The Handbook of Mathematics and Computational Science
In Early Modern Europe and Colonial AmericaFull description
bhjbFull description
bhjb
Full description
Audiovisual AestheticsFull description
Descripción: religious conversion
serpientes y escaleras
the presence of language is very important for human life. thus it has the philosophy as the production of human thought.
Funções matemáticas com fórmulas.
CFD in Heat Transfer ScienceDescripción completa
Full description
Full description
Japanese adjectives
OXFORD LIBRARY OF PSYCHOLOGY EDITED BY
JEROME R.
BUSEMEYER ZHENG
WANG JAMES T.
TOWNSEND & AMI
EIDELS
The Oxford Handbook of COMPUTATIONAL and MATHEMATICAL PSYCHOLOGY
The Oxford Handbook of Computational and Mathematical Psychology
OX F O R D L I B R A RY O F P S Y C H O L O G Y E D I T O R-I N-C H I E F
Peter E. Nathan AREA EDITORS
Clinical Psychology David H. Barlow
Cognitive Neuroscience Kevin N. Ochsner and Stephen M. Kosslyn
Cognitive Psychology Daniel Reisberg
Counseling Psychology Elizabeth M. Altmaier and Jo-Ida C. Hansen
Developmental Psychology Philip David Zelazo
Health Psychology Howard S. Friedman
History of Psychology David B. Baker
Methods and Measurement Todd D. Little
Neuropsychology Kenneth M. Adams
Organizational Psychology Steve W. J. Kozlowski
Personality and Social Psychology Kay Deaux and Mark Snyder
OXFORD LIBRARY OF PSYCHOLOGY
Editor-in-Chief
peter e. nathan
The Oxford Handbook of Computational and Mathematical Psychology Edited by
Jerome R. Busemeyer Zheng Wang James T. Townsend Ami Eidels
1
3 Oxford University Press is a department of the University of Oxford. It furthers the University’s objective of excellence in research, scholarship, and education by publishing worldwide. Oxford New York Auckland Cape Town Dar es Salaam Hong Kong Karachi Kuala Lumpur Madrid Melbourne Mexico City Nairobi New Delhi Shanghai Taipei Toronto With offices in Argentina Austria Brazil Chile Czech Republic France Greece Guatemala Hungary Italy Japan Poland Portugal Singapore South Korea Switzerland Thailand Turkey Ukraine Vietnam Oxford is a registered trademark of Oxford University Press in the UK and certain other countries. Published in the United States of America by Oxford University Press 198 Madison Avenue, New York, NY 10016 c Oxford University Press 2015 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without the prior permission in writing of Oxford University Press, or as expressly permitted by law, by license, or under terms agreed with the appropriate reproduction rights organization. Inquiries concerning reproduction outside the scope of the above should be sent to the Rights Department, Oxford University Press, at the address above. You must not circulate this work in any other form and you must impose this same condition on any acquirer. Library of Congress Cataloging-in-Publication Data Oxford handbook of computational and mathematical psychology / edited by Jerome R. Busemeyer, Zheng Wang, James T. Townsend, and Ami Eidels. pages cm. – (Oxford library of psychology) Includes bibliographical references and index. ISBN 978-0-19-995799-6 1. Cognition. 2. Cognitive science. 3. Psychology–Mathematical models. 4. Psychometrics. I. Busemeyer, Jerome R. BF311.O945 2015 150.1 51–dc23 2015002254
9 8 7 6 5 4 3 2 1 Printed in the United States of America on acid-free paper
Dedicated to the memory of Dr. William K. Estes (1919–2011) and Dr. R. Duncan Luce (1925–2012) Two of the founders of modern mathematical psychology
v
SHORT CONTENTS
Oxford Library of Psychology ix About the Editors xi Contributors
xiii
Table of Contents Chapters Index
xvii
1–390
391
vii
OX F O R D L I B R A R Y O F P SYC H O LO GY
The Oxford Library of Psychology, a landmark series of handbooks, is published by Oxford University Press, one of the world’s oldest and most highly respected publishers, with a tradition of publishing significant books in psychology. The ambitious goal of the Oxford Library of Psychology is nothing less than to span a vibrant, wide-ranging field and, in so doing, to fill a clear market need. Encompassing a comprehensive set of handbooks, organized hierarchically, the Library incorporates volumes at different levels, each designed to meet a distinct need. At one level are a set of handbooks designed broadly to survey the major subfields of psychology; at another are numerous handbooks that cover important current focal research and scholarly areas of psychology in depth and detail. Planned as a reflection of the dynamism of psychology, the Library will grow and expand as psychology itself develops, thereby highlighting significant new research that will impact on the field. Adding to its accessibility and ease of use, the Library will be published in print and, later on, electronically. The Library surveys psychology’s principal subfields with a set of handbooks that capture the current status and future prospects of those major subdisciplines. The initial set includes handbooks of social and personality psychology, clinical psychology, counseling psychology, school psychology, educational psychology, industrial and organizational psychology, cognitive psychology, cognitive neuroscience, methods and measurements, history, neuropsychology, personality assessment, developmental psychology, and more. Each handbook undertakes to review one of psychology’s major subdisciplines with breadth, comprehensiveness, and exemplary scholarship. In addition to these broadlyconceived volumes, the Library also includes a large number of handbooks designed to explore in depth more specialized areas of scholarship and research, such as stress, health and coping, anxiety and related disorders, cognitive development, or child and adolescent assessment. In contrast to the broad coverage of the subfield handbooks, each of these latter volumes focuses on an especially productive, more highly focused line of scholarship and research. Whether at the broadest or most specific level, however, all of the Library handbooks offer synthetic coverage that reviews and evaluates the relevant past and present research and anticipates research in the future. Each handbook in the Library includes introductory and concluding chapters written by its editor to provide a roadmap to the handbook’s table of contents and to offer informed anticipations of significant future developments in that field.
ix
An undertaking of this scope calls for handbook editors and chapter authors who are established scholars in the areas about which they write. Many of the nation’s and world’s most productive and best-respected psychologists have agreed to edit Library handbooks or write authoritative chapters in their areas of expertise. For whom has the Oxford Library of Psychology been written? Because of its breadth, depth, and accessibility, the Library serves a diverse audience, including graduate students in psychology and their faculty mentors, scholars, researchers, and practitioners in psychology and related fields. Each will find in the Library the information they seek on the subfield or focal area of psychology in which they work or are interested. Befitting its commitment to accessibility, each handbook includes a comprehensive index, as well as extensive references to help guide research. And because the Library was designed from its inception as an online as well as print resource, its structure and contents will be readily and rationally searchable online. Further, once the Library is released online, the handbooks will be regularly and thoroughly updated. In summary, the Oxford Library of Psychology will grow organically to provide a thoroughly informed perspective on the field of psychology, one that reflects both psychology’s dynamism and its increasing interdisciplinarity. Once published electronically, the Library is also destined to become a uniquely valuable interactive tool, with extended search and browsing capabilities, As you begin to consult this handbook, we sincerely hope you will share our enthusiasm for the more than 500-year tradition of Oxford University Press for excellence, innovation, and quality, as exemplified by the Oxford Library of Psychology. Peter E. Nathan Editor-in-Chief Oxford Library of Psychology
x
oxford library of psychology
ABOUT THE EDITORS
Jerome R. Busemeyer is Provost Professor of Psychology at Indiana University. He
was the president of Society for Mathematical Psychology and editor of the Journal of Mathematical Psychology. His theoretical contributions include decision field theory and, more recently, pioneering the new field of quantum cognition. Zheng Wang is Associate Professor at the Ohio State University and directs the Communication and Psychophysiology Lab. Much of her research tries to understand how our cognition, decision making, and communication are contextualized. James T. Townsend is Distinguished Rudy Professor of Psychology at Indiana
University. He was the president of Society for Mathematical Psychology and editor of the Journal of Mathematical Psychology. His theoretical contributions include systems factorial technology and general recognition theory. Ami Eidels is Senior Lecturer at the School of Psychology, University of Newcastle, Australia, and a principle investigator in the Newcastle Cognition Lab. His research focuses on human cognition, especially visual perception and attention, combined with computational and mathematical modeling.
xi
CONTRIBUTORS
Daniel Algom School of Psychological Sciences Tel-Aviv University Israel F. Gregory Ashby Department of Psychological and Brain Sciences University of California, Santa Barbara Santa Barbara, CA Joseph L. Austerweil Department of Cognitive Linguistic, and Psychological Sciences Brown University Providence, RI Scott D. Brown School of Psychology University of Newcastle Callaghan, NSW Australia Jerome R. Busemeyer Department of Psychological and Brain Sciences Cognitive Science Program Indiana University Bloomington, IN Amy H. Criss Department of Psychology Syracuse University Syracuse, NY Simon Dennis Department of Psychology The Ohio State University Columbus, OH Adele Diederich Psychology Jacobs University Bremen gGmbH Bremen 28759 Germany Chris Donkin School of Psychology University of New South Wales Kensington, NSW Australia
Ami Eidels School of Psychology University of Newcastle Callaghan, NSW Australia Samuel J. Gershman Department of Brain and Cognitive Sciences Massachusetts Institute of Technology Cambridge, MA Thomas L. Griffiths Department of Psychology University of California, Berkeley Berkeley, CA Todd M. Gureckis Department of Psychology New York University New York, NY Robert X. D. Hawkins Department of Psychological and Brain Sciences Indiana University Bloomington, IN Andrew Heathcote School of Psychology University of Newcastle Callaghan, NSW Australia Marc W. Howard Department of Psychological and Brain Sciences Center for Memory and Brain Boston University Boston, MA Brett Jefferson Department of Psychological and Brain Sciences Indiana University Bloomington, IN Michael N. Jones Department of Psychological and Brain Sciences Indiana University Bloomington, IN xiii
John K. Kruschke Department of Psychological and Brain Sciences Indiana University Bloomington, IN Yunfeng Li Department of Psychological Sciences Purdue University West Lafayette, IN Gordon D. Logan Department of Psychology Vanderbilt Vision Research Center Center for Integrative and Cognitive Neuroscience Vanderbilt University Nashville, TN Bradley C. Love University College London Experimental Psychology London, UK Dora Matzke Department of Psychology University of Amsterdam Amsterdam, the Netherlands Robert M. Nosofsky Department of Psychological and Brain Sciences Indiana University Bloomington, IN Richard W. J. Neufeld Departments of Psychology and Psychiatry, Neuroscience Program University of Western Ontario London, Ontario Canada Thomas J. Palmeri Department of Psychology Vanderbilt Vision Research Center Center for Integrative and Cognitive Neuroscience Vanderbilt University Nashville, TN Zygmunt Pizlo Department of Psychological Sciences Purdue University West Lafayette, IN Timothy J. Pleskac Center for Adaptive Rationality (ARC) Max Planck Institute for Human Development Berlin, Germany xiv
contributors
Emmanuel Pothos Department of Psychology City University London London, UK Babette Rae School of Psychology University of Newcastle Callaghan, NSW Australia Roger Ratcliff Department of Psychology The Ohio State University Columbus, OH Tadamasa Sawada Department of Psychology Higher School of Economics Moscow, Russia Jeffrey D. Schall Department of Psychology Vanderbilt Vision Research Center Center for Integrative and Cognitive Neuroscience Vanderbilt University Nashville, TN Philip Smith School of Psychological Sciences The University of Melbourne Parkville, VIC Australia Fabian A. Soto Department of Psychological and Brain Sciences University of California, Santa Barbara Santa Barbara, CA Joshua B. Tenenbaum Department of Brain and Cognitive Sciences Massachusetts Institute of Technology Cambridge, MA James T. Townsend Department of Psychological and Brain Sciences Cognitive Science Program Indiana University Bloomington, IN Joachim Vandekerckhove Department of Cognitive Sciences University of California, Irvine Irvine, CA Wolf Vanpaemel Faculty of Psychology and Educational Sciences University of Leuven Leuven, Belgium
Eric-Jan Wagenmakers Department of Psychology University of Amsterdam Amsterdam, the Netherlands Thomas S. Wallsten Department of Psychology University of Maryland College Park, MD
Zheng Wang School of Communication Center for Cognitive and Brain Sciences The Ohio State University Columbus, OH Jon Willits Department of Psychological and Brain Sciences Indiana University Bloomington, IN
contributors
xv
CONTENTS
Preface
xix
1. Review of Basic Mathematical Concepts Used in Computational and Mathematical Psychology 1 Jerome R. Busemeyer, Zheng Wang, Ami Eidels, and James T. Townsend
Part I
•
Elementary Cognitive Mechanisms
2. Multidimensional Signal Detection Theory 13 F. Gregory Ashby and Fabian A. Soto 3. Modeling Simple Decisions and Applications Using a Diffusion Model 35 Roger Ratcliff and Philip Smith 4. Features of Response Times: Identification of Cognitive Mechanisms through Mathematical Modeling 63 Daniel Algom, Ami Eidels, Robert X. D. Hawkins, Brett Jefferson, and James T. Townsend 5. Computational Reinforcement Learning 99 Todd M. Gureckis and Bradley C. Love
Part II
•
Basic Cognitive Skills
6. Why Is Accurately Labeling Simple Magnitudes So Hard? A Past, Present, and Future Look at Simple Perceptual Judgment 121 Chris Donkin, Babette Rae, Andrew Heathcote, and Scott D. Brown 7. An Exemplar-Based Random-Walk Model of Categorization and Recognition 142 Robert M. Nosofsky and Thomas J. Palmeri 8. Models of Episodic Memory 165 Amy H. Criss and Marc W. Howard
Part III
•
Higher Level Cognition
9. Structure and Flexibility in Bayesian Models of Cognition 187 Joseph L. Austerweil, Samuel J. Gershman, Joshua B. Tenenbaum, and Thomas L. Griffiths xvii
10. Models of Decision Making under Risk and Uncertainty 209 Timothy J. Pleskac, Adele Diederich, and Thomas S. Wallsten 11. Models of Semantic Memory 232 Michael N. Jones, Jon Willits, and Simon Dennis 12. Shape Perception 255 Tadamasa Sawada, Yunfeng Li, and Zygmunt Pizlo
Part IV
•
New Directions
13. Bayesian Estimation in Hierarchical Models 279 John K. Kruschke and Wolf Vanpaemel 14. Model Comparison and the Principle of Parsimony 300 Joachim Vandekerckhove, Dora Matzke, and Eric-Jan Wagenmakers 15. Neurocognitive Modeling of Perceptual Decision Making 320 Thomas J. Palmeri, Jeffrey D. Schall, and Gordon D. Logan 16. Mathematical and Computational Modeling in Clinical Psychology 341 Richard W. J. Neufeld 17. Quantum Models of Cognition and Decision 369 Jerome R. Busemeyer, Zheng Wang, and Emmanuel Pothos Index
xviii
contents
391
P R E FA C E
Computational and mathematical psychology has enjoyed rapid growth over the past decade. Our vision for the Oxford Handbook of Computational and Mathematical Psychology is to invite and organize a set of chapters that review these most important developments, especially those that have impacted— and will continue to impact—other fields such as cognitive psychology, developmental psychology, clinical psychology, and neuroscience. Together with a group of dedicated authors, who are leading scientists in their areas, we believe we have realized our vision. Specifically, the chapters cover the key developments in elementary cognitive mechanisms (e.g., signal detection, information processing, reinforcement learning), basic cognitive skills (e.g., perceptual judgment, categorization, episodic memory), higher-level cognition (e.g., Bayesian cognition, decision making, semantic memory, shape perception), modeling tools (e.g., Bayesian estimation and other new model comparison methods), and emerging new directions (e.g., neurocognitive modeling, applications to clinical psychology, quantum cognition) in computation and mathematical psychology. An important feature of this handbook is that it aims to engage readers with various levels of modeling experience. Each chapter is self-contained and written by authoritative figures in the topic area. Each chapter is designed to be a relatively applied introduction with a great emphasis on empirical examples (see New Handbook of Mathematical Psychology (2014) by Batchelder, Colonius, Dzhafarov, and Myung for a more mathematically foundational and less applied presentation). Each chapter endeavors to immediately involve readers, inspire them to apply the introduced models to their own research interests, and refer them to more rigorous mathematical treatments when needed. First, each chapter provides an elementary overview of the basic concepts, techniques, and models in the topic area. Some chapters also offer a historical perspective of their area or approach. Second, each chapter emphasizes empirical applications of the models. Each chapter shows how the models are being used to understand human cognition and illustrates the use of the models in a tutorial manner. Third, each chapter strives to create engaging, precise, and lucid writing that inspires the use of the models. The chapters were written for a typical graduate student in virtually any area of psychology, cognitive science, and related social and behavioral sciences, such as consumer behavior and communication. We also expect it to be useful for readers ranging from advanced undergraduate students to experienced faculty members and researchers. Beyond being a handy reference book, it should be beneficial as
xix
a textbook for self-teaching, and for graduate level (or advanced undergraduate level) courses in computational and mathematical psychology. We would like to thank all the authors for their excellent contributions. Also we thank the following scholars who helped review the book chapters in addition to the editors (listed alphabetically): Woo-Young Ahn, Greg Ashby, Scott Brown, Cody Cooper, Amy Criss, Adele Diederich, Chris Donkin, Yehiam Eldad, Pegah Fakhari, Birte Forstmann, Tom Griffiths, Andrew Heathcote, Alex Hedstrom, Joseph Houpt, Marc Howard, Matt Irwin, Mike Jones, John Kruschke, Peter Kvam, Bradley Love, Dora Matzke, Jay Myung, Robert Nosofsky, Tim Pleskac, Emmanuel Pothos, Noah Silbert, Tyler Solloway, Fabian Soto, Jennifer Trueblood, Joachim Vandekerckhove, Wolf Vanpaemel, Eric-Jan Wagenmakers, and Paul Williams. The authors and reviewers’ effort ensure our confidence in the high quality of this handbook. Finally, we would like to express how much we appreciate the outstanding assistance and guidance provided by our editorial team and production team at Oxford University Press. The hard work provided by Joan Bossert, Louis Gulino, Anne Dellinger, A. Joseph Lurdu Antoine and the production team of Newgen Knowledge Works Pvt. Ltd., and others at the Oxford University Press are essential for the development of this handbook. It has been a true pleasure working with this team! Jerome R. Busemeyer Zheng Wang James T. Townsend Ami Eidels December 16, 2014
xx
preface
CHAPTER
1
Review of Basic Mathematical Concepts Used in Computational and Mathematical Psychology
Jerome R. Busemeyer, Zheng Wang, Ami Eidels, and James T. Townsend
Abstract
Computational and mathematical models of psychology all use some common mathematical functions and principles. This chapter provides a brief overview. Key Words: mathematical functions, derivatives and integrals, probability theory,
expectations, maximum likelihood estimation
We have three ways to build theories to explain and predict how variables interact and relate to each other in psychological phenomena: using natural verbal languages, using formal mathematics, and using computational methods. Human intuitive and verbal reasoning has a lot of limitations. For example, Hintzman (1991) summarized at least 10 critical limitations, including our incapability to imagine how a dynamic system works. Formal models, including both mathematical and computational models, can address these limitations of human reasoning. Mathematics is a “radically empirical” science (Suppes, 1984, p.78), with consistent and rigorous evidence (the proof ) that is “presented with a completeness not characteristic of any other area of science” (p.78). Mathematical models can help avoid logic and reasoning errors that are typically encountered in human verbal reasoning. The complexity of theorizing and data often requires the aid of computers and computational languages. Computational models and mathematical models can be thought of as a continuum of a theorizing process. Every computational model is based on a certain mathematical model, and almost every mathematical model can be implemented as a computational model.
Psychological theories may start as a verbal description, which then can be formalized using mathematical language and subsequently coded into computational language. By testing the models using empirical data, the model fitting outcomes can provide feedback to improve the models, as well as our initial understanding and verbal descriptions. For readers who are newcomers to this exciting field, this chapter provides a review of basic concepts of mathematics, probability, and statistics used in computational and mathematical modeling of psychological representation, mechanisms, and processes. See Busemeyer and Diederich (2010) and Lewandowsky and Ferrel (2010) for a more detailed presentations.
Mathematical Functions Mathematical functions are used to map a set of points called the domain of the function into a set of points called the range of the function such that only one point in range is assigned to each point in the domain.1 As a simple example, the linear function is defined as f (x) = a·x where the constant a is the slope of a straight line. In general, we use the notation f (x) to represent a function f that maps a domain point x into a range point y = f (x). If a
1
function f (x) has the property that each range point y can only be reached by a single unique domain point x, then we can define the inverse function f −1 (y) = x that maps each range point y = f (x) back to the corresponding domain point x. For example, the quadratic function is defined as the map f (x) = x 2 = x · x, and if we pick the number x = 3.5, then f (3.5) = 3.52 = 12.25. The quadratic function is defined on a domain of both positive and negative real numbers, and it does not have an inverse because, for example, ( − x)2 = x 2 and so there are two ways to get back from each range point y to the domain. However, if we restrict the domain to the non-negative real numbers, then the inverse defined of x 2 exists and it is the square root function √ √ on non-negative real numbers y = x 2 = x. There are, of course, a large number of functions used in mathematical psychology, but some of the most popular ones include the following. The power function is denoted x a where the variable x is a positive real number and the constant a is called the power. A quadratic function can be obtained by setting a = 2 but we could instead choose a = 0.50, which is the square root function √ x .50 = x, or we could choose a = −1, which produces the reciprocal x −1 = 1x , or we could choose any real number such as a = 1.37. Using a calculator, one finds that if x = 15.25 and a = 1.37, then 15.251.37 = 41.8658. One important property to remember about power functions is that x a · x b = x a+b and x b · y b = (x · y)b and (x a )b = x ab . Also note that x 0 = 1. Note that when working with the power function, the variable x appears in the base, and the constant a appears as the power. The exponential function is denoted e x where the exponent x is any real valued variable and the constant base e stands for a special number that is approximately e ∼ = 2.7183. Sometimes it is more convenient to use the notation e x = exp (x) instead. Using a calculator, we can calculate e 2.5 = 2.71832.5 = 12.1825. Note that the exponent can be negative −x < 0, in which case we can write e −x = e1x . If x = 0, then e0 = 1. The exponential function always returns a positive value, ex > 0, and it approaches zero as x approaches negative infinity. More complex forms of the exponential are often used. For example, you will later see the function
−
x−μ σ
2
, where x is a variable and μ and σ are e constants. In this case, it is more convenient to 2 2 − x−μ σ = exp − x−μ . This tells write this as e σ
2
you to first compute the squared deviation y = x−μ 2 and then compute the reciprocal exp1 y . σ () The exponential function obeys the property e x · ey = ex+y and (e x )a = e a·x . In contrast to the power function, the base of the exponential is a constant and the exponent is a variable. The (natural) log function is denoted ln (x) for positive values of x. For example, using a calculator, for x = 10, we obtain ln (10) = 2.3026. (We normally use the natural base e = 2.7183. If instead we used base 10, then log10 (10) = 1.) The log function obeys the rules ln (x ·y) = ln (x)+ln (y) and ln (x a ) = a · ln (x). The log function is the inverse of the exponential function: ln ( exp (x)) = x and the exponential function is the inverse of the log function exp ( log (x)) = x. The function ax where a is a constant and x is a variable can be rewritten in terms of the exponential function: define b = ln (a), then e bx = (eb )x = exp ( ln (a))x = ax . Figure 1.1 illustrates the power, exponential, and log functions using different coefficient values for the function. As can be seen, the coefficient changes the curve of the functions. Last but not least are the trigonometric functions based on a circle. Figure 1.2 shows a circle with its center located at coordinates (0, 0) in an (X , Y ) plane. Now imagine a line segment of radius r = 1 that extends from the center point to the circumference of the circle. This line segment intersects with the circumference at coordinates ( cos (t · π), sin (t · π) ) in the plane. The coordinate cos (t · π) represents the projection of the point on the circumference down onto the the X axis, and the point sin (t · π) is the projection of the point on the circumference to the Y axis. The variable t (which, for example can be time) moves this point around the circle, with positive values moving the point counter clockwise, and negative values moving it clockwise. The constant π = 3.1416 equals one-half cycle around the circle, and 2π is the period of time it takes to go all the way around once. The two functions are related by a translation (called the phase) in time: cos (t · π + (π/2)) = sin (t · π). Note that cos is an even function because cos (t · π ) = cos (−t · π), whereas sin is an odd function because − sin (t · π ) = sin ( − t · π ) . Also note that these functions are periodic in the sense that for example cos (t · π ) = cos (t · π + 2 · k · π ) for any integer k. We can generalize these two functions by changing the frequency and the phase. For example, cos (ω · tπ + θ ) is a cosine function with a frequency ω (changing the time it takes
review of basic mathematical concepts
Power function: y = xa
5 4
Exponential function: y = exp(x)
5
a=2
3
y
y
y
a = 1/2
y = log(x)
3
3
2
y = log(2x)
4
y = exp(x)
4
Log function: y = log(x)
5
y = log(0.5x) 2
2 y = exp(0)
1 0
a = –1 0
1
2
3
1
1 y = exp(–x) 4
5
0
0
1
2
3
4
5
0
0
10
x
x
20
30
40
50
x
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
Sin(Time)
Sin(x)
Fig. 1.1 Examples of three important functions, with various parameter values. From left to right: Power function, exponential function, and log function. See text for details.
0 –0.2
0 –0.2
–0.4
–0.4
–0.6
–0.6
–0.8
–0.8
–1 –1 –0.8 –0.6 –0.4 –0.2
0
0.2 0.4 0.6 0.8
–1
1
Cos(x)
0
0.2 0.4 0.6 0.8
0
1.2 1.4 1.6 1.8
2
Time
Fig. 1.2 Left panel illustrates a point on a unit circle with a radius equal to one. Vertical line shows sine, horizontal line shows cosine. Right panel shows sine as a function of time. The point on the Y axis of the right panel corresponds to the point on the Y axis of the left panel.
to complete a cycle) and a phase θ (advancing or delaying the initial value at time t = 0).
Derivatives and Integrals A derivative of a continuous function is the rate of change or the slope of a function at some point. Suppose f (x) is some continuous function. For a small increment , the change in this function is df (x) = f (x) − f (x − ) and the rate of change df (x) is the change divided by the increment = f (x)−f (x−) .
If the function is continuous, then as → 0, this ratio converges to what is called the derivative of the function at x, denoted as
d dx f
(x). The derivatives of many functions are derived in calculus (see Stewart (2012) or any calculus textbook for an introduction to calculus). d c·x For example, in calculus it is shown that dx e = c·x c · e , which says that the slope of the exponential function at any point x is proportional to the exponential function itself. As another example, it d a x = a · x a−1 , which is is shown in calculus that dx the derivative of the power function. For example, the derivative of a quadratic function a·x 2 is a linear function 2 · a · x, and the derivative of the linear function a · x is the constant a. The derivative of the cosine function is dtd cos (t) = − sin (t) and the derivative of sine is dtd sin (t) = cos (t).
review of basic mathematical concepts
3
Power function: y = xa; a = 2
4
4
3
3
Power function: y = x a; a = 0.5
y
5
y
5
2
2
1
1
0
0
1
2
3
4
5
0
0
1
x
2
3
4
5
x
Fig. 1.3 Illustration of the power function and its derivatives. The lines in both panels mark the power-function line. The slope of the dotted line (the tangent to the function) is given by the derivative of that function. (in this example, at x = 1).
Figure 1.3 illustrates the derivative of the power function at the value x = 1 for two different coefficients. The curved line shows the power function, and the straight line touches the curve at x = 1. The slope of this line is the derivative. The integral of a continuous function is the area under the curve within some interval (see Fig. 1.4). Suppose f (x) is a continuous function of x within the interval [a, b]. A simple way to approximate this area is to divide the interval into N very small steps, with a small increment being a step: [x0 = a, x1 = a + , x2 = a + 2, ..., xj = a + j · , ..., xN −1 = b − , xN = b] Then, compute the area of the rectangle within each step, · f xj , and finally sum all the areas of the rectangles to obtain an approximate area under the curve: N A ≈ · f (x1 ) + · f (x2 ) + · · · + · f xN = f xj · . j=1
As the number of intervals becomes arbitrarily large and the increments get arbitrarily small so that N → ∞ and → 0, this sum converges to the integral b A= f (x) · dx. a
If we allow the upper limit of the integral to be a variable, say z, then the integral becomes a function of z the upper limit, which can be written as F (z) = a f (x) · dx . What happens if we take the derivative of an integral? Let’s examine the change in the area divided by the increment 4
A (xN ) − A (xN −1 ) = f (xN ), A (xN ) − A (xN −1 ) = f (xN ) . This simple idea (proven more rigorously in a calculus textbook) leads to the first fundamental d theorem of integrals which states that dz F (z) = z f (z) , with F (z) = a f (x)dx. The fundamental theorem can then be used to find the integral of z a function. For example, the integral of 0 x a dx = d (a + 1)−1 z a+1 = z a . The (a + 1)−1 z a+1 because dz z α·x 1 α·z e dx = α e because dtd eα·z = α · integral of z α·z cos (t)dt = sin (z) because e . The integral of d dz sin (z) = cos (z). Computational and mathematical models are often described by difference or differential equations. These types of equations are used to describe how the state of a system changes with time. For example, suppose V (t) represents the strength of a neural connection between an input and an output at time t, and suppose x (t) is some reward signal that is guiding the learning process. A simple, discrete time linear model of learning can be V (t) = (1 − α) · V (t − 1) + α · x (t), where 0 ≤ α ≤ 1 is the learning rate parameter. We can rewrite this as a difference equation: dV (t) = V (t) − V (t − 1) = − α · V (t − 1) + α · x (t) = − α · (V (t − 1) − x (t)) . This model states that the change in strength at time t is proportional to the negative of the error signal, which is defined as the difference between the
review of basic mathematical concepts
300
70
250
60 6
50
200
x 10–3
8
40
4
150 30 100
20
50 0
2
10 0
500
1000
0
0 0
500
1000
0
20
40
60
80
100
Fig. 1.4 The integral of the function is the area under curve. It can be approximated as the sum of the areas of the rectangles (left panel). As the rectangles become narrower (middle), the sum of their areas converges to the true integral (right).
previous strength and the new reward. If we wish to describe learning as occurring more continuously in time we can introduce a small time increment t into the model so that it states dV (t) =V (t) − V (t − t) = − α · t · (V (t − t) − x (t)), which says that the change in strength is proportional to the negative of the error signal, with the constant of proportionality now modified by the time increment. Dividing both sides by the time increment t we obtain dV (t) = −α · (V (t − t) − x (t)), t and now if we allow the time increment to approach zero in the limit, t → 0, then the preceding equation converges to a limit that is the differential equation d V (t) = −α · (V (t) − x (t)), dt which states that the rate of change in strength is proportional to the negative of the error signal. Sometimes we can solve the differential equation for a simple solution. For example, the solution to the equation dtd V (t) = −α · V (t) + c is V (t) = ac − e−α·t because when we substitute this solution back into the differential equation, it satisfies the equality of the differential equation c d c − e −αt = −α · − e −α·t + c. dt α α A stochastic difference equation is frequently used in cognitive modeling to represent how a state changes across time when it is perturbed by noise. For example, if we assume that the strength of
a connection changes according to the preceding learning model, but with some noise (denoted as (t)) added, then we can use the following stochastic difference equation √ dV (t) = −α · t · (V (t − t) − x (t)) + (t) · t.
√ Note that the noise is multiplied by t instead of t in a stochastic difference equation. This is required so that the effect of the noise does not disappear as t → 0, and the variance of the noise remains proportional to t (which is the key characteristic of Brownian motion processes). See Battacharya and Waymire (2009) for an excellent book on stochastic processes.
Elementary Probability Theory Probability theory describes how to assign probabilities to events. See Feller (1968) for a review of probability theory. We start with a sample space that is a set denoted as , which contains all the unique outcomes that can be realized. For simplicity, we will assume (unless noted otherwise) that the sample space is finite. (There could be a very large number of outcomes, but the number is finite.) For example, if a person takes two medical tests, test A and test B, and each test can be positive or negative, then the sample space contains four mutually exclusive and exhaustive outcomes: all four combinations of positive and negative test results from tests A and B. Figure 1.5 illustrates the situation for this simple example. An event, such as the event A (e.g., test A is positive) is a subset of the sample space. Suppose for the moment that A, B are two events. The disjunctive event A or B (e.g., test A is positive or test B is positive) is represented as the union A∪B. The conjunctive event A and B (e.g., test A is
review of basic mathematical concepts
5
Test A Negative Test B
Negative Positive
Positive
A
A B
A∩B
B
A∩B
Fig. 1.5 Two ways to illustrate the probability space of events A and B. The contingency table (left) and the Venn diagram (right) correspond in the following way: Positive values on both tests in the table (the conjunctive event, A∩B) are represented by the overlap of the circles in the Venn diagram. Positive values on one test but not on the other in the table (the XOR event, A positive and B negative, or vice versa) are represented by the nonoverlapping areas of circles A and B. Finally, tests that are both negative (upper left entry in the table) correspond in the Venn diagram to the area within the rectangle (the so-called “sample space”) that is not occupied by any of the circles.
positive and test B is positive) is represented as the intersection A ∩ B. The impossible event (e.g., test A is neither positive nor negative), denoted , is an empty set. The certain event is the entire sample space . The complementary event “not A” is denoted A. A probability function p assigns a number between zero to one to each event. The impossible event is assigned zero, and the certain event is assigned one. The other events are assigned probabilities 0 ≤ p (A) ≤ 1 and p A¯ = 1 − p (A). However, these probabilities must obey the following additive rule: If A ∩ B = then p(A ∪ B) = p(A) + p(B). What if the events are not mutually exclusive so that A ∩ B = ? The answer is called the “or”, which follows from the previous assumptions: p(A ∪ B) = p (A ∩ B) + p A ∩ B¯ + p A¯ ∩ B = p (A ∩ B) + p A ∩ B¯ + p A¯ ∩ B + p (A ∩ B) − p (A ∩ B) = p (A) + p (B) − p (A ∩ B) . Suppose we learn that some event A has occurred, and now we wish to define the new probability for event B conditioned on this known event. The conditional probability p(B|A) stands for the probability of event B given that event A has p(A∩B) occurred, which is defined as p (B|A) = p(A) . p(A∩B)
Similarly, p(A|B) = p(B) is the probability of event A given that B has occurred. Using the definition of conditional probability, we can then define the “and” rule for joint probabilities as follows: the probability of A and B equals p(A∩B) = p (A) p (B|A) = p (B) p (A|B). An important theorem of probability is called Bayes rule. It describes how to revise one’s beliefs based on evidence. Suppose we have two mutually 6
exclusive and exhaustive hypotheses denoted H1 and H2 . For example H1 could be a certain disease is present and H2 is the disease is not present. Define the event D as some observed data that provide evidence for or against each hypothesis, such as a medical test result. Suppose p(D|H1 ) and p (D|H2 ) are known. These are called the likelihood’s of the data for each hypothesis. For example, medical testing would be used to determine the likelihood of a positive versus negative test result when the disease is known to be present, and the likelihood of a positive versus negative test would also be known when the disease is not present. We define p(H1 ) and p(H2 ) as the prior probabilities of each hypothesis. For example, these priors may be based on base rates for disease present or not. Then according to the conditional probability definition p (H1 ) p (D|H1 ) p (H1 |D) = p (D) =
p (H1 ) p (D|H1 ) . p (H1 ) p (D|H1 ) + p (H2 ) p (D|H2 )
The last line is Bayes’ rule. The probability p (H1 |D) is called the posterior probability of the hypothesis given the data. It reflects the revision from the prior produced by the evidence from the data. If there are M ≥ 2 hypotheses, then the rule is extended to be p (H1 ) p (D|H1 ) , p (H1 |D) = M k=1 p (Hk ) p (D|Hk ) where the denominator is the sum across k = 1 to k = M hypotheses. We often work with events that are assigned to numbers. A random variable is a function that assigns real numbers to events. For example, a person may look at an advertisement and then rate how effective it is on a nine-point scale. In this case, there are nine mutually exclusive and exhaustive categories to choose from on the rating
review of basic mathematical concepts
scale, and each choice is assigned a number (say, 1, 2,. . ., or 9). Then we can define a random variable X (R), which is a function that maps the category event R onto one of the nine numbers. For example, if the person chooses the middle rating option, so that R = middle, then we assign X (middle) = 5. For simplicity, we often omit the event and instead write the random variable simply as X . For example, we can ask what is the probability that the random variable is assigned the number 5, which is written as p(X = 5). Then we assign a probability to each value of a random variable by assigning it the probability of the event that produces the value. For example, p(X = 5) equals the probability of the event that the person picks the middle value. Suppose the random
variable has N values x1 , x2, . . . , xi , . . . , xN . In our previous example with the rating scale, the random variable had nine values. The function p(X = xi ) (interpreted as the probability that the person picks a choice corresponding to value xi ) is called the probability mass function for the random variable X . This function has the following properties: 0 ≤ p (X = xi ) ≤ 1; N
p (X = xi ) = p (X = x1 ) + p (X = x2 )
i=1
+ · · · + p (X = xN ) = 1. The cumulative probability is then defined as p(X ≤ xi ) = p (X = x1 ) + p (X = x2 ) + · · · + p (X = xi ) =
i p X = xj . j=1
Often we measure more than one random variable. For example, we could present an advertisement and ask how effective it is for the participant personally but also ask how effective the participant believes it is for others. Suppose X is the random variable for the nine-point rating scale for self, and let Y be the random variable for the nine-point rating scale for others. Then we can define a joint probability p(X = xi , Y = yj ), which equals the probability that xj is selected for self and that yj is selected for others. These joint probabilities form a two way 9 × 9 table with with p(X = xi , Y = yj ) in each cell. This joint probability function has the properties:
0 ≤ p X = xi , Y = yj ≤ 1 p X = xi , Y = yj = p(X = xi ) j
p X = xi , Y = yj = p(Y = yj ) i
p X = xi , Y = yj = 1. i
j
Finally, we can define the conditional probability of Y = yj given X = xi as p Y = yj |X = xi = p(X =xi ,Y =yj ) . p(X =xi ) Often, we work with random variables that have a continuous rather than a discrete and finite distribution, such as the normal distribution. Suppose X is a univariate continuous random variable. In this case, the probability assigned to each real number is zero (there are uncountably infinite many of them in any interval). Instead, we start by defining the cumulative distribution function F (x) = p (X ≤ x). Then we define the probability density at each value of x as the derivative of the cumulative probability d F (x). Using this definition for function, f (x) = dx the density, we compute the probability of X falling b in some interval [a, b] as p (X ∈ [a, b]) = a f (x)dx . The increment f (x) · dx for the continuous random variable is conceptually related to the mass function p (X = xi ). The probability density function for the normal
− √1 e 2π σ
x−μ σ
2
, where distribution equals f (x) = μ is the mean of the distribution and σ is the standard deviation of the distribution. The normal distribution is popular because of the central limit theorem, which states that the sample mean X¯ = Xi /N of N independent samples of a random variable X will approach normal as the number of samples becomes arbitrarily large, even if the original random variable X is not normal. We often work with sample means, which we expect to be approximately normal because of the central limit theorem.
Expectations When working with random variables, we are often interested in their moments (i.e., their means, variances, or correlations). See Hogg and Craig (1970) for a reference on mathematical statistics. These different moments are different concepts, but they are all defined as expected values of the random variables. The expectation of the random variable is defined as E [X ] = i p (X = xi ) · xi for the
review of basic mathematical concepts
7
discrete case and it is defined as E [X ] = f (x)·x ·dx for the continuous case. The mean of a random variable X , denoted μX is defined as the expectation of the random variable μX = E [X ]. The variance of a random variable X , denoted σX2 , is defined as the expectation of the squared deviation
around the mean: σX2 = Var (X ) = E (X − μ)2 . For example, in the discrete case, this equals σX2 = i p (X = xi ) · is the square (xi − μX )2 . The standard deviation
root of the variance, σX = σX2 . The covariance between two random variables (X , Y ), denoted σXY , is defined by the expectation of the product of deviations: σXY = cov (X , Y ) = E [(X − μX ) · (Y − μY )]. For case, this example, in the discrete equals σXY = i j p X = xi , Y = yj ((xi − μX ) yj − μY . The correlation is defined as ρXY = σXY σX ·σY . Often we need to combine two random variables by a linear combination Z = a · X + b · Y , where a, b are two constants. For example, we may sum two scores, a = 1 and b = 1, or take a difference between two scores, a = 1, b = −1. There are two important rules for determining the mean and the variance of a linear transformation. The expectation operation is linear: E [a · X + b · Y ] = a · E [X ] + b · E [Y ]. The variance operator, however, is not linear: var (a · X + b · Y ) = a2 var (X ) + b2 var (Y ) + 2ab · cov (X , Y ).
Maximum Likelihood Estimation Computational and mathematical models of psychology contain parameters that need to be estimated from the data. For example, suppose a person can choose to play or not play a slot machine at the beginning of each trial. The slot machine pays the amount x(t) on trial t, but this amount is not revealed until the trial is over. Consider a simple model that assumes that the probability of choosing to gamble on trial t, denoted p(t), is predicted by the following linear learning model V (t) = (1 − α) · V (t − 1) + α · x(t) p(t) =
1 1 + e −β·V (t)
This model has two parameters α, β that must be estimated from the data. This is analogous to the estimation problem one faces when using multiple linear regression, where the linear regression is the model and the regression coefficients are the model parameters. However, computational and mathematical models, such as the earlier learning model example, are nonlinear with respect to 8
the model parameters, which makes them more complicated, and one cannot use simple linear regression fitting routines. The model parameters are estimated from the empirical experimental data. These experiments usually consist of a sample of participants, and each participant provides a series of responses to several experimental conditions. For example, a study of learning to gamble could obtain 100 choice trials at each of 5 payoff conditions from each of 50 participants. One of the first issues for modeling is the level of analysis of the data. On the one hand, a group-level analysis would fit a model to all the data from all the participants ignoring individual differences. This is not a good idea if there are substantial individual differences. On the other hand, an individual level analysis would fit a model to each individual separately allowing arbitrary individual differences. This introduces a new set of parameters for each person, which is unparsimonious. A hierarchical model applies the model to all of the individuals, but it includes a model for the distribution of individual differences. This is a good compromise, but it requires a good model of the distribution of individual differences. Chapter 13 of this book describes the hierarchical approach. Here we describe the basic ideas of fitting models at the individual level using a method called maximum likelihood (see Myung, 2003 for a detailed tutorial on maximum likelihood estimation). Also see Hogg and Craig (1970) for the general properties of maximum likelihood estimates. Suppose we obtain 100 choice trials (gamble, not gamble) from 5 payoff conditions from each participant. The above learning model has two parameters (α, β) that we wish to estimate using the 100 × 5 = 500 binary valued responses. We can put the 500 answers in a vector D = [x1 , x2 , . . . , xt , . . . , x500 ] , where each xt is zero (not gamble) or one (gamble). If we pick values for the two parameters (α, β), then we can insert these into our learning model and compute the probability of gambling, p (t), for each trial from the model. Define p(xt , t) as the probability that the model predicts the value xt observed on trial t. For example, if xt =1, then p(x t , t) = p(t), but if xt = 0, then p(xt , t) = 1 − p(t) , where recall that p(t) is the predicted probability of choosing the gamble. Then we compute the likelihood of the observed sequence of data D given the model parameters (α, β) as follows: L (D|α, β) = p (x1 , 1) p (x2 , 2) · · · p (x500 , 500) .
review of basic mathematical concepts
300 250 200 150 100 50 0 100
200
300
400
500
600
700
800
Fig. 1.6 Example of maximum likelihood estimation. The histograms describe data sampled from a Gamma distribution with scale and shape parameters both equal to 20. Using maximum likelihood we can estimate the parameter values of a Gamma distribution that best fits the sample (they turn out to be 20.4 and 19.5, respectively), and plot its probability density function (solid line).
To make this computationally feasible, we use the log likelihood instead LnL (D|α, β) =
500 ln p(xt , t) t=1
This likelihood changes depending on our choice of α, β. Our goal is to pick (α, β) that maximizes LnL(D|αβ). Nonlinear search algorithms available in computational software, such as MATLAB, R, Gauss, and Mathematica can be used to find the maximum likelihood estimates. The log likelihood is a goodness-of-fit measure. Higher values indicate better fit. Actually the computer algorithms find the minimum of the badness of fit measure G 2 = −2 · LnL(D|α, β). Maximum likelihood is not restricted to learning models and it can be used to fit all kinds of models. For example, if we observe a response time on each trial, and our model predicts the response time for each trial, then the preceding equation can be applied with xt equal to the observed response time on a trial, and with p(xt , t) equal to the predicted probability for the observed value of response time on that trial. Figure 1.6 shows an example in which a sample of response time data (summarized in the figure by a histogram) was fit by a gamma distribution model for response time using two model parameters. Now suppose we have two different competing learning models. The model we just described has two parameters. Suppose the competing model is quite different and it is more complex with four parameters. Also suppose the models are not nested so that it is not possible to compute the same predictions for the simpler model using the more
complex model. Then we can compare models by using a Bayesian information criterion (BIC, see Wasserman, 2000, for review). This criterion is derived on the basis of choosing the model that is most probable given the data. (However, the derivation only holds asymptotically as the sample size increases indefinitely.) For each model we wish to compare, we can compute a BIC index: 2 + nmodel · ln (N ), where nmodel BICmodel = Gmodel equals the number of model parameters estimated from the data and N = number of data points. The BIC index is an index that balances model fit with model complexity as measured by number of parameters. (Note however, that model complexity is more than the number of parameters, see Chapter 13). It is a badness-of-fit index, and so we choose the model with the lowest BIC index. See Chapter 14 for a detailed review on model comparison.
Concluding comments This handbook categorizes models into three levels based on the questions asked and addressed by the models, and the levels of analysis. The first category includes models theorizing the elementary cognitive mechanisms, such as the signal detection process, the diffusion process, informaton processing, and reinforcement learning. They have been widely used in many formal models within and beyond psychology and cognitive science, ranging from basic visual perception to complex decision making. The second category covers models theorizing basic cognitive skills, such as perceptual identification, categorization, and episodic memory. The third category includes models theorizing higher level cognition, such as Bayesian cognition, decision making, semantic memory, and shape
review of basic mathematical concepts
9
perception. In addition, we provide two chapters on modeling tools, including Bayesian estimation in hierarchical models and model comparison methods. We conclude the handbook with three chapters on new directions in the field, including neurocognitive modeling, mathematical and computational modeling in clinical psychology, and cognitive and decision models based upon quantum probability theory. The models reviewed in the handbook make use of many of the mathematical ideas presented in this review chapter. Probabilistic models appear in chapters covering signal detection theory (Chapter 2), probabilistic models of cognition (Chapter 9), decision theory (Chapters 10 and 17), and clinical applications (Chapter 16). Stochastic models (i.e., models that are dynamic and probabilistic) appear in chapters covering information processing (Chapter 4), percpetual judgment (Chapter 6), and random walk/diffusion models of choice and response time in various cogntive tasks (Chapters 3, 7, 10, and 15). Learning and memory models are reviewed in Chapters 5, 7, 8, and 11. Models using vector spaces and geometry are introduced in Chapters 11, 12, and 17. The basic concepts reviewed in this chapter should be helpful for readers who are new to mathematical and computational models to jumpstart reading the rest of the book. In addition, each chapter is self-contained, presents a tutorial style introduction to the topic area exemplified by many
10
applications, and provides a specific glossary list of the basic concepts in the topic area. We believe you will have a rewarding reading experience.
Note 1. This chapter is restricted to real numbers.
References Battacharya, R. N. & Waymire, E. C. (2009). Stochastic processes with applications (Vol. 61). Philadelphia, PA: Siam. Busemeyer, J. R., & Diederich, A. (2009). Cognitive modeling. Thousand Oaks, CA: SAGE. Cox, D. R. & Miller, H. D. (1965). The theory of stochastic processes (Vol. 134). Boca Raton, FL: CRC Press. Feller, W. (1968). on An introduction to probability theory and its applications (3rd ed., Vol. 1). New York, NY: Wiley. Hintzman D.L. (1991). Why are formal models useful in psychology? In W. E. Hockley & S. Lewandowsky (Eds.), Relating theory and data: Essays on human memory in honor of Bennet B. Murdock (pp. 39–56). Hillsdale, NJ: Erlbaum. Hogg, R. V., & Craig, A. T. (1970). Introduction to mathematical statistics (3rd ed.). New York, NY: Macmillan. Lewandowsky, S. & Ferrel, S. (2010). Computational modeling in cognition: Principles and practice. Thousand Oaks, CA: SAGE. Myung, I. J. (2003). Tutorial on maximum likelihood estimation. Journal of Mathematical Psychology, 47, 90–100. Stewart, J. (2012). Calculus (7th ed.) Belmont, CA: Brooks/Cole. Suppes, P. (1984). Probabilistic metaphysics. Oxford: Basil Blackwell. Wasserman, L. (2000) Bayesian model selection and model averaging. Journal of Mathematical Psychology, 44(1), 92– 107.
review of basic mathematical concepts
PART
Elementary Cognitive Mechanisms
I
CHAPTER
2
Multidimensional Signal Detection Theory
F. Gregory Ashby and Fabian A. Soto
Abstract Multidimensional signal detection theory is a multivariate extension of signal detection theory that makes two fundamental assumptions, namely that every mental state is noisy and that every action requires a decision. The most widely studied version is known as general recognition theory (GRT). General recognition theory assumes that the percept on each trial can be modeled as a random sample from a multivariate probability distribution defined over the perceptual space. Decision bounds divide this space into regions that are each associated with a response alternative. General recognition theory rigorously defines and tests a number of important perceptual and cognitive conditions, including perceptual and decisional separability and perceptual independence. General recognition theory has been used to analyze data from identification experiments in two ways: (1) fitting and comparing models that make different assumptions about perceptual and decisional processing, and (2) testing assumptions by computing summary statistics and checking whether these satisfy certain conditions. Much has been learned recently about the neural networks that mediate the perceptual and decisional processing modeled by GRT, and this knowledge can be used to improve the design of experiments where a GRT analysis is anticipated. Key Words: signal detection theory, general recognition theory, perceptual separability,
Introduction Signal detection theory revolutionized psychophysics in two different ways. First, it introduced the idea that trial-by-trial variability in sensation can significantly affect a subject’s performance. And second, it introduced the field to the then-radical idea that every psychophysical response requires a decision from the subject, even when the task is as simple as detecting a signal in the presence of noise. Of course, signal detection theory proved to be wildly successful and both of these assumptions are now routinely accepted without question in virtually all areas of psychology. The mathematical basis of signal detection theory is rooted in statistical decision theory, which
itself has a history that dates back at least several centuries. The insight of signal detection theorists was that this model of statistical decisions was also a good model of sensory decisions. The first signal detection theory publication appeared in 1954 (Peterson, Birdsall, & Fox, 1954), but the theory did not really become widely known in psychology until the seminal article of Swets, Tanner, and Birdsall appeared in Psychological Review in 1961. From then until 1986, almost all applications of signal detection theory assumed only one sensory dimension (Tanner, 1956, is the principal exception). In almost all cases, this dimension was meant to represent sensory magnitude. For a detailed description of this standard univariate 13
theory, see the excellent texts of either Macmillan and Creelman (2005) or Wickens (2002). This chapter describes multivariate generalizations of signal detection theory. Multidimensional signal detection theory is a multivariate extension of signal detection to cases in which there is more than one perceptual dimension. It has all the advantages of univariate signal detection theory (i.e., it separates perceptual and decision processes) but it also offers the best existing method for examining interactions among perceptual dimensions (or components). The most widely studied version of multidimensional signal detection theory is known as general recognition theory (GRT; Ashby & Townsend, 1986). Since its inception, more than 350 articles have applied GRT to a wide variety of phenomena, including categorization (e.g., Ashby & Gott, 1988; Maddox & Ashby, 1993), similarity judgment (Ashby & Perrin, 1988), face perception (Blaha, Silbert, & Townsend, 2011; Thomas, 2001; Wenger & Ingvalson, 2002), recognition and source memory (Banks, 2000; Rotello, Macmillan, & Reeder, 2004), source monitoring (DeCarlo, 2003), attention (Maddox, Ashby, & Waldron, 2002), object recognition (Cohen, 1997; Demeyer, Zaenen, & Wagemans, 2007), perception/action interactions (Amazeen & DaSilva, 2005), auditory and speech perception (Silbert, 2012; Silbert, Townsend, & Lentz, 2009), haptic perception (Giordano et al., 2012; Louw, Kappers, & Koenderink, 2002), and the perception of sexual interest (Farris, Viken, & Treat, 2010). Extending signal detection theory to multiple dimensions might seem like a straightforward mathematical exercise, but, in fact, several new conceptual problems must be solved. First, with more than one dimension, it becomes necessary to model interactions (or the lack thereof ) among those dimensions. During the 1960s and 1970s, a great many terms were coined that attempted to describe perceptual interactions among separate stimulus components. None of these, however, were rigorously defined or had any underlying theoretical foundation. Included in this list were perceptual independence, separability, integrality, performance parity, and sampling independence. Thus, to be useful as a model of perception, any multivariate extension of signal detection theory needed to provide theoretical interpretations of these terms and show rigorously how they were related to one another.
14
Second, the problem of how to model decision processes when the perceptual space is multidimensional is far more difficult than when there is only one sensory dimension. A standard signaldetection-theory lecture is to show that almost any decision strategy is mathematically equivalent to setting a criterion on the single sensory dimension, then giving one response if the sensory value falls on one side of this criterion, and the other response if the sensory value falls on the other side. For example, in the normal, equal-variance model, this is true regardless of whether subjects base their decision on sensory magnitude or on likelihood ratio. A straightforward generalization of this model to two perceptual dimensions divides the perceptual plane into two response regions. One response is given if the percept falls in the first region and the other response is given if the percept falls in the second region. The obvious problem is that, unlike a line, there are an infinite number of ways to divide a plane into two regions. How do we know which of these has the most empirical validity? The solution to the first of these two problems— that is, the sensory problem—was proposed by Ashby and Townsend (1986) in the article that first developed GRT. The GRT model of sensory interactions has been embellished during the past 25 years, but the core concepts introduced by Ashby and Townsend (1986) remain unchanged (i.e., perceptual independence, perceptual separability). In contrast, the decision problem has been much more difficult. Ashby and Townsend (1986) proposed some candidate decision processes, but at that time they were largely without empirical support. In the ensuing 25 years, however, hundreds of studies have attacked this problem, and today much is known about human decision processes in perceptual and cognitive tasks that use multidimensional perceptual stimuli.
Box 1 Notation Ai Bj = stimulus constructed by setting component A to level i and component B to level j ai bj = response in an identification experiment signaling that component A is at level i and component B is at level j X1 = perceived value of component A X2 = perceived value of component B
elementary cognitive mechanisms
Box 1 Continued fij (x1 ,x2 ) = joint likelihood that the perceived value of component A is x1 and the perceived value of component B is x2 on a trial when the presented stimulus is Ai Bj gij (x1 ) = marginal pdf of component A on trials when stimulus Ai Bj is presented rij = frequency with which the subject responded Rj on trials when stimulus Si was presented P(Rj |Si ) = probability that response Rj is given on a trial when stimulus Si is presented
General Recognition Theory General recognition theory (see the Glossary for key concepts related to GRT) can be applied to virtually any task. The most common applications, however, are to tasks in which the stimuli vary on two stimulus components or dimensions. As an example, consider an experiment in which participants are asked to categorize or identify faces that vary across trials on gender and age. Suppose there are four stimuli (i.e., faces) that are created by factorially combining two levels of each dimension. In this case we could denote the two levels of the gender dimension by A1 (male) and A2 (female) and the two levels of the age dimension by B1 (teen) and B2 (adult). Then the four faces are denoted as A1 B1 (male teen), A1 B2 (male adult), A2 B1 (female teen), and A2 B2 (female adult). As with signal detection theory, a fundamental assumption of GRT is that all perceptual systems are inherently noisy. There is noise both in the stimulus (e.g., photon noise) and in the neural systems that determine its sensory representation (Ashby & Lee, 1993). Even so, the perceived value on each sensory dimension will tend to increase as the level of the relevant stimulus component increases. In other words, the distribution of percepts will change when the stimulus changes. So, for example, each time the A1 B1 face is presented, its perceived age and maleness will tend to be slightly different. General recognition theory models the sensory or perceptual effects of a stimulus Ai Bj via the joint probability density function (pdf ) fij (x1 , x2 ) (see Box 1 for a description of the notation used in this article). On any particular trial when stimulus Ai Bj is presented, GRT assumes that the subject’s percept can be modeled as a random sample from this
joint pdf. Any such sample defines an ordered pair (x 1 , x 2 ), the entries of which fix the perceived value of the stimulus on the two sensory dimensions. General recognition theory assumes that the subject uses these values to select a response. In GRT, the relationship of the joint pdf to the marginal pdfs plays a critical role in determining whether the stimulus dimensions are perceptually integral or separable. The marginal pdf gij (x 1 ) simply describes the likelihoods of all possible sensory values of X 1 . Note that the marginal pdfs are identical to the one-dimensional pdfs of classical signal detection theory. Component A is perceptually separable from component B if the subject’s perception of A does not change when the level of B is varied. For example, age is perceptually separable from gender if the perceived age of the adult in our face experiment is the same for the male adult as for the female adult, and if a similar invariance holds for the perceived age of the teen. More formally, in an experiment with the four stimuli, A1 B1 , A1 B2 , A2 B1 , and A2 B2 , component A is perceptually separable from B if and only if g11 (x1 ) = g12 (x1 )
and g21 (x1 ) = g22 (x1 )
for all values of x1 .
(1)
Similarly, component B is perceptually separable from A if and only if g11 (x2 ) = g21 (x2 ) and
g12 (x2 ) = g22 (x2 ), (2)
for all values of x 2 . If perceptual separability fails then A and B are said to be perceptually integral. Note that this definition is purely perceptual since it places no constraints on any decision processes. Another purely perceptual phenomenon is perceptual independence. According to GRT, components A and B are perceived independently in stimulus Ai Bj if and only if the perceptual value of component A is statistically independent of the perceptual value of component B on Ai Bj trials. More specifically, A and B are perceived independently in stimulus Ai Bj if and only if f ij (x1 , x2 ) = g ij (x1 ) gij (x2 )
(3)
for all values of x 1 and x 2 . If perceptual independence is violated, then components A and B are perceived dependently. Note that perceptual independence is a property of a single stimulus, whereas perceptual separability is a property of groups of stimuli. A third important construct from GRT is decisional separability. In our hypothetical experiment
multidimensional signal detection theory
15
with stimuli A1 B1 , A1 B2 , A2 B1 , and A2 B2 , and two perceptual dimensions X 1 and X 2 , decisional separability holds on dimension X 1 (for example), if the subject’s decision about whether stimulus component A is at level 1 or 2 depends only on the perceived value on dimension X 1 . A decision bound is a line or curve that separates regions of the perceptual space that elicit different responses. The only types of decision bounds that satisfy decisional separability are vertical and horizontal lines.
The Multivariate Normal Model So far we have made no assumptions about the form of the joint or marginal pdfs. Our only assumption has been that there exists some probability distribution associated with each stimulus and that these distributions are all embedded in some Euclidean space (e.g., with orthogonal dimensions). There have been some efforts to extend GRT to more general geometric spaces (i.e., Riemannian manifolds; Townsend, Aisbett, Assadi, & Busemeyer, 2006; Townsend & SpencerSmith, 2004), but much more common is to add more restrictions to the original version of GRT, not fewer. For example, some applications of GRT have been distribution free (e.g., Ashby & Maddox, 1994; Ashby & Townsend, 1986), but most have assumed that the percepts are multivariate normally distributed. The multivariate normal distribution includes two assumptions. First, the marginal distributions are all normal. Second, the only possible dependencies are pairwise linear relationships. Thus, in multivariate normal distributions, uncorrelated random variables are statistically independent. A hypothetical example of a GRT model that assumes multivariate normal distributions is shown in Figure 2.1. The ellipses shown there are contours of equal likelihood; that is, all points on the same ellipse are equally likely to be sampled from the underlying distribution. The contours of equal likelihood also describe the shape a scatterplot of points would take if they were random samples from the underlying distribution. Geometrically, the contours are created by taking a slice through the distribution parallel to the perceptual plane and looking down at the result from above. Contours of equal likelihood in multivariate normal distributions are always circles or ellipses. Bivariate normal distributions, like those depicted in Figure 2.1 are each characterized by five parameters: a mean on each dimension, a variance on each dimension, and a covariance or correlation between the values on 16
the two dimensions. These are typically catalogued in a mean vector and a variance-covariance matrix. For example, consider a bivariate normal distribution with joint density function f (x 1 ,x 2 ). Then the mean vector would equal μ1 (4) µ= μ2 and the variance-covariance matrix would equal σ12 cov12 = (5) cov21 σ22 where cov12 is the covariance between the values on the two dimensions (i.e., note that the correlation coefficient is the standardized covariance—that is, 12 the correlation ρ12 = cov σ1 σ2 ). The multivariate normal distribution has another important property. Consider an identification task with only two stimuli and suppose the perceptual effects associated with the presentation of each stimulus can be modeled as a multivariate normal distribution. Then it is straightforward to show that the decision boundary that maximizes accuracy is always linear or quadratic (e.g., Ashby, 1992). The optimal boundary is linear if the two perceptual distributions have equal variancecovariance matrices (and so the contours of equal likelihood have the same shape and are just translations of each other) and the optimal boundary is quadratic if the two variance-covariance matrices are unequal. Thus, in the Gaussian version of GRT, the only decision bounds that are typically considered are either linear or quadratic. In Figure 2.1, note that perceptual independence holds for all stimuli except A2 B2 . This can be seen in the contours of equal likelihood. Note that the major and minor axes of the ellipses that define the contours of equal likelihood for stimuli A1 B1 , A1 B2 , and A2 B1 are all parallel to the two perceptual dimensions. Thus, a scatterplot of samples from each of these distributions would be characterized by zero correlation and, therefore, statistical independence (i.e., in the special Gaussian case). However, the major and minor axes of the A2 B2 distribution are tilted, reflecting a positive correlation and hence a violation of perceptual independence. Next, note in Figure 2.1 that stimulus component A is perceptually separable from stimulus component B, but B is not perceptually separable from A. To see this, note that the marginal distributions for stimulus component A are the same, regardless of the level of component B [i.e., g 11 (x 1 ) =
elementary cognitive mechanisms
Respond A2B2 g22(x2)
Respond A1B2
f22(x1,x2)
f12(x1,x2)
Applying GRT to Data
X2
Xc2
g12(x2) g21(x2)
Respond A1B1
f21(x1,x2)
f11(x1,x2)
Respond A2B1
g11(x2) g11(x1) = g12(x1)
X1
larger perceived values of A). As a result, component B is not decisionally separable from component A.
g21(x1) = g22(x1)
Xc1
Fig. 2.1 Contours of equal likelihood, decision bounds, and marginal perceptual distributions from a hypothetical multivariate normal GRT model that describes the results of an identification experiment with four stimuli that were constructed by factorially combining two levels of two stimulus dimensions.
g 12 (x 1 ) and g 21 (x 1 ) = g 22 (x 1 ), for all values of x 1 ]. Thus, the subject’s perception of component A does not depend on the level of B and, therefore, stimulus component A is perceptually separable from B. On the other hand, note that the subject’s perception of component B does change when the level of component A changes [i.e., g 11 (x 1 ) = g 21 (x 1 ) and g 12 (x 1 ) = g 22 (x 1 ) for most values of x 1 ]. In particular, when A changes from level 1 to level 2 the subject’s mean perceived value of each level of component B increases. Thus, the perception of component B depends on the level of component A and therefore B is not perceptually separable from A. Finally, note that decisional separability holds on dimension 1 but not on dimension 2. On dimension 1 the decision bound is vertical. Thus, the subject has adopted the following decision rule: Component A is at level 2 if x 1 > X c1 ; otherwise component A is at level 1. Where X c1 is the criterion on dimension 1 (i.e., the x 1 intercept of the vertical decision bound). Thus, the subject’s decision about whether component A is at level 1 or 2 does not depend on the perceived value of component B. So component A is decisionally separable from component B. On the other hand, the decision bound on dimension x 2 is not horizontal, so the criterion used to judge whether component B is at level 1 or 2 changes with the perceived value of component A (at least for
The most common applications of GRT are to data collected in an identification experiment like the one modeled in Figure 2.1. The key data from such experiments are collected in a confusion matrix, which contains a row for every stimulus and a column for every response (Table 2.1 displays an example of a confusion matrix, which will be discussed and analyzed later). The entry in row i and column j lists the number of trials on which stimulus Si was presented and the subject gave response Rj . Thus, the entries on the main diagonal give the frequencies of all correct responses and the off-diagonal entries describe the various errors (or confusions). Note that each row sum equals the total number of stimulus presentations of that type. So if each stimulus is presented 100 times then the sum of all entries in each row will equal 100. This means that there is one constraint per row, so an n × n confusion matrix will have n × (n – 1) degrees of freedom. General recognition theory has been used to analyze data from confusion matrices in two different ways. One is to fit the model to the entire confusion matrix. In this method, a GRT model is constructed with specific numerical values of all of its parameters and a predicted confusion matrix is computed. Next, values of each parameter are found that make the predicted matrix as close as possible to the empirical confusion matrix. To test various assumptions about perceptual and decisional processing—for example, whether perceptual independence holds—a version of the model that assumes perceptual independence is fit to the data as well as a version that makes no assumptions about perceptual independence. This latter version contains the former version as a special case (i.e., in which all covariance parameters are set to zero), so it can never fit worse. After fitting these two models, we assume that perceptual independence is violated if the more general model fits significantly better than the more restricted model that assumes perceptual independence. The other method for using GRT to test assumptions about perceptual processing, which is arguably more popular, is to compute certain summary statistics from the empirical confusion matrix and then to check whether these satisfy certain conditions that are characteristic of perceptual separability or
multidimensional signal detection theory
17
independence. Because these two methods are so different, we will discuss each in turn. It is important to note however, that regardless of which method is used, there are certain nonidentifiabilities in the GRT model that could limit the conclusions that are possible to draw from any such analyses (e.g., Menneer, Wenger, & Blaha, 2010; Silbert & Thomas, 2013). The problems are most severe when GRT is applied to 2 × 2 identification data (i.e., when the stimuli are A1 B1 , A1 B2 , A2 B1 , and A2 B2 ). For example, Silbert and Thomas (2013) showed that in 2 × 2 applications where there are two linear decision bounds that do not satisfy decisional separability, there always exists an alternative model that makes the exact same empirical predictions and satisfies decisional separability (and these two models are related by an affine transformation). Thus, decisional separability is not testable with standard applications of GRT to 2 × 2 identification data (nor can the slopes of the decision bounds be uniquely estimated). For several reasons, however, these nonidentifiabilities are not catastrophic. First, the problems don’t generally exist with 3 × 3 or larger identification tasks. In the 3 × 3 case the GRT model with linear bounds requires at least 4 decision bounds to divide the perceptual space into 9 response regions (e.g., in a tic-tac-toe configuration). Typically, two will have a generally vertical orientation and two will have a generally horizontal orientation. In this case, there is no affine transformation that guarantees decisional separability except in the special case where the two vertical-tending bounds are parallel and the two horizontal-tending bounds are parallel (because parallel lines remain parallel after affine transformations). Thus, in 3 × 3 (or higher) designs, decisional separability is typically identifiable and testable. Second, there are simple experimental manipulations that can be added to the basic 2 × 2 identification experiment to test for decisional separability. In particular, switching the locations of the response keys is known to interfere with performance if decisional separability fails but not if decisional separability holds (Maddox, Glass, O’Brien, Filoteo, & Ashby, 2010; for more information on this, see the section later entitled “Neural Implementations of GRT”). Thus, one could add 100 extra trials to the end of a 2 × 2 identification experiment where the response key locations are randomly interchanged (and participants are informed of this change). If accuracy drops 18
significantly during this period, then decisional separability can be rejected, whereas if accuracy is unaffected then decisional separability is supported. Third, one could analyze the 2 × 2 data using the newly developed GRT model with individual differences (GRT-wIND; Soto, Vucovich, Musgrave, & Ashby, in press), which was patterned after the INDSCAL model of multidimensional scaling (Carroll & Chang, 1970). GRT-wIND is fit to the data from all individuals simultaneously. All participants are assumed to share the same group perceptual distributions, but different participants are allowed different linear bounds and they are assumed to allocate different amounts of attention to each perceptual dimension. The model does not suffer from the identifiability problems identified by Silbert and Thomas (2013), even in the 2 × 2 case, because with different linear bounds for each participant there is no affine transformation that simultaneously makes all these bounds satisfy decisional separability.
Fitting the GRT Model to Identification Data computing the likelihood function When the full GRT model is fit to identification data, the best-fitting values of all free parameters must be found. Ideally, this is done via the method of maximum likelihood—that is, numerical values of all parameters are found that maximize the likelihood of the data given the model. Let S1 , S2 , . . . , Sn denote the n stimuli in an identification experiment and let R1 , R2 , . . . , Rn denote the n responses. Let rij denote the frequency with which the subject responded Rj on trials when stimulus Si was presented. Thus, rij is the entry in row i and column j of the confusion matrix. Note that the rij are random variables. The entries in each row have a multinomial distribution. In particular, if P(Rj |Si ) is the true probability that response Rj is given on trials when stimulus Si is presented, then the probability of observing the response frequencies ri1 , ri2 , . . . , rin in row i equals ni ! P(R1 |Si )ri1 P(R2 |Si )ri2 · · · P(Rn |Si )rin ri1 !ri2 ! · · · rin ! (6) where ni is the total number of times that stimulus Si was presented during the course of the experiment. The probability or joint likelihood of observing the entire confusion matrix is the product
elementary cognitive mechanisms
of the probabilities of observing each row; that is, L=
n
n
ni P(Rj |Si )rij n r ij i=1 j=1 j=1
(7)
General recognition theory models predict that P(Rj |Si ) has a specific form. Specifically, they predict that P(Rj |Si ) is the volume in the Rj response region under the multivariate distribution of perceptual effects elicited when stimulus Si is presented. This requires computing a multiple integral. The maximum likelihood estimators of the GRT model parameters are those numerical values of each parameter that maximize L. Note that the first term in Eq. 7 does not depend on the values of any model parameters. Rather it only depends on the data. Thus, the parameter values that maximize the second term also maximize the whole expression. For this reason, the first term can be ignored during the maximization process. Another common practice is to take logs of both sides of Eq. 7. Parameter values that maximize L will also maximize any monotonic function of L (and log is a monotonic transformation). So, the standard approach is to find values of the free parameters that maximize n n
rij log P(Rj |Si )
(8)
i=1 j=1
estimating the parameters In the case of the multivariate normal model, the predicted probability P(Rj |Si ) in Eq. 8 equals the volume under the multivariate normal pdf that describes the subject’s perceptual experiences on trials when stimulus Si is presented over the response region associated with response Rj . To estimate the best-fitting parameter values using a standard minimization routine, such integrals must be evaluated many times. If decisional separability is assumed, then the problem simplifies considerably. For example, under these conditions, Wickens (1992) derived the first and second derivatives necessary to quickly estimate parameters of the model using the Newton-Raphson method. Other methods must be used for more general models that do not assume decisional separability. Ennis and Ashby (2003) proposed an efficient algorithm for evaluating the integrals that arise when fitting any GRT model. This algorithm allows the parameters of virtually any GRT model to be estimated via standard minimization software. The remainder of this section describes this method. The left side of Figure 2.2 shows a contour of equal likelihood from the bivariate normal
distribution that describes the perceptual effects of stimulus Si , and the solid lines denote two possible decision bounds in this hypothetical task. In Figure 2.2 the bounds are linear, but the method works for any number of bounds that have any parametric form. The shaded region is the Rj response region. Thus, according to GRT, computing P(Rj |Si ) is equivalent to computing the volume under the Si perceptual distribution in the Rj response region. This volume is indicated by the shaded region in the figure. First note that any linear bound can be written in discriminant function form as h(x1 , x2 ) = h(x) = b x + c = 0
(9)
where (in the bivariate case) x and b are the vectors
x1 x= and b = b1 b2 x2 and c is a constant. The discriminant function form of any decision bound has the property that positive values are obtained if any point on one side of the bound is inserted into the function, and negative values are obtained if any point on the opposite side is inserted. So, for example, in Figure 2.2, the constants b1 , b2 , and c can be selected so that h1 (x) > 0 for any point x above the h1 bound and h1 (x) < 0 for any point below the bound. Similarly, for the h2 bound, the constants can be selected so that h2 (x) > 0 for any point to the right of the bound and h2 (x) < 0 for any point to the left. Note that under these conditions, the Rj response region is defined as the set of all x such that h1 (x) > 0 and h2 (x) > 0. Therefore, if we denote the multivariate normal (mvn) pdf for stimulus Si as mvn(µi , i ), then P(Rj |Si ) = mvn(µi , i )dx1 dx2 (10) h1 (x) > 0; h2 (x) > 0 Ennis and Ashby (2003) showed how to quickly approximate integrals of this type. The basic idea is to transform the problem using a multivariate form of the well-known z transformation. Ennis and Ashby proposed using the Cholesky transformation. Any random vector x that has a multivariate normal distribution can always be rewritten as x = Pz + µ,
(11)
where µ is the mean vector of x, z is a random vector with a multivariate z distribution (i.e., a multivariate normal distribution with mean vector 0 and variance-covariance matrix equal to the identity
multidimensional signal detection theory
19
Cholesky
z2 h1*(z1, z2) = 0
h2(x1, x2) = 0
h1(x1, x2) = 0
Cholesky z1
x2 Respond Rj
h2*(z1, z2) = 0 Respond Rj
x1
Fig. 2.2 Schematic illustration of how numerical integration is performed in the multivariate normal GRT model via Cholesky factorization.
matrix I), and P is a lower triangular matrix such that PP = (i.e., the variance-covariance matrix of x). If x is bivariate normal then ⎤ ⎡ σ1 0 (12) P = ⎣ cov12 cov2 ⎦ σ22 − σ 212 σ1 1
The Cholesky transformation is linear (see Eq. 11), so linear bounds in x space are transformed to linear bounds in z space. In particular, hk (x) = b x + c = 0 becomes hk (Pz + µ) = b (Pz + µ) + c = 0 or equivalently hk ∗ (z) = (b P)z + (b µ + c) = 0
(13)
Thus, in this way we can transform the Eq. 10 integral to P(Rj |Si ) = mvn(µi , i )dx1 dx2 h1 (x) > 0; h2 (x) > 0 =
mvn(0, I)dz1 dz2
(14)
h1∗ (z) > 0; h2∗ (z) > 0 The right and left panels of Figure 2.2 illustrate these two integrals. The key to evaluating the second of these integrals quickly is to preload zvalues that are centered in equal area intervals. In Figure 2.2 each gray point in the right panel has a z 1 coordinate that is the center1 of an interval with area 0.10 under the z distribution (since there 20
are 10 points). Taking the Cartesian product of these 10 points produces a table of 100 ordered pairs (z 1 , z 2 ) that are each the center of a rectangle with volume 0.01 (i.e., 0.10 × 0.10) under the bivariate z distribution. Given such a table, the Eq. 14 integral is evaluated by stepping through all (z 1 , z 2 ) points in the table. Each point is substituted into Eq. 13 for k = 1 and 2 and the signs of h1 ∗ (z 1 , z 2 ) and h2 ∗ (z 1 , z 2 ) are determined. If h1 ∗ (z 1 , z 2 ) > 0 and h2 ∗ (z 1 , z 2 ) > 0 then the Eq. 14 integral is incremented by 0.01. If either or both of these signs are negative, then the value of the integral is unchanged. So the value of the integral is approximately equal to the number of (z 1 , z 2 ) points that are in the Rj response region divided by the total number of (z 1 , z 2 ) points in the table. Figure 2.2 shows a 10 × 10 grid of (z 1 , z 2 ) points, but better results can be expected from a grid with higher resolution. We have had success with a 100 × 100 grid, which should produce approximations to the integral that are accurate to within 0.0001 (Ashby, Waldron, Lee, & Berkman, 2001). evaluating goodness of fit As indicated before, one popular method for testing an assumption about perceptual or decisional processing is to fit two versions of a GRT model to the data using the procedures outlined in this section. In the first, restricted version of the model, a number of parameters are set to values that reflect the assumption being tested. For example, fixing all correlations to zero would test perceptual independence. In the second, unrestricted version of the model, the same parameters are free to vary. Once the restricted and unrestricted versions of the model have been fit, they can be compared through
elementary cognitive mechanisms
a likelihood ratio test: = −2(logLR − logLU ),
(15)
where LR and LU represent the likelihoods of the restricted and unrestricted models, respectively. Under the null hypothesis that the restricted model is correct, the statistic has a Chi-squared distribution with degrees of freedom equal to the difference in the number of parameters between the restricted and unrestricted models. If several non-nested models were fitted to the data, we would usually want to select the best candidate from this set. The likelihood ratio test cannot be used to select among such non-nested models. Instead, we can compute the Akaike information criterion (AIC, Akaike, 1974) or the Bayesian information criterion (BIC, Schwarz, 1978): AIC = −2logL + 2m
(16)
BIC = −2 logL + m log N
(17)
where m is the number of free parameters in the model and N is the number of data points being fit. When the sample size is small compared to the number of free parameters of the model, as in most applications of GRT, a correction factor equal to 2m(m + 1)/(n2 − m − 1) should be added to the AIC (see Burnham & Anderson, 2004). The best model is the one with the smallest AIC or BIC. Because an n × n confusion matrix has n(n − 1) degrees of freedom, the maximum number of free parameters that can be estimated from any confusion matrix is n(n − 1). The origin and unit of measurement on each perceptual dimension are arbitrary. Therefore, without loss of generality, the mean vector of one perceptual distribution can be set to 0, and all variances of that distribution can be set to 1.0. Therefore, if there are two perceptual dimensions and n stimuli, then the full GRT model has 5(n − 1) + 1 free distributional parameters (i.e., n – 1 stimuli have 5 free parameters—2 means, 2 variances, and a covariance—and the distribution with mean 0 and all variances set to 1 has 1 free parameter—a covariance). If linear bounds are assumed, then another 2 free parameters must be added for every bound (e.g., slope and intercept). With a factorial design (e.g., as when the stimulus set is A1 B1 , A1 B2 , A2 B1 , and A2 B2 ), there must be at least one bound on each dimension to separate each pair of consecutive component levels. So for stimuli A1 B1 , A1 B2 , A2 B1 , and A2 B2 , at least two bounds are required (e.g., see Figure 2.1). If, instead, there are 3 levels of each component,
then at least 4 bounds are required. The confusion matrix from a 2 × 2 factorial experiment has 12 degrees of freedom. The full model has more free parameters than this, so it cannot be fit to the data from this experiment. As a result, some restrictive assumptions are required. In a 3 × 3 factorial experiment, however, the confusion matrix has 72 degrees of freedom (9 × 8) and the full model has 49 free parameters (i.e., 41 distributional parameters and 8 decision bound parameters), so the full model can be fit to identification data when there are at least 3 levels of each stimulus dimension. For an alternative to the GRT identification model presented in this section, see Box 2.
Box 2 GRT Versus the Similarity-Choice Model The most widely known alternative identification model is the similarity-choice model (SCM; Luce, 1963; Shepard, 1957), which assumes that ηi j βj , P(Rj |Si ) = ηi k βk k
where ηij is the similarity between stimuli Si and Sj and βj is the bias toward response Rj . The SCM has had remarkable success. For many years, it was the standard against which competing models were compared. For example, in 1992 J. E. K. Smith summarized its performance by concluding that the SCM “has never had a serious competitor as a model of identification data. Even when it has provided a poor model of such data, other models have done even less well” (p. 199). Shortly thereafter, however, the GRT model ended this dominance, at least for identification data collected from experiments with stimuli that differ on only a couple of stimulus dimensions. In virtually every such comparison, the GRT model has provided a substantially better fit than the SCM, in many cases with fewer free parameters (Ashby et al., 2001). Even so, it is important to note that the SCM is still valuable, especially in the case of identification experiments in which the stimuli vary on many unknown stimulus dimensions.
multidimensional signal detection theory
21
X2
The Summary Statistics Approach The summary statistics approach (Ashby & Townsend, 1986; Kadlec and Townsend, 1992a, 1992b) draws inferences about perceptual independence, perceptual separability, and decisional separability by using summary statistics that are easily computed from a confusion matrix. Consider again the factorial identification experiment with 2 levels of 2 stimulus components. As before, we will denote the stimuli in this experiment as A1 B1 , A1 B2 , A2 B1 , and A2 B2 . In this case, it is convenient to denote the responses as a1 b1 , a1 b2 , a2 b1 , and a2 b2 . The summary statistics approach operates by computing certain summary statistics that are derived from the 4 × 4 confusion matrix that results from this experiment. The statistics are computed at either the macro- or micro-level of analysis. macro-analyses Macro-analyses draw conclusions about perceptual and decisional separability from changes in accuracy, sensitivity, and bias measures computed for one dimension across levels of a second dimension. One of the most widely used summary statistics in macro-analysis is marginal response invariance, which holds for a dimension when the probability of identifying the correct level of that dimension does not depend on the level of any irrelevant dimensions (Ashby & Townsend, 1986). For example, marginal response invariance requires that the probability of correctly identifying that component A is at level 1 is the same regardless of the level of component B, or in other words that P (a1 | A1 B1 ) = P (a1 | A1 B2 ) Now in an identification experiment, A1 can be correctly identified regardless of whether the level of B is correctly identified, and so P (a1 |A1 B1 ) = P (a1 b1 | A1 B1 ) + P (a1 b2 | A1 B1 ) For this reason, marginal response invariance holds on dimension X 1 if and only if P (ai b1 | Ai B1 ) + P (ai b2 | Ai B1 ) = P (ai b1 | Ai B2 ) + P (ai b2 | Ai B2 )
(18)
for both i = 1 and 2. Similarly, marginal response invariance holds on dimension X 2 if and only if 22
f12
f11
f22
f21
X1 g22(x1)
g12(x1)
g11(x1)
d 'AB2 |cAB2| g21(x1)
d 'AB1 |cAB1|
Hit False Alarm
Fig. 2.3 Diagram explaining the relation between macroanalytic summary statistics and the concepts of perceptual and decisional separability.
P (a1 bj | A1 Bj ) + P (a2 bj | A1 Bj ) = P (a1 bj | A2 Bj ) + P (a2 bj | A2 Bj )
(19)
for both j = 1 and 2. Marginal response invariance is closely related to perceptual and decisional separability. In fact, if dimension X 1 is perceptually and decisionally separable from dimension X 2 , then marginal response invariance must hold for X 1 (Ashby & Townsend, 1986). In the later section entitled “Extensions to Response Time,” we describe how an even stronger test is possible with a response time version of marginal response invariance. Figure 2.3 helps to understand intuitively why perceptual and decisional separability together imply marginal response invariance. The top of the figure shows the perceptual distributions of four stimuli that vary on two dimensions. Dimension X 1 is decisionally but not perceptually separable from dimension X 2 ; the distance between the means of the perceptual distributions along the X 1 axis is much greater for the top two stimuli than for the bottom two stimuli. The marginal distributions at the bottom of Figure 2.3 show that the proportion of correct responses, represented by the light-grey areas under the curves, is larger in the second level of X 2 than in the first level. The result would be similar if perceptual separability held and decisional separability failed, as would be the case for X 2
elementary cognitive mechanisms
if its decision bound was not perpendicular to its main axis. To test marginal response invariance in dimension X 1 , we estimate the various probabilities in Eq. 18 from the empirical confusion matrix that results from this identification experiment. Next, equality between the two sides of Eq. 18 is assessed via a standard statistical test. These computations are repeated for both levels of component A and if either of the two tests is significant, then we conclude that marginal response invariance fails, and, therefore, that either perceptual or decisional separability are violated. The left side of Eq. 18 equals P(ai |Ai B1 ) and the right side equals P(ai |Ai B2 ). These are the probabilities that component Ai is correctly identified and are analogous to “hit” rates in signal detection theory. To emphasize this relationship, we define the identification hit rate of component Ai on trials when stimulus Ai Bj is presented as Hai |Ai Bj = P ai Ai Bj = P ai b1 Ai Bj (20) + P ai b2 Ai Bj The analogous false alarm rates can be defined similarly. For example, Fa2 |A1 Bj = P a2 A1 Bj = P a2 b1 A1 Bj (21) + P a2 b2 A1 Bj In Figure 2.3, note that the dark grey areas in the marginal distributions equal Fa1 |A2 B2 (top) and Fa1 |A2 B1 (bottom). In signal detection theory, hit and false-alarm rates are used to measure stimulus discriminability (i.e., d ). We can use the identification analogues to compute marginal discriminabilities for each stimulus component (Thomas, 1999). For example, (22) dAB = −1 Ha2 |A2 Bj − −1 Fa2 |A1 Bj j where the function −1 is the inverse cumulative distribution function for the standard normal distribution. As shown in Figure 2.3, the value of dABj represents the standardized distance between the means of the perceptual distributions of stimuli A1 Bj and A2 Bj . If component A is perceptually separable from component B, then the marginal discriminabilities between the two levels of A must be the same for each level of B – that is, dAB1 = dAB2 (Kadlec & Townsend, 1992a, 1992b). Thus, if this equality fails, then perceptual separability is
violated. The equality between two d s can be tested using the following statistic (Marascuilo, 1970): d − d2 Z = 1 sd2 + sd2 1
(23)
2
where sd2 =
F (1 − F ) H (1 − H )
2 +
2 nn φ −1 (F ) ns φ −1 (H ) (24)
where φ is the standard normal probability density function, H and F are the hit and false-alarm rates associated with the relevant d , nn is the number of trials used to compute F, and ns is the number of trials used to compute H. Under the null hypothesis of equal d s, Z follows a standard normal distribution. Marginal hit and false-alarm rates can also be used to compute a marginal response criterion. Several measures of response criterion and bias have been proposed (see Chapter 2 of Macmillan & Creelman, 2005), but perhaps the most widely used criterion measure in recent years (due to Kadlec, 1999) is: (25) cAB = −1 Fa1 |A2 Bj j
As shown in Figure 2.3, this measure represents the placement of the decision-bound relative to the center of the A2 Bj distribution. If component A is perceptually separable from component B, but cAB1 = cAB2 , then decisional separability must have failed on dimension X 1 (Kadlec & Townsend, 1992a, 1992b). On the other hand, if perceptual separability is violated, then examining the marginal response criteria provides no information about decisional separability. To understand why this is the case, note that in Figure 2.3 the marginal c values are not equal, even though decisional separability holds. A failure of perceptual separability has affected measures of both discriminability and response criteria. To test the difference between two c values, the following test statistic can be used (Kadlec, 1999): c −c Z = 1 2 sc2 + sc2
(26)
F (1 − F )
2 . nn φ −1 (F )
(27)
1
2
where sc2 =
multidimensional signal detection theory
23
micro-analyses Macro-analyses focus on properties of the entire stimulus ensemble. In contrast, micro-analyses test assumptions about perceptual independence and decisional separability by examining summary statistics computed for only one or two stimuli. The most widely used test of perceptual independence is via sampling independence, which holds when the probability of reporting a combination of components P(ai bj ) equals the product of the probabilities of reporting each component alone, P(ai )P(bj ). For example, sampling independence holds for stimulus A1 B1 if and only if P(a1 b1 |A1 B1 ) = P(a1 |A1 B1 ) × P(b1 |A1 B1 ) = [P(a1 b1 |A1 B1 ) + P(a1 b2 |A1 B1 )] × [P(a1 b1 |A1 B1 ) + P(a2 b1 |A1 B1 )] (28) Sampling independence provides a strong test of perceptual independence if decisional separability holds. In fact, if decisional separability holds on both dimensions, then sampling independence holds if and only if perceptual independence holds (Ashby & Townsend, 1986). Figure 2.4A gives an intuitive illustration of this theoretical result. Two cases are presented in which decisional separability holds on both dimensions and the decision bounds cross at the mean of the perceptual distribution. In the distribution to the left, perceptual independence holds and it is easy to see that all four responses are equally likely. Thus, the volume of this bivariate normal distribution in response region R4 = a2 b2 is 0.25. It is also easy to see that half of each marginal distribution lies above its relevant decision criterion (i.e., the two shaded regions), so P(a2 ) = P(b2 ) = 0.5. As a result, sampling independence is satisfied since P(a2 b2 ) = P(a2 ) × P(b2 ). It turns out that this relation holds regardless of where the bounds are placed, as long as they remain perpendicular to the dimension that they divide. The distribution to the right of Figure 2.4A has the same variances as the previous distribution, and, therefore, the same marginal response proportions for a2 and b2 . However, in this case, the covariance is larger than zero and it is clear that P(a2 b2 ) > 0.25. Perceptual independence can also be assessed through discriminability and criterion measures computed for one dimension conditioned on the perceived value on the other dimension. Figure 2.4B shows the perceptual distributions of two stimuli that share the same level of component B (i.e., B1 ) and have the same perceptual mean on 24
dimension X 2 . The decision bound perpendicular to X 2 separates the perceptual plane into two regions: percepts falling in the upper region elicit an incorrect response on component B (i.e., a miss for B), whereas percepts falling in the lower region elicit a correct B response (i.e., a hit). The bottom of the figure shows the marginal distribution for each stimulus conditioned on whether B is a hit or a miss. When perceptual independence holds, as is the case for the stimulus to the left, these conditional distributions have the same mean. On the other hand, when perceptual independence does not hold, as is the case for the stimulus to the right, the conditional distributions have different means, which is reflected in different d and c values depending on whether there is a hit or a miss on B. If decisional separability holds, differences in the conditional d s and cs are evidence of violations of perceptual independence (Kadlec & Townsend, 1992a, 1992b). Conditional d and c values can be computed from hit and false alarm rates for two stimuli differing in one dimension, conditioned on the reported level of the second dimension. For example, for the pair A1 B1 and A2 B1 , conditioned on a hit on B, the hit rate for A is P(a1 b1 |A1 B1 ) and the false alarm rate is P(a1 b1 |A2 B1 ). Conditioned on a miss on B, the hit rate for A is P(a1 b2 |A1 B1 ) and the false alarm rate is P(a1 b2 |A2 B1 ). These values are used as input to Eqs. 22–27 to reach a statistical conclusion. Note that if perceptual independence and decisional separability both hold, then the tests based on sampling independence and equal conditional d and c should lead to the same conclusion. If only one of these two tests holds and the other fails, this indicates a violation of decisional separability (Kadlec & Townsend, 1992a, 1992b).
An Empirical Example In this section we show with a concrete example how to analyze the data from an identification experiment using GRT. We will first analyze the data by fitting GRT models to the identification confusion matrix, and then we will conduct summary statistics analyses on the same data. Finally, we will compare the results from the two separate analyses. Imagine that you are a researcher interested in how the age and gender of faces interact during face recognition. You run an experiment in which subjects must identify four stimuli, the combination of two levels of age (teen and adult) and two levels of gender (male and female). Each stimulus is presented 250 times, for a total of 1,000
elementary cognitive mechanisms
A R3
R4
R3
R4
R1
R2
R1
R2
B Hit Miss False Alarm
f21
f11
g11(X1 | B miss)
g21(X1 | B miss)
' miss d A|B |cA|B miss | g11(X1| B hit)
g21(X1| B hit)
d 'A|B hit |CA|B hit| Fig. 2.4 Diagram explaining the relation between micro-analytic summary statistics and the concepts of perceptual independence and decisional separability. Panel A focuses on sampling independence and Panel B on conditional signal detection measures.
trials in the whole experiment. The data to be analyzed are summarized in the confusion matrix displayed in Table 2.1. These data were generated by random sampling from the model shown in Figure 2.5A. The advantage of generating artificial data from this model is that we know in advance what conclusions should be reached by our analyses. For example, note that decisional separability holds in the Figure 2.5A model. Also, because the distance between the “male” and “female” distributions is larger for “adult” than for “teen,” gender is not perceptually separable from age. In contrast, the “adult” and “teen” marginal distributions are the same across levels of gender, so age is perceptually separable from gender. Finally, because all distributions show a positive correlation, perceptual independence is violated for all stimuli. A hierarchy of models were fit to the data in Table 2.1 using maximum likelihood estimation (as in Ashby et al., 2001; Thomas, 2001). Because there are only 12 degrees of freedom in the
data, some parameters were fixed for all models. Specifically, all variances were assumed to be equal to one and decisional separability was assumed for both dimensions. Figure 2.5C shows the hierarchy of models used for the analysis, together with the number of free parameters m for each of them. In this figure, PS stands for perceptual separability, PI for perceptual independence, DS for decisional separability and 1_RHO describes a model with a single correlation parameter for all distributions. Note that several other models could be tested, depending on specific research goals and hypotheses, or on the results from summary statistics analysis. The arrows in Figure 2.5C connect models that are nested within each other. The result of likelihood ratio tests comparing such nested models are displayed next to each arrow, with an asterisk representing significantly better fit for the more general model (lower in the hierarchy) and n.s. representing a nonsignificant difference in fit. Starting at the top of the hierarchy, it
multidimensional signal detection theory
25
Teen
Age Teen
Age
Adult
B
Adult
A
Male
Fermale
Male
Gender
Fermale Gender
C {PI, PS, DS} m=4 *
{PI, PS(Gender), DS} m=6
n.s.
n.s
n.s.
{1_RHO, DS} m=9
*
*
{1_RHO, PS(Age), DS} m=7
{1_RHO, PS(Gender), DS} m=7
*
{1_RHO, PS, DS} m=5
{PI, PS(Age), DS} m=6
*
{PI, DS} m=8
*
n.s.
n.s.
n.s.
*
{PS, DS} m=8
n.s.
{PS(Age), DS} m = 10
{PS(Gender), DS} m = 10
Fig. 2.5 Results of analyzing the data in Table 2.1 with GRT. Panel A shows the GRT model that was used to generate the data. Panel B shows the recovered model from the model fitting and selection process. Panel C shows the hierarchy of models used for the analysis and the number of free parameters (m) in each. PI stands for perceptual independence, PS for perceptual separability, DS for decisional separability and 1_RHO for a single correlation in all distributions.
Table 2.1. Data from a simulated identification experiment with four face stimuli, created by factorially combining two levels of gender (male and female) and two levels of age (teen and adult). Response Stimulus Male/Teen
26
Male/Teen
Female/Teen
Male/Adult
Female/Adult
140
36
34
40
Female/Teen
89
91
4
66
Male/Adult
85
5
90
70
Female/Adult
20
59
8
163
elementary cognitive mechanisms
Table 2.2. Results of the summary statistics analysis for the simulated Gender × Age identification experiment. Macroanalyses Marginal Response Invariance Test
Result
Conclusion
Equal P (Gender=Male) across all Ages Equal P (Gender=Female) across all Ages Equal P (Age=Teen) across all Genders Equal P (Age=Adult) across all Genders
z = −0.09, p>.1 z = −7.12, p<.001 z = −0.39, p>.1 z = −1.04, p>.1
Yes No Yes Yes
Test
d for level 1
d for level 2
Result
Equal d for Gender across all Ages Equal d for Age across all Genders
0.84 0.89
1.74 1.06
z = −5.09, p < .001 No z = −1.01, p >.1 Yes
Test
c for level 1
c for level 2
Result
Conclusion
Equal c for Gender across all Ages Equal c for Age across all Genders
z = 1.57, p > .1 z = −2.05, p < .05 z = −2.24, p < .05 z = 2.29, p < .05 z = 2.14, p < .05 z = −2.01, p < .05 z = −7.80, p < .001 z = −3.13, p < .01 z = 2.17, p < .05 z = −4.09, p < .001 z = 3.77, p < .001 z = 5.64, p < .001 z = 2.15, p < .05 z = −1.14, p > .1 z = −3.07, p < .01 z = −3.44, p < .001
Yes No No No No No No No No No No No No Yes No No
d |Hit
d |Miss
Result
Conclusion
0.84 1.83 0.89 0.83
1.48 2.26 1.44 1.15
z = −2.04, p < .05 z = −1.27, p > .1 z = −1.80, p > .05 z = −0.70, p > .1
No Yes Yes Yes
Conditional d Test Equal d Equal d Equal d Equal d
for Gender when Age=Teen for Gender when Age=Adult for Age when Gender=Male for Age when Gender=Female
multidimensional signal detection theory
27
Table 2.2. Continued Conditional c Test
c |Hit
c |Miss
Result
Conclusion
Equal c for Gender when Age=Teen Equal c for Gender when Age=Adult Equal c for Age when Gender=Male Equal c for Age when Gender=Female
−0.014 −1.68 −0.04 −0.63
−1.58 −0.66 −1.50 0.57
z = 6.03, p < .001 z = −4.50, p < .001 z = 6.05, p < .001 z = −4.46, p < .001
No No No No
is possible to find the best candidate models by following the arrows with an asterisk on them down the hierarchy. This leaves the following candidate models: {PS, DS}, {1_RHO, PS(Age), DS}, {1_RHO, DS}, and {PS(Gender), DS}. From this list, we eliminate {1_RHO, DS} because it does not fit significantly better than the more restricted model {1_RHO, PS(Age), DS}. We also eliminate {PS(Gender), DS} because it does not fit better than the more restricted model {PS, DS}. This leaves two candidate models that cannot be compared through a likelihood ratio test, because they are not nested: {PS, DS} and {1_RHO, PS(Age), DS}. To compare these two models, we can use the BIC or AIC goodness-of-fit measures introduced earlier. The smallest corrected AIC was found for the model {1_RHO, PS(Age), DS} (2,256.43, compared to 2,296.97 for its competitor). This leads to the conclusion that the model that fits these data best assumes perceptual separability of age from gender, violations of perceptual separability of gender from age, and violations of perceptual independence. This model is shown in Figure 2.5B, and it perfectly reproduces the most important features of the model that was used to generate the data. However, note that the quality of this fit depends strongly on the fact that the assumptions used for all the models in Figure 2.5C (decisional separability and all variances equal) are correct in the true model. This will not be the case in many applications, which is why it is always a good idea to complement the model-fitting results with an analysis of summary statistics. The results from the summary statistics analysis are shown in Table 2.2. The interested reader can directly compute all the values in this table from the data in the confusion matrix (Table 2.1). The macro-analytic tests indicate violations of marginal response invariance, and unequal marginal d and c values for the gender dimension, both of which suggest that gender is not perceptually separable from age. These results are uninformative about decisional separability. Marginal response 28
invariance, equal marginal d and c values all hold for the age dimension, providing some weak evidence for perceptual and decisional separability of age from gender. The micro-analytic tests show violations of sampling independence for all stimuli, and conditional c values that are significantly different for all stimulus pairs, suggesting possible violations of perceptual independence and decisional separability. Note that if we assumed decisional separability, as we did to fit models to the data, the results of the microanalytic tests would lead to the conclusion of failure of perceptual independence. Thus, the results of the model fitting and summary statistics analyses converge to similar conclusions, which is not uncommon for real applications of GRT. These conclusions turn out to be correct in our example, but note that several of them depend heavily on making correct assumptions about decisional separability and other features of the perceptual and decisional processes generating the observed data.
Extensions to Response Time There have been a number of extensions of GRT that allow the theory to account both for response accuracy and response time (RT). These have differed in the amount of extra theoretical structure that was added to the theory described earlier. One approach was to add the fewest and least controversial assumptions possible that would allow GRT to make RT predictions. The resulting model succeeds, but it offers no process interpretation of how a decision is reached on each trial. An alternative approach is to add enough theoretical structure to make RT predictions and to describe the perceptual and cognitive processes that generated that decision. We describe each of these approaches in turn.
The RT-Distance Hypothesis In standard univariate signal detection theory, the most common RT assumption is that RT
elementary cognitive mechanisms
decreases with the distance between the perceptual effect and the response criterion (Bindra, Donderi, & Nishisato, 1968; Bindra, Williams, & Wise, 1965; Emmerich, Gray, Watson, & Tanis, 1972; Smith, 1968). The obvious multivariate analog of this, which is known as the RT-distance hypothesis, assumes that RT decreases with the distance between the percept and the decision bound. Considerable experimental support for the RT-distance hypothesis has been reported in categorization experiments in which there is only one decision bound and where more observability is possible (Ashby, Boynton, & Lee, 1994; Maddox, Ashby, & Gottlob, 1998). Efforts to incorporate the RT-distance hypothesis into GRT have been limited to two-choice experimental paradigms, such as categorization or speeded classification, which can be modeled with a single decision bound. The most general form of the RT-distance hypothesis makes no assumptions about the parametric form of the function that relates RT and distance to bound. The only assumption is that this function is monotonically decreasing. Specific functional forms are sometimes assumed. Perhaps the most common choice is to assume that RT decreases exponentially with distance to bound (Maddox & Ashby, 1996; Murdock, 1985). An advantage of assuming a specific functional form is that it allows direct fitting to empirical RT distributions (Maddox & Ashby, 1996). Even without any parametric assumptions, however, monotonicity by itself is enough to derive some strong results. For example, consider a filtering task with stimuli A1 B1 , A1 B2 , A2 B1 , and A2 B2 , and two perceptual dimensions X 1 and X 2 , in which the subject’s task on each trial is to name the level of component A. Let P FA (RTi < t|Ai Bj ) denote the probability that the RT is less than or equal to some value t on trials of a filtering task when the subject correctly classified the level of component A. Given this, then the RT analog of marginal response invariance, referred to as marginal RT invariance, can be defined as (Ashby & Maddox, 1994) PFA (RT i ≤ t | Ai B1 ) = P FA (RT i ≤ t | Ai B2 ) (29) for i = 1 and 2 and for all t > 0. Now assume that the weak version of the RT-distance hypothesis holds (i.e., where no functional form for the RT-distance relationship is specified) and that decisional separability also holds. Then Ashby and Maddox (1994) showed that
perceptual separability holds if and only if marginal RT invariance holds for both correct and incorrect responses. Note that this is an if and only if result, which was not true for marginal response invariance. In particular, if decisional separability and marginal response invariance both hold, perceptual separability could still be violated. But if decisional separability, marginal RT invariance, and the RT-distance hypothesis all hold, then perceptual separability must be satisfied. The reason we get the stronger result with RTs is that marginal RT invariance requires that Eq. 29 holds for all values of t, whereas marginal response invariance only requires a single equality to hold. A similar strong result could be obtained with accuracy data if marginal response invariance were required to hold for all possible placements of the response criterion (i.e., the point where the vertical decision bound intersects the X 1 axis).
Process Models of RT At least three different process models have been proposed that account for both RT and accuracy within a GRT framework. Ashby (1989) proposed a stochastic interpretation of GRT that was instantiated in a discrete-time linear system. In effect, the model assumed that each stimulus component provides input into a set of parallel (and linear) mutually interacting perceptual channels. The channel outputs describe a point that moves through a multidimensional perceptual space during processing. With long exposure durations the percept settles into an equilibrium state, and under these conditions the model becomes equivalent to the static version of GRT. However, the model can also be used to make predictions in cases of short exposure durations and when the subject is operating under conditions of speed stress. In addition, this model makes it possible to relate properties like perceptual separability to network architecture. For example, a sufficient condition for perceptual separability to hold is that there is no crossing of the input lines and no crosstalk between channels. Townsend, Houpt, and Silbert (2012) considerably generalized the stochastic model proposed by Ashby (1989) by extending it to a broad class of parallel processing models. In particular, they considered (almost) any model in which processing on each stimulus dimension occurs in parallel and the stimulus is identified as soon as processing finishes on all dimensions. They began by extending definitions of key GRT concepts, such as perceptual
multidimensional signal detection theory
29
and decisional separability and perceptual independence, to this broad class of parallel models. Next, under the assumption that decisional separability holds, they developed many RT versions of the summary statistics tests considered earlier in this chapter. Ashby (2000) took a different approach. Rather than specify a processing architecture, he proposed that moment-by-moment fluctuations in the percept could be modeled via a continuous-time multivariate diffusion process. In two-choice tasks with one decision bound, a signed distance is computed to the decision bound at each point in time; that is, in one response region simple distance-to-bound is computed (which is always positive), but in the response region associated with the contrasting response the negative of distance to bound is computed. These values are then continuously integrated and this cumulative value drives a standard diffusion process with two absorbing barriers—one associated with each response. This stochastic version of GRT is more biologically plausible than the Ashby (1989) version (e.g., see Smith & Ratcliff, 2004) and it establishes links to the voluminous work on diffusion models of decision making.
Neural Implementations of GRT Of course, the perceptual and cognitive processes modeled by GRT are mediated by circuits in the brain. During the past decade or two, much has been learned about the architecture and functioning of these circuits. Perhaps most importantly, there is now overwhelming evidence that humans have multiple neuroanatomically and functionally distinct learning systems (Ashby & Maddox, 2005; Eichenbaum, & Cohen, 2004; Squire, 1992). And most relevant to GRT, the evidence is good that the default decision strategy of one of these systems is decisional separability. The most complete description of two of the most important learning systems is arguably provided by the COVIS theory of category learning (Ashby, Alfonso-Reese, Turken, & Waldron, 1998; Ashby, Paul, & Maddox, 2011). COVIS assumes separate rule-based and procedural-learning categorization systems that compete for access to response production. The rule-based system uses executive attention and working memory to select and test simple verbalizable hypotheses about category membership. The procedural system gradually associates categorization responses with regions of perceptual space via reinforcement learning. 30
COVIS assumes that rule-based categorization is mediated by a broad neural network that includes the prefrontal cortex, anterior cingulate, head of the caudate nucleus, and the hippocampus, whereas the key structures in the procedural-learning system are the striatum and the premotor cortex. Virtually all decision rules that satisfy decisional separability are easily verbalized. In fact, COVIS assumes that the rule-based system is constrained to use rules that satisfy decisional separability (at least piecewise). In contrast, the COVIS procedural system has no such constraints. Instead, it tends to learn decision strategies that approximate the optimal bound. As we have seen, decisional separability is optimal only under some special, restrictive conditions. Thus, as a good first approximation, one can assume that decisional separability holds if subjects use their rule-based system, and that decisional separability is likely to fail if subjects use their procedural system. A large literature establishes conditions that favor one system over the other. Critical features include the nature of the optimal decision bound, the instructions given to the subjects, and the nature and timing of the feedback, to name just a few (e.g., Ashby & Maddox, 2005, 2010). For example, Ashby et al. (2001) fit the full GRT identification model to data from two experiments. In both, 9 similar stimuli were constructed by factorially combining 3 levels of the same 2 stimulus components. Thus, in stimulus space, the nine stimuli had the same 3×3 grid configuration in both experiments. In the first experiment however, subjects were shown this configuration beforehand and the response keypad had the same 3 × 3 grid as the stimuli. In the second experiment, the subjects were not told that the stimuli fell into a grid. Instead, the 9 stimuli were randomly assigned responses from the first 9 letters of the alphabet. In the first experiment, where subjects knew about the grid structure, the bestfitting GRT model assumed decisional separability on both stimulus dimensions. In the second experiment, where subjects lacked this knowledge, the decision bounds of the best-fitting GRT model violated decisional separability. Thus, one interpretation of these results is that the instructions biased subjects to use their rule-based system in the first experiment and their procedural system in the second experiment. As we have consistently seen throughout this chapter, decisional separability greatly simplifies applications of GRT to behavioral data. Thus, researchers who want to increase the probability
elementary cognitive mechanisms
that their subjects use decision strategies that satisfy decisional separability should adopt experimental procedures that encourage subjects to use their rule-based learning system. For example, subjects should be told about the factorial nature of the stimuli, the response device should map onto this factorial structure in a natural way, working memory demands should be minimized (e.g., avoid dual tasking) to ensure that working memory capacity is available for explicit hypothesis testing (Waldron & Ashby, 2001), and the intertrial interval should be long enough so that subjects have sufficient time to process the meaning of the feedback (Maddox, Ashby, Ing, & Pickering, 2004).
Conclusions Multidimensional signal detection theory in general, and GRT in particular, make two fundamental assumptions, namely that every mental state is noisy and that every action requires a decision. When signal detection theory was first proposed, both of these assumptions were controversial. We now know, however, that every sensory, perceptual, or cognitive process must operate in the presence of inherent noise. There is inevitable noise in the stimulus (e.g., photon noise, variability in viewpoint) at the neural level and in secondary factors, such as attention and motivation. Furthermore, there is now overwhelming evidence that every volitional action requires a decision of some sort. In fact, these decisions are now being studied at the level of the single neuron (e.g., Shadlen & Newsome, 2001). Thus, multidimensional signal detection theory captures two fundamental features of almost all behaviors. Beyond these two assumptions, however, the theory is flexible enough to model a wide variety of decision processes and sensory and perceptual interactions. For these reasons, the popularity of multidimensional signal detection theory is likely to grow in the coming decades.
Acknowledgments Preparation of this chapter was supported in part by Award Number P01NS044393 from the National Institute of Neurological Disorders and Stroke, by grant FA9550-12-1-0355 from the Air Force Office of Scientific Research, and by support from the U.S. Army Research Office through the Institute for Collaborative Biotechnologies under grant W911NF-07-1-0072.
Note 1. This is true except for the endpoints. Since these endpoint intervals have infinite width, the endpoints are set at the z-value that has equal area to the right and left in that interval (0.05 in Figure 2.2).
Glossary Absorbing barriers: Barriers placed around a diffusion process that terminate the stochastic process upon first contact. In most cases there is one barrier for each response alternative. Affine transformation: A transformation from an n-dimensional space to an m-dimensional space of the form y = Ax + b, where A is an m × n matrix and b is a vector. Categorization experiment: An experiment in which the subject’s task is to assign the presented stimulus to the category to which it belongs. If there are n different stimuli then a categorization experiment must include fewer than n separate response alternatives. d : A measure of discriminability from signal detection theory, defined as the standardized distance between the means of the signal and noise perceptual distributions (i.e., the mean difference divided by the common standard deviation). Decision bound: The set of points separating regions of perceptual space associated with contrasting responses. Diffusion process: A stochastic process that models the trajectory of a microscopic particle suspended in a liquid and subject to random displacement because of collisions with other molecules. Euclidean space: The standard space taught in highschool geometry constructed from orthogonal axes of real numbers. Frequently, the n-dimensional Euclidean space is denoted by n . False Alarm: Incorrectly reporting the presence of a signal when no signal was presented. Hit: Correctly reporting the presence of a presented signal. Identification experiment: An experiment in which the subject’s task is to identify each stimulus uniquely. Thus, if there are n different stimuli, then there must be n separate response alternatives. Typically, on each trial, one stimulus is selected randomly and presented to the subject. The subject’s task is to choose the response alternative that is uniquely associated with the presented stimulus. Likelihood ratio: The ratio of the likelihoods associated with two possible outcomes. If the two trial types are equally likely, then accuracy is maximized when the subject gives one response if the likelihood ratio is greater than 1 and the other response if the likelihood ratio is less than 1. Multidimensional scaling: A statistical technique in which objects or stimuli are situated in a multidimensional space in such a way that objects that are judged or perceived as similar are placed close together. In most approaches, each object is represented as a single point and the space is constructed from some type of proximity data collected on the to-be-scaled objects. A common choice is to collect similarity ratings on all possible stimulus pairs.
multidimensional signal detection theory
31
Nested mathematical models: Two mathematical models are nested if one is a special case of the other in which the restricted model is obtained from the more general model by fixing one or more parameters to certain specific values. Nonidentifiable models: The case where two seemingly different models make identical predictions. Perceptual dimension: A range of perceived values of some psychologically primary component of a stimulus. Procedural learning: Learning that improves incrementally with practice and requires immediate feedback after each response. Prototypical examples include the learning of athletic skills and learning to play a musical instrument. Response bias: The tendency to favor one response alternative in the face of equivocal sensory information. When the frequencies of different trial types are equal, a response bias occurs in signal detection theory whenever the response criterion is set at any point for which the likelihood ratio is unequal to 1. Response criterion: In signal detection theory, this is the point on the sensory dimension that separates percepts that elicit one response (e.g., Yes) from percepts that elicit the contrasting response (e.g., No). Speeded classification: An experimental task in which the subject must quickly categorize the stimulus according to the level of a single stimulus dimension. A common example is the filtering task. Statistical decision theory: The statistical theory of optimal decision-making. Striatum: A major input structure within the basal ganglia that includes the caudate nucleus and the putamen.
References Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19, 716–723. Amazeen, E. L., & DaSilva, F. (2005). Psychophysical test for the independence of perception and action. Journal of Experimental Psychology: Human Perception and Performance, 31, 170. Ashby, F. G. (1989). Stochastic general recognition theory. In D. Vickers & P. L. Smith (Eds.), Human information processing: Measures, mechanisms and models (pp. 435–457). Amsterdam, Netherlands: Elsevier Science Publishers. Ashby, F. G. (1992). Multidimensional models of categorization. In F. G. Ashby (Ed.), Multidimensional models of perception and cognition (pp. 449–483). Hillsdale, NJ: Erlbaum. Ashby, F. G. (2000). A stochastic version of general recognition theory. Journal of Mathematical Psychology, 44, 310–329. Ashby, F. G., Alfonso-Reese, L. A., Turken, A. U., & Waldron, E. M. (1998). A neuropsychological theory of multiple systems in category learning. Psychological Review, 105, 442–481. Ashby, F. G., Boynton, G., & Lee, W. W. (1994). Categorization response time with multidimensional stimuli. Perception & Psychophysics, 55, 11–27. Ashby, F. G., & Gott, R. E. (1988). Decision rules in the perception and categorization of multidimensional stimuli. Journal of Experimental Psychology: Learning, Memory, and Cognition, 14, 33–53.
32
Ashby, F. G., & Lee, W. W. (1993). Perceptual variability as a fundamental axiom of perceptual science. In S.C. Masin (Ed.), Foundations of perceptual theory (pp. 369–399). Amsterdam, Netherlands: Elsevier Science. Ashby, F. G., & Maddox, W. T. (1994). A response time theory of separability and integrality in speeded classification. Journal of Mathematical Psychology, 38, 423–466. Ashby, F. G., & Maddox, W. T. (2005). Human category learning. Annual Review of Psychology, 56, 149–178. Ashby, F. G., & Maddox, W. T. (2010). Human category learning 2.0. Annals of the New York Academy of Sciences, 1224, 147–161. Ashby, F. G., Paul, E. J., & Maddox, W. T. (2011). COVIS. In E. M. Pothos & A. J. Wills (Eds.), Formal approaches in categorization (pp. 65–87). New York, NY: Cambridge University Press. Ashby, F. G., & Perrin, N. A. (1988). Toward a unified theory of similarity and recognition. Psychological Review, 95, 124– 150. Ashby, F. G., & Townsend, J. T. (1986). Varieties of perceptual independence. Psychological Review, 93, 154–179. Ashby, F. G., Waldron, E. M., Lee, W. W., & Berkman, A. (2001). Suboptimality in human categorization and identification. Journal of Experimental Psychology: General, 130, 77–96. Banks, W. P. (2000). Recognition and source memory as multivariate decision processes. Psychological Science, 11, 267–273. Bindra, D., Donderi, D. C., & Nishisato, S. (1968). Decision latencies of “same” and “different” judgments. Perception & Psychophysics, 3(2), 121–136. Bindra, D., Williams, J. A., & Wise, J. S. (1965). Judgments of sameness and difference: Experiments on decision time. Science, 150, 1625–1627. Blaha, L., Silbert, N., & Townsend, J. (2011). A general tecognition theory study of race adaptation. Journal of Vision, 11, 567–567. Burnham, K. P., & Anderson, D. R. (2004). Multimodel inference understanding AIC and BIC in model selection. Sociological Methods & Research, 33, 261–304. Carroll, J. D., & Chang, J. J. (1970). Analysis of individual differences in multidimensional scaling via an N-way generalization of “Eckart-Young” decomposition. Psychometrika, 35, 283–319. Cohen, D. J. (1997). Visual detection and perceptual independence: Assessing color and form. Attention, Perception, & Psychophysics, 59, 623–635. DeCarlo, L. T. (2003). Source monitoring and multivariate signal detection theory, with a model for selection. Journal of Mathematical Psychology, 47, 292–303. Demeyer, M., Zaenen, P., & Wagemans, J. (2007). Lowlevel correlations between object properties and viewpoint can cause viewpoint-dependent object recognition. Spatial Vision, 20, 79–106. Eichenbaum, H., & Cohen, N. J. (2004). From conditioning to conscious recollection: Memory systems of the brain (No. 35). New York, NY: Oxford University Press. Emmerich, D. S., Gray, C. S., Watson, C. S., & Tanis, D. C. (1972). Response latency, confidence and ROCs in auditory signal detection. Perception & Psychophysics, 11, 65–72.
elementary cognitive mechanisms
Ennis, D. M., & Ashby, F. G. (2003). Fitting decision bound models to identification or categorization data. Unpublished manuscript. Available at http://www.psych.ucsb.edu/˜ashby/ cholesky.pdf Farris, C., Viken, R. J., & Treat, T. A. (2010). Perceived association between diagnostic and non-diagnostic cues of women’s sexual interest: General Recognition Theory predictors of risk for sexual coercion. Journal of mathematical psychology, 54, 137–149. Giordano, B. L., Visell, Y., Yao, H. Y., Hayward, V., Cooperstock, J. R., & McAdams, S. (2012). Identification of walked-upon materials in auditory, kinesthetic, haptic, and audio-haptic conditions. Journal of the Acoustical Society of America, 131, 4002–4012. Kadlec, H. (1999). MSDA_2: Updated version of software for multidimensional signal detection analyses. Behavior Research Methods, 31, 384–385. Kadlec, H., & Townsend, J. T. (1992a). Implications of marginal and conditional detection parameters for the separabilities and independence of perceptual dimensions. Journal of Mathematical Psychology, 36, 325–374. Kadlec, H., & Townsend, J. T. (1992b). Signal detection analyses of multidimensional interactions. In F. G. Ashby (Ed.), Multidimensional models of perception and cognition (pp. 181–231). Hillsdale, NJ: Erlbaum. Louw, S., Kappers, A. M., & Koenderink, J. J. (2002). Haptic discrimination of stimuli varying in amplitude and width. Experimental brain research, 146, 32–37. Luce, R. D. (1963). Detection and recognition. In R. D. Luce, R. R. Bush, & E. Galanter (Eds.), Handbook of mathematical psychology (Vol. 1, pp. 103–190). New York, NY: Wiley. Macmillan, N. A., & Creelman, D. (2005). Detection theory: A user’s guide (2nd ed.). Mahwah, NJ: Erlbaum. Maddox, W. T., & Ashby, F. G. (1993). Comparing decision bound and exemplar models of categorization. Perception & Psychophysics, 53, 49–70. Maddox, W. T., & Ashby, F. G. (1996). Perceptual separability, decisional separability, and the identification- speeded classification relationship. Journal of Experimental Psychology: Human Perception & Performance, 22, 795–817. Maddox, W. T., Ashby, F. G., & Gottlob, L. R. (1998). Response time distributions in multidimensional perceptual categorization. Perception & Psychophysics, 60, 620–637. Maddox, W. T., Ashby, F. G., Ing, A. D., & Pickering, A. D. (2004). Disrupting feedback processing interferes with rulebased but not information-integration category learning. Memory & Cognition, 32, 582–591. Maddox, W. T., Ashby, F. G., & Waldron, E. M. (2002). Multiple attention systems in perceptual categorization. Memory & Cognition, 30, 325–339. Maddox, W. T., Glass, B. D., O’Brien, J. B., Filoteo, J. V., & Ashby, F. G. (2010). Category label and response location shifts in category learning. Psychological Research, 74, 219–236. Marascuilo, L. (1970). Extensions of the significance test for oneparameter signal detection hypotheses. Psychometrika, 35, 237–243. Menneer, T., Wenger, M., & Blaha, L. (2010). Inferential challenges for General Recognition Theory: Mean-shift
integrality and perceptual configurality. Journal of Vision, 10, 1211–1211. Murdock, B. B. (1985). An analysis of the strength-latency relationship. Memory & Cognition, 13, 511–521. Peterson, W. W., Birdsall, T. G., & Fox, W. C. (1954). The theory of signal detectability. Transactions of the IRE Professional Group on Information Theory, PGIT-4, 171–212. Rotello, C. M., Macmillan, N. A., & Reeder, J. A. (2004). Sum-difference theory of remembering and knowing: a twodimensional signal-detection model. Psychological Review, 111, 588. Schwarz, G. E. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464. Shadlen, M. N., & Newsome, W. T. (2001). Neural basis of a perceptual decision in the parietal cortex (area LIP) of the rhesus monkey. Journal of Neurophysiology, 86, 1916–1936. Shepard, R. N. (1957). Stimulus and response generalization: A stochastic model relating generalization to distance in psychological space. Psychometrika, 22, 325–345. Silbert, N. H. (2012). Syllable structure and integration of voicing and manner of articulation information in labial consonant identification. Journal of the Acoustical Society of America, 131, 4076–4086. Silbert, N. H., & Thomas, R. D. (2013). Decisional separability, model identification, and statistical inference in the general recognition theory framework. Psychonomic Bulletin & Review, 20(1), 1–20. Silbert, N. H., Townsend, J. T., & Lentz, J. J. (2009). Independence and separability in the perception of complex nonspeech sounds. Attention, Perception, & Psychophysics, 71, 1900–1915. Smith, E. E. (1968). Choice reaction time: An analysis of the major theoretical positions. Psychological Bulletin, 69, 77– 110. Smith, J. E. K. (1992). Alternative biased choice models. Mathematical Social Sciences, 23, 199–219. Smith, P. L., & Ratcliff, R. (2004). Psychology and neurobiology of simple decisions. Trends in Neurosciences, 27, 161–168. Soto, F. A., Musgrave, R., Vucovich, L., & Ashby, F. G. (in press). General recognition theory with individual differences: A new method for examining perceptual and decisional interactions with an application to face perception. Psychonomic Bulletin & Review. Squire, L. R. (1992). Declarative and nondeclarative memory: Multiple brain systems supporting learning and memory. Journal of Cognitive Neuroscience, 4, 232–243. Swets, J. A., Tanner, W. P., Jr., & Birdsall, T. G. (1961). Decision processes in perception. Psychological Review, 68, 301–340. Tanner, W. P., Jr. (1956). Theory of recognition. Journal of the Acoustical Society of America, 30, 922–928. Thomas, R. D. (1999). Assessing sensitivity in a multidimensional space: Some problems and a definition of a general d . Psychonomic Bulletin & Review, 6, 224–238. Thomas, R. D. (2001). Perceptual interactions of facial dimensions in speeded classification and identification. Perception & Psychophysics, 63, 625–650. Townsend, J. T., Aisbett, J., Assadi, A., & Busemeyer, J. (2006). General recognition theory and methodology for dimensional independence on simple cognitive
multidimensional signal detection theory
33
manifolds. In H. Colonius & E. N. Dzhafarov (Eds.), Measurement and representation of sensations: Recent progress in psychophysical theory (pp. 203–242). Mahwah, NJ: Erlbaum. Townsend, J. T., Houpt, J. W., & Silbert, N. H. (2012). General recognition theory extended to include response times: Predictions for a class of parallel systems. Journal of Mathematical Psychology, 56, 476–494. Townsend, J. T., & Spencer-Smith, J. B. (2004). Two kinds of global perceptual separability and curvature. In C. Kaernbach, E. Schröger, & H. Müller (Eds.), Psychophysics beyond sensation: Laws and invariants of human cognition (pp. 89–109). Mahwah, NJ: Erlbaum.
34
Waldron, E. M., & Ashby, F. G. (2001). The effects of concurrent task interference on category learning: Evidence for multiple category learning systems. Psychonomic Bulletin & Review, 8, 168–176. Wenger, M. J., & Ingvalson, E. M. (2002). A decisional component of holistic encoding. Journal of Experimental Psychology: Learning, Memory, & Cognition, 28, 872–892. Wickens, T. D. (1992). Maximum-likelihood estimation of a multivariate Gaussian rating model with excluded data. Journal of Mathematical Psychology, 36, 213–234. Wickens, T. D. (2002). Elementary signal detection theory. New York, NY: Oxford University Press.
elementary cognitive mechanisms
CHAPTER
3
Modeling Simple Decisions and Applications Using a Diffusion Model
Roger Ratcliff and Philip Smith
Abstract
The diffusion model is one of the major sequential-sampling models for two-choice decision-making and choice response time in psychology. The model conceives of decision-making as a process in which noisy evidence is accumulated until one of two response criteria is reached and the associated response is made. The criteria represent the amount of evidence needed to make each decision and reflect the decision maker’s response biases and speed-accuracy trade-off settings. In this chapter we examine the application of the diffusion model in a variety of different settings. We discuss the optimality of the model and review its applications to a number of cognitive tasks, including perception,memory, and language tasks. We also consider its applications to normal and special populations, to the cognitive foundations of individual differences, to value-based decisions, and its role in understanding the neural basis of decision-making. Key Words: diffusion model, sequential-sampling, drift rate, choice, decision time,
Diffusion Models for Rapid Decisions Over the last 30 or 40 years, there has been a steady development of models for simple decisionmaking that deal with both the accuracy of decisions and the time taken to make them. The models assume that decisions are made by accumulating noisy information to decision criteria, one criterion for each possible choice. The models successfully account for the probability that each choice is made and the response time (RT) distributions for correct responses and errors. The models are highly constrained by the behavior of these dependent variables. The most frequent applications of these models have been to tasks that require two-choice decisions that are made reasonably quickly, typically with mean RTs less than 1.0–2.0 s. This is fast enough that one can assume that the decisions come from a single decision process and not from multiple, sequential processes (anything much slower and the single-process assumption would be suspect).
The models have been applied successfully to many different tasks including perceptual, numerical, and memory tasks with a variety of subject populations, including older adults, children, dyslexics, and adults undergoing sleep deprivation, reduced blood sugar, or alcohol intoxication. An important feature of human decision-making is that the processing system is very flexible because humans can switch tasks, stimulus dimensions, and output modalities very quickly, from one trial to the next. There are many different kinds of decisions that can be made about any stimulus. If the stimulus is a letter string, decisions can be made about whether it is word or a nonword, whether it was studied earlier, whether the color is red or green, whether it is upper or lower case, and so on. Responses can be made in different modalities and in different ways in those modalities (for example, manually, vocally, or via eye movements). The same decision mechanism might operate for all these 35
tasks or the mechanism might be task and modality specific. For two-choice tasks, the assumption usually made is that all decision-related information, that is, all the information that comes from a stimulus or memory, is collapsed onto a single variable, called drift rate, that characterizes the discriminative or preference information in the stimulus. In some situations, subjects may be asked to make judgments based on more than one dimension that cannot be combined in this way. In such cases, the systems factorial methods of Townsend and colleagues (e.g., Townsend, 1972; see the review in Townsend & Wenger, 2004) may be able to be used to determine whether processing on the different dimensions is serial or parallel, or some hybrid of the two. In this chapter, we focus on one model of the class of sequential sampling models of evidence accumulation, the diffusion model (Ratcliff, 1978; Ratcliff & McKoon, 2008; Smith, 2000). A comparison of the diffusion model with other sequential-sampling models, such as the Poisson counter model (Townsend & Ashby, 1983), the Vickers accumulator model (Smith & Vickers, 1988; Vickers, 1970), and the leaky competing accumulator model (Usher & McClelland, 2001) can be found in Ratcliff and Smith (2004). In the diffusion model, for a two-choice task, noisy evidence accumulates from a starting point (Figure 3.1), toward one of two decision criteria or boundaries and the quality of the information that enters the decision process determines the rate of accumulation. Fitting the model to data provides estimates of drift rates, decision boundaries, and a parameter representing the duration of nondecision processes. The model’s ability to separate these components is one of its key contributions and places major constraints on its ability to explain data. Stimulus difficulty affects drift rate but not the criteria, and to a good approximation, speed-accuracy shifts are represented in the criteria, not drift rate. If difficulty varies, changes in drift rate alone must accommodate all the changes in performance, namely accuracy and the changes in the spreads and locations of the correct and error RT distributions. Likewise, changes in the criteria affect all the aspects of performance. In these ways, the model is tightly constrained by data. In a perceptual task, drift rate depends on the quality of the perceptual information from a stimulus; in a memory task, it depends on the quality of the match between a test item and memory. In 36
a brightness discrimination task, for example, if the accumulated evidence reaches the top boundary, a “bright” response is executed and a “dark” response would then correspond to the bottom boundary. Figure 3.1 shows an example, using a brightness discrimination task. Evidence accumulates from a stimulus to the “bright” boundary or to the “dark” boundary. The solid arrow shows the drift rate for a bright stimulus, the dashed arrow shows the drift rate for a less bright stimulus, and the dotted arrow shows the drift rate for a dark stimulus. The three paths in Figure 3.1 show three different outcomes, all with the same drift rate. Noise in the accumulation process produces errors when the accumulated evidence reaches the incorrect boundary and it produces variable RTs that form a distribution of RTs that has the shape of empirically obtained distributions. In the figure, one path leads to a fast correct decision, one to a slow correct decision, and one to an error. Most responses are reasonably fast, but there are slower ones that spread out the right-hand tails of the distributions (as in the distribution at the top of Figure 3.1). As drift rate changes from a large value to near zero, the mean of the RT distribution for both correct and error responses increases because the tail of the RT distribution spreads out. Figure 3.2 shows simulated individual RTs from the model as a function of drift rate, which is assumed to vary from trial to trial. The shortest RTs change little with drift rate, and so a fast response says nothing about the difficulty of the trial. The probability of obtaining a slow response from a high drift rate is very small (e.g., Figure 3.2) and so conditions with the slowest responses come from lower drift rates (see Ratcliff, Philiastides, & Sajda, 2009). Figure 3.1 shows the accumulation-of-evidence process. Besides this, there are processes that
Quality of Evidence from Perception or Memory
Bright
Dark
Fig. 3.1 The diffusion decision model with three simulated paths and three different drift rates.
elementary cognitive mechanisms
Individual Trial Response Time
correlation between RT and drift rate r = –0.336
1400 1200 1000 800 600 0.0
0.2
0.4
0.6
0.8
Drift Rate for the Trial Fig. 3.2 Plots of individual RTs as a function of drift rate for the trial. The parameters of the diffusion model were, boundary separation, a = 0.107, starting point z = 0.048, duration of processes other than the decision process, Ter = 0.48 s, SD in drift rate across trial η = 0.20, range in starting points sz = 0.02, range in nondecision time st = 0.18 s, drift rate v = 0.3.
encode stimuli, access memory, transform stimulus information into a decision-related variable that determines drift rate, and execute responses. These components of processing are combined into one “nondecision” component in the model, that has mean Ter . The total processing time for a decision is the sum of the time taken by the decision process and the time taken by the nondecision component. The boundaries of the decision process can be manipulated by instructions (“respond as quickly as possible” or “respond as accurately as possible”), differential rewards for the two choices, and the relative frequencies with which the two stimuli are presented in the experiment. Changes in instructions, rewards, or biases affect both RTs and accuracy but in the model, to a good approximation, the effects on RTs and accuracy are due to shifts in boundary settings alone, not drift rates or nondecision time. (However, if subjects are pushed very hard to go fast, then nondecision time and drift rates can be lower (e.g., Starns, Ratcliff, & McKoon, 2012.) Figure 3.3, left panel, shows boundaries moving in for speed relative to accuracy instructions and the right panel shows how subjects can be biased toward the top response versus the bottom response by moving decision criteria from the dashed line to the solid line settings. It is also possible (Figure 3.3 right panel) to adjust the zero point of drift rate (the drift rate criterion) to accommodate biases between the two responses (see Leite & Ratcliff, 2011; Ratcliff, 1985; Ratcliff & McKoon, 2008, Figure 3.3).
A problem with early random walk models, which were precursors to the diffusion model, was that they predicted equal correct and error RT distributions if the drift rates for two stimuli were equal in magnitude but opposite in sign (Laming, 1968; Stone, 1960; but see Link & Heath, 1975). This prediction is also made by the diffusion model in the absence of across-trial variability in model parameters. In fact, the patterns of the relative speed of correct versus error responses are as follows: with accuracy instructions and/or difficult tasks, errors are slower than correct responses, and with speed instructions and/or easy tasks, errors are faster than correct responses (Luce, 1986). In the diffusion model, the observed patterns of correct versus error RTs fall out naturally because there is trial-to-trial variability in drift rate and starting point (e.g., Ratcliff, 1981). Figure 3.4 illustrates how this mixing works with just two drift rates or two starting points instead of their full distributions. In Figure 3.4 left panel, the v1 drift rate produces high accuracy and fast responses, the v2 one lower accuracy and slow responses. The mixture of these produces errors slower than correct responses because 5% of the 400 ms process averaged with 20% of the 600 ms process gives a weighted mean of 560 ms, which is slower than the weighted mean for correct responses (491 ms). In Figure 3.4, right panel, the distributions to the left are for processes that start near the correct boundary (the dotted arrow shows the distance the process has to go to make an error—the larger the distance, the slower the response) and the distributions to the right are for processes that start further away from the correct boundary. Processes that start near to the correct boundary have few errors and those errors are slow, whereas processes that start further away have more errors and the errors are fast, leading to errors faster than correct responses. In practice, drift rate is assumed to be normally distributed from trial to trial and the starting point is uniformly distributed, but these specific functional forms are not critical (Ratcliff, 2013). Some researchers have argued that across-trial variability in the parameters is not needed (Palmer, Huk, & Shadlen, 2005; Usher & McClelland, 2001). However, it is unreasonable to assume that subjects can set their processing components to identical values on every equivalent trial of an experiment (i.e., ones with the same stimulus value). For drift rates, across-trial variability in drift rate is exactly analogous to variability in stimulus or memory strength in signal detection theory. Later
Bias towards top boundary (dashed) changes to bias towards bottom boundary (solid)
Fig. 3.3 In the left panel, boundary separation alone changes between speed and accuracy instructions. In the right panel, the starting point varies with bias.
RT = 400 ms Pr = 0.95
Weighted Mean RT = 491ms
RT = 600 ms Pr = 0.80 a v1 z= a/2
v2
Pr = 0.95 RT = 350 ms a
Correct Responses
v
Pr = 0.80 RT = 450 ms
v
Weighted Mean RT = 396 ms
Correct Responses
z Error Responses
0 RT = 400 ms Pr = 0.05
RT = 600 ms Pr = 0.20
0 Pr = 0.20 RT = 350 ms
Weighted Mean RT = 560 ms
Error Responses Pr = 0.05 Weighted RT = 450 ms Mean RT = 370 ms
Fig. 3.4 Variability in drift rate and starting point and the effects on speed and accuracy. The left panel shows two process with drift rates v1 and v2 and the starting point halfway between the boundaries with correct and error RTs of 400 ms for v1 and of 600 ms for v2 . Averaging these two illustrates the effects of variability in drift rate across trials and in the illustration yields error responses slower than correct responses. The right panel shows processes with two starting points and drift rate v. Averaging processes with starting point 0.5a + 0.5 (high accuracy and short RTs) and starting point 0.5a − 0.5 (lower accuracy and short RTs) yields error responses faster than correct responses.
we describe an EEG study of perceptual decisionmaking that provides independent evidence for across-trial variability in drift rate and mention another that provides evidence for variability in starting point. It is important to understand that the diffusion model is highly falsifiable, not by mean RTs and accuracy values but by RT distributions. If empirical distributions are not right skewed, and do not shift and spread in exactly the right ways across experimental conditions, the model is falsified. Ratcliff (2002) generated sets of data with RT distributions that are plausible but never obtained in real experiments. For one set, the shapes and locations of the RT distributions were changed as a function of task difficulty, and for the other, the shapes and locations were changed as a function of speed versus accuracy instructions. For none of the resulting distributions was the model able to fit the data. In addition, the distributional predictions of the model are tested every time it is fit to empirical data. 38
expressions for accuracy and rt distributions For a two-boundary diffusion process with no across-trial variability in any of the parameters, the equation for accuracy, the proportion of responses terminating at the boundary at zero, is given by e−2va/s − e −2vz/s (1) P(v, a, z) = 2 e−2va/s − 1 (or 1−z/a if drift is zero), and the cumulative distribution of finishing times at the same boundary is given by 2
2
π s2 −vz e s2 a2 − 1 v2 + k2 π 2 s2 t ∞ a2 e 2 s2 2k sin kπz a 2 × v k2 π 2 s2 + k=1 s2 a2
G(t, v, a, z) = P(v, a, z) −
(2) where a is boundary separation (the top boundary is at a, the bottom boundary is at 0 and the
elementary cognitive mechanisms
distribution of finishing times is the distribution at the bottom boundary), z is the starting point, v is drift rate, and s is the SD in the normal distribution of within-trial variability (square root of the diffusion coefficient). These expressions can be derived as a solution of the partial differential equation for the first passage-time probability for the diffusion process (Feller, 1968). The results are described in detail in Ratcliff (1978) and Ratcliff and Smith (2004). Because Equation 2 contains an infinite sum, values of the RT density function need to be computed numerically. The series needs to be summed until it converges; this means that terms have to be added until subsequent terms become so small that they does not affect the total. This is complicated by the sine term, which can allow one value in the sum to be small, whereas the next one is not small. To deal with this practically, it is necessary to require that two or three successive terms are very small. The predictions from the model are obtained by integrating the results from Equations 1 and 2 over the distributions of the model’s across-trial variability parameters using numerical integration. In the standard model, drift rate is normally distributed across trials with SD η, the starting point is uniformly distributed with range sz , and nondecision time is uniformly distributed with range st . The predicted values are “exact” numerical predictions in the sense that they can be made as accurate as necessary (e.g., 0.1 ms or better) by using more and more steps in the infinite sum and more and more steps in the numerical integrations (packages that perform fitting are mentioned later). Alternative computational methods for obtaining predictions for diffusion models have been described by Smith (2000) and Diederich and Busemeyer (2003). The approach described by Smith uses integral equation methods derived from renewal theory. It was originally developed in mathematical biology to model the firing rates of integrate-and-fire neurons (Buonocore, Giorno, Nobile, & Ricciardi, 1990). The method is more computationally intensive than the infinite series approach of Equation 2, but has the advantage that it can be applied to processes in which the drift rates or decision criteria change over time or in which the accumulated information decays during the course of a trial. Smith (1995) and Smith and Ratcliff (2009) have proposed models in which drift rates depend on the outputs of visual and memory processes that change during a trial.
They obtained predictions for these models using the integral equation method. Diederich and Busemeyer (2003) proposed a matrix method for obtaining predictions for diffusion models. In their approach, a continuous-time, continuous-state diffusion process is approximated by a discrete-time, discrete-state birth-death process. The probability that the process takes a step up or down at each time point is characterized by a transition matrix whose entries express the rules by which the process evolves over time. By approximating the process in this way, the problem of obtaining RT distributions and response probabilities can be reduced to one of repeated matrix multiplication. This solution can be expensive computationally, but can be made more efficient by solving the associated algebraic eigenvalue problem, avoiding the need for repeated matrix multiplication. The method can also be applied to more complex problems that cannot be solved using the method of Equation 2 and has the advantage that it is very robust computationally. In some situations, it is important to generate predictions by simulation because simulated data can show the effects of all the sources of variability on a subject’s RTs and accuracy. The number of simulated observations can be increased sufficiently that the data approach the predictions that would be determined exactly from the numerical method. The expression for the update of evidence, x, on every time step t during the decision process, is determined by the drift rate, v, plus a noise term (Gaussian random variable, εi with SD σ ) to represent variability in processing: √ xi = vi t + σ ηi t
(3)
This equation provides the most straightforward method of simulating the diffusion process, but it is not the most efficient. Tuerlinckx, Maris, Ratcliff, & De Boeck (2001) examined four methods for simulating diffusion processes and found that a random walk approximation is better than using Equation 3. They also showed that a “rejection” method is even more efficient. However, if the process is nonstationary and complicated (e.g., with time varying drift rate, or boundaries that have some functional form) or there are several diffusion processes running to model multiple choice tasks, simulation is the simplest way to produce predictions, and the random walk approximation is likely the most efficient.
modeling simple decisions and applications
39
In fitting the diffusion model to data, accuracy and RT distributions for correct and error responses for all the conditions of the experiment must be simultaneously fit and the values of all of the components of processing estimated simultaneously. One commonly used fitting method uses quantiles of the RT distributions for correct and error responses for each condition (the 0.1, 0.3, 0.5, 0.7, and 0.9 quantile RTs). The model predicts the cumulative probability of a response at each RT quantile. Subtracting the cumulative probabilities for each successive quantile from the next higher quantile gives the proportion of responses between adjacent quantiles. For a chi-square computation, these are the expected values, to be compared to the observed proportions of responses between the quantiles (i.e., the proportions between 0.1, 0.3, 0.5, 0.7, and 0.9, are each 0.2, and the proportions below 0.1 and above 0.9 are both 0.1) multiplied by the number of observations. Summing over (Observed-Expected)2 /Expected for correct and error responses for each condition gives a single chi-square value that is minimized with a general SIMPLEX minimization routine. The parameter values for the model are adjusted by SIMPLEX until the minimum chi-square value is obtained. In any data set, there is the potential problem of outlier RTs, which could be fast (e.g., fast guesses) or slow (e.g., inattention). The quantile based method provides a good compromise that reduces the influence of outliers because the proportion of responses between the quantiles is used and extreme RTs within the bins have no influence on fitting. To additionally deal with outliers, a model of such processes is used in some model fitting approaches so that data is assumed to be a mixture of diffusion processes plus a small proportion of outliers. For details of the fitting methods for the standard diffusion model and modeling outliers, see Ratcliff and Tuerlinckx (2002). New methods for fitting the diffusion model have been developed recently and, over the last 6 or 7 years, fitting packages have been made available by Vandekerckhove and Tuerlinckx (2007) and Voss and Voss (2007). Also, Bayesian methods have been developed (Vandekerckhove, Tuerlinckx, & Lee, 2011) and a Bayesian package by Wiecki, Sofer & Frank (2013) has been made available. These Bayesian methods also implement hierarchical modeling schemes, in which model parameters for individual subjects are assumed to be random samples from population distributions that are specified within the model. The means and 40
variances of the population distributions, which are estimated in fitting, determine a range of probable values of drift rates and decision boundaries for individual subjects. Because all subjects are fit simultaneously using these methods, the parameters are constrained by the group parameters especially with low numbers of observations. The application of these hierarchical methods is in its infancy and some applications with large numbers of subjects, both simulated and real, that show their benefit over and above the more traditional methods are needed. To show how well the diffusion model fits data, we plot RT quantiles against the proportions for which the two responses are made. The top panel of Figure 3.5 shows a histogram for an RT distribution. The 0.1–0.9 quantile RTs and the 0.005 and 0.995 quantiles are shown on the x-axis. The rectangles represent equal areas of 0.2 probability mass between the 0.1–0.3, 0.3–0.5, etc. quantile RTs (and as can be seen, these represent the histogram reasonably well). These quantiles can be used to construct a quantile-probability plot by plotting the 0.1–0.9 quantile RTs vertically, as in the second panel of Figure 3.5, against the response proportion of that condition on the x-axis. Usually, correct responses are on the right of 0.5 and errors to the left (if there is no bias toward one or the other of the responses). Example RT distributions constructed from the equal area rectangles are also shown in grey. When there is a bias in starting point or when the two response categories are not symmetric (as in lexical decision and memory experiments), two quantile probability are needed, one for each response category. With quantile probability plots, changes in RT distribution locations and spread as a function of response proportion can be seen easily and compared with model fits. In the bottom panel of Figure 3.5, the 1–5 symbols are the data and the solid lines are the predictions from fits of the model to the data (with circles denoting the exact location of the predictions). As can be seen in this example, as response proportion changes from about 0.6 to near 1.0, the 0.1 quantile (leading edge) changes little, but the 0.9 quantile changes by as much as 400 ms. This is in line with the model predictions (e.g., Fig. 3.2). Also, as can be seen, error responses are slower than correct responses mainly in the spread, not in the leading edge location. Thus, quantile-probability plots allow all the important aspects of the data to be read from a single plot.
elementary cognitive mechanisms
Probability Density
Variants of the Standard Two-Choice Task 4
Up to this point, we have discussed how the diffusion model explains the results of experiments in which subjects respond with one of the two choices in their own time. The model has also been successfully applied to paradigms in which decision time is manipulated. Here we discuss three of these.
3 2 1
Quantiles
.9
error Pr = .3
1200 Reaction Time Quantiles (ms)
.005 .1 .3.5 .7
error Pr = .05
x
1000
.995
correct Pr = .7
Q
5 0.9
5
Q 0.9 correct Pr = .95
800
x
600
x x x
400
4 0.7
x
0.7
3 0.5
4 3
2 0.3
2
0.3
1 0.1
1
0.1
0.5
x x x x
200
Quantile Reaction Time (ms)
0.0
1200
0.2
0.4 0.6 0.8 Response Proportion
Dynamic pixel noise, brightness discrim. o o5 5o 5 5o o o5 5
1000 800 600 400
1.0
4o o4 33 o2 o o 2o 1 o 1o
o5
4o
o4
3o
3o
2o
o3 o2
2o
o3 o2
1o
o1
1o
o1
4o
o4
5o o4 4o o3 3o o2 2o o1 1o
Experiment 2, Ratcliff & Smith (2010) 0.0
0.2
0.4 0.6 0.8 Response Proportion
1.0
Fig. 3.5 The top panel shows a RT distribution overlaid with 0.1, 0.3, 0.5, 0.7, and 0.9 quantiles, where the area outside the .1 quantile ranges from 0.005 to 0.1 and the area outside the .9 quantile ranges from 0.9 to 0.995. The areas between each pair of middle quantiles are 0.2 and the areas below 0.1 and above 0.9 are 0.095. The quantile rectangles capture the main features of the RT distribution and therefore a reasonable summary of overall distribution shape. The middle panel shows quantile RTs for the 0.1, 0.3, 0.5 (median), 0.7, and 0.9 quantiles (stacked vertically) plotted against response proportion for each of the six conditions. Correct responses are plotted to the right, and error responses to the left. The bottom panel shows a quantile probability function from Ratcliff and Smith (2010, Experiment 2) with the numbers representing data and the lines representing predictions.
response signal and deadline tasks For response signal and deadline tasks, a signal is presented after the stimulus and a subject is required to respond as quickly as possible (in, say, 200–300 ms). For a deadline paradigm, the time between the stimulus and the signal is fixed across trials. For a response signal paradigm, the time varies from trial to trial (Reed, 1973; Schouten & Bekker, 1967; Wickelgren, 1977; Wickelgren, Corbett, & Dosher, 1980). With the deadline paradigm, subjects can adopt different strategies or criteria for different deadlines. This is not the case for the response signal paradigm in which processing can be assumed to be the same up to the signal. To apply the diffusion model to response signal data, Ratcliff (1988, 2006) assumed that there are response criteria just as for the standard twochoice task, and at some signal lag, responses come from a mixture of processes, those that have terminated at one or the other of the boundaries and those that have not. This is in accord with subjects’ intuitions that, at the long lags, the decision has already been made, the response has been chosen, and the subject is simply waiting for the signal. As the time between stimulus and signal decreases, a larger and larger proportion of processes will have failed to terminate. Differences among experimental conditions of different difficulties appear as differences in the proportions of accumulated information at the different lags. At the longest lags (2 or more seconds), all or almost all processes will have terminated. For nonterminated processes, there are two possibilities: that decisions are made on the basis of the partial information that has already been accumulated (Figure 3.6 top panel) or that they are simply guesses (Figure 3.6 middle panel). Ratcliff (2006) tested between these possibilities with a numerosity discrimination experiment (subjects decide whether the number of asterisks displayed on a PC monitor is greater than or less than 50). The same subjects participated in the response signal task and the standard task and examples of the response signal data and model fits are shown in Figure 3.7. When the model
modeling simple decisions and applications
41
Partial Information Model
Respond “Large”
Pr(“Large”) = sum of black areas at T1 Time Pr(“Small”) = sum of grey areas at T2
Respond “Small”
T2
T1
Guessing Model
Respond “Large”
Pr(“Large”) = sum of black areas at T1 & (1-P(guess))
Time Pr(“Small”) = sum of grey area at T2 & P(guess) T1
T2
Respond “Small”
Distribution of the density of non-terminated processes
Density of Processes
1.0 0.8 0.6 0.4 0.2 0.0 0.0
0.2
0.4
0.6
0.8
1.0
Time (s) Fig. 3.6 The top two panels shows two models for how the diffusion model accounts for response signal data. In the top panel, the proportion of “large” responses at time T1 is the sum of processes that have terminated at the “large” boundary (the black area above the boundary) and nonterminated processes (the black area still within the diffusion process), i.e., partial information. The middle panel shows the same assumption as the top panel except that if a process has not terminated, a guess is used instead of partial information. The bottom panel shows heat maps of simulated paths for the diffusion model. White corresponds to high path density and black to a low path density. For the diffusion model, the distribution to the right corresponds to the asymptotic distribution of path positions after about 0.2 seconds (i.e., the vertically oriented distributions in the top panel).
was fit to the two sets of data simultaneously, it fit well and it fit equally well for the two possibilities for nonterminated processes. In other words, “guessing” and partial information models could not be discriminated. 42
meyer, irwin, osman,& kounios, (1988) partial information paradigm This paradigm used a variant of the response signal task in which, on each trial, subjects responded either in the regular way unless a signal to respond
elementary cognitive mechanisms
1.0
0.4
1 2 3 1 4 123 24 35 5 45676 88 7 6
0.2
78
Response Proportion for “Large” Responses
0.8 0.6
0.0 1.0
1 2 1 3 2
0.8
1 23 4 45 3456 5 77 68 6 8 7 8
0.6 0.4 0.2
21 3
12 3
4
4
5
5
0
1 1 2 2 1 3
21 3
12 3
5
23 4 34
4
4
S1 6
6
6
7 8
7 8
78
21 3 4
213
213 4
5
5
4
6
45 5 56 6 87687 7 8
21
6
6 7 8
500 1000 1500 Response Signal Lag (ms)
7 8
21 3 4
S2 5
5
5
6 7 8
6 87
6 87
12 3
12 3
12 3
4
4
4
1 3
5
S3 7 8
0.0
123 4
132 4 324 765465 5 7 88 6 7 8
0
S4 5 6 87
5
5
6 87
876
500 1000 1500 Response Signal Lag (ms)
Fig. 3.7 Plots of response proportion as a function of response signal lag from a numerosity discrimination experiment (Ratcliff, 2006) for four subjects. The task required subjects to judged whether the number of dots in a 10x10 array was greater that 50 or less or equal to 50. The digits 1–8 (in reverse order) and the eight lines represent eight groupings of numbers of dots (e.g., 13–20, 21–30, 31–40, 41–50, 51–60, 61–70, 71–80, and 81–87 dots).
occurred, in which case they were to respond immediately. Thus, any trial could be a signal trial or a regular trial. Meyer et al. developed a method based on a race model that decomposes accuracy on the signal trials (at each signal lag) into a component from fast finishing regular trials and a component based on partial information. The predictions from the diffusion model matched those from Meyer et al. (1988). Results showed that partial information, in some tasks (see also Kounios, Osman, & Meyer, 1987) grew quickly and leveled off at about one-third the accuracy level of regular processes. Ratcliff (1988) examined the predictions of the diffusion model with the assumption that decisions on signal trials were a mixture of processes that terminated at a boundary and processes based on position in the decision process, that is, partial information. Therefore, if a process was above the starting point (i.e., the black area in the vertical distribution in the top panel of Figure 3.6), the decision corresponded to the choice at the upper boundary. Figure 3.6 bottom panel shows a heat map of the evolution of simulated diffusion processes. The map shows the density of processes as they begin at the starting point and spread out to the boundaries. The hotter the color (whiter), the more processes in that region. As time goes by, the color becomes
cooler because there are fewer and fewer processes that have not terminated. As in the top panel, the evolution of paths moves the mean position (the thick black line) from the starting point at 0.5 to a point a little above 0.6 by about 0.2 s. This produces an almost stationary distribution (the distribution to the right of the heat map), which gradually collapses over time (the two vertical distributions in the top panel of Figure 3.6). For the case in which partial information is used in the decision, the expression for the distribution of the positions x of decision processes at time t is given by: p(x, t) = e
v(x−z) s2
∞ 2 n=1
× sin
a
sin
nπ x a
e
nπz
− 12
a
v2 n2 π 2 s2 + 2 s2 a
t
(4)
where s2 is the diffusion coefficient, z is the starting point, a is the separation between the boundaries, and v is the drift rate. For model fitting, the expression in Equation 4 must be integrated over the normal distribution of drift rates and the uniform distribution of starting points to include variability in drift rate and starting point across trials. This can be accomplished with numerical integration using Gaussian quadrature. The series in
modeling simple decisions and applications
43
Equation 4 must be summed until it converges; this means that terms have to be added until subsequent terms become so small that they do not affect the total (i.e., the series has converged to within some criterion, e.g., 10−5 ). Then, to obtain the probability of choosing each response alternative, the proportion of processes between 0 and a/2 (for the negative alternative) and between a/2 and a (for the positive alternative) is calculated by integrating the expression for the density over position. time-varying processing Ratcliff (1980) examined two cases in which drift rate changes across the time course of processing. For one, drift rate changes discretely at one fixed time. Because there is an explicit expression for the distribution of evidence at that time, this distribution can be used as a starting distribution for a second diffusion process. If the time at which evidence changes is not a fixed time but has a distribution over time, this can be integrated over. This allows both response signal and regular RT tasks to be modeled. For another case, boundaries are removed completely and drift rate and the drift coefficient varied continuously over time. Only the first case has been used in modeling response signal data (as in Ratcliff, 1988, 2006). go/nogo task In the go/no go task, subjects are told to respond for one of the two choices but to make no response for the other choice. Withholding responses for one of the choices is similar to the response signal task in which responses must be held until the signal. Gomez, Ratcliff, and Perea (2007) proposed that there are two response boundaries for the go/no go task just as for the standard task, but subjects made a response only when accumulated evidence reaches the “go” boundary. Gomez et al. successfully fit the model simultaneously to data from the standard task and data from the go/no go task. They also tested a variant for which there was only one boundary, the “go” boundary, but this variant could not fit the data well. Application of the diffusion model simultaneously to the standard task and response signal task or to the standard task and go/no go tasks places powerful constraints on the model and, when it is successful, it offers new insights into the cognitive processes involved in these tasks. It also provides theoretical convergence between the three tasks, with two boundaries for all three tasks and withheld responses for the latter two. 44
The first conclusion is that applying models to multiple tasks simultaneously produces strong constraints on models that (if they successfully account for data) lead to new understanding of how the tasks are performed. In the context of the sequential sampling models discussed in this article, this approach yielded a new view of response signal performance: responses increase in accuracy over time mainly because the proportion of terminated processes increases and the increase in accuracy does not come entirely from the increasing availability of partial information. Moreover, versions of the models that provide quite good fits to the data from the standard RT and response signal tasks individually would not account for both sets of data simultaneously with parameters that were consistent across tasks.
Optimality In animal studies, performance has been described in terms of how close it comes to maximizing reward rate. This is part of a larger theme in neuroscience, which reprises the classical signal detection and sequential-sampling literatures, in which reward rate is used as a criterion for understanding whether neural computations approach optimality. For animals, how close performance is to optimal in terms of reward rate is a reasonable question to ask because animals are deprived of water or food and their overwhelming desire is to obtain them. Also they are trained for many sessions and so there is ample opportunity to optimize reward. However, when this kind of optimality is translated to human studies, the a priori reasonableness comes into question. This is because humans do not aim to get the most correct per unit time. Instead, they aim to get the most correct in the available time. If a student takes a 2-hour exam and obtains 60% correct in 1 hour, but another student gets 80% correct in 2 hours, the first has more correct per unit time, but the second would be more likely to pass the course. Bogacz, Brown, Moehlis, Holmes, & Cohen et al. (2006) performed extensive analyses of optimality and set the stage for analyses of data. They showed that optimality as defined by reward rate can be adjusted by changing boundary settings. If the boundaries are too far apart, subjects are accurate, but slow and so there are few correct per unit of time. If boundaries are too narrow, RT is short but accuracy is low and there are few correct responses per unit of time. Thus, there is a boundary setting that maximizes the number
elementary cognitive mechanisms
correct per unit of time and it is possible to test whether subjects set criteria near to this value. Starns and Ratcliff (2012) tested undergraduate subjects on a simple numerosity discrimination task in which different groups of subjects were tested at different levels of difficulty. They were tested in blocks of trials that had a fixed total duration for which they were instructed to get as many correct in the time allowed and in blocks of trials in which the number of trials was the same no matter how fast they went. Reward-rate optimality predicts that when difficulty increases, subjects should speed up and sacrifice accuracy.per unit time. Results showed subjects did the opposite, slowing down with increases in difficulty. This is the result we might expect from years of academic training to spend more time on difficult problems. Starns and Ratcliff (2010) analyzed several published data sets with young and older adults and found that young adults with accuracy feedback sometimes approached reward-rate optimality. But older adults rarely moved more than a few percent away from asymptotic accuracy. Young adults in the context of psychology experiments (or perhaps practice with video games, some of which promote speed) will sometimes be able to optimize performance in terms of number correct per unit of time. In general, however, concerns about accuracy that have been trained for years appear to dominate.
Domains of Application One criterion for how well a model performs is whether it simply reiterates what is already known from traditional analyses. Here we describe a number of applications, some of which provide new insights into processing, individual differences and differences among subject groups are obtained. But in other cases, even when the obvious results are obtained the model integrates the three dependent variables, namely, accuracy and correct and error RT distributions, into a common theoretical framework that provides explanations of data that many hypothesis-testing approaches do not. Hypothesistesting approaches usually select only accuracy or only mean RT as the dependent variable. In some cases, the two variables tell the same empirical story, but in other cases, they are inconsistent. The model based approach helps to resolve such inconsistencies.
Perceptual Tasks Recently diffusion models have been applied to psychophysical discrimination tasks in which
stimuli are presented very briefly, often at low levels of contrast, sometimes with backward masks to limit iconic persistence. The focus has been to understand the perceptual processes involved in the computation of drift rates. Psychophysical paradigms have historically been used mainly with threshold or accuracy measures but recent studies have collected accuracy and RT data. Ratcliff and Rouder (2000) and Smith, Ratcliff, and Wolfgang (2004) found that the diffusion model provided a good account of accuracy and distributions of RT from tasks with brief backwardmasked stimuli. They compared the model with a constant drift rate from starting point to boundaries to the model with varying drift rate. Drift rates might be thought to decrease over time if they either tracked stimulus information or were governed by a decaying perceptual trace. However, there was no evidence in either study of increased skewness in the RT distributions or very slow error RTs at short stimulus durations as would have been expected if the decision process had been driven by a decaying perceptual trace. Instead, it appears that the information that drives the decision is relatively durable. The standard application of the model assumes that, at some point in time after stimulus encoding, the decision process turns on, and evidence is accumulated toward a decision. This time is assumed to be the same across conditions and drift rate is assumed to be at a constant values from the point the process turns on. The assumption of a constant drift rate could be relaxed: Ratcliff (2002) generated predicted accuracy and RT quantiles for several conditions under the assumption that drift rate ramped up from zero to a constant level over 50 ms. He fit the standard model to these predicted values and found that the model fit well with nondecision time increased by 25 ms and with starting point, and nondecision time variability increased. Thus, a ramped onset of drift rate over a small time range will be indistinguishable from an abrupt onset. Smith and Ratcliff (2009) developed a model, the integrated system model, that is a continuousflow model comprised of perceptual, memory, and decision processes operating in cascade. The perceptual encoding processes are linear filters (Watson, 1986) and the transient outputs of the filters are encoded in a durable form in visual shortterm memory (VSTM), which is under the control of spatial attention. The strength of the VSTM trace determines the drift rate for the diffusion
modeling simple decisions and applications
45
process and the moment-to-moment variations in trace strength act as a source of noise in the decision process. Because the VSTM trace in the model increases over time (i.e., drift rate is time varying), predictions for the model are obtained using the integral equation methods described previously (Smith, 2000). The model has successfully accounted for accuracy and RT distributions in tasks with brief backward-masked stimuli. The main area of application of the integrated system model has been to tasks in which spatial attention is manipulated by spatial cues. In many cuing tasks, in which a single well-localized stimulus is presented in an otherwise empty display, attention shortens RT but increases accuracy only when stimuli are masked (Smith, Ratcliff, & Wolfgang, 2004; Smith, Ellis, Sewell, & Wolfgang, 2010). The model assumes that attention increases the efficiency with which perceptual information is transferred to VSTM and that masks interrupt the process of VSTM trace formation before it is complete. These two processes interact to produce a cuing effect in accuracy only when stimuli are masked but an unconditional effect in RT. The model has successfully accounted for the distributions of RT and accuracy in attention tasks in which the timing of stimulus localization is manipulated via onset transients and localizing markers (Sewell & Smith, 2012). These studies have helped illuminate the way in which performance is determined by perceptual, memory, attention, and decision processes acting in concert. Diederich and Busemeyer (2006) also considered the effects of attention on decision-making in a diffusion-process framework, studying decisions about multi-attribute stimuli for which it is plausible that people shift their attention sequentially from one attribute of a stimulus to the next. They assumed that some attributes would provide more information than others and modeled this successfully as a sequence of step changes in drift rate during the course of a trial.
Recognition Memory One of the early applications of the diffusion model was to recognition memory. In global memory models, a test item is matched against all memory in parallel, and the output is a single value of strength or familiarity (Gillund & Shiffrin, 1984; Hintzman, 1986; Murdock, 1982, and later, Dennis & Humphreys, 2001; McClelland & Chappell, 1998; Shiffrin & Steyvers, 1997). From this point of view, the diffusion model provides 46
a meeting point between the decision process and memory, specifically, the drift rate for a test item represents the degree of match between a test item and memory. In signal detection approaches to recognition memory, there has been considerable interest in the relative standard deviations (SDs) in strength between old and new test items, typically measured by confidence judgement paradigms. The common finding is that z-ROC functions (i.e., z-score transformed receiver operating characteristics) are approximately linear with a slope less than 1 (e.g., Ratcliff, Sheu, & Gronlund, 1992). There have been two interpretations of this finding. One is a single-process model that assumes the SD of memory strength is normally distributed, but the SD for old items is larger than that for new items. The other is a dual-process model in which the familiarity of old and new items comes from normal distributions with equal SDs but there is an additional recollection process (e.g., Yonelinas, 1997). In fits of the diffusion model to recognition memory data, it has been usually assumed that the SD in drift rate across trials is the same for studied and new items. Starns and Ratcliff (2014) performed an analysis of existing data sets that allowed the across-trial variability in drift rate to be different for studied and new items. They found that the across-trial variability in drift rate was larger (in about 66% of the cases for individual subjects) for studied items than for new items. It also turned out that the interpretations of the other model parameters did not change when variability was allowed to differ. The advantage of this analysis is that the relative variability of studied and new items was able to be determined from two-choice data and did not require confidence judgments.
Lexical Decision Much like recognition memory, a test item for lexical decision is matched against memory. The output is a value of how “wordlike” the item is. For sequential sampling models, proposals about how lexical items are accessed in memory must provide output values that, when mapped through a sequential sampling model, produce RTs and accuracy that fit data (Ratcliff, Gomez, & McKoon, 2004). (Note that there are other models that have integrated RT and accuracy with lexical processes, in particular, Norris, 2006). Often, lexical decision response time (RT) has been interpreted as a direct measure of the speed
elementary cognitive mechanisms
with which a word can be accessed in the lexicon. For example, some researchers have argued that the well-known effect of word frequency—shorter RTs for higher frequency words—demonstrates the greater accessibility of high frequency words (e.g., their order in a serial search, Forster, 1976; the resting levels of activation in units representing the words in a parallel processing system, Morton, 1969). However, other researchers have argued, as we do here, against a direct mapping from RT to accessibility. For example, Balota and Chumbley (1984) suggested that the effect of word frequency might be a by-product of the nature of the task itself, and not a manifestation of accessibility. In the research presented here, the diffusion model makes explicit how such a by-product might come about.
Semantic and Recognition Priming Effects For semantic priming, the task is usually a lexical decision. A target word is immediately preceded in a test list either by a word related to it (e.g., cat dog) or some other word (e.g. table dog). For recognition priming, the task is old/new recognition and a target word is immediately preceded by a word that was studied near to it in the list of items to be remembered or far from it. In the diffusion model, the simplest assumption about priming effects is that they result from higher drift rates for primed than unprimed items. It has been hypothesized that the difference in drift rates between primed and unprimed items arises from the familiarity of compound cues to memory (McKoon & Ratcliff, 1992; McNamara, 1992, 1994; Ratcliff & McKoon, 1988, 1994). The compound cue for an item is a multiplicative combination of the familiarity of the target word and the familiarity of the prime (see examples in Ratcliff & McKoon, 1988). If the prime and target words are related in memory, the combination produces a higher value of the joint familiarity than if they were not related. For primed items, the prime and target share associates in memory, the joint familiarity would be higher than if the prime and target do not share associates. This model was capable of explaining a number of phenomena in research on priming including the range of priming, the decay in priming, the onset of priming, and so on. McKoon and Ratcliff (2012) compared priming in word recognition to associative recognition. Subjects studied pairs of words and then performed either a single-word recognition task or
an associative recognition task (see also Ratcliff, Thapar, & McKoon, 2011). For the associative recognition task, subjects decided whether two words of a test pair had or had not appeared in the same pair at study. In the single-word task, some test words were immediately preceded in the test list by the other word of their studied pair (primed) and some by a word from a different pair (unprimed). Data from the two tasks were fit with the diffusion model and the results showed parallel behavior: the drift rates for associative recognition and those for priming were parallel across ages and IQ, indicating that they are based, at least to some degree, on the same information in memory.
Value-Based Judgments Busemeyer and Townsend (1993) developed a diffusion model called decision field theory to explain choices and decision times for decisions under uncertainty, and later Roe, Busemeyer, Townsend (2001) extended it to multi-alternative and multiattribute situations. According to the theory, at each moment in time, options are compared in terms of advantages/disadvantages with respect to an attribute, these evaluations are accumulated across time until a threshold is reached, and the first option to cross the threshold determines the choice that is made. The theory accounts for a number of findings that seem paradoxical from the perspective of rational choice theory. Usher and McClelland (2004) proposed another diffusion model to account for a similar range of findings. Milosavljevic, Malmaud, Huth, Koch, & Rangel (2010) examined several variants of diffusion models for value-based judgments. They found that the standard model with across-trial variability in model parameters provided a good account of data from their paradigm. More recently, Krajbich and Rangel (2011) have used a model similar in character to decision field theory. They examined value-based judgments for food items and had subjects choose which of two alternatives they preferred. They monitored eye fixations and in modeling, they assumed evidence was accumulated at a higher rate for the alternative fixated. Their model accounted for RTs and accuracy and for the influence of which of the two choices was fixated and for how long. Philiastides and Ratcliff (2013) examined valuebased judgments of consumer choices with brand names presented on some trials as well as the items for which the choices were made. When the quality of the brand name was in conflict with the perceived quality of the item, the probability of choosing the
modeling simple decisions and applications
47
item was lower then when they were consistent. Application of the diffusion model showed that the effect of the brand was to alter drift rate but none of the other parameters of the model. This means that the value of the stimulus and brand name were processed as a whole. Currently, there is a growing interest in the application of diffusion models to decision-making in marketing and economics, including neuroeconomics. Wide application of diffusion models in this domain are in their infancy, but the potential for theoretical advancement is great, as is demonstrated by these examples.
Aging The application of the diffusion model to studies of aging has been especially successful, producing a different view of the effects of aging on cognition than has been usual in aging research. The general finding in the literature has been that older adults are slower than young adults (but not necessarily less accurate) on most tasks, and this has been interpreted as a decline with age in all or almost all cognitive processes. However, application of the diffusion model showed that this is not correct (Ratcliff, Thapar, & McKoon, 2003, 2004, 2006, 2007; Ratcliff, Thapar, Gomez, & McKoon, 2004). For example, Ratcliff, Thapar, and McKoon (2010) tested old and young adults on numerosity discrimination, lexical decision, and recognition memory. What they found is that older adults had slower nondecision times and set their boundaries wider, but their drift rates were not lower than those of young adults. In contrast, in some tasks (associative recognition and letter discrimination), large declines in drift rate with age have been found (Ratcliff et al., 2011; Thapar et al., 2003).
Individual Differences The diffusion model has been used to examine individual differences. To do so requires that the SDs in model parameters from estimation variability are smaller than the SDs between subjects. In the aging studies described earlier, with about 45 minutes of data collection, individual differences in drift rates, boundary settings, and nondecision time were three to five times larger than the SDs of the model parameters. (See Ratcliff & Tuerlinckx, 2002, for tables of SDs in model parameters). Schmiedek, Oberauer, Wilhelm, Suβ, & Wittmann (2007) analyzed data from eight choice-RT tasks (including verbal, numerical, and 48
spatial tasks) from Oberauer, Suβ, Wilhelm, and Wittmann (2003). They found that drift rates in the diffusion model mapped onto working memory, speed of processing, and reasoning ability measures (each of these was measured by aggregated performance on several tasks). In aging studies by Ratcliff et al. (2010, 2011), IQs ranged from about 80 to about 140. Applying the model showed that drift rate varied with IQ (by as much as 2:1 for high versus lower IQ subjects) but boundary separation and nondecision time did not. This is the opposite of the pattern for aging. This dissociation provides strong support for the model because it extracts regularity from the three dependent variables (accuracy and correct and error RT distributions). Individual differences across tasks in model parameters provide strong evidence for common abilities across tasks. In the Ratcliff et al. (2010) study, in the lexical decision, item recognition, and associative recognition tasks, there are strong correlations across subjects in drift rate, and these correlated with IQ as measured by WAIS vocabulary and matrix reasoning. Also, boundary separation correlates across tasks as did nondecision time. These results show that the diffusion model extracts components of processing that show systematic individual differences across tasks. Consistent boundary setting across tasks is of special interest because boundary settings are optional, because they can be easily changed by instruction (e.g., go fast or be accurate). In most real-life situations, we rarely encounter more than single decisions on a particular stimulus class (except perhaps at Las Vegas or in psychology experiments). This means that there is little chance of adjusting decision criteria in real life because there is little extended experience with a task in which the decision maker can extract statistics from a long sequence of trials in which the structure of the trials does not change. The diffusion model assumes that a decision maker uses this decision mechanism across many tasks, and so we would expect to see correlations in boundary separation across tasks. This is a result that has been obtained whenever the comparison has been made.
Child Development A natural extension from the aging studies is to test children on similar tasks to those performed with older adults to trace the course of development within the model framework. Ratcliff, Love,
elementary cognitive mechanisms
Thompson, and Opfer (2012) tested several groups of children on a numerosity discrimination task and a lexical decision task. The results showed that relative to college age subjects, children’s drift rates were lower, boundary separation was larger, and nondecision time was longer. These differences were larger for younger relative to older children. In other laboratories, drift rates have been found to be lower for ADHD and dyslexic children relative to normal controls (ADHD, Mulder et al., 2010; dyslexia, Zeguers et al., 2011). These studies show that the diffusion model can be applied to data collected from children, a domain in which there has been relatively little research with decision models.
Clinical Applications In research on psychopathology and clinical populations, two-choice tasks are commonly used to investigate processing differences between patients and healthy controls. For highly anxious individuals, it is well-established that they show enhanced processing with threat-provoking materials, but this is found reliably only when there are two or more stimuli competing for processing resources, not one. However, when White, Ratcliff, Vasey, & McKoon (2010) applied the diffusion model to the RT and accuracy data from two-choice lexical decision task with single words that included threatening and control words, they found a consistent processing advantage for threatening words in high-anxious individuals, whereas traditional comparisons showed no significant differences Because the diffusion model makes use of both RT and accuracy data, it has more power to detect differences among subject populations than simply RT or accuracy alone. Studies of depression have had somewhat different patterns of results. Depressive symptoms are more closely linked with abnormal emotional processing with a negative emotional bias in clinical depression, even-handedness (i.e., no emotional bias) in dysphoria, and a positive emotional bias in nondepressed individuals. However, item recognition and lexical decision tasks often fail to produce significant results. White, Ratcliff, Vasey, & McKoon (2009) used the diffusion model to examine emotional processing in dysphoric (i.e., moderately high levels of depressive symptoms) and nondysphoric college students to examine differences in memory and lexical processing of positive and negative emotional words (which were presented among many neutral filler words). They found positive emotional bias in nondysphoric
subjects and even-handedness in dysphoric subjects in drift rates. As before, this pattern was not apparent with comparisons of reaction times or accuracy, consistent with previous null findings. One limitation of these studies and similar ones is that there may be relatively few materials with the right kinds of properties or structures (also in language processing experiments for example). The emotional word pools for the experiments only contained 30 words each. This left relatively few observations (especially for errors) to use in fitting the diffusion model, which would result in unreliable parameter estimates. To remedy this, the model was fit to all conditions simultaneously, including the neutral filler conditions which had hundreds of observations. The only parameter that was allowed to vary between the conditions was drift rate. Estimates for the other parameters were (e.g., nondecision time and boundary separation) largely determined by the filler conditions because the fitting method essentially weighted estimation of the parameters common to all conditions by the number of observations for each condition. Thus, the filler conditions largely determined all model parameters except the drift rates for the critical conditions, resulting in an increase in power. The results showed a bias for positive emotional words in the nondysphoric participants, but not in the dysphoric participants (White et al., 2009). This difference in emotional bias was not significant when the diffusion model was fit only to the emotional conditions with few observations, nor was it significant in comparisons of mean RT or accuracy. Another study examined the effects of aphasia in a lexical decision task. The impairments produce the exaggerated lexical decision reaction times typical of neurolinguistic patients. In diffusion model analyses, decision and nondecision processes were compromised, but the quality of the information upon which the decisions were based did not differ much from that of unimpaired subjects (Ratcliff, Perea, Colangelo, & Buchanan, 2004).
Manipulations of Homeostatic State Ratcliff and Van Dongen (2009) looked at effects of sleep deprivation with a numerosity discrimination task, van Ravenzwaaij, Dutilh, and Wagenmakers (2012) looked at the effects of alcohol consumption with a lexical decision task, and Geddes et al. (2010) looked at the effects of reduced blood sugar with a numerosity
modeling simple decisions and applications
49
discrimination task. Applying the model to all of these studies, the main effect was a reduced drift rate but with either small or no effect on boundary separation and nondecision time. These results show that the diffusion model is useful in providing interpretations of group differences among different subject populations. Furthermore, as noted earlier, the model can be used to examine individual differences (even with only 45 minutes of data collection for a task). This means that this modeling approach, when paired with the right tasks, may have a useful role to play in neuropsychological assessment.
Situations in Which the Standard Model Fails There are several cases in which the standard diffusion model fails to account for experimental data. These fall into two classes: one involves dynamic noise and categorical stimuli and the other involves conflict paradigms. For both, the main way the model fails is that there are cases for which the onset of the RT distribution (i.e., the leading edges) for one condition is delayed relative to the onset for other conditions. Ratcliff and Smith (2010) and Smith, Ratcliff, & Sewell (2014) tested letter discrimination, horizontal versus vertical bars discrimination, and Gabor patch orientation discrimination with stimuli degraded with either static noise or with dynamic noise. Noise was implemented by reversing the contrast polarity of some proportion of the pixels (randomly selected) for each of the letter, random bars, and Gabor patch stimuli. For dynamic noise, a different random sample of pixels was chosen on every frame of the display, whereas static noise used a single image with one random sample reversed. Dynamic noise and, to a lesser extent static noise, produced large shifts in the leading edges of the RT distribution. The shapes of the RT distributions were consistent with the model but increasing noise increased estimates of the nondecision time parameter Ter . This finding is inconsistent with the hypothesis that noise increases RTs simply by reducing the rate at which evidence accumulates in the decision process. Instead, it implies that noise delays the onset of the diffusion process. Smith, Ratcliff, and Sewell (2014) showed that shifts in onsets can be explained by Smith and Ratcliff ’s (2009) integrated system model. with the assumption that noise slows the process of forming a stable perceptual representation of the stimulus. In 50
the integrated system model, drift rate and diffusion noise grow in proportion to one another to an asymptote. Unlike the standard model, in which the onset of evidence accumulation is abrupt, the onset of evidence accumulation in the integrated system model is gradual, controlled by the growth of diffusion noise. Smith, Ratcliff & Sewell, 2014 showed that this model could explain the shifts in the onsets of RT distributions found by Ratcliff and Smith (2010). Smith, et al. (2014) also considered a second, release-from-inhibition model, which was motivated, in part, by physiological principles. They modeled release from inhibition using an Ornstein-Uhlenbeck (OU) diffusion process with a time-varying decay coefficient. In the OU process, information accumulation is opposed by a decay term that pulls the process back toward its starting point. The larger the decay, the harder it is for the process to accumulate enough information to reach a criterion and trigger a response. In the standard OU process, decay is proportional to the distance of the process from its starting point, but does not vary with time. Smith et al. (2014) assumed that decay was time-locked to the stimulus. At the start of the trial, before a perceptual representation of the stimulus is formed, the decay term is large and the process remains near its starting point with high probability. As stimulus information becomes available, the decay term progressively decreases, allowing information to accumulate in the same way as it does in the standard model. This model was also able to account for data like those reported by Ratcliff and Smith (2010). Because the inhibition process behaves somewhat like the standard model with variable starting point, the release-from-inhibition model was able to account for the fast errors found at high stimulus discriminability in dynamic noise tasks without the assumption of starting point variability. Ratcliff and Frank (2012) also found shifts in the leading edges of RT distributions in a reinforcement learning conflict experiment for which the stimuli were three pairs of letters (the same three throughout the experiment). On each trial, one of the pairs of letters were presented in random order and the subject had to choose and respond to one of the letters. One of the letters of the pair was reinforced more often than the other (in this case, reinforcement was simply a “correct” or “incorrect” message). After a training phase, on a small proportion of the trials, letters from different pairs were presented together. When the two letters
elementary cognitive mechanisms
were the highly reinforced members of the pairs, they were chosen nearly equally often and there was no slowing of the RT distribution. But when the letters that were reinforced with low probability were presented together, there was a delay in the leading edge of the RT distribution, an average delay of over 100 ms. This was explained in two ways, one in terms of the basal ganglia model of Frank (2006), and one in terms of the diffusion model. For the diffusion model to explain the data, a delay in the onset of the decision process could be used to produce good fits to the data. But this was, to some degree, a redescription of the empirical result. The basal ganglia model explained these conflict trials by an increase in threshold in the neural circuitry. This was linked to the diffusion model by showing that a transient increase in boundary separation was also capable of explaining the result (the delay in onset of the RT distribution). It turned out that an increase in boundary separation with an exponential decay mimics a delayed onset. White, Ratcliff and Starns (2011) also found leading edge shifts in a flanker task. In their experiment, a target angle bracket was presented that pointed in the direction of the correct response. On conflict trials, the target bracket was embedded in a string pointing the other way. Again, RT distributions could not be explained with only a difference in drift rates, but a model with drift rate changing over the time course of the decision, starting by being dominated by the flankers and then gradually focusing on the central symbol, was successful. All of these paradigms suggest that, in these conflict situations, drift rate is not stationary over time. It is necessary to go beyond the basic decision model and begin to integrate it with models of perceptual and cognitive processing.
Competing Two-Choice Models The diffusion model described to this point is one of a class of sequential sampling models that share many features. They all have given the same interpretations of effects of independent variables, which are the same across the models (e.g., Donkin, Brown, Heathcote, & Wagenmakers, 2011; Ratcliff, Thapar, Smith, & McKoon, 2005). This means, for example, that the effects of aging on model components are the same, whichever model is used. The leaky competing accumulator (LCA) model (Usher & McClelland, 2001) was developed as
c2
c1 Leak or Decay (k)
Criteria Leak or Decay (k)
Inhibition(β) Variable Start. Pt. Range sz
+ Gaussian Noise(σ)
1–v
Stimulus strength
+ Gaussian Noise(σ)
v
Fig. 3.8 An illustration of the leaky competing accumulator. The model includes an inhibition term (−kxj ) in which the increment to evidence in accumulator i is reduced as a function of activity in the other accumulator (xj ) and a decay term in which the increment to evidence is reduced as a function of activity in the accumulator (−βxi ). The decision criteria for the two accumulators are c1 and c2 , the accumulation rates are v1 and v2 (v1 +v2 =1), and there is variability in the starting points that is uniformly distributed across trials with range st . Variability in processing within a trial is normally distributed with standard deviation σ.
an alternative to the diffusion model. Part of the motivation was to implement neurobiological principles that the authors believed should be incorporated into RT models, especially mutual inhibition mechanisms and decay of information across time. In the LCA model, like the diffusion model, information is accumulated continuously over time. There are two accumulators, one for each response, as shown in Figure 3.8, and a response is made when the amount of information in one of the accumulators reaches its criterion amount. The rate of accumulation, the equivalent of drift rate in the diffusion model, is a combination of three components. The first is the input from the stimulus (v), with a different value for each experimental condition. If the input to one of the accumulators is v, the input to the other is 1−v so that the sum of the two rates is 1. The second component is decay in the amount of accumulated information, k, with size of decay growing as the amount of information in the accumulator grows, and the third is inhibition from the other accumulator, β, with the amount of inhibition growing as the amount of information in the other accumulator grows. If the amount of inhibition is large, the model exhibits features similar to the diffusion model because an increase in accumulated information for one of the response choices produces a decrease for the other choice.
modeling simple decisions and applications
51
Just as in the diffusion model, the accumulation of information is assumed to be variable over the course of a trial, with a normal distribution with standard deviation σ. Because of the decay and inhibition in the accumulation rates, the tails of RT distributions are longer than they would be if produced without these factors (cf., Smith & Vickers, 1988; Vickers, 1970, 1979; Vickers, Caudrey, Willson, 1971), which leads to good matches with the skewed shape of empirical distributions. The expression for the change in the amount of accumulated information at time t in accumulators i, is: ⎞ ⎛ βxj ⎠ t xi = ⎝vi − kxi − √ + σ ηi t
j =i
i = 1, 2
(5)
The amount of accumulated information is not allowed to take on values below zero, so if it is computed to be below zero, it is reset to zero. This is theoretically equivalent to constraining the diffusion process with a reflecting boundary at zero. The LCA model without across-trial variability in any of its components predicts errors slower than correct responses. To produce errors faster than correct responses and the crossover pattern such that errors are faster than correct responses for easy conditions and slower for difficult conditions, Usher and McClelland assumed variability in the accumulators’ starting points, just as is assumed in the diffusion model and by Laming (1968). In the diffusion model, moving a boundary position is equivalent to moving the starting point. Moving the starting point an amount y toward one boundary is the same as moving that boundary an amount y toward the starting point and the other boundary an amount y away from the starting point. In the LCA model, changing the starting point is not equivalent to changing a boundary position because decay is a function of the distance of the accumulated amount of evidence from zero. Increasing the starting point by an amount y increases decay by an amount proportional to y, but with the starting point at zero, reducing the boundary by y has no effect on decay. Usher and McClelland (2001) implemented variability in starting point by assuming rectangular distributions of the starting points with minimums at zero. No explicit solution is known for the pair of coupled equations in Eq. 5, when they are constrained by decision criteria and the requirement that the 52
accumulated information remain positive. Thus, as in Usher and McClelland (2001), predictions from the model are obtained by simulation. There have been several analyses of this model. Bogacz et al. (2006) showed that the model could be reduced to a single diffusion process if leak and inhibition were balanced and examined notions of optimality (but see van Ravenzwaaij, van der Maas, & Wagenmakers, 2012). The Linear Ballistic Accumulator (LBA, Brown and Heathcote, 2008) is similar to the LCA in that it uses two accumulators, but it has no within-trial variability, no decay, and no inhibition. The model assumes that the rate of evidence accumulation and the starting point for accumulation both vary randomly from trial to trial, but that the process of evidence accumulation itself is noise free. In essence, the model assumes that there is noise in the central nervous system on long, between-trial, time scales, but none on short, moment-to-moment, time scales that govern evidence accumulation within a trial. This assumption appears incompatible with the single-cell recording literature that has linked processes of evidence accumulation with neural firing rates in the oculomotor control system, because such neural spike trains are typically noisy. To reconcile these kinds of data with noiseless evidence accumulation requires an argument to the effect that individual neurons are noisy but the neural ensemble as a whole is effectively noise free. However, it is not clear that firing rates in weakly coupled networks of neurons exhibit the kinds of central limit theorem type properties that this argument requires (Zohary, Shadlen, & Newsome, 1994), and so the status central limit argument is unclear.
Multichoice Decision-Making and Confidence Judgments Recently, interest in the neuroscience domain in multichoice decision-making tasks has developed for visual search (Basso & Wurtz, 1998; Purcell et al., 2010) and motion discrimination (Niwa & Ditterich, 2008; Ditterich, 2010). In psychology, there have been investigations using generalizations of standard two-choice tasks (Leite & Ratcliff, 2010) and in absolute identification (Brown, Marley, Donkin, & Heathcote, 2008). In addition, confidence judgments in decision-making and memory tasks are multichoice decisions, and diffusion models are being applied in these domains (Pleskac & Busemeyer, 2010; Ratcliff & Starns, 2009, 2013; Van Zandt, 2002).
elementary cognitive mechanisms
It is clear that there is no simple way to extend the two-choice model to tasks with three or more choices. But models with racing accumulators can be extended. Some models with racing accumulators become standard diffusion models when the number of choices is reduced to two. Ratcliff and Starns (2013) proposed a model for confidence judgments in recognition memory tasks that uses a multiple-choice diffusion decision process with separate accumulators of evidence for each confidence choice. The accumulator that first reaches its decision boundary determines which choice is made. Ratcliff and Starns compared five algorithms for accumulating evidence and found that one of them produced choice proportions and full RT distributions for each choice that closely matched empirical data. With this algorithm, an increase in the evidence in one accumulator is accompanied by a decrease in the others with the total amount of evidence in the system being constant. Application of the model to the data from an earlier experiment (Ratcliff, McKoon, & Tindall, 1994) uncovered a relationship between the shapes of z-ROC functions and the behavior of RT distributions. For low-proportion choices, the RT distributions were shifted by as much as several hundred milliseconds relative to high proportion choices. This behavior and the shapes of z-ROC functions were both explained in the model by the behavior of the decision boundaries. For generality, Ratcliff and Starns (2013) also applied the decision model to a three-choice motion discrimination task in which one of the alternatives was the correct choice on only a low proportion of trials. As for the confidence judgment data, the RT distribution for the low probability alternative was shifted relative to the higher probability alternatives. The diffusion model with constant evidence accounted for the shift in the RT distribution better than a competing class of models. Research on multichoice decision making, including confidence judgments, is a growing industry but the constraints provided by RT distributions and response proportions for the different choices makes the modeling quite challenging.
One-Choice Decisions Relatively little work has been done recently on one-choice decisions. In these, there is only one key to press when a stimulus is detected. Ratcliff and Van Dongen (2011) tested a model that used a single diffusion process to represent the process of accumulating evidence. The main application was
to the psychomotor vigilance task (PVT) for which a millisecond timer is displayed on a computer screen and it starts counting up at intervals between 2 and 12 s after the subject’s last response. The subject’s task is to hit a key as quickly as possible to stop the timer. When the key is pressed, the counter is stopped, and the RT in milliseconds is displayed for 1 s. In single-choice decision-making tasks, the data are a distribution of RTs for hitting the response key. The one-choice diffusion model assumes the evidence begins accumulating on presentation of a stimulus until a decision criterion is hit, upon which, a response is initiated (Figure 3.9 illustrates the model). In the model, drift rate is assumed to vary from trial to trial. This relates it to the standard two-choice model, which makes this assumption to fit the relative speeds of correct and error responses. In application of the one-choice model to sleep deprivation data, across-trial variability in drift rate was needed to produce the long tails observed in the RT distributions. Ratcliff and Van Dongen (2011) fit the model to RT distributions and their hazard functions from experiments with the PVT with over 2000 observations per RT distribution per subject. With only changes in drift rate, they found that the model accounted for changes in the shape of RT distributions. In particular, changes in drift rate accounted for the change in hazard function shape moving from a high tail under no sleep deprivation to a low tail with sleep deprivation. They also fit data for which the PVT was tested every 2 hours for 36 hours of sleep deprivation and found that drift rate was closely related to an independent measure of alertness, which provides an external validation of the model.
Neuroscience One of the major advances in understanding decision making is in neuroscience applications using single cell recording in monkeys (and rats), human neuroscience including fMRI, EEG, and MEG. All these domains have had interactions between diffusion model theory and neuroscience measures. Hanes and Schall (1996) made the first connection between theory and single cell recording data, and this was taken up in work by Shadlen and colleagues (e.g., Gold and Shadlen, 2001). monkey neurophysiology In both psychology and neuroscience, theories of decision processes have been developed that
modeling simple decisions and applications
53
Across-trial drift rate distribution
Decision process starts
Stimulus onset
Decision process stops Response Td
Ta
0 v
Drift rate
Ter=Ta+Tb Time
Evidence criterion Drift rate (v) Starting point
a
0
Tb
Decision process:
Fig. 3.9 An illustration of the one-choice diffusion model. Evidence is accumulated at a drift rate v with SD across trials η, until a decision criterion at a is reached after time Td . Additional processing times include stimulus encoding time Ta and response output time Tb ; these sum to nondecision time Ter , which has uniform variability across trials with range st .
assume that evidence is gradually accumulated over time (Boucher, Palmeri, Logan, & Schall, 2007; Churchland, Kiani, & Shadlen, 2008; Ditterich, 2006; Gold & Shadlen, 2001, 2007; Grinband, Hirsch, & Ferrera, 2006; Hanes & Schall, 1996; Mazurek, Roitman, Ditterich, & Shadlen, 2003; Platt & Glimcher, 1999; Purcell et al., 2010; Ratcliff, Cherian, & Segraves, 2003; Ratcliff, Hasegawa, Hasegawa, Smith, & Segraves, 2007; Roitman & Shadlen, 2002; Shadlen & Newsome, 2001). In these studies, cells in the lateral intraparietal cortex (LIP), frontal eye field (FEF), and the superior colliculus (SC) exhibit behavior that corresponds to a gradual buildup in activity that matches the buildup in evidence in making simple perceptual decisions (see also Munoz & Wurtz, 1995; Basso & Wurtz, 1998). The neural populations that exhibit buildup behavior in LIP, FEF, and SC prior to a decision have been studied extensively. There is debate about where exactly the accumulation takes place, but it is clear that (at least) these three structures are part of a circuit that is involved in implementing the decision. These studies so far support the notion that there is a flow of information from LIP to FEF and then to SC prior to a decision. In modeling the neurobiology of the decision process, there are a number of models applied to a range of different tasks. They all have the common theme that they assume evidence is accumulated to a decision criterion, or boundary, and that accumulated evidence corresponds to activity in populations of neurons corresponding to the decision alternatives. The models considered here have been explicitly proposed as models of oculomotor decision making in monkeys or argued to describe the evidence accumulation process in humans or monkeys. The models fall into several
54
classes (Ratcliff & Smith, 2004; Smith & Ratcliff, 2004), including those that assume accumulation of a single evidence quantity taking on positive and negative values (Gold & Shadlen, 2000, 2001; Ratcliff, 1978; Ratcliff et al., 2003; Ratcliff, Van Zandt, & McKoon, 1999; Smith, 2000) and those that assume that evidence is accumulated in separate accumulators corresponding to separate decisions (Churchland, et al., 2008; Ditterich, 2006; Mazurek et al., 2003; Ratcliff et al., 2007; Usher & McClelland, 2001). In this latter class of models, accumulation can be independent in separate accumulators, or it can be interactive so that as evidence grows in one accumulator, it inhibits evidence accumulation in the other accumulator. The single accumulator model can be seen as implementing perfect inhibition because a positive increment toward one boundary is an increment away from the other boundary. The models with separate accumulators have the advantage because the two accumulators can be used to represent growth of activity in the populations of neurons corresponding to the two decisions. In the single diffusion process models, if the single process represented the aggregate activity in the two populations, then the growth of activity in the two populations would have to be perfectly negatively correlated. This is plausible if the resting activity level is relatively high in the neural populations (e.g., Roitman & Shadlen, 2002), but it is less plausible in populations in which the resting level is low (Hanes & Schall, 1996; Ratcliff et al., 2007). However, the two classes of models largely mimic each other at a behavioral level (Ratcliff, 2006; Ratcliff & Smith, 2004) and although the choice of models with racing diffusion processes seems to be superior in application in oculomotor
elementary cognitive mechanisms
responses in monkeys, this does not rule out the viability of the single accumulator model for human behavioral and neural data (Philiastides, Ratcliff, & Sajda, 2006; Ratcliff et al., 2009). Ratcliff et al. (2007; see also Ratcliff, Hasegawa, et al., 2011) applied a dual diffusion model to a brightness discrimination task. In the dual diffusion model, evidence for the two responses is accumulated by a pair of racing diffusion processes. In Ratcliff et al.’s model, there was competition at input (drift rates summed to a constant) but no inhibition (i.e., Figure 3.8 without the inhibition). Two rhesus monkeys were required to make a saccade to one of two peripheral choice targets based on the brightness of a central stimulus. Neurons in the deep layers of the SC exhibited a robust presaccadic activity when the stimulus specified a saccade toward a target within the neuron’s response field, and the magnitude of this activity was unaffected by level of difficulty. Activity following brightness stimuli specifying saccades to targets outside the response field was affected by task difficulty, increasing as the task became more difficult, and this modulation correlated with performance accuracy. The model fit the full complexity of the behavioral data, accuracy and RT distributions for correct and error responses, over a range of levels of difficulty. Using the parameters from the fits to behavioral data, simulated paths of the process were generated and these provided numerical predictions for the behavior of the firing rates in SC neurons that matched most but not all the effects in the data. Simulated paths from the model were compared to neuron activity. The assumption linking the paths to the neuron data is that firing rate is linearly related to position in the accumulation process; the nearer the boundary the decision process is, the higher the firing rate. The firing rate data show delayed availability of discriminative information for fast, intermediate, and slow decisions when activity is aligned on the stimulus and very small differences in discriminative information when activity is aligned on the saccade. The model produces exactly these patterns of results. The accumulation process is highly variable, allowing the process both to make errors, as is the case for the behavioral performance, and also to account for the firing rate results. Figure 3.10 shows sample results for the firing rate functions (black lines) and predicted firing rates (red lines). There have also been significant modeling efforts to relate models based on spiking neurons to
diffusion models (e.g., Deco, Rolls, Albantakis, & Ramo, 2013; Roxin & Ledberg, 2008; Wong and Wang, 2006). Smith (2010) made an explicit connection between diffusion processes at a macro behavioral level and shot noise processes at a slightly abstract neural level. Smith (2010) sought to show how diffusive information accumulation at a behavioral level could arise by aggregating neural firing rate processes. He modeled the representation of stimulus information at the neural level as the difference between excitatory and inhibitory Poisson shot noise processes. The shot noise process describes the cumulative effects of a number of time-varying disturbances or perturbations, each of which is initiated by a point event, which arrive according to a Poisson process. These discrete pulses are assumed to have exponential decay, and so, in time, some of these decaying traces add, and this is the shot noise process (e.g., Figure 3.1, Smith, 2010). In his model, the disturbances represent the flux in postsynaptic potentials in a cell population in response to a sequence of action potentials. Smith showed that the time integral of such Poisson shot-noise pairs follows an integrated OrnsteinUhlenbeck process, whose long-time scale statistics are very similar to those assumed in the standard diffusion model. His analysis showed how diffusive information at a behavioral level could arise from Poisson-like representations at the neural level. Subsequently Smith and McKenzie (2011) investigated a simple model of how long time scale information accumulation could be realized at a neural level. Wang (2002) previously argued that models of decision-making require information integration on a time scale that is an order of magnitude greater than any integration process found at a neural level. He argued that the most plausible substrate for such long-time scale integration is persistent activity in reverberation networks. Smith and McKenzie considered a very simple model of a recurrent loop in which spikes cycle around the loop with exponentially distributed cycle times and new spikes are added superposition. The activity in the loop could, therefore, be modeled as a superposition of Poisson processes. They showed that a model based on such recurrent loops could realize the kind of long-time scale integration process described by Wang and that it, too, exhibited a form of diffusive information accumulation that closely matches what is found behaviorally. In particular, the resulting model successfully predicted the RT distributions and
modeling simple decisions and applications
55
Align on stimulus
Monkey 11
Activity for correct response (spikes/s)
2% bright pixels EASY 150
150
100
100
50
50
0
0 0
Activity for error response (spikes/s)
45% bright pixels DIFFICULT
100
200
300
400
0
150
150
100
100
50
50
0
0 0
100 200 300 400 Time from stimulus onset (ms)
0
100
200
300
400
100 200 300 400 Time from stimulus onset (ms)
Fig. 3.10 Neural firing rates averaged over cells for firing rates aligned on the stimulus for the two monkeys from Ratcliff, Hasegawa et al. (2007). The firing rates are divided into thirds as a function the behavioral response (fastest third, middle third, and slowest third). The left hand column show easy conditions, bright responses to 98% white pixels and dark responses to 98% black pixels and the right hand column shows difficult conditions, bright responses to 55% white pixels and dark responses to 55% black pixels. The first row shows firing rates for cells in the receptive field of the target corresponding to the correct response and the correct response is made (target cell). The second row shows firing rates for cells in the receptive field of the target corresponding to the incorrect response for the stimulus when a correct response is made (competitor cell). The solid lines are the data and the dashed lines are model predictions.
choice probabilities from a signal detection experiment reported by Ratcliff and Smith (2004).
Human Neuroscience Diffusion models are currently being combined with fMRI and EEG techniques to look for stimulus-independent areas that implement decision-making (e.g., vmPFC, Heekeren, Marrett, Bandettini, & Ungerleider, 2004) and to map diffusion model parameters onto EEG signals (Philliastides et al., 2006). eeg support for across-trial variability in drift rate Philiastides, Ratcliff, and Sajda (2006) used a face/car discrimination task with briefly presented degraded pictures. They recorded EEGs from multiple electrodes during the task and then weighted and combined the electrical signals to obtain a single number or regressor that best discriminated between faces and cars. This was repeated over 60 56
ms windows from stimulus onset on up. The singletrial regressor was significant at two times, around 180 ms and around 380 ms. Ratcliff, Philiastides, and Sajda (2009) reasoned that, if the regressor was an index of difficulty, then in each condition of the experiment, responses could be sorted into those that the electrical signal said were more facelike and those that were more carlike. When responses were sorted and the diffusion model fit to the two halves of each condition, the drift rates for the two halves differed substantially but only for the later component at 380 ms. The diffusion model provides an estimate of nondecision time, which represents the duration of encoding and stimulus transformation processes prior to the decision time (as well as response output processes). This estimate shows that the decision process begins no earlier than 400 ms after stimulus onset, and so the late EEG signal component indexes difficulty on a trial-to-trial basis prior to the onset of the decision process. Therefore, these two features of the late component
elementary cognitive mechanisms
provide evidence that drift rate varies from trial to trial.
strength. These studies are the beginning of a new approach to brain structure and processing.
eeg support for across-trial variability in starting point Bode, Sewell, Lilburn, Forte, Smith and Stahl (2012) reported EEG evidence consistent with trial-to-trial biasing of the starting point of the diffusion process. They recorded EEG activity in a task requiring discrimination between briefly presented images of chairs or pianos that were presented in varying levels of noise and then backward masked. They applied a support vector machine pattern classifier to the EEG signals at successive time points and showed that decisions could be decoded (i.e., predicted) from the EEG several hundred milliseconds before the behavioral response. When the stimulus display contained only noise and no discriminative information, the decision outcome could still be predicted from the EEG, but only from the activity prior to stimulus presentation and not from any later time points. Bode et al. found that the RT distributions and accuracy in their task were well described by a diffusion model in which the starting point for evidence accumulation was biased toward the upper or lower boundary, depending on the participant’s previous choice history. They proposed that the information in the prestimulus EEG was a neural correlate of the process of setting the starting point, which occurs prior to the start of evidence accumulation. When the display contained no stimulus information and the drift of the diffusion process was zero, the primary determinant of the decision outcome would be the participant’s bias state: Processes starting near the upper boundary would be more likely to terminate at that boundary, and similarly for the lower boundary.
fmri A major problem with attempts to relate results from fMRI measurements to the growth of activity in decision-related brain areas is the sluggishness of the BOLD response. Despite this, there are many studies that use diffusion models in analyses of fMRI data. Mulder, Van Maanen, & Forstmann (2014) have reviewed a number of studies of perceptual decision making using fMRI methods and found evidence for regions associated with different components of diffusion models. Although there was some converegence, maps of the peakcoordinates of the activity for model components showed quite a large scatter across areas. This research would require a chapter by itself but the notion the some brain areas accumulate noisy evidence from other areas is certainly a mainstream belief in neuroscience and diffusion models are one theoretical framework that relates the neural to the behavioral level.
structural mri Studies that have examined structural connections between brain areas that are implicated in the control of decision making have found correlations between tract strength and decisionmaking variables. Forstmann et al. (2010) found a relationship between cortico-striatal connection strength and the ability of subjects to change their speed-accuracy tradeoff settings. Mulder, Boekel, Ratcliff, & Forstmann (2014) found correlations between subjects’ ability to bias their responses in response to reward and vmPFC-STN connection
Conclusions The use of diffusion models in representing simple decision-making in a variety of domains is an area of research that is seeing significant advances. The view that evidence is accumulated over time to decision criteria seems a settled view. The competing models seem to produce about the same conclusions about processing within experimental paradigms, and so broad interpretations do not depend on the specific model being used. In psychological applications, the basic theory and experimental applications are well established and somewhat mature. But application to individual differences (including neuropsychological testing) and different subject and patient populations are in their infancy. Also, neuroscience applications in both experimental and theoretical research are blossoming, with a variety of experimental methods being used as well as a variety of variants on the basic models developed in psychology.
Author Note Preparation of this chapter was supported by grants NIA R01-AG041176, AFOSR grant FA9550-11-1-0130, IES grant R305A120189, and by ARC Discovery Grant DP140102970.
modeling simple decisions and applications
57
Glossary Accumulator Model: A model in which positive increments are continuous random variable and the time at which the increments are made are discrete in time. The accumulators race to separate decision criteria. Confidence Judgments: Tasks in which responses are made on a discrete scale using different response keys. Decision Boundaries: These represent the amount of evidence needed to make a decision. Decision criteria: The amount of evidence for one or other alternative to make a decision. In diffusion models, the criteria are represented as boundaries on the evidence space. Diffusion Model: A model that assumes continuously available evidence in continuous time. Evidence accumulates in one signed sum and the process terminates when one of two decision criteria are reached. Diffusion Process: A process in which continuously variable noisy evidence is accumulated in continuous time. Drift rate: The average rate at which a diffusion process accumulates evidence. Go/Nogo Tasks: Tasks in which subjects respond to one stimulus type but hold their response until a time out for the other response. Leaky Competing Accumulator Model: A model in which evidence is continuously available in continuous time. Evidence is accumulated in separate accumulators (i.e., separate diffusion processes) and there is both decay in an accumulator and inhibition from other accumulators. Nondecision Time: Duration of processes other than the decision process. These include encoding time, response output time, memory access time in memory tasks, and the time to transform the stimulus representation to a decisionbased representation for perceptual tasks. Optimality: Often defined in terms of “reward rate” or the number correct per unit time in simple decision making experiments by analogy with animal experiments. Ornstein-Uhlenbeck diffusion process: This describes a noisy evidence accumulation process with leakage or decay; the standard (Wiener or Brownian motion) diffusion process describes a process in which there is no leakage. Poisson Counter Model: A model in which increments are discrete equal-sized units, but the time at which they arrive as the accumulators are Poisson distributed (exponential delays between counts). Poisson shot noise process: A process in which each point event in a Poisson process generates a continuous, time-varying disturbance or perturbation. The shot noise process is the cumulative sum of the perturbations. The shot noise process has been used as a model for a variety of phenomena, including the flow of electrons in vacuum tubes, the cumulative effects of earth tremors, and the flux in the postsynaptic potential in cell bodies in a neural population. PVT: The psychomotor vigilance test in which a counter starts counting up and the subject simply hits a key to stop it counting.
58
Random walk model: A discrete-time counterpart of the diffusion process. A diffusion process accumulates evidence in continuous time, whereas a random walk accumulates evidence at discrete time points. Response Signal and Deadline Tasks: Tasks in which the subject is required to respond at an experimenterdetermined time. The dependent variable is usually accuracy and the task measures how it grows over time in the decision process. Response Time Distributions: The distribution of times at which the decision process terminates (i.e., a histogram of times for data). Single Cell Recording in Animals: Recordings from single neurons often in awake behaving animals.
References Balota, D. A. & Chumbley, J. I. (1984). Are lexical decisions a good measure of lexical access? the role of word frequency in the neglected decision stage. Journal of Experimental Psychology: Human Perception and Performance, 10, 340–357. Basso, M. A. & Wurtz, R. H. (1998). Modulation of neuronal activity in superior colliculus by changes in target probability. Journal of Neuroscience, 18, 7519–7534. Bode, S., Sewell, D. K., Lilburn, S., Forte, J. D., Smith, P. L. & Stahl, J. (2012). Predicting perceptual decisions from early brain activity. Journal of Neuroscience, 32, 12488– 12498. Bogacz, R., Brown, E., Moehlis, J., Holmes, P. & Cohen, J. D. (2006). The physics of optimal decision making: A formal analysis of models of performance in two-alternative forced choice tasks. Psychological Review, 113, 700–765. Boucher, L., Palmeri, T., Logan, G., & Schall, J. (2007). Inhibitory control in mind and brain: An interactive race model of countermanding saccades. Psychological Review, 114, 376–397. Brown, S. D., & Heathcote, A. J. (2008). The simplest complete model of choice response time: Linear ballistic accumulation. Cognitive Psychology, 57, 153–178. Brown, S. D., Marley, A. A. J., Donkin, C. & Heathcote, A. J. (2008). An integrated model of choices and response times in absolute identification. Psychological Review, 115, 396–425. Buonocore, A., Giorno, V., Nobile, A. G., & Ricciardi, L. (1990). On the two-boundary first-crossing- time problem for diffusion processes. Journal of Applied Probability, 27, 102–114. Busemeyer, J. R., & Townsend, J. T. (1993). Decision field theory: A dynamic-cognitive approach to decision making in an uncertain environment. Psychological Review, 100, 432–459. Churchland, A. K., Kiani, R., & Shadlen, M. N. (2008). Decision-making with multiple alternatives. Nature Neuroscience, 11, 693–702. Deco, G., Rolls, E. T., Albantakis, L., & Romo, R. (2013). Brain mechanisms for perceptual and reward-related decisionmaking. Progress in Neurobiology, 103, 194–213. Dennis, S. & Humphreys, M. S. (2001). A context noise model of episodic word recognition. Psychological Review, 108, 452–477.
elementary cognitive mechanisms
Diederich, A., & Busemeyer, J.R. (2003). Simple matrix methods for analyzing diffusion models of choice probability, choice response time, and simple response time. Journal of Mathematical Psychology, 47, 304–322. Diederich, A., & Busemeyer, J. (2006). Modeling the effects of payoff on response bias in a perceptual discrimination task: Bound-change, drift-rate-change, or twostage-processing hypothesis. Perception & Psychophysics, 68, 194–207. Ditterich, J. (2006). Computational approaches to visual decision making. In D. J. Chadwick, M. Diamond, & J. Goode (Eds.), Percept, decision, action: Bridging the gaps (p.114). Chichester, UK: Wiley. Ditterich, J. (2010). A comparison between mechanisms of multi-alternative perceptual decision making: Ability to explain human behavior, predictions for neurophysiology, and relationship with decision theory. Frontiers in Neuroscience, 4, 184. Donkin, C., Brown, S., Heathcote, A., & Wagenmakers, E. J. (2011) Diffusion versus linear ballistic accumulation: Different models for response time, same conclusions about psychological mechanisms? Psychonomic Bulletin & Review, 55, 140–151. Feller, W. (1968). An introduction to probability theory and its applications. New York, NY: Wiley. Forster, K. I. (1976). Accessing the mental lexicon. In R. J. Wales & E. Walker (Eds.), New approaches to language mechanisms (pp. 257–287). Amsterdam, Netherlands: NorthHolland. Forstmann, B. U., Anwander, A., Schafer, A., Neumann, J., Brown, S., Wagenmakers, E.-J., Bogacz, R., & Turner, R. (2010). Cortico-striatal connections predict control over speed and accuracy in perceptual decision making. Proceedings of the National Academy of Sciences, 107, 15916–15920. Frank, M.J. (2006). Hold your horses: A dynamic computational role for the subthalamic nucleus in decision making. Neural Networks, 19, 1120–1136. Gillund, G., & Shiffrin, R.M. (1984). A retrieval model for both recognition and recall. Psychological Review, 91, 1–67. Geddes, J., Ratcliff, R., Allerhand, M., Childers, R., Wright, R. J., Frier, B. M., & Deary, I. J. (2010). Modeling the effects of hypoglycemia on a two-choice task in adult humans. Neuropsychology, 24, 652–660. Gold, J. I., & Shadlen, M. N. (2000). Representation of a perceptual decision in developing oculomotor commands. Nature, 404, 390–394. Gold, J. I., & Shadlen, M. N. (2001). Neural computations that underlie decisions about sensory stimuli. Trends in Cognitive Science, 5, 10–16. Gold, J. I., & Shadlen, M. N. (2007). The neural basis of decision making. Annual Review of Neuroscience, 30, 535–574. Gomez, P., Ratcliff, R., & Perea, M. (2007). A model of the go/no-go task. Journal of Experimental Psychology: General, 136, 347–369. Grinband, J., Hirsch, J., & Ferrera, V.P. (2006). A neural representation of categorization uncertainty in the human brain. Neuron, 49, 757–763.
Hanes, D. P., and Schall, J. D. (1996). Neural control of voluntary movement initiation. Science, 274, 427–430. Heekeren, H. R., Marrett, S., Bandettini, P. A., Ungerleider, L. G. (2004). A general mechanism for perceptual decisionmaking in the human brain. Nature, 431, 859–62. Hintzman, D. (1986). “Schema abstraction” in a multiple-trace memory model. Psychological Review, 93, 411–428. Krajbich, I., & Rangel, A. (2011). A multi-alternative drift diffusion model predicts the relationship between visual fixations and choice in value-based decisions. Proceedings of the National Academy of Sciences, 108, 13852–13857. Kounios, J., Osman, A. M., & Meyer, D. E. (1987). Structure and process semantic memory: New evidence based on speed-accuracy decomposition. Journal of Experimental Psychology: General, 116, 3–25. Laming, D. R. J. (1968). Information theory of choice reaction time. New York: Wiley. Leite, F. P., & Ratcliff, R. (2010). Modeling reaction time and accuracy of multiple-choice decisions. Attention, Perception and Psychophysics, 72, 246–273. Leite, F. P., & Ratcliff, R. (2011). What cognitive processes drive response biases? A diffusion model analysis. Judgment and Decision Making, 6, 651–687. Link, S. W. & Heath, R. A. (1975). A sequential theory of psychological discrimination. Psychometrika, 40, 77–105. Luce, R. D. (1986). Response times. New York, NY: Oxford University Press. Mazurek, M. E., Roitman, J. D., Ditterich, J., & Shadlen, M. N. (2003). A role for neural integrators in perceptual decisionmaking. Cerebral Cortex, 13, 1257–1269. McClelland, J. L. & Chappell, M. (1998). Familiarity breeds differentiation: A Bayesian approach to the effects of experience in recognition memory. Psychological Review, 105, 724–760. McKoon, G., & Ratcliff, R. (1992). Spreading activation versus compound cue accounts of priming: Mediated priming revisited. Journal of Experimental Psychology: Learning, Memory, and Cognition, 18, 1155–1172. McKoon, G., & Ratcliff, R. (2012). Aging and IQ effects on associative recognition and priming in item recognition. Journal of Memory and Language, 66, 416–437. McNamara, T. P. (1992 Priming and constraints it places on theories of memory and retrieval. Psychological Review, 99, 650–662. McNamara, T. P. (1994). Priming and theories of memory: A reply to Ratcliff and McKoon. Psychological Review, 101, 185–187. Meyer, D. E., Irwin, D. E., Osman, A. M., & Kounios, J. (1988). The dynamics of cognition: mental processes inferred from a speed-accuracy decomposition technique. Psychological Review, 95, 183–237. Milosavljevic, M., Malmaud, J., Huth, A., Koch, C., & Rangel, A. (2010). The Drift Diffusion Model can account for the accuracy and reaction times of value-based choice under high and low time pressure. Judgment and Decision Making, 5, 437–449. Morton, J. (1969). The interaction of information in word recognition. Psychological Review, 76, 165–178. Mulder, M. J., Boekel, W., Ratcliff, R., & Forstmann, B. U. (in press). Cortico-subthalamic connection predicts individual
modeling simple decisions and applications
59
differences in value-driven choice bias. Brain Structure & Function, 219, 1239–1249. Mulder, M. J., Bos, D., Weusten, J. M. H., van Belle, J., van Dijk, S. C., Simen, P., van Engeland, H., & Durson, S. (2010). Basic impairments in regulating the speed-accuracy tradeoff predict symptoms of attention-deficit/hyperactivity disorder. Biological Psychiatry, 68, 1114–1119. Mulder, M., van Maanen, L., & Forstmann, B. U. (2014). Perceptual decision neurosciences-A model-based review. Neuroscience, 277, 872–884. Munoz, D. P., & Wurtz, R. H. (1995). Saccade-related activity in monkey superior colliculus. I. Characteristics of burst and buildup cells. Journal of Neurophysiology, 73, 2313–2333. Murdock, B. B. (1982). A theory for the storage and retrieval of item and associative information. Psychological Review, 89, 609–626. Niwa, M., & Ditterich, J. (2008). Perceptual decisions between multiple directions of visual motion. Journal of Neuroscience, 28, 4435–4445. Norris, D. (2006). The Bayesian reader: explaining word recognition as an optimal Bayesian decision process. Psychological Review, 113, 327–357. Oberauer, K., Suβ, H-M., Wilhelm, O., Wittmann, W. W. (2003). The multiple faces of working memory: Storage, processing, supervision, and coordination. Intelligence, 31, 167–193. Palmer, J., Huk, A. C., & Shadlen, M. N. (2005). The effect of stimulus strength on the speed and accuracy of a perceptual decision. Journal of Vision, 5, 376–404. Philiastides, M., & Ratcliff, R. (2013). Influence of branding on preference-based decision making. Psychological Science, 24, 1208–1215. Philiastides, M. G., Ratcliff, R., & Sajda, P. (2006). Neural representation of task difficulty and decision making during perceptual categorization: A timing diagram. Journal of Neuroscience, 26, 8965–8975. Platt, M., & Glimcher, P. W. (1999). Neural correlates of decision variables in parietal cortex. Nature, 400, 233–238. Pleskac, T. J., & Busemeyer, J. R. (2010). Two-stage dynamic signal detection: A theory of choice, decision time, and confidence. Psychological Review, 117, 864–901. Purcell, B. A., Heitz, R. P., Cohen, J. Y., Schall, J. D., Logan, G. D., & Palmeri, T. J. (2010). Neurally-constrained modeling of perceptual decision making. Psychological Review, 117, 1113–1143. Ratcliff, R. (1978). A theory of memory retrieval. Psychological Review, 85, 59–108. Ratcliff, R. (1980). A note on modelling accumulation of information when the rate of accumulation changes over time. Journal of Mathematical Psychology, 21, 178–184. Ratcliff, R. (1985). Theoretical interpretations of speed and accuracy of positive and negative responses. Psychological Review, 92, 212–225. Ratcliff, R. (1988). A note on the mimicking of additive reaction time models. Journal of Mathematical Psychology, 32, 192– 204. Ratcliff, R. (2002). A diffusion model account of reaction time and accuracy in a two choice brightness discrimination task: Fitting real data and failing to fit fake but plausible data. Psychonomic Bulletin and Review, 9, 278–291.
60
Ratcliff, R. (2006). Modeling Response Signal and Response Time Data, Cognitive Psychology, 53, 195–237. Ratcliff, R. (2013). Parameter variability and distributional assumptions in the diffusion model. Psychological Review, 120, 281–292. Ratcliff, R., Cherian, A., & Segraves, M. (2003). A comparison of macaque behavior and superior colliculus neuronal activity to predictions from models of simple two-choice decisions. Journal of Neurophysiology, 90, 1392–1407. Ratcliff, R., & Frank, M. (2012). Reinforcement-based decision making in corticostriatal circuits: Mutual constraints by neurocomputational and diffusion models. Neural Computation, 24, 1186–1229. Ratcliff, R., Gomez, P., & McKoon, G. (2004). A diffusion model account of the lexical-decision task. Psychological Review, 111, 159–182. Ratcliff, R., Hasegawa, Y. T., Hasegawa, Y. P., Childers, R., Smith, P. L., & Segraves, M. A. (2011). Inhibition in superior colliculus neurons in a brightness discrimination task? Neural Computation, 23, 1790–1820. Ratcliff, R., Hasegawa, Y. T., Hasegawa, Y. P., Smith, P. L., & Segraves, M. A. (2007). Dual diffusion model for single-cell recording data from the superior colliculus in a brightnessdiscrimination task. Journal of Neurophysiology, 97, 1756– 1774. Ratcliff, R., Love, J., Thompson, C. A., & Opfer, J. (2012). Children are not like older adults: A diffusion model analysis of developmental changes in speeded responses, Child Development, 83, 367–381. Ratcliff, R., & McKoon, G. (1988). A retrieval theory of priming in memory. Psychological Review, 95, 385–408. Ratcliff, R., & McKoon, G. (1994). Retrieving information from memory: Spreading activation theories versus compound cue theories. Psychological Review, 101, 177–184. Ratcliff, R., & McKoon, G. (2008). The diffusion decision model: Theory and data for two-choice decision tasks. Neural Computation, 20, 873–922. Ratcliff, R., McKoon, G., & Tindall, M. H. (1994). Empirical generality of data from recognition memory receiveroperating characteristic functions and implications for the global memory models. Journal of Experimental Psychology: Learning, Memory, and Cognition, 20, 763–785. Ratcliff, R., Perea, M., Colangelo, A., & Buchanan, L. (2004). A diffusion model account of normal and impaired readers. Brain & Cognition, 55, 374–382. Ratcliff, R., Philiastides, M. G., & Sajda, P. (2009). Quality of evidence for perceptual decision making is indexed by trialto-trial variability of the EEG. Proceedings of the National Academy of Sciences, 106, 6539–6544. Ratcliff, R., & Rouder, J.N. (2000). A diffusion model account of masking in letter identification. Journal of Experimental Psychology: Human Perception and Performance, 26, 127–140. Ratcliff, R., Sheu, C-F, & Gronlund, S.D. (1992). Testing global memory models using ROC curves. Psychological Review, 99, 518–535. Ratcliff, R. & Smith, P. L. (2004). A comparison of sequential sampling models for two-choice reaction time. Psychological Review, 111, 333–367. Ratcliff, R., & Smith, P. L. (2010). Perceptual discrimination in static and dynamic noise: the temporal relation
elementary cognitive mechanisms
between perceptual encoding and decision making. Journal of Experimental Psychology: General, 139, 70–94. Ratcliff, R., & Starns, J. J. (2009). Modeling confidence and response time in recognition memory. Psychological Review, 116, 59–83. Ratcliff, R., & Starns, J. J. (2013). Modeling confidence judgments, response times, and multiple choices in decision making: recognition memory and motion discrimination. Psychological Review, 120, 697–719. Ratcliff, R., Thapar, A., Gomez, P., & McKoon, G. (2004). A diffusion model analysis of the effects of aging in the lexicaldecision task. Psychology and Aging, 19, 278–289. Ratcliff, R., Thapar, A. & McKoon, G. (2003). A diffusion model analysis of the effects of aging on brightness discrimination. Perception and Psychophysics, 65, 523–535. Ratcliff, R., Thapar, A., & McKoon, G. (2004). A diffusion model analysis of the effects of aging on recognition memory. Journal of Memory and Language, 50, 408–424. 1 Ratcliff, R., Thapar, A., & McKoon, G. (2006). Aging, practice, and perceptual tasks: A diffusion model analysis. Psychology and Aging, 21, 353–371. Ratcliff, R., Thapar, A., & McKoon, G. (2007). Application of the diffusion model to two-choice tasks for adults 75–90 years old. Psychology and Aging, 22, 56–66. Ratcliff, R., Thapar, A., & McKoon, G. (2010). Individual differences, aging, and IQ in two-choice tasks. Cognitive Psychology, 60, 127–157. Ratcliff, R., Thapar, A., & McKoon, G. (2011). Effects of aging and IQ on item and associative memory. Journal of Experimental Psychology: General, 140, 46–487. Ratcliff, R., Thapar, A., Smith, P. L. & McKoon, G. (2005). Aging and response times: A comparison of sequential sampling models. In J. Duncan, P. McLeod, & L. Phillips (Eds.), Speed, Control, and Age, Oxford, England: Oxford University Press. Ratcliff, R., & Tuerlinckx, F. (2002). Estimating the parameters of the diffusion model: Approaches to dealing with contaminant reaction times and parameter variability. Psychonomic Bulletin and Review, 9, 438–481. Ratcliff, R. & Van Dongen, H. P. A. (2009). Sleep deprivation affects multiple distinct cognitive processes. Psychonomic Bulletin and Review, 16, 742–751. Ratcliff, R. & Van Dongen, H.P.A. (2011). A diffusion model for one-choice reaction time tasks and the cognitive effects of sleep deprivation. Proceedings of the National Academy of Sciences, 108, 11285–11290. Ratcliff, R., Van Zandt, T., & McKoon, G. (1999). Connectionist and diffusion models of reaction time. Psychological Review, 106, 261–300. Reed, A.V. (1973). Speed-accuracy trade-off in recognition memory. Science, 181, 574–576. Roe, R. M., Busemeyer, J. R., & Townsend, J. T. (2001). Multialternative decision field theory: A dynamic connectionist model of decision-making. Psychological Review, 108, 370–392. Roitman, J. D. & Shadlen, M. N. (2002). Response of neurons in the lateral interparietal area during a combined visual discrimination reaction time task. Journal of Neuroscience, 22, 9475–9489.
Roxin, A., & Ledberg, A. (2008). Neurobiological models of two-choice decision making can be reduced to a onedimensional nonlinear diffusion equation. PLoS Computational Biology, 4, e1000046. Schmiedek, F., Oberauer, K., Wilhelm, O., Suβ, H-M., & Wittmann, W. (2007). Individual differences in components of reaction time distributions and their relations to working memory and intelligence. Journal of Experimental Psychology: General, 136, 414–429. Schouten, J. F., & Bekker, J. A. M. (1967). Reaction time and accuracy. Acta Psychologica, 27, 143–153. Sewell, D. K., & Smith, P. L. (2012). Attentional control in visual signal detection: Effects of abrupt-onset and noonset stimuli. Journal Of Experimental Psychology: Human Perception And Performance, 38, 1043–1068. Shadlen, M. N. & Newsome, W. T. (2001). Neural basis of a perceptual decision in the parietal cortex (area LIP) of the rhesus monkey. Journal of Neurophysiology, 86, 1916–1935. Shiffrin, R. M., & Steyvers, M. (1997). A model for recognition memory: REM: Retrieving effectively from memory. Psychonomic Bulletin and Review, 4, 145–166. Smith, P. L. (1995). Psychophysically principled models of visual simple reaction time. Psychological Review, 102, 567–593. Smith, P. L. (2000). Stochastic dynamic models of response time and accuracy: A foundational primer. Journal of Mathematical Psychology, 44, 408–463. Smith, P. L. (2010). From Poisson shot noise to the integrated Ornstein-Uhlenbeck process: Neurally-principled models of diffusive evidence accumulation in decision-making and response time. Journal of Mathematical Psychology, 54, 266– 283. Smith, P. L., Ellis, R., Sewell, D. K., & Wolfgang, B. J. (2010). Cued detection with compound integration-interruption masks reveals multiple attentional mechanisms. Journal of Vision, 10, 1–28. Smith, P. L., & McKenzie, C. (2011). Diffusive information accumulation by minimal recurrent neural models of decision making. Neural Computation, 23, 2000– 2031. Smith, P. L., & Ratcliff, R. (2004). The psychology and neurobiology of simple decisions, Trends in Neuroscience, 27, 161–168. Smith, P. L., & Ratcliff, R. (2009). An integrated theory of attention and decision making in visual signal detection. Psychological Review, 116, 283–317. Smith, P. L., Ratcliff, R., & Sewell, D. K. (2014). Modeling perceptual discrimination in dynamic noise: Time-changed diffusion and release from inhibition. Journal of Mathematical Psychology, 59, 95–113. Smith, P. L., Ratcliff, R., & Wolfgang, B. J. (2004). Attention orienting and the time course of perceptual decisions: response time distributions with masked and unmasked displays. Vision Research, 44, 1297–1320. Smith, P.L., & Vickers, D. (1988). The accumulator model of two-choice discrimination. Journal of Mathematical Psychology, 32, 135–168. Sperling, G. & Dosher, B. A. (1986). Strategy and optimization in human information processing. In K. Boff, L. Kaufman, and J. Thomas (Eds.), Handbook of perception and performance. (Vol. 1, pp. 1–65). New York, NY: Wiley.
modeling simple decisions and applications
61
Starns, J. J., & Ratcliff, R. (2010). The effects of aging on the speed-accuracy compromise: Boundary optimality in the diffusion model. Psychology and Aging, 25, 377–390. Starns, J. J., & Ratcliff, R. (2012). Age-related differences in diffusion model boundary optimality with both trial-limited and time-limited tasks. Psychonomic Bulletin and Review, 19, 139–145. Starns, J. J., & Ratcliff, R. (2014). Validating the unequalvariance assumption in recognition memory using response time distributions instead of ROC functions: A diffusion model analysis. Journal of Memory and Language, 70, 36–52. Starns, J. J., Ratcliff, R., & McKoon, G. (2012). Evaluating the unequal-variability and dual-process explanations of zROC slopes with response time data and the diffusion model. Cognitive Psychology, 64, 1–34. Stone, M. (1960). Models for choice reaction time. Psychometrika, 25, 251–260. Thapar, A., Ratcliff, R., & McKoon, G. (2003). A diffusion model analysis of the effects of aging on letter discrimination. Psychology and Aging, 18, 415–429. Townsend, J. T. (1972). Some results concerning the identifiability of parallel and serial processes. British Journal of Mathematical and Statistical Psychology, 25, 168–197. Townsend, J. T., & Ashby, F. G. (1983). Stochastic Modeling of Elementary Psychological Processes. Cambridge: Cambridge University Press. Townsend, J. T. & Wenger, M.J. (2004). A theory of interactive parallel processing: New capacity measures and predictions for a response time inequality series. Psychological Review, 111, 1003–1035. Tuerlinckx, F., Maris, E., Ratcliff, R., & De Boeck, P. (2001). A comparison of four methods for simulating the diffusion process. Behavior, Research, Instruments, and Computers, 33, 443–456. Usher, M. & McClelland, J. L. (2001). The time course of perceptual choice: The leaky, competing accumulator model. Psychological Review, 108, 550–592. Vandekerckhove, J. & Tuerlinckx, F. (2007) Fitting the Ratcliff diffusion model to experimental data. Psychonomic Bulletin & Review, 14, 1011–1026. Vandekerckhove, J., Tuerlinckx, F., & Lee, M. D. (2011). Hierarchical diffusion models for two-choice response times. Psychological Methods, 16, 44–62. van Ravenzwaaij, D., Dutilh, G., & Wagenmakers, E.-J. (2012). A diffusion model decomposition of the effects of alcohol on perceptual decision making. Psychopharmacology, 219, 1017– 1025. van Ravenzwaaij, D., van der Maas, H. L. J., & Wagenmakers, E.-J. (2012). Optimal decision making in neural inhibition models. Psychological Review, 119, 201–215. Van Zandt, T. (2002). Analysis of response time distributions. In J. T. Wixted (Vol. Ed.) & H. Pashler (Series Ed.), Stevens’
62
Handbook of Experimental Psychology (3rd ed.), Volume 4: Methodology in Experimental Psychology (pp. 461–516). New York, NY: Wiley. Vickers, D. (1970). Evidence for an accumulator model of psychophysical discrimination. Ergonomics, 13, 37–58. Vickers, D. (1979). Decision processes in visual perception. New York, NY: Academic . Vickers, D., Caudrey, D., & Willson, R. J. (1971). Discriminating between the frequency of occurrence of two alternative events. Acta Psychologica, 35, 151–172. Voss, A. & Voss, J. (2007) Fast-dm: A free program for efficient diffusion model analysis. Behavior Research Methods, 39, 767–775. Wang, X. J. (2002). Probabilistic decision making by slow reverberation in cortical circuits. Neuron, 36, 955–968. Watson, A. B. (1986). Temporal sensitivity. In K. R. Boff, L. Kaufman, & J. P. Thomas (Eds.), Handbook of perception and human performance (pp 6–1 to 6–43). New York, NY: Wiley. White, C. N., Ratcliff, R., & Starns, J. J. (2011). Diffusion models of the flanker task: Discrete versus gradual attentional selection. Cognitive Psychology, 63, 210–238. White, C., Ratcliff, R., Vasey, M. & McKoon, G. (2009). Dysphoria and memory for emotional material: A diffusion model analysis. Cognition and Emotion, 23, 181–205. White, C. N., Ratcliff, R., Vasey, M. W., & McKoon, G. (2010). Using diffusion models to understand clinical disorders. Journal of Mathematical Psychology, 54, 39–52. Wickelgren, W. A. (1977). Speed-accuracy tradeoff and information processing dynamics. Acta Psychologica, 41, 67–85. Wickelgren, W. A., Corbett, A. T., & Dosher, B. A. (1980). Priming and retrieval from short-term memory: A speed accuracy trade-off analysis. Journal of Verbal Learning and Verbal Behavior, 19, 387–404. Wiecki, T. V., Sofer, I. and Frank, M. J. (2013). HDDM: Hierarchical Bayesian estimation of the Drift-Diffusion Model in Python. Frontiers in Neuroinformatics, 7, 1–10. Wong, K.-F., & Wang, X.-J. (2006). A recurrent network mechanism for time integration in perceptual decisions. Journal of Neuroscience, 26, 1314–1328. Yonelinas, A. P. (1997). Recognition memory ROCs for item and associative information: The contribution of recollection and familiarity. Memory & Cognition, 25, 747–763. Zeguers, M. H. T., Snellings, P., Tijms, J., Weeda, W. D., Tamboer, P., Bexkens, A. & Huizenga, H.M. (2011). Specifying theories of developmental dyslexia: A diffusion model analysis of word recognition. Developmental Science, 14, 1340–1354. Zohary, E., Shadlen, M., & Newsome, W. (1994). Correlated neuronal discharge rate and its implications for psychophysical performance. Nature, 370, 140–143.
elementary cognitive mechanisms
CHAPTER
4
Features of Response Times: Identification of Cognitive Mechanisms through Mathematical Modeling
Daniel Algom, Ami Eidels, Robert X. D. Hawkins, Brett Jefferson, and James T. Townsend
Abstract
Psychology is one of the most recent sciences to issue from the mother-tree of philosophy. One of the greatest challenges is that of formulating theories and methodologies that move the field toward theoretical structures that are not only sufficient to explain and predict phenomena but, in some vital sense, necessary for those purposes. Mathematical modeling is perhaps the most promising general strategy, but even under that aegis, the physical sciences have labored toward that end. The present chapter begins by outlining the roots of our approach in 19th century physics, physiology, and psychology. Then, we witness the renaissance of goals in the 1960s, which were envisioned but not usually realizable in 19th century science and methodology. It could be contended that it is impossible to know the full story of what can be learned through scientific method in the absence of what cannot be known. This precept brings us into the slough of model mimicry, wherein even diametrically opposed physical or psychological concepts can be mathematically equivalent within specified observational theatres! Discussion of examples from close to half a century of research illustrate what we conceive of as unfortunate missteps from the psychological literature as well as what has been learned through careful application of the attendant principles. We conclude with a statement concerning ongoing expansion of our body of approaches and what we might expect in the future. Key Words: parallel processing, serial processing, mimicking, capacity, response times,
stochastic processes, visual search, redundant targets, history of response time measurement
From Past to Future: Main Currents in the Evolution of Reaction Time as a Tool in the Study of Human Information Processing If time has a history (Hawking, 1988), the timing of mental events certainly does. The idea that human sensations, feelings, or thoughts occur in real time seemed preposterous less than two centuries ago. When the idea has finally gained traction, its gradual acceptance in psychology has often been accompanied by much rancor that continued well beyond the development of the first attempts at measurement. After some early
progress that had been made in harnessing latency or reaction time (RT) to the study of psychological processes, Titchener (1905, p. 363) was still pondering whether “we have any right to speak of the ‘duration’ of mental processes.” Putting the term duration in inverted commas indicates the recent origin of usage of the term as well as Titchener’s own doubts about its validity or serviceability. Thirty years later, Robert Sessions Woodworth in his celebrated Experimental Psychology argued against acceptance of the first method to use reaction time. In a section poignantly titled,
63
“Discarding the subtraction method” (Woodworth 1938, p. 309), Woodworth expressed broader and deep seated reservations, observing that because “we cannot break up the reaction into successive acts and obtain the time for each act, of what use is reaction time?” Even more recent is Johnson’s (1955, p. 5) assertion that, “The reaction-time experiment suggests a method for the analysis of mental processes that turned out to be unworkable.” An onerous history granted, the use of RT is firmly established in modern cognitive psychology not least due to the general conceptual framework provided by the domain known as the informationprocessing approach. Within this framework, RT is used in a systematic, theoretically guided fashion in the quest to isolate the underlying processes and their interactions activated by a given experimental task (cf. Laming 1968; Luce 1986; Townsend and Ashby 1983; Welford 1980). Nevertheless, we would be remiss if we did not examine, if in passing, the essence of Woodworth’s reasoning. Woodworth’s concerns hark back to the forceful argument on the continuity of consciousness offered by William James in his seminal Principles of Psychology (see in particular, James 1890, Vol. 1, p. 244). In the chapter on the stream of thought, James contends that, due to its absolute continuity, thought or consciousness cannot be divided up for analysis. His attack is directed against the possibility of introspecting minute mental experiences, but the objection is equally cogent with respect to RT. When obtaining a value of RT, one measures the duration between two markers in time, usually that between some specified signal and the observer’s response. The RT is taken then to represent the time consumed by an internal process needed to perform a mental task. However, if mental processes are not amenable to partition, any pair of markers must be considered arbitrary. On a deeper level, the situation is a replica or subspecies of the relationship between nature and language as discussed by Friedrich Nietzsche (1873). Nature might well comprise a continuous whole, but human language (used to describe nature) is always discrete. How does one treat a continuous variable with discrete tools? Without dwelling on this issue in any depth, the upshot is clear. A fundamental, yet heretofore unarticulated assumption underlying all RT-based models, serial or parallel, is this: Natural mental functioning can be divided into separate, psychologically meaningful acts. Returning to history, why did the idea that mental acts occur in real, hence measurable, time 64
seem so incredible less than 200 years ago? The physiology of the human nervous system had made startling advances just around that time, but for many centuries the main thrust of attempts to understand the system along with the attendant sensations fell under the rubric of “vitalism.” Vitalism is the doctrine that there is a fundamental difference between living organisms and nonliving matter because the former entail something that is missing from the latter. Pinpointing just what this “something” was has proved elusive, yet the doctrine enjoyed widespread influence from antiquity (the Greek anatomist Galen held that vital spirits are necessary for life) to the 19th century (for all his great contributions to physiology, the towering figure of Johannes Müller subscribed to vitalism) to our own time (Freud’s “psychic energy,” “emerging property,” or even “mind” itself come to mind). Vitalism is best understood as opposition to the Cartesian extension of mechanistic explanations to biology (Bechtel and Richardson 1998; Rakover 2007). It is on this background of the strong influence of vitalism that researchers at the time believed that nerve conduction was instantaneous (in the order of the speed of light or faster) and that, in any rate, it was too fast to be measured.
Hermann von Helmholtz’s Measurement of the Speed of the Nerve Impulse Therefore, Hermann von Helmholtz (1821– 1894) along with his fellow students at Johannes Müller’s Berlin Institute of Physiology had to summon their best judgment and blood (signing their antivitalism oath) to rebuff their teacher, and espouse a strictly mechanistic position. Under the circumstances, it was a bold move on the part of Helmholtz and his peers to consider the moving nerve impulse as (merely) an event in space-time on a par with, say, that of a moving locomotive. Devising an ingenious method for measuring time, Helmholtz proceeded to measure the speed of the former. He stimulated a motor nerve in a frog’s leg and found that the latency of the muscular response depended on the distance of the stimulation from the muscle: the smaller the distance, the faster the response. Helmholtz’s calculations showed that the propagation of the impulse down the nerve was surprisingly slow, between 25 to 43 meters a second. Regardless of the value, it became evident that the speed of nerve conduction was finite and measurable! More boldly yet, Helmholtz turned to humans, asking participants to push a button
elementary cognitive mechanisms
when they felt stimulation in their leg. Predictably enough, people reacted to stimulation in the toe slower than to stimulation in the thigh. Helmholtz estimated the speed of nerve conduction in humans to be between 43 and 150 meters per second. The large range is notable, attesting to considerable variability. It was this variability, within-individuals as well as between-individuals, that discouraged Helmholtz from further pursuing RT research as a reliable means of psychological investigation. The last point is also notable because individual differences was the subject of a now-famous incident at the Greenwich observatory, which occurred half a century before Helmholtz’s measurements. Assistant astronomer David Kinnebrook was relieved of his job by his superior, Nevil Maskelyn, due to disagreement in reading the time that a star crossed the hairline in a telescope. The superior found that his assistant’s observations were a fraction of a second longer than his own. Twenty years later, this little-noticed incident (at the time) came to the attention of the German astronomer F. W. Bessel, who started to compare transit times by various astronomers. This first RT study revealed that all astronomers differed in their recordings. In order to cancel out individual variation from the astronomic calculations, Bessel set out to construct “personal equations” as a means to correct or equate differences among observers. Notice that the concept of “personal equation” assumes small (to nil) intra-individual variability in tandem with stable interindividual differences. Neither notion proved to be correct as Helmholtz witnessed with his observers. It turns out that variability, whether of intra- or inter-individual species, is a fixture of RT measurement. It is at this juncture that models developed within the generic framework of human information processing become truly valuable, attempting to disentangle the various sources of RT variability.
by Wilhem Wundt (1832–1920) and Franciscus Donders (1818–1889). Could RT measurement be refined to gauge duration of central processes, presumably reflecting mental activity in the brain itself? Wundt approached the question experimentally by probing the simultaneity of stimulus appearance in the conscious mind. Do stimuli presented at exactly the same (physical) time evoke similarly simultaneous sensations? In a simple experiment performed in his home in 1861, Wundt attached a calibrated scale to the end of the pendulum of his clock so that pendulum’s position at any time could be determined with precision. A needle fastened to the pendulum perpendicularly at its middle would strike a bell at the very instant that the pendulum reached a predefined position on the scale. Using this makeshift (yet accurate for the time) instrument (Figure 4.1), Wundt was observing his own mind: Hearing the sound of the bell, Wundt did not perceive the pendulum to be in the predetermined position but always away from there. Calculation based on the perceived distance of the pendulum from its original position showed the perceived time difference to be at around one-tenth of a second. Inevitably, Wundt concluded, people do not consciously experience the visual and auditory stimuli simultaneously, despite the fact that these stimuli occur at the same time. Encouraged by such data, Wundt subsequently attempted to measure specific central processes. A favorite topic was “apperception,” an early term for what is now known as attention. Wundt found that the RT to a given stimulus was shorter by one-tenth of a second if the observer concentrated on the response rather than on the stimulus. The
Studies of Reaction Time in Wundt’s Laboratory: Moving from the Periphery to the Center Note that for all his pioneering contribution, Helmholtz’s measurements were restricted to the periphery of the nervous system, to sensory and motor nerves transmitting impulses toward or from the brain (Fancher 1990). Even this result, as we recounted, was achieved after travelling a torturous road. Nevertheless, barely a decade after Helmholtz’s measurements in 1850, the following intriguing question was posed (separately)
Fig. 4.1 Schematic of Wundt’s thought meter.
features of response times
65
reason is that one has first to perceive the stimulus and then to apperceive it, that is, to decide whether it is the appropriate one for responding. When focusing on the response, the second of these processes is gratuitous. Consequently, Wundt proposed that apperception takes about one-tenth of a second. Regardless of the particular results, the significance of Wundt’s early foray into RT measurement lies in his bold thrust to probe the duration of mental processes of consequence to cognitive science and everyday life alike. Cognizant of its potential, Wundt’s home apparatus has been depicted as a “thought meter,” and the title of his own report (including subsequent data) aptly read, “Die Geschwindigkeit des Gedankens” (The speed of thoughts; Wundt 1892). Important work in Wundt’ laboratory was carried out on a related subject, the number of stimuli noticed simultaneously during a short glance. James McKeen Cattell (1860–1944), Wundt’s American student and assistant, first employed RT in the study of the visual span of attention or span of apprehension. However, a true pioneer in this domain was the Scottish philosopher, Sir William Hamilton, whose observations are reported in a posthumous book published in 1859. Hamilton spread out marbles on the ground and concluded that, on average, the span of visual attention is limited to 6–7 items. However, if the marbles are arranged in groups (of say two, three, or four marbles a group) the person can comprehend many more marbles because the mind considers each group as a unit. These results and conclusions anticipated those of George Miller a century later in his famous article on the “magical number seven” and on the effects of “chunking” (Miller, 1956). The power of grouping was expounded by Cattell himself who found that whole words could replace single unrelated letters, leaving invariant the number of units noticed within the span. Modern studies on the span of attention use short exposure times (at around 50 ms) in order to avoid eye movements and counting. As a result, observers actually report the contents of their short-term or “iconic” memory. George Sperling (1960), reviving interest in the subject in his groundbreaking studies on the information contained in brief visual presentations, concluded that the span was much larger than previously thought (in the order of 12–16 letters), but that it was also short lived. The very report by the observer can conceal the true size of the span; larger estimates are found when the deleterious effects of reporting are circumvented. 66
We surely have come a long way from Hamilton’s informal surmises. Nevertheless, his observations brought to the fore the idea of limited capacity (resources or attention) and even the idea of parallel processing. Murray (1988, p. 159), ever the keen reader, concluded that “Hamilton perceived consciousness as a kind of receptacle of limited capacity.” Needless to add, capacity and parallel processing are key concepts in the current approach known as human information processing.
Donders’ Complication Experiment and Method of Subtraction We already mentioned Franciscus Donders, the true pioneer of RT measurement in psychology. This Dutch physiologist (founder of modern ophthalmology among sundry achievements) developed the first influential, hence lasting procedure for measuring the duration of specific mental processes. Donders devised an experimental setup known as the complication experiment with an assorted method of RT measurement called the method of subtraction. The idea was to present tasks of increasing complexity and to subtract then the respective RTs in order to identify the duration of the added processes. The technique is best illustrated by the procedures used by Donders himself (Donders, 1868; we follow Murray’s 1988 depiction). In one variation, the a-method, a sound such as ki is presented by the experimenter and the observer reproduces it orally as quickly as possible (one should note that Donders was the first experimenter to use human [his own] voice in RT studies). The a-task is a simple reaction time experiment, recording the time it takes the observer to react to a predetermined stimulus by a predetermined response. In the b-method, one of several sounds is presented on a trial, and the observer repeats the sound as fast as possible. This variation is dubbed choice reaction time: Several different stimuli are presented and the observer responds to each of them differently. In the c-method, several sounds are given again, but the observer imitates only one of them and remains silent when the others are presented (this variation is now known as the go/no-go procedure). The differences between the respective RTs reflect the duration of the psychological processes involved. For example, the RT for the b-procedure entails both discrimination (or identification) of the stimulus presented and the selection of the appropriate response, whereas that for the c-procedure entails merely discrimination
elementary cognitive mechanisms
1.
A
R 2.
A
B
Ra
Rb
1.
A
2.
A
B
Output (RT1)
Output (RT2)
Mean RT = RT1 – RT2 Fig. 4.2 Illustration of the complication experiment and analysis by the method of subtraction. Top: A simple RT experiment (a single predetermined response made to a single predetermined stimulus) is complicated into a choice RT experiment (two different stimuli with a different response made to each). Bottom: The time it takes to perform the mental act of choice is estimated by subtracting the mean RT of the simple RT experiment from the mean RT of the choice RT experiment.
(or recognition, see Luce, 1986, p. 213). The mean difference (c−a) was taken by Donders to measure the duration of recognition, whereas that of (b−c) estimated the time consumed by the need to make a choice between responses (see Figure 4.2 for an outline of the Donders experiment and for the logic of the method of subtraction). In the scheme developed by Donders, there is a chain of discrete nonoverlapping processing systems. The duration of each process is measurable, assuming that each added experimental task uniquely taps one and only one of the processing systems. If the assumptions hold, the procedure succeeds in inferring the duration and eventually the attendant architecture of the psychological system under test. Consequently, the idea of subtraction has exerted a profound influence on RT theory and experimentation. Townsend and Ashby (1983) paid well-deserved homage to Donders by designating psychological processes carried out in a serial fashion (i.e., sequential and without overlap in processing time) as Dondersian systems. This much granted, closer scrutiny of the method (in particular, its underlying assumptions) uncovered several problems, so that the method has not been wholeheartedly accepted by students of RT. The
main criticisms are easily summarized because they are interconnected in final analysis. First, the experimental data collected by different investigators or by an individual investigator at different times proved extremely variable. For example, Donders (1868), Laming (1968), and Snodgrass, Luce, & Gealanter (1967) reported vastly different RTs for (c−a) and (b−c). Over and above the variability, the order of the differences is not preserved: Donders found (b−c) longer than (c−a), but those subsequent investigators found the opposite pattern. Wundt, an early champion of the method, was so discouraged by the large intra-individual variability that he abandoned his RT studies altogether. Second, the method requires that the added experimental task has no influence on any of the other tasks. The assumption of “pure insertion” (Sternberg, 1969a,b) asserts that the previous processes unfold in time precisely in the same fashion regardless of whether another process is inserted into the chain. If pure insertion is impossible in general or does not hold in particular cases, the assumptions of additivity and independence of the processes are also compromised. To compound the problem, the assumption of pure insertion is untestable with mean statistics although it might be with distributional statistics (Ashby & Townsend, 1980). The issue is not fully settled (cf. Luce, 1986, p. 215), and it is moot whether it can be fully settled with any mathematical or statistical test. The third criticism is even more fundamental. It concerns the relationship between the experimental task and the unobservable psychological process or subprocesses that the task is supposed to tap. It is not prima facie clear that by calling a task “response choice/selection” or “stimulus discrimination” the underlying psychological process is that of choice or discrimination. It is not even clear that the task taps a single process, excluding all sorts of subprocesses. The raison d’etre of the complication experiment is minimum complication, so that a single welldefined process is probed with each addition. This minimal-addition- or single-process principle is not readily testable (certainly not at the level of the mean) and it is even more difficult to satisfy in experimental practice. After all, how can one decide that the added task comprised the smallest complication possible (Külpe, 1895; Sternberg, 1969a,b). Symptoms of the problem have recurrently surfaced in the century following the Donders experiment. Where Donders called a given task “discrimination,” Wundt called the features of response times
67
Saul Sternberg’s Revival of the Donders Project: Inaugurating the Modern Study of Human Information Processing Reminiscent of the tale of Sleeping Beauty, Dondersian procedures were lying dormant for over a century. The prince-investigator reviving the technique was Saul Sternberg (1966, 1969a,b), and the magic kiss awakening renewed interest was his memory scan experiment. The participants are first shown a number of items. Then, they decide whether a test item was or was not present in the set just shown. Prototypical results are given in Figure 4.3. Two features of the data are noteworthy. First, RT is a linear function with a positive slope of the size of the memory set shown. Adding a single member to the memory set increases RT by the same constant amount. Second, targets and foils produce the same increment in RT, so that the slope of the function is the same for yes and for no responses (in Figure 4.3, the intercept, reflecting stimulus encoding, base and residual time, incidentally is also the same; however, the important feature is the parallelism of the targetpresent and target-absent functions). Sternberg interpreted the linear function with the positive slope to reflect serial processing such that the test 68
600
mean RT
same task, “cognition.” Donders’ c-task was conceived to tap stimulus recognition, but already in 1886 Cattell questioned its validity, arguing that the task entails processes beyond identification or recognition. More recently, Welford (1980), echoing Cattell’s concerns, concluded that the difference between the b-task (originally thought to tap response selection) and the c-task is one of degree and that both entail choice of the response. Wundt, acutely aware of the problem, conceived a new task, the d-procedure (meant to be a pure measure of recognition), to no avail. More than linguistic indeterminism is at stake. G. A. Smith (1977), for one, obtained data showing choice to be faster than recognition! How does one make choices among stimuli that one does not recognize? In the absence of a definite task-process association and theory, we cannot know with certainty the identity and order of the pertinent psychological processes. Given the problems, the method of subtraction was out of favor for many years with students of RT. The succeeding section will bring us into the modern era of cognitive research. Subsequent sections will revisit many of the concepts with more quantitative detail, but still with emphasis on a friendly style.
500 positive response negative response line of best fit
400
0
1
4 2 3 size of the positive set
5
6
Fig. 4.3 Prototypical results of Sternberg’s memory scan experiment.
item is compared with the memory representation of each of the items in the positive set – one item a time. He interpreted the parallelism of the slopes to mean that the search continues until the entire memory is exhausted even if an early item in the positive set matches the probe stimulus. Sternberg’s interpretation of his data is now known as the standard serial exhaustive search model. If search ceases as soon as a probe item is located, the process is said to self-terminate. Sternberg’s original (1966) analyses were stronger than many of the scores of studies that followed, due not only to invoking several control conditions but also in helping to rule out an important class of parallel models. Again, we will discuss this matter as well as other topics in this section in more quantitative detail subsequently. Sternberg’s conclusions seem compelling, but, as subsequent research has revealed, neither is forced by the data. The positive slope appears to have all the earmarks of serial processing, but a moment of reflection suffices to show that the same result follows in a natural fashion from parallel processing. Think of horse races (actual ones, not modeling metaphors) with a different number of horses in each race. The referee reports back to the organizer once each race is over (i.e., when the slowest horse crosses the finish line). Clearly, each race is parallel and exhaustive. It requires only a little intuition to conclude that the larger the number of horses, the longer the expected duration between the common start and the finishing time by the slowest horse (i.e., the RT-set size function has a positive slope). Now, if every horse runs just as fast and with the same random variation no matter how many other horses are present, then it can be shown that the increasing duration for all
elementary cognitive mechanisms
bimodal perception, see Bernstein, 1970), a result inconsistent with the prediction of the standard serial exhaustive model. Regardless of this particular result (the violation can be dealt with fairly easily by slight modification of the pertinent models), redundant target designs proved a powerful tool in revealing virtually all aspects of human information processing. Sternberg revived Donders’ method of subtraction in a further profound way. In his method of additive factors (Sternberg, 1966, 1969a,b), one does not eliminate or bypass a stage (as in the method of subtraction) but rather affects it selectively. Think of the standard memory scan experiment for an illustration. In the additive factors scheme, the operation of comparison comprises a single stage affected by the factor of size of the search set. Suppose that one adds another stage, stimulus encoding, affected by degrading the quality of the visual presentation. The logic of the method is as follows. Varying the number of stimuli in the search set affects comparison (and response) processes, whereas degrading the quality of the stimuli affects perceptual encoding. Additivity (of the mean RTs) holds if indeed the manipulations influence the respective processes selectively. If one further assumes independence, the incremental effects of added stages should be additive over accumulated RTs, too. The expected result in this two-stage serial model is shown in Figure 4.4. The influence of set size is revealed by the positive slopes of the RT curves and that of visual degradation by the longer RTs. Critically, the two factors do not interact as is evident in the parallelism of the slopes.
700 600 Poor quality 500 400 RT
the horses to finish bends over (i.e., increases by less and less an amount as the number of horses increases; see Townsend & Ashby 1983, p. 92 for a proof ) rather than being straight. Such a system, whether run by horses or by parallel perceptual or cognitive channels, is said to be unlimited capacity (e.g., Townsend, 1974; Townsend & Ashby 1978). Sternberg’s (1966) analyses did rule out this variety of parallel processing. Formal models of memory- or perceptualscanning have introduced the notion of limited capacity in performing the comparison process. In Townsend’s capacity reallocation model (Townsend, 1969, 1974; Townsend &Ashby, 1983; see also, Atkinson, Holmgren, & Juola, 1969), a finite amount of capacity is redistributed after completing the comparison of each item, the processing itself is always a parallel race between the remaining items. Such limited capacity, parallel exhaustive search models yield precisely the same predictions as Sternberg’s original model (e.g., positive parallel slopes for target-present and target-absent processing, absence of a serial position effect, and linear growth of variance with the numbers of items), some of which are not generally confirmed by experimental data. Following Townsend’s early development (1969, 1971), several classes of parallel models have been shown to predict Sternberg’s results (Corcoran, 1971; Murdock, 1971; Townsend, 1969, 1971a,b, 1972, 1974; Townsend & Ashby, 1983). Moreover, Sternberg’s data can be predicted by self-terminating rather than exhaustive search whether in parallel (e.g., Ratcliff, 1978) or even serial (e.g., Theios Smith, Haviland, Traupmann, & Moy 1973) models. The reader should consult Section 5 as well as Van Zandt and Townsend (1993) and Townsend and Colonius (1997) for more details on the topic of testing self-terminating versus exhaustive processing in parallel and serial models. The interrogation of Sternberg’s results entailed also (slight) experimental modifications. For example, the memory set can follow rather than precede the probe stimulus thus initiating what are usually termed visual search (or early target) experiments. Early examples of these designs are found in the studies by Estes and Taylor (1969), Atkinson et al. (1969), and van der Heijden (1975). A more consequential manipulation entails the inclusion of more than a single replica of the target stimulus in the search list. RT is found to decrease with the number of redundant targets (e.g., Baddeley & Ecob, 1973; Egeth, 1966; in
Good quality 300 200 100 0
0
1
2
3
4
6 5 Set Size
7
8
9
10
Fig. 4.4 Hypothetical results in an additive factors experiment in which additivity is seen to hold.
features of response times
69
The additive factors method, like the memory scan experiment, has engendered a very large amount of research, producing a wealth of valuable theorems (e.g., the independence of additivity and stochastic independence) and theoretical insights (e.g., success and failure in mimicry of serial systems by parallel systems). The last point will be particularly appreciated by those experiencing the frustration in convincing a graduate student (or a seasoned researcher!) that a positive slope does not, ipso facto, imply serial processing (Feature Integration Theory [Treisman & Gelade, 1980] is a poignant case in point). Criticisms and generalizations of the method unearthed further important information. For example, additivity does generally support separate processing stages, but interaction does not necessarily support a single stage. Statistical properties of analysis of variance (ANOVA) might compromise, to an extent, its value as the (sole) diagnostic tool (cf. Townsend, 1984). A really consequential feature of the method in virtually all modifications and generalizations (but see Schweickert, 1982) is that it tells us nothing about the order of occurrence of the various stages (or underlying processes). The ensuing problems were already noticed with respect to the original method by Donders, but they are equally serious with the method of additive factors. Sternberg’s landmark studies, along with the almost concomitant works by Sperling, Estes, Nickerson and Egeth and others, inaugurated the human information-processing approach in earnest. Where Donders, in his subtraction method, changed the nature of the tasks as well as the number of stimuli, Sternberg, in his memory experiment, did not change the task, only added items. It is easier to subtract numerical values of RT than entire psychological processes (cf. Marx & Cronan-Hillix, 1987). In his additive factors method, Sternberg showed that it was not even necessary to subtract processes, only to affect them experimentally in a selective way. Within a decade of Sternberg’s seminal contribution, virtually all students of RT and roughly half the community of cognitive psychologists (Lacmann, Lachmann, & Butterfield 1979) were conducting research employing or testing some aspect of Sternberg’s theory and methodology.
Basic Issues Expressed Quantitatively In the previous sections we surveyed some of the history of mental chronometry. Several key issues 70
were highlighted from historical and philosophical perspectives, all related to the notion of time and the role it plays in mental processes. First, mental events—our feelings, thoughts, and decisions—take time, and this time can be measured. Second, internal subprocesses can take place one at a time (and are hence called serial processes), or at the same time (parallel processes). Third, when several subprocesses take place, the system must await the completion of each and every one of these subprocesses before moving on to respond (exhaustive processing), or conversely, it can finish before that, say, upon the termination of any one of the subprocesses (minimum-time).1 Fourth, subprocesses may be independent from one another (or not), and so do the time durations taken to complete each subprocess. And finally, we introduced the idea that people may have a limited capacity—limited amount of resources (attention)—and hence can deal effectively with a limited amount of processing at any given time. In what follows we provide a formal treatment of each of these basic issues, along with illustrative examples. The first issue, regarding the temporal modeling of information processing, is ubiquitous in theoretical approaches to human cognition. We see this affirmed in several chapters of this book, such as Chapter 3 (Modeling Simple Decisions and Using a Diffusion Model) and Chapter 6 (A Past, Present, and Future Look at Simple Perceptual Judgment). Many models of perception and decision making are based on the premise that information, or evidence toward some target behavior is accumulated over time. Thus, to answer Titchener’s (1905) question, we have both the right and the obligation to speak about duration of mental processes. The remaining basic issues are discussed next in greater detail; the reader may find the following example helpful throughout this discussion. Suppose that you are a driver approaching an intersection. The sight of a red light or the sound of a policeman’s whistle signals you to stop and give way. One can think of the visual signal and the auditory signal as being processed in separate subsystems, which we call channels. We denote the time to process and detect a signal in each of the channels by tA (for the visual channel) and tB (for the auditory channel). We further make the assumption that both signals are presented at exactly the same time (we can relax this assumption subsequently). What can we learn about the time course of information processing? What can we learn about the
elementary cognitive mechanisms
+ tB , for any tA , tB > 0. This intuitive notion is true only as long as we assume that tA (and similarly tB ) is the same in the serial and the parallel cases.2 Is it realistic to expect our driver to bring her car to a stop only after she detects both sources of information? On intuitive grounds, one would prefer to act quickly on the basis of only one signal, whichever signal is detected first as a sign of danger. This issue is considered next.
relationship between the information-processing channels? The critical properties of architecture, stopping rule, and independence will now be introduced with only little mathematics. A rigorous mathematical statement regarding architecture (i.e., parallel and serial processes) appears in Section 4 of this Chapter. For more quantitative detail on these features, the reader should consult Townsend and Ashby (1983) or Townsend and Wenger (2004b, for a more recent statement).
Exhaustive versus Minimum-Time Stopping Rule
Architecture: Parallel Versus Serial Processing
Awaiting the completion of two subprocesses is referred to as exhaustive processing. The processing durations, tserial and tparallel for that strategy were given earlier. It is also possible to stop as soon as the first process is completed; in our example, as soon as the driver detects the red light or hears the policeman’s whistle. This strategy is referred to as minimum-time processing. The overall time it takes for a parallel system with a minimum-time rule is given by tparallel = min(tA , tB ). For a serial system, the total duration depends on the order of processing, tserial = tA if A is first, and tserial = tB if B is processed first. Needless to add, in a serial system that stops as soon as the first channel completes (as soon as the first signal is detected), the second channel will not have a chance to operate at all. Although other stopping rules are also possible, the exhaustive and minimum-time stopping rules are of particular interest. They are illustrated in Figure 4.5 (Panels a–d). Processing times for the different systems are summarized in Table 4.1.
As mentioned, two or more subprocesses can take place one at a time (serial), or at the same time (parallel). Figure 4.5 illustrates these modes of processing, where each arrow corresponds to a particular channel. It is convenient to consider the way the system operates—its architecture—through the prism of the time it takes to complete the processing of both signals. Suppose that the driver is unwilling to hit the brakes unless both signals are spotted, that is, she processes the two signals exhaustively. In the serial case (Panel a), the time to process both signals is the sum of the durations needed to process each channel, such that total tserial = tA + tB . In the parallel case, this time equals that needed to process the slower of the two processes, tparallel = max(tA , tB ). It is tempting to think that parallel processing will yield a faster braking response compared with serial processing (and more generally that parallel processing is more efficient than serial processing), given that max(tA , tB ) < tA (a)
(b) Process A
Process B
Decision
(c)
(e)
Process B
(d) Process A
Process B
Process A
Response
Decision Response
Decision
Response
Process A Process B
Decision Response
Process A Process B
∑
Decision Response
Fig. 4.5 Illustrations of serial (Panels a, c), parallel (b, d), and coactive (e) systems. Panels a and b demonstrate exhaustive processing, where both processes A and B must finish before a decision and response can be made. Panels c and d show minimum-time processing, where processing ceases once process A is completed (but B had not finished, as indicated by the broken line). Panel e illustrates a coactive mode of processing, where activation from two channels is summed before the decision stage.
features of response times
71
Table 4.1. Summary of overall completion times for the various models. tA and tB denote the time to process signals in channels A and B, respectively. Model and stopping rule
Overall completion time
Parallel exhaustive
max(tA , tB )
Serial exhaustive
tA + t B
Parallel minimum time
min(tA , tB )
Serial minimum time if channel A is processed first and B second if B is processed first and A second
tA tB
Stochastic Independence Two events are said to be statistically independent if the occurrence of one does not affect the probability of the other. For example, height and SAT score (standardized test score for college admissions in the United States) are independent if knowing the height of a person tells nothing about his SAT score. In the context of processing models, total completion-times of channels A and B are independent if knowing one does not tell us a thing about the value of the other. Our discussion of the architecture and stopping rule was simplified by the fact we assumed that processing is deterministic, rather than stochastic (probabilistic). A deterministic process always yields a fixed result, such that the effect or phenomenon we observe has no variability. For example, a deterministic process predicts that the time taken to drive from Sydney to Newcastle is always fixed, or that the time to choose between chocolate and vanilla flavors is the same every time we stop at the ice-cream parlor. Under this assumption, we were able to represent the time for processing in channels A and B by the fixed values, tA and tB . However, observations of human performance (and Sydney’s traffic) lead to the conclusion that behavior is quite variable and that it can probably be better described as a stochastic process. If so, processing time in any particular channel can no longer be characterized by a fixed value, but is represented by a random variable. A random variable does not have a single, fixed value but can rather take a set of possible values. These values can be characterized by probability distributions. The probability density function (pdf ) is defined by f(t) = p(T = t), and gives the likelihood that some 72
process, which takes random time T to complete, will actually be finished at time t. We can use f(t) to define stochastic independence. In probability theory, two random variables are independent if knowing the value of one tells nothing whatsoever about the values of the other (e.g., Luce, 1986, chapter 1). In processing models, total completion times of channels A and B are independent if knowing one, say tA , tells us nothing about the likelihood of various values of tB . Thus, we can express independence in terms of the joint pdfs, fAB (tA , tB ) = fA (tA ) · fB (tB ), which means that the joint density of processes A and B both finishing at time t is equal to the product of the probability of A finishing at time tA and the probability of B finishing at time tB .
Workload Capacity and the Capacity Coefficient We recounted earlier that the time to process multiple signals depends on the stopping rule and mode of processing (serial, parallel). Notably, processing also depends on the amount of resources available for processing, a notion that we call capacity. One may think of the cognitive system as performing some work, and the more subprocesses (channels) are engaged the greater the amount of work there is to perform. We define workload capacity as the fundamental ability of a system to deal with ever heavier task duties (Townsend & Eidels, 2011; see also Townsend & Ashby, 1978, 1983). A ready example is the increase in load from processing one signal to processing two or more signals. One may find it useful to think about work and capacity in terms of metaphors such as water pipes filling a pool, or tradesmen building a house. Suppose that the tradesmen operate in parallel (and, for illustration, deterministically) and that there is an infinite amount of resources (tools, building materials)—unlimited capacity. In that case, a twofold increase in the number of workers will cut to half the amount of time needed to build the house (assuming all tradesmen have the same workrate). Critically, adding more workers does not affect the labor rate of each individual worker. In a similar vein, increasing load on the cognitive system by increasing the number of to-be-processed items does not have an effect on the efficiency and time of processing each item alone. The time to process the visual signal (red light) when it is presented alone should be the same as the time to process the same signal when it is presented in tandem with the auditory signal (whistle by the policeman),
elementary cognitive mechanisms
tA|A = tA|AB . To clarify the notation, the subscript A|A indicates processing of signal A given that only signal A is present, whereas A|AB indicates processing of signal A when A and B are both present. If several channels are working toward the same goal and capacity is unlimited, then adding more channels should facilitate processing. It is possible however, that capacity is limited. In one special case, the overall amount of processing resources, X, can be a fixed value. With more and more channels coming into play, fewer resources can be allocated to each channel, and, consequently, the time to complete processing within each channel increases. So, for example, the time to process the visual signal is longer when the auditory signal is also present. Using the same notation as before, we can express this as tA|A < tA|AB and tB|B < tB|AB . Under limited capacity, performance with a given target is impaired as more targets are added to the task. Metaphorically, this is tantamount to tradesmen who are trying to work in parallel but share one set of tools. Worker A cannot work at the same rate that she did alone if she needs to await her partner handing over the hammer. Given that multiple workers or channels operate toward the same goal, a limited-capacity system can still complete processing faster than (or at least as fast as) any single channel alone (depending on the severity of the capacity limitation). However, a limitedcapacity system cannot be faster than an otherwise identical unlimited-capacity system. A third and at first curious case is that of super capacity. It is possible in principle that as more and more channels are called for action, the system recruits more resources (àla Kahneman, 1973) and is able to allocate to each of the channels more resources than what each channel originally had when it was working alone. In this case, tA|A > tA|AB and tB|B > tB|AB , and moreover, the more signals (and channels) there are, the faster the system completes processing. Under supercapacity, performance with a given target is improved as more targets are added to the task. We can model super capacity by way of a system in which channels A and B pool their activation into a single buffer, in which evidence is then compared against a single criterion. In that sense, processing channels can also join efforts to satisfy a common goal as could be the case in the tradesmen example. This mode of processing is often referred to as coactivation (e.g., Colonius & Townsend, 1997; Diederich & Colonius, 1991; Miller, 1978, 1982; Schwarz, 1994; Townsend & Nozawa, 1995;
Townsend & Eidels, 2011) and is illustrated in Figure 4.5e. Clearly, this type of model benefits from an increase in the number of relevant signals. With auditory and visual signals contributing to a single pool, evidence accumulates more quickly, and will surpass threshold faster. Thus, a coactive model is a natural candidate for supercapacity. However, it is not the only way supercapacity can be achieved in parallel systems as we shall see (Eidels, Houpt, Altieri, Pel, & Townsend 2011; Townsend & Wenger, 2004a). Townsend and Nozawa (1995) offered a measure of workload capacity known as the capacity coefficient: COR (t) =
log [SAB (t)] . log [SA (t) · SB (t)]
(1)
SA (t) and SB (t) are the survivor functions for completion times of processes A and B, and tell us the probability that channels A and B, respectively, did not finished processing by time t. SAB (t) is the survivor function for completion times of the system when channels A and B are both at work (e.g., when two targets are being processed simultaneously). We have already defined the pdf, f(t) = p(T = t), as the likelihood that a process that takes random time T to complete will actually be finished at time t. We can also define the probability that the process of interest is finished before or at time t, known as the cumulative distribution function (cdf ), F(t) = p(T ≤ t). The survivor function is the complement of the cdf, S(t) = 1 – F(t) = p(T>t), and tells us the probability that this process had not yet finished by time t. The capacity coefficient, COR (t), allows to assess performance in a system that processes multiple signals by comparing the amount of work done by the system when it processes two signals with the amount of work it does when each of the signals is presented alone. The subscript OR indicates that processing terminates as soon as subprocess A or subprocess B finishes (i.e., minimum-time termination). Townsend and Wenger (2004a) developed a complimentary capacity coefficient for the AND design, where the system can stop only after the two processes, A and B, are both finished: CAND (t) =
log [FA (t) ·FB (t)] . log [FAB (t)]
(2)
Equations 1 and 2 both apply to two channels, but the C(t) index can be easily generalized to account for more than two processes (Blaha & Townsend, 2006). The interpretation of COR (t) features of response times
73
and CAND (t) is the same, so that C(t) refers to both indices. Parallel-independent models are characterized by unlimited capacity, C(t) = 1. Capacity is C(t) < 1 in a limited capacity model, and it is C(t) > 1 with super capacity in force. Architecture (serial, parallel), stopping rule, and potential dependencies can also affect the capacity coefficient. For the effect of architecture, consider a serial model, which processes channel A first and then processes channel B. This model will take more time to complete, on average, than an otherwise identical parallel model in which processes A and B occur simultaneously. The former also results in C(t) < 1 – limited capacity. Breakdown of independence across channels also affects C(t) in a predictable manner. Townsend and Wenger (2004a) and Eidels et al. (2011) have shown that positive dependency (one channel “helps” the other) can lead to supercapacity, C(t) > 1, whereas negative dependency (one channel inhibits the other) can lead to limited capacity, C(t) < 1. The capacity coefficient is discussed further in the later section, Theoretical Distinctions, along with an illustrative example from the Stroop milieu. The interpretation of C(t) is particularly revealing when discussed with respect to the benchmark model that we describe next.
The standard parallel model can be considered as the “industry’s standard” in response-time modeling. This model is characterized by unlimited capacity and independent, parallel processing channels (attributes that yield the acronym UCIP, e.g., Townsend & Honey, 2007). If we further assume that the model can stop as soon as either one of the channels completes processing, we end up with an independent race model, illustrated earlier in Figure 4.5(d). Formally, the stochastic version of this model can be written as (3)
SA (t) and SB (t) are again the survivor functions for completion times of processes A and B and tell us the probability that channels A and B, respectively, did not finish by time t. Consider a model that stops processing as soon as either channel finishes (minimum-time processing), but will otherwise not stop as long as process A is still going on and process B is still going on (i.e., as long as both processes “survive,” hence the term survivor function). Because processing-channels A and B are 74
Given a parallel model, it is possible that the system stops only when all of its channels had completed processing (exhaustive processing). In the example, the system will stop only when both channel A and channel B stop. Assuming again that the channels are independent, the probability that the model completes processing by (at or before) time t is equal to the product of the probabilities of channels A and B finishing, FAB (t) = FA (t) · FB (t)
(5)
and in the more general form, with n channels,
The Benchmark Model: Parallel, Independent, Unlimited Capacity
SAB (t) = SA (t) · SB (t).
independent, we can multiply the probabilities so that the probability that the entire system does not stop by time t, SAB (t), is given by the product of the probabilities of A and B not finishing (see Eq. 3 again).3 We note that this equation describes a model with only two channels, but it can be generalized to any number of channels. The probability that an independent race model with n parallel channels does not complete by time t is given by the product of the probabilities of neither channel finishing,
Fexhaustive (t) = F1 (t)·F2 (t)·...·Fn (t) =
n
Fi (t) (6)
i=1
Two well-known RT inequalities also define the benchmark model. Miller (1978, 1982) proposed an upper bound for performance in the OR design (“respond as soon as you detect A or detect B”), the race model inequality: FAB (t) ≤ FA (t) + FB (t).
(7)
The inequality states that the cumulative distribution function for double-target displays, FAB (t), cannot exceed the sum of the single-target cumulative distribution functions if processing is an ordinary race between parallel and independent channels. Violations of the inequality imply supercapacity of a rather strong degree (Townsend and Eidels 2011; Townsend and Wenger 2004a). Grice, Canham, & Gwynee (1984) introduced a bound on limited capacity, often referred to as the Grice inequality:
elementary cognitive mechanisms
FAB (t) ≥ MAX [FA (t), FB (t)].
(8)
This inequality states that performance on doubletarget trials, FAB (t), should be faster than (or at least as fast as) that in the faster of the singletarget channels. If this inequality is violated, the simultaneous processing of two target signals is highly inefficient and the system is very limited capacity. An implication is that there is “no savings” or gains in moving from a single target to multiple targets (in OR designs). In Section 5 we shall demonstrate the use of the three assays of capacity in an OR design – C(t) and inequalities (7) and (8). Colonius and Vorberg (1994) proposed upper and lower bounds appropriate for AND tasks (“respond if you detect target A and target B”), which are analogous to OR tasks in the sense that their violations indicate supercapacity and limited capacity. Our benchmark model is, therefore, useful in serving as a gold standard against which performance can be compared and interpreted.
Conclusion Information-processing models can be characterized by the following four features referring to the relations among processing channels: architecture (serial, parallel), stopping rule (minimum-time, exhaustive), capacity (limited, unlimited, super), and stochastic (in)dependence. Most of these properties are latent and cannot be observed directly. Response times are useful tools in uncovering these properties, but in some cases the result is not unique. Model mimicry is thus the focus of the upcoming section. The caveats granted, recent advances in response-time modelling of cognitive processes proved useful in addressing some of the mimicking challenges (allowing researchers to identify critical features of human information-processing). The later section on Theoretical Distinctions outlines some of the advances, followed by applications of novel techniques from empirical literature. The reader might have noticed that some interesting topics such as the stochastic form of serial models were excluded from our discussion due to lack of space.4 However, the topics included in this chapter should give the reader a good understanding of elementary information-processing theory and a solid preparation for more specialized reading. Box 1 gives a practical illustration of the outstanding issues.
Model Mimicry Possessing the building blocks (architecture, stopping rule, capacity, and independence), we
Box 1 Is human capacity limited? We noted in this section that workload capacity—as measured by the capacity coefficient—could theoretically be limited, unlimited, or super, depending on whether the efficiency of processing decreases, is left unchanged, or increases with additional load (e.g., more signals to process). Cumulative evidence suggests that human capacity is limited (Kahneman, 1973), yet important and frequent situations of modern life, such as driving a car, require simultaneous processing of multiple signals. Therefore, a key question is whether human capacity is, in fact, limited, and what might be the consequences of such limitations in our everyday life. Strayer and Johnston (2001) studied the effects of mobile-phone conversations on performance in a concurrent (simulated) driving task. They found that conversations with either a hand-held or a hand-free mobile phone while driving resulted in a failure to detect traffic signals and in slower reactions to these signals when they were detected. The findings clearly suggest that human capacity is limited. However, in a more recent driving-simulator study Watson and Strayer (2010) have been able to identify a group of individuals—referred to as “supertaskers” who can perform multiple tasks without observed detriments. Although the majority of the participants showed significant performance decrements in the dual-task conditions (compared with a single-task condition of driving without distraction), a small minority of 2.5% showed no performance decrements. These supertaskers can be best characterized as having unlimited capacity (and possibly even supercapacity). The simulated-driving studies by Strayer and colleagues highlight some practical implications of uncovering latent mental constructs (capacity, in this example). now can expand purview to establishment of classes of models characterized by those properties. For example, exhaustive stopping rule in a serial model with independent identically distributed processing times, will have a mean response time equal to the sum of the mean response times for each channel E [RT ] = E [RTChannel 1 ] + E [RTChannel 2 ] + · · · + E [RTChannel n ] + nE [T0 ] , features of response times
75
where T0 is the base time to respond. Thus, for each channel added, we simply add its mean response time for the total average response time. But, is this the only model with such a prediction? In this section, we provide instances of overlap in predictions that arise from assuming various models. When one model can predict the results of another model, we face an instance of model mimicry. Though perhaps an obvious platitude, investigators rarely seem to concern themselves with the specter of mimicry. In this discussion, we emphasize total mimicry, that is, the existence of mathematical functions carrying the structure of one model to another in such a way as to render them completely equivalent. The upshot is that no data expressed at the same level as the mimicking equations can decide between competing models. Mimicry at other levels will be considered as well as some remedies to parallel-serial dilemma (in the following section).
Mean Response Time Predictions Recall that mean RT has been a useful tool in helping to determine (or eliminate) models best suited for data. Sternberg (1966), discussed in Section 2, supported a positive linear relationship between mean RT and set size. An early extension of this paradigm to conditions where the items were on display (instead of being stored in memory) was carried out by Atkinson et al. (1969) with largely similar results. The evidence for exhaustive processing was supported by the lack of an effect for the serial position of the target in the list. On the other hand, Nickerson (1966) argued that these data could be taken to favor self-terminating processing. In a seminal research with a different type of visual paradigm, same-different matching design with multiple targets, the data were interpreted as supporting a serial self-terminating process (Egeth, 1966). Even within the visual search paradigm, sometimes a self-terminating stopping is found and sometimes an exhaustive stopping is concluded. [See section, Theoretical Distinctions for further discussion of assessing the decisional stopping rule]. However, in none of these pioneering studies was the potential for confounding by other processing characteristics, especially capacity, taken into account. As we recounted, the early standard model was a serial, exhaustive model with equal mean processing times for every item. If one additionally assumes that each item or stage possesses the 76
same actual processing distribution (thus producing the equal mean processing times a fortiori) and that they are also independent, then one has the completestandard serial model as outlined earlier. For simplicity, assume that the mean processing time for each of the single items are all equal. Assume further that the target has equal probability of appearing in any of the n positions. On target-present trials, participants process n+1 2 items on average (yielding a positive linear relationship between mean RT and set-size). On target-absent trials, participants have to process the entire list, so that the average RT is n times the mean RT for a single item. Therefore, on both target-present and target-absent trials, there is a positive linear relationship between mean RT and set size. As we alluded in Section 2, it can be shown that unlimited capacity, independent parallel models do not generally make this prediction. These models, when using an exhaustive stopping rule, produce logarithmic-like functions that increase with set size, but not in a linear fashion (see Townsend and Ashby 1983, p.92). In the case of minimum time (i.e., race) stopping, they yield curvilinear decreasing mean RT functions. Interestingly, singletarget self-terminating processing reveals a flat, straight-line mean RT function for these models. Yet, the linear prediction of the standard Sternberg model is not unique to the serial class of models. Next, we introduce a particular parallel model, where the rate of processing depends on the number of items to be processed, that does yield the linear increase prediction. This model is just one of a multitude of models that can predict the linear relationship found in the data. Mean response times are a common measure used in determining the processing mechanisms in a task. Although illuminating with respect to the manipulated variables, the model conclusions made from such observations must consider the possibility of mimicry.
Supporting Mathematics: Serial Model Recall from the previous section, Basic Issues Expressed Quantitatively, that, in a serial model, items are processed one at a time. In minimumtime processing the target may appear in any of the available positions and processing stops when the target is found. As standard practice, E[Ii ] denotes the mean processing time of the ith item. Then, using mathematical induction, for target present trials, one has
elementary cognitive mechanisms
E(Response Time for n positions) = = = =
1 1 1 E[I1 ] + E[I1 + I2 ] + · · · + E[I1 + · · · + In ] n n n 1 1 E[I1 ] + · · · + [E[I1 ] + · · · + E[In ]] n n 1 2 n E[I1 ] + E[I1 ] + · · · + E[I1 ] n n n (n + 1) E [I1 ] 2
whereas on target-absent trials the result is simply E(Response Time for n items) = E[I1 + · · · + I n ] = nE[I1 ]. There is, thus, a linear relationship between number of items and mean response times for positive and negative responses. Of course, the minimum time serial prediction is simply E[I1 ], a flat straight line.
Supporting Mathematics: Parallel Model For clarity, a “stage” of processing is the time from one item finishing processing to the next item finishing processing. For example, in an exhaustive model with three items to be processed, any channel will have three stages: the time from start until the first item is processed, the time after the first item is processed to the time the second item is processed, and the time from the second item’s completed processing until the remaining item is finished processing. In a parallel model, the distribution of stage processing time takes the form of a difference between item processing times usually conditioned on channel information. Within-stage independence is defined as the statistical independence of stage processing times across two or more channels in the same stage, j. Across-stage independence assumes the independence of these times occurs within the same channel, but for different stages. Consider the within-stage independent parallel model with each item having a processing time following an exponential distribution with a rate inversely proportional to the number of items, n. In other words, the more items to be processed, the longer the actual processing time ofeach item will be. λ be the processing Thus, let gai j = exp − n−j+1 density for the i th item in stage j of processing. For example, stage-one processing on all items is ga1 1 = ga2 1 = · · · = gan 1 = exp − λn , whereas stage-two processing has density function λ . ga1 2 = ga2 1 = · · · = gan−1 1 = exp − n−1
We omit the reasoning due to space limitations, but the average processing time for each stage is λ1 . 1 So, the mean processing time for n items is n+1 2 λ 1 (positive response) and n λ (negative response). So, for λ = E[I11 ] this parallel model gives the same predictions as the aforementioned serial model for mean response times as functions of the number of items.
Intercompletion Time Equivalence We refer to the time required for a stage of processing as the intercompletion time. So in a serial model, the intercompletion times are just the processing times. We now examine the issue of model mimicry with respect to the distribution of the intercompletion times. We will show cases in which equivalence can occur between two common models, the across-stage independent serial model and a large class of parallel models that assume within-state independence. Across-stage independence is defined as the property that the probability density function of two or more stages of processing is the product of the component single stage density functions. Consider the case in which there are two channels, a and b, each dedicated to processing a particular item. To make the equivalence easy to follow, we write the serial model on the left side of the equations and the parallel model on the right. We use f for the pdf of the serial model, and g for the parallel model. p denotes the probability that a is processed first in the serial model. fa1 (ta1 )is the probability that it takes a the exact time of ta1 to finish in the first stage of processing. G is the cumulative distribution function of the respective subscript (for a parallel model). So Gb1 (ta1 ) denotes the probability that the first stage of processing for b will fail to finish before the time ta1 in a parallel model. Then, for the independent serial model to mimic the independent parallel model on all response time measurements it is necessary that: pfa1 (ta1 )fb2 (tb2 |ta1 ) ≡ ga1 (ta1 )Gb1 (ta1 )gb2 (tb2 |ta1 ),
For mimicry on the level of intercompletion times, we need equivalence for each stage of processing. For example, in the case in which where a is features of response times
77
processed first (preceding Eq. (9)) one needs to define f and p so that pfa1 (ta1 ) ≡ ga1 (ta1 )Gb1 (ta1 ). The three “equal” signs simply indicate that this equation must be true for all values of ta1 . This turns out to be readily done. Thus, there is a serial model that can completely mimic response time predictions from any given independent parallel model. This shows us that response time measurements are not enough to prove that there is a unique model for the processes involved in a task. Fortunately, there are distributions for the serial model that make parallel mimicry impossible. The upshot here is that this serial class of models is more general than that of the parallel models—the parallel class is mathematically contained within the serial class. This result provides one potential avenue for assessing architecture: Try to determine from the experimental data and appropriate statistics if processing satisfies serial but not parallel processing. If parallel models pass the tests, then these particular tests cannot discriminate (for that task) serial versus parallel architectures.
The Math Beneath the Mimicry Note that by integrating with respect to tb2 , (Eq. 9) reduces to pfa1 (ta1 ) ≡ ga1 (ta1 )Gb1 (ta1 ) [FirstStageProcessing] fb2 (tb2 | ta1 ) ≡ gb2 (tb2 | ta1 ) . [SecondStageProcessing] The same conclusions hold for Eq. (10) by integrating with respect to ta2 . This means that if there is intercompletion time equivalence, then there is total model equivalence.
where < a, b > denotes that a finishes before b. To show equivalence one needs to define fa1 , fb1 , fa2 , fb2 , and p for a serial model so that each stage of processing gives equivalent intercompletion time predictions. As above, for a second stage processing, simply set fb2 (tb2 |ta1 ) = gb2 (tb2 |ta1 ) and
fa2 (ta2 |tb1 ) = ga2 (ta2 |tb1 ).
Now we focus on fa1 and p. For equivalence, it is sufficient that pfa1 (t) = ga1 (t)Gb1 (t)
Integrating with respect to t, ∞ ga1 (t)Gb1 (t)dt. p= 0
The remaining density, fb1 (t), can be solved in the same way as above, using the equation (1 − p)fb1 (t) = gb1 (t)Ga1 (t) and the fact that 1−p = 1−
∞ 0
ga1 (t)Gb1 (t)dt =
(2 )
∞ 0
gb1 (t)Ga1 (t)dt.
Thus, the serial model that mimics the parallel is given by: ∞ ga1 (t)Gb1 (t)dt p= 0
ga1 (t)Gb1 (t) fa1 (t) = ∞ 0 ga1 (t)Gb1 (t)dt
Proposition 1. Given any within-stage independent parallel model there is always a serial model that is completely equivalent to it. Proof. This proof generalizes to cases where there are more than two processing positions (Townsend 1976a). Consider the following within-stage independent parallel model:
Notes 1. The most general definition of a confidence interval is the range of parameter values that would not be rejected according to a criterion p value, such as p < 0.05. These limits depend on the arbitrary settings of other parameters, and can be difficult to compute. 2. Data retrieved December 22, 2012 from http://www.baseball-reference.com/leagues/MLB/2012standard-batting.shtml 3. This analysis was summarized at http://doingbayesiandataanalysis.blogspot.com/2012/11/shrinkage-in-multi-level-hierarchical.html 4. In the context of a normal distribution, instead of a beta distribution, the “precision” is the reciprocal of variance. Intuitively, it refers to the narrowness of the distribution for either the normal or beta distributions.
Glossary Hierarchical model: A formal model that can be expressed such that one parameter is dependent on another parameter. Many models can be meaningfully factored this way, for example when there are parameters that describe data from individuals, and the individual-level parameters depend on group-level parameters. Highest density interval (HDI): The highest density interval summarizes the interval under a probability distribution where the probability densities inside the interval are higher than probability densities outside the interval. A 95% HDI includes the 95% of the distribution with the highest probability density. Markov chain Monte Carlo (MCMC): A class of stochastic algorithms for obtaining samples from a probability distribution. The algorithms take a random walk through parameter space, favoring values that have higher probability. With a sufficient number of steps, the values of the parameter are visited in proportion to their probabilities and therefore the samples can be used to approximate the distribution. Widely used examples of MCMC are the Gibbs sampler and the Metropolis-Hastings algorithm. Posterior distribution: A probability distribution over parameters derived via Bayes’ rule from the prior distribution by taking into account the targeted data. Prior distribution: A probability distribution over parameters representing the beliefs, knowledge or assumptions about the parameters without reference to the targeted data. The prior distribution and the likelihood function together define a model.
298
new directions
Region of practical equivalence (ROPE): An interval around a parameter value that is considered to be equivalent to that value for practical purposes. The ROPE is used as part of a decision rule for accepting or rejecting particular parameter values.
References Bartlema, A., Lee, M D., Wetzels, R., & Vanpaemel, W. (2014). A Bayesian hierarchical mixture approach to individual differences: Case studies in selective attention and representation in category learning. Journal of Mathematical Psychology, 59, 132–150. Batchelder, W H. (1998). Multinomial processing tree models and psychological assessment. Psychological Assessment, 10, 331–344. Bayes, T., & Price, R. (1763). An essay towards solving a problem in the doctrine of chances. By the Late Rev. Mr. Bayes, F.R.S. Communicated by Mr. Price, in a Letter to John Canton, A.M.F.R.S. Philosophical Transactions, 53, 370–418. doi: 10.1098/rstl.1763.0053 Carlin, B P., & Louis, T. A. (2009). Bayesian methods for data analysis (3rd ed.). Boca Raton, FL: CRC Press. Cohen, A. L., Sanborn, A. N., & Shiffrin, R, M. (2008). Model evaluation using grouped or individual data. Psychonomic Bulletin & Review, 15, 692–712. Denwood, M. J. (2013). runjags: An R package providing interface utilities, parallel computing methods and additional distributions for MCMC models in JAGS. Journal of Statistical Software, (in review). http://cran.r-project.org/web/ bpackages/runjags/ Doyle, A. C. (1890). The sign of four. London, England: Spencer Blackett. Freedman, L. S., Lowe, D., & Macaskill, P. (1984). Stopping rules for clinical trials incorporating clinical opinion. Biometrics, 40, 575–586. Hobbs, B. P., & Carlin, B. P. (2008). Practical Bayesian design and analysis for drug and device clinical trials. Journal of Biopharmaceutical Statistics, 18(1), 54–80. Kruschke, J. K. (2008). Models of categorization. R. Sun (Ed.), The Cambridge Handbook of Computational Psychology (p. 267–301). New York, NY: Cambridge University Press. Kruschke, J. K. (2011). Bayesian assessment of null values via parameter estimation and model comparison. Perspectives on Psychological Science, 6(3) 299–312. Kruschke, J. K. (2013). Bayesian estimation supersedes the t test. Journal of Experimental Psychology: General, 142(2), 573–603. doi: 10.1037/a0029146 Kruschke, J. K. (2015). Doing Bayesian data analysis, Second edition: A tutorial with R, JAGS, and Stan. Waltham, Academic Press/Elsevier. Lee, M. D. (2011). How cognitive modeling can benefit from hierarchical Bayesian models. Journal of Mathematical Psychology, 55, 1–7. Lodewyckx, T., Kim, W., Lee, M. D., Tuerlinckx, F., Kuppens, P., & Wagenmakers, E. J. (2011). A tutorial on Bayes factor estimation with the product space method. Journal of Mathematical Psychology, 55(5), 331–347.
Lunn, D., Jackson, C., Best, N., Thomas, A., & Spiegelhalter, D. (2013). The BUGS book: A practical introduction to Bayesian analysis. Boca Raton, FL: CRC Press. Lunn, D., Thomas, A., Best, N., & Spiegelhalter, D. (2000). WinBUGS — A Bayesian modelling framework: Concepts, structure, and extensibility. Statistics and Computing, 10(4), 325–337. Morey, R. D., & Rouder, J. N. (2011). Bayes factor approaches for testing interval null hypotheses. Psychological Methods, 16(4), 406–419. Nosofsky, R. M. (1987). Attention and learning processes in the identification and categorization of integral stimuli. Journal of Experimental Psychology: Learning, Memory and Cognition, 13, 87–108. Plummer, M. (2003). JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling. Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003). Vienna, Austria. Reed, S. K. (1972). Pattern recognition and categorization. Cognitive Psychology, 3, 382–407. Riefer, D. M., Knapp, B. R., Batchelder, W. H., Bamber, D., & Manifold, V. (2002). Cognitive psychometrics: Assessing storage and retrieval deficits in special populations with multinomial processing tree models. Psychological Assessment, 14, 184–201. Rouder, J. N., & Lu, J. (2005). An introduction to Bayesian hierarchical models with an application in the theory of signal detection. Psychonomic Bulletin & Review, 12(4), 573–604. Rouder, J. N., Lu, J., Speckman, P., Sun, D., & Jiang, Y. (2005). A hierarchical model for estimating response time distributions. Psychonomic Bulletin & Review, 12(2), 195–223. Serlin, R. C.,& Lapsley, D. K. (1985). Rationality in psychological research: The good-enough principle. American Psychologist, 40(1), 73–83. Serlin, R. C., & Lapsley, D. K. (1993). Rational appraisal of psychological research and the good-enough principle. G. Keren, & C. Lewis (Eds.) A handbook for data analysis
in the behavioral sciences: Methodological issues (pp. 199– 228). Hillsdale, NJ: Erlbaum. Shiffrin, R. M., Lee, M. D., Kim, W., & Wagenmakers, E. J. (2008). A survey of model evaluation approaches with a tutorial on hierarchical Bayesian methods. Cognitive Science, 32(8), 1248–1284. Smith, M. C., & Thelen, M. H. (1984). Development and validation of a test for bulimia. Journal of Consulting and Clinical Psychology, 52, 863–872. Spiegelhalter, D. J., Freedman, L. S., & Parmar, M. K. B. (1994). Bayesian approaches to randomized trials. Journal of the Royal Statistical Society. Series A, 157, 357–416. Stan Development Team. (2012). Stan: A C++ library for probability and sampling, version 1.1. Retrieved from http://mc-stan.org/citations.html Vanpaemel, W. (2009). BayesGCM: Software for Bayesian inference with the generalized context model. Behavior Research Methods, 41(4), 1111–1120. Vanpaemel, W. (2010). Prior sensitivity in theory testing: An apologia for the Bayes factor. Journal of Mathematical Psychology, 54, 491–498. Vanpaemel, W., & Lee, M. D. (2012). Using priors to formalize theory: Optimal attention and the generalized context model. Psychonomic Bulletin & Review, 19, 1047–1056. Viken, R. J., Treat, T. A., Nosofsky, R. M., McFall, R. M., & Palmeri, T, J. (2002). Modeling individual differences in perceptual and attentional processes related to bulimic symptoms. Journal of Abnormal Psychology, 111, 598–609. Wagenmakers, E. J. (2007). A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review, 14(5), 779–804. Wetzels, R., Raaijmakers, J. G. W., Jakab, E., & Wagenmakers, E. J. (2009). How to quantify support for and against the null hypothesis: A flexible WinBUGS implementation of a default Bayesian t test. Psychonomic Bulletin & Review, 16(4), 752–760.
bayesian estimation in hierarchical models
299
CHAPTER
14
Model Comparison and the Principle of Parsimony
Joachim Vandekerckhove, Dora Matzke, and Eric-Jan Wagenmakers
Abstract
According to the principle of parsimony, model selection methods should value both descriptive accuracy and simplicity. Here we focus primarily on Bayes factors and minimum description length, explaining how these procedures strike a balance between goodness-of-fit and parsimony. Throughout, we demonstrate the methods with an application on false memory, evaluating three competing multimonial proces tree models of interference in memory. Key Words: model selection, goodness of fit, parsimony, inference, Akaike’s information
Introduction At its core, the study of psychology is concerned with the discovery of plausible explanations for human behavior. For instance, one may observe that “practice makes perfect”: as people become more familiar with a task, they tend to execute it more quickly and with fewer errors. More interesting is the observation that practice tends to improve performance such that most of the benefit is accrued early on, a pattern of diminishing returns that is well described by a power law (Logan, 1988; but see Heathcote, Brown, & Mewhort, 2000). This pattern occurs across so many different tasks (e.g., cigar rolling, maze solving, fact retrieval, and a variety of standard psychological tasks) that it is known as the “power law of practice.” Consider, for instance, the lexical decision task, a task in which participants have to decide quickly whether a letter string is an existing word (e.g., sunscreen) or not (e.g., tolphin). When repeatedly presented with the same stimuli,1 participants show a power law decrease in their mean response latencies; in fact, they show a power law decrease in the entire response time distribution, that is, both the fast
300
responses and the slow responses speed up with practice according to a power law (Logan, 1992). The observation that practice makes perfect is trivial, but the finding that practice-induced improvement follows a general law is not. Nevertheless, the power law of practice only provides a descriptive summary of the data and does not explain the reasons that practice should result in a power law improvement in performance. In order to go beyond direct observation and statistical summary, it is necessary to bridge the divide between observed performance on the one hand and the pertinent psychological processes on the other. Such bridges are built from a coherent set of assumptions about the underlying cognitive processes—a theory. Ideally, substantive psychological theories are formalized as quantitative models (Busemeyer & Diederich, 2010; Lewandowsky & Farrell, 2010). For example, the power law of practice has been explained by instance theory (Logan, 1992, 2002). Instance theory stipulates that earlier experiences are stored in memory as individual traces or instances; upon presentation of a stimulus, these instances race to be retrieved, and the winner of the race initiates
a response. Mathematical analysis shows that, as instances are added to memory, the finishing time of the winning instance decreases as a power function. Hence, instance theory provides a simple and general explanation of the power law of practice. For all its elegance and generality, instance theory has not been the last word on the power law of practice. The main reason is that single phenomena often afford different competing explanations. For example, the effects of practice can also be accounted for by Rickard’s component power laws model (Rickard, 1997), Anderson’s ACTR model (Anderson, 2004), Cohen et al.’s PDP model (Cohen, Dunbar, & McClelland, 1990), Ratcliff ’s diffusion model (Dutilh, Vandekerckhove, Tuerlinckx, & Wagenmakers, 2009; Ratcliff, 1978), or Brown and Heathcote’s linear ballistic accumulator model (Brown & Heathcote, 2005, 2008; Heathcote & Hayes, 2012). When various models provide competing accounts of the same data set, it can be difficult to choose between them. The process of choosing between models is called model comparison, model selection, or hypothesis testing, and it is the focus of this chapter. A careful model comparison procedure includes both qualitative and quantitative elements. Important qualitative elements include the plausibility, parsimony, and coherence of the underlying assumptions, the consistency with known behavioral phenomena, the ability to explain rather than describe data, and the extent to which model predictions can be falsified through experiments. Here we ignore these important aspects and focus solely on the quantitative elements. The single most important quantitative element of model comparison relates to the ubiquitous trade-off between parsimony and goodness-of-fit (Pitt & Myung, 2002). The motivating insight is that the appeal of an excellent fit to the data (i.e., high descriptive adequacy) needs to be tempered to the extent that the fit was achieved with a highly complex and powerful model (i.e., low parsimony). The topic of quantitative model comparison is as important as it is challenging; fortunately, the topic has received—and continues to receive— considerable attention in the field of statistics, and the results of those efforts have been made accessible to psychologists through a series of recent special issues, books, and articles (e.g., Grünwald, 2007; Myung et al., 2000; Pitt & Myung, 2002; Wagenmakers & Waldorp, 2006). Here we discuss several procedures for model comparison, with an emphasis on minimum description length and
the Bayes factor. Both procedures entail principled and general solutions to the trade-off between parsimony and goodness of fit. The outline of this chapter is as follows. The first section describes the principle of parsimony and the unavoidable trade-off with goodness-offit. The second section summarizes the research of Wagenaar and Boer (1987) who carried out an experiment to compare three competing multinomial processing tree models (MPTs; Batchelder & Riefer, 1980); this model comparison exercise is used as a running example throughout the chapter. The third section outlines different methods for model comparison and applies them to Wagenaar and Boer (1987)’s MPT models. We focus on two popular information criteria, the AIC and the BIC, on the Fisher information approximation of the minimum description length principle, and on Bayes factors as obtained from importance sampling. The fourth section contains conclusions and take-home messages.
The Principle of Parsimony Throughout history, prominent philosophers and scientists have stressed the importance of parsimony. For instance, in the Almagest— a famous second-century book on astronomy— Ptolemy writes: “We consider it a good principle to explain the phenomena by the simplest hypotheses that can be established, provided this does not contradict the data in an important way.” Ptolemy’s principle of parsimony is widely known as Occam’s razor (see Box 2); the principle is intuitive as it puts a premium on elegance. In addition, most people feel naturally attracted to models and explanations that are easy to understand and communicate. Moreover, the principle also gives ground to reject propositions that are without empirical support, including extrasensory perception, alien abductions, or mysticism. In an apocryphal interaction, Napoleon Bonaparte asked Pierre-Simon Laplace why the latter’s book on the universe did not mention its creator, only to receive the curt reply “I had no need of that hypothesis.” However, the principle of parsimony finds its main motivation in the benefits that it bestows on those who use models for prediction. To see this, note that empirical data are assumed to be composed of a structural, replicable part and an idiosyncratic, nonreplicable part. The former is known as the signal, and the latter is known as the noise (Silver, 2012). Models that capture all
model comparison and the principle of parsimony
301
the signal and none of the noise provide the best possible predictions to unseen data from the same source. Overly simplistic models, however, fail to capture part of the signal; these models underfit the data and provide poor predictions. Overly complex models, on the other hand, mistake some of the noise for actual signal; these models overfit the data and again provide poor predictions. Thus, parsimony is essential because it helps discriminate the signal from the noise, allowing better prediction and generalization to new data.
Box 1 Occam’s razor Occam’s razor (sometimes Ockham’s) is named after the English philosopher and Franciscan friar Father William of Occam (c.1288– c.1348), who wrote “Numquam ponenda est pluralitas sine necessitate” (plurality must never be posited without necessity), and “Frustra fit per plura quod potest fieri per pauciora” (it is futile to do with more what can be done with less). Occam’s metaphorical razor symbolizes the principle of parsimony: by cutting away needless complexity, the razor leaves only theories, models, and hypotheses that are as simple as possible without being false. Throughout the centuries, many other scholars have espoused the principle of parsimony; the list predating Occam includes Aristotle, Ptolemy, and Thomas Aquinas (“it is superfluous to suppose that what can be accounted for by a few principles has been produced by many”), and the list following Occam includes Isaac Newton (“We are to admit no more causes of natural things than such as are both true and sufficient to explain their appearances. Therefore, to the same natural effects we must, so far as possible, assign the same causes.”), Bertrand Russell, Albert Einstein (“Everything should be made as simple as possible, but no simpler”), and many others. In the field of statistical reasoning and inference, Occam’s razor forms the foundation for the principle of minimum description length (Grunwald, 2000, 2007). In addition, Occam’s razor is automatically accommodated through Bayes factor model comparisons (e.g., Jefferys & Berger, 1992; Jeffreys, 1961; MacKay, 2003). Both minimum description length and Bayes factors feature prominently in this chapter as principled methods to quantify the trade-off between parsimony and goodness of
302
new directions
fit. Note that parsimony plays a role even in classical null-hypothesis significance testing, where the simpler null hypothesis is retained unless the data provide sufficient grounds for its rejection.
Goodness of fit “From the earliest days of statistics, statisticians have begun their analysis by proposing a distribution for their observations and then, perhaps with somewhat less enthusiasm, have checked on whether this distribution is true. Thus over the years a vast number of test procedures have appeared, and the study of these procedures has come to be known as goodness-of-fit” (D’Agostino & Stephens, 1986, p. v). The goodness of fit of a model is a quantity that expresses how well the model is able to account for a given set of observations. It addresses the following question: Under the assumption that a certain model is a true characterization of the population from which we have obtained a sample, and given the best fitting parameter estimates for that model, how well does our sample of data agree with that model? Various ways of quantifying goodness of fit exist. One common expression involves a Euclidean distance metric between the data and the model’s best prediction (the least squared error or LSE metric is the most well-known of these). Another measure involves the likelihood function, which expresses the likelihood of observing the data under the model, and is maximized by the best fitting parameter estimates (Myung, 2000).
Parsimony Goodness of fit must be balanced against model complexity in order to avoid overfitting—that is, to avoid building models that well explain the data at hand, but fail in out-of-sample predictions. The principle of parsimony forces researchers to abandon complex models that are tweaked to the observed data in favor of simpler models that can generalize to new data sets. A common example is that of polynomial regression. Figure 14.1 gives a typical example. The observed data are the circles in both the left and right panels. Crosses indicate unobserved, out-of-sample data points to which the model should generalize. In the left panel, a quadratic function is fit to the 8 observed data points, whereas
(a)
1500
(b) 1500 hidden data visible data 2nd degree polynomial
hidden data visible data 7th degree polynomial 1000
500
500
0
0
criterion
1000
0
10
20
30 40 predictor
50
60
0
10
20
30 predictor
40
50
60
Fig. 14.1 A polynomial regression of degree d is characterized by yˆ = di=0 ai x i . This model has d + 1 free parameters ai ; hence, in the right panel, a polynomial of degree 7 perfectly accounts for the 8 visible data points. This 7th-order polynomial, however, accounts poorly for the out-of-sample data points.
the right panel shows a 7th-order polynomial function fitted to the same data. Since a polynomial of degree 7 can be made to contain any 8 points in the plane, the observed data are perfectly captured by the best-fitting polynomial. However, it is clear that this function generalizes poorly to the unobserved samples, and it shows undesirable behavior for larger values of x. In sum, an adequate model comparison method needs to discount goodness of fit with model complexity. But how exactly can this be accomplished? As we will describe shortly, several model comparison methods are currently in vogue; all resulting from principled ideas on how to obtain measures of generalizability,2 meaning that these methods attempt to quantify the extent to which a model predicts unseen data from the same source (cf. Figure 14.1). Before outlining the details of various model-comparison methods, we now introduce a data set that serves as a working example throughout the remainder of the chapter.
Example: Competing Models of Interference in Memory For an example model comparison scenario, we revisit a study by Wagenaar and Boer (1987) on the effect of misleading information on the recollection of an earlier event. The effect of misleading postevent information was first studied systematically by Loftus, Miller, and Burns (1978); for a review of relevant literature see Wagenaar and Boer (1987) and references therein.
Wagenaar and Boer (1987) proposed three competing theoretical accounts of the effect of misleading postevent information. To evaluate the three accounts, Wagenaar and Boer set up an experiment and introduced three quantitative models that translate each of the theoretical accounts into a set of parametric assumptions that together give rise to a probability density over the data, given the parameters.
Abstract Accounts Wagenaar and Boer (1987) outlined three competing theoretical accounts of the effect of misleading postevent information on memory. Loftus’s destructive updating model (DUM) posits that the conflicting information replaces and destroys the original memory. A coexistence model (CXM) asserts that an inhibition mechanism suppresses the original memory, which nonetheless remains viable though temporarily inaccessible. Finally, a noconflict model (NCM) simply states that misleading postevent information is ignored, except when the original information was not encoded or already forgotten.
Experimental Design The experiment by Wagenaar and Boer (1987) proceeded as follows. In Phase I, a total of 562 participants were shown a sequence of events in the form of a pictorial story involving a pedestrian-car collision. One picture in the story would show a car
model comparison and the principle of parsimony
303
(a)
(b)
Fig. 14.2 A pair of pictures from the third phase (i.e., the recognition test) of Wagenaar and Boer (1987), reprinted with permission, containing the critical episode at the intersection. c After Wagenaar and Boer (1987), ActaPsychologica Elsevier Inc.
at an intersection, and a traffic light that was either red, yellow, or green. In Phase II, participants were asked a set of test questions with (potentially) conflicting information: Participants might be asked whether they remembered a pedestrian crossing the road when the car approached the “traffic light” (in the consistent group), the “stop sign” (in the inconsistent group) or the “intersection” (the neutral group). Then, in Phase III, participants were given a recognition test about elements of the story using picture pairs. Each pair would contain one picture from Phase I and one slightly altered version of the original picture. Participants were then asked to identify which of the pair had featured in the original story. A picture pair is shown in Figure 14.2, where the intersection is depicted with either a traffic light or a stop sign. Finally, in Phase IV, participants were informed that the correct choice in Phase III was the picture with the traffic light, and were then asked to recall the color of the traffic light. By design, this experiment should yield different response patterns depending on whether the conflicting postevent information destroys the original information (destructive-updating model), only suppresses it temporarily (coexistence model), or does not affect the original information unless it is unavailable (no-conflict model).
Concrete Models Wagenaar and Boer (1987) developed a series of MPT models (see Box 3) to quantify the predictions of the three competing theoretical accounts. Figure 14.3 depicts the no-conflict MPT model in the inconsistent condition. The figure is essentially a decision tree that is navigated from left to right. In Phase I of the collision narrative, the traffic 304
new directions
light is encoded with probability p, and if so, the color is encoded with probability c. In Phase II, the stop sign is encoded with probability q. In Phase III, the answer may be known or may be guessed correctly with probability 1/2, and in Phase IV the answer may be known or may be guessed correctly with probability 1/3. The probability of each path is given by the product of all the encountered probabilities, and the total probability of a response pattern is the summed probability of all branches that lead to it. For example, the total probability of getting both questions wrong is (1 − p) × q × 2/3 + (1−p)×(1−q)×1/2×2/3. We would then, under the no-conflict model, expect that proportion of participants to fall in the response pattern with two errors. The destructive updating model (Figure 2 in Wagenaar & Boer, 1987) extends the three-parameter no-conflict model by adding a fourth parameter d : the probability of destroying the traffic-light information, which may occur whenever the stop sign was encoded. The coexistence model (Figure 3 in Wagenaar & Boer, 1987), on the other hand, posits an extra probability s that the traffic light is suppressed (but not destroyed) when the stop sign is encoded. A critical difference between the latter two is that a destruction step will lead to chance accuracy in Phase IV if every piece of information was encoded, whereas a suppression step will not affect the underlying memory and lead to accurate responding. Note here that, if s = 0, the coexistence model reduces to the no-conflict model, as does the destructive-updating model with d = 0. The models only make different predictions in the inconsistent condition, so that, for the consistent and neutral conditions, the trees are identical.
Phase III correct
Phase IV correct
q
c
1–q p
⅓ q ⅔
yes
1–c ⅓ 1–q ⅔
Color encoded?
⅓
no
q ⅔ 1–p
½
⅔
1–q Traffic light encoded?
Stop sign encoded?
⅓
⅓
½ ⅔ Traffic light guessed?
Color guessed?
Fig. 14.3 Multinomial-processing tree representation of the inconsistent condition according to the no-conflict model adapted from Wagenaar and Boer (1987). c After Wagenaar and Boer (1987), ActaPsychologica Elsevier Inc.
Box 2 Popularity of multinomial processing tree models Multinomial processing tree models (Batchelder & Riefer, 1980; Chechile, 1973; Chechile & Meyer, 1976; Riefer & Batchelder, 1988) are psychological process models for categorical data. MPT models are used in two ways: as a psychometric tool to measure unobserved cognitive processes, and as a convenient formalization of competing psychological theories. Over time, MPTs have been applied to a wide range of psychological tasks and processes. For instance, MPT models are available for recognition, recall, source monitoring, perception, priming, reasoning, consensus analysis, the process dissociation procedure, implicit attitude measurement, and many other phenomena. For more information about MPTs, we recommend the review articles by Batchelder and Riefer (1999; 2007), and Erdfelder et al. (2009). The latter review article also discusses different software packages that can be used to fit MPT models. Necessarily missing from that
list is the recently developed R package MPTinR (Singmann & Kellen 2013) with which we have good experiences. As will become apparent throughout this chapter, however, our preferred method for fitting MPT models is Bayesian (Chechile & Meyer, 1976; Klauer, 2010; Lee & Wagenmakers, 2013; Matzke, Dolan, Batchelder, & Wagenmakers, in press; Rouder, Lu, Morey, Sun, & Speckman, 2008; Smith & Batchelder, 2010).
Previous Conclusions After fitting the three competing MPT models, Wagenaar and Boer (1987) obtained the parameter point estimates in Table 14.1. Using a χ 2 model fit index, they concluded that “a distinction among the three model families appeared to be impossible in actual practice” (p. 304), after noting that the noconflict model provides “an almost perfect fit” to the data. They propose, then, “to accept the most parsimonious model, which is the no-conflict model.” In the remainder of this chapter, we re-examine
model comparison and the principle of parsimony
305
Table 14.1. Parameter point estimates from Wagenaar and Boer (1987). p
c
q
d
s
No-conflict model (NCM)
0.50
0.57
0.50
na
na
Destructive-updating model (DUM)
0.50
0.57
0.50
0.00
na
Coexistence model (CXM)
0.55
0.55
0.43
na
0.20
this conclusion using various model comparison methods.
Three Methods for Model Comparison Many model comparison methods have been developed, all of them attempts to address the ubiquitous trade-off between parsimony and goodness of fit. Here we focus on three main classes of interrelated methods: (1) AIC and BIC, the most popular information criteria; (2) minimumdescription length; (3) Bayes factors. Below we provide a brief description of each method and then apply it to the model comparison problem that confronted Wagenaar and Boer (1987).
Information Criteria Information criteria are among the most popular methods for model comparison. Their popularity is explained by the simple and transparent manner in which they quantify the trade-off between parsimony and goodness of fit. Consider for instance the oldest information criterion, AIC (“an information criterion”), proposed by Akaike (1973, 1974a): AIC = −2 ln p y | θˆ + 2k. (1) The first term ln p y | θˆ is the log maximum likelihood that quantifies goodness of fit, where y is the data set and θˆ the maximum-likelihood parameter estimate; the second term 2k is a penalty for model complexity, measured by the number of adjustable model parameters k. The AIC estimates the expected information loss incurred when a probability distribution f (associated with the true data generating process) is approximated by a probability distribution g (associated with the model under evaluation). Hence, the model with the lowest AIC is the model with the smallest expected information loss between reality f and model g, where the discrepancy is quantified by the Kullback-Leibler divergence I (f , g), a distance metric between two probability distributions (for 306
new directions
full details, see Burnham & Anderson, 2002). The AIC is unfortunately not consistent: as the number of observations grows infinitely large, AIC is not guaranteed to choose the true data-generating model. In fact, there is cause to believe that the AIC tends to select complex models that overfit the data (O’Hagan & Forster, 2004; for a discussion see Vrieze, 2012). Another information criterion, the BIC (“Bayesian information criterion”) was proposed by Schwarz (1978): BIC = −2 ln p y | θˆ + k ln n. (2) Here, the penalty term is k ln n, where n is the number of observations.3 Hence, the BIC penalty for complexity increases with sample size, outweighing that of AIC as soon as n ≥ 8. The BIC was derived as an approximation of a Bayesian hypothesis test using default parameter priors (the “unit information prior”; see later for more information on Bayesian hypothesis testing, and see (Raftery, 1995), for more information on the BIC). The BIC is consistent: as the number of observations grows infinitely large, BIC is guaranteed to choose the true data-generating model. Nevertheless, there is evidence that, in practical applications, the BIC tends to select simple models that underfit the data (Burnham & Anderson, 2002). Now consider a set of candidate models, Mi , i = 1, ..., m, each with a specific IC (AIC or BIC) value. The model with the smallest IC value should be preferred, but the extent of this preference is not immediately apparent. For better interpretation we can calculate IC model weights (Akaike, 1974; Burnham & Anderson, 2002; Wagenmakers & Farrell, 2004); First, we compute, for each model i, the difference in IC with respect to the IC of the best candidate model: i = ICi − min IC.
(3)
This step is taken to increase numerical stability, but it also serves to emphasize the point that only differences in IC values are relevant. Next we obtain
Subjective intensity
Subjective intensity
k = 0.4 β = 1.0 k = 0.7 β = 0.5 0.5
0
0
1 2 Objective intensity
3
c = 0.1 b = 2.0 c = 0.5 b = 0.5 0.5
0
0
1 2 Objective intensity
3
Fig. 14.4 Two representative instances of Fechner’s law (left) and Steven’s law (right). Although Fechner’s law is restricted to nonlinear functions that level off as stimulus intensity increases, Steven’s law can additionally take shapes with accelerating slopes.
the model weights by transforming back to the likelihood scale and normalizing: exp ( − i /2) . wi = M m=1 exp ( − m /2)
(4)
The resulting AIC and BIC weights are called Akaike weights and Schwarz weights, respectively. These weights not only convey the relative preference among a set of candidate models (i.e., they express a degree to which we should prefer one model from the set as superior) but also provide a method to combine predictions across multiple models using model averaging (Hoeting, Madigan, Raftery, & Volinsky, 1999). Both AIC and BIC rely on an assessment of model complexity that is relatively crude, because it is determined entirely by the number of free parameters but not by the shape of the function through which they make contact with the data. To illustrate the importance of the functional form in which the parameters participate, consider the case of Fechner’s law and Steven’s law of psychophysics. Both of these laws transform objective stimulus intensity to subjective experience through a twoparameter nonlinear function.4 According to Fechner’s law, perceived intensity W (I ) of stimulus I is the result of the logarithmic function k ln (I + β). Steven’s law describes perceived intensity as an exponential function: S (I ) = cI b . Although both laws have the same number of parameters, Steven’s is more complex because it can cover a larger number of data patterns (see Figure 14.4). application to multinomial-processing tree models In order to apply AIC and BIC to the three competing MPTs proposed by Wagenaar and Boer (1987), we first need to compute the maximum log likelihood. Note that the MPT model parameters determine the predicted probabilities for the different response outcome categories (cf. Figure 14.3
and Box 3); these predicted probabilities are deterministic parameters from a multinomial probability density function. Hence, the maximum log likelihood parameter estimates for an MPT model produce multinomial parameters that maximize the probability of the observed data (i.e., the occurrence of the various outcome categories). Several software packages exist that can help find the maximum log likelihood parameter estimates for MPTs (Singmann & Kellen, 2013). With these estimates in hand, we can compute the information criteria described in the previous section. Table 14.2 shows the maximum log likelihood as well as AIC, BIC, and their associated weights (wAIC and wBIC; from Eq. 4). Interpreting wAIC and wBIC as measures of relative preference, we see that the results in Table 14.2 are mostly inconclusive. According to wAIC, the no-conflict model and coexistence model are virtually indistinguishable, though both are preferable to the destructive-updating model. According to wBIC, however, the no-conflict model should be preferred over both the destructiveupdating model and the coexistence model. The extent of this preference is noticeable but not decisive.
Minimum Description Length The minimum description length principle is based on the idea that statistical inference centers around capturing regularity in data; regularity, in turn, can be exploited to compress the data. Hence, the goal is to find the model that compresses the data the most (Grünwald, 2007). This is related to the concept of Kolmogorov complexity—for a sequence of numbers, Kolmogorov complexity is the length of the shortest program that prints that sequence and then halts (Grünwald, 2007). Although Kolmogorov complexity cannot be calculated, a suite of concrete methods are available based on the idea of model selection through
model comparison and the principle of parsimony
307
Table 14.2. AIC and BIC for the Wagenaar & Boer MPT models. log likelihood
k*
AIC
wAIC
BIC
wBIC
No-conflict model (NCM)
−24.41
3
54.82
0.41
67.82
0.86
Destructive-updating model (DUM)
−24.41
4
56.82
0.15
74.15
0.04
Coexistence model (CXM)
−23.35
4
54.70
0.44
72.03
0.10
Note: k is the number of free parameters.
data compression. These methods, most of them developed by Jorma Rissanen, fall under the general heading of minimum description length (MDL; Rissanen, 1978, 1987, 1996, 2001). In psychology, the MDL principle has been applied and promoted primarily by Grünwald (2000), Grünwald, Myung, and Pitt (2005), and Grünwald (2007), as well as Myung, Navarro, and Pitt (2006), Pitt and Myung (2002), and Pitt, Myung, and Zhang (2002). Here we mention three versions of the MDL principle. First, there is the so-called crude twopart code (Grünwald, 2007); here, one sums the description of the model (in bits) and the description of the data encoded with the help of that model (in bits). The penalty for complex models is that they take many bits to describe, increasing the summed code length. Unfortunately, it can be difficult to define the number of bits required to describe a model. Second, there is the Fisher information approximation (FIA; Pitt et al., 2002; Rissanen, 1996): k n FIA = − ln p y | θˆ + ln 2 2π # + ln det [I (θ)] dθ, (5)
where I (θ ) denotes the Fisher information matrix of sample size 1 (Ly, Verhagen, Grasman, & Wagenmakers, 2014). I (θ) is a k × k matrix whose (i, j)th element is . ∂ ln p y | θ ∂ ln p y | θ , Ii,j (θ ) = E ∂θi ∂θj where E() is the expectation operator. Note that FIA is similar to AIC and BIC in that it includes a first term that represents goodness of fit, and additional terms that represent a penalty for complexity. The second term resembles that of BIC, and the third term reflects a more sophisticated penalty that represents the number of distinguishable probability distributions that a model can generate (Pitt et al., 2002). Hence, FIA differs from AIC and BIC in that it also accounts for functional 308
new directions
form complexity, not just complexity due to the number of free parameters. Note that FIA weights (or Rissanen weights) can be obtained by multiplying FIA by 2 and then applying Eqs. 3 and 4. The third version of the MDL principle discussed here is normalized maximum likelihood (NML; Myung et al., 2006; Rissanen, 2001): ˆ p y | θ(y) . NML = (6) ˆ (x) dx p x | θ X This equation shows that NML tempers the enthusiasm about a good fit to the observed data y (i.e., the numerator) to the extent that the model could also have provided a good fit to random data x (i.e., the denominator). NML is simple to state but can be difficult to compute; for instance, the denominator may be infinite and this requires further measures to be taken (for details see Grünwald, 2007). Additionally, NML requires an integration over the entire set of possible data sets, which may be difficult to define as it depends on unknown decision processes in the researchers (Berger & Berry, 1988). Note that, since the computation of NML depends on the likelihood of data that might have occurred but did not, the procedure violates the likelihood principle, which states that all information about a parameter θ obtainable from an experiment is contained in the likelihood function for θ for the given y (Berger & Wolpert, 1998). application to multinomial processing tree models Using the parameter estimates from Table 14.1 and the code provided by Wu, Myung and Batchelder (2010), we can compute the FIA for the three competing MPT models considered by Wagenaar and Boer (1987).5 Table 14.3 displays, for each model, the FIA along with its associated complexity measure (the other one of its two constituent components, the maximum log likelihood, can be found in Table 14.2). The conclusions from the MDL analysis mirror those from the
Table 14.3. Minimum description length values for the Wagenaar & Boer MPT models. Complexity
FIA
wFIA
No-conflict model (NCM)
6.44
30.86
0.44
Destructive-updating model (DUM)
7.39
31.80
0.17
Coexistence model (CXM)
7.61
30.96
0.39
AIC measure, expressing a slight disfavor for the destructive updating model, and approximately equal preference for the no-conflict model versus the coexistence model.
Bayes Factors In Bayesian model comparison, the posterior odds for models M1 and M2 are obtained by updating the prior odds with the diagnostic information from the data: p(M1 | y) p(M1 ) m(y | M1 ) = × . (7) p(M2 | y) p(M2 ) m(y | M2 ) Equation 7 shows that the change from prior odds p(M1 )/ p(M2 ) to posterior odds p(M1 | y)/ p(M2 | y) is given by the ratio of marginal likelihoods m(y | M1 )/m(y | M2 ) (see later for the definition of the marginal likelihood). This ratio is known as the Bayes factor (Jeffreys, 1961; Kass & Raftery, 1995). The log of the Bayes factor is often interpreted as the weight of evidence provided by the data (Good, 1985; for details see Berger & Pericchi, 1996; Bernardo & Smith, 1994; Gill, 2002; O’Hagan, 1995). Thus, when the Bayes factor BF12 = m(y | M1 )/m(y | M2 ) equals 5, the observed data y are 5 times more likely to occur under M1 than under M2 ; when BF12 equals 0.1, the observed data are 10 times more likely under M2 than under M1 . Even though the Bayes factor has an unambiguous and continuous scale, it is sometimes useful to summarize the Bayes factor in terms of discrete categories of evidential strength. Jeffreys (1961, Appendix B) proposed the classification scheme shown in Table 14.4. We replaced the labels “not worth more than a bare mention” with “anecdotal,” “decisive” with “extreme,” and “substantial” with “moderate.” These labels facilitate scientific communication but should be considered only as an approximate descriptive articulation of different standards of evidence. Bayes factors negotiate the trade-off between parsimony and goodness of fit and implement an automatic Occam’s razor (Jefferys & Berger, 1992;
Table 14.4. Evidence categories for the Bayes factor B F12 (based on Jeffreys, 1961). Bayes factor BF12
Interpretation
>
100
Extreme evidence for M1
30
—
100
Very Strong evidence for M1
10
—
30
Strong evidence for M1
3
—
10
Moderate evidence for M1
1
—
3
Anecdotal evidence for M1
1
No evidence
1/3
—
1
Anecdotal evidence for M2
1/10
—
1/3
Moderate evidence for M2
1/30
—
1/10
Strong evidence for M2
1/100
—
1/30
Very Strong evidence for M2
<
1/100
Extreme evidence for M2
MacKay, 2003; Myung & Pitt, 1997). To see this, consider that the marginal likelihood m(y | M(·) ) can be expressed as p(y | θ , M(·) ) p(θ | M(·) ) dθ : an average across the entire parameter space, with the prior providing the averaging weights. It follows that complex models with high-dimensional parameter spaces are not necessarily desirable— large regions of the parameter space may yield a very poor fit to the data, dragging down the average. The marginal likelihood will be highest for parsimonious models that use only those parts of the parameter space that are required to provide an adequate account of the data (Lee & Wagenmakers, 2013). By using marginal likelihood, the Bayes factor punishes models that hedge their bets and make vague predictions. Models can hedge their bets in different ways: by including extra parameters, by assigning very wide prior distributions to the model parameters, or by using parameters that participate in the likelihood through a complicated functional form. By computing a weighted average likelihood across the entire parameter space, the marginal likelihood (and, consequently, the Bayes factor) automatically takes all these aspects into account.
model comparison and the principle of parsimony
309
Bayes factors represent “the standard Bayesian solution to the hypothesis testing and model selection problems” (Lewis & Raftery, 1997, p. 648) and “the primary tool used in Bayesian inference for hypothesis testing and model selection” (Berger, 2006, p. 378), but their application is not without challenges (Box 3). Next we show how these challenges can be overcome for the general class of MPT models. Then we compare the results of our Bayes factor analysis with those of the other model comparison methods using Jeffreys weights (i.e., normalized marginal likelihoods).
Box 3 Two challenges for Bayes factors Bayes factors (Jeffreys, 1961; Kass & Raftery, 1995) come with two main challenges, one practical and one conceptual. The practical challenge arises because Bayes factors are defined as the ratio of two marginal likelihoods, each of which requires integration across the entire parameter space. This integration process can be cumbersome and hence the Bayes factor can be difficult to obtain. Fortunately, there are many approximate and exact methods to facilitate the computation of the Bayes factor (e.g., Ardia, Ba¸stürk, Hoogerheide, & van Dijk, 2012; Chen, Shao, & Ibrahim, 2002; Gamerman & Lopes, 2006); in this chapter we focus on BIC (a crude approximation), the Savage-Dickey density ratio (applies only to nested models), and importance sampling. The conceptual challenge that Bayes factors bring is that the prior on the model parameters has a pronounced and lasting influence on the result. This should not come as a surprise: the Bayes factor punishes models for needless complexity, and the complexity of a model is determined in part by the prior distributions that are assigned to the parameters. The difficulty arises because researchers are often not very confident about the prior distributions that they specify. To overcome this challenge one can either spend more time and effort on the specification of realistic priors, or else one can choose default priors that fulfill general desiderata (e.g., Jeffreys, 1961; Liang, Paulo, Molina, Clyde, & Berger, 2008). Finally, the robustness of the conclusions can be verified by conducting a sensitivity analysis in which one examines the effect of changing the prior specification (e.g., Wagenmakers, Wetzels, Borsboom, & van der Maas, 2011).
310
new directions
application to multinomial processing tree models In order to compute the Bayes factor we seek to determine each model’s marginal likelihood m(y | M(·) ). As indicated earlier, the marginal likelihood m(y | M(·) ) is given by integrating the likelihood over the prior: m(y | M(·) ) = p y | θ , M(·) p θ | M(·) dθ . (8) The most straightforward manner to obtain m(y | M(·) ) is to draw samples from the prior p(θ | M(·) ) and average the corresponding values for p(y | θ , M(·) ): m(y | M(·) ) ≈
N 1 p y | θi , M(·) , θi ∼ p(θ ). (9) N i=1
For MPT models, this brute force integration approach may often be adequate. An MPT model usually has few parameters, and each is conveniently bounded from 0 to 1. However, brute-force integration is inefficient, particularly when the posterior is highly peaked relative to the prior: in this case, draws from p(θ | M(·) ) tend to result in low likelihoods and only few chance draws may have high likelihood. This problem can be overcome by a numerical technique known as importance sampling (Hammersley & Handscomb, 1964). In importance sampling, efficiency is increased by drawing samples from an importance density g(θ) instead of from the prior p(θ | M(·) ). Consider an importance density g(θ ). Then,
g(θ ) p y | θ, M(·) p θ | M(·) dθ g(θ) p y | θ , M(·) p θ | M(·) g(θ ) dθ = g(θ) N 1 p y | θi , M(·) p θi | M(·) , ≈ N g(θi )
m(y | M(·) ) =
i=1
θi ∼ g(θ ).
(10)
Note that if g(θ ) = p(θ | M(·) ), the importance sampler reduces to the brute-force integration shown in Eq. 9. Also note that if g(θ ) = p(θ | y, M(·) ), a single draw suffices to determine p(y) exactly. In sum, when the importance density equals the prior, we have brute force integration, and when it equals the posterior, we have a zero-variance estimator. However, in order to compute the posterior, we would have to be able to compute the normalizing constant (i.e., the marginal likelihood),
which is exactly the quantity we wish to determine. In practice, then, we want to use an importance density that is similar to the posterior, is easy to evaluate, and is easy to draw samples from. In addition, we want to use an importance density with tails that are not thinner than those of the posterior; thin tails cause the estimate to have high variance. These desiderata are met by the Beta mixture importance density described in Box 4: a mixture between a Beta(1, 1) density and a Beta density that provides a close fit to the posterior distribution. Here we use a series of univariate Beta mixtures, one for each separate parameter, but acknowledge that a multivariate importance density is potentially even more efficient as it accommodates correlations between the parameters. In our application to MPT models, we assume that all model parameters have uniform Beta(1, 1) priors. For most MPT models this assumption is fairly uncontroversial. The uniform priors can be thought of as a default choice; in the presence of strong prior knowledge one can substitute more informative priors. The uniform priors yield a default Bayes factor that can be a reference point for an analysis with more informative priors, if such an analysis is desired (i.e., when reliable prior information is available, such as can be elicited from experts or derived from earlier experiments). monte carlo sampling for the posterior distribution Before turning to the results of the Bayes factor model comparison, we first inspect the posterior distributions. The posterior distributions were approximated using Markov chain Monte Carlo sampling implemented in JAGS (Plummer, 2003) and WinBUGS (Lunn, Jackson, Best, Thomas, & Spiegelhalter, 2012).6 All code is available on the authors’ websites. Convergence was confirmed by visual inspection and the Rˆ statistic (Gelman & Rubin, 1992). The top panel of Figure 14.6 shows the posterior distributions for the no-conflict model. Although there is slightly more certainty about parameter p than there is about parameters q and c, the posterior distributions for all three parameters are relatively wide considering that they are based on data from as many as 562 participants. The middle panel of Figure 14.6 shows the posterior distributions for the destructive-updating model. It is important to realize that when d = 0 (i.e., no destruction of the earlier memory) the destructive-updating model reduces to the noconflict model. Compared to the no-conflict model,
Box 4 Importance sampling for MPT models using the Beta mixture method Importance sampling was invented by Stan Ulam and John von Neumann. Here we use it to estimate the marginal likelihood by repeatedly drawing samples and averaging—the samples are, however, not drawn from the prior (as per Eq. 9, the brute force method), but instead they are drawn from some convenient density g(θ ) (as per Eq. 10; Andrieu, De Freitas, Doucet, & Jordan, 2003; Hammersley & Handscomb, 1964). The parameters in MPT models are constrained to the unit interval, and, therefore, the family of Beta distributions is a natural candidate for g(θ ). The middle panel of Figure 14.5 shows an importance density (dashed line) for MPT parameter c in the noconflict model for the data from Wagenaar and Boer (1987). This importance density is a Beta distribution that was fit to the posterior distribution for c using the method of moments. The importance density provides a good description of the posterior (the dashed line tracks the posterior almost perfectly) and, therefore, is more efficient than the brute force method illustrated in the left panel of Figure 14.5, which uses the prior as the importance density. Unfortunately, Beta distributions do not always fit MPT parameters so well; specifically, the Beta importance density may sometimes have tails that are thinner than the posterior, and this increases the variability of the marginal likelihood estimate. To increase robustness and ensure that the importance density has relatively fat tails, we can use a Beta mixture, shown in the right panel of Figure 14.5. The Beta mixture consists of a uniform prior component (i.e., the Beta(1, 1) prior as in the left panel) and a Beta posterior component (i.e., a Beta distribution fit to the posterior, as in the middle panel). In this example, the mixture weight for the uniform component is w = 0.2. Small mixture weights retain the efficiency of the Beta posterior approach but avoid the extra variability due to thin tails. It is possible to increase efficiency further by specifying a multivariate importance density, but the present univariate approach is intuitive, easy to implement, and appears to work well in practice. The accuracy of the estimate can be confirmed by increasing the number of draws from the importance density, and by varying the w parameter.
model comparison and the principle of parsimony
311
Beta Posterior Method
Beta Mixture Method
Density
Brute Force Method
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
Probability Fig. 14.5 Three different importance sampling densities (dashed lines) for the posterior distribution (solid lines) of the c parameter in the no-conflict model as applied to the data from Wagenaar and Boer (1987). Left panel: a uniform Beta importance density (i.e., the brute-force method); middle panel: a Beta posterior importance density (i.e., a Beta distribution that provides the best fit to the posterior); right panel: a Beta mixture importance density (i.e., a mixture of the uniform Beta density and the Beta posterior density, with a mixture weight w = 0.2 on the uniform component).
No−Conflict Model p c
q
0.0
0.2
0.4
0.6
0.8
1.0
Destructive Updating Model Density
p q
d
0.0
0.2
c
0.4
0.6
0.8
1.0
0.8
1.0
Coexistence Model p q
s 0.0
0.2
c
0.4 0.6 Probability
Fig. 14.6 Posterior distributions for the parameters of the noconflict MPT model, the destructive updating MPT model, and the coexistence MPT model, as applied to the data from Wagenaar and Boer (1987).
312
new directions
parameters p, q, and c show relatively little change. The posterior distribution for d is very wide, indicating considerable uncertainty about its true value. A frequentist point-estimate yields dˆ = 0 (Wagenaar & Boer, 1987; see also Table 14.1), but this does not convey the fact that this estimate is highly uncertain. The lower panel of Figure 14.6 shows the posterior distributions for the coexistence model. When s = 0 (i.e., no suppression of the earlier memory), the coexistence model reduces to the noconflict model. Compared to the no-conflict model and the destructive-updating model, parameters p, q, and c again show relatively little change. The posterior distribution for s is very wide, indicating considerable uncertainty about its true value. The fact that the no-conflict model is nested under both the destructive-updating model and the no-conflict model allows us to inspect the extra parameters d and s and conclude that we have not learned very much about their true values. This suggests that, despite having tested 562 participants, the data do not firmly support one model over the other. We will now see how Bayes factors can make this intuitive judgment more precise. importance sampling for the bayes factor We applied the Beta mixture importance sampling method to estimate marginal likelihoods for the three models under consideration. The results were confirmed by varying the mixture weight w, by
independent implementations by the authors, and by comparison to the Savage-Dickey density ratio test presented later. Table 14.5 shows the results. From the marginal likelihoods and the Jeffreys weights we can derive the Bayes factors for the pair wise comparisons; the Bayes factor is 2.77 in favor of the no-conflict model over the destructiveupdating model, the Bayes factor is 1.39 in favor of the coexistence model over the no-conflict model, and the Bayes factor is 3.86 in favor of the coexistence model over the destructive-updating model. The first two of these Bayes factors are anecdotal or “not worth more than a bare mention” (Jeffreys, 1961), and the third one just makes the criterion for “moderate” evidence, although any enthusiasm about this level of evidence should be tempered by the realization that Jeffreys himself described a Bayes factor as high as 5.33 as “odds that would interest a gambler, but would be hardly worth more than a passing mention in a scientific paper” (Jeffreys, 1961, pp. 256–257). In other words, the Bayes factors are consistent with the intuitive visual assessment of the posterior distributions: the data do not allow us to draw strong conclusions. We should stress that Bayes factors apply to a comparison of any two models, regardless of whether or not they are structurally related or nested (i.e., where one model is a special, simplified version of a larger, encompassing model). As is true for the information criteria and minimum description length methods, Bayes factors can be used to compare structurally very different models, such as for example REM (Shiffrin & Steyvers, 1997) versus ACT-R (Anderson, 2004), or the diffusion model (Ratcliff, 1978) versus the linear ballistic accumulator model (Brown & Heathcote, 2008). In other words, Bayes factors can be applied to
nested and non-nested models alike. For the models under consideration, however, there exists a nested structure that allows one to obtain the Bayes factor through a computational shortcut. the savage-dickey approximation to the bayes factor for comparing nested models Consider first the comparison between the noconflict model MNCM and the destructive-updating model MDUM . As shown earlier, we can obtain the Bayes factor for MNCM versus MDUM by computing the marginal likelihoods using importance sampling. However, because the models are nested, we can also obtain the Bayes factor by considering only MDUM , and dividing the posterior ordinate at d = 0 by the prior ordinate at d = 0. This surprising result was first published by Dickey and Lientz (1970), who attributed it to Leonard J. “Jimmie” Savage. The result is now generally known as the Savage-Dickey density ratio (e.g., Dickey, 1971; for extensions and generalizations see Chen, 2005; Verdinelli & Wasserman, 1995; Wetzels, Grasman, & Wagenmakers, 2010; for an introduction for psychologists see Wagenmakers, Lodewyckx, Kuriyal, & Grasman, 2010; a short mathematical proof is presented in O’Hagan & Forster, 2004, pp. 174–177).7 Thus, we can exploit the fact that MNCM is nested in MDUM and use the Savage-Dickey density ratio to obtain the Bayes factor: BFNCM,DUM =
The Savage-Dickey density ratio test is visualized in Figure 14.7; the posterior ordinate at d = 0 is higher than the prior ordinate at d = 0, indicating
Table 14.5. Bayesian evidence (i.e., the logarithm of the marginal likelihood), Jeffreys weights, and pairwise Bayes factors computed from the Jeffreys weights or through the Savage-Dickey density ratio, for the Wagenaar and Boer MPT models. Bayesian evidence
Jeffreys weight
No-conflict model (NCM)
−30.55
Destructive-updating model (DUM) Coexistence model (CXM) ∗
Bayes factor (Savage-Dickey) Over NCM
Over DUM
Over CXM
0.36
1
2.77 (2.81)
0.72 (0.80)
−31.57
0.13
0.36 (0.36)
1
0.26 (0.28∗ )
−30.22
0.51
1.39 (1.25)
3.86 (3.51∗ )
1
Derived through transitivity: 2.81 × 1/0.80 = 3.51.
model comparison and the principle of parsimony
313
Savage−Dickey Density Ratio
where the second step is allowed because we have assigned uniform priors to both d and s, so that p(d = 0 | MDUM ) = p(s = 0 | MCXM ). Hence, the Savage-Dickey estimate for the Bayes factor between the two non-nested models MDUM and MCXM equals the ratio of the posterior ordinates at d = 0 and s = 0, resulting in the estimate BFCXM,DUM = 3.51, close to the importance sampling result of 3.86.
Density
Parameter d in the destructive updating model
Posterior Prior
Comparison of Model Comparisons 0.0
0.2
0.4 0.6 Probability
0.8
1.0
Fig. 14.7 Illustration of the Savage-Dickey density-ratio test. The dashed and solid lines show the prior and the posterior distribution for parameter d in the destructive updating model. The black dots indicate the height of the prior and the posterior distributions at d = 0.
that the data have increased the plausibility that d equals 0. This means that the data support MNCM over MDUM . The prior ordinate equals 1, and hence BFNCM,DUM simply equals the posterior ordinate at d = 0. A nonparametric density estimator (Stone, Hansen, Kooperberg, & Truong, 1997) that respects the bound at 0 yields an estimate of 2.81. This estimate is close to 2.77, the estimate from the importance sampling approach. The Savage-Dickey density-ratio test can be applied similarly to the comparison between the no-conflict model MNCM versus the coexistence model MCXM , where the critical test is at s = 0. Here, the posterior ordinate is estimated to be 0.80, and, hence, the Bayes factor for MCXM over MNCM equals 1/0.80 = 1.25, close to the Bayes factor obtained through importance sampling, BFCXM,NCM = 1.39. With these two Bayes factors in hand, we can immediately derive the Bayes factor for the comparison between the destructive updating model MDUM versus the coexistence model MCXM through transitivity, that is, BFCXM,DUM = BFNCM,DUM × BFCXM,NCM . Alternatively, we can also obtain BFCXM,DUM by directly comparing the posterior density for d = 0 against that for s = 0: BFCXM,DUM = BFNCM,DUM × BFCXM,NCM =
We have now implemented and performed a variety of model comparison methods for the three competing MPT models introduced by Wagenaar and Boer (1987): we computed and interpreted the Akaike information criteria (AIC), Bayesian information criteria (BIC), the Fisher information approximation of the minimum description length principle (FIA), and two computational implementations of the Bayes factor (BF). The general tenor across most of the model comparison exercises has been that the data do not convincingly support one particular model. However, the destructive updating model is consistently ranked the worst of the set. Looking at the parameter estimates, it is not difficult to see why this is so: the d parameter of the destructiveupdating model (i.e., the probability of destroying memory through updating) is estimated at 0, thereby reducing the destructive-updating model to the no-conflict model, and yielding an identical fit to the data (as can be seen in the likelihood column of Table 14.2). Since the no-conflict model counts as a special case of the destructive-updating model, it is by necessity less complex from a modelselection point of view—the d parameter is an unnecessary entity, the inclusion of which is not warranted by the data. This is reflected in the inferior performance of the destructive updating model according to all measures of generalizability. Note that the BF judges the support for NCM to be “anecdotal” even though NCM and DUM provide similar fit and have a clear difference in complexity—one might expect the principle of parsimony to tell us that, given the equal fit and clear complexity difference, there is massive evidence for the simpler model, and the BF appears to fail to implement Occam’s razor here. The lack of clear support of the NCM over the DUM is explained by the considerable uncertainty regarding the value of the parameter d : even though the posterior mode is at d = 0, much posterior variability is visible in the middle panel of Figure 14.6. With more data
and a posterior for d that is more peaked near 0, the evidence in favor of the simpler model would increase. The difference between the no-conflict model and the coexistence model is less clear-cut. Following AIC, the two models are virtually indistinguishable—compared to the coexistence model, the no-conflict model sacrifices one unit of log-likelihood for two units of complexity (one parameter). As a result, both models perform equally well under the AIC measure. Under the BIC measure, however, the penalty for the number of free parameters is more substantial, and here the no-conflict model trades a unit of log likelihood for log (N ) = 6.33 units of complexity, outdistancing both the destructive updating model and the coexistence model. The BIC is the exception in clearly preferring the no-conflict model over the coexistence model. The MDL, like the AIC, would have us hedge on the discriminability of the noconflict model and the coexistence model. The BF, finally, allows us to express evidence for the models using standard probability theory. Between any two models, the BF tells us how much the balance of evidence has shifted due to the data. Using two methods of computing the BF, we determined that the odds of the coexistence model over the destructive updating model almost quadrupled (BFCXM,DUM ≈ 3.86), but the odds of the coexistence model over the no-conflict model barely shifted at all (BFCXM,NCM ≈ 1.39). Finally, we can use the same principles of probability to compute Jeffreys weights, which express, for each model under consideration, the probability that it is true, assuming prior indifference. Furthermore, we can easily recompute the probabilities in case we wish to express a prior preference between the candidate models (for example, we might use the prior to express a preference for sparsity, as was originally proposed by Jeffreys, 1961).
Concluding Comments Model comparison methods need to implement the principle of parsimony: goodness-of-fit has to be discounted to the extent that it was accomplished by a model that is overly complex. Many methods of model comparison exist (Myung et al., 2000; Wagenmakers & Waldorp, 2006), and our selective review focused on methods that are popular, easyto-compute approximations (i.e., AIC and BIC) and methods that are difficult-to-compute “ideal” solutions (i.e., minimum description length and Bayes factors). We applied these model comparison
methods to the scenario of three competing MPT models introduced by Wagenaar and Boer (1987). Despite collecting data from 562 participants, the model comparison methods indicate that the data are somewhat ambiguous; at any rate, the data do not support the destructive updating model. This echoes the conclusions drawn by Wagenaar and Boer (1987). It is important to note that the modelcomparison methods discusses in this chapter can be applied regardless of whether the models are nested. This is not just a practical nicety; it also means that the methods are based on principles that transcend the details of a specific model implementation. In our opinion, a method of inference that is necessarily limited to the comparison of nested models is incomplete at best and misleading at worst. It is also important to realize that model comparison methods are relative indices of model adequacy; when, say, the Bayes factor expresses an extreme preference for model A over model B, this does not mean that model A fits the data at all well. Figure 14.8 shows a classic but dramatic example of the inadequacy of simple measures of relative model fit. Because it would be a mistake to base inference on a model that fails to describe the data, a complete inference methodology features both relative and absolute indices of model adequacy. For the MPT models under consideration here, Wagenaar and Boer (1987) reported that the no-conflict model provided “an almost perfect fit” to the data.8 The example MPT scenario considered here was relatively straightforward. More complicated MPT models contain order-restrictions, feature individual differences embedded in a hierarchical framework (Klauer, 2010; Matzke, Dolan, Batchelder, & Wagenmakers, in press), or contain a mixture-model representation with different latent classes of participants (for application to other models see Frühwirth-Schnatter, 2006; Scheibehenne, Rieskamp, & Wagenmakers, 2013). In theory, it is relatively easy to derive Bayes factors for these more complicated models. In practice, however, Bayes factors for complicated models may require the use of numerical techniques more involved than importance sampling. Nevertheless, for standard MPT models the Beta mixture importance sampler appears to be a convenient and reliable tool to obtain Bayes factors. We hope that this methodology will facilitate the principled comparison of MPT models in future applications.
model comparison and the principle of parsimony
315
Anscombe’s Quartet 12
12
10
10
8
8
6
6
r = 0.816
4
r = 0.816
4 5
10
15
20
5
12
12
10
10
8
8
6
10
6
r = 0.816
4
15
20
r = 0.816
4 5
10
15
20
5
10
15
20
Fig. 14.8 Anscombe’s Quartet is a set of four bivariate data sets whose basic descriptive statistics are approximately identical. In all cases, mean of X is 9, variance of X is 11, mean of Y is 7.5, variance of Y is 4.1, and the best fitting linear regression line is yiest = 3 + 0.5xi , which explains R 2 = 66.6% of the variance in Y . However, in two of the four cases, the linear regression is clearly a poor account of the data. The relative measure of model fit (R 2 ) gives no indication of this radical difference between the data sets, and an absolute measure of fit (even one as rudimentary as a visual inspection of the regression line) is required. (Figure downloaded from Flickr, courtesy of Eric-Jan Wagenmakers.)
Notes 1. This work was partially supported by the starting grant “Bayes or Bust” awarded by the European Research Council to EJW, and NSF grant #1230118 from the Methods, Measurements, and Statistics panel to JV. 2. This terminology is due to Pitt and Myung (2002), who point out that measures often referred to as “model fit indices” are in fact more than mere measures of fit to the data—they combine fit to the data with parsimony and hence measure generalizability. We adopt their more accurate terminology here. 3. Note that for hierarchical models, the definition of sample size n is more complicated (Pauler 1998; Raftery 1995). 4. For a more in-depth treatment, see Townsend (1975). 5. Analysis using the MPTinR package by Singmann and Kellen (2013) gave virtually identical results. Technical details for the computation of the NML for MPTs are provided in Appendix B of Klauer and Kellen (2011). 6. The second author used WinBUGS, the first and third authors used JAGS. 7. Note that the Savage-Dickey density ratio requires that when d = 0 the prior for the common parameters p, c, and q is the same for MDUM and MNCM . That is, p(p, c, q | d = 0, MDUM ) = p(p, c, q | MNCM ).
316
new directions
8. We confirmed the high quality of fit in a Bayesian framework using posterior predictives (Gelman & Hill, 2007), results not reported here.
Glossary Akaike’s information criterion (AIC): A quantity that expresses the generalizability of a model, based on the likelihood of the data under the model and the number of free parameters in the model. Akaike weights: A quantity that conveys the relative preference among a set of candidate models, using AIC as a measure of generalizability. Anscombe’s quartet: A set of four bivariate data sets whose statistical properties are virtually indistinguishable until they are displayed graphically, and a canonical example of the importance of data visualization. Bayes factor (BF): A quantity that conveys the degree to which the observed data sway our beliefs towards one or the other model. Under a-priori indifference between two models M1 and M2 , the BF expresses the a-posteriori relative probability of the two.
Bayesian information criterion (BIC): A quantity that expresses the generalizability of a model, based on the likelihood of the data under the model, the number of free parameters in the model, and the amount of data. Fisher information approximation (FIA): One of several approximations used to compute the MDL. Goodness of fit: A quantity that expresses how well a model is able to account for a given set of observations. Importance sampling: A numerical algorithm to efficiently draw samples from a distribution by factoring it into an easy-to-compute function over an easy-to-sample density. Jeffreys weights: A quantity that conveys the relative preference among a set of candidate models, using BF as a measure of generalizability. Likelihood principle: A principle of modeling and statistics that states that all information about a certain parameter that is obtainable from an experiment is contained in the likelihood function of that parameter for the given data. Many common statistical procedures, such as hypothesis testing with p-values, violate this principle. Minimum description length (MDL): A quantity that expresses the generalizability of a model, based on the extent to which the model allows the observed data to be compressed. Monte Carlo sampling: A general class of numerical algorithms used to characterize (i.e., compute descriptive measures) an arbitrary distribution by drawing large numbers of random samples from it. Nested models: Model M1 is nested in Model M2 if there exists a special case of M2 that is equivalent to M1 . Overfitting: A pitfall of modeling whereby the proposed model is too complex and begins to account for irrelevant particulars (i.e., random noise) of a specific data set, causing the model to poorly generalize to other data sets. Parsimony: A strategy against overfitting, and a fundamental principle of model selection: all other things being equal, simpler models should be preferred over complex ones; or: greater model complexity must be bought with greater explanatory power. Often referred to as Occam’s razor. Rissanen weights: A quantity that conveys the relative preference among a set of candidate models, using FIA as a measure of generalizability. Savage-Dickey density ratio: An efficient method for computing a Bayes factor between nested models. Schwartz weights: A quantity that conveys the relative preference among a set of candidate models, using BIC as a measure of generalizability.
References Akaike, H. (1973). Information theory as an extension of the maximum likelihood principle. In: B. N. Petrov & F. Csaki (Eds.), Second international symposium on information theory (267–281). Budapest: Akadémiai Kiadó. Akaike, H. (1974a). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19, 716–723.
Akaike, H. (1974b). On the likelihood of a time series model. The Statistician, 27, 217–235. Anderson, J. R., Bothell, D., Byrne, M. D., Douglass, S., Lebiere, C., & Qin, Y. (2004). An integrated theory of the mind. Psychological Review, 111, 1036–1060. Andrieu, C., De Freitas, N., Doucet, A., & Jordan, M. I. (2003). An introduction to MCMC for machine learning. Machine Learning, 50, 5–43. Ardia, D., Bastürk, N., Hoogerheide, L., & van Dijk, H. K. (2012). A comparative study of Monte Carlo methods for efficient evaluation of marginal likelihood. Computational Statistics and Data Analysis, 56, 3398–3414. Batchelder, W. H., & Riefer, D. M. (1980). Separation of storage and retrieval factors in free recall of clusterable pairs. Psychological Review, 87, 375–397. Batchelder, W. H., & Riefer, D. M. (1999). Theoretical and empirical review of multinomial process tree modeling. Psychonomic Bulletin & Review, 6, 57–86. Batchelder, W. H., & Riefer, D. M. (2007). Using multinomial processing tree models to measure cognitive deficits in clinical populations. In R. W. J. Neufeld (Ed.), Advances in clinical cognitive science: Formal modeling of processes and symptoms (19–50). Washington, DC: American Psychological Association. Berger, J. O. (2006). Bayes factors. S. Kotz, N. Balakrishnan, C. Read, B. Vidakovic, & N. L. Johnson (Eds.), Encyclopedia of statistical sciences (2nd ed.) (Vol. 1, pp 378–386). Hoboken, NJ: Wiley. Berger, J. O., & Berry, D. A. (1988). Statistical analysis and the illusion of objectivity. American Scientist, 76, 159–165. Berger, J. O., & Pericchi, L. R. (1996). The intrinsic Bayes factor for model selection and prediction. Journal of the American Statistical Association, 91, 109–122. Berger, J. O., & Wolpert, R. L. (1988). The likelihood principle (2nd ed.). Hayward, CA: Institute of Mathematical Statistics. Bernardo, J. M., & Smith, A. F. M. (1994). Bayesian theory. New York, NY: Wiley. Brown, S. D., & Heathcote, A. (2005). Practice increases the efficiency of evidence accumulation in perceptual choice. Journal of Experimental Psychology: Human Perception and Performance, 31, 289–298. Brown, S. D., & Heathcote, A. J. (2008). The simplest complete model of choice reaction time: Linear ballistic accumulation. Cognitive Psychology, 57, 153–178. Burnham, K. P., & Anderson, D. R. (2002). Model selection and multimodel inference: A practical information–theoretic approach (2nd ed.). New York, NY: Springer Verlag. Busemeyer, J. R., & Diederich, A. (2010). Cognitive modeling. Thousand Oaks, CA: Sage. Chechile, R. A. (1973). The relative storage and retrieval losses in short–term memory as a function of the similarity and amount of information processing in the interpolated task. Unpublished doctoral dissertation, University of Pittsburgh. Chechile, R. A., & Meyer, D. L. (1976). A Bayesian procedure for separately estimating storage and retrieval components of forgetting. Journal of Mathematical Psychology, 13, 269–295. Chen, M.-H. (2005). Computing marginal likelihoods from a single MCMC output. Statistica Neerlandica, 59, 16–29.
model comparison and the principle of parsimony
317
Chen, M.-H., Shao, Q.-M., & Ibrahim, J. G. (2002). Monte Carlo methods in Bayesian computation. New York, NY: Springer. Cohen, J. D., Dunbar, K., & McClelland, J. L. (1990). On the control of automatic processes: A parallel distributed processing account of the Stroop effect. Psychological Review, 97, 332–361. D’Agostino, R. B., & Stephens, M. A. (1986). Goodness-of-fit techniques. New York, NY: Marcel Dekker. Dickey, J. M. (1971). The weighted likelihood ratio, linear hypotheses on normal location parameters. The Annals of Mathematical Statistics, 42, 204–223. Dickey, J. M., & Lientz, B. P. (1970). The weighted likelihood ratio, sharp hypotheses about chances, the order of a Markov chain. The Annals of Mathematical Statistics, 41, 214–226. Dutilh, G., Vandekerckhove, J., Tuerlinckx, F., & Wagenmakers, E.-J. (2009). A diffusion model decomposition of the practice effect. Psychonomic Bulletin & Review, 16, 1026– 1036. Erdfelder, E., Auer, T.-S., Hilbig, B. E., Aßfalg, A., Moshagen, M., & Nadarevic, L. (2009). Multinomial processing tree models: A review of the literature. Zeitschrift für Psychologie, 217, 108–124. Frühwirth–Schnatter, S. (2006). Finite mixture and Markov switching models. New York, NY: Springer. Gamerman, D., & Lopes, H. F. (2006). Markov chain Monte Carlo: Stochastic simulation for Bayesian inference. Boca Raton, FL: Chapman & Hall/CRC. Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models. Cambridge, England: Cambridge University Press. Gelman, A., & Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences (with discussion). Statistical Science, 7, 457–472. Gill, J. (2002). Bayesian methods: A social and behavioral sciences approach. Boca Raton, FL: CRC Press. Good, I. J. (1985). Weight of evidence: A brief survey. J. M. Bernardo, M. H. DeGroot, D. V. Lindley, & A. F. M. Smith, Bayesian statistics 2 (249–269). New York, NY: Elsevier. Grünwald, P. (2000). Model selection based on minimum description length. Journal of Mathematical Psychology, 44, 133–152. Grünwald, P. (2007). The minimum description length principle. Cambridge, MA: MIT Press. Grünwald, P., Myung, I. J., & Pitt, M. A. (Eds). (2005). Advances in minimum description length: Theory and pplications. Cambridge, MA: MIT Press. Hammersley, J. M., & Handscomb, D. C. (1964). Monte Carlo methods. London, England: Methuen. Heathcote, A., Brown, S., & Mewhort, D. J. K. (2000). The power law repealed: The case for an exponential law of practice. Psychonomic Bulletin & Review, 7, 185–207. Heathcote, A., & Hayes, B. (2012). Diffusion versus linear ballistic accumulation: Different models for response time with different conclusions about psychological mechanisms? Canadian Journal of Experimental Psychology, 66, 125–136. Hoeting, J. A., Madigan, D., Raftery, A. E., & Volinsky, C. T. (1999). Bayesian model averaging: A tutorial. Statistical Science, 14, 382–417.
318
new directions
Jefferys, W. H., & Berger, J. O. (1992). Ockham’s razor and Bayesian analysis. American Scientist, 80, 64–72. Jeffreys, H. (1961). Theory of probability (3rd ed.). Oxford, England: Oxford University Press. Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90, 773–795. Klauer, K. C. (2010). Hierarchical multinomial processing tree models: A latent–trait approach. Psychometrika, 75, 70–98. Klauer, K. C., & Kellen, D. (2011). The flexibility of models of recognition memory: An analysis by the minimum-description length principle. Journal of Mathematical Psychology, 55(6), 430–450. Lee, M. D., & Wagenmakers, E.-J. (2013). Bayesian modeling for cognitive science: A practical course. Cambridge, England: Cambridge University Press. Lewandowsky, S., & Farrell, S. (2010). Computational modeling in cognition: Principles and practice. Thousand Oaks, CA: Sage. Lewis, S. M., & Raftery, A. E. (1997). Estimating Bayes factors via posterior simulation with the Laplace–Metropolis estimator. Journal of the American Statistical Association, 92, 648–655. Liang, F., Paulo, R., Molina, G., Clyde, M. A., & Berger, J. O. (2008). Mixtures of g priors for Bayesian variable selection. Journal of the American Statistical Association, 103, 410–423. Loftus, E. F., Miller, D. G., & Burns, H. J. (1978). Semantic integration of verbal information into a visual memory. Journal of Experimental Psychology: Human Learning and Memory, 4, 19–31. Logan, G. D. (1988). Toward an instance theory of automatization. Psychological Review, 95, 492–527. Logan, G. D. (1992). Shapes of reaction–time distributions and shapes of learning curves: A test of the instance theory of automaticity. Journal of Experimental Psychology: Learning, Memory, and Cognition, 18, 883–914. Logan, G. D. (2002). An instance theory of attention and memory. Psychological Review, 109, 376–400. Lunn, D., Jackson, C., Best, N., Thomas, A., & Spiegelhalter, D. (2012). The BUGS book: A practical introduction to Bayesian analysis. Boca Raton, FL: Chapman & Hall/CRC. Ly, A., Verhagen, A. J., Grasman, R. P. P. P., Wagenmakers, E.-J. (2014). A tutorial on Fisher information. Manuscript submitted for publication. MacKay, D. J. C. (2003). Information theory, inference, and learning algorithms. Cambridge, England: Cambridge University Press. Matzke, D., Dolan, C. V., Batchelder, W. H., & Wagenmakers, E.-J. (in press). Bayesian estimation of multinomial processing tree models with heterogeneity in participants and items. Psychometrika. Myung, I. J. (2000). The importance of complexity in model selection. Journal of Mathematical Psychology, 44, 190–204. Myung, I. J., Forster, M. R., & Browne, M. W. (2000). Model selection [Special issue]. Journal of Mathematical Psychology, 44, 1–2. Myung, I. J., Navarro, D. J., & Pitt, M. A. (2006). Model selection by normalized maximum likelihood. Journal of Mathematical Psychology, 50, 167–179.
Myung, I. J., Pitt, & M. A. (1997). Applying Occam’s razor in modeling cognition: A Bayesian approach. Psychonomic Bulletin & Review, 4, 79–95. O’Hagan, A. (1995). Fractional Bayes factors for model comparison. Journal of the Royal Statistical Society B, 57, 99–138. O’Hagan, A., & Forster, J. (2004). Kendall’s advanced theory of statistics, Vol. 2B: Bayesian inference (2nd ed.). London, England: Arnold. Pauler, D. K. (1998). The Schwarz criterion and related methods for normal linear models. Biometrika, 85, 13–27. Pitt, M. A., & Myung, I. J. (2002). When a good fit can be bad. Trends in Cognitive Sciences, 6, 421–425. Pitt, M. A., Myung, I. J., & Zhang, S. (2002). Toward a method of selecting among computational models of cognition. Psychological Review, 109, 472–491. Plummer, M. (2003). JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling. K. Hornik, F. Leisch, & A. Zeileis (Eds.), Proceedings of the 3rd international workshop on distributed statistical computing. Vienna, Austria. Raftery, A. E. (1995). Bayesian model selection in social research. P. V. Marsden (Ed.), Sociological methodology (pp 111–196). Cambridge, England: Blackwells. Ratcliff, R. (1978). A theory of memory retrieval. Psychological Review, 85, 59–108. Rickard, T. C. (1997). Bending the power law: A CMPL theory of strategy shifts and the automatization of cognitive skills. Journal of Experimental Psychology: General, 126, 288–311. Riefer, D. M., & Batchelder, W. H. (1988). Multinomial modeling and the measurement of cognitive processes. Psychological Review, 95, 318–339. Rissanen, J. (1978). Modeling by shortest data description. Automatica, 14, 445–471. Rissanen, J. (1987). Stochastic complexity. Journal of the Royal Statistical Society B, 49, 223–239. Rissanen, J. (1996). Fisher information and stochastic complexity. IEEE Transactions on Information Theory, 42, 40–47. Rissanen, J. (2001). Strong optimality of the normalized ML models as universal codes and information in data. IEEE Transactions on Information Theory, 47, 1712–1717. Rouder, J. N., Lu, J., Morey, R. D., Sun, D., & Speckman, P. L. (2008). A hierarchical process dissociation model. Journal of Experimental Psychology: General, 137, 370–389. Scheibehenne, B., Rieskamp, J., & Wagenmakers, E.-J., (2013). Testing adaptive toolbox models: A Bayesian hierarchical approach. Psychological Review, 120, 39–64. Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464. Shiffrin, R. M., & Steyvers, M. (1997). A model for recognition memory: REM—retrieving effectively
from memory. Psychonomic Bulletin & Review, 4, 145–166. Silver, N. (2012). The signal and the noise: The art and science of prediction. London, England: Allen Lane. Singmann, H., & Kellen, D. (2013). MPTinR: Analysis of multinomial processing tree models with R. Behavior Research Methods, 45, pp 560–575. Smith, J. B., & Batchelder, W. H. (2010). Beta–MPT: Multinomial processing tree models for addressing individual differences. Journal of Mathematical Psychology, 54, 167–183. Stone, C. J., Hansen, M. H., Kooperberg, C., & Truong, Y. K. (1997). Polynomial splines and their tensor products in extended linear modeling (with discussion). The Annals of Statistics, 25, 1371–1470. Townsend, J. T. (1975). The mind–body equation revisited. C. Cheng (Ed.), Philosophical aspects of the mind–body problem (pp. 200–218). Honolulu, Hawaii: Honolulu University Press. Verdinelli, I., & Wasserman, L. (1995). Computing Bayes factors using a generalization of the Savage–Dickey density ratio. Journal of the American Statistical Association, 90, 614–618. Vrieze, S. I. (2012). Model selection and psychological theory: A discussion of the differences between the Akaike information criterion (AIC) and the Bayesian information criterion (BIC). Psychological methods, 17, 228. Wagenaar, W. A., & Boer, J. P. A. (1987). Misleading postevent information: Testing parameterized models of integration in memory. Acta Psychologica, 66, 291–306. Wagenmakers, E.-J., & Farrell, S. (2004). AIC model selection using Akaike weights. Psychonomic Bulletin & Review, 11, 192–196. Wagenmakers, E.-J., Lodewyckx, T., Kuriyal, H., & Grasman, R. (2010). Bayesian hypothesis testing for psychologists: A tutorial on the Savage–Dickey method. Cognitive Psychology, 60, 158–189. Wagenmakers, E.-J., & Waldorp, L. (2006). Model selection: Theoretical developments and applications [Special issue]. Journal of Mathematical Psychology, 50(2), pp 1–2. Wagenmakers, E.-J., Wetzels, R., Borsboom, D., & van der Maas, H. L. J. (2011). Why psychologists must change the way they analyze their data: The case of psi. Journal of Personality and Social Psychology, 100, 426–432. Wetzels, R., Grasman, R. P. P. P., & Wagenmakers, E.-J. (2010). An encompassing prior generalization of the Savage–Dickey density ratio test. Computational Statistics & Data Analysis, 54, 2094–2102. Wu, H., Myung, J. I., & Batchelder, W. H. (2010). Minimum description length model selection of multinomial processing tree models. Psychonomic Bulletin & Review, 17, 275–286.
model comparison and the principle of parsimony
319
CHAPTER
15
Neurocognitive Modeling of Perceptual Decision Making
Thomas J. Palmeri, Jeffrey D. Schall, and Gordon D. Logan
Abstract
Mathematical psychology and systems neuroscience have converged on stochastic accumulator models to explain decision making. We examined saccade decisions in monkeys while neurophysiological recordings were made within their frontal eye field. Accumulator models were tested on how well they fit response probabilities and distributions of response times to make saccades. We connected these models with neurophysiology. To test the hypothesis that visually responsive neurons represented perceptual evidence driving accumulation, we replaced perceptual processing time and drift rate parameters with recorded neurophysiology from those neurons. To test the hypothesis that movement related neurons instantiated the accumulator, we compared measures of neural dynamics with predicted measures of accumulator dynamics. Thus, neurophysiology both provides a constraint on model assumptions and data for model selection. We highlight a gated accumulator model that accounts for saccade behavior during visual search, predicts neurophysiology during search, and provides insights into the locus of cognitive control over decisions. Key Words: accumulator models, decision making, response time, visual search, stop task, countermanding, neurophysiology, computational modeling, neural modeling, frontal eye field, superior colliculus
Introduction We make decisions all the time. Whom to marry? What car to buy? What to eat? Whether to turn left or right? Some are easy. Some are hard. Some involve uncertainty. Some involve risk or reward. Decision-making requires integrating our perceptions of the current environment with our knowledge and past experience and our assessments of uncertainty and risk in order to select a possible action from a set of alternatives. Behavioral research on decision-making has had a long and distinguished history in psychology (e.g., Kahneman & Tversky, 1984). We now have powerful computational and mathematical models of how decisions are made (e.g., Brown & Heathcote, 2008; Busemeyer & Townsend, 1993; Dayan & Daw, 2008; Ratcliff & Rouder, 1998). And we 320
know more about the brain areas involved in a range of decision-making tasks (Glimcher & Rustichini, 2004; Heekeren, Marrett, & Ungerleider, 2008; Schall, 2001; Shadlen & Newsome, 2001). To develop an integrated understanding of decisionmaking mechanisms, new efforts aim to combine behavioral and neural measures with cognitive modeling (e.g., Forstmann, Wagenmakers, Eichele, Brown, & Serences, 2011; Gold & Shadlen, 2007; Palmeri, in press; Smith & Ratcliff, 2004), an approach we aim to illustrate in some detail here. We focus on perceptual decisions. Perceptual decision-making involves perceptually representing the world with respect to current task goals and using perceptual evidence to inform the selection of an action. A broad class of accumulator models of perceptual decision-making assume that perceptual
evidence accumulates over time to a response threshold (e.g., Bogacz, Brown, Moehlis, Holmes, & Cohen, 2006; Brown & Heathcote, 2008; Link, 1992; Nosofsky & Palmeri, 1997; Palmeri, 1997; Ratcliff & Rouder, 1998; Ratcliff & Smith, 2004; Ratcliff & Smith, in press; Smith & Van Zandt, 2000; Usher & McClelland, 2001; see also Nosofsky & Palmeri, 2015). These models have provided excellent accounts of observed behavior, including the choices people make and the time it takes them to decide. Moreover, the observation that the pattern of spiking activity of certain neurons resembles an accumulation to threshold (Hanes & Schall, 1996) has sparked exciting synergies of mathematical and computational modeling with systems neuroscience (e.g., Boucher, Palmeri, Logan, & Schall, 2007a; Churchland & Ditterich, 2012; Cisek, Puskas, & El-Murr, 2009; Ditterich, 2006, 2010; Mazurek, Roitman, Ditterich, & Shadlen, 2003; Purcell, Heitz, Cohen, Schall, Logan, & Palmeri, 2010; Purcell, Schall, Logan, & Palmeri, 2012; Ratcliff, Cherian, & Segraves, 2003; Ratcliff, Hasegawa, Hasegawa, Smith, & Segraves, 2007; Wong, Huk, Shadlen, & Wang, 2007; Wong & Wang, 2006). In this article, we provide a general review of our contributions to these efforts. We use variants of accumulator models to explain neural mechanisms, use neurophysiology to constrain model assumptions, and use neural and behavioral data as a tool for model section. Our specific focus has been on perceptual decisions about where and when to make a saccadic eye movement to objects in the visual field. The first section of this article, Perceptual Decisions by Saccades, provides an overview of behavior, neuroanatomy, and neurophysiology of the primate saccade system, with an emphasis on the frontal eye field (FEF). There are numerous practical advantages to studying perceptual decisions made by saccades over perceptual decisions made by finger, hand, or limb movement and we can also capitalize on over two decades of careful systems neuroscience research with awake behaving monkeys characterizing the response properties of neurons in FEF and the interconnected network of other brain areas involved in saccadic eye movements (Figure 15.1). FEF itself provides physiologists and theoreticians a unique window on perceptual decision-making. FEF receives projections from a wide range of posterior brain areas involved in visual perception, projects to subcortical brain areas involved directly in the production of eye
LIP FEF MT V4 TEO TE
SC Brainstem Fig. 15.1 Illustration of the macaque cerebral cortex. Frontal eye field (FEF) is a key brain area involved in the production of saccadic eye movements and the focus of our recent work. It receives projections from numerous posterior visual areas, including the middle temporal area (MT), visual area V4, inferotemporal areas TE and TEO, and the lateral intraparietal area (LIP). FEF projects to the superior colliculus (SC). Both FEF and SC project to the brainstem saccade generators that ultimately control the muscles of the eyes. Not shown are connections between FEF and prefrontal cortical areas and areas of the basal ganglia. (Adapted from Purcell et al., 2010.)
movements, and is modulated by prefrontal brain areas involved in cognitive control. Indeed, one class of visually responsive neurons in FEF represent task-relevant salience of objects in the visual field, whereas another class of movement-related neurons increase their activity in a manner consistent with accumulation of evidence models and modulate their activity according to changing task demands (e.g., see Schall, 2001, 2004). One form of an accumulator model is illustrated in Figure 15.2. Accumulator models assume that perceptual processing takes some amount of time. The product of perceptual processing is perceptual evidence that is accumulated over time to make a perceptual decision. The rate of accumulation is often called drift rate, and this drift rate can be variable within a trial, across trials, or both (e.g., Brown & Heathcote, 2008; Ratcliff & Rouder, 1998). Variability in the accumulation of perceptual evidence to a threshold is a major contributor to variability in predicted behavior. In their most general form, accumulator models assume drift rates to be free parameters that can be optimized to fit a set of observed behavioral data. There has been concern that unrestricted assumptions about drift rate and its variability may imbue these models with too much flexibility (Jones & Dzhafarov, 2014; but see also Ratcliff,
neurocognitive modeling of perceptual decision making
321
(a)
perceptual processing time
drift θ
TR
motor response
TM
(b)
θ
motor response
TM Time Fig. 15.2 (a) Illustration of a classic stochastic accumulator model of perceptual decision-making, highlighting some of the key free parameters. Perceptual processing of a visual stimulus takes some variable amount time with mean TR. The outcome of perceptual processing is noisy perceptual evidence in favor of competing decisions with some mean drift rate. Perceptual evidence is accumulated over time, originating at some variable starting point (z), and accumulating until some threshold is reached, determined by θ Illustrated here is a drift-diffusion model, but different architectures for the perceptual decision-making process can be assumed (see Figure 15.5). Variability in the accumulation of evidence to a threshold is a key constituent in predicting variability in RT. A motor response is made with some time TM, which for saccadic eye movements is on the order of 10-20ms. (b) Our recent work has tested whether many of the free parameters can be constrained by the observed physiological dynamics of one class of neurons in FEF (see Figure 15.5) and whether predicted model dynamics of the stochastic accumulator can predict observed physiological dynamics of another class of neurons in FEF (see Figure 15.8).
2013). One important step in theory development has been to significantly constrain these models by creating theories of the drift rates driving the accumulation of evidence, linking models of perceptual decision making with models of perceptual processing (e.g., Ashby, 2000; Logan & Gordon, 2001; Mack & Palmeri, 2010, 2011; Nosofsky & Palmeri, 1997; Palmeri, 1997; Palmeri & Cottrell, 2009; Palmeri & Tarr, 2008; Schneider & Logan, 2005, 2009; Smith & Ratcliff, 2009). As a first step toward a neural theory of drift rates, we hypothesized that activity of visually responsive neurons in FEF represent perceptual evidence driving the accumulation to threshold. To test this hypothesis, as described in the section titled A Neural Locus of Drift Rates, we replaced perceptual processing-time and drift-rate parameters directly with recorded neurophysiology from these neurons (see Figures 15.2 and 15.5), testing whether any model architecture for accumulation of perceptual evidence could then quantitatively
322
new directions
account for observed saccade response probabilities and response time distributions. A number of different model architectures have been proposed that all involve some accumulation of perceptual evidence to a threshold (e.g., see Bogacz et al., 2006; Smith & Ratcliff, 2004). For example, as their name implies, independent race models assume that evidence for each alternative decision independently (Smith & Van Zandt, 2000; Vickers, 1970). Drift-diffusion models (Ratcliff, 1978; Ratcliff & Rouder, 1998) and random walk models (Laming, 1968; Link, 1992; Nosofsky & Palmeri, 1997; Palmeri, 1997) assume that perceptual evidence in favor of one alternative counts as evidence against competing alternatives. Competing accumulator models (Usher & McClelland, 2001) assume that support for various alternatives is mutually inhibitory, so as evidence in favor of one alternative grows, it inhibits the others, often in a winner-take-all fashion (Grossberg, 1976). Different models can vary in other respects
as well, such as whether integration of evidence is perfect or leaky. We describe these alternative model architectures and how well they account for observed response probabilities and response time distributions in the section Architectures for Perceptual Decision Making. We also tested the hypothesis that movementrelated neurons in FEF instantiate an accumulator (Hanes & Schall, 1996). As described in the section Predicting Neural Dynamics, we quantitatively compared measured metrics of neural dynamics with predicted metrics of accumulator dynamics. Neurophysiology and modeling are synergistic in that we test quantitatively whether movementrelated neurons have dynamics predicted by accumulator models, and we use the measured neural dynamics of movement-related neurons as an additional tool to select between competing model architectures. Finally, in a complementary way, in the section Control over Perceptual Decisions, we test whether competing hypotheses about cognitive control mechanisms can predict observed behavior as well as the observed modulation of movementrelated neurons dynamics.
Perceptual Decisions by Saccades Significant insights into the neurophysiological basis of perceptual decision-making have come from research on decisions about where and when to move the eyes (e.g., Gold & Shadlen, 2007; Schall, 2001, 2004; Smith & Ratcliff, 2004). Although the majority of human research on perceptual decisions has used manual key-press responses, a neurophysiological focus on saccadic eye movements is justified on several grounds: From the perspective of effect or dynamics and motor control, eye movements have relatively few degrees of freedom, far fewer than limb movements, allowing fairly direct links between neurophysiology and behavior to be established (Scudder, Kaneko, & Fuchs, 2002). Saccadic eye movements are also relatively ballistic, with movement dynamics quite stereotyped depending on the direction, starting point, and distance the eyes need to move (Gilchrist, 2011), unlike limb movement, which can reach the same endpoint using a multitude of different trajectories having vastly different temporal dynamics (Rosenbaum, 2009). Moreover, from the perspective of understanding the mechanisms by which perceptual evidence is used to produce a perceptual decision, the saccade system is also a choice candidate to study because of the Frontal
Eye Field (FEF), an area where visual perception, motor production, and cognitive control come together in the primate brain (Schall & Cohen, 2011). FEF has long been known to play a role in the production of saccadic eye movements (e.g., Bruce, Goldberg, Bushnell, & Stanton, 1985; Ferrier, 1874). This is reflected by its direct and indirect connectivity with the superior colliculus (SC) and brain stem nuclei necessary for the production of saccadic eye movement (e.g., Munoz & Schall, 2004; Scudder et al., 2002; Sparks, 2002), as illustrated in Figure 15.1. Also as illustrated, FEF is innervated by numerous dorsal and ventral stream areas of extrastriate visual cortex (Schall, Morel, King, & Bullier, 1995). Not illustrated are connections between FEF and brain areas implicated in cognitive control, such as medial frontal and dorsolateral prefrontal cortex (e.g., Stanton, Bruce, & Goldberg, 1995) and basal ganglia (GoldmanRakic & Porrino, 1985; Hikosaka & Wurtz, 1983). Neuroanatomically, FEF lies at a juncture of perception, action, and control. This bears out functionally, as various neurons within FEF reflect the importance of objects in the visual field,signal the selection and timing of saccadic eye movements, and modulate in a controlled manner according to changing task conditions (e.g., Heitz & Schall, 2012; Murthy, Ray, Shorter, Schall, & Thompson, 2009; Thompson, Biscoe, & Sato, 2005). At the start of each neurophysiological session, once a neuron in FEF has been isolated, a memoryguided saccade task is used to classify its response properties (Bruce & Goldberg, 1985). As illustrated in Figure 15.3, the monkey fixates a spot in the center of the screen while a target is flashed in the periphery. To earn reward, the monkey must maintain fixation for a variable amount of time after which the fixation spot disappears and then the monkey must make a single saccade to the remembered target location. When the target is in the receptive field of the FEF neuron, that neuron is classified as a visually responsive neuron (or visual neuron) if it shows a vigorous response to the appearance of the target, perhaps with a tonic response during the delay period, but with no significant saccade-related modulation. The neuron is classified as a movement-related neuron (or movement neuron, sometimes referred to as a buildup neuron) if it shows no or very weak modulation to the appearance of the target but pronounced growth of spike rate immediately
neurocognitive modeling of perceptual decision making
323
visual search
(a) Normalized Visual Neuron Activity
memory-guided
e tim
0.5
preceding saccade production. Other neurons in FEF show other response properties (e.g., Sato & Schall, 2003), but our recent work has focused primarily on visual and movement neurons, which we might loosely characterize as the incoming input signal and outgoing output signal from FEF (see also Pouget et al., 2009). Once visually responsive neurons and movementrelated neurons are identified, their response properties can be measured during a primary perceptual decision task. For example, in a visual search task, as illustrated in Figure 15.3, after the monkey fixates a central spot, a search array is shown containing a target (in this case an L) and several distractors (in this case rotated Ts) and the monkey must make a single saccade to the target in order to receive reward. During visual search, visually responsive and movement-related neurons display characteristic dynamics. Figure 15.4 shows the normalized spiking activity of representative neurons recorded during easy and hard visual search trials when the target (solid) or a distractor (dashed) was in the neuron’s receptive field. For some time after the visual search array appears, visually responsive neurons (Figure 15.4a) show no discrimination between a target and a distractor. However, spiking new directions
Normalized Movement Neuron Activity
Hard (target out) Easy (target out) F 155 DSP02a
0 100 200 Time from array onset (ms)
1.0
(c) Easy (target in)
0.5
0.0
Hard (target in)
0 100 200 Time from array onset (ms)
F 250 DSP04a
Fig. 15.3 Illustration of two saccade decision tasks discussed in this article. (a) In a memory-guided saccade task, the monkey fixates a central point while a peripheral target is quickly flashed; the location of the target is guided by the receptive field properties of the isolated neuron for a given experimental session. The monkey is required to maintain fixation for 400–1000ms, after which the fixation spot disappears. To earn reward, the monkey must make a single saccade to the remembered location of the peripheral target. (b) In a visual search task, the monkey first maintains fixation on a central point. An array of visual objects is then presented and to earn reward the monkey must make a single saccade to the target object and not one of the distractor objects. In this case, the reward target was an L and the distractors were variously rotated Ts, with the particular reward target changed from session to session. Various experiments manipulated the number of distractors (set size), the similarity between targets and distractors, and the particular dimensions on which targets and distractors differed (shape, color, or motion).
324
Hard (target in)
0.0 (b)
Easy (target in)
1.0
–100 –50 0 From saccade (ms)
Fig. 15.4 Illustration of response properties of visually responsive and movement-related neurons in FEF (Hanes, Patterson, & Schall, 1998; Hanes & Schall, 1996; Purcell et al., 2010). Recordings were made while monkeys engaged in a visual search task where the target either appeared among dissimilar distractors (easy search) or among similar distractors (hard search). Plots display normalized spike rate as a function of time (ms). Visually responsive neuron activity aligned on visual search array onset time illustrated in panel (a), movement-related neuron activity aligned on visual search array onset time illustrated in panel (b), and movement-related neuron activity aligned on saccade time illustrated in panel (c). Solid lines are trials in which the target was in the visual neuron’s receptive field or movement neuron’s movement field (target in), and dashed lines are trials in which the target was outside the neurons’ response fields (target out). (Adapted from Purcell et al., 2010.)
activity eventually discriminates between target and distractor, with generally faster and more significant discrimination with easy compared to hard visual search trials (Bichot & Schall, 1999; Sato, Murthy, Thompson, & Schall, 2001) and small compared to large set sizes (Cohen, Heitz, Woodman, & Schall, 2009). We note that the particular shape of the trajectories taken to achieve this neural discrimination can be somewhat heterogeneous across different neurons, but virtually all visually responsive neurons discriminate target from distractor over time. We emphasize that this discrimination concerns the “targetness” of the object in the neuron’s receptive field, not particular features or dimensions of the object like its color or shape, except under unique circumstances (Bichot, Schall, & Thompson, 1996). Visually responsive neurons display these same characteristic dynamics regardless of whether a saccade is made, such as when the monkey withholds or cancels an eye movement
because of a stop signal (Hanes, Patterson, & Schall, 1998) or when the monkey is trained to maintain fixation and respond with a limb movement and not an eye movement (Thompson, Biscoe, & Sato, 2005). Normalized activity of a representative movementrelated neuron is shown aligned on the onset time of the visual search array (Figure 15.4b) and aligned on the time of the saccade (Figure 15.4c). When the monkey makes a saccade to the object in the receptive field (movement field) of the neuron, there is a characteristic buildup of activity some time after array onset; there is far less activity when the nonselected object is in the receptive field, although the precise nature of those dynamics varies somewhat from neuron to neuron. We see clearly that, when aligned on saccade initiation time, activity reaches a relatively constant threshold level immediately prior to the eye movement (Hanes & Schall, 1996), and this pattern of activity holds across search difficulty and set size (Woodman, Kang, Thompson, & Schall, 2008). Movement-related neuron activity does not reach threshold if the monkey withholds or cancels an eye movement because of a stop signal (Hanes et al., 1998; Murthy et al., 2009) or makes a response to the target using a limb movement and not an eye movement (Thompson, Biscoe, & Sato, 2005).We discuss more detailed aspects of the temporal dynamics of movement-related neurons later in this article. One of our primary goals has been to develop models that both predict the saccade behavior of the monkey and predict the temporal dynamics of movement-related neurons in FEF.
A Neural Locus of Drift Rates Movement-related neurons increase in spike rate over time and reach a constant level of activity immediately prior to a saccade being initiated (Figure 15.4). The dynamics of movement-related neurons appear consistent with the dynamics of models that assume a stochastic accumulation of perceptual evidence to a threshold (Hanes & Schall, 1996; Ratcliff et al., 2003; Schall, 2001; Smith & Ratcliff, 2004). This insight raises several questions that we have begun to address in our recent work: If movement-related neurons instantiate an accumulator model, what kind of accumulator model do they instantiate? What kind of an accumulator model can predict the fine-grained dynamics of movement-related neurons? What
drives the accumulator model? We begin with the last question. A broad class of models of perceptual decisionmaking assumes that perceptual evidence is accumulated over time to a threshold (Figure 15.2; see also Ratcliff & Smith, 2015). The rate at which perceptual evidence is accumulated, the drift rate, can vary across objects, conditions, and experience. When accumulator models are tested by fitting them to observed behavior, it is not uncommon to assume that different drift rates across different experimental conditions are free parameters that are optimized to maximize or minimize some fit statistic (e.g., Brown & Heathcote, 2008; Boucher et al., 2007a; Ratcliff & Rouder, 1998; Usher & McClelland, 2001). But other theoretical work has aimed to connect models of perceptual decision-making to models of perceptual processing by developing a theory of the drift rates. For example, Nosofsky and Palmeri (1997; Palmeri, 1997) proposed an exemplar-based random walk model (EBRW) that combined the generalized context model of categorization (Nosofsky, 1986) with the instance theory of automaticity (Logan, 1988) to develop a theory of the drift rates driving a stochastic accumulation of evidence. Briefly, EBRW assumes that a perceived object activates previously stored exemplars in visual memory, the probability and speed of exemplar retrieval is governed by similarity, and repeated exemplar retrievals determine the direction and rate of accumulation to a response threshold. EBRW predicts the effects of similarity, experience, and expertise on response probabilities and response times for perceptual decisions about visual categorization and recognition (see Nosofsky & Palmeri, 2015; Palmeri & Cottrell, 2009; Palmeri, Wong, & Gauthier, 2004). Other theorists have similarly connected visual perception and visual attention mechanisms to accumulator models of perceptual decision making by creating theories of drift rate (e.g., Ashby, 2000; Logan, 2002; Mack & Palmeri, 2010; Schneider & Logan, 2005; Smith & Ratcliff, 2009). As a first step toward a neural theory of drift rates, we recently proposed a neural locus of drift rates when decisions are made by saccades (Purcell et al., 2010, 2012). We hypothesize that the accumulation of evidence is reflected in the firing rate of FEF movement-related neurons and the perceptual evidence driving this accumulation is reflected in the firing rate of FEF visually responsive
neurocognitive modeling of perceptual decision making
325
Cell n
Target in RF
Cell 1 Cell 2
k θ 0 100 200 300 0
200
g
vT
mT 0 100 200 300
400 β
u Cell n
Distractor in RF θ
Cell 1 Cell 2
vD
g
mD 0 100 200 300
0 0
200
100 200 300
400
Fig. 15.5 Illustration of simulation model architectures tested in Purcell et al. (2010, 2012). Spike trains were recorded from FEF visually-responsive neurons during a saccade visual search task. Trials were sorted into two populations according to whether the target or a distractor was within the neuron’s response field. Spike trains were randomly sampled from each population to generate a normalized activation function that served as the dynamic model input associated with a target (vT ) and a distractor (vD ) on a given simulated trial, as illustrated. Different architectures for perceptual decision-making were systematically tested. Decision units (mT ) could integrate evidence or not, and they could be leaky (k) or not. Decision units could integrate a difference between the inputs (u) or not, the stochastic input could be gated (g) or not, and the units could compete with one another (β) or not. Here, only two decision units are shown, one for a target and one for a distractor. In Purcell et al. (2012) there were eight accumulators, one for each possible stimulus location in the visual search array.
neurons. One way to test this hypothesis would be to develop a model of the dynamics of visually responsive neurons, a model of how those dynamics are translated into drift rates, and then use those drift rates to drive a model of the accumulation of perceptual evidence. We chose a different approach. Rather than model the dynamics of visually responsive neurons, we used the observed firing rates of those neurons directly as a dynamic neural representation of the perceptual evidence that was accumulated over time. Figure 15.5 illustrates our general approach. Activity of visually responsive neurons was recorded from FEF of monkeys performing a visual search task. In Figure 15.4, we illustrate spike density functions of a representative neuron when a target or distractor appeared in its receptive field during easy or hard visual search. For our modeling, we did not use the mean activity of neurons as input but, instead, generated thousands of simulated spikedensity functions by subsampling from the full set of individually recorded trials of visually responsive neurons. Specifically, on each simulated trial, we first randomly sampled, with replacement, a set of spike trains recorded from individual neurons. We subsampled from trials when the target was 326
new directions
in the receptive fields of the neurons to simulate perceptual evidence in favor of the target location and trials when a distractor was in the receptive field to simulate perceptual evidence in favor of each of the distractor locations. Along its far left, Figure 15.5 illustrates raster plots for example neurons, with individual trials arranged sequentially along the y axis, time along the x axis, and each black dot indicating the incidence of a recorded spike on a given trial for that neuron. The gray thick bars illustrate a random sampling from those recorded neurons. These sampled spike trains were convolved with a temporally asymmetric doubly exponential function (Thompson, Hanes, Bichot, & Schall, 1996), averaged together, and normalized to create dynamic drift rates associated with target and distractor locations (Purcell et al., 2010, 2012), as illustrated in the middle of Figure 15.5; the resulting input functions are mathematically similar to a Poisson shot noise process (Smith, 2010). Different inputs were defined according to the experimental condition under which the visually responsive neurons were recorded on each trial, such as easy versus hard search or small versus large set sizes. Arguably, this approach allows the most direct test of whether the dynamics of visually responsive
neurons provide a sufficient representation of perceptual evidence to predict where and when the monkey moves its eyes. If no model can predict saccade behavior using visually responsive neurons as input, then some other neural signal must be significantly modulating behavior of the monkey. Furthermore, as illustrated by contrasting Figures 2a and 2b, this novel approach imposes significant constraints on possible models by replacing free parameters governing the mean and variability of perceptual processing time, starting point of accumulation, and drift with observed neurophysiology. Finally, because the neurophysiological signal from visually responsive neurons is continuous in time, the models cannot merely assume that perceptual processing and perceptual decisions constitute discrete stages, as typical for many accumulator models.
Architectures for Perceptual Decision-Making Within the broad class of perceptual decisionmaking models assuming an accumulation of perceptual evidence to a threshold, a variety of different model architectures have been proposed (e.g., see Ratcliff & Smith, 2004; Smith & Ratcliff, 2004). We instantiated several of these competing architectures, and using drift rates defined by the recorded spiking activity of visually responsive neurons as inputs, evaluated how well each could fit observed response probabilities and response times of monkeys making saccades during a visual search task (Purcell et al., 2010, 2012). Figure 15.5 illustrates the common architectural framework. Drift rates defined by neurophysiology constitute the input nodes labeled vT (target) and vD (distractor). We assume an accumulator associated with the target location (mT ) and distractor locations (mD ). Figure 15.5 shows only one target and one distractor accumulator (Purcell et al., 2010) but we have extended this framework to multiple accumulators, one for every possible target location in the visual field (Purcell et al., 2012). Each accumulator is governed by the following stochastic differential equation ⎡⎛ ⎞+ dt ⎣⎝ vi (t) − d mi (t) = uvj(t) − g ⎠ τ j =i ⎤ dt − βmk (t) − kmi (t)⎦ + ξ. τ k =i
The mi (t) are rectified to be greater than or equal to zero because we later compare the dynamics of these accumulators to the observed spike rates of movement-related neurons, and those spike rates are greater than zero by definition. ξ represents Gaussian noise intrinsic to each accumulator with mean 0 and standard deviation σ ; in all of our simulations, this intrinsic accumulator variability could be assumed to be quite small relative to the variability of the visual inputs vi (t). All accumulators, mi (t), are assumed to race against one another to be the first to reach their threshold θ . The winner of that race between accumulators determines which saccade response is made on that simulated trial and the response time is given by the time to reach threshold plus a small ballistic time of 10–20ms. If k > 0, these are leaky accumulators, otherwise they are perfect integrators. If β = 0 and u = 0, we have a version of a simple horse race model. If β > 0, these are competing accumulators, and combined with leakage, k > 0, we have the leaky competing accumulator model (Usher & McClelland, 2001). If u > 0, then weighted differences are accumulated by each mi (t). In the case of only two accumulators, one for a target and the other for a distractor, and assuming u = 1, both mi (t) accumulates the difference between evidence for a target versus evidence for a distractor, which is quite similar to a standard drift-diffusion model (see Bogacz et al., 2006; Ratcliff et al., 2007; Usher & McClelland, 2001), and when assuming positive leakage (k > 0) is quite similar to an Ornstein-Uhlenbeck process (Smith, 2010); this similarity can become mathematical identity with some added assumptions (Bogacz et al., 2006; Usher & McClelland, 2001). Finally, we also proposed a novel aspect to this general architecture, which we called a gated accumulator (Purcell et al., 2010, 2012). When g > 0 and the input is positive-rectified, as indicated by the + subscript in the equation, then only inputs that are sufficiently large can enter into the accumulation. For example, consider a gated accumulator assuming u > 0; this would mean that the differences in the evidence in favor of the target over the distractors must be sufficiently large before that differences will accumulate. Recall that we assumed that the inputs are defined by neurophysiology, which has no beginning or ending, apart from the birth or death of the organism. Intuitively, the gate forces the accumulators to accumulate signal, not merely noise, and noise is all that is present before
neurocognitive modeling of perceptual decision making
327
(a)
nonaccumulator
(b)
perfect accumulator
(c)
leaky accumulator
(d)
gated accumulator
1.0 P (RT < t)
easy 0.5 hard 0.0 100
200 300 400 Response Time (ms)
data model
Fig. 15.6 In Purcell et al. (2010), models (Figure 15.5) were tested on how well they could account for observed RT distributions of the onset of saccades in an easy visual search where the target and distractors were dissimilar or where the target and distractors were similar hard. Each panel shows observed cumulative RT distributions (symbols) for easy and hard search. Best-fitting model predictions for a subset of the models tested in Purcell et al. (2010) are shown for illustration, ranging left-to-right from a nonaccumulator model that does not integrate perceptual evidence over time, a perfect integrator model with no leakage, a leaky accumulator model, and a gated accumulator model. (Adapted from Purcell et al., 2010.)
perceptual processing has begun to discriminate targets from distractors. We evaluated the fits of competing model architectures to observed response probabilities and distributions of response times using standard model fitting techniques (e.g., Ratcliff & Tuerlinckx, 2002; Van Zandt, 2000). We systematically compared models assuming a horse race, a diffusion-like difference accumulation process, or competition via lateral inhibition, factorially combined with various leaky, nonleaky, or gated accumulators. For example, Figure 15.6 displays observed response time distributions for easy versus hard visual search along with a sample of predictions from some of the model architectures evaluated by Purcell et al. (2010); for these particular data (Bichot, Thompson, Rao, & Schall, 2001; Cohen et al., 2009), there were very few errors. As shown in the left two panels, models assuming nointegrationat all, meaning that the current value of mi (t) simply reflects the current inputs at time t, and models assuming perfect integration without leakage, provided a relatively poor fit to the observed behavioral data. Although these particular behavioral data were fairly limited, with only a response-time distribution for easy and hard visual search, we could rule out some potential model architectures. However, other competing models, including those with leakage or gate, assuming a competition or an accumulation of differences, all provided reasonable quantitative accounts of the behavioral data, a couple of examples of which are shown in the two right panels of Figure 15.6. Purcell et al. (2012) evaluated fits of these models to a more comprehensive dataset where set 328
new directions
size was systematically manipulated and where the search was difficult enough to produce significant errors (Cohen et al., 2009). Models were required to fit correct- and error-response probabilities as well as distributions of correct- and error-response times. These data are shown in Figure 15.7. Also shown are the predictions of the best fitting model, which was a gated accumulator model that assumed both significant leakage and competition via lateral inhibition. Likely because this dataset was larger, it also provided a greater challenge to other models, since many horse-race models and diffusion-like models failed to provide adequate fits to the observed data, whether they included leakage or gating (see Purcell et al., 2012). Just based on the quality of fits to observed data, models with leakage and competition via lateral inhibition provided comparable fits whether those models included gating or not in both Purcell et al. (2010) and Purcell et al. (2012). So based on parsimony, a nongated version, which is essentially a leaky competing accumulator model (Usher & McClelland, 2001), would win the theoretical competition. But our goal was also to test whether the accumulators in the competing models could provide a theoretical account of the movement-related neurons in FEF. To do that, we also tested whether the dynamics measured in the accumulators could predict the dynamics measured in movement-related neurons (see also Boucher et al., 2007a; Ratcliff et al., 2003, 2007).
Predicting Neural Dynamics Until now, the work we have described follows a long tradition of developing and testing computational and mathematical models of cognition.
(a)
(b)
100 Percent Correct
Mean RT (ms)
400 350 300 250 200
2
4
Error Data Model
80
60
8
Correct
2
4
8
Set Size
(c)
(d)
Correct
P(RT < t)
0.9
Set size 2 Set size 8
0.7
Error 0.9 0.7
Set size 4 0.5
0.5
0.3
0.3
0.1
0.1
100
300
500
100
300
500
Response time (ms) Fig. 15.7 In Purcell et al. (2012), models (Figure 15.5) were tested on how well they could account for correct- and error-response probabilities and correct- and error-response time distributions of saccades in a visual search task with three levels of set size: 2 (blue), 4 (green), or 8 (red) objects in the visual array. Predictions from the best-fitting gated accumulator model are shown. (a) Mean observed (symbols) and predicted (lines) correct- (solid) and error- (dashed) response times as a function of set size. (b) Mean observed (symbols) and predicted (lines) probability correct as a function of set size. (c) Observed (symbols) and predicted (lines) cumulative RT distributions of correct responses at each set size. (d) Observed (symbols) and predicted (lines) cumulative RT distributions of error responses at each set size. (Adapted from Purcell et al., 2012.)
Competing models are evaluated on their ability to predict behavioral data by optimizing parameters in order to maximize or minimize the fit of each model to the observed data, and then statistical tests are performed for nested or nonnested model comparison (e.g., see Busemeyer & Diederich, 2010; Lewandowsky & Farrell, 2010). We go beyond this approach to evaluate linking propositions (Schall, 2004; Teller, 1984) that aim to map particular cognitive model mechanisms onto observable neural dynamics. Specifically, we evaluate the linking proposition that movementrelated neurons in FEF instantiate an accumulation of evidence to a threshold. We do this by testing how well the simulated dynamics of accumulators in the various model architectures described in the previous section predict the observed dynamics in movement-related neurons. Although the qualitative relationship between accumulator dynamics and movement neuron dynamics has long been recognized (e.g., Hanes & Schall, 1996;
Ratcliff et al., 2003; Smith & Ratcliff, 2004), we go beyond noting qualitative relationships to test quantitative predictions. Following the approach used by Woodman et al. (2008), we evaluated how several key measures of neural dynamics varied according to the measured response time of a saccade. The top row of Figure 15.8 illustrates several hypotheses for how variability in response time is related to variability in the underlying neural dynamics. Fast responses could be associated with an early initial onset of the neural activity from baseline, whereas slow responses could be associated with a delayed onset. Alternatively, fast responses could be associated with high growth rate in spiking activity to threshold, whereas slow responses could be associated with low growth rate. Fast responses could be associated with an increased baseline firing rate or decreased threshold, whereas slow responses could be associated with a decreased baseline firing rate or increased threshold. To evaluate these proposals, the onset
neurocognitive modeling of perceptual decision making
329
time, growth rate, baseline, and threshold of neural activity were all measured within bins of trials defined by response times from fastest to slowest, both within conditions and across conditions (see Purcell et al., 2010, 2012, for details). The middle row shows the relationship between onset time, growth rate, baseline, and threshold of neural activity and mean response time for each bin of an RT distribution for a representative neuron in a representative condition.The bottom row shows the mean correlation of neural measures with RT as a function of set size from Purcell et al. (2012), with a significant relationship between onset time and response time observed in neural activity in movement-related neurons in FEF. Using analogous methods, we also measured the relationship between onset time, growth rate, baseline, and threshold of accumulator dynamics and response time predicted by each of the competing model architectures that we simulated. Shown in Figure 15.8 are the predictions of the gated accumulator model from Purcell et al. (2012), illustrating a good match between model and neurons. These are true model predictions, not model fits. After the model was fitted to behavioral data, the accumulator dynamics using the best-fitting model parameters were measured and compared directly with the observed neural dynamics. All other models failed to predict the observed neural dynamics. For example, models without gate typically predicted a significant negative correlation between baseline and response time that was completely absent in the observed data. Part of the reason for this is that, with nongated models, the accumulators are allowed to accumulate noise in the input defined by visually responsive neurons. Although a leakage term may be sufficient to keep a weak noise signal from leading to a premature accumulation to threshold, it cannot prevent significant differences in baseline activity from being correlated with differences in predicted response time when the accumulators reach threshold, at least without significantly compromising fits to the observed behavior.
Control over Perceptual Decisions We have also considered the neurophysiological basis of cognitive control over perceptual decisions. Mirroring our other research, we used cognitive models to better understand neural mechanisms and used neural data to constrain competing cognitive models. 330
new directions
Perhaps the most widely used task for studying normal and dysfunctional cognitive control is the stop-signal task (Lappin & Eriksen, 1966; Logan & Cowan, 1984). Saccade variants of this task have been used with monkeys, and neurophysiological activity has been recorded from neurons in FEF (Hanes et al., 1998). The basic stop-signal task with saccades is in certain ways a converse of the memory-guided saccade task illustrated in Figure 15.4. Monkeys initially fixate the center of the screen. After a variable amount of time, the fixation spot disappears and a peripheral target appears somewhere in the visual field, and the monkey must make a single saccade to the target in order to earn reward. This is the primary task, or go signal. On a fraction of trials, some time after the peripheral target appears, the fixation spot is reilluminated, and the monkey is rewarded for cancelling its saccade, maintaining fixation. This is the stop signal. The interval between the appearance of the go signal, the peripheral target, and the stop signal, the fixation point, is called stop signal delay (SSD). Monkeys’ ability to inhibit their saccade is probabilistic due to the stochastic variability of go and stop processes and depends on SSD. Figure 15.9 displays the key behavioral data observed in the saccade stop-signal paradigm (Hanes et al., 1998). Figure 15.9a displays the probability of responding to the go signal (y axis), despite the presence of a stop signal at a particular SSD (x axis). When the stop signal illuminates shortly after the appearance of the target, at a short SSD, the probability of responding to the go signal is quite small. Control over the saccade as a consequence of the stop signal has been successful. In contrast, for a long SSD, the probability of successfully inhibiting the saccade is rather small. Figure 15.9b displays distributions of response times for primary go trials with a stop signal (signal response trials), in which a saccade was erroneously made, shaded by gray according to SSD (see figure caption). These response times are significantly faster than response times without any stop signal (no-stop-signal trials) in black. Behavioral data in the stop-signal paradigm has long been accounted for by an independent race model (Logan & Cowan, 1984), which assumes that performance is the outcome of a race between a go process, responsible for initiating the movement, and a stop process, responsible for inhibiting the movement (see also Becker & Jürgens, 1979; Boucher, Stuphorn, Logan, Schall, & Palmeri, 2007b; Camalier et al., 2007;
Growth rate
200
Correlation with Response time
0
400 200 Respose time (ms)
1.0 **
**
***
0.0
–1.0
0.6
r = –0.58 p = 0.17
0.4 0.2 0.0
Threshold
40 r = –0.32 p = 0.48 Baseline (sp/s)
Growth rate (sp/s/ms)
Onset (ms)
400 r = 0.81 p < 0.05
Baseline
20
0
400 200 Respose time (ms)
1.0
1.0
0.0
0.0
r = 0.07 p = 0.87
40 Threshold (sp/s)
Onset
20
0
400 200 Respose time (ms)
400 200 Respose time (ms)
1.0 *
0.0
Data Model 2
4 Set size
8
–1.0
2
4 Set size
8
–1.0
2
4 Set size
8
–1.0
2
4 Set size
8
Fig. 15.8 Comparing observed neural dynamics and predicted model dynamics. Top row: Four possible hypotheses for how variability in RT is related to variability in neural or accumulator dynamics: from left to right, variability in RT could be correlated with variability in the onset time, growth rate, baseline, or threshold. Middle row: Following Woodman et al. (2008), correct RTs were binned in groups from fastest to slowest and within each bin the onset time, growth rate, baseline, and threshold of the spike density functions were calculated. The relationship between RT and neural measure (left to right: onset time, growth rate, baseline, and threshold) are shown for one representative neuron in set size 4 for one of the monkeys tested; the correlation between RT and neural measure and its associate p-value are also shown. Bottom row: Average correlation between RT and neural measure (left-to-right: onset time, growth rate, baseline, and threshold) as a function of set size observed in neural dynamics and predicted in model dynamics for the gated accumulator model. (Adapted from Purcell et al., 2012.)
(b)
Observed Model
0.5
0.0
SSD
0.5
0.0 50
100 150 200 Stop signal delay (ms)
0
250
100 200 Time from stimulus (ms)
(d)
1.0
Movement Neurons in: frontal eye field superior colliculus Model Simulation: Go Process
Normalized Proportion
P(RT < t)
0.4
0.5
0.0
SSRT
1.0
cancel time
(c)
1.0
Normalized Activation
Probability (signal-respond)
(a)
200
300 Reaction time (ms)
400
0.3 0.2 0.1 0.0
–50
0 Cancel time (ms)
50
Fig. 15.9 (a) Observed inhibition function(gray line) and simulated inhibition function from the interactive race model (black line). (b) Observed (thin lines) and simulated (thick lines) cumulative RT distributions from no stop signal (black line) and signal-response trials with progressively longer stop signal delays (progressively darker gray lines). (c) Illustration of simulated activity in the interactive race model of the go unit and stop unit activation on signal-inhibit (thick solid line) and latency-matched no-stop-signal trials (thin solid lines) with stop-signal delay (SSD) and stop-signal reaction time (SSRT) indicated. Cancel time is indicated by the downward arrow. (d) Histogram of cancel times of the go unit predicted by the interactive race model compared with the histogram of cancel times measured for movement-related neurons in FEF and SC.(Adapted from Boucher et al., 2007a.)
Logan, Van Zandt, Verbruggen, & Wagenmakers, 2014; Olman, 1973). Boucher et al. (2007a) addressed an apparent paradox of how seemingly interacting neurons in the brain could produce behavior that appears to be the outcome of independent processes. Mirroring the general model architectures described earlier and illustrated in the right half of Figure 15.5, they instantiated and tested models that assumed stochastic accumulators for the go process and for the stop process that were either an independent race or that assumed competitive, lateral interactions between stop and go. Outstanding fits to observed behavioral data for both the independent race model and the interactive race model were observed. Figures 9a and 9b show fits of the interactive race model, but fits of the independent race model were virtually identical. Parsimony would favor the independent race. But neural data favored the interactive race. In the absence of a stop signal, visually responsive neurons in FEF select the target, and movementrelated neurons in FEF increase their activity until 332
new directions
a threshold level is reached, shortly after which a saccade is made (Hanes & Schall, 1996), just as they do on memory-guided saccade tasks or visual search tasks. On trials with a stop signal, the dynamics of visually responsive neurons are unaffected (Hanes et al., 1998). For movementrelated neurons, we can distinguish between activity when a stop was successful, signal-inhibit trials, from activity when a stop was unsuccessful, that is signal-respond trials. On signal-respond trials, the activity of movement-related neurons is qualitatively the same as the activity on nosignal trials, with neurons reaching a threshold level before a saccade is made. Even more striking, the activity on signal-respond trials is quantitatively indistinguishable from activity on no-signal trials that are equated for response time (latency-matched trials). On signal-inhibit trials, the activity increases in a manner indistinguishable from latencymatched no-signal trials until some time after the SSD, at which point the activity of movementrelated neurons is reduced back to baseline without
reaching the threshold. The saccade has been inhibited. Figure 15.9c displays the predicted accumulator dynamics of the interactive race model (Boucher et al., 2007a). The dynamics of the go accumulator in the interactive race precisely mirrors the description of the dynamics of movement-related neurons provided earlier, with dynamics not observed in the independent race model. For signal-inhibit trials and latency-matched no-signal trials, activity increases for some time after SSD, after which activity on signal-inhibit trials returns to baseline while activity on latency-matched no-signal trials continues to threshold. The accumulator dynamics in the interactive race model qualitatively captures the neural dynamics of movement-related neurons. But we could go further than that. We also calculated a metric called cancel time (Hanes et al., 1998), which is a function of the time at which the dynamics statistically diverge between signal-inhibit trials and latency-matched no-signal trials. This time can be calculated from movementrelated neurons. It can also be calculated from accumulator dynamics. And as shown in Figure 15.9b, these measures from neurons and the model nicely converge. We emphasize that, as was the case for Purcell et al. (2010, 2012), these are true model predictions. Boucher et al. (2007a) fitted models to behavioral data, then calculated the cancel time predicted by the models, and compared that to the observed cancel time in neurons. Parameters were not adjusted to maximize the correspondence. The hypothesized locus of control in Boucher et al. (2007a) is inhibition of a stop process on the go process, with the stop process identified as activity of fixation-related neurons and the go process identified as activity of movement-related neurons. The gate in the gated accumulator model (Purcell et al., 2010, 2012) could be another hypothesized locus of control over perceptual decisions. In recent work, we have suggested that blocking the input to the go unit, rather than actively inhibiting it via a stop unit, could be an alternative mechanism for stopping. Indeed, a blocked input model predicted observed data and distributions of cancel times at least as well as the interactive race model (Logan, Schall, & Palmeri, 2015; Logan, Yamaguchi, Schall, & Palmeri, in press). One suggestion we made was that the stop process could raise a gate between visual neurons that select the target and movement neurons that generate a movement to it, blocking input to the
movement neurons and thereby preventing them from reaching threshold. As another example, in a stop-signal task, both humans and monkeys adapt their performance from trial to trial, for example, producing longer RTs after successfully inhibiting a planned movement (e.g., Bissett & Logan, 2011; Nelson, Boucher, Logan, Palmeri, & Schall, 2010; Verbruggen & Logan, 2008). For monkeys, within FEF, activity of visually responsive neurons are unaffected by these trial-to-trial adjustments, but the onset time of activity of movement-related neurons is significantly delayed (Pouget et al., 2011). Purcell et al. (2012) suggested that strategic adjustment in the level of the gate could explain the delayed onset of movement-related neurons in the absence of any modulation of visually responsive neurons. Moreover, they demonstrated that this strategic adjustment of gate could be couched in terms of optimality. It has been previously suggested that strategic modulation of accumulator threshold could maximize reward rate, which is defined as the proportion of correct responses per unit time (e.g., Gold & Shadlen, 2002; Lo & Wang, 2006). We observed that strategic modulation of the level of the gate could maximize reward rate in much the same way (Purcell et al., 2012).
Summary and Conclusions Here we reviewed some of our contributions to a growing synergy of mathematical psychology and systems neuroscience. Our starting point has been a class of successful cognitive models of perceptual decision-making that assume a stochastic accumulation of perceptual evidence to a threshold over time (Figure 15.2). Models of this sort have long provided excellent accounts of response probabilities and distributions of response times in a wide range of perceptual decision-making tasks and manipulations (e.g., see Nosofsky & Palmeri, 2015; Ratcliff & Smith, 2015). We have extended these models to account for response probabilities and distributions of response times for awake behaving monkeys to make saccades to target objects in their visual field (Boucher et al., 2007a; Pouget et al., 2011; Purcell et al., 2010, 2012). Applying techniques common to mathematical psychology, we instantiated different model architectures and ruled out models that provided poor fits to observed data. These models have free parameters that govern theoretical quantities like perceptual processing
neurocognitive modeling of perceptual decision making
333
time, the starting point of accumulation, the drift rate of accumulation, and the response threshold. We constrained many of these parameters using neurophysiology. Unlike some approaches that constrain parameters values based on neurophysiology, often based on neural findings with rather large confidence intervals, we replaced parameterized model assumptions directly with recorded neurophysiology. Specifically, we sampled from neural activity recorded from visually responsive neurons in FEF, feeding these spike trains directly into stochastic accumulator models, thereby creating a largely nonparametric neural theory of perceptual processing time and the drift rate of accumulation. Not only did this approach constrain computational modeling, it also provided a direct test of the hypothesis that the activity of visually responsive neurons in FEF encodes perceptual evidence: This neural code can be accumulated over time to predict where and when the monkey moves its eyes (Purcell et al., 2010, 2012). We also tested the hypothesis that movementrelated neurons in FEF instantiate a stochastic accumulation of evidence. Although it has long been acknowledged that these neurons behave in a way consistent with accumulator models (e.g., Hanes & Schall, 1996; Schall, 2001), we went beyond qualitative description to test whether movement neuron dynamics can be quantitatively predicted by accumulator model dynamics. We measured how the onset of activity, baseline activity, rate of growth, and threshold varies with behavioral response time in both movement-related neurons and model accumulators, and we found close correspondences for some models. Not only does this test an hypothesis about the theoretical role of FEF movement-related neurons in perceptual decision-making, it also provides a powerful means of contrasting models that otherwise make indistinguishable behavioral predictions. Our gated accumulator model, which enforces accumulation of discriminative neural signals from visually responsive neurons, not only accounted for the detailed saccade behavior of monkeys, but also predicted quantitatively the dynamics observed in movement-related neurons in FEF, whereas other models could not (Purcell et al., 2010, 2012; see also Boucher et al., 2007a). This gated accumulator model also suggests a potential locus of cognitive control over perceptual decisions. Increasing the gate may account for speed-accuracy tradeoffs (Purcell et al., 2012) as well as stopping behavior and trial history effects described by 334
new directions
Boucher et al. (2007a) and Pouget et al. (2011), respectively. Turning to more general issues, our work has confronted a common challenge in the development of mathematical and computational models of cognition where competing models reach a point where they make very similar predictions, examples of which are discussed in other chapters in this volume (Busemeyer, Wang, Townsend & Eidels, 2015). This could be a consequence of true mimicry, where models assuming vastly different mechanisms nonetheless produce mathematically identical predictions that cannot be distinguished behaviorally. Often, however, it is that the current corpus of experimental manipulations and measures are insufficient to discriminate between competing models. Cognitive modelers have long turned to predicting additional complexity in behavioral data to resolve mimicry, going from predicting accuracy alone to predicting response probabilities as well as response times, and from predicting mean response-times to predicting response, time distributions, including those for correct and error responses. Indeed, in our work reviewed here, predicting jointly response probabilities and response time distributions yielded considerable traction in discriminating between competing models. Unfortunately, outside the mathematical psychology community, it is not uncommon to hear researchers state with complete confidence that response time distributions yield no more useful information than response time means, sadly unknowledgeable about the state of reality (e.g., see Townsend, 1990). That said, recognition is emerging, for example, that response time distributions are key aspects of data that theories of visual cognition needs to account for (e.g., Palmer, Horowitz, Torralba, & Wolfe, 2011; Wolfe, Palmer, & Horowitz, 2010), that response time distributions provide challenging constraints for low-level spiking neural models (e.g., Lo, Boucher, Paré, Schall, & Wang, 2009), and more generally that considerations of behavioral variability can yield insights into neural processes (e.g., Churchland et al., 2011; Purcell, Heitz, Cohen, & Schall, 2012). But even joint modeling of response probabilities and response-time distributions may be insufficient to contrast competing models. Our work illustrates how neurophysiological data can also help distinguish between models. We have described cases in which two models fit behavioral data equally well (Boucher et al., 2007a;
Purcell et al., 2010, 2012) but one model is more complex than the other. With only behavioral data and an appeal to parsimony, we would have demanded the exclusion of the more complex model in favor of the simpler one. However, in order to successfully mapobserved neural dynamics onto predicted model dynamics, the assumptions of the more complex model were required. Key here is that we believe that it is the important to map between neural dynamics and model dynamics, not between neural dynamics and model parameters (see also e.g., Davis, Love, & Preston, 2012). Variation in model parameters need not uniquely map onto variation in neural dynamics, but predicted variation in model dynamics must. And while we have demonstrated the theoretical usefulness of neural data in adjudicating between competing models, we do not believe that neural data has any particular empirical primacy. Just as mimicry issues can emerge when examining behavioral measures like accuracy and response time, analogous mimicry issues may be found at the level of neurophysiology and neural dynamics. Neural data are not necessarily more intrinsically informative than behavioral data, but more data provides additional constraints for distinguishing between competing models. More generally, our work allies with a growing body of research supporting accumulator models of perceptual decision making (e.g., Nosofsky & Palmeri, 1997; Ratcliff & Rouder, 1998; Ratcliff & Smith, 2004; Usher & McClelland, 2001), not just as models that explain behavior but also as models that explain brain activity measured using neurophysiology (e.g., Boucher et al., 2007; Churchland & Ditterich, 2012; Purcell et al., 2010, 2012; Ratcliff et al., 2003; but see Heitz & Schall, 2012, 2013), EEG (e.g., Philiastides, Ratcliff, & Sajda, 2006), and fMRI (e.g., Turner et al., 2013; van Maanen et al., 2011; White, Mumford, & Poldrack, 2012). The relative simplicity of cognitive models like accumulator models is a virtue in that they are computationally tractable, making them easily applicable across a wide range of phenomena and levels of analysis. Making explicit links to brain mechanisms does expose complexities. Our focus here has been largely on FEF, but other brain areas have neurons with dynamics that are visually responsive or movementrelated, including SC (Hanes & Wurtz, 2001; Paré & Hanes, 2003) and LIP (Gold & Shadlen, 2007; Mazurek et al., 2003; Shadlen & Newsome, 2001). Compared to the relative simplicity of most
Box 1 Top-down versus Bottom-up Theoretical Approaches Computational cognitive neuroscience aims to understand the relationship between brain and behavior using computational and mathematical models of cognition. One approach is bottom up. Theorists begin with fairly detailed mathematical models of neurons based on current understanding of cellular and molecular neurobiology. A common approach is to develop and test a single model of a neural network built up from these detailed models of neurons along with hypotheses about their excitatory and inhibitory connectivity. Although these neural models provide excellent accounts of spiking and receptor dynamics of individual neurons and may also account well for emergent network activity, they may provide only fairly coarse accounts of observed behavior, have somewhat limited generalizability, and be impractical to rigorously simulate and evaluate quantitatively. Another approach is top down (e.g., Forstman et al., 2011; Palmeri, 2014). Cognitive models account for details of behavior across multiple conditions, have significant generalizability across tasks and subject populations, and are often relatively easy to simulate and evaluate. It is common to evaluate multiple competing models and to test the necessity and sufficiency of model assumption with nested model comparison techniques. Although these models do not provide the same level of detailed predictions of spiking and receptor dynamics, they can provide predictions about the temporal dynamics of neural activity at the same level of precision as commonly summarized in neurophysiological investigations, as we illustrated in our review. In fact, Carandini (2012) suggested that bridging between brain and behavior can only be done by considering intermediate-level theories, that the gap between low-level neural models and behavior is simply a “bridge too far.” Although he considered linear filtering and divisive normalization as example computations that may be carried out across cortex (Carandini & Heeger, 2011), we consider accumulation of evidence as a similar computation that may be carried out in various brain areas, included FEF. These computations can simultaneously explain behavioral and neural dynamics.
neurocognitive modeling of perceptual decision making
335
stochastic accumulator models, there isa network of brain areas involved in evidence accumulations for perceptual decision making (Gold & Shadlen, 2007; Heekeren et al., 2008; Schall, 2001; 2004). Such mechanisms involving accumulation of evidence for perceptual decision-making may be replicated across different sensory and effector systems in the brain, such as those for visually guided saccades, but there may also be domaingeneral mechanisms as well (e.g., Ho, Brown, & Serences, 2009). Although the dynamics of specific individual neurons within particular brain areas mirror the dynamics of accumulators in models, we also know that, within any given brain area, ensembles of tens of thousands of neurons are involved in the generation of any perceptual decision. We need to understand the scaling relations from simple accumulator models to complex ensembles of thousands of neural accumulators (Zandbelt, Purcell, Palmeri, Logan, & Schall, 2014) and how to map the relatively few parameters that define simple accumulator models onto the great number of parameters that define complex neural dynamics (Umakantha, Purcell, & Palmeri, 2014).
Acknowledgments This work was supported by NIH R01EY021833, NSF Temporal Dynamics of Learning Center SMA-1041755, NIH R01-MH55806, NIH R01-EY008890, NIH P30-EY08126, NIH P30-HD015052, and by Robin and Richard Patton through the E. Bronson Ingram Chair in Neuroscience. Address correspondences to Thomas J. Palmeri, Department of Psychology, Vanderbilt University, Nashville TN 37203. Electronic mail may be addressed to thomas.j.palmeri@ vanderbilt.edu.
Glossary drift rate: The mean rate of perceptual evidence accumulation in a stochastic accumulator model of perceptual decision-making. frontal eye field: An area of prefrontal cortex that governs whether, where, and when the eyes moves to a new location in the visual field. gated accumulator: A stochastic accumulator model that includes a gate that enforces accumulation of discriminative neural signals, a model which quantitatively accounts for both behavioral and neural dynamics of saccadic eye movement. leakage: A weighted self-inhibition on the accumulation of
336
new directions
perceptual evidence, turning a perfect integrator of perceptual evidence into a leaky integrator of perceptual evidence. movement-related neurons: Neurons in FEF that show little or no modulation to the appearance of the target in the visual field but pronounced growth of spike rate immediately preceding the production of a saccade. perceptual decision-making: Perceptual decision-making requires representing the world with respect to current task goals and using perceptual evidence to inform the selection of a particular action. saccade: A ballistic eye movement of some angle and velocity to a particular location in the visual field. stochastic accumulator model: A class of computational models that assume that noisy perceptual evidence is accumulated over time from a starting point to a threshold, allowing predictions of both response probabilities and distributions of response times. stop-signal task: A classic cognitive control paradigm in which a primary go task is occasionally interrupted with a stop signal. visually responsive neurons: Visually responsive neurons are neurons in FEF that respond to the appearance of an object in their receptive field relative to that object’s salience with respect to current task goals but show little or no change in activity prior to the onset of a saccade
References Ashby, F. G. (2000). A stochastic version of general recognition theory. Journal of Mathematical Psychology, 44, 310–329. Becker, W., & Jürgens, R. (1979). An analysis of the saccadic system by means of double step stimuli. Vision Research, 19, 976–983. Bichot, N. P., & Schall. J. D. (1999). Effects of similarity and history on neural mechanisms of visual selection. Nature Neuroscience, 2, 549–554. Bichot, N. P., Schall, J. D., & Thompson, K. G. (1996). Visual feature selectivity in frontal eye fields induced by experience in mature macaques. Nature, 381, 697–699. Bichot, N. P., Thompson, K. G., Rao, S. C., & Schall, J. D. (2001). Reliability of macaque frontal eye field neurons signaling saccade targets during visual search. Journal of Neuroscience, 21, 713–725. Bissett, P. G., & Logan, G. D. (2011). Balancing cognitive demands: Control adjustments in the stop-signal paradigm. Journal of Experimental Psychology: Learning, Memory and Cognition, 37, 392–404. Bogacz, R., Brown, E., Moehlis, J., Holmes, P., & Cohen, J. D. (2006). The physics of optimal decision making: A formal analysis of models of performance in two-alternative forcedchoice tasks. Psychological Review, 113, 700–765. Boucher, L., Palmeri, T. J., Logan, G. D., & Schall, J. D. (2007a). Inhibitory control in mind and brain: An interactive race model of countermanding saccades. Psychological Review, 114, 376–397.
Boucher, L., Stuphorn, V., Logan, G. D., Schall, J. D., & Palmeri, T. J. (2007b). Stopping eye and hand movements: Are the processes independent? Perception & Psychophysics, 69, 785–801. Brown, S. D., & Heathcote, A. (2008). The simplest complete model of choice response time: Linear ballistic accumulation. Cognitive Psychology, 57, 153–178. Bruce, C. J., & Goldberg, M. E. (1985). Primate frontal eye fields. I. Single neurons discharging before saccades. Journal of Neurophysiology, 53, 603–635. Bruce, C. J., Goldberg, M. E., Bushnell, M. C., & Stanton, G. B. (1985). Primate frontal eye fields: II. Physiological and anatomical correlates of electrically evoked eye movements. Journal of Neurophysiology, 54, 714–734. Busemeyer, J. R., & Diederich, A. (2010). Cognitive modeling. Thousand Oaks, CA: Sage Publications. Busemeyer, J. R., & Townsend, J. T. (1993). Decision field theory: A dynamic-cognitive approach to decision making in an uncertain environment. Psychological Review, 100(3), 432–459. Busemeyer, J. R., Wang, Z., Townsend, J. T., & Eidels A. (2015). Mathematical and computational models of cognition. Oxford, UK: Oxford University Press. Camalier, C. R., Gotler, A., Murthy, A., Thompson, K. G., Logan, G. D., Palmeri, T. J., & Schall, J. D. (2007). Dynamics of saccade target selection: Race model analyses of double step and search step saccade production in human and macaque. Vision Research, 47, 2187–2211. Carandini, M. (2012). From circuits to behavior: A bridge too far? Nature Neuroscience, 15, 507–509. Carandini, A. K., & Heeger, D. J. (2012). Normalization as a canonical neural computation. Nature Reviews Neuroscience, 13, 51–62. Churchland, A. K., & Ditterich, J. (2012). New advances in understanding decisions among multiple alternatives. Current Opinion in Neurobiology, 22(6), 920–926. Churchland, A. K., Kiani, R., Chaudhuri, R., Wang, X. J., Pouget, A., & Shadlen, M. N. (2011). Variance as a signature of neural computations during decision making. Neuron, 69(4), 818–831. Cisek, P., Puskas, G. A., & El-Murr, S. (2009). Decisions in changing conditions: The urgency-gating model. Journal of Neuroscience, 29(3), 11560–11571. Cohen, J. Y., Heitz, R. P., Woodman, G. F., Schall, J. D. (2009). Neural basis of the set-size effect in frontal eye field: Timing of attention during visual search. Journal of Neurophysiology, 101, 1699–1704. Dayan, P., & Daw, N. D. (2008). Decision theory, reinforcement learning, and the brain. Cognitive, Affective, & Behavioral Neuroscience, 8(4), 429–453. Davis, T., Love, B. C., & Preston, A. R. (2012). Learning the exception to the rule: Model-based fMRI reveals specialized representations for surprising category members. Cerebral Cortex, 22, 260–273. Ditterich, J. (2006). Stochastic models of decisions about motion direction: Behavior and physiology. Neural Networks, 19, 981–1012. Ditterich, J. (2010). A comparison between mechanisms of multi-alternative perceptual decision making: Ability to
explain human behavior, predictions for neurophysiology, and relationship with decision theory. Frontiers in Decision Neuroscience, 4. Ferrier, D. (1874). The localization of function in brain. Proceedings of the Royal Society of London, 22, 229–232. Forstmann, B. U., Wagenmakers, E. J., Eichele, T., Brown, S., & Serences, J. T. (2011). Reciprocal relations between cognitive neuroscience and formal cognitive models: opposites attract? Trends in Cognitive Sciences, 15(6), 272–279. Glimcher, P. W., & Rustichini, A. (2004). Neuroeconomics: The consilience of brain and decision. Science, 306 (5695), 447– 452. Gilchrist, I. D. (2011). Saccades. In S. P. Liversedge, I. P. Gilchrist, & S. Everling (Eds.), Oxford Handbook on Eye Movements, (pp. 85–94). Oxford, UK: Oxford University Press. Gold, J. I., & Shadlen, M. N. (2002). Banburismus and the brain: Decoding the relationship between sensory stimuli, decisions, and reward. Neuron, 36, 299–308. Gold, J. I., & Shadlen, M.N. (2007). The neural basis of decision making. Annual Review of Neuroscience, 30, 535– 560. Goldman-Rakic, P. S., & Porrino, L. J. (1985). The primate mediodorsal (MD) nucleus and its projection to the frontal lobe. Journal of Comparative Neurology, 242, 535–560. Grossberg, S. (1976). Adaptive pattern classification and universal recoding: II. Feedback, expectation, olfaction, illusions. Biological Cybernetics, 23, 187–202. Hanes, D. P., Patterson, W. F., II, & Schall, J. D. (1998). Role of frontal eye fields in countermanding saccades: Visual, movement, and fixation activity. Journal of Neurophysiology, 79, 817–834. Hanes, D. P., & Schall, J. D. (1996). Neural control of voluntary movement initiation. Science, 274, 427–430. Hanes, D. P., & Wurtz, R. H. (2001). Interaction of the frontal eye field and superior colliculus for saccade generation. Journal of Neurophysiology, 85(2), 804–815. Heekeren, H. R., Marrett, S., & Ungerleider, L. G. (2008). The neural systems that mediate human perceptual decision making. Nature Reviews Neuroscience, 9(6), 467–479. Heitz, R. P., & Schall, J. D. (2012) Neural mechanisms of speedaccuracy tradeoff. Neuron, 76, 616–628. Heitz, R. P., & Schall, J. D. (2013). Neural chronometry and coherency across speed-accuracy demands reveal lack of homomorphism between computational and neural mechanisms of evidence accumulation. Philosophical Transactions of the Royal Society of London B, 368, 20130071. Hikosaka, O., & Wurtz, R. H. (1983). Visual and oculomotor functions of monkey substantia nigra pars reticulata: IV. Relation of substantia nigra to superior colliculus. Journal of Neurophysiology, 49, 1285–1301. Ho, T. C., Brown, S., & Serences, J. T. (2009). Domain general mechanisms of perceptual decision making in human cortex. The Journal of Neuroscience, 29(27), 8675–8687. Jones, M., & Dzhafarov, E. N. (2014). Unfalsifiability of major modeling schemes for choice reaction time. Psychological Review, 121, 1–32. Kahneman, D., & Tversky, A. (1984). Choices, values, and frames. American Psychologist, 39(4), 341–350.
neurocognitive modeling of perceptual decision making
337
Laming, D. R. J. (1968). Information theory of choice-reaction times. New York, NY: Academic. Lappin, J. S., & Eriksen, C. W. (1966). Use of a delayed signal to stop a visual reaction–time response. Journal of Experimental Psychology, 72, 805–811. Lewandowsky, S., & Farrell, S. (2010). Computational modeling in cognition: principles and practice. Thousand Oaks, CA: Sage. Link, S. W. (1992). The wave theory of difference and similarity. Hillsdale, NJ: Erlbaum. Lo, C.-C., Boucher, L., Paré, M., Schall, J. D., & Wang, X.-J. (2009). Proactive inhibitory control and attractor dynamics in countermanding action: A spiking neural circuit model. Journal of Neuroscience, 29, 9059–9071. Lo, C.-C., & Wang, X. J. (2006). Cortico–basal ganglia circuit mechanism for a decision threshold in reaction time tasks. Nature Neuroscience, 9, 956–963. Logan, G. D. (1988). Toward an instance theory of automatization. Psychological Review, 95, 492–527. Logan, G. D. (2002). An instance theory of attention and memory. Psychological Review, 109, 376. Logan, G. D. & Cowan, W. B. (1984). On the ability to inhibit thought and action: A theory of an act of control. Psychological Review, 91, 295–327. Logan, G. D., & Gordon, R. D. (2001). Executive control of visual attention in dual-task situations. Psychological Review, 108, 393–434. Logan, G. D., Schall, J. D., & Palmeri, T. J. (2015). Neural models of stopping and going. Manuscript in preparation. To appear in B. Forstmann & E. J. Wagenmakers (Eds.), An introduction to model-based cognitive neuroscience. Springer Neuroscience. Logan, G. D., Van Zandt, T., Verbruggen, F., & Wagenmakers, E.-J. (2014). On the ability to inhibit thought and action: General and special theories of an act of control. Psychological Review, 121(1), 66–95. Logan, G. D., Yamaguchi, M., Schall, G. D., & Palmeri, T. J. (in press). Inhibitory control in mind and brain 2.0: A blocked-input model of saccadic countermanding, psychological review. Mack, M. L., & Palmeri, T. J. (2010). Modeling categorization of scenes containing consistent versus inconsistent objects. Journal of Vision, 10(3):11, 1–11. Mack, M. L., & Palmeri, T. J. (2011). The timing of visual object categorization. Frontiers in Perception Science. Mazurek, M. E., Roitman, J. D., Ditterich, J., & Shadlen, M. N. (2003). A role for neural integrators in perceptual decision making. Cerebral Cortex, 13, 1257–1269. Munoz, D. P., & Schall, J. D. (2004). Concurrent, distributed control of saccade initiation in the frontal eye field and superior colliculus. In W.C. Hall & A. Moschovakis, (Eds.), The superior colliculus: New approaches for studying sensorimotor integration (pp. 55–82). Boca Raton, FL: CRC Press. Murthy, A., Ray, S., Shorter, S. M., Schall, J. D., & Thompson, K. G. (2009). Neural control of visual search by frontal eye field: effects of unexpected target displacement on visual selection and saccade preparation. Journal of Neurophysiology, 101(5), 2485–2506.
338
new directions
Nelson, M. J., Boucher, L., Logan, G. D., Palmeri, T. J., Schall, J. D. (2010). Impact of nonstationary response time in stopping and stepping saccade tasks. Attention, Perception, & Performance, 72, 1913–1929. Nosofsky, R. M. (1986). Attention, similarity, and the identification-categorization relationship. Journal of Experimental Psychology: General, 115, 39–57. Nosofsky, R. M., & Palmeri, T. J. (1997). An exemplar-based random walk model of speeded classification. Psychological Review, 104, 266–299. Nosofsky, R. M., & Palmeri, T. J. (2015). Exemplar-based random walk model. In J. R. Busemeyer, Z. Wang, J. T. Townsend, & A. Eidels (Eds.), Mathematical and computational models of cognition. Oxford University Press. Palmer, E. M., Horowitz, T. S., Torralba, A., & Wolfe, J. M. (2011). What are the shapes of response time distributions in visual search? Journal of Experimental Psychology: Human Perception and Performance, 37, 58. Palmeri, T. J. (1997). Exemplar similarity and the development of automaticity. Journal of Experimental Psychology: Learning, Memory, and Cognition, 23, 324–354. Palmeri, T. J. (2014). An exemplar of model-based cognitive neuroscience. Trends in Cognitive Science, 18(2), 67–69. Palmeri, T. J., & Cottrell, G. (2009). Modeling perceptual expertise. In I. Gauthier, M. Tarr, & D. Bub (Eds.), Perceptual expertise: bridging brain and behavior. Oxford, UK: Oxford University Press. Palmeri, T. J., & Tarr, M. (2008). Visual object perception and long-term memory. In S. Luck & A. Hollingworth (Eds.), Visual Memory, (pp. 163–207). Oxford, UK: Oxford University Press. Palmeri, T. J., Wong, A. C.-N., & Gauthier, I. (2004). Computational approaches to the development of perceptual expertise. Trends in Cognitive Sciences, 8, 378–386. Paré, M., & Hanes, D. P. (2003). Controlled movement processing: superior colliculus activity associated with countermanded saccades. Journal of Neuroscience, 23(16), 6480– 6489. Philiastides, M. G., Ratcliff, R., & Sajda, P. (2006). Neural representation of task difficulty and decision making during perceptual categorization: a timing diagram. Journal of Neuroscience, 26 (35), 8965–8975. Pouget, P., Logan, G. D., Palmeri, T. J., Boucher, L., Paré, M., & Schall, J. D. (2011). Neural basis of adaptive response time adjustment during saccade countermanding. Journal of Neuroscience, 31(35), 12604–12612. Pouget, P., Stepniewska, I., Crowder, E. A., Leslie, M. W., Emeric, E. E., Nelson, M. J., & Schall, J. D. (2009). Visual and motor connectivity and the distribution of calciumbinding proteins in macaque frontal eye field: Implications for saccade target selection. Frontiers in Neuroanatomy, 3, 2. Purcell, B. A., Heitz, R. P., Cohen, J. Y., & Schall, J. D. (2012). Response variability of frontal eye field neurons modulates with sensory input and saccade preparation but not visual search salience. Journal of Neurophysiology. 108, 2737–2750 Purcell, B. A., Heitz, R. P., Cohen, J. Y., Schall, J. D., Logan, G. D., & Palmeri, T. J. (2010). Neurally constrained modeling of perceptual decision making. Psychological Review, 117, 1113–1143.
Purcell, B. A., Schall, J. D., Logan, G. D., & Palmeri, T. J. (2012). From salience to saccades: Multiple-alternative gated stochastic accumulator model of visual search. Journal of Neuroscience, 32, 3433–3446. Ratcliff, R. (1978). A theory of memory retrieval. Psychological Review, 85, 59–108. Ratcliff, R. (2013). Parameter variability and distributional assumptions in the diffusion model. Psychological Review, 120, 281–292. Ratcliff, R., Cherian, A., & Segraves, M. (2003). A comparison of macaque behavior and superior colliculus neuronal activity to predictions from models of two-choice decisions. Journal of Neurophysiology, 90, 1392–1407. Ratcliff, R., Hasegawa, Y. T., Hasegawa, R. P., Smith, P. L., & Segraves, M. A. (2007). Dual diffusion model for single-cell recording data from the superior colliculus in a brightness-discrimination task. Journal of Neurophysiology, 97, 1756–1774. Ratcliff, R., & Rouder, J. N. (1998). Modeling response times for two-choice decisions. Psychological Science, 9, 347–356. Ratcliff, R., & Smith, P. L. (2004). A comparison of sequential sampling models for two-choice reaction time. Psychological Review, 111, 333–367. Ratcliff, R., & Smith, P.L. (2015). Modeling simple decisions and applications using a diffusion model. In J. R. Busemeyer, Z. Wang, J. T. Townsend, & A. Eidels (Eds.), Mathematical and computational models of cognition, Oxford University Press. Ratcliff, R., & Tuerlinckx, F. (2002). Estimating parameters of the diffusion model: Approaches to dealing with contaminant reaction times and parameter variability. Psychonomic Bulletin & Review, 9, 438–481. Rosenbaum, D. A. (2009). Human motor control. (2nd ed.). New York, NY: Academic. Sato, T., Murthy, A., Thompson, K. G., & Schall, J. D. (2001). Search efficiency but not response interference affects visual selection in frontal eye field. Neuron, 30, 583–591. Sato, T., & Schall, J. D. (2003). Effects of stimulus-response compatibility on neural selection in frontal eye field. Neuron, 38(4), 637–648. Schall, J. D. (2001). Neural basis of deciding, choosing and acting. Nature Reviews Neuroscience, 2, 33–42. Schall, J. D. (2004). On building a bridge between brain and behavior. Annual Review of Psychology, 55, 23–50. Schall, J. D., & Cohen, J. Y. (2011). The neural basis of saccade target selection. In S. P. Liversedge, I. P. Gilchrist, & S. Everling (Eds.). Oxford handbook on eye movements. Oxford, UK: Oxford University Press. Schall, J. D., Morel, A., King, D., & Bullier, J. (1995). Topography of visual cortex connections with frontal eye field in macaque: Convergence and segregation of processing streams. Journal of Neuroscience, 15, 4464–4487. Schneider, D. W., & Logan, G. D. (2005). Modeling task switching without switching tasks: A short-term priming account of explicitly cued performance. Journal of Experimental Psychology: General, 134, 343–367. Schneider, D. W., & Logan, G. D. (2009). Selecting a response in task switching: Testing a model of compound cue retrieval. Journal of Experimental Psychology: Learning, Memory and Cognition, 35, 122–136.
Scudder, C. A., Kaneko, C. S., & Fuchs, A. F. (2002). The brainstem burst generator for saccadic eye movements: A modern synthesis. Experimental Brain Research, 142, 439– 462. Shadlen, M. N., & Newsome, W. T. (2001). Neural basis of a perceptual decision in the parietal cortex (area LIP) of the rhesus monkey. Journal of Neurophysiology, 86 (4), 1916–1936. Smith, P. L. (2010). From Poisson shot noise to the integrated Ornstein-Uhlenbeck process: Neurally principled models of information accumulation in decision-making and response time. Journal of Mathematical Psychology, 54, 266–283. Smith, P. L., & Ratcliff, R. (2004). Psychology and neurobiology of simple decisions. Trends in Neuroscience, 27, 161–168. Smith, P. L., & Ratcliff, R. (2009). An integrated theory of attention and decision making in visual signal detection. Psychological Review, 116, 283. Smith, P. L., & Van Zandt, T. (2000). Time-dependent Poisson counter models of response latency in simple judgment. British Journal of Mathematical & Statistical Psychology, 53, 293–315. Sparks, D. L. (2002). The brainstem control of saccadic eye movements. Nature Reviews Neuroscience, 3, 952–964. Stanton, G. B., Bruce, C. J., & Goldberg, M. E. (1995). Topography of projections to posterior cortical areas from the macaque frontal eye fields. Journal of Comparative Neurology, 353, 291–305. Teller, D. Y. (1984). Linking propositions. Vision Research, 24, 1233–1246. Thompson, K. G., Biscoe, K. L., & Sato, T. R. (2005). Neuronal basis of covert spatial attention in the frontal eye field. Journal of Neuroscience, 25, 9479–9487. Thompson, K. G., Hanes, D. P., Bichot, N. P., & Schall, J. D. (1996). Perceptual and motor processing stages identified in the activity of macaque frontal eye field neurons during visual search. Journal of Neurophysiology, 76, 4040–4055. Townsend, J. T. (1990). The truth and consequences of ordinal differences in statistical distributions: Toward a theory of hierarchical inference. Psychological Bulletin, 108, 551–567. Turner, B. M., Forstmann, B. U., Wagenmakers, E. J., Brown, S. D., Sederberg, P. B., and Steyvers, M. (2013). A Bayesian framework for simultaneously modeling neural and behavioral data. NeuroImage, 72, 193–206. Umakantha, A., Purcell, B. A., & Palmeri, T. J. (2014). Mapping between a spiking neural network model and the diffusion model of perceptual decision making. Usher, M., & McClelland, J. L. (2001). The time course of perceptual choice: The leaky, competing accumulator model. Psychological Review, 108, 550–592. van Maanen, L., Brown, S. D., Eichele, T., Wagenmakers, E. J., Ho, T., Serences, J., & Forstmann, B. U. (2011). Neural correlates of trial-to-trial fluctuations in response caution. Journal of Neuroscience, 31(48), 17488–17495. Van Zandt, T. (2000). How to fit a response time distribution. Psychonomic Bulletin & Review, 7, 424–465. Verbruggen, F., & Logan, G. D. (2008). Response inhibition in the stop-signal paradigm. Trends in Cognitive Sciences, 12, 418–424.
neurocognitive modeling of perceptual decision making
339
White, C. N., Mumford, J. A., & Poldrack, R. A. (2012). Perceptual criteria in the human brain. Journal of Neuroscience, 32(47), 16716–16724. Wolfe, J. M., Palmer, E. M., & Horowitz, T. S. (2010). Reaction time distributions constrain models of visual search. Vision Research, 50, 1304–1311. Wong, K. F., Huk, A. C., Shadlen, M. N., & Wang, X. J. (2007). Neural circuit dynamics underlying accumulation of time-varying evidence during perceptual decision making. Frontiers in Computational Neuroscience, 1, 1–11.
340
new directions
Wong, K. F., & Wang, X. J. (2006). A recurrent network mechanism of time integration in perceptual decisions. Journal of Neuroscience, 26, 1314–1328. Woodman, G. F., Kang, M. S., Thompson, K., & Schall, J. D. (2008). The effect of visual search efficiency on response preparation: Neurophysiological evidence for discrete flow. Psychological Science, 19, 128–136. Zandbelt, B. B., Purcell, B. A., Palmeri, T. J., Logan, G. D., Schall, J. D. (2014). Response times from ensembles of accumulators. Proceedings of the National Academy of Sciences, 111, 2848–2853.
CHAPTER
16
Mathematical and Computational Modeling in Clinical Psychology
Richard W. J. Neufeld
Abstract
This chapter begins with an introduction to the basic ideas behind clinical mathematical and computational modeling. In general, models of normal cognitive-behavioral functioning are titrated to accommodate performance deviations accompanying psychopathology; model features remaining intact indicate functions that are spared; those that are perturbed are triaged as signifying functions that are disorder affected. Distinctions and interrelations among forms of modeling in clinical science and assessment are stipulated, with an emphasis on analytical, mathematical modeling. Preliminary conceptual and methodological considerations are presented. Concrete examples illustrate the benefits of modeling as applied to specific disorders. Emphasis in each case is on clinically significant information uniquely yielded by the modeling enterprise. Implications for the functional side of clinical functional neuro-imaging are detailed. Challenges to modeling in the domain of clinical science and assessment are described, as are tendered solutions. The chapter ends with a description of continuing challenges and future opportunities. Key Words: clinical mathematical modeling, clinical cognitive modeling, analytical modeling,
Introduction “The important point for methodology of psychology is that just as in statistics one can have a reasonably precise theory of probable inference, being ‘quasi-exact about the inherently inexact,’ so psychologists should learn to be sophisticated and rigorous in their metathinking about open concepts at the substantive level. . . . In social and biological science, one should keep in mind that explicit definition of theoretical entities is seldom achieved in terms of initial observational variables of those sciences, but it becomes possible instead by theoretical reduction or fusion (Meehl, 1978; p. 815).” Mathematical and computational modeling of clinical-psychological phenomena can elucidate clinically significant constructs by translating them into variables of a quantitative system, and lending
them meaning according to their operation within that very system (e.g., Braithwaite, 1968). New explanatory properties are availed, as are options for clinical-science measurement, and tools for clinical-assessment technology. This chapter is designed to elaborate on these assets by providing examples where otherwise intractable or hidden clinical information has been educed. Issues of practicality and validity, indigenous to the clinical setting, are examined, as is the potentially unique contribution of clinical modeling to the broader modeling enterprise. Emphasis is on currently prominent domains of application, and exemplary instances within each. Background material for the current developments is available in several sources (e.g., Busemeyer & Diedrich, 2010; Neufeld, 1998; 2007a). 341
We begin by considering an overall epistemic strategy of clinical psychological modeling. Divisions of modeling in the clinical domain are then distinguished. Exemplary implementations are presented, as are certain challenges sui generis to this domain. Figure 16.1 summarizes the overall epistemic strategy of clinical psychological modeling. In its basic form, quantitative models of normal performance, typically on laboratory tasks, are titrated to accommodate performance deviations occurring with clinical disturbance. The requisite model tweaking, analogous to a reagent of chemical titration, in principle discloses the nature of change to the task-performance system taking place with clinical disturbance. Aspects of the model remaining intact are deemed as aligning with functions spared with the disturbance, and those that have been perturbed are triaged as pointing to functions that have been affected. Accommodation of performance deviations by the appropriated model infrastructure, in turn, speaks to validity of the latter. Successful accommodation of altered performance among clinical samples becomes a source of construct validity, over and against an appropriated model’s failure or strain in doing so. This aspect of model evaluation, an instance of “model-generalization testing” (Busemeyer & Wang, 2000), is one in which performance data from the clinical setting can play an important role. To illustrate the preceding strategy, consider a typical memory-search task (Sternberg, 1969). Such a task may be appropriated to tap ecologically significant processes: cognitively preparing and transforming (encoding) environmental stimulation into a format facilitating collateral cognitive operations; extracting and manipulating material in Mathematical and Computational Psychology
Mathematical and Computational Modeling of Experimental-Task Performance
Clinical Mathematical and Computational Psychology Expression of Performance Deviation according to Model Titration
Model-Generalization Testing
Fig. 16.1 Relations between clinical and nonclinical mathematical and computational psychology.
342
new directions
short-term or working memory, on which informed responding rests; and preparing and delivering the information-determined response. During each trial of the preceding task, a prememorized set of items (memory set), such as alphanumeric characters, are to be scanned, in order to ascertain “as quickly and accurately as possible” the presence or absence of a subsequently presented item (probe item). The subspan memory-set size varies from trial to trial, with the items within each set size also possibly varying over their presentations, or alternatively remaining constant within each set size (variable- versus fixed- set procedures). Other manipulations may be directed, for instance, to increasing probe-encoding demands (whereby, say, the font of the probe item mismatches, rather than matches, the font of the memory-set items). The principal response property tenably is latency from probe-item onset, accuracy being high, and not compromising the validity of latency-based inferences (e.g., no speed-accuracy trade-off ). Quantitatively informed convergent experimental evidence may point to an elongated probeencoding process as being responsible for delayed trial-wise response latencies in the clinical group. The encoding process may be represented by a portion of the trial-performance model. This portion may stipulate, for example, constituent encoding operations of the encoding process, there being k in number (to use model-parameter notation consistent with the clinical modeling literature). Such subprocesses may correspond with observable stimulus features, such as curves, lines, and intersections of the alphanumeric probe, extracted in the service of mental template matching to members of the trial’s memory set. The intensity of processing applicable to each of the respective k subprocesses (loosely, speed of completion, or number transacted per unit time; e.g., Rouder, Sun, Speckman, Lu, & Ahou, 2003; Townsend & Ashby, 1983) is denoted v. Decomposing clinical-sample performance through the model-based analysis potentially exonerates the parameter v, but indicts k as the source of encoding protraction. Specifically, modelpredicted configurations of performance latency, and intertrial variability, converge with the pattern of empirical deviations from control data, upon elevation in k , but not reduction in v. By way of parameter assignment in the modeling literature, strictly speaking, cognitive capacity (in the sense of processing speed) is intact, but efficiency of its deployment has suffered. The preceding example of mapping of formal theory onto empirical data
is known in clinical mathematical modeling, and elsewhere, as “abductive reasoning” (see Box 1). When it comes to potential “value added” beyond the immediate increment in information, the following come to the fore. Ecological significance is imbued in terms of assembling environmental information in the service of establishing necessary responses, including those germane to self-maintenance activities, and meeting environmental demands. On this note, basic deficit may ramify to more elaborate operations in which the affected process plays a key role (e.g., where judgments about complex multidimensional stimuli are built up from encoded constituent dimensions). Deficits in rudimentary processes moreover may parlay into florid symptom development or maintenance, as where thought-content disorder (delusions and thematic hallucinations) arise from insufficient encoding of cues that normally anchor the interpretation of other information. Additional implications pertain to memory, where heightened retrieval failure is risked owing to protraction of initial item encoding. Debility parameterization also may inform other domains of measurement, such as neuroimaging. A tenable model may demarcate selected intratrial epochs of cognitive tasks when a clinically significant constituent process is likely to figure prominently. In this way, times of measurement interest may complement brain-regions of interest, for a more informed navigation of space-time coordinates in functional neuroimaging. Symptom significance thus may be brokered to imaged neurocircuitry via formally modeled cognitive abnormalities.
Modeling Distinctions in Psychological Clinical Science There are several versions of “modeling” in psychological clinical science. Nonformal models, such as flow-diagrams and other organizational schemata, nevertheless, are ubiquitously labeled “models” (cf. McFall, Townsend & Viken, 1995). Our consideration here is restricted to formal models, where the progression of theoretical statements is governed by precisely stipulated rules of successive statement transitions. Most notable, and obviously dominating throughout the history of science are mathematical models. Formal languages for theory development other than mathematics include symbolic logic and computer syntax. Within the formal modeling enterprise, then is mathematical modeling, computational modeling [computer simulation, including “connectionist,” “(neural)
Box 1 Abductive Reasoning in Clinical Cognitive Science Scientific rigor does not demand that theoretical explanation for empirical findings be restricted to a specific account from a set of those bearing on the study that have been singled out before the study takes place. In fact, the value of certain formal developments to understanding obtained data configurations may become apparent only after the latter present themselves. It is epistemologically acceptable to explanatorily retrofit extant (formal) theory to empirical data (e.g., in the text, changing the clinical sample’s value of k versus v), a method known as abductive reasoning (Haig, 2008). Abductive reasoning not only has a bona-fide place in science, but it is economical in its applicability to already-published data (Novotney, 2009) and/or proffered conjectures about clinically significant phenomena. On the note of rigorous economy, the clinical scientist can play with model properties to explore the interactions of model-identified or other proposed sources, say, of pathocognition with various clinically significant variables, such as psychological stress. For example, it has been conjectured that some depressive disorders can be understood in terms of highly practiced, automatic negative thoughts supplanting otherwise viable competitors, and also that psychological stress enhances ascendancy of the former on the depressed individual’s cognitive landscape (Hartlage, Alloy, Vasquez & Dykman, 1993). Translating these contentions into terms established in mainstream quantitative cognitive science, however, discloses that psychological stress instead reduces the ascendancy of well-practiced negative thoughts, at least within this highly-defensible assumptive framework (Neufeld, 1996; Townsend & Neufeld, 2004). The quantitative translation begins with expressing the dominance of well-practiced (socalled automatic) negative versus less practiced (so-called effortful) non-negative thought content, as higher average completion rates for the former. With these rate properties in tow, the well-practiced and less-practiced thought content then enter a formally modeled “horse race,” where the faster rates for negative-thought generation evince higher winning probabilities, for all race durations. Note that although these
mathematical and computational
343
Box 1 Continued derivations result from basic computations in integral calculus, they nevertheless yield precise predictions, and lay bare their associated assumptions. Differentiating the above horse-race expression of the probability of negative-thought victory, with respect to a parameter conveying the effects of stress on processing capacity, leads to a negative derivative. Formalized in this way, then, the result shows psychological stress actually to handicap the negative-thought contender. It is conceivable that reduction in the ascendancy of well-practiced negative thoughts, in the face of stressing environmental demands and pressures, in favour of less-practiced but more adaptive cognitive processes conveys a certain protective function. In all events, this example illustrates the hazards of depending on unaided verbal reasoning in attempting to deal with complex intervariable relations (including stress effects on psychopathololgy), and exemplifies the disclosure through available formal modeling, of subtleties that are both plausible and clinically significant—if initially counterintuitive (Staddon, 1984; Townsend, 1984).
network,” “cellular automata,” and “computational informatics” modeling], and nonlinear dynamical systems modeling (“chaos-theoretic” modeling, in the popular vernacular). There is, of course, an arbitrary aspect to such divisions. Mathematical modeling can recruit computer computations (later), whereas nonlinear dynamical systems modeling entails differential equations, and so on. Like many systems of classification, the present one facilitates exposition, in this case of modeling activity within the domain of psychological clinical science. Points of contact and overlap among these divisions, as well as unique aspects, should become more apparent with the more detailed descriptions that follow. The respective types of formal modeling potentially provide psychopathology-significant information unique to their specific level of analysis (Marr, 1982). They also may inform each other, and provide across-level-analysis construct validity. For example, manipulation of connectionist-model algorithm parameters may be 344
new directions
guided by results from mathematical modeling. Connectionist-modeling results, in turn, may lend construct validity to mathematical model titration (earlier). Before delineating subsets of formal modeling, a word is in order about so-called statistical modeling (see, e.g., Rodgers, 2010). Statistical modeling, such as structural-equation modeling, including confirmatory factor analysis, hierarchical linear modeling, mixture growth modeling, and taxometric analysis (mixture-model testing for staggered, or quasi-staggered latent distributions of clinical and nonclinical groups) supply a platform for data organization, and inferences about its resulting structure. To be sure, parameters, such as path weights and factor loadings are estimated using methods shared with formal models, as demarcated here. Contra the present emphasis, however, the format of proposed model structure (typically one of multivariate covariance), and computational methods are transcontent, generic, and do not germinate within the staked out theoretical-content domain with its problem-specific depth of analysis. In the case of formal modeling, it might be said that measurement models and empiricaltesting methods are part and parcel of processmodels of observed responses and data production (see also Box 2). Extended treatment of formalmodel distinctives and assets in clinical science is available in alternate venues (e.g., Neufeld, 2007b; Shanahan, Townsend & Neufeld in press).
Forms of Modeling in Clinical Science mathematical modeling Clinical mathematical modeling is characterized by analytically derived accounts of cognitivebehavioral abnormalities of clinical disorders. In most instances, models are stochastic, meaning they provide for an intrinsic indeterminacy of the modeled phenomenon (not unlike Brownian motion being modeled by a Wiener process, in physics). Doob (1953) has described a stochastic model as a “. . . mathematical abstraction of an empirical process whose development is governed by probabilistic laws (p. v).” Roughly, built into the structure of stochastic models is a summary of nature’s perturbation of empirical values from one observation to the next. Predictions, therefore, by and large, are directed to properties of the distributions of observations, such as those of response latencies over cognitive-task trials. Model
expressions of clinical-sample deviations in distribution properties, therefore, come to the fore. Such properties may include summary statistics, such as distribution moments (notably means and intertrial variances), but can also include distribution features as detailed as the distribution’s probability density function (density function, for short; proportional to the relative frequency of process completion at a particular time since its commencement; see, e.g., Evans, Hastings & Peacock, 2000; Townsend & Ashby, 1983; Van Zandt, 2000). Grasping the results of mathematical modeling’s analytical developments can be challenging, but it can be aided by computer computations. Where the properties of derivations are not apparent from inspection of their structures themselves, the formulae may be explored numerically. Doing so often invokes three-dimensional response surfaces. The predicted response output expressed by the formula, plotted on the Y axis, is examined as the formula’s model parameters are varied on the X and Z axes. In the earlier example, for instance, the probability of finalizing a stimulus-encoding process within a specified time t may be examined as parameters k and v are varied. Note in passing that expression of laws of nature typically is the purview of analytical mathematics (e.g., Newton’s law of Gravity). computational modeling (computer simulation) Computational modeling expresses recurrent interactions (reciprocal influences) among activated, and activating units, that have been combined into a network architecture. The interactions are implemented through computer syntax (e.g., Farrell & Lewandowsky, 2010). Some of its proponents have advanced this form of modeling as a method uniquely addressing the computational capacity of the brain, and as such, have viewed the workings of a connectionist network as a brain metaphor. Accordingly, the network units can stand for neuronal entities (e.g., multineuron modules, or “neurodes”), whose strengths of connection vary over the course of the network’s ultimate generation of targeted output values. Variation in network architecture (essentially paths and densities of interneurode connections), and/or connection activation, present themselves as potential expressions of cognitive abnormalities. Exposition of neuro-connectionist modeling in clinical science has been relatively extensive (e.g., Bianchi, Klein, Caviness, & Cash, 2012; Carter
& Neufeld, 2007; Hoffman & McGlashan, 2007; Phillips & Silverstein, 2003; Siegle & Hasselmo, 2002; Stein & Young, 1992). Typically directed to cognitive functioning, the connectionist-network approach recently has been extended to the study of intersymptom relations, providing an unique perspective for example on the issue of co-morbidity (Boorsboom, & Cramer, 2013; Cramer, Waldorp, van der Maas & Boorsboom, 2010).
nonlinear dynamical system (chaos-theoretic) modeling This form of modeling again entails interconnected variables (“coupled system dimensions”), but in clinical science, by and large, these are drastically fewer in number than usually is the case with computational network modeling (hence, the common distinction between “high-dimensional networks” and “low-dimensional nonlinear networks”). Continuous interactions of the system variables over time are expressed in terms of differential equations. The latter stipulate the momentary change of each dimension at time t, as determined by the extant values of other system dimensions, potentially including that of the dimension whose momentary change is being specified (e.g., Fukano & Gunji, 2012). The nonlinear properties of the system can arise from the differential equations’ cross-product terms, conveying continuous interactions or nonlinear functions of individual system dimensions, such as raising extant momentary status to a power other than 1.0. The status of system variables at time t is available via solution to the set of differential equations. Because of the nonlinearity-endowed complexity, solutions virtually always are carried out numerically, meaning computed cumulative infinitesimal changes are added to starting values for the time interval of interest. System dimensions are endowed with their substantive significance according to the theoretical content being modeled (e.g., dimensions of subjective fear, and physical symptoms, in dynamical modeling of panic disorder; Fukano & Gunji, 2012). A variation on the preceding description comprises the progression of changes in system-variable states over discrete trials (e.g., successive husband-wife interchanges; Gottman, Murray, Swanson, Tyson & Swanson, 2002). Such sequential transitions are implemented through trial-wise difference-equations, which now take the place of differential equations.
mathematical and computational
345
Model-predicted trajectories can be tested against trajectories of empirical observations, as obtained, say, through diary methods involving on-line data-gathering sites (e.g., Qualtrics; Bolger, Davis & Rafaeli, 2003). In addition, empirical data can be evaluated for the presence of dynamical signatures (“numerical diagnostics”) generated by system-equation driven, computer-simulated time series. It is possible, moreover, to search empirical data for time-series numerical diagnostics that are general to system complexity of a nonlineardynamical-systems nature. This latter endeavour, however, has been cricized when undertaken without a precise tendered model of the responsible system, ideally buttressed with other forms of modeling (notably mathematical modeling, earlier)— such a more informed undertaking as historically exemplified in physics (Wagenmakers, van der Maas & Farrell, 2012). A branch of nonlinear dynamical systems theory, ‘catastrophe theory,’ has been implemented notably in the analysis of addiction relapse. For example, therapeutically significant dynamical aspects of aptitude-treatment intervention procedures (e.g., Dance & Neufeld, 1988) have been identified through a catastrophetheory-based reanalysis of data from a large-scale multisite treatment-evaluation project (Witkiewitz, van der Maas, Hufford & Marlatt, 2007). Catastrophe theory offers established sets of differential equations (“canonical forms”) depicting sudden jumps in nonlinear dynamical system output (e.g., relapse behaviour) occurring to gradual changes in input variables (e.g., incremental stress). Nonlinear dynamical systems modeling—whether originating in the content domain, or essentially imported, as in the case of catastrophe theory— often may be the method of choice when it comes to macroscopic, molar clinical phenomena, such as psychotherapeutic interactions (cf. Molenaar, 2010). Extended exposition of nonlinear dynamical systems model development, from a clinical-science perspective, including that of substantively significant patterns of dimension trajectories (“dynamical attractors”), has been presented in Gottman, et al. (2002), Levy et al., (2012), and Neufeld (1999).
Model Parameter Estimation in Psychological Clinical Science A parameter is “an arbitrary constant whose value affects the specific nature but not the properties of a mathematical expression” (Borowski & Borwein, 1989, p. 435). In modeling abnormalities in 346
new directions
clinical samples, typically it is the values of model parameters, rather than the model structure (the model’s mathematical organization) that are found to shift away from control values. The clinical significance of the shift depends on the meaning of the parameter(s) involved. Parameters are endowed with substantive significance according to their roles in the formal system in which they are positioned, and their mathematical properties displayed therein. For example, a parameter of a processing model may be deemed to express “task-performance competence.” Construct validity for this interpretation may be supported if the statistical moments of modeled performance-trial latencies (e.g., their mean or variance) are sent to the realm of infinity, barring a minimal value for the parameter. If this value is not met, a critical corpus of extremely long, or incomplete task trials ensues, incurring infinite response-latency moments in the formalized system. From a construct-validity standpoint, this type of effect on system behavior—severe performance impairment, or breakdown—is in keeping with the parameter’s ascribed meaning (Neufeld, 2007b; see also Pitt, Kim, Navarro & Myung, 2006). Added to this source of parameter information— analytical construct validity—is experimental construct validity. Here, estimated parameter values are selectively sensitive to experimental manipulations, diagnostic-group differences, or variation in psychometric measures on which they purportedly bear (e.g., additional constituent operations of a designated cognitive task, other demands being equal, resulting in elevation specifically in the parameter k’, earlier). In clinical science and assessment, values of model parameters to a large extent are estimated from empirical data. Such estimation differs from methods frequently used in the physical sciences, which more often have the luxury of direct measurement of parameter values (e.g., measuring parameters of liquid density and temperature, in modeling sedimentation in a petroleum-extraction tailings pond). Parameter estimation from the very empirical data being modeled—of course, with associated penalization in computations of empirical model fit—however, is only superficially suspect. As eloquently stated by Flanagan (1991), In physics there are many explanatory constructs, electrons for example, which cannot be measured independently of the observable situations in which they figure explanatorily. What vindicates the
explanatory use of such constructs is the fact that, given everything else we know about nature, electrons best explain the observable processes in a wide array of experimental tests, and lead to successful predictions (p. 380).
How, then might parameter values be estimated, with an eye to possible constraints imposed in the clinical arena? Multiple methods of parameter estimation in clinical science variously have been used, depending on desired statistical properties (e.g., maximum likelihood, unbiasedness, Bayes; see, e.g., Evans, et al., 2000) and data-acquisition constraints. Note that selection of parameterestimation methods is to be distinguished from methods of selecting from among competing models or competing model variants (for issues of model selection, see especially Wagenmakers & Vandekerckhove, this volume). moment matching One method of parameter estimation consists of moment matching. Moments of some stochastic distributions can be algebraically combined to provide a direct estimate. For example, the mean of the Erlang distribution of performance-trial latencies, expressed in terms of its parameter implementation k’ and v, earlier, is k’/v (e.g., Evans, et al., 2000). The intertrial variance is k’/v2 . From these terms, an estimate of v is available as the empirical mean divided by the empirical variance, and an estimate of k’ is available as the mean, squared, divided by the variance. maximum likelihood Maximum-likelihood parameter estimation means initially writing a function expressing the likelihood of obtained data, given the data-generation model. The maximum-likelihood estimate is the value of the model parameter that would make the observed data maximally likely, given that model. Maximum likelihood estimates can be obtained analytically, by differentiating the written likelihood function with respect to the parameter in question, setting it to 0, and solving (the second derivative being negative). For example, the maximumlikelihood estimate of v in the Erlang distribution, earlier, is Nk , N ti i=1
where the size of the empirical sample of latency values ti is N. With multiple parameters, such as v
and k’, the multiple derivatives are set to 0, followed by solving simultaneously. As model complexity increases, with respect to the number of parameters and possibly model structure, numerical solutions may be necessary. Solutions now are found by computationally searching for the likelihood-function maximum, while varying constituent parameters iteratively and reiteratively. Such search algorithms are available through R, through the MATLAB OPTIMIZATIOMN TOOLBOX, computer-algebra programs, such as Waterloo MAPLE, and elsewhere. As with other methods that rely exclusively on the immediate data, stability of estimates rests in good part on extensiveness of data acquisition. Extensive data acquisition, say, on a cognitive task, possibly amounting to hundreds or even thousands of trials, obtained over multiple sessions, may be prohibitive when it comes to distressed patients. For these reasons, Bayesian estimates may be preferred (later). moment fitting and related procedures Another straightforward method of parameter estimation is that of maximizing the conformity of model-predicted moments (typically means and intertrial standard deviations, considering instability of higher-order empirical moments; Ratcliff, 1979), across performance conditions and groups. An unweighted least-squares solution minimizes the sum of squared deviations of model predictions from the empirical moments. Although analytical solutions are available in principle (deriving minima using differential calculus, similar to deriving maxima in the case of maximum-likelihood solutions), use of a numerical search algorithm most often is the case. Elaborations on unweighted least-squares solutions include minimization of Pearson χ 2 , where data are response-category frequencies fri : . - p (fri,observed − fri,model−predicted )2 , min fri,model−predicted i=1 where there are p categories. Note that frequencies can include those of performance-trial latencies falling within successive time intervals (“bins”). Where the chief data are other than categorical frequencies, as in the case of moments, parameter values may be estimated by constructing and minimizing a pseudo- χ 2 function. Here, the theoretical and observed frequencies are replaced with the observed and theoretical moments (Townsend, 1984; Townsend & Ashby, 1983, chap. 13). As with the Pearson χ 2 , the squared differences between the
mathematical and computational
347
model-predicted and observed values are weighted by the inverse of the model-predicted values. It is important to take account of logistical constraints when it comes to parameter-estimation and other aspects of formal modeling in the clinical setting. In this setting, where available data can be sparse, the use of moments may be necessary for stability of estimation. Although coarse, as compared to other distribution properties, moments encompass the entire distribution of values, and can effectively attenuate estimate-destabilizing noise (cf. Neufeld & Gardner, 1990). Observe that, along with likelihood maximization, the present procedures are an exercise in function optimization. In the case of least squares and variants thereon, the exercise is one of function minimization. In fact, often an unweighted leastsquares solution also is the maximum likelihood. Similarly, where likelihood functions enter χ 2 or approximate χ 2 computations (likelihood-ratio G 2 ), the minimum χ 2 is also maximum likelihood. Evaluation of the adequacy of parameter estimation amid constraints that are intrinsic to psychological clinical science and assessment ultimately rests with empirical tests of model performance. Performance obviously will suffer with inaccuracy of parameter estimation. bayesian parameter estimation Recall that Bayes’s theorem states that the probability of an estimated entity A, given entityrelated evidence B (posterior probability of A) is the pre-evidence probability of A (its prior probability) times the conditional probability of B, given A (likelihood), divided by the unconditional probability of B (normalizing factor): Pr(A)Pr(B|A) . (1) Pr(B) As applied to parameter estimation, A becomes the candidate parameter value θ, and B becomes data D theoretically produced by the stochastic model in which θ participates. Recognizing θ as continuous, Eq. (1) becomes Pr(A|B) =
f (θ)Pr(D|θ) g(θ|D) = + ∞ − ∞ f (θ)Pr(D|θ)d θ
(2)
where g and f denote density functions, over and against discrete-value probabilities Pr. The data D may be frequencies of response categories, such as correct versus incorrect item recognition. Note that the data D as well may be continuous, as in the case of measured process latencies. If
348
new directions
so, the density functions again replace discretevalue probabilities. For an Erlang distribution, for instance, the density function for a latency datum ti is (vti )k −1 −vti ve . (k − 1) Allowing k to be fixed, for the present purposes of illustration (e.g., directly measured, or set to 1.0, as with the exponential distribution), and allowing θ to stand for v, then for a sample of N independent values, Pr(D|θ ) in Eq. (2) becomes the joint conditional density function of the N ti values, given v and k’, N (vti )k −1
i=1
(k − 1)
ve−vti .
The posterior density function of θ [e.g., Eq. (2)] can be computed for all candidate values of θ , the tendered estimate then being the mean of this distribution (the statistical property of this estimate is termed “Bayes”; see especially Kruschke & Vanpaemel, this volume). If an individual participant or client is the sole source of D, the Bayes estimate of θ de facto has been individualized accordingly. Bayesian parameter estimation potentially endows clinical science and practice with demonstrably important advantages. Allowance for individual differences in parameter values can be built into the Bayesian architecture, specifically in terms of the prior distribution of performancemodel parameters (Batchelder, 1998). In doing so, the architecture also handles the issue of overdispersion in performance-data, meaning greater variability than would occur to fixed parameter values for all participants (Batchelder & Riefer, 2007). Selecting a prior distribution of θ depends in part on the nature of θ . Included are strong priors, such as those from the Generalized Gamma family (Evans, et al., 2000) where 0 ≤ θ ; the Beta distribution, where 0 ≤ θ ≤ 1.0; and the normal or Gaussian distribution. Included as well are gentle (neutral) priors, notably the uniform distribution, whose positive height spans the range of possible non-0 values of θ (see also Berger, 1985, for Jeffreys’ uninformed, and other prior distributions). Grounds for priordistribution candidacy also includes “conjugation”
with the performance process model. The practical significance of distributions being conjugate essentially is that the resulting posterior distribution becomes more mathematically tractable, allowing a closed-form solution for its probability density function. Defensible strong priors (defensible on theoretical grounds, and those of empirical model fit) can add to the arsenal of clinically significant information, in and of themselves. To illustrate, the Bayesian prior distribution of θ values has its own governing parameters (hyperparameters). Hyperparameters, in turn, can be substantively significant within the addressed content domain (e.g., the taskwise competence parameter, described at the beginning of this section; or another expressing psychological-stress effects on processing speed). Because of the information provided by a strong Bayesian prior, the estimate of θ can be more precise and stable in the face of a smaller data set D than would be necessary when the estimate falls entirely to the data at hand (see, e.g., Batchelder, 1998). This variability-reducing influence on parameter estimation is known as Bayesian shrinkage (e.g., O’Hagan & Forster, 2004). Bayesian shrinkage can be especially valuable in the clinical setting, where it may be unreasonable to expect distressed participants or patients to offer up more than a modest reliable specimen of task performance. Integrating the performance sample with the prior-endowed information is analogous to referring a modest blood sample to the larger body of hematological knowledge in the biomedical assay setting. Diagnostic richness of the proffered specimen is exploited because its composition is subjected to the preexisting body of information that is brought to bear. Other potentially compelling Bayesian-endowed advantages to clinical science and assessment are described in the section, Special Considerations Applying to Mathematical and Computational Modeling in Psychological Clinical Science and Assessment, later. (See also Box 2 for an unique method of assembling posterior density functions of parameter values to ascertain the probability of parameter differences between task-performance conditions.)
Illustrative Examples of Contributions of Mathematical Psychology to Clinical Science and Assessment Information conveyed by rigorous quantitative models of clinically significant cognitive-behavioral
systems is illustrated in the following examples. Focus is on uniquely disclosed aspects of system functioning. Results from generic data-theory empirical analyses, such as group by performanceconditions statistical interactions, are elucidated in terms of their modeled performance-process underpinnings. In addition to illuminating results from conventional analyses, quantitative modeling can uncover group differences in otherwise conflated psychopathology-germane functions (e.g., Chechile, 2007; Riefer, Knapp, Batchelder, Bamber & Manifold, 2002). Theoretically unifying seemingly dissociated empirical findings through rigorous dissection of response configurations represents a further contribution of formal modeling to clinical science and assessment (White, Ratcliff, Vasey & McKoon, 2010a). Moreover, definitive assessment of prominent conjectures on pathocognition is availed through experimental paradigms exquisitely meshing with key conjecture elements (Johnson, Blaha, Houpt & Townsend, 2010). Such paradigms emanate from models addressing fundamentals of cognition, and carry the authority of theorem-proof continuity, and closed-form predictions (Townsend & Nozawa, 1995; see also Townsend & Wenger, 2004a). At the same time, measures in common clinical use have not been left behind (e.g., Fridberg, Queller, Ahn, Kim, Bishara & Busemeyer, 2010; Bishara, Kruschke, Stout, Bechara, McCabe & Busemeyer, 2010; Yechiam, Veinott, Busemeyer, & Stout, 2007). Mathematical modeling effectively has quantified cognitive processes at the root of performance on measures such the Wisconsin Card Sorting Test, the Iowa Gambling Task, and the Go– No-Go task (taken up under Cognitive Modeling of Routinely Used Measures in Clinical Science and Assessment, later). Further, formal models of clinical-group cognition can effectively inform findings from clinical neuroimaging. Events of focal interest in “event-related imaging” are neither the withinnor between-trial transitions of physical stimuli embedded in administered cognitive paradigms but, rather, the covert mental processes to which such transitions give rise. Modeled stochastic trajectories of the symptom-significant component processes that transact cognitive performance trials can stipulate intratrial epochs of special neuroimaging interest. The latter can complement brain regions of interest, together facilitating the calibration of space-time measurement coordinates in neuroimaging studies.
mathematical and computational
349
Multinomial Processing Tree Modeling of Memory and Related Processes; Unveiling and Elucidating Deviations Among Clinical Samples Possibly the most widely used mathematical modeling in clinical psychology is multinomial processing tree modeling (MPTM). Essentially, MPTM models the production of categorical responses, such as recall or recognition of previously studied items, or judgments of items about their earlier source of presentation (e.g., auditory versus visual, thereby potentially bearing on the nature of hallucinations; Batchelder & Riefer, 1990; 1999). Responses in such categories are modeled as having emanated from a sequence of stages, or processing operations. For example, successful recall of a pair of semantically linked items, such as “computer, Internet,” entails storage and retrieval of the duo, retrieval itself necessarily being dependent on initial storage. The tree in MPTM consists of branches emanating from nodes; processes branching from nodes proceed from other processes, on which the branching process is conditional (as in the case of retrieving a stored item). Each process in each successive branch has a probability of successful occurrence. The probability of a response in a specific category (e.g., accurate retrieval of an item pair) having taken place through a specific train of events, is the product of the probabilities of those events (e.g., probability of storage times the conditional probability of retrieval, given storage). Parameters conveying the event probabilities are viewed as “capacities of the associated processes.” Criteria for success of constituent processes implicitly are strong, in that execution of a process that took place earlier in the branching is a sufficient precondition for the process being fed; the probability of failure of the subsequent process falls to that process itself. In this way, MPTM isolates functioning of the individual response-producing operations, and ascertains deficits accordingly. The model structure of MPTM is conceptually tractable, thanks to its straightforward processing tree diagrams. It has, however, a strong analytical foundation, and indeed has spawned innovative methods of parameter estimation (Chechile, 1998), and notably rigorous integration with statistical science (labeled “cognitive psychometrics”; e.g., Batchelder, 1998; Riefer et al., 2002). Computer software advances have accompanied MPTM’s analytical developments (see Moshagen, 2010, for current renderings). Exposition of MPTM 350
new directions
measurement technology has been substantial (e.g., Batchelder & Riefer, 1990; Chechile, 2004; Riefer & Batchelder, 1988), including that tailored to clinical-science audiences (Batchelder, 1998; Batchelder & Riefer, 2007; Chechile, 2007; Riefer et al., 2002; Smith & Batchelder, 2010). Issues in clinical cognition potentially form natural connections with the parameters of MTM. Just as the categories of response addressable with MPTM are considerable, so are the parameterized constructs it accommodates. In addition to essential processes of memory, perception, and learning, estimates are available for the effects of guessing, and for degrees of participants’ confidence in responding. Moreover, MPTM has been extended to predictions of response latency (Hu, 2001; Schweickert, 1985; Schweickert & Han, 2012). Riefer et al. (2002) applied MPTM in two experiments to decipher the nature of recall performance among schizophrenia and brain-damaged alcoholic participants. In each experiment, the clinical group and controls (nonpsychotic patients, and non–organic-brain-syndrome alcoholics) received six trials of presentation and study of semantically related item pairs (earlier), each study period being followed by recall of items in any order. In both experiments, the research design comprised a “correlational experiment” (Maher, 1970). A correlational experiment consists of the diagnostic groups under study performing under multiple conditions of theoretical interest—a prominent layout in psychological clinical science (e.g., Yang, Tadin, Glasser Glasser, Hong & Park, 2013). Initial analyses of variance (ANOVA) were conducted on sheer proportions of items recalled. In each instance, significant main effects of groups and study-recall trials were obtained; a statistically significant trials-by-groups interaction, the test of particular interest, was reported only in the case of the brain-damaged alcoholic participants and their controls (despite liberal degrees of freedom for within-subjects effects). As noted by Riefer et al. (2002), this generic empirical analysis betrayed its own shortcomings for tapping potentially critical group differences in faculties subserving recall performance. Indeed, Riefer et al. simply but forcefully showed how reliance on typical statistical treatments of data from correlational experiments can generate demonstrably misleading inferences about group differences in experimentally addressed processes of clinical and other interest.
Riefer et al.’s theory-disciplined measures precisely teased apart storage and recall-retrieval processes, but went further in prescribing explicitly theory-driven significance tests on group differences (see also Link, 1982; Link & Day, 1992; Townsend, 1984). Pursuant to the superficially parallel group performance changes across trials, schizophrenia participants failed to match the controls in improvement of storage efficiency specifically over the last 3 trials of the experiment. Moreover, analysis of a model parameter distinguishing the rate of improvement in storage accuracy as set against its first-trial “baseline,” revealed greater improvement among the schizophrenia participants during trials 2 and 3, but a decline relative to controls during the last 3 trials. In other words, this aspect of recalltask performance arguably was decidedly spared by the disorder, notably during the initial portions of task engagement. Analysis of the model parameter distinguishing rate of improvement in retrieval, as set against its first-trial baseline now indicated a significantly slower rate among the schizophrenia participants throughout. The interplay of these component operations evidently was lost in the conventional analysis of proportion of items recalled. It goes without saying that precisely profiling disorder-spared and affected aspects of functioning, as exemplified here, can inform the navigation of therapeutic intervention strategies. It also can round out the “functional” picture brought to bear on possible functional neuroimaging measurement obtained during recall task performance. Applying the substantively derived measurement model in the study on alcoholics with organicity yielded potentially important processing specifics buried in the nevertheless now-significant groupsby-trials ANOVA interaction. As in the case of schizophrenia, the brain-damaged alcoholic participants failed to approximate the controls in improvement of storage operations, specifically over the last 3 trials of the experiment. Also, significant deficits in retrieval again were observed throughout. Further, the rate of improvement in retrieval, relative to the trial-1 baseline, more or less stalled over trials. In contrast to retrieval operations, the rate of improvement in storage among the sample of alcoholics with organicity kept up with that of controls—evidently a disorder-spared aspect of task execution. The mathematically derived performanceassessment MPTM methodology demonstrably evinced substantial informational added value, in terms of clinically significant measurement and
explanation. Multiple dissociation in spared and affected elements of performance was observed within diagnostic groups, moreover with further dissociation of these patterns across groups. Estimates of model parameters in these studies were accompanied by estimated variability therein (group and condition-wise standard deviations). Inferences additionally were strengthened with flanking studies supporting construct validity of parameter interpretation, according to selective sensitivity to parameter-targeted experimental manipulations. In addition, validity of parameterestimation methodology was attested to through large-scale simulations, which included provision for possible individual differences in parameter values (implemented according to “hierarchical mixture structures”). In like fashion, Chechile (2007) applied MPTM to expose disorder-affected memory processes associated with developmental dyslexia. Three groups were formed according to psychometrically identified poor, average, and above average reading performance. Presented items consisted of 16 sets of 6 words, some of which were phonologically similar (e.g., blue, zoo), semantically similar (bad, mean), orthographically similar (slap, pals), or dissimilar (e.g., stars, race). For each set of items, 6 pairs of cards formed a 2 × 6 array. The top row consisted of the words to be studied, and the second row was used for testing. A digit-repetition task intervened between study and test phase, controlling for rehearsal of the studied materials. Testing included that of word-item recall or word position, in the top row of the array. For recall trials, a card in the second row was turned over revealing a blank side, and the participant was asked what word was in the position just above. For recognition trials, the face-down side was exposed to reveal a word, with the participant questioned about whether that word was in the corresponding position in the top row. In some instances, the exposed word was in the corresponding position, and in others it was in a different position. A 6-parameter MPTM model was applied to response data arranged into 10 categories. The parameters included two storage parameters, one directed to identification of a previous presentation, and one reserved for the arguably deeper storage required to detect foils (words in a different position); a recall-retrieval parameter; two guessing parameters, and a response-confidence parameter. As in the case of Riefer et al. (2002), parameterized memory-process deviations proprietary to model implementation were identified.
mathematical and computational
351
Apropos of the present purposes, a highlight among other noteworthy group differences occurred as follows. Compared to above-average readers, performance data of poor readers produced lower values for the previous-presentation storage parameter, but higher values for recall retrieval. This dissociation of process strength and deficit was specific to the orthographically similar items. These inferences, moreover, were supported with a decidedly model-principled method of deriving the probability of two groups differing in their distributions of parameter values (see Box 2). The reciprocal effects of stronger retrieval and weaker storage on the poor readers’ performance with the orthographically similar items evinced a nonsignificant difference from the above-average readers on raw recall scores (p > .10). Without the problem-customized measurement model, not only would group differences in memory functions have gone undetected, but the nature of these differences would have remained hidden. Again, a pattern of strength and weakness, and the particular conditions to which the pattern applied (encountering orthographically similar memory items), were powerfully exposed. In this report as well, findings were fortified with model-validating collateral studies. Selective sensitivity to parameter-targeting experimental manipulations lent construct validity to parameter interpretation. In addition, the validity of the model structure hosting the respective parameters was supported with preliminary estimation of coherence between model properties and empirical data [Pr(coherence); Chechile, 2004]. Parameter recovery as well was ascertained through simulations for the adopted sample size. Furthermore, parameter estimation in this study employed an innovation developed by Chechile, called “Population Parameter Mapping” (detailed in Chechile, 1998; 2004; 2007).1 unification of disparate findings on threat-sensitivity among anxiety-prone individuals through a common-process model Random-walk models (Cox & Miller, 1965) are stochastic mathematical models that have a rich history in cognitive theory (e.g., Busemeyer & Townsend, 1993; Link & Heath, 1975; Ratcliff, 1978; see Ratcliff & Smith, this volume). Diffusion modeling (Ratcliff, 1978) presents itself as another form of modeling shown to be of extraordinary value to clinical science. This mathematical method 352
new directions
allows the dissection of decisional performance into component processes that act together to generate choice responses and their latencies (Ratcliff, 1978). Application of diffusion modeling has been used to advantage in simplifying explanation, and in unifying observations on sensitivity to threatvalenced stimulation among anxiety-prone individuals (White, Ratcliff, Vasey & McKoon, 2010a; for diffusion-model software developments, see Wagenmakers, van der Maas, Dolan & Grasman, 2008). Increased engagement of threatening stimulus content (e.g., words such as punishment or accident) among higher anxiety-prone (HA) as compared to lower anxiety-prone (LA) individuals has been demonstrated across multiple paradigms (e.g., the Stroop and dichotic listening tasks; and the dotprobe task, where for HA individuals, detection of the probe is disproportionately increased with its proximity to threatening versus neutral items in a visual array). Consistency of findings of significant HA-LA group differences, however, by and large has depended on presentation of the threat items in the company of nonthreat items. Such differences break down when items are presented singly. Specifically, the critical HA-LA by threat—nonthreat item interaction (group-item second-order difference, or two-way interaction) has tended to be statistically significant specifically when the two types of stimuli have occurred together. This pattern of findings has generated varyingly complex conjectures about responsible agents. The conjectures have emphasized processing competition between the two types of stimuli, and associated cognitivecontrol operations. Group differences have been attributed, for example, to differential tagging of the threat items, or to H-A participant threat-item disengagement deficit (reviewed in White, et al., 2010a). The difficulty in obtaining significant second order-differences with presentation of singletons has led some investigators to question the importance, or even existence, of heightened threatstimulus sensitivity as such among HA individuals. Others have developed neuro-connectionst computational models (e.g., Williams & Oaksford, 1992), and models comprising neuro-connectionist computational-analytical amalgams (Frewen, Dozois, Joanisse & Neufeld, 2008), expressly stipulating elevated threat sensitivity among HA individuals.2 If valid, such sensitivity stands to ramify into grosser clinical symptomatology (Neufeld & Broga, 1981).
Greater threat-stimulus sensitivity defensibly exists among HA individuals; but for reasons that are relatively straightforward, such sensitivity may be more apparent in the company of nonthreat items, as follows: The cognitive system brought to bear on the processing task stands to be one of limited capacity (see Houpt & Townsend, 2012; Townsend & Ashby, 1983; Wenger & Townsend, 2000). When items are presented together, a parallel processing structure arguably is in place for both HA and LA participants (Neufeld & McCarty, 1994; Neufeld, et al., 2007). With less prepotent salience of the threat item for the LA participants, more attentional capacity potentially is drawn off by the nonthreat item, attenuating their difference in processing latency between the two items. A larger interitem difference would occur for the HA participants, assuming their greater resistance to the erosion of processing capacity away from the threat item (see White, et al., 2010a, p. 674). This proposition lends itself to the following simple numerical illustration. We invoke an independent parallel, limited-capacity processing system (IPLC), and exponentially distributed itemcompletion times (Townsend & Ashby, 1983). Its technical specifics aside, the operation of this system makes for inferences about the present issue that are easy to appreciate. The resources of such a system illustratively are expressed as a value of 10 arbitrary units (essentially, the rate per unit time at which task elements are transacted) for both HA and LA participants. In the case of a solo presentation, a threat item fully engages the system resources of an HA participant, and 90% thereof in the case of an LA participant. The solo presentation of the nonthreat item engages 50% of the system’s resources for both participants. By the IPLC model, the second-order difference in mean latency then is (1/10 – 1/9) – 0 = −0.0111 (that is, latency varies inversely as capacity, expressed as a processing-rate parameter). Moving to the simultaneous-item condition, 80% of system-processing resources hypothetically are retained by the threat item in the case of the HA participant, but are evenly divided in the case of the LA participant. The second-order difference now is (1/8−1/5) − (1/2−1/5) = −0.375. Statistical power obviously will be greater for an increased statistical effect size accompanying such a larger difference.3 It should be possible, nevertheless, to detect the smaller effect size for the single-item condition, with a refined measure of processing. On that note, the larger second-order difference in raw
response times attending the simultaneous item condition (earlier) itself may be attenuated due to the conflation of processing time with collateral cognitive activities, such as item encoding and response organization and execution. On balance, a measure denuded of such collateral processes may elevate the statistical effect size of the solopresentation second-order difference at least to that of the paired-presentation raw reaction time. Such a refined measure of processing speed per se was endowed by the diffusion model as applied by White, et al. (2010a) to a lexical decision task (yes-no about whether presented letters form a word). Teased apart were speed of central decisional activities (diffusion-model drift rate), response style (covert accumulation of evidence pending a decision), bias in favor of stating the presence of an actual word, and encoding (initial preparation and transformation of raw stimulation). Analysis was directed to values for these parameters, as well as to those of raw latency and accuracy. In three independent studies, analysis of drift rates consistently yielded significant group,-itemtype second-order differences, whereas analysis of raw latency and accuracy rates consistently fell short. The significant second-order difference also was parameter-selective, being restricted to driftrate values, even when manipulations were conducive to drawing out possible response-propensity differences.4 Here too, findings were buttressed with supporting analyses. Included was construct-validity augmenting selective parameter sensitivity to parameter-directed experimental manipulations. A further asset accompanying use of this model is its demonstrable parametric economy, in the following way; Parameter values have been shown to be uncorrelated, attesting to their conveyance of independent information. Information also is fully salvaged inasmuch as both correct and incorrect response times are analyzed (see also Link, 1982). Parameter estimates were accompanied by calculations of their variability (standard deviations and ranges), for the current conditions of estimation. Diagnostic efficiency statistics (sensitivity, specificity, and positive and negative predictive power) were used to round out description of group separation on the drift-rate, as well as raw data values, employing optimal cut-off scores for predicted classification. In each instance, the driftrate parameter decidedly outperformed the latency mean and median, as well as raw accuracy. These results were endorsed according to signal-detection
mathematical and computational
353
analysis, where the “signal” was the presence of higher anxiety proneness. Altogether, the previously described developments make a strong case for model-delivered parsimony. Seemingly enigmatic and discordant findings are shown to cohere, as products of a common underlying process. measurement technology emanating from theoretical first principles: assessing fundamentals of cognition in autism spectrum disorders As averred by Meehl (1978; see quotation at the outset of this chapter), measurement technology emanating from formal theory of longerestablished disciplines has emerged from the formal theory itself [writ large in the currently prominent Higgs-boson-directed Large Hadron Collider; Close (2011); see McFall & Townsend (1998) for a still-current update of Meehl’s appraisal of measurement methods in clinical-science]. Systems Factorial Technology (SFT; Townsend & Nozawa, 1995; see also Townsend & Wenger, 2004a, and Chapter 4, this volume) comprises such a development in cognitive science, and has been used to notable advantage in clinical cognitive science (Johnson, et al., 2010; Neufeld, et al., 2007; Townsend, Fific, & Neufeld, 2007). Identifiability of fundamentals of cognition has been disclosed by a series of elegant theorem-proof continuities addressed to temporal properties of information processing (see Townsend & Nozawa, 1995 for details; see also Townsend & Altieri, 2012 for recent extensions incorporating the dual response properties of latency and accuracy). The axiomatic statements from which the proofs emanate, moreover, ensure that results are general, when it comes to candidate distributions of processing durations; continuity of underlying population distributions is assumed, but results transcend particular parametric expressions thereof (e.g., exponential, Weibull, etc.; see e.g., Evans, et al., 2000). The distributiongeneral feature is particularly important because it makes for robustness across various research settings, something especially to be welcomed in the field of clinical science. Elements of cognitive functioning exposed by SFT include: (a), the architecture, or structure of the information-processing system; (b), the system’s cognitive workload capacity; (c), selected characteristics of system control; and, (d), independence versus interdependence of constituent cognitive operations carried out by system components. 354
new directions
Architecture pertains to whether the system is designed to handle task constituents concurrently (in parallel channels) or successively, in a serial fashion (e.g., encoding curves, lines, and intersections of alphanumeric characters, simultaneously or sequentially). Within the parallel division, moreover, alternate versions can be entertained. The channels can function as segregated units, with the products of their processing remaining distinct from one another in task completion (regular parallel architecture). Alternately, the channels can act as tributaries to a common conduit that receives and conveys the sum of their contributions, dispatching the collective toward task finalization (co-active parallel architecture). Cognitive workload capacity is estimated in SFT through an index related to work and energy in physics (Townsend & Wenger, 2004b). The index registers the potential of the system to undertake cognitive transactions per unit time [analogous to the rate of dispatching ShannonWeaver (1949) bits of information]. An important aspect of system control entails cessation of processing upon sufficiency for informed responding, over and against extracriterial continuation (operative stopping rules). Furthermore, independence versus interdependence of system components refers to absence versus presence of either mutual facilitation or cross-impedance of system channels devoted to discrete task constituents (e.g., channels handling separate alphanumeric items or possibly item features). Significantly, SFT mathematically disentangles these key elements of cognition. For example, cognitive-workload capacity is isolated from stopping rules and system architecture. Such elements are conflated in macroscopic speed and/or accuracy, whose relative resistance to increased task load (e.g., added items of processing; or concomitantsecondary versus single-task requirements) typically is taken to indicate system capacity (see Neufeld, et al., 2007). Disproportionate change in such behavioral data may occur, however, for reasons other than limitation in system workload capacity. Uneconomical stopping rules may be at work, such as exhaustive processing (task constituents on all system channels are finalized), when selfterminating processing will suffice (informed responding requires completion of only one, or a subset, of task constituents). It also is possible that healthy participants’ seemingly greater workload capacity actually is attributable to a more efficient architecture (e.g., the presence of co-active parallel processing).
This quantitatively disciplined measurement infrastructure takes on increased significance for clinical cognitive science, when it is realized that certain highly prominent constructs therein align with cognitive elements measured by SFT. Especially noteworthy in the study of schizophrenia, for example, is the construct of cognitive capacity (see, e.g., Neufeld, et al., 2007). In addition, systemcontrol stopping rules impinge on so-called executive function, a construct cutting across the study of multiple disorders. Implicated by cognitive-control stopping rules are cognitive-resource conservation, and the robustness of selective inhibition. It should not go unnoticed that SFT fundamentals of cognition also are at the heart of the “automatic-controlled processing” construct.2 This construct arguably trumps all others in cognitiveclinical-science frequency of usage. In identifying variants of the cognitive elements enumerated earlier, the stringent mathematical developments of SFT are meshed with relatively straightforward experimental manipulations, illustrated as follows. In the study of processing mechanisms in autism spectrum disorder (ASD), Johnson et al. (2010) instantiated SFT as “double factorial technology” (Townsend & Nozawa, 1995). A designated visual target consisted of a figure of a right-pointing arrow in a visual display. Manipulations included the presence or absence of such a figure. The target figure could be present in the form of constituent items of the visual array being arranged into a pattern forming a rightpointing arrow (global target), the items themselves consisting of right-pointing arrows (local target) or both (double target). This manipulation is incorporated into quantitative indexes discerning the nature of system workload capacity. The specific target implementation appropriated by Johnson et al., is ideally suited to the assessment of processing deviation in ASD, because prominent hypotheses about ASD cognitive performance hold that more detailed (read “local”) processing is favored. An additional mathematical-theory-driven manipulation entails target salience in the doubletarget condition. The right-pointing item arrangement can be of high or low salience, as can the rightpointing items making up the arrangement, altogether resulting in four factorial combinations. The combinations, in lockstep with SFT’s mathematical treatment of associated processing-latency distributions, complement the capacity analysis given earlier by discerning competing system architectures,
stopping rules, and in(inter)dependence of processing channels handling the individual targets. A microanalysis of task-performance latency distributions (errors being homogeneously low for Johnson et al.’s both ASD and control participants) was undertaken via the lens of systems-factorial assessment technology.5 Mathematically authorized signatures of double target facilitation, over and against single-target facilitation of processing (“redundancy gain”), was in evidence for all ASD and control participants alike. This aspect of processing evidently was spared with the occurrence of ASD. Contra prominent hypotheses, which were described earlier, all ASD participants displayed a speed advantage for global target processing over local-target processing. In contrast, 4 of the controls exhibited a local-target advantage or approximate equality of target speed. On balance, the verdict from quantitatively disciplined diagnostics was that this property of performance was directly opposite to that predicted by major hypotheses about ASD cognitive functioning. At minimum, a global-target processing advantage was preserved within this ASD sample. Less prominent in the literature have been conjectures about cognitive control in ASD. However, exhaustive target processing was detected as potentially operative among 5, and definitively operative for 2 of the sample of 10 ASD participants (one case being inconclusive). In contrast, for a minority of the 11 controls—4 in number—exhaustive processing was either possibly or definitively operative. The analysis, therefore, revealed that postcriterial continuation of target processing (with possible implications for preservation of processing resources, and the processing apparatus’s inhibition mechanism) may be disorder affected. System workload capacity, chronometrically measured in its own right, nevertheless was at least that of controls—an additional component of evidently spared functioning. Observed violations of selective influence of target-salience manipulations, notably among the control participants, indicated the presence of cross target-processing-channel interactions. The violations impelled the construction of special performance-accommodating theoretical architectures. Certain candidate structures thus were mandated by the present clinical-science samples. The upshot is an example where clinical cognitive science reciprocates to nonclinical cognitive science, in this case by possibly hastening the uncovering
mathematical and computational
355
of potentially important structures in human cognition. cognitive modeling of routinely used measures in clinical science and assessment Measurement in clinical science and assessment frequently has been aimed at the important cognitive-behavioral domain of decision and choice. Examples include the assembling of physical objects based on a judged organizing principle, executing risky gambles, and withholding versus emitting a response to a presenting cue. These decisionchoice scenarios are instantiated in the Wisconsin Sorting Test (WCST; Berg, 1948), which targets frontal lobe “executive function”; the Iowa Gambling Task (Bechara, Damasio, Damasio & Anderson, 1994), which is directed to decisions potentially abetted by accompanying affect; and the Go/No-Go Discrimination Task (see, e.g., Hoaken, Shaughnessy & Pihl, 2003), which is thought to engage inhibitory aspects of cognitive control. Deficits in decisional operations are poised to be ecologically consequential, when it comes to social, occupational, and self-maintenance activities [see Neufeld & Broga, 1981, for a quantitative portrayal of “critical” (versus “differential”) deficit, a concept recently relabelled “functional deficit”, e.g., Green, Horan & Sugar, 2013]. The Expectancy Valence Learning Model (EVL; Busemeyer & Myung, 1992; Busemeyer & Stout, 2002; Yechiam, Veinott, Busemeyer, & Stout, 2007; see also Bishara et al., 2010, and Fridberg et al., 2010, for related sequential learning models) is a stochastic dynamic model (see Busemeyer & Townsend, 1993) that supplies a formal platform for interpreting performance on such measurement tasks. The model expresses the dynamics of decisional behaviors in terms of the progression of expected values accrued by task alternatives, as governed by the record of outcomes rendered by choice-responses to date. Dynamic changes in alternative-expectations are specified by the model structure, in which are embedded the psychological forces—model parameters—operative in generating selection likelihoods at the level of constituent selection trials. Parameters of the EVL model deliver notable psychometric “added value” when it comes to the interpretation and clinical-assessment utility of task-measure data.
356
new directions
Box 2 A Novel Statistical Test for Model-Parameter Differences As indicated in the text, mathematical modeling can prescribe its own measures, experimentation, and tests. Sometimes, proposed tests can transcend the specific domain from which they emerge. This is the case for a statistical test for inequality of model properties between groups, devised by Chechile (2007; 2010). Considerations begin with a “horse race” model of cognitive processes occurring in parallel (Townsend & Ashby, 1983, e.g., p. 249). At any time point since the start of the race t’, the probability density function of the first process of the pair winning is its density function f 1 (t’) times the probability of the second function remaining incomplete S 2 (t’), or f 1 (t’) S 2 (t’). Integrating this expression from t’ = 0 to t’= t gives the probability of the first completion being earlier than the second, as evaluated to time t. Integrating across the entire range of values (t’= 0 to t’= ∞) gives the unconditional probability of the first process having a shorter time than the second. Chechile has adapted this reasoning to the construction of a statistical test expressing the probability that a model-parameter value under one condition of cognitive-behavioral performance is less than its value under a comparison condition Pr(θ 1 <θ 2 ). The method uses Bayesian posterior distributions of θ 1 and θ 2 (with qualifications described and illustrated empirically in Chechile, 2007). Slicing the range of θ into small segments, the proportion of the distribution of θ 1 within a particular segment, times the proportion of the θ 2 distribution beyond that segment, gives the probability of θ 1 lying within the segment, and that of θ 2 lying beyond, analogous to f 1 (t’) and S 2 (t’). Summing the products over the entire range of θ (if θ 1 and θ 2 themselves are probabilities, then θ ranges from 0 to 1.0) directly gives the unconditional probability of θ 1 being lower than θ 2 . Such a probability of .95, for instance, corresponds to a 95% level of confidence that θ 1 < θ 2.
Model parameters tap motivational, learning and response, domains of decision and choice. The first ascertains relative sensitivity to positive versus negative outcomes to selected alternatives (e.g., payoff versus loss to a card-selection gamble). A second performance parameter encapsulates the comparative strength of more versus less recent choice outcomes in influencing current responses. And the third parameter conveys the degree to which responding is governed by accumulated information, as opposed to momentary, ephemeral influences (informational grounding, versus impulsivity). The model, therefore, contextualizes the dynamics of choice responding in terms of rigorously estimated, psychologically meaningful constructs. The EVL model, and its closely aligned sequential learning derivatives (e.g., Ahn, Busemeyer, Wagenmakers & Stout, 2008), have successfully deciphered sources of choice-selection abnormalities in several clinical contexts. Studied groups have included those with Huntington’s and Parkinson’s disease (Busermeyer & Stout, 2002), bipolar disorder (Yechiam, Hayden, Bodkins, O’Donnell & Hetrick, 2008), and various forms of substance abuse (Bishara, et al., 2010; Fridberg, et al., 2010; Yechiam, et al., 2007), and autism spectrum disorders (Yechiam, Arshavsky, ShamayTsoory, Yanov & Aharon, 2010). Attestation to the value of EVL analyses, among other forms, has been that of multiple dissociation of parameterized abnormalities across the studied groups. For example, among Huntington’s disease individuals, the influence of recent outcome experiences in IGT selections has ascended over that of more remote episodes; and responding has been less consistent with extant information as transduced into alternative-outcome modeled expectations (Busemeyer & Stout, 2002; Yechiam, Veinott, Busemeyer & Stout, 2007). Judgments of both stimulant- and alcohol-dependent individuals, on the other hand, have tracked negative feedback to WCST selections to a lesser degree than have those of controls. Although similarly less affected by negative feedback, stimulant-dependent individuals have been more sensitive than alcohol-dependent individuals to positive outcomes attending their selection responses (Bishara et al., 2010). Through the psychological significance of their constituent parameters, sequential learning models have endowed the routinely used measures, described earlier, with incremental content validity and construct representation (Embretson, 1983).
Conventional measures have been understood with regard to their dynamical-process underpinnings. For instance, WCST performance (e.g., perseverative errors, whereby card sorting has not transitioned to a newly imposed organizing principle) has been simulated by modeled sequential-learning (Bishara, et al., 2010). Errors of commission on the go–no-go task have been associated with elevated attention to reward outcomes, and errors of omission have been associated with greater attention to punishing outcomes (Yechiam et al., 2006). Model parameters also have lent incremental nomothetic span (Embretson, 1983) to routinely used measures. Specifically, in addition to producing theoretically consistent diagnosticgroup correlates (e.g., elevated sensitivity to IGTselection rewards, among cocaine users; Yechiam et al., 2007), parameters have been differentially linked to relevant multi-item psychometric measures (Yechiam, et al., 2008). Moreover model-based measures have displayed diagnostic efficacy incremental to conventional measures. Individual differences in model parameters have added to the prediction of bipolar disorder, over and above that of cognitive-functioning and personality/temperament inventories (Yechiam et al., 2008). In addition to its explanatory value for conventionally measured WCST performance, model parameters comprehensively have captured conventional measures’ diagnostic sensitivity to substance abuse, but conventional measures have failed to capture that of model parameters (Bishara et al., 2010). In addition to the credentials enumerated, performance of the EVL model and its affiliates have survived the crucible of competition against competing model architectures (e.g., Yechiam et al., 2007). Applied versions of EVL and closely related sequential learning models also have been vindicated against internal variations on the ultimately applied versions (e.g., attempts at either reduced or increased parameterization; Bishara, et al., 2010; Yechiam, et al., 2007). formal modeling of pathocognition, and functional neuroimaging (the case of stimulus-encoding elongation in schizophrenia) Mathematical and computational modeling of cognitive psychopathology can provide vital information when it comes to the functional component
mathematical and computational
357
of clinical functional neuroimaging (functional magnetic resonance imaging, magnetic resonance spectroscopy, electroencephalography, and electromagnetoencephalography). Cognitive-process unfolding, as mathematically mandated by a tenable process model, can speak to the why, where, when, and how of neurocircuitry estimation. In so doing, it can provide an authoritative antidote to the vexatious problem of reverse inference, whereby functions whose neurocircuitry is purported as being charted, are inferred from the monitored neuro-circuitry itself (e.g., Green, et al., 2013; see Poldrack, 2011). So called event-related neuroimaging seeks to track sites of neuro-activation, and intersite coactivation, aligned with transitions taking place in experimental cognitive-performance paradigms (e.g., occurrence of a probe item, for ascertainment of its presence among a set of previously memorized items, in a memory-search task; or presentation of a visual array, for ascertainment of target presence, in a visual-search task). Events of actual interest, however, are the covert cognitive operations activated by one’s experimental manipulations. Estimating the time course of events that are cognitive per se arguably necessitates a tenable mathematical model of their stochastic dynamics. Stipulation of such temporal properties seems all the more important when a covert process of principal interest defensibly is accompanied by collateral processes, within a cognitive-task trial (e.g., encoding the probe item for purposes of memory-set comparison, along with response operations,—earlier). Motivation for model construction is fueled further if the designated intratrial process has symptom and ecological significance. The upshot is that mathematical modeling of cognitive abnormality is uniquely poised to supply information as the why, where, and when of clinical and other neuroimaging. It arguably also speaks to the how of vascular- and electro-neurophysiological signal processing, in terms of selection from among the array of signal-analysis options. Clinical importance of neurocircuitry measures can be increased through their analytical interlacing with clinical symptomatology. Among individuals with schizophrenia, for example, elongated duration of encoding (cognitively preparing and transforming) presenting stimulation into a format that facilitaties collateral processes (e.g., those taking place in so-called working memory) can disproportionately jeopardize the intake of contextual cues that are vital to anchoring other 358
new directions
input in its objective significance (i.e., “context deficit”; e.g., Dobson & Neufeld, 1982; George & Neufeld, 1985). This combination potentially contributes to schizophrenia thought-content disorder (delusions and thematic hallucinations; Neufeld, 1991; 2007c). Formally prescribed cognitive psychopathology thus can help broker symptom significance to monitored neurocircuitry. Monitored neurocircuity likewise can be endowed with increased ecological significance. Intact encoding, for example, may be critical to negotiating environmental stressors and demands, especially when coping is cognition intensive (e.g., where threat minimization is contingent on predictive judgments surrounding coping alternatives; e.g., Neufeld, 2007b; Shanahan & Neufeld, 2010). Impairment of this faculty therefore may be part and parcel of stress susceptibility (Normal & Malla, 1994). Compromised stress resolution in turn may further impair encoding efficiency, and so on (Neufeld, et al., 2010). Turning to the where of neuro-imaging measurement, targeted cognitive functions in principle are used to narrow down neuro-anatomical regions of measurement interest, according to locations on which the functions are thought to supervene. Precision of function specification constrained by formal modeling in principle should reduce ambiguity about what cognitively is being charted neurophysiologically. Explicit quantitative stipulation of functions of interest should sharpen imaging’s functionally guided spatial zones of brain exploration. Formal modeling of cognitive-task performance seems virtually indispensable to carving out intratrial times of neuro-(co)activation measurement interest. Stochastic dynamic models specify probabilistic times of constituent-process occurrence. For instance, encoding a probe item for its comparison to memory-held items would have a certain time trajectory following the probe’s presentation. With encoding as the singled-out process, imaging signals identified with a trial’s model-estimated epoch, during which the ascendance of encoding was highly probable, would command special attention. A tenable stochastic model of trial performance can convey the stochastic dynamics of a designated symptom- and ecologically significant process, necessary for a reasonable estimate of its corresponding within-trial neuro-circuitry (additional details are available in Neufeld, 2007c, the mathematical and neuro-imaging specifics of which are illustrated in Neufeld, et al., 2010).
A delimited intratrial measurement epoch stipulated by a stochastic-model specified-time trajectory commands considerable temporal resolution of processed imaging signals. Among analysis options, covariance of activation between measurement sites, across time (“seed-voxel, time series covariance”) has been shown empirically and mathematically to have the temporal resolution necessary to monitor neurocircuitry during the designated epoch (Neufeld, 2012; see also Friston, Fletcher, Josephs, Holmes & Rugg, 1998). In this way, formal modeling of cognitive performance speaks to the how of imaging-signal processing by selecting out or creating neuro-imaging-signal analysis options with requisite temporal resolution. Note that success at ferreting out target-process neurocircuitry is supported by the selective occurrence of differential neuro-(co)activation to the modeled presence of the triaged process in the clinical sample, and conversely, its nonoccurrence otherwise (also known as “multiple dissociation”). Model-informed times of measurement interest stand to complement tendered regions of interest, in navigating space-time coordinates of neuroimaging measurement. Moreover, symptom and ecologically significant functions remain in the company of collateral functions involved in task-trial transaction (e.g., item encoding occurs alongside the memoryscanning and response processes, for which it exists). Keeping the assembly of trial-performance operations intact, while temporally throwing into relief the target process, augurs well for preserving the integrity of the target process, as it functions in situ. Doing so arguably increases the ecological validity of findings, over and against experimentally deconstructing the performance apparatus by experimentally separating out constituent processes, thereby risking distortion of their system-level operation.
Special Considerations Applying to Mathematical and Computational Modeling in Psychological Clinical Science and Assessment Clinical mathematical and computational modeling is a relatively recent development in the history of modeling activity (see, e.g., Townsend, 2008). It goes without saying that opportunities unlocked by clinical implementations are accompanied by intriguing challenges. These take the form of upholding rigor of application amid exigencies imposed by clinical constraints, along
with faithfully engaging major substantive issues found in clinical science and assessment.
Methodological Challenges One of the first challenges to clinical formal modeling consists of obtaining sufficient data for stability of model-property estimation. This challenge is part and parcel of the field’s longstanding concern with ensuring the applicability of inferences to a given individual or, at least, to a homogeneous class of individuals to which the one at hand belongs (Davisdon & Costello, 1969; Neufeld, 1977). Mainline mathematical modeling too has been concerned with this issue. The solution by and large has been to model performance based on a large repertoire of data obtained from a given participant, tested over the course of a substantial number of multitrial sessions. Clinical exigencies (e.g., participant availability or limited endurance), however, may proscribe such extensiveness of data accumulation; the trade-off between participants and trial number may have to be tipped toward more participants and fewer trials.6 The combination of fewer performance trials, but more participants requires built-in checks for relative homogeneity of individual data protocols, to ensure that the modeled aggregate is representative of its respective contributors. Collapsing data across participants within a testing session, then, risks folding systematic performance differences into their average, with a resultant centroid that is unrepresentative of any of its parts (see Estes, 1956). Note in passing that similar problems could attend participant-specific aggregation across experimental sessions, should systematic intersession performance drift or re-configuration occur. Methods have been put forth for detecting systematic heterogeneity (Batchelder & Riefer, 2007; Carter & Neufeld, 1999; Neufeld & McCarty, 1994), or ascertaining its possibility to be inconsequential to the integrity of modeling results (Riefer, Knapp, Batchelder, Bamber & Manifold, 2002). If present, methods of disentangling, and separately modeling the conflated clusters of homogenous data have been suggested (e.g., Carter, Neufeld, & Benn, 1998). Systematic individual differences in model properties also can be accommodated by building them into the model architecture itself. Mixture models implement heterogeneity in model operations across individual and/or performance trials. For example, model parameters may differ across
mathematical and computational
359
individual participants within diagnostic groups, the parameters in principle forming their own random distribution. As parameter values now are randomly distributed across individuals (hyperdistribution), a parameter value’s probability (discrete-value parameters) or probability density (continuous-value parameters) takes on the role of its Bayesian prior probability (density) for the current population of individuals. Combined with a sample of empirical task performance, on whose distribution the parameter bears (base distribution), an individualized estimate of the parameter is available through its Bayesian posterior parameter distribution (see Bayesian Parameter Estimation, earlier). A mixture-model approach can be especially appealing in cases of overdispersion, as indicated say by a relatively large coefficient of variation (standard deviation of individual mean performance values divided by their grand mean; see overdispersion under Bayesian Parameter Estimation, earlier).7 Also availed through mixture models are customized (again, Bayesian posterior) performance distributions. If the latter apply to process latencies, for example, they can be consulted to estimate strategic times of neuro-imaging measurement within subgroups of parametrically homogeneous participants (detailed in Neufeld, et al., 2010, Section 7). Apropos of the limitations indigenous to the clinical enterprise, mixture models potentially moderate demands on participants when it comes to their supplying a task- performance sample. A relatively modest sample tenably is all that is required, because of the stabilizing influence on participant-specific estimates, as bestowed by the mixing distribution’s prior-held information. This obviously is a boon in the case of clinical modeling, considering the possible practical constraints in obtaining a stable performance sample from already distressed individuals. Mixture models also open up related avenues of empirical model testing. For purposes of assessing model validity, predictions of individuals’ performance, mediated by modest performance samples, can be applied to separately obtained individual validation samples. In this way, model verisimilitude can be monitored, including the model’s functioning at the level of individual participants (for details exposited with empirical data, see Neufeld, et al., 2010). Because of the stabilizing influence of the Bayesian prior distribution of basedistribution properties, this strategy provides a 360
new directions
tack to the thorny issue of small-N model testing (see Bayesian shrinkage, under Bayesian Parameter Estimation, earlier). An additional asset conferred by a mixturemodel platform bridges methodological and clinical substantive issues, as follows. Emerging from this model design is a statistical—and cognitivebehavioral—science-principled measurement infrastructure for dynamically monitoring individual treatment response. The method exploits the distinct prior distributions of base-model properties obtained from varyingly symptomatic and healthy groups. Now Bayes’ theorem allows the profiling of an individual at hand, with respect to the latter’s current probability of belonging to group g, g = 1,2, . . . G (G being the number of referent groups), given obtained performance sample {*}: Pr(g|{*}). Such a profile can be updated with the progression of treatment. Moreover, closely related procedures can be applied to the assessment of treatment regimens. In this case, treatment efficacy is monitored by charting the degree to which the treated sample at large is being edged toward or possibly away from healthier functioning (computational specifics are illustrated with empirical data in Neufeld, 2007b, and Neufeld, et al., 2010).
Clinical substantive issues A substantive issue interlaced with measuretheoretic considerations is the so-called differentialdeficit, psychometric-artefact problem (Chapman & Chapman, 1973). Abnormalities that are more pronounced than others may have special etiological importance, in part because of associated neuro-physiological substrates. False inferences of differential deficit, however, are risked because the relative amounts by which examined faculties are disorder-affected, are conflated with the psychometric properties of the instruments used to measure them. This issue retains much currency in the contemporary clinical-science literature (e.g., Gold & Dickinson, 2013). Frailties of recommended solutions, consisting of intermeasure calibration toward equality on classical measurement properties (reliability and observedscore variance) have been noted almost since the recommendations’ inception (e.g., Neufeld, 1984a; Neufeld & Broga, 1981; with augmenting technical reports, Neufeld & Broga, 1977; Neufeld 1984b). Note that the arguments countering the original classical-measurement recommendations have been demonstrated to extend from diagnostic-group to
continuous-variable designs (as used, e.g., by Kang & MacDonald, 2010; see Neufeld & Gardner, 1990). It can be shown that transducing classical psychometric partitioning of variance into formally modeled sources not only lends a model-based substantive interpretation to the partitioned variance, but also renders the psychometric-artefact issue (which continues to plague reliance on offthe-shelf and subquantitative, often clinically, rather than contemporary cognitive-science, contrived measures) essentially obsolete (Neufeld, 2007b; see also McFall & Townsend, 1998; Neufeld, Vollick, Carter, Boksman, Jette, 2002; Silverstein, 2008). Partitioned sources of variance in task performance now are specified according to a governing formal performance model, which incorporates an analytical stipulation of its disorder-affected and spared features (see Figure 16.1, and related prose, in the Introduction of this chapter). These sources of variance include classical measurementerror variance (within-participant variance); withingroup interparticipant variance; and intergroup variance. Another major substantive issue encountered in clinical science concerns the integration of socalled cold and hot cognition. Cold cognition, in this context pertains more or less to the mechanisms of information processing, and hot cognition pertains to their informational product, notably with respect to its semantic and affective properties—roughly, How is the thought delivered (cold cognition), and what does the delivery consist of (hot cognition). Deviations in the processing apparatus leading to interpretations of one’s environment cum responses that bear directly on symptomatology hold special clinical significance. The synthesis of cold and hot cognition perhaps is epitomized in the work on risk of sexual coercion, and eating and other disorders, by Teresa Treat, Richard Viken, Rchard McFall, and their prominent cognitive-scientist collaborators. Examples include Treat et al. (2002), and Treat, Viken, Kruschke & McFall (2010; see also Treat & Viken, 2010). These investigators have shown how basic perceptual processes (perceptual organization) can ramify to other clinically cogent cognitivebehavioral operations, including those of classification in interpersonal perception, and memory. This program of research exemplifies integrative, translational, and collaborative psychological science. It has been rigorously eclectic in its recruitment
of modeling methodology. Included have been multidimensional scaling; classification paradigms; perceptual-independence—covariation paradigms; and memory paradigms–all work that brooks no compromise on state-of-the-art mathematical and computational developments. Deviations in cognitive mapping, classification, and memory retrieval, surrounding presented items bearing on clinically important problems comprising eating disorders, and risk of sexual aggression have been productively studied. A further point of contact between formal process modeling and substantive clinical issues involves relations between the former and multiitem psychometric inventories. Empirical associations between model properties, and multi-item measures, reciprocally contribute to each other’s nomothetic-span construct validity. Correlations with process-model properties furthermore lend construct-representation construct validity to psychometric measures. Psychometric measures, in turn, can provide useful estimates of model properties with which they sufficiently correlate, especially if their economy of administration exceeds that of direct individual-level estimates of the model properties (Carter, Neufeld & Benn, 1998; Wallsten, Pleskac & Lejuez, 2005). Furthermore, to liaison with multi-item psychometric inventories, responding to a test item can be viewed as an exercise in dynamical information processing. As such, modeling option selection via item-response theory can be augmented with modeling item-response time, the latter as the product of a formally portrayed dynamic stochastic process (e.g., Neufeld, 1998; van der Maas, Molenaar, Maris, Kievit & Boorsboom, 2011; certain direct parallels between multi-item inventories and stochastic process models have been developed in Neufeld, 2007b). Note that a vexing contaminant of scores on multi-item inventories is the influence of social desirability (SD), that is, the pervasive inclination to respond to item options in a socially desirable direction. Indeed, a prime mover of contemporary methods of test construction, Douglas N. Jackson, has stated that SD is the g factor of personality and clinical psychometric measures (e.g., Helmes, 2000). Interestingly, it has been shown that formal theory and its measurement models can circumvent the unwanted influence of SD, because of the socially neutral composition of the measures involved (Koritzky & Yechiam, 2010).
mathematical and computational
361
Conclusion Mathematical and computational modeling arguably is essential to accelerating progress in clinical science and assessment. Indeed, rather than its being relegated to the realms of an esoteric enterprise or quaint curiosity, a strong case can be made for the central role of quantitative modeling in lifting the cognitive side of clinical cognitive neuroscience out of current quagmires and avoiding future ones. Clinical formal modeling augurs for progress in the field, owing to the explicitness of theorizing to which the modeling endeavor is constrained. Returns on research investment are elevated because rigor of expression exposes empirical shortcomings of extant formulations; blind alleys are expressly tagged as such, thanks to the definiteness of derived formulas, so that wasteful retracing is avoided and needed future directions often are indicated, because of the forced exactness of one’s quantitative formulations. Otherwise unavailable or intractable information is released. Moreover, existing data can be exploited by freshly viewing and analyzing it through methods opened up by a valid mathematical, computational infrastructure (Novotney, 2009). Formal developments also stand to enjoy an elongated half-life of contribution to the discipline. Staying power of the value of rigorous and substantively important achievements has been seen in mathematical modeling generally. For instance, early fundamental work on memory dynamics (e.g., Atkinson, Belseford & Shiffrin, 1967) recently has been found useful in the study of memory in schizophrenia (Brown et al., 2007). A similar durability stands to occur for clinical mathematical psychology, within the field of clinical science itself, and beyond. Certain suspected agents of disturbance may resist study through strictly experimental manipulation of independent variables (e.g., non–modelinformed correlational experiments; see earlier section Mulitnomial Processing Tree Modeling of Memory and Related Processes; Unveiling and Elucidating Deviations in Clinical Samples earlier). Such may be the case on ethical grounds, or because the theoretically tendered agent may elude experimental induction (e.g., organismically endogenous stress activation suspected as generating cognitiondetracting intrusive associations and/or diminished cognitive-workload capacity). Tenability of such conjectures nevertheless may be tested by examining improvement in model fit to data from afflicted 362
new directions
individuals, through quantitatively inserting the otherwise inaccessible constructs into the model composition. Current challenges can be expected to spawn vigorous research activity in several future directions. Among others, included are: (a) capitalizing on modeling opportunities in multiple areas of clinical science, poised for modeling applications (e.g., stress, anxiety disorders, and depression); (b) surmounting constraints intrinsic to the clinical setting that pose special barriers to the valid extraction of modeling-based information (e.g., estimating and providing for potentially confounding effects of medication on cognitive-behavioral task performance); (c) developing sound methods for bridging mathematical and computational clinical science to clinical assessment and intervention; and, (d), creating methods of tapping modeldisciplined information from currently available published data—in the service of model-informed, substantively meaningful, meta analyses. It may be said that evidence-based practice is best practice only if it is based on best evidence. Best evidence goes hand in hand with maximizing methodological options, which compels the candidacy of those that stem from decidedly quantitative theorizing.
Acknowledgments Manuscript preparation, and the author’s own work reported in the manuscript, were supported by grants from the Social Science and Humanities Research Council of Canada (author, Principal Investigator), the Medical Research Council of Canada (author, Principal Investigator) and the Canadian Institutes of Health Research (author, co-investigator). I thank Mathew Shanahan for helpful comments on an earlier version, and Lorrie Lefebvre for her valuable assistance with manuscript preparation.
Notes 1. See also Chechile (2010) for the introduction of a novel procedure for individualizing MPTM parameter values, which is especially pertinent to clinical assessment technology. See also Smith & Batchelder (2010) on computational methods for dealing with the clinical-science issue of group-data variability, owing to individual differences in parameter values (the problem of “overdispersion”). 2. Still others (Ouimet, Gawronski & Dozois, 2009) have tendered extensive verbally conveyed schemata and flow diagrams [cf. McFall, Townsend & Viken (1995) for a poignant demarcation of models qua models] entailing dual-system cognition (extrapolation of “automatic” and “controlled” processing properties; Schneider & Shiffrin, 1977; Shiffrin & Schneider,
1977), and threat-stimulus disengagement deficit [possibly suggesting serial item processing; but cf. Neufeld & McCarty (1994); Neufeld, Townsend & Jette (2007); White, et al., 2010a, p. 674]. 3. An independent parallel, limited-capacity processing system, with exponentially distributed processing latencies, potentially characterizing the processing architectures of both HA and LA individuals at least during early processing (Neufeld & McCarty, 1994; Neufeld, Townsend & Jette, 2007): (a) parsimoniously can be shown to cohere with reaction-time findings on HA threat bias, including those taken to indicate threat stimulus disengagement deficit; and, (b) also can be shown to cohere with collateral oculomotor data (e.g., Mogg, Millar & Bradley, 2000; Neufeld, Mather, Merskey & Russell, 1995; see also, Townsend & Ashby, 1983, chapter 5). 4. Note that encoding and response processes typically have been labeled “base processes,” and often have been relegated to “wastebasket” status. They may embody important differences, however, between clinical samples and controls. By examining “boundary separation,” a response-criterion parameter of the Ratcliff diffusion model, for example, White, Ratcliff, Vasey and McKoon (2010b) have discovered that HA, but not LA participants, increase their level of caution in responding after making an error (an otherwise undetected “alarm reaction”). See also Neufeld, Boksman, Vollick, George and Carter (2010), for analyses of potentially symptom significant elongation of stimulus encoding—a process subserving collateral cognitive operations— among individuals with a diagnosis of schizophrenia. 5. Extensions to mathematically prescribed signatures of variants on the cognitive elements described earlier, recently have included statistical significance tests and charting of statistical properties (e.g., “unbiasedness” and “statistical consistency” of signature estimates; Houpt & Townsend, 2010; 2012) and measures of theoretic and methodological unification of concepts in the literature related to cognitive-workload capacity (known as “Grice” and “Miller inequalities”), all under qualitative properties of SFT’s quantitative diagnostics (Townsend & Eidels, 2011). Bayesian extensions also currently are in progress. A description of software for SFT, including a tutorial on its use, has been presented by Houpt, Blaha, McIntire, Havig and Townsend (2014). 6. The challenge of requisite data magnitude may be overblown. Meaningful results at the individual level of analysis, for example, can be obtained with as few as 160 trials per experimental condition–if the method of analysis is mathematical-theory mandated (see Johnson, et al., 2010). Furthermore, clinicians who may balk at the apparent demands of repeated-trial cognitive paradigms nevertheless may have no hesitation in asking patients to respond to 567 separate items of a multi-item inventory. 7. Of late, cognitive science has witnessed a burgeoning interest in mixture models, dubbed “hierarchical Bayesian analysis” (e.g., Lee, 2011). The use of such models in cognitive science nevertheless can be traced back at least two and one half decades (e.g., Morrison, 1979; Schweickert, 1982). Mixture models of various structures (infinite, and finite, continuous and discrete), along with their Bayesian assets, moreover have enjoyed a certain history of addressing prominent issues in clinical cognitive science (e.g., Batchelder, 1998; Carter, et al., 1998; Neufeld, 2007b; Neufeld, Vollick, Carter, Boksman, & Jette, 2002; Neufeld & Williamson, 1996), some of which are stated in the text.
Glossary analytical construct validity: Support for the interpretation of a model parameter that expresses an aspect of individual differences in cognitive performance, according to the parameter’s effects on model predictions. base distribution: Distribution of a cognitive-behavioral performance variable (e.g., speed or accuracy) specified by a task-performance model. construct-representation construct validity: Support for the interpretation of a measure in terms of the degree to which it comprises mechanisms that are theoretically meaningful in their own right. exponential distribution: A stochastic distribution of process latencies whose probability density function is νe−νt , where t is time (arbitrary units), and v is a parameter; latencies are loaded to the left (i.e., toward t = 0, their maximum probability density being at t = 0), and the distribution has a long tail. hyperdistribution: A stochastic distribution governing the probabilities or probability densities of properties (e.g., parameters) of a base distribution (see section Bayesian Parameter Estimation in the text). likelihood function: A mathematical expression conveying the probability or probability density of a set of observed data, as a function of the cognitive-behavioral task-performance model (see section Bayesian Parameter Estimation in the text). mixing distribution: See hyperdistribution. mixture model: A model of task performance, expressed in terms a base distribution of response values, whose base-distribution properties (e.g., parameters) are randomly distributed according to a mixing distribution (see section Bayesian Parameter Estimation in the text). multiple dissociation: Cognitive-behavioral performance patterns or estimated neuro-circuitry occurring for a clinical group under selected experimental conditions, with contrasting patterns occurring under other conditions. The profile obtained for the clinical group is absent or opposite for control or other clinical groups. nomothetic-span construct validity: Support for the interpretation of a measure according to its network of associations with variables to which it is theoretically related. normalizing factor: Overall probability or probability density of data or evidence, all model-prescribed conditions (e.g., parameter values) considered (see section Bayesian Parameter Estimation, in the text). overdispersion: Greater variability in task-performance data than would be expected if a fixed set of model parameters were operative across all participants. prior distribution: In the Bayesian framework, the probability distribution before data are collected (See hyperdistribution). posterior distribution: In the Bayesian framework, the distribution corresponding to the prior distribution after the data are collected. Bayes’ theorem is used to update the prior distribution, following data acquisition. The key agent of this updating is the likelihood function (see section Bayesian Parameter Estimation in the text).
mathematical and computational
363
References Ahn, W. Y., Busemeyer, J. R., Wagenmakers, E. J., & Stout, J. C.(2008). Comparison of decision learning models using the generalization criterion method. Cognitive Science, 32, 1376–1402. Atkinson, R. C., Brelsford, J. W., & Shiffrin, R. M. (1967). Multiprocess models for memory with applications to a continuous presentation task. Journal of Mathematical Psychology, 4, 277–300. Batchelder, W. H. (1998). Multinomial processing tree models and psychological assessment. Psychological Assessment, 10, 331–344. Batchelder, W. H., & Riefer, D. M. (1990). Multinomial processing models of source monitoring. Psychological Review, 97, 548–564. Batchelder W. H., & Riefer, D. M. (1999). Theoretical and empirical review of multinomial process tree modelling. Psychonomic Bulletin & Review, 6, 57–86. Batchelder, W. H., & Riefer, D. M. (2007). Using multinomial processing tree models to measure cognitive deficits in clinical populations. In R. W. J. Neufeld (Ed.), Advances in clinical cognitive science: Formal modeling of processes and symptoms (pp. 19–50). Washington, D.C: American Psychological Association. Bechara A., Damasio A. R., Damasio H., & Anderson, S. W. (1994). Insensitivity to future consequences following damage to human prefrontal cortex. Cognition 50, 7–15. Berg, E. A. (1948). A simple objective technique for measuring flexibility in thinking. Journal of General Psychology, 39, 15–22. Berger, J. O. (1985). Statistical decision theory and bayesian analysis, (2nd ed.). New York, NY: Springer. Bianchi, M. T., Klein, J. P., Caviness, V. S., & Cash, S. S. (2012). Synchronizing bench and bedside: A clinical overview of networks and oscillations. In M. T. Bianchi, V. S. Caviness, & S. S. Cash (Eds.), Network approaches to diseases of the brain: Clinical applications in neurology and psychiatry (pp. 3–12). Oak Park, IL: Bentham Science Publishers. Bishara, A., Kruschke, J., Stout, J., Bechara, A., McCabe, D., Busemeyer, J., (2010). Sequential learning models for the Wisconsin card sort task: Assessing processes in substance dependent individuals. Journal of Mathematical Psychology, 54, 5–13. Bolger, N., Davis, A., & Rafaeli, E. (2003) Diary methods: Capturing life as it is lived. Annual Review of Psychology, 54, 579–616. Borowski, E. J., & Borwein, J. M. (1989). The Harper Collins Dictionary of Mathematics (2nd ed.). New York, NY: Harper Collins. Borsboom, D., & Cramer, A. O. J. (2013). Network analysis: An integrative approach to the structure of psychopathology. Annual Review of Clinical Psychology, 9, 91–121. Braithwaite, R. B. (1968). Scientific explanation. London, England: Cambridge University Press. Brown, G. G., Lohr, J., Notestine, R., Turner, T., Gamst, A., & Eyler, L. T. (2007). Performance of schizophrenia and bipolar patients on verbal and figural working memory tasks. Journal of Abnormal Psychology, 116, 741–753. Busemeyer, J. R., & Diedrich, A. (2010) Cognitive modeling. Thousand Oaks, CA: Sage.
364
new directions
Busemeyer, J. R. & Myung, I. J. (1992). An adaptive approach to human decision making: Learning theory, decision theory, and human performance. Journal of Experimental Psychology: General, 121, 177–194. Busemeyer, J. R., & Stout, J. C. (2002). A contribution of cognitive decision models to clinical assessment: Decomposing performance on the Bechara gambling task. Psychological Assessment, 14, 253–262. Busemeyer, J. R., & Townsend, J. T. (1993). Decision field theory: A dynamic-cognitive approach to decision making in an uncertain environment. Psychological Review, 100, 432–459. Busemeyer, J. R. & Wang, Y. (2000). Model comparisons and model selections based on generalization test methodology. Journal of Mathematical Psychology, 44(1), 171–189. Carter, J. R., & Neufeld, R. W. J. (2007). Cognitive processing of facial affect: Neuro-connectionist modeling of deviations in schizophrenia. Journal of Abnormal Psychology, 166, 290–305. Carter, J. R., Neufeld, R. W. J., & Benn, K. D. (1998). Application of process models in assessment psychology: Potential assets and challenges. Psychological Assessment, 10, 379–395. Chapman, L. J., & Chapman, J. P. (1973). Problems in the measurement of cognitive deficit. Psychological Bulletin, 79, 380–385. Chechile, R. A. (1998). A new method for estimating model parameters for multinomial data. Journal of Mathematical Psychology, 42, 432–471. Chechile, R. A. (2004). New multinomial models for the Chechile–Meyer task. Journal of Mathematical Psychology, 48, 364–384. Chechile, R. A. (2007). A model-based storage-retrieval analysis of developmental dyslexia. In R. W. J. Neufeld (Ed.), Advances in clinical cognitive science: Formal modeling of processes and symptoms (pp. 51–79). Washington, DC: American Psychological Association. Chechile, R. A. (2010). Modeling storage and retrieval processes with clinical populations with applications examining alcohol-induced amnesia and Korsakoff amnesia. Journal of Mathematical Psychology, 54, 150–166. Close, F. (2011). The infinity puzzle: How the hunt to understand the universe led to extraordinary science, high politics, and the large hadron collider. Toronto, Canada: A. Knopff. Cox, D. R. & Miller, H. D. (1965). The theory of stochastic processes. London, England: Methuen. Cramer, A. O. J., Waldorp, L. J., van der Maas, H., & Borsboom, D. (2010). Comorbidity: A network perspective. Behavioral and Brain Sciences, 33, 137–193. Dance, K., & Neufeld, R. W. J. (1988). Aptitude-treatment interaction in the clinical setting: An attempt to dispel the patient-uniformity myth. Psychological Bulletin, 104, 192–213. Davidson, P. O., & Costello, G. G. (1969). N=1: Experimental studies of single cases. New York, NY: Van Nostrand Reinhold. Dobson, D., & Neufeld, R. W. J. (1982). Paranoid-nonparanoid schizophrenic distinctions in the implementation of external conceptual constraints. Journal of Nervous and Mental Disease, 170, 614–621. Doob, J. L. (1953). Stochastic processes. New York, NY: Wiley.
Embretson, W. S. (1983). Construct validity: Construct representation versus nomothetic span. Psychological Bulletin, 93, 179–197. Estes, W. K. (1956). The problem of inference from curves based on group data. Psychological Bulletin, 53, 134–140. Evans, M., Hastings, N., & Peacock, B. (2000). Statistical distributions (3rd Ed.). New York, NY: Wiley. Farrell, S., & Lewandowsky, S. (2010). Computational models as aids to better reasoning in psychology. Current Directions in Psychological Science, 19, 329–335. Flanagan, O. (1991). Science of the mind (2nd ed.). Cambridge, MA: MIT Press. Frewen, P. A., Dozois, D., Joanisse, M., & Neufeld, R. W. J. (2008). Selective attention to threat versus reward: Meta-analysis and neural-network modeling of the dot-probe task. Clinical Psychology Review, 28, 308–338. Fridberg, D. J., Queller, S., Ahn, W.-Y., Kim, W., Bishara, A. J., Busemeyer, J. R., Porrino, L., & Stout, J. C. (2010). Cognitive mechanisms underlying risky decision-making in chronic cannabis users. Journal of Mathematical Psychology, 54, 28–38. Friston, K. J., Fletcher, P., Josephs, O., Holmes, A., & Rugg, M. D. (1998). Event-related fMRI: Characterizing differential responses. Neuroimage, 7, 30–40. Fukano T, & Gunji Y. P.(2012). Mathematical models of panic disorder. Nonlinear Dynamics, Psychology and Life Sciences, 16, 457–470. George, L., & Neufeld, R. W. J. (1985). Cognition and symptomatology in schizophrenia. Schizophrenia Bulletin, 11, 264–285. Gold, J. M., & Dickinson, D. (2013). “Generalized Cognitive Deficit” in Schizophrenia: Overused or underappreciated? Schizophrenia Bulletin, 39, 263–265. Gottman, J. M., Murray, J. D., Swanson, C. C., Tyson, R., & Swanson, K. R. (2002). The mathematics of marriage: Dynamic nonlinear models. Cambridge, MA: MIT Press. Green, M. F., Horan, W. P., & Sugar, C. A. (2013). Has the generalized deficit become the generalized criticism? Schizophrenia Bulletin, 39, 257–262. Haig, B. D. (2008). Scientific method, abduction, and clinical reasoning. Journal of Clinical Psychology, 64, 1013–1127. Hartlage, S., Alloy, L. B., Vázquez, C. & Dykman, B.(1993). Automatic and effortful processing in depression. Psychological Bulletin, 113(2), 247–278. Helmes, E. (2000). The role of social desirability in the assessment of personality constructs. In R. D. Goffin & E. Helmes (Eds.). Problems and solutions in human assessment: Honoring Douglas N. Jackson at seventy. Norwell, MA: Kluwer. Hoaken, P. N. S., Shaughnessy, V. K., & Pihl, R. O. (2003). Executive cognitive functioning and aggression: Is it an issue of impulsivity? Aggressive Behavior, 29, 15–30. Hoffman, R. E., & McGlashan, T. H. (2007). Using a speech perception neural network simulation to study normal neurodevelopment and auditory hallucinations in schizophrenia. In R. W. J. Neufeld (Ed.), Advances in clinical cognitive science: Formal modeling of processes and symptoms (pp. 239–262). Washington, D.C.: American Psychological Association.
Houpt, J. W., & Townsend, J. T. (2010). The statistical properties of the survivor interaction contrast. Journal of Mathematical Psychology, 54, 446–453. Houpt, J. W. and Townsend, J. T. (2012). Statistical measures for workload capacity analysis. Journal of Mathematical Psychology, 56, 341–355. Houpt, J. W., Blaha, L. M., McIntire, J. P., Havig, P. R., & Townsend, J. T. (2014). Systems factorial technology with R. Behavior Research Methods, Instrumentation and Computing, 46, 307–330 (Available online from http://link.springer.com/article/10.3758%2Fs13428013-0377-3.) Hu, X. (2001). Extending general processing tree models to analyze reaction time experiments. Journal of Mathematical Psychology, 45, 603–634. Johnson, S. A., Blaha, L. M., Houpt, J. W., & Townsend, J. T. (2010). Systems Factorial Technology provides new insights on global-local information processing in autism spectrum disorders. Journal of Mathematical Psychology, 54, 53–72. Kang, S. S., & MacDonald, A. W. III (2010). Limitations of true score variance to measure discriminating power: Psychometric simulation study. Journal of Abnormal Psychology, 119, 300–306. Koritzky, G., & Yechiam, E. (2010). On the robustness of description and experience based decision tasks to social desirability. Journal of Behavioral Decision Making, 23, 83–99. Lee, M. D. (2011). How cognitive modeling can benefit from hierarchical Bayesian models. Journal of Mathematical Psychology, 55, 1–7. Levy, L. R., Yao, W., McGuire, G., Vollick, D. N., Jetté, J., Shanahan, M. J., & Neufeld, R. W. J. (2012). Nonlinear bifurcations of psychological stress negotiation: New properties of a formal dynamical model. Nonlinear Dynamics, Psychology and Life Sciences, 16, 429–456. Link, S. W. (1982), Correcting response measures for guessing and partial information. Psychological Bulletin, 92, 469–486. Link, S. W., & Day, R. B. (1992). A theory of cheating. Behavior Research Methods, Instruments, & Computers, 24, 311–316. Link, S. W., & Heath, R. A. (1975). A sequential theory of psychological discrimination. Psychometrika, 40, 77–105. Maher, B. (1970). Introduction to research in psychopathology. New York, NY: McGraw-Hill. Marr, D. (1982). Vision. San Francisco, OA: Freeman. McFall, R. M., & Townsend, J. T. (1998). Foundations of psychological assessment: Implications for cognitive assessment in clinical science. Psychological Assessment, 10, 316–330. McFall, R. M., Townsend, J. T. and & Viken, R. J. (1995). Diathesis stress model or "just so" story? Behavioral and Brain Sciences, 18(3), 565–566. Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46 (4), 806–843. Mogg, K., Millar, N., Bradley, B. P. (2000). Biases in eye movements to threatening facial expressions in generalized anxiety disorder and depressive disorder. Journal of Abnormal Psychology, 109, 695–704.
mathematical and computational
365
Molenaar, P. C. M. (2010). Note on optimization of psychotherapeutic processes. Journal of Mathematical Psychology, 54, 208–213. Morrison, D. G. (1979). An individual-difference pureextinction process. Journal of Mathematical Psychology, 19, 307–315. Moshagen, M. (2010). multiTree: A computer program for the analysis of multinomial processing tree models. Behavior Research Methods, 42(1), 42–54. Neufeld, R. W. J. (1977) Clinical quantitative methods, NY: Grone & Stratton. Neufeld, R. W. J. (1984a). Re: The incorrect application of traditional test discriminating power formulations to diagnostic-group studies. Journal of Nervous and Mental Disease, 172, 373–374. Neufeld, R. W. J. (1984b). Elaboration of incorrect application of traditional test discriminating power formulations to diagnosticgroup studies. Department of Psychology Research Bulletin Number 599. London, ON: University of Western Ontario. Neufeld, R. W. J. (1991). Memory in paranoid schizophrenia. In P. Magaro (Ed.), The cognitive bases of mental disorders: Annual review of psychopathology (Vol. 1, pp. 231–261). Newbury Park, CA: Sage. Neufeld, R. W. J. (1996). Stochastic models of information processing under stress. Research Bulletin No. 734, London, ON: Department of Psychology, University of Western Ontario. Neufeld, R. W. J. (1998). Intersections and disjunctions in process-model applications. Psychological Assessment, 10, 396–398. Neufeld, R. W. J. (1999). Dynamic differentials of stress and coping. Psychological Review, 106, 385–397. Neufeld, R. W. J. (2007a). Introduction. In R. W. J. Neufeld (Ed.), Advances in the clinical cognitive science: Formal modeling of processes and symptoms, Washington, DC: American Psychological Association (pp. 3–18). Neufeld, R. W. J. (2007b). Composition and uses of formal clinical cognitive science In B. Shuart, W. Spaulding & J. Poland (Eds.), Modeling Complex Systems: Nebraska Symposium on Motivation, 52, 1–83. Lincoln, Nebraska: University of Nebraska Press. Neufeld, R. W. J. (2007c). On the centrality and significance of encoding deficit in schizophrenia. Schizophrenia Bulletin, 33, 982–993. Neufeld, R. W. J. (2012). Quantitative clinical cognitive science, cognitive neuroimaging, and tacks to fMRI signal analysis: The case of encoding deficit in schizophrenia. Paper presented at the 45th Annual Meeting of the Society for Mathematical Psychology, Columbus, Ohio, July 21–24, 2012. Neufeld, R. W. J., Boksman, K., Vollick, D., George, L., & Carter, J. (2010). Stochastic dynamics of stimulus encoding in schizophrenia: Theory, testing, and application. Journal of Mathematical Psychology, 54, 90–108. Neufeld, R. W. J. & Broga, M. I. (1977). Fallacy of the reliability-discriminability principle in research on differential cognitive deficit. Department of Psychology Research Bulletin Number 360. London, ON: University of Western Ontario.
366
new directions
Neufeld, R. W. J., & Broga, M. I. (1981). Evaluation of information-sequential aspects of schizophrenic performance, II: Methodological considerations. Journal of Nervous and Mental Disease, 169, 569–579. Neufeld, R. W. J., Vollick, D. Carter, J. R., Boksman, K., & Jetté, J. (2002). Application of stochastic modelling to group and individual differences in cognitive functioning. Psychological Assessment, 14, 279–298. Neufeld, R. W. J., & Gardner, R. C. (1990). Data aggregation in evaluating psychological constructs: Multivariate and logicaldeductive considerations. Journal of Mathematical Psychology, 34, 276–296. Neufeld, R. W. J., & McCarty, T. (1994). A formal analysis of stressor and stress-proneness effects on basic information processing. British Journal of Mathematical and Statistical Psychology, 47, 193–226. Neufeld, R. W. J., & Williamson, P. (1996). Neuropsychological correlates of positive symptoms: Delusions and hallucinations. In C. Pantelis, H.E. Nelson, & T.R.E. Barnes (Eds.), Schizophrenia: A neuropsychological perspective (pp. 205–235). London, England: Wiley Neufeld, R. W. J., Townsend, J. T., & Jetté, J. (2007). Quantitative response time technology for measuring cognitive-processing capacity in clinical studies. In R.W.J. Neufeld (Ed.), Advances in clinical cognitive science: Formal modeling and assessment of processes and symptoms (pp. 207–238). Washington, D.C.: American Psychological Association. Neufeld, R. W. J., Mather, J. A., Merskey, H., & Russell, N. C. (1995). Multivariate structure of eye-movement dysfunction in schizophrenia. Multivariate Experimental Clinical Research, 11, 1–21. Norman, R. M. G. & Malla, A. K.(1994). A prospective study of daily stressors and symptomatology in schizophrenic patients. Social Psychiatry and Psychiatric Epidemiology, 29, 244–249. Novotney, A. (2009). Science on a shoestring. Monitor on Psychology, 40(1), 42–44. O’Hagan, A., & Forster, J. (2004). Kendall’s advanced theory of statistics: Vol. 2B. Bayesian inference (2nd ed.). London, England: Arnold. Ouimet, A. J., Gawronski, B., & Dozois, D. J. A. (2009). Cognitive vulnerability to anxiety: A review and an integrative model. Clinical Psychology Review, 29, 459–470. Phillips, W. A., & Silverstein, S. M. (2003) Convergence of biological and psychological perspectives on cognitive coordination in schizophrenia. Discussion. Behavioral and Brain Sciences, 26, 65–138. Pitt, M. A., Kim, W., Navarro, J., & Myung, J. I. (2006). Global model analysis by parameter space partitioning. Psychololgical Review, 113, 57–83. Poldrack, R. A. (2011). Inferring mental states from neuroimaging data: From reverse inference to large-scale decoding. Neuron, 72, 692–697. Ratcliff, R. (1978). A theory of memory retrieval. Psychological Review, 85, 59–108. Ratcliff, R. (1979). Group reaction time distributions and an analysis of distribution statistics. Psychological Bulletin, 86, 446–461.
Riefer, D. M. & Batchelder, W. H. (1988). Multinomial modeling and the measurement of cognitive processes. Psychological Review, 95, 318–339. Riefer, D. M., Knapp, B. R., Batchelder, W. H., Bamber, D., & Manifold, V. (2002). Cognitive psychometrics: Assessing storage and retrieval deficits in special populations with multinomial processing tree models. Psychological Assessment, 14, 184–201. Rodgers, J. L. (2010) The epistemology of mathematical and statistical modeling: A quiet methodological revolution. American Psychologist, 65(1), 1–12. Rouder, J. N., Sun, D., Speckman, P. L., Lu, J. & Zhou, D. (2003). A hierarchical bayesian statistical framework for response time distributions. Psychometrika, 68(4), 589–606. Schneider, W., & Shiffrin, R. M. (1977). Controlled and automatic human information processing: I. Detection, search, and attention. Psychological Review, 84, 1–66. Schweickert, R. (1982). The bias of an estimate of coupled slack in stochastic PERT networks. Journal of Mathematical Psychology, 26, 1–12. Schweickert, R., (1985) Separable effects of factors on speed and accuracy: Memory scanning, lexical decisions and choice tasks. Psychological Bulletin, 7, 530–548. Schweickert, R., & Han, H. J. (2012). Reaction time predictions for factors selectively influencing processes in processing trees. Paper presented at Mathematical Psychology meeting, Columbus, OH, July 2012. Shanahan, M. J., & Neufeld, R. W. J. (2010). Coping with stress through decisional control: Quantification of negotiating the environment. British Journal of Mathematical and Statistical Psychology, 63, 575–601. Shanahan, M. J., Townsend, J. T. & Neufeld, R. W. J. (in press). Mathematical modeling in clinical psychology. In R. Cautlin & S. Liltienfield (Eds.); R. Zinbarg (Section Ed.) WileyBlackwell’s Encyclopedia of Clinical Psychology, London, Wiley-Blackwell. Shannon, C. E. (1949). The mathematical theory of communication. Urbana, IL: University of Illinois Press. Shiffrin, R. M., & Schneider, W. (1977). Controlled and automatic human information processing. II. Perceptual learning, automatic attending and a general theory. Psychological Review, 84, 127–190. Siegle G. J., Hasselmo M. E. (2002). Using connectionist models to guide assessment of psychological disorder. Psychological Assessment, 14, 263–278. Silverstein, S. S. (2008). Measuring specific, rather than generalized, cognitive deficits and maximizing betweengroup effect size in studies of cognition and cognitive change. Schizophrenia Bulletin, 34, 645–655. Smith, J. B., & Batchelder, W. H. (2010). Beta-MPT: Multinomial processing tree models for addressing individual differences. Journal of Mathematical Psychology, 54, 167–183. Staddon, J. E. R. (1984). Social learning theory and the dynamics of interaction. Psychological Review, 91(4), 502– 507. Stein, D. J., & Young, J. E. (1992). (Editors), Cognitive science and clinical disorders. New York, NY: Academic.
Sternberg, S(1969). The discovery of processing stages: Extensions of Donders’ method. In W. G. Koster (Ed.), Attention and performance II. Acta Psychologica, 30, 276–315. Townsend, J. T. (1984). “Uncovering mental processes with factorial experiments.” Journal of Mathematical Psychology, 28(4), 363–400. Townsend, J. T. (2008). Mathematical psychology: Prospects for the 21st century: A guest editorial. Journal of Mathematical Psychology, 52, 269–280. Townsend, J. T. and Ashby, F. G. (1983). Stochastic modelling of elementary psychological processes. Cambridge: Cambridge University Press. Towsend, J. T., & Altieri, N. (2012). An accuracy-response time capacity assessment function that measures performance against standard parallel predictions. Psychological Review, 119, 500–16. Townsend, J. T. & Eidels, A., (2011). Workload capacity spaces: A unified methodology for response time measures of efficiency as workload is varied. Psychonomic Bulletin & Review, 18, 659–681. Townsend, J. T., Fific, M., & Neufeld, R. W. J. (2007). Assessment of mental architecture in clinical/cognitive research. In T. A. Treat, R. R. Bootzin, T. B. Baker (Eds.), Psychological clinical science: Papers in Honor of Richard M. McFall (pp. 223–258). Mahwah, NJ: Erlbaum. Townsend, J. T. & Neufeld, R. W. J. (2004). Mathematical theory-driven methodology and experimentation in the emerging quantitative clinical cognitive science: Toward general laws of individual differences. Paper presented at the Association for Psychological Science-sponsored symposium on Translation Psychological Science in Honor of Richard M. McFall, Chicago, Illinois, May, 2004. Townsend, J. T., & Nozawa, G. (1995). Spatio-temporal properties of elementary perception: An investigation of parallel, serial, and coactive theories. Journal of Mathematical Psychology, 39, 321–359. Townsend, J. T., & Wenger, M. J. (2004a). The serial-parallel dilemma: A case study in a linkage of theory and method. Psychonomic Bulletin & Review, 11, 391–418. Townsend, J. T. & Wenger, M. J. (2004b). A theory of interactive parallel processing: New capacity measures and predictions for a response time inequality series. Psychological Review, 111, 1003–1035. Treat, T. A., & Viken, R. J. (2010). Cognitive processing of weight and emotional information in disordered eating. Current Directions in Psychological Science, 19, 81–85. Treat, T. A., McFall, R. M., Viken, R. J., Nosfosky, R. M., MacKay, D. B., & Kruschke, J. K. (2002). Assessing clinically relevant perceptual organization with multidimensional scaling techniques. Psychological Assessment, 14, 239–252. Treat, T. A., Viken, R. J., Kruschke, J. K., & McFall, R. M. (2010). Role of attention, memory, and covariationdetection processes in clinically significant eating-disorder symptoms. Journal of Mathematical Psychology, 54, 184–195. Van der Maas, H. L. J., Molenaar, D., Maris, G., Kievit, R. A., & Borsboom, D. (2011). Cognitive psychology meets psychometric theory: On the relation between process models for decision making and latent variable models for individual differences. Psychological Review, 118, 339–356.
mathematical and computational
367
Van Zandt, T. (2000). How to fit a response time distribution. Psychonomic Bulletin and Review, 7, 424–465. Wagenmakers, E.-J., van der Maas, H. L. J., Dolan, C., & Grasman, R. P. P. P. (2008). EZ does it! Extensions of the EZ-diffusion model. Psychonomic Bulletin & Review, 15, 1229–1235. Wagenmakers, E.-J., van der Maas, H. L. J., & Farrell, S. (2012). Abstract concepts require concrete models: Why cognitive scientists have not yet embraced nonlinearlycoupled, dynamical, self-organized critical, synergistic, scalefree, exquisitely context-sensitive, interaction-dominant, multifractal, interdependent brain-body-niche systems. TopiCS, 4, 87–93. Wallsten, T. S., Pleskac, T. J., Lejuez, C. W., (2005). Modeling a sequential risk-taking task. Psychological Review, 112, 862–880. Wenger, M. J., & Townsend, J. T. (2000). Basic tools for attention and general processing capacity in perception and cognition. Journal of General Psychology: Visual Attention, 127, 67–99. White, C. N., Ratcliff, R., Vasey, M. W., & McKoon, G. (2010a). Anxiety enhances threat processing without competition among multiple inputs: A diffusion model analysis. Emotion, 10, 662–677. White, C. N., Ratcliff, R., Vasey, M. W., McKoon, G. (2010b). Using diffusion models to understand clinical disorders. Journal of Mathematical Psychology, 54, 39–52 Williams, J. M. G., & Oaksford, M. (1992). Cognitive science, anxiety and depression: From experiments to connectionism.
368
new directions
In D. J. Stein & J. E. Young (Eds.) Cognitive science and clinical disorders (pp. 129–150). San Diego, CA: Academic. Witkiewitz, K., van der Maas, H. J., Hufford, M. R., & Marlatt, G. A. (2007). Non-normality and divergence in posttreatment alcohol use: re-examining the project MATCH data “another way.” Journal of Abnormal Psychology, 116, 378–394. Yang, E., Tadin, D., Glasser, D. M., Hong, S. W., Blake, R., & Park, S. (2013). Visual context processing in schizophrenia. Clinical Psychological Science, 1, 5–15. Yechiam, E., Arshavsky, O., Shamay-Tsoory, S. G., Yaniv, S., and Aharon, J. (2010). Adapted to explore: Reinforcement learning in Autistic Spectrum Conditions. Brain and Cognition, 72, 317–324. Yechiam, E., Goodnight, J., Bates, J. E., Busemeyer, J. R., Dodge, K. A., Pettit, G. S., & Newman, J. P. (2006). A formal cognitive model of the Go/No-Go discrimination task: Evaluation and implications. Psychological Assessment, 18, 239–249. Yechiam, E., Hayden, E. P., Bodkins, M., O’Donnell, B. F., and Hetrick, W. P. (2008). Decision making in bipolar disorder: A cognitive modeling approach. Psychiatry Research, 161, 142–152. Yechiam E., Veinott E. S., Busemeyer J. R., Stout J. C. (2007). Cognitive models for evaluating basic decision processes in clinical populations. In: Neufeld R. W. J. (Ed.), Advances in clinical cognitive science: Formal modeling and assessment of processes and symptoms (pp. 81–111). Washington, DC: APA Publications.
CHAPTER
17
Quantum Models of Cognition and Decision
Jerome R. Busemeyer, Zheng Wang, and Emmauel Pothos
Abstract
Quantum probability theory provides a new formalism for constructing probabilistic and dynamic systems of cognition and decision. The purpose of this chapter is to introduce psychologists to this fascinating theory. This chapter is organized into six sections. First, some of the basic psychological principles supporting a quantum approach to cognition and decision are summarized; second, some notations and definitions needed to understand quantum probability theory are presented; third, a comparison of quantum and classical probability theories is presented; fourth, quantum probability theory is used to account for some paradoxical findings in the field of human probability judgments; fifth, a comparison of quantum and Markov dynamic theories is presented; and finally, a quantum dynamic model is used to account for some puzzling findings of decision-making research. The chapter concludes with a summary of advantages and disadvantages of a quantum probability theoretical framework for modeling cognition and decision. Key Words: quantum probability, classical probability, Hilbert space, the law of total
Reasons for a Quantum Approach to Cognition and Decision This chapter is not about quantum physics 1 per se. Instead, it explores the application of probabilistic dynamic systems derived from quantum theory to a new domain – cognition and decision making behavior. Applications of quantum theory have appeared in judgment (Aerts & Aerts, 1994; Busemeyer, Pothos, Franco, & Trueblood, 2011; Franco, 2009; Pothos, Busemeyer, & Trueblood, 2013; Wang & Busemeyer, 2013), decision making (Bordley & Kadane, 1999; Busemeyer, Wang, & Townsend, 2006; Khrennikov & Haven, 2009; Lambert-Mogiliansky, Zamir, & Zwirn, 2009; La Mura, 2009; Pothos & Busemeyer, 2009; Trueblood & Busemeyer, 2011; Yukalov & Sornette, 2011), conceptual combinations (Aerts, 2009; Aerts & Gabora, 2005; Blutner, 2009), memory (Brainerd, Wang, & Reyna, 2013; Bruza
et al., 2009), and perception (Atmanspacher, Filk, & Romer, 2004; Conte et al., 2009). Several review articles (Pothos & Busemeyer, 2013; Wang, Busemeyer, Atmanspacher, & Pothos, 2013) and books (Busemeyer & Bruza, 2012; Ivancevic & Ivancevic, 2010; Khrennikov, 2010) provide a summary of this new program of research. Before presenting the formal ideas, let us first examine why quantum theory should be applicable to human cognition and decision behavior.
Judgments Are Based upon Indefinite and Uncertain Cognitive States Models commonly used in psychology assume the cognitive system changes from moment to moment, but at any specific moment it is in a definite state with respect to some judgment to be made. For example, suppose a juror has just heard conflicting evidence from the prosecutor and 369
the defense and the juror has to consider two mutually exclusive and exhaustive hypotheses—guilty or not guilty. A Bayesian model would assign a probability distribution over the two hypotheses— a probability p(G|evidence) is assigned to guilt and a probability 1 − p(G|evidence) is assigned to not guilty. Therefore, the juror’s subjective probability with respect to the question of guilty or not boils down to a state represented by a point lying somewhere between zero and one on the probability scale at each moment. This probability may change from moment to moment to produce a definite trajectory of probability for guilt across time. However, at each moment, this subjective probability either favors guilt p(G|evidence) > .50, or it favors not guilty p(G|evidence) < .50, or it is exactly at p(G|evidence) = .50. At a single moment, the juror cannot be both favoring guilt p(G|evidence) > .50 and at the same time favoring not guilty p(G|evidence) < .50. In contrast, quantum theory assumes that during deliberation the juror is in an indefinite (superposition) state at each moment. While in an indefinite state, the juror does not necessarily favor guilty and at the same time the juror does not necessarily favor not guilty. Instead, the juror is in a superposition state that leaves the juror conflicted, ambiguous, confused, or uncertain about the guilty status. The potential for saying guilt may be greater than the potential for saying not guilty at one moment, and these potentials may change from one moment to the next, but either hypotheses could potentially be chosen at each moment. In quantum theory, there is no single trajectory or sample path across time before making a decision. When asked to make a decision, the juror would be forced to commit to either guilt or not.
Judgments Create Rather Than Record a Cognitive State Models commonly used in psychology assume that what we record at a particular moment reflects the state of the cognitive system as it existed immediately before we inquired about it. For example, if a person watches a scene of an exciting car chase and is asked “Are you afraid?” then the answer “Yes. I am afraid” reflects the person’s cognitive state with respect to that question just before we asked it. In contrast, quantum theory assumes that taking a measurement of a system creates rather than records a property of the system (Wang, Busemeyer, Atmanspacher, & Pothos, 2013). For example, the 370
new directions
person may be ambiguous about his or her feelings after watching the scene, but the answer “Yes. I am afraid” is constructed from the interaction of this indefinite state and the question, which results in a now definitely “afraid” state. This is, in fact, the basis for modern psychological theories of emotion (Schachter & Singer, 1962). Decision scientists also have shown evidence that beliefs and preferences are constructed online rather than simply being read straight out of memory (Payne, Bettman, & Johnson, 1992), and expressing choices and opinions can change preferences (Sharot, Velasquez, & Dolan, 2010).
Judgments Disturb Each Other Producing Order Effects According to quantum theory, the answer to a question can change a state from indefinite to definite state, and this change causes one to respond differently to subsequent questions. Intuitively, the answer to the first question sets up a context that changes the answer to the next question, and this produces order effects of the measurements. Order effects make it impossible to define a joint probability of answers to questions A and B (unless one conditionalizes the conjunction with an order parameter), and instead it is necessary to assign a probability to the sequence of answers to question A followed by question B. In quantum theory, if A and B are two measurements, and the probabilities of the outcomes depend on the order of the measurements, then the two measurements are noncommutative. Many of the mathematical properties of quantum theory, such as Heisenberg’s famous uncertainty principle (Heisenberg, 1958), arise from developing a probabilistic model for noncommutative measurements. Question order effects are a major concern for attitude researchers, who struggle for a theoretical understanding of these effects similar to that achieved in quantum theory (Feldman & Lynch, 1988). Of course quantum theory is not the only theory to explain order effects. Markov models, for example, also can produce order effects. Quantum theory, however, provides a more natural, elegant, and built in set of principles (as opposed to ad hoc assumptions) for explaining order effects (Wang, Solloway, Shiffrin, & Busemeyer, 2014).
Judgments Do Not Always Obey Classical Logic Probabilistic models commonly used in psychology are based on the Kolmogorov axioms (1933/1950), which define events as sets that
Preliminary Concepts, Definitions, and Notations Quantum theory is based on geometry and linear algebra defined on a Hilbert space. (Hilbert spaces are complex vector spaces with certain convergence properties.) Paul Dirac developed an elegant notation for expressing the abstract elements of the theory, which are used in this chapter. This chapter is restricted to finite spaces for simplicity, but note that the theory is also applicable to infinite dimensional spaces. In fact, to keep examples simple, this section introduces the ideas using only a three-dimensional space in order to visually present the ideas. Figure 17.1 shows a particular vector labeled S that lies within a three-dimensional space spanned by three basis vectors labeled A, B, and C. For example, a simple attitude model could interpret S as the state of opinion of a person with regard to the beauty of an artwork using three mutually exclusive evaluations “good,” “mediocre,” or “bad,” which are represented by the basis vectors A, B, and C, respectively. A finite Hilbert space is an N -dimensional vector space defined on a field of complex numbers and
0.25 S
0.2 0.15 C
obey the axioms of set theory and Boolean logic. One important axiom is the distributive axiom: If {G, T , F } are events then G ∩ (T ∪ F ) = (G ∩ T ) ∪ (G ∩ F ). Consider for example, the concept that a boy is good (G), and the pair of concepts that the boy told the truth (T ) versus the boy did not tell truth (F ). According to classical Boolean logic, the event G can only occur in one of two ways: either (G ∩ T ) occurs or (G ∩ F ) exclusively. From this distributive axiom, one can derive the law of total probability, p(G) = p(T )p(G|T ) + p(F )p(G|F ). Quantum probability theory is derived from the von Neumann axioms (1932/1955), which define events as subspaces that obey different axioms from those of set theory. In particular, the distributive axiom does not always hold (Hughes, 1989). For example, according to quantum logic, when you try to decide whether a boy is good without knowing if he is truthful or not, you are not forced to have only two thoughts: he is good and he is truthful, or he is good and he is not truthful. You can remain ambiguous or indeterminate over the truthful or not truthful attributes, which can be represented by a superposition state. The fact that quantum logic does not always obey the distributive axiom implies that the quantum model does not always obey the law of total probability (Khrennikov, 2010).
0.1 0.05 T
0 1 0.5 B
0.2
0 0
0.6
0.4
0.8
1
A
Fig. 17.1 Three-dimensional vector space spanned by basis vectors A, B, and C. 2
endowed with an inner product. The space has a basis, which is a set of N orthonormal basis vectors χ = {|X1 , ..., |XN } that span the space. The symbol |X represents an arbitrary vector in an N -dimensional vector space, which is called a “ket.” This vector can be expressed by its coordinates with respect to the basis χ as follows |X =
N
xi |Xi .
i=1
The coordinates, xi are complex numbers. The N coordinates representing the ket |X with respect to a basis χ forms an N × 1 column matrix ⎤ ⎡ x1 ⎥ ⎢ X = ⎣ ... ⎦ . xN Referring to Figure 17.1, the coordinates for the specific vector |S with respect to the {|A , |B , |C} basis equal ⎡ ⎤ 0.696 S = ⎣ 0.696 ⎦ . 0.1765 Referring back to our simple attitude model, the coordinates of S represents the potentials for each of the opinions. The symbol X | represents a linear functional in an N -dimensional (dual) vector space, which is called a “bra.” Each ket |X has a correspondX |. The conjugate transpose operation, ing bra |X † = X | changes ket into a bra. The N coordinates representing the bra X | with respect to a basis χ forms a 1 × N row matrix
∗ . X † = x1∗ · · · xN
quantum models of cognition and decision
371
The * symbol indicates complex conjugation. For example, the bra S| corresponding to the ket |S has the matrix representation given below as S† =
0.696
0.696
0.1765 .
(Here the numbers in the example are real, and so conjugation has no effect.) Hilbert spaces are endowed with an inner product. Psychologically, the inner product is a measure of similarity between two vectors. The inner product is a scalar formed by applying the bra to the ket to form a bra-ket Y |X =
N
yi∗ · xi .
i=1
For example
S|S = S † · S = 0.696 0.696 0.1765 ⎡ ⎤ 0.696 × ⎣ 0.696 ⎦ = 1. 0.1765
This shows that the ket |S is unit length. The outer product, denoted by |X Y |, is a linear operator, which is used to make transitions from one state to another. In particular, assuming that the kets are unit length, then the outer product |X Y | maps the ket |Y to the ket |X as follows: (|X Y |) · |Y = |X Y | Y = |X . Assuming |X is unit length, the outer product |X X | projects |X to itself, |X X | · |X = |X X |X = |X · 1, and |X X | projects any other ket |Y onto the ray spanned by |X as follows |X X |Y = X |Y · |X . For these reasons, |X X | is called the projector for the ray spanned by |X , which is also symbolized as MX = |X X |. Projectors correspond to subspaces that represent events in quantum theory. They are Hermitian and idempotent. Referring to Figure 17.1, the coordinates for the basis vector |A (with respect to the {|A , |B , |C} basis) simply equals ⎡ ⎤ 1 A=⎣ 0 ⎦ 0 and the matrix representation of the projector for this basis equals ⎡ ⎤ ⎡ ⎤ 1 1 0 0
A · A† = ⎣ 0 ⎦ · 1 0 0 = ⎣ 0 0 0 ⎦ . 0 0 0 0 The projector MA = A · A† corresponds to the subspace representing the event A. In our simple 372
new directions
attitude model, MA would be used to represent the event that the person decides the artwork to be “good” (which corresponds to event A). The matrix representation of the projection of the ket |S onto the ray spanned by the basis |A then equals ⎡
⎤ 0.696 A · A† · S = ⎣ 0 ⎦ . 0 In our simple attitude model, this projection is used to determine the probability that the person decides the artwork is “good.” According to quantum theory, the squared length of this projection, 0.6962 = 0.4844, equals the probability that the person will decide that the artwork is “good.” Similarly, the coordinates for the basis vector |B (with respect to the {|A , |B , |C} basis) simply equals ⎡
⎤ 0 B = ⎣ 1 ⎦. 0 So the matrix representation (with respect to the {|A , |B , |C} basis) for the projector for |B equals ⎡
⎤ 0 0 0 B · B† = ⎣ 0 1 0 ⎦ . 0 0 0 In our simple attitude model, this projector is used to represent the event that the person decides the artwork to be “mediocre” (which corresponds to event B). In addition, the horizontal plane shown in Figure 17.1 is spanned by the {|A , |B} basis vectors, and the projector that projects vectors onto this plane equals MA +MB = |A A|+|B B| which has the matrix representation ⎡
1 A · A† + B · B † = ⎣ 0 0
0 1 0
⎤ 0 0 ⎦. 0
In our simple attitude example, this corresponds to the event that the person thinks the artwork is “good” or “mediocre.” The vector labeled T in Figure 17.1 is the projection of the vector |S onto the plane spanned by the {|A , |B} basis vectors, which has the matrix representation
T = A · A† + B · B † · S
These three vectors form another orthogonal basis for spanning the three-dimensional space. The projector MV = |V V | projects vectors onto the ray spanned by the basis vector |V as follows: MV |X = |V V | X . Using the coordinates defined above for |V we obtain, ⎤ ⎡
The squared length of this projection, T † T = 2 · 6962 = 0.969, equals the probability that the person decides the artwork to be “good” or “mediocre.” This is also the probability that the person thinks the artwork is not “bad” (1 − 0.17652 = 0.969). Referring back to our simple attitude model, suppose that instead of asking whether the artwork is beautiful, we ask what kind of moral message it conveys, and once again there are three answers such as “good,” “neutral,” or “bad.” Now the person needs to evaluate the same artwork with respect to a new point of view. In quantum theory, this new point of view is represented as a change in the basis. Figure 17.2 illustrates three new orthonormal vectors within the same three-dimensional space labeled U, V, and W in the figure. Now the basis vectors U, V, and W represent a “good,” “neutral,” or “bad” moral message, respectively. The state S now represents the person’s opinion with respect to this new moral message point of view. With respect to the {|A , |B , |C} basis, the coordinates for these three vectors are as follows ⎤ ⎤ ⎡ ⎡ √ 1/2 √1/2 −1/2 ⎦ , U = ⎣ 1/2 ⎦ , V = ⎣ √ 1/2 0 ⎡
⎤ −1/2 W = ⎣ √1/2 ⎦ 1/2
1 2
− 12
= (0.125) V ,
1 2
0.696 · ⎣ 0.696 ⎦ 0.1765
which indicates that 0.125 is the coordinate of the vector |S on the |V basis vector. This is the projection of S on the V basis, and the squared length of this projection, 0.1252 = 0.0156, equals the probability of this event, e.g., the probability that the person decides that the artwork is “neutral.” Repeating this procedure for |U and |W , we obtain the coordinates for the vector |S in Figure 17.2 with respect to the {|U , |V , |W } basis. ⎡ ⎤ 0.125 Y = ⎣0.9843⎦ 0.125 In sum, the same vector |S can be expressed by the coordinates X using the {|A , |B , |C} basis or by the coordinates Y using the {|U , |V , |W } basis. Note that the event “morally good” is represented by the vector U in Figure 17.2. This vector lies along the diagonal line of the A, B plane. Here we see an interesting feature of quantum theory. If a person is definite that the piece of artwork is “morally good” (represented by the vector U), then the person must be uncertain about whether it’s beauty is good versus mediocre (because U has a 45 degree angle with respect to each of the A, B vectors). However, if the person is certain that the artwork is “morally good” then the person is certain that it’s beauty is not “bad” (because U is contained in the A, B plane).
Quantum Compared to Classical Probabilities
0.8 0.7 W
0.6
V
0.5 0.4 0.3
S
0.2 0.1 0 1
U 0.5
0
−0.5
−1 −1
−0.5
0
0.5
1
Fig. 17.2 New basis U, V, W for representing the threedimensional vector space
This section presents the quantum probability axioms formulated by Paul Dirac (1958) and John von Neumann (1932/1955), and compares them systematically with the axioms of classical Kolmogorov probability theory (1933/1950) (see 3 Box 1 for a summary). For simplicity, we restrict this presentation to finite spaces in this chapter. Although the space is finite, the number of dimensions can be very large. The general theory is applicable to infinite dimensional spaces. See Chapter 2 in Busemeyer and Bruza (2012) for a more comprehensive introduction.
quantum models of cognition and decision
373
Box 1 A brief comparison of the classical Kolmogorov probability theory and the quantum probability theory • Kolmogorov Theory • • • • • • •
Sample space set of events Event A is represented as a subset State is a probability function p p(A) = probability assigned to event A if A ∩ B = , p(A ∪ B) = p(A) + p(B) p(A∩B) p(A|B) = p(B) , p(A ∩ B) = p(B ∩ A)
• Quantum Theory
•
Hilbert vector space of events Event A is represented as a subspace corresponding to a projector MA State is a vector |S in Hilbert space q(A) = MA |S2 if MA MB = 0, q(A ∨ B) = q(A) + q(B) 2 A MB |S q(A|B) = M√ ,
•
MA MB |S2 = MB MA |S2
• •
• • •
q(B)
Events Classical probability postulates a sample space χ , which we will assume contains a finite number of points, N (and N may be very large). The set of points in the sample space is defined as χ = {X1 , ..., XN }. An event A is a subset of this sample space A ⊆ χ. If A ⊆ χ is an event and B ⊆ χ is an event, then the intersection A ∩ B is an event; also the union A ∪ B is an event. Quantum theory postulates a Hilbert space χ, which we will assume has a finite dimension, N (and again N may be very large). The space is spanned by an orthonormal set of basis vectors 4 χ = {|X1 , ..., |XN } that form a basis for the space. An event A is a subspace spanned by a subset χA ⊆ χ of basis vectors. This event corresponds to a projector MA = i∈A |Xi Xi |. If A is an event spanned by χA ⊆ χ and B is an event spanned by χB ⊆ χ , then the meet (infimum)A ∧ B is an event spanned by χA ∩χB ; also the join (supremum) A∨B is an event spanned by χA ∪ χB . For example, in Figure 17.1, the event A is represented by the ray spanned by the vector |A, and “A or B” is represented by the horizontal 374
new directions
plane spanned by the two vectors {|A , |B} for the quantum model.
System State Classical probability postulates a probability function p that maps points in the sample space χ into positive real numbers which sum to unity. The empty set is mapped into zero, and the sample space is mapped into one, and all other events are mapped into the interval [0, 1]. If the pair of events {A ⊆ χ, B ⊆ χ} are mutually exclusive A ∩ B = Ø, then p(A ∪ B) = p(A) + p(B). The probability of the ¯ equals p(A) ¯ = 1 − P(A). event “not A,” denoted A, Quantum probability postulates a unit length state vector |X in the Hilbert space. The probability of an event A spanned by χA ⊆ χ is defined by q(A) = ||MA |X ||2 . For later work, it will be convenient to express ||M |X ||2 as the inner product ||M |X ||2 = X |M † M |X = X |M |X , where the last step made use of the idempotency of the projector M † = M = MM . If the pair of events {A, B}, both spanned by subsets of basis vectors from χ , are mutually exclusive, χA ∩ χB = Ø, then it follows from orthogonality that q(A∨B) = ||(MA +MB )|X ||2 = ||MA |X ||2 + ||MB |x||2 = q(A) + q(B). The event A¯ is the subspace that is the orthogonal complement to the subspace for the event A, and its probability ¯ = (I − MA ) |X 2 = 1 − q(A). equals q(A) For example, in Figure 17.1, the probability of the event A equals MA |S2 = ' ' 'A · A† · S '2 = |.696|2 , and the probability of the event “A or B” equals (MA + MB ) |S2 = ' ' ' A · A† + B · B† · S '2 = T 2 = 2 · |.696|2 .
State Revision
According to classical probability, if an event A ⊆ χ is observed, then a new conditional probability function is defined by the mapping p(Xi |A) = p(Xi ∩A) p(A) . The normalizing factor in the denominator is used to guarantee that the probability assigned to the entire sample space remains equal to one. According to quantum probability, if an event A is observed, then the new revised state is defined MA |X . The normalizing factor in the by |XA = ||M A |X || denominator is used to guarantee that the revised state remains unit length. The new revised state is then used (as described earlier) to compute probabilities for events. This is called Lüder’s rule.
For example, in Figure 17.1, if the event “A or B” is observed, then the matrix representation of the revised state equals ⎡ √ ⎤ √1/2 T = ⎣ 1/2 ⎦ = U . SAorB = T 0 The probability of event A given that “A or B” was '2 ' observed equals 'A · A† · SAorB ' = .50.
Commutativity Classical probability assumes that for any given experiment, there is only one sample space, χ, and all events are contained in this single sample space. Consequently, the intersection between two events and the union of two events is always well defined. A single probability function p is sufficient to assign probabilities to all events of the experiment. This is 5 called the principle of unicity (Griffiths, 2003). It follows from the commutative property of sets that joint probabilities are commutative, p(A) · p(B|A) = p(A ∩ B) = p(B ∩ A) = p(B) · p(A|B). Quantum probability assumes that there is only one Hilbert space and all events are contained in this single Hilbert space. For a single fixed basis, such as χ = {|Xi , i = 1, ..., N }, the meet and the join of two events spanned by a common set of basis vectors in χ are always well defined, and a probability function q can be used to assign probabilities to all the events defined with respect to the basis χ. When a common basis is used to define all the events, then the events are compatible. The beauty of a Hilbert space is that there are many choices for the basis that one can use to describe the space. For example, in Figure 17.2, a new basis using vectors {|U , |V , |W } was introduced to represent the state |S, which was obtained by rotating the original basis {|A , |B , |C} used in Figure 17.1. Suppose χ = {|Yi , i = 1, ..., N } is another orthonormal basis for the Hilbert space. If event A is spanned by χA ⊂ χ, and event B is spanned by χB ⊂ χ , then the meet for these two events is not defined; also the join for these two events is not defined either (Griffths, 2003). In this case, the events are not compatible. That is, the projectors for these two events do not commute, MA MB = MB MA , and the projectors for these two events do not share a common set of eigenvectors. In this case, it is not meaningful to assign a probability simultaneously to the pair of events {A, B} (Dirac, 1958). When the events are incompatible, the principle of unicity breaks
down and the events cannot all be described within a single sample space. The events spanned by χ, which are all compatible with each other, form one sample space; and the events spanned by χ , which are combatible with each other, form another sample space; but the events from χ are not compatible with the events from χ . In this case, there are two stochastically unrelated samples spaces (Dzhafarov & Kujala, 2012), and quantum theory provides a single state |S that can be used to assign probabilities to both sample spaces. For noncommutative events, probabilities are assigned to histories or sequences of events using Lüder’s rule. Suppose A is an event spanned by χA ⊆ χ, and event B is spanned by χB ⊆ χ . Consider the probability for the sequence of events: A followed by B. The probability of the first event A equals q(A) = ||MA |X ||2 ; the revised state, conditioned on obMA |X serving this event equals |XA = ||M ; the probA |X || ability of the second event, conditioned on the first event, equals q(B|A) = ||MB |XA ||2 ; therefore, the probability of event A followed by event B equals ' '2 ' MA |X ' ' M q(A) · q(B|A) = ||MA |X ||2 · ' ' B M |X ' A = MB MA |X 2 .
(1)
If the projectors do not commute, then order matters because q(A) · q(B|A) = MB MA |X 2 = MA MB |X 2 = q(B) · q(A|B). For example, referring to Figures 17.1 and 17.2, the probability of event “A or B” and then event V equals ' '2 ' ' MV (MA + MB ) |S2 = 'VV † · T ' = 0. The probability of the event V and then the event “A or B” equals '⎡ ⎤' 2 ' ' 0.123/2 ' ' ⎣ −0.123/2 ⎦' (MA + MB ) MV |S2 = ' ' ' ' ' 0 = 0.008. If all events are compatible, then quantum probability theory reduces to classical probability theory. In this sense, quantum probability is a generalization of classical probability theory (Gudder, 1988).
quantum models of cognition and decision
375
Violations of the Law of Total Probability The quantum axioms do not necessarily have to obey the classical law of total probability in the following manner. Consider an experiment with two different conditions. The first condition simply measures whether event B occurs. The second condition first measures whether A occurs, and then measures whether B occurs. For both conditions, we compute the probability of the event B. For the first condition, this is simply q(B) = ||MB |X ||2 . For the second condition, we could observe the sequence with event A and then event B with probability q(A) · q(B|A) = ||MB MA |X ||2 , or we could observe the sequence with event A¯ and then event B ¯ · q(B|A) = ||MB M ¯ |X ||2 , with probability q(A) A and so the total probability for event B in the second experiment equals the sum of these two ways: qT (B) = ||MB MA |X ||2 + ||MB MA¯ |X ||2 . The interference produced in this experiment is defined as the probability of event B observed in the first condition minus the total probability of event B observed in the second condition. According to quantum probability theory, the interference equals Int = q(B) − qT (B). To analyze this more closely, let us decompose the probability from the first condition as follows: q(B) = MB |X ' '2 = 'MB MA + MA¯ |X ' ' '2 = '(MB MA |X ) + MB MA¯ |X ' ' '2 = MB MA |X 2 + 'MB MA¯ |X ' + Int 2 1 2∗ 1 Int = X |MA MB MA¯ |X + X |MA MB MA¯ |X . (2) 2
An interference cross-product term, denoted Int, appears in the probability q (B) from the first condition, which produces deviations from the total probability qT (B) computed from the second condition. This interference term can be positive (i.e., constructive interference) or negative (i.e., destructive interference) or zero (i.e., no interference). If the two projectors commute so that MA MB = MB MA , then the interference is zero. There is also an interference term associated with the complementary probability ' ' ' ' ¯ = 'M ¯ MA |X '2 + 'M ¯ M ¯ |X '2 −Int. (3) q(B) B B A ¯ must The interference term associated with q(B) be the negative of the interference term associated 376
new directions
with q(B) because we must finally obtain ¯ 1 = q(B) + q(B)
' '2 1 = MB MA |X 2 + 'MB MA¯ |X ' ' '2 ' '2 + 'MB¯ MA |X ' + 'MB¯ MA¯ |X ' . A skeptic might argue that the preceding rules for assigning probabilities to events defined as subspaces are ad hoc, and maybe there are many other rules that one could use. In fact, a famous theorem by Gleason (1957) proves that these are the only rules one can use to assign probabilities to events defined as subspaces using an additive measure (at least for vector spaces of dimension greater than 2). Now let us turn to a couple of example applications of this theory. Quantum cognition and decision has been applied to a wide range of findings in psychology (see Box 2). In this chapter we only have space to show two illustrations— an application to probability judgment errors, and another application to violations of rational decision-making.
Application to Probability Judgment Errors Quantum theory provides a unified and coherent account for a broad range of puzzling findings in the area of human judgments. The theory has provided accounts for order effects on attitude judgments (Wang & Busemeyer, 2013; Wang et al., 2014), inference (Trueblood & Busemeyer, 2010), and causal reasoning (Trueblood & Busemeyer, 2011). The same theory has also been used to account for conjunction and disjunction errors found with probability judgments (Franco, 2009), as well as overextension and underextension errors found in conceptual combinations (Aerts, 2009). This section briefly describes how the theory accounts for conjunction and disjunction errors in probabilistic judgments (Busemeyer et al., 2011). Conjunction and disjunction probability judgment errors are very robust and they have been found with a wide variety of examples (Tversky & Kahneman, 1983). Here we consider an example, where a judge is provided with a brief story about a hypothetical woman named Linda (circa 1980s): Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in antinuclear demonstrations.
Box 2 Applications of quantum theory to cognition and decision 1. Choice and decision time (Busemeyer, Wang, & Townsend, 2006; Fuss & Navarro, 2013) 2. Violations of rational decision theory (Pothos & Busemeyer, 2009; Yukalov & Sornette, 2011) 3. Categorization and decision (Busemeyer, Wang, & Lambert-Mogiliansky, 2009) 4. Probability judgment errors (Busemeyer, Pothos, & Trueblood, 2012) 5. Similarity judgments (Pothos, Busemeyer, & Trueblood, 2013) 6. Causal reasoning (Trueblood & Busemeyer, 2011) 7. Bistable perception (Atmanspacher & Filk, 2010) 8. Conceptual combinations (Aerts, Gabora, & Sozzo, 2013) 9. Concept vagueness (Blutner, Pothos, & Bruza, 2013) 10. Associative memory (Bruza, Kitto, Nelson, & McEvoy, 2009) 11. Memory recognition (Brainerd, Wang, & Reyna, 2013) 12. Attitude question order effects (Wang & Busemeyer, 2013; Wang, Solloway, Shiffrin, & Busemeyer, 2014) 13. Order effects on inference (Trueblood & Busemeyer, 2010) 14. Game theory (Kvam, Lambert-Mogiliansky, & Busemeyer, 2013)
Then the judge is asked to rank the likelihood of the following events: Linda is (a) active in the feminist movement, (b) a bank teller, and (c) active in the feminist movement and a bank teller, (d) active in the feminist movement and not a bank teller, (e) active in the feminist movement or a bank teller. The conjunction fallacy occurs when option c is judged to be more likely than option b (even though it can be argued from the classical perspective that the latter contains the former), and the disjunction fallacy occurs when option a is judged to be more likely than option e (again, even though it can be argued that the latter contains the former). For example, in a study
(Morier & Borgida, 1984), the mean probability judgments were ordered as follows: using J (A) to denote the mean probability judgment for event A, J (feminist) = .83 > J (feminist or bank teller) = .60 > J (feminist and bank teller) = .36 > J (bank teller) = .26 (N = 64 observations per mean, and all pairwise differences are statistically significant). These results violate classical probability theory which is the reason that they are called fallacies. What follows is a simple yet general model for these types of findings. The first assumption is that after reading the story about Linda, a person forms an initial belief state |S that represents the person’s beliefs about features or properties that may or may not be true about Linda. Formally, this belief state is a vector within an N -dimensional vector space. This belief state is used to answer any possible question that might be asked about Linda. The second assumption is that a question such as “Is Linda a feminist?” is represented by a NF < N dimensional subspace of the N -dimensional vector space. This subspace corresponds to a projector MF that projects the state vector onto the subspace representing the feminist question. The question “Is Linda a bank teller?” is represented by another subspace of dimension NB < N with a corresponding projector MB . The third assumption is that the projectors MF , MB do not commute so that MF MB = MB MF , and thus, the order of their applications matters. The reason why these two projectors do not commute is the following. The two concepts (feminist, bank teller) are rarely experienced together, and so the person has not formed a compatible representation of beliefs about combinations of both concepts using a common basis. A person may have formed one basis representing features related to feminists, but this basis differs from the basis used to represent features related to bank tellers. The concepts do not share the same basis and so they are incompatible, and the person needs to change from one basis to the other sequentially in order to answer questions about each concept. The fourth assumption concerns the order that the concepts are processed when asked “Is Linda a feminist and a bank teller?”. Given that the events are incompatible, the person has to pick an order to process them. It is assumed that the more likely event is processed first. It is quite easy to judge the order of each individual question such as that J (feminist) > J (bank teller). But the question about
quantum models of cognition and decision
377
“feminist bank teller” is subtler, and this is not as easy as the previous two questions. The judgment for the conjunction requires forming an additional and subtler judgment about the conditional probability J (bank teller given feminist). These assumptions are now used to derive the quantum predictions for the probability of bank teller (using Eq. 2) : q(B) = MB |S2
According to the quantum model, a conjunction error occurs when ' '2 q(B) = MB MF |S2 + 'MB MF¯ |S' + Int < ' '2 MB MF |S2 → Int < − 'MB MF¯ |S' .
BT
Formally, the negative interference term produces the conjunction error. Intuitively, the Linda story produces a belief state that is almost orthogonal to the subspace for the bank teller event. However, if this state is first projected onto the feminist subspace (eliminating some details about Linda that make it impossible for her to be a bank teller), then it becomes a bit more likely to think that this feminist can be a bank teller too. Figure 17.3 illustrates how this works using a simple two-dimensional example (though we stress that the specification of the model is general and not restricted to one-dimensional subspaces). The probability for the bank teller alone question is determined from the direct projection from the initial state Psi to the bank teller (BT) axis, which is shown as the shorter light grey vertical segment. The probability for the conjunction is represented
F
Ψ ~BT
~F Fig. 17.3 Example of conjunction fallacy for a special case with two dimensions.
378
new directions
by first projecting Psi onto feminist (F), and then projecting onto bank teller, which is shown as the longer dark grey vertical segment. Note that the projection for the conjunction (dark grey vertical segment) exceeds the projection for the bank teller alone (light grey vertical segment). The same theory can also account for the disjunction effect. The event that “Linda is a bank teller or a feminist” is the same as the denial of the event that “Linda is not a bank teller and not a feminist.” According to the quantum model, the probability of the event “not a bank teller and ¯ · q(F¯ |B), ¯ and so the not a feminist” equals q(B) ¯ · q(F¯ |B). ¯ probability of the denial equals 1 − q(B) The disjunction error is predicted to occur when the probability of “feminist” exceeds the disjunction, ¯ · q(F¯ |B) ¯ → q(F¯ ) < that is, when q(F ) > 1 − q(B) ¯ · q(F¯ |B). ¯ Therefore, according to the quantum q(B) model, a disjunction error occurs when ' '2 ' '2 q(F ) = 'MF¯ MB |S' + 'MF¯ MB¯ |S' ' '2 + Int < 'MF¯ MB¯ |S' ' '2 → Int < − 'MF¯ MB |S' . 1 2 1 2∗ Int = S|MB¯ MF¯ MB |S + S|MB¯ MF¯ MB |S To account for both of these fallacies using the same principles and parameters, the model must predict the following order effect (see Busemeyer et al., 2011, appendix) MB MF |S2 > MF MB |S2 . Intuitively, the probability obtained by first considering whether Linda is a feminist and then considering whether she is a bank teller must be greater than the probability obtained by the opposite order. Order effects in this direction have been reported—asking people to judge “is Linda a bank teller” before asking them to judge “is Linda a feminist and a bank teller” significantly reduces the size of the conjunction error as compared to the opposite order (Stolarz-Fantino, Fantino, Zizzo, & Wen, 2003). The fact that the quantum model can account for both conjunction and disjunction errors using the same principles and same parameters is a definite advantage over other accounts, such as an averaging model. As described in Busemeyer et al. (2011), there are many other qualitative predictions that can be derived from this model. In particular, because q(F ) > q(F )q(B|F ), this model cannot produce double conjunction errors, in which the conjunction is greater than both individual events.
Empirically, indeed single conjunction errors are much more common and double conjunction errors are infrequent (Yates & Carlson, 1986). Another prediction from this model is that assuming the conjunction error occurs so that q(F ) · q(B|F ) > q(B), then q(B|F ) ≥ q(B) because q(B|F ) ≥ q(F ) · q(B|F ) > q(B). The intuition here is that, given the detailed knowledge about Linda, it is almost impossible for Linda to be a bank teller; but given that she is viewed more generally as a feminist, it is more likely to think that a feminist also can be a bank teller. This is an important prediction that needs further empirical testing. (See Tentori & Crupi, 2013, for arguments against this prediction.) The model presented in this section can account for conjunction errors, disjunction errors, averaging effects, and order effects. It is, however, only one of many possible ways to build models of probability judgments using quantum principles. In particular, Aerts (Aerts, 2009) and his colleagues (Aerts & Gabora, 2005) have developed alternative quantum models that account for conjunction and disjunction errors in conceptual combinations. Importantly, their model can produce double conjunction errors; but unfortunately it must change parameters to account for differences between conjunction and disjunction errors. In summary, the quantum axioms provide a common set of general principles that can be implemented in different ways to construct more specific and competing quantum models of the same phenomena. Each of the specific quantum models can be compared with each other and with other classical models with respect to their ability account for empirical results.
Quantum Dynamics This section presents the quantum dynamical principles and compares them with Markov processes used in classical dynamical systems. Markov theory provides the basis for constructing a wide variety of classical probability models in cognitive science (e.g., random walk/diffusion models of decision-making). Once again, we restrict this presentation to finite dimensional systems in this chapter. Although finite, the number of dimensions can be very large, and both quantum and Markov processes can readily be extended to infinite dimensional systems. See Busemeyer et al. (2006) and Chapter 7 in Busemeyer and Bruza (2012) for a more comprehensive treatment.
State Space Both quantum and Markov models begin with a set of N states χ = {|X1 , ..., |XN }, where the number of states, N , can be very large. According to the Markov model, a state such as |Xi represents all the information required to characterize the system at some moment, and χ represents the set of all the possible characterizations of the system across time. At any moment in time, the Markov system is exactly located at some specific state in χ , and across time the state changes from one element to another in χ. In comparision, according to the quantum model, a state, such as |Xi , represents a basis vector used to describe the system, and the set χ is a set of basis vectors that span an N -dimensional vector space. At any moment in time, the system is in a superposition state, |ψ, which is a point within the vector space spanned by χ, and across time the point |ψ moves around in the vector space (until a measurement occurs, which reduces the state to the observed basis vector).
Initial State According to the Markov model, the system starts at some particular element of χ . However, this initial state may be unknown to the investigator, in which case a probability, denoted 0 ≤ φi (0) ≤ 1, is assigned to each state |Xi . The N initial probabilities form a N × 1 column matrix ⎤ ⎡ φ1 (0) ⎥ ⎢ .. φ (0) = ⎣ ⎦. . φN (0)
It will be convenient to define a 1 × N row matrix
as J = 1 · · · 1 , which is used for summation. More generally, φ (t) represents the probability distribution across states in χ at time t. The Markov model requires this probability distribution to sum to unity: J · φ (t) = 1. According to the quantum model, the system starts in a superposition state |ψ (0) = ψi (0) · |Xi where ψi is the coordinate (called amplitude) assigned to the basis vector |Xi . The N amplitudes for the initial state form N × 1 column matrix ⎡ ⎤ ψ1 (0) ⎢ ⎥ .. ψ (0) = ⎣ ⎦. . ψN (0)
More generally, ψ (t) represents the amplitude distribution across basis vectors in χ at time t. The quantum model requires the squared length of this amplitude distribution to equal unity: ψ (t)† ψ (t) = 1·
quantum models of cognition and decision
379
State Transitions According to the Markov model, the probability distribution across states evolves across time according to the linear transition law φ(t + τ ) = T (t + τ , t) · φ (t) , where T (t + τ , t) is a transition matrix with element Tij representing the probability of transiting to a state in row i from a state in column j. The transition matrix of a Markov model is called stochastic because the columns of T (t + h, t) must sum to one to guarantee that the resulting probability distribution continues to sum to one, that is J · φ(t + τ ) = 1, and recall that J = [1 1 1 ... 1]. (The rows, however, are not required to sum to one.) In many applications, it is assumed that the transition matrix is stationary so that T (t2 +τ , t2 ) = T (t1 + τ , t1 ) = T (τ ) for all t and τ . The transition matrix of a Markov model is called a stochastic matrix because the columns of the transition matrix must sum to unity. According to the quantum model, the amplitude distribution evolves across time according to the linear transition law ψ (t + τ ) = U (t + τ , t) · ψ (t) , where U (t + τ , t) is a unitary matrix with element Uij representing the amplitude for transiting to row i from column j. The unitary matrix must satisfy the unitary property U † · U = I (I is the identity matrix) in order to guarantee that ψ (t)† ψ (t) = 1. That is, the columns are unit length and each pair of columns is orthogonal. A transition matrix can be formed from the unitary matrix by taking the squared modulus of each of the cell entries of U (t + τ , t). The transition matrix formed in this manner is doubly stochastic: both the rows and columns of this transition matrix must sum to unity. This is a more restrictive constraint on the transition matrix as compared to the Markov model. In many applications, it is assumed that the unitary matrix is stationary so that U (t2 + τ , t2 ) = U (t1 + τ , t1 ) = U (τ ) for all t and τ . According to the Markov model, the stationary transition matrix obeys the Kolmogorov forward equation d T (t) = K · T (t), dt where K is the intensity matrix, with element Kij , and Kij ≥ 0 for i = j, and j Kij = 0 to guarantee that T (t) remains a transition matrix. 380
new directions
According to the quantum model, the stationary unitary matrix obeys the Schrödinger equation d U (t) = −i · H · U (t) , dt where H is the Hamiltonian matrix, which is a Hermitian matrix H † = H , to guarantee that U (t) is a unitary matrix. This is where complex numbers enter in a significant way into quantum models. For the Markov model, the solution to the Kolmogorov foward equation is the following matrix exponential function T (t) = exp (t · K ) . For the quantum model, the solution to the Schrödinger equation is the following complex matrix exponential function U (t) = exp (−i · t · H ) . In summary, the probability distribution across states for the Markov model at time t equals φ (t) = exp (t · K ) · φ (0) , and likewise the amplitude distribution across states for the quantum model at time t equals ψ (t) = exp (−i · t · H ) · ψ (0) . The most important step for building a dynamic model is specifying the intensity matrix for the Markov model or specifying the Hamiltonian matrix for the quantum model. Here psychological science enters by developing a mapping from the psychological factors onto the parameters that define these matrices. An example is provided following this section to illustrate this model development process.
Response Probabilities Consider the probability of observing the response Rk at time t, which is denoted p (R (t) = Rk ). In this section, we use the same choice probability notation for both the Markov and quantum models. Assume that φ (t) is the current probability distribution for the Markov model and ψ (t) is the current amplitude distribution for the quantum model at time t. Both the Markov and quantum models determine the probability of a response by evaluating the set of states that map onto that particular response. Suppose a subset of states, χk ⊂ χ, are mapped onto a response Rk . Define Mk as a N × N indicator matrix, which is a diagonal matrix with ones on the diagonal corresponding to
the states mapped onto the response Rk , and zeros everywhere else. Then according to the Markov model, the response probability equals (recall J = [1 1 1. . .1]) p (R(t) = Rk ) = J · Mk · φ (t) . If in fact, the response Rk is observed at time t, then the new probability distribution, conditioned on this observation equals φ (t|Rk ) =
MK · φ (t) . p (R (t) = Rk )
According to the quantum model, the response probability equals p (R (t) = Rk ) = Mk · ψ (t)2 . If in fact, the response Rk is observed at time t, then the new amplitude distribution, conditioned on this observation equals ψ (t|Rk ) = #
Mk · ψ (t) . p (R (t) = Rk )
The conditional states, φ (t|Rk ) for the Markov model and ψ (t|Rk ) for the quantum model, then become the “initial” states to be used for further evolution and future observations.
Application to Decision Making This section examines two puzzling findings from decision research. One is the violation of the “sure thing” principle (Tversky & Shafir, 1992). Savage introduced the “sure thing” principle as a rational axiom for the foundation of decision theory (1954). According to the sure thing principle, if under state of the world X, you prefer action A over B, and if under the complementary state of the world X , you also prefer action A over B, then you should prefer action A over B even when you do not know the state of the world. A violation of the sure thing principle occurs when A is preferred over B for each known state of the world, but the opposite preference occurs when the state of the world is unknown. The other puzzling finding is the violation of the principle of dynamic consistency, called dynamic inconsistency. Dynamic consistency is considered in standard theory to be a rational principle for making dynamic decisions involving a sequence of actions and events over time. According to the backward induction algorithm used to form optimal plans with dynamic decisions, a person works backward making plans at the end of the sequence
in order to decide actions at the beginning of the sequence. To be dynamically consistent, when reaching the decisions at the end of the sequence, one should follow through on the plan that was used to make the decision at the beginning of the sequence. Violations of dynamic decision-making occur when people change plans and fail to follow through on a plan once they arrive at the final decisions.
Two-Stage Gambling Paradigm Tversky and Shafir (1992) experimentally investigated the sure thing principle using a two-stage gamble. They presented 98 students with a target gamble that had an equal chance of winning $200 or losing $100 (they used hypothetical money). The students were asked to imagine that they already played the target gamble once, and now they were asked whether they wanted to play the same gamble a second time. Each person experienced three conditions that were separated by a week and mixed with other decision problems to produce independent decisions. They were asked if they wanted to play the gamble a second time, given that they won the first play (Condition 1: known win), given that they lost the first play (Condition 2: known loss), and when the outcome of the first play was unknown (Condition 3: unknown). If they thought they won the first gamble, the majority (69%) chose to play again; if they thought they lost the first gamble, then again the majority (59%) chose to play again; but if they didn’t know whether they won or lost, then the majority chose not to play (only 36% wanted to play again). Tversky and Shafir (1992) explained these findings by claiming that people fail to follow through on consequential reasoning. When a person knows she/he has won the first gamble, then a reason to play again arises from the fact that she/he has extra house money to play with. When the person knows she/he has lost the first gamble, then a reason to play again arises from the fact that she/he needs to recover for their losses. When the person does not know the outcome of the first play, these reasons fail to arise. However, why not? Pothos and Busemeyer (2009) explained these and other results found by Shafir and Tversky (1992) using the concept of quantum interference. Referring back to the section Violations of the Law of Total Probability, define the event B as deciding to play the gamble on the second stage, define event
quantum models of cognition and decision
381
A as winning the first play, and define event A¯ as losing the first play. Then Eq. 2 expresses the probability of playing the gamble on the second stage for the unknown condition in terms of the total probability, qT (B), of playing the second stage on either of the two known conditions, plus the interference term Int. Given that the probability of winning the first stage equals .50, then a violation of the sure thing principle is predicted whenever qT (B) > .50 and the interference term Int is sufficiently negative so that q(B) < .50. But what determines the interference term? To answer this question, Pothos and Busemeyer (2009) developed a dynamic quantum model to account for the violation of the sure thing principle. This model is described in detail later, but before presenting these modeling details, let us first examine the second puzzling finding regarding violations of dynamic consistency. The same model is used to explain both findings. Barkan and Busemeyer (1999, 2003) used the same two-stage gambling paradigm to study another phenomena called dynamic inconsistency, which occurs whenever a person changes plans during decision-making. Each study included a total of 100 people, and each person played a series of gambles twice. Each gamble had an equal chance of producing a win or a loss (e.g., equal chance to win 200 points or lose 100 points, where each point was worth $0.01). Different gambles were formed by changing the amounts to win or lose. For each gamble, the person was forced to play the first round, and then contingent on the outcome of the first round, they were given a choice whether to take the same gamble on the second round. Choices were made under two conditions: a planned versus a final choice. For the planned choice, contingent on winning the first round, the person had to select a plan about whether to take or reject the gamble on the second round; contingent on losing the first round, the person had to make another plan about whether to take or reject the gamble on the second round. Then the first-stage gamble was actually played out and the actual win or loss was revealed. For the final choice, after actually experiencing the win on the first round, the person made a final decision to take or reject the second round. Likewise, after actually experiencing a loss on the first round, the person had to decide whether to take or reject the gamble on the second round. The plan and the final decisions were made equally valuable because the experimenter randomly selected either the planned action or the final action to determine 382
new directions
the final payoff with real money at stake. The results showed that people violate the dynamic consistency principle: Following an actual win, they changed from planning to take to finally rejecting the second stage; following an actual loss, they changed from planning to reject to finally taking the second stage. For example, Table 17.1 shows the results from the four gambles used by Barkan and Busemeyer (1999). The first two columns show the amounts to win or lose, the next two colums show the probability of taking the gamble under the plan (conditioned a planned win or loss), and the last two columns show the probability to take the gamble for the final decision (conditioned on an experienced win or a loss). Similar results were found by Barkan and Busemeyer (2003) using 17 different gambles. For later reference, we will denote the amount of the win by xW and the amount of the loss by xL . So for example, in the first row, xW = 80 and xL = 100. It is worth mentioning that the results shown in Table 1 once again demonstate a violation of the classical law of total probability in the following way. If the law of total probability holds, then the probability of taking the gamble during the plan (denoted p(T |P) and shown in the columns labeled “Plan Win, Plan Lose”) equals the probability of winning the first play (denoted p(W ) which was stated to be equal to .50) times the probability to take the gambe after a win (denoted p(T |W ) and shown under the column “Final Win” in Table 17.1) plus the probability of losing the first play (denoted p(L) = 1 − p(W )) times the probability of taking the gamble following a loss (denoted p(T |L) and shown under the column “Final Loss” in Table 17.1) so that p(T |P) = p(W ) · p(T |W ) + p(L) · p(T |L). All the gambles have the same probability of winning, and p(W ) is fixed across gambles and is stated in the problem to be equal to .50. However, these assumptions fail to reproduce the findings shown in Table 17.1. For example, we require p(W ) = .64 to reproduce the data in the first row, but we require p(W ) = .43 to reproduce the third row, and we require p(W ) = .31 to reproduce the data in the fourth row, and even worse, no legitimate value of p(W ) can be found to reproduce the second row.
Markov Dynamic Model for Two-Stage Gambles First let us construct a general Markov model for this two-stage gambling task. The Markov model
Table 17.1. Barkan and Busemeyer (1999). Win
Lose
Plan Win
Plan Loss
Final Win
Final Loss
80
100
.25
.26
.20
.35
80
40
.76
.72
.69
.73
200
100
.68
.68
.60
.75
200
40
.84
.86
.76
.89
uses a four-dimensional state space with states {|BW AT , |BW AR , |BL AT , |BL AR } , where |BW AT represents the state “believe you win the first gamble and act to take the second gamble,” |BW AR represents the state “believe you win the first gamble and act to reject the second gamble,” |BL AT represents the state “believe you lose the first gamble and act to take the second gamble,” and |BL AR represents the state “believe you lose the first gamble and act to reject the second gamble.” The probability distribution over states is represented by a 4 × 1 column matrix (that sums to unity) composed of two parts ⎡ ⎡ ⎤ ⎤ φWT 0 ⎢ φWR ⎥ ⎢ 0 ⎥ ⎢ ⎥ ⎥ φ = φ W + φ L , φW = ⎢ ⎣ 0 ⎦ , φL = ⎣φLT ⎦ . φLR 0 Before evaluating the payoffs of the gamble, the decision maker has an initial state represented by φ (0). This initial state depends on information about the outcome of the first play. If the outcome of the first play is unknown (i.e., the planning stage), then the initial state is set equal to φ (0) = φU , which has coordinates φWT = φWR = φLT = φLR = 14 . If the first play is known to be a win, then the initial state is set equal to φ (0) = φW with coordinates φWT = φWR = 12 , φLT = φLR = 0. If the first play is known to be a loss, then the initial state is set equal to φ (0) = φL with coordinates φWT = φWR = 0, φLT = φLR = 12 . The probabilities of taking the gamble, depending on the win or lose first game belief states, are then determined by a transition matrix TW 0 , T (t) = 0 TL where TW is a 2 × 2 transition matrix conditioned on winning, and TL is a 2 × 2 transition matrix conditioned on losing.
The matrix that picks out the states corresponding to the action of “taking the gamble” is represented by MT 0 1 0 M= , MT = . 0 MT 0 0
Recall that J = 1 1 1 1 sums across states to obtain the probability of a response. Finally, the Markov model predicts: .50 , p (T |W ) = J · M · T (t) · φW = J · MT · TW · .50
.50 p (T |L) = J · M · T (t) · φL = J · MT · TL · , .50
= (.50) · p (T |W ) + (.50) · p (T |L) . The last line shows that the Markov model must satisfy the law of total probability. Note that the Markov model must always obey the law of total probability, and, thus, already qualitatively fails to account for the violation of the sure thing principle and the dynamic inconsistency results described earlier.
quantum models of cognition and decision
383
Quantum Dynamic Model for Two-Stage Gambles
H1 =
Pothos and Busemeyer (2009) developed a quantum dynamic model that has been applied to the two-stage gambling task. The quantum model also uses a four-dimensional vector space spanned by four basis vectors {|BW AT , |BW AR , |BL AT , |BL AR } , where |BW AT represents the event “believe you win the first gamble and act to take the second gamble,” |BW AR represents the event “believe you win the first gamble and act to reject the second gamble,” |BL AT represents the event “believe you lose the first gamble and act to take the second gamble,” and |BL AR represents “believe you lose the first gamble and act to reject the second gamble.” The decision-maker’s state is a superposition over these four basis states: |ψ = ψWT · |BW AT + ψWR · |BW AR + ψLT · |BL AT + ψLR · |BL AR The matrix representation of this superposition state is the 4 × 1 column matrix (length equal to one) composed of two parts ⎡
Before evaluating the payoffs of the gamble, the decision maker has an initial state represented by ψ (0). This initial state depends on information about the outcome of the first play. If the outcome of the first play is unknown (i.e., the planning stage), then the initial state is set equal to ψ (0) = ψU , which has coordinates .50. If the first play is known to be a win, then the initial state is set equal to√ψ (0) = ψW with coordinates ψWT = ψWR = .50, ψLT = ψLR = 0. If the first play is known to be a loss, then the initial state is set equal to ψ (0) = ψL with √ coordinates ψWT = ψWR = 0, ψLT = ψLR = .50. Evaluation of the gamble payoffs causes the initial state ψ (0) to evolve into a final state ψ (t) after a period of deliberation time t, and this final state is used to decide whether to take or reject the gamble at the second stage. The Hamiltonian H used for this evolution is H = H1 + H2 , where 384
new directions
HW 0
⎡ ⎢ HL = ⎣
√hW
0 1+h2 , HW = ⎣ 1 w HL √ 2
hL
1+hL2 1 1+hL2
⎡
⎡
⎤
1+hw
√1
⎤
2 1+hw ⎦
√−hW 2
1+hw
1 1+hL2 ⎥ ⎦, −h L 1+hL2
1 0 −c ⎢ 0 −1 H2 = √ ⎢ ⎣ 2 1 0 0 1
⎤ 1 0 0 1 ⎥ ⎥. −1 0 ⎦ 0 1
The matrix HW in the upper left corner of H1 rotates the state toward taking or rejecting the gamble depending on the final payoffs (xW + xW , xW − xL ), given an initial win of the amount of x W from the first play. The coefficients hW and hL in the Hamiltonian HW are supposed to range between −1 and +1. So we need to map the utility differences into this range. The hyperbolic tangent provides a smooth S-shaped mapping. We then define hW in terms of the utility difference following a win as follows: 2 a − 1, DW = uW − xW , 1 + e−DW ⎧1 · (xW + xW )a + 12 · (xW − xL )a ⎪ ⎪ ⎪2 ⎨ (xW > xL ) = 1 1 a a ⎪ · (x W + xW ) − 2 · b · (xL − xW ) ⎪ ⎪ ⎩2 (xL > xW ).
hW =
uW
The variable uW is the utility of playing the gamble after a win, which uses a risk-aversion parameter a and a loss-aversion parameter b. The variable DW is the difference between the utility of taking and rejecting the gamble after a win. The matrix HL in the bottom right corner of H1 rotates the state toward taking or rejecting the gamble depending on the final payoffs (xW − xL , −xL − xL ) given an initial loss of the amount x L from the first play. Once again, using the hyperbolic tangent, we map the utility differences following a loss into hL as follows: 2 − 1, DL = uL − xLa , hL = 1 + e −DL ⎧1 1 a a ⎪ ⎪ 2 · (xW − xL ) − 2 · b · (xL + xL ) ⎪ ⎨ (x > x ) W L uL = ⎪− 12 · b · (xL − xW )a − 12 · b · (xL + xL )a ⎪ ⎪ ⎩ (xL > xW ). The variable uL is the utility of playing the gamble after a loss, which uses the same risk aversion
,
parameter a and the same loss aversion parameter b. The variable DL is the difference between the utility of taking and rejecting the gamble after a loss. The matrix H2 is designed to align beliefs with actions. This produces a type of “hot hand” effect. The parameter c determines the extent that beliefs can change from their initial values during the evaluation process, and it is critical for producing interference effects. Critically, if the parameter c is set to zero, then the quantum model reduces to a special case of a Markov model, the law of total-probability holds, and there are no interference effects. According to the quantummodel hypothesis, a nonzero value of this parameter c is expected, which will reproduce the 17 different quantum interference terms for the 17 different gambles. The initial state evolves to the final state according to the unitary evolution ψ(t) = exp (−i · t · H ) · ψ (0) . Following Pothos and Busemeyer (2009), the deliberation time was set equal to t = π2 , because at this time point, the evolution of preference first reaches an extreme point. The projector for choosing to gamble is represented by the indicator matrix that picks the “take gamble” action 0 1 0 MT . M= , MT = 0 MT 0 0 The probability of taking the gamble for the known win, known loss, and unknown (plan) conditions then equals '2 ' p (T |W ) = 'M · exp( − i · t · H ) · ψW ' '2 ' p (T |L) = 'M · exp( − i · t · H ) · ψL ' ' '2 p (T |U ) = 'M · exp( − i · t · H ) · ψU ' . The parameter c is critical for producing violations of the law of total probability. If we set the quantum-model parameter c = 0, then the quantum model predicts ' ' MT 0 ' p (T |U ) = ' 0 MT 0 exp ( − i · t · HW ) · · 0 exp ( − i · t · HL ) ⎡
⎤ ⎡ .50 0 ⎢ .50 ⎥ ⎢ 0 ⎢ ⎥+⎢ ⎣ 0 ⎦ ⎣ .50 0 .50
⎤ '2 ' ' ⎥' ⎥' ⎦' ' '
' '2 √ ' .50 ' ' √ = (.50) · ' M · exp ( − i · t · H ) · W ' T .50 ' ' √ '2 ' .50 ' ' √ +(.50) · ' M · exp ( − i · t · H ) · L ' T .50 ' = (.50) · p(T |W ) + (.50) · p(T |L). Therefore, if c = 0, the quantum model violates the law of total probability; but if c = 0, the quantum model satisfies the law of total probability. In fact, when c = 0, the Markov model can reproduce the predictions of the quantum model by setting each element of the first row of the transition matrix TW equal to p (T |W ) predicted by the quantum model, and by setting each element of the first row of the transition matrix TL equal to p (T |L) predicted by the quantum model. In other words, we can obtain a Markov model from the quantum model by setting c = 0.
Model Comparisons Next, we compare the Markov (obtained by setting c = 0 in the quantum model) and the quantum model (allowing c = 0) with respect to their fits to the Barkan and Busemeyer (2003) results in three different ways. The first is to compare least-squares fits to the 17 (gambles with different payoff conditions) × 2 (plan versus final) = 34 mean choice proportions reported in Barkan and Busemeyer (2003). The second is to compare the models using maximum likelihood estimates at the individual level and using AIC and BIC methods. The third is to estimate the hierarchical Bayesian posterior distribution for the critical parameter c that distinguishes the two models. The models are first compared using R 2 = SSE SSE , and adjusted R 2 = 1 − TSS · 34−1 1 − TSS 34−n . The latter index includes a penalty term for extra parameters. These statistics were computed with 2 2 P i − pi Pi − P¯ , SSE = and TSS = where Pi is the observed mean proportion of trials to choose gamble i = 1, ..., 34, pi is the predicted mean proportion, P¯ is the grand mean, 34 = 17 (payoff conditions) ×2 (plan versus final stage choices) is the number of observed choice proportions being fit. The quantum model has n = 3 parameters, and the best-fitting parameters (minimizing sum of squared error) are a = 0.71 (risk aversion), b = 2.54 (loss aversion), and c = −4.40. The risk aversion parameter is a bit below one as expected, and the loss parameter b exceeds one, as it should be. The
quantum models of cognition and decision
385
model produced an R 2 = 0.8234 and an adjusted R 2 = 0.8120. The Markov model, obtained by setting c = 0 in the quantum model, has only two parameters and it produced an R 2 = 0.7854 and an adjusted R 2 = 0.7787, which are lower than those of the quantum model. Next, the models were compared using AIC and BIC methods based on maximum likehood fits to individuals. For person i on trial t we observe a data pattern Xi (t) = [xTT (t), xTR (t), xRT (t), xRR (t)] defined by xjk (t) = 1 if event (j, k) occurs and otherwise zero, where TT is the event “planned to take the gamble and finally took the gamble,” TR is the event “planned to take the gamble but finally rejected the gamble,” RT is the event “planned to reject the gamble but finally took the gamble” and RR is the event “planned to reject the gamble and finally rejected the gamble.” To allow for possible dependencies between a pair of choices within a single trial, an additional memory recall parameter, m, was included in each model. For both models, it was assumed that there is some probability 0 ≤ m ≤ 1 that the person simply recalls and repeats the planned choice during the final choice, and there is some probability 1−m that the person forgets or ignores the planned choice when making the final choice. After including this memory parameter, the prediction for each event becomes pTT = p(T |plan) · (m · 1 + (1 − m) · p(T |final)) pTR = p(T |plan) · (1 − m) · p(R|final) pRT = p(R|plan) · (1 − m) · p(T |final) pRR = p(R|plan) · (m · 1 + (1 − m) · p(R|final)) Using these definitions for each model, the log 6 likelihood function for the 33 trials (with a pair of plan and final choices on each trial) from a single person can be expressed as ln L (Xi (t))=
xjk (t) · ln pjk
j,k
ln L (Xi ) =
33
ln L (Xi (t)) .
t=1
The log likelihood from each person was converted into Gi2 = −2 · ln (Li ) which indexes the lack of fit, and the parameters that minimized 7 Gi2 were found for each person. The quantum model has one more parameter than the Markov model. In this case, the AIC badness of fit index is defined as Gi2 +2, where 2 is the penalty for 386
new directions
the one extra parameter. Using AIC, 48 out of the 100 participants produced AIC indices favoring the quantum model over the Markov model. The BIC penalty depends on the number of observations, which is 33 for each person, and so for one extra parameter, the penalty equals log(33) = 3.4965. Using the more conservative BIC index, 22 out of the 100 participants produced BIC indices favoring the quantum model over the Markov model. Thus a majority of participants were adequately fit by the Markov model, but a substantial percentage of participants were better fit by the quantum model. One final method used to compare models is to examine the posterior distribution of the parameter c when estimated by hiearchical Bayesian methods. The details for this analysis are described in Busemeyer, Wang, and Trueblood (2012) and the results are only briefly summarized here. The hierarchical Bayesian estimation method starts by assuming a prior distribution over the individuals for each of the four quantum model parameters. Then, the likelihoods from the individual fits are used to update the prior distribution into a posterior distribution over the indivduals for the four parameters. The posterior distribution of the critical quantum parameter c is shown in Figure 17.4. The entire distribution lies below zero, and the mean of the distribution equals to −2.67. This supports the hypothesis that the critical quantum parameter, c, is not zero, and the model does not reduce to the Markov model. It is worth noting that the same quantum model also accounts for the violations of the surething principle, whereas the Markov model cannot explain this violation. Furthermore, the same quantum model described here was used to explain two other puzzling findings (not reviewed here). One is concerned with order effects on inference (Trueblood & Busemeyer, 2010), and the other is the interference of categorization on decision making (Busemeyer et al., 2009). In sum, the same quantum model has been successfully applied to four distinct puzzling judgement and decision findings, which builds confidence in the broad applicability of the model.
Concluding Comments This chapter provides a brief introduction to the basic principles of quantum theory and a few major paradoxical judgement and decision findings that
0.2 0.18
Posterior Probability
0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 −5
−4
−3
−2
−1 0 1 Quantum Parameter c
2
3
4
5
Fig. 17.4 Posterior Distribution of quantum model parameter c across individuals.
the theory has been used to explain. The theory is new and needs further testing, but the initial successful applications demonstrate its viability and theoretical potential. Busemeyer and Bruza (2012) provide a more detailed presentation of the basic principles, and they also describe in detail a much larger number of empirical applications. Also Pothos and Busemeyer (2013) summarize applications of quantum theory to cognitive science. Finally, special issues on quantum cognition have recently appeared in Journal of Mathematical Psychology (Bruza, Busemeyer, & Gabora, 2009) and Topics in Cognitive Science (Wang, Busemeyer, Atmanspacher, & Pothos, 2013). What are the advantages and disadvantages of the quantum approach as compared to traditional cognitive theories? First, let us consider some of the disadvantages. One is that the concepts and mathematics are very new and unfamiliar to psychologists, and learning how to use them requires an investment of time and effort. Second, because of the unfamiliarity, it may seem difficult at first to intuitively connect these ideas to traditional concepts of cognitive psychology, such as memory, attention, and information processing. Finally, applications of quantum theory to cognition have to overcome skepticism that naturally arises when introducing a revolutionary scientific new idea. However, now consider the advantages. First, the mathematics is not as difficult as it seems, and it only requires knowledge of linear algebra and differential equations. Second, once one
does become familiar with the mathematics and concepts, it becomes apparent that quantum theory provides a conceptually elegant and innovative way to formalize and represent some of the major concepts from cognition. For example, the superposition principle provides a natural way to represent parallel processing of uncertain information and capture that deep ambiguous feelings. Moreover, the consideration of quantum models allows the (re)introduction of new and useful theoretical principles into psychology, such as incompatibility, interference, and entanglement. In all, the main advantage of quantum theory is that a small set of principles provide a coherent explanation for a wide variety of puzzling results that have never been connected before under a single theoretical framework (see Box 2).
Notes 1. In particular, this chapter does not rely on the quantum brain hypothesis (Hammeroff, 1998). 2. This section makes little use complex numbers, but the section on dynamics requires their use. 3. This chapter follows the Dirac representation of the state as a vector rather than the more general von Neumann representation of the state as a density matrix. 4. The basis χ is an arbitrary choice and there are many other choices for a basis. Initially, we restrict ourselves to one arbitrarily chosen basis. Later we discuss issues arising from using different choices for the basis. 5. Kolmogorov assigned a single sample space to the outcomes of an experiment. This allows one to use a different sample space
quantum models of cognition and decision
387
for each experiment. But then the problem is that these separate sample spaces are left stochastically unrelated. 6. 16 gambles were played twice, one other gamble was played only once. 7. A surprising feature was found with the log likelihood function of the quantum model as a function of the key quantum parameter c. The log likelihood function forms a damped oscillation that converges at a reasonably high log likelihood at the extremes, and this is true both for the average across participants as well as for individual participants.
References Aerts, D. (2009). Quantum structure in cognition. Journal of Mathematical Psychology, 53(5), 314–348. Aerts, D., & Aerts, S. (1994). Applications of quantum statistics in psychological studies of decision processes. Foundations of Science, 1, 85–97. Aerts, D., & Gabora, L. (2005). A theory of concepts and their combinations ii: A Hilbert space representation. Kybernetes, 34, 192–221. Aerts, D., Gabora, L., & Sozzo, S. (2013). Concepts and their dynamics: A quantum - theoretic modeling of human thought. Topics in Cognitive Science, 5, 737–773. Atmanspacher, H., & Filk, T. (2010). A proposed test of temporal nonlocality in bistable perception. Journal of Mathematical Psychology, 54, 314–321. Atmanspacher, H., Filk, T., & Romer, H. (2004). Quantum zero features of bistable perception. Biological Cybernetics, 90, 33–40. Barkan, R., & Busemeyer, J. R. (1999). Changing plans: dynamic inconsistency and the effect of experience on the reference point. Psychological Bulletin and Review, 10, 353– 359. Barkan, R., & Busemeyer, J. R. (2003). Modeling dynamic inconsistency with a changing reference point. Journal of Behavioral Decision Making, 16, 235–255. Blutner, R. (2009). Concepts and bounded rationality: An application of Niestegge’s approach to conditional quantum probabilities. In L. e. a. Acardi (Ed.), Foundations of probability and physics-5 (Vol. 1101, p. 302–310). Blutner, R., Pothos, E. M., & Bruza, P. (2013). A quantum probability perspective on borderline vagueness. Topics in Cognitive Science, 5(4), 711–736. Bordley, R. F., & Kadane, J. B. (1999). Experimentdependent priors in psychology. Theory and Decision, 47 (3), 213–227. Brainerd, C. J., Wang, Z., & Reyna, V. (2013). Superposition of episodic memories: Overdistribution and quantum models. Topics in Cognitive Science, 5(4), 773–799. Bruza, P., Kitto, K., Nelson, D., & McEvoy, C. (2009). Is there something quantum-like in the human mental lexicon? Journal of Mathematical Psychology, 53, 362–377. Bruza, P. D., Busemeyer, J., & Gabora, L. (Eds.). (2009). Special issue on quantum cognition (Vol. 53). Journal of Mathematical Psyvhology. Busemeyer, J. R., & Bruza, P. D. (2012). Quantum models of cognition and decision. Cambirdge University Press.
388
new directions
Busemeyer, J. R., Pothos, E. M., Franco, R., & Trueblood, J. S. (2011). A quantum theoretical explanation for probability judgment errors. Psychological Review, 118(2), 193–218. Busemeyer, J. R., Wang, Z., & Lambert-Mogiliansky, A. (2009). Empirical comparison of markov and quantum models of decision making. Journal of Mathematical Psychology, 53(5), 423–433. Busemeyer, J. R., Wang, Z., & Townsend, J. (2006). Quantum dynamics of human decision making. Journal of Mathematical Psychology, 50(3), 220–241. Busemeyer, J. R., Wang, Z., & Trueblood, J. S. (2012). Hierarchical bayesian estimation of quantum decision model parameters. In J. R. Busemeyer, F. DuBois, A. LambertMogiliansky, & M. Melucci (Eds.), Quantum interaction. lecture notes in computer science, vol. 7620 (pp. 80–89). Springer. Conte, E., Khrennikov, A. Y., Todarello, O., Federici, A., Mendolicchio, L., & Zbilut, J. P. (2009). Mental states follow quantum mechanics during perception and cognition of ambiguous figures. Open Systems and Information Dynamics, 16, 1–17. Dirac, P. A. M. (1958). The principles of quantum mechanics. Oxford University Press. Dzhafarov, E., & Kujala, J. V. (2012). Selectivity in probabilistic causality: Where psychology runs into quantum physics. Journal of Mathematical Psychology, 56, 54–63. Feldman, J. M., & Lynch, J. G. (1988). Self-generated validity and other effects of measurement on belief, attitude, intention, and behavior. Journal of Applied Psychology, 73(3), 421–435. Franco, R. (2009). Quantum amplitude amplification algorithm: An explanaton of availability bias. In P. Bruza, D. Sofge, W. Lawless, K. van Rijsbergen, & M. Klusch (Eds.), Quantum interaction (pp. 84–96). Springer. Fuss, I. G., & Navarro, D. J. (2013). Open parallel cooperative and competitive decsision processes: A potential provenance for quantum probability decision models. Topics in Cognitive Science, 5(4), 818–843. Gleason, A. M. (1957). Measures on the closed subspaces of a Hilbert space. Journal of Mathematical Mechanics, 6, 885– 893. Griffiths, R. B. (2003). Consistent quantum theory. Cambridge University Press. Gudder, S. P. (1988). Quantum probability. Academic Press. Hammeroff, S. R. (1998). Quantum computation in brain microtubles? the penrose - hameroff “orch or” model of consiousness. Philosophical Transactions Royal Society London (A), 356, 1869–1896. Heisenberg, W. (1958). Physics and philosophy. Harper and Row. Hughes, R. I. G. (1989). The structure and interpretation of quantum mechanics. Harvard University Press. Ivancevic, V. G., & Ivancevic, T. T. (2010). Quantum neural computation. Springer. Khrennikov, A. Y. (2010). Ubiquitous quantum structure: From psychology to finance. Springer. Kolmogorov, A. N. (1933/1950). Foundations of the theory of probability. N.Y.: Chelsea Publishing Co. Lambert-Mogiliansky, A., Zamir, S., & Zwirn, H. (2009). Type indeterminacy: A model of the “KT”
(Kahneman-Tversky)-man. Journal of Mathematical Psychology, 53(5), 349–361. La Mura, P. (2009). Projective expected utility. Journal of Mathematical Psychology, 53(5), 408–414. Morier, D. M., & Borgida, E. (1984). The conjuction fallacy: A task specific phenomena? Personality and Social Psychology Bulletin, 10, 243–252. Payne, J., Bettman, J. R., & Johnson, E. J. (1992). Behavioral decision research: A constructive processing perspective. Annual Review of Psychology, 43, 87–131. Peres, A. (1998). Quantum theory: Concepts and methods. Kluwer Academic. Pothos, E. M., & Busemeyer, J. R. (2012). Can quantum probability provide a new direction for cognitive modeling? Behavioral and Brain Sciences, 36, 255–274. Pothos, E. M., Busemeyer, J. R., & Trueblood, J. S. (2013). A quantum geometric model of similarity. Psychological Review, 120(3), 679–696. Savage, L. J. (1954). The foundations of statistics. John Wiley & Sons. Schachter, S., & Singer, J. E. (1962). Cognitive, social, and physiological determinants of emotional state. Psychological Review, 69(5), 379–399. Shafir, E., & Tversky, A. (1992). Thinking through uncertainty: nonconsequential reasoning and choice. Cognitive Psychology, 24, 449–474. Stolarz-Fantino, S., Fantino, E., Zizzo, D. J., & Wen, J. (2003). The conjunction effect: New evidence for robustness. American Journal of Psychology, 116 (1), 15–34. Tentori, K., & Crupi, V. (2013). Why quantum probability does not explan the conjunction fallacy. Behavioral and Brain Sciences, 36 (3), 308–310.
Trueblood, J. S., & Busemeyer, J. R. (2010). A quantum probability account for order effects on inference. Cognitive Science, 35, 1518–1552. Trueblood, J. S., & Busemeyer, J. R. (2011). A quantum probability model of causal reasoning. Frontiers in cognitive science, 3, 138. Tversky, A., & Kahneman, D. (1983). Extensional versus intuitive reasoning: The conjuctive fallacy in probability judgment. Psychological Review, 90, 293–315. Tversky, A., & Shafir, E. (1992). The disjunction effect in choice under uncertainty. Psychological Science, 3, 305–309. Von Neumann, J. (1932/1955). Mathematical foundations of quantum theory. Princeton University Press. Wang, Z., & Busemeyer, J. R. (2013). A quantum question order model supported by empirical tests of an a priori and precise prediction. Topics in Cognitive Science, 5, 689–710. Wang, Z., Busemeyer, J. R., Atmanspacher, H., & Pothos, E. M. (2013). The potential of using quantum theory to build models of cognition. Topics in Cognitive Science, 5, 672–688. Wang, Z., Solloway, T., Shiffrin, R. M., & Busemeyer, J. (2014). Context effects produced by question orders reveal quantum nature of human judgments. Proceedings of the National Academy of Sciences, 111(26), 9431–9436. Yates, J. F., & Carlson, B. W. (1986). Conjunction errors: Evidence for multiple judgment procedures, including ’signed summation’. Organizational Behavior and Human Decision Processes, 37, 230–253. Yukalov, V. I., & Sornette, D. (2011). Decision theory with prospect interference and entanglement. Theory and Decision, 70, 283–328.
quantum models of cognition and decision
389
INDEX
Abductive reasoning in clinical cognitive science, 343–344 Absolute identification absolute and relative judgment, 129–130 intertrial interval and sequential effects, 136–138 learning, 130–133 perfect pitch versus, 133–135 response times, 135 theories of, 124–129 Absorbing barriers, 30 Accumulator models, 321–322, 327–328 Across-trial variability, 37–38, 46, 56–57 Actions, in Markov decision process (MDP), 102–103 ACT-R architectures, 126, 219, 301 Additive factors method, 69–70, 89 ADHD, 49–50 Affine transformation, 28 Aging studies, diffusion models in, 48 Akaike information criterion (AIC), 306–308 Alcohol consumption, 50 Aleatory uncertainty, 210 Algom, D., 63 Allais paradox, 219 ANCHOR-based exemplar model of absolute identification, 126 Anderson’s ACT-R model, 126, 219, 301 Anxiety, diffusion models of, 49 Anxiety-prone individuals, threat sensitivity modeling in, 352–354 Aphasia, 49–50 Ashby, F. G., 13 Assimilation and contrast, in absolute identification, 123–124, 128 Associative learning, 194–196 Associative recognition, 47 Attention allocation differences data, 291–292
descriptive model and parameters, 292–293 overview, 290–291 posterior distribution interpretation, 293–295 Attention-weight parameters, 144 Attraction, as context effect in DFT, 225–226 Austerweil, J. L., 187 Autism spectrum disorders, 354–356 Automaticity, 143, 148–150, 325 Autonomous search models, 178–179 Bandit tasks, 111 Basa ganglia model, 51 Baseball batting example, 282–290 data, 283 descriptive model and parameters, 283–285 overview, 282–283 posterior distribution interpretation, 285–290 shrinkage and multiple comparisons, 290 Basis functions, 201 Bayesian information criterion (BIC), 9, 21, 306–308 Bayesian models. See also Hierarchical models, Bayesian estimation in of cognition, 187–208 clustering observations, 192–196 conclusions, 203–204 continuous quantities, 200–203 features as perceptual units, 196–200 future directions, 204 mathematical background, 188–192 overview, 187–188 overview, 40, 169 parsimony principle in, 309–314 of shape perception, 258–260 Bayesian parameter estimation, 348–349
Bayes’ rule, 6, 281–282 BEAGLE (Bound Encoding of the Aggregate Language Environment) model, 243–244, 248 Bellman equation, 103 Benchmark model, 74–75 Benchmark phenomena, in perceptual judgment, 122–124 Berlin Institute of Physiology, 64 Bernoulli, Daniel, 210–211 Bessel, F. W., 65 Bias-variance trade-off, 190–191 BIC (Bayesian information criterion), 9, 21, 306–308 Blood sugar reduction, 50 Bootstrapping, 105 Boundary setting across tasks, 48 Bound Encoding of the Aggregate Language Environment (BEAGLE) model, 243–244, 248 Bow effects, in absolute identification, 123, 128 Brown, S. D., 121 Brown and Heathcote’s linear ballistic accumulator model, 301 BUGS modeling specification language, 282 Busemeyer, J. R., 1, 369 Calculus, 3–5 Candidate decision processes, 14 Capacity coefficient, 72–74 Capacity limitations, in absolute identification, 122–123 Capacity reallocation model, 69 Capacity theory, 90–91 Catastrophe theory, 346 Categorization, 29, 325. See also Bayesian models; Exemplar-based random walk (EBRW) model Category learning, 30–31, 189 Cattell, James McKeen, 66 Chaos-theoretic modeling, 345–346
391
Child development, diffusion models in, 48–49 Chinese restaurant process (CRP) metaphor, 193–195 Choice axiom testing, 211–214 Choice behavior, 199–200 Cholesky transformation, 20 “Chunking,” 66 Clinical psychology, mathematical and computational modeling in, 341–368 contributions of, 349–359 cognition in autism spectrum disorders, 354–356 cognitive modeling of routinely used measures, 356–357 multinomial processing tree modeling of memory, 350–352 in pathocognition and functional neuroimaging, 357–359 threat sensitivity modeling of anxiety-prone individuals, 352–354 distinctions in, 343–346 overview, 341–343 parameter estimation in, 346–349 special considerations, 359–361 Clustering observations, 192–196 Coactivation, 73 COALS (Correlated Occurrence Analogue to Lexical Semantics) model, 241, 248 Coexistence model (CXM), 303–304, 306, 308–309, 313 Cognition. See Bayesian models; Quantum models of cognition and decision Cognitive control of perceptual decisions, 330, 334 Cognitive modeling, 219–226 of clinical science measures, 356–357 context effects example, 225–226 decision field theory for multialternative choice problems, 222–225 multi-attribute, 221–222 overview, 220–221 “horse race,” 356
392
index
Cognitive-psychological complementarity, 87–90 Cognitive psychometrics, 290 Cohen’s PDP model, 301 Cold cognition, 361 Commutativity, 375 Competing accumulator models, 322 Complication experiment, 66–68 Component power laws model, 301 Compositional semantics, 249 Compromise, as context effect in DFT, 225–226 Computational reinforcement learning (CRL), 99–117 decision environment, 102 exploration and exploitation balance, 106 goal of, 101 good decision making, 103–104 historical perspective, 100–101 neural correlates of, 106–108 Q-learning, 105–106 research issues, 108–114 human exploration varieties, 110–113 model-based versus model-free learning, 108–109 reward varieties, 113–114 state representation influence, 109–110 temporal difference learning, 104–105 values for states and actions, 102–103 Conditioning, 101–103, 111, 195 Confidence judgments, 52–53 Conjunction probability judgment errors, 376–379 Connectionist models decision field theory as, 223, 225 of semantic memory, 234–239 Constancy, in shape perception, 256–257 Constructed semantics model (CSM), 247 Context, 175–178 Context-noise models, 172 Contingency table, 6f Continuous quantities, relationships of, 200–203 Contrast and assimilation, in absolute identification, 123–124, 128 Correlated Occurrence Analogue to Lexical Semantics (COALS) model, 241, 248
COVIS theory of category learning, 30–31 Credit assignment problem, in reinforcement learning, 100–101, 103 Criss, A. H., 165 CRL (Computational reinforcement learning). See Computational reinforcement learning (CRL) CrossCat model, 195 CRP (Chinese restaurant process) metaphor, 193–195 Crude two-part code, in MDL, 307–308 CSM (constructed semantics model), 247 Cued recall models of episodic memory, 173–174 Cumulative prospect theory, 217–219 CXM (coexistence model), 303–304, 306, 308–309, 313 Deadline tasks, 41–42 Decisional separability, 15–16, 22f, 23 Decision-boundary models, 30, 142 Decision field theory (DFT), 220–225 Decision-making models, 209–231. See also Computational reinforcement learning; Perceptual decision making, neurocognitive modeling of; Quantum models of cognition and decision choice axiom testing, 211–214 cognitive models context effects example, 225–226 decision field theory, 220–221 decision field theory for multialternative choice problems, 222–225 multi-attribute decision field theory, 221–222 overview, 219–220 historical development of, 210–211 overview, 209–210 rational choice models, 214–219 Decision rules for Bayesian posterior distribution, 287 Dennis, S., 232
Density estimation, in Bayesian models, 188–190 Depression, diffusion models of, 49 Derivatives and integrals, 3–5 Destructive updating model (DUM), 303–304, 306, 308–309, 313 Deterministic processes, 72 Diederich, A., 1, 209 Differential-deficit, psychometric-artifact problem, 360 Differential equations, 4–5 Diffusion models, 35–62 in aging studies, 48 in child development, 48–49 in clinical applications, 49–50, 352 competing two-choice models, 51–56 failure of, 50–51 in homeostatic state manipulations, 49–50 in individual differences studies, 48 in lexical decision, 46–47 optimality, 44–45 in perceptual tasks, 45–46 for practice effects, 301 for rapid decisions, 35–44 accuracy and RT distribution expressions, 38–41 drift rate, 36–38 overview, 35–36 standard two-choice task, 41–44 in recognition memory, 46 in semantic and recognition priming effects, 47 in value-based judgments, 47–48 Diffusion process, 30 Díríchlet-process mixture model, 192–194, 244 Disjunction probability judgment errors, 376–379 Dissociations, in categorization and recognition, 158–159 Distributional models of semantic memory, 239–247 latent semantic analysis, 239–240 moving window models, 240–241 probabilistic topic models, 243–246 random vector models, 241–243 retrieval-based semantics, 246–247
Domain of the function, 1 Donders, Franciscus, 65–66 Donkin, C., 121 Double factorial paradigm, 83 Drift rates accumulator model assumptions about, 321–322 across-trial variability in, 56–57 in perceptual decision making, 36–38, 45, 325–327 Dual process models of recognition, 166 DUM (destructive updating model), 303–304, 306, 308–309, 313 Dynamic attractor networks, 237–239 Dynamic decision models, 219–220. See also Decision-making models Dynamic programming, 104 Dyslexia, 50 Effective sample size (ESS) statistic, 282 EGCM (extended generalized context model) of absolute identification, 126, 143 Eidels, A., 1, 63 Emotional bias, 49 Episodic memory, 165–183 cued recall models of, 173–174 free recall models of, 174–179 future directions, 179 overview, 165–166 recognition memory models, 166–172 context-noise models, 172 global matching models, 167–168 retrieving effectively from memory (REM) model, 168–171 updating consequences, 171–172 Epistemic uncertainty, 210 Error signal, 4–5 ESS (effective sample size) statistic, 282 EUT (expected utility theory), 209, 211 EVL (Expectancy Valence Learning Model), 356–357 Exemplar-based random walk (EBRW) model of absolute identification, 125–126 of categorization and recognition, 142–164 automaticity and perceptual expertise, 148–150
old-new recognition RTs predicted by, 152–157 overview, 142–144 probabilistic feedback to contrast predictions, 150–152 research goals, 157–159 in response times, 144–146 similarity and practice effects, 146–148 in perceptual decision making, 325 Exemplar models of absolute identification, 125–126, 129 Exhaustive processing, 71–72 Expectancy Valence Learning Model (EVL), 356–357 Expectations, 7–8 Expected utility theory (EUT), 209, 211 Experience-based decision making, 215–216 Experimental Psychology (Woodworth), 83 Exploration/exploitation balance experiments in, 100 human varieties of, 110–113 in reinforcement learning, 106 Exponential functions, 2 Extended generalized context model (EGCM) of absolute identification, 126, 143 Eye movements, saccadic, 323–325 False alarm rates, 23 Feature inference, 196–199 Feature integration theory, 87 Feature-list models, 233–234 Fechnerian paradigm, 257–258 Fechner’s law of psychophysics, 307 Feed-forward networks, 235 FEF (frontal eye field), 321, 323 Fermat, Pierre de, 210 Flexibility-to-fit data, of models, 93 fMRI category-relevant dimensions shown by, 158 in clinical psychology, 357–359 context word approach and, 241 diffusion models and, 57 model-based analysis of, 107–108 Free recall models of episodic memory, 174–179 Frequentist methods, 281 Frontal eye field (FEF), 321, 323 Functions, mathematical, 1–3
index
393
Gabor patch orientation discrimination, 50 Galen, 64 Gate accumulator model, 327 Gaussian distribution, 189, 192 Generalizability, measures of, 303 Generalized context model (GCM), 126, 143, 152, 325 General recognition theory (GRT) application of, 14 applied to data, 17–21 empirical example, 24–28 multivariate normal distributions assumed by, 16 neural implementations of, 30–31 overview, 15–16 response accuracy and response time accounted for, 28–30 summary statistics approach, 22–24 GenSim software for semantic memory modeling, 246 Gershman, S. J., 187 Global matching models, 167–168 Go/No-Go Discrimination Task, 44, 349, 356 Goodness of fit evaluation, 20–21, 302 Grice inequality, 74–75, 92 Griffiths, T. L., 187 Grouping, power of, 66 GRT (general recognition theory). See General recognition theory (GRT) Guided search, 89 Gureckis, T. M., 99 HA-LA (higher anxiety-prone-lower anxiety-prone) group differences, 352–353 HAL (Hyperspace Analogue to Language) model, 240–241, 245, 248 Hamilton, Sir William, 66 Hawkins, R. X. D., 63 HDI (highest density interval), 285 Heathcote, A., 121 Hebbian learning, 235 HiDEx software for semantic memory modeling, 246 Hierarchical models, Bayesian estimation in, 279–299 attention allocation differences example, 290–295 data, 291–292
394
index
descriptive model and parameters, 292–293 overview, 290–291 posterior distribution interpretation, 293–295 baseball batting example, 282–290 data, 283 descriptive model and parameters, 283–285 overview, 282–283 posterior distribution interpretation, 285–290 shrinkage and multiple comparisons, 290 comparison of, 295–297 ideas in, 279–282 Higher anxiety-prone-lower anxiety-prone (HA-LA) group differences, 352–353 Highest density interval (HDI), 285 Hilbert space, in quantum theory, 371–372, 374–375 Histograms, 9 Homeostatic state manipulations, 49–50 “Horse race” model of cognitive processes, 356 Hot cognition, 361 Howard, M. W., 165 Human function learning, 202–203 Human information processing, 63–70 Donder’s complication experiment, 66–68 Sternberg’s work in, 68–70 von Helmholtz’s measurement of nerve impulse speed, 64–65 Wundt’s reaction time studies, 65–66 Human neuroscience, diffusion models for, 56–58 Hyperspace Analogue to Language (HAL) model, 240–241, 245, 248 IBP (Indian buffet process) metaphor, 194–195, 197–200, 203 Identification data, fitting GRT to, 18–21 Identification hit rate, 23 Importance sampling for Bayes factor, 312–313 Independence, axioms of, 211–212 Independent parallel, limited-capacity (IPLC) processing system, 353
Independent race model, 74–75 Indian buffet process (IBD) metaphor, 194–195, 197–200, 203 Individual differences studies, diffusion models in, 48 Infinite Relational Model (IRM), 195 Information criteria, in model comparison, 306–307 Instance theory, 301, 325 Institute for Collaborative Biotechnologies, 31 Instrumental conditioning, 101, 111 Integrals and derivatives, 3–5 Integrate-and-fire neurons, 39 Intercompletion time equivalence, 77–78 Intertrial interval, sequential effects and, 136–138 Inverse problem, shape perception as, 256–263 Iowa Gambling Task, 349, 356 IPLC (independent parallel, limited-capacity) processing system, 353 IRM (Infinite Relational Model), 195 James, William, 64 Jefferson, B., 63 Jeffreys weights, 313 Jones, M. N., 232 Kinnebrook, David, 65 Kolmogorov axioms, 307, 370, 373–374 Kruschke, J. K., 279 Kullback-Leibler divergence, 306 Languages, tonal, 134 Latent Díríchlet Allocation algorithms, 244 Latent semantic analysis (LSA), 239–240, 245, 248–249 Law of total probability, 376 LBA (Linear Ballistic Accumulator) model, 52, 301 Leaky competing accumulator (LCA) model, 36, 51–52, 128, 223–225, 327 Learning. See also Computational reinforcement learning absolute identification in, 130–133 associative, 194–196 Hebbian, 235 modeling human function, 202–203
procedural, 30–31 relationships in continuous quantities, 200–203 Lexical decisions, diffusion models in, 46–47 Lexicographic semi-order (LS) choice rule, 213 Li, Y., 255 Likelihood function, 18–19, 280, 296 Likelihood ratio test, 28 Limited capacity, 70, 73 Linear Ballistic Accumulator (LBA) model, 52, 301 Linear functions, 1–2 Linear regression, 8, 200–202 Logan, G. D., 320 Love, B. C., 99 LSA (latent semantic analysis), 239–240, 245, 248–249 LS (lexicographic semi-order) choice rule, 213 Lüder’s rule, 374 “Magical number seven,” 66 Mapping, functions for, 1 Marginal discriminabilities, 23 Marginal response invariance, 22 Markov Chain Monte Carlo (MCMC) algorithms, 244, 281–282, 293 Markov decision process (MDP), 102–104 Markov dynamic model for two-stage gambles, 382–383, 385–387 Maskelyn, Nevil, 65 Matched filter model, 167, 173 Mathematical concepts, review of, 1–10 derivatives and integrals, 3–5 expectations, 7–8 mathematical functions, 1–3 maximum likelihood estimation, 8–9 probability theory, 5–7 Matrix reasoning, 48 Matzke, D., 300 Maximum likelihood estimation (MLE), 8–9, 281, 347 MCMC (Markov Chain Monte Carlo) algorithms, 244, 281–282, 293 MDL (minimum description length), in model comparison, 307–309 MDP (Markov decision process), 102–104 MDS (multidimensional scaling), 143 Mean interaction contrast, 83–84
Measures of generalizability, 303 Memory, 350–352. See also Episodic memory; Semantic memory Memory interference models example, 303–306 Méré, Chevalier de, 210 Meyer, Irwin, Osman, and Kounios partial information paradigm, 42–44 Miller, George, 66 Minimum description length (MDL), in model comparison, 307–309 Minimum-time stopping rule, exhaustive processing versus, 71–72 Minkowski power model, 144 MLE (maximum likelihood estimation), 8–9, 281, 347 Model-based versus model-free learning, 108–109 Modeling. See Parsimony principle in model comparison; specifically named models Model mimicking degenerative, 80 ignoring parallel-serial, 87–90 prediction overlaps from, 75–78 in psychological science, 91–93 Moderate stochastic transitivity (MST), 212–213 Moment matching, in parameter estimation, 347–348 Monte-Carlo methods, 104, 311–313 Movement-related neurons, in FEF, 321, 323, 325–326, 328 Moving window models, 240–241 MPM (multiplicative prototype model), 290, 292 MPTs (multinomial processing tree models). See Multinomial processing tree models (MPTs) MST (moderate stochastic transitivity), 212–213 Müller, Johannes, 64 Multialternative choice problems, decision field theory for, 222–225 Multi-armed bandit tasks, 111 Multi-attribute decision field theory, 221–222 Multichoice-decision-making, 52–53 Multidimensional scaling (MDS), 143, 273
Multidimensional signal detection theory, 13–34 general recognition theory applied to data, 17–21 empirical example, 24–28 neural implementations of, 30–31 overview, 15–16 response accuracy and response time accounted for, 28–30 summary statistics approach, 22–24 multivariate normal model, 16–17 overview, 13–15 Multinomial processing tree models (MPTs), 301, 304–305, 307–311, 350–352 Multiple comparisons, shrinkage and, 290 Multiple linear regression, 8 Multiplicative prototype model (MPM), 290, 292 Multivariate normal model, 16–17 Myopic behavior, of agents, 103 National Institute of Neurological Disorders and Stroke, 31 Natural log functions, 2 NCM (no-conflict model), 303, 305–306, 308–309, 313 Nested models, comparing, 313–314 Neufeld, R. W. J., 341 Neural evidence of computational reinforcement learning, 106–108 of exemplar-based random walk, 158 of GRT, 30–31 in perceptual decision making, 325–330 Neurocognitive modeling of perceptual decision making. See Perceptual decision making, neurocognitive modeling of Neuro-connectionist modeling, 345 Neuroeconomics, 48 Neuroscience, decision making understanding from, 53–58 Newton-Raphson method, 19 Nietzsche, Friedrich, 64 No-conflict model (NCM), 303, 305–306, 308–309, 313 Noise, in perceptual systems, 15, 36
index
395
Nondecision time, 48–50, 57 Nonlinear dynamical system modeling, 345–346 Nonparametric models, 189–192, 194–195 Normal distribution, 7 Nosofsky, R. M., 142 Null list strength effects in (REM) model, 170–171 Numerosity discrimination task, 50 Observations, clustering, 192–196 Occam’s razor, 301–302 One-choice decisions, 53 Operant conditioning, 101 Optimality, 44–45 Optimal planning, 113 Ornstein-Uhlenbeck (OU) diffusion process, 50, 55 Overfitting, 190–191 Palmeri, T. J., 142, 320 Parallelism, 68 Parallel processing in benchmark model, 74–75 mathematics supported by, 77 parallel-serial mimicry ignored, 87–90 partial processing as basis of, 80–81 serial processing versus, 71 Parallel-Serial Tester (PST) paradigm, 82 Parametric models, 189–190 Parsimony principle in model comparison, 300–319 Bayes factors, 309–314 comparison of model comparisons, 314–315 information criteria, 306–307 memory interference models example, 303–306 minimum description length, 307–309 overview, 300–303 Partial information paradigm, 42–44 Pascal, Blaise, 210 Pathocognition, 357–359 Pavlovian conditioning, 195 PBRW (prototype-based random walk) model, 151–152 Perceptual decision making, neurocognitive modeling of, 320–340 architectures for, 327–328 conclusions, 333–336 control over, 330–333 neural dynamics, predictions of, 328–330
396
index
neural locus of drift rates, 325–327 overview, 320–323 saccadic eye movements and, 323–325 Perceptual expertise, automaticity and, 148–150 Perceptual independence, 15–16, 23 Perceptual judgment, 121–141 absolute identification issues, 129–139 absolute and relative judgment, 129–130 absolute identification versus perfect pitch, 133–135 intertrial interval and sequential effects, 136–138 learning, 130–133 response times, 135 absolute identification theories, 124–129 benchmark phenomena, 122–124 overview, 121–122 Perceptual separability, 15, 22f Perceptual tasks, diffusion models in, 36, 45–46 Perceptual units, features as, 196–200 Perfect pitch, absolute identification versus, 133–135 Perspective. See Shape perception Pizlo, Z., 255 Pleskac, T. J., 209 Poisson counter model, 36, 55 Poisson shot noise process, 55 Policies, in decision making, 103–104 Polynomial regression, 302–303 Posterior distribution in attention allocation differences, 293–295 in baseball batting example, 285–290 Monte Carlo sampling for, 311–313 in tests for model-parameter differences, 356 Pothos, E., 369 Power functions, 2 Power law, 300 Practice effects, 146–148, 300–301 Prediction error, 105, 107–108 Principles of Psychology (James), 64 Probabilistic topic models, 243–246, 248, 250 Probability density function, 72
Probability judgment error, 377–379 Probability mass function, 7 Probability theory, 5–7, 373–377. See also Bayesian models; Decision-making models Probability weighting function, 214–216 Problem of Points, 210 Procedural learning, 30–31 Prospect theory, 209, 214, 216–219 Prototype-based random walk (PBRW) model, 151–152 Prototype models, 142 PST (Parallel-Serial Tester) paradigm, 82 Psychology, mathematical and computational modeling in. See Clinical psychology, mathematical and computational modeling in Psychomotor vigilance task (PVT), 53 Q-learning, 104–106, 109, 111 Quadratic functions, 2 Quantile-probability plots, 40 Quantum models of cognition and decision, 369–389 classical probabilities versus, 373–377 concepts, definitions, and notation, 371–373 decision making applications, 381–387 Markov dynamic model for two-stage gambles, 382–383 model comparisons, 385–387 quantum dynamic model for two-stage gambles, 384–385 two-stage gambling paradigm, 381–382 dynamical principles, 379–381 probability judgment error applications, 377–379 reasons for, 369–371 Race model inequality, 74 Rae, B., 121 Random Permutations Model (RPM), 244 Random variables with continuous distribution, 7–8 Random vector models, 241–243 Random walk, 39. See also Exemplar-based random walk (EBRW) model
Range of the function, 1 Rank-dependent utility theory, 209 Rapid decisions. See Diffusion models Ratanalysis, 188 Ratcliff, R., 35 Ratcliff ’s diffusion model, 301 Rational choice models, 47, 214–219 Reaction time distributions, 83–87 Recognition and categorization. See Exemplar-based random walk (EBRW) model Recognition memory models, 46, 166–172 Region of practical equivalence (ROPE), in decision rules, 286 Regularization methods, in shape perception, 258–260 Reinforcement learning (RL). See Computational reinforcement learning Relative judgment models of absolute identification, 126–127, 129t, 136 Release-from-inhibition model, 50–51 REM (retrieving effectively from memory) model. See Retrieving effectively from memory (REM) model Rescorla-Wagner model, 111, 194 Response accuracy, 28–30, 90–91 Response signal tasks, 41–42 Response times (RT) absolute identification and, 128, 135 cognitive-psychological complementarity, 87–90 in diffusion models, 38–41 example of, 78–79 exemplar-based random walk model of, 144–146, 152–157 GRT to account for, 28–30 human information processing studied by, 63–70 Donder’s complication experiment, 66–68 Sternberg’s work in, 68–70 von Helmholtz’s measurement of nerve impulse speed, 64–65 Wundt’s reaction time studies, 65–66 metatheory expansion to encompass accuracy, 90–91 model mimicking, 75–78, 91–93
quantitative expressions of, 70–75 stopping rule distinctions based on set-size functions, 82–87 theoretical distinctions, 79–82 Restricted capacity models of absolute identification, 127–128, 129t, 136 Retrieval-based semantics, 246–247 Retrieved context models, 177–178 Retrieving effectively from memory (REM) model consequences of updating in, 171–172 overview, 168–170 word frequency and null list strength effects in, 170–171 Reward prediction error hypothesis, 107 Reward-rate optimality, 45 Reward varieties, in reinforcement learning, 113–114 Rickard’s component power laws model, 301 Risk in decision making. See Decision-making models RL (reinforcement learning). See Computational reinforcement learning ROPE (region of practical equivalence), in decision rules, 286 RPM (Random Permutations Model), 244 RT (response times). See Response times (RT) RT-distance hypothesis, 29, 150 Rule-plus-exception models, 142 Rumelhart networks, 235–237 Saccadic eye movements, 323–325 SAMBA (Selective Attention, Mapping, and Ballistic Accumulators) model of absolute identification, 128–129, 133, 135–138 Sampling independence test, 24 Savage-Dickey approximation to Bayes factor, 313–314 Sawada, T., 255 SBME (strength-based mirror effect), 171–172 Schall, J. D., 320 Schizophrenia, stimulus-encoding elongation in, 357–359 SCM (similarity-choice model), 21