Evaluating Amazon's Mechanical Turk as a tool for experimental behavioral research

Matthew J C Crump¹, John V McDonnell, Todd M Gureckis

Affiliations

PMID: 23516406
PMCID: PMC3596391
DOI: 10.1371/journal.pone.0057410

Evaluating Amazon's Mechanical Turk as a tool for experimental behavioral research

Matthew J C Crump et al. PLoS One. 2013.

. 2013;8(3):e57410.

doi: 10.1371/journal.pone.0057410. Epub 2013 Mar 13.

Authors

Matthew J C Crump¹, John V McDonnell, Todd M Gureckis

Affiliation

¹ Department of Psychology, Brooklyn College of CUNY, Brooklyn, New York, USA. mcrump@brooklyn.cuny.edu

PMID: 23516406
PMCID: PMC3596391
DOI: 10.1371/journal.pone.0057410

Abstract

Amazon Mechanical Turk (AMT) is an online crowdsourcing service where anonymous online workers complete web-based tasks for small sums of money. The service has attracted attention from experimental psychologists interested in gathering human subject data more efficiently. However, relative to traditional laboratory studies, many aspects of the testing environment are not under the experimenter's control. In this paper, we attempt to empirically evaluate the fidelity of the AMT system for use in cognitive behavioral experiments. These types of experiment differ from simple surveys in that they require multiple trials, sustained attention from participants, comprehension of complex instructions, and millisecond accuracy for response recording and stimulus presentation. We replicate a diverse body of tasks from experimental psychology including the Stroop, Switching, Flanker, Simon, Posner Cuing, attentional blink, subliminal priming, and category learning tasks using participants recruited using AMT. While most of replications were qualitatively successful and validated the approach of collecting data anonymously online using a web-browser, others revealed disparity between laboratory results and online results. A number of important lessons were encountered in the process of conducting these replications that should be of value to other researchers.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

**Figure 1. Congruent and Incongruent RTs, Error Rates and Individual Stroop Scores by Mean RT.**
A. Mean RTs and error rates for congruent and incongruent Stroop items with standard error bars. B. Individual subject Stroop difference scores (incongruent-congruent) plotted as a function of individual subject mean RTs.

**Figure 2. Repeat and Switch RTs, Error Rates and Individual Switch costs by mean RT.**
A. Mean RTs and error rates for task repeat and switch trials with standard error bars. B. Individual subject switch costs (switch-repeat) plotted as a function of individual subject mean RTs.

**Figure 3. Compatible and Incompatible RTs, Error Rates and Individual Flanker Scores by Mean RT.**
A. Mean RTs and error rates for compatible and incompatible flanker items with standard error bars. B. Individual subject flanker scores (incompatible-compatible) plotted as a function of individual subject mean RTs.

**Figure 4. Compatible and Incompatible RTs, Error Rates, and Individual Simon Scores by Mean RT.**
A. Mean RTs and error rates for compatible and incompatible Simon trials with standard error bars. B. Individual subject Simon scores (incompatible-compatible) plotted as a function of individual subject mean RTs.

**Figure 5. Visual Cuing: Cued and Uncued Mean RTs as a function of CSTOA.**
Mean RTs for cued and uncued trials as a function of cue-target stimulus onset asynchrony with standard error bars.

**Figure 6. Attentional Blink: Mean T2 Proportion Correct as a function of T1–T2 Lag.**
Mean T2 (second target) proportion correct as a function of T1–T2 lag with standard error bars.

**Figure 7. Masked Priming: Compatible and Incompatible Mean RTs and Error Rates Across Prime Durations.**
Mean RTs and error rates for compatible and incompatible masked prime trials as a function of prime duration with standard error bars.

**Figure 8. Cognitive Learning: A comparison between the learning curves reported in Nosfosky et al. (1994) data and the AMT replication data in Experiment 8.**
The probability of classification error as a function of training block. The top panel shows the learning curves estimated by Nosfosky et al. using 120 participants (40 per learning problem) who each performed two randomly selected problems. The right panel shows our AMT data with 228 participants, each who performed only one problem (38 per condition). We ended the experiment after 15 blocks, although Nosofsky et al. stopped after 25. Thus, the Nosofsky et al. data have been truncated to facilitate visual comparison.

**Figure 9. Cognitive Learning: The average number of block to criterion for each problem, an index of problem difficulty.**
The average number of blocks it took participants to reach criterion (2 blocks of 16 trials in a row with no mistakes) in each problem. The white bars show the estimated average number of blocks to criterion reported by Nosofsky et al. .

**Figure 10. Cognitive Learning: The learning curves for Shepard et al. Type II and IV problems based on task incentives.**
The probability of classification error as a function of training block, learning problem and incentive for Experiment 9. The incentive structure had little impact on performance within each problem.

**Figure 11. Cognitive Learning: The learning curves for Shepard et al. problems I, II, IV, and VI in Experiment 10.**
The top panel compared the results of Nosfosky et al. to the results of Experiment 10. The bottom compares the results of Lewandowsky to the results of Experiment 10 giving two different views of the relationship between the online and laboratory based data. Overall, the Type II problem seems more difficult than in previous report (as is the Type VI). However, in general, the instruction manipulation increased the congruence between the online and laboratory data.

See this image and copyright information in PMC

References

1. Pontin J (2007) Artificial Intelligence, With Help From the Humans. The New York Times. Available: http://www.nytimes.com/2007/03/25/business/yourmoney/25Stream.html?_r=0. Accessed 2012 Nov 6.
1. Mason W, Suri S (2012) Conducting behavioral research on Amazon's Mechanical Turk. Behav Res Methods 44: 1–23. - PubMed
1. Gosling SD, Vazire S, Srivastava S, John OP (2004) Should we trust web-based studies? A comparative analysis of six preconceptions about Internet questionnaires. Am Psychol 59: 93–104. - PubMed
1. Buhrmester M, Kwang T, Gosling SD (2011) Amazon's Mechanical Turk: A new source of inexpensive, yet high-quality, data? Perspect Psychol Sci 6: 3–5. - PubMed
1. Amir O, Rand DG, Gal Y (2012) Economic games on the Internet: the effect of $1 stakes. PLoS ONE 7: 2. - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Evaluating Amazon's Mechanical Turk as a tool for experimental behavioral research

Affiliation

Evaluating Amazon's Mechanical Turk as a tool for experimental behavioral research

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources