AI-generated responses are now one of the most pressing threats to online research data quality. Prolific's authenticity checks help solve this by flagging responses that show signs of AI generation or bot behaviour. But the flag itself is only the start. What you do with a flagged response, whether you reject it, exclude it from analysis, or treat it as part of normal data noise, is a methodological decision that affects the validity of your study and the fairness of your participant pool.
This guide focuses on AI-related concerns only. Other exclusions (failed attention checks, incomplete data, study protocol violations) follow similar principles but are out of scope here. For a deeper explanation of what the LLM and bot checks actually measure, see our article on How Prolific Protects Data Quality.
The guidance below is split into two parts depending on whether you have authenticity check results to work with.
Two decisions, not one
Before reaching for the reject button, be clear about which decision you are actually making:
Rejecting a submission on Prolific is a platform action. It withholds payment from the participant and affects their record. If participants receive too many rejections, they will no longer be able to participate in further research on Prolific. Reserve it for cases with clear evidence that the participant violated your study's terms. See our guide on who you can and cannot reject.
Excluding a response from analysis is an analytical decision. The participant is paid and their record is unaffected, but the data is set aside in your statistical analysis according to your pre-specified rules.
These are different decisions with different stakes. A Low authenticity check flag can warrant rejection. A response you simply find suspicious but don’t have concrete evidence is, almost always, a candidate for exclusion, not rejection. Some loss of usable data is normal in any study, including in-person research. Participants who completed your study in good faith should be paid, even if their data is later excluded from analysis.
If you have authenticity check results
This section is for researchers who ran authenticity checks and have flag results to interpret.
How authenticity check results are displayed
Results appear in two separate columns: one for LLM authenticity checks (which look at whether participants used an LLM to answer a free-text question) and one for bot checks (which look for fully automated AI bots answering your study). Each column shows one of three outcomes (High, Mixed, or Low), or no result at all if a check could not be returned.
What each outcome means and what to do
Result | What it means | Recommended action |
High (green) | Responses look authentic. | Approve as you normally would. |
Mixed (orange) | Some responses or signals were flagged, but not all. It could also be a false positive. | Review case by case (see more guidance on this below). |
Low (red) | Low-authenticity patterns detected with high confidence. | Confirm with brief manual review; reject. |
No result (loading icon / question mark) | The check did not run or could not return a result. Insufficient information. | Review manually as you would any unflagged response (see more guidance on this below). |
What to do with a Mixed or ‘no result’ flag?
These are the cases where judgement matters most. A Mixed bot result can have a legitimate explanation (e.g., the participant uses accessibility tools); a Mixed LLM result can mean some free-text answers were flagged while others were not (e.g., 2 out of 4); a "no result" state means the check could not be returned (often because there was an issue with retrieving the information or there wasn’t enough information). For all of these:
Review the submission. For LLM checks, you can review each flagged question individually by downloading the demographic data, which lets you see exactly which answers were flagged.
Consider any explanation the participant provides, by contacting them through Prolific messaging if a quick clarification would help.
Decide based on your pre-specified criteria for these cases. If the data clearly fails them, reject. If a legitimate explanation exists or the response meets your needs, approve.
When the LLM and bot columns disagree
A submission can come back with different ratings on the two columns; for example, High on the bot check but Mixed on the LLM check, or the reverse. There is no single rule for these cases, but a few principles help:
A low rating in either column is a clear signal. You may reject the submission.
Weight the column most relevant to your study's validity. If your analysis depends heavily on free-text responses, an LLM check returning Mixed matters more. If behavioural data (timing, mouse movement) is core to your design, the bot check carries more weight.
Reporting your authenticity check results
In your methods section, note how many responses fell into each tier (High, Mixed, Low, no result) and how many you rejected. If you retained Mixed or no-result responses in your analysis, the sensitivity-analysis and broader reporting guidance in the next section can also apply to your study.
If you suspect AI use but have no authenticity check results
This section is for researchers who did not run authenticity checks but suspect AI use in their study data.
Pre-specify your exclusion criteria
Deciding what counts as a problematic response after you have looked at the data introduces researcher degrees of freedom and inflates false-positive rates (Simmons, Nelson & Simonsohn, 2011; Wicherts et al., 2016). The solution is to decide on your exclusion rule before you collect data.
At a minimum, write your exclusion rules down in your study protocol or analysis plan before launch. You may also opt for formal pre-registration on a platform like OSF or AsPredicted, which timestamps your decisions publicly.
For AI-related concerns specifically, your pre-specified rule might look something like:
"Responses showing indicators of AI generation (e.g., generic phrasing, suspiciously uniform tone) will be excluded from the primary analysis. A sensitivity analysis will be reported including these responses."
The exact rule is yours to choose; what matters is that it exists before you see the data.
Handling responses you suspect
Sometimes a response will feel off (usually very different from the rest of your responses) and you’ll suspect use of AI. Suspicion alone is not grounds for rejection on Prolific. Rejecting affects a participant's record and their pay, and false positives are a real harm. Excluding from analysis is almost always the right move rather than rejecting. The valid reasons to reject a participant on Prolific are listed here.
Apply your pre-specified criteria consistently across all participants, and contact the participant through Prolific messaging if a clarification could resolve the uncertainty. We recommend keeping a brief exclusion log alongside your analysis script, recording the participant ID, your decision, and your reason. This becomes the basis for your methods-section reporting later.
Run sensitivity analyses for the judgement calls
Sensitivity analyses are most valuable for the cases where you had to make a judgement. The standard practice is to run your primary analysis using your pre-specified exclusion rule, then run the analysis again with a different decision applied; for example, including all the responses you flagged as suspicious. Report both. If your conclusions hold across both versions, your findings are robust to the exclusion decision. If they shift, that is itself an important finding and should be reported.
Report exclusions in your methods section
Readers and reviewers should be able to see exactly what you did. We recommend reporting:
How many responses you flagged as suspicious, and on what criteria.
How many you excluded from analysis.
Whether those criteria were pre-registered or decided after seeing the data.
The results of any sensitivity analysis.
Prolific's Methodological Justification Pack offers template language for reporting sample, recruitment, and data quality decisions, and is a useful starting point if you are not sure how to phrase any of the above.
References
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366.
Wicherts, J. M., Veldkamp, C. L. S., Augusteijn, H. E. M., Bakker, M., van Aert, R. C. M., & van Assen, M. A. L. M. (2016). Degrees of freedom in planning, running, analyzing, and reporting psychological studies: A checklist to avoid p-hacking. Frontiers in Psychology, 7, 1832.
