I launched an online study earlier this year aimed at understanding the processes that influence eating behaviors and eating disorders among individuals who identify as LGBTQ+. Little did I know I was actually launching a battle with bots.
Online surveys and questionnaires let researchers like me collect large, geographically diverse samples relatively quickly. With the growing availability of machine learning and other big-data methods, there are few easier and more cost-efficient ways to collect data. But these tools have become targets for bots and professional survey takers, which are flooding online research and threatening data integrity.
That threat wasn’t on my mind when my study went live. It consisted of validated self-report questions to measure eating disorder behaviors, exposure to discrimination, psychological well-being, social support, and a number of demographic variables. I was interested in whether sexual orientation and gender influenced eating in ways that we have not yet conceptualized, so I developed a set of open-ended questions to learn more about the lived experiences of individuals who identify as LGBTQ+ and their eating behaviors.
The study was completely online and took approximately 45-60 minutes to complete. I told participants that those completing the study would be compensated $15 for their time.
Within 12 hours, I had received nearly 400 responses. Elated by the response and eager to learn about potential risk factors directly from the communities I work with, my first instinct was to review responses from the survey’s open-ended questions. I quickly realized I was dealing with a huge problem in my data.
My first clue was identical and somewhat bizarre responses to open-ended questions. I initially thought that the study was being trolled by an individual or a small group of people. I was incredibly upset and worried about the potential impacts this could have on my study results, funding, and reputation.
I immediately froze the data collection to give myself some time to investigate further. As I dug deeper into the data, it became clear that only bots could have completed the large number of false responses I received in such a short period of time.
During several hundred hours of damage control and bot hunting, I developed several unique coding schemes to detect the electronic signatures of potential bots. These included impossible timestamps; not answering required questions or requests, such as consent; identical survey or open-ended responses from different “participants”; inconsistent responses to identical questions; not completing the survey; data where it shouldn’t be, like responses to questions about being a cisgender woman by participants who identified as cisgender men; impossible data values; illogical responses to open-ended questions; and completing the survey materials impossibly fast. I flagged participants with two of more of these infractions as potential bots.
Among the responses I received during the 12 hours the survey was live, only 11, or about 3%, were not flagged as bots. Had I not done this analysis, I might have unwittingly paid bots about $6,000, a big chunk of my research funds, for unusable data. I ultimately paid only the human participants for their time, and plan to use the money I wasn’t hoodwinked out of to reopen the original study on LGBTQ+ communities and eating behaviors, with much improved bot-detection tools.
Instead of being angered by this experience, I’m grateful I had it, since it has opened up a new avenue of research for me.
After I had paid the human participants, I took to social media to share my story and lessons learned in what became a viral Twitter thread.
My online #researchstudy was recently infiltrated by bots. I haven't shared this story publicly because I felt a bit like it was my fault. I'm putting my pride aside because I think #dataintegrity is and will be a growing issue in survey data and is not discussed enough (1/n)
— Melissa Simone, PhD (She/Her) (@m_simonephd) September 17, 2019
I had hoped that sharing my story might save at least one person from going through what I had. To my surprise, nearly 1 million people shared, read, or commented on this thread. Many had experienced bot infiltrations of their own. As a result of this experience, I have had phone calls with researchers from various academic institutions, industry, and even the U.S. Census Bureau to discuss the issue of bots in online surveys.
That was a “eureka” moment for me: I realized that this threat to online research went far beyond my project and that to protect the integrity of online research, scientists must become aware of cyber threats. This is now the focus of the methodological arm of my research, along with finding ways to reduce the impact of bots in research.
At the lowest level of sophistication, bots tend to speed through studies faster than humans can, provide illogical responses to open-ended questions, respond to questions that should be hidden from participants, or provide impossible time and date stamps, like a date that has not yet occurred — on informed consent documents. Here are three strategies I developed to identify “simple” bots from completing online research:
- Include two or three open-ended questions in the study and require responses to them. Monitor these questions for unusual responses or identical responses across “participants.”
- Track timestamps. Survey-building platforms often provide the option to include a stamp with the time the survey was begun, when it was ended, and when the survey taker moves from one questionnaire to the next. Flag impossible dates and times, bundles of participants beginning and completing the survey at the same time, and respondents who completed the survey impossibly fast.
- Use a captcha, a program that protects websites against bots.
More sophisticated bots are harder to detect. Sophisticated programmers will ensure that their bots aren’t stacked together by manipulating timestamps and originating IP addresses, code them to create a normal distribution across responses, and extract language from the survey itself to comprise more logical responses to open-ended questions. These kinds of bots require additional protective tools:
- Attention or logic checks. Dispersing questions that check whether a survey taker is paying attention while moving through survey materials can flag a bot. For example, the instruction “Ignore the rest of the content of this passage and select the second response option,” could be inserted in the middle of a long question. Those who select anything other than the second option would then be flagged.
- Include “honeypot” questions. These are embedded in a survey but are coded in a way that prevents human participants — but not bots— from viewing and responding to them. Highly sophisticated bots can detect honeypot items that stand out in comparison to other items, so they need to use similar language to those in the rest of the study.
- Use skip logic. Questions that direct survey takers to skip one or more sections prevent them from responding to questions that aren’t applicable to their circumstances. Someone who is not a father, for example, should not have to spend time responding to questions about fatherhood.
- Add redundancy. Ask the same question — “What is your age?” is a good example — at two separate points and check for differences in responses.
- Prevent “ballot-stuffing.” Most study platforms offer the option to can track IP addresses to prevent individuals from completing the study more than once.
- Make it personal. Consider including a public link to screen potential participants for eligibility, with ballot-stuffing protections in place. Those who meet eligibility requirements can then be sent a unique link to the survey that can be used only once.
There is no sure way to conduct a bot-free study. Given the speed at which bots are learning to bypass our protections, it is possible that some will complete online research studies even with protections in place. And we’re in something of an arms race: As researchers develop and deploy data protection tools, bot programmers find ways around them.
Doing online research comes with a set of risks, yet the data collected using these methods are invaluable and often can’t be collected any other way. That’s why creators and consumers of scientific information must be vigilant and proactive to protect the integrity of the data they collect online, and institutional review boards should require researchers to create a detailed section on data integrity from cyber threats for studies conducted online or on technology-driven platforms.
Now that the dust has settled, I can see the silver lining in this emotionally taxing experience. Actually, after the initial shock wore off, it’s hard to say that anything negative came out of it. As a quantitative psychologist, I am always looking for the next methodological challenge. I wasn’t expecting to find it in cyber threats to online research, yet here I am, looking forward to my next project: investigating what makes some studies more attractive to bots than others.
Melissa Simone, Ph.D., is a postdoctoral research fellow in the Department of Psychiatry and Behavioral Sciences at the University of Minnesota. The online research described here is funded by a training grant from the National Institute of Mental Health. This article is solely the responsibility of the author and does not necessarily represent the views of the National Institute of Mental Health, the National Institutes of Health, or the University of Minnesota.