Some professors allow the technology in their classrooms, others forbid it, and others permit it at their discretion, which might include scrutinizing all students' work with GPT detectors. A recently published, peer-reviewed paper from Patterns shows researchers found that programs built to detect whether text was generated by AI or humans would more often falsely label it as AI-generated when it was written by non-native English writers.
In the study, the researchers tested the performance of seven widely used GPT detectors, with 91 essays written for the Test of English as a Foreign Language (TOEFL) by Chinese speakers, and 88 essays written by U.S. eighth-graders, which were obtained from the Hewlett Foundation's Automated Student Assessment Prize (ASAP).
The GPT detectors accurately classified all U.S. student essays, but incorrectly labeled an average of 61% of the TOEFL essays as AI-generated. One of the detectors incorrectly flagged 97.8% of the TOEFL essays as generated by AI.
The research also found these GPT detectors are not as effective at catching plagiarism as their users may believe. Many of the detectors advertise 99% accuracy without evidence to back up the claims.
The researchers generated essays using ChatGPT and 70% were spotted as AI-generated by the GPT detectors. But simple prompts, such as asking ChatGPT to "elevate the provided text by employing literary language", improved the text enough to reduce that figure to 3%, which meant the GPT detectors then incorrectly determined the essays were written by humans 97% of the time.
"Our current recommendation is that we should be extremely careful about and maybe try to avoid using these detectors as much as possible," said senior author James Zou, from Stanford University.
The authors attributed the errors to GPT detectors favoring complex language and penalizing simpler word choices that are commonly used by non-native English writers. They found the TOEFL essays exhibited lower text perplexity, which "surprised" an AI model. If the next word in an essay is hard for the GPT detector to predict, then it is more likely to assume a human wrote the text; if the opposite is true, it will assume AI created it.
"If you use common English words, the detectors will give a low perplexity score, meaning my essay is likely to be flagged as AI-generated. If you use complex and fancier words, then it's more likely to be classified as human written by the algorithms," Zou explained.
Detecting AI-generated content, in general, can be difficult, which is why detection methods in the form of third-party computer programs have become popular. The research suggests, however, that these tools can marginalize non-native English writers in evaluative and educational settings.
"It can have significant consequences if these detectors are used to review things like job applications, college entrance essays or high school assignments," Zou explained.
Paradoxically, the study points out there is potential for GPT detectors to push non-native English speakers to use more generative AI tools in an effort to evade detection and improve their language skills, which would help them avoid the potential harassment and restricted visibility that could result from being discriminated against.