Seeing Through Synthetic Faces: Testing ChatGPT and Gemini Against AI Image Deepfakes

Seeing Through Synthetic Faces: Testing ChatGPT and Gemini Against AI Image Deepfakes

Can AI chatbots really help police the flood of AI-made faces online? A University at Buffalo-led team put that idea to the test, asking multimodal models such as OpenAI’s ChatGPT and Google’s Gemini to spot AI-generated images, then weighing their promise against the limits that still keep specialist detectors ahead.​

When chatbots look at faces

The study centers on a problem that has become hard to ignore: AI-generated photos and manipulated imagery now blend into everyday browsing, and the stakes rise when synthetic content is used to mislead. Rather than building a new detector from scratch, the researchers examined whether large language models, originally built for language, could be repurposed to judge whether a human face looks authentic.​

The team focused on multimodal versions of these models, meaning systems that can interpret images as well as text by learning from large collections of captioned photos. In that setup, the model effectively treats an image as something it can describe in language, then uses that semantic understanding to reason about what it is seeing.​

What the study found

To probe that capability, the researchers provided thousands of real and AI-generated face images and instructed the models to look for synthetic artifacts and signs of manipulation. In this evaluation, ChatGPT reached 79.5% accuracy on images generated by latent diffusion and 77.2% on images generated by StyleGAN.​

A standout detail was not only whether the model could make a call, but how it communicated the basis for that call. The researchers highlighted that ChatGPT could describe its reasoning in plain terms, for example pointing to slightly blurred hair and an abrupt-looking transition between a subject and the background when discussing an AI-generated photo of a man with glasses.​

Where it falls short today

Even with those encouraging results, the study stresses that multimodal chatbots still trail the strongest deepfake detection algorithms, which the researchers describe as achieving accuracy in the mid- to high-90s. One reason is that specialized detectors can pick up signal-level statistical cues that are invisible to people, while an LLM’s analysis is largely anchored in semantic-level irregularities.​

That semantic strength can also become a weakness, since focusing on human-interpretable oddities may miss subtler manipulations. The team also observed practical friction: ChatGPT sometimes refused to analyze images when asked directly whether a photo was AI-generated, citing an inability to help, which the researchers attributed to confidence thresholds and model safeguards.​

Gemini, meanwhile, performed similarly in identifying artifacts but often struggled to produce useful supporting explanations, at times offering observations the researchers described as nonsensical, such as pointing to moles that were not there. Taken together, the findings frame LLMs as promising aids, particularly for interpretability, but not replacements for purpose-built detectors yet

Experienced News Reporter with a demonstrated history of working in the broadcast media industry. Skilled in News Writing, Editing, Journalism, Creative Writing, and English.