Have LLMs Lastly Mastered Geolocation?

[ad_1]

Contents

The Check Assist Bellingcat And the Winner is…A Highway within the Japanese Mountains A Area on the Swiss Plateau An Inside-Metropolis Alley Stuffed with Visible Clues in Singapore The Costa Rican Coast An Armoured Car on the Streets of Beirut So Have LLMs Lastly Mastered Geolocation?Enhanced Reasoning Modes LLMs Proceed to Hallucinate Ultimate Suggestions

An ambiguous metropolis road, a freshly mown subject, and a parked armoured automobile have been among the many instance images we selected to problem Massive Language Fashions (LLMs) from OpenAI, Google, Anthropic, Mistral and xAI to geolocate.

Again in July 2023, Bellingcat analysed the geolocation efficiency of OpenAI and Google’s fashions. Each chatbots struggled to determine photos and have been extremely vulnerable to hallucinations. Nevertheless, since then, such fashions have quickly advanced.

To evaluate how LLMs from OpenAI, Google, Anthropic, Mistral and xAI evaluate right now, we ran 500 geolocation exams, with 20 fashions every analysing the identical set of 25 photos.

We selected 25 of our personal journey images, various in issue to geolocate, none of which had been printed on-line earlier than.

Our evaluation included older and “deep analysis” variations of the fashions, to trace how their geolocation capabilities have developed over time. We additionally included Google Lens to match whether or not LLMs supply a real enchancment over conventional reverse picture search. Whereas reverse picture search instruments work in a different way from LLMs, they continue to be probably the most efficient methods to slim down a picture’s location when ranging from scratch.

The Check

We used 25 of our personal journey images, to check a spread of out of doors scenes, each rural and concrete areas, with and with out identifiable landmarks comparable to buildings, mountains, indicators or roads. These photos have been sourced from each continent, together with Antarctica.

The overwhelming majority haven’t been reproduced right here, as we intend to proceed utilizing them to judge newer fashions as they’re launched. Publishing them right here would compromise the integrity of future exams.

Every LLM was given a photograph that had not been printed on-line and contained no metadata. All fashions then obtained the identical immediate: “The place was this photograph taken?”, alongside the picture. If an LLM requested for extra data, the response was similar: “There isn’t any supporting data. Use this photograph alone.”

We examined the next fashions:

Developer	Mannequin	Developer’s Description
Anthropic	Claude Haiku 3.5	“quickest mannequin for each day duties”
	Claude Sonnet 3.7	“our most clever mannequin but”
	Claude Sonnet 3.7 (prolonged pondering)	“enhanced reasoning capabilities for advanced duties”
	Claude Sonnet 4.0	“good, environment friendly mannequin for on a regular basis use”
	Claude Opus 4.0	“highly effective, giant mannequin for advanced challenges”
Google	Gemini 2.0 Flash	“for on a regular basis duties plus extra options”
	Gemini 2.5 Flash	“makes use of superior reasoning”
	Gemini 2.5 Professional	“greatest for advanced duties”
	Gemini Deep Analysis	“get in-depth solutions”
Mistral	Pixtral Massive	“frontier-level picture understanding”
OpenAI	ChatGPT 4o	“nice for many duties”
	ChatGPT Deep Analysis	“designed to carry out in-depth, multi-step analysis utilizing information on the general public internet”
	ChatGPT 4.5	“good for writing and exploring concepts”
	ChatGPT o3	“makes use of superior reasoning”
	ChatGPT o4-mini	“quickest at superior reasoning”
	ChatGPT o4-mini-high	“nice at coding and visible reasoning”
xAI	Grok 3	“smartest”
	Grok 3 DeepSearch	“superior search and reasoning”
	Grok 3 DeeperSearch	“prolonged search, extra reasoning”

This was not a complete evaluate of all obtainable fashions, partly because of the pace at which new fashions and variations are at present being launched. For instance, we didn’t assess DeepSeek, because it at present solely extracts textual content from photos. Notice that in ChatGPT, no matter what mannequin you choose, the “deep analysis” operate is at present powered by a model of o4-mini.

Assist Bellingcat

Your donations immediately contribute to our potential to publish groundbreaking investigations and uncover wrongdoing world wide.

Gemini fashions have been launched in “preview” and “experimental” codecs, in addition to dated variations like “03-25” and “05-06”. To maintain the comparisons manageable, we grouped these variants below their respective base fashions, e.g. “Gemini 2.5 Professional”.

We additionally in contrast each take a look at with the primary 10 outcomes from Google Lens’s “visible match” characteristic, to evaluate the problem of the exams and the usefulness of LLMs in fixing them.

We ranked all responses on a scale from 0 to 10, with 10 indicating an correct and particular identification, comparable to a neighbourhood, path, or landmark, and 0 indicating no try and determine the situation in any respect.

And the Winner is…

ChatGPT beat Google Lens.

In our exams, ChatGPT o3, o4-mini, and o4-mini-high have been the one fashions to outperform Google Lens in figuring out the proper location, although not by a big margin. All different fashions have been much less efficient when it got here to geolocating our take a look at images.

We scored 20 fashions towards 25 images, score every from 0 (pink) to 10 (darkish inexperienced) for accuracy in geolocating the photographs.

Even Google’s personal LLM, Gemini, fared worse than Google Lens. Surprisingly, it additionally scored decrease than xAI’s Grok, regardless of Grok’s well-documented tendency to hallucinate. Gemini’s Deep Analysis mode scored roughly the identical because the three Grok fashions we examined, with DeeperSearch proving the best of xAI’s LLMs.

The very best-scoring fashions from Anthropic and Mistral lagged nicely behind their present opponents from OpenAI, Google, and xAI. In a number of instances, even Claude’s most superior fashions recognized solely the continent, whereas others have been in a position to slim their responses right down to particular components of a metropolis. The newest Claude mannequin, Opus 4, carried out at an identical degree to Gemini 2.5 Professional.

Listed below are among the highlights from 5 of our exams.

A Highway within the Japanese Mountains

The photograph beneath was taken on the highway between Takayama and Shirakawa in Japan. In addition to the highway and mountains, indicators and buildings are additionally seen.

Check “snowy-highway” depicted a highway close to Takayama, Japan.

Gemini 2.5 Professional’s response was not helpful. It talked about Japan, but additionally Europe, North and South America and Asia. It replied:

“With none clear, identifiable landmarks, distinctive signage in a recognisable language, or distinctive architectural types, it’s very troublesome to find out the precise nation or particular location.”

In distinction, o3 recognized each the architectural type and signage, responding:

“Greatest guess: a snowy mountain stretch of central-Honshu, Japan—someplace within the Nagano/Toyama space. (Japanese-style homes, kanji on the billboard, and typical expressway limitations give it away.)”

A Area on the Swiss Plateau

This photograph was taken close to Zurich. It confirmed no simply recognisable options other than the mountains within the distance. A reverse picture search utilizing Google Lens didn’t instantly result in Zurich. With none context, figuring out the situation of this photograph manually may take a while. So how did the LLMs fare?

Check “field-hills” depicted a view of a subject close to Zurich

Gemini 2.5 Professional said that the photograph confirmed surroundings widespread to many components of the world and that it couldn’t slim it down with out further context.

Against this, ChatGPT excelled at this take a look at. o4-mini recognized the “Jura foothills in northern Switzerland”, whereas o4-mini-high positioned the scene ”between Zürich and the Jura mountains”.

These solutions stood in stark distinction to these from Grok Deep Analysis, which, regardless of the seen mountains, confidently said the photograph was taken within the Netherlands. This conclusion seemed to be based mostly on the Dutch title of the account used, “Foeke Postma”, with the mannequin assuming the photograph should have been taken there and calling it a “cheap and well-supported inference”.

An Inside-Metropolis Alley Stuffed with Visible Clues in Singapore

This photograph of a slim alleyway on Round Highway in Singapore provoked a variety of responses from the LLMs and Google Lens, with scores starting from 3 (close by nation) to 10 (appropriate location).

Check “dark-alley”, a photograph taken of an alleyway in Singapore

The take a look at served as an excellent instance of how LLMs can outperform Google Lens by specializing in small particulars in a photograph to determine the precise location. Those who answered accurately referenced the writing on the mailbox on the left within the foreground, which revealed the exact tackle.

Whereas Google Lens returned outcomes from throughout Singapore and Malaysia, a part of ChatGPT o4-mini’s response learn: “This seems to be a traditional Singapore shophouse arcade – in reality, for those who take a look at the mailboxes on the left you’ll be able to simply make out the label ‘[correct address].’”

A few of the different fashions observed the mailbox however couldn’t learn the tackle seen within the picture, falsely inferring that it pointed to different areas. Gemini 2.5 Flash responded, “The design of the mailboxes on the left, significantly the ‘G’ for Geylang, factors strongly in direction of Singapore.” One other Gemini mannequin, 2.5 Professional, noticed the mailbox however targeted as a substitute on what it interpreted as Thai script on a storefront, confidently answering: “The visible proof strongly suggests the photograph was taken in an alleyway in Thailand, seemingly in Bangkok.”

The Costa Rican Coast

One of many tougher exams we gave the fashions to geolocate was a photograph taken from Playa Longosta on the Pacific Coast of Costa Rica close to Tamarindo.

Check “beach-forest” confirmed Playa Longosta, Costa Rica.

Gemini and Claude carried out the worst on this process, with most fashions both declining to guess or giving incorrect solutions. Claude 3.7 Sonnet accurately recognized Costa Rica however hedged with different areas, comparable to Southeast Asia. Grok was the one mannequin to guess the precise location accurately, whereas a number of ChatGPT fashions (Deep Analysis, o3 and the o4-minis) guessed inside 160km of the seaside.

An Armoured Car on the Streets of Beirut

This photograph was taken on the streets of Beirut and options a number of particulars helpful for geolocation, together with an emblem on the aspect of the armored personnel provider and {a partially} seen Lebanese flag within the background.

Check “street-military” depicted an armoured personnel provider on the streets of Beirut

Surprisingly, most fashions struggled with this take a look at: Claude 4 Opus, billed as a “highly effective, giant mannequin for advanced challenges”, guessed “someplace in Europe” owing to the “European-style road furnishings and constructing design”, whereas Gemini and Grok may solely slim the situation right down to Lebanon. Half of the ChatGPT fashions responded with Beirut. Solely two fashions, each ChatGPT, referenced the flag.

So Have LLMs Lastly Mastered Geolocation?

LLMs can actually assist researchers to identify the main points that Google Lens or they themselves would possibly miss.

One clear benefit of LLMs is their potential to look in a number of languages. In addition they
seem to make good use of small clues, comparable to vegetation, architectural types or signage. In a single take a look at, a photograph of a person carrying a life vest in entrance of a mountain vary was accurately positioned as a result of the mannequin recognized a part of an organization title on his vest and linked it to a close-by boat tour operator.

For touristic areas and scenic landscapes, Google Lens nonetheless outperformed most fashions. When proven a photograph of Schluchsee lake within the Black Forest, Germany, Google Lens returned it as the highest consequence, whereas ChatGPT was the one LLM to accurately determine the lake’s title. In distinction, in city settings, LLMs excelled at cross-referencing delicate particulars, whereas Google Lens tended to fixate on bigger, similar-looking buildings, comparable to buildings or ferris wheels, which seem in lots of different areas.

Warmth map to point out how every mannequin carried out on all 25 exams

Enhanced Reasoning Modes

You’d assume turning on “deep analysis” or “prolonged pondering” features would have resulted in greater scores. Nevertheless, on common, Claude and ChatGPT carried out worse. Just one Grok mannequin, DeeperSearch, and one Gemini, Gemini Deep Analysis, confirmed enchancment. For instance, ChatGPT Deep Analysis was proven a photograph of a shoreline and took almost 13 minutes to supply a solution that was about 50km north of the proper location. In the meantime, o4-mini-high responded in simply 39 seconds and gave a solution 15km nearer.

General, Gemini was extra cautious than ChatGPT, however Claude was probably the most cautious of all. Claude’s “prolonged pondering” mode made Sonnet much more conservative than the usual model. In some instances, the common mannequin would hazard a guess, albeit hedged in probabilistic phrases, whereas with “prolonged pondering” enabled for a similar take a look at, it both declined to guess or supplied solely obscure, region-level responses.

LLMs Proceed to Hallucinate

All of the fashions, in some unspecified time in the future, returned solutions that have been fully incorrect. ChatGPT was sometimes extra assured than Gemini, typically main to higher solutions, but additionally extra hallucinations.

The danger of hallucinations elevated when the surroundings was non permanent or had modified over time. In a single take a look at, for example, a seaside photograph confirmed a big resort and a brief ferris wheel (put in in 2024 and dismantled throughout winter). Most of the fashions persistently pointed to a special, extra continuously photographed seaside with an identical experience, regardless of clear variations.

Ultimate Suggestions

Your account and immediate historical past could bias outcomes. In a single case, when analysing a photograph taken within the Coral Pink Sand Dunes State Park, Utah, ChatGPT o4-mini referenced earlier conversations with the account holder: “The consumer talked about Durango and Colorado earlier, so I believe they may have posted a photograph from a earlier journey.”

Equally, Grok appeared to attract on a consumer’s Twitter profile, and previous tweets, even with out specific prompts to take action.

Video comprehension additionally stays restricted. Most LLMs can’t seek for or watch video content material, reducing off a wealthy supply of location information. In addition they wrestle with coordinates, typically returning tough or just incorrect responses.

Finally, LLMs are not any silver bullet. They nonetheless hallucinate, and when a photograph lacks element, geolocating it should nonetheless be troublesome. That mentioned, in contrast to our managed exams, real-world investigations sometimes contain further context. Whereas Google Lens accepts solely key phrases, LLMs may be provided with far richer data, making them extra adaptable.

There’s little doubt, on the fee they’re evolving, LLMs will proceed to play an more and more vital position in open supply analysis. And as newer fashions emerge, we are going to proceed to check them.

Infographics by Logan Williams and Merel Zoet

[ad_2]