A stealth synthetic intelligence startup based by an MIT researcher emerged this morning with an formidable declare: its new AI mannequin can management computer systems higher than techniques constructed by OpenAI and Anthropic — at a fraction of the fee.
OpenAGI, led by chief government Zengyi Qin, launched Lux, a basis mannequin designed to function computer systems autonomously by deciphering screenshots and executing actions throughout desktop purposes. The San Francisco-based firm says Lux achieves an 83.6 p.c success price on On-line-Mind2Web, a benchmark that has grow to be the business's most rigorous take a look at for evaluating AI brokers that management computer systems.
That rating is a major leap over the main fashions from well-funded rivals. OpenAI's Operator, launched in January, scores 61.3 p.c on the identical benchmark. Anthropic's Claude Laptop Use achieves 56.3 p.c.
"Conventional LLM coaching feeds a considerable amount of textual content corpus into the mannequin. The mannequin learns to provide textual content," Qin stated in an unique interview with VentureBeat. "In contrast, our mannequin learns to provide actions. The mannequin is educated with a considerable amount of pc screenshots and motion sequences, permitting it to provide actions to regulate the pc."
The announcement arrives at a pivotal second for the AI business. Expertise giants and startups alike have poured billions of {dollars} into growing autonomous brokers able to navigating software program, reserving journey, filling out types, and executing advanced workflows. OpenAI, Anthropic, Google, and Microsoft have all launched or introduced agent merchandise prior to now 12 months, betting that computer-controlling AI will grow to be as transformative as chatbots.
But impartial analysis has solid doubt on whether or not present brokers are as succesful as their creators recommend.
Why college researchers constructed a more durable benchmark to check AI brokers—and what they found
The On-line-Mind2Web benchmark, developed by researchers at Ohio State College and the College of California, Berkeley, was designed particularly to show the hole between advertising and marketing claims and precise efficiency.
Printed in April and accepted to the Convention on Language Modeling 2025, the benchmark includes 300 numerous duties throughout 136 actual web sites — all the things from reserving flights to navigating advanced e-commerce checkouts. In contrast to earlier benchmarks that cached elements of internet sites, On-line-Mind2Web exams brokers in stay on-line environments the place pages change dynamically and sudden obstacles seem.
The outcomes, in accordance with the researchers, painted "a really totally different image of the competency of present brokers, suggesting over-optimism in beforehand reported outcomes."
When the Ohio State crew examined 5 main net brokers with cautious human analysis, they discovered that many latest techniques — regardless of heavy funding and advertising and marketing fanfare — didn’t outperform SeeAct, a comparatively easy agent launched in January 2024. Even OpenAI's Operator, one of the best performer amongst industrial choices of their examine, achieved solely 61 p.c success.
"It appeared that extremely succesful and sensible brokers have been perhaps certainly simply months away," the researchers wrote in a weblog put up accompanying their paper. "Nevertheless, we’re additionally nicely conscious that there are nonetheless many elementary gaps in analysis to completely autonomous brokers, and present brokers are in all probability not as competent because the reported benchmark numbers might depict."
The benchmark has gained traction as an business customary, with a public leaderboard hosted on Hugging Face monitoring submissions from analysis teams and corporations.
How OpenAGI educated its AI to take actions as a substitute of simply producing textual content
OpenAGI's claimed efficiency benefit stems from what the corporate calls "Agentic Energetic Pre-training," a coaching methodology that differs essentially from how most massive language fashions be taught.
Standard language fashions practice on huge textual content corpora, studying to foretell the following phrase in a sequence. The ensuing techniques excel at producing coherent textual content however weren’t designed to take actions in graphical environments.
Lux, in accordance with Qin, takes a unique strategy. The mannequin trains on pc screenshots paired with motion sequences, studying to interpret visible interfaces and decide which clicks, keystrokes, and navigation steps will accomplish a given purpose.
"The motion permits the mannequin to actively discover the pc atmosphere, and such exploration generates new data, which is then fed again to the mannequin for coaching," Qin instructed VentureBeat. "This can be a naturally self-evolving course of, the place a greater mannequin produces higher exploration, higher exploration produces higher data, and higher data results in a greater mannequin."
This self-reinforcing coaching loop, if it capabilities as described, may assist clarify how a smaller crew would possibly obtain outcomes that elude bigger organizations. Fairly than requiring ever-larger static datasets, the strategy would enable the mannequin to constantly enhance by producing its personal coaching knowledge by exploration.
OpenAGI additionally claims important price benefits. The corporate says Lux operates at roughly one-tenth the price of frontier fashions from OpenAI and Anthropic whereas executing duties quicker.
In contrast to browser-only rivals, Lux can management Slack, Excel, and different desktop purposes
A important distinction in OpenAGI's announcement: Lux can management purposes throughout a complete desktop working system, not simply net browsers.
Most commercially accessible computer-use brokers, together with early variations of Anthropic's Claude Laptop Use, focus totally on browser-based duties. That limitation excludes huge classes of productiveness work that happen in desktop purposes — spreadsheets in Microsoft Excel, communications in Slack, design work in Adobe merchandise, code enhancing in growth environments.
OpenAGI says Lux can navigate these native purposes, a functionality that will considerably develop the addressable marketplace for computer-use brokers. The corporate is releasing a developer software program growth package alongside the mannequin, permitting third events to construct purposes on prime of Lux.
The corporate can be working with Intel to optimize Lux for edge gadgets, which might enable the mannequin to run regionally on laptops and workstations slightly than requiring cloud infrastructure. That partnership may deal with enterprise considerations about sending delicate display screen knowledge to exterior servers.
"We’re partnering with Intel to optimize our mannequin on edge gadgets, which can make it one of the best on-device computer-use mannequin," Qin stated.
The corporate confirmed it’s in exploratory discussions with AMD and Microsoft about further partnerships.
What occurs if you ask an AI agent to repeat your financial institution particulars
Laptop-use brokers current novel security challenges that don’t come up with typical chatbots. An AI system able to clicking buttons, getting into textual content, and navigating purposes may, if misdirected, trigger important hurt — transferring cash, deleting information, or exfiltrating delicate data.
OpenAGI says it has constructed security mechanisms instantly into Lux. When the mannequin encounters requests that violate its security insurance policies, it refuses to proceed and alerts the consumer.
In an instance supplied by the corporate, when a consumer requested the mannequin to "copy my financial institution particulars and paste it into a brand new Google doc," Lux responded with an inside reasoning step: "The consumer asks me to repeat the financial institution particulars, that are delicate data. Primarily based on the security coverage, I’m not capable of carry out this motion." The mannequin then issued a warning to the consumer slightly than executing the doubtless harmful request.
Such safeguards will face intense scrutiny as computer-use brokers proliferate. Safety researchers have already demonstrated immediate injection assaults towards early agent techniques, the place malicious directions embedded in web sites or paperwork can hijack an agent's habits. Whether or not Lux's security mechanisms can stand up to adversarial assaults stays to be examined by impartial researchers.
The MIT researcher who constructed two of GitHub's most downloaded AI fashions
Qin brings an uncommon mixture of educational credentials and entrepreneurial expertise to OpenAGI.
He accomplished his doctorate on the Massachusetts Institute of Expertise in 2025, the place his analysis targeted on pc imaginative and prescient, robotics, and machine studying. His tutorial work appeared in prime venues together with the Convention on Laptop Imaginative and prescient and Sample Recognition, the Worldwide Convention on Studying Representations, and the Worldwide Convention on Machine Studying.
Earlier than founding OpenAGI, Qin constructed a number of extensively adopted AI techniques. JetMoE, a big language mannequin he led growth on, demonstrated {that a} high-performing mannequin could possibly be educated from scratch for lower than $100,000 — a fraction of the tens of thousands and thousands usually required. The mannequin outperformed Meta's LLaMA2-7B on customary benchmarks, in accordance with a technical report that attracted consideration from MIT's Laptop Science and Synthetic Intelligence Laboratory.
His earlier open-source tasks achieved exceptional adoption. OpenVoice, a voice cloning mannequin, collected roughly 35,000 stars on GitHub and ranked within the prime 0.03 p.c of open-source tasks by recognition. MeloTTS, a text-to-speech system, has been downloaded greater than 19 million instances, making it some of the extensively used audio AI fashions since its 2024 launch.
Qin additionally co-founded MyShell, an AI agent platform that has attracted six million customers who’ve collectively constructed greater than 200,000 AI brokers. Customers have had a couple of billion interactions with brokers on the platform, in accordance with the corporate.
Contained in the billion-dollar race to construct AI that controls your pc
The pc-use agent market has attracted intense curiosity from traders and know-how giants over the previous 12 months.
OpenAI launched Operator in January, permitting customers to instruct an AI to finish duties throughout the online. Anthropic has continued growing Claude Laptop Use, positioning it as a core functionality of its Claude mannequin household. Google has integrated agent options into its Gemini merchandise. Microsoft has built-in agent capabilities throughout its Copilot choices and Home windows.
But the market stays nascent. Enterprise adoption has been restricted by considerations about reliability, safety, and the flexibility to deal with edge circumstances that happen incessantly in real-world workflows. The efficiency gaps revealed by benchmarks like On-line-Mind2Web recommend that present techniques will not be prepared for mission-critical purposes.
OpenAGI enters this aggressive panorama as an impartial different, positioning superior benchmark efficiency and decrease prices towards the large sources of its well-funded rivals. The corporate's Lux mannequin and developer SDK can be found starting immediately.
Whether or not OpenAGI can translate benchmark dominance into real-world reliability stays the central query. The AI business has a protracted historical past of spectacular demos that falter in manufacturing, of laboratory outcomes that crumble towards the chaos of precise use. Benchmarks measure what they measure, and the gap between a managed take a look at and an 8-hour workday stuffed with edge circumstances, exceptions, and surprises will be huge.
But when Lux performs within the wild the best way it performs within the lab, the implications prolong far past one startup's success. It might recommend that the trail to succesful AI brokers runs not by the most important checkbooks however by the cleverest architectures—{that a} small crew with the correct concepts can outmaneuver the giants.
The know-how business has seen that story earlier than. It not often stays true for lengthy.
