By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
MadisonyMadisony
Notification Show More
Font ResizerAa
  • Home
  • National & World
  • Politics
  • Investigative Reports
  • Education
  • Health
  • Entertainment
  • Technology
  • Sports
  • Money
  • Pets & Animals
Reading: Terminal-Bench 2.0 launches alongside Harbor, a brand new framework for testing brokers in containers
Share
Font ResizerAa
MadisonyMadisony
Search
  • Home
  • National & World
  • Politics
  • Investigative Reports
  • Education
  • Health
  • Entertainment
  • Technology
  • Sports
  • Money
  • Pets & Animals
Have an existing account? Sign In
Follow US
2025 © Madisony.com. All Rights Reserved.
Technology

Terminal-Bench 2.0 launches alongside Harbor, a brand new framework for testing brokers in containers

Madisony
Last updated: November 8, 2025 2:12 am
Madisony
Share
Terminal-Bench 2.0 launches alongside Harbor, a brand new framework for testing brokers in containers
SHARE



Contents
Greater Bar, Cleaner KnowledgeHarbor: Unified Rollouts at ScaleEarly Outcomes: GPT-5 Leads in Activity SuccessSubmission and UseAiming for Standardization

The builders of Terminal-Bench, a benchmark suite for evaluating the efficiency of autonomous AI brokers on real-world terminal-based duties, have launched model 2.0 alongside Harbor, a brand new framework for testing, enhancing and optimizing AI brokers in containerized environments.

The twin launch goals to handle long-standing ache factors in testing and optimizing AI brokers, notably these constructed to function autonomously in practical developer environments.

With a tougher and rigorously verified process set, Terminal-Bench 2.0 replaces model 1.0 as the usual for assessing frontier mannequin capabilities.

Harbor, the accompanying runtime framework, permits builders and researchers to scale evaluations throughout hundreds of cloud containers and integrates with each open-source and proprietary brokers and coaching pipelines.

“Harbor is the package deal we want we had had whereas making Terminal-Bench," wrote co-creator Alex Shaw on X. "It’s for agent, mannequin, and benchmark builders and researchers who need to consider and enhance brokers and fashions."

Greater Bar, Cleaner Knowledge

Terminal-Bench 1.0 noticed fast adoption after its launch in Might 2025, changing into a default benchmark for evaluating agent efficiency throughout the sector of AI-powered brokers working in developer-style terminal environments. These brokers work together with programs by way of the command line, mimicking how builders work behind the scenes of the graphical person interface.

Nonetheless, its broad scope got here with inconsistencies. A number of duties had been recognized by the group as poorly specified or unstable as a result of exterior service adjustments.

Model 2.0 addresses these points straight. The up to date suite consists of 89 duties, every subjected to a number of hours of guide and LLM-assisted validation. The emphasis is on making duties solvable, practical, and clearly specified, elevating the problem ceiling whereas enhancing reliability and reproducibility.

A notable instance is the download-youtube process, which was eliminated or refactored in 2.0 as a result of its dependence on unstable third-party APIs.

“Astute Terminal-Bench followers could discover that SOTA efficiency is corresponding to TB1.0 regardless of our declare that TB2.0 is tougher,” Shaw famous on X. “We imagine it’s because process high quality is considerably larger within the new benchmark.”

Harbor: Unified Rollouts at Scale

Alongside the benchmark replace, the workforce launched Harbor, a brand new framework for operating and evaluating brokers in cloud-deployed containers.

Harbor helps large-scale rollout infrastructure, with compatibility for main suppliers like Daytona and Modal.

Designed to generalize throughout agent architectures, Harbor helps:

  • Analysis of any container-installable agent

  • Scalable supervised fine-tuning (SFT) and reinforcement studying (RL) pipelines

  • Customized benchmark creation and deployment

  • Full integration with Terminal-Bench 2.

Harbor was used internally to run tens of hundreds of rollouts throughout the creation of the brand new benchmark. It’s now publicly accessible by way of harborframework.com, with documentation for testing and submitting brokers to the general public leaderboard.

Early Outcomes: GPT-5 Leads in Activity Success

Preliminary outcomes from the Terminal-Bench 2.0 leaderboard present OpenAI's Codex CLI (command line interface), a GPT-5 powered variant, within the lead, with a 49.6% success price — the best amongst all brokers examined thus far.

Shut behind are different GPT-5 variants and Claude Sonnet 4.5-based brokers.

High 5 Agent Outcomes (Terminal-Bench 2.0):

  1. Codex CLI (GPT-5) — 49.6%

  2. Codex CLI (GPT-5-Codex) — 44.3%

  3. OpenHands (GPT-5) — 43.8%

  4. Terminus 2 (GPT-5-Codex) — 43.4%

  5. Terminus 2 (Claude Sonnet 4.5) — 42.8%

The shut clustering amongst prime fashions signifies lively competitors throughout platforms, with no single agent fixing greater than half the duties.

Submission and Use

To check or submit an agent, customers set up Harbor and run the benchmark utilizing easy CLI instructions. Submissions to the leaderboard require 5 benchmark runs, and outcomes may be emailed to the builders together with job directories for validation.

harbor run -d terminal-bench@2.0 -m "<mannequin>" -a "<agent>" –n-attempts 5 –jobs-dir <path/to/output>

Terminal-Bench 2.0 is already being built-in into analysis workflows targeted on agentic reasoning, code era, and gear use. Based on co-creator Mike Merrill, a postdoctoral researcher at Stanford, an in depth preprint is in progress masking the verification course of and design methodology behind the benchmark.

Aiming for Standardization

The mixed launch of Terminal-Bench 2.0 and Harbor marks a step towards extra constant and scalable agent analysis infrastructure. As LLM brokers proliferate in developer and operational environments, the necessity for managed, reproducible testing has grown.

These instruments provide a possible basis for a unified analysis stack — supporting mannequin enchancment, surroundings simulation, and benchmark standardization throughout the AI ecosystem.

Subscribe to Our Newsletter
Subscribe to our newsletter to get our newest articles instantly!
[mc4wp_form]
Share This Article
Email Copy Link Print
Previous Article [OPINION] After Tino, crucial for nationwide gov’t to help city reform in Metro Cebu [OPINION] After Tino, crucial for nationwide gov’t to help city reform in Metro Cebu
Next Article For Ukraine’s battle wounded, an enchantment for assist and hope For Ukraine’s battle wounded, an enchantment for assist and hope

POPULAR

Dangerous Air Is One of many Greatest Threats to Your Well being. Right here’s Tips on how to Defend Your self
Technology

Dangerous Air Is One of many Greatest Threats to Your Well being. Right here’s Tips on how to Defend Your self

Finest cash market account charges immediately, November 7, 2025 (as much as 4.26% APY return)
Money

Finest cash market account charges immediately, November 7, 2025 (as much as 4.26% APY return)

Find out how to watch Serie A on Paramount+: Christian Pulisic again within the Milan squad, Derby della Mole and extra
Sports

Find out how to watch Serie A on Paramount+: Christian Pulisic again within the Milan squad, Derby della Mole and extra

Man and his bull named Gus had been ‘pulled over’ by police
National & World

Man and his bull named Gus had been ‘pulled over’ by police

Latino voters swing towards Democrats in 2025 after Trump’s 2024 historic positive factors
Politics

Latino voters swing towards Democrats in 2025 after Trump’s 2024 historic positive factors

LIVE UPDATES: Hurricane Uwan
Investigative Reports

LIVE UPDATES: Hurricane Uwan

Bitcoin Struggles For Significant Restoration in Wake of Selloff
Money

Bitcoin Struggles For Significant Restoration in Wake of Selloff

You Might Also Like

Physicists Create a Thermometer for Measuring ‘Quantumness’
Technology

Physicists Create a Thermometer for Measuring ‘Quantumness’

The unique model of this story appeared in Quanta Journal.If there’s one legislation of physics that appears simple to understand,…

3 Min Read
6 Finest Carpet Cleaners (2025), Examined and Reviewed
Technology

6 Finest Carpet Cleaners (2025), Examined and Reviewed

Examine Our PicksOthers Examined{Photograph}: Simon HillHoover HF4 Hydro for £199: For UK of us in flats or small homes with…

11 Min Read
Dyson Airwrap Co-anda 2x Evaluate: Nice, Not a Should-Have
Technology

Dyson Airwrap Co-anda 2x Evaluate: Nice, Not a Should-Have

Every attachment is embedded with RFID chips that auto-adjust warmth and airflow based mostly in your last-used settings. Not a…

5 Min Read
17 Finest iPhone 17 Instances and Equipment (2025)
Technology

17 Finest iPhone 17 Instances and Equipment (2025)

Different Display Protectors I've ExaminedESR Armorite Professional display screen protector. {Photograph}: Julian ChokkattuESR Armorite Display Protector and Privateness Protector for…

26 Min Read
Madisony

We cover the stories that shape the world, from breaking global headlines to the insights behind them. Our mission is simple: deliver news you can rely on, fast and fact-checked.

Recent News

Dangerous Air Is One of many Greatest Threats to Your Well being. Right here’s Tips on how to Defend Your self
Dangerous Air Is One of many Greatest Threats to Your Well being. Right here’s Tips on how to Defend Your self
November 8, 2025
Finest cash market account charges immediately, November 7, 2025 (as much as 4.26% APY return)
Finest cash market account charges immediately, November 7, 2025 (as much as 4.26% APY return)
November 8, 2025
Find out how to watch Serie A on Paramount+: Christian Pulisic again within the Milan squad, Derby della Mole and extra
Find out how to watch Serie A on Paramount+: Christian Pulisic again within the Milan squad, Derby della Mole and extra
November 8, 2025

Trending News

Dangerous Air Is One of many Greatest Threats to Your Well being. Right here’s Tips on how to Defend Your self
Finest cash market account charges immediately, November 7, 2025 (as much as 4.26% APY return)
Find out how to watch Serie A on Paramount+: Christian Pulisic again within the Milan squad, Derby della Mole and extra
Man and his bull named Gus had been ‘pulled over’ by police
Latino voters swing towards Democrats in 2025 after Trump’s 2024 historic positive factors
  • About Us
  • Privacy Policy
  • Terms Of Service
Reading: Terminal-Bench 2.0 launches alongside Harbor, a brand new framework for testing brokers in containers
Share

2025 © Madisony.com. All Rights Reserved.

Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?