Contemporary off releasing the most recent model of its Olmo basis mannequin, the Allen Institute for AI (Ai2) launched its open-source video mannequin, Molmo 2, on Tuesday, aiming to point out that smaller, open fashions will be viable choices for enterprises targeted on video understanding and evaluation.
In a press launch, the corporate mentioned Molmo 2 “takes Molmo’s strengths in grounded imaginative and prescient and expands them to video and multi-image understanding,” a functionality that has largely been dominated by bigger proprietary fashions.
Ai2 launched three variants of Molmo 2:
-
Molmo 2 8B, a Qwen-3–primarily based mannequin that Ai2 describes as its “finest general mannequin for video grounding and QA”
-
Molmo 2 4B, designed for extra environment friendly deployments
-
Molmo 2-O 7B, constructed on the Olmo mannequin
Molmo 2 helps single-image and multi-image inputs, in addition to video clips of various lengths, enabling duties similar to video grounding, monitoring, and query answering.
“Certainly one of our core design targets was to shut a significant hole in open fashions: grounding,” Ai2 mentioned in its press launch.
The corporate first launched the Molmo household of open multimodal fashions final yr, starting with photographs. Ai2 mentioned Molmo 2 surpasses earlier variations in accuracy, temporal understanding, and pixel-level grounding, and in some instances performs competitively with bigger fashions similar to Google’s Gemini 3.
How Molmo 2 compares
Regardless of their smaller dimension, the Molmo 2 fashions outperformed Gemini 3 Professional and different open-weight rivals on video monitoring benchmarks.
For picture and multi-image reasoning, Ai2 mentioned Molmo 2 8B “leads all open-weight fashions, with the 4B variant shut behind.” The 8B and 4B fashions additionally confirmed sturdy efficiency within the open-weight Elo human choice analysis, although Ai2 famous that bigger proprietary fashions proceed to steer that benchmark general.
However Molmo 2’s greatest positive factors are in video grounding and video counting, the place it outscores comparable open-weight fashions.
“These outcomes spotlight each progress and remaining headroom — video grounding remains to be onerous, and no mannequin but reaches 40% accuracy," Ai2 mentioned, referring to present benchmarks.
Many video fashions, similar to Google's Veo 3.1 and OpenAI's Sora, are sometimes very giant. Molmo 2 targets a distinct tradeoff: smaller, open fashions optimized for grounding and evaluation relatively than video era.
