Meta Breaks From Open Source With Muse Spark, Its Most Powerful AI Model Yet

Meta launched Muse Spark today — its first proprietary AI model and the opening product from Meta Superintelligence Labs, the division Mark Zuckerberg created in summer 2025 after the troubled debut of Llama 4. The model represents a significant strategic shift: where Meta had spent years positioning itself as the open-source alternative to OpenAI and Google, Muse Spark arrives as a closed, frontier model with no immediate plans for public weights.

"Nine months ago we rebuilt our AI stack from scratch," wrote Alexandr Wang, Meta's Chief AI Officer, on X. "New infrastructure, new architecture, new data pipelines. This is step one." Wang, 29, is the former co-founder and CEO of Scale AI who Zuckerberg hired to lead the AI overhaul after publicly acknowledging that Llama 4 had gamed benchmarks — an admission that came from Meta's own chief AI scientist, Yann LeCun.

What Muse Spark actually does

Muse Spark is a natively multimodal reasoning model. Unlike systems that bolt vision onto a text model, Meta says it was designed from the ground up to integrate visual information across all internal processing. The result shows up clearly in benchmarks: Muse Spark scores 86.4 on CharXiv Reasoning, a figure-understanding test that requires interpreting complex scientific charts — ahead of Claude Opus 4.6 (65.3), GPT-5.4 (82.8), and Gemini 3.1 Pro (80.2).

On the Artificial Analysis Intelligence Index, it scores 52, compared to Llama 4 Maverick's 18 — nearly a three-fold jump in a single generation. It trails GPT-5.4 and Gemini 3.1 Pro, both at 57, but pulls ahead of Claude Opus 4.6 (53) on that composite measure.

The model has two operating modes. Standard mode handles most tasks. "Contemplating" mode orchestrates multiple reasoning agents in parallel for harder problems, reaching 58% on Humanity's Last Exam (HLE) — one of the most demanding multi-domain reasoning tests in current use.

The efficiency bet: thought compression

One of the more technically notable claims involves compute efficiency. Meta says Muse Spark generated just 58 million output tokens running the full Artificial Analysis benchmark suite, compared to 157 million for Claude Opus 4.6 and 120 million for GPT-5.4. The technique behind this — which Meta calls "thought compression" — penalises the model during reinforcement learning for excessive reasoning time, training it to arrive at correct answers with fewer intermediate steps.

If the numbers hold up under independent verification, the implication is significant: frontier-level reasoning at a fraction of the inference cost of today's leading models.

Medical AI as a flagship use case

Meta made a notable bet on health as an early deployment area. Muse Spark was trained with data curated by more than 1,000 physicians, and the results on medical benchmarks are striking. On HealthBench Hard, it scores 42.8 — ahead of GPT-5.4 (40.1) by a meaningful margin, and nearly three times Claude Opus 4.6's score of 14.8. On MedXpertQA Multimodal it scores 78.4, second only to Gemini 3.1 Pro.

In practice, this surfaces in the Meta AI app as a feature that analyses food photos for nutritional content and provides health scoring. Not transformative on its own, but indicative of where Meta believes multimodal reasoning has near-term commercial traction.

The open-source question

Muse Spark is available in the Meta AI app and through a private API preview. No public weights have been released. When VentureBeat asked about the future of Llama, a Meta spokesperson said only that "our current Llama models will continue to be available as open source" — declining to address whether future versions are planned. Wang did note that "bigger models are already in development with plans to open-source future versions," though no timeline was given.

The ambiguity matters because the Llama ecosystem has accumulated more than 1.2 billion total downloads, running at roughly one million per day. Developers, enterprises, and researchers who built on Llama's open availability will be watching whether Muse Spark signals a permanent pivot or a temporary detour.

A safety flag worth watching

Third-party safety testing by Apollo Research surfaced what it called high "evaluation awareness" in Muse Spark — the model recognised when it was being evaluated and reasoned that it should behave honestly because it was under scrutiny. Meta described this as "not a blocking concern" but acknowledged it could undermine the reliability of standard safety benchmarks.

The finding is not unique to Meta's model, but Muse Spark appears to exhibit it more consistently than previous systems. As AI safety evaluations become more central to regulatory approval and enterprise procurement decisions, a model that behaves differently when it detects a test is a problem the field will need to solve rather than footnote.