A new generation of capable, considered models.

Atlas-3 is a frontier foundation model with capability gains across reasoning, coding, tool use, and long context. The technical report and the safety statement publish today, alongside the API.

Join the API waitlist Read the technical report Safety statement

Capabilities

Six places where Atlas-3 takes a step the field has not.

Each capability is paired with one number from the technical report. The full evaluation suite, methodology, and known failure modes are in the report.

01
Reasoning
Multi-step planning under uncertainty
Atlas-3 plans across long horizons, asks for missing information, and revises plans as new evidence arrives. The model is comfortable saying it does not know.
Calibration on AGI-Eval-Bench (multistep): 0.842
02
Coding
Repository-aware code edits
Atlas-3 holds a working repository in context, edits across files coherently, runs the test suite, and revises against failures. The model writes a commit message you can read.
SWE-Bench Verified: 71.2% pass@1
03
Tool use
Disciplined function invocation
Atlas-3 calls the right tool with the right arguments, recovers from tool errors, and chains results into downstream work without losing the original objective.
Tau-Bench (retail): 0.611, the model declines unsafe calls
04
Long context
Eight hundred thousand tokens, used well
The 800K-token context window holds entire repositories, full filings, and multi-day conversations. Recall is steady from page one through the last page.
RULER (NIAH-Multi): 0.96 at 256K, 0.91 at 800K
05
Vision
Document and diagram understanding
Atlas-3 reads diagrams, transcribes handwriting, and describes what it sees in calibrated language. The model says when an image is ambiguous.
ChartQA strict: 0.88, MMMU (Pro): 0.74
06
Voice
Calibrated, not theatrical
When asked to write, Atlas-3 produces prose that holds register over a long passage. The model resists hype words and prefers concrete claims.
Word-level entropy on synthetic prompts: -8.3% vs. Atlas-2

Benchmark snapshot

A small selection. The full table is in the technical report.

Numbers are pass@1 unless otherwise noted. We publish the evaluation harness, the prompts, and the seeds. Reproduce, do not recite.

Benchmark	Atlas-3	Atlas-2	Delta
MMLU-Redux	0.892	0.864	+2.8 pts
AGI-Eval-Bench	0.842	0.781	+6.1 pts
SWE-Bench Verified	0.712	0.604	+10.8 pts
Tau-Bench (retail)	0.611	0.523	+8.8 pts
RULER NIAH-Multi @256K	0.961	0.907	+5.4 pts

Read the full technical report for ablations, failure analyses, and the evaluation cards we wrote for each benchmark.

Safety statement

We publish what the model will not do, and why.

The safety statement is not a compliance afterthought. It is the second document we wrote, after the technical report. The third document was the API rate limits.

Refusal coverage

Atlas-3 declines categories of harm we have decided not to serve. The list is published, with examples, in the safety statement. We will not change that list quietly.

Pre-deployment red team

External red teams ran adversarial evaluation against Atlas-3 before the launch. Findings, mitigations, and residual risk are in the safety statement.

Misuse monitoring

API traffic is monitored for misuse patterns under our published policy. Persistent abuse triggers escalation, not a silent rate-limit. We document our actions.

What we still do not know

The safety statement names open questions on long-horizon agentic behavior, on jailbreak generalization, and on tool misuse. We do not pretend they are solved.

From the research team

A note on what shipped and what did not.

Atlas-3 ships with the capabilities our internal evaluations say it ships with, no more. Earlier in development we held back two demos that polled well in user studies but failed under adversarial probing. They are not in the launch. We will revisit them when the safety case is stronger.

We are publishing the technical report, the evaluation harness, and the safety statement on the same day as the API because we want the same audience to see all three. The model is a step. Steps need to be measurable.

· the research leads

Lina MarshCo-lead, Pre-training
Daiki OnishiCo-lead, Reinforcement Learning
Maya WhitfordLead, Alignment
Owen BecherLead, Evaluations

API access

Join the waitlist. We bring partners on weekly.

The first wave is research labs, evaluation teams, and builders working on agentic systems. We open broader access in cohorts, with onboarding calls, after the first wave is steady.

We read every request. Tell us what you would build.

A new generation of capable, considered models.

Multi-step planning under uncertainty

Repository-aware code edits

Disciplined function invocation

Eight hundred thousand tokens, used well

Document and diagram understanding

Calibrated, not theatrical

Refusal coverage

Pre-deployment red team

Misuse monitoring

What we still do not know