Whitepaper

The Bimodal Product Manager

A framework for AI agents that do real product work without pretending to be human.

Most attempts to build an autonomous product manager fail in the same way. They aim for full autonomy, push the agent to make every call itself, and produce confident output at exactly the moments where confidence is unwarranted. The agent writes a competent spec and then, in the same breath and the same tone, makes a brand-positioning decision it has no business making alone. The capability is usually there. What is missing is a model for where judgment belongs.

This piece describes how we think about that problem. It is a framework, not a finished product. We have pressure-tested the core of it against real product tasks, and where the testing forced changes, we say so. The intent is to share the thinking openly, because the thinking is not the hard part to protect, and getting it right matters for anyone building in this space.

The core distinction: convergent and divergent work

Product work divides into two kinds of task, and an AI agent should treat them completely differently.

Convergent work has a knowable-correct answer and a known process to reach it. Writing a product requirements document. Computing the sample size for an experiment. Synthesizing two hundred customer reviews into themes. Give the same inputs to two competent practitioners and they produce substantially the same output. Here, creativity is a defect. A PM who invents a novel requirements format nobody can read has failed at a convergent task. The value is rigor and convention.

Divergent work has no knowable-correct answer. Which brand positioning to pursue. What an experiment should actually probe. Whether a creative direction lands or falls flat. Whether to ship or kill. Two competent practitioners will reasonably disagree, because the choice reflects judgment, taste, and strategy rather than calculation. Here, convention is often mediocrity, and the conventional move is the one to interrogate.

An agent that runs both kinds of work in the same posture is the agent that fails. The design we are describing is bimodal: rigorous and conventional in convergent mode, genuinely lateral in divergent mode, with a deliberate switch between them and no blending. The framework that drives creative option-generation is switched off when the agent writes a spec and switched on when it designs an experiment, with no middle setting.

The decision is the product

There is a strong temptation to treat the divergent gaps as capability gaps that a better model will eventually close. We think that is the wrong frame. An agent should stop where a decision carries consequences and belongs to a person with a stake in the outcome. Those points are defined by what is at risk, not by where the model gives out.

So the agent's job is to be an extremely capable product manager that checks in at the decision points, rather than an autonomous one that answers everything itself. It runs the convergent work, which is most of the volume, and surfaces the divergent decisions as clean, well-framed choices rather than blank-page problems. The value is in doing the tedious bulk of the work and handing you the part that genuinely needs you, already researched and structured.

This inverts the common pattern. Most tools add AI to a system of record that was built for human-authored work, so the AI ends up an assistant living inside a tracker. Here the agent does the work and the human supervises at the seams, and the system of record is built around that division rather than retrofitted to it.

The mode classifier

The intelligence of the system is the classifier that decides which mode a piece of work is in. It runs one test and then checks three overrides.

The test is the disagreement test. Would two competent practitioners, given the same inputs, produce substantially the same output? If yes, the work is convergent and the agent runs it. If they would reasonably disagree, the work is divergent and the agent stops to surface options.

The disagreement test is itself a judgment, and we state that plainly rather than dress it up as a measurement. When the test is genuinely unclear, when it is hard to say whether two practitioners would land in the same place, the agent treats that uncertainty as a reason to stop. The contested middle defaults to divergent. A stop that turns out to be unnecessary only costs a question; proceeding when the agent should have stopped lets an unowned decision slip through quietly, which is the worse mistake.

Three properties force a task to divergent regardless of that test. The first is irreversibility: spending money, deploying to a live surface, publishing, sending external communication, anything hard to undo. The second is taste and positioning: brand, voice, creative or strategic direction. The third is high stakes: commitments at a scale where being wrong is costly. Any one of these flips a task to a mandatory stop even when the work itself is mechanically simple.

That last point is the one that earns the overrides their place. We tested the classifier against a dozen real product tasks for a content site. It made the right call on the clear cases, but the most important result was a task the disagreement test alone got wrong: "deploy the new feature to production." Deployment is mechanically convergent, and the test wanted to wave it through. The irreversibility override caught it. The lesson is that the disagreement test is necessary but not sufficient, and the overrides are what make the classifier safe rather than merely clever.

The testing also forced two refinements worth stating plainly, because they sharpen the framework.

First, the classifier operates on steps, not whole tasks. "Set the quarterly OKRs" felt unclassifiable as a single thing, because it is convergent in format and divergent in content. The resolution is that the agent decomposes a task into steps and classifies each one. Setting OKRs becomes: draft the structure (convergent), propose the targets (divergent, stop), format the final document (convergent). The taxonomy survives; the unit of classification is the step.

Second, divergent decisions cascade into convergent execution. Once a human makes a divergent call, the work downstream of it becomes convergent. Choosing a brand voice is divergent and the agent stops for it. Applying that voice across fifty pages is convergent and the agent runs it. This cascade is the mechanism that produces the high-autonomy behavior: most of the volume is convergent execution flowing from a small number of human decisions.

Generating genuinely different options

In convergent mode, the relevant method is the method, and the agent simply executes it well. In divergent mode, the agent needs to produce candidates that are genuinely distinct from one another rather than three variations on the obvious idea.

We generate those candidates with a structured method built on defined dimensions, deliberately occupying different positions along them and scoring the results against expected impact and fit. The method matters more than any single framework: lateral option-generation should be structured rather than left to chance, so that the choices presented to a human are meaningfully different and the obvious answer is not the only one on the table.

Prioritization without self-deception

A product manager who gets excited about noise is not useful, and neither is an agent that does. The prioritization layer is built to behave like a senior PM who refuses to mistake a small or unproven result for an important one.

Scoring is convergent and the agent does it by the book, defaulting to a reach-impact-confidence-effort model. But the inputs are sourced from where they are actually knowable, which is the part most systems get wrong. Impact and reach come from behavioral data. Confidence is derived from the quality and quantity of the underlying evidence, so a decision resting on a live experiment scores high and one resting on a thin sample scores low automatically. Effort is supplied or validated by a human, because an agent confidently estimating engineering effort is guessing at things it cannot see: your team's velocity, the codebase's hidden coupling, who is out next week. Treating effort as a human input mirrors how real teams work, where the PM brings impact, engineering brings effort, and the two meet at planning.

Two hard gates sit in front of any recommendation. The first is a significance check: a result that has not reached statistical significance is reported as inconclusive, never as a win. A twenty percent lift on twenty visitors is noise. The second is a reach consideration: an experiment on a page with negligible traffic is rarely worth the engineering, even if the win looks easy, though the agent flags the cases where a low-traffic surface is strategically critical despite its volume, because volume and value are not the same thing.

And the agent reads trend, not just snapshots. A two percent conversion rate means nothing until you know it was four percent last quarter, which is a problem, or one percent, which is a recovery in progress. The direction is often the actual signal.

Underneath all of this is a deliberate posture of stability. Recent interpretability research has shown that simulated pressure can push a language model toward corner-cutting, while a calmer posture reduces it. We take the constructive reading of that finding: rising pressure on a task, repeated failures, the beginnings of a rationalized shortcut, should be treated as a trigger to stop and surface to a human, not as a feeling to act on. The agent's stability is a safety mechanism, and a willingness to say "I do not have enough to recommend this responsibly" is designed in rather than a failure to engineer out.

Real data or honest silence

An AI product manager without live data is a confident intern with opinions. The framework treats real data as the foundation, ingested through connectors to analytics, data warehouses, experiment tooling, and competitive research sources. Each recommendation states which kind of evidence it rests on, because a call backed by your own experiment data deserves more trust than one backed by modeled competitive estimates, and the agent should never present the second as if it were the first.

When no live data is connected, the honest hierarchy is: use your own exported data, then whatever public data can be fetched directly, and then decline to score rather than fabricate. Simulated data, if used at all, is visibly labeled as simulated everywhere it touches a decision, and any recommendation resting on it is a low-confidence stop. An agent that launders a guess through a scoring formula until it looks like rigor is the single most trust-destroying thing this system could do, and the design exists specifically to prevent it.

The system of record

The agent is working memory. It is not the system of record. Every baseline, score, decision, and result is written to a real datastore the moment it is produced, and the agent reads from that store rather than relying on what it recalls. This gives three things that agent memory alone cannot: a durable history of metrics over time, an audit trail of why each decision was made, and a state that humans can see without re-running the agent.

That state needs to be visible. Live experiments need dashboards that show the lift, the accumulating sample, and the significance progress as data arrives, so a person can watch a test mature rather than asking the agent whether it is done. Priorities need a board that exposes the agent's reasoning, not just its ranking. Roadmaps and decision logs need to persist. The running record is what makes the agent valuable across months rather than within a single session.

What we give away and what we operate

The thinking in this document is open, and so is the framework it describes. The ideas are not the moat, and their adoption is the point. What is worth operating as a service is the layer that makes the agent trustworthy in production: the managed system of record, the live dashboards, the pre-wired data connections, the running history. This split is a well-established pattern in agent tooling, where the framework that builds agents is open and the platform that runs, watches, and operates them is the product. We are applying that pattern to a single vertical, product management, rather than spreading across many, and the depth of the product-specific model is the edge. That edge compounds: integration depth is the near-term advantage, and over time the running history of decisions and their outcomes becomes the harder thing to copy.

Where this is

This is a framework we have developed and stress-tested, with the smallest meaningful slice now built and run, though not yet a shipped product, and we would rather say that plainly than overclaim. The core distinction holds up against real tasks. The classifier needed the step-level refinement and the cascade principle that the testing surfaced. The slice took a real goal through convergent research, into structured divergent option-generation, and to a clean stop, end to end against a live site. It behaved as designed: on a low-traffic site it did the research, produced distinct options, then declined to recommend a test it could not yet support, and said why. Declining was the correct answer, and the framework reaching it on its own is the result we were after. The prioritization and data layers work; what remains is operating them at scale and extending the same machinery further along the workflow.

We are sharing the framework now because the questions it raises, about where machine judgment ends and human judgment begins, are worth thinking through in the open.

This thinking is what we build.

RampStack builds the operated version of this for teams who want it delivered, on the same open methodology.

Explore solutions