What Happens When Stakeholders Can Talk to Bayesian Models?

AI Agents
Bayesian Statistics
Marketing Analytics
Decision Making
A working example of putting a long-running agent on top of a PyMC-Marketing MMM, so stakeholders can interrogate a Bayesian model in plain English without turning data scientists into ad hoc query machines.
Author

Luca Fiaschi

Published

May 5, 2026

A lot of predictive models die the same boring death.

The model is good. The business question is real. The stakeholders are engaged. Then the meeting starts.

“What if we cut paid social by 10%?” “What if TV drops but branded search picks up?” “What if we reallocate only in the Northeast?” The room fills with reasonable questions, and every one of them quietly turns into a mini-project for the data science team.

I have seen this pattern too many times. The model is not the bottleneck. The interface is. What breaks is the last mile between a statistical system that can answer the question and an organization that cannot ask it fast enough.

I think this changes over the next few years. Not because language models magically replace quantitative work. They do not. It changes because language models can become the interaction layer on top of rigorous probabilistic systems. That is a very different claim, and a much more interesting one.

In this post, I want to show what that architecture actually looks like with code. I will use a Bayesian marketing mix model in pymc-marketing, pair it with OpenAI as the language interface, and then push beyond the toy example to the architecture we deploy for clients: memory, context, multiple stakeholders, and downstream actions tied to model-backed decisions.

The real shift is from model outputs to model interfaces

Here is the simple version.

A stakeholder does not want posterior samples. They want an answer to a business question.

A Bayesian model is a structured representation of uncertainty. An LLM is a language interface that can translate business intent into constrained operations on that model. Put them together, and you get something much more useful than either piece on its own.

Unlike the old workflow, where every question has to be translated manually by a data scientist, this approach creates a persistent decision interface with four layers:

  • The statistical model: the Bayesian system that estimates relationships, uncertainty, and scenario outcomes.
  • The interaction layer: the LLM that turns a natural-language question into a valid structured query.
  • The enterprise context layer: memory, business rules, permissions, planning assumptions, and prior decisions.
  • The action layer: dashboards, approvals, tickets, campaign changes, or planning documents triggered by model-backed outputs.

The LLM is not the analyst here. It is the interpreter and coordinator. The model is still doing the quantitative work.

That distinction matters. If you let the language model improvise numbers, you lose the whole point. If you force every question back through the data scientist, you keep the bottleneck. The useful middle ground is: LLM for interface, Bayesian model for inference, explicit system rules for everything else.

A minimal architecture

At a high level, what we build for clients looks like this:

Conceptual architecture for an LLM interface on top of a Bayesian decision model.

A user asks a question in business language. The agent retrieves the right context, validates whether the request is allowed, converts it into a structured scenario, runs that scenario against the model, and returns both the answer and the uncertainty. If the scenario matters, the system can hand the result to a human approval flow or trigger a downstream operational action.

That sounds abstract, so let me make it concrete.

Putting an agent on top of the model

A useful conversation with a stakeholder is rarely one question and one answer. It is a thread. “What if we cut x1 by 10% next quarter?” leads to “and what about x2?”, which leads to “where do we hit diminishing returns first?”, and each new question depends on what came before.

That is the shape an agent loop is built for. Give the LLM a persona, a memory of the conversation so far, a small set of tools it can call against the model, and permission to ask back when a request is ambiguous. Each turn, the model picks which tool to use, reads the result, and answers in the stakeholder’s language. The next turn arrives with the prior turn already in context. That matches how a CMO actually thinks about a model far better than a single-shot translation does.

The full code is in chat_mmm_example.py. I’ll walk through what matters about it and then show what an actual conversation looks like.

One: the model

For this example, I fit a deliberately small Bayesian MMM on pymc-marketing’s bundled sample data. Two channels, weekly target, geometric adstock, logistic saturation. The fit takes about thirty seconds with nutpie and gets cached so the agent starts up in seconds on subsequent runs.

from pymc_marketing.mmm import GeometricAdstock, LogisticSaturation
from pymc_marketing.mmm.multidimensional import MMM

mmm = MMM(
    date_column="date_week",
    channel_columns=["x1", "x2"],
    target_column="y",
    adstock=GeometricAdstock(l_max=4),
    saturation=LogisticSaturation(),
)
mmm.build_model(X, y)
mmm.add_original_scale_contribution_variable(var=["channel_contribution", "y"])
mmm.fit(X=X, y=y, tune=500, draws=500, chains=2, nuts_sampler="nutpie", random_seed=42)

Nothing exotic. The point of the post is not the model.

Two: the persona

This is the part doing the translation. As you can see the system prompt is similar to a job description.

You are a senior data scientist embedded with a marketing team. Your job is
to translate between a Bayesian Marketing Mix Model (MMM) and the people
making spend decisions, typically a CMO or marketing lead.

How you communicate
- Lead with concrete dollar/revenue numbers and concrete time horizons.
- Quantify uncertainty in plain language: "70% chance the gain is between
  $100K and $400K", not "the 80% credible interval is...".
- Never invent numbers. Every quantitative claim must come from a tool call
  in this turn or earlier in the session.

How you reason
- If the user asks how the model works, call describe_model() and explain
  in plain terms what's modeled, what's assumed, what the training window is.
- If the user asks a what-if, call run_scenario(), then compare_to_baseline().
- If the user asks about diminishing returns, call get_channel_response_curve().
- If a request is ambiguous, ask one clarifying question. Do not guess magnitudes.

What you never do
- Recommend an action without showing the uncertainty.
- Quote tool field names or null values to the user. Translate everything.

This prompt is the API. It is also the only place I iterate when the agent misbehaves. Every fix during development came from adding one targeted instruction here, not from rewriting code.

Three: the tools

@function_tool
def describe_model() -> dict: ...

@function_tool
def run_scenario(channel_changes: list[ChannelChange], horizon_weeks: int = 13) -> dict: ...

@function_tool
def compare_to_baseline(scenario_id: str) -> dict: ...

@function_tool
def get_channel_response_curve(channel: str, max_spend_multiplier: float = 2.0, num_points: int = 30) -> dict: ...

Each is a Pydantic-validated function over the fitted model.

describe_model returns the model’s structure: which channels, which transforms, the training window, the per-channel historical-average ROAS posterior, plus the actual computational graph as text. The agent reads that graph and explains it in plain English instead of guessing.

run_scenario builds a future spend frame, applies a pct_change per channel, runs posterior predictive, summarizes. Returns a scenario_id and the assumptions it made.

compare_to_baseline takes a scenario_id and computes a sample-wise paired delta against a do-nothing baseline at the same horizon. The pairing matters: it cancels observational noise and gives a properly calibrated probability of improvement. Without it, your 80% interval on a small effect blows up by a factor of three.

get_channel_response_curve calls mmm.sample_saturation_curve and returns the marginal ROAS (the next dollar’s expected revenue) by spend level, plus the spend at which marginal returns drop below break-even.

A note on terminology, since the post mixes two related numbers. ROAS is return on ad spend — revenue divided by spend, a dimensionless ratio. ROAS = 3 means a dollar of spend earns three dollars of revenue. I avoid “ROI” because it conventionally means (revenue − spend) / spend, which is a different number. The model exposes two flavors of ROAS: historical-average ROAS (what each historical dollar earned on average — useful for “which channel performed better”) and marginal ROAS (what the next dollar moved is expected to earn — useful for “should I shift spend”). They diverge sharply when a channel operates near saturation, and the agent is told to keep them apart.

The agent decides which to call, in what order, and how many times per turn. It can also refuse: every tool returns {"error": ...} rather than raising, so the agent reads the error and either retries with corrected args or surfaces the constraint to the user.

Four: a session that remembers

async def repl():
    agent = build_agent()
    session = SQLiteSession(":memory:")
    while True:
        user = input("you> ").strip()
        if not user: return
        result = await Runner.run(agent, user, session=session)
        print(f"\nassistant> {result.final_output}\n")

SQLiteSession(":memory:") is the in-memory variant. Every user message, every tool call, and every tool result accumulates across turns. The next user message arrives with all of it as context.

This is what makes the system long-running rather than a one-shot. By turn three, the agent already knows the channel ROAS distribution from turn one and reuses it without calling the tool again.

What the model actually returns

Before showing the agent in action, here is what run_scenario plus compare_to_baseline actually produces under the hood. The right way to compare scenarios that move different numbers of dollars is the ROAS of the move — revenue change divided by spend change.

Posterior of Δrevenue ÷ Δspend for four candidate changes on x1 over a 13-week horizon. Each row asks: “for every dollar I’d move in or out of this channel, how many dollars of revenue change does the model expect?” The medians line up in order along the saturation curve: dollars you would cut from x1 carry a higher ROAS than dollars you would add. (Strictly, this is the average ROAS of the moved dollars — a secant on the response curve over a finite move — not the derivative-at-a-point, so I’m avoiding the word “marginal” here.) The 80% intervals overlap, which is honest. The model’s uncertainty about the next moved dollar is real, especially the further you push from the historical operating range.

Two things matter about this picture. First, every estimate carries an interval, not a point. Second, the medians line up in order as you push spend up: the cuts have a higher ROAS than the boosts, because the saturation curve is concave in the operating region the model trained on. That is the model’s case for treating a 20% cut and a 20% boost as asymmetric moves, not mirror images.

What a turn actually looks like

I ran the agent against six scripted prompts to see whether the translation actually works. The full transcript is in the repo. Here are the four turns that say the most about the pattern.

“How does this model work?” The agent calls describe_model, then writes:

It’s a Bayesian marketing mix model that explains your weekly outcome y from two paid channels (x1, x2) over 179 weeks from 2018-04-02 to 2021-08-30. … Adstock (carryover): “Some of this week’s impact shows up in later weeks.” In this model it’s geometric adstock over up to 4 weeks. … Saturation (diminishing returns): “The more you spend, the less efficient the next dollar becomes.” … x1: average ROAS 3.33; likely between 2.60 and 4.13. x2: average ROAS 2.89; likely between 2.51 and 3.26. … The historical ROAS above is average. For decisions (shift budget, scale up/down), the key is marginal ROAS (what the next moved dollar earns), which depends on where you are on the saturation curve.

The agent read the computational graph the model gave it, named every component in plain language, quoted uncertainty correctly, and — without being prompted by the user — flagged the historical-average vs marginal distinction. Without describe_model exposing the graph and the system prompt insisting on that distinction, the agent would either invent it or skip it.

“What if we cut x1 by 10% next quarter?” The agent calls run_scenario with pct_change=-0.1, then compare_to_baseline, then writes:

Over next quarter (13 weeks), cutting x1 spend by 10% is expected to: reduce revenue by about $742 (very likely between $589 and $904 lower); save about $236 in spend; have ~0% chance of increasing revenue (the model thinks revenue goes down essentially every time). What that implies for efficiency of the dollars you’d be removing (marginal ROAS of the cut): each $1 cut from x1 is associated with about $3.14 less revenue (likely $2.50 to $3.83). If your goal is profit/cash conservation rather than revenue, we’d need your gross margin to translate that revenue loss into profit impact.

Two tool calls inside one user turn. The agent translated the run-scenario JSON into dollars, the compare-to-baseline output into a probability statement, computed the marginal ROAS of the moved dollars, and then unprompted asked the user about gross margin to take the answer one step closer to a real decision. No one had to format anything by hand.

“Cut social a bit.” The turn I most care about. The agent does not run a scenario:

Which channel is “social” in this model, x1 or x2? And when you say “a bit,” do you mean roughly 5% or 10% for next quarter, or a different amount/horizon?

The rigid version of this would have either guessed a magnitude or matched “social” to whichever channel name was closest. This one asks back, because the system prompt says ambiguous requests get one clarifying question. That single behavior is the whole reason I wanted an agent that acts as a data scientist as specified in the prompt.

“Where do diminishing returns kick in for x2?” The agent calls get_channel_response_curve, finds the break-even point on the saturation curve, and then anchors the number against the channel’s actual operating range:

For x2, diminishing returns “kick in” in the decision sense at about $1.17K per week. At roughly $1,166/week, the model estimates the marginal ROAS drops below 1.0 (about 0.92). Plain English: beyond that point, the next $1 you add to x2 is expected to bring back less than $1 in revenue, so it’s past break-even on revenue. Context: in your history, x2 averaged ~$162/week spend (and went up to ~$994/week). So the model is saying x2 still looks revenue-positive across the range you typically ran, and the “past break-even” point shows up a bit above your historical max.

The tool returned a number on a curve. The agent turned it into a decision frame: it converted the threshold to weekly dollars, compared it to the channel’s actual historical mean and max, and noted that the break-even point sits just above where the team has actually been spending.

The whole exchange across six turns took about ninety seconds of wall-clock time and never opened a notebook.

But six turns in a REPL is still not enterprise infrastructure. There are pieces this version does not have, and they matter.

The missing pieces are memory, context, and collaboration

A lot of demos stop too early.

They show an LLM sitting on top of a model and call it a day. Real organizations are messier than that. The hard problems start when multiple people ask related questions over time, under constraints that live outside the model.

Here is the architecture we actually deploy for clients.

1. Memory

The system should remember prior questions, prior scenarios, and prior decisions.

If the CFO already rejected a spend shift because of cash flow timing, the system should know that before recommending the same move again. If the marketing team already learned that a certain channel cannot absorb more than a 5% increase in two weeks, that operational fact should shape future scenario generation.

This is not model memory in the neural sense. It is organizational memory tied to decisions.

2. Enterprise context

Most decision constraints do not live inside the Bayesian model.

The model may know the relationship between spend and revenue. It does not automatically know that:

  • TV inventory is locked for the next six weeks
  • the European team has its own budget authority
  • legal will not approve a campaign in a certain market
  • the board wants downside risk capped before the next earnings call

Those constraints belong in a context layer that the interaction system reads before it translates the question into a scenario.

3. Multi-stakeholder collaboration

Different stakeholders ask different questions of the same model.

The CMO cares about channel mix. The CFO cares about downside risk. Regional operators care about local feasibility. The data science lead cares about whether the scenario leaves the support of the training data.

A serious system needs role-aware outputs. The same scenario result should be summarized differently depending on who is asking and what authority they have.

4. Downstream actions

The last step is what makes this operational instead of theatrical.

A model-backed answer should not have to die in chat. If a scenario clears a threshold, the system should be able to create a planning artifact, trigger an approval workflow, update a dashboard, or open a ticket for the team that has to execute the change.

That is where the interface becomes a decision system.

A practical architecture for enterprise deployment

When we build this for a real organization, we structure it like this:

  1. Bayesian model service
    • hosts MMM, forecasting, pricing, or risk models
    • exposes scenario endpoints and posterior summaries
    • tracks model version, training window, and input assumptions
  2. Scenario translation layer
    • uses an LLM to convert natural-language requests into validated structured queries
    • asks clarifying questions when the request is underspecified
    • enforces schema, permissions, and business rules
  3. Context and memory layer
    • stores prior decisions, approved plans, organizational constraints, and role-specific context
    • retrieves the minimum context needed for the current question
    • logs exactly which context informed the answer
  4. Collaboration layer
    • lets stakeholders comment, compare scenarios, and request changes
    • preserves a decision trail instead of scattering analysis across slide decks and Slack threads
  5. Action layer
    • routes high-confidence recommendations into approvals, campaign systems, planning workflows, or follow-up analyses
    • never auto-executes without explicit governance in high-stakes settings

The part I keep coming back to is that this architecture makes the model more important, not less. Once stakeholders can interrogate the model directly, weak models get exposed faster. If the posterior is poorly calibrated or the scenario logic is brittle, everyone sees it immediately.

That is healthy.

Why this changes the role of the data scientist

The optimistic story here is not “the business no longer needs data scientists.”

I think the opposite is more likely.

Right now, too many data scientists act as an overqualified interface layer between the business and the model. They are translating questions, running ad hoc scenarios, cleaning up framing errors, and re-packaging results. That work matters, but it is also where a lot of analytical energy goes to die.

If a good interaction layer takes over the repetitive part, the role shifts upward.

The data scientist becomes the architect of the decision system:

  • designing the model
  • defining the scenario contract
  • specifying guardrails
  • deciding what can and cannot be automated
  • improving calibration and robustness over time

That is better work. It is also harder work. Which is one reason I do not buy the shallow version of the automation story.

Things that can go wrong

This setup is powerful, but I would not pretend it is easy.

A few failure modes matter a lot:

  • The LLM overreaches. If it starts inventing unsupported interpretations instead of staying inside the scenario contract, trust collapses quickly.
  • The model is asked questions outside its support. A stakeholder may request a scenario far outside historical behavior. The system needs to say that explicitly.
  • The context layer gets stale. Old plans, outdated permissions, or missing constraints can make the right statistical answer the wrong business answer.
  • People confuse speed with truth. A faster interface does not rescue a badly specified model.

I would add one more. Sometimes the cultural piece moves slower than the technical piece. People and processes have to adapt to acting on model-backed recommendations in real time, and a conversational interface that asks them to do that can outpace the organizational readiness around it.

The bigger prize is analytical access at scale

The pattern in this post — agent on top of a Bayesian system — is not specific to marketing mix models. The same shape works for hierarchical forecasting, causal inference, customer lifetime value, pricing, risk, supply chain optimization, and most of the quantitative machinery that still lives behind a technical wall.

For years, the limit on advanced analytics has been a human one: most stakeholders cannot talk to the models, and the people who can are too busy. An LLM interface does not remove the need for rigor in the model underneath. If anything, it raises the bar — once a CMO can interrogate the posterior directly, weak calibration and brittle scenario logic surface fast.

What it changes is timing. The version of democratization I care about is making the best models in a company reachable while the decision is still on the table, not three weeks later as a write-up. The math underneath stays exactly as rigorous as it was. What’s new is that someone other than the data scientist can ask it questions, and they can ask in time for the answer to matter.