How I Verified Climate Change While Planning a Trip to Berlin

Bayesian Statistics

Forecasting

AI Agents

A Bayesian detour through 66 years of rain data. The model couldn’t beat the naive baseline, and proved that nothing can. The trend coefficient was the real payoff.

Author

Luca Fiaschi

Published

July 3, 2026

I’m planning a week in Berlin later this year. I love the city in late summer: lakes, beer gardens, long evenings. What I don’t love is booking flights months ahead and then spending half the trip hiding from rain in museums.

So the question I actually wanted to answer was narrow: of all the weeks in the next six months, which one has the lowest chance of rain?

No weather app answers this. Forecasts stop at ten days. Beyond that you are on your own, and this post is the story of what happened when I refused to accept that. It ends with a booked week, a null result I’m fond of, and a climate-change signal I didn’t set out to find.

The Brandenburg Gate on the kind of evening I’m optimizing for. Photo: Thomas Wolf, foto-tw.de, CC BY-SA 3.0.

The naive answer is called climatology

My first instinct was embarrassingly simple: download historical data and count. For each calendar day, compute the fraction of past years where that day was rainy. If it rained on September 16 in 20 of the last 66 years, call it a 30% chance.

Meteorologists have a name for this: the climatological forecast. It sounds naive. It is also the reference against which every professional forecast system is scored, because at long lead times it is brutally hard to beat.

I pulled 66 years of daily precipitation for Berlin (1960 to 2025) from the ERA5 reanalysis archive, about 24,000 daily observations, and defined a rain day as 1 mm or more. The raw per-day frequencies are jagged, so you smooth them. That smoothed curve was my baseline.

Trying to do better

Two ideas made me think I could improve on counting.

First, drift. The climate of 1965 is not the climate of 2025. A plain average over 66 years assumes it is. If rain frequency is trending, a model that estimates the trend should forecast next year better than a frozen average.

Second, cycles. Year-to-year weather wobbles with slow ocean patterns like El Niño. Maybe some of that wobble is learnable.

The model that captures both is a Bayesian logistic regression on rain occurrence:

\[ \text{logit}(p_{d,y}) \;=\; \beta_0 + s(d) + \big(\delta + g(d)\big)\,t_y + r_{y} \]

In plain English: each day’s rain probability is a baseline, plus a smooth seasonal curve \(s(d)\) (a periodic Gaussian process, so December 31 connects to January 1), plus a climate trend \(\delta\) that is itself allowed to vary by season through \(g(d)\) (maybe September dries while October gets wetter), plus a year-level random effect \(r_y\) that soaks up whatever made 2003 unusually dry.

In PyMC the whole thing fits in about twenty lines:

import pymc as pm

with pm.Model(coords={"year": years}) as model:
    b0 = pm.Normal("b0", mu=-0.4, sigma=0.5)      # baseline log-odds of rain

    # Seasonal curve: a periodic GP over day-of-year, so the shape is
    # smooth and December 31 connects to January 1
    eta = pm.HalfNormal("eta_season", 1.0)
    ls = pm.LogNormal("ls_season", -0.7, 0.5)
    gp = pm.gp.HSGPPeriodic(m=20, scale=eta,
                            cov_func=pm.gp.cov.Periodic(1, period=1.0, ls=ls))
    s = gp.prior("s", X=day_grid)                 # one value per calendar day

    # Climate trend in logit units per decade, plus a second (smaller)
    # periodic GP that lets the trend differ by season
    delta = pm.Normal("delta", mu=0.0, sigma=0.3)
    g = trend_gp.prior("g", X=day_grid)           # built like s, tighter prior

    # Year-to-year wobble: a random walk, projected orthogonal to the
    # trend line so the two terms can't absorb each other's signal
    sigma_ry = pm.HalfNormal("sigma_ry", 0.15)
    ry_raw = pm.GaussianRandomWalk("ry_raw", sigma=sigma_ry, dims="year")
    ry = pm.Deterministic("ry", detrend @ ry_raw, dims="year")

    logit_p = b0 + s[day_idx] + (delta + g[day_idx]) * decade + ry[year_idx]
    pm.Bernoulli("rain", p=pm.math.sigmoid(logit_p), observed=rain)

    idata = pm.sample(nuts_sampler="nutpie")

Honest disclosure: an AI agent did most of the plumbing. The data pipeline, the backtesting harness, the convergence diagnostics, and several dead ends went from idea to committed code in an afternoon of me reviewing and steering. The agent constructed and fit the model following the excellent PyMC modeling skill from PyMC Labs, a set of expert instructions that teaches it current practice: the nutpie sampler, Hilbert-space GP approximations, identification tricks like the detrending projection above, and the full convergence-diagnostics checklist. Five years ago this detour would have cost me a week and I would not have bothered.

The fitted annual cycle for Berlin against the empirical frequencies, and the late-summer window adjusted to 2026.

The verdict: you cannot beat counting

Fitting a model is easy. The question is whether it forecasts better, and with historical data it is remarkably easy to cheat by accident, because the future leaks into the past through a dozen small holes. So the backtest walks forward, the way the model would actually be used. Start in 2006: fit on 1960 through 2005 only, predict every day of 2006, score it. Extend the training window by one year, refit from scratch, predict 2007. Repeat through 2025. Twenty held-out years the fitting never saw.

Two details matter more than they look. First, every quantity the model learns, down to the reference year I center the trend on, is recomputed inside each training window. Second, the held-out unit is a whole year, never scattered days: rain arrives in multi-day spells, so holding out random days while training on their neighbors would leak the answer through autocorrelation and produce flattering scores. The baseline gets the identical treatment, with the smoothed climatological curve recomputed from each training window.

Grading the forecasts needs care of its own. A probability can never be right or wrong on a single day: if I say 30% and it rains, who won? The tool for this is a scoring rule, a function that takes your stated probability and the observed 0/1 outcome and returns a penalty, averaged over many forecasts. Not any function will do. A useful scoring rule must be proper, meaning your expected penalty is smallest when you report what you actually believe. Naive choices fail this test. Score probabilities by absolute error, for instance, and the optimal strategy is to round every forecast to 0 or 1: the rule pays you to hide your uncertainty.

I used the Brier score, the squared error between forecast and outcome:

\[ \text{BS} \;=\; \frac{1}{N}\sum_{i=1}^{N}\,(p_i - y_i)^2 \]

where \(p_i\) is the forecast probability and \(y_i\) is 1 if that day turned out rainy. Lower is better. Zero is clairvoyance. A shrug of 0.5 scores 0.25 whatever happens, and forecasting a constant base rate \(\bar p\) scores about \(\bar p(1-\bar p)\), which for Berlin’s late summer is around 0.22. It is strictly proper, so honesty is the winning strategy, and it has pedigree: Glenn Brier introduced it in 1950 to verify probability-of-precipitation forecasts. The metric was invented for exactly this problem.

Those anchors explain why the numbers below sit so close together. Most of a rain day’s variance is irreducible coin-flip noise, so a flat base-rate forecast already scores about 0.22 and knowing the seasonal shape buys a forecaster only the last few thousandths below that. The competition happens in the third decimal. That is also why the error bar matters: it comes from a bootstrap over whole years rather than days, for the same clustering reason as the folds.

The result: my model scored 0.2144 over the twenty years. The smoothed climatological average scored 0.2150. Statistically tied: the bootstrap interval on the skill difference spans zero. All that machinery, and the count-and-smooth baseline matched it.

The interesting part is why, because the model itself can answer that. Daily weather beyond two weeks is chaos; the only thing left to predict is whether a season leans wetter or drier than usual. From the fitted year effects I could compute how large that predictable component even is. The answer: if an oracle whispered next autumn’s exact anomaly in my ear, my Brier score would improve by 0.4%. A realistic seasonal forecast captures maybe a tenth of that. This is also where El Niño exits the story: its influence on Central European late-summer rain is too weak to matter.

So this is not “my model is bad.” It is a stronger statement: for daily rain probabilities months ahead in this part of the world, no model can meaningfully beat climatology. The signal isn’t there. This is also why agencies like ECMWF and NOAA sell three-month aggregate outlooks and never daily probabilities at seasonal range. I now understand their product design from the inside.

The consolation prize: the drift is real

The tied score hides something. Prediction was a wash, but the model’s coefficients contain knowledge, and one of them is \(\delta\), the long-term drift.

It came out negative: \(-0.033\) logit per decade, with a 94% credible interval of \([-0.079, +0.013]\). In probability terms, a 92% chance that rain-day frequency in Berlin has been falling. Compounded over the record, late summer in Berlin now has roughly four fewer wet days per season than in the 1960s.

The seasonal decomposition is sharper than the headline number:

The long-term trend by day of year. Early September is drying fastest; late October is getting slightly wetter.

Early September is drying at about \(-0.077\) logit per decade, and unlike the annual average, its 94% credible interval sits entirely below zero. Late October, if anything, drifts slightly the other way. The same season, different directions, which is exactly what a single “annual trend” number would have averaged into mush.

The title of this post oversells it, and I know it. One city and one variable are not the IPCC, and I measured occurrence, not intensity (projections for Central Europe expect rain to become less frequent but heavier, which my binary rain-day cutoff cannot see). But I did not go looking for a climate signal. I went looking for a dry week, and the signal showed up anyway, in an afternoon: a 92% probability that the year as a whole is drying, and an early-September drying rate whose credible interval clears zero entirely. That was the moment this stopped being a trip-planning exercise.

The week I booked

Since nothing beats climatology, climatology plus the drift correction is the optimal trip planner. I aggregated the daily posterior into rolling 7-day windows for the next six months:

Expected rain days per 7-day window in Berlin, July 2026 through January 2027. The minimum sits in mid-September.

The driest week in Berlin over the next six months is the first week of September; every top-ranked window starts between the 1st and the 5th. About 1.9 expected rain days and a one-in-five chance of a completely dry week. Fittingly, the winning window sits exactly on the early-September drying trend from the previous section, so it has been getting more reliable over the decades. The worst option is late December, at just under three expected rain days.

Flights booked. I’ll check the ten-day forecast before packing, because inside two weeks real forecasts finally beat the curve.

What I took away

Three things, none of them about rain.

A model that fails to out-predict a baseline can still pay for itself in understanding. I could not forecast better than counting, but I now know why nothing can, I know the ceiling is physics rather than my prior choices, and I got a credible local climate trend as a by-product.

Null results only mean something when the evaluation is honest. Walk-forward validation, whole years held out, proper scoring rules, and a strong baseline. Skip any of these and I would have fooled myself into shipping a “better” forecaster.

And the economics have changed. The marginal cost of doing this rigorously, the backtest harness, the ceiling analysis, the diagnostics, has collapsed now that agents write the scaffolding. The judgment about what to test and whether to believe it is still the job. That part, for now, still rains on me.