Methodology · 15 min read

Geo-Lift Incrementality Testing: 14-Day True-ROAS Measurement Method

Last-click ROAS is a flattering lie — it counts the conversion that would have happened anyway. Geo-lift incrementality testing partitions your spend by geography (e.g., pause Meta in Texas + Florida for 14 days; keep spending in matched control states), compares revenue across the two cohorts, and runs a two-proportion z-test for statistical significance. This playbook walks the design, the math, and the Admaxxer surface that automates it.

14 days
Holdout window
Typical detection sensitivity
95%
Confidence level
α = 0.05 two-tailed
10-20%
Required spend share
Of platform daily budget

Last-click ROAS is a flattering lie

Reported ROAS in Meta Ads Manager, Google Ads, and every dashboard that pipes their numbers downstream — including Admaxxer's default last-non-direct view — counts every click that converts, regardless of whether that conversion would have happened without the click. Multi-touch attribution dampens it. Markov-chain attribution dampens it further. Only a randomized experiment can isolate the causal contribution. Geo-lift is that experiment.

Incremental revenue is the revenue you would lose if you turned off the channel. Reported revenue is the revenue the channel takes credit for. The two are not the same. A DTC brand with strong brand affinity will see Meta-reported ROAS of 4.0x and Meta incremental ROAS of 1.6x because roughly half the so-called Meta conversions are people who searched for the brand the next day and clicked a Meta retargeting ad on the way to checkout.

The same brand will see Google Search-reported ROAS of 8.0x and Google Search incremental ROAS of 2.2x for the same reason — branded search conversions would happen anyway. Geo-lift quantifies the gap. If you have ever been asked by a CFO or a CMO whether the Meta budget is real or vanity, geo-lift is how you answer.

  • Last-click ROAS counts conversions that would happen anyway.
  • Randomized geographic holdout produces a causal estimate.
  • State-level cluster sampling avoids cross-contamination.
  • The two-proportion z-test is the right statistic for two binomial cohorts.

What a geo-lift test actually is

A geo-lift test is a designed experiment, not an observational regression. You select N test geographies, pause (or significantly reduce) ad spend in them while holding spend constant in matched control geographies, then measure the revenue delta across cohorts. The two-proportion z-test on conversion-rate-per-visitor in the two cohorts gives you a p-value for whether the difference is statistically significant.

The treatment is the spend change. The outcome is the revenue or conversion-rate response. The two cohorts are matched at baseline so the only systematic difference between them is the treatment. Cluster sampling at the state level is what makes the design tractable for DTC — city-level holdouts leak because shoppers move between cities, but state-to-state contamination is rare enough to be a non-issue in practice.

Geo-lift is the canonical way to measure incremental ROAS for high-stakes budget decisions: cutting a channel by 40%, reallocating between Meta and Google, defending an ad budget at a board meeting. Marketing mix modeling (MMM) and multi-touch attribution are useful for ongoing budget tuning, but they cannot replace a designed experiment when the decision is large and the cost of being wrong is high.

Selecting test and control geographies

The test is only as honest as the matched pair. Pick test states randomly and you will spend two weeks watching noise. Admaxxer ships a deterministic matched-pair generator that scores every state pair against the prior 90 days of your own pixel + Shopify data, then assigns one state per pair to the test cohort and the other to the control cohort.

The matching heuristic uses four signals: baseline conversion rate, average order value, DTC penetration per 1,000 residents, and day-of-week traffic shape. These are the only confounds that move DTC revenue in any meaningful way. The distance metric is Euclidean over z-scored features so no single dimension can dominate the pairing.

Minimum five matched pairs (ten states) for a typical DTC sample size. A $20k/month Meta brand will usually yield ten to fifteen matched pairs and run with six to ten in the live test for statistical power headroom. Below is the actual heuristic — TypeScript, no library dependency, runs against the live ClickHouse pixel data.

// server/lib/incrementality/matchedPairs.ts
// Pre-test geography matching — runs against the prior 90 days of
// pixel + Shopify orders. Returns matched pairs ready for test/control
// assignment. State-level (not city-level) to avoid cross-contamination.

type StateBaseline = {
  state: string;            // e.g. "CA", "FL", "TX"
  visitors_90d: number;     // pixel visitor count
  orders_90d: number;       // Shopify orders count
  cvr: number;              // orders / visitors
  aov: number;              // average order value, locked-FX USD
  dowShape: number[];       // 7-element traffic share (Mon..Sun)
  dtcPenetration: number;   // visitors / state population (per 1k)
};

// Pairwise distance — Euclidean over z-scored features so any one
// dimension can't dominate. CVR + AOV + DTC penetration carry the
// most weight (those are the only confounds that move revenue).
function pairDistance(a: StateBaseline, b: StateBaseline): number {
  return Math.hypot(
    zScore(a.cvr, b.cvr),
    zScore(a.aov, b.aov),
    zScore(a.dtcPenetration, b.dtcPenetration),
    dowCosineDistance(a.dowShape, b.dowShape),
  );
}

// Greedy pairing — pick the closest pair, remove both, repeat.
// Returns >=5 pairs to satisfy the minimum-sample-size requirement.
export function generateMatchedPairs(
  baselines: StateBaseline[],
): Array<[string, string]> {
  const pool = [...baselines];
  const pairs: Array<[string, string]> = [];
  while (pool.length >= 2) {
    let bestI = 0, bestJ = 1, bestD = Infinity;
    for (let i = 0; i < pool.length; i++) {
      for (let j = i + 1; j < pool.length; j++) {
        const d = pairDistance(pool[i], pool[j]);
        if (d < bestD) { bestD = d; bestI = i; bestJ = j; }
      }
    }
    pairs.push([pool[bestI].state, pool[bestJ].state]);
    pool.splice(bestJ, 1);
    pool.splice(bestI, 1);
  }
  return pairs;
}
  • Within plus or minus 0.5 percentage points of baseline CVR.
  • Within plus or minus 15% AOV.
  • Within plus or minus 20% DTC penetration per 1,000 residents.
  • Day-of-week cosine similarity greater than 0.95.
  • Both states have more than 500 visitors per day on average.

The two-proportion z-test

Two binomial cohorts, each with an observed conversion rate. The question is whether the rates differ enough to reject the null hypothesis that they are the same. The two-proportion z-test answers that with a single statistic and a 95% confidence interval. This is industry-standard math; identical to what scipy's proportions_ztest computes. Admaxxer runs it in TypeScript at server/lib/incrementality/zTest.ts with the same arithmetic — no library dependency, no numerical drift between dev and prod.

# Two-proportion z-test (95% confidence, alpha 0.05, two-tailed)

p1 = conversions_test / visitors_test         # observed conversion rate, test cohort
p2 = conversions_control / visitors_control   # observed conversion rate, control cohort

# Pooled estimate (under the null hypothesis that p1 == p2):
p_pooled = (conversions_test + conversions_control)
         / (visitors_test    + visitors_control)

# Standard error of (p1 - p2):
se = sqrt( p_pooled * (1 - p_pooled) * (1/n_test + 1/n_control) )

# Test statistic:
z = (p1 - p2) / se

# Reject the null when |z| > 1.96  (two-tailed, alpha = 0.05)
# 95% confidence interval on the lift:
ci_lower = (p1 - p2) - 1.96 * se
ci_upper = (p1 - p2) + 1.96 * se

Reading the result. The threshold for statistical significance at 95% confidence is |z| > 1.96 (two-tailed). At 99% it is |z| > 2.58. Always report the confidence interval alongside the point estimate — a point estimate without a CI is a half-finished result and easy to misinterpret.

  • |z| greater than 2.58 — reject the null at 99% confidence; budget reallocation is safe.
  • 1.96 less than |z| less than 2.58 — reject at 95%; statistically significant but consider repeating the test before a major budget shift.
  • 1.64 less than |z| less than 1.96 — marginally significant; conservative DTC operators treat this as inconclusive.
  • |z| less than 1.64 — failed to reject the null; check the power-analysis table before declaring the channel ineffective.

Power analysis — when 14 days isn't enough

Power is the probability that your test will detect a real effect of a given size. A 5% relative lift on a 2% baseline conversion rate needs roughly 8,000 visitors per cohort to reach 80% power at 95% confidence. A 1% relative lift on the same baseline needs roughly 25 times more visitors. Small effects on low-CVR brands are the hardest experiments to run and the easiest to misread.

Admaxxer's /incrementality page runs the power-analysis math live against your specific spend × CTR × CVR profile and tells you the minimum days remaining before the cohort has enough visitors to reach significance. The table below is the practical reference for sketching a test on the back of a napkin.

# Minimum visitors per cohort to detect a 5% lift at 95% confidence
# (alpha = 0.05 two-tailed, power 1 - beta = 0.80)

Baseline CVR    Visitors per cohort    Days at 1,000 visitors/day
-------------   --------------------   ---------------------------
0.5%            ~78,000                78 days  (extend to 28+; use a higher-CVR cohort)
1.0%            ~39,000                39 days  (extend to 28)
2.0%            ~19,500                ~20 days
3.0%            ~13,000                ~13 days
5.0%             ~7,800                ~8  days  (most BFCM-window tests)
8.0%             ~4,900                ~5  days

# Formula (rough):  n_per_cohort  approx  16 * p * (1 - p) / (lift * p)^2
# Detecting a SMALLER lift requires a quadratically larger sample.

Practical guidance: most $20k+/month DTC brands at a 2-3% baseline CVR can run a 14-day geo-lift comfortably. Below 1% CVR you should extend to 28 days or pick higher-traffic test states. Below 0.5% CVR consider running an audience-holdout variant on email or SMS instead — the math is identical but the sample sizes are usually easier to hit.

Test duration — 14 days is the floor, 28 days is the ceiling

Fourteen days captures two full weekly cycles, which smooths out day-of-week effects in DTC traffic (Sunday traffic differs from Wednesday traffic in every category we have measured). It also gives Meta's and Google's bidding algorithms time to adjust to the spend reduction without artificially distorting your control cohort. Anything shorter than 14 days risks fitting noise; anything longer than 28 days starts measuring brand-awareness erosion rather than the spend change.

Past 28 days, three things degrade your test. Brand awareness in the test states starts to erode noticeably; you are measuring 'Meta off for 4 weeks' rather than 'Meta off for 2 weeks.' Seasonal effects creep in — a competitor launches, a TikTok video goes viral, an unexpected weather event affects shopping behavior. And customer questions about why they are not seeing your ads accumulate, which can leak the experiment to the wider audience and contaminate the geography boundary.

Extend to 28 days only when your conversion rate is below 1%. In that regime the visitor count alone is rarely enough at 14 days, and you would rather pay the brand-awareness cost than ship a conclusion you cannot statistically defend. The trade-off is asymmetric: a wrong conclusion can drive a multi-quarter budget misallocation, while two extra weeks of reduced spend in the test states is a one-time cost that recovers as soon as the test closes and the campaigns are restored.

What 14 days does not capture is the trailing tail of the channel. Some channels have a 30-60 day attribution latency — a Meta retargeting impression today drives a purchase six weeks from now. Geo-lift catches the immediate response; the trailing tail shows up in MMM rather than in the experiment readout. For ongoing budget tuning, the combination of the two surfaces is what produces a complete picture; geo-lift alone is the high-confidence point estimate, not the time-distributed contribution function.

Ramp-down handling. When you cut Meta in a test state, the bidding algorithm sees a sudden drop in available impressions and may shift spend to adjacent states (control contamination). Admaxxer's wizard recommends a 48-hour ramp-down — reduce spend by 50% on day -1, then to 0% on day 0 — to give the algorithm time to redistribute gradually. The same logic applies on the way back up. Including the ramp window means the formal 14-day clock starts on day +2 of the configuration, not day 0; the wizard tracks both timestamps separately.

Platform configuration — geo-restricted spend on every major ad network

Meta, Google, TikTok, Amazon, and Pinterest all support geographic exclusions at the campaign or ad-set level. Admaxxer's setup wizard generates the exact exclusion configuration per platform — including the criterion IDs for Google, DMA-to-state mapping for TikTok, and bid-modifier syntax for Amazon DSP — so you don't have to compute it manually. The reference below is the at-a-glance summary; the wizard fills in the specifics for your accounts.

PlatformExclusion mechanism
Meta (Facebook + Instagram)Ad-set level location targeting → exclude US states
Google Ads (Search + PMax)Campaign location → exclude state-level criterion IDs (2840 base + state suffix)
TikTok AdsAd group geo targeting → exclude DMA / state list (cross-state DMAs split automatically)
Amazon (Sponsored Display + DSP)DSP geo bid modifier set to -100% in test states; Sponsored Products can't be geo-restricted
Pinterest AdsTargeting → Location → exclude US regions; allow 24h for propagation

The general pattern across platforms: clone your existing campaign into two — a test-geo variant that excludes the test states and a control-geo variant that excludes the control states. Then pause the test-geo ad sets in the test states for the 14-day window. Learning phase resets only on the cloned set; your live one keeps its bidding history.

Performance Max on Google is a partial exception — PMax cannot mix include + exclude in one campaign, so you need a separate PMax campaign per cohort. Amazon Sponsored Products cannot be geo-restricted at all; if you are testing Sponsored Products incrementality, use a temporal holdout (pause for a week, run for a week, alternate) rather than a geographic one.

Reading the result — a worked example

Below is a literal day-14 readout from a $20k/month Meta-heavy DTC supplements brand running the test described above. Baseline blended CVR of 2.3% over the prior 90 days. Three test states (CA, FL, TX) with Meta spend paused; three matched controls (NY, GA, OH) with spend held constant. The math is the same z-test formula reproduced in the previous section.

# 14-day Meta geo-lift, Vitatree-style DTC supplements brand
# Spend: $20k/month Meta-heavy
# Baseline CVR (90d, blended): 2.3%
# Test geos: CA, FL, TX  (Meta ad spend paused; other channels untouched)
# Control geos: NY, GA, OH  (matched-pair selected by Admaxxer wizard)

# Day 14 readout:
visitors_test     = 41,200
conversions_test  =    872   ->  p1 = 2.116%
visitors_control  = 39,800
conversions_control = 1,010  ->  p2 = 2.538%

# Two-proportion z-test:
p_pooled = (872 + 1010) / (41200 + 39800)
        = 1882 / 81000
        = 0.02324
se = sqrt(0.02324 * 0.97676 * (1/41200 + 1/39800))
   = 0.001054
z  = (0.02116 - 0.02538) / 0.001054
   = -4.00

# |z| = 4.00  >  1.96   ->  reject the null at 95% confidence
# p-value  approximately  0.00006  (effectively p < 0.001)
# 95% CI on the lift:  (-0.622%, -0.215%)  conversion-rate points absolute
# Interpretation: Meta drives a 16-25% relative conversion-rate lift in
# the spend-on cohort. Cutting Meta returned a clean, statistically
# significant revenue decline — Meta's incremental ROAS is REAL, not
# the last-click flattering lie. Reallocate budget INTO Meta, not away.

Interpretation. The negative sign on z indicates the test cohort underperformed the control — exactly what we expected when Meta spend was cut. The confidence interval is in absolute conversion-rate points; dividing by the control rate gives the relative lift. A 16-25% relative conversion-rate lift in the spend-on cohort is the kind of result that justifies reallocating budget into Meta, not away.

How to act on a clean result. A clean positive lift (z greater than 1.96 with a positive sign on the spend-on cohort) means the channel is doing real work. Increase its budget allocation by 10-20% and re-run the test next quarter — the lift coefficient typically stays stable for one to two quarters then drifts as audiences saturate. A clean null result (|z| less than 1.96 with adequate power) means the channel is not driving measurable incremental revenue at the current spend level; consider reducing spend by 30-40% and running a second test to confirm.

Geo-lift vs MMM — when to use each

Geo-lift testing is a randomized experiment that produces a causal estimate of channel contribution. MMM (marketing mix modeling) is a regression-based observational technique that estimates channel contribution from historical spend × revenue data. Both have a place in a mature DTC analytics stack; they answer different questions and incur different costs.

  • Geo-lift: use for high-stakes budget decisions. Output is causal. Cost is 14 days of intentionally reduced spend in the test states.
  • MMM: use for ongoing weekly/monthly budget tuning across 5+ channels. Output is observational with a credible interval. No experimental cost — runs continuously on historical data.

The two-step workflow most Admaxxer customers run: MMM continuously to set weekly budget allocation, geo-lift quarterly on whichever channel MMM is most confident is over- or under-credited. The geo-lift result calibrates the next quarter's MMM priors. This is the pattern Triple Whale, Northbeam, and Rockerbox all converged on with sophisticated DTC operators; Admaxxer ships both as first-class surfaces (/incrementality + /mmm).

MMM does not replace geo-lift for high-stakes decisions because it cannot rule out omitted-variable bias. A spike in branded search traffic could be caused by a billboard, a podcast ad, or organic word-of-mouth — MMM sees the spike but cannot tell which input drove it. Geo-lift can, because the randomized assignment closes off all the alternative explanations by construction.

The audience-holdout variant for email and SMS

Geo-lift only works for channels where spend is geographically controllable. Email and SMS are audience-controllable, not geo-controllable. The fix is structurally identical: randomly hold out 20% of the target audience for the duration of the test, send the campaign to the other 80%, and run the same two-proportion z-test on revenue per audience member.

Sample size considerations. An 80/20 split on a 100,000-list audience yields 80,000 treatment and 20,000 holdout. At a 0.8% campaign-attributed conversion rate that is approximately 640 conversions in treatment and 160 in holdout — comfortably above the z-test sample-size floor for detecting a 10% relative lift. Smaller lists or lower CVRs may need a larger holdout percentage or a longer aggregation window across multiple sends.

Test duration is shorter than geo-lift because the treatment is a discrete event (a campaign send) rather than a 14-day continuous spend reduction. Measure revenue in the 7 days following each send; aggregate across 4-8 sends for a stable estimate of email or SMS incremental contribution. Admaxxer's /incrementality page wires this up automatically for Klaviyo, Postscript, and Attentive ESPs — the holdout is set on the ESP side and the revenue join happens in our pipeline.

What incrementality reveals about 'email-driven' revenue. Klaviyo and most ESPs report email-attributed revenue using a 5-day click-through attribution window. Holdout testing consistently shows that 40-70% of that revenue is not incremental — those customers were going to buy anyway and clicking the email was the convenient link. The incremental fraction matters for ROI on list-rental, paid growth, and welcome-series investment. Run the holdout once a quarter against your flagship campaign type (post-purchase, cart-abandonment, win-back) to keep the calibration current.

Five mistakes that break a geo-lift test

Geo-lift testing has a small but consistent set of failure modes. Each of these has cost an Admaxxer customer at least one test cycle before we shipped a guard against it in the wizard. Knowing them in advance is cheaper than learning them on your own dollar.

  1. City-level holdouts. Cross-city contamination is real — a Bay Area shopper is one click away from an Oakland billboard and a San Francisco podcast ad. Cluster at the state level. Admaxxer rejects city-level designs in the wizard.
  2. Forgetting to exclude on retargeting and branded search. When you pause prospecting in Texas but leave Meta retargeting and Google branded search on, you contaminate the test with the most incremental sub-channel. Exclude every paid surface equally.
  3. Running over a major holiday. Independence Day, Thanksgiving, and Christmas all skew baseline CVR by 2-5 times. Push the test to the week after if a holiday falls inside the 14-day window. The matched-pair generator flags this.
  4. Ignoring power analysis on a low-CVR brand. A 0.5% CVR brand running 14 days with 1,000 visitors/day per cohort has roughly 60% power at 95% confidence — not enough to detect a 5% lift. Extend to 28 days or pick higher-traffic test states.
  5. Cherry-picking the readout day. Picking the day that gives the cleanest answer is p-hacking. Lock the readout day at test creation; report the result on that day even if it's less favorable than day 13.

How to act on a clean result

A clean positive lift (z greater than 1.96, positive sign on the spend-on cohort) means the channel is contributing real incremental revenue. The typical action is to increase its budget allocation by 10-20% and re-run the test next quarter to confirm the lift coefficient is stable. Most channels degrade as audiences saturate, so a clean lift today is not a perpetual license — re-test annually at minimum, quarterly for fast-moving channels like TikTok where creative cycles are short and audience overlap is high.

A clean null result (|z| less than 1.96 with adequate power) means the channel is not driving measurable incremental revenue at the current spend level. The standard action is to reduce spend by 30-40% and run a second test to confirm. Cutting cold-turkey is risky — the second-order effects (brand awareness, retargeting funnel feed, audience pixel quality) take longer to manifest and you want a measured taper rather than an abrupt halt. The 30-40% reduction window typically reveals whether the channel has a non-linear contribution: if cutting 40% drops revenue by 40%, the channel was fully incremental at the prior level; if cutting 40% drops revenue by only 10%, the channel has diminishing returns and a smaller budget delivers similar output.

A marginally significant result (1.64 less than |z| less than 1.96) is the most common outcome and the trickiest to act on. The honest move is to extend the test or queue a follow-up test rather than declare a definitive answer. Admaxxer's /incrementality page recommends a specific follow-up test design when this happens, including the minimum additional sample size to reach 95% confidence and an explicit rejection criterion in case the follow-up also lands in the marginal zone (at which point the answer is 'this channel's contribution is small enough that the test cost exceeds the decision value' and you move on with budget at the current level).

Reporting the result to stakeholders. The headline number should always be the relative lift with its 95% confidence interval, not just the point estimate. 'Meta drives a 16-25% relative conversion-rate lift, statistically significant at p<0.001' is a defensible statement; 'Meta drives a 21% lift' is a half-truth that invites misinterpretation. The CI is what makes the result actionable — a tight CI around a small lift is more useful than a wide CI around a large lift.

Documenting the test for future reference. Admaxxer's /incrementality page persists every test (test states, control states, test window, baseline period, configuration timestamps, day-by-day cohort metrics, final z-statistic, confidence interval, and the decision the team took). This becomes a longitudinal record of how each channel's incrementality has shifted over time — a critical input to budget defense conversations with finance and to onboarding new growth team members who need historical context.

Frequently asked questions

What is geo-lift testing and why use it over MMM?
Geo-lift testing is a randomized experiment: you select N test geographies, pause (or significantly reduce) ad spend in them while keeping it constant in matched control geographies, then measure the revenue delta across cohorts. The two-proportion z-test on conversion-rate-per-visitor in the two cohorts gives a p-value for whether the difference is statistically significant. Geo-lift is a true causal estimate — if test geos sell 5% less than control geos with comparable baseline behavior, the 5% is incremental ROAS contribution. MMM (marketing mix modeling) is regression-based and observational; it estimates contribution but can't isolate causation. Use geo-lift for high-stakes budget reallocation decisions; use MMM for ongoing budget mix recommendations.
How do I pick test vs control geographies?
Admaxxer ships a pre-test matched-pair generator. It compares state-level baseline conversion rates over the prior 90 days, finds pairs of states with similar (a) DTC penetration, (b) AOV distribution, (c) day-of-week traffic shape, then assigns one of each pair to test, the other to control. The math: minimum 5 matched pairs (10 states) for a typical DTC sample size; >20 states preferred. Cluster sampling at the state level (not city-level) avoids cross-contamination from city-level demographics.
How long does a geo-lift test need to run?
Minimum 14 days for typical DTC sample sizes. The bottleneck is the conversion rate, not the visitor count — at ~2% conversion you need ~8,000 visitors per cohort to detect a 5% lift at 95% confidence. Admaxxer's /incrementality page shows a real-time power analysis: it tells you the minimum days remaining for your specific spend × CTR × CVR profile to reach significance. Most DTC tests reach significance in 10-18 days; extend to 28 if your conversion rate is <1%.
What if my ad platform doesn't support geo-restricted spend?
Meta, Google, TikTok, Amazon, and Pinterest all support geographic targeting at the campaign or ad-set level. The setup is: clone your existing campaigns into two: a 'test-geo' campaign that excludes the test states and a 'control-geo' campaign that excludes the control states. Then pause the test-geo ad-sets in the test states for the 14-day window. Admaxxer's geo-lift setup wizard generates the exact platform-specific exclusion configuration so you don't have to compute it manually.
Can I run geo-lift for organic + email channels too?
Geo-lift is most valuable for paid channels because the spend is controllable. For organic (SEO, direct, referral), 'geo-lift' doesn't apply since you can't pause organic in Texas. But Admaxxer's /incrementality page lets you run a holdout test on email and SMS by suppressing campaigns to a randomized 20% audience holdout — same z-test on revenue per audience member. The mathematical pattern is identical; the cohort partition is audience-based rather than geo-based.

Run this playbook in your own dashboard

Admaxxer ships the pixel + Meta CAPI + Google Enhanced Conversions + Maxxer AI agent + cohort analytics out of the box. The playbook above becomes a live surface in your account after a 5-minute setup.

Start a 7-day trial See pricing