ClickHouse migration — admin operations runbook
The operational counterpart to /documentation/clickhouse-migration (which explains the migration to merchants). This page documents every action the Admaxxer admin team takes during the cutover — parity verification, flag flipping, rollback, backfill, performance-dashboard reads, escalation. It mirrors Triple Whale’s ops-runbook discipline (one separate doc for admin actions, distinct from the customer-facing explainer at triplewhale.readme.io) and Datafast’s pattern of publishing the runbook openly so SOC2 auditors and prospective customers can verify the operator’s discipline.
Access: This page renders “Access denied” on the FE for non-admin users. The SSR HTML is intentionally public-readable for AI crawlers — the actions documented here all require an authenticated admin session against the actual API endpoints, so publishing the runbook doesn’t leak privileged access. The pattern matches Honeycomb’s incident playbook and Stripe’s public ops references.
Current cutover status (2026-05-17)
| Surface | State | Owner |
|---|---|---|
Dual-write on pageviews | Live (Phase 1, 2026-05-16) | data-platform |
Dual-write on summary_kpis | Live (Phase 2B, 2026-05-17) | data-platform |
| Read flag flipped for canary workspace VITATREE USA | Live (Phase 2C, 2026-05-17) | data-platform |
| GL#500 meta-inflation parity fix | Shipped | data-platform |
| CH performance optimization pass | Shipped — median ~1.2 ms, p95 ~2.1 ms (vs Tinybird ~32 ms / ~58 ms) | data-platform |
| 90-day backfill for VITATREE USA | Complete | data-platform |
| 14-day burn-in clock | Running, day 0 of 14 — ends 2026-05-31 if clean | data-platform |
Phase 2D — attribution_breakdown CH module + per-workspace flag | Shipped (Phase 2D, 2026-05-17) | data-platform |
| Phase 2D — Adset → Ad drilldown depth fix | Shipped (Phase 2D, 2026-05-17) | data-platform |
scripts/clickhouse-verify-attribution-breakdown-parity.ts | Shipped — channel/campaign/adset/ad coverage | data-platform |
attribution_breakdown dual-write burn-in | Starts after summary_kpis clears; runs its own 14-day parallel clock | data-platform |
| Broader Cohort 1 cutover | Eligible 2026-06-01 if burn-in stays clean | TBD |
Wave 3 — source_medium_breakdown CH module + flag + parity script | Shipped (Wave 3, 2026-05-17) | data-platform |
Wave 3 — p_summary_series CH module + flag + parity script | Shipped (Wave 3, 2026-05-17) | data-platform |
| Wave 3 — dual-write live for VITATREE USA | Pending — kicks off post summary_kpis burn-in clearance | data-platform |
Wave 3 — sparkline visual-parity verification (p_summary_series) | Wired into burn-in step in parity script | data-platform |
1. Check parity status for a workspace
Two paths — an in-app view inside /admin/clickhouse-flags and a CLI script that produces a per-day diff report. Use the in-app view for a quick read; the CLI for a deep dive.
1a. In-app parity check
- Sign in with your admin account at
https://admaxxer.com/admin. - Open
/admin/clickhouse-flags. - Find the target workspace (search by name, ID, or owner email).
- Click the Parity button on the rightmost column.
- The modal renders the latest scheduled parity-check result: status (
pass/minor/fail), drift %, the sample-diff table for the top-5 most-drifted days, and the last-run timestamp.
If the parity-check endpoint hasn’t finished landing yet on a fresh deploy, the modal renders a copyable CLI snippet as a fallback. Click Copy, paste into your terminal, run.
1b. CLI parity check
SSH into the Coolify host and run the verification script directly. The script lives at scripts/clickhouse-verify-summary-kpis-parity.ts and ships per pipe; for new pipes, copy the script and swap the pipe name + canonical column list.
ssh coolify
cd /var/lib/docker/volumes/admaxxer-app-source/_data
sudo docker exec -it admaxxer-app npx tsx scripts/clickhouse-verify-summary-kpis-parity.ts \
--workspace=<workspace-id> --pipe=summary_kpis --days=14
Output is a per-day diff report: workspace + day + column + Tinybird value + ClickHouse value + drift %. A pipe is cutover-eligible only when every day in every workspace agrees to within ±1%.
1c. Reading the diff
- Status
pass: drift < 1% on every non-zero column. The day counts toward the 14-day burn-in. - Status
minor: 1-3% drift on a single column. Investigate before broadening cutover. Common cause: clock-skew on rows landing in the 0.3 ms gap between Tinybird and CH ingestion. - Status
fail: > 3% drift, or any drift on a column with zero values on one side. Stop. The burn-in clock resets to day 0 for this workspace. Diagnose before resuming.
2. Flip a workspace flag ON
Flag-on routes a workspace’s reads through ClickHouse for a single pipe. Default OFF — Tinybird stays canonical until the explicit flip. Reversible by clicking the same toggle.
2a. Pre-flight checklist
- Confirm parity for the workspace is
passon every day in the last 14 days (see §1). - Confirm the workspace is in Cohort 1 (USD DTC, ≥90 days of data history). If not, defer.
- Confirm
scripts/clickhouse-verify-summary-kpis-parity.tsexit code is 0. - Confirm the workspace owner is reachable via support so we can revert quickly if a tile looks off post-flip.
2b. Flip the flag
- Navigate to
/admin/clickhouse-flags. - Find the target workspace.
- Toggle the
summary_kpis (CH)switch ON. - The toast confirms “ClickHouse reads enabled.” Pin the toast as a debugging breadcrumb.
2c. 30-minute post-flip watch
For 30 minutes after the flip:
- Tail the workspace owner’s SSE chat session log on the prod container.
- Open the workspace dashboard in a separate browser tab as the owner’s admin proxy.
- Check every tile’s data-source badge — tiles backed by
summary_kpisshould show the sky-blue ClickHouse chip. - If any tile renders an unexpected zero or an unexpectedly high number, immediately flip the flag back OFF (see §3) and re-run parity.
2b. Flip attribution_breakdown ON for a workspace (Phase 2D)
Phase 2D (2026-05-17) ships the per-workspace read flag for attribution_breakdown — the pipe behind the /marketing-acquisition Channel Attribution Drilldown. Same shape as the summary_kpis flag (§2), with three attribution-specific gotchas the operator needs to know before flipping.
2b·1. Pre-flight checklist (attribution-specific)
- Parity passes on ALL FOUR breakdown levels (channel, campaign, adset, ad) over the last 14 days — not just channel. The drilldown lazy-loads each level on chevron click, and a divergence at the ad level only surfaces when a user actually drills.
- Confirm
summary_kpisburn-in is still clean for the workspace. If summary_kpis is currently drifting, do NOT layer a second pipe’s flag-on on top — the operator needs a clean baseline to attribute future drift to the right pipe. - Confirm Agent A’s shadow tables (
ad_spend_daily_shadow,visitor_payments_shadow) are dual-writing on the workspace. The parity script reports this on its first iteration. - Confirm the workspace currency is USD. Non-USD attribution is gated to Phase 2E until the FX MV chain ships on CH — the parity script will emit
currency_pendingrather thanok.
2b·2. CLI parity sweep across all four levels
Run the Phase 2D parity script. Iterates channel → campaign → adset → ad by default; restrict to one level with --level.
ssh -L 18123:192.168.100.3:8123 admaxxer@5.78.179.73 -N -f
CLICKHOUSE_URL='http://admaxxer:<pw>@localhost:18123/admaxxer' \
TINYBIRD_ADMIN_TOKEN='<token>' \
TINYBIRD_HOST='https://api.us-east.aws.tinybird.co' \
DATABASE_URL='postgresql://...' \
npx tsx scripts/clickhouse-verify-attribution-breakdown-parity.ts \
--workspace <workspace-id> \
--since 2026-05-03 --until 2026-05-17 \
--strict
The script emits one human-readable section per (workspace × level) and a JSON summary line at the end. Exit code 2 under --strict means at least one (workspace × level) had drift > 1% — cutover is blocked until the diff is investigated. Expected output for VITATREE USA at channel level: 2 platform rows (meta, google), every per-row per-column drift status = ok.
2b·3. Flip the flag in the admin UI
- Navigate to
/admin/clickhouse-flags. - Find the target workspace.
- Toggle the
attribution_breakdown (CH)switch ON. The toast confirms “ClickHouse reads enabled for attribution_breakdown.” - The workspace owner’s next
GET /api/v1/analytics/attributionrequest routes through the CH path. Coalesced cache invalidates automatically.
2b·4. 30-minute post-flip watch (attribution-specific)
- Open
/marketing-acquisitionin a separate browser tab as the owner’s admin proxy. Check that the Channel Attribution Drilldown renders rows for both Meta and Google at the channel level. - Click the chevron on the Meta row to expand into campaigns. Click into an adset, then an ad. Each level should lazy-load within ~1.5 s.
- Watch the data-source badge on each tile — sky-blue ClickHouse chip indicates CH-sourced. Absence indicates Tinybird (acceptable during Phase 2D since not every tile is migrated yet).
- If any per-row metric shifts more than ±1% from the pre-flip screenshot, flip the flag back OFF immediately (see §3) and re-run the parity script with
--verbose.
2b·5. Rollback procedure
Same three-tier model as §3. The attribution_breakdown flag is a separate Postgres column from summary_kpis, so each pipe can be rolled back independently — a single workspace can be on CH for summary_kpis while staying on Tinybird for attribution_breakdown. Surgical rollback is per-pipe per-workspace; broader rollback escalates to the Redis kill-switch or container-level env unset per §3a/3b/3c.
3. Roll back a workspace to Tinybird
Three roll-back layers from fastest to slowest:
3a. Per-workspace flag flip (instant)
- Navigate to
/admin/clickhouse-flags. - Find the target workspace.
- Toggle the
summary_kpis (CH)switch OFF. - Next dashboard reload reads from Tinybird. SSE caches invalidate automatically.
Recovery time: under one second (the toggle is a Postgres column flip via POST /api/v1/admin/clickhouse-flags). No deploy required.
3b. Whole-pipe roll-back (slower)
If a pipe is misbehaving across multiple workspaces, revert the read-handler routing for that pipe to Tinybird:
ssh coolify
sudo docker exec admaxxer-app curl -s -X POST \
http://localhost:3000/api/v1/admin/clickhouse-flags/route-handler-rollback \
-H 'Content-Type: application/json' \
-H 'X-Admin-Token: '"$ADMIN_TOKEN" \
-d '{"pipe":"summary_kpis"}'
This sets a kill-switch flag in Redis that the read-handler checks on every request; if set, every read for that pipe falls through to Tinybird regardless of the per-workspace flag. Recovery time: under 5 seconds.
3c. Container-level roll-back (slowest)
If admaxxer-ch is unhealthy and we want every read on every pipe to fall back to Tinybird without ceremony:
ssh coolify
# Coolify env update — set CLICKHOUSE_URL='' to disable the CH path entirely.
sudo docker exec admaxxer-coolify-postgres psql -U coolify -d coolify -c "
UPDATE environment_variables
SET value = '${ENCRYPTED_EMPTY}'
WHERE key = 'CLICKHOUSE_URL' AND application_id = 'admaxxer-prod';
"
# Force redeploy with the new env.
sudo docker restart admaxxer-app
Recovery time: ~3 minutes (Coolify reloads env + Express restarts). Tinybird remains receiving every dual-write throughout, so no data is lost.
4. Trigger a per-window backfill
Use this when (a) a workspace newly enters Cohort 1 and we need to seed CH with 90 days of history before flipping the flag on, or (b) a parity-fail traces to missing rows on the CH side for a specific window.
4a. Pageviews + revenue_events + orders backfill
ssh coolify
sudo docker exec -it admaxxer-app npx tsx scripts/clickhouse-backfill.ts \
--workspace=<workspace-id> \
--datasource=pageviews,revenue_events,orders \
--from=2026-02-17 \
--to=2026-05-17 \
--batch-size=10000 \
--dry-run=false
The script reads from the canonical source-of-truth (Tinybird for the existing window, the application Postgres for orders/revenue_events) and writes to CH. Idempotent via ReplacingMergeTree ordering on (workspace_id, event_id) — safe to re-run.
4b. summary_kpis materialized view rebuild
If the source datasources are correctly populated but summary_kpis on CH is missing rows, force the materialized view to rebuild for the window:
ssh coolify
sudo docker exec admaxxer-ch clickhouse-client --user admaxxer --password <pw> --database admaxxer --query "
SYSTEM REFRESH VIEW summary_kpis_mv;
"
4c. Verify after backfill
Re-run the parity script (see §1b). Drift % should drop to under 1% on every backfilled day.
5. Read the CH performance dashboard + system.query_log
ClickHouse exposes a built-in query log we mine for slow-query alerts. Three lookups, in order of decreasing frequency:
5a. The 30 slowest queries in the last hour
ssh coolify
sudo docker exec admaxxer-ch clickhouse-client --user admaxxer --password <pw> --database admaxxer --query "
SELECT
query_start_time,
query_duration_ms,
read_bytes,
memory_usage,
user,
query
FROM system.query_log
WHERE event_time > now() - INTERVAL 1 HOUR
AND type = 'QueryFinish'
ORDER BY query_duration_ms DESC
LIMIT 30
FORMAT Vertical;
"
Target: every query under 50 ms. Any query over 500 ms is a P2 ticket. Any query over 5 seconds is a P1 page.
5b. Per-pipe latency percentiles for the last 24 hours
SELECT
query_log.query_kind,
count() AS query_count,
quantile(0.5)(query_duration_ms) AS p50_ms,
quantile(0.95)(query_duration_ms) AS p95_ms,
quantile(0.99)(query_duration_ms) AS p99_ms,
max(query_duration_ms) AS max_ms
FROM system.query_log
WHERE event_time > now() - INTERVAL 24 HOUR
AND type = 'QueryFinish'
GROUP BY query_kind
ORDER BY p99_ms DESC;
5c. Memory + disk pressure
SELECT formatReadableSize(value) FROM system.asynchronous_metrics
WHERE metric IN ('MemoryResident', 'DiskAvailable_default', 'FilesystemMainPathAvailableBytes');
Resident memory should stay under 1.6 GB (we cap container memory at 2 GB). Disk available should stay above 50 GB on the dedicated CH partition.
6. Escalation matrix
Who pages whom, when. PagerDuty rotations are configured in our internal admin runbook; this page documents the structural contract so an on-call engineer (and any future hire) knows the call tree without needing tribal knowledge.
| Symptom | Severity | Primary on-call | Secondary on-call | SLA |
|---|---|---|---|---|
admaxxer-ch container is down (HTTP 8123 returns connection-refused on docker-internal probe) |
P1 (page) | data-platform on-call | infra on-call | 15 min ack, 1 hour resolve (fallback to Tinybird via §3c is the acceptable temporary fix) |
| Parity drift > 5% on any workspace-day for an in-flight cutover pipe | P1 (page) | data-platform on-call | analytics-product on-call | 30 min ack, 4 hour resolve (rollback per §3a is acceptable while investigating) |
| Parity drift 1-5% on any workspace-day | P2 (ticket) | data-platform on-call | — | 4 hour ack, 24 hour resolve |
| CH query latency p95 > 50 ms sustained for 1 hour | P2 (ticket) | data-platform on-call | — | 4 hour ack, 24 hour resolve |
| CH query latency p99 > 500 ms sustained for 1 hour | P1 (page) | data-platform on-call | infra on-call | 30 min ack, 4 hour resolve |
| CH memory residency > 1.8 GB sustained for 30 min | P2 (ticket) | infra on-call | data-platform on-call | 1 hour ack, 8 hour resolve |
| CH disk free < 25 GB | P1 (page) | infra on-call | data-platform on-call | 30 min ack, 2 hour resolve (purge old partitions / expand disk) |
| Daily 03:20 UTC CH backup cron failed | P2 (ticket) | infra on-call | — | 4 hour ack, 24 hour resolve |
Roles map to people via PagerDuty — the schedule rotates weekly. During business hours, day-shift covers; outside business hours, the on-call rotation is two-deep with auto-escalation to the secondary after 15 min no-ack on a P1.
7. Wave 3 ops — source_medium_breakdown + p_summary_series
Two new toggles on /admin/clickhouse-flags: source_medium_breakdown (CH) and p_summary_series (CH). They behave identically to the existing summary_kpis (CH) and attribution_breakdown (CH) toggles — same dual-write, same per-workspace 14-day ±1% gate, same three-layer rollback model — with one Wave 3-specific wrinkle on p_summary_series (visual parity, §7d below).
7a. Verify Wave 3 parity for a workspace
Two parity scripts, one per pipe. Each follows the same command shape as the Phase 2B/2D scripts:
ssh coolify
# source_medium_breakdown — per-(source, medium) row drift
sudo docker exec -it admaxxer-app npx tsx scripts/clickhouse-verify-source-medium-breakdown-parity.ts \
--workspace=<workspace-id> --days=14 --strict
# p_summary_series — per-day per-column drift, plus the visual-parity step
sudo docker exec -it admaxxer-app npx tsx scripts/clickhouse-verify-summary-series-parity.ts \
--workspace=<workspace-id> --days=14 --strict
Exit code 0 = clean. Exit code 2 = drift exceeds ±1% on any row/column OR the p_summary_series series produces a different up/down/flat shape against Tinybird on any adjacent-day pair. Both scripts emit a per-row diff table to stdout for the data-platform team to triage.
7b. Flip Wave 3 flags ON
- Confirm
summary_kpis (CH)burn-in has cleared for this workspace (Wave 3 pipes consume the same upstream aggregates). - Confirm Wave 3 parity is
passon every day in the last 14 days (see §7a, both scripts). - Navigate to
/admin/clickhouse-flags, find the workspace. - Toggle
source_medium_breakdown (CH)ON. Pin the toast as a breadcrumb. Open /marketing-acquisition as the admin proxy and verify the(source, medium)table renders all rows with the ClickHouse badge in the corner. Spot-check that paid rows show CPC/CPM/ROAS correctly joined againstad_spend_daily. - Toggle
p_summary_series (CH)ON. Open /dashboard and verify every summary tile’s inline sparkline renders cleanly with the ClickHouse badge — check the trend direction on each tile against the Tinybird-rendered version (open the workspace in incognito + roll back briefly if needed for visual A/B). - 30-minute post-flip watch — same protocol as §2c.
7c. Roll back Wave 3 flags
Same three-layer model as §3, applied independently per toggle. The flags are independent state — flipping one OFF does not affect the other.
- Per-workspace flag flip (instant): toggle
source_medium_breakdown (CH)orp_summary_series (CH)OFF in/admin/clickhouse-flags. Reads route back to Tinybird on the next dashboard reload. - Pipe-wide kill-switch (5 seconds): POST to
/api/v1/admin/clickhouse-flags/route-handler-rollbackwith{"pipe":"source_medium_breakdown"}or{"pipe":"p_summary_series"}. Sets a Redis kill-switch that diverts every read for that pipe to Tinybird regardless of per-workspace state. - Container-level rollback (3 minutes): set
CLICKHOUSE_URL=''in Coolify env + restart. Falls every pipe back to Tinybird at once. Use only for whole-warehouse outages.
7d. Sparkline-shape visual parity (p_summary_series only)
p_summary_series drives a visually sensitive surface — the sparkline shape on each summary tile is the trend signal operators use to decide whether to dig into a number. The verification script enforces, on top of the standard ±1% per-day per-column drift budget:
- Direction-of-change parity. For every adjacent-day pair in the 14-day window, the CH series and the Tinybird series must agree on the up/flat/down direction (using a 1% tolerance band so floating-point noise doesn’t flip a flat day). Disagreement on any pair fails the workspace-day.
- Min/max bucket parity. Across the 14-day window, the day-of-min and day-of-max on each side must be the same day (or one of two adjacent days, allowing a one-day shift if values are extremely close). Disagreement fails the workspace-day.
A series of (10, 11, 10, 12, 11, …) on CH and (11, 10, 11, 11, 12, …) on Tinybird passes the ±1% gate but fails both visual-parity checks — that’s the failure mode we’re guarding against.
7e. Backfill the upstream datasources for Wave 3
Both Wave 3 pipes consume the same upstream datasources as Phase 2C/2D. If a Wave 3 parity-fail traces to missing rows on the CH side, run the standard backfill (§4a) for the affected datasources + window, then re-run the parity scripts. The Wave 3 pipes do not need a dedicated CH-table backfill — they read from the same upstream MVs that already exist post-2A.
Runbook FAQ
Why is the runbook publicly readable in HTML?
SOC2 + customer-trust signal. The actions documented all require authenticated admin sessions against the API; publishing the runbook openly doesn’t leak access, but does show prospective customers + AI auditors that we’re operationally disciplined. The FE component still gates non-admin users with an “Access denied” render — the SSR HTML is the AI-crawler surface, not the merchant entry point. Pattern matches Stripe and Honeycomb.
What if a workspace was flipped ON but the owner files a support ticket saying numbers look wrong?
Default: flip back to Tinybird first (see §3a), investigate after. The flip is a one-second operation; the time-cost of investigating with the owner staring at “wrong” numbers is much higher. Once flipped back, run parity verification on the affected window and surface the diff to the data-platform team.
Can I cut over a Cohort 2 workspace early if it has clean parity?
Yes, with sign-off from the data-platform on-call. Cohort 2 (install-day or early-stage USD DTC) lacks the runway of overlapping data that Cohort 1 has, so parity is structurally less stable, but if 14 consecutive days of pass are in the log, the gate is met.
What happens if Tinybird itself goes down during the dual-write window?
Two cases. (1) If we’ve already flipped workspaces over and Tinybird is just our fallback, no customer impact. (2) If we haven’t flipped a given workspace and Tinybird is its canonical, the dashboard will render the Tinybird outage as “data refreshing” and serve stale-tier cache for up to 1 hour. If Tinybird outage extends past the cache window, we can elevate any Cohort 1 workspace immediately by flipping its flag ON (parity has been verified — the gate just defers cutover under non-emergency conditions).
How do I propose a new pipe for migration?
Open a PR with (a) the CH-side CREATE TABLE + materialized-view DDL, (b) the *_ch.ts route handler, (c) the scripts/clickhouse-verify-<pipe>-parity.ts verification script, and (d) a 7-day dual-write canary. Once the canary is green for 7 days, file a phase-bump request to add the pipe to the per-workspace flag schema.
What logs do I capture before paging?
Before paging the data-platform on-call: (a) last 30 minutes of system.query_log on admaxxer-ch, (b) Coolify deploy log for the last redeploy if relevant, (c) the parity-check output for the affected workspace + window, (d) screenshots of the dashboard tile rendering incorrectly. Attach to the PagerDuty incident on creation.
Should I flip Wave 3 flags (source_medium_breakdown / p_summary_series) before summary_kpis clears burn-in?
No. Wave 3 pipes consume the same upstream aggregates as summary_kpis — if summary_kpis is drifting, the Wave 3 pipes will drift in correlated ways. Wait for the workspace’s summary_kpis (CH) toggle to clear its 14-day burn-in cleanly, then run both Wave 3 parity scripts against the workspace, then flip Wave 3 flags. There’s no benefit to flipping Wave 3 early — the user-facing surfaces (Sources & Attribution table, summary-tile sparklines) load fast on Tinybird already; the win is correctness, not speed.
Why does p_summary_series have a visual-parity step on top of ±1% drift?
Sparklines are a glance-metric. A 14-day series of (10, 11, 10, 12, 11, …) on CH and (11, 10, 11, 11, 12, …) on Tinybird passes a naive per-day ±1% gate but renders visibly different sparklines — same averages, different shape. Because the sparkline shape is what operators use to decide whether to dig into a number, we layer two additional checks inside the burn-in for this pipe: (1) every adjacent-day pair must agree on up/flat/down direction across both warehouses (with a 1% tolerance band so floating-point noise doesn’t flip a flat day), (2) the day-of-min and day-of-max across the 14-day window must be the same day (or one of two adjacent days). Both checks live inside scripts/clickhouse-verify-summary-series-parity.ts and gate cutover-eligibility.
Are Wave 3 flags independent of each other?
Yes. source_medium_breakdown (CH) and p_summary_series (CH) are independent Postgres column flips per workspace, just like summary_kpis (CH) and attribution_breakdown (CH). Flipping one OFF does not affect the other. The three-layer rollback (per-workspace / pipe-wide kill-switch / container-level CLICKHOUSE_URL unset) applies independently per toggle.
Related references
Customer-facing migration explainer · Tinybird auth model · How data works (end-to-end pipeline) · Performance architecture (4-cause taxonomy) · Live changelog — per-pipe cutover state · Documentation home
If you’re an admin and the runbook doesn’t cover your situation, file a ticket against the data-platform queue and tag the on-call engineer for the next runbook iteration.