# 2026-05-12 — GitHub Actions minutes exhausted (CI freeze)

| Field | Value |
|---|---|
| **Severity** | high (build pipeline only — no production impact) |
| **Status** | resolved |
| **Opened** | 2026-05-12 09:42 UTC |
| **Mitigated** | 2026-05-12 10:18 UTC (CI re-enabled with manual quota top-up) |
| **Resolved** | 2026-05-12 14:00 UTC (sustained mitigation: auto-triggers disabled, plan upgraded) |
| **Components affected** | `build_pipeline` (GitHub Actions) |
| **Components NOT affected** | Gateway, Cloud, Directory, Webhooks, Pages, ClickHouse, billing webhook, email routing |
| **Customer impact** | None on the live runtime. Internal merge velocity stalled for ~4h. |

## Timeline (UTC)

- **09:42** — Multiple PR CI checks fail with "Recent jobs have exceeded the maximum allowed time for your account." First alert from `cf-status-bot`.
- **09:46** — On-call engineer (a) confirms the message is org-wide, (b) confirms production traffic is unaffected (Cloudflare serves Pages + Workers without GitHub involvement), (c) opens an incident channel.
- **09:51** — Root cause identified: a chain of workflow auto-triggers on `push` + a `schedule` cron that ran a heavy `apply migrations to a Postgres service` job for every merge fired off ~3,400 minutes since the start of the day.
- **10:02** — Org plan upgrade requested in GitHub Billing. ETA 15 min for the plan to take effect (org on Teams plan).
- **10:18** — Plan upgrade processed. CI runs unfreeze.
- **10:25** — All blocked PRs re-run; one merge happens.
- **10:30** — Emergency commit `a48d7e8 EMERGENCY: disable all CI auto-triggers (100% Actions minutes hit) [skip ci]` lands disabling auto-triggers on every workflow.
- **10:40** — Subsequent commits use `[skip ci]` markers where appropriate; manual `workflow_dispatch` is the path forward until a real budget allocation lands.
- **14:00** — All in-flight work resumed without CI in the loop. Patent-safety scan run locally on the affected branches.

## Root cause

Three compounding factors:

1. **No budget alert.** The org had no GitHub Actions cost-cap or 80/90/95% alert configured.
2. **High fan-out.** A recently added matrix job (apply schema migrations against Postgres for every push to every branch in the matrix) tripled minute consumption per merge.
3. **Schedule cron firing on every workflow file.** A `schedule` trigger added in the wave-21 wave was not gated to `main`-only, so it fired for every branch.

## Customer impact

**Zero** to the runtime. The KYE Gateway, Cloud, Directory, and Webhooks all serve from Cloudflare's edge with no GitHub-Actions dependency. The Status feed itself runs on a Pages Function. Build pipeline stoppage = no shipping but no downtime.

## What we did (mitigation)

- Topped up the Actions allowance (paid plan)
- Disabled all `push` and `schedule` auto-triggers across all workflows
- Switched to manual `workflow_dispatch` for the duration
- Wrote this postmortem

## What we are doing (prevention)

- [ ] Add an 80%/95% budget alert on Actions minutes (GitHub Billing → Spending limits)
- [ ] Gate the `schedule` trigger to `main` branch only
- [ ] Cap matrix expansion (was 9-wide; cap to 3-wide for branch builds)
- [ ] Add a self-hosted runner pool to absorb routine load — current spike was a 6h delta but cheap on the BYO-compute path
- [ ] Make the workflow runtime quota visible on this Status page under the `build_pipeline` component
- [ ] Quarterly chaos exercise: deliberately exhaust Actions minutes in the dev org, ensure runbook works

## Lessons

The most important thing this incident validated is the architectural choice to keep CI off the customer hot path. The Gateway never depends on a build job clearing — it serves from a long-running deployment artefact in Cloudflare. Every prior incident review reinforced this; this one confirmed it under real conditions.

The next-most-important thing: visibility. We had no alert; we found out via failed PR checks. The action items above prevent that.

## Related

- Commit `a48d7e8` — emergency CI auto-trigger disable
- [Status page](../index.html)
- KYE Protocol™ [Incident Response Runbook](https://github.com/KYE-Protocol/runbooks/blob/main/INCIDENT-RESPONSE.md) (internal)
