Production runbooks
Use these runbooks for production or production-like daemon instances. They assume:- a dedicated
state_root - a dedicated
workspace_root - a checked-in routes file
- bearer auth for every non-loopback control-plane exposure
- route credentials stored through
auth_ref, not inline route-file keys
Threat Model
Protect these assets first:- provider API keys, OAuth refresh tokens, connector tokens, MCP credentials, and broker leases
- control-plane admin/read-only bearer tokens
KHEISH_AUTH_STORE_MASTER_KEYorKHEISH_AUTH_STORE_MASTER_KEY_FILE- state-root journals, debug artifacts, assets, deliveries, external-action audit records, and
audit-signing.key - workspace files and shell-task output
- the TLS proxy or ingress in front of the daemon
- the daemon HTTP control plane
- provider, connector, MCP, hook, and shell-tool outbound calls
- local state-root and workspace-root filesystem access
- browser clients allowed through CORS
- leaked control-plane bearer tokens
- leaked provider or connector credentials through logs, debug artifacts, outputs, or backups
- over-broad CORS exposure from a browser origin
- reverse-proxy buffering or rewriting of SSE streams
- SSRF, DNS rebinding, or private-network access through web, hook, MCP, or connector paths
- stale long-running shell tasks, approvals, questions, delivery retries, or debug-full capture windows
- partial backup/restore, secret rotation, or route-file drift during incidents
- bind the daemon to loopback or a private service network unless the network boundary is explicitly hardened
- keep bearer auth enabled outside loopback and rotate token files atomically
- store provider credentials with
auth_refand revoke broker subjects/slots during containment - keep debug capture below
fullexcept in short, approved windows - run
doctor,status, route diagnostics, and the smoke scripts below after deploy, restore, and credential rotation - verify backup restores against an isolated daemon before promotion
TLS and Reverse Proxy
Prefer this boundary:- bind the daemon to loopback or a private service network
- terminate TLS at a dedicated reverse proxy or ingress controller
- keep daemon bearer auth enabled behind the proxy
- forward SSE streams without response buffering
- keep
/healthzand/readyzreachable by the orchestrator
- set
--http-auth-mode bearer - use
--http-admin-token-fileand, if needed,--http-readonly-token-file - keep browser clients same-origin through your gateway when possible; Kheish’s direct CORS allowlist is intentionally loopback-only
- disable proxy buffering for
/v1/events/stream,/v1/sessions/*/stream, and/v1/runs/*/stream - preserve request bodies for connector endpoints; do not strip connector signature headers
- set request and response size limits intentionally for your attachment, debug, and output-routing workloads
- set idle timeouts long enough for SSE
deploy/reverse-proxy/nginx-kheish.conf. Validate that it still preserves bearer auth, disables SSE buffering, keeps CORS out of the proxy layer, and remains syntactically valid when local nginx is available:
http://localhost:5173.
Validate after deploy:
doctor fails closed on broken /readyz, malformed or buffered SSE, storage write-probe failures, missing state-root lock ownership, unsafe CORS preflight responses, route/provider readiness errors, and invalid hook definitions. It also checks configured HTTP hook hostnames against current DNS and warns or errors before a hook dispatch can surprise operators with private/local address resolution.
Backup
Back up the full daemon state root as one unit. At minimum it contains:- session and run journals
- approvals, questions, task state, and scheduler state
- channel, project, persona, learning, skill, asset, derivation, and observation state
- delivery queues and delivery ledgers
- debug artifacts that have not yet expired
- daemon-managed connector state
auth/global-slots.json- broker revocation and lease history under the auth area
audit-signing.key
--workspace-root. If workspace files are part of your recovery objective, back up the workspace root as a separate artifact and verify checksums for the specific files or directories your workflows depend on. The smoke tests below verify this as a separate archive with a workspace marker checksum.
Keep these deployment artifacts with, or recoverable alongside, the backup:
- the routes file used at startup
- file-backed connector, hook, MCP, and skill-root config
- the exact
KHEISH_AUTH_STORE_MASTER_KEYorKHEISH_AUTH_STORE_MASTER_KEY_FILEvalue - control-plane token files
- the daemon binary version or container image digest
| Objective | Production target |
|---|---|
| State-root RPO | last successful atomic snapshot or stopped-daemon archive |
| Workspace RPO | last workspace archive/checksum set for workflow-critical paths |
| RTO | restore to an isolated daemon, pass validation, then promote traffic |
| Drill cadence | at least monthly, and after daemon, schema, auth-store, connector, or route-file changes |
| Integrity | record SHA-256 for every archive and preserve it outside the archive |
| Confidentiality | encrypt archives with the production backup system or a reviewed envelope key workflow |
| Ownership | restore with the same service user/group and restrictive token/master-key file modes |
Restore
Restore into an isolated daemon first, not directly onto the production listener.- Restore the state-root archive to a new directory.
- Provide the same auth-store master key used by the original state root.
- Provide the original routes file and deployment config.
- Start on a loopback bind and a fresh workspace root.
- Run the validation commands below.
- Promote traffic only after validation passes.
Key Rotation
Control-plane bearer token files are hot-reloaded when file metadata changes. Rotate them with an atomic write and keep admin/read-only values distinct during the whole operation:auth_ref. This revokes active broker leases for that slot and preserves route-file references:
KHEISH_AUTH_STORE_MASTER_KEY by simply changing the file. Existing encrypted auth-store records become unreadable. Treat that key as a root encryption key: recover it from your secret manager during restore, and plan any future re-encryption as a separate migration with a verified backup.
Preserve audit-signing.key across backup and restore if you need continuity of the signed external-action audit ledger.
Rotation matrix:
| Secret class | Rotate | Validate | Rollback/containment |
|---|---|---|---|
| Admin bearer token | Atomic replace --http-admin-token-file | old token fails, new token can run status and doctor | restore previous token file only if it was not compromised |
| Read-only bearer token | Atomic replace --http-readonly-token-file | read-only token can read status; mutations remain forbidden | restore previous token only if it was not compromised |
| Route API key | secrets set <auth_ref> --provider ... | doctor routes --check-auth, provider readiness, optional route canary | runtime auth revoke-slot <auth_ref> and set a known-good key |
| Connector shared token | rotate the connector shared_token.secret_ref or backing secret | old webhook signature rejected, new signature accepted | revoke connector subject/slot and replace connector config |
| Connector credential slot | secrets set <slot> --provider generic plus sidecar restart/replacement | broker lease changes and token digests stay hidden | runtime auth revoke-slot <slot> |
| MCP bearer secret | update the MCP secret ref through the daemon secret store | run_mcp_local_true_binary.sh and targeted MCP tool call | revoke slot and disable MCP server config |
| MCP OAuth account | refresh/logout through mcp auth/mcp oauth workflow | account status/refresh/logout protocol smoke | revoke OAuth slot and remove server access |
| Capture/provisioning token | rotate capture token lease in the provisioning surface | old token rejected, heartbeat resumes with new lease | revoke lease and deprovision the stale agent |
| Auth-store master key | do not hot-rotate in place | restore drill with original key | recover original key; plan re-encryption as a migration |
| Audit signing key | preserve for continuity; rotate only with an explicit audit-boundary decision | signed records verify before/after restore | archive old public key/id and document the rollover time |
Incident Response
Incident ownership:- incident commander: decides containment scope and promotion/rollback
- operator: runs daemon commands and captures evidence
- security owner: owns credential revocation and external provider/connector actions
- comms owner: writes timeline, stakeholder updates, and postmortem notes
dont-ask blocks tools that would require approval instead of leaving them pending during containment. Turn debug capture off before collecting new evidence unless the incident handler explicitly needs a fresh isolated debug bundle.
Evidence collection:
statusdoctordoctor routes --check-authdoctor routes --check-referencesruns listruns get <run_id>sessions events <session_id>tasks list <session_id>deliveries list --run-id <run_id>when output routing is involvedruns external-actions <run_id>when a signed external boundary is involved<state_root>/control-plane-auth/audit.jsonl
- interrupt suspicious sessions with
sessions interrupt <session_id> - cancel suspicious runs with
runs cancel <run_id> - stop unsafe shell tasks with
tasks stop <session_id> <task_id> - delete or rotate compromised connectors through the
connectorsCLI - disable risky hooks by replacing runtime hook config with a reviewed file
| Scenario | Contain | Evidence | Recover |
|---|---|---|---|
| Leaked admin token | rotate admin token file, verify old token fails | control-plane auth audit, doctor, status | review recent admin-only actions and issue new operator tokens |
| Leaked provider key | runtime auth revoke-slot <auth_ref>, set replacement key | doctor routes --check-auth, provider readiness, route canary | rotate provider-side key and rerun route smoke |
| State-root exfiltration | stop daemon, rotate every stored credential, preserve audit key copy | state-root file inventory, auth lease status, external-action audit | restore from known-good backup with original master key, then rotate secrets |
| Proxy/CORS exposure | remove public route or tighten proxy/CORS config | doctor --cors-origin, auth audit, proxy logs | run reverse-proxy config smoke and browser-origin smoke before reopening |
debug_level=full left on | runtime set-debug-level off, prune/retain evidence per policy | runs debug, state-root debug artifact inventory | rotate any credentials exposed in captured provider/tool payloads |
| Stuck shell task | tasks list, tasks output, tasks stop | task output, run events, workspace diff | restart daemon only after task tree is gone or intentionally quarantined |
- write an immutable timeline with command timestamps and operator identity
- store evidence in restricted incident storage, not in public logs
- record every rotated secret and revocation result
- run restore and route/reference diagnostics before declaring recovery
- add a regression smoke or unit test for the incident path when feasible
SLO Signals
Use these starting thresholds for production-like deployments:| Signal | JSON path or probe | Warning | Page |
|---|---|---|---|
| Readiness | GET /readyz | one non-2xx sample | two consecutive non-2xx samples within 60 seconds |
| Health | /v1/status .health.ok | false for one sample | false for two consecutive samples |
| Storage | /v1/status .storage.ok, .storage.probes[] | any warning diagnostic | false immediately |
| Provider readiness | /v1/status .provider_readiness.active_route_ready, .provider_readiness.routes[] | default route warning for 5 minutes | default route error for 5 minutes |
| Run queue lag | /v1/status .runs.queued, .runs.oldest_queued_run_age_ms, .health.warnings[] with queued_run_lag | older than 30 minutes behind active session work | older than 2 hours behind active session work |
| Stale active runs | /v1/status .runs.running, .runs.oldest_non_terminal_run_idle_ms, .health.warnings[] with stale-run codes | older than 2 hours without events | older than 6 hours without events |
| Pending approvals/questions | /v1/status .runs.waiting_for_approval, .runs.waiting_for_user_question | older than 4 hours | older than 24 hours |
| Delivery DLQ | /v1/status .delivery.unresolved_dead_lettered | any growth in staging | any production DLQ item or repeated growth |
| Auth/CORS rate limits | <state_root>/control-plane-auth/audit.jsonl codes auth_rate_limited and cors_origin_rate_limited | repeated code within 10 minutes | sustained repeats for 30 minutes |
| Debug capture | /v1/status .runtime.debug_level | full enabled outside a ticketed window | full enabled for more than 30 minutes |
| State-root disk | host filesystem metrics for --state-root | above 80 percent | above 90 percent or inode exhaustion |
doctor, status, route diagnostics, then scoped run/session/task/delivery views.
Use the SLO probe smoke as a CI-ready check for these documented paths:
auth_ref OpenAI route and verifies /readyz, /v1/status .health.ok, .storage.ok, .provider_readiness.active_route_ready, .provider_readiness.error_route_count, .runs.running, .runs.queued, .runs.waiting_for_approval, .runs.waiting_for_user_question, .delivery.unresolved_dead_lettered, .runtime.debug_level, loopback/auth control-plane posture, clean route auth diagnostics, and raw OpenAI key non-leakage in the evidence directory. The no-backlog SLO smoke keeps queue counters at zero; mixed-state daemon status tests cover backlog age fields such as .runs.oldest_queued_run_age_ms and .runs.oldest_non_terminal_run_idle_ms. The probe keeps its local control-plane token files and auth-store master key outside the evidence directory.
Smoke Test
The checked-in Nginx fixture has its own config smoke:Authorization forwarding, Host/X-Forwarded-* forwarding, proxy_http_version 1.1, disabled proxy/request buffering, disabled proxy cache, long SSE timeouts, explicit body-size limit, no wildcard proxy CORS, no disabled upstream TLS verification, and no inline bearer-token literal. If nginx and openssl are installed locally, it also rewrites the fixture to a temporary self-signed certificate path and runs nginx -t.
The repository includes a provider-free smoke for the documented backup, restore, and route-secret rotation path:
status.health.ok, status.storage.ok, provider readiness, and clean delivery DLQ, validates runtime get, doctor, doctor routes --check-auth, and doctor routes --check-references, rotates a route secret, asserts the slot-revocation response shape, backs up the state root and workspace root as separate archives, restores both into a second isolated daemon, verifies the restored secret metadata, route diagnostics, audit-signing.key continuity, and workspace marker checksum, and greps smoke artifacts for raw route-secret leakage.
When OPENAI_API_KEY or KHEISH_OPENAI_API_KEY is available, run the live smoke as well:
auth_ref OpenAI route, sends traffic through a local streaming proxy, asserts health/storage/provider-readiness status paths, validates doctor --cors-origin, doctor routes --check-auth, doctor routes --check-references, and doctor routes --canary, creates a fresh session, completes a real run with OPS_RUNBOOK_LIVE_OK, probes /v1/events/stream, /v1/sessions/<session_id>/stream, and /v1/runs/<run_id>/stream with replay cursors, exercises incident commands including runtime set-permission-mode dont-ask, runtime set-debug-level off, runs list, runs get <run_id>, sessions events <session_id>, tasks list <session_id>, deliveries list --run-id <run_id>, runs external-actions <run_id>, sessions interrupt <session_id>, and runs cancel <run_id>, verifies signed external-action records before and after restore, rotates the admin bearer token file atomically, verifies old-token failure and new-token success, revokes the active broker subject plus runtime auth revoke-slot openai.prod and asserts the returned revocation shapes, rotates a child-process connector credential_slots secret, verifies the old connector lease is revoked and token digests stay hidden, restarts the sidecar through connectors put-external ops-webhook, proves old webhook signatures are rejected and new signatures are accepted without creating an extra provider run, explicitly revokes runtime auth revoke-slot sidecars.webhook.routes, backs up/restores the state and workspace roots after the real run, verifies audit-signing.key continuity plus workspace marker checksum, and greps the smoke artifacts for raw OpenAI and connector secret leakage. Both ops smokes call scripts/e2e/verify_runbook_commands.py so stale or missing runbook command fragments fail before daemon validation.
The MCP local true-binary smoke covers MCP bearer secret-store wiring and generic MCP OAuth login/status/refresh/logout against local protocol fixtures:
openssl, terminates HTTPS in a local streaming proxy, validates the generated certificate as the trusted CA for all HTTPS probes, verifies unauthenticated /v1/status returns 401, submits a real OpenAI run through the HTTPS control plane, verifies status health/storage/provider readiness through TLS, probes global/session/run SSE replay over HTTPS including Last-Event-ID, and greps the evidence directory for raw OpenAI key leakage. This validates the daemon’s TLS-facing operator contract, but it does not replace a deployment-specific Nginx, Caddy, or ingress smoke for your production proxy configuration.
Treat smoke evidence directories as operational artifacts, not public logs. They should not contain raw provider keys, but some smokes keep local bearer-token files or auth-store master keys nearby while the daemon is running.