Production runbooks

Use these runbooks for production or production-like daemon instances. They assume:

a dedicated state_root
a dedicated workspace_root
a checked-in routes file
bearer auth for every non-loopback control-plane exposure
route credentials stored through auth_ref, not inline route-file keys

Threat Model

Protect these assets first:

provider API keys, OAuth refresh tokens, connector tokens, MCP credentials, and broker leases
control-plane admin/read-only bearer tokens
KHEISH_AUTH_STORE_MASTER_KEY or KHEISH_AUTH_STORE_MASTER_KEY_FILE
state-root journals, debug artifacts, assets, deliveries, external-action audit records, and audit-signing.key
workspace files and shell-task output

Primary trust boundaries:

the TLS proxy or ingress in front of the daemon
the daemon HTTP control plane
provider, connector, MCP, hook, and shell-tool outbound calls
local state-root and workspace-root filesystem access
browser clients allowed through CORS

Expected threats:

leaked control-plane bearer tokens
leaked provider or connector credentials through logs, debug artifacts, outputs, or backups
over-broad CORS exposure from a browser origin
reverse-proxy buffering or rewriting of SSE streams
SSRF, DNS rebinding, or private-network access through web, hook, MCP, or connector paths
stale long-running shell tasks, approvals, questions, delivery retries, or debug-full capture windows
partial backup/restore, secret rotation, or route-file drift during incidents

Required controls:

bind the daemon to loopback or a private service network unless the network boundary is explicitly hardened
keep bearer auth enabled outside loopback and rotate token files atomically
store provider credentials with auth_ref and revoke broker subjects/slots during containment
keep debug capture below full except in short, approved windows
run doctor, status, route diagnostics, and the smoke scripts below after deploy, restore, and credential rotation
verify backup restores against an isolated daemon before promotion

TLS and Reverse Proxy

Prefer this boundary:

bind the daemon to loopback or a private service network
terminate TLS at a dedicated reverse proxy or ingress controller
keep daemon bearer auth enabled behind the proxy
forward SSE streams without response buffering
keep /healthz and /readyz reachable by the orchestrator

Proxy checklist:

set --http-auth-mode bearer
use --http-admin-token-file and, if needed, --http-readonly-token-file
keep browser clients same-origin through your gateway when possible; Kheish’s direct CORS allowlist is intentionally loopback-only
disable proxy buffering for /v1/events/stream, /v1/sessions/*/stream, and /v1/runs/*/stream
preserve request bodies for connector endpoints; do not strip connector signature headers
set request and response size limits intentionally for your attachment, debug, and output-routing workloads
set idle timeouts long enough for SSE

The repo-owned Nginx fixture lives at deploy/reverse-proxy/nginx-kheish.conf. Validate that it still preserves bearer auth, disables SSE buffering, keeps CORS out of the proxy layer, and remains syntactically valid when local nginx is available:

bash scripts/e2e/ops_reverse_proxy_config_smoke.sh

Minimal Nginx shape:

upstream kheish_daemon {
    server 127.0.0.1:4000;
    keepalive 32;
}

server {
    listen 443 ssl http2;
    server_name kheish.example.com;

    ssl_certificate /etc/letsencrypt/live/kheish.example.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/kheish.example.com/privkey.pem;
    ssl_protocols TLSv1.2 TLSv1.3;

    client_max_body_size 32m;

    location / {
        proxy_pass http://kheish_daemon;
        proxy_http_version 1.1;

        proxy_set_header Authorization $http_authorization;
        proxy_set_header Host $host;
        proxy_set_header X-Forwarded-Host $host;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_pass_request_headers on;

        proxy_read_timeout 1h;
        proxy_send_timeout 1h;
        proxy_buffering off;
        proxy_request_buffering off;
        proxy_cache off;
        add_header X-Accel-Buffering no always;
    }
}

For a browser UI, prefer serving the UI from the same origin as the proxy and forwarding API calls server-side. If you intentionally call the daemon control plane directly from a browser, it must use an exact loopback origin accepted by Kheish’s CORS policy, such as http://localhost:5173. Validate after deploy:

./target/debug/kheish-daemon --base-url https://kheish.example.com \
  --token-file /run/secrets/kheish-admin-token \
  doctor

./target/debug/kheish-daemon --base-url https://kheish.example.com \
  --token-file /run/secrets/kheish-admin-token \
  doctor --cors-origin https://kheish.example.com

./target/debug/kheish-daemon --base-url https://kheish.example.com \
  --token-file /run/secrets/kheish-admin-token \
  doctor routes --check-auth

./target/debug/kheish-daemon --base-url https://kheish.example.com \
  --token-file /run/secrets/kheish-admin-token \
  doctor routes --check-references

doctor fails closed on broken /readyz, malformed or buffered SSE, storage write-probe failures, missing state-root lock ownership, unsafe CORS preflight responses, route/provider readiness errors, and invalid hook definitions. It also checks configured HTTP hook hostnames against current DNS and warns or errors before a hook dispatch can surprise operators with private/local address resolution.

Backup

Back up the full daemon state root as one unit. At minimum it contains:

session and run journals
approvals, questions, task state, and scheduler state
channel, project, persona, learning, skill, asset, derivation, and observation state
delivery queues and delivery ledgers
debug artifacts that have not yet expired
daemon-managed connector state
auth/global-slots.json
broker revocation and lease history under the auth area
audit-signing.key

This state-root backup does not automatically include arbitrary files under --workspace-root. If workspace files are part of your recovery objective, back up the workspace root as a separate artifact and verify checksums for the specific files or directories your workflows depend on. The smoke tests below verify this as a separate archive with a workspace marker checksum. Keep these deployment artifacts with, or recoverable alongside, the backup:

the routes file used at startup
file-backed connector, hook, MCP, and skill-root config
the exact KHEISH_AUTH_STORE_MASTER_KEY or KHEISH_AUTH_STORE_MASTER_KEY_FILE value
control-plane token files
the daemon binary version or container image digest

For the cleanest filesystem backup, stop or drain the daemon first. If your platform provides atomic volume snapshots, snapshot the entire state root rather than individual files. Set explicit recovery objectives before relying on the archive:

Objective	Production target
State-root RPO	last successful atomic snapshot or stopped-daemon archive
Workspace RPO	last workspace archive/checksum set for workflow-critical paths
RTO	restore to an isolated daemon, pass validation, then promote traffic
Drill cadence	at least monthly, and after daemon, schema, auth-store, connector, or route-file changes
Integrity	record SHA-256 for every archive and preserve it outside the archive
Confidentiality	encrypt archives with the production backup system or a reviewed envelope key workflow
Ownership	restore with the same service user/group and restrictive token/master-key file modes

Example tar backup after stopping the daemon:

tar -C /var/lib/kheish/state -czf kheish-state-$(date +%Y%m%dT%H%M%SZ).tar.gz .
tar -C /var/lib/kheish/workspace -czf kheish-workspace-$(date +%Y%m%dT%H%M%SZ).tar.gz .
sha256sum kheish-state-*.tar.gz kheish-workspace-*.tar.gz > kheish-backup.sha256

Restore

Restore into an isolated daemon first, not directly onto the production listener.

Restore the state-root archive to a new directory.
Provide the same auth-store master key used by the original state root.
Provide the original routes file and deployment config.
Start on a loopback bind and a fresh workspace root.
Run the validation commands below.
Promote traffic only after validation passes.

Example:

mkdir -p /var/lib/kheish/restore-state /var/lib/kheish/restore-workspace
tar -C /var/lib/kheish/restore-state -xzf kheish-state-20260503T120000Z.tar.gz
tar -C /var/lib/kheish/restore-workspace -xzf kheish-workspace-20260503T120000Z.tar.gz

KHEISH_AUTH_STORE_MASTER_KEY_FILE=/run/secrets/kheish-auth-store-master-key \
./target/debug/kheish-daemon serve \
  --bind 127.0.0.1:4010 \
  --state-root /var/lib/kheish/restore-state \
  --workspace-root /var/lib/kheish/restore-workspace \
  --routes-file /etc/kheish/routes.toml \
  --http-auth-mode bearer \
  --http-admin-token-file /run/secrets/kheish-admin-token

Then verify archive integrity, file ownership, and daemon health:

sha256sum -c kheish-backup.sha256
chown -R kheish:kheish /var/lib/kheish/restore-state /var/lib/kheish/restore-workspace
chmod 600 /run/secrets/kheish-auth-store-master-key /run/secrets/kheish-admin-token

./target/debug/kheish-daemon --base-url http://127.0.0.1:4010 \
  --token-file /run/secrets/kheish-admin-token \
  status

./target/debug/kheish-daemon --base-url http://127.0.0.1:4010 \
  --token-file /run/secrets/kheish-admin-token \
  runtime get

./target/debug/kheish-daemon --base-url http://127.0.0.1:4010 \
  --token-file /run/secrets/kheish-admin-token \
  doctor routes --routes-file /etc/kheish/routes.toml --check-auth

./target/debug/kheish-daemon --base-url http://127.0.0.1:4010 \
  --token-file /run/secrets/kheish-admin-token \
  doctor routes --check-references

Key Rotation

Control-plane bearer token files are hot-reloaded when file metadata changes. Rotate them with an atomic write and keep admin/read-only values distinct during the whole operation:

install -m 600 /run/secrets/new-kheish-admin-token /run/secrets/kheish-admin-token.next
mv /run/secrets/kheish-admin-token.next /run/secrets/kheish-admin-token

Then verify old-token failure and new-token success:

./target/debug/kheish-daemon --base-url http://127.0.0.1:4000 \
  --token-file /run/secrets/kheish-admin-token \
  status

Rotate route API-key slots by overwriting the same auth_ref. This revokes active broker leases for that slot and preserves route-file references:

./target/debug/kheish-daemon --base-url http://127.0.0.1:4000 \
  --token-file /run/secrets/kheish-admin-token \
  secrets set openai.prod \
    --provider openai \
    --from-env OPENAI_API_KEY

./target/debug/kheish-daemon --base-url http://127.0.0.1:4000 \
  --token-file /run/secrets/kheish-admin-token \
  runtime auth revoke-slot openai.prod

./target/debug/kheish-daemon --base-url http://127.0.0.1:4000 \
  --token-file /run/secrets/kheish-admin-token \
  doctor routes --check-auth

Rotate connector child-process credential slots the same way: overwrite the generic secret slot, force the sidecar to restart or be replaced so it fetches the new material, then verify the broker lease changed and revoke the slot explicitly during incident containment:

./target/debug/kheish-daemon --base-url http://127.0.0.1:4000 \
  --token-file /run/secrets/kheish-admin-token \
  secrets set sidecars.webhook.routes \
    --provider generic \
    --value "$WEBHOOK_ROUTES_JSON"

./target/debug/kheish-daemon --base-url http://127.0.0.1:4000 \
  --token-file /run/secrets/kheish-admin-token \
  connectors put-external ops-webhook \
    --json '{"platform":"webhook","mode":"child_process","base_url":"http://127.0.0.1:<port>","shared_token":{"secret_ref":"sidecars.webhook.shared"},"child_process":{"command":"python3","args":["connectors/python/run_connector.py","webhook"],"credential_slots":{"WEBHOOK_ROUTES_JSON":"sidecars.webhook.routes"}}}'

./target/debug/kheish-daemon --base-url http://127.0.0.1:4000 \
  --token-file /run/secrets/kheish-admin-token \
  runtime auth subject connector:ops-webhook

./target/debug/kheish-daemon --base-url http://127.0.0.1:4000 \
  --token-file /run/secrets/kheish-admin-token \
  runtime auth lease <lease_id>

./target/debug/kheish-daemon --base-url http://127.0.0.1:4000 \
  --token-file /run/secrets/kheish-admin-token \
  runtime auth revoke-slot sidecars.webhook.routes

For MCP bearer secrets and MCP OAuth account refresh/logout behavior, run the local protocol smoke after any auth-broker or MCP runtime change:

bash scripts/e2e/run_mcp_local_true_binary.sh

Do not rotate KHEISH_AUTH_STORE_MASTER_KEY by simply changing the file. Existing encrypted auth-store records become unreadable. Treat that key as a root encryption key: recover it from your secret manager during restore, and plan any future re-encryption as a separate migration with a verified backup. Preserve audit-signing.key across backup and restore if you need continuity of the signed external-action audit ledger. Rotation matrix:

Secret class	Rotate	Validate	Rollback/containment
Admin bearer token	Atomic replace `--http-admin-token-file`	old token fails, new token can run `status` and `doctor`	restore previous token file only if it was not compromised
Read-only bearer token	Atomic replace `--http-readonly-token-file`	read-only token can read `status`; mutations remain forbidden	restore previous token only if it was not compromised
Route API key	`secrets set <auth_ref> --provider ...`	`doctor routes --check-auth`, provider readiness, optional route canary	`runtime auth revoke-slot <auth_ref>` and set a known-good key
Connector shared token	rotate the connector `shared_token.secret_ref` or backing secret	old webhook signature rejected, new signature accepted	revoke connector subject/slot and replace connector config
Connector credential slot	`secrets set <slot> --provider generic` plus sidecar restart/replacement	broker lease changes and token digests stay hidden	`runtime auth revoke-slot <slot>`
MCP bearer secret	update the MCP secret ref through the daemon secret store	`run_mcp_local_true_binary.sh` and targeted MCP tool call	revoke slot and disable MCP server config
MCP OAuth account	refresh/logout through `mcp auth`/`mcp oauth` workflow	account status/refresh/logout protocol smoke	revoke OAuth slot and remove server access
Capture/provisioning token	rotate capture token lease in the provisioning surface	old token rejected, heartbeat resumes with new lease	revoke lease and deprovision the stale agent
Auth-store master key	do not hot-rotate in place	restore drill with original key	recover original key; plan re-encryption as a migration
Audit signing key	preserve for continuity; rotate only with an explicit audit-boundary decision	signed records verify before/after restore	archive old public key/id and document the rollover time

Incident Response

Incident ownership:

incident commander: decides containment scope and promotion/rollback
operator: runs daemon commands and captures evidence
security owner: owns credential revocation and external provider/connector actions
comms owner: writes timeline, stakeholder updates, and postmortem notes

Initial containment:

./target/debug/kheish-daemon --base-url http://127.0.0.1:4000 \
  --token-file /run/secrets/kheish-admin-token \
  runtime set-permission-mode dont-ask

./target/debug/kheish-daemon --base-url http://127.0.0.1:4000 \
  --token-file /run/secrets/kheish-admin-token \
  runtime set-debug-level off

./target/debug/kheish-daemon --base-url http://127.0.0.1:4000 \
  --token-file /run/secrets/kheish-admin-token \
  doctor

dont-ask blocks tools that would require approval instead of leaving them pending during containment. Turn debug capture off before collecting new evidence unless the incident handler explicitly needs a fresh isolated debug bundle. Evidence collection:

status
doctor
doctor routes --check-auth
doctor routes --check-references
runs list
runs get <run_id>
sessions events <session_id>
tasks list <session_id>
deliveries list --run-id <run_id> when output routing is involved
runs external-actions <run_id> when a signed external boundary is involved
<state_root>/control-plane-auth/audit.jsonl

Credential containment:

./target/debug/kheish-daemon --base-url http://127.0.0.1:4000 \
  --token-file /run/secrets/kheish-admin-token \
  runtime auth revoke-subject session:compromised-session

./target/debug/kheish-daemon --base-url http://127.0.0.1:4000 \
  --token-file /run/secrets/kheish-admin-token \
  runtime auth revoke-slot openai.prod

Operational containment:

interrupt suspicious sessions with sessions interrupt <session_id>
cancel suspicious runs with runs cancel <run_id>
stop unsafe shell tasks with tasks stop <session_id> <task_id>
delete or rotate compromised connectors through the connectors CLI
disable risky hooks by replacing runtime hook config with a reviewed file

Scenario checklists:

Scenario	Contain	Evidence	Recover
Leaked admin token	rotate admin token file, verify old token fails	control-plane auth audit, `doctor`, `status`	review recent admin-only actions and issue new operator tokens
Leaked provider key	`runtime auth revoke-slot <auth_ref>`, set replacement key	`doctor routes --check-auth`, provider readiness, route canary	rotate provider-side key and rerun route smoke
State-root exfiltration	stop daemon, rotate every stored credential, preserve audit key copy	state-root file inventory, auth lease status, external-action audit	restore from known-good backup with original master key, then rotate secrets
Proxy/CORS exposure	remove public route or tighten proxy/CORS config	`doctor --cors-origin`, auth audit, proxy logs	run reverse-proxy config smoke and browser-origin smoke before reopening
`debug_level=full` left on	`runtime set-debug-level off`, prune/retain evidence per policy	`runs debug`, state-root debug artifact inventory	rotate any credentials exposed in captured provider/tool payloads
Stuck shell task	`tasks list`, `tasks output`, `tasks stop`	task output, run events, workspace diff	restart daemon only after task tree is gone or intentionally quarantined

Post-incident requirements:

write an immutable timeline with command timestamps and operator identity
store evidence in restricted incident storage, not in public logs
record every rotated secret and revocation result
run restore and route/reference diagnostics before declaring recovery
add a regression smoke or unit test for the incident path when feasible

SLO Signals

Use these starting thresholds for production-like deployments:

Signal	JSON path or probe	Warning	Page
Readiness	`GET /readyz`	one non-2xx sample	two consecutive non-2xx samples within 60 seconds
Health	`/v1/status` `.health.ok`	`false` for one sample	`false` for two consecutive samples
Storage	`/v1/status` `.storage.ok`, `.storage.probes[]`	any warning diagnostic	`false` immediately
Provider readiness	`/v1/status` `.provider_readiness.active_route_ready`, `.provider_readiness.routes[]`	default route warning for 5 minutes	default route error for 5 minutes
Run queue lag	`/v1/status` `.runs.queued`, `.runs.oldest_queued_run_age_ms`, `.health.warnings[]` with `queued_run_lag`	older than 30 minutes behind active session work	older than 2 hours behind active session work
Stale active runs	`/v1/status` `.runs.running`, `.runs.oldest_non_terminal_run_idle_ms`, `.health.warnings[]` with stale-run codes	older than 2 hours without events	older than 6 hours without events
Pending approvals/questions	`/v1/status` `.runs.waiting_for_approval`, `.runs.waiting_for_user_question`	older than 4 hours	older than 24 hours
Delivery DLQ	`/v1/status` `.delivery.unresolved_dead_lettered`	any growth in staging	any production DLQ item or repeated growth
Auth/CORS rate limits	`<state_root>/control-plane-auth/audit.jsonl` codes `auth_rate_limited` and `cors_origin_rate_limited`	repeated code within 10 minutes	sustained repeats for 30 minutes
Debug capture	`/v1/status` `.runtime.debug_level`	`full` enabled outside a ticketed window	`full` enabled for more than 30 minutes
State-root disk	host filesystem metrics for `--state-root`	above 80 percent	above 90 percent or inode exhaustion

Recommended first response is always evidence-first: doctor, status, route diagnostics, then scoped run/session/task/delivery views. Use the SLO probe smoke as a CI-ready check for these documented paths:

KHEISH_OPS_SLO_MODEL=gpt-5.4 bash scripts/e2e/ops_slo_probe_smoke.sh

It starts a bearer-protected daemon with an auth_ref OpenAI route and verifies /readyz, /v1/status .health.ok, .storage.ok, .provider_readiness.active_route_ready, .provider_readiness.error_route_count, .runs.running, .runs.queued, .runs.waiting_for_approval, .runs.waiting_for_user_question, .delivery.unresolved_dead_lettered, .runtime.debug_level, loopback/auth control-plane posture, clean route auth diagnostics, and raw OpenAI key non-leakage in the evidence directory. The no-backlog SLO smoke keeps queue counters at zero; mixed-state daemon status tests cover backlog age fields such as .runs.oldest_queued_run_age_ms and .runs.oldest_non_terminal_run_idle_ms. The probe keeps its local control-plane token files and auth-store master key outside the evidence directory.

Smoke Test

The checked-in Nginx fixture has its own config smoke:

bash scripts/e2e/ops_reverse_proxy_config_smoke.sh

It statically verifies the documented reverse-proxy contract: TLS certificate/key settings, TLS protocol floor, loopback daemon upstream, Authorization forwarding, Host/X-Forwarded-* forwarding, proxy_http_version 1.1, disabled proxy/request buffering, disabled proxy cache, long SSE timeouts, explicit body-size limit, no wildcard proxy CORS, no disabled upstream TLS verification, and no inline bearer-token literal. If nginx and openssl are installed locally, it also rewrites the fixture to a temporary self-signed certificate path and runs nginx -t. The repository includes a provider-free smoke for the documented backup, restore, and route-secret rotation path:

bash scripts/e2e/ops_backup_restore_smoke.sh

It creates a fresh state root and workspace root, keeps local token/master-key files outside the evidence directory, starts a bearer-protected daemon on loopback, asserts status.health.ok, status.storage.ok, provider readiness, and clean delivery DLQ, validates runtime get, doctor, doctor routes --check-auth, and doctor routes --check-references, rotates a route secret, asserts the slot-revocation response shape, backs up the state root and workspace root as separate archives, restores both into a second isolated daemon, verifies the restored secret metadata, route diagnostics, audit-signing.key continuity, and workspace marker checksum, and greps smoke artifacts for raw route-secret leakage. When OPENAI_API_KEY or KHEISH_OPENAI_API_KEY is available, run the live smoke as well:

KHEISH_OPS_LIVE_MODEL=gpt-5.4 bash scripts/e2e/ops_runbook_live_smoke.sh

The live smoke starts a bearer-protected daemon with an auth_ref OpenAI route, sends traffic through a local streaming proxy, asserts health/storage/provider-readiness status paths, validates doctor --cors-origin, doctor routes --check-auth, doctor routes --check-references, and doctor routes --canary, creates a fresh session, completes a real run with OPS_RUNBOOK_LIVE_OK, probes /v1/events/stream, /v1/sessions/<session_id>/stream, and /v1/runs/<run_id>/stream with replay cursors, exercises incident commands including runtime set-permission-mode dont-ask, runtime set-debug-level off, runs list, runs get <run_id>, sessions events <session_id>, tasks list <session_id>, deliveries list --run-id <run_id>, runs external-actions <run_id>, sessions interrupt <session_id>, and runs cancel <run_id>, verifies signed external-action records before and after restore, rotates the admin bearer token file atomically, verifies old-token failure and new-token success, revokes the active broker subject plus runtime auth revoke-slot openai.prod and asserts the returned revocation shapes, rotates a child-process connector credential_slots secret, verifies the old connector lease is revoked and token digests stay hidden, restarts the sidecar through connectors put-external ops-webhook, proves old webhook signatures are rejected and new signatures are accepted without creating an extra provider run, explicitly revokes runtime auth revoke-slot sidecars.webhook.routes, backs up/restores the state and workspace roots after the real run, verifies audit-signing.key continuity plus workspace marker checksum, and greps the smoke artifacts for raw OpenAI and connector secret leakage. Both ops smokes call scripts/e2e/verify_runbook_commands.py so stale or missing runbook command fragments fail before daemon validation. The MCP local true-binary smoke covers MCP bearer secret-store wiring and generic MCP OAuth login/status/refresh/logout against local protocol fixtures:

bash scripts/e2e/run_mcp_local_true_binary.sh

To validate TLS termination and SSE over HTTPS with a self-signed local certificate, run:

KHEISH_OPS_TLS_LIVE_MODEL=gpt-5.4 bash scripts/e2e/ops_tls_proxy_live_smoke.sh

The TLS smoke starts the same bearer-protected daemon shape, generates a one-day localhost certificate with openssl, terminates HTTPS in a local streaming proxy, validates the generated certificate as the trusted CA for all HTTPS probes, verifies unauthenticated /v1/status returns 401, submits a real OpenAI run through the HTTPS control plane, verifies status health/storage/provider readiness through TLS, probes global/session/run SSE replay over HTTPS including Last-Event-ID, and greps the evidence directory for raw OpenAI key leakage. This validates the daemon’s TLS-facing operator contract, but it does not replace a deployment-specific Nginx, Caddy, or ingress smoke for your production proxy configuration. Treat smoke evidence directories as operational artifacts, not public logs. They should not contain raw provider keys, but some smokes keep local bearer-token files or auth-store master keys nearby while the daemon is running.

​Production runbooks

​Threat Model

​TLS and Reverse Proxy

​Backup

​Restore

​Key Rotation

​Incident Response

​SLO Signals

​Smoke Test