Skip to main content

Production runbooks

Use these runbooks for production or production-like daemon instances. They assume:
  • a dedicated state_root
  • a dedicated workspace_root
  • a checked-in routes file
  • bearer auth for every non-loopback control-plane exposure
  • route credentials stored through auth_ref, not inline route-file keys

Threat Model

Protect these assets first:
  • provider API keys, OAuth refresh tokens, connector tokens, MCP credentials, and broker leases
  • control-plane admin/read-only bearer tokens
  • KHEISH_AUTH_STORE_MASTER_KEY or KHEISH_AUTH_STORE_MASTER_KEY_FILE
  • state-root journals, debug artifacts, assets, deliveries, external-action audit records, and audit-signing.key
  • workspace files and shell-task output
Primary trust boundaries:
  • the TLS proxy or ingress in front of the daemon
  • the daemon HTTP control plane
  • provider, connector, MCP, hook, and shell-tool outbound calls
  • local state-root and workspace-root filesystem access
  • browser clients allowed through CORS
Expected threats:
  • leaked control-plane bearer tokens
  • leaked provider or connector credentials through logs, debug artifacts, outputs, or backups
  • over-broad CORS exposure from a browser origin
  • reverse-proxy buffering or rewriting of SSE streams
  • SSRF, DNS rebinding, or private-network access through web, hook, MCP, or connector paths
  • stale long-running shell tasks, approvals, questions, delivery retries, or debug-full capture windows
  • partial backup/restore, secret rotation, or route-file drift during incidents
Required controls:
  • bind the daemon to loopback or a private service network unless the network boundary is explicitly hardened
  • keep bearer auth enabled outside loopback and rotate token files atomically
  • store provider credentials with auth_ref and revoke broker subjects/slots during containment
  • keep debug capture below full except in short, approved windows
  • run doctor, status, route diagnostics, and the smoke scripts below after deploy, restore, and credential rotation
  • verify backup restores against an isolated daemon before promotion

TLS and Reverse Proxy

Prefer this boundary:
  • bind the daemon to loopback or a private service network
  • terminate TLS at a dedicated reverse proxy or ingress controller
  • keep daemon bearer auth enabled behind the proxy
  • forward SSE streams without response buffering
  • keep /healthz and /readyz reachable by the orchestrator
Proxy checklist:
  • set --http-auth-mode bearer
  • use --http-admin-token-file and, if needed, --http-readonly-token-file
  • keep browser clients same-origin through your gateway when possible; Kheish’s direct CORS allowlist is intentionally loopback-only
  • disable proxy buffering for /v1/events/stream, /v1/sessions/*/stream, and /v1/runs/*/stream
  • preserve request bodies for connector endpoints; do not strip connector signature headers
  • set request and response size limits intentionally for your attachment, debug, and output-routing workloads
  • set idle timeouts long enough for SSE
The repo-owned Nginx fixture lives at deploy/reverse-proxy/nginx-kheish.conf. Validate that it still preserves bearer auth, disables SSE buffering, keeps CORS out of the proxy layer, and remains syntactically valid when local nginx is available:
bash scripts/e2e/ops_reverse_proxy_config_smoke.sh
Minimal Nginx shape:
upstream kheish_daemon {
    server 127.0.0.1:4000;
    keepalive 32;
}

server {
    listen 443 ssl http2;
    server_name kheish.example.com;

    ssl_certificate /etc/letsencrypt/live/kheish.example.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/kheish.example.com/privkey.pem;
    ssl_protocols TLSv1.2 TLSv1.3;

    client_max_body_size 32m;

    location / {
        proxy_pass http://kheish_daemon;
        proxy_http_version 1.1;

        proxy_set_header Authorization $http_authorization;
        proxy_set_header Host $host;
        proxy_set_header X-Forwarded-Host $host;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_pass_request_headers on;

        proxy_read_timeout 1h;
        proxy_send_timeout 1h;
        proxy_buffering off;
        proxy_request_buffering off;
        proxy_cache off;
        add_header X-Accel-Buffering no always;
    }
}
For a browser UI, prefer serving the UI from the same origin as the proxy and forwarding API calls server-side. If you intentionally call the daemon control plane directly from a browser, it must use an exact loopback origin accepted by Kheish’s CORS policy, such as http://localhost:5173. Validate after deploy:
./target/debug/kheish-daemon --base-url https://kheish.example.com \
  --token-file /run/secrets/kheish-admin-token \
  doctor

./target/debug/kheish-daemon --base-url https://kheish.example.com \
  --token-file /run/secrets/kheish-admin-token \
  doctor --cors-origin https://kheish.example.com

./target/debug/kheish-daemon --base-url https://kheish.example.com \
  --token-file /run/secrets/kheish-admin-token \
  doctor routes --check-auth

./target/debug/kheish-daemon --base-url https://kheish.example.com \
  --token-file /run/secrets/kheish-admin-token \
  doctor routes --check-references
doctor fails closed on broken /readyz, malformed or buffered SSE, storage write-probe failures, missing state-root lock ownership, unsafe CORS preflight responses, route/provider readiness errors, and invalid hook definitions. It also checks configured HTTP hook hostnames against current DNS and warns or errors before a hook dispatch can surprise operators with private/local address resolution.

Backup

Back up the full daemon state root as one unit. At minimum it contains:
  • session and run journals
  • approvals, questions, task state, and scheduler state
  • channel, project, persona, learning, skill, asset, derivation, and observation state
  • delivery queues and delivery ledgers
  • debug artifacts that have not yet expired
  • daemon-managed connector state
  • auth/global-slots.json
  • broker revocation and lease history under the auth area
  • audit-signing.key
This state-root backup does not automatically include arbitrary files under --workspace-root. If workspace files are part of your recovery objective, back up the workspace root as a separate artifact and verify checksums for the specific files or directories your workflows depend on. The smoke tests below verify this as a separate archive with a workspace marker checksum. Keep these deployment artifacts with, or recoverable alongside, the backup:
  • the routes file used at startup
  • file-backed connector, hook, MCP, and skill-root config
  • the exact KHEISH_AUTH_STORE_MASTER_KEY or KHEISH_AUTH_STORE_MASTER_KEY_FILE value
  • control-plane token files
  • the daemon binary version or container image digest
For the cleanest filesystem backup, stop or drain the daemon first. If your platform provides atomic volume snapshots, snapshot the entire state root rather than individual files. Set explicit recovery objectives before relying on the archive:
ObjectiveProduction target
State-root RPOlast successful atomic snapshot or stopped-daemon archive
Workspace RPOlast workspace archive/checksum set for workflow-critical paths
RTOrestore to an isolated daemon, pass validation, then promote traffic
Drill cadenceat least monthly, and after daemon, schema, auth-store, connector, or route-file changes
Integrityrecord SHA-256 for every archive and preserve it outside the archive
Confidentialityencrypt archives with the production backup system or a reviewed envelope key workflow
Ownershiprestore with the same service user/group and restrictive token/master-key file modes
Example tar backup after stopping the daemon:
tar -C /var/lib/kheish/state -czf kheish-state-$(date +%Y%m%dT%H%M%SZ).tar.gz .
tar -C /var/lib/kheish/workspace -czf kheish-workspace-$(date +%Y%m%dT%H%M%SZ).tar.gz .
sha256sum kheish-state-*.tar.gz kheish-workspace-*.tar.gz > kheish-backup.sha256

Restore

Restore into an isolated daemon first, not directly onto the production listener.
  1. Restore the state-root archive to a new directory.
  2. Provide the same auth-store master key used by the original state root.
  3. Provide the original routes file and deployment config.
  4. Start on a loopback bind and a fresh workspace root.
  5. Run the validation commands below.
  6. Promote traffic only after validation passes.
Example:
mkdir -p /var/lib/kheish/restore-state /var/lib/kheish/restore-workspace
tar -C /var/lib/kheish/restore-state -xzf kheish-state-20260503T120000Z.tar.gz
tar -C /var/lib/kheish/restore-workspace -xzf kheish-workspace-20260503T120000Z.tar.gz

KHEISH_AUTH_STORE_MASTER_KEY_FILE=/run/secrets/kheish-auth-store-master-key \
./target/debug/kheish-daemon serve \
  --bind 127.0.0.1:4010 \
  --state-root /var/lib/kheish/restore-state \
  --workspace-root /var/lib/kheish/restore-workspace \
  --routes-file /etc/kheish/routes.toml \
  --http-auth-mode bearer \
  --http-admin-token-file /run/secrets/kheish-admin-token
Then verify archive integrity, file ownership, and daemon health:
sha256sum -c kheish-backup.sha256
chown -R kheish:kheish /var/lib/kheish/restore-state /var/lib/kheish/restore-workspace
chmod 600 /run/secrets/kheish-auth-store-master-key /run/secrets/kheish-admin-token

./target/debug/kheish-daemon --base-url http://127.0.0.1:4010 \
  --token-file /run/secrets/kheish-admin-token \
  status

./target/debug/kheish-daemon --base-url http://127.0.0.1:4010 \
  --token-file /run/secrets/kheish-admin-token \
  runtime get

./target/debug/kheish-daemon --base-url http://127.0.0.1:4010 \
  --token-file /run/secrets/kheish-admin-token \
  doctor routes --routes-file /etc/kheish/routes.toml --check-auth

./target/debug/kheish-daemon --base-url http://127.0.0.1:4010 \
  --token-file /run/secrets/kheish-admin-token \
  doctor routes --check-references

Key Rotation

Control-plane bearer token files are hot-reloaded when file metadata changes. Rotate them with an atomic write and keep admin/read-only values distinct during the whole operation:
install -m 600 /run/secrets/new-kheish-admin-token /run/secrets/kheish-admin-token.next
mv /run/secrets/kheish-admin-token.next /run/secrets/kheish-admin-token
Then verify old-token failure and new-token success:
./target/debug/kheish-daemon --base-url http://127.0.0.1:4000 \
  --token-file /run/secrets/kheish-admin-token \
  status
Rotate route API-key slots by overwriting the same auth_ref. This revokes active broker leases for that slot and preserves route-file references:
./target/debug/kheish-daemon --base-url http://127.0.0.1:4000 \
  --token-file /run/secrets/kheish-admin-token \
  secrets set openai.prod \
    --provider openai \
    --from-env OPENAI_API_KEY

./target/debug/kheish-daemon --base-url http://127.0.0.1:4000 \
  --token-file /run/secrets/kheish-admin-token \
  runtime auth revoke-slot openai.prod

./target/debug/kheish-daemon --base-url http://127.0.0.1:4000 \
  --token-file /run/secrets/kheish-admin-token \
  doctor routes --check-auth
Rotate connector child-process credential slots the same way: overwrite the generic secret slot, force the sidecar to restart or be replaced so it fetches the new material, then verify the broker lease changed and revoke the slot explicitly during incident containment:
./target/debug/kheish-daemon --base-url http://127.0.0.1:4000 \
  --token-file /run/secrets/kheish-admin-token \
  secrets set sidecars.webhook.routes \
    --provider generic \
    --value "$WEBHOOK_ROUTES_JSON"

./target/debug/kheish-daemon --base-url http://127.0.0.1:4000 \
  --token-file /run/secrets/kheish-admin-token \
  connectors put-external ops-webhook \
    --json '{"platform":"webhook","mode":"child_process","base_url":"http://127.0.0.1:<port>","shared_token":{"secret_ref":"sidecars.webhook.shared"},"child_process":{"command":"python3","args":["connectors/python/run_connector.py","webhook"],"credential_slots":{"WEBHOOK_ROUTES_JSON":"sidecars.webhook.routes"}}}'

./target/debug/kheish-daemon --base-url http://127.0.0.1:4000 \
  --token-file /run/secrets/kheish-admin-token \
  runtime auth subject connector:ops-webhook

./target/debug/kheish-daemon --base-url http://127.0.0.1:4000 \
  --token-file /run/secrets/kheish-admin-token \
  runtime auth lease <lease_id>

./target/debug/kheish-daemon --base-url http://127.0.0.1:4000 \
  --token-file /run/secrets/kheish-admin-token \
  runtime auth revoke-slot sidecars.webhook.routes
For MCP bearer secrets and MCP OAuth account refresh/logout behavior, run the local protocol smoke after any auth-broker or MCP runtime change:
bash scripts/e2e/run_mcp_local_true_binary.sh
Do not rotate KHEISH_AUTH_STORE_MASTER_KEY by simply changing the file. Existing encrypted auth-store records become unreadable. Treat that key as a root encryption key: recover it from your secret manager during restore, and plan any future re-encryption as a separate migration with a verified backup. Preserve audit-signing.key across backup and restore if you need continuity of the signed external-action audit ledger. Rotation matrix:
Secret classRotateValidateRollback/containment
Admin bearer tokenAtomic replace --http-admin-token-fileold token fails, new token can run status and doctorrestore previous token file only if it was not compromised
Read-only bearer tokenAtomic replace --http-readonly-token-fileread-only token can read status; mutations remain forbiddenrestore previous token only if it was not compromised
Route API keysecrets set <auth_ref> --provider ...doctor routes --check-auth, provider readiness, optional route canaryruntime auth revoke-slot <auth_ref> and set a known-good key
Connector shared tokenrotate the connector shared_token.secret_ref or backing secretold webhook signature rejected, new signature acceptedrevoke connector subject/slot and replace connector config
Connector credential slotsecrets set <slot> --provider generic plus sidecar restart/replacementbroker lease changes and token digests stay hiddenruntime auth revoke-slot <slot>
MCP bearer secretupdate the MCP secret ref through the daemon secret storerun_mcp_local_true_binary.sh and targeted MCP tool callrevoke slot and disable MCP server config
MCP OAuth accountrefresh/logout through mcp auth/mcp oauth workflowaccount status/refresh/logout protocol smokerevoke OAuth slot and remove server access
Capture/provisioning tokenrotate capture token lease in the provisioning surfaceold token rejected, heartbeat resumes with new leaserevoke lease and deprovision the stale agent
Auth-store master keydo not hot-rotate in placerestore drill with original keyrecover original key; plan re-encryption as a migration
Audit signing keypreserve for continuity; rotate only with an explicit audit-boundary decisionsigned records verify before/after restorearchive old public key/id and document the rollover time

Incident Response

Incident ownership:
  • incident commander: decides containment scope and promotion/rollback
  • operator: runs daemon commands and captures evidence
  • security owner: owns credential revocation and external provider/connector actions
  • comms owner: writes timeline, stakeholder updates, and postmortem notes
Initial containment:
./target/debug/kheish-daemon --base-url http://127.0.0.1:4000 \
  --token-file /run/secrets/kheish-admin-token \
  runtime set-permission-mode dont-ask

./target/debug/kheish-daemon --base-url http://127.0.0.1:4000 \
  --token-file /run/secrets/kheish-admin-token \
  runtime set-debug-level off

./target/debug/kheish-daemon --base-url http://127.0.0.1:4000 \
  --token-file /run/secrets/kheish-admin-token \
  doctor
dont-ask blocks tools that would require approval instead of leaving them pending during containment. Turn debug capture off before collecting new evidence unless the incident handler explicitly needs a fresh isolated debug bundle. Evidence collection:
  • status
  • doctor
  • doctor routes --check-auth
  • doctor routes --check-references
  • runs list
  • runs get <run_id>
  • sessions events <session_id>
  • tasks list <session_id>
  • deliveries list --run-id <run_id> when output routing is involved
  • runs external-actions <run_id> when a signed external boundary is involved
  • <state_root>/control-plane-auth/audit.jsonl
Credential containment:
./target/debug/kheish-daemon --base-url http://127.0.0.1:4000 \
  --token-file /run/secrets/kheish-admin-token \
  runtime auth revoke-subject session:compromised-session

./target/debug/kheish-daemon --base-url http://127.0.0.1:4000 \
  --token-file /run/secrets/kheish-admin-token \
  runtime auth revoke-slot openai.prod
Operational containment:
  • interrupt suspicious sessions with sessions interrupt <session_id>
  • cancel suspicious runs with runs cancel <run_id>
  • stop unsafe shell tasks with tasks stop <session_id> <task_id>
  • delete or rotate compromised connectors through the connectors CLI
  • disable risky hooks by replacing runtime hook config with a reviewed file
Scenario checklists:
ScenarioContainEvidenceRecover
Leaked admin tokenrotate admin token file, verify old token failscontrol-plane auth audit, doctor, statusreview recent admin-only actions and issue new operator tokens
Leaked provider keyruntime auth revoke-slot <auth_ref>, set replacement keydoctor routes --check-auth, provider readiness, route canaryrotate provider-side key and rerun route smoke
State-root exfiltrationstop daemon, rotate every stored credential, preserve audit key copystate-root file inventory, auth lease status, external-action auditrestore from known-good backup with original master key, then rotate secrets
Proxy/CORS exposureremove public route or tighten proxy/CORS configdoctor --cors-origin, auth audit, proxy logsrun reverse-proxy config smoke and browser-origin smoke before reopening
debug_level=full left onruntime set-debug-level off, prune/retain evidence per policyruns debug, state-root debug artifact inventoryrotate any credentials exposed in captured provider/tool payloads
Stuck shell tasktasks list, tasks output, tasks stoptask output, run events, workspace diffrestart daemon only after task tree is gone or intentionally quarantined
Post-incident requirements:
  • write an immutable timeline with command timestamps and operator identity
  • store evidence in restricted incident storage, not in public logs
  • record every rotated secret and revocation result
  • run restore and route/reference diagnostics before declaring recovery
  • add a regression smoke or unit test for the incident path when feasible

SLO Signals

Use these starting thresholds for production-like deployments:
SignalJSON path or probeWarningPage
ReadinessGET /readyzone non-2xx sampletwo consecutive non-2xx samples within 60 seconds
Health/v1/status .health.okfalse for one samplefalse for two consecutive samples
Storage/v1/status .storage.ok, .storage.probes[]any warning diagnosticfalse immediately
Provider readiness/v1/status .provider_readiness.active_route_ready, .provider_readiness.routes[]default route warning for 5 minutesdefault route error for 5 minutes
Run queue lag/v1/status .runs.queued, .runs.oldest_queued_run_age_ms, .health.warnings[] with queued_run_lagolder than 30 minutes behind active session workolder than 2 hours behind active session work
Stale active runs/v1/status .runs.running, .runs.oldest_non_terminal_run_idle_ms, .health.warnings[] with stale-run codesolder than 2 hours without eventsolder than 6 hours without events
Pending approvals/questions/v1/status .runs.waiting_for_approval, .runs.waiting_for_user_questionolder than 4 hoursolder than 24 hours
Delivery DLQ/v1/status .delivery.unresolved_dead_letteredany growth in stagingany production DLQ item or repeated growth
Auth/CORS rate limits<state_root>/control-plane-auth/audit.jsonl codes auth_rate_limited and cors_origin_rate_limitedrepeated code within 10 minutessustained repeats for 30 minutes
Debug capture/v1/status .runtime.debug_levelfull enabled outside a ticketed windowfull enabled for more than 30 minutes
State-root diskhost filesystem metrics for --state-rootabove 80 percentabove 90 percent or inode exhaustion
Recommended first response is always evidence-first: doctor, status, route diagnostics, then scoped run/session/task/delivery views. Use the SLO probe smoke as a CI-ready check for these documented paths:
KHEISH_OPS_SLO_MODEL=gpt-5.4 bash scripts/e2e/ops_slo_probe_smoke.sh
It starts a bearer-protected daemon with an auth_ref OpenAI route and verifies /readyz, /v1/status .health.ok, .storage.ok, .provider_readiness.active_route_ready, .provider_readiness.error_route_count, .runs.running, .runs.queued, .runs.waiting_for_approval, .runs.waiting_for_user_question, .delivery.unresolved_dead_lettered, .runtime.debug_level, loopback/auth control-plane posture, clean route auth diagnostics, and raw OpenAI key non-leakage in the evidence directory. The no-backlog SLO smoke keeps queue counters at zero; mixed-state daemon status tests cover backlog age fields such as .runs.oldest_queued_run_age_ms and .runs.oldest_non_terminal_run_idle_ms. The probe keeps its local control-plane token files and auth-store master key outside the evidence directory.

Smoke Test

The checked-in Nginx fixture has its own config smoke:
bash scripts/e2e/ops_reverse_proxy_config_smoke.sh
It statically verifies the documented reverse-proxy contract: TLS certificate/key settings, TLS protocol floor, loopback daemon upstream, Authorization forwarding, Host/X-Forwarded-* forwarding, proxy_http_version 1.1, disabled proxy/request buffering, disabled proxy cache, long SSE timeouts, explicit body-size limit, no wildcard proxy CORS, no disabled upstream TLS verification, and no inline bearer-token literal. If nginx and openssl are installed locally, it also rewrites the fixture to a temporary self-signed certificate path and runs nginx -t. The repository includes a provider-free smoke for the documented backup, restore, and route-secret rotation path:
bash scripts/e2e/ops_backup_restore_smoke.sh
It creates a fresh state root and workspace root, keeps local token/master-key files outside the evidence directory, starts a bearer-protected daemon on loopback, asserts status.health.ok, status.storage.ok, provider readiness, and clean delivery DLQ, validates runtime get, doctor, doctor routes --check-auth, and doctor routes --check-references, rotates a route secret, asserts the slot-revocation response shape, backs up the state root and workspace root as separate archives, restores both into a second isolated daemon, verifies the restored secret metadata, route diagnostics, audit-signing.key continuity, and workspace marker checksum, and greps smoke artifacts for raw route-secret leakage. When OPENAI_API_KEY or KHEISH_OPENAI_API_KEY is available, run the live smoke as well:
KHEISH_OPS_LIVE_MODEL=gpt-5.4 bash scripts/e2e/ops_runbook_live_smoke.sh
The live smoke starts a bearer-protected daemon with an auth_ref OpenAI route, sends traffic through a local streaming proxy, asserts health/storage/provider-readiness status paths, validates doctor --cors-origin, doctor routes --check-auth, doctor routes --check-references, and doctor routes --canary, creates a fresh session, completes a real run with OPS_RUNBOOK_LIVE_OK, probes /v1/events/stream, /v1/sessions/<session_id>/stream, and /v1/runs/<run_id>/stream with replay cursors, exercises incident commands including runtime set-permission-mode dont-ask, runtime set-debug-level off, runs list, runs get <run_id>, sessions events <session_id>, tasks list <session_id>, deliveries list --run-id <run_id>, runs external-actions <run_id>, sessions interrupt <session_id>, and runs cancel <run_id>, verifies signed external-action records before and after restore, rotates the admin bearer token file atomically, verifies old-token failure and new-token success, revokes the active broker subject plus runtime auth revoke-slot openai.prod and asserts the returned revocation shapes, rotates a child-process connector credential_slots secret, verifies the old connector lease is revoked and token digests stay hidden, restarts the sidecar through connectors put-external ops-webhook, proves old webhook signatures are rejected and new signatures are accepted without creating an extra provider run, explicitly revokes runtime auth revoke-slot sidecars.webhook.routes, backs up/restores the state and workspace roots after the real run, verifies audit-signing.key continuity plus workspace marker checksum, and greps the smoke artifacts for raw OpenAI and connector secret leakage. Both ops smokes call scripts/e2e/verify_runbook_commands.py so stale or missing runbook command fragments fail before daemon validation. The MCP local true-binary smoke covers MCP bearer secret-store wiring and generic MCP OAuth login/status/refresh/logout against local protocol fixtures:
bash scripts/e2e/run_mcp_local_true_binary.sh
To validate TLS termination and SSE over HTTPS with a self-signed local certificate, run:
KHEISH_OPS_TLS_LIVE_MODEL=gpt-5.4 bash scripts/e2e/ops_tls_proxy_live_smoke.sh
The TLS smoke starts the same bearer-protected daemon shape, generates a one-day localhost certificate with openssl, terminates HTTPS in a local streaming proxy, validates the generated certificate as the trusted CA for all HTTPS probes, verifies unauthenticated /v1/status returns 401, submits a real OpenAI run through the HTTPS control plane, verifies status health/storage/provider readiness through TLS, probes global/session/run SSE replay over HTTPS including Last-Event-ID, and greps the evidence directory for raw OpenAI key leakage. This validates the daemon’s TLS-facing operator contract, but it does not replace a deployment-specific Nginx, Caddy, or ingress smoke for your production proxy configuration. Treat smoke evidence directories as operational artifacts, not public logs. They should not contain raw provider keys, but some smokes keep local bearer-token files or auth-store master keys nearby while the daemon is running.