Skip to main content

Debugging and recovery

When a run looks wrong, debug from daemon evidence rather than from assumptions about the prompt.

Evidence-first order

The recommended inspection order is:
  1. flows get <flow_id> when the issue is Playbook/Flow-scoped
  2. runs get <run_id>
  3. approvals list --session-id <session_id>
  4. tasks list <session_id>
  5. tasks output <session_id> <task_id>
  6. sessions get <session_id>
  7. sessions events <session_id>
  8. channels get <channel_id> and channels messages list <channel_id> when debugging one public conversation
  9. channels leases <channel_id> when a thread looks stalled or over-eager
  10. channels stimuli list <channel_id> and channels thread-work <channel_id> when autonomous wake-ups or recovery look wrong
  11. runs debug <run_id>

Common operational pitfalls

The most common causes of confusion are:
  • wrong daemon URL
  • wrong state root
  • stale sessions created by an older daemon build
  • CLI and daemon schema drift
  • shell-heavy runs that need more than one approval wave
  • updating a persona record and expecting an already-bound session to change automatically
  • assuming the daemon-global MCP or skill inventory is always fully visible to every session
  • assuming visible tools imply delegated credential access, even when the session or child credential_scope has narrowed route, connector, or MCP auth

Restore expectations

Kheish can restore pending approvals, pending structured questions, active inline skills, schedules, and queued deliveries. Recovery is only as good as the state root and binary pair you are actually running. Kheish does not promise that an arbitrary shell process survives a daemon restart. Active daemon-managed shell tasks are settled fail-closed on boot:
  • inspect tasks list <session_id> for status: "failed"
  • inspect tasks get <session_id> <task_id> for metadata.terminal_reason: "daemon_restarted" and metadata.recovered_on_boot: true
  • inspect tasks output <session_id> <task_id> --full for partial output
  • retry manually only after checking whether the command is safe to repeat
Do not treat task_output.retrieval_status=success as shell success. It only means the daemon read the persisted output view. Session metadata is also the authoritative restore source for:
  • the bound persona snapshot
  • the persisted session capability scope override
  • the persisted session credential scope override
  • the persisted session reply-target defaults
  • the effective active inline skill state derived from persona defaults and session-local changes
Channel state restores from its own daemon-owned store, not from any member session journal. That store now includes:
  • channel records
  • public messages and reactions
  • public turn leases
  • queued channel stimuli
  • canonical thread-work state with bindings and progress snapshots
When a public thread behaves unexpectedly after restart, inspect the restored channel view, stimulus queue, and thread-work projection first, then confirm whether one referenced ChannelDelivery run is still active or has already settled.

Debug capture

Use debug capture when you need to inspect:
  • the final system prompt sent to the model
  • the effective tool surface
  • provider wire payloads
  • normalized model outputs
Keep debug capture scoped and temporary in production-like environments. When prompt identity looks wrong, inspect both:
  • ./target/debug/kheish-daemon sessions get <session_id> for the session’s bound persona summary
  • ./target/debug/kheish-daemon personas get <persona_id> for the latest mutable persona record
Those two views can differ legitimately when the session bound an older persona snapshot. When skill or MCP visibility looks wrong, inspect:
  • ./target/debug/kheish-daemon sessions get <session_id> for capability_scope and effective_capability_scope
  • ./target/debug/kheish-daemon sessions get <session_id> for credential_scope and effective_credential_scope
  • ./target/debug/kheish-daemon runtime get for the daemon-global inventory that existed before session filtering
Remember that one MCP server can still disappear from the session-visible surface when none of its tools or helpers remain executable after capability and tool-surface filtering. When auth-backed execution looks wrong, inspect:
  • ./target/debug/kheish-daemon runtime auth subject <subject_id> for active route and connector leases
  • ./target/debug/kheish-daemon runtime auth lease <lease_id> for one concrete delegated lease
  • ./target/debug/kheish-daemon runs external-actions <run_id> for the signed audit trail that ties one run to principals, grants, targets, and request or response digests
If you are restoring or migrating one state root, keep audit-signing.key together with the external-action ledgers. Without the matching key, existing signed audit records remain readable on disk but the daemon cannot continue that ledger safely. When output routing looks wrong, inspect:
  • ./target/debug/kheish-daemon sessions get <session_id> for persisted reply_targets
  • ./target/debug/kheish-daemon connectors list for daemon-managed transport definitions
  • ./target/debug/kheish-daemon connectors get <kind> <name> for one connector’s redacted secret and routing config
Remember that a session reply-target edit is prospective. If a run already captured reply targets before the change, inspect that run rather than assuming the session default will retroactively override it. The same applies to daemon-managed background shell tasks. They snapshot reply targets at task creation time and later reuse that stored snapshot for completion notifications and restart recovery. When channel conversation behavior looks wrong, inspect:
  • ./target/debug/kheish-daemon channels get <channel_id> for title, members, autonomy policy, and paused state
  • ./target/debug/kheish-daemon channels messages list <channel_id> for the durable public timeline
  • ./target/debug/kheish-daemon channels leases <channel_id> for the current public turn holder and queued candidates
  • ./target/debug/kheish-daemon channels stimuli list <channel_id> for pending, claimed, superseded, cancelled, or already dispatched wake-ups
  • ./target/debug/kheish-daemon channels thread-work <channel_id> for the canonical root-thread work projection, bindings, and progress snapshots
  • ./target/debug/kheish-daemon runs list --session-id <session_id> and runs get <run_id> when one lease still points at an active or recently settled ChannelDelivery run
Remember that the public conversation truth lives in the channel store, but restart reconciliation can still consult ChannelDelivery runs before a lease is cleared or advanced. Current channel recovery also does more than just replay leases:
  • claimed stimuli can be re-queued instead of being lost
  • already materialized public posts can be reused instead of duplicated
  • canonical thread-work state can be rebuilt from durable public messages
  • stale progress snapshots and stale bindings can be repaired or dropped when they no longer match the correct root thread
So when a channel “looks wrong after restart”, always inspect the stimulus and thread-work views before assuming the social routing logic itself is broken. The daemon also repairs its compact session-persona cache from persisted session metadata during restart recovery. So if session-persona filtering looks stale after a crash, trust the live daemon views over any older cached JSON you may still see on disk.

Evidence Note

  • Code verified: crates/kheish-daemon/src/services/run.rs, crates/kheish-daemon/src/services/task.rs, crates/kheish-daemon/src/state/task_workflow.rs, crates/kheish-daemon/src/state/playbook_workflow.rs.
  • CLI/API verified: commands named in the evidence-first order exist in crates/kheish-daemon/src/main.rs and route through crates/kheish-daemon/src/api/handlers.rs.
  • Daemon live tested for this note: no; deterministic daemon restart tests cover the documented shell-task failure state.
  • Provider-specific tested for this note: no; restart recovery is daemon-local and provider-neutral.