Debugging and recovery

When a run looks wrong, debug from daemon evidence rather than from assumptions about the prompt.

Evidence-first order

The recommended inspection order is:

flows get <flow_id> when the issue is Playbook/Flow-scoped
runs get <run_id>
approvals list --session-id <session_id>
tasks list <session_id>
tasks output <session_id> <task_id>
sessions get <session_id>
sessions events <session_id>
channels get <channel_id> and channels messages list <channel_id> when debugging one public conversation
channels leases <channel_id> when a thread looks stalled or over-eager
channels stimuli list <channel_id> and channels thread-work <channel_id> when autonomous wake-ups or recovery look wrong
runs debug <run_id>

Common operational pitfalls

The most common causes of confusion are:

wrong daemon URL
wrong state root
stale sessions created by an older daemon build
CLI and daemon schema drift
shell-heavy runs that need more than one approval wave
updating a persona record and expecting an already-bound session to change automatically
assuming the daemon-global MCP or skill inventory is always fully visible to every session
assuming visible tools imply delegated credential access, even when the session or child credential_scope has narrowed route, connector, or MCP auth

Restore expectations

Kheish can restore pending approvals, pending structured questions, active inline skills, schedules, and queued deliveries. Recovery is only as good as the state root and binary pair you are actually running. Kheish does not promise that an arbitrary shell process survives a daemon restart. Active daemon-managed shell tasks are settled fail-closed on boot:

inspect tasks list <session_id> for status: "failed"
inspect tasks get <session_id> <task_id> for metadata.terminal_reason: "daemon_restarted" and metadata.recovered_on_boot: true
inspect tasks output <session_id> <task_id> --full for partial output
retry manually only after checking whether the command is safe to repeat

Do not treat task_output.retrieval_status=success as shell success. It only means the daemon read the persisted output view. Session metadata is also the authoritative restore source for:

the bound persona snapshot
the persisted session capability scope override
the persisted session credential scope override
the persisted session reply-target defaults
the effective active inline skill state derived from persona defaults and session-local changes

Channel state restores from its own daemon-owned store, not from any member session journal. That store now includes:

channel records
public messages and reactions
public turn leases
queued channel stimuli
canonical thread-work state with bindings and progress snapshots

When a public thread behaves unexpectedly after restart, inspect the restored channel view, stimulus queue, and thread-work projection first, then confirm whether one referenced ChannelDelivery run is still active or has already settled.

Debug capture

Use debug capture when you need to inspect:

the final system prompt sent to the model
the effective tool surface
provider wire payloads
normalized model outputs

Keep debug capture scoped and temporary in production-like environments. When prompt identity looks wrong, inspect both:

./target/debug/kheish-daemon sessions get <session_id> for the session’s bound persona summary
./target/debug/kheish-daemon personas get <persona_id> for the latest mutable persona record

Those two views can differ legitimately when the session bound an older persona snapshot. When skill or MCP visibility looks wrong, inspect:

./target/debug/kheish-daemon sessions get <session_id> for capability_scope and effective_capability_scope
./target/debug/kheish-daemon sessions get <session_id> for credential_scope and effective_credential_scope
./target/debug/kheish-daemon runtime get for the daemon-global inventory that existed before session filtering

Remember that one MCP server can still disappear from the session-visible surface when none of its tools or helpers remain executable after capability and tool-surface filtering. When auth-backed execution looks wrong, inspect:

./target/debug/kheish-daemon runtime auth subject <subject_id> for active route and connector leases
./target/debug/kheish-daemon runtime auth lease <lease_id> for one concrete delegated lease
./target/debug/kheish-daemon runs external-actions <run_id> for the signed audit trail that ties one run to principals, grants, targets, and request or response digests

If you are restoring or migrating one state root, keep audit-signing.key together with the external-action ledgers. Without the matching key, existing signed audit records remain readable on disk but the daemon cannot continue that ledger safely. When output routing looks wrong, inspect:

./target/debug/kheish-daemon sessions get <session_id> for persisted reply_targets
./target/debug/kheish-daemon connectors list for daemon-managed transport definitions
./target/debug/kheish-daemon connectors get <kind> <name> for one connector’s redacted secret and routing config

Remember that a session reply-target edit is prospective. If a run already captured reply targets before the change, inspect that run rather than assuming the session default will retroactively override it. The same applies to daemon-managed background shell tasks. They snapshot reply targets at task creation time and later reuse that stored snapshot for completion notifications and restart recovery. When channel conversation behavior looks wrong, inspect:

./target/debug/kheish-daemon channels get <channel_id> for title, members, autonomy policy, and paused state
./target/debug/kheish-daemon channels messages list <channel_id> for the durable public timeline
./target/debug/kheish-daemon channels leases <channel_id> for the current public turn holder and queued candidates
./target/debug/kheish-daemon channels stimuli list <channel_id> for pending, claimed, superseded, cancelled, or already dispatched wake-ups
./target/debug/kheish-daemon channels thread-work <channel_id> for the canonical root-thread work projection, bindings, and progress snapshots
./target/debug/kheish-daemon runs list --session-id <session_id> and runs get <run_id> when one lease still points at an active or recently settled ChannelDelivery run

Remember that the public conversation truth lives in the channel store, but restart reconciliation can still consult ChannelDelivery runs before a lease is cleared or advanced. Current channel recovery also does more than just replay leases:

claimed stimuli can be re-queued instead of being lost
already materialized public posts can be reused instead of duplicated
canonical thread-work state can be rebuilt from durable public messages
stale progress snapshots and stale bindings can be repaired or dropped when they no longer match the correct root thread

So when a channel “looks wrong after restart”, always inspect the stimulus and thread-work views before assuming the social routing logic itself is broken. The daemon also repairs its compact session-persona cache from persisted session metadata during restart recovery. So if session-persona filtering looks stale after a crash, trust the live daemon views over any older cached JSON you may still see on disk.

Evidence Note

Code verified: crates/kheish-daemon/src/services/run.rs, crates/kheish-daemon/src/services/task.rs, crates/kheish-daemon/src/state/task_workflow.rs, crates/kheish-daemon/src/state/playbook_workflow.rs.
CLI/API verified: commands named in the evidence-first order exist in crates/kheish-daemon/src/main.rs and route through crates/kheish-daemon/src/api/handlers.rs.
Daemon live tested for this note: no; deterministic daemon restart tests cover the documented shell-task failure state.
Provider-specific tested for this note: no; restart recovery is daemon-local and provider-neutral.

Home

Get started

Concepts

Runtime

Integrations

Operations

API

Operational reference

Contributing

Debugging and recovery

Debugging and recovery

Evidence-first order

Common operational pitfalls

Restore expectations

Debug capture

Evidence Note

Home

Get started

Concepts

Runtime

Integrations

Operations

API

Operational reference

Contributing

​Debugging and recovery

​Evidence-first order

​Common operational pitfalls

​Restore expectations

​Debug capture

​Evidence Note

Debugging and recovery

Evidence-first order

Common operational pitfalls

Restore expectations

Debug capture

Evidence Note