Skip to main content

Capture and observations

Kheish models capture as a generic daemon-owned observation flow:
  1. one external producer uploads data into one daemon-owned observation source
  2. the daemon stores one durable observation record plus raw asset metadata
  3. later, one operator or schedule materializes observations into a normal run
This keeps host-local device access, system permissions, and capture timing outside the agent runtime while preserving one stable control-plane contract.

Why capture is not a session input shortcut

Session input and capture solve different problems:
  • session input is for immediate user- or connector-submitted work
  • observations are for durable external state that may be processed later
An observation can be retained, filtered, materialized repeatedly, or ignored without mutating the session journal until you explicitly create a run from it.

Core objects

Observation source

An observation source is one stable daemon-owned ingest boundary. Each source carries:
  • source_id
  • kind
  • sensitivity
  • retention_seconds
  • max_active_observations
  • max_active_bytes
  • allow_materialization
  • allow_output_delivery
The daemon currently supports these source kinds:
  • screen_snapshot
  • webcam_snapshot
  • microphone_segment

Observation

Each uploaded observation stores:
  • one daemon-owned raw asset_id
  • media_type
  • sha256
  • byte_length
  • captured_at_ms
  • received_at_ms
  • optional canonical_text_asset_id
  • optional stream_id
  • optional seq_no
  • caller-supplied metadata
  • one stable idempotency_key plus daemon request fingerprint
The daemon keeps raw payload storage and run orchestration separate. An observation exists before any model sees it.

Materialization

Materialization converts an observation selection into a normal daemon run. The daemon currently supports two selection shapes:
  • explicit observation_ids
  • latest_from_source
  • latest_from_stream
The source controls whether materialization is allowed at all, and whether the resulting run may inherit or target non-daemon reply routes.

Accepted media types

Current source-kind media acceptance is strict:
  • screen_snapshot: image/png, image/jpeg
  • webcam_snapshot: image/png, image/jpeg
  • microphone_segment: audio/wav, audio/webm
This is intentionally narrower than the general daemon asset store. Capture ingress is source-kind checked before the observation is persisted. The broader asset store also accepts retained and generated audio formats such as audio/mpeg, audio/mp4, and audio/m4a. That wider media support does not change the stricter observation-source ingest contract.

Current daemon-side behavior

The daemon already provides:
  • observations sources create|list|get
  • observations ingest
  • observations list|get
  • observations materialize
  • observations schedule
Observation listing and stream-scoped materialization support source_id + stream_id filtering. Stream identifiers are intentionally scoped to one source rather than treated as a global lookup key. The HTTP surface is exposed under:
  • POST /v1/observation-sources
  • GET /v1/observation-sources
  • GET /v1/observation-sources/{source_id}
  • POST /v1/observation-sources/{source_id}/observations
  • GET /v1/observations
  • GET /v1/observations/{observation_id}
  • POST /v1/observation-materializations
Schedules use the normal /v1/schedules surface with one embedded observation materialization request.

Capture runtimes

The daemon does not open microphones, webcams, or screens itself. Capture happens in an external producer that speaks the observation ingest contract. The current host-local reference runtime is kheish-capture, developed as a sibling repository to this daemon workspace. Its current implemented scope is:
  • fixture-based uploads for image/png, image/jpeg, audio/wav, and audio/webm
  • live screen capture to image/png or image/jpeg
  • live webcam capture to image/png or image/jpeg
  • live microphone capture to audio/wav
  • mixed-audio capture that can emit:
    • one mixed WAV only
    • one raw WAV leg for each input that produced signal
    • raw WAV legs plus one mixed WAV artifact
Its real-daemon tests currently validate the daemon observation contract through fixture and synthetic-driver paths, plus conditional live screen and webcam CLI paths when the current environment exposes the required backends and devices. This daemon documentation should still not imply that every host-device backend is already validated end-to-end on every platform. The daemon also supports direct derivations from one observation subject. Current built-in observation derivations include:
  • canonical_text
  • visual_preview
Those derivations create one derivation record plus one daemon-owned result asset. They do not create a new observation record. For audio specifically, canonical_text is now one real daemon-owned speech-to-text path when a transcription backend is configured. The same canonical-text behavior is reused across:
  • audio observations
  • audio assets referenced by normal session input
  • connector-delivered audio that is first normalized into daemon-owned assets
The capability is provider-neutral at the daemon boundary, and the built-in transcription backends today are OpenAI and OpenRouter. Canonical-text derivation still depends on the configured transcription backend, not only on the selected run route.

Correlated audio today

Kheish can already store correlated audio artifacts, but the current daemon materialization model is still source-centric. The current reliable pattern for correlated audio is:
  • keep one daemon source_id
  • use distinct stream_id values for each uploaded leg or derived artifact
  • keep one shared group identifier in observation metadata when needed
  • use latest_from_stream when one stream-local selection is sufficient
  • have one external client materialize with explicit observation_ids when you need a precise grouped selection across multiple related artifacts
This matters for call capture and similar workflows. The daemon stores stream_id and seq_no, and it can now list, materialize, and schedule by source_id + stream_id. It does not yet perform arbitrary metadata grouping or multi-stream capture-session reconstruction on its own. Documentation and integrations should therefore avoid promising automatic grouped call reconstruction on the daemon side. Those fields are selection and filtering aids today. The default materialization context does not automatically turn stream_id, seq_no, or arbitrary caller metadata into grouped prompt-visible semantics.

Materialization defaults

Materialization is intentionally conservative:
  • image-like sources can include their raw asset by default
  • microphone observations remain transcript-first by default
  • operators can opt into raw microphone assets during materialization with raw_asset_policy = always
  • when canonical text exists for a microphone observation, the daemon materializes that text
  • when canonical text does not exist for a microphone observation, the daemon inserts one placeholder text notice; with raw_asset_policy = always, that notice is attached alongside the raw audio asset
  • when a transcription backend is configured, raw audio/wav and audio/webm observations can derive canonical_text daemon-side and persist it back onto the observation before materialization
  • the same daemon transcription service also handles supported audio assets outside the observation ingest path, such as imported or generated MP3 and M4A files
  • without uploader-supplied canonical text and without a configured transcription backend, microphone materialization falls back to the generic placeholder notice
  • source sensitivity and allow_output_delivery still constrain reply routing
For multi-artifact audio, explicit observation_ids are currently the best way to control exactly what the model receives.

Operator guidance

Use observations when:
  • you want capture data to survive client disconnects and daemon restarts
  • capture is produced outside the daemon process
  • one workflow may analyze the same capture more than once
  • one schedule should trigger future analysis from retained data
Use normal session input when:
  • the caller is sending one immediate instruction plus files
  • no separate retention or capture lifecycle is needed