Capture and observations

Kheish models capture as a generic daemon-owned observation flow:

one external producer uploads data into one daemon-owned observation source
the daemon stores one durable observation record plus raw asset metadata
later, one operator or schedule materializes observations into a normal run

This keeps host-local device access, system permissions, and capture timing outside the agent runtime while preserving one stable control-plane contract.

Why capture is not a session input shortcut

Session input and capture solve different problems:

session input is for immediate user- or connector-submitted work
observations are for durable external state that may be processed later

An observation can be retained, filtered, materialized repeatedly, or ignored without mutating the session journal until you explicitly create a run from it.

Core objects

Observation source

An observation source is one stable daemon-owned ingest boundary. Each source carries:

source_id
kind
sensitivity
retention_seconds
max_active_observations
max_active_bytes
allow_materialization
allow_output_delivery

The daemon currently supports these source kinds:

screen_snapshot
webcam_snapshot
microphone_segment

Observation

Each uploaded observation stores:

one daemon-owned raw asset_id
media_type
sha256
byte_length
captured_at_ms
received_at_ms
optional canonical_text_asset_id
optional stream_id
optional seq_no
caller-supplied metadata
one stable idempotency_key plus daemon request fingerprint

The daemon keeps raw payload storage and run orchestration separate. An observation exists before any model sees it.

Materialization

Materialization converts an observation selection into a normal daemon run. The daemon currently supports two selection shapes:

explicit observation_ids
latest_from_source
latest_from_stream

The source controls whether materialization is allowed at all, and whether the resulting run may inherit or target non-daemon reply routes.

Accepted media types

Current source-kind media acceptance is strict:

screen_snapshot: image/png, image/jpeg
webcam_snapshot: image/png, image/jpeg
microphone_segment: audio/wav, audio/webm

This is intentionally narrower than the general daemon asset store. Capture ingress is source-kind checked before the observation is persisted. The broader asset store also accepts retained and generated audio formats such as audio/mpeg, audio/mp4, and audio/m4a. That wider media support does not change the stricter observation-source ingest contract.

Current daemon-side behavior

The daemon already provides:

observations sources create|list|get
observations ingest
observations list|get
observations materialize
observations schedule

Observation listing and stream-scoped materialization support source_id + stream_id filtering. Stream identifiers are intentionally scoped to one source rather than treated as a global lookup key. The HTTP surface is exposed under:

POST /v1/observation-sources
GET /v1/observation-sources
GET /v1/observation-sources/{source_id}
POST /v1/observation-sources/{source_id}/observations
GET /v1/observations
GET /v1/observations/{observation_id}
POST /v1/observation-materializations

Schedules use the normal /v1/schedules surface with one embedded observation materialization request.

Capture runtimes

The daemon does not open microphones, webcams, or screens itself. Capture happens in an external producer that speaks the observation ingest contract. The current host-local reference runtime is kheish-capture, developed as a sibling repository to this daemon workspace. Its current implemented scope is:

fixture-based uploads for image/png, image/jpeg, audio/wav, and audio/webm
live screen capture to image/png or image/jpeg
live webcam capture to image/png or image/jpeg
live microphone capture to audio/wav
mixed-audio capture that can emit:
- one mixed WAV only
- one raw WAV leg for each input that produced signal
- raw WAV legs plus one mixed WAV artifact

Its real-daemon tests currently validate the daemon observation contract through fixture and synthetic-driver paths, plus conditional live screen and webcam CLI paths when the current environment exposes the required backends and devices. This daemon documentation should still not imply that every host-device backend is already validated end-to-end on every platform. The daemon also supports direct derivations from one observation subject. Current built-in observation derivations include:

canonical_text
visual_preview

Those derivations create one derivation record plus one daemon-owned result asset. They do not create a new observation record. For audio specifically, canonical_text is now one real daemon-owned speech-to-text path when a transcription backend is configured. The same canonical-text behavior is reused across:

audio observations
audio assets referenced by normal session input
connector-delivered audio that is first normalized into daemon-owned assets

The capability is provider-neutral at the daemon boundary, and the built-in transcription backends today are OpenAI and OpenRouter. Canonical-text derivation still depends on the configured transcription backend, not only on the selected run route.

Correlated audio today

Kheish can already store correlated audio artifacts, but the current daemon materialization model is still source-centric. The current reliable pattern for correlated audio is:

keep one daemon source_id
use distinct stream_id values for each uploaded leg or derived artifact
keep one shared group identifier in observation metadata when needed
use latest_from_stream when one stream-local selection is sufficient
have one external client materialize with explicit observation_ids when you need a precise grouped selection across multiple related artifacts

This matters for call capture and similar workflows. The daemon stores stream_id and seq_no, and it can now list, materialize, and schedule by source_id + stream_id. It does not yet perform arbitrary metadata grouping or multi-stream capture-session reconstruction on its own. Documentation and integrations should therefore avoid promising automatic grouped call reconstruction on the daemon side. Those fields are selection and filtering aids today. The default materialization context does not automatically turn stream_id, seq_no, or arbitrary caller metadata into grouped prompt-visible semantics.

Materialization defaults

Materialization is intentionally conservative:

image-like sources can include their raw asset by default
microphone observations remain transcript-first by default
operators can opt into raw microphone assets during materialization with raw_asset_policy = always
when canonical text exists for a microphone observation, the daemon materializes that text
when canonical text does not exist for a microphone observation, the daemon inserts one placeholder text notice; with raw_asset_policy = always, that notice is attached alongside the raw audio asset
when a transcription backend is configured, raw audio/wav and audio/webm observations can derive canonical_text daemon-side and persist it back onto the observation before materialization
the same daemon transcription service also handles supported audio assets outside the observation ingest path, such as imported or generated MP3 and M4A files
without uploader-supplied canonical text and without a configured transcription backend, microphone materialization falls back to the generic placeholder notice
source sensitivity and allow_output_delivery still constrain reply routing

For multi-artifact audio, explicit observation_ids are currently the best way to control exactly what the model receives.

Operator guidance

Use observations when:

you want capture data to survive client disconnects and daemon restarts
capture is produced outside the daemon process
one workflow may analyze the same capture more than once
one schedule should trigger future analysis from retained data

Use normal session input when:

the caller is sending one immediate instruction plus files
no separate retention or capture lifecycle is needed

Home

Get started

Concepts

Runtime

Integrations

Operations

API

Operational reference

Contributing

Capture and observations

Capture and observations

Why capture is not a session input shortcut

Core objects

Observation source

Observation

Materialization

Accepted media types

Current daemon-side behavior

Capture runtimes

Correlated audio today

Materialization defaults

Operator guidance

Home

Get started

Concepts

Runtime

Integrations

Operations

API

Operational reference

Contributing

​Capture and observations

​Why capture is not a session input shortcut

​Core objects

​Observation source

​Observation

​Materialization

​Accepted media types

​Current daemon-side behavior

​Capture runtimes

​Correlated audio today

​Materialization defaults

​Operator guidance

​Related docs

Capture and observations

Why capture is not a session input shortcut

Core objects

Observation source

Observation

Materialization

Accepted media types

Current daemon-side behavior

Capture runtimes

Correlated audio today

Materialization defaults

Operator guidance

Related docs