Assets and multimodal input
Kheish treats files as daemon-owned assets. Whether a caller uploads a file inline with one input or imports it ahead of time, the daemon persists a normalized asset record and references it from later session state.Why assets exist
The asset store gives Kheish a stable way to handle multimodal input in a daemon-first system:- files survive client disconnects and daemon restarts
- session journals keep stable asset references instead of host-specific paths
- the same file can be reused across multiple runs or sessions
- provider-specific request payloads are encoded from daemon-managed asset state instead of caller-local blobs
Supported media types
The daemon currently supports these asset media types:text/plaintext/csvtext/markdownapplication/jsonapplication/pdfapplication/dxfimage/pngimage/jpegaudio/wavaudio/webmaudio/mpegaudio/mpgaaudio/opusaudio/aacaudio/flacaudio/mp4audio/m4aaudio/pcmaudio/l16audio/l24
fmt /data, complete MP3 frames, Ogg Opus
packet lacing, AAC frame lengths, FLAC STREAMINFO, MP4/M4A audio track handlers, declared
sample-rate/channel limits where the container exposes them, and in-limit duration where the
container makes duration calculable; raw PCM has no container signature or channel/sample-rate
metadata, so it is limited to non-empty aligned payloads.
Audio assets are used for retained outputs and, when a transcription backend is configured, for
daemon-derived canonical text. Observation-source uploads remain narrower: microphone sources still
accept only audio/wav and audio/webm.
Asset lifecycle
There are two normal ways to create an asset:- import it explicitly through
POST /v1/assetsorkheish-daemon assets import - upload it inline as part of a session input request
- a stable
asset_id - normalized
media_type file_namesha256byte_length- an opaque raw storage URI
- an opaque derived
text_uriwhen the asset is renderable as bounded text, including daemon-derived transcripts for supported audio assets - an opaque
preview_image_uripluspreview_image_media_typewhen the daemon derives a visual preview, such as for DXF
Input submission shapes
Session input supports two shapes.Legacy text plus attachments
This shape is convenient for the CLI and simple integrations.contentcarries the free-form text bodyattachmentsappends assets after the text body
--file /path/to/file--asset asset-123
Ordered rich input
Useinput_items when you need precise ordering between text and files.
Each item can be:
textasset_referenceboard_referenceinline_asset
- text, then one document, then more text
- text, then one image, then more text
- text plus a mix of stored assets and fresh uploads
- text plus a session-visible board revision
input_items must not be combined with legacy content or attachments.
board_reference points at a daemon board visible to the session and can optionally pin a specific revision_id.
The same ordered multimodal shape is also used by sidechain subtasks, public channel messages, and agent-facing orchestration tools. This keeps one text-plus-assets payload structured all the way through the daemon instead of flattening it into plain text between agents.
Provider behavior
The runtime keeps one provider-neutral representation and only encodes provider-specific payloads at execution time.- Supported images remain true image parts and require a vision-capable model route.
- Supported documents can still be sent to non-vision models because the daemon extracts and renders bounded text for them.
- Supported audio assets can contribute daemon-derived transcript text when a transcription backend is configured.
- Some document types, such as DXF, can contribute both derived text and a visual preview when the selected route accepts image parts.
- Ordered
input_itemspreserve the caller-specified sequence across provider adapters.
Storage and normalization limits
Current daemon-side normalization includes these important limits:- raw asset payloads are limited to 12 MiB
- images are normalized to a maximum edge of 2048 pixels
- normalized images are capped at 4 MiB
