Skip to main content

Assets and multimodal input

Kheish treats files as daemon-owned assets. Whether a caller uploads a file inline with one input or imports it ahead of time, the daemon persists a normalized asset record and references it from later session state.

Why assets exist

The asset store gives Kheish a stable way to handle multimodal input in a daemon-first system:
  • files survive client disconnects and daemon restarts
  • session journals keep stable asset references instead of host-specific paths
  • the same file can be reused across multiple runs or sessions
  • provider-specific request payloads are encoded from daemon-managed asset state instead of caller-local blobs

Supported media types

The daemon currently supports these asset media types:
  • text/plain
  • text/csv
  • text/markdown
  • application/json
  • application/pdf
  • application/dxf
  • image/png
  • image/jpeg
  • audio/wav
  • audio/webm
  • audio/mpeg
  • audio/mp4
  • audio/m4a
  • audio/l16
  • audio/l24
Images are treated as true multimodal inputs. Supported document formats are stored as raw files plus derived text when the daemon can extract it. DXF assets can also expose one daemon-derived PNG preview so a vision-capable route can inspect the same plan visually without requiring a separate screenshot upload. Audio assets are also daemon-owned files. They are used for retained outputs and, when a transcription backend is configured, for daemon-derived canonical text. Observation-source uploads remain narrower: microphone sources still accept only audio/wav and audio/webm.

Asset lifecycle

There are two normal ways to create an asset:
  • import it explicitly through POST /v1/assets or kheish-daemon assets import
  • upload it inline as part of a session input request
In both cases the daemon persists:
  • a stable asset_id
  • normalized media_type
  • file_name
  • sha256
  • byte_length
  • an opaque raw storage URI
  • an opaque derived text_uri when the asset is renderable as bounded text, including daemon-derived transcripts for supported audio assets
  • an opaque preview_image_uri plus preview_image_media_type when the daemon derives a visual preview, such as for DXF
Imports are deduplicated by normalized media type and content digest.

Input submission shapes

Session input supports two shapes.

Legacy text plus attachments

This shape is convenient for the CLI and simple integrations.
  • content carries the free-form text body
  • attachments appends assets after the text body
The daemon CLI currently maps:
  • --file /path/to/file
  • --asset asset-123
into this legacy attachment shape.

Ordered rich input

Use input_items when you need precise ordering between text and files. Each item can be:
  • text
  • asset_reference
  • inline_asset
This is the correct shape for prompts such as:
  • text, then one document, then more text
  • text, then one image, then more text
  • text plus a mix of stored assets and fresh uploads
input_items must not be combined with legacy content or attachments. The same ordered multimodal shape is also used by sidechain subtasks, public channel messages, and agent-facing orchestration tools. This keeps one text-plus-assets payload structured all the way through the daemon instead of flattening it into plain text between agents.

Provider behavior

The runtime keeps one provider-neutral representation and only encodes provider-specific payloads at execution time.
  • Supported images remain true image parts and require a vision-capable model route.
  • Supported documents can still be sent to non-vision models because the daemon extracts and renders bounded text for them.
  • Supported audio assets can contribute daemon-derived transcript text when a transcription backend is configured.
  • Some document types, such as DXF, can contribute both derived text and a visual preview when the selected route accepts image parts.
  • Ordered input_items preserve the caller-specified sequence across provider adapters.
This means a daemon can accept document inputs on a text-only route, while image inputs require a route that supports image parts.

Storage and normalization limits

Current daemon-side normalization includes these important limits:
  • raw asset payloads are limited to 12 MiB
  • images are normalized to a maximum edge of 2048 pixels
  • normalized images are capped at 4 MiB
These limits are enforced by the daemon before provider routing.