Assets and multimodal input
Kheish treats files as daemon-owned assets. Whether a caller uploads a file inline with one input or imports it ahead of time, the daemon persists a normalized asset record and references it from later session state.Why assets exist
The asset store gives Kheish a stable way to handle multimodal input in a daemon-first system:- files survive client disconnects and daemon restarts
- session journals keep stable asset references instead of host-specific paths
- the same file can be reused across multiple runs or sessions
- provider-specific request payloads are encoded from daemon-managed asset state instead of caller-local blobs
Supported media types
The daemon currently supports these asset media types:text/plaintext/csvtext/markdownapplication/jsonapplication/pdfapplication/dxfimage/pngimage/jpegaudio/wavaudio/webmaudio/mpegaudio/mp4audio/m4aaudio/l16audio/l24
audio/wav and audio/webm.
Asset lifecycle
There are two normal ways to create an asset:- import it explicitly through
POST /v1/assetsorkheish-daemon assets import - upload it inline as part of a session input request
- a stable
asset_id - normalized
media_type file_namesha256byte_length- an opaque raw storage URI
- an opaque derived
text_uriwhen the asset is renderable as bounded text, including daemon-derived transcripts for supported audio assets - an opaque
preview_image_uripluspreview_image_media_typewhen the daemon derives a visual preview, such as for DXF
Input submission shapes
Session input supports two shapes.Legacy text plus attachments
This shape is convenient for the CLI and simple integrations.contentcarries the free-form text bodyattachmentsappends assets after the text body
--file /path/to/file--asset asset-123
Ordered rich input
Useinput_items when you need precise ordering between text and files.
Each item can be:
textasset_referenceinline_asset
- text, then one document, then more text
- text, then one image, then more text
- text plus a mix of stored assets and fresh uploads
input_items must not be combined with legacy content or attachments.
The same ordered multimodal shape is also used by sidechain subtasks, public channel messages, and agent-facing orchestration tools. This keeps one text-plus-assets payload structured all the way through the daemon instead of flattening it into plain text between agents.
Provider behavior
The runtime keeps one provider-neutral representation and only encodes provider-specific payloads at execution time.- Supported images remain true image parts and require a vision-capable model route.
- Supported documents can still be sent to non-vision models because the daemon extracts and renders bounded text for them.
- Supported audio assets can contribute daemon-derived transcript text when a transcription backend is configured.
- Some document types, such as DXF, can contribute both derived text and a visual preview when the selected route accepts image parts.
- Ordered
input_itemspreserve the caller-specified sequence across provider adapters.
Storage and normalization limits
Current daemon-side normalization includes these important limits:- raw asset payloads are limited to 12 MiB
- images are normalized to a maximum edge of 2048 pixels
- normalized images are capped at 4 MiB
