Assets and multimodal input

Kheish treats files as daemon-owned assets. Whether a caller uploads a file inline with one input or imports it ahead of time, the daemon persists a normalized asset record and references it from later session state.

Why assets exist

The asset store gives Kheish a stable way to handle multimodal input in a daemon-first system:

files survive client disconnects and daemon restarts
session journals keep stable asset references instead of host-specific paths
the same file can be reused across multiple runs or sessions
provider-specific request payloads are encoded from daemon-managed asset state instead of caller-local blobs

Supported media types

The daemon currently supports these asset media types:

text/plain
text/csv
text/markdown
application/json
application/pdf
application/dxf
image/png
image/jpeg
audio/wav
audio/webm
audio/mpeg
audio/mp4
audio/m4a
audio/l16
audio/l24

Images are treated as true multimodal inputs. Supported document formats are stored as raw files plus derived text when the daemon can extract it. DXF assets can also expose one daemon-derived PNG preview so a vision-capable route can inspect the same plan visually without requiring a separate screenshot upload. Audio assets are also daemon-owned files. They are used for retained outputs and, when a transcription backend is configured, for daemon-derived canonical text. Observation-source uploads remain narrower: microphone sources still accept only audio/wav and audio/webm.

Asset lifecycle

There are two normal ways to create an asset:

import it explicitly through POST /v1/assets or kheish-daemon assets import
upload it inline as part of a session input request

In both cases the daemon persists:

a stable asset_id
normalized media_type
file_name
sha256
byte_length
an opaque raw storage URI
an opaque derived text_uri when the asset is renderable as bounded text, including daemon-derived transcripts for supported audio assets
an opaque preview_image_uri plus preview_image_media_type when the daemon derives a visual preview, such as for DXF

Imports are deduplicated by normalized media type and content digest.

Input submission shapes

Session input supports two shapes.

Legacy text plus attachments

This shape is convenient for the CLI and simple integrations.

content carries the free-form text body
attachments appends assets after the text body

The daemon CLI currently maps:

--file /path/to/file
--asset asset-123

into this legacy attachment shape.

Ordered rich input

Use input_items when you need precise ordering between text and files. Each item can be:

text
asset_reference
inline_asset

This is the correct shape for prompts such as:

text, then one document, then more text
text, then one image, then more text
text plus a mix of stored assets and fresh uploads

input_items must not be combined with legacy content or attachments. The same ordered multimodal shape is also used by sidechain subtasks, public channel messages, and agent-facing orchestration tools. This keeps one text-plus-assets payload structured all the way through the daemon instead of flattening it into plain text between agents.

Provider behavior

The runtime keeps one provider-neutral representation and only encodes provider-specific payloads at execution time.

Supported images remain true image parts and require a vision-capable model route.
Supported documents can still be sent to non-vision models because the daemon extracts and renders bounded text for them.
Supported audio assets can contribute daemon-derived transcript text when a transcription backend is configured.
Some document types, such as DXF, can contribute both derived text and a visual preview when the selected route accepts image parts.
Ordered input_items preserve the caller-specified sequence across provider adapters.

This means a daemon can accept document inputs on a text-only route, while image inputs require a route that supports image parts.

Storage and normalization limits

Current daemon-side normalization includes these important limits:

raw asset payloads are limited to 12 MiB
images are normalized to a maximum edge of 2048 pixels
normalized images are capped at 4 MiB

These limits are enforced by the daemon before provider routing.

Home

Get started

Concepts

Runtime

Integrations

Operations

API

Operational reference

Contributing

Assets and multimodal input

Assets and multimodal input

Why assets exist

Supported media types

Asset lifecycle

Input submission shapes

Legacy text plus attachments

Ordered rich input

Provider behavior

Storage and normalization limits

Home

Get started

Concepts

Runtime

Integrations

Operations

API

Operational reference

Contributing

​Assets and multimodal input

​Why assets exist

​Supported media types

​Asset lifecycle

​Input submission shapes

​Legacy text plus attachments

​Ordered rich input

​Provider behavior

​Storage and normalization limits

Assets and multimodal input

Why assets exist

Supported media types

Asset lifecycle

Input submission shapes

Legacy text plus attachments

Ordered rich input

Provider behavior

Storage and normalization limits