concept

Deterministic intelligence

LLMs give different answers each time you ask. .0wav embeds the answer inside the file — same question, same answer, every time, for free after the first.

May 5, 2026

Ask Claude what time a meeting started. Ask again an hour later. The first answer says 2:14 PM. The second says “around 2:15.” Neither is wrong. Both cost you a query.

This is the default state of generative AI. Every read is an inference. Every inference is non-deterministic, billable, and slightly different from the last one. RAG layers a retrieval step on top, which retrieves the chunks but still runs an LLM on them every time you ask. The retrieval is cheaper than re-reading the whole file. It is not free.

.0wav flips the model. The expensive work — diarization, alignment, sentiment, behavioral profile, complexity map, speaker embeddings — runs once, when the file is created. The output is written into the file itself. After that, asking a question is a structured lookup, not an inference. The fortieth query returns the same value as the first, instantly, at zero marginal compute cost.

What “deterministic” means here

A .0wav file is an HDF5 container. Inside it: the audio, the master FLAC, the transcript with word-level timestamps, the speaker segments, the per-second complexity map, the embeddings, the metadata. When you ask “who spoke first?” or “what was the average words-per-minute?” the answer is read from the container. There is no model in the loop at read time. The answer cannot drift between queries because no inference happens between queries.

This is what we mean by deterministic intelligence: the answer is property of the asset, not output of a model run.

What this is not

It is not a cache. A cache assumes a future model will produce a stable answer if asked the same way. Caches break when the model updates, the prompt changes, or the input drifts.

It is not RAG. RAG retrieves text and asks a model to interpret it. Interpretation introduces variance. Interpretation costs tokens.

It is not transcription. Transcription is one of dozens of fields written into the asset. The behavioral profile, the autonomic matrix, the prosody features, and the speaker graph are equally first-class.

Why this matters now

Inference cost is the bill that does not stop arriving. Every “ask the recording again” is a fresh charge. At a hundred recordings, this is a footnote. At ten thousand, it’s the AWS line item. At a million — call center transcripts, depositions, podcasts, sales calls, telehealth visits — it’s the reason the budget review has a Q4 panic in it.

Storing the answer is one-time. Recomputing it is forever.

What it costs

You pay once, when the asset is processed. Every read after that is the cost of decompressing a few HDF5 datasets — microseconds, not seconds, and no per-token pricing. A team that re-asks the same recording forty times pays for one inference run, not forty.

The first answer is not free. Every answer after it is.

Where it breaks

.0wav is right when:

The set of questions you’ll ask is roughly known up front (transcription, who spoke, when, how, with what emotional register).
The asset will be queried more than once.
The cost of re-asking matters.

It’s the wrong tool when you need open-ended reasoning that wasn’t anticipated at processing time, or when the asset is single-use. For those, you want a model and a prompt. For everything else, the question and the answer should already live in the file.

The compute tax — why the inference bill keeps growing and what it costs to ignore.
Why RAG isn’t enough — coming soon to the wiki.
The .0wav format — coming soon to the wiki.