Back to projects

MyAPI

A retrieval layer for agents and humans that answers cold-start context questions in under 3 seconds — benchmarked, not vibes-tested.

PythonFastAPIKhojTailscaleRAGBenchmarking

The cold start problem

Every AI agent session starts the same way. The agent scans your files, greps through logs, tries to infer project context from filenames and README fragments, and maybe hallucinates a few capabilities that don't exist. By the time it has enough context to do useful work, 30-90 seconds have elapsed — and you've burned tokens on orientation rather than execution.

The context was already there. Three years of it — Obsidian notes, exported ChatGPT and Claude conversations, CLI agent session logs. The problem wasn't that the information didn't exist. It was that none of it was findable in the way agents consume information. Agents don't browse. They issue queries and expect structured evidence to reason with.

So I built a retrieval layer that sits between agents and my entire personal corpus.

What MyAPI actually is

An agent (Claude Code, Codex, Cursor) hits a single /query endpoint with a natural language question. MyAPI classifies the intent, routes it through a multi-lane retrieval pipeline, and returns reranked evidence with source metadata — in under 3 seconds.

The pipeline indexes roughly 3,200 Obsidian markdown notes, exported ChatGPT and Claude conversation archives, and CLI agent session logs — about 3 years of decisions, architectural constraints, and debugging resolutions. A Python normalizer (context_refinery/) converts these heterogeneous formats into canonical knowledge objects before they hit the search index.

Khoj handles the semantic vector search. The Context Refinery sits on top of that: query classification, multi-lane retrieval (semantic + keyword + synthesized-note boosting), metadata-aware filtering, and evidence reranking. The result is structured enough for an agent to reason with — not just a list of file paths.

Two audiences, one pipeline

The retrieval pipeline is the same. What changes is the response shape.

Agents get structured JSON — ranked evidence blocks with confidence scores, source metadata, and timestamps. The use case is cold-start context elimination: the agent arrives, asks a few context questions, and gets oriented in seconds instead of minutes.

Humans get the same evidence packaged differently. Episodic recall ("find that thread where I figured out the Khoj sync race condition"), decision retrieval ("what did we conclude about Postgres vs SQLite for receipts"), and synthesis queries across years of notes. Same pipeline, same provenance — different output envelope.

How I evaluate it

I don't "vibe-test" retrieval. I built a categorized query bank with seven diagnostic buckets:

  1. Win — correct evidence returned, top-ranked. The baseline expectation.
  2. Weak win — correct evidence exists but is buried below threshold. A ranking problem.
  3. Corpus gap — the information doesn't exist in the indexed corpus. Not a retrieval problem; a content problem.
  4. Retrieval gap — the information exists but search didn't surface it. The embedding or query expansion strategy failed.
  5. Metadata gap — the information is there but got filtered out by time, source type, or tag. The query classifier made the wrong assumption.
  6. Intent gap — the query was classified into the wrong lane. The system looked for a decision when the user asked for a concept explanation.
  7. Answer-shape gap — the evidence is correct but the response structure doesn't match what the caller needs. Agents need structured confidence scores; humans want prose synthesis.

Every retrieval improvement gets measured against these buckets. They tell me which lever to pull: corpus shaping, intent classification, or reranking strategy. Swapping models or tuning hyperparameters is the last resort, not the first instinct.

How it runs

Phase 1 is deployed on a cloud VM behind Tailscale. If you're not on my tailnet, you can't reach it. The VM auto-shuts down after a few hours of inactivity and spins up on demand.

No public endpoints. No auth layer. The trust boundary is the network itself.

Phase 2

Phase 1 proved the pipeline works. Phase 2 is about trust calibration — systematically working through the benchmark bucket gaps, tuning the intent classifier, and expanding the corpus with higher-signal content.

The active direction right now is what I'm calling the corpus v1 substrate: normalized markdown, meaningful folder structure, frontmatter provenance, stable headers, and explicit trust/canonicality fields. Because it turns out, corpus quality is source-of-truth quality. If what you're indexing isn't well-structured to begin with, no amount of retrieval tuning fixes the signal.

When I can watch an agent cold-start, query MyAPI, and immediately begin working from accurate context without asking me to confirm any of it — that's when Phase 2 is done.