Skip to content

← All posts

Multi-tenant RAG on Kubernetes: what bends first

When you industrialise a RAG platform for several clients, the LLM isn't the problem — isolation, quotas, and per-tenant observability are.

Multi-tenant RAG on Kubernetes: what bends first

I built a multi-tenant RAG platform for Conexia. Everyone talks about RAG as if the topic were “pick an LLM and bolt a vector store behind it.” In reality, as soon as you go past the single-client POC, the LLM is the least interesting component. Here are the three things that bend first, and what I did to make them operable.

1. Isolation is everywhere or nowhere

When you start, the temptation is to separate tenants by “logical namespace”: a key prefix, a tag on documents, done. That holds until the first leak. And the first leak comes fast, because there are at least six surfaces where isolation can break:

  • The vector DB (a bad where filter and tenant A receives tenant B’s context)
  • Secrets (LLM keys, scraping credentials, per-client webhooks)
  • Quotas (tenant A must not saturate the LLM pool for tenant B)
  • Observability (logs and metrics must be tagged per tenant to make per-client debugging possible)
  • Cache (a poorly scoped prompt cache leaks business info between tenants)
  • Configuration (system prompts, sources, business workflows)

My rule: one tenant = one identifier that propagates through every layer, and we fail by default if the identifier is missing. No silent “default value.” If a request arrives without a tenant ID, it gets rejected — not crudely attributed to tenant zero.

In practice on K8s, that means:

  • An auth middleware that injects the tenant ID into the request context
  • A wrapper on the vector DB that refuses queries without tenant_id
  • Kubernetes secrets scoped by namespace, or External Secrets synced from a vault with a tenant-aware path
  • Systematic Prometheus labels on every application metric
# Example: required labels on RAG metrics
- name: rag_query_latency_seconds
  labels:
    - tenant_id
    - source_type     # web, gmail, messenger
    - llm_provider
    - cache_hit

If you can’t answer “how many tokens did tenant X consume this week, and on which models” in five seconds, your multi-tenant observability doesn’t exist yet.

2. Ingestion pipelines are the real source of outages

The LLM rarely falls over. Your scraping and indexing pipelines? They fall over constantly: the source site changes its HTML, the source API rate-limits you, a PDF arrives corrupted, a batch of documents blows up your embedding budget because someone uploaded a log dump.

What saved me:

  • Idempotence everywhere: a document indexed twice doesn’t create two entries. Content hash as key.
  • Quarantine: any document that fails goes into a separate queue, not into the main queue that silently fills up.
  • Per-tenant budget cap on embeddings: a client can’t burn the monthly budget overnight by uploading 50 GB of PDFs.
  • Indexing workflow split into observable stages (fetch → parse → chunk → embed → upsert), with metrics at each stage.

It’s less sexy than an LLM benchmark, but it’s what keeps the platform alive on a Sunday morning.

3. LLM routing is ops, not ML

Everyone dreams of an intelligent router that picks the best model. In practice, useful routing is boring:

  • A rule per criticality level (“internal test” requests don’t go to GPT-4)
  • An availability fallback (if the primary provider times out, switch)
  • A per-tenant circuit breaker so that one broken client doesn’t take the others down
  • A per-tenant daily budget alert

“ML routing” that dynamically picks the best model for each prompt is interesting in research. In production, you first want routing to be deterministic, observable, and debuggable. Sophistication comes later.

What I’d do first on a new project

If I were starting over tomorrow:

  1. Tenant ID everywhere, fail-by-default, from line one of code
  2. Per-tenant metrics from day one, even with a single tenant
  3. LLM and embedding budget caps before the first client integration
  4. Ingestion pipeline with quarantine and idempotence before optimising anything else
  5. The “real” LLM routing last, and as simple as possible

RAG isn’t an AI problem. It’s a platform problem. And platforms are built from the edges (isolation, quotas, observability), not from the centre (the model).