22 minute read

If you are building anything that pipes a user’s history into an LLM prompt, this story is probably waiting for you too.

For about six weeks one of our services was sending every user’s entire interaction history into the prompt with no cap on size. For most users that was fine. For our power users it was three to five minutes per call, our upstream LLM gateway timing out, jobs piling up in the queue, and demos blowing up in real time.

We had a demo on Tuesday. On Monday we shipped this:

.slice(0, 10_000)

Latency went from minutes to two seconds. We made the demo. Six days later we replaced that one line with what should have been there all along: per-user semantic retrieval on Postgres and pgvector.

This post is about both halves of that story. Why the hotfix worked and was still wrong, and what the real solution looks like.


The shape of the problem

Briefly, so the details below make sense: we run an internal service that generates survey responses on behalf of users, for cases where we need realistic-looking respondents to seed a test, validate a new flow, or reproduce an edge case. To make a generated answer sound like a specific person and not a generic one, the service feeds the LLM that user’s prior question-and-answer history as context, then asks the model to fill the current survey page in their voice.

That detail aside, the lesson in this post is general. Any system that pipes a user’s full history into an LLM prompt has the same shape. The history is unbounded, the model is paid per token, and the data path is silent about what part of the history actually matters for the question being asked.

Worth naming what we were doing, because the rest of the post turns on it. We were doing context stuffing, not retrieval. Every Q&A pair the user had ever answered, joined together, sent to the model in one shot. The irony is not lost on us: the retrieval in “retrieval-augmented generation” is exactly what we did not have. We had stuffing. The rest of this post is the story of how we built the retrieval part, six weeks later, after the stuffing broke.

For a user with a small history (a handful of past surveys), the blob is a few KB and the model takes a second or two to come back. That was the steady state for most users. It was the state we shipped at, and for a while, it held.


How we ended up sending megabyte-sized prompts

The original implementation took the path of least resistance: dump everything the system knew about a user into the prompt and let the model figure out what to do with it. Every past Q&A pair, joined together with newlines, in whatever order the rows had landed in the database. No cap on size, no relevance filtering, nothing fancy.

For about six weeks it worked. Then we onboarded a long-tenured user group, with people who had answered hundreds of surveys each. Their context blobs blew past a megabyte. The largest one in production was 2.66 million characters.

You can probably guess what the LLM gateway did with that.

LLM calls on those users started taking three to five minutes. Some hit the upstream timeout. The background job queue feeding the LLM started backing up. And worst of all, we had product demos lined up where the customer was watching the queue drain in real time and asking why.

There was no real mystery about what was wrong. One log line was enough to confirm it:

this.logger.debug(`prompt length: ${prompt.length}`);
// prompt length: 2660142

2.66 million characters into a chat completion endpoint will take minutes, in any model, in any region. The model has to read it.

The cheapest debugging step for a slow third-party call is logging the size of what you are sending. It is also the one most people skip. Payload size belongs in the same log line as the URL and the response time, in every HTTP client wrapper you write. You will get it back, in saved time, the first time something goes wrong.


The hotfix

We had less than 24 hours before the demo. The hotfix went in around lunchtime:

const MAX_MEMBER_CONTEXT_CHARS = 10_000

private truncateContext(memberContext: string, surveyId: number): string {
  if (memberContext.length <= MAX_MEMBER_CONTEXT_CHARS) {
    return memberContext
  }
  this.logger.warn(
    `[CONTEXT-TRUNCATE] surveyId=${surveyId}: memberContext ${memberContext.length} chars truncated to ${MAX_MEMBER_CONTEXT_CHARS}`,
  )
  return memberContext.slice(0, MAX_MEMBER_CONTEXT_CHARS)
}

A slice, a warning log so we could measure how often it tripped, and a MAX_MEMBER_CONTEXT_CHARS constant so the magic number had a name. End-to-end LLM latency went from three to five minutes back down to roughly two seconds. The demo went out the door.

Nobody on the team thought this was the answer. The PR description said in writing that this was a stopgap to keep our LLM path responsive, and that a proper retrieval implementation was the next ticket. But it bought us the breathing room to build that properly instead of in a panic.

This is also where most teams stop. The fire is out, the metric is green, the velocity board moves. The thing that made us not stop is that we already knew what was wrong with the hotfix, because it was obvious the moment we wrote it.


Why slice(0, N) is not retrieval

The blob we were truncating was stitched together in insertion order: the order Q&A pairs happened to land in the database, which is roughly the chronological order in which the user took the surveys. For our heaviest user, the first 10 KB was answers from surveys taken years ago. The page we were generating a response for was a recent satisfaction tracker. The Q&A pairs that would have actually helped the model, that user’s recent answers on the same product surface, sat at character offset 1.4 million. The hotfix was clipping them off and feeding the model old, mostly unrelated history.

To be honest about what we know and don’t know here: we did not measure a quality regression in production. We do not have an automated eval on generated-answer quality yet. What we have is the structural argument, which is enough to take seriously. If the data you are clipping has no order with respect to the question being asked, the first N bytes of it are statistically no different from a random sample of N bytes. The model is not failing in that case. It is succeeding loudly on the wrong input. That is a worse failure mode than slow latency, because there is no graph that lights up when it happens.

  Unbounded prompt After hotfix
LLM latency 3-5 min on power users 1-3s across the board
Context the model saw Everything, including the right pairs The oldest 10 KB, regardless of relevance
What is at risk Throughput, visibly Answer relevance, invisibly
Failure mode Loud, on every graph Silent, on no graph

The dangerous column is the right one. The hotfix turned a problem we could see into a problem we couldn’t. That is a fine trade for two days while you ship a demo. It is not a fine steady state.

Truncation is not a solution, it is an incident in disguise. It silences a metric you watch (latency) by creating a regression in something you don’t (answer quality). If the data you are clipping has no order relative to the query, slicing the first N bytes of it is no different statistically from picking N bytes at random. The fix is not a bigger N.


The right question

Under pressure on Monday we asked the small question: how do we make this prompt smaller, today, in time for the demo? Truncation answered it.

The bigger question was the one we walked into the design doc with: which part of this user’s history is actually relevant to the page we are filling right now? The honest answer to that is not “the first N bytes.” It is “the chunks that semantically match the question being asked.” The first is truncation. The second is retrieval. They look similar from outside. They behave nothing alike.

But before we describe the retrieval design, the more important framing: retrieval is not the default path. Most users in our system have small context blobs, a few KB of past answers. Sending the whole thing to the LLM works fine. There is no reason to embed, store, query, and reconstruct context for a user whose entire history would fit in a tweet.

The honest design is: keep the legacy blob path for everyone whose context fits inside the model’s effective attention window. Only build the retrieval pipeline for the heavy users where that path breaks. The threshold we landed on is 200K tokens of user context, measured with tiktoken rather than length / 4 since token counts vary 2-5× by content. Below that, the model gets the whole blob. Above it, retrieval kicks in.

This matters because it changes what the system looks like. We are not building a RAG system. We are building a gated escape hatch for the long tail of users whose history is too big for the cheap path. The 90%+ of users who stay under the threshold never touch the new infrastructure. Storage cost, embedding API cost, retrieval latency, all of it scales with the small fraction of heavy users, not with the whole user base.

The shape of the retrieval path itself, for heavy users:

  1. At sync time, if the user’s blob crosses 200K tokens, split their Q&A pairs into chunks of 10 (chronological order). Store one row per chunk in the vector table.
  2. A background worker fills in embeddings for each new chunk asynchronously.
  3. At fill time, embed the question on the current page, run a cosine similarity search over that user’s chunks, take top 5, send those into the prompt.

We will come back to why 10 pairs per chunk and not 1 or 100 in a moment. That’s the part the design discussion actually argued about.


Why pgvector, not a new vector database

We already run Postgres. Postgres has pgvector. Pulling in a new vector database (Pinecone, Weaviate, Qdrant, whatever the current favourite is) for a single feature is the kind of architectural decision that looks free on a slide and is anything but free six months later in oncall. So we built on what we already operate.

Embeddings come from OpenAI’s text-embedding-3-small at 1536 dimensions, which is what the vector(1536) column type below is sized for. The model is cheap, fast, and accurate enough for our use case. We are not married to it; the column dimension is the only place the choice leaks into the schema, and a future migration to a different model is a backfill, not a rewrite.

CREATE TABLE panel_member_answer_vectors (
  id SERIAL PRIMARY KEY,
  synced_panel_member_id INT NOT NULL
    REFERENCES synced_panel_members(id) ON DELETE CASCADE,
  chunk_text TEXT NOT NULL,      -- concatenation of 10 Q&A pairs
  pair_count INT NOT NULL,        -- how many pairs are in this chunk
  embedding vector(1536),
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  UNIQUE (synced_panel_member_id, chunk_text)
);
CREATE INDEX idx_pmav_member ON panel_member_answer_vectors (synced_panel_member_id);

Three things in that schema are deliberate and worth calling out.

No HNSW index, on purpose. HNSW is the canonical pgvector index and the default advice in any RAG blog post. We did not use it. The reason is the shape of our query, not the size of our data. We never search across all users, we only search within one user. A B-tree on synced_panel_member_id narrows the result set to that one heavy user’s chunks, and then a sequential cosine comparison over that bounded set runs in single-digit milliseconds. At our heaviest user today the set is tiny; at the ceiling we are sizing for, it tops out around 2,000 chunks per user. Either way the cosine pass stays in memory and stays fast. HNSW would add build time, write amplification, and a real tuning surface (pgvector tuning is a topic of its own), and at our query shape it would not be faster. We will add it the day telemetry says we need it. Not before.

UNIQUE on (synced_panel_member_id, chunk_text), with a side benefit. When we ran the migration to populate the new table, the constraint deduped a sync bug we didn’t know we had: our heaviest user had 78 duplicate Q&A pairs out of 127 stored rows. The blob format had hidden it the entire time, because duplicates inside one giant joined string just look like one long string of repeated text. The constraint did the cleanup as a side effect of being correct. The upstream sync bug that produced the duplicates is filed separately and is being fixed at the source; the constraint stays regardless, because defensive layers like this are the difference between a one-time data wash and a recurring oncall page. A reminder that schema constraints are not just gates against bad input. They are also flashlights, pointed at dirty data that is already there.

Nullable embedding column. Chunks are inserted synchronously as part of the user-sync transaction. Embeddings are filled in asynchronously by a background worker that batches calls to the embedding provider. This separates two concerns that have no business being in the same transaction: the durable record of what the user answered, and the derived index that lets us search it. If the embedding provider is down, sync still works. Retrieval falls through to the legacy blob path until the worker catches up. The system degrades, it does not break.


A flag, not a query: how the system knows which path to take

Here is the question the gated design forces you to answer: at fill time, how does the code know whether this user is a small-blob user or a heavy retrieval user?

The naive answer is “query the vector table and see if rows exist.” That works, and it is wrong. It pays a DB roundtrip for every single response generation, including the 90%+ of calls that will immediately turn around and use the blob anyway. Multiply that by every page of every survey for every user in a run, and the “small users are free” claim quietly stops being true.

The honest answer is to store the state where it is needed. We added one column to the existing user row:

is_chunked: boolean; // default false

The flag is set to true at sync time, the moment the user’s blob crosses the 200K-token threshold. The row was already being loaded for the response generation anyway, so the flag arrives for free. Retrieval becomes a single in-memory check:

if (!member.isChunked) return legacyBlob(member.context);
// only heavy members reach this line
return semanticRetrieval(member.id, queryEmbedding);

One deliberate semantic: the flag is sticky-true. Once a user becomes chunked, they stay chunked, even if a future resync drops them below 200K tokens. The reasoning is part sunk cost (the chunks are already vectorized, the embedding bill is already paid), part state consistency (no edge case where rows exist in the vector table but the flag is telling fill-time to use the blob). The trade-off is that a user cannot be un-chunked through the normal resync flow. Only deleting and re-creating the user row resets the state. We made peace with that.

The bigger principle: store state explicitly, don’t recompute it. A boolean that is set once at write time beats a DB query on every read. Reading code that says if (!member.isChunked) tells the next engineer exactly what is going on at this branch in the design. Reading code that says if ((await countVectorRows(memberId)) === 0) tells them nothing about why the system is asking that question, and quietly costs an extra network roundtrip on every fill. The boolean is not faster by accident; it is faster because it is the right shape for the decision being made.


The grouping decision: per-pair, per-survey, or per-N

This was the longest discussion in the design. Three options were on the table, and the interesting part of the design is which one we picked and why the obvious choices were wrong.

Per-pair (one Q&A = one row)

The first instinct, and the one most RAG tutorials show you. Embed each Q&A pair on its own, retrieve top-K pairs.

It dies on the row count math at scale. We are not a 1000-user app. A user group can have 90K users, and a heavy user can have 20K answers across years of activity. Per-pair grouping puts that at 1.8 billion rows in a single Postgres table, with around 11 TB of vector data. Postgres can technically store that. We do not want to operate it. Vacuum behaviour, replication lag, backup windows, index rebuild times: every one of those gets unhealthy fast at that row count, and none of them is the failure mode you want for a feature that is supposed to be an “escape hatch.”

Per-pair also pays embedding cost per pair. At a million pairs across a user base, that is a million embedding API calls at sync time. The cost is real money even at OpenAI’s small-model pricing, and the throughput is bounded by the embedding API’s rate limit. None of this is unsolvable, but it is all overhead being paid in exchange for a granularity the retrieval doesn’t need.

Per-survey (one survey’s worth of answers = one row)

The next-most-obvious idea, and intuitively appealing. A survey is a natural topic boundary. NPS surveys are about loyalty, demographic surveys are about who the person is, product surveys are about features. Grouping by survey preserves topic. No magic number to tune.

It dies on two real problems:

  1. Similarity score dilution. A survey covers many sub-topics. A satisfaction tracker might have a Net Promoter question, a product feedback question, a competitor comparison, and a demographic question, all in one survey. Embedding all of those into one vector averages the topic signal out. When the page asks something specific (“how satisfied with response time”), the per-survey chunk’s embedding is generic and doesn’t match strongly. The cosine search produces lower scores across the board, which is exactly the failure mode you don’t want from a retrieval system.

  2. Survey size is unbounded. A QuestionPro survey can have 500 questions. Some have 5000. A 500-question survey at ~150 chars per answer is 75K characters in one row, which is past the 8192-token input limit of most embedding models. We’d have to chunk anyway. So we’re not actually grouping by survey, we’re chunking inside surveys and pretending the boundary matters. It doesn’t.

Per-N pairs (we landed on N=10)

The middle ground. Take a user’s pairs in chronological order and group them in fixed batches of 10. One row per batch. ~1500 characters per chunk on average. Bounded chunk size, bounded row count, embedding input always inside the model’s limit.

The trade-off is real: a chunk of 10 chronologically-adjacent pairs may contain mixed topics. Mixing two or three topics in 1500 characters still produces a meaningful embedding (the topics are at least adjacent in the user’s timeline), but it’s not as clean as one topic per chunk would be. We chose this anyway. The reason: the alternatives are worse. Per-pair scales catastrophically. Per-survey loses signal for surveys that mix topics, which is most of them, and breaks on size for the large ones.

10 is not magic. It is the smallest number where the embedding has enough text to carry topic signal and where the row count stays in single-digit millions instead of single-digit billions at scale. It is a tuning parameter, and we will revisit it. To revisit it well, we need something we do not have yet: an automated eval on generated-answer quality. The next thing this work unlocks is exactly that. Once retrieval is doing something other than “use the blob,” we finally have two variants to compare, and a reason to build the eval. That is its own ticket, and probably its own post.

The other rejections (briefer, because they’re the standard ones)

BM25 / Postgres full-text search. Free and tempting, but plainto_tsquery is AND-default and stems aggressively. A page asking “how satisfied with our services” tokenizes to 'satisfi' & 'servic', returns zero matches against a user who had answered five surveys about “AI advancement satisfaction.” Survey questions paraphrase across surveys (“how likely to recommend” → “would you suggest us to a friend”). Handling paraphrase is the reason to do semantic retrieval, and BM25 cannot.

JSONB array column on the existing users table. Tempting because it avoids a new migration. But pgvector’s <=> cosine operator works at the row level, not the array-element level. An array column forces UNNEST per query → no usable index → sequential scan of the entire users table on every retrieval.

Reuse the existing question_vectors table with a discriminator column. Polymorphic schema with nullable foreign keys and branching logic in every query. “One fewer table” is a benefit that lasts a sprint. Polymorphism is a maintenance cost that compounds for years.

HNSW from day one. Right index, wrong workload. Per-user B-tree filter + in-memory cosine over a few thousand vectors beats HNSW for our query shape until row count per user grows another order of magnitude. We’ll add it when telemetry says we need it.

Vector retrieval has a stack of canonical defaults: HNSW indexes, per-pair embeddings, recursive chunking, hybrid keyword-plus-semantic scoring. Each is correct for some scale and some query shape. None is correct for all of them. The canonical architecture is canonical for someone else’s problem. Measure your data, and run the row count math at the scale you actually operate at, before you copy it.


K=5, not K=10 or K=20

Once retrieval is working, the next instinct is to be generous with K. More context is more information, right?

Wrong, and it took some reading to convince me of it. The current production-RAG research (Lushbinary’s 2026 guide summarizes it well) keeps finding the same thing: LLM accuracy degrades as input grows. The model spreads its attention across whatever you give it. If five of your ten retrieved passages are noise, the five good ones do worse than they would alone, because the model is now also reasoning about the noise. Smaller K with higher-precision results beats larger K with looser thresholds.

That was the second time on this project we ran into the same fact in a different costume. Truncation hid the relevant data behind a length cap. Over-large K drowned it in distractors. The lesson is the same lesson: less input, more accurate output, as long as the input you keep is the right input.


What the new path costs

Because the retrieval pipeline only runs for heavy users, the cost numbers are nothing like “build RAG for the whole user base.” From the data we have in production today, roughly 5-10% of users cross the 200K-token threshold. That estimate will sharpen as we roll out and see real traffic distribution, but the order of magnitude is right.

The sizing below is for the ceiling we are designing for, not the heaviest user in production today. Today’s heaviest is small (a few dozen distinct Q&A pairs, after the UNIQUE constraint deduped them). The numbers that follow describe what happens when a fully active group of long-tenured users shows up. That is the case the design has to survive.

  • Eligible users: for a 90K-user group, somewhere between 5K and 10K cross the threshold. The rest stay on the legacy blob path with zero infrastructure changes: no row in the vector table, no embedding call, no extra latency.
  • Storage: at the ceiling we are sizing for, a heavy user with ~20K answers produces ~2K chunks at 10 pairs each. At 1536-dim float32, that is ~12 MB per heavy user, ~60-120 GB total for the heavy tail of a large user base. Postgres handles that easily on standard storage.
  • Embedding API cost: one async call per new chunk at sync time, batched. At fill time, the query embedding is cached per page, so the same page being filled for N heavy users costs one embedding call total, not N.
  • Retrieval latency: single-digit milliseconds per query. The B-tree filter narrows to one user’s chunks; the cosine pass runs over a few thousand vectors in memory.
  • End-to-end LLM latency on heavy users: two to three seconds. Same as the truncation hotfix was giving us, except the model is now being handed the right context instead of an arbitrary prefix of the wrong context.
  • End-to-end LLM latency on small users: unchanged. They never enter the retrieval path.

The legacy blob is the fallback for two different reasons. For small users it is the default, because it is cheaper, simpler, and has no embedding-worker dependency. For heavy users it is the safety net: if retrieval returns zero rows, or the embedding worker is behind, the call falls through to the old behaviour rather than failing. Worst case is what we shipped before the hotfix, which is a worst case we have already lived through. There is no regression path we can introduce by rolling this out. Only an upside path we can lose if rollout exposes something we missed.


Three things worth carrying out of this

If the rest is too long to remember, these are the three.

Log the payload size on every external call. Especially LLM calls. Especially the ones where the payload is “everything we know about this user.” The cost of always having the number is zero. The cost of not having it the first time something goes slow is hours.

Treat slice(0, N) on an LLM prompt as a hotfix, not a solution. It is a perfectly reasonable thing to ship on a Monday afternoon to keep a demo alive. It is not a thing to leave in place for a quarter. If the data you are cutting has no order relative to the question being asked, the first N bytes are statistically no better than a random N bytes, and the model will use them anyway, quietly, without telling you. The fix is not a bigger N, the fix is a smaller, more relevant prompt.

Measure your own data before copying a reference architecture. HNSW indexes, recursive chunking, hybrid keyword-plus-semantic scoring, large-K retrieval. Each of these is a correct default for somebody’s workload, but not automatically for yours. Our per-user B-tree filter followed by an in-memory cosine pass over a handful of vectors looks “wrong” against a textbook RAG diagram, and is the right answer for our query shape. Yours might be different. Read your data first, then pick the architecture.

Where this is going

The retrieval pipeline is rolling out behind a feature flag. The legacy truncation helper is still there underneath as a safety net. Once we have a few weeks of production data showing the fallback rate stays below our threshold, the blob column and the truncation helper both come out together.

The other consumer of the same context blob (a conversational AI surface elsewhere in our stack) currently has no truncation at all, and quietly exceeds the model’s token limit on long-tail users. Same retrieval call fixes it. The API is already built, it just needs to be wired in.

If your system pipes user history into an LLM prompt and your only defence against runaway prompt size is a length cap, you are probably shipping the same bug we were. The fix is not a bigger cap. The fix is a smaller, more relevant prompt.