ADR 0005: Memory-budget discipline in fit and load¶
Date: 2026-04-22 Status: accepted
Context¶
snapvec targets a laptop-local workflow: developer machines with
8-16 GB RAM, corpora in the 10k-1M vector range. At the top of
that range, the difference between "peak memory equals final size"
and "peak memory equals 2x-10x the final size" is the difference
between a job that completes and a job that is killed by the OOM
killer or grinds through swap.
Two places in the library were violating this implicitly:
OPQ fit. fit_opq_rotation used to cast the entire centred
float32 batch to float64 for covariance accumulation:
At N=1M, d=384 that is ~3 GB of transient memory on top of the
original float32 input. On a 16 GB laptop with embeddings already
resident, use_opq=True could OOM silently.
Load paths. All four index loaders shared the pattern:
f.read returns a bytes object that stays alive until the
assignment completes; .copy() allocates a second buffer of the
same size before arr takes ownership. Peak memory during load
is ~2x the final array. PQSnapIndex.load() additionally did a
.T.copy() to transpose the on-disk (n, M) row-major block into
the (M, n) C-contiguous RAM layout, adding another full-size
transient.
Individually each was a tolerable hiccup. Cumulatively they meant the library documented a 1M-vector corpus as supported while quietly needing 4-5 GB of headroom above the final index size to actually get there.
Decision¶
Adopt a blanket rule for the fit and load paths: never hold two copies of any per-corpus-sized array at the same time, and cap transient memory at a constant multiple of the per-vector cost.
fit_opq_rotation accumulates covariance in 16,384-row chunks.
cov = np.zeros((d, d), dtype=np.float64)
for start in range(0, n, chunk):
X_chunk = X[start:start + chunk] - mean # float32
X_chunk64 = X_chunk.astype(np.float64) # ~chunk * d * 8 bytes
cov += X_chunk64.T @ X_chunk64
Peak transient is chunk * d * 8 bytes (~30 MB at dim=384),
independent of N. Numerics are unchanged: same per-row outer
products, same float64 accumulator, same eigendecomposition.
load() uses readinto into pre-allocated numpy buffers.
No bytes transient, no .copy() duplicate. PQSnapIndex.load()
additionally streams the on-disk (n, M) block through a 16,384-row
staging buffer and fuses the transpose into each chunk's copy into
the final (M, n) array, so the full-size transpose transient is
also gone.
Consequences¶
- Measured on a clean subprocess (so
ru_maxrssis not contaminated by earlier phases):
| index | n | M | file | codes | peak RSS before | peak RSS after |
|---|---|---|---|---|---|---|
| PQ | 500k | 96 | 50 MB | 46 MB | 199 MB | 152 MB |
| IVFPQ | 500k | 192 | 96 MB | 92 MB | 291 MB | 198 MB |
Savings are linear in N -- one codes-array-worth of transient
removed per load.
-
OPQ fit at N=1M drops from ~3 GB transient peak to ~30 MB transient peak, independent of
N. -
File formats are unchanged. Existing
.snpv,.snpq,.snpi,.snprfiles round-trip identically under the new code. -
The rule generalises. Any future feature that touches a per-corpus-sized array in the fit or load path should:
-
Pre-allocate the final buffer and fill it (no
frombuffer + copy, noarray + array.copy()). - Stream arithmetic that needs a different dtype through a small bounded chunk, not a whole-batch cast.
-
Flag in review if these are not possible; do not ship a silent 2x-10x memory blow-up.
-
The one remaining per-corpus allocation at load time is the final codes array itself. Reducing that further requires a memory-mapped lazy path (v0.12 roadmap). That is out of scope for this ADR.