Changelog¶
All notable changes to snapvec are documented here.
The format follows Keep a Changelog;
the project uses Semantic Versioning.
[Unreleased]¶
[0.11.1] -- 2026-04-22¶
Memory + dtype correctness patch. No API or file-format change. Two memory-footprint fixes on the fit/load paths for large indices (N = 1M class), plus a sweep through the scale-and-rotate paths to eliminate silent float64 upcasts. Also ships the first batch of ADRs and the engineering-note paper draft.
Changed¶
-
fit_opq_rotation(X, M)now accumulates the covariance in 16,384-row float64 chunks instead of casting the whole centred matrix to float64 at once. Peak working memory during OPQ fit drops from ~(n * d * 8) bytes to ~(chunk * d * 8) -- at n=1M and d=384, that is ~3 GB -> ~30 MB transient, independent of n. Numerics are unchanged (same covariance, same eigendecomposition); all existing OPQ determinism/round-trip tests pass. -
load()onSnapIndex,PQSnapIndex,IVFPQSnapIndex, andResidualSnapIndexnow usesf.readinto(arr.data)into pre-allocated numpy buffers instead ofnp.frombuffer(f.read(...)).copy(). The old path held the transient bytes string and the final array simultaneously, roughly doubling peak RAM during load. Measured on a single process (n=500k, M=192 IVFPQ): peak RSS during load drops from 291 MB to 198 MB, a one-codes-array saving that scales linearly with n. -
PQSnapIndex.load()additionally streams the on-disk (n, M) row-major code block through a small staging buffer, fusing the transpose into the per-chunk copy into the final (M, n) C-contiguous array. Eliminates the full-size transpose copy that the previous.T.copy()path allocated.
Fixed¶
- Silent float64 upcasts in scale-and-rotate paths. Several
sites in
_index.py,_ivfpq.py,_pq.py,_residual.py, and_rotation.pywere multiplying or rotating float32 arrays in ways that quietly promoted to float64 (e.g.arr * scalarwith a Python float, missingastype(float32, copy=False)at rotation boundaries,_apply_qjl_arraysaccumulating in float64). All hot-path arithmetic is now explicitly float32; perf is within 2% noise on these paths. Addstests/test_dtype_invariants.py(224 lines) to lock the invariant.
Docs¶
- First batch of Architecture Decision Records under
docs/adr/(5 records: RHT off, early_stop parked, OPQ-P vs OPQ-NP, cluster-contiguous codes, memory-budget discipline). - Engineering-note paper draft on compressed ANN for local-first
RAG at
paper/snapvec-0.11.md.
Notes¶
File format is unchanged (.snpv v3, .snpq v2, .snpi v5,
.snpr v1); existing indices load correctly under the new code.
[0.11.0] -- 2026-04-21¶
First minor bump on the 0.10 line. Adds optional OPQ rotation
(parametric, Ge et al. 2013) to PQSnapIndex and IVFPQSnapIndex.
No behaviour change for existing callers -- the flag defaults off.
Added¶
use_opqparameter onPQSnapIndexandIVFPQSnapIndex. When True,fit()learns an orthogonal rotation that balances per-subspace variance, and both corpus and queries go through that rotation before PQ encoding / scoring. Orthogonality preserves inner products (<q, x> == <Rq, Rx>), so the rest of the pipeline is unchanged.snapvec._kmeans.fit_opq_rotation(X, M): standalone helper that returns the(dim, dim)float32 rotation. Eigendecomposition on the covariance in float64 (for numerical stability at large training sets), round-robin allocation of sorted eigenvectors to subspaces, downcast to float32 at the return boundary.
Impact (BEIR FIQA, BGE-small, N=57,638, nlist=512, 300 queries)¶
Recall@10 at matched bytes/vec, rerank disabled:
| M | d_sub | baseline | OPQ | delta |
|---|---|---|---|---|
| 48 | 8 | 0.553 | 0.656 | +10.3pp |
| 96 | 4 | 0.767 | 0.812 | +4.6pp |
| 192 | 2 | 0.932 | 0.931 | 0 |
OPQ helps when the per-subspace dimension is at least 4. At
d_sub = 2 the rotation has no room to redistribute variance and
the gain collapses. Latency is identical in all three rows
(the extra matmul at preprocess is ~10 us, hidden by the rest of
the search pipeline).
Scope. The table above is measured on a single dataset (BEIR
FIQA with BGE-small). OPQ gains depend on the coordinate-variance
distribution of your embedding; on near-isotropic distributions
there is nothing for the rotation to redistribute. Run
experiments/bench_ivfpq_opq.py on your own corpus before
claiming a recall win downstream.
Training cost¶
fit_opq_rotation on the FIQA training sample (dim=384):
| N (training rows) | time | peak memory |
|---|---|---|
| 10k | 21 ms | 52 MB |
| 57k | 56 ms | 271 MB |
The memory peak comes from the intermediate float64 cast used for
covariance accumulation; it scales to ~3 GB transiently at N = 1M.
A follow-up can chunk the covariance accumulation to bound memory
at the cost of one extra pass; not in this release. Stored
rotation matrix itself is dim * dim * 4 bytes (576 KB at dim=384).
Runtime cost is ~1.6 us per query matmul.
File format¶
.snpqbumped to v2 (legacy v1 still readable). New flag_FLAG_USE_OPQ(bit 2); rotation block written only when set..snpibumped to v5 (legacy v1-v4 still readable). New flag_FLAG_USE_OPQ(bit 3); rotation block written only when set.
Non-OPQ saves pay no disk cost for the new rotation block -- the
serialisation is gated on the flag, so baseline indices get only
the four-byte version-header bump from v1 to v2 (.snpq) or v4
to v5 (.snpi). v0.10.x readers still refuse v0.11.0 files at
the version check, even when no OPQ flag is set; upgrade them via
pip install -U snapvec as the error message suggests.
Tests¶
9 new tests covering OPQ across PQSnapIndex and IVFPQSnapIndex:
rotation orthogonality, mutual exclusion with RHT, save/load
round-trip, seed determinism, no recall regression on clustered
synthetic data, and a regression guard (test_ivfpq_opq_search_batch_matches_search)
that asserts search_batch and search agree under use_opq=True.
Full suite: 199 tests, mypy --strict clean, ruff clean.
Internal¶
- PR #64 review caught a real bug during dogfooding:
search_batchinitially rotated queries via_preprocess_single's logic but not in its own batch path. Fixed + regression test added.
[0.10.3] -- 2026-04-21¶
Performance patch release. No API changes. Byte-identical on-disk output for the same seed and inputs (determinism tests still pass).
Performance¶
IVFPQSnapIndex.add_batchstreaming ingest is ~18% faster (PR #61). The old path allocated two N-sized copies of_codesper call (vianp.concatenatethen a second[:, order]after argsort). The new path sorts just the new batch by cluster id and then merges it into the cluster-contiguous layout with a single allocation + a per-cluster contiguous copy loop -- one N-sized memcpy per call instead of two, and the copy is a straight stream instead of a fancy-index scatter.
Measured on a 50k-row streaming ingest, batch_size=500,
dim=128, nlist=32, M=16:
| Point | Before | After | Delta |
|---|---|---|---|
| First batch (N=500) | 2.46 us/row | 1.75 us/row | -29% |
| Last batch (N=50k) | 8.02 us/row | 6.15 us/row | -23% |
| Total 50k ingest | 243 ms | 200 ms | -18% |
The last-batch-vs-first-batch growth ratio is still ~3x because the remaining copy is fundamental to the cluster-contiguous layout; truly O(1)-per-row streaming needs a delta-buffer design (ROADMAP v0.12).
- Row-wise L2 norm via
einsuminstead ofnp.linalg.norm(PR #59, 7 call sites across_index.py,_ivfpq.py,_pq.py,_residual.py).np.sqrt(np.einsum('ij,ij->i', X, X))is 2.4x faster thannp.linalg.norm(X, axis=1)in isolation on a (10000, 384) float32 matrix (0.72 ms -> 0.30 ms per call). In practice the end-to-end delta on every measured path is within noise (+/- 1% overSnapIndex.add_batch,PQSnapIndex.add_batch,ResidualSnapIndex.add_batch,IVFPQSnapIndex.search_batch) because the norm is a small fraction of each path (RHT + codebook search dominate). Kept for correctness consistency and for callers whose hot path does dominate on batch normalisation. Seeexperiments/bench_bolt_einsum_impact.pyfor the measurement.
[0.10.2] -- 2026-04-21¶
Release-infrastructure patch on top of v0.10.1. No library changes.
Changed¶
- Drop
macos-13from the wheel matrix. During the v0.10.1 release run the GitHub-hostedmacos-13runner pool stayed queued for ~45 minutes and was cancelled, blocking the PyPI publish (publishrequires every wheel job to succeed). Apple Silicon has dominated new Mac sales for 4+ years; Intel Mac users now install from the sdist (requiresbrew install libompand a C compiler available via Xcode command-line tools). Re-adding the row is one matrix entry if GitHub's macos-13 capacity recovers.
The v0.10.0 and v0.10.1 tags exist in git but never produced PyPI artifacts; 0.10.2 is the first version actually published.
[0.10.1] -- 2026-04-20¶
Release-infrastructure patch on top of v0.10.0. No library changes.
Fixed¶
- cibuildwheel macOS arm64 target:
MACOSX_DEPLOYMENT_TARGET=13.0was set uniformly for every macOS wheel, but the Homebrew libomp installed on GitHub'smacos-14runner is linked against macOS 14 system libraries.delocate-wheelrejected the bundle with "Library dependencies do not satisfy target MacOS version 13.0". Pin the target per matrix row:macos-13builds (x86_64) stay at13.0,macos-14builds (arm64) now target14.0. User-visible consequence: snapvec arm64 wheels require macOS 14+ (Sonoma); x86_64 wheels still work on macOS 13+. - cibuildwheel test env missing hypothesis (from v0.10.0 release
attempt, previously fixed in PR #57 but worth recording in the
release notes): the cibuildwheel test phase runs the full test
suite against the built wheel and needs every test-time import.
CIBW_TEST_REQUIRES: "pytest hypothesis>=6.100".
[0.10.0] -- 2026-04-20¶
Headline: professionalisation release. The only library change
is a narrow fix in SnapIndex.search (reject k < 1 with a clear
error, matching the other three index types); everything else wires
up the infrastructure, tests, documentation, and benchmarks that make
snapvec ready for third-party adoption and publishable comparisons.
Added¶
- CI test matrix on Linux, macOS-14, and Windows across CPython
3.10, 3.12, 3.13 (PR #43). Pre-compiled wheels produced by
cibuildwheelcover Linux x86_64/aarch64, macOS-13 x86_64, macOS-14 arm64, and Windows AMD64 across CPython 3.10, 3.11, 3.12, 3.13, published to PyPI onv*tags via trusted publishing. - Documentation site at https://stffns.github.io/snapvec/ built
from MkDocs Material +
mkdocstrings. Eighteen pages: getting started, one user-guide per index type, architecture, benchmarks, concurrency contract, API reference, changelog (PR #49). - Executable examples in
examples/(quickstart, PQ, IVF-PQ with rerank, filter search, save/load, streaming ingest) exercised by the CI matrix on every commit (PR #49). - Community files:
SECURITY.md,CODE_OF_CONDUCT.md,CONTRIBUTING.md, bug / feature / PR templates, dependabot, a rootCLAUDE.md, pre-commit config (PR #43). - Test suite grew from 151 to 190+:
- Hypothesis property-based tests (order-invariance, save/load round-trip, delete reduces len, search respects k, filter subset) across SnapIndex, PQSnapIndex, IVFPQSnapIndex (PR #50).
- Determinism tests: same seed + same inputs produces byte-identical index files for all four classes (PR #50).
- Adversarial tests: empty index, n=1, k > n, zero-norm queries, degenerate distributions, bit-extreme configurations (PR #50).
- Forward-compat tests: patch a known-good file's version byte and assert the loader raises an upgrade hint (PR #52).
- Recall regression smoke test on a clustered synthetic corpus: asserts IVFPQ + rerank stays >= 0.90 recall@10 on every CI run (PR #52).
bench/threadingcurve on search_batch: p50 across num_threads = 1, 2, 4, 8 and nprobe in (4..256). Published in docs (PR #51).bench/competitivehead-to-head on BEIR FIQA: snapvec (three bit-depths for flat + IVFPQ with and without rerank at M=48 and M=192) vs FAISS IVFPQ (M=48 and M=192 matched-budget) vs hnswlib vs sqlite-vec. Unified Pareto table in docs (PR #53).- MkDocs site now deploys automatically to GitHub Pages on every
merge to
main(PR #49).
Changed¶
SnapIndex.search(k)now raisesValueErrorfork < 1, matching the validation thatPQSnapIndex,ResidualSnapIndex, andIVFPQSnapIndexalready had (PR #50). Caught by the new adversarial suite; prior behaviour returned all results onk=0andn-1results onk=-1via an accidental slicing interaction.- Forward-compat errors on
load()now name the versions this build supports and point atpip install -U snapvec(PR #52). The old"unsupported version N"is gone. - README trimmed from 673 to 91 lines; technical content moved to the MkDocs site under structured navigation (PR #49).
- macOS arm64 build uses
-mcpu=nativeinstead of-march=native(which Xcode 15 clang rejects asunknown target CPU 'apple-m1'). Fixed the CI build on GitHub'smacos-14runners (PR #43).
Fixed¶
- Ten ruff lint errors in snapvec/ and tests/ (unused locals and
ambiguous
lloop variable) (PR #43).
Documentation¶
- Concurrency contract (
docs/user-guide/concurrency.md) makes the single-writer / multi-reader guarantee explicit. Calls out thatfreeze()(or one warm-up search) is required before fanning out to reader threads becauseSnapIndex.searchlazily materialises its fp16 centroid cache on the first query (PR #50, PR #51). - Benchmarks page covers FIQA recall, sqlite-vec scale, the threading curve, and the competitive comparison (PR #53).
- README badges: PyPI version, Python versions, CI, Docs deployment, license, monthly downloads (PR #43, PR #49).
Infrastructure¶
hypothesispulled into[dev]optional deps (PR #50).mkdocs,mkdocs-material,mkdocstrings[python]pulled into a new[docs]optional extra (PR #49).faiss-cpu,hnswlib,sqlite-vecused only by the competitive bench; not added as snapvec runtime dependencies.
[0.9.0] -- 2026-04-16¶
Headline: 5.8x faster search via Cython compiled kernels, recall 0.977 at 441us on FIQA -- zero extra runtime dependencies.
The v0.8.1 pure-NumPy sprint proved that the ADC scoring loop (74-97% of search time) is dispatch-bound at M=192: NumPy's ~2us per-call overhead x 192 subspaces = ~400us floor that no vectorisation trick can break. This release compiles the hot path to native code via Cython+OpenMP, eliminating the dispatch overhead entirely.
The 0.977 recall at nprobe=64+rerank reproduces the v0.6.0 measurement exactly, across three releases with non-trivial algorithmic changes (column-major layout, batched matmul LUT, compiled kernel). Same config, same data, same number.
Added¶
- Cython compiled ADC kernel (
snapvec/_fast.pyx) with OpenMP parallel scoring viaprange. Two kernels: adc_colmajor: fused subspace scoring loop (replaces the M=192 Python dispatch loop). Serial below 2000 candidates, parallel above.-
fused_gather_adc: reads codes directly from_codesviarow_idxindirection, eliminating the intermediate(M, total)gather buffer. Saves both the memcpy and the allocation. -
setup.pywith OpenMP auto-detection per platform (macOS via Homebrew libomp, Linux via gcc-fopenmp, Windows via MSVC/openmp). Falls back to serial-only compilation if OpenMP is not available.
Changed¶
-
Build backend switched from hatchling to setuptools to support Cython extension compilation.
-
numba removed -- Cython produces identical performance (same LLVM backend on macOS) without the 143MB JIT runtime, cold-start penalty, or Python version lag.
snapveconly depends on numpy at runtime. -
pyproject.tomldescription updated to reflect "NumPy + Cython compiled kernels" (was "Pure NumPy, no heavy dependencies").
Performance¶
ADC kernel microbench (M=192, K=256):
| n_cand | NumPy (v0.8.1) | Cython serial | Cython parallel |
|---|---|---|---|
| 500 | 199 us | 42 us (4.7x) | 40 us (5.0x) |
| 3,500 | 695 us | 271 us (2.6x) | 187 us (3.7x) |
| 10,000 | 2,112 us | 788 us (2.7x) | 253 us (8.3x) |
| 50,000 | 8,847 us | 4,604 us | 946 us (9.4x) |
End-to-end on BGE-small / FIQA (N=57,638, nlist=512, M=192, K=256, 500 queries):
| nprobe + rerank | v0.6.0 | v0.9.0 | Speedup | Recall |
|---|---|---|---|---|
| 32 | 1,380 | 369 us | 3.7x | 0.944 |
| 64 | 2,550 | 441 us | 5.8x | 0.977 |
| 128 | 4,590 | 635 us | 7.2x | 0.992 |
PQ full-scan: recall 0.915 at 1,529 us. IVF-PQ nprobe=64 + rerank is 3.5x faster at higher recall -- strictly superior on both axes.
Four operating points, all sub-millisecond at N=57K:
| Profile | nprobe | Recall | us/query |
|---|---|---|---|
| Fastest | 16 | 0.883 | 304 |
| Interactive | 32 | 0.944 | 369 |
| High precision | 64 | 0.977 | 441 |
| Near-exact | 128 | 0.992 | 635 |
What was tried and rejected¶
Full experimentation log in experiments/PERF_NOTES.md:
- Vectorised ADC (take_along_axis, chunked, fancy indexing): 0.4-0.9x. Materialising (M, n_cand) intermediate kills L1 cache locality that the per-subspace loop preserves.
- Sparse CSR matvec: 3.3x without build cost, 0.2x with. CSR construction dominates.
- Gather vectorisation (np.concatenate, fancy indexing, cumsum): all slower. Contiguous-slice memcpy beats scattered access.
- Numba gather kernel: 0.1x. Element-by-element Numba loop can't beat NumPy's bulk memcpy for contiguous slices.
- Numba was the initial compiled backend. Matched Cython performance exactly but adds 143MB runtime (numba + llvmlite), 2-3s JIT cold start, Python version lag, and serverless/container friction. Replaced by Cython for zero runtime cost.
Pipeline structure (confirmed optimal)¶
Query -> Preprocess (6us) -> Coarse probe (19us) -> LUT build (25us)
-> Gather row_idx (22us) -> Fused ADC (187us) -> Top-k (84us)
Gather-then-score (224 dispatches) beats score-per-cluster (6,144 dispatches) by 27x. Every pipeline stage profiled and tested against alternatives.
[0.8.1] -- 2026-04-16¶
Performance-only release: push pure-NumPy search to its ceiling.
Changed¶
-
LUT build: 7x faster -- replace M=192 separate per-subspace matmuls with a single batched
np.matmulcall. Applies toPQSnapIndex.search(),IVFPQSnapIndex.search(), andIVFPQSnapIndex.search_batch()(einsum replaced with matmul). LUT build drops from ~10% to <1% of IVF-PQ search time. -
PQSnapIndex ADC: 1.47x faster -- switch
_codesstorage from row-major(n, M)to column-major(M, n), matching the IVF-PQ layout. Per-subspace gather now reads contiguous memory instead of stride-M hops. End-to-end PQ search is ~32% faster. File format unchanged (transpose on save/load).
Performance notes¶
Full pure-NumPy optimisation sprint documented in
experiments/PERF_NOTES.md. Tested and rejected: vectorised ADC
via take_along_axis (0.4x), chunked ADC (0.4x), sparse CSR
matvec (0.2x with build cost), gather vectorisation via fancy
indexing (0.15x). Pipeline structure (gather-then-score) confirmed
optimal: 224 dispatches vs 6,144 for score-per-cluster.
[0.8.0] -- 2026-04-15¶
Headline: multi-reader safety + id-scoped IVF search. Two
features the v0.7 users asked for once snapvec landed in multi-
tenant services: a formal thread-safety contract so a single index
can fan out across request threads without duplicating memory, and
id-whitelist filtering on IVFPQSnapIndex so tenant / partition
scoping doesn't have to be bolted on with post-filter passes.
Added¶
FreezableIndexmixin — every index class now exposesidx.freeze()/idx.unfreeze()/idx.frozen. After freezing, every mutator (add,add_batch,delete,fit,close) raisesRuntimeErrornaming the operation, so concurrentsearch()from multiple threads is race-free by contract — no lock is taken on the query hot path.
SnapIndex.freeze() also pre-warms the lazy _cache (the
float16 centroid matrix materialised on first full-scan search),
so two concurrent first-queries on a just-frozen index no longer
race on that assignment.
-
IVFPQSnapIndex.search_batchis safe under concurrent callers. Athreading.Lockserialises the lazyThreadPoolExecutorinit so two callers can't both observe_executor is Noneand each create their own pool. The init path uses double-checked locking so the steady-state query path never touches the lock. Once created, the executor is not torn down duringsearch_batch(); if a caller asks for a differentnum_threadsafter init,search_batchraisesValueErrorrather than shutting down a pool another thread may be about to submit work to. Lifecycle teardown remains inclose(). -
IVFPQSnapIndex.search(..., filter_ids=...)/search_batch(..., filter_ids=...)— id-whitelist filtering for IVF-PQ, cluster- and pool-aware: -
Cluster skip: the probe ranking is restricted to clusters that contain at least one filter row, so sparse filters don't waste
nprobeslots on empty clusters. Whenlen(allowed_clusters) <= nprobethe probe set is computed with a directnp.tileof the allowed clusters (no argpartition). - Pool-aware: the row filter is applied before the top-k / rerank-pool selection, so the rerank candidate pool is drawn from the filtered subset — no slots spent on ids the caller just excluded.
- Sparse-friendly mask: internally represented as a sorted
int64array +np.isin(row_idx, filter_rows, assume_unique=True), not a denseN-bool array. AtN = 100 Mthis is the difference between a 100 MB allocation per query and a scan proportional to the probed subset. - Unknown ids in the filter are silently dropped; an entirely-
unknown filter returns
[]immediately.
Changed¶
- Every index class's
add()now callsself._check_not_frozen("add")before delegating toadd_batch(), so theRuntimeErrorthrown on a frozen index namesadd(the method the caller invoked) rather thanadd_batch(the internal delegate).
Documentation¶
IVFPQSnapIndex.search()docstring clarifies that the rerank cache has beenfloat16since v0.7 while the rerank matmul itself stays infloat32via NumPy type-promotion onq_pre._freezable.pymodule docstring spells out that any lazy per- class cache reachable fromsearch()must be pre-warmed in the subclass'sfreeze()override (or serialised by the class itself) — not assumed safe by default.
[0.7.1] — 2026-04-15¶
Test-correctness patch. No behaviour change to the shipped fp16 rerank cache — the v0.7.0 recall measurements (0.943 / 0.977 / 0.993) stand. But the unit test that was supposed to protect against future regressions of that cast was vacuous, caught in the PR #31 post-merge review (thanks, Copilot).
Fixed¶
test_fp16_cache_recall_matches_fp32_within_noisewas a tautology. The original A/B swapped the fp16 cache for its own upcast (cache.astype(float32)) and compared. Because upcasting float16 → float32 is lossless and the rerank matmul runs in fp32 regardless (driven byq_pre's fp32 dtype via NumPy type promotion), the two runs produced bit-identical scores and the assertion was always trivially true. A future regression that made the cast lossier would not have been caught.
The test now reconstructs the pre-truncation fp32 cache by
re-running _preprocess on the original float32 corpus and
compares rerank recall against that. The fix includes an
explicit scores_differ assertion that at least one score in the
top-10 must differ between the fp16 and fp32 runs — a guard that
will fail loudly if the test regresses back to the vacuous form.
- Clarified the
_full_precisionfield docstring on how NumPy's type-promotion rules keep the rerank matmul in float32. The previous wording said NumPy "promotes fp16 to fp32 internally on matmul" — technically true in this case, but the reason it lands in float32 is thatq_preis float32, not anything intrinsic to the fp16 dtype. Rewritten to flag the risk ifq_prewere ever downgraded to fp16.
[0.7.0] — 2026-04-15¶
Headline: tier-1 production reliability + halved rerank cache.
Two features that make snapvec deployable at scale in environments
where silent disk corruption, cloud re-encoding, or partial writes
are real risks — and where the 4× storage cost of
keep_full_precision=True in v0.6 was the main reason users
capped out their use of the rerank path.
Added¶
- CRC32 trailer on every persistent format (
.snpv,.snpq,.snpr,.snpi). Catches single-bit flips on disk and mid- transfer corruption atload()time instead of silently returning wrong search results. 8-byte tail: 4-byte magicCRC2+ uint32 LEzlib.crc32of the preceding payload.CRC2(notCRC1) leaves room to upgrade to xxHash3 later without breaking old readers. - Shared atomic-save helpers in
snapvec/_file_format.py:ChecksumWriter,save_with_checksum_atomic(path, writer_fn),verify_checksum(path),has_trailer(path). Each index module'ssave()body shrinks to a_write(f)closure; the.tmp+os.replaceatomic-rename dance lives in one place.
Changed¶
IVFPQSnapIndexrerank cache is now float16 (was float32 in v0.6). Halves both disk and RAM footprint ofkeep_full_precision=Trueindices. NumPy upcasts fp16 → fp32 implicitly in the rerank matmul, so arithmetic precision is unchanged — only the stored representation is quantised.
Recall A/B on BGE-small / FIQA (N = 57 638, rerank_candidates=100):
| nprobe | fp32 (v0.6) | fp16 (v0.7) | Δ |
|---|---|---|---|
| 32 | 0.943 | 0.943 | 0.000 |
| 64 | 0.977 | 0.976 | -0.001 |
| 128 | 0.994 | 0.993 | -0.001 |
Storage drops from 84 MB to 42 MB at N = 57 638; extrapolated to 1.5 GB → 768 MB at N = 1M.
.snpibumped to v4. v3 files (fp32 cache) load transparently via a row-chunked fp32→fp16 cast (≈1 MB of fp32 intermediate per chunk) so peak transient RAM stays bounded regardless of N.
Fixed¶
ChecksumWriter.finalise()is now actually idempotent (a second call is a no-op, not a second trailer).write()afterfinalise()now raisesRuntimeErrorinstead of silently invalidating the trailer CRC.verify_checksumopens the file once (previously twice — once viahas_trailer, once to compute the CRC).
v0.8 roadmap¶
freeze()+ thread-safe search contract (deferred from v0.7 — better with more soak time).filter_idsIVF-aware (cluster skip + pool-aware).snapvec[fast]Numba / Rust accelerator (seedocs/blog/02-fast-extension.md).
[0.6.0] — 2026-04-15¶
Headline: breaks the PQ recall ceiling. The v0.5.0 release
documented a hard limit of recall@10 = 0.929 at M = 192, K = 256
— no amount of nprobe could get past it, because the residual PQ
codebook couldn't resolve the last few percent of error. v0.6 adds
an opt-in float32 rerank pass that recovers the recall the PQ
approximation lost, at <1 % latency overhead.
Added¶
-
IVFPQSnapIndex(... keep_full_precision=True)— stores an additional(n, dim_eff) float32cache of the original (post- preprocess) vectors alongside the PQ codes, sorted cluster- contiguously in sync with_codes. Opt-in; defaultFalsekeeps v0.5 storage footprint exactly. Storage cost isdim_eff × 4 bytes / vector(e.g. +1536 B/vec atdim = 384). -
IVFPQSnapIndex.search(..., rerank_candidates=N)— when set (and the index was built withkeep_full_precision=True), the IVF-PQ pass returns its top-Ncandidates, those get re-scored exactly against the cached float32 vectors, and the top-kof the reranked set is returned. Requiresrerank_candidates >= k.
Measured on FIQA / BGE-small (N = 57 638, nlist=512, M=192,
K=256, 300 queries, mean ms / query):
| nprobe | PQ recall | PQ ms | + rerank(100) recall | + rerank(100) ms |
|---|---|---|---|---|
| 16 | 0.842 | 0.87 | 0.880 | 0.89 |
| 32 | 0.893 | 1.49 | 0.943 | 1.38 |
| 64 | 0.917 | 2.52 | 0.977 | 2.55 |
| 128 | 0.928 | 4.55 | 0.994 | 4.59 |
Rerank cost is a single
(rerank_candidates, dim_eff) @ (dim_eff,) matmul (~38k ops at
N = 100). Latency overhead is within measurement noise on this
laptop. rerank_candidates=100 already saturates the recall
lift — going to 200 does not move the number.
-
.snpiv3 file format with_FLAG_KEEP_FULL_PRECISION. v1 and v2 files continue to load transparently (no full-precision data); v3 save/load round-trips the float32 cache byte-for-byte. -
stats()now reports bothbytes_per_vecandbytes_per_vec_codes_onlyso users can see the storage delta thatkeep_full_precisionintroduces.
Changed¶
-
IVFPQSnapIndexinternals extract a shared_gather_pq_scoreshelper so_score_one(default path) and_score_one_with_rerankshare the same candidate-scoring logic — including the* self._normsscaling whennormalized=False. A single source of truth prevents divergence of the candidate metric between the two paths (a class of bugs that bit the first rev of this PR). -
SnapIndex.deleteis now O(1) via swap-with-last instead of O(N) vianp.delete+ dict-rewrite. Micro-bench at N = 10 k over 500 deletes: 421× faster (112 ms → 0.27 ms). Unlocks use cases with id churn (tenant rotation, TTL eviction). -
SnapIndex._apply_qjl_arrays(theuse_prod=Truepath) no longer materialises a(N, d) float32temporary viaqjl.astype(np.float32) @ S_q.np.dot(qjl, S_q)delegates the mixed-type math to NumPy's C backend directly. Latency-equivalent at our sizes (both paths hit the BLAS floor) but avoids the ~150 MB allocation spike per search at N = 100 k.
Documentation¶
- README decision table adds a new row for the rerank path and quotes the 0.994 recall number.
docs/blog/01-numpy-perf-ceiling.mdanddocs/blog/02-fast-extension.mdkept up to date with the decision matrix (Numba vs Rust + PyO3) for the v0.7 accelerator follow-up.
v0.7 roadmap (tracked but not in 0.6)¶
snapvec[fast]accelerator for the IVF-PQ search hot path. Numba for prototyping, Rust + PyO3 + maturin for shipping.- Bit-packed PQ codes for
K < 256. filter_idsonPQSnapIndexandIVFPQSnapIndex.
[0.5.0] — 2026-04-15¶
Headline: an evidence-driven perf sprint for IVFPQSnapIndex.
Search-side wins are modest in pure NumPy (the hot path is dispatch-
bound at M = 192 — the rest of the slack only opens up with a
compiled inner loop), but the build path, the API surface and the
documentation each got a real lift. See PERF_DECISION.md for the
sprint synthesis and experiments/PERF_NOTES.md for the negative-
result archive.
Added¶
IVFPQSnapIndex.search_batch(queries, k, nprobe, num_threads)— ergonomic batched API. Builds the per-batch coarse score matrix and the(B, M, K)LUT tensor with single BLAS calls instead of loopingsearch()per query, with optionalThreadPoolExecutorparallelism for the per-query gather + scoring phase. Performance-neutral (~1×) vs per-query loop in pure NumPy because Apple Accelerate / Intel MKL already amortise; future-proof for the v0.6 numba kernel.IVFPQSnapIndex.close()— explicit thread pool teardown for long-lived processes that cycle indices.- n_train sizing warning + docs —
IVFPQSnapIndex.fit()raises aUserWarningwhenn_train < 30 · nlist(the FAISS rule of thumb), with the actual ratio and recommended size. README adds a "Sizing nlist and the training set" subsection with a concrete table for N ∈ {10k, 100k, 1M, 10M}. PERF_DECISION.md— one-page sprint synthesis kept at the repo root: what we measured, what we shipped, what we dropped, why the original "1M / 0.94 / <1 ms" target shifted, and the v0.6 roadmap.experiments/PERF_NOTES.md— negative-result archive with measured A/B for every technique that did not pan out (early-stop probing, single-querynum_threads, quantised LUT / SWAR), so we do not burn the budget twice on the same ideas.
Changed¶
add_batchis 12× faster at N = 1M (827s → 70s) via chunked residual encoding (peak transient memory bounded under ~150 MB). The previous N = 1M cost was dominated by swap, not algorithm.IVFPQSnapIndexcodes are now stored column-major(M, n)so the per-subspace gather insidesearch()is a contiguous slice instead of a stride-M lookup. ~12% latency win on the search hot path; structurally enables the cleaner gather patterns the rest of the sprint relies on..snpifile format bumped to v2 to match the column-major layout. v1 files load transparently via a one-time transpose at load time.- Coarse cluster ranking now uses the L2-monotone score
2⟨q, c⟩ − ‖c‖²instead of plain⟨q, c⟩, which matches the metric used during k-means assignment. Plain dot product was a small but real bias since coarse centroids are means of unit vectors with varying norms.
Fixed¶
add_batchnow rejects duplicate ids (within the batch and against the already-indexed set). Silently letting them through previously pointed_id_to_rowat the wrong row.load()now validates_offsetsinvariants (length, monotone, boundary values). Corrupted files raise instead of silently mis- indexing the codes buffer.- Stop reading
ThreadPoolExecutor's private_max_workersattribute; track the worker count explicitly.
Documentation¶
- README "Which Index should I use?" table now references the FIQA-
queries recall sweep numbers from
bench_ivf_pq_fiqa_recall.pyinstead of the augmented-corpus baseline that had its own ceiling artefact.
v0.6 roadmap (tracked but not in 0.5)¶
pq_rerank=True: float32 rerank of the IVF-PQ top-N to break theK = 256PQ ceiling at 0.929 recall@10.- Optional
numbaaccelerator (snapvec[fast]) that fuses the per- subspace gather + sum into a single SIMD inner loop, removing NumPy's per-iteration dispatch floor. - Bit-packed PQ codes for
K < 256. filter_idsonPQSnapIndexandIVFPQSnapIndex.
[0.4.0] — 2026-04-15¶
Headline: snapvec grows from one scalar index to a family of four
(SnapIndex, ResidualSnapIndex, PQSnapIndex, IVFPQSnapIndex),
covering the full accuracy / storage / latency frontier from
training-free scalar compression to sub-linear IVF search.
Added¶
IVFPQSnapIndex— inverted-file + residual Product Quantization index with cluster-contiguous storage.fit()trainsnlistcoarse k-means centroids, then a shared per-subspace residual PQ codebook (M × K) on the pooled residuals.addsorts codes by coarse cluster id and maintains anoffsets[nlist + 1]array so each probed cluster is a contiguous slice instead of a dynamic boolean mask. Cluster ranking uses theL2-monotone score2⟨q, c⟩ − ‖c‖²to match the metric used during assignment (plain⟨q, c⟩is biased because coarse centroids are means of unit vectors and their norms vary). Search visits onlynprobe / nlistof the corpus and merges probed-cluster scores with the per-cluster ADC LUT offset by⟨q, centroid_c⟩. On BGE-small (N = 20 000,M=192,K=256,nlist=256) this delivers 9.7× speedup at recall 0.910 (nprobe=8), 6.1× at recall 0.931 (nprobe=16), and 3.5× at recall 0.940 (nprobe=32, actually exceedingPQSnapIndexfull-scan recall because per-cluster residuals have smaller variance than globally-centred vectors). New persistent format.snpi(magicSNPI, v1).PQSnapIndex— product-quantization index with learned codebooks. Splits the embedding intoMsubspaces of equal size (exact division required — no ragged last subspace in v1), trains a per-subspaceK-centroid codebook via k-means++ + Lloyd iterations, and scores queries with asymmetric distance computation (ADC) over a per-query(M, K)LUT. On BGE-small / SciFact this reaches recall@10 = 0.94 at 192 B/vec (withnormalized=True) whereSnapIndex(bits=3)delivers 0.78 at the same storage, and matchesSnapIndex(bits=4)recall at half the bytes per vector — at the cost of a one-offfit(sample)step that the training-freeSnapIndexdoes not need. Off by default:use_rht=False, because the rotation that lets fixed Lloyd-Max codebooks work on arbitrary data actively destroys the subspace structure k-means is about to exploit. Storage per vector isMbytes in normalized mode,M + 4bytes otherwise (the float32 norm). Ships with its own persistent format (.snpq, magicSNPQ, v1).ResidualSnapIndex— two-stage Lloyd-Max quantization (coarseb1bits + residualb2bits, with the residual rescaled by the theoreticalσ_r = √ε(b1)). Reuses the existing{2, 3, 4}-bit codebooks; no new training. Opens operating points uniform snapvec cannot reach: on BGE-small/SciFact,b1=3,b2=3(6 bits/coord) lifts recall@10 from 0.870 (uniform 4-bit) to 0.921, andb1=4,b2=3(7 bits/coord) reaches 0.957. Includes arerank_Msearch mode that does a coarse pass over the whole corpus and reranks only the top-Mwith the full reconstruction — converges to full-reconstruction recall atM=100, making the fine stage O(M) instead of O(N). New on-disk format.snpr(magicSNPR, v1); codes are unpackeduint8per coordinate in this release (tight packing is follow-up).
[0.3.0] — 2026-04-14¶
Added¶
- Tight 3-bit packing in RAM and on disk: 8 three-bit indices are
packed across 3 bytes (24 bits) instead of the previous byte-aligned
layout. 3-bit compression improves from 5.9× → 7.8× on disk and
3.0× → 7.8× in RAM at
d=384(padded_dim=512), closing the gap that previously made 3-bit mode strictly dominated by 4-bit. - Vectorised FWHT:
_rotation._fwht_inplacenow uses a single reshape view per butterfly level instead of a Python-level slice loop. Single-query RHT is ~24× faster at pdim=512 (460 µs → 19 µs) and ~40× faster at pdim=2048. End-to-end warm query latency is unchanged atN=100k, d=384(the gemv still dominates) but the RHT is no longer the #2 cost in the Python glue — relevant for high-dim models and batched insertion. - File format v3 with a transparent backward-compatibility decoder. v1 and v2 files (which used byte-aligned 3-bit) are detected via the version field and decoded with the legacy path, then re-packed into the v3 layout in memory.
Changed¶
_indicesstorage is now bit-packed for all bit widths (2, 3, 4) in RAM._can_packevaluates toTruewhenever(pdim * mse_bits) % 8 == 0, which is satisfied for everypdim = 2^kwithk ≥ 3.save/loadindices are zero-copy for all bit widths: RAM layout matches disk layout byte-for-byte, sotobytes()/np.frombufferreplaces the previous unpack + repack round-trip.
Documentation¶
- README "Compression ratios" table reflects actual (not theoretical) numbers and collapses "disk" and "RAM" into a single column for all modes (they're equal now).
- Lloyd-Max table drops the
bytes/coord (actual)warning column — 3-bit now matches its theoretical 0.375 bytes/coord. - Roadmap
Tight 3-bit packingandVectorised FWHTitems moved to "previously shipped".
Compatibility¶
- Pre-v0.3
.snpvfiles load unchanged. The reader dispatches the 3-bit decoder on the version field, so no user action is required. - Public API unchanged:
SnapIndex(dim, bits=...)works as before, with smaller footprint forbits=3.
[0.2.0] — 2026-04-13¶
Added¶
- RAM-packed indices for 2-bit and 4-bit modes (byte-aligned
packing). At
d=384, N=100k, 4-bit: idle RAM drops from 51.6 MB to 26.0 MB (-50%); warm RAM drops from 154 MB to 128 MB (-17%); warm query latency unchanged. normalized=Trueconstructor flag for pre-normalized embeddings (skips thenp.linalg.normstep inadd_batch/search). Persisted via bit-1 of the v2 flags field.- Roadmap section in README ranking pure-Python and optional Rust improvements by measured ROI.
Fixed¶
- README compression numbers previously reported theoretical
bytes/coord without the RHT padding; real footprint at
d=384ispadded_dim=512, so 4-bit is 260 B/vec, 2-bit is 132 B/vec, etc. - Documented (not yet fixed in this release) that the 3-bit disk packer was byte-aligned, giving the same 0.5 bytes/coord as 4-bit. This is addressed in v0.3.0.
- README flagged the silent
float64 → float32cast insideadd_batch/searchthat triples peak RAM during insertion when inputs arefloat64. Measured at+306 MBforN=100k, d=384.
[0.1.3] — 2026-04-12¶
Added¶
filter_idsparameter onsearch()for O(|filter| · d) pre- filtering instead of full-scan + post-filter.
[0.1.1] — initial internal release¶
Added¶
- Initial implementation of TurboQuant / HadaMax (RHT + Lloyd-Max) with 2/3/4-bit support.
use_prod=Truemode with QJL unbiased correction.chunk_sizestreaming search.mypy --stricttype annotations.- Atomic
.snpvsave / load format.