Skip to content

PQSnapIndex

PQSnapIndex

PQSnapIndex(dim: int, M: int, K: int = 256, seed: int = 0, normalized: bool = False, use_rht: bool = False, use_opq: bool = False)

Bases: FreezableIndex

Product-quantization index trained once on a corpus sample.

Parameters:

Name Type Description Default
dim int

Embedding dimension.

required
M int

Number of subspaces. (pdim or dim) must be divisible by M.

required
K int

Centroids per subspace. Must satisfy 2 ≤ K ≤ 256.

256
seed int

Seed for the RHT (if used) and for k-means++ init.

0
normalized bool

When True, inputs are assumed unit-length and no per-vector norm is stored.

False
use_rht bool

When True, prepend the randomized Hadamard transform before splitting into subspaces. Off by default — on modern embeddings it hurts PQ by destroying subspace structure.

False
use_opq bool

When True, learn an orthogonal OPQ-P rotation (Ge et al., 2013) during fit() and apply it to both corpus and queries. Balances per-subspace variance, typically lifting recall@10 by 0.5-2 pp at the same bytes/vec. Mutually exclusive with use_rht.

False

fit

fit(training_vectors: NDArray[float32], kmeans_iters: int = 15) -> None

Train per-subspace codebooks on training_vectors.

Must be called exactly once, before the first add / add_batch. Calling fit a second time (whether or not any vectors have been indexed) raises — double-fit would silently overwrite the codebooks and, if any vectors had been indexed, invalidate their codes.