Subtitle: How many clusters do we need? How many nprobe-s? Testing for 10M, 50M, and 100M vectors


Inspired by SPFresh/SPANN articles - creating a vector database over S3. Setup is - many small clusters over S3.

Should start with - design and performance considerations. How many such small clusters do we need to check? What is a reasonable recall@10?

  • We’ve been using LAION-400M — 400 million CLIP image embeddings.
  • Retrieved ground truth computed by brute-force GPU search on a RunPod A100.
  • Using IVFFlat.

Questions

  • How many small clusters [nlist]?
  • Given a query, in how many such clusters should we search [nprobe]?
  • Should we duplicate vectors and write to multiple such clusters?
  • How much would it cost, in terms of S3 storage and queries?
  • How much time would it take for a query?
  • How to add new vectors — how to manage delta files?
  • When to do compaction — which strategy?

How to answer these questions

Create a simulator.

  • Input: given DB size and a strategy.
  • Output: calculation (simulation) of cost and latency.

Compaction Strategies

Let’s get inspired by other OSS - RocksDB.

RocksDB’s core insight is that random writes to disk are slow, but sequential writes are fast. Its solution: buffer writes in memory (L0 memtable), flush them as immutable sorted files to disk (L1, L2, …), and compact those files in the background.

We can do the same, using either Daily / Weekly / Monthly deltas append.

Or, L0, L1, …, L6 for example. Where L0 is in memory, L1 is in local disk, all up to L6 which is in S3.

S3 - AWS, Hetzner, Cloudflare [R2]

Solution is “portable” between providers — and also between regions.

Using the benchmark script we can get latency numbers, and adjust accordingly based on the required environment.

We could also measure differences between regions within a given provider.