Reverse image search at 1.28 M images: how much float can you throw away?

16× smaller storage on the searchable index. 97.6 % recall vs full float. A practical look at binary quantisation as a first-phase retrieval strategy on a million-doc image corpus.

The setup

Drop any image. The page shows three columns side by side, each running the same query against a different rank profile:

Behind it: 1.28 million images from ImageNet-1k's training split, embedded once with facebook/dinov2-base (768-d, L2-normalised), indexed in Vespa Cloud with two parallel ANN structures over the same vectors.

Why bother quantising

DINOv2-base outputs a 768-dimensional float embedding per image. Stored as bfloat16 that's 1.5 KB per vector; for 1.28 M images that's 1.84 GB just for the embeddings, before the HNSW graph or the JPEG bytes themselves (~70 GB).

Sign-bit-quantise the 768 floats to a 768-bit binary vector - keep the sign, drop the magnitude - and you get:

Storage breakdown (1.28 M docs)

The float embedding dominates at 1.84 GB. Binary at 117 MB fits in attribute memory on a single content node with room to spare. JPEG bytes (~70 GB) live on disk and aren't graphed.

Vespa makes this two-step structure native - the binary tensor is derived from the fed embedding by the indexing pipeline. You only feed the float; Vespa builds the binary attribute and HNSW index at indexing time:

field embedding_binary type tensor<int8>(x[96]) {
    indexing: input embedding | binarize | pack_bits | attribute | index
    attribute { distance-metric: hamming }
    index { hnsw { max-links-per-node: 24, neighbors-to-explore-at-insert: 200 } }
}

The float vector also gets its own HNSW index, used as the gold-standard reference and the rerank source:

field embedding type tensor<bfloat16>(x[768]) {
    indexing: attribute | index
    attribute { distance-metric: prenormalized-angular }
    index { hnsw { max-links-per-node: 16, neighbors-to-explore-at-insert: 200 } }
}

The five rank profiles

Five strategies, all in schemas/image.sd:

profilefirst-phasesecond-phase reranktests
closeness_binaryHamming HNSWnonelossy lower bound
closenessHamming HNSWmanual cosine, top 100recommended hybrid
closeness_hybrid_strictHamming HNSWtop 1000does deeper rerank help?
closeness_weightedHamming HNSWα·float + (1−α)·binary, top 100does mixing help?
closeness_floatFloat HNSWnonereference

Wiring the float rerank

The hybrid second-phase reranks the binary first-phase candidates by their float cosine against the query. Because both query and attribute are L2-normalised, cosine equals the plain dot product, so we expose it as a function the rank expression and match-features can both reference:

function cos_emb()       { expression: sum(query(q) * attribute(embedding)) }
function closeness_emb() { expression: 0.5 * (1 + cos_emb) }

first-phase  { expression: closeness(field, embedding_binary) }
second-phase { rerank-count: 100; expression: closeness_emb }

match-features {
    closeness(field, embedding_binary)
    distance(field, embedding_binary)
    closeness_emb
}

Defining closeness_emb as a function (rather than reusing closeness(field, embedding) in second-phase) is what lets us also surface a comparable float score in match-features - useful for the three-column compare on the demo page and for the eval below, which compares per-method scores on the same hits.

Methodology

Headline result

50 validation queries, k = 10, target_hits = 200. Recall is measured against the float profile (which is 1.000 by construction).

0.000.250.500.751.00binary0.604hybrid0.976hybrid_strict0.976weighted0.971float1.000recall@10 vs float gold
profilerecall@10mean rank shift
binary0.6042.23
hybrid0.9760.02
hybrid_strict0.9760.02
weighted0.9710.13
float1.0000.00

What this says:

The mean hides the shape

A 0.976 mean reads like “every query loses ~2.4 %”. It isn't. Per-query recall@10 across the 50 validation queries:

010203040500.20.30.40.50.60.70.80.91.0binarybinary+rerankrecall@10 vs float gold# queries

49 of 50 hybrid queries return the float top-10 exactly. One query drops to 0.8. Median recall is 1.0; the mean is dragged down by a single outlier. The hybrid profile isn't “mostly right” - it's either right or it isn't, and it's right almost every time.

Binary alone tells the opposite story. No query is perfect; the mode is 0.6 (14 queries) and the spread covers 0.2 – 0.9. The lossy first-phase isn't making a small mistake on every query - it's making a different mistake on every query. That's why the rerank works: the missing docs aren't systematically gone, they're scattered just outside the top 10 of the binary candidate pool.

Rerank is essentially free

Each profile reads the same binary HNSW candidates, so the float rerank is a 768-d dot product on at most 1000 vectors. End-to-end p50 latency across all five profiles fits in a ~10 ms band:

400420440460480500520binary447483hybrid450496hybrid_strict451477weighted453494float455479end-to-end latency (ms) · bar = p50 · vertical mark = p95

The flat profile means the choice between binary-only and binary+rerank is essentially “do you want the recall or not” - there's no latency knob to tune. (These are end-to-end including the EC2 → Vespa Cloud round-trip, so absolute numbers are network-bound. The differential between profiles is what matters.)

Recall vs target_hits

target_hits bounds HNSW exploration depth. Higher = better recall, slower. Same 50-query validation set at each depth:

0.50.60.70.80.91.0501002005001000binarybinary+reranktarget_hits (log scale)recall@10 vs float gold

Hybrid recall climbs from 0.925 at depth 50 to 0.976 at depth 200, then plateaus. Binary alone has barely any room to grow even with 10× more candidates: it tops out at ~0.60 because some float-correct neighbours simply never make it into the binary top-N at any depth.

The actionable shape: target_hits = 200 is the sweet spot for this corpus. Doubling depth past that doesn't change recall - only latency.

Try the weighted profile yourself

The closeness_weighted rank profile blends manual cosine with the (real) Hamming closeness in the second phase. Slide α to query the live demo - at α=0 you're ranking by binary closeness alone, at α=1 by float closeness alone:

α=0 (binary only)α=0.70α=1 (float only)

Open the demo with α = 0.70

Takeaways

  1. DINOv2 + binary quantisation + float rerank works. 16× storage savings for the searchable index, 97.6 % recall vs full float, basically zero ordering error among the docs that survive the first phase.
  2. Binary alone is not viable for image search at million-doc scale. 60 % recall is recoverable with rerank but not without; if you want a binary-only pipeline you need a richer first-phase signal (multi-vector, 2-bit, or product quantisation).
  3. Rerank pool depth saturates at 100 for this corpus. Going deeper without going wider is wasted CPU.
  4. Match-features are how you compare ranking strategies fairly. Surfacing both binary and float scores per hit, regardless of which drove ranking, is what makes the three-column compare visible on the demo and the eval below honest.

Vespa Cloud · DINOv2 · ImageNet-1k