Reverse image search at 1.28 M images: how much float can you throw away?

16× smaller storage on the searchable index. 97.6 % recall vs full float. A practical look at binary quantisation as a first-phase retrieval strategy on a million-doc image corpus.

The setup

Drop any image. The page shows three columns side by side, each running the same query against a different rank profile:

Binary HNSW only - first-phase Hamming, no rerank. Cheap, lossy.
Binary + float rerank - recommended hybrid. Cheap retrieval, float refinement on top 100.
Float only - full-precision HNSW. The reference; what you'd get if storage weren't a concern.

Behind it: 1.28 million images from ImageNet-1k's training split, embedded once with facebook/dinov2-base (768-d, L2-normalised), indexed in Vespa Cloud with two parallel ANN structures over the same vectors.

Why bother quantising

DINOv2-base outputs a 768-dimensional float embedding per image. Stored as bfloat16 that's 1.5 KB per vector; for 1.28 M images that's 1.84 GB just for the embeddings, before the HNSW graph or the JPEG bytes themselves (~70 GB).

Sign-bit-quantise the 768 floats to a 768-bit binary vector - keep the sign, drop the magnitude - and you get:

Storage: 768 ÷ 8 = 96 bytes per vector. 16× smaller than bfloat16.
Distance: Hamming = popcount(a XOR b) - three or four cycles per pair on modern CPUs, no FP unit.
Cost: a binarised vector loses information. Two embeddings with the same sign pattern but different magnitudes look identical to Hamming.

Storage breakdown (1.28 M docs)

embedding (bfloat16)1.97 GB
binary (packed)123 MB
float HNSW links (m=16)82 MB
binary HNSW links (m=24)123 MB

The float embedding dominates at 1.84 GB. Binary at 117 MB fits in attribute memory on a single content node with room to spare. JPEG bytes (~70 GB) live on disk and aren't graphed.

Vespa makes this two-step structure native - the binary tensor is derived from the fed embedding by the indexing pipeline. You only feed the float; Vespa builds the binary attribute and HNSW index at indexing time:

field embedding_binary type tensor<int8>(x[96]) {
    indexing: input embedding | binarize | pack_bits | attribute | index
    attribute { distance-metric: hamming }
    index { hnsw { max-links-per-node: 24, neighbors-to-explore-at-insert: 200 } }
}

The float vector also gets its own HNSW index, used as the gold-standard reference and the rerank source:

field embedding type tensor<bfloat16>(x[768]) {
    indexing: attribute | index
    attribute { distance-metric: prenormalized-angular }
    index { hnsw { max-links-per-node: 16, neighbors-to-explore-at-insert: 200 } }
}

The five rank profiles

Five strategies, all in schemas/image.sd:

profile	first-phase	second-phase rerank	tests
`closeness_binary`	Hamming HNSW	none	lossy lower bound
`closeness`	Hamming HNSW	manual cosine, top 100	recommended hybrid
`closeness_hybrid_strict`	Hamming HNSW	top 1000	does deeper rerank help?
`closeness_weighted`	Hamming HNSW	α·float + (1−α)·binary, top 100	does mixing help?
`closeness_float`	Float HNSW	none	reference

Wiring the float rerank

The hybrid second-phase reranks the binary first-phase candidates by their float cosine against the query. Because both query and attribute are L2-normalised, cosine equals the plain dot product, so we expose it as a function the rank expression and match-features can both reference:

function cos_emb()       { expression: sum(query(q) * attribute(embedding)) }
function closeness_emb() { expression: 0.5 * (1 + cos_emb) }

first-phase  { expression: closeness(field, embedding_binary) }
second-phase { rerank-count: 100; expression: closeness_emb }

match-features {
    closeness(field, embedding_binary)
    distance(field, embedding_binary)
    closeness_emb
}

Defining closeness_emb as a function (rather than reusing closeness(field, embedding) in second-phase) is what lets us also surface a comparable float score in match-features - useful for the three-column compare on the demo page and for the eval below, which compares per-method scores on the same hits.

Methodology

Sample: 50 validation-split images (the entire validation cut of the corpus).
Query embedding: fetched via Vespa's document API, not re-encoded - query vector is bit-for-bit what's in the index, so the comparison is purely about ranking.
Gold: closeness_float top-K. Float-HNSW recall vs exact brute-force is >0.99 for our parameters.
Metrics: recall@10, mean rank shift, p50/p95 latency.
Self-exclusion: each query excludes its own ID, so top-10 measures actual retrieval, not “1 self + 9 neighbours”.

Headline result

50 validation queries, k = 10, target_hits = 200. Recall is measured against the float profile (which is 1.000 by construction).

profile	recall@10	mean rank shift
`binary`	0.604	2.23
`hybrid`	0.976	0.02
`hybrid_strict`	0.976	0.02
`weighted`	0.971	0.13
`float`	1.000	0.00

What this says:

Binary alone keeps 60.4 % of float's top-10. The other 40 % are near-duplicates by cosine but their sign patterns drift far enough that Hamming pushes them out of the top-10. At 1.28 M scale the noise floor of 768-bit Hamming is real.
Adding a float rerank on top 100 recovers to 97.6 %. Most of the time the float-correct neighbour is somewhere in the top-100 by Hamming, just in the wrong position. Rerank fixes the order. Mean rank shift among shared docs drops from 2.23 to 0.02.
rerank-100 → rerank-1000 doesn't help (0.976 → 0.976). The bottleneck is binary retrieval, not rerank pool depth.
Weighted blending at α=0.7 is slightly worse (0.971 vs 0.976). Mixing in binary at this scale dilutes the signal.

The mean hides the shape

A 0.976 mean reads like “every query loses ~2.4 %”. It isn't. Per-query recall@10 across the 50 validation queries:

49 of 50 hybrid queries return the float top-10 exactly. One query drops to 0.8. Median recall is 1.0; the mean is dragged down by a single outlier. The hybrid profile isn't “mostly right” - it's either right or it isn't, and it's right almost every time.

Binary alone tells the opposite story. No query is perfect; the mode is 0.6 (14 queries) and the spread covers 0.2 – 0.9. The lossy first-phase isn't making a small mistake on every query - it's making a different mistake on every query. That's why the rerank works: the missing docs aren't systematically gone, they're scattered just outside the top 10 of the binary candidate pool.

Rerank is essentially free

Each profile reads the same binary HNSW candidates, so the float rerank is a 768-d dot product on at most 1000 vectors. End-to-end p50 latency across all five profiles fits in a ~10 ms band:

The flat profile means the choice between binary-only and binary+rerank is essentially “do you want the recall or not” - there's no latency knob to tune. (These are end-to-end including the EC2 → Vespa Cloud round-trip, so absolute numbers are network-bound. The differential between profiles is what matters.)

Recall vs target_hits

target_hits bounds HNSW exploration depth. Higher = better recall, slower. Same 50-query validation set at each depth:

Hybrid recall climbs from 0.925 at depth 50 to 0.976 at depth 200, then plateaus. Binary alone has barely any room to grow even with 10× more candidates: it tops out at ~0.60 because some float-correct neighbours simply never make it into the binary top-N at any depth.

The actionable shape: target_hits = 200 is the sweet spot for this corpus. Doubling depth past that doesn't change recall - only latency.

Try the weighted profile yourself

The closeness_weighted rank profile blends manual cosine with the (real) Hamming closeness in the second phase. Slide α to query the live demo - at α=0 you're ranking by binary closeness alone, at α=1 by float closeness alone:

α=0 (binary only)α=0.70α=1 (float only)

Open the demo with α = 0.70 →

Takeaways

DINOv2 + binary quantisation + float rerank works. 16× storage savings for the searchable index, 97.6 % recall vs full float, basically zero ordering error among the docs that survive the first phase.
Binary alone is not viable for image search at million-doc scale. 60 % recall is recoverable with rerank but not without; if you want a binary-only pipeline you need a richer first-phase signal (multi-vector, 2-bit, or product quantisation).
Rerank pool depth saturates at 100 for this corpus. Going deeper without going wider is wasted CPU.
Match-features are how you compare ranking strategies fairly. Surfacing both binary and float scores per hit, regardless of which drove ranking, is what makes the three-column compare visible on the demo and the eval below honest.

Vespa Cloud · DINOv2 · ImageNet-1k