The query image
Whatever you uploaded — a selfie, a sunset, a sneaker. We resize its longer side to 224px and feed the raw pixels into the encoder.
Drop any image. We turn it into a 768-dimension embedding with DINOv2, then ask Vespa to return the most visually similar photographs from ImageNet-1k.
Whatever you uploaded — a selfie, a sunset, a sneaker. We resize its longer side to 224px and feed the raw pixels into the encoder.
A self-supervised Vision Transformer turns the image into a single 768-dim vector. We L2-normalize it so cosine and dot-product agree.
Vespa packs each vector to 768 bits and walks a binary Hamming HNSW graph for first-phase retrieval — 16× cheaper memory than bfloat16. The full-precision vectors live on disk and only get pulled in for second-phase rerank on the top 100.
The closest vectors become the nearest images — ranked by similarity. Your query image is the first result if it lives in the index.