The query image
Whatever you uploaded — a selfie, a sunset, a sneaker. We resize its longer side to 224px and feed the raw pixels into the encoder.
PIL.Image → processor(images, return_tensors='pt')
Drop any image. We turn it into a 768-dimension embedding with DINOv2, then ask Vespa to return the most visually similar photographs from ImageNet-1k.
Whatever you uploaded — a selfie, a sunset, a sneaker. We resize its longer side to 224px and feed the raw pixels into the encoder.
A self-supervised Vision Transformer turns the image into a single 768-dim vector. We L2-normalize it so cosine and dot-product agree.
Vespa stores every ImageNet vector in a bfloat16 HNSW graph. A single YQL query walks the graph and returns the K closest points by prenormalized-angular distance.
The closest vectors become the nearest images — ranked by similarity. Your query image is the first result if it lives in the index.