Univ. of Modena and
ALADIN introduces a two-stage architecture that distills fine-grained alignment scores into an efficient common embedding space, enabling high-performance image-text matching and retrieval. The model achieves competitive recall while demonstrating up to a 90-fold increase in inference speed compared to entangled Vision-Language Transformers.
23
There are no more papers matching your filters at the moment.