VL-JEPA: Joint Embedding Predictive Architecture for Vision-language

BibTex

Copy

@misc{chenThu Dec 11 2025 18:59:22 GMT+0000 (Coordinated Universal Time)vljepajointembedding,
      title={VL-JEPA: Joint Embedding Predictive Architecture for Vision-language},
      author={Delong Chen and Mustafa Shukor and Theo Moutakanni and Willy Chung and Jade Yu and Tejaswi Kasarla and Allen Bolourchi and Yann LeCun and Pascale Fung},
      year={Thu Dec 11 2025 18:59:22 GMT+0000 (Coordinated Universal Time)},
      eprint={2512.10942},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.10942},
}

AI Audio Lecture + Q&A

0:00 / 0:00

VL-JEPA: Joint Embedding Predictive Architecture for Vision-language

Transcript

John: Welcome to our seminar on Transformer Architectures. Today's lecture is on 'VL-JEPA: Joint Embedding Predictive Architecture for Vision-language,' a recent paper from Meta FAIR and HKUST. We've seen a lot of work recently, like 'SparseVLM' and 'SmolVLM', focusing on making Vision-Language Models more efficient. This work takes a different path, proposing a fundamental architectural shift championed by Yann LeCun. It challenges the dominant generative paradigm. Yes, Noah? Noah: Excuse me, Professor. Could you first clarify what 'Joint Embedding Predictive Architecture' or JEPA actually means? Is it just another name for an encoder-decoder model? John: That's an excellent starting point. It's distinct from a standard encoder-decoder. Think of the two dominant approaches we have now. First, you have CLIP-style models, which learn to map images and text to a shared space. They're efficient for retrieval but can't generate text. On the other end, you have generative VLMs, like LLaVA, which bolt a vision encoder onto an LLM to generate text token by token. They are powerful but computationally very expensive and slow. Noah: Right, because they have to predict the exact sequence of words. John: Precisely. They're modeling not just the core meaning, but also all the superficial linguistic variations—word choice, sentence structure, and so on. JEPA proposes a third way. Instead of predicting high-dimensional data like pixels or text tokens, the model learns to predict abstract representations of the target in a latent space. The goal is to predict the embedding of the answer, not the answer's raw text itself. This simplifies the learning problem immensely. Noah: So the model doesn't have to worry about whether the answer is 'a dog on a couch' versus 'a canine resting on a sofa'. It just has to predict an embedding that represents that core concept. John: Exactly. That's the main contribution. It aims to combine the efficiency of representation learning with the multitask capabilities of generative models, but without the high computational cost of autoregressive token generation. It’s built for efficiency and real-time use cases. Noah: So how does it work under the hood? What are the key components? John: The architecture has four main parts. First, an X-Encoder, which is a frozen vision model, turns the input image or video into a visual embedding. Second, a Predictor, which is a transformer, takes that visual embedding and a text query, like a question, and predicts a target embedding. Third, a Y-Encoder, which is a separate text model, creates the 'ground truth' embedding from the actual text answer. The model is trained to minimize the distance between the predicted embedding and the ground truth embedding using an InfoNCE loss. Noah: Wait, InfoNCE loss? Why not just a simple MSE or cosine similarity loss between the two embeddings? John: Good question. While you could use those, InfoNCE is a contrastive loss. It not only pulls the prediction and the true target embedding together but also pushes the prediction away from other negative examples in the batch. This helps prevent representation collapse, where the model learns to output the same average embedding for everything. It ensures the embedding space remains uniform and semantically meaningful. Noah: And what about the fourth component? How do you get actual text out if the model only produces an embedding? John: That's the clever part. The fourth component is a Y-Decoder. It translates the predicted embedding back into human-readable text. But here's the key: this decoder is not used during the main training phase. It’s a lightweight component brought in only at inference time when you need a text output. This massively reduces training complexity. For many tasks, like classification, you don't even need the decoder; you just find which candidate label's embedding is closest to the predicted one. Noah: That seems particularly useful for streaming video. You wouldn't have to generate a caption for every single frame. John: That's one of their main applications, which they call 'selective decoding'. The model produces a continuous stream of semantic embeddings non-autoregressively. You can monitor this stream and only invoke the lightweight decoder when a significant semantic shift is detected. The paper reports this reduces decoding operations by nearly three times with minimal performance loss. John: The primary implication here is a potential shift away from brute-force generative modeling for many VLM tasks, especially those requiring efficiency and real-time response. By operating in a latent semantic space, VL-JEPA demonstrates a more sample-efficient and computationally cheaper path to generalist multimodal understanding. This work is the first to really apply the JEPA concept to general-domain vision-language problems and show it's competitive. Noah: It makes sense. The trend in papers like 'NVILA' and 'EVEv2' has been about finding new ways to break the efficiency barrier. This feels like a more foundational architectural answer to that problem. But what are the limitations? Does this latent-space prediction work for more complex, multi-step reasoning tasks where generating a chain of thought is important? John: That is the critical question for future work. The authors acknowledge that the model hasn't been evaluated on complex reasoning, tool use, or agentic behaviors where explicit textual generation is often key. It excels at tasks like VQA, classification, and retrieval. Whether 'visual chain-of-thought' can be done effectively within a latent space is an open and very interesting research direction that this paper opens up. John: So, to wrap up, VL-JEPA provides a compelling proof-of-concept for a new class of vision-language models. It's not about replacing generative VLMs entirely, but about offering a highly efficient and effective alternative for a broad range of tasks, particularly in real-time and streaming contexts. John: The key takeaway is this: the future of AI may involve less direct prediction of data and more reasoning within abstract, semantic spaces. VL-JEPA is a significant step in that direction. Thanks for listening. If you have any further questions, ask our AI assistant or drop a comment.

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

Dark mode

VL-JEPA: Joint Embedding Predictive Architecture for Vision-language