alphaXiv

History

Papers Benchmarks

NAVER LABS Europe

2,860

14 Jun 2024

computer-science computer-vision-security computer-vision-and-pattern-recognition

Grounding Image Matching in 3D with MASt3R

NAVER LABS Europe

MASt3R, developed by NAVER LABS Europe, presents an image matching framework that explicitly incorporates 3D scene information to establish pixel correspondences. The method achieves a 30% absolute improvement in VCRE AUC on the map-free localization benchmark and sets new state-of-the-art results for relative pose estimation on CO3Dv2 and RealEstate10k datasets.

1,713

495

28 Jan 2025

computer-science computation-and-language efficient-transformers

Provence: efficient and robust context pruning for retrieval-augmented generation

NAVER LABS Europe

Vassilina Nikoulina

NAVER LABS Europe developed Provence, an efficient and robust context pruner for Retrieval-Augmented Generation (RAG) systems. It dynamically identifies and removes irrelevant sentences from retrieved contexts, achieving up to 80% compression while maintaining or improving LLM answer quality and accelerating inference by up to 2x, notably by unifying pruning with reranking into a single model.

803

12 Jul 2021

computer-science information-retrieval

SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking

CNRS

Sorbonne Université NAVER LABS Europe

The SPLADE model fine-tunes BERT to generate high-dimensional, sparse lexical representations for queries and documents, enabling efficient first-stage ranking using inverted indexes. It achieves effectiveness comparable to leading dense retrieval models while offering greater efficiency for large-scale systems.

1,562

21 Mar 2025

computer-science computer-vision-and-pattern-recognition geometric-deep-learning

Pow3R: Empowering Unconstrained 3D Reconstruction with Camera and Scene Priors

UCL NAVER LABS Europe

A unified 3D reconstruction framework from UCL and Naver Labs Europe enables flexible incorporation of camera and scene priors like intrinsics, poses, and depth maps into transformer-based architectures, achieving state-of-the-art performance across multiple 3D vision tasks while enabling high-resolution processing and efficient pose estimation through dual coordinate prediction.

511

24 Jul 2024

attention-mechanisms computer-science computer-vision-security

Multi-HMR: Multi-Person Whole-Body Human Mesh Recovery in a Single Shot

NAVER LABS Europe

We present Multi-HMR, a strong sigle-shot model for multi-person 3D human mesh recovery from a single RGB image. Predictions encompass the whole body, i.e., including hands and facial expressions, using the SMPL-X parametric model and 3D location in the camera coordinate system. Our model detects people by predicting coarse 2D heatmaps of person locations, using features produced by a standard Vision Transformer (ViT) backbone. It then predicts their whole-body pose, shape and 3D location using a new cross-attention module called the Human Prediction Head (HPH), with one query attending to the entire set of features for each detected person. As direct prediction of fine-grained hands and facial poses in a single shot, i.e., without relying on explicit crops around body parts, is hard to learn from existing data, we introduce CUFFS, the Close-Up Frames of Full-Body Subjects dataset, containing humans close to the camera with diverse hand poses. We show that incorporating it into the training data further enhances predictions, particularly for hands. Multi-HMR also optionally accounts for camera intrinsics, if available, by encoding camera ray directions for each image token. This simple design achieves strong performance on whole-body and body-only benchmarks simultaneously: a ViT-S backbone on

448{\times}448

images already yields a fast and competitive model, while larger models and higher resolutions obtain state-of-the-art results.

361

8,993

19 Aug 2025

computer-science computer-vision-and-pattern-recognition generative-models

Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction

University of Oxford NAVER LABS Europe

Geo4D adapts a pre-trained video diffusion model, DynamiCrafter, for feed-forward monocular 4D scene reconstruction, effectively extracting explicit 4D geometry from video generators. The method achieves state-of-the-art performance in video depth and camera pose estimation on real-world benchmarks, despite being trained solely on synthetic data.

2,425

21 Sep 2021

computer-science artificial-intelligence computation-and-language

SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval

CNRS

Sorbonne Université NAVER LABS Europe

SPLADE v2, developed by Naver Labs Europe and Sorbonne Université, enhances sparse neural retrieval by incorporating max pooling, knowledge distillation, and document-only expansion. It achieves state-of-the-art performance on MS MARCO (36.8 MRR@10) and strong zero-shot generalization on BEIR, while maintaining efficiency suitable for inverted indexes.

219

22 Aug 2025

computer-science computer-vision-and-pattern-recognition geometric-deep-learning

HAMSt3R: Human-Aware Multi-view Stereo 3D Reconstruction

KAUST NAVER LABS Europe

Recovering the 3D geometry of a scene from a sparse set of uncalibrated images is a long-standing problem in computer vision. While recent learning-based approaches such as DUSt3R and MASt3R have demonstrated impressive results by directly predicting dense scene geometry, they are primarily trained on outdoor scenes with static environments and struggle to handle human-centric scenarios. In this work, we introduce HAMSt3R, an extension of MASt3R for joint human and scene 3D reconstruction from sparse, uncalibrated multi-view images. First, we exploit DUNE, a strong image encoder obtained by distilling, among others, the encoders from MASt3R and from a state-of-the-art Human Mesh Recovery (HMR) model, multi-HMR, for a better understanding of scene geometry and human bodies. Our method then incorporates additional network heads to segment people, estimate dense correspondences via DensePose, and predict depth in human-centric environments, enabling a more comprehensive 3D reconstruction. By leveraging the outputs of our different heads, HAMSt3R produces a dense point map enriched with human semantic information in 3D. Unlike existing methods that rely on complex optimization pipelines, our approach is fully feed-forward and efficient, making it suitable for real-world applications. We evaluate our model on EgoHumans and EgoExo4D, two challenging benchmarks con taining diverse human-centric scenarios. Additionally, we validate its generalization to traditional multi-view stereo and multi-view pose regression tasks. Our results demonstrate that our method can reconstruct humans effectively while preserving strong performance in general 3D reconstruction tasks, bridging the gap between human and scene understanding in 3D vision.

234

26 Jun 2025

computer-science computer-vision-and-pattern-recognition geometric-deep-learning

PanSt3R: Multi-view Consistent Panoptic Segmentation

NAVER LABS Europe

Panoptic segmentation of 3D scenes, involving the segmentation and classification of object instances in a dense 3D reconstruction of a scene, is a challenging problem, especially when relying solely on unposed 2D images. Existing approaches typically leverage off-the-shelf models to extract per-frame 2D panoptic segmentations, before optimizing an implicit geometric representation (often based on NeRF) to integrate and fuse the 2D predictions. We argue that relying on 2D panoptic segmentation for a problem inherently 3D and multi-view is likely suboptimal as it fails to leverage the full potential of spatial relationships across views. In addition to requiring camera parameters, these approaches also necessitate computationally expensive test-time optimization for each scene. Instead, in this work, we propose a unified and integrated approach PanSt3R, which eliminates the need for test-time optimization by jointly predicting 3D geometry and multi-view panoptic segmentation in a single forward pass. Our approach builds upon recent advances in 3D reconstruction, specifically upon MUSt3R, a scalable multi-view version of DUSt3R, and enhances it with semantic awareness and multi-view panoptic segmentation capabilities. We additionally revisit the standard post-processing mask merging procedure and introduce a more principled approach for multi-view segmentation. We also introduce a simple method for generating novel-view predictions based on the predictions of PanSt3R and vanilla 3DGS. Overall, the proposed PanSt3R is conceptually simple, yet fast and scalable, and achieves state-of-the-art performance on several benchmarks, while being orders of magnitude faster than existing methods.

330

12 Jan 2023

computer-science computer-vision-security computer-vision-and-pattern-recognition

CroCo: Self-Supervised Pre-training for 3D Vision Tasks by Cross-View Completion

NAVER LABS Europe

NAVER LABS Europe developed CroCo, a self-supervised pre-training method that leverages cross-view image completion to learn robust 3D geometric representations. This approach achieves superior performance on various 3D vision tasks such as depth estimation, optical flow, and relative camera pose, while demonstrating a trade-off in performance for high-level semantic tasks.

1,541

03 Mar 2025

computer-science computer-vision-security computer-vision-and-pattern-recognition

MUSt3R: Multi-view Network for Stereo 3D Reconstruction

NAVER LABS Europe

NAVER LABS Europe introduces MUSt3R, a groundbreaking multi-view 3D reconstruction framework that achieves state-of-the-art performance while maintaining 8-11 FPS processing speeds, representing a significant advance in both offline and online reconstruction capabilities through its novel symmetric architecture and multi-layer memory mechanism.

246

266

11 Mar 2024

computer-science computation-and-language information-retrieval

SPLADE-v3: New baselines for SPLADE

Cohere NAVER LABS Europe

SPLADE-v3, developed by Naver Labs Europe, introduces enhanced training methodologies that notably improve the effectiveness of sparse neural retrieval models. The models achieve statistically significant gains over previous SPLADE versions and demonstrate strong competitiveness with cross-encoder re-rankers on diverse benchmarks.

922

349

27 Jan 2025

computer-science artificial-intelligence computation-and-language

PISCO: Pretty Simple Compression for Retrieval-Augmented Generation

NAVER LABS Europe

NAVER LABS researchers introduce PISCO, a remarkably efficient document compression framework for retrieval-augmented generation that achieves 16x compression with minimal accuracy loss (0-3%), while requiring only 48 hours of training on a single GPU - challenging conventional wisdom about the necessity of extensive pretraining for effective compression systems.

721

18 Aug 2023

computer-science computer-vision-security computer-vision-and-pattern-recognition

CroCo v2: Improved Cross-view Completion Pre-training for Stereo Matching and Optical Flow

NAVER LABS Europe

This work introduces CroCo v2, an enhanced self-supervised pre-training method that enables a generic Vision Transformer architecture to achieve state-of-the-art performance on dense geometric vision tasks like stereo matching and optical flow. It leverages millions of real-world image pairs, employs relative positional embeddings, and scales up the model, demonstrating that specialized architectural components are not strictly necessary for these tasks.

467

23 Oct 2025

computer-science computer-vision-and-pattern-recognition robotics

Kinaema: a recurrent sequence model for memory and pose in motion

NAVER LABS Europe

Kinaema, developed by NAVER LABS Europe, introduces a recurrent sequence model that enables embodied AI agents to form scalable, long-term spatial memory from continuous visual and odometry streams. The model outperforms prior memory baselines on memory-based relative pose estimation and improves navigation performance in challenging, continuous scenarios.

165

27 Jul 2025

computer-science computer-vision-security computer-vision-and-pattern-recognition

LUDVIG: Learning-Free Uplifting of 2D Visual Features to Gaussian Splatting Scenes

CNRS

Inria LJK Grenoble-INP Univ Grenoble Alpes NAVER LABS Europe

LUDVIG introduces a learning-free "inverse rendering" approach that directly aggregates 2D visual features from models like DINOv2 onto 3D Gaussian Splatting scenes, complemented by a 3D graph diffusion mechanism for refinement. This method from Univ. Grenoble Alpes and NAVER LABS Europe achieves performance comparable to or better than optimization-based techniques across various 3D semantic tasks, while accelerating feature generation and uplifting by 5 to 10 times.

234

24 Mar 2025

computer-science computer-vision-and-pattern-recognition machine-learning

DUNE: Distilling a Universal Encoder from Heterogeneous 2D and 3D Teachers

NAVER LABS Europe

Recent multi-teacher distillation methods have unified the encoders of multiple foundation models into a single encoder, achieving competitive performance on core vision tasks like classification, segmentation, and depth estimation. This led us to ask: Could similar success be achieved when the pool of teachers also includes vision models specialized in diverse tasks across both 2D and 3D perception? In this paper, we define and investigate the problem of heterogeneous teacher distillation, or co-distillation, a challenging multi-teacher distillation scenario where teacher models vary significantly in both (a) their design objectives and (b) the data they were trained on. We explore data-sharing strategies and teacher-specific encoding, and introduce DUNE, a single encoder excelling in 2D vision, 3D understanding, and 3D human perception. Our model achieves performance comparable to that of its larger teachers, sometimes even outperforming them, on their respective tasks. Notably, DUNE surpasses MASt3R in Map-free Visual Relocalization with a much smaller encoder.

1,075

27 Sep 2024

computer-science computer-vision-security computer-vision-and-pattern-recognition

MASt3R-SfM: a Fully-Integrated Solution for Unconstrained Structure-from-Motion

NAVER LABS Europe

MASt3R-SfM, from the authors at CentraleSupélec, presents a Structure-from-Motion pipeline that uses foundation models for 3D vision to reconstruct 3D scenes from unconstrained image collections. The system demonstrates improved robustness in challenging scenarios like limited overlap and pure rotation, while maintaining quasi-linear complexity with the number of images.

2,521

447

25 Jun 2025

computer-science computer-vision-and-pattern-recognition machine-learning

LPOSS: Label Propagation Over Patches and Pixels for Open-vocabulary Semantic Segmentation

Czech Technical University in Prague NAVER LABS Europe

We propose a training-free method for open-vocabulary semantic segmentation using Vision-and-Language Models (VLMs). Our approach enhances the initial per-patch predictions of VLMs through label propagation, which jointly optimizes predictions by incorporating patch-to-patch relationships. Since VLMs are primarily optimized for cross-modal alignment and not for intra-modal similarity, we use a Vision Model (VM) that is observed to better capture these relationships. We address resolution limitations inherent to patch-based encoders by applying label propagation at the pixel level as a refinement step, significantly improving segmentation accuracy near class boundaries. Our method, called LPOSS+, performs inference over the entire image, avoiding window-based processing and thereby capturing contextual interactions across the full image. LPOSS+ achieves state-of-the-art performance among training-free methods, across a diverse set of datasets. Code: this https URL

104

12 May 2022

computer-science computation-and-language information-retrieval

From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective

CNRS

Sorbonne Université NAVER LABS Europe

Neural retrievers based on dense representations combined with Approximate Nearest Neighbors search have recently received a lot of attention, owing their success to distillation and/or better sampling of examples for training -- while still relying on the same backbone architecture. In the meantime, sparse representation learning fueled by traditional inverted indexing techniques has seen a growing interest, inheriting from desirable IR priors such as explicit lexical matching. While some architectural variants have been proposed, a lesser effort has been put in the training of such models. In this work, we build on SPLADE -- a sparse expansion-based retriever -- and show to which extent it is able to benefit from the same training improvements as dense models, by studying the effect of distillation, hard-negative mining as well as the Pre-trained Language Model initialization. We furthermore study the link between effectiveness and efficiency, on in-domain and zero-shot settings, leading to state-of-the-art results in both scenarios for sufficiently expressive models.

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

Grounding Image Matching in 3D with MASt3R

Provence: efficient and robust context pruning for retrieval-augmented generation

SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking

Pow3R: Empowering Unconstrained 3D Reconstruction with Camera and Scene Priors

Multi-HMR: Multi-Person Whole-Body Human Mesh Recovery in a Single Shot

Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction

SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval

HAMSt3R: Human-Aware Multi-view Stereo 3D Reconstruction

PanSt3R: Multi-view Consistent Panoptic Segmentation

CroCo: Self-Supervised Pre-training for 3D Vision Tasks by Cross-View Completion

MUSt3R: Multi-view Network for Stereo 3D Reconstruction

SPLADE-v3: New baselines for SPLADE

PISCO: Pretty Simple Compression for Retrieval-Augmented Generation

CroCo v2: Improved Cross-view Completion Pre-training for Stereo Matching and Optical Flow

Kinaema: a recurrent sequence model for memory and pose in motion

LUDVIG: Learning-Free Uplifting of 2D Visual Features to Gaussian Splatting Scenes

DUNE: Distilling a Universal Encoder from Heterogeneous 2D and 3D Teachers

MASt3R-SfM: a Fully-Integrated Solution for Unconstrained Structure-from-Motion

LPOSS: Label Propagation Over Patches and Pixels for Open-vocabulary Semantic Segmentation

From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective

Events

AI for Law

Personalize Your Feed