Focoos AI
SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation
25 Mar 2025
Referring Video Object Segmentation (RVOS) relies on natural language expressions to segment an object in a video clip. Existing methods restrict reasoning either to independent short clips, losing global context, or process the entire video offline, impairing their application in a streaming fashion. In this work, we aim to surpass these limitations and design an RVOS method capable of effectively operating in streaming-like scenarios while retaining contextual information from past frames. We build upon the Segment-Anything 2 (SAM2) model, that provides robust segmentation and tracking capabilities and is naturally suited for streaming processing. We make SAM2 wiser, by empowering it with natural language understanding and explicit temporal modeling at the feature extraction stage, without fine-tuning its weights, and without outsourcing modality interaction to external models. To this end, we introduce a novel adapter module that injects temporal information and multi-modal cues in the feature extraction process. We further reveal the phenomenon of tracking bias in SAM2 and propose a learnable module to adjust its tracking focus when the current frame features suggest a new object more aligned with the caption. Our proposed method, SAMWISE, achieves state-of-the-art across various benchmarks, by adding a negligible overhead of less than 5 M parameters. Code is available at this https URL .
View blog
Resources39
The revenge of BiSeNet: Efficient Multi-Task Image Segmentation
15 Apr 2024
Recent advancements in image segmentation have focused on enhancing the efficiency of the models to meet the demands of real-time applications, especially on edge devices. However, existing research has primarily concentrated on single-task settings, especially on semantic segmentation, leading to redundant efforts and specialized architectures for different tasks. To address this limitation, we propose a novel architecture for efficient multi-task image segmentation, capable of handling various segmentation tasks without sacrificing efficiency or accuracy. We introduce BiSeNetFormer, that leverages the efficiency of two-stream semantic segmentation architectures and it extends them into a mask classification framework. Our approach maintains the efficient spatial and context paths to capture detailed and semantic information, respectively, while leveraging an efficient transformed-based segmentation head that computes the binary masks and class probabilities. By seamlessly supporting multiple tasks, namely semantic and panoptic segmentation, BiSeNetFormer offers a versatile solution for multi-task segmentation. We evaluate our approach on popular datasets, Cityscapes and ADE20K, demonstrating impressive inference speeds while maintaining competitive accuracy compared to state-of-the-art architectures. Our results indicate that BiSeNetFormer represents a significant advancement towards fast, efficient, and multi-task segmentation networks, bridging the gap between model efficiency and task adaptability.
View blog
Resources
To Match or Not to Match: Revisiting Image Matching for Reliable Visual Place Recognition

A re-evaluation of image matching within Visual Place Recognition (VPR) pipelines reveals that while re-ranking can degrade Recall@1 on datasets saturated by modern retrieval models, it proves valuable for challenging scenarios. The work establishes that image matching's inlier counts serve as a reliable indicator of retrieval confidence, enabling an adaptive strategy for selective re-ranking.

View blog
Resources
What does CLIP know about peeling a banana?
18 Apr 2024
Humans show an innate capability to identify tools to support specific actions. The association between objects parts and the actions they facilitate is usually named affordance. Being able to segment objects parts depending on the tasks they afford is crucial to enable intelligent robots to use objects of daily living. Traditional supervised learning methods for affordance segmentation require costly pixel-level annotations, while weakly supervised approaches, though less demanding, still rely on object-interaction examples and support a closed set of actions. These limitations hinder scalability, may introduce biases, and usually restrict models to a limited set of predefined actions. This paper proposes AffordanceCLIP, to overcome these limitations by leveraging the implicit affordance knowledge embedded within large pre-trained Vision-Language models like CLIP. We experimentally demonstrate that CLIP, although not explicitly trained for affordances detection, retains valuable information for the task. Our AffordanceCLIP achieves competitive zero-shot performance compared to methods with specialized training, while offering several advantages: i) it works with any action prompt, not just a predefined set; ii) it requires training only a small number of additional parameters compared to existing solutions and iii) eliminates the need for direct supervision on action-object pairs, opening new perspectives for functionality-based reasoning of models.
View blog
Resources
PEM: Prototype-based Efficient MaskFormer for Image Segmentation

PEM introduces a new architecture that enhances the efficiency of MaskFormer-style models for image segmentation by redesigning the transformer decoder and pixel decoder. This approach achieves state-of-the-art performance-speed trade-offs on panoptic and semantic segmentation across datasets like Cityscapes and ADE20K, delivering up to twice the speed of Mask2Former with comparable accuracy.

View blog
Resources85
Show or Tell? A Benchmark To Evaluate Visual and Textual Prompts in Semantic Segmentation
06 May 2025

Researchers from Politecnico di Torino and Focoos AI introduce the Show or Tell (SoT) benchmark, a comprehensive framework for directly comparing visual and textual prompts in multi-class semantic segmentation across 14 diverse datasets. Their evaluation demonstrates that visual prompts often yield higher segmentation accuracy, particularly in specialized domains, while textual prompts are more computationally efficient and perform well for common concepts.

View blog
Resources6
There are no more papers matching your filters at the moment.