Ask or search anything...

History

Events

Watch Recordings

AI for Law01/09 · Joel Niklaus · Hugging Face

Papers Benchmarks

Hot

Institute of Digital TwinEITNingbo

DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

26 Aug 2025

University of Illinois at Urbana-Champaign Galbot

DreamVLA, a Vision-Language-Action (VLA) model from a collaboration including Shanghai Jiao Tong University and Tsinghua University, enhances robot manipulation by forecasting comprehensive future world knowledge, including dynamic regions, depth, and semantics. It achieves this by integrating these predictions into a unified transformer, leading to improved generalization and higher success rates across various robotic tasks while maintaining efficient inference.

View blog

#attention-mechanisms #computer-science #computer-vision-and-pattern-recognition

Resources 187

1,518

Probing the Difficulty Perception Mechanism of Large Language Models

12 Oct 2025

孙博闻李

MikaStars39

Zhejiang University Wuhan University of Science and Technology

Researchers demonstrated that Large Language Models (LLMs) encode problem difficulty in their internal representations, localizing this mechanism to specific attention heads and showing it can be linearly probed. This allows for automatic difficulty annotation and offers insights into adaptive reasoning, revealing differences from token-level entropy.

View blog

#attention-mechanisms #computer-science #artificial-intelligence

Resources

637

Multimodal Language Models See Better When They Look Shallower

10 Oct 2025

National University of Singapore Meituan Inc.

Researchers systematically analyzed visual layer selection in Multimodal Large Language Models (MLLMs), demonstrating that integrating features from shallow, middle, and deep Vision Transformer layers via a simple concatenation fusion outperforms conventional deep-layer reliance and more complex fusion strategies.

View blog

#computer-science #artificial-intelligence #computer-vision-and-pattern-recognition

Resources 2

1,379

MultiConIR: Towards multi-condition Information Retrieval

04 Sep 2025

Shanghai Jiao Tong University

The Hong Kong Polytechnic University

The MULTICONIR benchmark was developed to systematically evaluate information retrieval and reranking models on multi-condition natural language queries, revealing that current state-of-the-art models suffer significant performance degradation and lack robust relevance monotonicity and format invariance. Advanced general-purpose LLMs, such as GPT-4o, demonstrated superior capabilities in these complex retrieval scenarios.

View blog

#computer-science #information-retrieval

Resources 7

751

DexVLG: Dexterous Vision-Language-Grasp Model at Scale

03 Jul 2025

BAAI Galbot

A new framework introduces DexGraspNet 3.0, the largest synthetic dataset for dexterous grasping with 170 million semantically-annotated poses, and DexVLG, a large vision-language-grasp model. The model predicts language-aligned dexterous grasp poses from single-view RGBD input, achieving 80% success and 75% part accuracy in real-world experiments.

View blog

#computer-science #computer-vision-and-pattern-recognition #robotics

Resources 34

348

LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model

21 Aug 2025

Xiamen University BUPT

The development of Large Speech-Language Models (LSLMs) has been slowed by fragmented architectures and a lack of transparency, hindering the systematic comparison and reproducibility of research. Unlike in the vision-language domain, the LSLM field suffers from the common practice of releasing model weights without their corresponding training data and configurations. To address these critical gaps, we introduce LLaSO, the first fully open, end-to-end framework for large-scale speech-language modeling. LLaSO provides the community with three essential resources: (1) LLaSO-Align, a 12M-instance speech-text alignment corpus; (2) LLaSO-Instruct, a 13.5M-instance multi-task instruction-tuning dataset; and (3) LLaSO-Eval, a reproducible benchmark for standardized evaluation. To validate our framework, we build and release LLaSO-Base, a 3.8B-parameter reference model trained exclusively on our public data. It achieves a normalized score of 0.72, establishing a strong, reproducible baseline that surpasses comparable models. Our analysis reveals that while broader training coverage enhances performance, significant generalization gaps persist on unseen tasks, particularly in pure audio scenarios. By releasing the complete stack of data, benchmarks, and models, LLaSO establishes a foundational open standard to unify research efforts and accelerate community-driven progress in LSLMs. We release the code, dataset, pretrained models, and results in this https URL.

View blog

#computer-science #artificial-intelligence #computation-and-language

Resources 103

203

Large Language Models Empowered Personalized Web Agents

24 Mar 2025

National University of Singapore

The Hong Kong Polytechnic University

Researchers from National University of Singapore and collaborators introduced the concept of LLM-empowered personalized Web agents, aiming to automate online tasks by incorporating user-specific data. They developed the PersonalWAB benchmark and proposed the PUMA framework, which notably improved task accuracy and efficiency by leveraging personalized user memory and preference optimization, outperforming larger general-purpose LLMs.

View blog

#computer-science #conversational-ai #artificial-intelligence

Resources 2

347

Empowering Global Voices: A Data-Efficient, Phoneme-Tone Adaptive Approach to High-Fidelity Speech Synthesis

10 Apr 2025

Tsinghua University Peking University logo

Peking University

A data-efficient framework for Thai text-to-speech synthesis combines phoneme-tone adaptive modeling with specialized preprocessing pipelines to handle complex linguistic features, achieving high-fidelity speech synthesis and zero-shot voice cloning while requiring significantly less training data than traditional approaches.

View blog

#computer-science #artificial-intelligence #sound

Resources

103

Retriever: Learning Content-Style Representation as a Token-Level Bipartite Graph

24 Feb 2022

University of Science and Technology of China Microsoft logo

Microsoft

This paper addresses the unsupervised learning of content-style decomposed representation. We first give a definition of style and then model the content-style representation as a token-level bipartite graph. An unsupervised framework, named Retriever, is proposed to learn such representations. First, a cross-attention module is employed to retrieve permutation invariant (P.I.) information, defined as style, from the input data. Second, a vector quantization (VQ) module is used, together with man-induced constraints, to produce interpretable content tokens. Last, an innovative link attention module serves as the decoder to reconstruct data from the decomposed content and style, with the help of the linking keys. Being modal-agnostic, the proposed Retriever is evaluated in both speech and image domains. The state-of-the-art zero-shot voice conversion performance confirms the disentangling ability of our framework. Top performance is also achieved in the part discovery task for images, verifying the interpretability of our representation. In addition, the vivid part-based style transfer quality demonstrates the potential of Retriever to support various fascinating generative tasks. Project page at this https URL

View blog

#attention-mechanisms #computer-science #contrastive-learning

Resources 54

There are no more papers matching your filters at the moment.

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

Dark mode

Ask or search anything...

Events