Institute of Digital TwinEITNingbo
DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

DreamVLA, a Vision-Language-Action (VLA) model from a collaboration including Shanghai Jiao Tong University and Tsinghua University, enhances robot manipulation by forecasting comprehensive future world knowledge, including dynamic regions, depth, and semantics. It achieves this by integrating these predictions into a unified transformer, leading to improved generalization and higher success rates across various robotic tasks while maintaining efficient inference.

View blog
Resources187
Probing the Difficulty Perception Mechanism of Large Language Models

Researchers demonstrated that Large Language Models (LLMs) encode problem difficulty in their internal representations, localizing this mechanism to specific attention heads and showing it can be linearly probed. This allows for automatic difficulty annotation and offers insights into adaptive reasoning, revealing differences from token-level entropy.

View blog
Resources
Multimodal Language Models See Better When They Look Shallower

Researchers systematically analyzed visual layer selection in Multimodal Large Language Models (MLLMs), demonstrating that integrating features from shallow, middle, and deep Vision Transformer layers via a simple concatenation fusion outperforms conventional deep-layer reliance and more complex fusion strategies.

View blog
Resources2
MultiConIR: Towards multi-condition Information Retrieval

The MULTICONIR benchmark was developed to systematically evaluate information retrieval and reranking models on multi-condition natural language queries, revealing that current state-of-the-art models suffer significant performance degradation and lack robust relevance monotonicity and format invariance. Advanced general-purpose LLMs, such as GPT-4o, demonstrated superior capabilities in these complex retrieval scenarios.

View blog
Resources7
DexVLG: Dexterous Vision-Language-Grasp Model at Scale
03 Jul 2025

A new framework introduces DexGraspNet 3.0, the largest synthetic dataset for dexterous grasping with 170 million semantically-annotated poses, and DexVLG, a large vision-language-grasp model. The model predicts language-aligned dexterous grasp poses from single-view RGBD input, achieving 80% success and 75% part accuracy in real-world experiments.

View blog
Resources34
LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model
The development of Large Speech-Language Models (LSLMs) has been slowed by fragmented architectures and a lack of transparency, hindering the systematic comparison and reproducibility of research. Unlike in the vision-language domain, the LSLM field suffers from the common practice of releasing model weights without their corresponding training data and configurations. To address these critical gaps, we introduce LLaSO, the first fully open, end-to-end framework for large-scale speech-language modeling. LLaSO provides the community with three essential resources: (1) LLaSO-Align, a 12M-instance speech-text alignment corpus; (2) LLaSO-Instruct, a 13.5M-instance multi-task instruction-tuning dataset; and (3) LLaSO-Eval, a reproducible benchmark for standardized evaluation. To validate our framework, we build and release LLaSO-Base, a 3.8B-parameter reference model trained exclusively on our public data. It achieves a normalized score of 0.72, establishing a strong, reproducible baseline that surpasses comparable models. Our analysis reveals that while broader training coverage enhances performance, significant generalization gaps persist on unseen tasks, particularly in pure audio scenarios. By releasing the complete stack of data, benchmarks, and models, LLaSO establishes a foundational open standard to unify research efforts and accelerate community-driven progress in LSLMs. We release the code, dataset, pretrained models, and results in this https URL.
View blog
Resources103
Large Language Models Empowered Personalized Web Agents

Researchers from National University of Singapore and collaborators introduced the concept of LLM-empowered personalized Web agents, aiming to automate online tasks by incorporating user-specific data. They developed the PersonalWAB benchmark and proposed the PUMA framework, which notably improved task accuracy and efficiency by leveraging personalized user memory and preference optimization, outperforming larger general-purpose LLMs.

View blog
Resources2
Empowering Global Voices: A Data-Efficient, Phoneme-Tone Adaptive Approach to High-Fidelity Speech Synthesis

A data-efficient framework for Thai text-to-speech synthesis combines phoneme-tone adaptive modeling with specialized preprocessing pipelines to handle complex linguistic features, achieving high-fidelity speech synthesis and zero-shot voice cloning while requiring significantly less training data than traditional approaches.

View blog
Resources
Retriever: Learning Content-Style Representation as a Token-Level Bipartite Graph
This paper addresses the unsupervised learning of content-style decomposed representation. We first give a definition of style and then model the content-style representation as a token-level bipartite graph. An unsupervised framework, named Retriever, is proposed to learn such representations. First, a cross-attention module is employed to retrieve permutation invariant (P.I.) information, defined as style, from the input data. Second, a vector quantization (VQ) module is used, together with man-induced constraints, to produce interpretable content tokens. Last, an innovative link attention module serves as the decoder to reconstruct data from the decomposed content and style, with the help of the linking keys. Being modal-agnostic, the proposed Retriever is evaluated in both speech and image domains. The state-of-the-art zero-shot voice conversion performance confirms the disentangling ability of our framework. Top performance is also achieved in the part discovery task for images, verifying the interpretability of our representation. In addition, the vivid part-based style transfer quality demonstrates the potential of Retriever to support various fascinating generative tasks. Project page at this https URL
View blog
Resources54
There are no more papers matching your filters at the moment.