Zhejiang UniversityChinaResearchers at the ZJU-UIUC Institute developed SDiT, a Spiking Diffusion Model with a Transformer backbone, which substantially advances the quality and efficiency of SNN-based image generation. The model achieves significantly improved FID scores on datasets like MNIST and Fashion-MNIST while demonstrating reduced computational cost compared to ANN counterparts.
View blogResearchers at Zhejiang University developed Tram, a token-level retrieval-augmented mechanism that guides the decoder's generation of source code summaries by incorporating external knowledge. This approach established new state-of-the-art BLEU scores on four public benchmarks, including a 1.39 point improvement on Java and 1.53 points on Python, while also enhancing the generation of low-frequency, domain-specific terms.
View blogThis paper presents a comprehensive conceptual framework for addressing privacy, security, and trustworthiness challenges in Distributed Wireless Large AI Models (WLAM). It synthesizes existing knowledge and proposes layered protection strategies and integrated approaches across various technical and ethical dimensions to enable robust and responsible deployment.
View blogLightMem introduces a lightweight, three-stage memory system for LLM agents, inspired by human cognition, to manage long-term interactions efficiently. This system achieves state-of-the-art accuracy on conversational benchmarks while significantly reducing computational costs, including up to a 38x reduction in total token usage and a 55x reduction in API calls across various LLM backbones.
View blogResearchers from Shanghai AI Lab and Zhejiang University developed OmniWorld, a multi-domain and multi-modal dataset for 4D world modeling, offering over 300 million frames with comprehensive annotations including RGB, depth, and camera poses. This dataset, which combines a novel synthetic game environment with curated real-world data, establishes a benchmark that demonstrates current 3D geometric and camera-controlled video generation models exhibit limitations in complex dynamic scenarios, but fine-tuning with OmniWorld consistently improves their performance across various tasks.
View blogM3-Agent, developed by ByteDance Seed, introduces a multimodal AI agent framework that processes continuous video and audio streams, constructs an entity-centric long-term memory, and performs multi-turn reasoning. The system achieves 30.7% accuracy on the M3-Bench-robot dataset and 48.9% on M3-Bench-web, demonstrating enhanced person understanding and cross-modal reasoning capabilities.
View blogResearchers from Beijing University of Posts and Telecommunications, Westlake University, and Zhejiang University, along with the OpenHelix Team, introduce VLA-Adapter, an efficient method to bridge vision-language representations to robotic actions. The approach enables state-of-the-art level performance with a tiny-scale 0.5B parameter backbone without robotic data pre-training, achieving a 97.3% average success rate on the LIBERO benchmark and providing a 3x faster inference speed (219.2 Hz) than comparable methods.
View blogResearchers from DAMO Academy, Hupan Lab, and Zhejiang University developed WorldVLA, an autoregressive framework that unifies robot action generation and environmental state forecasting. This approach integrates vision, language, and action modeling with a world model, yielding superior performance on robotic manipulation tasks and improving both action execution and future state prediction.
View blogA comprehensive synthesis of Large Language Models for automated software development covers the entire model lifecycle, from data curation to autonomous agents, and offers practical guidance derived from empirical experiments on pre-training, fine-tuning, and reinforcement learning, alongside a detailed analysis of challenges and future directions.
View blogResearchers from Zhejiang University and ByteDance introduced CodeVision, a "code-as-tool" framework that equips Multimodal Large Language Models (MLLMs) to programmatically interact with images. The approach significantly improves MLLM robustness by correcting common image corruptions and enables state-of-the-art multi-tool reasoning through emergent tool use and error recovery.
View blogResearch establishes a theoretical link between Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO) by reinterpreting GRPO as a contrastive learning objective. This insight leads to "2-GRPO," a variant that achieves comparable mathematical reasoning performance to standard GRPO while reducing training time by over 70% and requiring only 1/8 of the rollouts.
View blogURSA presents a uniform discrete diffusion framework that incorporates a metric probability path for video generation, enabling iterative global refinement in discrete token space. This framework achieves performance competitive with state-of-the-art continuous diffusion models across text-to-video, image-to-video, and text-to-image benchmarks, while enhancing scalability and multi-task capabilities.
View blogThe ReSearch framework enables Large Language Models to integrate multi-step reasoning with external search, learning interactively via reinforcement learning without supervised intermediate steps. It yields substantial performance gains on complex multi-hop question answering benchmarks and reveals emergent self-correction capabilities.
View blogMemOS, a memory operating system for AI systems, redefines memory as a first-class system resource to address current Large Language Model limitations in long-context reasoning, continuous personalization, and knowledge evolution. This framework unifies heterogeneous memory types (plaintext, activation, parameter) using a standardized MemCube unit, achieving superior performance on benchmarks like LoCoMo and PreFEval, and demonstrating robust, low-latency memory operations.
View blog