MiroMind
DriveDPO: Policy Learning via Safety DPO For End-to-End Autonomous Driving

DriveDPO introduces a two-stage policy learning framework for end-to-end autonomous driving, integrating unified policy distillation with a novel Safety Direct Preference Optimization (DPO). This approach yields a new state-of-the-art PDMS of 90.0 on the NAVSIM benchmark, outperforming DiffusionDrive by 1.9 points and WOTE by 2.0 points.

View blog
Resources
100 Days After DeepSeek-R1: A Survey on Replication Studies and More Directions for Reasoning Language Models

This survey by MiroMind and affiliated universities synthesizes open-source replication studies of DeepSeek-R1, detailing methodologies for reasoning language models (RLMs) through supervised fine-tuning (SFT) and reinforcement learning from verifiable rewards (RLVR). It highlights the critical role of high-quality data and adapted RL algorithms in achieving strong reasoning capabilities, also identifying emerging research directions and challenges.

View blog
Resources
Evolving Prompts In-Context: An Open-ended, Self-replicating Perspective

A novel prompt design paradigm demonstrates that pruning in-context learning examples into seemingly incoherent "gibberish" can consistently improve large language model performance across various tasks, challenging conventional prompt engineering wisdom. The PROMPTQUINE evolutionary search framework effectively discovers these unconventional prompts, providing insights into LLM behavior and highlighting vulnerabilities in current AI alignment techniques.

View blog
Resources
Multi-band Frequency Reconstruction for Neural Psychoacoustic Coding

A neural psychoacoustic coding framework called MUFFIN enables high-fidelity audio compression through multi-band spectral quantization and modified snake activation functions, achieving state-of-the-art reconstruction quality at 12.5 Hz compression rates while preserving critical acoustic details across speech, music and environmental sound domains.

View blog
Resources19
MOOSE-Chem3: Toward Experiment-Guided Hypothesis Ranking via Simulated Experimental Feedback

This research introduces experiment-guided hypothesis ranking for scientific discovery, developing a high-fidelity simulator (CSX-Sim) and an in-context reinforcement learning (ICRL) framework with the CSX-Rank agent. This system significantly reduces the number of experimental trials needed to identify optimal hypotheses, requiring 15.196 trials compared to 28.608 for pre-experiment ranking on the TOMATO-chem dataset, thus enabling more efficient and cost-effective scientific exploration.

View blog
Resources
On the Role of Difficult Prompts in Self-Play Preference Optimization
Self-play preference optimization has emerged as a prominent paradigm for aligning large language models (LLMs). It typically involves a language model to generate on-policy responses for prompts and a reward model (RM) to guide the selection of chosen and rejected responses, which can be further trained with direct preference optimization (DPO). However, the role of prompts remains underexplored, despite being a core component in this pipeline. In this work, we investigate how prompts of varying difficulty influence self-play preference optimization. We first use the mean reward of NN sampled responses of a prompt as a proxy for its difficulty. We find that difficult prompts exhibit substantially inferior self-play optimization performance in comparison to easy prompts for language models. Moreover, incorporating difficult prompts into training fails to enhance overall performance and, in fact, leads to slight degradation compared to training on easy prompts alone. We also observe that the performance gap between difficult and easy prompts closes as the model capacity increases, suggesting that difficulty interacts with the model capacity. Building on these findings, we explore strategies to mitigate the negative effect of difficult prompts on final performance. We demonstrate that selectively removing an appropriate portion of challenging prompts enhances overall self-play performance, while also reporting failed attempts and lessons learned.
View blog
Resources
NTU Speechlab LLM-Based Multilingual ASR System for Interspeech MLC-SLM Challenge 2025
This report details the NTU Speechlab system developed for the Interspeech 2025 Multilingual Conversational Speech and Language Model (MLC-SLM) Challenge (Task I), where we achieved 5th place. We present comprehensive analyses of our multilingual automatic speech recognition system, highlighting key advancements in model architecture, data selection, and training strategies. In particular, language-specific prompts and model averaging techniques were instrumental in boosting system performance across diverse languages. Compared to the initial baseline system, our final model reduced the average Mix Error Rate from 20.2% to 10.6%, representing an absolute improvement of 9.6% (a relative improvement of 48%) on the evaluation set. Our results demonstrate the effectiveness of our approach and offer practical insights for future Speech Large Language Models.
View blog
Resources
IFEval-Audio: Benchmarking Instruction-Following Capability in Audio-based Large Language Models
Large language models (LLMs) have demonstrated strong instruction-following capabilities in text-based tasks. However, this ability often deteriorates in multimodal models after alignment with non-text modalities such as images or audio. While several recent efforts have investigated instruction-following performance in text and vision-language models, instruction-following in audio-based large language models remains largely unexplored. To bridge this gap, we introduce IFEval-Audio, a novel evaluation dataset designed to assess the ability to follow instructions in an audio LLM. IFEval-Audio contains 280 audio-instruction-answer triples across six diverse dimensions: Content, Capitalization, Symbol, List Structure, Length, and Format. Each example pairs an audio input with a text instruction, requiring the model to generate an output that follows a specified structure. We benchmark state-of-the-art audio LLMs on their ability to follow audio-involved instructions. The dataset is released publicly to support future research in this emerging area.
View blog
Resources
100 Days After DeepSeek-R1: A Survey on Replication Studies and More Directions for Reasoning Language Models
15 May 2025
The recent development of reasoning language models (RLMs) represents a novel evolution in large language models. In particular, the recent release of DeepSeek-R1 has generated widespread social impact and sparked enthusiasm in the research community for exploring the explicit reasoning paradigm of language models. However, the implementation details of the released models have not been fully open-sourced by DeepSeek, including DeepSeek-R1-Zero, DeepSeek-R1, and the distilled small models. As a result, many replication studies have emerged aiming to reproduce the strong performance achieved by DeepSeek-R1, reaching comparable performance through similar training procedures and fully open-source data resources. These works have investigated feasible strategies for supervised fine-tuning (SFT) and reinforcement learning from verifiable rewards (RLVR), focusing on data preparation and method design, yielding various valuable insights. In this report, we provide a summary of recent replication studies to inspire future research. We primarily focus on SFT and RLVR as two main directions, introducing the details for data construction, method design and training procedure of current replication studies. Moreover, we conclude key findings from the implementation details and experimental results reported by these studies, anticipating to inspire future research. We also discuss additional techniques of enhancing RLMs, highlighting the potential of expanding the application scope of these models, and discussing the challenges in development. By this survey, we aim to help researchers and developers of RLMs stay updated with the latest advancements, and seek to inspire new ideas to further enhance RLMs.
View blog
Resources
There are no more papers matching your filters at the moment.