Institute of Science Tokyo
We present LatentCSI, a novel method for generating images of the physical environment from WiFi CSI measurements that leverages a pretrained latent diffusion model (LDM). Unlike prior approaches that rely on complex and computationally intensive techniques such as GANs, our method employs a lightweight neural network to map CSI amplitudes directly into the latent space of an LDM. We then apply the LDM's denoising diffusion model to the latent representation with text-based guidance before decoding using the LDM's pretrained decoder to obtain a high-resolution image. This design bypasses the challenges of pixel-space image generation and avoids the explicit image encoding stage typically required in conventional image-to-image pipelines, enabling efficient and high-quality image synthesis. We validate our approach on two datasets: a wide-band CSI dataset we collected with off-the-shelf WiFi devices and cameras; and a subset of the publicly available MM-Fi dataset. The results demonstrate that LatentCSI outperforms baselines of comparable complexity trained directly on ground-truth images in both computational efficiency and perceptual quality, while additionally providing practical advantages through its unique capacity for text-guided controllability.
By reformulating multi-head attention to reveal an intrinsic FFN-like structure, UMoE introduces a unified Mixture-of-Experts architecture that integrates shared experts across both attention and FFN layers. This approach consistently improves language modeling perplexity and zero-shot performance across various tasks, while enhancing parameter efficiency in large language models.
TrustJudge introduces a probabilistic framework to systematically mitigate two fundamental inconsistencies—score-comparison and pairwise transitivity—within LLM-as-a-judge evaluation. The method significantly reduces conflict ratios and non-transitivity rates by employing distribution-sensitive scoring and likelihood-aware aggregation, while maintaining or enhancing evaluation accuracy across various large language models and tasks.
31
AdaBlock-dLLM introduces a training-free scheduler that dynamically adjusts block sizes in diffusion-based Large Language Models (dLLMs) during inference. This approach improves generation accuracy by up to 5.3% while maintaining or enhancing throughput, particularly when integrated with Key-Value caching.
Researchers at Institute of Science Tokyo and AIST introduced a "transform-and-retain" paradigm for LLM pre-training data, actively rewriting existing corpora with LLMs to enhance quality. This approach led to a 17.0 pass@1 point increase on HumanEval for code and a 12.4 accuracy point increase on GSM8K for math in continual pre-training of Llama-3.1-8B.
Researchers at Sakana AI developed Transformer-Squared, a framework enabling large language models to self-adapt dynamically to diverse tasks in real-time. It leverages Singular Value Fine-tuning (SVF) to create highly efficient, composable "expert" vectors and employs a two-pass inference mechanism, demonstrating performance gains over LoRA and the ability to transfer experts across different base models.
1,150
Kawamura et al. introduce PowerCLIP, a pre-training framework that aligns combinations of image regions with structured textual phrases to enhance compositional understanding in vision-language models. It achieves state-of-the-art performance, including a 7.1% average Top-1 accuracy gain over CLIP on zero-shot classification and a 4.3% average Recall@1 gain on image-text retrieval benchmarks.
FreeRet is a training-free framework that transforms any off-the-shelf Multimodal Large Language Model (MLLM) into a competitive two-stage retriever, achieving state-of-the-art performance on multimodal benchmarks without requiring additional training or data. The approach demonstrates that MLLMs can efficiently serve both as embedders for candidate search and as precise rerankers.
This study proposes a new approach to quantum state recovery following measurement. Specifically, we introduce a special operation that transfers the probability amplitude of the quantum state into its orthogonal complement. This operation is followed by a measurement performed on this orthogonal subspace, enabling the undisturbed original quantum state to be regained. Remarkably, this recovery is achieved without dependence of the post-measurement operation on the measurement outcome, thus allowing the recovery without historical dependence. This constitutes a highly nontrivial phenomenon. From the operational perspective, as the no-cloning theorem forbids perfect and probabilistic cloning of arbitrary quantum states, and traditional post-measurement reversal methods typically rely on operations contingent on the measurement outcomes, it questions fundamental assumptions regarding the necessity of historic dependence. From an informational perspective, since this recovery method erases the information about the measurement outcome, it's intriguing that the information can be erased without accessing the measurement outcome. These results imply the operational and informational non-triviality formulated in a direct-sum Hilbert space framework.
FlashGMM presents a redesigned entropy coding algorithm for learned image compression that resolves the computational bottleneck of Gaussian Mixture Models (GMMs). This approach eliminates the need for CDF lookup tables, achieving up to a 90x speedup over prior GMM implementations while slightly improving rate-distortion performance by 0.26% BD-Rate.
4
MixtureVitae introduces an open, web-scale pretraining dataset that minimizes legal and ethical risks by using permissive-first text sources, augmented with high-quality instruction and reasoning data. Models trained on this corpus achieve performance competitive with those trained on non-permissive data, and demonstrate an order-of-magnitude improvement in math and coding abilities over other permissive datasets.
Preconditioning is widely used in machine learning to accelerate convergence on the empirical risk, yet its role on the expected risk remains underexplored. In this work, we investigate how preconditioning affects feature learning and generalization performance. We first show that the input information available to the model is conveyed solely through the Gram matrix defined by the preconditioner's metric, thereby inducing a controllable spectral bias on feature learning. Concretely, instantiating the preconditioner as the pp-th power of the input covariance matrix and within a single-index teacher model, we prove that in generalization, the exponent pp and the alignment between the teacher and the input spectrum are crucial factors. We further investigate how the interplay between these factors influences feature learning from three complementary perspectives: (i) Robustness to noise, (ii) Out-of-distribution generalization, and (iii) Forward knowledge transfer. Our results indicate that the learned feature representations closely mirror the spectral bias introduced by the preconditioner -- favoring components that are emphasized and exhibiting reduced sensitivity to those that are suppressed. Crucially, we demonstrate that generalization is significantly enhanced when this spectral bias is aligned with that of the teacher.
Researchers from Shanghai AI Laboratory, Institute of Science Tokyo, and Nanjing University developed EXPVID, the first benchmark for scientific experiment video understanding and reasoning, leveraging JoVE videos and peer-reviewed papers. The benchmark evaluates Multimodal Large Language Models (MLLMs) across perception, procedural understanding, and scientific reasoning tasks, revealing that proprietary models like GPT-5 and Gemini-2.5 significantly outperform open-source counterparts in complex scientific contexts.
3
The Mixture of Experts (MoE) architecture reduces the training and inference cost significantly compared to a dense model of equivalent capacity. Upcycling is an approach that initializes and trains an MoE model using a pre-trained dense model. While upcycling leads to initial performance gains, the training progresses slower than when trained from scratch, leading to suboptimal performance in the long term. We propose Drop-Upcycling - a method that effectively addresses this problem. Drop-Upcycling combines two seemingly contradictory approaches: utilizing the knowledge of pre-trained dense models while statistically re-initializing some parts of the weights. This approach strategically promotes expert specialization, significantly enhancing the MoE model's efficiency in knowledge acquisition. Extensive large-scale experiments demonstrate that Drop-Upcycling significantly outperforms previous MoE construction methods in the long term, specifically when training on hundreds of billions of tokens or more. As a result, our MoE model with 5.9B active parameters achieves comparable performance to a 13B dense model in the same model family, while requiring approximately 1/4 of the training FLOPs. All experimental resources, including source code, training data, model checkpoints and logs, are publicly available to promote reproducibility and future research on MoE.
Observations of microlensed gravitational waves (GWs) emanated by compact binary coalescences (CBCs) are essential for studying the mass density distribution in the universe, including black holes and dark matter halos. However, no confident detection of microlensed GWs have been reported to date. There are two important challenges in the identification of microlensed GWs. The first is that the source waveform and lens structure models are not known a-priori. The second is that certain classes of unlensed GWs could mimic microlensed GWs, resulting in undesirable false alarms. In this work, we propose to use the Kramers-Kronig relation for gravitational lensing systems. We argue that such systems are essentially linear response systems obeying causality, where KK relation must hold. The power of this method lies in the fact that microlensed GWs, regardless of the lens structure, must obey KK relation, while unlensed GW events are not in general expected to obey it. This, in principle, allows us to identify microlensed GWs while dismissing microlensing mimickers. We provide the first important steps towards a methodology that exploits KK relation, and test its usefulness under idealized conditions.
Empirical scaling laws have driven the evolution of large language models (LLMs), yet their coefficients shift whenever the model architecture or data pipeline changes. Mixture-of-Experts (MoE) models, now standard in state-of-the-art systems, introduce a new sparsity dimension that current dense-model frontiers overlook. We investigate how MoE sparsity influences two distinct capability regimes: memorization skills and reasoning skills. By training MoE families that vary total parameters, active parameters, and top-kk routing under fixed compute budgets, we disentangle pre-training loss from downstream accuracy. Our results reveal two principles. First, Active FLOPs: models with identical training loss but greater active compute achieve higher reasoning accuracy. Second, Total tokens per parameter (TPP): memorization tasks improve with more parameters, while reasoning tasks benefit from optimal TPP, indicating that reasoning is data-hungry. Neither reinforcement learning post-training (GRPO) nor increased test-time compute alters these trends. We therefore argue that optimal MoE sparsity must be determined jointly by active FLOPs and TPP, revising the classical picture of compute-optimal scaling. Our model checkpoints, code and logs are open-source at this https URL.
4
The DP-SynRAG framework generates a differentially private synthetic RAG database using LLMs and a multi-stage process involving private clustering and text generation. This approach enables RAG systems to process an unlimited number of queries under a fixed privacy budget, outperforming previous per-query DP methods in scalability and achieving robust privacy against leakage while maintaining high accuracy for RAG tasks.
Generating physically plausible human motion is crucial for applications such as character animation and virtual reality. Existing approaches often incorporate a simulator-based motion projection layer to the diffusion process to enforce physical plausibility. However, such methods are computationally expensive due to the sequential nature of the simulator, which prevents parallelization. We show that simulator-based motion projection can be interpreted as a form of guidance, either classifier-based or classifier-free, within the diffusion process. Building on this insight, we propose SimDiff, a Simulator-constrained Diffusion Model that integrates environment parameters (e.g., gravity, wind) directly into the denoising process. By conditioning on these parameters, SimDiff generates physically plausible motions efficiently, without repeated simulator calls at inference, and also provides fine-grained control over different physical coefficients. Moreover, SimDiff successfully generalizes to unseen combinations of environmental parameters, demonstrating compositional generalization.
Recent progress in quantum computing has enabled systems with tens of reliable logical qubits, built from thousands of noisy physical qubits. However, many impactful applications demand quantum computations with millions of logical qubits, necessitating highly scalable quantum error correction. In classical information theory, low-density parity-check (LDPC) codes can approach channel capacity efficiently. Yet, no quantum error-correcting codes with efficient decoding have been shown to approach the hashing bound - a fundamental limit on quantum capacity - despite decades of research. Here, we present quantum LDPC codes that not only approach the hashing bound but also allow decoding with computational cost linear in the number of physical qubits. This breakthrough paves the way for large-scale, fault-tolerant quantum computation. Combined with emerging hardware that manages many qubits, our approach brings quantum solutions to important real-world problems significantly closer to reality.
Camellia introduces a new benchmark to quantify entity-centric cultural biases in Large Language Models across nine Asian languages and six distinct Asian cultures. The evaluation reveals that current LLMs exhibit a 30-40% preference for Western entities in culturally-grounded contexts, demonstrate varied sentiment associations, and show significant performance disparities (12-20% accuracy gaps) in extracting Asian-associated entities.
3
There are no more papers matching your filters at the moment.