alphaXiv

History

Papers Benchmarks

Korea Advanced Institute of Science and Technology

2,021

04 Dec 2024

mathematics optimization-and-control

Impact Of Income And Leisure On Optimal Portfolio, Consumption, Retirement Decisions Under Exponential Utility

Korea Advanced Institute of Science and Technology Sookmyung Women’s University

We study an optimal control problem encompassing investment, consumption, and retirement decisions under exponential (CARA-type) utility. The financial market comprises a bond with constant drift and a stock following geometric Brownian motion. The agent receives continuous income, consumes over time, and has the option to retire irreversibly, gaining increased leisure post-retirement compared to pre-retirement. The objective is to maximize the expected exponential utility of weighted consumption and leisure over an infinite horizon. Using a martingale approach and dual value function, we derive implicit solutions for the optimal portfolio, consumption, and retirement time. The analysis highlights key contributions: first, the equivalent condition for no retirement is characterized by a specific income threshold; second, the influence of income and leisure levels on optimal portfolio, consumption, and retirement decisions is thoroughly examined. These results provide valuable insights into the interplay between financial and lifestyle choices in retirement planning.

1,017

01 Feb 2024

ai-for-health computer-science computer-vision-and-pattern-recognition

Disentangled Multimodal Brain MR Image Translation via Transformer-based Modality Infuser

Harvard Medical School Massachusetts General Hospital Korea Advanced Institute of Science and Technology

Multimodal Magnetic Resonance (MR) Imaging plays a crucial role in disease diagnosis due to its ability to provide complementary information by analyzing a relationship between multimodal images on the same subject. Acquiring all MR modalities, however, can be expensive, and, during a scanning session, certain MR images may be missed depending on the study protocol. The typical solution would be to synthesize the missing modalities from the acquired images such as using generative adversarial networks (GANs). Yet, GANs constructed with convolutional neural networks (CNNs) are likely to suffer from a lack of global relationships and mechanisms to condition the desired modality. To address this, in this work, we propose a transformer-based modality infuser designed to synthesize multimodal brain MR images. In our method, we extract modality-agnostic features from the encoder and then transform them into modality-specific features using the modality infuser. Furthermore, the modality infuser captures long-range relationships among all brain structures, leading to the generation of more realistic images. We carried out experiments on the BraTS 2018 dataset, translating between four MR modalities, and our experimental results demonstrate the superiority of our proposed method in terms of synthesis quality. In addition, we conducted experiments on a brain tumor segmentation task and different conditioning methods.

2,042

28 Mar 2024

active-learning attention-mechanisms computer-science

Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity

Korea Advanced Institute of Science and Technology

Sukmin Cho

Adaptive-RAG introduces a framework that dynamically selects one of three retrieval-augmented LLM strategies based on an input query's predicted complexity. This approach balances accuracy and computational cost, achieving competitive performance while significantly reducing the average number of retrieval and generation steps compared to fixed multi-step methods.

1,601

18 Jul 2018

attention-mechanisms computer-science computer-vision-security

CBAM: Convolutional Block Attention Module

Korea Advanced Institute of Science and Technology

Adobe Lunit Inc

The Convolutional Block Attention Module (CBAM) enhances Convolutional Neural Networks by sequentially applying channel and spatial attention mechanisms. This lightweight module consistently improves performance across various architectures and tasks, including image classification and object detection, with minimal additional computational cost.

2,104

988

23 Oct 2025

computer-science machine-learning generative-models

Continuous Diffusion Model for Language Modeling

Korea Advanced Institute of Science and Technology DeepAuto.ai

The Riemannian Diffusion Language Model (RDLM) introduces a continuous diffusion framework for discrete data that leverages the geometry of categorical distributions on a statistical manifold. This approach achieves state-of-the-art performance for diffusion models across language modeling, pixel-level image generation, and DNA sequence design, with scalable training enabled by a simulation-free approximation method.

905

25 Oct 2025

chain-of-thought computer-science machine-learning

TimeXL: Explainable Multi-modal Time Series Prediction with LLM-in-the-Loop

Korea Advanced Institute of Science and Technology University of Connecticut NEC Labs America

Time series analysis provides essential insights for real-world system dynamics and informs downstream decision-making, yet most existing methods often overlook the rich contextual signals present in auxiliary modalities. To bridge this gap, we introduce TimeXL, a multi-modal prediction framework that integrates a prototype-based time series encoder with three collaborating Large Language Models (LLMs) to deliver more accurate predictions and interpretable explanations. First, a multi-modal prototype-based encoder processes both time series and textual inputs to generate preliminary forecasts alongside case-based rationales. These outputs then feed into a prediction LLM, which refines the forecasts by reasoning over the encoder's predictions and explanations. Next, a reflection LLM compares the predicted values against the ground truth, identifying textual inconsistencies or noise. Guided by this feedback, a refinement LLM iteratively enhances text quality and triggers encoder retraining. This closed-loop workflow-prediction, critique (reflect), and refinement-continuously boosts the framework's performance and interpretability. Empirical evaluations on four real-world datasets demonstrate that TimeXL achieves up to 8.9% improvement in AUC and produces human-centric, multi-modal explanations, highlighting the power of LLM-driven reasoning for time series prediction.

693

06 Jun 2025

computer-science artificial-intelligence computation-and-language

Peri-LN: Revisiting Normalization Layer in the Transformer Architecture

NAVER Cloud Korea Advanced Institute of Science and Technology NAVER AI Lab.

Researchers from NAVER Cloud and KAIST rigorously analyze Peri-LN, a Transformer normalization strategy that normalizes both sub-layer inputs and outputs, demonstrating its capability to maintain stable hidden-state activations and gradient flow, resulting in improved pre-training and downstream task performance for large language models.

132

26 Sep 2025

computer-science sound audio-and-speech-processing

SPADE: Structured Pruning and Adaptive Distillation for Efficient LLM-TTS

Korea Advanced Institute of Science and Technology 42dot Inc.

The goal of this paper is to introduce SPADE, a framework for Structured Pruning and Adaptive Distillation for Efficient Large Language Model-based text-to-speech (LLM-TTS). Recent LLM-TTS systems achieve strong controllability and zero-shot generalization, but their large parameter counts and high latency limit real-world deployment. SPADE addresses this by combining (i) a pruning step guided by a word-error-rate-based layer importance index to remove non-essential Transformer layers, with (ii) multi-level knowledge distillation to restore autoregressive coherence. On zero-shot benchmarks, SPADE preserves near-parity perceptual quality while halving Transformer depth, reducing VRAM usage by up to 20%, and achieving up to 1.7x faster real-time factor with less than 5% of the original training data. These results show that compact LLM-TTS models can maintain naturalness and speaker similarity while enabling practical real-time speech generation. Audio samples are available at this https URL.

129

30 Nov 2025

audio-and-speech-processing electrical-engineering

Semantic-VAE: Semantic-Alignment Latent Representation for Better Speech Synthesis

Shanghai Jiao Tong University

The Chinese University of Hong Kong Korea Advanced Institute of Science and Technology Shanghai Innovation Institute Geely

Researchers from Shanghai Jiao Tong University, The Chinese University of Hong Kong, KAIST, and Geely developed Semantic-VAE, a Variational Autoencoder framework that incorporates semantic alignment to address the reconstruction-generation dilemma in zero-shot Text-to-Speech. This method achieved a Word Error Rate of 2.10% and speaker similarity of 0.64, outperforming prior mel-based and vanilla VAE approaches on the LibriSpeech-PC test-clean dataset.

111

13 Oct 2025

computer-science artificial-intelligence computation-and-language

Diffusion-Link: Diffusion Probabilistic Model for Bridging the Audio-Text Modality Gap

Korea Advanced Institute of Science and Technology University of Seoul

Researchers from KAIST developed Diffusion-Link, a diffusion probabilistic model that generatively transforms audio embeddings into text-like distributions to bridge the audio-text modality gap. This lightweight module achieves state-of-the-art performance in zero-shot automatic audio captioning on AudioCaps, improving CIDEr by 52.5% over baselines without requiring external knowledge.

583

04 Mar 2025

computer-science artificial-intelligence computation-and-language

Spread Preference Annotation: Direct Preference Judgment for Efficient LLM Alignment

Carnegie Mellon University Korea Advanced Institute of Science and Technology

동영 김

Spread Preference Annotation (SPA) presents an iterative framework that efficiently aligns Large Language Models (LLMs) with human preferences using only a small initial human-annotated dataset. The method achieves this by leveraging the LLM's intrinsic knowledge to generate self-refined preference labels, demonstrating a 21.13% win rate against GPT-4 on AlpacaEval 2.0 using significantly less data than previous approaches.

193

26 Dec 2024

computer-science sound audio-and-speech-processing

VoiceDiT: Dual-Condition Diffusion Transformer for Environment-Aware Speech Synthesis

Korea Advanced Institute of Science and Technology

Đạt Nguyễn Tấn

Junseok Ahn

VoiceDiT is a multi-modal generative model developed at KAIST that synthesizes environment-aware speech and audio from diverse inputs, including text and visual prompts. The system generates high-quality, intelligible speech that seamlessly integrates with specified acoustic environments, outperforming existing models in speech intelligibility and audio realism.

134

30 Sep 2025

attention-mechanisms computer-science computer-vision-and-pattern-recognition

AVCD: Mitigating Hallucinations in Audio-Visual Large Language Models through Contrastive Decoding

Korea Advanced Institute of Science and Technology

Hallucination remains a major challenge in multimodal large language models (MLLMs). To address this, various contrastive decoding (CD) methods have been proposed that contrasts original logits with hallucinated logits generated from perturbed inputs. While CD has shown promise in vision-language models (VLMs), it is not well-suited for AV-LLMs, where hallucinations often emerge from both unimodal and cross-modal combinations involving audio, video, and language. These intricate interactions call for a more adaptive and modality-aware decoding strategy. In this paper, we propose Audio-Visual Contrastive Decoding (AVCD)-a novel, training-free decoding framework designed to model trimodal interactions and suppress modality-induced hallucinations in AV-LLMs. Unlike previous CD methods in VLMs that corrupt a fixed modality, AVCD leverages attention distributions to dynamically identify less dominant modalities and applies attentive masking to generate perturbed output logits. To support CD in a trimodal setting, we also reformulate the original CD framework to jointly handle audio, visual, and textual inputs. Finally, to improve efficiency, we introduce entropy-guided adaptive decoding, which selectively skips unnecessary decoding steps based on the model's confidence in its predictions. Extensive experiments demonstrate that AVCD consistently outperforms existing decoding methods. Especially, on the AVHBench dataset, it improves accuracy by 2% for VideoLLaMA2 and 7% for video-SALMONN, demonstrating strong robustness and generalizability. Our code is available at this https URL.

24 Sep 2025

audio-and-speech-processing electrical-engineering

SCORE: Scaling audio generation using Standardized COmposite REwards

ByteDance Korea Advanced Institute of Science and Technology

Researchers from KAIST and ByteDance Seed developed SCORE, a training-free inference-time scaling method that uses standardized composite rewards to enhance Text-to-Audio generation. SCORE improves both perceptual audio quality and text-alignment, achieving a 10.8% improvement in CLAP score and a 10.6% improvement in production quality over naive sampling when using balanced guidance.

197

28 Feb 2025

computer-science machine-learning neural-and-evolutionary-computing

Ant Colony Sampling with GFlowNets for Combinatorial Optimization

Université de Montréal CIFAR

Mila - Quebec AI Institute Korea Advanced Institute of Science and Technology Omelet

Researchers from Mila and KAIST developed the Generative Flow Ant Colony Sampler (GFACS), a meta-heuristic that combines GFlowNets with Ant Colony Optimization (ACO) for combinatorial optimization. This approach generates multi-modal prior distributions of solutions and achieves superior solution quality and efficiency, outperforming strong baselines and specialized RL solvers across various benchmarks, often with reduced training time.

1,274

22 May 2025

computer-science artificial-intelligence domain-adaptation

SEED: Speaker Embedding Enhancement Diffusion Model

Carnegie Mellon University

Seoul National University

Apple Korea Advanced Institute of Science and Technology University of Seoul

남기현

The SEED model applies diffusion models directly to speaker embeddings, aiming to enhance the robustness of speaker recognition systems against environmental mismatches. It improves speaker identification accuracy by up to 19.6% in noisy conditions compared to baseline systems, demonstrating an efficient alternative to traditional audio enhancement or complex disentangled representation learning approaches.

559

02 Jun 2025

computer-science artificial-intelligence machine-learning

RL-SPH: Learning to Achieve Feasible Solutions for Integer Linear Programs

Korea Advanced Institute of Science and Technology

Integer linear programming (ILP) is widely utilized for various combinatorial optimization problems. Primal heuristics play a crucial role in quickly finding feasible solutions for NP-hard ILP. Although

\textit{end-to-end learning}

-based primal heuristics (E2EPH) have recently been proposed, they are typically unable to independently generate feasible solutions and mainly focus on binary variables. Ensuring feasibility is critical, especially when handling non-binary integer variables. To address this challenge, we propose RL-SPH, a novel reinforcement learning-based start primal heuristic capable of independently generating feasible solutions, even for ILP involving non-binary integers. Experimental results demonstrate that RL-SPH rapidly obtains high-quality feasible solutions, achieving on average a 44x lower primal gap and a 2.3x lower primal integral compared to existing primal heuristics.

25 Sep 2025

computer-science sound audio-and-speech-processing

MAGE: A Coarse-to-Fine Speech Enhancer with Masked Generative Model

Korea Advanced Institute of Science and Technology Ho Chi Minh City University of Technology, VNUHCM

Speech enhancement remains challenging due to the trade-off between efficiency and perceptual quality. In this paper, we introduce MAGE, a Masked Audio Generative Enhancer that advances generative speech enhancement through a compact and robust design. Unlike prior masked generative models with random masking, MAGE employs a scarcity-aware coarse-to-fine masking strategy that prioritizes frequent tokens in early steps and rare tokens in later refinements, improving efficiency and generalization. We also propose a lightweight corrector module that further stabilizes inference by detecting low-confidence predictions and re-masking them for refinement. Built on BigCodec and finetuned from Qwen2.5-0.5B, MAGE is reduced to 200M parameters through selective layer retention. Experiments on DNS Challenge and noisy LibriSpeech show that MAGE achieves state-of-the-art perceptual quality and significantly reduces word error rate for downstream recognition, outperforming larger baselines. Audio examples are available at this https URL.

1,257

08 Nov 2024

computer-science artificial-intelligence computer-vision-and-pattern-recognition

Text-to-image Diffusion Models in Generative AI: A Survey

Kyung Hee University Korea Advanced Institute of Science and Technology

This survey reviews the progress of diffusion models in generating images from text, ~\textit{i.e.} text-to-image diffusion models. As a self-contained work, this survey starts with a brief introduction of how diffusion models work for image synthesis, followed by the background for text-conditioned image synthesis. Based on that, we present an organized review of pioneering methods and their improvements on text-to-image generation. We further summarize applications beyond image generation, such as text-guided generation for various modalities like videos, and text-guided image editing. Beyond the progress made so far, we discuss existing challenges and promising future directions.

275

16 Jul 2025

computer-science computer-vision-and-pattern-recognition multi-modal-learning

Bridging the Skeleton-Text Modality Gap: Diffusion-Powered Modality Alignment for Zero-shot Skeleton-based Action Recognition

Korea Advanced Institute of Science and Technology

In zero-shot skeleton-based action recognition (ZSAR), aligning skeleton features with the text features of action labels is essential for accurately predicting unseen actions. ZSAR faces a fundamental challenge in bridging the modality gap between the two-kind features, which severely limits generalization to unseen actions. Previous methods focus on direct alignment between skeleton and text latent spaces, but the modality gaps between these spaces hinder robust generalization learning. Motivated by the success of diffusion models in multi-modal alignment (e.g., text-to-image, text-to-video), we firstly present a diffusion-based skeleton-text alignment framework for ZSAR. Our approach, Triplet Diffusion for Skeleton-Text Matching (TDSM), focuses on cross-alignment power of diffusion models rather than their generative capability. Specifically, TDSM aligns skeleton features with text prompts by incorporating text features into the reverse diffusion process, where skeleton features are denoised under text guidance, forming a unified skeleton-text latent space for robust matching. To enhance discriminative power, we introduce a triplet diffusion (TD) loss that encourages our TDSM to correct skeleton-text matches while pushing them apart for different action classes. Our TDSM significantly outperforms very recent state-of-the-art methods with significantly large margins of 2.36%-point to 13.05%-point, demonstrating superior accuracy and scalability in zero-shot settings through effective skeleton-text matching.

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

Impact Of Income And Leisure On Optimal Portfolio, Consumption, Retirement Decisions Under Exponential Utility

Disentangled Multimodal Brain MR Image Translation via Transformer-based Modality Infuser

Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity

CBAM: Convolutional Block Attention Module

Continuous Diffusion Model for Language Modeling

TimeXL: Explainable Multi-modal Time Series Prediction with LLM-in-the-Loop

Peri-LN: Revisiting Normalization Layer in the Transformer Architecture

SPADE: Structured Pruning and Adaptive Distillation for Efficient LLM-TTS

Semantic-VAE: Semantic-Alignment Latent Representation for Better Speech Synthesis

Diffusion-Link: Diffusion Probabilistic Model for Bridging the Audio-Text Modality Gap

Spread Preference Annotation: Direct Preference Judgment for Efficient LLM Alignment

VoiceDiT: Dual-Condition Diffusion Transformer for Environment-Aware Speech Synthesis

AVCD: Mitigating Hallucinations in Audio-Visual Large Language Models through Contrastive Decoding

SCORE: Scaling audio generation using Standardized COmposite REwards

Ant Colony Sampling with GFlowNets for Combinatorial Optimization

SEED: Speaker Embedding Enhancement Diffusion Model

RL-SPH: Learning to Achieve Feasible Solutions for Integer Linear Programs

MAGE: A Coarse-to-Fine Speech Enhancer with Masked Generative Model

Text-to-image Diffusion Models in Generative AI: A Survey

Bridging the Skeleton-Text Modality Gap: Diffusion-Powered Modality Alignment for Zero-shot Skeleton-based Action Recognition

Events

AI for Law

Personalize Your Feed