With rapid advances in generative artificial intelligence, the text-to-music
synthesis task has emerged as a promising direction for music generation.
Nevertheless, achieving precise control over multi-track generation remains an
open challenge. While existing models excel in directly generating multi-track
mix, their limitations become evident when it comes to composing individual
tracks and integrating them in a controllable manner. This departure from the
typical workflows of professional composers hinders the ability to refine
details in specific tracks. To address this gap, we propose JEN-1 Composer, a
unified framework designed to efficiently model marginal, conditional, and
joint distributions over multi-track music using a single model. Building upon
an audio latent diffusion model, JEN-1 Composer extends the versatility of
multi-track music generation. We introduce a progressive curriculum training
strategy, which gradually escalates the difficulty of training tasks while
ensuring the model's generalization ability and facilitating smooth transitions
between different scenarios. During inference, users can iteratively generate
and select music tracks, thus incrementally composing entire musical pieces in
accordance with the Human-AI co-composition workflow. Our approach demonstrates
state-of-the-art performance in controllable and high-fidelity multi-track
music synthesis, marking a significant advancement in interactive AI-assisted
music creation. Our demo pages are available at www.jenmusic.ai/research.