In the realm of 3D digital human applications, music-to-dance presents a
challenging task. Given the one-to-many relationship between music and dance,
previous methods have been limited in their approach, relying solely on
matching and generating corresponding dance movements based on music rhythm. In
the professional field of choreography, a dance phrase consists of several
dance poses and dance movements. Dance poses composed of a series of basic
meaningful body postures, while dance movements can reflect dynamic changes
such as the rhythm, melody, and style of dance. Taking inspiration from these
concepts, we introduce an innovative dance generation pipeline called
DanceMeld, which comprising two stages, i.e., the dance decouple stage and the
dance generation stage. In the decouple stage, a hierarchical VQ-VAE is used to
disentangle dance poses and dance movements in different feature space levels,
where the bottom code represents dance poses, and the top code represents dance
movements. In the generation stage, we utilize a diffusion model as a prior to
model the distribution and generate latent codes conditioned on music features.
We have experimentally demonstrated the representational capabilities of top
code and bottom code, enabling the explicit decoupling expression of dance
poses and dance movements. This disentanglement not only provides control over
motion details, styles, and rhythm but also facilitates applications such as
dance style transfer and dance unit editing. Our approach has undergone
qualitative and quantitative experiments on the AIST++ dataset, demonstrating
its superiority over other methods.