Key Laboratory of Intelligent Information Processing
LLaMA-Omni: Seamless Speech Interaction with Large Language Models

Developed by researchers at ICT/CAS, LLaMA-Omni is an end-to-end model enabling low-latency, high-quality speech interaction with open-source Large Language Models, achieving a response latency of 236ms and strong instruction-following performance while training in under 3 days on 4 GPUs. It addresses the gap in open-source solutions for simultaneous speech and text generation by employing a non-autoregressive streaming speech decoder and an efficient two-stage training strategy.

View blog
Resources3,076
LevelRAG: Enhancing Retrieval-Augmented Generation with Multi-hop Logic Planning over Rewriting Augmented Searchers

LevelRAG introduces a hierarchical architecture for Retrieval-Augmented Generation (RAG) systems that decouples retrieval logic from specific retriever optimizations, enabling flexible multi-hop question answering by combining sparse, dense, and web searchers. It demonstrated strong performance on multi-hop QA datasets, matching larger models in efficiency by utilizing significantly fewer parameters.

View blog
Resources5
Wait-info Policy: Balancing Source and Target at Information Level for Simultaneous Machine Translation
Simultaneous machine translation (SiMT) outputs the translation while receiving the source inputs, and hence needs to balance the received source information and translated target information to make a reasonable decision between waiting for inputs or outputting translation. Previous methods always balance source and target information at the token level, either directly waiting for a fixed number of tokens or adjusting the waiting based on the current token. In this paper, we propose a Wait-info Policy to balance source and target at the information level. We first quantify the amount of information contained in each token, named info. Then during simultaneous translation, the decision of waiting or outputting is made based on the comparison results between the total info of previous target outputs and received source inputs. Experiments show that our method outperforms strong baselines under and achieves better balance via the proposed info.
View blog
Resources7
Back Translation for Speech-to-text Translation Without Transcripts
The success of end-to-end speech-to-text translation (ST) is often achieved by utilizing source transcripts, e.g., by pre-training with automatic speech recognition (ASR) and machine translation (MT) tasks, or by introducing additional ASR and MT data. Unfortunately, transcripts are only sometimes available since numerous unwritten languages exist worldwide. In this paper, we aim to utilize large amounts of target-side monolingual data to enhance ST without transcripts. Motivated by the remarkable success of back translation in MT, we develop a back translation algorithm for ST (BT4ST) to synthesize pseudo ST data from monolingual target data. To ease the challenges posed by short-to-long generation and one-to-many mapping, we introduce self-supervised discrete units and achieve back translation by cascading a target-to-unit model and a unit-to-speech model. With our synthetic ST data, we achieve an average boost of 2.3 BLEU on MuST-C En-De, En-Fr, and En-Es datasets. More experiments show that our method is especially effective in low-resource scenarios.
View blog
Resources13
There are no more papers matching your filters at the moment.