Researchers systematically analyzed visual layer selection in Multimodal Large Language Models (MLLMs), demonstrating that integrating features from shallow, middle, and deep Vision Transformer layers via a simple concatenation fusion outperforms conventional deep-layer reliance and more complex fusion strategies.
The MULTICONIR benchmark was developed to systematically evaluate information retrieval and reranking models on multi-condition natural language queries, revealing that current state-of-the-art models suffer significant performance degradation and lack robust relevance monotonicity and format invariance. Advanced general-purpose LLMs, such as GPT-4o, demonstrated superior capabilities in these complex retrieval scenarios.
A data-efficient framework for Thai text-to-speech synthesis combines phoneme-tone adaptive modeling with specialized preprocessing pipelines to handle complex linguistic features, achieving high-fidelity speech synthesis and zero-shot voice cloning while requiring significantly less training data than traditional approaches.
There are no more papers matching your filters at the moment.