Qatar Computing Research Institute
Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation

A comprehensive survey systematically reviews over 100 recent papers in Multimodal Retrieval-Augmented Generation (RAG), proposing an innovation-driven taxonomy that categorizes methods across retrieval, fusion, augmentation, generation, and training strategies, and outlines open challenges and future research directions.

View blog
Resources384
X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents

Researchers from UCLA, University of Washington, Qatar Computing Research Institute, Google, and Stanford University developed X-Teaming, an adaptive multi-agent framework for multi-turn jailbreaking, achieving attack success rates up to 98.1% against robust LLMs. They also created XGuard-Train, a 30K-entry dataset that improved multi-turn attack resistance by 34.2% when used for fine-tuning.

View blog
Resources38
Open-RAG: Enhanced Retrieval-Augmented Reasoning with Open-Source Large Language Models

The OPEN-RAG framework enhances the reasoning capabilities of open-source Large Language Models (LLMs) by integrating external knowledge via Retrieval-Augmented Generation (RAG). It leverages a parameter-efficient sparse Mixture of Experts (MoE) architecture, reflection tokens for context evaluation, and adaptive retrieval, demonstrating improved performance on single and multi-hop reasoning tasks.

View blog
Resources135
LLMxCPG: Context-Aware Vulnerability Detection Through Code Property Graph-Guided Large Language Models

LLMxCPG introduces a two-phase framework that integrates Code Property Graphs (CPGs) with Large Language Models (LLMs) to improve software vulnerability detection. The system reduces code size by up to 90.93% through CPG-guided slicing, achieving a 20% F1-score improvement over state-of-the-art baselines on unseen datasets while demonstrating robustness to code modifications.

View blog
Resources
FuseCodec: Semantic-Contextual Fusion and Supervision for Neural Codecs
Speech tokenization enables discrete representation and facilitates speech language modeling. However, existing neural codecs capture low-level acoustic features, overlooking the semantic and contextual cues inherent to human speech. While recent efforts introduced semantic representations from self-supervised speech models or incorporated contextual representations from pre-trained language models, challenges remain in aligning and unifying the semantic and contextual representations. We introduce FuseCodec, which unifies acoustic, semantic, and contextual representations through strong cross-modal alignment and globally informed supervision. We propose three complementary techniques: (i) Latent Representation Fusion, integrating semantic and contextual features directly into the encoder latent space for robust and unified representation learning; (ii) Global Semantic-Contextual Supervision, supervising discrete tokens with globally pooled and broadcasted representations to enhance temporal consistency and cross-modal alignment; and (iii) Temporally Aligned Contextual Supervision, strengthening alignment by dynamically matching contextual and speech tokens within a local window for fine-grained token-level supervision. We further introduce FuseCodec-TTS, demonstrating our methodology's applicability to zero-shot speech synthesis. Empirically, FuseCodec achieves state-of-the-art performance in LibriSpeech, surpassing EnCodec, SpeechTokenizer, and DAC in transcription accuracy, perceptual quality, intelligibility, and speaker similarity. Results highlight the effectiveness of contextually and semantically guided tokenization for speech tokenization and downstream tasks. Code and pretrained models are available at this https URL.
View blog
Resources8
Beyond the Leaderboard: Understanding Performance Disparities in Large Language Models via Model Diffing

Researchers at QCRI developed a model diffing methodology using crosscoders to mechanistically analyze changes in Large Language Models after fine-tuning. Their work reveals that techniques like Simplified Preference Optimization (SimPO) lead to targeted shifts in internal capabilities, enhancing areas such as safety and instruction following while diminishing others like hallucination detection and specialized technical skills.

View blog
Resources
PRPO: Paragraph-level Policy Optimization for Vision-Language Deepfake Detection
The rapid rise of synthetic media has made deepfake detection a critical challenge for online safety and trust. Progress remains constrained by the scarcity of large, high-quality datasets. Although multimodal large language models (LLMs) exhibit strong reasoning capabilities, their performance on deepfake detection is poor, often producing explanations that are misaligned with visual evidence or hallucinatory. To address this limitation, we introduce a reasoning-annotated dataset for deepfake detection and propose Paragraph-level Relative Policy Optimization (PRPO), a reinforcement learning algorithm that aligns LLM reasoning with image content at the paragraph level. Experiments show that PRPO improves detection accuracy by a wide margin and achieves the highest reasoning score of 4.55/5.0. Ablation studies further demonstrate that PRPO significantly outperforms GRPO under test-time conditions. These results underscore the importance of grounding multimodal reasoning in visual evidence to enable more reliable and interpretable deepfake detection.
View blog
Resources
MathMist: A Parallel Multilingual Benchmark Dataset for Mathematical Problem Solving and Reasoning

MATHMIST introduces the first parallel multilingual benchmark dataset for evaluating Large Language Models on mathematical problem-solving across seven typologically diverse languages. It expands beyond arithmetic to include symbolic and proof-based reasoning, employing novel evaluation paradigms such as code-switched Chain-of-Thought and perturbed reasoning to reveal persistent performance deficiencies in low-resource settings and complex cross-lingual generalization challenges.

View blog
Resources1
Xolver: Multi-Agent Reasoning with Holistic Experience Learning Just Like an Olympiad Team

Xolver, a multi-agent reasoning framework, emulates human Olympiad teams by integrating collaborative agents, a dual-memory system, and iterative refinement. It achieves state-of-the-art results on complex math and coding benchmarks, demonstrating that even lightweight LLM backbones can surpass larger models when augmented with holistic experience learning.

View blog
Resources29
AI Debate Aids Assessment of Controversial Claims
As AI grows more powerful, it will increasingly shape how we understand the world. But with this influence comes the risk of amplifying misinformation and deepening social divides-especially on consequential topics where factual accuracy directly impacts well-being. Scalable Oversight aims to ensure AI systems remain truthful even when their capabilities exceed those of their evaluators. Yet when humans serve as evaluators, their own beliefs and biases can impair judgment. We study whether AI debate can guide biased judges toward the truth by having two AI systems debate opposing sides of controversial factuality claims on COVID-19 and climate change where people hold strong prior beliefs. We conduct two studies. Study I recruits human judges with either mainstream or skeptical beliefs who evaluate claims through two protocols: debate (interaction with two AI advisors arguing opposing sides) or consultancy (interaction with a single AI advisor). Study II uses AI judges with and without human-like personas to evaluate the same protocols. In Study I, debate consistently improves human judgment accuracy and confidence calibration, outperforming consultancy by 4-10% across COVID-19 and climate change claims. The improvement is most significant for judges with mainstream beliefs (up to +15.2% accuracy on COVID-19 claims), though debate also helps skeptical judges who initially misjudge claims move toward accurate views (+4.7% accuracy). In Study II, AI judges with human-like personas achieve even higher accuracy (78.5%) than human judges (70.1%) and default AI judges without personas (69.8%), suggesting their potential for supervising frontier AI models. These findings highlight AI debate as a promising path toward scalable, bias-resilient oversight in contested domains.
View blog
Resources4
Deep Learning for Anomaly Detection: A Survey

This survey provides a structured and comprehensive overview of deep learning-based anomaly detection methods, categorizing techniques by data type, label availability, and training objective. It synthesizes findings on deep learning's enhanced performance for complex data and reviews the adoption and effectiveness of these methods across numerous real-world application domains.

View blog
Resources104
T-SiamTPN: Temporal Siamese Transformer Pyramid Networks for Robust and Efficient UAV Tracking
Aerial object tracking remains a challenging task due to scale variations, dynamic backgrounds, clutter, and frequent occlusions. While most existing trackers emphasize spatial cues, they often overlook temporal dependencies, resulting in limited robustness in long-term tracking and under occlusion. Furthermore, correlation-based Siamese trackers are inherently constrained by the linear nature of correlation operations, making them ineffective against complex, non-linear appearance changes. To address these limitations, we introduce T-SiamTPN, a temporal-aware Siamese tracking framework that extends the SiamTPN architecture with explicit temporal modeling. Our approach incorporates temporal feature fusion and attention-based interactions, strengthening temporal consistency and enabling richer feature representations. These enhancements yield significant improvements over the baseline and achieve performance competitive with state-of-the-art trackers. Crucially, despite the added temporal modules, T-SiamTPN preserves computational efficiency. Deployed on the resource-constrained Jetson Nano, the tracker runs in real time at 7.1 FPS, demonstrating its suitability for real-world embedded applications without notable runtime overhead. Experimental results highlight substantial gains: compared to the baseline, T-SiamTPN improves success rate by 13.7% and precision by 14.7%. These findings underscore the importance of temporal modeling in Siamese tracking frameworks and establish T-SiamTPN as a strong and efficient solution for aerial object tracking. Code is available at: this https URL
View blog
Resources
Ready to Translate, Not to Represent? Bias and Performance Gaps in Multilingual LLMs Across Language Families and Domains
The rise of Large Language Models (LLMs) has redefined Machine Translation (MT), enabling context-aware and fluent translations across hundreds of languages and textual domains. Despite their remarkable capabilities, LLMs often exhibit uneven performance across language families and specialized domains. Moreover, recent evidence reveals that these models can encode and amplify different biases present in their training data, posing serious concerns for fairness, especially in low-resource languages. To address these gaps, we introduce Translation Tangles, a unified framework and dataset for evaluating the translation quality and fairness of open-source LLMs. Our approach benchmarks 24 bidirectional language pairs across multiple domains using different metrics. We further propose a hybrid bias detection pipeline that integrates rule-based heuristics, semantic similarity filtering, and LLM-based validation. We also introduce a high-quality, bias-annotated dataset based on human evaluations of 1,439 translation-reference pairs. The code and dataset are accessible on GitHub: this https URL
View blog
Resources1
Semantic Ranking for Automated Adversarial Technique Annotation in Security Text

Researchers from Purdue, QCRI, and MBZUAI developed a multi-stage semantic ranking system to automate the annotation of cyber threat behaviors to MITRE ATT&CK techniques. The system, which utilizes fine-tuned transformer models and a newly released human-annotated dataset, achieved a recall@10 of 92.07% and recall@3 of 81.02%, outperforming prior methods and significantly exceeding the performance of general large language models.

View blog
Resources
Generative AI for Character Animation: A Comprehensive Survey of Techniques, Applications, and Future Directions

This survey paper from Sharif University of Technology, Iran University of Science and Technology, and Qatar Computing Research Institute presents a comprehensive overview of how generative artificial intelligence techniques are transforming character animation. It integrates advances across previously fragmented areas like facial animation, avatar creation, and motion synthesis, demonstrating how AI can automate and augment traditional animation processes, leading to more accessible and efficient content creation.

View blog
Resources4
TechniqueRAG: Retrieval Augmented Generation for Adversarial Technique Annotation in Cyber Threat Intelligence Text

Researchers from the Qatar Computing Research Institute developed TECHNIQUERAG, a Retrieval-Augmented Generation framework that automates the annotation of adversarial techniques in cyber threat intelligence texts. This framework achieves state-of-the-art performance, with an F1 score of 91.09% on the Procedures dataset for technique prediction, by integrating off-the-shelf retrievers, a zero-shot LLM re-ranker, and a minimally fine-tuned generator to overcome data scarcity and enhance domain-specific precision.

View blog
Resources8
The Landscape of Arabic Large Language Models (ALLMs): A New Era for Arabic Language Technology
The emergence of ChatGPT marked a transformative milestone for Artificial Intelligence (AI), showcasing the remarkable potential of Large Language Models (LLMs) to generate human-like text. This wave of innovation has revolutionized how we interact with technology, seamlessly integrating LLMs into everyday tasks such as vacation planning, email drafting, and content creation. While English-speaking users have significantly benefited from these advancements, the Arabic world faces distinct challenges in developing Arabic-specific LLMs. Arabic, one of the languages spoken most widely around the world, serves more than 422 million native speakers in 27 countries and is deeply rooted in a rich linguistic and cultural heritage. Developing Arabic LLMs (ALLMs) presents an unparalleled opportunity to bridge technological gaps and empower communities. The journey of ALLMs has been both fascinating and complex, evolving from rudimentary text processing systems to sophisticated AI-driven models. This article explores the trajectory of ALLMs, from their inception to the present day, highlighting the efforts to evaluate these models through benchmarks and public leaderboards. We also discuss the challenges and opportunities that ALLMs present for the Arab world.
View blog
Resources
StructTransform: A Scalable Attack Surface for Safety-Aligned Large Language Models
In this work, we present a series of structure transformation attacks on LLM alignment, where we encode natural language intent using diverse syntax spaces, ranging from simple structure formats and basic query languages (e.g., SQL) to new novel spaces and syntaxes created entirely by LLMs. Our extensive evaluation shows that our simplest attacks can achieve close to a 90% success rate, even on strict LLMs (such as Claude 3.5 Sonnet) using SOTA alignment mechanisms. We improve the attack performance further by using an adaptive scheme that combines structure transformations along with existing content transformations, resulting in over 96% ASR with 0% refusals. To generalize our attacks, we explore numerous structure formats, including syntaxes purely generated by LLMs. Our results indicate that such novel syntaxes are easy to generate and result in a high ASR, suggesting that defending against our attacks is not a straightforward process. Finally, we develop a benchmark and evaluate existing safety-alignment defenses against it, showing that most of them fail with 100% ASR. Our results show that existing safety alignment mostly relies on token-level patterns without recognizing harmful concepts, highlighting and motivating the need for serious research efforts in this direction. As a case study, we demonstrate how attackers can use our attack to easily generate a sample malware and a corpus of fraudulent SMS messages, which perform well in bypassing detection.
View blog
Resources
GenAI Content Detection Task 3: Cross-Domain Machine-Generated Text Detection Challenge
Recently there have been many shared tasks targeting the detection of generated text from Large Language Models (LLMs). However, these shared tasks tend to focus either on cases where text is limited to one particular domain or cases where text can be from many domains, some of which may not be seen during test time. In this shared task, using the newly released RAID benchmark, we aim to answer whether or not models can detect generated text from a large, yet fixed, number of domains and LLMs, all of which are seen during training. Over the course of three months, our task was attempted by 9 teams with 23 detector submissions. We find that multiple participants were able to obtain accuracies of over 99% on machine-generated text from RAID while maintaining a 5% False Positive Rate -- suggesting that detectors are able to robustly detect text from many domains and models simultaneously. We discuss potential interpretations of this result and provide directions for future research.
View blog
Resources
T-RAG: Lessons from the LLM Trenches

The Qatar Computing Research Institute (QCRI) developed T-RAG, a system for secure, on-premise question answering over private organizational documents by integrating Retrieval-Augmented Generation (RAG) with a QLoRA-finetuned Llama-2 7B model and a unique tree-based context for hierarchical information. T-RAG achieved a 73.0% total correct rate in human evaluations, surpassing RAG (56.8%) and finetuned-only (54.1%) models, especially for hierarchical queries, and demonstrated improved robustness in "Needle in a Haystack" tests.

View blog
Resources
There are no more papers matching your filters at the moment.