alphaXiv

History

Papers Benchmarks

Amazon. Inc

11 Oct 2025

agentic-frameworks agents computer-science

TRAJECT-Bench:A Trajectory-Aware Benchmark for Evaluating Agentic Tool Use

Michigan State University Amazon. Inc Hippocratic AI Penn State University

A new benchmark called TRAJECT-Bench introduces a rigorous framework for evaluating Large Language Model (LLM) agents' tool-use capabilities using over 1,000 real-world APIs, complex multi-step trajectories, and both direct and indirect user queries. Evaluations reveal LLMs struggle significantly with inferring intent from indirect queries and scaling tool use, though agentic frameworks like ReAct provide modest improvements.

108

06 Jun 2025

computer-science cryptography-and-security

Comprehensive Vulnerability Analysis is Necessary for Trustworthy LLM-MAS

Michigan State University

University of Georgia

University of Arizona Amazon. Inc IBM, T.J. Watson Research Center

Researchers from Michigan State University, Amazon, IBM, and others introduce a systematic framework for analyzing vulnerabilities in Large Language Model-based Multi-Agent Systems (LLM-MAS). This work formalizes attack surfaces, threat models, and malicious goals, revealing how inter-agent communication and compositional effects create unique security challenges beyond those of single LLMs.

20,685

11 May 2025

computer-science information-retrieval machine-learning

Knowledge Distillation for Enhancing Walmart E-commerce Search Relevance Using Large Language Models

Amazon. Inc Walmart Global Tech

Walmart researchers developed a knowledge distillation framework that transfers semantic understanding capabilities from large language models to a smaller, production-ready model for e-commerce search ranking, achieving comparable performance while meeting strict latency requirements and demonstrating improved user engagement metrics in A/B testing on Walmart.com.

104

03 Jun 2025

computer-science cryptography-and-security

Attention Knows Whom to Trust: Attention-based Trust Management for LLM Multi-Agent Systems

Michigan State University

The Pennsylvania State University Amazon. Inc

Researchers from Michigan State University, Amazon Inc., and The Pennsylvania State University developed an attention-based trust management system (A-Trust) to evaluate message trustworthiness in Large Language Model-based Multi-Agent Systems (LLM-MAS), achieving over 80% message detection rate and significantly reducing attack success rates by up to 70% while maintaining system utility.

09 Aug 2025

adversarial-attacks adversarial-robustness computer-science

Many-Turn Jailbreaking

University of California, Santa Barbara Amazon. Inc

Current jailbreaking work on large language models (LLMs) aims to elicit unsafe outputs from given prompts. However, it only focuses on single-turn jailbreaking targeting one specific query. On the contrary, the advanced LLMs are designed to handle extremely long contexts and can thus conduct multi-turn conversations. So, we propose exploring multi-turn jailbreaking, in which the jailbroken LLMs are continuously tested on more than the first-turn conversation or a single target query. This is an even more serious threat because 1) it is common for users to continue asking relevant follow-up questions to clarify certain jailbroken details, and 2) it is also possible that the initial round of jailbreaking causes the LLMs to respond to additional irrelevant questions consistently. As the first step (First draft done at June 2024) in exploring multi-turn jailbreaking, we construct a Multi-Turn Jailbreak Benchmark (MTJ-Bench) for benchmarking this setting on a series of open- and closed-source models and provide novel insights into this new safety threat. By revealing this new vulnerability, we aim to call for community efforts to build safer LLMs and pave the way for a more in-depth understanding of jailbreaking LLMs.

22 Jun 2025

computer-science computation-and-language data-curation

LLMs for Customized Marketing Content Generation and Evaluation at Scale

Texas A&M University Amazon. Inc

Offsite marketing is essential in e-commerce, enabling businesses to reach customers through external platforms and drive traffic to retail websites. However, most current offsite marketing content is overly generic, template-based, and poorly aligned with landing pages, limiting its effectiveness. To address these limitations, we propose MarketingFM, a retrieval-augmented system that integrates multiple data sources to generate keyword-specific ad copy with minimal human intervention. We validate MarketingFM via offline human and automated evaluations and large-scale online A/B tests. In one experiment, keyword-focused ad copy outperformed templates, achieving up to 9% higher CTR, 12% more impressions, and 0.38% lower CPC, demonstrating gains in ad ranking and cost efficiency. Despite these gains, human review of generated ads remains costly. To address this, we propose AutoEval-Main, an automated evaluation system that combines rule-based metrics with LLM-as-a-Judge techniques to ensure alignment with marketing principles. In experiments with large-scale human annotations, AutoEval-Main achieved 89.57% agreement with human reviewers. Building on this, we propose AutoEval-Update, a cost-efficient LLM-human collaborative framework to dynamically refine evaluation prompts and adapt to shifting criteria with minimal human input. By selectively sampling representative ads for human review and using a critic LLM to generate alignment reports, AutoEval-Update improves evaluation consistency while reducing manual effort. Experiments show the critic LLM suggests meaningful refinements, improving LLM-human agreement. Nonetheless, human oversight remains essential for setting thresholds and validating refinements before deployment.

03 Feb 2023

causal-inference computer-science artificial-intelligence

Hierarchical Graph Neural Networks for Causal Discovery and Root Cause Localization

University of Utah NEC Laboratories America, Inc.

University of Central Florida Amazon. Inc

In this paper, we propose REASON, a novel framework that enables the automatic discovery of both intra-level (i.e., within-network) and inter-level (i.e., across-network) causal relationships for root cause localization. REASON consists of Topological Causal Discovery and Individual Causal Discovery. The Topological Causal Discovery component aims to model the fault propagation in order to trace back to the root causes. To achieve this, we propose novel hierarchical graph neural networks to construct interdependent causal networks by modeling both intra-level and inter-level non-linear causal relations. Based on the learned interdependent causal networks, we then leverage random walks with restarts to model the network propagation of a system fault. The Individual Causal Discovery component focuses on capturing abrupt change patterns of a single system entity. This component examines the temporal patterns of each entity's metric data (i.e., time series), and estimates its likelihood of being a root cause based on the Extreme Value theory. Combining the topological and individual causal scores, the top K system entities are identified as root causes. Extensive experiments on three real-world datasets with case studies demonstrate the effectiveness and superiority of the proposed framework.

20 Oct 2025

computer-science computation-and-language machine-learning

Generative or Discriminative? Revisiting Text Classification in the Era of Transformers

Amazon. Inc

Researchers at Amazon Inc. systematically compared modern Transformer-based generative and discriminative models for text classification. They found that while generative models (Autoregressive and Discrete Diffusion) exhibit advantages in low-data scenarios for larger models trained from scratch, pseudo-generative Masked Language Models surprisingly achieve superior accuracy in high-data regimes, and discriminative encoder models excel with small scales and when pre-trained.

11 Nov 2025

computer-science machine-learning domain-adaptation

Vicinity-Guided Discriminative Latent Diffusion for Privacy-Preserving Domain Adaptation

Ocean University of China The University of British Columbia Amazon. Inc Ewha Womans University

Recent work on latent diffusion models (LDMs) has focused almost exclusively on generative tasks, leaving their potential for discriminative transfer largely unexplored. We introduce Discriminative Vicinity Diffusion (DVD), a novel LDM-based framework for a more practical variant of source-free domain adaptation (SFDA): the source provider may share not only a pre-trained classifier but also an auxiliary latent diffusion module, trained once on the source data and never exposing raw source samples. DVD encodes each source feature's label information into its latent vicinity by fitting a Gaussian prior over its k-nearest neighbors and training the diffusion network to drift noisy samples back to label-consistent representations. During adaptation, we sample from each target feature's latent vicinity, apply the frozen diffusion module to generate source-like cues, and use a simple InfoNCE loss to align the target encoder to these cues, explicitly transferring decision boundaries without source access. Across standard SFDA benchmarks, DVD outperforms state-of-the-art methods. We further show that the same latent diffusion module enhances the source classifier's accuracy on in-domain data and boosts performance in supervised classification and domain generalization experiments. DVD thus reinterprets LDMs as practical, privacy-preserving bridges for explicit knowledge transfer, addressing a core challenge in source-free domain adaptation that prior methods have yet to solve.

158

11 Dec 2024

computer-science programming-languages software-engineering

Scalable, Validated Code Translation of Entire Projects using Large Language Models

University of Bristol Amazon. Inc

This paper introduces Oxidizer, a modular, validated LLM-based translation pipeline capable of translating entire Go projects to Rust with high semantic correctness. The system achieved an average compilation success rate of 99% and a function I/O equivalence rate of 73% across seven diverse Go projects, demonstrating improved scalability and reliability compared to prior methods.

292

19 Aug 2025

ai-for-health attention-mechanisms computer-science

MedVisionLlama: Leveraging Pre-Trained Large Language Model Layers to Enhance Medical Image Segmentation

Stanford University

McGill University Amazon. Inc Montreal Neurological Institute, McGill University

Researchers from McGill University developed MedVisionLlama, an architecture that integrates frozen Large Language Model (LLM) layers into Vision Transformers (ViTs) for 3D medical image segmentation. This approach improves segmentation accuracy and data efficiency by leveraging LLM's implicit knowledge, outperforming baselines across diverse tasks.

18 Feb 2024

computer-science computation-and-language computer-vision-and-pattern-recognition

Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning

University of Washington

University of Michigan

MRC-Net: 6-DoF Pose Estimation with MultiScale Residual Correlation

Amazon. Inc

We propose a single-shot approach to determining 6-DoF pose of an object with available 3D computer-aided design (CAD) model from a single RGB image. Our method, dubbed MRC-Net, comprises two stages. The first performs pose classification and renders the 3D object in the classified pose. The second stage performs regression to predict fine-grained residual pose within class. Connecting the two stages is a novel multi-scale residual correlation (MRC) layer that captures high-and-low level correspondences between the input image and rendering from first stage. MRC-Net employs a Siamese network with shared weights between both stages to learn embeddings for input and rendered images. To mitigate ambiguity when predicting discrete pose class labels on symmetric objects, we use soft probabilistic labels to define pose class in the first stage. We demonstrate state-of-the-art accuracy, outperforming all competing RGB-based methods on four challenging BOP benchmark datasets: T-LESS, LM-O, YCB-V, and ITODD. Our method is non-iterative and requires no complex post-processing.

02 Nov 2023

computer-science computation-and-language human-ai-interaction

The Art of SOCRATIC QUESTIONING: Recursive Thinking with Large Language Models

How Culturally Aware are Vision-Language Models?

Stanford University Amazon. Inc

An image is often considered worth a thousand words, and certain images can tell rich and insightful stories. Can these stories be told via image captioning? Images from folklore genres, such as mythology, folk dance, cultural signs, and symbols, are vital to every culture. Our research compares the performance of four popular vision-language models (GPT-4V, Gemini Pro Vision, LLaVA, and OpenFlamingo) in identifying culturally specific information in such images and creating accurate and culturally sensitive image captions. We also propose a new evaluation metric, the Cultural Awareness Score (CAS), which measures the degree of cultural awareness in image captions. We provide a dataset MOSAIC-1.5k labeled with ground truth for images containing cultural background and context and a labeled dataset with assigned Cultural Awareness Scores that can be used with unseen data. Creating culturally appropriate image captions is valuable for scientific research and can be beneficial for many practical applications. We envision our work will promote a deeper integration of cultural sensitivity in AI applications worldwide. By making the dataset and Cultural Awareness Score available to the public, we aim to facilitate further research in this area, encouraging the development of more culturally aware AI systems that respect and celebrate global diversity.

09 Dec 2021

attention-mechanisms computer-science contrastive-learning

CMA-CLIP: Cross-Modality Attention CLIP for Image-Text Classification

Stony Brook University Amazon. Inc

Modern Web systems such as social media and e-commerce contain rich contents expressed in images and text. Leveraging information from multi-modalities can improve the performance of machine learning tasks such as classification and recommendation. In this paper, we propose the Cross-Modality Attention Contrastive Language-Image Pre-training (CMA-CLIP), a new framework which unifies two types of cross-modality attentions, sequence-wise attention and modality-wise attention, to effectively fuse information from image and text pairs. The sequence-wise attention enables the framework to capture the fine-grained relationship between image patches and text tokens, while the modality-wise attention weighs each modality by its relevance to the downstream tasks. In addition, by adding task specific modality-wise attentions and multilayer perceptrons, our proposed framework is capable of performing multi-task classification with multi-modalities. We conduct experiments on a Major Retail Website Product Attribute (MRWPA) dataset and two public datasets, Food101 and Fashion-Gen. The results show that CMA-CLIP outperforms the pre-trained and fine-tuned CLIP by an average of 11.9% in recall at the same level of precision on the MRWPA dataset for multi-task classification. It also surpasses the state-of-the-art method on Fashion-Gen Dataset by 5.5% in accuracy and achieves competitive performance on Food101 Dataset. Through detailed ablation studies, we further demonstrate the effectiveness of both cross-modality attention modules and our method's robustness against noise in image and text inputs, which is a common challenge in practice.

11 Sep 2023

computer-science computer-vision-and-pattern-recognition image-generation

NICE: CVPR 2023 Challenge on Zero-shot Image Captioning

KAIST

University of Waterloo

UC Berkeley

Tsinghua University Illinois Institute of Technology

Seoul National University LG AI Research

University of Sydney Hanyang University Amazon. Inc Kakao Brain Shutterstock NJUST Wooribank

Yimu Wang

In this report, we introduce NICE (New frontiers for zero-shot Image Captioning Evaluation) project and share the results and outcomes of 2023 challenge. This project is designed to challenge the computer vision community to develop robust image captioning models that advance the state-of-the-art both in terms of accuracy and fairness. Through the challenge, the image captioning models were tested using a new evaluation dataset that includes a large variety of visual concepts from many domains. There was no specific training data provided for the challenge, and therefore the challenge entries were required to adapt to new types of image descriptions that had not been seen during training. This report includes information on the newly proposed NICE dataset, evaluation methods, challenge results, and technical details of top-ranking entries. We expect that the outcomes of the challenge will contribute to the improvement of AI models on various vision-language tasks.

13 Oct 2022

computer-science computer-science-and-game-theory machine-learning

Smooth Calibration, Leaky Forecasts, Finite Recall, and Nash Dynamics

University of Pennsylvania The Hebrew University of Jerusalem Amazon. Inc

We propose to smooth out the calibration score, which measures how good a forecaster is, by combining nearby forecasts. While regular calibration can be guaranteed only by randomized forecasting procedures, we show that smooth calibration can be guaranteed by deterministic procedures. As a consequence, it does not matter if the forecasts are leaked, i.e., made known in advance: smooth calibration can nevertheless be guaranteed (while regular calibration cannot). Moreover, our procedure has finite recall, is stationary, and all forecasts lie on a finite grid. To construct the procedure, we deal also with the related setups of online linear regression and weak calibration. Finally, we show that smooth calibration yields uncoupled finite-memory dynamics in n-person games "smooth calibrated learning" in which the players play approximate Nash equilibria in almost all periods (by contrast, calibrated learning, which uses regular calibration, yields only that the time-averages of play are approximate correlated equilibria).

26 Aug 2022

computer-science artificial-intelligence computer-vision-and-pattern-recognition

CH-MARL: A Multimodal Benchmark for Cooperative, Heterogeneous Multi-Agent Reinforcement Learning

University of Southern California Amazon. Inc

We propose a multimodal (vision-and-language) benchmark for cooperative and heterogeneous multi-agent learning. We introduce a benchmark multimodal dataset with tasks involving collaboration between multiple simulated heterogeneous robots in a rich multi-room home environment. We provide an integrated learning framework, multimodal implementations of state-of-the-art multi-agent reinforcement learning techniques, and a consistent evaluation protocol. Our experiments investigate the impact of different modalities on multi-agent learning performance. We also introduce a simple message passing method between agents. The results suggest that multimodality introduces unique challenges for cooperative multi-agent learning and there is significant room for advancing multi-agent reinforcement learning methods in such settings.

19 Mar 2025

computer-science artificial-intelligence computer-vision-and-pattern-recognition

CAM-Seg: A Continuous-valued Embedding Approach for Semantic Image Generation

Northeastern University Amazon. Inc University of Maryland Baltimore County DEVCOM Army Research Laboratory

Traditional transformer-based semantic segmentation relies on quantized embeddings. However, our analysis reveals that autoencoder accuracy on segmentation mask using quantized embeddings (e.g. VQ-VAE) is 8% lower than continuous-valued embeddings (e.g. KL-VAE). Motivated by this, we propose a continuous-valued embedding framework for semantic segmentation. By reformulating semantic mask generation as a continuous image-to-embedding diffusion process, our approach eliminates the need for discrete latent representations while preserving fine-grained spatial and semantic details. Our key contribution includes a diffusion-guided autoregressive transformer that learns a continuous semantic embedding space by modeling long-range dependencies in image features. Our framework contains a unified architecture combining a VAE encoder for continuous feature extraction, a diffusion-guided transformer for conditioned embedding generation, and a VAE decoder for semantic mask reconstruction. Our setting facilitates zero-shot domain adaptation capabilities enabled by the continuity of the embedding space. Experiments across diverse datasets (e.g., Cityscapes and domain-shifted variants) demonstrate state-of-the-art robustness to distribution shifts, including adverse weather (e.g., fog, snow) and viewpoint variations. Our model also exhibits strong noise resilience, achieving robust performance (

\approx

95% AP compared to baseline) under gaussian noise, moderate motion blur, and moderate brightness/contrast variations, while experiencing only a moderate impact (

\approx

90% AP compared to baseline) from 50% salt and pepper noise, saturation and hue shifts. Code available: this https URL

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

TRAJECT-Bench:A Trajectory-Aware Benchmark for Evaluating Agentic Tool Use

Comprehensive Vulnerability Analysis is Necessary for Trustworthy LLM-MAS

Knowledge Distillation for Enhancing Walmart E-commerce Search Relevance Using Large Language Models

Attention Knows Whom to Trust: Attention-based Trust Management for LLM Multi-Agent Systems

Many-Turn Jailbreaking

LLMs for Customized Marketing Content Generation and Evaluation at Scale

Hierarchical Graph Neural Networks for Causal Discovery and Root Cause Localization

Generative or Discriminative? Revisiting Text Classification in the Era of Transformers

Vicinity-Guided Discriminative Latent Diffusion for Privacy-Preserving Domain Adaptation

Scalable, Validated Code Translation of Entire Projects using Large Language Models

MedVisionLlama: Leveraging Pre-Trained Large Language Model Layers to Enhance Medical Image Segmentation

Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning

MRC-Net: 6-DoF Pose Estimation with MultiScale Residual Correlation

The Art of SOCRATIC QUESTIONING: Recursive Thinking with Large Language Models

How Culturally Aware are Vision-Language Models?

CMA-CLIP: Cross-Modality Attention CLIP for Image-Text Classification

NICE: CVPR 2023 Challenge on Zero-shot Image Captioning

Smooth Calibration, Leaky Forecasts, Finite Recall, and Nash Dynamics

CH-MARL: A Multimodal Benchmark for Cooperative, Heterogeneous Multi-Agent Reinforcement Learning

CAM-Seg: A Continuous-valued Embedding Approach for Semantic Image Generation

Events

AI for Law

Personalize Your Feed