alphaXiv

History

Papers Benchmarks

UNSW

385

30 Nov 2024

agent-based-systems computer-science artificial-intelligence

AgentOps: Enabling Observability of LLM Agents

CSIRO’s Data61 UNSW

Data61, CSIRO researchers introduced AgentOps, a specialized DevOps paradigm and comprehensive taxonomy for Large Language Model (LLM) agents, to enhance observability. This framework details critical artifacts and their relationships, offering a structured approach to tracing agent operations that addresses AI safety concerns and surpasses the capabilities of existing MLOps tools.

18 Jun 2025

computer-science computer-vision-and-pattern-recognition machine-learning

Dual-Stage Value-Guided Inference with Margin-Based Reward Adjustment for Fast and Faithful VLM Captioning

UNSW University of Surrey Mohamed bin Zayed University of AI Jio Institute

ViMaR introduces an inference framework for Vision-Language Models (VLMs) that enhances image captioning efficiency and fidelity. The framework achieves over 4x speedup compared to prior value-guided methods while significantly reducing hallucination rates and enabling VLM self-improvement.

138

04 Mar 2025

agentic-frameworks agents computer-science

Playing games with Large language models: Randomness and strategy

UNSW

This research explores the capabilities and limitations of Large Language Models (LLMs) when engaging in game-theoretic interactions. It finds LLMs consistently struggle with generating truly random outputs in Rock Paper Scissors, leading to biased play and high tie rates, but also demonstrates their strong malleability in the Prisoner's Dilemma, where explicit strategic guidance dramatically shifts behavior from cooperation to defection.

128

22 Oct 2020

computer-science machine-learning graph-neural-networks

Adaptive Graph Convolutional Recurrent Network for Traffic Forecasting

Griffith University University of Technology Sydney UNSW

Modeling complex spatial and temporal correlations in the correlated time series data is indispensable for understanding the traffic dynamics and predicting the future status of an evolving traffic system. Recent works focus on designing complicated graph neural network architectures to capture shared patterns with the help of pre-defined graphs. In this paper, we argue that learning node-specific patterns is essential for traffic forecasting while the pre-defined graph is avoidable. To this end, we propose two adaptive modules for enhancing Graph Convolutional Network (GCN) with new capabilities: 1) a Node Adaptive Parameter Learning (NAPL) module to capture node-specific patterns; 2) a Data Adaptive Graph Generation (DAGG) module to infer the inter-dependencies among different traffic series automatically. We further propose an Adaptive Graph Convolutional Recurrent Network (AGCRN) to capture fine-grained spatial and temporal correlations in traffic series automatically based on the two modules and recurrent networks. Our experiments on two real-world traffic datasets show AGCRN outperforms state-of-the-art by a significant margin without pre-defined graphs about spatial connections.

156

03 Feb 2025

computer-science software-engineering

RCAEval: A Benchmark for Root Cause Analysis of Microservice Systems with Telemetry Data

UNSW Chongqing University RMIT University

RCAEval provides a comprehensive open-source benchmark for Root Cause Analysis in microservice systems, featuring three large-scale datasets with 735 failure cases across diverse systems and 11 fault types, including code-level issues, supported by multi-source telemetry. Its integrated evaluation framework includes 15 state-of-the-art RCA baselines, with preliminary experiments revealing that current methods have considerable room for improvement in accuracy across varied fault types.

04 Apr 2025

computer-science artificial-intelligence computation-and-language

Learning Natural Language Constraints for Safe Reinforcement Learning of Language Agents

CSIRO’s Data61 UNSW

Researchers from CSIRO's Data61 and UNSW developed a framework to learn natural language safety constraints from demonstrations, enabling Large Language Models (LLMs) to achieve robust, proactive safety alignment. Applying this method to fine-tune DistilBERT resulted in zero constraint violations in a text-based navigation task, even under domain shifts, outperforming standard reinforcement learning approaches.

23 Oct 2025

agents chain-of-thought computer-science

Illusions of reflection: open-ended task reveals systematic failures in Large Language Models' reflective reasoning

UNSW Aurecon Group

Humans do not just find mistakes after the fact -- we often catch them mid-stream because 'reflection' is tied to the goal and its constraints. Today's large language models produce reasoning tokens and 'reflective' text, but is it functionally equivalent with human reflective reasoning? Prior work on closed-ended tasks -- with clear, external 'correctness' signals -- can make 'reflection' look effective while masking limits in self-correction. We therefore test eight frontier models on a simple, real-world task that is open-ended yet rule-constrained, with auditable success criteria: to produce valid scientific test items, then revise after considering their own critique. First-pass performance is poor (often zero valid items out of 4 required; mean

\approx

1), and reflection yields only modest gains (also

\approx

1). Crucially, the second attempt frequently repeats the same violation of constraint, indicating 'corrective gains' arise largely from chance production of a valid item rather than error detection and principled, constraint-sensitive repair. Performance before and after reflection deteriorates as open-endedness increases, and models marketed for 'reasoning' show no advantage. Our results suggest that current LLM 'reflection' lacks functional evidence of the active, goal-driven monitoring that helps humans respect constraints even on a first pass. Until such mechanisms are instantiated in the model itself, reliable performance requires external structure that enforces constraints. Our code is available at: this https URL

10 Oct 2025

adversarial-attacks ai-for-cybersecurity computer-science

Machine Learning for Detection and Analysis of Novel LLM Jailbreaks

UNSW Pingla Institute

Large Language Models (LLMs) suffer from a range of vulnerabilities that allow malicious users to solicit undesirable responses through manipulation of the input text. These so-called jailbreak prompts are designed to trick the LLM into circumventing the safety guardrails put in place to keep responses acceptable to the developer's policies. In this study, we analyse the ability of different machine learning models to distinguish jailbreak prompts from genuine uses, including looking at our ability to identify jailbreaks that use previously unseen strategies. Our results indicate that using current datasets the best performance is achieved by fine tuning a Bidirectional Encoder Representations from Transformers (BERT) model end-to-end for identifying jailbreaks. We visualise the keywords that distinguish jailbreak from genuine prompts and conclude that explicit reflexivity in prompt structure could be a signal of jailbreak intention.

03 Jun 2025

computer-science computation-and-language inference-optimization

Expanding before Inferring: Enhancing Factuality in Large Language Models through Premature Layers Interpolation

Chinese Academy of Sciences

Sun Yat-Sen University

The Hong Kong Polytechnic University UNSW Shenzhen MSU-BIT University Shenzhen Institutes of Advanced Technology

Dingwei Chen

This paper presents Premature Layers Interpolation (PLI), a training-free and plug-and-play method that reduces LLM hallucinations by strategically inserting new layers derived from the spherical linear interpolation of existing adjacent layers. The approach consistently improves factual accuracy and reasoning across various models and datasets with minimal computational overhead.

17 Mar 2025

computer-science software-engineering

SeeAction: Towards Reverse Engineering How-What-Where of HCI Actions from Screencasts for UI Automation

CSIRO UNSW ANU

UI automation is a useful technique for UI testing, bug reproduction, and robotic process automation. Recording user actions with an application assists rapid development of UI automation scripts, but existing recording techniques are intrusive, rely on OS or GUI framework accessibility support, or assume specific app implementations. Reverse engineering user actions from screencasts is non-intrusive, but a key reverse-engineering step is currently missing - recognizing human-understandable structured user actions ([command] [widget] [location]) from action screencasts. To fill the gap, we propose a deep learning-based computer vision model that can recognize 11 commands and 11 widgets, and generate location phrases from action screencasts, through joint learning and multi-task learning. We label a large dataset with 7260 video-action pairs, which record user interactions with Word, Zoom, Firefox, Photoshop, and Windows 10 Settings. Through extensive experiments, we confirm the effectiveness and generality of our model, and demonstrate the usefulness of a screencast-to-action-script tool built upon our model for bug reproduction.

19 Aug 2023

ai-for-health computer-science contrastive-learning

Contrastive Learning-based Imputation-Prediction Networks for In-hospital Mortality Risk Modeling using EHRs

UNSW RMIT University Northwest A\&F University Flinders University

Researchers at Flinders University, UNSW, RMIT University, and Northwest A&F University developed CL-ImpPreNet, a framework that leverages graph analysis for patient stratification and contrastive learning to simultaneously impute missing values in Electronic Health Records (EHRs) and predict in-hospital mortality. This integrated approach achieved superior performance over state-of-the-art baselines on MIMIC-III and eICU datasets for both imputation and prediction tasks.

26 Nov 2024

attention-mechanisms computer-science computer-vision-and-pattern-recognition

TAFM-Net: A Novel Approach to Skin Lesion Segmentation Using Transformer Attention and Focal Modulation

UNSW Abasyn University Islamabad Campus

Incorporating modern computer vision techniques into clinical protocols shows promise in improving skin lesion segmentation. The U-Net architecture has been a key model in this area, iteratively improved to address challenges arising from the heterogeneity of dermatologic images due to varying clinical settings, lighting, patient attributes, and hair density. To further improve skin lesion segmentation, we developed TAFM-Net, an innovative model leveraging self-adaptive transformer attention (TA) coupled with focal modulation (FM). Our model integrates an EfficientNetV2B1 encoder, which employs TA to enhance spatial and channel-related saliency, while a densely connected decoder integrates FM within skip connections, enhancing feature emphasis, segmentation performance, and interpretability crucial for medical image analysis. A novel dynamic loss function amalgamates region and boundary information, guiding effective model training. Our model achieves competitive performance, with Jaccard coefficients of 93.64\%, 86.88\% and 92.88\% in the ISIC2016, ISIC2017 and ISIC2018 datasets, respectively, demonstrating its potential in real-world scenarios.

12 Jun 2025

chain-of-thought computer-science artificial-intelligence

Context Is Not Comprehension

UNSW

Verbose ListOps (VLO) is introduced as a benchmark for evaluating large language models' (LLMs) multi-step reasoning and intermediate state tracking within coherent, distracting narratives. The research demonstrates that leading LLMs experience a drastic performance reduction, typically 50% or more, when the same logical computations are embedded in natural language compared to their performance on explicit bare logical tasks.

20 Dec 2022

computer-science artificial-intelligence cryptography-and-security

PublicCheck: Public Integrity Verification for Services of Run-time Deep Models

CSIRO’s Data61 UNSW Indian Institute of Technology Delhi Cyber CRC

Existing integrity verification approaches for deep models are designed for private verification (i.e., assuming the service provider is honest, with white-box access to model parameters). However, private verification approaches do not allow model users to verify the model at run-time. Instead, they must trust the service provider, who may tamper with the verification results. In contrast, a public verification approach that considers the possibility of dishonest service providers can benefit a wider range of users. In this paper, we propose PublicCheck, a practical public integrity verification solution for services of run-time deep models. PublicCheck considers dishonest service providers, and overcomes public verification challenges of being lightweight, providing anti-counterfeiting protection, and having fingerprinting samples that appear smooth. To capture and fingerprint the inherent prediction behaviors of a run-time model, PublicCheck generates smoothly transformed and augmented encysted samples that are enclosed around the model's decision boundary while ensuring that the verification queries are indistinguishable from normal queries. PublicCheck is also applicable when knowledge of the target model is limited (e.g., with no knowledge of gradients or model parameters). A thorough evaluation of PublicCheck demonstrates the strong capability for model integrity breach detection (100% detection accuracy with less than 10 black-box API queries) against various model integrity attacks and model compression attacks. PublicCheck also demonstrates the smooth appearance, feasibility, and efficiency of generating a plethora of encysted samples for fingerprinting.

31 Jan 2022

computer-science machine-learning statistics

Weisfeiler and Lehman Go Cellular: CW Networks

University of Cambridge

UCLA

Imperial College London

Shanghai Jiao Tong University UNSW MPI MiS Twitter

CW Networks leverage regular cell complexes and a cellular Weisfeiler-Lehman test to enhance the expressive power of geometric deep learning models for graph-structured data. This approach enables principled modeling of higher-order interactions, leading to state-of-the-art results on molecular graph prediction tasks and distinguishing graphs beyond the capabilities of 3-WL tests.

27 May 2025

computer-science cryptography-and-security operating-systems

Fast, Secure, Adaptable: LionsOS Design, Implementation and Performance

UNSW

LionsOS, an operating system built on the seL4 microkernel and Microkit, offers a practical approach to developing high-assurance systems for industrial embedded applications. It achieves superior networking performance, outperforming Linux with higher throughput, lower CPU usage, and 3-10x lower latency, while maintaining a small trusted computing base and simplifying driver development.

11 Mar 2025

astrophysics-of-galaxies solar-and-stellar-astrophysics physics

The GALAH Survey: Data Release 4

Monash University University of Utah University of Oklahoma University of Ljubljana Aarhus University

Space Telescope Science Institute UNSW University of Southern Queensland

Stockholm University Uppsala University University of Iowa

Australian National University Macquarie University

University of Sydney

Chalmers University of Technology International Space Science Institute Beijing

Ioana Ciucă

The stars of the Milky Way carry the chemical history of our Galaxy in their atmospheres as they journey through its vast expanse. Like barcodes, we can extract the chemical fingerprints of stars from high-resolution spectroscopy. The fourth data release (DR4) of the Galactic Archaeology with HERMES (GALAH) Survey, based on a decade of observations, provides the chemical abundances of up to 32 elements for 917 588 stars that also have exquisite astrometric data from the

Gaia

satellite. For the first time, these elements include life-essential nitrogen to complement carbon, and oxygen as well as more measurements of rare-earth elements critical to modern-life electronics, offering unparalleled insights into the chemical composition of the Milky Way. For this release, we use neural networks to simultaneously fit stellar parameters and abundances across the whole wavelength range, leveraging synthetic grids computed with Spectroscopy Made Easy. These grids account for atomic line formation in non-local thermodynamic equilibrium for 14 elements. In a two-iteration process, we first fit stellar labels to all 1 085 520 spectra, then co-add repeated observations and refine these labels using astrometric data from

Gaia

and 2MASS photometry, improving the accuracy and precision of stellar parameters and abundances. Our validation thoroughly assesses the reliability of spectroscopic measurements and highlights key caveats. GALAH DR4 represents yet another milestone in Galactic archaeology, combining detailed chemical compositions from multiple nucleosynthetic channels with kinematic information and age estimates. The resulting dataset, covering nearly a million stars, opens new avenues for understanding not only the chemical and dynamical history of the Milky Way, but also the broader questions of the origin of elements and the evolution of planets, stars, and galaxies.

03 Jun 2024

ai-for-health computer-science machine-learning

Distributional Refinement Network: Distributional Forecasting via Deep Learning

University of Melbourne UNSW

Researchers from the University of Melbourne and UNSW Sydney developed the Distributional Refinement Network (DRN), a deep learning framework designed for accurate distributional forecasting while maintaining interpretability. This approach refines predictions from an interpretable baseline model, demonstrating superior performance in metrics like Negative Log-Likelihood and CRPS on synthetic and real-world insurance claim datasets, while also providing actionable insights into feature contributions via SHAP analysis.

18 Mar 2025

computer-science conversational-ai emerging-technologies

MERCI: Multimodal Emotional and peRsonal Conversational Interactions Dataset

UNSW UTS

a baki kocaballı

The integration of conversational agents into our daily lives has become increasingly common, yet many of these agents cannot engage in deep interactions with humans. Despite this, there is a noticeable shortage of datasets that capture multimodal information from human-robot interaction dialogues. To address this gap, we have recorded a novel multimodal dataset (MERCI) that encompasses rich embodied interaction data. The process involved asking participants to complete a questionnaire and gathering their profiles on ten topics, such as hobbies and favorite music. Subsequently, we initiated conversations between the robot and the participants, leveraging GPT-4 to generate contextually appropriate responses based on the participant's profile and emotional state, as determined by facial expression recognition and sentiment analysis. Automatic and user evaluations were conducted to assess the overall quality of the collected data. The results of both evaluations indicated a high level of naturalness, engagement, fluency, consistency, and relevance in the conversation, as well as the robot's ability to provide empathetic responses. It is worth noting that the dataset is derived from genuine interactions with the robot, involving participants who provided personal information and conveyed actual emotions.

18 Mar 2025

computer-science artificial-intelligence machine-learning

AI-Powered Prediction of Nanoparticle Pharmacokinetics: A Multi-View Learning Approach

The University of Hong Kong UNSW University of Surrey University of Western Australia

The clinical translation of nanoparticle-based treatments remains limited due to the unpredictability of (nanoparticle) NP pharmacokinetics

\unicode{x2014}

how they distribute, accumulate, and clear from the body. Predicting these behaviours is challenging due to complex biological interactions and the difficulty of obtaining high-quality experimental datasets. Existing AI-driven approaches rely heavily on data-driven learning but fail to integrate crucial knowledge about NP properties and biodistribution mechanisms. We introduce a multi-view deep learning framework that enhances pharmacokinetic predictions by incorporating prior knowledge of key NP properties such as size and charge into a cross-attention mechanism, enabling context-aware feature selection and improving generalization despite small datasets. To further enhance prediction robustness, we employ an ensemble learning approach, combining deep learning with XGBoost (XGB) and Random Forest (RF), which significantly outperforms existing AI models. Our interpretability analysis reveals key physicochemical properties driving NP biodistribution, providing biologically meaningful insights into possible mechanisms governing NP behaviour in vivo rather than a black-box model. Furthermore, by bridging machine learning with physiologically based pharmacokinetic (PBPK) modelling, this work lays the foundation for data-efficient AI-driven drug discovery and precision nanomedicine.

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

AgentOps: Enabling Observability of LLM Agents

Dual-Stage Value-Guided Inference with Margin-Based Reward Adjustment for Fast and Faithful VLM Captioning

Playing games with Large language models: Randomness and strategy

Adaptive Graph Convolutional Recurrent Network for Traffic Forecasting

RCAEval: A Benchmark for Root Cause Analysis of Microservice Systems with Telemetry Data

Learning Natural Language Constraints for Safe Reinforcement Learning of Language Agents

Illusions of reflection: open-ended task reveals systematic failures in Large Language Models' reflective reasoning

Machine Learning for Detection and Analysis of Novel LLM Jailbreaks

Expanding before Inferring: Enhancing Factuality in Large Language Models through Premature Layers Interpolation

SeeAction: Towards Reverse Engineering How-What-Where of HCI Actions from Screencasts for UI Automation

Contrastive Learning-based Imputation-Prediction Networks for In-hospital Mortality Risk Modeling using EHRs

TAFM-Net: A Novel Approach to Skin Lesion Segmentation Using Transformer Attention and Focal Modulation

Context Is Not Comprehension

PublicCheck: Public Integrity Verification for Services of Run-time Deep Models

Weisfeiler and Lehman Go Cellular: CW Networks

Fast, Secure, Adaptable: LionsOS Design, Implementation and Performance

The GALAH Survey: Data Release 4

Distributional Refinement Network: Distributional Forecasting via Deep Learning

MERCI: Multimodal Emotional and peRsonal Conversational Interactions Dataset

AI-Powered Prediction of Nanoparticle Pharmacokinetics: A Multi-View Learning Approach

Events

AI for Law

Personalize Your Feed