alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

Ask or search anything...

Events

Watch Recordings

AI for Law01/09 · Joel Niklaus · Hugging Face

Papers Benchmarks

Mayo Clinic

The 2024 Brain Tumor Segmentation (BraTS) Challenge: Glioma Segmentation on Post-treatment MRI

28 May 2024

Heidelberg University UCLA logo

Gliomas are the most common malignant primary brain tumors in adults and one of the deadliest types of cancer. There are many challenges in treatment and monitoring due to the genetic diversity and high intrinsic heterogeneity in appearance, shape, histology, and treatment response. Treatments include surgery, radiation, and systemic therapies, with magnetic resonance imaging (MRI) playing a key role in treatment planning and post-treatment longitudinal assessment. The 2024 Brain Tumor Segmentation (BraTS) challenge on post-treatment glioma MRI will provide a community standard and benchmark for state-of-the-art automated segmentation models based on the largest expert-annotated post-treatment glioma MRI dataset. Challenge competitors will develop automated segmentation models to predict four distinct tumor sub-regions consisting of enhancing tissue (ET), surrounding non-enhancing T2/fluid-attenuated inversion recovery (FLAIR) hyperintensity (SNFH), non-enhancing tumor core (NETC), and resection cavity (RC). Models will be evaluated on separate validation and test datasets using standardized performance metrics utilized across the BraTS 2024 cluster of challenges, including lesion-wise Dice Similarity Coefficient and Hausdorff Distance. Models developed during this challenge will advance the field of automated MRI segmentation and contribute to their integration into clinical practice, ultimately enhancing patient care.

#ai-for-health #computer-science #computer-vision-security

Paper thumbnail

MONAI: An open-source framework for deep learning in healthcare

04 Nov 2022

University College London University of Oxford logo

University of Oxford

Artificial Intelligence (AI) is having a tremendous impact across most areas of science. Applications of AI in healthcare have the potential to improve our ability to detect, diagnose, prognose, and intervene on human disease. For AI models to be used clinically, they need to be made safe, reproducible and robust, and the underlying software framework must be aware of the particularities (e.g. geometry, physiology, physics) of medical data being processed. This work introduces MONAI, a freely available, community-supported, and consortium-led PyTorch-based framework for deep learning in healthcare. MONAI extends PyTorch to support medical data, with a particular focus on imaging, and provide purpose-specific AI model architectures, transformations and utilities that streamline the development and deployment of medical AI models. MONAI follows best practices for software-development, providing an easy-to-use, robust, well-documented, and well-tested software framework. MONAI preserves the simple, additive, and compositional approach of its underlying PyTorch libraries. MONAI is being used by and receiving contributions from research, clinical and industrial teams from around the world, who are pursuing applications spanning nearly every aspect of healthcare.

#computer-science #artificial-intelligence #computer-vision-and-pattern-recognition

Paper thumbnail

MRI super-resolution reconstruction using efficient diffusion probabilistic model with residual shifting

26 Apr 2025

Emory University Mayo Clinic

Researchers at Emory University and Mayo Clinic developed Res-SRDiff, an efficient diffusion model, that reconstructs high-resolution MRI from low-resolution inputs by integrating a residual error-shifting mechanism. This approach enables high-fidelity super-resolution in just four sampling steps, achieving reconstruction speeds of 0.46 seconds per brain slice and 0.95 seconds per prostate slice, while preserving fine anatomical details.

#ai-for-health #computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

BiomedGPT: A Generalist Vision-Language Foundation Model for Diverse Biomedical Tasks

11 Aug 2024

Lehigh University Harvard Medical School

BiomedGPT introduces the first open-source, lightweight generalist vision-language foundation model for biomedicine, successfully addressing diverse tasks by unifying text and image data within a transformer architecture. The model achieved state-of-the-art results in 16 out of 25 experiments and demonstrated strong zero-shot capabilities, often outperforming larger proprietary models despite its smaller parameter count.

#ai-for-health #computer-science #artificial-intelligence

Paper thumbnail

Comprehensive language-image pre-training for 3D medical image understanding

16 Oct 2025

University of Cambridge Microsoft logo

Researchers from Microsoft and leading medical institutions developed COLIPRI, a 3D vision-language pre-training framework that integrates contrastive learning, radiology report generation, and masked autoencoding to enhance understanding of 3D medical images. The framework significantly improved the clinical accuracy of generated radiology reports and achieved strong performance in global classification and retrieval tasks.

#ai-for-health #computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

MIRIAD: Augmenting LLMs with millions of medical query-response pairs

09 Jun 2025

ETH Zurich Stanford University logo

Stanford University

LLMs are bound to transform healthcare with advanced decision support and flexible chat assistants. However, LLMs are prone to generate inaccurate medical content. To ground LLMs in high-quality medical knowledge, LLMs have been equipped with external knowledge via RAG, where unstructured medical knowledge is split into small text chunks that can be selectively retrieved and integrated into the LLMs context. Yet, existing RAG pipelines rely on raw, unstructured medical text, which can be noisy, uncurated and difficult for LLMs to effectively leverage. Systematic approaches to organize medical knowledge to best surface it to LLMs are generally lacking. To address these challenges, we introduce MIRIAD, a large-scale, curated corpus of 5,821,948 medical QA pairs, each rephrased from and grounded in a passage from peer-reviewed medical literature using a semi-automated pipeline combining LLM generation, filtering, grounding, and human annotation. Unlike prior medical corpora, which rely on unstructured text, MIRIAD encapsulates web-scale medical knowledge in an operationalized query-response format, which enables more targeted retrieval. Experiments on challenging medical QA benchmarks show that augmenting LLMs with MIRIAD improves accuracy up to 6.7% compared to unstructured RAG baselines with the same source corpus and with the same amount of retrieved text. Moreover, MIRIAD improved the ability of LLMs to detect medical hallucinations by 22.5 to 37% (increase in F1 score). We further introduce MIRIAD-Atlas, an interactive map of MIRIAD spanning 56 medical disciplines, enabling clinical users to visually explore, search, and refine medical knowledge. MIRIAD promises to unlock a wealth of down-stream applications, including medical information retrievers, enhanced RAG applications, and knowledge-grounded chat interfaces, which ultimately enables more reliable LLM applications in healthcare.

#ai-for-health #computer-science #conversational-ai

Paper thumbnail

BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text

28 Oct 2025

University of Illinois at Urbana-Champaign Stanford University logo

Stanford University

Large language models (LLMs) hold great promise for medical applications and are evolving rapidly, with new models being released at an accelerated pace. However, benchmarking on large-scale real-world data such as electronic health records (EHRs) is critical, as clinical decisions are directly informed by these sources, yet current evaluations remain limited. Most existing benchmarks rely on medical exam-style questions or PubMed-derived text, failing to capture the complexity of real-world clinical data. Others focus narrowly on specific application scenarios, limiting their generalizability across broader clinical use. To address this gap, we present BRIDGE, a comprehensive multilingual benchmark comprising 87 tasks sourced from real-world clinical data sources across nine languages. It covers eight major task types spanning the entire continuum of patient care across six clinical stages and 20 representative applications, including triage and referral, consultation, information extraction, diagnosis, prognosis, and billing coding, and involves 14 clinical specialties. We systematically evaluated 95 LLMs (including DeepSeek-R1, GPT-4o, Gemini series, and Qwen3 series) under various inference strategies. Our results reveal substantial performance variation across model sizes, languages, natural language processing tasks, and clinical specialties. Notably, we demonstrate that open-source LLMs can achieve performance comparable to proprietary models, while medically fine-tuned LLMs based on older architectures often underperform versus updated general-purpose models. The BRIDGE and its corresponding leaderboard serve as a foundational resource and a unique reference for the development and evaluation of new LLMs in real-world clinical text understanding. The BRIDGE leaderboard: this https URL

#ai-for-health #computer-science #artificial-intelligence

Paper thumbnail

Is the medical image segmentation problem solved? A survey of current developments and future directions

27 Aug 2025

Harvard Medical School University of Pennsylvania logo

University of Pennsylvania

This survey critically evaluates the progress in deep learning-based medical image segmentation from 2015 to 2024, concluding that while performance has improved significantly, the problem remains unsolved, especially concerning clinical viability and generalization. It identifies persistent challenges across various segmentation paradigms and outlines future research directions, including the development of intelligent segmentation agents and the need for new evaluation metrics that quantify human-in-the-loop efficiency.

#agents #ai-for-health #attention-mechanisms

Paper thumbnail

A deep learning framework for efficient pathology image analysis

18 Feb 2025

University of Washington Imperial College London logo

Imperial College London

A deep learning framework, EAGLE, was developed to efficiently analyze whole slide pathology images, achieving an average AUROC of 0.742 across 31 tasks while processing images over 99% faster than previous methods by selectively focusing on critical regions.

#ai-for-health #attention-mechanisms #computer-science

Paper thumbnail

EViT-Unet: U-Net Like Efficient Vision Transformer for Medical Image Segmentation on Mobile and Edge Devices

19 Oct 2024

Arizona State University Mayo Clinic

EViT-UNet presents an efficient U-shaped deep learning architecture for medical image segmentation that balances high accuracy with low computational cost. This approach strategically integrates convolutional operations and vision transformer blocks, achieving an average DSC of 80.9% on the Synapse dataset with only 5.4 GMac, making it suitable for deployment on resource-constrained mobile and edge devices.

#computer-science #computer-vision-and-pattern-recognition #edge-computing

Paper thumbnail

MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records

24 Dec 2023

zepeng-frazier-huo

Zepeng Frazier Huo

Stanford University Mayo Clinic

MedAlign introduces a clinician-generated dataset for evaluating Large Language Models on complex instruction-following tasks using comprehensive Electronic Health Records. It reveals that state-of-the-art LLMs achieve a maximum of 65% correctness on these tasks, highlighting the critical role of context length and a misalignment in current medical LLM fine-tuning approaches, while identifying automated metrics that correlate with clinician judgment.

#ai-for-health #computer-science #conversational-ai

Paper thumbnail

LatteReview: A Multi-Agent Framework for Systematic Review Automation Using Large Language Models

08 Oct 2025

moein-shariatnia

Moein Shariatnia

Mayo Clinic Tehran University of Medical Sciences

Systematic literature reviews and meta-analyses are essential for synthesizing research insights, but they remain time-intensive and labor-intensive due to the iterative processes of screening, evaluation, and data extraction. This paper introduces and evaluates LatteReview, a Python-based framework that leverages large language models (LLMs) and multi-agent systems to automate key elements of the systematic review process. Designed to streamline workflows while maintaining rigor, LatteReview utilizes modular agents for tasks such as title and abstract screening, relevance scoring, and structured data extraction. These agents operate within orchestrated workflows, supporting sequential and parallel review rounds, dynamic decision-making, and iterative refinement based on user feedback. LatteReview's architecture integrates LLM providers, enabling compatibility with both cloud-based and locally hosted models. The framework supports features such as Retrieval-Augmented Generation (RAG) for incorporating external context, multimodal reviews, Pydantic-based validation for structured inputs and outputs, and asynchronous programming for handling large-scale datasets. The framework is available on the GitHub repository, with detailed documentation and an installable package.

#computer-science #computation-and-language #human-ai-interaction

Paper thumbnail

High-throughput digital twin framework for predicting neurite deterioration using MetaFormer attention

18 Dec 2024

Carnegie Mellon University Mayo Clinic

Neurodevelopmental disorders (NDDs) cover a variety of conditions, including autism spectrum disorder, attention-deficit/hyperactivity disorder, and epilepsy, which impair the central and peripheral nervous systems. Their high comorbidity and complex etiologies present significant challenges for accurate diagnosis and effective treatments. Conventional clinical and experimental studies are time-intensive, burdening research progress considerably. This paper introduces a high-throughput digital twin framework for modeling neurite deteriorations associated with NDDs, integrating synthetic data generation, experimental images, and machine learning (ML) models. The synthetic data generator utilizes an isogeometric analysis (IGA)-based phase field model to capture diverse neurite deterioration patterns such as neurite retraction, atrophy, and fragmentation while mitigating the limitations of scarce experimental data. The ML model utilizes MetaFormer-based gated spatiotemporal attention architecture with deep temporal layers and provides fast predictions. The framework effectively captures long-range temporal dependencies and intricate morphological transformations with average errors of 1.9641% and 6.0339% for synthetic and experimental neurite deterioration, respectively. Seamlessly integrating simulations, experiments, and ML, the digital twin framework can guide researchers to make informed experimental decisions by predicting potential experimental outcomes, significantly reducing costs and saving valuable time. It can also advance our understanding of neurite deterioration and provide a scalable solution for exploring complex neurological mechanisms, contributing to the development of targeted treatments.

#ai-for-health #attention-mechanisms #computer-science

Paper thumbnail

INSPECT: A Multimodal Dataset for Pulmonary Embolism Diagnosis and Prognosis

17 Nov 2023

zepeng-frazier-huo

Zepeng Frazier Huo

Stanford University Microsoft logo

Synthesizing information from multiple data sources plays a crucial role in the practice of modern medicine. Current applications of artificial intelligence in medicine often focus on single-modality data due to a lack of publicly available, multimodal medical datasets. To address this limitation, we introduce INSPECT, which contains de-identified longitudinal records from a large cohort of patients at risk for pulmonary embolism (PE), along with ground truth labels for multiple outcomes. INSPECT contains data from 19,402 patients, including CT images, radiology report impression sections, and structured electronic health record (EHR) data (i.e. demographics, diagnoses, procedures, vitals, and medications). Using INSPECT, we develop and release a benchmark for evaluating several baseline modeling approaches on a variety of important PE related tasks. We evaluate image-only, EHR-only, and multimodal fusion models. Trained models and the de-identified dataset are made available for non-commercial use under a data use agreement. To the best of our knowledge, INSPECT is the largest multimodal dataset integrating 3D medical imaging and EHR for reproducible methods evaluation and research.

#ai-for-health #computer-science #computer-vision-security

Paper thumbnail

Towards LogiGLUE: A Brief Survey and A Benchmark for Analyzing Logical Reasoning Capabilities of Language Models

31 Mar 2024

Arizona State University Mayo Clinic

Logical reasoning is fundamental for humans yet presents a substantial challenge in the domain of Artificial Intelligence. Initially, researchers used Knowledge Representation and Reasoning (KR) systems that did not scale and required non-trivial manual effort. Recently, the emergence of large language models (LLMs) has demonstrated the ability to overcome various limitations of formal Knowledge Representation (KR) systems. Consequently, there's a growing interest in using LLMs for logical reasoning via natural language. This work strives to understand the proficiency of LLMs in logical reasoning by offering a brief review of the latest progress in this area; with a focus on the logical reasoning datasets, tasks, and the methods adopted to utilize LLMs for reasoning. To offer a thorough analysis, we have compiled a benchmark titled LogiGLUE. This includes 24 varied datasets encompassing deductive, abductive, and inductive reasoning. Utilizing LogiGLUE as a foundation, we have trained an instruction fine-tuned language model, resulting in LogiT5. We study single-task training, multi-task training, and "chain-of-thought" knowledge distillation fine-tuning technique to assess the performance of model across the different logical reasoning categories. We also assess various LLMs using LogiGLUE, and the findings indicate that LLMs excel most in abductive reasoning, followed by deductive reasoning, while they are least effective at inductive reasoning. We aim to shed light on the capabilities and potential pathways for enhancing logical reasoning proficiency in LLMs, paving the way for more advanced and nuanced developments in this critical field.

#computer-science #artificial-intelligence #computation-and-language

Paper thumbnail

Reinforcement Learning for Out-of-Distribution Reasoning in LLMs: An Empirical Study on Diagnosis-Related Group Coding

14 Oct 2025

University of Illinois at Urbana-Champaign Mayo Clinic

Diagnosis-Related Group (DRG) codes are essential for hospital reimbursement and operations but require labor-intensive assignment. Large Language Models (LLMs) struggle with DRG coding due to the out-of-distribution (OOD) nature of the task: pretraining corpora rarely contain private clinical or billing data. We introduce DRG-Sapphire, which uses large-scale reinforcement learning (RL) for automated DRG coding from clinical notes. Built on Qwen2.5-7B and trained with Group Relative Policy Optimization (GRPO) using rule-based rewards, DRG-Sapphire introduces a series of RL enhancements to address domain-specific challenges not seen in previous mathematical tasks. Our model achieves state-of-the-art accuracy on the MIMIC-IV benchmark and generates physician-validated reasoning for DRG assignments, significantly enhancing explainability. Our study further sheds light on broader challenges of applying RL to knowledge-intensive, OOD tasks. We observe that RL performance scales approximately linearly with the logarithm of the number of supervised fine-tuning (SFT) examples, suggesting that RL effectiveness is fundamentally constrained by the domain knowledge encoded in the base model. For OOD tasks like DRG coding, strong RL performance requires sufficient knowledge infusion prior to RL. Consequently, scaling SFT may be more effective and computationally efficient than scaling RL alone for such tasks.

#ai-for-health #computer-science #artificial-intelligence

Paper thumbnail

Towards Explainable and Safe Conversational Agents for Mental Health: A Survey

25 Apr 2023

Mayo Clinic IIIT Hyderabad

This survey critically evaluates current conversational agents for mental health, highlighting their limitations in explainability, safety, and clinical grounding. It proposes a framework for building knowledge-infused, trustworthy virtual mental health assistants and advocates for new evaluation metrics for responsible development.

#ai-for-health #computer-science #conversational-ai

Paper thumbnail

The Rise of Small Language Models in Healthcare: A Comprehensive Survey

25 Apr 2025

shaina-raza

Shaina Raza

Vector Institute Mayo Clinic

Despite substantial progress in healthcare applications driven by large language models (LLMs), growing concerns around data privacy, and limited resources; the small language models (SLMs) offer a scalable and clinically viable solution for efficient performance in resource-constrained environments for next-generation healthcare informatics. Our comprehensive survey presents a taxonomic framework to identify and categorize them for healthcare professionals and informaticians. The timeline of healthcare SLM contributions establishes a foundational framework for analyzing models across three dimensions: NLP tasks, stakeholder roles, and the continuum of care. We present a taxonomic framework to identify the architectural foundations for building models from scratch; adapting SLMs to clinical precision through prompting, instruction fine-tuning, and reasoning; and accessibility and sustainability through compression techniques. Our primary objective is to offer a comprehensive survey for healthcare professionals, introducing recent innovations in model optimization and equipping them with curated resources to support future research and development in the field. Aiming to showcase the groundbreaking advancements in SLMs for healthcare, we present a comprehensive compilation of experimental results across widely studied NLP tasks in healthcare to highlight the transformative potential of SLMs in healthcare. The updated repository is available at Github

#ai-for-health #computer-science #artificial-intelligence

Paper thumbnail

MedYOLO: A Medical Image Object Detection Framework

07 Jun 2024

Artificial intelligence-enhanced identification of organs, lesions, and other structures in medical imaging is typically done using convolutional neural networks (CNNs) designed to make voxel-accurate segmentations of the region of interest. However, the labels required to train these CNNs are time-consuming to generate and require attention from subject matter experts to ensure quality. For tasks where voxel-level precision is not required, object detection models offer a viable alternative that can reduce annotation effort. Despite this potential application, there are few options for general purpose object detection frameworks available for 3-D medical imaging. We report on MedYOLO, a 3-D object detection framework using the one-shot detection method of the YOLO family of models and designed for use with medical imaging. We tested this model on four different datasets: BRaTS, LIDC, an abdominal organ Computed Tomography (CT) dataset, and an ECG-gated heart CT dataset. We found our models achieve high performance on commonly present medium and large-sized structures such as the heart, liver, and pancreas even without hyperparameter tuning. However, the models struggle with very small or rarely present structures.

#ai-for-health #computer-science #computer-vision-security

Paper thumbnail

CONFLARE: CONFormal LArge language model REtrieval

04 Apr 2024

moein-shariatnia

Moein Shariatnia

Mayo Clinic Tehran University of Medical Sciences

Retrieval-augmented generation (RAG) frameworks enable large language models (LLMs) to retrieve relevant information from a knowledge base and incorporate it into the context for generating responses. This mitigates hallucinations and allows for the updating of knowledge without retraining the LLM. However, RAG does not guarantee valid responses if retrieval fails to identify the necessary information as the context for response generation. Also, if there is contradictory content, the RAG response will likely reflect only one of the two possible responses. Therefore, quantifying uncertainty in the retrieval process is crucial for ensuring RAG trustworthiness. In this report, we introduce a four-step framework for applying conformal prediction to quantify retrieval uncertainty in RAG frameworks. First, a calibration set of questions answerable from the knowledge base is constructed. Each question's embedding is compared against document embeddings to identify the most relevant document chunks containing the answer and record their similarity scores. Given a user-specified error rate ({\alpha}), these similarity scores are then analyzed to determine a similarity score cutoff threshold. During inference, all chunks with similarity exceeding this threshold are retrieved to provide context to the LLM, ensuring the true answer is captured in the context with a (1-{\alpha}) confidence level. We provide a Python package that enables users to implement the entire workflow proposed in our work, only using LLMs and without human intervention.

#computer-science #artificial-intelligence #computation-and-language

Paper thumbnail

There are no more papers matching your filters at the moment.