computational-engineering-finance-and-science
Large language models are increasingly deployed across the financial sector for tasks such as research, compliance, risk analysis, and customer service, which makes rigorous safety evaluation essential. However, existing financial benchmarks primarily focus on textbook-style question answering and numerical problem solving, but fail to evaluate models' real-world safety behaviors. They weakly assess regulatory compliance and investor-protection norms, rarely stress-test multi-turn adversarial tactics such as jailbreaks or prompt injection, inconsistently ground answers in long filings, ignore tool- or RAG-induced over-reach risks, and rely on opaque or non-auditable evaluation protocols. To close these gaps, we introduce CNFinBench, a benchmark that employs finance-tailored red-team dialogues and is structured around a Capability-Compliance-Safety triad, including evidence-grounded reasoning over long reports and jurisdiction-aware rule/tax compliance tasks. For systematic safety quantification, we introduce the Harmful Instruction Compliance Score (HICS) to measure how consistently models resist harmful prompts across multi-turn adversarial dialogues. To ensure auditability, CNFinBench enforces strict output formats with dynamic option perturbation for objective tasks and employs a hybrid LLM-ensemble plus human-calibrated judge for open-ended evaluations. Experiments on 21 models across 15 subtasks confirm a persistent capability-compliance gap: models achieve an average score of 61.0 on capability tasks but fall to 34.18 on compliance and risk-control evaluations. Under multi-turn adversarial dialogue tests, most systems reach only partial resistance (HICS 60-79), demonstrating that refusal alone is not a reliable proxy for safety without cited and verifiable reasoning.
We present a 3D polycrystal foundation model that learns a physically structured representation of voxel-based microstructures through large-scale self-supervised pretraining. The encoder is trained on a dataset of 100,000 FCC microstructures whose crystallographic orientations span the texture hull, using a masking strategy that forces the model to infer latent features from incomplete spatial information. The quality of the learned representation is evaluated through two downstream tasks with distinct physical characteristics. (i) Homogenized stiffness prediction: the pretrained encoder consistently outperforms the non-pretrained baseline across all masking ratios. (ii) Nonlinear response modeling: the encoder is coupled with an orientation-aware interaction-based deep material network (ODMN) to infer complete sets of network parameters, enabling accurate stress-strain predictions for previously unseen microstructures. In both tasks, the pretrained encoder demonstrates markedly stronger generalization capability. These results underscore the strong transferability of the proposed framework and its suitability for data-scarce scientific settings, where labeled microstructures are limited and physics-consistent generalization is essential. The foundation model provides a scalable route toward integration with experimentally derived microstructures, offering a new basis for microstructure-property reasoning in practical materials design.
Multivariate time series forecasts are widely used, such as industrial, transportation and financial forecasts. However, the dominant frequencies in time series may shift with the evolving spectral distribution of the data. Traditional Mixture of Experts (MoE) models, which employ a fixed number of experts, struggle to adapt to these changes, resulting in frequency coverage imbalance issue. Specifically, too few experts can lead to the overlooking of critical information, while too many can introduce noise. To this end, we propose Ada-MoGE, an adaptive Gaussian Mixture of Experts model. Ada-MoGE integrates spectral intensity and frequency response to adaptively determine the number of experts, ensuring alignment with the input data's frequency distribution. This approach prevents both information loss due to an insufficient number of experts and noise contamination from an excess of experts. Additionally, to prevent noise introduction from direct band truncation, we employ Gaussian band-pass filtering to smoothly decompose the frequency domain features, further optimizing the feature representation. The experimental results show that our model achieves state-of-the-art performance on six public benchmarks with only 0.2 million parameters.
Correlations in complex systems are often obscured by nonstationarity, long-range memory, and heavy-tailed fluctuations, which limit the usefulness of traditional covariance-based analyses. To address these challenges, we construct scale and fluctuation-dependent correlation matrices using the multifractal detrended cross-correlation coefficient ρr\rho_r that selectively emphasizes fluctuations of different amplitudes. We examine the spectral properties of these detrended correlation matrices and compare them to the spectral properties of the matrices calculated in the same way from synthetic Gaussian and qqGaussian signals. Our results show that detrending, heavy tails, and the fluctuation-order parameter rr jointly produce spectra, which substantially depart from the random case even under absence of cross-correlations in time series. Applying this framework to one-minute returns of 140 major cryptocurrencies from 2021-2024 reveals robust collective modes, including a dominant market factor and several sectoral components whose strength depends on the analyzed scale and fluctuation order. After filtering out the market mode, the empirical eigenvalue bulk aligns closely with the limit of random detrended cross-correlations, enabling clear identification of structurally significant outliers. Overall, the study provides a refined spectral baseline for detrended cross-correlations and offers a promising tool for distinguishing genuine interdependencies from noise in complex, nonstationary, heavy-tailed systems.
Tasou et al. present a comprehensive analysis of sparse computations in Deep Neural Network inference, providing a taxonomy of sparsity, reviewing state-of-the-art kernels, and empirically evaluating their performance on CPUs and GPUs. The work highlights significant untapped efficiency potential due to extremely low hardware utilization in current sparse implementations.
The SecureFinAI Lab at Columbia University developed an orchestration framework for financial agents (FinAgents) to enable agentic trading, moving beyond traditional algorithmic systems. This framework achieved superior risk-adjusted returns and reduced volatility and drawdown in both stock and cryptocurrency trading simulations, while enhancing accessibility to sophisticated financial strategies.
12
In component shape optimization, the component properties are often evaluated by computationally expensive simulations. Such optimization becomes unfeasible when it is focused on a global search requiring thousands of simulations to be evaluated. Here, we present a viable global shape optimization methodology based on multi-objective evolutionary algorithms accelerated by deep neural networks (DNNs). Our methodology alternates between evaluating simulations and utilizing the generated data to train DNNs with various architectures. When a suitable DNN architecture is identified, the DNN replaces the simulation in the rest of the global search. Our methodology was tested on five ZDT benchmark functions, showing itself at the level of and sometimes more flexible than other state-of-the-art acceleration approaches. Then, it was applied to a real-life optimization problem, namely the shape optimization of a single-phase ejector. Compared with a non-accelerated methodology, ours was able to save weeks of CPU time in solving this problem. To experimentally confirm the performance of the optimized ejector shapes, four of them were 3D printed and tested on the lab scale confirming the predicted performance. This suggests that our methodology could be used for acceleration of other real-life shape optimization problems.
The increasing prevalence of thyroid cancer globally has led to the development of various computer-aided detection methods. Accurate segmentation of thyroid nodules is a critical first step in the development of AI-assisted clinical decision support systems. This study focuses on instance segmentation of thyroid nodules using YOLOv5 algorithms on ultrasound images. We evaluated multiple YOLOv5 variants (Nano, Small, Medium, Large, and XLarge) across two dataset versions, with and without doppler images. The YOLOv5-Large algorithm achieved the highest performance with a dice score of 91\% and mAP of 0.87 on the dataset including doppler images. Notably, our results demonstrate that doppler images, typically excluded by physicians, can significantly improve segmentation performance. The YOLOv5-Small model achieved 79\% dice score when doppler images were excluded, while including them improved performance across all model variants. These findings suggest that instance segmentation with YOLOv5 provides an effective real-time approach for thyroid nodule detection, with potential clinical applications in automated diagnostic systems.
We curate the DeXposure dataset, the first large-scale dataset for inter-protocol credit exposure in decentralized financial networks, covering global markets of 43.7 million entries across 4.3 thousand protocols, 602 blockchains, and 24.3 thousand tokens, from 2020 to 2025. A new measure, value-linked credit exposure between protocols, is defined as the inferred financial dependency relationships derived from changes in Total Value Locked (TVL). We develop a token-to-protocol model using DefiLlama metadata to infer inter-protocol credit exposure from the token's stock dynamics, as reported by the protocols. Based on the curated dataset, we develop three benchmarks for machine learning research with financial applications: (1) graph clustering for global network measurement, tracking the structural evolution of credit exposure networks, (2) vector autoregression for sector-level credit exposure dynamics during major shocks (Terra and FTX), and (3) temporal graph neural networks for dynamic link prediction on temporal graphs. From the analysis, we observe (1) a rapid growth of network volume, (2) a trend of concentration to key protocols, (3) a decline of network density (the ratio of actual connections to possible connections), and (4) distinct shock propagation across sectors, such as lending platforms, trading exchanges, and asset management protocols. The DeXposure dataset and code have been released publicly. We envision they will help with research and practice in machine learning as well as financial risk monitoring, policy analysis, DeFi market modeling, amongst others. The dataset also contributes to machine learning research by offering benchmarks for graph clustering, vector autoregression, and temporal graph analysis.
2
The OmniScientist framework integrates AI agents within a simulated human scientific ecosystem by encoding collaborative and infrastructural norms, allowing AI to participate as "genuine scientists." The approach achieved superior literature review, a dramatic reduction in solution error for experiment automation, and enhanced human-AI collaboration in complex problem-solving.
Forecasting cryptocurrency prices is hindered by extreme volatility and a methodological dilemma between information-scarce univariate models and noise-prone full-multivariate models. This paper investigates a partial-multivariate approach to balance this trade-off, hypothesizing that a strategic subset of features offers superior predictive power. We apply the Partial-Multivariate Transformer (PMformer) to forecast daily returns for BTCUSDT and ETHUSDT, benchmarking it against eleven classical and deep learning models. Our empirical results yield two primary contributions. First, we demonstrate that the partial-multivariate strategy achieves significant statistical accuracy, effectively balancing informative signals with noise. Second, we experiment and discuss an observable disconnect between this statistical performance and practical trading utility; lower prediction error did not consistently translate to higher financial returns in simulations. This finding challenges the reliance on traditional error metrics and highlights the need to develop evaluation criteria more aligned with real-world financial objectives.
This work addresses the question of how generative artificial intelligence can be used to reduce the time required to set up electromagnetic simulation models. A chatbot based on a large language model is presented, enabling the automated generation of simulation models with various functional enhancements. A chatbot-driven workflow based on the large language model Google Gemini 2.0 Flash automatically generates and solves two-dimensional finite element eddy current models using Gmsh and GetDP. Python is used to coordinate and automate interactions between the workflow components. The study considers conductor geometries with circular cross-sections of variable position and number. Additionally, users can define custom post-processing routines and receive a concise summary of model information and simulation results. Each functional enhancement includes the corresponding architectural modifications and illustrative case studies.
Predictive modeling of stellarator plasmas is crucial for advancing nuclear fusion energy, yet it faces unique computational difficulties. One of the main challenges is accurately simulating the dynamics of specific particle species that are not well captured by fluid models, which necessitates the use of hybrid fluid-kinetic models. The non-axisymmetric geometry of stellarators fundamentally couples the toroidal Fourier modes, in contrast to what happens in tokamaks, requiring different numerical and computational treatment. This work presents a novel, globally coupled projection scheme inside the JOREK finite element framework. The approach ensures a self-consistent and physically accurate transfer of kinetic markers to the fluid grid, effectively handling the complex 3D mesh by constructing and solving a unified linear system that encompasses all toroidal harmonics simultaneously. To manage the computational complexity of this coupling, the construction of the system's matrix is significantly accelerated using the Fast Fourier Transform (FFT). The efficient localization of millions of particles is made possible by implementing a 3D R-Tree spatial index, which supports this projection and ensures computational tractability at scale. On realistic Wendelstein 7-X stellarator geometries, the fidelity of the framework is rigorously shown. In sharp contrast to the uncoupled approaches' poor performance, quantitative convergence tests verify that the coupled scheme attains the theoretically anticipated spectral convergence. This study offers a crucial capability for the predictive analysis and optimization of next-generation stellarator designs by developing a validated, high-fidelity computational tool.
Core-shell electrode particles are a promising morphology control strategy for high-performance lithium-ion batteries. However, experimental observations reveal that these structures remain prone to mechanical failure, with shell fractures and core-shell debonding occurring after a single charge. In this work, we present a novel, comprehensive computational framework to predict and gain insight into the failure of core-shell morphologies and the associated degradation in battery performance. The fully coupled chemo-mechano-damage model presented captures the interplay between mechanical damage and electrochemical behaviours, enabling the quantification of particle cracking and capacity fade. Both bulk material fracture and interface debonding are captured by utilising the phase field method. We quantify the severity of particle cracking and capacity loss through case studies on a representative core-shell system (NMC811@NMC532). The results bring valuable insights into cracking patterns, underlying mechanisms, and their impact on capacity loss. Surface cracks are found to initiate when a significantly higher lithium concentration accumulates in the core compared to the shell. Interfacial debonding is shown to arise from localised hoop stresses near the core-shell interface, due to greater shell expansion. This debonding develops rapidly, impedes lithium-ion transport, and can lead to more than 10\% capacity loss after a single discharge. Furthermore, larger particles may experience crack branching driven by extensive tensile zones, potentially fragmenting the entire particle. The framework developed can not only bring new insight into the degradation mechanisms of core-shell particles but also be used to design electrode materials with improved performance and extended lifetime.
AgenticSciML introduces a multi-agent AI framework designed to autonomously discover and refine scientific machine learning (SciML) methodologies. The system consistently discovered solutions that outperformed single-agent baselines by 10x to over 11,000x in error reduction, generating novel SciML strategies across six benchmark problems.
Machine learning (ML) and artificial intelligence (AI) algorithms are transforming and empowering the characterization and control of dynamic systems in the engineering, physical, and biological sciences. These emerging modeling paradigms require comparative metrics to evaluate a diverse set of scientific objectives, including forecasting, state reconstruction, generalization, and control, while also considering limited data scenarios and noisy measurements. We introduce a common task framework (CTF) for science and engineering, which features a growing collection of challenge data sets with a diverse set of practical and common objectives. The CTF is a critically enabling technology that has contributed to the rapid advance of ML/AI algorithms in traditional applications such as speech recognition, language processing, and computer vision. There is a critical need for the objective metrics of a CTF to compare the diverse algorithms being rapidly developed and deployed in practice today across science and engineering.
Large language models (LLMs) achieve strong performance across benchmarks--from knowledge quizzes and math reasoning to web-agent tasks--but these tests occur in static settings, lacking real dynamics and uncertainty. Consequently, they evaluate isolated reasoning or problem-solving rather than decision-making under uncertainty. To address this, we introduce LiveTradeBench, a live trading environment for evaluating LLM agents in realistic and evolving markets. LiveTradeBench follows three design principles: (i) Live data streaming of market prices and news, eliminating dependence on offline backtesting and preventing information leakage while capturing real-time uncertainty; (ii) a portfolio-management abstraction that extends control from single-asset actions to multi-asset allocation, integrating risk management and cross-asset reasoning; and (iii) multi-market evaluation across structurally distinct environments--U.S. stocks and Polymarket prediction markets--differing in volatility, liquidity, and information flow. At each step, an agent observes prices, news, and its portfolio, then outputs percentage allocations that balance risk and return. Using LiveTradeBench, we run 50-day live evaluations of 21 LLMs across families. Results show that (1) high LMArena scores do not imply superior trading outcomes; (2) models display distinct portfolio styles reflecting risk appetite and reasoning dynamics; and (3) some LLMs effectively leverage live signals to adapt decisions. These findings expose a gap between static evaluation and real-world competence, motivating benchmarks that test sequential decision making and consistency under live uncertainty.
28
Buried pipelines transporting oil and gas across geohazard-prone regions are exposed to potential ground movement, leading to the risk of significant strain demand and structural failure. Reliability analysis, which determines the probability of failure after accounting for pertinent uncertainties, is essential for ensuring the safety of pipeline systems. However, traditional reliability analysis methods involving computationally intensive numerical models, such as finite element simulations of pipeline subjected to ground movement, have limited applications; this is partly because stochastic sampling approaches require repeated simulations over a large number of samples for the uncertain variables when estimating low probabilities. This study introduces Physics-Informed Neural Network for Reliability Analysis (PINN-RA) for buried pipelines subjected to ground movement, which integrates PINN-based surrogate model with Monte Carlo Simulation (MCS) to achieve efficient reliability assessment. To enable its application under uncertain variables associated with soil properties and ground movement, the PINN-based surrogate model is extended to solve a parametric differential equation system, namely the governing equation of pipelines embedded in soil with different properties. The findings demonstrate that PINN-RA significantly reduces the computational effort required and thus accelerates reliability analysis. By eliminating the need for repetitive numerical evaluations of pipeline subjected to permanent ground movement, the proposed approach provides an efficient and scalable tool for pipeline reliability assessment, enabling rapid decision-making in geohazard-prone regions.
In the domain of corporate credit rating, traditional deep learning methods have improved predictive accuracy but still suffer from the inherent 'black-box' problem and limited interpretability. While incorporating non-financial information enriches the data and provides partial interpretability, the models still lack hierarchical reasoning mechanisms, limiting their comprehensive analytical capabilities. To address these challenges, we propose CreditXAI, a Multi-Agent System (MAS) framework that simulates the collaborative decision-making process of professional credit analysts. The framework focuses on business, financial, and governance risk dimensions to generate consistent and interpretable credit assessments. Experimental results demonstrate that multi-agent collaboration improves predictive accuracy by more than 7% over the best single-agent baseline, confirming its significant synergistic advantage in corporate credit risk evaluation. This study provides a new technical pathway to build intelligent and interpretable credit rating models.
Researchers from Renmin University of China and BAAI developed FinSight, a multi-agent AI system that generates high-quality, multimodal financial research reports comparable to those by human experts. The system achieved an overall score of 8.09 on a new benchmark, outperforming leading commercial deep research systems, and excelled in chart expressiveness and analytical insight.
There are no more papers matching your filters at the moment.