alphaXiv

History

Papers Benchmarks

Beneficial AI Foundation

26 Sep 2025

agents computer-science machine-learning

A benchmark for vericoding: formally verified program synthesis

MIT Beneficial AI Foundation

This research introduces "vericoding," the LLM-driven generation of formally verified code from formal specifications. It establishes the largest benchmark to date, comprising 12,504 tasks across Dafny, Verus, and Lean, and demonstrates that LLMs can achieve up to an 82.2% success rate in generating provably correct code for Dafny tasks, alongside a rapid increase in pure verification capabilities from 68% to 96% within a year on existing benchmarks.

08 Feb 2025

computer-science artificial-intelligence machine-learning

Proving the Coding Interview: A Benchmark for Formally Verified Code Generation

Beneficial AI Foundation

We introduce the Formally Verified Automated Programming Progress Standards, or FVAPPS, a benchmark of 4715 samples for writing programs and proving their correctness, the largest formal verification benchmark, including 1083 curated and quality controlled samples. Previously, APPS provided a benchmark and dataset for programming puzzles to be completed in Python and checked against unit tests, of the kind seen in technical assessments in the software engineering industry. Building upon recent approaches for benchmarks in interactive theorem proving, we generalize the unit tests to Lean 4 theorems given without proof (i.e., using Lean's "sorry" keyword). On the 406 theorems of 100 randomly selected samples, Sonnet correctly proves 30% and Gemini correctly proves 18%. We challenge the machine learning and program synthesis communities to solve both each general purpose programming problem and its associated correctness specifications. The benchmark is available at this https URL

01 Oct 2025

bayesian-deep-learning computer-science machine-learning

Analysis of Variational Sparse Autoencoders

University of Colorado Beneficial AI Foundation

Sparse Autoencoders (SAEs) have emerged as a promising approach for interpreting neural network representations by learning sparse, human-interpretable features from dense activations. We investigate whether incorporating variational methods into SAE architectures can improve feature organization and interpretability. We introduce the Variational Sparse Autoencoder (vSAE), which replaces deterministic ReLU gating with stochastic sampling from learned Gaussian posteriors and incorporates KL divergence regularization toward a standard normal prior. Our hypothesis is that this probabilistic sampling creates dispersive pressure, causing features to organize more coherently in the latent space while avoiding overlap. We evaluate a TopK vSAE against a standard TopK SAE on Pythia-70M transformer residual stream activations using comprehensive benchmarks including SAE Bench, individual feature interpretability analysis, and global latent space visualization through t-SNE. The vSAE underperforms standard SAE across core evaluation metrics, though excels at feature independence and ablation metrics. The KL divergence term creates excessive regularization pressure that substantially reduces the fraction of living features, leading to observed performance degradation. While vSAE features demonstrate improved robustness, they exhibit many more dead features than baseline. Our findings suggest that naive application of variational methods to SAEs does not improve feature organization or interpretability.

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

A benchmark for vericoding: formally verified program synthesis

Proving the Coding Interview: A Benchmark for Formally Verified Code Generation

Analysis of Variational Sparse Autoencoders

Events

AI for Law

Personalize Your Feed