alphaXiv

28

06 Oct 2025

computer-science computation-and-language computer-vision-and-pattern-recognition

On Structured State-Space Duality

University of California, San Diego

Northwestern University Ensemble AI

Structured State-Space Duality (SSD) [Dao & Gu, ICML 2024] is an equivalence between a simple Structured State-Space Model (SSM) and a masked attention mechanism. In particular, a state-space model with a scalar-times-identity state matrix is equivalent to a masked self-attention with a

1

-semiseparable causal mask. Consequently, the same sequence transformation (model) has two algorithmic realizations: as a linear-time

O(T)

recurrence or as a quadratic-time

O(T^2)

attention. In this note, we formalize and generalize this duality: (i) we extend SSD from the scalar-identity case to general diagonal SSMs (diagonal state matrices); (ii) we show that these diagonal SSMs match the scalar case's training complexity lower bounds while supporting richer dynamics; (iii) we establish a necessary and sufficient condition under which an SSM is equivalent to

1

-semiseparable masked attention; and (iv) we show that such duality fails to extend to standard softmax attention due to rank explosion. Together, these results tighten bridge between recurrent SSMs and Transformers, and widen the design space for expressive yet efficient sequence models.

31

51

26 Sep 2025

agents attention-mechanisms computer-science

In-Context Algorithm Emulation in Fixed-Weight Transformers

University of Toronto

Northwestern University Ensemble AI

We prove that a minimal Transformer with frozen weights emulates a broad class of algorithms by in-context prompting. We formalize two modes of in-context algorithm emulation. In the task-specific mode, for any continuous function

f: \mathbb{R} \to \mathbb{R}

, we show the existence of a single-head softmax attention layer whose forward pass reproduces functions of the form

f(w^\top x - y)

to arbitrary precision. This general template subsumes many popular machine learning algorithms (e.g., gradient descent, linear regression, ridge regression). In the prompt-programmable mode, we prove universality: a single fixed-weight two-layer softmax attention module emulates all algorithms from the task-specific class (i.e., each implementable by a single softmax attention) via only prompting. Our key idea is to construct prompts that encode an algorithm's parameters into token representations, creating sharp dot-product gaps that force the softmax attention to follow the intended computation. This construction requires no feed-forward layers and no parameter updates. All adaptation happens through the prompt alone. Numerical results corroborate our theory. These findings forge a direct link between in-context learning and algorithmic emulation, and offer a simple mechanism for large Transformers to serve as prompt-programmable libraries of algorithms. They illuminate how GPT-style foundation models may swap algorithms via prompts alone, and establish a form of algorithmic universality in modern Transformer models.

2

1,410

08 Oct 2025

computer-science artificial-intelligence machine-learning

NdLinear: Preserving Multi-Dimensional Structure for Parameter-Efficient Neural Networks

Ensemble AI

In deep learning, processing multidimensional inputs (e.g., images, medical scans, and time series) is an important task that often requires flattening the inputs. We introduce

\mathit{NdLinear}

, a drop-in replacement for linear layers that operates directly on tensors, requiring no flattening. By applying transformations separately along each dimension, NdLinear preserves native data structure while achieving dramatic parameter reductions, often by orders of magnitude, with minimal memory overhead. We prove NdLinear maintains expressivity through structured Tucker decomposition while preserving VC-dimension scaling. Extensive experiments demonstrate NdLinear's capacity to achieve significant parameter reductions with substantial wall-clock efficiency gains and minimal memory overhead. For instance, our

\mathit{NdLinear-LoRA}

matches or exceeds standard LoRA on language reasoning tasks using up to

9\times

fewer parameters. Experiments across CNNs, RNNs, Transformers, and MLPs on vision, language, time-series, and tabular tasks consistently demonstrate NdLinear's efficiency gains. While excelling at axis-separable tasks, NdLinear has limitations with entangled spatial interactions. By processing data in its original N-dimensional form, NdLinear provides a theoretically grounded, practical component for building more efficient neural architectures.

19

08 Nov 2024

computer-science artificial-intelligence cryptography-and-security

On Differentially Private String Distances

UC Berkeley

Northwestern University

MIT Ensemble AI

Given a database of bit strings

A_1,\ldots,A_m\in \{0,1\}^n

, a fundamental data structure task is to estimate the distances between a given query

B\in \{0,1\}^n

with all the strings in the database. In addition, one might further want to ensure the integrity of the database by releasing these distance statistics in a secure manner. In this work, we propose differentially private (DP) data structures for this type of tasks, with a focus on Hamming and edit distance. On top of the strong privacy guarantees, our data structures are also time- and space-efficient. In particular, our data structure is

\epsilon

-DP against any sequence of queries of arbitrary length, and for any query

B

such that the maximum distance to any string in the database is at most

k

, we output

m

distance estimates. Moreover, - For Hamming distance, our data structure answers any query in

\widetilde O(mk+n)

time and each estimate deviates from the true distance by at most

\widetilde O(k/e^{\epsilon/\log k})

; - For edit distance, our data structure answers any query in

\widetilde O(mk^2+n)

time and each estimate deviates from the true distance by at most

\widetilde O(k/e^{\epsilon/(\log k \log n)})

. For moderate

k

, both data structures support sublinear query operations. We obtain these results via a novel adaptation of the randomized response technique as a bit flipping procedure, applied to the sketched strings.

2

06 Oct 2025

computer-science computation-and-language computer-vision-and-pattern-recognition

On Structured State-Space Duality

University of California, San Diego

Northwestern University Ensemble AI

Structured State-Space Duality (SSD) [Dao & Gu, ICML 2024] is an equivalence between a simple Structured State-Space Model (SSM) and a masked attention mechanism. In particular, a state-space model with a scalar-times-identity state matrix is equivalent to a masked self-attention with a

1

-semiseparable causal mask. Consequently, the same sequence transformation (model) has two algorithmic realizations: as a linear-time

O(T)

recurrence or as a quadratic-time

O(T^2)

attention. In this note, we formalize and generalize this duality: (i) we extend SSD from the scalar-identity case to general diagonal SSMs (diagonal state matrices); (ii) we show that these diagonal SSMs match the scalar case's training complexity lower bounds while supporting richer dynamics; (iii) we establish a necessary and sufficient condition under which an SSM is equivalent to

1

-semiseparable masked attention; and (iv) we show that such duality fails to extend to standard softmax attention due to rank explosion. Together, these results tighten bridge between recurrent SSMs and Transformers, and widen the design space for expressive yet efficient sequence models.

29

24 Mar 2025

computer-science artificial-intelligence data-structures-and-algorithms

Differentially Private Kernel Density Estimation

Northwestern University

Adobe Ensemble AI

We introduce a refined differentially private (DP) data structure for kernel density estimation (KDE), offering not only improved privacy-utility tradeoff but also better efficiency over prior results. Specifically, we study the mathematical problem: given a similarity function

f

(or DP KDE) and a private dataset

X \subset \mathbb{R}^d

, our goal is to preprocess

X

so that for any query

y\in\mathbb{R}^d

, we approximate

\sum_{x \in X} f(x, y)

in a differentially private fashion. The best previous algorithm for $f(x,y) =\| x - y \|_1$ is the node-contaminated balanced binary tree by [Backurs, Lin, Mahabadi, Silwal, and Tarnawski, ICLR 2024]. Their algorithm requires

O(nd)

space and time for preprocessing with

n=|X|

. For any query point, the query time is

d \log n

, with an error guarantee of

(1+\alpha)

-approximation and

\epsilon^{-1} \alpha^{-0.5} d^{1.5} R \log^{1.5} n

. In this paper, we improve the best previous result [Backurs, Lin, Mahabadi, Silwal, and Tarnawski, ICLR 2024] in three aspects: - We reduce query time by a factor of

\alpha^{-1} \log n

. - We improve the approximation ratio from

\alpha

to 1. - We reduce the error dependence by a factor of

\alpha^{-0.5}

. From a technical perspective, our method of constructing the search tree differs from previous work [Backurs, Lin, Mahabadi, Silwal, and Tarnawski, ICLR 2024]. In prior work, for each query, the answer is split into $\alpha^{-1} \log n

numbers, each derived from the summation of

\log n$ values in interval tree countings. In contrast, we construct the tree differently, splitting the answer into

\log n

numbers, where each is a smart combination of two distance values, two counting values, and

y

itself. We believe our tree structure may be of independent interest.

7

08 Oct 2025

computer-science artificial-intelligence machine-learning

NdLinear: Preserving Multi-Dimensional Structure for Parameter-Efficient Neural Networks

New York University

Columbia University Weizmann Institute of Science Wand AI Ensemble AI Thoughtworks

In deep learning, processing multidimensional inputs (e.g., images, medical scans, and time series) is an important task that often requires flattening the inputs. We introduce

\mathit{NdLinear}

, a drop-in replacement for linear layers that operates directly on tensors, requiring no flattening. By applying transformations separately along each dimension, NdLinear preserves native data structure while achieving dramatic parameter reductions, often by orders of magnitude, with minimal memory overhead. We prove NdLinear maintains expressivity through structured Tucker decomposition while preserving VC-dimension scaling. Extensive experiments demonstrate NdLinear's capacity to achieve significant parameter reductions with substantial wall-clock efficiency gains and minimal memory overhead. For instance, our

\mathit{NdLinear-LoRA}

matches or exceeds standard LoRA on language reasoning tasks using up to

9\times

fewer parameters. Experiments across CNNs, RNNs, Transformers, and MLPs on vision, language, time-series, and tabular tasks consistently demonstrate NdLinear's efficiency gains. While excelling at axis-separable tasks, NdLinear has limitations with entangled spatial interactions. By processing data in its original N-dimensional form, NdLinear provides a theoretically grounded, practical component for building more efficient neural architectures.

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

On Structured State-Space Duality

In-Context Algorithm Emulation in Fixed-Weight Transformers

NdLinear: Preserving Multi-Dimensional Structure for Parameter-Efficient Neural Networks

On Differentially Private String Distances

On Structured State-Space Duality

Differentially Private Kernel Density Estimation

NdLinear: Preserving Multi-Dimensional Structure for Parameter-Efficient Neural Networks

Events

AI for Law

Personalize Your Feed