Ensemble AI
Structured State-Space Duality (SSD) [Dao & Gu, ICML 2024] is an equivalence between a simple Structured State-Space Model (SSM) and a masked attention mechanism. In particular, a state-space model with a scalar-times-identity state matrix is equivalent to a masked self-attention with a 11-semiseparable causal mask. Consequently, the same sequence transformation (model) has two algorithmic realizations: as a linear-time O(T)O(T) recurrence or as a quadratic-time O(T2)O(T^2) attention. In this note, we formalize and generalize this duality: (i) we extend SSD from the scalar-identity case to general diagonal SSMs (diagonal state matrices); (ii) we show that these diagonal SSMs match the scalar case's training complexity lower bounds while supporting richer dynamics; (iii) we establish a necessary and sufficient condition under which an SSM is equivalent to 11-semiseparable masked attention; and (iv) we show that such duality fails to extend to standard softmax attention due to rank explosion. Together, these results tighten bridge between recurrent SSMs and Transformers, and widen the design space for expressive yet efficient sequence models.
31
We prove that a minimal Transformer with frozen weights emulates a broad class of algorithms by in-context prompting. We formalize two modes of in-context algorithm emulation. In the task-specific mode, for any continuous function f:RRf: \mathbb{R} \to \mathbb{R}, we show the existence of a single-head softmax attention layer whose forward pass reproduces functions of the form f(wxy)f(w^\top x - y) to arbitrary precision. This general template subsumes many popular machine learning algorithms (e.g., gradient descent, linear regression, ridge regression). In the prompt-programmable mode, we prove universality: a single fixed-weight two-layer softmax attention module emulates all algorithms from the task-specific class (i.e., each implementable by a single softmax attention) via only prompting. Our key idea is to construct prompts that encode an algorithm's parameters into token representations, creating sharp dot-product gaps that force the softmax attention to follow the intended computation. This construction requires no feed-forward layers and no parameter updates. All adaptation happens through the prompt alone. Numerical results corroborate our theory. These findings forge a direct link between in-context learning and algorithmic emulation, and offer a simple mechanism for large Transformers to serve as prompt-programmable libraries of algorithms. They illuminate how GPT-style foundation models may swap algorithms via prompts alone, and establish a form of algorithmic universality in modern Transformer models.
2
In deep learning, processing multidimensional inputs (e.g., images, medical scans, and time series) is an important task that often requires flattening the inputs. We introduce NdLinear\mathit{NdLinear}, a drop-in replacement for linear layers that operates directly on tensors, requiring no flattening. By applying transformations separately along each dimension, NdLinear preserves native data structure while achieving dramatic parameter reductions, often by orders of magnitude, with minimal memory overhead. We prove NdLinear maintains expressivity through structured Tucker decomposition while preserving VC-dimension scaling. Extensive experiments demonstrate NdLinear's capacity to achieve significant parameter reductions with substantial wall-clock efficiency gains and minimal memory overhead. For instance, our NdLinearLoRA\mathit{NdLinear-LoRA} matches or exceeds standard LoRA on language reasoning tasks using up to 9×9\times fewer parameters. Experiments across CNNs, RNNs, Transformers, and MLPs on vision, language, time-series, and tabular tasks consistently demonstrate NdLinear's efficiency gains. While excelling at axis-separable tasks, NdLinear has limitations with entangled spatial interactions. By processing data in its original N-dimensional form, NdLinear provides a theoretically grounded, practical component for building more efficient neural architectures.
Given a database of bit strings A1,,Am{0,1}nA_1,\ldots,A_m\in \{0,1\}^n, a fundamental data structure task is to estimate the distances between a given query B{0,1}nB\in \{0,1\}^n with all the strings in the database. In addition, one might further want to ensure the integrity of the database by releasing these distance statistics in a secure manner. In this work, we propose differentially private (DP) data structures for this type of tasks, with a focus on Hamming and edit distance. On top of the strong privacy guarantees, our data structures are also time- and space-efficient. In particular, our data structure is ϵ\epsilon-DP against any sequence of queries of arbitrary length, and for any query BB such that the maximum distance to any string in the database is at most kk, we output mm distance estimates. Moreover, - For Hamming distance, our data structure answers any query in O~(mk+n)\widetilde O(mk+n) time and each estimate deviates from the true distance by at most O~(k/eϵ/logk)\widetilde O(k/e^{\epsilon/\log k}); - For edit distance, our data structure answers any query in O~(mk2+n)\widetilde O(mk^2+n) time and each estimate deviates from the true distance by at most O~(k/eϵ/(logklogn))\widetilde O(k/e^{\epsilon/(\log k \log n)}). For moderate kk, both data structures support sublinear query operations. We obtain these results via a novel adaptation of the randomized response technique as a bit flipping procedure, applied to the sketched strings.
Structured State-Space Duality (SSD) [Dao & Gu, ICML 2024] is an equivalence between a simple Structured State-Space Model (SSM) and a masked attention mechanism. In particular, a state-space model with a scalar-times-identity state matrix is equivalent to a masked self-attention with a 11-semiseparable causal mask. Consequently, the same sequence transformation (model) has two algorithmic realizations: as a linear-time O(T)O(T) recurrence or as a quadratic-time O(T2)O(T^2) attention. In this note, we formalize and generalize this duality: (i) we extend SSD from the scalar-identity case to general diagonal SSMs (diagonal state matrices); (ii) we show that these diagonal SSMs match the scalar case's training complexity lower bounds while supporting richer dynamics; (iii) we establish a necessary and sufficient condition under which an SSM is equivalent to 11-semiseparable masked attention; and (iv) we show that such duality fails to extend to standard softmax attention due to rank explosion. Together, these results tighten bridge between recurrent SSMs and Transformers, and widen the design space for expressive yet efficient sequence models.
We introduce a refined differentially private (DP) data structure for kernel density estimation (KDE), offering not only improved privacy-utility tradeoff but also better efficiency over prior results. Specifically, we study the mathematical problem: given a similarity function ff (or DP KDE) and a private dataset XRdX \subset \mathbb{R}^d, our goal is to preprocess XX so that for any query yRdy\in\mathbb{R}^d, we approximate xXf(x,y)\sum_{x \in X} f(x, y) in a differentially private fashion. The best previous algorithm for $f(x,y) =\| x - y \|_1$ is the node-contaminated balanced binary tree by [Backurs, Lin, Mahabadi, Silwal, and Tarnawski, ICLR 2024]. Their algorithm requires O(nd)O(nd) space and time for preprocessing with n=Xn=|X|. For any query point, the query time is dlognd \log n, with an error guarantee of (1+α)(1+\alpha)-approximation and ϵ1α0.5d1.5Rlog1.5n\epsilon^{-1} \alpha^{-0.5} d^{1.5} R \log^{1.5} n. In this paper, we improve the best previous result [Backurs, Lin, Mahabadi, Silwal, and Tarnawski, ICLR 2024] in three aspects: - We reduce query time by a factor of α1logn\alpha^{-1} \log n. - We improve the approximation ratio from α\alpha to 1. - We reduce the error dependence by a factor of α0.5\alpha^{-0.5}. From a technical perspective, our method of constructing the search tree differs from previous work [Backurs, Lin, Mahabadi, Silwal, and Tarnawski, ICLR 2024]. In prior work, for each query, the answer is split into $\alpha^{-1} \log nnumbers,eachderivedfromthesummationof numbers, each derived from the summation of \log n$ values in interval tree countings. In contrast, we construct the tree differently, splitting the answer into logn\log n numbers, where each is a smart combination of two distance values, two counting values, and yy itself. We believe our tree structure may be of independent interest.
7
In deep learning, processing multidimensional inputs (e.g., images, medical scans, and time series) is an important task that often requires flattening the inputs. We introduce NdLinear\mathit{NdLinear}, a drop-in replacement for linear layers that operates directly on tensors, requiring no flattening. By applying transformations separately along each dimension, NdLinear preserves native data structure while achieving dramatic parameter reductions, often by orders of magnitude, with minimal memory overhead. We prove NdLinear maintains expressivity through structured Tucker decomposition while preserving VC-dimension scaling. Extensive experiments demonstrate NdLinear's capacity to achieve significant parameter reductions with substantial wall-clock efficiency gains and minimal memory overhead. For instance, our NdLinearLoRA\mathit{NdLinear-LoRA} matches or exceeds standard LoRA on language reasoning tasks using up to 9×9\times fewer parameters. Experiments across CNNs, RNNs, Transformers, and MLPs on vision, language, time-series, and tabular tasks consistently demonstrate NdLinear's efficiency gains. While excelling at axis-separable tasks, NdLinear has limitations with entangled spatial interactions. By processing data in its original N-dimensional form, NdLinear provides a theoretically grounded, practical component for building more efficient neural architectures.
There are no more papers matching your filters at the moment.