alphaXiv

History

Papers Benchmarks

Yandex

85,963

17 Nov 2025

agentic-frameworks agents attention-mechanisms

Hogwild! Inference: Parallel LLM Generation via Concurrent Attention

HSE University Yandex IST Austria

Researchers from Yandex, HSE University, and IST Austria developed "Hogwild! Inference," a protocol enabling multiple Large Language Models (LLMs) to generate text concurrently by sharing a dynamically updated attention Key-Value (KV) cache. This mechanism allows LLMs to self-organize and collaborate in real-time, achieving up to 3.6x speedup and improved accuracy on complex reasoning tasks.

125

4,114

18 Feb 2025

computer-science machine-learning deep-reinforcement-learning

TabM: Advancing Tabular Deep Learning with Parameter-Efficient Ensembling

HSE University Yandex

Akim Kotelnikov

Researchers at Yandex and HSE University developed TabM, an MLP-based model that utilizes parameter-efficient ensembling to achieve state-of-the-art performance on tabular data. This approach demonstrates that simple MLP architectures, when enhanced with ensemble techniques, can outperform more complex deep learning models while maintaining high efficiency across a diverse set of 46 datasets.

197

1,404

12 Jun 2023

computer-science artificial-intelligence machine-learning

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU

ETH Zurich

Carnegie Mellon University

UC Berkeley

Stanford University

Meta HSE University Yandex

Binhang Yuan

FlexGen enables high-throughput generative inference of large language models like OPT-175B on a single commodity GPU. It achieves this by proposing a novel memory offloading and scheduling system that optimally manages the GPU-CPU-Disk hierarchy and leverages 4-bit quantization, yielding up to 112x higher throughput compared to existing systems.

9,364

1,700

26 Oct 2023

computer-science machine-learning deep-reinforcement-learning

Revisiting Deep Learning Models for Tabular Data

Moscow Institute of Physics and Technology Yandex National Research University “Higher School of Economics”

Ivan Rubachev

Researchers from Yandex, MIPT, and HSE conducted a rigorous evaluation of deep learning architectures for tabular data, introducing a ResNet-like baseline and the FT-Transformer. Their work clarifies that while the FT-Transformer achieves state-of-the-art deep learning performance and excels on certain datasets, Gradient Boosted Decision Trees retain superiority on others, underscoring that no single model universally outperforms.

240

520

07 Oct 2024

computer-science machine-learning generative-models

TabDDPM: Modelling Tabular Data with Diffusion Models

Yandex HSE

Ivan Rubachev

Akim Kotelnikov

TabDDPM adapts denoising diffusion probabilistic models to heterogeneous tabular data, generating synthetic datasets with more realistic feature distributions and inter-feature correlations. The approach outperforms existing GAN/VAE methods in machine learning utility while offering improved privacy over interpolation-based techniques like SMOTE.

421

391

11 Jun 2025

computer-science machine-learning fine-tuning

On Finetuning Tabular Foundation Models

HSE University Yandex

Ivan Rubachev

Foundation models are an emerging research direction in tabular deep learning. Notably, TabPFNv2 recently claimed superior performance over traditional GBDT-based methods on small-scale datasets using an in-context learning paradigm, which does not adapt model parameters to target datasets. However, the optimal finetuning approach for adapting tabular foundational models, and how this adaptation reshapes their internal mechanisms, remains underexplored. While prior works studied finetuning for earlier foundational models, inconsistent findings and TabPFNv2's unique architecture necessitate fresh investigation. To address these questions, we first systematically evaluate various finetuning strategies on diverse datasets. Our findings establish full finetuning as the most practical solution for TabPFNv2 in terms of time-efficiency and effectiveness. We then investigate how finetuning alters TabPFNv2's inner mechanisms, drawing an analogy to retrieval-augmented models. We reveal that the success of finetuning stems from the fact that after gradient-based adaptation, the dot products of the query-representations of test objects and the key-representations of in-context training objects more accurately reflect their target similarity. This improved similarity allows finetuned TabPFNv2 to better approximate target dependency by appropriately weighting relevant in-context samples, improving the retrieval-based prediction logic. From the practical perspective, we managed to finetune TabPFNv2 on datasets with up to 50K objects, observing performance improvements on almost all tasks. More precisely, on academic datasets with I.I.D. splits, finetuning allows TabPFNv2 to achieve state-of-the-art results, while on datasets with gradual temporal shifts and rich feature sets, TabPFNv2 is less stable and prior methods remain better.

673

20 Jan 2019

computer-science machine-learning ensemble-methods

CatBoost: unbiased boosting with categorical features

Yandex

CatBoost, developed by Yandex, introduces a gradient boosting framework that addresses prediction shift and target leakage in categorical feature processing. It achieves superior predictive quality across various benchmarks by employing ordered boosting and ordered target statistics.

343

05 Jun 2023

computer-science computation-and-language machine-learning

SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression

ETH Zurich

University of Washington HSE University Skoltech Yandex IST Austria NeuralMagic

SpQR introduces a sparse-quantized representation that achieves near-lossless compression of large language model weights to 3-4 bits, reducing memory footprint by over 3.4x while maintaining perplexity within 1% of 16-bit models and accelerating inference by 20-30%.

546

440

17 May 2020

computer-science computer-vision-security computer-vision-and-pattern-recognition

Deep Image Prior

University of Oxford Yandex Skolkovo Institute of Science and Technology

A deep convolutional generator network, even when randomly initialized and without prior training, can serve as a powerful implicit prior for various image restoration tasks. The study demonstrates that optimizing this network's weights to fit a single degraded image can effectively perform denoising, super-resolution, and inpainting, leveraging the inherent structural biases of the network itself.

7,952

244

06 Nov 2017

computer-science computer-vision-and-pattern-recognition

Instance Normalization: The Missing Ingredient for Fast Stylization

University of Oxford Skoltech Yandex

This paper identifies Instance Normalization (IN) as a crucial architectural component for improving feed-forward neural style transfer, enabling real-time stylization networks to produce images of comparable quality to the slower iterative methods. Replacing Batch Normalization with IN resolves issues like image artifacts and achieves high-fidelity results across different generator architectures and resolutions.

1,224

04 Sep 2025

computer-science machine-learning embedding-methods

Unveiling the Role of Data Uncertainty in Tabular Deep Learning

HSE University Yandex

This research analyzes how various modern deep learning techniques achieve competitive performance on tabular data by examining their implicit mechanisms for handling data (aleatoric) uncertainty. The study reveals that methods such as numerical feature embeddings, retrieval-augmented models, and advanced ensembling strategies provide robustness in high-uncertainty regions, explaining their effectiveness.

553

20 Nov 2025

chain-of-thought computer-science computation-and-language

AutoJudge: Judge Decoding Without Manual Annotation

HSE University ITMO University Yandex IST Austria Together AI

We introduce AutoJudge, a method that accelerates large language model (LLM) inference with task-specific lossy speculative decoding. Instead of matching the original model output distribution token-by-token, we identify which of the generated tokens affect the downstream quality of the response, relaxing the distribution match guarantee so that the "unimportant" tokens can be generated faster. Our approach relies on a semi-greedy search algorithm to test which of the mismatches between target and draft models should be corrected to preserve quality and which ones may be skipped. We then train a lightweight classifier based on existing LLM embeddings to predict, at inference time, which mismatching tokens can be safely accepted without compromising the final answer quality. We evaluate the effectiveness of AutoJudge with multiple draft/target model pairs on mathematical reasoning and programming benchmarks, achieving significant speedups at the cost of a minor accuracy reduction. Notably, on GSM8k with the Llama 3.1 70B target model, our approach achieves up to

\approx2\times

speedup over speculative decoding at the cost of

\le 1\%

drop in accuracy. When applied to the LiveCodeBench benchmark, AutoJudge automatically detects programming-specific important tokens, accepting

\ge 25

tokens per speculation cycle at

2\%

drop in Pass@1. Our approach requires no human annotation and is easy to integrate with modern LLM inference frameworks.

335

29 Jun 2023

cloud-computing computer-science distributed-parallel-and-cluster-computing

SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient

University of Washington HSE University Yandex

A decentralized model-parallel algorithm, SWARM parallelism, enables efficient training of billion-scale neural networks using unreliable, heterogeneous, and low-bandwidth computing resources. It successfully trained a 1.01 billion parameter Transformer on 400 preemptible T4 GPUs over a public cloud network with less than 200 Mb/s, demonstrating convergence comparable to models trained on high-end A100 GPUs.

137

429

07 Jun 2019

attention-mechanisms computer-science computation-and-language

Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned

University of Amsterdam University of Zurich University of Edinburgh Moscow Institute of Physics and Technology Yandex

Researchers at Yandex and collaborating universities demonstrated that only a small subset of Transformer attention heads are crucial for neural machine translation, identifying specialized functions like positional, syntactic, and rare word attention. Their novel pruning method allows removing up to 90% of encoder heads with minimal BLEU score degradation, enhancing efficiency and interpretability.

309

207

05 Jul 2025

computer-science computation-and-language efficient-transformers

Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding

Carnegie Mellon University

Meta Yandex Together AI

SEQUOIA presents a speculative decoding method that uses a dynamic programming approach to find optimal token tree topologies and a robust sampling-without-replacement verification algorithm. This approach achieves up to 4.04x speedup for Llama2-7B on A100 GPUs and up to 9.5x speedup for Llama3-70B in offloading settings on L40 GPUs, consistently outperforming prior methods.

360

251

30 May 2024

computer-science machine-learning efficient-transformers

PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression

HSE University

KAUST Skoltech Yandex MIPT IST Austria NeuralMagic Researchcore

There has been significant interest in "extreme" compression of large language models (LLMs), i.e., to 1-2 bits per parameter, which allows such models to be executed efficiently on resource-constrained devices. Existing work focused on improved one-shot quantization techniques and weight representations; yet, purely post-training approaches are reaching diminishing returns in terms of the accuracy-vs-bit-width trade-off. State-of-the-art quantization methods such as QuIP# and AQLM include fine-tuning (part of) the compressed parameters over a limited amount of calibration data; however, such fine-tuning techniques over compressed weights often make exclusive use of straight-through estimators (STE), whose performance is not well-understood in this setting. In this work, we question the use of STE for extreme LLM compression, showing that it can be sub-optimal, and perform a systematic study of quantization-aware fine-tuning strategies for LLMs. We propose PV-Tuning - a representation-agnostic framework that generalizes and improves upon existing fine-tuning strategies, and provides convergence guarantees in restricted cases. On the practical side, when used for 1-2 bit vector quantization, PV-Tuning outperforms prior techniques for highly-performant models such as Llama and Mistral. Using PV-Tuning, we achieve the first Pareto-optimal quantization for Llama 2 family models at 2 bits per parameter.

1,216

194

01 Jun 2025

computer-science contrastive-learning information-retrieval

Yambda-5B -- A Large-Scale Multi-modal Dataset for Ranking And Retrieval

Yandex

We present Yambda-5B, a large-scale open dataset sourced from the Yandex Music streaming platform. Yambda-5B contains 4.79 billion user-item interactions from 1 million users across 9.39 million tracks. The dataset includes two primary types of interactions: implicit feedback (listening events) and explicit feedback (likes, dislikes, unlikes and undislikes). In addition, we provide audio embeddings for most tracks, generated by a convolutional neural network trained on audio spectrograms. A key distinguishing feature of Yambda-5B is the inclusion of the is_organic flag, which separates organic user actions from recommendation-driven events. This distinction is critical for developing and evaluating machine learning algorithms, as Yandex Music relies on recommender systems to personalize track selection for users. To support rigorous benchmarking, we introduce an evaluation protocol based on a Global Temporal Split, allowing recommendation algorithms to be assessed in conditions that closely mirror real-world use. We report benchmark results for standard baselines (ItemKNN, iALS) and advanced models (SANSA, SASRec) using a variety of evaluation metrics. By releasing Yambda-5B to the community, we aim to provide a readily accessible, industrial-scale resource to advance research, foster innovation, and promote reproducible results in recommender systems.

135

01 Jul 2025

computer-science machine-learning efficient-transformers

EvoPress: Accurate Dynamic Model Compression via Evolutionary Search

ETH Zürich Skoltech Yandex Neural Magic IST Austria

The high computational costs of large language models (LLMs) have led to a flurry of research on LLM compression, via methods such as quantization, sparsification, or structured pruning. A new frontier in this area is given by dynamic, non-uniform compression methods, which adjust the compression levels (e.g., sparsity) per-block or even per-layer in order to minimize accuracy loss, while guaranteeing a global compression threshold. Yet, current methods rely on estimating the importance of a given layer, implicitly assuming that layers contribute independently to the overall compression error. We begin from the motivating observation that this independence assumption does not generally hold for LLM compression: pruning a model further may even significantly recover performance. To address this, we propose EvoPress, a novel evolutionary framework for dynamic LLM compression. By formulating dynamic compression as a general optimization problem, EvoPress identifies optimal compression profiles in a highly efficient manner, and generalizes across diverse models and compression techniques. Via EvoPress, we achieve state-of-the-art performance for dynamic compression of Llama, Mistral, and Phi models, setting new benchmarks for structural pruning (block/layer dropping), unstructured sparsity, and quantization with dynamic bitwidths. Our code is available at this https URL}.

122

21 Jul 2025

computer-science information-retrieval

Scaling Recommender Transformers to One Billion Parameters

Yandex

Yandex researchers scaled transformer encoders for recommender systems to one billion parameters using the ARGUS framework, demonstrating that such large models yield substantial performance improvements. The approach achieved record-breaking online gains of +2.26% in Total Listening Time and +6.37% in Like Likelihood on a large music platform, validating the scaling hypothesis for user behavior modeling.

166

02 Mar 2023

computer-science distributed-parallel-and-cluster-computing machine-learning

Petals: Collaborative Inference and Fine-tuning of Large Models

University of Washington

Hugging Face ENS Paris-Saclay Yandex Yandex School of Data Analysis

PETALS enables collaborative inference and parameter-efficient fine-tuning of large language models (LLMs) such as BLOOM-176B, allowing their use on distributed consumer-grade GPUs at interactive speeds of approximately one inference step per second.

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

Hogwild! Inference: Parallel LLM Generation via Concurrent Attention

TabM: Advancing Tabular Deep Learning with Parameter-Efficient Ensembling

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU

Revisiting Deep Learning Models for Tabular Data

TabDDPM: Modelling Tabular Data with Diffusion Models

On Finetuning Tabular Foundation Models

CatBoost: unbiased boosting with categorical features

SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression

Deep Image Prior

Instance Normalization: The Missing Ingredient for Fast Stylization

Unveiling the Role of Data Uncertainty in Tabular Deep Learning

AutoJudge: Judge Decoding Without Manual Annotation

SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient

Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned

Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding

PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression

Yambda-5B -- A Large-Scale Multi-modal Dataset for Ranking And Retrieval

EvoPress: Accurate Dynamic Model Compression via Evolutionary Search

Scaling Recommender Transformers to One Billion Parameters

Petals: Collaborative Inference and Fine-tuning of Large Models

Events

AI for Law

Personalize Your Feed