alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Browser Extension

We're hiring
PaperBlogResources

Scorch: A Library for Sparse Deep Learning

BibTex
Copy
@Article{Yan2024ScorchAL,
 author = {Bobby Yan and Alexander J Root and Trevor Gale and David Broman and Fredrik Kjolstad},
 booktitle = {arXiv.org},
 journal = {ArXiv},
 title = {Scorch: A Library for Sparse Deep Learning},
 volume = {abs/2405.16883},
 year = {2024}
}
AI Audio Lecture + Q&A
0:00 / 0:00
Scorch: A Library for Sparse Deep Learning
Transcript
John: Welcome to Advanced Topics in Compilers for Machine Learning. Today's lecture is on 'Scorch: A Library for Sparse Deep Learning' from researchers at Stanford University. We've seen a lot of work recently in optimizing large models, with papers like 'SparseTIR' exploring new compilation abstractions. This paper addresses a key challenge: bridging the gap between advanced sparse tensor compilers and the deep learning frameworks that practitioners actually use. It argues that sparsity shouldn't require learning a new framework, but should be a native, efficient feature of the one you already know. Yes, Noah? Noah: Hi Professor. So is the main idea here just to make PyTorch better at handling sparse tensors? I thought it already had some support for that. John: That's a good question. PyTorch does have some support, but it's quite rudimentary. It's limited to a few specific data formats, like COO and CSR, and a handful of hand-optimized operations like sparse matrix multiplication. If you want to do anything slightly different, you either have to write your own high-effort custom kernel or find a specialized, one-off library. This creates a fragmented ecosystem where you might use PyTorch Geometric for graphs and another library for sparse transformers, each with its own API. Scorch's goal is to unify this. John: The project is guided by three main objectives. First, seamless integration. The idea is that a user can just type 'import scorch as torch' and their existing code can start leveraging sparse tensors without major changes. Second, it provides unified abstractions. An operation like 'einsum' should work on any mix of dense and sparse tensors, which is something current frameworks struggle with. Third, and most importantly, it offers robust automatic performance optimization. It uses a compiler to make intelligent decisions about how to execute the code, freeing the researcher from manual performance tuning. Noah: That automatic optimization sounds like the key contribution. But how does it handle the performance-flexibility trade-off? Hand-tuned kernels are standard in PyTorch for a reason—they are extremely fast. Can a compiler-generated kernel really compete with that? John: That is the central challenge. Scorch's approach is to intercept a PyTorch operation at runtime. If it involves a sparse tensor, the Scorch compiler stack takes over. It first generates a domain-specific intermediate representation of the computation. Then, it applies a series of novel optimization algorithms. The first is for automatic loop ordering. In sparse computations, the order of loops can dramatically change the asymptotic complexity. Scorch uses a fast heuristic to find a good loop order that avoids worst-case scenarios. It also intelligently inserts temporary data structures, which they call workspaces, to manage intermediate results efficiently without having to create a full dense tensor. Noah: Workspaces? So it materializes temporary tensors to avoid inefficient access patterns? John: Exactly, but it materializes them as sparse structures, not dense ones, to maintain efficiency. The second optimization is automatic tiling. This is a standard technique for improving cache locality, but it's tricky with sparse data. Scorch’s key insight here is to avoid tiling sparse dimensions, as that can lead to unpredictable and expensive searches within the sparse data structure. Instead, it identifies and tiles only the dense loops where data reuse is predictable. Finally, there's format inference, where the compiler automatically determines the optimal storage format for the output tensor based on the operation and the input formats. You don't have to specify it manually. Noah: Wait, the report mentioned they use red-black trees for those workspaces. Isn't that a relatively slow data structure for a high-performance context compared to, say, a hash map? John: A sharp observation. The choice of a red-black tree is a trade-off. It provides guaranteed logarithmic time complexity for insertions and, importantly, ordered iteration. This predictability is very valuable for a compiler. Hash maps can have better average-case performance, but their worst-case performance on collisions can be poor, and they don't preserve order, which might complicate subsequent compiler passes. For their initial CPU-focused work, prioritizing predictable performance over potentially volatile, though sometimes faster, alternatives seems to be the design choice. John: The impact of this approach is quite clear in their results. For standard operations like SpMM in Graph Convolutional Networks, they achieved over a 2x speedup compared to PyTorch's native implementation. But the most compelling result was for Sampled Dense-Dense Matrix Multiplication, or SDDMM. PyTorch has no hand-optimized kernel for this, so a user has to perform a full dense multiplication and then filter it, which is asymptotically slow. Scorch, however, automatically generates a single, fused kernel that is orders of magnitude faster. This is the real power: it enables the exploration of new sparse models and operations that were previously too impractical to implement. Noah: So it's not just about speeding up existing operations, but about unlocking new combinations of operations that were previously impractical. This sounds conceptually similar to the goals of general compilers like 'TVM', but hyper-focused on solving the sparse tensor problem within the PyTorch ecosystem. John: Precisely. It applies the principles of compiler automation to solve a specific, high-impact problem. By providing a general and unified solution, it directly combats the ecosystem fragmentation we talked about earlier. Researchers can now focus more on novel architecture design rather than low-level performance engineering. John: In summary, Scorch's primary significance is that it democratizes the use of sparsity in deep learning. By integrating a sophisticated sparse compiler seamlessly into PyTorch, it dramatically lowers the barrier to entry for building and experimenting with efficient, sparse models. The work is currently focused on CPU inference, but the compiler foundation is solid and could be extended to GPUs and training in the future. The main takeaway is this: by reframing sparsity as a compiler problem instead of a manual kernel-writing problem, we can unlock both performance and greater flexibility in our deep learning frameworks. John: Thanks for listening. If you have any further questions, ask our AI assistant or drop a comment.