LuxiTech
Mainstream Transformer-based large language models face major efficiency bottlenecks: training computation scales quadratically with sequence length, and inference memory grows linearly, limiting long-context processing. Building large models on non-NVIDIA platforms also poses challenges for stable and efficient training. To address this, we introduce SpikingBrain, a family of brain-inspired models designed for efficient long-context training and inference. SpikingBrain leverages the MetaX GPU cluster and focuses on three aspects: (1) Model Architecture: linear and hybrid-linear attention architectures with adaptive spiking neurons; (2) Algorithmic Optimizations: an efficient, conversion-based training pipeline and a dedicated spike coding framework; (3) System Engineering: customized training frameworks, operator libraries, and parallelism strategies tailored to MetaX hardware. Using these techniques, we develop two models: SpikingBrain-7B, a linear LLM, and SpikingBrain-76B, a hybrid-linear MoE LLM. These models demonstrate the feasibility of large-scale LLM development on non-NVIDIA platforms, and training remains stable for weeks on hundreds of MetaX GPUs with Model FLOPs Utilization at expected levels. SpikingBrain achieves performance comparable to open-source Transformer baselines while using only about 150B tokens for continual pre-training. Our models also significantly improve long-context efficiency and deliver inference with (partially) constant memory and event-driven spiking behavior. For example, SpikingBrain-7B attains over 100x speedup in Time to First Token for 4M-token sequences. Furthermore, the proposed spiking scheme achieves 69.15 percent sparsity, enabling low-power operation. Overall, this work demonstrates the potential of brain-inspired mechanisms to drive the next generation of efficient and scalable large model design.
1,022
Gated Slot Attention (GSA) presents a linear-time recurrent architecture that incorporates a gating mechanism to enhance memory capacity and adaptively forget information. It improves performance on in-context recall-intensive tasks and enables more effective finetuning of pretrained Transformers into recurrent models, often surpassing larger models trained from scratch on various benchmarks.
5
Researchers from the University of California, Santa Cruz and Intel Labs developed the first scalable language model architecture that completely eliminates matrix multiplication (MatMul) operations. This MatMul-free model achieves competitive performance with Transformer++ models up to 2.7 billion parameters while drastically improving computational and energy efficiency on both GPUs and neuromorphic hardware.
3,032
Enhancing the computational efficiency of on-device Deep Neural Networks (DNNs) remains a significant challengein mobile and edge computing. As we aim to execute increasingly complex tasks with constrained computational resources, much of the research has focused on compressing neural network structures and optimizing systems. Although many studies have focused on compressing neural network structures and parameters or optimizing underlying systems, there has been limited attention on optimizing the fundamental building blocks of neural networks: the neurons. In this study, we deliberate on a simple but important research question: Can we design artificial neurons that offer greater efficiency than the traditional neuron paradigm? Inspired by the threshold mechanisms and the excitation-inhibition balance observed in biological neurons, we propose a novel artificial neuron model, Threshold Neurons. Using Threshold Neurons, we can construct neural networks similar to those with traditional artificial neurons, while significantly reducing hardware implementation complexity. Our extensive experiments validate the effectiveness of neural networks utilizing Threshold Neurons, achieving substantial power savings of 7.51x to 8.19x and area savings of 3.89x to 4.33x at the kernel level, with minimal loss in precision. Furthermore, FPGA-based implementations of these networks demonstrate 2.52x power savings and 1.75x speed enhancements at the system level. The source code will be made available upon publication.
Deploying Spiking Neural Networks (SNNs) on the Xylo neuromorphic chip via the Rockpool framework represents a significant advancement in achieving ultra-low-power consumption and high computational efficiency for edge applications. This paper details a novel deployment pipeline, emphasizing the integration of Rockpool's capabilities with Xylo's architecture, and evaluates the system's performance in terms of energy efficiency and accuracy. The unique advantages of the Xylo chip, including its digital spiking architecture and event-driven processing model, are highlighted to demonstrate its suitability for real-time, power-sensitive applications.
There are no more papers matching your filters at the moment.