H-Net, developed by researchers at Carnegie Mellon University and Cartesia AI, introduces an end-to-end hierarchical network that learns dynamic data chunking, enabling direct processing of raw bytes. This architecture surpasses traditional BPE-tokenized large language models in performance and robustness across various modalities while achieving better data efficiency.
View blogThis empirical study evaluates Mamba-based language models, including a novel Mamba-2-Hybrid architecture, at 8 billion parameters against Transformer models. The Mamba-2-Hybrid model combines efficient state-space layers with a small fraction of self-attention layers, outperforming pure Transformers on standard language tasks and demonstrating superior long-context generalization and significantly faster inference speeds, particularly for very long sequences.
View blogA unifying matrix mixer framework is introduced for diverse sequence models, leading to Hydra, a bidirectional architecture. Hydra demonstrates state-of-the-art performance on GLUE and ImageNet-1K benchmarks, outperforming BERT and ViT respectively, while retaining sub-quadratic computational efficiency.
View blog