Toyota Technological Institute in Chicago
Sparse Fusion Transformers (SFT) introduce an efficient architecture for multimodal classification by leveraging the complementary nature of modalities to aggressively sparsify unimodal representations before fusion. The approach achieves up to an 11-fold reduction in computational cost and memory while maintaining or improving accuracy on benchmark datasets like VGGSound and CMU-MOSEI.
There are no more papers matching your filters at the moment.