PACT (Pruning and Clustering-based Token Reduction) reduces the computational demands of Visual Language Models by identifying and removing unimportant visual tokens and merging redundant ones. The method achieves up to a 71.3% token reduction with only a 1.4% performance drop on LLaVA-OneVision-7B and enables a 60% throughput improvement on Qwen-VL while maintaining model performance.
View blogThis research develops and evaluates the first reconstruction attack, named CDMI, specifically designed for state-of-the-art layout-aware document understanding models. The work demonstrates that these models are vulnerable to data leakage, showing perfect reconstruction of up to 4.1% of sensitive fields in a one-shot attack and up to 22.5% in multi-shot scenarios with membership inference.
View blog