South China University of Technology
Researchers from Nanyang Technological University, Sun Yat-Sen University, and South China University of Technology developed a general-purpose 3D Vision-Language Pre-training framework that leverages 3D scene graphs to achieve multi-level alignment between 3D scenes and natural language. The framework establishes state-of-the-art or competitive performance across 3D visual grounding, question answering, and dense captioning tasks.
Researchers propose a 6DMA-aided cell-free massive MIMO system that optimizes antenna rotation angles to maximize the average achievable sum-rate by adapting to long-term user spatial distributions. The system demonstrates superior performance over fixed-antenna setups and provides distinct optimal 6DMA orientations for different receiver combining strategies.
This survey provides a comprehensive analysis of 'reasoning Large Language Models,' detailing their transition from intuitive 'System 1' to deliberate 'System 2' thinking. It maps the foundational technologies, core construction methods, and evaluation benchmarks, highlighting their enhanced performance in complex tasks like mathematics and coding while also identifying current limitations and future research directions.
19
Patch-as-Decodable Token (PaDT) unifies multimodal Large Language Models by enabling direct generation of both textual and diverse visual outputs, such as bounding boxes and segmentation masks. This approach achieves state-of-the-art performance across fine-grained vision tasks, with smaller PaDT models surpassing much larger existing MLLMs.
2
Large language models (LLMs) possess broad world knowledge and strong general-purpose reasoning ability, yet they struggle to learn from many in-context examples on standard machine learning (ML) tasks, that is, to leverage many-shot demonstrations purely via in-context learning (ICL) without gradient descent. We introduce MachineLearningLM, a portable continued-pretraining framework that equips a general-purpose LLM with robust in-context ML capability while preserving its general knowledge and reasoning for broader chat workflows. Our pretraining procedure synthesizes ML tasks from millions of structural causal models (SCMs), spanning shot counts up to 1,024. We begin with a random-forest teacher, distilling tree-based decision strategies into the LLM to strengthen robustness in numerical modeling. All tasks are serialized with a token-efficient prompt, enabling 3x to 6x more examples per context window and delivering up to 50x amortized throughput via batch inference. Despite a modest setup (Qwen-2.5-7B-Instruct with LoRA rank 8), MachineLearningLM outperforms strong LLM baselines (e.g., GPT-5-mini) by an average of about 15% on out-of-distribution tabular classification across finance, physics, biology, and healthcare domains. It exhibits a striking many-shot scaling law: accuracy increases monotonically as in-context demonstrations grow from 8 to 1,024. Without any task-specific training, it attains random-forest-level accuracy across hundreds of shots. General chat capabilities, including knowledge and reasoning, are preserved: it achieves 75.4% on MMLU.
35
3D-VLA introduces a generative world model that integrates 3D scene understanding with language and action generation, enabling embodied AI systems to predict future states and plan actions in 3D environments. It leverages a new 3D embodied instruction dataset and demonstrates improved performance in 3D reasoning, goal generation, and action planning compared to 2D vision-language models.
422
This work provides a comprehensive survey and taxonomy of multimodal spatial reasoning within large language models, addressing the scarcity of systematic reviews and standardized benchmarks. It analyzes diverse approaches for enhancing spatial intelligence in MLLMs and introduces open benchmarks to facilitate rigorous evaluation and comparison across various tasks and modalities.
18
·
OCRBench v2 offers an improved benchmark for evaluating Large Multimodal Models (LMMs) on visual text localization and reasoning. It presents 23 tasks across 31 diverse scenarios with 10,000 human-validated instruction-response pairs and a private test set, revealing that current LMMs perform poorly on fine-grained spatial perception, complex layout understanding, and structured element parsing tasks, despite advances in basic text recognition.
726
The Mixture-of-Memories (MoM) architecture replaces a single recurrent state with multiple, independent memory states and a routing mechanism, enhancing linear sequence models' ability to retain information over long sequences. This design enables performance on recall-intensive tasks comparable to Transformer models while maintaining linear time complexity during training and constant time inference.
49
This paper provides a comprehensive survey of lifelong learning methods for LLM-based agents, focusing on how these agents can continuously learn and adapt through perception, memory, and action modules.
238
Researchers from Peking University and South China University of Technology developed FakeShield, a framework that uses multi-modal large language models for explainable and generalized image forgery detection and localization. This system not only detects and precisely localizes image manipulations but also provides human-understandable explanations for its judgments, achieving superior performance across diverse tampering types.
129
The 3D-LLM framework from a collaboration including MIT and UMass Amherst enables large language models to understand and reason about the 3D physical world. It achieves this by generating large-scale 3D-language data and deriving 3D features from multi-view 2D images, demonstrating improved performance across tasks like 3D question answering and object grounding compared to prior methods.
1,017
Researchers from Peking University and collaborating institutions develop TimeChat-Online, a streaming video understanding framework that reduces visual token processing by 82.8% through differential token dropping while maintaining 98% performance on StreamingBench, enabling efficient real-time interaction with continuous video streams.
2
3D-Mem introduces a scalable 3D scene memory framework for embodied agents, leveraging multi-view image "Memory Snapshots" for explored regions and "Frontier Snapshots" for unexplored areas. This enables efficient lifelong exploration and enhanced spatial reasoning, outperforming baselines in various embodied question answering and navigation tasks by effectively integrating with Vision-Language Models.
88
Researchers at South China University of Technology and collaborators introduced NSG-VD, a physics-driven method utilizing a Normalized Spatiotemporal Gradient (NSG) and Maximum Mean Discrepancy, to detect AI-generated videos by identifying violations of physical continuity. The approach achieves superior detection performance on advanced generative models like Sora and demonstrates strong robustness in data-imbalanced settings.
4
South China University of Technology and Pazhou Laboratory researchers develop TLM (Test-Time Learning for LLMs), a framework that adapts large language models to new domains during inference using only unlabeled test data through input perplexity minimization, achieving at least 20% performance improvements on domain knowledge adaptation tasks in their AdaptEval benchmark while employing LoRA-based parameter updates and a sample-efficient learning strategy that prioritizes high-perplexity examples to prevent catastrophic forgetting and reduce computational overhead compared to traditional fine-tuning approaches.
24
CorrCLIP introduces a framework to enhance CLIP's performance for open-vocabulary semantic segmentation by reconstructing patch correlations, specifically addressing inter-class incoherence. The method achieved state-of-the-art performance among training-free approaches, improving averaged mIoU by up to 8.5% on various benchmarks and often surpassing weakly-supervised models.
7
LifelongAgentBench introduces the first unified benchmark to evaluate large language model agents as lifelong learners across Database, Operating System, and Knowledge Graph environments. The benchmark demonstrates that experience replay consistently improves agent performance (e.g., 19% to 78% accuracy in DB tasks) but is limited by context length, leading to the development of group self-consistency which drastically reduces token usage (e.g., 56,409 to 11,002 tokens) while maintaining effectiveness.
48
There are no more papers matching your filters at the moment.