alphaXiv

History

Papers Benchmarks

TikTok

1,574

03 Sep 2025

agents computer-science machine-learning

SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning

Nanyang Technological University TikTok

Xiaosen Zheng

The paper introduces SimpleTIR, an end-to-end Reinforcement Learning (RL) approach that stabilizes multi-turn Tool-Integrated Reasoning (TIR) in Large Language Models (LLMs) under the Zero RL setting. SimpleTIR resolves training instability and gradient explosions by filtering problematic 'void turns,' achieving state-of-the-art performance on mathematical reasoning benchmarks and fostering diverse, emergent reasoning patterns.

292

8,578

20 Oct 2024

computer-science computer-vision-security computer-vision-and-pattern-recognition

Depth Anything V2

HKU TikTok

Depth Anything V2 introduces a new paradigm for monocular depth estimation by leveraging precise synthetic data for detail and massive pseudo-labeled real data for robustness. The model family achieves state-of-the-art performance in zero-shot relative depth estimation, with the largest model reaching 97.4% accuracy on the new DA-2K benchmark, while maintaining high inference efficiency.

4,644

8,213

06 Aug 2025

agents chain-of-thought computer-science

SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

HKUST BUPT TikTok

Wei Liu

SimpleRL-Zoo comprehensively investigates zero reinforcement learning (RL) on 10 diverse open base models, demonstrating that advanced reasoning behaviors can emerge from models initially lacking strong instruction-following and that traditional supervised fine-tuning can actually hinder this emergence.

3,764

2,345

09 Jun 2025

chain-of-thought computer-science computation-and-language

General-Reasoner: Advancing LLM Reasoning Across All Domains

University of Waterloo Vector Institute TikTok

This paper introduces 'General-Reasoner,' an approach that extends Large Language Model reasoning capabilities across diverse domains beyond mathematics and coding. It achieves this by curating a large-scale, verifiable all-domain reasoning dataset and developing a generative model-based verifier for robust answer assessment within a 'Zero RL' training framework, with models demonstrating performance competitive with or surpassing commercial LLMs on various benchmarks.

1,988

07 Apr 2024

computer-science computer-vision-security computer-vision-and-pattern-recognition

Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data

Zhejiang University Zhejiang Lab

The University of Hong Kong TikTok

Researchers from HKU, TikTok, CUHK, and ZJU developed "Depth Anything," a robust foundation model for monocular depth estimation that leverages an unprecedented 62 million unlabeled images. The model demonstrates superior zero-shot generalization and establishes new state-of-the-art performance upon fine-tuning on various benchmarks, also yielding a versatile encoder for semantic tasks.

7,297

506

16 Jul 2025

computer-science software-engineering

SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories?

National University of Singapore

University of California, San Diego Xian Jiaotong University TikTok

SWE-Perf introduces the first benchmark for evaluating Large Language Models (LLMs) on real-world, repository-level code performance optimization, drawing from 140 human-authored performance-improving pull requests. Evaluations using SWE-Perf reveal a substantial gap between current LLM capabilities (e.g., OpenHands with 2.26% average gain) and expert performance (10.85% gain), indicating LLMs struggle with complex, multi-function optimization and identifying opportunities in computationally intensive code.

217

03 Dec 2025

agentic-frameworks agents computer-science

DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle

Chinese Academy of Sciences NUS

ByteDance

University of California, San Diego Institute of Automation, CAS TikTok

A new benchmark, DAComp, evaluates large language model (LLM) agents across the full data intelligence lifecycle, integrating repository-level data engineering and open-ended data analysis. Experiments reveal state-of-the-art LLMs struggle with holistic pipeline orchestration and strategic insight synthesis, achieving an aggregated data engineering score of 43.45% and a data analysis score of 50.84% with top models.

146

01 Nov 2025

agentic-frameworks agents computer-science

MCP-Flow: Facilitating LLM Agents to Master Real-World, Diverse and Scaling MCP Tools

Shanghai Jiao Tong University

Zhejiang University TikTok Illinois University

An automated web-agent-driven pipeline called MCP-FLOW constructs a large-scale, high-quality dataset from real-world Model Contextual Protocol (MCP) servers to enable Large Language Models (LLMs) to effectively use diverse external tools. The generated dataset, comprising over 68,000 instruction-function call pairs, allows smaller, fine-tuned LLMs to achieve superior performance in tool selection and complex agentic tasks compared to larger state-of-the-art models.

393

27 May 2025

computer-science computer-vision-and-pattern-recognition

SURDS: Benchmarking Spatial Understanding and Reasoning in Driving Scenarios with Vision Language Models

Wuhan University

Chinese Academy of Sciences

University of Oxford University of Technology Sydney Xian Jiaotong University TikTok Waytous

Researchers developed SURDS, a benchmark for evaluating Vision Language Models' fine-grained spatial understanding and reasoning in real-world driving environments. An accompanying reinforcement learning-based alignment method achieved an overall score of 40.80 on SURDS, enabling a 3B-parameter model to surpass larger, general-purpose VLMs.

178

26 Sep 2025

chain-of-thought computer-science artificial-intelligence

Draw-In-Mind: Rebalancing Designer-Painter Roles in Unified Multimodal Models Benefits Image Editing

National University of Singapore TikTok

Researchers from National University of Singapore's Show Lab and TikTok introduced the Draw-In-Mind (DIM) framework, which rebalances the understanding and generation roles within unified multimodal models for instruction-guided image editing. This approach, leveraging a new Chain-of-Thought (CoT) dataset, achieved state-of-the-art performance on image editing benchmarks like ImgEdit and GEdit-Bench-EN, while also demonstrating a 4.5x speedup in inference compared to previous models.

1,100

19 Jun 2025

computer-science computation-and-language software-engineering

SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development

Shanghai Jiao Tong University

University of Michigan Soochow University Beijing University of Aeronautics and Astronautics TikTok

SWE-Dev introduces a large-scale dataset for evaluating and training autonomous AI systems on real-world Feature-Driven Development (FDD) tasks, demonstrating that fine-tuning models on its 14,000 verifiable training instances significantly improves performance on complex, repository-level coding challenges.

314

03 Dec 2024

attention-mechanisms computer-science artificial-intelligence

From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding

Nanyang Technological University Great Bay University Shenzhen University Southwest Jiaotong University TikTok

A comprehensive review by researchers from TikTok and several universities surveys the advancements and challenges of MultiModal Large Language Models (MM-LLMs) for understanding long videos. The work systematically traces model evolution, highlights the necessity for specialized architectural adaptations and training strategies, and quantitatively compares performance across various video benchmarks.

14 Oct 2025

adversarial-robustness computer-science computer-vision-and-pattern-recognition

Ivan-ISTD: Rethinking Cross-domain Heteroscedastic Noise Perturbations in Infrared Small Target Detection

Sun Yat-Sen University

ByteDance Guangdong University of Technology TikTok

In the multimedia domain, Infrared Small Target Detection (ISTD) plays a important role in drone-based multi-modality sensing. To address the dual challenges of cross-domain shift and heteroscedastic noise perturbations in ISTD, we propose a doubly wavelet-guided Invariance learning framework(Ivan-ISTD). In the first stage, we generate training samples aligned with the target domain using Wavelet-guided Cross-domain Synthesis. This wavelet-guided alignment machine accurately separates the target background through multi-frequency wavelet filtering. In the second stage, we introduce Real-domain Noise Invariance Learning, which extracts real noise characteristics from the target domain to build a dynamic noise library. The model learns noise invariance through self-supervised loss, thereby overcoming the limitations of distribution bias in traditional artificial noise modeling. Finally, we create the Dynamic-ISTD Benchmark, a cross-domain dynamic degradation dataset that simulates the distribution shifts encountered in real-world applications. Additionally, we validate the versatility of our method using other real-world datasets. Experimental results demonstrate that our approach outperforms existing state-of-the-art methods in terms of many quantitative metrics. In particular, Ivan-ISTD demonstrates excellent robustness in cross-domain scenarios. The code for this work can be found at: this https URL.

03 Oct 2025

ai-for-cybersecurity chain-of-thought computer-science

Reasoning-Enhanced Domain-Adaptive Pretraining of Multimodal Large Language Models for Short Video Content Governance

TikTok

Researchers from TikTok Inc. developed a reasoning-enhanced domain-adaptive pretraining paradigm for Multimodal Large Language Models (MLLMs) to improve short video content governance. This method, using custom pretraining tasks, enhances MLLMs' understanding of complex guidelines and generalization across content issues, outperforming GPT-4o and achieving up to 31% absolute accuracy gains on in-domain tasks.

2,463

15 May 2025

computer-science cryptography-and-security machine-learning

Cape: Context-Aware Prompt Perturbation Mechanism with Differential Privacy

TikTok

Large Language Models (LLMs) have gained significant popularity due to their remarkable capabilities in text understanding and generation. However, despite their widespread deployment in inference services such as ChatGPT, concerns about the potential leakage of sensitive user data have arisen. Existing solutions primarily rely on privacy-enhancing technologies to mitigate such risks, facing the trade-off among efficiency, privacy, and utility. To narrow this gap, we propose Cape, a context-aware prompt perturbation mechanism based on differential privacy, to enable efficient inference with an improved privacy-utility trade-off. Concretely, we introduce a hybrid utility function that better captures the token similarity. Additionally, we propose a bucketized sampling mechanism to handle large sampling space, which might lead to long-tail phenomenons. Extensive experiments across multiple datasets, along with ablation studies, demonstrate that Cape achieves a better privacy-utility trade-off compared to prior state-of-the-art works.

13 Oct 2025

computer-science computation-and-language cryptography-and-security

Secret-Protected Evolution for Differentially Private Synthetic Text Generation

Rutgers University TikTok

Researchers at TikTok and Rutgers University developed Secret-Protected Evolution (SecPE), a framework for differentially private synthetic text generation that provides secret-aware protection. This approach dramatically reduces computational overhead, achieving up to 13,000x faster histogram computation, and delivers improved data utility and downstream task performance under privacy compared to traditional GDP-based methods.

107

17 Feb 2025

chain-of-thought computer-science artificial-intelligence

InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning

South China University of Technology

Zhejiang University University of Electronic Science and Technology of China Harbin Institute of Technology

The Hong Kong Polytechnic University Dalian University of Technology Amazon Beijing University of Posts and Telecommunications TikTok Reallm Labs

LI Pengxiang

Zeyu Liu

Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have made significant advancements in reasoning capabilities. However, they still face challenges such as high computational demands and privacy concerns. This paper focuses on developing efficient Small Language Models (SLMs) and Multimodal Small Language Models (MSLMs) that retain competitive reasoning abilities. We introduce a novel training pipeline that enhances reasoning capabilities and facilitates deployment on edge devices, achieving state-of-the-art performance while minimizing development costs. \InfR~ aims to advance AI systems by improving reasoning, reducing adoption barriers, and addressing privacy concerns through smaller model sizes. Resources are available at this https URL com/Reallm-Labs/InfiR.

23 Jul 2025

computer-science machine-learning fine-tuning

Filter-And-Refine: A MLLM Based Cascade System for Industrial-Scale Video Content Moderation

TikTok

Effective content moderation is essential for video platforms to safeguard user experience and uphold community standards. While traditional video classification models effectively handle well-defined moderation tasks, they struggle with complicated scenarios such as implicit harmful content and contextual ambiguity. Multimodal large language models (MLLMs) offer a promising solution to these limitations with their superior cross-modal reasoning and contextual understanding. However, two key challenges hinder their industrial adoption. First, the high computational cost of MLLMs makes full-scale deployment impractical. Second, adapting generative models for discriminative classification remains an open research problem. In this paper, we first introduce an efficient method to transform a generative MLLM into a multimodal classifier using minimal discriminative training data. To enable industry-scale deployment, we then propose a router-ranking cascade system that integrates MLLMs with a lightweight router model. Offline experiments demonstrate that our MLLM-based approach improves F1 score by 66.50% over traditional classifiers while requiring only 2% of the fine-tuning data. Online evaluations show that our system increases automatic content moderation volume by 41%, while the cascading deployment reduces computational cost to only 1.5% of direct full-scale deployment.

01 Sep 2025

computer-science cryptography-and-security

Breaking the Layer Barrier: Remodeling Private Transformer Inference with Hybrid CKKS and MPC

Peking University TikTok HUST

This paper presents an efficient framework for private Transformer inference that combines Homomorphic Encryption (HE) and Secure Multi-party Computation (MPC) to protect data privacy. Existing methods often leverage HE for linear layers (e.g., matrix multiplications) and MPC for non-linear layers (e.g., Softmax activation functions), but the conversion between HE and MPC introduces significant communication costs. The proposed framework, dubbed BLB, overcomes this by breaking down layers into fine-grained operators and further fusing adjacent linear operators, reducing the need for HE/MPC conversions. To manage the increased ciphertext bit width from the fused linear operators, BLB proposes the first secure conversion protocol between CKKS and MPC and enables CKKS-based computation of the fused operators. Additionally, BLB proposes an efficient matrix multiplication protocol for fused computation in Transformers. Extensive evaluations on BERT-base, BERT-large, and GPT2-base show that BLB achieves a

21\times

reduction in communication overhead compared to BOLT (S\&P'24) and a

2\times

reduction compared to Bumblebee (NDSS'25), along with latency reductions of

13\times

and

1.8\times

, respectively, when leveraging GPU acceleration.

184

13 May 2025

computer-science artificial-intelligence computer-vision-and-pattern-recognition

HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding

Nanyang Technological University Great Bay University Shenzhen University Southwest Jiaotong University TikTok

Multimodal large language models have become a popular topic in deep visual understanding due to many promising real-world applications. However, hour-long video understanding, spanning over one hour and containing tens of thousands of visual frames, remains under-explored because of 1) challenging long-term video analyses, 2) inefficient large-model approaches, and 3) lack of large-scale benchmark datasets. Among them, in this paper, we focus on building a large-scale hour-long long video benchmark, HLV-1K, designed to evaluate long video understanding models. HLV-1K comprises 1009 hour-long videos with 14,847 high-quality question answering (QA) and multi-choice question asnwering (MCQA) pairs with time-aware query and diverse annotations, covering frame-level, within-event-level, cross-event-level, and long-term reasoning tasks. We evaluate our benchmark using existing state-of-the-art methods and demonstrate its value for testing deep long video understanding capabilities at different levels and for various tasks. This includes promoting future long video understanding tasks at a granular level, such as deep understanding of long live videos, meeting recordings, and movies.

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning

Depth Anything V2

SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

General-Reasoner: Advancing LLM Reasoning Across All Domains

Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data

SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories?

DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle

MCP-Flow: Facilitating LLM Agents to Master Real-World, Diverse and Scaling MCP Tools

SURDS: Benchmarking Spatial Understanding and Reasoning in Driving Scenarios with Vision Language Models

Draw-In-Mind: Rebalancing Designer-Painter Roles in Unified Multimodal Models Benefits Image Editing

SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development

From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding

Ivan-ISTD: Rethinking Cross-domain Heteroscedastic Noise Perturbations in Infrared Small Target Detection

Reasoning-Enhanced Domain-Adaptive Pretraining of Multimodal Large Language Models for Short Video Content Governance

Cape: Context-Aware Prompt Perturbation Mechanism with Differential Privacy

Secret-Protected Evolution for Differentially Private Synthetic Text Generation

InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning

Filter-And-Refine: A MLLM Based Cascade System for Industrial-Scale Video Content Moderation

Breaking the Layer Barrier: Remodeling Private Transformer Inference with Hybrid CKKS and MPC

HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding

Events

AI for Law

Personalize Your Feed