alphaXiv

History

Papers Benchmarks

Tomorrow Advancing Life

294

18 Feb 2025

computer-science computer-vision-and-pattern-recognition generative-models

Personalized Image Generation with Deep Generative Models: A Decade Survey

Harbin Institute of Technology

The Hong Kong Polytechnic University Tomorrow Advancing Life

A decade-spanning survey offers the first comprehensive review of personalized image generation techniques across Generative Adversarial Networks, Diffusion Models, and Autoregressive Models, proposing a unified framework for their analysis. The review details the evolution from GAN-based latent manipulation to highly flexible, text-driven diffusion-based customization, and identifies key challenges like the fidelity-versus-editability trade-off.

208

23 Aug 2025

computer-science computer-vision-and-pattern-recognition image-and-video-processing

MDIQA: Unified Image Quality Assessment for Multi-dimensional Evaluation and Restoration

Harbin Institute of Technology Tomorrow Advancing Life

Recent advancements in image quality assessment (IQA), driven by sophisticated deep neural network designs, have significantly improved the ability to approach human perceptions. However, most existing methods are obsessed with fitting the overall score, neglecting the fact that humans typically evaluate image quality from different dimensions before arriving at an overall quality assessment. To overcome this problem, we propose a multi-dimensional image quality assessment (MDIQA) framework. Specifically, we model image quality across various perceptual dimensions, including five technical and four aesthetic dimensions, to capture the multifaceted nature of human visual perception within distinct branches. Each branch of our MDIQA is initially trained under the guidance of a separate dimension, and the respective features are then amalgamated to generate the final IQA score. Additionally, when the MDIQA model is ready, we can deploy it for a flexible training of image restoration (IR) models, enabling the restoration results to better align with varying user preferences through the adjustment of perceptual dimension weights. Extensive experiments demonstrate that our MDIQA achieves superior performance and can be effectively and flexibly applied to image restoration tasks. The code is available: this https URL.

2,263

11 Apr 2024

attention-mechanisms autonomous-vehicles computer-science

HPNet: Dynamic Trajectory Forecasting with Historical Prediction Attention

Chinese Academy of Sciences Tomorrow Advancing Life

Predicting the trajectories of road agents is essential for autonomous driving systems. The recent mainstream methods follow a static paradigm, which predicts the future trajectory by using a fixed duration of historical frames. These methods make the predictions independently even at adjacent time steps, which leads to potential instability and temporal inconsistency. As successive time steps have largely overlapping historical frames, their forecasting should have intrinsic correlation, such as overlapping predicted trajectories should be consistent, or be different but share the same motion goal depending on the road situation. Motivated by this, in this work, we introduce HPNet, a novel dynamic trajectory forecasting method. Aiming for stable and accurate trajectory forecasting, our method leverages not only historical frames including maps and agent states, but also historical predictions. Specifically, we newly design a Historical Prediction Attention module to automatically encode the dynamic relationship between successive predictions. Besides, it also extends the attention range beyond the currently visible window benefitting from the use of historical predictions. The proposed Historical Prediction Attention together with the Agent Attention and Mode Attention is further formulated as the Triple Factorized Attention module, serving as the core design of this http URL on the Argoverse and INTERACTION datasets show that HPNet achieves state-of-the-art performance, and generates accurate and stable future trajectories. Our code are available at this https URL.

176

312

04 Dec 2024

computer-science computation-and-language computer-vision-and-pattern-recognition

LLM as a Complementary Optimizer to Gradient Descent: A Case Study in Prompt Tuning

Harbin Institute of Technology Tomorrow Advancing Life Pazhou Lab, Guangzhou

Mastering a skill generally relies on both hands-on experience from doers and insightful, high-level guidance by mentors. Will this strategy also work well for solving complex non-convex optimization problems? Here, a common gradient-based optimizer acts like a disciplined doer, making locally optimal updates at each step. Large Language Models (LLMs) can also search for better solutions by inferring from natural language instructions, akin to a high-level mentor. In this paper, we show that these two participators are complementary to each other and can effectively collaborate as a combined optimization framework. The collaborative optimization is achieved by alternating between the gradient-based and LLM-based optimizers. We instruct LLMs to generate possibly improved solutions by taking parameter trajectories recorded during the previous stage of gradient-based optimization into account. Inferred results of LLMs are used as restarting points for the next stage of gradient optimization. We verify the effectiveness of this optimization framework on prompt tuning. By leveraging both the locally rigorous gradient-based optimizer and the high-level deductive LLM-based optimizer, the combined optimization method consistently yields improvements over competitive baselines on a variety of tasks. Our results demonstrate the synergistic effect of conventional gradient-based optimization and the inference ability of LLMs. The code is released at this https URL

165

13 Aug 2025

chain-of-thought computer-science artificial-intelligence

Integrating Visual Interpretation and Linguistic Reasoning for Math Problem Solving

Harbin Institute of Technology

The Hong Kong Polytechnic University Tomorrow Advancing Life Pazhou Lab, Guangzhou

Current large vision-language models (LVLMs) typically employ a connector module to link visual features with text embeddings of large language models (LLMs) and use end-to-end training to achieve multi-modal understanding in a unified process. Effective alignment needs high-quality pre-training data and a carefully designed training process. Current LVLMs face challenges when addressing complex vision-language reasoning tasks, with their reasoning capabilities notably lagging behind those of LLMs. This paper proposes a paradigm shift: instead of training end-to-end vision-language reasoning models, we advocate for developing a decoupled reasoning framework based on existing visual interpretation specialists and text-based reasoning LLMs. Our approach leverages (1) a dedicated vision-language model to transform the visual content of images into textual descriptions and (2) an LLM to perform reasoning according to the visual-derived text and the original question. This method presents a cost-efficient solution for multi-modal model development by optimizing existing models to work collaboratively, avoiding end-to-end development of vision-language models from scratch. By transforming images into language model-compatible text representations, it facilitates future low-cost and flexible upgrades to upcoming powerful LLMs. We introduce an outcome-rewarded joint-tuning strategy to optimize the cooperation between the visual interpretation and linguistic reasoning model. Evaluation results on vision-language benchmarks demonstrate that the decoupled reasoning framework outperforms recent LVLMs. Our approach yields particularly significant performance gains on visually intensive geometric mathematics problems. The code is available: this https URL.

18 Aug 2023

computer-science computer-vision-and-pattern-recognition

ELITE: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation

Harbin Institute of Technology

The Hong Kong Polytechnic University Tomorrow Advancing Life

Yabo Zhang

Researchers at Harbin Institute of Technology, The Hong Kong Polytechnic University, and TAL developed ELITE, a learning-based encoder that efficiently encodes visual concepts into textual embeddings for customized text-to-image generation. The method learns a new concept from a single image in approximately 0.05 seconds, achieving high fidelity and robust editability within diffusion models.

524

22 Jul 2025

computer-science artificial-intelligence computer-vision-and-pattern-recognition

VitaGlyph: Vitalizing Artistic Typography with Flexible Dual-branch Diffusion Models

Harbin Institute of Technology Tomorrow Advancing Life

Yabo Zhang

Artistic typography is a technique to visualize the meaning of input character in an imaginable and readable manner. With powerful text-to-image diffusion models, existing methods directly design the overall geometry and texture of input character, making it challenging to ensure both creativity and legibility. In this paper, we introduce a dual-branch, training-free method called VitaGlyph, enabling flexible artistic typography with controllable geometry changes while maintaining the readability. The key insight of VitaGlyph is to treat input character as a scene composed of a Subject and its Surrounding, which are rendered with varying degrees of geometric transformation. To enhance the visual appeal and creativity of the generated artistic typography, the subject flexibly expresses the essential concept of the input character, while the surrounding enriches relevant background without altering the shape, thus maintaining overall readability. Specifically, we implement VitaGlyph through a three-phase framework: (i) Knowledge Acquisition leverages large language models to design text descriptions for the subject and surrounding. (ii) Regional Interpretation detects the part that most closely matches the subject description and refines the structure via Semantic Typography. (iii) Attentional Compositional Generation separately renders the textures of the Subject and Surrounding regions and blends them in an attention-based manner. Experimental results demonstrate that VitaGlyph not only achieves better artistry and readability but also manages to depict multiple customized concepts, facilitating more creative and pleasing artistic typography generation. Our code will be made publicly available.

28 Jul 2024

attention-mechanisms computer-science computer-vision-and-pattern-recognition

MasterWeaver: Taming Editability and Face Identity for Personalized Text-to-Image Generation

Harbin Institute of Technology

The Hong Kong Polytechnic University Peng Cheng Lab Tomorrow Advancing Life

Text-to-image (T2I) diffusion models have shown significant success in personalized text-to-image generation, which aims to generate novel images with human identities indicated by the reference images. Despite promising identity fidelity has been achieved by several tuning-free methods, they usually suffer from overfitting issues. The learned identity tends to entangle with irrelevant information, resulting in unsatisfied text controllability, especially on faces. In this work, we present MasterWeaver, a test-time tuning-free method designed to generate personalized images with both faithful identity fidelity and flexible editability. Specifically, MasterWeaver adopts an encoder to extract identity features and steers the image generation through additional introduced cross attention. To improve editability while maintaining identity fidelity, we propose an editing direction loss for training, which aligns the editing directions of our MasterWeaver with those of the original T2I model. Additionally, a face-augmented dataset is constructed to facilitate disentangled identity learning, and further improve the editability. Extensive experiments demonstrate that our MasterWeaver can not only generate personalized images with faithful identity, but also exhibit superiority in text controllability. Our code can be found at this https URL.

129

07 Feb 2025

computer-science computer-vision-security computer-vision-and-pattern-recognition

Explicit Relational Reasoning Network for Scene Text Detection

Fudan University Hunan Normal University Xiangtan University Tomorrow Advancing Life

Connected component (CC) is a proper text shape representation that aligns with human reading intuition. However, CC-based text detection methods have recently faced a developmental bottleneck that their time-consuming post-processing is difficult to eliminate. To address this issue, we introduce an explicit relational reasoning network (ERRNet) to elegantly model the component relationships without post-processing. Concretely, we first represent each text instance as multiple ordered text components, and then treat these components as objects in sequential movement. In this way, scene text detection can be innovatively viewed as a tracking problem. From this perspective, we design an end-to-end tracking decoder to achieve a CC-based method dispensing with post-processing entirely. Additionally, we observe that there is an inconsistency between classification confidence and localization quality, so we propose a Polygon Monte-Carlo method to quickly and accurately evaluate the localization quality. Based on this, we introduce a position-supervised classification loss to guide the task-aligned learning of ERRNet. Experiments on challenging benchmarks demonstrate the effectiveness of our ERRNet. It consistently achieves state-of-the-art accuracy while holding highly competitive inference speed.

17 Aug 2021

computer-science computer-vision-and-pattern-recognition generative-models

Orthogonal Jacobian Regularization for Unsupervised Disentanglement in Image Generation

Harbin Institute of Technology Tomorrow Advancing Life Pazhou Lab, Guangzhou

Unsupervised disentanglement learning is a crucial issue for understanding and exploiting deep generative models. Recently, SeFa tries to find latent disentangled directions by performing SVD on the first projection of a pre-trained GAN. However, it is only applied to the first layer and works in a post-processing way. Hessian Penalty minimizes the off-diagonal entries of the output's Hessian matrix to facilitate disentanglement, and can be applied to this http URL, it constrains each entry of output independently, making it not sufficient in disentangling the latent directions (e.g., shape, size, rotation, etc.) of spatially correlated variations. In this paper, we propose a simple Orthogonal Jacobian Regularization (OroJaR) to encourage deep generative model to learn disentangled representations. It simply encourages the variation of output caused by perturbations on different latent dimensions to be orthogonal, and the Jacobian with respect to the input is calculated to represent this variation. We show that our OroJaR also encourages the output's Hessian matrix to be diagonal in an indirect manner. In contrast to the Hessian Penalty, our OroJaR constrains the output in a holistic way, making it very effective in disentangling latent dimensions corresponding to spatially correlated variations. Quantitative and qualitative experimental results show that our method is effective in disentangled and controllable image generation, and performs favorably against the state-of-the-art methods. Our code is available at this https URL

158

13 May 2024

computer-science artificial-intelligence computation-and-language

MuMath-Code: Combining Tool-Use Large Language Models with Multi-perspective Data Augmentation for Mathematical Reasoning

Ocean University of China Tomorrow Advancing Life

MuMath-Code improves open-source Large Language Models' mathematical reasoning by integrating multi-perspective data augmentation with external tool-use. The method leverages a novel code-nested dataset and a two-stage training strategy, achieving state-of-the-art performance for open models with 90.7% on GSM8K and 55.1% on MATH.

21 Aug 2023

attention-mechanisms computer-science computer-vision-security

Patch Is Not All You Need

Chinese Academy of Sciences Institute of Computing Technology Tomorrow Advancing Life

Vision Transformers have achieved great success in computer visions, delivering exceptional performance across various tasks. However, their inherent reliance on sequential input enforces the manual partitioning of images into patch sequences, which disrupts the image's inherent structural and semantic continuity. To handle this, we propose a novel Pattern Transformer (Patternformer) to adaptively convert images to pattern sequences for Transformer input. Specifically, we employ the Convolutional Neural Network to extract various patterns from the input image, with each channel representing a unique pattern that is fed into the succeeding Transformer as a visual token. By enabling the network to optimize these patterns, each pattern concentrates on its local region of interest, thereby preserving its intrinsic structural and semantic information. Only employing the vanilla ResNet and Transformer, we have accomplished state-of-the-art performance on CIFAR-10 and CIFAR-100, and have achieved competitive results on ImageNet.

29 Mar 2025

computer-science computer-vision-and-pattern-recognition

IPAD: Iterative, Parallel, and Diffusion-based Network for Scene Text Recognition

Chinese Academy of Sciences Tomorrow Advancing Life

Xiaomeng Yang

Nowadays, scene text recognition has attracted more and more attention due to its diverse applications. Most state-of-the-art methods adopt an encoder-decoder framework with the attention mechanism, autoregressively generating text from left to right. Despite the convincing performance, this sequential decoding strategy constrains the inference speed. Conversely, non-autoregressive models provide faster, simultaneous predictions but often sacrifice accuracy. Although utilizing an explicit language model can improve performance, it burdens the computational load. Besides, separating linguistic knowledge from vision information may harm the final prediction. In this paper, we propose an alternative solution that uses a parallel and iterative decoder that adopts an easy-first decoding strategy. Furthermore, we regard text recognition as an image-based conditional text generation task and utilize the discrete diffusion strategy, ensuring exhaustive exploration of bidirectional contextual information. Extensive experiments demonstrate that the proposed approach achieves superior results on the benchmark datasets, including both Chinese and English text images.

22 Apr 2023

computer-science computer-vision-and-pattern-recognition generative-models

CCLAP: Controllable Chinese Landscape Painting Generation via Latent Diffusion Model

Chinese Academy of Sciences Tomorrow Advancing Life Beĳing Institute of Technology

Robin Wang

With the development of deep generative models, recent years have seen great success of Chinese landscape painting generation. However, few works focus on controllable Chinese landscape painting generation due to the lack of data and limited modeling capabilities. In this work, we propose a controllable Chinese landscape painting generation method named CCLAP, which can generate painting with specific content and style based on Latent Diffusion Model. Specifically, it consists of two cascaded modules, i.e., content generator and style aggregator. The content generator module guarantees the content of generated paintings specific to the input text. While the style aggregator module is to generate paintings of a style corresponding to a reference image. Moreover, a new dataset of Chinese landscape paintings named CLAP is collected for comprehensive evaluation. Both the qualitative and quantitative results demonstrate that our method achieves state-of-the-art performance, especially in artfully-composed and artistic conception. Codes are available at this https URL

31 Mar 2022

computer-science computer-vision-and-pattern-recognition generative-models

Semantic-shape Adaptive Feature Modulation for Semantic Image Synthesis

Tianjin University

Alibaba Group Harbin Institute of Technology Peng Cheng Laboratory Tomorrow Advancing Life

Zhengyao Lyu

Recent years have witnessed substantial progress in semantic image synthesis, it is still challenging in synthesizing photo-realistic images with rich details. Most previous methods focus on exploiting the given semantic map, which just captures an object-level layout for an image. Obviously, a fine-grained part-level semantic layout will benefit object details generation, and it can be roughly inferred from an object's shape. In order to exploit the part-level layouts, we propose a Shape-aware Position Descriptor (SPD) to describe each pixel's positional feature, where object shape is explicitly encoded into the SPD feature. Furthermore, a Semantic-shape Adaptive Feature Modulation (SAFM) block is proposed to combine the given semantic map and our positional features to produce adaptively modulated features. Extensive experiments demonstrate that the proposed SPD and SAFM significantly improve the generation of objects with rich details. Moreover, our method performs favorably against the SOTA methods in terms of quantitative and qualitative evaluation. The source code and model are available at this https URL

20 Dec 2023

attention-mechanisms computer-science computer-vision-security

Masked and Permuted Implicit Context Learning for Scene Text Recognition

Chinese Academy of Sciences

Communication University of China Tomorrow Advancing Life

Xiaomeng Yang

Scene Text Recognition (STR) is difficult because of the variations in text styles, shapes, and backgrounds. Though the integration of linguistic information enhances models' performance, existing methods based on either permuted language modeling (PLM) or masked language modeling (MLM) have their pitfalls. PLM's autoregressive decoding lacks foresight into subsequent characters, while MLM overlooks inter-character dependencies. Addressing these problems, we propose a masked and permuted implicit context learning network for STR, which unifies PLM and MLM within a single decoder, inheriting the advantages of both approaches. We utilize the training procedure of PLM, and to integrate MLM, we incorporate word length information into the decoding process and replace the undetermined characters with mask tokens. Besides, perturbation training is employed to train a more robust model against potential length prediction errors. Our empirical evaluations demonstrate the performance of our model. It not only achieves superior performance on the common benchmarks but also achieves a substantial improvement of

9.1\%

on the more challenging Union14M-Benchmark.

06 Apr 2022

computer-science computer-vision-and-pattern-recognition

Retrieval-based Spatially Adaptive Normalization for Semantic Image Synthesis

Harbin Institute of Technology Peng Cheng Laboratory Tomorrow Advancing Life

Semantic image synthesis is a challenging task with many practical applications. Albeit remarkable progress has been made in semantic image synthesis with spatially-adaptive normalization and existing methods normalize the feature activations under the coarse-level guidance (e.g., semantic class). However, different parts of a semantic object (e.g., wheel and window of car) are quite different in structures and textures, making blurry synthesis results usually inevitable due to the missing of fine-grained guidance. In this paper, we propose a novel normalization module, termed as REtrieval-based Spatially AdaptIve normaLization (RESAIL), for introducing pixel level fine-grained guidance to the normalization architecture. Specifically, we first present a retrieval paradigm by finding a content patch of the same semantic class from training set with the most similar shape to each test semantic mask. Then, RESAIL is presented to use the retrieved patch for guiding the feature normalization of corresponding region, and can provide pixel level fine-grained guidance, thereby greatly mitigating blurry synthesis results. Moreover, distorted ground-truth images are also utilized as alternatives of retrieval-based guidance for feature normalization, further benefiting model training and improving visual quality of generated images. Experiments on several challenging datasets show that our RESAIL performs favorably against state-of-the-arts in terms of quantitative metrics, visual quality, and subjective evaluation. The source code and pre-trained models will be publicly available.

25 Sep 2022

computer-science computer-vision-and-pattern-recognition

Towards Diverse and Faithful One-shot Adaption of Generative Adversarial Networks

Harbin Institute of Technology Tomorrow Advancing Life

Yabo Zhang

Researchers from Harbin Institute of Technology propose DiFa, a method for one-shot generative domain adaptation that simultaneously achieves faithful style acquisition and diverse image generation. The approach utilizes global CLIP alignment, an attentive local style loss, and a selective cross-domain consistency loss to adapt pre-trained GANs to new domains using only a single reference image, demonstrating improved image quality and diversity over existing methods.

18 Oct 2022

computer-science computer-vision-and-pattern-recognition image-segmentation

1st Place Solutions for the UVO Challenge 2022

Beijing University of Posts and Telecommunications Tomorrow Advancing Life

This paper describes the approach we have taken in the challenge. We still adopted the two-stage scheme same as the last champion, that is, detection first and segmentation followed. We trained more powerful detector and segmentor separately. Besides, we also perform pseudo-label training on the test set, based on student-teacher framework and end-to-end transformer based object detection. The method ranks first on the 2nd Unidentified Video Objects (UVO) challenge, achieving AR@100 of 46.8, 64.7 and 32.2 in the limited data frame track, unlimited data frame track and video track respectively.

31 May 2023

computer-science computer-vision-and-pattern-recognition few-shot-learning

Inferring and Leveraging Parts from Object Shape for Improving Semantic Image Synthesis

Harbin Institute of Technology

The Hong Kong Polytechnic University Tomorrow Advancing Life

Despite the progress in semantic image synthesis, it remains a challenging problem to generate photo-realistic parts from input semantic map. Integrating part segmentation map can undoubtedly benefit image synthesis, but is bothersome and inconvenient to be provided by users. To improve part synthesis, this paper presents to infer Parts from Object ShapE (iPOSE) and leverage it for improving semantic image synthesis. However, albeit several part segmentation datasets are available, part annotations are still not provided for many object categories in semantic image synthesis. To circumvent it, we resort to few-shot regime to learn a PartNet for predicting the object part map with the guidance of pre-defined support part maps. PartNet can be readily generalized to handle a new object category when a small number (e.g., 3) of support part maps for this category are provided. Furthermore, part semantic modulation is presented to incorporate both inferred part map and semantic map for image synthesis. Experiments show that our iPOSE not only generates objects with rich part details, but also enables to control the image synthesis flexibly. And our iPOSE performs favorably against the state-of-the-art methods in terms of quantitative and qualitative evaluation. Our code will be publicly available at this https URL.

There are no more papers matching your filters at the moment.

Events

Personalize Your Feed

Install Browser Extension

We're hiring

alphaXiv

Explore

State of the Art

Sign In

Labs

Feedback

Dark mode

Personalized Image Generation with Deep Generative Models: A Decade Survey

MDIQA: Unified Image Quality Assessment for Multi-dimensional Evaluation and Restoration

HPNet: Dynamic Trajectory Forecasting with Historical Prediction Attention

LLM as a Complementary Optimizer to Gradient Descent: A Case Study in Prompt Tuning

Integrating Visual Interpretation and Linguistic Reasoning for Math Problem Solving

ELITE: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation

VitaGlyph: Vitalizing Artistic Typography with Flexible Dual-branch Diffusion Models

MasterWeaver: Taming Editability and Face Identity for Personalized Text-to-Image Generation

Explicit Relational Reasoning Network for Scene Text Detection

Orthogonal Jacobian Regularization for Unsupervised Disentanglement in Image Generation

MuMath-Code: Combining Tool-Use Large Language Models with Multi-perspective Data Augmentation for Mathematical Reasoning

Patch Is Not All You Need

IPAD: Iterative, Parallel, and Diffusion-based Network for Scene Text Recognition

CCLAP: Controllable Chinese Landscape Painting Generation via Latent Diffusion Model

Semantic-shape Adaptive Feature Modulation for Semantic Image Synthesis

Masked and Permuted Implicit Context Learning for Scene Text Recognition

Retrieval-based Spatially Adaptive Normalization for Semantic Image Synthesis

Towards Diverse and Faithful One-shot Adaption of Generative Adversarial Networks

1st Place Solutions for the UVO Challenge 2022

Inferring and Leveraging Parts from Object Shape for Improving Semantic Image Synthesis

Events

AI for Law

Personalize Your Feed