Pukyong National University
Scene Text Editing (STE) is the task of modifying text content in an image while preserving its visual style, such as font, color, and background. While recent diffusion-based approaches have shown improvements in visual quality, key limitations remain: lack of support for low-resource languages, domain gap between synthetic and real data, and the absence of appropriate metrics for evaluating text style preservation. To address these challenges, we propose STELLAR (Scene Text Editor for Low-resource LAnguages and Real-world data). STELLAR enables reliable multilingual editing through a language-adaptive glyph encoder and a multi-stage training strategy that first pre-trains on synthetic data and then fine-tunes on real images. We also construct a new dataset, STIPLAR(Scene Text Image Pairs of Low-resource lAnguages and Real-world data), for training and evaluation. Furthermore, we propose Text Appearance Similarity (TAS), a novel metric that assesses style preservation by independently measuring font, color, and background similarity, enabling robust evaluation even without ground truth. Experimental results demonstrate that STELLAR outperforms state-of-the-art models in visual consistency and recognition accuracy, achieving an average TAS improvement of 2.2% across languages over the baselines.
In this paper, we present an effective data augmentation framework leveraging the Large Language Model (LLM) and Diffusion Model (DM) to tackle the challenges inherent in data-scarce scenarios. Recently, DMs have opened up the possibility of generating synthetic images to complement a few training images. However, increasing the diversity of synthetic images also raises the risk of generating samples outside the target distribution. Our approach addresses this issue by embedding novel semantic information into text prompts via LLM and utilizing real images as visual prompts, thus generating semantically rich images. To ensure that the generated images remain within the target distribution, we dynamically adjust the guidance weight based on each image's CLIPScore to control the diversity. Experimental results show that our method produces synthetic images with enhanced diversity while maintaining adherence to the target distribution. Consequently, our approach proves to be more efficient in the few-shot setting on several benchmarks. Our code is available at this https URL .
28
To achieve realistic immersion in landscape images, fluids such as water and clouds need to move within the image while revealing new scenes from various camera perspectives. Recently, a field called dynamic scene video has emerged, which combines single image animation with 3D photography. These methods use pseudo 3D space, implicitly represented with Layered Depth Images (LDIs). LDIs separate a single image into depth-based layers, which enables elements like water and clouds to move within the image while revealing new scenes from different camera perspectives. However, as landscapes typically consist of continuous elements, including fluids, the representation of a 3D space separates a landscape image into discrete layers, and it can lead to diminished depth perception and potential distortions depending on camera movement. Furthermore, due to its implicit modeling of 3D space, the output may be limited to videos in the 2D domain, potentially reducing their versatility. In this paper, we propose representing a complete 3D space for dynamic scene video by modeling explicit representations, specifically 4D Gaussians, from a single image. The framework is focused on optimizing 3D Gaussians by generating multi-view images from a single image and creating 3D motion to optimize 4D Gaussians. The most important part of proposed framework is consistent 3D motion estimation, which estimates common motion among multi-view images to bring the motion in 3D space closer to actual motions. As far as we know, this is the first attempt that considers animation while representing a complete 3D space from a single landscape image. Our model demonstrates the ability to provide realistic immersion in various landscape images through diverse experiments and metrics. Extensive experimental results are this https URL
10
Federated learning (FL) often suffers from performance degradation due to key challenges such as data heterogeneity and communication constraints. To address these limitations, we present a novel FL framework called FedWSQ, which integrates weight standardization (WS) and the proposed distribution-aware non-uniform quantization (DANUQ). WS enhances FL performance by filtering out biased components in local updates during training, thereby improving the robustness of the model against data heterogeneity and unstable client participation. In addition, DANUQ minimizes quantization errors by leveraging the statistical properties of local model updates. As a result, FedWSQ significantly reduces communication overhead while maintaining superior model accuracy. Extensive experiments on FL benchmark datasets demonstrate that FedWSQ consistently outperforms existing FL methods across various challenging FL settings, including extreme data heterogeneity and ultra-low-bit communication scenarios.
This paper presents a novel and efficient image enhancement method based on pigment representation. Unlike conventional methods where the color transformation is restricted to pre-defined color spaces like RGB, our method dynamically adapts to input content by transforming RGB colors into a high-dimensional feature space referred to as \textit{pigments}. The proposed pigment representation offers adaptability and expressiveness, achieving superior image enhancement performance. The proposed method involves transforming input RGB colors into high-dimensional pigments, which are then reprojected individually and blended to refine and aggregate the information of the colors in pigment spaces. Those pigments are then transformed back into RGB colors to generate an enhanced output image. The transformation and reprojection parameters are derived from the visual encoder which adaptively estimates such parameters based on the content in the input image. Extensive experimental results demonstrate the superior performance of the proposed method over state-of-the-art methods in image enhancement tasks, including image retouching and tone mapping, while maintaining relatively low computational complexity and small model size.
For real-world language applications, detecting an out-of-distribution (OOD) sample is helpful to alert users or reject such unreliable samples. However, modern over-parameterized language models often produce overconfident predictions for both in-distribution (ID) and OOD samples. In particular, language models suffer from OOD samples with a similar semantic representation to ID samples since these OOD samples lie near the ID manifold. A rejection network can be trained with ID and diverse outlier samples to detect test OOD samples, but explicitly collecting auxiliary OOD datasets brings an additional burden for data collection. In this paper, we propose a simple but effective method called Pseudo Outlier Exposure (POE) that constructs a surrogate OOD dataset by sequentially masking tokens related to ID classes. The surrogate OOD sample introduced by POE shows a similar representation to ID data, which is most effective in training a rejection network. Our method does not require any external OOD data and can be easily implemented within off-the-shelf Transformers. A comprehensive comparison with state-of-the-art algorithms demonstrates POE's competitiveness on several text classification benchmarks.
The FASTSUM collaboration has a long-standing programme of using anisotropic lattice QCD to investigate strong interaction thermodynamics, and in particular spectral quantities. Here we present first results from our new ensemble which has a temporal lattice spacing a_t=15am and anisotropy xi=a_s/a_t=7, giving unprecedented resolution in the temporal direction. We show results for the chiral transition, vector-axial-vector degeneracy, and heavy quarkonium, and compare them with earlier results with coarser time resolution.
A social norm defines what is good and what is bad in social contexts, as well as what to do based on such assessments. A stable social norm should be maintained against errors committed by its players. In addition, individuals may have different probabilities of errors in following the norm, and a social norm would be unstable if it benefited those who do not follow the norm carefully. In this work, we show that Simple Standing, which has been known to resist errors and mutants successfully, actually exhibits threshold behavior. That is, in a population of individuals playing the donation game according to Simple Standing, the residents can suppress the invasion of mutants with higher error proneness only if the residents' own error proneness is sufficiently low. Otherwise, the population will be invaded by mutants that commit assessment errors more frequently, and a series of such invasions will eventually undermine the existing social norm. This study suggests that the stability analysis of a social norm may have a different picture if the probability of error itself is regarded as an individual attribute.
Modular quantum architectures have emerged as a promising approach for scaling quantum computing systems by connecting multiple Quantum Processing Units (QPUs). However, this approach introduces significant challenges due to costly inter-core operations between chips and quantum state transfers, which contribute to noise and quantum decoherence. This paper presents QARMA, a novel Qubit mapping using Attention-based deep Reinforcement learning (DRL) for Modular quantum Architectures, along with its extension QARMA-R that incorporates dynamic qubit reuse capabilities. Our approach combines an attention-based mechanism with Graph Neural Networks (GNN) to learn optimal qubit allocation, routing, and reuse strategies that minimize inter-core communications. We introduce two key innovations: (1) a transformer-based encoder that captures both the global circuit structure and local qubit interactions and (2) a dynamic qubit reuse compilation mechanism that leverages mid-circuit measurement and reset operations to reduce inter-operation and qubit requirements. Our experimental results show significant improvements over state-of-the-art approaches. Compared to highly-optimized Qiskit with modular architecture configuration, QARMA-R reduces inter-core communications by up to 100% (on average 86%), while QARMA maintains 15-40% improvement for larger circuits without reuse. Against traditional modular qubit mapping, our approach achieves 97-100% reduction in inter-core operation. The proposed methods advance quantum circuit compilation techniques and enable the execution of more extensive quantum algorithms on resource-constrained modular quantum systems, contributing to the growing body of research on scalable quantum computing architectures.
Microwave drives applied to superconducting qubits (SCQs) are central to high-fidelity control and fast readout. However, recent studies find that even drives far below the superconducting gap frequency may cause drive-induced quasiparticle generation (QPG) across Josephson junctions (JJs), posing a serious concern for fault-tolerant superconducting quantum computing. Here, we find experimental evidence that the actual QPG rates in strongly driven SCQs are remarkably lower than expected. We apply intense drive fields through readout resonators, reaching effective qubit drive amplitudes up to 300 GHz. The nonlinear response of the resonators enables quantification of the energy loss from SCQs into their environments, including the contribution from QPG. Even when conservatively attributing all measured dissipation to QPG, the observed energy dissipation rates are far lower than expected from the ideal QPG model. Meanwhile, calculations incorporating high-frequency cutoffs (HFCs) near 17-20 GHz in the QPG conductance can explain the experiments. These HFCs yield QPG rates a few orders of magnitude smaller than those without HFCs, providing evidence that the QPG rates are lower than predicted by the ideal model. Our findings prevent overestimation of drive-induced QPG and provide crucial guidance for operating superconducting quantum processors. Identifying the microscopic origin of the discrepancy opens new material and device opportunities to further mitigate QPG.
Recent thousand-qubit processors represent a significant hardware advancement, but current limitations prevent effective quantum error correction (QEC), necessitating reliance on quantum error mitigation (QEM) to enhance result fidelity from quantum computers. Our paper introduces a noise-aware folding technique that enhances Zero-Noise Extrapolation (ZNE) by leveraging the noise characteristics of target quantum hardware to fold circuits more efficiently. Unlike traditional ZNE approaches assuming uniform error distribution, our method redistributes noise using calibration data based on hardware noise models. By employing a noise-adaptive compilation method combined with our proposed folding mechanism, we enhance the ZNE accuracy of quantum gate-based computing using superconducting quantum computers. This paper highlights the uniqueness of our method, summarizes noise accumulation, presents the scaling algorithm, and compares the reliability of our method with those of existing models using linear extrapolation model. Experimental results show that compared to existing folding methods, our approach achieved a 35% improvement on quantum computer simulators and a 31% improvement on real quantum computers, demonstrating the effectiveness of our proposed approach.
Recent studies on indirect reciprocity with private assessment on complete graphs suggest the possibility that one can continuously modulate the degree of segregation by controlling how to judge a good person helping a bad one. A well-known social norm called L6 judges it as bad, which eventually segregates the society into two antagonistic clusters, but if it is judged as good, the system reaches paradise where everyone likes each other. In this work, we numerically study this transition between segregation and paradise in two different settings. Firstly, in a uniform population of size NN where everyone regards such a donor as good with probability pp and bad with 1p1-p, we observe paradise when NpNp is sufficiently greater than O(1)O(1). In contrast, in a heterogeneous setting where only kk individuals judge such a donor as good, the size difference of the clusters increases almost linearly as kk increases, so paradise can only be reached as kNk \to N in a large population. Therefore, when an urgent change is needed to overcome the segregation due to L6, a small change in each and every individual's behavior is more efficient than a radical change in a fraction of the population.
This paper investigates a learning solution for robust beamforming optimization in downlink multi-user systems. A base station (BS) identifies efficient multi-antenna transmission strategies only with imperfect channel state information (CSI) and its stochastic features. To this end, we propose a robust training algorithm where a deep neural network (DNN), which only accepts estimates and statistical knowledge of the perfect CSI, is optimized to fit to real-world propagation environment. Consequently, the trained DNN can provide efficient robust beamforming solutions based only on imperfect observations of the actual CSI. Numerical results validate the advantages of the proposed learning approach compared to conventional schemes.
A comparative analysis of deep learning architectures for face spoofing detection demonstrates MobileNetV2's superior performance with 91.59% accuracy on test data compared to Vision Transformer's 86.54%, while evaluating model efficiency and generalization capabilities across a dataset of 150,986 images.
7
In conventional multi-user multiple-input multiple-output (MU-MIMO) systems with frequency division duplexing (FDD), channel acquisition and precoder optimization processes have been designed separately although they are highly coupled. This paper studies an end-to-end design of downlink MU-MIMO systems which include pilot sequences, limited feedback, and precoding. To address this problem, we propose a novel deep learning (DL) framework which jointly optimizes the feedback information generation at users and the precoder design at a base station (BS). Each procedure in the MU-MIMO systems is replaced by intelligently designed multiple deep neural networks (DNN) units. At the BS, a neural network generates pilot sequences and helps the users obtain accurate channel state information. At each user, the channel feedback operation is carried out in a distributed manner by an individual user DNN. Then, another BS DNN collects feedback information from the users and determines the MIMO precoding matrices. A joint training algorithm is proposed to optimize all DNN units in an end-to-end manner. In addition, a training strategy which can avoid retraining for different network sizes for a scalable design is proposed. Numerical results demonstrate the effectiveness of the proposed DL framework compared to classical optimization techniques and other conventional DNN schemes.
Handling class imbalance remains a central challenge in machine learning, particularly in pattern recognition tasks where rare but critical events-such as fraudulent transactions or medical anomalies-must be identified accurately. Traditional generative models offer a potential remedy through data augmentation but often treat generation and classification as independent processes, leading to distribution mismatch and limited classifier benefit. To address these shortcomings, we propose Causal Cooperative Networks (CCNETS), a modular learning framework that integrates generation, inference, and reconstruction within a unified causal paradigm. CCNETS comprises three cooperative modules: an Explainer for latent feature abstraction, a Reasoner for label prediction, and a Producer for context-aware data generation. These components interact through a causal feedback loop, where classification results guide targeted sample synthesis. A key innovation, the Zoint mechanism, enables adaptive fusion of latent and observable features, enhancing semantic richness and enabling robust decision-making under uncertainty. We evaluate CCNETS on a real-world credit card fraud detection dataset with extreme imbalance (fraud cases < 0.2%). Across three experimental setups-including synthetic training, amplified generation, and direct classifier comparison-CCNETS outperforms baseline methods, achieving higher F1 scores, precision, and recall. Models trained on CCNETS-generated data also demonstrate superior generalization under limited data conditions. These results establish CCNETS as a scalable, interpretable, and hybrid soft computing framework. By causally aligning synthetic data with classifier objectives, CCNETS advances imbalanced pattern recognition and opens new directions for robust, modular learning in real-world applications.
Image denoising is essential for removing noise in images caused by electric device malfunctions or other factors during image acquisition. It helps preserve image quality and interpretation. Many convolutional autoencoder algorithms have proven effective in image denoising. Owing to their promising efficiency, quantum computers have gained popularity. This study introduces a quantum convolutional autoencoder (QCAE) method for improved image denoising. This method was developed by substituting the representative latent space of the autoencoder with a quantum circuit. To enhance efficiency, we leveraged the advantages of the quantum approximate optimization algorithm (QAOA)-incorporated parameter-shift rule to identify an optimized cost function, facilitating effective learning from data and gradient computation on an actual quantum computer. The proposed QCAE method outperformed its classical counterpart as it exhibited lower training loss and a higher structural similarity index (SSIM) value. QCAE also outperformed its classical counterpart in denoising the MNIST dataset by up to 40% in terms of SSIM value, confirming its enhanced capabilities in real-world applications. Evaluation of QAOA performance across different circuit configurations and layer variations showed that our technique outperformed other circuit designs by 25% on average.
Surface defect detection of steel, especially the recognition of multi-scale defects, has always been a major challenge in industrial manufacturing. Steel surfaces not only have defects of various sizes and shapes, which limit the accuracy of traditional image processing and detection methods in complex environments. However, traditional defect detection methods face issues of insufficient accuracy and high miss-detection rates when dealing with small target defects. To address this issue, this study proposes a detection framework based on deep learning, specifically YOLOv9s, combined with the C3Ghost module, SCConv module, and CARAFE upsampling operator, to improve detection accuracy and model performance. First, the SCConv module is used to reduce feature redundancy and optimize feature representation by reconstructing the spatial and channel dimensions. Second, the C3Ghost module is introduced to enhance the model's feature extraction ability by reducing redundant computations and parameter volume, thereby improving model efficiency. Finally, the CARAFE upsampling operator, which can more finely reorganize feature maps in a content-aware manner, optimizes the upsampling process and ensures detailed restoration of high-resolution defect regions. Experimental results demonstrate that the proposed model achieves higher accuracy and robustness in steel surface defect detection tasks compared to other methods, effectively addressing defect detection problems.
This letter studies deep learning (DL) approaches to optimize beamforming vectors in downlink multi-user multi-antenna systems that can be universally applied to arbitrarily given transmit power limitation at a base station. We exploit the sum power budget as side information so that deep neural networks (DNNs) can effectively learn the impact of the power constraint in the beamforming optimization. Consequently, a single training process is sufficient for the proposed universal DL approach, whereas conventional methods need to train multiple DNNs for all possible power budget levels. Numerical results demonstrate the effectiveness of the proposed DL methods over existing schemes.
Researchers introduced "Calibrated PLM (CALL)," a framework combining confidence penalty losses, MixUp, and ensemble methods, demonstrating improved classification accuracy and calibration for pre-trained Transformers in multi-class text classification, especially under low-resource conditions. The empirical study detailed the effects and synergistic potential of various calibration techniques.
There are no more papers matching your filters at the moment.