Henan Polytechnic University
In this paper, we propose EDIT (Encoder-Decoder Image Transformer), a novel architecture designed to mitigate the attention sink phenomenon observed in Vision Transformer models. Attention sink occurs when an excessive amount of attention is allocated to the [CLS] token, distorting the model's ability to effectively process image patches. To address this, we introduce a layer-aligned encoder-decoder architecture, where the encoder utilizes self-attention to process image patches, while the decoder uses cross-attention to focus on the [CLS] token. Unlike traditional encoder-decoder framework, where the decoder depends solely on high-level encoder representations, EDIT allows the decoder to extract information starting from low-level features, progressively refining the representation layer by layer. EDIT is naturally interpretable demonstrated through sequential attention maps, illustrating the refined, layer-by-layer focus on key image features. Experiments on ImageNet-1k and ImageNet-21k, along with transfer learning tasks, show that EDIT achieves consistent performance improvements over DeiT3 models. These results highlight the effectiveness of EDIT's design in addressing attention sink and improving visual feature extraction.
This paper presents Fourier Neural Operators (FNOs) as a method for physical layer processing in next-generation MIMO systems, addressing challenges like near-field propagation and continuous apertures by learning function-to-function mappings. It demonstrates FNOs' ability to accurately model holographic MIMO channels within 2.04 ms and to achieve superior channel estimation with lower Normalized Mean Squared Error (NMSE) for flexible intelligent metasurfaces compared to existing methods.
The advent of Rydberg atomic quantum receivers (RAQRs) offers a new solution for the evolution of wireless transceiver architecture, promising unprecedented sensitivity and immunity to thermal noise. However, RAQRs introduce a unique non-linear signal model based on biased phase retrieval, which complicates fundamental channel estimation tasks. Traditional iterative algorithms often struggle in low signal-to-noise regimes and fail to capture complex and non-ideal system characteristics. To address this, we propose a novel model-driven deep learning framework for channel estimation in RAQRs. Specifically, we propose a Transformer-based unrolling architecture, termed URformer, which is derived by unrolling a stabilized variant of the expectation-maximization Gerchberg-Saxton (EM-GS) algorithm. Specifically, each layer of the proposed URformer incorporates three trainable modules: 1) a learnable filter implemented by a neural network that replaces the fixed Bessel function ratio in the classic EM-GS algorithm; 2) a trainable gating mechanism that adaptively combines classic and model-based updates to ensure training stability; and 3) a efficient channel Transformer block that learns to correct residual errors by capturing non-local dependencies across the channel matrix. Numerical results demonstrate that the proposed URformer significantly outperforms classic iterative algorithms and conventional black-box neural networks with less pilot overhead.
While initial applications of artificial intelligence (AI) in wireless communications over the past decade have demonstrated considerable potential using specialized models for targeted communication tasks, the revolutionary demands of sixth-generation (6G) networks for holographic communications, ubiquitous sensing, and native intelligence are propelling a necessary evolution towards AI-native wireless networks. The arrival of large AI models paves the way for the next phase of Wireless AI, driven by wireless foundation models (WFMs). In particular, pre-training on universal electromagnetic (EM) principles equips WFMs with the essential adaptability for a multitude of demanding 6G applications. However, existing large AI models face critical limitations, including pre-training strategies disconnected from EM-compliant constraints leading to physically inconsistent predictions, a lack of embedded understanding of wave propagation physics, and the inaccessibility of massive labeled datasets for comprehensive EM-aware training. To address these challenges, this article presents an electromagnetic information theory-guided self-supervised pre-training (EIT-SPT) framework designed to systematically inject EM physics into WFMs. The EIT-SPT framework aims to infuse WFMs with intrinsic EM knowledge, thereby enhancing their physical consistency, generalization capabilities across varied EM landscapes, and overall data efficiency. Building upon the proposed EIT-SPT framework, this article first elaborates on diverse potential applications in 6G scenarios of WFMs, then validates the efficacy of the proposed framework through illustrative case studies, and finally summarizes critical open research challenges and future directions for WFMs.
This paper performs a comprehensive and comparative evaluation of the state of the art local features for the task of image based 3D reconstruction. The evaluated local features cover the recently developed ones by using powerful machine learning techniques and the elaborately designed handcrafted features. To obtain a comprehensive evaluation, we choose to include both float type features and binary ones. Meanwhile, two kinds of datasets have been used in this evaluation. One is a dataset of many different scene types with groundtruth 3D points, containing images of different scenes captured at fixed positions, for quantitative performance evaluation of different local features in the controlled image capturing situations. The other dataset contains Internet scale image sets of several landmarks with a lot of unrelated images, which is used for qualitative performance evaluation of different local features in the free image collection situations. Our experimental results show that binary features are competent to reconstruct scenes from controlled image sequences with only a fraction of processing time compared to use float type features. However, for the case of large scale image set with many distracting images, float type features show a clear advantage over binary ones.
Despite remarkable advancements, current Text-to-Image (T2I) models struggle with complex, long-form textual instructions, frequently failing to accurately render intricate details, spatial relationships, or specific constraints. This limitation is highlighted by benchmarks such as LongBench-T2I, which reveal deficiencies in handling composition, specific text, and fine textures. To address this, we propose DeCoT (Decomposition-CoT), a novel framework that leverages Large Language Models (LLMs) to significantly enhance T2I models' understanding and execution of complex instructions. DeCoT operates in two core stages: first, Complex Instruction Decomposition and Semantic Enhancement, where an LLM breaks down raw instructions into structured, actionable semantic units and clarifies ambiguities; second, Multi-Stage Prompt Integration and Adaptive Generation, which transforms these units into a hierarchical or optimized single prompt tailored for existing T2I models. Extensive experiments on the LongBench-T2I dataset demonstrate that DeCoT consistently and substantially improves the performance of leading T2I models across all evaluated dimensions, particularly in challenging aspects like "Text" and "Composition". Quantitative results, validated by multiple MLLM evaluators (Gemini-2.0-Flash and InternVL3-78B), show that DeCoT, when integrated with Infinity-8B, achieves an average score of 3.52, outperforming the baseline Infinity-8B (3.44). Ablation studies confirm the critical contribution of each DeCoT component and the importance of sophisticated LLM prompting. Furthermore, human evaluations corroborate these findings, indicating superior perceptual quality and instruction fidelity. DeCoT effectively bridges the gap between high-level user intent and T2I model requirements, leading to more faithful and accurate image generation.
As the real propagation environment becomes in creasingly complex and dynamic, millimeter wave beam prediction faces huge challenges. However, the powerful cross modal representation capability of vision-language model (VLM) provides a promising approach. The traditional methods that rely on real-time channel state information (CSI) are computationally expensive and often fail to maintain accuracy in such environments. In this paper, we present a VLM-driven contrastive learning based multimodal beam prediction framework that integrates multimodal data via modality-specific encoders. To enforce cross-modal consistency, we adopt a contrastive pretraining strategy to align image and LiDAR features in the latent space. We use location information as text prompts and connect it to the text encoder to introduce language modality, which further improves cross-modal consistency. Experiments on the DeepSense-6G dataset show that our VLM backbone provides additional semantic grounding. Compared with existing methods, the overall distance-based accuracy score (DBA-Score) of 0.9016, corresponding to 1.46% average improvement.
Extremely large-scale multiple-input multiple-output (XL-MIMO) is a key technology for next-generation wireless communication systems. By deploying significantly more antennas than conventional massive MIMO systems, XL-MIMO promises substantial improvements in spectral efficiency. However, due to the drastically increased array size, the conventional planar wave channel model is no longer accurate, necessitating a transition to a near-field spherical wave model. This shift challenges traditional beam training and channel estimation methods, which were designed for planar wave propagation. In this article, we present a comprehensive review of state-of-the-art beam training and channel estimation techniques for XL-MIMO systems. We analyze the fundamental principles, key methodologies, and recent advancements in this area, highlighting their respective strengths and limitations in addressing the challenges posed by the near-field propagation environment. Furthermore, we explore open research challenges that remain unresolved to provide valuable insights for researchers and engineers working toward the development of next-generation XL-MIMO communication systems.
Query-focused summarization over multi-table data is a challenging yet critical task for extracting precise and relevant information from structured data. Existing methods often rely on complex preprocessing steps and struggle to generalize across domains or handle the logical reasoning required for multi-table queries. In this paper, we propose QueryTableSummarizer++, an end-to-end generative framework leveraging large language models (LLMs) enhanced with table-aware pre-training, query-aligned fine-tuning, and reinforcement learning with feedback. Our method eliminates the need for intermediate serialization steps and directly generates query-relevant summaries. Experiments on a benchmark dataset demonstrate that QueryTableSummarizer++ significantly outperforms state-of-the-art baselines in terms of BLEU, ROUGE, and F1-score. Additional analyses highlight its scalability, generalization across domains, and robust handling of complex queries. Human evaluation further validates the superior quality and practical applicability of the generated summaries, establishing QueryTableSummarizer++ as a highly effective solution for multi-table summarization tasks.
It is known that every (single-qudit) Clifford operator maps the full set of generalized Pauli matrices (GPMs) to itself under unitary conjugation, which is an important quantum operation and plays a crucial role in quantum computation and information. However, in many quantum information processing tasks, it is required that a specific set of GPMs be mapped to another such set under conjugation, instead of the entire set. We formalize this by introducing local Clifford operator, which maps a given nn-GPM set to another such set under unitary conjugation. We establish necessary and sufficient conditions for such an operator to transform a pair of GPMs, showing that these local Clifford operators admit a classical matrix representation, analogous to the classical (or symplectic) representation of standard (single-qudit) Clifford operators. Furthermore, we demonstrate that any local Clifford operator acting on an nn-GPM (n2n\geq 2) set can be decomposed into a product of standard Clifford operators and a local Clifford operator acting on a pair of GPMs. This decomposition provides a complete classical characterization of unitary conjugation mappings between nn-GPM sets. As a key application, we use this framework to address the local unitary equivalence (LU-equivalence) of sets of generalized Bell states (GBSs). We prove that the 31 equivalence classes of 44-GBS sets in bipartite system C6C6\mathbb{C}^{6}\otimes \mathbb{C}^{6} previously identified via Clifford operators are indeed distinct under LU-equivalence, confirming that this classification is complete.
In this article, we overview intelligent reflecting surface (IRS)-empowered wireless communication systems. We first present the fundamentals of IRS-assisted wireless transmission. On this basis, we explore the integration of IRS with various advanced transmission technologies, such as millimeter wave, non-orthogonal multiple access, and physical layer security. Following this, we discuss the effects of hardware impairments and imperfect channel-state-information on the IRS system performance. Finally, we highlight several open issues to be addressed.
Flexible intelligent metasurfaces (FIMs) offer a new solution for wireless communications by introducing morphological degrees of freedom, dynamically morphing their three-dimensional shape to ensure multipath signals interfere constructively. However, realizing the desired performance gains in FIM systems is critically dependent on acquiring accurate channel state information across a continuous and high-dimensional deformation space. Therefore, this paper investigates this fundamental channel estimation problem for FIM assisted millimeter-wave communication systems. First, we develop model-based frameworks that structure the problem as either function approximation using interpolation and kernel methods or as a sparse signal recovery problem that leverages the inherent angular sparsity of millimeter-wave channels. To further advance the estimation capability beyond explicit assumptions in model-based channel estimation frameworks, we propose a deep learning-based framework using a Fourier neural operator (FNO). By parameterizing a global convolution operator in the Fourier domain, we design an efficient FNO architecture to learn the continuous operator that maps FIM shapes to channel responses with mesh-independent properties. Furthermore, we exploit a hierarchical FNO (H-FNO) architecture to efficiently capture the multi-scale features across a hierarchy of spatial resolutions. Numerical results demonstrate that the proposed H-FNO significantly outperforms the model-based benchmarks in estimation accuracy and pilot efficiency. In particular, the interpretability analysis show that the proposed H-FNO learns an anisotropic spatial filter adapted to the physical geometry of FIM and is capable of accurately reconstructing the non-linear channel response across the continuous deformation space.
The Internet of Things (IoT) has been increasingly used in our everyday lives as well as in numerous industrial applications. However, due to limitations in computing and power capabilities, IoT devices need to send their respective tasks to cloud service stations that are usually located at far distances. Having to transmit data far distances introduces challenges for services that require low latency such as industrial control in factories and plants as well as artificial intelligence assisted autonomous driving. To solve this issue, mobile edge computing (MEC) is deployed at the networks edge to reduce transmission time. In this regard, this study proposes a new offloading scheme for MEC-assisted ultra dense cellular networks using reinforcement learning (RL) techniques. The proposed scheme enables efficient resource allocation and dynamic offloading decisions based on varying network conditions and user demands. The RL algorithm learns from the networks historical data and adapts the offloading decisions to optimize the networks overall performance. Non-orthogonal multiple access is also adopted to improve resource utilization among the IoT devices. Simulation results demonstrate that the proposed scheme outperforms other stateof the art offloading algorithms in terms of energy efficiency, network throughput, and user satisfaction.
Rapid bone scintigraphy is crucial for diagnosing skeletal disorders and detecting tumor metastases in children, as it shortens scan duration and reduces discomfort. However, accelerated acquisition often degrades image quality, impairing the visibility of fine anatomical details and potentially compromising diagnosis. To overcome this limitation, we introduce the first application of SAM-based semantic priors for medical image restoration, utilizing the Segment Anything Model (SAM) to enhance pediatric rapid bone scintigraphy. Our approach employs two cascaded networks, fIR1f^{IR1} and fIR2f^{IR2}, supported by three specialized modules: a Semantic Prior Integration (SPI) module, a Semantic Knowledge Distillation (SKD) module, and a Semantic Consistency Module (SCM). The SPI and SKD modules inject domain-specific semantic cues from a fine-tuned SAM, while the SCM preserves coherent semantic feature representations across both cascaded stages. Moreover, we present RBS, a novel Rapid Bone Scintigraphy dataset comprising paired standard (20 cm/min) and rapid (40 cm/min) scans from 137 pediatric patients aged 0.5 - 16 years, making it the first dataset tailored for pediatric rapid bone scintigraphy restoration. Extensive experiments on both a public endoscopic dataset and our RBS dataset demonstrate that our method consistently surpasses existing techniques in PSNR, SSIM, FID, and LPIPS metrics.
Visual storytelling is an emerging field that combines images and narratives to create engaging and contextually rich stories. Despite its potential, generating coherent and emotionally resonant visual stories remains challenging due to the complexity of aligning visual and textual information. This paper presents a novel approach leveraging large language models (LLMs) and large vision-language models (LVLMs) combined with instruction tuning to address these challenges. We introduce a new dataset comprising diverse visual stories, annotated with detailed captions and multimodal elements. Our method employs a combination of supervised and reinforcement learning to fine-tune the model, enhancing its narrative generation capabilities. Quantitative evaluations using GPT-4 and qualitative human assessments demonstrate that our approach significantly outperforms existing models, achieving higher scores in narrative coherence, relevance, emotional depth, and overall quality. The results underscore the effectiveness of instruction tuning and the potential of LLMs/LVLMs in advancing visual storytelling.
Motivated by the recent LHC Higgs data and null results in searches for any new physics, we investigate the Higgs couplings and naturalness in the littlest Higgs model with T-parity. By performing the global fit of the latest Higgs data, electroweak precise observables and RbR_{b} measurements, we find that the scale ff can be excluded up to 600 GeV at 2σ2\sigma confidence level. The expected Higgs coupling measurements at the future collider TLEP will improve this lower limit to above 3 TeV. Besides, the top parnter mass mT+m_{T_{+}} can be excluded up to 880 GeV at 2σ2\sigma confidence level. The future HL-LHC can constrain this mass in the region m_{T_{+}} < 2.2 TeV corresponding to the fine-tuning being lager than 1%.
This paper introduces a robust, learning-based method for diagnosing the state of distribution network switchgear, which is crucial for maintaining the power quality for end users. Traditional diagnostic models often rely heavily on expert knowledge and lack robustness. To address this, our method incorporates an expanded feature vector that includes environmental data, temperature readings, switch position, motor operation, insulation conditions, and local discharge information. We tackle the issue of high dimensionality through feature mapping. The method introduces a decision radius to categorize unlabeled samples and updates the model parameters using a combination of supervised and unsupervised loss, along with a consistency regularization function. This approach ensures robust learning even with a limited number of labeled samples. Comparative analysis demonstrates that this method significantly outperforms existing models in both accuracy and robustness.
Both unmanned aerial vehicles (UAVs) and intelligent reflecting surfaces (IRS) are gaining traction as transformative technologies for upcoming wireless networks. The IRS-aided UAV communication, which introduces IRSs into UAV communications, has emerged in an effort to improve the system performance while also overcoming UAV communication constraints and issues. The purpose of this paper is to provide a comprehensive overview of IRSassisted UAV communications. First, we provide five examples of how IRSs and UAVs can be combined to achieve unrivaled potential in difficult situations. The technological features of the most recent relevant researches on IRS-aided UAV communications from the perspective of the main performance criteria, i.e., energy efficiency, security, spectral efficiency, etc. Additionally, previous research studies on technology adoption as machine learning algorithms. Lastly, some promising research directions and open challenges for IRS-aided UAV communication are presented.
Recent studies on the electrical switching of tetragonal antiferromagnet (AFM) via N{é}el spin-orbit torque have paved the way for the economic use of antiferromagnetic materials. The most difficult obstacle that presently limits the application of antiferromagnetic materials in spintronics, especially in memory storage applications, could be the small and fragile magnetoresistance (MR) in the AFM-based nanostructure. In this study, we investigated the spin transports in Mn2_2Au-based tunnel junctions based onthe first-principle scattering theory. Giant MRs more than 1000%1000\% are predicted in some Fe/MgO/Ag/Mn2_2Au/Ta junctions that are about the same order as that in an MgO-based ferromagnetic tunnel junction with same barrier thickness. The interplay of the spin filtering effect, the quantum well resonant states, and the interfacial resonant states could be responsible for the unusual giant and robust MRs observed in these Mn2_2Au-based junctions.
Reconfigurable Intelligent Surfaces (RIS) dynamically control signal propagation to enhance wireless communications. This paper presents a novel framework for rotatable RIS assisted physical-layer multicast systems, aiming to maximize the sum of minimum multicast rates via joint optimization of base station beamforming, RIS phase shifts, and orientation. Unlike unicast or non-rotatable setups, the rotatable RIS adapts orientation to align signals with user groups, improving fairness and rates for weak users. An alternating optimization approach combines convex optimization for beamforming/phase shifts with exhaustive search and particle swarm optimization (PSO) for orientation. Majorization-Minimization-based algorithms solve subproblems iteratively. Simulation results show the framework achieves 24.1% rate improvement via exhaustive search and 20.0% via PSO over the non-rotatable RIS baseline, with PSO performance close to the exhaustive search upper bound, highlighting the benefits of physical-layer multicast and orientation optimization.
There are no more papers matching your filters at the moment.