Fujian University of Technology
TSMamba introduces a time series foundation model built on the Mamba architecture, achieving linear computational complexity for long sequence processing. It leverages a two-stage transfer learning strategy from pre-trained Mamba Large Language Models, demonstrating strong zero-shot generalization and superior fine-tuned performance across various forecasting benchmarks.
Speech emotion recognition (SER) plays a vital role in improving the interactions between humans and machines by inferring human emotion and affective states from speech signals. Whereas recent works primarily focus on mining spatiotemporal information from hand-crafted features, we explore how to model the temporal patterns of speech emotions from dynamic temporal scales. Towards that goal, we introduce a novel temporal emotional modeling approach for SER, termed Temporal-aware bI-direction Multi-scale Network (TIM-Net), which learns multi-scale contextual affective representations from various time scales. Specifically, TIM-Net first employs temporal-aware blocks to learn temporal affective representation, then integrates complementary information from the past and the future to enrich contextual representations, and finally, fuses multiple time scale features for better adaptation to the emotional variation. Extensive experimental results on six benchmark SER datasets demonstrate the superior performance of TIM-Net, gaining 2.34% and 2.61% improvements of the average UAR and WAR over the second-best on each corpus. The source code is available at this https URL.
167
This paper proposes a highly robust autonomous agent framework based on the ReAct paradigm, designed to solve complex tasks through adaptive decision making and multi-agent collaboration. Unlike traditional frameworks that rely on fixed workflows generated by LLM-based planners, this framework dynamically generates next actions during agent execution based on prior trajectories, thereby enhancing its robustness. To address potential termination issues caused by adaptive execution paths, I propose a timely abandonment strategy incorporating a probabilistic penalty mechanism. For multi-agent collaboration, I introduce a memory transfer mechanism that enables shared and dynamically updated memory among agents. The framework's innovative timely abandonment strategy dynamically adjusts the probability of task abandonment via probabilistic penalties, allowing developers to balance conservative and exploratory tendencies in agent execution strategies by tuning hyperparameters. This significantly improves adaptability and task execution efficiency in complex environments. Additionally, agents can be extended through external tool integration, supported by modular design and MCP protocol compatibility, which enables flexible action space expansion. Through explicit division of labor, the multi-agent collaboration mechanism enables agents to focus on specific task components, thereby significantly improving execution efficiency and quality.
Recent AI-generated image (AIGI) detectors achieve impressive accuracy under clean condition. In view of antiforensics, it is significant to develop advanced adversarial attacks for evaluating the security of such detectors, which remains unexplored sufficiently. This letter proposes a Dual-domain Feature Importance Attack (DuFIA) scheme to invalidate AIGI detectors to some extent. Forensically important features are captured by the spatially interpolated gradient and frequency-aware perturbation. The adversarial transferability is enhanced by jointly modeling spatial and frequency-domain feature importances, which are fused to guide the optimization-based adversarial example generation. Extensive experiments across various AIGI detectors verify the cross-model transferability, transparency and robustness of DuFIA.
1
In semi-supervised semantic segmentation (SSS), weak-to-strong consistency regularization techniques are widely utilized in recent works, typically combined with input-level and feature-level perturbations. However, the integration between weak-to-strong consistency regularization and network perturbation has been relatively rare. We note several problems with existing network perturbations in SSS that may contribute to this phenomenon. By revisiting network perturbations, we introduce a new approach for network perturbation to expand the existing weak-to-strong consistency regularization for unlabeled data. Additionally, we present a volatile learning process for labeled data, which is uncommon in existing research. Building upon previous work that includes input-level and feature-level perturbations, we present MLPMatch (Multi-Level-Perturbation Match), an easy-to-implement and efficient framework for semi-supervised semantic segmentation. MLPMatch has been validated on the Pascal VOC and Cityscapes datasets, achieving state-of-the-art performance. Code is available from this https URL.
4
In order to enhance the performance of Transformer models for long-term multivariate forecasting while minimizing computational demands, this paper introduces the Joint Time-Frequency Domain Transformer (JTFT). JTFT combines time and frequency domain representations to make predictions. The frequency domain representation efficiently extracts multi-scale dependencies while maintaining sparsity by utilizing a small number of learnable frequencies. Simultaneously, the time domain (TD) representation is derived from a fixed number of the most recent data points, strengthening the modeling of local relationships and mitigating the effects of non-stationarity. Importantly, the length of the representation remains independent of the input sequence length, enabling JTFT to achieve linear computational complexity. Furthermore, a low-rank attention layer is proposed to efficiently capture cross-dimensional dependencies, thus preventing performance degradation resulting from the entanglement of temporal and channel-wise modeling. Experimental results on six real-world datasets demonstrate that JTFT outperforms state-of-the-art baselines in predictive performance.
18
·
According to the World Health Organization(WHO), it is estimated that approximately 1.3 billion people live with some forms of vision impairment globally, of whom 36 million are blind. Due to their disability, engaging these minority into the society is a challenging problem. The recent rise of smart mobile phones provides a new solution by enabling blind users' convenient access to the information and service for understanding the world. Users with vision impairment can adopt the screen reader embedded in the mobile operating systems to read the content of each screen within the app, and use gestures to interact with the phone. However, the prerequisite of using screen readers is that developers have to add natural-language labels to the image-based components when they are developing the app. Unfortunately, more than 77% apps have issues of missing labels, according to our analysis of 10,408 Android apps. Most of these issues are caused by developers' lack of awareness and knowledge in considering the minority. And even if developers want to add the labels to UI components, they may not come up with concise and clear description as most of them are of no visual issues. To overcome these challenges, we develop a deep-learning based model, called LabelDroid, to automatically predict the labels of image-based buttons by learning from large-scale commercial apps in Google Play. The experimental results show that our model can make accurate predictions and the generated labels are of higher quality than that from real Android developers.
48
Despite significant advancements, current large language models (LLMs) and vision-language models (LVLMs) continue to struggle with complex, multi-step, cross-modal common sense reasoning tasks, often exhibiting a lack of "deliberative thinking." They tend to rely on superficial associations rather than deep, chained inference, particularly when integrating visual information with abstract concepts. To address this, we propose the Coherent Multimodal Reasoning Framework (CMRF), a novel approach that enhances LVLMs' common sense reasoning capabilities through an iterative, self-evaluating inference mechanism. CMRF mimics human problem-solving by decomposing complex queries, generating step-by-step inferences, and self-correcting errors. Our framework integrates three key modules: a Reasoning Decomposition Unit (RDU) for breaking down problems into sub-questions, a Contextual Inference Engine (CIE) for contextual inference, and a Coherence Assessment Module (CAM) for evaluating logical consistency and confidence. Coupled with an Adaptive Iterative Refinement strategy, CMRF systematically refines its reasoning paths. Built upon LLaVA-1.6-34B and trained on a novel Multimodal Daily Activity Reasoning (MDAR) dataset, CMRF achieves state-of-the-art performance among open-source LVLMs on challenging benchmarks like VCR, A-OKVQA, and DailyLife-MRC. It attains an average accuracy of 69.4%, surpassing the best open-source baseline by +2.4 percentage points, with particular strength in complex reasoning scenarios. Extensive ablation studies and human evaluations confirm the critical contributions of each module and the effectiveness of iterative refinement in fostering more coherent and accurate reasoning.
The optimal multicast tree problem in the Software-Defined Networking (SDN) multicast routing is an NP-hard combinatorial optimization problem. Although existing SDN intelligent solution methods, which are based on deep reinforcement learning, can dynamically adapt to complex network link state changes, these methods are plagued by problems such as redundant branches, large action space, and slow agent convergence. In this paper, an SDN intelligent multicast routing algorithm based on deep hierarchical reinforcement learning is proposed to circumvent the aforementioned problems. First, the multicast tree construction problem is decomposed into two sub-problems: the fork node selection problem and the construction of the optimal path from the fork node to the destination node. Second, based on the information characteristics of SDN global network perception, the multicast tree state matrix, link bandwidth matrix, link delay matrix, link packet loss rate matrix, and sub-goal matrix are designed as the state space of intrinsic and meta controllers. Then, in order to mitigate the excessive action space, our approach constructs different action spaces at the upper and lower levels. The meta-controller generates an action space using network nodes to select the fork node, and the intrinsic controller uses the adjacent edges of the current node as its action space, thus implementing four different action selection strategies in the construction of the multicast tree. To facilitate the intelligent agent in constructing the optimal multicast tree with greater speed, we developed alternative reward strategies that distinguish between single-step node actions and multi-step actions towards multiple destination nodes.
Speech Emotion Recognition (SER) has become a growing focus of research in human-computer interaction. An essential challenge in SER is to extract common attributes from different speakers or languages, especially when a specific source corpus has to be trained to recognize the unknown data coming from another speech corpus. To address this challenge, a Capsule Network (CapsNet) and Transfer Learning based Mixed Task Net (CTLMTNet) are proposed to deal with both the singlecorpus and cross-corpus SER tasks simultaneously in this paper. For the single-corpus task, the combination of Convolution-Pooling and Attention CapsNet module CPAC) is designed by embedding the self-attention mechanism to the CapsNet, guiding the module to focus on the important features that can be fed into different capsules. The extracted high-level features by CPAC provide sufficient discriminative ability. Furthermore, to handle the cross-corpus task, CTL-MTNet employs a Corpus Adaptation Adversarial Module (CAAM) by combining CPAC with Margin Disparity Discrepancy (MDD), which can learn the domain-invariant emotion representations through extracting the strong emotion commonness. Experiments including ablation studies and visualizations on both singleand cross-corpus tasks using four well-known SER datasets in different languages are conducted for performance evaluation and comparison. The results indicate that in both tasks the CTL-MTNet showed better performance in all cases compared to a number of state-of-the-art methods. The source code and the supplementary materials are available at: this https URL
We report a direct search for a new gauge boson, XX, with a mass of 17 MeV/c217~\text{MeV}/c^2, which could explain the anomalous excess of e+ee^+e^- pairs observed in the 8Be^8\text{Be} nuclear transitions. The search is conducted in the charmonium decay χcJXJ/ψ (J=0,1,2)\chi_{cJ}\to X J/\psi~(J=0,1,2) via the radiative transition ψ(3686)γχcJ\psi(3686)\to\gamma\chi_{cJ} using (2712.4±14.3)×106\left(2712.4\pm 14.3 \right)\times 10^6 ψ(3686)\psi(3686) events collected with the BESIII detector at the BEPCII collider. No significant signal is observed, and the new upper limit on the coupling strength of charm quark and the new gauge boson, ϵc\epsilon_c, at 17 MeV/c217~\text{MeV}/c^2 is set to be |\epsilon_c|<1.2\times 10^{-2} at 90%90\% confidence level. We also report new constraints on the mixing strength ϵ\epsilon between the Standard Model photon and dark photon γ\gamma^\prime in the mass range from 5 MeV/c25~\text{MeV}/c^2 to 300 MeV/c2300~\text{MeV}/c^2. The upper limits at 90%90\% confidence level vary within (2.517.5)×103(2.5-17.5)\times 10^{-3} depending on the γ\gamma^\prime mass.
Bhakti is introduced as a lightweight vector database management system designed for small to medium-sized datasets, enhancing large language models with semantic search and a novel memory-enhanced dialogue solution. This system facilitates more coherent and personalized chatbot interactions by using a weighted vectorization method for dialogue history, demonstrating improved conversational qualities in qualitative evaluations.
38
It is assumed that superheavy dark matter particles with O(EeV) mass may decay to relativistic milli-charged particles (MCPs). The upward-going MCPs passing through the Earth could be measured by the fluorescence detectors (FD) of the Pierre Auger observatory (Auger). The massless hidden photon model was taken for MCPs to interact with nuclei, so that the numbers and fluxes of expected MCPs and neutrinos may be evaluated at the FD of Auger. Based on the assumption that no events are observed at the FD of Auger in 14 years, the corresponding upper limits on MCP fluxes were calculated at 90\% C. L.. These results indicated that MCPs could be directly detected in the secondaries' energy range O(1EeV)-O(10EeV) at the FD of Auger, when 3×1013ϵ21033\times10^{-13}\lesssim\epsilon^2\lesssim 10^{-3}. And a new region of 10 MeV < mMCPm_{MCP} < 1 PeV and 10710^{-7} \lesssim ϵ\epsilon \lesssim 3×1023\times10^{-2} is ruled out in the mMCPm_{MCP}-ϵ\epsilon plane with 14 years of Auger data.
Structural health monitoring (SHM) ensures the safety and longevity of structures such as aerospace equipment and wind power installations. Developing a simple, highly flexible, and scalable SHM method that does not depend on baseline models is significant for ensuring the operational integrity of advanced composite structures. In this regard, a hybrid baseline-free damage detection and localization framework incorporating an unsupervised Kolmogorov-Arnold autoencoder (KAE) and modified probabilistic elliptical imaging algorithm (MRAPID) is proposed for damage detection and localization in composite structures. Specifically, KAE was used to process the guided wave signals (GW) without any prior feature extraction process. The KAE continuously learns and adapts to the baseline model of each structure, learning from the response characteristics of its undamaged state. Then, the predictions from KAE are processed, combined with the MRAPID to generate a damage probability map. The performance of the proposed method for damage detection and localization was verified using the simulated damage data obtained on wind turbine blades and the actual damage data obtained on composite flat plates. The results show that the proposed method can effectively detect and localize damage and can achieve multiple damage localization. In addition, the method outperforms classical damage detection algorithms and state-of-the-art baseline-free damage detection and localization methods in terms of damage localization accuracy.
Although known for negatively impacting the operation of superconducting qubits, thermal baths are shown to exert qubit control in a positive way, provided they are properly engineered. We demonstrate an experimental method to engineer the transduction of microwave driving into heat flow through a leaky resonator. Given the precise conversion, a qubit receiving the heat flow obtains a quasi-thermal equilibrium with arbitrary target temperature in hundreds of nanoseconds. We show that the dynamics of the quantum transducing process is described by thermalizing channel states, generated from the double dressings of the resonator by the semi-classical driving and the qubit-resonator coupling. Their spectrum, coupling, and driving strength determine the channel rate of energy flow, along with the relaxation rates of photon leakage into the bath. The analytical prediction is shown to match well with the experimental measurements on an Xmon qubit circuit.
Traditional multicast routing methods have some problems in constructing a multicast tree, such as limited access to network state information, poor adaptability to dynamic and complex changes in the network, and inflexible data forwarding. To address these defects, the optimal multicast routing problem in software-defined networking (SDN) is tailored as a multi-objective optimization problem, and an intelligent multicast routing algorithm DRL-M4MR based on the deep Q network (DQN) deep reinforcement learning (DRL) method is designed to construct a multicast tree in SDN. First, the multicast tree state matrix, link bandwidth matrix, link delay matrix, and link packet loss rate matrix are designed as the state space of the DRL agent by combining the global view and control of the SDN. Second, the action space of the agent is all the links in the network, and the action selection strategy is designed to add the links to the current multicast tree under four cases. Third, single-step and final reward function forms are designed to guide the intelligence to make decisions to construct the optimal multicast tree. The experimental results show that, compared with existing algorithms, the multicast tree construct by DRL-M4MR can obtain better bandwidth, delay, and packet loss rate performance after training, and it can make more intelligent multicast routing decisions in a dynamic network environment.
55
With the rapid development of big data and artificial intelligence technologies, the demand for effective processing and retrieval of vector data is growing. Against this backdrop, I have developed the Bhakti vector database, aiming to provide a lightweight and easy-to-deploy solution to meet the storage and semantic search needs of small and medium-sized datasets. Bhakti supports a variety of similarity calculation methods and a domain-specific language (DSL) for document-based pattern matching pre-filtering, facilitating migration of data with its portable data files, flexible data management and seamless integration with Python3. Furthermore, I propose a memory-enhanced large language model dialogue solution based on the Bhakti database, which can assign different weights to the question and answer in dialogue history, achieving fine-grained control over the semantic importance of each segment in a single dialogue history. Through experimental validation, my method shows significant performance in the application of semantic search and question-answering systems. Although there are limitations in processing large datasets, such as not supporting approximate calculation methods like HNSW, the lightweight nature of Bhakti gives it a clear advantage in scenarios involving small and medium-sized datasets.
·
UI design is an integral part of software development. For many developers who do not have much UI design experience, exposing them to a large database of real-application UI designs can help them quickly build up a realistic understanding of the design space for a software feature and get design inspirations from existing applications. However, existing keyword-based, image-similarity-based, and component-matching-based methods cannot reliably find relevant high-fidelity UI designs in a large database alike to the UI wireframe that the developers sketch, in face of the great variations in UI designs. In this article, we propose a deep-learning-based UI design search engine to fill in the gap. The key innovation of our search engine is to train a wireframe image autoencoder using a large database of real-application UI designs, without the need for labeling relevant UI designs. We implement our approach for Android UI design search, and conduct extensive experiments with artificially created relevant UI designs and human evaluation of UI design search results. Our experiments confirm the superior performance of our search engine over existing image-similarity or component-matching-based methods and demonstrate the usefulness of our search engine in real-world UI design tasks.
15
In human-computer interaction, Speech Emotion Recognition (SER) plays an essential role in understanding the user's intent and improving the interactive experience. While similar sentimental speeches own diverse speaker characteristics but share common antecedents and consequences, an essential challenge for SER is how to produce robust and discriminative representations through causality between speech emotions. In this paper, we propose a Gated Multi-scale Temporal Convolutional Network (GM-TCNet) to construct a novel emotional causality representation learning component with a multi-scale receptive field. GM-TCNet deploys a novel emotional causality representation learning component to capture the dynamics of emotion across the time domain, constructed with dilated causal convolution layer and gating mechanism. Besides, it utilizes skip connection fusing high-level features from different gated convolution blocks to capture abundant and subtle emotion changes in human speech. GM-TCNet first uses a single type of feature, mel-frequency cepstral coefficients, as inputs and then passes them through the gated temporal convolutional module to generate the high-level features. Finally, the features are fed to the emotion classifier to accomplish the SER task. The experimental results show that our model maintains the highest performance in most cases compared to state-of-the-art techniques.
In this work, we explore the properties of the matrix method for black hole quasinormal modes on the nonuniform grid. In particular, the method is implemented to be adapted to the Chebyshev grid, aimed at effectively suppressing Runge's phenomenon. It is found that while such an implementation is favorable from a mathematical point of view, in practice, the increase in precision does not necessarily compensate for the penalty in computational time. On the other hand, the original matrix method, though subject to Runge's phenomenon, is shown to be reasonably robust and suffices for most applications with a moderate grid number. In terms of computational time and obtained significant figures, we carried out an analysis regarding the trade-off between the two aspects. The implications of the present study are also addressed.
There are no more papers matching your filters at the moment.