Chinese Academy of Military Science
Prompt learning has been widely adopted to efficiently adapt vision-language models (VLMs) like CLIP for various downstream tasks. Despite their success, current VLM-based facial expression recognition (FER) methods struggle to capture fine-grained textual-visual relationships, which are essential for distinguishing subtle differences between facial expressions. To address this challenge, we propose a multimodal prompt alignment framework for FER, called MPA-FER, that provides fine-grained semantic guidance to the learning process of prompted visual features, resulting in more precise and interpretable representations. Specifically, we introduce a multi-granularity hard prompt generation strategy that utilizes a large language model (LLM) like ChatGPT to generate detailed descriptions for each facial expression. The LLM-based external knowledge is injected into the soft prompts by minimizing the feature discrepancy between the soft prompts and the hard prompts. To preserve the generalization abilities of the pretrained CLIP model, our approach incorporates prototype-guided visual feature alignment, ensuring that the prompted visual features from the frozen image encoder align closely with class-specific prototypes. Additionally, we propose a cross-modal global-local alignment module that focuses on expression-relevant facial features, further improving the alignment between textual and visual features. Extensive experiments demonstrate our framework outperforms state-of-the-art methods on three FER benchmark datasets, while retaining the benefits of the pretrained model and minimizing computational costs.
Tracking multiple objects in a continuous video stream is crucial for many computer vision tasks. It involves detecting and associating objects with their respective identities across successive frames. Despite significant progress made in multiple object tracking (MOT), recent studies have revealed the vulnerability of existing MOT methods to adversarial attacks. Nevertheless, all of these attacks belong to digital attacks that inject pixel-level noise into input images, and are therefore ineffective in physical scenarios. To fill this gap, we propose PapMOT, which can generate physical adversarial patches against MOT for both digital and physical scenarios. Besides attacking the detection mechanism, PapMOT also optimizes a printable patch that can be detected as new targets to mislead the identity association process. Moreover, we introduce a patch enhancement strategy to further degrade the temporal consistency of tracking results across video frames, resulting in more aggressive attacks. We further develop new evaluation metrics to assess the robustness of MOT against such attacks. Extensive evaluations on multiple datasets demonstrate that our PapMOT can successfully attack various architectures of MOT trackers in digital scenarios. We also validate the effectiveness of PapMOT for physical attacks by deploying printed adversarial patches in the real world.
Adversarial patches are widely used to evaluate the robustness of object detection systems in real-world scenarios. These patches were initially designed to deceive single-modal detectors (e.g., visible or infrared) and have recently been extended to target visible-infrared dual-modal detectors. However, existing dual-modal adversarial patch attacks have limited attack effectiveness across diverse physical scenarios. To address this, we propose CDUPatch, a universal cross-modal patch attack against visible-infrared object detectors across scales, views, and scenarios. Specifically, we observe that color variations lead to different levels of thermal absorption, resulting in temperature differences in infrared imaging. Leveraging this property, we propose an RGB-to-infrared adapter that maps RGB patches to infrared patches, enabling unified optimization of cross-modal patches. By learning an optimal color distribution on the adversarial patch, we can manipulate its thermal response and generate an adversarial infrared texture. Additionally, we introduce a multi-scale clipping strategy and construct a new visible-infrared dataset, MSDrone, which contains aerial vehicle images in varying scales and perspectives. These data augmentation strategies enhance the robustness of our patch in real-world conditions. Experiments on four benchmark datasets (e.g., DroneVehicle, LLVIP, VisDrone, MSDrone) show that our method outperforms existing patch attacks in the digital domain. Extensive physical tests further confirm strong transferability across scales, views, and scenarios.
Recently, surrogate models based on deep learning have attracted much attention for engineering analysis and optimization. As the construction of data pairs in most engineering problems is time-consuming, data acquisition is becoming the predictive capability bottleneck of most deep surrogate models, which also exists in surrogate for thermal analysis and design. To address this issue, this paper develops a physics-informed convolutional neural network (CNN) for the thermal simulation surrogate. The network can learn a mapping from heat source layout to the steady-state temperature field without labeled data, which equals solving an entire family of partial difference equations (PDEs). To realize the physics-guided training without labeled data, we employ the heat conduction equation and finite difference method to construct the loss function. Since the solution is sensitive to boundary conditions, we properly impose hard constraints by padding in the Dirichlet and Neumann boundary conditions. In addition, the neural network architecture is well-designed to improve the prediction precision of the problem at hand, and pixel-level online hard example mining is introduced to overcome the imbalance of optimization difficulty in the computation domain. The experiments demonstrate that the proposed method can provide comparable predictions with numerical method and data-driven deep learning models. We also conduct various ablation studies to investigate the effectiveness of the network component and training methods proposed in this paper.
A gradient shortcut method called SDO accelerates backpropagation in diffusion models by retaining only one optimization step's computational graph, reducing computation time by 90% and memory usage by 35% while maintaining or improving performance across multiple controllable generation tasks.
The physics-informed neural network (PINN) is effective in solving the partial differential equation (PDE) by capturing the physics constraints as a part of the training loss function through the Automatic Differentiation (AD). This study proposes the hybrid finite difference with the physics-informed neural network (HFD-PINN) to fully use the domain knowledge. The main idea is to use the finite difference method (FDM) locally instead of AD in the framework of PINN. In particular, we use AD at complex boundaries and the FDM in other domains. The hybrid learning model shows promising results in experiments. To use the FDM locally in the complex boundary domain and avoid the generation of background mesh, we propose the HFD-PINN-sdf method, which locally uses the finite difference scheme at random points. In addition, the signed distance function is used to avoid the difference scheme from crossing the domain boundary. In this paper, we demonstrate the performance of our proposed methods and compare the results with the different number of collocation points for the Poisson equation, Burgers equation. We also chose several different finite difference schemes, including the compact finite difference method (CDM) and crank-nicolson method (CNM), to verify the robustness of HFD-PINN. We take the heat conduction problem and the heat transfer problem on the irregular domain as examples to demonstrate the efficacy of our framework. In summary, HFD-PINN, especially HFD-PINN-sdf, are more instructive and efficient, significantly when solving PDEs in complex geometries.
Knowledge-based visual question answering (VQA) requires world knowledge beyond the image for accurate answer. Recently, instead of extra knowledge bases, a large language model (LLM) like GPT-3 is activated as an implicit knowledge engine to jointly acquire and reason the necessary knowledge for answering by converting images into textual information (e.g., captions and answer candidates). However, such conversion may introduce irrelevant information, which causes the LLM to misinterpret images and ignore visual details crucial for accurate knowledge. We argue that multimodal large language model (MLLM) is a better implicit knowledge engine than the LLM for its superior capability of visual understanding. Despite this, how to activate the capacity of MLLM as the implicit knowledge engine has not been explored yet. Therefore, we propose GeReA, a generate-reason framework that prompts a MLLM like InstructBLIP with question relevant vision and language information to generate knowledge-relevant descriptions and reasons those descriptions for knowledge-based VQA. Specifically, the question-relevant image regions and question-specific manual prompts are encoded in the MLLM to generate the knowledge relevant descriptions, referred to as question-aware prompt captions. After that, the question-aware prompt captions, image-question pair, and similar samples are sent into the multi-modal reasoning model to learn a joint knowledge-image-question representation for answer prediction. GeReA unlocks the use of MLLM as the implicit knowledge engine, surpassing all previous state-of-the-art methods on OK-VQA and A-OKVQA datasets, with test accuracies of 66.5% and 63.3% respectively. Our code will be released at this https URL.
10
Researchers from the Chinese Academy of Military Science and Tianjin AI Innovation Center developed FCA-NIG, an automated generative framework for Vision-Language Navigation (VLN) that creates a large-scale dataset (FCA-R2R) with fine-grained sub-instruction-sub-trajectory and entity-landmark alignments. This dataset enhances various VLN agents' performance, with some models showing a 10% increase in Success Rate on unseen validation sets.
Thermal issue is of great importance during layout design of heat source components in systems engineering, especially for high functional-density products. Thermal analysis generally needs complex simulation, which leads to an unaffordable computational burden to layout optimization as it iteratively evaluates different schemes. Surrogate modeling is an effective way to alleviate computation complexity. However, temperature field prediction (TFP) with complex heat source layout (HSL) input is an ultra-high dimensional nonlinear regression problem, which brings great difficulty to traditional regression models. The Deep neural network (DNN) regression method is a feasible way for its good approximation performance. However, it faces great challenges in both data preparation for sample diversity and uniformity in the layout space with physical constraints, and proper DNN model selection and training for good generality, which necessitates efforts of both layout designer and DNN experts. To advance this cross-domain research, this paper proposes a DNN based HSL-TFP surrogate modeling task benchmark. With consideration for engineering applicability, sample generation, dataset evaluation, DNN model, and surrogate performance metrics, are thoroughly studied. Experiments are conducted with ten representative state-of-the-art DNN models. Detailed discussion on baseline results is provided and future prospects are analyzed for DNN based HSL-TFP tasks.
Temperature field inversion of heat-source systems (TFI-HSS) with limited observations is essential to monitor the system health. Although some methods such as interpolation have been proposed to solve TFI-HSS, those existing methods ignore correlations between data constraints and physics constraints, causing the low precision. In this work, we develop a physics-informed neural network-based temperature field inversion (PINN-TFI) method to solve the TFI-HSS task and a coefficient matrix condition number based position selection of observations (CMCN-PSO) method to select optima positions of noise observations. For the TFI-HSS task, the PINN-TFI method encodes constrain terms into the loss function, thus the task is transformed into an optimization problem of minimizing the loss function. In addition, we have found that noise observations significantly affect reconstruction performances of the PINN-TFI method. To alleviate the effect of noise observations, the CMCN-PSO method is proposed to find optimal positions, where the condition number of observations is used to evaluate positions. The results demonstrate that the PINN-TFI method can significantly improve prediction precisions and the CMCN-PSO method can find good positions to acquire a more robust temperature field.
32
We approach the problem of high-DOF reaching-and-grasping via learning joint planning of grasp and motion with deep reinforcement learning. To resolve the sample efficiency issue in learning the high-dimensional and complex control of dexterous grasping, we propose an effective representation of grasping state characterizing the spatial interaction between the gripper and the target object. To represent gripper-object interaction, we adopt Interaction Bisector Surface (IBS) which is the Voronoi diagram between two close by 3D geometric objects and has been successfully applied in characterizing spatial relations between 3D objects. We found that IBS is surprisingly effective as a state representation since it well informs the fine-grained control of each finger with spatial relation against the target object. This novel grasp representation, together with several technical contributions including a fast IBS approximation, a novel vector-based reward and an effective training strategy, facilitate learning a strong control model of high-DOF grasping with good sample efficiency, dynamic adaptability, and cross-category generality. Experiments show that it generates high-quality dexterous grasp for complex shapes with smooth grasping motions.
63
Physical adversarial attacks in object detection have attracted increasing attention. However, most previous works focus on hiding the objects from the detector by generating an individual adversarial patch, which only covers the planar part of the vehicle's surface and fails to attack the detector in physical scenarios for multi-view, long-distance and partially occluded objects. To bridge the gap between digital attacks and physical attacks, we exploit the full 3D vehicle surface to propose a robust Full-coverage Camouflage Attack (FCA) to fool detectors. Specifically, we first try rendering the nonplanar camouflage texture over the full vehicle surface. To mimic the real-world environment conditions, we then introduce a transformation function to transfer the rendered camouflaged vehicle into a photo realistic scenario. Finally, we design an efficient loss function to optimize the camouflage texture. Experiments show that the full-coverage camouflage attack can not only outperform state-of-the-art methods under various test cases but also generalize to different environments, vehicles, and object detectors. The code of FCA will be available at: this https URL.
Deep learning methodology contributes a lot to the development of hyperspectral image (HSI) analysis community. However, it also makes HSI analysis systems vulnerable to adversarial attacks. To this end, we propose a masked spatial-spectral autoencoder (MSSA) in this paper under self-supervised learning theory, for enhancing the robustness of HSI analysis systems. First, a masked sequence attention learning module is conducted to promote the inherent robustness of HSI analysis systems along spectral channel. Then, we develop a graph convolutional network with learnable graph structure to establish global pixel-wise combinations.In this way, the attack effect would be dispersed by all the related pixels among each combination, and a better defense performance is achievable in spatial aspect.Finally, to improve the defense transferability and address the problem of limited labelled samples, MSSA employs spectra reconstruction as a pretext task and fits the datasets in a self-supervised manner.Comprehensive experiments over three benchmarks verify the effectiveness of MSSA in comparison with the state-of-the-art hyperspectral classification methods and representative adversarial defense strategies.
Recovering a globally accurate complex physics field from limited sensor is critical to the measurement and control in the aerospace engineering. General reconstruction methods for recovering the field, especially the deep learning with more parameters and better representational ability, usually require large amounts of labeled data which is unaffordable. To solve the problem, this paper proposes Uncertainty Guided Ensemble Self-Training (UGE-ST), using plentiful unlabeled data to improve reconstruction performance. A novel self-training framework with the ensemble teacher and pretraining student designed to improve the accuracy of the pseudo-label and remedy the impact of noise is first proposed. On the other hand, uncertainty-guided learning is proposed to encourage the model to focus on the highly confident regions of pseudo-labels and mitigate the effects of wrong pseudo-labeling in self-training, improving the performance of the reconstruction model. Experiments include the pressure velocity field reconstruction of airfoil and the temperature field reconstruction of aircraft system indicate that our UGE-ST can save up to 90% of the data with the same accuracy as supervised learning.
Physics-informed neural networks (PINNs) have lately received significant attention as a representative deep learning-based technique for solving partial differential equations (PDEs). Most fully connected network-based PINNs use automatic differentiation to construct loss functions that suffer from slow convergence and difficult boundary enforcement. In addition, although convolutional neural network (CNN)-based PINNs can significantly improve training efficiency, CNNs have difficulty in dealing with irregular geometries with unstructured meshes. Therefore, we propose a novel framework based on graph neural networks (GNNs) and radial basis function finite difference (RBF-FD). We introduce GNNs into physics-informed learning to better handle irregular domains with unstructured meshes. RBF-FD is used to construct a high-precision difference format of the differential equations to guide model training. Finally, we perform numerical experiments on Poisson and wave equations on irregular domains. We illustrate the generalizability, accuracy, and efficiency of the proposed algorithms on different PDE parameters, numbers of collection points, and several types of RBFs.
Physical adversarial attacks against deep neural networks (DNNs) have recently gained increasing attention. The current mainstream physical attacks use printed adversarial patches or camouflage to alter the appearance of the target object. However, these approaches generate conspicuous adversarial patterns that show poor stealthiness. Another physical deployable attack is the optical attack, featuring stealthiness while exhibiting weakly in the daytime with sunlight. In this paper, we propose a novel Reflected Light Attack (RFLA), featuring effective and stealthy in both the digital and physical world, which is implemented by placing the color transparent plastic sheet and a paper cut of a specific shape in front of the mirror to create different colored geometries on the target object. To achieve these goals, we devise a general framework based on the circle to model the reflected light on the target object. Specifically, we optimize a circle (composed of a coordinate and radius) to carry various geometrical shapes determined by the optimized angle. The fill color of the geometry shape and its corresponding transparency are also optimized. We extensively evaluate the effectiveness of RFLA on different datasets and models. Experiment results suggest that the proposed method achieves over 99% success rate on different datasets and models in the digital world. Additionally, we verify the effectiveness of the proposed method in different physical environments by using sunlight or a flashlight.
Deep neural networks (DNNs) have made remarkable strides in various computer vision tasks, including image classification, segmentation, and object detection. However, recent research has revealed a vulnerability in advanced DNNs when faced with deliberate manipulations of input data, known as adversarial attacks. Moreover, the accuracy of DNNs is heavily influenced by the distribution of the training dataset. Distortions or perturbations in the color space of input images can introduce out-of-distribution data, resulting in misclassification. In this work, we propose a brightness-variation dataset, which incorporates 24 distinct brightness levels for each image within a subset of ImageNet. This dataset enables us to simulate the effects of light and shadow on the images, so as is to investigate the impact of light and shadow on the performance of DNNs. In our study, we conduct experiments using several state-of-the-art DNN architectures on the aforementioned dataset. Through our analysis, we discover a noteworthy positive correlation between the brightness levels and the loss of accuracy in DNNs. Furthermore, we assess the effectiveness of recently proposed robust training techniques and strategies, including AugMix, Revisit, and Free Normalizer, using the ResNet50 architecture on our brightness-variation dataset. Our experimental results demonstrate that these techniques can enhance the robustness of DNNs against brightness variation, leading to improved performance when dealing with images exhibiting varying brightness levels.
Object detectors have demonstrated vulnerability to adversarial examples crafted by small perturbations that can deceive the object detector. Existing adversarial attacks mainly focus on white-box attacks and are merely valid at a specific viewpoint, while the universal multi-view black-box attack is less explored, limiting their generalization in practice. In this paper, we propose a novel universal multi-view black-box attack against object detectors, which optimizes a universal adversarial UV texture constructed by multiple image stickers for a 3D object via the designed layout optimization algorithm. Specifically, we treat the placement of image stickers on the UV texture as a circle-based layout optimization problem, whose objective is to find the optimal circle layout filled with image stickers so that it can deceive the object detector under the multi-view scenario. To ensure reasonable placement of image stickers, two constraints are elaborately devised. To optimize the layout, we adopt the random search algorithm enhanced by the devised important-aware selection strategy to find the most appropriate image sticker for each circle from the image sticker pools. Extensive experiments conducted on four common object detectors suggested that the detection performance decreases by a large magnitude of 74.29% on average in multi-view scenarios. Additionally, a novel evaluation tool based on the photo-realistic simulator is designed to assess the texture-based attack fairly.
In satellite layout design, heat source layout optimization (HSLO) is an effective technique to decrease the maximum temperature and improve the heat management of the whole system. Recently, deep learning surrogate assisted HSLO has been proposed, which learns the mapping from layout to its corresponding temperature field, so as to substitute the simulation during optimization to decrease the computational cost largely. However, it faces two main challenges: 1) the neural network surrogate for the certain task is often manually designed to be complex and requires rich debugging experience, which is challenging for the designers in the engineering field; 2) existing algorithms for HSLO could only obtain a near optimal solution in single optimization and are easily trapped in local optimum. To address the first challenge, considering reducing the total parameter numbers and ensuring the similar accuracy as well as, a neural architecture search (NAS) method combined with Feature Pyramid Network (FPN) framework is developed to realize the purpose of automatically searching for a small deep learning surrogate model for HSLO. To address the second challenge, a multimodal neighborhood search based layout optimization algorithm (MNSLO) is proposed, which could obtain more and better approximate optimal design schemes simultaneously in single optimization. Finally, two typical two-dimensional heat conduction optimization problems are utilized to demonstrate the effectiveness of the proposed method. With the similar accuracy, NAS finds models with 80% fewer parameters, 64% fewer FLOPs and 36% faster inference time than the original FPN. Besides, with the assistance of deep learning surrogate by automatic search, MNSLO could achieve multiple near optimal design schemes simultaneously to provide more design diversities for designers.
Polynomial chaos expansion (PCE) is a powerful surrogate model-based reliability analysis method. Generally, a PCE model with a higher expansion order is usually required to obtain an accurate surrogate model for some complex non-linear stochastic systems. However, the high-order PCE increases the number of labeled data required for solving the expansion coefficients. To alleviate this problem, this paper proposes a consistency regularization-based deep polynomial chaos neural network (Deep PCNN) method, including the low-order adaptive PCE model (the auxiliary model) and the high-order polynomial chaos neural network (the main model). The expansion coefficients of the main model are parameterized into the learnable weights of the polynomial chaos neural network, realizing iterative learning of expansion coefficients to obtain more accurate high-order PCE models. The auxiliary model uses a proposed consistency regularization loss function to assist in training the main model. The consistency regularization-based Deep PCNN method can significantly reduce the number of labeled data in constructing a high-order PCE model without losing accuracy by using few labeled data and abundant unlabeled data. A numerical example validates the effectiveness of the consistency regularization-based Deep PCNN method, and then this method is applied to analyze the reliability of two aerospace engineering systems.
There are no more papers matching your filters at the moment.