Universidade da Beira Interior
Concept-based models aim to explain model decisions with human-understandable concepts. However, most existing approaches treat concepts as numerical attributes, without providing complementary visual explanations that could localize the predicted concepts. This limits their utility in real-world applications and particularly in high-stakes scenarios, such as medical use-cases. This paper proposes ViConEx-Med, a novel transformer-based framework for visual concept explainability, which introduces multi-concept learnable tokens to jointly predict and localize visual concepts. By leveraging specialized attention layers for processing visual and text-based concept tokens, our method produces concept-level localization maps while maintaining high predictive accuracy. Experiments on both synthetic and real-world medical datasets demonstrate that ViConEx-Med outperforms prior concept-based models and achieves competitive performance with black-box models in terms of both concept detection and localization precision. Our results suggest a promising direction for building inherently interpretable models grounded in visual concepts. Code is publicly available at this https URL.
Concept-based models naturally lend themselves to the development of inherently interpretable skin lesion diagnosis, as medical experts make decisions based on a set of visual patterns of the lesion. Nevertheless, the development of these models depends on the existence of concept-annotated datasets, whose availability is scarce due to the specialized knowledge and expertise required in the annotation process. In this work, we show that vision-language models can be used to alleviate the dependence on a large number of concept-annotated samples. In particular, we propose an embedding learning strategy to adapt CLIP to the downstream task of skin lesion classification using concept-based descriptions as textual embeddings. Our experiments reveal that vision-language models not only attain better accuracy when using concepts as textual embeddings, but also require a smaller number of concept-annotated samples to attain comparable performance to approaches specifically devised for automatic concept generation.
8
To deepen our understanding of Quantum Gravity and its connections with black holes and cosmology, building a common language and exchanging ideas across different approaches is crucial. The Nordita Program "Quantum Gravity: from gravitational effective field theories to ultraviolet complete approaches" created a platform for extensive discussions, aimed at pinpointing both common grounds and sources of disagreements, with the hope of generating ideas and driving progress in the field. This contribution summarizes the twelve topical discussions held during the program and collects individual thoughts of speakers and panelists on the future of the field in light of these discussions.
Although General Relativity predicts the presence of a singularity inside of a Black Hole, it is not a complete theory of gravity. A real structure of a Black Hole interior near an expected singularity depends on the UV completion of gravity. In this paper, we establish that the question whether singular spherically symmetric solutions are absent is governed by the functional form of a non-perturbative graviton propagator. We explicitly show in a framework of a ghost-free infinite derivative gravity that for the graviton propagator of an exponential form favored by the unitarity a singularity is not possible unless an unphysical situation when the total mass of the Black Hole is infinite is considered.
A comprehensive survey synthesizes the current state of adversarial attacks and defenses for Deep Neural Networks, consolidating recent advancements, standardizing evaluation metrics including Auto-Attack, and examining adversarial effects on Vision Transformers while identifying future research directions.
The main challenges limiting the adoption of deep learning-based solutions in medical workflows are the availability of annotated data and the lack of interpretability of such systems. Concept Bottleneck Models (CBMs) tackle the latter by constraining the final disease prediction on a set of predefined and human-interpretable concepts. However, the increased interpretability achieved through these concept-based explanations implies a higher annotation burden. Moreover, if a new concept needs to be added, the whole system needs to be retrained. Inspired by the remarkable performance shown by Large Vision-Language Models (LVLMs) in few-shot settings, we propose a simple, yet effective, methodology, CBVLM, which tackles both of the aforementioned challenges. First, for each concept, we prompt the LVLM to answer if the concept is present in the input image. Then, we ask the LVLM to classify the image based on the previous concept predictions. Moreover, in both stages, we incorporate a retrieval module responsible for selecting the best examples for in-context learning. By grounding the final diagnosis on the predicted concepts, we ensure explainability, and by leveraging the few-shot capabilities of LVLMs, we drastically lower the annotation cost. We validate our approach with extensive experiments across four medical datasets and twelve LVLMs (both generic and medical) and show that CBVLM consistently outperforms CBMs and task-specific supervised methods without requiring any training and using just a few annotated examples. More information on our project page: this https URL.
3
Cloud Robotics is helping to create a new generation of robots that leverage the nearly unlimited resources of large data centers (i.e., the cloud), overcoming the limitations imposed by on-board resources. Different processing power, capabilities, resource sizes, energy consumption, and so forth, make scheduling and task allocation critical components. The basic idea of task allocation and scheduling is to optimize performance by minimizing completion time, energy consumption, delays between two consecutive tasks, along with others, and maximizing resource utilization, number of completed tasks in a given time interval, and suchlike. In the past, several works have addressed various aspects of task allocation and scheduling. In this paper, we provide a comprehensive overview of task allocation and scheduling strategies and related metrics suitable for robotic network cloud systems. We discuss the issues related to allocation and scheduling methods and the limitations that need to be overcome. The literature review is organized according to three different viewpoints: Architectures and Applications, Methods and Parameters. In addition, the limitations of each method are highlighted for future research.
Current black-box adversarial attacks either require multiple queries or diffusion models to produce adversarial samples that can impair the target model performance. However, these methods require training a surrogate loss or diffusion models to produce adversarial samples, which limits their applicability in real-world settings. Thus, we propose a Zero Query Black-box Adversarial (ZQBA) attack that exploits the representations of Deep Neural Networks (DNNs) to fool other networks. Instead of requiring thousands of queries to produce deceiving adversarial samples, we use the feature maps obtained from a DNN and add them to clean images to impair the classification of a target model. The results suggest that ZQBA can transfer the adversarial samples to different models and across various datasets, namely CIFAR and Tiny ImageNet. The experiments also show that ZQBA is more effective than state-of-the-art black-box attacks with a single query, while maintaining the imperceptibility of perturbations, evaluated both quantitatively (SSIM) and qualitatively, emphasizing the vulnerabilities of employing DNNs in real-world contexts. All the source code is available at this https URL.
Neural Architecture Search (NAS) benchmarks significantly improved the capability of developing and comparing NAS methods while at the same time drastically reduced the computational overhead by providing meta-information about thousands of trained neural networks. However, tabular benchmarks have several drawbacks that can hinder fair comparisons and provide unreliable results. These usually focus on providing a small pool of operations in heavily constrained search spaces -- usually cell-based neural networks with pre-defined outer-skeletons. In this work, we conducted an empirical analysis of the widely used NAS-Bench-101, NAS-Bench-201 and TransNAS-Bench-101 benchmarks in terms of their generability and how different operations influence the performance of the generated architectures. We found that only a subset of the operation pool is required to generate architectures close to the upper-bound of the performance range. Also, the performance distribution is negatively skewed, having a higher density of architectures in the upper-bound range. We consistently found convolution layers to have the highest impact on the architecture's performance, and that specific combination of operations favors top-scoring architectures. These findings shed insights on the correct evaluation and comparison of NAS methods using NAS benchmarks, showing that directly searching on NAS-Bench-201, ImageNet16-120 and TransNAS-Bench-101 produces more reliable results than searching only on CIFAR-10. Furthermore, with this work we provide suggestions for future benchmark evaluations and design. The code used to conduct the evaluations is available at this https URL.
Light field technology is a powerful imaging method that captures both the intensity and direction of light rays in a scene, enabling the reconstruction of 3D information and supporting a range of unique applications. However, light fields produce vast amounts of data, making efficient compression essential for their practical use. View synthesis plays a key role in light field technology by enabling the generation of new views, yet its interaction with compression has not been fully explored. In this work, a subjective analysis of the effect of view synthesis on light field compression is conducted. To achieve this, a sparsely sampled light field is created by dropping views from an original light field. Both light fields are then encoded using JPEG Pleno and VVC. View synthesis is then applied to the compressed sampled light field to reconstruct the same number of views as the original. The subjective evaluation follows the proposed JPEG AIC-3 test methodology designed to assess the quality of high-fidelity compressed images. This test consists of two test stimuli displayed side-by-side, each alternating between an original and a coded view, creating a flicker effect on both sides. The user must choose which side has the stronger flicker and, therefore, the lower quality. Using these subjective results, a selection of metrics is validated.
We show how infinite derivative modifications of gravity impact on the stochastic background of Gravitational Waves from early Universe. The generic property of the ghost-free theory fixed on Minkowski space-time is the emergence of an infinite number of complex mass states when other classical backgrounds are considered. These additional states are shown to enhance the power spectrum of scalar perturbations generated during inflation. Current and future space-based and terrestrial interferometers offer indirect testing methods for the infinite derivative gravity action, enabling the exploration of new parameter spaces. In particular, we identify unconventional blue-tilted Gravitational Wave spectra, presenting a novel approach for testing infinite derivative quantum gravity in the future.
Quantum field theory (QFT) in Rindler spacetime is a gateway to understanding unitarity and information loss paradoxes in curved spacetime. Rindler coordinates map Minkowski spacetime onto regions with horizons, effectively dividing accelerated observers into causally disconnected sectors. Employing standard quantum field theory techniques and Bogoliubov transformations between Minkowski and Rindler coordinates yields entanglement between states across these causally separated regions of spacetime. This results in a breakdown of unitarity, implying that information regarding the entangled partner may be irretrievably lost beyond the Rindler horizon. As a consequence, one has a situation of pure states evolving into mixed states. In this paper, we introduce a novel framework for comprehending this phenomenon using a recently proposed formulation of direct-sum quantum field theory (DQFT), which is grounded in superselection rules formulated by the parity and time reversal (PT\mathcal{P}\mathcal{T}) symmetry of Minkowski spacetime. In the context of DQFT applied to Rindler spacetime, we demonstrate that each Rindler observer can, in principle, access pure states within the horizon, thereby restoring unitarity. However, our analysis also reveals the emergence of a thermal spectrum of Unruh radiation. This prompts a reevaluation of entanglement in Rindler spacetime, where we propose a novel perspective on how Rindler observers may reconstruct complementary information beyond the horizon. Furthermore, we revisit the implications of the Reeh-Schlieder theorem within the framework of DQFT. Lastly, we underscore how our findings contribute to ongoing efforts aimed at elucidating the role of unitarity in quantum field theory within the context of de Sitter and black hole spacetimes.
The growing demand for surveillance in public spaces presents significant challenges due to the shortage of human resources. Current AI-based video surveillance systems heavily rely on core computer vision models that require extensive finetuning, which is particularly difficult in surveillance settings due to limited datasets and difficult setting (viewpoint, low quality, etc.). In this work, we propose leveraging Large Vision-Language Models (LVLMs), known for their strong zero and few-shot generalization, to tackle video understanding tasks in surveillance. Specifically, we explore VideoLLaMA2, a state-of-the-art LVLM, and an improved token-level sampling method, Self-Reflective Sampling (Self-ReS). Our experiments on the UCF-Crime dataset show that VideoLLaMA2 represents a significant leap in zero-shot performance, with 20% boost over the baseline. Self-ReS additionally increases zero-shot action recognition performance to 44.6%. These results highlight the potential of LVLMs, paired with improved sampling techniques, for advancing surveillance video analysis in diverse scenarios.
The dominant approach for surface defect detection is the use of hand-crafted feature-based methods. However, this falls short when conditions vary that affect extracted images. So, in this paper, we sought to determine how well several state-of-the-art Convolutional Neural Networks perform in the task of surface defect detection. Moreover, we propose two methods: CNN-Fusion, that fuses the prediction of all the networks into a final one, and Auto-Classifier, which is a novel proposal that improves a Convolutional Neural Network by modifying its classification component using AutoML. We carried out experiments to evaluate the proposed methods in the task of surface defect detection using different datasets from DAGM2007. We show that the use of Convolutional Neural Networks achieves better results than traditional methods, and also, that Auto-Classifier out-performs all other methods, by achieving 100% accuracy and 100% AUC results throughout all the datasets.
Recent advancements in imitation learning have been largely fueled by the integration of sequence models, which provide a structured flow of information to effectively mimic task behaviours. Currently, Decision Transformer (DT) and subsequently, the Hierarchical Decision Transformer (HDT), presented Transformer-based approaches to learn task policies. Recently, the Mamba architecture has shown to outperform Transformers across various task domains. In this work, we introduce two novel methods, Decision Mamba (DM) and Hierarchical Decision Mamba (HDM), aimed at enhancing the performance of the Transformer models. Through extensive experimentation across diverse environments such as OpenAI Gym and D4RL, leveraging varying demonstration data sets, we demonstrate the superiority of Mamba models over their Transformer counterparts in a majority of tasks. Results show that DM outperforms other methods in most settings. The code can be found at this https URL
Full-reference point cloud objective metrics are currently providing very accurate representations of perceptual quality. These metrics are usually composed of a set of features that are somehow combined, resulting in a final quality value. In this study, the different features of the best-performing metrics are analyzed. For that, different objective quality metrics are compared between them, and the differences in their quality representation are studied. This provided a selection of the set of metrics used in this study, namely the point-to-plane, point-to-attribute, Point Cloud Structural Similarity, Point Cloud Quality Metric and Multiscale Graph Similarity. The features defined in those metrics are examined based on their contribution to the objective estimation using recursive feature elimination. To employ the recursive feature selection algorithm, both the support vector regression and the ridge regression algorithms were employed. For this study, the Broad Quality Assessment of Static Point Clouds in Compression Scenario database was used for both training and validation of the models. According to the recursive feature elimination, several features were selected and then combined using the regression method used to select those features. The best combination models were then evaluated across five different publicly available subjective quality assessment datasets, targeting different point cloud characteristics and distortions. It was concluded that a combination of features selected from the Point Cloud Quality Metric, Multiscale Graph Similarity and PSNR MSE D2, combined with Ridge Regression, results in the best performance. This model leads to the definition of the Feature Selection Model.
In this paper we show that there is a universal prediction for the Newtonian potential for an infinite derivative, ghost-free, quadratic curvature gravity. We show that in order to make such a theory ghost-free at a perturbative level, the Newtonian potential always falls-off as 1/r in the infrared limit, while at short distances the potential becomes non-singular. We provide examples which can potentially test the scale of gravitational non-locality up to 0.004 eV.
The main challenges hindering the adoption of deep learning-based systems in clinical settings are the scarcity of annotated data and the lack of interpretability and trust in these systems. Concept Bottleneck Models (CBMs) offer inherent interpretability by constraining the final disease prediction on a set of human-understandable concepts. However, this inherent interpretability comes at the cost of greater annotation burden. Additionally, adding new concepts requires retraining the entire system. In this work, we introduce a novel two-step methodology that addresses both of these challenges. By simulating the two stages of a CBM, we utilize a pretrained Vision Language Model (VLM) to automatically predict clinical concepts, and an off-the-shelf Large Language Model (LLM) to generate disease diagnoses based on the predicted concepts. Furthermore, our approach supports test-time human intervention, enabling corrections to predicted concepts, which improves final diagnoses and enhances transparency in decision-making. We validate our approach on three skin lesion datasets, demonstrating that it outperforms traditional CBMs and state-of-the-art explainable methods, all without requiring any training and utilizing only a few annotated examples. The code is available at this https URL
2
06 Feb 2017
A central issue in the theory of extreme values focuses on suitable conditions such that the well-known results for the limiting distributions of the maximum of i.i.d. sequences can be applied to stationary ones. In this context, the extremal index appears as a key parameter to capture the effect of temporal dependence on the limiting distribution of the maxima. The multivariate extremal index corresponds to a generalization of this concept to a multivariate context and affects the tail dependence structure within the marginal sequences and between them. As it is a function, the inference becomes more difficult, and it is therefore important to obtain characterizations, namely bounds based on the marginal dependence that are easier to estimate. In this work we present two decompositions that emphasize different types of information contained in the multivariate extremal index, an upper limit better than those found in the literature and we analyze its role in dependence on the limiting model of the componentwise maxima of a stationary sequence. We will illustrate the results with examples of recognized interest in applications.
Deep learning models are widely used nowadays for their reliability in performing various tasks. However, they do not typically provide the reasoning behind their decision, which is a significant drawback, particularly for more sensitive areas such as biometrics, security and healthcare. The most commonly used approaches to provide interpretability create visual attention heatmaps of regions of interest on an image based on models gradient backpropagation. Although this is a viable approach, current methods are targeted toward image settings and default/standard deep learning models, meaning that they require significant adaptations to work on video/multi-modal settings and custom architectures. This paper proposes an approach for interpretability that is model-agnostic, based on a novel use of the Squeeze and Excitation (SE) block that creates visual attention heatmaps. By including an SE block prior to the classification layer of any model, we are able to retrieve the most influential features via SE vector manipulation, one of the key components of the SE block. Our results show that this new SE-based interpretability can be applied to various models in image and video/multi-modal settings, namely biometrics of facial features with CelebA and behavioral biometrics using Active Speaker Detection datasets. Furthermore, our proposal does not compromise model performance toward the original task, and has competitive results with current interpretability approaches in state-of-the-art object datasets, highlighting its robustness to perform in varying data aside from the biometric context.
There are no more papers matching your filters at the moment.