Jožef Stefan International Postgraduate School
This survey paper systematically categorizes and analyzes the synergy between Knowledge Graphs (KGs) and Large Language Models (LLMs), identifying critical challenges in scalability, computational efficiency, and data quality that differentiate it from prior reviews. It provides a structured framework for understanding current integration approaches and outlines key open problems for future research in creating more reliable and interpretable AI systems.
Researchers extended the OPTION ontology to formally describe modular optimization algorithms, then constructed Knowledge Graphs integrating problem features, algorithm configurations, and performance data. A pipeline combining Knowledge Graph Embeddings with a Random Forest classifier successfully predicted algorithm performance, achieving F1 scores exceeding 0.9 in balanced scenarios and near-perfect F1 scores (0.999) for highly imbalanced classifications.
A study from the Jožef Stefan Institute systematically evaluates catastrophic forgetting in cross-lingual classification when adapting Large Language Models, comparing Intermediate Training and Cross-Lingual Validation with full-model and adapter fine-tuning. It quantifies the trade-offs, showing that Intermediate Training enhances target language performance while Cross-Lingual Validation better preserves source language knowledge.
To improve the reading experience, many news sites organize news into topical collections, called stories. In this work, we present an approach for implementing real-time story identification for a news monitoring system that automatically collects news articles as they appear online and processes them in various ways. Story identification aims to assign each news article to a specific story that the article is covering. The process is similar to text clustering and topic modeling, but requires that articles be grouped based on particular events, places, and people, rather than general text similarity (as in clustering) or general (predefined) topics (as in topic modeling). We present an approach to story identification that is capable of functioning in real time, assigning articles to stories as they are published online. In the proposed approach, we combine text representation techniques, clustering algorithms, and online topic modeling methods. We combine various text representation methods to extract specific events and named entities necessary for story identification, showing that a mixture of online topic-modeling approaches such as BERTopic, DBStream, and TextClust can be adapted for story discovery. We evaluate our approach on a news dataset from Slovene media covering a period of 1 month. We show that our real-time approach produces sensible results as judged by human evaluators.
For two decades, NdFeB based magnets have been a critical component in a range of electrical devices engaged in energy production and conversion. The magnet shape and the internal microstructure of the selected NdFeB grade govern their efficiency and size. However, stricter requirements on device efficiency call for better performing magnets preferably with novel functionality not achievable today. Here we use 3D metal printing by Selective Laser Melting to fabricate dense net shape permanent magnets based on NdFeB that exhibit high magnetic performance. Evidence is provided that the internal microstructure, not achievable by traditional manufacturing means, is the origin of the solid magnetic properties. The freedom in magnet body shape and size that ranges from the millimeter to tens of centimeter scale opens up a design freedom that could be a catalyzer for the next generation of electrical devices.
Large semantic knowledge bases are grounded in factual knowledge. However, recent approaches to dense text representations (i.e. embeddings) do not efficiently exploit these resources. Dense and robust representations of documents are essential for effectively solving downstream classification and retrieval tasks. This work demonstrates that injecting embedded information from knowledge bases can augment the performance of contemporary Large Language Model (LLM)-based representations for the task of text classification. Further, by considering automated machine learning (AutoML) with the fused representation space, we demonstrate it is possible to improve classification accuracy even if we use low-dimensional projections of the original representation space obtained via efficient matrix factorization. This result shows that significantly faster classifiers can be achieved with minimal or no loss in predictive performance, as demonstrated using five strong LLM baselines on six diverse real-life datasets. The code is freely available at \url{this https URL}.
Keyword extraction is used for summarizing the content of a document and supports efficient document retrieval, and is as such an indispensable part of modern text-based systems. We explore how load centrality, a graph-theoretic measure applied to graphs derived from a given text can be used to efficiently identify and rank keywords. Introducing meta vertices (aggregates of existing vertices) and systematic redundancy filters, the proposed method performs on par with state-of-the-art for the keyword extraction task on 14 diverse datasets. The proposed method is unsupervised, interpretable and can also be used for document visualization.
3
In this work, the effects of thermal annealing at 500 {\deg}C on aerosol-deposited 0.65Pb(Mg1/3Nb2/3)O30.35PbTiO30.65\text{Pb}(\text{Mg}_{1/3}\text{Nb}_{2/3})\text{O}_{3}-0.35\text{PbTiO}_{3} thick films on stainless-steel substrates are investigated using two complementary methods at high and low applied external electric fields. The first one is Positive Up Negative Down method, which allows us to obtain information about the switching and non-switching contributions to the polarization. It shows that the as-deposited film is ferroelectric before annealing, since it has a switching contribution to the polarization. After annealing, both the switching and non-switching contributions to polarization increased by a factor of 1.6 and 2.33, respectively, indicating stronger ferroelectric behavior. The second method is based on impedance spectroscopy coupled with Rayleigh analysis. The results show that post-deposition thermal annealing increases the reversible domain wall contribution to the dielectric permittivity by a factor 11 while keeping the threshold field similar. This indicates, after annealing, domain wall density is larger while domain wall mobility remains similar. These two complementary characterization methods show that annealing increases the ferroelectric behavior of the thick film by increasing the domain wall density and its influence is visible both on polarization versus electric field loop and dielectric permittivity.
Automatic term extraction plays an essential role in domain language understanding and several natural language processing downstream tasks. In this paper, we propose a comparative study on the predictive power of Transformers-based pretrained language models toward term extraction in a multi-language cross-domain setting. Besides evaluating the ability of monolingual models to extract single- and multi-word terms, we also experiment with ensembles of mono- and multilingual models by conducting the intersection or union on the term output sets of different language models. Our experiments have been conducted on the ACTER corpus covering four specialized domains (Corruption, Wind energy, Equitation, and Heart failure) and three languages (English, French, and Dutch), and on the RSDO5 Slovenian corpus covering four additional domains (Biomechanics, Chemistry, Veterinary, and Linguistics). The results show that the strategy of employing monolingual models outperforms the state-of-the-art approaches from the related work leveraging multilingual models, regarding all the languages except Dutch and French if the term extraction task excludes the extraction of named entity terms. Furthermore, by combining the outputs of the two best performing models, we achieve significant improvements.
Manufacturing industries strive to improve production efficiency and product quality by deploying advanced sensing and control systems. Wearable sensors are emerging as a promising solution for achieving this goal, as they can provide continuous and unobtrusive monitoring of workers' activities in the manufacturing line. This paper presents a novel wearable sensing prototype that combines IMU and body capacitance sensing modules to recognize worker activities in the manufacturing line. To handle these multimodal sensor data, we propose and compare early, and late sensor data fusion approaches for multi-channel time-series convolutional neural networks and deep convolutional LSTM. We evaluate the proposed hardware and neural network model by collecting and annotating sensor data using the proposed sensing prototype and Apple Watches in the testbed of the manufacturing line. Experimental results demonstrate that our proposed methods achieve superior performance compared to the baseline methods, indicating the potential of the proposed approach for real-world applications in manufacturing industries. Furthermore, the proposed sensing prototype with a body capacitive sensor and feature fusion method improves by 6.35%, yielding a 9.38% higher macro F1 score than the proposed sensing prototype without a body capacitive sensor and Apple Watch data, respectively.
We propose an approach to symbolic regression based on a novel variational autoencoder for generating hierarchical structures, HVAE. It combines simple atomic units with shared weights to recursively encode and decode the individual nodes in the hierarchy. Encoding is performed bottom-up and decoding top-down. We empirically show that HVAE can be trained efficiently with small corpora of mathematical expressions and can accurately encode expressions into a smooth low-dimensional latent space. The latter can be efficiently explored with various optimization methods to address the task of symbolic regression. Indeed, random search through the latent space of HVAE performs better than random search through expressions generated by manually crafted probabilistic grammars for mathematical expressions. Finally, EDHiE system for symbolic regression, which applies an evolutionary algorithm to the latent space of HVAE, reconstructs equations from a standard symbolic regression benchmark better than a state-of-the-art system based on a similar combination of deep learning and evolutionary algorithms.ž
In this paper, we focus on the detection of semantic changes in Slovene, a less resourced Slavic language with two million speakers. Detecting and tracking semantic changes provides insight into the evolution of language caused by changes in society and culture. We present the first Slovene dataset for evaluating semantic change detection systems, which contains aggregated semantic change scores for 104 target words obtained from more than 3,000 manually annotated sentence pairs. We analyze an important class of measures of semantic change metrics based on the Average pairwise distance and identify several limitations. To address these limitations, we propose a novel metric based on regularized optimal transport, which offers a more robust framework for quantifying semantic change. We provide a comprehensive evaluation of various existing semantic change detection methods and associated semantic change measures on our dataset. Through empirical testing, we demonstrate that our proposed approach, leveraging regularized optimal transport, achieves either matching or improved performance compared to baseline approaches.
Dehumanisation involves the perception and or treatment of a social group's members as less than human. This phenomenon is rarely addressed with computational linguistic techniques. We adapt a recently proposed approach for English, making it easier to transfer to other languages and to evaluate, introducing a new sentiment resource, the use of zero-shot cross-lingual valence and arousal detection, and a new method for statistical significance testing. We then apply it to study attitudes to migration expressed in Slovene newspapers, to examine changes in the Slovene discourse on migration between the 2015-16 migration crisis following the war in Syria and the 2022-23 period following the war in Ukraine. We find that while this discourse became more negative and more intense over time, it is less dehumanising when specifically addressing Ukrainian migrants compared to others.
Doping of quantum antiferromagnets is an established approach to investigate the robustness of their ground state against the competing phases. Predictions of doping effects on the ground state of the Shastry-Sutherland dimer model are here verified experimentally on Mg-doped SrCu2(BO3)2. A partial incorporation of Mg2+ on the Cu2+-site in the SrCu2(BO3)2 structure leads to a subtle but systematic lattice expansion with the increasing Mg-doping concentration, which is accompanied by a concomitant decrease in the spin gap, the Curie-Weiss temperature and the peak temperature of the susceptibility. These findings indicate a doping-induced breaking of Cu2+ spin-1/2 dimers which is also corroborated by X-band EPR spectroscopy that points to a systematic increase in intensity of free Cu2+ sites with increasing Mg-doping concentration. Extending the Mg-doping up to nominal x = 0.10 or SrCu1.9Mg0.1(BO3)2, in the magnetisation measurements taken up to 35 T, a suppression of the pseudo-1/8 plateau is found along with a clear presence of an anomaly at an onset critical field H'C0 ~ 9 T. The latter, absent in pure SrCu2(BO3)2, emerges due to the coupling of liberated Cu2+ spin-1/2 entities in the vicinity of Mg-doping induced impurities.
In this study, explainable machine learning techniques are applied to predict the toxicity of mussels in the Gulf of Trieste (Adriatic Sea) caused by harmful algal blooms. By analysing a newly created 28-year dataset containing records of toxic phytoplankton in mussel farming areas and toxin concentrations in mussels (Mytilus galloprovincialis), we train and evaluate the performance of ML models to accurately predict diarrhetic shellfish poisoning (DSP) events. The random forest model provided the best prediction of positive toxicity results based on the F1 score. Explainability methods such as permutation importance and SHAP identified key species (Dinophysis fortii and D. caudata) and environmental factors (salinity, river discharge and precipitation) as the best predictors of DSP outbreaks. These findings are important for improving early warning systems and supporting sustainable aquaculture practices.
We introduce a dataset on political orientation and power position identification. The dataset is derived from ParlaMint, a set of comparable corpora of transcribed parliamentary speeches from 29 national and regional parliaments. We introduce the dataset, provide the reasoning behind some of the choices during its creation, present statistics on the dataset, and, using a simple classifier, some baseline results on predicting political orientation on the left-to-right axis, and on power position identification, i.e., distinguishing between the speeches delivered by governing coalition party members from those of opposition party members.
The growing impact of climate change on coastal areas, particularly active but fragile regions, necessitates collaboration among diverse stakeholders and disciplines to formulate effective environmental protection policies. We introduce a novel specialized corpus comprising 2,491 sentences from 410 scientific abstracts concerning coastal areas, for the Automatic Term Extraction (ATE) and Classification (ATC) tasks. Inspired by the ARDI framework, focused on the identification of Actors, Resources, Dynamics and Interactions, we automatically extract domain terms and their distinct roles in the functioning of coastal systems by leveraging monolingual and multilingual transformer models. The evaluation demonstrates consistent results, achieving an F1 score of approximately 80\% for automated term extraction and F1 of 70\% for extracting terms and their labels. These findings are promising and signify an initial step towards the development of a specialized Knowledge Base dedicated to coastal areas.
With the ever-increasing number of news stories available online, classifying them by topic, regardless of the language they are written in, has become crucial for enhancing readers' access to relevant content. To address this challenge, we propose a teacher-student framework based on large language models (LLMs) for developing multilingual news classification models of reasonable size with no need for manual data annotation. The framework employs a Generative Pretrained Transformer (GPT) model as the teacher model to develop an IPTC Media Topic training dataset through automatic annotation of news articles in Slovenian, Croatian, Greek, and Catalan. The teacher model exhibits a high zero-shot performance on all four languages. Its agreement with human annotators is comparable to that between the human annotators themselves. To mitigate the computational limitations associated with the requirement of processing millions of texts daily, smaller BERT-like student models are fine-tuned on the GPT-annotated dataset. These student models achieve high performance comparable to the teacher model. Furthermore, we explore the impact of the training data size on the performance of the student models and investigate their monolingual, multilingual and zero-shot cross-lingual capabilities. The findings indicate that student models can achieve high performance with a relatively small number of training instances, and demonstrate strong zero-shot cross-lingual abilities. Finally, we publish the best-performing news topic classifier, enabling multilingual classification with the top-level categories of the IPTC Media Topic schema.
Magnetoelectric composites integrate the coupling between magnetic and piezoelectric materials to create new functionalities for potential technological applications. This coupling is typically achieved through the exchange of magnetic, electric, or elastic energy across the interfaces between the different constituent materials. Tailoring the strength of the magnetoelectric effect is primarily accomplished by selecting suitable materials for each constituent and by optimizing geometrical and microstructural designs. Various composite architectures, such as (0-3), (2-2), (1-3) and core-shell connectivities, have been studied to enhance magnetoelectric coupling and other required physical properties in composites. This review examines the latest advancements in magnetoelectric materials, focusing on the impact of different interphase connectivity types on their properties and performance. Before exploring magnetic-electric coupling, a brief overview of the historical background of multiferroic magnetoelectric composites is provided. Fundamental concepts underlying the magnetoelectric effect, piezoelectricity, and the magnetostrictive effect are explained, including their origins and examples of these materials' properties. So far, three types of magnetoelectric composite connectivities have been investigated experimentally: particulate composites (0-3), laminated and thin films (2-2), sticks embedded in matrix, core-shell particles, and coaxial fibers. An outlook on the prospects and scientific challenges in the field of multiferroic magnetoelectric composites is given at the end of this review.
Single crystals are essential for characterizing a wide range of magnetic states, including exotic ones such as quantum spin liquids. This study reports a flux method for growing single crystals of NdTa7_7O19_{19}, the first quantum spin liquid candidate on a triangular spin lattice with dominant Ising like spin correlations. Purple NdTa7_7O19_{19} single crystals with hexagonal morphology were successfully grown using a K2_2Mo3_3O10_{10}-B2_2O3_3 flux. With lateral sizes up to 3.5 mm and a thickness up to 2 mm, these are the largest dimensions reported to date. The chemical composition was confirmed by powder and single-crystal X-ray diffraction along with scanning electron microscopy with energy dispersive X-ray spectroscopy. Aiming for an accurate determination of the magnetic anisotropy and its effect on the magnetic properties, NdTa7_7O19_{19} crystals were additionally analyzed by magnetic susceptibility, revealing a substantial anisotropy without long-range magnetic ordering down to 2 K. Single crystals of two novel rare-earth heptatantalates, ErTa7_7O19_{19} and GdTa7_7O19_{19}, were also grown and their magnetic properties investigated. The magnetic anisotropy of ErTa7_7O19_{19} closely resembles that of isostructural NdTa7_7O19_{19}, indicating a possibility of a similar exotic magnetic ground state. In contrast, GdTa7_7O19_{19} shows paramagnetic behavior, consistent with previous results obtained for polycrystalline samples.
There are no more papers matching your filters at the moment.