Abstract
In this contribution, we formalize the general case of fuzzy linguistic summaries with multiple summarizers and qualifiers. We extend previous definitions to handle the following advanced form: , where Q is a linguistic quantifier, are qualifiers, are summarizers, and ⋆, ⋄ ∈ {AND, OR} are aggregation operators. Fuzzy linguistic summaries with two summarizers may be exemplified as Most young employees earn low salary AND have small experience, and with two qualifiers as: Most young employees AND having small experience earn low salary. We study the consistency of fuzzy linguistic summaries, focusing mainly on the double negation property (DN). We consider different negations for linguistic quantifiers and show which can be used for preserving the DN property. The applicability of the proposed general definition and its properties is illustrated by multiple examples, including experiments for the real-life problem of sensor-based health monitoring. This work highlights the importance of prior selection of the aggregation functions considered in the construction of fuzzy linguistic summaries and their influence on accomplishing the double negation property.
Abstract
Fuzzy linguistic summaries provide compact, human-readable descriptions of complex data. In explainable artificial intelligence (XAI), they have been used to transform numerical model explanations into natural-language statements. Traditional quality criteria, such as the degree of truth and the degree of support, are exclusively data-driven and do not necessarily capture domain knowledge. In clinical settings, experts often reason contrastively and relative to baseline expectations: a pattern is informative if it differentially characterizes one diagnostic state versus others and if it increases the concentration of that state under the observed pattern. In this paper, we introduce two new quality criteria, contrast and surprise, for assessing fuzzy linguistic summaries in the classification contexts. In the current formulation, the proposed criteria are defined for the qualifier-summarizer pattern and do not directly depend on the quantifier, so they are intended to complement, rather than replace, summary-level criteria such as the degree of truth and degree of support. We provide illustrative experiments on real-life data to show their complementarity with the truth value and support.
Abstract
In many real-world scenarios, explanations are not perceived as isolated entities, but rather as streams accompanying models that operate over extended periods of time. In such contexts, users are exposed not to a single explanation, but to a sequence of explanations whose evolution may influence trust, understanding, and decision-making. In this work, we aim to capture the evolution of explanations that exhibit sudden changes, contradict prior explanations, or demonstrate erratic fluctuations, as these may compromise their practical utility, even when maintaining a high level of predictive performance. We propose a framework aimed at supporting the assessment of the consistency of explanations for evolving supervised learning contexts, taking into account the variability of the input datasets, and the uncertainty of the model. We introduce a model-agnostic and explanation-agnostic index for quantifying the temporal consistency of automated explanations. We empirically demonstrate that the proposed index facilitates the quantification and localization of temporal instability in explanation streams. Experimental results for simulated multivariate time series underscores the ability of the proposed index to capture instantaneous changes and their relative impact in relation to historical patterns.
Abstract
Fuzzy linguistic summarization [8] enables providing concise and easy-to-interpret descriptions about large datasets. Its main aim is to aggregate and translate numeric observations into natural-language sentences using linguistic expressions. Countless examples confirm the usefulness of fuzzy linguistic summaries in practical applications, in particular those based on sensor-collected data, when their analysis is too complex or time-consuming for experts. This contribution builds upon previous achievements in the theories of generalized and intermediate quantifiers [1], and the evaluative linguistic expressions [2]. When constructing fuzzy linguistic summaries, one can distinguish relative expressions such as low, high, medium, short, etc. or absolute ones, e.g., around 20 [9]. In the majority of the related works, the relative expressions are sufficient. However, as observed for the clinical setting, there are well-established norms arising from the medical guidelines that need to be acknowledged. Absolute expressions are of great importance in medical applications, where many indicators have numerical standards, such as blood or heart rate tests (e.g., ‘About 100 patients with high blood pressure have a pulse of around 90.’). Thus, there is a need to consider absolute linguistic expressions, which are usually represented as unimodal membership functions, and, in our opinion, the properties of such summaries have not been studied intensely so far. In this work, we study the antonym property of fuzzy linguistic summaries with absolute linguistic expressions. First, we briefly review qualitative evaluation criteria with a particular focus on the degree of truth (as baseline) and the degrees of imprecision and specificity. Next, we consider the property of antonym and investigate its adequacy for the selected criteria.
Abstract
Recent advances in information and communication technologies have led to the widespread use of smart meters, wearable sensors, and related devices in healthcare, industrial monitoring (e.g., manufacturing and energy systems), and transportation. These systems generate large volumes of sequential data, yet fully annotating them remains expensive and often impractical. Consequently, only a small fraction of the data is typically labeled, limiting the applicability of fully supervised learning. At the same time, ignoring available labels altogether, as in fully unsupervised approaches, risks discarding valuable information. This tension has motivated growing interest in semi-supervised methods that learn from both labeled and unlabeled data. However, many such approaches rely on complex black-box models, making the decision-making process opaque, particularly in high-risk domains such as medicine and finance. Explainable AI (XAI) has therefore become essential for building trust and ensuring accountability in these settings. This review surveys recent advances in explainable semi-supervised methods for multivariate time-series analysis. We introduce a taxonomy based on how explainability is integrated, categorizing the approaches as white-box (transparent), post hoc (opaque), and intermediate. Within each category, we further classify them by their most distinctive characteristics: the model class for white-box, the interpretability integration strategy for intermediate, and the explanation technique for post-hoc. We also discuss common practices and lessons learned in dealing with partial supervision and model interpretability, and highlight key challenges in sequential data analysis, such as the choice of performance metrics and explanation techniques. We find that, although interest in explainable semi-supervised time-series methods is growing, the systematic evaluation of explanations remains underdeveloped and lacks standardized evaluation practices. Nearly 70% of the reviewed works report some form of explainability validation; however, it is typically indirect, qualitative, or limited in scope. Overall, explainable semi-supervised methods represent a promising direction for future research, with potential benefits across a wide range of real-world time-series applications.
Abstract
Bipolar affective disorder and depression are among the most prevalent mental health conditions, with recent advances highlighting the role of sensors and computational methods in monitoring them. However, current Artificial Intelligence (AI)-based systems, while accurate, often lack transparency, limiting their trustworthiness and clinical adoption. Furthermore, the state-of-the-art is still missing clear guidelines on how to design advanced human-centric validation approaches for interpretations or explanations of intelligent systems with the aim of paving the way towards trustworthy AI systems ready to be adopted by clinicians. This paper presents a novel evaluation approach integrating supervised learning with fuzzy information granules derived from fuzzy association rules and linguistic summaries to enhance interpretability. Its main innovation lies in the human-centric evaluation methodology. Our use case study in the mental health monitoring setting demonstrates the framework’s ability to reveal meaningful relationships between sensor data and mental states. Thus, this work contributes to the development of trustworthy AI systems in compliance with emerging regulatory standards. Our findings confirm that fuzzy logic-based interpretations constructed about the patients’ acoustic features would be beneficial for both clinicians and patients. 75% of respondents agreed that interpretations addressed important aspects of the clinical problem, and 91.7% of respondents agreed that additional interpretations would help psychiatrists in daily patient care. However, evaluations were more critical concerning the clarity and evidential support. Further work should focus on improving the conciseness and clarity of the automatically constructed fuzzy information granules.
Abstract
Fuzzy linguistic summaries provide insights about large numerical datasets in natural language. While their practical potential has been demonstrated with many applications across domains, effective monitoring of evolving sequences of such summaries remains a significant challenge. This limitation is especially evident in dynamic environments such as remote health monitoring, where new data are collected continuously and human-consistent monitoring approaches are needed.
The majority of the related works concentrate on assessing the quality of individual summary sentences, typically in terms of measures such as degree of truth, confidence, support, informativeness, or focus. The evaluation of sets or sequences of summaries is considerably more complex, particularly in real-world settings where data may arrive incrementally or remain partially incomplete, thereby introducing additional uncertainty into the assessment process. Interpretability of collections of linguistic summaries has been first studied in [1] and further extended in [2].
In this work, we aim to explore the properties of the sequences of fuzzy linguistic summaries, focusing on the construction of effective and easy-to-understand information granules that support communication about changes in the observed multivariate time series. For this purpose, we adapt the fuzzy linguistic summaries based on the concept of extended protoforms [3]. Let us now briefly recall the form of fuzzy linguistic summaries (FLS) [2]:
where , (Q) is a linguistic quantifier, () are qualifiers and () are summarizers. This work also builds on previous research related to the theories of generalised and intermediate quantifiers [4] initiated by the work of Mostowski [5]. Formula (1) can be expressed also as the generalized quantifier of type () [6] being an operator (Q) binding (n)-variables: .
In this work, we consider selected examples of summaries with quantifiers Majority and Around half. The main focus of this work is on monitoring sequences of FLS. We propose a stability index and confront it with the selected well-established approaches of the statistical process control.
Abstract
Granular Computing (GC) has evolved dynamically over recent decades, substantially enhancing methods to improve our understanding of large numerical datasets, as demonstrated, for example, by Pedrycz and Bargiela [1]. Recent developments have advanced areas such as fuzzy association rule mining and linguistic summarization, among others. Despite significant theoretical and applied achievements and numerous successful practical applications, one of the main challenges that remains is adequate validation of fuzzy information granules, and the task becomes even more complex if various granular computing approaches are confronted.
Let us consider the following two examples of fuzzy information granules:
I1: Almost every recording in depression has low energy.
I2: If we observe low energy, then the patient is most likely suffering from depression.
In practice, there is often a need to compare or confront different types of information granules, often imprecise, that are coming from various sources. To the best of the authors’ knowledge, while there are significant contributions to particular areas, for example, association rules mining, including approaches to assess their quality, there is not much research on the comparative analysis of different granules, which should thoroughly consider various quality criteria, types of quantifiers, and representations of linguistic expressions.
In this work, we pose the question of how to assess the dissimilarity of pairs of information granules that may be exemplified with I1 and I2. We focus on two representative types of information granules, namely fuzzy association rules (FAR) and fuzzy linguistic summaries, and aim to (1) propose a unified notation for the construction and selection of the most meaningful fuzzy information granules, and (2) analyze and discuss the assessment of dissimilarity across the considered types. We consider an illustrative example of real-life data collected within the mental health monitoring application scenario, which served as inspiration for this research. This work will conclude with a discussion about open challenges.
Abstract
Acoustic features extracted from speech are promising for monitoring the mental state of patients with bipolar disorder. However, further research is still needed to enhance existing AI-based predictive systems with additional explanations in natural language. Fuzzy linguistic summaries have proven successful in many contexts, including supporting the understanding of neural networks tailored to smartphone-based mental health monitoring. This work recalls the recent PLENARY: Explaining black-box models in natural language through a fuzzy linguistic summaries approach aimed to explain changes in the way of speaking and also to explain the voice against the symptoms associated with mental disorders, such as depressed mood, anxiety, and agitation. Linguistic summaries are aimed to describe large and complex numeric datasets and classification outputs in a human-consistent manner, for example, “Among records that contribute positively to predicting depression, most of them have pitch-related speech features at a low level.” Finally, running examples in the Python programming language are provided.
Abstract
Feature importance plays a fundamental role in machine learning and serves as a cornerstone of explainable machine learning. In temporal settings, where data accumulates sequentially, the relevance of features may evolve, introducing challenges for interpretation. While temporal variation in feature importance is increasingly relevant for applications such as clinical monitoring and time-series prediction, it remains underexplored in the literature. In this paper, we propose a novel methodology for quantifying the temporal stability of local feature attributions. Our approach combines exponentially weighted moving average (EWMA) model with performance metrics. The goal is to compute a feature-wise stability metric that reflects how consistently a feature contributes to model predictions over time. To complement this, we introduce a distributional drift score based on the Wasserstein distance, capturing shifts in the underlying feature distributions. Together, these two signals form a diagnostic framework that distinguishes between shifts due to data dynamics and those arising from model behavior. We evaluate our approach on a simulated dataset reflecting mental health monitoring scenario, as well as a publicly available benchmark time-series dataset. In both cases, the proposed metrics uncover nuanced patterns of feature behavior, enabling practitioners to identify features that are not only important but also temporally reliable. Our results demonstrate that assessing both the stability of explanations and the drift of features provides a more robust foundation for trustworthy model interpretation in dynamic environments
Abstract
One of the main challenges when developing medical decision support systems for the emergency room is adequately filtering the most relevant information. High workload, stress, and the necessity for urgent decisions require precise answers to the questions posed. Although LLM-based systems can provide abundant information, physicians need concise and relevant data in this particular clinical setting. In this study, we perform a pilot assessment of the transparency of selected LLM-based systems. The comparative analysis includes ChatGPT o1 model, which was asked to produce responses with varying temperatures and a pilot graph-based RAG
specializing in cardiovascular diseases. A survey was conducted among 33 clinicians regarding the amount of information contained in the provided prompts. Physicians favored the most readable, specific, and helpful answers in emergency department conditions. Reliable medical data and the form in which answers are delivered are crucial for physicians working in the emergency room. We conclude that physicians have preferences for LLM responses at a specific temperature. Further research should be expanded to enable tailoring responses not only to the clinical situation but also to the experience of the asking physician.
Abstract
In this study, we propose a novel modification of the Exact Shapley Values Based on Nearest Neighbor Classification method that incorporates fuzzy logic to better account for uncertainty in datasets. The modification introduces a fuzzy weight vector computed using the Fuzzy K-Nearest Neighbor algorithm, which improves the computation of Shapley values for pseudo-labeled instances. This approach aims to better reflect the relevance of observations and mitigate the effects of data uncertainty.
The method was validated on a real-world pilot study of 654 patients diagnosed with acute coronary syndrome. In the dataset used, one of the problems encountered in clinical practice, data uncertainty, emerged. The results show that the K-FVSNN method achieves competitive performance and maintains robust results even with up to 90% missing labels in the training set. These results highlight the potential of the K-FVSNN method for handling uncertain data in medical applications. Future work will explore its application to other datasets and further refinements to the weight vector to improve its generalizability
Related papers:
Abstract
Introduction
Voice features could be a sensitive marker of affective state in bipolar disorder (BD). Smartphone apps offer an excellent opportunity to collect voice data in the natural setting and become a useful tool in phase prediction in BD.
Aims of the Study
We investigate the relations between the symptoms of BD, evaluated by psychiatrists, and patients’ voice characteristics. A smartphone app extracted acoustic parameters from the daily phone calls of n = 51 patients. We show how the prosodic, spectral, and voice quality features correlate with clinically assessed affective states and explore their usefulness in predicting the BD phase.
Methods
A smartphone app (BDmon) was developed to collect the voice signal and extract its physical features. BD patients used the application on average for 208 days. Psychiatrists assessed the severity of BD symptoms using the Hamilton depression rating scale −17 and the Young Mania rating scale. We analyze the relations between acoustic features of speech and patients’ mental states using linear generalized mixed-effect models.
Results
The prosodic, spectral, and voice quality parameters, are valid markers in assessing the severity of manic and depressive symptoms. The accuracy of the predictive generalized mixed-effect model is 70.9%–71.4%. Significant differences in the effect sizes and directions are observed between female and male subgroups. The greater the severity of mania in males, the louder (β = 1.6) and higher the tone of voice (β = 0.71), more clearly (β = 1.35), and more sharply they speak (β = 0.95), and their conversations are longer (β = 1.64). For females, the observations are either exactly the opposite—the greater the severity of mania, the quieter (β = −0.27) and lower the tone of voice (β = −0.21) and less clearly (β = −0.25) they speak — or no correlations are found (length of speech). On the other hand, the greater the severity of bipolar depression in males, the quieter (β = −1.07) and less clearly they speak (β = −1.00). In females, no distinct correlations between the severity of depressive symptoms and the change in voice parameters are found.
Conclusions
Speech analysis provides physiological markers of affective symptoms in BD and acoustic features extracted from speech are effective in predicting BD phases. This could personalize monitoring and care for BD patients, helping to decide whether a specialist should be consulted.

