ReSum: Unlocking Long-Horizon Search Intelligence via Context Summarization
[AUTHORS]
Xixi Wu, Kuan Li, Yida Zhao, Liwen Zhang, Litu Ou, Huifeng Yin, Zhongwang Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Minhao Cheng, Shuai Wang, Hong Cheng, Jingren Zhou
[COMMENTS]
https://tongyi-agent.github.io/blog/introducing-tongyi-deep-research/
[LINK]
http://arxiv.org/abs/2509.13313v1
[DATE]
2025-09-17 01:57:22+08:00
[CATEGORIES]
cs.CL
WebWeaver: Structuring Web-Scale Evidence with Dynamic Outlines for Open-Ended Deep Research
[AUTHORS]
Zijian Li, Xin Guan, Bo Zhang, Shen Huang, Houquan Zhou, Shaopeng Lai, Ming Yan, Yong Jiang, Pengjun Xie, Fei Huang, Jun Zhang, Jingren Zhou
[ABSTRACT]
This paper tackles open-ended deep research (OEDR), a complex challenge where
AI agents must synthesize vast web-scale information into insightful reports.
Current approaches are plagued by dual-fold limitations: static research
pipelines that decouple planning from evidence acquisition and one-shot
generation paradigms that easily suffer from long-context failure issues like
“loss in the middle” and hallucinations. To address these challenges, we
introduce WebWeaver, a novel dual-agent framework that emulates the human
research process. The planner operates in a dynamic cycle, iteratively
interleaving evidence acquisition with outline optimization to produce a
comprehensive, source-grounded outline linking to a memory bank of evidence.
The writer then executes a hierarchical retrieval and writing process,
composing the report section by section. By performing targeted retrieval of
only the necessary evidence from the memory bank for each part, it effectively
mitigates long-context issues. Our framework establishes a new state-of-the-art
across major OEDR benchmarks, including DeepResearch Bench, DeepConsult, and
DeepResearchGym. These results validate our human-centric, iterative
methodology, demonstrating that adaptive planning and focused synthesis are
crucial for producing high-quality, reliable, and well-structured reports.
[COMMENTS]
An agent system for open-ended deep research
[LINK]
http://arxiv.org/abs/2509.13312v1
[DATE]
2025-09-17 01:57:21+08:00
[CATEGORIES]
cs.CL
Towards General Agentic Intelligence via Environment Scaling
[AUTHORS]
Runnan Fang, Shihao Cai, Baixuan Li, Jialong Wu, Guangyu Li, Wenbiao Yin, Xinyu Wang, Xiaobin Wang, Liangcai Su, Zhen Zhang, Shibin Wu, Zhengwei Tao, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou
[COMMENTS]
https://tongyi-agent.github.io/blog/introducing-tongyi-deep-research/
[LINK]
http://arxiv.org/abs/2509.13311v1
[DATE]
2025-09-17 01:57:20+08:00
[CATEGORIES]
cs.CL
Scaling Agents via Continual Pre-training
[AUTHORS]
Liangcai Su, Zhen Zhang, Guangyu Li, Zhuo Chen, Chenxi Wang, Maojia Song, Xinyu Wang, Kuan Li, Jialong Wu, Xuanzhong Chen, Zile Qiao, Zhongwang Zhang, Huifeng Yin, Shihao Cai, Runnan Fang, Zhengwei Tao, Wenbiao Yin, Chenxiong Qian, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou
[COMMENTS]
https://tongyi-agent.github.io/blog/introducing-tongyi-deep-research/
[LINK]
http://arxiv.org/abs/2509.13310v1
[DATE]
2025-09-17 01:57:19+08:00
[CATEGORIES]
cs.CL
WebResearcher: Unleashing unbounded reasoning capability in Long-Horizon Agents
[AUTHORS]
Zile Qiao, Guoxin Chen, Xuanzhong Chen, Donglei Yu, Wenbiao Yin, Xinyu Wang, Zhen Zhang, Baixuan Li, Huifeng Yin, Kuan Li, Rui Min, Minpeng Liao, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou
[ABSTRACT]
Recent advances in deep-research systems have demonstrated the potential for
AI agents to autonomously discover and synthesize knowledge from external
sources. In this paper, we introduce WebResearcher, a novel framework for
building such agents through two key components: (1) WebResearcher, an
iterative deep-research paradigm that reformulates deep research as a Markov
Decision Process, where agents periodically consolidate findings into evolving
reports while maintaining focused workspaces, overcoming the context
suffocation and noise contamination that plague existing mono-contextual
approaches; and (2) WebFrontier, a scalable data synthesis engine that
generates high-quality training data through tool-augmented complexity
escalation, enabling systematic creation of research tasks that bridge the gap
between passive knowledge recall and active knowledge construction. Notably, we
find that the training data from our paradigm significantly enhances tool-use
capabilities even for traditional mono-contextual methods. Furthermore, our
paradigm naturally scales through parallel thinking, enabling concurrent
multi-agent exploration for more comprehensive conclusions. Extensive
experiments across 6 challenging benchmarks demonstrate that WebResearcher
achieves state-of-the-art performance, even surpassing frontier proprietary
systems.
[COMMENTS]
https://tongyi-agent.github.io/blog/introducing-tongyi-deep-research/
[LINK]
http://arxiv.org/abs/2509.13309v1
[DATE]
2025-09-17 01:57:17+08:00
[CATEGORIES]
cs.CL
WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement Learning
[AUTHORS]
Kuan Li, Zhongwang Zhang, Huifeng Yin, Rui Ye, Yida Zhao, Liwen Zhang, Litu Ou, Dingchu Zhang, Xixi Wu, Jialong Wu, Xinyu Wang, Zile Qiao, Zhen Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou
[ABSTRACT]
Transcending human cognitive limitations represents a critical frontier in
LLM training. Proprietary agentic systems like DeepResearch have demonstrated
superhuman capabilities on extremely complex information-seeking benchmarks
such as BrowseComp, a feat previously unattainable. We posit that their success
hinges on a sophisticated reasoning pattern absent in open-source models: the
ability to systematically reduce extreme uncertainty when navigating vast
information landscapes. Based on this insight, we introduce WebSailor, a
complete post-training methodology designed to instill this crucial capability.
Our approach involves generating novel, high-uncertainty tasks through
structured sampling and information obfuscation, RFT cold start, and an
efficient agentic RL training algorithm, Duplicating Sampling Policy
Optimization (DUPO). With this integrated pipeline, WebSailor significantly
outperforms all open-source agents in complex information-seeking tasks,
matching proprietary agents’ performance and closing the capability gap.
[COMMENTS]
https://tongyi-agent.github.io/blog/introducing-tongyi-deep-research/
[LINK]
http://arxiv.org/abs/2509.13305v1
[DATE]
2025-09-17 01:57:03+08:00
[CATEGORIES]
cs.LG
cs.CL
JoPA:Explaining Large Language Model’s Generation via Joint Prompt Attribution
[AUTHORS]
Yurui Chang, Bochuan Cao, Yujia Wang, Jinghui Chen, Lu Lin
[ABSTRACT]
Large Language Models (LLMs) have demonstrated impressive performances in
complex text generation tasks. However, the contribution of the input prompt to
the generated content still remains obscure to humans, underscoring the
necessity of understanding the causality between input and output pairs.
Existing works for providing prompt-specific explanation often confine model
output to be classification or next-word prediction. Few initial attempts
aiming to explain the entire language generation often treat input prompt texts
independently, ignoring their combinatorial effects on the follow-up
generation. In this study, we introduce a counterfactual explanation framework
based on Joint Prompt Attribution, JoPA, which aims to explain how a few prompt
texts collaboratively influences the LLM’s complete generation. Particularly,
we formulate the task of prompt attribution for generation interpretation as a
combinatorial optimization problem, and introduce a probabilistic algorithm to
search for the casual input combination in the discrete space. We define and
utilize multiple metrics to evaluate the produced explanations, demonstrating
both the faithfulness and efficiency of our framework.
[COMMENTS]
Accepted to ACL 2025 (Main)
[LINK]
http://arxiv.org/abs/2405.20404v3
[DATE]
2025-09-17 01:48:53+08:00
[CATEGORIES]
cs.CL
cs.LG
ChartGaze: Enhancing Chart Understanding in LVLMs with Eye-Tracking Guided Attention Refinement
[AUTHORS]
Ali Salamatian, Amirhossein Abaskohi, Wan-Cyuan Fan, Mir Rayat Imtiaz Hossain, Leonid Sigal, Giuseppe Carenini
[COMMENTS]
EMNLP 2025
[LINK]
http://arxiv.org/abs/2509.13282v1
[DATE]
2025-09-17 01:35:39+08:00
[CATEGORIES]
cs.CL
cs.LG
Evaluating LLM Alignment on Personality Inference from Real-World Interview Data
[AUTHORS]
Jianfeng Zhu, Julina Maharjan, Xinyu Li, Karin G. Coifman, Ruoming Jin
[ABSTRACT]
Large Language Models (LLMs) are increasingly deployed in roles requiring
nuanced psychological understanding, such as emotional support agents,
counselors, and decision-making assistants. However, their ability to interpret
human personality traits, a critical aspect of such applications, remains
unexplored, particularly in ecologically valid conversational settings. While
prior work has simulated LLM “personas” using discrete Big Five labels on
social media data, the alignment of LLMs with continuous, ground-truth
personality assessments derived from natural interactions is largely
unexamined. To address this gap, we introduce a novel benchmark comprising
semi-structured interview transcripts paired with validated continuous Big Five
trait scores. Using this dataset, we systematically evaluate LLM performance
across three paradigms: (1) zero-shot and chain-of-thought prompting with
GPT-4.1 Mini, (2) LoRA-based fine-tuning applied to both RoBERTa and Meta-LLaMA
architectures, and (3) regression using static embeddings from pretrained BERT
and OpenAI’s text-embedding-3-small. Our results reveal that all Pearson
correlations between model predictions and ground-truth personality traits
remain below 0.26, highlighting the limited alignment of current LLMs with
validated psychological constructs. Chain-of-thought prompting offers minimal
gains over zero-shot, suggesting that personality inference relies more on
latent semantic representation than explicit reasoning. These findings
underscore the challenges of aligning LLMs with complex human attributes and
motivate future work on trait-specific prompting, context-aware modeling, and
alignment-oriented fine-tuning.
[COMMENTS]
8 pages, 3 figures
[LINK]
http://arxiv.org/abs/2509.13244v1
[DATE]
2025-09-17 00:54:35+08:00
[CATEGORIES]
cs.CL
SuPreME: A Supervised Pre-training Framework for Multimodal ECG Representation Learning
[AUTHORS]
Mingsheng Cai, Jiuming Jiang, Wenhao Huang, Che Liu, Rossella Arcucci
[ABSTRACT]
Cardiovascular diseases are a leading cause of death and disability
worldwide. Electrocardiogram (ECG) is critical for diagnosing and monitoring
cardiac health, but obtaining large-scale annotated ECG datasets is
labor-intensive and time-consuming. Recent ECG Self-Supervised Learning (eSSL)
methods mitigate this by learning features without extensive labels but fail to
capture fine-grained clinical semantics and require extensive task-specific
fine-tuning. To address these challenges, we propose $\textbf{SuPreME}$, a
$\textbf{Su}$pervised $\textbf{Pre}$-training framework for
$\textbf{M}$ultimodal $\textbf{E}$CG representation learning. SuPreME is
pre-trained using structured diagnostic labels derived from ECG report entities
through a one-time offline extraction with Large Language Models (LLMs), which
help denoise, standardize cardiac concepts, and improve clinical representation
learning. By fusing ECG signals with textual cardiac queries instead of fixed
labels, SuPreME enables zero-shot classification of unseen conditions without
further fine-tuning. We evaluate SuPreME on six downstream datasets covering
106 cardiac conditions, achieving superior zero-shot AUC performance of
$77.20\%$, surpassing state-of-the-art eSSLs by $4.98\%$. Results demonstrate
SuPreME’s effectiveness in leveraging structured, clinically relevant knowledge
for high-quality ECG representations.
[COMMENTS]
Findings of The 2025 Conference on Empirical Methods in Natural
Language Processing (EMNLP 2025)
[LINK]
http://arxiv.org/abs/2502.19668v3
[DATE]
2025-09-17 00:49:11+08:00
[CATEGORIES]
cs.CL
cs.LG
Is the Top Still Spinning? Evaluating Subjectivity in Narrative Understanding
[AUTHORS]
Melanie Subbiah, Akankshya Mishra, Grace Kim, Liyan Tang, Greg Durrett, Kathleen McKeown
[COMMENTS]
EMNLP 2025
[LINK]
http://arxiv.org/abs/2504.01132v2
[DATE]
2025-09-17 00:47:26+08:00
[CATEGORIES]
cs.CL
Podcasts as a Medium for Participation in Collective Action: A Case Study of Black Lives Matter
[AUTHORS]
Theodora Moldovan, Arianna Pera, Davide Vega, Luca Maria Aiello
[ABSTRACT]
We study how participation in collective action is articulated in podcast
discussions, using the Black Lives Matter (BLM) movement as a case study. While
research on collective action discourse has primarily focused on text-based
content, this study takes a first step toward analyzing audio formats by using
podcast transcripts. Using the Structured Podcast Research Corpus (SPoRC), we
investigated spoken language expressions of participation in collective action,
categorized as problem-solution, call-to-action, intention, and execution. We
identified podcast episodes discussing racial justice after important
BLM-related events in May and June of 2020, and extracted participatory
statements using a layered framework adapted from prior work on social media.
We examined the emotional dimensions of these statements, detecting eight key
emotions and their association with varying stages of activism. We found that
emotional profiles vary by stage, with different positive emotions standing out
during calls-to-action, intention, and execution. We detected negative
associations between collective action and negative emotions, contrary to
theoretical expectations. Our work contributes to a better understanding of how
activism is expressed in spoken digital discourse and how emotional framing may
depend on the format of the discussion.
[COMMENTS]
11 pages, 5 figures
[LINK]
http://arxiv.org/abs/2509.13197v1
[DATE]
2025-09-17 00:00:19+08:00
[CATEGORIES]
cs.CL
The Few-shot Dilemma: Over-prompting Large Language Models
[AUTHORS]
Yongjian Tang, Doruk Tuncel, Christian Koerner, Thomas Runkler
[ABSTRACT]
Over-prompting, a phenomenon where excessive examples in prompts lead to
diminished performance in Large Language Models (LLMs), challenges the
conventional wisdom about in-context few-shot learning. To investigate this
few-shot dilemma, we outline a prompting framework that leverages three
standard few-shot selection methods - random sampling, semantic embedding, and
TF-IDF vectors - and evaluate these methods across multiple LLMs, including
GPT-4o, GPT-3.5-turbo, DeepSeek-V3, Gemma-3, LLaMA-3.1, LLaMA-3.2, and Mistral.
Our experimental results reveal that incorporating excessive domain-specific
examples into prompts can paradoxically degrade performance in certain LLMs,
which contradicts the prior empirical conclusion that more relevant few-shot
examples universally benefit LLMs. Given the trend of LLM-assisted software
engineering and requirement analysis, we experiment with two real-world
software requirement classification datasets. By gradually increasing the
number of TF-IDF-selected and stratified few-shot examples, we identify their
optimal quantity for each LLM. This combined approach achieves superior
performance with fewer examples, avoiding the over-prompting problem, thus
surpassing the state-of-the-art by 1% in classifying functional and
non-functional requirements.
[COMMENTS]
accepted for the main track of FLLM
[LINK]
http://arxiv.org/abs/2509.13196v1
[DATE]
2025-09-17 00:00:06+08:00
[CATEGORIES]
cs.CL
QDFlow: A Python package for physics simulations of quantum dot devices
[AUTHORS]
Donovan L. Buterakos, Sandesh S. Kalantre, Joshua Ziegler, Jacob M Taylor, Justyna P. Zwolak
[ABSTRACT]
Recent advances in machine learning (ML) have accelerated progress in
calibrating and operating quantum dot (QD) devices. However, most ML approaches
rely on access to large, high-quality labeled datasets for training,
benchmarking, and validation, with labels capturing key features in the data.
Obtaining such datasets experimentally is challenging due to limited data
availability and the labor-intensive nature of labeling. QDFlow is an
open-source physics simulator for multi-QD arrays that generates realistic
synthetic data with ground-truth labels. QDFlow combines a self-consistent
Thomas-Fermi solver, a dynamic capacitance model, and flexible noise modules to
produce charge stability diagrams and ray-based data closely resembling
experiments. With extensive tunable parameters and customizable noise models,
QDFlow supports the creation of large, diverse datasets for ML development,
benchmarking, and quantum device research.
[COMMENTS]
17 pages, 5 figures
[LINK]
http://arxiv.org/abs/2509.13298v1
[DATE]
2025-09-17 01:54:25+08:00
[CATEGORIES]
cs.LG
Accelerating Protein Molecular Dynamics Simulation with DeepJump
[AUTHORS]
Allan dos Santos Costa, Manvitha Ponnapati, Dana Rubin, Tess Smidt, Joseph Jacobson
[ABSTRACT]
Unraveling the dynamical motions of biomolecules is essential for bridging
their structure and function, yet it remains a major computational challenge.
Molecular dynamics (MD) simulation provides a detailed depiction of
biomolecular motion, but its high-resolution temporal evolution comes at
significant computational cost, limiting its applicability to timescales of
biological relevance. Deep learning approaches have emerged as promising
solutions to overcome these computational limitations by learning to predict
long-timescale dynamics. However, generalizable kinetics models for proteins
remain largely unexplored, and the fundamental limits of achievable
acceleration while preserving dynamical accuracy are poorly understood. In this
work, we fill this gap with DeepJump, an Euclidean-Equivariant Flow
Matching-based model for predicting protein conformational dynamics across
multiple temporal scales. We train DeepJump on trajectories of the diverse
proteins of mdCATH, systematically studying our model’s performance in
generalizing to long-term dynamics of fast-folding proteins and characterizing
the trade-off between computational acceleration and prediction accuracy. We
demonstrate the application of DeepJump to ab initio folding, showcasing
prediction of folding pathways and native states. Our results demonstrate that
DeepJump achieves significant $\approx$1000$\times$ computational acceleration
while effectively recovering long-timescale dynamics, providing a stepping
stone for enabling routine simulation of proteins.
[LINK]
http://arxiv.org/abs/2509.13294v1
[DATE]
2025-09-17 01:48:58+08:00
[CATEGORIES]
cs.LG
OGF: An Online Gradient Flow Method for Optimizing the Statistical Steady-State Time Averages of Unsteady Turbulent Flows
[AUTHORS]
Tom Hickling, Jonathan F. MacArt, Justin Sirignano, Den Waidmann
[ABSTRACT]
Turbulent flows are chaotic and unsteady, but their statistical distribution
converges to a statistical steady state. Engineering quantities of interest
typically take the form of time-average statistics such as $ \frac{1}{t}
\int_0^t f ( u(x,\tau; \theta) ) d\tau \overset{t \rightarrow
\infty}{\rightarrow} F(x; \theta)$, where $u(x,t; \theta)$ are solutions of the
Navier–Stokes equations with parameters $\theta$. Optimizing over $F(x;
\theta)$ has many engineering applications including geometric optimization,
flow control, and closure modeling. However, this remains an open challenge, as
existing computational approaches are incapable of scaling to physically
representative numbers of grid points. The fundamental obstacle is the
chaoticity of turbulent flows: gradients calculated with the adjoint method
diverge exponentially as $t \rightarrow \infty$.
We develop a new online gradient-flow (OGF) method that is scalable to large
degree-of-freedom systems and enables optimizing for the steady-state
statistics of chaotic, unsteady, turbulence-resolving simulations. The method
forward-propagates an online estimate for the gradient of $F(x; \theta)$ while
simultaneously performing online updates of the parameters $\theta$. A key
feature is the fully online nature of the algorithm to facilitate faster
optimization progress and its combination with a finite-difference estimator to
avoid the divergence of gradients due to chaoticity. The proposed OGF method is
demonstrated for optimizations over three chaotic ordinary and partial
differential equations: the Lorenz-63 equation, the Kuramoto–Sivashinsky
equation, and Navier–Stokes solutions of compressible, forced, homogeneous
isotropic turbulence. In each case, the OGF method successfully reduces the
loss based on $F(x; \theta)$ by several orders of magnitude and accurately
recovers the optimal parameters.
[COMMENTS]
34 pages, 13 figures
[LINK]
http://arxiv.org/abs/2507.05149v2
[DATE]
2025-09-17 01:29:17+08:00
[CATEGORIES]
cs.LG
LLMs for energy and macronutrients estimation using only text data from 24-hour dietary recalls: a parameter-efficient fine-tuning experiment using a 10-shot prompt
[AUTHORS]
Rodrigo M Carrillo-Larco
[ABSTRACT]
BACKGROUND: Most artificial intelligence tools used to estimate nutritional
content rely on image input. However, whether large language models (LLMs) can
accurately predict nutritional values based solely on text descriptions of
foods consumed remains unknown. If effective, this approach could enable
simpler dietary monitoring without the need for photographs. METHODS: We used
24-hour dietary recalls from adolescents aged 12-19 years in the National
Health and Nutrition Examination Survey (NHANES). An open-source quantized LLM
was prompted using a 10-shot, chain-of-thought approach to estimate energy and
five macronutrients based solely on text strings listing foods and their
quantities. We then applied parameter-efficient fine-tuning (PEFT) to evaluate
whether predictive accuracy improved. NHANES-calculated values served as the
ground truth for energy, proteins, carbohydrates, total sugar, dietary fiber
and total fat. RESULTS: In a pooled dataset of 11,281 adolescents (49.9% male,
mean age 15.4 years), the vanilla LLM yielded poor predictions. The mean
absolute error (MAE) was 652.08 for energy and the Lin’s CCC <0.46 across
endpoints. In contrast, the fine-tuned model performed substantially better,
with energy MAEs ranging from 171.34 to 190.90 across subsets, and Lin’s CCC
exceeding 0.89 for all outcomes. CONCLUSIONS: When prompted using a
chain-of-thought approach and fine-tuned with PEFT, open-source LLMs exposed
solely to text input can accurately predict energy and macronutrient values
from 24-hour dietary recalls. This approach holds promise for low-burden,
text-based dietary monitoring tools.
[COMMENTS]
https://github.com/rodrigo-carrillo/LLMs-Macronutrient-Estimation-NHANES-Adolescents
[LINK]
http://arxiv.org/abs/2509.13268v1
[DATE]
2025-09-17 01:26:17+08:00
[CATEGORIES]
cs.LG
JANUS: A Dual-Constraint Generative Framework for Stealthy Node Injection Attacks
[AUTHORS]
Jiahao Zhang, Xiaobing Pei, Zhaokun Zhong, Wenqiang Hao, Zhenghao Tang
[ABSTRACT]
Graph Neural Networks (GNNs) have demonstrated remarkable performance across
various applications, yet they are vulnerable to sophisticated adversarial
attacks, particularly node injection attacks. The success of such attacks
heavily relies on their stealthiness, the ability to blend in with the original
graph and evade detection. However, existing methods often achieve stealthiness
by relying on indirect proxy metrics, lacking consideration for the fundamental
characteristics of the injected content, or focusing only on imitating local
structures, which leads to the problem of local myopia. To overcome these
limitations, we propose a dual-constraint stealthy node injection framework,
called Joint Alignment of Nodal and Universal Structures (JANUS). At the local
level, we introduce a local feature manifold alignment strategy to achieve
geometric consistency in the feature space. At the global level, we incorporate
structured latent variables and maximize the mutual information with the
generated structures, ensuring the injected structures are consistent with the
semantic patterns of the original graph. We model the injection attack as a
sequential decision process, which is optimized by a reinforcement learning
agent. Experiments on multiple standard datasets demonstrate that the JANUS
framework significantly outperforms existing methods in terms of both attack
effectiveness and stealthiness.
[LINK]
http://arxiv.org/abs/2509.13266v1
[DATE]
2025-09-17 01:24:30+08:00
[CATEGORIES]
cs.LG
Post-Hoc Split-Point Self-Consistency Verification for Efficient, Unified Quantification of Aleatoric and Epistemic Uncertainty in Deep Learning
[AUTHORS]
Zhizhong Zhao, Ke Chen
[ABSTRACT]
Uncertainty quantification (UQ) is vital for trustworthy deep learning, yet
existing methods are either computationally intensive, such as Bayesian or
ensemble methods, or provide only partial, task-specific estimates, such as
single-forward-pass techniques. In this paper, we propose a post-hoc
single-forward-pass framework that jointly captures aleatoric and epistemic
uncertainty without modifying or retraining pretrained models. Our method
applies \emph{Split-Point Analysis} (SPA) to decompose predictive residuals
into upper and lower subsets, computing \emph{Mean Absolute Residuals} (MARs)
on each side. We prove that, under ideal conditions, the total MAR equals the
harmonic mean of subset MARs; deviations define a novel \emph{Self-consistency
Discrepancy Score} (SDS) for fine-grained epistemic estimation across
regression and classification. For regression, side-specific quantile
regression yields prediction intervals with improved empirical coverage, which
are further calibrated via SDS. For classification, when calibration data are
available, we apply SPA-based calibration identities to adjust the softmax
outputs and then compute predictive entropy on these calibrated probabilities.
Extensive experiments on diverse regression and classification benchmarks
demonstrate that our framework matches or exceeds several state-of-the-art UQ
methods while incurring minimal overhead.
Our source code is available at https://github.com/zzz0527/SPC-UQ.
[COMMENTS]
32 pages, 15 figures and 16 tables. Technical Report submitted to a
journal for publication
[LINK]
http://arxiv.org/abs/2509.13262v1
[DATE]
2025-09-17 01:16:01+08:00
[CATEGORIES]
cs.LG
Learning from a Biased Sample
[AUTHORS]
Roshni Sahoo, Lihua Lei, Stefan Wager
[ABSTRACT]
The empirical risk minimization approach to data-driven decision making
requires access to training data drawn under the same conditions as those that
will be faced when the decision rule is deployed. However, in a number of
settings, we may be concerned that our training sample is biased in the sense
that some groups (characterized by either observable or unobservable
attributes) may be under- or over-represented relative to the general
population; and in this setting empirical risk minimization over the training
set may fail to yield rules that perform well at deployment. We propose a model
of sampling bias called conditional $\Gamma$-biased sampling, where observed
covariates can affect the probability of sample selection arbitrarily much but
the amount of unexplained variation in the probability of sample selection is
bounded by a constant factor. Applying the distributionally robust optimization
framework, we propose a method for learning a decision rule that minimizes the
worst-case risk incurred under a family of test distributions that can generate
the training distribution under $\Gamma$-biased sampling. We apply a result of
Rockafellar and Uryasev to show that this problem is equivalent to an augmented
convex risk minimization problem. We give statistical guarantees for learning a
model that is robust to sampling bias via the method of sieves, and propose a
deep learning algorithm whose loss function captures our robust learning
target. We empirically validate our proposed method in a case study on
prediction of mental health scores from health survey data and a case study on
ICU length of stay prediction.
[LINK]
http://arxiv.org/abs/2209.01754v4
[DATE]
2025-09-17 01:04:19+08:00
[CATEGORIES]
cs.LG
Don’t Forget the Nonlinearity: Unlocking Activation Functions in Efficient Fine-Tuning
[AUTHORS]
Bo Yin, Xingyi Yang, Xinchao Wang
[ABSTRACT]
Existing parameter-efficient fine-tuning (PEFT) methods primarily adapt
weight matrices while keeping activation functions fixed. We introduce
\textbf{NoRA}, the first PEFT framework that directly adapts nonlinear
activation functions in pretrained transformer-based models. NoRA replaces
fixed activations with learnable rational functions and applies structured
low-rank updates to numerator and denominator coefficients, with a group-wise
design that localizes adaptation and improves stability at minimal cost. On
vision transformers trained on CIFAR-10 and CIFAR-100, NoRA matches or exceeds
full fine-tuning while updating only 0.4\% of parameters (0.02M), achieving
accuracy gains of +0.17\% and +0.27\%. When combined with LoRA
(\textbf{NoRA++}), it outperforms LoRA and DoRA under matched training budgets
by adding fewer trainable parameters. On LLaMA3-8B instruction tuning, NoRA++
consistently improves generation quality, yielding average MMLU gains of
+0.3\%–0.8\%, including +1.6\% on STEM (Alpaca) and +1.3\% on OpenOrca. We
further show that NoRA constrains adaptation to a low-dimensional functional
subspace, implicitly regularizing update magnitude and direction. These results
establish activation-space tuning as a complementary and highly
parameter-efficient alternative to weight-based PEFT, positioning activation
functions as first-class objects for model adaptation.
[LINK]
http://arxiv.org/abs/2509.13240v1
[DATE]
2025-09-17 00:47:03+08:00
[CATEGORIES]
cs.LG
Single-stream Policy Optimization
[AUTHORS]
Zhongwen Xu, Zihan Ding
[ABSTRACT]
We revisit policy-gradient optimization for Large Language Models (LLMs) from
a single-stream perspective. Prevailing group-based methods like GRPO reduce
variance with on-the-fly baselines but suffer from critical flaws: frequent
degenerate groups erase learning signals, and synchronization barriers hinder
scalability. We introduce Single-stream Policy Optimization (SPO), which
eliminates these issues by design. SPO replaces per-group baselines with a
persistent, KL-adaptive value tracker and normalizes advantages globally across
the batch, providing a stable, low-variance learning signal for every sample.
Being group-free, SPO enables higher throughput and scales effectively in
long-horizon or tool-integrated settings where generation times vary.
Furthermore, the persistent value tracker naturally enables an adaptive
curriculum via prioritized sampling. Experiments using Qwen3-8B show that SPO
converges more smoothly and attains higher accuracy than GRPO, while
eliminating computation wasted on degenerate groups. Ablation studies confirm
that SPO’s gains stem from its principled approach to baseline estimation and
advantage normalization, offering a more robust and efficient path for LLM
reasoning. Across five hard math benchmarks with Qwen3 8B, SPO improves the
average maj@32 by +3.4 percentage points (pp) over GRPO, driven by substantial
absolute point gains on challenging datasets, including +7.3 pp on BRUMO 25,
+4.4 pp on AIME 25, +3.3 pp on HMMT 25, and achieves consistent relative gain
in pass@$k$ across the evaluated $k$ values. SPO’s success challenges the
prevailing trend of adding incidental complexity to RL algorithms, highlighting
a path where fundamental principles, not architectural workarounds, drive the
next wave of progress in LLM reasoning.
[LINK]
http://arxiv.org/abs/2509.13232v1
[DATE]
2025-09-17 00:39:11+08:00
[CATEGORIES]
cs.LG
Curriculum Multi-Task Self-Supervision Improves Lightweight Architectures for Onboard Satellite Hyperspectral Image Segmentation
[AUTHORS]
Hugo Carlesso, Josiane Mothe, Radu Tudor Ionescu
[ABSTRACT]
Hyperspectral imaging (HSI) captures detailed spectral signatures across
hundreds of contiguous bands per pixel, being indispensable for remote sensing
applications such as land-cover classification, change detection, and
environmental monitoring. Due to the high dimensionality of HSI data and the
slow rate of data transfer in satellite-based systems, compact and efficient
models are required to support onboard processing and minimize the transmission
of redundant or low-value data, e.g. cloud-covered areas. To this end, we
introduce a novel curriculum multi-task self-supervised learning (CMTSSL)
framework designed for lightweight architectures for HSI analysis. CMTSSL
integrates masked image modeling with decoupled spatial and spectral jigsaw
puzzle solving, guided by a curriculum learning strategy that progressively
increases data complexity during self-supervision. This enables the encoder to
jointly capture fine-grained spectral continuity, spatial structure, and global
semantic features. Unlike prior dual-task SSL methods, CMTSSL simultaneously
addresses spatial and spectral reasoning within a unified and computationally
efficient design, being particularly suitable for training lightweight models
for onboard satellite deployment. We validate our approach on four public
benchmark datasets, demonstrating consistent gains in downstream segmentation
tasks, using architectures that are over 16,000x lighter than some
state-of-the-art models. These results highlight the potential of CMTSSL in
generalizable representation learning with lightweight architectures for
real-world HSI applications. Our code is publicly available at
https://github.com/hugocarlesso/CMTSSL.
[LINK]
http://arxiv.org/abs/2509.13229v1
[DATE]
2025-09-17 00:37:59+08:00
[CATEGORIES]
cs.LG
On the Out-of-Distribution Backdoor Attack for Federated Learning
[AUTHORS]
Jiahao Xu, Zikai Zhang, Rui Hu
[ABSTRACT]
Traditional backdoor attacks in federated learning (FL) operate within
constrained attack scenarios, as they depend on visible triggers and require
physical modifications to the target object, which limits their practicality.
To address this limitation, we introduce a novel backdoor attack prototype for
FL called the out-of-distribution (OOD) backdoor attack ($\mathtt{OBA}$), which
uses OOD data as both poisoned samples and triggers simultaneously. Our
approach significantly broadens the scope of backdoor attack scenarios in FL.
To improve the stealthiness of $\mathtt{OBA}$, we propose $\mathtt{SoDa}$,
which regularizes both the magnitude and direction of malicious local models
during local training, aligning them closely with their benign versions to
evade detection. Empirical results demonstrate that $\mathtt{OBA}$ effectively
circumvents state-of-the-art defenses while maintaining high accuracy on the
main task.
To address this security vulnerability in the FL system, we introduce
$\mathtt{BNGuard}$, a new server-side defense method tailored against
$\mathtt{SoDa}$. $\mathtt{BNGuard}$ leverages the observation that OOD data
causes significant deviations in the running statistics of batch normalization
layers. This allows $\mathtt{BNGuard}$ to identify malicious model updates and
exclude them from aggregation, thereby enhancing the backdoor robustness of FL.
Extensive experiments across various settings show the effectiveness of
$\mathtt{BNGuard}$ on defending against $\mathtt{SoDa}$. The code is available
at https://github.com/JiiahaoXU/SoDa-BNGuard.
[COMMENTS]
To appear at MobiHoc 2025
[LINK]
http://arxiv.org/abs/2509.13219v1
[DATE]
2025-09-17 00:23:39+08:00
[CATEGORIES]
cs.LG
FOSSIL: Regret-minimizing weighting for robust learning under imbalance and small data
[AUTHORS]
J. Cha, J. Lee, J. Cho, J. Shin
[ABSTRACT]
Imbalanced and small data regimes are pervasive in domains such as rare
disease imaging, genomics, and disaster response, where labeled samples are
scarce and naive augmentation often introduces artifacts. Existing solutions
such as oversampling, focal loss, or meta-weighting address isolated aspects of
this challenge but remain fragile or complex. We introduce FOSSIL (Flexible
Optimization via Sample Sensitive Importance Learning), a unified weighting
framework that seamlessly integrates class imbalance correction,
difficulty-aware curricula, augmentation penalties, and warmup dynamics into a
single interpretable formula. Unlike prior heuristics, the proposed framework
provides regret-based theoretical guarantees and achieves consistent empirical
gains over ERM, curriculum, and meta-weighting baselines on synthetic and
real-world datasets, while requiring no architectural changes.
[COMMENTS]
24 pages, 6 figures, submitted to ICLR 2025
[LINK]
http://arxiv.org/abs/2509.13218v1
[DATE]
2025-09-17 00:23:21+08:00
[CATEGORIES]
cs.LG
Flow-Based Fragment Identification via Binding Site-Specific Latent Representations
[AUTHORS]
Rebecca Manuela Neeser, Ilia Igashov, Arne Schneuing, Michael Bronstein, Philippe Schwaller, Bruno Correia
[ABSTRACT]
Fragment-based drug design is a promising strategy leveraging the binding of
small chemical moieties that can efficiently guide drug discovery. The initial
step of fragment identification remains challenging, as fragments often bind
weakly and non-specifically. We developed a protein-fragment encoder that
relies on a contrastive learning approach to map both molecular fragments and
protein surfaces in a shared latent space. The encoder captures
interaction-relevant features and allows to perform virtual screening as well
as generative design with our new method LatentFrag. In LatentFrag, fragment
embeddings and positions are generated conditioned on the protein surface while
being chemically realistic by construction. Our expressive fragment and protein
representations allow location of protein-fragment interaction sites with high
sensitivity and we observe state-of-the-art fragment recovery rates when
sampling from the learned distribution of latent fragment embeddings. Our
generative method outperforms common methods such as virtual screening at a
fraction of its computational cost providing a valuable starting point for
fragment hit discovery. We further show the practical utility of LatentFrag and
extend the workflow to full ligand design tasks. Together, these approaches
contribute to advancing fragment identification and provide valuable tools for
fragment-based drug discovery.
[LINK]
http://arxiv.org/abs/2509.13216v1
[DATE]
2025-09-17 00:20:45+08:00
[CATEGORIES]
cs.LG
Density-Aware Farthest Point Sampling
[AUTHORS]
Paolo Climaco, Jochen Garcke
[ABSTRACT]
We focus on training machine learning regression models in scenarios where
the availability of labeled training data is limited due to computational
constraints or high labeling costs. Thus, selecting suitable training sets from
unlabeled data is essential for balancing performance and efficiency. For the
selection of the training data, we focus on passive and model-agnostic sampling
methods that only consider the data feature representations. We derive an upper
bound for the expected prediction error of Lipschitz continuous regression
models that linearly depends on the weighted fill distance of the training set,
a quantity we can estimate simply by considering the data features. We
introduce “Density-Aware Farthest Point Sampling” (DA-FPS), a novel sampling
method. We prove that DA-FPS provides approximate minimizers for a data-driven
estimation of the weighted fill distance, thereby aiming at minimizing our
derived bound. We conduct experiments using two regression models across three
datasets. The results demonstrate that DA-FPS significantly reduces the mean
absolute prediction error compared to other sampling strategies.
[COMMENTS]
12 pages, 2 figures
[LINK]
http://arxiv.org/abs/2509.13213v1
[DATE]
2025-09-17 00:19:14+08:00
[CATEGORIES]
cs.LG
HAM: Hierarchical Adapter Merging for Scalable Continual Learning
[AUTHORS]
Eric Nuertey Coleman, Luigi Quarantiello, Samrat Mukherjee, Julio Hurtado, Vincenzo Lomonaco
[ABSTRACT]
Continual learning is an essential capability of human cognition, yet it
poses significant challenges for current deep learning models. The primary
issue is that new knowledge can interfere with previously learned information,
causing the model to forget earlier knowledge in favor of the new, a phenomenon
known as catastrophic forgetting. Although large pre-trained models can
partially mitigate forgetting by leveraging their existing knowledge and
over-parameterization, they often struggle when confronted with novel data
distributions. Parameter-Efficient Fine-Tuning (PEFT) methods, such as LoRA,
enable efficient adaptation to new knowledge. However, they still face
challenges in scaling to dynamic learning scenarios and long sequences of
tasks, as maintaining one adapter per task introduces complexity and increases
the potential for interference. In this paper, we introduce Hierarchical
Adapters Merging (HAM), a novel framework that dynamically combines adapters
from different tasks during training. This approach enables HAM to scale
effectively, allowing it to manage more tasks than competing baselines with
improved efficiency. To achieve this, HAM maintains a fixed set of groups that
hierarchically consolidate new adapters. For each task, HAM trains a low-rank
adapter along with an importance scalar, then dynamically groups tasks based on
adapter similarity. Within each group, adapters are pruned, scaled and merge,
facilitating transfer learning between related tasks. Extensive experiments on
three vision benchmarks show that HAM significantly outperforms
state-of-the-art methods, particularly as the number of tasks increases.
[LINK]
http://arxiv.org/abs/2509.13211v1
[DATE]
2025-09-17 00:18:19+08:00
[CATEGORIES]
cs.LG
Hybrid Two-Stage Reconstruction of Multiscale Subsurface Flow with Physics-informed Residual Connected Neural Operator
[AUTHORS]
Peiqi Li, Jie Chen
[ABSTRACT]
The novel neural networks show great potential in solving partial
differential equations. For single-phase flow problems in subsurface porous
media with high-contrast coefficients, the key is to develop neural operators
with accurate reconstruction capability and strict adherence to physical laws.
In this study, we proposed a hybrid two-stage framework that uses multiscale
basis functions and physics-guided deep learning to solve the Darcy flow
problem in high-contrast fractured porous media. In the first stage, a
data-driven model is used to reconstruct the multiscale basis function based on
the permeability field to achieve effective dimensionality reduction while
preserving the necessary multiscale features. In the second stage, the
physics-informed neural network, together with Transformer-based global
information extractor is used to reconstruct the pressure field by integrating
the physical constraints derived from the Darcy equation, ensuring consistency
with the physical laws of the real world. The model was evaluated on datasets
with different combinations of permeability and basis functions and performed
well in terms of reconstruction accuracy. Specifically, the framework achieves
R2 values above 0.9 in terms of basis function fitting and pressure
reconstruction, and the residual indicator is on the order of $1\times
10^{-4}$. These results validate the ability of the proposed framework to
achieve accurate reconstruction while maintaining physical consistency.
[LINK]
http://arxiv.org/abs/2501.13271v2
[DATE]
2025-09-17 00:13:44+08:00
[CATEGORIES]
cs.LG
B-TGAT: A Bi-directional Temporal Graph Attention Transformer for Clustering Multivariate Spatiotemporal Data
[AUTHORS]
Francis Ndikum Nji, Vandana Janaja, Jianwu Wang
[ABSTRACT]
Clustering high-dimensional multivariate spatiotemporal climate data is
challenging due to complex temporal dependencies, evolving spatial
interactions, and non-stationary dynamics. Conventional clustering methods,
including recurrent and convolutional models, often struggle to capture both
local and global temporal relationships while preserving spatial context. We
present a time-distributed hybrid U-Net autoencoder that integrates a
Bi-directional Temporal Graph Attention Transformer (B-TGAT) to guide efficient
temporal clustering of multidimensional spatiotemporal climate datasets. The
encoder and decoder are equipped with ConvLSTM2D modules that extract joint
spatial–temporal features by modeling localized dynamics and spatial
correlations over time, and skip connections that preserve multiscale spatial
details during feature compression and reconstruction. At the bottleneck,
B-TGAT integrates graph-based spatial modeling with attention-driven temporal
encoding, enabling adaptive weighting of temporal neighbors and capturing both
short and long-range dependencies across regions. This architecture produces
discriminative latent embeddings optimized for clustering. Experiments on three
distinct spatiotemporal climate datasets demonstrate superior cluster
separability, temporal stability, and alignment with known climate transitions
compared to state-of-the-art baselines. The integration of ConvLSTM2D, U-Net
skip connections, and B-TGAT enhances temporal clustering performance while
providing interpretable insights into complex spatiotemporal variability,
advancing both methodological development and climate science applications.
[COMMENTS]
10 pages, In review
[LINK]
http://arxiv.org/abs/2509.13202v1
[DATE]
2025-09-17 00:08:21+08:00
[CATEGORIES]
cs.LG
Game-RL: Synthesizing Verifiable Game Tasks at Scale to Boost VLMs General Reasoning
[AUTHORS]
Jingqi Tong, Jixin Tang, Hangcheng Li, Yurong Mou, Ming Zhang, Jun Zhao, Yanbo Wen, Fan Song, Jiahao Zhan, Yuyang Lu, Chaoran Tao, Zhiyuan Guo, Jizhou Yu, Tianhao Cheng, Changhao Jiang, Zhen Wang, Tao Liang, Zhihui Fei, Mingyang Wan, Guojun Ma, Weifeng Ge, Guanhua Chen, Tao Gui, Xipeng Qiu, Qi Zhang, Xuanjing Huang
[COMMENTS]
63 pages, 23 figures, submitted to NeurIPS 2025
[LINK]
http://arxiv.org/abs/2505.13886v4
[DATE]
2025-09-16 23:33:16+08:00
[CATEGORIES]
cs.CL
LLM Hallucination Detection: A Fast Fourier Transform Method Based on Hidden Layer Temporal Signals
[AUTHORS]
Jinxin Li, Gang Tu, ShengYu Cheng, Junjie Hu, Jinting Wang, Rui Chen, Zhilong Zhou, Dongbo Shan
[ABSTRACT]
Hallucination remains a critical barrier for deploying large language models
(LLMs) in reliability-sensitive applications. Existing detection methods
largely fall into two categories: factuality checking, which is fundamentally
constrained by external knowledge coverage, and static hidden-state analysis,
that fails to capture deviations in reasoning dynamics. As a result, their
effectiveness and robustness remain limited. We propose HSAD (Hidden Signal
Analysis-based Detection), a novel hallucination detection framework that
models the temporal dynamics of hidden representations during autoregressive
generation. HSAD constructs hidden-layer signals by sampling activations across
layers, applies Fast Fourier Transform (FFT) to obtain frequency-domain
representations, and extracts the strongest non-DC frequency component as
spectral features. Furthermore, by leveraging the autoregressive nature of
LLMs, HSAD identifies optimal observation points for effective and reliable
detection. Across multiple benchmarks, including TruthfulQA, HSAD achieves over
10 percentage points improvement compared to prior state-of-the-art methods. By
integrating reasoning-process modeling with frequency-domain analysis, HSAD
establishes a new paradigm for robust hallucination detection in LLMs.
[LINK]
http://arxiv.org/abs/2509.13154v1
[DATE]
2025-09-16 23:08:19+08:00
[CATEGORIES]
cs.CL
The Belief State Transformer
[AUTHORS]
Edward S. Hu, Kwangjun Ahn, Qinghua Liu, Haoran Xu, Manan Tomar, Ada Langford, Jayden Teoh, Bryon Xu, David Yan, Dinesh Jayaraman, Alex Lamb, John Langford
[ABSTRACT]
We introduce the “Belief State Transformer”, a next-token predictor that
takes both a prefix and suffix as inputs, with a novel objective of predicting
both the next token for the prefix and the previous token for the suffix. The
Belief State Transformer effectively learns to solve challenging problems that
conventional forward-only transformers struggle with, in a domain-independent
fashion. Key to this success is learning a compact belief state that captures
all relevant information necessary for accurate predictions. Empirical
ablations show that each component of the model is essential in difficult
scenarios where standard Transformers fall short. For the task of story writing
with known prefixes and suffixes, our approach outperforms the
Fill-in-the-Middle method for reaching known goals and demonstrates improved
performance even when the goals are unknown. Altogether, the Belief State
Transformer enables more efficient goal-conditioned decoding, better test-time
inference, and high-quality text representations on small scale problems.
Website: https://edwhu.github.io/bst-website
[COMMENTS]
Updated report with new improvements and authors
[LINK]
http://arxiv.org/abs/2410.23506v3
[DATE]
2025-09-16 22:40:13+08:00
[CATEGORIES]
cs.LG
cs.CL
Empowering LLMs with Parameterized Skills for Adversarial Long-Horizon Planning
[AUTHORS]
Sijia Cui, Shuai Xu, Aiyao He, Yanna Wang, Bo Xu
[ABSTRACT]
Recent advancements in Large Language Models(LLMs) have led to the
development of LLM-based AI agents. A key challenge is the creation of agents
that can effectively ground themselves in complex, adversarial long-horizon
environments. Existing methods mainly focus on (1) using LLMs as policies to
interact with the environment through generating low-level feasible actions,
and (2) utilizing LLMs to generate high-level tasks or language guides to
stimulate action generation. However, the former struggles to generate reliable
actions, while the latter relies heavily on expert experience to translate
high-level tasks into specific action sequences. To address these challenges,
we introduce the Plan with Language, Act with Parameter (PLAP) planning
framework that facilitates the grounding of LLM-based agents in long-horizon
environments. The PLAP method comprises three key components: (1) a skill
library containing environment-specific parameterized skills, (2) a skill
planner powered by LLMs, and (3) a skill executor converting the parameterized
skills into executable action sequences. We implement PLAP in MicroRTS, a
long-horizon real-time strategy game that provides an unfamiliar and
challenging environment for LLMs. The experimental results demonstrate the
effectiveness of PLAP. In particular, GPT-4o-driven PLAP in a zero-shot setting
outperforms 80% of baseline agents, and Qwen2-72B-driven PLAP, with carefully
crafted few-shot examples, surpasses the top-tier scripted agent, CoacAI.
Additionally, we design comprehensive evaluation metrics and test 6
closed-source and 2 open-source LLMs within the PLAP framework, ultimately
releasing an LLM leaderboard ranking long-horizon skill planning ability. Our
code is available at https://github.com/AI-Research-TeamX/PLAP.
[COMMENTS]
Accepted to IJCNN 2025
[LINK]
http://arxiv.org/abs/2509.13127v1
[DATE]
2025-09-16 22:36:30+08:00
[CATEGORIES]
cs.CL
Can Code-Switched Texts Activate a Knowledge Switch in LLMs? A Case Study on English-Korean Code-Switching
[AUTHORS]
Seoyeon Kim, Huiseo Kim, Chanjun Park, Jinyoung Yeo, Dongha Lee
[ABSTRACT]
Recent large language models (LLMs) demonstrate multilingual abilities, yet
they are English-centric due to dominance of English in training corpora. The
limited resource for low-resource languages remains a crucial challenge.
Code-switching (CS), a phenomenon where multilingual speakers alternate between
languages in a discourse, can convey subtle cultural and linguistic nuances
that can be otherwise lost in translation and elicits language-specific
knowledge in human communications. In light of this, we investigate whether
code-switching can activate, or identify and leverage knowledge for reasoning
when LLMs solve low-resource language tasks. To facilitate the research, we
first present EnKoQA, a synthetic English-Korean CS question-answering dataset.
We provide comprehensive analysis on a variety of multilingual LLMs by
subdividing activation process into knowledge identification and knowledge
leveraging. Our results demonstrate that compared to English text, CS can
faithfully activate knowledge inside LLMs especially on language-specific
domains, suggesting the potential of code-switching on low-resource language
tasks.
[COMMENTS]
Accepted to EMNLP 2025 Findings
[LINK]
http://arxiv.org/abs/2410.18436v3
[DATE]
2025-09-16 22:24:05+08:00
[CATEGORIES]
cs.CL
Counterfactual Simulatability of LLM Explanations for Generation Tasks
[AUTHORS]
Marvin Limpijankit, Yanda Chen, Melanie Subbiah, Nicholas Deas, Kathleen McKeown
[ABSTRACT]
LLMs can be unpredictable, as even slight alterations to the prompt can cause
the output to change in unexpected ways. Thus, the ability of models to
accurately explain their behavior is critical, especially in high-stakes
settings. One approach for evaluating explanations is counterfactual
simulatability, how well an explanation allows users to infer the model’s
output on related counterfactuals. Counterfactual simulatability has been
previously studied for yes/no question answering tasks. We provide a general
framework for extending this method to generation tasks, using news
summarization and medical suggestion as example use cases. We find that while
LLM explanations do enable users to better predict LLM outputs on
counterfactuals in the summarization setting, there is significant room for
improvement for medical suggestion. Furthermore, our results suggest that the
evaluation for counterfactual simulatability may be more appropriate for
skill-based tasks as opposed to knowledge-based tasks.
[LINK]
http://arxiv.org/abs/2505.21740v2
[DATE]
2025-09-16 21:55:55+08:00
[CATEGORIES]
cs.CL
Shaping Explanations: Semantic Reward Modeling with Encoder-Only Transformers for GRPO
[AUTHORS]
Francesco Pappone, Ruggero Marino Lazzaroni, Federico Califano, Niccolò Gentile, Roberto Marras
[ABSTRACT]
While Large Language Models (LLMs) excel at generating human-like text,
aligning their outputs with complex, qualitative goals like pedagogical
soundness remains a significant challenge. Standard reinforcement learning
techniques often rely on slow and expensive LLM-as-a-judge evaluations or on
brittle, keyword-based metrics like ROUGE, which fail to capture the semantic
essence of a high-quality explanation. In this work, we introduce a novel
approach to reward shaping within the Group Relative Policy Optimisation (GRPO)
framework. Our central contribution is the use of a small, efficient
encoder-only transformer as a semantic reward model. This model provides a
dense, semantically rich reward signal based on the cosine similarity between a
generated explanation and a ground-truth reference, guiding the policy towards
explanations that are not just factually correct but also structurally and
conceptually aligned with expert reasoning. We apply this method to the task of
training a model for the Italian medical-school entrance examinations,
following standard domain-adaptive continued pre-training (CPT) and supervised
fine-tuning (SFT). Our results demonstrate that GRPO with our proposed semantic
reward significantly improves explanation faithfulness and clarity over a
strong SFT baseline, showcasing the power of using lightweight encoder models
for nuanced reward shaping in complex generation tasks
[LINK]
http://arxiv.org/abs/2509.13081v1
[DATE]
2025-09-16 21:39:29+08:00
[CATEGORIES]
cs.CL
When Inverse Data Outperforms: Exploring the Pitfalls of Mixed Data in Multi-Stage Fine-Tuning
[AUTHORS]
Mengyi Deng, Xin Li, Tingyu Zhu, Zhicheng Yang, Zhijiang Guo, Wei Wang
[ABSTRACT]
Existing work has shown that o1-level performance can be achieved with
limited data distillation, but most existing methods focus on unidirectional
supervised fine-tuning (SFT), overlooking the intricate interplay between
diverse reasoning patterns. In this paper, we construct r1k, a high-quality
reverse reasoning dataset derived by inverting 1,000 forward examples from s1k,
and examine how SFT and Direct Preference Optimization (DPO) affect alignment
under bidirectional reasoning objectives. SFT on r1k yields a 1.6%–6.8%
accuracy improvement over s1k across evaluated benchmarks. However, naively
mixing forward and reverse data during SFT weakens the directional distinction.
Although DPO can partially recover this distinction, it also suppresses less
preferred reasoning paths by shifting the probability mass toward irrelevant
outputs. These findings suggest that mixed reasoning data introduce conflicting
supervision signals, underscoring the need for robust and direction-aware
alignment strategies.
[LINK]
http://arxiv.org/abs/2509.13079v1
[DATE]
2025-09-16 21:36:36+08:00
[CATEGORIES]
cs.LG
cs.CL
TAPS: Tool-Augmented Personalisation via Structured Tagging
[AUTHORS]
Ekaterina Taktasheva, Jeff Dalton
[ABSTRACT]
Recent advancements in tool-augmented large language models have enabled them
to interact with external tools, enhancing their ability to perform complex
user tasks. However, existing approaches overlook the role of personalisation
in guiding tool use. This work investigates how user preferences can be
effectively integrated into goal-oriented dialogue agents. Through extensive
analysis, we identify key weaknesses in the ability of LLMs to personalise tool
use. To this end, we introduce TAPS, a novel solution that enhances
personalised tool use by leveraging a structured tagging tool and an
uncertainty-based tool detector. TAPS significantly improves the ability of
LLMs to incorporate user preferences, achieving the new state-of-the-art for
open source models on the NLSI task.
[COMMENTS]
Accepted to EMNLP 2026 Main
[LINK]
http://arxiv.org/abs/2506.20409v3
[DATE]
2025-09-16 21:36:33+08:00
[CATEGORIES]
cs.CL
EIFBENCH: Extremely Complex Instruction Following Benchmark for Large Language Models
[AUTHORS]
Tao Zou, Xinghua Zhang, Haiyang Yu, Minzheng Wang, Fei Huang, Yongbin Li
[COMMENTS]
Accepted by EMNLP 2025
[LINK]
http://arxiv.org/abs/2506.08375v2
[DATE]
2025-09-16 21:19:41+08:00
[CATEGORIES]
cs.CL
Robust Adaptation of Large Multimodal Models for Retrieval Augmented Hateful Meme Detection
[AUTHORS]
Jingbiao Mei, Jinghong Chen, Guangyu Yang, Weizhe Lin, Bill Byrne
[ABSTRACT]
Hateful memes have become a significant concern on the Internet,
necessitating robust automated detection systems. While Large Multimodal Models
(LMMs) have shown promise in hateful meme detection, they face notable
challenges like sub-optimal performance and limited out-of-domain
generalization capabilities. Recent studies further reveal the limitations of
both supervised fine-tuning (SFT) and in-context learning when applied to LMMs
in this setting. To address these issues, we propose a robust adaptation
framework for hateful meme detection that enhances in-domain accuracy and
cross-domain generalization while preserving the general vision-language
capabilities of LMMs. Analysis reveals that our approach achieves improved
robustness under adversarial attacks compared to SFT models. Experiments on six
meme classification datasets show that our approach achieves state-of-the-art
performance, outperforming larger agentic systems. Moreover, our method
generates higher-quality rationales for explaining hateful content compared to
standard SFT, enhancing model interpretability. Code available at
https://github.com/JingbiaoMei/RGCL
[COMMENTS]
EMNLP 2025 Main (Oral)
[LINK]
http://arxiv.org/abs/2502.13061v4
[DATE]
2025-09-16 21:10:31+08:00
[CATEGORIES]
cs.CL
cs.LG
Break the Checkbox: Challenging Closed-Style Evaluations of Cultural Alignment in LLMs
[AUTHORS]
Mohsinul Kabir, Ajwad Abrar, Sophia Ananiadou
[COMMENTS]
Accepted at EMNLP 2025 (Main)
[LINK]
http://arxiv.org/abs/2502.08045v3
[DATE]
2025-09-16 21:10:21+08:00
[CATEGORIES]
cs.CL
From Understanding to Generation: An Efficient Shortcut for Evaluating Language Models
[AUTHORS]
Viktor Hangya, Fabian Küch, Darina Gold
[ABSTRACT]
Iterative evaluation of LLMs during training is essential to ensure expected
capability development, but can be time- and compute-intensive. While NLU
tasks, where the model selects from fixed answer choices, are cheap to
evaluate, essential capabilities like reasoning and code generation rely on the
more time-consuming NLG (token-by-token generation) format. In this work, our
aim is to decrease the computational burden of NLG benchmarks in order to
enable monitoring crucial LLM capabilities during model training. We
reformulate generative tasks into computationally cheaper NLU alternatives. We
test the performance correlation between the original and reformulated tasks
using 8 LMs of various sizes and 4 capabilities: mathematical reasoning, code
generation, factual knowledge and reading comprehension. Our results show a
strong correlation between task formats, supporting capability assessment via
cheaper alternatives and achieving over 35x average reduction in evaluation
time. Our project is available at:
https://github.com/Fraunhofer-IIS/EvalShortcut
[COMMENTS]
Accepted to EMNLP 2025 (Main Conference)
[LINK]
http://arxiv.org/abs/2506.03592v2
[DATE]
2025-09-16 21:10:04+08:00
[CATEGORIES]
cs.CL
Multi-Model Synthetic Training for Mission-Critical Small Language Models
[AUTHORS]
Nolan Platt, Pragyansmita Nayak
[ABSTRACT]
Large Language Models (LLMs) have demonstrated remarkable capabilities across
many domains, yet their appli- cation to specialized fields remains constrained
by the scarcity and complexity of domain-specific training data. We present a
novel approach that achieves a 261x cost reduction for maritime intelligence by
using LLMs as one-time teachers rather than using them directly for inference.
Our method transforms 3.2 billion Automatic Identification System (AIS) vessel
tracking records into 21,543 synthetic question and answer pairs through
multi-model generation (GPT-4o and o3-mini), preventing over- fitting and
ensuring accurate reasoning. The resulting fine-tuned Qwen2.5-7B model achieves
75% accuracy on maritime tasks, while being substantially cheaper than using a
larger model for inference. We show that smaller, cheaper models - when fine
tuned properly - can provide similar accuracy compared to larger models that
are prohibitively expensive. Our work contributes to the growing field of
synthetic dataset generation for specialized AI applications and presents a
highly reproducible framework for domains where manual annotation is
infeasible. Beyond expand- ing research in the growing field of specialized
small language models, our approach has immediate applications in maritime
safety, security operations, and vessel traffic management systems in various
industries.
[COMMENTS]
8 pages. Accepted as a full paper to the 3rd International Conference
on Foundation and Large Language Models (IEEE FLLM) 2025
[LINK]
http://arxiv.org/abs/2509.13047v1
[DATE]
2025-09-16 21:04:48+08:00
[CATEGORIES]
cs.CL
cs.LG
Reading Between the Prompts: How Stereotypes Shape LLM’s Implicit Personalization
[AUTHORS]
Vera Neplenbroek, Arianna Bisazza, Raquel Fernández
[ABSTRACT]
Generative Large Language Models (LLMs) infer user’s demographic information
from subtle cues in the conversation – a phenomenon called implicit
personalization. Prior work has shown that such inferences can lead to lower
quality responses for users assumed to be from minority groups, even when no
demographic information is explicitly provided. In this work, we systematically
explore how LLMs respond to stereotypical cues using controlled synthetic
conversations, by analyzing the models’ latent user representations through
both model internals and generated answers to targeted user questions. Our
findings reveal that LLMs do infer demographic attributes based on these
stereotypical signals, which for a number of groups even persists when the user
explicitly identifies with a different demographic group. Finally, we show that
this form of stereotype-driven implicit personalization can be effectively
mitigated by intervening on the model’s internal representations using a
trained linear probe to steer them toward the explicitly stated identity. Our
results highlight the need for greater transparency and control in how LLMs
represent user identity.
[COMMENTS]
Accepted at EMNLP Main 2025
[LINK]
http://arxiv.org/abs/2505.16467v2
[DATE]
2025-09-16 20:56:24+08:00
[CATEGORIES]
cs.CL
HiChunk: Evaluating and Enhancing Retrieval-Augmented Generation with Hierarchical Chunking
[AUTHORS]
Wensheng Lu, Keyu Chen, Ruizhi Qiao, Xing Sun
[ABSTRACT]
Retrieval-Augmented Generation (RAG) enhances the response capabilities of
language models by integrating external knowledge sources. However, document
chunking as an important part of RAG system often lacks effective evaluation
tools. This paper first analyzes why existing RAG evaluation benchmarks are
inadequate for assessing document chunking quality, specifically due to
evidence sparsity. Based on this conclusion, we propose HiCBench, which
includes manually annotated multi-level document chunking points, synthesized
evidence-dense quetion answer(QA) pairs, and their corresponding evidence
sources. Additionally, we introduce the HiChunk framework, a multi-level
document structuring framework based on fine-tuned LLMs, combined with the
Auto-Merge retrieval algorithm to improve retrieval quality. Experiments
demonstrate that HiCBench effectively evaluates the impact of different
chunking methods across the entire RAG pipeline. Moreover, HiChunk achieves
better chunking quality within reasonable time consumption, thereby enhancing
the overall performance of RAG systems.
[COMMENTS]
17 pages, 5 figures, 6 tables
[LINK]
http://arxiv.org/abs/2509.11552v2
[DATE]
2025-09-16 20:36:35+08:00
[CATEGORIES]
cs.CL
ToM-SSI: Evaluating Theory of Mind in Situated Social Interactions
[AUTHORS]
Matteo Bortoletto, Constantin Ruhdorfer, Andreas Bulling
[COMMENTS]
EMNLP 2025 (Main)
[LINK]
http://arxiv.org/abs/2509.05066v2
[DATE]
2025-09-16 20:22:34+08:00
[CATEGORIES]
cs.CL
TokenSkip: Controllable Chain-of-Thought Compression in LLMs
[AUTHORS]
Heming Xia, Chak Tou Leong, Wenjie Wang, Yongqi Li, Wenjie Li
[ABSTRACT]
Chain-of-Thought (CoT) has been proven effective in enhancing the reasoning
capabilities of large language models (LLMs). Recent advancements, such as
OpenAI’s o1 and DeepSeek-R1, suggest that scaling up the length of CoT
sequences during inference could further boost LLM reasoning performance.
However, due to the autoregressive nature of LLM decoding, longer CoT outputs
lead to a linear increase in inference latency, adversely affecting user
experience, particularly when the CoT exceeds 10,000 tokens. To address this
limitation, we analyze the semantic importance of tokens within CoT outputs and
reveal that their contributions to reasoning vary. Building on this insight, we
propose TokenSkip, a simple yet effective approach that enables LLMs to
selectively skip less important tokens, allowing for controllable CoT
compression. Extensive experiments across various models and tasks demonstrate
the effectiveness of TokenSkip in reducing CoT token usage while preserving
strong reasoning performance. Notably, when applied to Qwen2.5-14B-Instruct,
TokenSkip reduces reasoning tokens by 40% (from 313 to 181) on GSM8K, with less
than a 0.4% performance drop. We release our code and checkpoints in
https://github.com/hemingkx/TokenSkip.
[COMMENTS]
EMNLP 2025 (Long Paper), camera-ready version
[LINK]
http://arxiv.org/abs/2502.12067v3
[DATE]
2025-09-16 20:21:22+08:00
[CATEGORIES]
cs.CL
SitLLM: Large Language Models for Sitting Posture Health Understanding via Pressure Sensor Data
[AUTHORS]
Jian Gao, Fufangchen Zhao, Yiyang Zhang, Danfeng Yan
[ABSTRACT]
Poor sitting posture is a critical yet often overlooked factor contributing
to long-term musculoskeletal disorders and physiological dysfunctions. Existing
sitting posture monitoring systems, although leveraging visual, IMU, or
pressure-based modalities, often suffer from coarse-grained recognition and
lack the semantic expressiveness necessary for personalized feedback. In this
paper, we propose \textbf{SitLLM}, a lightweight multimodal framework that
integrates flexible pressure sensing with large language models (LLMs) to
enable fine-grained posture understanding and personalized health-oriented
response generation. SitLLM comprises three key components: (1) a
\textit{Gaussian-Robust Sensor Embedding Module} that partitions pressure maps
into spatial patches and injects local noise perturbations for robust feature
extraction; (2) a \textit{Prompt-Driven Cross-Modal Alignment Module} that
reprograms sensor embeddings into the LLM’s semantic space via multi-head
cross-attention using the pre-trained vocabulary embeddings; and (3) a
\textit{Multi-Context Prompt Module} that fuses feature-level, structure-level,
statistical-level, and semantic-level contextual information to guide
instruction comprehension.
[LINK]
http://arxiv.org/abs/2509.12994v1
[DATE]
2025-09-16 20:06:05+08:00
[CATEGORIES]
cs.CL
How Much Do LLMs Hallucinate across Languages? On Multilingual Estimation of LLM Hallucination in the Wild
[AUTHORS]
Saad Obaid ul Islam, Anne Lauscher, Goran Glavaš
[ABSTRACT]
In the age of misinformation, hallucination – the tendency of Large Language
Models (LLMs) to generate non-factual or unfaithful responses – represents the
main risk for their global utility. Despite LLMs becoming increasingly
multilingual, the vast majority of research on detecting and quantifying LLM
hallucination are (a) English-centric and (b) focus on machine translation (MT)
and summarization, tasks that are less common “in the wild” than open
information seeking. In contrast, we aim to quantify the extent of LLM
hallucination across languages in knowledge-intensive long-form question
answering. To this end, we train a multilingual hallucination detection model
and conduct a large-scale study across 30 languages and 6 open-source LLM
families. We start from an English hallucination detection dataset and rely on
MT to generate (noisy) training data in other languages. We also manually
annotate gold data for five high-resource languages; we then demonstrate, for
these languages, that the estimates of hallucination rates are similar between
silver (LLM-generated) and gold test sets, validating the use of silver data
for estimating hallucination rates for other languages. For the final rates
estimation, we build a knowledge-intensive QA dataset for 30 languages with
LLM-generated prompts and Wikipedia articles as references. We find that, while
LLMs generate longer responses with more hallucinated tokens for
higher-resource languages, there is no correlation between length-normalized
hallucination rates of languages and their digital representation. Further, we
find that smaller LLMs exhibit larger hallucination rates than larger models.
[COMMENTS]
EMNLP 2025
[LINK]
http://arxiv.org/abs/2502.12769v3
[DATE]
2025-09-16 19:12:12+08:00
[CATEGORIES]
cs.CL
Do LLMs Understand Wine Descriptors Across Cultures? A Benchmark for Cultural Adaptations of Wine Reviews
[AUTHORS]
Chenye Zou, Xingyue Wen, Tianyi Hu, Qian Janice Wang, Daniel Hershcovich
[ABSTRACT]
Recent advances in large language models (LLMs) have opened the door to
culture-aware language tasks. We introduce the novel problem of adapting wine
reviews across Chinese and English, which goes beyond literal translation by
incorporating regional taste preferences and culture-specific flavor
descriptors. In a case study on cross-cultural wine review adaptation, we
compile the first parallel corpus of professional reviews, containing 8k
Chinese and 16k Anglophone reviews. We benchmark both
neural-machine-translation baselines and state-of-the-art LLMs with automatic
metrics and human evaluation. For the latter, we propose three culture-oriented
criteria – Cultural Proximity, Cultural Neutrality, and Cultural Genuineness
– to assess how naturally a translated review resonates with target-culture
readers. Our analysis shows that current models struggle to capture cultural
nuances, especially in translating wine descriptions across different cultures.
This highlights the challenges and limitations of translation models in
handling cultural content.
[COMMENTS]
EMNLP 2025 Findings
[LINK]
http://arxiv.org/abs/2509.12961v1
[DATE]
2025-09-16 19:10:30+08:00
[CATEGORIES]
cs.CL
Investigating ReLoRA: Effects on the Learning Dynamics of Small Language Models
[AUTHORS]
Yuval Weiss, David Demitri Africa, Paula Buttery, Richard Diehl Martinez
[ABSTRACT]
Parameter-efficient methods such as LoRA have revolutionised the fine-tuning
of LLMs. Still, their extension to pretraining via ReLoRA is less well
understood, especially for small language models (SLMs), which offer lower
computational and environmental costs. This work is the first systematic study
of ReLoRA in SLMs (11M-66M parameters), evaluating both performance and
learning dynamics. Through ablation experiments, we find that ReLoRA generally
performs worse than standard training on loss, Paloma perplexity and BLiMP,
with the gap widening for the larger models. Further analysis of the learning
dynamics of the models indicates that ReLoRA reinforces the rank deficiencies
found in smaller models. These results indicate that low-rank update strategies
may not transfer easily to SLM pretraining, highlighting the need for more
research in the low-compute regime.
[COMMENTS]
12 Pages, 6 Tables, 8 Figures
[LINK]
http://arxiv.org/abs/2509.12960v1
[DATE]
2025-09-16 19:06:58+08:00
[CATEGORIES]
cs.CL
Automated Generation of Research Workflows from Academic Papers: A Full-text Mining Framework
[AUTHORS]
Heng Zhang, Chengzhi Zhang
[ABSTRACT]
The automated generation of research workflows is essential for improving the
reproducibility of research and accelerating the paradigm of “AI for Science”.
However, existing methods typically extract merely fragmented procedural
components and thus fail to capture complete research workflows. To address
this gap, we propose an end-to-end framework that generates comprehensive,
structured research workflows by mining full-text academic papers. As a case
study in the Natural Language Processing (NLP) domain, our paragraph-centric
approach first employs Positive-Unlabeled (PU) Learning with SciBERT to
identify workflow-descriptive paragraphs, achieving an F1-score of 0.9772.
Subsequently, we utilize Flan-T5 with prompt learning to generate workflow
phrases from these paragraphs, yielding ROUGE-1, ROUGE-2, and ROUGE-L scores of
0.4543, 0.2877, and 0.4427, respectively. These phrases are then systematically
categorized into data preparation, data processing, and data analysis stages
using ChatGPT with few-shot learning, achieving a classification precision of
0.958. By mapping categorized phrases to their document locations in the
documents, we finally generate readable visual flowcharts of the entire
research workflows. This approach facilitates the analysis of workflows derived
from an NLP corpus and reveals key methodological shifts over the past two
decades, including the increasing emphasis on data analysis and the transition
from feature engineering to ablation studies. Our work offers a validated
technical framework for automated workflow generation, along with a novel,
process-oriented perspective for the empirical investigation of evolving
scientific paradigms. Source code and data are available at:
https://github.com/ZH-heng/research_workflow.
[LINK]
http://arxiv.org/abs/2509.12955v1
[DATE]
2025-09-16 18:59:23+08:00
[CATEGORIES]
cs.CL
Jailbreaking Large Language Models Through Content Concretization
[AUTHORS]
Johan Wahréus, Ahmed Hussain, Panos Papadimitratos
[ABSTRACT]
Large Language Models (LLMs) are increasingly deployed for task automation
and content generation, yet their safety mechanisms remain vulnerable to
circumvention through different jailbreaking techniques. In this paper, we
introduce \textit{Content Concretization} (CC), a novel jailbreaking technique
that iteratively transforms abstract malicious requests into concrete,
executable implementations. CC is a two-stage process: first, generating
initial LLM responses using lower-tier, less constrained safety filters models,
then refining them through higher-tier models that process both the preliminary
output and original prompt. We evaluate our technique using 350
cybersecurity-specific prompts, demonstrating substantial improvements in
jailbreak Success Rates (SRs), increasing from 7\% (no refinements) to 62\%
after three refinement iterations, while maintaining a cost of 7.5\textcent~per
prompt. Comparative A/B testing across nine different LLM evaluators confirms
that outputs from additional refinement steps are consistently rated as more
malicious and technically superior. Moreover, manual code analysis reveals that
generated outputs execute with minimal modification, although optimal
deployment typically requires target-specific fine-tuning. With eventual
improved harmful code generation, these results highlight critical
vulnerabilities in current LLM safety frameworks.
[COMMENTS]
Accepted for presentation in the Conference on Game Theory and AI for
Security (GameSec) 2025
[LINK]
http://arxiv.org/abs/2509.12937v1
[DATE]
2025-09-16 18:34:26+08:00
[CATEGORIES]
cs.CL
UtterTune: LoRA-Based Target-Language Pronunciation Edit and Control in Multilingual Text-to-Speech
[AUTHORS]
Shuhei Kato
[ABSTRACT]
We propose UtterTune, a lightweight adaptation method that fine-tunes a
multilingual text-to-speech (TTS) system based on a large language model (LLM)
architecture, designed to enhance the controllability of pronunciation in a
target language while preserving performance in others. While LLM architectures
have enabled TTS models to achieve remarkable naturalness, accurately modeling
grapheme-to-phoneme (G2P) mapping and prosody remains challenging, especially
when the model omits an explicit G2P module and directly processes minimally
encoded text (e.g., byte-pair encoding). UtterTune leverages low-rank
adaptation to enable the control of segmental pronunciation and pitch accent at
the phoneme level for Japanese speech, the target language in this paper, while
maintaining naturalness and speaker similarity in a zero-shot setting.
Objective and subjective evaluations confirm its effectiveness.
[COMMENTS]
5 pages
[LINK]
http://arxiv.org/abs/2508.09767v2
[DATE]
2025-09-16 18:28:11+08:00
[CATEGORIES]
cs.CL
All Roads Lead to Rome: Graph-Based Confidence Estimation for Large Language Model Reasoning
[AUTHORS]
Caiqi Zhang, Chang Shu, Ehsan Shareghi, Nigel Collier
[COMMENTS]
EMNLP 2025 Main
[LINK]
http://arxiv.org/abs/2509.12908v1
[DATE]
2025-09-16 18:02:52+08:00
[CATEGORIES]
cs.CL
Conan-Embedding-v2: Training an LLM from Scratch for Text Embeddings
[AUTHORS]
Shiyu Li, Yang Tang, Ruijie Liu, Shi-Zhe Chen, Xi Chen
[ABSTRACT]
Large language models (LLMs) have recently demonstrated excellent performance
in text embedding tasks. Previous work usually use LoRA to fine-tune existing
LLMs, which are limited by the data and training gap between LLMs and embedding
models. In this work, we introduce Conan-embedding-v2, a new 1.4B-parameter LLM
trained from scratch and fine-tuned as a text embedder. First, we add news data
and multilingual pairs for LLM pretraining to bridge the data gap. Based on
this, we propose a cross-lingual retrieval dataset that enables the LLM to
better integrate embeddings across different languages. Second, whereas LLMs
use a causal mask with token-level loss, embedding models use a bidirectional
mask with sentence-level loss. This training gap makes full fine-tuning less
effective than LoRA. We introduce a soft-masking mechanism to gradually
transition between these two types of masks, enabling the model to learn more
comprehensive representations. Based on this, we propose a dynamic hard
negative mining method that exposes the model to more difficult negative
examples throughout the training process. Being intuitive and effective, with
only approximately 1.4B parameters, Conan-embedding-v2 achieves SOTA
performance on both the Massive Text Embedding Benchmark (MTEB) and Chinese
MTEB (May 19, 2025).
[COMMENTS]
EMNLP 2025 Oral
[LINK]
http://arxiv.org/abs/2509.12892v1
[DATE]
2025-09-16 17:48:11+08:00
[CATEGORIES]
cs.CL
The LLM Already Knows: Estimating LLM-Perceived Question Difficulty via Hidden Representations
[AUTHORS]
Yubo Zhu, Dongrui Liu, Zecheng Lin, Wei Tong, Sheng Zhong, Jing Shao
[ABSTRACT]
Estimating the difficulty of input questions as perceived by large language
models (LLMs) is essential for accurate performance evaluation and adaptive
inference. Existing methods typically rely on repeated response sampling,
auxiliary models, or fine-tuning the target model itself, which may incur
substantial computational costs or compromise generality. In this paper, we
propose a novel approach for difficulty estimation that leverages only the
hidden representations produced by the target LLM. We model the token-level
generation process as a Markov chain and define a value function to estimate
the expected output quality given any hidden state. This allows for efficient
and accurate difficulty estimation based solely on the initial hidden state,
without generating any output tokens. Extensive experiments across both textual
and multimodal tasks demonstrate that our method consistently outperforms
existing baselines in difficulty estimation. Moreover, we apply our difficulty
estimates to guide adaptive reasoning strategies, including Self-Consistency,
Best-of-N, and Self-Refine, achieving higher inference efficiency with fewer
generated tokens.
[LINK]
http://arxiv.org/abs/2509.12886v1
[DATE]
2025-09-16 17:38:41+08:00
[CATEGORIES]
cs.CL
Emphasising Structured Information: Integrating Abstract Meaning Representation into LLMs for Enhanced Open-Domain Dialogue Evaluation
[AUTHORS]
Bohao Yang, Kun Zhao, Dong Liu, Chen Tang, Liang Zhan, Chenghua Lin
[ABSTRACT]
Automatic open-domain dialogue evaluation has attracted increasing attention,
yet remains challenging due to the complexity of assessing response
appropriateness. Traditional evaluation metrics, typically trained with true
positive and randomly selected negative responses, tend to assign higher scores
to responses that share greater content similarity with contexts. However,
adversarial negative responses, despite possessing high lexical overlap with
contexts, can be semantically incongruous. Consequently, existing metrics
struggle to effectively evaluate such responses, resulting in low correlations
with human judgments. While recent studies have demonstrated the effectiveness
of Large Language Models (LLMs) for open-domain dialogue evaluation, they still
face challenges in handling adversarial negative examples. We propose a novel
evaluation framework that integrates Abstract Meaning Representation (AMR)
enhanced domain-specific language models (SLMs) with LLMs. Our SLMs explicitly
incorporate AMR graph information through a gating mechanism for enhanced
semantic representation learning, while both SLM predictions and AMR knowledge
are integrated into LLM prompts for robust evaluation. Extensive experiments on
open-domain dialogue evaluation tasks demonstrate the superiority of our method
compared to state-of-the-art baselines. Our comprehensive ablation studies
reveal that AMR graph information contributes substantially more to performance
improvements. Our framework achieves strong correlations with human judgments
across multiple datasets, establishing a new benchmark for dialogue evaluation.
Our code and data are publicly available.
[COMMENTS]
EMNLP 2025 Findings
[LINK]
http://arxiv.org/abs/2404.01129v5
[DATE]
2025-09-16 17:27:11+08:00
[CATEGORIES]
cs.CL
Crafting Customisable Characters with LLMs: A Persona-Driven Role-Playing Agent Framework
[AUTHORS]
Bohao Yang, Dong Liu, Chenghao Xiao, Kun Zhao, Chen Tang, Chao Li, Lin Yuan, Guang Yang, Chenghua Lin
[COMMENTS]
EMNLP 2025 Findings
[LINK]
http://arxiv.org/abs/2406.17962v7
[DATE]
2025-09-16 17:13:07+08:00
[CATEGORIES]
cs.CL
Data Augmentation for Maltese NLP using Transliterated and Machine Translated Arabic Data
[AUTHORS]
Kurt Micallef, Nizar Habash, Claudia Borg
[ABSTRACT]
Maltese is a unique Semitic language that has evolved under extensive
influence from Romance and Germanic languages, particularly Italian and
English. Despite its Semitic roots, its orthography is based on the Latin
script, creating a gap between it and its closest linguistic relatives in
Arabic. In this paper, we explore whether Arabic-language resources can support
Maltese natural language processing (NLP) through cross-lingual augmentation
techniques. We investigate multiple strategies for aligning Arabic textual data
with Maltese, including various transliteration schemes and machine translation
(MT) approaches. As part of this, we also introduce novel transliteration
systems that better represent Maltese orthography. We evaluate the impact of
these augmentations on monolingual and mutlilingual models and demonstrate that
Arabic-based augmentation can significantly benefit Maltese NLP tasks.
[COMMENTS]
EMNLP Camera-Ready
[LINK]
http://arxiv.org/abs/2509.12853v1
[DATE]
2025-09-16 17:09:50+08:00
[CATEGORIES]
cs.CL
Executable Ontologies: Synthesizing Event Semantics with Dataflow Architecture
[AUTHORS]
Aleksandr Boldachev
[ABSTRACT]
This paper presents boldsea, Boldachev’s semantic-event approach – an
architecture for modeling complex dynamic systems using executable ontologies
– semantic models that act as dynamic structures, directly controlling process
execution. We demonstrate that integrating event semantics with a dataflow
architecture addresses the limitations of traditional Business Process
Management (BPM) systems and object-oriented semantic technologies. The paper
presents the formal BSL (boldsea Semantic Language), including its BNF grammar,
and outlines the boldsea-engine’s architecture, which directly interprets
semantic models as executable algorithms without compilation. It enables the
modification of event models at runtime, ensures temporal transparency, and
seamlessly merges data and business logic within a unified semantic framework.
[COMMENTS]
22 pages, 6 figures. Corrected captions on Figure 4
[LINK]
http://arxiv.org/abs/2509.09775v2
[DATE]
2025-09-16 17:05:52+08:00
[CATEGORIES]
cs.CL
Do Large Language Models Truly Grasp Addition? A Rule-Focused Diagnostic Using Two-Integer Arithmetic
[AUTHORS]
Yang Yan, Yu Lu, Renjun Xu, Zhenzhong Lan
[ABSTRACT]
Large language models (LLMs) achieve impressive results on advanced
mathematics benchmarks but sometimes fail on basic arithmetic tasks, raising
the question of whether they have truly grasped fundamental arithmetic rules or
are merely relying on pattern matching. To unravel this issue, we
systematically probe LLMs’ understanding of two-integer addition (0 to $2^64$)
by testing three crucial properties: commutativity (A+B=B+A), representation
invariance via symbolic remapping (e.g., $7 -> Y$), and consistent accuracy
scaling with operand length. Our evaluation of 12 leading LLMs reveals a stark
disconnect: while models achieve high numeric accuracy (73.8-99.8%), they
systematically fail these diagnostics. Specifically, accuracy plummets to <=
7.5% with symbolic inputs, commutativity is violated in up to 20% of cases, and
accuracy scaling is non-monotonic. These findings demonstrate that current LLMs
address elementary addition via pattern matching, not robust rule induction,
motivating new diagnostic benchmarks and innovations in model architecture and
training to cultivate genuine mathematical reasoning. Our dataset and
generating code are available at
https://github.com/kuri-leo/llm-arithmetic-diagnostic.
[COMMENTS]
Accepted by EMNLP‘25 Main
[LINK]
http://arxiv.org/abs/2504.05262v2
[DATE]
2025-09-16 16:56:37+08:00
[CATEGORIES]
cs.CL
ConvergeWriter: Data-Driven Bottom-Up Article Construction
[AUTHORS]
Binquan Ji, Jiaqi Wang, Ruiting Li, Xingchen Han, Yiyang Qi, Shichao Wang, Yifei Lu, Yuantao Han, Feiliang Ren
[ABSTRACT]
Large Language Models (LLMs) have shown remarkable prowess in text
generation, yet producing long-form, factual documents grounded in extensive
external knowledge bases remains a significant challenge. Existing “top-down”
methods, which first generate a hypothesis or outline and then retrieve
evidence, often suffer from a disconnect between the model’s plan and the
available knowledge, leading to content fragmentation and factual inaccuracies.
To address these limitations, we propose a novel “bottom-up,” data-driven
framework that inverts the conventional generation pipeline. Our approach is
predicated on a “Retrieval-First for Knowledge, Clustering for Structure”
strategy, which first establishes the “knowledge boundaries” of the source
corpus before any generative planning occurs. Specifically, we perform
exhaustive iterative retrieval from the knowledge base and then employ an
unsupervised clustering algorithm to organize the retrieved documents into
distinct “knowledge clusters.” These clusters form an objective, data-driven
foundation that directly guides the subsequent generation of a hierarchical
outline and the final document content. This bottom-up process ensures that the
generated text is strictly constrained by and fully traceable to the source
material, proactively adapting to the finite scope of the knowledge base and
fundamentally mitigating the risk of hallucination. Experimental results on
both 14B and 32B parameter models demonstrate that our method achieves
performance comparable to or exceeding state-of-the-art baselines, and is
expected to demonstrate unique advantages in knowledge-constrained scenarios
that demand high fidelity and structural coherence. Our work presents an
effective paradigm for generating reliable, structured, long-form documents,
paving the way for more robust LLM applications in high-stakes,
knowledge-intensive domains.
[LINK]
http://arxiv.org/abs/2509.12811v1
[DATE]
2025-09-16 16:30:52+08:00
[CATEGORIES]
cs.CL
Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching
[AUTHORS]
Simon A. Aytes, Jinheon Baek, Sung Ju Hwang
[COMMENTS]
EMNLP 2025
[LINK]
http://arxiv.org/abs/2503.05179v3
[DATE]
2025-09-16 16:18:39+08:00
[CATEGORIES]
cs.CL
cs.LG
Understanding and Leveraging the Expert Specialization of Context Faithfulness in Mixture-of-Experts LLMs
[AUTHORS]
Jun Bai, Minghao Tong, Yang Liu, Zixia Jia, Zilong Zheng
[COMMENTS]
Accepted by EMNLP 2025 Main
[LINK]
http://arxiv.org/abs/2508.19594v2
[DATE]
2025-09-16 16:17:06+08:00
[CATEGORIES]
cs.CL
Teaching Your Models to Understand Code via Focal Preference Alignment
[AUTHORS]
Jie Wu, Haoling Li, Xin Zhang, Jianwen Luo, Yangyu Huang, Ruihang Chu, Yujiu Yang, Scarlett Li
[ABSTRACT]
Preference learning extends the performance of Code LLMs beyond traditional
supervised fine-tuning by leveraging relative quality comparisons. In existing
approaches, a set of n candidate solutions is evaluated based on test case
success rates, with the candidate demonstrating a higher pass rate being
labeled as positive and its counterpart with a lower pass rate as negative.
However, because this approach aligns entire failing code blocks rather than
pinpointing specific errors, it lacks the granularity necessary to capture
meaningful error-correction relationships. As a result, the model is unable to
learn more informative error-correction patterns. To address these issues, we
propose Target-DPO, a new preference alignment framework that mimics human
iterative debugging to refine Code LLMs. Target-DPO explicitly locates error
regions and aligns the corresponding tokens via a tailored DPO algorithm. To
facilitate it, we introduce the CodeFlow dataset, where samples are iteratively
refined until passing tests, with modifications capturing error corrections.
Extensive experiments show that a diverse suite of Code LLMs equipped with
Target-DPO achieves significant performance gains in code generation and
improves on challenging tasks like BigCodeBench. In-depth analysis reveals that
Target-DPO yields fewer errors. Code, model and datasets are in:
https://github.com/JieWu02/Target-DPO.
[COMMENTS]
Accepted by EMNLP‘25
[LINK]
http://arxiv.org/abs/2503.02783v3
[DATE]
2025-09-16 16:03:41+08:00
[CATEGORIES]
cs.CL
cs.LG
A funny companion: Distinct neural responses to perceived AI- versus human-generated humor
[AUTHORS]
Xiaohui Rao, Hanlin Wu, Zhenguang G. Cai
[ABSTRACT]
As AI companions become capable of human-like communication, including
telling jokes, understanding how people cognitively and emotionally respond to
AI humor becomes increasingly important. This study used electroencephalography
(EEG) to compare how people process humor from AI versus human sources.
Behavioral analysis revealed that participants rated AI and human humor as
comparably funny. However, neurophysiological data showed that AI humor
elicited a smaller N400 effect, suggesting reduced cognitive effort during the
processing of incongruity. This was accompanied by a larger Late Positive
Potential (LPP), indicating a greater degree of surprise and emotional
response. This enhanced LPP likely stems from the violation of low initial
expectations regarding AI’s comedic capabilities. Furthermore, a key temporal
dynamic emerged: human humor showed habituation effects, marked by an
increasing N400 and a decreasing LPP over time. In contrast, AI humor
demonstrated increasing processing efficiency and emotional reward, with a
decreasing N400 and an increasing LPP. This trajectory reveals how the brain
can dynamically update its predictive model of AI capabilities. This process of
cumulative reinforcement challenges “algorithm aversion” in humor, as it
demonstrates how cognitive adaptation to AI’s language patterns can lead to an
intensified emotional reward. Additionally, participants’ social attitudes
toward AI modulated these neural responses, with higher perceived AI
trustworthiness correlating with enhanced emotional engagement. These findings
indicate that the brain responds to AI humor with surprisingly positive and
intense reactions, highlighting humor’s potential for fostering genuine
engagement in human-AI social interaction.
[LINK]
http://arxiv.org/abs/2509.10847v2
[DATE]
2025-09-16 15:44:57+08:00
[CATEGORIES]
cs.CL
Contrastive Learning with Enhanced Abstract Representations using Grouped Loss of Abstract Semantic Supervision
[AUTHORS]
Omri Suissa, Muhiim Ali, Shengmai Chen, Yinuo Cai, Shekhar Pradhan
[ABSTRACT]
Humans can recognize an image as an instance of a general concept, beyond
simply identifying its objects and their relationships. In this paper, we
investigate 1. The extent to which VLMs have this concept abstraction capacity,
and 2. Strategies for encoding the sort of higher-concept information in images
that would enable the resulting VLM model (CLEAR GLASS model) to have this
capability to a greater degree. To this end, we introduce a grouped
image-caption dataset (MAGIC), which consists of several groups of image
captions and for each group a set of associated images and higher-level
conceptual labels. We use a novel contrastive loss technique to induce the
model to encode in the representation of each image (caption) in a group the
information that is common to all members of the image-caption group. Our main
contribution is a grouped contrastive loss function based on text-image
contrastive groups (outer contrastive loss) as well as an inner loss which
measures the distances between image-caption instances in the group. Our
training methodology results in the CLEAR GLASS model having the concept
abstraction capacity as an emergent capacity because the model is not exposed
to the higher-level concepts associated with each group. Instead, the training
forces the model to create for each image-caption group a semantic
representation that brings it closer to the semantic representation of the
higher-level concepts in the latent semantic space. Our experiments show that
this training methodology results in a model which shows improvement in
abstract concept recognition compared to SOTA models.
[LINK]
http://arxiv.org/abs/2509.12771v1
[DATE]
2025-09-16 15:36:44+08:00
[CATEGORIES]
cs.CL
AIxcellent Vibes at GermEval 2025 Shared Task on Candy Speech Detection: Improving Model Performance by Span-Level Training
[AUTHORS]
Christian Rene Thelen, Patrick Gustav Blaneck, Tobias Bornheim, Niklas Grieger, Stephan Bialonski
[ABSTRACT]
Positive, supportive online communication in social media (candy speech) has
the potential to foster civility, yet automated detection of such language
remains underexplored, limiting systematic analysis of its impact. We
investigate how candy speech can be reliably detected in a 46k-comment German
YouTube corpus by monolingual and multilingual language models, including
GBERT, Qwen3 Embedding, and XLM-RoBERTa. We find that a multilingual
XLM-RoBERTa-Large model trained to detect candy speech at the span level
outperforms other approaches, ranking first in both binary positive F1: 0.8906)
and categorized span-based detection (strict F1: 0.6307) subtasks at the
GermEval 2025 Shared Task on Candy Speech Detection. We speculate that
span-based training, multilingual capabilities, and emoji-aware tokenizers
improved detection performance. Our results demonstrate the effectiveness of
multilingual models in identifying positive, supportive language.
[COMMENTS]
6 pages, 1 figure, 2 tables
[LINK]
http://arxiv.org/abs/2509.07459v2
[DATE]
2025-09-16 15:28:10+08:00
[CATEGORIES]
cs.CL
InfoGain-RAG: Boosting Retrieval-Augmented Generation via Document Information Gain-based Reranking and Filtering
[AUTHORS]
Zihan Wang, Zihan Liang, Zhou Shao, Yufei Ma, Huangyu Dai, Ben Chen, Lingtao Mao, Chenyi Lei, Yuqing Ding, Han Li
[ABSTRACT]
Retrieval-Augmented Generation (RAG) has emerged as a promising approach to
address key limitations of Large Language Models (LLMs), such as hallucination,
outdated knowledge, and lacking reference. However, current RAG frameworks
often struggle with identifying whether retrieved documents meaningfully
contribute to answer generation. This shortcoming makes it difficult to filter
out irrelevant or even misleading content, which notably impacts the final
performance. In this paper, we propose Document Information Gain (DIG), a novel
metric designed to quantify the contribution of retrieved documents to correct
answer generation. DIG measures a document’s value by computing the difference
of LLM’s generation confidence with and without the document augmented.
Further, we introduce InfoGain-RAG, a framework that leverages DIG scores to
train a specialized reranker, which prioritizes each retrieved document from
exact distinguishing and accurate sorting perspectives. This approach can
effectively filter out irrelevant documents and select the most valuable ones
for better answer generation. Extensive experiments across various models and
benchmarks demonstrate that InfoGain-RAG can significantly outperform existing
approaches, on both single and multiple retrievers paradigm. Specifically on
NaturalQA, it achieves the improvements of 17.9%, 4.5%, 12.5% in exact match
accuracy against naive RAG, self-reflective RAG and modern ranking-based RAG
respectively, and even an average of 15.3% increment on advanced proprietary
model GPT-4o across all datasets. These results demonstrate the feasibility of
InfoGain-RAG as it can offer a reliable solution for RAG in multiple
applications.
[COMMENTS]
EMNLP‘25 Oral Presentation. Contact: [email protected]
[LINK]
http://arxiv.org/abs/2509.12765v1
[DATE]
2025-09-16 15:28:07+08:00
[CATEGORIES]
cs.CL
LoRA-PAR: A Flexible Dual-System LoRA Partitioning Approach to Efficient LLM Fine-Tuning
[AUTHORS]
Yining Huang, Bin Li, Keke Tang, Meilian Chen
[ABSTRACT]
Large-scale generative models like DeepSeek-R1 and OpenAI-O1 benefit
substantially from chain-of-thought (CoT) reasoning, yet pushing their
performance typically requires vast data, large model sizes, and full-parameter
fine-tuning. While parameter-efficient fine-tuning (PEFT) helps reduce cost,
most existing approaches primarily address domain adaptation or layer-wise
allocation rather than explicitly tailoring data and parameters to different
response demands. Inspired by “Thinking, Fast and Slow,” which characterizes
two distinct modes of thought-System 1 (fast, intuitive, often automatic) and
System 2 (slower, more deliberative and analytic)-we draw an analogy that
different “subregions” of an LLM’s parameters might similarly specialize for
tasks that demand quick, intuitive responses versus those requiring multi-step
logical reasoning. Therefore, we propose LoRA-PAR, a dual-system LoRA framework
that partitions both data and parameters by System 1 or System 2 demands, using
fewer yet more focused parameters for each task. Specifically, we classify task
data via multi-model role-playing and voting, and partition parameters based on
importance scoring, then adopt a two-stage fine-tuning strategy of training
System 1 tasks with supervised fine-tuning (SFT) to enhance knowledge and
intuition and refine System 2 tasks with reinforcement learning (RL) to
reinforce deeper logical deliberation next. Extensive experiments show that the
two-stage fine-tuning strategy, SFT and RL, lowers active parameter usage while
matching or surpassing SOTA PEFT baselines.
[COMMENTS]
12 pages
[LINK]
http://arxiv.org/abs/2507.20999v3
[DATE]
2025-09-16 15:22:09+08:00
[CATEGORIES]
cs.LG
cs.CL
Dynamic Relation Inference via Verb Embeddings
[AUTHORS]
Omri Suissa, Muhiim Ali, Ariana Azarbal, Hui Shen, Shekhar Pradhan
[ABSTRACT]
CLIP has demonstrated exceptional image-text matching capabilities due to its
training on contrastive learning tasks. Past research has suggested that
whereas CLIP effectively matches text to images when the matching can be
achieved just by matching the text with the objects in the image, CLIP
struggles when the matching depends on representing the relationship among the
objects in the images (i.e., inferring relations). Previous attempts to address
this limitation by training CLIP on relation detection datasets with only
linguistic supervision have met with limited success. In this paper, we offer
insights and practical methods to advance the field of relation inference from
images. This paper approaches the task of creating a model that effectively
detects relations among the objects in images by producing text and image
embeddings that capture relationships through linguistic supervision. To this
end, we propose Dynamic Relation Inference via Verb Embeddings (DRIVE), which
augments the COCO dataset, fine-tunes CLIP with hard negatives
subject-relation-object triples and corresponding images, and introduces a
novel loss function to improve relation detection. Evaluated on multiple
CLIP-based models, our method significantly improves zero-shot relation
inference accuracy in both frozen and fine-tuned settings, significantly
outperforming CLIP and state-of-the-art models while generalizing well on
unseen data.
[LINK]
http://arxiv.org/abs/2503.13021v2
[DATE]
2025-09-16 15:12:21+08:00
[CATEGORIES]
cs.CL
Zero-shot Graph Reasoning via Retrieval Augmented Framework with LLMs
[AUTHORS]
Hanqing Li, Kiran Sheena Jyothi, Henry Liang, Sharika Mahadevan, Diego Klabjan
[ABSTRACT]
We propose a new, training-free method, Graph Reasoning via Retrieval
Augmented Framework (GRRAF), that harnesses retrieval-augmented generation
(RAG) alongside the code-generation capabilities of large language models
(LLMs) to address a wide range of graph reasoning tasks. In GRRAF, the target
graph is stored in a graph database, and the LLM is prompted to generate
executable code queries that retrieve the necessary information. This approach
circumvents the limitations of existing methods that require extensive
finetuning or depend on predefined algorithms, and it incorporates an error
feedback loop with a time-out mechanism to ensure both correctness and
efficiency. Experimental evaluations on the GraphInstruct dataset reveal that
GRRAF achieves 100% accuracy on most graph reasoning tasks, including cycle
detection, bipartite graph checks, shortest path computation, and maximum flow,
while maintaining consistent token costs regardless of graph sizes. Imperfect
but still very high performance is observed on subgraph matching. Notably,
GRRAF scales effectively to large graphs with up to 10,000 nodes.
[LINK]
http://arxiv.org/abs/2509.12743v1
[DATE]
2025-09-16 14:58:58+08:00
[CATEGORIES]
cs.CL
PatentScore: Multi-dimensional Evaluation of LLM-Generated Patent Claims
[AUTHORS]
Yongmin Yoo, Qiongkai Xu, Longbing Cao
[ABSTRACT]
High-stakes texts such as patent claims, medical records, and technical
reports are structurally complex and demand a high degree of reliability and
precision. While large language models (LLMs) have recently been applied to
automate their generation in high-stakes domains, reliably evaluating such
outputs remains a major challenge. Conventional natural language generation
(NLG) metrics are effective for generic documents but fail to capture the
structural and legal characteristics essential to evaluating complex
high-stakes documents. To address this gap, we propose PatentScore, a
multi-dimensional evaluation framework specifically designed for one of the
most intricate and rigorous domains, patent claims. PatentScore integrates
hierarchical decomposition of claim elements, validation patterns grounded in
legal and technical standards, and scoring across structural, semantic, and
legal dimensions. In experiments on our dataset which consists of 400 Claim1,
PatentScore achieved the highest correlation with expert annotations ($r =
0.819$), significantly outperforming widely used NLG metrics. This work
establishes a new standard for evaluating LLM-generated patent claims,
providing a solid foundation for research on patent generation and validation.
[LINK]
http://arxiv.org/abs/2505.19345v2
[DATE]
2025-09-16 14:50:21+08:00
[CATEGORIES]
cs.CL
Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection
[AUTHORS]
Ziqi Miao, Yi Ding, Lijun Li, Jing Shao
[COMMENTS]
Accepted to EMNLP 2025 (Main). 17 pages, 7 figures
[LINK]
http://arxiv.org/abs/2507.02844v2
[DATE]
2025-09-16 14:48:12+08:00
[CATEGORIES]
cs.CL
A Novel Recurrent Neural Network Framework for Prediction and Treatment of Oncogenic Mutation Progression
[AUTHORS]
Rishab Parthasarathy, Achintya Bhowmik
[ABSTRACT]
Despite significant medical advancements, cancer remains the second leading
cause of death, with over 600,000 deaths per year in the US. One emerging
field, pathway analysis, is promising but still relies on manually derived wet
lab data, which is time-consuming to acquire. This work proposes an efficient,
effective end-to-end framework for Artificial Intelligence (AI) based pathway
analysis that predicts both cancer severity and mutation progression, thus
recommending possible treatments. The proposed technique involves a novel
combination of time-series machine learning models and pathway analysis. First,
mutation sequences were isolated from The Cancer Genome Atlas (TCGA) Database.
Then, a novel preprocessing algorithm was used to filter key mutations by
mutation frequency. This data was fed into a Recurrent Neural Network (RNN)
that predicted cancer severity. Then, the model probabilistically used the RNN
predictions, information from the preprocessing algorithm, and multiple
drug-target databases to predict future mutations and recommend possible
treatments. This framework achieved robust results and Receiver Operating
Characteristic (ROC) curves (a key statistical metric) with accuracies greater
than 60%, similar to existing cancer diagnostics. In addition, preprocessing
played an instrumental role in isolating important mutations, demonstrating
that each cancer stage studied may contain on the order of a few-hundred key
driver mutations, consistent with current research. Heatmaps based on predicted
gene frequency were also generated, highlighting key mutations in each cancer.
Overall, this work is the first to propose an efficient, cost-effective
end-to-end framework for projecting cancer progression and providing possible
treatments without relying on expensive, time-consuming wet lab work.
[COMMENTS]
12 pages, 11 figures, work originally done in 2022/2023 and was
awarded as one of the Regeneron Science Talent Search Finalists in 2022
[LINK]
http://arxiv.org/abs/2509.12732v1
[DATE]
2025-09-16 14:46:28+08:00
[CATEGORIES]
cs.LG
cs.CL
Chain-of-Thought Prompting Obscures Hallucination Cues in Large Language Models: An Empirical Evaluation
[AUTHORS]
Jiahao Cheng, Tiancheng Su, Jia Yuan, Guoxiu He, Jiawei Liu, Xinqi Tao, Jingwen Xie, Huaxia Li
[COMMENTS]
Accepted at EMNLP 2025 Findings
[LINK]
http://arxiv.org/abs/2506.17088v3
[DATE]
2025-09-16 13:49:11+08:00
[CATEGORIES]
cs.CL
Efficient Context Selection for Long-Context QA: No Tuning, No Iteration, Just Adaptive-$k$
[AUTHORS]
Chihiro Taguchi, Seiji Maekawa, Nikita Bhutani
[ABSTRACT]
Retrieval-augmented generation (RAG) and long-context language models (LCLMs)
both address context limitations of LLMs in open-domain question answering
(QA). However, optimal external context to retrieve remains an open problem:
fixing the retrieval size risks either wasting tokens or omitting key evidence.
Existing adaptive methods like Self-RAG and Self-Route rely on iterative LLM
prompting and perform well on factoid QA, but struggle with aggregation QA,
where the optimal context size is both unknown and variable. We present
Adaptive-$k$ retrieval, a simple and effective single-pass method that
adaptively selects the number of passages based on the distribution of the
similarity scores between the query and the candidate passages. It does not
require model fine-tuning, extra LLM inferences or changes to existing
retriever-reader pipelines. On both factoid and aggregation QA benchmarks,
Adaptive-$k$ matches or outperforms fixed-$k$ baselines while using up to 10x
fewer tokens than full-context input, yet still retrieves 70% of relevant
passages. It improves accuracy across five LCLMs and two embedding models,
highlighting that dynamically adjusting context size leads to more efficient
and accurate QA.
[COMMENTS]
26 pages, 16 tables, 5 figures. Accepted at EMNLP 2025 (Main)
[LINK]
http://arxiv.org/abs/2506.08479v2
[DATE]
2025-09-16 13:21:57+08:00
[CATEGORIES]
cs.CL
GTA: Supervised-Guided Reinforcement Learning for Text Classification with Large Language Models
[AUTHORS]
Min Zeng, Jingfei Sun, Xueyou Luo, Caiquan Liu, Shiqi Zhang, Li Xie, Xiaoxin Chen
[ABSTRACT]
In natural language processing tasks, pure reinforcement learning (RL)
fine-tuning methods often suffer from inefficient exploration and slow
convergence; while supervised fine-tuning (SFT) methods, although efficient in
training, have limited performance ceiling and less solid theoretical
foundation compared to RL. To address efficiency-capability trade-off, we
propose the Guess-Think-Answer (GTA) framework that combines the efficiency of
SFT with the capability gains of RL in a unified training paradigm. GTA works
by having the model first produce a provisional guess (optimized via
cross-entropy loss), then reflect on this guess before generating the final
answer, with RL rewards shaping both the final output and the format of the
entire GTA structure. This hybrid approach achieves both faster convergence
than pure RL and higher performance ceiling than pure SFT. To mitigate gradient
conflicts between the two training signals, we employ loss masking and gradient
constraints. Empirical results on four text classification benchmarks
demonstrate that GTA substantially accelerates convergence while outperforming
both standalone SFT and RL baselines.
[COMMENTS]
Accepted at EMNLP 2025
[LINK]
http://arxiv.org/abs/2509.12108v2
[DATE]
2025-09-16 13:13:41+08:00
[CATEGORIES]
cs.CL
OpenWHO: A Document-Level Parallel Corpus for Health Translation in Low-Resource Languages
[AUTHORS]
Raphaël Merx, Hanna Suominen, Trevor Cohn, Ekaterina Vylomova
[ABSTRACT]
In machine translation (MT), health is a high-stakes domain characterised by
widespread deployment and domain-specific vocabulary. However, there is a lack
of MT evaluation datasets for low-resource languages in this domain. To address
this gap, we introduce OpenWHO, a document-level parallel corpus of 2,978
documents and 26,824 sentences from the World Health Organization’s e-learning
platform. Sourced from expert-authored, professionally translated materials
shielded from web-crawling, OpenWHO spans a diverse range of over 20 languages,
of which nine are low-resource. Leveraging this new resource, we evaluate
modern large language models (LLMs) against traditional MT models. Our findings
reveal that LLMs consistently outperform traditional MT models, with Gemini 2.5
Flash achieving a +4.79 ChrF point improvement over NLLB-54B on our
low-resource test set. Further, we investigate how LLM context utilisation
affects accuracy, finding that the benefits of document-level translation are
most pronounced in specialised domains like health. We release the OpenWHO
corpus to encourage further research into low-resource MT in the health domain.
[COMMENTS]
Accepted at WMT 2025
[LINK]
http://arxiv.org/abs/2508.16048v2
[DATE]
2025-09-16 13:10:52+08:00
[CATEGORIES]
cs.CL
Case-Based Decision-Theoretic Decoding with Quality Memories
[AUTHORS]
Hiroyuki Deguchi, Masaaki Nagata
[ABSTRACT]
Minimum Bayes risk (MBR) decoding is a decision rule of text generation,
which selects the hypothesis that maximizes the expected utility and robustly
generates higher-quality texts than maximum a posteriori (MAP) decoding.
However, it depends on sample texts drawn from the text generation model; thus,
it is difficult to find a hypothesis that correctly captures the knowledge or
information of out-of-domain. To tackle this issue, we propose case-based
decision-theoretic (CBDT) decoding, another method to estimate the expected
utility using examples of domain data. CBDT decoding not only generates
higher-quality texts than MAP decoding, but also the combination of MBR and
CBDT decoding outperformed MBR decoding in seven domain De–En and
Ja$\leftrightarrow$En translation tasks and image captioning tasks on MSCOCO
and nocaps datasets.
[COMMENTS]
Accepted at EMNLP2025 main
[LINK]
http://arxiv.org/abs/2509.12677v1
[DATE]
2025-09-16 13:01:05+08:00
[CATEGORIES]
cs.CL
Towards Inclusive Toxic Content Moderation: Addressing Vulnerabilities to Adversarial Attacks in Toxicity Classifiers Tackling LLM-generated Content
[AUTHORS]
Shaz Furniturewala, Arkaitz Zubiaga
[ABSTRACT]
The volume of machine-generated content online has grown dramatically due to
the widespread use of Large Language Models (LLMs), leading to new challenges
for content moderation systems. Conventional content moderation classifiers,
which are usually trained on text produced by humans, suffer from
misclassifications due to LLM-generated text deviating from their training data
and adversarial attacks that aim to avoid detection. Present-day defence
tactics are reactive rather than proactive, since they rely on adversarial
training or external detection models to identify attacks. In this work, we aim
to identify the vulnerable components of toxicity classifiers that contribute
to misclassification, proposing a novel strategy based on mechanistic
interpretability techniques. Our study focuses on fine-tuned BERT and RoBERTa
classifiers, testing on diverse datasets spanning a variety of minority groups.
We use adversarial attacking techniques to identify vulnerable circuits.
Finally, we suppress these vulnerable circuits, improving performance against
adversarial attacks. We also provide demographic-level insights into these
vulnerable circuits, exposing fairness and robustness gaps in model training.
We find that models have distinct heads that are either crucial for performance
or vulnerable to attack and suppressing the vulnerable heads improves
performance on adversarial input. We also find that different heads are
responsible for vulnerability across different demographic groups, which can
inform more inclusive development of toxicity detection models.
[LINK]
http://arxiv.org/abs/2509.12672v1
[DATE]
2025-09-16 12:51:18+08:00
[CATEGORIES]
cs.CL
Chat-Driven Text Generation and Interaction for Person Retrieval
[AUTHORS]
Zequn Xie, Chuxin Wang, Sihang Cai, Yeqiang Wang, Shulei Wang, Tao Jin
[ABSTRACT]
Text-based person search (TBPS) enables the retrieval of person images from
large-scale databases using natural language descriptions, offering critical
value in surveillance applications. However, a major challenge lies in the
labor-intensive process of obtaining high-quality textual annotations, which
limits scalability and practical deployment. To address this, we introduce two
complementary modules: Multi-Turn Text Generation (MTG) and Multi-Turn Text
Interaction (MTI). MTG generates rich pseudo-labels through simulated dialogues
with MLLMs, producing fine-grained and diverse visual descriptions without
manual supervision. MTI refines user queries at inference time through dynamic,
dialogue-based reasoning, enabling the system to interpret and resolve vague,
incomplete, or ambiguous descriptions - characteristics often seen in
real-world search scenarios. Together, MTG and MTI form a unified and
annotation-free framework that significantly improves retrieval accuracy,
robustness, and usability. Extensive evaluations demonstrate that our method
achieves competitive or superior results while eliminating the need for manual
captions, paving the way for scalable and practical deployment of TBPS systems.
[COMMENTS]
Accepted by EMNLP 2025. 13 pages, 3 figures
[LINK]
http://arxiv.org/abs/2509.12662v1
[DATE]
2025-09-16 12:40:24+08:00
[CATEGORIES]
cs.CL
PAC: Pronunciation-Aware Contextualized Large Language Model-based Automatic Speech Recognition
[AUTHORS]
Li Fu, Yu Xin, Sunlu Zeng, Lu Fan, Youzheng Wu, Xiaodong He
[ABSTRACT]
This paper presents a Pronunciation-Aware Contextualized (PAC) framework to
address two key challenges in Large Language Model (LLM)-based Automatic Speech
Recognition (ASR) systems: effective pronunciation modeling and robust
homophone discrimination. Both are essential for raw or long-tail word
recognition. The proposed approach adopts a two-stage learning paradigm. First,
we introduce a pronunciation-guided context learning method. It employs an
interleaved grapheme-phoneme context modeling strategy that incorporates
grapheme-only distractors, encouraging the model to leverage phonemic cues for
accurate recognition. Then, we propose a pronunciation-discriminative
reinforcement learning method with perturbed label sampling to further enhance
the model's ability to distinguish contextualized homophones. Experimental
results on the public English Librispeech and Mandarin AISHELL-1 datasets
indicate that PAC: (1) reduces relative Word Error Rate (WER) by 30.2% and
53.8% compared to pre-trained LLM-based ASR models, and (2) achieves 31.8% and
60.5% relative reductions in biased WER for long-tail words compared to strong
baselines, respectively.
[COMMENTS]
Submitted to ICASSP 2026
[LINK]
http://arxiv.org/abs/2509.12647v1
[DATE]
2025-09-16 12:07:28+08:00
[CATEGORIES]
cs.CL
Why Stop at One Error? Benchmarking LLMs as Data Science Code Debuggers for Multi-Hop and Multi-Bug Errors
[AUTHORS]
Zhiyu Yang, Shuo Wang, Yukun Yan, Yang Deng
[ABSTRACT]
LLMs are transforming software development, yet current code generation and
code repair benchmarks mainly assess syntactic and functional correctness in
simple, single-error cases. LLMs’ capabilities to autonomously find and fix
runtime logical errors in complex data science code remain largely unexplored.
To address this gap, we introduce DSDBench: the Data Science Debugging
Benchmark, the first benchmark for systematic evaluation of LLMs on multi-hop
error tracing and multi-bug detection in data science code debugging. DSDBench
adapts datasets from existing data science task benchmarks, such as DABench and
MatPlotBench, featuring realistic data science debugging tasks with
automatically synthesized multi-hop, multi-bug code snippets. DSDBench includes
1,117 annotated samples with 741 cause-effect error pairs and runtime error
messages. Evaluations of state-of-the-art LLMs on DSDBench show significant
performance gaps, highlighting challenges in debugging logical runtime errors
in data science code. DSDBench offers a crucial resource to evaluate and
improve LLMs’ debugging and reasoning capabilities, enabling more reliable
AI-assisted data science in the future. DSDBench is publicly available at
github.com/KevinCL16/DSDBench.
[COMMENTS]
Accepted at EMNLP 2025 Main, Oral
[LINK]
http://arxiv.org/abs/2503.22388v3
[DATE]
2025-09-16 11:49:04+08:00
[CATEGORIES]
cs.CL
Next-Generation Database Interfaces: A Survey of LLM-based Text-to-SQL
[AUTHORS]
Zijin Hong, Zheng Yuan, Qinggang Zhang, Hao Chen, Junnan Dong, Feiran Huang, Xiao Huang
[ABSTRACT]
Generating accurate SQL from users’ natural language questions (text-to-SQL)
remains a long-standing challenge due to the complexities involved in user
question understanding, database schema comprehension, and SQL generation.
Traditional text-to-SQL systems, which combine human engineering and deep
neural networks, have made significant progress. Subsequently, pre-trained
language models (PLMs) have been developed for text-to-SQL tasks, achieving
promising results. However, as modern databases and user questions grow more
complex, PLMs with a limited parameter size often produce incorrect SQL. This
necessitates more sophisticated and tailored optimization methods, which
restricts the application of PLM-based systems. Recently, large language models
(LLMs) have shown significant capabilities in natural language understanding as
model scale increases. Thus, integrating LLM-based solutions can bring unique
opportunities, improvements, and solutions to text-to-SQL research. In this
survey, we provide a comprehensive review of existing LLM-based text-to-SQL
studies. Specifically, we offer a brief overview of the technical challenges
and evolutionary process of text-to-SQL. Next, we introduce the datasets and
metrics designed to evaluate text-to-SQL systems. Subsequently, we present a
systematic analysis of recent advances in LLM-based text-to-SQL. Finally, we
make a summarization and discuss the remaining challenges in this field and
suggest expectations for future research directions. All the related resources
of LLM-based, including research papers, benchmarks, and open-source projects,
are collected for the community in our repository:
https://github.com/DEEP-PolyU/Awesome-LLM-based-Text2SQL.
[COMMENTS]
Accepted to IEEE TKDE
[LINK]
http://arxiv.org/abs/2406.08426v6
[DATE]
2025-09-16 11:28:04+08:00
[CATEGORIES]
cs.CL
Optimal Brain Restoration for Joint Quantization and Sparsification of LLMs
[AUTHORS]
Hang Guo, Yawei Li, Luca Benini
[ABSTRACT]
Recent advances in Large Language Model (LLM) compression, such as
quantization and pruning, have achieved notable success. However, as these
techniques gradually approach their respective limits, relying on a single
method for further compression has become increasingly challenging. In this
work, we explore an alternative solution by combining quantization and
sparsity. This joint approach, though promising, introduces new difficulties
due to the inherently conflicting requirements on weight distributions:
quantization favors compact ranges, while pruning benefits from high variance.
To attack this problem, we propose Optimal Brain Restoration (OBR), a general
and training-free framework that aligns pruning and quantization by error
compensation between both. OBR minimizes performance degradation on downstream
tasks by building on a second-order Hessian objective, which is then
reformulated into a tractable problem through surrogate approximation and
ultimately reaches a closed-form solution via group error compensation.
Experiments show that OBR enables aggressive W4A4KV4 quantization with 50%
sparsity on existing LLMs, and delivers up to 4.72x speedup and 6.4x memory
reduction compared to the FP16-dense baseline.
[COMMENTS]
Preprint
[LINK]
http://arxiv.org/abs/2509.11177v2
[DATE]
2025-09-16 11:17:50+08:00
[CATEGORIES]
cs.CL
MVPBench: A Benchmark and Fine-Tuning Framework for Aligning Large Language Models with Diverse Human Values
[AUTHORS]
Yao Liang, Dongcheng Zhao, Feifei Zhao, Guobin Shen, Yuwei Wang, Dongqi Liang, Yi Zeng
[ABSTRACT]
The alignment of large language models (LLMs) with human values is critical
for their safe and effective deployment across diverse user populations.
However, existing benchmarks often neglect cultural and demographic diversity,
leading to limited understanding of how value alignment generalizes globally.
In this work, we introduce MVPBench, a novel benchmark that systematically
evaluates LLMs’ alignment with multi-dimensional human value preferences across
75 countries. MVPBench contains 24,020 high-quality instances annotated with
fine-grained value labels, personalized questions, and rich demographic
metadata, making it the most comprehensive resource of its kind to date. Using
MVPBench, we conduct an in-depth analysis of several state-of-the-art LLMs,
revealing substantial disparities in alignment performance across geographic
and demographic lines. We further demonstrate that lightweight fine-tuning
methods, such as Low-Rank Adaptation (LoRA) and Direct Preference Optimization
(DPO), can significantly enhance value alignment in both in-domain and
out-of-domain settings. Our findings underscore the necessity for
population-aware alignment evaluation and provide actionable insights for
building culturally adaptive and value-sensitive LLMs. MVPBench serves as a
practical foundation for future research on global alignment, personalized
value modeling, and equitable AI development.
[COMMENTS]
Some parts of the paper need to be revised. We would therefore like
to withdraw the paper and resubmit it after making the necessary changes
[LINK]
http://arxiv.org/abs/2509.08022v2
[DATE]
2025-09-16 11:06:45+08:00
[CATEGORIES]
cs.CL
EconProver: Towards More Economical Test-Time Scaling for Automated Theorem Proving
[AUTHORS]
Mukai Li, Linfeng Song, Zhenwen Liang, Jiahao Xu, Shansan Gong, Qi Liu, Haitao Mi, Dong Yu
[ABSTRACT]
Large Language Models (LLMs) have recently advanced the field of Automated
Theorem Proving (ATP), attaining substantial performance gains through widely
adopted test-time scaling strategies, notably reflective Chain-of-Thought (CoT)
reasoning and increased sampling passes. However, they both introduce
significant computational overhead for inference. Moreover, existing cost
analyses typically regulate only the number of sampling passes, while
neglecting the substantial disparities in sampling costs introduced by
different scaling strategies. In this paper, we systematically compare the
efficiency of different test-time scaling strategies for ATP models and
demonstrate the inefficiency of the current state-of-the-art (SOTA) open-source
approaches. We then investigate approaches to significantly reduce token usage
and sample passes while maintaining the original performance. Specifically, we
propose two complementary methods that can be integrated into a unified EconRL
pipeline for amplified benefits: (1) a dynamic Chain-of-Thought (CoT) switching
mechanism designed to mitigate unnecessary token consumption, and (2) Diverse
parallel-scaled reinforcement learning (RL) with trainable prefixes to enhance
pass rates under constrained sampling passes. Experiments on miniF2F and
ProofNet demonstrate that our EconProver achieves comparable performance to
baseline methods with only 12% of the computational cost. This work provides
actionable insights for deploying lightweight ATP models without sacrificing
performance.
[LINK]
http://arxiv.org/abs/2509.12603v1
[DATE]
2025-09-16 11:00:13+08:00
[CATEGORIES]
cs.CL
DaSAThco: Data-Aware SAT Heuristics Combinations Optimization via Large Language Models
[AUTHORS]
Minyu Chen, Guoqiang Li
[ABSTRACT]
The performance of Conflict-Driven Clause Learning solvers hinges on internal
heuristics, yet the heterogeneity of SAT problems makes a single, universally
optimal configuration unattainable. While prior automated methods can find
specialized configurations for specific problem families, this dataset-specific
approach lacks generalizability and requires costly re-optimization for new
problem types. We introduce DaSAThco, a framework that addresses this challenge
by learning a generalizable mapping from instance features to tailored
heuristic ensembles, enabling a train-once, adapt-broadly model. Our framework
uses a Large Language Model, guided by systematically defined Problem
Archetypes, to generate a diverse portfolio of specialized heuristic ensembles
and subsequently learns an adaptive selection mechanism to form the final
mapping. Experiments show that DaSAThco achieves superior performance and, most
notably, demonstrates robust out-of-domain generalization where non-adaptive
methods show limitations. Our work establishes a more scalable and practical
path toward automated algorithm design for complex, configurable systems.
[COMMENTS]
11 pages
[LINK]
http://arxiv.org/abs/2509.12602v1
[DATE]
2025-09-16 10:58:50+08:00
[CATEGORIES]
cs.CL
The Better You Learn, The Smarter You Prune: Towards Efficient Vision-language-action Models via Differentiable Token Pruning
[AUTHORS]
Titong Jiang, Xuefeng Jiang, Yuan Ma, Xin Wen, Bailin Li, Kun Zhan, Peng Jia, Yahui Liu, Sheng Sun, Xianpeng Lang
[ABSTRACT]
We present LightVLA, a simple yet effective differentiable token pruning
framework for vision-language-action (VLA) models. While VLA models have shown
impressive capability in executing real-world robotic tasks, their deployment
on resource-constrained platforms is often bottlenecked by the heavy
attention-based computation over large sets of visual tokens. LightVLA
addresses this challenge through adaptive, performance-driven pruning of visual
tokens: It generates dynamic queries to evaluate visual token importance, and
adopts Gumbel softmax to enable differentiable token selection. Through
fine-tuning, LightVLA learns to preserve the most informative visual tokens
while pruning tokens which do not contribute to task execution, thereby
improving efficiency and performance simultaneously. Notably, LightVLA requires
no heuristic magic numbers and introduces no additional trainable parameters,
making it compatible with modern inference frameworks. Experimental results
demonstrate that LightVLA outperforms different VLA models and existing token
pruning methods across diverse tasks on the LIBERO benchmark, achieving higher
success rates with substantially reduced computational overhead. Specifically,
LightVLA reduces FLOPs and latency by 59.1% and 38.2% respectively, with a 2.9%
improvement in task success rate. Meanwhile, we also investigate the learnable
query-based token pruning method LightVLA* with additional trainable
parameters, which also achieves satisfactory performance. Our work reveals that
as VLA pursues optimal performance, LightVLA spontaneously learns to prune
tokens from a performance-driven perspective. To the best of our knowledge,
LightVLA is the first work to apply adaptive visual token pruning to VLA tasks
with the collateral goals of efficiency and performance, marking a significant
step toward more efficient, powerful and practical real-time robotic systems.
[COMMENTS]
Under review. Project site:
https://liauto-research.github.io/LightVLA
[LINK]
http://arxiv.org/abs/2509.12594v1
[DATE]
2025-09-16 10:43:46+08:00
[CATEGORIES]
cs.CL
Match Chat: Real Time Generative AI and Generative Computing for Tennis
[AUTHORS]
Aaron Baughman, Gozde Akay, Eduardo Morales, Rahul Agarwal, Preetika Srivastava
[ABSTRACT]
We present Match Chat, a real-time, agent-driven assistant designed to
enhance the tennis fan experience by delivering instant, accurate responses to
match-related queries. Match Chat integrates Generative Artificial Intelligence
(GenAI) with Generative Computing (GenComp) techniques to synthesize key
insights during live tennis singles matches. The system debuted at the 2025
Wimbledon Championships and the 2025 US Open, where it provided about 1 million
users with seamless access to streaming and static data through natural
language queries. The architecture is grounded in an Agent-Oriented
Architecture (AOA) combining rule engines, predictive models, and agents to
pre-process and optimize user queries before passing them to GenAI components.
The Match Chat system had an answer accuracy of 92.83% with an average response
time of 6.25 seconds under loads of up to 120 requests per second (RPS). Over
96.08% of all queries were guided using interactive prompt design, contributing
to a user experience that prioritized clarity, responsiveness, and minimal
effort. The system was designed to mask architectural complexity, offering a
frictionless and intuitive interface that required no onboarding or technical
familiarity. Across both Grand Slam deployments, Match Chat maintained 100%
uptime and supported nearly 1 million unique users, underscoring the
scalability and reliability of the platform. This work introduces key design
patterns for real-time, consumer-facing AI systems that emphasize speed,
precision, and usability that highlights a practical path for deploying
performant agentic systems in dynamic environments.
[COMMENTS]
12 pages, 5 Figures, 4 Tables
[LINK]
http://arxiv.org/abs/2509.12592v1
[DATE]
2025-09-16 10:38:27+08:00
[CATEGORIES]
cs.CL
Arg-LLaDA: Argument Summarization via Large Language Diffusion Models and Sufficiency-Aware Refinement
[AUTHORS]
Hao Li, Yizheng Sun, Viktor Schlegel, Kailai Yang, Riza Batista-Navarro, Goran Nenadic
[ABSTRACT]
Argument summarization aims to generate concise, structured representations
of complex, multi-perspective debates. While recent work has advanced the
identification and clustering of argumentative components, the generation stage
remains underexplored. Existing approaches typically rely on single-pass
generation, offering limited support for factual correction or structural
refinement. To address this gap, we introduce Arg-LLaDA, a novel large language
diffusion framework that iteratively improves summaries via sufficiency-guided
remasking and regeneration. Our method combines a flexible masking controller
with a sufficiency-checking module to identify and revise unsupported,
redundant, or incomplete spans, yielding more faithful, concise, and coherent
outputs. Empirical results on two benchmark datasets demonstrate that Arg-LLaDA
surpasses state-of-the-art baselines in 7 out of 10 automatic evaluation
metrics. In addition, human evaluations reveal substantial improvements across
core dimensions, coverage, faithfulness, and conciseness, validating the
effectiveness of our iterative, sufficiency-aware generation strategy.
[COMMENTS]
Preprint
[LINK]
http://arxiv.org/abs/2507.19081v3
[DATE]
2025-09-16 10:37:26+08:00
[CATEGORIES]
cs.CL
MAGIC-Enhanced Keyword Prompting for Zero-Shot Audio Captioning with CLIP Models
[AUTHORS]
Vijay Govindarajan, Pratik Patel, Sahil Tripathi, Md Azizul Hoque, Gautam Siddharth Kashyap
[ABSTRACT]
Automated Audio Captioning (AAC) generates captions for audio clips but faces
challenges due to limited datasets compared to image captioning. To overcome
this, we propose the zero-shot AAC system that leverages pre-trained models,
eliminating the need for extensive training. Our approach uses a pre-trained
audio CLIP model to extract auditory features and generate a structured prompt,
which guides a Large Language Model (LLM) in caption generation. Unlike
traditional greedy decoding, our method refines token selection through the
audio CLIP model, ensuring alignment with the audio content. Experimental
results demonstrate a 35% improvement in NLG mean score (from 4.7 to 7.3) using
MAGIC search with the WavCaps model. The performance is heavily influenced by
the audio-text matching model and keyword selection, with optimal results
achieved using a single keyword prompt, and a 50% performance drop when no
keyword list is used.
[COMMENTS]
Accepted in The 26th International Conference on Web Information
Systems Engineering (WISE), scheduled for 15-17 December 2025 in Marrakech,
Morocco
[LINK]
http://arxiv.org/abs/2509.12591v1
[DATE]
2025-09-16 10:36:00+08:00
[CATEGORIES]
cs.CL
Yet Another Watermark for Large Language Models
[AUTHORS]
Siyuan Bao, Ying Shi, Zhiguang Yang, Hanzhou Wu, Xinpeng Zhang
[ABSTRACT]
Existing watermarking methods for large language models (LLMs) mainly embed
watermark by adjusting the token sampling prediction or post-processing,
lacking intrinsic coupling with LLMs, which may significantly reduce the
semantic quality of the generated marked texts. Traditional watermarking
methods based on training or fine-tuning may be extendable to LLMs. However,
most of them are limited to the white-box scenario, or very time-consuming due
to the massive parameters of LLMs. In this paper, we present a new watermarking
framework for LLMs, where the watermark is embedded into the LLM by
manipulating the internal parameters of the LLM, and can be extracted from the
generated text without accessing the LLM. Comparing with related methods, the
proposed method entangles the watermark with the intrinsic parameters of the
LLM, which better balances the robustness and imperceptibility of the
watermark. Moreover, the proposed method enables us to extract the watermark
under the black-box scenario, which is computationally efficient for use.
Experimental results have also verified the feasibility, superiority and
practicality. This work provides a new perspective different from mainstream
works, which may shed light on future research.
[COMMENTS]
https://scholar.google.com/citations?hl=en&user=IdiF7M0AAAAJ
[LINK]
http://arxiv.org/abs/2509.12574v1
[DATE]
2025-09-16 10:04:55+08:00
[CATEGORIES]
cs.CL
HiMATE: A Hierarchical Multi-Agent Framework for Machine Translation Evaluation
[AUTHORS]
Shijie Zhang, Renhao Li, Songsheng Wang, Philipp Koehn, Min Yang, Derek F. Wong
[ABSTRACT]
The advancement of Large Language Models (LLMs) enables flexible and
interpretable automatic evaluations. In the field of machine translation
evaluation, utilizing LLMs with translation error annotations based on
Multidimensional Quality Metrics (MQM) yields more human-aligned judgments.
However, current LLM-based evaluation methods still face challenges in
accurately identifying error spans and assessing their severity. In this paper,
we propose HiMATE, a Hierarchical Multi-Agent Framework for Machine Translation
Evaluation. We argue that existing approaches inadequately exploit the
fine-grained structural and semantic information within the MQM hierarchy. To
address this, we develop a hierarchical multi-agent system grounded in the MQM
error typology, enabling granular evaluation of subtype errors. Two key
strategies are incorporated to further mitigate systemic hallucinations within
the framework: the utilization of the model’s self-reflection capability and
the facilitation of agent discussion involving asymmetric information.
Empirically, HiMATE outperforms competitive baselines across different datasets
in conducting human-aligned evaluations. Further analyses underscore its
significant advantage in error span detection and severity assessment,
achieving an average F1-score improvement of 89% over the best-performing
baseline. We make our code and data publicly available at
https://github.com/nlp2ct-shijie/HiMATE.
[LINK]
http://arxiv.org/abs/2505.16281v3
[DATE]
2025-09-16 09:45:19+08:00
[CATEGORIES]
cs.CL
LEAF: Knowledge Distillation of Text Embedding Models with Teacher-Aligned Representations
[AUTHORS]
Robin Vujanic, Thomas Rueckstiess
[ABSTRACT]
We present LEAF (“Lightweight Embedding Alignment Framework”), a knowledge
distillation framework for text embedding models. A key distinguishing feature
is that our distilled leaf models are aligned to their teacher. In the context
of information retrieval, this allows for flexible asymmetric architectures
where documents are encoded with the larger teacher model, while queries can be
served with the smaller leaf models. We also show that leaf models
automatically inherit MRL and robustness to output quantization whenever these
properties are present in the teacher model, without explicitly training for
them. To demonstrate the capability of our framework we publish leaf-ir, a 23M
parameters information retrieval oriented text embedding model trained using
LEAF, which sets a new state-of-the-art (SOTA) on BEIR, ranking #1 on the
public leaderboard for this benchmark and for models of its size. When run in
asymmetric mode, its retrieval performance is further increased. Our scheme is
however not restricted to the information retrieval setting, and we demonstrate
its wider applicability by synthesizing the multi-task leaf-mt model. This also
sets a new SOTA, ranking #1 on the public MTEB v2 (English) leaderboard for its
size. LEAF is applicable to black-box models and in contrast to other embedding
model training frameworks, it does not require judgments nor hard negatives,
and training can be conducted using small batch sizes. Thus, dataset and
training infrastructure requirements for our framework are modest. We make our
models publicly available under a permissive Apache 2.0 license.
[COMMENTS]
17 pages, 12 figures
[LINK]
http://arxiv.org/abs/2509.12539v1
[DATE]
2025-09-16 08:41:05+08:00
[CATEGORIES]
cs.CL
cs.LG
MachineLearningLM: Scaling Many-shot In-context Learning via Continued Pretraining
[AUTHORS]
Haoyu Dong, Pengkun Zhang, Mingzhe Lu, Yanzhen Shen, Guolin Ke
[ABSTRACT]
Large language models (LLMs) possess broad world knowledge and strong
general-purpose reasoning ability, yet they struggle to learn from many
in-context examples on standard machine learning (ML) tasks, that is, to
leverage many-shot demonstrations purely via in-context learning (ICL) without
gradient descent. We introduce MachineLearningLM, a portable
continued-pretraining framework that equips a general-purpose LLM with robust
in-context ML capability while preserving its general knowledge and reasoning
for broader chat workflows.
Our pretraining procedure synthesizes ML tasks from millions of structural
causal models (SCMs), spanning shot counts up to 1,024. We begin with a
random-forest teacher, distilling tree-based decision strategies into the LLM
to strengthen robustness in numerical modeling. All tasks are serialized with a
token-efficient prompt, enabling 3x to 6x more examples per context window and
delivering up to 50x amortized throughput via batch inference.
Despite a modest setup (Qwen-2.5-7B-Instruct with LoRA rank 8),
MachineLearningLM outperforms strong LLM baselines (e.g., GPT-5-mini) by an
average of about 15% on out-of-distribution tabular classification across
finance, physics, biology, and healthcare domains. It exhibits a striking
many-shot scaling law: accuracy increases monotonically as in-context
demonstrations grow from 8 to 1,024. Without any task-specific training, it
attains random-forest-level accuracy across hundreds of shots. General chat
capabilities, including knowledge and reasoning, are preserved: it achieves
75.4% on MMLU.
[LINK]
http://arxiv.org/abs/2509.06806v5
[DATE]
2025-09-16 08:33:42+08:00
[CATEGORIES]
cs.CL
The Adaptation Paradox: Agency vs. Mimicry in Companion Chatbots
[AUTHORS]
T. James Brandt, Cecilia Xi Wang
[ABSTRACT]
Generative AI powers a growing wave of companion chatbots, yet principles for
fostering genuine connection remain unsettled. We test two routes: visible user
authorship versus covert language-style mimicry. In a preregistered 3x2
experiment (N = 162), we manipulated user-controlled avatar generation (none,
premade, user-generated) and Language Style Matching (LSM) (static vs.
adaptive). Generating an avatar boosted rapport ($\omega^2$ = .040, p = .013),
whereas adaptive LSM underperformed static style on personalization and
satisfaction (d = 0.35, p = .009) and was paradoxically judged less adaptive (t
= 3.07, p = .003, d = 0.48). We term this an Adaptation Paradox: synchrony
erodes connection when perceived as incoherent, destabilizing persona. To
explain, we propose a stability-and-legibility account: visible authorship
fosters natural interaction, while covert mimicry risks incoherence. Our
findings suggest designers should prioritize legible, user-driven
personalization and limit stylistic shifts rather than rely on opaque mimicry.
[COMMENTS]
31 pages, 17 figures, 2 tables. Submitted to CHI 2026 (under review).
Preregistered: https://osf.io/f4h5b ; Code/Materials:
https://doi.org/10.5281/zenodo.15801081
[LINK]
http://arxiv.org/abs/2509.12525v1
[DATE]
2025-09-16 08:02:27+08:00
[CATEGORIES]
cs.CL
Context-Aware Language Models for Forecasting Market Impact from Sequences of Financial News
[AUTHORS]
Ross Koval, Nicholas Andrews, Xifeng Yan
[ABSTRACT]
Financial news plays a critical role in the information diffusion process in
financial markets and is a known driver of stock prices. However, the
information in each news article is not necessarily self-contained, often
requiring a broader understanding of the historical news coverage for accurate
interpretation. Further, identifying and incorporating the most relevant
contextual information presents significant challenges. In this work, we
explore the value of historical context in the ability of large language models
to understand the market impact of financial news. We find that historical
context provides a consistent and significant improvement in performance across
methods and time horizons. To this end, we propose an efficient and effective
contextualization method that uses a large LM to process the main article,
while a small LM encodes the historical context into concise summary embeddings
that are then aligned with the large model’s representation space. We explore
the behavior of the model through multiple qualitative and quantitative
interpretability tests and reveal insights into the value of contextualization.
Finally, we demonstrate that the value of historical context in model
predictions has real-world applications, translating to substantial
improvements in simulated investment performance.
[COMMENTS]
Preprint
[LINK]
http://arxiv.org/abs/2509.12519v1
[DATE]
2025-09-16 07:51:13+08:00
[CATEGORIES]
cs.CL
A comparison of pipelines for the translation of a low resource language based on transformers
[AUTHORS]
Chiara Bonfanti, Michele Colombino, Giulia Coucourde, Faeze Memari, Stefano Pinardi, Rosa Meo
[ABSTRACT]
This work compares three pipelines for training transformer-based neural
networks to produce machine translators for Bambara, a Mand`e language spoken
in Africa by about 14,188,850 people. The first pipeline trains a simple
transformer to translate sentences from French into Bambara. The second
fine-tunes LLaMA3 (3B-8B) instructor models using decoder-only architectures
for French-to-Bambara translation. Models from the first two pipelines were
trained with different hyperparameter combinations to improve BLEU and chrF
scores, evaluated on both test sentences and official Bambara benchmarks. The
third pipeline uses language distillation with a student-teacher dual neural
network to integrate Bambara into a pre-trained LaBSE model, which provides
language-agnostic embeddings. A BERT extension is then applied to LaBSE to
generate translations. All pipelines were tested on Dokotoro (medical) and
Bayelemagaba (mixed domains). Results show that the first pipeline, although
simpler, achieves the best translation accuracy (10% BLEU, 21% chrF on
Bayelemagaba), consistent with low-resource translation results. On the Yiri
dataset, created for this work, it achieves 33.81% BLEU and 41% chrF.
Instructor-based models perform better on single datasets than on aggregated
collections, suggesting they capture dataset-specific patterns more
effectively.
[COMMENTS]
9 pages, 4 figures
[LINK]
http://arxiv.org/abs/2509.12514v1
[DATE]
2025-09-16 07:36:49+08:00
[CATEGORIES]
cs.CL
cs.LG
FunAudio-ASR Technical Report
[AUTHORS]
Keyu An, Yanni Chen, Chong Deng, Changfeng Gao, Zhifu Gao, Bo Gong, Xiangang Li, Yabin Li, Xiang Lv, Yunjie Ji, Yiheng Jiang, Bin Ma, Haoneng Luo, Chongjia Ni, Zexu Pan, Yiping Peng, Zhendong Peng, Peiyao Wang, Hao Wang, Wen Wang, Wupeng Wang, Biao Tian, Zhentao Tan, Nan Yang, Bin Yuan, Jieping Ye, Jixing Yu, Qinglin Zhang, Kun Zou, Han Zhao, Shengkui Zhao, Jingren Zhou
[ABSTRACT]
In recent years, automatic speech recognition (ASR) has witnessed
transformative advancements driven by three complementary paradigms: data
scaling, model size scaling, and deep integration with large language models
(LLMs). However, LLMs are prone to hallucination, which can significantly
degrade user experience in real-world ASR applications. In this paper, we
present FunAudio-ASR, a large-scale, LLM-based ASR system that synergistically
combines massive data, large model capacity, LLM integration, and reinforcement
learning to achieve state-of-the-art performance across diverse and complex
speech recognition scenarios. Moreover, FunAudio-ASR is specifically optimized
for practical deployment, with enhancements in streaming capability, noise
robustness, code-switching, hotword customization, and satisfying other
real-world application requirements. Experimental results show that while most
LLM-based ASR systems achieve strong performance on open-source benchmarks,
they often underperform on real industry evaluation sets. Thanks to
production-oriented optimizations, FunAudio-ASR achieves SOTA performance on
real application datasets, demonstrating its effectiveness and robustness in
practical settings.
[LINK]
http://arxiv.org/abs/2509.12508v1
[DATE]
2025-09-16 07:19:36+08:00
[CATEGORIES]
cs.CL
Keep Security! Benchmarking Security Policy Preservation in Large Language Model Contexts Against Indirect Attacks in Question Answering
[AUTHORS]
Hwan Chang, Yumin Kim, Yonghyun Jun, Hwanhee Lee
[ABSTRACT]
As Large Language Models (LLMs) are increasingly deployed in sensitive
domains such as enterprise and government, ensuring that they adhere to
user-defined security policies within context is critical-especially with
respect to information non-disclosure. While prior LLM studies have focused on
general safety and socially sensitive data, large-scale benchmarks for
contextual security preservation against attacks remain lacking. To address
this, we introduce a novel large-scale benchmark dataset, CoPriva, evaluating
LLM adherence to contextual non-disclosure policies in question answering.
Derived from realistic contexts, our dataset includes explicit policies and
queries designed as direct and challenging indirect attacks seeking prohibited
information. We evaluate 10 LLMs on our benchmark and reveal a significant
vulnerability: many models violate user-defined policies and leak sensitive
information. This failure is particularly severe against indirect attacks,
highlighting a critical gap in current LLM safety alignment for sensitive
applications. Our analysis reveals that while models can often identify the
correct answer to a query, they struggle to incorporate policy constraints
during generation. In contrast, they exhibit a partial ability to revise
outputs when explicitly prompted. Our findings underscore the urgent need for
more robust methods to guarantee contextual security.
[COMMENTS]
EMNLP 2025 (Main Conference)
[LINK]
http://arxiv.org/abs/2505.15805v2
[DATE]
2025-09-16 07:11:40+08:00
[CATEGORIES]
cs.CL
UniversalCEFR: Enabling Open Multilingual Research on Language Proficiency Assessment
[AUTHORS]
Joseph Marvin Imperial, Abdullah Barayan, Regina Stodden, Rodrigo Wilkens, Ricardo Munoz Sanchez, Lingyun Gao, Melissa Torgbi, Dawn Knight, Gail Forey, Reka R. Jablonkai, Ekaterina Kochmar, Robert Reynolds, Eugénio Ribeiro, Horacio Saggion, Elena Volodina, Sowmya Vajjala, Thomas François, Fernando Alva-Manchego, Harish Tayyar Madabushi
[COMMENTS]
Accepted to EMNLP 2025 (Main Conference)
[LINK]
http://arxiv.org/abs/2506.01419v2
[DATE]
2025-09-16 06:17:42+08:00
[CATEGORIES]
cs.CL
Does Language Model Understand Language?
[AUTHORS]
Suvojit Acharjee, Utathya Aich, Asfak Ali
[ABSTRACT]
Despite advances in natural language generation and understanding, LM still
struggle with fine grained linguistic phenomena such as tense, negation, voice,
and modality which are the elements central to effective human communication.
In the context of the United Nations SDG 4, where linguistic clarity is
critical, the deployment of LMs in educational technologies demands careful
scrutiny. As LMs are increasingly powering applications like tutoring systems,
automated grading, and translation, their alignment with human linguistic
interpretation becomes essential for effective learning. In this study, we
conduct a evaluation of SOTA language models across these challenging contexts
in both English and Bengali. To ensure a structured assessment, we introduce a
new Route for Evaluation of Cognitive Inference in Systematic Environments
guidelines. Our proposed LUCID dataset, composed of carefully crafted sentence
pairs in English and Bengali, specifically challenges these models on critical
aspects of language comprehension, including negation, tense, voice variations.
We assess the performance of SOTA models including MISTRAL-SABA-24B,
LLaMA-4-Scout-17B, LLaMA-3.3-70B, Gemma2-9B, and Compound-Beta using standard
metrics like Pearson correlation, Spearman correlation, and Mean Absolute
Error, as well as novel, linguistically inspired metric the HCE accuracy. The
HCE accuracy measures how often model predictions fall within one standard
deviation of the mean human rating, thus capturing human like tolerance for
variability in language interpretation. Our findings highlight Compound-Beta as
the most balanced model, consistently achieving high correlations and low MAEs
across diverse language conditions. It records the highest Pearson correlation
in English and demonstrates robust performance on mixed-language data,
indicating a strong alignment with human judgments in cross lingual scenarios.
[LINK]
http://arxiv.org/abs/2509.12459v1
[DATE]
2025-09-16 05:09:09+08:00
[CATEGORIES]
cs.CL
Topic Coverage-based Demonstration Retrieval for In-Context Learning
[AUTHORS]
Wonbin Kweon, SeongKu Kang, Runchu Tian, Pengcheng Jiang, Jiawei Han, Hwanjo Yu
[ABSTRACT]
The effectiveness of in-context learning relies heavily on selecting
demonstrations that provide all the necessary information for a given test
input. To achieve this, it is crucial to identify and cover fine-grained
knowledge requirements. However, prior methods often retrieve demonstrations
based solely on embedding similarity or generation probability, resulting in
irrelevant or redundant examples. In this paper, we propose TopicK, a topic
coverage-based retrieval framework that selects demonstrations to
comprehensively cover topic-level knowledge relevant to both the test input and
the model. Specifically, TopicK estimates the topics required by the input and
assesses the model’s knowledge on those topics. TopicK then iteratively selects
demonstrations that introduce previously uncovered required topics, in which
the model exhibits low topical knowledge. We validate the effectiveness of
TopicK through extensive experiments across various datasets and both open- and
closed-source LLMs. Our source code is available at
https://github.com/WonbinKweon/TopicK_EMNLP2025.
[COMMENTS]
EMNLP 2025 Main
[LINK]
http://arxiv.org/abs/2509.12451v1
[DATE]
2025-09-16 05:00:28+08:00
[CATEGORIES]
cs.CL
MedFact: Benchmarking the Fact-Checking Capabilities of Large Language Models on Chinese Medical Texts
[AUTHORS]
Jiayi He, Yangmin Huang, Qianyun Du, Xiangying Zhou, Zhiyang He, Jiaxue Hu, Xiaodong Tao, Lixian Lai
[ABSTRACT]
The increasing deployment of Large Language Models (LLMs) in healthcare
necessitates a rigorous evaluation of their factual reliability. However,
existing benchmarks are often limited by narrow domains of data, failing to
capture the complexity of real-world medical information. To address this
critical gap, we introduce MedFact, a new and challenging benchmark for Chinese
medical fact-checking. MedFact comprises 2,116 expert-annotated instances
curated from diverse real-world texts, spanning 13 medical specialties, 8
fine-grained error types, 4 writing styles, and multiple difficulty levels. Its
construction employs a hybrid AI-human framework where iterative expert
feedback refines an AI-driven, multi-criteria filtering process, ensuring both
high data quality and difficulty. We conduct a comprehensive evaluation of 20
leading LLMs, benchmarking their performance on veracity classification and
error localization against a human expert baseline. Our results reveal that
while models can often determine if a text contains an error, precisely
localizing it remains a substantial challenge, with even top-performing models
falling short of human performance. Furthermore, our analysis uncovers a
frequent “over-criticism” phenomenon, a tendency for models to misidentify
correct information as erroneous, which is exacerbated by advanced reasoning
techniques such as multi-agent collaboration and inference-time scaling. By
highlighting these critical challenges for deploying LLMs in medical
applications, MedFact provides a robust resource to drive the development of
more factually reliable and medically aware models.
[LINK]
http://arxiv.org/abs/2509.12440v1
[DATE]
2025-09-16 04:46:21+08:00
[CATEGORIES]
cs.CL
Small Models, Big Results: Achieving Superior Intent Extraction through Decomposition
[AUTHORS]
Danielle Cohen, Yoni Halpern, Noam Kahlon, Joel Oren, Omri Berkovitch, Sapir Caduri, Ido Dagan, Anatoly Efros
[ABSTRACT]
Understanding user intents from UI interaction trajectories remains a
challenging, yet crucial, frontier in intelligent agent development. While
massive, datacenter-based, multi-modal large language models (MLLMs) possess
greater capacity to handle the complexities of such sequences, smaller models
which can run on-device to provide a privacy-preserving, low-cost, and
low-latency user experience, struggle with accurate intent inference. We
address these limitations by introducing a novel decomposed approach: first, we
perform structured interaction summarization, capturing key information from
each user action. Second, we perform intent extraction using a fine-tuned model
operating on the aggregated summaries. This method improves intent
understanding in resource-constrained models, even surpassing the base
performance of large MLLMs.
[LINK]
http://arxiv.org/abs/2509.12423v1
[DATE]
2025-09-16 04:20:30+08:00
[CATEGORIES]
cs.CL
Responsible AI in NLP: GUS-Net Span-Level Bias Detection Dataset and Benchmark for Generalizations, Unfairness, and Stereotypes
[AUTHORS]
Maximus Powers, Shaina Raza, Alex Chang, Rehana Riaz, Umang Mavani, Harshitha Reddy Jonala, Ansh Tiwari, Hua Wei
[ABSTRACT]
Representational harms in language technologies often occur in short spans
within otherwise neutral text, where phrases may simultaneously convey
generalizations, unfairness, or stereotypes. Framing bias detection as
sentence-level classification obscures which words carry bias and what type is
present, limiting both auditability and targeted mitigation. We introduce the
GUS-Net Framework, comprising the GUS dataset and a multi-label token-level
detector for span-level analysis of social bias. The GUS dataset contains 3,739
unique snippets across multiple domains, with over 69,000 token-level
annotations. Each token is labeled using BIO tags (Begin, Inside, Outside) for
three pathways of representational harm: Generalizations, Unfairness, and
Stereotypes. To ensure reliable data annotation, we employ an automated
multi-agent pipeline that proposes candidate spans which are subsequently
verified and corrected by human experts. We formulate bias detection as
multi-label token-level classification and benchmark both encoder-based models
(e.g., BERT family variants) and decoder-based large language models (LLMs).
Our evaluations cover token-level identification and span-level entity
recognition on our test set, and out-of-distribution generalization. Empirical
results show that encoder-based models consistently outperform decoder-based
baselines on nuanced and overlapping spans while being more computationally
efficient. The framework delivers interpretable, fine-grained diagnostics that
enable systematic auditing and mitigation of representational harms in
real-world NLP systems.
[LINK]
http://arxiv.org/abs/2410.08388v5
[DATE]
2025-09-16 04:20:14+08:00
[CATEGORIES]
cs.CL
Probing LLM Hallucination from Within: Perturbation-Driven Approach via Internal Knowledge
[AUTHORS]
Seongmin Lee, Hsiang Hsu, Chun-Fu Chen, Duen Horng Chau
[ABSTRACT]
LLM hallucination, where unfaithful text is generated, presents a critical
challenge for LLMs’ practical applications. Current detection methods often
resort to external knowledge, LLM fine-tuning, or supervised training with
large hallucination-labeled datasets. Moreover, these approaches do not
distinguish between different types of hallucinations, which is crucial for
enhancing detection performance. To address such limitations, we introduce
hallucination probing, a new task that classifies LLM-generated text into three
categories: aligned, misaligned, and fabricated. Driven by our novel discovery
that perturbing key entities in prompts affects LLM’s generation of these three
types of text differently, we propose SHINE, a novel hallucination probing
method that does not require external knowledge, supervised training, or LLM
fine-tuning. SHINE is effective in hallucination probing across three modern
LLMs, and achieves state-of-the-art performance in hallucination detection,
outperforming seven competing methods across four datasets and four LLMs,
underscoring the importance of probing for accurate detection.
[COMMENTS]
22 pages, 15 figures
[LINK]
http://arxiv.org/abs/2411.09689v4
[DATE]
2025-09-16 04:18:50+08:00
[CATEGORIES]
cs.CL
MORQA: Benchmarking Evaluation Metrics for Medical Open-Ended Question Answering
[AUTHORS]
Wen-wai Yim, Asma Ben Abacha, Zixuan Yu, Robert Doerning, Fei Xia, Meliha Yetisgen
[ABSTRACT]
Evaluating natural language generation (NLG) systems in the medical domain
presents unique challenges due to the critical demands for accuracy, relevance,
and domain-specific expertise. Traditional automatic evaluation metrics, such
as BLEU, ROUGE, and BERTScore, often fall short in distinguishing between
high-quality outputs, especially given the open-ended nature of medical
question answering (QA) tasks where multiple valid responses may exist. In this
work, we introduce MORQA (Medical Open-Response QA), a new multilingual
benchmark designed to assess the effectiveness of NLG evaluation metrics across
three medical visual and text-based QA datasets in English and Chinese. Unlike
prior resources, our datasets feature 2-4+ gold-standard answers authored by
medical professionals, along with expert human ratings for three English and
Chinese subsets. We benchmark both traditional metrics and large language model
(LLM)-based evaluators, such as GPT-4 and Gemini, finding that LLM-based
approaches significantly outperform traditional metrics in correlating with
expert judgments. We further analyze factors driving this improvement,
including LLMs’ sensitivity to semantic nuances and robustness to variability
among reference answers. Our results provide the first comprehensive,
multilingual qualitative study of NLG evaluation in the medical domain,
highlighting the need for human-aligned evaluation methods. All datasets and
annotations will be publicly released to support future research.
[COMMENTS]
9 pages, 8 tables
[LINK]
http://arxiv.org/abs/2509.12405v1
[DATE]
2025-09-16 03:51:57+08:00
[CATEGORIES]
cs.CL
Polysemantic Dropout: Conformal OOD Detection for Specialized LLMs
[AUTHORS]
Ayush Gupta, Ramneet Kaur, Anirban Roy, Adam D. Cobb, Rama Chellappa, Susmit Jha
[COMMENTS]
Accepted to EMNLP 2025 main conference
[LINK]
http://arxiv.org/abs/2509.04655v2
[DATE]
2025-09-16 03:42:21+08:00
[CATEGORIES]
cs.CL
Concurrent Linguistic Error Detection (CLED): a New Methodology for Error Detection in Large Language Models
[AUTHORS]
Jinhua Zhu, Javier Conde, Zhen Gao, Pedro Reviriego, Shanshan Liu, Fabrizio Lombardi
[ABSTRACT]
The wide adoption of Large language models (LLMs) makes their dependability a
pressing concern. Detection of errors is the first step to mitigating their
impact on a system and thus, efficient error detection for LLMs is an important
issue. In many settings, the LLM is considered as a black box with no access to
the internal nodes; this prevents the use of many error detection schemes that
need access to the model’s internal nodes. An interesting observation is that
the output of LLMs in error-free operation should be valid and normal text.
Therefore, when the text is not valid or differs significantly from normal
text, it is likely that there is an error. Based on this observation we propose
to perform Concurrent Linguistic Error Detection (CLED); this scheme extracts
some linguistic features of the text generated by the LLM and feeds them to a
concurrent classifier that detects errors. Since the proposed error detection
mechanism only relies on the outputs of the model, then it can be used on LLMs
in which there is no access to the internal nodes. The proposed CLED scheme has
been evaluated on the T5 model when used for news summarization and on the
OPUS-MT model when used for translation. In both cases, the same set of
linguistic features has been used for error detection to illustrate the
applicability of the proposed scheme beyond a specific case. The results show
that CLED can detect most of the errors at a low overhead penalty. The use of
the concurrent classifier also enables a trade-off between error detection
effectiveness and its associated overhead, so providing flexibility to a
designer.
[COMMENTS]
11 pages, 6 figures, 30 references
[LINK]
http://arxiv.org/abs/2403.16393v2
[DATE]
2025-09-16 03:36:16+08:00
[CATEGORIES]
cs.CL
cs.LG
SENTRA: Selected-Next-Token Transformer for LLM Text Detection
[AUTHORS]
Mitchell Plyler, Yilun Zhang, Alexander Tuzhilin, Saoud Khalifah, Sen Tian
[ABSTRACT]
LLMs are becoming increasingly capable and widespread. Consequently, the
potential and reality of their misuse is also growing. In this work, we address
the problem of detecting LLM-generated text that is not explicitly declared as
such. We present a novel, general-purpose, and supervised LLM text detector,
SElected-Next-Token tRAnsformer (SENTRA). SENTRA is a Transformer-based encoder
leveraging selected-next-token-probability sequences and utilizing contrastive
pre-training on large amounts of unlabeled data. Our experiments on three
popular public datasets across 24 domains of text demonstrate SENTRA is a
general-purpose classifier that significantly outperforms popular baselines in
the out-of-domain setting.
[COMMENTS]
EMNLP Findings 2025
[LINK]
http://arxiv.org/abs/2509.12385v1
[DATE]
2025-09-16 03:26:17+08:00
[CATEGORIES]
cs.CL
cs.LG
LLM-as-a-Judge: Rapid Evaluation of Legal Document Recommendation for Retrieval-Augmented Generation
[AUTHORS]
Anu Pradhan, Alexandra Ortan, Apurv Verma, Madhavan Seshadri
[ABSTRACT]
The evaluation bottleneck in recommendation systems has become particularly
acute with the rise of Generative AI, where traditional metrics fall short of
capturing nuanced quality dimensions that matter in specialized domains like
legal research. Can we trust Large Language Models to serve as reliable judges
of their own kind? This paper investigates LLM-as-a-Judge as a principled
approach to evaluating Retrieval-Augmented Generation systems in legal
contexts, where the stakes of recommendation quality are exceptionally high.
We tackle two fundamental questions that determine practical viability: which
inter-rater reliability metrics best capture the alignment between LLM and
human assessments, and how do we conduct statistically sound comparisons
between competing systems? Through systematic experimentation, we discover that
traditional agreement metrics like Krippendorff’s alpha can be misleading in
the skewed distributions typical of AI system evaluations. Instead, Gwet’s AC2
and rank correlation coefficients emerge as more robust indicators for judge
selection, while the Wilcoxon Signed-Rank Test with Benjamini-Hochberg
corrections provides the statistical rigor needed for reliable system
comparisons.
Our findings suggest a path toward scalable, cost-effective evaluation that
maintains the precision demanded by legal applications, transforming what was
once a human-intensive bottleneck into an automated, yet statistically
principled, evaluation framework.
[COMMENTS]
Accepted in EARL 25: The 2nd Workshop on Evaluating and Applying
Recommender Systems with Large Language Models at RecSys 2025
[LINK]
http://arxiv.org/abs/2509.12382v1
[DATE]
2025-09-16 03:20:21+08:00
[CATEGORIES]
cs.CL
Cutting Through the Noise: Boosting LLM Performance on Math Word Problems
[AUTHORS]
Ujjwala Anantheswaran, Himanshu Gupta, Kevin Scaria, Shreyas Verma, Chitta Baral, Swaroop Mishra
[ABSTRACT]
Large Language Models (LLMs) excel at various tasks, including solving math
word problems (MWPs), but struggle with real-world problems containing
irrelevant information. To address this, we propose a prompting framework that
generates adversarial variants of MWPs by adding irrelevant variables. We
introduce a dataset, PROBLEMATHIC, containing both adversarial and
non-adversarial MWPs. Our experiments reveal that LLMs are susceptible to
distraction by numerical noise, resulting in an average relative performance
drop of ~26% on adversarial MWPs. To mitigate this, we fine-tune LLMs (Llama-2,
Mistral) on the adversarial samples from our dataset. Fine-tuning on
adversarial training instances improves performance on adversarial MWPs by ~8%,
indicating increased robustness to noise and improved ability to identify
relevant data for reasoning. Finally, to assess the generalizability of our
prompting framework, we introduce GSM-8K-Adv, an adversarial variant of the
GSM-8K benchmark. LLMs continue to struggle when faced with adversarial
information, reducing performance by up to 6%.
[COMMENTS]
Published at ICLR 2025 Workshop on Reasoning and Planning for LLMs
[LINK]
http://arxiv.org/abs/2406.15444v5
[DATE]
2025-09-16 03:18:19+08:00
[CATEGORIES]
cs.CL
MORABLES: A Benchmark for Assessing Abstract Moral Reasoning in LLMs with Fables
[AUTHORS]
Matteo Marcuzzo, Alessandro Zangari, Andrea Albarelli, Jose Camacho-Collados, Mohammad Taher Pilehvar
[ABSTRACT]
As LLMs excel on standard reading comprehension benchmarks, attention is
shifting toward evaluating their capacity for complex abstract reasoning and
inference. Literature-based benchmarks, with their rich narrative and moral
depth, provide a compelling framework for evaluating such deeper comprehension
skills. Here, we present MORABLES, a human-verified benchmark built from fables
and short stories drawn from historical literature. The main task is structured
as multiple-choice questions targeting moral inference, with carefully crafted
distractors that challenge models to go beyond shallow, extractive question
answering. To further stress-test model robustness, we introduce adversarial
variants designed to surface LLM vulnerabilities and shortcuts due to issues
such as data contamination. Our findings show that, while larger models
outperform smaller ones, they remain susceptible to adversarial manipulation
and often rely on superficial patterns rather than true moral reasoning. This
brittleness results in significant self-contradiction, with the best models
refuting their own answers in roughly 20% of cases depending on the framing of
the moral choice. Interestingly, reasoning-enhanced models fail to bridge this
gap, suggesting that scale - not reasoning ability - is the primary driver of
performance.
[COMMENTS]
Accepted to EMNLP 2025 Main Conference
[LINK]
http://arxiv.org/abs/2509.12371v1
[DATE]
2025-09-16 03:06:10+08:00
[CATEGORIES]
cs.CL
GPT-4.1 Sets the Standard in Automated Experiment Design Using Novel Python Libraries
[AUTHORS]
Nuno Fachada, Daniel Fernandes, Carlos M. Fernandes, Bruno D. Ferreira-Saraiva, João P. Matos-Carvalho
[ABSTRACT]
Large Language Models (LLMs) have advanced rapidly as tools for automating
code generation in scientific research, yet their ability to interpret and use
unfamiliar Python APIs for complex computational experiments remains poorly
characterized. This study systematically benchmarks a selection of
state-of-the-art LLMs in generating functional Python code for two increasingly
challenging scenarios: conversational data analysis with the \textit{ParShift}
library, and synthetic data generation and clustering using \textit{pyclugen}
and \textit{scikit-learn}. Both experiments use structured, zero-shot prompts
specifying detailed requirements but omitting in-context examples. Model
outputs are evaluated quantitatively for functional correctness and prompt
compliance over multiple runs, and qualitatively by analyzing the errors
produced when code execution fails. Results show that only a small subset of
models consistently generate correct, executable code. GPT-4.1 achieved a 100\%
success rate across all runs in both experimental tasks, whereas most other
models succeeded in fewer than half of the runs, with only Grok-3 and
Mistral-Large approaching comparable performance. In addition to benchmarking
LLM performance, this approach helps identify shortcomings in third-party
libraries, such as unclear documentation or obscure implementation bugs.
Overall, these findings highlight current limitations of LLMs for end-to-end
scientific automation and emphasize the need for careful prompt design,
comprehensive library documentation, and continued advances in language model
capabilities.
[COMMENTS]
The peer-reviewed version of this paper is published in Future
Internet at https://doi.org/10.3390/fi17090412. This version is typeset by
the author and differs only in pagination and typographical detail
[LINK]
http://arxiv.org/abs/2508.00033v2
[DATE]
2025-09-16 02:54:46+08:00
[CATEGORIES]
cs.CL
The Strawberry Problem: Emergence of Character-level Understanding in Tokenized Language Models
[AUTHORS]
Adrian Cosma, Stefan Ruseti, Emilian Radoi, Mihai Dascalu
[ABSTRACT]
Despite their remarkable progress across diverse domains, Large Language
Models (LLMs) consistently fail at simple character-level tasks, such as
counting letters in words, due to a fundamental limitation: tokenization. In
this work, we frame this limitation as a problem of low mutual information and
analyze it in terms of concept emergence. Using a suite of 19 synthetic tasks
that isolate character-level reasoning in a controlled setting, we show that
such capabilities emerge suddenly and only late in training. We find that
percolation-based models of concept emergence explain these patterns,
suggesting that learning character composition is not fundamentally different
from learning commonsense knowledge. To address this bottleneck, we propose a
lightweight architectural modification that significantly improves
character-level reasoning while preserving the inductive advantages of subword
models. Together, our results bridge low-level perceptual gaps in tokenized LMs
and provide a principled framework for understanding and mitigating their
structural blind spots. We make our code publicly available.
[COMMENTS]
Accepted at EMNLP 2025 Main as Oral Presentation (Top 15% of accepted
papers)
[LINK]
http://arxiv.org/abs/2505.14172v3
[DATE]
2025-09-16 02:36:32+08:00
[CATEGORIES]
cs.CL
Exact Coset Sampling for Quantum Lattice Algorithms
[AUTHORS]
Yifan Zhang
[ABSTRACT]
We give a simple, fully correct, and assumption-light replacement for the
contested “domain-extension” in Step 9 of a recent windowed-QFT lattice
algorithm with complex-Gaussian windows~\citep{chen2024quantum}. The published
Step~9 suffers from a periodicity/support mismatch. We present a pair-shift
difference construction that coherently cancels all unknown offsets, produces
an exact uniform CRT-coset state over $\mathbb{Z}_{P}$, and then uses the QFT
to enforce the intended modular linear relation. The unitary is reversible,
uses $\mathrm{poly}(\log M_2)$ gates, and preserves the algorithm’s
asymptotics. Project Page: https://github.com/yifanzhang-pro/quantum-lattice.
[COMMENTS]
Project Page: https://github.com/yifanzhang-pro/quantum-lattice
[LINK]
http://arxiv.org/abs/2509.12341v1
[DATE]
2025-09-16 02:10:28+08:00
[CATEGORIES]
cs.CL
MTEB-NL and E5-NL: Embedding Benchmark and Models for Dutch
[AUTHORS]
Nikolay Banar, Ehsan Lotfi, Jens Van Nooten, Cristina Arhiliuc, Marija Kliocaite, Walter Daelemans
[ABSTRACT]
Recently, embedding resources, including models, benchmarks, and datasets,
have been widely released to support a variety of languages. However, the Dutch
language remains underrepresented, typically comprising only a small fraction
of the published multilingual resources. To address this gap and encourage the
further development of Dutch embeddings, we introduce new resources for their
evaluation and generation. First, we introduce the Massive Text Embedding
Benchmark for Dutch (MTEB-NL), which includes both existing Dutch datasets and
newly created ones, covering a wide range of tasks. Second, we provide a
training dataset compiled from available Dutch retrieval datasets, complemented
with synthetic data generated by large language models to expand task coverage
beyond retrieval. Finally, we release a series of E5-NL models compact yet
efficient embedding models that demonstrate strong performance across multiple
tasks. We make our resources publicly available through the Hugging Face Hub
and the MTEB package.
[LINK]
http://arxiv.org/abs/2509.12340v1
[DATE]
2025-09-16 02:08:08+08:00
[CATEGORIES]
cs.CL
Event2Vec: A Geometric Approach to Learning Composable Representations of Event Sequences
[AUTHORS]
Antonin Sulc
[ABSTRACT]
The study of neural representations, both in biological and artificial
systems, is increasingly revealing the importance of geometric and topological
structures. Inspired by this, we introduce Event2Vec, a novel framework for
learning representations of discrete event sequences. Our model leverages a
simple, additive recurrent structure to learn composable, interpretable
embeddings. We provide a theoretical analysis demonstrating that, under
specific training objectives, our model’s learned representations in a
Euclidean space converge to an ideal additive structure. This ensures that the
representation of a sequence is the vector sum of its constituent events, a
property we term the linear additive hypothesis. To address the limitations of
Euclidean geometry for hierarchical data, we also introduce a variant of our
model in hyperbolic space, which is naturally suited to embedding tree-like
structures with low distortion. We present experiments to validate our
hypothesis and demonstrate the benefits of each geometry, highlighting the
improved performance of the hyperbolic model on hierarchical event sequences.
[COMMENTS]
10 pages, 3 figures, Symmetry and Geometry in Neural Representations
Workshop at NeuralIPS (Neurreps) 2025
[LINK]
http://arxiv.org/abs/2509.12188v1
[DATE]
2025-09-16 01:51:02+08:00
[CATEGORIES]
cs.LG
cs.CL
RAGs to Riches: RAG-like Few-shot Learning for Large Language Model Role-playing
[AUTHORS]
Timothy Rupprecht, Enfu Nan, Arash Akbari, Arman Akbari, Lei Lu, Priyanka Maan, Sean Duffy, Pu Zhao, Yumei He, David Kaeli, Yanzhi Wang
[ABSTRACT]
Role-playing Large language models (LLMs) are increasingly deployed in
high-stakes domains such as healthcare, education, and governance, where
failures can directly impact user trust and well-being. A cost effective
paradigm for LLM role-playing is few-shot learning, but existing approaches
often cause models to break character in unexpected and potentially harmful
ways, especially when interacting with hostile users. Inspired by
Retrieval-Augmented Generation (RAG), we reformulate LLM role-playing into a
text retrieval problem and propose a new prompting framework called
RAGs-to-Riches, which leverages curated reference demonstrations to condition
LLM responses. We evaluate our framework with LLM-as-a-judge preference voting
and introduce two novel token-level ROUGE metrics: Intersection over Output
(IOO) to quantity how much an LLM improvises and Intersection over References
(IOR) to measure few-shot demonstrations utilization rate during the evaluation
tasks. When simulating interactions with a hostile user, our prompting strategy
incorporates in its responses during inference an average of 35% more tokens
from the reference demonstrations. As a result, across 453 role-playing
interactions, our models are consistently judged as being more authentic, and
remain in-character more often than zero-shot and in-context Learning (ICL)
methods. Our method presents a scalable strategy for building robust,
human-aligned LLM role-playing frameworks.
[LINK]
http://arxiv.org/abs/2509.12168v1
[DATE]
2025-09-16 01:31:15+08:00
[CATEGORIES]
cs.CL
Speak-to-Structure: Evaluating LLMs in Open-domain Natural Language-Driven Molecule Generation
[AUTHORS]
Jiatong Li, Junxian Li, Weida Wang, Yunqing Liu, Changmeng Zheng, Dongzhan Zhou, Xiao-yong Wei, Qing Li
[ABSTRACT]
Recently, Large Language Models (LLMs) have shown great potential in natural
language-driven molecule discovery. However, existing datasets and benchmarks
for molecule-text alignment are predominantly built on a one-to-one mapping,
measuring LLMs’ ability to retrieve a single, pre-defined answer, rather than
their creative potential to generate diverse, yet equally valid, molecular
candidates. To address this critical gap, we propose Speak-to-Structure
(S^2-Bench}), the first benchmark to evaluate LLMs in open-domain natural
language-driven molecule generation. S^2-Bench is specifically designed for
one-to-many relationships, challenging LLMs to demonstrate genuine molecular
understanding and generation capabilities. Our benchmark includes three key
tasks: molecule editing (MolEdit), molecule optimization (MolOpt), and
customized molecule generation (MolCustom), each probing a different aspect of
molecule discovery. We also introduce OpenMolIns, a large-scale instruction
tuning dataset that enables Llama-3.1-8B to surpass the most powerful LLMs like
GPT-4o and Claude-3.5 on S^2-Bench. Our comprehensive evaluation of 28 LLMs
shifts the focus from simple pattern recall to realistic molecular design,
paving the way for more capable LLMs in natural language-driven molecule
discovery.
[COMMENTS]
Our codes and datasets are available through
https://github.com/phenixace/TOMG-Bench
[LINK]
http://arxiv.org/abs/2412.14642v3
[DATE]
2025-09-16 01:29:42+08:00
[CATEGORIES]
cs.CL
Active Layer-Contrastive Decoding Reduces Hallucination in Large Language Model Generation
[AUTHORS]
Hongxiang Zhang, Hao Chen, Muhao Chen, Tianyi Zhang
[ABSTRACT]
Recent decoding methods improve the factuality of large language models
(LLMs) by refining how the next token is selected during generation. These
methods typically operate at the token level, leveraging internal
representations to suppress superficial patterns. Nevertheless, LLMs remain
prone to hallucinations, especially over longer contexts. In this paper, we
propose Active Layer-Contrastive Decoding (ActLCD), a novel decoding strategy
that actively decides when to apply contrasting layers during generation. By
casting decoding as a sequential decision-making problem, ActLCD employs a
reinforcement learning policy guided by a reward-aware classifier to optimize
factuality beyond the token level. Our experiments demonstrate that ActLCD
surpasses state-of-the-art methods across five benchmarks, showcasing its
effectiveness in mitigating hallucinations in diverse generation scenarios.
[COMMENTS]
19 pages, 3 figures, EMNLP 2025
[LINK]
http://arxiv.org/abs/2505.23657v3
[DATE]
2025-09-16 01:26:37+08:00
[CATEGORIES]
cs.CL
cs.LG
Pun Unintended: LLMs and the Illusion of Humor Understanding
[AUTHORS]
Alessandro Zangari, Matteo Marcuzzo, Andrea Albarelli, Mohammad Taher Pilehvar, Jose Camacho-Collados
[COMMENTS]
Accepted to EMNLP 2025 Main Conference
[LINK]
http://arxiv.org/abs/2509.12158v1
[DATE]
2025-09-16 01:22:30+08:00
[CATEGORIES]
cs.CL
XplaiNLP at CheckThat! 2025: Multilingual Subjectivity Detection with Finetuned Transformers and Prompt-Based Inference with Large Language Models
[AUTHORS]
Ariana Sahitaj, Jiaao Li, Pia Wenzel Neves, Fedor Splitt, Premtim Sahitaj, Charlott Jakob, Veronika Solopova, Vera Schmitt
[ABSTRACT]
This notebook reports the XplaiNLP submission to the CheckThat! 2025 shared
task on multilingual subjectivity detection. We evaluate two approaches: (1)
supervised fine-tuning of transformer encoders, EuroBERT, XLM-RoBERTa, and
German-BERT, on monolingual and machine-translated training data; and (2)
zero-shot prompting using two LLMs: o3-mini for Annotation (rule-based
labelling) and gpt-4.1-mini for DoubleDown (contrastive rewriting) and
Perspective (comparative reasoning). The Annotation Approach achieves 1st place
in the Italian monolingual subtask with an F_1 score of 0.8104, outperforming
the baseline of 0.6941. In the Romanian zero-shot setting, the fine-tuned
XLM-RoBERTa model obtains an F_1 score of 0.7917, ranking 3rd and exceeding the
baseline of 0.6461. The same model also performs reliably in the multilingual
task and improves over the baseline in Greek. For German, a German-BERT model
fine-tuned on translated training data from typologically related languages
yields competitive performance over the baseline. In contrast, performance in
the Ukrainian and Polish zero-shot settings falls slightly below the respective
baselines, reflecting the challenge of generalization in low-resource
cross-lingual scenarios.
[LINK]
http://arxiv.org/abs/2509.12130v1
[DATE]
2025-09-16 00:53:41+08:00
[CATEGORIES]
cs.CL
When marine radar target detection meets pretrained large language models
[AUTHORS]
Qiying Hu, Linping Zhang, Xueqian Wang, Gang Li, Yu Liu, Xiao-Ping Zhang
[ABSTRACT]
Deep learning (DL) methods are widely used to extract high-dimensional
patterns from the sequence features of radar echo signals. However,
conventional DL algorithms face challenges such as redundant feature segments,
and constraints from restricted model sizes. To address these issues, we
propose a framework that integrates feature preprocessing with large language
models (LLMs). Our preprocessing module tokenizes radar sequence features,
applies a patch selection algorithm to filter out uninformative segments, and
projects the selected patches into embeddings compatible with the feature space
of pre-trained LLMs. Leveraging these refined embeddings, we incorporate a
pre-trained LLM, fine-tuning only the normalization layers to reduce training
burdens while enhancing performance. Experiments on measured datasets
demonstrate that the proposed method significantly outperforms the
state-of-the-art baselines on supervised learning tests.
[LINK]
http://arxiv.org/abs/2509.12110v1
[DATE]
2025-09-16 00:38:13+08:00
[CATEGORIES]
cs.CL
cs.LG
Time is On My Side: Dynamics of Talk-Time Sharing in Video-chat Conversations
[AUTHORS]
Kaixiang Zhang, Justine Zhang, Cristian Danescu-Niculescu-Mizil
[ABSTRACT]
An intrinsic aspect of every conversation is the way talk-time is shared
between multiple speakers. Conversations can be balanced, with each speaker
claiming a similar amount of talk-time, or imbalanced when one talks
disproportionately. Such overall distributions are the consequence of
continuous negotiations between the speakers throughout the conversation: who
should be talking at every point in time, and for how long? In this work we
introduce a computational framework for quantifying both the conversation-level
distribution of talk-time between speakers, as well as the lower-level dynamics
that lead to it. We derive a typology of talk-time sharing dynamics structured
by several intuitive axes of variation. By applying this framework to a large
dataset of video-chats between strangers, we confirm that, perhaps
unsurprisingly, different conversation-level distributions of talk-time are
perceived differently by speakers, with balanced conversations being preferred
over imbalanced ones, especially by those who end up talking less. Then we
reveal that – even when they lead to the same level of overall balance –
different types of talk-time sharing dynamics are perceived differently by the
participants, highlighting the relevance of our newly introduced typology.
Finally, we discuss how our framework offers new tools to designers of
computer-mediated communication platforms, for both human-human and human-AI
communication.
[COMMENTS]
Accepted for publication at CSCW 2025. Code and data available in
ConvoKit (https://convokit.cornell.edu)
[LINK]
http://arxiv.org/abs/2506.20474v3
[DATE]
2025-09-16 00:34:13+08:00
[CATEGORIES]
cs.CL
In-domain SSL pre-training and streaming ASR
[AUTHORS]
Jarod Duret, Salima Mdhaffar, Gaëlle Laperrière, Ryan Whetten, Audrey Galametz, Catherine Kobus, Marion-Cécile Martin, Jo Oleiwan, Yannick Estève
[ABSTRACT]
In this study, we investigate the benefits of domain-specific self-supervised
pre-training for both offline and streaming ASR in Air Traffic Control (ATC)
environments. We train BEST-RQ models on 4.5k hours of unlabeled ATC data, then
fine-tune on a smaller supervised ATC set. To enable real-time processing, we
propose using chunked attention and dynamic convolutions, ensuring low-latency
inference. We compare these in-domain SSL models against state-of-the-art,
general-purpose speech encoders such as w2v-BERT 2.0 and HuBERT. Results show
that domain-adapted pre-training substantially improves performance on standard
ATC benchmarks, significantly reducing word error rates when compared to models
trained on broad speech corpora. Furthermore, the proposed streaming approach
further improves word error rate under tighter latency constraints, making it
particularly suitable for safety-critical aviation applications. These findings
highlight that specializing SSL representations for ATC data is a practical
path toward more accurate and efficient ASR systems in real-world operational
settings.
[COMMENTS]
Accepted to SPECOM 2025
[LINK]
http://arxiv.org/abs/2509.12101v1
[DATE]
2025-09-16 00:25:43+08:00
[CATEGORIES]
cs.CL
Is ‘Hope’ a person or an idea? A pilot benchmark for NER: comparing traditional NLP tools and large language models on ambiguous entities
[AUTHORS]
Payam Latifi
[ABSTRACT]
This pilot study presents a small-scale but carefully annotated benchmark of
Named Entity Recognition (NER) performance across six systems: three non-LLM
NLP tools (NLTK, spaCy, Stanza) and three general-purpose large language models
(LLMs: Gemini-1.5-flash, DeepSeek-V3, Qwen-3-4B). The dataset contains 119
tokens covering five entity types (PERSON, LOCATION, ORGANIZATION, DATE, TIME).
We evaluated each system’s output against the manually annotated gold standard
dataset using F1-score. The results show that LLMs generally outperform
conventional tools in recognizing context-sensitive entities like person names,
with Gemini achieving the highest average F1-score. However, traditional
systems like Stanza demonstrate greater consistency in structured tags such as
LOCATION and DATE. We also observed variability among LLMs, particularly in
handling temporal expressions and multi-word organizations. Our findings
highlight that while LLMs offer improved contextual understanding, traditional
tools remain competitive in specific tasks, informing model selection.
[COMMENTS]
14 pages, 9 figures, 2 tables. This is a pilot study evaluating six
NER systems – three traditional tools (NLTK, spaCy, Stanza) and three LLMs
(Gemini-1.5-flash, DeepSeek-V3, Qwen-3-4B) – on a small, ambiguity-rich
dataset of 119 tokens. The annotated dataset, prompts are provided in
appendices for full reproducibility. All experiments were conducted on 14 May
2025
[LINK]
http://arxiv.org/abs/2509.12098v1
[DATE]
2025-09-16 00:21:59+08:00
[CATEGORIES]
cs.CL
SURGIN: SURrogate-guided Generative INversion for subsurface multiphase flow with quantified uncertainty
[AUTHORS]
Zhao Feng, Bicheng Yan, Luanxiao Zhao, Xianda Shen, Renyu Zhao, Wenhao Wang, Fengshou Zhang
[ABSTRACT]
We present a direct inverse modeling method named SURGIN, a SURrogate-guided
Generative INversion framework tailed for subsurface multiphase flow data
assimilation. Unlike existing inversion methods that require adaptation for
each new observational configuration, SURGIN features a zero-shot conditional
generation capability, enabling real-time assimilation of unseen monitoring
data without task-specific retraining. Specifically, SURGIN synergistically
integrates a U-Net enhanced Fourier Neural Operator (U-FNO) surrogate with a
score-based generative model (SGM), framing the conditional generation as a
surrogate prediction-guidance process in a Bayesian perspective. Instead of
directly learning the conditional generation of geological parameters, an
unconditional SGM is first pretrained in a self-supervised manner to capture
the geological prior, after which posterior sampling is performed by leveraging
a differentiable U-FNO surrogate to enable efficient forward evaluations
conditioned on unseen observations. Extensive numerical experiments demonstrate
SURGIN’s capability to decently infer heterogeneous geological fields and
predict spatiotemporal flow dynamics with quantified uncertainty across diverse
measurement settings. By unifying generative learning with surrogate-guided
Bayesian inference, SURGIN establishes a new paradigm for inverse modeling and
uncertainty quantification in parametric functional spaces.
[LINK]
http://arxiv.org/abs/2509.13189v1
[DATE]
2025-09-16 23:42:22+08:00
[CATEGORIES]
cs.LG
Is Meta-Learning Out? Rethinking Unsupervised Few-Shot Classification with Limited Entropy
[AUTHORS]
Yunchuan Guan, Yu Liu, Ke Zhou, Zhiqi Shen, Jenq-Neng Hwang, Serge Belongie, Lei Li
[ABSTRACT]
Meta-learning is a powerful paradigm for tackling few-shot tasks. However,
recent studies indicate that models trained with the whole-class training
strategy can achieve comparable performance to those trained with meta-learning
in few-shot classification tasks. To demonstrate the value of meta-learning, we
establish an entropy-limited supervised setting for fair comparisons. Through
both theoretical analysis and experimental validation, we establish that
meta-learning has a tighter generalization bound compared to whole-class
training. We unravel that meta-learning is more efficient with limited entropy
and is more robust to label noise and heterogeneous tasks, making it
well-suited for unsupervised tasks. Based on these insights, We propose MINO, a
meta-learning framework designed to enhance unsupervised performance. MINO
utilizes the adaptive clustering algorithm DBSCAN with a dynamic head for
unsupervised task construction and a stability-based meta-scaler for robustness
against label noise. Extensive experiments confirm its effectiveness in
multiple unsupervised few-shot and zero-shot tasks.
[COMMENTS]
Accepted by ICCV 2025
[LINK]
http://arxiv.org/abs/2509.13185v1
[DATE]
2025-09-16 23:39:03+08:00
[CATEGORIES]
cs.LG
Efficient Cold-Start Recommendation via BPE Token-Level Embedding Initialization with LLM
[AUTHORS]
Yushang Zhao, Xinyue Han, Qian Leng, Qianyi Sun, Haotian Lyu, Chengrui Zhou
[ABSTRACT]
The cold-start issue is the challenge when we talk about recommender systems,
especially in the case when we do not have the past interaction data of new
users or new items. Content-based features or hybrid solutions are common as
conventional solutions, but they can only work in a sparse metadata environment
with shallow patterns. In this paper, the efficient cold-start recommendation
strategy is presented, which is based on the sub word-level representations by
applying Byte Pair Encoding (BPE) tokenization and pre-trained Large Language
Model (LLM) embedding in the initialization procedure. We obtain fine-grained
token-level vectors that are aligned with the BPE vocabulary as opposed to
using coarse-grained sentence embeddings. Together, these token embeddings can
be used as dense semantic priors on unseen entities, making immediate
recommendation performance possible without user-item interaction history. Our
mechanism can be compared to collaborative filtering systems and tested over
benchmark datasets with stringent cold-start assumptions. Experimental findings
show that the given BPE-LLM method achieves higher Recall@k, NDCG@k, and Hit
Rate measurements compared to the standard baseline and displays the same
capability of sufficient computational performance. Furthermore, we demonstrate
that using subword-aware embeddings yields better generalizability and is more
interpretable, especially within a multilingual and sparse input setting. The
practical application of token-level semantic initialization as a lightweight,
but nevertheless effective extension to modern recommender systems in the
zero-shot setting is indicated within this work.
[LINK]
http://arxiv.org/abs/2509.13179v1
[DATE]
2025-09-16 23:32:51+08:00
[CATEGORIES]
cs.LG
Any-Step Density Ratio Estimation via Interval-Annealed Secant Alignment
[AUTHORS]
Wei Chen, Shigui Li, Jiacheng Li, Jian Xu, Zhiqi Lin, Junmei Yang, Delu Zeng, John Paisley, Qibin Zhao
[ABSTRACT]
Estimating density ratios is a fundamental problem in machine learning, but
existing methods often trade off accuracy for efficiency. We propose
\textit{Interval-annealed Secant Alignment Density Ratio Estimation (ISA-DRE)},
a framework that enables accurate, any-step estimation without numerical
integration.
Instead of modeling infinitesimal tangents as in prior methods, ISA-DRE
learns a global secant function, defined as the expectation of all tangents
over an interval, with provably lower variance, making it more suitable for
neural approximation. This is made possible by the \emph{Secant Alignment
Identity}, a self-consistency condition that formally connects the secant with
its underlying tangent representations.
To mitigate instability during early training, we introduce \emph{Contraction
Interval Annealing}, a curriculum strategy that gradually expands the alignment
interval during training. This process induces a contraction mapping, which
improves convergence and training stability.
Empirically, ISA-DRE achieves competitive accuracy with significantly fewer
function evaluations compared to prior methods, resulting in much faster
inference and making it well suited for real-time and interactive applications.
[LINK]
http://arxiv.org/abs/2509.04852v2
[DATE]
2025-09-16 23:29:33+08:00
[CATEGORIES]
cs.LG
Concentration inequalities for semidefinite least squares based on data
[AUTHORS]
Filippo Fabiani, Andrea Simonetto
[ABSTRACT]
We study data-driven least squares (LS) problems with semidefinite (SD)
constraints and derive finite-sample guarantees on the spectrum of their
optimal solutions when these constraints are relaxed. In particular, we provide
a high confidence bound allowing one to solve a simpler program in place of the
full SDLS problem, while ensuring that the eigenvalues of the resulting
solution are $\varepsilon$-close of those enforced by the SD constraints. The
developed certificate, which consistently shrinks as the number of data
increases, turns out to be easy-to-compute, distribution-free, and only
requires independent and identically distributed samples. Moreover, when the
SDLS is used to learn an unknown quadratic function, we establish bounds on the
error between a gradient descent iterate minimizing the surrogate cost obtained
with no SD constraints and the true minimizer.
[LINK]
http://arxiv.org/abs/2509.13166v1
[DATE]
2025-09-16 23:17:37+08:00
[CATEGORIES]
cs.LG
On the Correlation between Individual Fairness and Predictive Accuracy in Probabilistic Models
[AUTHORS]
Alessandro Antonucci, Eric Rossetto, Ivan Duvnjak
[ABSTRACT]
We investigate individual fairness in generative probabilistic classifiers by
analysing the robustness of posterior inferences to perturbations in private
features. Building on established results in robustness analysis, we
hypothesise a correlation between robustness and predictive accuracy,
specifically, instances exhibiting greater robustness are more likely to be
classified accurately. We empirically assess this hypothesis using a benchmark
of fourteen datasets with fairness concerns, employing Bayesian networks as the
underlying generative models. To address the computational complexity
associated with robustness analysis over multiple private features with
Bayesian networks, we reformulate the problem as a most probable explanation
task in an auxiliary Markov random field. Our experiments confirm the
hypothesis about the correlation, suggesting novel directions to mitigate the
traditional trade-off between fairness and accuracy.
[COMMENTS]
15 pages, 9 figures, 1 table
[LINK]
http://arxiv.org/abs/2509.13165v1
[DATE]
2025-09-16 23:17:13+08:00
[CATEGORIES]
cs.LG
Geoff: The Generic Optimization Framework & Frontend for Particle Accelerator Controls
[AUTHORS]
Penelope Madysa, Sabrina Appel, Verena Kain, Michael Schenk
[ABSTRACT]
Geoff is a collection of Python packages that form a framework for automation
of particle accelerator controls. With particle accelerator laboratories around
the world researching machine learning techniques to improve accelerator
performance and uptime, a multitude of approaches and algorithms have emerged.
The purpose of Geoff is to harmonize these approaches and to minimize friction
when comparing or migrating between them. It provides standardized interfaces
for optimization problems, utility functions to speed up development, and a
reference GUI application that ties everything together. Geoff is an
open-source library developed at CERN and maintained and updated in
collaboration between CERN and GSI as part of the EURO-LABS project. This paper
gives an overview over Geoff’s design, features, and current usage.
[COMMENTS]
18 pages, 5 figures. Submitted to SoftwareX
[LINK]
http://arxiv.org/abs/2506.03796v3
[DATE]
2025-09-16 23:03:48+08:00
[CATEGORIES]
cs.LG
Learning from Heterophilic Graphs: A Spectral Theory Perspective on the Impact of Self-Loops and Parallel Edges
[AUTHORS]
Kushal Bose, Swagatam Das
[ABSTRACT]
Graph heterophily poses a formidable challenge to the performance of
Message-passing Graph Neural Networks (MP-GNNs). The familiar low-pass filters
like Graph Convolutional Networks (GCNs) face performance degradation, which
can be attributed to the blending of the messages from dissimilar neighboring
nodes. The performance of the low-pass filters on heterophilic graphs still
requires an in-depth analysis. In this context, we update the heterophilic
graphs by adding a number of self-loops and parallel edges. We observe that
eigenvalues of the graph Laplacian decrease and increase respectively by
increasing the number of self-loops and parallel edges. We conduct several
studies regarding the performance of GCN on various benchmark heterophilic
networks by adding either self-loops or parallel edges. The studies reveal that
the GCN exhibited either increasing or decreasing performance trends on adding
self-loops and parallel edges. In light of the studies, we established
connections between the graph spectra and the performance trends of the
low-pass filters on the heterophilic graphs. The graph spectra characterize the
essential intrinsic properties of the input graph like the presence of
connected components, sparsity, average degree, cluster structures, etc. Our
work is adept at seamlessly evaluating graph spectrum and properties by
observing the performance trends of the low-pass filters without pursuing the
costly eigenvalue decomposition. The theoretical foundations are also discussed
to validate the impact of adding self-loops and parallel edges on the graph
spectrum.
[LINK]
http://arxiv.org/abs/2509.13139v1
[DATE]
2025-09-16 22:54:54+08:00
[CATEGORIES]
cs.LG
Discovering Mathematical Equations with Diffusion Language Model
[AUTHORS]
Xiaoxu Han, Chengzhen Ning, Jinghui Zhong, Fubiao Yang, Yu Wang, Xin Mu
[ABSTRACT]
Discovering valid and meaningful mathematical equations from observed data
plays a crucial role in scientific discovery. While this task, symbolic
regression, remains challenging due to the vast search space and the trade-off
between accuracy and complexity. In this paper, we introduce DiffuSR, a
pre-training framework for symbolic regression built upon a continuous-state
diffusion language model. DiffuSR employs a trainable embedding layer within
the diffusion process to map discrete mathematical symbols into a continuous
latent space, modeling equation distributions effectively. Through iterative
denoising, DiffuSR converts an initial noisy sequence into a symbolic equation,
guided by numerical data injected via a cross-attention mechanism. We also
design an effective inference strategy to enhance the accuracy of the
diffusion-based equation generator, which injects logit priors into genetic
programming. Experimental results on standard symbolic regression benchmarks
demonstrate that DiffuSR achieves competitive performance with state-of-the-art
autoregressive methods and generates more interpretable and diverse
mathematical expressions.
[LINK]
http://arxiv.org/abs/2509.13136v1
[DATE]
2025-09-16 22:53:44+08:00
[CATEGORIES]
cs.LG
Second-Order Tensorial Partial Differential Equations on Graphs
[AUTHORS]
Aref Einizade, Fragkiskos D. Malliaros, Jhony H. Giraldo
[ABSTRACT]
Processing data on multiple interacting graphs is crucial for many
applications, but existing approaches rely mostly on discrete filtering or
first-order continuous models, dampening high frequencies and slow information
propagation. In this paper, we introduce second-order tensorial partial
differential equations on graphs (SoTPDEG) and propose the first theoretically
grounded framework for second-order continuous product graph neural networks
(GNNs). Our method exploits the separability of cosine kernels in Cartesian
product graphs to enable efficient spectral decomposition while preserving
high-frequency components. We further provide rigorous over-smoothing and
stability analysis under graph perturbations, establishing a solid theoretical
foundation. Experimental results on spatiotemporal traffic forecasting
illustrate the superiority over the compared methods.
[COMMENTS]
9 pages, 1 figure
[LINK]
http://arxiv.org/abs/2509.02015v3
[DATE]
2025-09-16 22:41:12+08:00
[CATEGORIES]
cs.LG
Informed Correctors for Discrete Diffusion Models
[AUTHORS]
Yixiu Zhao, Jiaxin Shi, Feng Chen, Shaul Druckmann, Lester Mackey, Scott Linderman
[ABSTRACT]
Discrete diffusion has emerged as a powerful framework for generative
modeling in discrete domains, yet efficiently sampling from these models
remains challenging. Existing sampling strategies often struggle to balance
computation and sample quality when the number of sampling steps is reduced,
even when the model has learned the data distribution well. To address these
limitations, we propose a predictor-corrector sampling scheme where the
corrector is informed by the diffusion model to more reliably counter the
accumulating approximation errors. To further enhance the effectiveness of our
informed corrector, we introduce complementary architectural modifications
based on hollow transformers and a simple tailored training objective that
leverages more training signal. We use a synthetic example to illustrate the
failure modes of existing samplers and show how informed correctors alleviate
these problems. On the text8 and tokenized ImageNet 256x256 datasets, our
informed corrector consistently produces superior samples with fewer errors or
improved FID scores for discrete diffusion models. These results underscore the
potential of informed correctors for fast and high-fidelity generation using
discrete diffusion. Our code is available at
https://github.com/lindermanlab/informed-correctors.
[LINK]
http://arxiv.org/abs/2407.21243v4
[DATE]
2025-09-16 22:31:25+08:00
[CATEGORIES]
cs.LG
Sublinear-Time Algorithms for Diagonally Dominant Systems and Applications to the Friedkin-Johnsen Model
[AUTHORS]
Weiming Feng, Zelin Li, Pan Peng
[ABSTRACT]
We study sublinear-time algorithms for solving linear systems $Sz = b$, where
$S$ is a diagonally dominant matrix, i.e., $|S_{ii}| \geq \delta + \sum_{j \ne
i} |S_{ij}|$ for all $i \in [n]$, for some $\delta \geq 0$. We present
randomized algorithms that, for any $u \in [n]$, return an estimate $z_u$ of
$z^_u$ with additive error $\varepsilon$ or $\varepsilon \lVert
z^\rVert_\infty$, where $z^$ is some solution to $Sz^ = b$, and the
algorithm only needs to read a small portion of the input $S$ and $b$. For
example, when the additive error is $\varepsilon$ and assuming $\delta>0$, we
give an algorithm that runs in time $O\left( \frac{|b|\infty^2
S{\max}}{\delta^3 \varepsilon^2} \log \frac{| b |\infty}{\delta
\varepsilon} \right)$, where $S_{\max} = \max_{i \in [n]} |S_{ii}|$. We also
prove a matching lower bound, showing that the linear dependence on $S{\max}$
is optimal. Unlike previous sublinear-time algorithms, which apply only to
symmetric diagonally dominant matrices with non-negative diagonal entries, our
algorithm works for general strictly diagonally dominant matrices ($\delta >
0$) and a broader class of non-strictly diagonally dominant matrices $(\delta =
0)$. Our approach is based on analyzing a simple probabilistic recurrence
satisfied by the solution. As an application, we obtain an improved
sublinear-time algorithm for opinion estimation in the Friedkin–Johnsen model.
[LINK]
http://arxiv.org/abs/2509.13112v1
[DATE]
2025-09-16 22:13:31+08:00
[CATEGORIES]
cs.LG
Quantifying The Limits of AI Reasoning: Systematic Neural Network Representations of Algorithms
[AUTHORS]
Anastasis Kratsios, Dennis Zvigelsky, Bradd Hart
[ABSTRACT]
A main open question in contemporary AI research is quantifying the forms of
reasoning neural networks can perform when perfectly trained. This paper
answers this by interpreting reasoning tasks as circuit emulation, where the
gates define the type of reasoning; e.g. Boolean gates for predicate logic,
tropical circuits for dynamic programming, arithmetic and analytic gates for
symbolic mathematical representation, and hybrids thereof for deeper reasoning;
e.g. higher-order logic.
We present a systematic meta-algorithm that converts essentially any circuit
into a feedforward neural network (NN) with ReLU activations by iteratively
replacing each gate with a canonical ReLU MLP emulator. We show that, on any
digital computer, our construction emulates the circuit exactly–no
approximation, no rounding, modular overflow included–demonstrating that no
reasoning task lies beyond the reach of neural networks. The number of neurons
in the resulting network (parametric complexity) scales with the circuit’s
complexity, and the network’s computational graph (structure) mirrors that of
the emulated circuit. This formalizes the folklore that NNs networks trade
algorithmic run-time (circuit runtime) for space complexity (number of
neurons).
We derive a range of applications of our main result, from emulating
shortest-path algorithms on graphs with cubic–size NNs, to simulating stopped
Turing machines with roughly quadratically–large NNs, and even the emulation
of randomized Boolean circuits. Lastly, we demonstrate that our result is
strictly more powerful than a classical universal approximation theorem: any
universal function approximator can be encoded as a circuit and directly
emulated by a NN.
[COMMENTS]
18 pages main body, 45 pages total + references
[LINK]
http://arxiv.org/abs/2508.18526v2
[DATE]
2025-09-16 22:10:46+08:00
[CATEGORIES]
cs.LG
Convex Regularization and Convergence of Policy Gradient Flows under Safety Constraints
[AUTHORS]
Pekka Malo, Lauri Viitasaari, Antti Suominen, Eeva Vilkkumaa, Olli Tahvonen
[ABSTRACT]
This paper examines reinforcement learning (RL) in infinite-horizon decision
processes with almost-sure safety constraints, crucial for applications like
autonomous systems, finance, and resource management. We propose a
doubly-regularized RL framework combining reward and parameter regularization
to address safety constraints in continuous state-action spaces. The problem is
formulated as a convex regularized objective with parametrized policies in the
mean-field regime. Leveraging mean-field theory and Wasserstein gradient flows,
policies are modeled on an infinite-dimensional statistical manifold, with
updates governed by parameter distribution gradient flows. Key contributions
include solvability conditions for safety-constrained problems, smooth bounded
approximations for gradient flows, and exponential convergence guarantees under
sufficient regularization. General regularization conditions, including entropy
regularization, support practical particle method implementations. This
framework provides robust theoretical insights and guarantees for safe RL in
complex, high-dimensional settings.
[COMMENTS]
29 pages
[LINK]
http://arxiv.org/abs/2411.19193v2
[DATE]
2025-09-16 22:10:15+08:00
[CATEGORIES]
cs.LG
Safe Learning Under Irreversible Dynamics via Asking for Help
[AUTHORS]
Benjamin Plaut, Juan Liévano-Karim, Hanlin Zhu, Stuart Russell
[ABSTRACT]
Most learning algorithms with formal regret guarantees essentially rely on
trying all possible behaviors, which is problematic when some errors cannot be
recovered from. Instead, we allow the learning agent to ask for help from a
mentor and to transfer knowledge between similar states. We show that this
combination enables the agent to learn both safely and effectively. Under
standard online learning assumptions, we provide an algorithm whose regret and
number of mentor queries are both sublinear in the time horizon for any Markov
Decision Process (MDP), including MDPs with irreversible dynamics. Our proof
involves a sequence of three reductions which may be of independent interest.
Conceptually, our result may be the first formal proof that it is possible for
an agent to obtain high reward while becoming self-sufficient in an unknown,
unbounded, and high-stakes environment without resets.
[COMMENTS]
Under submission
[LINK]
http://arxiv.org/abs/2502.14043v2
[DATE]
2025-09-16 21:55:47+08:00
[CATEGORIES]
cs.LG
BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models
[AUTHORS]
Yuming Li, Yikai Wang, Yuying Zhu, Zhongyu Zhao, Ming Lu, Qi She, Shanghang Zhang
[ABSTRACT]
Recent progress in aligning image and video generative models with Group
Relative Policy Optimization (GRPO) has improved human preference alignment,
but existing variants remain inefficient due to sequential rollouts and large
numbers of sampling steps, unreliable credit assignment: sparse terminal
rewards are uniformly propagated across timesteps, failing to capture the
varying criticality of decisions during denoising. In this paper, we present
BranchGRPO, a method that restructures the rollout process into a branching
tree, where shared prefixes amortize computation and pruning removes low-value
paths and redundant depths. BranchGRPO introduces three contributions: (1) a
branching scheme that amortizes rollout cost through shared prefixes while
preserving exploration diversity; (2) a reward fusion and depth-wise advantage
estimator that transforms sparse terminal rewards into dense step-level
signals; and (3) pruning strategies that cut gradient computation but leave
forward rollouts and exploration unaffected. On HPDv2.1 image alignment,
BranchGRPO improves alignment scores by up to \textbf{16\%} over DanceGRPO,
while reducing per-iteration training time by nearly \textbf{55\%}. A hybrid
variant, BranchGRPO-Mix, further accelerates training to 4.7x faster than
DanceGRPO without degrading alignment. On WanX video generation, it further
achieves higher Video-Align scores with sharper and temporally consistent
frames compared to DanceGRPO. Codes are available at
\href{https://fredreic1849.github.io/BranchGRPO-Webpage/}{BranchGRPO}.
[COMMENTS]
12 pages, 6 figures
[LINK]
http://arxiv.org/abs/2509.06040v4
[DATE]
2025-09-16 21:50:17+08:00
[CATEGORIES]
cs.LG
Multi-task and few-shot learning in virtual flow metering
[AUTHORS]
Kristian Løvland, Bjarne Grimstad, Lars S. Imsland
[ABSTRACT]
Recent literature has explored various ways to improve soft sensors by
utilizing learning algorithms with transferability. A performance gain is
generally attained when knowledge is transferred among strongly related soft
sensor learning tasks. One setting where it is reasonable to expect strongly
related tasks, is when learning soft sensors for separate process units that
are of the same type. Applying methods that exploit transferability in this
setting leads to what we call multi-unit soft sensing.
This paper formulates a probabilistic, hierarchical model for multi-unit soft
sensing. The model is implemented using a deep neural network. The proposed
learning method is studied empirically on a large-scale industrial case by
developing virtual flow meters (a type of soft sensor) for 80 petroleum wells.
We investigate how the model generalizes with the number of wells/units. We
demonstrate that multi-unit models learned from data from many wells permit
few-shot learning of virtual flow meters for new wells. Surprisingly, regarding
the difficulty of the tasks, few-shot learning on 1-3 data points often leads
to high performance on new wells.
[COMMENTS]
17 pages, 12 figures. Updates consist of extended dataset decriptions
and a study on the role of context parameter dimension
[LINK]
http://arxiv.org/abs/2309.15828v3
[DATE]
2025-09-16 21:24:03+08:00
[CATEGORIES]
cs.LG
Traces Propagation: Memory-Efficient and Scalable Forward-Only Learning in Spiking Neural Networks
[AUTHORS]
Lorenzo Pes, Bojian Yin, Sander Stuijk, Federico Corradi
[ABSTRACT]
Spiking Neural Networks (SNNs) provide an efficient framework for processing
dynamic spatio-temporal signals and for investigating the learning principles
underlying biological neural systems. A key challenge in training SNNs is to
solve both spatial and temporal credit assignment. The dominant approach for
training SNNs is Backpropagation Through Time (BPTT) with surrogate gradients.
However, BPTT is in stark contrast with the spatial and temporal locality
observed in biological neural systems and leads to high computational and
memory demands, limiting efficient training strategies and on-device learning.
Although existing local learning rules achieve local temporal credit assignment
by leveraging eligibility traces, they fail to address the spatial credit
assignment without resorting to auxiliary layer-wise matrices, which increase
memory overhead and hinder scalability, especially on embedded devices. In this
work, we propose Traces Propagation (TP), a forward-only, memory-efficient,
scalable, and fully local learning rule that combines eligibility traces with a
layer-wise contrastive loss without requiring auxiliary layer-wise matrices. TP
outperforms other fully local learning rules on NMNIST and SHD datasets. On
more complex datasets such as DVS-GESTURE and DVS-CIFAR10, TP showcases
competitive performance and scales effectively to deeper SNN architectures such
as VGG-9, while providing favorable memory scaling compared to prior fully
local scalable rules, for datasets with a significant number of classes.
Finally, we show that TP is well suited for practical fine-tuning tasks, such
as keyword spotting on the Google Speech Commands dataset, thus paving the way
for efficient learning at the edge.
[LINK]
http://arxiv.org/abs/2509.13053v1
[DATE]
2025-09-16 21:11:52+08:00
[CATEGORIES]
cs.LG
Spiking Vocos: An Energy-Efficient Neural Vocoder
[AUTHORS]
Yukun Chen, Zhaoxi Mu, Andong Li, Peilin Li, Xinyu Yang
[ABSTRACT]
Despite the remarkable progress in the synthesis speed and fidelity of neural
vocoders, their high energy consumption remains a critical barrier to practical
deployment on computationally restricted edge devices. Spiking Neural Networks
(SNNs), widely recognized for their high energy efficiency due to their
event-driven nature, offer a promising solution for low-resource scenarios. In
this paper, we propose Spiking Vocos, a novel spiking neural vocoder with
ultra-low energy consumption, built upon the efficient Vocos framework. To
mitigate the inherent information bottleneck in SNNs, we design a Spiking
ConvNeXt module to reduce Multiply-Accumulate (MAC) operations and incorporate
an amplitude shortcut path to preserve crucial signal dynamics. Furthermore, to
bridge the performance gap with its Artificial Neural Network (ANN)
counterpart, we introduce a self-architectural distillation strategy to
effectively transfer knowledge. A lightweight Temporal Shift Module is also
integrated to enhance the model’s ability to fuse information across the
temporal dimension with negligible computational overhead. Experiments
demonstrate that our model achieves performance comparable to its ANN
counterpart, with UTMOS and PESQ scores of 3.74 and 3.45 respectively, while
consuming only 14.7% of the energy. The source code is available at
https://github.com/pymaster17/Spiking-Vocos.
[LINK]
http://arxiv.org/abs/2509.13049v1
[DATE]
2025-09-16 21:09:13+08:00
[CATEGORIES]
cs.LG
Understanding Generalization in Physics Informed Models through Affine Variety Dimensions
[AUTHORS]
Takeshi Koshizuka, Issei Sato
[ABSTRACT]
Physics-informed machine learning is gaining significant traction for
enhancing statistical performance and sample efficiency through the integration
of physical knowledge. However, current theoretical analyses often presume
complete prior knowledge in non-hybrid settings, overlooking the crucial
integration of observational data, and are frequently limited to linear
systems, unlike the prevalent nonlinear nature of many real-world applications.
To address these limitations, we introduce a unified residual form that unifies
collocation and variational methods, enabling the incorporation of incomplete
and complex physical constraints in hybrid learning settings. Within this
formulation, we establish that the generalization performance of
physics-informed regression in such hybrid settings is governed by the
dimension of the affine variety associated with the physical constraint, rather
than by the number of parameters. This enables a unified analysis that is
applicable to both linear and nonlinear equations. We also present a method to
approximate this dimension and provide experimental validation of our
theoretical findings.
[LINK]
http://arxiv.org/abs/2501.18879v2
[DATE]
2025-09-16 20:49:15+08:00
[CATEGORIES]
cs.LG
ReTrack: Data Unlearning in Diffusion Models through Redirecting the Denoising Trajectory
[AUTHORS]
Qitan Shi, Cheng Jin, Jiawei Zhang, Yuantao Gu
[ABSTRACT]
Diffusion models excel at generating high-quality, diverse images but suffer
from training data memorization, raising critical privacy and safety concerns.
Data unlearning has emerged to mitigate this issue by removing the influence of
specific data without retraining from scratch. We propose ReTrack, a fast and
effective data unlearning method for diffusion models. ReTrack employs
importance sampling to construct a more efficient fine-tuning loss, which we
approximate by retaining only dominant terms. This yields an interpretable
objective that redirects denoising trajectories toward the $k$-nearest
neighbors, enabling efficient unlearning while preserving generative quality.
Experiments on MNIST T-Shirt, CelebA-HQ, CIFAR-10, and Stable Diffusion show
that ReTrack achieves state-of-the-art performance, striking the best trade-off
between unlearning strength and generation quality preservation.
[LINK]
http://arxiv.org/abs/2509.13007v1
[DATE]
2025-09-16 20:20:15+08:00
[CATEGORIES]
cs.LG
Ensemble Visualization With Variational Autoencoder
[AUTHORS]
Cenyang Wu, Qinhan Yu, Liang Zhou
[ABSTRACT]
We present a new method to visualize data ensembles by constructing
structured probabilistic representations in latent spaces, i.e.,
lower-dimensional representations of spatial data features. Our approach
transforms the spatial features of an ensemble into a latent space through
feature space conversion and unsupervised learning using a variational
autoencoder (VAE). The resulting latent spaces follow multivariate standard
Gaussian distributions, enabling analytical computation of confidence intervals
and density estimation of the probabilistic distribution that generates the
data ensemble. Preliminary results on a weather forecasting ensemble
demonstrate the effectiveness and versatility of our method.
[COMMENTS]
Accepted by the IEEE Workshop on Uncertainty Visualization
[LINK]
http://arxiv.org/abs/2509.13000v1
[DATE]
2025-09-16 20:13:15+08:00
[CATEGORIES]
cs.LG
Data-driven Methods of Extracting Text Structure and Information Transfer
[AUTHORS]
Shinichi Honna, Taichi Murayama, Akira Matsui
[ABSTRACT]
The Anna Karenina Principle (AKP) holds that success requires satisfying a
small set of essential conditions, whereas failure takes diverse forms. We test
AKP, its reverse, and two further patterns described as ordered and noisy
across novels, online encyclopedias, research papers, and movies. Texts are
represented as sequences of functional blocks, and convergence is assessed in
transition order and position. Results show that structural principles vary by
medium: novels follow reverse AKP in order, Wikipedia combines AKP with ordered
patterns, academic papers display reverse AKP in order but remain noisy in
position, and movies diverge by genre. Success therefore depends on structural
constraints that are specific to each medium, while failure assumes different
shapes across domains.
[LINK]
http://arxiv.org/abs/2509.12999v1
[DATE]
2025-09-16 20:13:09+08:00
[CATEGORIES]
cs.LG
PVPO: Pre-Estimated Value-Based Policy Optimization for Agentic Reasoning
[AUTHORS]
Wenfeng Feng, Penghong Zhao, Guochao Jiang, Chuzhan Hao, Yuewei Zhang, Hao Wang, Guohua Liu
[ABSTRACT]
Critic-free reinforcement learning methods, particularly group policies, have
attracted considerable attention for their efficiency in complex tasks.
However, these methods rely heavily on multiple sampling and comparisons within
the policy to estimate advantage, which may cause the policy to fall into local
optimum and increase computational cost. To address these issues, we propose
PVPO, an efficient reinforcement learning method enhanced by an advantage
reference anchor and data pre-sampling. Specifically, we use the reference
model to rollout in advance and employ the calculated reward score as a
reference anchor. Our approach effectively corrects the cumulative bias
introduced by intra-group comparisons and significantly reduces reliance on the
number of rollouts during training. Meanwhile, the reference model can assess
sample difficulty during data pre-sampling, enabling effective selection of
high-gain data to improve training efficiency. Moreover, PVPO is orthogonal to
other advanced critic-free RL algorithms, making it compatible with and
complementary to these methods. Experiments conducted on nine datasets across
two domains demonstrate that PVPO achieves State-Of-The-Art (SOTA) performance.
Our approach not only demonstrates robust generalization across multiple tasks,
but also exhibits scalable performance across models of varying scales.
[COMMENTS]
17 pages, 9 figures
[LINK]
http://arxiv.org/abs/2508.21104v2
[DATE]
2025-09-16 20:11:29+08:00
[CATEGORIES]
cs.LG
Bridging Performance Gaps for Foundation Models: A Post-Training Strategy for ECGFounder
[AUTHORS]
Ya Zhou, Yujie Yang, Xiaohan Fan, Wei Zhao
[ABSTRACT]
ECG foundation models are increasingly popular due to their adaptability
across various tasks. However, their clinical applicability is often limited by
performance gaps compared to task-specific models, even after pre-training on
large ECG datasets and fine-tuning on target data. This limitation is likely
due to the lack of an effective post-training strategy. In this paper, we
propose a simple yet effective post-training approach to enhance ECGFounder, a
state-of-the-art ECG foundation model pre-trained on over 7 million ECG
recordings. Experiments on the PTB-XL benchmark show that our approach improves
the baseline fine-tuning strategy by 1.2%-3.3% in macro AUROC and 5.3%-20.9% in
macro AUPRC. Additionally, our method outperforms several recent
state-of-the-art approaches, including task-specific and advanced
architectures. Further evaluation reveals that our method is more stable and
sample-efficient compared to the baseline, achieving a 9.1% improvement in
macro AUROC and a 34.9% improvement in macro AUPRC using just 10% of the
training data. Ablation studies identify key components, such as stochastic
depth and preview linear probing, that contribute to the enhanced performance.
These findings underscore the potential of post-training strategies to improve
ECG foundation models, and we hope this work will contribute to the continued
development of foundation models in the ECG domain.
[COMMENTS]
A simple yet effective strategy for ECG foundation models
[LINK]
http://arxiv.org/abs/2509.12991v1
[DATE]
2025-09-16 20:02:13+08:00
[CATEGORIES]
cs.LG
Dual-Stage Reweighted MoE for Long-Tailed Egocentric Mistake Detection
[AUTHORS]
Boyu Han, Qianqian Xu, Shilong Bao, Zhiyong Yang, Sicong Li, Qingming Huang
[ABSTRACT]
In this report, we address the problem of determining whether a user performs
an action incorrectly from egocentric video data. To handle the challenges
posed by subtle and infrequent mistakes, we propose a Dual-Stage Reweighted
Mixture-of-Experts (DR-MoE) framework. In the first stage, features are
extracted using a frozen ViViT model and a LoRA-tuned ViViT model, which are
combined through a feature-level expert module. In the second stage, three
classifiers are trained with different objectives: reweighted cross-entropy to
mitigate class imbalance, AUC loss to improve ranking under skewed
distributions, and label-aware loss with sharpness-aware minimization to
enhance calibration and generalization. Their predictions are fused using a
classification-level expert module. The proposed method achieves strong
performance, particularly in identifying rare and ambiguous mistake instances.
The code is available at https://github.com/boyuh/DR-MoE.
[LINK]
http://arxiv.org/abs/2509.12990v1
[DATE]
2025-09-16 20:00:42+08:00
[CATEGORIES]
cs.LG
Causal Discovery via Quantile Partial Effect
[AUTHORS]
Yikang Chen, Xingzhe Sun, Dehui Du
[ABSTRACT]
Quantile Partial Effect (QPE) is a statistic associated with conditional
quantile regression, measuring the effect of covariates at different levels.
Our theory demonstrates that when the QPE of cause on effect is assumed to lie
in a finite linear span, cause and effect are identifiable from their
observational distribution. This generalizes previous identifiability results
based on Functional Causal Models (FCMs) with additive, heteroscedastic noise,
etc. Meanwhile, since QPE resides entirely at the observational level, this
parametric assumption does not require considering mechanisms, noise, or even
the Markov assumption, but rather directly utilizes the asymmetry of shape
characteristics in the observational distribution. By performing basis function
tests on the estimated QPE, causal directions can be distinguished, which is
empirically shown to be effective in experiments on a large number of bivariate
causal discovery datasets. For multivariate causal discovery, leveraging the
close connection between QPE and score functions, we find that Fisher
Information is sufficient as a statistical measure to determine causal order
when assumptions are made about the second moment of QPE. We validate the
feasibility of using Fisher Information to identify causal order on multiple
synthetic and real-world multivariate causal discovery datasets.
[COMMENTS]
29 pages, 6 figures
[LINK]
http://arxiv.org/abs/2509.12981v1
[DATE]
2025-09-16 19:43:01+08:00
[CATEGORIES]
cs.LG
Improving Accuracy and Efficiency of Implicit Neural Representations: Making SIREN a WINNER
[AUTHORS]
Hemanth Chandravamsi, Dhanush V. Shenoy, Steven H. Frankel
[ABSTRACT]
We identify and address a fundamental limitation of sinusoidal representation
networks (SIRENs), a class of implicit neural representations. SIRENs Sitzmann
et al. (2020), when not initialized appropriately, can struggle at fitting
signals that fall outside their frequency support. In extreme cases, when the
network’s frequency support misaligns with the target spectrum, a ‘spectral
bottleneck’ phenomenon is observed, where the model yields to a near-zero
output and fails to recover even the frequency components that are within its
representational capacity. To overcome this, we propose WINNER - Weight
Initialization with Noise for Neural Representations. WINNER perturbs uniformly
initialized weights of base SIREN with Gaussian noise - whose noise scales are
adaptively determined by the spectral centroid of the target signal. Similar to
random Fourier embeddings, this mitigates ‘spectral bias’ but without
introducing additional trainable parameters. Our method achieves
state-of-the-art audio fitting and significant gains in image and 3D shape
fitting tasks over base SIREN. Beyond signal fitting, WINNER suggests new
avenues in adaptive, target-aware initialization strategies for optimizing deep
neural network training. For code and data visit
cfdlabtechnion.github.io/siren_square/.
[LINK]
http://arxiv.org/abs/2509.12980v1
[DATE]
2025-09-16 19:41:13+08:00
[CATEGORIES]
cs.LG
Spiking Neural Networks for Continuous Control via End-to-End Model-Based Learning
[AUTHORS]
Justus Huebotter, Pablo Lanillos, Marcel van Gerven, Serge Thill
[ABSTRACT]
Despite recent progress in training spiking neural networks (SNNs) for
classification, their application to continuous motor control remains limited.
Here, we demonstrate that fully spiking architectures can be trained end-to-end
to control robotic arms with multiple degrees of freedom in continuous
environments. Our predictive-control framework combines Leaky
Integrate-and-Fire dynamics with surrogate gradients, jointly optimizing a
forward model for dynamics prediction and a policy network for goal-directed
action. We evaluate this approach on both a planar 2D reaching task and a
simulated 6-DOF Franka Emika Panda robot. Results show that SNNs can achieve
stable training and accurate torque control, establishing their viability for
high-dimensional motor tasks. An extensive ablation study highlights the role
of initialization, learnable time constants, and regularization in shaping
training dynamics. We conclude that while stable and effective control can be
achieved, recurrent spiking networks remain highly sensitive to hyperparameter
settings, underscoring the importance of principled design choices.
[LINK]
http://arxiv.org/abs/2509.05356v2
[DATE]
2025-09-16 19:40:33+08:00
[CATEGORIES]
cs.LG
BAPFL: Exploring Backdoor Attacks Against Prototype-based Federated Learning
[AUTHORS]
Honghong Zeng, Jiong Lou, Zhe Wang, Hefeng Zhou, Chentao Wu, Wei Zhao, Jie Li
[ABSTRACT]
Prototype-based federated learning (PFL) has emerged as a promising paradigm
to address data heterogeneity problems in federated learning, as it leverages
mean feature vectors as prototypes to enhance model generalization. However,
its robustness against backdoor attacks remains largely unexplored. In this
paper, we identify that PFL is inherently resistant to existing backdoor
attacks due to its unique prototype learning mechanism and local data
heterogeneity. To further explore the security of PFL, we propose BAPFL, the
first backdoor attack method specifically designed for PFL frameworks. BAPFL
integrates a prototype poisoning strategy with a trigger optimization
mechanism. The prototype poisoning strategy manipulates the trajectories of
global prototypes to mislead the prototype training of benign clients, pushing
their local prototypes of clean samples away from the prototypes of
trigger-embedded samples. Meanwhile, the trigger optimization mechanism learns
a unique and stealthy trigger for each potential target label, and guides the
prototypes of trigger-embedded samples to align closely with the global
prototype of the target label. Experimental results across multiple datasets
and PFL variants demonstrate that BAPFL achieves a 35\%-75\% improvement in
attack success rate compared to traditional backdoor attacks, while preserving
main task accuracy. These results highlight the effectiveness, stealthiness,
and adaptability of BAPFL in PFL.
[LINK]
http://arxiv.org/abs/2509.12964v1
[DATE]
2025-09-16 19:15:19+08:00
[CATEGORIES]
cs.LG
Spatiotemporal graph neural process for reconstruction, extrapolation, and classification of cardiac trajectories
[AUTHORS]
Jaume Banus, Augustin C. Ogier, Roger Hullin, Philippe Meyer, Ruud B. van Heeswijk, Jonas Richiardi
[ABSTRACT]
We present a probabilistic framework for modeling structured spatiotemporal
dynamics from sparse observations, focusing on cardiac motion. Our approach
integrates neural ordinary differential equations (NODEs), graph neural
networks (GNNs), and neural processes into a unified model that captures
uncertainty, temporal continuity, and anatomical structure. We represent
dynamic systems as spatiotemporal multiplex graphs and model their latent
trajectories using a GNN-parameterized vector field. Given the sparse context
observations at node and edge levels, the model infers a distribution over
latent initial states and control variables, enabling both interpolation and
extrapolation of trajectories. We validate the method on three synthetic
dynamical systems (coupled pendulum, Lorenz attractor, and Kuramoto
oscillators) and two real-world cardiac imaging datasets - ACDC (N=150) and UK
Biobank (N=526) - demonstrating accurate reconstruction, extrapolation, and
disease classification capabilities. The model accurately reconstructs
trajectories and extrapolates future cardiac cycles from a single observed
cycle. It achieves state-of-the-art results on the ACDC classification task (up
to 99% accuracy), and detects atrial fibrillation in UK Biobank subjects with
competitive performance (up to 67% accuracy). This work introduces a flexible
approach for analyzing cardiac motion and offers a foundation for graph-based
learning in structured biomedical spatiotemporal time-series data.
[LINK]
http://arxiv.org/abs/2509.12953v1
[DATE]
2025-09-16 18:57:51+08:00
[CATEGORIES]
cs.LG
Comprehend, Divide, and Conquer: Feature Subspace Exploration via Multi-Agent Hierarchical Reinforcement Learning
[AUTHORS]
Weiliang Zhang, Xiaohan Huang, Yi Du, Ziyue Qiao, Qingqing Long, Zhen Meng, Yuanchun Zhou, Meng Xiao
[ABSTRACT]
Feature selection aims to preprocess the target dataset, find an optimal and
most streamlined feature subset, and enhance the downstream machine learning
task. Among filter, wrapper, and embedded-based approaches, the reinforcement
learning (RL)-based subspace exploration strategy provides a novel objective
optimization-directed perspective and promising performance. Nevertheless, even
with improved performance, current reinforcement learning approaches face
challenges similar to conventional methods when dealing with complex datasets.
These challenges stem from the inefficient paradigm of using one agent per
feature and the inherent complexities present in the datasets. This observation
motivates us to investigate and address the above issue and propose a novel
approach, namely HRLFS. Our methodology initially employs a Large Language
Model (LLM)-based hybrid state extractor to capture each feature’s mathematical
and semantic characteristics. Based on this information, features are
clustered, facilitating the construction of hierarchical agents for each
cluster and sub-cluster. Extensive experiments demonstrate the efficiency,
scalability, and robustness of our approach. Compared to contemporary or the
one-feature-one-agent RL-based approaches, HRLFS improves the downstream ML
performance with iterative feature subspace exploration while accelerating
total run time by reducing the number of agents involved.
[COMMENTS]
20 pages, keywords: Automated Feature Engineering, Tabular Dataset,
Multi-Agent Reinforcement Learning, Feature Selection
[LINK]
http://arxiv.org/abs/2504.17356v2
[DATE]
2025-09-16 18:52:32+08:00
[CATEGORIES]
cs.LG
Sy-FAR: Symmetry-based Fair Adversarial Robustness
[AUTHORS]
Haneen Najjar, Eyal Ronen, Mahmood Sharif
[ABSTRACT]
Security-critical machine-learning (ML) systems, such as face-recognition
systems, are susceptible to adversarial examples, including real-world
physically realizable attacks. Various means to boost ML’s adversarial
robustness have been proposed; however, they typically induce unfair
robustness: It is often easier to attack from certain classes or groups than
from others. Several techniques have been developed to improve adversarial
robustness while seeking perfect fairness between classes. Yet, prior work has
focused on settings where security and fairness are less critical. Our insight
is that achieving perfect parity in realistic fairness-critical tasks, such as
face recognition, is often infeasible – some classes may be highly similar,
leading to more misclassifications between them. Instead, we suggest that
seeking symmetry – i.e., attacks from class $i$ to $j$ would be as successful
as from $j$ to $i$ – is more tractable. Intuitively, symmetry is a desirable
because class resemblance is a symmetric relation in most domains.
Additionally, as we prove theoretically, symmetry between individuals induces
symmetry between any set of sub-groups, in contrast to other fairness notions
where group-fairness is often elusive. We develop Sy-FAR, a technique to
encourage symmetry while also optimizing adversarial robustness and extensively
evaluate it using five datasets, with three model architectures, including
against targeted and untargeted realistic attacks. The results show Sy-FAR
significantly improves fair adversarial robustness compared to state-of-the-art
methods. Moreover, we find that Sy-FAR is faster and more consistent across
runs. Notably, Sy-FAR also ameliorates another type of unfairness we discover
in this work – target classes that adversarial examples are likely to be
classified into become significantly less vulnerable after inducing symmetry.
[COMMENTS]
20 pages, 11 figures
[LINK]
http://arxiv.org/abs/2509.12939v1
[DATE]
2025-09-16 18:39:42+08:00
[CATEGORIES]
cs.LG
Soft Gradient Boosting with Learnable Feature Transforms for Sequential Regression
[AUTHORS]
Huseyin Karaca, Suleyman Serdar Kozat
[ABSTRACT]
We propose a soft gradient boosting framework for sequential regression that
embeds a learnable linear feature transform within the boosting procedure. At
each boosting iteration, we train a soft decision tree and learn a linear input
feature transform Q together. This approach is particularly advantageous in
high-dimensional, data-scarce scenarios, as it discovers the most relevant
input representations while boosting. We demonstrate, using both synthetic and
real-world datasets, that our method effectively and efficiently increases the
performance by an end-to-end optimization of feature selection/transform and
boosting while avoiding overfitting. We also extend our algorithm to
differentiable non-linear transforms if overfitting is not a problem. To
support reproducibility and future work, we share our code publicly.
[LINK]
http://arxiv.org/abs/2509.12920v1
[DATE]
2025-09-16 18:14:47+08:00
[CATEGORIES]
cs.LG
TimeCluster with PCA is Equivalent to Subspace Identification of Linear Dynamical Systems
[AUTHORS]
Christian L. Hines, Samuel Spillard, Daniel P. Martin
[ABSTRACT]
TimeCluster is a visual analytics technique for discovering structure in long
multivariate time series by projecting overlapping windows of data into a
low-dimensional space. We show that, when Principal Component Analysis (PCA) is
chosen as the dimensionality reduction technique, this procedure is
mathematically equivalent to classical linear subspace identification
(block-Hankel matrix plus Singular Vector Decomposition (SVD)). In both
approaches, the same low-dimensional linear subspace is extracted from the time
series data. We first review the TimeCluster method and the theory of subspace
system identification. Then we show that forming the sliding-window matrix of a
time series yields a Hankel matrix, so applying PCA (via SVD) to this matrix
recovers the same principal directions as subspace identification. Thus the
cluster coordinates from TimeCluster coincide with the subspace identification
methods. We present experiments on synthetic and real dynamical signals
confirming that the two embeddings coincide. Finally, we explore and discuss
future opportunities enabled by this equivalence, including forecasting from
the identified state space, streaming/online extensions, incorporating and
visualising external inputs and robust techniques for displaying underlying
trends in corrupted data.
[COMMENTS]
15 pages, 9 figures
[LINK]
http://arxiv.org/abs/2509.12895v1
[DATE]
2025-09-16 17:50:35+08:00
[CATEGORIES]
cs.LG
Single-seed generation of Brownian paths and integrals for adaptive and high order SDE solvers
[AUTHORS]
Andraž Jelinčič, James Foster, Patrick Kidger
[ABSTRACT]
Despite the success of adaptive time-stepping in ODE simulation, it has so
far seen few applications for Stochastic Differential Equations (SDEs). To
simulate SDEs adaptively, methods such as the Virtual Brownian Tree (VBT) have
been developed, which can generate Brownian motion (BM) non-chronologically.
However, in most applications, knowing only the values of Brownian motion is
not enough to achieve a high order of convergence; for that, we must compute
time-integrals of BM such as $\int_s^t W_r \, dr$. With the aim of using high
order SDE solvers adaptively, we extend the VBT to generate these integrals of
BM in addition to the Brownian increments. A JAX-based implementation of our
construction is included in the popular Diffrax library
(https://github.com/patrick-kidger/diffrax).
Since the entire Brownian path produced by VBT is uniquely determined by a
single PRNG seed, previously generated samples need not be stored, which
results in a constant memory footprint and enables experiment repeatability and
strong error estimation. Based on binary search, the VBT’s time complexity is
logarithmic in the tolerance parameter $\varepsilon$. Unlike the original VBT
algorithm, which was only precise at some dyadic times, we prove that our
construction exactly matches the joint distribution of the Brownian motion and
its time integrals at any query times, provided they are at least $\varepsilon$
apart.
We present two applications of adaptive high order solvers enabled by our new
VBT. Using adaptive solvers to simulate a high-volatility CIR model, we achieve
more than twice the convergence order of constant stepping. We apply an
adaptive third order underdamped or kinetic Langevin solver to an MCMC problem,
where our approach outperforms the No U-Turn Sampler, while using only a tenth
of its function evaluations.
[LINK]
http://arxiv.org/abs/2405.06464v6
[DATE]
2025-09-16 17:44:51+08:00
[CATEGORIES]
cs.LG
Towards Context-Aware Human-like Pointing Gestures with RL Motion Imitation
[AUTHORS]
Anna Deichler, Siyang Wang, Simon Alexanderson, Jonas Beskow
[ABSTRACT]
Pointing is a key mode of interaction with robots, yet most prior work has
focused on recognition rather than generation. We present a motion capture
dataset of human pointing gestures covering diverse styles, handedness, and
spatial targets. Using reinforcement learning with motion imitation, we train
policies that reproduce human-like pointing while maximizing precision. Results
show our approach enables context-aware pointing behaviors in simulation,
balancing task performance with natural dynamics.
[COMMENTS]
Presented at the Context-Awareness in HRI (CONAWA) Workshop, ACM/IEEE
International Conference on Human-Robot Interaction (HRI 2022), March 7, 2022
[LINK]
http://arxiv.org/abs/2509.12880v1
[DATE]
2025-09-16 17:30:42+08:00
[CATEGORIES]
cs.LG
Tool-R1: Sample-Efficient Reinforcement Learning for Agentic Tool Use
[AUTHORS]
Yabo Zhang, Yihan Zeng, Qingyun Li, Zhen Hu, Kavin Han, Wangmeng Zuo
[ABSTRACT]
Large language models (LLMs) have demonstrated strong capabilities in
language understanding and reasoning, yet they remain limited when tackling
real-world tasks that require up-to-date knowledge, precise operations, or
specialized tool use. To address this, we propose Tool-R1, a reinforcement
learning framework that enables LLMs to perform general, compositional, and
multi-step tool use by generating executable Python code. Tool-R1 supports
integration of user-defined tools and standard libraries, with variable sharing
across steps to construct coherent workflows. An outcome-based reward function,
combining LLM-based answer judgment and code execution success, guides policy
optimization. To improve training efficiency, we maintain a dynamic sample
queue to cache and reuse high-quality trajectories, reducing the overhead of
costly online sampling. Experiments on the GAIA benchmark show that Tool-R1
substantially improves both accuracy and robustness, achieving about 10\% gain
over strong baselines, with larger improvements on complex multi-step tasks.
These results highlight the potential of Tool-R1 for enabling reliable and
efficient tool-augmented reasoning in real-world applications. Our code will be
available at https://github.com/YBYBZhang/Tool-R1.
[LINK]
http://arxiv.org/abs/2509.12867v1
[DATE]
2025-09-16 17:22:21+08:00
[CATEGORIES]
cs.LG
TransDiffuser: Diverse Trajectory Generation with Decorrelated Multi-modal Representation for End-to-end Autonomous Driving
[AUTHORS]
Xuefeng Jiang, Yuan Ma, Pengxiang Li, Leimeng Xu, Xin Wen, Kun Zhan, Zhongpu Xia, Peng Jia, Xianpeng Lang, Sheng Sun
[ABSTRACT]
In recent years, diffusion models have demonstrated remarkable potential
across diverse domains, from vision generation to language modeling.
Transferring its generative capabilities to modern end-to-end autonomous
driving systems has also emerged as a promising direction. However, existing
diffusion-based trajectory generative models often exhibit mode collapse where
different random noises converge to similar trajectories after the denoising
process.Therefore, state-of-the-art models often rely on anchored trajectories
from pre-defined trajectory vocabulary or scene priors in the training set to
mitigate collapse and enrich the diversity of generated trajectories, but such
inductive bias are not available in real-world deployment, which can be
challenged when generalizing to unseen scenarios. In this work, we investigate
the possibility of effectively tackling the mode collapse challenge without the
assumption of pre-defined trajectory vocabulary or pre-computed scene priors.
Specifically, we propose TransDiffuser, an encoder-decoder based generative
trajectory planning model, where the encoded scene information and motion
states serve as the multi-modal conditional input of the denoising decoder.
Different from existing approaches, we exploit a simple yet effective
multi-modal representation decorrelation optimization mechanism during the
denoising process to enrich the latent representation space which better guides
the downstream generation. Without any predefined trajectory anchors or
pre-computed scene priors, TransDiffuser achieves the PDMS of 94.85 on the
closed-loop planning-oriented benchmark NAVSIM, surpassing previous
state-of-the-art methods. Qualitative evaluation further showcases
TransDiffuser generates more diverse and plausible trajectories which explore
more drivable area.
[COMMENTS]
Under review
[LINK]
http://arxiv.org/abs/2505.09315v2
[DATE]
2025-09-16 17:17:36+08:00
[CATEGORIES]
cs.LG
Minimax optimal transfer learning for high-dimensional additive regression
[AUTHORS]
Seung Hyun Moon
[ABSTRACT]
This paper studies high-dimensional additive regression under the transfer
learning framework, where one observes samples from a target population
together with auxiliary samples from different but potentially related
regression models. We first introduce a target-only estimation procedure based
on the smooth backfitting estimator with local linear smoothing. In contrast to
previous work, we establish general error bounds under sub-Weibull($\alpha$)
noise, thereby accommodating heavy-tailed error distributions. In the
sub-exponential case ($\alpha=1$), we show that the estimator attains the
minimax lower bound under regularity conditions, which requires a substantial
departure from existing proof strategies. We then develop a novel two-stage
estimation method within a transfer learning framework, and provide theoretical
guarantees at both the population and empirical levels. Error bounds are
derived for each stage under general tail conditions, and we further
demonstrate that the minimax optimal rate is achieved when the auxiliary and
target distributions are sufficiently close. All theoretical results are
supported by simulation studies and real data analysis.
[COMMENTS]
This is a draft version of the paper. All responsibilities are
assigned to the first author
[LINK]
http://arxiv.org/abs/2509.06308v2
[DATE]
2025-09-16 16:59:48+08:00
[CATEGORIES]
cs.LG
Safe Reinforcement Learning using Action Projection: Safeguard the Policy or the Environment?
[AUTHORS]
Hannah Markgraf, Shamburaj Sawant, Hanna Krasowski, Lukas Schäfer, Sebastien Gros, Matthias Althoff
[ABSTRACT]
Projection-based safety filters, which modify unsafe actions by mapping them
to the closest safe alternative, are widely used to enforce safety constraints
in reinforcement learning (RL). Two integration strategies are commonly
considered: Safe environment RL (SE-RL), where the safeguard is treated as part
of the environment, and safe policy RL (SP-RL), where it is embedded within the
policy through differentiable optimization layers. Despite their practical
relevance in safety-critical settings, a formal understanding of their
differences is lacking. In this work, we present a theoretical comparison of
SE-RL and SP-RL. We identify a key distinction in how each approach is affected
by action aliasing, a phenomenon in which multiple unsafe actions are projected
to the same safe action, causing information loss in the policy gradients. In
SE-RL, this effect is implicitly approximated by the critic, while in SP-RL, it
manifests directly as rank-deficient Jacobians during backpropagation through
the safeguard. Our contributions are threefold: (i) a unified formalization of
SE-RL and SP-RL in the context of actor-critic algorithms, (ii) a theoretical
analysis of their respective policy gradient estimates, highlighting the role
of action aliasing, and (iii) a comparative study of mitigation strategies,
including a novel penalty-based improvement for SP-RL that aligns with
established SE-RL practices. Empirical results support our theoretical
predictions, showing that action aliasing is more detrimental for SP-RL than
for SE-RL. However, with appropriate improvement strategies, SP-RL can match or
outperform improved SE-RL across a range of environments. These findings
provide actionable insights for choosing and refining projection-based safe RL
methods based on task characteristics.
[LINK]
http://arxiv.org/abs/2509.12833v1
[DATE]
2025-09-16 16:56:38+08:00
[CATEGORIES]
cs.LG
Energy-Efficient Quantized Federated Learning for Resource-constrained IoT devices
[AUTHORS]
Wilfrid Sougrinoma Compaoré, Yaya Etiabi, El Mehdi Amhoud, Mohamad Assaad
[ABSTRACT]
Federated Learning (FL) has emerged as a promising paradigm for enabling
collaborative machine learning while preserving data privacy, making it
particularly suitable for Internet of Things (IoT) environments. However,
resource-constrained IoT devices face significant challenges due to limited
energy,unreliable communication channels, and the impracticality of assuming
infinite blocklength transmission. This paper proposes a federated learning
framework for IoT networks that integrates finite blocklength transmission,
model quantization, and an error-aware aggregation mechanism to enhance energy
efficiency and communication reliability. The framework also optimizes uplink
transmission power to balance energy savings and model performance. Simulation
results demonstrate that the proposed approach significantly reduces energy
consumption by up to 75\% compared to a standard FL model, while maintaining
robust model accuracy, making it a viable solution for FL in real-world IoT
scenarios with constrained resources. This work paves the way for efficient and
reliable FL implementations in practical IoT deployments. Index Terms:
Federated learning, IoT, finite blocklength, quantization, energy efficiency.
[COMMENTS]
6 pages, accepted at IEEE PIMRC 2025
[LINK]
http://arxiv.org/abs/2509.12814v1
[DATE]
2025-09-16 16:31:46+08:00
[CATEGORIES]
cs.LG
Fast reconstruction of degenerate populations of conductance-based neuron models from spike times
[AUTHORS]
Julien Brandoit, Damien Ernst, Guillaume Drion, Arthur Fyon
[ABSTRACT]
Neurons communicate through spikes, and spike timing is a crucial part of
neuronal processing. Spike times can be recorded experimentally both
intracellularly and extracellularly, and are the main output of
state-of-the-art neural probes. On the other hand, neuronal activity is
controlled at the molecular level by the currents generated by many different
transmembrane proteins called ion channels. Connecting spike timing to ion
channel composition remains an arduous task to date. To address this challenge,
we developed a method that combines deep learning with a theoretical tool
called Dynamic Input Conductances (DICs), which reduce the complexity of ion
channel interactions into three interpretable components describing how neurons
spike. Our approach uses deep learning to infer DICs directly from spike times
and then generates populations of “twin” neuron models that replicate the
observed activity while capturing natural variability in membrane channel
composition. The method is fast, accurate, and works using only spike
recordings. We also provide open-source software with a graphical interface,
making it accessible to researchers without programming expertise.
[LINK]
http://arxiv.org/abs/2509.12783v1
[DATE]
2025-09-16 16:02:00+08:00
[CATEGORIES]
cs.LG
MEGAN: Mixture of Experts for Robust Uncertainty Estimation in Endoscopy Videos
[AUTHORS]
Damola Agbelese, Krishna Chaitanya, Pushpak Pati, Chaitanya Parmar, Pooya Mobadersany, Shreyas Fadnavis, Lindsey Surace, Shadi Yarandi, Louis R. Ghanem, Molly Lucas, Tommaso Mansi, Oana Gabriela Cula, Pablo F. Damasceno, Kristopher Standish
[ABSTRACT]
Reliable uncertainty quantification (UQ) is essential in medical AI.
Evidential Deep Learning (EDL) offers a computationally efficient way to
quantify model uncertainty alongside predictions, unlike traditional methods
such as Monte Carlo (MC) Dropout and Deep Ensembles (DE). However, all these
methods often rely on a single expert’s annotations as ground truth for model
training, overlooking the inter-rater variability in healthcare. To address
this issue, we propose MEGAN, a Multi-Expert Gating Network that aggregates
uncertainty estimates and predictions from multiple AI experts via EDL models
trained with diverse ground truths and modeling strategies. MEGAN’s gating
network optimally combines predictions and uncertainties from each EDL model,
enhancing overall prediction confidence and calibration. We extensively
benchmark MEGAN on endoscopy videos for Ulcerative colitis (UC) disease
severity estimation, assessed by visual labeling of Mayo Endoscopic Subscore
(MES), where inter-rater variability is prevalent. In large-scale prospective
UC clinical trial, MEGAN achieved a 3.5% improvement in F1-score and a 30.5%
reduction in Expected Calibration Error (ECE) compared to existing methods.
Furthermore, MEGAN facilitated uncertainty-guided sample stratification,
reducing the annotation burden and potentially increasing efficiency and
consistency in UC trials.
[COMMENTS]
11 pages, 2 figures, 1 table, accepted at UNSURE, MICCAI
[LINK]
http://arxiv.org/abs/2509.12772v1
[DATE]
2025-09-16 15:42:01+08:00
[CATEGORIES]
cs.LG
Stochastic Optimal Control via Measure Relaxations
[AUTHORS]
Etienne Buehrle, Christoph Stiller
[ABSTRACT]
The optimal control problem of stochastic systems is commonly solved via
robust or scenario-based optimization methods, which are both challenging to
scale to long optimization horizons. We cast the optimal control problem of a
stochastic system as a convex optimization problem over occupation measures. We
demonstrate our method on a set of synthetic and real-world scenarios, learning
cost functions from data via Christoffel polynomials. The code for our
experiments is available at https://github.com/ebuehrle/dpoc.
[COMMENTS]
7 pages, 4 figures
[LINK]
http://arxiv.org/abs/2508.00886v2
[DATE]
2025-09-16 15:40:23+08:00
[CATEGORIES]
cs.LG
Physics-informed neural network solves minimal surfaces in curved spacetime
[AUTHORS]
Koji Hashimoto, Koichi Kyo, Masaki Murata, Gakuto Ogiwara, Norihiro Tanahashi
[ABSTRACT]
We develop a flexible framework based on physics-informed neural networks
(PINNs) for solving boundary value problems involving minimal surfaces in
curved spacetimes, with a particular emphasis on singularities and moving
boundaries. By encoding the underlying physical laws into the loss function and
designing network architectures that incorporate the singular behavior and
dynamic boundaries, our approach enables robust and accurate solutions to both
ordinary and partial differential equations with complex boundary conditions.
We demonstrate the versatility of this framework through applications to
minimal surface problems in anti-de Sitter (AdS) spacetime, including examples
relevant to the AdS/CFT correspondence (e.g. Wilson loops and gluon scattering
amplitudes) popularly used in the context of string theory in theoretical
physics. Our methods efficiently handle singularities at boundaries, and also
support both “soft” (loss-based) and “hard” (formulation-based) imposition of
boundary conditions, including cases where the position of a boundary is
promoted to a trainable parameter. The techniques developed here are not
limited to high-energy theoretical physics but are broadly applicable to
boundary value problems encountered in mathematics, engineering, and the
natural sciences, wherever singularities and moving boundaries play a critical
role.
[COMMENTS]
40 pages, 17 figures, 3 tables; v2: added arXiv number of the
companion paper
[LINK]
http://arxiv.org/abs/2509.10866v2
[DATE]
2025-09-16 15:24:07+08:00
[CATEGORIES]
cs.LG
Toward Ownership Understanding of Objects: Active Question Generation with Large Language Model and Probabilistic Generative Model
[AUTHORS]
Saki Hashimoto, Shoichi Hasegawa, Tomochika Ishikawa, Akira Taniguchi, Yoshinobu Hagiwara, Lotfi El Hafi, Tadahiro Taniguchi
[ABSTRACT]
Robots operating in domestic and office environments must understand object
ownership to correctly execute instructions such as “Bring me my cup.”
However, ownership cannot be reliably inferred from visual features alone. To
address this gap, we propose Active Ownership Learning (ActOwL), a framework
that enables robots to actively generate and ask ownership-related questions to
users. ActOwL employs a probabilistic generative model to select questions that
maximize information gain, thereby acquiring ownership knowledge efficiently to
improve learning efficiency. Additionally, by leveraging commonsense knowledge
from Large Language Models (LLM), objects are pre-classified as either shared
or owned, and only owned objects are targeted for questioning. Through
experiments in a simulated home environment and a real-world laboratory
setting, ActOwL achieved significantly higher ownership clustering accuracy
with fewer questions than baseline methods. These findings demonstrate the
effectiveness of combining active inference with LLM-guided commonsense
reasoning, advancing the capability of robots to acquire ownership knowledge
for practical and socially appropriate task execution.
[COMMENTS]
Submitted to AROB-ISBC 2026 (Journal Track option)
[LINK]
http://arxiv.org/abs/2509.12754v1
[DATE]
2025-09-16 15:15:52+08:00
[CATEGORIES]
cs.LG
DeltaHedge: A Multi-Agent Framework for Portfolio Options Optimization
[AUTHORS]
Feliks Bańka, Jarosław A. Chudziak
[ABSTRACT]
In volatile financial markets, balancing risk and return remains a
significant challenge. Traditional approaches often focus solely on equity
allocation, overlooking the strategic advantages of options trading for dynamic
risk hedging. This work presents DeltaHedge, a multi-agent framework that
integrates options trading with AI-driven portfolio management. By combining
advanced reinforcement learning techniques with an ensembled options-based
hedging strategy, DeltaHedge enhances risk-adjusted returns and stabilizes
portfolio performance across varying market conditions. Experimental results
demonstrate that DeltaHedge outperforms traditional strategies and standalone
models, underscoring its potential to transform practical portfolio management
in complex financial environments. Building on these findings, this paper
contributes to the fields of quantitative finance and AI-driven portfolio
optimization by introducing a novel multi-agent system for integrating options
trading strategies, addressing a gap in the existing literature.
[COMMENTS]
Presented at Pacific Asia Conference on Information Systems (PACIS
2025), Kuala Lumpur. Official proceedings available at
https://aisel.aisnet.org/pacis2025/aiandml/aiandml/25/. 16 pages, 7 figures,
3 tables
[LINK]
http://arxiv.org/abs/2509.12753v1
[DATE]
2025-09-16 15:14:56+08:00
[CATEGORIES]
cs.LG
Finite Neural Networks as Mixtures of Gaussian Processes: From Provable Error Bounds to Prior Selection
[AUTHORS]
Steven Adams, Andrea Patanè, Morteza Lahijanian, Luca Laurenti
[ABSTRACT]
Infinitely wide or deep neural networks (NNs) with independent and
identically distributed (i.i.d.) parameters have been shown to be equivalent to
Gaussian processes. Because of the favorable properties of Gaussian processes,
this equivalence is commonly employed to analyze neural networks and has led to
various breakthroughs over the years. However, neural networks and Gaussian
processes are equivalent only in the limit; in the finite case there are
currently no methods available to approximate a trained neural network with a
Gaussian model with bounds on the approximation error. In this work, we present
an algorithmic framework to approximate a neural network of finite width and
depth, and with not necessarily i.i.d. parameters, with a mixture of Gaussian
processes with error bounds on the approximation error. In particular, we
consider the Wasserstein distance to quantify the closeness between
probabilistic models and, by relying on tools from optimal transport and
Gaussian processes, we iteratively approximate the output distribution of each
layer of the neural network as a mixture of Gaussian processes. Crucially, for
any NN and $\epsilon >0$ our approach is able to return a mixture of Gaussian
processes that is $\epsilon$-close to the NN at a finite set of input points.
Furthermore, we rely on the differentiability of the resulting error bound to
show how our approach can be employed to tune the parameters of a NN to mimic
the functional behavior of a given Gaussian process, e.g., for prior selection
in the context of Bayesian inference. We empirically investigate the
effectiveness of our results on both regression and classification problems
with various neural network architectures. Our experiments highlight how our
results can represent an important step towards understanding neural network
predictions and formally quantifying their uncertainty.
[LINK]
http://arxiv.org/abs/2407.18707v2
[DATE]
2025-09-16 15:14:07+08:00
[CATEGORIES]
cs.LG
Deep Generative and Discriminative Digital Twin endowed with Variational Autoencoder for Unsupervised Predictive Thermal Condition Monitoring of Physical Robots in Industry 6.0 and Society 6.0
[AUTHORS]
Eric Guiffo Kaigom
[ABSTRACT]
Robots are unrelentingly used to achieve operational efficiency in Industry
4.0 along with symbiotic and sustainable assistance for the work-force in
Industry 5.0. As resilience, robustness, and well-being are required in
anti-fragile manufacturing and human-centric societal tasks, an autonomous
anticipation and adaption to thermal saturation and burns due to motors
overheating become instrumental for human safety and robot availability. Robots
are thereby expected to self-sustain their performance and deliver user
experience, in addition to communicating their capability to other agents in
advance to ensure fully automated thermally feasible tasks, and prolong their
lifetime without human intervention. However, the traditional robot shutdown,
when facing an imminent thermal saturation, inhibits productivity in factories
and comfort in the society, while cooling strategies are hard to implement
after the robot acquisition. In this work, smart digital twins endowed with
generative AI, i.e., variational autoencoders, are leveraged to manage
thermally anomalous and generate uncritical robot states. The notion of thermal
difficulty is derived from the reconstruction error of variational
autoencoders. A robot can use this score to predict, anticipate, and share the
thermal feasibility of desired motion profiles to meet requirements from
emerging applications in Industry 6.0 and Society 6.0.
[COMMENTS]
$\copyright$ 2025 the authors. This work has been accepted to the to
the 10th IFAC Symposium on Mechatronic Systems & 14th IFAC Symposium on
Robotics July 15-18, 2025 || Paris, France for publication under a Creative
Commons Licence CC-BY-NC-ND
[LINK]
http://arxiv.org/abs/2509.12740v1
[DATE]
2025-09-16 14:52:59+08:00
[CATEGORIES]
cs.LG
Improved Impossible Tuning and Lipschitz-Adaptive Universal Online Learning with Gradient Variations
[AUTHORS]
Kei Takemura, Ryuta Matsuno, Keita Sakuma
[ABSTRACT]
A central goal in online learning is to achieve adaptivity to unknown problem
characteristics, such as environmental changes captured by gradient variation
(GV), function curvature (universal online learning, UOL), and gradient scales
(Lipschitz adaptivity, LA). Simultaneously achieving these with optimal
performance is a major challenge, partly due to limitations in algorithms for
prediction with expert advice. These algorithms often serve as meta-algorithms
in online ensemble frameworks, and their sub-optimality hinders overall UOL
performance. Specifically, existing algorithms addressing the ``impossible
tuning’’ issue incur an excess $\sqrt{\log T}$ factor in their regret bound
compared to the lower bound. To solve this problem, we propose a novel
optimistic online mirror descent algorithm with an auxiliary initial round
using large learning rates. This design enables a refined analysis where a
generated negative term cancels the gap-related factor, resolving the
impossible tuning issue up to $\log\log T$ factors. Leveraging our improved
algorithm as a meta-algorithm, we develop the first UOL algorithm that
simultaneously achieves state-of-the-art GV bounds and LA under standard
assumptions. Our UOL result overcomes key limitations of prior works, notably
resolving the conflict between LA mechanisms and regret analysis for GV bounds
– an open problem highlighted by Xie et al.
[COMMENTS]
Our proof of Lemma 3 (a key lemma) has a critical error
[LINK]
http://arxiv.org/abs/2505.21095v2
[DATE]
2025-09-16 14:43:14+08:00
[CATEGORIES]
cs.LG
A Graph Machine Learning Approach for Detecting Topological Patterns in Transactional Graphs
[AUTHORS]
Francesco Zola, Jon Ander Medina, Andrea Venturi, Amaia Gil, Raul Orduna
[ABSTRACT]
The rise of digital ecosystems has exposed the financial sector to evolving
abuse and criminal tactics that share operational knowledge and techniques both
within and across different environments (fiat-based, crypto-assets, etc.).
Traditional rule-based systems lack the adaptability needed to detect
sophisticated or coordinated criminal behaviors (patterns), highlighting the
need for strategies that analyze actors’ interactions to uncover suspicious
activities and extract their modus operandi. For this reason, in this work, we
propose an approach that integrates graph machine learning and network analysis
to improve the detection of well-known topological patterns within
transactional graphs. However, a key challenge lies in the limitations of
traditional financial datasets, which often provide sparse, unlabeled
information that is difficult to use for graph-based pattern analysis.
Therefore, we firstly propose a four-step preprocessing framework that involves
(i) extracting graph structures, (ii) considering data temporality to manage
large node sets, (iii) detecting communities within, and (iv) applying
automatic labeling strategies to generate weak ground-truth labels. Then, once
the data is processed, Graph Autoencoders are implemented to distinguish among
the well-known topological patterns. Specifically, three different GAE variants
are implemented and compared in this analysis. Preliminary results show that
this pattern-focused, topology-driven method is effective for detecting complex
financial crime schemes, offering a promising alternative to conventional
rule-based detection systems.
[COMMENTS]
Paper accepted @ Workshop on AI for Financial Crime Fight (AI4FCF @
ICDM 2025)
[LINK]
http://arxiv.org/abs/2509.12730v1
[DATE]
2025-09-16 14:43:11+08:00
[CATEGORIES]
cs.LG
Generalizable Holographic Reconstruction via Amplitude-Only Diffusion Priors
[AUTHORS]
Jeongsol Kim, Chanseok Lee, Jong Chul Ye, Mooseok Jang
[ABSTRACT]
Phase retrieval in inline holography is a fundamental yet ill-posed inverse
problem due to the nonlinear coupling between amplitude and phase in coherent
imaging. We present a novel off-the-shelf solution that leverages a diffusion
model trained solely on object amplitude to recover both amplitude and phase
from diffraction intensities. Using a predictor-corrector sampling framework
with separate likelihood gradients for amplitude and phase, our method enables
complex field reconstruction without requiring ground-truth phase data for
training. We validate the proposed approach through extensive simulations and
experiments, demonstrating robust generalization across diverse object shapes,
imaging system configurations, and modalities, including lensless setups.
Notably, a diffusion prior trained on simple amplitude data (e.g., polystyrene
beads) successfully reconstructs complex biological tissue structures,
highlighting the method’s adaptability. This framework provides a
cost-effective, generalizable solution for nonlinear inverse problems in
computational imaging, and establishes a foundation for broader coherent
imaging applications beyond holography.
[COMMENTS]
Keywords: Diffusion model, phase retrieval, inline-holography,
inverse problem
[LINK]
http://arxiv.org/abs/2509.12728v1
[DATE]
2025-09-16 14:36:08+08:00
[CATEGORIES]
cs.LG
Training-free Adjustable Polynomial Graph Filtering for Ultra-fast Multimodal Recommendation
[AUTHORS]
Yu-Seung Roh, Joo-Young Kim, Jin-Duk Park, Won-Yong Shin
[ABSTRACT]
Multimodal recommender systems improve the performance of canonical
recommender systems with no item features by utilizing diverse content types
such as text, images, and videos, while alleviating inherent sparsity of
user-item interactions and accelerating user engagement. However, current
neural network-based models often incur significant computational overhead due
to the complex training process required to learn and integrate information
from multiple modalities. To address this challenge,we propose MultiModal-Graph
Filtering (MM-GF), a training-free method grounded in graph filtering (GF) for
efficient and accurate multimodal recommendations. Specifically, MM-GF first
constructs multiple similarity graphs for two distinct modalities as well as
user-item interaction data. Then, MM-GF optimally fuses these multimodal
signals using a polynomial graph filter that allows for precise control of the
frequency response by adjusting frequency bounds. Furthermore, the filter
coefficients are treated as hyperparameters, enabling flexible and data-driven
adaptation. Extensive experiments on real-world benchmark datasets demonstrate
that MM-GF not only improves recommendation accuracy by up to 22.25% compared
to the best competitor but also dramatically reduces computational costs by
achieving the runtime of less than 10 seconds.
[COMMENTS]
17 pages, 7 figures, 6 tables
[LINK]
http://arxiv.org/abs/2503.04406v2
[DATE]
2025-09-16 14:35:48+08:00
[CATEGORIES]
cs.LG
Unbiased Online Curvature Approximation for Regularized Graph Continual Learning
[AUTHORS]
Jie Yin, Ke Sun, Han Wu
[ABSTRACT]
Graph continual learning (GCL) aims to learn from a continuous sequence of
graph-based tasks. Regularization methods are vital for preventing catastrophic
forgetting in GCL, particularly in the challenging replay-free,
class-incremental setting, where each task consists of a set of unique classes.
In this work, we first establish a general regularization framework for GCL
based on the curved parameter space induced by the Fisher information matrix
(FIM). We show that the dominant Elastic Weight Consolidation (EWC) and its
variants are a special case within this framework, using a diagonal
approximation of the empirical FIM based on parameters from previous tasks. To
overcome their limitations, we propose a new unbiased online curvature
approximation of the full FIM based on the model’s current learning state. Our
method directly estimates the regularization term in an online manner without
explicitly evaluating and storing the FIM itself. This enables the model to
better capture the loss landscape during learning new tasks while retaining the
knowledge learned from previous tasks. Extensive experiments on three graph
datasets demonstrate that our method significantly outperforms existing
regularization-based methods, achieving a superior trade-off between stability
(retaining old knowledge) and plasticity (acquiring new knowledge).
[COMMENTS]
9 pages
[LINK]
http://arxiv.org/abs/2509.12727v1
[DATE]
2025-09-16 14:35:13+08:00
[CATEGORIES]
cs.LG
Data-Driven Discovery of Emergent Dynamics in Reaction-Diffusion Systems from Sparse and Noisy Observations
[AUTHORS]
Saumitra Dwivedi, Ricardo da Silva Torres, Ibrahim A. Hameed, Gunnar Tufte, Anniken Susanne T. Karlsen
[ABSTRACT]
Data-driven discovery of emergent dynamics is gaining popularity,
particularly in the context of reaction-diffusion systems. These systems are
widely studied across various fields, including neuroscience, ecology,
epidemiology, and several other subject areas that deal with emergent dynamics.
A current challenge in the discovery process relates to system identification
when there is no prior knowledge of the underlying physics. We attempt to
address this challenge by learning Soft Artificial Life (Soft ALife) models,
such as Agent-based and Cellular Automata (CA) models, from observed data for
reaction-diffusion systems. In this paper, we present findings on the
applicability of a conceptual framework, the Data-driven Rulesets for Soft
Artificial Life (DRSALife) model, to learn Soft ALife rulesets that accurately
represent emergent dynamics in a reaction-diffusion system from observed data.
This model has demonstrated promising results for Elementary CA Rule 30, Game
of Life, and Vicsek Flocking problems in recent work. To our knowledge, this is
one of the few studies that explore machine-based Soft ALife ruleset learning
and system identification for reaction-diffusion dynamics without any prior
knowledge of the underlying physics. Moreover, we provide comprehensive
findings from experiments investigating the potential effects of using noisy
and sparse observed datasets on learning emergent dynamics. Additionally, we
successfully identify the structure and parameters of the underlying partial
differential equations (PDEs) representing these dynamics. Experimental results
demonstrate that the learned models are able to predict the emergent dynamics
with good accuracy (74%) and exhibit quite robust performance when subjected to
Gaussian noise and temporal sparsity.
[LINK]
http://arxiv.org/abs/2509.09278v2
[DATE]
2025-09-16 14:32:45+08:00
[CATEGORIES]
cs.LG
Revisiting Transferable Adversarial Images: Systemization, Evaluation, and New Insights
[AUTHORS]
Zhengyu Zhao, Hanwei Zhang, Renjue Li, Ronan Sicre, Laurent Amsaleg, Michael Backes, Qi Li, Qian Wang, Chao Shen
[ABSTRACT]
Transferable adversarial images raise critical security concerns for computer
vision systems in real-world, black-box attack scenarios. Although many
transfer attacks have been proposed, existing research lacks a systematic and
comprehensive evaluation. In this paper, we systemize transfer attacks into
five categories around the general machine learning pipeline and provide the
first comprehensive evaluation, with 23 representative attacks against 11
representative defenses, including the recent, transfer-oriented defense and
the real-world Google Cloud Vision. In particular, we identify two main
problems of existing evaluations: (1) for attack transferability, lack of
intra-category analyses with fair hyperparameter settings, and (2) for attack
stealthiness, lack of diverse measures. Our evaluation results validate that
these problems have indeed caused misleading conclusions and missing points,
and addressing them leads to new, \textit{consensus-challenging} insights, such
as (1) an early attack, DI, even outperforms all similar follow-up ones, (2)
the state-of-the-art (white-box) defense, DiffPure, is even vulnerable to
(black-box) transfer attacks, and (3) even under the same $L_p$ constraint,
different attacks yield dramatically different stealthiness results regarding
diverse imperceptibility metrics, finer-grained measures, and a user study. We
hope that our analyses will serve as guidance on properly evaluating
transferable adversarial images and advance the design of attacks and defenses.
Code is available at https://github.com/ZhengyuZhao/TransferAttackEval.
[COMMENTS]
TPAMI 2025. Code is available at
https://github.com/ZhengyuZhao/TransferAttackEval
[LINK]
http://arxiv.org/abs/2310.11850v2
[DATE]
2025-09-16 14:15:46+08:00
[CATEGORIES]
cs.LG
Spatio-temporal DeepKriging in PyTorch: A Supplementary Application to Precipitation Data for Interpolation and Probabilistic Forecasting
[AUTHORS]
Pratik Nag
[LINK]
http://arxiv.org/abs/2509.12708v1
[DATE]
2025-09-16 13:58:31+08:00
[CATEGORIES]
cs.LG
NORA: A Nephrology-Oriented Representation Learning Approach Towards Chronic Kidney Disease Classification
[AUTHORS]
Mohammad Abdul Hafeez Khan, Twisha Bhattacharyya, Omar Khan, Noorah Khan, Alina Aziz Fatima Khan, Mohammed Qutub Khan, Sujoy Ghosh Hajra
[ABSTRACT]
Chronic Kidney Disease (CKD) affects millions of people worldwide, yet its
early detection remains challenging, especially in outpatient settings where
laboratory-based renal biomarkers are often unavailable. In this work, we
investigate the predictive potential of routinely collected non-renal clinical
variables for CKD classification, including sociodemographic factors, comorbid
conditions, and urinalysis findings. We introduce the Nephrology-Oriented
Representation leArning (NORA) approach, which combines supervised contrastive
learning with a nonlinear Random Forest classifier. NORA first derives
discriminative patient representations from tabular EHR data, which are then
used for downstream CKD classification. We evaluated NORA on a clinic-based EHR
dataset from Riverside Nephrology Physicians. Our results demonstrated that
NORA improves class separability and overall classification performance,
particularly enhancing the F1-score for early-stage CKD. Additionally, we
assessed the generalizability of NORA on the UCI CKD dataset, demonstrating its
effectiveness for CKD risk stratification across distinct patient cohorts.
[COMMENTS]
7 pages, 5 figures, accepted to the International Conference on
Machine Learning and Applications (ICMLA) 2025
[LINK]
http://arxiv.org/abs/2509.12704v1
[DATE]
2025-09-16 13:54:33+08:00
[CATEGORIES]
cs.LG
Soft Graph Transformer for MIMO Detection
[AUTHORS]
Jiadong Hong, Lei Liu, Xinyu Bian, Wenjie Wang, Zhaoyang Zhang
[ABSTRACT]
We propose the Soft Graph Transformer (SGT), a Soft-Input-Soft-Output neural
architecture tailored for MIMO detection. While Maximum Likelihood (ML)
detection achieves optimal accuracy, its prohibitive exponential complexity
renders it impractical for real-world systems. Conventional message passing
algorithms offer tractable alternatives but rely on large-system asymptotics
and random matrix assumptions, both of which break down under practical
implementations. Prior Transformer-based detectors, on the other hand, fail to
incorporate the MIMO factor graph structure and cannot utilize decoder-side
soft information, limiting their standalone performance and their applicability
in iterative detection-decoding (IDD). To overcome these limitations, SGT
integrates message passing directly into a graph-aware attention mechanism and
supports decoder-informed updates through soft-input embeddings. This design
enables effective soft-output generation while preserving computational
efficiency. As a standalone detector, SGT closely approaches ML performance and
surpasses prior Transformer-based approaches.
[COMMENTS]
8 pages
[LINK]
http://arxiv.org/abs/2509.12694v1
[DATE]
2025-09-16 13:42:45+08:00
[CATEGORIES]
cs.LG
ZTree: A Subgroup Identification Based Decision Tree Learning Framework
[AUTHORS]
Eric Cheng, Jie Cheng
[ABSTRACT]
Decision trees are a commonly used class of machine learning models valued
for their interpretability and versatility, capable of both classification and
regression. We propose ZTree, a novel decision tree learning framework that
replaces CART’s traditional purity based splitting with statistically
principled subgroup identification. At each node, ZTree applies hypothesis
testing (e.g., z-tests, t-tests, Mann-Whitney U, log-rank) to assess whether a
candidate subgroup differs meaningfully from the complement. To adjust for the
complication of multiple testing, we employ a cross-validation-based approach
to determine if further node splitting is needed. This robust stopping
criterion eliminates the need for post-pruning and makes the test threshold
(z-threshold) the only parameter for controlling tree complexity. Because of
the simplicity of the tree growing procedure, once a detailed tree is learned
using the most lenient z-threshold, all simpler trees can be derived by simply
removing nodes that do not meet the larger z-thresholds. This makes parameter
tuning intuitive and efficient. Furthermore, this z-threshold is essentially a
p-value, allowing users to easily plug in appropriate statistical tests into
our framework without adjusting the range of parameter search. Empirical
evaluation on five large-scale UCI datasets demonstrates that ZTree
consistently delivers strong performance, especially at low data regimes.
Compared to CART, ZTree also tends to grow simpler trees without sacrificing
performance. ZTree introduces a statistically grounded alternative to
traditional decision tree splitting by leveraging hypothesis testing and a
cross-validation approach to multiple testing correction, resulting in an
efficient and flexible framework.
[COMMENTS]
15 pages, 1 table, 5 figures
[LINK]
http://arxiv.org/abs/2509.12688v1
[DATE]
2025-09-16 13:25:16+08:00
[CATEGORIES]
cs.LG
AI/ML Based Detection and Categorization of Covert Communication in IPv6 Network
[AUTHORS]
Mohammad Wali Ur Rahman, Yu-Zheng Lin, Carter Weeks, David Ruddell, Jeff Gabriellini, Bill Hayes, Salim Hariri, Pratik Satam, Edward V. Ziegler Jr
[ABSTRACT]
The flexibility and complexity of IPv6 extension headers allow attackers to
create covert channels or bypass security mechanisms, leading to potential data
breaches or system compromises. The mature development of machine learning has
become the primary detection technology option used to mitigate covert
communication threats. However, the complexity of detecting covert
communication, evolving injection techniques, and scarcity of data make
building machine-learning models challenging. In previous related research,
machine learning has shown good performance in detecting covert communications,
but oversimplified attack scenario assumptions cannot represent the complexity
of modern covert technologies and make it easier for machine learning models to
detect covert communications. To bridge this gap, in this study, we analyzed
the packet structure and network traffic behavior of IPv6, used encryption
algorithms, and performed covert communication injection without changing
network packet behavior to get closer to real attack scenarios. In addition to
analyzing and injecting methods for covert communications, this study also uses
comprehensive machine learning techniques to train the model proposed in this
study to detect threats, including traditional decision trees such as random
forests and gradient boosting, as well as complex neural network architectures
such as CNNs and LSTMs, to achieve detection accuracy of over 90\%. This study
details the methods used for dataset augmentation and the comparative
performance of the applied models, reinforcing insights into the adaptability
and resilience of the machine learning application in IPv6 covert
communication. We further introduce a Generative AI-driven script refinement
framework, leveraging prompt engineering as a preliminary exploration of how
generative agents can assist in covert communication detection and model
enhancement.
[COMMENTS]
15 pages, 8 figures, accepted by Springer Cybersecurity
[LINK]
http://arxiv.org/abs/2501.10627v2
[DATE]
2025-09-16 13:21:05+08:00
[CATEGORIES]
cs.LG
Adversarial Combinatorial Semi-bandits with Graph Feedback
[AUTHORS]
Yuxiao Wen
[ABSTRACT]
In combinatorial semi-bandits, a learner repeatedly selects from a
combinatorial decision set of arms, receives the realized sum of rewards, and
observes the rewards of the individual selected arms as feedback. In this
paper, we extend this framework to include \emph{graph feedback}, where the
learner observes the rewards of all neighboring arms of the selected arms in a
feedback graph $G$. We establish that the optimal regret over a time horizon
$T$ scales as $\widetilde{\Theta}(S\sqrt{T}+\sqrt{\alpha ST})$, where $S$ is
the size of the combinatorial decisions and $\alpha$ is the independence number
of $G$. This result interpolates between the known regrets
$\widetilde\Theta(S\sqrt{T})$ under full information (i.e., $G$ is complete)
and $\widetilde\Theta(\sqrt{KST})$ under the semi-bandit feedback (i.e., $G$
has only self-loops), where $K$ is the total number of arms. A key technical
ingredient is to realize a convexified action using a random decision vector
with negative correlations. We also show that online stochastic mirror descent
(OSMD) that only realizes convexified actions in expectation is suboptimal. In
addition, we describe the problem of \emph{combinatorial semi-bandits with
general capacity} and apply our results to derive an improved regret upper
bound, which may be of independent interest.
[COMMENTS]
To appear in ICML 2025
[LINK]
http://arxiv.org/abs/2502.18826v7
[DATE]
2025-09-16 13:16:32+08:00
[CATEGORIES]
cs.LG
Large Language Model Scaling Laws for Neural Quantum States in Quantum Chemistry
[AUTHORS]
Oliver Knitter, Dan Zhao, Stefan Leichenauer, Shravan Veerapaneni
[ABSTRACT]
Scaling laws have been used to describe how large language model (LLM)
performance scales with model size, training data size, or amount of
computational resources. Motivated by the fact that neural quantum states (NQS)
has increasingly adopted LLM-based components, we seek to understand NQS
scaling laws, thereby shedding light on the scalability and optimal
performance–resource trade-offs of NQS ansatze. In particular, we identify
scaling laws that predict the performance, as measured by absolute error and
V-score, for transformer-based NQS as a function of problem size in
second-quantized quantum chemistry applications. By performing analogous
compute-constrained optimization of the obtained parametric curves, we find
that the relationship between model size and training time is highly dependent
on loss metric and ansatz, and does not follow the approximately linear
relationship found for language models.
[COMMENTS]
16 pages, 5 figures, to be submitted for peer review
[LINK]
http://arxiv.org/abs/2509.12679v1
[DATE]
2025-09-16 13:04:36+08:00
[CATEGORIES]
cs.LG
Instance-level Randomization: Toward More Stable LLM Evaluations
[AUTHORS]
Yiyang Li, Yonghuang Wu, Ying Luo, Liangtai Sun, Zishu Qin, Lin Qiu, Xuezhi Cao, Xunliang Cai
[COMMENTS]
Accepted by Findings of EMNLP 2025
[LINK]
http://arxiv.org/abs/2509.12678v1
[DATE]
2025-09-16 13:04:00+08:00
[CATEGORIES]
cs.LG
MFAF: An EVA02-Based Multi-scale Frequency Attention Fusion Method for Cross-View Geo-Localization
[AUTHORS]
YiTong Liu, TianZhu Liu, YanFeng GU
[ABSTRACT]
Cross-view geo-localization aims to determine the geographical location of a
query image by matching it against a gallery of images. This task is
challenging due to the significant appearance variations of objects observed
from variable views, along with the difficulty in extracting discriminative
features. Existing approaches often rely on extracting features through feature
map segmentation while neglecting spatial and semantic information. To address
these issues, we propose the EVA02-based Multi-scale Frequency Attention Fusion
(MFAF) method. The MFAF method consists of Multi-Frequency Branch-wise Block
(MFB) and the Frequency-aware Spatial Attention (FSA) module. The MFB block
effectively captures both low-frequency structural features and high-frequency
edge details across multiple scales, improving the consistency and robustness
of feature representations across various viewpoints. Meanwhile, the FSA module
adaptively focuses on the key regions of frequency features, significantly
mitigating the interference caused by background noise and viewpoint
variability. Extensive experiments on widely recognized benchmarks, including
University-1652, SUES-200, and Dense-UAV, demonstrate that the MFAF method
achieves competitive performance in both drone localization and drone
navigation tasks.
[COMMENTS]
17 pages, 13 figures
[LINK]
http://arxiv.org/abs/2509.12673v1
[DATE]
2025-09-16 12:51:52+08:00
[CATEGORIES]
cs.LG
PBPK-iPINNs : Inverse Physics-Informed Neural Networks for Physiologically Based Pharmacokinetic Brain Models
[AUTHORS]
Charuka D. Wickramasinghe, Krishanthi C. Weerasinghe, Pradeep K. Ranaweera
[ABSTRACT]
Physics-Informed Neural Networks (PINNs) leverage machine learning with
differential equations to solve direct and inverse problems, ensuring
predictions follow physical laws. Physiologically based pharmacokinetic (PBPK)
modeling advances beyond classical compartmental approaches by using a
mechanistic, physiology focused framework. A PBPK model is based on a system of
ODEs, with each equation representing the mass balance of a drug in a
compartment, such as an organ or tissue. These ODEs include parameters that
reflect physiological, biochemical, and drug-specific characteristics to
simulate how the drug moves through the body. In this paper, we introduce
PBPK-iPINN, a method to estimate drug-specific or patient-specific parameters
and drug concentration profiles in PBPK brain compartment models using inverse
PINNs. We demonstrate that, for the inverse problem to converge to the correct
solution, the loss function components (data loss, initial conditions loss, and
residual loss) must be appropriately weighted, and parameters (including number
of layers, number of neurons, activation functions, learning rate, optimizer,
and collocation points) must be carefully tuned. The performance of the
PBPK-iPINN approach is then compared with established traditional numerical and
statistical methods.
[COMMENTS]
24 pages, 11 figures
[LINK]
http://arxiv.org/abs/2509.12666v1
[DATE]
2025-09-16 12:43:09+08:00
[CATEGORIES]
cs.LG
State-of-Health Prediction for EV Lithium-Ion Batteries via DLinear and Robust Explainable Feature Selection
[AUTHORS]
Minsu Kim, Jaehyun Oh, Sang-Young Lee, Junghwan Kim
[ABSTRACT]
Accurate prediction of the state-of-health (SOH) of lithium-ion batteries is
essential for ensuring the safety, reliability, and efficient operation of
electric vehicles (EVs). Battery packs in EVs experience nonuniform degradation
due to cell-to-cell variability (CtCV), posing a major challenge for real-time
battery management. In this work, we propose an explainable, data-driven SOH
prediction framework tailored for EV battery management systems (BMS). The
approach combines robust feature engineering with a DLinear. Using NASA’s
battery aging dataset, we extract twenty meaningful features from voltage,
current, temperature, and time profiles, and select key features using Pearson
correlation and Shapley additive explanations (SHAP). The SHAP-based selection
yields consistent feature importance across multiple cells, effectively
capturing CtCV. The DLinear algorithm outperforms long short-term memory (LSTM)
and Transformer architectures in prediction accuracy, while requiring fewer
training cycles and lower computational cost. This work offers a scalable and
interpretable framework for SOH forecasting, enabling practical implementation
in EV BMS and promoting safer, more efficient electric mobility.
[LINK]
http://arxiv.org/abs/2501.11542v2
[DATE]
2025-09-16 12:29:46+08:00
[CATEGORIES]
cs.LG
Sustainable LSTM-Based Precoding for RIS-Aided mmWave MIMO Systems with Implicit CSI
[AUTHORS]
Po-Heng Chou, Jiun-Jia Wu, Wan-Jen Huang, Ronald Y. Chang
[ABSTRACT]
In this paper, we propose a sustainable long short-term memory (LSTM)-based
precoding framework for reconfigurable intelligent surface (RIS)-assisted
millimeter-wave (mmWave) MIMO systems. Instead of explicit channel state
information (CSI) estimation, the framework exploits uplink pilot sequences to
implicitly learn channel characteristics, reducing both pilot overhead and
inference complexity. Practical hardware constraints are addressed by
incorporating the phase-dependent amplitude model of RIS elements, while a
multi-label training strategy improves robustness when multiple near-optimal
codewords yield comparable performance. Simulations show that the proposed
design achieves over 90% of the spectral efficiency of exhaustive search (ES)
with only 2.2% of its computation time, cutting energy consumption by nearly
two orders of magnitude. The method also demonstrates resilience under
distribution mismatch and scalability to larger RIS arrays, making it a
practical and energy-efficient solution for sustainable 6G wireless networks.
[COMMENTS]
6 pages, 5 figures, 2 tables, and accepted by 2025 IEEE Globecom
Workshops
[LINK]
http://arxiv.org/abs/2509.12658v1
[DATE]
2025-09-16 12:29:14+08:00
[CATEGORIES]
cs.LG
Synthetic Survival Data Generation for Heart Failure Prognosis Using Deep Generative Models
[AUTHORS]
Chanon Puttanawarut, Natcha Fongsrisin, Porntep Amornritvanich, Panu Looareesuwan, Cholatid Ratanatharathorn
[ABSTRACT]
Background: Heart failure (HF) research is constrained by limited access to
large, shareable datasets due to privacy regulations and institutional
barriers. Synthetic data generation offers a promising solution to overcome
these challenges while preserving patient confidentiality. Methods: We
generated synthetic HF datasets from institutional data comprising 12,552
unique patients using five deep learning models: tabular variational
autoencoder (TVAE), normalizing flow, ADSGAN, SurvivalGAN, and tabular
denoising diffusion probabilistic models (TabDDPM). We comprehensively
evaluated synthetic data utility through statistical similarity metrics,
survival prediction using machine learning and privacy assessments. Results:
SurvivalGAN and TabDDPM demonstrated high fidelity to the original dataset,
exhibiting similar variable distributions and survival curves after applying
histogram equalization. SurvivalGAN (C-indices: 0.71-0.76) and TVAE (C-indices:
0.73-0.76) achieved the strongest performance in survival prediction
evaluation, closely matched real data performance (C-indices: 0.73-0.76).
Privacy evaluation confirmed protection against re-identification attacks.
Conclusions: Deep learning-based synthetic data generation can produce
high-fidelity, privacy-preserving HF datasets suitable for research
applications. This publicly available synthetic dataset addresses critical data
sharing barriers and provides a valuable resource for advancing HF research and
predictive modeling.
[LINK]
http://arxiv.org/abs/2509.04245v2
[DATE]
2025-09-16 11:53:00+08:00
[CATEGORIES]
cs.LG
Optimization of GNN Training Through Half-precision
[AUTHORS]
Arnab Kanti Tarafder, Yidong Gong, Pradeep Kumar
[ABSTRACT]
Recent trends in lower precision, e.g. half-precision floating point,
training have shown improved system performance and reduced memory usage for
Deep Learning while maintaining accuracy. However, current GNN systems cannot
achieve such goals for GNN, as our analyses show that they massively
underperform while showing abnormal accuracy when using half-precision. These
systems suffer from value overflow issues due to lowered precision,
under-utilization of hardware resources, and poor training performance. To
mitigate this, we introduce HalfGNN, a half-precision based GNN system. HalfGNN
proposes novel techniques: new vector operations for half-precision data types
that improve data load and reduction performance, and discretized SpMM that
overcomes the value overflow and natively provides workload balancing. Such
techniques improve hardware utilization, reduce memory usage, and remove atomic
writes. Evaluations show that HalfGNN achieves on average of 2.30X speedup in
training time over DGL (float-based) for GAT, GCN, and GIN respectively while
achieving similar accuracy, and saving 2.67X memory.
[LINK]
http://arxiv.org/abs/2411.01109v2
[DATE]
2025-09-16 11:50:00+08:00
[CATEGORIES]
cs.LG
High-Energy Concentration for Federated Learning in Frequency Domain
[AUTHORS]
Haozhi Shi, Weiying Xie, Hangyu Ye, Daixun Li, Jitao Ma, Leyuan Fang
[ABSTRACT]
Federated Learning (FL) presents significant potential for collaborative
optimization without data sharing. Since synthetic data is sent to the server,
leveraging the popular concept of dataset distillation, this FL framework
protects real data privacy while alleviating data heterogeneity. However, such
methods are still challenged by the redundant information and noise in entire
spatial-domain designs, which inevitably increases the communication burden. In
this paper, we propose a novel Frequency-Domain aware FL method with
high-energy concentration (FedFD) to address this problem. Our FedFD is
inspired by the discovery that the discrete cosine transform predominantly
distributes energy to specific regions, referred to as high-energy
concentration. The principle behind FedFD is that low-energy like
high-frequency components usually contain redundant information and noise, thus
filtering them helps reduce communication costs and optimize performance. Our
FedFD is mathematically formulated to preserve the low-frequency components
using a binary mask, facilitating an optimal solution through frequency-domain
distribution alignment. In particular, real data-driven synthetic
classification is imposed into the loss to enhance the quality of the
low-frequency components. On five image and speech datasets, FedFD achieves
superior performance than state-of-the-art methods while reducing communication
costs. For example, on the CIFAR-10 dataset with Dirichlet coefficient $\alpha
= 0.01$, FedFD achieves a minimum reduction of 37.78\% in the communication
cost, while attaining a 10.88\% performance gain.
[LINK]
http://arxiv.org/abs/2509.12630v1
[DATE]
2025-09-16 11:49:26+08:00
[CATEGORIES]
cs.LG
Abex-rat: Synergizing Abstractive Augmentation and Adversarial Training for Classification of Occupational Accident Reports
[AUTHORS]
Jian Chen, Jiabao Dou, Jinbao Tian, Yunqi Yang, Zhou Li
[ABSTRACT]
The automatic classification of occupational accident reports is a critical
research area for enhancing workplace safety and enabling large-scale risk
analysis. However, the severe class imbalance inherent in these real-world
datasets often compromises the performance of analytical models, particularly
for rare but severe incident types, hindering the development of reliable
automated systems. To address this challenge, we propose ABEX-RAT, a novel and
efficient framework that synergizes generative data augmentation with robust
adversarial training. Our approach first employs a twostep
abstractive-expansive (ABEX) pipeline, which leverages a large language model
to distill core incident semantics and then uses a generative model to create
diverse, highquality synthetic samples for underrepresented classes.
Subsequently, a lightweight classifier is trained on the augmented data using a
computationally efficient random adversarial training (RAT) protocol, which
stochastically applies perturbations to enhance model generalization and
robustness without significant overhead. Experimental results on the public
OSHA dataset demonstrate that our method achieves new state-of-the-art
performance, reaching a macro-F1 score of 90.32% and significantly
outperforming previous SOTA and fine-tuned large model baselines. Our work
validates that this synergistic strategy is a highly effective and efficient
alternative to brute-force fine-tuning for specialized, imbalanced
classification tasks. The code is publicly available
at:https://github.com/nxcc-lab/ABEX-RAT.
[LINK]
http://arxiv.org/abs/2509.02072v3
[DATE]
2025-09-16 11:28:45+08:00
[CATEGORIES]
cs.LG
Mob-based cattle weight gain forecasting using ML models
[AUTHORS]
Muhammad Riaz Hasib Hossain, Rafiqul Islam, Shawn R McGrath, Md Zahidul Islam, David Lamb
[ABSTRACT]
Forecasting mob based cattle weight gain (MB CWG) may benefit large livestock
farms, allowing farmers to refine their feeding strategies, make educated
breeding choices, and reduce risks linked to climate variability and market
fluctuations. In this paper, a novel technique termed MB CWG is proposed to
forecast the one month advanced weight gain of herd based cattle using
historical data collected from the Charles Sturt University Farm. This research
employs a Random Forest (RF) model, comparing its performance against Support
Vector Regression (SVR) and Long Short Term Memory (LSTM) models for monthly
weight gain prediction. Four datasets were used to evaluate the performance of
models, using 756 sample data from 108 herd-based cattle, along with weather
data (rainfall and temperature) influencing CWG. The RF model performs better
than the SVR and LSTM models across all datasets, achieving an R^2 of 0.973,
RMSE of 0.040, and MAE of 0.033 when both weather and age factors were
included. The results indicate that including both weather and age factors
significantly improves the accuracy of weight gain predictions, with the RF
model outperforming the SVR and LSTM models in all scenarios. These findings
demonstrate the potential of RF as a robust tool for forecasting cattle weight
gain in variable conditions, highlighting the influence of age and climatic
factors on herd based weight trends. This study has also developed an
innovative automated pre processing tool to generate a benchmark dataset for MB
CWG predictive models. The tool is publicly available on GitHub and can assist
in preparing datasets for current and future analytical research..
[LINK]
http://arxiv.org/abs/2509.12615v1
[DATE]
2025-09-16 11:23:43+08:00
[CATEGORIES]
cs.LG
ScaleDoc: Scaling LLM-based Predicates over Large Document Collections
[AUTHORS]
Hengrui Zhang, Yulong Hui, Yihao Liu, Huanchen Zhang
[ABSTRACT]
Predicates are foundational components in data analysis systems. However,
modern workloads increasingly involve unstructured documents, which demands
semantic understanding, beyond traditional value-based predicates. Given
enormous documents and ad-hoc queries, while Large Language Models (LLMs)
demonstrate powerful zero-shot capabilities, their high inference cost leads to
unacceptable overhead. Therefore, we introduce \textsc{ScaleDoc}, a novel
system that addresses this by decoupling predicate execution into an offline
representation phase and an optimized online filtering phase. In the offline
phase, \textsc{ScaleDoc} leverages a LLM to generate semantic representations
for each document. Online, for each query, it trains a lightweight proxy model
on these representations to filter the majority of documents, forwarding only
the ambiguous cases to the LLM for final decision. Furthermore,
\textsc{ScaleDoc} proposes two core innovations to achieve significant
efficiency: (1) a contrastive-learning-based framework that trains the proxy
model to generate reliable predicating decision scores; (2) an adaptive cascade
mechanism that determines the effective filtering policy while meeting specific
accuracy targets. Our evaluations across three datasets demonstrate that
\textsc{ScaleDoc} achieves over a 2$\times$ end-to-end speedup and reduces
expensive LLM invocations by up to 85\%, making large-scale semantic analysis
practical and efficient.
[LINK]
http://arxiv.org/abs/2509.12610v1
[DATE]
2025-09-16 11:18:06+08:00
[CATEGORIES]
cs.LG
Your Compiler is Backdooring Your Model: Understanding and Exploiting Compilation Inconsistency Vulnerabilities in Deep Learning Compilers
[AUTHORS]
Simin Chen, Jinjun Peng, Yixin He, Junfeng Yang, Baishakhi Ray
[ABSTRACT]
Deep learning (DL) compilers are core infrastructure in modern DL systems,
offering flexibility and scalability beyond vendor-specific libraries. This
work uncovers a fundamental vulnerability in their design: can an official,
unmodified compiler alter a model’s semantics during compilation and introduce
hidden backdoors? We study both adversarial and natural settings. In the
adversarial case, we craft benign models where triggers have no effect
pre-compilation but become effective backdoors after compilation. Tested on six
models, three commercial compilers, and two hardware platforms, our attack
yields 100% success on triggered inputs while preserving normal accuracy and
remaining undetected by state-of-the-art detectors. The attack generalizes
across compilers, hardware, and floating-point settings. In the natural
setting, we analyze the top 100 HuggingFace models (including one with 220M+
downloads) and find natural triggers in 31 models. This shows that compilers
can introduce risks even without adversarial manipulation.
Our results reveal an overlooked threat: unmodified DL compilers can silently
alter model semantics. To our knowledge, this is the first work to expose
inherent security risks in DL compiler design, opening a new direction for
secure and trustworthy ML.
[LINK]
http://arxiv.org/abs/2509.11173v2
[DATE]
2025-09-16 10:55:43+08:00
[CATEGORIES]
cs.LG
Disentangling Content from Style to Overcome Shortcut Learning: A Hybrid Generative-Discriminative Learning Framework
[AUTHORS]
Siming Fu, Sijun Dong, Xiaoliang Meng
[ABSTRACT]
Despite the remarkable success of Self-Supervised Learning (SSL), its
generalization is fundamentally hindered by Shortcut Learning, where models
exploit superficial features like texture instead of intrinsic structure. We
experimentally verify this flaw within the generative paradigm (e.g., MAE) and
argue it is a systemic issue also affecting discriminative methods, identifying
it as the root cause of their failure on unseen domains. While existing methods
often tackle this at a surface level by aligning or separating domain-specific
features, they fail to alter the underlying learning mechanism that fosters
shortcut dependency.To address this at its core, we propose HyGDL (Hybrid
Generative-Discriminative Learning Framework), a hybrid framework that achieves
explicit content-style disentanglement. Our approach is guided by the
Invariance Pre-training Principle: forcing a model to learn an invariant
essence by systematically varying a bias (e.g., style) at the input while
keeping the supervision signal constant. HyGDL operates on a single encoder and
analytically defines style as the component of a representation that is
orthogonal to its style-invariant content, derived via vector projection. This
is operationalized through a synergistic design: (1) a self-distillation
objective learns a stable, style-invariant content direction; (2) an analytical
projection then decomposes the representation into orthogonal content and style
vectors; and (3) a style-conditioned reconstruction objective uses these
vectors to restore the image, providing end-to-end supervision. Unlike prior
methods that rely on implicit heuristics, this principled disentanglement
allows HyGDL to learn truly robust representations, demonstrating superior
performance on benchmarks designed to diagnose shortcut learning.
[LINK]
http://arxiv.org/abs/2509.11598v2
[DATE]
2025-09-16 10:52:25+08:00
[CATEGORIES]
cs.LG
A Particle-Flow Algorithm for Free-Support Wasserstein Barycenters
[AUTHORS]
Kisung You
[ABSTRACT]
The Wasserstein barycenter extends the Euclidean mean to the space of
probability measures by minimizing the weighted sum of squared 2-Wasserstein
distances. We develop a free-support algorithm for computing Wasserstein
barycenters that avoids entropic regularization and instead follows the formal
Riemannian geometry of Wasserstein space. In our approach, barycenter atoms
evolve as particles advected by averaged optimal-transport displacements, with
barycentric projections of optimal transport plans used in place of Monge maps
when the latter do not exist. This yields a geometry-aware particle-flow update
that preserves sharp features of the Wasserstein barycenter while remaining
computationally tractable. We establish theoretical guarantees, including
consistency of barycentric projections, monotone descent and convergence to
stationary points, stability with respect to perturbations of the inputs, and
resolution consistency as the number of atoms increases. Empirical studies on
averaging probability distributions, Bayesian posterior aggregation, image
prototypes and classification, and large-scale clustering demonstrate accuracy
and scalability of the proposed particle-flow approach, positioning it as a
principled alternative to both linear programming and regularized solvers.
[LINK]
http://arxiv.org/abs/2509.11435v2
[DATE]
2025-09-16 10:50:21+08:00
[CATEGORIES]
cs.LG
Auditable Early Stopping for Agentic Routing: Ledger-Verified Run-Wise Certificates under Local DP
[AUTHORS]
Shivam Akhauri
[ABSTRACT]
We address when a best-first router for tool-use agents can stop exploring
without missing a better leaf, while preserving local differential privacy
(LDP) and leaving an audit trail. We introduce a run-wise certificate that
couples each node’s key to the same exponential race that realizes leaf
perturbations; the usual halting rule (stop when the maximum over $v$ in $F$ of
Key$(v) \le B^*$) then certifies the realized run. We give two certified modes
on context-indexed prefix DAGs with child partition: (i) Exact (known counts),
using lazy offset propagation with winner reuse; and (ii) Surrogate (upper
bounds only), which anchors keys to a parent-level surrogate race and allows
validator tightening via $\kappa = \log(N / N_{ub}$). A small compiler enforces
the partition property, and an admissible, race-independent M(tau) keeps keys
sound. The ledger logs uniforms, counts, and tie handling; privacy follows by
post-processing. Experiments on synthetic graphs and a small real pipeline show
tight stopping, deterministic replay, and low overhead.
[LINK]
http://arxiv.org/abs/2509.10550v2
[DATE]
2025-09-16 10:41:43+08:00
[CATEGORIES]
cs.LG
Pimba: A Processing-in-Memory Acceleration for Post-Transformer Large Language Model Serving
[AUTHORS]
Wonung Kim, Yubin Lee, Yoonsung Kim, Jinwoo Hwang, Seongryong Oh, Jiyong Jung, Aziz Huseynov, Woong Gyu Park, Chang Hyun Park, Divya Mahajan, Jongse Park
[ABSTRACT]
Transformers are the driving force behind today’s Large Language Models
(LLMs), serving as the foundation for their performance and versatility. Yet,
their compute and memory costs grow with sequence length, posing scalability
challenges for long-context inferencing. In response, the algorithm community
is exploring alternative architectures, such as state space models (SSMs),
linear attention, and recurrent neural networks (RNNs), which we refer to as
post-transformers. This shift presents a key challenge: building a serving
system that efficiently supports both transformer and post-transformer LLMs
within a unified framework. To address this challenge, we analyze the
performance characteristics of transformer and post-transformer LLMs. Despite
their algorithmic differences, both are fundamentally limited by memory
bandwidth under batched inference due to attention in transformers and state
updates in post-transformers. Further analyses suggest two additional insights:
(1) state update operations, unlike attention, incur high hardware cost, making
per-bank PIM acceleration inefficient, and (2) different low-precision
arithmetic methods offer varying accuracy-area tradeoffs, while we identify
Microsoft’s MX as the Pareto-optimal choice. Building on these insights, we
design Pimba as an array of State-update Processing Units (SPUs), each shared
between two banks to enable interleaved access to PIM. Each SPU includes a
State-update Processing Engine (SPE) that comprises element-wise multipliers
and adders using MX-based quantized arithmetic, enabling efficient execution of
state update and attention operations. Our evaluation shows that, compared to
LLM-optimized GPU and GPU+PIM systems, Pimba achieves up to 4.1x and 2.1x
higher token generation throughput, respectively.
[LINK]
http://arxiv.org/abs/2507.10178v3
[DATE]
2025-09-16 10:24:35+08:00
[CATEGORIES]
cs.LG
EB-gMCR: Energy-Based Generative Modeling for Signal Unmixing and Multivariate Curve Resolution
[AUTHORS]
Yu-Tang Chang, Shih-Fang Chen
[ABSTRACT]
Signal unmixing analysis decomposes data into basic patterns and is widely
applied in chemical and biological research. Multivariate curve resolution
(MCR), a branch of signal unmixing, separates mixed signals into components
(base patterns) and their concentrations (intensity), playing a key role in
understanding composition. Classical MCR is typically framed as matrix
factorization (MF) and requires a user-specified number of components, usually
unknown in real data. Once data or component number increases, the scalability
of these MCR approaches face significant challenges. This study reformulates
MCR as a data generative process (gMCR), and introduces an Energy-Based solver,
EB-gMCR, that automatically discovers the smallest component set and their
concentrations for reconstructing the mixed signals faithfully. On synthetic
benchmarks with up to 256 components, EB-gMCR attains high reconstruction
fidelity and recovers the component count within 5% at 20dB noise and
near-exact at 30dB. On two public spectral datasets, it identifies the correct
component count and improves component separation over MF-based MCR approaches
(NMF variants, ICA, MCR-ALS). EB-gMCR is a general solver for fixed-pattern
signal unmixing (components remain invariant across mixtures). Domain priors
(non-negativity, nonlinear mixing) enter as plug-in modules, enabling
adaptation to new instruments or domains without altering the core selection
learning step. The source code is available at
https://github.com/b05611038/ebgmcr_solver.
[COMMENTS]
10 pages, 3 figures, 2 tables
[LINK]
http://arxiv.org/abs/2507.23600v3
[DATE]
2025-09-16 10:17:56+08:00
[CATEGORIES]
cs.LG
Understanding Boolean Function Learnability on Deep Neural Networks: PAC Learning Meets Neurosymbolic Models
[AUTHORS]
Marcio Nicolau, Anderson R. Tavares, Zhiwei Zhang, Pedro Avelar, João M. Flach, Luis C. Lamb, Moshe Y. Vardi
[ABSTRACT]
Computational learning theory states that many classes of boolean formulas
are learnable in polynomial time. This paper addresses the understudied subject
of how, in practice, such formulas can be learned by deep neural networks.
Specifically, we analyze boolean formulas associated with model-sampling
benchmarks, combinatorial optimization problems, and random 3-CNFs with varying
degrees of constrainedness. Our experiments indicate that: (i) neural learning
generalizes better than pure rule-based systems and pure symbolic approach;
(ii) relatively small and shallow neural networks are very good approximators
of formulas associated with combinatorial optimization problems; (iii) smaller
formulas seem harder to learn, possibly due to the fewer positive (satisfying)
examples available; and (iv) interestingly, underconstrained 3-CNF formulas are
more challenging to learn than overconstrained ones. Such findings pave the way
for a better understanding, construction, and use of interpretable
neurosymbolic AI methods.
[COMMENTS]
Version accepted for NeSy 2025
[LINK]
http://arxiv.org/abs/2009.05908v4
[DATE]
2025-09-16 09:18:21+08:00
[CATEGORIES]
cs.LG
iCD: A Implicit Clustering Distillation Mathod for Structural Information Mining
[AUTHORS]
Xiang Xue, Yatu Ji, Qing-dao-er-ji Ren, Bao Shi, Min Lu, Nier Wu, Xufei Zhuang, Haiteng Xu, Gan-qi-qi-ge Cha
[ABSTRACT]
Logit Knowledge Distillation has gained substantial research interest in
recent years due to its simplicity and lack of requirement for intermediate
feature alignment; however, it suffers from limited interpretability in its
decision-making process. To address this, we propose implicit Clustering
Distillation (iCD): a simple and effective method that mines and transfers
interpretable structural knowledge from logits, without requiring ground-truth
labels or feature-space alignment. iCD leverages Gram matrices over decoupled
local logit representations to enable student models to learn latent semantic
structural patterns. Extensive experiments on benchmark datasets demonstrate
the effectiveness of iCD across diverse teacher-student architectures, with
particularly strong performance in fine-grained classification tasks –
achieving a peak improvement of +5.08% over the baseline. The code is available
at: https://github.com/maomaochongaa/iCD.
[LINK]
http://arxiv.org/abs/2509.12553v1
[DATE]
2025-09-16 09:16:13+08:00
[CATEGORIES]
cs.LG
An Adaptive Tensor-Train Decomposition Approach for Efficient Deep Neural Network Compression
[AUTHORS]
Shiyi Luo, Mingshuo Liu, Yifeng Yu, Shangping Ren, Yu Bai
[ABSTRACT]
In the field of model compression, choosing an appropriate rank for tensor
decomposition is pivotal for balancing model compression rate and efficiency.
However, this selection, whether done manually or through optimization-based
automatic methods, often increases computational complexity. Manual rank
selection lacks efficiency and scalability, often requiring extensive
trial-and-error, while optimization-based automatic methods significantly
increase the computational burden. To address this, we introduce a novel,
automatic, and budget-aware rank selection method for efficient model
compression, which employs Layer-Wise Imprinting Quantitation (LWIQ). LWIQ
quantifies each layer’s significance within a neural network by integrating a
proxy classifier. This classifier assesses the layer’s impact on overall model
performance, allowing for a more informed adjustment of tensor rank.
Furthermore, our approach includes a scaling factor to cater to varying
computational budget constraints. This budget awareness eliminates the need for
repetitive rank recalculations for different budget scenarios. Experimental
results on the CIFAR-10 dataset show that our LWIQ improved by 63.2% in rank
search efficiency, and the accuracy only dropped by 0.86% with 3.2x less model
size on the ResNet-56 model as compared to the state-of-the-art proxy-based
automatic tensor rank selection method.
[COMMENTS]
11 pages, 6 figures
[LINK]
http://arxiv.org/abs/2408.01534v3
[DATE]
2025-09-16 09:07:49+08:00
[CATEGORIES]
cs.LG
EMOE: A Framework for Out-of-distribution Uncertainty Based Rejection via Model-Agnostic Expansive Matching of Experts
[AUTHORS]
Yunni Qu, James Wellnitz, Dzung Dinh, Bhargav Vaduri, Alexander Tropsha, Junier Oliva
[ABSTRACT]
Expansive Matching of Experts (EMOE) is a novel framework that utilizes
support-expanding, extrapolatory pseudo-labeling to improve prediction and
uncertainty based rejection on out-of-distribution(OOD) points. EMOE utilizes a
diverse set of multiple base experts as pseudo-labelers on the augmented data
to improve OOD performance through multiple MLP heads (one per expert) with
shared embedding train with a novel per-head matching loss. Unlike prior
methods that rely on modality-specific augmentations or assume access to OOD
data, EMOE introduces extrapolatory pseudo-labeling on latent-space
augmentations, enabling robust OOD generalization with any real-valued vector
data. In contrast to prior modality agnostic methods with neural backbones,
EMOE is model-agnostic, working effectively with methods from simple tree-based
models to complex OOD generalization models. We demonstrate that EMOE achieves
superior performance compared to state-of-the-art method on diverse datasets in
single-source domain generalization setting.
[LINK]
http://arxiv.org/abs/2406.01825v3
[DATE]
2025-09-16 09:02:27+08:00
[CATEGORIES]
cs.LG
Human + AI for Accelerating Ad Localization Evaluation
[AUTHORS]
Harshit Rajgarhia, Shivali Dalmia, Mengyang Zhao, Mukherji Abhishek, Kiran Ganesh
[ABSTRACT]
Adapting advertisements for multilingual audiences requires more than simple
text translation; it demands preservation of visual consistency, spatial
alignment, and stylistic integrity across diverse languages and formats. We
introduce a structured framework that combines automated components with human
oversight to address the complexities of advertisement localization. To the
best of our knowledge, this is the first work to integrate scene text
detection, inpainting, machine translation (MT), and text reimposition
specifically for accelerating ad localization evaluation workflows. Qualitative
results across six locales demonstrate that our approach produces semantically
accurate and visually coherent localized advertisements, suitable for
deployment in real-world workflows.
[LINK]
http://arxiv.org/abs/2509.12543v1
[DATE]
2025-09-16 08:52:41+08:00
[CATEGORIES]
cs.LG
Cross-Modal Deep Metric Learning for Time Series Anomaly Detection
[AUTHORS]
Wei Li, Zheze Yang
[ABSTRACT]
To effectively address the issues of low sensitivity and high time
consumption in time series anomaly detection, we propose an anomaly detection
method based on cross-modal deep metric learning. A cross-modal deep metric
learning feature clustering model is constructed, composed of an input layer, a
triplet selection layer, and a loss function computation layer. The squared
Euclidean distances between cluster centers are calculated, and a stochastic
gradient descent strategy is employed to optimize the model and classify
different time series features. The inner product of principal component
direction vectors is used as a metric for anomaly measurement. The von
Mises-Fisher (vMF) distribution is applied to describe the directional
characteristics of time series data, and historical data is used to train and
obtain evaluation parameters. By comparing the principal component direction
vector of actual time series data with the threshold, anomaly detection is
performed. Experimental results demonstrate that the proposed method accurately
classifies time series data with different attributes, exhibits high
sensitivity to anomalies, and achieves high detection accuracy, fast detection
speed, and strong robustness.
[LINK]
http://arxiv.org/abs/2509.12540v1
[DATE]
2025-09-16 08:43:00+08:00
[CATEGORIES]
cs.LG
Towards Bio-Inspired Robotic Trajectory Planning via Self-Supervised RNN
[AUTHORS]
Miroslav Cibula, Kristína Malinovská, Matthias Kerzel
[ABSTRACT]
Trajectory planning in robotics is understood as generating a sequence of
joint configurations that will lead a robotic agent, or its manipulator, from
an initial state to the desired final state, thus completing a manipulation
task while considering constraints like robot kinematics and the environment.
Typically, this is achieved via sampling-based planners, which are
computationally intensive. Recent advances demonstrate that trajectory planning
can also be performed by supervised sequence learning of trajectories, often
requiring only a single or fixed number of passes through a neural
architecture, thus ensuring a bounded computation time. Such fully supervised
approaches, however, perform imitation learning; they do not learn based on
whether the trajectories can successfully reach a goal, but try to reproduce
observed trajectories. In our work, we build on this approach and propose a
cognitively inspired self-supervised learning scheme based on a recurrent
architecture for building a trajectory model. We evaluate the feasibility of
the proposed method on a task of kinematic planning for a robotic arm. The
results suggest that the model is able to learn to generate trajectories only
using given paired forward and inverse kinematics models, and indicate that
this novel method could facilitate planning for more complex manipulation tasks
requiring adaptive solutions.
[COMMENTS]
12 pages, 4 figures, 2 tables. To be published in 2025 International
Conference on Artificial Neural Networks (ICANN) proceedings. This research
was funded by the Horizon Europe project TERAIS, GA no. 101079338, and in
part by the Slovak Grant Agency for Science (VEGA), project 1/0373/23. The
code can be found at https://doi.org/10.5281/zenodo.17127997
[LINK]
http://arxiv.org/abs/2507.02171v2
[DATE]
2025-09-16 08:36:01+08:00
[CATEGORIES]
cs.LG
Pre-trained Visual Representations Generalize Where it Matters in Model-Based Reinforcement Learning
[AUTHORS]
Scott Jones, Liyou Zhou, Sebastian W. Pattinson
[ABSTRACT]
In visuomotor policy learning, the control policy for the robotic agent is
derived directly from visual inputs. The typical approach, where a policy and
vision encoder are trained jointly from scratch, generalizes poorly to novel
visual scene changes. Using pre-trained vision models (PVMs) to inform a policy
network improves robustness in model-free reinforcement learning (MFRL). Recent
developments in Model-based reinforcement learning (MBRL) suggest that MBRL is
more sample-efficient than MFRL. However, counterintuitively, existing work has
found PVMs to be ineffective in MBRL. Here, we investigate PVM’s effectiveness
in MBRL, specifically on generalization under visual domain shifts. We show
that, in scenarios with severe shifts, PVMs perform much better than a baseline
model trained from scratch. We further investigate the effects of varying
levels of fine-tuning of PVMs. Our results show that partial fine-tuning can
maintain the highest average task performance under the most extreme
distribution shifts. Our results demonstrate that PVMs are highly successful in
promoting robustness in visual policy learning, providing compelling evidence
for their wider adoption in model-based robotic learning applications.
[LINK]
http://arxiv.org/abs/2509.12531v1
[DATE]
2025-09-16 08:13:14+08:00
[CATEGORIES]
cs.LG
Overcoming classic challenges for artificial neural networks by providing incentives and practice
[AUTHORS]
Kazuki Irie, Brenden M. Lake
[ABSTRACT]
Since the earliest proposals for artificial neural network (ANN) models of
the mind and brain, critics have pointed out key weaknesses in these models
compared to human cognitive abilities. Here we review recent work that uses
metalearning to overcome several classic challenges, which we characterize as
addressing the Problem of Incentive and Practice – that is, providing machines
with both incentives to improve specific skills and opportunities to practice
those skills. This explicit optimization contrasts with more conventional
approaches that hope the desired behaviour will emerge through optimizing
related but different objectives. We review applications of this principle to
addressing four classic challenges for ANNs: systematic generalization,
catastrophic forgetting, few-shot learning and multi-step reasoning. We also
discuss how large language models incorporate key aspects of this metalearning
framework (namely, sequence prediction with feedback trained on diverse data),
which helps to explain some of their successes on these classic challenges.
Finally, we discuss the prospects for understanding aspects of human
development through this framework, and whether natural environments provide
the right incentives and practice for learning how to make challenging
generalizations.
[COMMENTS]
In press at Nature Machine Intelligence
[LINK]
http://arxiv.org/abs/2410.10596v4
[DATE]
2025-09-16 08:12:42+08:00
[CATEGORIES]
cs.LG
A Statistical Analysis of Deep Federated Learning for Intrinsically Low-dimensional Data
[AUTHORS]
Saptarshi Chakraborty, Peter L. Bartlett
[ABSTRACT]
Despite significant research on the optimization aspects of federated
learning, the exploration of generalization error, especially in the realm of
heterogeneous federated learning, remains an area that has been insufficiently
investigated, primarily limited to developments in the parametric regime. This
paper delves into the generalization properties of deep federated regression
within a two-stage sampling model. Our findings reveal that the intrinsic
dimension, characterized by the entropic dimension, plays a pivotal role in
determining the convergence rates for deep learners when appropriately chosen
network sizes are employed. Specifically, when the true relationship between
the response and explanatory variables is described by a $\beta$-H"older
function and one has access to $n$ independent and identically distributed
(i.i.d.) samples from $m$ participating clients, for participating clients, the
error rate scales at most as $\Tilde{O}((mn)^{-2\beta/(2\beta +
\bar{d}{2\beta}(\lambda))})$, whereas for non-participating clients, it scales
as $\Tilde{O}(\Delta \cdot m^{-2\beta/(2\beta + \bar{d}{2\beta}(\lambda))} +
(mn)^{-2\beta/(2\beta + \bar{d}{2\beta}(\lambda))})$. Here
$\bar{d}{2\beta}(\lambda)$ denotes the corresponding $2\beta$-entropic
dimension of $\lambda$, the marginal distribution of the explanatory variables.
The dependence between the two stages of the sampling scheme is characterized
by $\Delta$. Consequently, our findings not only explicitly incorporate the
``heterogeneity” of the clients, but also highlight that the convergence rates
of errors of deep federated learners are not contingent on the nominal high
dimensionality of the data but rather on its intrinsic dimension.
[LINK]
http://arxiv.org/abs/2410.20659v2
[DATE]
2025-09-16 08:10:25+08:00
[CATEGORIES]
cs.LG
Graph Homophily Booster: Rethinking the Role of Discrete Features on Heterophilic Graphs
[AUTHORS]
Ruizhong Qiu, Ting-Wei Li, Gaotang Li, Hanghang Tong
[ABSTRACT]
Graph neural networks (GNNs) have emerged as a powerful tool for modeling
graph-structured data. However, existing GNNs often struggle with heterophilic
graphs, where connected nodes tend to have dissimilar features or labels. While
numerous methods have been proposed to address this challenge, they primarily
focus on architectural designs without directly targeting the root cause of the
heterophily problem. These approaches still perform even worse than the
simplest MLPs on challenging heterophilic datasets. For instance, our
experiments show that 21 latest GNNs still fall behind the MLP on the Actor
dataset. This critical challenge calls for an innovative approach to addressing
graph heterophily beyond architectural designs. To bridge this gap, we propose
and study a new and unexplored paradigm: directly increasing the graph
homophily via a carefully designed graph transformation. In this work, we
present a simple yet effective framework called GRAPHITE to address graph
heterophily. To the best of our knowledge, this work is the first method that
explicitly transforms the graph to directly improve the graph homophily.
Stemmed from the exact definition of homophily, our proposed GRAPHITE creates
feature nodes to facilitate homophilic message passing between nodes that share
similar features. Furthermore, we both theoretically and empirically show that
our proposed GRAPHITE significantly increases the homophily of originally
heterophilic graphs, with only a slight increase in the graph size. Extensive
experiments on challenging datasets demonstrate that our proposed GRAPHITE
significantly outperforms state-of-the-art methods on heterophilic graphs while
achieving comparable accuracy with state-of-the-art methods on homophilic
graphs.
[COMMENTS]
14 pages
[LINK]
http://arxiv.org/abs/2509.12530v1
[DATE]
2025-09-16 08:10:20+08:00
[CATEGORIES]
cs.LG
Selective Risk Certification for LLM Outputs via Information-Lift Statistics: PAC-Bayes, Robustness, and Skeleton Design
[AUTHORS]
Sanjeda Akter, Ibne Farabi Shihab, Anuj Sharma
[ABSTRACT]
Large language models often produce plausible but incorrect outputs. Existing
heuristics such as HallBayes lack formal guarantees. We develop the first
comprehensive theory of \emph{information-lift certificates} under selective
classification. Our contributions are: (i) a PAC-Bayes \emph{sub-gamma}
analysis extending beyond standard Bernstein bounds; (ii) explicit skeleton
sensitivity theorems quantifying robustness to misspecification; (iii)
failure-mode guarantees under assumption violations; and (iv) a principled
variational method for skeleton construction. Across six datasets and multiple
model families, we validate assumptions empirically, reduce abstention by
12–15\% at the same risk, and maintain runtime overhead below 20\% (further
reduced via batching).
[LINK]
http://arxiv.org/abs/2509.12527v1
[DATE]
2025-09-16 08:05:54+08:00
[CATEGORIES]
cs.LG
Phi: Preference Hijacking in Multi-modal Large Language Models at Inference Time
[AUTHORS]
Yifan Lan, Yuanpu Cao, Weitong Zhang, Lu Lin, Jinghui Chen
[ABSTRACT]
Recently, Multimodal Large Language Models (MLLMs) have gained significant
attention across various domains. However, their widespread adoption has also
raised serious safety concerns. In this paper, we uncover a new safety risk of
MLLMs: the output preference of MLLMs can be arbitrarily manipulated by
carefully optimized images. Such attacks often generate contextually relevant
yet biased responses that are neither overtly harmful nor unethical, making
them difficult to detect. Specifically, we introduce a novel method, Preference
Hijacking (Phi), for manipulating the MLLM response preferences using a
preference hijacked image. Our method works at inference time and requires no
model modifications. Additionally, we introduce a universal hijacking
perturbation – a transferable component that can be embedded into different
images to hijack MLLM responses toward any attacker-specified preferences.
Experimental results across various tasks demonstrate the effectiveness of our
approach. The code for Phi is accessible at https://github.com/Yifan-Lan/Phi.
[LINK]
http://arxiv.org/abs/2509.12521v1
[DATE]
2025-09-16 07:55:57+08:00
[CATEGORIES]
cs.LG
Learning to Generate Pointing Gestures in Situated Embodied Conversational Agents
[AUTHORS]
Anna Deichler, Siyang Wang, Simon Alexanderson, Jonas Beskow
[ABSTRACT]
One of the main goals of robotics and intelligent agent research is to enable
natural communication with humans in physically situated settings. While recent
work has focused on verbal modes such as language and speech, non-verbal
communication is crucial for flexible interaction. We present a framework for
generating pointing gestures in embodied agents by combining imitation and
reinforcement learning. Using a small motion capture dataset, our method learns
a motor control policy that produces physically valid, naturalistic gestures
with high referential accuracy. We evaluate the approach against supervised
learning and retrieval baselines in both objective metrics and a virtual
reality referential game with human users. Results show that our system
achieves higher naturalness and accuracy than state-of-the-art supervised
models, highlighting the promise of imitation-RL for communicative gesture
generation and its potential application to robots.
[COMMENTS]
DOI: 10.3389/frobt.2023.1110534. This is the author’s LaTeX version
[LINK]
http://arxiv.org/abs/2509.12507v1
[DATE]
2025-09-16 07:15:15+08:00
[CATEGORIES]
cs.LG
Efficient Pain Recognition via Respiration Signals: A Single Cross-Attention Transformer Multi-Window Fusion Pipeline
[AUTHORS]
Stefanos Gkikas, Ioannis Kyprakis, Manolis Tsiknakis
[ABSTRACT]
Pain is a complex condition that affects a large portion of the population.
Accurate and consistent evaluation is essential for individuals experiencing
pain and supports the development of effective and advanced management
strategies. Automatic pain assessment systems provide continuous monitoring,
aid clinical decision-making, and aim to reduce distress while preventing
functional decline. This study has been submitted to the Second Multimodal
Sensing Grand Challenge for Next-Gen Pain Assessment (AI4PAIN). The proposed
method introduces a pipeline that employs respiration as the input signal and
integrates a highly efficient cross-attention transformer with a
multi-windowing strategy. Extensive experiments demonstrate that respiration
serves as a valuable physiological modality for pain assessment. Furthermore,
results show that compact and efficient models, when properly optimized, can
deliver strong performance, often surpassing larger counterparts. The proposed
multi-window strategy effectively captures short-term and long-term features,
along with global characteristics, enhancing the model’s representational
capacity.
[COMMENTS]
arXiv admin note: text overlap with arXiv:2507.21881,
arXiv:2507.21875
[LINK]
http://arxiv.org/abs/2507.21886v6
[DATE]
2025-09-16 07:02:53+08:00
[CATEGORIES]
cs.LG
SamudrACE: Fast and Accurate Coupled Climate Modeling with 3D Ocean and Atmosphere Emulators
[AUTHORS]
James P. C. Duncan, Elynn Wu, Surya Dheeshjith, Adam Subel, Troy Arcomano, Spencer K. Clark, Brian Henn, Anna Kwa, Jeremy McGibbon, W. Andre Perkins, William Gregory, Carlos Fernandez-Granda, Julius Busecke, Oliver Watt-Meyer, William J. Hurlin, Alistair Adcroft, Laure Zanna, Christopher Bretherton
[ABSTRACT]
Traditional numerical global climate models simulate the full Earth system by
exchanging boundary conditions between separate simulators of the atmosphere,
ocean, sea ice, land surface, and other geophysical processes. This paradigm
allows for distributed development of individual components within a common
framework, unified by a coupler that handles translation between realms via
spatial or temporal alignment and flux exchange. Following a similar approach
adapted for machine learning-based emulators, we present SamudrACE: a coupled
global climate model emulator which produces centuries-long simulations at
1-degree horizontal, 6-hourly atmospheric, and 5-daily oceanic resolution, with
145 2D fields spanning 8 atmospheric and 19 oceanic vertical levels, plus sea
ice, surface, and top-of-atmosphere variables. SamudrACE is highly stable and
has low climate biases comparable to those of its components with prescribed
boundary forcing, with realistic variability in coupled climate phenomena such
as ENSO that is not possible to simulate in uncoupled mode.
[COMMENTS]
23 pages, 17 figures
[LINK]
http://arxiv.org/abs/2509.12490v1
[DATE]
2025-09-16 06:27:26+08:00
[CATEGORIES]
cs.LG
Finite-Agent Stochastic Differential Games on Large Graphs: II. Graph-Based Architectures
[AUTHORS]
Ruimeng Hu, Jihao Long, Haosheng Zhou
[ABSTRACT]
We propose a novel neural network architecture, called Non-Trainable
Modification (NTM), for computing Nash equilibria in stochastic differential
games (SDGs) on graphs. These games model a broad class of graph-structured
multi-agent systems arising in finance, robotics, energy, and social dynamics,
where agents interact locally under uncertainty. The NTM architecture imposes a
graph-guided sparsification on feedforward neural networks, embedding fixed,
non-trainable components aligned with the underlying graph topology. This
design enhances interpretability and stability, while significantly reducing
the number of trainable parameters in large-scale, sparse settings. We
theoretically establish a universal approximation property for NTM in static
games on graphs and numerically validate its expressivity and robustness
through supervised learning tasks. Building on this foundation, we incorporate
NTM into two state-of-the-art game solvers, Direct Parameterization and Deep
BSDE, yielding their sparse variants (NTM-DP and NTM-DBSDE). Numerical
experiments on three SDGs across various graph structures demonstrate that
NTM-based methods achieve performance comparable to their fully trainable
counterparts, while offering improved computational efficiency.
[LINK]
http://arxiv.org/abs/2509.12484v1
[DATE]
2025-09-16 06:11:56+08:00
[CATEGORIES]
cs.LG
Comparative Analysis of Wave Scattering Numerical Modeling Using the Boundary Element Method and Physics-Informed Neural Networks
[AUTHORS]
Oscar Rincón-Cardeno, Gregorio Pérez Bernal, Silvana Montoya Noguera, Nicolás Guarín-Zapata
[ABSTRACT]
Purpose - This study compares the Boundary Element Method (BEM) and
Physics-Informed Neural Networks (PINNs) for solving the two-dimensional
Helmholtz equation in wave scattering problems. The objective is to evaluate
the performance of both methods under the same conditions.
Design/methodology/approach - We solve the Helmholtz equation using BEM and
PINNs for the same scattering problem. The PINNs are trained by minimizing the
residual of the governing equations and boundary conditions, with their
configuration determined through hyperparameter optimization, while the BEM is
applied using boundary discretization. Both methods are evaluated in terms of
solution accuracy, computation time, and generalization capacity.
Findings - Numerical experiments were conducted by varying the number of
integration points for BEM and the number of layers and neurons per layer for
PINNs. Hyperparameter tuning provided further insight into suitable
configurations for wave scattering problems. At comparable accuracy, PINNs
produced consistent solutions but required training times approximately 42
times longer than BEM. However, once trained, PINNs achieved evaluation times
up to 204 times faster. The generalization capacity was also assessed outside
the PINN training domain, where the relative error increased from $7.46 \times
10^{-2}$ to 8.22, while BEM maintained a similar error level in the extended
region.
Originality/value - This work presents a direct comparison between PINNs and
BEM for the Helmholtz equation. The analysis provides quantitative data on the
performance of both methods, supporting their selection in future research on
wave propagation problems and establishing future challenges and directions.
[COMMENTS]
19 pages, 7 figures
[LINK]
http://arxiv.org/abs/2509.12483v1
[DATE]
2025-09-16 06:08:20+08:00
[CATEGORIES]
cs.LG
InfoGain Wavelets: Furthering the Design of Graph Diffusion Wavelets
[AUTHORS]
David R. Johnson, Smita Krishnaswamy, Michael Perlmutter
[ABSTRACT]
Diffusion wavelets extract information from graph signals at different scales
of resolution by utilizing graph diffusion operators raised to various powers,
known as diffusion scales. Traditionally, these scales are chosen to be dyadic
integers, $2^j$. Here, we propose a novel, unsupervised method for selecting
the diffusion scales based on ideas from information theory. We then show that
our method can be incorporated into wavelet-based GNNs, which are modeled after
the geometric scattering transform, via graph classification experiments.
[LINK]
http://arxiv.org/abs/2504.08802v2
[DATE]
2025-09-16 06:01:11+08:00
[CATEGORIES]
cs.LG
Solving Truly Massive Budgeted Monotonic POMDPs with Oracle-Guided Meta-Reinforcement Learning
[AUTHORS]
Manav Vora, Jonas Liang, Michael N. Grussing, Melkior Ornik
[ABSTRACT]
Monotonic Partially Observable Markov Decision Processes (POMDPs), where the
system state progressively decreases until a restorative action is performed,
can be used to model sequential repair problems effectively. This paper
considers the problem of solving budget-constrained multi-component monotonic
POMDPs, where a finite budget limits the maximal number of restorative actions.
For a large number of components, solving such a POMDP using current methods is
computationally intractable due to the exponential growth in the state space
with an increasing number of components. To address this challenge, we propose
a two-step approach. Since the individual components of a budget-constrained
multi-component monotonic POMDP are only connected via the shared budget, we
first approximate the optimal budget allocation among these components using an
approximation of each component POMDP’s optimal value function which is
obtained through a random forest model. Subsequently, we introduce an
oracle-guided meta-trained Proximal Policy Optimization (PPO) algorithm to
solve each of the independent budget-constrained single-component monotonic
POMDPs. The oracle policy is obtained by performing value iteration on the
corresponding monotonic Markov Decision Process (MDP). This two-step method
provides scalability in solving truly massive multi-component monotonic POMDPs.
To demonstrate the efficacy of our approach, we consider a real-world
maintenance scenario that involves inspection and repair of an administrative
building by a team of agents within a maintenance budget. Finally, we perform a
computational complexity analysis for a varying number of components to show
the scalability of the proposed approach.
[LINK]
http://arxiv.org/abs/2408.07192v3
[DATE]
2025-09-16 05:58:36+08:00
[CATEGORIES]
cs.LG
Tuning Sequential Monte Carlo Samplers via Greedy Incremental Divergence Minimization
[AUTHORS]
Kyurae Kim, Zuheng Xu, Jacob R. Gardner, Trevor Campbell
[ABSTRACT]
The performance of sequential Monte Carlo (SMC) samplers heavily depends on
the tuning of the Markov kernels used in the path proposal. For SMC samplers
with unadjusted Markov kernels, standard tuning objectives, such as the
Metropolis-Hastings acceptance rate or the expected-squared jump distance, are
no longer applicable. While stochastic gradient-based end-to-end optimization
has been explored for tuning SMC samplers, they often incur excessive training
costs, even for tuning just the kernel step sizes. In this work, we propose a
general adaptation framework for tuning the Markov kernels in SMC samplers by
minimizing the incremental Kullback-Leibler (KL) divergence between the
proposal and target paths. For step size tuning, we provide a gradient- and
tuning-free algorithm that is generally applicable for kernels such as Langevin
Monte Carlo (LMC). We further demonstrate the utility of our approach by
providing a tailored scheme for tuning kinetic LMC used in SMC samplers. Our
implementations are able to obtain a full schedule of tuned parameters at the
cost of a few vanilla SMC runs, which is a fraction of gradient-based
approaches.
[COMMENTS]
Accepted to ICML‘25; v4, v5: fixed typos
[LINK]
http://arxiv.org/abs/2503.15704v5
[DATE]
2025-09-16 05:27:08+08:00
[CATEGORIES]
cs.LG
Nonlocal Neural Tangent Kernels via Parameter-Space Interactions
[AUTHORS]
Sriram Nagaraj, Vishakh Hari
[ABSTRACT]
The Neural Tangent Kernel (NTK) framework has provided deep insights into the
training dynamics of neural networks under gradient flow. However, it relies on
the assumption that the network is differentiable with respect to its
parameters, an assumption that breaks down when considering non-smooth target
functions or parameterized models exhibiting non-differentiable behavior. In
this work, we propose a Nonlocal Neural Tangent Kernel (NNTK) that replaces the
local gradient with a nonlocal interaction-based approximation in parameter
space. Nonlocal gradients are known to exist for a wider class of functions
than the standard gradient. This allows NTK theory to be extended to nonsmooth
functions, stochastic estimators, and broader families of models. We explore
both fixed-kernel and attention-based formulations of this nonlocal operator.
We illustrate the new formulation with numerical studies.
[LINK]
http://arxiv.org/abs/2509.12467v1
[DATE]
2025-09-16 05:23:47+08:00
[CATEGORIES]
cs.LG
Cott-ADNet: Lightweight Real-Time Cotton Boll and Flower Detection Under Field Conditions
[AUTHORS]
Rui-Feng Wang, Mingrui Xu, Matthew C Bauer, Iago Beffart Schardong, Xiaowen Ma, Kangning Cui
[ABSTRACT]
Cotton is one of the most important natural fiber crops worldwide, yet
harvesting remains limited by labor-intensive manual picking, low efficiency,
and yield losses from missing the optimal harvest window. Accurate recognition
of cotton bolls and their maturity is therefore essential for automation, yield
estimation, and breeding research. We propose Cott-ADNet, a lightweight
real-time detector tailored to cotton boll and flower recognition under complex
field conditions. Building on YOLOv11n, Cott-ADNet enhances spatial
representation and robustness through improved convolutional designs, while
introducing two new modules: a NeLU-enhanced Global Attention Mechanism to
better capture weak and low-contrast features, and a Dilated Receptive Field
SPPF to expand receptive fields for more effective multi-scale context modeling
at low computational cost. We curate a labeled dataset of 4,966 images, and
release an external validation set of 1,216 field images to support future
research. Experiments show that Cott-ADNet achieves 91.5% Precision, 89.8%
Recall, 93.3% mAP50, 71.3% mAP, and 90.6% F1-Score with only 7.5 GFLOPs,
maintaining stable performance under multi-scale and rotational variations.
These results demonstrate Cott-ADNet as an accurate and efficient solution for
in-field deployment, and thus provide a reliable basis for automated cotton
harvesting and high-throughput phenotypic analysis. Code and dataset is
available at https://github.com/SweefongWong/Cott-ADNet.
[COMMENTS]
14 pages, 5 figures, 1 table
[LINK]
http://arxiv.org/abs/2509.12442v1
[DATE]
2025-09-16 04:50:03+08:00
[CATEGORIES]
cs.LG
Neural-Quantum-States Impurity Solver for Quantum Embedding Problems
[AUTHORS]
Yinzhanghao Zhou, Tsung-Han Lee, Ao Chen, Nicola Lanatà, Hong Guo
[ABSTRACT]
Neural quantum states (NQS) have emerged as a promising approach to solve
second-quantised Hamiltonians, because of their scalability and flexibility. In
this work, we design and benchmark an NQS impurity solver for the quantum
embedding methods, focusing on the ghost Gutzwiller Approximation (gGA)
framework. We introduce a graph transformer-based NQS framework able to
represent arbitrarily connected impurity orbitals and develop an error control
mechanism to stabilise iterative updates throughout the quantum embedding
loops. We validate the accuracy of our approach with benchmark gGA calculations
of the Anderson Lattice Model, yielding results in excellent agreement with the
exact diagonalisation impurity solver. Finally, our analysis of the
computational budget reveals the method’s principal bottleneck to be the
high-accuracy sampling of physical observables required by the embedding loop,
rather than the NQS variational optimisation, directly highlighting the
critical need for more efficient inference techniques.
[COMMENTS]
10 pages main text, and 4 figures. Note that YinZhangHao Zhou and
Zhanghao Zhouyin are the same person, I use them both
[LINK]
http://arxiv.org/abs/2509.12431v1
[DATE]
2025-09-16 04:33:10+08:00
[CATEGORIES]
cs.LG
Surrogate Representation Inference for Noisy Text and Image Annotations
[AUTHORS]
Kentaro Nakamura
[ABSTRACT]
As researchers increasingly rely on machine learning models and LLMs to
annotate unstructured data, such as texts or images, various approaches have
been proposed to correct bias in downstream statistical analysis. However,
existing methods tend to yield large standard errors and require some
error-free human annotation. In this paper, I introduce Surrogate
Representation Inference (SRI), which assumes that unstructured data fully
mediate the relationship between human annotations and structured variables.
The assumption is guaranteed by design provided that human coders rely only on
unstructured data for annotation. Under this setting, I propose a neural
network architecture that learns a low-dimensional representation of
unstructured data such that the surrogate assumption remains to be satisfied.
When multiple human annotations are available, SRI can further correct
non-differential measurement errors that may exist in human annotations.
Focusing on text-as-outcome settings, I formally establish the identification
conditions and semiparametric efficient estimation strategies that enable
learning and leveraging such a low-dimensional representation. Simulation
studies and a real-world application demonstrate that SRI reduces standard
errors by over 50% when machine learning prediction accuracy is moderate and
provides valid inference even when human annotations contain non-differential
measurement errors.
[LINK]
http://arxiv.org/abs/2509.12416v1
[DATE]
2025-09-16 04:09:21+08:00
[CATEGORIES]
cs.LG
Bayesian Parametric Matrix Models: Principled Uncertainty Quantification for Spectral Learning
[AUTHORS]
Mohammad Nooraiepour
[ABSTRACT]
Scientific machine learning increasingly uses spectral methods to understand
physical systems. Current spectral learning approaches provide only point
estimates without uncertainty quantification, limiting their use in
safety-critical applications where prediction confidence is essential.
Parametric matrix models have emerged as powerful tools for scientific machine
learning, achieving exceptional performance by learning governing equations.
However, their deterministic nature limits deployment in uncertainty
quantification applications. We introduce Bayesian parametric matrix models
(B-PMMs), a principled framework that extends PMMs to provide uncertainty
estimates while preserving their spectral structure and computational
efficiency. B-PMM addresses the fundamental challenge of quantifying
uncertainty in matrix eigenvalue problems where standard Bayesian methods fail
due to the geometric constraints of spectral decomposition. The theoretical
contributions include: (i) adaptive spectral decomposition with regularized
matrix perturbation bounds that characterize eigenvalue uncertainty
propagation, (ii) structured variational inference algorithms using
manifold-aware matrix-variate Gaussian posteriors that respect Hermitian
constraints, and (iii) finite-sample calibration guarantees with explicit
dependence on spectral gaps and problem conditioning. Experimental validation
across matrix dimensions from 5x5 to 500x500 with perfect convergence rates
demonstrates that B-PMMs achieve exceptional uncertainty calibration (ECE <
0.05) while maintaining favorable scaling. The framework exhibits graceful
degradation under spectral ill-conditioning and provides reliable uncertainty
estimates even in near-degenerate regimes. The proposed framework supports
robust spectral learning in uncertainty-critical domains and lays the
groundwork for broader Bayesian spectral machine learning.
[LINK]
http://arxiv.org/abs/2509.12406v1
[DATE]
2025-09-16 03:52:35+08:00
[CATEGORIES]
cs.LG
Structured Information Loss in Network Embeddings
[AUTHORS]
Gabriel Chuang, Augustin Chaintreau
[ABSTRACT]
We analyze a simple algorithm for network embedding, explicitly
characterizing conditions under which the learned representation encodes the
graph’s generative model fully, partially, or not at all. In cases where the
embedding loses some information (i.e., is not invertible), we describe the
equivalence classes of graphons that map to the same embedding, finding that
these classes preserve community structure but lose substantial density
information. Finally, we show implications for community detection and link
prediction. Our results suggest strong limitations on the effectiveness of link
prediction based on embeddings alone, and we show common conditions under which
naive link prediction adds edges in a disproportionate manner that can either
mitigate or exacerbate structural biases.
[LINK]
http://arxiv.org/abs/2509.12396v1
[DATE]
2025-09-16 03:41:24+08:00
[CATEGORIES]
cs.LG
Adaptive Spatial Goodness Encoding: Advancing and Scaling Forward-Forward Learning Without Backpropagation
[AUTHORS]
Qingchun Gong, Robert Bogdan Staszewski, Kai Xu
[ABSTRACT]
The Forward-Forward (FF) algorithm offers a promising al- ternative to
backpropagation (BP). Despite advancements in recent FF-based extensions, which
have enhanced the origi- nal algorithm and adapted it to convolutional neural
networks (CNNs), they often suffer from limited representational ca- pacity and
poor scalability to large-scale datasets, primarily due to exploding channel
dimensionality. In this work, we propose adaptive spatial goodness encoding
(ASGE), a new FF-based training framework tailored for CNNs. ASGE lever- ages
feature maps to compute spatially-aware goodness rep- resentations at each
layer, enabling layer-wise supervision. Crucially, this approach decouples
classification complexity from channel dimensionality, thereby addressing the
issue of channel explosion and achieving competitive performance compared to
other BP-free methods. ASGE outperforms all other FF-based approaches across
multiple benchmarks, delivering test accuracies of 99.65% on MNIST, 93.41% on
FashionMNIST, 90.62% on CIFAR-10, and 65.42% on CIFAR-100. Moreover, we present
the first successful ap- plication of FF-based training to ImageNet, with Top-1
and Top-5 accuracies of 26.21% and 47.49%. By entirely elimi- nating BP and
significantly narrowing the performance gap with BP-trained models, the ASGE
framework establishes a viable foundation toward scalable BP-free CNN training.
[LINK]
http://arxiv.org/abs/2509.12394v1
[DATE]
2025-09-16 03:38:32+08:00
[CATEGORIES]
cs.LG
Causal-Symbolic Meta-Learning (CSML): Inducing Causal World Models for Few-Shot Generalization
[AUTHORS]
Mohamed Zayaan S
[ABSTRACT]
Modern deep learning models excel at pattern recognition but remain
fundamentally limited by their reliance on spurious correlations, leading to
poor generalization and a demand for massive datasets. We argue that a key
ingredient for human-like intelligence-robust, sample-efficient learning-stems
from an understanding of causal mechanisms. In this work, we introduce
Causal-Symbolic Meta-Learning (CSML), a novel framework that learns to infer
the latent causal structure of a task distribution. CSML comprises three key
modules: a perception module that maps raw inputs to disentangled symbolic
representations; a differentiable causal induction module that discovers the
underlying causal graph governing these symbols and a graph-based reasoning
module that leverages this graph to make predictions. By meta-learning a shared
causal world model across a distribution of tasks, CSML can rapidly adapt to
novel tasks, including those requiring reasoning about interventions and
counterfactuals, from only a handful of examples. We introduce CausalWorld, a
new physics-based benchmark designed to test these capabilities. Our
experiments show that CSML dramatically outperforms state-of-the-art
meta-learning and neuro-symbolic baselines, particularly on tasks demanding
true causal inference.
[COMMENTS]
10 pages, 4 figures
[LINK]
http://arxiv.org/abs/2509.12387v1
[DATE]
2025-09-16 03:28:09+08:00
[CATEGORIES]
cs.LG
Geometric Red-Teaming for Robotic Manipulation
[AUTHORS]
Divyam Goel, Yufei Wang, Tiancheng Wu, Guixiu Qiao, Pavel Piliptchak, David Held, Zackory Erickson
[ABSTRACT]
Standard evaluation protocols in robotic manipulation typically assess policy
performance over curated, in-distribution test sets, offering limited insight
into how systems fail under plausible variation. We introduce Geometric
Red-Teaming (GRT), a red-teaming framework that probes robustness through
object-centric geometric perturbations, automatically generating CrashShapes –
structurally valid, user-constrained mesh deformations that trigger
catastrophic failures in pre-trained manipulation policies. The method
integrates a Jacobian field-based deformation model with a gradient-free,
simulator-in-the-loop optimization strategy. Across insertion, articulation,
and grasping tasks, GRT consistently discovers deformations that collapse
policy performance, revealing brittle failure modes missed by static
benchmarks. By combining task-level policy rollouts with constraint-aware shape
exploration, we aim to build a general purpose framework for structured,
object-centric robustness evaluation in robotic manipulation. We additionally
show that fine-tuning on individual CrashShapes, a process we refer to as
blue-teaming, improves task success by up to 60 percentage points on those
shapes, while preserving performance on the original object, demonstrating the
utility of red-teamed geometries for targeted policy refinement. Finally, we
validate both red-teaming and blue-teaming results with a real robotic arm,
observing that simulated CrashShapes reduce task success from 90% to as low as
22.5%, and that blue-teaming recovers performance to up to 90% on the
corresponding real-world geometry – closely matching simulation outcomes.
Videos and code can be found on our project website:
https://georedteam.github.io/ .
[COMMENTS]
Accepted at the 9th Annual Conference on Robot Learning (CoRL 2025,
Oral)
[LINK]
http://arxiv.org/abs/2509.12379v1
[DATE]
2025-09-16 03:12:26+08:00
[CATEGORIES]
cs.LG
Diffusion-Based Generation and Imputation of Driving Scenarios from Limited Vehicle CAN Data
[AUTHORS]
Julian Ripper, Ousama Esbel, Rafael Fietzek, Max Mühlhäuser, Thomas Kreutz
[ABSTRACT]
Training deep learning methods on small time series datasets that also
include corrupted samples is challenging. Diffusion models have shown to be
effective to generate realistic and synthetic data, and correct corrupted
samples through imputation. In this context, this paper focuses on generating
synthetic yet realistic samples of automotive time series data. We show that
denoising diffusion probabilistic models (DDPMs) can effectively solve this
task by applying them to a challenging vehicle CAN-dataset with long-term data
and a limited number of samples. Therefore, we propose a hybrid generative
approach that combines autoregressive and non-autoregressive techniques. We
evaluate our approach with two recently proposed DDPM architectures for time
series generation, for which we propose several improvements. To evaluate the
generated samples, we propose three metrics that quantify physical correctness
and test track adherence. Our best model is able to outperform even the
training data in terms of physical correctness, while showing plausible driving
behavior. Finally, we use our best model to successfully impute physically
implausible regions in the training data, thereby improving the data quality.
[COMMENTS]
Preprint, Paper has been accepted at ITSC 2025
[LINK]
http://arxiv.org/abs/2509.12375v1
[DATE]
2025-09-16 03:07:28+08:00
[CATEGORIES]
cs.LG
Explainable Unsupervised Multi-Anomaly Detection and Temporal Localization in Nuclear Times Series Data with a Dual Attention-Based Autoencoder
[AUTHORS]
Konstantinos Vasili, Zachery T. Dahm, Stylianos Chatzidakis
[ABSTRACT]
The nuclear industry is advancing toward more new reactor designs, with
next-generation reactors expected to be smaller in scale and power output.
These systems have the potential to produce large volumes of information in the
form of multivariate time-series data, which could be used for enhanced
real-time monitoring and control. In this context, the development of remote
autonomous or semi-autonomous control systems for reactor operation has gained
significant interest. A critical first step toward such systems is an accurate
diagnostics module capable of detecting and localizing anomalies within the
reactor system. Recent studies have proposed various ML and DL approaches for
anomaly detection in the nuclear domain. Despite promising results, key
challenges remain, including limited to no explainability, lack of access to
real-world data, and scarcity of abnormal events, which impedes benchmarking
and characterization. Most existing studies treat these methods as black boxes,
while recent work highlights the need for greater interpretability of ML/DL
outputs in safety-critical domains. Here, we propose an unsupervised
methodology based on an LSTM autoencoder with a dual attention mechanism for
characterization of abnormal events in a real-world reactor radiation area
monitoring system. The framework includes not only detection but also
localization of the event and was evaluated using real-world datasets of
increasing complexity from the PUR-1 research reactor. The attention mechanisms
operate in both the feature and temporal dimensions, where the feature
attention assigns weights to radiation sensors exhibiting abnormal patterns,
while time attention highlights the specific timesteps where irregularities
occur, thus enabling localization. By combining the results, the framework can
identify both the affected sensors and the duration of each anomaly within a
single unified network.
[LINK]
http://arxiv.org/abs/2509.12372v1
[DATE]
2025-09-16 03:06:17+08:00
[CATEGORIES]
cs.LG
PGT-I: Scaling Spatiotemporal GNNs with Memory-Efficient Distributed Training
[AUTHORS]
Seth Ockerman, Amal Gueroudji, Tanwi Mallick, Yixuan He, Line Pouchard, Robert Ross, Shivaram Venkataraman
[ABSTRACT]
Spatiotemporal graph neural networks (ST-GNNs) are powerful tools for
modeling spatial and temporal data dependencies. However, their applications
have been limited primarily to small-scale datasets because of memory
constraints. While distributed training offers a solution, current frameworks
lack support for spatiotemporal models and overlook the properties of
spatiotemporal data. Informed by a scaling study on a large-scale workload, we
present PyTorch Geometric Temporal Index (PGT-I), an extension to PyTorch
Geometric Temporal that integrates distributed data parallel training and two
novel strategies: index-batching and distributed-index-batching. Our index
techniques exploit spatiotemporal structure to construct snapshots dynamically
at runtime, significantly reducing memory overhead, while
distributed-index-batching extends this approach by enabling scalable
processing across multiple GPUs. Our techniques enable the first-ever training
of an ST-GNN on the entire PeMS dataset without graph partitioning, reducing
peak memory usage by up to 89% and achieving up to a 11.78x speedup over
standard DDP with 128 GPUs.
[COMMENTS]
To appear in the 2025 International Conference for High Performance
Computing, Networking, Storage, and Analysis
[LINK]
http://arxiv.org/abs/2507.11683v3
[DATE]
2025-09-16 02:57:17+08:00
[CATEGORIES]
cs.LG
Test-Time Canonicalization by Foundation Models for Robust Perception
[AUTHORS]
Utkarsh Singhal, Ryan Feng, Stella X. Yu, Atul Prakash
[COMMENTS]
Published at ICML 2025
[LINK]
http://arxiv.org/abs/2507.10375v2
[DATE]
2025-09-16 02:55:09+08:00
[CATEGORIES]
cs.LG
Unsupervised Atomic Data Mining via Multi-Kernel Graph Autoencoders for Machine Learning Force Fields
[AUTHORS]
Hong Sun, Joshua A. Vita, Amit Samanta, Vincenzo Lordi
[ABSTRACT]
Constructing a chemically diverse dataset while avoiding sampling bias is
critical to training efficient and generalizable force fields. However, in
computational chemistry and materials science, many common dataset generation
techniques are prone to oversampling regions of the potential energy surface.
Furthermore, these regions can be difficult to identify and isolate from each
other or may not align well with human intuition, making it challenging to
systematically remove bias in the dataset. While traditional clustering and
pruning (down-sampling) approaches can be useful for this, they can often lead
to information loss or a failure to properly identify distinct regions of the
potential energy surface due to difficulties associated with the high
dimensionality of atomic descriptors. In this work, we introduce the
Multi-kernel Edge Attention-based Graph Autoencoder (MEAGraph) model, an
unsupervised approach for analyzing atomic datasets. MEAGraph combines multiple
linear kernel transformations with attention-based message passing to capture
geometric sensitivity and enable effective dataset pruning without relying on
labels or extensive training. Demonstrated applications on niobium, tantalum,
and iron datasets show that MEAGraph efficiently groups similar atomic
environments, allowing for the use of basic pruning techniques for removing
sampling bias. This approach provides an effective method for representation
learning and clustering that can be used for data analysis, outlier detection,
and dataset optimization.
[LINK]
http://arxiv.org/abs/2509.12358v1
[DATE]
2025-09-16 02:41:51+08:00
[CATEGORIES]
cs.LG
Memorization Sinks: Isolating Memorization during LLM Training
[AUTHORS]
Gaurav R. Ghosal, Pratyush Maini, Aditi Raghunathan
[ABSTRACT]
Large language models are susceptible to memorizing repeated sequences,
posing privacy and copyright concerns. A popular mitigation strategy is to
remove memorized information from specific neurons post-hoc. However, such
approaches have shown limited success so far. In a controlled setting, we show
that the memorization of natural sequences (those that resemble linguistically
plausible text) become mechanistically entangled with general language
abilities, thereby becoming challenging to remove post-hoc. In this work, we
put forward a new paradigm of MemSinks that promotes isolation of memorization
by design. We leverage a sequence identifier that activates a unique set of
memorization neurons for each sequence across repetitions. By analyzing the
dynamics of learning and forgetting, we argue that MemSinks facilitates
isolation of memorized content, making it easier to remove without compromising
general language capabilities. We implement MemSinks at the billion-parameter
and billion-token scale, and observe both effective isolation and strong
generalization. To our knowledge, this is the first proof-of-concept on real
data demonstrating that simultaneous generalization and isolation is
achievable. We open-source our code at http://github.com/grghosal/MemSinks.
[COMMENTS]
Accepted at the 2025 International Conference of Machine Learning
[LINK]
http://arxiv.org/abs/2507.09937v2
[DATE]
2025-09-16 02:32:06+08:00
[CATEGORIES]
cs.LG
MetaLLMix : An XAI Aided LLM-Meta-learning Based Approach for Hyper-parameters Optimization
[AUTHORS]
Mohammed Tiouti, Mohamed Bal-Ghaoui
[ABSTRACT]
Effective model and hyperparameter selection remains a major challenge in
deep learning, often requiring extensive expertise and computation. While
AutoML and large language models (LLMs) promise automation, current LLM-based
approaches rely on trial and error and expensive APIs, which provide limited
interpretability and generalizability. We propose MetaLLMiX, a zero-shot
hyperparameter optimization framework combining meta-learning, explainable AI,
and efficient LLM reasoning. By leveraging historical experiment outcomes with
SHAP explanations, MetaLLMiX recommends optimal hyperparameters and pretrained
models without additional trials. We further employ an LLM-as-judge evaluation
to control output format, accuracy, and completeness. Experiments on eight
medical imaging datasets using nine open-source lightweight LLMs show that
MetaLLMiX achieves competitive or superior performance to traditional HPO
methods while drastically reducing computational cost. Our local deployment
outperforms prior API-based approaches, achieving optimal results on 5 of 8
tasks, response time reductions of 99.6-99.9%, and the fastest training times
on 6 datasets (2.4-15.7x faster), maintaining accuracy within 1-5% of
best-performing baselines.
[LINK]
http://arxiv.org/abs/2509.09387v2
[DATE]
2025-09-16 02:23:31+08:00
[CATEGORIES]
cs.LG
Linear Dimensionality Reduction for Word Embeddings in Tabular Data Classification
[AUTHORS]
Liam Ressel, Hamza A. A. Gardi
[ABSTRACT]
The Engineers’ Salary Prediction Challenge requires classifying salary
categories into three classes based on tabular data. The job description is
represented as a 300-dimensional word embedding incorporated into the tabular
features, drastically increasing dimensionality. Additionally, the limited
number of training samples makes classification challenging. Linear
dimensionality reduction of word embeddings for tabular data classification
remains underexplored. This paper studies Principal Component Analysis (PCA)
and Linear Discriminant Analysis (LDA). We show that PCA, with an appropriate
subspace dimension, can outperform raw embeddings. LDA without regularization
performs poorly due to covariance estimation errors, but applying shrinkage
improves performance significantly, even with only two dimensions. We propose
Partitioned-LDA, which splits embeddings into equal-sized blocks and performs
LDA separately on each, thereby reducing the size of the covariance matrices.
Partitioned-LDA outperforms regular LDA and, combined with shrinkage, achieves
top-10 accuracy on the competition public leaderboard. This method effectively
enhances word embedding performance in tabular data classification with limited
training samples.
[LINK]
http://arxiv.org/abs/2509.12346v1
[DATE]
2025-09-16 02:19:00+08:00
[CATEGORIES]
cs.LG
Random Rule Forest (RRF): Interpretable Ensembles of LLM-Generated Questions for Predicting Startup Success
[AUTHORS]
Ben Griffin, Diego Vidaurre, Ugur Koyluoglu, Joseph Ternasky, Fuat Alican, Yigit Ihlamur
[ABSTRACT]
Predicting rare outcomes such as startup success is central to venture
capital, demanding models that are both accurate and interpretable. We
introduce Random Rule Forest (RRF), a lightweight ensemble method that uses a
large language model (LLM) to generate simple YES/NO questions in natural
language. Each question functions as a weak learner, and their responses are
combined using a threshold-based voting rule to form a strong, interpretable
predictor.
Applied to a dataset of 9,892 founders, RRF achieves a 6.9x improvement over
a random baseline on held-out data; adding expert-crafted questions lifts this
to 8x and highlights the value of human-LLM collaboration. Compared with zero-
and few-shot baselines across three LLM architectures, RRF attains an F0.5 of
0.121, versus 0.086 for the best baseline (+0.035 absolute, +41% relative). By
combining the creativity of LLMs with the rigor of ensemble learning, RRF
delivers interpretable, high-precision predictions suitable for decision-making
in high-stakes domains.
[COMMENTS]
13 pages including appendix, 4 figures
[LINK]
http://arxiv.org/abs/2505.24622v2
[DATE]
2025-09-16 02:17:48+08:00
[CATEGORIES]
cs.LG
FEDONet : Fourier-Embedded DeepONet for Spectrally Accurate Operator Learning
[AUTHORS]
Arth Sojitra, Mrigank Dhingra, Omer San
[ABSTRACT]
Deep Operator Networks (DeepONets) have recently emerged as powerful
data-driven frameworks for learning nonlinear operators, particularly suited
for approximating solutions to partial differential equations (PDEs). Despite
their promising capabilities, the standard implementation of DeepONets, which
typically employs fully connected linear layers in the trunk network, can
encounter limitations in capturing complex spatial structures inherent to
various PDEs. To address this, we introduce Fourier-embedded trunk networks
within the DeepONet architecture, leveraging random Fourier feature mappings to
enrich spatial representation capabilities. Our proposed Fourier-embedded
DeepONet, FEDONet demonstrates superior performance compared to the traditional
DeepONet across a comprehensive suite of PDE-driven datasets, including the
two-dimensional Poisson equation, Burgers’ equation, the Lorenz-63 chaotic
system, Eikonal equation, Allen-Cahn equation, Kuramoto-Sivashinsky equation,
and the Lorenz-96 system. Empirical evaluations of FEDONet consistently show
significant improvements in solution reconstruction accuracy, with average
relative L2 performance gains ranging between 2-3x compared to the DeepONet
baseline. This study highlights the effectiveness of Fourier embeddings in
enhancing neural operator learning, offering a robust and broadly applicable
methodology for PDE surrogate modeling.
[LINK]
http://arxiv.org/abs/2509.12344v1
[DATE]
2025-09-16 02:13:28+08:00
[CATEGORIES]
cs.LG
Integrating Attention-Enhanced LSTM and Particle Swarm Optimization for Dynamic Pricing and Replenishment Strategies in Fresh Food Supermarkets
[AUTHORS]
Xianchen Liu, Tianhui Zhang, Xinyu Zhang, Lingmin Hou, Zhen Guo, Yuanhao Tian, Yang Liu
[ABSTRACT]
This paper presents a novel approach to optimizing pricing and replenishment
strategies in fresh food supermarkets by combining Long Short-Term Memory
(LSTM) networks with Particle Swarm Optimization (PSO). The LSTM model,
enhanced with an attention mechanism, is used to predict sales volumes, pricing
trends, and spoilage rates over a seven-day period. The predictions generated
by the LSTM model serve as inputs for the PSO algorithm, which iteratively
optimizes pricing and replenishment strategies to maximize profitability while
adhering to inventory constraints. The integration of cost-plus pricing allows
for dynamic adjustments based on fixed and variable costs, ensuring real-time
adaptability to market fluctuations. The framework not only maximizes profits
but also reduces food waste, contributing to more sustainable supermarket
operations. The attention mechanism enhances the interpretability of the LSTM
model by identifying key time points and factors influencing sales, improving
decision-making accuracy. This methodology bridges the gap between predictive
modeling and optimization, offering a scalable solution for dynamic pricing and
inventory management in fresh food retail and other industries dealing with
perishable goods.
[COMMENTS]
16 pages, 6 figure
[LINK]
http://arxiv.org/abs/2509.12339v1
[DATE]
2025-09-16 02:07:44+08:00
[CATEGORIES]
cs.LG
Uncertainty-Aware Hourly Air Temperature Mapping at 2 km Resolution via Physics-Guided Deep Learning
[AUTHORS]
Shengjie Kris Liu, Siqin Wang, Lu Zhang
[ABSTRACT]
Near-surface air temperature is a key physical property of the Earth’s
surface. Although weather stations offer continuous monitoring and satellites
provide broad spatial coverage, no single data source offers seamless data in a
spatiotemporal fashion. Here, we propose a data-driven, physics-guided deep
learning approach to generate hourly air temperature data at 2 km resolution
over the contiguous United States. The approach, called Amplifier
Air-Transformer, first reconstructs GOES-16 surface temperature data obscured
by clouds. It does so through a neural network encoded with the annual
temperature cycle, incorporating a linear term to amplify ERA5 temperature
values at finer scales and convolutional layers to capture spatiotemporal
variations. Then, another neural network transforms the reconstructed surface
temperature into air temperature by leveraging its latent relationship with key
Earth surface properties. The approach is further enhanced with predictive
uncertainty estimation through deep ensemble learning to improve reliability.
The proposed approach is built and tested on 77.7 billion surface temperature
pixels and 155 million air temperature records from weather stations across the
contiguous United States (2018-2024), achieving hourly air temperature mapping
accuracy of 1.93 C in station-based validation. The proposed approach
streamlines surface temperature reconstruction and air temperature prediction,
and it can be extended to other satellite sources for seamless air temperature
monitoring at high spatiotemporal resolution. The generated data of this study
can be downloaded at https://doi.org/10.5281/zenodo.15252812, and the project
webpage can be found at https://skrisliu.com/HourlyAirTemp2kmUSA/.
[LINK]
http://arxiv.org/abs/2509.12329v1
[DATE]
2025-09-16 02:01:04+08:00
[CATEGORIES]
cs.LG
VADER: A Variational Autoencoder to Infer Planetary Masses and Gas-Dust Disk Properties Around Young Stars
[AUTHORS]
Sayed Shafaat Mahmud, Sayantan Auddy, Neal Turner, Jeffrey S. Bary
[ABSTRACT]
We present \textbf{VADER} (Variational Autoencoder for Disks Embedded with
Rings), for inferring both planet mass and global disk properties from
high-resolution ALMA dust continuum images of protoplanetary disks (PPDs).
VADER, a probabilistic deep learning model, enables uncertainty-aware inference
of planet masses, $\alpha$-viscosity, dust-to-gas ratio, Stokes number, flaring
index, and the number of planets directly from protoplanetary disk images.
VADER is trained on over 100{,}000 synthetic images of PPDs generated from
\texttt{FARGO3D} simulations post-processed with \texttt{RADMC3D}. Our trained
model predicts physical planet and disk parameters with $R^2 > 0.9$ from dust
continuum images of PPDs. Applied to 23 real disks, VADER’s mass estimates are
consistent with literature values and reveal latent correlations that reflect
known disk physics. Our results establish VAE-based generative models as robust
tools for probabilistic astrophysical inference, with direct applications to
interpreting protoplanetary disk substructures in the era of large
interferometric surveys.
[COMMENTS]
6 pages, 5 figures, Accepted and Published at International
Conference on Machine Learning, Machine Learning for Astrophysics Workshop
2025
[LINK]
http://arxiv.org/abs/2509.12324v1
[DATE]
2025-09-16 02:00:19+08:00
[CATEGORIES]
cs.LG
Dynamic Relational Priming Improves Transformer in Multivariate Time Series
[AUTHORS]
Hunjae Lee, Corey Clark
[ABSTRACT]
Standard attention mechanisms in transformers employ static token
representations that remain unchanged across all pair-wise computations in each
layer. This limits their representational alignment with the potentially
diverse relational dynamics of each token-pair interaction. While they excel in
domains with relatively homogeneous relationships, standard attention’s static
relational learning struggles to capture the diverse, heterogeneous
inter-channel dependencies of multivariate time series (MTS) data–where
different channel-pair interactions within a single system may be governed by
entirely different physical laws or temporal dynamics. To better align the
attention mechanism for such domain phenomena, we propose attention with
dynamic relational priming (prime attention). Unlike standard attention where
each token presents an identical representation across all of its pair-wise
interactions, prime attention tailors each token dynamically (or per
interaction) through learnable modulations to best capture the unique
relational dynamics of each token pair, optimizing each pair-wise interaction
for that specific relationship. This representational plasticity of prime
attention enables effective extraction of relationship-specific information in
MTS while maintaining the same asymptotic computational complexity as standard
attention. Our results demonstrate that prime attention consistently
outperforms standard attention across benchmarks, achieving up to 6.5\%
improvement in forecasting accuracy. In addition, we find that prime attention
achieves comparable or superior performance using up to 40\% less sequence
length compared to standard attention, further demonstrating its superior
relational modeling capabilities.
[LINK]
http://arxiv.org/abs/2509.12196v1
[DATE]
2025-09-16 01:56:15+08:00
[CATEGORIES]
cs.LG
Safety Pretraining: Toward the Next Generation of Safe AI
[AUTHORS]
Pratyush Maini, Sachin Goyal, Dylan Sam, Alex Robey, Yash Savani, Yiding Jiang, Andy Zou, Matt Fredrikson, Zacharcy C. Lipton, J. Zico Kolter
[LINK]
http://arxiv.org/abs/2504.16980v2
[DATE]
2025-09-16 01:51:55+08:00
[CATEGORIES]
cs.LG
HoloGarment: 360° Novel View Synthesis of In-the-Wild Garments
[AUTHORS]
Johanna Karras, Yingwei Li, Yasamin Jafarian, Ira Kemelmacher-Shlizerman
[ABSTRACT]
Novel view synthesis (NVS) of in-the-wild garments is a challenging task due
significant occlusions, complex human poses, and cloth deformations. Prior
methods rely on synthetic 3D training data consisting of mostly unoccluded and
static objects, leading to poor generalization on real-world clothing. In this
paper, we propose HoloGarment (Hologram-Garment), a method that takes 1-3
images or a continuous video of a person wearing a garment and generates
360{\deg} novel views of the garment in a canonical pose. Our key insight is to
bridge the domain gap between real and synthetic data with a novel implicit
training paradigm leveraging a combination of large-scale real video data and
small-scale synthetic 3D data to optimize a shared garment embedding space.
During inference, the shared embedding space further enables dynamic
video-to-360{\deg} NVS through the construction of a garment “atlas”
representation by finetuning a garment embedding on a specific real-world
video. The atlas captures garment-specific geometry and texture across all
viewpoints, independent of body pose or motion. Extensive experiments show that
HoloGarment achieves state-of-the-art performance on NVS of in-the-wild
garments from images and videos. Notably, our method robustly handles
challenging real-world artifacts – such as wrinkling, pose variation, and
occlusion – while maintaining photorealism, view consistency, fine texture
details, and accurate geometry. Visit our project page for additional results:
https://johannakarras.github.io/HoloGarment
[LINK]
http://arxiv.org/abs/2509.12187v1
[DATE]
2025-09-16 01:50:57+08:00
[CATEGORIES]
cs.LG
The Morgan-Pitman Test of Equality of Variances and its Application to Machine Learning Model Evaluation and Selection
[AUTHORS]
Argimiro Arratia, Alejandra Cabaña, Ernesto Mordecki, Gerard Rovira-Parra
[ABSTRACT]
Model selection in non-linear models often prioritizes performance metrics
over statistical tests, limiting the ability to account for sampling
variability. We propose the use of a statistical test to assess the equality of
variances in forecasting errors. The test builds upon the classic Morgan-Pitman
approach, incorporating enhancements to ensure robustness against data with
heavy-tailed distributions or outliers with high variance, plus a strategy to
make residuals from machine learning models statistically independent. Through
a series of simulations and real-world data applications, we demonstrate the
test’s effectiveness and practical utility, offering a reliable tool for model
evaluation and selection in diverse contexts.
[COMMENTS]
29 pages, 4 figures
[LINK]
http://arxiv.org/abs/2509.12185v1
[DATE]
2025-09-16 01:47:38+08:00
[CATEGORIES]
cs.LG
Security of Deep Reinforcement Learning for Autonomous Driving: A Survey
[AUTHORS]
Ambra Demontis, Srishti Gupta, Maura Pintor, Luca Demetrio, Kathrin Grosse, Hsiao-Ying Lin, Chengfang Fang, Battista Biggio, Fabio Roli
[ABSTRACT]
Reinforcement learning (RL) enables agents to learn optimal behaviors through
interaction with their environment and has been increasingly deployed in
safety-critical applications, including autonomous driving. Despite its
promise, RL is susceptible to attacks designed either to compromise policy
learning or to induce erroneous decisions by trained agents. Although the
literature on RL security has grown rapidly and several surveys exist, existing
categorizations often fall short in guiding the selection of appropriate
defenses for specific systems. In this work, we present a comprehensive survey
of 86 recent studies on RL security, addressing these limitations by
systematically categorizing attacks and defenses according to defined threat
models and single- versus multi-agent settings. Furthermore, we examine the
relevance and applicability of state-of-the-art attacks and defense mechanisms
within the context of autonomous driving, providing insights to inform the
design of robust RL systems.
[LINK]
http://arxiv.org/abs/2212.06123v2
[DATE]
2025-09-16 01:46:22+08:00
[CATEGORIES]
cs.LG
All that structure matches does not glitter
[AUTHORS]
Maya M. Martirossyan, Thomas Egg, Philipp Hoellmer, George Karypis, Mark Transtrum, Adrian Roitberg, Mingjie Liu, Richard G. Hennig, Ellad B. Tadmor, Stefano Martiniani
[ABSTRACT]
Generative models for materials, especially inorganic crystals, hold
potential to transform the theoretical prediction of novel compounds and
structures. Advancement in this field depends critically on robust benchmarks
and minimal, information-rich datasets that enable meaningful model evaluation.
This paper critically examines common datasets and reported metrics for a
crystal structure prediction task$\unicode{x2014}$generating the most likely
structures given the chemical composition of a material. We focus on three key
issues: First, materials datasets should contain unique crystal structures; for
example, we show that the widely-utilized carbon-24 dataset only contains
$\approx$40% unique structures. Second, materials datasets should not be split
randomly if polymorphs of many different compositions are numerous, which we
find to be the case for the perov-5 dataset. Third, benchmarks can mislead if
used uncritically, e.g., reporting a match rate metric without considering the
structural variety exhibited by identical building blocks. To address these
oft-overlooked issues, we introduce several fixes. We provide revised versions
of the carbon-24 dataset: one with duplicates removed, one deduplicated and
split by number of atoms $N$, and two containing only identical structures but
with different unit cells. We also propose a new split for the perov-5 dataset
which ensures polymorphs are grouped within each split subset, setting a more
sensible standard for benchmarking model performance. Finally, we present METRe
and cRMSE, new model evaluation metrics that can correct existing issues with
the match rate metric.
[LINK]
http://arxiv.org/abs/2509.12178v1
[DATE]
2025-09-16 01:41:16+08:00
[CATEGORIES]
cs.LG
From Autoencoders to CycleGAN: Robust Unpaired Face Manipulation via Adversarial Learning
[AUTHORS]
Collin Guo
[ABSTRACT]
Human face synthesis and manipulation are increasingly important in
entertainment and AI, with a growing demand for highly realistic,
identity-preserving images even when only unpaired, unaligned datasets are
available. We study unpaired face manipulation via adversarial learning, moving
from autoencoder baselines to a robust, guided CycleGAN framework. While
autoencoders capture coarse identity, they often miss fine details. Our
approach integrates spectral normalization for stable training, identity- and
perceptual-guided losses to preserve subject identity and high-level structure,
and landmark-weighted cycle constraints to maintain facial geometry across pose
and illumination changes. Experiments show that our adversarial trained
CycleGAN improves realism (FID), perceptual quality (LPIPS), and identity
preservation (ID-Sim) over autoencoders, with competitive cycle-reconstruction
SSIM and practical inference times, which achieved high quality without paired
datasets and approaching pix2pix on curated paired subsets. These results
demonstrate that guided, spectrally normalized CycleGANs provide a practical
path from autoencoders to robust unpaired face manipulation.
[COMMENTS]
8 pages, 7 figures
[LINK]
http://arxiv.org/abs/2509.12176v1
[DATE]
2025-09-16 01:40:19+08:00
[CATEGORIES]
cs.LG
MMM: Clustering Multivariate Longitudinal Mixed-type Data
[AUTHORS]
Francesco Amato, Julien Jacques
[ABSTRACT]
Multivariate longitudinal data of mixed-type are increasingly collected in
many science domains. However, algorithms to cluster this kind of data remain
scarce, due to the challenge to simultaneously model the within- and
between-time dependence structures for multivariate data of mixed kind. We
introduce the Mixture of Mixed-Matrices (MMM) model: reorganizing the data in a
three-way structure and assuming that the non-continuous variables are
observations of underlying latent continuous variables, the model relies on a
mixture of matrix-variate normal distributions to perform clustering in the
latent dimension. The MMM model is thus able to handle continuous, ordinal,
binary, nominal and count data and to concurrently model the heterogeneity, the
association among the responses and the temporal dependence structure in a
parsimonious way and without assuming conditional independence. The inference
is carried out through an MCMC-EM algorithm, which is detailed. An evaluation
of the model through synthetic data shows its inference abilities. A real-world
application on financial data is presented.
[LINK]
http://arxiv.org/abs/2509.12166v1
[DATE]
2025-09-16 01:30:31+08:00
[CATEGORIES]
cs.LG
On the Generalization of Representation Uncertainty in Earth Observation
[AUTHORS]
Spyros Kondylatos, Nikolaos Ioannis Bountos, Dimitrios Michail, Xiao Xiang Zhu, Gustau Camps-Valls, Ioannis Papoutsis
[ABSTRACT]
Recent advances in Computer Vision have introduced the concept of pretrained
representation uncertainty, enabling zero-shot uncertainty estimation. This
holds significant potential for Earth Observation (EO), where trustworthiness
is critical, yet the complexity of EO data poses challenges to
uncertainty-aware methods. In this work, we investigate the generalization of
representation uncertainty in EO, considering the domain’s unique semantic
characteristics. We pretrain uncertainties on large EO datasets and propose an
evaluation framework to assess their zero-shot performance in multi-label
classification and segmentation EO tasks. Our findings reveal that, unlike
uncertainties pretrained on natural images, EO-pretraining exhibits strong
generalization across unseen EO domains, geographic locations, and target
granularities, while maintaining sensitivity to variations in ground sampling
distance. We demonstrate the practical utility of pretrained uncertainties
showcasing their alignment with task-specific uncertainties in downstream
tasks, their sensitivity to real-world EO image noise, and their ability to
generate spatial uncertainty estimates out-of-the-box. Initiating the
discussion on representation uncertainty in EO, our study provides insights
into its strengths and limitations, paving the way for future research in the
field. Code and weights are available at:
https://github.com/Orion-AI-Lab/EOUncertaintyGeneralization.
[COMMENTS]
Accepted to ICCV 2025
[LINK]
http://arxiv.org/abs/2503.07082v2
[DATE]
2025-09-16 01:24:39+08:00
[CATEGORIES]
cs.LG
Learning Neural Networks by Neuron Pursuit
[AUTHORS]
Akshay Kumar, Jarvis Haupt
[ABSTRACT]
The first part of this paper studies the evolution of gradient flow for
homogeneous neural networks near a class of saddle points exhibiting a sparsity
structure. The choice of these saddle points is motivated from previous works
on homogeneous networks, which identified the first saddle point encountered by
gradient flow after escaping the origin. It is shown here that, when
initialized sufficiently close to such saddle points, gradient flow remains
near the saddle point for a sufficiently long time, during which the set of
weights with small norm remain small but converge in direction. Furthermore,
important empirical observations are made on the behavior of gradient descent
after escaping these saddle points. The second part of the paper, motivated by
these results, introduces a greedy algorithm to train deep neural networks
called Neuron Pursuit (NP). It is an iterative procedure which alternates
between expanding the network by adding neuron(s) with carefully chosen
weights, and minimizing the training loss using this augmented network. The
efficacy of the proposed algorithm is validated using numerical experiments.
[LINK]
http://arxiv.org/abs/2509.12154v1
[DATE]
2025-09-16 01:18:35+08:00
[CATEGORIES]
cs.LG
A learning-driven automatic planning framework for proton PBS treatments of H&N cancers
[AUTHORS]
Qingqing Wang, Liqiang Xiao, Chang Chang
[ABSTRACT]
Proton pencil beam scanning (PBS) treatment planning for head & neck (H&N)
cancers involves numerous conflicting objectives, requiring iterative objective
parameter adjustments to balance multiple clinical goals. We propose a
learning-driven inverse optimizer and integrate it into a proximal policy
optimization (PPO)-based planning framework to automatically generate
high-quality plans for patients with diverse treatment requirements. The
inverse optimizer is a learning-to-optimize (L2O) method that predicts update
steps by learning from task-specific data distributions. For the first time,
long-context processing techniques developed for large language models (LLMs)
are utilized to address the scalability limitations of existing L2O methods,
enabling simultaneous optimization over a substantially large set of variables.
The PPO framework functions as an outer-loop virtual planner, autonomously
adjusting objective parameters through a policy network, and the inner-loop L2O
inverse optimizer computes machine-deliverable spot monitor unit (MU) values
based on the PPO-refined objectives. Moreover, a Swin UnetR dose predictor is
trained with prescription- and beam-specific information to estimate the
initial objective parameters. In our experiments, total 97 patients with
bilateral or ipsilateral H&N cancers are collected for training and testing.
Compared with the second-order gradient-based methods, our L2O optimizer
improves the effectiveness and efficiency of the time-consuming inverse
optimization by 22.97% and 36.41%, respectively, and in conjunction with the
PPO-based virtual planner, plans are generated within clinically acceptable
times, i.e. 2.55 hours in average, and shows improved or comparable
organs-at-risk sparing with superior target coverage compared with
human-generated plans.
[COMMENTS]
27 pages, 4 figures
[LINK]
http://arxiv.org/abs/2508.11085v2
[DATE]
2025-09-16 01:16:18+08:00
[CATEGORIES]
cs.LG
All Optical Echo State Network Reservoir Computing
[AUTHORS]
Ishwar S Kaushik, Peter J Ehlers, Daniel Soh
[ABSTRACT]
We propose an innovative design for an all-optical Echo State Network (ESN),
an advanced type of reservoir computer known for its universal computational
capabilities. Our design enables fully optical implementation of arbitrary
ESNs, featuring flexibility in optical matrix multiplication and nonlinear
activation. Leveraging the nonlinear characteristics of stimulated Brillouin
scattering (SBS), the architecture efficiently realizes measurement-free
nonlinear activation. The approach significantly reduces computational overhead
and energy consumption compared to traditional software-based methods.
Comprehensive simulations validate the system’s memory capacity, nonlinear
processing strength, and polynomial algebra capabilities, showcasing
performance comparable to software ESNs across key benchmark tasks. Our design
establishes a feasible, scalable, and universally applicable framework for
optical reservoir computing, suitable for diverse machine learning
applications.
[COMMENTS]
14 pages, 11 figures
[LINK]
http://arxiv.org/abs/2504.08224v2
[DATE]
2025-09-16 01:00:50+08:00
[CATEGORIES]
cs.LG
$K$-Level Policy Gradients for Multi-Agent Reinforcement Learning
[AUTHORS]
Aryaman Reddi, Gabriele Tiboni, Jan Peters, Carlo D’Eramo
[ABSTRACT]
Actor-critic algorithms for deep multi-agent reinforcement learning (MARL)
typically employ a policy update that responds to the current strategies of
other agents. While being straightforward, this approach does not account for
the updates of other agents at the same update step, resulting in
miscoordination. In this paper, we introduce the $K$-Level Policy Gradient
(KPG), a method that recursively updates each agent against the updated
policies of other agents, speeding up the discovery of effective coordinated
policies. We theoretically prove that KPG with finite iterates achieves
monotonic convergence to a local Nash equilibrium under certain conditions. We
provide principled implementations of KPG by applying it to the deep MARL
algorithms MAPPO, MADDPG, and FACMAC. Empirically, we demonstrate superior
performance over existing deep MARL algorithms in StarCraft II and multi-agent
MuJoCo.
[LINK]
http://arxiv.org/abs/2509.12117v1
[DATE]
2025-09-16 00:42:56+08:00
[CATEGORIES]
cs.LG
SafeSwitch: Steering Unsafe LLM Behavior via Internal Activation Signals
[AUTHORS]
Peixuan Han, Cheng Qian, Xiusi Chen, Yuji Zhang, Heng Ji, Denghui Zhang
[ABSTRACT]
Large language models (LLMs) exhibit exceptional capabilities across various
tasks but also pose risks by generating harmful content. Existing safety
mechanisms, while improving model safety, often lead to overly cautious
behavior and fail to fully leverage LLMs’ internal cognitive processes.
Inspired by humans’ reflective thinking capability, we first show that LLMs can
similarly perform internal assessments about safety in their internal states.
Building on this insight, we propose SafeSwitch, a dynamic framework that
regulates unsafe outputs by utilizing the prober-based internal state monitor
that actively detects harmful intentions, and activates a safety head that
leads to safer and more conservative responses only when necessary. SafeSwitch
reduces harmful outputs by approximately 80% on harmful queries while
maintaining strong utility, reaching a Pareto optimal among several methods.
Our method is also advantageous over traditional methods in offering more
informative, context-aware refusals, and achieves these benefits while only
tuning less than 6% of the original parameters. SafeSwitch demonstrates large
language models’ capacity for self-awareness and reflection regarding safety,
offering a promising approach to more nuanced and effective safety controls.
Codes for this work are available at https://github.com/Hanpx20/SafeSwitch.
[LINK]
http://arxiv.org/abs/2502.01042v5
[DATE]
2025-09-16 00:36:22+08:00
[CATEGORIES]
cs.LG
MODIS: Multi-Omics Data Integration for Small and unpaired datasets
[AUTHORS]
Daniel Lepe-Soltero, Thierry Artières, Anaïs Baudot, Paul Villoutreix
[ABSTRACT]
An important objective in computational biology is the efficient integration
of multi-omics data. The task of integration comes with challenges: multi-omics
data are most often unpaired (requiring diagonal integration), partially
labeled with information about biological conditions, and in some situations
such as rare diseases, only very small datasets are available. We present
MODIS, a semi supervised framework designed to account for these particular
challenges. To address the challenge of very small datasets, we propose to
exploit information contained in larger multi-omics databases by training our
model on a large reference database and a small target dataset simultaneously,
effectively turning the problem of transfer learning into a problem of learning
with class imbalance. MODIS performs diagonal integration on unpaired samples,
leveraging class-labels to align modalities despite class imbalance and data
scarcity. The architecture combines multiple variational auto-encoders, a class
classifier and an adversarially trained modality classifier. To ensure training
stability, we adapted a regularized relativistic GAN loss to this setting. We
first validate MODIS on a synthetic dataset to assess the level of supervision
needed for accurate alignment and to quantify the impact of class imbalance on
predictive performance. We then apply our approach to the large public TCGA
database, considering between 10 and 34 classes (cancer types and normal
tissue). MODIS demonstrates high prediction accuracy, robust performance with
limited supervision, and stability to class imbalance. These results position
MODIS as a promising solution for challenging integration scenarios,
particularly diagonal integration with a small number of samples, typical of
rare diseases studies. The code is available at
https://github.com/VILLOUTREIXLab/MODIS.
[LINK]
http://arxiv.org/abs/2503.18856v2
[DATE]
2025-09-16 00:29:33+08:00
[CATEGORIES]
cs.LG
Draw a Portrait of Your Graph Data: An Instance-Level Profiling Framework for Graph-Structured Data
[AUTHORS]
Tianqi Zhao, Russa Biswas, Megha Khosla
[ABSTRACT]
Graph machine learning models often achieve similar overall performance yet
behave differently at the node level, failing on different subsets of nodes
with varying reliability. Standard evaluation metrics such as accuracy obscure
these fine grained differences, making it difficult to diagnose when and where
models fail. We introduce NodePro, a node profiling framework that enables
fine-grained diagnosis of model behavior by assigning interpretable profile
scores to individual nodes. These scores combine data-centric signals, such as
feature dissimilarity, label uncertainty, and structural ambiguity, with
model-centric measures of prediction confidence and consistency during
training. By aligning model behavior with these profiles, NodePro reveals
systematic differences between models, even when aggregate metrics are
indistinguishable. We show that node profiles generalize to unseen nodes,
supporting prediction reliability without ground-truth labels. Finally, we
demonstrate the utility of NodePro in identifying semantically inconsistent or
corrupted nodes in a structured knowledge graph, illustrating its effectiveness
in real-world settings.
[LINK]
http://arxiv.org/abs/2509.12094v1
[DATE]
2025-09-16 00:18:54+08:00
[CATEGORIES]
cs.LG
Operator learning for hyperbolic partial differential equations
[AUTHORS]
Christopher Wang, Alex Townsend
[ABSTRACT]
We construct the first rigorously justified probabilistic algorithm for
recovering the solution operator of a hyperbolic partial differential equation
(PDE) in two variables from input-output training pairs. The primary challenge
of recovering the solution operator of hyperbolic PDEs is the presence of
characteristics, along which the associated Green’s function is discontinuous.
Therefore, a central component of our algorithm is a rank detection scheme that
identifies the approximate location of the characteristics. By combining the
randomized singular value decomposition with an adaptive hierarchical partition
of the domain, we construct an approximant to the solution operator using
$O(\Psi_\epsilon^{-1}\epsilon^{-7}\log(\Xi_\epsilon^{-1}\epsilon^{-1}))$
input-output pairs with relative error $O(\Xi_\epsilon^{-1}\epsilon)$ in the
operator norm as $\epsilon\to0$, with high probability. Here, $\Psi_\epsilon$
represents the existence of degenerate singular values of the solution
operator, and $\Xi_\epsilon$ measures the quality of the training data. Our
assumptions on the regularity of the coefficients of the hyperbolic PDE are
relatively weak given that hyperbolic PDEs do not have the ``instantaneous
smoothing effect’’ of elliptic and parabolic PDEs, and our recovery rate
improves as the regularity of the coefficients increases.
[COMMENTS]
44 pages, 8 figures
[LINK]
http://arxiv.org/abs/2312.17489v2
[DATE]
2025-09-16 00:18:44+08:00
[CATEGORIES]
cs.LG
Deceptive Risk Minimization: Out-of-Distribution Generalization by Deceiving Distribution Shift Detectors
[AUTHORS]
Anirudha Majumdar
[ABSTRACT]
This paper proposes deception as a mechanism for out-of-distribution (OOD)
generalization: by learning data representations that make training data appear
independent and identically distributed (iid) to an observer, we can identify
stable features that eliminate spurious correlations and generalize to unseen
domains. We refer to this principle as deceptive risk minimization (DRM) and
instantiate it with a practical differentiable objective that simultaneously
learns features that eliminate distribution shifts from the perspective of a
detector based on conformal martingales while minimizing a task-specific loss.
In contrast to domain adaptation or prior invariant representation learning
methods, DRM does not require access to test data or a partitioning of training
data into a finite number of data-generating domains. We demonstrate the
efficacy of DRM on numerical experiments with concept shift and a simulated
imitation learning setting with covariate shift in environments that a robot is
deployed in.
[LINK]
http://arxiv.org/abs/2509.12081v1
[DATE]
2025-09-16 00:11:55+08:00
[CATEGORIES]
cs.LG
A Time-Series Foundation Model by Universal Delay Embedding
[AUTHORS]
Zijian Wang, Peng Tao, Jifan Shi, Rui Bao, Rui Liu, Luonan Chen
[ABSTRACT]
This study introduces Universal Delay Embedding (UDE), a pretrained
foundation model designed to revolutionize time-series forecasting through
principled integration of delay embedding representation and Koopman operator
prediction. Leveraging Takens’ embedding theorem, UDE as a dynamical
representation of observed data constructs two-dimensional subspace patches
from Hankel matrices, theoretically preserving dynamical and topological
properties of underlying dynamical systems. Such patches are viewed as images,
which can be efficiently processed by exploiting advanced deep learning
technologies. Computationally, these patches further serve as tokens for
learning a self-attention encoder, thus enabling accurate prediction of
nonlinear time-series by a finite-dimensional Koopman operator in a linear
manner in a latent space. Extensive evaluations across various benchmarks and
real-world climate datasets demonstrate over 20% average reduction in mean
squared error versus state-of-the-art foundation models, alongside superior
generalization in fine-tuning scenarios. In particular, the learned dynamical
representations and Koopman operator prediction forms from the patches exhibit
exceptional interpretability, with consistent identification of topologically
informative subspaces and robust encoding of domain-invariant dynamics,
establishing UDE as a scalable, interpretable framework for universal
time-series modeling and forecasting with broad scientific and industrial
applicability.
[LINK]
http://arxiv.org/abs/2509.12080v1
[DATE]
2025-09-16 00:11:49+08:00
[CATEGORIES]
cs.LG
An End to End Edge to Cloud Data and Analytics Strategy
[AUTHORS]
Vijay Kumar Butte, Sujata Butte
[ABSTRACT]
There is an exponential growth of connected Internet of Things (IoT) devices.
These have given rise to applications that rely on real time data to make
critical decisions quickly. Enterprises today are adopting cloud at a rapid
pace. There is a critical need to develop secure and efficient strategy and
architectures to best leverage capabilities of cloud and edge assets. This
paper provides an end to end secure edge to cloud data and analytics strategy.
To enable real life implementation, the paper provides reference architectures
for device layer, edge layer and cloud layer.
[LINK]
http://arxiv.org/abs/2509.12296v1
[DATE]
2025-09-16 00:04:10+08:00
[CATEGORIES]
cs.LG
Dion: Distributed Orthonormalized Updates
[AUTHORS]
Kwangjun Ahn, Byron Xu, Natalie Abreu, Ying Fan, Gagik Magakyan, Pratyusha Sharma, Zheng Zhan, John Langford
[ABSTRACT]
Orthonormalized updates accelerate training, improve stability, and enable
robust hyperparameter transfer, but existing methods like Muon rely on dense
matrix operations that clash with sharded weights in large-scale LLM training,
causing high compute and communication cost. We introduce Dion (Distributed
Orthonormalization), a scalable and efficient update rule that replaces
Newton-Schulz iteration with amortized power iteration on a momentum buffer,
avoiding full-matrix reconstruction and integrating cleanly with weight
sharding. The rank-fraction parameter with error feedback enables low-rank
updates that balance quality with significant cost savings. On language models
from 160M to 3B parameters, Dion retains the benefits of orthonormalized
updates, while markedly reducing wall-clock time at scale, making it a
practical optimizer for next-generation foundation models. Code is available
at: https://github.com/microsoft/dion/
[COMMENTS]
“Version 3” with various new updates
[LINK]
http://arxiv.org/abs/2504.05295v3
[DATE]
2025-09-16 00:02:53+08:00
[CATEGORIES]
cs.LG
Early Detection of Branched Broomrape (Phelipanche ramosa) Infestation in Tomato Crops Using Leaf Spectral Analysis and Machine Learning
[AUTHORS]
Mohammadreza Narimani, Alireza Pourreza, Ali Moghimi, Parastoo Farajpoor, Hamid Jafarbiglu, Mohsen B. Mesgaran
[ABSTRACT]
Branched broomrape (Phelipanche ramosa) is a chlorophyll-deficient parasitic
weed that threatens tomato production by extracting nutrients from the host. We
investigate early detection using leaf-level spectral reflectance (400-2500 nm)
and ensemble machine learning. In a field experiment in Woodland, California,
we tracked 300 tomato plants across growth stages defined by growing degree
days (GDD). Leaf reflectance was acquired with a portable spectrometer and
preprocessed (band denoising, 1 nm interpolation, Savitzky-Golay smoothing,
correlation-based band reduction). Clear class differences were observed near
1500 nm and 2000 nm water absorption features, consistent with reduced leaf
water content in infected plants at early stages. An ensemble combining Random
Forest, XGBoost, SVM with RBF kernel, and Naive Bayes achieved 89% accuracy at
585 GDD, with recalls of 0.86 (infected) and 0.93 (noninfected). Accuracy
declined at later stages (e.g., 69% at 1568 GDD), likely due to senescence and
weed interference. Despite the small number of infected plants and
environmental confounders, results show that proximal sensing with ensemble
learning enables timely detection of broomrape before canopy symptoms are
visible, supporting targeted interventions and reduced yield losses.
[COMMENTS]
Author-accepted version. Accepted and presented at AGRICONTROL 2025
(8th IFAC Conference on Sensing, Control and Automation Technologies for
Agriculture), UC Davis, USA. To appear in IFAC-PapersOnLine (Elsevier)
[LINK]
http://arxiv.org/abs/2509.12074v1
[DATE]
2025-09-16 00:00:32+08:00
[CATEGORIES]
cs.LG
Steering Language Models in Multi-Token Generation: A Case Study on Tense and Aspect
[AUTHORS]
Alina Klerings, Jannik Brinkmann, Daniel Ruffinelli, Simone Ponzetto
[ABSTRACT]
Large language models (LLMs) are able to generate grammatically well-formed
text, but how do they encode their syntactic knowledge internally? While prior
work has focused largely on binary grammatical contrasts, in this work, we
study the representation and control of two multidimensional hierarchical
grammar phenomena - verb tense and aspect - and for each, identify distinct,
orthogonal directions in residual space using linear discriminant analysis.
Next, we demonstrate causal control over both grammatical features through
concept steering across three generation tasks. Then, we use these identified
features in a case study to investigate factors influencing effective steering
in multi-token generation. We find that steering strength, location, and
duration are crucial parameters for reducing undesirable side effects such as
topic shift and degeneration. Our findings suggest that models encode tense and
aspect in structurally organized, human-like ways, but effective control of
such features during generation is sensitive to multiple factors and requires
manual tuning or automated optimization.
[COMMENTS]
to be published in The 2025 Conference on Empirical Methods in
Natural Language Processing
[LINK]
http://arxiv.org/abs/2509.12065v1
[DATE]
2025-09-15 23:48:09+08:00
[CATEGORIES]
cs.CL
Is In-Context Learning Learning?
[AUTHORS]
Adrian de Wynter
[ABSTRACT]
In-context learning (ICL) allows some autoregressive models to solve tasks
via next-token prediction and without needing further training. This has led to
claims about these model’s ability to solve (learn) unseen tasks with only a
few shots (exemplars) in the prompt. However, deduction does not always imply
learning, as ICL does not explicitly encode a given observation. Instead, the
models rely on their prior knowledge and the exemplars given, if any. We argue
that, mathematically, ICL does constitute learning, but its full
characterisation requires empirical work. We then carry out a large-scale
analysis of ICL ablating out or accounting for memorisation, pretraining,
distributional shifts, and prompting style and phrasing. We find that ICL is an
effective learning paradigm, but limited in its ability to learn and generalise
to unseen tasks. We note that, in the limit where exemplars become more
numerous, accuracy is insensitive to exemplar distribution, model, prompt
style, and the input’s linguistic features. Instead, it deduces patterns from
regularities in the prompt, which leads to distributional sensitivity,
especially in prompting styles such as chain-of-thought. Given the varied
accuracies on formally similar tasks, we conclude that autoregression’s ad-hoc
encoding is not a robust mechanism, and suggests limited all-purpose
generalisability.
[COMMENTS]
Director’s cut
[LINK]
http://arxiv.org/abs/2509.10414v2
[DATE]
2025-09-15 23:29:49+08:00
[CATEGORIES]
cs.CL
cs.LG
FinGEAR: Financial Mapping-Guided Enhanced Answer Retrieval
[AUTHORS]
Ying Li, Mengyu Wang, Miguel de Carvalho, Sotirios Sabanis, Tiejun Ma
[ABSTRACT]
Financial disclosures such as 10-K filings present challenging retrieval
problems due to their length, regulatory section hierarchy, and domain-specific
language, which standard retrieval-augmented generation (RAG) models underuse.
We introduce FinGEAR (Financial Mapping-Guided Enhanced Answer Retrieval), a
retrieval framework tailored to financial documents. FinGEAR combines a finance
lexicon for Item-level guidance (FLAM), dual hierarchical indices for
within-Item search (Summary Tree and Question Tree), and a two-stage
cross-encoder reranker. This design aligns retrieval with disclosure structure
and terminology, enabling fine-grained, query-aware context selection.
Evaluated on full 10-Ks with queries aligned to the FinQA dataset, FinGEAR
delivers consistent gains in precision, recall, F1, and relevancy, improving F1
by up to 56.7% over flat RAG, 12.5% over graph-based RAGs, and 217.6% over
prior tree-based systems, while also increasing downstream answer accuracy with
a fixed reader. By jointly modeling section hierarchy and domain lexicon
signals, FinGEAR improves retrieval fidelity and provides a practical
foundation for high-stakes financial analysis.
[LINK]
http://arxiv.org/abs/2509.12042v1
[DATE]
2025-09-15 23:25:26+08:00
[CATEGORIES]
cs.CL
Hopscotch: Discovering and Skipping Redundancies in Language Models
[AUTHORS]
Mustafa Eyceoz, Nikhil Shivakumar Nayak, Hao Wang, Ligong Han, Akash Srivastava
[ABSTRACT]
Modern causal language models stack many attention blocks to improve
performance, but not all blocks are necessary for every task. We propose
Hopscotch, a simple yet effective method that identifies and skips attention
blocks with least contributions to a task and adapts to preserve output
quality. Hopscotch jointly optimizes which blocks to skip and how to scale the
outputs of the remaining layers. By introducing lightweight, trainable scaling
parameters to attention and MLP blocks, it mitigates distribution shifts in
hidden states caused by removing attention blocks. Hopscotch does not modify
model weights or require access to pretraining or instruction-tuning data, and
is compatible with existing model compression techniques. When applied to
$\texttt{Llama-3.1-8B}$ and $\texttt{Qwen2.5-7B}$, Hopscotch achieves less than
a 2% drop in performance even after skipping four attention blocks.
[COMMENTS]
10 pages, 4 figures, 9 tables
[LINK]
http://arxiv.org/abs/2506.03303v2
[DATE]
2025-09-15 23:22:06+08:00
[CATEGORIES]
cs.CL
cs.LG
AMQ: Enabling AutoML for Mixed-precision Weight-Only Quantization of Large Language Models
[AUTHORS]
Sangjun Lee, Seung-taek Woo, Jungyu Jin, Changhun Lee, Eunhyeok Park
[COMMENTS]
EMNLP 2025 Main Conference, Long Paper (Oral)
[LINK]
http://arxiv.org/abs/2509.12019v1
[DATE]
2025-09-15 22:59:35+08:00
[CATEGORIES]
cs.LG
cs.CL
Are Generative Models Underconfident? Better Quality Estimation with Boosted Model Probability
[AUTHORS]
Tu Anh Dinh, Jan Niehues
[ABSTRACT]
Quality Estimation (QE) is estimating quality of the model output during
inference when the ground truth is not available. Deriving output quality from
the models’ output probability is the most trivial and low-effort way. However,
we show that the output probability of text-generation models can appear
underconfident. At each output step, there can be multiple correct options,
making the probability distribution spread out more. Thus, lower probability
does not necessarily mean lower output quality. Due to this observation, we
propose a QE approach called BoostedProb, which boosts the model’s confidence
in cases where there are multiple viable output options. With no increase in
complexity, BoostedProb is notably better than raw model probability in
different settings, achieving on average +0.194 improvement in Pearson
correlation to ground-truth quality. It also comes close to or outperforms more
costly approaches like supervised or ensemble-based QE in certain settings.
[COMMENTS]
Accepted to EMNLP 2025 Main Conference
[LINK]
http://arxiv.org/abs/2502.11115v4
[DATE]
2025-09-15 22:57:04+08:00
[CATEGORIES]
cs.CL
MTalk-Bench: Evaluating Speech-to-Speech Models in Multi-Turn Dialogues via Arena-style and Rubrics Protocols
[AUTHORS]
Yuhao Du, Qianwei Huang, Guo Zhu, Zhanchen Dai, Shunian Chen, Qiming Zhu, Le Pan, Minghao Chen, Yuhao Zhang, Li Zhou, Benyou Wang, Haizhou Li
[ABSTRACT]
The rapid advancement of speech-to-speech (S2S) large language models (LLMs)
has significantly improved real-time spoken interaction. However, current
evaluation frameworks remain inadequate for assessing performance in complex,
multi-turn dialogues. To address this, we introduce MTalk-Bench, a multi-turn
S2S benchmark covering three core dimensions: Semantic Information,
Paralinguistic Information, and Ambient Sound. Each dimension includes nine
realistic scenarios, along with targeted tasks to assess specific capabilities
such as reasoning. Our dual-method evaluation framework combines Arena-style
evaluation (pairwise comparison) and Rubrics-based evaluation (absolute
scoring) for relative and absolute assessment. The benchmark includes both
model and human outputs, evaluated by human evaluators and LLMs. Experimental
results reveal two sets of findings. Overall performance of S2S LLMs: (1)
models excel at semantic information processing yet underperform on
paralinguistic information and ambient sounds perception; (2) models typically
regain coherence by increasing response length, sacrificing efficiency in
multi-turn dialogues; (3) modality-aware, task-specific designs outperform
brute scaling. Evaluation framework and reliability: (1) Arena and Rubrics
yield consistent, complementary rankings, but reliable distinctions emerge only
when performance gaps are large; (2) LLM-as-a-judge aligns with humans when
gaps are clear or criteria explicit, but exhibits position and length biases
and is reliable on nonverbal evaluation only with text annotations. These
results highlight current limitations in S2S evaluation and the need for more
robust, speech-aware assessment frameworks.
[LINK]
http://arxiv.org/abs/2508.18240v2
[DATE]
2025-09-15 22:50:39+08:00
[CATEGORIES]
cs.CL
Lost in Embeddings: Information Loss in Vision-Language Models
[AUTHORS]
Wenyan Li, Raphael Tang, Chengzu Li, Caiqi Zhang, Ivan Vulić, Anders Søgaard
[ABSTRACT]
Vision–language models (VLMs) often process visual inputs through a
pretrained vision encoder, followed by a projection into the language model’s
embedding space via a connector component. While crucial for modality fusion,
the potential information loss induced by this projection step and its direct
impact on model capabilities remain understudied. We introduce two
complementary approaches to examine and quantify this loss by analyzing the
latent representation space. First, we evaluate semantic information
preservation by analyzing changes in k-nearest neighbor relationships between
image representations, before and after projection. Second, we directly measure
information loss by reconstructing visual embeddings from the projected
representation, localizing loss at an image patch level. Experiments reveal
that connectors substantially distort the local geometry of visual
representations, with k-nearest neighbors diverging by 40–60\%
post-projection, correlating with degradation in retrieval performance. The
patch-level embedding reconstruction provides interpretable insights for model
behavior on visually grounded question-answering tasks, finding that areas of
high information loss reliably predict instances where models struggle.
[LINK]
http://arxiv.org/abs/2509.11986v1
[DATE]
2025-09-15 22:38:06+08:00
[CATEGORIES]
cs.CL
Spec-LLaVA: Accelerating Vision-Language Models with Dynamic Tree-Based Speculative Decoding
[AUTHORS]
Mingxiao Huo, Jiayi Zhang, Hewei Wang, Jinfeng Xu, Zheyu Chen, Huilin Tai, Yijun Chen
[ABSTRACT]
Vision-Language Models (VLMs) enable powerful multimodal reasoning but suffer
from slow autoregressive inference, limiting their deployment in real-time
applications. We introduce Spec-LLaVA, a system that applies speculative
decoding to accelerate VLMs without sacrificing output quality. Spec-LLaVA
pairs a lightweight draft VLM with a large target model: the draft speculates
future tokens, which the target verifies in parallel, allowing multiple tokens
to be generated per step. To maximize efficiency, we design a dynamic
tree-based verification algorithm that adaptively expands and prunes
speculative branches using draft model confidence. On MS COCO out-of-domain
images, Spec-LLaVA achieves up to 3.28$\times$ faster decoding on LLaVA-1.5
(7B, 13B) with no loss in generation quality. This work presents a lossless
acceleration framework for VLMs using dynamic tree-structured speculative
decoding, opening a path toward practical real-time multimodal assistants.
Importantly, the lightweight draft model design makes the framework amenable to
resource-constrained or on-device deployment settings.
[COMMENTS]
7pages, accepted by ICML TTODLer-FM workshop
[LINK]
http://arxiv.org/abs/2509.11961v1
[DATE]
2025-09-15 22:16:51+08:00
[CATEGORIES]
cs.CL
LinguaLens: Towards Interpreting Linguistic Mechanisms of Large Language Models via Sparse Auto-Encoder
[AUTHORS]
Yi Jing, Zijun Yao, Hongzhu Guo, Lingxu Ran, Xiaozhi Wang, Lei Hou, Juanzi Li
[ABSTRACT]
Large language models (LLMs) demonstrate exceptional performance on tasks
requiring complex linguistic abilities, such as reference disambiguation and
metaphor recognition/generation. Although LLMs possess impressive capabilities,
their internal mechanisms for processing and representing linguistic knowledge
remain largely opaque. Prior research on linguistic mechanisms is limited by
coarse granularity, limited analysis scale, and narrow focus. In this study, we
propose LinguaLens, a systematic and comprehensive framework for analyzing the
linguistic mechanisms of large language models, based on Sparse Auto-Encoders
(SAEs). We extract a broad set of Chinese and English linguistic features
across four dimensions (morphology, syntax, semantics, and pragmatics). By
employing counterfactual methods, we construct a large-scale counterfactual
dataset of linguistic features for mechanism analysis. Our findings reveal
intrinsic representations of linguistic knowledge in LLMs, uncover patterns of
cross-layer and cross-lingual distribution, and demonstrate the potential to
control model outputs. This work provides a systematic suite of resources and
methods for studying linguistic mechanisms, offers strong evidence that LLMs
possess genuine linguistic knowledge, and lays the foundation for more
interpretable and controllable language modeling in future research.
[COMMENTS]
Accepted by EMNLP 2025 MainConference
[LINK]
http://arxiv.org/abs/2502.20344v2
[DATE]
2025-09-15 22:09:38+08:00
[CATEGORIES]
cs.CL
How to Evaluate Medical AI
[AUTHORS]
Ilia Kopanichuk, Petr Anokhin, Vladimir Shaposhnikov, Vladimir Makharev, Ekaterina Tsapieva, Iaroslav Bespalov, Dmitry V. Dylov, Ivan Oseledets
[ABSTRACT]
The integration of artificial intelligence (AI) into medical diagnostic
workflows requires robust and consistent evaluation methods to ensure
reliability, clinical relevance, and the inherent variability in expert
judgments. Traditional metrics like precision and recall often fail to account
for the inherent variability in expert judgments, leading to inconsistent
assessments of AI performance. Inter-rater agreement statistics like Cohen’s
Kappa are more reliable but they lack interpretability. We introduce Relative
Precision and Recall of Algorithmic Diagnostics (RPAD and RRAD) - a new
evaluation metrics that compare AI outputs against multiple expert opinions
rather than a single reference. By normalizing performance against inter-expert
disagreement, these metrics provide a more stable and realistic measure of the
quality of predicted diagnosis. In addition to the comprehensive analysis of
diagnostic quality measures, our study contains a very important side result.
Our evaluation methodology allows us to avoid selecting diagnoses from a
limited list when evaluating a given case. Instead, both the models being
tested and the examiners verifying them arrive at a free-form diagnosis. In
this automated methodology for establishing the identity of free-form clinical
diagnoses, a remarkable 98% accuracy becomes attainable. We evaluate our
approach using 360 medical dialogues, comparing multiple large language models
(LLMs) against a panel of physicians. Large-scale study shows that
top-performing models, such as DeepSeek-V3, achieve consistency on par with or
exceeding expert consensus. Moreover, we demonstrate that expert judgments
exhibit significant variability - often greater than that between AI and
humans. This finding underscores the limitations of any absolute metrics and
supports the need to adopt relative metrics in medical AI.
[COMMENTS]
10 pages, 7 fugures
[LINK]
http://arxiv.org/abs/2509.11941v1
[DATE]
2025-09-15 22:01:22+08:00
[CATEGORIES]
cs.CL
Designing LLMs for cultural sensitivity: Evidence from English-Japanese translation
[AUTHORS]
Helene Tenzer, Oumnia Abidi, Stefan Feuerriegel
[LINK]
http://arxiv.org/abs/2509.11921v1
[DATE]
2025-09-15 21:37:35+08:00
[CATEGORIES]
cs.CL
Uncertainty in Authorship: Why Perfect AI Detection Is Mathematically Impossible
[AUTHORS]
Aadil Gani Ganie
[ABSTRACT]
As large language models (LLMs) become more advanced, it is increasingly
difficult to distinguish between human-written and AI-generated text. This
paper draws a conceptual parallel between quantum uncertainty and the limits of
authorship detection in natural language. We argue that there is a fundamental
trade-off: the more confidently one tries to identify whether a text was
written by a human or an AI, the more one risks disrupting the text’s natural
flow and authenticity. This mirrors the tension between precision and
disturbance found in quantum systems. We explore how current detection
methods–such as stylometry, watermarking, and neural classifiers–face
inherent limitations. Enhancing detection accuracy often leads to changes in
the AI’s output, making other features less reliable. In effect, the very act
of trying to detect AI authorship introduces uncertainty elsewhere in the text.
Our analysis shows that when AI-generated text closely mimics human writing,
perfect detection becomes not just technologically difficult but theoretically
impossible. We address counterarguments and discuss the broader implications
for authorship, ethics, and policy. Ultimately, we suggest that the challenge
of AI-text detection is not just a matter of better tools–it reflects a
deeper, unavoidable tension in the nature of language itself.
[LINK]
http://arxiv.org/abs/2509.11915v1
[DATE]
2025-09-15 21:33:32+08:00
[CATEGORIES]
cs.CL
GATEAU: Selecting Influential Samples for Long Context Alignment
[AUTHORS]
Shuzheng Si, Haozhe Zhao, Gang Chen, Yunshui Li, Kangyang Luo, Chuancheng Lv, Kaikai An, Fanchao Qi, Baobao Chang, Maosong Sun
[ABSTRACT]
Aligning large language models to handle instructions with extremely long
contexts has yet to be fully investigated. Previous studies have attempted to
scale up the available data volume by synthesizing long instruction-following
samples, as constructing such a dataset tends to be challenging for annotators.
However, a lack of a well-defined strategy for ensuring data quality may
introduce low-quality samples and restrict the model’s performance. Thus, we
propose GATEAU, a novel framework to address the unique challenge of long
context alignment by identifying the influential samples enriched with
long-range dependency relations. Specifically, GATEAU measures the long-range
dependencies from two essential aspects: the difficulty of generating target
responses due to the long-range dependencies, and the difficulty of
understanding long inputs due to such dependencies. Comprehensive experiments
indicate that GATEAU effectively identifies influential samples, and the model
trained on these selected samples exhibits better instruction-following and
long-context understanding capabilities.
[COMMENTS]
EMNLP 2025
[LINK]
http://arxiv.org/abs/2410.15633v7
[DATE]
2025-09-15 21:10:22+08:00
[CATEGORIES]
cs.CL
Plugging Schema Graph into Multi-Table QA: A Human-Guided Framework for Reducing LLM Reliance
[AUTHORS]
Xixi Wang, Miguel Costa, Jordanka Kovaceva, Shuai Wang, Francisco C. Pereira
[COMMENTS]
Accepted to EMNLP 2025 findings
[LINK]
http://arxiv.org/abs/2506.04427v2
[DATE]
2025-09-15 20:38:51+08:00
[CATEGORIES]
cs.CL
The AI Memory Gap: Users Misremember What They Created With AI or Without
[AUTHORS]
Tim Zindulka, Sven Goller, Daniela Fernandes, Robin Welsch, Daniel Buschek
[ABSTRACT]
As large language models (LLMs) become embedded in interactive text
generation, disclosure of AI as a source depends on people remembering which
ideas or texts came from themselves and which were created with AI. We
investigate how accurately people remember the source of content when using AI.
In a pre-registered experiment, 184 participants generated and elaborated on
ideas both unaided and with an LLM-based chatbot. One week later, they were
asked to identify the source (noAI vs withAI) of these ideas and texts. Our
findings reveal a significant gap in memory: After AI use, the odds of correct
attribution dropped, with the steepest decline in mixed human-AI workflows,
where either the idea or elaboration was created with AI. We validated our
results using a computational model of source memory. Discussing broader
implications, we highlight the importance of considering source confusion in
the design and use of interactive text generation technologies.
[COMMENTS]
31 pages, 10 figures, 9 tables
[LINK]
http://arxiv.org/abs/2509.11851v1
[DATE]
2025-09-15 20:31:00+08:00
[CATEGORIES]
cs.CL
Collaborative Document Editing with Multiple Users and AI Agents
[AUTHORS]
Florian Lehmann, Krystsina Shauchenka, Daniel Buschek
[ABSTRACT]
Current AI writing support tools are largely designed for individuals,
complicating collaboration when co-writers must leave the shared workspace to
use AI and then communicate and reintegrate results. We propose integrating AI
agents directly into collaborative writing environments. Our prototype makes AI
use transparent and customisable through two new shared objects: agent profiles
and tasks. Agent responses appear in the familiar comment feature. In a user
study (N=30), 14 teams worked on writing projects during one week. Interaction
logs and interviews show that teams incorporated agents into existing norms of
authorship, control, and coordination, rather than treating them as team
members. Agent profiles were viewed as personal territory, while created agents
and outputs became shared resources. We discuss implications for team-based AI
interaction, highlighting opportunities and boundaries for treating AI as a
shared resource in collaborative work.
[COMMENTS]
34 pages, 10 figures, 4 tables
[LINK]
http://arxiv.org/abs/2509.11826v1
[DATE]
2025-09-15 20:11:59+08:00
[CATEGORIES]
cs.CL
Transformer-Based Multimodal Knowledge Graph Completion with Link-Aware Contexts
[AUTHORS]
Haodi Ma, Dzmitry Kasinets, Daisy Zhe Wang
[ABSTRACT]
Multimodal knowledge graph completion (MMKGC) aims to predict missing links
in multimodal knowledge graphs (MMKGs) by leveraging information from various
modalities alongside structural data. Existing MMKGC approaches primarily
extend traditional knowledge graph embedding (KGE) models, which often require
creating an embedding for every entity. This results in large model sizes and
inefficiencies in integrating multimodal information, particularly for
real-world graphs. Meanwhile, Transformer-based models have demonstrated
competitive performance in knowledge graph completion (KGC). However, their
focus on single-modal knowledge limits their capacity to utilize cross-modal
information. Recently, Large vision-language models (VLMs) have shown potential
in cross-modal tasks but are constrained by the high cost of training. In this
work, we propose a novel approach that integrates Transformer-based KGE models
with cross-modal context generated by pre-trained VLMs, thereby extending their
applicability to MMKGC. Specifically, we employ a pre-trained VLM to transform
relevant visual information from entities and their neighbors into textual
sequences. We then frame KGC as a sequence-to-sequence task, fine-tuning the
model with the generated cross-modal context. This simple yet effective method
significantly reduces model size compared to traditional KGE approaches while
achieving competitive performance across multiple large-scale datasets with
minimal hyperparameter tuning.
[LINK]
http://arxiv.org/abs/2501.15688v2
[DATE]
2025-09-15 19:55:10+08:00
[CATEGORIES]
cs.CL
cs.LG
Collapse of Irrelevant Representations (CIR) Ensures Robust and Non-Disruptive LLM Unlearning
[AUTHORS]
Filip Sondej, Yushi Yang
[ABSTRACT]
Current unlearning techniques and safety training consistently fail to remove
dangerous knowledge from language models. We analyze the root causes and
propose a highly selective technique which unlearns robustly and without
disrupting general performance.
We perform PCA on activations and module output gradients to identify
subspaces containing common representations, and collapse them before
calculating unlearning updates. This way we avoid unlearning general
representations, and only target those specific to the unlearned facts.
When unlearning WMDP dataset facts from Llama-3.1-8B, we drop post-attack
accuracy 80x more than our best baseline (Circuit Breakers) on biohazardous
facts and 30x more on cyberhazardous facts. Despite this, we disrupt general
performance 30x less (only 0.1% WikiText loss increase), while requiring less
than 3 GPU-seconds per fact.
[LINK]
http://arxiv.org/abs/2509.11816v1
[DATE]
2025-09-15 19:55:10+08:00
[CATEGORIES]
cs.LG
cs.CL
PledgeTracker: A System for Monitoring the Fulfilment of Pledges
[AUTHORS]
Yulong Chen, Michael Sejr Schlichtkrull, Zhenyun Deng, David Corney, Nasim Asl, Joshua Salisbury, Andrew Dudfield, Andreas Vlachos
[ABSTRACT]
Political pledges reflect candidates’ policy commitments, but tracking their
fulfilment requires reasoning over incremental evidence distributed across
multiple, dynamically updated sources. Existing methods simplify this task into
a document classification task, overlooking its dynamic, temporal and
multi-document nature. To address this issue, we introduce
\textsc{PledgeTracker}, a system that reformulates pledge verification into
structured event timeline construction. PledgeTracker consists of three core
components: (1) a multi-step evidence retrieval module; (2) a timeline
construction module and; (3) a fulfilment filtering module, allowing the
capture of the evolving nature of pledge fulfilment and producing interpretable
and structured timelines. We evaluate PledgeTracker in collaboration with
professional fact-checkers in real-world workflows, demonstrating its
effectiveness in retrieving relevant evidence and reducing human verification
effort.
[COMMENTS]
EMNLP 2025 demo
[LINK]
http://arxiv.org/abs/2509.11804v1
[DATE]
2025-09-15 19:37:47+08:00
[CATEGORIES]
cs.CL
From Fuzzy Speech to Medical Insight: Benchmarking LLMs on Noisy Patient Narratives
[AUTHORS]
Eden Mama, Liel Sheri, Yehudit Aperstein, Alexander Apartsin
[ABSTRACT]
The widespread adoption of large language models (LLMs) in healthcare raises
critical questions about their ability to interpret patient-generated
narratives, which are often informal, ambiguous, and noisy. Existing benchmarks
typically rely on clean, structured clinical text, offering limited insight
into model performance under realistic conditions. In this work, we present a
novel synthetic dataset designed to simulate patient self-descriptions
characterized by varying levels of linguistic noise, fuzzy language, and
layperson terminology. Our dataset comprises clinically consistent scenarios
annotated with ground-truth diagnoses, spanning a spectrum of communication
clarity to reflect diverse real-world reporting styles. Using this benchmark,
we fine-tune and evaluate several state-of-the-art models (LLMs), including
BERT-based and encoder-decoder T5 models. To support reproducibility and future
research, we release the Noisy Diagnostic Benchmark (NDB), a structured dataset
of noisy, synthetic patient descriptions designed to stress-test and compare
the diagnostic capabilities of large language models (LLMs) under realistic
linguistic conditions. We made the benchmark available for the community:
https://github.com/lielsheri/PatientSignal
[COMMENTS]
6 pages, 1 figure
[LINK]
http://arxiv.org/abs/2509.11803v1
[DATE]
2025-09-15 19:34:46+08:00
[CATEGORIES]
cs.CL
Low-rank variational dropout: Uncertainty and rank selection in adapters
[AUTHORS]
Cooper Doyle
[ABSTRACT]
Parameter-efficient fine-tuning (PEFT) methods such as LoRA adapt large
language models by inserting low-rank adapters, but they leave open two key
questions: how to give the adapted model calibrated uncertainty, and how to
choose the adapter rank. Existing approaches to uncertainty are typically
post-hoc, while rank selection is manual and task-specific. BayesLoRA revisits
variational dropout in the LoRA setting and shows that the natural unit of
stochasticity is not individual weights but entire ranks of the adapter. By
placing rank-wise variational distributions over adapter components, BayesLoRA
defines a posterior that (i) yields calibrated predictions through adapter-only
Monte Carlo sampling and (ii) prunes redundant ranks automatically via an
ARD-style KL term. Theoretical analysis shows that this rank-parameterized
posterior localizes uncertainty to the adapted subspace and explains
amplification under distribution shift. Empirically, BayesLoRA improves
calibration while at the same time producing lighter, faster adapters, removing
the need to tune ranks by hand. This dual role of uncertainty estimation and
uncertainty-driven pruning suggests BayesLoRA may offer a practical default for
reliable and efficient PEFT.
[COMMENTS]
5 pages, 2 figures
[LINK]
http://arxiv.org/abs/2506.22809v2
[DATE]
2025-09-15 19:21:46+08:00
[CATEGORIES]
cs.LG
cs.CL
User eXperience Perception Insights Dataset (UXPID): Synthetic User Feedback from Public Industrial Forums
[AUTHORS]
Mikhail Kulyabin, Jan Joosten, Choro Ulan uulu, Nuno Miguel Martins Pacheco, Fabian Ries, Filippos Petridis, Jan Bosch, Helena Holmström Olsson
[ABSTRACT]
Customer feedback in industrial forums reflect a rich but underexplored
source of insight into real-world product experience. These publicly shared
discussions offer an organic view of user expectations, frustrations, and
success stories shaped by the specific contexts of use. Yet, harnessing this
information for systematic analysis remains challenging due to the unstructured
and domain-specific nature of the content. The lack of structure and
specialized vocabulary makes it difficult for traditional data analysis
techniques to accurately interpret, categorize, and quantify the feedback,
thereby limiting its potential to inform product development and support
strategies. To address these challenges, this paper presents the User
eXperience Perception Insights Dataset (UXPID), a collection of 7130
artificially synthesized and anonymized user feedback branches extracted from a
public industrial automation forum. Each JavaScript object notation (JSON)
record contains multi-post comments related to specific hardware and software
products, enriched with metadata and contextual conversation data. Leveraging a
large language model (LLM), each branch is systematically analyzed and
annotated for UX insights, user expectations, severity and sentiment ratings,
and topic classifications. The UXPID dataset is designed to facilitate research
in user requirements, user experience (UX) analysis, and AI-driven feedback
processing, particularly where privacy and licensing restrictions limit access
to real-world data. UXPID supports the training and evaluation of
transformer-based models for tasks such as issue detection, sentiment analysis,
and requirements extraction in the context of technical forums.
[LINK]
http://arxiv.org/abs/2509.11777v1
[DATE]
2025-09-15 18:58:41+08:00
[CATEGORIES]
cs.CL
cs.LG
An Agentic Toolkit for Adaptive Information Extraction from Regulatory Documents
[AUTHORS]
Gaye Colakoglu, Gürkan Solmaz, Jonathan Fürst
[ABSTRACT]
Declaration of Performance (DoP) documents, mandated by EU regulation,
certify the performance of construction products. While some of their content
is standardized, DoPs vary widely in layout, language, schema, and format,
posing challenges for automated key-value pair extraction (KVP) and question
answering (QA). Existing static or LLM-only IE pipelines often hallucinate and
fail to adapt to this structural diversity. Our domain-specific, stateful
agentic system addresses these challenges through a planner-executor-responder
architecture. The system infers user intent, detects document modality, and
orchestrates tools dynamically for robust, traceable reasoning while avoiding
tool misuse or execution loops. Evaluation on a curated DoP dataset
demonstrates improved robustness across formats and languages, offering a
scalable solution for structured data extraction in regulated workflows.
[LINK]
http://arxiv.org/abs/2509.11773v1
[DATE]
2025-09-15 18:53:05+08:00
[CATEGORIES]
cs.CL
Can LLMs assist with Ambiguity? A Quantitative Evaluation of various Large Language Models on Word Sense Disambiguation
[AUTHORS]
T. G. D. K. Sumanathilaka, Nicholas Micallef, Julian Hough
[ABSTRACT]
Ambiguous words are often found in modern digital communications. Lexical
ambiguity challenges traditional Word Sense Disambiguation (WSD) methods, due
to limited data. Consequently, the efficiency of translation, information
retrieval, and question-answering systems is hindered by these limitations.
This study investigates the use of Large Language Models (LLMs) to improve WSD
using a novel approach combining a systematic prompt augmentation mechanism
with a knowledge base (KB) consisting of different sense interpretations. The
proposed method incorporates a human-in-loop approach for prompt augmentation
where prompt is supported by Part-of-Speech (POS) tagging, synonyms of
ambiguous words, aspect-based sense filtering and few-shot prompting to guide
the LLM. By utilizing a few-shot Chain of Thought (COT) prompting-based
approach, this work demonstrates a substantial improvement in performance. The
evaluation was conducted using FEWS test data and sense tags. This research
advances accurate word interpretation in social media and digital
communication.
[COMMENTS]
12 pages,6 tables, 1 figure, Proceedings of the 1st International
Conference on NLP & AI for Cyber Security
[LINK]
http://arxiv.org/abs/2411.18337v5
[DATE]
2025-09-15 18:05:37+08:00
[CATEGORIES]
cs.CL
[AUTHORS]
Quang P. M. Pham, Khoi T. N. Nguyen, Nhi H. Doan, Cuong A. Pham, Qinbo Sun, Weimin Qi, Kentaro Inui, Dezhen Song [ABSTRACT]
Efficient path planning in robotics, particularly within large-scale, complex
environments, remains a significant hurdle. While Large Language Models (LLMs)
offer strong reasoning capabilities, their high computational cost and limited
adaptability hinder real-time deployment on edge devices. We present SmallPlana novel framework leveraging LLMs as teacher models to train lightweight
Small Language Models (SLMs) for high-level path planning tasks. In SmallPlan,
the SLMs provide optimal action sequences to navigate across scene graphs that
compactly represent full-scaled 3D scenes. The SLMs are trained in a
simulation-powered, interleaved manner with LLM-guided supervised fine-tuning
(SFT) and reinforcement learning (RL). This strategy not only enables SLMs to
successfully complete navigation tasks but also makes them aware of important
factors like distance travel, providing more efficient path planning. Through
experiments, we demonstrate that the fine-tuned SLMs perform competitively with
larger models like GPT-4o on sequential path planning, without suffering from
hallucination and overfitting. SmallPlan is resource-efficient, making it
well-suited for edge-device deployment and advancing practical autonomous
robotics. Our source code is available here:
https://github.com/quangpham2006/SmallPlan
[COMMENTS]
Paper is under review
[LINK]
http://arxiv.org/abs/2505.00831v5
[DATE]
2025-09-15 18:05:34+08:00
[CATEGORIES]
cs.CL
Lean Formalization of Generalization Error Bound by Rademacher Complexity
[AUTHORS]
Sho Sonoda, Kazumi Kasaura, Yuma Mizuno, Kei Tsukamoto, Naoto Onda
[ABSTRACT]
We formalize the generalization error bound using the Rademacher complexity
for the Lean 4 theorem prover based on the probability theory in the Mathlib 4
library. Generalization error quantifies the gap between a learning machine’s
performance on given training data versus unseen test data, and the Rademacher
complexity is a powerful tool to upper-bound the generalization error of a
variety of modern learning problems. Previous studies have only formalized
extremely simple cases such as bounds by parameter counts and analyses for very
simple models (decision stumps). Formalizing the Rademacher complexity bound,
also known as the uniform law of large numbers, requires substantial
development and is achieved for the first time in this study. In the course of
development, we formalize the Rademacher complexity and its unique arguments
such as symmetrization, and clarify the topological assumptions on hypothesis
classes under which the bound holds. As an application, we also present the
formalization of generalization error bound for $L^2$-regularization models.
[COMMENTS]
major updated
[LINK]
http://arxiv.org/abs/2503.19605v3
[DATE]
2025-09-15 17:48:25+08:00
[CATEGORIES]
cs.LG
cs.CL
LLM as a Broken Telephone: Iterative Generation Distorts Information
[AUTHORS]
Amr Mohamed, Mingmeng Geng, Michalis Vazirgiannis, Guokan Shang
[ABSTRACT]
As large language models are increasingly responsible for online content,
concerns arise about the impact of repeatedly processing their own outputs.
Inspired by the “broken telephone” effect in chained human communication, this
study investigates whether LLMs similarly distort information through iterative
generation. Through translation-based experiments, we find that distortion
accumulates over time, influenced by language choice and chain complexity.
While degradation is inevitable, it can be mitigated through strategic
prompting techniques. These findings contribute to discussions on the long-term
effects of AI-mediated information propagation, raising important questions
about the reliability of LLM-generated content in iterative workflows.
[COMMENTS]
Accepted to ACL 2025, Main Conference
[LINK]
http://arxiv.org/abs/2502.20258v2
[DATE]
2025-09-15 17:44:07+08:00
[CATEGORIES]
cs.CL
Chain of Strategy Optimization Makes Large Language Models Better Emotional Supporter
[AUTHORS]
Weixiang Zhao, Xingyu Sui, Xinyang Han, Yang Deng, Yulin Hu, Jiahe Guo, Libo Qin, Qianyun Du, Shijin Wang, Yanyan Zhao, Bing Qin, Ting Liu
[ABSTRACT]
The growing emotional stress in modern society has increased the demand for
Emotional Support Conversations (ESC). While Large Language Models (LLMs) show
promise for ESC, they face two key challenges: (1) low strategy selection
accuracy, and (2) preference bias, limiting their adaptability to emotional
needs of users. Existing supervised fine-tuning (SFT) struggles to address
these issues, as it rigidly trains models on single gold-standard responses
without modeling nuanced strategy trade-offs. To overcome these limitations, we
propose Chain-of-Strategy Optimization (CSO), a novel approach that optimizes
strategy selection preferences at each dialogue turn. We first leverage Monte
Carlo Tree Search to construct ESC-Pro, a high-quality preference dataset with
turn-level strategy-response pairs. Training on ESC-Pro with CSO improves both
strategy accuracy and bias mitigation, enabling LLMs to generate more
empathetic and contextually appropriate responses. Experiments on LLaMA-3.1-8B,
Gemma-2-9B, and Qwen2.5-7B demonstrate that CSO outperforms standard SFT,
highlighting the efficacy of fine-grained, turn-level preference modeling in
ESC.
[COMMENTS]
21 pages, 9 figures, 17 tables
[LINK]
http://arxiv.org/abs/2503.05362v2
[DATE]
2025-09-15 17:43:22+08:00
[CATEGORIES]
cs.CL
UR$^2$: Unify RAG and Reasoning through Reinforcement Learning
[AUTHORS]
Weitao Li, Boran Xiang, Xiaolong Wang, Zhinan Gou, Weizhi Ma, Yang Liu
[ABSTRACT]
Large Language Models (LLMs) have shown remarkable capabilities through two
complementary paradigms: Retrieval-Augmented Generation (RAG), which enhances
knowledge grounding, and Reinforcement Learning from Verifiable Rewards (RLVR),
which optimizes complex reasoning abilities. However, these two capabilities
are often developed in isolation, and existing efforts to unify them remain
narrow in scope – typically limited to open-domain QA with fixed retrieval
settings and task-specific constraints. This lack of integration constrains
generalization and limits the applicability of RAG-RL methods to broader
domains. To bridge this gap, we propose UR2 (Unified RAG and Reasoning), a
general framework that unifies retrieval and reasoning through reinforcement
learning. UR2 introduces two key contributions: a difficulty-aware curriculum
training that selectively invokes retrieval only for challenging problems, and
a hybrid knowledge access strategy combining domain-specific offline corpora
with LLM-generated summaries. These components are designed to enable dynamic
coordination between retrieval and reasoning, improving adaptability across a
diverse range of tasks. Experiments across open-domain QA, MMLU-Pro, medical,
and mathematical reasoning tasks demonstrate that UR$^2$ (built on
Qwen-2.5-3/7B and LLaMA-3.1-8B) significantly outperforms existing RAG and RL
methods, achieving comparable performance to GPT-4o-mini and GPT-4.1-mini on
several benchmarks. We have released all code, models, and data at
https://github.com/Tsinghua-dhy/UR2.
[LINK]
http://arxiv.org/abs/2508.06165v2
[DATE]
2025-09-15 17:23:58+08:00
[CATEGORIES]
cs.CL
Room acoustics affect communicative success in hybrid meeting spaces: a pilot study
[AUTHORS]
Robert Einig, Stefan Janscha, Jonas Schuster, Julian Koch, Martin Hagmueller, Barbara Schuppler
[ABSTRACT]
Since the COVID-19 pandemic in 2020, universities and companies have
increasingly integrated hybrid features into their meeting spaces, or even
created dedicated rooms for this purpose. While the importance of a fast and
stable internet connection is often prioritized, the acoustic design of seminar
rooms is frequently overlooked. Poor acoustics, particularly excessive
reverberation, can lead to issues such as misunderstandings, reduced speech
intelligibility or cognitive and vocal fatigue. This pilot study investigates
whether room acoustic interventions in a seminar room at Graz University of
Technology support better communication in hybrid meetings. For this purpose,
we recorded two groups of persons twice, once before and once after improving
the acoustics of the room. Our findings – despite not reaching statistical
significance due to the small sample size - indicate clearly that our spatial
interventions improve communicative success in hybrid meetings. To make the
paper accessible also for readers from the speech communication community, we
explain room acoustics background, relevant for the interpretation of our
results.
[LINK]
http://arxiv.org/abs/2509.11709v1
[DATE]
2025-09-15 17:09:33+08:00
[CATEGORIES]
cs.CL
CoachMe: Decoding Sport Elements with a Reference-Based Coaching Instruction Generation Model
[AUTHORS]
Wei-Hsin Yeh, Yu-An Su, Chih-Ning Chen, Yi-Hsueh Lin, Calvin Ku, Wen-Hsin Chiu, Min-Chun Hu, Lun-Wei Ku
[COMMENTS]
Published in Proceedings of the 63rd Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025.
Official version: https://doi.org/10.18653/v1/2025.acl-long.1413
[LINK]
http://arxiv.org/abs/2509.11698v1
[DATE]
2025-09-15 17:01:39+08:00
[CATEGORIES]
cs.CL
cs.LG
A Dynamic Knowledge Update-Driven Model with Large Language Models for Fake News Detection
[AUTHORS]
Di Jin, Jun Yang, Xiaobao Wang, Junwei Zhang, Shuqi Li, Dongxiao He
[ABSTRACT]
As the Internet and social media evolve rapidly, distinguishing credible news
from a vast amount of complex information poses a significant challenge. Due to
the suddenness and instability of news events, the authenticity labels of news
can potentially shift as events develop, making it crucial for fake news
detection to obtain the latest event updates. Existing methods employ
retrieval-augmented generation to fill knowledge gaps, but they suffer from
issues such as insufficient credibility of retrieved content and interference
from noisy information. We propose a dynamic knowledge update-driven model for
fake news detection (DYNAMO), which leverages knowledge graphs to achieve
continuous updating of new knowledge and integrates with large language models
to fulfill dual functions: news authenticity detection and verification of new
knowledge correctness, solving the two key problems of ensuring the
authenticity of new knowledge and deeply mining news semantics. Specifically,
we first construct a news-domain-specific knowledge graph. Then, we use Monte
Carlo Tree Search to decompose complex news and verify them step by step.
Finally, we extract and update new knowledge from verified real news texts and
reasoning paths. Experimental results demonstrate that DYNAMO achieves the best
performance on two real-world datasets.
[LINK]
http://arxiv.org/abs/2509.11687v1
[DATE]
2025-09-15 16:38:08+08:00
[CATEGORIES]
cs.CL
Measuring Visual Understanding in Telecom domain: Performance Metrics for Image-to-UML conversion using VLMs
[AUTHORS]
HG Ranjani, Rutuja Prabhudesai
[ABSTRACT]
Telecom domain 3GPP documents are replete with images containing sequence
diagrams. Advances in Vision-Language Large Models (VLMs) have eased conversion
of such images to machine-readable PlantUML (puml) formats. However, there is a
gap in evaluation of such conversions - existing works do not compare puml
scripts for various components. In this work, we propose performance metrics to
measure the effectiveness of such conversions. A dataset of sequence diagrams
from 3GPP documents is chosen to be representative of domain-specific actual
scenarios. We compare puml outputs from two VLMs - Claude Sonnet and GPT-4V -
against manually created ground truth representations. We use version control
tools to capture differences and introduce standard performance metrics to
measure accuracies along various components: participant identification,
message flow accuracy, sequence ordering, and grouping construct preservation.
We demonstrate effectiveness of proposed metrics in quantifying conversion
errors across various components of puml scripts. The results show that nodes,
edges and messages are accurately captured. However, we observe that VLMs do
not necessarily perform well on complex structures such as notes, box, groups.
Our experiments and performance metrics indicates a need for better
representation of these components in training data for fine-tuned VLMs.
[LINK]
http://arxiv.org/abs/2509.11667v1
[DATE]
2025-09-15 16:08:41+08:00
[CATEGORIES]
cs.LG
cs.CL
DSMoE: Matrix-Partitioned Experts with Dynamic Routing for Computation-Efficient Dense LLMs
[AUTHORS]
Minxuan Lv, Zhenpeng Su, Leiyu Pan, Yizhe Xiong, Zijia Lin, Hui Chen, Wei Zhou, Jungong Han, Guiguang Ding, Cheng Luo, Di Zhang, Kun Gai, Songlin Hu
[ABSTRACT]
As large language models continue to scale, computational costs and resource
consumption have emerged as significant challenges. While existing
sparsification methods like pruning reduce computational overhead, they risk
losing model knowledge through parameter removal. This paper proposes DSMoE
(Dynamic Sparse Mixture-of-Experts), a novel approach that achieves
sparsification by partitioning pre-trained FFN layers into computational
blocks. We implement adaptive expert routing using sigmoid activation and
straight-through estimators, enabling tokens to flexibly access different
aspects of model knowledge based on input complexity. Additionally, we
introduce a sparsity loss term to balance performance and computational
efficiency. Extensive experiments on LLaMA models demonstrate that under
equivalent computational constraints, DSMoE achieves superior performance
compared to existing pruning and MoE approaches across language modeling and
downstream tasks, particularly excelling in generation tasks. Analysis reveals
that DSMoE learns distinctive layerwise activation patterns, providing new
insights for future MoE architecture design.
[COMMENTS]
Accepted by EMNLP main conference
[LINK]
http://arxiv.org/abs/2502.12455v3
[DATE]
2025-09-15 15:57:24+08:00
[CATEGORIES]
cs.CL
MALLM: Multi-Agent Large Language Models Framework
[AUTHORS]
Jonas Becker, Lars Benedikt Kaesberg, Niklas Bauer, Jan Philip Wahle, Terry Ruas, Bela Gipp
[ABSTRACT]
Multi-agent debate (MAD) has demonstrated the ability to augment collective
intelligence by scaling test-time compute and leveraging expertise. Current
frameworks for multi-agent debate are often designed towards tool use, lack
integrated evaluation, or provide limited configurability of agent personas,
response generators, discussion paradigms, and decision protocols. We introduce
MALLM (Multi-Agent Large Language Models), an open-source framework that
enables systematic analysis of MAD components. MALLM offers more than 144
unique configurations of MAD, including (1) agent personas (e.g., Expert,
Personality), (2) response generators (e.g., Critical, Reasoning), (3)
discussion paradigms (e.g., Memory, Relay), and (4) decision protocols (e.g.,
Voting, Consensus). MALLM uses simple configuration files to define a debate.
Furthermore, MALLM can load any textual Huggingface dataset (e.g., MMLU-Pro,
WinoGrande) and provides an evaluation pipeline for easy comparison of MAD
configurations. MALLM is tailored towards researchers and provides a window
into the heart of multi-agent debate, facilitating the understanding of its
components and their interplay.
[COMMENTS]
Accepted at EMNLP 2025 (Demo)
[LINK]
http://arxiv.org/abs/2509.11656v1
[DATE]
2025-09-15 15:48:02+08:00
[CATEGORIES]
cs.CL
EthicsMH: A Pilot Benchmark for Ethical Reasoning in Mental Health AI
[AUTHORS]
Sai Kartheek Reddy Kasu
[ABSTRACT]
The deployment of large language models (LLMs) in mental health and other
sensitive domains raises urgent questions about ethical reasoning, fairness,
and responsible alignment. Yet, existing benchmarks for moral and clinical
decision-making do not adequately capture the unique ethical dilemmas
encountered in mental health practice, where confidentiality, autonomy,
beneficence, and bias frequently intersect. To address this gap, we introduce
Ethical Reasoning in Mental Health (EthicsMH), a pilot dataset of 125 scenarios
designed to evaluate how AI systems navigate ethically charged situations in
therapeutic and psychiatric contexts. Each scenario is enriched with structured
fields, including multiple decision options, expert-aligned reasoning, expected
model behavior, real-world impact, and multi-stakeholder viewpoints. This
structure enables evaluation not only of decision accuracy but also of
explanation quality and alignment with professional norms. Although modest in
scale and developed with model-assisted generation, EthicsMH establishes a task
framework that bridges AI ethics and mental health decision-making. By
releasing this dataset, we aim to provide a seed resource that can be expanded
through community and expert contributions, fostering the development of AI
systems capable of responsibly handling some of society’s most delicate
decisions.
[LINK]
http://arxiv.org/abs/2509.11648v1
[DATE]
2025-09-15 15:35:35+08:00
[CATEGORIES]
cs.CL
Efficient Environmental Claim Detection with Hyperbolic Graph Neural Networks
[AUTHORS]
Darpan Aswal, Manjira Sinha
[ABSTRACT]
Transformer based models, specially large language models (LLMs) dominate the
field of NLP with their mass adoption in tasks such as text generation,
summarization and fake news detection. These models offer ease of deployment
and reliability for most applications, however, they require significant
amounts of computational power for training as well as inference. This poses
challenges in their adoption in resource-constrained applications, specially in
the open-source community where compute availability is usually scarce. This
work proposes a graph-based approach for Environmental Claim Detection,
exploring Graph Neural Networks (GNNs) and Hyperbolic Graph Neural Networks
(HGNNs) as lightweight yet effective alternatives to transformer-based models.
Re-framing the task as a graph classification problem, we transform claim
sentences into dependency parsing graphs, utilizing a combination of word2vec
\& learnable part-of-speech (POS) tag embeddings for the node features and
encoding syntactic dependencies in the edge relations. Our results show that
our graph-based models, particularly HGNNs in the poincar'e space (P-HGNNs),
achieve performance superior to the state-of-the-art on environmental claim
detection while using upto \textbf{30x fewer parameters}. We also demonstrate
that HGNNs benefit vastly from explicitly modeling data in hierarchical
(tree-like) structures, enabling them to significantly improve over their
euclidean counterparts.
[LINK]
http://arxiv.org/abs/2502.13628v2
[DATE]
2025-09-15 15:14:58+08:00
[CATEGORIES]
cs.CL
Steering LVLMs via Sparse Autoencoder for Hallucination Mitigation
[AUTHORS]
Zhenglin Hua, Jinghan He, Zijun Yao, Tianxu Han, Haiyun Guo, Yuheng Jia, Junfeng Fang
[ABSTRACT]
Large vision-language models (LVLMs) have achieved remarkable performance on
multimodal tasks. However, they still suffer from hallucinations, generating
text inconsistent with visual input, posing significant risks in real-world
applications. Existing approaches to address this issue focus on incorporating
external knowledge bases, alignment training, or decoding strategies, all of
which require substantial computational cost and time. Recent works try to
explore more efficient alternatives by adjusting LVLMs’ internal
representations. Although promising, these methods may cause hallucinations to
be insufficiently suppressed or lead to excessive interventions that negatively
affect normal semantics. In this work, we leverage sparse autoencoders (SAEs)
to identify semantic directions closely associated with faithfulness or
hallucination, extracting more precise and disentangled hallucination-related
representations. Our analysis demonstrates that interventions along the
identified faithful direction can mitigate hallucinations, while those along
the hallucinatory direction can exacerbate them. Building on these insights, we
propose Steering LVLMs via SAE Latent Directions (SSL), a plug-and-play method
based on SAE-derived latent directions to mitigate hallucinations in LVLMs.
Extensive experiments demonstrate that SSL significantly outperforms existing
decoding approaches in mitigating hallucinations, while maintaining
transferability across different model architectures with negligible additional
time overhead. The code is available at https://github.com/huazhenglin2003/SSL.
[COMMENTS]
Accepted to Findings of EMNLP 2025
[LINK]
http://arxiv.org/abs/2505.16146v2
[DATE]
2025-09-15 15:02:17+08:00
[CATEGORIES]
cs.CL
cs.LG
CM-Align: Consistency-based Multilingual Alignment for Large Language Models
[AUTHORS]
Xue Zhang, Yunlong Liang, Fandong Meng, Songming Zhang, Yufeng Chen, Jinan Xu, Jie Zhou
[COMMENTS]
EMNLP 2025 Findings
[LINK]
http://arxiv.org/abs/2509.08541v2
[DATE]
2025-09-15 14:55:00+08:00
[CATEGORIES]
cs.CL
CAC-CoT: Connector-Aware Compact Chain-of-Thought for Efficient Reasoning Data Synthesis Across Dual-System Cognitive Tasks
[AUTHORS]
Sunguk Choi, Yonghoon Kwon, Heondeuk Lee
[ABSTRACT]
Long chain-of-thought (CoT) prompting helps Large Language Models (LLMs)
solve difficult problems, but very long traces often slow or even degrade
performance on fast, intuitive “System-1” tasks. We introduce Connector-Aware
Compact CoT (CAC-CoT) – a method that deliberately restricts reasoning to a
small, fixed set of connector phrases, steering the model toward concise and
well – structured explanations. Despite its simplicity, our synthetic method
with general-purpose LLMs yields a high-quality training quality. CAC-CoT
achieves approximately 85% on GSM8K and approximately 40% on GPQA (System-2)
while also achieving approximately 85% on S1-Bench (System-1), surpassing the
baseline by over 20%. Its reasoning traces average approximately 300
tokens(ART), about one-third the length of baseline traces, delivering higher
efficiency without loss of accuracy.
[COMMENTS]
Accepted at EMNLP 2025 findings
[LINK]
http://arxiv.org/abs/2508.18743v2
[DATE]
2025-09-15 14:27:47+08:00
[CATEGORIES]
cs.CL
AesBiasBench: Evaluating Bias and Alignment in Multimodal Language Models for Personalized Image Aesthetic Assessment
[AUTHORS]
Kun Li, Lai-Man Po, Hongzheng Yang, Xuyuan Xu, Kangcheng Liu, Yuzhi Zhao
[ABSTRACT]
Multimodal Large Language Models (MLLMs) are increasingly applied in
Personalized Image Aesthetic Assessment (PIAA) as a scalable alternative to
expert evaluations. However, their predictions may reflect subtle biases
influenced by demographic factors such as gender, age, and education. In this
work, we propose AesBiasBench, a benchmark designed to evaluate MLLMs along two
complementary dimensions: (1) stereotype bias, quantified by measuring
variations in aesthetic evaluations across demographic groups; and (2)
alignment between model outputs and genuine human aesthetic preferences. Our
benchmark covers three subtasks (Aesthetic Perception, Assessment, Empathy) and
introduces structured metrics (IFD, NRD, AAS) to assess both bias and
alignment. We evaluate 19 MLLMs, including proprietary models (e.g., GPT-4o,
Claude-3.5-Sonnet) and open-source models (e.g., InternVL-2.5, Qwen2.5-VL).
Results indicate that smaller models exhibit stronger stereotype biases,
whereas larger models align more closely with human preferences. Incorporating
identity information often exacerbates bias, particularly in emotional
judgments. These findings underscore the importance of identity-aware
evaluation frameworks in subjective vision-language tasks.
[COMMENTS]
Accepted by EMNLP 2025
[LINK]
http://arxiv.org/abs/2509.11620v1
[DATE]
2025-09-15 14:25:39+08:00
[CATEGORIES]
cs.CL
Dynamic Span Interaction and Graph-Aware Memory for Entity-Level Sentiment Classification
[AUTHORS]
Md. Mithun Hossain, Sanjara, Md. Shakil Hossain, Sudipto Chaki
[ABSTRACT]
Entity-level sentiment classification involves identifying the sentiment
polarity linked to specific entities within text. This task poses several
challenges: effectively modeling the subtle and complex interactions between
entities and their surrounding sentiment expressions; capturing dependencies
that may span across sentences; and ensuring consistent sentiment predictions
for multiple mentions of the same entity through coreference resolution.
Additionally, linguistic phenomena such as negation, ambiguity, and overlapping
opinions further complicate the analysis. These complexities make entity-level
sentiment classification a difficult problem, especially in real-world, noisy
textual data. To address these issues, we propose SpanEIT, a novel framework
integrating dynamic span interaction and graph-aware memory mechanisms for
enhanced entity-sentiment relational modeling. SpanEIT builds span-based
representations for entities and candidate sentiment phrases, employs
bidirectional attention for fine-grained interactions, and uses a graph
attention network to capture syntactic and co-occurrence relations. A
coreference-aware memory module ensures entity-level consistency across
documents. Experiments on FSAD, BARU, and IMDB datasets show SpanEIT
outperforms state-of-the-art transformer and hybrid baselines in accuracy and
F1 scores. Ablation and interpretability analyses validate the effectiveness of
our approach, underscoring its potential for fine-grained sentiment analysis in
applications like social media monitoring and customer feedback analysis.
[LINK]
http://arxiv.org/abs/2509.11604v1
[DATE]
2025-09-15 13:47:57+08:00
[CATEGORIES]
cs.CL
AraHealthQA 2025: The First Shared Task on Arabic Health Question Answering
[AUTHORS]
Hassan Alhuzali, Walid Al-Eisawi, Muhammad Abdul-Mageed, Chaimae Abouzahir, Mouath Abu-Daoud, Ashwag Alasmari, Renad Al-Monef, Ali Alqahtani, Lama Ayash, Leen Kharouf, Farah E. Shamout, Nizar Habash
[ABSTRACT]
We introduce AraHealthQA 2025, the Comprehensive Arabic Health Question
Answering Shared Task, held in conjunction with ArabicNLP 2025 (co-located with
EMNLP 2025). This shared task addresses the paucity of high-quality Arabic
medical QA resources by offering two complementary tracks: MentalQA, focusing
on Arabic mental health Q&A (e.g., anxiety, depression, stigma reduction), and
MedArabiQ, covering broader medical domains such as internal medicine,
pediatrics, and clinical decision making. Each track comprises multiple
subtasks, evaluation datasets, and standardized metrics, facilitating fair
benchmarking. The task was structured to promote modeling under realistic,
multilingual, and culturally nuanced healthcare contexts. We outline the
dataset creation, task design and evaluation framework, participation
statistics, baseline systems, and summarize the overall outcomes. We conclude
with reflections on the performance trends observed and prospects for future
iterations in Arabic health QA.
[COMMENTS]
ArabicNLP2025-colocated with EMNLP2025
[LINK]
http://arxiv.org/abs/2508.20047v3
[DATE]
2025-09-15 13:11:57+08:00
[CATEGORIES]
cs.CL
Formal Reasoning for Intelligent QA Systems: A Case Study in the Educational Domain
[AUTHORS]
Tuan Bui, An Nguyen, Phat Thai, Minh Hua, Ngan Pham L. N., Ngan Pham T. B., Dung Le, Long Nguyen, Thanh-Tung Tran, Thang Bui, Tho Quan
[ABSTRACT]
Reasoning is essential for closed-domain QA systems in which procedural
correctness and policy compliance are critical. While large language models
(LLMs) have shown strong performance on many reasoning tasks, recent work
reveals that their reasoning traces are often unfaithful - serving more as
plausible justifications than as causally grounded derivations. Efforts to
combine LLMs with symbolic engines (e.g., Prover9, Z3) have improved
reliability but remain limited to static forms of logic, struggling with
dynamic, state-based reasoning such as multi-step progressions and conditional
transitions.
In this paper, we propose MCFR (Model Checking for Formal Reasoning), a
neuro-symbolic framework that integrates LLMs with model checking to support
property verification. MCFR translates natural language into formal
specifications and verifies them over transition models. To support evaluation,
we introduce EduMC-QA, a benchmark dataset grounded in real academic
procedures. Our results show that MCFR improves reasoning faithfulness and
interpretability, offering a viable path toward verifiable QA in high-stakes
closed-domain applications. In addition to evaluating MCFR, we compare its
performance with state-of-the-art LLMs such as ChatGPT, DeepSeek, and Claude to
contextualize its effectiveness.
[COMMENTS]
Published at the 2nd ACM Workshop in AI-powered Question & Answering
Systems (AIQAM ‘25), co-located with ACM Multimedia 2025
[LINK]
http://arxiv.org/abs/2509.11572v1
[DATE]
2025-09-15 12:34:42+08:00
[CATEGORIES]
cs.CL
Bhaasha, Bhasa, Zaban: A Survey for Low-Resourced Languages in South Asia – Current Stage and Challenges
[AUTHORS]
Sampoorna Poria, Xiaolei Huang
[ABSTRACT]
Rapid developments of large language models have revolutionized many NLP
tasks for English data. Unfortunately, the models and their evaluations for
low-resource languages are being overlooked, especially for languages in South
Asia. Although there are more than 650 languages in South Asia, many of them
either have very limited computational resources or are missing from existing
language models. Thus, a concrete question to be answered is: Can we assess the
current stage and challenges to inform our NLP community and facilitate model
developments for South Asian languages? In this survey, we have comprehensively
examined current efforts and challenges of NLP models for South Asian languages
by retrieving studies since 2020, with a focus on transformer-based models,
such as BERT, T5, & GPT. We present advances and gaps across 3 essential
aspects: data, models, & tasks, such as available data sources, fine-tuning
strategies, & domain applications. Our findings highlight substantial issues,
including missing data in critical domains (e.g., health), code-mixing, and
lack of standardized evaluation benchmarks. Our survey aims to raise awareness
within the NLP community for more targeted data curation, unify benchmarks
tailored to cultural and linguistic nuances of South Asia, and encourage an
equitable representation of South Asian languages. The complete list of
resources is available at: https://github.com/trust-nlp/LM4SouthAsia-Survey.
[LINK]
http://arxiv.org/abs/2509.11570v1
[DATE]
2025-09-15 12:31:22+08:00
[CATEGORIES]
cs.CL
D$^2$HScore: Reasoning-Aware Hallucination Detection via Semantic Breadth and Depth Analysis in LLMs
[AUTHORS]
Yue Ding, Xiaofang Zhu, Tianze Xia, Junfei Wu, Xinlong Chen, Qiang Liu, Liang Wang
[ABSTRACT]
Although large Language Models (LLMs) have achieved remarkable success, their
practical application is often hindered by the generation of non-factual
content, which is called “hallucination”. Ensuring the reliability of LLMs’
outputs is a critical challenge, particularly in high-stakes domains such as
finance, security, and healthcare. In this work, we revisit hallucination
detection from the perspective of model architecture and generation dynamics.
Leveraging the multi-layer structure and autoregressive decoding process of
LLMs, we decompose hallucination signals into two complementary dimensions: the
semantic breadth of token representations within each layer, and the semantic
depth of core concepts as they evolve across layers. Based on this insight, we
propose \textbf{D$^2$HScore (Dispersion and Drift-based Hallucination Score)},
a training-free and label-free framework that jointly measures: (1)
\textbf{Intra-Layer Dispersion}, which quantifies the semantic diversity of
token representations within each layer; and (2) \textbf{Inter-Layer Drift},
which tracks the progressive transformation of key token representations across
layers. To ensure drift reflects the evolution of meaningful semantics rather
than noisy or redundant tokens, we guide token selection using attention
signals. By capturing both the horizontal and vertical dynamics of
representation during inference, D$^2$HScore provides an interpretable and
lightweight proxy for hallucination detection. Extensive experiments across
five open-source LLMs and five widely used benchmarks demonstrate that
D$^2$HScore consistently outperforms existing training-free baselines.
[COMMENTS]
under review
[LINK]
http://arxiv.org/abs/2509.11569v1
[DATE]
2025-09-15 12:28:38+08:00
[CATEGORIES]
cs.CL
Hallucinated Span Detection with Multi-View Attention Features
[AUTHORS]
Yuya Ogasa, Yuki Arase
[ABSTRACT]
This study addresses the problem of hallucinated span detection in the
outputs of large language models. It has received less attention than
output-level hallucination detection despite its practical importance. Prior
work has shown that attentions often exhibit irregular patterns when
hallucinations occur. Motivated by these findings, we extract features from the
attention matrix that provide complementary views capturing (a) whether certain
tokens are influential or ignored, (b) whether attention is biased toward
specific subsets, and (c) whether a token is generated referring to a narrow or
broad context, in the generation. These features are input to a
Transformer-based classifier to conduct sequential labelling to identify
hallucinated spans. Experimental results indicate that the proposed method
outperforms strong baselines on hallucinated span detection with longer input
contexts, such as data-to-text and summarisation tasks.
[LINK]
http://arxiv.org/abs/2504.04335v2
[DATE]
2025-09-15 12:21:37+08:00
[CATEGORIES]
cs.CL
cs.LG
Too Helpful, Too Harmless, Too Honest or Just Right?
[AUTHORS]
Gautam Siddharth Kashyap, Mark Dras, Usman Naseem
[ABSTRACT]
Large Language Models (LLMs) exhibit strong performance across a wide range
of NLP tasks, yet aligning their outputs with the principles of Helpfulness,
Harmlessness, and Honesty (HHH) remains a persistent challenge. Existing
methods often optimize for individual alignment dimensions in isolation,
leading to trade-offs and inconsistent behavior. While Mixture-of-Experts (MoE)
architectures offer modularity, they suffer from poorly calibrated routing,
limiting their effectiveness in alignment tasks. We propose TrinityX, a modular
alignment framework that incorporates a Mixture of Calibrated Experts (MoCaE)
within the Transformer architecture. TrinityX leverages separately trained
experts for each HHH dimension, integrating their outputs through a calibrated,
task-adaptive routing mechanism that combines expert signals into a unified,
alignment-aware representation. Extensive experiments on three standard
alignment benchmarks-Alpaca (Helpfulness), BeaverTails (Harmlessness), and
TruthfulQA (Honesty)-demonstrate that TrinityX outperforms strong baselines,
achieving relative improvements of 32.5% in win rate, 33.9% in safety score,
and 28.4% in truthfulness. In addition, TrinityX reduces memory usage and
inference latency by over 40% compared to prior MoE-based approaches. Ablation
studies highlight the importance of calibrated routing, and cross-model
evaluations confirm TrinityX’s generalization across diverse LLM backbones.
[COMMENTS]
EMNLP‘25 Main
[LINK]
http://arxiv.org/abs/2509.08486v2
[DATE]
2025-09-15 11:28:04+08:00
[CATEGORIES]
cs.CL
HARP: Hallucination Detection via Reasoning Subspace Projection
[AUTHORS]
Junjie Hu, Gang Tu, ShengYu Cheng, Jinxin Li, Jinting Wang, Rui Chen, Zhilong Zhou, Dongbo Shan
[ABSTRACT]
Hallucinations in Large Language Models (LLMs) pose a major barrier to their
reliable use in critical decision-making. Although existing hallucination
detection methods have improved accuracy, they still struggle with
disentangling semantic and reasoning information and maintaining robustness. To
address these challenges, we propose HARP (Hallucination detection via
reasoning subspace projection), a novel hallucination detection framework. HARP
establishes that the hidden state space of LLMs can be decomposed into a direct
sum of a semantic subspace and a reasoning subspace, where the former encodes
linguistic expression and the latter captures internal reasoning processes.
Moreover, we demonstrate that the Unembedding layer can disentangle these
subspaces, and by applying Singular Value Decomposition (SVD) to its
parameters, the basis vectors spanning the semantic and reasoning subspaces are
obtained. Finally, HARP projects hidden states onto the basis vectors of the
reasoning subspace, and the resulting projections are then used as input
features for hallucination detection in LLMs. By using these projections, HARP
reduces the dimension of the feature to approximately 5% of the original,
filters out most noise, and achieves enhanced robustness. Experiments across
multiple datasets show that HARP achieves state-of-the-art hallucination
detection performance; in particular, it achieves an AUROC of 92.8% on
TriviaQA, outperforming the previous best method by 7.5%.
[LINK]
http://arxiv.org/abs/2509.11536v1
[DATE]
2025-09-15 11:02:33+08:00
[CATEGORIES]
cs.CL
Oyster-I: Beyond Refusal – Constructive Safety Alignment for Responsible Language Models
[AUTHORS]
Ranjie Duan, Jiexi Liu, Xiaojun Jia, Shiji Zhao, Ruoxi Cheng, Fengxiang Wang, Cheng Wei, Yong Xie, Chang Liu, Defeng Li, Yinpeng Dong, Yichi Zhang, Yuefeng Chen, Chongwen Wang, Xingjun Ma, Xingxing Wei, Yang Liu, Hang Su, Jun Zhu, Xinfeng Li, Yitong Sun, Jie Zhang, Jinzhao Hu, Sha Xu, Wenchao Yang, Yitong Yang, Jialing Tao, Hui Xue
[ABSTRACT]
Large language models (LLMs) typically deploy safety mechanisms to prevent
harmful content generation. Most current approaches focus narrowly on risks
posed by malicious actors, often framing risks as adversarial events and
relying on defensive refusals. However, in real-world settings, risks also come
from non-malicious users seeking help while under psychological distress (e.g.,
self-harm intentions). In such cases, the model’s response can strongly
influence the user’s next actions. Simple refusals may lead them to repeat,
escalate, or move to unsafe platforms, creating worse outcomes. We introduce
Constructive Safety Alignment (CSA), a human-centric paradigm that protects
against malicious misuse while actively guiding vulnerable users toward safe
and helpful results. Implemented in Oyster-I (Oy1), CSA combines game-theoretic
anticipation of user reactions, fine-grained risk boundary discovery, and
interpretable reasoning control, turning safety into a trust-building process.
Oy1 achieves state-of-the-art safety among open models while retaining high
general capabilities. On our Constructive Benchmark, it shows strong
constructive engagement, close to GPT-5, and unmatched robustness on the
Strata-Sword jailbreak dataset, nearing GPT-o1 levels. By shifting from
refusal-first to guidance-first safety, CSA redefines the model-user
relationship, aiming for systems that are not just safe, but meaningfully
helpful. We release Oy1, code, and the benchmark to support responsible,
user-centered AI.
[COMMENTS]
Technical Report Code & Model weights available:
https://github.com/Alibaba-AAIG/Oyster
[LINK]
http://arxiv.org/abs/2509.01909v5
[DATE]
2025-09-15 10:58:40+08:00
[CATEGORIES]
cs.CL
Towards Reliable and Interpretable Document Question Answering via VLMs
[AUTHORS]
Alessio Chen, Simone Giovannini, Andrea Gemelli, Fabio Coppini, Simone Marinai
[ABSTRACT]
Vision-Language Models (VLMs) have shown strong capabilities in document
understanding, particularly in identifying and extracting textual information
from complex documents. Despite this, accurately localizing answers within
documents remains a major challenge, limiting both interpretability and
real-world applicability. To address this, we introduce DocExplainerV0, a
plug-and-play bounding-box prediction module that decouples answer generation
from spatial localization. This design makes it applicable to existing VLMs,
including proprietary systems where fine-tuning is not feasible. Through
systematic evaluation, we provide quantitative insights into the gap between
textual accuracy and spatial grounding, showing that correct answers often lack
reliable localization. Our standardized framework highlights these shortcomings
and establishes a benchmark for future research toward more interpretable and
robust document information extraction VLMs.
[LINK]
http://arxiv.org/abs/2509.10129v2
[DATE]
2025-09-15 10:38:59+08:00
[CATEGORIES]
cs.CL
MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation
[AUTHORS]
Chenghao Yang, Yinbo Luo, Zhoufutu Wen, Qi Chu, Tao Gong, Longxiang Liu, Kaiyuan Zhang, Jianpeng Jiao, Ge Zhang, Wenhao Huang, Nenghai Yu
[ABSTRACT]
Large Language Models (\textbf{LLMs}), e.g. ChatGPT, have been widely adopted
in real-world dialogue applications. However, LLMs’ robustness, especially in
handling long complex dialogue sessions, including frequent motivation
transfer, sophisticated cross-turn dependency, is criticized all along.
Nevertheless, no existing benchmarks can fully reflect these weaknesses. We
present \textbf{MARS-Bench}, a \textbf{M}ulti-turn \textbf{A}thletic
\textbf{R}eal-world \textbf{S}cenario Dialogue \textbf{Bench}mark, designed to
remedy the gap. MARS-Bench is constructed from play-by-play text commentary so
to feature realistic dialogues specifically designed to evaluate three critical
aspects of multi-turn conversations: Ultra Multi-turn, Interactive Multi-turn,
and Cross-turn Tasks. Extensive experiments on MARS-Bench also reveal that
closed-source LLMs significantly outperform open-source alternatives, explicit
reasoning significantly boosts LLMs’ robustness on handling long complex
dialogue sessions, and LLMs indeed face significant challenges when handling
motivation transfer and sophisticated cross-turn dependency. Moreover, we
provide mechanistic interpretability on how attention sinks due to special
tokens lead to LLMs’ performance degradation when handling long complex
dialogue sessions based on attention visualization experiment in
Qwen2.5-7B-Instruction.
[COMMENTS]
29 pages, 13 figures, Accepted as EMNLP2025 Findings
[LINK]
http://arxiv.org/abs/2505.23810v2
[DATE]
2025-09-15 10:12:59+08:00
[CATEGORIES]
cs.CL
LVLMs are Bad at Overhearing Human Referential Communication
[AUTHORS]
Zhengxiang Wang, Weiling Li, Panagiotis Kaliosis, Owen Rambow, Susan E. Brennan
[COMMENTS]
EMNLP 2025 (Main)
[LINK]
http://arxiv.org/abs/2509.11514v1
[DATE]
2025-09-15 10:03:18+08:00
[CATEGORIES]
cs.CL
Unsupervised Candidate Ranking for Lexical Substitution via Holistic Sentence Semantics
[AUTHORS]
Zhongyang Hu, Naijie Gu, Xiangzhi Tao, Tianhui Gu, Yibing Zhou
[ABSTRACT]
A key subtask in lexical substitution is ranking the given candidate words. A
common approach is to replace the target word with a candidate in the original
sentence and feed the modified sentence into a model to capture semantic
differences before and after substitution. However, effectively modeling the
bidirectional influence of candidate substitution on both the target word and
its context remains challenging. Existing methods often focus solely on
semantic changes at the target position or rely on parameter tuning over
multiple evaluation metrics, making it difficult to accurately characterize
semantic variation. To address this, we investigate two approaches: one based
on attention weights and another leveraging the more interpretable integrated
gradients method, both designed to measure the influence of context tokens on
the target token and to rank candidates by incorporating semantic similarity
between the original and substituted sentences. Experiments on the LS07 and
SWORDS datasets demonstrate that both approaches improve ranking performance.
[LINK]
http://arxiv.org/abs/2509.11513v1
[DATE]
2025-09-15 09:57:09+08:00
[CATEGORIES]
cs.CL
DeDisCo at the DISRPT 2025 Shared Task: A System for Discourse Relation Classification
[AUTHORS]
Zhuoxuan Ju, Jingni Wu, Abhishek Purushothama, Amir Zeldes
[COMMENTS]
System submission for the DISRPT 2025 - Shared Task on Discourse
Relation Parsing and Treebanking In conjunction with CODI-CRAC & EMNLP 2025.
1st place in Task 3: relation classification
[LINK]
http://arxiv.org/abs/2509.11498v1
[DATE]
2025-09-15 09:25:37+08:00
[CATEGORIES]
cs.CL
AKCIT-FN at CheckThat! 2025: Switching Fine-Tuned SLMs and LLM Prompting for Multilingual Claim Normalization
[AUTHORS]
Fabrycio Leite Nakano Almada, Kauan Divino Pouso Mariano, Maykon Adriell Dutra, Victor Emanuel da Silva Monteiro, Juliana Resplande Sant’Anna Gomes, Arlindo Rodrigues Galvão Filho, Anderson da Silva Soares
[ABSTRACT]
Claim normalization, the transformation of informal social media posts into
concise, self-contained statements, is a crucial step in automated
fact-checking pipelines. This paper details our submission to the CLEF-2025
CheckThat! Task~2, which challenges systems to perform claim normalization
across twenty languages, divided into thirteen supervised (high-resource) and
seven zero-shot (no training data) tracks.
Our approach, leveraging fine-tuned Small Language Models (SLMs) for
supervised languages and Large Language Model (LLM) prompting for zero-shot
scenarios, achieved podium positions (top three) in fifteen of the twenty
languages. Notably, this included second-place rankings in eight languages,
five of which were among the seven designated zero-shot languages, underscoring
the effectiveness of our LLM-based zero-shot strategy. For Portuguese, our
initial development language, our system achieved an average METEOR score of
0.5290, ranking third. All implementation artifacts, including inference,
training, evaluation scripts, and prompt configurations, are publicly available
at https://github.com/ju-resplande/checkthat2025_normalization.
[COMMENTS]
15 pages, 2 figures
[LINK]
http://arxiv.org/abs/2509.11496v1
[DATE]
2025-09-15 09:19:49+08:00
[CATEGORIES]
cs.CL
LogicTree: Structured Proof Exploration for Coherent and Rigorous Logical Reasoning with Large Language Models
[AUTHORS]
Kang He, Kaushik Roy
[ABSTRACT]
Large language models (LLMs) have achieved remarkable multi-step reasoning
capabilities across various domains. However, LLMs still face distinct
challenges in complex logical reasoning, as (1) proof-finding requires
systematic exploration and the maintenance of logical coherence and (2)
searching the right combination of premises at each reasoning step is
inherently challenging in tasks with large premise space. To address this, we
propose LogicTree, an inference-time modular framework employing
algorithm-guided search to automate structured proof exploration and ensure
logical coherence. Advancing beyond tree-of-thought (ToT), we incorporate
caching mechanism into LogicTree to enable effective utilization of historical
knowledge, preventing reasoning stagnation and minimizing redundancy.
Furthermore, we address the combinatorial complexity of premise search by
decomposing it into a linear process. The refined premise selection restricts
subsequent inference to at most one derivation per step, enhancing reasoning
granularity and enforcing strict step-by-step reasoning. Additionally, we
introduce two LLM-free heuristics for premise prioritization, enabling
strategic proof search. Experimental results on five datasets demonstrate that
LogicTree optimally scales inference-time computation to achieve higher proof
accuracy, surpassing chain-of-thought (CoT) and ToT with average gains of 23.6%
and 12.5%, respectively, on GPT-4o. Moreover, within LogicTree, GPT-4o
outperforms o3-mini by 7.6% on average.
[COMMENTS]
EMNLP 2025 Main Conference
[LINK]
http://arxiv.org/abs/2504.14089v2
[DATE]
2025-09-15 09:15:50+08:00
[CATEGORIES]
cs.CL
cs.LG
Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models
[AUTHORS]
Thao Nguyen, Yang Li, Olga Golovneva, Luke Zettlemoyer, Sewoong Oh, Ludwig Schmidt, Xian Li
[ABSTRACT]
Scaling laws predict that the performance of large language models improves
with increasing model size and data size. In practice, pre-training has been
relying on massive web crawls, using almost all data sources publicly available
on the internet so far. However, this pool of natural data does not grow at the
same rate as the compute supply. Furthermore, the availability of high-quality
texts is even more limited: data filtering pipelines often remove up to 99% of
the initial web scrapes to achieve state-of-the-art. To address the “data wall”
of pre-training scaling, our work explores ways to transform and recycle data
discarded in existing filtering processes. We propose REWIRE, REcycling the Web
with guIded REwrite, a method to enrich low-quality documents so that they
could become useful for training. This in turn allows us to increase the
representation of synthetic data in the final pre-training set. Experiments at
1B, 3B and 7B scales of the DCLM benchmark show that mixing high-quality raw
texts and our rewritten texts lead to 1.0, 1.3 and 2.5 percentage points
improvement respectively across 22 diverse tasks, compared to training on only
filtered web data. Training on the raw-synthetic data mix is also more
effective than having access to 2x web data. Through further analysis, we
demonstrate that about 82% of the mixed in texts come from transforming
lower-quality documents that would otherwise be discarded. REWIRE also
outperforms related approaches of generating synthetic data, including
Wikipedia-style paraphrasing, question-answer synthesizing and knowledge
extraction. These results suggest that recycling web texts holds the potential
for being a simple and effective approach for scaling pre-training data. We
make our high-quality synthetic data publicly available at
https://huggingface.co/datasets/facebook/recycling_the_web.
[COMMENTS]
Accepted to COLM 2025
[LINK]
http://arxiv.org/abs/2506.04689v3
[DATE]
2025-09-15 08:40:48+08:00
[CATEGORIES]
cs.CL
cs.LG
Understanding the Uncertainty of LLM Explanations: A Perspective Based on Reasoning Topology
[AUTHORS]
Longchao Da, Xiaoou Liu, Jiaxin Dai, Lu Cheng, Yaqing Wang, Hua Wei
[ABSTRACT]
Understanding the uncertainty in large language model (LLM) explanations is
important for evaluating their faithfulness and reasoning consistency, and thus
provides insights into the reliability of LLM’s output regarding a question. In
this work, we propose a novel framework that quantifies uncertainty in LLM
explanations through a reasoning topology perspective. By designing a
structural elicitation strategy, we guide the LLMs to frame the explanations of
an answer into a graph topology. This process decomposes the explanations into
the knowledge related sub-questions and topology-based reasoning structures,
which allows us to quantify uncertainty not only at the semantic level but also
from the reasoning path. It further brings convenience to assess knowledge
redundancy and provide interpretable insights into the reasoning process. Our
method offers a systematic way to interpret the LLM reasoning, analyze
limitations, and provide guidance for enhancing robustness and faithfulness.
This work pioneers the use of graph-structured uncertainty measurement in LLM
explanations and demonstrates the potential of topology-based quantification.
[COMMENTS]
28 pages, 9 figures; accepted at COLM‘25
[LINK]
http://arxiv.org/abs/2502.17026v2
[DATE]
2025-09-15 08:12:59+08:00
[CATEGORIES]
cs.CL
Improving LLMs’ Learning for Coreference Resolution
[AUTHORS]
Yujian Gan, Yuan Liang, Yanni Lin, Juntao Yu, Massimo Poesio
[ABSTRACT]
Coreference Resolution (CR) is crucial for many NLP tasks, but existing LLMs
struggle with hallucination and under-performance. In this paper, we
investigate the limitations of existing LLM-based approaches to CR-specifically
the Question-Answering (QA) Template and Document Template methods and propose
two novel techniques: Reversed Training with Joint Inference and Iterative
Document Generation. Our experiments show that Reversed Training improves the
QA Template method, while Iterative Document Generation eliminates
hallucinations in the generated source text and boosts coreference resolution.
Integrating these methods and techniques offers an effective and robust
solution to LLM-based coreference resolution.
[LINK]
http://arxiv.org/abs/2509.11466v1
[DATE]
2025-09-15 07:08:35+08:00
[CATEGORIES]
cs.CL
CEMTM: Contextual Embedding-based Multimodal Topic Modeling
[AUTHORS]
Amirhossein Abaskohi, Raymond Li, Chuyuan Li, Shafiq Joty, Giuseppe Carenini
[COMMENTS]
EMNLP 2025
[LINK]
http://arxiv.org/abs/2509.11465v1
[DATE]
2025-09-15 07:07:46+08:00
[CATEGORIES]
cs.CL
cs.LG
Less Is More? Examining Fairness in Pruned Large Language Models for Summarising Opinions
[AUTHORS]
Nannan Huang, Haytham M. Fayek, Xiuzhen Zhang
[ABSTRACT]
Model compression through post-training pruning offers a way to reduce model
size and computational requirements without significantly impacting model
performance. However, the effect of pruning on the fairness of LLM-generated
summaries remains unexplored, particularly for opinion summarisation where
biased outputs could influence public views.In this paper, we present a
comprehensive empirical analysis of opinion summarisation, examining three
state-of-the-art pruning methods and various calibration sets across three
open-source LLMs using four fairness metrics. Our systematic analysis reveals
that pruning methods have a greater impact on fairness than calibration sets.
Building on these insights, we propose High Gradient Low Activation (HGLA)
pruning, which identifies and removes parameters that are redundant for input
processing but influential in output generation. Our experiments demonstrate
that HGLA can better maintain or even improve fairness compared to existing
methods, showing promise across models and tasks where traditional methods have
limitations. Our human evaluation shows HGLA-generated outputs are fairer than
existing state-of-the-art pruning methods. Code is available at:
https://github.com/amberhuang01/HGLA.
[COMMENTS]
Accepted to EMNLP 2025 Main Conference
[LINK]
http://arxiv.org/abs/2508.17610v3
[DATE]
2025-09-15 07:02:43+08:00
[CATEGORIES]
cs.CL
The Diffusion Duality
[AUTHORS]
Subham Sekhar Sahoo, Justin Deschenaux, Aaron Gokaslan, Guanghan Wang, Justin Chiu, Volodymyr Kuleshov
[ABSTRACT]
Uniform-state discrete diffusion models hold the promise of fast text
generation due to their inherent ability to self-correct. However, they are
typically outperformed by autoregressive models and masked diffusion models. In
this work, we narrow this performance gap by leveraging a key insight:
Uniform-state diffusion processes naturally emerge from an underlying Gaussian
diffusion. Our method, Duo, transfers powerful techniques from Gaussian
diffusion to improve both training and sampling. First, we introduce a
curriculum learning strategy guided by the Gaussian process, doubling training
speed by reducing variance. Models trained with curriculum learning surpass
autoregressive models in zero-shot perplexity on 3 of 7 benchmarks. Second, we
present Discrete Consistency Distillation, which adapts consistency
distillation from the continuous to the discrete setting. This algorithm
unlocks few-step generation in diffusion language models by accelerating
sampling by two orders of magnitude. We provide the code and model checkpoints
on the project page: http://s-sahoo.github.io/duo
[COMMENTS]
ICML 2025. We provide the code at: https://github.com/s-sahoo/duo
[v2]: Camera ready revisions
[LINK]
http://arxiv.org/abs/2506.10892v2
[DATE]
2025-09-15 06:07:45+08:00
[CATEGORIES]
cs.LG
cs.CL
Artificial intelligence contribution to translation industry: looking back and forward
[AUTHORS]
Mohammed Q. Shormani
[ABSTRACT]
This study provides a comprehensive analysis of artificial intelligence (AI)
contribution to research in the translation industry (ACTI), synthesizing it
over forty-five years from 1980-2024. 13220 articles were retrieved from three
sources, namely WoS, Scopus, and Lens; 9836 were unique records, which were
used for the analysis. I provided two types of analysis, viz., scientometric
and thematic, focusing on Cluster, Subject categories, Keywords, Bursts,
Centrality and Research Centers as for the former. For the latter, I provided a
thematic review for 18 articles, selected purposefully from the articles
involved, centering on purpose, approach, findings, and contribution to ACTI
future directions. This study is significant for its valuable contribution to
ACTI knowledge production over 45 years, emphasizing several trending issues
and hotspots including Machine translation, Statistical machine translation,
Low-resource language, Large language model, Arabic dialects, Translation
quality, and Neural machine translation. The findings reveal that the more AI
develops, the more it contributes to translation industry, as Neural Networking
Algorithms have been incorporated and Deep Language Learning Models like
ChatGPT have been launched. However, much rigorous research is still needed to
overcome several problems encountering translation industry, specifically
concerning low-resource, multi-dialectical and free word order languages, and
cultural and religious registers.
[COMMENTS]
30 pages, 13 figures
[LINK]
http://arxiv.org/abs/2411.19855v3
[DATE]
2025-09-15 06:01:33+08:00
[CATEGORIES]
cs.CL
Learning to Optimize Multi-Objective Alignment Through Dynamic Reward Weighting
[AUTHORS]
Yining Lu, Zilong Wang, Shiyang Li, Xin Liu, Changlong Yu, Qingyu Yin, Zhan Shi, Zixuan Zhang, Meng Jiang
[ABSTRACT]
Prior works in multi-objective reinforcement learning typically use linear
reward scalarization with fixed weights, which provably fail to capture
non-convex Pareto fronts and thus yield suboptimal results. This limitation
becomes especially critical in online preference alignment for large language
models. Here, stochastic trajectories generated by parameterized policies
create highly non-linear and non-convex mappings from parameters to objectives
that no single static weighting scheme can find optimal trade-offs. We address
this limitation by introducing dynamic reward weighting, which adaptively
adjusts reward weights during the online reinforcement learning process. Unlike
existing approaches that rely on fixed-weight interpolation, our dynamic
weighting continuously balances and prioritizes objectives in training,
facilitating effective exploration of Pareto fronts in objective space. We
introduce two approaches of increasing sophistication and generalizability: (1)
hypervolume-guided weight adaptation and (2) gradient-based weight
optimization, offering a versatile toolkit for online multi-objective
alignment. Our extensive experiments demonstrate their compatibility with
commonly used online reinforcement learning algorithms (including GRPO,
REINFORCE, and RLOO), effectiveness across multiple mathematical reasoning
datasets, and applicability to different model families, consistently achieving
Pareto dominant solutions with fewer training steps than fixed-weight linear
scalarization baselines.
[LINK]
http://arxiv.org/abs/2509.11452v1
[DATE]
2025-09-15 05:56:35+08:00
[CATEGORIES]
cs.LG
cs.CL
CognitiveSky: Scalable Sentiment and Narrative Analysis for Decentralized Social Media
[AUTHORS]
Gaurab Chhetri, Anandi Dutta, Subasish Das
[ABSTRACT]
The emergence of decentralized social media platforms presents new
opportunities and challenges for real-time analysis of public discourse. This
study introduces CognitiveSky, an open-source and scalable framework designed
for sentiment, emotion, and narrative analysis on Bluesky, a federated Twitter
or X.com alternative. By ingesting data through Bluesky’s Application
Programming Interface (API), CognitiveSky applies transformer-based models to
annotate large-scale user-generated content and produces structured and
analyzable outputs. These summaries drive a dynamic dashboard that visualizes
evolving patterns in emotion, activity, and conversation topics. Built entirely
on free-tier infrastructure, CognitiveSky achieves both low operational cost
and high accessibility. While demonstrated here for monitoring mental health
discourse, its modular design enables applications across domains such as
disinformation detection, crisis response, and civic sentiment analysis. By
bridging large language models with decentralized networks, CognitiveSky offers
a transparent, extensible tool for computational social science in an era of
shifting digital ecosystems.
[COMMENTS]
This is the author’s preprint version of a paper accepted for
presentation at HICSS 59 (Hawaii International Conference on System
Sciences), 2026, Hawaii, USA. The final published version will appear in the
official conference proceedings. Conference site: https://hicss.hawaii.edu/
[LINK]
http://arxiv.org/abs/2509.11444v1
[DATE]
2025-09-15 05:37:24+08:00
[CATEGORIES]
cs.CL
A Transformer-Based Cross-Platform Analysis of Public Discourse on the 15-Minute City Paradigm
[AUTHORS]
Gaurab Chhetri, Darrell Anderson, Boniphace Kutela, Subasish Das
[ABSTRACT]
This study presents the first multi-platform sentiment analysis of public
opinion on the 15-minute city concept across Twitter, Reddit, and news media.
Using compressed transformer models and Llama-3-8B for annotation, we classify
sentiment across heterogeneous text domains. Our pipeline handles long-form and
short-form text, supports consistent annotation, and enables reproducible
evaluation. We benchmark five models (DistilRoBERTa, DistilBERT, MiniLM,
ELECTRA, TinyBERT) using stratified 5-fold cross-validation, reporting
F1-score, AUC, and training time. DistilRoBERTa achieved the highest F1
(0.8292), TinyBERT the best efficiency, and MiniLM the best cross-platform
consistency. Results show News data yields inflated performance due to class
imbalance, Reddit suffers from summarization loss, and Twitter offers moderate
challenge. Compressed models perform competitively, challenging assumptions
that larger models are necessary. We identify platform-specific trade-offs and
propose directions for scalable, real-world sentiment classification in urban
planning discourse.
[COMMENTS]
This is the author’s preprint version of a paper accepted for
presentation at the 24th International Conference on Machine Learning and
Applications (ICMLA 2025), December 3-5, 2025, Florida, USA. The final
published version will appear in the official IEEE proceedings. Conference
site: https://www.icmla-conference.org/icmla25/
[LINK]
http://arxiv.org/abs/2509.11443v1
[DATE]
2025-09-15 05:36:24+08:00
[CATEGORIES]
cs.CL
STRICT: Stress Test of Rendering Images Containing Text
[AUTHORS]
Tianyu Zhang, Xinyu Wang, Lu Li, Zhenghan Tai, Jijun Chi, Jingrui Tian, Hailin He, Suyuchen Wang
[ABSTRACT]
While diffusion models have revolutionized text-to-image generation with
their ability to synthesize realistic and diverse scenes, they continue to
struggle to generate consistent and legible text within images. This
shortcoming is commonly attributed to the locality bias inherent in
diffusion-based generation, which limits their ability to model long-range
spatial dependencies. In this paper, we introduce $\textbf{STRICT}$, a
benchmark designed to systematically stress-test the ability of diffusion
models to render coherent and instruction-aligned text in images. Our benchmark
evaluates models across multiple dimensions: (1) the maximum length of readable
text that can be generated; (2) the correctness and legibility of the generated
text, and (3) the ratio of not following instructions for generating text. We
evaluate several state-of-the-art models, including proprietary and open-source
variants, and reveal persistent limitations in long-range consistency and
instruction-following capabilities. Our findings provide insights into
architectural bottlenecks and motivate future research directions in multimodal
generative modeling. We release our entire evaluation pipeline at
https://github.com/tianyu-z/STRICT-Bench.
[COMMENTS]
Accepted as a main conference paper at EMNLP 2025
[LINK]
http://arxiv.org/abs/2505.18985v2
[DATE]
2025-09-15 05:30:57+08:00
[CATEGORIES]
cs.LG
cs.CL
FuseCodec: Semantic-Contextual Fusion and Supervision for Neural Codecs
[AUTHORS]
Md Mubtasim Ahasan, Rafat Hasan Khan, Tasnim Mohiuddin, Aman Chadha, Tariq Iqbal, M Ashraful Amin, Amin Ahsan Ali, Md Mofijul Islam, A K M Mahbubur Rahman
[ABSTRACT]
Speech tokenization enables discrete representation and facilitates speech
language modeling. However, existing neural codecs capture low-level acoustic
features, overlooking the semantic and contextual cues inherent to human
speech. While recent efforts introduced semantic representations from
self-supervised speech models or incorporated contextual representations from
pre-trained language models, challenges remain in aligning and unifying the
semantic and contextual representations. We introduce FuseCodec, which unifies
acoustic, semantic, and contextual representations through strong cross-modal
alignment and globally informed supervision. We propose three complementary
techniques: (i) Latent Representation Fusion, integrating semantic and
contextual features directly into the encoder latent space for robust and
unified representation learning; (ii) Global Semantic-Contextual Supervision,
supervising discrete tokens with globally pooled and broadcasted
representations to enhance temporal consistency and cross-modal alignment; and
(iii) Temporally Aligned Contextual Supervision, strengthening alignment by
dynamically matching contextual and speech tokens within a local window for
fine-grained token-level supervision. We further introduce FuseCodec-TTS,
demonstrating our methodology’s applicability to zero-shot speech synthesis.
Empirically, FuseCodec achieves state-of-the-art performance in LibriSpeech,
surpassing EnCodec, SpeechTokenizer, and DAC in transcription accuracy,
perceptual quality, intelligibility, and speaker similarity. Results highlight
the effectiveness of contextually and semantically guided tokenization for
speech tokenization and downstream tasks. Code and pretrained models are
available at https://github.com/mubtasimahasan/FuseCodec.
[LINK]
http://arxiv.org/abs/2509.11425v1
[DATE]
2025-09-15 04:35:36+08:00
[CATEGORIES]
cs.CL
Trading-R1: Financial Trading with LLM Reasoning via Reinforcement Learning
[AUTHORS]
Yijia Xiao, Edward Sun, Tong Chen, Fang Wu, Di Luo, Wei Wang
[ABSTRACT]
Developing professional, structured reasoning on par with human financial
analysts and traders remains a central challenge in AI for finance, where
markets demand interpretability and trust. Traditional time-series models lack
explainability, while LLMs face challenges in turning natural-language analysis
into disciplined, executable trades. Although reasoning LLMs have advanced in
step-by-step planning and verification, their application to risk-sensitive
financial decisions is underexplored. We present Trading-R1, a
financially-aware model that incorporates strategic thinking and planning for
comprehensive thesis composition, facts-grounded analysis, and
volatility-adjusted decision making. Trading-R1 aligns reasoning with trading
principles through supervised fine-tuning and reinforcement learning with a
three-stage easy-to-hard curriculum. Training uses Tauric-TR1-DB, a 100k-sample
corpus spanning 18 months, 14 equities, and five heterogeneous financial data
sources. Evaluated on six major equities and ETFs, Trading-R1 demonstrates
improved risk-adjusted returns and lower drawdowns compared to both open-source
and proprietary instruction-following models as well as reasoning models. The
system generates structured, evidence-based investment theses that support
disciplined and interpretable trading decisions. Trading-R1 Terminal will be
released at https://github.com/TauricResearch/Trading-R1.
[COMMENTS]
Tauric Research: https://github.com/TauricResearch
[LINK]
http://arxiv.org/abs/2509.11420v1
[DATE]
2025-09-15 04:13:41+08:00
[CATEGORIES]
cs.CL
cs.LG
Continually Adding New Languages to Multilingual Language Models
[AUTHORS]
Abraham Toluwase Owodunni, Sachin Kumar
[ABSTRACT]
Multilingual language models are trained on a fixed set of languages, and to
support new languages, the models need to be retrained from scratch. This is an
expensive endeavor and is often infeasible, as model developers tend not to
release their pre-training data. Naive approaches, such as continued
pretraining, suffer from catastrophic forgetting; however, mitigation
strategies like experience replay cannot be applied due to the lack of original
pretraining data. In this work, we investigate the problem of continually
adding new languages to a multilingual model, assuming access to pretraining
data in only the target languages. We explore multiple approaches to address
this problem and propose Layer-Selective LoRA (LayRA), which adds Low-Rank
Adapters (LoRA) to selected initial and final layers while keeping the rest of
the model frozen. LayRA builds on two insights: (1) LoRA reduces forgetting,
and (2) multilingual models encode inputs in the source language in the initial
layers, reason in English in intermediate layers, and translate back to the
source language in final layers. We experiment with adding multiple
combinations of Galician, Swahili, and Urdu to pretrained language models and
evaluate each method on diverse multilingual tasks. We find that LayRA provides
the overall best tradeoff between preserving models’ capabilities in previously
supported languages, while being competitive with existing approaches such as
LoRA in learning new languages. We also demonstrate that using model
arithmetic, the adapted models can be equipped with strong instruction
following abilities without access to any instruction tuning data in the target
languages.
[LINK]
http://arxiv.org/abs/2509.11414v1
[DATE]
2025-09-15 04:08:15+08:00
[CATEGORIES]
cs.CL
IOLBENCH: Benchmarking LLMs on Linguistic Reasoning
[AUTHORS]
Satyam Goyal, Soham Dan
[ABSTRACT]
Despite the remarkable advancements and widespread applications of deep
neural networks, their ability to perform reasoning tasks remains limited,
particularly in domains requiring structured, abstract thought. In this paper,
we investigate the linguistic reasoning capabilities of state-of-the-art large
language models (LLMs) by introducing IOLBENCH, a novel benchmark derived from
International Linguistics Olympiad (IOL) problems. This dataset encompasses
diverse problems testing syntax, morphology, phonology, and semantics, all
carefully designed to be self-contained and independent of external knowledge.
These tasks challenge models to engage in metacognitive linguistic reasoning,
requiring the deduction of linguistic rules and patterns from minimal examples.
Through extensive benchmarking of leading LLMs, we find that even the most
advanced models struggle to handle the intricacies of linguistic complexity,
particularly in areas demanding compositional generalization and rule
abstraction. Our analysis highlights both the strengths and persistent
limitations of current models in linguistic problem-solving, offering valuable
insights into their reasoning capabilities. By introducing IOLBENCH, we aim to
foster further research into developing models capable of human-like reasoning,
with broader implications for the fields of computational linguistics and
artificial intelligence.
[LINK]
http://arxiv.org/abs/2501.04249v2
[DATE]
2025-09-15 02:43:41+08:00
[CATEGORIES]
cs.CL
Transformer Enhanced Relation Classification: A Comparative Analysis of Contextuality, Data Efficiency and Sequence Complexity
[AUTHORS]
Bowen Jing, Yang Cui, Tianpeng Huang
[ABSTRACT]
In the era of large language model, relation extraction (RE) plays an
important role in information extraction through the transformation of
unstructured raw text into structured data (Wadhwa et al., 2023). In this
paper, we systematically compare the performance of deep supervised learning
approaches without transformers and those with transformers. We used a series
of non-transformer architectures such as PA-LSTM(Zhang et al., 2017),
C-GCN(Zhang et al., 2018), and AGGCN(attention guide GCN)(Guo et al., 2019),
and a series of transformer architectures such as BERT, RoBERTa, and R-BERT(Wu
and He, 2019). Our comparison included traditional metrics like micro F1, as
well as evaluations in different scenarios, varying sentence lengths, and
different percentages of the dataset for training. Our experiments were
conducted on TACRED, TACREV, and RE-TACRED. The results show that
transformer-based models outperform non-transformer models, achieving micro F1
scores of 80-90% compared to 64-67% for non-transformer models. Additionally,
we briefly review the research journey in supervised relation classification
and discuss the role and current status of large language models (LLMs) in
relation extraction.
[LINK]
http://arxiv.org/abs/2509.11374v1
[DATE]
2025-09-15 02:11:31+08:00
[CATEGORIES]
cs.CL
!MSA at AraHealthQA 2025 Shared Task: Enhancing LLM Performance for Arabic Clinical Question Answering through Prompt Engineering and Ensemble Learning
[AUTHORS]
Mohamed Tarek, Seif Ahmed, Mohamed Basem
[COMMENTS]
8 Pages , ArabicNLP 2025 , Co-located with EMNLP 2025
[LINK]
http://arxiv.org/abs/2509.11365v1
[DATE]
2025-09-15 01:39:58+08:00
[CATEGORIES]
cs.CL
ConvSearch-R1: Enhancing Query Reformulation for Conversational Search with Reasoning via Reinforcement Learning
[AUTHORS]
Changtai Zhu, Siyin Wang, Ruijun Feng, Kai Song, Xipeng Qiu
[ABSTRACT]
Conversational search systems require effective handling of context-dependent
queries that often contain ambiguity, omission, and coreference. Conversational
Query Reformulation (CQR) addresses this challenge by transforming these
queries into self-contained forms suitable for off-the-shelf retrievers.
However, existing CQR approaches suffer from two critical constraints: high
dependency on costly external supervision from human annotations or large
language models, and insufficient alignment between the rewriting model and
downstream retrievers. We present ConvSearch-R1, the first self-driven
framework that completely eliminates dependency on external rewrite supervision
by leveraging reinforcement learning to optimize reformulation directly through
retrieval signals. Our novel two-stage approach combines Self-Driven Policy
Warm-Up to address the cold-start problem through retrieval-guided
self-distillation, followed by Retrieval-Guided Reinforcement Learning with a
specially designed rank-incentive reward shaping mechanism that addresses the
sparsity issue in conventional retrieval metrics. Extensive experiments on
TopiOCQA and QReCC datasets demonstrate that ConvSearch-R1 significantly
outperforms previous state-of-the-art methods, achieving over 10% improvement
on the challenging TopiOCQA dataset while using smaller 3B parameter models
without any external supervision.
[COMMENTS]
Accepted by EMNLP 2025 at the Main Conference
[LINK]
http://arxiv.org/abs/2505.15776v2
[DATE]
2025-09-15 00:49:40+08:00
[CATEGORIES]
cs.CL
Humanizing Machines: Rethinking LLM Anthropomorphism Through a Multi-Level Framework of Design
[AUTHORS]
Yunze Xiao, Lynnette Hui Xian Ng, Jiarui Liu, Mona T. Diab
[COMMENTS]
Accepted in EMNLP main proceedings; Updated citations
[LINK]
http://arxiv.org/abs/2508.17573v2
[DATE]
2025-09-15 00:28:31+08:00
[CATEGORIES]
cs.CL
Foundational theory for optimal decision tree problems. II. Optimal hypersurface decision tree algorithm
[AUTHORS]
Xi He
[ABSTRACT]
Decision trees are a ubiquitous model for classification and regression tasks
due to their interpretability and efficiency. However, solving the optimal
decision tree (ODT) problem remains a challenging combinatorial optimization
task. Even for the simplest splitting rules–axis-parallel hyperplanes–it is
NP-hard to optimize. In Part I of this series, we rigorously defined the proper
decision tree model through four axioms and, based on these, introduced four
formal definitions of the ODT problem. From these definitions, we derived four
generic algorithms capable of solving ODT problems for arbitrary decision trees
satisfying the axioms. We also analyzed the combinatorial geometric properties
of hypersurfaces, showing that decision trees defined by polynomial
hypersurface splitting rules satisfy the proper axioms that we proposed.
In this second paper (Part II) of this two-part series, building on the
algorithmic and geometric foundations established in Part I, we introduce the
first hypersurface decision tree (HODT) algorithm. To the best of our
knowledge, existing optimal decision tree methods are, to date, limited to
hyperplane splitting rules–a special case of hypersurfaces–and rely on
general-purpose solvers. In contrast, our HODT algorithm addresses the general
hypersurface decision tree model without requiring external solvers.
Using synthetic datasets generated from ground-truth hyperplane decision
trees, we vary tree size, data size, dimensionality, and label and feature
noise. Results showing that our algorithm recovers the ground truth more
accurately than axis-parallel trees and exhibits greater robustness to noise.
We also analyzed generalization performance across 30 real-world datasets,
showing that HODT can achieve up to 30% higher accuracy than the
state-of-the-art optimal axis-parallel decision tree algorithm when tree
complexity is properly controlled.
[LINK]
http://arxiv.org/abs/2509.12057v1
[DATE]
2025-09-15 23:38:44+08:00
[CATEGORIES]
cs.LG
LEGO: Spatial Accelerator Generation and Optimization for Tensor Applications
[AUTHORS]
Yujun Lin, Zhekai Zhang, Song Han
[ABSTRACT]
Modern tensor applications, especially foundation models and generative AI
applications require multiple input modalities (both vision and language),
which increases the demand for flexible accelerator architecture. Existing
frameworks suffer from the trade-off between design flexibility and
productivity of RTL generation: either limited to very few hand-written
templates or cannot automatically generate the RTL. To address this challenge,
we propose the LEGO framework, which targets tensor applications and
automatically generates spatial architecture design and outputs synthesizable
RTL code without handwritten RTL design templates. Leveraging the
affine-transformation-based architecture representation, LEGO front end finds
interconnections between function units, synthesizes the memory system, and
fuses different spatial dataflow designs based on data reuse analysis. LEGO
back end then translates the hardware in a primitive-level graph to perform
lower-level optimizations, and applies a set of linear-programming algorithms
to optimally insert pipeline registers and reduce the overhead of unused logic
when switching spatial dataflows. Our evaluation demonstrates that LEGO can
achieve 3.2x speedup and 2.4x energy efficiency compared to previous work
Gemmini, and can generate one architecture for diverse modern foundation models
in generative AI applications.
[COMMENTS]
The first two authors have equal contributions; Published as a
conference paper in HPCA 2025; 13 pages, 14 figures
[LINK]
http://arxiv.org/abs/2509.12053v1
[DATE]
2025-09-15 23:36:18+08:00
[CATEGORIES]
cs.LG
Kolb-Based Experiential Learning for Generalist Agents with Human-Level Kaggle Data Science Performance
[AUTHORS]
Antoine Grosnit, Alexandre Maraval, Refinath S N, Zichao Zhao, James Doran, Giuseppe Paolo, Albert Thomas, Jonas Gonzalez, Abhineet Kumar, Khyati Khandelwal, Abdelhakim Benechehab, Hamza Cherkaoui, Youssef Attia El-Hili, Kun Shao, Jianye Hao, Jun Yao, Balázs Kégl, Haitham Bou-Ammar, Jun Wang
[ABSTRACT]
Human expertise emerges through iterative cycles of interaction, reflection,
and internal model updating, which are central to cognitive theories such as
Kolb’s experiential learning and Vygotsky’s zone of proximal development. In
contrast, current AI systems, particularly LLM agents, rely on static
pre-training or rigid workflows, lacking mechanisms for continual adaptation.
Recent studies identified early cognitive traits in LLM agents (reflection,
revision, and self-correction) suggesting foundational elements of human-like
experiential learning. Thus the key question: Can we design LLM agents capable
of structured, cognitively grounded learning similar to human processes? In
response, we propose a computational framework of Kolb’s learning cycle with
Vygotsky’s ZPD for autonomous agents. Our architecture separates extrinsic
(environment interaction) and intrinsic (internal reflection/abstraction)
functions, enabling cognitively grounded scaffolded learning, where the agent
initially learns within structured environments, followed by open-ended
generalisation. This approach empowers agents to master complex tasks ; domains
that traditional fine-tuning or simple reflective methods could not tackle
effectively. Its potential is powerfully demonstrated via direct comparison
with humans in real-world Kaggle data science competitions. Learning fully
automated data science code generation across 81 tasks, our system, Agent K,
demonstrated the ability to perform the entire workflow autonomously, achieving
an Elo-MMR score of 1694, beyond median score of the Kaggle Masters (the top 2%
among 200,000 users) of our study. With 9 gold, 8 silver, and 12 bronze medals
level performance - including 4 gold and 4 silver on prize-awarding
competitions - Agent K is the 1st AI system to successfully integrate Kolb- and
Vygotsky-inspired human cognitive learning, marking a major step toward
generalist AI.
[LINK]
http://arxiv.org/abs/2411.03562v3
[DATE]
2025-09-15 23:34:58+08:00
[CATEGORIES]
cs.LG
Hi-DARTS: Hierarchical Dynamically Adapting Reinforcement Trading System
[AUTHORS]
Hoon Sagong, Heesu Kim, Hanbeen Hong
[ABSTRACT]
Conventional autonomous trading systems struggle to balance computational
efficiency and market responsiveness due to their fixed operating frequency. We
propose Hi-DARTS, a hierarchical multi-agent reinforcement learning framework
that addresses this trade-off. Hi-DARTS utilizes a meta-agent to analyze market
volatility and dynamically activate specialized Time Frame Agents for
high-frequency or low-frequency trading as needed. During back-testing on AAPL
stock from January 2024 to May 2025, Hi-DARTS yielded a cumulative return of
25.17% with a Sharpe Ratio of 0.75. This performance surpasses standard
benchmarks, including a passive buy-and-hold strategy on AAPL (12.19% return)
and the S&P 500 ETF (SPY) (20.01% return). Our work demonstrates that dynamic,
hierarchical agents can achieve superior risk-adjusted returns while
maintaining high computational efficiency.
[COMMENTS]
Accepted paper at International Conference on ICT Convergence 2025
[LINK]
http://arxiv.org/abs/2509.12048v1
[DATE]
2025-09-15 23:31:47+08:00
[CATEGORIES]
cs.LG
Travel Time and Weather-Aware Traffic Forecasting in a Conformal Graph Neural Network Framework
[AUTHORS]
Mayur Patil, Qadeer Ahmed, Shawn Midlam-Mohler
[ABSTRACT]
Traffic flow forecasting is essential for managing congestion, improving
safety, and optimizing various transportation systems. However, it remains a
prevailing challenge due to the stochastic nature of urban traffic and
environmental factors. Better predictions require models capable of
accommodating the traffic variability influenced by multiple dynamic and
complex interdependent factors. In this work, we propose a Graph Neural Network
(GNN) framework to address the stochasticity by leveraging adaptive adjacency
matrices using log-normal distributions and Coefficient of Variation (CV)
values to reflect real-world travel time variability. Additionally, weather
factors such as temperature, wind speed, and precipitation adjust edge weights
and enable GNN to capture evolving spatio-temporal dependencies across traffic
stations. This enhancement over the static adjacency matrix allows the model to
adapt effectively to traffic stochasticity and changing environmental
conditions. Furthermore, we utilize the Adaptive Conformal Prediction (ACP)
framework to provide reliable uncertainty quantification, achieving target
coverage while maintaining acceptable prediction intervals. Experimental
results demonstrate that the proposed model, in comparison with baseline
methods, showed better prediction accuracy and uncertainty bounds. We, then,
validate this method by constructing traffic scenarios in SUMO and applying
Monte-Carlo simulation to derive a travel time distribution for a Vehicle Under
Test (VUT) to reflect real-world variability. The simulated mean travel time of
the VUT falls within the intervals defined by INRIX historical data, verifying
the model’s robustness.
[COMMENTS]
This manuscript has been accepted as a REGULAR PAPER in the
Transactions on Intelligent Transportation Systems 2025
[LINK]
http://arxiv.org/abs/2509.12043v1
[DATE]
2025-09-15 23:25:43+08:00
[CATEGORIES]
cs.LG
Multipole Semantic Attention: A Fast Approximation of Softmax Attention for Pretraining
[AUTHORS]
Rupert Mitchell, Kristian Kersting
[ABSTRACT]
We present Multipole Semantic Attention (MuSe), an efficient approximation of
softmax attention that combines semantic clustering with multipole expansions
from computational physics. Our method addresses the quadratic computational
complexity of transformers in the context length by clustering queries and keys
separately in their learned representation spaces, enabling a hierarchical
two-stage attention mechanism. Unlike prior clustering approaches that group
only keys or use unified clustering, we maintain separate clusterings that
respect attention’s asymmetric treatment of these spaces. We augment
centroid-based (monopole) approximations with dipole corrections that capture
directional variance within clusters, preserving richer information during
training. The method operates as a drop-in replacement for standard attention,
requiring only hyperparameter specification without architectural
modifications. Our approach achieves $\mathcal{O}(NCD)$ complexity for acausal
attention with $C$ clusters and $\mathcal{O}(NCD \log N)$ for causal attention.
On isolated attention layers, we demonstrate $3\times$ speedup over CUDNN Flash
Attention at 8k context length, with relative squared errors below 20%. For
causal attention, we develop a hierarchical block decomposition that combines
exact local computation with efficient long-range approximation. In end-to-end
pretraining of a 30M parameter model on book-length texts with 16k context, we
achieve 12.2% runtime reduction with only 0.36% loss degradation, establishing
the viability of multipole approximations for efficient transformer
pretraining.
[LINK]
http://arxiv.org/abs/2509.10406v2
[DATE]
2025-09-15 23:24:37+08:00
[CATEGORIES]
cs.LG
Decision-Theoretic Approaches for Improved Learning-Augmented Algorithms
[AUTHORS]
Spyros Angelopoulos, Christoph Dürr, Georgii Melidi
[ABSTRACT]
We initiate the systematic study of decision-theoretic metrics in the design
and analysis of algorithms with machine-learned predictions. We introduce
approaches based on both deterministic measures such as distance-based
evaluation, that help us quantify how close the algorithm is to an ideal
solution, and stochastic measures that balance the trade-off between the
algorithm’s performance and the risk associated with the imperfect oracle.
These approaches allow us to quantify the algorithm’s performance across the
full spectrum of the prediction error, and thus choose the best algorithm
within an entire class of otherwise incomparable ones. We apply our framework
to three well-known problems from online decision making, namely ski-rental,
one-max search, and contract scheduling.
[LINK]
http://arxiv.org/abs/2501.17701v2
[DATE]
2025-09-15 23:20:23+08:00
[CATEGORIES]
cs.LG
Scalable extensions to given-data Sobol’ index estimators
[AUTHORS]
Teresa Portone, Bert Debusschere, Samantha Yang, Emiliano Islas-Quinones, T. Patrick Xiao
[ABSTRACT]
Given-data methods for variance-based sensitivity analysis have significantly
advanced the feasibility of Sobol’ index computation for computationally
expensive models and models with many inputs. However, the limitations of
existing methods still preclude their application to models with an extremely
large number of inputs. In this work, we present practical extensions to the
existing given-data Sobol’ index method, which allow variance-based sensitivity
analysis to be efficiently performed on large models such as neural networks,
which have $>10^4$ parameterizable inputs. For models of this size, holding all
input-output evaluations simultaneously in memory – as required by existing
methods – can quickly become impractical. These extensions also support
nonstandard input distributions with many repeated values, which are not
amenable to equiprobable partitions employed by existing given-data methods.
Our extensions include a general definition of the given-data Sobol’ index
estimator with arbitrary partition, a streaming algorithm to process
input-output samples in batches, and a heuristic to filter out small indices
that are indistinguishable from zero indices due to statistical noise. We show
that the equiprobable partition employed in existing given-data methods can
introduce significant bias into Sobol’ index estimates even at large sample
sizes and provide numerical analyses that demonstrate why this can occur. We
also show that our streaming algorithm can achieve comparable accuracy and
runtimes with lower memory requirements, relative to current methods which
process all samples at once. We demonstrate our novel developments on two
application problems in neural network modeling.
[LINK]
http://arxiv.org/abs/2509.09078v2
[DATE]
2025-09-15 23:10:59+08:00
[CATEGORIES]
cs.LG
Imitation Learning as Return Distribution Matching
[AUTHORS]
Filippo Lazzati, Alberto Maria Metelli
[ABSTRACT]
We study the problem of training a risk-sensitive reinforcement learning (RL)
agent through imitation learning (IL). Unlike standard IL, our goal is not only
to train an agent that matches the expert’s expected return (i.e., its average
performance) but also its risk attitude (i.e., other features of the return
distribution, such as variance). We propose a general formulation of the
risk-sensitive IL problem in which the objective is to match the expert’s
return distribution in Wasserstein distance. We focus on the tabular setting
and assume the expert’s reward is known. After demonstrating the limited
expressivity of Markovian policies for this task, we introduce an efficient and
sufficiently expressive subclass of non-Markovian policies tailored to it.
Building on this subclass, we develop two provably efficient algorithms, RS-BC
and RS-KT, for solving the problem when the transition model is unknown and
known, respectively. We show that RS-KT achieves substantially lower sample
complexity than RS-BC by exploiting dynamics information. We further
demonstrate the sample efficiency of return distribution matching in the
setting where the expert’s reward is unknown by designing an oracle-based
variant of RS-KT. Finally, we complement our theoretical analysis of RS-KT and
RS-BC with numerical simulations, highlighting both their sample efficiency and
the advantages of non-Markovian policies over standard sample-efficient IL
algorithms.
[LINK]
http://arxiv.org/abs/2509.12026v1
[DATE]
2025-09-15 23:08:04+08:00
[CATEGORIES]
cs.LG
Learning non-Markovian Dynamical Systems with Signature-based Encoders
[AUTHORS]
Eliott Pradeleix, Rémy Hosseinkhan-Boucher, Alena Shilova, Onofrio Semeraro, Lionel Mathelin
[ABSTRACT]
Neural ordinary differential equations offer an effective framework for
modeling dynamical systems by learning a continuous-time vector field. However,
they rely on the Markovian assumption - that future states depend only on the
current state - which is often untrue in real-world scenarios where the
dynamics may depend on the history of past states. This limitation becomes
especially evident in settings involving the continuous control of complex
systems with delays and memory effects. To capture historical dependencies,
existing approaches often rely on recurrent neural network (RNN)-based
encoders, which are inherently discrete and struggle with continuous modeling.
In addition, they may exhibit poor training behavior. In this work, we
investigate the use of the signature transform as an encoder for learning
non-Markovian dynamics in a continuous-time setting. The signature transform
offers a continuous-time alternative with strong theoretical foundations and
proven efficiency in summarizing multidimensional information in time. We
integrate a signature-based encoding scheme into encoder-decoder dynamics
models and demonstrate that it outperforms RNN-based alternatives in test
performance on synthetic benchmarks.
[COMMENTS]
Accepted at [ML-DE] Machine Learning Meets Differential Equations
2025 (ECAI 2025). To appear in Proceedings of Machine Learning Research
(PMLR)
[LINK]
http://arxiv.org/abs/2509.12022v1
[DATE]
2025-09-15 23:01:22+08:00
[CATEGORIES]
cs.LG
Task-Focused Consolidation with Spaced Recall: Making Neural Networks Learn like College Students
[AUTHORS]
Prital Bamnodkar
[ABSTRACT]
Deep neural networks often suffer from a critical limitation known as
catastrophic forgetting, where performance on past tasks degrades after
learning new ones. This paper introduces a novel continual learning approach
inspired by human learning strategies like Active Recall, Deliberate Practice,
and Spaced Repetition, named Task-Focused Consolidation with Spaced Recall
(TFC-SR). TFC-SR enhances the standard experience replay framework with a
mechanism we term the Active Recall Probe. It is a periodic, task-aware
evaluation of the model’s memory that stabilizes the representations of past
knowledge. We test TFC-SR on the Split MNIST and the Split CIFAR-100 benchmarks
against leading regularization-based and replay-based baselines. Our results
show that TFC-SR performs significantly better than these methods. For
instance, on the Split CIFAR-100, it achieves a final accuracy of 13.17%
compared to Standard Experience Replay’s 7.40%. We demonstrate that this
advantage comes from the stabilizing effect of the probe itself, and not from
the difference in replay volume. Additionally, we analyze the trade-off between
memory size and performance and show that while TFC-SR performs better in
memory-constrained environments, higher replay volume is still more effective
when available memory is abundant. We conclude that TFC-SR is a robust and
efficient approach, highlighting the importance of integrating active memory
retrieval mechanisms into continual learning systems.
[COMMENTS]
Improved Grammar, consistency and flow. Some sections like the
Discussion Section have been rewritten for improvement. Figures and Tables
have improved formatting, while the algorithm pseudocode is now consistent
with the experiments and less ambiguous
[LINK]
http://arxiv.org/abs/2507.21109v2
[DATE]
2025-09-15 22:56:56+08:00
[CATEGORIES]
cs.LG
MAYA: Addressing Inconsistencies in Generative Password Guessing through a Unified Benchmark
[AUTHORS]
William Corrias, Fabio De Gaspari, Dorjan Hitaj, Luigi V. Mancini
[ABSTRACT]
Recent advances in generative models have led to their application in
password guessing, with the aim of replicating the complexity, structure, and
patterns of human-created passwords. Despite their potential, inconsistencies
and inadequate evaluation methodologies in prior research have hindered
meaningful comparisons and a comprehensive, unbiased understanding of their
capabilities. This paper introduces MAYA, a unified, customizable,
plug-and-play benchmarking framework designed to facilitate the systematic
characterization and benchmarking of generative password-guessing models in the
context of trawling attacks. Using MAYA, we conduct a comprehensive assessment
of six state-of-the-art approaches, which we re-implemented and adapted to
ensure standardization. Our evaluation spans eight real-world password datasets
and covers an exhaustive set of advanced testing scenarios, totaling over
15,000 compute hours. Our findings indicate that these models effectively
capture different aspects of human password distribution and exhibit strong
generalization capabilities. However, their effectiveness varies significantly
with long and complex passwords. Through our evaluation, sequential models
consistently outperform other generative architectures and traditional
password-guessing tools, demonstrating unique capabilities in generating
accurate and complex guesses. Moreover, the diverse password distributions
learned by the models enable a multi-model attack that outperforms the best
individual model. By releasing MAYA, we aim to foster further research,
providing the community with a new tool to consistently and reliably benchmark
generative password-guessing models. Our framework is publicly available at
https://github.com/williamcorrias/MAYA-Password-Benchmarking.
[COMMENTS]
Paper accepted at the 47th IEEE Symposium on Security and Privacy
(S&P 2026)
[LINK]
http://arxiv.org/abs/2504.16651v3
[DATE]
2025-09-15 22:53:33+08:00
[CATEGORIES]
cs.LG
Robustness in the Face of Partial Identifiability in Reward Learning
[AUTHORS]
Filippo Lazzati, Alberto Maria Metelli
[ABSTRACT]
In Reward Learning (ReL), we are given feedback on an unknown target reward,
and the goal is to use this information to recover it in order to carry out
some downstream application, e.g., planning. When the feedback is not
informative enough, the target reward is only partially identifiable, i.e.,
there exists a set of rewards, called the feasible set, that are equally
plausible candidates for the target reward. In these cases, the ReL algorithm
might recover a reward function different from the target reward, possibly
leading to a failure in the application. In this paper, we introduce a general
ReL framework that permits to quantify the drop in “performance” suffered in
the considered application because of identifiability issues. Building on this,
we propose a robust approach to address the identifiability problem in a
principled way, by maximizing the “performance” with respect to the worst-case
reward in the feasible set. We then develop Rob-ReL, a ReL algorithm that
applies this robust approach to the subset of ReL problems aimed at assessing a
preference between two policies, and we provide theoretical guarantees on
sample and iteration complexity for Rob-ReL. We conclude with a
proof-of-concept experiment to illustrate the considered setting.
[LINK]
http://arxiv.org/abs/2501.06376v2
[DATE]
2025-09-15 22:37:21+08:00
[CATEGORIES]
cs.LG
Deep learning joint extremes of metocean variables using the SPAR model
[AUTHORS]
Ed Mackay, Callum Murphy-Barltrop, Jordan Richards, Philip Jonathan
[ABSTRACT]
This paper presents a novel deep learning framework for estimating
multivariate joint extremes of metocean variables, based on the Semi-Parametric
Angular-Radial (SPAR) model. When considered in polar coordinates, the problem
of modelling multivariate extremes is transformed to one of modelling an
angular density, and the tail of a univariate radial variable conditioned on
angle. In the SPAR approach, the tail of the radial variable is modelled using
a generalised Pareto (GP) distribution, providing a natural extension of
univariate extreme value theory to the multivariate setting. In this work, we
show how the method can be applied in higher dimensions, using a case study for
five metocean variables: wind speed, wind direction, wave height, wave period,
and wave direction. The angular variable is modelled using a kernel density
method, while the parameters of the GP model are approximated using
fully-connected deep neural networks. Our approach provides great flexibility
in the dependence structures that can be represented, together with
computationally efficient routines for training the model. Furthermore, the
application of the method requires fewer assumptions about the underlying
distribution(s) compared to existing approaches, and an asymptotically
justified means for extrapolating outside the range of observations. Using
various diagnostic plots, we show that the fitted models provide a good
description of the joint extremes of the metocean variables considered.
[LINK]
http://arxiv.org/abs/2412.15808v3
[DATE]
2025-09-15 22:35:15+08:00
[CATEGORIES]
cs.LG
Learning from Uncertain Similarity and Unlabeled Data
[AUTHORS]
Meng Wei, Zhongnian Li, Peng Ying, Xinzheng Xu
[ABSTRACT]
Existing similarity-based weakly supervised learning approaches often rely on
precise similarity annotations between data pairs, which may inadvertently
expose sensitive label information and raise privacy risks. To mitigate this
issue, we propose Uncertain Similarity and Unlabeled Learning (USimUL), a novel
framework where each similarity pair is embedded with an uncertainty component
to reduce label leakage. In this paper, we propose an unbiased risk estimator
that learns from uncertain similarity and unlabeled data. Additionally, we
theoretically prove that the estimator achieves statistically optimal
parametric convergence rates. Extensive experiments on both benchmark and
real-world datasets show that our method achieves superior classification
performance compared to conventional similarity-based approaches.
[LINK]
http://arxiv.org/abs/2509.11984v1
[DATE]
2025-09-15 22:29:36+08:00
[CATEGORIES]
cs.LG
Learned Controllers for Agile Quadrotors in Pursuit-Evasion Games
[AUTHORS]
Alejandro Sanchez Roncero, Yixi Cai, Olov Andersson, Petter Ogren
[ABSTRACT]
We address the problem of agile 1v1 quadrotor pursuit-evasion, where a
pursuer and an evader learn to outmaneuver each other through reinforcement
learning (RL). Such settings face two major challenges: non-stationarity, since
each agent’s evolving policy alters the environment dynamics and destabilizes
training, and catastrophic forgetting, where a policy overfits to the current
adversary and loses effectiveness against previously encountered strategies. To
tackle these issues, we propose an Asynchronous Multi-Stage Population-Based
(AMSPB) algorithm. At each stage, the pursuer and evader are trained
asynchronously against a frozen pool of opponents sampled from a growing
population of past and current policies, stabilizing training and ensuring
exposure to diverse behaviors. Within this framework, we train neural network
controllers that output either velocity commands or body rates with collective
thrust. Experiments in a high-fidelity simulator show that: (i) AMSPB-trained
RL policies outperform RL and geometric baselines; (ii) body-rate-and-thrust
controllers achieve more agile flight than velocity-based controllers, leading
to better pursuit-evasion performance; (iii) AMSPB yields stable, monotonic
gains across stages; and (iv) trained policies in one arena size generalize
fairly well to other sizes without retraining.
[COMMENTS]
Under review
[LINK]
http://arxiv.org/abs/2506.02849v2
[DATE]
2025-09-15 22:29:31+08:00
[CATEGORIES]
cs.LG
Low-rank Orthogonalization for Large-scale Matrix Optimization with Applications to Foundation Model Training
[AUTHORS]
Chuan He, Zhanwang Deng, Zhaosong Lu
[ABSTRACT]
Neural network (NN) training is inherently a large-scale matrix optimization
problem, yet the matrix structure of NN parameters has long been overlooked.
Recently, the optimizer Muon \cite{jordanmuon}, which explicitly exploits this
structure, has gained significant attention for its strong performance in
foundation model training. A key component contributing to Muon’s success is
matrix orthogonalization. In this paper, we propose {\it low-rank
orthogonalization}, which explicitly leverages the low-rank nature of gradients
during NN training. Building on this, we propose low-rank matrix-signed
gradient descent and a low-rank variant of Muon. Our numerical experiments
demonstrate the superior performance of low-rank orthogonalization, with the
low-rank Muon achieving promising results in GPT-2 and LLaMA pretraining –
surpassing the performance of the carefully tuned vanilla Muon. Theoretically,
we establish the iteration complexity of the low-rank matrix-signed gradient
descent for finding an approximate stationary solution, as well as that of
low-rank Muon for finding an approximate stochastic stationary solution under
heavy-tailed noise.
[COMMENTS]
27 pages
[LINK]
http://arxiv.org/abs/2509.11983v1
[DATE]
2025-09-15 22:28:53+08:00
[CATEGORIES]
cs.LG
Unearthing Gems from Stones: Policy Optimization with Negative Sample Augmentation for LLM Reasoning
[AUTHORS]
Zhaohui Yang, Yuxiao Ye, Shilei Jiang, Chen Hu, Linjing Li, Shihong Deng, Daxin Jiang
[ABSTRACT]
Recent advances in reasoning language models have witnessed a paradigm shift
from short to long CoT pattern. Given the substantial computational cost of
rollouts in long CoT models, maximizing the utility of fixed training datasets
becomes crucial. Our analysis reveals that negative responses contain valuable
components such as self-reflection and error-correction steps, yet primary
existing methods either completely discard negative samples (RFT) or apply
equal penalization across all tokens (RL), failing to leverage these potential
learning signals. In light of this, we propose Behavior Constrained Policy
Gradient with Negative Sample Augmentation (BCPG-NSA), a fine-grained offline
RL framework that encompasses three stages: 1) sample segmentation, 2)
consensus-based step correctness assessment combining LLM and PRM judgers, and
3) policy optimization with NSA designed to effectively mine positive steps
within negative samples. Experimental results show that BCPG-NSA outperforms
baselines on several challenging math/coding reasoning benchmarks using the
same training dataset, achieving improved sample efficiency and demonstrating
robustness and scalability when extended to multiple iterations.
[LINK]
http://arxiv.org/abs/2505.14403v4
[DATE]
2025-09-15 22:23:10+08:00
[CATEGORIES]
cs.LG
Deep operator network for surrogate modeling of poroelasticity with random permeability fields
[AUTHORS]
Sangjoon Park, Yeonjong Shin, Jinhyun Choo
[ABSTRACT]
Poroelasticity – coupled fluid flow and elastic deformation in porous media
– often involves spatially variable permeability, especially in subsurface
systems. In such cases, simulations with random permeability fields are widely
used for probabilistic analysis, uncertainty quantification, and inverse
problems. These simulations require repeated forward solves that are often
prohibitively expensive, motivating the development of efficient surrogate
models. However, efficient surrogate modeling techniques for poroelasticity
with random permeability fields remain scarce. In this study, we propose a
surrogate modeling framework based on the deep operator network (DeepONet), a
neural architecture designed to learn mappings between infinite-dimensional
function spaces. The proposed surrogate model approximates the solution
operator that maps random permeability fields to transient poroelastic
responses. To enhance predictive accuracy and stability, we integrate three
strategies: nondimensionalization of the governing equations, input
dimensionality reduction via Karhunen–Lo'eve expansion, and a two-step
training procedure that decouples the optimization of branch and trunk
networks. The methodology is evaluated on two benchmark problems in
poroelasticity: soil consolidation and ground subsidence induced by groundwater
extraction. In both cases, the DeepONet achieves substantial speedup in
inference while maintaining high predictive accuracy across a wide range of
permeability statistics. These results highlight the potential of the proposed
approach as a scalable and efficient surrogate modeling technique for
poroelastic systems with random permeability fields.
[LINK]
http://arxiv.org/abs/2509.11966v1
[DATE]
2025-09-15 22:18:49+08:00
[CATEGORIES]
cs.LG
Identifiable Autoregressive Variational Autoencoders for Nonlinear and Nonstationary Spatio-Temporal Blind Source Separation
[AUTHORS]
Mika Sipilä, Klaus Nordhausen, Sara Taskinen
[ABSTRACT]
The modeling and prediction of multivariate spatio-temporal data involve
numerous challenges. Dimension reduction methods can significantly simplify
this process, provided that they account for the complex dependencies between
variables and across time and space. Nonlinear blind source separation has
emerged as a promising approach, particularly following recent advances in
identifiability results. Building on these developments, we introduce the
identifiable autoregressive variational autoencoder, which ensures the
identifiability of latent components consisting of nonstationary autoregressive
processes. The blind source separation efficacy of the proposed method is
showcased through a simulation study, where it is compared against
state-of-the-art methods, and the spatio-temporal prediction performance is
evaluated against several competitors on air pollution and weather datasets.
[LINK]
http://arxiv.org/abs/2509.11962v1
[DATE]
2025-09-15 22:17:06+08:00
[CATEGORIES]
cs.LG
TabStruct: Measuring Structural Fidelity of Tabular Data
[AUTHORS]
Xiangjian Jiang, Nikola Simidjievski, Mateja Jamnik
[ABSTRACT]
Evaluating tabular generators remains a challenging problem, as the unique
causal structural prior of heterogeneous tabular data does not lend itself to
intuitive human inspection. Recent work has introduced structural fidelity as a
tabular-specific evaluation dimension to assess whether synthetic data complies
with the causal structures of real data. However, existing benchmarks often
neglect the interplay between structural fidelity and conventional evaluation
dimensions, thus failing to provide a holistic understanding of model
performance. Moreover, they are typically limited to toy datasets, as
quantifying existing structural fidelity metrics requires access to
ground-truth causal structures, which are rarely available for real-world
datasets. In this paper, we propose a novel evaluation framework that jointly
considers structural fidelity and conventional evaluation dimensions. We
introduce a new evaluation metric, $\textbf{global utility}$, which enables the
assessment of structural fidelity even in the absence of ground-truth causal
structures. In addition, we present $\textbf{TabStruct}$, a comprehensive
evaluation benchmark offering large-scale quantitative analysis on 13 tabular
generators from nine distinct categories, across 29 datasets. Our results
demonstrate that global utility provides a task-independent, domain-agnostic
lens for tabular generator performance. We release the TabStruct benchmark
suite, including all datasets, evaluation pipelines, and raw results. Code is
available at https://github.com/SilenceX12138/TabStruct.
[COMMENTS]
55 pages, 60 tables, 7 figures
[LINK]
http://arxiv.org/abs/2509.11950v1
[DATE]
2025-09-15 22:08:20+08:00
[CATEGORIES]
cs.LG
Learning from Scratch: Structurally-masked Transformer for Next Generation Lib-free Simulation
[AUTHORS]
Junlang Huang, Hao Chen, Zhong Guan
[ABSTRACT]
This paper proposes a neural framework for power and timing prediction of
multi-stage data path, distinguishing itself from traditional lib-based
analytical methods dependent on driver characterization and load
simplifications. To the best of our knowledge, this is the first
language-based, netlist-aware neural network designed explicitly for standard
cells. Our approach employs two pre-trained neural models of waveform
prediction and delay estimation that directly infer transient waveforms and
propagation delays from SPICE netlists, conditioned on critical physical
parameters such as load capacitance, input slew, and gate size. This method
accurately captures both intrinsic and coupling-induced delay effects without
requiring simplification or interpolation. For multi-stage timing prediction,
we implement a recursive propagation strategy where predicted waveforms from
each stage feed into subsequent stages, cumulatively capturing delays across
the logic chain. This approach ensures precise timing alignment and complete
waveform visibility throughout complex signal pathways. The waveform prediction
utilizes a hybrid CNN-Transformer architecture with netlist-aware node-level
encoding, addressing traditional Transformers’ fixed input dimensionality
constraints. Additionally, specialized subnetworks separately handle primary
delay estimation and crosstalk correction. Experimental results demonstrate
SPICE-level accuracy, consistently achieving RMSE below 0.0098 across diverse
industrial circuits. The proposed framework provides a scalable, structurally
adaptable neural alternative to conventional power and timing engines,
demonstrating high fidelity to physical circuit behaviors.
[COMMENTS]
Prepare for complementary experiments
[LINK]
http://arxiv.org/abs/2507.17396v2
[DATE]
2025-09-15 22:06:28+08:00
[CATEGORIES]
cs.LG
Neuro-Symbolic Agents with Modal Logic for Autonomous Diagnostics
[AUTHORS]
Antonin Sulc, Thorsten Hellert
[ABSTRACT]
The development of intelligent agents, particularly those powered by language
models (LMs), has shown the critical role in various environments that require
intelligent and autonomous decision. Environments are not passive testing
grounds and they represent the data required for agents to learn and exhibit
very challenging conditions that require adaptive, complex and autonomous
capacity to make decisions. While the paradigm of scaling models and datasets
has led to remarkable emergent capabilities, we argue that scaling the
structure, fidelity, and logical consistency of agent reasoning within these
environments is a crucial, yet underexplored, dimension of AI research. This
paper introduces a neuro-symbolic multi-agent architecture where the belief
states of individual agents are formally represented as Kripke models. This
foundational choice enables them to reason about known concepts of
\emph{possibility} and \emph{necessity} using the formal language of modal
logic. In this work, we use of immutable, domain-specific knowledge to make
infere information, which is encoded as logical constraints essential for
proper diagnosis. In the proposed model, we show constraints that actively
guide the hypothesis generation of LMs, effectively preventing them from
reaching physically or logically untenable conclusions. In a high-fidelity
simulated particle accelerator environment, our system successfully diagnoses
complex, cascading failures by combining the powerful semantic intuition of LMs
with the rigorous, verifiable validation of modal logic and a factual world
model and showcasing a viable path toward more robust, reliable, and verifiable
autonomous agents.
[COMMENTS]
10 pages, 1 figure, Scaling Environments for Agents (SEA) Workshop at
NeuralIPS
[LINK]
http://arxiv.org/abs/2509.11943v1
[DATE]
2025-09-15 22:03:06+08:00
[CATEGORIES]
cs.LG
Quantum Noise Tomography with Physics-Informed Neural Networks
[AUTHORS]
Antonin Sulc
[ABSTRACT]
Characterizing the environmental interactions of quantum systems is a
critical bottleneck in the development of robust quantum technologies.
Traditional tomographic methods are often data-intensive and struggle with
scalability. In this work, we introduce a novel framework for performing
Lindblad tomography using Physics-Informed Neural Networks (PINNs). By
embedding the Lindblad master equation directly into the neural network’s loss
function, our approach simultaneously learns the quantum state’s evolution and
infers the underlying dissipation parameters from sparse, time-series
measurement data. Our results show that PINNs can reconstruct both the system
dynamics and the functional form of unknown noise parameters, presenting a
sample-efficient and scalable solution for quantum device characterization.
Ultimately, our method produces a fully-differentiable digital twin of a noisy
quantum system by learning its governing master equation.
[COMMENTS]
6 pages, 3 figures, Machine Learning and the Physical Sciences
Workshop at the 39th conference on Neural Information Processing Systems
(NeurIPS)
[LINK]
http://arxiv.org/abs/2509.11911v1
[DATE]
2025-09-15 21:30:50+08:00
[CATEGORIES]
cs.LG
High Effort, Low Gain: Fundamental Limits of Active Learning for Linear Dynamical Systems
[AUTHORS]
Nicolas Chatzikiriakos, Kevin Jamieson, Andrea Iannelli
[ABSTRACT]
In this work, we consider the problem of identifying an unknown linear
dynamical system given a finite hypothesis class. In particular, we analyze the
effect of the excitation input on the sample complexity of identifying the true
system with high probability. To this end, we present sample complexity lower
bounds that capture the choice of the selected excitation input. The sample
complexity lower bound gives rise to a system theoretic condition to determine
the potential benefit of experiment design. Informed by the analysis of the
sample complexity lower bound, we propose a persistent excitation (PE)
condition tailored to the considered setting, which we then use to establish
sample complexity upper bounds. Notably, the \acs{PE} condition is weaker than
in the case of an infinite hypothesis class and allows analyzing different
excitation inputs modularly. Crucially, the lower and upper bounds share the
same dependency on key problem parameters. Finally, we leverage these insights
to propose an active learning algorithm that sequentially excites the system
optimally with respect to the current estimate, and provide sample complexity
guarantees for the presented algorithm. Concluding simulations showcase the
effectiveness of the proposed algorithm.
[LINK]
http://arxiv.org/abs/2509.11907v1
[DATE]
2025-09-15 21:29:24+08:00
[CATEGORIES]
cs.LG
Wavelet-SARIMA-Transformer: A Hybrid Model for Rainfall Forecasting
[AUTHORS]
Junmoni Saikia, Kuldeep Goswami, Sarat C. Kakaty
[ABSTRACT]
This study develops and evaluates a novel hybridWavelet SARIMA Transformer,
WST framework to forecast using monthly rainfall across five meteorological
subdivisions of Northeast India over the 1971 to 2023 period. The approach
employs the Maximal Overlap Discrete Wavelet Transform, MODWT with four wavelet
families such as, Haar, Daubechies, Symlet, Coiflet etc. to achieve shift
invariant, multiresolution decomposition of the rainfall series. Linear and
seasonal components are modeled using Seasonal ARIMA, SARIMA, while nonlinear
components are modeled by a Transformer network, and forecasts are
reconstructed via inverse MODWT. Comprehensive validation using an 80 is to 20
train test split and multiple performance indices such as, RMSE, MAE, SMAPE,
Willmotts d, Skill Score, Percent Bias, Explained Variance, and Legates McCabes
E1 demonstrates the superiority of the Haar-based hybrid model, WHST. Across
all subdivisions, WHST consistently achieved lower forecast errors, stronger
agreement with observed rainfall, and unbiased predictions compared with stand
alone SARIMA, stand-alone Transformer, and two-stage wavelet hybrids. Residual
adequacy was confirmed through the Ljung Box test, while Taylor diagrams
provided an integrated assessment of correlation, variance fidelity, and RMSE,
further reinforcing the robustness of the proposed approach. The results
highlight the effectiveness of integrating multiresolution signal decomposition
with complementary linear and deep learning models for hydroclimatic
forecasting. Beyond rainfall, the proposed WST framework offers a scalable
methodology for forecasting complex environmental time series, with direct
implications for flood risk management, water resources planning, and climate
adaptation strategies in data-sparse and climate-sensitive regions.
[LINK]
http://arxiv.org/abs/2509.11903v1
[DATE]
2025-09-15 21:27:19+08:00
[CATEGORIES]
cs.LG
Learning Representations in Video Game Agents with Supervised Contrastive Imitation Learning
[AUTHORS]
Carlos Celemin, Joseph Brennan, Pierluigi Vito Amadori, Tim Bradley
[ABSTRACT]
This paper introduces a novel application of Supervised Contrastive Learning
(SupCon) to Imitation Learning (IL), with a focus on learning more effective
state representations for agents in video game environments. The goal is to
obtain latent representations of the observations that capture better the
action-relevant factors, thereby modeling better the cause-effect relationship
from the observations that are mapped to the actions performed by the
demonstrator, for example, the player jumps whenever an obstacle appears ahead.
We propose an approach to integrate the SupCon loss with continuous output
spaces, enabling SupCon to operate without constraints regarding the type of
actions of the environment. Experiments on the 3D games Astro Bot and Returnal,
and multiple 2D Atari games show improved representation quality, faster
learning convergence, and better generalization compared to baseline models
trained only with supervised action prediction loss functions.
[LINK]
http://arxiv.org/abs/2509.11880v1
[DATE]
2025-09-15 21:00:29+08:00
[CATEGORIES]
cs.LG
Predicting Stock Prices using Permutation Decision Trees and Strategic Trailing
[AUTHORS]
Vishrut Ramraj, Nithin Nagaraj, Harikrishnan N B
[ABSTRACT]
In this paper, we explore the application of Permutation Decision Trees (PDT)
and strategic trailing for predicting stock market movements and executing
profitable trades in the Indian stock market. We focus on high-frequency data
using 5-minute candlesticks for the top 50 stocks listed in the NIFTY 50 index
and Forex pairs such as XAUUSD and EURUSD. We implement a trading strategy that
aims to buy stocks at lower prices and sell them at higher prices, capitalizing
on short-term market fluctuations. Due to regulatory constraints in India,
short selling is not considered in our strategy. The model incorporates various
technical indicators and employs hyperparameters such as the trailing stop-loss
value and support thresholds to manage risk effectively. We trained and tested
data on a 3 month dataset provided by Yahoo Finance. Our bot based on
Permutation Decision Tree achieved a profit of 1.1802\% over the testing
period, where as a bot based on LSTM gave a return of 0.557\% over the testing
period and a bot based on RNN gave a return of 0.5896\% over the testing
period. All of the bots outperform the buy-and-hold strategy, which resulted in
a loss of 2.29\%.
[COMMENTS]
27 pages
[LINK]
http://arxiv.org/abs/2504.12828v3
[DATE]
2025-09-15 20:57:10+08:00
[CATEGORIES]
cs.LG
Early alignment in two-layer networks training is a two-edged sword
[AUTHORS]
Etienne Boursier, Nicolas Flammarion
[ABSTRACT]
Training neural networks with first order optimisation methods is at the core
of the empirical success of deep learning. The scale of initialisation is a
crucial factor, as small initialisations are generally associated to a feature
learning regime, for which gradient descent is implicitly biased towards simple
solutions. This work provides a general and quantitative description of the
early alignment phase, originally introduced by Maennel et al. (2018). For
small initialisation and one hidden ReLU layer networks, the early stage of the
training dynamics leads to an alignment of the neurons towards key directions.
This alignment induces a sparse representation of the network, which is
directly related to the implicit bias of gradient flow at convergence. This
sparsity inducing alignment however comes at the expense of difficulties in
minimising the training objective: we also provide a simple data example for
which overparameterised networks fail to converge towards global minima and
only converge to a spurious stationary point instead.
[COMMENTS]
Official JMLR version
[LINK]
http://arxiv.org/abs/2401.10791v3
[DATE]
2025-09-15 20:47:31+08:00
[CATEGORIES]
cs.LG
Bridging Vision Language Models and Symbolic Grounding for Video Question Answering
[AUTHORS]
Haodi Ma, Vyom Pathak, Daisy Zhe Wang
[ABSTRACT]
Video Question Answering (VQA) requires models to reason over spatial,
temporal, and causal cues in videos. Recent vision language models (VLMs)
achieve strong results but often rely on shallow correlations, leading to weak
temporal grounding and limited interpretability. We study symbolic scene graphs
(SGs) as intermediate grounding signals for VQA. SGs provide structured
object-relation representations that complement VLMs holistic reasoning. We
introduce SG-VLM, a modular framework that integrates frozen VLMs with scene
graph grounding via prompting and visual localization. Across three benchmarks
(NExT-QA, iVQA, ActivityNet-QA) and multiple VLMs (QwenVL, InternVL), SG-VLM
improves causal and temporal reasoning and outperforms prior baselines, though
gains over strong VLMs are limited. These findings highlight both the promise
and current limitations of symbolic grounding, and offer guidance for future
hybrid VLM-symbolic approaches in video understanding.
[LINK]
http://arxiv.org/abs/2509.11862v1
[DATE]
2025-09-15 20:35:56+08:00
[CATEGORIES]
cs.LG
Visualization and Analysis of the Loss Landscape in Graph Neural Networks
[AUTHORS]
Samir Moustafa, Lorenz Kummer, Simon Fetzel, Nils M. Kriege, Wilfried N. Gansterer
[ABSTRACT]
Graph Neural Networks (GNNs) are powerful models for graph-structured data,
with broad applications. However, the interplay between GNN parameter
optimization, expressivity, and generalization remains poorly understood. We
address this by introducing an efficient learnable dimensionality reduction
method for visualizing GNN loss landscapes, and by analyzing the effects of
over-smoothing, jumping knowledge, quantization, sparsification, and
preconditioner on GNN optimization. Our learnable projection method surpasses
the state-of-the-art PCA-based approach, enabling accurate reconstruction of
high-dimensional parameters with lower memory usage. We further show that
architecture, sparsification, and optimizer’s preconditioning significantly
impact the GNN optimization landscape and their training process and final
prediction performance. These insights contribute to developing more efficient
designs of GNN architectures and training strategies.
[LINK]
http://arxiv.org/abs/2509.11792v1
[DATE]
2025-09-15 19:22:55+08:00
[CATEGORIES]
cs.LG
Watch Your Step: A Cost-Sensitive Framework for Accelerometer-Based Fall Detection in Real-World Streaming Scenarios
[AUTHORS]
Timilehin B. Aderinola, Luca Palmerini, Ilaria D’Ascanio, Lorenzo Chiari, Jochen Klenk, Clemens Becker, Brian Caulfield, Georgiana Ifrim
[ABSTRACT]
Real-time fall detection is crucial for enabling timely interventions and
mitigating the severe health consequences of falls, particularly in older
adults. However, existing methods often rely on simulated data or assumptions
such as prior knowledge of fall events, limiting their real-world
applicability. Practical deployment also requires efficient computation and
robust evaluation metrics tailored to continuous monitoring. This paper
presents a real-time fall detection framework for continuous monitoring without
prior knowledge of fall events. Using over 60 hours of inertial measurement
unit (IMU) data from the FARSEEING real-world falls dataset, we employ recent
efficient classifiers to compute fall probabilities in streaming mode. To
enhance robustness, we introduce a cost-sensitive learning strategy that tunes
the decision threshold using a cost function reflecting the higher risk of
missed falls compared to false alarms. Unlike many methods that achieve high
recall only at the cost of precision, our framework achieved Recall of 1.00,
Precision of 0.84, and an F1 score of 0.91 on FARSEEING, detecting all falls
while keeping false alarms low, with average inference time below 5 ms per
sample. These results demonstrate that cost-sensitive threshold tuning enhances
the robustness of accelerometer-based fall detection. They also highlight the
potential of our computationally efficient framework for deployment in
real-time wearable sensor systems for continuous monitoring.
[LINK]
http://arxiv.org/abs/2509.11789v1
[DATE]
2025-09-15 19:19:42+08:00
[CATEGORIES]
cs.LG
Greedy Low-Rank Gradient Compression for Distributed Learning with Convergence Guarantees
[AUTHORS]
Chuyan Chen, Yutong He, Pengrui Li, Weichen Jia, Kun Yuan
[ABSTRACT]
Distributed optimization is pivotal for large-scale signal processing and
machine learning, yet communication overhead remains a major bottleneck.
Low-rank gradient compression, in which the transmitted gradients are
approximated by low-rank matrices to reduce communication, offers a promising
remedy. Existing methods typically adopt either randomized or greedy
compression strategies: randomized approaches project gradients onto randomly
chosen subspaces, introducing high variance and degrading empirical
performance; greedy methods select the most informative subspaces, achieving
strong empirical results but lacking convergence guarantees. To address this
gap, we propose GreedyLore–the first Greedy Low-Rank gradient compression
algorithm for distributed learning with rigorous convergence guarantees.
GreedyLore incorporates error feedback to correct the bias introduced by greedy
compression and introduces a semi-lazy subspace update that ensures the
compression operator remains contractive throughout all iterations. With these
techniques, we prove that GreedyLore achieves a convergence rate of
$\mathcal{O}(\sigma/\sqrt{NT} + 1/T)$ under standard optimizers such as MSGD
and Adam–marking the first linear speedup convergence rate for low-rank
gradient compression. Extensive experiments are conducted to validate our
theoretical findings.
[COMMENTS]
17 pages, 5 figures
[LINK]
http://arxiv.org/abs/2507.08784v3
[DATE]
2025-09-15 19:06:25+08:00
[CATEGORIES]
cs.LG
Industrial Energy Disaggregation with Digital Twin-generated Dataset and Efficient Data Augmentation
[AUTHORS]
Christian Internò, Andrea Castellani, Sebastian Schmitt, Fabio Stella, Barbara Hammer
[ABSTRACT]
Industrial Non-Intrusive Load Monitoring (NILM) is limited by the scarcity of
high-quality datasets and the complex variability of industrial energy
consumption patterns. To address data scarcity and privacy issues, we introduce
the Synthetic Industrial Dataset for Energy Disaggregation (SIDED), an
open-source dataset generated using Digital Twin simulations. SIDED includes
three types of industrial facilities across three different geographic
locations, capturing diverse appliance behaviors, weather conditions, and load
profiles. We also propose the Appliance-Modulated Data Augmentation (AMDA)
method, a computationally efficient technique that enhances NILM model
generalization by intelligently scaling appliance power contributions based on
their relative impact. We show in experiments that NILM models trained with
AMDA-augmented data significantly improve the disaggregation of energy
consumption of complex industrial appliances like combined heat and power
systems. Specifically, in our out-of-sample scenarios, models trained with AMDA
achieved a Normalized Disaggregation Error of 0.093, outperforming models
trained without data augmentation (0.451) and those trained with random data
augmentation (0.290). Data distribution analyses confirm that AMDA effectively
aligns training and test data distributions, enhancing model generalization.
[LINK]
http://arxiv.org/abs/2506.20525v2
[DATE]
2025-09-15 18:51:28+08:00
[CATEGORIES]
cs.LG
Stabilizing PINNs: A regularization scheme for PINN training to avoid unstable fixed points of dynamical systems
[AUTHORS]
Milos Babic, Franz M. Rohrhofer, Bernhard C. Geiger
[ABSTRACT]
It was recently shown that the loss function used for training
physics-informed neural networks (PINNs) exhibits local minima at solutions
corresponding to fixed points of dynamical systems. In the forward setting,
where the PINN is trained to solve initial value problems, these local minima
can interfere with training and potentially leading to physically incorrect
solutions. Building on stability theory, this paper proposes a regularization
scheme that penalizes solutions corresponding to unstable fixed points.
Experimental results on four dynamical systems, including the Lotka-Volterra
model and the van der Pol oscillator, show that our scheme helps avoiding
physically incorrect solutions and substantially improves the training success
rate of PINNs.
[COMMENTS]
8 pages, 3 figures
[LINK]
http://arxiv.org/abs/2509.11768v1
[DATE]
2025-09-15 18:44:30+08:00
[CATEGORIES]
cs.LG
Data Fusion and Machine Learning for Ship Fuel Consumption Modelling – A Case of Bulk Carrier Vessel
[AUTHORS]
Abdella Mohamed, Xiangyu Hu, Christian Hendricks
[ABSTRACT]
There is an increasing push for operational measures to reduce ships’ bunker
fuel consumption and carbon emissions, driven by the International Maritime
Organization (IMO) mandates. Key performance indicators such as the Energy
Efficiency Operational Indicator (EEOI) focus on fuel efficiency. Strategies
like trim optimization, virtual arrival, and green routing have emerged. The
theoretical basis for these approaches lies in accurate prediction of fuel
consumption as a function of sailing speed, displacement, trim, climate, and
sea state. This study utilized 296 voyage reports from a bulk carrier vessel
over one year (November 16, 2021 to November 21, 2022) and 28 parameters,
integrating hydrometeorological big data from the Copernicus Marine Environment
Monitoring Service (CMEMS) with 19 parameters and the European Centre for
Medium-Range Weather Forecasts (ECMWF) with 61 parameters. The objective was to
evaluate whether fusing external public data sources enhances modeling accuracy
and to highlight the most influential parameters affecting fuel consumption.
The results reveal a strong potential for machine learning techniques to
predict ship fuel consumption accurately by combining voyage reports with
climate and sea data. However, validation on similar classes of vessels remains
necessary to confirm generalizability.
[COMMENTS]
44 pages, 6 figures, preprint version
[LINK]
http://arxiv.org/abs/2509.11750v1
[DATE]
2025-09-15 18:01:14+08:00
[CATEGORIES]
cs.LG
Likelihood Ratio Tests by Kernel Gaussian Embedding
[AUTHORS]
Leonardo V. Santoro, Victor M. Panaretos
[ABSTRACT]
We propose a novel kernel-based nonparametric two-sample test, employing the
combined use of kernel mean and kernel covariance embedding. Our test builds on
recent results showing how such combined embeddings map distinct probability
measures to mutually singular Gaussian measures on the kernel’s RKHS.
Leveraging this separation of measure phenomenon", we construct a test
statistic based on the relative entropy between the <span style="color:#e74d3c;">Gaussian</span> embeddings, in
effect the likelihood ratio. The likelihood ratio is specifically tailored to
detect equality versus singularity of two Gaussians, and satisfies a
$0/\infty$” law, in that it vanishes under the null and diverges under the
alternative. To implement the test in finite samples, we introduce a
regularised version, calibrated by way of permutation. We prove consistency,
establish uniform power guarantees under mild conditions, and discuss how our
framework unifies and extends prior approaches based on spectrally regularized
MMD. Empirical results on synthetic and real data demonstrate remarkable gains
in power compared to state-of-the-art methods, particularly in high-dimensional
and weak-signal regimes.
[LINK]
http://arxiv.org/abs/2508.07982v2
[DATE]
2025-09-15 18:00:49+08:00
[CATEGORIES]
cs.LG
Analysing Python Machine Learning Notebooks with Moose
[AUTHORS]
Marius Mignard, Steven Costiou, Nicolas Anquetil, Anne Etien
[ABSTRACT]
Machine Learning (ML) code, particularly within notebooks, often exhibits
lower quality compared to traditional software. Bad practices arise at three
distinct levels: general Python coding conventions, the organizational
structure of the notebook itself, and ML-specific aspects such as
reproducibility and correct API usage. However, existing analysis tools
typically focus on only one of these levels and struggle to capture ML-specific
semantics, limiting their ability to detect issues. This paper introduces
Vespucci Linter, a static analysis tool with multi-level capabilities, built on
Moose and designed to address this challenge. Leveraging a metamodeling
approach that unifies the notebook’s structural elements with Python code
entities, our linter enables a more contextualized analysis to identify issues
across all three levels. We implemented 22 linting rules derived from the
literature and applied our tool to a corpus of 5,000 notebooks from the Kaggle
platform. The results reveal violations at all levels, validating the relevance
of our multi-level approach and demonstrating Vespucci Linter’s potential to
improve the quality and reliability of ML development in notebook environments.
[LINK]
http://arxiv.org/abs/2509.11748v1
[DATE]
2025-09-15 17:59:49+08:00
[CATEGORIES]
cs.LG
Kernel Embeddings and the Separation of Measure Phenomenon
[AUTHORS]
Leonardo V. Santoro, Kartik G. Waghmare, Victor M. Panaretos
[ABSTRACT]
We prove that kernel covariance embeddings lead to information-theoretically
perfect separation of distinct probability distributions. In statistical terms,
we establish that testing for the equality of two probability measures on a
compact and separable metric space is equivalent to testing for the singularity
between two centered Gaussian measures on a reproducing kernel Hilbert Space.
The corresponding Gaussians are defined via the notion of kernel covariance
embedding of a probability measure, and the Hilbert space is that generated by
the embedding kernel. Distinguishing singular Gaussians is fundamentally
simpler from an information-theoretic perspective than non-parametric
two-sample testing, particularly in complex or high-dimensional domains. This
is because singular Gaussians are supported on essentially separate and affine
subspaces. Our proof leverages the classical Feldman-Hajek dichotomy, and shows
that even a small perturbation of a distribution will be maximally magnified
through its Gaussian embedding. This “separation of measure phenomenon”
appears to be a blessing of infinite dimensionality, by means of embedding,
with the potential to inform the design of efficient inference tools in
considerable generality. The elicitation of this phenomenon also appears to
crystallize, in a precise and simple mathematical statement, the outstanding
empirical effectiveness of the so-called ``kernel trick”.
[LINK]
http://arxiv.org/abs/2505.04613v2
[DATE]
2025-09-15 17:35:15+08:00
[CATEGORIES]
cs.LG
DRAG: Data Reconstruction Attack using Guided Diffusion
[AUTHORS]
Wa-Kin Lei, Jun-Cheng Chen, Shang-Tse Chen
[ABSTRACT]
With the rise of large foundation models, split inference (SI) has emerged as
a popular computational paradigm for deploying models across lightweight edge
devices and cloud servers, addressing data privacy and computational cost
concerns. However, most existing data reconstruction attacks have focused on
smaller CNN classification models, leaving the privacy risks of foundation
models in SI settings largely unexplored. To address this gap, we propose a
novel data reconstruction attack based on guided diffusion, which leverages the
rich prior knowledge embedded in a latent diffusion model (LDM) pre-trained on
a large-scale dataset. Our method performs iterative reconstruction on the
LDM’s learned image prior, effectively generating high-fidelity images
resembling the original data from their intermediate representations (IR).
Extensive experiments demonstrate that our approach significantly outperforms
state-of-the-art methods, both qualitatively and quantitatively, in
reconstructing data from deep-layer IRs of the vision foundation model. The
results highlight the urgent need for more robust privacy protection mechanisms
for large models in SI scenarios. Code is available at:
https://github.com/ntuaislab/DRAG.
[COMMENTS]
ICML 2025
[LINK]
http://arxiv.org/abs/2509.11724v1
[DATE]
2025-09-15 17:26:19+08:00
[CATEGORIES]
cs.LG
Neural Audio Codecs for Prompt-Driven Universal Source Separation
[AUTHORS]
Adhiraj Banerjee, Vipul Arora
[ABSTRACT]
Text-guided source separation supports flexible audio editing across media
and assistive applications, but existing models like AudioSep are too
compute-heavy for edge deployment. Neural audio codec (NAC) models such as
CodecFormer and SDCodec are compute-efficient but limited to fixed-class
separation. We introduce CodecSep, the first NAC-based model for on-device
universal, text-driven separation. CodecSep combines DAC compression with a
Transformer masker modulated by CLAP-derived FiLM parameters. Across six
open-domain benchmarks under matched training/prompt protocols,
\textbf{CodecSep} surpasses \textbf{AudioSep} in separation fidelity (SI-SDR)
while remaining competitive in perceptual quality (ViSQOL) and matching or
exceeding fixed-stem baselines (TDANet, CodecFormer, SDCodec). In code-stream
deployments, it needs just 1.35~GMACs end-to-end – approximately $54\times$
less compute ($25\times$ architecture-only) than spectrogram-domain separators
like AudioSep – while remaining fully bitstream-compatible.
[COMMENTS]
21 pages, 1 figure, pre-print, under review
[LINK]
http://arxiv.org/abs/2509.11717v1
[DATE]
2025-09-15 17:12:57+08:00
[CATEGORIES]
cs.LG
Beyond Regularity: Modeling Chaotic Mobility Patterns for Next Location Prediction
[AUTHORS]
Yuqian Wu, Yuhong Peng, Jiapeng Yu, Xiangyu Liu, Zeting Yan, Kang Lin, Weifeng Su, Bingqing Qu, Raymond Lee, Dingqi Yang
[ABSTRACT]
Next location prediction is a key task in human mobility analysis, crucial
for applications like smart city resource allocation and personalized
navigation services. However, existing methods face two significant challenges:
first, they fail to address the dynamic imbalance between periodic and chaotic
mobile patterns, leading to inadequate adaptation over sparse trajectories;
second, they underutilize contextual cues, such as temporal regularities in
arrival times, which persist even in chaotic patterns and offer stronger
predictability than spatial forecasts due to reduced search spaces. To tackle
these challenges, we propose \textbf{\method}, a
\underline{\textbf{C}}h\underline{\textbf{A}}otic \underline{\textbf{N}}eural
\underline{\textbf{O}}scillator n\underline{\textbf{E}}twork for next location
prediction, which introduces a biologically inspired Chaotic Neural Oscillatory
Attention mechanism to inject adaptive variability into traditional attention,
enabling balanced representation of evolving mobility behaviors, and employs a
Tri-Pair Interaction Encoder along with a Cross Context Attentive Decoder to
fuse multimodal “who-when-where” contexts in a joint framework for enhanced
prediction performance. Extensive experiments on two real-world datasets
demonstrate that CANOE consistently and significantly outperforms a sizeable
collection of state-of-the-art baselines, yielding 3.17\%-13.11\% improvement
over the best-performing baselines across different cases. In particular, CANOE
can make robust predictions over mobility trajectories of different mobility
chaotic levels. A series of ablation studies also supports our key design
choices. Our code is available at: https://github.com/yuqian2003/CANOE.
[COMMENTS]
12 pages, 5 figures
[LINK]
http://arxiv.org/abs/2509.11713v1
[DATE]
2025-09-15 17:10:48+08:00
[CATEGORIES]
cs.LG
Group Expectation Policy Optimization for Heterogeneous Reinforcement Learning
[AUTHORS]
Han Zhang, Ruibin Zheng, Zexuan Yi, Zhuo Zhang, Hanyang Peng, Hui Wang, Zike Yuan, Cai Ke, Shiwei Chen, Jiacheng Yang, Yangning Li, Xiang Li, Jiangyue Yan, Yaoqi Liu, Liwen Jing, Jiayin Qi, Ruifeng Xu, Binxing Fang, Yue Yu
[ABSTRACT]
As single-center computing approaches power constraints, decentralized
training is becoming essential. Reinforcement Learning (RL) post-training
enhances Large Language Models (LLMs) but faces challenges in heterogeneous
distributed environments due to its tightly-coupled sampling-learning
alternation. We propose HeteroRL, an asynchronous RL architecture that
decouples rollout sampling from parameter learning, enabling robust deployment
across geographically distributed nodes under network delays. We identify that
latency-induced KL divergence causes importance sampling failure due to high
variance. To address this, we propose Group Expectation Policy Optimization
(GEPO), which reduces importance weight variance through a refined sampling
mechanism. Theoretically, GEPO achieves exponential variance reduction.
Experiments show it maintains superior stability over methods like GRPO, with
less than 3% performance degradation under 1800-second delays, demonstrating
strong potential for decentralized RL in heterogeneous networks.
[LINK]
http://arxiv.org/abs/2508.17850v4
[DATE]
2025-09-15 17:08:09+08:00
[CATEGORIES]
cs.LG
Feasibility of In-Ear Single-Channel ExG for Wearable Sleep Monitoring in Real-World Settings
[AUTHORS]
Philipp Lepold, Jonas Leichtle, Tobias Röddiger, Michael Beigl
[ABSTRACT]
Automatic sleep staging typically relies on gold-standard EEG setups, which
are accurate but obtrusive and impractical for everyday use outside sleep
laboratories. This limits applicability in real-world settings, such as home
environments, where continuous, long-term monitoring is needed. Detecting sleep
onset is particularly relevant, enabling consumer applications (e.g.
automatically pausing media playback when the user falls asleep). Recent
research has shown correlations between in-ear EEG and full-scalp EEG for
various phenomena, suggesting wearable, in-ear devices could allow unobtrusive
sleep monitoring. We investigated the feasibility of using single-channel
in-ear electrophysiological (ExG) signals for automatic sleep staging in a
wearable device by conducting a sleep study with 11 participants (mean age:
24), using a custom earpiece with a dry eartip electrode (D"atwyler SoftPulse)
as a measurement electrode in one ear and a reference in the other. Ground
truth sleep stages were obtained from an Apple Watch Ultra, validated for sleep
staging. Our system achieved 90.5% accuracy for binary sleep detection (Awake
vs. Asleep) and 65.1% accuracy for four-class staging (Awake, REM, Core, Deep)
using leave-one-subject-out validation. These findings demonstrate the
potential of in-ear electrodes as a low-effort, comfortable approach to sleep
monitoring, with applications such as stopping podcasts when users fall asleep.
[LINK]
http://arxiv.org/abs/2509.07896v2
[DATE]
2025-09-15 16:34:10+08:00
[CATEGORIES]
cs.LG
Two Sides of the Same Optimization Coin: Model Degradation and Representation Collapse in Graph Foundation Models
[AUTHORS]
Xunkai Li, Daohan Su, Sicheng Liu, Ru Zhang, Zhenjun Li, Bing Zhou, Rong-Hua Li, Guoren Wang
[ABSTRACT]
Graph foundation models, inspired by the success of LLMs, are designed to
learn the optimal embedding from multi-domain TAGs for the downstream
cross-task generalization capability. During our investigation, graph VQ-MAE
stands out among the increasingly diverse landscape of GFM architectures. This
is attributed to its ability to jointly encode topology and textual attributes
from multiple domains into discrete embedding spaces with clear semantic
boundaries. Despite its potential, domain generalization conflicts cause
imperceptible pitfalls. In this paper, we instantiate two of them, and they are
just like two sides of the same GFM optimization coin - Side 1 Model
Degradation: The encoder and codebook fail to capture the diversity of inputs;
Side 2 Representation Collapse: The hidden embedding and codebook vector fail
to preserve semantic separability due to constraints from narrow representation
subspaces. These two pitfalls (sides) collectively impair the decoder and
generate the low-quality reconstructed supervision, causing the GFM
optimization dilemma during pre-training (coin). Through empirical
investigation, we attribute the above challenges to Information Bottleneck and
Regularization Deficit. To address them, we propose MoT (Mixture-of-Tinkers) -
(1) Information Tinker for Two Pitfalls, which utilizes an edge-wise semantic
fusion strategy and a mixture-of-codebooks with domain-aware routing to improve
information capacity. (2) Regularization Tinker for Optimization Coin, which
utilizes two additional regularizations to further improve gradient supervision
in our proposed Information Tinker. Notably, as a flexible architecture, MoT
adheres to the scaling laws of GFM, offering a controllable model scale.
Compared to SOTA baselines, experiments on 22 datasets across 6 domains
demonstrate that MoT achieves significant improvements in supervised, few-shot,
and zero-shot scenarios.
[LINK]
http://arxiv.org/abs/2509.08401v3
[DATE]
2025-09-15 16:24:50+08:00
[CATEGORIES]
cs.LG
SpaPool: Soft Partition Assignment Pooling for__Graph Neural Networks
[AUTHORS]
Rodrigue Govan, Romane Scherrer, Philippe Fournier-Viger, Nazha Selmaoui-Folcher
[ABSTRACT]
This paper introduces SpaPool, a novel pooling method that combines the
strengths of both dense and sparse techniques for a graph neural network.
SpaPool groups vertices into an adaptive number of clusters, leveraging the
benefits of both dense and sparse approaches. It aims to maintain the
structural integrity of the graph while reducing its size efficiently.
Experimental results on several datasets demonstrate that SpaPool achieves
competitive performance compared to existing pooling techniques and excels
particularly on small-scale graphs. This makes SpaPool a promising method for
applications requiring efficient and effective graph processing.
[LINK]
http://arxiv.org/abs/2509.11675v1
[DATE]
2025-09-15 16:16:40+08:00
[CATEGORIES]
cs.LG
Intrinsic Dimension Estimating Autoencoder (IDEA) Using CancelOut Layer and a Projected Loss
[AUTHORS]
Antoine Oriou, Philipp Krah, Julian Koellermeier
[ABSTRACT]
This paper introduces the Intrinsic Dimension Estimating Autoencoder (IDEA),
which identifies the underlying intrinsic dimension of a wide range of datasets
whose samples lie on either linear or nonlinear manifolds. Beyond estimating
the intrinsic dimension, IDEA is also able to reconstruct the original dataset
after projecting it onto the corresponding latent space, which is structured
using re-weighted double CancelOut layers. Our key contribution is the
introduction of the projected reconstruction loss term, guiding the training of
the model by continuously assessing the reconstruction quality under the
removal of an additional latent dimension. We first assess the performance of
IDEA on a series of theoretical benchmarks to validate its robustness. These
experiments allow us to test its reconstruction ability and compare its
performance with state-of-the-art intrinsic dimension estimators. The
benchmarks show good accuracy and high versatility of our approach.
Subsequently, we apply our model to data generated from the numerical solution
of a vertically resolved one-dimensional free-surface flow, following a
pointwise discretization of the vertical velocity profile in the horizontal
direction, vertical direction, and time. IDEA succeeds in estimating the
dataset’s intrinsic dimension and then reconstructs the original solution by
working directly within the projection space identified by the network.
[COMMENTS]
Preprint with 12 pages and 12 figures
[LINK]
http://arxiv.org/abs/2509.10011v2
[DATE]
2025-09-15 16:02:19+08:00
[CATEGORIES]
cs.LG
C3DE: Causal-Aware Collaborative Neural Controlled Differential Equation for Long-Term Urban Crowd Flow Prediction
[AUTHORS]
Yuting Liu, Qiang Zhou, Hanzhe Li, Chenqi Gong, Jingjing Gu
[ABSTRACT]
Long-term urban crowd flow prediction suffers significantly from cumulative
sampling errors, due to increased sequence lengths and sampling intervals,
which inspired us to leverage Neural Controlled Differential Equations (NCDEs)
to mitigate this issue. However, regarding the crucial influence of Points of
Interest (POIs) evolution on long-term crowd flow, the multi-timescale
asynchronous dynamics between crowd flow and POI distribution, coupled with
latent spurious causality, poses challenges to applying NCDEs for long-term
urban crowd flow prediction. To this end, we propose Causal-aware Collaborative
neural CDE (C3DE) to model the long-term dynamic of crowd flow. Specifically,
we introduce a dual-path NCDE as the backbone to effectively capture the
asynchronous evolution of collaborative signals across multiple time scales.
Then, we design a dynamic correction mechanism with the counterfactual-based
causal effect estimator to quantify the causal impact of POIs on crowd flow and
minimize the accumulation of spurious correlations. Finally, we leverage a
predictor for long-term prediction with the fused collaborative signals of POI
and crowd flow. Extensive experiments on three real-world datasets demonstrate
the superior performance of C3DE, particularly in cities with notable flow
fluctuations.
[LINK]
http://arxiv.org/abs/2509.12289v1
[DATE]
2025-09-15 15:24:39+08:00
[CATEGORIES]
cs.LG
Mechanistic Interpretability of LoRA-Adapted Language Models for Nuclear Reactor Safety Applications
[AUTHORS]
Yoon Pyo Lee
[ABSTRACT]
The integration of Large Language Models (LLMs) into safety-critical domains,
such as nuclear engineering, necessitates a deep understanding of their
internal reasoning processes. This paper presents a novel methodology for
interpreting how an LLM encodes and utilizes domain-specific knowledge, using a
Boiling Water Reactor system as a case study. We adapted a general-purpose LLM
(Gemma-3-1b-it) to the nuclear domain using a parameter-efficient fine-tuning
technique known as Low-Rank Adaptation. By comparing the neuron activation
patterns of the base model to those of the fine-tuned model, we identified a
sparse set of neurons whose behavior was significantly altered during the
adaptation process. To probe the causal role of these specialized neurons, we
employed a neuron silencing technique. Our results demonstrate that while
silencing most of these specialized neurons individually did not produce a
statistically significant effect, deactivating the entire group collectively
led to a statistically significant degradation in task performance. Qualitative
analysis further revealed that silencing these neurons impaired the model’s
ability to generate detailed, contextually accurate technical information. This
paper provides a concrete methodology for enhancing the transparency of an
opaque black-box model, allowing domain expertise to be traced to verifiable
neural circuits. This offers a pathway towards achieving nuclear-grade
artificial intelligence (AI) assurance, addressing the verification and
validation challenges mandated by nuclear regulatory frameworks (e.g., 10 CFR
50 Appendix B), which have limited AI deployment in safety-critical nuclear
operations.
[COMMENTS]
Accepted for publication in Nuclear Technology. 24 pages, 2 tables, 4
figures
[LINK]
http://arxiv.org/abs/2507.09931v2
[DATE]
2025-09-15 15:21:30+08:00
[CATEGORIES]
cs.LG
FOCUS on Contamination: A Geospatial Deep Learning Framework with a Noise-Aware Loss for Surface Water PFAS Prediction
[AUTHORS]
Jowaria Khan, Alexa Friedman, Sydney Evans, Rachel Klein, Runzi Wang, Katherine E. Manz, Kaley Beins, David Q. Andrews, Elizabeth Bondi-Kelly
[ABSTRACT]
Per- and polyfluoroalkyl substances (PFAS), chemicals found in products like
non-stick cookware, are unfortunately persistent environmental pollutants with
severe health risks. Accurately mapping PFAS contamination is crucial for
guiding targeted remediation efforts and protecting public and environmental
health, yet detection across large regions remains challenging due to the cost
of testing and the difficulty of simulating their spread. In this work, we
introduce FOCUS, a geospatial deep learning framework with a label noise-aware
loss function, to predict PFAS contamination in surface water over large
regions. By integrating hydrological flow data, land cover information, and
proximity to known PFAS sources, our approach leverages both spatial and
environmental context to improve prediction accuracy. We evaluate the
performance of our approach through extensive ablation studies, robustness
analysis, real-world validation, and comparative analyses against baselines
like sparse segmentation, as well as existing scientific methods, including
Kriging and pollutant transport simulations. Results and expert feedback
highlight our framework’s potential for scalable PFAS monitoring.
[LINK]
http://arxiv.org/abs/2502.14894v2
[DATE]
2025-09-15 15:20:49+08:00
[CATEGORIES]
cs.LG
‘Hello, World!’: Making GNNs Talk with LLMs
[AUTHORS]
Sunwoo Kim, Soo Yong Lee, Jaemin Yoo, Kijung Shin
[COMMENTS]
Published as a conference paper at EMNLP 2025 Findings. Code and
datasets are in https://github.com/kswoo97/GLN-Code
[LINK]
http://arxiv.org/abs/2505.20742v2
[DATE]
2025-09-15 15:07:30+08:00
[CATEGORIES]
cs.LG
TeleRAG: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval
[AUTHORS]
Chien-Yu Lin, Keisuke Kamahori, Yiyu Liu, Xiaoxiang Shi, Madhav Kashyap, Yile Gu, Rulin Shao, Zihao Ye, Kan Zhu, Stephanie Wang, Arvind Krishnamurthy, Rohan Kadekodi, Luis Ceze, Baris Kasikci
[ABSTRACT]
Retrieval-augmented generation (RAG) extends large language models (LLMs)
with external data sources to enhance factual correctness and domain coverage.
Modern RAG pipelines rely on large datastores, leading to system challenges in
latency-sensitive deployments, especially when GPU memory is limited. To
address these challenges, we propose TeleRAG, an efficient inference system
that reduces RAG latency with minimal GPU memory requirements. The core
innovation of TeleRAG is lookahead retrieval, a prefetching mechanism that
anticipates required data and transfers it from CPU to GPU in parallel with LLM
generation. By leveraging the modularity of RAG pipelines, the inverted file
index (IVF) search algorithm and similarities between queries, TeleRAG
optimally overlaps data movement and computation. Experimental results
demonstrate that TeleRAG achieves up to a 1.53x average reduction in end-to-end
latency for single-query inference and up to 1.83x average improvement in
throughput for batch-query scenarios compared to state-of-the-art systems. This
confirms the practical utility of TeleRAG for faster and more memory-efficient
deployments of advanced RAG applications.
[LINK]
http://arxiv.org/abs/2502.20969v2
[DATE]
2025-09-15 14:58:50+08:00
[CATEGORIES]
cs.LG
Adaptive-GraphSketch: Real-Time Edge Anomaly Detection via Multi-Layer Tensor Sketching and Temporal Decay
[AUTHORS]
Ocheme Anthony Ekle, William Eberle
[ABSTRACT]
Anomaly detection in dynamic graphs is essential for identifying malicious
activities, fraud, and unexpected behaviors in real-world systems such as
cybersecurity and power grids. However, existing approaches struggle with
scalability, probabilistic interpretability, and adaptability to evolving
traffic patterns. In this paper, we propose ADAPTIVE-GRAPHSKETCH, a lightweight
and scalable framework for real-time anomaly detection in streaming edge data.
Our method integrates temporal multi-tensor sketching with Count-Min Sketch
using Conservative Update (CMS-CU) to compactly track edge frequency patterns
with bounded memory, while mitigating hash collision issues. We incorporate
Bayesian inference for probabilistic anomaly scoring and apply Exponentially
Weighted Moving Average (EWMA) for adaptive thresholding tuned to burst
intensity. Extensive experiments on four real-world intrusion detection
datasets demonstrate that ADAPTIVE-GRAPHSKETCH outperforms state-of-the-art
baselines such as ANOEDGE-G/L, MIDAS-R, and F-FADE, achieving up to 6.5% AUC
gain on CIC-IDS2018 and up to 15.6% on CIC-DDoS2019, while processing 20
million edges in under 3.4 seconds using only 10 hash functions. Our results
show that ADAPTIVE-GRAPHSKETCH is practical and effective for fast, accurate
anomaly detection in large-scale streaming graphs.
Keywords: Anomaly Detection, Streaming, Real-time, Dynamic Graphs, Edge
Streams, Tensor Sketching
[COMMENTS]
10 pages, 6 figures. Accepted for presentation at the IEEE
International Conference on Knowledge Graphs (ICKG 2025). This is the authors
accepted version; the final published paper will be available via IEEE Xplore
[LINK]
http://arxiv.org/abs/2509.11633v1
[DATE]
2025-09-15 14:57:35+08:00
[CATEGORIES]
cs.LG
SpeCa: Accelerating Diffusion Transformers with Speculative Feature Caching
[AUTHORS]
Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Fei Ren, Shaobo Wang, Kaixin Li, Linfeng Zhang
[ABSTRACT]
Diffusion models have revolutionized high-fidelity image and video synthesis,
yet their computational demands remain prohibitive for real-time applications.
These models face two fundamental challenges: strict temporal dependencies
preventing parallelization, and computationally intensive forward passes
required at each denoising step. Drawing inspiration from speculative decoding
in large language models, we present SpeCa, a novel ‘Forecast-then-verify’
acceleration framework that effectively addresses both limitations. SpeCa’s
core innovation lies in introducing Speculative Sampling to diffusion models,
predicting intermediate features for subsequent timesteps based on fully
computed reference timesteps. Our approach implements a parameter-free
verification mechanism that efficiently evaluates prediction reliability,
enabling real-time decisions to accept or reject each prediction while
incurring negligible computational overhead. Furthermore, SpeCa introduces
sample-adaptive computation allocation that dynamically modulates resources
based on generation complexity, allocating reduced computation for simpler
samples while preserving intensive processing for complex instances.
Experiments demonstrate 6.34x acceleration on FLUX with minimal quality
degradation (5.5% drop), 7.3x speedup on DiT while preserving generation
fidelity, and 79.84% VBench score at 6.1x acceleration for HunyuanVideo. The
verification mechanism incurs minimal overhead (1.67%-3.5% of full inference
costs), establishing a new paradigm for efficient diffusion model inference
while maintaining generation quality even at aggressive acceleration ratios.
Our codes have been released in Github:
\textbf{https://github.com/Shenyi-Z/Cache4Diffusion}
[COMMENTS]
15 pages, 9 figures, ACM Multimedia 2025
[LINK]
http://arxiv.org/abs/2509.11628v1
[DATE]
2025-09-15 14:46:22+08:00
[CATEGORIES]
cs.LG
Murphys Laws of AI Alignment: Why the Gap Always Wins
[AUTHORS]
Madhava Gaikwad
[ABSTRACT]
We study reinforcement learning from human feedback under misspecification.
Sometimes human feedback is systematically wrong on certain types of inputs,
like a broken compass that points the wrong way in specific regions. We prove
that when feedback is biased on a fraction alpha of contexts with bias strength
epsilon, any learning algorithm needs exponentially many samples
exp(nalphaepsilon^2) to distinguish between two possible “true” reward
functions that differ only on these problematic contexts. However, if you can
identify where feedback is unreliable (a “calibration oracle”), you can focus
your limited questions there and overcome the exponential barrier with just
O(1/(alpha*epsilon^2)) queries. This quantifies why alignment is hard: rare
edge cases with subtly biased feedback create an exponentially hard learning
problem unless you know where to look.
The gap between what we optimize (proxy from human feedback) and what we want
(true objective) is fundamentally limited by how common the problematic
contexts are (alpha), how wrong the feedback is there (epsilon), and how much
the true objectives disagree there (gamma). Murphy’s Law for AI alignment: the
gap always wins unless you actively route around misspecification.
[COMMENTS]
Provides a formal impossibility theorem (Murphys Gap) and welcomes
collaboration on large-scale experiments and benchmark design
[LINK]
http://arxiv.org/abs/2509.05381v3
[DATE]
2025-09-15 14:39:32+08:00
[CATEGORIES]
cs.LG
Inducing Uncertainty for Test-Time Privacy
[AUTHORS]
Muhammad H. Ashiq, Peter Triantafillou, Hung Yun Tseng, Grigoris G. Chrysos
[ABSTRACT]
Unlearning is the predominant method for removing the influence of data in
machine learning models. However, even after unlearning, models often continue
to produce the same predictions on the unlearned data with high confidence.
This persistent behavior can be exploited by adversaries using confident model
predictions on incorrect or obsolete data to harm users. We call this threat
model, which unlearning fails to protect against, test-time privacy. In
particular, an adversary with full model access can bypass any naive defenses
which ensure test-time privacy. To address this threat, we introduce an
algorithm which perturbs model weights to induce maximal uncertainty on
protected instances while preserving accuracy on the rest of the instances. Our
core algorithm is based on finetuning with a Pareto optimal objective that
explicitly balances test-time privacy against utility. We also provide a
certifiable approximation algorithm which achieves $(\varepsilon, \delta)$
guarantees without convexity assumptions. We then prove a tight, non-vacuous
bound that characterizes the privacy-utility tradeoff that our algorithms
incur. Empirically, our method obtains $>3\times$ stronger uncertainty than
pretraining with $<0.2\%$ drops in accuracy on various image recognition
benchmarks. Altogether, this framework provides a tool to guarantee additional
protection to end users.
[LINK]
http://arxiv.org/abs/2509.11625v1
[DATE]
2025-09-15 14:38:57+08:00
[CATEGORIES]
cs.LG
A Controllable 3D Deepfake Generation Framework with Gaussian Splatting
[AUTHORS]
Wending Liu, Siyun Liang, Huy H. Nguyen, Isao Echizen
[ABSTRACT]
We propose a novel 3D deepfake generation framework based on 3D Gaussian
Splatting that enables realistic, identity-preserving face swapping and
reenactment in a fully controllable 3D space. Compared to conventional 2D
deepfake approaches that suffer from geometric inconsistencies and limited
generalization to novel view, our method combines a parametric head model with
dynamic Gaussian representations to support multi-view consistent rendering,
precise expression control, and seamless background integration. To address
editing challenges in point-based representations, we explicitly separate the
head and background Gaussians and use pre-trained 2D guidance to optimize the
facial region across views. We further introduce a repair module to enhance
visual consistency under extreme poses and expressions. Experiments on
NeRSemble and additional evaluation videos demonstrate that our method achieves
comparable performance to state-of-the-art 2D approaches in identity
preservation, as well as pose and expression consistency, while significantly
outperforming them in multi-view rendering quality and 3D consistency. Our
approach bridges the gap between 3D modeling and deepfake synthesis, enabling
new directions for scene-aware, controllable, and immersive visual forgeries,
revealing the threat that emerging 3D Gaussian Splatting technique could be
used for manipulation attacks.
[LINK]
http://arxiv.org/abs/2509.11624v1
[DATE]
2025-09-15 14:34:17+08:00
[CATEGORIES]
cs.LG
Topology Structure Optimization of Reservoirs Using GLMY Homology
[AUTHORS]
Yu Chen, Shengwei Wang, Hongwei Lin
[ABSTRACT]
Reservoir is an efficient network for time series processing. It is well
known that network structure is one of the determinants of its performance.
However, the topology structure of reservoirs, as well as their performance, is
hard to analyzed, due to the lack of suitable mathematical tools. In this
paper, we study the topology structure of reservoirs using persistent GLMY
homology theory, and develop a method to improve its performance. Specifically,
it is found that the reservoir performance is closely related to the
one-dimensional GLMY homology groups. Then, we develop a reservoir structure
optimization method by modifying the minimal representative cycles of
one-dimensional GLMY homology groups. Finally, by experiments, it is validated
that the performance of reservoirs is jointly influenced by the reservoir
structure and the periodicity of the dataset.
[LINK]
http://arxiv.org/abs/2509.11612v1
[DATE]
2025-09-15 14:11:29+08:00
[CATEGORIES]
cs.LG
Scaling to Multimodal and Multichannel Heart Sound Classification: Fine-Tuning Wav2Vec 2.0 with Synthetic and Augmented Biosignals
[AUTHORS]
Milan Marocchi, Matthew Fynn, Kayapanda Mandana, Yue Rong
[ABSTRACT]
Cardiovascular diseases (CVDs) are the leading cause of death worldwide,
accounting for approximately 17.9 million deaths each year. Early detection is
critical, creating a demand for accurate and inexpensive pre-screening methods.
Deep learning has recently been applied to classify abnormal heart sounds
indicative of CVDs using synchronised phonocardiogram (PCG) and
electrocardiogram (ECG) signals, as well as multichannel PCG (mPCG). However,
state-of-the-art architectures remain underutilised due to the limited
availability of synchronised and multichannel datasets. Augmented datasets and
pre-trained models provide a pathway to overcome these limitations, enabling
transformer-based architectures to be trained effectively. This work combines
traditional signal processing with denoising diffusion models, WaveGrad and
DiffWave, to create an augmented dataset to fine-tune a Wav2Vec 2.0-based
classifier on multimodal and multichannel heart sound datasets. The approach
achieves state-of-the-art performance. On the Computing in Cardiology (CinC)
2016 dataset of single channel PCG, accuracy, unweighted average recall (UAR),
sensitivity, specificity and Matthew’s correlation coefficient (MCC) reach
92.48\%, 93.05\%, 93.63\%, 92.48\%, 94.93\% and 0.8283, respectively. Using the
synchronised PCG and ECG signals of the training-a dataset from CinC, 93.14\%,
92.21\%, 94.35\%, 90.10\%, 95.12\% and 0.8380 are achieved for accuracy, UAR,
sensitivity, specificity and MCC, respectively. Using a wearable vest dataset
consisting of mPCG data, the model achieves 77.13\% accuracy, 74.25\% UAR,
86.47\% sensitivity, 62.04\% specificity, and 0.5082 MCC. These results
demonstrate the effectiveness of transformer-based models for CVD detection
when supported by augmented datasets, highlighting their potential to advance
multimodal and multichannel heart sound classification.
[COMMENTS]
35 pages, 37 figures, 19 tables
[LINK]
http://arxiv.org/abs/2509.11606v1
[DATE]
2025-09-15 13:52:41+08:00
[CATEGORIES]
cs.LG
Dynamic Adaptive Parsing of Temporal and Cross-Variable Patterns for Network State Classification
[AUTHORS]
Yuan Gao, Xuelong Wang, Zhenguo Dong, Yong Zhang
[ABSTRACT]
Effective network state classification is a primary task for ensuring network
security and optimizing performance. Existing deep learning models have shown
considerable progress in this area. Some methods excel at analyzing the complex
temporal periodicities found in traffic data, while graph-based approaches are
adept at modeling the dynamic dependencies between different variables.
However, a key trade-off remains, as these methods struggle to capture both
characteristics simultaneously. Models focused on temporal patterns often
overlook crucial variable dependencies, whereas those centered on dependencies
may fail to capture fine-grained temporal details. To address this trade-off,
we introduce DAPNet, a framework based on a Mixture-of-Experts architecture.
DAPNet integrates three specialized networks for periodic analysis, dynamic
cross-variable correlation modeling, and hybrid temporal feature extraction. A
learnable gating network dynamically assigns weights to experts based on the
input sample and computes a weighted fusion of their outputs. Furthermore, a
hybrid regularization loss function ensures stable training and addresses the
common issue of class imbalance. Extensive experiments on two large-scale
network intrusion detection datasets (CICIDS2017/2018) validate DAPNet’s higher
accuracy for its target application. The generalizability of the architectural
design is evaluated across ten public UEA benchmark datasets, positioning
DAPNet as a specialized framework for network state classification.
[LINK]
http://arxiv.org/abs/2509.11601v1
[DATE]
2025-09-15 13:32:32+08:00
[CATEGORIES]
cs.LG
Binary Quantization For LLMs Through Dynamic Grouping
[AUTHORS]
Xinzhe Zheng, Zhen-Qun Yang, Haoran Xie, S. Joe Qin, Arlene Chen, Fangzhen Lin
[ABSTRACT]
Large Language Models (LLMs) have demonstrated remarkable performance across
a wide range of Natural Language Processing (NLP) tasks, but require
substantial memory and computational resources. Binary quantization, which
compresses model weights from 16-bit Brain Float to 1-bit representations in
{-1, 1}, offers significant reductions in storage and inference costs. However,
such aggressive quantization often leads to notable performance degradation
compared to more conservative 4-bit quantization methods. In this research, we
propose a novel optimization objective tailored for binary quantization, along
with three algorithms designed to realize it effectively. Our method enhances
blocked quantization by dynamically identifying optimal unstructured
sub-matrices through adaptive grouping strategies. Experimental results
demonstrate that our approach achieves an average bit length of just 1.007
bits, while maintaining high model quality. Specifically, our quantized LLaMA
3.2 3B model attains a perplexity of 8.23, remarkably close to the original
7.81, and surpasses previous SOTA BiLLM with a perplexity of only 123.90.
Furthermore, our method is competitive with SOTA 4-bit approaches such as GPTQ
in both performance and efficiency. The compression process is highly
efficient, requiring only 14 seconds to quantize the full LLaMA 3.2 3B weights
on a single CPU core, with the entire process completing in under 100 minutes
and exhibiting embarrassingly parallel properties.
Code - https://github.com/johnnyzheng0636/WGM_bi_quan
[COMMENTS]
An error was identified in the quantization bit width; it is not
binary
[LINK]
http://arxiv.org/abs/2509.03054v2
[DATE]
2025-09-15 13:32:08+08:00
[CATEGORIES]
cs.LG
Timing Matters: Enhancing User Experience through Temporal Prediction in Smart Homes
[AUTHORS]
Shrey Ganatra, Spandan Anaokar, Pushpak Bhattacharyya
[ABSTRACT]
The proliferation of IoT devices generates vast interaction data, offering
insights into user behaviour. While prior work predicts what actions users
perform, the timing of these actions – critical for enabling proactive and
efficient smart systems – remains relatively underexplored. Addressing this
gap, we focus on predicting the time of the next user action in smart
environments. Due to the lack of public datasets with fine-grained timestamps
suitable for this task and associated privacy concerns, we contribute a dataset
of 11.6k sequences synthesized based on human annotations of interaction
patterns, pairing actions with precise timestamps. To this end, we introduce
Timing-Matters, a Transformer-Encoder based method that predicts action timing,
achieving 38.30% accuracy on the synthesized dataset, outperforming the best
baseline by 6%, and showing 1–6% improvements on other open datasets. Our code
and dataset will be publicly released.
[COMMENTS]
7 pages + 1 reference, 5 figures, 6 tables
[LINK]
http://arxiv.org/abs/2411.18719v2
[DATE]
2025-09-15 13:31:49+08:00
[CATEGORIES]
cs.LG
AMLNet: A Knowledge-Based Multi-Agent Framework to Generate and Detect Realistic Money Laundering Transactions
[AUTHORS]
Sabin Huda, Ernest Foo, Zahra Jadidi, MA Hakim Newton, Abdul Sattar
[ABSTRACT]
Anti-money laundering (AML) research is constrained by the lack of publicly
shareable, regulation-aligned transaction datasets. We present AMLNet, a
knowledge-based multi-agent framework with two coordinated units: a
regulation-aware transaction generator and an ensemble detection pipeline. The
generator produces 1,090,173 synthetic transactions (approximately 0.16\%
laundering-positive) spanning core laundering phases (placement, layering,
integration) and advanced typologies (e.g., structuring, adaptive threshold
behavior). Regulatory alignment reaches 75\% based on AUSTRAC rule coverage
(Section 4.2), while a composite technical fidelity score of 0.75 summarizes
temporal, structural, and behavioral realism components (Section 4.4). The
detection ensemble achieves F1 0.90 (precision 0.84, recall 0.97) on the
internal test partitions of AMLNet and adapts to the external SynthAML dataset,
indicating architectural generalizability across different synthetic generation
paradigms. We provide multi-dimensional evaluation (regulatory, temporal,
network, behavioral) and release the dataset (Version 1.0,
https://doi.org/10.5281/zenodo.16736515), to advance reproducible and
regulation-conscious AML experimentation.
[LINK]
http://arxiv.org/abs/2509.11595v1
[DATE]
2025-09-15 13:25:46+08:00
[CATEGORIES]
cs.LG
Piecewise Deterministic Markov Processes for Bayesian Neural Networks
[AUTHORS]
Ethan Goan, Dimitri Perrin, Kerrie Mengersen, Clinton Fookes
[ABSTRACT]
Inference on modern Bayesian Neural Networks (BNNs) often relies on a
variational inference treatment, imposing violated assumptions of independence
and the form of the posterior. Traditional MCMC approaches avoid these
assumptions at the cost of increased computation due to its incompatibility to
subsampling of the likelihood. New Piecewise Deterministic Markov Process
(PDMP) samplers permit subsampling, though introduce a model specific
inhomogenous Poisson Process (IPPs) which is difficult to sample from. This
work introduces a new generic and adaptive thinning scheme for sampling from
these IPPs, and demonstrates how this approach can accelerate the application
of PDMPs for inference in BNNs. Experimentation illustrates how inference with
these methods is computationally feasible, can improve predictive accuracy,
MCMC mixing performance, and provide informative uncertainty measurements when
compared against other approximate inference schemes.
[COMMENTS]
Includes correction to software and corrigendum note (fix
supplementary references)
[LINK]
http://arxiv.org/abs/2302.08724v3
[DATE]
2025-09-15 13:10:19+08:00
[CATEGORIES]
cs.LG
Learning Singularity-Encoded Green’s Functions with Application to Iterative Methods
[AUTHORS]
Qi Sun, Shengyan Li, Bowen Zheng, Lili Ju, Xuejun Xu
[ABSTRACT]
Green’s function provides an inherent connection between theoretical analysis
and numerical methods for elliptic partial differential equations, and general
absence of its closed-form expression necessitates surrogate modeling to guide
the design of effective solvers. Unfortunately, numerical computation of
Green’s function remains challenging due to its doubled dimensionality and
intrinsic singularity. In this paper, we present a novel singularity-encoded
learning approach to resolve these problems in an unsupervised fashion. Our
method embeds the Green’s function within a one-order higher-dimensional space
by encoding its prior estimate as an augmented variable, followed by a neural
network parametrization to manage the increased dimensionality. By projecting
the trained neural network solution back onto the original domain, our deep
surrogate model exploits its spectral bias to accelerate conventional iterative
schemes, serving either as a preconditioner or as part of a hybrid solver. The
effectiveness of our proposed method is empirically verified through numerical
experiments with two and four dimensional Green’s functions, achieving
satisfactory resolution of singularities and acceleration of iterative solvers.
[LINK]
http://arxiv.org/abs/2509.11580v1
[DATE]
2025-09-15 12:53:22+08:00
[CATEGORIES]
cs.LG
STRIDE: Subset-Free Functional Decomposition for XAI in Tabular Settings
[AUTHORS]
Chaeyun Ko
[ABSTRACT]
Most explainable AI (XAI) frameworks are limited in their expressiveness,
summarizing complex feature effects as single scalar values \phi_i. This
approach answers “what” features are important but fails to reveal “how” they
interact. Furthermore, methods that attempt to capture interactions, like those
based on Shapley values, often face an exponential computational cost. We
present STRIDE, a scalable framework that addresses both limitations by
reframing explanation as a subset-enumeration-free, orthogonal “functional
decomposition” in a Reproducing Kernel Hilbert Space (RKHS). In the tabular
setups we study, STRIDE analytically computes functional components f_S(x_S)
via a recursive kernel-centering procedure. The approach is model-agnostic and
theoretically grounded with results on orthogonality and L^2 convergence. In
tabular benchmarks (10 datasets, median over 10 seeds), STRIDE attains a 3.0
times median speedup over TreeSHAP and a mean R^2=0.93 for reconstruction. We
also introduce “component surgery”, a diagnostic that isolates a learned
interaction and quantifies its contribution; on California Housing, removing a
single interaction reduces test R^2 from 0.019 to 0.027.
[COMMENTS]
Major revision for submission to ICLR 2026. Substantially revised
abstract, introduction, and discussion. Added new ‘component surgery’
analysis and updated benchmark results for clarity. (12 pages, 2 figures)
[LINK]
http://arxiv.org/abs/2509.09070v2
[DATE]
2025-09-15 11:49:28+08:00
[CATEGORIES]
cs.LG
Solved in Unit Domain: JacobiNet for Differentiable Coordinate-Transformed PINNs
[AUTHORS]
Xi Chen, Jianchuan Yang, Junjie Zhang, Runnan Yang, Xu Liu, Hong Wang, Tinghui Zheng, Ziyu Ren, Wenqi Hu
[ABSTRACT]
Physics-Informed Neural Networks offer a powerful framework for solving PDEs
by embedding physical laws into the learning process. However, when applied to
domains with irregular boundaries, PINNs often suffer from instability and slow
convergence, which stems from (1) inconsistent normalization due to geometric
anisotropy, (2) inaccurate boundary enforcements, and (3) imbalanced loss term
competition. A common workaround is to map the domain to a regular space. Yet,
conventional mapping methods rely on case-specific meshes, define Jacobians at
pre-specified fixed nodes, reformulate PDEs via the chain rule-making them
incompatible with modern automatic differentiation, tensor-based frameworks. To
bridge this gap, we propose JacobiNet, a learning-based coordinate-transformed
PINN framework that unifies domain mapping and PDE solving within an end-to-end
differentiable architecture. Leveraging lightweight MLPs, JacobiNet learns
continuous, differentiable mappings, enables direct Jacobian computation via
autograd, shares computation graph with downstream PINNs. Its continuous nature
and built-in Jacobian eliminate the need for meshing, explicit Jacobians
computation/ storage, and PDE reformulation, while unlocking geometric-editing
operations, reducing the mapping cost. Separating physical modeling from
geometric complexity, JacobiNet (1) addresses normalization challenges in the
original anisotropic coordinates, (2) facilitates hard constraints of boundary
conditions, and (3) mitigates the long-standing imbalance among loss terms.
Evaluated on various PDEs, JacobiNet reduces the L2 error from 0.11-0.73 to
0.01-0.09. In vessel-like domains with varying shapes, JacobiNet enables
millisecond-level mapping inference for unseen geometries, improves prediction
accuracy by an average of 3.65, while delivering over 10 speed
up-demonstrating strong generalization, accuracy, and efficiency.
[COMMENTS]
Submitted to CMAME, revision in progress
[LINK]
http://arxiv.org/abs/2508.02537v2
[DATE]
2025-09-15 11:42:13+08:00
[CATEGORIES]
cs.LG
K2-Think: A Parameter-Efficient Reasoning System
[AUTHORS]
Zhoujun Cheng, Richard Fan, Shibo Hao, Taylor W. Killian, Haonan Li, Suqi Sun, Hector Ren, Alexander Moreno, Daqian Zhang, Tianjun Zhong, Yuxin Xiong, Yuanzhe Hu, Yutao Xie, Xudong Han, Yuqi Wang, Varad Pimpalkhute, Yonghao Zhuang, Aaryamonvikram Singh, Xuezhi Liang, Anze Xie, Jianshu She, Desai Fan, Chengqian Gao, Liqun Ma, Mikhail Yurochkin, John Maggs, Xuezhe Ma, Guowei He, Zhiting Hu, Zhengzhong Liu, Eric P. Xing
[ABSTRACT]
K2-Think is a reasoning system that achieves state-of-the-art performance
with a 32B parameter model, matching or surpassing much larger models like
GPT-OSS 120B and DeepSeek v3.1. Built on the Qwen2.5 base model, our system
shows that smaller models can compete at the highest levels by combining
advanced post-training and test-time computation techniques. The approach is
based on six key technical pillars: Long Chain-of-thought Supervised
Finetuning, Reinforcement Learning with Verifiable Rewards (RLVR), Agentic
planning prior to reasoning, Test-time Scaling, Speculative Decoding, and
Inference-optimized Hardware, all using publicly available open-source
datasets. K2-Think excels in mathematical reasoning, achieving state-of-the-art
scores on public benchmarks for open-source models, while also performing
strongly in other areas such as Code and Science. Our results confirm that a
more parameter-efficient model like K2-Think 32B can compete with
state-of-the-art systems through an integrated post-training recipe that
includes long chain-of-thought training and strategic inference-time
enhancements, making open-source reasoning systems more accessible and
affordable. K2-Think is freely available at k2think.ai, offering best-in-class
inference speeds of over 2,000 tokens per second per request via the Cerebras
Wafer-Scale Engine.
[COMMENTS]
To access the K2-Think reasoning system, please visit www.k2think.ai
[LINK]
http://arxiv.org/abs/2509.07604v3
[DATE]
2025-09-15 11:29:27+08:00
[CATEGORIES]
cs.LG
Compressed Sensing: Mathematical Foundations, Implementation, and Advanced Optimization Techniques
[AUTHORS]
Shane Stevenson, Maryam Sabagh
[ABSTRACT]
Compressed sensing is a signal processing technique that allows for the
reconstruction of a signal from a small set of measurements. The key idea
behind compressed sensing is that many real-world signals are inherently
sparse, meaning that they can be efficiently represented in a different space
with only a few components compared to their original space representation. In
this paper we will explore the mathematical formulation behind compressed
sensing, its logic and pathologies, and apply compressed sensing to real world
signals.
[LINK]
http://arxiv.org/abs/2509.11550v1
[DATE]
2025-09-15 11:29:18+08:00
[CATEGORIES]
cs.LG
UI-S1: Advancing GUI Automation via Semi-online Reinforcement Learning
[AUTHORS]
Zhengxi Lu, Jiabo Ye, Fei Tang, Yongliang Shen, Haiyang Xu, Ziwei Zheng, Weiming Lu, Ming Yan, Fei Huang, Jun Xiao, Yueting Zhuang
[ABSTRACT]
Graphical User Interface (GUI) agents have demonstrated remarkable progress
in automating complex user interface interactions through reinforcement
learning. However, current approaches face a fundamental dilemma: offline RL
enables stable training on pre-collected trajectories, but struggles with
multi-step task execution for lack of trajectory-level reward signals; online
RL captures these signals through environment interaction, but suffers from
sparse rewards and prohibitive deployment costs. To address it, we present
Semi-online Reinforcement Learning, a novel paradigm that simulates online RL
on offline trajectories. During each rollout process, we preserve the original
model output within the multi-turn dialogue, where a Patch Module adaptively
recovers the divergence between rollout and expert trajectories. To capture
long-term training signals, Semi-online RL introduces discounted future returns
into the reward computation and optimizes the policy with weighted step-level
and episode-level advantages. We further introduce Semi-Online Performance
(SOP), a metric that aligns better with true online performance, serving as a
practical and effective proxy for real-world evaluation. Experiments show that
ours Semi-online RL achieves SOTA performance among 7B models across four
dynamic benchmarks, with significant gains over the base model (e.g., +12.0% on
AndroidWorld, +23.8% on AITW), demonstrating significant progress in bridging
the gap between offline training efficiency and online multi-turn reasoning.
The code is available at https://github.com/X-PLUG/MobileAgent/tree/main/UI-S1.
[COMMENTS]
22 pages, 17 figures
[LINK]
http://arxiv.org/abs/2509.11543v1
[DATE]
2025-09-15 11:24:08+08:00
[CATEGORIES]
cs.LG
Self-Evolving Curriculum for LLM Reasoning
[AUTHORS]
Xiaoyin Chen, Jiarui Lu, Minsu Kim, Dinghuai Zhang, Jian Tang, Alexandre Piché, Nicolas Gontier, Yoshua Bengio, Ehsan Kamalloo
[ABSTRACT]
Reinforcement learning (RL) has proven effective for fine-tuning large
language models (LLMs), significantly enhancing their reasoning abilities in
domains such as mathematics and code generation. A crucial factor influencing
RL fine-tuning success is the training curriculum: the order in which training
problems are presented. While random curricula serve as common baselines, they
remain suboptimal; manually designed curricula often rely heavily on
heuristics, and online filtering methods can be computationally prohibitive. To
address these limitations, we propose Self-Evolving Curriculum (SEC), an
automatic curriculum learning method that learns a curriculum policy
concurrently with the RL fine-tuning process. Our approach formulates
curriculum selection as a non-stationary Multi-Armed Bandit problem, treating
each problem category (e.g., difficulty level or problem type) as an individual
arm. We leverage the absolute advantage from policy gradient methods as a proxy
measure for immediate learning gain. At each training step, the curriculum
policy selects categories to maximize this reward signal and is updated using
the TD(0) method. Across three distinct reasoning domains: planning, inductive
reasoning, and mathematics, our experiments demonstrate that SEC significantly
improves models’ reasoning capabilities, enabling better generalization to
harder, out-of-distribution test problems. Additionally, our approach achieves
better skill balance when fine-tuning simultaneously on multiple reasoning
domains. These findings highlight SEC as a promising strategy for RL
fine-tuning of LLMs.
[LINK]
http://arxiv.org/abs/2505.14970v3
[DATE]
2025-09-15 11:08:37+08:00
[CATEGORIES]
cs.LG
Expressive Power of Deep Networks on Manifolds: Simultaneous Approximation
[AUTHORS]
Hanfei Zhou, Lei Shi
[ABSTRACT]
A key challenge in scientific machine learning is solving partial
differential equations (PDEs) on complex domains, where the curved geometry
complicates the approximation of functions and their derivatives required by
differential operators. This paper establishes the first simultaneous
approximation theory for deep neural networks on manifolds. We prove that a
constant-depth $\mathrm{ReLU}^{k-1}$ network with bounded weights–a property
that plays a crucial role in controlling generalization error–can approximate
any function in the Sobolev space $\mathcal{W}_p^{k}(\mathcal{M}^d)$ to an
error of $\varepsilon$ in the $\mathcal{W}_p^{s}(\mathcal{M}^d)$ norm, for
$k\geq 3$ and $s<k$, using $\mathcal{O}(\varepsilon^{-d/(k-s)})$ nonzero
parameters, a rate that overcomes the curse of dimensionality by depending only
on the intrinsic dimension $d$. These results readily extend to functions in
H"older-Zygmund spaces. We complement this result with a matching lower bound,
proving our construction is nearly optimal by showing the required number of
parameters matches up to a logarithmic factor. Our proof of the lower bound
introduces novel estimates for the Vapnik-Chervonenkis dimension and
pseudo-dimension of the network’s high-order derivative classes. These
complexity bounds provide a theoretical cornerstone for learning PDEs on
manifolds involving derivatives. Our analysis reveals that the network
architecture leverages a sparse structure to efficiently exploit the manifold’s
low-dimensional geometry.
[LINK]
http://arxiv.org/abs/2509.09362v2
[DATE]
2025-09-15 10:56:15+08:00
[CATEGORIES]
cs.LG
E-ROBOT: a dimension-free method for robust statistics and machine learning via Schrödinger bridge
[AUTHORS]
Davide La Vecchia, Hang Liu
[ABSTRACT]
We propose the Entropic-regularized Robust Optimal Transport (E-ROBOT)
framework, a novel method that combines the robustness of ROBOT with the
computational and statistical benefits of entropic regularization. We show
that, rooted in the Schr"{o}dinger bridge problem theory, E-ROBOT defines the
robust Sinkhorn divergence $\overline{W}{\varepsilon,\lambda}$, where the
parameter $\lambda$ controls robustness and $\varepsilon$ governs the
regularization strength. Letting $n\in \mathbb{N}$ denote the sample size, a
central theoretical contribution is establishing that the sample complexity of
$\overline{W}{\varepsilon,\lambda}$ is $\mathcal{O}(n^{-1/2})$, thereby
avoiding the curse of dimensionality that plagues standard ROBOT. This
dimension-free property unlocks the use of $\overline{W}_{\varepsilon,\lambda}$
as a loss function in large-dimensional statistical and machine learning tasks.
With this regard, we demonstrate its utility through four applications:
goodness-of-fit testing; computation of barycenters for corrupted 2D and 3D
shapes; definition of gradient flows; and image colour transfer. From the
computation standpoint, a perk of our novel method is that it can be easily
implemented by modifying existing (\texttt{Python}) routines. From the
theoretical standpoint, our work opens the door to many research directions in
statistics and machine learning: we discuss some of them.
[LINK]
http://arxiv.org/abs/2509.11532v1
[DATE]
2025-09-15 10:49:04+08:00
[CATEGORIES]
cs.LG
High-Fidelity Scientific Simulation Surrogates via Adaptive Implicit Neural Representations
[AUTHORS]
Ziwei Li, Yuhan Duan, Tianyu Xiong, Yi-Tang Chen, Wei-Lun Chao, Han-Wei Shen
[ABSTRACT]
Effective surrogate models are critical for accelerating scientific
simulations. Implicit neural representations (INRs) offer a compact and
continuous framework for modeling spatially structured data, but they often
struggle with complex scientific fields exhibiting localized, high-frequency
variations. Recent approaches address this by introducing additional features
along rigid geometric structures (e.g., grids), but at the cost of flexibility
and increased model size. In this paper, we propose a simple yet effective
alternative: Feature-Adaptive INR (FA-INR). FA-INR leverages cross-attention to
an augmented memory bank to learn flexible feature representations, enabling
adaptive allocation of model capacity based on data characteristics, rather
than rigid structural assumptions. To further improve scalability, we introduce
a coordinate-guided mixture of experts (MoE) that enhances the specialization
and efficiency of feature representations. Experiments on three large-scale
ensemble simulation datasets show that FA-INR achieves state-of-the-art
fidelity while significantly reducing model size, establishing a new trade-off
frontier between accuracy and compactness for INR-based surrogates.
[LINK]
http://arxiv.org/abs/2506.06858v2
[DATE]
2025-09-15 10:36:14+08:00
[CATEGORIES]
cs.LG
DARD: Dice Adversarial Robustness Distillation against Adversarial Attacks
[AUTHORS]
Jing Zou, Shungeng Zhang, Meikang Qiu, Chong Li
[ABSTRACT]
Deep learning models are vulnerable to adversarial examples, posing critical
security challenges in real-world applications. While Adversarial Training (AT
) is a widely adopted defense mechanism to enhance robustness, it often incurs
a trade-off by degrading performance on unperturbed, natural data. Recent
efforts have highlighted that larger models exhibit enhanced robustness over
their smaller counterparts. In this paper, we empirically demonstrate that such
robustness can be systematically distilled from large teacher models into
compact student models. To achieve better performance, we introduce Dice
Adversarial Robustness Distillation (DARD), a novel method designed to transfer
robustness through a tailored knowledge distillation paradigm. Additionally, we
propose Dice Projected Gradient Descent (DPGD), an adversarial example
generalization method optimized for effective attack. Our extensive experiments
demonstrate that the DARD approach consistently outperforms adversarially
trained networks with the same architecture, achieving superior robustness and
standard accuracy.
[COMMENTS]
Accepted at SecureComm 2025, 15 pages, 4 figures
[LINK]
http://arxiv.org/abs/2509.11525v1
[DATE]
2025-09-15 10:31:30+08:00
[CATEGORIES]
cs.LG
Know What You Don’t Know: Selective Prediction for Early Exit DNNs
[AUTHORS]
Divya Jyoti Bajpai, Manjesh Kumar Hanawal
[ABSTRACT]
Inference latency and trustworthiness of Deep Neural Networks (DNNs) are the
bottlenecks in deploying them in critical applications like sensitive tasks.
Early Exit (EE) DNNs overcome the latency issues by allowing samples to exit
from intermediary layers if they attain high' confidence scores on the
predicted class. However, the DNNs are known to exhibit overconfidence, which
can lead to many samples exiting early and render EE strategies untrustworthy.
We use Selective Prediction (SP) to overcome this issue by checking the
hardness’ of the samples rather than just relying on the confidence score
alone. We propose SPEED, a novel approach that uses Deferral Classifiers (DCs)
at each layer to check the hardness of samples before performing EEs.
Specifically, the DCs identify if a sample is hard to predict at an
intermediary layer, leading to hallucination, and defer it to an expert. Early
detection of hard samples for inference prevents the wastage of computational
resources and improves trust by deferring the hard samples to the expert. We
demonstrate that EE aided with SP improves both accuracy and latency. Our
method minimizes the risk of wrong prediction by $50\%$ with a speedup of
$2.05\times$ as compared to the final layer. The anonymized source code is
available at https://github.com/Div290/SPEED
[COMMENTS]
To appear in the the Fifth International Conference on AI ML Systems
[LINK]
http://arxiv.org/abs/2509.11520v1
[DATE]
2025-09-15 10:19:09+08:00
[CATEGORIES]
cs.LG
A Permutation-free Kernel Two-Sample Test
[AUTHORS]
Shubhanshu Shekhar, Ilmun Kim, Aaditya Ramdas
[ABSTRACT]
The kernel Maximum Mean Discrepancy~(MMD) is a popular multivariate distance
metric between distributions that has found utility in two-sample testing. The
usual kernel-MMD test statistic is a degenerate U-statistic under the null, and
thus it has an intractable limiting distribution. Hence, to design a
level-$\alpha$ test, one usually selects the rejection threshold as the
$(1-\alpha)$-quantile of the permutation distribution. The resulting
nonparametric test has finite-sample validity but suffers from large
computational cost, since every permutation takes quadratic time. We propose
the cross-MMD, a new quadratic-time MMD test statistic based on
sample-splitting and studentization. We prove that under mild assumptions, the
cross-MMD has a limiting standard Gaussian distribution under the null.
Importantly, we also show that the resulting test is consistent against any
fixed alternative, and when using the Gaussian kernel, it has minimax
rate-optimal power against local alternatives. For large sample sizes, our new
cross-MMD provides a significant speedup over the MMD, for only a slight loss
in power.
[COMMENTS]
Published at the Thirty-sixth Conference on Neural Information
Processing Systems (NeurIPS), with an oral presentation. The current version
on arXiv fixes a bug in the proof of Theorem 9
[LINK]
http://arxiv.org/abs/2211.14908v3
[DATE]
2025-09-15 10:16:53+08:00
[CATEGORIES]
cs.LG
TED: Accelerate Model Training by Internal Generalization
[AUTHORS]
Jinying Xiao, Ping Li, Jie Nie
[ABSTRACT]
Large language models have demonstrated strong performance in recent years,
but the high cost of training drives the need for efficient methods to compress
dataset sizes. We propose TED pruning, a method that addresses the challenge of
overfitting under high pruning ratios by quantifying the model’s ability to
improve performance on pruned data while fitting retained data, known as
Internal Generalization (IG). TED uses an optimization objective based on
Internal Generalization Distance (IGD), measuring changes in IG before and
after pruning to align with true generalization performance and achieve
implicit regularization. The IGD optimization objective was verified to allow
the model to achieve the smallest upper bound on generalization error. The
impact of small mask fluctuations on IG is studied through masks and Taylor
approximation, and fast estimation of IGD is enabled. In analyzing continuous
training dynamics, the prior effect of IGD is validated, and a progressive
pruning strategy is proposed. Experiments on image classification, natural
language understanding, and large language model fine-tuning show TED achieves
lossless performance with 60-70\% of the data. Upon acceptance, our code will
be made publicly available.
[COMMENTS]
ECAI 2024
[LINK]
http://arxiv.org/abs/2405.03228v3
[DATE]
2025-09-15 10:13:39+08:00
[CATEGORIES]
cs.LG
SEVEN: Pruning Transformer Model by Reserving Sentinels
[AUTHORS]
Jinying Xiao, Ping Li, Jie Nie, Zhe Tang
[ABSTRACT]
Large-scale Transformer models (TM) have demonstrated outstanding performance
across various tasks. However, their considerable parameter size restricts
their applicability, particularly on mobile devices. Due to the dynamic and
intricate nature of gradients on TM compared to Convolutional Neural Networks,
commonly used pruning methods tend to retain weights with larger gradient
noise. This results in pruned models that are sensitive to sparsity and
datasets, exhibiting suboptimal performance. Symbolic Descent (SD) is a general
approach for training and fine-tuning TM. In this paper, we attempt to describe
the noisy batch gradient sequences on TM through the cumulative process of SD.
We utilize this design to dynamically assess the importance scores of
weights.SEVEN is introduced by us, which particularly favors weights with
consistently high sensitivity, i.e., weights with small gradient noise. These
weights are tended to be preserved by SEVEN. Extensive experiments on various
TM in natural language, question-answering, and image classification domains
are conducted to validate the effectiveness of SEVEN. The results demonstrate
significant improvements of SEVEN in multiple pruning scenarios and across
different sparsity levels. Additionally, SEVEN exhibits robust performance
under various fine-tuning strategies. The code is publicly available at
https://github.com/xiaojinying/SEVEN.
[COMMENTS]
IJCNN 2024
[LINK]
http://arxiv.org/abs/2403.12688v2
[DATE]
2025-09-15 10:09:20+08:00
[CATEGORIES]
cs.LG
Machine Learning-Driven Predictive Resource Management in Complex Science Workflows
[AUTHORS]
Tasnuva Chowdhury, Tadashi Maeno, Fatih Furkan Akman, Joseph Boudreau, Sankha Dutta, Shengyu Feng, Adolfy Hoisie, Kuan-Chieh Hsu, Raees Khan, Jaehyung Kim, Ozgur O. Kilic, Scott Klasky, Alexei Klimentov, Tatiana Korchuganova, Verena Ingrid Martinez Outschoorn, Paul Nilsson, David K. Park, Norbert Podhorszki, Yihui Ren, John Rembrandt Steele, Frédéric Suter, Sairam Sri Vatsavai, Torre Wenaus, Wei Yang, Yiming Yang, Shinjae Yoo
[ABSTRACT]
The collaborative efforts of large communities in science experiments, often
comprising thousands of global members, reflect a monumental commitment to
exploration and discovery. Recently, advanced and complex data processing has
gained increasing importance in science experiments. Data processing workflows
typically consist of multiple intricate steps, and the precise specification of
resource requirements is crucial for each step to allocate optimal resources
for effective processing. Estimating resource requirements in advance is
challenging due to a wide range of analysis scenarios, varying skill levels
among community members, and the continuously increasing spectrum of computing
options. One practical approach to mitigate these challenges involves initially
processing a subset of each step to measure precise resource utilization from
actual processing profiles before completing the entire step. While this
two-staged approach enables processing on optimal resources for most of the
workflow, it has drawbacks such as initial inaccuracies leading to potential
failures and suboptimal resource usage, along with overhead from waiting for
initial processing completion, which is critical for fast-turnaround analyses.
In this context, our study introduces a novel pipeline of machine learning
models within a comprehensive workflow management system, the Production and
Distributed Analysis (PanDA) system. These models employ advanced machine
learning techniques to predict key resource requirements, overcoming challenges
posed by limited upfront knowledge of characteristics at each step. Accurate
forecasts of resource requirements enable informed and proactive
decision-making in workflow management, enhancing the efficiency of handling
diverse, complex workflows across heterogeneous resources.
[LINK]
http://arxiv.org/abs/2509.11512v1
[DATE]
2025-09-15 09:53:30+08:00
[CATEGORIES]
cs.LG
SafeDiver: Cooperative AUV-USV Assisted Diver Communication via Multi-agent Reinforcement Learning Approach
[AUTHORS]
Tinglong Deng, Hang Tao, Xinxiang Wang, Yinyan Wang, Hanjiang Luo
[ABSTRACT]
As underwater human activities are increasing, the demand for underwater
communication service presents a significant challenge. Existing underwater
diver communication methods face hurdles due to inherent disadvantages and
complex underwater environments. To address this issue, we propose a scheme
that utilizes maritime unmanned systems to assist divers with reliable and
high-speed communication. Multiple AUVs are equipped with optical and acoustic
multimodal communication devices as relay nodes, providing adaptive
communication services based on changes in the diver’s activity area. By using
a multi-agent reinforcement learning (MARL) approach to control the cooperative
movement of AUVs, high-speed and reliable data transmission between divers can
be achieved. At the same time, utilizing the advantages of on-demand deployment
and wide coverage of unmanned surface vehicles (USVs) as surface relay nodes to
coordinate and forward information from AUVs, and controlling AUVs to
adaptively select relay USV nodes for data transmission, high-quality
communication between divers and surface platform can be achieved. Through
simulation verification, the proposed scheme can effectively achieve reliable
and high-speed communication for divers.
[LINK]
http://arxiv.org/abs/2509.11508v1
[DATE]
2025-09-15 09:44:28+08:00
[CATEGORIES]
cs.LG
Drug Repurposing Using Deep Embedded Clustering and Graph Neural Networks
[AUTHORS]
Luke Delzer, Robert Kroleski, Ali K. AlShami, Jugal Kalita
[ABSTRACT]
Drug repurposing has historically been an economically infeasible process for
identifying novel uses for abandoned drugs. Modern machine learning has enabled
the identification of complex biochemical intricacies in candidate drugs;
however, many studies rely on simplified datasets with known drug-disease
similarities. We propose a machine learning pipeline that uses unsupervised
deep embedded clustering, combined with supervised graph neural network link
prediction to identify new drug-disease links from multi-omic data.
Unsupervised autoencoder and cluster training reduced the dimensionality of
omic data into a compressed latent embedding. A total of 9,022 unique drugs
were partitioned into 35 clusters with a mean silhouette score of 0.8550. Graph
neural networks achieved strong statistical performance, with a prediction
accuracy of 0.901, receiver operating characteristic area under the curve of
0.960, and F1-Score of 0.901. A ranked list comprised of 477 per-cluster link
probabilities exceeding 99 percent was generated. This study could provide new
drug-disease link prospects across unrelated disease domains, while advancing
the understanding of machine learning in drug repurposing studies.
[COMMENTS]
Accepted at the 2025 International Conference on Machine Learning and
Applications (ICMLA)
[LINK]
http://arxiv.org/abs/2509.11493v1
[DATE]
2025-09-15 09:04:37+08:00
[CATEGORIES]
cs.LG
Multilingual Diversity Improves Vision-Language Representations
[AUTHORS]
Thao Nguyen, Matthew Wallingford, Sebastin Santy, Wei-Chiu Ma, Sewoong Oh, Ludwig Schmidt, Pang Wei Koh, Ranjay Krishna
[COMMENTS]
NeurIPS 2024 Spotlight paper
[LINK]
http://arxiv.org/abs/2405.16915v3
[DATE]
2025-09-15 08:59:09+08:00
[CATEGORIES]
cs.LG
FineServe: Precision-Aware KV Slab and Two-Level Scheduling for Heterogeneous Precision LLM Serving
[AUTHORS]
Kyungmin Bin, Seungbeom Choi, Jimyoung Son, Jieun Choi, Daseul Bae, Daehyeon Baek, Kihyo Moon, Minsung Jang, Hyojung Lee
[ABSTRACT]
Recent advances in Post-Training Quantization (PTQ) techniques have
significantly increased demand for serving quantized large language models
(LLMs), enabling higher throughput and substantially reduced memory usage with
minimal accuracy loss. Quantized models address memory constraints in LLMs and
enhance GPU resource utilization through efficient GPU sharing. However,
quantized models have smaller KV block sizes than non-quantized models, causing
limited memory efficiency due to memory fragmentation. Also, distinct resource
usage patterns between quantized and non-quantized models require efficient
scheduling to maximize throughput. To address these challenges, we propose
FineServe, an inference serving framework for mixed-precision LLMs. FineServe’s
key contributions include: (1) KV Slab, a precision-aware adaptive memory
management technique dynamically allocating KV cache based on model
quantization characteristics, significantly reducing GPU memory fragmentation,
and (2) a two-level scheduling framework comprising a global scheduler that
places models to GPUs based on request rates, latency SLOs, and memory
constraints and efficiency, and a local scheduler that adaptively adjusts batch
sizes according to real-time request fluctuations. Experimental results
demonstrate that FineServe achieves up to 2.2x higher SLO attainment and 1.8x
higher token generation throughput compared to the state-of-the-art GPU sharing
systems.
[LINK]
http://arxiv.org/abs/2509.06261v2
[DATE]
2025-09-15 08:51:47+08:00
[CATEGORIES]
cs.LG
Topology-Aware and Highly Generalizable Deep Reinforcement Learning for Efficient Retrieval in Multi-Deep Storage Systems
[AUTHORS]
Funing Li, Yuan Tian, Ruben Noortwyck, Jifeng Zhou, Liming Kuang, Robert Schulz
[ABSTRACT]
In modern industrial and logistics environments, the rapid expansion of fast
delivery services has heightened the demand for storage systems that combine
high efficiency with increased density. Multi-deep autonomous vehicle storage
and retrieval systems (AVS/RS) present a viable solution for achieving greater
storage density. However, these systems encounter significant challenges during
retrieval operations due to lane blockages. A conventional approach to mitigate
this issue involves storing items with homogeneous characteristics in a single
lane, but this strategy restricts the flexibility and adaptability of
multi-deep storage systems.
In this study, we propose a deep reinforcement learning-based framework to
address the retrieval problem in multi-deep storage systems with heterogeneous
item configurations. Each item is associated with a specific due date, and the
objective is to minimize total tardiness. To effectively capture the system’s
topology, we introduce a graph-based state representation that integrates both
item attributes and the local topological structure of the multi-deep
warehouse. To process this representation, we design a novel neural network
architecture that combines a Graph Neural Network (GNN) with a Transformer
model. The GNN encodes topological and item-specific information into
embeddings for all directly accessible items, while the Transformer maps these
embeddings into global priority assignments. The Transformer’s strong
generalization capability further allows our approach to be applied to storage
systems with diverse layouts. Extensive numerical experiments, including
comparisons with heuristic methods, demonstrate the superiority of the proposed
neural network architecture and the effectiveness of the trained agent in
optimizing retrieval tardiness.
[LINK]
http://arxiv.org/abs/2506.14787v2
[DATE]
2025-09-15 08:47:00+08:00
[CATEGORIES]
cs.LG
STLCG++: A Masking Approach for Differentiable Signal Temporal Logic Specification
[AUTHORS]
Parv Kapoor, Kazuki Mizuta, Eunsuk Kang, Karen Leung
[ABSTRACT]
Signal Temporal Logic (STL) offers a concise yet expressive framework for
specifying and reasoning about spatio-temporal behaviors of robotic systems.
Attractively, STL admits the notion of robustness, the degree to which an input
signal satisfies or violates an STL specification, thus providing a nuanced
evaluation of system performance. In particular, the differentiability of STL
robustness enables direct integration to robotic workflows that rely on
gradient-based optimization, such as trajectory optimization and deep learning.
However, existing approaches to evaluating and differentiating STL robustness
rely on recurrent computations, which become inefficient with longer sequences,
limiting their use in time-sensitive applications. In this paper, we present
STLCG++, a masking-based approach that parallelizes STL robustness evaluation
and backpropagation across timesteps, \revised{achieving more than 1000$\times$
faster computation time than the recurrent approach (STLCG++).}{achieving
significant speed-ups compared to a recurrent approach.} We also introduce a
smoothing technique to enable the differentiation of time interval bounds,
thereby expanding STL’s applicability in gradient-based optimization tasks
involving spatial and temporal variables. Finally, we demonstrate STLCG++’s
benefits through three robotics use cases and provide JAX and PyTorch libraries
for seamless integration into modern robotics workflows. Project website with
demo and code: https://uw-ctrl.github.io/stlcg/.
[COMMENTS]
\copyright 2025 IEEE. Personal use of this material is permitted.
Permission from IEEE must be obtained for all other uses, in any current or
future media, including reprinting/republishing this material for advertising
or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of
this work in other works
[LINK]
http://arxiv.org/abs/2501.04194v2
[DATE]
2025-09-15 08:45:51+08:00
[CATEGORIES]
cs.LG
Enhancing Radiographic Disease Detection with MetaCheX, a Context-Aware Multimodal Model
[AUTHORS]
Nathan He, Cody Chen
[ABSTRACT]
Existing deep learning models for chest radiology often neglect patient
metadata, limiting diagnostic accuracy and fairness. To bridge this gap, we
introduce MetaCheX, a novel multimodal framework that integrates chest X-ray
images with structured patient metadata to replicate clinical decision-making.
Our approach combines a convolutional neural network (CNN) backbone with
metadata processed by a multilayer perceptron through a shared classifier.
Evaluated on the CheXpert Plus dataset, MetaCheX consistently outperformed
radiograph-only baseline models across multiple CNN architectures. By
integrating metadata, the overall diagnostic accuracy was significantly
improved, measured by an increase in AUROC. The results of this study
demonstrate that metadata reduces algorithmic bias and enhances model
generalizability across diverse patient populations. MetaCheX advances clinical
artificial intelligence toward robust, context-aware radiographic disease
detection.
[COMMENTS]
All authors contributed equally, 5 pages, 2 figures, 1 table
[LINK]
http://arxiv.org/abs/2509.12287v1
[DATE]
2025-09-15 08:44:44+08:00
[CATEGORIES]
cs.LG
Semantic Augmentation in Images using Language
[AUTHORS]
Sahiti Yerramilli, Jayant Sravan Tamarapalli, Tanmay Girish Kulkarni, Jonathan Francis, Eric Nyberg
[ABSTRACT]
Deep Learning models are incredibly data-hungry and require very large
labeled datasets for supervised learning. As a consequence, these models often
suffer from overfitting, limiting their ability to generalize to real-world
examples. Recent advancements in diffusion models have enabled the generation
of photorealistic images based on textual inputs. Leveraging the substantial
datasets used to train these diffusion models, we propose a technique to
utilize generated images to augment existing datasets. This paper explores
various strategies for effective data augmentation to improve the out-of-domain
generalization capabilities of deep learning models.
[LINK]
http://arxiv.org/abs/2404.02353v4
[DATE]
2025-09-15 08:39:25+08:00
[CATEGORIES]
cs.LG
RAPTOR: A Foundation Policy for Quadrotor Control
[AUTHORS]
Jonas Eschmann, Dario Albani, Giuseppe Loianno
[ABSTRACT]
Humans are remarkably data-efficient when adapting to new unseen conditions,
like driving a new car. In contrast, modern robotic control systems, like
neural network policies trained using Reinforcement Learning (RL), are highly
specialized for single environments. Because of this overfitting, they are
known to break down even under small differences like the Simulation-to-Reality
(Sim2Real) gap and require system identification and retraining for even
minimal changes to the system. In this work, we present RAPTOR, a method for
training a highly adaptive foundation policy for quadrotor control. Our method
enables training a single, end-to-end neural-network policy to control a wide
variety of quadrotors. We test 10 different real quadrotors from 32 g to 2.4 kg
that also differ in motor type (brushed vs. brushless), frame type (soft vs.
rigid), propeller type (2/3/4-blade), and flight controller
(PX4/Betaflight/Crazyflie/M5StampFly). We find that a tiny, three-layer policy
with only 2084 parameters is sufficient for zero-shot adaptation to a wide
variety of platforms. The adaptation through In-Context Learning is made
possible by using a recurrence in the hidden layer. The policy is trained
through a novel Meta-Imitation Learning algorithm, where we sample 1000
quadrotors and train a teacher policy for each of them using Reinforcement
Learning. Subsequently, the 1000 teachers are distilled into a single, adaptive
student policy. We find that within milliseconds, the resulting foundation
policy adapts zero-shot to unseen quadrotors. We extensively test the
capabilities of the foundation policy under numerous conditions (trajectory
tracking, indoor/outdoor, wind disturbance, poking, different propellers).
[LINK]
http://arxiv.org/abs/2509.11481v1
[DATE]
2025-09-15 08:05:40+08:00
[CATEGORIES]
cs.LG
Cross-Platform Scaling of Vision-Language-Action Models from Edge to Cloud GPUs
[AUTHORS]
Amir Taherin, Juyi Lin, Arash Akbari, Arman Akbari, Pu Zhao, Weiwei Chen, David Kaeli, Yanzhi Wang
[ABSTRACT]
Vision-Language-Action (VLA) models have emerged as powerful generalist
policies for robotic control, yet their performance scaling across model
architectures and hardware platforms, as well as their associated power
budgets, remain poorly understood. This work presents an evaluation of five
representative VLA models – spanning state-of-the-art baselines and two newly
proposed architectures – targeting edge and datacenter GPU platforms. Using
the LIBERO benchmark, we measure accuracy alongside system-level metrics,
including latency, throughput, and peak memory usage, under varying edge power
constraints and high-performance datacenter GPU configurations. Our results
identify distinct scaling trends: (1) architectural choices, such as action
tokenization and model backbone size, strongly influence throughput and memory
footprint; (2) power-constrained edge devices exhibit non-linear performance
degradation, with some configurations matching or exceeding older datacenter
GPUs; and (3) high-throughput variants can be achieved without significant
accuracy loss. These findings provide actionable insights when selecting and
optimizing VLAs across a range of deployment constraints. Our work challenges
current assumptions about the superiority of datacenter hardware for robotic
inference.
[COMMENTS]
To appear in the Asilomar Conference on Signals, Systems, and
Computers 2025
[LINK]
http://arxiv.org/abs/2509.11480v1
[DATE]
2025-09-15 08:00:37+08:00
[CATEGORIES]
cs.LG
Modality-Aware Infrared and Visible Image Fusion with Target-Aware Supervision
[AUTHORS]
Tianyao Sun, Dawei Xiang, Tianqi Ding, Xiang Fang, Yijiashun Qi, Zunduo Zhao
[ABSTRACT]
Infrared and visible image fusion (IVIF) is a fundamental task in multi-modal
perception that aims to integrate complementary structural and textural cues
from different spectral domains. In this paper, we propose FusionNet, a novel
end-to-end fusion framework that explicitly models inter-modality interaction
and enhances task-critical regions. FusionNet introduces a modality-aware
attention mechanism that dynamically adjusts the contribution of infrared and
visible features based on their discriminative capacity. To achieve
fine-grained, interpretable fusion, we further incorporate a pixel-wise alpha
blending module, which learns spatially-varying fusion weights in an adaptive
and content-aware manner. Moreover, we formulate a target-aware loss that
leverages weak ROI supervision to preserve semantic consistency in regions
containing important objects (e.g., pedestrians, vehicles). Experiments on the
public M3FD dataset demonstrate that FusionNet generates fused images with
enhanced semantic preservation, high perceptual quality, and clear
interpretability. Our framework provides a general and extensible solution for
semantic-aware multi-modal image fusion, with benefits for downstream tasks
such as object detection and scene understanding.
[COMMENTS]
Accepted by 2025 6th International Conference on Computer Vision and
Data Mining (ICCVDM 2025)
[LINK]
http://arxiv.org/abs/2509.11476v1
[DATE]
2025-09-15 07:44:15+08:00
[CATEGORIES]
cs.LG
Tool-as-Interface: Learning Robot Policies from Observing Human Tool Use
[AUTHORS]
Haonan Chen, Cheng Zhu, Shuijing Liu, Yunzhu Li, Katherine Driggs-Campbell
[ABSTRACT]
Tool use is essential for enabling robots to perform complex real-world
tasks, but learning such skills requires extensive datasets. While
teleoperation is widely used, it is slow, delay-sensitive, and poorly suited
for dynamic tasks. In contrast, human videos provide a natural way for data
collection without specialized hardware, though they pose challenges on robot
learning due to viewpoint variations and embodiment gaps. To address these
challenges, we propose a framework that transfers tool-use knowledge from
humans to robots. To improve the policy’s robustness to viewpoint variations,
we use two RGB cameras to reconstruct 3D scenes and apply Gaussian splatting
for novel view synthesis. We reduce the embodiment gap using segmented
observations and tool-centric, task-space actions to achieve
embodiment-invariant visuomotor policy learning. We demonstrate our framework’s
effectiveness across a diverse suite of tool-use tasks, where our learned
policy shows strong generalization and robustness to human perturbations,
camera motion, and robot base movement. Our method achieves a 71\% improvement
in task success over teleoperation-based diffusion policies and dramatically
reduces data collection time by 77\% and 41\% compared to teleoperation and the
state-of-the-art interface, respectively.
[COMMENTS]
Accepted to CoRL 2025. Project page:
https://tool-as-interface.github.io. 17 pages, 14 figures
[LINK]
http://arxiv.org/abs/2504.04612v2
[DATE]
2025-09-15 07:11:15+08:00
[CATEGORIES]
cs.LG
Data-Induced Interactions of Sparse Sensors Using Statistical Physics
[AUTHORS]
Andrei A. Klishin, J. Nathan Kutz, Krithika Manohar
[ABSTRACT]
Large-dimensional empirical data in science and engineering frequently have a
low-rank structure and can be represented as a combination of just a few
eigenmodes. Because of this structure, we can use just a few spatially
localized sensor measurements to reconstruct the full state of a complex
system. The quality of this reconstruction, especially in the presence of
sensor noise, depends significantly on the spatial configuration of the
sensors. Multiple algorithms based on gappy interpolation and QR factorization
have been proposed to optimize sensor placement. Here, instead of an algorithm
that outputs a single “optimal” sensor configuration, we take a statistical
mechanics view to compute the full landscape of sensor interactions induced by
the training data. The two key advances of this paper are the recasting of the
sensor placement landscape in an Ising model form and a regularized
reconstruction that significantly decreases reconstruction error for few
sensors. In addition, we provide first uncertainty quantification of the sparse
sensing reconstruction and open questions about the shape of reconstruction
risk curve. Mapping out these data-induced sensor interactions allows combining
them with external selection criteria and anticipating sensor replacement
impacts.
[COMMENTS]
23 RevTeX pages, 12 figures
[LINK]
http://arxiv.org/abs/2307.11838v2
[DATE]
2025-09-15 06:12:29+08:00
[CATEGORIES]
cs.LG
Tabular Data with Class Imbalance: Predicting Electric Vehicle Crash Severity with Pretrained Transformers (TabPFN) and Mamba-Based Models
[AUTHORS]
Shriyank Somvanshi, Pavan Hebli, Gaurab Chhetri, Subasish Das
[ABSTRACT]
This study presents a deep tabular learning framework for predicting crash
severity in electric vehicle (EV) collisions using real-world crash data from
Texas (2017-2023). After filtering for electric-only vehicles, 23,301
EV-involved crash records were analyzed. Feature importance techniques using
XGBoost and Random Forest identified intersection relation, first harmful
event, person age, crash speed limit, and day of week as the top predictors,
along with advanced safety features like automatic emergency braking. To
address class imbalance, Synthetic Minority Over-sampling Technique and Edited
Nearest Neighbors (SMOTEENN) resampling was applied. Three state-of-the-art
deep tabular models, TabPFN, MambaNet, and MambaAttention, were benchmarked for
severity prediction. While TabPFN demonstrated strong generalization,
MambaAttention achieved superior performance in classifying severe injury cases
due to its attention-based feature reweighting. The findings highlight the
potential of deep tabular architectures for improving crash severity prediction
and enabling data-driven safety interventions in EV crash contexts.
[COMMENTS]
This is the author’s preprint version of a paper accepted for
presentation at the 24th International Conference on Machine Learning and
Applications (ICMLA 2025), December 3-5, 2025, Florida, USA. The final
published version will appear in the official IEEE proceedings. Conference
site: https://www.icmla-conference.org/icmla25/
[LINK]
http://arxiv.org/abs/2509.11449v1
[DATE]
2025-09-15 05:46:17+08:00
[CATEGORIES]
cs.LG
Disentanglement of Biological and Technical Factors via Latent Space Rotation in Clinical Imaging Improves Disease Pattern Discovery
[AUTHORS]
Jeanny Pan, Philipp Seeböck, Christoph Fürböck, Svitlana Pochepnia, Jennifer Straub, Lucian Beer, Helmut Prosch, Georg Langs
[ABSTRACT]
Identifying new disease-related patterns in medical imaging data with the
help of machine learning enlarges the vocabulary of recognizable findings. This
supports diagnostic and prognostic assessment. However, image appearance varies
not only due to biological differences, but also due to imaging technology
linked to vendors, scanning- or re- construction parameters. The resulting
domain shifts impedes data representation learning strategies and the discovery
of biologically meaningful cluster appearances. To address these challenges, we
introduce an approach to actively learn the domain shift via post-hoc rotation
of the data latent space, enabling disentanglement of biological and technical
factors. Results on real-world heterogeneous clinical data showcase that the
learned disentangled representation leads to stable clusters representing
tissue-types across different acquisition settings. Cluster consistency is
improved by +19.01% (ARI), +16.85% (NMI), and +12.39% (Dice) compared to the
entangled representation, outperforming four state-of-the-art harmonization
methods. When using the clusters to quantify tissue composition on idiopathic
pulmonary fibrosis patients, the learned profiles enhance Cox survival
prediction. This indicates that the proposed label-free framework facilitates
biomarker discovery in multi-center routine imaging data. Code is available on
GitHub https://github.com/cirmuw/latent-space-rotation-disentanglement.
[COMMENTS]
The Fourth Workshop on Applications of Medical Artificial
Intelligence, AMAI 2025, Held in Conjunction with MICCAI 2025, Daejeon,
Republic of Korea, September 23, 2025, Proceedings
[LINK]
http://arxiv.org/abs/2509.11436v1
[DATE]
2025-09-15 05:16:15+08:00
[CATEGORIES]
cs.LG
Efficient Pauli channel estimation with logarithmic quantum memory
[AUTHORS]
Sitan Chen, Weiyuan Gong
[ABSTRACT]
Here we revisit one of the prototypical tasks for characterizing the
structure of noise in quantum devices: estimating every eigenvalue of an
$n$-qubit Pauli noise channel to error $\epsilon$. Prior work [14] proved no-go
theorems for this task in the practical regime where one has a limited amount
of quantum memory, e.g. any protocol with $\le 0.99n$ ancilla qubits of quantum
memory must make exponentially many measurements, provided it is
non-concatenating. Such protocols can only interact with the channel by
repeatedly preparing a state, passing it through the channel, and measuring
immediately afterward.
This left open a natural question: does the lower bound hold even for general
protocols, i.e. ones which chain together many queries to the channel,
interleaved with arbitrary data-processing channels, before measuring?
Surprisingly, in this work we show the opposite: there is a protocol that can
estimate the eigenvalues of a Pauli channel to error $\epsilon$ using only
$O(\log n/\epsilon^2)$ ancilla and $\tilde{O}(n^2/\epsilon^2)$ measurements. In
contrast, we show that any protocol with zero ancilla, even a concatenating
one, must make $\Omega(2^n/\epsilon^2)$ measurements, which is tight.
Our results imply, to our knowledge, the first quantum learning task where
logarithmically many qubits of quantum memory suffice for an exponential
statistical advantage. Our protocol can be naturally extended to a protocol
that learns the eigenvalues of Pauli terms within any subset $A$ of a Pauli
channel with $O(\log\log(|A|)/\epsilon^2)$ ancilla and
$\tilde{O}(n^2/\epsilon^2)$ measurements.
[COMMENTS]
57 pages, 1 figure
[LINK]
http://arxiv.org/abs/2309.14326v5
[DATE]
2025-09-15 04:54:26+08:00
[CATEGORIES]
cs.LG
An End-to-End Depth-Based Pipeline for Selfie Image Rectification
[AUTHORS]
Ahmed Alhawwary, Janne Mustaniemi, Phong Nguyen-Ha, Janne Heikkilä
[ABSTRACT]
Portraits or selfie images taken from a close distance typically suffer from
perspective distortion. In this paper, we propose an end-to-end deep
learning-based rectification pipeline to mitigate the effects of perspective
distortion. We learn to predict the facial depth by training a deep CNN. The
estimated depth is utilized to adjust the camera-to-subject distance by moving
the camera farther, increasing the camera focal length, and reprojecting the 3D
image features to the new perspective. The reprojected features are then fed to
an inpainting module to fill in the missing pixels. We leverage a
differentiable renderer to enable end-to-end training of our depth estimation
and feature extraction nets to improve the rectified outputs. To boost the
results of the inpainting module, we incorporate an auxiliary module to predict
the horizontal movement of the camera which decreases the area that requires
hallucination of challenging face parts such as ears. Unlike previous works, we
process the full-frame input image at once without cropping the subject’s face
and processing it separately from the rest of the body, eliminating the need
for complex post-processing steps to attach the face back to the subject’s
body. To train our network, we utilize the popular game engine Unreal Engine to
generate a large synthetic face dataset containing various subjects, head
poses, expressions, eyewear, clothes, and lighting. Quantitative and
qualitative results show that our rectification pipeline outperforms previous
methods, and produces comparable results with a time-consuming 3D GAN-based
method while being more than 260 times faster.
[COMMENTS]
Accepted at IEEE TPAMI
[LINK]
http://arxiv.org/abs/2412.19189v2
[DATE]
2025-09-15 04:49:26+08:00
[CATEGORIES]
cs.LG
Long-time dynamics and universality of nonconvex gradient descent
[AUTHORS]
Qiyang Han
[ABSTRACT]
This paper develops a general approach to characterize the long-time
trajectory behavior of nonconvex gradient descent in generalized single-index
models in the large aspect ratio regime. In this regime, we show that for each
iteration the gradient descent iterate concentrates around a deterministic
vector called the <span style="color:#e74d3c;">Gaussian</span> theoretical gradient descent', whose dynamics can
be tracked by a state evolution system of two recursive equations for two
scalars. Our concentration guarantees hold universally for a broad class of
design matrices and remain valid over long time horizons until algorithmic
convergence or divergence occurs. Moreover, our approach reveals that gradient
descent iterates are in general approximately independent of the data and
strongly incoherent with the feature vectors, a phenomenon previously known as
the
implicit regularization’ effect of gradient descent in specific models
under Gaussian data.
As an illustration of the utility of our general theory, we present two
applications of different natures in the regression setting. In the first, we
prove global convergence of nonconvex gradient descent with general independent
initialization for a broad class of structured link functions, and establish
universality of randomly initialized gradient descent in phase retrieval for
large aspect ratios. In the second, we develop a data-free iterative algorithm
for estimating state evolution parameters along the entire gradient descent
trajectory, thereby providing a low-cost yet statistically valid tool for
practical tasks such as hyperparameter tuning and runtime determination.
As a by-product of our analysis, we show that in the large aspect ratio
regime, the Gaussian theoretical gradient descent coincides with a recent line
of dynamical mean-field theory for gradient descent over the constant-time
horizon.
[LINK]
http://arxiv.org/abs/2509.11426v1
[DATE]
2025-09-15 04:36:18+08:00
[CATEGORIES]
cs.LG
Prediction of Stocks Index Price using Quantum GANs
[AUTHORS]
Sangram Deshpande, Gopal Ramesh Dahale, Sai Nandan Morapakula, Uday Wad
[ABSTRACT]
This paper investigates the application of Quantum Generative Adversarial
Networks (QGANs) for stock price prediction. Financial markets are inherently
complex, marked by high volatility and intricate patterns that traditional
models often fail to capture. QGANs, leveraging the power of quantum computing,
offer a novel approach by combining the strengths of generative models with
quantum machine learning techniques. We implement a QGAN model tailored for
stock price prediction and evaluate its performance using historical stock
market data. Our results demonstrate that QGANs can generate synthetic data
closely resembling actual market behavior, leading to enhanced prediction
accuracy. The experiment was conducted using the Stocks index price data and
the AWS Braket SV1 simulator for training the QGAN circuits. The
quantum-enhanced model outperforms classical Long Short-Term Memory (LSTM) and
GAN models in terms of convergence speed and prediction accuracy. This research
represents a key step toward integrating quantum computing in financial
forecasting, offering potential advantages in speed and precision over
traditional methods. The findings suggest important implications for traders,
financial analysts, and researchers seeking advanced tools for market analysis.
[LINK]
http://arxiv.org/abs/2509.12286v1
[DATE]
2025-09-15 04:28:24+08:00
[CATEGORIES]
cs.LG
Framing AI System Benchmarking as a Learning Task: FlexBench and the Open MLPerf Dataset
[AUTHORS]
Grigori Fursin, Daniel Altunay
[ABSTRACT]
Existing AI system benchmarks such as MLPerf often struggle to keep pace with
the rapidly evolving AI landscape, making it difficult to support informed
deployment, optimization, and co-design decisions for AI systems. We suggest
that benchmarking itself can be framed as an AI task - one in which models are
continuously evaluated and optimized across diverse datasets, software, and
hardware, using key metrics such as accuracy, latency, throughput, energy
consumption, and cost. To support this perspective, we present FlexBench: a
modular extension of the MLPerf LLM inference benchmark, integrated with
HuggingFace and designed to provide relevant and actionable insights.
Benchmarking results and metadata are collected into an Open MLPerf Dataset,
which can be collaboratively curated, extended, and leveraged for predictive
modeling and feature engineering. We successfully validated the FlexBench
concept through MLPerf Inference submissions, including evaluations of DeepSeek
R1 and LLaMA 3.3 on commodity servers. The broader objective is to enable
practitioners to make cost-effective AI deployment decisions that reflect their
available resources, requirements, and constraints.
[LINK]
http://arxiv.org/abs/2509.11413v1
[DATE]
2025-09-15 04:02:15+08:00
[CATEGORIES]
cs.LG
Deriving the Scaled-Dot-Function via Maximum Likelihood Estimation and Maximum Entropy Approach
[AUTHORS]
Jiyong Ma
[ABSTRACT]
In this paper, we present a maximum likelihood estimation approach to
determine the value vector in transformer models. We model the sequence of
value vectors, key vectors, and the query vector as a sequence of Gaussian
distributions. The variance in each Gaussian distribution depends on the time
step, the corresponding key vector, and the query vector. The mean value in
each Gaussian distribution depends on the time step, and the corresponding
value vector. This analysis may offer a new explanation of the
scaled-dot-product function or softmax function used in transformer
architectures [1]. Another explanation, inspired by [4], is based on the
maximum entropy approach in natural language processing [5]. In this approach,
a query vector and key vectors are used to derive the feature functions for the
maximum entropy model.
[LINK]
http://arxiv.org/abs/2509.12285v1
[DATE]
2025-09-15 03:52:32+08:00
[CATEGORIES]
cs.LG
Leveraging Large Language Models to Democratize Access to Costly Datasets for Academic Research
[AUTHORS]
Julian Junyan Wang, Victor Xiaoqi Wang
[ABSTRACT]
Unequal access to costly datasets essential for empirical research has long
hindered researchers from disadvantaged institutions, limiting their ability to
contribute to their fields and advance their careers. Recent breakthroughs in
Large Language Models (LLMs) have the potential to democratize data access by
automating data collection from unstructured sources. We develop and evaluate a
novel methodology using GPT-4o-mini within a Retrieval-Augmented Generation
(RAG) framework to collect data from corporate disclosures. Our approach
achieves human-level accuracy in collecting CEO pay ratios from approximately
10,000 proxy statements and Critical Audit Matters (CAMs) from more than 12,000
10-K filings, with LLM processing times of 9 and 40 minutes respectively, each
at a cost under US $10. This stands in stark contrast to the hundreds of hours
needed for manual collection or the thousands of dollars required for
commercial database subscriptions. To foster a more inclusive research
community by empowering researchers with limited resources to explore new
avenues of inquiry, we share our methodology and the resulting datasets.
[COMMENTS]
58 pagegs, 5 figures, 5 tables
[LINK]
http://arxiv.org/abs/2412.02065v3
[DATE]
2025-09-15 03:24:35+08:00
[CATEGORIES]
cs.LG
From Firewalls to Frontiers: AI Red-Teaming is a Domain-Specific Evolution of Cyber Red-Teaming
[AUTHORS]
Anusha Sinha, Keltin Grimes, James Lucassen, Michael Feffer, Nathan VanHoudnos, Zhiwei Steven Wu, Hoda Heidari
[ABSTRACT]
A red team simulates adversary attacks to help defenders find effective
strategies to defend their systems in a real-world operational setting. As more
enterprise systems adopt AI, red-teaming will need to evolve to address the
unique vulnerabilities and risks posed by AI systems. We take the position that
AI systems can be more effectively red-teamed if AI red-teaming is recognized
as a domain-specific evolution of cyber red-teaming. Specifically, we argue
that existing Cyber Red Teams who adopt this framing will be able to better
evaluate systems with AI components by recognizing that AI poses new risks, has
new failure modes to exploit, and often contains unpatchable bugs that
re-prioritize disclosure and mitigation strategies. Similarly, adopting a
cybersecurity framing will allow existing AI Red Teams to leverage a
well-tested structure to emulate realistic adversaries, promote mutual
accountability with formal rules of engagement, and provide a pattern to mature
the tooling necessary for repeatable, scalable engagements. In these ways, the
merging of AI and Cyber Red Teams will create a robust security ecosystem and
best position the community to adapt to the rapidly changing threat landscape.
[LINK]
http://arxiv.org/abs/2509.11398v1
[DATE]
2025-09-15 03:21:58+08:00
[CATEGORIES]
cs.LG
Offline RLAIF: Piloting VLM Feedback for RL via SFO
[AUTHORS]
Jacob Beck
[ABSTRACT]
While internet-scale image and textual data have enabled strong
generalization in Vision-Language Models (VLMs), the absence of internet-scale
control data has impeded the development of similar generalization in standard
reinforcement learning (RL) agents. Although VLMs are fundamentally limited in
their ability to solve control tasks due to their lack of action-conditioned
training data, their capacity for image understanding allows them to provide
valuable feedback in RL tasks by recognizing successful outcomes. A key
challenge in Reinforcement Learning from AI Feedback (RLAIF) is determining how
best to integrate VLM-derived signals into the learning process. We explore
this question in the context of offline RL and introduce a class of methods
called Sub-Trajectory Filtered Optimization (SFO). We identify three key
insights. First, trajectory length plays a crucial role in offline RL, as
full-trajectory preference learning exacerbates the stitching problem,
necessitating the use of sub-trajectories. Second, even in Markovian
environments, a non-Markovian reward signal from a sequence of images is
required to assess trajectory improvement, as VLMs do not interpret control
actions and must rely on visual cues over time. Third, a simple yet effective
approach–filtered and weighted behavior cloning–consistently outperforms more
complex RLHF-based methods. We propose Sub-Trajectory Filtered Behavior Cloning
(SFBC), a method that leverages VLM feedback on sub-trajectories while
incorporating a retrospective filtering mechanism that removes sub-trajectories
preceding failures to improve robustness and prevent turbulence. Please enjoy
our airport puns.
[COMMENTS]
Code is provided at https://github.com/jacooba/OfflineRLAIF
[LINK]
http://arxiv.org/abs/2503.01062v6
[DATE]
2025-09-15 03:13:37+08:00
[CATEGORIES]
cs.LG
Approaches to Responsible Governance of GenAI in Organizations
[AUTHORS]
Dhari Gandhi, Himanshu Joshi, Lucas Hartman, Shabnam Hassani
[ABSTRACT]
PEER-REVIEWED AND ACCEPTED IN IEEE- ISTAS 2025
The rapid evolution of Generative AI (GenAI) has introduced unprecedented
opportunities while presenting complex challenges around ethics,
accountability, and societal impact. This paper draws on a literature review,
established governance frameworks, and industry roundtable discussions to
identify core principles for integrating responsible GenAI governance into
diverse organizational structures. Our objective is to provide actionable
recommendations for a balanced, risk-based governance approach that enables
both innovation and oversight. Findings emphasize the need for adaptable risk
assessment tools, continuous monitoring practices, and cross-sector
collaboration to establish trustworthy GenAI. These insights provide a
structured foundation and Responsible GenAI Guide (ResAI) for organizations to
align GenAI initiatives with ethical, legal, and operational best practices.
[LINK]
http://arxiv.org/abs/2504.17044v2
[DATE]
2025-09-15 02:57:53+08:00
[CATEGORIES]
cs.LG
Enhancing ML Models Interpretability for Credit Scoring
[AUTHORS]
Sagi Schwartz, Qinling Wang, Fang Fang
[ABSTRACT]
Predicting default is essential for banks to ensure profitability and
financial stability. While modern machine learning methods often outperform
traditional regression techniques, their lack of transparency limits their use
in regulated environments. Explainable artificial intelligence (XAI) has
emerged as a solution in domains like credit scoring. However, most XAI
research focuses on post-hoc interpretation of black-box models, which does not
produce models lightweight or transparent enough to meet regulatory
requirements, such as those for Internal Ratings-Based (IRB) models.
This paper proposes a hybrid approach: post-hoc interpretations of black-box
models guide feature selection, followed by training glass-box models that
maintain both predictive power and transparency.
Using the Lending Club dataset, we demonstrate that this approach achieves
performance comparable to a benchmark black-box model while using only 10
features - an 88.5% reduction. In our example, SHapley Additive exPlanations
(SHAP) is used for feature selection, eXtreme Gradient Boosting (XGBoost)
serves as the benchmark and the base black-box model, and Explainable Boosting
Machine (EBM) and Penalized Logistic Tree Regression (PLTR) are the
investigated glass-box models.
We also show that model refinement using feature interaction analysis,
correlation checks, and expert input can further enhance model interpretability
and robustness.
[LINK]
http://arxiv.org/abs/2509.11389v1
[DATE]
2025-09-15 02:47:38+08:00
[CATEGORIES]
cs.LG
Some Robustness Properties of Label Cleaning
[AUTHORS]
Chen Cheng, John Duchi
[ABSTRACT]
We demonstrate that learning procedures that rely on aggregated labels, e.g.,
label information distilled from noisy responses, enjoy robustness properties
impossible without data cleaning. This robustness appears in several ways. In
the context of risk consistency – when one takes the standard approach in
machine learning of minimizing a surrogate (typically convex) loss in place of
a desired task loss (such as the zero-one mis-classification error) –
procedures using label aggregation obtain stronger consistency guarantees than
those even possible using raw labels. And while classical statistical scenarios
of fitting perfectly-specified models suggest that incorporating all possible
information – modeling uncertainty in labels – is statistically efficient,
consistency fails for “standard” approaches as soon as a loss to be minimized
is even slightly mis-specified. Yet procedures leveraging aggregated
information still converge to optimal classifiers, highlighting how
incorporating a fuller view of the data analysis pipeline, from collection to
model-fitting to prediction time, can yield a more robust methodology by
refining noisy signals.
[COMMENTS]
39 pages
[LINK]
http://arxiv.org/abs/2509.11379v1
[DATE]
2025-09-15 02:17:51+08:00
[CATEGORIES]
cs.LG
Intelligent Reservoir Decision Support: An Integrated Framework Combining Large Language Models, Advanced Prompt Engineering, and Multimodal Data Fusion for Real-Time Petroleum Operations
[AUTHORS]
Seyed Kourosh Mahjour, Seyed Saman Mahjour
[ABSTRACT]
The petroleum industry faces unprecedented challenges in reservoir
management, requiring rapid integration of complex multimodal datasets for
real-time decision support. This study presents a novel integrated framework
combining state-of-the-art large language models (GPT-4o, Claude 4 Sonnet,
Gemini 2.5 Pro) with advanced prompt engineering techniques and multimodal data
fusion for comprehensive reservoir analysis. The framework implements
domain-specific retrieval-augmented generation (RAG) with over 50,000 petroleum
engineering documents, chain-of-thought reasoning, and few-shot learning for
rapid field adaptation. Multimodal integration processes seismic
interpretations, well logs, and production data through specialized AI models
with vision transformers. Field validation across 15 diverse reservoir
environments demonstrates exceptional performance: 94.2% reservoir
characterization accuracy, 87.6% production forecasting precision, and 91.4%
well placement optimization success rate. The system achieves sub-second
response times while maintaining 96.2% safety reliability with no high-risk
incidents during evaluation. Economic analysis reveals 62-78% cost reductions
(mean 72%) relative to traditional methods with 8-month payback period.
Few-shot learning reduces field adaptation time by 72%, while automated prompt
optimization achieves 89% improvement in reasoning quality. The framework
processed real-time data streams with 96.2% anomaly detection accuracy and
reduced environmental incidents by 45%. We provide detailed experimental
protocols, baseline comparisons, ablation studies, and statistical significance
testing to ensure reproducibility. This research demonstrates practical
integration of cutting-edge AI technologies with petroleum domain expertise for
enhanced operational efficiency, safety, and economic performance.
[LINK]
http://arxiv.org/abs/2509.11376v1
[DATE]
2025-09-15 02:13:27+08:00
[CATEGORIES]
cs.LG
Decoding Musical Origins: Distinguishing Human and AI Composers
[AUTHORS]
Cheng-Yang Tsai, Tzu-Wei Huang, Shao-Yu Wei, Guan-Wei Chen, Hung-Ying Chu, Yu-Cheng Lin
[ABSTRACT]
With the rapid advancement of Large Language Models (LLMs), AI-driven music
generation has become a vibrant and fruitful area of research. However, the
representation of musical data remains a significant challenge. To address
this, a novel, machine-learning-friendly music notation system, YNote, was
developed. This study leverages YNote to train an effective classification
model capable of distinguishing whether a piece of music was composed by a
human (Native), a rule-based algorithm (Algorithm Generated), or an LLM (LLM
Generated). We frame this as a text classification problem, applying the Term
Frequency-Inverse Document Frequency (TF-IDF) algorithm to extract structural
features from YNote sequences and using the Synthetic Minority Over-sampling
Technique (SMOTE) to address data imbalance. The resulting model achieves an
accuracy of 98.25%, successfully demonstrating that YNote retains sufficient
stylistic information for analysis. More importantly, the model can identify
the unique “ technological fingerprints “ left by different AI generation
techniques, providing a powerful tool for tracing the origins of AI-generated
content.
[LINK]
http://arxiv.org/abs/2509.11369v1
[DATE]
2025-09-15 01:50:33+08:00
[CATEGORIES]
cs.LG
PersonaX: Multimodal Datasets with LLM-Inferred Behavior Traits
[AUTHORS]
Loka Li, Wong Yu Kang, Minghao Fu, Guangyi Chen, Zhenhao Chen, Gongxu Luo, Yuewen Sun, Salman Khan, Peter Spirtes, Kun Zhang
[ABSTRACT]
Understanding human behavior traits is central to applications in
human-computer interaction, computational social science, and personalized AI
systems. Such understanding often requires integrating multiple modalities to
capture nuanced patterns and relationships. However, existing resources rarely
provide datasets that combine behavioral descriptors with complementary
modalities such as facial attributes and biographical information. To address
this gap, we present PersonaX, a curated collection of multimodal datasets
designed to enable comprehensive analysis of public traits across modalities.
PersonaX consists of (1) CelebPersona, featuring 9444 public figures from
diverse occupations, and (2) AthlePersona, covering 4181 professional athletes
across 7 major sports leagues. Each dataset includes behavioral trait
assessments inferred by three high-performing large language models, alongside
facial imagery and structured biographical features. We analyze PersonaX at two
complementary levels. First, we abstract high-level trait scores from text
descriptions and apply five statistical independence tests to examine their
relationships with other modalities. Second, we introduce a novel causal
representation learning (CRL) framework tailored to multimodal and
multi-measurement data, providing theoretical identifiability guarantees.
Experiments on both synthetic and real-world data demonstrate the effectiveness
of our approach. By unifying structured and unstructured analysis, PersonaX
establishes a foundation for studying LLM-inferred behavioral traits in
conjunction with visual and biographical attributes, advancing multimodal trait
analysis and causal reasoning.
[LINK]
http://arxiv.org/abs/2509.11362v1
[DATE]
2025-09-15 01:30:03+08:00
[CATEGORIES]
cs.LG
On Linear Mode Connectivity of Mixture-of-Experts Architectures
[AUTHORS]
Viet-Hoang Tran, Van Hoan Trinh, Khanh Vinh Bui, Tan M. Nguyen
[ABSTRACT]
Linear Mode Connectivity (LMC) is a notable phenomenon in the loss landscapes
of neural networks, wherein independently trained models have been observed to
be connected–up to permutation symmetries–by linear paths in parameter space
along which the loss remains consistently low. This observation challenges
classical views of non-convex optimization and has implications for model
ensembling, generalization, and our understanding of neural loss geometry.
Inspired by recent studies on LMC in standard neural networks, we
systematically investigate this phenomenon within Mixture-of-Experts (MoE)
architectures–a class of models known for their scalability and computational
efficiency, which combine traditional neural networks–referred to as
experts–through a learnable gating mechanism. We begin by conducting a
comprehensive analysis of both dense and sparse gating regimes, demonstrating
that the symmetries inherent to MoE architectures are fully characterized by
permutations acting on both the expert components and the gating function.
Building on these foundational findings, we propose a matching algorithm that
enables alignment between independently trained MoEs, thereby facilitating the
discovery of LMC. Finally, we empirically validate the presence of LMC using
our proposed algorithm across diverse MoE configurations–including dense,
sparse, and shared-expert variants–under a wide range of model settings and
datasets of varying scales and modalities. Our results confirm the existence of
LMC in MoE architectures and offer fundamental insights into the functional
landscape and optimization dynamics of deep learning models.
[LINK]
http://arxiv.org/abs/2509.11348v1
[DATE]
2025-09-15 00:51:41+08:00
[CATEGORIES]
cs.LG
BiLSTM-VHP: BiLSTM-Powered Network for Viral Host Prediction
[AUTHORS]
Azher Ahmed Efat, Farzana Islam, Annajiat Alim Rasel, Munima Haque
[ABSTRACT]
Recorded history shows the long coexistence of humans and animals, suggesting
it began much earlier. Despite some beneficial interdependence, many animals
carry viral diseases that can spread to humans. These diseases are known as
zoonotic diseases. Recent outbreaks of SARS-CoV-2, Monkeypox and swine flu
viruses have shown how these viruses can disrupt human life and cause death.
Fast and accurate predictions of the host from which the virus spreads can help
prevent these diseases from spreading. This work presents BiLSTM-VHP, a
lightweight bidirectional long short-term memory (LSTM)-based architecture that
can predict the host from the nucleotide sequence of orthohantavirus, rabies
lyssavirus, and rotavirus A with high accuracy. The proposed model works with
nucleotide sequences of 400 bases in length and achieved a prediction accuracy
of 89.62% for orthohantavirus, 96.58% for rotavirus A, and 77.22% for rabies
lyssavirus outperforming previous studies. Moreover, performance of the model
is assessed using the confusion matrix, F-1 score, precision, recall,
microaverage AUC. In addition, we introduce three curated datasets of
orthohantavirus, rotavirus A, and rabies lyssavirus containing 8,575, 95,197,
and 22,052 nucleotide sequences divided into 9, 12, and 29 host classes,
respectively. The codes and dataset are available at
https://doi.org/10.17605/OSF.IO/ANFKR
[LINK]
http://arxiv.org/abs/2509.11345v1
[DATE]
2025-09-15 00:42:11+08:00
[CATEGORIES]
cs.LG
Beyond Instance Consistency: Investigating View Diversity in Self-supervised Learning
[AUTHORS]
Huaiyuan Qin, Muli Yang, Siyuan Hu, Peng Hu, Yu Zhang, Chen Gong, Hongyuan Zhu
[ABSTRACT]
Self-supervised learning (SSL) conventionally relies on the instance
consistency paradigm, assuming that different views of the same image can be
treated as positive pairs. However, this assumption breaks down for non-iconic
data, where different views may contain distinct objects or semantic
information. In this paper, we investigate the effectiveness of SSL when
instance consistency is not guaranteed. Through extensive ablation studies, we
demonstrate that SSL can still learn meaningful representations even when
positive pairs lack strict instance consistency. Furthermore, our analysis
further reveals that increasing view diversity, by enforcing zero overlapping
or using smaller crop scales, can enhance downstream performance on
classification and dense prediction tasks. However, excessive diversity is
found to reduce effectiveness, suggesting an optimal range for view diversity.
To quantify this, we adopt the Earth Mover’s Distance (EMD) as an estimator to
measure mutual information between views, finding that moderate EMD values
correlate with improved SSL learning, providing insights for future SSL
framework design. We validate our findings across a range of settings,
highlighting their robustness and applicability on diverse data sources.
[COMMENTS]
Published in TMLR. Review: https://openreview.net/forum?id=urWCU3YMA0
[LINK]
http://arxiv.org/abs/2509.11344v1
[DATE]
2025-09-15 00:41:17+08:00
[CATEGORIES]
cs.LG
Next-Generation Reservoir Computing for Dynamical Inference
[AUTHORS]
Rok Cestnik, Erik A. Martens
[ABSTRACT]
We present a simple and scalable implementation of next-generation reservoir
computing for modeling dynamical systems from time series data. Our approach
uses a pseudorandom nonlinear projection of time-delay embedded input, allowing
an arbitrary dimension of the feature space, thus providing a flexible
alternative to the polynomial-based projections used in previous
next-generation reservoir computing variants. We apply the method to benchmark
tasks – including attractor reconstruction and bifurcation diagram estimation
– using only partial and noisy observations. We also include an exploratory
example of estimating asymptotic oscillation phases. The models remain stable
over long rollouts and generalize beyond training data. This framework enables
the precise control of system state and is well suited for surrogate modeling
and digital twin applications.
[COMMENTS]
10 pages, 10 figures
[LINK]
http://arxiv.org/abs/2509.11338v1
[DATE]
2025-09-15 00:28:48+08:00
[CATEGORIES]
cs.LG
On the Escaping Efficiency of Distributed Adversarial Training Algorithms
[AUTHORS]
Ying Cao, Kun Yuan, Ali H. Sayed
[ABSTRACT]
Adversarial training has been widely studied in recent years due to its role
in improving model robustness against adversarial attacks. This paper focuses
on comparing different distributed adversarial training algorithms–including
centralized and decentralized strategies–within multi-agent learning
environments. Previous studies have highlighted the importance of model
flatness in determining robustness. To this end, we develop a general
theoretical framework to study the escaping efficiency of these algorithms from
local minima, which is closely related to the flatness of the resulting models.
We show that when the perturbation bound is sufficiently small (i.e., when the
attack strength is relatively mild) and a large batch size is used,
decentralized adversarial training algorithms–including consensus and
diffusion–are guaranteed to escape faster from local minima than the
centralized strategy, thereby favoring flatter minima. However, as the
perturbation bound increases, this trend may no longer hold. In the simulation
results, we illustrate our theoretical findings and systematically compare the
performance of models obtained through decentralized and centralized
adversarial training algorithms. The results highlight the potential of
decentralized strategies to enhance the robustness of models in distributed
settings.
[LINK]
http://arxiv.org/abs/2509.11337v1
[DATE]
2025-09-15 00:28:20+08:00
[CATEGORIES]
cs.LG
[AUTHORS]
Yonghao Weng, Liqiang Gao, Linwu Zhu, Jian Huang [ABSTRACT]
Recently, large language models (LLMs) have achieved remarkable breakthroughs
in general domains such as programming and writing, and have demonstrated
strong potential in various scientific research scenarios. However, the
capabilities of AI models in the highly specialized field of materials
characterization and analysis have not yet been systematically or sufficiently
validated. To address this gap, we present MatQnA, the first multi-modal
benchmark dataset specifically designed for material characterization
techniques. MatQnA includes ten mainstream characterization methods, such as
X-ray Photoelectron Spectroscopy (XPS), X-ray Diffraction (XRD), Scanning
Electron Microscopy (SEM), Transmission Electron Microscopy (TEM), etc. We
employ a hybrid approach combining LLMs with human-in-the-loop validation to
construct high-quality question-answer pairs, integrating both multiple-choice
and subjective questions. Our preliminary evaluation results show that the most
advanced multi-modal AI models (e.g., GPT-4.1, Claude 4, Gemini 2.5, and Doubao
Vision Pro 32K) have already achieved nearly 90% accuracy on objective
questions in materials data interpretation and analysis tasks, demonstrating
strong potential for applications in materials characterization and analysis.
The MatQnA dataset is publicly available at
https://huggingface.co/datasets/richardhzgg/matQnA. [LINK]
http://arxiv.org/abs/2509.11335v1 [DATE]
2025-09-15 00:23:48+08:00 [CATEGORIES]
cs.LG
From Personas to Talks: Revisiting the Impact of Personas on LLM-Synthesized Emotional Support Conversations
[AUTHORS]
Shenghan Wu, Yimo Zhu, Wynne Hsu, Mong-Li Lee, Yang Deng
[ABSTRACT]
The rapid advancement of Large Language Models (LLMs) has revolutionized the
generation of emotional support conversations (ESC), offering scalable
solutions with reduced costs and enhanced data privacy. This paper explores the
role of personas in the creation of ESC by LLMs. Our research utilizes
established psychological frameworks to measure and infuse persona traits into
LLMs, which then generate dialogues in the emotional support scenario. We
conduct extensive evaluations to understand the stability of persona traits in
dialogues, examining shifts in traits post-generation and their impact on
dialogue quality and strategy distribution. Experimental results reveal several
notable findings: 1) LLMs can infer core persona traits, 2) subtle shifts in
emotionality and extraversion occur, influencing the dialogue dynamics, and 3)
the application of persona traits modifies the distribution of emotional
support strategies, enhancing the relevance and empathetic quality of the
responses. These findings highlight the potential of persona-driven LLMs in
crafting more personalized, empathetic, and effective emotional support
dialogues, which has significant implications for the future design of
AI-driven emotional support systems.
[COMMENTS]
Accepted by EMNLP 2025 Main Conference
[LINK]
http://arxiv.org/abs/2502.11451v2
[DATE]
2025-09-14 23:58:08+08:00
[CATEGORIES]
cs.CL
On the Fundamental Impossibility of Hallucination Control in Large Language Models
[AUTHORS]
Michał P. Karpowicz
[ABSTRACT]
This paper establishes a fundamental impossibility theorem: no LLM capable of
performing non-trivial knowledge aggregation can simultaneously achieve
truthful knowledge representation, semantic information conservation, complete
revelation of relevant knowledge, and knowledge-constrained optimality. The
impossibility is not an engineering limitation but arises from the mathematical
structure of information aggregation itself. We establish this result by
describing the inference process as an auction of ideas, where distributed
components compete exploiting their partial knowledge to shape responses. The
proof spans three independent mathematical domains: mechanism design theory
(Green-Laffont), the theory of proper scoring rules (Savage), and direct
architectural analysis of transformers (Log-Sum-Exp convexity). In particular,
we show how to quantify the creation of overconfident or intuitive
responses-the signature of both hallucination and creativity, or imagination.
To support this analysis, we introduce the complementary concepts of the
semantic information measure and the emergence operator to model bounded
reasoning in a general setting. We prove that while bounded reasoning generates
accessible information, providing valuable insights and inspirations, the
idealized unconstrained reasoning strictly preserves semantic content. By
demonstrating that hallucination and imagination are mathematically identical
phenomena-grounded in departures from truthfulness, semantic information
conservation, revelation of relevant knowledge, and knowledge-constrained
optimality-we offer a principled foundation for managing these behaviors in
advanced AI systems. Finally, we present some speculative ideas to inspire
evaluation and refinements of the proposed theory.
[COMMENTS]
Mathematics debugged: introduces Polish space model of knowledge,
added examples, corrected errors, re-edited, new safety and alignment section
[LINK]
http://arxiv.org/abs/2506.06382v6
[DATE]
2025-09-14 23:56:29+08:00
[CATEGORIES]
cs.CL
cs.LG
Ko-PIQA: A Korean Physical Commonsense Reasoning Dataset with Cultural Context
[AUTHORS]
Dasol Choi, Jungwhan Kim, Guijin Son
[ABSTRACT]
Physical commonsense reasoning datasets like PIQA are predominantly
English-centric and lack cultural diversity. We introduce Ko-PIQA, a Korean
physical commonsense reasoning dataset that incorporates cultural context.
Starting from 3.01 million web-crawled questions, we employed a multi-stage
filtering approach using three language models to identify 11,553 PIQA-style
questions. Through GPT-4o refinement and human validation, we obtained 441
high-quality question-answer pairs. A key feature of Ko-PIQA is its cultural
grounding: 19.7\% of questions contain culturally specific elements like
traditional Korean foods (kimchi), clothing (hanbok), and specialized
appliances (kimchi refrigerators) that require culturally-aware reasoning
beyond direct translation. We evaluate seven language models on Ko-PIQA, with
the best model achieving 83.22\% accuracy while the weakest reaches only
59.86\%, demonstrating significant room for improvement. Models particularly
struggle with culturally specific scenarios, highlighting the importance of
culturally diverse datasets. Ko-PIQA serves as both a benchmark for Korean
language models and a foundation for more inclusive commonsense reasoning
research. The dataset and code will be publicly available.
[LINK]
http://arxiv.org/abs/2509.11303v1
[DATE]
2025-09-14 22:47:04+08:00
[CATEGORIES]
cs.CL
The Prompt Engineering Report Distilled: Quick Start Guide for Life Sciences
[AUTHORS]
Valentin Romanov, Steven A Niederer
[ABSTRACT]
Developing effective prompts demands significant cognitive investment to
generate reliable, high-quality responses from Large Language Models (LLMs). By
deploying case-specific prompt engineering techniques that streamline
frequently performed life sciences workflows, researchers could achieve
substantial efficiency gains that far exceed the initial time investment
required to master these techniques. The Prompt Report published in 2025
outlined 58 different text-based prompt engineering techniques, highlighting
the numerous ways prompts could be constructed. To provide actionable
guidelines and reduce the friction of navigating these various approaches, we
distil this report to focus on 6 core techniques: zero-shot, few-shot
approaches, thought generation, ensembling, self-criticism, and decomposition.
We breakdown the significance of each approach and ground it in use cases
relevant to life sciences, from literature summarization and data extraction to
editorial tasks. We provide detailed recommendations for how prompts should and
shouldn’t be structured, addressing common pitfalls including multi-turn
conversation degradation, hallucinations, and distinctions between reasoning
and non-reasoning models. We examine context window limitations, agentic tools
like Claude Code, while analyzing the effectiveness of Deep Research tools
across OpenAI, Google, Anthropic and Perplexity platforms, discussing current
limitations. We demonstrate how prompt engineering can augment rather than
replace existing established individual practices around data processing and
document editing. Our aim is to provide actionable guidance on core prompt
engineering principles, and to facilitate the transition from opportunistic
prompting to an effective, low-friction systematic practice that contributes to
higher quality research.
[LINK]
http://arxiv.org/abs/2509.11295v1
[DATE]
2025-09-14 22:39:35+08:00
[CATEGORIES]
cs.CL
LastingBench: Defend Benchmarks Against Knowledge Leakage
[AUTHORS]
Yixiong Fang, Tianran Sun, Yuling Shi, Min Wang, Xiaodong Gu
[ABSTRACT]
The increasing complexity of large language models (LLMs) raises concerns
about their ability to “cheat” on standard Question Answering (QA) benchmarks
by memorizing task-specific data. This undermines the validity of benchmark
evaluations, as they no longer reflect genuine model capabilities but instead
the effects of data leakage. While prior work has focused on detecting such
leakage, little attention has been given to mitigating its impact and
preserving the long-term utility of benchmarks. In this paper, we introduce
LastingBench, a novel framework designed to continuously reinforce and
safeguard existing benchmarks against knowledge leakage. LastingBench
identifies leakage points in the context through perturbation, then rewrites
the leakage points to counterfactual ones-disrupting memorization while
preserving the benchmark’s original evaluative intent. Evaluations of
state-of-the-art QA benchmarks show significant performance gaps, highlighting
the efficacy of LastingBench in reducing memorization effects. LastingBench
offers a practical and scalable solution to ensure benchmark robustness over
time, promoting fairer and more interpretable evaluations of LLMs.
[LINK]
http://arxiv.org/abs/2506.21614v2
[DATE]
2025-09-14 22:29:20+08:00
[CATEGORIES]
cs.CL
Evaluating Automatic Speech Recognition Systems for Korean Meteorological Experts
[AUTHORS]
ChaeHun Park, Hojun Cho, Jaegul Choo
[COMMENTS]
EMNLP 2025 Findings
[LINK]
http://arxiv.org/abs/2410.18444v3
[DATE]
2025-09-14 22:27:40+08:00
[CATEGORIES]
cs.CL
Mitigating Hallucinations in Large Vision-Language Models by Self-Injecting Hallucinations
[AUTHORS]
Yifan Lu, Ziqi Zhang, Chunfeng Yuan, Jun Gao, Congxuan Zhang, Xiaojuan Qi, Bing Li, Weiming Hu
[ABSTRACT]
Large Vision-Language Models (LVLMs) suffer from serious hallucination
problems, where the model-generated responses are inconsistent with the visual
inputs. Existing hallucination mitigation methods are mainly based on
preference alignment and require external human annotations or auxiliary models
for preference data collection, which increase costs and limit sustainable
improvement. To tackle these challenges, we propose Autonomous Preference
Alignment via Self-Injection (APASI), a novel and generalizable method that
mitigates hallucinations without external dependencies. APASI leverages the
target LVLM to self-inject hallucinations into a generated response, creating a
pair of responses with varying preference levels. During the self-injection
process, the dis-preferred response is generated based on three key
observations of hallucinations, ensuring it simulates real hallucination
patterns. This fidelity offers an accurate learning signal for hallucination
mitigation. Moreover, APASI incorporates an iterative alignment training
strategy combined with curriculum learning to periodically update the
preference data with increasing challenge, enabling stable and continuous
enhancement of the LVLM. Extensive experiments across six benchmarks show that
APASI not only effectively mitigates hallucinations for three baseline models
but also achieves comparable or even superior performance to alignment-based
methods with external dependency, thereby demonstrating its effectiveness and
generalization capability. The code is available at
https://github.com/davidluciolu/APASI.
[COMMENTS]
emnlp 2025 accepted
[LINK]
http://arxiv.org/abs/2509.11287v1
[DATE]
2025-09-14 22:26:53+08:00
[CATEGORIES]
cs.CL
PDFMathTranslate: Scientific Document Translation Preserving Layouts
[AUTHORS]
Rongxin Ouyang, Chang Chu, Zhikuang Xin, Xiangyao Ma
[ABSTRACT]
Language barriers in scientific documents hinder the diffusion and
development of science and technologies. However, prior efforts in translating
such documents largely overlooked the information in layouts. To bridge the
gap, we introduce PDFMathTranslate, the world’s first open-source software for
translating scientific documents while preserving layouts. Leveraging the most
recent advances in large language models and precise layout detection, we
contribute to the community with key improvements in precision, flexibility,
and efficiency. The work has been open-sourced at
https://github.com/byaidu/pdfmathtranslate with more than 222k downloads.
[COMMENTS]
7 pages, 4 figures, EMNLP 2025 Demo
[LINK]
http://arxiv.org/abs/2507.03009v3
[DATE]
2025-09-14 18:24:36+08:00
[CATEGORIES]
cs.CL
cs.LG
RanAT4BIE: Random Adversarial Training for Biomedical Information Extraction
[AUTHORS]
Jian Chen, Shengyi Lv, Leilei Su
[ABSTRACT]
We introduce random adversarial training (RAT), a novel framework
successfully applied to biomedical information extraction (BioIE) tasks.
Building on PubMedBERT as the foundational architecture, our study first
validates the effectiveness of conventional adversarial training in enhancing
pre-trained language models’ performance on BioIE tasks. While adversarial
training yields significant improvements across various performance metrics, it
also introduces considerable computational overhead. To address this
limitation, we propose RAT as an efficiency solution for biomedical information
extraction. This framework strategically integrates random sampling mechanisms
with adversarial training principles, achieving dual objectives: enhanced model
generalization and robustness while significantly reducing computational costs.
Through comprehensive evaluations, RAT demonstrates superior performance
compared to baseline models in BioIE tasks. The results highlight RAT’s
potential as a transformative framework for biomedical natural language
processing, offering a balanced solution to the model performance and
computational efficiency.
[COMMENTS]
Accepted for publication at the International Joint Conference on
Neural Networks (IJCNN) 2025
[LINK]
http://arxiv.org/abs/2509.11191v1
[DATE]
2025-09-14 17:40:00+08:00
[CATEGORIES]
cs.CL
Differentially-private text generation degrades output language quality
[AUTHORS]
Erion Çano, Ivan Habernal
[ABSTRACT]
Ensuring user privacy by synthesizing data from large language models (LLMs)
tuned under differential privacy (DP) has become popular recently. However, the
impact of DP fine-tuned LLMs on the quality of the language and the utility of
the texts they produce has not been investigated. In this work, we tune five
LLMs with three corpora under four levels of privacy and assess the length, the
grammatical correctness, and the lexical diversity of the text outputs they
produce. We also probe the utility of the synthetic outputs in downstream
classification tasks such as book genre recognition based on book descriptions
and cause of death recognition based on verbal autopsies. The results indicate
that LLMs tuned under stronger privacy constrains produce texts that are
shorter by at least 77 %, that are less grammatically correct by at least 9 %,
and are less diverse by at least 10 % in bi-gram diversity. Furthermore, the
accuracy they reach in downstream classification tasks decreases, which might
be detrimental to the usefulness of the generated synthetic data.
[COMMENTS]
20 pages, 3 figures, 35 tables
[LINK]
http://arxiv.org/abs/2509.11176v1
[DATE]
2025-09-14 17:16:11+08:00
[CATEGORIES]
cs.CL
Rumor Detection by Multi-task Suffix Learning based on Time-series Dual Sentiments
[AUTHORS]
Zhiwei Liu, Kailai Yang, Eduard Hovy, Sophia Ananiadou
[COMMENTS]
work in progress
[LINK]
http://arxiv.org/abs/2502.14383v2
[DATE]
2025-09-14 16:55:19+08:00
[CATEGORIES]
cs.CL
Assessing LLMs in Art Contexts: Critique Generation and Theory of Mind Evaluation
[AUTHORS]
Takaya Arita, Wenxian Zheng, Reiji Suzuki, Fuminori Akiba
[ABSTRACT]
This study explored how large language models (LLMs) perform in two areas
related to art: writing critiques of artworks and reasoning about mental states
(Theory of Mind, or ToM) in art-related situations. For the critique generation
part, we built a system that combines Noel Carroll’s evaluative framework with
a broad selection of art criticism theories. The model was prompted to first
write a full-length critique and then shorter, more coherent versions using a
step-by-step prompting process. These AI-generated critiques were then compared
with those written by human experts in a Turing test-style evaluation. In many
cases, human subjects had difficulty telling which was which, and the results
suggest that LLMs can produce critiques that are not only plausible in style
but also rich in interpretation, as long as they are carefully guided. In the
second part, we introduced new simple ToM tasks based on situations involving
interpretation, emotion, and moral tension, which can appear in the context of
art. These go beyond standard false-belief tests and allow for more complex,
socially embedded forms of reasoning. We tested 41 recent LLMs and found that
their performance varied across tasks and models. In particular, tasks that
involved affective or ambiguous situations tended to reveal clearer
differences. Taken together, these results help clarify how LLMs respond to
complex interpretative challenges, revealing both their cognitive limitations
and potential. While our findings do not directly contradict the so-called
Generative AI Paradox–the idea that LLMs can produce expert-like output
without genuine understanding–they suggest that, depending on how LLMs are
instructed, such as through carefully designed prompts, these models may begin
to show behaviors that resemble understanding more closely than we might
assume.
[COMMENTS]
Corrected a typo in the metadata title only
(“Assesing”->”Assessing”). No changes were made to the PDF or source files
[LINK]
http://arxiv.org/abs/2504.12805v2
[DATE]
2025-09-14 16:24:18+08:00
[CATEGORIES]
cs.CL
AQUA: Attention via QUery mAgnitudes for Memory and Compute Efficient Inference in LLMs
[AUTHORS]
Santhosh G S, Saurav Prakash, Balaraman Ravindran
[ABSTRACT]
The quadratic complexity of the attention mechanism remains a fundamental
barrier to scaling Large Language Models (LLMs) to longer contexts, creating a
critical bottleneck in both computation and memory. To address this, we
introduce AQUA (Attention via QUery mAgnitudes) a novel and versatile
approximation strategy that significantly reduces the cost of attention with a
graceful performance trade-off. Our method operates in two phases: an efficient
offline step where we compute a universal, language agnostic projection matrix
via SVD on a calibration dataset, and an online inference step where we project
query and key vectors and dynamically select a sparse subset of dimensions
based on the query’s magnitude. We provide a formal theoretical analysis of
AQUA, establishing the break-even point at which it becomes more
computationally efficient than standard attention. Our empirical evaluations on
state-of-the-art models like Llama-3.1-8B demonstrate that a 25% reduction in
the attention dot-product computation can be achieved with a statistically
insignificant impact on performance across a wide range of benchmarks. We
further showcase the versatility of AQUA by demonstrating its ability to
synergistically accelerate existing token eviction methods like H2O and to
directly reduce KV-cache memory size. By offering a controllable knob to
balance efficiency and accuracy, AQUA provides a practical and powerful tool
for making large-scale LLM inference more accessible and sustainable.
[LINK]
http://arxiv.org/abs/2509.11155v1
[DATE]
2025-09-14 16:20:48+08:00
[CATEGORIES]
cs.LG
cs.CL
EasyEdit2: An Easy-to-use Steering Framework for Editing Large Language Models
[AUTHORS]
Ziwen Xu, Shuxun Wang, Kewei Xu, Haoming Xu, Mengru Wang, Xinle Deng, Yunzhi Yao, Guozhou Zheng, Huajun Chen, Ningyu Zhang
[ABSTRACT]
In this paper, we introduce EasyEdit2, a framework designed to enable
plug-and-play adjustability for controlling Large Language Model (LLM)
behaviors. EasyEdit2 supports a wide range of test-time interventions,
including safety, sentiment, personality, reasoning patterns, factuality, and
language features. Unlike its predecessor, EasyEdit2 features a new
architecture specifically designed for seamless model steering. It comprises
key modules such as the steering vector generator and the steering vector
applier, which enable automatic generation and application of steering vectors
to influence the model’s behavior without modifying its parameters. One of the
main advantages of EasyEdit2 is its ease of use-users do not need extensive
technical knowledge. With just a single example, they can effectively guide and
adjust the model’s responses, making precise control both accessible and
efficient. Empirically, we report model steering performance across different
LLMs, demonstrating the effectiveness of these techniques. We have released the
source code on GitHub at https://github.com/zjunlp/EasyEdit along with a
demonstration notebook. In addition, we provide a demo video at
https://www.youtube.com/watch?v=AkfoiPfp5rQ for a quick introduction.
[COMMENTS]
EMNLP 2025 System Demonstrations. Demo:
https://www.youtube.com/watch?v=AkfoiPfp5rQ; code:
https://github.com/zjunlp/EasyEdit
[LINK]
http://arxiv.org/abs/2504.15133v3
[DATE]
2025-09-14 16:10:18+08:00
[CATEGORIES]
cs.CL
cs.LG
Text2Mem: A Unified Memory Operation Language for Memory Operating System
[AUTHORS]
Felix Wang, Boyu Chen, Kerun Xu, Bo Tang, Feiyu Xiong, Zhiyu Li
[ABSTRACT]
Large language model agents increasingly depend on memory to sustain long
horizon interaction, but existing frameworks remain limited. Most expose only a
few basic primitives such as encode, retrieve, and delete, while higher order
operations like merge, promote, demote, split, lock, and expire are missing or
inconsistently supported. Moreover, there is no formal and executable
specification for memory commands, leaving scope and lifecycle rules implicit
and causing unpredictable behavior across systems. We introduce Text2Mem, a
unified memory operation language that provides a standardized pathway from
natural language to reliable execution. Text2Mem defines a compact yet
expressive operation set aligned with encoding, storage, and retrieval. Each
instruction is represented as a JSON based schema instance with required fields
and semantic invariants, which a parser transforms into typed operation objects
with normalized parameters. A validator ensures correctness before execution,
while adapters map typed objects either to a SQL prototype backend or to real
memory frameworks. Model based services such as embeddings or summarization are
integrated when required. All results are returned through a unified execution
contract. This design ensures safety, determinism, and portability across
heterogeneous backends. We also outline Text2Mem Bench, a planned benchmark
that separates schema generation from backend execution to enable systematic
evaluation. Together, these components establish the first standardized
foundation for memory control in agents.
[COMMENTS]
11 pages, 3 figures
[LINK]
http://arxiv.org/abs/2509.11145v1
[DATE]
2025-09-14 15:30:09+08:00
[CATEGORIES]
cs.CL
When Smiley Turns Hostile: Interpreting How Emojis Trigger LLMs’ Toxicity
[AUTHORS]
Shiyao Cui, Xijia Feng, Yingkang Wang, Junxiao Yang, Zhexin Zhang, Biplab Sikdar, Hongning Wang, Han Qiu, Minlie Huang
[ABSTRACT]
Emojis are globally used non-verbal cues in digital communication, and
extensive research has examined how large language models (LLMs) understand and
utilize emojis across contexts. While usually associated with friendliness or
playfulness, it is observed that emojis may trigger toxic content generation in
LLMs. Motivated by such a observation, we aim to investigate: (1) whether
emojis can clearly enhance the toxicity generation in LLMs and (2) how to
interpret this phenomenon. We begin with a comprehensive exploration of
emoji-triggered LLM toxicity generation by automating the construction of
prompts with emojis to subtly express toxic intent. Experiments across 5
mainstream languages on 7 famous LLMs along with jailbreak tasks demonstrate
that prompts with emojis could easily induce toxicity generation. To understand
this phenomenon, we conduct model-level interpretations spanning semantic
cognition, sequence generation and tokenization, suggesting that emojis can act
as a heterogeneous semantic channel to bypass the safety mechanisms. To pursue
deeper insights, we further probe the pre-training corpus and uncover potential
correlation between the emoji-related data polution with the toxicity
generation behaviors. Supplementary materials provide our implementation code
and data. (Warning: This paper contains potentially sensitive contents)
[LINK]
http://arxiv.org/abs/2509.11141v1
[DATE]
2025-09-14 15:21:44+08:00
[CATEGORIES]
cs.CL
Agentic Username Suggestion and Multimodal Gender Detection in Online Platforms: Introducing the PNGT-26K Dataset
[AUTHORS]
Farbod Bijary, Mohsen Ebadpour, Amirhosein Tajbakhsh
[ABSTRACT]
Persian names present unique challenges for natural language processing
applications, particularly in gender detection and digital identity creation,
due to transliteration inconsistencies and cultural-specific naming patterns.
Existing tools exhibit significant performance degradation on Persian names,
while the scarcity of comprehensive datasets further compounds these
limitations. To address these challenges, the present research introduces
PNGT-26K, a comprehensive dataset of Persian names, their commonly associated
gender, and their English transliteration, consisting of approximately 26,000
tuples. As a demonstration of how this resource can be utilized, we also
introduce two frameworks, namely Open Gender Detection and Nominalist. Open
Gender Detection is a production-grade, ready-to-use framework for using
existing data from a user, such as profile photo and name, to give a
probabilistic guess about the person’s gender. Nominalist, the second framework
introduced by this paper, utilizes agentic AI to help users choose a username
for their social media accounts on any platform. It can be easily integrated
into any website to provide a better user experience. The PNGT-26K dataset,
Nominalist and Open Gender Detection frameworks are publicly available on
Github.
[LINK]
http://arxiv.org/abs/2509.11136v1
[DATE]
2025-09-14 15:08:32+08:00
[CATEGORIES]
cs.LG
cs.CL
We Argue to Agree: Towards Personality-Driven Argumentation-Based Negotiation Dialogue Systems for Tourism
[AUTHORS]
Priyanshu Priya, Saurav Dudhate, Desai Vishesh Yasheshbhai, Asif Ekbal
[ABSTRACT]
Integrating argumentation mechanisms into negotiation dialogue systems
improves conflict resolution through exchanges of arguments and critiques.
Moreover, incorporating personality attributes enhances adaptability by
aligning interactions with individuals’ preferences and styles. To advance
these capabilities in negotiation dialogue systems, we propose a novel
Personality-driven Argumentation-based Negotiation Dialogue Generation (PAN-DG)
task. To support this task, we introduce PACT, a dataset of Personality-driven
Argumentation-based negotiation Conversations for Tourism sector. This dataset,
generated using Large Language Models (LLMs), features three distinct
personality profiles, viz. Argumentation Profile, Preference Profile, and
Buying Style Profile to simulate a variety of negotiation scenarios involving
diverse personalities. Thorough automatic and manual evaluations indicate that
the dataset comprises high-quality dialogues. Further, we conduct comparative
experiments between pre-trained and fine-tuned LLMs for the PAN-DG task.
Multi-dimensional evaluation demonstrates that the fine-tuned LLMs effectively
generate personality-driven rational responses during negotiations. This
underscores the effectiveness of PACT in enhancing personalization and
reasoning capabilities in negotiation dialogue systems, thereby establishing a
foundation for future research in this domain.
[COMMENTS]
Paper is accepted at EMNLP (Findings) 2025
[LINK]
http://arxiv.org/abs/2509.11118v1
[DATE]
2025-09-14 14:16:42+08:00
[CATEGORIES]
cs.CL
Fluid Language Model Benchmarking
[AUTHORS]
Valentin Hofmann, David Heineman, Ian Magnusson, Kyle Lo, Jesse Dodge, Maarten Sap, Pang Wei Koh, Chun Wang, Hannaneh Hajishirzi, Noah A. Smith
[ABSTRACT]
Language model (LM) benchmarking faces several challenges: comprehensive
evaluations are costly, benchmarks often fail to measure the intended
capabilities, and evaluation quality can degrade due to labeling errors and
benchmark saturation. Although various strategies have been proposed to
mitigate these issues, they tend to address individual aspects in isolation,
neglecting broader questions about overall evaluation quality. Here, we
introduce Fluid Benchmarking, a new evaluation approach that advances LM
benchmarking across multiple dimensions. Inspired by psychometrics, Fluid
Benchmarking is based on the insight that the relative value of benchmark items
depends on an LM’s capability level, suggesting that evaluation should adapt to
each LM. Methodologically, Fluid Benchmarking estimates an item response model
based on existing LM evaluation results and uses the inferred quantities to
select evaluation items dynamically, similar to computerized adaptive testing
in education. In our experiments, we compare Fluid Benchmarking against the
common practice of random item sampling as well as more sophisticated
baselines, including alternative methods grounded in item response theory. We
examine four dimensions – efficiency, validity, variance, and saturation –
and find that Fluid Benchmarking achieves superior performance in all of them
(e.g., higher validity and less variance on MMLU with fifty times fewer items).
Our analysis shows that the two components of Fluid Benchmarking have distinct
effects: item response theory, used to map performance into a latent ability
space, increases validity, while dynamic item selection reduces variance.
Overall, our results suggest that LM benchmarking can be substantially improved
by moving beyond static evaluation.
[COMMENTS]
COLM 2025
[LINK]
http://arxiv.org/abs/2509.11106v1
[DATE]
2025-09-14 13:49:42+08:00
[CATEGORIES]
cs.CL
cs.LG
Synthesize-on-Graph: Knowledgeable Synthetic Data Generation for Continue Pre-training of Large Language Models
[AUTHORS]
Shengjie Ma, Xuhui Jiang, Chengjin Xu, Cehao Yang, Liyu Zhang, Jian Guo
[ABSTRACT]
Large Language Models (LLMs) have achieved remarkable success but remain
data-inefficient, especially when learning from small, specialized corpora with
limited and proprietary data. Existing synthetic data generation methods for
continue pre-training focus on intra-document content and overlook
cross-document knowledge associations, limiting content diversity and depth. We
propose Synthetic-on-Graph (SoG), a synthetic data generation framework that
incorporates cross-document knowledge associations for efficient corpus
expansion. SoG constructs a context graph by extracting entities and concepts
from the original corpus, representing cross-document associations, and
employing a graph walk strategy for knowledge-associated sampling. This
enhances synthetic data diversity and coherence, enabling models to learn
complex knowledge structures and handle rare knowledge. To further improve the
quality of synthetic data, we integrate two complementary strategies,
Chain-of-Thought (CoT) and Contrastive Clarifying (CC), to enhance both
reasoning capability and discriminative power. Extensive experiments
demonstrate that SoG surpasses state-of-the-art (SOTA) methods on multi-hop and
domain-specific question answering, while achieving competitive performance on
long-context reading comprehension. These results highlight the superior
generalization ability of SoG. Our work advances the paradigm of synthetic data
generation and offers practical solutions for efficient knowledge acquisition
in LLMs, particularly for downstream tasks and domains with limited training
data.
[LINK]
http://arxiv.org/abs/2505.00979v3
[DATE]
2025-09-14 13:23:15+08:00
[CATEGORIES]
cs.CL
MAC-Tuning: LLM Multi-Compositional Problem Reasoning with Enhanced Knowledge Boundary Awareness
[AUTHORS]
Junsheng Huang, Zhitao He, Yucheng Huang, Sandeep Polisetty, Qingyun Wang, Yi. R Fung
[COMMENTS]
We release our code and resource at
https://github.com/no-touch-fish/Multi-QA-Tuning. The paper is accepted into
EMNLP 2025 main
[LINK]
http://arxiv.org/abs/2504.21773v4
[DATE]
2025-09-14 13:13:07+08:00
[CATEGORIES]
cs.CL
Length-Aware Rotary Position Embedding for Text-Speech Alignment
[AUTHORS]
Hyeongju Kim, Juheon Lee, Jinhyeok Yang, Jacob Morton
[ABSTRACT]
Many recent text-to-speech (TTS) systems are built on transformer
architectures and employ cross-attention mechanisms for text-speech alignment.
Within these systems, rotary position embedding (RoPE) is commonly used to
encode positional information in text and speech representations. In this work,
we introduce length-aware RoPE (LARoPE), a simple yet effective extension of
RoPE that improves text-speech alignment. Unlike RoPE, which relies on absolute
indices, LARoPE computes relative distances between query and key positions
using length-normalized indices. Experimental results show that LARoPE
consistently outperforms RoPE, offering faster loss convergence, more accurate
text-speech alignment, and higher overall TTS quality. Furthermore, LARoPE
demonstrates greater resilience to variations in utterance duration and
maintains stable performance in extended speech generation up to 30 seconds,
whereas RoPE suffers from notable degradation. Notably, our method achieves a
state-of-the-art word error rate on a standard zero-shot TTS benchmark.
[COMMENTS]
5 pages, 3 figures, preprint
[LINK]
http://arxiv.org/abs/2509.11084v1
[DATE]
2025-09-14 12:25:13+08:00
[CATEGORIES]
cs.CL
Transplant Then Regenerate: A New Paradigm for Text Data Augmentation
[AUTHORS]
Guangzhan Wang, Hongyu Zhang, Beijun Shen, Xiaodong Gu
[ABSTRACT]
Data augmentation is a critical technique in deep learning. Traditional
methods like Back-translation typically focus on lexical-level rephrasing,
which primarily produces variations with the same semantics. While large
language models (LLMs) have enhanced text augmentation by their “knowledge
emergence” capability, controlling the style and structure of these outputs
remains challenging and requires meticulous prompt engineering. In this paper,
we propose LMTransplant, a novel text augmentation paradigm leveraging LLMs.
The core idea of LMTransplant is transplant-then-regenerate: incorporating seed
text into a context expanded by LLM, and asking the LLM to regenerate a variant
based on the expanded context. This strategy allows the model to create more
diverse and creative content-level variants by fully leveraging the knowledge
embedded in LLMs, while preserving the core attributes of the original text. We
evaluate LMTransplant across various text-related tasks, demonstrating its
superior performance over existing text augmentation methods. Moreover,
LMTransplant demonstrates exceptional scalability as the size of augmented data
grows.
[COMMENTS]
Accepted by EMNLP 2025
[LINK]
http://arxiv.org/abs/2508.14723v3
[DATE]
2025-09-14 12:08:17+08:00
[CATEGORIES]
cs.CL
The System Description of CPS Team for Track on Driving with Language of CVPR 2024 Autonomous Grand Challenge
[AUTHORS]
Jinghan Peng, Jingwen Wang, Xing Yu, Dehui Du
[ABSTRACT]
This report outlines our approach using vision language model systems for the
Driving with Language track of the CVPR 2024 Autonomous Grand Challenge. We
have exclusively utilized the DriveLM-nuScenes dataset for training our models.
Our systems are built on the LLaVA models, which we enhanced through
fine-tuning with the LoRA and DoRA methods. Additionally, we have integrated
depth information from open-source depth estimation models to enrich the
training and inference processes. For inference, particularly with
multiple-choice and yes/no questions, we adopted a Chain-of-Thought reasoning
approach to improve the accuracy of the results. This comprehensive methodology
enabled us to achieve a top score of 0.7799 on the validation set leaderboard,
ranking 1st on the leaderboard.
[LINK]
http://arxiv.org/abs/2509.11071v1
[DATE]
2025-09-14 11:37:17+08:00
[CATEGORIES]
cs.CL
LLMAP: LLM-Assisted Multi-Objective Route Planning with User Preferences
[AUTHORS]
Liangqi Yuan, Dong-Jun Han, Christopher G. Brinton, Sabine Brunswicker
[ABSTRACT]
The rise of large language models (LLMs) has made natural language-driven
route planning an emerging research area that encompasses rich user objectives.
Current research exhibits two distinct approaches: direct route planning using
LLM-as-Agent and graph-based searching strategies. However, LLMs in the former
approach struggle to handle extensive map data, while the latter shows limited
capability in understanding natural language preferences. Additionally, a more
critical challenge arises from the highly heterogeneous and unpredictable
spatio-temporal distribution of users across the globe. In this paper, we
introduce a novel LLM-Assisted route Planning (LLMAP) system that employs an
LLM-as-Parser to comprehend natural language, identify tasks, and extract user
preferences and recognize task dependencies, coupled with a Multi-Step Graph
construction with iterative Search (MSGS) algorithm as the underlying solver
for optimal route finding. Our multi-objective optimization approach adaptively
tunes objective weights to maximize points of interest (POI) quality and task
completion rate while minimizing route distance, subject to three key
constraints: user time limits, POI opening hours, and task dependencies. We
conduct extensive experiments using 1,000 routing prompts sampled with varying
complexity across 14 countries and 27 cities worldwide. The results demonstrate
that our approach achieves superior performance with guarantees across multiple
constraints.
[LINK]
http://arxiv.org/abs/2509.12273v1
[DATE]
2025-09-14 10:30:19+08:00
[CATEGORIES]
cs.CL
cs.LG
Rethinking Human Preference Evaluation of LLM Rationales
[AUTHORS]
Ziang Li, Manasi Ganti, Zixian Ma, Helena Vasconcelos, Qijia He, Ranjay Krishna
[COMMENTS]
Published in the XLLM-Reason-Plan Workshop on the Application of LLM
Explainability to Reasoning and Planning at COLM 2025
[LINK]
http://arxiv.org/abs/2509.11026v1
[DATE]
2025-09-14 09:33:14+08:00
[CATEGORIES]
cs.CL
MEPT: Mixture of Expert Prompt Tuning as a Manifold Mapper
[AUTHORS]
Runjia Zeng, Guangyan Sun, Qifan Wang, Tong Geng, Sohail Dianat, Xiaotian Han, Raghuveer Rao, Xueling Zhang, Cheng Han, Lifu Huang, Dongfang Liu
[COMMENTS]
EMNLP 2025
[LINK]
http://arxiv.org/abs/2509.00996v2
[DATE]
2025-09-14 07:37:47+08:00
[CATEGORIES]
cs.LG
cs.CL
Persona-Based Synthetic Data Generation Using Multi-Stage Conditioning with Large Language Models for Emotion Recognition
[AUTHORS]
Keito Inoshita, Rushia Harada
[ABSTRACT]
In the field of emotion recognition, the development of high-performance
models remains a challenge due to the scarcity of high-quality, diverse
emotional datasets. Emotional expressions are inherently subjective, shaped by
individual personality traits, socio-cultural backgrounds, and contextual
factors, making large-scale, generalizable data collection both ethically and
practically difficult. To address this issue, we introduce PersonaGen, a novel
framework for generating emotionally rich text using a Large Language Model
(LLM) through multi-stage persona-based conditioning. PersonaGen constructs
layered virtual personas by combining demographic attributes, socio-cultural
backgrounds, and detailed situational contexts, which are then used to guide
emotion expression generation. We conduct comprehensive evaluations of the
generated synthetic data, assessing semantic diversity through clustering and
distributional metrics, human-likeness via LLM-based quality scoring, realism
through comparison with real-world emotion corpora, and practical utility in
downstream emotion classification tasks. Experimental results show that
PersonaGen significantly outperforms baseline methods in generating diverse,
coherent, and discriminative emotion expressions, demonstrating its potential
as a robust alternative for augmenting or replacing real-world emotional
datasets.
[LINK]
http://arxiv.org/abs/2507.13380v2
[DATE]
2025-09-14 06:38:50+08:00
[CATEGORIES]
cs.CL
FM2DS: Few-Shot Multimodal Multihop Data Synthesis with Knowledge Distillation for Question Answering
[AUTHORS]
Amirhossein Abaskohi, Spandana Gella, Giuseppe Carenini, Issam H. Laradji
[ABSTRACT]
Multimodal multihop question answering (MMQA) requires reasoning over images
and text from multiple sources. Despite advances in visual question answering,
this multihop setting remains underexplored due to a lack of quality datasets.
Existing methods focus on single-hop, single-modality, or short texts, limiting
real-world applications like interpreting educational documents with long,
multimodal content. To fill this gap, we introduce FM2DS, the first framework
for creating a high-quality dataset for MMQA. Our approach consists of a
5-stage pipeline that involves acquiring relevant multimodal documents from
Wikipedia, synthetically generating high-level questions and answers, and
validating them through rigorous criteria to ensure data quality. We evaluate
our methodology by training models on our synthesized dataset and testing on
two benchmarks: MultimodalQA and WebQA. Our results demonstrate that, with an
equal sample size, models trained on our synthesized data outperform those
trained on human-collected data by 1.9 in exact match (EM) score on average.
Additionally, we introduce M2QA-Bench with 1k samples, the first benchmark for
MMQA on long documents, generated using FM2DS and refined by human annotators.
We believe our data synthesis method will serve as a strong foundation for
training and evaluating MMQA models.
[COMMENTS]
Findings of EMNLP 2025
[LINK]
http://arxiv.org/abs/2412.07030v5
[DATE]
2025-09-14 04:47:08+08:00
[CATEGORIES]
cs.CL
cs.LG
Can Advanced LLMs Coach Smaller LLMs? Knowledge Distillation for Goal-Oriented Dialogs
[AUTHORS]
Tong Wang, K. Sudhir, Dat Hong
[ABSTRACT]
Enterprises deploying LLMs for goal-oriented dialogs, such as customer
service, face a critical trade-off between performance, control, and cost.
Proprietary models like GPT-4 offer strong performance but are costly and
cannot be self-hosted, raising security and privacy concerns. Open-source
alternatives offer flexibility and lower token costs but lag in performance. We
introduce Guidance Elicitation and Retrieval (GER), a prompt-based knowledge
distillation framework where a high-performance teacher LLM coaches a
lower-performance student without modifying the student’s parameters. GER
extracts tactical guidance for a wide range of dialog scenarios from the
teacher and stores these scenario-guidance pairs in a structured library. At
inference time, the student retrieves the relevant guidance and integrates it
into its prompt. While GER training can be bootstrapped entirely with synthetic
data, its modular design lets it seamlessly augment the synthetic data with
human conversational logs. In addition, the modular design enables easy
auditing and updating of the guidance library as new scenarios and constraints
emerge. Experiments show GER’s guidance-based coaching outperforms both example
output based fine-tuning and non-customized guidance baselines, and generalizes
across other contexts and student models. The GER framework is potentially
extensible to coach human service agents.
[LINK]
http://arxiv.org/abs/2408.07238v2
[DATE]
2025-09-14 04:28:25+08:00
[CATEGORIES]
cs.CL
cs.LG
Monitoring Decoding: Mitigating Hallucination via Evaluating the Factuality of Partial Response during Generation
[AUTHORS]
Yurui Chang, Bochuan Cao, Lu Lin
[ABSTRACT]
While large language models have demonstrated exceptional performance across
a wide range of tasks, they remain susceptible to hallucinations – generating
plausible yet factually incorrect contents. Existing methods to mitigating such
risk often rely on sampling multiple full-length generations, which introduces
significant response latency and becomes ineffective when the model
consistently produces hallucinated outputs with high confidence. To address
these limitations, we introduce Monitoring Decoding (MD), a novel framework
that dynamically monitors the generation process and selectively applies
in-process interventions, focusing on revising crucial tokens responsible for
hallucinations. Instead of waiting until completion of multiple full-length
generations, we identify hallucination-prone tokens during generation using a
monitor function, and further refine these tokens through a tree-based decoding
strategy. This approach ensures an enhanced factual accuracy and coherence in
the generated output while maintaining efficiency. Experimental results
demonstrate that MD consistently outperforms self-consistency-based approaches
in both effectiveness and efficiency, achieving higher factual accuracy while
significantly reducing computational overhead.
[COMMENTS]
Accepted to ACL 2025 (Findings)
[LINK]
http://arxiv.org/abs/2503.03106v2
[DATE]
2025-09-14 02:51:06+08:00
[CATEGORIES]
cs.CL
cs.LG
Introducing Spotlight: A Novel Approach for Generating Captivating Key Information from Documents
[AUTHORS]
Ankan Mullick, Sombit Bose, Rounak Saha, Ayan Kumar Bhowmick, Aditya Vempaty, Prasenjit Dey, Ravi Kokku, Pawan Goyal, Niloy Ganguly
[COMMENTS]
Paper accepted in EMNLP 2025 Main Conference (Full)
[LINK]
http://arxiv.org/abs/2509.10935v1
[DATE]
2025-09-14 02:18:37+08:00
[CATEGORIES]
cs.CL
Public Data Assisted Differentially Private In-Context Learning
[AUTHORS]
Seongho Joo, Hyukhun Koh, Kyomin Jung
[ABSTRACT]
In-context learning (ICL) in Large Language Models (LLMs) has shown
remarkable performance across various tasks without requiring fine-tuning.
However, recent studies have highlighted the risk of private data leakage
through the prompt in ICL, especially when LLMs are exposed to malicious
attacks. While differential privacy (DP) provides strong privacy guarantees, it
often significantly reduces the utility of in-context learning (ICL). To
address this challenge, we incorporate task-related public data into the ICL
framework while maintaining the DP guarantee. Based on this approach, we
propose a private in-context learning algorithm that effectively balances
privacy protection and model utility. Through experiments, we demonstrate that
our approach significantly improves the utility of private ICL with the
assistance of public data. Additionally, we show that our method is robust
against membership inference attacks, demonstrating empirical privacy
protection.
[COMMENTS]
EMNLP 2025 Findings
[LINK]
http://arxiv.org/abs/2509.10932v1
[DATE]
2025-09-14 02:11:51+08:00
[CATEGORIES]
cs.CL
ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments
[AUTHORS]
Gili Lior, Eliya Habba, Shahar Levy, Avi Caciularu, Gabriel Stanovsky
[ABSTRACT]
LLMs are highly sensitive to prompt phrasing, yet standard benchmarks
typically report performance using a single prompt, raising concerns about the
reliability of such evaluations. In this work, we argue for a stochastic method
of moments evaluation over the space of meaning-preserving prompt
perturbations. We introduce a formal definition of reliable evaluation that
accounts for prompt sensitivity, and suggest ReliableEval - a method for
estimating the number of prompt resamplings needed to obtain meaningful
results. Using our framework, we stochastically evaluate five frontier LLMs and
find that even top-performing models like GPT-4o and Claude-3.7-Sonnet exhibit
substantial prompt sensitivity. Our approach is model-, task-, and
metric-agnostic, offering a recipe for meaningful and robust LLM evaluation.
[COMMENTS]
Findings of EMNLP 2025
[LINK]
http://arxiv.org/abs/2505.22169v2
[DATE]
2025-09-14 02:08:05+08:00
[CATEGORIES]
cs.CL
Harmful Prompt Laundering: Jailbreaking LLMs with Abductive Styles and Symbolic Encoding
[AUTHORS]
Seongho Joo, Hyukhun Koh, Kyomin Jung
[COMMENTS]
EMNLP 2025
[LINK]
http://arxiv.org/abs/2509.10931v1
[DATE]
2025-09-14 02:07:56+08:00
[CATEGORIES]
cs.CL
Aligning ESG Controversy Data with International Guidelines through Semi-Automatic Ontology Construction
[AUTHORS]
Tsuyoshi Iwata, Guillaume Comte, Melissa Flores, Ryoma Kondo, Ryohei Hisano
[ABSTRACT]
The growing importance of environmental, social, and governance data in
regulatory and investment contexts has increased the need for accurate,
interpretable, and internationally aligned representations of non-financial
risks, particularly those reported in unstructured news sources. However,
aligning such controversy-related data with principle-based normative
frameworks, such as the United Nations Global Compact or Sustainable
Development Goals, presents significant challenges. These frameworks are
typically expressed in abstract language, lack standardized taxonomies, and
differ from the proprietary classification systems used by commercial data
providers. In this paper, we present a semi-automatic method for constructing
structured knowledge representations of environmental, social, and governance
events reported in the news. Our approach uses lightweight ontology design,
formal pattern modeling, and large language models to convert normative
principles into reusable templates expressed in the Resource Description
Framework. These templates are used to extract relevant information from news
content and populate a structured knowledge graph that links reported incidents
to specific framework principles. The result is a scalable and transparent
framework for identifying and interpreting non-compliance with international
sustainability guidelines.
[COMMENTS]
Author accepted manuscript. This paper has been accepted for
presentation at the ISWC 2025 Posters & Demos Track. License details will be
updated once the official proceedings are published
[LINK]
http://arxiv.org/abs/2509.10922v1
[DATE]
2025-09-14 01:49:59+08:00
[CATEGORIES]
cs.CL
Soft Reasoning: Navigating Solution Spaces in Large Language Models through Controlled Embedding Exploration
[AUTHORS]
Qinglin Zhu, Runcong Zhao, Hanqi Yan, Yulan He, Yudong Chen, Lin Gui
[ABSTRACT]
Large Language Models (LLMs) struggle with complex reasoning due to limited
diversity and inefficient search. We propose Soft Reasoning, an embedding-based
search framework that optimises the embedding of the first token to guide
generation. It combines (1) embedding perturbation for controlled exploration
and (2) Bayesian optimisation to refine embeddings via a verifier-guided
objective, balancing exploration and exploitation. This approach improves
reasoning accuracy and coherence while avoiding reliance on heuristic search.
Experiments demonstrate superior correctness with minimal computation, making
it a scalable, model-agnostic solution. The code is released at
https://github.com/alickzhu/Soft-Reasoning.
[COMMENTS]
Accepted as a Spotlight at ICML 2025
[LINK]
http://arxiv.org/abs/2505.24688v4
[DATE]
2025-09-14 01:45:08+08:00
[CATEGORIES]
cs.CL
CultureSynth: A Hierarchical Taxonomy-Guided and Retrieval-Augmented Framework for Cultural Question-Answer Synthesis
[AUTHORS]
Xinyu Zhang, Pei Zhang, Shuang Luo, Jialong Tang, Yu Wan, Baosong Yang, Fei Huang
[ABSTRACT]
Cultural competence, defined as the ability to understand and adapt to
multicultural contexts, is increasingly vital for large language models (LLMs)
in global environments. While several cultural benchmarks exist to assess LLMs’
cultural competence, current evaluations suffer from fragmented taxonomies,
domain specificity, and heavy reliance on manual data annotation. To address
these limitations, we introduce CultureSynth, a novel framework comprising (1)
a comprehensive hierarchical multilingual cultural taxonomy covering 12 primary
and 130 secondary topics, and (2) a Retrieval-Augmented Generation (RAG)-based
methodology leveraging factual knowledge to synthesize culturally relevant
question-answer pairs. The CultureSynth-7 synthetic benchmark contains 19,360
entries and 4,149 manually verified entries across 7 languages. Evaluation of
14 prevalent LLMs of different sizes reveals clear performance stratification
led by ChatGPT-4o-Latest and Qwen2.5-72B-Instruct. The results demonstrate that
a 3B-parameter threshold is necessary for achieving basic cultural competence,
models display varying architectural biases in knowledge processing, and
significant geographic disparities exist across models. We believe that
CultureSynth offers a scalable framework for developing culturally aware AI
systems while reducing reliance on manual annotation\footnote{Benchmark is
available at https://github.com/Eyr3/CultureSynth.}.
[COMMENTS]
Accepted as a Findings paper at EMNLP 2025
[LINK]
http://arxiv.org/abs/2509.10886v1
[DATE]
2025-09-14 00:33:56+08:00
[CATEGORIES]
cs.CL
Term2Note: Synthesising Differentially Private Clinical Notes from Medical Terms
[AUTHORS]
Yuping Wu, Viktor Schlegel, Warren Del-Pinto, Srinivasan Nandakumar, Iqra Zahid, Yidan Sun, Usama Farghaly Omar, Amirah Jasmine, Arun-Kumar Kaliya-Perumal, Chun Shen Tham, Gabriel Connors, Anil A Bharath, Goran Nenadic
[ABSTRACT]
Training data is fundamental to the success of modern machine learning
models, yet in high-stakes domains such as healthcare, the use of real-world
training data is severely constrained by concerns over privacy leakage. A
promising solution to this challenge is the use of differentially private (DP)
synthetic data, which offers formal privacy guarantees while maintaining data
utility. However, striking the right balance between privacy protection and
utility remains challenging in clinical note synthesis, given its domain
specificity and the complexity of long-form text generation. In this paper, we
present Term2Note, a methodology to synthesise long clinical notes under strong
DP constraints. By structurally separating content and form, Term2Note
generates section-wise note content conditioned on DP medical terms, with each
governed by separate DP constraints. A DP quality maximiser further enhances
synthetic notes by selecting high-quality outputs. Experimental results show
that Term2Note produces synthetic notes with statistical properties closely
aligned with real clinical notes, demonstrating strong fidelity. In addition,
multi-label classification models trained on these synthetic notes perform
comparably to those trained on real data, confirming their high utility.
Compared to existing DP text generation baselines, Term2Note achieves
substantial improvements in both fidelity and utility while operating under
fewer assumptions, suggesting its potential as a viable privacy-preserving
alternative to using sensitive clinical notes.
[LINK]
http://arxiv.org/abs/2509.10882v1
[DATE]
2025-09-14 00:26:38+08:00
[CATEGORIES]
cs.CL
Generalized Dirichlet Energy and Graph Laplacians for Clustering Directed and Undirected Graphs
[AUTHORS]
Harry Sevi, Gwendal Debaussart-Joniec, Malik Hacini, Matthieu Jonckheere, Argyris Kalogeratos
[ABSTRACT]
Clustering in directed graphs remains a fundamental challenge due to the
asymmetry in edge connectivity, which limits the applicability of classical
spectral methods originally designed for undirected graphs. A common workaround
is to symmetrize the adjacency matrix, but this often leads to losing critical
directional information. In this work, we introduce the generalized Dirichlet
energy (GDE), a novel energy functional that extends the classical Dirichlet
energy to handle arbitrary positive vertex measures and Markov transition
matrices. GDE provides a unified framework applicable to both directed and
undirected graphs, and is closely tied to the diffusion dynamics of random
walks. Building on this framework, we propose the generalized spectral
clustering (GSC) method that enables the principled clustering of weakly
connected digraphs without resorting to the introduction of teleportation to
the random walk transition matrix. A key component of our approach is the
utilization of a parametrized vertex measure encoding graph directionality and
density. Experiments on real-world point-cloud datasets demonstrate that GSC
consistently outperforms existing spectral clustering approaches in terms of
clustering accuracy and robustness, offering a powerful new tool for
graph-based data analysis.
[LINK]
http://arxiv.org/abs/2203.03221v3
[DATE]
2025-09-14 23:55:04+08:00
[CATEGORIES]
cs.LG
A dynamic view of some anomalous phenomena in SGD
[AUTHORS]
Vivek Shripad Borkar
[ABSTRACT]
It has been observed by Belkin et al.\ that over-parametrized neural networks
exhibit a `double descent’ phenomenon. That is, as the model complexity (as
reflected in the number of features) increases, the test error initially
decreases, then increases, and then decreases again. A counterpart of this
phenomenon in the time domain has been noted in the context of epoch-wise
training, viz., the test error decreases with the number of iterates, then
increases, then decreases again. Another anomalous phenomenon is that of
\textit{grokking} wherein two regimes of descent are interrupted by a third
regime wherein the mean loss remains almost constant. This note presents a
plausible explanation for these and related phenomena by using the theory of
two time scale stochastic approximation, applied to the continuous time limit
of the gradient dynamics. This gives a novel perspective for an already well
studied theme.
[COMMENTS]
9 pages, 4 figures
[LINK]
http://arxiv.org/abs/2505.01751v3
[DATE]
2025-09-14 23:54:59+08:00
[CATEGORIES]
cs.LG
AIssistant: An Agentic Approach for Human–AI Collaborative Scientific Work on Reviews and Perspectives in Machine Learning
[AUTHORS]
Sasi Kiran Gaddipati, Farhana Keya, Gollam Rabby, Sören Auer
[ABSTRACT]
Advances in AI-assisted research have introduced powerful tools for
literature retrieval, hypothesis generation, experimentation, and manuscript
preparation. However, systems remain fragmented and lack human-centred
workflows. To address these gaps, we introduce AIssistant, an agentic,
open-source Human-AI collaborative framework designed to simplify the
end-to-end creation of scientific workflows. Since our development is still in
an early stage, we present here the first experiments with AIssistant for
perspective and review research papers in machine learning. Our system
integrates modular tools and agents for literature synthesis, section-wise
experimentation, citation management, and automatic LaTeX paper text
generation, while maintaining human oversight at every stage to ensure
accuracy, coherence, and scholarly rigour. We conducted a comprehensive
evaluation across three layers: (1) Independent Human Review, following NeurIPS
double-blind standards; (2) Automated LLM Review, using GPT-5 as a scalable
human review proxy; and (3) Program Chair Oversight, where the chair monitors
the entire review process and makes final validation and acceptance decisions.
The results demonstrate that AIssistant improves drafting efficiency and
thematic consistency. Nonetheless, Human-AI collaboration remains essential for
maintaining factual correctness, methodological soundness, and ethical
compliance. Despite its effectiveness, we identify key limitations, including
hallucinated citations, difficulty adapting to dynamic paper structures, and
incomplete integration of multimodal content.
[LINK]
http://arxiv.org/abs/2509.12282v1
[DATE]
2025-09-14 23:50:31+08:00
[CATEGORIES]
cs.LG
Next Edit Prediction: Learning to Predict Code Edits from Context and Interaction History
[AUTHORS]
Ruofan Lu, Yintong Huo, Meng Zhang, Yichen Li, Michael R. Lyu
[ABSTRACT]
The rapid advancement of large language models (LLMs) has led to the
widespread adoption of AI-powered coding assistants integrated into a
development environment. On one hand, low-latency code completion offers
completion suggestions but is fundamentally constrained to the cursor’s current
position. On the other hand, chat-based editing can perform complex
modifications, yet forces developers to stop their work, describe the intent in
natural language, which causes a context-switch away from the code. This
creates a suboptimal user experience, as neither paradigm proactively predicts
the developer’s next edit in a sequence of related edits. To bridge this gap
and provide the seamless code edit suggestion, we introduce the task of Next
Edit Prediction, a novel task designed to infer developer intent from recent
interaction history to predict both the location and content of the subsequent
edit. Specifically, we curate a high-quality supervised fine-tuning dataset and
an evaluation benchmark for the Next Edit Prediction task. Then, we conduct
supervised fine-tuning on a series of models and performed a comprehensive
evaluation of both the fine-tuned models and other baseline models, yielding
several novel findings. This work lays the foundation for a new interaction
paradigm that proactively collaborate with developers by anticipating their
following action, rather than merely reacting to explicit instructions. The
code is available at https://github.com/lurf21/NextEditPrediction.
[LINK]
http://arxiv.org/abs/2508.10074v2
[DATE]
2025-09-14 23:44:13+08:00
[CATEGORIES]
cs.LG
Anant-Net: Breaking the Curse of Dimensionality with Scalable and Interpretable Neural Surrogate for High-Dimensional PDEs
[AUTHORS]
Sidharth S. Menon, Ameya D. Jagtap
[ABSTRACT]
High-dimensional partial differential equations (PDEs) arise in diverse
scientific and engineering applications but remain computationally intractable
due to the curse of dimensionality. Traditional numerical methods struggle with
the exponential growth in computational complexity, particularly on hypercubic
domains, where the number of required collocation points increases rapidly with
dimensionality. Here, we introduce Anant-Net, an efficient neural surrogate
that overcomes this challenge, enabling the solution of PDEs in high
dimensions. Unlike hyperspheres, where the internal volume diminishes as
dimensionality increases, hypercubes retain or expand their volume (for unit or
larger length), making high-dimensional computations significantly more
demanding. Anant-Net efficiently incorporates high-dimensional boundary
conditions and minimizes the PDE residual at high-dimensional collocation
points. To enhance interpretability, we integrate Kolmogorov-Arnold networks
into the Anant-Net architecture. We benchmark Anant-Net’s performance on
several linear and nonlinear high-dimensional equations, including the Poisson,
Sine-Gordon, and Allen-Cahn equations, demonstrating high accuracy and
robustness across randomly sampled test points from high-dimensional space.
Importantly, Anant-Net achieves these results with remarkable efficiency,
solving 300-dimensional problems on a single GPU within a few hours. We also
compare Anant-Net’s results for accuracy and runtime with other
state-of-the-art methods. Our findings establish Anant-Net as an accurate,
interpretable, and scalable framework for efficiently solving high-dimensional
PDEs.
[COMMENTS]
32 pages, 18 figures
[LINK]
http://arxiv.org/abs/2505.03595v3
[DATE]
2025-09-14 23:30:03+08:00
[CATEGORIES]
cs.LG
Contrastive Network Representation Learning
[AUTHORS]
Zihan Dong, Xin Zhou, Ryumei Nakada, Lexin Li, Linjun Zhang
[ABSTRACT]
Network representation learning seeks to embed networks into a
low-dimensional space while preserving the structural and semantic properties,
thereby facilitating downstream tasks such as classification, trait prediction,
edge identification, and community detection. Motivated by challenges in brain
connectivity data analysis that is characterized by subject-specific,
high-dimensional, and sparse networks that lack node or edge covariates, we
propose a novel contrastive learning-based statistical approach for network
edge embedding, which we name as Adaptive Contrastive Edge Representation
Learning (ACERL). It builds on two key components: contrastive learning of
augmented network pairs, and a data-driven adaptive random masking mechanism.
We establish the non-asymptotic error bounds, and show that our method achieves
the minimax optimal convergence rate for edge representation learning. We
further demonstrate the applicability of the learned representation in multiple
downstream tasks, including network classification, important edge detection,
and community detection, and establish the corresponding theoretical
guarantees. We validate our method through both synthetic data and real brain
connectivities studies, and show its competitive performance compared to the
baseline method of sparse principal components analysis.
[LINK]
http://arxiv.org/abs/2509.11316v1
[DATE]
2025-09-14 23:25:59+08:00
[CATEGORIES]
cs.LG
Meta-model Neural Process for Probabilistic Power Flow under Varying N-1 System Topologies
[AUTHORS]
Sel Ly, Kapil Chauhan, Anshuman Singh, Hung Dinh Nguyen
[ABSTRACT]
The probabilistic power flow (PPF) problem is essential to quantifying the
distribution of the nodal voltages due to uncertain injections. The
conventional PPF problem considers a fixed topology, and the solutions to such
a PPF problem are associated with this topology. A change in the topology might
alter the power flow patterns and thus require the PPF problem to be solved
again. The previous PPF model and its solutions are no longer valid for the new
topology. This practice incurs both inconvenience and computation burdens as
more contingencies are foreseen due to high renewables and a large share of
electric vehicles. This paper presents a novel topology-adaptive approach,
based on the meta-model Neural Process (MMNP), for finding the solutions to PPF
problems under varying N-1 topologies, particularly with one-line failures. By
leveraging context set-based topology representation and conditional
distribution over function learning techniques, the proposed MMNP enhances the
robustness of PPF models to topology variations, mitigating the need for
retraining PPF models on a new configuration. Simulations on an IEEE 9-bus
system and IEEE 118-bus system validate the model’s performance. The maximum
%L1-relative error norm was observed as 1.11% and 0.77% in 9-bus and 118-bus,
respectively. This adaptive approach fills a critical gap in PPF methodology in
an era of increasing grid volatility.
[COMMENTS]
An improved version for the conference paper at PESGM 2025
[LINK]
http://arxiv.org/abs/2509.12281v1
[DATE]
2025-09-14 23:07:33+08:00
[CATEGORIES]
cs.LG
Think Small, Plan Smart: Minimalist Symbolic Abstraction and Heuristic Subspace Search for LLM-Guided Task Planning
[AUTHORS]
Junfeng Tang, Yuping Yan, Zihan Ye, Zhenshou, Song, Zeqi Zheng, Yaochu Jin
[ABSTRACT]
Reliable task planning is pivotal for achieving long-horizon autonomy in
real-world robotic systems. Large language models (LLMs) offer a promising
interface for translating complex and ambiguous natural language instructions
into actionable plans. However, their probabilistic and opaque nature often
leads to logically inconsistent or infeasible outputs. To address these
limitations, recent frameworks combine LLMs with symbolic planners by first
generating action models (Planning Domain Definition Language) and then
applying heuristic search. Although promising, such systems still suffer from
representation redundancy and exponential search complexity, often resulting in
inefficient or overly long plans. To improve planning efficiency and
effectiveness, we propose PLAHX (Planning from Language using Abstraction and
Heuristic eXploration), a two-stage LLM-symbolic planning framework that
integrates abstract symbolic representations with meta-heuristic subspace
search in a parallel and iterative fashion. Rather than relying on verbose
LLM-generated domain models, we introduce a minimalist symbolic abstraction
pipeline that preserves semantic fidelity while eliminating redundancy. Our
approach redefines LLM-symbolic planning not by making LLMs smarter, but by
reducing the symbolic search space adaptively. Empirical results across four
challenging domains, including block stacking and robotic mobile grasping, show
that our approach improves the success rate by 21.47% on average, while
reducing token consumption by 13% compared to state-of-the-art baselines.
[LINK]
http://arxiv.org/abs/2501.15214v2
[DATE]
2025-09-14 22:33:58+08:00
[CATEGORIES]
cs.LG
Derivative-informed Graph Convolutional Autoencoder with Phase Classification for the Lifshitz-Petrich Model
[AUTHORS]
Yanlai Chen, Yajie Ji, Zhenli Xu
[ABSTRACT]
The Lifshitz-Petrich (LP) model is a classical model for describing complex
spatial patterns such as quasicrystals and multiphase structures. Solving and
classifying the solutions of the LP model is challenging due to the presence of
high-order gradient terms and the long-range orientational order characteristic
of the quasicrystals. To address these challenges, we propose a
Derivative-informed Graph Convolutional Autoencoder (DiGCA) to classify the
multi-component multi-state solutions of the LP model. The classifier consists
of two stages. In the offline stage, the DiGCA phase classifier innovatively
incorporates both solutions and their derivatives for training a graph
convolutional autoencoder which effectively captures intricate spatial
dependencies while significantly reducing the dimensionality of the solution
space. In the online phase, the framework employs a neural network classifier
to efficiently categorize encoded solutions into distinct phase diagrams. The
numerical results demonstrate that the DiGCA phase classifier accurately solves
the LP model, classifies its solutions, and rapidly generates detailed phase
diagrams in a robust manner, offering significant improvements in both
efficiency and accuracy over traditional methods.
[LINK]
http://arxiv.org/abs/2509.11293v1
[DATE]
2025-09-14 22:32:42+08:00
[CATEGORIES]
cs.LG
PINGS: Physics-Informed Neural Network for Fast Generative Sampling
[AUTHORS]
Achmad Ardani Prasha, Clavino Ourizqi Rachmadi, Muhamad Fauzan Ibnu Syahlan, Naufal Rahfi Anugerah, Nanda Garin Raditya, Putri Amelia, Sabrina Laila Mutiara, Hilman Syachr Ramadhan
[ABSTRACT]
We introduce PINGS (Physics-Informed Neural Network for Fast Generative
Sampling), a framework that amortizes diffusion sampling by training a
physics-informed network to approximate reverse-time probability-flow dynamics,
reducing sampling to a single forward pass (NFE = 1). As a proof of concept, we
learn a direct map from a 3D standard normal to a non-Gaussian Gaussian Mixture
Model (GMM). PINGS preserves the target’s distributional structure
(multi-bandwidth kernel $MMD^2 = 1.88 \times 10^{-2}$ with small errors in
mean, covariance, skewness, and excess kurtosis) and achieves constant-time
generation: $10^4$ samples in $16.54 \pm 0.56$ millisecond on an RTX 3090,
versus 468-843 millisecond for DPM-Solver (10/20) and 960 millisecond for DDIM
(50) under matched conditions. We also sanity-check the
PINN/automatic-differentiation pipeline on a damped harmonic oscillator,
obtaining MSEs down to $\mathcal{O}(10^{-5})$. Compared to fast but iterative
ODE solvers and direct-map families (Flow, Rectified-Flow, Consistency), PINGS
frames generative sampling as a PINN-style residual problem with endpoint
anchoring, yielding a white-box, differentiable map with NFE = 1. These
proof-of-concept results position PINGS as a promising route to fast,
function-based generative sampling with potential extensions to scientific
simulation (e.g., fast calorimetry).
[COMMENTS]
19 pages, 4 figures
[LINK]
http://arxiv.org/abs/2509.11284v1
[DATE]
2025-09-14 22:22:33+08:00
[CATEGORIES]
cs.LG
Protected Probabilistic Classification Library
[AUTHORS]
Ivan Petej
[ABSTRACT]
This paper introduces a new Python package specifically designed to address
calibration of probabilistic classifiers under dataset shift. The method is
demonstrated in binary and multi-class settings and its effectiveness is
measured against a number of existing post-hoc calibration methods. The
empirical results are promising and suggest that our technique can be helpful
in a variety of settings for batch and online learning classification problems
where the underlying data distribution changes between the training and test
sets.
[LINK]
http://arxiv.org/abs/2509.11267v1
[DATE]
2025-09-14 21:43:01+08:00
[CATEGORIES]
cs.LG
SelectMix: Enhancing Label Noise Robustness through Targeted Sample Mixing
[AUTHORS]
Qiuhao Liu, Ling Li, Yao Lu, Qi Xuan, Zhaowei Zhu, Jiaheng Wei
[ABSTRACT]
Deep neural networks tend to memorize noisy labels, severely degrading their
generalization performance. Although Mixup has demonstrated effectiveness in
improving generalization and robustness, existing Mixup-based methods typically
perform indiscriminate mixing without principled guidance on sample selection
and mixing strategy, inadvertently propagating noisy supervision. To overcome
these limitations, we propose SelectMix, a confidence-guided mixing framework
explicitly tailored for noisy labels. SelectMix first identifies potentially
noisy or ambiguous samples through confidence based mismatch analysis using
K-fold cross-validation, then selectively blends identified uncertain samples
with confidently predicted peers from their potential classes. Furthermore,
SelectMix employs soft labels derived from all classes involved in the mixing
process, ensuring the labels accurately represent the composition of the mixed
samples, thus aligning supervision signals closely with the actual mixed
inputs. Through extensive theoretical analysis and empirical evaluations on
multiple synthetic (MNIST, Fashion-MNIST, CIFAR-10, CIFAR-100) and real-world
benchmark datasets (CIFAR-N, MNIST and Clothing1M), we demonstrate that
SelectMix consistently outperforms strong baseline methods, validating its
effectiveness and robustness in learning with noisy labels.
[LINK]
http://arxiv.org/abs/2509.11265v1
[DATE]
2025-09-14 21:37:38+08:00
[CATEGORIES]
cs.LG
Gradient Free Deep Reinforcement Learning With TabPFN
[AUTHORS]
David Schiff, Ofir Lindenbaum, Yonathan Efroni
[ABSTRACT]
Gradient based optimization is fundamental to most modern deep reinforcement
learning algorithms, however, it introduces significant sensitivity to
hyperparameters, unstable training dynamics, and high computational costs. We
propose TabPFN RL, a novel gradient free deep RL framework that repurposes the
meta trained transformer TabPFN as a Q function approximator. Originally
developed for tabular classification, TabPFN is a transformer pre trained on
millions of synthetic datasets to perform inference on new unseen datasets via
in context learning. Given an in context dataset of sample label pairs and new
unlabeled data, it predicts the most likely labels in a single forward pass,
without gradient updates or task specific fine tuning. We use TabPFN to predict
Q values using inference only, thereby eliminating the need for back
propagation at both training and inference. To cope with the model’s fixed
context budget, we design a high reward episode gate that retains only the top
5% of trajectories. Empirical evaluations on the Gymnasium classic control
suite demonstrate that TabPFN RL matches or surpasses Deep Q Network on
CartPole v1, MountainCar v0, and Acrobot v1, without applying gradient descent
or any extensive hyperparameter tuning. We discuss the theoretical aspects of
how bootstrapped targets and non stationary visitation distributions violate
the independence assumptions encoded in TabPFN’s prior, yet the model retains a
surprising generalization capacity. We further formalize the intrinsic context
size limit of in context RL algorithms and propose principled truncation
strategies that enable continual learning when the context is full. Our results
establish prior fitted networks such as TabPFN as a viable foundation for fast
and computationally efficient RL, opening new directions for gradient free RL
with large pre trained transformers.
[LINK]
http://arxiv.org/abs/2509.11259v1
[DATE]
2025-09-14 21:09:58+08:00
[CATEGORIES]
cs.LG
From PowerSGD to PowerSGD+: Low-Rank Gradient Compression for Distributed Optimization with Convergence Guarantees
[AUTHORS]
Shengping Xie, Chuyan Chen, Kun Yuan
[ABSTRACT]
Low-rank gradient compression methods, such as PowerSGD, have gained
attention in communication-efficient distributed optimization. However, the
convergence guarantees of PowerSGD remain unclear, particularly in stochastic
settings. In this paper, we show that PowerSGD does not always converge to the
optimal solution and provide a clear counterexample to support this finding. To
address this, we introduce PowerSGD+, which periodically updates the projection
subspace via singular value decomposition, ensuring that it remains aligned
with the optimal subspace. We prove that PowerSGD+ converges under standard
assumptions and validate its effectiveness through empirical evaluation on
large language model tasks.
[LINK]
http://arxiv.org/abs/2509.11254v1
[DATE]
2025-09-14 20:54:28+08:00
[CATEGORIES]
cs.LG
Calibration in Deep Learning: A Survey of the State-of-the-Art
[AUTHORS]
Cheng Wang
[ABSTRACT]
Calibrating deep neural models plays an important role in building reliable,
robust AI systems in safety-critical applications. Recent work has shown that
modern neural networks that possess high predictive capability are poorly
calibrated and produce unreliable model predictions. Though deep learning
models achieve remarkable performance on various benchmarks, the study of model
calibration and reliability is relatively under-explored. Ideal deep models
should have not only high predictive performance but also be well calibrated.
There have been some recent advances in calibrating deep models. In this
survey, we review the state-of-the-art calibration methods and their principles
for performing model calibration. First, we start with the definition of model
calibration and explain the root causes of model miscalibration. Then we
introduce the key metrics that can measure this aspect. It is followed by a
summary of calibration methods that we roughly classify into four categories:
post-hoc calibration, regularization methods, uncertainty estimation, and
composition methods. We also cover recent advancements in calibrating large
models, particularly large language models (LLMs). Finally, we discuss some
open issues, challenges, and potential directions.
[COMMENTS]
34 pages
[LINK]
http://arxiv.org/abs/2308.01222v4
[DATE]
2025-09-14 20:53:07+08:00
[CATEGORIES]
cs.LG
QualityFM: a Multimodal Physiological Signal Foundation Model with Self-Distillation for Signal Quality Challenges in Critically Ill Patients
[AUTHORS]
Zongheng Guo, Tao Chen, Manuela Ferrario
[ABSTRACT]
Photoplethysmogram (PPG) and electrocardiogram (ECG) are commonly recorded in
intesive care unit (ICU) and operating room (OR). However, the high incidence
of poor, incomplete, and inconsistent signal quality, can lead to false alarms
or diagnostic inaccuracies. The methods explored so far suffer from limited
generalizability, reliance on extensive labeled data, and poor cross-task
transferability. To overcome these challenges, we introduce QualityFM, a novel
multimodal foundation model for these physiological signals, designed to
acquire a general-purpose understanding of signal quality. Our model is
pre-trained on an large-scale dataset comprising over 21 million 30-second
waveforms and 179,757 hours of data. Our approach involves a dual-track
architecture that processes paired physiological signals of differing quality,
leveraging a self-distillation strategy where an encoder for high-quality
signals is used to guide the training of an encoder for low-quality signals. To
efficiently handle long sequential signals and capture essential local
quasi-periodic patterns, we integrate a windowed sparse attention mechanism
within our Transformer-based model. Furthermore, a composite loss function,
which combines direct distillation loss on encoder outputs with indirect
reconstruction loss based on power and phase spectra, ensures the preservation
of frequency-domain characteristics of the signals. We pre-train three models
with varying parameter counts (9.6 M to 319 M) and demonstrate their efficacy
and practical value through transfer learning on three distinct clinical tasks:
false alarm of ventricular tachycardia detection, the identification of atrial
fibrillation and the estimation of arterial blood pressure (ABP) from PPG and
ECG signals.
[COMMENTS]
11 pages, 5 figures, 7 tables
[LINK]
http://arxiv.org/abs/2509.06516v2
[DATE]
2025-09-14 20:44:02+08:00
[CATEGORIES]
cs.LG
Revisiting Meter Tracking in Carnatic Music using Deep Learning Approaches
[AUTHORS]
Satyajeet Prabhu
[ABSTRACT]
Beat and downbeat tracking, jointly referred to as Meter Tracking, is a
fundamental task in Music Information Retrieval (MIR). Deep learning models
have far surpassed traditional signal processing and classical machine learning
approaches in this domain, particularly for Western (Eurogenetic) genres, where
large annotated datasets are widely available. These systems, however, perform
less reliably on underrepresented musical traditions. Carnatic music, a rich
tradition from the Indian subcontinent, is renowned for its rhythmic intricacy
and unique metrical structures (t=alas). The most notable prior work on meter
tracking in this context employed probabilistic Dynamic Bayesian Networks
(DBNs). The performance of state-of-the-art (SOTA) deep learning models on
Carnatic music, however, remains largely unexplored.
In this study, we evaluate two models for meter tracking in Carnatic music:
the Temporal Convolutional Network (TCN), a lightweight architecture that has
been successfully adapted for Latin rhythms, and Beat This!, a
transformer-based model designed for broad stylistic coverage without the need
for post-processing. Replicating the experimental setup of the DBN baseline on
the Carnatic Music Rhythm (CMR$_f$) dataset, we systematically assess the
performance of these models in a directly comparable setting. We further
investigate adaptation strategies, including fine-tuning the models on Carnatic
data and the use of musically informed parameters. Results show that while
off-the-shelf models do not always outperform the DBN, their performance
improves substantially with transfer learning, matching or surpassing the
baseline. These findings indicate that SOTA deep learning models can be
effectively adapted to underrepresented traditions, paving the way for more
inclusive and broadly applicable meter tracking systems.
[LINK]
http://arxiv.org/abs/2509.11241v1
[DATE]
2025-09-14 20:33:34+08:00
[CATEGORIES]
cs.LG
Online Optimization on Hadamard Manifolds: Curvature Independent Regret Bounds on Horospherically Convex Objectives
[AUTHORS]
Emre Sahinoglu, Shahin Shahrampour
[ABSTRACT]
We study online Riemannian optimization on Hadamard manifolds under the
framework of horospherical convexity (h-convexity). Prior work mostly relies on
the geodesic convexity (g-convexity), leading to regret bounds scaling poorly
with the manifold curvature. To address this limitation, we analyze Riemannian
online gradient descent for h-convex and strongly h-convex functions and
establish $O(\sqrt{T})$ and $O(\log(T))$ regret guarantees, respectively. These
bounds are curvature-independent and match the results in the Euclidean
setting. We validate our approach with experiments on the manifold of symmetric
positive definite (SPD) matrices equipped with the affine-invariant metric. In
particular, we investigate online Tyler’s $M$-estimation and online Fr'echet
mean computation, showing the application of h-convexity in practice.
[LINK]
http://arxiv.org/abs/2509.11236v1
[DATE]
2025-09-14 20:27:31+08:00
[CATEGORIES]
cs.LG
TransZero: Parallel Tree Expansion in MuZero using Transformer Networks
[AUTHORS]
Emil Malmsten, Wendelin Böhmer
[ABSTRACT]
We present TransZero, a model-based reinforcement learning algorithm that
removes the sequential bottleneck in Monte Carlo Tree Search (MCTS). Unlike
MuZero, which constructs its search tree step by step using a recurrent
dynamics model, TransZero employs a transformer-based network to generate
multiple latent future states simultaneously. Combined with the Mean-Variance
Constrained (MVC) evaluator that eliminates dependence on inherently sequential
visitation counts, our approach enables the parallel expansion of entire
subtrees during planning. Experiments in MiniGrid and LunarLander show that
TransZero achieves up to an eleven-fold speedup in wall-clock time compared to
MuZero while maintaining sample efficiency. These results demonstrate that
parallel tree construction can substantially accelerate model-based
reinforcement learning, bringing real-time decision-making in complex
environments closer to practice. The code is publicly available on GitHub.
[COMMENTS]
Submitted to BNAIC/BeNeLearn 2025. 15 pages, 4 figures
[LINK]
http://arxiv.org/abs/2509.11233v1
[DATE]
2025-09-14 20:20:38+08:00
[CATEGORIES]
cs.LG
ResWCAE: Biometric Pattern Image Denoising Using Residual Wavelet-Conditioned Autoencoder
[AUTHORS]
Youzhi Liang, Wen Liang
[ABSTRACT]
The utilization of biometric authentication with pattern images is
increasingly popular in compact Internet of Things (IoT) devices. However, the
reliability of such systems can be compromised by image quality issues,
particularly in the presence of high levels of noise. While state-of-the-art
deep learning algorithms designed for generic image denoising have shown
promise, their large number of parameters and lack of optimization for unique
biometric pattern retrieval make them unsuitable for these devices and
scenarios. In response to these challenges, this paper proposes a lightweight
and robust deep learning architecture, the Residual Wavelet-Conditioned
Convolutional Autoencoder (Res-WCAE) with a Kullback-Leibler divergence (KLD)
regularization, designed specifically for fingerprint image denoising. Res-WCAE
comprises two encoders - an image encoder and a wavelet encoder - and one
decoder. Residual connections between the image encoder and decoder are
leveraged to preserve fine-grained spatial features, where the bottleneck layer
conditioned on the compressed representation of features obtained from the
wavelet encoder using approximation and detail subimages in the
wavelet-transform domain. The effectiveness of Res-WCAE is evaluated against
several state-of-the-art denoising methods, and the experimental results
demonstrate that Res-WCAE outperforms these methods, particularly for heavily
degraded fingerprint images in the presence of high levels of noise. Overall,
Res-WCAE shows promise as a solution to the challenges faced by biometric
authentication systems in compact IoT devices.
[COMMENTS]
8 pages, 2 figures
[LINK]
http://arxiv.org/abs/2307.12255v2
[DATE]
2025-09-14 20:08:55+08:00
[CATEGORIES]
cs.LG
Foundational theory for optimal decision tree problems. I. Algorithmic and geometric foundations
[AUTHORS]
Xi He
[ABSTRACT]
In the first paper (part I) of this series of two, we introduce four novel
definitions of the ODT problems: three for size-constrained trees and one for
depth-constrained trees. These definitions are stated unambiguously through
executable recursive programs, satisfying all criteria we propose for a formal
specification. In this sense, they resemble the “standard form” used in the
study of general-purpose solvers.
Grounded in algebraic programming theory-a relational formalism for deriving
correct-by-construction algorithms from specifications-we can not only
establish the existence or nonexistence of dynamic programming solutions but
also derive them constructively whenever they exist. Consequently, the four
generic problem definitions yield four novel optimal algorithms for ODT
problems with arbitrary splitting rules that satisfy the axioms and objective
functions of a given form. These algorithms encompass the known
depth-constrained, axis-parallel ODT algorithm as the special case, while
providing a unified, efficient, and elegant solution for the general ODT
problem.
In Part II, we present the first optimal hypersurface decision tree algorithm
and provide comprehensive experiments against axis-parallel decision tree
algorithms, including heuristic CART and state-of-the-art optimal methods. The
results demonstrate the significant potential of decision trees with flexible
splitting rules. Moreover, our framework is readily extendable to support
algorithms for constructing even more flexible decision trees, including those
with mixed splitting rules.
[COMMENTS]
50 pages, 1 figure
[LINK]
http://arxiv.org/abs/2509.11226v1
[DATE]
2025-09-14 20:01:02+08:00
[CATEGORIES]
cs.LG
Blending 3D Geometry and Machine Learning for Multi-View Stereopsis
[AUTHORS]
Vibhas Vats, Md. Alimoor Reza, David Crandall, Soon-heung Jung
[ABSTRACT]
Traditional multi-view stereo (MVS) methods primarily depend on photometric
and geometric consistency constraints. In contrast, modern learning-based
algorithms often rely on the plane sweep algorithm to infer 3D geometry,
applying explicit geometric consistency (GC) checks only as a post-processing
step, with no impact on the learning process itself. In this work, we introduce
GC MVSNet plus plus, a novel approach that actively enforces geometric
consistency of reference view depth maps across multiple source views (multi
view) and at various scales (multi scale) during the learning phase (see Fig.
1). This integrated GC check significantly accelerates the learning process by
directly penalizing geometrically inconsistent pixels, effectively halving the
number of training iterations compared to other MVS methods. Furthermore, we
introduce a densely connected cost regularization network with two distinct
block designs simple and feature dense optimized to harness dense feature
connections for enhanced regularization. Extensive experiments demonstrate that
our approach achieves a new state of the art on the DTU and BlendedMVS datasets
and secures second place on the Tanks and Temples benchmark. To our knowledge,
GC MVSNet plus plus is the first method to enforce multi-view, multi-scale
supervised geometric consistency during learning. Our code is available.
[COMMENTS]
A pre-print – accepted at Neurocomputing. arXiv admin note:
substantial text overlap with arXiv:2310.19583
[LINK]
http://arxiv.org/abs/2505.03470v4
[DATE]
2025-09-14 19:37:02+08:00
[CATEGORIES]
cs.LG
Marigold-DC: Zero-Shot Monocular Depth Completion with Guided Diffusion
[AUTHORS]
Massimiliano Viola, Kevin Qu, Nando Metzger, Bingxin Ke, Alexander Becker, Konrad Schindler, Anton Obukhov
[ABSTRACT]
Depth completion upgrades sparse depth measurements into dense depth maps
guided by a conventional image. Existing methods for this highly ill-posed task
operate in tightly constrained settings and tend to struggle when applied to
images outside the training domain or when the available depth measurements are
sparse, irregularly distributed, or of varying density. Inspired by recent
advances in monocular depth estimation, we reframe depth completion as an
image-conditional depth map generation guided by sparse measurements. Our
method, Marigold-DC, builds on a pretrained latent diffusion model for
monocular depth estimation and injects the depth observations as test-time
guidance via an optimization scheme that runs in tandem with the iterative
inference of denoising diffusion. The method exhibits excellent zero-shot
generalization across a diverse range of environments and handles even
extremely sparse guidance effectively. Our results suggest that contemporary
monocular depth priors greatly robustify depth completion: it may be better to
view the task as recovering dense depth from (dense) image pixels, guided by
sparse depth; rather than as inpainting (sparse) depth, guided by an image.
Project website: https://MarigoldDepthCompletion.github.io/
[COMMENTS]
ICCV 2025
[LINK]
http://arxiv.org/abs/2412.13389v2
[DATE]
2025-09-14 19:15:24+08:00
[CATEGORIES]
cs.LG
Benchmarking Pretrained Molecular Embedding Models For Molecular Representation Learning
[AUTHORS]
Mateusz Praski, Jakub Adamczyk, Wojciech Czech
[ABSTRACT]
Pretrained neural networks have attracted significant interest in chemistry
and small molecule drug design. Embeddings from these models are widely used
for molecular property prediction, virtual screening, and small data learning
in molecular chemistry. This study presents the most extensive comparison of
such models to date, evaluating 25 models across 25 datasets. Under a fair
comparison framework, we assess models spanning various modalities,
architectures, and pretraining strategies. Using a dedicated hierarchical
Bayesian statistical testing model, we arrive at a surprising result: nearly
all neural models show negligible or no improvement over the baseline ECFP
molecular fingerprint. Only the CLAMP model, which is also based on molecular
fingerprints, performs statistically significantly better than the
alternatives. These findings raise concerns about the evaluation rigor in
existing studies. We discuss potential causes, propose solutions, and offer
practical recommendations.
[LINK]
http://arxiv.org/abs/2508.06199v3
[DATE]
2025-09-14 19:09:51+08:00
[CATEGORIES]
cs.LG
Predictable Compression Failures: Why Language Models Actually Hallucinate
[AUTHORS]
Leon Chlon, Ahmed Karim, Maggie Chlon
[ABSTRACT]
Large language models perform near-Bayesian inference yet violate permutation
invariance on exchangeable data. We resolve this by showing transformers
minimize expected conditional description length (cross-entropy) over
orderings, $\mathbb{E}\pi[\ell(Y \mid \Gamma\pi(X))]$, which admits a
Kolmogorov-complexity interpretation up to additive constants, rather than the
permutation-invariant description length $\ell(Y \mid X)$. This makes them
Bayesian in expectation, not in realization. We derive (i) a Quantified
Martingale Violation bound showing order-induced deviations scale as $O(\log
n)$ with constants; (ii) the Expectation-level Decompression Law linking
information budgets to reliability for Bernoulli predicates; and (iii)
deployable planners (B2T/RoH/ISR) for answer/abstain decisions. Empirically,
permutation dispersion follows $a+b\ln n$ (Qwen2-7B $b \approx 0.377$,
Llama-3.1-8B $b \approx 0.147$); permutation mixtures improve ground-truth
likelihood/accuracy; and randomized dose-response shows hallucinations drop by
$\sim 0.13$ per additional nat. A pre-specified audit with a fixed ISR=1.0
achieves near-0\% hallucinations via calibrated refusal at 24\% abstention. The
framework turns hallucinations into predictable compression failures and
enables principled information budgeting.
[LINK]
http://arxiv.org/abs/2509.11208v1
[DATE]
2025-09-14 18:32:59+08:00
[CATEGORIES]
cs.LG
CRoC: Context Refactoring Contrast for Graph Anomaly Detection with Limited Supervision
[AUTHORS]
Siyue Xie, Da Sun Handason Tam, Wing Cheong Lau
[ABSTRACT]
Graph Neural Networks (GNNs) are widely used as the engine for various
graph-related tasks, with their effectiveness in analyzing graph-structured
data. However, training robust GNNs often demands abundant labeled data, which
is a critical bottleneck in real-world applications. This limitation severely
impedes progress in Graph Anomaly Detection (GAD), where anomalies are
inherently rare, costly to label, and may actively camouflage their patterns to
evade detection. To address these problems, we propose Context Refactoring
Contrast (CRoC), a simple yet effective framework that trains GNNs for GAD by
jointly leveraging limited labeled and abundant unlabeled data. Different from
previous works, CRoC exploits the class imbalance inherent in GAD to refactor
the context of each node, which builds augmented graphs by recomposing the
attributes of nodes while preserving their interaction patterns. Furthermore,
CRoC encodes heterogeneous relations separately and integrates them into the
message-passing process, enhancing the model’s capacity to capture complex
interaction semantics. These operations preserve node semantics while
encouraging robustness to adversarial camouflage, enabling GNNs to uncover
intricate anomalous cases. In the training stage, CRoC is further integrated
with the contrastive learning paradigm. This allows GNNs to effectively harness
unlabeled data during joint training, producing richer, more discriminative
node embeddings. CRoC is evaluated on seven real-world GAD datasets with
varying scales. Extensive experiments demonstrate that CRoC achieves up to 14%
AUC improvement over baseline GNNs and outperforms state-of-the-art GAD methods
under limited-label settings.
[COMMENTS]
Accepted by ECAI 2025
[LINK]
http://arxiv.org/abs/2508.12278v2
[DATE]
2025-09-14 18:08:13+08:00
[CATEGORIES]
cs.LG
Quantum Architecture Search for Solving Quantum Machine Learning Tasks
[AUTHORS]
Michael Kölle, Simon Salfer, Tobias Rohe, Philipp Altmann, Claudia Linnhoff-Popien
[ABSTRACT]
Quantum computing leverages quantum mechanics to address computational
problems in ways that differ fundamentally from classical approaches. While
current quantum hardware remains error-prone and limited in scale, Variational
Quantum Circuits offer a noise-resilient framework suitable for today’s
devices. The performance of these circuits strongly depends on the underlying
architecture of their parameterized quantum components. Identifying efficient,
hardware-compatible quantum circuit architectures – known as Quantum
Architecture Search (QAS) – is therefore essential. Manual QAS is complex and
error-prone, motivating efforts to automate it. Among various automated
strategies, Reinforcement Learning (RL) remains underexplored, particularly in
Quantum Machine Learning contexts. This work introduces RL-QAS, a framework
that applies RL to discover effective circuit architectures for classification
tasks. We evaluate RL-QAS using the Iris and binary MNIST datasets. The agent
autonomously discovers low-complexity circuit designs that achieve high test
accuracy. Our results show that RL is a viable approach for automated
architecture search in quantum machine learning. However, applying RL-QAS to
more complex tasks will require further refinement of the search strategy and
performance evaluation mechanisms.
[LINK]
http://arxiv.org/abs/2509.11198v1
[DATE]
2025-09-14 17:55:38+08:00
[CATEGORIES]
cs.LG
Federated Recommender System with Data Valuation for E-commerce Platform
[AUTHORS]
Jongwon Park, Minku Kang, Wooseok Sim, Soyoung Lee, Hogun Park
[ABSTRACT]
Federated Learning (FL) is gaining prominence in machine learning as privacy
concerns grow. This paradigm allows each client (e.g., an individual online
store) to train a recommendation model locally while sharing only model
updates, without exposing the raw interaction logs to a central server, thereby
preserving privacy in a decentralized environment. Nonetheless, most existing
FL-based recommender systems still rely solely on each client’s private data,
despite the abundance of publicly available datasets that could be leveraged to
enrich local training; this potential remains largely underexplored. To this
end, we consider a realistic scenario wherein a large shopping platform
collaborates with multiple small online stores to build a global recommender
system. The platform possesses global data, such as shareable user and item
lists, while each store holds a portion of interaction data privately (or
locally). Although integrating global data can help mitigate the limitations of
sparse and biased clients’ local data, it also introduces additional
challenges: simply combining all global interactions can amplify noise and
irrelevant patterns, worsening personalization and increasing computational
costs. To address these challenges, we propose FedGDVE, which selectively
augments each client’s local graph with semantically aligned samples from the
global dataset. FedGDVE employs: (i) a pre-trained graph encoder to extract
global structural features, (ii) a local valid predictor to assess
client-specific relevance, (iii) a reinforcement-learning-based probability
estimator to filter and sample only the most pertinent global interactions.
FedGDVE improves performance by up to 34.86% on recognized benchmarks in FL
environments.
[COMMENTS]
Accepted to Expert Systems with Applications Journal, Elsevier
[LINK]
http://arxiv.org/abs/2509.11196v1
[DATE]
2025-09-14 17:48:23+08:00
[CATEGORIES]
cs.LG
Investigating the Lottery Ticket Hypothesis for Variational Quantum Circuits
[AUTHORS]
Michael Kölle, Leonhard Klingert, Julian Schönberger, Philipp Altmann, Tobias Rohe, Claudia Linnhoff-Popien
[ABSTRACT]
Quantum computing is an emerging field in computer science that has seen
considerable progress in recent years, especially in machine learning. By
harnessing the principles of quantum physics, it can surpass the limitations of
classical algorithms. However, variational quantum circuits (VQCs), which rely
on adjustable parameters, often face the barren plateau phenomenon, hindering
optimization. The Lottery Ticket Hypothesis (LTH) is a recent concept in
classical machine learning that has led to notable improvements in parameter
efficiency for neural networks. It states that within a large network, a
smaller, more efficient subnetwork, or ‘‘winning ticket,’’ can achieve
comparable performance, potentially circumventing plateau challenges. In this
work, we investigate whether this idea can apply to VQCs. We show that the weak
LTH holds for VQCs, revealing winning tickets that retain just 26.0\% of the
original parameters. For the strong LTH, where a pruning mask is learned
without any training, we discovered a winning ticket in a binary VQC, achieving
100\% accuracy with only 45\% of the weights. These findings indicate that LTH
may mitigate barren plateaus by reducing parameter counts while preserving
performance, thus enhancing the efficiency of VQCs in quantum machine learning
tasks.
[LINK]
http://arxiv.org/abs/2509.11190v1
[DATE]
2025-09-14 17:39:32+08:00
[CATEGORIES]
cs.LG
Sub-universal variational circuits for combinatorial optimization problems
[AUTHORS]
Gal Weitz, Lirandë Pira, Chris Ferrie, Joshua Combes
[ABSTRACT]
Quantum variational circuits have gained significant attention due to their
applications in the quantum approximate optimization algorithm and quantum
machine learning research. This work introduces a novel class of classical
probabilistic circuits designed for generating approximate solutions to
combinatorial optimization problems constructed using two-bit stochastic
matrices. Through a numerical study, we investigate the performance of our
proposed variational circuits in solving the Max-Cut problem on various graphs
of increasing sizes. Our classical algorithm demonstrates improved performance
for several graph types to the quantum approximate optimization algorithm. Our
findings suggest that evaluating the performance of quantum variational
circuits against variational circuits with sub-universal gate sets is a
valuable benchmark for identifying areas where quantum variational circuits can
excel.
[COMMENTS]
10 pages, 7 figures
[LINK]
http://arxiv.org/abs/2308.14981v2
[DATE]
2025-09-14 17:25:23+08:00
[CATEGORIES]
cs.LG
Harnessing Optimization Dynamics for Curvature-Informed Model Merging
[AUTHORS]
Pouria Mahdavinia, Hamed Mahdavi, Niloofar Mireshghallah, Mehrdad Mahdavi
[ABSTRACT]
Model merging is an effective post-training strategy for composing
capabilities in large language models without joint retraining. We study this
in the supervised fine-tuning (SFT) stage, where multiple capability-based SFT
checkpoints – spanning math, code, precise instruction following, general
instruction following, and knowledge recall – must be consolidated into a
single model. We introduce Optimization Trajectory Aware (OTA) Merging, a
curvature-aware aggregation that leverages optimizer second-moment statistics
as a diagonal curvature proxy to reweight parameter edits and mitigate
interference. Complementing OTA, we propose Fast Fisher Grafting (FFG), a
curvature-driven task-localization step that sparsifies conflicting or
low-importance edits. FFG induces extremely low-rank masks concentrated in
early attention query/key projections and token embeddings, exploiting shared
curvature across capabilities. We further develop a memory-light compression of
the second moments that preserves OTA’s effect. Across diverse capability-based
SFT checkpoints, OTA+FFG improves merged-model quality over strong weight-space
baselines, reduces negative transfer, and remains robust across sparsity
levels. Analyses reveal substantial curvature overlap between checkpoints,
offering a novel lens on why simple linear merging can be effective in
practice. Ablations confirm that FFG is critical for reducing task interference
and that the compressed second moments retain the gains of the full
formulation. To facilitate reproducibility, we open-source all code, training
and evaluation scripts, visualization artifacts, and capability-specific SFT
checkpoints at https://github.com/pmahdavi/ota-merge.
[LINK]
http://arxiv.org/abs/2509.11167v1
[DATE]
2025-09-14 16:59:53+08:00
[CATEGORIES]
cs.LG
GK-SMOTE: A Hyperparameter-free Noise-Resilient Gaussian KDE-Based Oversampling Approach
[AUTHORS]
Mahabubur Rahman Miraj, Hongyu Huang, Ting Yang, Jinxue Zhao, Nankun Mu, Xinyu Lei
[ABSTRACT]
Imbalanced classification is a significant challenge in machine learning,
especially in critical applications like medical diagnosis, fraud detection,
and cybersecurity. Traditional oversampling techniques, such as SMOTE, often
fail to handle label noise and complex data distributions, leading to reduced
classification accuracy. In this paper, we propose GK-SMOTE, a
hyperparameter-free, noise-resilient extension of SMOTE, built on Gaussian
Kernel Density Estimation (KDE). GK-SMOTE enhances class separability by
generating synthetic samples in high-density minority regions, while
effectively avoiding noisy or ambiguous areas. This self-adaptive approach uses
Gaussian KDE to differentiate between safe and noisy regions, ensuring more
accurate sample generation without requiring extensive parameter tuning. Our
extensive experiments on diverse binary classification datasets demonstrate
that GK-SMOTE outperforms existing state-of-the-art oversampling techniques
across key evaluation metrics, including MCC, Balanced Accuracy, and AUPRC. The
proposed method offers a robust, efficient solution for imbalanced
classification tasks, especially in noisy data environments, making it an
attractive choice for real-world applications.
[COMMENTS]
15 pages, 5 figures, 9th APWeb-WAIM joint International Conference on
Web and Big Data (APWeb-WAIM 2025)
[LINK]
http://arxiv.org/abs/2509.11163v1
[DATE]
2025-09-14 16:50:30+08:00
[CATEGORIES]
cs.LG
NeurStore: Efficient In-database Deep Learning Model Management System
[AUTHORS]
Siqi Xiang, Sheng Wang, Xiaokui Xiao, Cong Yue, Zhanhao Zhao, Beng Chin Ooi
[ABSTRACT]
With the prevalence of in-database AI-powered analytics, there is an
increasing demand for database systems to efficiently manage the ever-expanding
number and size of deep learning models. However, existing database systems
typically store entire models as monolithic files or apply compression
techniques that overlook the structural characteristics of deep learning
models, resulting in suboptimal model storage overhead. This paper presents
NeurStore, a novel in-database model management system that enables efficient
storage and utilization of deep learning models. First, NeurStore employs a
tensor-based model storage engine to enable fine-grained model storage within
databases. In particular, we enhance the hierarchical navigable small world
(HNSW) graph to index tensors, and only store additional deltas for tensors
within a predefined similarity threshold to ensure tensor-level deduplication.
Second, we propose a delta quantization algorithm that effectively compresses
delta tensors, thus achieving a superior compression ratio with controllable
model accuracy loss. Finally, we devise a compression-aware model loading
mechanism, which improves model utilization performance by enabling direct
computation on compressed tensors. Experimental evaluations demonstrate that
NeurStore achieves superior compression ratios and competitive model loading
throughput compared to state-of-the-art approaches.
[LINK]
http://arxiv.org/abs/2509.03228v2
[DATE]
2025-09-14 16:48:46+08:00
[CATEGORIES]
cs.LG
Stabilizing Data-Free Model Extraction
[AUTHORS]
Dat-Thinh Nguyen, Kim-Hung Le, Nhien-An Le-Khac
[ABSTRACT]
Model extraction is a severe threat to Machine Learning-as-a-Service systems,
especially through data-free approaches, where dishonest users can replicate
the functionality of a black-box target model without access to realistic data.
Despite recent advancements, existing data-free model extraction methods suffer
from the oscillating accuracy of the substitute model. This oscillation, which
could be attributed to the constant shift in the generated data distribution
during the attack, makes the attack impractical since the optimal substitute
model cannot be determined without access to the target model’s in-distribution
data. Hence, we propose MetaDFME, a novel data-free model extraction method
that employs meta-learning in the generator training to reduce the distribution
shift, aiming to mitigate the substitute model’s accuracy oscillation. In
detail, we train our generator to iteratively capture the meta-representations
of the synthetic data during the attack. These meta-representations can be
adapted with a few steps to produce data that facilitates the substitute model
to learn from the target model while reducing the effect of distribution
shifts. Our experiments on popular baseline image datasets, MNIST, SVHN,
CIFAR-10, and CIFAR-100, demonstrate that MetaDFME outperforms the current
state-of-the-art data-free model extraction method while exhibiting a more
stable substitute model’s accuracy during the attack.
[COMMENTS]
28th European Conference on Artificial Intelligence (ECAI-2025)
[LINK]
http://arxiv.org/abs/2509.11159v1
[DATE]
2025-09-14 16:36:56+08:00
[CATEGORIES]
cs.LG
Feature Space Topology Control via Hopkins Loss
[AUTHORS]
Einari Vaaras, Manu Airaksinen
[ABSTRACT]
Feature space topology refers to the organization of samples within the
feature space. Modifying this topology can be beneficial in machine learning
applications, including dimensionality reduction, generative modeling, transfer
learning, and robustness to adversarial attacks. This paper introduces a novel
loss function, Hopkins loss, which leverages the Hopkins statistic to enforce a
desired feature space topology, which is in contrast to existing
topology-related methods that aim to preserve input feature topology. We
evaluate the effectiveness of Hopkins loss on speech, text, and image data in
two scenarios: classification and dimensionality reduction using nonlinear
bottleneck autoencoders. Our experiments show that integrating Hopkins loss
into classification or dimensionality reduction has only a small impact on
classification performance while providing the benefit of modifying feature
topology.
[COMMENTS]
Accepted for publication in Proc. IEEE ICTAI 2025, Athens, Greece
[LINK]
http://arxiv.org/abs/2509.11154v1
[DATE]
2025-09-14 16:16:20+08:00
[CATEGORIES]
cs.LG
RoVerFly: Robust and Versatile Learning-based Control of Quadrotor Across Payload Configurations
[AUTHORS]
Mintae Kim, Jiaze Cai, Koushil Sreenath
[ABSTRACT]
Designing robust controllers for precise, arbitrary trajectory tracking with
quadrotors is challenging due to nonlinear dynamics and underactuation, and
becomes harder with flexible cable-suspended payloads that introduce extra
degrees of freedom and hybridness. Classical model-based methods offer
stability guarantees but require extensive tuning and often do not adapt when
the configuration changes, such as when a payload is added or removed, or when
the payload mass or cable length varies. We present RoVerFly, a unified
learning-based control framework in which a reinforcement learning (RL) policy
serves as a robust and versatile tracking controller for standard quadrotors
and for cable-suspended payload systems across a range of configurations.
Trained with task and domain randomization, the controller is resilient to
disturbances and varying dynamics. It achieves strong zero-shot generalization
across payload settings, including no payload as well as varying mass and cable
length, without controller switching or re-tuning, while retaining the
interpretability and structure of a feedback tracking controller. Code and
supplementary materials are available at
https://github.com/mintaeshkim/roverfly
[COMMENTS]
8 pages
[LINK]
http://arxiv.org/abs/2509.11149v1
[DATE]
2025-09-14 15:41:40+08:00
[CATEGORIES]
cs.LG
VoltanaLLM: Feedback-Driven Frequency Control and State-Space Routing for Energy-Efficient LLM Serving
[AUTHORS]
Jiahuan Yu, Aryan Taneja, Junfeng Lin, Minjia Zhang
[ABSTRACT]
Modern Large Language Model (LLM) serving systems increasingly support
interactive applications, like real-time chat assistants, code generation
tools, and agentic workflows. However, the soaring energy cost of LLM inference
presents a growing challenge for sustainable and cost-effective deployment.
This paper introduces VoltanaLLM, a system for SLO-aware, energy-efficient LLM
serving, built from a control theory perspective. VoltanaLLM co-designs
frequency scaling and request routing in emerging prefill/decode disaggregated
architectures, leveraging their decoupled execution to enable fine-grained
phase-specific control. It consists of a feedback-driven frequency controller
that dynamically adapts GPU frequency for prefill and decode phases, and a
state-space router that explores routing decisions across frequency-scaled
instances to minimize energy under latency constraints. We implement VoltanaLLM
in SGLang and evaluate its performance over multiple state-of-the-art LLMs and
real-world datasets. The results demonstrate that VoltanaLLM achieves up to
36.3% energy savings while maintaining near-perfect SLO attainment rate, paving
the way for sustainable and intelligent LLM serving. Code of VoltanaLLM is
open-sourced on GitHub:
https://github.com/Supercomputing-System-AI-Lab/VoltanaLLM.
[LINK]
http://arxiv.org/abs/2509.04827v2
[DATE]
2025-09-14 15:30:56+08:00
[CATEGORIES]
cs.LG
Sampling-enabled scalable manifold learning unveils the discriminative cluster structure of high-dimensional data
[AUTHORS]
Dehua Peng, Zhipeng Gui, Wenzhang Wei, Fa Li, Jie Gui, Huayi Wu, Jianya Gong
[ABSTRACT]
As a pivotal branch of machine learning, manifold learning uncovers the
intrinsic low-dimensional structure within complex nonlinear manifolds in
high-dimensional space for visualization, classification, clustering, and
gaining key insights. Although existing techniques have achieved remarkable
successes, they suffer from extensive distortions of cluster structure, which
hinders the understanding of underlying patterns. Scalability issues also limit
their applicability for handling large-scale data. We hence propose a
sampling-based Scalable manifold learning technique that enables Uniform and
Discriminative Embedding, namely SUDE, for large-scale and high-dimensional
data. It starts by seeking a set of landmarks to construct the low-dimensional
skeleton of the entire data, and then incorporates the non-landmarks into the
learned space based on the constrained locally linear embedding (CLLE). We
empirically validated the effectiveness of SUDE on synthetic datasets and
real-world benchmarks, and applied it to analyze single-cell data and detect
anomalies in electrocardiogram (ECG) signals. SUDE exhibits distinct advantage
in scalability with respect to data size and embedding dimension, and has
promising performance in cluster separation, integrity, and global structure
preservation. The experiments also demonstrate notable robustness in embedding
quality as the sampling rate decreases.
[COMMENTS]
80 pages, 37 figures
[LINK]
http://arxiv.org/abs/2401.01100v5
[DATE]
2025-09-14 14:36:35+08:00
[CATEGORIES]
cs.LG
WildSmoke: Ready-to-Use Dynamic 3D Smoke Assets from a Single Video in the Wild
[AUTHORS]
Yuqiu Liu, Jialin Song, Manolis Savva, Wuyang Chen
[ABSTRACT]
We propose a pipeline to extract and reconstruct dynamic 3D smoke assets from
a single in-the-wild video, and further integrate interactive simulation for
smoke design and editing. Recent developments in 3D vision have significantly
improved reconstructing and rendering fluid dynamics, supporting realistic and
temporally consistent view synthesis. However, current fluid reconstructions
rely heavily on carefully controlled clean lab environments, whereas real-world
videos captured in the wild are largely underexplored. We pinpoint three key
challenges of reconstructing smoke in real-world videos and design targeted
techniques, including smoke extraction with background removal, initialization
of smoke particles and camera poses, and inferring multi-view videos. Our
method not only outperforms previous reconstruction and generation methods with
high-quality smoke reconstructions (+2.22 average PSNR on wild videos), but
also enables diverse and realistic editing of fluid dynamics by simulating our
smoke assets. We provide our models, data, and 4D smoke assets at
https://autumnyq.github.io/WildSmoke.
[LINK]
http://arxiv.org/abs/2509.11114v1
[DATE]
2025-09-14 14:06:42+08:00
[CATEGORIES]
cs.LG
Multi-Modal Sensing Aided mmWave Beamforming for V2V Communications with Transformers
[AUTHORS]
Muhammad Baqer Mollah, Honggang Wang, Hua Fang
[ABSTRACT]
Beamforming techniques are utilized in millimeter wave (mmWave) communication
to address the inherent path loss limitation, thereby establishing and
maintaining reliable connections. However, adopting standard defined
beamforming approach in highly dynamic vehicular environments often incurs high
beam training overheads and reduces the available airtime for communications,
which is mainly due to exchanging pilot signals and exhaustive beam
measurements. To this end, we present a multi-modal sensing and fusion learning
framework as a potential alternative solution to reduce such overheads. In this
framework, we first extract the features individually from the visual and GPS
coordinates sensing modalities by modality specific encoders, and subsequently
fuse the multimodal features to obtain predicted top-k beams so that the best
line-of-sight links can be proactively established. To show the
generalizability of the proposed framework, we perform a comprehensive
experiment in four different vehicle-to-vehicle (V2V) scenarios from real-world
multi-modal sensing and communication dataset. From the experiment, we observe
that the proposed framework achieves up to 77.58% accuracy on predicting top-15
beams correctly, outperforms single modalities, incurs roughly as low as 2.32
dB average power loss, and considerably reduces the beam searching space
overheads by 76.56% for top-15 beams with respect to standard defined approach.
[COMMENTS]
6 Pages, Accepted to present at 2025 IEEE Global Communications
Conference (GLOBECOM), Taipei, Taiwan
[LINK]
http://arxiv.org/abs/2509.11112v1
[DATE]
2025-09-14 14:03:42+08:00
[CATEGORIES]
cs.LG
BIGNet: Pretrained Graph Neural Network for Embedding Semantic, Spatial, and Topological Data in BIM Models
[AUTHORS]
Jin Han, Xin-Zheng Lu, Jia-Rui Lin
[ABSTRACT]
Large Foundation Models (LFMs) have demonstrated significant advantages in
civil engineering, but they primarily focus on textual and visual data,
overlooking the rich semantic, spatial, and topological features in BIM
(Building Information Modelling) models. Therefore, this study develops the
first large-scale graph neural network (GNN), BIGNet, to learn, and reuse
multidimensional design features embedded in BIM models. Firstly, a scalable
graph representation is introduced to encode the “semantic-spatial-topological”
features of BIM components, and a dataset with nearly 1 million nodes and 3.5
million edges is created. Subsequently, BIGNet is proposed by introducing a new
message-passing mechanism to GraphMAE2 and further pretrained with a node
masking strategy. Finally, BIGNet is evaluated in various transfer learning
tasks for BIM-based design checking. Results show that: 1) homogeneous graph
representation outperforms heterogeneous graph in learning design features, 2)
considering local spatial relationships in a 30 cm radius enhances performance,
and 3) BIGNet with GAT (Graph Attention Network)-based feature extraction
achieves the best transfer learning results. This innovation leads to a 72.7%
improvement in Average F1-score over non-pretrained models, demonstrating its
effectiveness in learning and transferring BIM design features and facilitating
their automated application in future design and lifecycle management.
[LINK]
http://arxiv.org/abs/2509.11104v1
[DATE]
2025-09-14 13:43:14+08:00
[CATEGORIES]
cs.LG
GCN-TULHOR: Trajectory-User Linking Leveraging GCNs and Higher-Order Spatial Representations
[AUTHORS]
Khoa Tran, Pranav Gupta, Manos Papagelis
[ABSTRACT]
Trajectory-user linking (TUL) aims to associate anonymized trajectories with
the users who generated them, which is crucial for personalized
recommendations, privacy-preserving analytics, and secure location-based
services. Existing methods struggle with sparse data, incomplete routes, and
limited modeling of complex spatial dependencies, often relying on low-level
check-in data or ignoring spatial patterns. In this paper, we introduced
GCN-TULHOR, a method that transforms raw location data into higher-order
mobility flow representations using hexagonal tessellation, reducing data
sparsity and capturing richer spatial semantics, and integrating Graph
Convolutional Networks (GCNs). Our approach converts both sparse check-in and
continuous GPS trajectory data into unified higher-order flow representations,
mitigating sparsity while capturing deeper semantic information. The GCN layer
explicitly models complex spatial relationships and non-local dependencies
without requiring side information such as timestamps or points of interest.
Experiments on six real-world datasets show consistent improvements over
classical baselines, RNN- and Transformer-based models, and the TULHOR method
in accuracy, precision, recall, and F1-score. GCN-TULHOR achieves 1-8% relative
gains in accuracy and F1. Sensitivity analysis identifies an optimal setup with
a single GCN layer and 512-dimensional embeddings. The integration of GCNs
enhances spatial learning and improves generalizability across mobility data.
This work highlights the value of combining graph-based spatial learning with
sequential modeling, offering a robust and scalable solution for TUL with
applications in recommendations, urban planning, and security.
[LINK]
http://arxiv.org/abs/2509.11095v1
[DATE]
2025-09-14 13:14:09+08:00
[CATEGORIES]
cs.LG
What is in a Price? Estimating Willingness-to-Pay with Bayesian Hierarchical Models
[AUTHORS]
Srijesh Pillai, Rajesh Kumar Chandrawat
[ABSTRACT]
For premium consumer products, pricing strategy is not about a single number,
but about understanding the perceived monetary value of the features that
justify a higher cost. This paper proposes a robust methodology to deconstruct
a product’s price into the tangible value of its constituent parts. We employ
Bayesian Hierarchical Conjoint Analysis, a sophisticated statistical technique,
to solve this high-stakes business problem using the Apple iPhone as a
universally recognizable case study. We first simulate a realistic choice based
conjoint survey where consumers choose between different hypothetical iPhone
configurations. We then develop a Bayesian Hierarchical Logit Model to infer
consumer preferences from this choice data. The core innovation of our model is
its ability to directly estimate the Willingness-to-Pay (WTP) in dollars for
specific feature upgrades, such as a “Pro” camera system or increased storage.
Our results demonstrate that the model successfully recovers the true,
underlying feature valuations from noisy data, providing not just a point
estimate but a full posterior probability distribution for the dollar value of
each feature. This work provides a powerful, practical framework for
data-driven product design and pricing strategy, enabling businesses to make
more intelligent decisions about which features to build and how to price them.
[COMMENTS]
7 pages, 6 figures, 1 table. Accepted for publication in the
proceedings of the 2025 Advances in Science and Engineering Technology
International Conferences (ASET)
[LINK]
http://arxiv.org/abs/2509.11089v1
[DATE]
2025-09-14 12:39:35+08:00
[CATEGORIES]
cs.LG
SH-SAS: An Implicit Neural Representation for Complex Spherical-Harmonic Scattering Fields for 3D Synthetic Aperture Sonar
[AUTHORS]
Omkar Shailendra Vengurlekar, Adithya Pediredla, Suren Jayasuriya
[ABSTRACT]
Synthetic aperture sonar (SAS) reconstruction requires recovering both the
spatial distribution of acoustic scatterers and their direction-dependent
response. Time-domain backprojection is the most common 3D SAS reconstruction
algorithm, but it does not model directionality and can suffer from sampling
limitations, aliasing, and occlusion. Prior neural volumetric methods applied
to synthetic aperture sonar treat each voxel as an isotropic scattering
density, not modeling anisotropic returns. We introduce SH-SAS, an implicit
neural representation that expresses the complex acoustic scattering field as a
set of spherical harmonic (SH) coefficients. A multi-resolution hash encoder
feeds a lightweight MLP that outputs complex SH coefficients up to a specified
degree L. The zeroth-order coefficient acts as an isotropic scattering field,
which also serves as the density term, while higher orders compactly capture
directional scattering with minimal parameter overhead. Because the model
predicts the complex amplitude for any transmit-receive baseline, training is
performed directly from 1-D time-of-flight signals without the need to beamform
intermediate images for supervision. Across synthetic and real SAS (both in-air
and underwater) benchmarks, results show that SH-SAS performs better in terms
of 3D reconstruction quality and geometric metrics than previous methods.
[LINK]
http://arxiv.org/abs/2509.11087v1
[DATE]
2025-09-14 12:29:28+08:00
[CATEGORIES]
cs.LG
DemandLens: Enhancing Forecast Accuracy Through Product-Specific Hyperparameter Optimization
[AUTHORS]
Srijesh Pillai, M. I. Jawid Nazir
[ABSTRACT]
DemandLens demonstrates an innovative Prophet based forecasting model for the
mattress-in-a-box industry, incorporating COVID-19 metrics and SKU-specific
hyperparameter optimization. This industry has seen significant growth of
E-commerce players in the recent years, wherein the business model majorly
relies on outsourcing Mattress manufacturing and related logistics and supply
chain operations, focusing on marketing the product and driving conversions
through Direct-to-Consumer sales channels. Now, within the United States, there
are a limited number of Mattress contract manufacturers available, and hence,
it is important that they manage their raw materials, supply chain, and,
inventory intelligently, to be able to cater maximum Mattress brands. Our
approach addresses the critical need for accurate Sales Forecasting in an
industry that is heavily dependent on third-party Contract Manufacturing. This,
in turn, helps the contract manufacturers to be prepared, hence, avoiding
bottleneck scenarios, and aiding them to source raw materials at optimal rates.
The model demonstrates strong predictive capabilities through SKU-specific
Hyperparameter optimization, offering the Contract Manufacturers and Mattress
brands a reliable tool to streamline supply chain operations.
[COMMENTS]
10 pages, 12 figures, 3 tables. Accepted for publication in the
proceedings of the 2025 Advances in Science and Engineering Technology
International Conferences (ASET)
[LINK]
http://arxiv.org/abs/2509.11085v1
[DATE]
2025-09-14 12:25:50+08:00
[CATEGORIES]
cs.LG
Developing an aeroponic smart experimental greenhouse for controlling irrigation and plant disease detection using deep learning and IoT
[AUTHORS]
Mohammadreza Narimani, Ali Hajiahmad, Ali Moghimi, Reza Alimardani, Shahin Rafiee, Amir Hossein Mirzabe
[ABSTRACT]
Controlling environmental conditions and monitoring plant status in
greenhouses is critical to promptly making appropriate management decisions
aimed at promoting crop production. The primary objective of this research
study was to develop and test a smart aeroponic greenhouse on an experimental
scale where the status of Geranium plant and environmental conditions are
continuously monitored through the integration of the internet of things (IoT)
and artificial intelligence (AI). An IoT-based platform was developed to
control the environmental conditions of plants more efficiently and provide
insights to users to make informed management decisions. In addition, we
developed an AI-based disease detection framework using VGG-19,
InceptionResNetV2, and InceptionV3 algorithms to analyze the images captured
periodically after an intentional inoculation. The performance of the AI
framework was compared with an expert’s evaluation of disease status.
Preliminary results showed that the IoT system implemented in the greenhouse
environment is able to publish data such as temperature, humidity, water flow,
and volume of charge tanks online continuously to users and adjust the
controlled parameters to provide an optimal growth environment for the plants.
Furthermore, the results of the AI framework demonstrate that the VGG-19
algorithm was able to identify drought stress and rust leaves from healthy
leaves with the highest accuracy, 92% among the other algorithms.
[COMMENTS]
Author-accepted version. Presented at ASABE Annual International
Meeting (AIM) 2021 (virtual), Paper 2101252. Please cite the published
meeting paper: doi:10.13031/aim.202101252. Minor wording and formatting
updates in this preprint
[LINK]
http://arxiv.org/abs/2509.12274v1
[DATE]
2025-09-14 11:48:22+08:00
[CATEGORIES]
cs.LG
C-Learner: Constrained Learning for Causal Inference
[AUTHORS]
Tiffany Tianhui Cai, Yuri Fonseca, Kaiwen Hou, Hongseok Namkoong
[ABSTRACT]
Popular debiased estimation methods for causal inference – such as augmented
inverse propensity weighting and targeted maximum likelihood estimation –
enjoy desirable asymptotic properties like statistical efficiency and double
robustness but they can produce unstable estimates when there is limited
overlap between treatment and control, requiring additional assumptions or ad
hoc adjustments in practice (e.g., truncating propensity scores). In contrast,
simple plug-in estimators are stable but lack desirable asymptotic properties.
We propose a novel debiasing approach that achieves the best of both worlds,
producing stable plug-in estimates with desirable asymptotic properties. Our
constrained learning framework solves for the best plug-in estimator under the
constraint that the first-order error with respect to the plugged-in quantity
is zero, and can leverage flexible model classes including neural networks and
tree ensembles. In several experimental settings, including ones in which we
handle text-based covariates by fine-tuning language models, our constrained
learning-based estimator outperforms basic versions of one-step estimation and
targeting in challenging settings with limited overlap between treatment and
control, and performs similarly otherwise.
[LINK]
http://arxiv.org/abs/2405.09493v6
[DATE]
2025-09-14 11:47:58+08:00
[CATEGORIES]
cs.LG
Kernel-based Stochastic Approximation Framework for Nonlinear Operator Learning
[AUTHORS]
Jia-Qi Yang, Lei Shi
[ABSTRACT]
We develop a stochastic approximation framework for learning nonlinear
operators between infinite-dimensional spaces utilizing general Mercer
operator-valued kernels. Our framework encompasses two key classes: (i) compact
kernels, which admit discrete spectral decompositions, and (ii) diagonal
kernels of the form $K(x,x’)=k(x,x’)T$, where $k$ is a scalar-valued kernel and
$T$ is a positive operator on the output space. This broad setting induces
expressive vector-valued reproducing kernel Hilbert spaces (RKHSs) that
generalize the classical $K=kI$ paradigm, thereby enabling rich structural
modeling with rigorous theoretical guarantees. To address target operators
lying outside the RKHS, we introduce vector-valued interpolation spaces to
precisely quantify misspecification error. Within this framework, we establish
dimension-free polynomial convergence rates, demonstrating that nonlinear
operator learning can overcome the curse of dimensionality. The use of general
operator-valued kernels further allows us to derive rates for intrinsically
nonlinear operator learning, going beyond the linear-type behavior inherent in
diagonal constructions of $K=kI$. Importantly, this framework accommodates a
wide range of operator learning tasks, ranging from integral operators such as
Fredholm operators to architectures based on encoder-decoder representations.
Moreover, we validate its effectiveness through numerical experiments on the
two-dimensional Navier-Stokes equations.
[COMMENTS]
34 pages, 3 figures
[LINK]
http://arxiv.org/abs/2509.11070v1
[DATE]
2025-09-14 11:33:36+08:00
[CATEGORIES]
cs.LG
BERT4beam: Large AI Model Enabled Generalized Beamforming Optimization
[AUTHORS]
Yuhang Li, Yang Lu, Wei Chen, Bo Ai, Zhiguo Ding, Dusit Niyato
[ABSTRACT]
Artificial intelligence (AI) is anticipated to emerge as a pivotal enabler
for the forthcoming sixth-generation (6G) wireless communication systems.
However, current research efforts regarding large AI models for wireless
communications primarily focus on fine-tuning pre-trained large language models
(LLMs) for specific tasks. This paper investigates the large-scale AI model
designed for beamforming optimization to adapt and generalize to diverse tasks
defined by system utilities and scales. We propose a novel framework based on
bidirectional encoder representations from transformers (BERT), termed
BERT4beam. We aim to formulate the beamforming optimization problem as a
token-level sequence learning task, perform tokenization of the channel state
information, construct the BERT model, and conduct task-specific pre-training
and fine-tuning strategies. Based on the framework, we propose two BERT-based
approaches for single-task and multi-task beamforming optimization,
respectively. Both approaches are generalizable for varying user scales.
Moreover, the former can adapt to varying system utilities and antenna
configurations by re-configuring the input and output module of the BERT model,
while the latter, termed UBERT, can directly generalize to diverse tasks, due
to a finer-grained tokenization strategy. Extensive simulation results
demonstrate that the two proposed approaches can achieve near-optimal
performance and outperform existing AI models across various beamforming
optimization tasks, showcasing strong adaptability and generalizability.
[LINK]
http://arxiv.org/abs/2509.11056v1
[DATE]
2025-09-14 10:49:29+08:00
[CATEGORIES]
cs.LG
Adapting Projection-Based Reduced-Order Models using Projected Gaussian Process
[AUTHORS]
Xiao Liu, Jingyi Feng, Xinchao Liu
[ABSTRACT]
Projection-based model reduction is among the most widely adopted methods for
constructing parametric Reduced-Order Models (ROM). Utilizing the snapshot data
from solving full-order governing equations, the Proper Orthogonal
Decomposition (POD) computes the optimal basis modes that represent the data,
and a ROM can be constructed in the low-dimensional vector subspace spanned by
the POD basis. For parametric governing equations, a potential challenge arises
when there is a need to update the POD basis to adapt ROM that accurately
capture the variation of a system’s behavior over its parameter space (in
design, control, uncertainty quantification, digital twins applications, etc.).
In this paper, we propose a Projected Gaussian Process (pGP) and formulate the
problem of adapting the POD basis as a supervised statistical learning problem,
for which the goal is to learn a mapping from the parameter space to the
Grassmann manifold that contains the optimal subspaces. A mapping is firstly
established between the Euclidean space and the horizontal space of an
orthogonal matrix that spans a reference subspace in the Grassmann manifold. A
second mapping from the horizontal space to the Grassmann manifold is
established through the Exponential/Logarithm maps between the manifold and its
tangent space. Finally, given a new parameter, the conditional distribution of
a vector can be found in the Euclidean space using the Gaussian Process (GP)
regression, and such a distribution is then projected to the Grassmann manifold
that enables us to predict the optimal subspace for the new parameter. As a
statistical learning approach, the proposed pGP allows us to optimally estimate
(or tune) the model parameters from data and quantify the statistical
uncertainty associated with the prediction. The advantages of the proposed pGP
are demonstrated by numerical experiments.
[LINK]
http://arxiv.org/abs/2410.14090v2
[DATE]
2025-09-14 10:43:35+08:00
[CATEGORIES]
cs.LG
An Advanced Convolutional Neural Network for Bearing Fault Diagnosis under Limited Data
[AUTHORS]
Shengke Sun, Shuzhen Han, Ziqian Luan, Xinghao Qin, Jiao Yin, Zhanshan Zhao, Jinli Cao, Hua Wang
[ABSTRACT]
In the area of bearing fault diagnosis, deep learning (DL) methods have been
widely used recently. However, due to the high cost or privacy concerns,
high-quality labeled data are scarce in real world scenarios. While few-shot
learning has shown promise in addressing data scarcity, existing methods still
face significant limitations in this domain. Traditional data augmentation
techniques often suffer from mode collapse and generate low-quality samples
that fail to capture the diversity of bearing fault patterns. Moreover,
conventional convolutional neural networks (CNNs) with local receptive fields
makes them inadequate for extracting global features from complex vibration
signals. Additionally, existing methods fail to model the intricate
relationships between limited training samples. To solve these problems, we
propose an advanced data augmentation and contrastive fourier convolution
framework (DAC-FCF) for bearing fault diagnosis under limited data. Firstly, a
novel conditional consistent latent representation and reconstruction
generative adversarial network (CCLR-GAN) is proposed to generate more diverse
data. Secondly, a contrastive learning based joint optimization mechanism is
utilized to better model the relations between the available training data.
Finally, we propose a 1D fourier convolution neural network (1D-FCNN) to
achieve a global-aware of the input data. Experiments demonstrate that DAC-FCF
achieves significant improvements, outperforming baselines by up to 32\% on
case western reserve university (CWRU) dataset and 10\% on a self-collected
test bench. Extensive ablation experiments prove the effectiveness of the
proposed components. Thus, the proposed DAC-FCF offers a promising solution for
bearing fault diagnosis under limited data.
[LINK]
http://arxiv.org/abs/2509.11053v1
[DATE]
2025-09-14 10:41:48+08:00
[CATEGORIES]
cs.LG
Can We Treat Noisy Labels as Accurate?
[AUTHORS]
Yuxiang Zheng, Zhongyi Han, Yilong Yin, Xin Gao, Tongliang Liu
[ABSTRACT]
Noisy labels significantly hinder the accuracy and generalization of machine
learning models, particularly when resulting from ambiguous instance features
that complicate correct labeling. Traditional approaches, such as those relying
on transition matrices for label correction, often struggle to effectively
resolve such ambiguity, due to their inability to capture complex relationships
between instances and noisy labels. In this paper, we propose EchoAlign, a
paradigm shift in learning from noisy labels. Unlike previous methods that
attempt to correct labels, EchoAlign treats noisy labels ($\tilde{Y}$) as
accurate and modifies corresponding instances ($X$) to better align with these
labels. The EchoAlign framework comprises two main components: (1) EchoMod
leverages controllable generative models to selectively modify instance
features, achieving alignment with noisy labels while preserving intrinsic
instance characteristics such as shape, texture, and semantic identity. (2)
EchoSelect mitigates distribution shifts introduced by instance modifications
by strategically retaining a substantial subset of original instances with
correct labels. Specifically, EchoSelect exploits feature similarity
distributions between original and modified instances to accurately distinguish
between correctly and incorrectly labeled samples. Extensive experiments across
three benchmark datasets demonstrate that EchoAlign significantly outperforms
state-of-the-art methods, particularly in high-noise environments, achieving
superior accuracy and robustness. Notably, under 30% instance-dependent noise,
EchoSelect retains nearly twice the number of correctly labeled samples
compared to previous methods, maintaining 99% selection accuracy, thereby
clearly illustrating the effectiveness of EchoAlign. The implementation of
EchoAlign is publicly available at
https://github.com/KevinCarpricorn/EchoAlign/tree/main.
[COMMENTS]
23 pages
[LINK]
http://arxiv.org/abs/2405.12969v2
[DATE]
2025-09-14 10:36:20+08:00
[CATEGORIES]
cs.LG
Data-Efficient Ensemble Weather Forecasting with Diffusion Models
[AUTHORS]
Kevin Valencia, Ziyang Liu, Justin Cui
[ABSTRACT]
Although numerical weather forecasting methods have dominated the field,
recent advances in deep learning methods, such as diffusion models, have shown
promise in ensemble weather forecasting. However, such models are typically
autoregressive and are thus computationally expensive. This is a challenge in
climate science, where data can be limited, costly, or difficult to work with.
In this work, we explore the impact of curated data selection on these
autoregressive diffusion models. We evaluate several data sampling strategies
and show that a simple time stratified sampling approach achieves performance
similar to or better than full-data training. Notably, it outperforms the
full-data model on certain metrics and performs only slightly worse on others
while using only 20% of the training data. Our results demonstrate the
feasibility of data-efficient diffusion training, especially for weather
forecasting, and motivates future work on adaptive or model-aware sampling
methods that go beyond random or purely temporal sampling.
[LINK]
http://arxiv.org/abs/2509.11047v1
[DATE]
2025-09-14 10:22:16+08:00
[CATEGORIES]
cs.LG
Hybrid Quantum Neural Networks for Efficient Protein-Ligand Binding Affinity Prediction
[AUTHORS]
Seon-Geun Jeong, Kyeong-Hwan Moon, Won-Joo Hwang
[ABSTRACT]
Protein-ligand binding affinity is critical in drug discovery, but
experimentally determining it is time-consuming and expensive. Artificial
intelligence (AI) has been used to predict binding affinity, significantly
accelerating this process. However, the high-performance requirements and vast
datasets involved in affinity prediction demand increasingly large AI models,
requiring substantial computational resources and training time. Quantum
machine learning has emerged as a promising solution to these challenges. In
particular, hybrid quantum-classical models can reduce the number of parameters
while maintaining or improving performance compared to classical counterparts.
Despite these advantages, challenges persist: why hybrid quantum models achieve
these benefits, whether quantum neural networks (QNNs) can replace classical
neural networks, and whether such models are feasible on noisy
intermediate-scale quantum (NISQ) devices. This study addresses these
challenges by proposing a hybrid quantum neural network (HQNN) that empirically
demonstrates the capability to approximate non-linear functions in the latent
feature space derived from classical embedding. The primary goal of this study
is to achieve a parameter-efficient model in binding affinity prediction while
ensuring feasibility on NISQ devices. Numerical results indicate that HQNN
achieves comparable or superior performance and parameter efficiency compared
to classical neural networks, underscoring its potential as a viable
replacement. This study highlights the potential of hybrid QML in computational
drug discovery, offering insights into its applicability and advantages in
addressing the computational challenges of protein-ligand binding affinity
prediction.
[COMMENTS]
43 pages, 9 figures, and 12 tables. Accepted by EPJ Quantum
Technology
[LINK]
http://arxiv.org/abs/2509.11046v1
[DATE]
2025-09-14 10:20:21+08:00
[CATEGORIES]
cs.LG
When Deep Learning Meets Polyhedral Theory: A Survey
[AUTHORS]
Joey Huchette, Gonzalo Muñoz, Thiago Serra, Calvin Tsay
[ABSTRACT]
In the past decade, deep learning became the prevalent methodology for
predictive modeling thanks to the remarkable accuracy of deep neural networks
in tasks such as computer vision and natural language processing. Meanwhile,
the structure of neural networks converged back to simpler representations
based on piecewise constant and piecewise linear functions such as the
Rectified Linear Unit (ReLU), which became the most commonly used type of
activation function in neural networks. That made certain types of network
structure $\unicode{x2014}$such as the typical fully-connected feedforward
neural network$\unicode{x2014}$ amenable to analysis through polyhedral theory
and to the application of methodologies such as Linear Programming (LP) and
Mixed-Integer Linear Programming (MILP) for a variety of purposes. In this
paper, we survey the main topics emerging from this fast-paced area of work,
which bring a fresh perspective to understanding neural networks in more detail
as well as to applying linear optimization techniques to train, verify, and
reduce the size of such networks.
[LINK]
http://arxiv.org/abs/2305.00241v4
[DATE]
2025-09-14 10:19:44+08:00
[CATEGORIES]
cs.LG
FragmentGPT: A Unified GPT Model for Fragment Growing, Linking, and Merging in Molecular Design
[AUTHORS]
Xuefeng Liu, Songhao Jiang, Qinan Huang, Tinson Xu, Ian Foster, Mengdi Wang, Hening Lin, Jinbo Xu, Rick Stevens
[ABSTRACT]
Fragment-Based Drug Discovery (FBDD) is a popular approach in early drug
development, but designing effective linkers to combine disconnected molecular
fragments into chemically and pharmacologically viable candidates remains
challenging. Further complexity arises when fragments contain structural
redundancies, like duplicate rings, which cannot be addressed by simply adding
or removing atoms or bonds. To address these challenges in a unified framework,
we introduce FragmentGPT, which integrates two core components: (1) a novel
chemically-aware, energy-based bond cleavage pre-training strategy that equips
the GPT-based model with fragment growing, linking, and merging capabilities,
and (2) a novel Reward Ranked Alignment with Expert Exploration (RAE) algorithm
that combines expert imitation learning for diversity enhancement, data
selection and augmentation for Pareto and composite score optimality, and
Supervised Fine-Tuning (SFT) to align the learner policy with multi-objective
goals. Conditioned on fragment pairs, FragmentGPT generates linkers that
connect diverse molecular subunits while simultaneously optimizing for multiple
pharmaceutical goals. It also learns to resolve structural redundancies-such as
duplicated fragments-through intelligent merging, enabling the synthesis of
optimized molecules. FragmentGPT facilitates controlled, goal-driven molecular
assembly. Experiments and ablation studies on real-world cancer datasets
demonstrate its ability to generate chemically valid, high-quality molecules
tailored for downstream drug discovery tasks.
[LINK]
http://arxiv.org/abs/2509.11044v1
[DATE]
2025-09-14 10:17:07+08:00
[CATEGORIES]
cs.LG
Convergence Rate in Nonlinear Two-Time-Scale Stochastic Approximation with State (Time)-Dependence
[AUTHORS]
Zixi Chen, Yumin Xu, Ruixun Zhang
[ABSTRACT]
The nonlinear two-time-scale stochastic approximation is widely studied under
conditions of bounded variances in noise. Motivated by recent advances that
allow for variability linked to the current state or time, we consider state-
and time-dependent noises. We show that the Lyapunov function exhibits
polynomial convergence rates in both cases, with the rate of polynomial delay
depending on the parameters of state- or time-dependent noises. Notably, if the
state noise parameters fully approach their limiting value, the Lyapunov
function achieves an exponential convergence rate. We provide two numerical
examples to illustrate our theoretical findings in the context of stochastic
gradient descent with Polyak-Ruppert averaging and stochastic bilevel
optimization.
[COMMENTS]
23 pages
[LINK]
http://arxiv.org/abs/2509.11039v1
[DATE]
2025-09-14 10:06:15+08:00
[CATEGORIES]
cs.LG
From Federated Learning to X-Learning: Breaking the Barriers of Decentrality Through Random Walks
[AUTHORS]
Allan Salihovic, Payam Abdisarabshali, Michael Langberg, Seyyedali Hosseinalipour
[ABSTRACT]
We provide our perspective on X-Learning (XL), a novel distributed learning
architecture that generalizes and extends the concept of decentralization. Our
goal is to present a vision for XL, introducing its unexplored design
considerations and degrees of freedom. To this end, we shed light on the
intuitive yet non-trivial connections between XL, graph theory, and Markov
chains. We also present a series of open research directions to stimulate
further research.
[COMMENTS]
6 figures, 12 pages
[LINK]
http://arxiv.org/abs/2509.03709v2
[DATE]
2025-09-14 09:21:14+08:00
[CATEGORIES]
cs.LG
Fast Fourier Transform-Based Spectral and Temporal Gradient Filtering for Differential Privacy
[AUTHORS]
Hyeju Shin, Vincent-Daniel, Kyudan Jung, Seongwon Yun
[ABSTRACT]
Differential Privacy (DP) has emerged as a key framework for protecting
sensitive data in machine learning, but standard DP-SGD often suffers from
significant accuracy loss due to injected noise. To address this limitation, we
introduce the FFT-Enhanced Kalman Filter (FFTKF), a differentially private
optimization method that improves gradient quality while preserving
$(\varepsilon, \delta)$-DP guarantees. FFTKF applies frequency-domain filtering
to shift privacy noise into less informative high-frequency components,
preserving the low-frequency gradient signals that carry most learning
information. A scalar-gain Kalman filter with a finite-difference Hessian
approximation further refines the denoised gradients. The method has
per-iteration complexity $\mathcal{O}(d \log d)$ and achieves higher test
accuracy than DP-SGD and DiSK on MNIST, CIFAR-10, CIFAR-100, and Tiny-ImageNet
with CNNs, Wide ResNets, and Vision Transformers. Theoretical analysis shows
that FFTKF ensures equivalent privacy while delivering a stronger
privacy–utility trade-off through reduced variance and controlled bias.
[LINK]
http://arxiv.org/abs/2505.04468v2
[DATE]
2025-09-14 07:21:23+08:00
[CATEGORIES]
cs.LG
Gradient Methods with Online Scaling Part II. Practical Aspects
[AUTHORS]
Ya-Chi Chu, Wenzhi Gao, Yinyu Ye, Madeleine Udell
[LINK]
http://arxiv.org/abs/2509.11007v1
[DATE]
2025-09-14 07:14:27+08:00
[CATEGORIES]
cs.LG
Adversarial Examples Are Not Bugs, They Are Superposition
[AUTHORS]
Liv Gorton, Owen Lewis
[ABSTRACT]
Adversarial examples – inputs with imperceptible perturbations that fool
neural networks – remain one of deep learning’s most perplexing phenomena
despite nearly a decade of research. While numerous defenses and explanations
have been proposed, there is no consensus on the fundamental mechanism. One
underexplored hypothesis is that superposition, a concept from mechanistic
interpretability, may be a major contributing factor, or even the primary
cause. We present four lines of evidence in support of this hypothesis, greatly
extending prior arguments by Elhage et al. (2022): (1) superposition can
theoretically explain a range of adversarial phenomena, (2) in toy models,
intervening on superposition controls robustness, (3) in toy models,
intervening on robustness (via adversarial training) controls superposition,
and (4) in ResNet18, intervening on robustness (via adversarial training)
controls superposition.
[LINK]
http://arxiv.org/abs/2508.17456v2
[DATE]
2025-09-14 07:06:06+08:00
[CATEGORIES]
cs.LG
Hardness, Structural Knowledge, and Opportunity: An Analytical Framework for Modular Performance Modeling
[AUTHORS]
Omid Gheibi, Christian Kästner, Pooyan Jamshidi
[ABSTRACT]
Performance-influence models are beneficial for understanding how
configurations affect system performance, but their creation is challenging due
to the exponential growth of configuration spaces. While gray-box approaches
leverage selective “structural knowledge” (like the module execution graph of
the system) to improve modeling, the relationship between this knowledge, a
system’s characteristics (we call them “structural aspects”), and potential
model improvements is not well understood. This paper addresses this gap by
formally investigating how variations in structural aspects (e.g., the number
of modules and options per module) and the level of structural knowledge impact
the creation of “opportunities” for improved “modular performance modeling”. We
introduce and quantify the concept of modeling “hardness”, defined as the
inherent difficulty of performance modeling. Through controlled experiments
with synthetic system models, we establish an “analytical matrix” to measure
these concepts. Our findings show that modeling hardness is primarily driven by
the number of modules and configuration options per module. More importantly,
we demonstrate that both higher levels of structural knowledge and increased
modeling hardness significantly enhance the opportunity for improvement. The
impact of these factors varies by performance metric; for ranking accuracy
(e.g., in debugging task), structural knowledge is more dominant, while for
prediction accuracy (e.g., in resource management task), hardness plays a
stronger role. These results provide actionable insights for system designers,
guiding them to strategically allocate time and select appropriate modeling
approaches based on a system’s characteristics and a given task’s objectives.
[LINK]
http://arxiv.org/abs/2509.11000v1
[DATE]
2025-09-14 06:52:10+08:00
[CATEGORIES]
cs.LG
Factor Graph Optimization for Leak Localization in Water Distribution Networks
[AUTHORS]
Paul Irofti, Luis Romero-Ben, Florin Stoican, Vicenç Puig
[ABSTRACT]
Detecting and localizing leaks in water distribution network systems is an
important topic with direct environmental, economic, and social impact. Our
paper is the first to explore the use of factor graph optimization techniques
for leak localization in water distribution networks, enabling us to perform
sensor fusion between pressure and demand sensor readings and to estimate the
network’s temporal and structural state evolution across all network nodes. The
methodology introduces specific water network factors and proposes a new
architecture composed of two factor graphs: a leak-free state estimation factor
graph and a leak localization factor graph. When a new sensor reading is
obtained, unlike Kalman and other interpolation-based methods, which estimate
only the current network state, factor graphs update both current and past
states. Results on Modena, L-TOWN and synthetic networks show that factor
graphs are much faster than nonlinear Kalman-based alternatives such as the
UKF, while also providing improvements in localization compared to
state-of-the-art estimation-localization approaches. Implementation and
benchmarks are available at https://github.com/pirofti/FGLL.
[LINK]
http://arxiv.org/abs/2509.10982v1
[DATE]
2025-09-14 05:06:27+08:00
[CATEGORIES]
cs.LG
Toward Quantum Utility in Finance: A Robust Data-Driven Algorithm for Asset Clustering
[AUTHORS]
Shivam Sharma, Supreeth Mysore Venkatesh, Pushkin Kachroo
[ABSTRACT]
Clustering financial assets based on return correlations is a fundamental
task in portfolio optimization and statistical arbitrage. However, classical
clustering methods often fall short when dealing with signed correlation
structures, typically requiring lossy transformations and heuristic assumptions
such as a fixed number of clusters. In this work, we apply the Graph-based
Coalition Structure Generation algorithm (GCS-Q) to directly cluster signed,
weighted graphs without relying on such transformations. GCS-Q formulates each
partitioning step as a QUBO problem, enabling it to leverage quantum annealing
for efficient exploration of exponentially large solution spaces. We validate
our approach on both synthetic and real-world financial data, benchmarking
against state-of-the-art classical algorithms such as SPONGE and k-Medoids. Our
experiments demonstrate that GCS-Q consistently achieves higher clustering
quality, as measured by Adjusted Rand Index and structural balance penalties,
while dynamically determining the number of clusters. These results highlight
the practical utility of near-term quantum computing for graph-based
unsupervised learning in financial applications.
[COMMENTS]
9 pages, 2 figures, International Quantum Engineering conference and
exhibition (QUEST-IS 2025)
[LINK]
http://arxiv.org/abs/2509.07766v2
[DATE]
2025-09-14 05:01:32+08:00
[CATEGORIES]
cs.LG
PHLoRA: data-free Post-hoc Low-Rank Adapter extraction from full-rank checkpoint
[AUTHORS]
Bhoomit Vasani, Jack FitzGerald, Anjie Fang, Sushmit Vaish
[ABSTRACT]
We introduce PHLoRA (Pronounced “flora”). (Post-hoc LoRA), a simple yet
powerful method to extract low-rank adaptation adapters from full-rank
fine-tuned models without requiring access to training data or gradients. By
computing the low-rank decomposition of weight differences between a base model
and its fine-tuned counterpart, our method reconstructs adapter modules that
can be merged or dynamically routed at inference time via S-LoRA, or served in
scalable, industry settings using platforms like NVIDIA NIM. This approach
amortizes latency overhead across requests and yields substantial cost savings.
Unlike prior work that trains each adapter explicitly, our approach decouples
fine-tuning from adapter generation, allowing adapter extraction from existing
full-rank models or third-party checkpoints. Experiments on text, image, and
video benchmarks using the Amazon Nova model family demonstrate that extracted
adapters preserve high energy from the full weight delta, can be pruned safely,
and yield negligible degradation in downstream task performance when re-merged.
Overall, PHLoRA provides a practical path for making all existing full-rank
checkpoints adapter-ready, democratizing scalable inference for all models.
[LINK]
http://arxiv.org/abs/2509.10971v1
[DATE]
2025-09-14 04:13:58+08:00
[CATEGORIES]
cs.LG
The Psychogenic Machine: Simulating AI Psychosis, Delusion Reinforcement and Harm Enablement in Large Language Models
[AUTHORS]
Joshua Au Yeung, Jacopo Dalmasso, Luca Foschini, Richard JB Dobson, Zeljko Kraljevic
[ABSTRACT]
Background: Emerging reports of “AI psychosis” are on the rise, where
user-LLM interactions may exacerbate or induce psychosis or adverse
psychological symptoms. The sycophantic and agreeable nature of LLMs can
beneficial, it can become a vector for harm by reinforcing delusional beliefs
in vulnerable users.
Methods: We introduce psychosis-bench, a novel benchmark designed to
systematically evaluate the psychogenicity of LLMs comprimising 16 structured,
12-turn conversational scenarios simulating the progression of delusional
themes(Erotic Delusions, Grandiose/Messianic Delusions, Referential Delusions)
and potential harms. We evaluated eight prominent LLMs for Delusion
Confirmation (DCS), Harm Enablement (HES), and Safety Intervention(SIS) across
explicit and implicit conversational contexts.
Findings: Across 1,536 simulated conversation turns, all LLMs demonstrated
psychogenic potential, showing a strong tendency to perpetuate rather than
challenge delusions (mean DCS of 0.91 $\pm$0.88). Models frequently enabled
harmful user requests (mean HES of 0.69 $\pm$0.84) and offered safety
interventions in only roughly a third of applicable turns (mean SIS of 0.37
$\pm$0.48). 51 / 128 (39.8%) of scenarios had no safety interventions offered.
Performance was significantly worse in implicit scenarios, models were more
likely to confirm delusions and enable harm while offering fewer interventions
(p < .001). A strong correlation was found between DCS and HES (rs = .77).
Model performance varied widely, indicating that safety is not an emergent
property of scale alone.
Conclusion: This study establishes LLM psychogenicity as a quantifiable risk
and underscores the urgent need for re-thinking how we train LLMs. We frame
this issue not merely as a technical challenge but as a public health
imperative requiring collaboration between developers, policymakers, and
healthcare professionals.
[LINK]
http://arxiv.org/abs/2509.10970v1
[DATE]
2025-09-14 04:10:28+08:00
[CATEGORIES]
cs.LG
Predictive Free Energy Simulations Through Hierarchical Distillation of Quantum Hamiltonians
[AUTHORS]
Chenghan Li, Garnet Kin-Lic Chan
[LINK]
http://arxiv.org/abs/2509.10967v1
[DATE]
2025-09-14 03:53:02+08:00
[CATEGORIES]
cs.LG
Efficient Imitation Without Demonstrations via Value-Penalized Auxiliary Control from Examples
[AUTHORS]
Trevor Ablett, Bryan Chan, Jayce Haoran Wang, Jonathan Kelly
[ABSTRACT]
Common approaches to providing feedback in reinforcement learning are the use
of hand-crafted rewards or full-trajectory expert demonstrations.
Alternatively, one can use examples of completed tasks, but such an approach
can be extremely sample inefficient. We introduce value-penalized auxiliary
control from examples (VPACE), an algorithm that significantly improves
exploration in example-based control by adding examples of simple auxiliary
tasks and an above-success-level value penalty. Across both simulated and real
robotic environments, we show that our approach substantially improves learning
efficiency for challenging tasks, while maintaining bounded value estimates.
Preliminary results also suggest that VPACE may learn more efficiently than the
more common approaches of using full trajectories or true sparse rewards.
Project site: https://papers.starslab.ca/vpace/.
[COMMENTS]
In Proceedings of the IEEE International Conference on Robotics and
Automation (ICRA’25), Atlanta, USA, May 19-23, 2025
[LINK]
http://arxiv.org/abs/2407.03311v4
[DATE]
2025-09-14 03:48:09+08:00
[CATEGORIES]
cs.LG
Potential failures of physics-informed machine learning in traffic flow modeling: theoretical and experimental analysis
[AUTHORS]
Yuan-Zheng Lei, Yaobang Gong, Dianwei Chen, Yao Cheng, Xianfeng Terry Yang
[ABSTRACT]
This study investigates why physics-informed machine learning (PIML) can fail
in macroscopic traffic flow modeling. We define failure as cases where a PIML
model underperforms both purely data-driven and purely physics-based baselines
by a given threshold. Unlike in other fields, physics residuals themselves do
not hinder optimization in this setting. Instead, effective updates require
both data and physics gradients to form acute angles with the true gradient, a
condition difficult to satisfy with low-resolution loop data. In such cases,
neural networks cannot accurately approximate density and speed, and the
constructed physics residuals, already degraded by discrete sampling and
temporal averaging, lose their ability to capture PDE dynamics, which directly
leads to PIML failure. Theoretically, although LWR and ARZ solutions are weak
solutions, for piecewise $C^k$ initial data they remain $C^k$ off the shock set
under mild conditions, which has Lebesgue measure zero. Thus, almost all
detector or collocation points lie in smooth regions where residuals are valid,
and the MLP’s inability to exactly represent discontinuities is immaterial.
Finally, we establish MSE lower bounds of physics residuals: higher-order
models such as ARZ have strictly larger consistency error bounds than LWR under
mild conditions. This explains why LWR-based PIML can outperform ARZ-based PIML
even with high-resolution data, with the gap shrinking as resolution increases,
consistent with prior empirical findings.
[LINK]
http://arxiv.org/abs/2505.11491v2
[DATE]
2025-09-14 03:25:04+08:00
[CATEGORIES]
cs.LG
Clarifying Model Transparency: Interpretability versus Explainability in Deep Learning with MNIST and IMDB Examples
[AUTHORS]
Mitali Raj
[ABSTRACT]
The impressive capabilities of deep learning models are often counterbalanced
by their inherent opacity, commonly termed the “black box” problem, which
impedes their widespread acceptance in high-trust domains. In response, the
intersecting disciplines of interpretability and explainability, collectively
falling under the Explainable AI (XAI) umbrella, have become focal points of
research. Although these terms are frequently used as synonyms, they carry
distinct conceptual weights. This document offers a comparative exploration of
interpretability and explainability within the deep learning paradigm,
carefully outlining their respective definitions, objectives, prevalent
methodologies, and inherent difficulties. Through illustrative examinations of
the MNIST digit classification task and IMDB sentiment analysis, we
substantiate a key argument: interpretability generally pertains to a model’s
inherent capacity for human comprehension of its operational mechanisms (global
understanding), whereas explainability is more commonly associated with
post-hoc techniques designed to illuminate the basis for a model’s individual
predictions or behaviors (local explanations). For example, feature attribution
methods can reveal why a specific MNIST image is recognized as a ‘7’, and
word-level importance can clarify an IMDB sentiment outcome. However, these
local insights do not render the complex underlying model globally transparent.
A clear grasp of this differentiation, as demonstrated by these standard
datasets, is vital for fostering dependable and sound artificial intelligence.
[COMMENTS]
5 pages, 2 figures, Accepted at ICICC 2026
[LINK]
http://arxiv.org/abs/2509.10929v1
[DATE]
2025-09-14 02:06:55+08:00
[CATEGORIES]
cs.LG
Lightweight Metadata-Aware Mixture-of-Experts Masked Autoencoder for Earth Observation
[AUTHORS]
Mohanad Albughdadi
[ABSTRACT]
Recent advances in Earth Observation have focused on large-scale foundation
models. However, these models are computationally expensive, limiting their
accessibility and reuse for downstream tasks. In this work, we investigate
compact architectures as a practical pathway toward smaller general-purpose EO
models. We propose a Metadata-aware Mixture-of-Experts Masked Autoencoder
(MoE-MAE) with only 2.5M parameters. The model combines sparse expert routing
with geo-temporal conditioning, incorporating imagery alongside
latitude/longitude and seasonal/daily cyclic encodings. We pretrain the MoE-MAE
on the BigEarthNet-Landsat dataset and evaluate embeddings from its frozen
encoder using linear probes. Despite its small size, the model competes with
much larger architectures, demonstrating that metadata-aware pretraining
improves transfer and label efficiency. To further assess generalization, we
evaluate on the EuroSAT-Landsat dataset, which lacks explicit metadata, and
still observe competitive performance compared to models with hundreds of
millions of parameters. These results suggest that compact, metadata-aware
MoE-MAEs are an efficient and scalable step toward future EO foundation models.
[LINK]
http://arxiv.org/abs/2509.10919v1
[DATE]
2025-09-14 01:35:17+08:00
[CATEGORIES]
cs.LG
ToMA: Token Merge with Attention for Image Generation with Diffusion Models
[AUTHORS]
Wenbo Lu, Shaoyi Zheng, Yuxuan Xia, Shengjie Wang
[ABSTRACT]
Diffusion models excel in high-fidelity image generation but face scalability
limits due to transformers’ quadratic attention complexity. Plug-and-play token
reduction methods like ToMeSD and ToFu reduce FLOPs by merging redundant tokens
in generated images but rely on GPU-inefficient operations (e.g., sorting,
scattered writes), introducing overheads that negate theoretical speedups when
paired with optimized attention implementations (e.g., FlashAttention). To
bridge this gap, we propose Token Merge with Attention (ToMA), an off-the-shelf
method that redesigns token reduction for GPU-aligned efficiency, with three
key contributions: 1) a reformulation of token merge as a submodular
optimization problem to select diverse tokens; 2) merge/unmerge as an
attention-like linear transformation via GPU-friendly matrix operations; and 3)
exploiting latent locality and sequential redundancy (pattern reuse) to
minimize overhead. ToMA reduces SDXL/Flux generation latency by 24%/23%,
respectively (with DINO $\Delta < 0.07$), outperforming prior methods. This
work bridges the gap between theoretical and practical efficiency for
transformers in diffusion.
[COMMENTS]
In proceedings of the 42nd International Conference on Machine
Learning (ICML 2025). Code available at https://github.com/wenboluu/ToMA
[LINK]
http://arxiv.org/abs/2509.10918v1
[DATE]
2025-09-14 01:35:00+08:00
[CATEGORIES]
cs.LG
Robustifying Diffusion-Denoised Smoothing Against Covariate Shift
[AUTHORS]
Ali Hedayatnia, Mostafa Tavassolipour, Babak Nadjar Araabi, Abdol-Hossein Vahabie
[ABSTRACT]
Randomized smoothing is a well-established method for achieving certified
robustness against l2-adversarial perturbations. By incorporating a denoiser
before the base classifier, pretrained classifiers can be seamlessly integrated
into randomized smoothing without significant performance degradation. Among
existing methods, Diffusion Denoised Smoothing - where a pretrained denoising
diffusion model serves as the denoiser - has produced state-of-the-art results.
However, we show that employing a denoising diffusion model introduces a
covariate shift via misestimation of the added noise, ultimately degrading the
smoothed classifier’s performance. To address this issue, we propose a novel
adversarial objective function focused on the added noise of the denoising
diffusion model. This approach is inspired by our understanding of the origin
of the covariate shift. Our goal is to train the base classifier to ensure it
is robust against the covariate shift introduced by the denoiser. Our method
significantly improves certified accuracy across three standard classification
benchmarks - MNIST, CIFAR-10, and ImageNet - achieving new state-of-the-art
performance in l2-adversarial perturbations. Our implementation is publicly
available at
https://github.com/ahedayat/Robustifying-DDS-Against-Covariate-Shift
[LINK]
http://arxiv.org/abs/2509.10913v1
[DATE]
2025-09-14 01:27:37+08:00
[CATEGORIES]
cs.LG
Developing a Multi-Modal Machine Learning Model For Predicting Performance of Automotive Hood Frames
[AUTHORS]
Abhishek Indupally, Satchit Ramnath
[ABSTRACT]
Is there a way for a designer to evaluate the performance of a given hood
frame geometry without spending significant time on simulation setup? This
paper seeks to address this challenge by developing a multimodal
machine-learning (MMML) architecture that learns from different modalities of
the same data to predict performance metrics. It also aims to use the MMML
architecture to enhance the efficiency of engineering design processes by
reducing reliance on computationally expensive simulations. The proposed
architecture accelerates design exploration, enabling rapid iteration while
maintaining high-performance standards, especially in the concept design phase.
The study also presents results that show that by combining multiple data
modalities, MMML outperforms traditional single-modality approaches. Two new
frame geometries, not part of the training dataset, are also used for
prediction using the trained MMML model to showcase the ability to generalize
to unseen frame models. The findings underscore MMML’s potential in
supplementing traditional simulation-based workflows, particularly in the
conceptual design phase, and highlight its role in bridging the gap between
machine learning and real-world engineering applications. This research paves
the way for the broader adoption of machine learning techniques in engineering
design, with a focus on refining multimodal approaches to optimize structural
development and accelerate the design cycle.
[LINK]
http://arxiv.org/abs/2508.20358v2
[DATE]
2025-09-14 01:24:01+08:00
[CATEGORIES]
cs.LG
Principled Approximation Methods for Efficient and Scalable Deep Learning
[AUTHORS]
Pedro Savarese
[ABSTRACT]
Recent progress in deep learning has been driven by increasingly larger
models. However, their computational and energy demands have grown
proportionally, creating significant barriers to their deployment and to a
wider adoption of deep learning technologies. This thesis investigates
principled approximation methods for improving the efficiency of deep learning
systems, with a particular focus on settings that involve discrete constraints
and non-differentiability.
We study three main approaches toward improved efficiency: architecture
design, model compression, and optimization. For model compression, we propose
novel approximations for pruning and quantization that frame the underlying
discrete problem as continuous and differentiable, enabling gradient-based
training of compression schemes alongside the model’s parameters. These
approximations allow for fine-grained sparsity and precision configurations,
leading to highly compact models without significant fine-tuning. In the
context of architecture design, we design an algorithm for neural architecture
search that leverages parameter sharing across layers to efficiently explore
implicitly recurrent architectures. Finally, we study adaptive optimization,
revisiting theoretical properties of widely used methods and proposing an
adaptive optimizer that allows for quick hyperparameter tuning.
Our contributions center on tackling computationally hard problems via
scalable and principled approximations. Experimental results on image
classification, language modeling, and generative modeling tasks show that the
proposed methods provide significant improvements in terms of training and
inference efficiency while maintaining, or even improving, the model’s
performance.
[COMMENTS]
PhD thesis
[LINK]
http://arxiv.org/abs/2509.00174v2
[DATE]
2025-09-14 01:01:49+08:00
[CATEGORIES]
cs.LG
Research on Short-Video Platform User Decision-Making via Multimodal Temporal Modeling and Reinforcement Learning
[AUTHORS]
Jinmeiyang Wang, Jing Dong, Li Zhou
[ABSTRACT]
This paper proposes the MT-DQN model, which integrates a Transformer,
Temporal Graph Neural Network (TGNN), and Deep Q-Network (DQN) to address the
challenges of predicting user behavior and optimizing recommendation strategies
in short-video environments. Experiments demonstrated that MT-DQN consistently
outperforms traditional concatenated models, such as Concat-Modal, achieving an
average F1-score improvement of 10.97% and an average NDCG@5 improvement of
8.3%. Compared to the classic reinforcement learning model Vanilla-DQN, MT-DQN
reduces MSE by 34.8% and MAE by 26.5%. Nonetheless, we also recognize
challenges in deploying MT-DQN in real-world scenarios, such as its
computational cost and latency sensitivity during online inference, which will
be addressed through future architectural optimization.
[COMMENTS]
26 pages
[LINK]
http://arxiv.org/abs/2509.12269v1
[DATE]
2025-09-14 00:28:14+08:00
[CATEGORIES]
cs.LG
On the Impact of Downstream Tasks on Sampling and Reconstructing Noisy Graph Signals
[AUTHORS]
Baskaran Sripathmanathan, Xiaowen Dong, Michael Bronstein
[ABSTRACT]
We investigate graph signal reconstruction and sample selection for
classification tasks. We present general theoretical characterisations of
classification error applicable to multiple commonly used reconstruction
methods, and compare that to the classical reconstruction error. We demonstrate
the applicability of our results by using them to derive new optimal sampling
methods for linearized graph convolutional networks, and show improvement over
other graph signal processing based methods.
[COMMENTS]
This work has been accepted for publication at IEEE CAMSAP 2025
[LINK]
http://arxiv.org/abs/2509.10874v1
[DATE]
2025-09-14 00:09:43+08:00
[CATEGORIES]
cs.LG
Pre-Storage Reasoning for Episodic Memory: Shifting Inference Burden to Memory for Personalized Dialogue
[AUTHORS]
Sangyeop Kim, Yohan Lee, Sanghwa Kim, Hyunjong Kim, Sungzoon Cho
[ABSTRACT]
Effective long-term memory in conversational AI requires synthesizing
information across multiple sessions. However, current systems place excessive
reasoning burden on response generation, making performance significantly
dependent on model sizes. We introduce PREMem (Pre-storage Reasoning for
Episodic Memory), a novel approach that shifts complex reasoning processes from
inference to memory construction. PREMem extracts fine-grained memory fragments
categorized into factual, experiential, and subjective information; it then
establishes explicit relationships between memory items across sessions,
capturing evolution patterns like extensions, transformations, and
implications. By performing this reasoning during pre-storage rather than when
generating a response, PREMem creates enriched representations while reducing
computational demands during interactions. Experiments show significant
performance improvements across all model sizes, with smaller models achieving
results comparable to much larger baselines while maintaining effectiveness
even with constrained token budgets. Code and dataset are available at
https://github.com/sangyeop-kim/PREMem.
[COMMENTS]
Accepted by EMNLP 2025 (Findings)
[LINK]
http://arxiv.org/abs/2509.10852v1
[DATE]
2025-09-13 23:18:08+08:00
[CATEGORIES]
cs.CL
Text2Sign Diffusion: A Generative Approach for Gloss-Free Sign Language Production
[AUTHORS]
Liqian Feng, Lintao Wang, Kun Hu, Dehui Kong, Zhiyong Wang
[ABSTRACT]
Sign language production (SLP) aims to translate spoken language sentences
into a sequence of pose frames in a sign language, bridging the communication
gap and promoting digital inclusion for deaf and hard-of-hearing communities.
Existing methods typically rely on gloss, a symbolic representation of sign
language words or phrases that serves as an intermediate step in SLP. This
limits the flexibility and generalization of SLP, as gloss annotations are
often unavailable and language-specific. Therefore, we present a novel
diffusion-based generative approach - Text2Sign Diffusion (Text2SignDiff) for
gloss-free SLP. Specifically, a gloss-free latent diffusion model is proposed
to generate sign language sequences from noisy latent sign codes and spoken
text jointly, reducing the potential error accumulation through a
non-autoregressive iterative denoising process. We also design a cross-modal
signing aligner that learns a shared latent space to bridge visual and textual
content in sign and spoken languages. This alignment supports the conditioned
diffusion-based process, enabling more accurate and contextually relevant sign
language generation without gloss. Extensive experiments on the commonly used
PHOENIX14T and How2Sign datasets demonstrate the effectiveness of our method,
achieving the state-of-the-art performance.
[LINK]
http://arxiv.org/abs/2509.10845v1
[DATE]
2025-09-13 23:05:19+08:00
[CATEGORIES]
cs.CL
Evaluating Large Language Models for Evidence-Based Clinical Question Answering
[AUTHORS]
Can Wang, Yiqun Chen
[ABSTRACT]
Large Language Models (LLMs) have demonstrated substantial progress in
biomedical and clinical applications, motivating rigorous evaluation of their
ability to answer nuanced, evidence-based questions. We curate a multi-source
benchmark drawing from Cochrane systematic reviews and clinical guidelines,
including structured recommendations from the American Heart Association and
narrative guidance used by insurers. Using GPT-4o-mini and GPT-5, we observe
consistent performance patterns across sources and clinical domains: accuracy
is highest on structured guideline recommendations (90%) and lower on narrative
guideline and systematic review questions (60–70%). We also find a strong
correlation between accuracy and the citation count of the underlying
systematic reviews, where each doubling of citations is associated with roughly
a 30% increase in the odds of a correct answer. Models show moderate ability to
reason about evidence quality when contextual information is supplied. When we
incorporate retrieval-augmented prompting, providing the gold-source abstract
raises accuracy on previously incorrect items to 0.79; providing top 3 PubMed
abstracts (ranked by semantic relevance) improves accuracy to 0.23, while
random abstracts reduce accuracy (0.10, within temperature variation). These
effects are mirrored in GPT-4o-mini, underscoring that source clarity and
targeted retrieval – not just model size – drive performance. Overall, our
results highlight both the promise and current limitations of LLMs for
evidence-based clinical question answering. Retrieval-augmented prompting
emerges as a useful strategy to improve factual accuracy and alignment with
source evidence, while stratified evaluation by specialty and question type
remains essential to understand current knowledge access and to contextualize
model performance.
[LINK]
http://arxiv.org/abs/2509.10843v1
[DATE]
2025-09-13 23:03:34+08:00
[CATEGORIES]
cs.CL
Towards Automated Error Discovery: A Study in Conversational AI
[AUTHORS]
Dominic Petrak, Thy Thy Tran, Iryna Gurevych
[ABSTRACT]
Although LLM-based conversational agents demonstrate strong fluency and
coherence, they still produce undesirable behaviors (errors) that are
challenging to prevent from reaching users during deployment. Recent research
leverages large language models (LLMs) to detect errors and guide
response-generation models toward improvement. However, current LLMs struggle
to identify errors not explicitly specified in their instructions, such as
those arising from updates to the response-generation model or shifts in user
behavior. In this work, we introduce Automated Error Discovery, a framework for
detecting and defining errors in conversational AI, and propose SEEED (Soft
Clustering Extended Encoder-Based Error Detection), as an encoder-based
approach to its implementation. We enhance the Soft Nearest Neighbor Loss by
amplifying distance weighting for negative samples and introduce Label-Based
Sample Ranking to select highly contrastive examples for better representation
learning. SEEED outperforms adapted baselines – including GPT-4o and Phi-4 –
across multiple error-annotated dialogue datasets, improving the accuracy for
detecting unknown errors by up to 8 points and demonstrating strong
generalization to unknown intent detection.
[COMMENTS]
Accepted to EMNLP 2025 main conference
[LINK]
http://arxiv.org/abs/2509.10833v1
[DATE]
2025-09-13 22:53:22+08:00
[CATEGORIES]
cs.CL
cs.LG
Revealing the Inherent Instructability of Pre-Trained Language Models
[AUTHORS]
Seokhyun An, Minji Kim, Hyounghun Kim
[COMMENTS]
Findings of EMNLP 2025 (32 pages). Code available at
https://github.com/seokhyunan/response-tuning
[LINK]
http://arxiv.org/abs/2410.02465v3
[DATE]
2025-09-13 13:11:42+08:00
[CATEGORIES]
cs.CL
Judge Q: Trainable Queries for Optimized Information Retention in KV Cache Eviction
[AUTHORS]
Yijun Liu, Yixuan Wang, Yuzhuang Xu, Shiyu Ji, Yang Xu, Qingfu Zhu, Wanxiang Che
[ABSTRACT]
Large language models (LLMs) utilize key-value (KV) cache to store historical
information during sequence processing. The size of KV cache grows linearly as
the length of the sequence extends, which seriously affects memory usage and
decoding efficiency. Current methods for KV cache eviction typically utilize
the last window from the pre-filling phase as queries to compute the KV
importance scores for eviction. Although this scheme is simple to implement, it
tends to overly focus on local information, potentially leading to the neglect
or omission of crucial global information. To mitigate this issue, we propose
Judge Q, a novel training method which incorporates a soft token list. This
method only tunes the model’s embedding layer at a low training cost. By
concatenating the soft token list at the end of the input sequence, we train
these tokens’ attention map to the original input sequence to align with that
of the actual decoded tokens. In this way, the queries corresponding to the
soft tokens can effectively capture global information and better evaluate the
importance of the keys and values within the KV cache, thus maintaining
decoding quality when KV cache is evicted. Under the same eviction budget, our
method exhibits less performance degradation compared to existing eviction
approaches. We validate our approach through experiments conducted on models
such as Llama-3.1-8B-Instruct and Mistral-7B-Instruct-v0.3, using benchmarks
including LongBench, RULER, and Needle-in-a-Haystack. Results indicate an
improvement of approximately 1 point on the LongBench and over 3 points on
RULER. This proposed methodology can be seamlessly integrated into existing
open-source models with minimal training overhead, thereby enhancing
performance in KV cache eviction scenarios.
[COMMENTS]
preprint
[LINK]
http://arxiv.org/abs/2509.10798v1
[DATE]
2025-09-13 11:34:12+08:00
[CATEGORIES]
cs.CL
Assessing Consistency and Reproducibility in the Outputs of Large Language Models: Evidence Across Diverse Finance and Accounting Tasks
[AUTHORS]
Julian Junyan Wang, Victor Xiaoqi Wang
[ABSTRACT]
This study provides the first comprehensive assessment of consistency and
reproducibility in Large Language Model (LLM) outputs in finance and accounting
research. We evaluate how consistently LLMs produce outputs given identical
inputs through extensive experimentation with 50 independent runs across five
common tasks: classification, sentiment analysis, summarization, text
generation, and prediction. Using three OpenAI models (GPT-3.5-turbo,
GPT-4o-mini, and GPT-4o), we generate over 3.4 million outputs from diverse
financial source texts and data, covering MD&As, FOMC statements, finance news
articles, earnings call transcripts, and financial statements. Our findings
reveal substantial but task-dependent consistency, with binary classification
and sentiment analysis achieving near-perfect reproducibility, while complex
tasks show greater variability. More advanced models do not consistently
demonstrate better consistency and reproducibility, with task-specific patterns
emerging. LLMs significantly outperform expert human annotators in consistency
and maintain high agreement even where human experts significantly disagree. We
further find that simple aggregation strategies across 3-5 runs dramatically
improve consistency. We also find that aggregation may come with an additional
benefit of improved accuracy for sentiment analysis when using newer models.
Simulation analysis reveals that despite measurable inconsistency in LLM
outputs, downstream statistical inferences remain remarkably robust. These
findings address concerns about what we term “G-hacking,” the selective
reporting of favorable outcomes from multiple generative AI runs, by
demonstrating that such risks are relatively low for finance and accounting
tasks.
[COMMENTS]
76 pages, 20 tables, 12 figures
[LINK]
http://arxiv.org/abs/2503.16974v4
[DATE]
2025-09-13 09:57:49+08:00
[CATEGORIES]
cs.CL
cs.LG
ISACL: Internal State Analyzer for Copyrighted Training Data Leakage
[AUTHORS]
Guangwei Zhang, Qisheng Su, Jiateng Liu, Cheng Qian, Yanzhou Pan, Yanjie Fu, Denghui Zhang
[ABSTRACT]
Large Language Models (LLMs) have revolutionized Natural Language Processing
(NLP) but pose risks of inadvertently exposing copyrighted or proprietary data,
especially when such data is used for training but not intended for
distribution. Traditional methods address these leaks only after content is
generated, which can lead to the exposure of sensitive information. This study
introduces a proactive approach: examining LLMs’ internal states before text
generation to detect potential leaks. By using a curated dataset of copyrighted
materials, we trained a neural network classifier to identify risks, allowing
for early intervention by stopping the generation process or altering outputs
to prevent disclosure. Integrated with a Retrieval-Augmented Generation (RAG)
system, this framework ensures adherence to copyright and licensing
requirements while enhancing data privacy and ethical standards. Our results
show that analyzing internal states effectively mitigates the risk of
copyrighted data leakage, offering a scalable solution that fits smoothly into
AI workflows, ensuring compliance with copyright regulations while maintaining
high-quality text generation. The implementation is available on
GitHub.\footnote{https://github.com/changhu73/Internal_states_leakage}
[LINK]
http://arxiv.org/abs/2508.17767v2
[DATE]
2025-09-13 09:49:58+08:00
[CATEGORIES]
cs.CL
cs.LG
Base Models Beat Aligned Models at Randomness and Creativity
[AUTHORS]
Peter West, Christopher Potts
[ABSTRACT]
Alignment has quickly become a default ingredient in LLM development, with
techniques such as reinforcement learning from human feedback making models act
safely, follow instructions, and perform ever-better on complex tasks. While
these techniques are certainly useful, we propose that they should not be
universally applied and demonstrate a range of tasks on which base language
models consistently outperform their popular aligned forms. Particularly, we
study tasks that require unpredictable outputs, such as random number
generation, mixed strategy games (rock-paper-scissors and hide-and-seek), and
creative writing. In each case, aligned models tend towards narrow behaviors
that result in distinct disadvantages, for instance, preferring to generate “7”
over other uniformly random numbers, becoming almost fully predictable in some
game states, or prioritizing pleasant writing over creative originality. Across
models tested, better performance on common benchmarks tends to correlate with
worse performance on our tasks, suggesting an effective trade-off in the
required capabilities.
[LINK]
http://arxiv.org/abs/2505.00047v2
[DATE]
2025-09-13 08:52:27+08:00
[CATEGORIES]
cs.CL
RECAP: Transparent Inference-Time Emotion Alignment for Medical Dialogue Systems
[AUTHORS]
Adarsh Srinivasan, Jacob Dineen, Muhammad Umar Afzal, Muhammad Uzair Sarfraz, Irbaz B. Riaz, Ben Zhou
[ABSTRACT]
Large language models in healthcare often miss critical emotional cues,
delivering medically sound but emotionally flat advice. This is especially
problematic in clinical contexts where patients are distressed and vulnerable,
and require empathic communication to support safety, adherence, and trust. We
present RECAP (Reflect-Extract-Calibrate-Align-Produce), an inference-time
framework that adds structured emotional reasoning without retraining. By
decomposing empathy into transparent appraisal-theoretic stages and exposing
per-dimension Likert signals, RECAP produces nuanced, auditable responses.
Across EmoBench, SECEU, and EQ-Bench, RECAP improves emotional reasoning by
22-28% on 8B models and 10-13% on larger models over zero-shot baselines.
Clinician evaluations further confirm superior empathetic communication. RECAP
shows that modular, theory-grounded prompting can systematically enhance
emotional intelligence in medical AI while preserving the accountability
required for deployment.
[LINK]
http://arxiv.org/abs/2509.10746v1
[DATE]
2025-09-13 07:30:45+08:00
[CATEGORIES]
cs.CL
Automated MCQA Benchmarking at Scale: Evaluating Reasoning Traces as Retrieval Sources for Domain Adaptation of Small Language Models
[AUTHORS]
Ozan Gokdemir, Neil Getty, Robert Underwood, Sandeep Madireddy, Franck Cappello, Arvind Ramanathan, Ian T. Foster, Rick L. Stevens
[ABSTRACT]
As scientific knowledge grows at an unprecedented pace, evaluation benchmarks
must evolve to reflect new discoveries and ensure language models are tested on
current, diverse literature. We propose a scalable, modular framework for
generating multiple-choice question-answering (MCQA) benchmarks directly from
large corpora of scientific papers. Our pipeline automates every stage of MCQA
creation, including PDF parsing, semantic chunking, question generation, and
model evaluation. As a case study, we generate more than 16,000 MCQs from
22,000 open-access articles in radiation and cancer biology. We then evaluate a
suite of small language models (1.1B-14B parameters) on these questions,
comparing baseline accuracy with retrieval-augmented generation (RAG) from
paper-derived semantic chunks and from reasoning traces distilled from GPT-4.1.
We find that reasoning-trace retrieval consistently improves performance on
both synthetic and expert-annotated benchmarks, enabling several small models
to surpass GPT-4 on the 2023 Astro Radiation and Cancer Biology exam.
[COMMENTS]
This manuscript has been accepted for publication at the
Supercomputing 25 (SC ‘25) Conference (Frontiers in Generative AI for HPC
Science and Engineering: Foundations, Challenges, and Opportunities Workshop)
in St. Louis, MO, USA on November 16th, 2025. It will appear in the SC25
Workshop Proceedings after that date
[LINK]
http://arxiv.org/abs/2509.10744v1
[DATE]
2025-09-13 07:22:49+08:00
[CATEGORIES]
cs.CL
Reasoning Under Uncertainty: Exploring Probabilistic Reasoning Capabilities of LLMs
[AUTHORS]
Mobina Pournemat, Keivan Rezaei, Gaurang Sriramanan, Arman Zarei, Jiaxiang Fu, Yang Wang, Hamid Eghbalzadeh, Soheil Feizi
[ABSTRACT]
Despite widespread success in language understanding and generation, large
language models (LLMs) exhibit unclear and often inconsistent behavior when
faced with tasks that require probabilistic reasoning. In this work, we present
the first comprehensive study of the reasoning capabilities of LLMs over
explicit discrete probability distributions. Given observations from a
probability distribution, we evaluate models on three carefully designed tasks,
mode identification, maximum likelihood estimation, and sample generation, by
prompting them to provide responses to queries about either the joint
distribution or its conditionals. These tasks thus probe a range of
probabilistic skills, including frequency analysis, marginalization, and
generative behavior. Through comprehensive empirical evaluations, we
demonstrate that there exists a clear performance gap between smaller and
larger models, with the latter demonstrating stronger inference and surprising
capabilities in sample generation. Furthermore, our investigations reveal
notable limitations, including sensitivity to variations in the notation
utilized to represent probabilistic outcomes and performance degradation of
over 60% as context length increases. Together, our results provide a detailed
understanding of the probabilistic reasoning abilities of LLMs and identify key
directions for future improvement.
[COMMENTS]
25 pages, 4 figures, 6 tables
[LINK]
http://arxiv.org/abs/2509.10739v1
[DATE]
2025-09-13 06:58:05+08:00
[CATEGORIES]
cs.CL
PolyTruth: Multilingual Disinformation Detection using Transformer-Based Language Models
[AUTHORS]
Zaur Gouliev, Jennifer Waters, Chengqian Wang
[ABSTRACT]
Disinformation spreads rapidly across linguistic boundaries, yet most AI
models are still benchmarked only on English. We address this gap with a
systematic comparison of five multilingual transformer models: mBERT, XLM,
XLM-RoBERTa, RemBERT, and mT5 on a common fake-vs-true machine learning
classification task. While transformer-based language models have demonstrated
notable success in detecting disinformation in English, their effectiveness in
multilingual contexts still remains up for debate. To facilitate evaluation, we
introduce PolyTruth Disinfo Corpus, a novel corpus of 60,486 statement pairs
(false claim vs. factual correction) spanning over twenty five languages that
collectively cover five language families and a broad topical range from
politics, health, climate, finance, and conspiracy, half of which are
fact-checked disinformation claims verified by an augmented MindBugs Discovery
dataset. Our experiments revealed performance variations. Models such as
RemBERT achieved better overall accuracy, particularly excelling in
low-resource languages, whereas models like mBERT and XLM exhibit considerable
limitations when training data is scarce. We provide a discussion of these
performance patterns and implications for real-world deployment. The dataset is
publicly available on our GitHub repository to encourage further
experimentation and advancement. Our findings illuminate both the potential and
the current limitations of AI systems for multilingual disinformation
detection.
[COMMENTS]
11 pages, 5 figures, 4 tables. Submitted to arXiv in Computation and
Language
[LINK]
http://arxiv.org/abs/2509.10737v1
[DATE]
2025-09-13 06:53:17+08:00
[CATEGORIES]
cs.CL
cs.LG
Understanding Emergent In-Context Learning from a Kernel Regression Perspective
[AUTHORS]
Chi Han, Ziqi Wang, Han Zhao, Heng Ji
[ABSTRACT]
Large language models (LLMs) have initiated a paradigm shift in transfer
learning. In contrast to the classic pretraining-then-finetuning procedure, in
order to use LLMs for downstream prediction tasks, one only needs to provide a
few demonstrations, known as in-context examples, without adding more or
updating existing model parameters. This in-context learning (ICL) capability
of LLMs is intriguing, and it is not yet fully understood how pretrained LLMs
acquire such capabilities. In this paper, we investigate the reason why a
transformer-based language model can accomplish in-context learning after
pre-training on a general language corpus by proposing a kernel-regression
perspective of understanding LLMs’ ICL bahaviors when faced with in-context
examples. More concretely, we first prove that Bayesian inference on in-context
prompts can be asymptotically understood as kernel regression $\hat y = \sum_i
y_i K(x, x_i)/\sum_i K(x, x_i)$ as the number of in-context demonstrations
grows. Then, we empirically investigate the in-context behaviors of language
models. We find that during ICL, the attention and hidden features in LLMs
match the behaviors of a kernel regression. Finally, our theory provides
insights into multiple phenomena observed in the ICL field: why retrieving
demonstrative samples similar to test samples can help, why ICL performance is
sensitive to the output formats, and why ICL accuracy benefits from selecting
in-distribution and representative samples. Code and resources are publicly
available at https://github.com/Glaciohound/Explain-ICL-As-Kernel-Regression.
[COMMENTS]
Transactions on Machine Learning Research (TMLR 2025)
[LINK]
http://arxiv.org/abs/2305.12766v3
[DATE]
2025-09-13 06:18:33+08:00
[CATEGORIES]
cs.CL
cs.LG
SearchInstruct: Enhancing Domain Adaptation via Retrieval-Based Instruction Dataset Creation
[AUTHORS]
Iman Barati, Mostafa Amiri, Heshaam Faili
[ABSTRACT]
Supervised Fine-Tuning (SFT) is essential for training large language models
(LLMs), significantly enhancing critical capabilities such as instruction
following and in-context learning. Nevertheless, creating suitable training
datasets tailored for specific domains remains challenging due to unique domain
constraints and data scarcity. In this paper, we propose SearchInstruct, an
innovative method explicitly designed to construct high quality instruction
datasets for SFT. Our approach begins with a limited set of domain specific,
human generated questions, which are systematically expanded using a large
language model. Subsequently, domain relevant resources are dynamically
retrieved to generate accurate and contextually appropriate answers for each
augmented question. Experimental evaluation demonstrates that SearchInstruct
enhances both the diversity and quality of SFT datasets, leading to measurable
improvements in LLM performance within specialized domains. Additionally, we
show that beyond dataset generation, the proposed method can also effectively
facilitate tasks such as model editing, enabling efficient updates to existing
models. To facilitate reproducibility and community adoption, we provide full
implementation details, the complete set of generated instruction response
pairs, and the source code in a publicly accessible Git repository:
https://github.com/mostafaamiri/SearchInstruct
[LINK]
http://arxiv.org/abs/2509.10708v1
[DATE]
2025-09-13 05:50:39+08:00
[CATEGORIES]
cs.CL
A Survey on Retrieval And Structuring Augmented Generation with Large Language Models
[AUTHORS]
Pengcheng Jiang, Siru Ouyang, Yizhu Jiao, Ming Zhong, Runchu Tian, Jiawei Han
[ABSTRACT]
Large Language Models (LLMs) have revolutionized natural language processing
with their remarkable capabilities in text generation and reasoning. However,
these models face critical challenges when deployed in real-world applications,
including hallucination generation, outdated knowledge, and limited domain
expertise. Retrieval And Structuring (RAS) Augmented Generation addresses these
limitations by integrating dynamic information retrieval with structured
knowledge representations. This survey (1) examines retrieval mechanisms
including sparse, dense, and hybrid approaches for accessing external
knowledge; (2) explore text structuring techniques such as taxonomy
construction, hierarchical classification, and information extraction that
transform unstructured text into organized representations; and (3) investigate
how these structured representations integrate with LLMs through prompt-based
methods, reasoning frameworks, and knowledge embedding techniques. It also
identifies technical challenges in retrieval efficiency, structure quality, and
knowledge integration, while highlighting research opportunities in multimodal
retrieval, cross-lingual structures, and interactive systems. This
comprehensive overview provides researchers and practitioners with insights
into RAS methods, applications, and future directions.
[COMMENTS]
KDD’25 survey track
[LINK]
http://arxiv.org/abs/2509.10697v1
[DATE]
2025-09-13 05:25:25+08:00
[CATEGORIES]
cs.CL
Struct-Bench: A Benchmark for Differentially Private Structured Text Generation
[AUTHORS]
Shuaiqi Wang, Vikas Raunak, Arturs Backurs, Victor Reis, Pei Zhou, Sihao Chen, Longqi Yang, Zinan Lin, Sergey Yekhanin, Giulia Fanti
[ABSTRACT]
Differentially private (DP) synthetic data generation is a promising
technique for utilizing private datasets that otherwise cannot be exposed for
model training or other analytics. While much research literature has focused
on generating private unstructured text and image data, in enterprise settings,
structured data (e.g., tabular) is more common, often including natural
language fields or components. Existing synthetic data evaluation techniques
(e.g., FID) struggle to capture the structural properties and correlations of
such datasets. In this work, we propose Struct-Bench, a framework and benchmark
for evaluating synthetic datasets derived from structured datasets that contain
natural language data. The Struct-Bench framework requires users to provide a
representation of their dataset structure as a Context-Free Grammar (CFG). Our
benchmark comprises 5 real-world and 2 synthetically generated datasets, each
annotated with CFGs. We show that these datasets demonstrably present a great
challenge even for state-of-the-art DP synthetic data generation methods.
Struct-Bench also includes reference implementations of different metrics and a
leaderboard, thereby providing researchers a standardized evaluation platform
to benchmark and investigate privacy-preserving synthetic data generation
methods. Further, we also present a case study showing how to use Struct-Bench
to improve the synthetic data quality of Private Evolution (PE) on structured
data. The benchmark and the leaderboard have been publicly made available at
https://struct-bench.github.io.
[LINK]
http://arxiv.org/abs/2509.10696v1
[DATE]
2025-09-13 05:18:13+08:00
[CATEGORIES]
cs.CL
cs.LG
Pluralistic Alignment for Healthcare: A Role-Driven Framework
[AUTHORS]
Jiayou Zhong, Anudeex Shetty, Chao Jia, Xuanrui Lin, Usman Naseem
[COMMENTS]
Accepted to EMNLP 2025 (Main Proceedings)
[LINK]
http://arxiv.org/abs/2509.10685v1
[DATE]
2025-09-13 04:28:27+08:00
[CATEGORIES]
cs.CL
cs.LG
Context Copying Modulation: The Role of Entropy Neurons in Managing Parametric and Contextual Knowledge Conflicts
[AUTHORS]
Zineddine Tighidet, Andrea Mogini, Hedi Ben-younes, Jiali Mei, Patrick Gallinari, Benjamin Piwowarski
[ABSTRACT]
The behavior of Large Language Models (LLMs) when facing contextual
information that conflicts with their internal parametric knowledge is
inconsistent, with no generally accepted explanation for the expected outcome
distribution. Recent work has identified in autoregressive transformer models a
class of neurons – called entropy neurons – that produce a significant effect
on the model output entropy while having an overall moderate impact on the
ranking of the predicted tokens. In this paper, we investigate the preliminary
claim that these neurons are involved in inhibiting context copying behavior in
transformers by looking at their role in resolving conflicts between contextual
and parametric information. We show that entropy neurons are responsible for
suppressing context copying across a range of LLMs, and that ablating them
leads to a significant change in the generation process. These results enhance
our understanding of the internal dynamics of LLMs when handling conflicting
information.
[COMMENTS]
Accepted at EMNLP 2025
[LINK]
http://arxiv.org/abs/2509.10663v1
[DATE]
2025-09-13 03:42:16+08:00
[CATEGORIES]
cs.CL
Interdisciplinary Research in Conversation: A Case Study in Computational Morphology for Language Documentation
[AUTHORS]
Enora Rice, Katharina von der Wense, Alexis Palmer
[ABSTRACT]
Computational morphology has the potential to support language documentation
through tasks like morphological segmentation and the generation of Interlinear
Glossed Text (IGT). However, our research outputs have seen limited use in
real-world language documentation settings. This position paper situates the
disconnect between computational morphology and language documentation within a
broader misalignment between research and practice in NLP and argues that the
field risks becoming decontextualized and ineffectual without systematic
integration of User-Centered Design (UCD). To demonstrate how principles from
UCD can reshape the research agenda, we present a case study of GlossLM, a
state-of-the-art multilingual IGT generation model. Through a small-scale user
study with three documentary linguists, we find that despite strong metric
based performance, the system fails to meet core usability needs in real
documentation contexts. These insights raise new research questions around
model constraints, label standardization, segmentation, and personalization. We
argue that centering users not only produces more effective tools, but surfaces
richer, more relevant research directions
[COMMENTS]
Accepted to EMNLP 2025
[LINK]
http://arxiv.org/abs/2509.10644v1
[DATE]
2025-09-13 03:20:11+08:00
[CATEGORIES]
cs.CL
No Answer Needed: Predicting LLM Answer Accuracy from Question-Only Linear Probes
[AUTHORS]
Iván Vicente Moreno Cencerrado, Arnau Padrés Masdemont, Anton Gonzalvez Hawthorne, David Demitri Africa, Lorenzo Pacchiardi
[ABSTRACT]
Do large language models (LLMs) anticipate when they will answer correctly?
To study this, we extract activations after a question is read but before any
tokens are generated, and train linear probes to predict whether the model’s
forthcoming answer will be correct. Across three open-source model families
ranging from 7 to 70 billion parameters, projections on this “in-advance
correctness direction” trained on generic trivia questions predict success in
distribution and on diverse out-of-distribution knowledge datasets,
outperforming black-box baselines and verbalised predicted confidence.
Predictive power saturates in intermediate layers, suggesting that
self-assessment emerges mid-computation. Notably, generalisation falters on
questions requiring mathematical reasoning. Moreover, for models responding “I
don’t know”, doing so strongly correlates with the probe score, indicating that
the same direction also captures confidence. By complementing previous results
on truthfulness and other behaviours obtained with probes and sparse
auto-encoders, our work contributes essential findings to elucidate LLM
internals.
[LINK]
http://arxiv.org/abs/2509.10625v1
[DATE]
2025-09-13 02:09:55+08:00
[CATEGORIES]
cs.CL
WhisTLE: Deeply Supervised, Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers
[AUTHORS]
Akshat Pandey, Karun Kumar, Raphael Tang
[ABSTRACT]
Pretrained automatic speech recognition (ASR) models such as Whisper perform
well but still need domain adaptation to handle unseen vocabulary and parlance.
In many real-world settings, collecting speech data is impractical,
necessitating text-only adaptation. We propose WhisTLE, a deeply supervised,
text-only adaptation method for pretrained encoder-decoder ASR models. WhisTLE
trains a variational autoencoder (VAE) to model encoder outputs from text and
fine-tunes the decoder using the learned text-to-latent encoder, optionally
combined with text-to-speech (TTS) adaptation. At inference, the original
encoder is restored, incurring no extra runtime cost. Across four out-of-domain
datasets and four ASR models, WhisTLE with TTS reduces word error rate (WER) by
12.3% relative to TTS-only adaptation and outperforms all non-WhisTLE baselines
in 27 of 32 scenarios.
[COMMENTS]
5 pages, 2 figures
[LINK]
http://arxiv.org/abs/2509.10452v1
[DATE]
2025-09-13 01:59:09+08:00
[CATEGORIES]
cs.CL
cs.LG
Router-Tuning: A Simple and Effective Approach for Enabling Dynamic-Depth in Transformers
[AUTHORS]
Shwai He, Tao Ge, Guoheng Sun, Bowei Tian, Xiaoyang Wang, Dong Yu
[ABSTRACT]
Traditional transformer models often allocate a fixed amount of computational
resources to every input token, leading to inefficient and unnecessary
computation. To address this, the Mixture of Depths (MoD) was introduced to
dynamically adjust the computational depth by skipping less important layers.
Despite its promise, current MoD approaches remain under-explored and face two
main challenges: (1) high training costs due to the need to train the entire
model along with the routers that determine which layers to skip, and (2) the
risk of performance degradation when important layers are bypassed. In response
to the first issue, we propose Router-Tuning, a method that fine-tunes only the
router on a small dataset, drastically reducing the computational overhead
associated with full model training. For the second challenge, we propose
MindSkip, which deploys Attention with Dynamic Depths. This method preserves
the model’s performance while significantly enhancing computational and memory
efficiency. Extensive experiments demonstrate that our approach delivers
competitive results while dramatically improving the computation efficiency,
e.g., 21\% speedup and only a 0.2\% performance drop. The code is released at
https://github.com/CASE-Lab-UMD/Router-Tuning.
[COMMENTS]
EMNLP 2025 Main Conference
[LINK]
http://arxiv.org/abs/2410.13184v6
[DATE]
2025-09-13 01:55:02+08:00
[CATEGORIES]
cs.CL
DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL
[AUTHORS]
Rui Lu, Zhenyu Hou, Zihan Wang, Hanchen Zhang, Xiao Liu, Yujiang Li, Shi Feng, Jie Tang, Yuxiao Dong
[ABSTRACT]
Augmenting large language models (LLMs) with browsing tools substantially
improves their potential as deep search agents to solve complex, real-world
tasks. Yet, open LLMs still perform poorly in such settings due to limited
long-horizon reasoning capacity with browsing tools and the lack of
sufficiently difficult supervised data. To address these challenges, we present
DeepDive to advance deep search agents. First, we propose a strategy to
automatically synthesize complex, difficult, and hard-to-find questions from
open knowledge graphs. Second, we apply end-to-end multi-turn reinforcement
learning (RL) to enhance LLMs’ long-horizon reasoning with deep search.
Experiments show that DeepDive-32B achieves a new open-source competitive
result on BrowseComp, outperforming WebSailor, DeepSeek-R1-Browse, and
Search-o1. We demonstrate that multi-turn RL training improves deep search
ability and significantly contributes to the performance improvements across
multiple benchmarks. We observe that DeepDive enables test-time scaling of tool
calls and parallel sampling. All datasets, models, and code are publicly
available at https://github.com/THUDM/DeepDive.
[LINK]
http://arxiv.org/abs/2509.10446v1
[DATE]
2025-09-13 01:52:35+08:00
[CATEGORIES]
cs.CL
RefactorCoderQA: Benchmarking LLMs for Multi-Domain Coding Question Solutions in Cloud and Edge Deployment
[AUTHORS]
Shadikur Rahman, Aroosa Hameed, Gautam Srivastava, Syed Muhammad Danish
[ABSTRACT]
To optimize the reasoning and problem-solving capabilities of Large Language
Models (LLMs), we propose a novel cloud-edge collaborative architecture that
enables a structured, multi-agent prompting framework. This framework comprises
three specialized components: GuideLLM, a lightweight model deployed at the
edge to provide methodological guidance; SolverLLM, a more powerful model
hosted in the cloud responsible for generating code solutions; and JudgeLLM, an
automated evaluator for assessing solution correctness and quality. To evaluate
and demonstrate the effectiveness of this architecture in realistic settings,
we introduce RefactorCoderQA, a comprehensive benchmark designed to evaluate
and enhance the performance of Large Language Models (LLMs) across multi-domain
coding tasks. Motivated by the limitations of existing benchmarks,
RefactorCoderQA systematically covers various technical domains, including
Software Engineering, Data Science, Machine Learning, and Natural Language
Processing, using authentic coding challenges from Stack Overflow. Extensive
experiments reveal that our fine-tuned model, RefactorCoder-MoE, achieves
state-of-the-art performance, significantly outperforming leading open-source
and commercial baselines with an overall accuracy of 76.84%. Human evaluations
further validate the interpretability, accuracy, and practical relevance of the
generated solutions. In addition, we evaluate system-level metrics, such as
throughput and latency, to gain deeper insights into the performance
characteristics and trade-offs of the proposed architecture.
[COMMENTS]
12 pages, 5 figures, submitted to IEEE Transactions on Services
Computing
[LINK]
http://arxiv.org/abs/2509.10436v1
[DATE]
2025-09-13 01:44:22+08:00
[CATEGORIES]
cs.CL
Direct Judgement Preference Optimization
[AUTHORS]
Peifeng Wang, Austin Xu, Yilun Zhou, Caiming Xiong, Shafiq Joty
[COMMENTS]
EMNLP 2025
[LINK]
http://arxiv.org/abs/2409.14664v3
[DATE]
2025-09-13 01:21:39+08:00
[CATEGORIES]
cs.CL
Long Context Automated Essay Scoring with Language Models
[AUTHORS]
Christopher Ormerod, Gitit Kehat
[ABSTRACT]
Transformer-based language models are architecturally constrained to process
text of a fixed maximum length. Essays written by higher-grade students
frequently exceed the maximum allowed length for many popular open-source
models. A common approach to addressing this issue when using these models for
Automated Essay Scoring is to truncate the input text. This raises serious
validity concerns as it undermines the model’s ability to fully capture and
evaluate organizational elements of the scoring rubric, which requires long
contexts to assess. In this study, we evaluate several models that incorporate
architectural modifications of the standard transformer architecture to
overcome these length limitations using the Kaggle ASAP 2.0 dataset. The models
considered in this study include fine-tuned versions of XLNet, Longformer,
ModernBERT, Mamba, and Llama models.
[COMMENTS]
8 pages, 2 figures, 2 tables
[LINK]
http://arxiv.org/abs/2509.10417v1
[DATE]
2025-09-13 01:13:47+08:00
[CATEGORIES]
cs.CL
Slaves to the Law of Large Numbers: An Asymptotic Equipartition Property for Perplexity in Generative Language Models
[AUTHORS]
Tyler Bell, Avinash Mudireddy, Ivan Johnson-Eversoll, Soura Dasgupta, Raghu Mudumbai
[ABSTRACT]
We prove a new asymptotic un-equipartition property for the perplexity of
long texts generated by a language model and present supporting experimental
evidence from open-source models. Specifically we show that the logarithmic
perplexity of any large text generated by a language model must asymptotically
converge to the average entropy of its token distributions. This defines a
“typical set” that all long synthetic texts generated by a language model
must belong to. We refine the concept of ‘‘typical set’’ to include only
grammatically correct texts. We then show that this refined typical set is a
vanishingly small subset of all possible grammatically correct texts for a very
general definition of grammar. This means that language models are strongly
constrained in the range of their possible behaviors and outputs. We make no
simplifying assumptions (such as stationarity) about the statistics of language
model outputs, and therefore our results are directly applicable to practical
real-world models without any approximations. We discuss possible applications
of the typical set concept to problems such as detecting synthetic texts and
membership inference in training datasets.
[LINK]
http://arxiv.org/abs/2405.13798v4
[DATE]
2025-09-13 01:03:33+08:00
[CATEGORIES]
cs.CL
Abduct, Act, Predict: Scaffolding Causal Inference for Automated Failure Attribution in Multi-Agent Systems
[AUTHORS]
Alva West, Yixuan Weng, Minjun Zhu, Zhen Lin, Yue Zhang
[ABSTRACT]
Failure attribution in multi-agent systems – pinpointing the exact step
where a decisive error occurs – is a critical yet unsolved challenge. Current
methods treat this as a pattern recognition task over long conversation logs,
leading to critically low step-level accuracy (below 17\%), which renders them
impractical for debugging complex systems. Their core weakness is a fundamental
inability to perform robust counterfactual reasoning: to determine if
correcting a single action would have actually averted the task failure. To
bridge this counterfactual inference gap, we introduce Abduct-Act-Predict (A2P)
Scaffolding, a novel agent framework that transforms failure attribution from
pattern recognition into a structured causal inference task. A2P explicitly
guides a large language model through a formal three-step reasoning process
within a single inference pass: (1) Abduction, to infer the hidden root causes
behind an agent’s actions; (2) Action, to define a minimal corrective
intervention; and (3) Prediction, to simulate the subsequent trajectory and
verify if the intervention resolves the failure. This structured approach
leverages the holistic context of the entire conversation while imposing a
rigorous causal logic on the model’s analysis. Our extensive experiments on the
Who\&When benchmark demonstrate its efficacy. On the Algorithm-Generated
dataset, A2P achieves 47.46\% step-level accuracy, a 2.85$\times$ improvement
over the 16.67\% of the baseline. On the more complex Hand-Crafted dataset, it
achieves 29.31\% step accuracy, a 2.43$\times$ improvement over the baseline’s
12.07\%. By reframing the problem through a causal lens, A2P Scaffolding
provides a robust, verifiable, and significantly more accurate solution for
automated failure attribution.
[LINK]
http://arxiv.org/abs/2509.10401v1
[DATE]
2025-09-13 00:51:15+08:00
[CATEGORIES]
cs.CL
Dropping Experts, Recombining Neurons: Retraining-Free Pruning for Sparse Mixture-of-Experts LLMs
[AUTHORS]
Yixiao Zhou, Ziyu Zhao, Dongzhou Cheng, zhiliang wu, Jie Gui, Yi Yang, Fei Wu, Yu Cheng, Hehe Fan
[ABSTRACT]
Sparse Mixture-of-Experts (SMoE) architectures are widely used in large
language models (LLMs) due to their computational efficiency. However, though
only a few experts are activated for each token, SMoE still requires loading
all expert parameters, leading to high memory usage and challenges in
deployment. Previous work has tried to reduce the overhead by pruning and
merging experts, but primarily focused on expert-level operations, leaving
neuron-level structure underexplored. We propose DERN (Dropping Experts,
Recombining Neurons), a task-agnostic and retraining-free framework for expert
pruning and reconstruction. We observe that experts are often misaligned and
contain semantic conflicts at the neuron level, which poses challenges for
direct merging. To solve this, DERN works in three steps: it first prunes
redundant experts using router statistics; then it decomposes them into
neuron-level expert segments, assigning each segment to its most compatible
retained expert; and finally, it merges segments within each retained expert to
build a compact representation. Experiments on Mixtral, Qwen, and DeepSeek SMoE
models show that DERN improves performance by more than 5% on commonsense
reasoning and MMLU benchmarks under 50% expert sparsity, without extra
training. It also greatly reduces the number of experts and memory usage,
making SMoE LLMs easier to deploy in practice.
[COMMENTS]
Accepted to EMNLP2025
[LINK]
http://arxiv.org/abs/2509.10377v1
[DATE]
2025-09-13 00:09:39+08:00
[CATEGORIES]
cs.CL
Optimal message passing for molecular prediction is simple, attentive and spatial
[AUTHORS]
Alma C. Castaneda-Leautaud, Rommie E. Amaro
[ABSTRACT]
Strategies to improve the predicting performance of Message-Passing
Neural-Networks for molecular property predictions can be achieved by
simplifying how the message is passed and by using descriptors that capture
multiple aspects of molecular graphs. In this work, we designed model
architectures that achieved state-of-the-art performance, surpassing more
complex models such as those pre-trained on external databases. We assessed
dataset diversity to complement our performance results, finding that
structural diversity influences the need for additional components in our MPNNs
and feature sets.
In most datasets, our best architecture employs bidirectional message-passing
with an attention mechanism, applied to a minimalist message formulation that
excludes self-perception, highlighting that relatively simpler models, compared
to classical MPNNs, yield higher class separability. In contrast, we found that
convolution normalization factors do not benefit the predictive power in all
the datasets tested. This was corroborated in both global and node-level
outputs. Additionally, we analyzed the influence of both adding spatial
features and working with 3D graphs, finding that 2D molecular graphs are
sufficient when complemented with appropriately chosen 3D descriptors. This
approach not only preserves predictive performance but also reduces
computational cost by over 50%, making it particularly advantageous for
high-throughput screening campaigns.
[COMMENTS]
32 pages, 12 figures. Preprint submitted to RSC Drug Discovery
[LINK]
http://arxiv.org/abs/2509.10871v1
[DATE]
2025-09-13 23:55:02+08:00
[CATEGORIES]
cs.LG
GTHNA: Local-global Graph Transformer with Memory Reconstruction for Holistic Node Anomaly Evaluation
[AUTHORS]
Mingkang Li, Xuexiong Luo, Yue Zhang, Yaoyang Li, Fu Lin
[ABSTRACT]
Anomaly detection in graph-structured data is an inherently challenging
problem, as it requires the identification of rare nodes that deviate from the
majority in both their structural and behavioral characteristics. Existing
methods, such as those based on graph convolutional networks (GCNs), often
suffer from over-smoothing, which causes the learned node representations to
become indistinguishable. Furthermore, graph reconstruction-based approaches
are vulnerable to anomalous node interference during the reconstruction
process, leading to inaccurate anomaly detection. In this work, we propose a
novel and holistic anomaly evaluation framework that integrates three key
components: a local-global Transformer encoder, a memory-guided reconstruction
mechanism, and a multi-scale representation matching strategy. These components
work synergistically to enhance the model’s ability to capture both local and
global structural dependencies, suppress the influence of anomalous nodes, and
assess anomalies from multiple levels of granularity. Anomaly scores are
computed by combining reconstruction errors and memory matching signals,
resulting in a more robust evaluation. Extensive experiments on seven benchmark
datasets demonstrate that our method outperforms existing state-of-the-art
approaches, offering a comprehensive and generalizable solution for anomaly
detection across various graph domains.
[COMMENTS]
9 pages, 7 figures
[LINK]
http://arxiv.org/abs/2509.10869v1
[DATE]
2025-09-13 23:52:16+08:00
[CATEGORIES]
cs.LG
CogGNN: Cognitive Graph Neural Networks in Generative Connectomics
[AUTHORS]
Mayssa Soussia, Yijun Lin, Mohamed Ali Mahjoub, Islem Rekik
[ABSTRACT]
Generative learning has advanced network neuroscience, enabling tasks like
graph super-resolution, temporal graph prediction, and multimodal brain graph
fusion. However, current methods, mainly based on graph neural networks (GNNs),
focus solely on structural and topological properties, neglecting cognitive
traits. To address this, we introduce the first cognified generative model,
CogGNN, which endows GNNs with cognitive capabilities (e.g., visual memory) to
generate brain networks that preserve cognitive features. While broadly
applicable, we present CogGNN, a specific variant designed to integrate visual
input, a key factor in brain functions like pattern recognition and memory
recall. As a proof of concept, we use our model to learn connectional brain
templates (CBTs), population-level fingerprints from multi-view brain networks.
Unlike prior work that overlooks cognitive properties, CogGNN generates CBTs
that are both cognitively and structurally meaningful. Our contributions are:
(i) a novel cognition-aware generative model with a visual-memory-based loss;
(ii) a CBT-learning framework with a co-optimization strategy to yield
well-centered, discriminative, cognitively enhanced templates. Extensive
experiments show that CogGNN outperforms state-of-the-art methods, establishing
a strong foundation for cognitively grounded brain network modeling.
[LINK]
http://arxiv.org/abs/2509.10864v1
[DATE]
2025-09-13 23:38:56+08:00
[CATEGORIES]
cs.LG
Variable Selection Using Relative Importance Rankings
[AUTHORS]
Tien-En Chang, Argon Chen
[ABSTRACT]
Although conceptually related, variable selection and relative importance
(RI) analysis have been treated quite differently in the literature. While RI
is typically used for post-hoc model explanation, this paper explores its
potential for variable ranking and filter-based selection before model
creation. Specifically, we anticipate strong performance from the RI measures
because they incorporate both direct and combined effects of predictors,
addressing a key limitation of marginal correlation that ignores dependencies
among predictors. We implement and evaluate the RI-based variable selection
methods using general dominance (GD), comprehensive relative importance (CRI),
and a newly proposed, computationally efficient variant termed CRI.Z.
We first demonstrate how the RI measures more accurately rank the variables
than the marginal correlation, especially when there are suppressed or weak
predictors. We then show that predictive models built on these rankings are
highly competitive, often outperforming state-of-the-art methods such as the
lasso and relaxed lasso. The proposed RI-based methods are particularly
effective in challenging cases involving clusters of highly correlated
predictors, a setting known to cause failures in many benchmark methods.
Although lasso methods have dominated the recent literature on variable
selection, our study reveals that the RI-based method is a powerful and
competitive alternative. We believe these underutilized tools deserve greater
attention in statistics and machine learning communities. The code is available
at: https://github.com/tien-endotchang/RI-variable-selection.
[COMMENTS]
26 pages, 9 figures
[LINK]
http://arxiv.org/abs/2509.10853v1
[DATE]
2025-09-13 23:21:39+08:00
[CATEGORIES]
cs.LG
Neurosymbolic AI Transfer Learning Improves Network Intrusion Detection
[AUTHORS]
Huynh T. T. Tran, Jacob Sander, Achraf Cohen, Brian Jalaian, Nathaniel D. Bastian
[ABSTRACT]
Transfer learning is commonly utilized in various fields such as computer
vision, natural language processing, and medical imaging due to its impressive
capability to address subtasks and work with different datasets. However, its
application in cybersecurity has not been thoroughly explored. In this paper,
we present an innovative neurosymbolic AI framework designed for network
intrusion detection systems, which play a crucial role in combating malicious
activities in cybersecurity. Our framework leverages transfer learning and
uncertainty quantification. The findings indicate that transfer learning
models, trained on large and well-structured datasets, outperform neural-based
models that rely on smaller datasets, paving the way for a new era in
cybersecurity solutions.
[COMMENTS]
9 pages, 2 figures, 6 tables
[LINK]
http://arxiv.org/abs/2509.10850v1
[DATE]
2025-09-13 23:12:35+08:00
[CATEGORIES]
cs.LG
FACTORS: Factorial Approximation for Complementary Two-factor Optimization with Risk-aware Scoring
[AUTHORS]
Dongseok Kim, Wonjun Jeong, Gisung Oh
[ABSTRACT]
We propose FACTORS, a framework that combines design of experiments with
Shapley decomposition to address performance and stability issues that are
sensitive to combinations of training factors. Our approach consistently
estimates main effects and two-factor interactions, then integrates them into a
risk-adjusted objective function that jointly accounts for uncertainty and
cost, enabling reliable selection of configurations under a fixed budget.
Effect estimation is implemented through two complementary paths: a plug-in
path based on conditional means, and a least-squares path that reconstructs
Shapley contributions from samples. These paths are designed to work
complementarily even when design density and bias levels differ. By
incorporating standardization of estimates, bias correction, and uncertainty
quantification, our procedure ensures comparability across heterogeneous factor
spaces and designs, while a lightweight search routine yields configurations
within practical time even for large factor spaces. On the theoretical side, we
provide error decompositions, sample complexity analysis, and upper bounds on
optimality gaps. On the interpretive side, we summarize main effects and
interactions in map form, highlighting adjustment priorities and safe
improvement pathways. Across diverse datasets and design conditions, our
approach improves rank preservation and optimal configuration identification,
reduces decision-making risks, and offers a tuning foundation that delivers
interpretable justification alongside stable performance gains even under
budget constraints.
[COMMENTS]
43 pages, 8 figures
[LINK]
http://arxiv.org/abs/2509.10825v1
[DATE]
2025-09-13 22:44:45+08:00
[CATEGORIES]
cs.LG
A Traditional Approach to Symbolic Piano Continuation
[AUTHORS]
Christian Zhou-Zheng, John Backsund, Dun Li Chan, Alex Coventry, Avid Eslami, Jyotin Goel, Xingwen Han, Danysh Soomro, Galen Wei
[ABSTRACT]
We present a traditional approach to symbolic piano music continuation for
the MIREX 2025 Symbolic Music Generation challenge. While computational music
generation has recently focused on developing large foundation models with
sophisticated architectural modifications, we argue that simpler approaches
remain more effective for constrained, single-instrument tasks. We thus return
to a simple, unaugmented next-token-prediction objective on tokenized raw MIDI,
aiming to outperform large foundation models by using better data and better
fundamentals. We release model weights and code at
https://github.com/christianazinn/mirex2025.
[COMMENTS]
3 pages, extended abstract, MIREX session at ISMIR 2025 LBD
[LINK]
http://arxiv.org/abs/2509.12267v1
[DATE]
2025-09-13 22:22:11+08:00
[CATEGORIES]
cs.LG
Rethinking Sparse Autoencoders: Select-and-Project for Fairness and Control from Encoder Features Alone
[AUTHORS]
Antonio Bărbălau, Cristian Daniel Păduraru, Teodor Poncu, Alexandru Tifrea, Elena Burceanu
[ABSTRACT]
Sparse Autoencoders (SAEs) have proven valuable due to their ability to
provide interpretable and steerable representations. Current debiasing methods
based on SAEs manipulate these sparse activations presuming that feature
representations are housed within decoder weights. We challenge this
fundamental assumption and introduce an encoder-focused alternative for
representation debiasing, contributing three key findings: (i) we highlight an
unconventional SAE feature selection strategy, (ii) we propose a novel SAE
debiasing methodology that orthogonalizes input embeddings against encoder
weights, and (iii) we establish a performance-preserving mechanism during
debiasing through encoder weight interpolation. Our Selection and Projection
framework, termed S\&P TopK, surpasses conventional SAE usage in fairness
metrics by a factor of up to 3.2 and advances state-of-the-art test-time VLM
debiasing results by a factor of up to 1.8 while maintaining downstream
performance.
[LINK]
http://arxiv.org/abs/2509.10809v1
[DATE]
2025-09-13 14:36:07+08:00
[CATEGORIES]
cs.LG
A Convolution and Attention Based Encoder for Reinforcement Learning under Partial Observability
[AUTHORS]
Wuhao Wang, Zhiyong Chen
[ABSTRACT]
Partially Observable Markov Decision Processes (POMDPs) remain a core
challenge in reinforcement learning due to incomplete state information. We
address this by reformulating POMDPs as fully observable processes with
fixed-length observation histories as augmented states. To efficiently encode
these histories, we propose a lightweight temporal encoder based on depthwise
separable convolution and self-attention, avoiding the overhead of recurrent
and Transformer-based models. Integrated into an actor-critic framework, our
method achieves superior performance on continuous control benchmarks under
partial observability. More broadly, this work shows that lightweight temporal
encoding can improve the scalability of AI systems under uncertainty. It
advances the development of agents capable of reasoning robustly in real-world
environments where information is incomplete or delayed.
[LINK]
http://arxiv.org/abs/2505.23857v2
[DATE]
2025-09-13 11:54:46+08:00
[CATEGORIES]
cs.LG
Genome-Factory: An Integrated Library for Tuning, Deploying, and Interpreting Genomic Models
[AUTHORS]
Weimin Wu, Xuefeng Song, Yibo Wen, Qinjie Lin, Zhihan Zhou, Jerry Yao-Chieh Hu, Zhong Wang, Han Liu
[ABSTRACT]
We introduce Genome-Factory, an integrated Python library for tuning,
deploying, and interpreting genomic models. Our core contribution is to
simplify and unify the workflow for genomic model development: data collection,
model tuning, inference, benchmarking, and interpretability. For data
collection, Genome-Factory offers an automated pipeline to download genomic
sequences and preprocess them. It also includes quality control, such as GC
content normalization. For model tuning, Genome-Factory supports three
approaches: full-parameter, low-rank adaptation, and adapter-based fine-tuning.
It is compatible with a wide range of genomic models. For inference,
Genome-Factory enables both embedding extraction and DNA sequence generation.
For benchmarking, we include two existing benchmarks and provide a flexible
interface for users to incorporate additional benchmarks. For interpretability,
Genome-Factory introduces the first open-source biological interpreter based on
a sparse auto-encoder. This module disentangles embeddings into sparse,
near-monosemantic latent units and links them to interpretable genomic features
by regressing on external readouts. To improve accessibility, Genome-Factory
features both a zero-code command-line interface and a user-friendly web
interface. We validate the utility of Genome-Factory across three dimensions:
(i) Compatibility with diverse models and fine-tuning methods; (ii)
Benchmarking downstream performance using two open-source benchmarks; (iii)
Biological interpretation of learned representations with DNABERT-2. These
results highlight its end-to-end usability and practical value for real-world
genomic analysis.
[LINK]
http://arxiv.org/abs/2509.12266v1
[DATE]
2025-09-13 11:31:55+08:00
[CATEGORIES]
cs.LG
GoldenTransformer: A Modular Fault Injection Framework for Transformer Robustness Research
[AUTHORS]
Luke Howard
[ABSTRACT]
Transformers have become the foundation for a wide range of
state–of–the–art models across natural language processing, computer vision,
and other machine learning domains. Despite their widespread deployment, the
robustness of these models under fault conditions remains underexplored. We
present GoldenTransformer, a modular and extensible fault injection framework
designed to evaluate the resiliency of Large Language Models to induced
hardware faults. GoldenTransformer offers a unified Python-based platform for
injecting diverse classes of faults–such as weight corruption, activation
injections, and attention–level disruptions–into pretrained
transformer–based models. Inspired by the GoldenEye simulator for DNNs, our
framework focuses on the unique challenges of working with large transformer
architectures, including considerations such as structural complexity, latent
dependencies, and nonuniform layer definitions. GoldenTransformer is built atop
PyTorch and HuggingFace Transformers, and it supports experiment
reproducibility, metric logging, and visualization out of the box. We detail
the technical design and use of GoldenTransformer and demonstrate through
several example experiments on classification and generation tasks. By enabling
controlled injection of faults at multiple logical and structural points in a
transformer, GoldenTransformer offers researchers and practitioners a valuable
tool for model robustness analysis and for guiding dependable system design in
real-world LLM applications.
[COMMENTS]
4 Pages
[LINK]
http://arxiv.org/abs/2509.10790v1
[DATE]
2025-09-13 10:52:08+08:00
[CATEGORIES]
cs.LG
Non-Linear Model-Based Sequential Decision-Making in Agriculture
[AUTHORS]
Sakshi Arya, Wentao Lin
[ABSTRACT]
Sequential decision-making is central to sustainable agricultural management
and precision agriculture, where resource inputs must be optimized under
uncertainty and over time. However, such decisions must often be made with
limited observations, whereas classical bandit and reinforcement learning
approaches typically rely on either linear or black-box reward models that may
misrepresent domain knowledge or require large amounts of data. We propose a
family of \emph{nonlinear, model-based bandit algorithms} that embed
domain-specific response curves directly into the exploration-exploitation
loop. By coupling (i) principled uncertainty quantification with (ii)
closed-form or rapidly computable profit optima, these algorithms achieve
sublinear regret and near-optimal sample complexity while preserving
interpretability. Theoretical analysis establishes regret and sample complexity
bounds, and extensive simulations emulating real-world fertilizer-rate
decisions show consistent improvements over both linear and nonparametric
baselines (such as linear UCB and $k$-NN UCB) in the low-sample regime, under
both well-specified and shape-compatible misspecified models. Because our
approach leverages mechanistic insight rather than large data volumes, it is
especially suited to resource-constrained settings, supporting sustainable,
inclusive, and transparent sequential decision-making across agriculture,
environmental management, and allied applications.
[LINK]
http://arxiv.org/abs/2509.01924v2
[DATE]
2025-09-13 10:37:08+08:00
[CATEGORIES]
cs.LG
Contextual Budget Bandit for Food Rescue Volunteer Engagement
[AUTHORS]
Ariana Tang, Naveen Raman, Fei Fang, Zheyuan Ryan Shi
[ABSTRACT]
Volunteer-based food rescue platforms tackle food waste by matching surplus
food to communities in need. These platforms face the dual problem of
maintaining volunteer engagement and maximizing the food rescued. Existing
algorithms to improve volunteer engagement exacerbate geographical disparities,
leaving some communities systematically disadvantaged. We address this issue by
proposing Contextual Budget Bandit. Contextual Budget Bandit incorporates
context-dependent budget allocation in restless multi-armed bandits, a model of
decision-making which allows for stateful arms. By doing so, we can allocate
higher budgets to communities with lower match rates, thereby alleviating
geographical disparities. To tackle this problem, we develop an empirically
fast heuristic algorithm. Because the heuristic algorithm can achieve a poor
approximation when active volunteers are scarce, we design the Mitosis
algorithm, which is guaranteed to compute the optimal budget allocation.
Empirically, we demonstrate that our algorithms outperform baselines on both
synthetic and real-world food rescue datasets, and show how our algorithm
achieves geographical fairness in food rescue.
[LINK]
http://arxiv.org/abs/2509.10777v1
[DATE]
2025-09-13 09:49:00+08:00
[CATEGORIES]
cs.LG
Parameter estimation with uncertainty quantification from continuous measurement data using neural network ensembles
[AUTHORS]
Amanuel Anteneh
[ABSTRACT]
We show that ensembles of deep neural networks, called deep ensembles, can be
used to perform quantum parameter estimation while also providing a means for
quantifying uncertainty in parameter estimates, which is a key advantage of
using Bayesian inference for parameter estimation. These models are shown to be
more robust to noise in the measurement results used to perform the parameter
estimation as well as noise in the data used to train them. We also show that
much less data is needed to achieve comparable performance to Bayesian
inference based estimation, which is known to reach the ultimate precision
limit as more data is collected, than was used in previous proposals.
[LINK]
http://arxiv.org/abs/2509.10756v1
[DATE]
2025-09-13 07:58:44+08:00
[CATEGORIES]
cs.LG
HalluField: Detecting LLM Hallucinations via Field-Theoretic Modeling
[AUTHORS]
Minh Vu, Brian K. Tran, Syed A. Shah, Geigh Zollicoffer, Nhat Hoang-Xuan, Manish Bhattarai
[ABSTRACT]
Large Language Models (LLMs) exhibit impressive reasoning and
question-answering capabilities. However, they often produce inaccurate or
unreliable content known as hallucinations. This unreliability significantly
limits their deployment in high-stakes applications. Thus, there is a growing
need for a general-purpose method to detect hallucinations in LLMs. In this
work, we introduce HalluField, a novel field-theoretic approach for
hallucination detection based on a parametrized variational principle and
thermodynamics. Inspired by thermodynamics, HalluField models an LLM’s response
to a given query and temperature setting as a collection of discrete likelihood
token paths, each associated with a corresponding energy and entropy. By
analyzing how energy and entropy distributions vary across token paths under
changes in temperature and likelihood, HalluField quantifies the semantic
stability of a response. Hallucinations are then detected by identifying
unstable or erratic behavior in this energy landscape. HalluField is
computationally efficient and highly practical: it operates directly on the
model’s output logits without requiring fine-tuning or auxiliary neural
networks. Notably, the method is grounded in a principled physical
interpretation, drawing analogies to the first law of thermodynamics.
Remarkably, by modeling LLM behavior through this physical lens, HalluField
achieves state-of-the-art hallucination detection performance across models and
datasets.
[LINK]
http://arxiv.org/abs/2509.10753v1
[DATE]
2025-09-13 07:49:52+08:00
[CATEGORIES]
cs.LG
Testing classical properties from quantum data
[AUTHORS]
Matthias C. Caro, Preksha Naik, Joseph Slote
[ABSTRACT]
Properties of Boolean functions can often be tested much faster than the
functions can be learned. However, this advantage usually disappears when
testers are limited to random samples of a function $f$–a natural setting for
data science–rather than queries. In this work we initiate the study of a
quantum version of this “data science scenario”: quantum algorithms that test
properties of $f$ solely from quantum data in the form of copies of the
function state $|f\rangle \propto \sum_x|x,f(x)\rangle$.
$\bullet$ New tests. For three well-established properties–monotonicity,
symmetry, and triangle-freeness–we show that the speedup lost when restricting
classical testers to sampled data can be recovered by quantum algorithms
operating solely from quantum data.
$\bullet$ Inadequacy of Fourier sampling. Our new testers use techniques
beyond quantum Fourier sampling, and we show that this necessary. In
particular, there is no constant-complexity tester for symmetry relying solely
on Fourier sampling and random classical samples.
$\bullet$ Classical queries vs. quantum data. We exhibit a testing problem
that can be solved from $O(1)$ classical queries but that requires
$\Omega(2^{n/2})$ function state copies. The Forrelation problem provides a
separation of the same magnitude in the opposite direction, so we conclude that
quantum data and classical queries are “maximally incomparable” resources for
testing.
$\bullet$ Towards lower bounds. We also begin the study of lower bounds for
testing from quantum data. For quantum monotonicity testing, we prove that the
ensembles of Goldreich et al. (2000) and Black (2023), which give exponential
lower bounds for classical sample-based testing, do not yield any nontrivial
lower bounds for testing from quantum data. New insights specific to quantum
data will be required for proving copy complexity lower bounds for testing in
this model.
[COMMENTS]
34 + 2 pages, 2 tables, 1 figure
[LINK]
http://arxiv.org/abs/2411.12730v3
[DATE]
2025-09-13 07:47:46+08:00
[CATEGORIES]
cs.LG
Coordinated Reinforcement Learning Prefetching Architecture for Multicore Systems
[AUTHORS]
Mohammed Humaid Siddiqui, Fernando Guzman, Yufei Wu, Ruishu Ann
[ABSTRACT]
Hardware prefetching is critical to fill the performance gap between CPU
speeds and slower memory accesses. With multicore architectures becoming
commonplace, traditional prefetchers are severely challenged. Independent core
operation creates significant redundancy (up to 20% of prefetch requests are
duplicates), causing unnecessary memory bus traffic and wasted bandwidth.
Furthermore, cutting-edge prefetchers such as Pythia suffer from about a 10%
performance loss when scaling from a single-core to a four-core system. To
solve these problems, we propose CRL-Pythia, a coordinated reinforcement
learning based prefetcher specifically designed for multicore systems. In this
work, CRL-Pythia addresses these issues by enabling cross-core sharing of
information and cooperative prefetching decisions, which greatly reduces
redundant prefetch requests and improves learning convergence across cores. Our
experiments demonstrate that CRL-Pythia outperforms single Pythia
configurations in all cases, with approximately 12% IPC (instructions per
cycle) improvement for bandwidth-constrained workloads, while imposing moderate
hardware overhead. Our sensitivity analyses also verify its robustness and
scalability, thereby making CRL-Pythia a practical and efficient solution to
contemporary multicore systems.
[COMMENTS]
47 pages, 12 figures, technical report prepared at Fairleigh
Dickinson University
[LINK]
http://arxiv.org/abs/2509.10719v1
[DATE]
2025-09-13 06:20:33+08:00
[CATEGORIES]
cs.LG
MinatoLoader: Accelerating Machine Learning Training Through Efficient Data Preprocessing
[AUTHORS]
Rahma Nouaji, Stella Bitchebe, Ricardo Macedo, Oana Balmau
[ABSTRACT]
Data loaders are used by Machine Learning (ML) frameworks like PyTorch and
TensorFlow to apply transformations to data before feeding it into the
accelerator. This operation is called data preprocessing. Data preprocessing
plays an important role in the ML training workflow because if it is
inefficiently pipelined with the training, it can yield high GPU idleness,
resulting in important training delays. Unfortunately, existing data loaders
turn out to waste GPU resources, with $76\%$ GPU idleness when using the
PyTorch data loader, for example. One key source of inefficiency is the
variability in preprocessing time across samples within the same dataset.
Existing data loaders are oblivious to this variability, and they construct
batches without any consideration of slow or fast samples. In this case, the
entire batch is delayed by a single slow sample, stalling the training pipeline
and resulting in head-of-line blocking.
To address these inefficiencies, we present MinatoLoader, a general-purpose
data loader for PyTorch that accelerates training and improves GPU utilization.
MinatoLoader is designed for a single-server setup, containing multiple GPUs.
It continuously prepares data in the background and actively constructs batches
by prioritizing fast-to-preprocess samples, while slower samples are processed
in parallel.
We evaluate MinatoLoader on servers with V100 and A100 GPUs. On a machine
with four A100 GPUs, MinatoLoader improves the training time of a wide range of
workloads by up to $7.5\times$ ($3.6\times$ on average) over PyTorch DataLoader
and Pecan, and up to $3\times$ ($2.2\times$ on average) over DALI. It also
increases average GPU utilization from 46.4\% with PyTorch to 90.45\%, while
preserving model accuracy and enabling faster convergence.
[COMMENTS]
Paper accepted at EuroSys 2026 (will be updated after the
camera-ready)
[LINK]
http://arxiv.org/abs/2509.10712v1
[DATE]
2025-09-13 06:06:57+08:00
[CATEGORIES]
cs.LG
Simulation-Based Sensitivity Analysis in Optimal Treatment Regimes and Causal Decomposition with Individualized Interventions
[AUTHORS]
Soojin Park, Suyeon Kang, Chioun Lee
[ABSTRACT]
Causal decomposition analysis aims to assess the effect of modifying risk
factors on reducing social disparities in outcomes. Recently, this analysis has
incorporated individual characteristics when modifying risk factors by
utilizing optimal treatment regimes (OTRs). Since the newly defined
individualized effects rely on the no omitted confounding assumption,
developing sensitivity analyses to account for potential omitted confounding is
essential. Moreover, OTRs and individualized effects are primarily based on
binary risk factors, and no formal approach currently exists to benchmark the
strength of omitted confounding using observed covariates for binary risk
factors. To address this gap, we extend a simulation-based sensitivity analysis
that simulates unmeasured confounders, addressing two sources of bias emerging
from deriving OTRs and estimating individualized effects. Additionally, we
propose a formal bounding strategy that benchmarks the strength of omitted
confounding for binary risk factors. Using the High School Longitudinal Study
2009 (HSLS:09), we demonstrate this sensitivity analysis and benchmarking
method.
[COMMENTS]
42 pages
[LINK]
http://arxiv.org/abs/2506.19010v2
[DATE]
2025-09-13 05:50:57+08:00
[CATEGORIES]
cs.LG
DOSA: Differentiable Model-Based One-Loop Search for DNN Accelerators
[AUTHORS]
Charles Hong, Qijing Huang, Grace Dinh, Mahesh Subedar, Yakun Sophia Shao
[ABSTRACT]
In the hardware design space exploration process, it is critical to optimize
both hardware parameters and algorithm-to-hardware mappings. Previous work has
largely approached this simultaneous optimization problem by separately
exploring the hardware design space and the mapspace - both individually large
and highly nonconvex spaces - independently. The resulting combinatorial
explosion has created significant difficulties for optimizers.
In this paper, we introduce DOSA, which consists of differentiable
performance models and a gradient descent-based optimization technique to
simultaneously explore both spaces and identify high-performing design points.
Experimental results demonstrate that DOSA outperforms random search and
Bayesian optimization by 2.80x and 12.59x, respectively, in improving DNN model
energy-delay product, given a similar number of samples. We also demonstrate
the modularity and flexibility of DOSA by augmenting our analytical model with
a learned model, allowing us to optimize buffer sizes and mappings of a real
DNN accelerator and attain a 1.82x improvement in energy-delay product.
[COMMENTS]
Published at MICRO 2023
[LINK]
http://arxiv.org/abs/2509.10702v1
[DATE]
2025-09-13 05:38:50+08:00
[CATEGORIES]
cs.LG
IPR: Intelligent Prompt Routing with User-Controlled Quality-Cost Trade-offs
[AUTHORS]
Aosong Feng, Zhichao Xu, Xian Wu, Kang Zhou, Sheng Guan, Yueyan Chen, Ninad Kulkarni, Yun Zhou, Balasubramaniam Srinivasan, Haibo Ding, Lin Lee Cheong
[ABSTRACT]
Routing incoming queries to the most cost-effective LLM while maintaining
response quality poses a fundamental challenge in optimizing performance-cost
trade-offs for large-scale commercial systems. We present IPR\, a
quality-constrained Intelligent Prompt Routing framework that dynamically
selects optimal models based on predicted response quality and user-specified
tolerance levels. IPR introduces three key innovations: (1) a modular
architecture with lightweight quality estimators trained on 1.5M prompts
annotated with calibrated quality scores, enabling fine-grained quality
prediction across model families; (2) a user-controlled routing mechanism with
tolerance parameter $\tau \in [0,1]$ that provides explicit control over
quality-cost trade-offs; and (3) an extensible design using frozen encoders
with model-specific adapters, reducing new model integration from days to
hours. To rigorously train and evaluate IPR, we curate an industrial-level
dataset IPRBench\footnote{IPRBench will be released upon legal approval.}, a
comprehensive benchmark containing 1.5 million examples with response quality
annotations across 11 LLM candidates. Deployed on a major cloud platform, IPR
achieves 43.9\% cost reduction while maintaining quality parity with the
strongest model in the Claude family and processes requests with sub-150ms
latency.
[COMMENTS]
The submission was made without the full consent of all listed
authors. We are withdrawing until authorship is resolved
[LINK]
http://arxiv.org/abs/2509.06274v2
[DATE]
2025-09-13 05:29:53+08:00
[CATEGORIES]
cs.LG
CrunchLLM: Multitask LLMs for Structured Business Reasoning and Outcome Prediction
[AUTHORS]
Rabeya Tus Sadia, Qiang Cheng
[ABSTRACT]
Predicting the success of start-up companies, defined as achieving an exit
through acquisition or IPO, is a critical problem in entrepreneurship and
innovation research. Datasets such as Crunchbase provide both structured
information (e.g., funding rounds, industries, investor networks) and
unstructured text (e.g., company descriptions), but effectively leveraging this
heterogeneous data for prediction remains challenging. Traditional machine
learning approaches often rely only on structured features and achieve moderate
accuracy, while large language models (LLMs) offer rich reasoning abilities but
struggle to adapt directly to domain-specific business data. We present
\textbf{CrunchLLM}, a domain-adapted LLM framework for startup success
prediction. CrunchLLM integrates structured company attributes with
unstructured textual narratives and applies parameter-efficient fine-tuning
strategies alongside prompt optimization to specialize foundation models for
entrepreneurship data. Our approach achieves accuracy exceeding 80\% on
Crunchbase startup success prediction, significantly outperforming traditional
classifiers and baseline LLMs. Beyond predictive performance, CrunchLLM
provides interpretable reasoning traces that justify its predictions, enhancing
transparency and trustworthiness for financial and policy decision makers. This
work demonstrates how adapting LLMs with domain-aware fine-tuning and
structured–unstructured data fusion can advance predictive modeling of
entrepreneurial outcomes. CrunchLLM contributes a methodological framework and
a practical tool for data-driven decision making in venture capital and
innovation policy.
[LINK]
http://arxiv.org/abs/2509.10698v1
[DATE]
2025-09-13 05:26:11+08:00
[CATEGORIES]
cs.LG
Kalman Bayesian Transformer
[AUTHORS]
Haoming Jing, Oren Wright, José M. F. Moura, Yorie Nakahira
[ABSTRACT]
Sequential fine-tuning of transformers is useful when new data arrive
sequentially, especially with shifting distributions. Unlike batch learning,
sequential learning demands that training be stabilized despite a small amount
of data by balancing new information and previously learned knowledge in the
pre-trained models. This challenge is further complicated when training is to
be completed in latency-critical environments and learning must additionally
quantify and be mediated by uncertainty. Motivated by these challenges, we
propose a novel method that frames sequential fine-tuning as a posterior
inference problem within a Bayesian framework. Our approach integrates
closed-form moment propagation of random variables, Kalman Bayesian Neural
Networks, and Taylor approximations of the moments of softmax functions. By
explicitly accounting for pre-trained models as priors and adaptively balancing
them against new information based on quantified uncertainty, our method
achieves robust and data-efficient sequential learning. The effectiveness of
our method is demonstrated through numerical simulations involving sequential
adaptation of a decision transformer to tasks characterized by distribution
shifts and limited memory resources.
[COMMENTS]
Accepted to the 64th IEEE Conference on Decision and Control (CDC
2025)
[LINK]
http://arxiv.org/abs/2509.10695v1
[DATE]
2025-09-13 05:15:23+08:00
[CATEGORIES]
cs.LG
Learning Concave Bid Shading Strategies in Online Auctions via Measure-valued Proximal Optimization
[AUTHORS]
Iman Nodozi, Djordje Gligorijevic, Abhishek Halder
[ABSTRACT]
This work proposes a bid shading strategy for first-price auctions as a
measure-valued optimization problem. We consider a standard parametric form for
bid shading and formulate the problem as convex optimization over the joint
distribution of shading parameters. After each auction, the shading parameter
distribution is adapted via a regularized Wasserstein-proximal update with a
data-driven energy functional. This energy functional is conditional on the
context, i.e., on publisher/user attributes such as domain, ad slot type,
device, or location. The proposed algorithm encourages the bid distribution to
place more weight on values with higher expected surplus, i.e., where the win
probability and the value gap are both large. We show that the resulting
measure-valued convex optimization problem admits a closed form solution. A
numerical example illustrates the proposed method.
[LINK]
http://arxiv.org/abs/2509.10693v1
[DATE]
2025-09-13 05:11:06+08:00
[CATEGORIES]
cs.LG
Continuum Attention for Neural Operators
[AUTHORS]
Edoardo Calvello, Nikola B. Kovachki, Matthew E. Levine, Andrew M. Stuart
[ABSTRACT]
Transformers, and the attention mechanism in particular, have become
ubiquitous in machine learning. Their success in modeling nonlocal, long-range
correlations has led to their widespread adoption in natural language
processing, computer vision, and time series problems. Neural operators, which
map spaces of functions into spaces of functions, are necessarily both
nonlinear and nonlocal if they are universal; it is thus natural to ask whether
the attention mechanism can be used in the design of neural operators.
Motivated by this, we study transformers in the function space setting. We
formulate attention as a map between infinite dimensional function spaces and
prove that the attention mechanism as implemented in practice is a Monte Carlo
or finite difference approximation of this operator. The function space
formulation allows for the design of transformer neural operators, a class of
architectures designed to learn mappings between function spaces. In this
paper, we state and prove the first universal approximation result for
transformer neural operators, using only a slight modification of the
architecture implemented in practice. The prohibitive cost of applying the
attention operator to functions defined on multi-dimensional domains leads to
the need for more efficient attention-based architectures. For this reason we
also introduce a function space generalization of the patching strategy from
computer vision, and introduce a class of associated neural operators.
Numerical results, on an array of operator learning problems, demonstrate the
promise of our approaches to function space formulations of attention and their
use in neural operators.
[LINK]
http://arxiv.org/abs/2406.06486v3
[DATE]
2025-09-13 04:13:48+08:00
[CATEGORIES]
cs.LG
Multi-Agent Systems Execute Arbitrary Malicious Code
[AUTHORS]
Harold Triedman, Rishi Jha, Vitaly Shmatikov
[ABSTRACT]
Multi-agent systems coordinate LLM-based agents to perform tasks on users’
behalf. In real-world applications, multi-agent systems will inevitably
interact with untrusted inputs, such as malicious Web content, files, email
attachments, and more.
Using several recently proposed multi-agent frameworks as concrete examples,
we demonstrate that adversarial content can hijack control and communication
within the system to invoke unsafe agents and functionalities. This results in
a complete security breach, up to execution of arbitrary malicious code on the
user’s device or exfiltration of sensitive data from the user’s containerized
environment. For example, when agents are instantiated with GPT-4o, Web-based
attacks successfully cause the multi-agent system execute arbitrary malicious
code in 58-90\% of trials (depending on the orchestrator). In some
model-orchestrator configurations, the attack success rate is 100\%. We also
demonstrate that these attacks succeed even if individual agents are not
susceptible to direct or indirect prompt injection, and even if they refuse to
perform harmful actions. We hope that these results will motivate development
of trust and security models for multi-agent systems before they are widely
deployed.
[COMMENTS]
33 pages, 5 figures, 7 tables
[LINK]
http://arxiv.org/abs/2503.12188v2
[DATE]
2025-09-13 03:53:17+08:00
[CATEGORIES]
cs.LG
M4GN: Mesh-based Multi-segment Hierarchical Graph Network for Dynamic Simulations
[AUTHORS]
Bo Lei, Victor M. Castillo, Yeping Hu
[ABSTRACT]
Mesh-based graph neural networks (GNNs) have become effective surrogates for
PDE simulations, yet their deep message passing incurs high cost and
over-smoothing on large, long-range meshes; hierarchical GNNs shorten
propagation paths but still face two key obstacles: (i) building coarse graphs
that respect mesh topology, geometry, and physical discontinuities, and (ii)
maintaining fine-scale accuracy without sacrificing the speed gained from
coarsening. We tackle these challenges with M4GN, a three-tier, segment-centric
hierarchical network. M4GN begins with a hybrid segmentation strategy that
pairs a fast graph partitioner with a superpixel-style refinement guided by
modal-decomposition features, producing contiguous segments of dynamically
consistent nodes. These segments are encoded by a permutation-invariant
aggregator, avoiding the order sensitivity and quadratic cost of aggregation
approaches used in prior works. The resulting information bridges a micro-level
GNN, which captures local dynamics, and a macro-level transformer that reasons
efficiently across segments, achieving a principled balance between accuracy
and efficiency. Evaluated on multiple representative benchmark datasets, M4GN
improves prediction accuracy by up to 56% while achieving up to 22% faster
inference than state-of-the-art baselines.
[COMMENTS]
Accepted and published in Transactions on Machine Learning Research
(TMLR), 2025
[LINK]
http://arxiv.org/abs/2509.10659v1
[DATE]
2025-09-13 03:38:38+08:00
[CATEGORIES]
cs.LG
Self-Supervised Goal-Reaching Results in Multi-Agent Cooperation and Exploration
[AUTHORS]
Chirayu Nimonkar, Shlok Shah, Catherine Ji, Benjamin Eysenbach
[ABSTRACT]
For groups of autonomous agents to achieve a particular goal, they must
engage in coordination and long-horizon reasoning. However, designing reward
functions to elicit such behavior is challenging. In this paper, we study how
self-supervised goal-reaching techniques can be leveraged to enable agents to
cooperate. The key idea is that, rather than have agents maximize some scalar
reward, agents aim to maximize the likelihood of visiting a certain goal. This
problem setting enables human users to specify tasks via a single goal state
rather than implementing a complex reward function. While the feedback signal
is quite sparse, we will demonstrate that self-supervised goal-reaching
techniques enable agents to learn from such feedback. On MARL benchmarks, our
proposed method outperforms alternative approaches that have access to the same
sparse reward signal as our method. While our method has no explicit mechanism
for exploration, we observe that self-supervised multi-agent goal-reaching
leads to emergent cooperation and exploration in settings where alternative
approaches never witness a single successful trial.
[COMMENTS]
Project website with videos https://chirayu-n.github.io/gcmarl and
code https://github.com/Chirayu-N/gc-marl are online
[LINK]
http://arxiv.org/abs/2509.10656v1
[DATE]
2025-09-13 03:35:20+08:00
[CATEGORIES]
cs.LG
On a Geometry of Interbrain Networks
[AUTHORS]
Nicolás Hinrichs, Noah Guzmán, Melanie Weber
[ABSTRACT]
Effective analysis in neuroscience benefits significantly from robust
conceptual frameworks. Traditional metrics of interbrain synchrony in social
neuroscience typically depend on fixed, correlation-based approaches,
restricting their explanatory capacity to descriptive observations. Inspired by
the successful integration of geometric insights in network science, we propose
leveraging discrete geometry to examine the dynamic reconfigurations in neural
interactions during social exchanges. Unlike conventional synchrony approaches,
our method interprets inter-brain connectivity changes through the evolving
geometric structures of neural networks. This geometric framework is realized
through a pipeline that identifies critical transitions in network connectivity
using entropy metrics derived from curvature distributions. By doing so, we
significantly enhance the capacity of hyperscanning methodologies to uncover
underlying neural mechanisms in interactive social behavior.
[COMMENTS]
4 pages, 1 figure, submitted to NeurReps workshop 2025
[LINK]
http://arxiv.org/abs/2509.10650v1
[DATE]
2025-09-13 03:26:27+08:00
[CATEGORIES]
cs.LG
FairCoT: Enhancing Fairness in Text-to-Image Generation via Chain of Thought Reasoning with Multimodal Large Language Models
[AUTHORS]
Zahraa Al Sahili, Ioannis Patras, Matthew Purver
[ABSTRACT]
In the domain of text-to-image generative models, biases inherent in training
datasets often propagate into generated content, posing significant ethical
challenges, particularly in socially sensitive contexts. We introduce FairCoT,
a novel framework that enhances fairness in text to image models through Chain
of Thought (CoT) reasoning within multimodal generative large language models.
FairCoT employs iterative CoT refinement to systematically mitigate biases, and
dynamically adjusts textual prompts in real time, ensuring diverse and
equitable representation in generated images. By integrating iterative
reasoning processes, FairCoT addresses the limitations of zero shot CoT in
sensitive scenarios, balancing creativity with ethical responsibility.
Experimental evaluations across popular text-to-image systems including DALLE
and various Stable Diffusion variants, demonstrate that FairCoT significantly
enhances fairness and diversity without sacrificing image quality or semantic
fidelity. By combining robust reasoning, lightweight deployment, and
extensibility to multiple models, FairCoT represents a promising step toward
more socially responsible and transparent AI driven content generation.
[COMMENTS]
Accepted at EMNLP 2025
[LINK]
http://arxiv.org/abs/2406.09070v4
[DATE]
2025-09-13 03:20:38+08:00
[CATEGORIES]
cs.LG
Advanced Hybrid Transformer LSTM Technique with Attention and TS Mixer for Drilling Rate of Penetration Prediction
[AUTHORS]
Saddam Hussain Khan
[ABSTRACT]
Accurate prediction of the Rate of Penetration (ROP) is pivotal for drilling
optimization, yet it remains a persistent challenge due to the nonlinear,
dynamic, and heterogeneous nature of drilling data. This study introduces a
novel hybrid deep learning architecture in which input data are first processed
through a customized Long Short-Term Memory (LSTM) network to capture
multi-scale temporal dependencies aligned with drilling operational cycles, and
the resulting features are subsequently refined by an Enhanced Transformer
encoder with drilling-specific positional encodings and real-time optimization.
Concurrently, the same input is directed to a Time-Series Mixer (TS-Mixer)
block that enables efficient cross-feature modeling of static and categorical
attributes such as lithology indices and mud properties. The outputs from the
enhanced Transformer and TS-Mixer are concatenated, after which an adaptive
attention selectively emphasizes the most informative feature representations
for accurate ROP prediction. The proposed framework fuses sequential memory,
static feature interactions, global contextual learning, and dynamic feature
weighting, providing a comprehensive solution to the heterogeneous and
event-driven nature of drilling dynamics. Evaluation on a real-world drilling
dataset demonstrates benchmark-leading performance, achieving an Rsqaure of
0.9988 and a MAPE of 1.447%, significantly surpassing standalone and hybrid
baselines. Model interpretability is achieved through SHAP and LIME, and
comparisons between actual and predicted curves, along with bias checks,
confirm the accuracy and fairness of the model across various scenarios. This
advanced hybrid approach enables dependable real-time ROP prediction,
supporting the development of intelligent, cost-effective drilling optimization
systems with significant operational benefits.
[COMMENTS]
35 Pages, 19 Figures, 9 Tables
[LINK]
http://arxiv.org/abs/2508.05210v2
[DATE]
2025-09-13 03:14:53+08:00
[CATEGORIES]
cs.LG
Accurate and Private Diagnosis of Rare Genetic Syndromes from Facial Images with Federated Deep Learning
[AUTHORS]
Ali Burak Ünal, Cem Ata Baykara, Peter Krawitz, Mete Akgün
[ABSTRACT]
Machine learning has shown promise in facial dysmorphology, where
characteristic facial features provide diagnostic clues for rare genetic
disorders. GestaltMatcher, a leading framework in this field, has demonstrated
clinical utility across multiple studies, but its reliance on centralized
datasets limits further development, as patient data are siloed across
institutions and subject to strict privacy regulations. We introduce a
federated GestaltMatcher service based on a cross-silo horizontal federated
learning framework, which allows hospitals to collaboratively train a global
ensemble feature extractor without sharing patient images. Patient data are
mapped into a shared latent space, and a privacy-preserving kernel matrix
computation framework enables syndrome inference and discovery while
safeguarding confidentiality. New participants can directly benefit from and
contribute to the system by adopting the global feature extractor and kernel
configuration from previous training rounds. Experiments show that the
federated service retains over 90% of centralized performance and remains
robust to both varying silo numbers and heterogeneous data distributions.
[LINK]
http://arxiv.org/abs/2509.10635v1
[DATE]
2025-09-13 02:42:33+08:00
[CATEGORIES]
cs.LG
Interpretable neural network system identification method for two families of second-order systems based on characteristic curves
[AUTHORS]
Federico J. Gonzalez, Luis P. Lara
[ABSTRACT]
Nonlinear system identification often involves a fundamental trade-off
between interpretability and flexibility, often requiring the incorporation of
physical constraints. We propose a unified data-driven framework that combines
the mathematical structure of the governing differential equations with the
flexibility of neural networks (NNs). At the core of our approach is the
concept of characteristic curves (CCs), which represent individual nonlinear
functions (e.g., friction and restoring components) of the system. Each CC is
modeled by a dedicated NN, enabling a modular and interpretable representation
of the system equation. To demonstrate the versatility of the CC-based
formalism, we introduce three identification strategies: (1) SINDy-CC, which
extends the sparse regression approach of SINDy by incorporating the
mathematical structure of the governing equations as constraints; (2) Poly-CC,
which represents each CC using high-degree polynomials; and (3) NN-CC, which
uses NNs without requiring prior assumptions about basis functions. Our results
show that all three approaches are well-suited for systems with simple
polynomial nonlinearities, such as the van der Pol oscillator. In contrast,
NN-CC demonstrates superior performance in modeling systems with complex
nonlinearities and discontinuities, such as those observed in stick-slip
systems. The key contribution of this work is to demonstrate that the CC-based
framework, particularly the NN-CC approach, can capture complex nonlinearities
while maintaining interpretability through the explicit representation of the
CCs. This balance makes it well-suited for modeling systems with
discontinuities and complex nonlinearities that are challenging to assess using
traditional polynomial or sparse regression methods, providing a powerful tool
for nonlinear system identification.
[LINK]
http://arxiv.org/abs/2509.10632v1
[DATE]
2025-09-13 02:32:02+08:00
[CATEGORIES]
cs.LG
Quantum-Inspired Stacked Integrated Concept Graph Model (QISICGM) for Diabetes Risk Prediction
[AUTHORS]
Kenneth G. Young II
[ABSTRACT]
The Quantum-Inspired Stacked Integrated Concept Graph Model (QISICGM) is an
innovative machine learning framework that harnesses quantum-inspired
techniques to predict diabetes risk with exceptional accuracy and efficiency.
Utilizing the PIMA Indians Diabetes dataset augmented with 2,000 synthetic
samples to mitigate class imbalance (total: 2,768 samples, 1,949 positives),
QISICGM integrates a self-improving concept graph with a stacked ensemble
comprising Random Forests (RF), Extra Trees (ET), transformers, convolutional
neural networks (CNNs), and feed-forward neural networks (FFNNs). This approach
achieves an out-of-fold (OOF) F1 score of 0.8933 and an AUC of 0.8699,
outperforming traditional methods. Quantum inspired elements, such as phase
feature mapping and neighborhood sequence modeling, enrich feature
representations, enabling CPU-efficient inference at 8.5 rows per second. This
paper presents a detailed architecture, theoretical foundations, code insights,
and performance evaluations, including visualizations from the outputs
subfolder. The open-source implementation (v1.0.0) is available at
https://github.com/keninayoung/QISICGM, positioning QISICGM as a potential
benchmark for AI-assisted clinical triage in diabetes and beyond. Ultimately,
this work emphasizes trustworthy AI through calibration, interpretability, and
open-source reproducibility.
[COMMENTS]
13 pages, 3 figures, includes performance tables and visualizations.
Proposes a Quantum-Inspired Stacked Integrated Concept Graph Model (QISICGM)
that integrates phase feature mapping, self-improving concept graphs, and
neighborhood sequence modeling within a stacked ensemble. Demonstrates
improved F1 and AUC on an augmented PIMA Diabetes dataset with efficient CPU
inference
[LINK]
http://arxiv.org/abs/2509.12259v1
[DATE]
2025-09-13 02:26:31+08:00
[CATEGORIES]
cs.LG
Spectral and Rhythm Feature Performance Evaluation for Category and Class Level Audio Classification with Deep Convolutional Neural Networks
[AUTHORS]
Friedrich Wolf-Monheim
[ABSTRACT]
Next to decision tree and k-nearest neighbours algorithms deep convolutional
neural networks (CNNs) are widely used to classify audio data in many domains
like music, speech or environmental sounds. To train a specific CNN various
spectral and rhythm features like mel-scaled spectrograms, mel-frequency
cepstral coefficients (MFCC), cyclic tempograms, short-time Fourier transform
(STFT) chromagrams, constant-Q transform (CQT) chromagrams and chroma energy
normalized statistics (CENS) chromagrams can be used as digital image input
data for the neural network. The performance of these spectral and rhythm
features for audio category level as well as audio class level classification
is investigated in detail with a deep CNN and the ESC-50 dataset with 2,000
labeled environmental audio recordings using an end-to-end deep learning
pipeline. The evaluated metrics accuracy, precision, recall and F1 score for
multiclass classification clearly show that the mel-scaled spectrograms and the
mel-frequency cepstral coefficients (MFCC) perform significantly better then
the other spectral and rhythm features investigated in this research for audio
classification tasks using deep CNNs.
[LINK]
http://arxiv.org/abs/2509.07756v2
[DATE]
2025-09-13 02:16:11+08:00
[CATEGORIES]
cs.LG
Optimal Multimarginal Schrödinger Bridge: Minimum Spanning Tree over Measure-valued Vertices
[AUTHORS]
Georgiy A. Bondar, Abhishek Halder
[ABSTRACT]
The Multimarginal Schr"odinger Bridge (MSB) finds the optimal coupling among
a collection of random vectors with known statistics and a known correlation
structure. In the MSB formulation, this correlation structure is specified
\emph{a priori} as an undirected connected graph with measure-valued vertices.
In this work, we formulate and solve the problem of finding the optimal MSB in
the sense we seek the optimal coupling over all possible graph structures. We
find that computing the optimal MSB amounts to solving the minimum spanning
tree problem over measure-valued vertices. We show that the resulting problem
can be solved in two steps. The first step constructs a complete graph with
edge weight equal to a sum of the optimal value of the corresponding bimarginal
SB and the entropies of the endpoints. The second step solves a standard
minimum spanning tree problem over that complete weighted graph. Numerical
experiments illustrate the proposed solution.
[LINK]
http://arxiv.org/abs/2509.10626v1
[DATE]
2025-09-13 02:15:42+08:00
[CATEGORIES]
cs.LG
Convergence Analysis of Asynchronous Federated Learning with Gradient Compression for Non-Convex Optimization
[AUTHORS]
Diying Yang, Yingwei Hou, Weigang Wu
[ABSTRACT]
In practical federated learning (FL), the large communication overhead
between clients and the server is often a significant bottleneck. Gradient
compression methods can effectively reduce this overhead, while error feedback
(EF) restores model accuracy. Moreover, due to device heterogeneity,
synchronous FL often suffers from stragglers and inefficiency-issues that
asynchronous FL effectively alleviates. However, in asynchronous FL
settings-which inherently face three major challenges: asynchronous delay, data
heterogeneity, and flexible client participation-the complex interactions among
these system/statistical constraints and compression/EF mechanisms remain
poorly understood theoretically. In this paper, we fill this gap through a
comprehensive convergence study that adequately decouples and unravels these
complex interactions across various FL frameworks. We first consider a basic
asynchronous FL framework AsynFL, and establish an improved convergence
analysis that relies on fewer assumptions and yields a superior convergence
rate than prior studies. We then extend our study to a compressed version,
AsynFLC, and derive sufficient conditions for its convergence, indicating the
nonlinear interaction between asynchronous delay and compression rate. Our
analysis further demonstrates how asynchronous delay and data heterogeneity
jointly exacerbate compression-induced errors, thereby hindering convergence.
Furthermore, we study the convergence of AsynFLC-EF, the framework that further
integrates EF. We prove that EF can effectively reduce the variance of gradient
estimation under the aforementioned challenges, enabling AsynFLC-EF to match
the convergence rate of AsynFL. We also show that the impact of asynchronous
delay and flexible participation on EF is limited to slowing down the
higher-order convergence term. Experimental results substantiate our analytical
findings very well.
[LINK]
http://arxiv.org/abs/2504.19903v3
[DATE]
2025-09-13 02:13:51+08:00
[CATEGORIES]
cs.LG
AI-in-the-Loop: Privacy Preserving Real-Time Scam Detection and Conversational Scambaiting by Leveraging LLMs and Federated Learning
[AUTHORS]
Ismail Hossain, Sai Puppala, Sajedul Talukder, Md Jahangir Alam
[ABSTRACT]
Scams exploiting real-time social engineering – such as phishing,
impersonation, and phone fraud – remain a persistent and evolving threat
across digital platforms. Existing defenses are largely reactive, offering
limited protection during active interactions. We propose a privacy-preserving,
AI-in-the-loop framework that proactively detects and disrupts scam
conversations in real time. The system combines instruction-tuned artificial
intelligence with a safety-aware utility function that balances engagement with
harm minimization, and employs federated learning to enable continual model
updates without raw data sharing. Experimental evaluations show that the system
produces fluent and engaging responses (perplexity as low as 22.3, engagement
$\approx$0.80), while human studies confirm significant gains in realism,
safety, and effectiveness over strong baselines. In federated settings, models
trained with FedAvg sustain up to 30 rounds while preserving high engagement
($\approx$0.80), strong relevance ($\approx$0.74), and low PII leakage
($\leq$0.0085). Even with differential privacy, novelty and safety remain
stable, indicating that robust privacy can be achieved without sacrificing
performance. The evaluation of guard models (LlamaGuard, LlamaGuard2/3,
MD-Judge) shows a straightforward pattern: stricter moderation settings reduce
the chance of exposing personal information, but they also limit how much the
model engages in conversation. In contrast, more relaxed settings allow longer
and richer interactions, which improve scam detection, but at the cost of
higher privacy risk. To our knowledge, this is the first framework to unify
real-time scam-baiting, federated privacy preservation, and calibrated safety
moderation into a proactive defense paradigm.
[COMMENTS]
This paper got accepted in 26th Privacy Enhancing Technologies
Symposium (PETS 2026). We uploaded it into ArXiv as pre-print
[LINK]
http://arxiv.org/abs/2509.05362v2
[DATE]
2025-09-13 02:06:28+08:00
[CATEGORIES]
cs.LG
Building a General SimCLR Self-Supervised Foundation Model Across Neurological Diseases to Advance 3D Brain MRI Diagnoses
[AUTHORS]
Emily Kaczmarek, Justin Szeto, Brennan Nichyporuk, Tal Arbel
[ABSTRACT]
3D structural Magnetic Resonance Imaging (MRI) brain scans are commonly
acquired in clinical settings to monitor a wide range of neurological
conditions, including neurodegenerative disorders and stroke. While deep
learning models have shown promising results analyzing 3D MRI across a number
of brain imaging tasks, most are highly tailored for specific tasks with
limited labeled data, and are not able to generalize across tasks and/or
populations. The development of self-supervised learning (SSL) has enabled the
creation of large medical foundation models that leverage diverse, unlabeled
datasets ranging from healthy to diseased data, showing significant success in
2D medical imaging applications. However, even the very few foundation models
for 3D brain MRI that have been developed remain limited in resolution, scope,
or accessibility. In this work, we present a general, high-resolution
SimCLR-based SSL foundation model for 3D brain structural MRI, pre-trained on
18,759 patients (44,958 scans) from 11 publicly available datasets spanning
diverse neurological diseases. We compare our model to Masked Autoencoders
(MAE), as well as two supervised baselines, on four diverse downstream
prediction tasks in both in-distribution and out-of-distribution settings. Our
fine-tuned SimCLR model outperforms all other models across all tasks. Notably,
our model still achieves superior performance when fine-tuned using only 20% of
labeled training samples for predicting Alzheimer’s disease. We use publicly
available code and data, and release our trained model at
https://github.com/emilykaczmarek/3D-Neuro-SimCLR, contributing a broadly
applicable and accessible foundation model for clinical brain MRI analysis.
[COMMENTS]
Accepted to ICCV 2025 Workshop CVAMD
[LINK]
http://arxiv.org/abs/2509.10620v1
[DATE]
2025-09-13 02:05:08+08:00
[CATEGORIES]
cs.LG
pySigLib – Fast Signature-Based Computations on CPU and GPU
[AUTHORS]
Daniil Shmelev, Cristopher Salvi
[ABSTRACT]
Signature-based methods have recently gained significant traction in machine
learning for sequential data. In particular, signature kernels have emerged as
powerful discriminators and training losses for generative models on
time-series, notably in quantitative finance. However, existing implementations
do not scale to the dataset sizes and sequence lengths encountered in practice.
We present pySigLib, a high-performance Python library offering optimised
implementations of signatures and signature kernels on CPU and GPU, fully
compatible with PyTorch’s automatic differentiation. Beyond an efficient
software stack for large-scale signature-based computation, we introduce a
novel differentiation scheme for signature kernels that delivers accurate
gradients at a fraction of the runtime of existing libraries.
[LINK]
http://arxiv.org/abs/2509.10613v1
[DATE]
2025-09-13 02:00:14+08:00
[CATEGORIES]
cs.LG
Understanding Outer Optimizers in Local SGD: Learning Rates, Momentum, and Acceleration
[AUTHORS]
Ahmed Khaled, Satyen Kale, Arthur Douillard, Chi Jin, Rob Fergus, Manzil Zaheer
[ABSTRACT]
Modern machine learning often requires training with large batch size,
distributed data, and massively parallel compute hardware (like mobile and
other edge devices or distributed data centers). Communication becomes a major
bottleneck in such settings but methods like Local Stochastic Gradient Descent
(Local SGD) show great promise in reducing this additional communication
overhead. Local SGD consists of three parts: a local optimization process, an
aggregation mechanism, and an outer optimizer that uses the aggregated updates
from the nodes to produce a new model. While there exists an extensive
literature on understanding the impact of hyperparameters in the local
optimization process, the choice of outer optimizer and its hyperparameters is
less clear. We study the role of the outer optimizer in Local SGD, and prove
new convergence guarantees for the algorithm. In particular, we show that
tuning the outer learning rate allows us to (a) trade off between optimization
error and stochastic gradient noise variance, and (b) make up for ill-tuning of
the inner learning rate. Our theory suggests that the outer learning rate
should sometimes be set to values greater than $1$. We extend our results to
settings where we use momentum in the outer optimizer, and we show a similar
role for the momentum-adjusted outer learning rate. We also study acceleration
in the outer optimizer and show that it improves the convergence rate as a
function of the number of communication rounds, improving upon the convergence
rate of prior algorithms that apply acceleration locally. Finally, we also
introduce a novel data-dependent analysis of Local SGD that yields further
insights on outer learning rate tuning. We conduct comprehensive experiments
with standard language models and various outer optimizers to validate our
theory.
[LINK]
http://arxiv.org/abs/2509.10439v1
[DATE]
2025-09-13 01:47:58+08:00
[CATEGORIES]
cs.LG
Mutual Information Tracks Policy Coherence in Reinforcement Learning
[AUTHORS]
Cameron Reid, Wael Hafez, Amirhossein Nazeri
[ABSTRACT]
Reinforcement Learning (RL) agents deployed in real-world environments face
degradation from sensor faults, actuator wear, and environmental shifts, yet
lack intrinsic mechanisms to detect and diagnose these failures. We present an
information-theoretic framework that reveals both the fundamental dynamics of
RL and provides practical methods for diagnosing deployment-time anomalies.
Through analysis of state-action mutual information patterns in a robotic
control task, we first demonstrate that successful learning exhibits
characteristic information signatures: mutual information between states and
actions steadily increases from 0.84 to 2.83 bits (238% growth) despite growing
state entropy, indicating that agents develop increasingly selective attention
to task-relevant patterns. Intriguingly, states, actions and next states joint
mutual information, MI(S,A;S’), follows an inverted U-curve, peaking during
early learning before declining as the agent specializes suggesting a
transition from broad exploration to efficient exploitation. More immediately
actionable, we show that information metrics can differentially diagnose system
failures: observation-space, i.e., states noise (sensor faults) produces broad
collapses across all information channels with pronounced drops in state-action
coupling, while action-space noise (actuator faults) selectively disrupts
action-outcome predictability while preserving state-action relationships. This
differential diagnostic capability demonstrated through controlled perturbation
experiments enables precise fault localization without architectural
modifications or performance degradation. By establishing information patterns
as both signatures of learning and diagnostic for system health, we provide the
foundation for adaptive RL systems capable of autonomous fault detection and
policy adjustment based on information-theoretic principles.
[COMMENTS]
10 pages, 4 figures, 1 table
[LINK]
http://arxiv.org/abs/2509.10423v1
[DATE]
2025-09-13 01:24:20+08:00
[CATEGORIES]
cs.LG
Run-Time Monitoring of ERTMS/ETCS Control Flow by Process Mining
[AUTHORS]
Francesco Vitale, Tommaso Zoppi, Francesco Flammini, Nicola Mazzocca
[ABSTRACT]
Ensuring the resilience of computer-based railways is increasingly crucial to
account for uncertainties and changes due to the growing complexity and
criticality of those systems. Although their software relies on strict
verification and validation processes following well-established best-practices
and certification standards, anomalies can still occur at run-time due to
residual faults, system and environmental modifications that were unknown at
design-time, or other emergent cyber-threat scenarios. This paper explores
run-time control-flow anomaly detection using process mining to enhance the
resilience of ERTMS/ETCS L2 (European Rail Traffic Management System / European
Train Control System Level 2). Process mining allows learning the actual
control flow of the system from its execution traces, thus enabling run-time
monitoring through online conformance checking. In addition, anomaly
localization is performed through unsupervised machine learning to link
relevant deviations to critical system components. We test our approach on a
reference ERTMS/ETCS L2 scenario, namely the RBC/RBC Handover, to show its
capability to detect and localize anomalies with high accuracy, efficiency, and
explainability.
[COMMENTS]
Accepted to the 6th International Conference on Reliability, Safety,
and Security of Railway Systems (RSSRail2025)
[LINK]
http://arxiv.org/abs/2509.10419v1
[DATE]
2025-09-13 01:17:35+08:00
[CATEGORIES]
cs.LG
Inpainting-Guided Policy Optimization for Diffusion Large Language Models
[AUTHORS]
Siyan Zhao, Mengchen Liu, Jing Huang, Miao Liu, Chenyu Wang, Bo Liu, Yuandong Tian, Guan Pang, Sean Bell, Aditya Grover, Feiyu Chen
[ABSTRACT]
Masked diffusion large language models (dLLMs) are emerging as promising
alternatives to autoregressive LLMs, offering competitive performance while
supporting unique generation capabilities such as inpainting. We explore how
inpainting can inform RL algorithm design for dLLMs. Aligning LLMs with
reinforcement learning faces an exploration challenge: sparse reward signals
and sample waste when models fail to discover correct solutions. While this
inefficiency affects LLMs broadly, dLLMs offer a distinctive opportunity–their
inpainting ability can guide exploration. We introduce IGPO (Inpainting Guided
Policy Optimization), an RL framework that strategically inserts partial
ground-truth reasoning traces during online sampling. Unlike providing full
solutions, inpainting steers exploration toward promising trajectory spaces
while preserving self-generated reasoning, bridging supervised fine-tuning and
reinforcement learning. We apply IGPO to group-based optimization methods such
as GRPO, where exploration failures cause zero advantages and gradients. IGPO
restores meaningful gradients while improving sample efficiency. We also
propose supervised fine-tuning on synthetically rewritten concise traces that
better align with dLLM generation patterns. With additional techniques
including entropy-based filtering, our training recipe yields substantial gains
across three mathematical benchmarks–GSM8K, Math500, and AMC–achieving new
state-of-the-art results for full-attention masked dLLMs.
[COMMENTS]
preprint; 21 pages
[LINK]
http://arxiv.org/abs/2509.10396v1
[DATE]
2025-09-13 00:44:31+08:00
[CATEGORIES]
cs.LG
Bayesian Sheaf Neural Networks
[AUTHORS]
Patrick Gillespie, Layal Bou Hamdan, Ioannis Schizas, David L. Boothe, Vasileios Maroulas
[ABSTRACT]
Equipping graph neural networks with a convolution operation defined in terms
of a cellular sheaf offers advantages for learning expressive representations
of heterophilic graph data. The most flexible approach to constructing the
sheaf is to learn it as part of the network as a function of the node features.
However, this leaves the network potentially overly sensitive to the learned
sheaf. As a counter-measure, we propose a variational approach to learning
cellular sheaves within sheaf neural networks, yielding an architecture we
refer to as a Bayesian sheaf neural network. As part of this work, we define a
novel family of reparameterizable probability distributions on the rotation
group $SO(n)$ using the Cayley transform. We evaluate the Bayesian sheaf neural
network on several graph datasets, and show that our Bayesian sheaf models
achieve leading performance compared to baseline models and are less sensitive
to the choice of hyperparameters under limited training data settings.
[COMMENTS]
32 pages, 4 figures
[LINK]
http://arxiv.org/abs/2410.09590v2
[DATE]
2025-09-13 00:40:19+08:00
[CATEGORIES]
cs.LG
Differentially Private Decentralized Dataset Synthesis Through Randomized Mixing with Correlated Noise
[AUTHORS]
Utsab Saha, Tanvir Muntakim Tonoy, Hafiz Imtiaz
[ABSTRACT]
In this work, we explore differentially private synthetic data generation in
a decentralized-data setting by building on the recently proposed
Differentially Private Class-Centric Data Aggregation (DP-CDA). DP-CDA
synthesizes data in a centralized setting by mixing multiple randomly-selected
samples from the same class and injecting carefully calibrated Gaussian noise,
ensuring ({\epsilon}, {\delta})-differential privacy. When deployed in a
decentralized or federated setting, where each client holds only a small
partition of the data, DP-CDA faces new challenges. The limited sample size per
client increases the sensitivity of local computations, requiring higher noise
injection to maintain the differential privacy guarantee. This, in turn, leads
to a noticeable degradation in the utility compared to the centralized setting.
To mitigate this issue, we integrate the Correlation-Assisted Private
Estimation (CAPE) protocol into the federated DP-CDA framework and propose CAPE
Assisted Federated DP-CDA algorithm. CAPE enables limited collaboration among
the clients by allowing them to generate jointly distributed (anti-correlated)
noise that cancels out in aggregate, while preserving privacy at the individual
level. This technique significantly improves the privacy-utility trade-off in
the federated setting. Extensive experiments on MNIST and FashionMNIST datasets
demonstrate that the proposed CAPE Assisted Federated DP-CDA approach can
achieve utility comparable to its centralized counterpart under some parameter
regime, while maintaining rigorous differential privacy guarantees.
[COMMENTS]
This work has been submitted to the IEEE for possible publication
[LINK]
http://arxiv.org/abs/2509.10385v1
[DATE]
2025-09-13 00:18:35+08:00
[CATEGORIES]
cs.LG
Flow Straight and Fast in Hilbert Space: Functional Rectified Flow
[AUTHORS]
Jianxin Zhang, Clayton Scott
[ABSTRACT]
Many generative models originally developed in finite-dimensional Euclidean
space have functional generalizations in infinite-dimensional settings.
However, the extension of rectified flow to infinite-dimensional spaces remains
unexplored. In this work, we establish a rigorous functional formulation of
rectified flow in an infinite-dimensional Hilbert space. Our approach builds
upon the superposition principle for continuity equations in an
infinite-dimensional space. We further show that this framework extends
naturally to functional flow matching and functional probability flow ODEs,
interpreting them as nonlinear generalizations of rectified flow. Notably, our
extension to functional flow matching removes the restrictive measure-theoretic
assumptions in the existing theory of \citet{kerrigan2024functional}.
Furthermore, we demonstrate experimentally that our method achieves superior
performance compared to existing functional generative models.
[LINK]
http://arxiv.org/abs/2509.10384v1
[DATE]
2025-09-13 00:18:16+08:00
[CATEGORIES]
cs.LG
Evolving Voices Based on Temporal Poisson Factorisation
[AUTHORS]
Jan Vávra, Bettina Grün, Paul Hofmarcher
[ABSTRACT]
The world is evolving and so is the vocabulary used to discuss topics in
speech. Analysing political speech data from more than 30 years requires the
use of flexible topic models to uncover the latent topics and their change in
prevalence over time as well as the change in the vocabulary of the topics. We
propose the temporal Poisson factorisation (TPF) model as an extension to the
Poisson factorisation model to model sparse count data matrices obtained based
on the bag-of-words assumption from text documents with time stamps. We discuss
and empirically compare different model specifications for the time-varying
latent variables consisting either of a flexible auto-regressive structure of
order one or a random walk. Estimation is based on variational inference where
we consider a combination of coordinate ascent updates with automatic
differentiation using batching of documents. Suitable variational families are
proposed to ease inference. We compare results obtained using independent
univariate variational distributions for the time-varying latent variables to
those obtained with a multivariate variant. We discuss in detail the results of
the TPF model when analysing speeches from 18 sessions in the U.S. Senate
(1981-2016).
[COMMENTS]
main paper: 20 pages (2 single figures, 3 double figures, 3 tables),
appendix: 2 pages, supplementary materials: 18 pages (2 plots, 4 quadruple
plots, 2 tables), references: 3 pages
[LINK]
http://arxiv.org/abs/2410.18486v2
[DATE]
2025-09-13 00:15:25+08:00
[CATEGORIES]
cs.LG
Attacking Attention of Foundation Models Disrupts Downstream Tasks
[AUTHORS]
Hondamunige Prasanna Silva, Federico Becattini, Lorenzo Seidenari
[ABSTRACT]
Foundation models represent the most prominent and recent paradigm shift in
artificial intelligence. Foundation models are large models, trained on broad
data that deliver high accuracy in many downstream tasks, often without
fine-tuning. For this reason, models such as CLIP , DINO or Vision Transfomers
(ViT), are becoming the bedrock of many industrial AI-powered applications.
However, the reliance on pre-trained foundation models also introduces
significant security concerns, as these models are vulnerable to adversarial
attacks. Such attacks involve deliberately crafted inputs designed to deceive
AI systems, jeopardizing their reliability. This paper studies the
vulnerabilities of vision foundation models, focusing specifically on CLIP and
ViTs, and explores the transferability of adversarial attacks to downstream
tasks. We introduce a novel attack, targeting the structure of
transformer-based architectures in a task-agnostic fashion. We demonstrate the
effectiveness of our attack on several downstream tasks: classification,
captioning, image/text retrieval, segmentation and depth estimation. Code
available at:https://github.com/HondamunigePrasannaSilva/attack-attention
[COMMENTS]
Paper published at CVPR 2025 Workshop Advml
[LINK]
http://arxiv.org/abs/2506.05394v3
[DATE]
2025-09-13 00:12:48+08:00
[CATEGORIES]
cs.LG
Matrix-free Neural Preconditioner for the Dirac Operator in Lattice Gauge Theory
[AUTHORS]
Yixuan Sun, Srinivas Eswar, Yin Lin, William Detmold, Phiala Shanahan, Xiaoye Li, Yang Liu, Prasanna Balaprakash
[ABSTRACT]
Linear systems arise in generating samples and in calculating observables in
lattice quantum chromodynamics~(QCD). Solving the Hermitian positive definite
systems, which are sparse but ill-conditioned, involves using iterative
methods, such as Conjugate Gradient (CG), which are time-consuming and
computationally expensive. Preconditioners can effectively accelerate this
process, with the state-of-the-art being multigrid preconditioners. However,
constructing useful preconditioners can be challenging, adding additional
computational overhead, especially in large linear systems. We propose a
framework, leveraging operator learning techniques, to construct linear maps as
effective preconditioners. The method in this work does not rely on explicit
matrices from either the original linear systems or the produced
preconditioners, allowing efficient model training and application in the CG
solver. In the context of the Schwinger model U(1) gauge theory in 1+1
spacetime dimensions with two degenerate-mass fermions), this preconditioning
scheme effectively decreases the condition number of the linear systems and
approximately halves the number of iterations required for convergence in
relevant parameter ranges. We further demonstrate the framework learns a
general mapping dependent on the lattice structure which leads to zero-shot
learning ability for the Dirac operators constructed from gauge field
configurations of different sizes.
[LINK]
http://arxiv.org/abs/2509.10378v1
[DATE]
2025-09-13 00:10:18+08:00
[CATEGORIES]
cs.LG
Is Adversarial Training with Compressed Datasets Effective?
[AUTHORS]
Tong Chen, Raghavendra Selvan
[ABSTRACT]
Dataset Condensation (DC) refers to the recent class of dataset compression
methods that generate a smaller, synthetic, dataset from a larger dataset. This
synthetic dataset aims to retain the essential information of the original
dataset, enabling models trained on it to achieve performance levels comparable
to those trained on the full dataset. Most current DC methods have mainly
concerned with achieving high test performance with limited data budget, and
have not directly addressed the question of adversarial robustness. In this
work, we investigate the impact of adversarial robustness on models trained
with compressed datasets. We show that the compressed datasets obtained from DC
methods are not effective in transferring adversarial robustness to models. As
a solution to improve dataset compression efficiency and adversarial robustness
simultaneously, we present a robustness-aware dataset compression method based
on finding the Minimal Finite Covering (MFC) of the dataset. The proposed
method is (1) provably robust by minimizing the generalized adversarial loss,
(2) more effective than DC methods when applying adversarial training over MFC,
(3) obtained by a one-time computation and is applicable for any model.
[COMMENTS]
22 pages, 10 figures, 3 tables, accepted at Scandinavian Conference
on Image Analysis 2025 (SCIA 2025)
[LINK]
http://arxiv.org/abs/2402.05675v3
[DATE]
2025-09-13 00:09:45+08:00
[CATEGORIES]
cs.LG
Characterizing the Efficiency of Distributed Training: A Power, Performance, and Thermal Perspective
[AUTHORS]
Seokjin Go, Joongun Park, Spandan More, Hanjiang Wu, Irene Wang, Aaron Jezghani, Tushar Krishna, Divya Mahajan
[ABSTRACT]
The rapid scaling of Large Language Models (LLMs) has pushed training
workloads far beyond the limits of single-node analysis, demanding a deeper
understanding of how these models behave across large-scale, multi-GPU systems.
In this paper, we present a comprehensive characterization of LLM training
across diverse real-world workloads and hardware platforms, including NVIDIA
H100/H200 and AMD MI250 GPUs. We analyze dense and sparse models under various
parallelism strategies – tensor, pipeline, data, and expert – and evaluate
their effects on hardware utilization, power consumption, and thermal behavior.
We further evaluate the effectiveness of optimizations such as activation
recomputation and compute-communication overlap. Our findings show that
performance is not determined solely by scaling hardware capacity. Scale-up
systems with fewer, higher-memory GPUs can outperform scale-out systems in
communication-bound regimes, but only under carefully tuned configurations; in
other cases, scale-out deployments achieve superior throughput. We also show
that certain parallelism combinations, such as tensor with pipeline, lead to
bandwidth underutilization due to inefficient data chunking, while increasing
microbatch sizes beyond a certain point induces bursty execution and peak power
excursions that worsen thermal throttling. These insights reveal how training
performance is shaped by complex interactions between hardware, system
topology, and model execution. We conclude by offering recommendations for
system and hardware design to improve the scalability and reliability of future
LLM systems and workloads. The source code of this project is available at
https://github.com/sitar-lab/CharLLM-PPT.
[LINK]
http://arxiv.org/abs/2509.10371v1
[DATE]
2025-09-13 00:05:07+08:00
[CATEGORIES]
cs.LG
A Conflicts-free, Speed-lossless KAN-based Reinforcement Learning Decision System for Interactive Driving in Roundabouts
[AUTHORS]
Zhihao Lin, Zhen Tian, Jianglin Lan, Qi Zhang, Ziyang Ye, Hanyang Zhuang, Xianxian Zhao
[ABSTRACT]
Safety and efficiency are crucial for autonomous driving in roundabouts,
especially mixed traffic with both autonomous vehicles (AVs) and human-driven
vehicles. This paper presents a learning-based algorithm that promotes safe and
efficient driving across varying roundabout traffic conditions. A deep
Q-learning network is used to learn optimal strategies in complex multi-vehicle
roundabout scenarios, while a Kolmogorov-Arnold Network (KAN) improves the AVs’
environmental understanding. To further enhance safety, an action inspector
filters unsafe actions, and a route planner optimizes driving efficiency.
Moreover, model predictive control ensures stability and precision in
execution. Experimental results demonstrate that the proposed system
consistently outperforms state-of-the-art methods, achieving fewer collisions,
reduced travel time, and stable training with smooth reward convergence.
[COMMENTS]
14 pages, 11 figures, published in IEEE Transactions on Intelligent
Transportation Systems
[LINK]
http://arxiv.org/abs/2408.08242v2
[DATE]
2025-09-13 00:03:33+08:00
[CATEGORIES]
cs.LG
UIO-LLMs: Unbiased Incremental Optimization for Long-Context LLMs
[AUTHORS]
Wenhao Li, Mingbao Lin, Yunshan Zhong, Shuicheng Yan, Rongrong Ji
[ABSTRACT]
Managing long texts is challenging for large language models (LLMs) due to
limited context window sizes. This study introduces UIO-LLMs, an unbiased
incremental optimization approach for memory-enhanced transformers under
long-context settings. We initially conceptualize the process as a streamlined
encoder-decoder framework where the weights-shared encoder and decoder
respectively encapsulate a context segment into memories and leverage these
memories to predict outputs of the subsequent segment. Subsequently, by
treating our memory-enhanced transformers as fully-connected recurrent neural
networks (RNNs), we refine the training process using the Truncated
Backpropagation Through Time (TBPTT) algorithm, which incorporates innovative
incremental optimization techniques. These techniques not only diminish time
complexity but also address the bias in gradient computation through an
unbiased optimization process. UIO-LLMs successfully handle long context, such
as extending the context window of Llama2-7b-chat from 4K to 100K tokens with
minimal 2% additional parameters, while keeping the inference cost nearly
linear as context length increases.
[COMMENTS]
This article was not accepted, and its quality is not very good.
Therefore, we have decided to withdraw the submission and will not resubmit
it elsewhere
[LINK]
http://arxiv.org/abs/2406.18173v3
[DATE]
2025-09-12 23:39:00+08:00
[CATEGORIES]
cs.CL
Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes
[AUTHORS]
Tyler Loakman, William Thorne, Chenghua Lin
[COMMENTS]
Accepted to Findings of EMNLP 2025
[LINK]
http://arxiv.org/abs/2507.13335v2
[DATE]
2025-09-12 22:23:05+08:00
[CATEGORIES]
cs.CL
SI-FACT: Mitigating Knowledge Conflict via Self-Improving Faithfulness-Aware Contrastive Tuning
[AUTHORS]
Shengqiang Fu
[ABSTRACT]
Large Language Models often generate unfaithful responses in knowledge
intensive tasks due to knowledge conflict,that is,a preference for relying on
internal parametric knowledge rather than the provided context.To address this
issue,we propose a novel self improving framework,Self Improving Faithfulness
Aware Contrastive Tuning.The framework uses a self instruct mechanism that
allows the base LLM to automatically generate high quality,structured
contrastive learning data,including anchor samples,semantically equivalent
positive samples,and negative samples simulating unfaithful scenarios.This
approach significantly reduces the cost of manual
annotation.Subsequently,contrastive learning is applied to train the
model,enabling it to pull faithful responses closer and push unfaithful
responses farther apart in the representation space.Experiments on knowledge
conflict evaluation benchmarks ECARE KRE and COSE KRE show that the SI FACT
model based on Llama3 8B Instruct improves the Contextual Recall Rate by 6.2%
over the best baseline method,while significantly reducing dependence on
internal memory.The results indicate that SI FACT provides strong effectiveness
and high data efficiency in enhancing the contextual faithfulness of
LLMs,offering a practical pathway toward building more proactive and
trustworthy language models.
[LINK]
http://arxiv.org/abs/2509.10208v1
[DATE]
2025-09-12 20:56:14+08:00
[CATEGORIES]
cs.CL
Benchmark of stylistic variation in LLM-generated texts
[AUTHORS]
Jiří Milička, Anna Marklová, Václav Cvrček
[ABSTRACT]
This study investigates the register variation in texts written by humans and
comparable texts produced by large language models (LLMs). Biber’s
multidimensional analysis (MDA) is applied to a sample of human-written texts
and AI-created texts generated to be their counterparts to find the dimensions
of variation in which LLMs differ most significantly and most systematically
from humans. As textual material, a new LLM-generated corpus AI-Brown is used,
which is comparable to BE-21 (a Brown family corpus representing contemporary
British English). Since all languages except English are underrepresented in
the training data of frontier LLMs, similar analysis is replicated on Czech
using AI-Koditex corpus and Czech multidimensional model. Examined were 16
frontier models in various settings and prompts, with emphasis placed on the
difference between base models and instruction-tuned models. Based on this, a
benchmark is created through which models can be compared with each other and
ranked in interpretable dimensions.
[LINK]
http://arxiv.org/abs/2509.10179v1
[DATE]
2025-09-12 20:12:20+08:00
[CATEGORIES]
cs.CL
A 2-step Framework for Automated Literary Translation Evaluation: Its Promises and Pitfalls
[AUTHORS]
Sheikh Shafayat, Dongkeun Yoon, Woori Jang, Jiwoo Choi, Alice Oh, Seohyon Jung
[ABSTRACT]
In this work, we propose and evaluate the feasibility of a two-stage pipeline
to evaluate literary machine translation, in a fine-grained manner, from
English to Korean. The results show that our framework provides fine-grained,
interpretable metrics suited for literary translation and obtains a higher
correlation with human judgment than traditional machine translation metrics.
Nonetheless, it still fails to match inter-human agreement, especially in
metrics like Korean Honorifics. We also observe that LLMs tend to favor
translations generated by other LLMs, and we highlight the necessity of
developing more sophisticated evaluation methods to ensure accurate and
culturally sensitive machine translation of literary works.
[LINK]
http://arxiv.org/abs/2412.01340v3
[DATE]
2025-09-12 20:10:06+08:00
[CATEGORIES]
cs.CL
Error Analysis in a Modular Meeting Transcription System
[AUTHORS]
Peter Vieting, Simon Berger, Thilo von Neumann, Christoph Boeddeker, Ralf Schlüter, Reinhold Haeb-Umbach
[ABSTRACT]
Meeting transcription is a field of high relevance and remarkable progress in
recent years. Still, challenges remain that limit its performance. In this
work, we extend a previously proposed framework for analyzing leakage in speech
separation with proper sensitivity to temporal locality. We show that there is
significant leakage to the cross channel in areas where only the primary
speaker is active. At the same time, the results demonstrate that this does not
affect the final performance much as these leaked parts are largely ignored by
the voice activity detection (VAD). Furthermore, different segmentations are
compared showing that advanced diarization approaches are able to reduce the
gap to oracle segmentation by a third compared to a simple energy-based VAD. We
additionally reveal what factors contribute to the remaining difference. The
results represent state-of-the-art performance on LibriCSS among systems that
train the recognition module on LibriSpeech data only.
[COMMENTS]
Accepted at ITG Conference on Speech Communication 2025
[LINK]
http://arxiv.org/abs/2509.10143v1
[DATE]
2025-09-12 19:10:38+08:00
[CATEGORIES]
cs.CL
cs.LG
Population-Aligned Persona Generation for LLM-based Social Simulation
[AUTHORS]
Zhengyu Hu, Zheyuan Xiao, Max Xiong, Yuxuan Lei, Tianfu Wang, Jianxun Lian, Kaize Ding, Ziang Xiao, Nicholas Jing Yuan, Xing Xie
[ABSTRACT]
Recent advances in large language models (LLMs) have enabled human-like
social simulations at unprecedented scale and fidelity, offering new
opportunities for computational social science. A key challenge, however, is
the construction of persona sets that authentically represent the diversity and
distribution of real-world populations. Most existing LLM-based social
simulation studies focus primarily on designing agentic frameworks and
simulation environments, often overlooking the complexities of persona
generation and the potential biases introduced by unrepresentative persona
sets. In this paper, we propose a systematic framework for synthesizing
high-quality, population-aligned persona sets for LLM-driven social simulation.
Our approach begins by leveraging LLMs to generate narrative personas from
long-term social media data, followed by rigorous quality assessment to filter
out low-fidelity profiles. We then apply importance sampling to achieve global
alignment with reference psychometric distributions, such as the Big Five
personality traits. To address the needs of specific simulation contexts, we
further introduce a task-specific module that adapts the globally aligned
persona set to targeted subpopulations. Extensive experiments demonstrate that
our method significantly reduces population-level bias and enables accurate,
flexible social simulation for a wide range of research and policy
applications.
[LINK]
http://arxiv.org/abs/2509.10127v1
[DATE]
2025-09-12 18:43:47+08:00
[CATEGORIES]
cs.CL
cs.LG
Prominence-aware automatic speech recognition for conversational speech
[AUTHORS]
Julian Linke, Barbara Schuppler
[ABSTRACT]
This paper investigates prominence-aware automatic speech recognition (ASR)
by combining prominence detection and speech recognition for conversational
Austrian German. First, prominence detectors were developed by fine-tuning
wav2vec2 models to classify word-level prominence. The detector was then used
to automatically annotate prosodic prominence in a large corpus. Based on those
annotations, we trained novel prominence-aware ASR systems that simultaneously
transcribe words and their prominence levels. The integration of prominence
information did not change performance compared to our baseline ASR system,
while reaching a prominence detection accuracy of 85.53% for utterances where
the recognized word sequence was correct. This paper shows that
transformer-based models can effectively encode prosodic information and
represents a novel contribution to prosody-enhanced ASR, with potential
applications for linguistic research and prosody-informed dialogue systems.
[LINK]
http://arxiv.org/abs/2509.10116v1
[DATE]
2025-09-12 18:18:38+08:00
[CATEGORIES]
cs.CL
Scaling Arabic Medical Chatbots Using Synthetic Data: Enhancing Generative AI with Synthetic Patient Records
[AUTHORS]
Abdulrahman Allam, Seif Ahmed, Ali Hamdi, Khaled Shaban
[ABSTRACT]
The development of medical chatbots in Arabic is significantly constrained by
the scarcity of large-scale, high-quality annotated datasets. While prior
efforts compiled a dataset of 20,000 Arabic patient-doctor interactions from
social media to fine-tune large language models (LLMs), model scalability and
generalization remained limited. In this study, we propose a scalable synthetic
data augmentation strategy to expand the training corpus to 100,000 records.
Using advanced generative AI systems ChatGPT-4o and Gemini 2.5 Pro we generated
80,000 contextually relevant and medically coherent synthetic question-answer
pairs grounded in the structure of the original dataset. These synthetic
samples were semantically filtered, manually validated, and integrated into the
training pipeline. We fine-tuned five LLMs, including Mistral-7B and AraGPT2,
and evaluated their performance using BERTScore metrics and expert-driven
qualitative assessments. To further analyze the effectiveness of synthetic
sources, we conducted an ablation study comparing ChatGPT-4o and
Gemini-generated data independently. The results showed that ChatGPT-4o data
consistently led to higher F1-scores and fewer hallucinations across all
models. Overall, our findings demonstrate the viability of synthetic
augmentation as a practical solution for enhancing domain-specific language
models in-low resource medical NLP, paving the way for more inclusive,
scalable, and accurate Arabic healthcare chatbot systems.
[COMMENTS]
Accepted in AICCSA 2025
[LINK]
http://arxiv.org/abs/2509.10108v1
[DATE]
2025-09-12 17:58:11+08:00
[CATEGORIES]
cs.CL
Arabic Large Language Models for Medical Text Generation
[AUTHORS]
Abdulrahman Allam, Seif Ahmed, Ali Hamdi, Ammar Mohammed
[ABSTRACT]
Efficient hospital management systems (HMS) are critical worldwide to address
challenges such as overcrowding, limited resources, and poor availability of
urgent health care. Existing methods often lack the ability to provide
accurate, real-time medical advice, particularly for irregular inputs and
underrepresented languages. To overcome these limitations, this study proposes
an approach that fine-tunes large language models (LLMs) for Arabic medical
text generation. The system is designed to assist patients by providing
accurate medical advice, diagnoses, drug recommendations, and treatment plans
based on user input. The research methodology required the collection of a
unique dataset from social media platforms, capturing real-world medical
conversations between patients and doctors. The dataset, which includes patient
complaints together with medical advice, was properly cleaned and preprocessed
to account for multiple Arabic dialects. Fine-tuning state-of-the-art
generative models, such as Mistral-7B-Instruct-v0.2, LLaMA-2-7B, and GPT-2
Medium, optimized the system’s ability to generate reliable medical text.
Results from evaluations indicate that the fine-tuned Mistral-7B model
outperformed the other models, achieving average BERT (Bidirectional Encoder
Representations from Transformers) Score values in precision, recall, and
F1-scores of 68.5\%, 69.08\%, and 68.5\%, respectively. Comparative
benchmarking and qualitative assessments validate the system’s ability to
produce coherent and relevant medical replies to informal input. This study
highlights the potential of generative artificial intelligence (AI) in
advancing HMS, offering a scalable and adaptable solution for global healthcare
challenges, especially in linguistically and culturally diverse environments.
[COMMENTS]
Published in 2025 4th International Conference on Computer
Technologies (ICCTech)
[LINK]
http://arxiv.org/abs/2509.10095v1
[DATE]
2025-09-12 17:37:26+08:00
[CATEGORIES]
cs.CL
Querying Climate Knowledge: Semantic Retrieval for Scientific Discovery
[AUTHORS]
Mustapha Adamu, Qi Zhang, Huitong Pan, Longin Jan Latecki, Eduard C. Dragut
[ABSTRACT]
The growing complexity and volume of climate science literature make it
increasingly difficult for researchers to find relevant information across
models, datasets, regions, and variables. This paper introduces a
domain-specific Knowledge Graph (KG) built from climate publications and
broader scientific texts, aimed at improving how climate knowledge is accessed
and used. Unlike keyword based search, our KG supports structured, semantic
queries that help researchers discover precise connections such as which models
have been validated in specific regions or which datasets are commonly used
with certain teleconnection patterns. We demonstrate how the KG answers such
questions using Cypher queries, and outline its integration with large language
models in RAG systems to improve transparency and reliability in
climate-related question answering. This work moves beyond KG construction to
show its real world value for climate researchers, model developers, and others
who rely on accurate, contextual scientific information.
[COMMENTS]
ACM SIGIR 2025 Workshop MANILA
[LINK]
http://arxiv.org/abs/2509.10087v1
[DATE]
2025-09-12 17:28:29+08:00
[CATEGORIES]
cs.CL
Alignment-Augmented Speculative Decoding with Alignment Sampling and Conditional Verification
[AUTHORS]
Jikai Wang, Zhenxu Tian, Juntao Li, Qingrong Xia, Xinyu Duan, Zhefeng Wang, Baoxing Huai, Min Zhang
[ABSTRACT]
Recent works have revealed the great potential of speculative decoding in
accelerating the autoregressive generation process of large language models.
The success of these methods relies on the alignment between draft candidates
and the sampled outputs of the target model. Existing methods mainly achieve
draft-target alignment with training-based methods, e.g., EAGLE, Medusa,
involving considerable training costs. In this paper, we present a
training-free alignment-augmented speculative decoding algorithm. We propose
alignment sampling, which leverages output distribution obtained in the
prefilling phase to provide more aligned draft candidates. To further benefit
from high-quality but non-aligned draft candidates, we also introduce a simple
yet effective flexible verification strategy. Through an adaptive probability
threshold, our approach can improve generation accuracy while further improving
inference efficiency. Experiments on 8 datasets (including question answering,
summarization and code completion tasks) show that our approach increases the
average generation score by 3.3 points for the LLaMA3 model. Our method
achieves a mean acceptance length up to 2.39 and speed up generation by 2.23.
[COMMENTS]
Accepted at EMNLP 2025 Main
[LINK]
http://arxiv.org/abs/2505.13204v2
[DATE]
2025-09-12 17:08:05+08:00
[CATEGORIES]
cs.CL
Reframe Your Life Story: Interactive Narrative Therapist and Innovative Moment Assessment with Large Language Models
[AUTHORS]
Yi Feng, Jiaqi Wang, Wenxuan Zhang, Zhuang Chen, Yutong Shen, Xiyao Xiao, Minlie Huang, Liping Jing, Jian Yu
[COMMENTS]
EMNLP 2025 Main
[LINK]
http://arxiv.org/abs/2507.20241v2
[DATE]
2025-09-12 17:08:02+08:00
[CATEGORIES]
cs.CL
!MSA at BAREC Shared Task 2025: Ensembling Arabic Transformers for Readability Assessment
[AUTHORS]
Mohamed Basem, Mohamed Younes, Seif Ahmed, Abdelrahman Moustafa
[ABSTRACT]
We present MSAs winning system for the BAREC 2025 Shared Task on fine-grained
Arabic readability assessment, achieving first place in six of six tracks. Our
approach is a confidence-weighted ensemble of four complementary transformer
models (AraBERTv2, AraELECTRA, MARBERT, and CAMeLBERT) each fine-tuned with
distinct loss functions to capture diverse readability signals. To tackle
severe class imbalance and data scarcity, we applied weighted training,
advanced preprocessing, SAMER corpus relabeling with our strongest model, and
synthetic data generation via Gemini 2.5 Flash, adding about 10,000 rare-level
samples. A targeted post-processing step corrected prediction distribution
skew, delivering a 6.3 percent Quadratic Weighted Kappa (QWK) gain. Our system
reached 87.5 percent QWK at the sentence level and 87.4 percent at the document
level, demonstrating the power of model and loss diversity, confidence-informed
fusion, and intelligent augmentation for robust Arabic readability prediction.
[COMMENTS]
10 Pages , 8 figures , ArabicNLP 2025 , Co-located with EMNLP 2025
[LINK]
http://arxiv.org/abs/2509.10040v1
[DATE]
2025-09-12 16:08:45+08:00
[CATEGORIES]
cs.CL
Multi-Intent Recognition in Dialogue Understanding: A Comparison Between Smaller Open-Source LLMs
[AUTHORS]
Adnan Ahmad, Philine Kowol, Stefan Hillmann, Sebastian Möller
[ABSTRACT]
In this paper, we provide an extensive analysis of multi-label intent
classification using Large Language Models (LLMs) that are open-source,
publicly available, and can be run in consumer hardware. We use the MultiWOZ
2.1 dataset, a benchmark in the dialogue system domain, to investigate the
efficacy of three popular open-source pre-trained LLMs, namely LLama2-7B-hf,
Mistral-7B-v0.1, and Yi-6B. We perform the classification task in a few-shot
setup, giving 20 examples in the prompt with some instructions. Our approach
focuses on the differences in performance of these models across several
performance metrics by methodically assessing these models on multi-label
intent classification tasks. Additionally, we compare the performance of the
instruction-based fine-tuning approach with supervised learning using the
smaller transformer model BertForSequenceClassification as a baseline. To
evaluate the performance of the models, we use evaluation metrics like
accuracy, precision, and recall as well as micro, macro, and weighted F1 score.
We also report the inference time, VRAM requirements, etc. The Mistral-7B-v0.1
outperforms two other generative models on 11 intent classes out of 14 in terms
of F-Score, with a weighted average of 0.50. It also has relatively lower
Humming Loss and higher Jaccard Similarity, making it the winning model in the
few-shot setting. We find BERT based supervised classifier having superior
performance compared to the best performing few-shot generative LLM. The study
provides a framework for small open-source LLMs in detecting complex
multi-intent dialogues, enhancing the Natural Language Understanding aspect of
task-oriented chatbots.
[LINK]
http://arxiv.org/abs/2509.10010v1
[DATE]
2025-09-12 15:10:55+08:00
[CATEGORIES]
cs.CL
Input-Time Scaling
[AUTHORS]
Rapheal Huang, Weilong Guo
[ABSTRACT]
Current Large Language Models (LLMs) are usually post-trained on large-scale
carefully curated datasets (data & training scaling) and doing reasoning in
test time (inference time scaling). In this work, we present a new scaling
paradigm, Input-Time Scaling, to complement previous scaling methods by putting
resources on queries (input time). During training and testing, we utilize
meta-knowledge from LLMs to refine inputs with different strategies. We also
discover a new phenomenon, train-test co-design. It requires us to apply query
strategies during training and testing as a whole. Only applying strategies on
training or testing would seriously degrade the performance gained. We are also
surprised to find that seemingly low data quality datasets can perform better.
We can get the best performance even by adding irrelevant information to the
queries, with randomly selected 1k examples from a minimally filtered dataset.
These findings contradict the widely held inductive bias, “garbage in, garbage
out”. Curating datasets with seemingly high-quality data can even potentially
limit the performance ceiling. In addition, models trained on more data with
similar quality (15k VS 1k) perform worse, the intuition of simply scaling the
size should also be carefully inspected. The good news is that our findings are
compatible with the Less is More phenomenon. 1K examples are enough to invoke
high-level reasoning ability. With experiments on Qwen2.5-32B-Instruct, we are
able to reach SOTA performance among 32B models on AIME24(76.7%) and
AIME25(76.7%) pass@1. We can further achieve AIME24(76.7%) and AIME25(80%) with
a majority vote of three models. Starting from DeepSeek-R1-Distill-Qwen-32B,
the result would be 90.0% on AIME24 and 80.0% on AIME25. To facilitate
reproducibility and further research, we are working on open-source our
datasets, data pipelines, evaluation results, and checkpoints.
[LINK]
http://arxiv.org/abs/2508.13654v4
[DATE]
2025-09-12 15:04:59+08:00
[CATEGORIES]
cs.LG
cs.CL
Unsupervised Hallucination Detection by Inspecting Reasoning Processes
[AUTHORS]
Ponhvoan Srey, Xiaobao Wu, Anh Tuan Luu
[COMMENTS]
To appear in EMNLP 2025
[LINK]
http://arxiv.org/abs/2509.10004v1
[DATE]
2025-09-12 14:58:17+08:00
[CATEGORIES]
cs.CL
Polish-English medical knowledge transfer: A new benchmark and results
[AUTHORS]
Łukasz Grzybowski, Jakub Pokrywka, Michał Ciesiółka, Jeremi I. Kaczmarek, Marek Kubis
[ABSTRACT]
Large Language Models (LLMs) have demonstrated significant potential in
handling specialized tasks, including medical problem-solving. However, most
studies predominantly focus on English-language contexts. This study introduces
a novel benchmark dataset based on Polish medical licensing and specialization
exams (LEK, LDEK, PES) taken by medical doctor candidates and practicing
doctors pursuing specialization. The dataset was web-scraped from publicly
available resources provided by the Medical Examination Center and the Chief
Medical Chamber. It comprises over 24,000 exam questions, including a subset of
parallel Polish-English corpora, where the English portion was professionally
translated by the examination center for foreign candidates. By creating a
structured benchmark from these existing exam questions, we systematically
evaluate state-of-the-art LLMs, including general-purpose, domain-specific, and
Polish-specific models, and compare their performance against human medical
students. Our analysis reveals that while models like GPT-4o achieve near-human
performance, significant challenges persist in cross-lingual translation and
domain-specific understanding. These findings underscore disparities in model
performance across languages and medical specialties, highlighting the
limitations and ethical considerations of deploying LLMs in clinical practice.
[LINK]
http://arxiv.org/abs/2412.00559v2
[DATE]
2025-09-12 14:49:21+08:00
[CATEGORIES]
cs.CL
Humans Hallucinate Too: Language Models Identify and Correct Subjective Annotation Errors With Label-in-a-Haystack Prompts
[AUTHORS]
Georgios Chochlakis, Peter Wu, Arjun Bedi, Marcus Ma, Kristina Lerman, Shrikanth Narayanan
[COMMENTS]
Accepted to the Main Proceedings of EMNLP, 2025. 20 pages, 16
figures, 10 tables
[LINK]
http://arxiv.org/abs/2505.17222v2
[DATE]
2025-09-12 14:41:58+08:00
[CATEGORIES]
cs.CL
CMHG: A Dataset and Benchmark for Headline Generation of Minority Languages in China
[AUTHORS]
Guixian Xu, Zeli Su, Ziyin Zhang, Jianing Liu, XU Han, Ting Zhang, Yushuang Dong
[ABSTRACT]
Minority languages in China, such as Tibetan, Uyghur, and Traditional
Mongolian, face significant challenges due to their unique writing systems,
which differ from international standards. This discrepancy has led to a severe
lack of relevant corpora, particularly for supervised tasks like headline
generation. To address this gap, we introduce a novel dataset, Chinese Minority
Headline Generation (CMHG), which includes 100,000 entries for Tibetan, and
50,000 entries each for Uyghur and Mongolian, specifically curated for headline
generation tasks. Additionally, we propose a high-quality test set annotated by
native speakers, designed to serve as a benchmark for future research in this
domain. We hope this dataset will become a valuable resource for advancing
headline generation in Chinese minority languages and contribute to the
development of related benchmarks.
[LINK]
http://arxiv.org/abs/2509.09990v1
[DATE]
2025-09-12 14:18:44+08:00
[CATEGORIES]
cs.CL
Open-sci-ref-0.01: open and reproducible reference baselines for language model and dataset comparison
[AUTHORS]
Marianna Nezhurina, Jörg Franke, Taishi Nakamura, Timur Carstensen, Niccolò Ajroldi, Ville Komulainen, David Salinas, Jenia Jitsev
[ABSTRACT]
We introduce open-sci-ref, a family of dense transformer models trained as
research baselines across multiple model (0.13B to 1.7B parameters) and token
scales (up to 1T) on 8 recent open reference datasets. Evaluating the models on
various standardized benchmarks, our training runs set establishes reference
points that enable researchers to assess the sanity and quality of alternative
training approaches across scales and datasets. Intermediate checkpoints allow
comparison and studying of the training dynamics. The established reference
baselines allow training procedures to be compared through their scaling
trends, aligning them on a common compute axis. Comparison of open reference
datasets reveals that training on NemoTron-CC HQ consistently outperforms other
reference datasets, followed by DCLM-baseline and FineWeb-Edu. In addition to
intermediate training checkpoints, the release includes logs, code, and
downstream evaluations to simplify reproduction, standardize comparison, and
facilitate future research.
[COMMENTS]
Model weights and intermediate checkpoints are available at
https://huggingface.co/collections/open-sci/open-sci-ref-001-685905e598be658fbcebff4f;
code for reproducing training, evaluation and raw experiments data at
https://github.com/LAION-AI/open-sci-ref-0.01
[LINK]
http://arxiv.org/abs/2509.09009v2
[DATE]
2025-09-12 13:22:38+08:00
[CATEGORIES]
cs.LG
cs.CL
Building Self-Evolving Agents via Experience-Driven Lifelong Learning: A Framework and Benchmark
[AUTHORS]
Yuxuan Cai, Yipeng Hao, Jie Zhou, Hang Yan, Zhikai Lei, Rui Zhen, Zhenhua Han, Yutao Yang, Junsong Li, Qianjun Pan, Tianyu Huai, Qin Chen, Xin Li, Kai Chen, Bo Zhang, Xipeng Qiu, Liang He
[ABSTRACT]
As AI advances toward general intelligence, the focus is shifting from
systems optimized for static tasks to creating open-ended agents that learn
continuously. In this paper, we introduce Experience-driven Lifelong Learning
(ELL), a framework for building self-evolving agents capable of continuous
growth through real-world interaction. The framework is built on four core
principles: (1) Experience Exploration: Agents learn through continuous,
self-motivated interaction with dynamic environments, navigating interdependent
tasks and generating rich experiential trajectories. (2) Long-term Memory:
Agents preserve and structure historical knowledge, including personal
experiences, domain expertise, and commonsense reasoning, into a persistent
memory system. (3) Skill Learning: Agents autonomously improve by abstracting
recurring patterns from experience into reusable skills, which are actively
refined and validated for application in new tasks. (4) Knowledge
Internalization: Agents internalize explicit and discrete experiences into
implicit and intuitive capabilities as “second nature”.
We also introduce StuLife, a benchmark dataset for ELL that simulates a
student’s holistic college journey, from enrollment to academic and personal
development, across three core phases and ten detailed sub-scenarios. StuLife
is designed around three key paradigm
[LINK]
http://arxiv.org/abs/2508.19005v4
[DATE]
2025-09-12 13:22:00+08:00
[CATEGORIES]
cs.CL
Déjà Vu: Multilingual LLM Evaluation through the Lens of Machine Translation Evaluation
[AUTHORS]
Julia Kreutzer, Eleftheria Briakou, Sweta Agrawal, Marzieh Fadaee, Kocmi Tom
[ABSTRACT]
Generation capabilities and language coverage of multilingual large language
models (mLLMs) are advancing rapidly. However, evaluation practices for
generative abilities of mLLMs are still lacking comprehensiveness, scientific
rigor, and consistent adoption across research labs, which undermines their
potential to meaningfully guide mLLM development. We draw parallels with
machine translation (MT) evaluation, a field that faced similar challenges and
has, over decades, developed transparent reporting standards and reliable
evaluations for multilingual generative models. Through targeted experiments
across key stages of the generative evaluation pipeline, we demonstrate how
best practices from MT evaluation can deepen the understanding of quality
differences between models. Additionally, we identify essential components for
robust meta-evaluation of mLLMs, ensuring the evaluation methods themselves are
rigorously assessed. We distill these insights into a checklist of actionable
recommendations for mLLM research and development.
[LINK]
http://arxiv.org/abs/2504.11829v4
[DATE]
2025-09-12 12:48:46+08:00
[CATEGORIES]
cs.CL
Agentic Vehicles for Human-Centered Mobility Systems
[AUTHORS]
Jiangbo Yu
[ABSTRACT]
Autonomy, from the Greek autos (self) and nomos (law), refers to the capacity
to operate according to internal rules without external control. Autonomous
vehicles (AuVs) are therefore understood as systems that perceive their
environment and execute pre-programmed tasks independently of external input,
consistent with the SAE levels of automated driving. Yet recent research and
real-world deployments have begun to showcase vehicles that exhibit behaviors
outside the scope of this definition. These include natural language
interaction with humans, goal adaptation, contextual reasoning, external tool
use, and the handling of unforeseen ethical dilemmas, enabled in part by
multimodal large language models (LLMs). These developments highlight not only
a gap between technical autonomy and the broader cognitive and social
capacities required for human-centered mobility, but also the emergence of a
form of vehicle intelligence that currently lacks a clear designation. To
address this gap, the paper introduces the concept of agentic vehicles (AgVs):
vehicles that integrate agentic AI systems to reason, adapt, and interact
within complex environments. It synthesizes recent advances in agentic systems
and suggests how AgVs can complement and even reshape conventional autonomy to
ensure mobility services are aligned with user and societal needs. The paper
concludes by outlining key challenges in the development and governance of AgVs
and their potential role in shaping future agentic transportation systems.
[LINK]
http://arxiv.org/abs/2507.04996v5
[DATE]
2025-09-12 11:15:11+08:00
[CATEGORIES]
cs.CL
Feedback-Driven Tool-Use Improvements in Large Language Models via Automated Build Environments
[AUTHORS]
Junjie Ye, Changhao Jiang, Zhengyin Du, Yufei Xu, Xuesong Yao, Zhiheng Xi, Xiaoran Fan, Qi Zhang, Tao Gui, Xuanjing Huang, Jiecao Chen
[ABSTRACT]
Effective tool use is essential for large language models (LLMs) to interact
meaningfully with their environment. However, progress is limited by the lack
of efficient reinforcement learning (RL) frameworks specifically designed for
tool use, due to challenges in constructing stable training environments and
designing verifiable reward mechanisms. To address this, we propose an
automated environment construction pipeline, incorporating scenario
decomposition, document generation, function integration, complexity scaling,
and localized deployment. This enables the creation of high-quality training
environments that provide detailed and measurable feedback without relying on
external tools. Additionally, we introduce a verifiable reward mechanism that
evaluates both the precision of tool use and the completeness of task
execution. When combined with trajectory data collected from the constructed
environments, this mechanism integrates seamlessly with standard RL algorithms
to facilitate feedback-driven model training. Experiments on LLMs of varying
scales demonstrate that our approach significantly enhances the models’
tool-use performance without degrading their general capabilities, regardless
of inference modes or training algorithms. Our analysis suggests that these
gains result from improved context understanding and reasoning, driven by
updates to the lower-layer MLP parameters in models.
[LINK]
http://arxiv.org/abs/2508.08791v2
[DATE]
2025-09-12 10:57:21+08:00
[CATEGORIES]
cs.CL
FinMTEB: Finance Massive Text Embedding Benchmark
[AUTHORS]
Yixuan Tang, Yi Yang
[COMMENTS]
EMNLP 2025, https://github.com/yixuantt/FinMTEB
[LINK]
http://arxiv.org/abs/2502.10990v3
[DATE]
2025-09-12 10:40:43+08:00
[CATEGORIES]
cs.CL
DiFlow-TTS: Discrete Flow Matching with Factorized Speech Tokens for Low-Latency Zero-Shot Text-To-Speech
[AUTHORS]
Ngoc-Son Nguyen, Hieu-Nghia Huynh-Nguyen, Thanh V. T. Tran, Truong-Son Hy, Van Nguyen
[ABSTRACT]
Zero-shot Text-to-Speech (TTS) aims to synthesize high-quality speech that
mimics the voice of an unseen speaker using only a short reference sample,
requiring not only speaker adaptation but also accurate modeling of prosodic
attributes. Recent approaches based on language models, diffusion, and flow
matching have shown promising results in zero-shot TTS, but still suffer from
slow inference and repetition artifacts. Discrete codec representations have
been widely adopted for speech synthesis, and recent works have begun to
explore diffusion models in purely discrete settings, suggesting the potential
of discrete generative modeling for speech synthesis. However, existing
flow-matching methods typically embed these discrete tokens into a continuous
space and apply continuous flow matching, which may not fully leverage the
advantages of discrete representations. To address these challenges, we
introduce DiFlow-TTS, which, to the best of our knowledge, is the first model
to explore purely Discrete Flow Matching for speech synthesis. DiFlow-TTS
explicitly models factorized speech attributes within a compact and unified
architecture. It leverages in-context learning by conditioning on textual
content, along with prosodic and acoustic attributes extracted from a reference
speech, enabling effective attribute cloning in a zero-shot setting. In
addition, the model employs a factorized flow prediction mechanism with
distinct heads for prosody and acoustic details, allowing it to learn
aspect-specific distributions. Experimental results demonstrate that DiFlow-TTS
achieves promising performance in several key metrics, including naturalness,
prosody, preservation of speaker style, and energy control. It also maintains a
compact model size and achieves low-latency inference, generating speech up to
25.8 times faster than the latest existing baselines.
[LINK]
http://arxiv.org/abs/2509.09631v2
[DATE]
2025-09-12 09:59:18+08:00
[CATEGORIES]
cs.CL
Faster and Better LLMs via Latency-Aware Test-Time Scaling
[AUTHORS]
Zili Wang, Tianyu Zhang, Haoli Bai, Lu Hou, Xianzhi Yu, Wulong Liu, Shiming Xiang, Lei Zhu
[ABSTRACT]
Test-Time Scaling (TTS) has proven effective in improving the performance of
Large Language Models (LLMs) during inference. However, existing research has
overlooked the efficiency of TTS from a latency-sensitive perspective. Through
a latency-aware evaluation of representative TTS methods, we demonstrate that a
compute-optimal TTS does not always result in the lowest latency in scenarios
where latency is critical. To address this gap and achieve latency-optimal TTS,
we propose two key approaches by optimizing the concurrency configurations: (1)
branch-wise parallelism, which leverages multiple concurrent inference
branches, and (2) sequence-wise parallelism, enabled by speculative decoding.
By integrating these two approaches and allocating computational resources
properly to each, our latency-optimal TTS enables a 32B model to reach 82.3%
accuracy on MATH-500 within 1 minute and a smaller 3B model to achieve 72.4%
within 10 seconds. Our work emphasizes the importance of latency-aware TTS and
demonstrates its ability to deliver both speed and accuracy in
latency-sensitive scenarios.
[LINK]
http://arxiv.org/abs/2505.19634v4
[DATE]
2025-09-12 09:41:20+08:00
[CATEGORIES]
cs.CL
Humor in Pixels: Benchmarking Large Multimodal Models Understanding of Online Comics
[AUTHORS]
Yuriel Ryan, Rui Yang Tan, Kenny Tsu Wei Choo, Roy Ka-Wei Lee
[COMMENTS]
27 pages, 8 figures, EMNLP 2025
[LINK]
http://arxiv.org/abs/2509.12248v1
[DATE]
2025-09-12 09:39:24+08:00
[CATEGORIES]
cs.CL
NileChat: Towards Linguistically Diverse and Culturally Aware LLMs for Local Communities
[AUTHORS]
Abdellah El Mekki, Houdaifa Atou, Omer Nacar, Shady Shehata, Muhammad Abdul-Mageed
[ABSTRACT]
Enhancing the linguistic capabilities of Large Language Models (LLMs) to
include low-resource languages is a critical research area. Current research
directions predominantly rely on synthetic data generated by translating
English corpora, which, while demonstrating promising linguistic understanding
and translation abilities, often results in models aligned with source language
culture. These models frequently fail to represent the cultural heritage and
values of local communities. This work proposes a methodology to create both
synthetic and retrieval-based pre-training data tailored to a specific
community, considering its (i) language, (ii) cultural heritage, and (iii)
cultural values. We demonstrate our methodology using Egyptian and Moroccan
dialects as testbeds, chosen for their linguistic and cultural richness and
current underrepresentation in LLMs. As a proof-of-concept, we develop
NileChat, a 3B parameter LLM adapted for Egyptian and Moroccan communities,
incorporating their language, cultural heritage, and values. Our results on
various understanding, translation, and cultural and values alignment
benchmarks show that NileChat outperforms existing Arabic-aware LLMs of similar
size and performs on par with larger models. We share our methods, data, and
models with the community to promote the inclusion and coverage of more diverse
communities in LLM development.
[LINK]
http://arxiv.org/abs/2505.18383v2
[DATE]
2025-09-12 06:14:33+08:00
[CATEGORIES]
cs.CL
Tokens, the oft-overlooked appetizer: Large language models, the distributional hypothesis, and meaning
[AUTHORS]
Julia Witte Zimmerman, Denis Hudon, Kathryn Cramer, Alejandro J. Ruiz, Calla Beauregard, Ashley Fehr, Mikaela Irene Fudolig, Bradford Demarest, Yoshi Meke Bird, Milo Z. Trujillo, Christopher M. Danforth, Peter Sheridan Dodds
[ABSTRACT]
Tokenization is a necessary component within the current architecture of many
language models, including the transformer-based large language models (LLMs)
of Generative AI, yet its impact on the model’s cognition is often overlooked.
We argue that LLMs demonstrate that the Distributional Hypothesis (DH) is
sufficient for reasonably human-like language performance, and that the
emergence of human-meaningful linguistic units among tokens and current
structural constraints motivate changes to existing, linguistically-agnostic
tokenization techniques, particularly with respect to their roles as (1)
semantic primitives and as (2) vehicles for conveying salient distributional
patterns from human language to the model. We explore tokenizations from a BPE
tokenizer; extant model vocabularies obtained from Hugging Face and tiktoken;
and the information in exemplar token vectors as they move through the layers
of a RoBERTa (large) model. Besides creating sub-optimal semantic building
blocks and obscuring the model’s access to the necessary distributional
patterns, we describe how tokens and pretraining can act as a backdoor for bias
and other unwanted content, which current alignment practices may not
remediate. Additionally, we relay evidence that the tokenization algorithm’s
objective function impacts the LLM’s cognition, despite being arguably
meaningfully insulated from the main system intelligence. [First uploaded to
arXiv in December, 2024.]
[LINK]
http://arxiv.org/abs/2412.10924v5
[DATE]
2025-09-12 05:57:39+08:00
[CATEGORIES]
cs.CL
Emulating Public Opinion: A Proof-of-Concept of AI-Generated Synthetic Survey Responses for the Chilean Case
[AUTHORS]
Bastián González-Bustamante, Nando Verelst, Carla Cisternas
[ABSTRACT]
Large Language Models (LLMs) offer promising avenues for methodological and
applied innovations in survey research by using synthetic respondents to
emulate human answers and behaviour, potentially mitigating measurement and
representation errors. However, the extent to which LLMs recover aggregate item
distributions remains uncertain and downstream applications risk reproducing
social stereotypes and biases inherited from training data. We evaluate the
reliability of LLM-generated synthetic survey responses against ground-truth
human responses from a Chilean public opinion probabilistic survey.
Specifically, we benchmark 128 prompt-model-question triplets, generating
189,696 synthetic profiles, and pool performance metrics (i.e., accuracy,
precision, recall, and F1-score) in a meta-analysis across 128
question-subsample pairs to test for biases along key sociodemographic
dimensions. The evaluation spans OpenAI’s GPT family and o-series reasoning
models, as well as Llama and Qwen checkpoints. Three results stand out. First,
synthetic responses achieve excellent performance on trust items (F1-score and
accuracy > 0.90). Second, GPT-4o, GPT-4o-mini and Llama 4 Maverick perform
comparably on this task. Third, synthetic-human alignment is highest among
respondents aged 45-59. Overall, LLM-based synthetic samples approximate
responses from a probabilistic sample, though with substantial item-level
heterogeneity. Capturing the full nuance of public opinion remains challenging
and requires careful calibration and additional distributional tests to ensure
algorithmic fidelity and reduce errors.
[COMMENTS]
Working paper: 18 pages, 4 tables, 2 figures
[LINK]
http://arxiv.org/abs/2509.09871v1
[DATE]
2025-09-12 05:43:59+08:00
[CATEGORIES]
cs.CL
Vibe Check: Understanding the Effects of LLM-Based Conversational Agents’ Personality and Alignment on User Perceptions in Goal-Oriented Tasks
[AUTHORS]
Hasibur Rahman, Smit Desai
[ABSTRACT]
Large language models (LLMs) enable conversational agents (CAs) to express
distinctive personalities, raising new questions about how such designs shape
user perceptions. This study investigates how personality expression levels and
user-agent personality alignment influence perceptions in goal-oriented tasks.
In a between-subjects experiment (N=150), participants completed travel
planning with CAs exhibiting low, medium, or high expression across the Big
Five traits, controlled via our novel Trait Modulation Keys framework. Results
revealed an inverted-U relationship: medium expression produced the most
positive evaluations across Intelligence, Enjoyment, Anthropomorphism,
Intention to Adopt, Trust, and Likeability, significantly outperforming both
extremes. Personality alignment further enhanced outcomes, with Extraversion
and Emotional Stability emerging as the most influential traits. Cluster
analysis identified three distinct compatibility profiles, with “Well-Aligned”
users reporting substantially positive perceptions. These findings demonstrate
that personality expression and strategic trait alignment constitute optimal
design targets for CA personality, offering design implications as LLM-based
CAs become increasingly prevalent.
[LINK]
http://arxiv.org/abs/2509.09870v1
[DATE]
2025-09-12 05:43:49+08:00
[CATEGORIES]
cs.CL
Decoding Neural Emotion Patterns through Large Language Model Embeddings
[AUTHORS]
Gideon Vos, Maryam Ebrahimpour, Liza van Eijk, Zoltan Sarnyai, Mostafa Rahimi Azghadi
[ABSTRACT]
Understanding how emotional expression in language relates to brain function
is a challenge in computational neuroscience and affective computing.
Traditional neuroimaging is costly and lab-bound, but abundant digital text
offers new avenues for emotion-brain mapping. Prior work has largely examined
neuroimaging-based emotion localization or computational text analysis
separately, with little integration. We propose a computational framework that
maps textual emotional content to anatomically defined brain regions without
requiring neuroimaging. Using OpenAI’s text-embedding-ada-002, we generate
high-dimensional semantic representations, apply dimensionality reduction and
clustering to identify emotional groups, and map them to 18 brain regions
linked to emotional processing. Three experiments were conducted: i) analyzing
conversational data from healthy vs. depressed subjects (DIAC-WOZ dataset) to
compare mapping patterns, ii) applying the method to the GoEmotions dataset and
iii) comparing human-written text with large language model (LLM) responses to
assess differences in inferred brain activation. Emotional intensity was scored
via lexical analysis. Results showed neuroanatomically plausible mappings with
high spatial specificity. Depressed subjects exhibited greater limbic
engagement tied to negative affect. Discrete emotions were successfully
differentiated. LLM-generated text matched humans in basic emotion distribution
but lacked nuanced activation in empathy and self-referential regions (medial
prefrontal and posterior cingulate cortex). This cost-effective, scalable
approach enables large-scale analysis of naturalistic language, distinguishes
between clinical populations, and offers a brain-based benchmark for evaluating
AI emotional expression.
[COMMENTS]
26 pages, 9 figures
[LINK]
http://arxiv.org/abs/2508.09337v2
[DATE]
2025-09-12 05:41:16+08:00
[CATEGORIES]
cs.CL
Latency and Token-Aware Test-Time Compute
[AUTHORS]
Jenny Y. Huang, Mehul Damani, Yousef El-Kurdi, Ramon Astudillo, Wei Sun
[ABSTRACT]
Inference-time scaling has emerged as a powerful way to improve large
language model (LLM) performance by generating multiple candidate responses and
selecting among them. However, existing work on dynamic allocation for
test-time compute typically considers only parallel generation methods such as
best-of-N, overlooking incremental decoding methods like beam search, and has
largely ignored latency, focusing only on token usage. We formulate
inference-time scaling as a problem of dynamic compute allocation and method
selection, where the system must decide which strategy to apply and how much
compute to allocate on a per-query basis. Our framework explicitly incorporates
both token cost and wall-clock latency, the latter being critical for user
experience and particularly for agentic workflows where models must issue
multiple queries efficiently. Experiments on reasoning benchmarks show that our
approach consistently outperforms static strategies, achieving favorable
accuracy-cost trade-offs while remaining practical for deployment.
[LINK]
http://arxiv.org/abs/2509.09864v1
[DATE]
2025-09-12 05:35:19+08:00
[CATEGORIES]
cs.LG
cs.CL
MEMOIR: Lifelong Model Editing with Minimal Overwrite and Informed Retention for LLMs
[AUTHORS]
Ke Wang, Yiming Qin, Nikolaos Dimitriadis, Alessandro Favero, Pascal Frossard
[ABSTRACT]
Language models deployed in real-world systems often require post-hoc updates
to incorporate new or corrected knowledge. However, editing such models
efficiently and reliably-without retraining or forgetting previous
information-remains a major challenge. Existing methods for lifelong model
editing either compromise generalization, interfere with past edits, or fail to
scale to long editing sequences. We propose MEMOIR, a novel scalable framework
that injects knowledge through a residual memory, i.e., a dedicated parameter
module, while preserving the core capabilities of the pre-trained model. By
sparsifying input activations through sample-dependent masks, MEMOIR confines
each edit to a distinct subset of the memory parameters, minimizing
interference among edits. At inference, it identifies relevant edits by
comparing the sparse activation patterns of new queries to those stored during
editing. This enables generalization to rephrased queries by activating only
the relevant knowledge while suppressing unnecessary memory activation for
unrelated prompts. Experiments on question answering, hallucination correction,
and out-of-distribution generalization benchmarks for LLaMA-3 and Mistral
backbones demonstrate that MEMOIR achieves state-of-the-art performance across
reliability, generalization, and locality metrics, scaling to thousands of
sequential edits with minimal forgetting.
[COMMENTS]
The first two authors contributed equally to this work
[LINK]
http://arxiv.org/abs/2506.07899v3
[DATE]
2025-09-12 04:56:58+08:00
[CATEGORIES]
cs.CL
cs.LG
Breaking Language Barriers or Reinforcing Bias? A Study of Gender and Racial Disparities in Multilingual Contrastive Vision Language Models
[AUTHORS]
Zahraa Al Sahili, Ioannis Patras, Matthew Purver
[ABSTRACT]
Multilingual vision-language models (VLMs) promise universal image-text
retrieval, yet their social biases remain underexplored. We perform the first
systematic audit of four public multilingual CLIP variants: M-CLIP, NLLB-CLIP,
CAPIVARA-CLIP, and the debiased SigLIP-2, covering ten languages that differ in
resource availability and morphological gender marking. Using balanced subsets
of FairFace and the PATA stereotype suite in a zero-shot setting, we quantify
race and gender bias and measure stereotype amplification. Contrary to the
intuition that multilinguality mitigates bias, every model exhibits stronger
gender skew than its English-only baseline. CAPIVARA-CLIP shows its largest
biases precisely in the low-resource languages it targets, while the shared
encoder of NLLB-CLIP and SigLIP-2 transfers English gender stereotypes into
gender-neutral languages; loosely coupled encoders largely avoid this leakage.
Although SigLIP-2 reduces agency and communion skews, it inherits – and in
caption-sparse contexts (e.g., Xhosa) amplifies – the English anchor’s crime
associations. Highly gendered languages consistently magnify all bias types,
yet gender-neutral languages remain vulnerable whenever cross-lingual weight
sharing imports foreign stereotypes. Aggregated metrics thus mask
language-specific hot spots, underscoring the need for fine-grained,
language-aware bias evaluation in future multilingual VLM research.
[LINK]
http://arxiv.org/abs/2505.14160v2
[DATE]
2025-09-12 04:26:08+08:00
[CATEGORIES]
cs.CL
cs.LG
Pragmatic Frames Evoked by Gestures: A FrameNet Brasil Approach to Multimodality in Turn Organization
[AUTHORS]
Helen de Andrade Abreu, Tiago Timponi Torrent, Ely Edison da Silva Matos
[ABSTRACT]
This paper proposes a framework for modeling multimodal conversational turn
organization via the proposition of correlations between language and
interactive gestures, based on analysis as to how pragmatic frames are
conceptualized and evoked by communicators. As a means to provide evidence for
the analysis, we developed an annotation methodology to enrich a multimodal
dataset (annotated for semantic frames) with pragmatic frames modeling
conversational turn organization. Although conversational turn organization has
been studied by researchers from diverse fields, the specific strategies,
especially gestures used by communicators, had not yet been encoded in a
dataset that can be used for machine learning. To fill this gap, we enriched
the Frame2 dataset with annotations of gestures used for turn organization. The
Frame2 dataset features 10 episodes from the Brazilian TV series Pedro Pelo
Mundo annotated for semantic frames evoked in both video and text. This dataset
allowed us to closely observe how communicators use interactive gestures
outside a laboratory, in settings, to our knowledge, not previously recorded in
related literature. Our results have confirmed that communicators involved in
face-to-face conversation make use of gestures as a tool for passing, taking
and keeping conversational turns, and also revealed variations of some gestures
that had not been documented before. We propose that the use of these gestures
arises from the conceptualization of pragmatic frames, involving mental spaces,
blending and conceptual metaphors. In addition, our data demonstrate that the
annotation of pragmatic frames contributes to a deeper understanding of human
cognition and language.
[COMMENTS]
Paper submitted to Language Sciences Journal
[LINK]
http://arxiv.org/abs/2509.09804v1
[DATE]
2025-09-12 03:14:57+08:00
[CATEGORIES]
cs.CL
HEFT: A Coarse-to-Fine Hierarchy for Enhancing the Efficiency and Accuracy of Language Model Reasoning
[AUTHORS]
Brennen Hill
[ABSTRACT]
The adaptation of large language models (LLMs) to specialized reasoning tasks
is fundamentally constrained by computational resources. Parameter-Efficient
Fine-Tuning (PEFT) methods have emerged as a powerful solution, yet the
landscape of these techniques is diverse, with distinct methods operating in
either the model’s weight space or its representation space. This paper
investigates the hypothesis that a synergistic combination of these paradigms
can unlock superior performance and efficiency. We introduce HEFT (Hierarchical
Efficient Fine-Tuning), a novel hierarchical adaptation strategy that composes
two distinct PEFT methods in a coarse-to-fine manner: first, a broad,
foundational adaptation in the weight space using Low-Rank Adaptation (LoRA),
followed by a precise, surgical refinement of internal activations using
Representation Fine-Tuning (ReFT). We evaluate this approach by fine-tuning a
Llama-2-7B model on the BoolQ benchmark, a challenging dataset for inferential
reasoning. Our results reveal a profound synergistic effect. A model fine-tuned
for only three epochs with our HEFT strategy achieves an accuracy of 85.17\%,
exceeding the performance of models trained for 20 epochs with either LoRA-only
(85.05\%) or ReFT-only (83.36\%) methodologies. This work demonstrates that the
thoughtful composition of PEFT methods is a potent algorithmic innovation,
offering a more efficient and effective path toward advancing the reasoning
capabilities of language models. By achieving superior results with a fraction
of the computational budget, our findings present a principled approach to
overcoming the obstacles inherent in adapting large-scale models for complex
cognitive tasks.
[LINK]
http://arxiv.org/abs/2509.09801v1
[DATE]
2025-09-12 03:06:46+08:00
[CATEGORIES]
cs.CL
cs.LG
FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark
[AUTHORS]
Rongyao Fang, Aldrich Yu, Chengqi Duan, Linjiang Huang, Shuai Bai, Yuxuan Cai, Kun Wang, Si Liu, Xihui Liu, Hongsheng Li
[ABSTRACT]
The advancement of open-source text-to-image (T2I) models has been hindered
by the absence of large-scale, reasoning-focused datasets and comprehensive
evaluation benchmarks, resulting in a performance gap compared to leading
closed-source systems. To address this challenge, We introduce FLUX-Reason-6M
and PRISM-Bench (Precise and Robust Image Synthesis Measurement Benchmark).
FLUX-Reason-6M is a massive dataset consisting of 6 million high-quality
FLUX-generated images and 20 million bilingual (English and Chinese)
descriptions specifically designed to teach complex reasoning. The image are
organized according to six key characteristics: Imagination, Entity, Text
rendering, Style, Affection, and Composition, and design explicit Generation
Chain-of-Thought (GCoT) to provide detailed breakdowns of image generation
steps. The whole data curation takes 15,000 A100 GPU days, providing the
community with a resource previously unattainable outside of large industrial
labs. PRISM-Bench offers a novel evaluation standard with seven distinct
tracks, including a formidable Long Text challenge using GCoT. Through
carefully designed prompts, it utilizes advanced vision-language models for
nuanced human-aligned assessment of prompt-image alignment and image
aesthetics. Our extensive evaluation of 19 leading models on PRISM-Bench
reveals critical performance gaps and highlights specific areas requiring
improvement. Our dataset, benchmark, and evaluation code are released to
catalyze the next wave of reasoning-oriented T2I generation. Project page:
https://flux-reason-6m.github.io/ .
[COMMENTS]
Project page: https://flux-reason-6m.github.io/
[LINK]
http://arxiv.org/abs/2509.09680v1
[DATE]
2025-09-12 01:59:59+08:00
[CATEGORIES]
cs.CL
ButterflyQuant: Ultra-low-bit LLM Quantization through Learnable Orthogonal Butterfly Transforms
[AUTHORS]
Bingxin Xu, Zhen Dong, Oussama Elachqar, Yuzhang Shang
[ABSTRACT]
Large language models require massive memory footprints, severely limiting
deployment on consumer hardware. Quantization reduces memory through lower
numerical precision, but extreme 2-bit quantization suffers from catastrophic
performance loss due to outliers in activations. Rotation-based methods such as
QuIP and QuaRot apply orthogonal transforms to eliminate outliers before
quantization, using computational invariance: $\mathbf{y} = \mathbf{Wx} =
(\mathbf{WQ}^T)(\mathbf{Qx})$ for orthogonal $\mathbf{Q}$. However, these
methods use fixed transforms–Hadamard matrices achieving optimal worst-case
coherence $\mu = 1/\sqrt{n}$–that cannot adapt to specific weight
distributions. We identify that different transformer layers exhibit distinct
outlier patterns, motivating layer-adaptive rotations rather than
one-size-fits-all approaches. We propose ButterflyQuant, which replaces
Hadamard rotations with learnable butterfly transforms parameterized by
continuous Givens rotation angles. Unlike Hadamard’s discrete $\{+1, -1\}$
entries that are non-differentiable and prohibit gradient-based learning,
butterfly transforms’ continuous parameterization enables smooth optimization
while guaranteeing orthogonality by construction. This orthogonal constraint
ensures theoretical guarantees in outlier suppression while achieving $O(n \log
n)$ computational complexity with only $\frac{n \log n}{2}$ learnable
parameters. We further introduce a uniformity regularization on
post-transformation activations to promote smoother distributions amenable to
quantization. Learning requires only 128 calibration samples and converges in
minutes on a single GPU–a negligible one-time cost. On LLaMA-2-7B with 2-bit
quantization, ButterflyQuant achieves 15.4 perplexity versus 22.1 for QuaRot.
[COMMENTS]
Replace discrete Hadamard transforms with continuous Butterfly
transforms to facilitate the learning of rotation matrices in LLM
quantization
[LINK]
http://arxiv.org/abs/2509.09679v1
[DATE]
2025-09-12 01:59:51+08:00
[CATEGORIES]
cs.LG
cs.CL
SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning
[AUTHORS]
Haozhan Li, Yuxin Zuo, Jiale Yu, Yuhao Zhang, Zhaohui Yang, Kaiyan Zhang, Xuekai Zhu, Yuchen Zhang, Tianxing Chen, Ganqu Cui, Dehui Wang, Dingxiang Luo, Yuchen Fan, Youbang Sun, Jia Zeng, Jiangmiao Pang, Shanghang Zhang, Yu Wang, Yao Mu, Bowen Zhou, Ning Ding
[ABSTRACT]
Vision-Language-Action (VLA) models have recently emerged as a powerful
paradigm for robotic manipulation. Despite substantial progress enabled by
large-scale pretraining and supervised fine-tuning (SFT), these models face two
fundamental challenges: (i) the scarcity and high cost of large-scale
human-operated robotic trajectories required for SFT scaling, and (ii) limited
generalization to tasks involving distribution shift. Recent breakthroughs in
Large Reasoning Models (LRMs) demonstrate that reinforcement learning (RL) can
dramatically enhance step-by-step reasoning capabilities, raising a natural
question: Can RL similarly improve the long-horizon step-by-step action
planning of VLA? In this work, we introduce SimpleVLA-RL, an efficient RL
framework tailored for VLA models. Building upon veRL, we introduce
VLA-specific trajectory sampling, scalable parallelization, multi-environment
rendering, and optimized loss computation. When applied to OpenVLA-OFT,
SimpleVLA-RL achieves SoTA performance on LIBERO and even outperforms $\pi_0$
on RoboTwin 1.0\&2.0 with the exploration-enhancing strategies we introduce.
SimpleVLA-RL not only reduces dependence on large-scale data and enables robust
generalization, but also remarkably surpasses SFT in real-world tasks.
Moreover, we identify a novel phenomenon “pushcut” during RL training,
wherein the policy discovers previously unseen patterns beyond those seen in
the previous training process. Github: https://github.com/PRIME-RL/SimpleVLA-RL
[LINK]
http://arxiv.org/abs/2509.09674v1
[DATE]
2025-09-12 01:59:17+08:00
[CATEGORIES]
cs.CL
cs.LG
Steering MoE LLMs via Expert (De)Activation
[AUTHORS]
Mohsen Fayyaz, Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Ryan Rossi, Trung Bui, Hinrich Schütze, Nanyun Peng
[ABSTRACT]
Mixture-of-Experts (MoE) in Large Language Models (LLMs) routes each token
through a subset of specialized Feed-Forward Networks (FFN), known as experts.
We present SteerMoE, a framework for steering MoE models by detecting and
controlling behavior-linked experts. Our detection method identifies experts
with distinct activation patterns across paired inputs exhibiting contrasting
behaviors. By selectively (de)activating such experts during inference, we
control behaviors like faithfulness and safety without retraining or modifying
weights. Across 11 benchmarks and 6 LLMs, our steering raises safety by up to
+20% and faithfulness by +27%. In adversarial attack mode, it drops safety by
-41% alone, and -100% when combined with existing jailbreak methods, bypassing
all safety guardrails and exposing a new dimension of alignment faking hidden
within experts.
[LINK]
http://arxiv.org/abs/2509.09660v1
[DATE]
2025-09-12 01:55:09+08:00
[CATEGORIES]
cs.CL
cs.LG
Retrieval-Augmented Generation for Reliable Interpretation of Radio Regulations
[AUTHORS]
Zakaria El Kassimi, Fares Fourati, Mohamed-Slim Alouini
[ABSTRACT]
We study question answering in the domain of radio regulations, a legally
sensitive and high-stakes area. We propose a telecom-specific
Retrieval-Augmented Generation (RAG) pipeline and introduce, to our knowledge,
the first multiple-choice evaluation set for this domain, constructed from
authoritative sources using automated filtering and human validation. To assess
retrieval quality, we define a domain-specific retrieval metric, under which
our retriever achieves approximately 97% accuracy. Beyond retrieval, our
approach consistently improves generation accuracy across all tested models. In
particular, while naively inserting documents without structured retrieval
yields only marginal gains for GPT-4o (less than 1%), applying our pipeline
results in nearly a 12% relative improvement. These findings demonstrate that
carefully targeted grounding provides a simple yet strong baseline and an
effective domain-specific solution for regulatory question answering. All code
and evaluation scripts, along with our derived question-answer dataset, are
available at https://github.com/Zakaria010/Radio-RAG.
[LINK]
http://arxiv.org/abs/2509.09651v1
[DATE]
2025-09-12 01:43:42+08:00
[CATEGORIES]
cs.CL
cs.LG
All for One: LLMs Solve Mental Math at the Last Token With Information Transferred From Other Tokens
[AUTHORS]
Siddarth Mamidanna, Daking Rai, Ziyu Yao, Yilun Zhou
[COMMENTS]
EMNLP 2025 Main Conference
[LINK]
http://arxiv.org/abs/2509.09650v1
[DATE]
2025-09-12 01:41:29+08:00
[CATEGORIES]
cs.CL
Bridging the Capability Gap: Joint Alignment Tuning for Harmonizing LLM-based Multi-Agent Systems
[AUTHORS]
Minghang Zhu, Zhengliang Shi, Zhiwei Xu, Shiguang Wu, Lingjie Wang, Pengjie Ren, Zhaochun Ren, Zhumin Chen
[ABSTRACT]
The advancement of large language models (LLMs) has enabled the construction
of multi-agent systems to solve complex tasks by dividing responsibilities
among specialized agents, such as a planning agent for subgoal generation and a
grounding agent for executing tool-use actions. Most existing methods typically
fine-tune these agents independently, leading to capability gaps among them
with poor coordination. To address this, we propose MOAT, a Multi-Agent Joint
Alignment Tuning framework that improves agents collaboration through iterative
alignment. MOAT alternates between two key stages: (1) Planning Agent
Alignment, which optimizes the planning agent to generate subgoal sequences
that better guide the grounding agent; and (2) Grounding Agent Improving, which
fine-tunes the grounding agent using diverse subgoal-action pairs generated by
the agent itself to enhance its generalization capablity. Theoretical analysis
proves that MOAT ensures a non-decreasing and progressively convergent training
process. Experiments across six benchmarks demonstrate that MOAT outperforms
state-of-the-art baselines, achieving average improvements of 3.1% on held-in
tasks and 4.4% on held-out tasks.
[COMMENTS]
EMNLP 2025 Findings
[LINK]
http://arxiv.org/abs/2509.09629v1
[DATE]
2025-09-12 01:15:45+08:00
[CATEGORIES]
cs.CL
Physics-informed sensor coverage through structure preserving machine learning
[AUTHORS]
Benjamin David Shaffer, Brooks Kinch, Joseph Klobusicky, M. Ani Hsieh, Nathaniel Trask
[ABSTRACT]
We present a machine learning framework for adaptive source localization in
which agents use a structure-preserving digital twin of a coupled
hydrodynamic-transport system for real-time trajectory planning and data
assimilation. The twin is constructed with conditional neural Whitney forms
(CNWF), coupling the numerical guarantees of finite element exterior calculus
(FEEC) with transformer-based operator learning. The resulting model preserves
discrete conservation, and adapts in real time to streaming sensor data. It
employs a conditional attention mechanism to identify: a reduced Whitney-form
basis; reduced integral balance equations; and a source field, each compatible
with given sensor measurements. The induced reduced-order environmental model
retains the stability and consistency of standard finite-element simulation,
yielding a physically realizable, regular mapping from sensor data to the
source field. We propose a staggered scheme that alternates between evaluating
the digital twin and applying Lloyd’s algorithm to guide sensor placement, with
analysis providing conditions for monotone improvement of a coverage
functional. Using the predicted source field as an importance function within
an optimal-recovery scheme, we demonstrate recovery of point sources under
continuity assumptions, highlighting the role of regularity as a sufficient
condition for localization. Experimental comparisons with physics-agnostic
transformer architectures show improved accuracy in complex geometries when
physical constraints are enforced, indicating that structure preservation
provides an effective inductive bias for source identification.
[LINK]
http://arxiv.org/abs/2509.10363v1
[DATE]
2025-09-12 23:54:13+08:00
[CATEGORIES]
cs.LG
Unveiling Group-Specific Distributed Concept Drift: A Fairness Imperative in Federated Learning
[AUTHORS]
Teresa Salazar, João Gama, Helder Araújo, Pedro Henriques Abreu
[ABSTRACT]
In the evolving field of machine learning, ensuring group fairness has become
a critical concern, prompting the development of algorithms designed to
mitigate bias in decision-making processes. Group fairness refers to the
principle that a model’s decisions should be equitable across different groups
defined by sensitive attributes such as gender or race, ensuring that
individuals from privileged groups and unprivileged groups are treated fairly
and receive similar outcomes. However, achieving fairness in the presence of
group-specific concept drift remains an unexplored frontier, and our research
represents pioneering efforts in this regard. Group-specific concept drift
refers to situations where one group experiences concept drift over time while
another does not, leading to a decrease in fairness even if accuracy remains
fairly stable. Within the framework of Federated Learning, where clients
collaboratively train models, its distributed nature further amplifies these
challenges since each client can experience group-specific concept drift
independently while still sharing the same underlying concept, creating a
complex and dynamic environment for maintaining fairness. The most significant
contribution of our research is the formalization and introduction of the
problem of group-specific concept drift and its distributed counterpart,
shedding light on its critical importance in the field of fairness.
Additionally, leveraging insights from prior research, we adapt an existing
distributed concept drift adaptation algorithm to tackle group-specific
distributed concept drift which uses a multi-model approach, a local
group-specific drift detection mechanism, and continuous clustering of models
over time. The findings from our experiments highlight the importance of
addressing group-specific concept drift and its distributed counterpart to
advance fairness in machine learning.
[COMMENTS]
accepted for publication in IEEE Transactions on Neural Networks and
Learning Systems (early access, Sep. 2025)
[LINK]
http://arxiv.org/abs/2402.07586v4
[DATE]
2025-09-12 23:26:26+08:00
[CATEGORIES]
cs.LG
Why does your graph neural network fail on some graphs? Insights from exact generalisation error
[AUTHORS]
Nil Ayday, Mahalakshmi Sabanayagam, Debarghya Ghoshdastidar
[ABSTRACT]
Graph Neural Networks (GNNs) are widely used in learning on graph-structured
data, yet a principled understanding of why they succeed or fail remains
elusive. While prior works have examined architectural limitations such as
over-smoothing and over-squashing, these do not explain what enables GNNs to
extract meaningful representations or why performance varies drastically
between similar architectures. These questions are related to the role of
generalisation: the ability of a model to make accurate predictions on
unlabelled data. Although several works have derived generalisation error
bounds for GNNs, these are typically loose, restricted to a single
architecture, and offer limited insight into what governs generalisation in
practice. In this work, we take a different approach by deriving the exact
generalisation error for GNNs in a transductive fixed-design setting through
the lens of signal processing. From this viewpoint, GNNs can be interpreted as
graph filter operators that act on node features via the graph structure. By
focusing on linear GNNs while allowing non-linearity in the graph filters, we
derive the first exact generalisation error for a broad range of GNNs,
including convolutional, PageRank-based, and attention-based models. The exact
characterisation of the generalisation error reveals that only the aligned
information between node features and graph structure contributes to
generalisation. Furthermore, we quantify the effect of homophily on
generalisation. Our work provides a framework that explains when and why GNNs
can effectively leverage structural and feature information, offering practical
guidance for model selection.
[LINK]
http://arxiv.org/abs/2509.10337v1
[DATE]
2025-09-12 23:18:36+08:00
[CATEGORIES]
cs.LG
I-Segmenter: Integer-Only Vision Transformer for Efficient Semantic Segmentation
[AUTHORS]
Jordan Sassoon, Michal Szczepanski, Martyna Poreba
[LINK]
http://arxiv.org/abs/2509.10334v1
[DATE]
2025-09-12 23:14:19+08:00
[CATEGORIES]
cs.LG
ARMA Block: A CNN-Based Autoregressive and Moving Average Module for Long-Term Time Series Forecasting
[AUTHORS]
Myung Jin Kim, YeongHyeon Park, Il Dong Yun
[LINK]
http://arxiv.org/abs/2509.10324v1
[DATE]
2025-09-12 23:03:49+08:00
[CATEGORIES]
cs.LG
SME-TEAM: Leveraging Trust and Ethics for Secure and Responsible Use of AI and LLMs in SMEs
[AUTHORS]
Iqbal H. Sarker, Helge Janicke, Ahmad Mohsin, Leandros Maglaras
[ABSTRACT]
Artificial Intelligence (AI) and Large Language Models (LLMs) are reshaping
today’s business practices, however, their adoption within small and
medium-sized enterprises (SMEs) raises significant technical, ethical and trust
issues. This paper proposes a structured, multi-phased framework designed to
embed trust and ethical principles throughout the AI lifecycle for their secure
and responsible use in SMEs. Structured around four pillars, i.e., Data,
Algorithms, Human oversight, and Model Architecture, the framework bridges
theoretical ethical principles with operational practice, enhancing AI
capabilities in diverse SME applications. Ultimately, this paper offers a
structured roadmap for responsible AI adoption, framing trust and ethics as a
catalyst for resilience, competitiveness, and sustainable innovation in SMEs.
[COMMENTS]
10 pages
[LINK]
http://arxiv.org/abs/2509.10594v1
[DATE]
2025-09-12 22:59:52+08:00
[CATEGORIES]
cs.LG
Robot guide with multi-agent control and automatic scenario generation with LLM
[AUTHORS]
Elizaveta D. Moskovskaya, Anton D. Moscowsky
[ABSTRACT]
The work describes the development of a hybrid control architecture for an
anthropomorphic tour guide robot, combining a multi-agent resource management
system with automatic behavior scenario generation based on large language
models. The proposed approach aims to overcome the limitations of traditional
systems, which rely on manual tuning of behavior scenarios. These limitations
include manual configuration, low flexibility, and lack of naturalness in robot
behavior. The process of preparing tour scenarios is implemented through a
two-stage generation: first, a stylized narrative is created, then non-verbal
action tags are integrated into the text. The multi-agent system ensures
coordination and conflict resolution during the execution of parallel actions,
as well as maintaining default behavior after the completion of main
operations, contributing to more natural robot behavior. The results obtained
from the trial demonstrate the potential of the proposed approach for
automating and scaling social robot control systems.
[COMMENTS]
14 pages, 5 figures, 2 tables, 1 demo-video and repository link
[LINK]
http://arxiv.org/abs/2509.10317v1
[DATE]
2025-09-12 22:59:04+08:00
[CATEGORIES]
cs.LG
GraphCSVAE: Graph Categorical Structured Variational Autoencoder for Spatiotemporal Auditing of Physical Vulnerability Towards Sustainable Post-Disaster Risk Reduction
[AUTHORS]
Joshua Dimasaka, Christian Geiß, Robert Muir-Wood, Emily So
[ABSTRACT]
In the aftermath of disasters, many institutions worldwide face challenges in
continually monitoring changes in disaster risk, limiting the ability of key
decision-makers to assess progress towards the UN Sendai Framework for Disaster
Risk Reduction 2015-2030. While numerous efforts have substantially advanced
the large-scale modeling of hazard and exposure through Earth observation and
data-driven methods, progress remains limited in modeling another equally
important yet challenging element of the risk equation: physical vulnerability.
To address this gap, we introduce Graph Categorical Structured Variational
Autoencoder (GraphCSVAE), a novel probabilistic data-driven framework for
modeling physical vulnerability by integrating deep learning, graph
representation, and categorical probabilistic inference, using time-series
satellite-derived datasets and prior expert belief systems. We introduce a
weakly supervised first-order transition matrix that reflects the changes in
the spatiotemporal distribution of physical vulnerability in two
disaster-stricken and socioeconomically disadvantaged areas: (1) the
cyclone-impacted coastal Khurushkul community in Bangladesh and (2) the
mudslide-affected city of Freetown in Sierra Leone. Our work reveals
post-disaster regional dynamics in physical vulnerability, offering valuable
insights into localized spatiotemporal auditing and sustainable strategies for
post-disaster risk reduction.
[COMMENTS]
Accepted full paper at the 8th International Disaster and Risk
Conference, IDRC 2025 | Keywords: weakly supervised, graph deep learning,
categorical distribution, physical vulnerability, remote sensing,
spatiotemporal disaster risk, transition matrix | The data and code are
respectively available at https://doi.org/10.5281/zenodo.16656471 and
https://github.com/riskaudit/GraphCSVAE
[LINK]
http://arxiv.org/abs/2509.10308v1
[DATE]
2025-09-12 22:50:56+08:00
[CATEGORIES]
cs.LG
Data-Driven Discovery of Mobility Periodicity for Understanding Urban Systems
[AUTHORS]
Xinyu Chen, Qi Wang, Yunhan Zheng, Nina Cao, HanQin Cai, Jinhua Zhao
[ABSTRACT]
Human mobility regularity is crucial for understanding urban dynamics and
informing decision-making processes. This study first quantifies the
periodicity in complex human mobility data as a sparse identification of
dominant positive auto-correlations in time series autoregression and then
discovers periodic patterns. We apply the framework to large-scale metro
passenger flow data in Hangzhou, China and multi-modal mobility data in New
York City and Chicago, USA, revealing the interpretable weekly periodicity
across different spatial locations over past several years. The analysis of
ridesharing data from 2019 to 2024 demonstrates the disruptive impact of the
pandemic on mobility regularity and the subsequent recovery trends. In 2024,
the periodic mobility patterns of ridesharing, taxi, subway, and bikesharing in
Manhattan uncover the regularity and variability of these travel modes. Our
findings highlight the potential of interpretable machine learning to discover
spatiotemporal mobility patterns and offer a valuable tool for understanding
urban systems.
[LINK]
http://arxiv.org/abs/2508.03747v2
[DATE]
2025-09-12 22:48:48+08:00
[CATEGORIES]
cs.LG
Generalizing Beyond Suboptimality: Offline Reinforcement Learning Learns Effective Scheduling through Random Data
[AUTHORS]
Jesse van Remmerden, Zaharah Bukhsh, Yingqian Zhang
[ABSTRACT]
The Job-Shop Scheduling Problem (JSP) and Flexible Job-Shop Scheduling
Problem (FJSP), are canonical combinatorial optimization problems with
wide-ranging applications in industrial operations. In recent years, many
online reinforcement learning (RL) approaches have been proposed to learn
constructive heuristics for JSP and FJSP. Although effective, these online RL
methods require millions of interactions with simulated environments that may
not capture real-world complexities, and their random policy initialization
leads to poor sample efficiency. To address these limitations, we introduce
Conservative Discrete Quantile Actor-Critic (CDQAC), a novel offline RL
algorithm that learns effective scheduling policies directly from historical
data, eliminating the need for costly online interactions, while maintaining
the ability to improve upon suboptimal training data. CDQAC couples a
quantile-based critic with a delayed policy update, estimating the return
distribution of each machine-operation pair rather than selecting pairs
outright. Our extensive experiments demonstrate CDQAC’s remarkable ability to
learn from diverse data sources. CDQAC consistently outperforms the original
data-generating heuristics and surpasses state-of-the-art offline and online RL
baselines. In addition, CDQAC is highly sample efficient, requiring only 10-20
training instances to learn high-quality policies. Surprisingly, we find that
CDQAC performs better when trained on data generated by a random heuristic than
when trained on higher-quality data from genetic algorithms and priority
dispatching rules.
[LINK]
http://arxiv.org/abs/2509.10303v1
[DATE]
2025-09-12 22:45:39+08:00
[CATEGORIES]
cs.LG
On Regression in Extreme Regions
[AUTHORS]
Stephan Clémençon, Nathan Huet, Anne Sabourin
[ABSTRACT]
We establish a statistical learning theoretical framework aimed at
extrapolation, or out-of-domain generalization, on the unobserved tails of
covariates in continuous regression problems. Our strategy involves performing
statistical regression on a subsample of observations with continuous labels
that are the furthest away from the origin, focusing specifically on their
angular components. The underlying assumptions of our approach are grounded in
the theory of multivariate regular variation, a cornerstone of extreme value
theory. We address the stylized problem of nonparametric least squares
regression with predictors chosen from a Vapnik-Chervonenkis class.
This work contributes to a broader initiative to develop statistical learning
theoretical foundations for supervised learning strategies that enhance
performance on the supposedly heavy tails of covariates. Previous efforts in
this area have focused exclusively on binary classification on extreme
covariates. Although the continuous target setting necessitates different
techniques and regularity assumptions, our main results echo findings from
earlier studies. We quantify the predictive performance on tail regions in
terms of excess risk, presenting it as a finite sample risk bound with a clear
bias-variance decomposition. Numerical experiments with simulated and real data
illustrate our theoretical findings.
[COMMENTS]
30 pages (main paper), 12 pages (appendix), 3 figures, 2 tables.
Accepted for publication in EJS
[LINK]
http://arxiv.org/abs/2303.03084v3
[DATE]
2025-09-12 22:44:04+08:00
[CATEGORIES]
cs.LG
Kriging prior Regression: A Case for Kriging-Based Spatial Features with TabPFN in Soil Mapping
[AUTHORS]
Jonas Schmidinger, Viacheslav Barkov, Sebastian Vogel, Martin Atzmueller, Gerard B M Heuvelink
[ABSTRACT]
Machine learning and geostatistics are two fundamentally different frameworks
for predicting and spatially mapping soil properties. Geostatistics leverages
the spatial structure of soil properties, while machine learning captures the
relationship between available environmental features and soil properties. We
propose a hybrid framework that enriches ML with spatial context through
engineering of ‘spatial lag’ features from ordinary kriging. We call this
approach ‘kriging prior regression’ (KpR), as it follows the inverse logic of
regression kriging. To evaluate this approach, we assessed both the point and
probabilistic prediction performance of KpR, using the TabPFN model across six
fieldscale datasets from LimeSoDa. These datasets included soil organic carbon,
clay content, and pH, along with features derived from remote sensing and
in-situ proximal soil sensing. KpR with TabPFN demonstrated reliable
uncertainty estimates and more accurate predictions in comparison to several
other spatial techniques (e.g., regression/residual kriging with TabPFN), as
well as to established non-spatial machine learning algorithms (e.g., random
forest). Most notably, it significantly improved the average R2 by around 30%
compared to machine learning algorithms without spatial context. This
improvement was due to the strong prediction performance of the TabPFN
algorithm itself and the complementary spatial information provided by KpR
features. TabPFN is particularly effective for prediction tasks with small
sample sizes, common in precision agriculture, whereas KpR can compensate for
weak relationships between sensing features and soil properties when proximal
soil sensing data are limited. Hence, we conclude that KpR with TabPFN is a
very robust and versatile modelling framework for digital soil mapping in
precision agriculture.
[LINK]
http://arxiv.org/abs/2509.09408v2
[DATE]
2025-09-12 22:31:32+08:00
[CATEGORIES]
cs.LG
[AUTHORS]
Salih Toprak, Muge Erel-Ozcevik [ABSTRACT]
In disaster scenarios where conventional energy infrastructure is
compromised, secure and traceable energy trading between solar-powered
households and mobile charging units becomes a necessity. To ensure the
integrity of such transactions over a blockchain network, robust and
unpredictable nonce generation is vital. This study proposes an SDN-enabled
architecture where machine learning regressors are leveraged not for their
accuracy, but for their potential to generate randomized values suitable as
nonce candidates. Therefore, it is newly called Proof of AutoML. Here, SDN
allows flexible control over data flows and energy routing policies even in
fragmented or degraded networks, ensuring adaptive response during emergencies.
Using a 9000-sample dataset, we evaluate five AutoML-selected regression modelsGradient Boosting, LightGBM, Random Forest, Extra Trees, and K-Nearest
Neighbors - not by their prediction accuracy, but by their ability to produce
diverse and non-deterministic outputs across shuffled data inputs. Randomness
analysis reveals that Random Forest and Extra Trees regressors exhibit complete
dependency on randomness, whereas Gradient Boosting, K-Nearest Neighbors and
LightGBM show strong but slightly lower randomness scores (97.6%, 98.8% and
99.9%, respectively). These findings highlight that certain machine learning
models, particularly tree-based ensembles, may serve as effective and
lightweight nonce generators within blockchain-secured, SDN-based energy
trading infrastructures resilient to disaster conditions.
[COMMENTS]
6 pages, 3 figures, 7th International Conference on Blockchain
Computing and Applications (BCCA 2025), \c{opyright}2025 IEEE
[LINK]
http://arxiv.org/abs/2509.10291v1
[DATE]
2025-09-12 22:30:18+08:00
[CATEGORIES]
cs.LG
Property prediction for ionic liquids without prior structural knowledge using limited experimental data: A data-driven neural recommender system leveraging transfer learning
[AUTHORS]
Sahil Sethi, Kai Sundmacher, Caroline Ganzer
[ABSTRACT]
Ionic liquids (ILs) have emerged as versatile replacements for traditional
solvents because their physicochemical properties can be precisely tailored to
various applications. However, accurately predicting key thermophysical
properties remains challenging due to the vast chemical design space and the
limited availability of experimental data. In this study, we present a
data-driven transfer learning framework that leverages a neural recommender
system (NRS) to enable reliable property prediction for ILs using sparse
experimental datasets. The approach involves a two-stage process: first,
pre-training NRS models on COSMO-RS-based simulated data at fixed temperature
and pressure to learn property-specific structural embeddings for cations and
anions; and second, fine-tuning simple feedforward neural networks using these
embeddings with experimental data at varying temperatures and pressures. In
this work, five essential IL properties are considered: density, viscosity,
surface tension, heat capacity, and melting point. The framework supports both
within-property and cross-property knowledge transfer. Notably, pre-trained
models for density, viscosity, and heat capacity are used to fine-tune models
for all five target properties, achieving improved performance by a substantial
margin for four of them. The model exhibits robust extrapolation to previously
unseen ILs. Moreover, the final trained models enable property prediction for
over 700,000 IL combinations, offering a scalable solution for IL screening in
process design. This work highlights the effectiveness of combining simulated
data and transfer learning to overcome sparsity in the experimental data.
[LINK]
http://arxiv.org/abs/2509.10273v1
[DATE]
2025-09-12 22:13:31+08:00
[CATEGORIES]
cs.LG
Uncertainty Modeling in Graph Neural Networks via Stochastic Differential Equations
[AUTHORS]
Richard Bergna, Sergio Calvo-Ordoñez, Felix L. Opolka, Pietro Liò, Jose Miguel Hernandez-Lobato
[ABSTRACT]
We propose a novel Stochastic Differential Equation (SDE) framework to
address the problem of learning uncertainty-aware representations for
graph-structured data. While Graph Neural Ordinary Differential Equations
(GNODEs) have shown promise in learning node representations, they lack the
ability to quantify uncertainty. To address this, we introduce Latent Graph
Neural Stochastic Differential Equations (LGNSDE), which enhance GNODE by
embedding randomness through a Bayesian prior-posterior mechanism for epistemic
uncertainty and Brownian motion for aleatoric uncertainty. By leveraging the
existence and uniqueness of solutions to graph-based SDEs, we prove that the
variance of the latent space bounds the variance of model outputs, thereby
providing theoretically sensible guarantees for the uncertainty estimates.
Furthermore, we show mathematically that LGNSDEs are robust to small
perturbations in the input, maintaining stability over time. Empirical results
across several benchmarks demonstrate that our framework is competitive in
out-of-distribution detection, robustness to noise, and active learning,
underscoring the ability of LGNSDEs to quantify uncertainty reliably. Code is
available at
\href{https://github.com/Richard-Bergna/GraphNeuralSDE}{\texttt{github.com/Richard-Bergna/GraphNeuralSDE}}.
[COMMENTS]
Accepted at ICLR 2025 as Spotlight. 18 pages including appendix
[LINK]
http://arxiv.org/abs/2408.16115v5
[DATE]
2025-09-12 22:10:50+08:00
[CATEGORIES]
cs.LG
Representation Learning on Large Non-Bipartite Transaction Networks using GraphSAGE
[AUTHORS]
Mihir Tare, Clemens Rattasits, Yiming Wu, Euan Wielewski
[ABSTRACT]
Financial institutions increasingly require scalable tools to analyse complex
transactional networks, yet traditional graph embedding methods struggle with
dynamic, real-world banking data. This paper demonstrates the practical
application of GraphSAGE, an inductive Graph Neural Network framework, to
non-bipartite heterogeneous transaction networks within a banking context.
Unlike transductive approaches, GraphSAGE scales well to large networks and can
generalise to unseen nodes which is critical for institutions working with
temporally evolving transactional data. We construct a transaction network
using anonymised customer and merchant transactions and train a GraphSAGE model
to generate node embeddings. Our exploratory work on the embeddings reveals
interpretable clusters aligned with geographic and demographic attributes.
Additionally, we illustrate their utility in downstream classification tasks by
applying them to a money mule detection model where using these embeddings
improves the prioritisation of high-risk accounts. Beyond fraud detection, our
work highlights the adaptability of this framework to banking-scale networks,
emphasising its inductive capability, scalability, and interpretability. This
study provides a blueprint for financial organisations to harness graph machine
learning for actionable insights in transactional ecosystems.
[LINK]
http://arxiv.org/abs/2509.12255v1
[DATE]
2025-09-12 22:09:16+08:00
[CATEGORIES]
cs.LG
LaDi-WM: A Latent Diffusion-based World Model for Predictive Manipulation
[AUTHORS]
Yuhang Huang, Jiazhao Zhang, Shilong Zou, Xinwang Liu, Ruizhen Hu, Kai Xu
[ABSTRACT]
Predictive manipulation has recently gained considerable attention in the
Embodied AI community due to its potential to improve robot policy performance
by leveraging predicted states. However, generating accurate future visual
states of robot-object interactions from world models remains a well-known
challenge, particularly in achieving high-quality pixel-level representations.
To this end, we propose LaDi-WM, a world model that predicts the latent space
of future states using diffusion modeling. Specifically, LaDi-WM leverages the
well-established latent space aligned with pre-trained Visual Foundation Models
(VFMs), which comprises both geometric features (DINO-based) and semantic
features (CLIP-based). We find that predicting the evolution of the latent
space is easier to learn and more generalizable than directly predicting
pixel-level images. Building on LaDi-WM, we design a diffusion policy that
iteratively refines output actions by incorporating forecasted states, thereby
generating more consistent and accurate results. Extensive experiments on both
synthetic and real-world benchmarks demonstrate that LaDi-WM significantly
enhances policy performance by 27.9\% on the LIBERO-LONG benchmark and 20\% on
the real-world scenario. Furthermore, our world model and policies achieve
impressive generalizability in real-world experiments.
[COMMENTS]
CoRL 2025
[LINK]
http://arxiv.org/abs/2505.11528v6
[DATE]
2025-09-12 21:58:52+08:00
[CATEGORIES]
cs.LG
Multi-Turn Human-LLM Interaction Through the Lens of a Two-Way Intelligibility Protocol
[AUTHORS]
Harshvardhan Mestha, Karan Bania, Shreyas V, Sidong Liu, Ashwin Srinivasan
[ABSTRACT]
Our interest is in the design of software systems involving a human-expert
interacting – using natural language – with a large language model (LLM) on
data analysis tasks. For complex problems, it is possible that LLMs can harness
human expertise and creativity to find solutions that were otherwise elusive.
On one level, this interaction takes place through multiple turns of prompts
from the human and responses from the LLM. Here we investigate a more
structured approach based on an abstract protocol described in [3] for
interaction between agents. The protocol is motivated by a notion of “two-way
intelligibility” and is modelled by a pair of communicating finite-state
machines. We provide an implementation of the protocol, and provide empirical
evidence of using the implementation to mediate interactions between an LLM and
a human-agent in two areas of scientific interest (radiology and drug design).
We conduct controlled experiments with a human proxy (a database), and
uncontrolled experiments with human subjects. The results provide evidence in
support of the protocol’s capability of capturing one- and two-way
intelligibility in human-LLM interaction; and for the utility of two-way
intelligibility in the design of human-machine systems.
[LINK]
http://arxiv.org/abs/2410.20600v2
[DATE]
2025-09-12 21:52:45+08:00
[CATEGORIES]
cs.LG
Space Group Informed Transformer for Crystalline Materials Generation
[AUTHORS]
Zhendong Cao, Xiaoshan Luo, Jian Lv, Lei Wang
[ABSTRACT]
We introduce CrystalFormer, a transformer-based autoregressive model
specifically designed for space group-controlled generation of crystalline
materials. By explicitly incorporating space group symmetry, CrystalFormer
greatly reduces the effective complexity of crystal space, which is essential
for data-and compute-efficient generative modeling of crystalline materials.
Leveraging the prominent discrete and sequential nature of the Wyckoff
positions, CrystalFormer learns to generate crystals by directly predicting the
species and coordinates of symmetry-inequivalent atoms in the unit cell. We
demonstrate the advantages of CrystalFormer in standard tasks such as symmetric
structure initialization and element substitution over widely used conventional
approaches. Furthermore, we showcase its plug-and-play application to
property-guided materials design, highlighting its flexibility. Our analysis
reveals that CrystalFormer ingests sensible solid-state chemistry knowledge and
heuristics by compressing the material dataset, thus enabling systematic
exploration of crystalline materials space. The simplicity, generality, and
adaptability of CrystalFormer position it as a promising architecture to be the
foundational model of the entire crystalline materials space, heralding a new
era in materials discovery and design.
[COMMENTS]
29 pages, 12 figures
[LINK]
http://arxiv.org/abs/2403.15734v3
[DATE]
2025-09-12 21:52:28+08:00
[CATEGORIES]
cs.LG
Neural Force Field: Few-shot Learning of Generalized Physical Reasoning
[AUTHORS]
Shiqian Li, Ruihong Shen, Yaoyu Tao, Chi Zhang, Yixin Zhu
[ABSTRACT]
Physical reasoning is a remarkable human ability that enables rapid learning
and generalization from limited experience. Current AI models, despite
extensive training, still struggle to achieve similar generalization,
especially in Out-of-distribution (OOD) settings. This limitation stems from
their inability to abstract core physical principles from observations. A key
challenge is developing representations that can efficiently learn and
generalize physical dynamics from minimal data. Here we present Neural Force
Field (NFF), a framework extending Neural Ordinary Differential Equation (NODE)
to learn complex object interactions through force field representations, which
can be efficiently integrated through an Ordinary Differential Equation (ODE)
solver to predict object trajectories. Unlike existing approaches that rely on
discrete latent spaces, NFF captures fundamental physical concepts such as
gravity, support, and collision in continuous explicit force fields.
Experiments on three challenging physical reasoning tasks demonstrate that NFF,
trained with only a few examples, achieves strong generalization to unseen
scenarios. This physics-grounded representation enables efficient
forward-backward planning and rapid adaptation through interactive refinement.
Our work suggests that incorporating physics-inspired representations into
learning systems can help bridge the gap between artificial and human physical
reasoning capabilities.
[COMMENTS]
31 pages
[LINK]
http://arxiv.org/abs/2502.08987v4
[DATE]
2025-09-12 21:15:39+08:00
[CATEGORIES]
cs.LG
Investigating Feature Attribution for 5G Network Intrusion Detection
[AUTHORS]
Federica Uccello, Simin Nadjm-Tehrani
[ABSTRACT]
With the rise of fifth-generation (5G) networks in critical applications, it
is urgent to move from detection of malicious activity to systems capable of
providing a reliable verdict suitable for mitigation. In this regard,
understanding and interpreting machine learning (ML) models’ security alerts is
crucial for enabling actionable incident response orchestration. Explainable
Artificial Intelligence (XAI) techniques are expected to enhance trust by
providing insights into why alerts are raised. A dominant approach
statistically associates feature sets that can be correlated to a given alert.
This paper starts by questioning whether such attribution is relevant for
future generation communication systems, and investigates its merits in
comparison with an approach based on logical explanations. We extensively study
two methods, SHAP and VoTE-XAI, by analyzing their interpretations of alerts
generated by an XGBoost model in three different use cases with several 5G
communication attacks. We identify three metrics for assessing explanations:
sparsity, how concise they are; stability, how consistent they are across
samples from the same attack type; and efficiency, how fast an explanation is
generated. As an example, in a 5G network with 92 features, 6 were deemed
important by VoTE-XAI for a Denial of Service (DoS) variant, ICMPFlood, while
SHAP identified over 20. More importantly, we found a significant divergence
between features selected by SHAP and VoTE-XAI. However, none of the top-ranked
features selected by SHAP were missed by VoTE-XAI. When it comes to efficiency
of providing interpretations, we found that VoTE-XAI is significantly more
responsive, e.g. it provides a single explanation in under 0.002 seconds, in a
high-dimensional setting (478 features).
[LINK]
http://arxiv.org/abs/2509.10206v1
[DATE]
2025-09-12 20:55:48+08:00
[CATEGORIES]
cs.LG
Steering Protein Language Models
[AUTHORS]
Long-Kai Huang, Rongyi Zhu, Bing He, Jianhua Yao
[ABSTRACT]
Protein Language Models (PLMs), pre-trained on extensive evolutionary data
from natural proteins, have emerged as indispensable tools for protein design.
While powerful, PLMs often struggle to produce proteins with precisely
specified functionalities or properties due to inherent challenges in
controlling their outputs. In this work, we investigate the potential of
Activation Steering, a technique originally developed for controlling text
generation in Large Language Models (LLMs), to direct PLMs toward generating
protein sequences with targeted properties. We propose a simple yet effective
method that employs activation editing to steer PLM outputs, and extend this
approach to protein optimization through a novel editing site identification
module. Through comprehensive experiments on lysozyme-like sequence generation
and optimization, we demonstrate that our methods can be seamlessly integrated
into both auto-encoding and autoregressive PLMs without requiring additional
training. These results highlight a promising direction for precise protein
engineering using foundation models.
[COMMENTS]
Accepted to ICML 2025
[LINK]
http://arxiv.org/abs/2509.07983v2
[DATE]
2025-09-12 20:39:45+08:00
[CATEGORIES]
cs.LG
DE-VAE: Revealing Uncertainty in Parametric and Inverse Projections with Variational Autoencoders using Differential Entropy
[AUTHORS]
Frederik L. Dennig, Daniel A. Keim
[ABSTRACT]
Recently, autoencoders (AEs) have gained interest for creating parametric and
invertible projections of multidimensional data. Parametric projections make it
possible to embed new, unseen samples without recalculating the entire
projection, while invertible projections allow the synthesis of new data
instances. However, existing methods perform poorly when dealing with
out-of-distribution samples in either the data or embedding space. Thus, we
propose DE-VAE, an uncertainty-aware variational AE using differential entropy
(DE) to improve the learned parametric and invertible projections. Given a
fixed projection, we train DE-VAE to learn a mapping into 2D space and an
inverse mapping back to the original space. We conduct quantitative and
qualitative evaluations on four well-known datasets, using UMAP and t-SNE as
baseline projection methods. Our findings show that DE-VAE can create
parametric and inverse projections with comparable accuracy to other current
AE-based approaches while enabling the analysis of embedding uncertainty.
[COMMENTS]
5 pages, 3 figures, LaTeX; fixed typos; to appear at the 2025 IEEE
Workshop on Uncertainty Visualization
[LINK]
http://arxiv.org/abs/2508.12145v3
[DATE]
2025-09-12 20:37:11+08:00
[CATEGORIES]
cs.LG
Hadamard-Riemannian Optimization for Margin-Variance Ensemble
[AUTHORS]
Zexu Jin
[ABSTRACT]
Ensemble learning has been widely recognized as a pivotal technique for
boosting predictive performance by combining multiple base models.
Nevertheless, conventional margin-based ensemble methods predominantly focus on
maximizing the expected margin while neglecting the critical role of margin
variance, which inherently restricts the generalization capability of the model
and heightens its vulnerability to overfitting, particularly in noisy or
imbalanced datasets. Additionally, the conventional approach of optimizing
ensemble weights within the probability simplex often introduces computational
inefficiency and scalability challenges, complicating its application to
large-scale problems. To tackle these limitations, this paper introduces a
novel ensemble learning framework that explicitly incorporates margin variance
into the loss function. Our method jointly optimizes the negative expected
margin and its variance, leading to enhanced robustness and improved
generalization performance. Moreover, by reparameterizing the ensemble weights
onto the unit sphere, we substantially simplify the optimization process and
improve computational efficiency. Extensive experiments conducted on multiple
benchmark datasets demonstrate that the proposed approach consistently
outperforms traditional margin-based ensemble techniques, underscoring its
effectiveness and practical utility.
[LINK]
http://arxiv.org/abs/2509.10189v1
[DATE]
2025-09-12 20:28:39+08:00
[CATEGORIES]
cs.LG
P3D: Scalable Neural Surrogates for High-Resolution 3D Physics Simulations with Global Context
[AUTHORS]
Benjamin Holzschuh, Georg Kohl, Florian Redinger, Nils Thuerey
[ABSTRACT]
We present a scalable framework for learning deterministic and probabilistic
neural surrogates for high-resolution 3D physics simulations. We introduce a
hybrid CNN-Transformer backbone architecture targeted for 3D physics
simulations, which significantly outperforms existing architectures in terms of
speed and accuracy. Our proposed network can be pretrained on small patches of
the simulation domain, which can be fused to obtain a global solution,
optionally guided via a fast and scalable sequence-to-sequence model to include
long-range dependencies. This setup allows for training large-scale models with
reduced memory and compute requirements for high-resolution datasets. We
evaluate our backbone architecture against a large set of baseline methods with
the objective to simultaneously learn the dynamics of 14 different types of
PDEs in 3D. We demonstrate how to scale our model to high-resolution isotropic
turbulence with spatial resolutions of up to $512^3$. Finally, we demonstrate
the versatility of our network by training it as a diffusion model to produce
probabilistic samples of highly turbulent 3D channel flows across varying
Reynolds numbers, accurately capturing the underlying flow statistics.
[LINK]
http://arxiv.org/abs/2509.10186v1
[DATE]
2025-09-12 20:26:06+08:00
[CATEGORIES]
cs.LG
Physics-Informed Neural Networks vs. Physics Models for Non-Invasive Glucose Monitoring: A Comparative Study Under Realistic Synthetic Conditions
[AUTHORS]
Riyaadh Gani
[ABSTRACT]
Non-invasive glucose monitors often fail outside the lab because existing
datasets ignore hardware noise, environmental drift, and person-to-person
physiology. We introduce the first ultra-realistic near-infrared (NIR)
simulator that injects 12-bit ADC quantisation, +/-0.1% LED ageing, photodiode
dark noise, 15-45 C temperature, 30-90% relative humidity, contact-pressure
variation, Fitzpatrick I-VI melanin, and diurnal glucose excursions (dawn
phenomenon). Using this platform (rho glucose-NIR = 0.21), we benchmark six
methods: Enhanced Beer-Lambert (physics-engineered ridge regression), three
physics-informed neural networks (PINNs), a selective radiative-transfer PINN,
and a shallow DNN. Beer-Lambert achieves 13.6 mg/dL RMSE, 95.8% Clarke-A and
93.8% +/-15% accuracy with only 56 parameters and 0.01 ms inference,
outperforming the best PINN (14.6 mg/dL) and the SDNN baseline (35.1 mg/dL).
Results overturn the assumption that deeper PINNs dominate and supply an open,
end-to-end reference stack for rapid prototyping of embedded optical glucose
sensors.
[LINK]
http://arxiv.org/abs/2509.12253v1
[DATE]
2025-09-12 20:18:00+08:00
[CATEGORIES]
cs.LG
Building Age Estimation: A New Multi-Modal Benchmark Dataset and Community Challenge
[AUTHORS]
Nikolaos Dionelis, Alessandra Feliciotti, Mattia Marconcini, Devis Peressutti, Nika Oman Kadunc, JaeWan Park, Hagai Raja Sinulingga, Steve Andreas Immanuel, Ba Tran, Caroline Arnold, Nicolas Longépé
[ABSTRACT]
Estimating the construction year of buildings is critical for advancing
sustainability, as older structures often lack energy-efficient features.
Sustainable urban planning relies on accurate building age data to reduce
energy consumption and mitigate climate change. In this work, we introduce
MapYourCity, a novel multi-modal benchmark dataset comprising top-view Very
High Resolution (VHR) imagery, multi-spectral Earth Observation (EO) data from
the Copernicus Sentinel-2 satellite constellation, and co-localized street-view
images across various European cities. Each building is labeled with its
construction epoch, and the task is formulated as a seven-class classification
problem covering periods from 1900 to the present. To advance research in EO
generalization and multi-modal learning, we organized a community-driven data
challenge in 2024, hosted by ESA $\Phi$-lab, which ran for four months and
attracted wide participation.
This paper presents the Top-4 performing models from the challenge and their
evaluation results. We assess model generalization on cities excluded from
training to prevent data leakage, and evaluate performance under missing
modality scenarios, particularly when street-view data is unavailable. Results
demonstrate that building age estimation is both feasible and effective, even
in previously unseen cities and when relying solely on top-view satellite
imagery (i.e. with VHR and Sentinel-2 images). The MapYourCity dataset thus
provides a valuable resource for developing scalable, real-world solutions in
sustainable urban analytics.
[COMMENTS]
16 pages, 20 figures, 1 table, Submitted
[LINK]
http://arxiv.org/abs/2502.13818v4
[DATE]
2025-09-12 20:15:34+08:00
[CATEGORIES]
cs.LG
The Hidden Width of Deep ResNets: Tight Error Bounds and Phase Diagrams
[AUTHORS]
Lénaïc Chizat
[ABSTRACT]
We study the gradient-based training of large-depth residual networks
(ResNets) from standard random initializations. We show that with a diverging
depth $L$, a fixed embedding dimension $D$, and an arbitrary hidden width $M$,
the training dynamics converges to a Neural Mean ODE training dynamics.
Remarkably, the limit is independent of the scaling of $M$, covering practical
cases of, say, Transformers, where $M$ (the number of hidden units or attention
heads per layer) is typically of the order of $D$. For a residual scale
$\Theta_D\big(\frac{\alpha}{LM}\big)$, we obtain the error bound
$O_D\big(\frac{1}{L}+ \frac{\alpha}{\sqrt{LM}}\big)$ between the model’s output
and its limit after a fixed number gradient of steps, and we verify empirically
that this rate is tight. When $\alpha=\Theta(1)$, the limit exhibits complete
feature learning, i.e. the Mean ODE is genuinely non-linearly parameterized. In
contrast, we show that $\alpha \to \infty$ yields a \lazy ODE regime where the
Mean ODE is linearly parameterized. We then focus on the particular case of
ResNets with two-layer perceptron blocks, for which we study how these scalings
depend on the embedding dimension $D$. We show that for this model, the only
residual scale that leads to complete feature learning is
$\Theta\big(\frac{\sqrt{D}}{LM}\big)$. In this regime, we prove the error bound
$O\big(\frac{1}{L}+ \frac{\sqrt{D}}{\sqrt{LM}}\big)$ between the ResNet and its
limit after a fixed number of gradient steps, which is also empirically tight.
Our convergence results rely on a novel mathematical perspective on ResNets :
(i) due to the randomness of the initialization, the forward and backward pass
through the ResNet behave as the stochastic approximation of certain mean ODEs,
and (ii) by propagation of chaos (that is, asymptotic independence of the
units) this behavior is preserved through the training dynamics.
[LINK]
http://arxiv.org/abs/2509.10167v1
[DATE]
2025-09-12 19:51:44+08:00
[CATEGORIES]
cs.LG
Diffusion Buffer: Online Diffusion-based Speech Enhancement with Sub-Second Latency
[AUTHORS]
Bunlong Lay, Rostislav Makarov, Timo Gerkmann
[ABSTRACT]
Diffusion models are a class of generative models that have been recently
used for speech enhancement with remarkable success but are computationally
expensive at inference time. Therefore, these models are impractical for
processing streaming data in real-time. In this work, we adapt a sliding window
diffusion framework to the speech enhancement task. Our approach progressively
corrupts speech signals through time, assigning more noise to frames close to
the present in a buffer. This approach outputs denoised frames with a delay
proportional to the chosen buffer size, enabling a trade-off between
performance and latency. Empirical results demonstrate that our method
outperforms standard diffusion models and runs efficiently on a GPU, achieving
an input-output latency in the order of 0.3 to 1 seconds. This marks the first
practical diffusion-based solution for online speech enhancement.
[COMMENTS]
5 pages, 2 figures, Accepted to Interspeech 2025
[LINK]
http://arxiv.org/abs/2506.02908v2
[DATE]
2025-09-12 19:49:57+08:00
[CATEGORIES]
cs.LG
Repulsive Monte Carlo on the sphere for the sliced Wasserstein distance
[AUTHORS]
Vladimir Petrovic, Rémi Bardenet, Agnès Desolneux
[ABSTRACT]
In this paper, we consider the problem of computing the integral of a
function on the unit sphere, in any dimension, using Monte Carlo methods.
Although the methods we present are general, our guiding thread is the sliced
Wasserstein distance between two measures on $\mathbb{R}^d$, which is precisely
an integral on the $d$-dimensional sphere. The sliced Wasserstein distance (SW)
has gained momentum in machine learning either as a proxy to the less
computationally tractable Wasserstein distance, or as a distance in its own
right, due in particular to its built-in alleviation of the curse of
dimensionality. There has been recent numerical benchmarks of quadratures for
the sliced Wasserstein, and our viewpoint differs in that we concentrate on
quadratures where the nodes are repulsive, i.e. negatively dependent. Indeed,
negative dependence can bring variance reduction when the quadrature is adapted
to the integration task. Our first contribution is to extract and motivate
quadratures from the recent literature on determinantal point processes (DPPs)
and repelled point processes, as well as repulsive quadratures from the
literature specific to the sliced Wasserstein distance. We then numerically
benchmark these quadratures. Moreover, we analyze the variance of the UnifOrtho
estimator, an orthogonal Monte Carlo estimator. Our analysis sheds light on
UnifOrtho’s success for the estimation of the sliced Wasserstein in large
dimensions, as well as counterexamples from the literature. Our final
recommendation for the computation of the sliced Wasserstein distance is to use
randomized quasi-Monte Carlo in low dimensions and \emph{UnifOrtho} in large
dimensions. DPP-based quadratures only shine when quasi-Monte Carlo also does,
while repelled quadratures show moderate variance reduction in general, but
more theoretical effort is needed to make them robust.
[LINK]
http://arxiv.org/abs/2509.10166v1
[DATE]
2025-09-12 19:48:11+08:00
[CATEGORIES]
cs.LG
Leveraging Data Augmentation and Siamese Learning for Predictive Process Monitoring
[AUTHORS]
Sjoerd van Straten, Alessandro Padella, Marwan Hassani
[ABSTRACT]
Predictive Process Monitoring (PPM) enables forecasting future events or
outcomes of ongoing business process instances based on event logs. However,
deep learning PPM approaches are often limited by the low variability and small
size of real-world event logs. To address this, we introduce SiamSA-PPM, a
novel self-supervised learning framework that combines Siamese learning with
Statistical Augmentation for Predictive Process Monitoring. It employs three
novel statistically grounded transformation methods that leverage control-flow
semantics and frequent behavioral patterns to generate realistic, semantically
valid new trace variants. These augmented views are used within a Siamese
learning setup to learn generalizable representations of process prefixes
without the need for labeled supervision. Extensive experiments on real-life
event logs demonstrate that SiamSA-PPM achieves competitive or superior
performance compared to the SOTA in both next activity and final outcome
prediction tasks. Our results further show that statistical augmentation
significantly outperforms random transformations and improves variability in
the data, highlighting SiamSA-PPM as a promising direction for training data
enrichment in process prediction.
[LINK]
http://arxiv.org/abs/2507.18293v2
[DATE]
2025-09-12 19:43:41+08:00
[CATEGORIES]
cs.LG
A Symmetry-Integrated Approach to Surface Code Decoding
[AUTHORS]
Hoshitaro Ohnishi, Hideo Mukai
[ABSTRACT]
Quantum error correction, which utilizes logical qubits that are encoded as
redundant multiple physical qubits to find and correct errors in physical
qubits, is indispensable for practical quantum computing. Surface code is
considered to be a promising encoding method with a high error threshold that
is defined by stabilizer generators. However, previous methods have suffered
from the problem that the decoder acquires solely the error probability
distribution because of the non-uniqueness of correct prediction obtained from
the input. To circumvent this problem, we propose a technique to reoptimize the
decoder model by approximating syndrome measurements with a continuous function
that is mathematically interpolated by neural network. We evaluated the
improvement in accuracy of a multilayer perceptron based decoder for code
distances of 5 and 7 as well as for decoders based on convolutional and
recurrent neural networks and transformers for a code distance of 5. In all
cases, the reoptimized decoder gave better accuracy than the original models,
demonstrating the universal effectiveness of the proposed method that is
independent of code distance or network architecture. These results suggest
that re-framing the problem of surface code decoding into a regression problem
that can be tackled by deep learning is a useful strategy.
[COMMENTS]
12 pages, 6 figures
[LINK]
http://arxiv.org/abs/2509.10164v1
[DATE]
2025-09-12 19:41:49+08:00
[CATEGORIES]
cs.LG
Federated Multi-Agent Reinforcement Learning for Privacy-Preserving and Energy-Aware Resource Management in 6G Edge Networks
[AUTHORS]
Francisco Javier Esono Nkulu Andong, Qi Min
[ABSTRACT]
As sixth-generation (6G) networks move toward ultra-dense, intelligent edge
environments, efficient resource management under stringent privacy, mobility,
and energy constraints becomes critical. This paper introduces a novel
Federated Multi-Agent Reinforcement Learning (Fed-MARL) framework that
incorporates cross-layer orchestration of both the MAC layer and application
layer for energy-efficient, privacy-preserving, and real-time resource
management across heterogeneous edge devices. Each agent uses a Deep Recurrent
Q-Network (DRQN) to learn decentralized policies for task offloading, spectrum
access, and CPU energy adaptation based on local observations (e.g., queue
length, energy, CPU usage, and mobility). To protect privacy, we introduce a
secure aggregation protocol based on elliptic curve Diffie Hellman key
exchange, which ensures accurate model updates without exposing raw data to
semi-honest adversaries. We formulate the resource management problem as a
partially observable multi-agent Markov decision process (POMMDP) with a
multi-objective reward function that jointly optimizes latency, energy
efficiency, spectral efficiency, fairness, and reliability under 6G-specific
service requirements such as URLLC, eMBB, and mMTC. Simulation results
demonstrate that Fed-MARL outperforms centralized MARL and heuristic baselines
in task success rate, latency, energy efficiency, and fairness, while ensuring
robust privacy protection and scalability in dynamic, resource-constrained 6G
edge networks.
[LINK]
http://arxiv.org/abs/2509.10163v1
[DATE]
2025-09-12 19:41:40+08:00
[CATEGORIES]
cs.LG
BenchECG and xECG: a benchmark and baseline for ECG foundation models
[AUTHORS]
Riccardo Lunelli, Angus Nicolson, Samuel Martin Pröll, Sebastian Johannes Reinstadler, Axel Bauer, Clemens Dlaska
[ABSTRACT]
Electrocardiograms (ECGs) are inexpensive, widely used, and well-suited to
deep learning. Recently, interest has grown in developing foundation models for
ECGs - models that generalise across diverse downstream tasks. However,
consistent evaluation has been lacking: prior work often uses narrow task
selections and inconsistent datasets, hindering fair comparison. Here, we
introduce BenchECG, a standardised benchmark comprising a comprehensive suite
of publicly available ECG datasets and versatile tasks. We also propose xECG,
an xLSTM-based recurrent model trained with SimDINOv2 self-supervised learning,
which achieves the best BenchECG score compared to publicly available
state-of-the-art models. In particular, xECG is the only publicly available
model to perform strongly on all datasets and tasks. By standardising
evaluation, BenchECG enables rigorous comparison and aims to accelerate
progress in ECG representation learning. xECG achieves superior performance
over earlier approaches, defining a new baseline for future ECG foundation
models.
[COMMENTS]
32 pages, 4 figures, 22 tables
[LINK]
http://arxiv.org/abs/2509.10151v1
[DATE]
2025-09-12 19:27:17+08:00
[CATEGORIES]
cs.LG
Cost-Free Personalization via Information-Geometric Projection in Bayesian Federated Learning
[AUTHORS]
Nour Jamoussi, Giuseppe Serra, Photios A. Stavrou, Marios Kountouris
[ABSTRACT]
Bayesian Federated Learning (BFL) combines uncertainty modeling with
decentralized training, enabling the development of personalized and reliable
models under data heterogeneity and privacy constraints. Existing approaches
typically rely on Markov Chain Monte Carlo (MCMC) sampling or variational
inference, often incorporating personalization mechanisms to better adapt to
local data distributions. In this work, we propose an information-geometric
projection framework for personalization in parametric BFL. By projecting the
global model onto a neighborhood of the user’s local model, our method enables
a tunable trade-off between global generalization and local specialization.
Under mild assumptions, we show that this projection step is equivalent to
computing a barycenter on the statistical manifold, allowing us to derive
closed-form solutions and achieve cost-free personalization. We apply the
proposed approach to a variational learning setup using the Improved
Variational Online Newton (IVON) optimizer and extend its application to
general aggregation schemes in BFL. Empirical evaluations under heterogeneous
data distributions confirm that our method effectively balances global and
local performance with minimal computational overhead.
[LINK]
http://arxiv.org/abs/2509.10132v1
[DATE]
2025-09-12 18:46:21+08:00
[CATEGORIES]
cs.LG
A Comprehensive Survey on Imbalanced Data Learning
[AUTHORS]
Xinyi Gao, Dongting Xie, Yihang Zhang, Zhengren Wang, Chong Chen, Conghui He, Hongzhi Yin, Wentao Zhang
[ABSTRACT]
With the expansion of data availability, machine learning (ML) has achieved
remarkable breakthroughs in both academia and industry. However, imbalanced
data distributions are prevalent in various types of raw data and severely
hinder the performance of ML by biasing the decision-making processes. To
deepen the understanding of imbalanced data and facilitate the related research
and applications, this survey systematically analyzes various real-world data
formats and concludes existing researches for different data formats into four
distinct categories: data re-balancing, feature representation, training
strategy, and ensemble learning. This structured analysis helps researchers
comprehensively understand the pervasive nature of imbalance across diverse
data formats, thereby paving a clearer path toward achieving specific research
goals. We provide an overview of relevant open-source libraries, spotlight
current challenges, and offer novel insights aimed at fostering future
advancements in this critical area of study.
[LINK]
http://arxiv.org/abs/2502.08960v3
[DATE]
2025-09-12 18:37:30+08:00
[CATEGORIES]
cs.LG
Multivariate Long-term Time Series Forecasting with Fourier Neural Filter
[AUTHORS]
Chenheng Xu, Dan Wu, Yixin Zhu, Ying Nian Wu
[ABSTRACT]
Multivariate long-term time series forecasting has been suffering from the
challenge of capturing both temporal dependencies within variables and spatial
correlations across variables simultaneously. Current approaches predominantly
repurpose backbones from natural language processing or computer vision (e.g.,
Transformers), which fail to adequately address the unique properties of time
series (e.g., periodicity). The research community lacks a dedicated backbone
with temporal-specific inductive biases, instead relying on domain-agnostic
backbones supplemented with auxiliary techniques (e.g., signal decomposition).
We introduce FNF as the backbone and DBD as the architecture to provide
excellent learning capabilities and optimal learning pathways for
spatio-temporal modeling, respectively. Our theoretical analysis proves that
FNF unifies local time-domain and global frequency-domain information
processing within a single backbone that extends naturally to spatial modeling,
while information bottleneck theory demonstrates that DBD provides superior
gradient flow and representation capacity compared to existing unified or
sequential architectures. Our empirical evaluation across 11 public benchmark
datasets spanning five domains (energy, meteorology, transportation,
environment, and nature) confirms state-of-the-art performance with consistent
hyperparameter settings. Notably, our approach achieves these results without
any auxiliary techniques, suggesting that properly designed neural
architectures can capture the inherent properties of time series, potentially
transforming time series modeling in scientific and industrial applications.
[LINK]
http://arxiv.org/abs/2506.09174v2
[DATE]
2025-09-12 17:50:48+08:00
[CATEGORIES]
cs.LG
KAN-SR: A Kolmogorov-Arnold Network Guided Symbolic Regression Framework
[AUTHORS]
Marco Andrea Bühler, Gonzalo Guillén-Gosálbez
[ABSTRACT]
We introduce a novel symbolic regression framework, namely KAN-SR, built on
Kolmogorov Arnold Networks (KANs) which follows a divide-and-conquer approach.
Symbolic regression searches for mathematical equations that best fit a given
dataset and is commonly solved with genetic programming approaches. We show
that by using deep learning techniques, more specific KANs, and combining them
with simplification strategies such as translational symmetries and
separabilities, we are able to recover ground-truth equations of the Feynman
Symbolic Regression for Scientific Discovery (SRSD) dataset. Additionally, we
show that by combining the proposed framework with neural controlled
differential equations, we are able to model the dynamics of an in-silico
bioprocess system precisely, opening the door for the dynamic modeling of other
engineering systems.
[LINK]
http://arxiv.org/abs/2509.10089v1
[DATE]
2025-09-12 17:31:34+08:00
[CATEGORIES]
cs.LG
Predictive Spike Timing Enables Distributed Shortest Path Computation in Spiking Neural Networks
[AUTHORS]
Simen Storesund, Kristian Valset Aars, Robin Dietrich, Nicolai Waniek
[ABSTRACT]
Efficient planning and sequence selection are central to intelligence, yet
current approaches remain largely incompatible with biological computation.
Classical graph algorithms like Dijkstra’s or A* require global state and
biologically implausible operations such as backtracing, while reinforcement
learning methods rely on slow gradient-based policy updates that appear
inconsistent with rapid behavioral adaptation observed in natural systems.
We propose a biologically plausible algorithm for shortest-path computation
that operates through local spike-based message-passing with realistic
processing delays. The algorithm exploits spike-timing coincidences to identify
nodes on optimal paths: Neurons that receive inhibitory-excitatory message
pairs earlier than predicted reduce their response delays, creating a temporal
compression that propagates backwards from target to source. Through analytical
proof and simulations on random spatial networks, we demonstrate that the
algorithm converges and discovers all shortest paths using purely timing-based
mechanisms. By showing how short-term timing dynamics alone can compute
shortest paths, this work provides new insights into how biological networks
might solve complex computational problems through purely local computation and
relative spike-time prediction. These findings open new directions for
understanding distributed computation in biological and artificial systems,
with possible implications for computational neuroscience, AI, reinforcement
learning, and neuromorphic systems.
[LINK]
http://arxiv.org/abs/2509.10077v1
[DATE]
2025-09-12 17:13:47+08:00
[CATEGORIES]
cs.LG
Task-Oriented Multimodal Token Transmission in Resource-Constrained Multiuser Networks
[AUTHORS]
Junhe Zhang, Wanli Ni, Pengwei Wang, Dongyu Wang
[ABSTRACT]
Despite the promising paradigm enabled by integrating semantic communication
(SemCom) with multimodal large models (MLMs) for transmitting and utilizing
multimodal data, efficiently fusing and exploiting cross-modal information
still remain challenging. Moreover, widely adopted transformer-based
architectures inevitably produce excessively long token embeddings for
transmission, which result in higher bandwidth consumption, increased power
usage, and greater latency, rendering them impractical in resource-constrained
networks. In this letter, we propose a task-oriented multimodal token
transmission scheme for efficient multimodal information fusion and
utilization. To improve inter-modal consistency and task-relevant token
transmission, we design a two-stage training algotithm which involves
cross-modal alignment followed by task-oriented fine-tuning. Meanwhile, token
compression is performed using a sliding window pooling operation to conserve
limited communication resources. To balance the trade-off between latency
reduction and performance degradation caused by compression, we formulate a
weighted-sum optimization problem over latency and inference performance. We
jointly optimizes bandwidth, power allocation, and token length across users by
using an alternating optimization method. Simulation results demonstrate that
the proposed algorithm outperforms the baseline under different bandwidth and
power budgets. Moreover, the two-stage training algorithm achieves higher
accuracy across various signal-to-noise ratios than the method without
cross-modal alignment.
[LINK]
http://arxiv.org/abs/2505.07841v2
[DATE]
2025-09-12 16:58:30+08:00
[CATEGORIES]
cs.LG
When and How Does CLIP Enable Domain and Compositional Generalization?
[AUTHORS]
Elias Kempf, Simon Schrodi, Max Argus, Thomas Brox
[COMMENTS]
ICML 2025 (Spotlight)
[LINK]
http://arxiv.org/abs/2502.09507v3
[DATE]
2025-09-12 16:50:44+08:00
[CATEGORIES]
cs.LG
Prior shift estimation for positive unlabeled data through the lens of kernel embedding
[AUTHORS]
Jan Mielniczuk, Wojciech Rejchel, Paweł Teisseyre
[ABSTRACT]
We study estimation of a class prior for unlabeled target samples which
possibly differs from that of source population. Moreover, it is assumed that
the source data is partially observable: only samples from the positive class
and from the whole population are available (PU learning scenario). We
introduce a novel direct estimator of a class prior which avoids estimation of
posterior probabilities in both populations and has a simple geometric
interpretation. It is based on a distribution matching technique together with
kernel embedding in a Reproducing Kernel Hilbert Space and is obtained as an
explicit solution to an optimisation task. We establish its asymptotic
consistency as well as an explicit non-asymptotic bound on its deviation from
the unknown prior, which is calculable in practice. We study finite sample
behaviour for synthetic and real data and show that the proposal works
consistently on par or better than its competitors.
[LINK]
http://arxiv.org/abs/2502.21194v2
[DATE]
2025-09-12 16:49:56+08:00
[CATEGORIES]
cs.LG
Reinforcement learning for spin torque oscillator tasks
[AUTHORS]
Jakub Mojsiejuk, Sławomir Ziętek, Witold Skowroński
[ABSTRACT]
We address the problem of automatic synchronisation of the spintronic
oscillator (STO) by means of reinforcement learning (RL). A numerical solution
of the macrospin Landau-Lifschitz-Gilbert-Slonczewski equation is used to
simulate the STO and we train the two types of RL agents to synchronise with a
target frequency within a fixed number of steps. We explore modifications to
this base task and show an improvement in both convergence and energy
efficiency of the synchronisation that can be easily achieved in the simulated
environment.
[COMMENTS]
3 figures, 6 pages
[LINK]
http://arxiv.org/abs/2509.10057v1
[DATE]
2025-09-12 16:41:39+08:00
[CATEGORIES]
cs.LG
When Pattern-by-Pattern Works: Theoretical and Empirical Insights for Logistic Models with Missing Values
[AUTHORS]
Christophe Muller, Erwan Scornet, Julie Josse
[ABSTRACT]
Predicting a response with partially missing inputs remains a challenging
task even in parametric models, since parameter estimation in itself is not
sufficient to predict on partially observed inputs. Several works study
prediction in linear models. In this paper, we focus on logistic models, which
present their own difficulties. From a theoretical perspective, we prove that a
Pattern-by-Pattern strategy (PbP), which learns one logistic model per
missingness pattern, accurately approximates Bayes probabilities in various
missing data scenarios (MCAR, MAR and MNAR). Empirically, we thoroughly compare
various methods (constant and iterative imputations, complete case analysis,
PbP, and an EM algorithm) across classification, probability estimation,
calibration, and parameter inference. Our analysis provides a comprehensive
view on the logistic regression with missing values. It reveals that mean
imputation can be used as baseline for low sample sizes, and improved
performance is obtained via nonlinear multiple iterative imputation techniques
with the labels (MICE.RF.Y). For large sample sizes, PbP is the best method for
Gaussian mixtures, and we recommend MICE.RF.Y in presence of nonlinear
features.
[LINK]
http://arxiv.org/abs/2507.13024v2
[DATE]
2025-09-12 16:27:21+08:00
[CATEGORIES]
cs.LG
Uncertainty-Aware Tabular Prediction: Evaluating VBLL-Enhanced TabPFN in Safety-Critical Medical Data
[AUTHORS]
Madhushan Ramalingam
[ABSTRACT]
Predictive models are being increasingly used across a wide range of domains,
including safety-critical applications such as medical diagnosis and criminal
justice. Reliable uncertainty estimation is a crucial task in such settings.
Tabular Prior-data Fitted Network (TabPFN) is a recently proposed machine
learning foundation model for tabular dataset, which uses a generative
transformer architecture. Variational Bayesian Last Layers (VBLL) is a
state-of-the-art lightweight variational formulation that effectively improves
uncertainty estimation with minimal computational overhead. In this work we aim
to evaluate the performance of VBLL integrated with the recently proposed
TabPFN in uncertainty calibration. Our experiments, conducted on three
benchmark medical tabular datasets, compare the performance of the original
TabPFN and the VBLL-integrated version. Contrary to expectations, we observed
that original TabPFN consistently outperforms VBLL integrated TabPFN in
uncertainty calibration across all datasets.
[LINK]
http://arxiv.org/abs/2509.10048v1
[DATE]
2025-09-12 16:24:19+08:00
[CATEGORIES]
cs.LG
FedRP: A Communication-Efficient Approach for Differentially Private Federated Learning Using Random Projection
[AUTHORS]
Mohammad Hasan Narimani, Mostafa Tavassolipour
[ABSTRACT]
Federated learning (FL) offers an innovative paradigm for collaborative model
training across decentralized devices, such as smartphones, balancing enhanced
predictive performance with the protection of user privacy in sensitive areas
like Internet of Things (IoT) and medical data analysis. Despite its
advantages, FL encounters significant challenges related to user privacy
protection against potential attacks and the management of communication costs.
This paper introduces a novel federated learning algorithm called FedRP, which
integrates random projection techniques with the Alternating Direction Method
of Multipliers (ADMM) optimization framework. This approach enhances privacy by
employing random projection to reduce the dimensionality of model parameters
prior to their transmission to a central server, reducing the communication
cost. The proposed algorithm offers a strong $(\epsilon, \delta)$-differential
privacy guarantee, demonstrating resilience against data reconstruction
attacks. Experimental results reveal that FedRP not only maintains high model
accuracy but also outperforms existing methods, including conventional
differential privacy approaches and FedADMM, in terms of both privacy
preservation and communication efficiency.
[LINK]
http://arxiv.org/abs/2509.10041v1
[DATE]
2025-09-12 16:08:48+08:00
[CATEGORIES]
cs.LG
AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning
[AUTHORS]
Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Jiashu Wang, Tongkai Yang, Binhang Yuan, Yi Wu
[ABSTRACT]
Reinforcement learning (RL) has become a dominant paradigm for training large
language models (LLMs), particularly for reasoning tasks. Effective RL for LLMs
requires massive parallelization and poses an urgent need for efficient
training systems. Most existing large-scale RL systems for LLMs are
synchronous, alternating generation and training in a batch setting where
rollouts in each training batch are generated by the same model. This approach
stabilizes RL training but suffers from severe system-level inefficiency:
generation must wait until the longest output in the batch is completed before
model updates, resulting in GPU underutilization. We present AReaL, a fully
asynchronous RL system that completely decouples generation from training.
Rollout workers in AReaL continuously generate new outputs without waiting,
while training workers update the model whenever a batch of data is collected.
AReaL also incorporates a collection of system-level optimizations, leading to
substantially higher GPU utilization. To stabilize RL training, AReaL balances
the workload of rollout and training workers to control data staleness, and
adopts a staleness-enhanced PPO variant to better handle outdated training
samples. Extensive experiments on math and code reasoning benchmarks show that
AReaL achieves up to 2.77$\times$ training speedup compared to synchronous
systems with the same number of GPUs and matched or improved final performance.
The code of AReaL is available at https://github.com/inclusionAI/AReaL/.
[LINK]
http://arxiv.org/abs/2505.24298v3
[DATE]
2025-09-12 15:59:18+08:00
[CATEGORIES]
cs.LG
Symbolic Feedforward Networks for Probabilistic Finite Automata: Exact Simulation and Learnability
[AUTHORS]
Sahil Rajesh Dhayalkar
[ABSTRACT]
We present a formal and constructive theory showing that probabilistic finite
automata (PFAs) can be exactly simulated using symbolic feedforward neural
networks. Our architecture represents state distributions as vectors and
transitions as stochastic matrices, enabling probabilistic state propagation
via matrix-vector products. This yields a parallel, interpretable, and
differentiable simulation of PFA dynamics using soft updates-without
recurrence. We formally characterize probabilistic subset construction,
$\varepsilon$-closure, and exact simulation via layered symbolic computation,
and prove equivalence between PFAs and specific classes of neural networks. We
further show that these symbolic simulators are not only expressive but
learnable: trained with standard gradient descent-based optimization on labeled
sequence data, they recover the exact behavior of ground-truth PFAs. This
learnability, formalized in Proposition 5.1, is the crux of this work. Our
results unify probabilistic automata theory with neural architectures under a
rigorous algebraic framework, bridging the gap between symbolic computation and
deep learning.
[COMMENTS]
19 pages, 2 figures
[LINK]
http://arxiv.org/abs/2509.10034v1
[DATE]
2025-09-12 15:57:01+08:00
[CATEGORIES]
cs.LG
Sparse Coding Representation of 2-way Data
[AUTHORS]
Boya Ma, Abram Magner, Maxwell McNeil, Petko Bogdanov
[ABSTRACT]
Sparse dictionary coding represents signals as linear combinations of a few
dictionary atoms. It has been applied to images, time series, graph signals and
multi-way spatio-temporal data by jointly employing temporal and spatial
dictionaries. Data-agnostic analytical dictionaries, such as the discrete
Fourier transform, wavelets and graph Fourier, have seen wide adoption due to
efficient implementations and good practical performance. On the other hand,
dictionaries learned from data offer sparser and more accurate solutions but
require learning of both the dictionaries and the coding coefficients. This
becomes especially challenging for multi-dictionary scenarios since encoding
coefficients correspond to all atom combinations from the dictionaries. To
address this challenge, we propose a low-rank coding model for 2-dictionary
scenarios and study its data complexity. Namely, we establish a bound on the
number of samples needed to learn dictionaries that generalize to unseen
samples from the same distribution. We propose a convex relaxation solution,
called AODL, whose exact solution we show also solves the original problem. We
then solve this relaxation via alternating optimization between the sparse
coding matrices and the learned dictionaries, which we prove to be convergent.
We demonstrate its quality for data reconstruction and missing value imputation
in both synthetic and real-world datasets. For a fixed reconstruction quality,
AODL learns up to 90\% sparser solutions compared to non-low-rank and
analytical (fixed) dictionary baselines. In addition, the learned dictionaries
reveal interpretable insights into patterns present within the samples used for
training.
[LINK]
http://arxiv.org/abs/2509.10033v1
[DATE]
2025-09-12 15:53:33+08:00
[CATEGORIES]
cs.LG
Exploring Expert Specialization through Unsupervised Training in Sparse Mixture of Experts
[AUTHORS]
Strahinja Nikolic, Ilker Oguz, Demetri Psaltis
[ABSTRACT]
Understanding the internal organization of neural networks remains a
fundamental challenge in deep learning interpretability. We address this
challenge by exploring a novel Sparse Mixture of Experts Variational
Autoencoder (SMoE-VAE) architecture. We test our model on the QuickDraw
dataset, comparing unsupervised expert routing against a supervised baseline
guided by ground-truth labels. Surprisingly, we find that unsupervised routing
consistently achieves superior reconstruction performance. The experts learn to
identify meaningful sub-categorical structures that often transcend
human-defined class boundaries. Through t-SNE visualizations and reconstruction
analysis, we investigate how MoE models uncover fundamental data structures
that are more aligned with the model’s objective than predefined labels.
Furthermore, our study on the impact of dataset size provides insights into the
trade-offs between data quantity and expert specialization, offering guidance
for designing efficient MoE architectures.
[COMMENTS]
14 pages, 7 figures
[LINK]
http://arxiv.org/abs/2509.10025v1
[DATE]
2025-09-12 15:45:10+08:00
[CATEGORIES]
cs.LG
Analyzing the Impact of Adversarial Examples on Explainable Machine Learning
[AUTHORS]
Prathyusha Devabhakthini, Sasmita Parida, Raj Mani Shukla, Suvendu Chandan Nayak, Tapadhir Das
[ABSTRACT]
Adversarial attacks are a type of attack on machine learning models where an
attacker deliberately modifies the inputs to cause the model to make incorrect
predictions. Adversarial attacks can have serious consequences, particularly in
applications such as autonomous vehicles, medical diagnosis, and security
systems. Work on the vulnerability of deep learning models to adversarial
attacks has shown that it is very easy to make samples that make a model
predict things that it doesn’t want to. In this work, we analyze the impact of
model interpretability due to adversarial attacks on text classification
problems. We develop an ML-based classification model for text data. Then, we
introduce the adversarial perturbations on the text data to understand the
classification performance after the attack. Subsequently, we analyze and
interpret the model’s explainability before and after the attack
[LINK]
http://arxiv.org/abs/2307.08327v2
[DATE]
2025-09-12 15:14:11+08:00
[CATEGORIES]
cs.LG
Neural Scaling Laws for Deep Regression
[AUTHORS]
Tilen Cadez, Kyoung-Min Kim
[ABSTRACT]
Neural scaling laws–power-law relationships between generalization errors
and characteristics of deep learning models–are vital tools for developing
reliable models while managing limited resources. Although the success of large
language models highlights the importance of these laws, their application to
deep regression models remains largely unexplored. Here, we empirically
investigate neural scaling laws in deep regression using a parameter estimation
model for twisted van der Waals magnets. We observe power-law relationships
between the loss and both training dataset size and model capacity across a
wide range of values, employing various architectures–including fully
connected networks, residual networks, and vision transformers. Furthermore,
the scaling exponents governing these relationships range from 1 to 2, with
specific values depending on the regressed parameters and model details. The
consistent scaling behaviors and their large scaling exponents suggest that the
performance of deep regression models can improve substantially with increasing
data size.
[COMMENTS]
Supplementary Information will be provided with the published
manuscript
[LINK]
http://arxiv.org/abs/2509.10000v1
[DATE]
2025-09-12 14:49:19+08:00
[CATEGORIES]
cs.LG
Semi-Supervised Learning for Dose Prediction in Targeted Radionuclide: A Synthetic Data Study
[AUTHORS]
Jing Zhang, Alexandre Bousse, Chi-Hieu Pham, Kuangyu Shi, Julien Bert
[ABSTRACT]
Targeted Radionuclide Therapy (TRT) is a modern strategy in radiation
oncology that aims to administer a potent radiation dose specifically to cancer
cells using cancer-targeting radiopharmaceuticals. Accurate radiation dose
estimation tailored to individual patients is crucial. Deep learning,
particularly with pre-therapy imaging, holds promise for personalizing TRT
doses. However, current methods require large time series of SPECT imaging,
which is hardly achievable in routine clinical practice, and thus raises issues
of data availability. Our objective is to develop a semi-supervised learning
(SSL) solution to personalize dosimetry using pre-therapy images. The aim is to
develop an approach that achieves accurate results when PET/CT images are
available, but are associated with only a few post-therapy dosimetry data
provided by SPECT images. In this work, we introduce an SSL method using a
pseudo-label generation approach for regression tasks inspired by the FixMatch
framework. The feasibility of the proposed solution was preliminarily evaluated
through an in-silico study using synthetic data and Monte Carlo simulation.
Experimental results for organ dose prediction yielded promising outcomes,
showing that the use of pseudo-labeled data provides better accuracy compared
to using only labeled data.
[COMMENTS]
12 pages, 13 figures, 5 tables
[LINK]
http://arxiv.org/abs/2503.05367v2
[DATE]
2025-09-12 14:47:33+08:00
[CATEGORIES]
cs.LG
Finite Scalar Quantization Enables Redundant and Transmission-Robust Neural Audio Compression at Low Bit-rates
[AUTHORS]
Harry Julian, Rachel Beeson, Lohith Konathala, Johanna Ulin, Jiameng Gao
[ABSTRACT]
Neural Audio Codecs (NACs) have become increasingly adopted in speech
processing tasks due to their excellent rate-distortion performance and
compatibility with Large Language Models (LLMs) as discrete feature
representations for audio generation. While most existing codecs rely on
Residual Vector Quantization (RVQ), Finite Scalar Quantization (FSQ) has
recently emerged as a compelling alternative that simplifies training and
natively supports single codebooks. We introduce NeuCodec, an FSQ-based NAC,
and show that FSQ encodes baked-in redundancy which produces an encoding which
is robust when transmitted through noisy channels. First, through an encoder
distillation experiment, we show that two different encoders can learn to
encode identical audio into vastly different code sequences whilst maintaining
comparable reconstruction quality with the same quantizer and decoder. Second,
we demonstrate that FSQ has vastly superior bit-level perturbation robustness
by comparing the performance of RVQ and FSQ codecs when simulating the
transmission of code sequences through a noisy channel.
[LINK]
http://arxiv.org/abs/2509.09550v2
[DATE]
2025-09-12 14:43:25+08:00
[CATEGORIES]
cs.LG
Data-Driven Energy Estimation for Virtual Servers Using Combined System Metrics and Machine Learning
[AUTHORS]
Amandip Sangha
[ABSTRACT]
This paper presents a machine learning-based approach to estimate the energy
consumption of virtual servers without access to physical power measurement
interfaces. Using resource utilization metrics collected from guest virtual
machines, we train a Gradient Boosting Regressor to predict energy consumption
measured via RAPL on the host. We demonstrate, for the first time, guest-only
resource-based energy estimation without privileged host access with
experiments across diverse workloads, achieving high predictive accuracy and
variance explained ($0.90 \leq R^2 \leq 0.97$), indicating the feasibility of
guest-side energy estimation. This approach can enable energy-aware scheduling,
cost optimization and physical host independent energy estimates in virtualized
environments. Our approach addresses a critical gap in virtualized environments
(e.g. cloud) where direct energy measurement is infeasible.
[LINK]
http://arxiv.org/abs/2509.09991v1
[DATE]
2025-09-12 14:22:01+08:00
[CATEGORIES]
cs.LG
PL-Net: Progressive Learning Network for Medical Image Segmentation
[AUTHORS]
Kunpeng Mao, Ruoyu Li, Junlong Cheng, Danmei Huang, Zhiping Song, ZeKui Liu
[ABSTRACT]
In recent years, deep convolutional neural network-based segmentation methods
have achieved state-of-the-art performance for many medical analysis tasks.
However, most of these approaches rely on optimizing the U-Net structure or
adding new functional modules, which overlooks the complementation and fusion
of coarse-grained and fine-grained semantic information. To address these
issues, we propose a 2D medical image segmentation framework called Progressive
Learning Network (PL-Net), which comprises Internal Progressive Learning (IPL)
and External Progressive Learning (EPL). PL-Net offers the following
advantages: (1) IPL divides feature extraction into two steps, allowing for the
mixing of different size receptive fields and capturing semantic information
from coarse to fine granularity without introducing additional parameters; (2)
EPL divides the training process into two stages to optimize parameters and
facilitate the fusion of coarse-grained information in the first stage and
fine-grained information in the second stage. We conducted comprehensive
evaluations of our proposed method on five medical image segmentation datasets,
and the experimental results demonstrate that PL-Net achieves competitive
segmentation performance. It is worth noting that PL-Net does not introduce any
additional learnable parameters compared to other U-Net variants.
[LINK]
http://arxiv.org/abs/2110.14484v3
[DATE]
2025-09-12 14:17:06+08:00
[CATEGORIES]
cs.LG
A Unified Framework for Diffusion Bridge Problems: Flow Matching and Schrödinger Matching into One
[AUTHORS]
Minyoung Kim
[ABSTRACT]
The bridge problem is to find an SDE (or sometimes an ODE) that bridges two
given distributions. The application areas of the bridge problem are enormous,
among which the recent generative modeling (e.g., conditional or unconditional
image generation) is the most popular. Also the famous Schr"{o}dinger bridge
problem, a widely known problem for a century, is a special instance of the
bridge problem. Two most popular algorithms to tackle the bridge problems in
the deep learning era are: (conditional) flow matching and iterative fitting
algorithms, where the former confined to ODE solutions, and the latter
specifically for the Schr"{o}dinger bridge problem. The main contribution of
this article is in two folds: i) We provide concise reviews of these algorithms
with technical details to some extent; ii) We propose a novel unified
perspective and framework that subsumes these seemingly unrelated algorithms
(and their variants) into one. In particular, we show that our unified
framework can instantiate the Flow Matching (FM) algorithm, the (mini-batch)
optimal transport FM algorithm, the (mini-batch) Schr"{o}dinger bridge FM
algorithm, and the deep Schr"{o}dinger bridge matching (DSBM) algorithm as its
special cases. We believe that this unified framework will be useful for
viewing the bridge problems in a more general and flexible perspective, and in
turn can help researchers and practitioners to develop new bridge algorithms in
their fields.
[LINK]
http://arxiv.org/abs/2503.21756v2
[DATE]
2025-09-12 14:05:24+08:00
[CATEGORIES]
cs.LG
Why and How Auxiliary Tasks Improve JEPA Representations
[AUTHORS]
Jiacan Yu, Siyi Chen, Mingrui Liu, Nono Horiuchi, Vladimir Braverman, Zicheng Xu, Dan Haramati, Randall Balestriero
[ABSTRACT]
Joint-Embedding Predictive Architecture (JEPA) is increasingly used for
visual representation learning and as a component in model-based RL, but its
behavior remains poorly understood. We provide a theoretical characterization
of a simple, practical JEPA variant that has an auxiliary regression head
trained jointly with latent dynamics. We prove a No Unhealthy Representation
Collapse theorem: in deterministic MDPs, if training drives both the
latent-transition consistency loss and the auxiliary regression loss to zero,
then any pair of non-equivalent observations, i.e., those that do not have the
same transition dynamics or auxiliary label, must map to distinct latent
representations. Thus, the auxiliary task anchors which distinctions the
representation must preserve. Controlled ablations in a counting environment
corroborate the theory and show that training the JEPA model jointly with the
auxiliary head generates a richer representation than training them separately.
Our work indicates a path to improve JEPA encoders: training them with an
auxiliary function that, together with the transition dynamics, encodes the
right equivalence relations.
[LINK]
http://arxiv.org/abs/2509.12249v1
[DATE]
2025-09-12 13:28:29+08:00
[CATEGORIES]
cs.LG
Soft Diamond Regularizers for Deep Learning
[AUTHORS]
Olaoluwa Adigun, Bart Kosko
[ABSTRACT]
This chapter presents the new family of soft diamond synaptic regularizers
based on thick-tailed symmetric alpha stable $S{\alpha}S$ probability bell
curves. These new parametrized weight priors improved deep-learning performance
on image and language-translation test sets and increased the sparsity of the
trained weights. They outperformed the state-of-the-art hard-diamond Laplacian
regularizer of sparse lasso regression and classification. The $S{\alpha}S$
synaptic weight priors have power-law bell-curve tails that are thicker than
the thin exponential tails of Gaussian bell curves that underly ridge
regularizers. Their tails get thicker as the $\alpha$ parameter decreases.
These thicker tails model more impulsive behavior and allow for occasional
distant search in synaptic weight spaces of extremely high dimension. The
geometry of their constraint sets has a diamond shape. The shape varies from a
circle to a star or diamond that depends on the $\alpha$ tail thickness and
dispersion of the $S{\alpha}S$ weight prior. These $S{\alpha}S$ bell curves
lack a closed form in general and this makes direct training computationally
intensive. We removed this computational bottleneck by using a precomputed
look-up table. We tested the soft diamond regularizers with deep neural
classifiers on both image test sets and German-to-English language translation.
The image simulations used the three datasets CIFAR-10, CIFAR-100, and
Caltech-256. The regularizers improved the accuracy and sparsity of the
classifiers. We also tested with deep neural machine-translation models on the
IWSLT-2016 Evaluation dataset for German-to-English text translation. They also
outperformed ridge regularizers and lasso regularizers. These findings
recommend the sub-Cauchy $\alpha = 0.5$ soft diamond regularizer as a
competitive and sparse regularizer for large-scale machine learning.
[COMMENTS]
25 pages, 15 figures. This version extends the earlier version titled
“Training Deep Neural Classifiers with Soft Diamond Regularizers”
[LINK]
http://arxiv.org/abs/2412.20724v2
[DATE]
2025-09-12 13:20:39+08:00
[CATEGORIES]
cs.LG
Drone-Based Multispectral Imaging and Deep Learning for Timely Detection of Branched Broomrape in Tomato Farms
[AUTHORS]
Mohammadreza Narimani, Alireza Pourreza, Ali Moghimi, Mohsen Mesgaran, Parastoo Farajpoor, Hamid Jafarbiglu
[ABSTRACT]
This study addresses the escalating threat of branched broomrape (Phelipanche
ramosa) to California’s tomato industry, which supplies over 90 percent of U.S.
processing tomatoes. The parasite’s largely underground life cycle makes early
detection difficult, while conventional chemical controls are costly,
environmentally harmful, and often ineffective. To address this, we combined
drone-based multispectral imagery with Long Short-Term Memory (LSTM) deep
learning networks, using the Synthetic Minority Over-sampling Technique (SMOTE)
to handle class imbalance. Research was conducted on a known broomrape-infested
tomato farm in Woodland, Yolo County, CA, across five key growth stages
determined by growing degree days (GDD). Multispectral images were processed to
isolate tomato canopy reflectance. At 897 GDD, broomrape could be detected with
79.09 percent overall accuracy and 70.36 percent recall without integrating
later stages. Incorporating sequential growth stages with LSTM improved
detection substantially. The best-performing scenario, which integrated all
growth stages with SMOTE augmentation, achieved 88.37 percent overall accuracy
and 95.37 percent recall. These results demonstrate the strong potential of
temporal multispectral analysis and LSTM networks for early broomrape
detection. While further real-world data collection is needed for practical
deployment, this study shows that UAV-based multispectral sensing coupled with
deep learning could provide a powerful precision agriculture tool to reduce
losses and improve sustainability in tomato production.
[COMMENTS]
Author-accepted version (no publisher header/footer). 10 pages +
presentation. Published in Proceedings of SPIE Defense + Commercial Sensing
2024, Vol. 13053, Paper 1305304. Event: National Harbor, Maryland, USA.
Official version: https://doi.org/10.1117/12.3021219
[LINK]
http://arxiv.org/abs/2509.09972v1
[DATE]
2025-09-12 13:16:56+08:00
[CATEGORIES]
cs.LG
Interpretable Data-driven Anomaly Detection in Industrial Processes with ExIFFI
[AUTHORS]
Davide Frizzo, Francesco Borsatti, Alessio Arcudi, Antonio De Moliner, Roberto Oboe, Gian Antonio Susto
[ABSTRACT]
Anomaly Detection (AD) is crucial in industrial settings to streamline
operations by detecting underlying issues. Conventional methods merely label
observations as normal or anomalous, lacking crucial insights. In Industry 5.0,
interpretable outcomes become desirable to enable users to understand the
rational under model decisions. This paper presents the first industrial
application of ExIFFI, a recent approach for fast, efficient explanations for
the Extended Isolation Forest (EIF) (AD) method. ExIFFI is tested on three
industrial datasets, demonstrating superior explanation effectiveness and
computational efficiency compared to other state-of-the-art explainable AD
models.
[COMMENTS]
This is an extension of the previous version of the paper, submitted
to IEEE Transaction for Industry Application. The extension consists in:
improved text, new citations, new benchmark dataset CoffeeData
and new
figures
[LINK]
http://arxiv.org/abs/2405.01158v2
[DATE]
2025-09-12 13:14:00+08:00
[CATEGORIES]
cs.LG
Local-Cloud Inference Offloading for LLMs in Multi-Modal, Multi-Task, Multi-Dialogue Settings
[AUTHORS]
Liangqi Yuan, Dong-Jun Han, Shiqiang Wang, Christopher G. Brinton
[ABSTRACT]
Compared to traditional machine learning models, recent large language models
(LLMs) can exhibit multi-task-solving capabilities through multiple dialogues
and multi-modal data sources. These unique characteristics of LLMs, together
with their large model size, make their deployment more challenging.
Specifically, (i) deploying LLMs on local devices faces computational, memory,
and energy resource issues, while (ii) deploying them in the cloud cannot
guarantee real-time service and incurs communication/usage costs. In this
paper, we design TMO, a local-cloud LLM inference system with Three-M
Offloading: Multi-modal, Multi-task, and Multi-dialogue. TMO incorporates (i) a
lightweight local LLM that can process simple tasks at high speed and (ii) a
large-scale cloud LLM that can handle multi-modal data sources. We develop a
resource-constrained reinforcement learning (RCRL) strategy for TMO that
optimizes the inference location (i.e., local vs. cloud) and multi-modal data
sources to use for each task/dialogue, aiming to maximize the long-term reward
(response quality, latency, and usage cost) while adhering to resource
constraints. We also contribute M4A1, a new dataset we curated that contains
reward and cost metrics across multiple modality, task, dialogue, and LLM
configurations, enabling evaluation of offloading decisions. We demonstrate the
effectiveness of TMO compared to several exploration-decision and LLM-as-Agent
baselines, showing significant improvements in latency, cost, and response
quality.
[LINK]
http://arxiv.org/abs/2502.11007v3
[DATE]
2025-09-12 12:41:28+08:00
[CATEGORIES]
cs.LG
Limited Reference, Reliable Generation: A Two-Component Framework for Tabular Data Generation in Low-Data Regimes
[AUTHORS]
Mingxuan Jiang, Yongxin Wang, Ziyue Dai, Yicun Liu, Hongyi Nie, Sen Liu, Hongfeng Chai
[ABSTRACT]
Synthetic tabular data generation is increasingly essential in data
management, supporting downstream applications when real-world and high-quality
tabular data is insufficient. Existing tabular generation approaches, such as
generative adversarial networks (GANs), diffusion models, and fine-tuned Large
Language Models (LLMs), typically require sufficient reference data, limiting
their effectiveness in domain-specific databases with scarce records. While
prompt-based LLMs offer flexibility without parameter tuning, they often fail
to capture dataset-specific feature-label dependencies and generate redundant
data, leading to degradation in downstream task performance. To overcome these
issues, we propose ReFine, a framework that (i) derives symbolic “if-then”
rules from interpretable models and embeds them into prompts to explicitly
guide generation toward domain-specific feature distribution, and (ii) applies
a dual-granularity filtering strategy that suppresses over-sampling patterns
and selectively refines rare but informative samples to reduce distributional
imbalance. Extensive experiments on various regression and classification
benchmarks demonstrate that ReFine consistently outperforms state-of-the-art
methods, achieving up to 0.44 absolute improvement in R-squared for regression
and 10.0 percent relative improvement in F1 score for classification tasks.
[LINK]
http://arxiv.org/abs/2509.09960v1
[DATE]
2025-09-12 12:34:46+08:00
[CATEGORIES]
cs.LG
Adaptive Token Merging for Efficient Transformer Semantic Communication at the Edge
[AUTHORS]
Omar Erak, Omar Alhussein, Hatem Abou-Zeid, Mehdi Bennis, Sami Muhaidat
[ABSTRACT]
Large-scale transformers are central to modern semantic communication, yet
their high computational and communication costs hinder deployment on
resource-constrained edge devices. This paper introduces a training-free
framework for adaptive token merging, a novel mechanism that compresses
transformer representations at runtime by selectively merging semantically
redundant tokens under per-layer similarity thresholds. Unlike prior
fixed-ratio reduction, our approach couples merging directly to input
redundancy, enabling data-dependent adaptation that balances efficiency and
task relevance without retraining. We cast the discovery of merging strategies
as a multi-objective optimization problem and leverage Bayesian optimization to
obtain Pareto-optimal trade-offs between accuracy, inference cost, and
communication cost. On ImageNet classification, we match the accuracy of the
unmodified transformer with 30\% fewer floating-point operations per second and
under 20\% of the original communication cost, while for visual question
answering our method achieves performance competitive with the full LLaVA model
at less than one-third of the compute and one-tenth of the bandwidth. Finally,
we show that our adaptive merging is robust across varying channel conditions
and provides inherent privacy benefits, substantially degrading the efficacy of
model inversion attacks. Our framework provides a practical and versatile
solution for deploying powerful transformer models in resource-limited edge
intelligence scenarios.
[COMMENTS]
Submitted to IEEE Journals
[LINK]
http://arxiv.org/abs/2509.09955v1
[DATE]
2025-09-12 12:11:59+08:00
[CATEGORIES]
cs.LG
PCGBandit: One-shot acceleration of transient PDE solvers via online-learned preconditioners
[AUTHORS]
Mikhail Khodak, Min Ki Jung, Brian Wynne, Edmond Chow, Egemen Kolemen
[ABSTRACT]
Data-driven acceleration of scientific computing workflows has been a
high-profile aim of machine learning (ML) for science, with numerical
simulation of transient partial differential equations (PDEs) being one of the
main applications. The focus thus far has been on methods that require
classical simulations to train, which when combined with the data-hungriness
and optimization challenges of neural networks has caused difficulties in
demonstrating a convincing advantage against strong classical baselines. We
consider an alternative paradigm in which the learner uses a classical solver’s
own data to accelerate it, enabling a one-shot speedup of the simulation.
Concretely, since transient PDEs often require solving a sequence of related
linear systems, the feedback from repeated calls to a linear solver such as
preconditioned conjugate gradient (PCG) can be used by a bandit algorithm to
online-learn an adaptive sequence of solver configurations (e.g.
preconditioners). The method we develop, PCGBandit, is implemented directly on
top of the popular open source software OpenFOAM, which we use to show its
effectiveness on a set of fluid and magnetohydrodynamics (MHD) problems.
[COMMENTS]
code available at https://github.com/mkhodak/PCGBandit
[LINK]
http://arxiv.org/abs/2509.08765v2
[DATE]
2025-09-12 11:49:57+08:00
[CATEGORIES]
cs.LG
Quantum-Assisted Machine Learning Models for Enhanced Weather Prediction
[AUTHORS]
Saiyam Sakhuja, Shivanshu Siyanwal, Abhishek Tiwari, Britant, Savita Kashyap
[ABSTRACT]
Quantum Machine Learning (QML) presents as a revolutionary approach to
weather forecasting by using quantum computing to improve predictive modeling
capabilities. In this study, we apply QML models, including Quantum Gated
Recurrent Units (QGRUs), Quantum Neural Networks (QNNs), Quantum Long
Short-Term Memory(QLSTM), Variational Quantum Circuits(VQCs), and Quantum
Support Vector Machines(QSVMs), to analyze meteorological time-series data from
the ERA5 dataset. Our methodology includes preprocessing meteorological
features, implementing QML architectures for both classification and regression
tasks. The results demonstrate that QML models can achieve reasonable accuracy
in both prediction and classification tasks, particularly in binary
classification. However, challenges such as quantum hardware limitations and
noise affect scalability and generalization. This research provides insights
into the feasibility of QML for weather prediction, paving the way for further
exploration of hybrid quantum-classical frameworks to enhance meteorological
forecasting.
[COMMENTS]
Will require more permissions and data to be republished later for
academic rigor
[LINK]
http://arxiv.org/abs/2503.23408v3
[DATE]
2025-09-12 11:44:40+08:00
[CATEGORIES]
cs.LG
Constructive Universal Approximation and Sure Convergence for Multi-Layer Neural Networks
[AUTHORS]
Chien-Ming Chi
[ABSTRACT]
We propose o1Neuro, a new neural network model built on sparse indicator
activation neurons, with two key statistical properties. (1) Constructive
universal approximation: At the population level, a deep o1Neuro can
approximate any measurable function of $\boldsymbol{X}$, while a shallow
o1Neuro suffices for additive models with two-way interaction components,
including XOR and univariate terms, assuming $\boldsymbol{X} \in [0,1]^p$ has
bounded density. Combined with prior work showing that a single-hidden-layer
non-sparse network is a universal approximator, this highlights a trade-off
between activation sparsity and network depth in approximation capability. (2)
Sure convergence: At the sample level, the optimization of o1Neuro reaches an
optimal model with probability approaching one after sufficiently many update
rounds, and we provide an example showing that the required number of updates
is well bounded under linear data-generating models. Empirically, o1Neuro is
compared with XGBoost, Random Forests, and TabNet for learning complex
regression functions with interactions, demonstrating superior predictive
performance on several benchmark datasets from OpenML and the UCI Machine
Learning Repository with $n = 10000$, as well as on synthetic datasets with
$100 \le n \le 20000$.
[COMMENTS]
34 pages, 3 figures, 7 tables
[LINK]
http://arxiv.org/abs/2507.04779v2
[DATE]
2025-09-12 11:29:11+08:00
[CATEGORIES]
cs.LG
A Novel Approach to Balance Convenience and Nutrition in Meals With Long-Term Group Recommendations and Reasoning on Multimodal Recipes and its Implementation in BEACON
[AUTHORS]
Vansh Nagpal, Siva Likitha Valluru, Kausik Lakkaraju, Nitin Gupta, Zach Abdulrahman, Andrew Davison, Biplav Srivastava
[ABSTRACT]
A common decision made by people, whether healthy or with health conditions,
is choosing meals like breakfast, lunch, and dinner, comprising combinations of
foods for appetizer, main course, side dishes, desserts, and beverages. Often,
this decision involves tradeoffs between nutritious choices (e.g., salt and
sugar levels, nutrition content) and convenience (e.g., cost and accessibility,
cuisine type, food source type). We present a data-driven solution for meal
recommendations that considers customizable meal configurations and time
horizons. This solution balances user preferences while accounting for food
constituents and cooking processes. Our contributions include introducing
goodness measures, a recipe conversion method from text to the recently
introduced multimodal rich recipe representation (R3) format, learning methods
using contextual bandits that show promising preliminary results, and the
prototype, usage-inspired, BEACON system.
[LINK]
http://arxiv.org/abs/2412.17910v2
[DATE]
2025-09-12 11:23:17+08:00
[CATEGORIES]
cs.LG
Towards Developing Socially Compliant Automated Vehicles: Advances, Expert Insights, and A Conceptual Framework
[AUTHORS]
Yongqi Dong, Bart van Arem, Haneen Farah
[ABSTRACT]
Automated Vehicles (AVs) hold promise for revolutionizing transportation by
improving road safety, traffic efficiency, and overall mobility. Despite the
steady advancement in high-level AVs in recent years, the transition to full
automation entails a period of mixed traffic, where AVs of varying automation
levels coexist with human-driven vehicles (HDVs). Making AVs socially compliant
and understood by human drivers is expected to improve the safety and
efficiency of mixed traffic. Thus, ensuring AVs’ compatibility with HDVs and
social acceptance is crucial for their successful and seamless integration into
mixed traffic. However, research in this critical area of developing Socially
Compliant AVs (SCAVs) remains sparse. This study carries out the first
comprehensive scoping review to assess the current state of the art in
developing SCAVs, identifying key concepts, methodological approaches, and
research gaps. An informal expert interview was also conducted to discuss the
literature review results and identify critical research gaps and expectations
towards SCAVs. Based on the scoping review and expert interview input, a
conceptual framework is proposed for the development of SCAVs. The conceptual
framework is evaluated using an online survey targeting researchers,
technicians, policymakers, and other relevant professionals worldwide. The
survey results provide valuable validation and insights, affirming the
significance of the proposed conceptual framework in tackling the challenges of
integrating AVs into mixed-traffic environments. Additionally, future research
perspectives and suggestions are discussed, contributing to the research and
development agenda of SCAVs.
[COMMENTS]
23 pages, 13 figures, accepted by the Journal of Communications in
Transportation Research
[LINK]
http://arxiv.org/abs/2501.06089v3
[DATE]
2025-09-12 11:22:52+08:00
[CATEGORIES]
cs.LG
DyKen-Hyena: Dynamic Kernel Generation via Cross-Modal Attention for Multimodal Intent Recognition
[AUTHORS]
Yifei Wang, Wenbin Wang, Yong Luo
[ABSTRACT]
Though Multimodal Intent Recognition (MIR) proves effective by utilizing rich
information from multiple sources (e.g., language, video, and audio), the
potential for intent-irrelevant and conflicting information across modalities
may hinder performance from being further improved. Most current models attempt
to fuse modalities by applying mechanisms like multi-head attention to unimodal
feature sequences and then adding the result back to the original
representation. This process risks corrupting the primary linguistic features
with noisy or irrelevant non-verbal signals, as it often fails to capture the
fine-grained, token-level influence where non-verbal cues should modulate, not
just augment, textual meaning. To address this, we introduce DyKen-Hyena, which
reframes the problem from feature fusion to processing modulation. Our model
translates audio-visual cues into dynamic, per-token convolutional kernels that
directly modulate textual feature extraction. This fine-grained approach
achieves state-of-the-art results on the MIntRec and MIntRec2.0 benchmarks.
Notably, it yields a +10.46% F1-score improvement in out-of-scope detection,
validating that our method creates a fundamentally more robust intent
representation.
[COMMENTS]
8 pages, 2 figures
[LINK]
http://arxiv.org/abs/2509.09940v1
[DATE]
2025-09-12 11:12:39+08:00
[CATEGORIES]
cs.LG
Quantum-Enhanced Forecasting for Deep Reinforcement Learning in Algorithmic Trading
[AUTHORS]
Jun-Hao Chen, Yu-Chien Huang, Yun-Cheng Tsai, Samuel Yen-Chi Chen
[ABSTRACT]
The convergence of quantum-inspired neural networks and deep reinforcement
learning offers a promising avenue for financial trading. We implemented a
trading agent for USD/TWD by integrating Quantum Long Short-Term Memory (QLSTM)
for short-term trend prediction with Quantum Asynchronous Advantage
Actor-Critic (QA3C), a quantum-enhanced variant of the classical A3C. Trained
on data from 2000-01-01 to 2025-04-30 (80\% training, 20\% testing), the
long-only agent achieves 11.87\% return over around 5 years with 0.92\% max
drawdown, outperforming several currency ETFs. We detail state design (QLSTM
features and indicators), reward function for trend-following/risk control, and
multi-core training. Results show hybrid models yield competitive FX trading
performance. Implications include QLSTM’s effectiveness for small-profit trades
with tight risk and future enhancements. Key hyperparameters: QLSTM sequence
length$=$4, QA3C workers$=$8. Limitations: classical quantum simulation and
simplified strategy. \footnote{The views expressed in this article are those of
the authors and do not represent the views of Wells Fargo. This article is for
informational purposes only. Nothing contained in this article should be
construed as investment advice. Wells Fargo makes no express or implied
warranties and expressly disclaims all legal, tax, and accounting implications
related to this article.
[LINK]
http://arxiv.org/abs/2509.09176v2
[DATE]
2025-09-12 11:09:02+08:00
[CATEGORIES]
cs.LG
SciML Agents: Write the Solver, Not the Solution
[AUTHORS]
Saarth Gaonkar, Xiang Zheng, Haocheng Xi, Rishabh Tiwari, Kurt Keutzer, Dmitriy Morozov, Michael W. Mahoney, Amir Gholami
[ABSTRACT]
Recent work in scientific machine learning aims to tackle scientific tasks
directly by predicting target values with neural networks (e.g.,
physics-informed neural networks, neural ODEs, neural operators, etc.), but
attaining high accuracy and robustness has been challenging. We explore an
alternative view: use LLMs to write code that leverages decades of numerical
algorithms. This shifts the burden from learning a solution function to making
domain-aware numerical choices. We ask whether LLMs can act as SciML agents
that, given a natural-language ODE description, generate runnable code that is
scientifically appropriate, selecting suitable solvers (stiff vs. non-stiff),
and enforcing stability checks. There is currently no benchmark to measure this
kind of capability for scientific computing tasks. As such, we first introduce
two new datasets: a diagnostic dataset of adversarial “misleading” problems;
and a large-scale benchmark of 1,000 diverse ODE tasks. The diagnostic set
contains problems whose superficial appearance suggests stiffness, and that
require algebraic simplification to demonstrate non-stiffness; and the
large-scale benchmark spans stiff and non-stiff ODE regimes. We evaluate open-
and closed-source LLM models along two axes: (i) unguided versus guided
prompting with domain-specific knowledge; and (ii) off-the-shelf versus
fine-tuned variants. Our evaluation measures both executability and numerical
validity against reference solutions. We find that with sufficient context and
guided prompts, newer instruction-following models achieve high accuracy on
both criteria. In many cases, recent open-source systems perform strongly
without fine-tuning, while older or smaller models still benefit from
fine-tuning. Overall, our preliminary results indicate that careful prompting
and fine-tuning can yield a specialized LLM agent capable of reliably solving
simple ODE problems.
[LINK]
http://arxiv.org/abs/2509.09936v1
[DATE]
2025-09-12 10:53:57+08:00
[CATEGORIES]
cs.LG
Multi-Play Combinatorial Semi-Bandit Problem
[AUTHORS]
Shintaro Nakamura, Yuko Kuroki, Wei Chen
[ABSTRACT]
In the combinatorial semi-bandit (CSB) problem, a player selects an action
from a combinatorial action set and observes feedback from the base arms
included in the action. While CSB is widely applicable to combinatorial
optimization problems, its restriction to binary decision spaces excludes
important cases involving non-negative integer flows or allocations, such as
the optimal transport and knapsack problems.To overcome this limitation, we
propose the multi-play combinatorial semi-bandit (MP-CSB), where a player can
select a non-negative integer action and observe multiple feedbacks from a
single arm in each round. We propose two algorithms for the MP-CSB. One is a
Thompson-sampling-based algorithm that is computationally feasible even when
the action space is exponentially large with respect to the number of arms, and
attains $O(\log T)$ distribution-dependent regret in the stochastic regime,
where $T$ is the time horizon. The other is a best-of-both-worlds algorithm,
which achieves $O(\log T)$ variance-dependent regret in the stochastic regime
and the worst-case $\tilde{\mathcal{O}}\left( \sqrt{T} \right)$ regret in the
adversarial regime. Moreover, its regret in adversarial one is data-dependent,
adapting to the cumulative loss of the optimal action, the total quadratic
variation, and the path-length of the loss sequence. Finally, we numerically
show that the proposed algorithms outperform existing methods in the CSB
literature.
[LINK]
http://arxiv.org/abs/2509.09933v1
[DATE]
2025-09-12 10:46:59+08:00
[CATEGORIES]
cs.LG
Balancing Utility and Privacy: Dynamically Private SGD with Random Projection
[AUTHORS]
Zhanhong Jiang, Md Zahid Hasan, Nastaran Saadati, Aditya Balu, Chao Liu, Soumik Sarkar
[ABSTRACT]
Stochastic optimization is a pivotal enabler in modern machine learning,
producing effective models for various tasks. However, several existing works
have shown that model parameters and gradient information are susceptible to
privacy leakage. Although Differentially Private SGD (DPSGD) addresses privacy
concerns, its static noise mechanism impacts the error bounds for model
performance. Additionally, with the exponential increase in model parameters,
efficient learning of these models using stochastic optimizers has become more
challenging. To address these concerns, we introduce the Dynamically
Differentially Private Projected SGD (D2P2-SGD) optimizer. In D2P2-SGD, we
combine two important ideas: (i) dynamic differential privacy (DDP) with
automatic gradient clipping and (ii) random projection with SGD, allowing
dynamic adjustment of the tradeoff between utility and privacy of the model. It
exhibits provably sub-linear convergence rates across different objective
functions, matching the best available rate. The theoretical analysis further
suggests that DDP leads to better utility at the cost of privacy, while random
projection enables more efficient model learning. Extensive experiments across
diverse datasets show that D2P2-SGD remarkably enhances accuracy while
maintaining privacy. Our code is available here.
[COMMENTS]
27 pages, 13 figures
[LINK]
http://arxiv.org/abs/2509.09485v2
[DATE]
2025-09-12 09:27:15+08:00
[CATEGORIES]
cs.LG
Self-Optimizing Machine Learning Potential Assisted Automated Workflow for Highly Efficient Complex Systems Material Design
[AUTHORS]
Jiaxiang Li, Junwei Feng, Jie Luo, Bowen Jiang, Xiangyu Zheng, Qigang Song, Jian Lv, Keith Butler, Hanyu Liu, Congwei Xie, Yu Xie, Yanming Ma
[ABSTRACT]
Machine learning interatomic potentials have revolutionized complex materials
design by enabling rapid exploration of material configurational spaces via
crystal structure prediction with ab initio accuracy. However, critical
challenges persist in ensuring robust generalization to unknown structures and
minimizing the requirement for substantial expert knowledge and time-consuming
manual interventions. Here, we propose an automated crystal structure
prediction framework built upon the attention-coupled neural networks potential
to address these limitations. The generalizability of the potential is achieved
by sampling regions across the local minima of the potential energy surface,
where the self-evolving pipeline autonomously refines the potential iteratively
while minimizing human intervention. The workflow is validated on Mg-Ca-H
ternary and Be-P-N-O quaternary systems by exploring nearly 10 million
configurations, demonstrating substantial speedup compared to first-principles
calculations. These results underscore the effectiveness of our approach in
accelerating the exploration and discovery of complex multi-component
functional materials.
[LINK]
http://arxiv.org/abs/2505.08159v3
[DATE]
2025-09-12 09:19:46+08:00
[CATEGORIES]
cs.LG
ForTIFAI: Fending Off Recursive Training Induced Failure for AI Models
[AUTHORS]
Soheil Zibakhsh Shabgahi, Pedram Aghazadeh, Azalia Mirhoseini, Farinaz Koushanfar
[ABSTRACT]
The increasing reliance on generative AI models has accelerated the
generation rate of synthetic data, with some projections suggesting that most
available new data for training could be machine-generated by 2030. This shift
to a mainly synthetic content presents a critical challenge: repeated training
in synthetic data leads to a phenomenon known as model collapse, where model
performance degrades over generations of training, eventually rendering the
models ineffective. Although prior studies have explored the causes and
detection of model collapse, existing mitigation strategies remain limited.
In this paper, we identify model overconfidence in their self-generated data
as a key driver of collapse. Building on this observation, we propose a
confidence-aware loss function that downweights high-confidence predictions
during training. We introduce a novel loss function we call Truncated Cross
Entropy (TCE). We demonstrate that TCE significantly delays model collapse in
recursive training.
We provide a model-agnostic framework that links the loss function design to
model collapse mitigation and validate our approach both theoretically and
empirically, showing that it can extend the model’s fidelity interval before
collapse by more than 2.3x. Finally, we show that our method generalizes across
modalities. These findings suggest that the design of loss functions provides a
simple yet powerful tool for preserving the quality of generative models in the
era of increasing synthetic data.
[LINK]
http://arxiv.org/abs/2509.08972v2
[DATE]
2025-09-12 09:02:19+08:00
[CATEGORIES]
cs.LG
Variational Neural Networks for Observable Thermodynamics (V-NOTS)
[AUTHORS]
Christopher Eldred, François Gay-Balmaz, Vakhtang Putkaradze
[ABSTRACT]
Much attention has recently been devoted to data-based computing of evolution
of physical systems. In such approaches, information about data points from
past trajectories in phase space is used to reconstruct the equations of motion
and to predict future solutions that have not been observed before. However, in
many cases, the available data does not correspond to the variables that define
the system’s phase space. We focus our attention on the important example of
dissipative dynamical systems. In that case, the phase space consists of
coordinates, momenta and entropies; however, the momenta and entropies cannot,
in general, be observed directly. To address this difficulty, we develop an
efficient data-based computing framework based exclusively on observable
variables, by constructing a novel approach based on the \emph{thermodynamic
Lagrangian}, and constructing neural networks that respect the thermodynamics
and guarantees the non-decreasing entropy evolution. We show that our network
can provide an efficient description of phase space evolution based on a
limited number of data points and a relatively small number of parameters in
the system.
[COMMENTS]
26 pages, 6 figures
[LINK]
http://arxiv.org/abs/2509.09899v1
[DATE]
2025-09-12 07:22:22+08:00
[CATEGORIES]
cs.LG
Accelerating 3D Photoacoustic Computed Tomography with End-to-End Physics-Aware Neural Operators
[AUTHORS]
Jiayun Wang, Yousuf Aborahama, Arya Khokhar, Yang Zhang, Chuwei Wang, Karteekeya Sastry, Julius Berner, Yilin Luo, Boris Bonev, Zongyi Li, Kamyar Azizzadenesheli, Lihong V. Wang, Anima Anandkumar
[ABSTRACT]
Photoacoustic computed tomography (PACT) combines optical contrast with
ultrasonic resolution, achieving deep-tissue imaging beyond the optical
diffusion limit. While three-dimensional PACT systems enable high-resolution
volumetric imaging for applications spanning transcranial to breast imaging,
current implementations require dense transducer arrays and prolonged
acquisition times, limiting clinical translation. We introduce Pano (PACT
imaging neural operator), an end-to-end physics-aware model that directly
learns the inverse acoustic mapping from sensor measurements to volumetric
reconstructions. Unlike existing approaches (e.g. universal back-projection
algorithm), Pano learns both physics and data priors while also being agnostic
to the input data resolution. Pano employs spherical discrete-continuous
convolutions to preserve hemispherical sensor geometry, incorporates Helmholtz
equation constraints to ensure physical consistency and operates
resolutionindependently across varying sensor configurations. We demonstrate
the robustness and efficiency of Pano in reconstructing high-quality images
from both simulated and real experimental data, achieving consistent
performance even with significantly reduced transducer counts and limited-angle
acquisition configurations. The framework maintains reconstruction fidelity
across diverse sparse sampling patterns while enabling real-time volumetric
imaging capabilities. This advancement establishes a practical pathway for
making 3D PACT more accessible and feasible for both preclinical research and
clinical applications, substantially reducing hardware requirements without
compromising image reconstruction quality.
[LINK]
http://arxiv.org/abs/2509.09894v1
[DATE]
2025-09-12 07:12:55+08:00
[CATEGORIES]
cs.LG
Your Image is Secretly the Last Frame of a Pseudo Video
[AUTHORS]
Wenlong Chen, Wenlin Chen, Lapo Rastrelli, Yingzhen Li
[ABSTRACT]
Diffusion models, which can be viewed as a special case of hierarchical
variational autoencoders (HVAEs), have shown profound success in generating
photo-realistic images. In contrast, standard HVAEs often produce images of
inferior quality compared to diffusion models. In this paper, we hypothesize
that the success of diffusion models can be partly attributed to the additional
self-supervision information for their intermediate latent states provided by
corrupted images, which along with the original image form a pseudo video.
Based on this hypothesis, we explore the possibility of improving other types
of generative models with such pseudo videos. Specifically, we first extend a
given image generative model to their video generative model counterpart, and
then train the video generative model on pseudo videos constructed by applying
data augmentation to the original images. Furthermore, we analyze the potential
issues of first-order Markov data augmentation methods, which are typically
used in diffusion models, and propose to use more expressive data augmentation
to construct more useful information in pseudo videos. Our empirical results on
the CIFAR10 and CelebA datasets demonstrate that improved image generation
quality can be achieved with additional self-supervised information from pseudo
videos.
[COMMENTS]
Presented at the ICLR 2025 Workshop on Deep Generative Model in
Machine Learning: Theory, Principle and Efficacy (DeLTa). 1-frame results for
CIFAR10 in Table 2 corrected. Code released
[LINK]
http://arxiv.org/abs/2410.20158v3
[DATE]
2025-09-12 07:09:05+08:00
[CATEGORIES]
cs.LG
Automated Tuning for Diffusion Inverse Problem Solvers without Generative Prior Retraining
[AUTHORS]
Yaşar Utku Alçalar, Junno Yun, Mehmet Akçakaya
[ABSTRACT]
Diffusion/score-based models have recently emerged as powerful generative
priors for solving inverse problems, including accelerated MRI reconstruction.
While their flexibility allows decoupling the measurement model from the
learned prior, their performance heavily depends on carefully tuned data
fidelity weights, especially under fast sampling schedules with few denoising
steps. Existing approaches often rely on heuristics or fixed weights, which
fail to generalize across varying measurement conditions and irregular timestep
schedules. In this work, we propose Zero-shot Adaptive Diffusion Sampling
(ZADS), a test-time optimization method that adaptively tunes fidelity weights
across arbitrary noise schedules without requiring retraining of the diffusion
prior. ZADS treats the denoising process as a fixed unrolled sampler and
optimizes fidelity weights in a self-supervised manner using only undersampled
measurements. Experiments on the fastMRI knee dataset demonstrate that ZADS
consistently outperforms both traditional compressed sensing and recent
diffusion-based methods, showcasing its ability to deliver high-fidelity
reconstructions across varying noise schedules and acquisition settings.
[COMMENTS]
IEEE International Workshop on Computational Advances in Multi-Sensor
Adaptive Processing (CAMSAP), 2025
[LINK]
http://arxiv.org/abs/2509.09880v1
[DATE]
2025-09-12 06:22:32+08:00
[CATEGORIES]
cs.LG
A Topic Modeling Analysis of Stigma Dimensions, Social, and Related Behavioral Circumstances in Clinical Notes Among Patients with HIV
[AUTHORS]
Ziyi Chen, Yiyang Liu, Mattia Prosperi, Krishna Vaddiparti, Robert L Cook, Jiang Bian, Yi Guo, Yonghui Wu
[ABSTRACT]
Objective: To characterize stigma dimensions, social, and related behavioral
circumstances in people living with HIV(PLWHs) seeking care, using NLP methods
applied to a large collection of EHR clinical notes from a large integrated
health system in the southeast United States. Methods: We identified a cohort
of PLWHs from the UF Health IDR and performed topic modeling analysis using
Latent Dirichlet Allocation to uncover stigma-related dimensions and related
social and behavioral contexts. Domain experts created a seed list of
HIV-related stigma keywords, then applied a snowball strategy to review notes
for additional terms until saturation was reached iteratively. To identify more
target topics, we tested three keyword-based filtering strategies. The detected
topics were evaluated using three widely used metrics and manually reviewed by
specialists. In addition, we conducted word frequency analysis and topic
variation analysis among subgroups to examine differences across age and
sex-specific demographics. Results: We identified 9140 PLWHs at UF Health and
collected 2.9 million clinical notes. Through the iterative keyword approach,
we generated a list of 91 keywords associated with HIV-related stigma. Topic
modeling on sentences containing at least one keyword uncovered a wide range of
topic themes, such as “Mental Health Concern, Stigma”, “Treatment Refusal,
Isolation”, and “Substance Abuse”. Topic variation analysis across age
subgroups revealed substantial differences. Conclusion: Extracting and
understanding the HIV-related stigma and associated social and behavioral
circumstances from EHR clinical notes enables scalable, time-efficient
assessment and overcoming the limitations of traditional questionnaires.
Findings from this research provide actionable insights to inform patient care
and interventions to improve HIV-care outcomes.
[LINK]
http://arxiv.org/abs/2506.09279v2
[DATE]
2025-09-12 06:21:00+08:00
[CATEGORIES]
cs.LG
Efficient transformer adaptation for analog in-memory computing via low-rank adapters
[AUTHORS]
Chen Li, Elena Ferro, Corey Lammie, Manuel Le Gallo, Irem Boybat, Bipin Rajendran
[ABSTRACT]
Analog In-Memory Computing (AIMC) offers a promising solution to the von
Neumann bottleneck. However, deploying transformer models on AIMC remains
challenging due to their inherent need for flexibility and adaptability across
diverse tasks. For the benefits of AIMC to be fully realized, weights of static
vector-matrix multiplications must be mapped and programmed to analog devices
in a weight-stationary manner. This poses two challenges for adapting a base
network to hardware and downstream tasks: (i) conventional analog
hardware-aware (AHWA) training requires retraining the entire model, and (ii)
reprogramming analog devices is both time- and energy-intensive. To address
these issues, we propose Analog Hardware-Aware Low-Rank Adaptation (AHWA-LoRA)
training, a novel approach for efficiently adapting transformers to AIMC
hardware. AHWA-LoRA training keeps the analog weights fixed as meta-weights and
introduces lightweight external LoRA modules for both hardware and task
adaptation. We validate AHWA-LoRA training on SQuAD v1.1 and the GLUE
benchmark, demonstrate its scalability to larger models, and show its
effectiveness in instruction tuning and reinforcement learning. We further
evaluate a practical deployment scenario that balances AIMC tile latency with
digital LoRA processing using optimized pipeline strategies, with RISC-V-based
programmable multi-core accelerators. This hybrid architecture achieves
efficient transformer inference with only a 4% per-layer overhead compared to a
fully AIMC implementation.
[COMMENTS]
18 pages
[LINK]
http://arxiv.org/abs/2411.17367v3
[DATE]
2025-09-12 05:49:49+08:00
[CATEGORIES]
cs.LG
Musculoskeletal simulation of limb movement biomechanics in Drosophila melanogaster
[AUTHORS]
Pembe Gizem Özdil, Chuanfang Ning, Jasper S. Phelps, Sibo Wang-Chen, Guy Elisha, Alexander Blanke, Auke Ijspeert, Pavan Ramdya
[ABSTRACT]
Computational models are critical to advance our understanding of how neural,
biomechanical, and physical systems interact to orchestrate animal behaviors.
Despite the availability of near-complete reconstructions of the Drosophila
melanogaster central nervous system, musculature, and exoskeleton, anatomically
and physically grounded models of fly leg muscles are still missing. These
models provide an indispensable bridge between motor neuron activity and joint
movements. Here, we introduce the first 3D, data-driven musculoskeletal model
of Drosophila legs, implemented in both OpenSim and MuJoCo simulation
environments. Our model incorporates a Hill-type muscle representation based on
high-resolution X-ray scans from multiple fixed specimens. We present a
pipeline for constructing muscle models using morphological imaging data and
for optimizing unknown muscle parameters specific to the fly. We then combine
our musculoskeletal models with detailed 3D pose estimation data from behaving
flies to achieve muscle-actuated behavioral replay in OpenSim. Simulations of
muscle activity across diverse walking and grooming behaviors predict
coordinated muscle synergies that can be tested experimentally. Furthermore, by
training imitation learning policies in MuJoCo, we test the effect of different
passive joint properties on learning speed and find that damping and stiffness
facilitate learning. Overall, our model enables the investigation of motor
control in an experimentally tractable model organism, providing insights into
how biomechanics contribute to generation of complex limb movements. Moreover,
our model can be used to control embodied artificial agents to generate
naturalistic and compliant locomotion in simulated environments.
[COMMENTS]
23 pages, 11 figures
[LINK]
http://arxiv.org/abs/2509.06426v2
[DATE]
2025-09-12 05:45:17+08:00
[CATEGORIES]
cs.LG
Off Policy Lyapunov Stability in Reinforcement Learning
[AUTHORS]
Sarvan Gill, Daniela Constantinescu
[ABSTRACT]
Traditional reinforcement learning lacks the ability to provide stability
guarantees. More recent algorithms learn Lyapunov functions alongside the
control policies to ensure stable learning. However, the current self-learned
Lyapunov functions are sample inefficient due to their on-policy nature. This
paper introduces a method for learning Lyapunov functions off-policy and
incorporates the proposed off-policy Lyapunov function into the Soft Actor
Critic and Proximal Policy Optimization algorithms to provide them with a data
efficient stability certificate. Simulations of an inverted pendulum and a
quadrotor illustrate the improved performance of the two algorithms when
endowed with the proposed off-policy Lyapunov function.
[COMMENTS]
Conference on Robot Learning (CORL) 2025
[LINK]
http://arxiv.org/abs/2509.09863v1
[DATE]
2025-09-12 05:34:08+08:00
[CATEGORIES]
cs.LG
HiLight: A Hierarchical Reinforcement Learning Framework with Global Adversarial Guidance for Large-Scale Traffic Signal Control
[AUTHORS]
Yaqiao Zhu, Hongkai Wen, Geyong Min, Man Luo
[ABSTRACT]
Efficient traffic signal control (TSC) is essential for mitigating urban
congestion, yet existing reinforcement learning (RL) methods face challenges in
scaling to large networks while maintaining global coordination. Centralized RL
suffers from scalability issues, while decentralized approaches often lack
unified objectives, resulting in limited network-level efficiency. In this
paper, we propose HiLight, a hierarchical reinforcement learning framework with
global adversarial guidance for large-scale TSC. HiLight consists of a
high-level Meta-Policy, which partitions the traffic network into subregions
and generates sub-goals using a Transformer-LSTM architecture, and a low-level
Sub-Policy, which controls individual intersections with global awareness. To
improve the alignment between global planning and local execution, we introduce
an adversarial training mechanism, where the Meta-Policy generates challenging
yet informative sub-goals, and the Sub-Policy learns to surpass these targets,
leading to more effective coordination. We evaluate HiLight across both
synthetic and real-world benchmarks, and additionally construct a large-scale
Manhattan network with diverse traffic conditions, including peak transitions,
adverse weather, and holiday surges. Experimental results show that HiLight
exhibits significant advantages in large-scale scenarios and remains
competitive across standard benchmarks of varying sizes.
[LINK]
http://arxiv.org/abs/2506.14391v2
[DATE]
2025-09-12 05:09:24+08:00
[CATEGORIES]
cs.LG
An Information-Theoretic Framework for Credit Risk Modeling: Unifying Industry Practice with Statistical Theory for Fair and Interpretable Scorecards
[AUTHORS]
Agus Sudjianto, Denis Burakov
[ABSTRACT]
Credit risk modeling relies extensively on Weight of Evidence (WoE) and
Information Value (IV) for feature engineering, and Population Stability Index
(PSI) for drift monitoring, yet their theoretical foundations remain
disconnected. We establish a unified information-theoretic framework revealing
these industry-standard metrics as instances of classical information
divergences. Specifically, we prove that IV exactly equals PSI (Jeffreys
divergence) computed between good and bad credit outcomes over identical bins.
Through the delta method applied to WoE transformations, we derive standard
errors for IV and PSI, enabling formal hypothesis testing and probabilistic
fairness constraints for the first time. We formalize credit modeling’s
inherent performance-fairness trade-off as maximizing IV for predictive power
while minimizing IV for protected attributes. Using automated binning with
depth-1 XGBoost stumps, we compare three encoding strategies: logistic
regression with one-hot encoding, WoE transformation, and constrained XGBoost.
All methods achieve comparable predictive performance (AUC 0.82-0.84),
demonstrating that principled, information-theoretic binning outweighs encoding
choice. Mixed-integer programming traces Pareto-efficient solutions along the
performance-fairness frontier with uncertainty quantification. This framework
bridges theory and practice, providing the first rigorous statistical
foundation for widely-used credit risk metrics while offering principled tools
for balancing accuracy and fairness in regulated environments.
[LINK]
http://arxiv.org/abs/2509.09855v1
[DATE]
2025-09-12 05:05:34+08:00
[CATEGORIES]
cs.LG
HGEN: Heterogeneous Graph Ensemble Networks
[AUTHORS]
Jiajun Shen, Yufei Jin, Yi He, Xingquan Zhu
[ABSTRACT]
This paper presents HGEN that pioneers ensemble learning for heterogeneous
graphs. We argue that the heterogeneity in node types, nodal features, and
local neighborhood topology poses significant challenges for ensemble learning,
particularly in accommodating diverse graph learners. Our HGEN framework
ensembles multiple learners through a meta-path and transformation-based
optimization pipeline to uplift classification accuracy. Specifically, HGEN
uses meta-path combined with random dropping to create Allele Graph Neural
Networks (GNNs), whereby the base graph learners are trained and aligned for
later ensembling. To ensure effective ensemble learning, HGEN presents two key
components: 1) a residual-attention mechanism to calibrate allele GNNs of
different meta-paths, thereby enforcing node embeddings to focus on more
informative graphs to improve base learner accuracy, and 2) a
correlation-regularization term to enlarge the disparity among embedding
matrices generated from different meta-paths, thereby enriching base learner
diversity. We analyze the convergence of HGEN and attest its higher
regularization magnitude over simple voting. Experiments on five heterogeneous
networks validate that HGEN consistently outperforms its state-of-the-art
competitors by substantial margin.
[COMMENTS]
The paper is in proceedings of the 34th IJCAI Conference, 2025
[LINK]
http://arxiv.org/abs/2509.09843v1
[DATE]
2025-09-12 04:50:00+08:00
[CATEGORIES]
cs.LG
Data Matters Most: Auditing Social Bias in Contrastive Vision Language Models
[AUTHORS]
Zahraa Al Sahili, Ioannis Patras, Matthew Purver
[ABSTRACT]
Vision-language models (VLMs) deliver strong zero-shot recognition but
frequently inherit social biases from their training data. We systematically
disentangle three design factors – model size, training-data scale, and
training-data source – by comparing CLIP and OpenCLIP, two models that share
an identical contrastive objective yet differ in encoder width and in the
image-text corpora on which they are pre-trained (400M proprietary pairs vs.
400M/2B LAION). Across balanced face-analysis benchmarks, enlarging the encoder
reduces gender skew in CLIP but amplifies both gender and racial skew in
OpenCLIP; increasing the LAION corpus from 400M to 2B further increases
OpenCLIP bias. At matched model and data budgets, substituting proprietary data
with LAION improves gender fairness while increasing racial skew, underscoring
data source as the primary driver of bias patterns. We also evaluate three
post-hoc, test-time debiasing strategies – Bias Prompts, Prompt Array, and
SANER. Debiasing reduces but does not eliminate harm, and its effectiveness is
source- and size-dependent: Bias Prompts most effectively reduce gender skew in
CLIP at smaller model sizes, whereas Prompt Array and SANER more reliably
reduce racial skew in OpenCLIP; scaling LAION reconfigures which method is most
fair. Taken together, these findings challenge the assumption that bigger
models or datasets are automatically fairer and foreground training data source
as the key determinant of both bias and mitigation efficacy. We release code
and evaluation scripts to enable transparent, reproducible auditing of future
VLMs.
[LINK]
http://arxiv.org/abs/2501.13223v5
[DATE]
2025-09-12 04:42:53+08:00
[CATEGORIES]
cs.LG
Counterfactual Probabilistic Diffusion with Expert Models
[AUTHORS]
Wenhao Mu, Zhi Cao, Mehmed Uludag, Alexander Rodríguez
[ABSTRACT]
Predicting counterfactual distributions in complex dynamical systems is
essential for scientific modeling and decision-making in domains such as public
health and medicine. However, existing methods often rely on point estimates or
purely data-driven models, which tend to falter under data scarcity. We propose
a time series diffusion-based framework that incorporates guidance from
imperfect expert models by extracting high-level signals to serve as structured
priors for generative modeling. Our method, ODE-Diff, bridges mechanistic and
data-driven approaches, enabling more reliable and interpretable causal
inference. We evaluate ODE-Diff across semi-synthetic COVID-19 simulations,
synthetic pharmacological dynamics, and real-world case studies, demonstrating
that it consistently outperforms strong baselines in both point prediction and
distributional accuracy.
[LINK]
http://arxiv.org/abs/2508.13355v2
[DATE]
2025-09-12 04:38:41+08:00
[CATEGORIES]
cs.LG
Revisiting Actor-Critic Methods in Discrete Action Off-Policy Reinforcement Learning
[AUTHORS]
Reza Asad, Reza Babanezhad, Sharan Vaswani
[ABSTRACT]
Value-based approaches such as DQN are the default methods for off-policy
reinforcement learning with discrete-action environments such as Atari. Common
policy-based methods are either on-policy and do not effectively learn from
off-policy data (e.g. PPO), or have poor empirical performance in the
discrete-action setting (e.g. SAC). Consequently, starting from discrete SAC
(DSAC), we revisit the design of actor-critic methods in this setting. First,
we determine that the coupling between the actor and critic entropy is the
primary reason behind the poor performance of DSAC. We demonstrate that by
merely decoupling these components, DSAC can have comparable performance as
DQN. Motivated by this insight, we introduce a flexible off-policy actor-critic
framework that subsumes DSAC as a special case. Our framework allows using an
m-step Bellman operator for the critic update, and enables combining standard
policy optimization methods with entropy regularization to instantiate the
resulting actor objective. Theoretically, we prove that the proposed methods
can guarantee convergence to the optimal regularized value function in the
tabular setting. Empirically, we demonstrate that these methods can approach
the performance of DQN on standard Atari games, and do so even without entropy
regularization or explicit exploration.
[LINK]
http://arxiv.org/abs/2509.09838v1
[DATE]
2025-09-12 04:34:08+08:00
[CATEGORIES]
cs.LG
CoDiCodec: Unifying Continuous and Discrete Compressed Representations of Audio
[AUTHORS]
Marco Pasini, Stefan Lattner, George Fazekas
[ABSTRACT]
Efficiently representing audio signals in a compressed latent space is
critical for latent generative modelling. However, existing autoencoders often
force a choice between continuous embeddings and discrete tokens. Furthermore,
achieving high compression ratios while maintaining audio fidelity remains a
challenge. We introduce CoDiCodec, a novel audio autoencoder that overcomes
these limitations by both efficiently encoding global features via summary
embeddings, and by producing both compressed continuous embeddings at ~ 11 Hz
and discrete tokens at a rate of 2.38 kbps from the same trained model,
offering unprecedented flexibility for different downstream generative tasks.
This is achieved through Finite Scalar Quantization (FSQ) and a novel
FSQ-dropout technique, and does not require additional loss terms beyond the
single consistency loss used for end-to-end training. CoDiCodec supports both
autoregressive decoding and a novel parallel decoding strategy, with the latter
achieving superior audio quality and faster decoding. CoDiCodec outperforms
existing continuous and discrete autoencoders at similar bitrates in terms of
reconstruction audio quality. Our work enables a unified approach to audio
compression, bridging the gap between continuous and discrete generative
modelling paradigms.
[COMMENTS]
Accepted to ISMIR 2025
[LINK]
http://arxiv.org/abs/2509.09836v1
[DATE]
2025-09-12 04:31:18+08:00
[CATEGORIES]
cs.LG
Estimating carbon pools in the shelf sea environment: reanalysis or model-informed machine learning?
[AUTHORS]
Jozef Skakala
[ABSTRACT]
Shelf seas are important for carbon sequestration and carbon cycle, but shelf
sea observations for carbon pools are often sparse, or highly uncertain.
Alternative can be provided by reanalyses, but these are often expensive to
run. We propose to use an ensemble of neural networks (i.e. deep ensemble) to
learn from a coupled physics-biogeochemistry model the relationship between the
directly observable variables and carbon pools. We demonstrate for North-West
European Shelf (NWES) sea environment, that when the deep ensemble trained on a
model free run simulation is applied to the NWES reanalysis, it is capable to
reproduce the reanalysis outputs for carbon pools and additionally provide
uncertainty information. We focus on explainability of the results and
demonstrate potential use of the deep ensembles for future climate what-if
scenarios. We suggest that model-informed machine learning presents a viable
alternative to expensive reanalyses and could complement observations, wherever
they are missing and/or highly uncertain.
[COMMENTS]
24 pages, 9 figures (3 in the appendix), v2 - minor changes
[LINK]
http://arxiv.org/abs/2508.10178v2
[DATE]
2025-09-12 04:03:23+08:00
[CATEGORIES]
cs.LG
DGFusion: Depth-Guided Sensor Fusion for Robust Semantic Perception
[AUTHORS]
Tim Broedermannn, Christos Sakaridis, Luigi Piccinelli, Wim Abbeloos, Luc Van Gool
[ABSTRACT]
Robust semantic perception for autonomous vehicles relies on effectively
combining multiple sensors with complementary strengths and weaknesses.
State-of-the-art sensor fusion approaches to semantic perception often treat
sensor data uniformly across the spatial extent of the input, which hinders
performance when faced with challenging conditions. By contrast, we propose a
novel depth-guided multimodal fusion method that upgrades condition-aware
fusion by integrating depth information. Our network, DGFusion, poses
multimodal segmentation as a multi-task problem, utilizing the lidar
measurements, which are typically available in outdoor sensor suites, both as
one of the model’s inputs and as ground truth for learning depth. Our
corresponding auxiliary depth head helps to learn depth-aware features, which
are encoded into spatially varying local depth tokens that condition our
attentive cross-modal fusion. Together with a global condition token, these
local depth tokens dynamically adapt sensor fusion to the spatially varying
reliability of each sensor across the scene, which largely depends on depth. In
addition, we propose a robust loss for our depth, which is essential for
learning from lidar inputs that are typically sparse and noisy in adverse
conditions. Our method achieves state-of-the-art panoptic and semantic
segmentation performance on the challenging MUSES and DELIVER datasets. Code
and models will be available at https://github.com/timbroed/DGFusion
[COMMENTS]
Code and models will be available at
https://github.com/timbroed/DGFusion
[LINK]
http://arxiv.org/abs/2509.09828v1
[DATE]
2025-09-12 04:03:00+08:00
[CATEGORIES]
cs.LG
Early Detection of Visual Impairments at Home Using a Smartphone Red-Eye Reflex Test
[AUTHORS]
Judith Massmann, Alexander Lichtenstein, Francisco M. López
[ABSTRACT]
Numerous visual impairments can be detected in red-eye reflex images from
young children. The so-called Bruckner test is traditionally performed by
ophthalmologists in clinical settings. Thanks to the recent technological
advances in smartphones and artificial intelligence, it is now possible to
recreate the Bruckner test using a mobile device. In this paper, we present a
first study conducted during the development of KidsVisionCheck, a free
application that can perform vision screening with a mobile device using
red-eye reflex images. The underlying model relies on deep neural networks
trained on children’s pupil images collected and labeled by an ophthalmologist.
With an accuracy of 90% on unseen test data, our model provides highly reliable
performance without the necessity of specialist equipment. Furthermore, we can
identify the optimal conditions for data collection, which can in turn be used
to provide immediate feedback to the users. In summary, this work marks a first
step toward accessible pediatric vision screenings and early intervention for
vision abnormalities worldwide.
[COMMENTS]
Accepted at IEEE ICDL 2025. 6 pages, 7 figures, 2 tables
[LINK]
http://arxiv.org/abs/2509.09808v1
[DATE]
2025-09-12 03:27:57+08:00
[CATEGORIES]
cs.LG
Sparse Polyak: an adaptive step size rule for high-dimensional M-estimation
[AUTHORS]
Tianqi Qiao, Marie Maros
[ABSTRACT]
We propose and study Sparse Polyak, a variant of Polyak’s adaptive step size,
designed to solve high-dimensional statistical estimation problems where the
problem dimension is allowed to grow much faster than the sample size. In such
settings, the standard Polyak step size performs poorly, requiring an
increasing number of iterations to achieve optimal statistical precision-even
when, the problem remains well conditioned and/or the achievable precision
itself does not degrade with problem size. We trace this limitation to a
mismatch in how smoothness is measured: in high dimensions, it is no longer
effective to estimate the Lipschitz smoothness constant. Instead, it is more
appropriate to estimate the smoothness restricted to specific directions
relevant to the problem (restricted Lipschitz smoothness constant). Sparse
Polyak overcomes this issue by modifying the step size to estimate the
restricted Lipschitz smoothness constant. We support our approach with both
theoretical analysis and numerical experiments, demonstrating its improved
performance.
[LINK]
http://arxiv.org/abs/2509.09802v1
[DATE]
2025-09-12 03:13:05+08:00
[CATEGORIES]
cs.LG
Learning Value of Information towards Joint Communication and Control in 6G V2X
[AUTHORS]
Lei Lei, Kan Zheng, Xuemin, Shen
[ABSTRACT]
As Cellular Vehicle-to-Everything (C-V2X) evolves towards future
sixth-generation (6G) networks, Connected Autonomous Vehicles (CAVs) are
emerging to become a key application. Leveraging data-driven Machine Learning
(ML), especially Deep Reinforcement Learning (DRL), is expected to
significantly enhance CAV decision-making in both vehicle control and V2X
communication under uncertainty. These two decision-making processes are
closely intertwined, with the value of information (VoI) acting as a crucial
bridge between them. In this paper, we introduce Sequential Stochastic Decision
Process (SSDP) models to define and assess VoI, demonstrating their application
in optimizing communication systems for CAVs. Specifically, we formally define
the SSDP model and demonstrate that the MDP model is a special case of it. The
SSDP model offers a key advantage by explicitly representing the set of
information that can enhance decision-making when available. Furthermore, as
current research on VoI remains fragmented, we propose a systematic VoI
modeling framework grounded in the MDP, Reinforcement Learning (RL) and Optimal
Control theories. We define different categories of VoI and discuss their
corresponding estimation methods. Finally, we present a structured approach to
leverage the various VoI metrics for optimizing the When",
What”, and
``How” to communicate problems. For this purpose, SSDP models are formulated
with VoI-associated reward functions derived from VoI-based optimization
objectives. While we use a simple vehicle-following control problem to
illustrate the proposed methodology, it holds significant potential to
facilitate the joint optimization of stochastic, sequential control and
communication decisions in a wide range of networked control systems.
[LINK]
http://arxiv.org/abs/2505.06978v3
[DATE]
2025-09-12 03:05:35+08:00
[CATEGORIES]
cs.LG
Auxiliary Discrminator Sequence Generative Adversarial Networks (ADSeqGAN) for Few Sample Molecule Generation
[AUTHORS]
Haocheng Tang, Jing Long, Beihong Ji, Junmei Wang
[ABSTRACT]
In this work, we introduce Auxiliary Discriminator Sequence Generative
Adversarial Networks (ADSeqGAN), a novel approach for molecular generation in
small-sample datasets. Traditional generative models often struggle with
limited training data, particularly in drug discovery, where molecular datasets
for specific therapeutic targets, such as nucleic acids binders and central
nervous system (CNS) drugs, are scarce. ADSeqGAN addresses this challenge by
integrating an auxiliary random forest classifier as an additional
discriminator into the GAN framework, significantly improves molecular
generation quality and class specificity. Our method incorporates pretrained
generator and Wasserstein distance to enhance training stability and diversity.
We evaluate ADSeqGAN across three representative cases. First, on nucleic acid-
and protein-targeting molecules, ADSeqGAN shows superior capability in
generating nucleic acid binders compared to baseline models. Second, through
oversampling, it markedly improves CNS drug generation, achieving higher yields
than traditional de novo models. Third, in cannabinoid receptor type 1 (CB1)
ligand design, ADSeqGAN generates novel druglike molecules, with 32.8\%
predicted actives surpassing hit rates of CB1-focused and general-purpose
libraries when assessed by a target-specific LRIP-SF scoring function. Overall,
ADSeqGAN offers a versatile framework for molecular design in data-scarce
scenarios, with demonstrated applications in nucleic acid binders, CNS drugs,
and CB1 ligands.
[COMMENTS]
Accepted by Journal of Chemical Information and Modeling, ASAP
[LINK]
http://arxiv.org/abs/2502.16446v2
[DATE]
2025-09-12 03:05:03+08:00
[CATEGORIES]
cs.LG
A Modular and Multimodal Generative AI Framework for Urban Building Energy Data: Generating Synthetic Homes
[AUTHORS]
Jackson Eshbaugh, Chetan Tiwari, Jorge Silveyra
[ABSTRACT]
Computational models have emerged as powerful tools for energy modeling
research, touting scalability and quantitative results. However, these models
require a plethora of data, some of which is inaccessible, expensive, or raises
privacy concerns. We introduce a modular multimodal framework to produce this
data from publicly accessible residential information and images using
generative artificial intelligence (AI). Additionally, we provide a pipeline
demonstrating this framework, and we evaluate its generative AI components. Our
experiments show that our framework’s use of AI avoids common issues with
generative models. Our framework produces realistic, labeled data. By reducing
dependence on costly or restricted data sources, we pave a path towards more
accessible and reproducible research.
[COMMENTS]
44 pages; 2 appendices; 9 figures; 1 table. Code available at
https://github.com/Lafayette-EshbaughSilveyra-Group/synthetic-homes
[LINK]
http://arxiv.org/abs/2509.09794v1
[DATE]
2025-09-12 02:53:21+08:00
[CATEGORIES]
cs.LG
From the Gradient-Step Denoiser to the Proximal Denoiser and their associated convergent Plug-and-Play algorithms
[AUTHORS]
Vincent Herfeld, Baudouin Denis de Senneville, Arthur Leclaire, Nicolas Papadakis
[LINK]
http://arxiv.org/abs/2509.09793v1
[DATE]
2025-09-12 02:53:08+08:00
[CATEGORIES]
cs.LG
InterFormer: Effective Heterogeneous Interaction Learning for Click-Through Rate Prediction
[AUTHORS]
Zhichen Zeng, Xiaolong Liu, Mengyue Hang, Xiaoyi Liu, Qinghai Zhou, Chaofei Yang, Yiqun Liu, Yichen Ruan, Laming Chen, Yuxin Chen, Yujia Hao, Jiaqi Xu, Jade Nie, Xi Liu, Buyun Zhang, Wei Wen, Siyang Yuan, Hang Yin, Xin Zhang, Kai Wang, Wen-Yen Chen, Yiping Han, Huayu Li, Chunzhi Yang, Bo Long, Philip S. Yu, Hanghang Tong, Jiyan Yang
[ABSTRACT]
Click-through rate (CTR) prediction, which predicts the probability of a user
clicking an ad, is a fundamental task in recommender systems. The emergence of
heterogeneous information, such as user profile and behavior sequences, depicts
user interests from different aspects. A mutually beneficial integration of
heterogeneous information is the cornerstone towards the success of CTR
prediction. However, most of the existing methods suffer from two fundamental
limitations, including (1) insufficient inter-mode interaction due to the
unidirectional information flow between modes, and (2) aggressive information
aggregation caused by early summarization, resulting in excessive information
loss. To address the above limitations, we propose a novel module named
InterFormer to learn heterogeneous information interaction in an interleaving
style. To achieve better interaction learning, InterFormer enables
bidirectional information flow for mutually beneficial learning across
different modes. To avoid aggressive information aggregation, we retain
complete information in each data mode and use a separate bridging arch for
effective information selection and summarization. Our proposed InterFormer
achieves state-of-the-art performance on three public datasets and a
large-scale industrial dataset.
[COMMENTS]
11 pages, 6 figures
[LINK]
http://arxiv.org/abs/2411.09852v4
[DATE]
2025-09-12 02:51:53+08:00
[CATEGORIES]
cs.LG
One Head, Many Models: Cross-Attention Routing for Cost-Aware LLM Selection
[AUTHORS]
Roshini Pulishetty, Mani Kishan Ghantasala, Keerthy Kaushik Dasoju, Niti Mangwani, Vishal Garimella, Aditya Mate, Somya Chatterjee, Yue Kang, Ehi Nosakhare, Sadid Hasan, Soundar Srinivasan
[ABSTRACT]
The proliferation of large language models (LLMs) with varying computational
costs and performance profiles presents a critical challenge for scalable,
cost-effective deployment in real-world applications. We introduce a unified
routing framework that leverages a single-head cross-attention mechanism to
jointly model query and model embeddings, enabling dynamic selection of the
optimal LLM for each input query. Our approach is evaluated on RouterBench, a
large-scale, publicly available benchmark encompassing diverse LLM pools and
domains. By explicitly capturing fine-grained query-model interactions, our
router predicts both response quality and generation cost, achieving up to 6.6%
improvement in Average Improvement in Quality (AIQ) and 2.9% in maximum
performance over existing routers. To robustly balance performance and cost, we
propose an exponential reward function that enhances stability across user
preferences. The resulting architecture is lightweight, generalizes effectively
across domains, and demonstrates improved efficiency compared to prior methods,
establishing a new standard for cost-aware LLM routing.
[LINK]
http://arxiv.org/abs/2509.09782v1
[DATE]
2025-09-12 02:29:09+08:00
[CATEGORIES]
cs.LG
Demo: Healthcare Agent Orchestrator (HAO) for Patient Summarization in Molecular Tumor Boards
[AUTHORS]
Matthias Blondeel, Noel Codella, Sam Preston, Hao Qiu, Leonardo Schettini, Frank Tuan, Wen-wai Yim, Smitha Saligrama, Mert Öz, Shrey Jain, Matthew P. Lungren, Thomas Osborne
[ABSTRACT]
Molecular Tumor Boards (MTBs) are multidisciplinary forums where oncology
specialists collaboratively assess complex patient cases to determine optimal
treatment strategies. A central element of this process is the patient summary,
typically compiled by a medical oncologist, radiation oncologist, or surgeon,
or their trained medical assistant, who distills heterogeneous medical records
into a concise narrative to facilitate discussion. This manual approach is
often labor-intensive, subjective, and prone to omissions of critical
information. To address these limitations, we introduce the Healthcare Agent
Orchestrator (HAO), a Large Language Model (LLM)-driven AI agent that
coordinates a multi-agent clinical workflow to generate accurate and
comprehensive patient summaries for MTBs. Evaluating predicted patient
summaries against ground truth presents additional challenges due to stylistic
variation, ordering, synonym usage, and phrasing differences, which complicate
the measurement of both succinctness and completeness. To overcome these
evaluation hurdles, we propose TBFact, a “model-as-a-judge” framework
designed to assess the comprehensiveness and succinctness of generated
summaries. Using a benchmark dataset derived from de-identified tumor board
discussions, we applied TBFact to evaluate our Patient History agent. Results
show that the agent captured 94% of high-importance information (including
partial entailments) and achieved a TBFact recall of 0.84 under strict
entailment criteria. We further demonstrate that TBFact enables a data-free
evaluation framework that institutions can deploy locally without sharing
sensitive clinical data. Together, HAO and TBFact establish a robust foundation
for delivering reliable and scalable support to MTBs.
[COMMENTS]
9 pages, 1 figure; Added missing co-authors and contributors
[LINK]
http://arxiv.org/abs/2509.06602v2
[DATE]
2025-09-12 01:52:20+08:00
[CATEGORIES]
cs.LG
Investigating Energy Efficiency and Performance Trade-offs in LLM Inference Across Tasks and DVFS Settings
[AUTHORS]
Paul Joe Maliakel, Shashikant Ilager, Ivona Brandic
[ABSTRACT]
Large Language Models (LLMs) have demonstrated remarkable performance across
a wide range of natural language processing (NLP) tasks, leading to widespread
adoption in both research and industry. However, their inference workloads are
computationally and energy intensive, raising concerns about sustainability and
environmental impact. As LLMs continue to scale, it becomes essential to
identify and optimize the factors that influence their runtime efficiency
without compromising performance. In this work, we systematically investigate
the energy-performance trade-offs of LLMs during inference. We benchmark models
of varying sizes and architectures, including Falcon-7B, Mistral-7B-v0.1,
LLaMA-3.2-1B, LLaMA-3.2-3B, and GPT-Neo-2.7B, across tasks such as question
answering, commonsense reasoning, and factual generation. We analyze the effect
of input characteristics, such as sequence length, entropy, named entity
density and so on. Furthermore, we examine the impact of hardware-level
optimizations through Dynamic Voltage and Frequency Scaling (DVFS), measuring
how different GPU clock settings affect latency and power consumption. Our
empirical findings show that model architecture, input complexity, and clock
configuration significantly influence inference efficiency. By correlating
input features with energy metrics and evaluating DVFS behavior, we identify
practical strategies that reduce energy consumption by up to 30% while
preserving model quality. This study provides actionable insights for designing
energy-efficient and sustainable LLM inference systems.
[LINK]
http://arxiv.org/abs/2501.08219v3
[DATE]
2025-09-12 01:49:08+08:00
[CATEGORIES]
cs.LG
Modular Jump Gaussian Processes
[AUTHORS]
Anna R. Flowers, Christopher T. Franck, Mickaël Binois, Chiwoo Park, Robert B. Gramacy
[ABSTRACT]
Gaussian processes (GPs) furnish accurate nonlinear predictions with
well-calibrated uncertainty. However, the typical GP setup has a built-in
stationarity assumption, making it ill-suited for modeling data from processes
with sudden changes, or “jumps” in the output variable. The “jump GP” (JGP) was
developed for modeling data from such processes, combining local GPs and latent
“level” variables under a joint inferential framework. But joint modeling can
be fraught with difficulty. We aim to simplify by suggesting a more modular
setup, eschewing joint inference but retaining the main JGP themes: (a)
learning optimal neighborhood sizes that locally respect manifolds of
discontinuity; and (b) a new cluster-based (latent) feature to capture regions
of distinct output levels on both sides of the manifold. We show that each of
(a) and (b) separately leads to dramatic improvements when modeling processes
with jumps. In tandem (but without requiring joint inference) that benefit is
compounded, as illustrated on real and synthetic benchmark examples from the
recent literature.
[COMMENTS]
19 pages, 13 figures
[LINK]
http://arxiv.org/abs/2505.15557v2
[DATE]
2025-09-12 01:23:43+08:00
[CATEGORIES]
cs.LG
Directly Aligning the Full Diffusion Trajectory with Fine-Grained Human Preference
[AUTHORS]
Xiangwei Shen, Zhimin Li, Zhantao Yang, Shiyi Zhang, Yingfang Zhang, Donghao Li, Chunyu Wang, Qinglin Lu, Yansong Tang
[ABSTRACT]
Recent studies have demonstrated the effectiveness of directly aligning
diffusion models with human preferences using differentiable reward. However,
they exhibit two primary challenges: (1) they rely on multistep denoising with
gradient computation for reward scoring, which is computationally expensive,
thus restricting optimization to only a few diffusion steps; (2) they often
need continuous offline adaptation of reward models in order to achieve desired
aesthetic quality, such as photorealism or precise lighting effects. To address
the limitation of multistep denoising, we propose Direct-Align, a method that
predefines a noise prior to effectively recover original images from any time
steps via interpolation, leveraging the equation that diffusion states are
interpolations between noise and target images, which effectively avoids
over-optimization in late timesteps. Furthermore, we introduce Semantic
Relative Preference Optimization (SRPO), in which rewards are formulated as
text-conditioned signals. This approach enables online adjustment of rewards in
response to positive and negative prompt augmentation, thereby reducing the
reliance on offline reward fine-tuning. By fine-tuning the FLUX model with
optimized denoising and online reward adjustment, we improve its
human-evaluated realism and aesthetic quality by over 3x.
[COMMENTS]
15 pages
[LINK]
http://arxiv.org/abs/2509.06942v3
[DATE]
2025-09-12 01:14:11+08:00
[CATEGORIES]
cs.LG
Near-Optimal Sample Complexity in Reward-Free Kernel-Based Reinforcement Learning
[AUTHORS]
Aya Kayal, Sattar Vakili, Laura Toni, Alberto Bernacchia
[ABSTRACT]
Reinforcement Learning (RL) problems are being considered under increasingly
more complex structures. While tabular and linear models have been thoroughly
explored, the analytical study of RL under nonlinear function approximation,
especially kernel-based models, has recently gained traction for their strong
representational capacity and theoretical tractability. In this context, we
examine the question of statistical efficiency in kernel-based RL within the
reward-free RL framework, specifically asking: how many samples are required to
design a near-optimal policy? Existing work addresses this question under
restrictive assumptions about the class of kernel functions. We first explore
this question by assuming a generative model, then relax this assumption at the
cost of increasing the sample complexity by a factor of H, the length of the
episode. We tackle this fundamental problem using a broad class of kernels and
a simpler algorithm compared to prior work. Our approach derives new confidence
intervals for kernel ridge regression, specific to our RL setting, which may be
of broader applicability. We further validate our theoretical findings through
simulations.
[COMMENTS]
Accepted at AISTATS 2025
[LINK]
http://arxiv.org/abs/2502.07715v2
[DATE]
2025-09-12 01:08:49+08:00
[CATEGORIES]
cs.LG
Functional Groups are All you Need for Chemically Interpretable Molecular Property Prediction
[AUTHORS]
Roshan Balaji, Joe Bobby, Nirav Pravinbhai Bhatt
[ABSTRACT]
Molecular property prediction using deep learning (DL) models has accelerated
drug and materials discovery, but the resulting DL models often lack
interpretability, hindering their adoption by chemists. This work proposes
developing molecule representations using the concept of Functional Groups (FG)
in chemistry. We introduce the Functional Group Representation (FGR) framework,
a novel approach to encoding molecules based on their fundamental chemical
substructures. Our method integrates two types of functional groups: those
curated from established chemical knowledge (FG), and those mined from a large
molecular corpus using sequential pattern mining (MFG). The resulting FGR
framework encodes molecules into a lower-dimensional latent space by leveraging
pre-training on a large dataset of unlabeled molecules. Furthermore, the
proposed framework allows the inclusion of 2D structure-based descriptors of
molecules. We demonstrate that the FGR framework achieves state-of-the-art
performance on a diverse range of 33 benchmark datasets spanning physical
chemistry, biophysics, quantum mechanics, biological activity, and
pharmacokinetics while enabling chemical interpretability. Crucially, the
model’s representations are intrinsically aligned with established chemical
principles, allowing chemists to directly link predicted properties to specific
functional groups and facilitating novel insights into structure-property
relationships. Our work presents a significant step toward developing
high-performing, chemically interpretable DL models for molecular discovery.
[LINK]
http://arxiv.org/abs/2509.09619v1
[DATE]
2025-09-12 01:01:31+08:00
[CATEGORIES]
cs.LG
Explaining Concept Drift through the Evolution of Group Counterfactuals
[AUTHORS]
Ignacy Stępka, Jerzy Stefanowski
[ABSTRACT]
Machine learning models in dynamic environments often suffer from concept
drift, where changes in the data distribution degrade performance. While
detecting this drift is a well-studied topic, explaining how and why the
model’s decision-making logic changes still remains a significant challenge. In
this paper, we introduce a novel methodology to explain concept drift by
analyzing the temporal evolution of group-based counterfactual explanations
(GCEs). Our approach tracks shifts in the GCEs’ cluster centroids and their
associated counterfactual action vectors before and after a drift. These
evolving GCEs act as an interpretable proxy, revealing structural changes in
the model’s decision boundary and its underlying rationale. We operationalize
this analysis within a three-layer framework that synergistically combines
insights from the data layer (distributional shifts), the model layer
(prediction disagreement), and our proposed explanation layer. We show that
such holistic view allows for a more comprehensive diagnosis of drift, making
it possible to distinguish between different root causes, such as a spatial
data shift versus a re-labeling of concepts.
[COMMENTS]
TempXAI Workshop @ ECML PKDD 2025
[LINK]
http://arxiv.org/abs/2509.09616v1
[DATE]
2025-09-12 00:58:34+08:00
[CATEGORIES]
cs.LG
ReBaNO: Reduced Basis Neural Operator Mitigating Generalization Gaps and Achieving Discretization Invariance
[AUTHORS]
Haolan Zheng, Yanlai Chen, Jiequn Han, Yue Yu
[ABSTRACT]
We propose a novel data-lean operator learning algorithm, the Reduced Basis
Neural Operator (ReBaNO), to solve a group of PDEs with multiple distinct
inputs. Inspired by the Reduced Basis Method and the recently introduced
Generative Pre-Trained Physics-Informed Neural Networks, ReBaNO relies on a
mathematically rigorous greedy algorithm to build its network structure offline
adaptively from the ground up. Knowledge distillation via task-specific
activation function allows ReBaNO to have a compact architecture requiring
minimal computational cost online while embedding physics. In comparison to
state-of-the-art operator learning algorithms such as PCA-Net, DeepONet, FNO,
and CNO, numerical results demonstrate that ReBaNO significantly outperforms
them in terms of eliminating/shrinking the generalization gap for both in- and
out-of-distribution tests and being the only operator learning algorithm
achieving strict discretization invariance.
[LINK]
http://arxiv.org/abs/2509.09611v1
[DATE]
2025-09-12 00:52:54+08:00
[CATEGORIES]
cs.LG
LAVa: Layer-wise KV Cache Eviction with Dynamic Budget Allocation
[AUTHORS]
Yiqun Shen, Song Yuan, Zhengze Zhang, Xiaoliang Wang, Daxin Jiang, Nguyen Cam-Tu
[ABSTRACT]
KV Cache is commonly used to accelerate LLM inference with long contexts, yet
its high memory demand drives the need for cache compression. Existing
compression methods, however, are largely heuristic and lack dynamic budget
allocation. To address this limitation, we introduce a unified framework for
cache compression by minimizing information loss in Transformer residual
streams. Building on it, we analyze the layer attention output loss and derive
a new metric to compare cache entries across heads, enabling layer-wise
compression with dynamic head budgets. Additionally, by contrasting cross-layer
information, we also achieve dynamic layer budgets. LAVa is the first unified
strategy for cache eviction and dynamic budget allocation that, unlike prior
methods, does not rely on training or the combination of multiple strategies.
Experiments with benchmarks (LongBench, Needle-In-A-Haystack, Ruler, and
InfiniteBench) demonstrate its superiority. Moreover, our experiments reveal a
new insight: dynamic layer budgets are crucial for generation tasks (e.g., code
completion), while dynamic head budgets play a key role in extraction tasks
(e.g., extractive QA). As a fully dynamic compression method, LAVa consistently
maintains top performance across task types. Our code is available at
https://github.com/MGDDestiny/Lava.
[LINK]
http://arxiv.org/abs/2509.09754v1
[DATE]
2025-09-12 00:48:24+08:00
[CATEGORIES]
cs.LG
Conditioning on PDE Parameters to Generalise Deep Learning Emulation of Stochastic and Chaotic Dynamics
[AUTHORS]
Ira J. S. Shokar, Rich R. Kerswell, Peter H. Haynes
[ABSTRACT]
We present a deep learning emulator for stochastic and chaotic
spatio-temporal systems, explicitly conditioned on the parameter values of the
underlying partial differential equations (PDEs). Our approach involves
pre-training the model on a single parameter domain, followed by fine-tuning on
a smaller, yet diverse dataset, enabling generalisation across a broad range of
parameter values. By incorporating local attention mechanisms, the network is
capable of handling varying domain sizes and resolutions. This enables
computationally efficient pre-training on smaller domains while requiring only
a small additional dataset to learn how to generalise to larger domain sizes.
We demonstrate the model’s capabilities on the chaotic Kuramoto-Sivashinsky
equation and stochastically-forced beta-plane turbulence, showcasing its
ability to capture phenomena at interpolated parameter values. The emulator
provides significant computational speed-ups over conventional numerical
integration, facilitating efficient exploration of parameter space, while a
probabilistic variant of the emulator provides uncertainty quantification,
allowing for the statistical study of rare events.
[LINK]
http://arxiv.org/abs/2509.09599v1
[DATE]
2025-09-12 00:37:45+08:00
[CATEGORIES]
cs.LG
Graph Alignment via Dual-Pass Spectral Encoding and Latent Space Communication
[AUTHORS]
Maysam Behmanesh, Erkan Turan, Maks Ovsjanikov
[ABSTRACT]
Graph alignment-the problem of identifying corresponding nodes across
multiple graphs-is fundamental to numerous applications. Most existing
unsupervised methods embed node features into latent representations to enable
cross-graph comparison without ground-truth correspondences. However, these
methods suffer from two critical limitations: the degradation of node
distinctiveness due to oversmoothing in GNN-based embeddings, and the
misalignment of latent spaces across graphs caused by structural noise, feature
heterogeneity, and training instability, ultimately leading to unreliable node
correspondences. We propose a novel graph alignment framework that
simultaneously enhances node distinctiveness and enforces geometric consistency
across latent spaces. Our approach introduces a dual-pass encoder that combines
low-pass and high-pass spectral filters to generate embeddings that are both
structure-aware and highly discriminative. To address latent space
misalignment, we incorporate a geometry-aware functional map module that learns
bijective and isometric transformations between graph embeddings, ensuring
consistent geometric relationships across different representations. Extensive
experiments on graph benchmarks demonstrate that our method consistently
outperforms existing unsupervised alignment baselines, exhibiting superior
robustness to structural inconsistencies and challenging alignment scenarios.
Additionally, comprehensive evaluation on vision-language benchmarks using
diverse pretrained models shows that our framework effectively generalizes
beyond graph domains, enabling unsupervised alignment of vision and language
representations.
[COMMENTS]
23 pages
[LINK]
http://arxiv.org/abs/2509.09597v1
[DATE]
2025-09-12 00:36:16+08:00
[CATEGORIES]
cs.LG
Critical Challenges and Guidelines in Evaluating Synthetic Tabular Data: A Systematic Review
[AUTHORS]
Nazia Nafis, Inaki Esnaola, Alvaro Martinez-Perez, Maria-Cruz Villa-Uriol, Venet Osmani
[ABSTRACT]
Generating synthetic tabular data can be challenging, however evaluation of
their quality is just as challenging, if not more. This systematic review sheds
light on the critical importance of rigorous evaluation of synthetic health
data to ensure reliability, relevance, and their appropriate use. Based on
screening of 1766 papers and a detailed review of 101 papers we identified key
challenges, including lack of consensus on evaluation methods, improper use of
evaluation metrics, limited input from domain experts, inadequate reporting of
dataset characteristics, and limited reproducibility of results. In response,
we provide several guidelines on the generation and evaluation of synthetic
data, to allow the community to unlock and fully harness the transformative
potential of synthetic data and accelerate innovation.
[LINK]
http://arxiv.org/abs/2504.18544v2
[DATE]
2025-09-12 00:27:30+08:00
[CATEGORIES]
cs.LG
[AUTHORS]
Yujian Ma, Jinqiu Sang, Ruizhe Li [ABSTRACT]
Large pre-trained speech models such as Whisper offer strong generalization
but pose significant challenges for resource-efficient adaptation. Low-Rank
Adaptation (LoRA) has become a popular parameter-efficient fine-tuning method,
yet its underlying mechanisms in speech tasks remain poorly understood. In this
work, we conduct the first systematic mechanistic interpretability study of
LoRA within the Whisper encoder for speech emotion recognition (SER). Using a
suite of analytical tools, including layer contribution probing, logit-lens
inspection, and representational similarity via singular value decomposition
(SVD) and centered kernel alignment (CKA), we reveal two key mechanisms: a
delayed specialization process that preserves general features in early layers
before consolidating task-specific information, and a forward alignment,
backward differentiation dynamic between LoRA’s matrices. Our findings clarify
how LoRA reshapes encoder hierarchies, providing both empirical insights and a
deeper mechanistic understanding for designing efficient and interpretable
adaptation strategies in large speech models. Our code is available at
https://github.com/harryporry77/Behind-the-Scenes. [COMMENTS]
Work in process [LINK]
http://arxiv.org/abs/2509.08454v2 [DATE]
2025-09-12 00:01:59+08:00 [CATEGORIES]
cs.LG
Entropy-Gated Branching for Efficient Test-Time Reasoning
[AUTHORS]
Xianzhi Li, Ethan Callanan, Abdellah Ghassel, Xiaodan Zhu
[ABSTRACT]
Test-time compute methods like beam search can significantly improve the
reasoning capabilities and problem-solving accuracy of large language models.
However, these approaches require substantially increased computational
resources, with most computation wasted on exploring low-diversity branches
where the model already exhibits high confidence. We observe that a small
subset of uncertain reasoning steps has a disproportionately large impact on
final prediction accuracy, and branching at these points tends to yield
higher-quality and more diverse candidate reasoning steps. Therefore, we
introduce Entropy-Gated Branching: a novel inference technique that dynamically
allocates computational resources by selectively expanding prediction sequences
only at points of high uncertainty. Our method leverages entropy as a gating
mechanism to identify when branching is most beneficial, coupled with an
external feedback model to rank and prune candidate branches. Empirical results
on mathematical and financial reasoning benchmarks show that this strategy
improves accuracy by 22.6% over standard inference while operating 37% faster
than conventional beam search with similar or higher performance. Our results
show that dynamic resource allocation during inference can substantially
improve both efficiency and effectiveness, offering a more scalable pathway to
enhanced LLM reasoning capabilities.
[LINK]
http://arxiv.org/abs/2503.21961v2
[DATE]
2025-09-11 23:49:39+08:00
[CATEGORIES]
cs.CL
Persistent Homology of Topic Networks for the Prediction of Reader Curiosity
[AUTHORS]
Manuel D. S. Hopp, Vincent Labatut, Arthur Amalvy, Richard Dufour, Hannah Stone, Hayley Jach, Kou Murayama
[ABSTRACT]
Reader curiosity, the drive to seek information, is crucial for textual
engagement, yet remains relatively underexplored in NLP. Building on
Loewenstein’s Information Gap Theory, we introduce a framework that models
reader curiosity by quantifying semantic information gaps within a text’s
semantic structure. Our approach leverages BERTopic-inspired topic modeling and
persistent homology to analyze the evolving topology (connected components,
cycles, voids) of a dynamic semantic network derived from text segments,
treating these features as proxies for information gaps. To empirically
evaluate this pipeline, we collect reader curiosity ratings from participants
(n = 49) as they read S. Collins’s ‘‘The Hunger Games’’ novel. We then use the
topological features from our pipeline as independent variables to predict
these ratings, and experimentally show that they significantly improve
curiosity prediction compared to a baseline model (73% vs. 30% explained
deviance), validating our approach. This pipeline offers a new computational
method for analyzing text structure and its relation to reader engagement.
[COMMENTS]
Original paper with an improved and extended appendix
[LINK]
http://arxiv.org/abs/2506.11095v2
[DATE]
2025-09-11 23:49:22+08:00
[CATEGORIES]
cs.CL
Prompting the Market? A Large-Scale Meta-Analysis of GenAI in Finance NLP (2022-2025)
[AUTHORS]
Paolo Pedinotti, Peter Baumann, Nathan Jessurun, Leslie Barrett, Enrico Santus
[ABSTRACT]
Large Language Models (LLMs) have rapidly reshaped financial NLP, enabling
new tasks and driving a proliferation of datasets and diversification of data
sources. Yet, this transformation has outpaced traditional surveys. In this
paper, we present MetaGraph, a generalizable methodology for extracting
knowledge graphs from scientific literature and analyzing them to obtain a
structured, queryable view of research trends. We define an ontology for
financial NLP research and apply an LLM-based extraction pipeline to 681 papers
(2022-2025), enabling large-scale, data-driven analysis. MetaGraph reveals
three key phases: early LLM adoption and task/dataset innovation; critical
reflection on LLM limitations; and growing integration of peripheral techniques
into modular systems. This structured view offers both practitioners and
researchers a clear understanding of how financial NLP has evolved -
highlighting emerging trends, shifting priorities, and methodological
shifts-while also demonstrating a reusable approach for mapping scientific
progress in other domains.
[COMMENTS]
7 pages, 6 appendices, EMNLP industry track
[LINK]
http://arxiv.org/abs/2509.09544v1
[DATE]
2025-09-11 23:37:56+08:00
[CATEGORIES]
cs.CL
DeMeVa at LeWiDi-2025: Modeling Perspectives with In-Context Learning and Label Distribution Learning
[AUTHORS]
Daniil Ignatev, Nan Li, Hugh Mee Wong, Anh Dang, Shane Kaszefski Yaschuk
[ABSTRACT]
This system paper presents the DeMeVa team’s approaches to the third edition
of the Learning with Disagreements shared task (LeWiDi 2025; Leonardelli et
al., 2025). We explore two directions: in-context learning (ICL) with large
language models, where we compare example sampling strategies; and label
distribution learning (LDL) methods with RoBERTa (Liu et al., 2019b), where we
evaluate several fine-tuning methods. Our contributions are twofold: (1) we
show that ICL can effectively predict annotator-specific annotations
(perspectivist annotations), and that aggregating these predictions into soft
labels yields competitive performance; and (2) we argue that LDL methods are
promising for soft label predictions and merit further exploration by the
perspectivist community.
[COMMENTS]
11 pages, 4 figures; to appear at NLPerspectives@EMNLP-2025
[LINK]
http://arxiv.org/abs/2509.09524v1
[DATE]
2025-09-11 23:04:42+08:00
[CATEGORIES]
cs.CL
cs.LG
Uncertainty Quantification in Retrieval Augmented Question Answering
[AUTHORS]
Laura Perez-Beltrachini, Mirella Lapata
[ABSTRACT]
Retrieval augmented Question Answering (QA) helps QA models overcome
knowledge gaps by incorporating retrieved evidence, typically a set of
passages, alongside the question at test time. Previous studies show that this
approach improves QA performance and reduces hallucinations, without, however,
assessing whether the retrieved passages are indeed useful at answering
correctly. In this work, we propose to quantify the uncertainty of a QA model
via estimating the utility of the passages it is provided with. We train a
lightweight neural model to predict passage utility for a target QA model and
show that while simple information theoretic metrics can predict answer
correctness up to a certain extent, our approach efficiently approximates or
outperforms more expensive sampling-based methods. Code and data are available
at https://github.com/lauhaide/ragu.
[COMMENTS]
TMLR (09/2025)
[LINK]
http://arxiv.org/abs/2502.18108v3
[DATE]
2025-09-11 22:41:09+08:00
[CATEGORIES]
cs.CL
An Ontology-Driven Graph RAG for Legal Norms: A Structural, Temporal, and Deterministic Approach
[AUTHORS]
Hudson de Martim
[ABSTRACT]
Retrieval-Augmented Generation (RAG) systems in the legal domain face a
critical challenge: standard, flat-text retrieval is blind to the hierarchical,
diachronic, and causal structure of law, leading to anachronistic and
unreliable answers. This paper introduces the Structure-Aware Temporal Graph
RAG (SAT-Graph RAG), an ontology-driven framework designed to overcome these
limitations by explicitly modeling the formal structure and diachronic nature
of legal norms. We ground our knowledge graph in a formal, LRMoo-inspired model
that distinguishes abstract legal Works from their versioned Expressions. We
model temporal states as efficient aggregations that reuse the versioned
expressions (CTVs) of unchanged components, and we reify legislative events as
first-class Action nodes to make causality explicit and queryable. This
structured backbone enables a unified, planner-guided query strategy that
applies explicit policies to deterministically resolve complex requests for (i)
point-in-time retrieval, (ii) hierarchical impact analysis, and (iii) auditable
provenance reconstruction. Through a case study on the Brazilian Constitution,
we demonstrate how this approach provides a verifiable, temporally-correct
substrate for LLMs, enabling higher-order analytical capabilities while
drastically reducing the risk of factual errors. The result is a practical
framework for building more trustworthy and explainable legal AI systems.
[COMMENTS]
Major revision for clarity and academic precision. Updated title and
abstract. Refined core terminology, contributions, related work, and shifted
the implementation to a conceptual architecture. Added new arguments to
strengthen the paper’s thesis
[LINK]
http://arxiv.org/abs/2505.00039v5
[DATE]
2025-09-11 22:34:52+08:00
[CATEGORIES]
cs.CL
Contextualize-then-Aggregate: Circuits for In-Context Learning in Gemma-2 2B
[AUTHORS]
Aleksandra Bakalova, Yana Veitsman, Xinting Huang, Michael Hahn
[ABSTRACT]
In-Context Learning (ICL) is an intriguing ability of large language models
(LLMs). Despite a substantial amount of work on its behavioral aspects and how
it emerges in miniature setups, it remains unclear which mechanism assembles
task information from the individual examples in a fewshot prompt. We use
causal interventions to identify information flow in Gemma-2 2B for five
naturalistic ICL tasks. We find that the model infers task information using a
two-step strategy we call contextualize-then-aggregate: In the lower layers,
the model builds up representations of individual fewshot examples, which are
contextualized by preceding examples through connections between fewshot input
and output tokens across the sequence. In the higher layers, these
representations are aggregated to identify the task and prepare prediction of
the next output. The importance of the contextualization step differs between
tasks, and it may become more important in the presence of ambiguous examples.
Overall, by providing rigorous causal analysis, our results shed light on the
mechanisms through which ICL happens in language models.
[LINK]
http://arxiv.org/abs/2504.00132v3
[DATE]
2025-09-11 22:13:48+08:00
[CATEGORIES]
cs.CL
cs.LG
CritiQ: Mining Data Quality Criteria from Human Preferences
[AUTHORS]
Honglin Guo, Kai Lv, Qipeng Guo, Tianyi Liang, Zhiheng Xi, Demin Song, Qiuyinzhe Zhang, Yu Sun, Kai Chen, Xipeng Qiu, Tao Gui
[ABSTRACT]
Language model heavily depends on high-quality data for optimal performance.
Existing approaches rely on manually designed heuristics, the perplexity of
existing models, training classifiers, or careful prompt engineering, which
require significant expert experience and human annotation effort while
introduce biases. We introduce CritiQ, a novel data selection method that
automatically mines criteria from human preferences for data quality with only
~30 human-annotated pairs and performs efficient data selection. The main
component, CritiQ Flow, employs a manager agent to evolve quality criteria and
worker agents to make pairwise judgments. We build a knowledge base that
extracts quality criteria from previous work to boost CritiQ Flow. Compared to
perplexity- and classifier- based methods, verbal criteria are more
interpretable and possess reusable value. After deriving the criteria, we train
the CritiQ Scorer to give quality scores and perform efficient data selection.
We demonstrate the effectiveness of our method in the code, math, and logic
domains, achieving high accuracy on human-annotated test sets. To validate the
quality of the selected data, we continually train Llama 3.1 models and observe
improved performance on downstream tasks compared to uniform sampling. Ablation
studies validate the benefits of the knowledge base and the reflection process.
We analyze how criteria evolve and the effectiveness of majority voting.
[COMMENTS]
to be published in ACL 2025, Code is available at
https://github.com/KYLN24/CritiQ
[LINK]
http://arxiv.org/abs/2502.19279v3
[DATE]
2025-09-11 22:11:32+08:00
[CATEGORIES]
cs.CL
Mitigating Language Barriers in Education: Developing Multilingual Digital Learning Materials with Machine Translation
[AUTHORS]
Lucie Poláková, Martin Popel, Věra Kloudová, Michal Novák, Mariia Anisimova, Jiří Balhar
[ABSTRACT]
The EdUKate project combines digital education, linguistics, translation
studies, and machine translation to develop multilingual learning materials for
Czech primary and secondary schools. Launched through collaboration between a
major Czech academic institution and the country’s largest educational
publisher, the project is aimed at translating up to 9,000 multimodal
interactive exercises from Czech into Ukrainian, English, and German for an
educational web portal. It emphasizes the development and evaluation of a
direct Czech-Ukrainian machine translation system tailored to the educational
domain, with special attention to processing formatted content such as XML and
PDF and handling technical and scientific terminology. We present findings from
an initial survey of Czech teachers regarding the needs of non-Czech-speaking
students and describe the system’s evaluation and implementation on the web
portal. All resulting applications are freely available to students, educators,
and researchers.
[COMMENTS]
8 pages, 2 figures
[LINK]
http://arxiv.org/abs/2509.09473v1
[DATE]
2025-09-11 21:54:44+08:00
[CATEGORIES]
cs.CL
GrACE: A Generative Approach to Better Confidence Elicitation in Large Language Models
[AUTHORS]
Zhaohan Zhang, Ziquan Liu, Ioannis Patras
[ABSTRACT]
Assessing the reliability of Large Language Models (LLMs) by confidence
elicitation is a prominent approach to AI safety in high-stakes applications,
such as healthcare and finance. Existing methods either require expensive
computational overhead or suffer from poor calibration, making them impractical
and unreliable for real-world deployment. In this work, we propose GrACE, a
Generative Approach to Confidence Elicitation that enables scalable and
reliable confidence elicitation for LLMs. GrACE adopts a novel mechanism in
which the model expresses confidence by the similarity between the last hidden
state and the embedding of a special token appended to the vocabulary, in
real-time. We fine-tune the model for calibrating the confidence with
calibration targets associated with accuracy. Experiments with three LLMs and
two benchmark datasets show that the confidence produced by GrACE achieves the
best discriminative capacity and calibration on open-ended generation tasks,
outperforming six competing methods without resorting to additional sampling or
an auxiliary model. Moreover, we propose two strategies for improving test-time
scaling based on confidence induced by GrACE. Experimental results show that
using GrACE not only improves the accuracy of the final decision but also
significantly reduces the number of required samples in the test-time scaling
scheme, indicating the potential of GrACE as a practical solution for deploying
LLMs with scalable, reliable, and real-time confidence estimation.
[COMMENTS]
20 pages, 11 figures
[LINK]
http://arxiv.org/abs/2509.09438v1
[DATE]
2025-09-11 21:25:40+08:00
[CATEGORIES]
cs.CL
FLM-Audio: Natural Monologues Improves Native Full-Duplex Chatbots via Dual Training
[AUTHORS]
Yiqun Yao, Xiang Li, Xin Jiang, Xuezhi Fang, Naitong Yu, Wenjia Ma, Aixin Sun, Yequan Wang
[ABSTRACT]
Full-duplex dialog models aim to listen and speak simultaneously, delivering
rapid responses to dynamic user input. Among different solutions to full
duplexity, a native solution merges multiple channels in each time step,
achieving the lowest latency. However, prevailing designs break down the
textual monologue sentences for word-level alignment with audio streams, which
degrades language modeling abilities. To help address this issue, we introduce
natural monologues, which are composed by continuous sentences and waiting
intervals, mimicking humanoid cognitive behavior in dialogs. We find a proper
training paradigm to be critical for semantically aligning natural monologues
with audio. To this end, we develop a dual training paradigm that alternates
the position of the monologues, either leading or trailing the audio, across
different training stages. A combination of our natural monologue and dual
training strategy is applied in developing FLM-Audio, our 7B spoken dialog
chatbot with native full-duplexity. As confirmed by experimental results,
FLM-Audio achieves superior response qualities and chatting experiences while
requiring significantly less training data.
[LINK]
http://arxiv.org/abs/2509.02521v2
[DATE]
2025-09-11 21:07:17+08:00
[CATEGORIES]
cs.CL
LLMs Don’t Know Their Own Decision Boundaries: The Unreliability of Self-Generated Counterfactual Explanations
[AUTHORS]
Harry Mayne, Ryan Othniel Kearns, Yushi Yang, Andrew M. Bean, Eoin Delaney, Chris Russell, Adam Mahdi
[COMMENTS]
Accepted to EMNLP 2025 Main
[LINK]
http://arxiv.org/abs/2509.09396v1
[DATE]
2025-09-11 20:25:41+08:00
[CATEGORIES]
cs.LG
cs.CL
Hierarchical Bracketing Encodings Work for Dependency Graphs
[AUTHORS]
Ana Ezquerro, Carlos Gómez-Rodríguez, David Vilares
[ABSTRACT]
We revisit hierarchical bracketing encodings from a practical perspective in
the context of dependency graph parsing. The approach encodes graphs as
sequences, enabling linear-time parsing with $n$ tagging actions, and still
representing reentrancies, cycles, and empty nodes. Compared to existing graph
linearizations, this representation substantially reduces the label space while
preserving structural information. We evaluate it on a multilingual and
multi-formalism benchmark, showing competitive results and consistent
improvements over other methods in exact match accuracy.
[COMMENTS]
Accepted at EMNLP 2025 (main)
[LINK]
http://arxiv.org/abs/2509.09388v1
[DATE]
2025-09-11 20:08:22+08:00
[CATEGORIES]
cs.CL
Improving Alignment in LVLMs with Debiased Self-Judgment
[AUTHORS]
Sihan Yang, Chenhang Cui, Zihao Zhao, Yiyang Zhou, Weilong Yan, Ying Wei, Huaxiu Yao
[COMMENTS]
EMNLP 2025 Findings
[LINK]
http://arxiv.org/abs/2508.20655v2
[DATE]
2025-09-11 20:03:20+08:00
[CATEGORIES]
cs.CL
MetaRAG: Metamorphic Testing for Hallucination Detection in RAG Systems
[AUTHORS]
Channdeth Sok, David Luz, Yacine Haddam
[ABSTRACT]
Large Language Models (LLMs) are increasingly deployed in enterprise
applications, yet their reliability remains limited by hallucinations, i.e.,
confident but factually incorrect information. Existing detection approaches,
such as SelfCheckGPT and MetaQA, primarily target standalone LLMs and do not
address the unique challenges of Retrieval-Augmented Generation (RAG) systems,
where responses must be consistent with retrieved evidence. We therefore
present MetaRAG, a metamorphic testing framework for hallucination detection in
Retrieval-Augmented Generation (RAG) systems. MetaRAG operates in a real-time,
unsupervised, black-box setting, requiring neither ground-truth references nor
access to model internals, making it suitable for proprietary and high-stakes
domains. The framework proceeds in four stages: (1) decompose answers into
atomic factoids, (2) generate controlled mutations of each factoid using
synonym and antonym substitutions, (3) verify each variant against the
retrieved context (synonyms are expected to be entailed and antonyms
contradicted), and (4) aggregate penalties for inconsistencies into a
response-level hallucination score. Crucially for identity-aware AI, MetaRAG
localizes unsupported claims at the factoid span where they occur (e.g.,
pregnancy-specific precautions, LGBTQ+ refugee rights, or labor eligibility),
allowing users to see flagged spans and enabling system designers to configure
thresholds and guardrails for identity-sensitive queries. Experiments on a
proprietary enterprise dataset illustrate the effectiveness of MetaRAG for
detecting hallucinations and enabling trustworthy deployment of RAG-based
conversational agents. We also outline a topic-based deployment design that
translates MetaRAG’s span-level scores into identity-aware safeguards; this
design is discussed but not evaluated in our experiments.
[COMMENTS]
under review
[LINK]
http://arxiv.org/abs/2509.09360v1
[DATE]
2025-09-11 19:18:23+08:00
[CATEGORIES]
cs.CL
Generative Data Refinement: Just Ask for Better Data
[AUTHORS]
Minqi Jiang, João G. M. Araújo, Will Ellsworth, Sian Gooding, Edward Grefenstette
[ABSTRACT]
For a fixed parameter size, the capabilities of large models are primarily
determined by the quality and quantity of its training data. Consequently,
training datasets now grow faster than the rate at which new data is indexed on
the web, leading to projected data exhaustion over the next decade. Much more
data exists as user-generated content that is not publicly indexed, but
incorporating such data comes with considerable risks, such as leaking private
information and other undesirable content. We introduce a framework, Generative
Data Refinement (GDR), for using pretrained generative models to transform a
dataset with undesirable content into a refined dataset that is more suitable
for training. Our experiments show that GDR can outperform industry-grade
solutions for dataset anonymization, as well as enable direct detoxification of
highly unsafe datasets. Moreover, we show that by generating synthetic data
that is conditioned on each example in the real dataset, GDR’s refined outputs
naturally match the diversity of web scale datasets, and thereby avoid the
often challenging task of generating diverse synthetic data via model
prompting. The simplicity and effectiveness of GDR make it a powerful tool for
scaling up the total stock of training data for frontier models.
[LINK]
http://arxiv.org/abs/2509.08653v2
[DATE]
2025-09-11 19:18:17+08:00
[CATEGORIES]
cs.LG
cs.CL
Culturally-Nuanced Story Generation for Reasoning in Low-Resource Languages: The Case of Javanese and Sundanese
[AUTHORS]
Salsabila Zahirah Pranida, Rifo Ahmad Genadi, Fajri Koto
[ABSTRACT]
Culturally grounded commonsense reasoning is underexplored in low-resource
languages due to scarce data and costly native annotation. We test whether
large language models (LLMs) can generate culturally nuanced narratives for
such settings. Focusing on Javanese and Sundanese, we compare three data
creation strategies: (1) LLM-assisted stories prompted with cultural cues, (2)
machine translation from Indonesian benchmarks, and (3) native-written stories.
Human evaluation finds LLM stories match natives on cultural fidelity but lag
in coherence and correctness. We fine-tune models on each dataset and evaluate
on a human-authored test set for classification and generation. LLM-generated
data yields higher downstream performance than machine-translated and
Indonesian human-authored training data. We release a high-quality benchmark of
culturally grounded commonsense stories in Javanese and Sundanese to support
future work.
[LINK]
http://arxiv.org/abs/2502.12932v2
[DATE]
2025-09-11 18:20:11+08:00
[CATEGORIES]
cs.CL
RED: Unleashing Token-Level Rewards from Holistic Feedback via Reward Redistribution
[AUTHORS]
Jiahui Li, Lin Li, Tai-wei Chang, Kun Kuang, Long Chen, Jun Zhou, Cheng Yang
[ABSTRACT]
Reinforcement learning from human feedback (RLHF) offers a promising approach
to aligning large language models (LLMs) with human preferences. Typically, a
reward model is trained or supplied to act as a proxy for humans in evaluating
generated responses during the reinforcement training phase. However, current
reward models operate as sequence-to-one models, allocating a single, sparse,
and delayed reward to an entire output sequence. This approach may overlook the
significant contributions of individual tokens toward the desired outcome. To
this end, we propose a more fine-grained, token-level guidance approach for RL
training. Specifically, we introduce RED, a novel reward redistribition method
that evaluates and assigns specific credit to each token using an off-the-shelf
reward model. Utilizing these fine-grained rewards enhances the model’s
understanding of language nuances, leading to more precise performance
improvements. Notably, our method does not require modifying the reward model
or introducing additional training steps, thereby incurring minimal
computational costs. Experimental results across diverse datasets and tasks
demonstrate the superiority of our approach.
[LINK]
http://arxiv.org/abs/2411.08302v2
[DATE]
2025-09-11 18:17:06+08:00
[CATEGORIES]
cs.CL
From scratch to silver: Creating trustworthy training data for patent-SDG classification using Large Language Models
[AUTHORS]
Grazia Sveva Ascione, Nicolò Tamagnone
[ABSTRACT]
Classifying patents by their relevance to the UN Sustainable Development
Goals (SDGs) is crucial for tracking how innovation addresses global
challenges. However, the absence of a large, labeled dataset limits the use of
supervised learning. Existing methods, such as keyword searches, transfer
learning, and citation-based heuristics, lack scalability and generalizability.
This paper frames patent-to-SDG classification as a weak supervision problem,
using citations from patents to SDG-tagged scientific publications (NPL
citations) as a noisy initial signal. To address its sparsity and noise, we
develop a composite labeling function (LF) that uses large language models
(LLMs) to extract structured concepts, namely functions, solutions, and
applications, from patents and SDG papers based on a patent ontology.
Cross-domain similarity scores are computed and combined using a rank-based
retrieval approach. The LF is calibrated via a custom positive-only loss that
aligns with known NPL-SDG links without penalizing discovery of new SDG
associations. The result is a silver-standard, soft multi-label dataset mapping
patents to SDGs, enabling the training of effective multi-label regression
models. We validate our approach through two complementary strategies: (1)
internal validation against held-out NPL-based labels, where our method
outperforms several baselines including transformer-based models, and zero-shot
LLM; and (2) external validation using network modularity in patent citation,
co-inventor, and co-applicant graphs, where our labels reveal greater thematic,
cognitive, and organizational coherence than traditional technological
classifications. These results show that weak supervision and semantic
alignment can enhance SDG classification at scale.
[LINK]
http://arxiv.org/abs/2509.09303v1
[DATE]
2025-09-11 17:44:16+08:00
[CATEGORIES]
cs.CL
PersonaFuse: A Personality Activation-Driven Framework for Enhancing Human-LLM Interactions
[AUTHORS]
Yixuan Tang, Yi Yang, Ahmed Abbasi
[ABSTRACT]
Recent advancements in Large Language Models (LLMs) demonstrate remarkable
capabilities across various fields. These developments have led to more direct
communication between humans and LLMs in various situations, such as social
companionship and psychological support. However, LLMs often exhibit
limitations in emotional perception and social competence during real-world
conversations. These limitations partly originate from their inability to adapt
their communication style and emotional expression to different social and task
contexts. In this work, we introduce PersonaFuse, a novel LLM post-training
framework that enables LLMs to adapt and express different personalities for
varying situations. Inspired by Trait Activation Theory and the Big Five
personality model, PersonaFuse employs a Mixture-of-Expert architecture that
combines persona adapters with a dynamic routing network, enabling contextual
trait expression. Experimental results show that PersonaFuse substantially
outperforms baseline models across multiple dimensions of social-emotional
intelligence. Importantly, these gains are achieved without sacrificing general
reasoning ability or model safety, which remain common limitations of direct
prompting and supervised fine-tuning approaches. PersonaFuse also delivers
consistent improvements in downstream human-centered applications, such as
mental health counseling and review-based customer service. Finally, human
preference evaluations against leading LLMs, including GPT-4o and DeepSeek,
demonstrate that PersonaFuse achieves competitive response quality despite its
comparatively smaller model size. These findings demonstrate that PersonaFuse
offers a theoretically grounded and practical approach for developing
social-emotional enhanced LLMs, marking a significant advancement toward more
human-centric AI systems.
[LINK]
http://arxiv.org/abs/2509.07370v2
[DATE]
2025-09-11 17:42:02+08:00
[CATEGORIES]
cs.CL
Tree-OPO: Off-policy Monte Carlo Tree-Guided Advantage Optimization for Multistep Reasoning
[AUTHORS]
Bingning Huang, Tu Nguyen, Matthieu Zimmer
[ABSTRACT]
Recent advances in reasoning with large language models (LLMs) have shown the
effectiveness of Monte Carlo Tree Search (MCTS) for generating high-quality
intermediate trajectories, particularly in math and symbolic domains. Inspired
by this, we explore how MCTS-derived trajectories, traditionally used for
training value or reward models, can be repurposed to improve policy
optimization in preference-based reinforcement learning (RL). Specifically, we
focus on Group Relative Policy Optimization (GRPO), a recent algorithm that
enables preference-consistent policy learning without value networks. We
propose a staged GRPO training paradigm where completions are derived from
partially revealed MCTS rollouts, introducing a novel tree-structured setting
for advantage estimation. This leads to a rich class of prefix-conditioned
reward signals, which we analyze theoretically and empirically. Our initial
results indicate that while structured advantage estimation can stabilize
updates and better reflect compositional reasoning quality, challenges such as
advantage saturation and reward signal collapse remain. We propose heuristic
and statistical solutions to mitigate these issues and discuss open challenges
for learning under staged or tree-like reward structures.
[LINK]
http://arxiv.org/abs/2509.09284v1
[DATE]
2025-09-11 17:18:07+08:00
[CATEGORIES]
cs.CL
cs.LG
Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents
[AUTHORS]
Jiawei Wang, Jiacai Liu, Yuqian Fu, Yingru Li, Xintao Wang, Yuan Lin, Yu Yue, Lin Zhang, Yang Wang, Ke Wang
[ABSTRACT]
In long-horizon tasks, recent agents based on Large Language Models (LLMs)
face a significant challenge that sparse, outcome-based rewards make it
difficult to assign credit to intermediate steps. Previous methods mainly focus
on creating dense reward signals to guide learning, either through traditional
reinforcement learning techniques like inverse reinforcement learning or by
using Process Reward Models for step-by-step feedback. In this paper, we
identify a fundamental problem in the learning dynamics of LLMs: the magnitude
of policy gradients is inherently coupled with the entropy, which leads to
inefficient small updates for confident correct actions and potentially
destabilizes large updates for uncertain ones. To resolve this, we propose
Entropy-Modulated Policy Gradients (EMPG), a framework that re-calibrates the
learning signal based on step-wise uncertainty and the final task outcome. EMPG
amplifies updates for confident correct actions, penalizes confident errors,
and attenuates updates from uncertain steps to stabilize exploration. We
further introduce a bonus term for future clarity that encourages agents to
find more predictable solution paths. Through comprehensive experiments on
three challenging agent tasks, WebShop, ALFWorld, and Deep Search, we
demonstrate that EMPG achieves substantial performance gains and significantly
outperforms strong policy gradient baselines. Project page is at
https://empgseed-seed.github.io/
[COMMENTS]
ICLR 2026 Under review
[LINK]
http://arxiv.org/abs/2509.09265v1
[DATE]
2025-09-11 16:50:01+08:00
[CATEGORIES]
cs.LG
cs.CL
Agentic LLMs for Question Answering over Tabular Data
[AUTHORS]
Rishit Tyagi, Mohit Gupta, Rahul Bouri
[ABSTRACT]
Question Answering over Tabular Data (Table QA) presents unique challenges
due to the diverse structure, size, and data types of real-world tables. The
SemEval 2025 Task 8 (DataBench) introduced a benchmark composed of large-scale,
domain-diverse datasets to evaluate the ability of models to accurately answer
structured queries. We propose a Natural Language to SQL (NL-to-SQL) approach
leveraging large language models (LLMs) such as GPT-4o, GPT-4o-mini, and
DeepSeek v2:16b to generate SQL queries dynamically. Our system follows a
multi-stage pipeline involving example selection, SQL query generation, answer
extraction, verification, and iterative refinement. Experiments demonstrate the
effectiveness of our approach, achieving 70.5\% accuracy on DataBench QA and
71.6\% on DataBench Lite QA, significantly surpassing baseline scores of 26\%
and 27\% respectively. This paper details our methodology, experimental
results, and alternative approaches, providing insights into the strengths and
limitations of LLM-driven Table QA.
[COMMENTS]
Accepted at ACL workshop SemEval 2025
[LINK]
http://arxiv.org/abs/2509.09234v1
[DATE]
2025-09-11 16:12:38+08:00
[CATEGORIES]
cs.CL
Scalable Evaluation of Online Facilitation Strategies via Synthetic Simulation of Discussions
[AUTHORS]
Dimitris Tsirmpas, Ion Androutsopoulos, John Pavlopoulos
[ABSTRACT]
Limited large-scale evaluations exist for facilitation strategies of online
discussions due to significant costs associated with human involvement. An
effective solution is synthetic discussion simulations using Large Language
Models (LLMs) to create initial pilot experiments. We propose design principles
based on existing methodologies for synthetic discussion generation. Based on
these principles, we propose a simple, generalizable, LLM-driven methodology to
prototype the development of LLM facilitators by generating synthetic data
without human involvement, and which surpasses current baselines. We use our
methodology to test whether current Social Science strategies for facilitation
can improve the performance of LLM facilitators. We find that, while LLM
facilitators significantly improve synthetic discussions, there is no evidence
that the application of these strategies leads to further improvements in
discussion quality. In an effort to aid research in the field of facilitation,
we release a large, publicly available dataset containing LLM-generated and
LLM-annotated discussions using multiple open-source models. This dataset can
be used for LLM facilitator finetuning as well as behavioral analysis of
current out-of-the-box LLMs in the task. We also release an open-source python
framework that efficiently implements our methodology at great scale.
[COMMENTS]
15 pages, 3 tables, 12 figures
[LINK]
http://arxiv.org/abs/2503.16505v3
[DATE]
2025-09-11 16:05:33+08:00
[CATEGORIES]
cs.CL
cs.LG
Identifying Key Features for Establishing Sustainable Agro-Tourism Centre: A Data Driven Approach
[AUTHORS]
Alka Gadakh, Vidya Kumbhar, Sonal Khosla, Kumar Karunendra
[ABSTRACT]
Agro-tourism serves as a strategic economic model designed to facilitate
rural development by diversifying income streams for local communities like
farmers while promoting the conservation of indigenous cultural heritage and
traditional agricultural practices. As a very booming subdomain of tourism,
there is a need to study the strategies for the growth of Agro-tourism in
detail. The current study has identified the important indicators for the
growth and enhancement of agro-tourism. The study is conducted in two phases:
identification of the important indicators through a comprehensive literature
review and in the second phase state-of-the-art techniques were used to
identify the important indicators for the growth of agro-tourism. The
indicators are also called features synonymously, the machine learning models
for feature selection were applied and it was observed that the Least Absolute
Shrinkage and Selection Operator (LASSO) method combined with, the machine
Learning Classifiers such as Logistic Regression (LR), Decision Trees (DT),
Random Forest (RF) Tree, and Extreme Gradient Boosting (XGBOOST) models were
used to suggest the growth of the agro-tourism. The results show that with the
LASSO method, LR model gives the highest classification accuracy of 98% in
70-30% train-test data followed by RF with 95% accuracy. Similarly, in the
80-20% train-test data LR maintains the highest accuracy at 99%, while DT and
XGBoost follow with 97% accuracy.
[LINK]
http://arxiv.org/abs/2509.09214v1
[DATE]
2025-09-11 15:43:40+08:00
[CATEGORIES]
cs.LG
cs.CL
A Novel Data Augmentation Approach for Automatic Speaking Assessment on Opinion Expressions
[AUTHORS]
Chung-Chun Wang, Jhen-Ke Lin, Hao-Chien Lu, Hong-Yun Lin, Berlin Chen
[ABSTRACT]
Automated speaking assessment (ASA) on opinion expressions is often hampered
by the scarcity of labeled recordings, which restricts prompt diversity and
undermines scoring reliability. To address this challenge, we propose a novel
training paradigm that leverages a large language models (LLM) to generate
diverse responses of a given proficiency level, converts responses into
synthesized speech via speaker-aware text-to-speech synthesis, and employs a
dynamic importance loss to adaptively reweight training instances based on
feature distribution differences between synthesized and real speech.
Subsequently, a multimodal large language model integrates aligned textual
features with speech signals to predict proficiency scores directly.
Experiments conducted on the LTTC dataset show that our approach outperforms
methods relying on real data or conventional augmentation, effectively
mitigating low-resource constraints and enabling ASA on opinion expressions
with cross-modal information.
[COMMENTS]
submitted to the ISCA SLaTE-2025 Workshop
[LINK]
http://arxiv.org/abs/2506.04077v2
[DATE]
2025-09-11 15:34:06+08:00
[CATEGORIES]
cs.CL
T2R-bench: A Benchmark for Generating Article-Level Reports from Real World Industrial Tables
[AUTHORS]
Jie Zhang, Changzai Pan, Kaiwen Wei, Sishi Xiong, Yu Zhao, Xiangyu Li, Jiaxin Peng, Xiaoyan Gu, Jian Yang, Wenhan Chang, Zhenhe Wu, Jiang Zhong, Shuangyong Song, Yongxiang Li, Xuelong Li
[ABSTRACT]
Extensive research has been conducted to explore the capabilities of large
language models (LLMs) in table reasoning. However, the essential task of
transforming tables information into reports remains a significant challenge
for industrial applications. This task is plagued by two critical issues: 1)
the complexity and diversity of tables lead to suboptimal reasoning outcomes;
and 2) existing table benchmarks lack the capacity to adequately assess the
practical application of this task. To fill this gap, we propose the
table-to-report task and construct a bilingual benchmark named T2R-bench, where
the key information flow from the tables to the reports for this task. The
benchmark comprises 457 industrial tables, all derived from real-world
scenarios and encompassing 19 industry domains as well as 4 types of industrial
tables. Furthermore, we propose an evaluation criteria to fairly measure the
quality of report generation. The experiments on 25 widely-used LLMs reveal
that even state-of-the-art models like Deepseek-R1 only achieves performance
with 62.71 overall score, indicating that LLMs still have room for improvement
on T2R-bench.
[LINK]
http://arxiv.org/abs/2508.19813v2
[DATE]
2025-09-11 15:29:17+08:00
[CATEGORIES]
cs.CL
CCF: A Context Compression Framework for Efficient Long-Sequence Language Modeling
[AUTHORS]
Wenhao Li, Bangcheng Sun, Weihao Ye, Tianyi Zhang, Daohai Yu, Fei Chao, Rongrong Ji
[ABSTRACT]
Scaling language models to longer contexts is essential for capturing rich
dependencies across extended discourse. However, na"ive context extension
imposes significant computational and memory burdens, often resulting in
inefficiencies during both training and inference. In this work, we propose
CCF, a novel context compression framework designed to enable efficient
long-context modeling by learning hierarchical latent representations that
preserve global semantics while aggressively reducing input redundancy. CCF
integrates segment-wise semantic aggregation with key-value memory encoding,
forming compact representations that support accurate reconstruction and
long-range understanding. To further enhance scalability, we introduce a
training-efficient optimization strategy that couples incremental segment
decoding with sparse reservoir sampling, substantially reducing memory overhead
without degrading performance. Empirical results on multiple long-context
language modeling benchmarks demonstrate that CCF achieves competitive
perplexity under high compression ratios, and significantly improves throughput
and memory efficiency compared to existing approaches. These findings highlight
the potential of structured compression for scalable and effective long-context
language modeling.
[LINK]
http://arxiv.org/abs/2509.09199v1
[DATE]
2025-09-11 15:13:49+08:00
[CATEGORIES]
cs.CL
Efficient Trie-based Biasing using K-step Prediction for Rare Word Recognition
[AUTHORS]
Chin Yuen Kwok, Jia Qi yip
[ABSTRACT]
Contextual biasing improves rare word recognition of ASR models by
prioritizing the output of rare words during decoding. A common approach is
Trie-based biasing, which gives “bonus scores” to partial hypothesis (e.g.
“Bon”) that may lead to the generation of the rare word (e.g. “Bonham”). If the
full word (“Bonham”) isn’t ultimately recognized, the system revokes those
earlier bonuses. This revocation is limited to beam search and is
computationally expensive, particularly for models with large decoders. To
overcome these limitations, we propose adapting ASR models to look ahead and
predict multiple steps at once. This avoids the revocation step entirely by
better estimating whether a partial hypothesis will lead to the generation of
the full rare word. By fine-tuning Whisper with only 10 hours of synthetic
data, our method reduces the word error rate on the NSC Part 2 test set from
30.86% to 12.19%.
[COMMENTS]
Published in Interspeech 2025
[LINK]
http://arxiv.org/abs/2509.09196v1
[DATE]
2025-09-11 15:11:46+08:00
[CATEGORIES]
cs.CL
Spotlight Attention: Towards Efficient LLM Generation via Non-linear Hashing-based KV Cache Retrieval
[AUTHORS]
Wenhao Li, Yuxin Zhang, Gen Luo, Haiyuan Wan, Ziyang Gong, Fei Chao, Rongrong Ji
[ABSTRACT]
Reducing the key-value (KV) cache burden in Large Language Models (LLMs)
significantly accelerates inference. Dynamically selecting critical KV caches
during decoding helps maintain performance. Existing methods use random linear
hashing to identify important tokens, but this approach is inefficient due to
the orthogonal distribution of queries and keys within two narrow cones in
LLMs. We introduce Spotlight Attention, a novel method that employs non-linear
hashing functions to optimize the embedding distribution of queries and keys,
enhancing coding efficiency and robustness. We also developed a lightweight,
stable training framework using a Bradley-Terry ranking-based loss, enabling
optimization of the non-linear hashing module on GPUs with 16GB memory in 8
hours. Experimental results show that Spotlight Attention drastically improves
retrieval precision while shortening the length of the hash code at least
5$\times$ compared to traditional linear hashing. Finally, we exploit the
computational advantages of bitwise operations by implementing specialized CUDA
kernels, achieving hashing retrieval for 512K tokens in under 100$\mu$s on a
single A100 GPU, with end-to-end throughput up to 3$\times$ higher than vanilla
decoding.
[LINK]
http://arxiv.org/abs/2508.19740v3
[DATE]
2025-09-11 14:45:58+08:00
[CATEGORIES]
cs.CL
Clip Your Sequences Fairly: Enforcing Length Fairness for Sequence-Level RL
[AUTHORS]
Hanyi Mao, Quanjia Xiao, Lei Pang, Haixiao Liu
[ABSTRACT]
We propose FSPO (Fair Sequence Policy Optimization), a sequence-level
reinforcement learning method for LLMs that enforces length-fair clipping
directly in the importance-sampling (IS) weight space. We revisit
sequence-level RL methods and identify a mismatch when PPO/GRPO-style clipping
is transplanted to sequences: a fixed clip range systematically reweights short
vs. long responses, distorting the effective objective. Theoretically, we
formalize length fairness via a Length Reweighting Error (LRE) and prove that
small LRE yields a directional cosine guarantee between the clipped and true
updates. FSPO introduces a simple, Gaussian-motivated remedy: we clip the
sequence log-IS ratio with a band that applies a KL-corrected drift term and
scales as $\sqrt{L}$. Empirically, FSPO flattens clip rates across length bins,
stabilizes training, and outperforms all baselines across multiple evaluation
datasets.
[LINK]
http://arxiv.org/abs/2509.09177v1
[DATE]
2025-09-11 14:27:10+08:00
[CATEGORIES]
cs.LG
cs.CL
MIND: Towards Immersive Psychological Healing with Multi-agent Inner Dialogue
[AUTHORS]
Yujia Chen, Changsong Li, Yiming Wang, Tianjie Ju, Qingqing Xiao, Nan Zhang, Zifan Kong, Peng Wang, Binyu Yan
[COMMENTS]
Accepted by EMNLP 2025 Findings
[LINK]
http://arxiv.org/abs/2502.19860v2
[DATE]
2025-09-11 13:37:42+08:00
[CATEGORIES]
cs.CL
SimMark: A Robust Sentence-Level Similarity-Based Watermarking Algorithm for Large Language Models
[AUTHORS]
Amirhossein Dabiriaghdam, Lele Wang
[ABSTRACT]
The widespread adoption of large language models (LLMs) necessitates reliable
methods to detect LLM-generated text. We introduce SimMark, a robust
sentence-level watermarking algorithm that makes LLMs’ outputs traceable
without requiring access to model internals, making it compatible with both
open and API-based LLMs. By leveraging the similarity of semantic sentence
embeddings combined with rejection sampling to embed detectable statistical
patterns imperceptible to humans, and employing a soft counting mechanism,
SimMark achieves robustness against paraphrasing attacks. Experimental results
demonstrate that SimMark sets a new benchmark for robust watermarking of
LLM-generated content, surpassing prior sentence-level watermarking techniques
in robustness, sampling efficiency, and applicability across diverse domains,
all while maintaining the text quality and fluency.
[COMMENTS]
Accepted to EMNLP 25 main
[LINK]
http://arxiv.org/abs/2502.02787v2
[DATE]
2025-09-11 12:36:56+08:00
[CATEGORIES]
cs.CL
cs.LG
VeriSafe Agent: Safeguarding Mobile GUI Agent via Logic-based Action Verification
[AUTHORS]
Jungjae Lee, Dongjae Lee, Chihun Choi, Youngmin Im, Jaeyoung Wi, Kihong Heo, Sangeun Oh, Sunjae Lee, Insik Shin
[ABSTRACT]
Large Foundation Models (LFMs) have unlocked new possibilities in
human-computer interaction, particularly with the rise of mobile Graphical User
Interface (GUI) Agents capable of interacting with mobile GUIs. These agents
allow users to automate complex mobile tasks through simple natural language
instructions. However, the inherent probabilistic nature of LFMs, coupled with
the ambiguity and context-dependence of mobile tasks, makes LFM-based
automation unreliable and prone to errors. To address this critical challenge,
we introduce VeriSafe Agent (VSA): a formal verification system that serves as
a logically grounded safeguard for Mobile GUI Agents. VSA deterministically
ensures that an agent’s actions strictly align with user intent before
executing the action. At its core, VSA introduces a novel autoformalization
technique that translates natural language user instructions into a formally
verifiable specification. This enables runtime, rule-based verification of
agent’s actions, detecting erroneous actions even before they take effect. To
the best of our knowledge, VSA is the first attempt to bring the rigor of
formal verification to GUI agents, bridging the gap between LFM-driven actions
and formal software verification. We implement VSA using off-the-shelf LFM
services (GPT-4o) and evaluate its performance on 300 user instructions across
18 widely used mobile apps. The results demonstrate that VSA achieves
94.33%-98.33% accuracy in verifying agent actions, outperforming existing
LFM-based verification methods by 30.00%-16.33%, and increases the GUI agent’s
task completion rate by 90%-130%.
[LINK]
http://arxiv.org/abs/2503.18492v2
[DATE]
2025-09-11 12:15:45+08:00
[CATEGORIES]
cs.CL
ViRanker: A BGE-M3 & Blockwise Parallel Transformer Cross-Encoder for Vietnamese Reranking
[AUTHORS]
Phuong-Nam Dang, Kieu-Linh Nguyen, Thanh-Hieu Pham
[ABSTRACT]
This paper presents ViRanker, a cross-encoder reranking model tailored to the
Vietnamese language. Built on the BGE-M3 encoder and enhanced with the
Blockwise Parallel Transformer, ViRanker addresses the lack of competitive
rerankers for Vietnamese, a low-resource language with complex syntax and
diacritics. The model was trained on an 8 GB curated corpus and fine-tuned with
hybrid hard-negative sampling to strengthen robustness. Evaluated on the
MMARCO-VI benchmark, ViRanker achieves strong early-rank accuracy, surpassing
multilingual baselines and competing closely with PhoRanker. By releasing the
model openly on Hugging Face, we aim to support reproducibility and encourage
wider adoption in real-world retrieval systems. Beyond Vietnamese, this study
illustrates how careful architectural adaptation and data curation can advance
reranking in other underrepresented languages.
[COMMENTS]
9 pages
[LINK]
http://arxiv.org/abs/2509.09131v1
[DATE]
2025-09-11 12:07:43+08:00
[CATEGORIES]
cs.CL
Compass-v3: Scaling Domain-Specific LLMs for Multilingual E-Commerce in Southeast Asia
[AUTHORS]
Sophia Maria
[ABSTRACT]
Large language models (LLMs) excel in general-domain applications, yet their
performance often degrades in specialized tasks requiring domain-specific
knowledge. E-commerce is particularly challenging, as its data are noisy,
heterogeneous, multilingual, and highly dynamic. We present Compass-v3, a
vertical-domain Mixture-of-Experts (MoE) model with 245B total parameters and
71B active per token, designed for Southeast Asian e-commerce. Compass-v3
adopts fewer but larger experts, combined with hardware-efficient
optimizations-such as intra-node expert parallelism and a customized memcpy
operator-to maximize GPU utilization. The model is trained on 12T tokens of
curated multilingual corpora and large-scale synthetic e-commerce instructions
using a mixed-training strategy. To enhance alignment, we propose
Optimal-Transport Direct Preference Optimization (OTPO), which captures
token-level distinctions and improves instruction adherence in
commerce-specific scenarios. Extensive evaluations demonstrate that Compass-v3
delivers state-of-the-art e-commerce performance, surpassing DeepSeek-V3.1,
GPT-4 series, and Qwen3-235B. Moreover, Compass-v3 demonstrates strong
multilingual capability across low-resource Southeast Asian languages
(Indonesian, Thai, Filipino, Vietnamese, Malay, Taglog) and Portuguese while
sustaining competitive performance on general benchmarks. It has already been
widely applied in Shopee’s industrial-scale e-commerce platform and is
gradually replacing OpenAI’s traffic, now accounting for over 70\% of total LLM
usage, highlighting its dual strengths in specialized commerce expertise and
broad linguistic competence.
[LINK]
http://arxiv.org/abs/2509.09121v1
[DATE]
2025-09-11 11:23:48+08:00
[CATEGORIES]
cs.CL
OTESGN: Optimal Transport-Enhanced Syntactic-Semantic Graph Networks for Aspect-Based Sentiment Analysis
[AUTHORS]
Xinfeng Liao, Xuanqi Chen, Lianxi Wang, Jiahuan Yang, Zhuowei Chen, Ziying Rong
[ABSTRACT]
Aspect-based sentiment analysis (ABSA) aims to identify aspect terms and
determine their sentiment polarity. While dependency trees combined with
contextual semantics provide structural cues, existing approaches often rely on
dot-product similarity and fixed graphs, which limit their ability to capture
nonlinear associations and adapt to noisy contexts. To address these
limitations, we propose the Optimal Transport-Enhanced Syntactic-Semantic Graph
Network (OTESGN), a model that jointly integrates structural and distributional
signals. Specifically, a Syntactic Graph-Aware Attention module models global
dependencies with syntax-guided masking, while a Semantic Optimal Transport
Attention module formulates aspect-opinion association as a distribution
matching problem solved via the Sinkhorn algorithm. An Adaptive Attention
Fusion mechanism balances heterogeneous features, and contrastive
regularization enhances robustness. Extensive experiments on three benchmark
datasets (Rest14, Laptop14, and Twitter) demonstrate that OTESGN delivers
state-of-the-art performance. Notably, it surpasses competitive baselines by up
to +1.30 Macro-F1 on Laptop14 and +1.01 on Twitter. Ablation studies and
visualization analyses further highlight OTESGN’s ability to capture
fine-grained sentiment associations and suppress noise from irrelevant context.
[LINK]
http://arxiv.org/abs/2509.08612v2
[DATE]
2025-09-11 10:55:43+08:00
[CATEGORIES]
cs.CL
TigerCoder: A Novel Suite of LLMs for Code Generation in Bangla
[AUTHORS]
Nishat Raihan, Antonios Anastasopoulos, Marcos Zampieri
[ABSTRACT]
Despite being the 5th most spoken language, Bangla remains underrepresented
in Large Language Models (LLMs), particularly for code generation. This
primarily stems from the scarcity of high-quality data to pre-train and/or
finetune such models. Hence, we introduce the first dedicated family of Code
LLMs for Bangla (1B & 9B). We offer three major contributions: (1) a
comprehensive Bangla code instruction datasets for programming domain
adaptation; (2) MBPP-Bangla, an evaluation benchmark for Bangla code
generation; and (3) the TigerCoder-family of Code LLMs, achieving significant
~11-18% performance gains at Pass@1 over existing multilingual and
general-purpose Bangla LLMs. Our findings show that curated, high-quality
datasets can overcome limitations of smaller models for low-resource languages.
We open-source all resources to advance further Bangla LLM research.
[LINK]
http://arxiv.org/abs/2509.09101v1
[DATE]
2025-09-11 10:25:49+08:00
[CATEGORIES]
cs.CL
Optimizing Length Compression in Large Reasoning Models
[AUTHORS]
Zhengxiang Cheng, Dongping Chen, Mingyang Fu, Tianyi Zhou
[ABSTRACT]
Large Reasoning Models (LRMs) have achieved remarkable success, yet they
often suffer from producing unnecessary and verbose reasoning chains. We
identify a core aspect of this issue as “invalid thinking” – models tend to
repeatedly double-check their work after having derived the correct answer. To
address this specific inefficiency, we move beyond the general principles of
Efficacy and Efficiency to propose two new, fine-grained principles: Brevity,
which advocates for eliminating redundancy, and Sufficiency, which ensures
critical reasoning steps are preserved. Guided by these principles, we
introduce LC-R1, a post-training method based on Group Relative Policy
Optimization (GRPO). LC-R1 employs a novel combination of a Length Reward for
overall conciseness and a Compress Reward that is specifically designed to
remove the invalid portion of the thinking process. Extensive experiments on
multiple reasoning benchmarks demonstrate that LC-R1 achieves a significant
reduction in sequence length (~50%) with only a marginal (~2%) drop in
accuracy, achieving a favorable trade-off point on the Pareto frontier that
prioritizes high compression. Our analysis further validates the robustness of
LC-R1 and provides valuable insights for developing more powerful yet
computationally efficient LRMs. Our code is released at
https://github.com/zxiangx/LC-R1.
[COMMENTS]
16 pages, 7 figures, 4 tables
[LINK]
http://arxiv.org/abs/2506.14755v2
[DATE]
2025-09-11 10:13:24+08:00
[CATEGORIES]
cs.CL
MR-UIE: Multi-Perspective Reasoning with Reinforcement Learning for Universal Information Extraction
[AUTHORS]
Zhongqiu Li, Shiquan Wang, Ruiyu Fang, Mengjiao Bao, Zhenhe Wu, Shuangyong Song, Yongxiang Li, Zhongjiang He
[ABSTRACT]
Large language models (LLMs) demonstrate robust capabilities across diverse
research domains. However, their performance in universal information
extraction (UIE) remains insufficient, especially when tackling structured
output scenarios that involve complex schema descriptions and require
multi-step reasoning. While existing approaches enhance the performance of LLMs
through in-context learning and instruction tuning, significant limitations
nonetheless persist. To enhance the model’s generalization ability, we propose
integrating reinforcement learning (RL) with multi-perspective reasoning for
information extraction (IE) tasks. Our work transitions LLMs from passive
extractors to active reasoners, enabling them to understand not only what to
extract but also how to reason. Experiments conducted on multiple IE benchmarks
demonstrate that MR-UIE consistently elevates extraction accuracy across
domains and surpasses state-of-the-art methods on several datasets.
Furthermore, incorporating multi-perspective reasoning into RL notably enhances
generalization in complex IE tasks, underscoring the critical role of reasoning
in challenging scenarios.
[LINK]
http://arxiv.org/abs/2509.09082v1
[DATE]
2025-09-11 09:08:58+08:00
[CATEGORIES]
cs.CL
SWI: Speaking with Intent in Large Language Models
[AUTHORS]
Yuwei Yin, EunJeong Hwang, Giuseppe Carenini
[ABSTRACT]
Intent, typically clearly formulated and planned, functions as a cognitive
framework for communication and problem-solving. This paper introduces the
concept of Speaking with Intent (SWI) in large language models (LLMs), where
the explicitly generated intent encapsulates the model’s underlying intention
and provides high-level planning to guide subsequent analysis and action. By
emulating deliberate and purposeful thoughts in the human mind, SWI is
hypothesized to enhance the reasoning capabilities and generation quality of
LLMs. Extensive experiments on text summarization, multi-task question
answering, and mathematical reasoning benchmarks consistently demonstrate the
effectiveness and generalizability of Speaking with Intent over direct
generation without explicit intent. Further analysis corroborates the
generalizability of SWI under different experimental settings. Moreover, human
evaluations verify the coherence, effectiveness, and interpretability of the
intent produced by SWI. The promising results in enhancing LLMs with explicit
intents pave a new avenue for boosting LLMs’ generation and reasoning abilities
with cognitive notions.
[COMMENTS]
Code: https://github.com/YuweiYin/SWI
[LINK]
http://arxiv.org/abs/2503.21544v3
[DATE]
2025-09-11 08:53:14+08:00
[CATEGORIES]
cs.CL
cs.LG
ASTPrompter: Preference-Aligned Automated Language Model Red-Teaming to Generate Low-Perplexity Unsafe Prompts
[AUTHORS]
Amelia F. Hardy, Houjun Liu, Allie Griffith, Bernard Lange, Duncan Eddy, Mykel J. Kochenderfer
[ABSTRACT]
Existing LLM red-teaming approaches prioritize high attack success rate,
often resulting in high-perplexity prompts. This focus overlooks low-perplexity
attacks that are more difficult to filter, more likely to arise during benign
usage, and more impactful as negative downstream training examples. In
response, we introduce ASTPrompter, a single-step optimization method that uses
contrastive preference learning to train an attacker to maintain low perplexity
while achieving a high attack success rate (ASR). ASTPrompter achieves an
attack success rate 5.1 times higher on Llama-8.1B while using inputs that are
2.1 times more likely to occur according to the frozen LLM. Furthermore, our
attack transfers to Mistral-7B, Qwen-7B, and TinyLlama in both black- and
white-box settings. Lastly, by tuning a single hyperparameter in our method, we
discover successful attack prefixes along an efficient frontier between ASR and
perplexity, highlighting perplexity as a previously under-considered factor in
red-teaming.
[COMMENTS]
8 pages, 7 pages of appendix, 3 tables, 4 figures
[LINK]
http://arxiv.org/abs/2407.09447v5
[DATE]
2025-09-11 06:36:47+08:00
[CATEGORIES]
cs.CL
HypoGeneAgent: A Hypothesis Language Agent for Gene-Set Cluster Resolution Selection Using Perturb-seq Datasets
[AUTHORS]
Ying Yuan, Xing-Yue Monica Ge, Aaron Archer Waterman, Tommaso Biancalani, David Richmond, Yogesh Pandit, Avtar Singh, Russell Littman, Jin Liu, Jan-Christian Huetter, Vladimir Ermakov
[ABSTRACT]
Large-scale single-cell and Perturb-seq investigations routinely involve
clustering cells and subsequently annotating each cluster with Gene-Ontology
(GO) terms to elucidate the underlying biological programs. However, both
stages, resolution selection and functional annotation, are inherently
subjective, relying on heuristics and expert curation. We present
HYPOGENEAGENT, a large language model (LLM)-driven framework, transforming
cluster annotation into a quantitatively optimizable task. Initially, an LLM
functioning as a gene-set analyst analyzes the content of each gene program or
perturbation module and generates a ranked list of GO-based hypotheses,
accompanied by calibrated confidence scores. Subsequently, we embed every
predicted description with a sentence-embedding model, compute pair-wise cosine
similarities, and let the agent referee panel score (i) the internal
consistency of the predictions, high average similarity within the same
cluster, termed intra-cluster agreement (ii) their external distinctiveness,
low similarity between clusters, termed inter-cluster separation. These two
quantities are combined to produce an agent-derived resolution score, which is
maximized when clusters exhibit simultaneous coherence and mutual exclusivity.
When applied to a public K562 CRISPRi Perturb-seq dataset as a preliminary
test, our Resolution Score selects clustering granularities that exhibit
alignment with known pathway compared to classical metrics such silhouette
score, modularity score for gene functional enrichment summary. These findings
establish LLM agents as objective adjudicators of cluster resolution and
functional annotation, thereby paving the way for fully automated,
context-aware interpretation pipelines in single-cell multi-omics studies.
[LINK]
http://arxiv.org/abs/2509.09740v1
[DATE]
2025-09-11 06:25:33+08:00
[CATEGORIES]
cs.CL
cs.LG
COCO-Urdu: A Large-Scale Urdu Image-Caption Dataset with Multimodal Quality Estimation
[AUTHORS]
Umair Hassan
[ABSTRACT]
Urdu, spoken by over 250 million people, remains critically under-served in
multimodal and vision-language research. The absence of large-scale,
high-quality datasets has limited the development of Urdu-capable systems and
reinforced biases in multilingual vision-language models trained primarily on
high-resource languages. To address this gap, we present COCO-Urdu, a
large-scale image-caption dataset derived from MS COCO, containing 59,000
images and 319,000 Urdu captions selected through stratified sampling to
preserve the original distribution. Captions were translated using SeamlessM4T
v2 and validated with a hybrid multimodal quality estimation framework that
integrates COMET-Kiwi for translation quality, CLIP-based similarity for visual
grounding, and BERTScore with back-translation for semantic consistency;
low-scoring captions were iteratively refined using open-source large language
models. We further benchmark COCO-Urdu on BLEU, SacreBLEU, and chrF, reporting
consistently strong results. To the best of our knowledge, COCO-Urdu is the
largest publicly available Urdu captioning dataset. By releasing both the
dataset and the quality estimation pipeline, we aim to reduce language bias in
multimodal research and establish a foundation for inclusive vision-language
systems.
[COMMENTS]
17 pages, 3 figures, 3 tables. Dataset available at
https://huggingface.co/datasets/umairhassan02/urdu-translated-coco-captions-subset.
Scripts and notebooks to reproduce results available at
https://github.com/umair-hassan2/COCO-Urdu
[LINK]
http://arxiv.org/abs/2509.09014v1
[DATE]
2025-09-11 05:17:32+08:00
[CATEGORIES]
cs.CL
BRoverbs – Measuring how much LLMs understand Portuguese proverbs
[AUTHORS]
Thales Sales Almeida, Giovana Kerche Bonás, João Guilherme Alves Santos
[ABSTRACT]
Large Language Models (LLMs) exhibit significant performance variations
depending on the linguistic and cultural context in which they are applied.
This disparity signals the necessity of mature evaluation frameworks that can
assess their capabilities in specific regional settings. In the case of
Portuguese, existing evaluations remain limited, often relying on translated
datasets that may not fully capture linguistic nuances or cultural references.
Meanwhile, native Portuguese-language datasets predominantly focus on
structured national exams or sentiment analysis of social media interactions,
leaving gaps in evaluating broader linguistic understanding. To address this
limitation, we introduce BRoverbs, a dataset specifically designed to assess
LLM performance through Brazilian proverbs. Proverbs serve as a rich linguistic
resource, encapsulating cultural wisdom, figurative expressions, and complex
syntactic structures that challenge the model comprehension of regional
expressions. BRoverbs aims to provide a new evaluation tool for
Portuguese-language LLMs, contributing to advancing regionally informed
benchmarking. The benchmark is available at
https://huggingface.co/datasets/Tropic-AI/BRoverbs.
[LINK]
http://arxiv.org/abs/2509.08960v1
[DATE]
2025-09-11 03:47:46+08:00
[CATEGORIES]
cs.CL
Documents Are People and Words Are Items: A Psychometric Approach to Textual Data with Contextual Embeddings
[AUTHORS]
Jinsong Chen
[ABSTRACT]
This research introduces a novel psychometric method for analyzing textual
data using large language models. By leveraging contextual embeddings to create
contextual scores, we transform textual data into response data suitable for
psychometric analysis. Treating documents as individuals and words as items,
this approach provides a natural psychometric interpretation under the
assumption that certain keywords, whose contextual meanings vary significantly
across documents, can effectively differentiate documents within a corpus. The
modeling process comprises two stages: obtaining contextual scores and
performing psychometric analysis. In the first stage, we utilize natural
language processing techniques and encoder based transformer models to identify
common keywords and generate contextual scores. In the second stage, we employ
various types of factor analysis, including exploratory and bifactor models, to
extract and define latent factors, determine factor correlations, and identify
the most significant words associated with each factor. Applied to the Wiki
STEM corpus, our experimental results demonstrate the method’s potential to
uncover latent knowledge dimensions and patterns within textual data. This
approach not only enhances the psychometric analysis of textual data but also
holds promise for applications in fields rich in textual information, such as
education, psychology, and law.
[LINK]
http://arxiv.org/abs/2509.08920v1
[DATE]
2025-09-11 02:31:37+08:00
[CATEGORIES]
cs.CL
CondAmbigQA: A Benchmark and Dataset for Conditional Ambiguous Question Answering
[AUTHORS]
Zongxi Li, Yang Li, Haoran Xie, S. Joe Qin
[COMMENTS]
Accepted by EMNLP 2025 (Main Conference)
[LINK]
http://arxiv.org/abs/2502.01523v2
[DATE]
2025-09-11 02:27:02+08:00
[CATEGORIES]
cs.CL
Automated Evidence Extraction and Scoring for Corporate Climate Policy Engagement: A Multilingual RAG Approach
[AUTHORS]
Imene Kolli, Ario Saeid Vaghefi, Chiara Colesanti Senni, Shantam Raj, Markus Leippold
[ABSTRACT]
InfluenceMap’s LobbyMap Platform monitors the climate policy engagement of
over 500 companies and 250 industry associations, assessing each entity’s
support or opposition to science-based policy pathways for achieving the Paris
Agreement’s goal of limiting global warming to 1.5{\deg}C. Although
InfluenceMap has made progress with automating key elements of the analytical
workflow, a significant portion of the assessment remains manual, making it
time- and labor-intensive and susceptible to human error. We propose an
AI-assisted framework to accelerate the monitoring of corporate climate policy
engagement by leveraging Retrieval-Augmented Generation to automate the most
time-intensive extraction of relevant evidence from large-scale textual data.
Our evaluation shows that a combination of layout-aware parsing, the Nomic
embedding model, and few-shot prompting strategies yields the best performance
in extracting and classifying evidence from multilingual corporate documents.
We conclude that while the automated RAG system effectively accelerates
evidence extraction, the nuanced nature of the analysis necessitates a
human-in-the-loop approach where the technology augments, rather than replaces,
expert judgment to ensure accuracy.
[LINK]
http://arxiv.org/abs/2509.08907v1
[DATE]
2025-09-11 02:09:45+08:00
[CATEGORIES]
cs.CL
Noise or Nuance: An Investigation Into Useful Information and Filtering For LLM Driven AKBC
[AUTHORS]
Alex Clay, Ernesto Jiménez-Ruiz, Pranava Madhyastha
[ABSTRACT]
RAG and fine-tuning are prevalent strategies for improving the quality of LLM
outputs. However, in constrained situations, such as that of the 2025 LM-KBC
challenge, such techniques are restricted. In this work we investigate three
facets of the triple completion task: generation, quality assurance, and LLM
response parsing. Our work finds that in this constrained setting: additional
information improves generation quality, LLMs can be effective at filtering
poor quality triples, and the tradeoff between flexibility and consistency with
LLM response parsing is setting dependent.
[COMMENTS]
8 pages, 1 figure, accepted to the ISWC 2025 LM-KBC Workshop
[LINK]
http://arxiv.org/abs/2509.08903v1
[DATE]
2025-09-11 02:04:41+08:00
[CATEGORIES]
cs.CL
Recurrence Meets Transformers for Universal Multimodal Retrieval
[AUTHORS]
Davide Caffagni, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
[ABSTRACT]
With the rapid advancement of multimodal retrieval and its application in
LLMs and multimodal LLMs, increasingly complex retrieval tasks have emerged.
Existing methods predominantly rely on task-specific fine-tuning of
vision-language models and are limited to single-modality queries or documents.
In this paper, we propose ReT-2, a unified retrieval model that supports
multimodal queries, composed of both images and text, and searches across
multimodal document collections where text and images coexist. ReT-2 leverages
multi-layer representations and a recurrent Transformer architecture with
LSTM-inspired gating mechanisms to dynamically integrate information across
layers and modalities, capturing fine-grained visual and textual details. We
evaluate ReT-2 on the challenging M2KR and M-BEIR benchmarks across different
retrieval configurations. Results demonstrate that ReT-2 consistently achieves
state-of-the-art performance across diverse settings, while offering faster
inference and reduced memory usage compared to prior approaches. When
integrated into retrieval-augmented generation pipelines, ReT-2 also improves
downstream performance on Encyclopedic-VQA and InfoSeek datasets. Our source
code and trained models are publicly available at:
https://github.com/aimagelab/ReT-2
[LINK]
http://arxiv.org/abs/2509.08897v1
[DATE]
2025-09-11 02:00:29+08:00
[CATEGORIES]
cs.CL
A Survey of Reinforcement Learning for Large Reasoning Models
[AUTHORS]
Kaiyan Zhang, Yuxin Zuo, Bingxiang He, Youbang Sun, Runze Liu, Che Jiang, Yuchen Fan, Kai Tian, Guoli Jia, Pengfei Li, Yu Fu, Xingtai Lv, Yuchen Zhang, Sihang Zeng, Shang Qu, Haozhan Li, Shijie Wang, Yuru Wang, Xinwei Long, Fangfu Liu, Xiang Xu, Jiaze Ma, Xuekai Zhu, Ermo Hua, Yihao Liu, Zonglin Li, Huayu Chen, Xiaoye Qu, Yafu Li, Weize Chen, Zhenzhao Yuan, Junqi Gao, Dong Li, Zhiyuan Ma, Ganqu Cui, Zhiyuan Liu, Biqing Qi, Ning Ding, Bowen Zhou
[ABSTRACT]
In this paper, we survey recent advances in Reinforcement Learning (RL) for
reasoning with Large Language Models (LLMs). RL has achieved remarkable success
in advancing the frontier of LLM capabilities, particularly in addressing
complex logical tasks such as mathematics and coding. As a result, RL has
emerged as a foundational methodology for transforming LLMs into LRMs. With the
rapid progress of the field, further scaling of RL for LRMs now faces
foundational challenges not only in computational resources but also in
algorithm design, training data, and infrastructure. To this end, it is timely
to revisit the development of this domain, reassess its trajectory, and explore
strategies to enhance the scalability of RL toward Artificial SuperIntelligence
(ASI). In particular, we examine research applying RL to LLMs and LRMs for
reasoning abilities, especially since the release of DeepSeek-R1, including
foundational components, core problems, training resources, and downstream
applications, to identify future opportunities and directions for this rapidly
evolving area. We hope this review will promote future research on RL for
broader reasoning models. Github:
https://github.com/TsinghuaC3I/Awesome-RL-for-LRMs
[LINK]
http://arxiv.org/abs/2509.08827v1
[DATE]
2025-09-11 01:59:43+08:00
[CATEGORIES]
cs.CL
cs.LG
TweakLLM: A Routing Architecture for Dynamic Tailoring of Cached Responses
[AUTHORS]
Muhammad Taha Cheema, Abeer Aamir, Khawaja Gul Muhammad, Naveed Anwar Bhatti, Ihsan Ayyub Qazi, Zafar Ayyub Qazi
[COMMENTS]
13 pages, 9 figures
[LINK]
http://arxiv.org/abs/2507.23674v2
[DATE]
2025-09-11 01:59:08+08:00
[CATEGORIES]
cs.LG
cs.CL
Subjective Behaviors and Preferences in LLM: Language of Browsing
[AUTHORS]
Sai Sundaresan, Harshita Chopra, Atanu R. Sinha, Koustava Goswami, Nagasai Saketh Naidu, Raghav Karan, N Anushka
[ABSTRACT]
A Large Language Model (LLM) offers versatility across domains and tasks,
purportedly benefiting users with a wide variety of behaviors and preferences.
We question this perception about an LLM when users have inherently subjective
behaviors and preferences, as seen in their ubiquitous and idiosyncratic
browsing of websites or apps. The sequential behavior logs of pages, thus
generated, form something akin to each user’s self-constructed “language”,
albeit without the structure and grammar imbued in natural languages. We ask:
(i) Can a small LM represent the “language of browsing” better than a large LM?
(ii) Can an LM with a single set of parameters (or, single LM) adequately
capture myriad users’ heterogeneous, subjective behaviors and preferences?
(iii) Can a single LM with high average performance, yield low variance in
performance to make alignment good at user level? We introduce clusterwise LM
training, HeTLM (Heterogeneity aware Training of Language Model), appropriate
for subjective behaviors. We find that (i) a small LM trained using a
page-level tokenizer outperforms large pretrained or finetuned LMs; (ii) HeTLM
with heterogeneous cluster specific set of parameters outperforms a single LM
of the same family, controlling for the number of parameters; and (iii) a
higher mean and a lower variance in generation ensues, implying improved
alignment.
[COMMENTS]
Accepted at EMNLP 2025
[LINK]
http://arxiv.org/abs/2508.15474v2
[DATE]
2025-09-11 01:51:03+08:00
[CATEGORIES]
cs.CL
MoVoC: Morphology-Aware Subword Construction for Geez Script Languages
[AUTHORS]
Hailay Kidu Teklehaymanot, Dren Fazlija, Wolfgang Nejdl
[ABSTRACT]
Subword-based tokenization methods often fail to preserve morphological
boundaries, a limitation especially pronounced in low-resource, morphologically
complex languages such as those written in the Geez script. To address this, we
present MoVoC (Morpheme-aware Subword Vocabulary Construction) and train
MoVoC-Tok, a tokenizer that integrates supervised morphological analysis into
the subword vocabulary. This hybrid segmentation approach combines
morpheme-based and Byte Pair Encoding (BPE) tokens to preserve morphological
integrity while maintaining lexical meaning. To tackle resource scarcity, we
curate and release manually annotated morpheme data for four Geez script
languages and a morpheme-aware vocabulary for two of them. While the proposed
tokenization method does not lead to significant gains in automatic translation
quality, we observe consistent improvements in intrinsic metrics, MorphoScore,
and Boundary Precision, highlighting the value of morphology-aware segmentation
in enhancing linguistic fidelity and token efficiency. Our morpheme-annotated
datasets and tokenizer will be publicly available to support further research
in low-resource, morphologically rich languages. Our code and data are
available on GitHub: https://github.com/hailaykidu/MoVoC
[COMMENTS]
This submission is approximately 10 pages in length and includes 1
figure and 6 tables
[LINK]
http://arxiv.org/abs/2509.08812v1
[DATE]
2025-09-11 01:45:10+08:00
[CATEGORIES]
cs.CL
Evaluating LLMs Without Oracle Feedback: Agentic Annotation Evaluation Through Unsupervised Consistency Signals
[AUTHORS]
Cheng Chen, Haiyan Yin, Ivor Tsang
[ABSTRACT]
Large Language Models (LLMs), when paired with prompt-based tasks, have
significantly reduced data annotation costs and reliance on human annotators.
However, evaluating the quality of their annotations remains challenging in
dynamic, unsupervised environments where oracle feedback is scarce and
conventional methods fail. To address this challenge, we propose a novel
agentic annotation paradigm, where a student model collaborates with a noisy
teacher (the LLM) to assess and refine annotation quality without relying on
oracle feedback. The student model, acting as an unsupervised feedback
mechanism, employs a user preference-based majority voting strategy to evaluate
the consistency of the LLM outputs. To systematically measure the reliability
of LLM-generated annotations, we introduce the Consistent and Inconsistent
(CAI) Ratio, a novel unsupervised evaluation metric. The CAI Ratio not only
quantifies the annotation quality of the noisy teacher under limited user
preferences but also plays a critical role in model selection, enabling the
identification of robust LLMs in dynamic, unsupervised environments. Applied to
ten open-domain NLP datasets across four LLMs, the CAI Ratio demonstrates a
strong positive correlation with LLM accuracy, establishing it as an essential
tool for unsupervised evaluation and model selection in real-world settings.
[COMMENTS]
11 pages, 10 figures
[LINK]
http://arxiv.org/abs/2509.08809v1
[DATE]
2025-09-11 01:42:41+08:00
[CATEGORIES]
cs.CL
Scaling Truth: The Confidence Paradox in AI Fact-Checking
[AUTHORS]
Ihsan A. Qazi, Zohaib Khan, Abdullah Ghani, Agha A. Raza, Zafar A. Qazi, Wassay Sajjad, Ayesha Ali, Asher Javaid, Muhammad Abdullah Sohail, Abdul H. Azeemi
[COMMENTS]
65 pages, 26 figures, 6 tables
[LINK]
http://arxiv.org/abs/2509.08803v1
[DATE]
2025-09-11 01:36:25+08:00
[CATEGORIES]
cs.CL
CURE: Controlled Unlearning for Robust Embeddings – Mitigating Conceptual Shortcuts in Pre-Trained Language Models
[AUTHORS]
Aysenur Kocak, Shuo Yang, Bardh Prenkaj, Gjergji Kasneci
[COMMENTS]
Accepted at the Conference on Empirical Methods in Natural Language
Processing (EMNLP 2025)
[LINK]
http://arxiv.org/abs/2509.05230v2
[DATE]
2025-09-11 01:32:53+08:00
[CATEGORIES]
cs.CL
cs.LG
Do All Autoregressive Transformers Remember Facts the Same Way? A Cross-Architecture Analysis of Recall Mechanisms
[AUTHORS]
Minyeong Choe, Haehyun Cho, Changho Seo, Hyunil Kim
[ABSTRACT]
Understanding how Transformer-based language models store and retrieve
factual associations is critical for improving interpretability and enabling
targeted model editing. Prior work, primarily on GPT-style models, has
identified MLP modules in early layers as key contributors to factual recall.
However, it remains unclear whether these findings generalize across different
autoregressive architectures. To address this, we conduct a comprehensive
evaluation of factual recall across several models – including GPT, LLaMA,
Qwen, and DeepSeek – analyzing where and how factual information is encoded
and accessed. Consequently, we find that Qwen-based models behave differently
from previous patterns: attention modules in the earliest layers contribute
more to factual recall than MLP modules. Our findings suggest that even within
the autoregressive Transformer family, architectural variations can lead to
fundamentally different mechanisms of factual recall.
[COMMENTS]
Accepted at EMNLP 2025
[LINK]
http://arxiv.org/abs/2509.08778v1
[DATE]
2025-09-11 01:06:55+08:00
[CATEGORIES]
cs.CL
Calibrating MLLM-as-a-judge via Multimodal Bayesian Prompt Ensembles
[AUTHORS]
Eric Slyman, Mehrab Tanjim, Kushal Kafle, Stefan Lee
[ABSTRACT]
Multimodal large language models (MLLMs) are increasingly used to evaluate
text-to-image (TTI) generation systems, providing automated judgments based on
visual and textual context. However, these “judge” models often suffer from
biases, overconfidence, and inconsistent performance across diverse image
domains. While prompt ensembling has shown promise for mitigating these issues
in unimodal, text-only settings, our experiments reveal that standard
ensembling methods fail to generalize effectively for TTI tasks. To address
these limitations, we propose a new multimodal-aware method called Multimodal
Mixture-of-Bayesian Prompt Ensembles (MMB). Our method uses a Bayesian prompt
ensemble approach augmented by image clustering, allowing the judge to
dynamically assign prompt weights based on the visual characteristics of each
sample. We show that MMB improves accuracy in pairwise preference judgments and
greatly enhances calibration, making it easier to gauge the judge’s true
uncertainty. In evaluations on two TTI benchmarks, HPSv2 and MJBench, MMB
outperforms existing baselines in alignment with human annotations and
calibration across varied image content. Our findings highlight the importance
of multimodal-specific strategies for judge calibration and suggest a promising
path forward for reliable large-scale TTI evaluation.
[COMMENTS]
17 pages, 8 figures, Accepted at ICCV 2025
[LINK]
http://arxiv.org/abs/2509.08777v1
[DATE]
2025-09-11 01:06:47+08:00
[CATEGORIES]
cs.CL
A Dynamic Fusion Model for Consistent Crisis Response
[AUTHORS]
Xiaoying Song, Anirban Saha Anik, Eduardo Blanco, Vanessa Frias-Martinez, Lingzi Hong
[ABSTRACT]
In response to the urgent need for effective communication with
crisis-affected populations, automated responses driven by language models have
been proposed to assist in crisis communications. A critical yet often
overlooked factor is the consistency of response style, which could affect the
trust of affected individuals in responders. Despite its importance, few
studies have explored methods for maintaining stylistic consistency across
generated responses. To address this gap, we propose a novel metric for
evaluating style consistency and introduce a fusion-based generation approach
grounded in this metric. Our method employs a two-stage process: it first
assesses the style of candidate responses and then optimizes and integrates
them at the instance level through a fusion process. This enables the
generation of high-quality responses while significantly reducing stylistic
variation between instances. Experimental results across multiple datasets
demonstrate that our approach consistently outperforms baselines in both
response quality and stylistic uniformity.
[COMMENTS]
Accepted at Findings of EMNLP 2025
[LINK]
http://arxiv.org/abs/2509.01053v2
[DATE]
2025-09-11 01:01:54+08:00
[CATEGORIES]
cs.CL
Speaking at the Right Level: Literacy-Controlled Counterspeech Generation with RAG-RL
[AUTHORS]
Xiaoying Song, Anirban Saha Anik, Dibakar Barua, Pengcheng Luo, Junhua Ding, Lingzi Hong
[ABSTRACT]
Health misinformation spreading online poses a significant threat to public
health. Researchers have explored methods for automatically generating
counterspeech to health misinformation as a mitigation strategy. Existing
approaches often produce uniform responses, ignoring that the health literacy
level of the audience could affect the accessibility and effectiveness of
counterspeech. We propose a Controlled-Literacy framework using
retrieval-augmented generation (RAG) with reinforcement learning (RL) to
generate tailored counterspeech adapted to different health literacy levels. In
particular, we retrieve knowledge aligned with specific health literacy levels,
enabling accessible and factual information to support generation. We design a
reward function incorporating subjective user preferences and objective
readability-based rewards to optimize counterspeech to the target health
literacy level. Experiment results show that Controlled-Literacy outperforms
baselines by generating more accessible and user-preferred counterspeech. This
research contributes to more equitable and impactful public health
communication by improving the accessibility and comprehension of counterspeech
to health misinformation
[COMMENTS]
Accepted at Findings of EMNLP 2025
[LINK]
http://arxiv.org/abs/2509.01058v2
[DATE]
2025-09-11 00:52:35+08:00
[CATEGORIES]
cs.CL
AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning
[AUTHORS]
Zhiheng Xi, Jixuan Huang, Chenyang Liao, Baodai Huang, Honglin Guo, Jiaqi Liu, Rui Zheng, Junjie Ye, Jiazheng Zhang, Wenxiang Chen, Wei He, Yiwen Ding, Guanyu Li, Zehui Chen, Zhengyin Du, Xuesong Yao, Yufei Xu, Jiecao Chen, Tao Gui, Zuxuan Wu, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang
[ABSTRACT]
Developing autonomous LLM agents capable of making a series of intelligent
decisions to solve complex, real-world tasks is a fast-evolving frontier. Like
human cognitive development, agents are expected to acquire knowledge and
skills through exploration and interaction with the environment. Despite
advances, the community still lacks a unified, interactive reinforcement
learning (RL) framework that can effectively train such agents from scratch –
without relying on supervised fine-tuning (SFT) – across diverse and realistic
environments. To bridge this gap, we introduce AgentGym-RL, a new framework to
train LLM agents for multi-turn interactive decision-making through RL. The
framework features a modular and decoupled architecture, ensuring high
flexibility and extensibility. It encompasses a wide variety of real-world
scenarios, and supports mainstream RL algorithms. Furthermore, we propose
ScalingInter-RL, a training approach designed for exploration-exploitation
balance and stable RL optimization. In early stages, it emphasizes exploitation
by restricting the number of interactions, and gradually shifts towards
exploration with larger horizons to encourage diverse problem-solving
strategies. In this way, the agent develops more diverse behaviors and is less
prone to collapse under long horizons. We perform extensive experiments to
validate the stability and effectiveness of both the AgentGym-RL framework and
the ScalingInter-RL approach. Our agents match or surpass commercial models on
27 tasks across diverse environments. We offer key insights and will
open-source the complete AgentGym-RL framework – including code and datasets
– to empower the research community in developing the next generation of
intelligent agents.
[COMMENTS]
preprint, 39 pages, 16 figures. Project:
https://AgentGym-RL.github.io/. Framework and Code:
https://github.com/woooodyy/AgentGym, https://github.com/woooodyy/AgentGym-RL
[LINK]
http://arxiv.org/abs/2509.08755v1
[DATE]
2025-09-11 00:46:11+08:00
[CATEGORIES]
cs.LG
cs.CL
MPO: Boosting LLM Agents with Meta Plan Optimization
[AUTHORS]
Weimin Xiong, Yifan Song, Qingxiu Dong, Bingchan Zhao, Feifan Song, Xun Wang, Sujian Li
[COMMENTS]
EMNLP 2025 Findings
[LINK]
http://arxiv.org/abs/2503.02682v2
[DATE]
2025-09-11 00:45:42+08:00
[CATEGORIES]
cs.CL
cs.LG
Streaming Sequence-to-Sequence Learning with Delayed Streams Modeling
[AUTHORS]
Neil Zeghidour, Eugene Kharitonov, Manu Orsini, Václav Volhejn, Gabriel de Marmiesse, Edouard Grave, Patrick Pérez, Laurent Mazaré, Alexandre Défossez
[ABSTRACT]
We introduce Delayed Streams Modeling (DSM), a flexible formulation for
streaming, multimodal sequence-to-sequence learning. Sequence-to-sequence
generation is often cast in an offline manner, where the model consumes the
complete input sequence before generating the first output timestep.
Alternatively, streaming sequence-to-sequence rely on learning a policy for
choosing when to advance on the input stream, or write to the output stream.
DSM instead models already time-aligned streams with a decoder-only language
model. By moving the alignment to a pre-processing step,and introducing
appropriate delays between streams, DSM provides streaming inference of
arbitrary output sequences, from any input combination, making it applicable to
many sequence-to-sequence problems. In particular, given text and audio
streams, automatic speech recognition (ASR) corresponds to the text stream
being delayed, while the opposite gives a text-to-speech (TTS) model. We
perform extensive experiments for these two major sequence-to-sequence tasks,
showing that DSM provides state-of-the-art performance and latency while
supporting arbitrary long sequences, being even competitive with offline
baselines. Code, samples and demos are available at
https://github.com/kyutai-labs/delayed-streams-modeling
[LINK]
http://arxiv.org/abs/2509.08753v1
[DATE]
2025-09-11 00:43:01+08:00
[CATEGORIES]
cs.CL
GRAM-R$^2$: Self-Training Generative Foundation Reward Models for Reward Reasoning
[AUTHORS]
Chenglong Wang, Yongyu Mu, Hang Zhou, Yifu Huo, Ziming Zhu, Jiali Zeng, Murun Yang, Bei Li, Tong Xiao, Xiaoyang Hao, Chunliang Zhang, Fandong Meng, Jingbo Zhu
[ABSTRACT]
Significant progress in reward modeling over recent years has been driven by
a paradigm shift from task-specific designs towards generalist reward models.
Despite this trend, developing effective reward models remains a fundamental
challenge: the heavy reliance on large-scale labeled preference data.
Pre-training on abundant unlabeled data offers a promising direction, but
existing approaches fall short of instilling explicit reasoning into reward
models. To bridge this gap, we propose a self-training approach that leverages
unlabeled data to elicit reward reasoning in reward models. Based on this
approach, we develop GRAM-R$^2$, a generative reward model trained to produce
not only preference labels but also accompanying reward rationales. GRAM-R$^2$
can serve as a foundation model for reward reasoning and can be applied to a
wide range of tasks with minimal or no additional fine-tuning. It can support
downstream applications such as response ranking and task-specific reward
tuning. Experiments on response ranking, task adaptation, and reinforcement
learning from human feedback demonstrate that GRAM-R$^2$ consistently delivers
strong performance, outperforming several strong discriminative and generative
baselines.
[LINK]
http://arxiv.org/abs/2509.02492v2
[DATE]
2025-09-11 00:37:27+08:00
[CATEGORIES]
cs.CL
cs.LG
X-Teaming Evolutionary M2S: Automated Discovery of Multi-turn to Single-turn Jailbreak Templates
[AUTHORS]
Hyunjun Kim, Junwoo Ha, Sangyoon Yu, Haon Park
[ABSTRACT]
Multi-turn-to-single-turn (M2S) compresses iterative red-teaming into one
structured prompt, but prior work relied on a handful of manually written
templates. We present X-Teaming Evolutionary M2S, an automated framework that
discovers and optimizes M2S templates through language-model-guided evolution.
The system pairs smart sampling from 12 sources with an LLM-as-judge inspired
by StrongREJECT and records fully auditable logs.
Maintaining selection pressure by setting the success threshold to $\theta =
0.70$, we obtain five evolutionary generations, two new template families, and
44.8% overall success (103/230) on GPT-4.1. A balanced cross-model panel of
2,500 trials (judge fixed) shows that structural gains transfer but vary by
target; two models score zero at the same threshold. We also find a positive
coupling between prompt length and score, motivating length-aware judging.
Our results demonstrate that structure-level search is a reproducible route
to stronger single-turn probes and underscore the importance of threshold
calibration and cross-model evaluation. Code, configurations, and artifacts are
available at https://github.com/hyunjun1121/M2S-x-teaming.
[LINK]
http://arxiv.org/abs/2509.08729v1
[DATE]
2025-09-11 00:17:44+08:00
[CATEGORIES]
cs.CL
What Does Normal Even Mean? Evaluating Benign Traffic in Intrusion Detection Datasets
[AUTHORS]
Meghan Wilkinson, Robert H Thomson
[ABSTRACT]
Supervised machine learning techniques rely on labeled data to achieve high
task performance, but this requires the labels to capture some meaningful
differences in the underlying data structure. For training network intrusion
detection algorithms, most datasets contain a series of attack classes and a
single large benign class which captures all non-attack network traffic. A
review of intrusion detection papers and guides that explicitly state their
data preprocessing steps identified that the majority took the labeled
categories of the dataset at face value when training their algorithms. The
present paper evaluates the structure of benign traffic in several common
intrusion detection datasets (NSL-KDD, UNSW-NB15, and CIC-IDS 2017) and
determines whether there are meaningful sub-categories within this traffic
which may improve overall multi-classification performance using common machine
learning techniques. We present an overview of some unsupervised clustering
techniques (e.g., HDBSCAN, Mean Shift Clustering) and show how they
differentially cluster the benign traffic space.
[COMMENTS]
10 pages; accepted to SBP-BRiMS 2025 Poster Session
[LINK]
http://arxiv.org/abs/2509.09564v1
[DATE]
2025-09-11 23:55:21+08:00
[CATEGORIES]
cs.LG
Average Causal Effect Estimation in DAGs with Hidden Variables: Beyond Back-Door and Front-Door Criteria
[AUTHORS]
Anna Guo, Razieh Nabi
[ABSTRACT]
The identification theory for causal effects in directed acyclic graphs
(DAGs) with hidden variables is well established, but methods for estimating
and inferring functionals that extend beyond the g-formula remain
underdeveloped. Previous studies have introduced semiparametric estimators for
such functionals in a broad class of DAGs with hidden variables. While these
estimators exhibit desirable statistical properties such as double robustness
in certain cases, they also face significant limitations. Notably, they
encounter substantial computational challenges, particularly involving density
estimation and numerical integration for continuous variables, and their
estimates may fall outside the parameter space of the target estimand.
Additionally, the asymptotic properties of these estimators is underexplored,
especially when integrating flexible statistical and machine learning models
for nuisance functional estimations. This paper addresses these challenges by
introducing novel one-step corrected plug-in and targeted minimum loss-based
estimators of causal effects for a class of hidden variable DAGs that go beyond
classical back-door and front-door criteria (known as the treatment primal
fixability criterion in prior literature). These estimators leverage
data-adaptive machine learning algorithms to minimize modeling assumptions
while ensuring key statistical properties including double robustness,
efficiency, boundedness within the target parameter space, and asymptotic
linearity under $L^2(P)$-rate conditions for nuisance functional estimates that
yield root-n consistent causal effect estimates. To ensure our estimation
methods are accessible in practice, we provide the flexCausal package in R.
[LINK]
http://arxiv.org/abs/2409.03962v2
[DATE]
2025-09-11 23:52:22+08:00
[CATEGORIES]
cs.LG
Boosting Embodied AI Agents through Perception-Generation Disaggregation and Asynchronous Pipeline Execution
[AUTHORS]
Shulai Zhang, Ao Xu, Quan Chen, Han Zhao, Weihao Cui, Ningxin Zheng, Haibin Lin, Xin Liu, Minyi Guo
[ABSTRACT]
Embodied AI systems operate in dynamic environments, requiring seamless
integration of perception and generation modules to process high-frequency
input and output demands. Traditional sequential computation patterns, while
effective in ensuring accuracy, face significant limitations in achieving the
necessary “thinking” frequency for real-world applications. In this work, we
present Auras, an algorithm-system co-designed inference framework to optimize
the inference frequency of embodied AI agents. Auras disaggregates the
perception and generation and provides controlled pipeline parallelism for them
to achieve high and stable throughput. Faced with the data staleness problem
that appears when the parallelism is increased, Auras establishes a public
context for perception and generation to share, thereby promising the accuracy
of embodied agents. Experimental results show that Auras improves throughput by
2.54x on average while achieving 102.7% of the original accuracy, demonstrating
its efficacy in overcoming the constraints of sequential computation and
providing high throughput.
[LINK]
http://arxiv.org/abs/2509.09560v1
[DATE]
2025-09-11 23:51:43+08:00
[CATEGORIES]
cs.LG
Variance-Aware Noisy Training: Hardening DNNs against Unstable Analog Computations
[AUTHORS]
Xiao Wang, Hendrik Borras, Bernhard Klein, Holger Fröning
[ABSTRACT]
The disparity between the computational demands of deep learning and the
capabilities of compute hardware is expanding drastically. Although deep
learning achieves remarkable performance in countless tasks, its escalating
requirements for computational power and energy consumption surpass the
sustainable limits of even specialized neural processing units, including the
Apple Neural Engine and NVIDIA TensorCores. This challenge is intensified by
the slowdown in CMOS scaling.
Analog computing presents a promising alternative, offering substantial
improvements in energy efficiency by directly manipulating physical quantities
such as current, voltage, charge, or photons. However, it is inherently
vulnerable to manufacturing variations, nonlinearities, and noise, leading to
degraded prediction accuracy. One of the most effective techniques for
enhancing robustness, Noisy Training, introduces noise during the training
phase to reinforce the model against disturbances encountered during inference.
Although highly effective, its performance degrades in real-world environments
where noise characteristics fluctuate due to external factors such as
temperature variations and temporal drift.
This study underscores the necessity of Noisy Training while revealing its
fundamental limitations in the presence of dynamic noise. To address these
challenges, we propose Variance-Aware Noisy Training, a novel approach that
mitigates performance degradation by incorporating noise schedules which
emulate the evolving noise conditions encountered during inference. Our method
substantially improves model robustness, without training overhead. We
demonstrate a significant increase in robustness, from 79.3\% with conventional
Noisy Training to 97.6\% with Variance-Aware Noisy Training on CIFAR-10 and
from 32.4\% to 99.7\% on Tiny ImageNet.
[LINK]
http://arxiv.org/abs/2503.16183v2
[DATE]
2025-09-11 23:35:11+08:00
[CATEGORIES]
cs.LG
DeepVoting: Learning and Fine-Tuning Voting Rules with Canonical Embeddings
[AUTHORS]
Leonardo Matone, Ben Abramowitz, Ben Armstrong, Avinash Balakrishnan, Nicholas Mattei
[ABSTRACT]
Aggregating agent preferences into a collective decision is an important step
in many problems (e.g., hiring, elections, peer review) and across areas of
computer science (e.g., reinforcement learning, recommender systems). As Social
Choice Theory has shown, the problem of designing aggregation rules with
specific sets of properties (axioms) can be difficult, or provably impossible
in some cases. Instead of designing algorithms by hand, one can learn
aggregation rules, particularly voting rules, from data. However, prior work in
this area has required extremely large models or been limited by the choice of
preference representation, i.e., embedding. We recast the problem of designing
voting rules with desirable properties into one of learning probabilistic
functions that output distributions over a set of candidates. Specifically, we
use neural networks to learn probabilistic social choice functions. Using
standard embeddings from the social choice literature we show that preference
profile encoding has significant impact on the efficiency and ability of neural
networks to learn rules, allowing us to learn rules faster and with smaller
networks than previous work. Moreover, we show that our learned rules can be
fine-tuned using axiomatic properties to create novel voting rules and make
them resistant to specific types of “attack”. Namely, we fine-tune rules to
resist a probabilistic version of the No Show Paradox.
[LINK]
http://arxiv.org/abs/2408.13630v2
[DATE]
2025-09-11 23:32:16+08:00
[CATEGORIES]
cs.LG
ProDiGy: Proximity- and Dissimilarity-Based Byzantine-Robust Federated Learning
[AUTHORS]
Sena Ergisi, Luis Maßny, Rawad Bitar
[ABSTRACT]
Federated Learning (FL) emerged as a widely studied paradigm for distributed
learning. Despite its many advantages, FL remains vulnerable to adversarial
attacks, especially under data heterogeneity. We propose a new Byzantine-robust
FL algorithm called ProDiGy. The key novelty lies in evaluating the client
gradients using a joint dual scoring system based on the gradients’ proximity
and dissimilarity. We demonstrate through extensive numerical experiments that
ProDiGy outperforms existing defenses in various scenarios. In particular, when
the clients’ data do not follow an IID distribution, while other defense
mechanisms fail, ProDiGy maintains strong defense capabilities and model
accuracy. These findings highlight the effectiveness of a dual perspective
approach that promotes natural similarity among honest clients while detecting
suspicious uniformity as a potential indicator of an attack.
[LINK]
http://arxiv.org/abs/2509.09534v1
[DATE]
2025-09-11 23:25:59+08:00
[CATEGORIES]
cs.LG
Development and Comparative Evaluation of Three Artificial Intelligence Models (NLP, LLM, JEPA) for Predicting Triage in Emergency Departments: A 7-Month Retrospective Proof-of-Concept
[AUTHORS]
Edouard Lansiaux, Ramy Azzouz, Emmanuel Chazard, Amélie Vromant, Eric Wiel
[ABSTRACT]
Emergency departments struggle with persistent triage errors, especially
undertriage and overtriage, which are aggravated by growing patient volumes and
staff shortages. This study evaluated three AI models [TRIAGEMASTER (NLP),
URGENTIAPARSE (LLM), and EMERGINET (JEPA)] against the FRENCH triage scale and
nurse practice, using seven months of adult triage data from Roger Salengro
Hospital in Lille, France. Among the models, the LLM-based URGENTIAPARSE
consistently outperformed both AI alternatives and nurse triage, achieving the
highest accuracy (F1-score 0.900, AUC-ROC 0.879) and superior performance in
predicting hospitalization needs (GEMSA). Its robustness across structured data
and raw transcripts highlighted the advantage of LLM architectures in
abstracting patient information. Overall, the findings suggest that integrating
LLM-based AI into emergency department workflows could significantly enhance
patient safety and operational efficiency, though successful adoption will
depend on addressing limitations and ensuring ethical transparency.
[COMMENTS]
13 pages, 7 figures, 3 tables
[LINK]
http://arxiv.org/abs/2507.01080v2
[DATE]
2025-09-11 23:20:56+08:00
[CATEGORIES]
cs.LG
Extended Neural Contractive Dynamical Systems: On Multiple Tasks and Riemannian Safety Regions
[AUTHORS]
Hadi Beik Mohammadi, Søren Hauberg, Georgios Arvanitidis, Gerhard Neumann, Leonel Rozo
[ABSTRACT]
Stability guarantees are crucial when ensuring that a fully autonomous robot
does not take undesirable or potentially harmful actions. We recently proposed
the Neural Contractive Dynamical Systems (NCDS), which is a neural network
architecture that guarantees contractive stability. With this,
learning-from-demonstrations approaches can trivially provide stability
guarantees. However, our early work left several unanswered questions, which we
here address. Beyond providing an in-depth explanation of NCDS, this paper
extends the framework with more careful regularization, a conditional variant
of the framework for handling multiple tasks, and an uncertainty-driven
approach to latent obstacle avoidance. Experiments verify that the developed
system has the flexibility of ordinary neural networks while providing the
stability guarantees needed for autonomous robotics.
[COMMENTS]
arXiv admin note: substantial text overlap with arXiv:2401.09352
[LINK]
http://arxiv.org/abs/2411.11405v3
[DATE]
2025-09-11 23:11:27+08:00
[CATEGORIES]
cs.LG
Explainable AI for Accelerated Microstructure Imaging: A SHAP-Guided Protocol on the Connectome 2.0 scanner
[AUTHORS]
Quentin Uhl, Tommaso Pavan, Julianna Gerold, Kwok-Shing Chan, Yohan Jun, Shohei Fujita, Aneri Bhatt, Yixin Ma, Qiaochu Wang, Hong-Hsi Lee, Susie Y. Huang, Berkin Bilgic, Ileana Jelescu
[ABSTRACT]
The diffusion MRI Neurite Exchange Imaging model offers a promising framework
for probing gray matter microstructure by estimating parameters such as
compartment sizes, diffusivities, and inter-compartmental water exchange time.
However, existing protocols require long scan times. This study proposes a
reduced acquisition scheme for the Connectome 2.0 scanner that preserves model
accuracy while substantially shortening scan duration. We developed a
data-driven framework using explainable artificial intelligence with a guided
recursive feature elimination strategy to identify an optimal 8-feature subset
from a 15-feature protocol. The performance of this optimized protocol was
validated in vivo and benchmarked against the full acquisition and alternative
reduction strategies. Parameter accuracy, preservation of anatomical contrast,
and test-retest reproducibility were assessed. The reduced protocol yielded
parameter estimates and cortical maps comparable to the full protocol, with low
estimation errors in synthetic data and minimal impact on test-retest
variability. Compared to theory-driven and heuristic reduction schemes, the
optimized protocol demonstrated superior robustness, reducing the deviation in
water exchange time estimates by over two-fold. In conclusion, this hybrid
optimization framework enables viable imaging of neurite exchange in 14 minutes
without loss of parameter fidelity. This approach supports the broader
application of exchange-sensitive diffusion magnetic resonance imaging in
neuroscience and clinical research, and offers a generalizable method for
designing efficient acquisition protocols in biophysical parameter mapping.
[COMMENTS]
Submitted to IEEE Transactions on Medical Imaging (TMI). This
all-in-one version includes supplementary materials. 18 pages, 14 figures, 2
tables
[LINK]
http://arxiv.org/abs/2509.09513v1
[DATE]
2025-09-11 22:53:26+08:00
[CATEGORIES]
cs.LG
PIPES: A Meta-dataset of Machine Learning Pipelines
[AUTHORS]
Cynthia Moreira Maia, Lucas B. V. de Amorim, George D. C. Cavalcanti, Rafael M. O. Cruz
[ABSTRACT]
Solutions to the Algorithm Selection Problem (ASP) in machine learning face
the challenge of high computational costs associated with evaluating various
algorithms’ performances on a given dataset. To mitigate this cost, the
meta-learning field can leverage previously executed experiments shared in
online repositories such as OpenML. OpenML provides an extensive collection of
machine learning experiments. However, an analysis of OpenML’s records reveals
limitations. It lacks diversity in pipelines, specifically when exploring data
preprocessing steps/blocks, such as scaling or imputation, resulting in limited
representation. Its experiments are often focused on a few popular techniques
within each pipeline block, leading to an imbalanced sample. To overcome the
observed limitations of OpenML, we propose PIPES, a collection of experiments
involving multiple pipelines designed to represent all combinations of the
selected sets of techniques, aiming at diversity and completeness. PIPES stores
the results of experiments performed applying 9,408 pipelines to 300 datasets.
It includes detailed information on the pipeline blocks, training and testing
times, predictions, performances, and the eventual error messages. This
comprehensive collection of results allows researchers to perform analyses
across diverse and representative pipelines and datasets. PIPES also offers
potential for expansion, as additional data and experiments can be incorporated
to support the meta-learning community further. The data, code, supplementary
material, and all experiments can be found at
https://github.com/cynthiamaia/PIPES.git.
[LINK]
http://arxiv.org/abs/2509.09512v1
[DATE]
2025-09-11 22:52:58+08:00
[CATEGORIES]
cs.LG
LLMs for sensory-motor control: Combining in-context and iterative learning
[AUTHORS]
Jônata Tyska Carvalho, Stefano Nolfi
[ABSTRACT]
We propose a method that enables large language models (LLMs) to control
embodied agents by directly mapping continuous observation vectors to
continuous action vectors. At the outset, the LLMs generate a control strategy
based on a textual description of the agent, its environment, and the intended
goal. This strategy is then iteratively refined through a learning process in
which the LLMs are repeatedly prompted to improve the current strategy, using
performance feedback and sensory-motor data collected during its evaluation.
The method is validated on classic control tasks from the Gymnasium library and
the inverted pendulum task from the MuJoCo library. The approach proves
effective with relatively compact models such as Gpt-oss:120b and Qwen2.5:72b.
In most cases, it successfully identifies optimal or near-optimal solutions by
integrating symbolic knowledge derived through reasoning with sub-symbolic
sensory-motor data gathered as the agent interacts with its environment.
[COMMENTS]
Article updated with results from gpt-oss:120b. 24 pages (13 pages
are from appendix), 6 figures, code for experiments replication and
supplementary material provided at
https://github.com/jtyska/llm-robotics-article/
[LINK]
http://arxiv.org/abs/2506.04867v2
[DATE]
2025-09-11 22:52:08+08:00
[CATEGORIES]
cs.LG
Learning functions through Diffusion Maps
[AUTHORS]
Alvaro Almeida Gomez
[ABSTRACT]
We propose a data-driven method for approximating real-valued functions on
smooth manifolds, building on the Diffusion Maps framework under the manifold
hypothesis. Given pointwise evaluations of a function, the method constructs a
smooth extension to the ambient space by exploiting diffusion geometry and its
connection to the heat equation and the Laplace-Beltrami operator.
To address the computational challenges of high-dimensional data, we
introduce a dimensionality reduction strategy based on the low-rank structure
of the distance matrix, revealed via singular value decomposition (SVD). In
addition, we develop an online updating mechanism that enables efficient
incorporation of new data, thereby improving scalability and reducing
computational cost.
Numerical experiments, including applications to sparse CT reconstruction,
demonstrate that the proposed methodology outperforms classical feedforward
neural networks and interpolation methods in terms of both accuracy and
efficiency.
[COMMENTS]
Comments are welcome
[LINK]
http://arxiv.org/abs/2509.03758v2
[DATE]
2025-09-11 22:50:33+08:00
[CATEGORIES]
cs.LG
Euclidean Distance Deflation Under High-Dimensional Heteroskedastic Noise
[AUTHORS]
Keyi Li, Yuval Kluger, Boris Landa
[ABSTRACT]
Pairwise Euclidean distance calculation is a fundamental step in many machine
learning and data analysis algorithms. In real-world applications, however,
these distances are frequently distorted by heteroskedastic
noise$\unicode{x2014}$a prevalent form of inhomogeneous corruption
characterized by variable noise magnitudes across data observations. Such noise
inflates the computed distances in a nontrivial way, leading to
misrepresentations of the underlying data geometry. In this work, we address
the tasks of estimating the noise magnitudes per observation and correcting the
pairwise Euclidean distances under heteroskedastic noise. Perhaps surprisingly,
we show that in general high-dimensional settings and without assuming prior
knowledge on the clean data structure or noise distribution, both tasks can be
performed reliably, even when the noise levels vary considerably. Specifically,
we develop a principled, hyperparameter-free approach that jointly estimates
the noise magnitudes and corrects the distances. We provide theoretical
guarantees for our approach, establishing probabilistic bounds on the
estimation errors of both noise magnitudes and distances. These bounds,
measured in the normalized $\ell_1$ norm, converge to zero at polynomial rates
as both feature dimension and dataset size increase. Experiments on synthetic
datasets demonstrate that our method accurately estimates distances in
challenging regimes, significantly improving the robustness of subsequent
distance-based computations. Notably, when applied to single-cell RNA
sequencing data, our method yields noise magnitude estimates consistent with an
established prototypical model, enabling accurate nearest neighbor
identification that is fundamental to many downstream analyses.
[LINK]
http://arxiv.org/abs/2507.18520v2
[DATE]
2025-09-11 22:50:14+08:00
[CATEGORIES]
cs.LG
Bridging Simplicity and Sophistication using GLinear: A Novel Architecture for Enhanced Time Series Prediction
[AUTHORS]
Syed Tahir Hussain Rizvi, Neel Kanwal, Muddasar Naeem
[ABSTRACT]
Time Series Forecasting (TSF) is an important application across many fields.
There is a debate about whether Transformers, despite being good at
understanding long sequences, struggle with preserving temporal relationships
in time series data. Recent research suggests that simpler linear models might
outperform or at least provide competitive performance compared to complex
Transformer-based models for TSF tasks. In this paper, we propose a novel
data-efficient architecture, \textit{Gaussian-activated Linear model
(GLinear)}, for multivariate TSF that exploits periodic patterns to provide
better accuracy. It achieves higher prediction accuracy while requiring less
historical data than other state-of-the-art linear predictors. Four different
datasets (ETTh1, Electricity, Traffic, and Weather) are used to evaluate the
performance of the proposed predictor. A performance comparison with
state-of-the-art linear architectures (such as NLinear, DLinear, and RLinear)
and transformer-based time series predictors (Autoformer) shows that the
GLinear, despite being data efficient, outperforms the existing architectures
in most cases of multivariate TSF while being competitive in others. We hope
that the proposed GLinear model opens new fronts of research and development of
simpler and more sophisticated architectures for data and computationally
efficient time-series analysis. The source code is publicly available on
GitHub.
[COMMENTS]
Submitted to Digital Signal Processing Journal
[LINK]
http://arxiv.org/abs/2501.01087v4
[DATE]
2025-09-11 22:43:09+08:00
[CATEGORIES]
cs.LG
Asynchronous Gossip Algorithms for Rank-Based Statistical Methods
[AUTHORS]
Anna Van Elst, Igor Colin, Stephan Clémençon
[ABSTRACT]
As decentralized AI and edge intelligence become increasingly prevalent,
ensuring robustness and trustworthiness in such distributed settings has become
a critical issue-especially in the presence of corrupted or adversarial data.
Traditional decentralized algorithms are vulnerable to data contamination as
they typically rely on simple statistics (e.g., means or sum), motivating the
need for more robust statistics. In line with recent work on decentralized
estimation of trimmed means and ranks, we develop gossip algorithms for
computing a broad class of rank-based statistics, including L-statistics and
rank statistics-both known for their robustness to outliers. We apply our
method to perform robust distributed two-sample hypothesis testing, introducing
the first gossip algorithm for Wilcoxon rank-sum tests. We provide rigorous
convergence guarantees, including the first convergence rate bound for
asynchronous gossip-based rank estimation. We empirically validate our
theoretical results through experiments on diverse network topologies.
[LINK]
http://arxiv.org/abs/2509.07543v2
[DATE]
2025-09-11 22:39:19+08:00
[CATEGORIES]
cs.LG
OpenFake: An Open Dataset and Platform Toward Large-Scale Deepfake Detection
[AUTHORS]
Victor Livernoche, Akshatha Arodi, Andreea Musulan, Zachary Yang, Adam Salvail, Gaétan Marceau Caron, Jean-François Godbout, Reihaneh Rabbany
[ABSTRACT]
Deepfakes, synthetic media created using advanced AI techniques, have
intensified the spread of misinformation, particularly in politically sensitive
contexts. Existing deepfake detection datasets are often limited, relying on
outdated generation methods, low realism, or single-face imagery, restricting
the effectiveness for general synthetic image detection. By analyzing social
media posts, we identify multiple modalities through which deepfakes propagate
misinformation. Furthermore, our human perception study demonstrates that
recently developed proprietary models produce synthetic images increasingly
indistinguishable from real ones, complicating accurate identification by the
general public. Consequently, we present a comprehensive, politically-focused
dataset specifically crafted for benchmarking detection against modern
generative models. This dataset contains three million real images paired with
descriptive captions, which are used for generating 963k corresponding
high-quality synthetic images from a mix of proprietary and open-source models.
Recognizing the continual evolution of generative techniques, we introduce an
innovative crowdsourced adversarial platform, where participants are
incentivized to generate and submit challenging synthetic images. This ongoing
community-driven initiative ensures that deepfake detection methods remain
robust and adaptive, proactively safeguarding public discourse from
sophisticated misinformation threats.
[COMMENTS]
25 pages, 12 figures
[LINK]
http://arxiv.org/abs/2509.09495v1
[DATE]
2025-09-11 22:34:22+08:00
[CATEGORIES]
cs.LG
Revisiting Non-Acyclic GFlowNets in Discrete Environments
[AUTHORS]
Nikita Morozov, Ian Maksimov, Daniil Tiapkin, Sergey Samsonov
[ABSTRACT]
Generative Flow Networks (GFlowNets) are a family of generative models that
learn to sample objects from a given probability distribution, potentially
known up to a normalizing constant. Instead of working in the object space,
GFlowNets proceed by sampling trajectories in an appropriately constructed
directed acyclic graph environment, greatly relying on the acyclicity of the
graph. In our paper, we revisit the theory that relaxes the acyclicity
assumption and present a simpler theoretical framework for non-acyclic
GFlowNets in discrete environments. Moreover, we provide various novel
theoretical insights related to training with fixed backward policies, the
nature of flow functions, and connections between entropy-regularized RL and
non-acyclic GFlowNets, which naturally generalize the respective concepts and
theoretical results from the acyclic setting. In addition, we experimentally
re-examine the concept of loss stability in non-acyclic GFlowNet training, as
well as validate our own theoretical findings.
[COMMENTS]
ICML 2025; minor corrections in proofs of Proposition 3.6 and 3.8 in
v3, all results remain unchanged
[LINK]
http://arxiv.org/abs/2502.07735v3
[DATE]
2025-09-11 22:32:19+08:00
[CATEGORIES]
cs.LG
The Information Dynamics of Generative Diffusion
[AUTHORS]
Luca Ambrogioni
[ABSTRACT]
Generative diffusion models have emerged as a powerful class of models in
machine learning, yet a unified theoretical understanding of their operation is
still developing. This paper provides an integrated perspective on generative
diffusion by connecting their dynamic, information-theoretic, and thermodynamic
properties under a unified mathematical framework. We demonstrate that the rate
of conditional entropy production during generation (i.e. the generative
bandwidth) is directly governed by the expected divergence of the score
function’s vector field. This divergence, in turn, is linked to the branching
of trajectories and generative bifurcations, which we characterize as
symmetry-breaking phase transitions in the energy landscape. This synthesis
offers a powerful insight: the process of generation is fundamentally driven by
the controlled, noise-induced breaking of (approximate) symmetries, where peaks
in information transfer correspond to critical transitions between possible
outcomes. The score function acts as a dynamic non-linear filter that regulates
the bandwidth of the noise by suppressing fluctuations that are incompatible
with the data.
[LINK]
http://arxiv.org/abs/2508.19897v3
[DATE]
2025-09-11 22:30:28+08:00
[CATEGORIES]
cs.LG
Meta-Learning Reinforcement Learning for Crypto-Return Prediction
[AUTHORS]
Junqiao Wang, Zhaoyang Guan, Guanyu Liu, Tianze Xia, Xianzhi Li, Shuo Yin, Xinyuan Song, Chuhan Cheng, Tianyu Shi, Alex Lee
[ABSTRACT]
Predicting cryptocurrency returns is notoriously difficult: price movements
are driven by a fast-shifting blend of on-chain activity, news flow, and social
sentiment, while labeled training data are scarce and expensive. In this paper,
we present Meta-RL-Crypto, a unified transformer-based architecture that
unifies meta-learning and reinforcement learning (RL) to create a fully
self-improving trading agent. Starting from a vanilla instruction-tuned LLM,
the agent iteratively alternates between three roles-actor, judge, and
meta-judge-in a closed-loop architecture. This learning process requires no
additional human supervision. It can leverage multimodal market inputs and
internal preference feedback. The agent in the system continuously refines both
the trading policy and evaluation criteria. Experiments across diverse market
regimes demonstrate that Meta-RL-Crypto shows good performance on the technical
indicators of the real market and outperforming other LLM-based baselines.
[LINK]
http://arxiv.org/abs/2509.09751v1
[DATE]
2025-09-11 22:20:45+08:00
[CATEGORIES]
cs.LG
Database Views as Explanations for Relational Deep Learning
[AUTHORS]
Agapi Rissaki, Ilias Fountalis, Wolfgang Gatterbauer, Benny Kimelfeld
[ABSTRACT]
In recent years, there has been significant progress in the development of
deep learning models over relational databases, including architectures based
on heterogeneous graph neural networks (hetero-GNNs) and heterogeneous graph
transformers. In effect, such architectures state how the database records and
links (e.g., foreign-key references) translate into a large, complex numerical
expression, involving numerous learnable parameters. This complexity makes it
hard to explain, in human-understandable terms, how a model uses the available
data to arrive at a given prediction. We present a novel framework for
explaining machine-learning models over relational databases, where
explanations are view definitions that highlight focused parts of the database
that mostly contribute to the model’s prediction. We establish such global
abductive explanations by adapting the classic notion of determinacy by Nash,
Segoufin, and Vianu (2010). In addition to tuning the tradeoff between
determinacy and conciseness, the framework allows controlling the level of
granularity by adopting different fragments of view definitions, such as ones
highlighting whole columns, foreign keys between tables, relevant groups of
tuples, and so on. We investigate the realization of the framework in the case
of hetero-GNNs. We develop heuristic algorithms that avoid the exhaustive
search over the space of all databases. We propose techniques that are
model-agnostic, and others that are tailored to hetero-GNNs via the notion of
learnable masking. Our approach is evaluated through an extensive empirical
study on the RelBench collection, covering a variety of domains and different
record-level tasks. The results demonstrate the usefulness of the proposed
explanations, as well as the efficiency of their generation.
[LINK]
http://arxiv.org/abs/2509.09482v1
[DATE]
2025-09-11 22:11:48+08:00
[CATEGORIES]
cs.LG
CountTRuCoLa: Rule Confidence Learning for Temporal Knowledge Graph Forecasting
[AUTHORS]
Julia Gastinger, Christian Meilicke, Heiner Stuckenschmidt
[ABSTRACT]
We address the task of temporal knowledge graph (TKG) forecasting by
introducing a fully explainable method based on temporal rules. Motivated by
recent work proposing a strong baseline using recurrent facts, our approach
learns four simple types of rules with a confidence function that considers
both recency and frequency. Evaluated on nine datasets, our method matches or
surpasses the performance of eight state-of-the-art models and two baselines,
while providing fully interpretable predictions.
[LINK]
http://arxiv.org/abs/2509.09474v1
[DATE]
2025-09-11 21:56:21+08:00
[CATEGORIES]
cs.LG
AEGIS: An Agent for Extraction and Geographic Identification in Scholarly Proceedings
[AUTHORS]
Om Vishesh, Harshad Khadilkar, Deepak Akkil
[ABSTRACT]
Keeping pace with the rapid growth of academia literature presents a
significant challenge for researchers, funding bodies, and academic societies.
To address the time-consuming manual effort required for scholarly discovery,
we present a novel, fully automated system that transitions from data discovery
to direct action. Our pipeline demonstrates how a specialized AI agent,
‘Agent-E’, can be tasked with identifying papers from specific geographic
regions within conference proceedings and then executing a Robotic Process
Automation (RPA) to complete a predefined action, such as submitting a
nomination form. We validated our system on 586 papers from five different
conferences, where it successfully identified every target paper with a recall
of 100% and a near perfect accuracy of 99.4%. This demonstration highlights the
potential of task-oriented AI agents to not only filter information but also to
actively participate in and accelerate the workflows of the academic community.
[COMMENTS]
5 pages, 2 figures
[LINK]
http://arxiv.org/abs/2509.09470v1
[DATE]
2025-09-11 21:52:52+08:00
[CATEGORIES]
cs.LG
Physics consistent machine learning framework for inverse modeling with applications to ICF capsule implosions
[AUTHORS]
Daniel A. Serino, Evan Bell, Marc Klasky, Ben S. Southworth, Balasubramanya Nadiga, Trevor Wilcox, Oleg Korobkin
[ABSTRACT]
In high energy density physics (HEDP) and inertial confinement fusion (ICF),
predictive modeling is complicated by uncertainty in parameters that
characterize various aspects of the modeled system, such as those
characterizing material properties, equation of state (EOS), opacities, and
initial conditions. Typically, however, these parameters are not directly
observable. What is observed instead is a time sequence of radiographic
projections using X-rays. In this work, we define a set of sparse hydrodynamic
features derived from the outgoing shock profile and outer material edge, which
can be obtained from radiographic measurements, to directly infer such
parameters. Our machine learning (ML)-based methodology involves a pipeline of
two architectures, a radiograph-to-features network (R2FNet) and a
features-to-parameters network (F2PNet), that are trained independently and
later combined to approximate a posterior distribution for the parameters from
radiographs. We show that the estimated parameters can be used in a
hydrodynamics code to obtain density fields and hydrodynamic shock and outer
edge features that are consistent with the data. Finally, we demonstrate that
features resulting from an unknown EOS model can be successfully mapped onto
parameters of a chosen analytical EOS model, implying that network predictions
are learning physics, with a degree of invariance to the underlying choice of
EOS model.
[LINK]
http://arxiv.org/abs/2412.20192v2
[DATE]
2025-09-11 21:51:55+08:00
[CATEGORIES]
cs.LG
AquaCast: Urban Water Dynamics Forecasting with Precipitation-Informed Multi-Input Transformer
[AUTHORS]
Golnoosh Abdollahinejad, Saleh Baghersalimi, Denisa-Andreea Constantinescu, Sergey Shevchik, David Atienza
[COMMENTS]
This work has been submitted to Journal of Hydrology, Elsevier, and a
preprint version is also available at SSRN 10.2139/ssrn.5399833
[LINK]
http://arxiv.org/abs/2509.09458v1
[DATE]
2025-09-11 21:42:34+08:00
[CATEGORIES]
cs.LG
Unveiling Multiple Descents in Unsupervised Autoencoders
[AUTHORS]
Kobi Rahimi, Yehonathan Refael, Tom Tirer, Ofir Lindenbaum
[ABSTRACT]
The phenomenon of double descent has challenged the traditional bias-variance
trade-off in supervised learning but remains unexplored in unsupervised
learning, with some studies arguing for its absence. In this study, we first
demonstrate analytically that double descent does not occur in linear
unsupervised autoencoders (AEs). In contrast, we show for the first time that
both double and triple descent can be observed with nonlinear AEs across
various data models and architectural designs. We examine the effects of
partial sample and feature noise and highlight the importance of bottleneck
size in influencing the double descent curve. Through extensive experiments on
both synthetic and real datasets, we uncover model-wise, epoch-wise, and
sample-wise double descent across several data types and architectures. Our
findings indicate that over-parameterized models not only improve
reconstruction but also enhance performance in downstream tasks such as anomaly
detection and domain adaptation, highlighting their practical value in complex
real-world scenarios.
[LINK]
http://arxiv.org/abs/2406.11703v3
[DATE]
2025-09-11 21:42:30+08:00
[CATEGORIES]
cs.LG
Composable Score-based Graph Diffusion Model for Multi-Conditional Molecular Generation
[AUTHORS]
Anjie Qiao, Zhen Wang, Chuan Chen, DeFu Lian, Enhong Chen
[ABSTRACT]
Controllable molecular graph generation is essential for material and drug
discovery, where generated molecules must satisfy diverse property constraints.
While recent advances in graph diffusion models have improved generation
quality, their effectiveness in multi-conditional settings remains limited due
to reliance on joint conditioning or continuous relaxations that compromise
fidelity. To address these limitations, we propose Composable Score-based Graph
Diffusion model (CSGD), the first model that extends score matching to discrete
graphs via concrete scores, enabling flexible and principled manipulation of
conditional guidance. Building on this foundation, we introduce two score-based
techniques: Composable Guidance (CoG), which allows fine-grained control over
arbitrary subsets of conditions during sampling, and Probability Calibration
(PC), which adjusts estimated transition probabilities to mitigate train-test
mismatches. Empirical results on four molecular datasets show that CSGD
achieves state-of-the-art performance, with a 15.3% average improvement in
controllability over prior methods, while maintaining high validity and
distributional fidelity. Our findings highlight the practical advantages of
score-based modeling for discrete graph generation and its capacity for
flexible, multi-property molecular design.
[LINK]
http://arxiv.org/abs/2509.09451v1
[DATE]
2025-09-11 21:37:56+08:00
[CATEGORIES]
cs.LG
RoseCDL: Robust and Scalable Convolutional Dictionary Learning for Rare-event Detection
[AUTHORS]
Jad Yehya, Mansour Benbakoura, Cédric Allain, Benoît Malezieux, Matthieu Kowalski, Thomas Moreau
[ABSTRACT]
Identifying recurring patterns and rare events in large-scale signals is a
fundamental challenge in fields such as astronomy, physical simulations, and
biomedical science. Convolutional Dictionary Learning (CDL) offers a powerful
framework for modeling local structures in signals, but its use for detecting
rare or anomalous events remains largely unexplored. In particular, CDL faces
two key challenges in this setting: high computational cost and sensitivity to
artifacts and outliers. In this paper, we introduce RoseCDL, a scalable and
robust CDL algorithm designed for unsupervised rare event detection in long
signals. RoseCDL combines stochastic windowing for efficient training on large
datasets with inline outlier detection to enhance robustness and isolate
anomalous patterns. This reframes CDL as a practical tool for event discovery
and characterization in real-world signals, extending its role beyond
traditional tasks like compression or denoising.
[LINK]
http://arxiv.org/abs/2509.07523v3
[DATE]
2025-09-11 21:35:58+08:00
[CATEGORIES]
cs.LG
Sigma Flows for Image and Data Labeling and Learning Structured Prediction
[AUTHORS]
Jonas Cassel, Bastian Boll, Stefania Petra, Peter Albers, Christoph Schnörr
[ABSTRACT]
This paper introduces the sigma flow model for the prediction of structured
labelings of data observed on Riemannian manifolds, including Euclidean image
domains as special case. The approach combines the Laplace-Beltrami framework
for image denoising and enhancement, introduced by Sochen, Kimmel and Malladi
about 25 years ago, and the assignment flow approach introduced and studied by
the authors.
The sigma flow arises as Riemannian gradient flow of generalized harmonic
energies and thus is governed by a nonlinear geometric PDE which determines a
harmonic map from a closed Riemannian domain manifold to a statistical
manifold, equipped with the Fisher-Rao metric from information geometry. A
specific ingredient of the sigma flow is the mutual dependency of the
Riemannian metric of the domain manifold on the evolving state. This makes the
approach amenable to machine learning in a specific way, by realizing this
dependency through a mapping with compact time-variant parametrization that can
be learned from data. Proof of concept experiments demonstrate the expressivity
of the sigma flow model and prediction performance.
Structural similarities to transformer network architectures and networks
generated by the geometric integration of sigma flows are pointed out, which
highlights the connection to deep learning and, conversely, may stimulate the
use of geometric design principles for structured prediction in other areas of
scientific machine learning.
[COMMENTS]
51 pages, revised experimental section
[LINK]
http://arxiv.org/abs/2408.15946v2
[DATE]
2025-09-11 21:14:43+08:00
[CATEGORIES]
cs.LG
A Comprehensive Guide to Differential Privacy: From Theory to User Expectations
[AUTHORS]
Napsu Karmitsa, Antti Airola, Tapio Pahikkala, Tinja Pitkämäki
[ABSTRACT]
The increasing availability of personal data has enabled significant advances
in fields such as machine learning, healthcare, and cybersecurity. However,
this data abundance also raises serious privacy concerns, especially in light
of powerful re-identification attacks and growing legal and ethical demands for
responsible data use. Differential privacy (DP) has emerged as a principled,
mathematically grounded framework for mitigating these risks. This review
provides a comprehensive survey of DP, covering its theoretical foundations,
practical mechanisms, and real-world applications. It explores key algorithmic
tools and domain-specific challenges - particularly in privacy-preserving
machine learning and synthetic data generation. The report also highlights
usability issues and the need for improved communication and transparency in DP
systems. Overall, the goal is to support informed adoption of DP by researchers
and practitioners navigating the evolving landscape of data privacy.
[LINK]
http://arxiv.org/abs/2509.03294v2
[DATE]
2025-09-11 21:12:37+08:00
[CATEGORIES]
cs.LG
Semantic Concentration for Self-Supervised Dense Representations Learning
[AUTHORS]
Peisong Wen, Qianqian Xu, Siran Dai, Runmin Cong, Qingming Huang
[ABSTRACT]
Recent advances in image-level self-supervised learning (SSL) have made
significant progress, yet learning dense representations for patches remains
challenging. Mainstream methods encounter an over-dispersion phenomenon that
patches from the same instance/category scatter, harming downstream performance
on dense tasks. This work reveals that image-level SSL avoids over-dispersion
by involving implicit semantic concentration. Specifically, the non-strict
spatial alignment ensures intra-instance consistency, while shared patterns,
i.e., similar parts of within-class instances in the input space, ensure
inter-image consistency. Unfortunately, these approaches are infeasible for
dense SSL due to their spatial sensitivity and complicated scene-centric data.
These observations motivate us to explore explicit semantic concentration for
dense SSL. First, to break the strict spatial alignment, we propose to distill
the patch correspondences. Facing noisy and imbalanced pseudo labels, we
propose a noise-tolerant ranking loss. The core idea is extending the Average
Precision (AP) loss to continuous targets, such that its decision-agnostic and
adaptive focusing properties prevent the student model from being misled.
Second, to discriminate the shared patterns from complicated scenes, we propose
the object-aware filter to map the output space to an object-based space.
Specifically, patches are represented by learnable prototypes of objects via
cross-attention. Last but not least, empirical studies across various tasks
soundly support the effectiveness of our method. Code is available in
https://github.com/KID-7391/CoTAP.
[LINK]
http://arxiv.org/abs/2509.09429v1
[DATE]
2025-09-11 21:12:10+08:00
[CATEGORIES]
cs.LG
Adapting Vision-Language Models for Neutrino Event Classification in High-Energy Physics
[AUTHORS]
Dikshant Sagar, Kaiwen Yu, Alejandro Yankelevich, Jianming Bian, Pierre Baldi
[ABSTRACT]
Recent advances in Large Language Models (LLMs) have demonstrated their
remarkable capacity to process and reason over structured and unstructured data
modalities beyond natural language. In this work, we explore the applications
of Vision Language Models (VLMs), specifically a fine-tuned variant of LLaMa
3.2, to the task of identifying neutrino interactions in pixelated detector
data from high-energy physics (HEP) experiments. We benchmark this model
against a state-of-the-art convolutional neural network (CNN) architecture,
similar to those used in the NOvA and DUNE experiments, which have achieved
high efficiency and purity in classifying electron and muon neutrino events.
Our evaluation considers both the classification performance and
interpretability of the model predictions. We find that VLMs can outperform
CNNs, while also providing greater flexibility in integrating auxiliary textual
or semantic information and offering more interpretable, reasoning-based
predictions. This work highlights the potential of VLMs as a general-purpose
backbone for physics event classification, due to their high performance,
interpretability, and generalizability, which opens new avenues for integrating
multimodal reasoning in experimental neutrino physics.
[LINK]
http://arxiv.org/abs/2509.08461v2
[DATE]
2025-09-11 21:03:04+08:00
[CATEGORIES]
cs.LG
On the Relationship Between Adversarial Robustness and Decision Region in Deep Neural Networks
[AUTHORS]
Seongjin Park, Haedong Jeong, Tair Djanibekov, Giyoung Jeon, Jinseok Seol, Jaesik Choi
[ABSTRACT]
In general, Deep Neural Networks (DNNs) are evaluated by the generalization
performance measured on unseen data excluded from the training phase. Along
with the development of DNNs, the generalization performance converges to the
state-of-the-art and it becomes difficult to evaluate DNNs solely based on this
metric. The robustness against adversarial attack has been used as an
additional metric to evaluate DNNs by measuring their vulnerability. However,
few studies have been performed to analyze the adversarial robustness in terms
of the geometry in DNNs. In this work, we perform an empirical study to analyze
the internal properties of DNNs that affect model robustness under adversarial
attacks. In particular, we propose the novel concept of the Populated Region
Set (PRS), where training samples are populated more frequently, to represent
the internal properties of DNNs in a practical setting. From systematic
experiments with the proposed concept, we provide empirical evidence to
validate that a low PRS ratio has a strong relationship with the adversarial
robustness of DNNs. We also devise PRS regularizer leveraging the
characteristics of PRS to improve the adversarial robustness without
adversarial training.
[COMMENTS]
10 pages
[LINK]
http://arxiv.org/abs/2207.03400v2
[DATE]
2025-09-11 20:51:51+08:00
[CATEGORIES]
cs.LG
Fused Lasso Improves Accuracy of Co-occurrence Network Inference in Grouped Samples
[AUTHORS]
Daniel Agyapong, Briana H. Beatty, Peter G. Kennedy, Toby D. Hocking
[ABSTRACT]
Co-occurrence network inference algorithms have significantly advanced our
understanding of microbiome communities. However, these algorithms typically
analyze microbial associations within samples collected from a single
environmental niche, often capturing only static snapshots rather than dynamic
microbial processes. Previous studies have commonly grouped samples from
different environmental niches together without fully considering how microbial
communities adapt their associations when faced with varying ecological
conditions. Our study addresses this limitation by explicitly investigating
both spatial and temporal dynamics of microbial communities. We analyzed
publicly available microbiome abundance data across multiple locations and time
points, to evaluate algorithm performance in predicting microbial associations
using our proposed Same-All Cross-validation (SAC) framework. SAC evaluates
algorithms in two distinct scenarios: training and testing within the same
environmental niche (Same), and training and testing on combined data from
multiple environmental niches (All). To overcome the limitations of
conventional algorithms, we propose fuser, an algorithm that, while not
entirely new in machine learning, is novel for microbiome community network
inference. It retains subsample-specific signals while simultaneously sharing
relevant information across environments during training. Unlike standard
approaches that infer a single generalized network from combined data, fuser
generates distinct, environment-specific predictive networks. Our results
demonstrate that fuser achieves comparable predictive performance to existing
algorithms such as glmnet when evaluated within homogeneous environments
(Same), and notably reduces test error compared to baseline algorithms in
cross-environment (All) scenarios.
[LINK]
http://arxiv.org/abs/2509.09413v1
[DATE]
2025-09-11 20:51:34+08:00
[CATEGORIES]
cs.LG
Effort-aware Fairness: Incorporating a Philosophy-informed, Human-centered Notion of Effort into Algorithmic Fairness Metrics
[AUTHORS]
Tin Trung Nguyen, Jiannan Xu, Zora Che, Phuong-Anh Nguyen-Le, Rushil Dandamudi, Donald Braman, Furong Huang, Hal Daumé III, Zubin Jelveh
[ABSTRACT]
Although popularized AI fairness metrics, e.g., demographic parity, have
uncovered bias in AI-assisted decision-making outcomes, they do not consider
how much effort one has spent to get to where one is today in the input feature
space. However, the notion of effort is important in how Philosophy and humans
understand fairness. We propose a philosophy-informed approach to conceptualize
and evaluate Effort-aware Fairness (EaF), grounded in the concept of Force,
which represents the temporal trajectory of predictive features coupled with
inertia. Besides theoretical formulation, our empirical contributions include:
(1) a pre-registered human subjects experiment, which shows that for both
stages of the (individual) fairness evaluation process, people consider the
temporal trajectory of a predictive feature more than its aggregate value; (2)
pipelines to compute Effort-aware Individual/Group Fairness in the criminal
justice and personal finance contexts. Our work may enable AI model auditors to
uncover and potentially correct unfair decisions against individuals who have
spent significant efforts to improve but are still stuck with systemic
disadvantages outside their control.
[COMMENTS]
AIES 2025
[LINK]
http://arxiv.org/abs/2505.19317v4
[DATE]
2025-09-11 20:10:12+08:00
[CATEGORIES]
cs.LG
Robust Non-Linear Correlations via Polynomial Regression
[AUTHORS]
Luca Giuliani, Michele Lombardi
[ABSTRACT]
The Hirschfeld-Gebelein-R'enyi (HGR) correlation coefficient is an extension
of Pearson’s correlation that is not limited to linear correlations, with
potential applications in algorithmic fairness, scientific analysis, and causal
discovery. Recently, novel algorithms to estimate HGR in a differentiable
manner have been proposed to facilitate its use as a loss regularizer in
constrained machine learning applications. However, the inherent
uncomputability of HGR requires a bias-variance trade-off, which can possibly
compromise the robustness of the proposed methods, hence raising technical
concerns if applied in real-world scenarios. We introduce a novel computational
approach for HGR that relies on user-configurable polynomial kernels, offering
greater robustness compared to previous methods and featuring a faster yet
almost equally effective restriction. Our approach provides significant
advantages in terms of robustness and determinism, making it a more reliable
option for real-world applications. Moreover, we present a brief experimental
analysis to validate the applicability of our approach within a constrained
machine learning framework, showing that its computation yields an insightful
subgradient that can serve as a loss regularizer.
[LINK]
http://arxiv.org/abs/2509.09380v1
[DATE]
2025-09-11 19:55:48+08:00
[CATEGORIES]
cs.LG
LiDAR-BIND-T: Improved and Temporally Consistent Sensor Modality Translation and Fusion for Robotic Applications
[AUTHORS]
Niels Balemans, Ali Anwar, Jan Steckel, Siegfried Mercelis
[ABSTRACT]
This paper extends LiDAR-BIND, a modular multi-modal fusion framework that
binds heterogeneous sensors (radar, sonar) to a LiDAR-defined latent space,
with mechanisms that explicitly enforce temporal consistency. We introduce
three contributions: (i) temporal embedding similarity that aligns consecutive
latent representations, (ii) a motion-aligned transformation loss that matches
displacement between predictions and ground truth LiDAR, and (iii) windowed
temporal fusion using a specialised temporal module. We further update the
model architecture to better preserve spatial structure. Evaluations on
radar/sonar-to-LiDAR translation demonstrate improved temporal and spatial
coherence, yielding lower absolute trajectory error and better occupancy map
accuracy in Cartographer-based SLAM (Simultaneous Localisation and Mapping). We
propose different metrics based on the Fr'echet Video Motion Distance (FVMD)
and a correlation-peak distance metric providing practical temporal quality
indicators to evaluate SLAM performance. The proposed temporal LiDAR-BIND, or
LiDAR-BIND-T, maintains plug-and-play modality fusion while substantially
enhancing temporal stability, resulting in improved robustness and performance
for downstream SLAM.
[LINK]
http://arxiv.org/abs/2509.05728v2
[DATE]
2025-09-11 19:47:58+08:00
[CATEGORIES]
cs.LG
Representation-Aware Distributionally Robust Optimization: A Knowledge Transfer Framework
[AUTHORS]
Zitao Wang, Nian Si, Molei Liu
[ABSTRACT]
We propose REpresentation-Aware Distributionally Robust Estimation (READ), a
novel framework for Wasserstein distributionally robust learning that accounts
for predictive representations when guarding against distributional shifts.
Unlike classical approaches that treat all feature perturbations equally, READ
embeds a multidimensional alignment parameter into the transport cost, allowing
the model to differentially discourage perturbations along directions
associated with informative representations. This yields robustness to feature
variation while preserving invariant structure. Our first contribution is a
theoretical foundation: we show that seminorm regularizations for linear
regression and binary classification arise as Wasserstein distributionally
robust objectives, thereby providing tractable reformulations of READ and
unifying a broad class of regularized estimators under the DRO lens. Second, we
adopt a principled procedure for selecting the Wasserstein radius using the
techniques of robust Wasserstein profile inference. This further enables the
construction of valid, representation-aware confidence regions for model
parameters with distinct geometric features. Finally, we analyze the geometry
of READ estimators as the alignment parameters vary and propose an optimization
algorithm to estimate the projection of the global optimum onto this solution
surface. This procedure selects among equally robust estimators while optimally
constructing a representation structure. We conclude by demonstrating the
effectiveness of our framework through extensive simulations and a real-world
study, providing a powerful robust estimation grounded in learning
representation.
[LINK]
http://arxiv.org/abs/2509.09371v1
[DATE]
2025-09-11 19:42:17+08:00
[CATEGORIES]
cs.LG
Low-degree lower bounds via almost orthonormal bases
[AUTHORS]
Alexandra Carpentier, Simone Maria Giancola, Christophe Giraud, Nicolas Verzelen
[ABSTRACT]
Low-degree polynomials have emerged as a powerful paradigm for providing
evidence of statistical-computational gaps across a variety of high-dimensional
statistical models [Wein25]. For detection problems – where the goal is to
test a planted distribution $\mathbb{P}’$ against a null distribution
$\mathbb{P}$ with independent components – the standard approach is to bound
the advantage using an $\mathbb{L}^2(\mathbb{P})$-orthonormal family of
polynomials. However, this method breaks down for estimation tasks or more
complex testing problems where $\mathbb{P}$ has some planted structures, so
that no simple $\mathbb{L}^2(\mathbb{P})$-orthogonal polynomial family is
available. To address this challenge, several technical workarounds have been
proposed [SW22,SW25], though their implementation can be delicate. In this
work, we propose a more direct proof strategy. Focusing on random graph models,
we construct a basis of polynomials that is almost orthonormal under
$\mathbb{P}$, in precisely those regimes where statistical-computational gaps
arise. This almost orthonormal basis not only yields a direct route to
establishing low-degree lower bounds, but also allows us to explicitly identify
the polynomials that optimize the low-degree criterion. This, in turn, provides
insights into the design of optimal polynomial-time algorithms. We illustrate
the effectiveness of our approach by recovering known low-degree lower bounds,
and establishing new ones for problems such as hidden subcliques, stochastic
block models, and seriation models.
[LINK]
http://arxiv.org/abs/2509.09353v1
[DATE]
2025-09-11 19:07:36+08:00
[CATEGORIES]
cs.LG
MoSE: Unveiling Structural Patterns in Graphs via Mixture of Subgraph Experts
[AUTHORS]
Junda Ye, Zhongbao Zhang, Li Sun, Siqiang Luo
[ABSTRACT]
While graph neural networks (GNNs) have achieved great success in learning
from graph-structured data, their reliance on local, pairwise message passing
restricts their ability to capture complex, high-order subgraph patterns.
leading to insufficient structural expressiveness. Recent efforts have
attempted to enhance structural expressiveness by integrating random walk
kernels into GNNs. However, these methods are inherently designed for
graph-level tasks, which limits their applicability to other downstream tasks
such as node classification. Moreover, their fixed kernel configurations hinder
the model’s flexibility in capturing diverse subgraph structures. To address
these limitations, this paper proposes a novel Mixture of Subgraph Experts
(MoSE) framework for flexible and expressive subgraph-based representation
learning across diverse graph tasks. Specifically, MoSE extracts informative
subgraphs via anonymous walks and dynamically routes them to specialized
experts based on structural semantics, enabling the model to capture diverse
subgraph patterns with improved flexibility and interpretability. We further
provide a theoretical analysis of MoSE’s expressivity within the Subgraph
Weisfeiler-Lehman (SWL) Test, proving that it is more powerful than SWL.
Extensive experiments, together with visualizations of learned subgraph
experts, demonstrate that MoSE not only outperforms competitive baselines but
also provides interpretable insights into structural patterns learned by the
model.
[COMMENTS]
16 pages, 11 figures
[LINK]
http://arxiv.org/abs/2509.09337v1
[DATE]
2025-09-11 18:45:50+08:00
[CATEGORIES]
cs.LG
Exploring Pre-training Across Domains for Few-Shot Surgical Skill Assessment
[AUTHORS]
Dimitrios Anastasiou, Razvan Caramalau, Nazir Sirajudeen, Matthew Boal, Philip Edwards, Justin Collins, John Kelly, Ashwin Sridhar, Maxine Tran, Faiz Mumtaz, Nevil Pavithran, Nader Francis, Danail Stoyanov, Evangelos B. Mazomenos
[ABSTRACT]
Automated surgical skill assessment (SSA) is a central task in surgical
computer vision. Developing robust SSA models is challenging due to the
scarcity of skill annotations, which are time-consuming to produce and require
expert consensus. Few-shot learning (FSL) offers a scalable alternative
enabling model development with minimal supervision, though its success
critically depends on effective pre-training. While widely studied for several
surgical downstream tasks, pre-training has remained largely unexplored in SSA.
In this work, we formulate SSA as a few-shot task and investigate how
self-supervised pre-training strategies affect downstream few-shot SSA
performance. We annotate a publicly available robotic surgery dataset with
Objective Structured Assessment of Technical Skill (OSATS) scores, and evaluate
various pre-training sources across three few-shot settings. We quantify domain
similarity and analyze how domain gap and the inclusion of procedure-specific
data into pre-training influence transferability. Our results show that small
but domain-relevant datasets can outperform large scale, less aligned ones,
achieving accuracies of 60.16%, 66.03%, and 73.65% in the 1-, 2-, and 5-shot
settings, respectively. Moreover, incorporating procedure-specific data into
pre-training with a domain-relevant external dataset significantly boosts
downstream performance, with an average gain of +1.22% in accuracy and +2.28%
in F1-score; however, applying the same strategy with less similar but
large-scale sources can instead lead to performance degradation. Code and
models are available at https://github.com/anastadimi/ssa-fsl.
[COMMENTS]
Accepted at MICCAI 2025 DEMI Workshop
[LINK]
http://arxiv.org/abs/2509.09327v1
[DATE]
2025-09-11 18:23:19+08:00
[CATEGORIES]
cs.LG
Model-Agnostic Open-Set Air-to-Air Visual Object Detection for Reliable UAV Perception
[AUTHORS]
Spyridon Loukovitis, Anastasios Arsenos, Vasileios Karampinis, Athanasios Voulodimos
[ABSTRACT]
Open-set detection is crucial for robust UAV autonomy in air-to-air object
detection under real-world conditions. Traditional closed-set detectors degrade
significantly under domain shifts and flight data corruption, posing risks to
safety-critical applications. We propose a novel, model-agnostic open-set
detection framework designed specifically for embedding-based detectors. The
method explicitly handles unknown object rejection while maintaining robustness
against corrupted flight data. It estimates semantic uncertainty via entropy
modeling in the embedding space and incorporates spectral normalization and
temperature scaling to enhance open-set discrimination. We validate our
approach on the challenging AOT aerial benchmark and through extensive
real-world flight tests. Comprehensive ablation studies demonstrate consistent
improvements over baseline methods, achieving up to a 10\% relative AUROC gain
compared to standard YOLO-based detectors. Additionally, we show that
background rejection further strengthens robustness without compromising
detection accuracy, making our solution particularly well-suited for reliable
UAV perception in dynamic air-to-air environments.
[LINK]
http://arxiv.org/abs/2509.09297v1
[DATE]
2025-09-11 17:40:06+08:00
[CATEGORIES]
cs.LG
Uniform convergence for Gaussian kernel ridge regression
[AUTHORS]
Paul Dommel, Rajmadan Lakshmanan
[ABSTRACT]
This paper establishes the first polynomial convergence rates for Gaussian
kernel ridge regression (KRR) with a fixed hyperparameter in both the uniform
and the $L^{2}$-norm. The uniform convergence result closes a gap in the
theoretical understanding of KRR with the Gaussian kernel, where no such rates
were previously known. In addition, we prove a polynomial $L^{2}$-convergence
rate in the case, where the Gaussian kernel’s width parameter is fixed. This
also contributes to the broader understanding of smooth kernels, for which
previously only sub-polynomial $L^{2}$-rates were known in similar settings.
Together, these results provide new theoretical justification for the use of
Gaussian KRR with fixed hyperparameters in nonparametric regression.
[COMMENTS]
The submission is being withdrawn because the authorship of the
manuscript does not comply with the publishing/authorship guidelines of our
department
[LINK]
http://arxiv.org/abs/2508.11274v2
[DATE]
2025-09-11 17:19:48+08:00
[CATEGORIES]
cs.LG
villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models
[AUTHORS]
Xiaoyu Chen, Hangxing Wei, Pushi Zhang, Chuheng Zhang, Kaixin Wang, Yanjiang Guo, Rushuai Yang, Yucen Wang, Xinquan Xiao, Li Zhao, Jianyu Chen, Jiang Bian
[ABSTRACT]
Visual-Language-Action (VLA) models have emerged as a popular paradigm for
learning robot manipulation policies that can follow language instructions and
generalize to novel scenarios. Recent work has begun to explore the
incorporation of latent actions, an abstract representation of visual change
between two frames, into VLA pre-training. In this paper, we introduce villa-X,
a novel Visual-Language-Latent-Action (ViLLA) framework that advances latent
action modeling for learning generalizable robot manipulation policies. Our
approach improves both how latent actions are learned and how they are
incorporated into VLA pre-training. Together, these contributions enable
villa-X to achieve superior performance across simulated environments including
SIMPLER and LIBERO, as well as on two real-world robot setups including gripper
and dexterous hand manipulation. We believe the ViLLA paradigm holds
significant promise, and that our villa-X provides a strong foundation for
future research.
[COMMENTS]
Project page: https://aka.ms/villa-x
[LINK]
http://arxiv.org/abs/2507.23682v2
[DATE]
2025-09-11 17:15:53+08:00
[CATEGORIES]
cs.LG
A Vector-Quantized Foundation Model for Patient Behavior Monitoring
[AUTHORS]
Rodrigo Oliver, Josué Pérez-Sabater, Leire Paz-Arbaizar, Diego Herrero-Quevedo, Antonio Artés-Rodríguez, Alejandro Lancho, Pablo M. Olmos
[ABSTRACT]
Foundation models have achieved remarkable success across various domains,
yet their adoption in healthcare remains limited. While significant advances
have been made in medical imaging, genetic biomarkers, and time series from
electronic health records, the potential of foundation models for patient
behavior monitoring through personal digital devices remains underexplored. The
data generated by these devices are inherently heterogeneous, multisource, and
often exhibit high rates of missing data, posing unique challenges. This paper
introduces a novel foundation model based on a modified vector quantized
variational autoencoder, specifically designed to process real-world data from
smartphones and wearable devices. We leveraged the discrete latent
representation of this model to effectively perform two downstream tasks,
suicide risk assessment and emotional state prediction, on different held-out
clinical cohorts without the need of fine-tuning. We also highlight the
existence of a trade-off between discrete and continuous latent structures,
suggesting that hybrid models may be optimal for balancing accuracy across
various supervised and unsupervised tasks.
[COMMENTS]
10 pages (32 with references and supplementary material). Submitted
to Elsevier’s journal on Artificial Intelligence in Medicine
[LINK]
http://arxiv.org/abs/2503.15221v3
[DATE]
2025-09-11 17:08:07+08:00
[CATEGORIES]
cs.LG
Unsupervised Multi-Attention Meta Transformer for Rotating Machinery Fault Diagnosis
[AUTHORS]
Hanyang Wang, Yuxuan Yang, Hongjun Wang, Lihui Wang
[ABSTRACT]
The intelligent fault diagnosis of rotating mechanical equipment usually
requires a large amount of labeled sample data. However, in practical
industrial applications, acquiring enough data is both challenging and
expensive in terms of time and cost. Moreover, different types of rotating
mechanical equipment with different unique mechanical properties, require
separate training of diagnostic models for each case. To address the challenges
of limited fault samples and the lack of generalizability in prediction models
for practical engineering applications, we propose a Multi-Attention Meta
Transformer method for few-shot unsupervised rotating machinery fault diagnosis
(MMT-FD). This framework extracts potential fault representations from
unlabeled data and demonstrates strong generalization capabilities, making it
suitable for diagnosing faults across various types of mechanical equipment.
The MMT-FD framework integrates a time-frequency domain encoder and a
meta-learning generalization model. The time-frequency domain encoder predicts
status representations generated through random augmentations in the
time-frequency domain. These enhanced data are then fed into a meta-learning
network for classification and generalization training, followed by fine-tuning
using a limited amount of labeled data. The model is iteratively optimized
using a small number of contrastive learning iterations, resulting in high
efficiency. To validate the framework, we conducted experiments on a bearing
fault dataset and rotor test bench data. The results demonstrate that the
MMT-FD model achieves 99\% fault diagnosis accuracy with only 1\% of labeled
sample data, exhibiting robust generalization capabilities.
[LINK]
http://arxiv.org/abs/2509.09251v1
[DATE]
2025-09-11 16:35:43+08:00
[CATEGORIES]
cs.LG
Iterative Methods for Full-Scale Gaussian Process Approximations for Large Spatial Data
[AUTHORS]
Tim Gyger, Reinhard Furrer, Fabio Sigrist
[ABSTRACT]
Gaussian processes are flexible probabilistic regression models which are
widely used in statistics and machine learning. However, a drawback is their
limited scalability to large data sets. To alleviate this, full-scale
approximations (FSAs) combine predictive process methods and covariance
tapering, thus approximating both global and local structures. We show how
iterative methods can be used to reduce computational costs in calculating
likelihoods, gradients, and predictive distributions with FSAs. In particular,
we introduce a novel preconditioner and show theoretically and empirically that
it accelerates the conjugate gradient method’s convergence speed and mitigates
its sensitivity with respect to the FSA parameters and the eigenvalue structure
of the original covariance matrix, and we demonstrate empirically that it
outperforms a state-of-the-art pivoted Cholesky preconditioner. Furthermore, we
introduce an accurate and fast way to calculate predictive variances using
stochastic simulation and iterative methods. In addition, we show how our newly
proposed FITC preconditioner can also be used in iterative methods for Vecchia
approximations. In our experiments, it outperforms existing state-of-the-art
preconditioners for Vecchia approximations. All methods are implemented in a
free C++ software library with high-level Python and R packages.
[LINK]
http://arxiv.org/abs/2405.14492v4
[DATE]
2025-09-11 16:33:26+08:00
[CATEGORIES]
cs.LG
Global Optimization of Stochastic Black-Box Functions with Arbitrary Noise Distributions using Wilson Score Kernel Density Estimation
[AUTHORS]
Thorbjørn Mosekjær Iversen, Lars Carøe Sørensen, Simon Faarvang Mathiesen, Henrik Gordon Petersen
[ABSTRACT]
Many optimization problems in robotics involve the optimization of
time-expensive black-box functions, such as those involving complex simulations
or evaluation of real-world experiments. Furthermore, these functions are often
stochastic as repeated experiments are subject to unmeasurable disturbances.
Bayesian optimization can be used to optimize such methods in an efficient
manner by deploying a probabilistic function estimator to estimate with a given
confidence so that regions of the search space can be pruned away.
Consequently, the success of the Bayesian optimization depends on the function
estimator’s ability to provide informative confidence bounds. Existing function
estimators require many function evaluations to infer the underlying confidence
or depend on modeling of the disturbances. In this paper, it is shown that the
confidence bounds provided by the Wilson Score Kernel Density Estimator
(WS-KDE) are applicable as excellent bounds to any stochastic function with an
output confined to the closed interval [0;1] regardless of the distribution of
the output. This finding opens up the use of WS-KDE for stable global
optimization on a wider range of cost functions. The properties of WS-KDE in
the context of Bayesian optimization are demonstrated in simulation and applied
to the problem of automated trap design for vibrational part feeders.
[LINK]
http://arxiv.org/abs/2509.09238v1
[DATE]
2025-09-11 16:20:30+08:00
[CATEGORIES]
cs.LG
Temporal Query Network for Efficient Multivariate Time Series Forecasting
[AUTHORS]
Shengsheng Lin, Haojun Chen, Haijie Wu, Chunyun Qiu, Weiwei Lin
[ABSTRACT]
Sufficiently modeling the correlations among variables (aka channels) is
crucial for achieving accurate multivariate time series forecasting (MTSF). In
this paper, we propose a novel technique called Temporal Query (TQ) to more
effectively capture multivariate correlations, thereby improving model
performance in MTSF tasks. Technically, the TQ technique employs periodically
shifted learnable vectors as queries in the attention mechanism to capture
global inter-variable patterns, while the keys and values are derived from the
raw input data to encode local, sample-level correlations. Building upon the TQ
technique, we develop a simple yet efficient model named Temporal Query Network
(TQNet), which employs only a single-layer attention mechanism and a
lightweight multi-layer perceptron (MLP). Extensive experiments demonstrate
that TQNet learns more robust multivariate correlations, achieving
state-of-the-art forecasting accuracy across 12 challenging real-world
datasets. Furthermore, TQNet achieves high efficiency comparable to
linear-based methods even on high-dimensional datasets, balancing performance
and computational cost. The code is available at:
https://github.com/ACAT-SCUT/TQNet.
[COMMENTS]
ICML 2025
[LINK]
http://arxiv.org/abs/2505.12917v2
[DATE]
2025-09-11 16:11:41+08:00
[CATEGORIES]
cs.LG
Vejde: A Framework for Inductive Deep Reinforcement Learning Based on Factor Graph Color Refinement
[AUTHORS]
Jakob Nyberg, Pontus Johnson
[ABSTRACT]
We present and evaluate Vejde; a framework which combines data abstraction,
graph neural networks and reinforcement learning to produce inductive policy
functions for decision problems with richly structured states, such as object
classes and relations. MDP states are represented as data bases of facts about
entities, and Vejde converts each state to a bipartite graph, which is mapped
to latent states through neural message passing. The factored representation of
both states and actions allows Vejde agents to handle problems of varying size
and structure. We tested Vejde agents on eight problem domains defined in RDDL,
with ten problem instances each, where policies were trained using both
supervised and reinforcement learning. To test policy generalization, we
separate problem instances in two sets, one for training and the other solely
for testing. Test results on unseen instances for the Vejde agents were
compared to MLP agents trained on each problem instance, as well as the online
planning algorithm Prost. Our results show that Vejde policies in average
generalize to the test instances without a significant loss in score.
Additionally, the inductive agents received scores on unseen test instances
that on average were close to the instance-specific MLP agents.
[LINK]
http://arxiv.org/abs/2509.09219v1
[DATE]
2025-09-11 15:51:38+08:00
[CATEGORIES]
cs.LG
Towards Robust Influence Functions with Flat Validation Minima
[AUTHORS]
Xichen Ye, Yifan Wu, Weizhong Zhang, Cheng Jin, Yifan Chen
[COMMENTS]
Accepted by ICML 2025. arXiv admin note: text overlap with
arXiv:2310.00902 by other authors
[LINK]
http://arxiv.org/abs/2505.19097v2
[DATE]
2025-09-11 15:44:56+08:00
[CATEGORIES]
cs.LG
Incentivizing Safer Actions in Policy Optimization for Constrained Reinforcement Learning
[AUTHORS]
Somnath Hazra, Pallab Dasgupta, Soumyajit Dey
[ABSTRACT]
Constrained Reinforcement Learning (RL) aims to maximize the return while
adhering to predefined constraint limits, which represent domain-specific
safety requirements. In continuous control settings, where learning agents
govern system actions, balancing the trade-off between reward maximization and
constraint satisfaction remains a significant challenge. Policy optimization
methods often exhibit instability near constraint boundaries, resulting in
suboptimal training performance. To address this issue, we introduce a novel
approach that integrates an adaptive incentive mechanism in addition to the
reward structure to stay within the constraint bound before approaching the
constraint boundary. Building on this insight, we propose Incrementally
Penalized Proximal Policy Optimization (IP3O), a practical algorithm that
enforces a progressively increasing penalty to stabilize training dynamics.
Through empirical evaluation on benchmark environments, we demonstrate the
efficacy of IP3O compared to the performance of state-of-the-art Safe RL
algorithms. Furthermore, we provide theoretical guarantees by deriving a bound
on the worst-case error of the optimality achieved by our algorithm.
[COMMENTS]
11 pages, Accepted to the 34th International Joint Conference on
Artificial Intelligence (IJCAI) 2025, Main Track
[LINK]
http://arxiv.org/abs/2509.09208v1
[DATE]
2025-09-11 15:33:35+08:00
[CATEGORIES]
cs.LG
Structure Matters: Brain Graph Augmentation via Learnable Edge Masking for Data-efficient Psychiatric Diagnosis
[AUTHORS]
Mujie Liu, Chenze Wang, Liping Chen, Nguyen Linh Dan Le, Niharika Tewari, Ting Dang, Jiangang Ma, Feng Xia
[ABSTRACT]
The limited availability of labeled brain network data makes it challenging
to achieve accurate and interpretable psychiatric diagnoses. While
self-supervised learning (SSL) offers a promising solution, existing methods
often rely on augmentation strategies that can disrupt crucial structural
semantics in brain graphs. To address this, we propose SAM-BG, a two-stage
framework for learning brain graph representations with structural semantic
preservation. In the pre-training stage, an edge masker is trained on a small
labeled subset to capture key structural semantics. In the SSL stage, the
extracted structural priors guide a structure-aware augmentation process,
enabling the model to learn more semantically meaningful and robust
representations. Experiments on two real-world psychiatric datasets demonstrate
that SAM-BG outperforms state-of-the-art methods, particularly in small-labeled
data settings, and uncovers clinically relevant connectivity patterns that
enhance interpretability. Our code is available at
https://github.com/mjliu99/SAM-BG.
[LINK]
http://arxiv.org/abs/2509.09744v1
[DATE]
2025-09-11 15:24:39+08:00
[CATEGORIES]
cs.LG
Breaking the Statistical Similarity Trap in Extreme Convection Detection
[AUTHORS]
Md Tanveer Hossain Munim
[ABSTRACT]
Current evaluation metrics for deep learning weather models create a
“Statistical Similarity Trap”, rewarding blurry predictions while missing rare,
high-impact events. We provide quantitative evidence of this trap, showing
sophisticated baselines achieve 97.9% correlation yet 0.00 CSI for dangerous
convection detection. We introduce DART (Dual Architecture for Regression
Tasks), a framework addressing the challenge of transforming coarse atmospheric
forecasts into high-resolution satellite brightness temperature fields
optimized for extreme convection detection (below 220 K). DART employs
dual-decoder architecture with explicit background/extreme decomposition,
physically motivated oversampling, and task-specific loss functions. We present
four key findings: (1) empirical validation of the Statistical Similarity Trap
across multiple sophisticated baselines; (2) the “IVT Paradox”, removing
Integrated Water Vapor Transport, widely regarded as essential for atmospheric
river analysis, improves extreme convection detection by 270%; (3)
architectural necessity demonstrated through operational flexibility (DART
achieves CSI = 0.273 with bias = 2.52 vs. 6.72 for baselines at equivalent
CSI), and (4) real-world validation with the August 2023 Chittagong flooding
disaster as a case study. To our knowledge, this is the first work to
systematically address this hybrid conversion-segmentation-downscaling task,
with no direct prior benchmarks identified in existing literature. Our
validation against diverse statistical and deep learning baselines sufficiently
demonstrates DART’s specialized design. The framework enables precise
operational calibration through beta-tuning, trains in under 10 minutes on
standard hardware, and integrates seamlessly with existing meteorological
workflows, demonstrating a pathway toward trustworthy AI for extreme weather
preparedness.
[COMMENTS]
43 pages, 7 figures
[LINK]
http://arxiv.org/abs/2509.09195v1
[DATE]
2025-09-11 15:10:45+08:00
[CATEGORIES]
cs.LG
Uncertainty-aware Diffusion and Reinforcement Learning for Joint Plane Localization and Anomaly Diagnosis in 3D Ultrasound
[AUTHORS]
Yuhao Huang, Yueyue Xu, Haoran Dou, Jiaxiao Deng, Xin Yang, Hongyu Zheng, Dong Ni
[ABSTRACT]
Congenital uterine anomalies (CUAs) can lead to infertility, miscarriage,
preterm birth, and an increased risk of pregnancy complications. Compared to
traditional 2D ultrasound (US), 3D US can reconstruct the coronal plane,
providing a clear visualization of the uterine morphology for assessing CUAs
accurately. In this paper, we propose an intelligent system for simultaneous
automated plane localization and CUA diagnosis. Our highlights are: 1) we
develop a denoising diffusion model with local (plane) and global (volume/text)
guidance, using an adaptive weighting strategy to optimize attention allocation
to different conditions; 2) we introduce a reinforcement learning-based
framework with unsupervised rewards to extract the key slice summary from
redundant sequences, fully integrating information across multiple planes to
reduce learning difficulty; 3) we provide text-driven uncertainty modeling for
coarse prediction, and leverage it to adjust the classification probability for
overall performance improvement. Extensive experiments on a large 3D uterine US
dataset show the efficacy of our method, in terms of plane localization and CUA
diagnosis. Code is available at https://github.com/yuhoo0302/CUA-US.
[COMMENTS]
Accepted by MICCAI 2025;10 pages, 3 figures
[LINK]
http://arxiv.org/abs/2506.23538v2
[DATE]
2025-09-11 14:34:11+08:00
[CATEGORIES]
cs.LG
Adaptive Pareto-Optimal Token Merging for Edge Transformer Models in Semantic Communication
[AUTHORS]
Omar Erak, Omar Alhussein, Hatem Abou-Zeid, Mehdi Bennis
[ABSTRACT]
Large-scale transformer models have emerged as a powerful tool for semantic
communication systems, enabling edge devices to extract rich representations
for robust inference across noisy wireless channels. However, their substantial
computational demands remain a major barrier to practical deployment in
resource-constrained 6G networks. In this paper, we present a training-free
framework for adaptive token merging in pretrained vision transformers to
jointly reduce inference time and transmission resource usage. We formulate the
selection of per-layer merging proportions as a multi-objective optimization
problem to balance accuracy and computational cost. We employ Gaussian
process-based Bayesian optimization to construct a Pareto frontier of optimal
configurations, enabling flexible runtime adaptation to dynamic application
requirements and channel conditions. Extensive experiments demonstrate that our
method consistently outperforms other baselines and achieves significant
reductions in floating-point operations while maintaining competitive accuracy
across a wide range of signal-to-noise ratio (SNR) conditions. Additional
results highlight the effectiveness of adaptive policies that adjust merging
aggressiveness in response to channel quality, providing a practical mechanism
to trade off latency and semantic fidelity on demand. These findings establish
a scalable and efficient approach for deploying transformer-based semantic
communication in future edge intelligence systems.
[COMMENTS]
To appear in IEEE Globecom 2025
[LINK]
http://arxiv.org/abs/2509.09168v1
[DATE]
2025-09-11 14:05:35+08:00
[CATEGORIES]
cs.LG
HISPASpoof: A New Dataset For Spanish Speech Forensics
[AUTHORS]
Maria Risques, Kratika Bhagtani, Amit Kumar Singh Yadav, Edward J. Delp
[ABSTRACT]
Zero-shot Voice Cloning (VC) and Text-to-Speech (TTS) methods have advanced
rapidly, enabling the generation of highly realistic synthetic speech and
raising serious concerns about their misuse. While numerous detectors have been
developed for English and Chinese, Spanish-spoken by over 600 million people
worldwide-remains underrepresented in speech forensics. To address this gap, we
introduce HISPASpoof, the first large-scale Spanish dataset designed for
synthetic speech detection and attribution. It includes real speech from public
corpora across six accents and synthetic speech generated with six zero-shot
TTS systems. We evaluate five representative methods, showing that detectors
trained on English fail to generalize to Spanish, while training on HISPASpoof
substantially improves detection. We also evaluate synthetic speech attribution
performance on HISPASpoof, i.e., identifying the generation method of synthetic
speech. HISPASpoof thus provides a critical benchmark for advancing reliable
and inclusive speech forensics in Spanish.
[COMMENTS]
8 pages, 1 figure, 10 tables, being submitted to ICASSP 2026 (IEEE
International Conference on Acoustics, Speech, and Signal Processing 2026)
[LINK]
http://arxiv.org/abs/2509.09155v1
[DATE]
2025-09-11 13:29:07+08:00
[CATEGORIES]
cs.LG
CAME-AB: Cross-Modality Attention with Mixture-of-Experts for Antibody Binding Site Prediction
[AUTHORS]
Hongzong Li, Jiahao Ma, Zhanpeng Shi, Rui Xiao, Fanming Jin, Ye-Fan Hu, Hangjun Che, Jian-Dong Huang
[ABSTRACT]
Antibody binding site prediction plays a pivotal role in computational
immunology and therapeutic antibody design. Existing sequence or structure
methods rely on single-view features and fail to identify antibody-specific
binding sites on the antigens. In this paper, we propose \textbf{CAME-AB}, a
novel Cross-modality Attention framework with a Mixture-of-Experts (MoE)
backbone for robust antibody binding site prediction. CAME-AB integrates five
biologically grounded modalities, including raw amino acid encodings, BLOSUM
substitution profiles, pretrained language model embeddings, structure-aware
features, and GCN-refined biochemical graphs, into a unified multimodal
representation. To enhance adaptive cross-modal reasoning, we propose an
\emph{adaptive modality fusion} module that learns to dynamically weight each
modality based on its global relevance and input-specific contribution. A
Transformer encoder combined with an MoE module further promotes feature
specialization and capacity expansion. We additionally incorporate a supervised
contrastive learning objective to explicitly shape the latent space geometry,
encouraging intra-class compactness and inter-class separability. To improve
optimization stability and generalization, we apply stochastic weight averaging
during training. Extensive experiments on benchmark antibody-antigen datasets
demonstrate that CAME-AB consistently outperforms strong baselines on multiple
metrics, including Precision, Recall, F1-score, AUC-ROC, and MCC. Ablation
studies further validate the effectiveness of each architectural component and
the benefit of multimodal feature integration. The model implementation details
and the codes are available on https://anonymous.4open.science/r/CAME-AB-C525
[LINK]
http://arxiv.org/abs/2509.06465v4
[DATE]
2025-09-11 13:09:47+08:00
[CATEGORIES]
cs.LG
Video Understanding by Design: How Datasets Shape Architectures and Insights
[AUTHORS]
Lei Wang, Piotr Koniusz, Yongsheng Gao
[ABSTRACT]
Video understanding has advanced rapidly, fueled by increasingly complex
datasets and powerful architectures. Yet existing surveys largely classify
models by task or family, overlooking the structural pressures through which
datasets guide architectural evolution. This survey is the first to adopt a
dataset-driven perspective, showing how motion complexity, temporal span,
hierarchical composition, and multimodal richness impose inductive biases that
models should encode. We reinterpret milestones, from two-stream and 3D CNNs to
sequential, transformer, and multimodal foundation models, as concrete
responses to these dataset-driven pressures. Building on this synthesis, we
offer practical guidance for aligning model design with dataset invariances
while balancing scalability and task demands. By unifying datasets, inductive
biases, and architectures into a coherent framework, this survey provides both
a comprehensive retrospective and a prescriptive roadmap for advancing
general-purpose video understanding.
[COMMENTS]
Research report
[LINK]
http://arxiv.org/abs/2509.09151v1
[DATE]
2025-09-11 13:06:30+08:00
[CATEGORIES]
cs.LG
Peering Partner Recommendation for ISPs using Machine Learning
[AUTHORS]
Md Ibrahim Ibne Alam, Ankur Senapati, Anindo Mahmood, Murat Yuksel, Koushik Kar
[ABSTRACT]
Internet service providers (ISPs) need to connect with other ISPs to provide
global connectivity services to their users. To ensure global connectivity,
ISPs can either use transit service(s) or establish direct peering
relationships between themselves via Internet exchange points (IXPs). Peering
offers more room for ISP-specific optimizations and is preferred, but it often
involves a lengthy and complex process. Automating peering partner selection
can enhance efficiency in the global Internet ecosystem. We explore the use of
publicly available data on ISPs to develop a machine learning (ML) model that
can predict whether an ISP pair should peer or not. At first, we explore public
databases, e.g., PeeringDB, CAIDA, etc., to gather data on ISPs. Then, we
evaluate the performance of three broad types of ML models for predicting
peering relationships: tree-based, neural network-based, and transformer-based.
Among these, we observe that tree-based models achieve the highest accuracy and
efficiency in our experiments. The XGBoost model trained with publicly
available data showed promising performance, with a 98% accuracy rate in
predicting peering partners. In addition, the model demonstrated great
resilience to variations in time, space, and missing data. We envision that
ISPs can adopt our method to fully automate the peering partner selection
process, thus transitioning to a more efficient and optimized Internet
ecosystem.
[COMMENTS]
Submitted to IEEE Transactions on Machine Learning in Communications
and Networking
[LINK]
http://arxiv.org/abs/2509.09146v1
[DATE]
2025-09-11 12:43:31+08:00
[CATEGORIES]
cs.LG
Efficient Optimization Accelerator Framework for Multistate Ising Problems
[AUTHORS]
Chirag Garg, Sayeef Salahuddin
[ABSTRACT]
Ising Machines are emerging hardware architectures that efficiently solve
NP-Hard combinatorial optimization problems. Generally, combinatorial problems
are transformed into quadratic unconstrained binary optimization (QUBO) form,
but this transformation often complicates the solution landscape, degrading
performance, especially for multi-state problems. To address this challenge, we
model spin interactions as generalized boolean logic function to significantly
reduce the exploration space. We demonstrate the effectiveness of our approach
on graph coloring problem using probabilistic Ising solvers, achieving similar
accuracy compared to state-of-the-art heuristics and machine learning
algorithms. It also shows significant improvement over state-of-the-art
QUBO-based Ising solvers, including probabilistic Ising and simulated
bifurcation machines. We also design 1024-neuron all-to-all connected
probabilistic Ising accelerator on FPGA with the proposed approach that shows
~10000x performance acceleration compared to GPU-based Tabucol heuristics and
reducing physical neurons by 1.5-4x over baseline Ising frameworks. Thus, this
work establishes superior efficiency, scalability and solution quality for
multi-state optimization problems.
[COMMENTS]
9 page main text, 4 main figures, 2 main table, 3 page supplementary,
10 supplementary figures,
[LINK]
http://arxiv.org/abs/2505.20250v2
[DATE]
2025-09-11 12:19:00+08:00
[CATEGORIES]
cs.LG
Continuous-Time Value Iteration for Multi-Agent Reinforcement Learning
[AUTHORS]
Xuefeng Wang, Lei Zhang, Henglin Pu, Ahmed H. Qureshi, Husheng Li
[ABSTRACT]
Existing reinforcement learning (RL) methods struggle with complex dynamical
systems that demand interactions at high frequencies or irregular time
intervals. Continuous-time RL (CTRL) has emerged as a promising alternative by
replacing discrete-time Bellman recursion with differential value functions
defined as viscosity solutions of the Hamilton–Jacobi–Bellman (HJB) equation.
While CTRL has shown promise, its applications have been largely limited to the
single-agent domain. This limitation stems from two key challenges: (i)
conventional solution methods for HJB equations suffer from the curse of
dimensionality (CoD), making them intractable in high-dimensional systems; and
(ii) even with HJB-based learning approaches, accurately approximating
centralized value functions in multi-agent settings remains difficult, which in
turn destabilizes policy training. In this paper, we propose a CT-MARL
framework that uses physics-informed neural networks (PINNs) to approximate
HJB-based value functions at scale. To ensure the value is consistent with its
differential structure, we align value learning with value-gradient learning by
introducing a Value Gradient Iteration (VGI) module that iteratively refines
value gradients along trajectories. This improves gradient fidelity, in turn
yielding more accurate values and stronger policy learning. We evaluate our
method using continuous-time variants of standard benchmarks, including
multi-agent particle environment (MPE) and multi-agent MuJoCo. Our results
demonstrate that our approach consistently outperforms existing continuous-time
RL baselines and scales to complex multi-agent dynamics.
[COMMENTS]
19 pages, 10 figures
[LINK]
http://arxiv.org/abs/2509.09135v1
[DATE]
2025-09-11 12:12:50+08:00
[CATEGORIES]
cs.LG
EgoAgent: A Joint Predictive Agent Model in Egocentric Worlds
[AUTHORS]
Lu Chen, Yizhou Wang, Shixiang Tang, Qianhong Ma, Tong He, Wanli Ouyang, Xiaowei Zhou, Hujun Bao, Sida Peng
[ABSTRACT]
Learning an agent model that behaves like humans-capable of jointly
perceiving the environment, predicting the future, and taking actions from a
first-person perspective-is a fundamental challenge in computer vision.
Existing methods typically train separate models for these abilities, which
fail to capture their intrinsic relationships and prevent them from learning
from each other. Inspired by how humans learn through the perception-action
loop, we propose EgoAgent, a unified agent model that simultaneously learns to
represent, predict, and act within a single transformer. EgoAgent explicitly
models the causal and temporal dependencies among these abilities by
formulating the task as an interleaved sequence of states and actions. It
further introduces a joint embedding-action-prediction architecture with
temporally asymmetric predictor and observer branches, enabling synergistic
optimization across all three capabilities. Comprehensive evaluations of
EgoAgent on representative tasks such as image classification, egocentric
future state prediction, and 3D human motion prediction demonstrate the
superiority of our method. The code and trained models will be publicly
available at https://github.com/zju3dv/EgoAgent.
<code style="color:green;">[COMMENTS]</code>Project Page: https://egoagent.github.io | Demo Video:
https://youtu.be/qhfHp_sfDvY
[LINK]
http://arxiv.org/abs/2502.05857v3
[DATE]
2025-09-11 11:59:01+08:00
[CATEGORIES]
cs.LG
Learning What Matters: Causal Time Series Modeling for Arctic Sea Ice Prediction
[AUTHORS]
Emam Hossain, Md Osman Gani
[ABSTRACT]
Conventional machine learning and deep learning models typically rely on
correlation-based learning, which often fails to distinguish genuine causal
relationships from spurious associations, limiting their robustness,
interpretability, and ability to generalize. To overcome these limitations, we
introduce a causality-aware deep learning framework that integrates
Multivariate Granger Causality (MVGC) and PCMCI+ for causal feature selection
within a hybrid neural architecture. Leveraging 43 years (1979-2021) of Arctic
Sea Ice Extent (SIE) data and associated ocean-atmospheric variables at daily
and monthly resolutions, the proposed method identifies causally influential
predictors, prioritizes direct causes of SIE dynamics, reduces unnecessary
features, and enhances computational efficiency. Experimental results show that
incorporating causal inputs leads to improved prediction accuracy and
interpretability across varying lead times. While demonstrated on Arctic SIE
forecasting, the framework is broadly applicable to other dynamic,
high-dimensional domains, offering a scalable approach that advances both the
theoretical foundations and practical performance of causality-informed
predictive modeling.
[COMMENTS]
Accepted and presented at the AI4TS Workshop @ IJCAI 2025
(non-archival)
[LINK]
http://arxiv.org/abs/2509.09128v1
[DATE]
2025-09-11 11:54:39+08:00
[CATEGORIES]
cs.LG
Securing Private Federated Learning in a Malicious Setting: A Scalable TEE-Based Approach with Client Auditing
[AUTHORS]
Shun Takagi, Satoshi Hasegawa
[ABSTRACT]
In cross-device private federated learning, differentially private
follow-the-regularized-leader (DP-FTRL) has emerged as a promising
privacy-preserving method. However, existing approaches assume a semi-honest
server and have not addressed the challenge of securely removing this
assumption. This is due to its statefulness, which becomes particularly
problematic in practical settings where clients can drop out or be corrupted.
While trusted execution environments (TEEs) might seem like an obvious
solution, a straightforward implementation can introduce forking attacks or
availability issues due to state management. To address this problem, our paper
introduces a novel server extension that acts as a trusted computing base (TCB)
to realize maliciously secure DP-FTRL. The TCB is implemented with an ephemeral
TEE module on the server side to produce verifiable proofs of server actions.
Some clients, upon being selected, participate in auditing these proofs with
small additional communication and computational demands. This extension
solution reduces the size of the TCB while maintaining the system’s scalability
and liveness. We provide formal proofs based on interactive differential
privacy, demonstrating privacy guarantee in malicious settings. Finally, we
experimentally show that our framework adds small constant overhead to clients
in several realistic settings.
[COMMENTS]
Accepted at PoPETs 2026
[LINK]
http://arxiv.org/abs/2509.08709v2
[DATE]
2025-09-11 11:44:47+08:00
[CATEGORIES]
cs.LG
Closing the Gap between TD Learning and Supervised Learning with $Q$-Conditioned Maximization
[AUTHORS]
Xing Lei, Zifeng Zhuang, Shentao Yang, Sheng Xu, Yunhao Luo, Fei Shen, Wenyan Yang, Xuetao Zhang, Donglin Wang
[ABSTRACT]
Recently, supervised learning (SL) methodology has emerged as an effective
approach for offline reinforcement learning (RL) due to their simplicity,
stability, and efficiency. However, recent studies show that SL methods lack
the trajectory stitching capability, typically associated with temporal
difference (TD)-based approaches. A question naturally surfaces: \textit{How
can we endow SL methods with stitching capability and close its performance gap
with TD learning?} To answer this question, we introduce $Q$-conditioned
maximization supervised learning for offline goal-conditioned RL, which
enhances SL with the stitching capability through $Q$-conditioned policy and
$Q$-conditioned maximization. Concretely, we propose
\textbf{G}oal-\textbf{C}onditioned \textbf{\textit{Rein}}forced
\textbf{S}upervised \textbf{L}earning (\textbf{GC\textit{Rein}SL}), which
consists of (1) estimating the $Q$-function by Normalizing Flows from the
offline dataset and (2) finding the maximum $Q$-value within the data support
by integrating $Q$-function maximization with Expectile Regression. In
inference time, our policy chooses optimal actions based on such a maximum
$Q$-value. Experimental results from stitching evaluations on offline RL
datasets demonstrate that our method outperforms prior SL approaches with
stitching capabilities and goal data augmentation techniques.
[LINK]
http://arxiv.org/abs/2506.00795v3
[DATE]
2025-09-11 11:42:40+08:00
[CATEGORIES]
cs.LG
Sensitivity-LoRA: Low-Load Sensitivity-Based Fine-Tuning for Large Language Models
[AUTHORS]
Hao Zhang, Bo Huang, Zhenjia Li, Xi Xiao, Hui Yi Leong, Zumeng Zhang, Xinwei Long, Tianyang Wang, Hao Xu
[ABSTRACT]
Large Language Models (LLMs) have transformed both everyday life and
scientific research. However, adapting LLMs from general-purpose models to
specialized tasks remains challenging, particularly in resource-constrained
environments. Low-Rank Adaptation (LoRA), a prominent method within
Parameter-Efficient Fine-Tuning (PEFT), has emerged as a promising approach to
LLMs by approximating model weight updates using low-rank decomposition.
However, LoRA is limited by its uniform rank ( r ) allocation to each
incremental matrix, and existing rank allocation techniques aimed at addressing
this issue remain computationally inefficient, complex, and unstable, hindering
practical applications. To address these limitations, we propose
Sensitivity-LoRA, an efficient fine-tuning method that dynamically allocates
ranks to weight matrices based on both their global and local sensitivities. It
leverages the second-order derivatives (Hessian Matrix) of the loss function to
effectively capture weight sensitivity, enabling optimal rank allocation with
minimal computational overhead. Our experimental results have demonstrated
robust effectiveness, efficiency and stability of Sensitivity-LoRA across
diverse tasks and benchmarks.
[COMMENTS]
15 pages
[LINK]
http://arxiv.org/abs/2509.09119v1
[DATE]
2025-09-11 11:07:05+08:00
[CATEGORIES]
cs.LG
Diffusion Graph Neural Networks for Robustness in Olfaction Sensors and Datasets
[AUTHORS]
Kordel K. France, Ovidiu Daescu
[ABSTRACT]
Robotic odour source localization (OSL) is a critical capability for
autonomous systems operating in complex environments. However, current OSL
methods often suffer from ambiguities, particularly when robots misattribute
odours to incorrect objects due to limitations in olfactory datasets and sensor
resolutions. To address this challenge, we introduce a novel machine learning
method using diffusion-based molecular generation to enhance odour localization
accuracy that can be used by itself or with automated olfactory dataset
construction pipelines. This generative process of our diffusion model expands
the chemical space beyond the limitations of both current olfactory datasets
and training methods, enabling the identification of potential odourant
molecules not previously documented. The generated molecules can then be more
accurately validated using advanced olfactory sensors, enabling them to detect
more compounds and inform better hardware design. By integrating visual
analysis, language processing, and molecular generation, our framework enhances
the ability of olfaction-vision models on robots to accurately associate odours
with their correct sources, thereby improving navigation and decision-making
through better sensor selection for a target compound in critical applications
such as explosives detection, narcotics screening, and search and rescue. Our
methodology represents a foundational advancement in the field of artificial
olfaction, offering a scalable solution to challenges posed by limited
olfactory data and sensor ambiguities. Code and data are made available to the
community at the following URL:
https://github.com/KordelFranceTech/OlfactionVisionLanguage-Dataset.
[LINK]
http://arxiv.org/abs/2506.00455v3
[DATE]
2025-09-11 11:02:39+08:00
[CATEGORIES]
cs.LG
Inferring entropy production in many-body systems using nonequilibrium MaxEnt
[AUTHORS]
Miguel Aguilera, Sosuke Ito, Artemy Kolchinsky
[ABSTRACT]
We propose a method for inferring entropy production (EP) in high-dimensional
stochastic systems, including many-body systems and non-Markovian systems with
long memory. Standard techniques for estimating EP become intractable in such
systems due to computational and statistical limitations. We infer
trajectory-level EP and lower bounds on average EP by exploiting a
nonequilibrium analogue of the Maximum Entropy principle, along with convex
duality. Our approach uses only samples of trajectory observables, such as
spatiotemporal correlations. It does not require reconstruction of
high-dimensional probability distributions or rate matrices, nor impose any
special assumptions such as discrete states or multipartite dynamics. In
addition, it may be used to compute a hierarchical decomposition of EP,
reflecting contributions from different interaction orders, and it has an
intuitive physical interpretation as a “thermodynamic uncertainty relation.” We
demonstrate its numerical performance on a disordered nonequilibrium spin model
with 1000 spins and a large neural spike-train dataset.
[LINK]
http://arxiv.org/abs/2505.10444v3
[DATE]
2025-09-11 10:59:05+08:00
[CATEGORIES]
cs.LG
CryptGNN: Enabling Secure Inference for Graph Neural Networks
[AUTHORS]
Pritam Sen, Yao Ma, Cristian Borcea
[ABSTRACT]
We present CryptGNN, a secure and effective inference solution for
third-party graph neural network (GNN) models in the cloud, which are accessed
by clients as ML as a service (MLaaS). The main novelty of CryptGNN is its
secure message passing and feature transformation layers using distributed
secure multi-party computation (SMPC) techniques. CryptGNN protects the
client’s input data and graph structure from the cloud provider and the
third-party model owner, and it protects the model parameters from the cloud
provider and the clients. CryptGNN works with any number of SMPC parties, does
not require a trusted server, and is provably secure even if P-1 out of P
parties in the cloud collude. Theoretical analysis and empirical experiments
demonstrate the security and efficiency of CryptGNN.
[LINK]
http://arxiv.org/abs/2509.09107v1
[DATE]
2025-09-11 10:35:33+08:00
[CATEGORIES]
cs.LG
Joint Optimization of Energy Consumption and Completion Time in Federated Learning
[AUTHORS]
Xinyu Zhou, Jun Zhao, Huimei Han, Claude Guet
[ABSTRACT]
Federated Learning (FL) is an intriguing distributed machine learning
approach due to its privacy-preserving characteristics. To balance the
trade-off between energy and execution latency, and thus accommodate different
demands and application scenarios, we formulate an optimization problem to
minimize a weighted sum of total energy consumption and completion time through
two weight parameters. The optimization variables include bandwidth,
transmission power and CPU frequency of each device in the FL system, where all
devices are linked to a base station and train a global model collaboratively.
Through decomposing the non-convex optimization problem into two subproblems,
we devise a resource allocation algorithm to determine the bandwidth
allocation, transmission power, and CPU frequency for each participating
device. We further present the convergence analysis and computational
complexity of the proposed algorithm. Numerical results show that our proposed
algorithm not only has better performance at different weight parameters (i.e.,
different demands) but also outperforms the state of the art.
[COMMENTS]
This paper appears in the Proceedings of IEEE International
Conference on Distributed Computing Systems (ICDCS) 2022. Please feel free to
contact us for questions or remarks
[LINK]
http://arxiv.org/abs/2209.14900v5
[DATE]
2025-09-11 09:15:16+08:00
[CATEGORIES]
cs.LG
KoopMotion: Learning Almost Divergence Free Koopman Flow Fields for Motion Planning
[AUTHORS]
Alice Kate Li, Thales C Silva, Victoria Edwards, Vijay Kumar, M. Ani Hsieh
[ABSTRACT]
In this work, we propose a novel flow field-based motion planning method that
drives a robot from any initial state to a desired reference trajectory such
that it converges to the trajectory’s end point. Despite demonstrated efficacy
in using Koopman operator theory for modeling dynamical systems, Koopman does
not inherently enforce convergence to desired trajectories nor to specified
goals – a requirement when learning from demonstrations (LfD). We present
KoopMotion which represents motion flow fields as dynamical systems,
parameterized by Koopman Operators to mimic desired trajectories, and leverages
the divergence properties of the learnt flow fields to obtain smooth motion
fields that converge to a desired reference trajectory when a robot is placed
away from the desired trajectory, and tracks the trajectory until the end
point. To demonstrate the effectiveness of our approach, we show evaluations of
KoopMotion on the LASA human handwriting dataset and a 3D manipulator
end-effector trajectory dataset, including spectral analysis. We also perform
experiments on a physical robot, verifying KoopMotion on a miniature autonomous
surface vehicle operating in a non-static fluid flow environment. Our approach
is highly sample efficient in both space and time, requiring only 3\% of the
LASA dataset to generate dense motion plans. Additionally, KoopMotion provides
a significant improvement over baselines when comparing metrics that measure
spatial and temporal dynamics modeling efficacy.
[COMMENTS]
Accepted to CoRL 2025 (Conference on Robot Learning). 15 pages 11
figures
[LINK]
http://arxiv.org/abs/2509.09074v1
[DATE]
2025-09-11 08:42:01+08:00
[CATEGORIES]
cs.LG
SurGBSA: Learning Representations From Molecular Dynamics Simulations
[AUTHORS]
Derek Jones, Yue Yang, Felice C. Lightstone, Niema Moshiri, Jonathan E. Allen, Tajana S. Rosing
[ABSTRACT]
Self-supervised pretraining from static structures of drug-like compounds and
proteins enable powerful learned feature representations. Learned features
demonstrate state of the art performance on a range of predictive tasks
including molecular properties, structure generation, and protein-ligand
interactions. The majority of approaches are limited by their use of static
structures and it remains an open question, how best to use atomistic molecular
dynamics (MD) simulations to develop more generalized models to improve
prediction accuracy for novel molecular structures. We present SURrogate mmGBSA
(SurGBSA) as a new modeling approach for MD-based representation learning,
which learns a surrogate function of the Molecular Mechanics Generalized Born
Surface Area (MMGBSA). We show for the first time the benefits of
physics-informed pre-training to train a surrogate MMGBSA model on a collection
of over 1.4 million 3D trajectories collected from MD simulations of the
CASF-2016 benchmark. SurGBSA demonstrates a dramatic 27,927x speedup versus a
traditional physics-based single-point MMGBSA calculation while nearly matching
single-point MMGBSA accuracy on the challenging pose ranking problem for
identification of the correct top pose (-0.4% difference). Our work advances
the development of molecular foundation models by showing model improvements
when training on MD simulations. Models, code and training data are made
publicly available.
[LINK]
http://arxiv.org/abs/2509.03084v2
[DATE]
2025-09-11 07:46:01+08:00
[CATEGORIES]
cs.LG
A Scoping Review of Machine Learning Applications in Power System Protection and Disturbance Management
[AUTHORS]
Julian Oelhaf, Georg Kordowich, Mehran Pashaei, Christian Bergler, Andreas Maier, Johann Jäger, Siming Bayer
[ABSTRACT]
The integration of renewable and distributed energy resources reshapes modern
power systems, challenging conventional protection schemes. This scoping review
synthesizes recent literature on machine learning (ML) applications in power
system protection and disturbance management, following the PRISMA for Scoping
Reviews framework. Based on over 100 publications, three key objectives are
addressed: (i) assessing the scope of ML research in protection tasks; (ii)
evaluating ML performance across diverse operational scenarios; and (iii)
identifying methods suitable for evolving grid conditions. ML models often
demonstrate high accuracy on simulated datasets; however, their performance
under real-world conditions remains insufficiently validated. The existing
literature is fragmented, with inconsistencies in methodological rigor, dataset
quality, and evaluation metrics. This lack of standardization hampers the
comparability of results and limits the generalizability of findings. To
address these challenges, this review introduces a ML-oriented taxonomy for
protection tasks, resolves key terminological inconsistencies, and advocates
for standardized reporting practices. It further provides guidelines for
comprehensive dataset documentation, methodological transparency, and
consistent evaluation protocols, aiming to improve reproducibility and enhance
the practical relevance of research outcomes. Critical gaps remain, including
the scarcity of real-world validation, insufficient robustness testing, and
limited consideration of deployment feasibility. Future research should
prioritize public benchmark datasets, realistic validation methods, and
advanced ML architectures. These steps are essential to move ML-based
protection from theoretical promise to practical deployment in increasingly
dynamic and decentralized power systems.
[LINK]
http://arxiv.org/abs/2509.09053v1
[DATE]
2025-09-11 07:19:28+08:00
[CATEGORIES]
cs.LG
MoWE : A Mixture of Weather Experts
[AUTHORS]
Dibyajyoti Chakraborty, Romit Maulik, Peter Harrington, Dallas Foster, Mohammad Amin Nabian, Sanjay Choudhry
[ABSTRACT]
Data-driven weather models have recently achieved state-of-the-art
performance, yet progress has plateaued in recent years. This paper introduces
a Mixture of Experts (MoWE) approach as a novel paradigm to overcome these
limitations, not by creating a new forecaster, but by optimally combining the
outputs of existing models. The MoWE model is trained with significantly lower
computational resources than the individual experts. Our model employs a Vision
Transformer-based gating network that dynamically learns to weight the
contributions of multiple “expert” models at each grid point, conditioned on
forecast lead time. This approach creates a synthesized deterministic forecast
that is more accurate than any individual component in terms of Root Mean
Squared Error (RMSE). Our results demonstrate the effectiveness of this method,
achieving up to a 10% lower RMSE than the best-performing AI weather model on a
2-day forecast horizon, significantly outperforming individual experts as well
as a simple average across experts. This work presents a computationally
efficient and scalable strategy to push the state of the art in data-driven
weather prediction by making the most out of leading high-quality forecast
models.
[LINK]
http://arxiv.org/abs/2509.09052v1
[DATE]
2025-09-11 07:15:59+08:00
[CATEGORIES]
cs.LG
To Theoretically Understand Transformer-Based In-Context Learning for Optimizing CSMA
[AUTHORS]
Shugang Hao, Hongbo Li, Lingjie Duan
[ABSTRACT]
The binary exponential backoff scheme is widely used in WiFi 7 and still
incurs poor throughput performance under dynamic channel environments. Recent
model-based approaches (e.g., non-persistent and $p$-persistent CSMA) simply
optimize backoff strategies under a known and fixed node density, still leading
to a large throughput loss due to inaccurate node density estimation. This
paper is the first to propose LLM transformer-based in-context learning (ICL)
theory for optimizing channel access. We design a transformer-based ICL
optimizer to pre-collect collision-threshold data examples and a query
collision case. They are constructed as a prompt as the input for the
transformer to learn the pattern, which then generates a predicted contention
window threshold (CWT). To train the transformer for effective ICL, we develop
an efficient algorithm and guarantee a near-optimal CWT prediction within
limited training steps. As it may be hard to gather perfect data examples for
ICL in practice, we further extend to allow erroneous data input in the prompt.
We prove that our optimizer maintains minimal prediction and throughput
deviations from the optimal values. Experimental results on NS-3 further
demonstrate our approach’s fast convergence and near-optimal throughput over
existing model-based and DRL-based approaches under unknown node densities.
[LINK]
http://arxiv.org/abs/2508.09146v4
[DATE]
2025-09-11 07:13:54+08:00
[CATEGORIES]
cs.LG
Examining Different Research Communities: Authorship Network
[AUTHORS]
Shrabani Ghosh
[ABSTRACT]
Google Scholar is one of the top search engines to access research articles
across multiple disciplines for scholarly literature. Google scholar advance
search option gives the privilege to extract articles based on phrases,
publishers name, authors name, time duration etc. In this work, we collected
Google Scholar data (2000-2021) for two different research domains in computer
science: Data Mining and Software Engineering. The scholar database resources
are powerful for network analysis, data mining, and identify links between
authors via authorship network. We examined coauthor-ship network for each
domain and studied their network structure. Extensive experiments are performed
to analyze publications trend and identifying influential authors and
affiliated organizations for each domain. The network analysis shows that the
networks features are distinct from one another and exhibit small communities
within the influential authors of a particular domain.
[LINK]
http://arxiv.org/abs/2409.00081v2
[DATE]
2025-09-11 06:53:38+08:00
[CATEGORIES]
cs.LG
The Role of Community Detection Methods in Performance Variations of Graph Mining Tasks
[AUTHORS]
Shrabani Ghosh, Erik Saule
[ABSTRACT]
In real-world scenarios, large graphs represent relationships among entities
in complex systems. Mining these large graphs often containing millions of
nodes and edges helps uncover structural patterns and meaningful insights.
Dividing a large graph into smaller subgraphs facilitates complex system
analysis by revealing local information. Community detection extracts clusters
or communities of graphs based on statistical methods and machine learning
models using various optimization techniques. Structure based community
detection methods are more suitable for applying to graphs because they do not
rely heavily on rich node or edge attribute information. The features derived
from these communities can improve downstream graph mining tasks, such as link
prediction and node classification. In real-world applications, we often lack
ground truth community information. Additionally, there is neither a
universally accepted gold standard for community detection nor a single method
that is consistently optimal across diverse applications. In many cases, it is
unclear how practitioners select community detection methods, and choices are
often made without explicitly considering their potential impact on downstream
tasks. In this study, we investigate whether the choice of community detection
algorithm significantly influences the performance of downstream applications.
We propose a framework capable of integrating various community detection
methods to systematically evaluate their effects on downstream task outcomes.
Our comparative analysis reveals that specific community detection algorithms
yield superior results in certain applications, highlighting that method
selection substantially affects performance.
[LINK]
http://arxiv.org/abs/2509.09045v1
[DATE]
2025-09-11 06:44:23+08:00
[CATEGORIES]
cs.LG
Missing Fine Details in Images: Last Seen in High Frequencies
[AUTHORS]
Tejaswini Medi, Hsien-Yi Wang, Arianna Rampini, Margret Keuper
[ABSTRACT]
Latent generative models have shown remarkable progress in high-fidelity
image synthesis, typically using a two-stage training process that involves
compressing images into latent embeddings via learned tokenizers in the first
stage. The quality of generation strongly depends on how expressive and
well-optimized these latent embeddings are. While various methods have been
proposed to learn effective latent representations, generated images often lack
realism, particularly in textured regions with sharp transitions, due to loss
of fine details governed by high frequencies. We conduct a detailed frequency
decomposition of existing state-of-the-art (SOTA) latent tokenizers and show
that conventional objectives inherently prioritize low-frequency
reconstruction, often at the expense of high-frequency fidelity. Our analysis
reveals these latent tokenizers exhibit a bias toward low-frequency information
during optimization, leading to over-smoothed outputs and visual artifacts that
diminish perceptual quality. To address this, we propose a wavelet-based,
frequency-aware variational autoencoder (FA-VAE) framework that explicitly
decouples the optimization of low- and high-frequency components. This
decoupling enables improved reconstruction of fine textures while preserving
global structure. Moreover, we integrate our frequency-preserving latent
embeddings into a SOTA latent diffusion model, resulting in sharper and more
realistic image generation. Our approach bridges the fidelity gap in current
latent tokenizers and emphasizes the importance of frequency-aware optimization
for realistic image synthesis, with broader implications for applications in
content creation, neural rendering, and medical imaging.
[LINK]
http://arxiv.org/abs/2509.05441v3
[DATE]
2025-09-11 06:15:25+08:00
[CATEGORIES]
cs.LG
Generative quantum advantage for classical and quantum problems
[AUTHORS]
Hsin-Yuan Huang, Michael Broughton, Norhan Eassa, Hartmut Neven, Ryan Babbush, Jarrod R. McClean
[ABSTRACT]
Recent breakthroughs in generative machine learning, powered by massive
computational resources, have demonstrated unprecedented human-like
capabilities. While beyond-classical quantum experiments can generate samples
from classically intractable distributions, their complexity has thwarted all
efforts toward efficient learning. This challenge has hindered demonstrations
of generative quantum advantage: the ability of quantum computers to learn and
generate desired outputs substantially better than classical computers. We
resolve this challenge by introducing families of generative quantum models
that are hard to simulate classically, are efficiently trainable, exhibit no
barren plateaus or proliferating local minima, and can learn to generate
distributions beyond the reach of classical computers. Using a $68$-qubit
superconducting quantum processor, we demonstrate these capabilities in two
scenarios: learning classically intractable probability distributions and
learning quantum circuits for accelerated physical simulation. Our results
establish that both learning and sampling can be performed efficiently in the
beyond-classical regime, opening new possibilities for quantum-enhanced
generative models with provable advantage.
[LINK]
http://arxiv.org/abs/2509.09033v1
[DATE]
2025-09-11 06:06:28+08:00
[CATEGORIES]
cs.LG
Deep Context-Conditioned Anomaly Detection for Tabular Data
[AUTHORS]
Spencer King, Zhilu Zhang, Ruofan Yu, Baris Coskun, Wei Ding, Qian Cui
[ABSTRACT]
Anomaly detection is critical in domains such as cybersecurity and finance,
especially when working with large-scale tabular data. Yet, unsupervised
anomaly detection – where no labeled anomalies are available – remains a
significant challenge. Although various deep learning methods have been
proposed to model a dataset’s joint distribution, real-world tabular data often
contain heterogeneous contexts (e.g., different users), making globally rare
events normal under certain contexts. Consequently, relying on a single global
distribution can overlook these contextual nuances, degrading detection
performance. In this paper, we present a context-conditional anomaly detection
framework tailored for tabular datasets. Our approach automatically identifies
context features and models the conditional data distribution using a simple
deep autoencoder. Extensive experiments on multiple tabular benchmark datasets
demonstrate that our method outperforms state-of-the-art approaches,
underscoring the importance of context in accurately distinguishing anomalous
from normal instances.
[COMMENTS]
Submitted to WSDM 2026. 11 pages, 4 figures, 5 tables, 1 algorithm, 8
datasets, contextual anomaly detection framework for tabular data
[LINK]
http://arxiv.org/abs/2509.09030v1
[DATE]
2025-09-11 06:01:11+08:00
[CATEGORIES]
cs.LG
Deep Reinforcement Learning for Inventory Networks: Toward Reliable Policy Optimization
[AUTHORS]
Matias Alvo, Daniel Russo, Yash Kanoria, Minuk Lee
[ABSTRACT]
We argue that inventory management presents unique opportunities for the
reliable application of deep reinforcement learning (DRL). To enable this, we
emphasize and test two complementary techniques. The first is Hindsight
Differentiable Policy Optimization (HDPO), which uses pathwise gradients from
offline counterfactual simulations to directly and efficiently optimize policy
performance. Unlike standard policy gradient methods that rely on high-variance
score-function estimators, HDPO computes gradients by differentiating through
the known system dynamics. Via extensive benchmarking, we show that HDPO
recovers near-optimal policies in settings with known or bounded optima, is
more robust than variants of the REINFORCE algorithm, and significantly
outperforms generalized newsvendor heuristics on problems using real time
series data. Our second technique aligns neural policy architectures with the
topology of the inventory network. We exploit Graph Neural Networks (GNNs) as a
natural inductive bias for encoding supply chain structure, demonstrate that
they can represent optimal and near-optimal policies in two theoretical
settings, and empirically show that they reduce data requirements across six
diverse inventory problems. A key obstacle to progress in this area is the lack
of standardized benchmark problems. To address this gap, we open-source a suite
of benchmark environments, along with our full codebase, to promote
transparency and reproducibility. All resources are available at
github.com/MatiasAlvo/Neural_inventory_control.
[LINK]
http://arxiv.org/abs/2306.11246v3
[DATE]
2025-09-11 05:32:10+08:00
[CATEGORIES]
cs.LG
Crack Path Prediction with Operator Learning using Discrete Particle System data Generation
[AUTHORS]
Elham Kiyani, Venkatesh Ananchaperumal, Ahmad Peyvan, Mahendaran Uchimali, Gang Li, George Em Karniadakis
[ABSTRACT]
Accurately modeling crack propagation is critical for predicting failure in
engineering materials and structures, where small cracks can rapidly evolve and
cause catastrophic damage. The interaction of cracks with discontinuities, such
as holes, significantly affects crack deflection and arrest. Recent
developments in discrete particle systems with multibody interactions based on
constitutive behavior have demonstrated the ability to capture crack nucleation
and evolution without relying on continuum assumptions. In this work, we use
data from Constitutively Informed Particle Dynamics (CPD) simulations to train
operator learning models, specifically Deep Operator Networks (DeepONets),
which learn mappings between function spaces instead of finite-dimensional
vectors. We explore two DeepONet variants: vanilla and Fusion DeepONet, for
predicting time-evolving crack propagation in specimens with varying
geometries. Three representative cases are studied: (i) varying notch height
without active fracture; and (ii) and (iii) combinations of notch height and
hole radius where dynamic fracture occurs on irregular discrete meshes. The
models are trained using geometric inputs in the branch network and
spatial-temporal coordinates in the trunk network. Results show that Fusion
DeepONet consistently outperforms the vanilla variant, with more accurate
predictions especially in non-fracturing cases. Fracture-driven scenarios
involving displacement and crack evolution remain more challenging. These
findings highlight the potential of Fusion DeepONet to generalize across
complex, geometry-varying, and time-dependent crack propagation phenomena.
[COMMENTS]
22 pages, 14 figures
[LINK]
http://arxiv.org/abs/2506.01976v2
[DATE]
2025-09-11 05:10:20+08:00
[CATEGORIES]
cs.LG
Attribution Regularization for Multimodal Paradigms
[AUTHORS]
Sahiti Yerramilli, Jayant Sravan Tamarapalli, Jonathan Francis, Eric Nyberg
[ABSTRACT]
Multimodal machine learning has gained significant attention in recent years
due to its potential for integrating information from multiple modalities to
enhance learning and decision-making processes. However, it is commonly
observed that unimodal models outperform multimodal models, despite the latter
having access to richer information. Additionally, the influence of a single
modality often dominates the decision-making process, resulting in suboptimal
performance. This research project aims to address these challenges by
proposing a novel regularization term that encourages multimodal models to
effectively utilize information from all modalities when making decisions. The
focus of this project lies in the video-audio domain, although the proposed
regularization technique holds promise for broader applications in embodied AI
research, where multiple modalities are involved. By leveraging this
regularization term, the proposed approach aims to mitigate the issue of
unimodal dominance and improve the performance of multimodal machine learning
systems. Through extensive experimentation and evaluation, the effectiveness
and generalizability of the proposed technique will be assessed. The findings
of this research project have the potential to significantly contribute to the
advancement of multimodal machine learning and facilitate its application in
various domains, including multimedia analysis, human-computer interaction, and
embodied AI research.
[LINK]
http://arxiv.org/abs/2404.02359v3
[DATE]
2025-09-11 05:09:48+08:00
[CATEGORIES]
cs.LG
Fast attention mechanisms: a tale of parallelism
[AUTHORS]
Jingwen Liu, Hantao Yu, Clayton Sanford, Alexandr Andoni, Daniel Hsu
[ABSTRACT]
Transformers have the representational capacity to simulate Massively
Parallel Computation (MPC) algorithms, but they suffer from quadratic time
complexity, which severely limits their scalability. We introduce an efficient
attention mechanism called Approximate Nearest Neighbor Attention (ANNA) with
sub-quadratic time complexity. We prove that ANNA-transformers (1) retain the
expressive power previously established for standard attention in terms of
matching the capabilities of MPC algorithms, and (2) can solve key reasoning
tasks such as Match2 and $k$-hop with near-optimal depth. Using the MPC
framework, we further prove that constant-depth ANNA-transformers can simulate
constant-depth low-rank transformers, thereby providing a unified way to reason
about a broad class of efficient attention approximations.
[LINK]
http://arxiv.org/abs/2509.09001v1
[DATE]
2025-09-11 04:59:44+08:00
[CATEGORIES]
cs.LG
Active Learning and Explainable AI for Multi-Objective Optimization of Spin Coated Polymers
[AUTHORS]
Brendan Young, Brendan Alvey, Andreas Werbrouck, Will Murphy, James Keller, Mattias J. Young, Matthew Maschmann
[ABSTRACT]
Spin coating polymer thin films to achieve specific mechanical properties is
inherently a multi-objective optimization problem. We present a framework that
integrates an active Pareto front learning algorithm (PyePAL) with
visualization and explainable AI techniques to optimize processing parameters.
PyePAL uses Gaussian process models to predict objective values (hardness and
elasticity) from the design variables (spin speed, dilution, and polymer
mixture), guiding the adaptive selection of samples toward promising regions of
the design space. To enable interpretable insights into the high-dimensional
design space, we utilize UMAP (Uniform Manifold Approximation and Projection)
for two-dimensional visualization of the Pareto front exploration.
Additionally, we incorporate fuzzy linguistic summaries, which translate the
learned relationships between process parameters and performance objectives
into linguistic statements, thus enhancing the explainability and understanding
of the optimization results. Experimental results demonstrate that our method
efficiently identifies promising polymer designs, while the visual and
linguistic explanations facilitate expert-driven analysis and knowledge
discovery.
[COMMENTS]
8 pages, 7 figures, Presented at 2025 AAAI Spring Symposium Series
[LINK]
http://arxiv.org/abs/2509.08988v1
[DATE]
2025-09-11 04:35:59+08:00
[CATEGORIES]
cs.LG
Physics-informed waveform inversion using pretrained wavefield neural operators
[AUTHORS]
Xinquan Huang, Fu Wang, Tariq Alkhalifah
[ABSTRACT]
Full waveform inversion (FWI) is crucial for reconstructing high-resolution
subsurface models, but it is often hindered, considering the limited data, by
its null space resulting in low-resolution models, and more importantly, by its
computational cost, especially if needed for real-time applications. Recent
attempts to accelerate FWI using learned wavefield neural operators have shown
promise in efficiency and differentiability, but typically suffer from noisy
and unstable inversion performance. To address these limitations, we introduce
a novel physics-informed FWI framework to enhance the inversion in accuracy
while maintaining the efficiency of neural operator-based FWI. Instead of
relying only on the L2 norm objective function via automatic differentiation,
resulting in noisy model reconstruction, we integrate a physics constraint term
in the loss function of FWI, improving the quality of the inverted velocity
models. Specifically, starting with an initial model to simulate wavefields and
then evaluating the loss over how much the resulting wavefield obeys the
physical laws (wave equation) and matches the recorded data, we achieve a
reduction in noise and artifacts. Numerical experiments using the OpenFWI and
Overthrust models demonstrate our method’s superior performance, offering
cleaner and more accurate subsurface velocity than vanilla approaches.
Considering the efficiency of the approach compared to FWI, this advancement
represents a significant step forward in the practical application of FWI for
real-time subsurface monitoring.
[LINK]
http://arxiv.org/abs/2509.08967v1
[DATE]
2025-09-11 03:57:18+08:00
[CATEGORIES]
cs.LG
Value bounds and Convergence Analysis for Averages of LRP attributions
[AUTHORS]
Alexander Binder, Nastaran Takmil-Homayouni, Urun Dogan
[ABSTRACT]
We analyze numerical properties of Layer-wise relevance propagation
(LRP)-type attribution methods by representing them as a product of modified
gradient matrices. This representation creates an analogy to matrix
multiplications of Jacobi-matrices which arise from the chain rule of
differentiation. In order to shed light on the distribution of attribution
values, we derive upper bounds for singular values. Furthermore we derive
component-wise bounds for attribution map values. As a main result, we apply
these component-wise bounds to obtain multiplicative constants. These constants
govern the convergence of empirical means of attributions to expectations of
attribution maps. This finding has important implications for scenarios where
multiple non-geometric data augmentations are applied to individual test
samples, as well as for Smoothgrad-type attribution methods. In particular, our
analysis reveals that the constants for LRP-beta remain independent of weight
norms, a significant distinction from both gradient-based methods and
LRP-epsilon.
[COMMENTS]
37 pages
[LINK]
http://arxiv.org/abs/2509.08963v1
[DATE]
2025-09-11 03:50:00+08:00
[CATEGORIES]
cs.LG
A Logic for Expressing Log-Precision Transformers
[AUTHORS]
William Merrill, Ashish Sabharwal
[ABSTRACT]
One way to interpret the reasoning power of transformer-based language models
is to describe the types of logical rules they can resolve over some input
text. Recently, Chiang et al. (2023) showed that finite-precision transformers
can be equivalently expressed in a generalization of first-order logic.
However, finite-precision transformers are a weak transformer variant because,
as we show, a single head can only attend to a constant number of tokens and,
in particular, cannot represent uniform attention. Since attending broadly is a
core capability for transformers, we ask whether a minimally more expressive
model that can attend universally can also be characterized in logic. To this
end, we analyze transformers whose forward pass is computed in $\log n$
precision on contexts of length $n$. We prove that any log-precision
transformer can be equivalently expressed as a first-order logic sentence that,
in addition to standard universal and existential quantifiers, may also contain
majority-vote quantifiers. This is the tightest known upper bound and first
logical characterization of log-precision transformers.
[COMMENTS]
May 24, 2023: Restructured version of old preprint. Oct 12, 2023: To
appear at NeurIPS. Sept 10, 2025: minor technical corrections
[LINK]
http://arxiv.org/abs/2210.02671v7
[DATE]
2025-09-11 03:48:04+08:00
[CATEGORIES]
cs.LG
Convexity of Optimization Curves: Local Sharp Thresholds, Robustness Impossibility, and New Counterexamples
[AUTHORS]
Le Duc Hieu
[ABSTRACT]
We study when the \emph{optimization curve} of first-order methods – the
sequence ${f(x_n)}{n\ge0}$ produced by constant-stepsize iterations – is
convex, equivalently when the forward differences $f(x_n)-f(x{n+1})$ are
nonincreasing. For gradient descent (GD) on convex $L$-smooth functions, the
curve is convex for all stepsizes $\eta \le 1.75/L$, and this threshold is
tight. Moreover, gradient norms are nonincreasing for all $\eta \le 2/L$, and
in continuous time (gradient flow) the curve is always convex. These results
complement and refine the classical smooth convex optimization toolbox,
connecting discrete and continuous dynamics as well as worst-case analyses.
[LINK]
http://arxiv.org/abs/2509.08954v1
[DATE]
2025-09-11 03:28:47+08:00
[CATEGORIES]
cs.LG
Deploying AI for Signal Processing education: Selected challenges and intriguing opportunities
[AUTHORS]
Jarvis Haupt, Qin Lu, Yanning Shen, Jia Chen, Yue Dong, Dan McCreary, Mehmet Akçakaya, Georgios B. Giannakis
[ABSTRACT]
Powerful artificial intelligence (AI) tools that have emerged in recent years
– including large language models, automated coding assistants, and advanced
image and speech generation technologies – are the result of monumental human
achievements. These breakthroughs reflect mastery across multiple technical
disciplines and the resolution of significant technological challenges.
However, some of the most profound challenges may still lie ahead. These
challenges are not purely technical but pertain to the fair and responsible use
of AI in ways that genuinely improve the global human condition. This article
explores one promising application aligned with that vision: the use of AI
tools to facilitate and enhance education, with a specific focus on signal
processing (SP). It presents two interrelated perspectives: identifying and
addressing technical limitations, and applying AI tools in practice to improve
educational experiences. Primers are provided on several core technical issues
that arise when using AI in educational settings, including how to ensure
fairness and inclusivity, handle hallucinated outputs, and achieve efficient
use of resources. These and other considerations – such as transparency,
explainability, and trustworthiness – are illustrated through the development
of an immersive, structured, and reliable “smart textbook.” The article serves
as a resource for researchers and educators seeking to advance AI’s role in
engineering education.
[COMMENTS]
Accepted to the IEEE Signal Processing Magazine Special Issue on
Artificial Intelligence for Education: A Signal Processing Perspective
[LINK]
http://arxiv.org/abs/2509.08950v1
[DATE]
2025-09-11 03:19:26+08:00
[CATEGORIES]
cs.LG
Capability-Aware Shared Hypernetworks for Flexible Heterogeneous Multi-Robot Coordination
[AUTHORS]
Kevin Fu, Shalin Anand Jain, Pierce Howell, Harish Ravichandar
[ABSTRACT]
Recent advances have enabled heterogeneous multi-robot teams to learn complex
and effective coordination skills. However, existing neural architectures that
support heterogeneous teaming tend to force a trade-off between expressivity
and efficiency. Shared-parameter designs prioritize sample efficiency by
enabling a single network to be shared across all or a pre-specified subset of
robots (via input augmentations), but tend to limit behavioral diversity. In
contrast, recent designs employ a separate policy for each robot, enabling
greater diversity and expressivity at the cost of efficiency and
generalization. Our key insight is that such tradeoffs can be avoided by
viewing these design choices as ends of a broad spectrum. Inspired by recent
work in transfer and meta learning, and building on prior work in multi-robot
task allocation, we propose Capability-Aware Shared Hypernetworks (CASH), a
soft weight sharing architecture that uses hypernetworks to efficiently learn a
flexible shared policy that dynamically adapts to each robot post-training. By
explicitly encoding the impact of robot capabilities (e.g., speed and payload)
on collective behavior, CASH enables zero-shot generalization to unseen robots
or team compositions. Our experiments involve multiple heterogeneous tasks,
three learning paradigms (imitation learning, value-based, and policy-gradient
RL), and SOTA multi-robot simulation (JaxMARL) and hardware (Robotarium)
platforms. Across all conditions, we find that CASH generates
appropriately-diverse behaviors and consistently outperforms baseline
architectures in terms of performance and sample efficiency during both
training and zero-shot generalization, all with 60%-80% fewer learnable
parameters.
[COMMENTS]
22 pages, 8 figures, equal authorship between Kevin Fu and Shalin
Anand Jain Manuscript accepted for publication at the 9th Conference on Robot
Learning (CoRL 2025), Seoul, Korea
[LINK]
http://arxiv.org/abs/2501.06058v5
[DATE]
2025-09-11 03:13:14+08:00
[CATEGORIES]
cs.LG
ACE: A Security Architecture for LLM-Integrated App Systems
[AUTHORS]
Evan Li, Tushin Mallick, Evan Rose, William Robertson, Alina Oprea, Cristina Nita-Rotaru
[ABSTRACT]
LLM-integrated app systems extend the utility of Large Language Models (LLMs)
with third-party apps that are invoked by a system LLM using interleaved
planning and execution phases to answer user queries. These systems introduce
new attack vectors where malicious apps can cause integrity violation of
planning or execution, availability breakdown, or privacy compromise during
execution.
In this work, we identify new attacks impacting the integrity of planning, as
well as the integrity and availability of execution in LLM-integrated apps, and
demonstrate them against IsolateGPT, a recent solution designed to mitigate
attacks from malicious apps. We propose Abstract-Concrete-Execute (ACE), a new
secure architecture for LLM-integrated app systems that provides security
guarantees for system planning and execution. Specifically, ACE decouples
planning into two phases by first creating an abstract execution plan using
only trusted information, and then mapping the abstract plan to a concrete plan
using installed system apps. We verify that the plans generated by our system
satisfy user-specified secure information flow constraints via static analysis
on the structured plan output. During execution, ACE enforces data and
capability barriers between apps, and ensures that the execution is conducted
according to the trusted abstract plan. We show experimentally that ACE is
secure against attacks from the InjecAgent and Agent Security Bench benchmarks
for indirect prompt injection, and our newly introduced attacks. We also
evaluate the utility of ACE in realistic environments, using the Tool Usage
suite from the LangChain benchmark. Our architecture represents a significant
advancement towards hardening LLM-based systems using system security
principles.
[COMMENTS]
25 pages, 13 figures, 8 tables; accepted by Network and Distributed
System Security Symposium (NDSS) 2026
[LINK]
http://arxiv.org/abs/2504.20984v3
[DATE]
2025-09-11 03:03:48+08:00
[CATEGORIES]
cs.LG
Corruption-Tolerant Asynchronous Q-Learning with Near-Optimal Rates
[AUTHORS]
Sreejeet Maity, Aritra Mitra
[ABSTRACT]
We consider the problem of learning the optimal policy in a discounted,
infinite-horizon reinforcement learning (RL) setting where the reward signal is
subject to adversarial corruption. Such corruption, which may arise from
extreme noise, sensor faults, or malicious attacks, can severely degrade the
performance of classical algorithms such as Q-learning. To address this
challenge, we propose a new provably robust variant of the Q-learning algorithm
that operates effectively even when a fraction of the observed rewards are
arbitrarily perturbed by an adversary. Under the asynchronous sampling model
with time-correlated data, we establish that despite adversarial corruption,
the finite-time convergence rate of our algorithm matches that of existing
results for the non-adversarial case, up to an additive term proportional to
the fraction of corrupted samples. Moreover, we derive an information-theoretic
lower bound revealing that the additive corruption term in our upper bounds is
unavoidable.
Next, we propose a variant of our algorithm that requires no prior knowledge
of the statistics of the true reward distributions. The analysis of this
setting is particularly challenging and is enabled by carefully exploiting a
refined Azuma-Hoeffding inequality for almost-martingales, a technical tool
that might be of independent interest. Collectively, our contributions provide
the first finite-time robustness guarantees for asynchronous Q-learning,
bridging a significant gap in robust RL.
[LINK]
http://arxiv.org/abs/2509.08933v1
[DATE]
2025-09-11 02:56:39+08:00
[CATEGORIES]
cs.LG
Data-Augmented Few-Shot Neural Stencil Emulation for System Identification of Computer Models
[AUTHORS]
Sanket Jantre, Deepak Akhare, Xiaoning Qian, Nathan M. Urban
[ABSTRACT]
Partial differential equations (PDEs) underpin the modeling of many natural
and engineered systems. It can be convenient to express such models as neural
PDEs rather than using traditional numerical PDE solvers by replacing part or
all of the PDE’s governing equations with a neural network representation.
Neural PDEs are often easier to differentiate, linearize, reduce, or use for
uncertainty quantification than the original numerical solver. They are usually
trained on solution trajectories obtained by long time integration of the PDE
solver. Here we propose a more sample-efficient data-augmentation strategy for
generating neural PDE training data from a computer model by space-filling
sampling of local “stencil” states. This approach removes a large degree of
spatiotemporal redundancy present in trajectory data and oversamples states
that may be rarely visited but help the neural PDE generalize across the state
space. We demonstrate that accurate neural PDE stencil operators can be learned
from synthetic training data generated by the computational equivalent of 10
timesteps’ worth of numerical simulation. Accuracy is further improved if we
assume access to a single full-trajectory simulation from the computer model,
which is typically available in practice. Across several PDE systems, we show
that our data-augmented synthetic stencil data yield better trained neural
stencil operators, with clear performance gains compared with naively sampled
stencil data from simulation trajectories.
[LINK]
http://arxiv.org/abs/2508.19441v2
[DATE]
2025-09-11 02:54:57+08:00
[CATEGORIES]
cs.LG
Adaptive kernel predictors from feature-learning infinite limits of neural networks
[AUTHORS]
Clarissa Lauditi, Blake Bordelon, Cengiz Pehlevan
[ABSTRACT]
Previous influential work showed that infinite width limits of neural
networks in the lazy training regime are described by kernel machines. Here, we
show that neural networks trained in the rich, feature learning infinite-width
regime in two different settings are also described by kernel machines, but
with data-dependent kernels. For both cases, we provide explicit expressions
for the kernel predictors and prescriptions to numerically calculate them. To
derive the first predictor, we study the large-width limit of feature-learning
Bayesian networks, showing how feature learning leads to task-relevant
adaptation of layer kernels and preactivation densities. The saddle point
equations governing this limit result in a min-max optimization problem that
defines the kernel predictor. To derive the second predictor, we study gradient
flow training of randomly initialized networks trained with weight decay in the
infinite-width limit using dynamical mean field theory (DMFT). The fixed point
equations of the arising DMFT defines the task-adapted internal representations
and the kernel predictor. We compare our kernel predictors to kernels derived
from lazy regime and demonstrate that our adaptive kernels achieve lower test
loss on benchmark datasets.
[LINK]
http://arxiv.org/abs/2502.07998v2
[DATE]
2025-09-11 02:21:17+08:00
[CATEGORIES]
cs.LG
Geometry and Stability of Supervised Learning Problems
[AUTHORS]
Facundo Mémoli, Brantley Vose, Robert C. Williamson
[ABSTRACT]
We introduce a notion of distance between supervised learning problems, which
we call the Risk distance. This distance, inspired by optimal transport,
facilitates stability results; one can quantify how seriously issues like
sampling bias, noise, limited data, and approximations might change a given
problem by bounding how much these modifications can move the problem under the
Risk distance. With the distance established, we explore the geometry of the
resulting space of supervised learning problems, providing explicit geodesics
and proving that the set of classification problems is dense in a larger class
of problems. We also provide two variants of the Risk distance: one that
incorporates specified weights on a problem’s predictors, and one that is more
sensitive to the contours of a problem’s risk landscape.
[COMMENTS]
99 pages, to be published in Journal of Machine Learning Research 26
(2025) 1-99
[LINK]
http://arxiv.org/abs/2403.01660v2
[DATE]
2025-09-11 02:19:04+08:00
[CATEGORIES]
cs.LG
Instance-Optimal Matrix Multiplicative Weight Update and Its Quantum Applications
[AUTHORS]
Weiyuan Gong, Tongyang Li, Xinzhao Wang, Zhiyu Zhang
[ABSTRACT]
The Matrix Multiplicative Weight Update (MMWU) is a seminal online learning
algorithm with numerous applications. Applied to the matrix version of the
Learning from Expert Advice (LEA) problem on the $d$-dimensional spectraplex,
it is well known that MMWU achieves the minimax-optimal regret bound of
$O(\sqrt{T\log d})$, where $T$ is the time horizon. In this paper, we present
an improved algorithm achieving the instance-optimal regret bound of
$O(\sqrt{T\cdot S(X||d^{-1}I_d)})$, where $X$ is the comparator in the regret,
$I_d$ is the identity matrix, and $S(\cdot||\cdot)$ denotes the quantum
relative entropy. Furthermore, our algorithm has the same computational
complexity as MMWU, indicating that the improvement in the regret bound is
“free”.
Technically, we first develop a general potential-based framework for matrix
LEA, with MMWU being its special case induced by the standard exponential
potential. Then, the crux of our analysis is a new “one-sided” Jensen’s trace
inequality built on a Laplace transform technique, which allows the application
of general potential functions beyond exponential to matrix LEA. Our algorithm
is finally induced by an optimal potential function from the vector LEA
problem, based on the imaginary error function.
Complementing the above, we provide a memory lower bound for matrix LEA, and
explore the applications of our algorithm in quantum learning theory. We show
that it outperforms the state of the art for learning quantum states corrupted
by depolarization noise, random quantum states, and Gibbs states. In addition,
applying our algorithm to linearized convex losses enables predicting nonlinear
quantum properties, such as purity, quantum virtual cooling, and R'{e}nyi-$2$
correlation.
[COMMENTS]
47 pages
[LINK]
http://arxiv.org/abs/2509.08911v1
[DATE]
2025-09-11 02:15:41+08:00
[CATEGORIES]
cs.LG
World Modeling with Probabilistic Structure Integration
[AUTHORS]
Klemen Kotar, Wanhee Lee, Rahul Venkatesh, Honglin Chen, Daniel Bear, Jared Watrous, Simon Kim, Khai Loong Aw, Lilian Naing Chen, Stefan Stojanov, Kevin Feigelis, Imran Thobani, Alex Durango, Khaled Jedoui, Atlas Kazemian, Dan Yamins
[ABSTRACT]
We present Probabilistic Structure Integration (PSI), a system for learning
richly controllable and flexibly promptable world models from data. PSI
consists of a three-step cycle. The first step, Probabilistic prediction,
involves building a probabilistic graphical model Psi of the data, in the form
of a random-access autoregressive sequence model. Psi supports a complete set
of learned conditional distributions describing the dependence of any variables
in the data on any other set of variables. In step 2, Structure extraction, we
show how to extract underlying low-dimensional properties in the data,
corresponding to a diverse set of meaningful “intermediate structures”, in a
zero-shot fashion via causal inference on Psi. Step 3, Integration, completes
the cycle by converting these structures into new token types that are then
continually mixed back into the training diet as conditioning signals and
prediction targets. Each such cycle augments the capabilities of Psi, both
allowing it to model the underlying data better, and creating new control
handles – akin to an LLM-like universal prompting language. We train an
instance of Psi on 1.4 trillion tokens of internet video data; we use it to
perform a variety of useful video prediction and understanding inferences; we
extract state-of-the-art optical flow, self-supervised depth and object
segmentation; and we use these structures to support a full cycle of predictive
improvements.
[LINK]
http://arxiv.org/abs/2509.09737v1
[DATE]
2025-09-11 02:01:04+08:00
[CATEGORIES]
cs.LG
QCardEst/QCardCorr: Quantum Cardinality Estimation and Correction
[AUTHORS]
Tobias Winker, Jinghua Groppe, Sven Groppe
[ABSTRACT]
Cardinality estimation is an important part of query optimization in DBMS. We
develop a Quantum Cardinality Estimation (QCardEst) approach using Quantum
Machine Learning with a Hybrid Quantum-Classical Network. We define a compact
encoding for turning SQL queries into a quantum state, which requires only
qubits equal to the number of tables in the query. This allows the processing
of a complete query with a single variational quantum circuit (VQC) on current
hardware. In addition, we compare multiple classical post-processing layers to
turn the probability vector output of VQC into a cardinality value. We
introduce Quantum Cardinality Correction QCardCorr, which improves classical
cardinality estimators by multiplying the output with a factor generated by a
VQC to improve the cardinality estimation. With QCardCorr, we have an
improvement over the standard PostgreSQL optimizer of 6.37 times for JOB-light
and 8.66 times for STATS. For JOB-light we even outperform MSCN by a factor of
3.47.
[COMMENTS]
7 pages
[LINK]
http://arxiv.org/abs/2509.08817v1
[DATE]
2025-09-11 01:49:06+08:00
[CATEGORIES]
cs.LG
Reward function compression facilitates goal-dependent reinforcement learning
[AUTHORS]
Gaia Molinaro, Anne G. E. Collins
[ABSTRACT]
Reinforcement learning agents learn from rewards, but humans can uniquely
assign value to novel, abstract outcomes in a goal-dependent manner. However,
this flexibility is cognitively costly, making learning less efficient. Here,
we propose that goal-dependent learning is initially supported by a
capacity-limited working memory system. With consistent experience, learners
create a “compressed” reward function (a simplified rule defining the goal)
which is then transferred to long-term memory and applied automatically upon
receiving feedback. This process frees up working memory resources, boosting
learning efficiency. We test this theory across six experiments. Consistent
with our predictions, our findings demonstrate that learning is parametrically
impaired by the size of the goal space, but improves when the goal space
structure allows for compression. We also find faster reward processing to
correlate with better learning performance, supporting the idea that as goal
valuation becomes more automatic, more resources are available for learning. We
leverage computational modeling to support this interpretation. Our work
suggests that efficient goal-directed learning relies on compressing complex
goal information into a stable reward function, shedding light on the cognitive
mechanisms of human motivation. These findings generate new insights into the
neuroscience of intrinsic motivation and could help improve behavioral
techniques that support people in achieving their goals.
[LINK]
http://arxiv.org/abs/2509.06810v2
[DATE]
2025-09-11 01:24:06+08:00
[CATEGORIES]
cs.LG
ADHDeepNet From Raw EEG to Diagnosis: Improving ADHD Diagnosis through Temporal-Spatial Processing, Adaptive Attention Mechanisms, and Explainability in Raw EEG Signals
[AUTHORS]
Ali Amini, Mohammad Alijanpour, Behnam Latifi, Ali Motie Nasrabadi
[ABSTRACT]
Attention Deficit Hyperactivity Disorder (ADHD) is a common brain disorder in
children that can persist into adulthood, affecting social, academic, and
career life. Early diagnosis is crucial for managing these impacts on patients
and the healthcare system but is often labor-intensive and time-consuming. This
paper presents a novel method to improve ADHD diagnosis precision and
timeliness by leveraging Deep Learning (DL) approaches and electroencephalogram
(EEG) signals. We introduce ADHDeepNet, a DL model that utilizes comprehensive
temporal-spatial characterization, attention modules, and explainability
techniques optimized for EEG signals. ADHDeepNet integrates feature extraction
and refinement processes to enhance ADHD diagnosis. The model was trained and
validated on a dataset of 121 participants (61 ADHD, 60 Healthy Controls),
employing nested cross-validation for robust performance. The proposed
two-stage methodology uses a 10-fold cross-subject validation strategy.
Initially, each iteration optimizes the model’s hyper-parameters with inner
2-fold cross-validation. Then, Additive Gaussian Noise (AGN) with various
standard deviations and magnification levels is applied for data augmentation.
ADHDeepNet achieved 100% sensitivity and 99.17% accuracy in classifying ADHD/HC
subjects. To clarify model explainability and identify key brain regions and
frequency bands for ADHD diagnosis, we analyzed the learned weights and
activation patterns of the model’s primary layers. Additionally, t-distributed
Stochastic Neighbor Embedding (t-SNE) visualized high-dimensional data, aiding
in interpreting the model’s decisions. This study highlights the potential of
DL and EEG in enhancing ADHD diagnosis accuracy and efficiency.
[COMMENTS]
29 pages, 7 figures. Preprint. Correspondence: [email protected]
[LINK]
http://arxiv.org/abs/2509.08779v1
[DATE]
2025-09-11 01:07:00+08:00
[CATEGORIES]
cs.LG
Uncertainty Quantification in Probabilistic Machine Learning Models: Theory, Methods, and Insights
[AUTHORS]
Marzieh Ajirak, Anand Ravishankar, Petar M. Djuric
[ABSTRACT]
Uncertainty Quantification (UQ) is essential in probabilistic machine
learning models, particularly for assessing the reliability of predictions. In
this paper, we present a systematic framework for estimating both epistemic and
aleatoric uncertainty in probabilistic models. We focus on Gaussian Process
Latent Variable Models and employ scalable Random Fourier Features-based
Gaussian Processes to approximate predictive distributions efficiently. We
derive a theoretical formulation for UQ, propose a Monte Carlo sampling-based
estimation method, and conduct experiments to evaluate the impact of
uncertainty estimation. Our results provide insights into the sources of
predictive uncertainty and illustrate the effectiveness of our approach in
quantifying the confidence in the predictions.
[COMMENTS]
Accepted to EUSIPCO 2025
[LINK]
http://arxiv.org/abs/2509.05877v2
[DATE]
2025-09-11 01:02:44+08:00
[CATEGORIES]
cs.LG
Fourier Learning Machines: Nonharmonic Fourier-Based Neural Networks for Scientific Machine Learning
[AUTHORS]
Mominul Rubel, Adam Meyers, Gabriel Nicolosi
[ABSTRACT]
We introduce the Fourier Learning Machine (FLM), a neural network (NN)
architecture designed to represent a multidimensional nonharmonic Fourier
series. The FLM uses a simple feedforward structure with cosine activation
functions to learn the frequencies, amplitudes, and phase shifts of the series
as trainable parameters. This design allows the model to create a
problem-specific spectral basis adaptable to both periodic and nonperiodic
functions. Unlike previous Fourier-inspired NN models, the FLM is the first
architecture able to represent a complete, separable Fourier basis in multiple
dimensions using a standard Multilayer Perceptron-like architecture. A
one-to-one correspondence between the Fourier coefficients and amplitudes and
phase-shifts is demonstrated, allowing for the translation between a full,
separable basis form and the cosine phase–shifted one. Additionally, we
evaluate the performance of FLMs on several scientific computing problems,
including benchmark Partial Differential Equations (PDEs) and a family of
Optimal Control Problems (OCPs). Computational experiments show that the
performance of FLMs is comparable, and often superior, to that of established
architectures like SIREN and vanilla feedforward NNs.
[LINK]
http://arxiv.org/abs/2509.08759v1
[DATE]
2025-09-11 00:49:20+08:00
[CATEGORIES]
cs.LG
Using AI to Optimize Patient Transfer and Resource Utilization During Mass-Casualty Incidents: A Simulation Platform
[AUTHORS]
Zhaoxun “Lorenz” Liu, Wagner H. Souza, Jay Han, Amin Madani
[ABSTRACT]
Mass casualty incidents (MCIs) overwhelm healthcare systems and demand rapid,
accurate patient-hospital allocation decisions under extreme pressure. Here, we
developed and validated a deep reinforcement learning-based decision-support AI
agent to optimize patient transfer decisions during simulated MCIs by balancing
patient acuity levels, specialized care requirements, hospital capacities, and
transport logistics. To integrate this AI agent, we developed MasTER, a
web-accessible command dashboard for MCI management simulations. Through a
controlled user study with 30 participants (6 trauma experts and 24
non-experts), we evaluated three interaction approaches with the AI agent
(human-only, human-AI collaboration, and AI-only) across 20- and 60-patient MCI
scenarios in the Greater Toronto Area. Results demonstrate that increasing AI
involvement significantly improves decision quality and consistency. The AI
agent outperforms trauma surgeons (p < 0.001) and enables non-experts to
achieve expert-level performance when assisted, contrasting sharply with their
significantly inferior unassisted performance (p < 0.001). These findings
establish the potential for our AI-driven decision support to enhance both MCI
preparedness training and real-world emergency response management.
[LINK]
http://arxiv.org/abs/2509.08756v1
[DATE]
2025-09-11 00:46:54+08:00
[CATEGORIES]
cs.LG
Learning Turbulent Flows with Generative Models: Super-resolution, Forecasting, and Sparse Flow Reconstruction
[AUTHORS]
Vivek Oommen, Siavash Khodakarami, Aniruddha Bora, Zhicheng Wang, George Em Karniadakis
[ABSTRACT]
Neural operators are promising surrogates for dynamical systems but when
trained with standard L2 losses they tend to oversmooth fine-scale turbulent
structures. Here, we show that combining operator learning with generative
modeling overcomes this limitation. We consider three practical turbulent-flow
challenges where conventional neural operators fail: spatio-temporal
super-resolution, forecasting, and sparse flow reconstruction. For Schlieren
jet super-resolution, an adversarially trained neural operator (adv-NO) reduces
the energy-spectrum error by 15x while preserving sharp gradients at neural
operator-like inference cost. For 3D homogeneous isotropic turbulence, adv-NO
trained on only 160 timesteps from a single trajectory forecasts accurately for
five eddy-turnover times and offers 114x wall-clock speed-up at inference than
the baseline diffusion-based forecasters, enabling near-real-time rollouts. For
reconstructing cylinder wake flows from highly sparse Particle Tracking
Velocimetry-like inputs, a conditional generative model infers full 3D velocity
and pressure fields with correct phase alignment and statistics. These advances
enable accurate reconstruction and forecasting at low compute cost, bringing
near-real-time analysis and control within reach in experimental and
computational fluid mechanics. See our project page:
https://vivekoommen.github.io/Gen4Turb/
[LINK]
http://arxiv.org/abs/2509.08752v1
[DATE]
2025-09-11 00:42:22+08:00
[CATEGORIES]
cs.LG
Bregman Douglas-Rachford Splitting Method
[AUTHORS]
Shiqian Ma, Lin Xiao, Renbo Zhao
[ABSTRACT]
In this paper, we propose the Bregman Douglas-Rachford splitting (BDRS)
method and its variant Bregman Peaceman-Rachford splitting method for solving
maximal monotone inclusion problem. We show that BDRS is equivalent to a
Bregman alternating direction method of multipliers (ADMM) when applied to the
dual of the problem. A special case of the Bregman ADMM is an alternating
direction version of the exponential multiplier method. To the best of our
knowledge, algorithms proposed in this paper are new to the literature. We also
discuss how to use our algorithms to solve the discrete optimal transport (OT)
problem. We prove the convergence of the algorithms under certain assumptions,
though we point out that one assumption does not apply to the OT problem.
[LINK]
http://arxiv.org/abs/2509.08739v1
[DATE]
2025-09-11 00:27:02+08:00
[CATEGORIES]
cs.LG
ChemBOMAS: Accelerated BO in Chemistry with LLM-Enhanced Multi-Agent System
[AUTHORS]
Dong Han, Zhehong Ai, Pengxiang Cai, Shuzhou Sun, Shanya Lu, Jianpeng Chen, Ben Gao, Lingli Ge, Weida Wang, Xiangxin Zhou, Xihui Liu, Mao Su, Wanli Ouyang, Lei Bai, Dongzhan Zhou, Tao XU, Yuqiang Li, Shufei Zhang
[ABSTRACT]
The efficiency of Bayesian optimization (BO) in chemistry is often hindered
by sparse experimental data and complex reaction mechanisms. To overcome these
limitations, we introduce ChemBOMAS, a new framework named LLM-Enhanced
Multi-Agent System for accelerating BO in chemistry. ChemBOMAS’s optimization
process is enhanced by LLMs and synergistically employs two strategies:
knowledge-driven coarse-grained optimization and data-driven fine-grained
optimization. First, in the knowledge-driven coarse-grained optimization stage,
LLMs intelligently decompose the vast search space by reasoning over existing
chemical knowledge to identify promising candidate regions. Subsequently, in
the data-driven fine-grained optimization stage, LLMs enhance the BO process
within these candidate regions by generating pseudo-data points, thereby
improving data utilization efficiency and accelerating convergence. Benchmark
evaluations** further confirm that ChemBOMAS significantly enhances
optimization effectiveness and efficiency compared to various BO algorithms.
Importantly, the practical utility of ChemBOMAS was validated through wet-lab
experiments conducted under pharmaceutical industry protocols, targeting
conditional optimization for a previously unreported and challenging chemical
reaction. In the wet experiment, ChemBOMAS achieved an optimal objective value
of 96%. This was substantially higher than the 15% achieved by domain experts.
This real-world success, together with strong performance on benchmark
evaluations, highlights ChemBOMAS as a powerful tool to accelerate chemical
discovery.
[LINK]
http://arxiv.org/abs/2509.08736v1
[DATE]
2025-09-11 00:24:08+08:00
[CATEGORIES]
cs.LG
DEQuify your force field: More efficient simulations using deep equilibrium models
[AUTHORS]
Andreas Burger, Luca Thiede, Alán Aspuru-Guzik, Nandita Vijaykumar
[ABSTRACT]
Machine learning force fields show great promise in enabling more accurate
molecular dynamics simulations compared to manually derived ones. Much of the
progress in recent years was driven by exploiting prior knowledge about
physical systems, in particular symmetries under rotation, translation, and
reflections. In this paper, we argue that there is another important piece of
prior information that, thus fa,r hasn’t been explored: Simulating a molecular
system is necessarily continuous, and successive states are therefore extremely
similar. Our contribution is to show that we can exploit this information by
recasting a state-of-the-art equivariant base model as a deep equilibrium
model. This allows us to recycle intermediate neural network features from
previous time steps, enabling us to improve both accuracy and speed by
$10\%-20\%$ on the MD17, MD22, and OC20 200k datasets, compared to the non-DEQ
base model. The training is also much more memory efficient, allowing us to
train more expressive models on larger systems.
[COMMENTS]
AI4MAT-ICLR-2025 Spotlight https://openreview.net/forum?id=XACVRYePQQ
[LINK]
http://arxiv.org/abs/2509.08734v1
[DATE]
2025-09-11 00:23:52+08:00
[CATEGORIES]
cs.LG
Investigating Compositional Reasoning in Time Series Foundation Models
[AUTHORS]
Willa Potosnak, Cristian Challu, Mononito Goswami, Kin G. Olivares, Michał Wiliński, Nina Żukowska, Artur Dubrawski
[ABSTRACT]
Large pre-trained time series foundation models (TSFMs) have demonstrated
promising zero-shot performance across a wide range of domains. However, a
question remains: Do TSFMs succeed by memorizing patterns in training data, or
do they possess the ability to reason about such patterns? While reasoning is a
topic of great interest in the study of Large Language Models (LLMs), it is
undefined and largely unexplored in the context of TSFMs. In this work,
inspired by language modeling literature, we formally define compositional
reasoning in forecasting and distinguish it from in-distribution
generalization. We evaluate the reasoning and generalization capabilities of 16
popular deep learning forecasting models on multiple synthetic and real-world
datasets. Additionally, through controlled studies, we systematically examine
which design choices in 7 popular open-source TSFMs contribute to improved
reasoning capabilities. Our study yields key insights into the impact of TSFM
architecture design on compositional reasoning and generalization. We find that
patch-based Transformers have the best reasoning performance, closely followed
by residualized MLP-based architectures, which are 97\% less computationally
complex in terms of FLOPs and 86\% smaller in terms of the number of trainable
parameters. Interestingly, in some zero-shot out-of-distribution scenarios,
these models can outperform moving average and exponential smoothing
statistical baselines trained on in-distribution data. Only a few design
choices, such as the tokenization method, had a significant (negative) impact
on Transformer model performance.
[LINK]
http://arxiv.org/abs/2502.06037v2
[DATE]
2025-09-11 00:22:20+08:00
[CATEGORIES]
cs.LG
Data-driven generative simulation of SDEs using diffusion models
[AUTHORS]
Xuefeng Gao, Jiale Zha, Xun Yu Zhou
[ABSTRACT]
This paper introduces a new approach to generating sample paths of unknown
stochastic differential equations (SDEs) using diffusion models, a class of
generative AI models commonly employed in image and video applications. Unlike
the traditional Monte Carlo methods for simulating SDEs, which require explicit
specifications of the drift and diffusion coefficients, our method takes a
model-free, data-driven approach. Given a finite set of sample paths from an
SDE, we utilize conditional diffusion models to generate new, synthetic paths
of the same SDE. To demonstrate the effectiveness of our approach, we conduct a
simulation experiment to compare our method with alternative benchmark ones
including neural SDEs. Furthermore, in an empirical study we leverage these
synthetically generated sample paths to enhance the performance of
reinforcement learning algorithms for continuous-time mean-variance portfolio
selection, hinting promising applications of diffusion models in financial
analysis and decision-making.
[LINK]
http://arxiv.org/abs/2509.08731v1
[DATE]
2025-09-11 00:17:52+08:00
[CATEGORIES]
cs.LG
Decentralized Stochastic Nonconvex Optimization under the Relaxed Smoothness
[AUTHORS]
Luo Luo, Xue Cui, Tingkai Jia, Cheng Chen
[ABSTRACT]
This paper studies decentralized optimization problem
$f(\mathbf{x})=\frac{1}{m}\sum_{i=1}^m f_i(\mathbf{x})$, where each local
function has the form of $f_i(\mathbf{x}) = {\mathbb
E}\left[F(\mathbf{x};{\xi}_i)\right]$ which is $(L_0,L_1)$-smooth but possibly
nonconvex and the random variable ${\xi}_i$ follows distribution ${\mathcal
D}_i$. We propose a novel algorithm called decentralized normalized stochastic
gradient descent (DNSGD), which can achieve the $\epsilon$-stationary point on
each local agent. We present a new framework for analyzing decentralized
first-order methods in the relaxed smooth setting, based on the Lyapunov
function related to the product of the gradient norm and the consensus error.
The analysis shows upper bounds on sample complexity of ${\mathcal
O}(m^{-1}(L_f\sigma^2\Delta_f\epsilon^{-4} + \sigma^2\epsilon^{-2} +
L_f^{-2}L_1^3\sigma^2\Delta_f\epsilon^{-1} + L_f^{-2}L_1^2\sigma^2))$ per agent
and communication complexity of $\tilde{\mathcal O}((L_f\epsilon^{-2} +
L_1\epsilon^{-1})\gamma^{-1/2}\Delta_f)$, where $L_f=L_0 +L_1\zeta$, $\sigma^2$
is the variance of the stochastic gradient, $\Delta_f$ is the initial optimal
function value gap, $\gamma$ is the spectral gap of the network, and $\zeta$ is
the degree of the gradient dissimilarity. In the special case of $L_1=0$, the
above results (nearly) match the lower bounds on decentralized nonconvex
optimization in the standard smooth setting. We also conduct numerical
experiments to show the empirical superiority of our method.
[LINK]
http://arxiv.org/abs/2509.08726v1
[DATE]
2025-09-11 00:17:19+08:00
[CATEGORIES]
cs.LG
RINO: Renormalization Group Invariance with No Labels
[AUTHORS]
Zichun Hao, Raghav Kansal, Abhijith Gandrakota, Chang Sun, Ngadiuba Jennifer, Javier Duarte, Maria Spiropulu
[ABSTRACT]
A common challenge with supervised machine learning (ML) in high energy
physics (HEP) is the reliance on simulations for labeled data, which can often
mismodel the underlying collision or detector response. To help mitigate this
problem of domain shift, we propose RINO (Renormalization Group Invariance with
No Labels), a self-supervised learning approach that can instead pretrain
models directly on collision data, learning embeddings invariant to
renormalization group flow scales. In this work, we pretrain a
transformer-based model on jets originating from quantum chromodynamic (QCD)
interactions from the JetClass dataset, emulating real QCD-dominated
experimental data, and then finetune on the JetNet dataset – emulating
simulations – for the task of identifying jets originating from top quark
decays. RINO demonstrates improved generalization from the JetNet training data
to JetClass data compared to supervised training on JetNet from scratch,
demonstrating the potential for RINO pretraining on real collision data
followed by fine-tuning on small, high-quality MC datasets, to improve the
robustness of ML models in HEP.
[LINK]
http://arxiv.org/abs/2509.07486v2
[DATE]
2025-09-11 00:15:42+08:00
[CATEGORIES]
cs.LG
Sharing is Caring: Efficient LM Post-Training with Collective RL Experience Sharing
[AUTHORS]
Jeffrey Amico, Gabriel Passamani Andrade, John Donaghy, Ben Fielding, Tristin Forbus, Harry Grieve, Semih Kara, Jari Kolehmainen, Yihua Lou, Christopher Nies, Edward Phillip Flores Nuño, Diogo Ortega, Shikhar Rastogi, Austin Virts, Matthew J. Wright
[ABSTRACT]
Post-training language models (LMs) with reinforcement learning (RL) can
enhance their complex reasoning capabilities without supervised fine-tuning, as
demonstrated by DeepSeek-R1-Zero. However, effectively utilizing RL for LMs
requires significant parallelization to scale-up inference, which introduces
non-trivial technical challenges (e.g. latency, memory, and reliability)
alongside ever-growing financial costs. We present Swarm sAmpling Policy
Optimization (SAPO), a fully decentralized and asynchronous RL post-training
algorithm. SAPO is designed for decentralized networks of heterogenous compute
nodes, where each node manages its own policy model(s) while “sharing” rollouts
with others in the network; no explicit assumptions about latency, model
homogeneity, or hardware are required and nodes can operate in silo if desired.
As a result, the algorithm avoids common bottlenecks in scaling RL
post-training while also allowing (and even encouraging) new possibilities. By
sampling rollouts “shared” across the network, it enables “Aha moments” to
propagate, thereby bootstrapping the learning process. In this paper we show
SAPO achieved cumulative reward gains of up to 94% in controlled experiments.
We also share insights from tests on a network with thousands of nodes
contributed by Gensyn community members running the algorithm on diverse
hardware and models during an open-source demo.
[COMMENTS]
14 pages, 6 figures
[LINK]
http://arxiv.org/abs/2509.08721v1
[DATE]
2025-09-11 00:14:20+08:00
[CATEGORIES]
cs.LG
PQMass: Probabilistic Assessment of the Quality of Generative Models using Probability Mass Estimation
[AUTHORS]
Pablo Lemos, Sammy Sharief, Nikolay Malkin, Salma Salhi, Connor Stone, Laurence Perreault-Levasseur, Yashar Hezaveh
[ABSTRACT]
We propose a likelihood-free method for comparing two distributions given
samples from each, with the goal of assessing the quality of generative models.
The proposed approach, PQMass, provides a statistically rigorous method for
assessing the performance of a single generative model or the comparison of
multiple competing models. PQMass divides the sample space into non-overlapping
regions and applies chi-squared tests to the number of data samples that fall
within each region, giving a p-value that measures the probability that the bin
counts derived from two sets of samples are drawn from the same multinomial
distribution. PQMass does not depend on assumptions regarding the density of
the true distribution, nor does it rely on training or fitting any auxiliary
models. We evaluate PQMass on data of various modalities and dimensions,
demonstrating its effectiveness in assessing the quality, novelty, and
diversity of generated samples. We further show that PQMass scales well to
moderately high-dimensional data and thus obviates the need for feature
extraction in practical applications.
[COMMENTS]
Published as a conference paper at ICLR 2025
[LINK]
http://arxiv.org/abs/2402.04355v3
[DATE]
2025-09-11 00:11:35+08:00
[CATEGORIES]
cs.LG
Assessing the Limits of Graph Neural Networks for Vapor-Liquid Equilibrium Prediction: A Cryogenic Mixture Case Study
[AUTHORS]
Aryan Gupta
[ABSTRACT]
Accurate and fast thermophysical models are needed to embed vapor-liquid
equilibrium (VLE) calculations in design, optimization, and control loops for
cryogenic mixtures. This study asks whether a structure-aware graph neural
network (GNN; DimeNet++) trained on GERG-2008/CoolProp data can act as a
practical surrogate for an equation of state (EoS). We generate a ternary
dataset over 90-200 K and pressures to 100 bar, curate it with a 15% density
filter (reducing 5,200 states to 1,516), and pair each state with a lightweight
molecular-dynamics snapshot to supply structural features. The model is trained
in two stages; pretraining on residual Helmholtz energy followed by pressure
fine-tuning with a stability penalty; and evaluated via single-phase
interpolation tests, solver-free derivative-quality diagnostics, an audited VLE
driver, and a latency benchmark. Within its regime, the GNN interpolates
single-phase properties reasonably well; however, the VLE driver accepts no GNN
equilibria on tested binaries (all plotted VLE points are CoolProp fallback or
the solver fails), and diagnostic probes reveal jagged P(V|T) paths and
thermal-stability flags concentrated in dense/cold regions, indicating
insufficient derivative smoothness/consistency for robust equilibrium solving.
An end-to-end timing comparison shows no single-phase speed advantage relative
to CoolProp (tens of milliseconds vs sub-millisecond). We conclude that, as
configured, the surrogate in this study is not solver-ready for VLE and offers
no runtime benefit; its value is methodological, delineating failure modes and
pointing to remedies such as physics-informed training signals and targeted
coverage near phase boundaries.
[LINK]
http://arxiv.org/abs/2509.10565v1
[DATE]
2025-09-11 00:10:58+08:00
[CATEGORIES]
cs.LG
Compressing CNN models for resource-constrained systems by channel and layer pruning
[AUTHORS]
Ahmed Sadaqa, Di Liu
[ABSTRACT]
Convolutional Neural Networks (CNNs) have achieved significant breakthroughs
in various fields. However, these advancements have led to a substantial
increase in the complexity and size of these networks. This poses a challenge
when deploying large and complex networks on edge devices. Consequently, model
compression has emerged as a research field aimed at reducing the size and
complexity of CNNs. One prominent technique in model compression is model
pruning. This paper will present a new technique of pruning that combines both
channel and layer pruning in what is called a “hybrid pruning framework”.
Inspired by EfficientNet, a renowned CNN architecture known for scaling up
networks from both channel and layer perspectives, this hybrid approach applies
the same principles but in reverse, where it scales down the network through
pruning. Experiments on the hybrid approach demonstrated a notable decrease in
the overall complexity of the model, with only a minimal reduction in accuracy
compared to the baseline model. This complexity reduction translates into
reduced latency when deploying the pruned models on an NVIDIA JETSON TX2
embedded AI device.
[COMMENTS]
16 pages, 4 figures, the European Conference on Machine Learning and
Principles and Practice of Knowledge Discovery in Databases
[LINK]
http://arxiv.org/abs/2509.08714v1
[DATE]
2025-09-11 00:09:47+08:00
[CATEGORIES]
cs.LG
REGen: A Reliable Evaluation Framework for Generative Event Argument Extraction
[AUTHORS]
Omar Sharif, Joseph Gatto, Madhusudan Basak, Sarah M. Preum
[COMMENTS]
Accepted at EMNLP-2025
[LINK]
http://arxiv.org/abs/2502.16838v2
[DATE]
2025-09-10 23:49:00+08:00
[CATEGORIES]
cs.CL
Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth
[AUTHORS]
Yang Wang, Chenghao Xiao, Chia-Yi Hsiao, Zi Yan Chang, Chi-Li Chen, Tyler Loakman, Chenghua Lin
[ABSTRACT]
We introduce Drivelology, a unique linguistic phenomenon characterised as
“nonsense with depth” - utterances that are syntactically coherent yet
pragmatically paradoxical, emotionally loaded, or rhetorically subversive.
While such expressions may resemble surface-level nonsense, they encode
implicit meaning requiring contextual inference, moral reasoning, or emotional
interpretation. We find that current large language models (LLMs), despite
excelling at many natural language processing (NLP) tasks, consistently fail to
grasp the layered semantics of Drivelological text. To investigate this, we
construct a benchmark dataset of over 1,200+ meticulously curated and diverse
examples across English, Mandarin, Spanish, French, Japanese, and Korean. Each
example underwent careful expert review to verify its Drivelological
characteristics, involving multiple rounds of discussion and adjudication to
address disagreements. Using this dataset, we evaluate a range of LLMs on
classification, generation, and reasoning tasks. Our results reveal clear
limitations of LLMs: models often confuse Drivelology with shallow nonsense,
produce incoherent justifications, or miss implied rhetorical functions
altogether. These findings highlight a deep representational gap in LLMs’
pragmatic understanding and challenge the assumption that statistical fluency
implies cognitive comprehension. We release our dataset and code to facilitate
further research in modelling linguistic depth beyond surface-level coherence.
[COMMENTS]
Accepted for oral presentation at the EMNLP 2025 Main Conference
[LINK]
http://arxiv.org/abs/2509.03867v2
[DATE]
2025-09-10 22:02:50+08:00
[CATEGORIES]
cs.CL
LLM Ensemble for RAG: Role of Context Length in Zero-Shot Question Answering for BioASQ Challenge
[AUTHORS]
Dima Galat, Diego Molla-Aliod
[ABSTRACT]
Biomedical question answering (QA) poses significant challenges due to the
need for precise interpretation of specialized knowledge drawn from a vast,
complex, and rapidly evolving corpus. In this work, we explore how large
language models (LLMs) can be used for information retrieval (IR), and an
ensemble of zero-shot models can accomplish state-of-the-art performance on a
domain-specific Yes/No QA task. Evaluating our approach on the BioASQ challenge
tasks, we show that ensembles can outperform individual LLMs and in some cases
rival or surpass domain-tuned systems - all while preserving generalizability
and avoiding the need for costly fine-tuning or labeled data. Our method
aggregates outputs from multiple LLM variants, including models from Anthropic
and Google, to synthesize more accurate and robust answers. Moreover, our
investigation highlights a relationship between context length and performance:
while expanded contexts are meant to provide valuable evidence, they
simultaneously risk information dilution and model disorientation. These
findings emphasize IR as a critical foundation in Retrieval-Augmented
Generation (RAG) approaches for biomedical QA systems. Precise, focused
retrieval remains essential for ensuring LLMs operate within relevant
information boundaries when generating answers from retrieved documents. Our
results establish that ensemble-based zero-shot approaches, when paired with
effective RAG pipelines, constitute a practical and scalable alternative to
domain-tuned systems for biomedical question answering.
[COMMENTS]
CEUR-WS, CLEF2025
[LINK]
http://arxiv.org/abs/2509.08596v1
[DATE]
2025-09-10 21:50:49+08:00
[CATEGORIES]
cs.CL
Pay Attention to Real World Perturbations! Natural Robustness Evaluation in Machine Reading Comprehension
[AUTHORS]
Yulong Wu, Viktor Schlegel, Riza Batista-Navarro
[ABSTRACT]
As neural language models achieve human-comparable performance on Machine
Reading Comprehension (MRC) and see widespread adoption, ensuring their
robustness in real-world scenarios has become increasingly important. Current
robustness evaluation research, though, primarily develops synthetic
perturbation methods, leaving unclear how well they reflect real life
scenarios. Considering this, we present a framework to automatically examine
MRC models on naturally occurring textual perturbations, by replacing paragraph
in MRC benchmarks with their counterparts based on available Wikipedia edit
history. Such perturbation type is natural as its design does not stem from an
arteficial generative process, inherently distinct from the previously
investigated synthetic approaches. In a large-scale study encompassing SQUAD
datasets and various model architectures we observe that natural perturbations
result in performance degradation in pre-trained encoder language models. More
worryingly, these state-of-the-art Flan-T5 and Large Language Models (LLMs)
inherit these errors. Further experiments demonstrate that our findings
generalise to natural perturbations found in other more challenging MRC
benchmarks. In an effort to mitigate these errors, we show that it is possible
to improve the robustness to natural perturbations by training on naturally or
synthetically perturbed examples, though a noticeable gap still remains
compared to performance on unperturbed data.
[LINK]
http://arxiv.org/abs/2502.16523v2
[DATE]
2025-09-10 21:22:08+08:00
[CATEGORIES]
cs.CL
Benchmarking Vision-Language Models on Chinese Ancient Documents: From OCR to Knowledge Reasoning
[AUTHORS]
Haiyang Yu, Yuchuan Wu, Fan Shi, Lei Liao, Jinghui Lu, Xiaodong Ge, Han Wang, Minghan Zhuo, Xuecheng Wu, Xiang Fei, Hao Feng, Guozhi Tang, An-Lan Wang, Hanshen Zhu, Yangfan He, Quanhuan Liang, Liyuan Meng, Chao Feng, Can Huang, Jingqun Tang, Bin Li
[ABSTRACT]
Chinese ancient documents, invaluable carriers of millennia of Chinese
history and culture, hold rich knowledge across diverse fields but face
challenges in digitization and understanding, i.e., traditional methods only
scan images, while current Vision-Language Models (VLMs) struggle with their
visual and linguistic complexity. Existing document benchmarks focus on English
printed texts or simplified Chinese, leaving a gap for evaluating VLMs on
ancient Chinese documents. To address this, we present AncientDoc, the first
benchmark for Chinese ancient documents, designed to assess VLMs from OCR to
knowledge reasoning. AncientDoc includes five tasks (page-level OCR, vernacular
translation, reasoning-based QA, knowledge-based QA, linguistic variant QA) and
covers 14 document types, over 100 books, and about 3,000 pages. Based on
AncientDoc, we evaluate mainstream VLMs using multiple metrics, supplemented by
a human-aligned large language model for scoring.
[LINK]
http://arxiv.org/abs/2509.09731v1
[DATE]
2025-09-10 21:02:29+08:00
[CATEGORIES]
cs.CL
SciNLP: A Domain-Specific Benchmark for Full-Text Scientific Entity and Relation Extraction in NLP
[AUTHORS]
Decheng Duan, Yingyi Zhang, Jitong Peng, Chengzhi Zhang
[ABSTRACT]
Structured information extraction from scientific literature is crucial for
capturing core concepts and emerging trends in specialized fields. While
existing datasets aid model development, most focus on specific publication
sections due to domain complexity and the high cost of annotating scientific
texts. To address this limitation, we introduce SciNLP - a specialized
benchmark for full-text entity and relation extraction in the Natural Language
Processing (NLP) domain. The dataset comprises 60 manually annotated full-text
NLP publications, covering 7,072 entities and 1,826 relations. Compared to
existing research, SciNLP is the first dataset providing full-text annotations
of entities and their relationships in the NLP domain. To validate the
effectiveness of SciNLP, we conducted comparative experiments with similar
datasets and evaluated the performance of state-of-the-art supervised models on
this dataset. Results reveal varying extraction capabilities of existing models
across academic texts of different lengths. Cross-comparisons with existing
datasets show that SciNLP achieves significant performance improvements on
certain baseline models. Using models trained on SciNLP, we implemented
automatic construction of a fine-grained knowledge graph for the NLP domain.
Our KG has an average node degree of 3.2 per entity, indicating rich semantic
topological information that enhances downstream applications. The dataset is
publicly available at https://github.com/AKADDC/SciNLP.
[COMMENTS]
EMNLP 2025 Main
[LINK]
http://arxiv.org/abs/2509.07801v2
[DATE]
2025-09-10 20:09:56+08:00
[CATEGORIES]
cs.CL
Simulating Identity, Propagating Bias: Abstraction and Stereotypes in LLM-Generated Text
[AUTHORS]
Pia Sommerauer, Giulia Rambelli, Tommaso Caselli
[COMMENTS]
Accepted to EMNLP Findings 2025
[LINK]
http://arxiv.org/abs/2509.08484v1
[DATE]
2025-09-10 18:49:21+08:00
[CATEGORIES]
cs.CL
Acquiescence Bias in Large Language Models
[AUTHORS]
Daniel Braun
[COMMENTS]
Accepted to EMNLP 2025 Findings
[LINK]
http://arxiv.org/abs/2509.08480v1
[DATE]
2025-09-10 18:39:24+08:00
[CATEGORIES]
cs.CL
Beyond One-Size-Fits-All: Inversion Learning for Highly Effective NLG Evaluation Prompts
[AUTHORS]
Hanhua Hong, Chenghao Xiao, Yang Wang, Yiqi Liu, Wenge Rong, Chenghua Lin
[ABSTRACT]
Evaluating natural language generation systems is challenging due to the
diversity of valid outputs. While human evaluation is the gold standard, it
suffers from inconsistencies, lack of standardisation, and demographic biases,
limiting reproducibility. LLM-based evaluators offer a scalable alternative but
are highly sensitive to prompt design, where small variations can lead to
significant discrepancies. In this work, we propose an inversion learning
method that learns effective reverse mappings from model outputs back to their
input instructions, enabling the automatic generation of highly effective,
model-specific evaluation prompts. Our method requires only a single evaluation
sample and eliminates the need for time-consuming manual prompt engineering,
thereby improving both efficiency and robustness. Our work contributes toward a
new direction for more robust and efficient LLM-based evaluation.
[COMMENTS]
11 pages, accepted by Transactions of the Association for
Computational Linguistics (TACL)
[LINK]
http://arxiv.org/abs/2504.21117v3
[DATE]
2025-09-10 18:32:57+08:00
[CATEGORIES]
cs.CL
Adversarial Attacks Against Automated Fact-Checking: A Survey
[AUTHORS]
Fanzhen Liu, Alsharif Abuadbba, Kristen Moore, Surya Nepal, Cecile Paris, Jia Wu, Jian Yang, Quan Z. Sheng
[ABSTRACT]
In an era where misinformation spreads freely, fact-checking (FC) plays a
crucial role in verifying claims and promoting reliable information. While
automated fact-checking (AFC) has advanced significantly, existing systems
remain vulnerable to adversarial attacks that manipulate or generate claims,
evidence, or claim-evidence pairs. These attacks can distort the truth, mislead
decision-makers, and ultimately undermine the reliability of FC models. Despite
growing research interest in adversarial attacks against AFC systems, a
comprehensive, holistic overview of key challenges remains lacking. These
challenges include understanding attack strategies, assessing the resilience of
current models, and identifying ways to enhance robustness. This survey
provides the first in-depth review of adversarial attacks targeting FC,
categorizing existing attack methodologies and evaluating their impact on AFC
systems. Additionally, we examine recent advancements in adversary-aware
defenses and highlight open research questions that require further
exploration. Our findings underscore the urgent need for resilient FC
frameworks capable of withstanding adversarial manipulations in pursuit of
preserving high verification accuracy.
[COMMENTS]
Accepted to the Main Conference of EMNLP 2025. Resources are
available at
https://github.com/FanzhenLiu/Awesome-Automated-Fact-Checking-Attacks
[LINK]
http://arxiv.org/abs/2509.08463v1
[DATE]
2025-09-10 18:10:10+08:00
[CATEGORIES]
cs.CL
Meta-Semantics Augmented Few-Shot Relational Learning
[AUTHORS]
Han Wu, Jie Yin
[COMMENTS]
Accepted by EMNLP 2025
[LINK]
http://arxiv.org/abs/2505.05684v2
[DATE]
2025-09-10 18:03:50+08:00
[CATEGORIES]
cs.CL
cs.LG
All for law and law for all: Adaptive RAG Pipeline for Legal Research
[AUTHORS]
Figarri Keisha, Prince Singh, Pallavi, Dion Fernandes, Aravindh Manivannan, Ilham Wicaksono, Faisal Ahmad, Wiem Ben Rim
[ABSTRACT]
Retrieval-Augmented Generation (RAG) has transformed how we approach text
generation tasks by grounding Large Language Model (LLM) outputs in retrieved
knowledge. This capability is especially critical in the legal domain. In this
work, we introduce a novel end-to-end RAG pipeline that improves upon previous
baselines using three targeted enhancements: (i) a context-aware query
translator that disentangles document references from natural-language
questions and adapts retrieval depth and response style based on expertise and
specificity, (ii) open-source retrieval strategies using SBERT and GTE
embeddings that achieve substantial performance gains while remaining
cost-efficient, and (iii) a comprehensive evaluation and generation framework
that combines RAGAS, BERTScore-F1, and ROUGE-Recall to assess semantic
alignment and faithfulness across models and prompt designs. Our results show
that carefully designed open-source pipelines can rival proprietary approaches
in retrieval quality, while a custom legal-grounded prompt consistently
produces more faithful and contextually relevant answers than baseline
prompting. Taken together, these contributions demonstrate the potential of
task-aware, component-level tuning to deliver legally grounded, reproducible,
and cost-effective RAG systems for legal research assistance.
[COMMENTS]
submitted to NLLP 2025 Workshop
[LINK]
http://arxiv.org/abs/2508.13107v2
[DATE]
2025-09-10 17:50:51+08:00
[CATEGORIES]
cs.CL
A Role-Aware Multi-Agent Framework for Financial Education Question Answering with LLMs
[AUTHORS]
Andy Zhu, Yingjun Du
[ABSTRACT]
Question answering (QA) plays a central role in financial education, yet
existing large language model (LLM) approaches often fail to capture the
nuanced and specialized reasoning required for financial problem-solving. The
financial domain demands multistep quantitative reasoning, familiarity with
domain-specific terminology, and comprehension of real-world scenarios. We
present a multi-agent framework that leverages role-based prompting to enhance
performance on domain-specific QA. Our framework comprises a Base Generator, an
Evidence Retriever, and an Expert Reviewer agent that work in a single-pass
iteration to produce a refined answer. We evaluated our framework on a set of
3,532 expert-designed finance education questions from Study.com, an online
learning platform. We leverage retrieval-augmented generation (RAG) for
contextual evidence from 6 finance textbooks and prompting strategies for a
domain-expert reviewer. Our experiments indicate that critique-based refinement
improves answer accuracy by 6.6-8.3% over zero-shot Chain-of-Thought baselines,
with the highest performance from Gemini-2.0-Flash. Furthermore, our method
enables GPT-4o-mini to achieve performance comparable to the finance-tuned
FinGPT-mt_Llama3-8B_LoRA. Our results show a cost-effective approach to
enhancing financial QA and offer insights for further research in multi-agent
financial LLM systems.
[COMMENTS]
8 pages, 6 figures, Underreview
[LINK]
http://arxiv.org/abs/2509.09727v1
[DATE]
2025-09-10 17:40:18+08:00
[CATEGORIES]
cs.CL
CommonVoice-SpeechRE and RPG-MoGe: Advancing Speech Relation Extraction with a New Dataset and Multi-Order Generative Framework
[AUTHORS]
Jinzhong Ning, Paerhati Tulajiang, Yingying Le, Yijia Zhang, Yuanyuan Sun, Hongfei Lin, Haifeng Liu
[ABSTRACT]
Speech Relation Extraction (SpeechRE) aims to extract relation triplets
directly from speech. However, existing benchmark datasets rely heavily on
synthetic data, lacking sufficient quantity and diversity of real human speech.
Moreover, existing models also suffer from rigid single-order generation
templates and weak semantic alignment, substantially limiting their
performance. To address these challenges, we introduce CommonVoice-SpeechRE, a
large-scale dataset comprising nearly 20,000 real-human speech samples from
diverse speakers, establishing a new benchmark for SpeechRE research.
Furthermore, we propose the Relation Prompt-Guided Multi-Order Generative
Ensemble (RPG-MoGe), a novel framework that features: (1) a multi-order triplet
generation ensemble strategy, leveraging data diversity through diverse element
orders during both training and inference, and (2) CNN-based latent relation
prediction heads that generate explicit relation prompts to guide cross-modal
alignment and accurate triplet generation. Experiments show our approach
outperforms state-of-the-art methods, providing both a benchmark dataset and an
effective solution for real-world SpeechRE. The source code and dataset are
publicly available at https://github.com/NingJinzhong/SpeechRE_RPG_MoGe.
[LINK]
http://arxiv.org/abs/2509.08438v1
[DATE]
2025-09-10 17:35:43+08:00
[CATEGORIES]
cs.CL
Natural Language Translation of Formal Proofs through Informalization of Proof Steps and Recursive Summarization along Proof Structure
[AUTHORS]
Seiji Hattori, Takuya Matsuzaki, Makoto Fujiwara
[ABSTRACT]
This paper proposes a natural language translation method for
machine-verifiable formal proofs that leverages the informalization
(verbalization of formal language proof steps) and summarization capabilities
of LLMs. For evaluation, it was applied to formal proof data created in
accordance with natural language proofs taken from an undergraduate-level
textbook, and the quality of the generated natural language proofs was analyzed
in comparison with the original natural language proofs. Furthermore, we will
demonstrate that this method can output highly readable and accurate natural
language proofs by applying it to existing formal proof library of the Lean
proof assistant.
[COMMENTS]
Submitted to INLG 2025 (accepted)
[LINK]
http://arxiv.org/abs/2509.09726v1
[DATE]
2025-09-10 17:22:12+08:00
[CATEGORIES]
cs.CL
BIBERT-Pipe on Biomedical Nested Named Entity Linking at BioASQ 2025
[AUTHORS]
Chunyu Li, Xindi Zheng, Siqi Liu
[ABSTRACT]
Entity linking (EL) for biomedical text is typically benchmarked on
English-only corpora with flat mentions, leaving the more realistic scenario of
nested and multilingual mentions largely unexplored. We present our system for
the BioNNE 2025 Multilingual Biomedical Nested Named Entity Linking shared task
(English & Russian), closing this gap with a lightweight pipeline that keeps
the original EL model intact and modifies only three task-aligned components:
Two-stage retrieval-ranking. We leverage the same base encoder model in both
stages: the retrieval stage uses the original pre-trained model, while the
ranking stage applies domain-specific fine-tuning. Boundary cues. In the
ranking stage, we wrap each mention with learnable [Ms] / [Me] tags, providing
the encoder with an explicit, language-agnostic span before robustness to
overlap and nesting. Dataset augmentation. We also automatically expand the
ranking training corpus with three complementary data sources, enhancing
coverage without extra manual annotation. On the BioNNE 2025 leaderboard, our
two stage system, bilingual bert (BIBERT-Pipe), ranks third in the multilingual
track, demonstrating the effectiveness and competitiveness of these minimal yet
principled modifications. Code are publicly available at
https://github.com/Kaggle-Competitions-Code/BioNNE-L.
[LINK]
http://arxiv.org/abs/2509.09725v1
[DATE]
2025-09-10 17:14:25+08:00
[CATEGORIES]
cs.CL
Localizing Factual Inconsistencies in Attributable Text Generation
[AUTHORS]
Arie Cattan, Paul Roit, Shiyue Zhang, David Wan, Roee Aharoni, Idan Szpektor, Mohit Bansal, Ido Dagan
[ABSTRACT]
There has been an increasing interest in detecting hallucinations in
model-generated texts, both manually and automatically, at varying levels of
granularity. However, most existing methods fail to precisely pinpoint the
errors. In this work, we introduce QASemConsistency, a new formalism for
localizing factual inconsistencies in attributable text generation, at a
fine-grained level. Drawing inspiration from Neo-Davidsonian formal semantics,
we propose decomposing the generated text into minimal predicate-argument level
propositions, expressed as simple question-answer (QA) pairs, and assess
whether each individual QA pair is supported by a trusted reference text. As
each QA pair corresponds to a single semantic relation between a predicate and
an argument, QASemConsistency effectively localizes the unsupported
information. We first demonstrate the effectiveness of the QASemConsistency
methodology for human annotation, by collecting crowdsourced annotations of
granular consistency errors, while achieving a substantial inter-annotator
agreement. This benchmark includes more than 3K instances spanning various
tasks of attributable text generation. We also show that QASemConsistency
yields factual consistency scores that correlate well with human judgments.
Finally, we implement several methods for automatically detecting localized
factual inconsistencies, with both supervised entailment models and LLMs.
[COMMENTS]
Accepted for publication in Transactions of the Association for
Computational Linguistics (TACL), 2025. Authors pre-print
[LINK]
http://arxiv.org/abs/2410.07473v3
[DATE]
2025-09-10 17:05:33+08:00
[CATEGORIES]
cs.CL
How Far Are We from Optimal Reasoning Efficiency?
[AUTHORS]
Jiaxuan Gao, Shu Yan, Qixin Tan, Lu Yang, Shusheng Xu, Wei Fu, Zhiyu Mei, Kaifeng Lyu, Yi Wu
[ABSTRACT]
Large Reasoning Models (LRMs) demonstrate remarkable problem-solving
capabilities through extended Chain-of-Thought (CoT) reasoning but often
produce excessively verbose and redundant reasoning traces. This inefficiency
incurs high inference costs and limits practical deployment. While existing
fine-tuning methods aim to improve reasoning efficiency, assessing their
efficiency gains remains challenging due to inconsistent evaluations. In this
work, we introduce the reasoning efficiency frontiers, empirical upper bounds
derived from fine-tuning base LRMs across diverse approaches and training
configurations. Based on these frontiers, we propose the Reasoning Efficiency
Gap (REG), a unified metric quantifying deviations of any fine-tuned LRMs from
these frontiers. Systematic evaluation on challenging mathematical benchmarks
reveals significant gaps in current methods: they either sacrifice accuracy for
short length or still remain inefficient under tight token budgets. To reduce
the efficiency gap, we propose REO-RL, a class of Reinforcement Learning
algorithms that minimizes REG by targeting a sparse set of token budgets.
Leveraging numerical integration over strategically selected budgets, REO-RL
approximates the full efficiency objective with low error using a small set of
token budgets. Through systematic benchmarking, we demonstrate that our
efficiency metric, REG, effectively captures the accuracy-length trade-off,
with low-REG methods reducing length while maintaining accuracy. Our approach,
REO-RL, consistently reduces REG by >=50 across all evaluated LRMs and matching
Qwen3-4B/8B efficiency frontiers under a 16K token budget with minimal accuracy
loss. Ablation studies confirm the effectiveness of our exponential token
budget strategy. Finally, our findings highlight that fine-tuning LRMs to
perfectly align with the efficiency frontiers remains an open challenge.
[LINK]
http://arxiv.org/abs/2506.07104v2
[DATE]
2025-09-10 17:03:04+08:00
[CATEGORIES]
cs.CL
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks
[AUTHORS]
Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z. Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Yang, Hao Yang Lu, Amaad Martin, Zhe Su, Leander Maben, Raj Mehta, Wayne Chi, Lawrence Jang, Yiqing Xie, Shuyan Zhou, Graham Neubig
[COMMENTS]
Preprint
[LINK]
http://arxiv.org/abs/2412.14161v3
[DATE]
2025-09-10 16:35:19+08:00
[CATEGORIES]
cs.CL
Low-Resource Fine-Tuning for Multi-Task Structured Information Extraction with a Billion-Parameter Instruction-Tuned Model
[AUTHORS]
Yu Cheng Chih, Yong Hao Hou
[ABSTRACT]
Deploying large language models (LLMs) for structured data extraction in
domains such as financial compliance reporting, legal document analytics, and
multilingual knowledge base construction is often impractical for smaller teams
due to the high cost of running large architectures and the difficulty of
preparing large, high-quality datasets. Most recent instruction-tuning studies
focus on seven-billion-parameter or larger models, leaving limited evidence on
whether much smaller models can work reliably under low-resource, multi-task
conditions. This work presents ETLCH, a billion-parameter LLaMA-based model
fine-tuned with low-rank adaptation on only a few hundred to one thousand
samples per task for JSON extraction, knowledge graph extraction, and named
entity recognition. Despite its small scale, ETLCH outperforms strong baselines
across most evaluation metrics, with substantial gains observed even at the
lowest data scale. These findings demonstrate that well-tuned small models can
deliver stable and accurate structured outputs at a fraction of the
computational cost, enabling cost-effective and reliable information extraction
pipelines in resource-constrained environments.
[COMMENTS]
13 pages, 8 figures, includes experiments on JSON extraction,
knowledge graph extraction, and NER
[LINK]
http://arxiv.org/abs/2509.08381v1
[DATE]
2025-09-10 16:19:07+08:00
[CATEGORIES]
cs.CL
CoAT: Chain-of-Associated-Thoughts Framework for Enhancing Large Language Models Reasoning
[AUTHORS]
Jianfeng Pan, Senyou Deng, Shaomang Huang
[ABSTRACT]
Research on LLM technologies is rapidly emerging, with most of them employ a
‘fast thinking’ approach to inference. Most LLMs generate the final result
based solely on a single query and LLM’s reasoning capabilities. However, with
the advent of OpenAI-o1, ‘slow thinking’ techniques have garnered increasing
attention because its process is closer to the human thought process. Inspired
by the human ability to constantly associate and replenish knowledge during
thinking, we developed the novel Chain-of-Associated-Thoughts (CoAT) framework,
which introduces an innovative synergy between the Monte Carlo Tree Search
(MCTS) algorithm and a dynamic mechanism for integrating new key information,
termed ‘associative memory’. By combining the structured exploration
capabilities of MCTS with the adaptive learning capacity of associative memory,
CoAT significantly expands the LLM search space, enabling our framework to
explore diverse reasoning pathways and dynamically update its knowledge base in
real-time. This allows the framework to not only revisit and refine earlier
inferences but also adaptively incorporate evolving information, ensuring that
the final output is both accurate and comprehensive. We validate CoAT’s
effectiveness across a variety of generative and reasoning tasks. Quantitative
experiments show that CoAT achieves over 10% performance improvement on
open-source multi-hop reasoning datasets (HotpotQA, MuSiQue) and more than 15%
gain on our proprietary CRB dataset.
[COMMENTS]
18 pages, 10 figures
[LINK]
http://arxiv.org/abs/2502.02390v2
[DATE]
2025-09-10 16:09:02+08:00
[CATEGORIES]
cs.CL
[AUTHORS]
Sergey Pletenev, Daniil Moskovskiy, Alexander Panchenko
[LINK]
http://arxiv.org/abs/2509.08358v1
[DATE]
2025-09-10 15:48:24+08:00
[CATEGORIES]
cs.CL
Toward Subtrait-Level Model Explainability in Automated Writing Evaluation
[AUTHORS]
Alejandro Andrade-Lotero, Lee Becker, Joshua Southerland, Scott Hellman
[ABSTRACT]
Subtrait (latent-trait components) assessment presents a promising path
toward enhancing transparency of automated writing scores. We prototype
explainability and subtrait scoring with generative language models and show
modest correlation between human subtrait and trait scores, and between
automated and human subtrait scores. Our approach provides details to demystify
scores for educators and students.
[COMMENTS]
Accepted to National Council on Measurement in Education (NCME) 2025
Annual Meeting
[LINK]
http://arxiv.org/abs/2509.08345v1
[DATE]
2025-09-10 15:32:14+08:00
[CATEGORIES]
cs.CL
Your Language Model Can Secretly Write Like Humans: Contrastive Paraphrase Attacks on LLM-Generated Text Detectors
[AUTHORS]
Hao Fang, Jiawei Kong, Tianqu Zhuang, Yixiang Qiu, Kuofeng Gao, Bin Chen, Shu-Tao Xia, Yaowei Wang, Min Zhang
[COMMENTS]
Accepted by EMNLP-2025
[LINK]
http://arxiv.org/abs/2505.15337v3
[DATE]
2025-09-10 15:03:03+08:00
[CATEGORIES]
cs.CL
Measuring Bias or Measuring the Task: Understanding the Brittle Nature of LLM Gender Biases
[AUTHORS]
Bufan Gao, Elisa Kreiss
[ABSTRACT]
As LLMs are increasingly applied in socially impactful settings, concerns
about gender bias have prompted growing efforts both to measure and mitigate
such bias. These efforts often rely on evaluation tasks that differ from
natural language distributions, as they typically involve carefully constructed
task prompts that overtly or covertly signal the presence of gender
bias-related content. In this paper, we examine how signaling the evaluative
purpose of a task impacts measured gender bias in LLMs. Concretely, we test
models under prompt conditions that (1) make the testing context salient, and
(2) make gender-focused content salient. We then assess prompt sensitivity
across four task formats with both token-probability and discrete-choice
metrics. We find that prompts that more clearly align with (gender bias)
evaluation framing elicit distinct gender output distributions compared to less
evaluation-framed prompts. Discrete-choice metrics further tend to amplify bias
relative to probabilistic measures. These findings do not only highlight the
brittleness of LLM gender bias evaluations but open a new puzzle for the NLP
benchmarking and development community: To what extent can well-controlled
testing designs trigger LLM “testing mode” performance, and what does this mean
for the ecological validity of future benchmarks.
[COMMENTS]
To be published at EMNLP 2025 (main conference)
[LINK]
http://arxiv.org/abs/2509.04373v3
[DATE]
2025-09-10 14:08:26+08:00
[CATEGORIES]
cs.CL
Towards Knowledge-Aware Document Systems: Modeling Semantic Coverage Relations via Answerability Detection
[AUTHORS]
Yehudit Aperstein, Alon Gottlib, Gal Benita, Alexander Apartsin
[ABSTRACT]
Understanding how information is shared across documents, regardless of the
format in which it is expressed, is critical for tasks such as information
retrieval, summarization, and content alignment. In this work, we introduce a
novel framework for modelling Semantic Coverage Relations (SCR), which
classifies document pairs based on how their informational content aligns. We
define three core relation types: equivalence, where both texts convey the same
information using different textual forms or styles; inclusion, where one
document fully contains the information of another and adds more; and semantic
overlap, where each document presents partially overlapping content. To capture
these relations, we adopt a question answering (QA)-based approach, using the
answerability of shared questions across documents as an indicator of semantic
coverage. We construct a synthetic dataset derived from the SQuAD corpus by
paraphrasing source passages and selectively omitting information, enabling
precise control over content overlap. This dataset allows us to benchmark
generative language models and train transformer-based classifiers for SCR
prediction. Our findings demonstrate that discriminative models significantly
outperform generative approaches, with the RoBERTa-base model achieving the
highest accuracy of 61.4% and the Random Forest-based model showing the best
balance with a macro-F1 score of 52.9%. The results show that QA provides an
effective lens for assessing semantic relations across stylistically diverse
texts, offering insights into the capacity of current models to reason about
information beyond surface similarity. The dataset and code developed in this
study are publicly available to support reproducibility.
[COMMENTS]
27 pages, 1 figure
[LINK]
http://arxiv.org/abs/2509.08304v1
[DATE]
2025-09-10 14:00:01+08:00
[CATEGORIES]
cs.CL
Beyond Seen Data: Improving KBQA Generalization Through Schema-Guided Logical Form Generation
[AUTHORS]
Shengxiang Gao, Jey Han Lau, Jianzhong Qi
[ABSTRACT]
Knowledge base question answering (KBQA) aims to answer user questions in
natural language using rich human knowledge stored in large KBs. As current
KBQA methods struggle with unseen knowledge base elements at test time,we
introduce SG-KBQA: a novel model that injects schema contexts into entity
retrieval and logical form generation to tackle this issue. It uses the richer
semantics and awareness of the knowledge base structure provided by schema
contexts to enhance generalizability. We show that SG-KBQA achieves strong
generalizability, outperforming state-of-the-art models on two commonly used
benchmark datasets across a variety of test settings. Our source code is
available at https://github.com/gaosx2000/SG_KBQA.
[COMMENTS]
Accepted by EMNLP 2025
[LINK]
http://arxiv.org/abs/2502.12737v3
[DATE]
2025-09-10 13:39:44+08:00
[CATEGORIES]
cs.CL
A Survey on Training-free Alignment of Large Language Models
[AUTHORS]
Birong Pan, Yongqi Li, Weiyu Zhang, Wenpeng Lu, Mayi Xu, Shen Zhou, Yuanyuan Zhu, Ming Zhong, Tieyun Qian
[ABSTRACT]
The alignment of large language models (LLMs) aims to ensure their outputs
adhere to human values, ethical standards, and legal norms. Traditional
alignment methods often rely on resource-intensive fine-tuning (FT), which may
suffer from knowledge degradation and face challenges in scenarios where the
model accessibility or computational resources are constrained. In contrast,
training-free (TF) alignment techniques–leveraging in-context learning,
decoding-time adjustments, and post-generation corrections–offer a promising
alternative by enabling alignment without heavily retraining LLMs, making them
adaptable to both open-source and closed-source environments. This paper
presents the first systematic review of TF alignment methods, categorizing them
by stages of pre-decoding, in-decoding, and post-decoding. For each stage, we
provide a detailed examination from the viewpoint of LLMs and multimodal LLMs
(MLLMs), highlighting their mechanisms and limitations. Furthermore, we
identify key challenges and future directions, paving the way for more
inclusive and effective TF alignment techniques. By synthesizing and organizing
the rapidly growing body of research, this survey offers a guidance for
practitioners and advances the development of safer and more reliable LLMs.
[COMMENTS]
Accepted to EMNLP 2025 (findings), camera-ready version
[LINK]
http://arxiv.org/abs/2508.09016v4
[DATE]
2025-09-10 13:08:47+08:00
[CATEGORIES]
cs.CL
cs.LG
Prior Prompt Engineering for Reinforcement Fine-Tuning
[AUTHORS]
Pittawat Taveekitworachai, Potsawee Manakul, Sarana Nutanong, Kunat Pipatanakul
[COMMENTS]
Accepted at EMNLP 2025, Main; 26 pages, 42 figures
[LINK]
http://arxiv.org/abs/2505.14157v2
[DATE]
2025-09-10 12:33:47+08:00
[CATEGORIES]
cs.CL
Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation
[AUTHORS]
Chuanqi Cheng, Jian Guan, Wei Wu, Rui Yan
[ABSTRACT]
Long-form video processing fundamentally challenges vision-language models
(VLMs) due to the high computational costs of handling extended temporal
sequences. Existing token pruning and feature merging methods often sacrifice
critical temporal dependencies or dilute semantic information. We introduce
differential distillation, a principled approach that systematically preserves
task-relevant information while suppressing redundancy. Based on this
principle, we develop ViLAMP, a hierarchical video-language model that
processes hour-long videos at “mixed precision” through two key mechanisms: (1)
differential keyframe selection that maximizes query relevance while
maintaining temporal distinctiveness at the frame level and (2) differential
feature merging that preserves query-salient features in non-keyframes at the
patch level. Hence, ViLAMP retains full information in keyframes while reducing
non-keyframes to their most salient features, resembling mixed-precision
training. Extensive experiments demonstrate ViLAMP’s superior performance
across four video understanding benchmarks, particularly on long-form content.
Notably, ViLAMP can process ultra-long videos (up to 10K frames) on a single
NVIDIA A100 GPU, achieving substantial computational efficiency while
maintaining state-of-the-art performance. Code and model are available at
https://github.com/steven-ccq/ViLAMP.
[COMMENTS]
Accepted by ICML 2025
[LINK]
http://arxiv.org/abs/2504.02438v5
[DATE]
2025-09-10 12:22:46+08:00
[CATEGORIES]
cs.CL
ALIGNS: Unlocking nomological networks in psychological measurement through a large language model
[AUTHORS]
Kai R. Larsen, Sen Yan, Roland Müller, Lan Sang, Mikko Rönkkö, Ravi Starzl, Donald Edmondson
[ABSTRACT]
Psychological measurement is critical to many disciplines. Despite advances
in measurement, building nomological networks, theoretical maps of how concepts
and measures relate to establish validity, remains a challenge 70 years after
Cronbach and Meehl proposed them as fundamental to validation. This limitation
has practical consequences: clinical trials may fail to detect treatment
effects, and public policy may target the wrong outcomes. We introduce Analysis
of Latent Indicators to Generate Nomological Structures (ALIGNS), a large
language model-based system trained with validated questionnaire measures.
ALIGNS provides three comprehensive nomological networks containing over
550,000 indicators across psychology, medicine, social policy, and other
fields. This represents the first application of large language models to solve
a foundational problem in measurement validation. We report classification
accuracy tests used to develop the model, as well as three evaluations. In the
first evaluation, the widely used NIH PROMIS anxiety and depression instruments
are shown to converge into a single dimension of emotional distress. The second
evaluation examines child temperament measures and identifies four potential
dimensions not captured by current frameworks, and questions one existing
dimension. The third evaluation, an applicability check, engages expert
psychometricians who assess the system’s importance, accessibility, and
suitability. ALIGNS is freely available at nomologicalnetwork.org,
complementing traditional validation methods with large-scale nomological
analysis.
[LINK]
http://arxiv.org/abs/2509.09723v1
[DATE]
2025-09-10 12:21:02+08:00
[CATEGORIES]
cs.CL
cs.LG
ACE-RL: Adaptive Constraint-Enhanced Reward for Long-form Generation Reinforcement Learning
[AUTHORS]
Jianghao Chen, Wei Sun, Qixiang Yin, Lingxing Kong, Zhixing Tan, Jiajun Zhang
[ABSTRACT]
Large Language Models (LLMs) have demonstrated remarkable progress in
long-context understanding, yet they face significant challenges in
high-quality long-form generation. Existing studies primarily suffer from two
limitations: (1) A heavy reliance on scarce, high-quality long-form response
data for supervised fine-tuning (SFT) or for pairwise preference reward in
reinforcement learning (RL). (2) Focus on coarse-grained quality optimization
dimensions, such as relevance, coherence, and helpfulness, overlooking the
fine-grained specifics inherent to diverse long-form generation scenarios. To
address this issue, we propose a framework using Adaptive Constraint-Enhanced
reward for long-form generation Reinforcement Learning (ACE-RL). ACE-RL first
automatically deconstructs each instruction into a set of fine-grained,
adaptive constraint criteria by identifying its underlying intents and demands.
Subsequently, we design a reward mechanism that quantifies the quality of
long-form responses based on their satisfaction over corresponding constraints,
converting subjective quality evaluation into constraint verification. Finally,
we utilize reinforcement learning to guide models toward superior long-form
generation capabilities. Experimental results demonstrate that our ACE-RL
framework significantly outperforms existing SFT and RL baselines by 20.70% and
7.32% on WritingBench, and our top-performing model even surpasses proprietary
systems like GPT-4o by 7.10%, providing a more effective training paradigm for
LLMs to generate high-quality content across diverse long-form generation
scenarios.
[COMMENTS]
Under review, our code is available at https://github.com/ZNLP/ACE-RL
[LINK]
http://arxiv.org/abs/2509.04903v2
[DATE]
2025-09-10 12:00:39+08:00
[CATEGORIES]
cs.CL
That’s So FETCH: Fashioning Ensemble Techniques for LLM Classification in Civil Legal Intake and Referral
[AUTHORS]
Quinten Steenhuis
[ABSTRACT]
Each year millions of people seek help for their legal problems by calling a
legal aid program hotline, walking into a legal aid office, or using a lawyer
referral service. The first step to match them to the right help is to identify
the legal problem the applicant is experiencing. Misdirection has consequences.
Applicants may miss a deadline, experience physical abuse, lose housing or lose
custody of children while waiting to connect to the right legal help. We
introduce and evaluate the FETCH classifier for legal issue classification and
describe two methods for improving accuracy: a hybrid LLM/ML ensemble
classification method, and the automatic generation of follow-up questions to
enrich the initial problem narrative. We employ a novel data set of 419
real-world queries to a nonprofit lawyer referral service. Ultimately, we show
classification accuracy (hits@2) of 97.37\% using a mix of inexpensive models,
exceeding the performance of the current state-of-the-art GPT-5 model. Our
approach shows promise in significantly reducing the cost of guiding users of
the legal system to the right resource for their problem while achieving high
accuracy.
[COMMENTS]
Submission to JURIX 2025
[LINK]
http://arxiv.org/abs/2509.07170v2
[DATE]
2025-09-10 11:09:10+08:00
[CATEGORIES]
cs.CL
HIRAG: Hierarchical-Thought Instruction-Tuning Retrieval-Augmented Generation
[AUTHORS]
YiHan Jiao, ZheHao Tan, Dan Yang, DuoLin Sun, Jie Feng, Yue Shen, Jian Wang, Peng Wei
[ABSTRACT]
Retrieval-augmented generation (RAG) has become a fundamental paradigm for
addressing the challenges faced by large language models in handling real-time
information and domain-specific problems. Traditional RAG systems primarily
rely on the in-context learning (ICL) capabilities of the large language model
itself. Still, in-depth research on the specific capabilities needed by the RAG
generation model is lacking, leading to challenges with inconsistent document
quality and retrieval system imperfections. Even the limited studies that
fine-tune RAG generative models often \textit{lack a granular focus on RAG
task} or \textit{a deeper utilization of chain-of-thought processes}. To
address this, we propose that RAG models should possess three progressively
hierarchical abilities (1) Filtering: the ability to select relevant
information; (2) Combination: the ability to combine semantic information
across paragraphs; and (3) RAG-specific reasoning: the ability to further
process external knowledge using internal knowledge. Thus, we introduce our new
RAG instruction fine-tuning method, Hierarchical-Thought Instruction-Tuning
Retrieval-Augmented Generation (HIRAG) incorporates a “think before answering”
strategy. This method enhances the model’s open-book examination capability by
utilizing multi-level progressive chain-of-thought. Experiments show that the
HIRAG training strategy significantly improves the model’s performance on
datasets such as RGB, PopQA, MuSiQue, HotpotQA, and PubmedQA.
[LINK]
http://arxiv.org/abs/2507.05714v3
[DATE]
2025-09-10 11:00:18+08:00
[CATEGORIES]
cs.CL
MedS$^3$: Towards Medical Slow Thinking with Self-Evolved Soft Dual-sided Process Supervision
[AUTHORS]
Shuyang Jiang, Yusheng Liao, Zhe Chen, Ya Zhang, Yanfeng Wang, Yu Wang
[ABSTRACT]
Medical language models face critical barriers to real-world clinical
reasoning applications. However, mainstream efforts, which fall short in task
coverage, lack fine-grained supervision for intermediate reasoning steps, and
rely on proprietary systems, are still far from a versatile, credible and
efficient language model for clinical reasoning usage. To this end, we propose
\mone, a self-evolving framework that imparts robust reasoning capabilities to
small, deployable models. Starting with 8,000 curated instances sampled via a
curriculum strategy across five medical domains and 16 datasets, we use a small
base policy model to conduct Monte Carlo Tree Search (MCTS) for constructing
rule-verifiable reasoning trajectories. Self-explored reasoning trajectories
ranked by node values are used to bootstrap the policy model via reinforcement
fine-tuning and preference learning. Moreover, we introduce a soft dual process
reward model that incorporates value dynamics: steps that degrade node value
are penalized, enabling fine-grained identification of reasoning errors even
when the final answer is correct. Experiments on eleven benchmarks show that
\mone outperforms the previous state-of-the-art medical model by +6.45 accuracy
points and surpasses 32B-scale general-purpose reasoning models by +8.57
points. Additional empirical analysis further demonstrates that \mone achieves
robust and faithful reasoning behavior.
[COMMENTS]
20 pages;
[LINK]
http://arxiv.org/abs/2501.12051v3
[DATE]
2025-09-10 10:53:11+08:00
[CATEGORIES]
cs.CL
CITER: Collaborative Inference for Efficient Large Language Model Decoding with Token-Level Routing
[AUTHORS]
Wenhao Zheng, Yixiao Chen, Weitong Zhang, Souvik Kundu, Yun Li, Zhengzhong Liu, Eric P. Xing, Hongyi Wang, Huaxiu Yao
[ABSTRACT]
Large language models have achieved remarkable success in various tasks but
suffer from high computational costs during inference, limiting their
deployment in resource-constrained applications. To address this issue, we
propose a novel Collaborative Inference with Token-lEvel Routing (CITER)
framework that enables efficient collaboration between small and large language
models (SLMs \& LLMs) through a token-level routing strategy. Specifically,
CITER routes non-critical tokens to an SLM for efficiency and routes critical
tokens to an LLM for generalization quality. We formulate router training as a
policy optimization, where the router receives rewards based on both the
quality of predictions and the inference costs of generation. This allows the
router to learn to predict token-level routing scores and make routing
decisions based on both the current token and the future impact of its
decisions. To further accelerate the reward evaluation process, we introduce a
shortcut which significantly reduces the costs of the reward estimation and
improving the practicality of our approach. Extensive experiments on five
benchmark datasets demonstrate that CITER reduces the inference costs while
preserving high-quality generation, offering a promising solution for real-time
and resource-constrained applications. Our data and code are available at
https://github.com/aiming-lab/CITER.
[LINK]
http://arxiv.org/abs/2502.01976v6
[DATE]
2025-09-10 10:45:51+08:00
[CATEGORIES]
cs.CL
cs.LG
CoT-RAG: Integrating Chain of Thought and Retrieval-Augmented Generation to Enhance Reasoning in Large Language Models
[AUTHORS]
Feiyang Li, Peng Fang, Zhan Shi, Arijit Khan, Fang Wang, Weihao Wang, Xin Zhang, Yongjian Cui
[ABSTRACT]
Chain-of-thought (CoT) reasoning boosts large language models’ (LLMs)
performance on complex tasks but faces two key limitations: a lack of
reliability when solely relying on LLM-generated reasoning chains and lower
reasoning performance from natural language prompts compared with code prompts.
To address these issues, we propose CoT-RAG, a novel reasoning framework with
three key designs: (i) Knowledge Graph-driven CoT Generation, featuring
knowledge graphs to modulate reasoning chain generation of LLMs, thereby
enhancing reasoning credibility; (ii) Learnable Knowledge Case-aware RAG, which
incorporates retrieval-augmented generation (RAG) into knowledge graphs to
retrieve relevant sub-cases and sub-descriptions, providing LLMs with learnable
information; (iii) Pseudo Program Prompting Execution, which promotes greater
logical rigor by guiding LLMs to execute reasoning tasks as pseudo-programs.
Evaluations on nine public datasets spanning three reasoning tasks reveal
significant accuracy gains-ranging from 4.0% to 44.3%-over state-of-the-art
methods. Furthermore, tests on four domain-specific datasets demonstrate
exceptional accuracy and efficient execution, underscoring its practical
applicability and scalability. Our code and data are available at https:
//github.com/hustlfy123/CoT-RAG.
[LINK]
http://arxiv.org/abs/2504.13534v3
[DATE]
2025-09-10 10:38:49+08:00
[CATEGORIES]
cs.CL
VIDEE: Visual and Interactive Decomposition, Execution, and Evaluation of Text Analytics with Intelligent Agents
[AUTHORS]
Sam Yu-Te Lee, Chenyang Ji, Shicheng Wen, Lifu Huang, Dongyu Liu, Kwan-Liu Ma
[ABSTRACT]
Text analytics has traditionally required specialized knowledge in Natural
Language Processing (NLP) or text analysis, which presents a barrier for
entry-level analysts. Recent advances in large language models (LLMs) have
changed the landscape of NLP by enabling more accessible and automated text
analysis (e.g., topic detection, summarization, information extraction, etc.).
We introduce VIDEE, a system that supports entry-level data analysts to conduct
advanced text analytics with intelligent agents. VIDEE instantiates a
human-agent collaroration workflow consisting of three stages: (1)
Decomposition, which incorporates a human-in-the-loop Monte-Carlo Tree Search
algorithm to support generative reasoning with human feedback, (2) Execution,
which generates an executable text analytics pipeline, and (3) Evaluation,
which integrates LLM-based evaluation and visualizations to support user
validation of execution results. We conduct two quantitative experiments to
evaluate VIDEE’s effectiveness and analyze common agent errors. A user study
involving participants with varying levels of NLP and text analytics experience
– from none to expert – demonstrates the system’s usability and reveals
distinct user behavior patterns. The findings identify design implications for
human-agent collaboration, validate the practical utility of VIDEE for
non-expert users, and inform future improvements to intelligent text analytics
systems.
[LINK]
http://arxiv.org/abs/2506.21582v3
[DATE]
2025-09-10 10:31:17+08:00
[CATEGORIES]
cs.CL
Arce: Augmented Roberta with Contextualized Elucidations for Ner in Automated Rule Checking
[AUTHORS]
Jian Chen, Jinbao Tian, Yankui Li, Yuqi Lu, Zhou Li
[ABSTRACT]
Accurate information extraction from specialized texts is a critical
challenge, particularly for named entity recognition (NER) in the architecture,
engineering, and construction (AEC) domain to support automated rule checking
(ARC). The performance of standard pre-trained models is often constrained by
the domain gap, as they struggle to interpret the specialized terminology and
complex relational contexts inherent in AEC texts. Although this issue can be
mitigated by further pre-training on large, human-curated domain corpora, as
exemplified by methods like ARCBERT, this approach is both labor-intensive and
cost-prohibitive. Consequently, leveraging large language models (LLMs) for
automated knowledge generation has emerged as a promising alternative. However,
the optimal strategy for generating knowledge that can genuinely enhance
smaller, efficient models remains an open question. To address this, we propose
ARCE (augmented RoBERTa with contextualized elucidations), a novel approach
that systematically explores and optimizes this generation process. ARCE
employs an LLM to first generate a corpus of simple, direct explanations, which
we term Cote, and then uses this corpus to incrementally pre-train a RoBERTa
model prior to its fine-tuning on the downstream task. Our extensive
experiments show that ARCE establishes a new state-of-the-art on a benchmark
AEC dataset, achieving a Macro-F1 score of 77.20%. This result also reveals a
key finding: simple, explanation-based knowledge proves surprisingly more
effective than complex, role-based rationales for this task. The code is
publicly available at:https://github.com/nxcc-lab/ARCE.
[LINK]
http://arxiv.org/abs/2508.07286v2
[DATE]
2025-09-10 10:25:40+08:00
[CATEGORIES]
cs.CL
DomainCQA: Crafting Knowledge-Intensive QA from Domain-Specific Charts
[AUTHORS]
Yujing Lu, Ling Zhong, Jing Yang, Weiming Li, Peng Wei, Yongheng Wang, Manni Duan, Qing Zhang
[ABSTRACT]
Chart Question Answering (CQA) evaluates Multimodal Large Language Models
(MLLMs) on visual understanding and reasoning over chart data. However,
existing benchmarks mostly test surface-level parsing, such as reading labels
and legends, while overlooking deeper scientific reasoning. We propose
DomainCQA, a framework for constructing domain-specific CQA benchmarks that
emphasize both visual comprehension and knowledge-intensive reasoning. It
integrates complexity-aware chart selection, multitier QA generation, and
expert validation. Applied to astronomy, DomainCQA yields AstroChart, a
benchmark of 1,690 QA pairs over 482 charts, exposing persistent weaknesses in
fine-grained perception, numerical reasoning, and domain knowledge integration
across 21 MLLMs. Fine-tuning on AstroChart improves performance across
fundamental and advanced tasks. Pilot QA sets in biochemistry, economics,
medicine, and social science further demonstrate DomainCQA’s generality.
Together, our results establish DomainCQA as a unified pipeline for
constructing and augmenting domain-specific chart reasoning benchmarks.
[COMMENTS]
85 pages, 59 figures
[LINK]
http://arxiv.org/abs/2503.19498v4
[DATE]
2025-09-10 10:18:09+08:00
[CATEGORIES]
cs.CL
DischargeSim: A Simulation Benchmark for Educational Doctor-Patient Communication at Discharge
[AUTHORS]
Zonghai Yao, Michael Sun, Won Seok Jang, Sunjae Kwon, Soie Kwon, Hong Yu
[ABSTRACT]
Discharge communication is a critical yet underexplored component of patient
care, where the goal shifts from diagnosis to education. While recent large
language model (LLM) benchmarks emphasize in-visit diagnostic reasoning, they
fail to evaluate models’ ability to support patients after the visit. We
introduce DischargeSim, a novel benchmark that evaluates LLMs on their ability
to act as personalized discharge educators. DischargeSim simulates post-visit,
multi-turn conversations between LLM-driven DoctorAgents and PatientAgents with
diverse psychosocial profiles (e.g., health literacy, education, emotion).
Interactions are structured across six clinically grounded discharge topics and
assessed along three axes: (1) dialogue quality via automatic and LLM-as-judge
evaluation, (2) personalized document generation including free-text summaries
and structured AHRQ checklists, and (3) patient comprehension through a
downstream multiple-choice exam. Experiments across 18 LLMs reveal significant
gaps in discharge education capability, with performance varying widely across
patient profiles. Notably, model size does not always yield better education
outcomes, highlighting trade-offs in strategy use and content prioritization.
DischargeSim offers a first step toward benchmarking LLMs in post-visit
clinical education and promoting equitable, personalized patient support.
[COMMENTS]
Equal contribution for the first two authors. To appear in the
proceedings of the Main Conference on Empirical Methods in Natural Language
Processing (EMNLP) 2025
[LINK]
http://arxiv.org/abs/2509.07188v2
[DATE]
2025-09-10 09:37:06+08:00
[CATEGORIES]
cs.CL
RSCC: A Large-Scale Remote Sensing Change Caption Dataset for Disaster Events
[AUTHORS]
Zhenyuan Chen, Chenxi Wang, Feng Zhang
[ABSTRACT]
Remote sensing is critical for disaster monitoring, yet existing datasets
lack temporal image pairs and detailed textual annotations. While
single-snapshot imagery dominates current resources, it fails to capture
dynamic disaster impacts over time. To address this gap, we introduce the
Remote Sensing Change Caption (RSCC) dataset, a large-scale benchmark
comprising 62,315 pre-/post-disaster image pairs (spanning earthquakes, floods,
wildfires, and more) paired with rich, human-like change captions. By bridging
the temporal and semantic divide in remote sensing data, RSCC enables robust
training and evaluation of vision-language models for disaster-aware
bi-temporal understanding. Our results highlight RSCC’s ability to facilitate
detailed disaster-related analysis, paving the way for more accurate,
interpretable, and scalable vision-language applications in remote sensing.
Code and dataset are available at https://github.com/Bili-Sakura/RSCC.
[COMMENTS]
under review
[LINK]
http://arxiv.org/abs/2509.01907v3
[DATE]
2025-09-10 09:09:56+08:00
[CATEGORIES]
cs.CL
Tokenizing Loops of Antibodies
[AUTHORS]
Ada Fang, Robert G. Alberstein, Simon Kelow, Frédéric A. Dreyer
[ABSTRACT]
The complementarity-determining regions of antibodies are loop structures
that are key to their interactions with antigens, and of high importance to the
design of novel biologics. Since the 1980s, categorizing the diversity of CDR
structures into canonical clusters has enabled the identification of key
structural motifs of antibodies. However, existing approaches have limited
coverage and cannot be readily incorporated into protein foundation models.
Here we introduce ImmunoGlobulin LOOp Tokenizer, Igloo, a multimodal antibody
loop tokenizer that encodes backbone dihedral angles and sequence. Igloo is
trained using a contrastive learning objective to map loops with similar
backbone dihedral angles closer together in latent space. Igloo can efficiently
retrieve the closest matching loop structures from a structural antibody
database, outperforming existing methods on identifying similar H3 loops by
5.9\%. Igloo assigns tokens to all loops, addressing the limited coverage issue
of canonical clusters, while retaining the ability to recover canonical loop
conformations. To demonstrate the versatility of Igloo tokens, we show that
they can be incorporated into protein language models with IglooLM and
IglooALM. On predicting binding affinity of heavy chain variants, IglooLM
outperforms the base protein language model on 8 out of 10 antibody-antigen
targets. Additionally, it is on par with existing state-of-the-art
sequence-based and multimodal protein language models, performing comparably to
models with $7\times$ more parameters. IglooALM samples antibody loops which
are diverse in sequence and more consistent in structure than state-of-the-art
antibody inverse folding models. Igloo demonstrates the benefit of introducing
multimodal tokens for antibody loops for encoding the diverse landscape of
antibody loops, improving protein foundation models, and for antibody CDR
design.
[COMMENTS]
21 pages, 7 figures, 10 tables, code available at
https://github.com/prescient-design/igloo
[LINK]
http://arxiv.org/abs/2509.08707v1
[DATE]
2025-09-10 23:56:19+08:00
[CATEGORIES]
cs.LG
Deep Unrolling of Sparsity-Induced RDO for 3D Point Cloud Attribute Coding
[AUTHORS]
Tam Thuc Do, Philip A. Chou, Gene Cheung
[ABSTRACT]
Given encoded 3D point cloud geometry available at the decoder, we study the
problem of lossy attribute compression in a multi-resolution B-spline
projection framework. A target continuous 3D attribute function is first
projected onto a sequence of nested subspaces $\mathcal{F}^{(p)}{l_0}
\subseteq \cdots \subseteq \mathcal{F}^{(p)}{L}$, where
$\mathcal{F}^{(p)}_{l}$ is a family of functions spanned by a B-spline basis
function of order $p$ at a chosen scale and its integer shifts. The projected
low-pass coefficients $F_l^*$ are computed by variable-complexity unrolling of
a rate-distortion (RD) optimization algorithm into a feed-forward network,
where the rate term is the sparsity-promoting $\ell_1$-norm. Thus, the
projection operation is end-to-end differentiable. For a chosen coarse-to-fine
predictor, the coefficients are then adjusted to account for the prediction
from a lower-resolution to a higher-resolution, which is also optimized in a
data-driven manner.
[LINK]
http://arxiv.org/abs/2509.08685v1
[DATE]
2025-09-10 23:23:21+08:00
[CATEGORIES]
cs.LG
Perfectly-Private Analog Secure Aggregation in Federated Learning
[AUTHORS]
Delio Jaramillo-Velez, Charul Rajput, Ragnar Freij-Hollanti, Camilla Hollanti, Alexandre Graell i Amat
[ABSTRACT]
In federated learning, multiple parties train models locally and share their
parameters with a central server, which aggregates them to update a global
model. To address the risk of exposing sensitive data through local models,
secure aggregation via secure multiparty computation has been proposed to
enhance privacy. At the same time, perfect privacy can only be achieved by a
uniform distribution of the masked local models to be aggregated. This raises a
problem when working with real valued data, as there is no measure on the reals
that is invariant under the masking operation, and hence information leakage is
bound to occur. Shifting the data to a finite field circumvents this problem,
but as a downside runs into an inherent accuracy complexity tradeoff issue due
to fixed point modular arithmetic as opposed to floating point numbers that can
simultaneously handle numbers of varying magnitudes. In this paper, a novel
secure parameter aggregation method is proposed that employs the torus rather
than a finite field. This approach guarantees perfect privacy for each party’s
data by utilizing the uniform distribution on the torus, while avoiding
accuracy losses. Experimental results show that the new protocol performs
similarly to the model without secure aggregation while maintaining perfect
privacy. Compared to the finite field secure aggregation, the torus-based
protocol can in some cases significantly outperform it in terms of model
accuracy and cosine similarity, hence making it a safer choice.
[COMMENTS]
Comments welcome
[LINK]
http://arxiv.org/abs/2509.08683v1
[DATE]
2025-09-10 23:22:40+08:00
[CATEGORIES]
cs.LG
Signal Fidelity Index-Aware Calibration for Dementia Predictions Across Heterogeneous Real-World Data
[AUTHORS]
Jingya Cheng, Jiazi Tian, Federica Spoto, Alaleh Azhir, Daniel Mork, Hossein Estiri
[ABSTRACT]
\textbf{Background:} Machine learning models trained on electronic health
records (EHRs) often degrade across healthcare systems due to distributional
shift. A fundamental but underexplored factor is diagnostic signal decay:
variability in diagnostic quality and consistency across institutions, which
affects the reliability of codes used for training and prediction.
\textbf{Objective:} To develop a Signal Fidelity Index (SFI) quantifying
diagnostic data quality at the patient level in dementia, and to test SFI-aware
calibration for improving model performance across heterogeneous datasets
without outcome labels.
\textbf{Methods:} We built a simulation framework generating 2,500 synthetic
datasets, each with 1,000 patients and realistic demographics, encounters, and
coding patterns based on dementia risk factors. The SFI was derived from six
interpretable components: diagnostic specificity, temporal consistency,
entropy, contextual concordance, medication alignment, and trajectory
stability. SFI-aware calibration applied a multiplicative adjustment, optimized
across 50 simulation batches.
\textbf{Results:} At the optimal parameter ($\alpha$ = 2.0), SFI-aware
calibration significantly improved all metrics (p $<$ 0.001). Gains ranged from
10.3\% for Balanced Accuracy to 32.5\% for Recall, with notable increases in
Precision (31.9\%) and F1-score (26.1\%). Performance approached reference
standards, with F1-score and Recall within 1\% and Balanced Accuracy and
Detection Rate improved by 52.3\% and 41.1\%, respectively.
\textbf{Conclusions:} Diagnostic signal decay is a tractable barrier to model
generalization. SFI-aware calibration provides a practical, label-free strategy
to enhance prediction across healthcare contexts, particularly for large-scale
administrative datasets lacking outcome labels.
[LINK]
http://arxiv.org/abs/2509.08679v1
[DATE]
2025-09-10 23:19:04+08:00
[CATEGORIES]
cs.LG
Predicting the Performance of Graph Convolutional Networks with Spectral Properties of the Graph Laplacian
[AUTHORS]
Shalima Binta Manir, Tim Oates
[ABSTRACT]
A common observation in the Graph Convolutional Network (GCN) literature is
that stacking GCN layers may or may not result in better performance on tasks
like node classification and edge prediction. We have found empirically that a
graph’s algebraic connectivity, which is known as the Fiedler value, is a good
predictor of GCN performance. Intuitively, graphs with similar Fiedler values
have analogous structural properties, suggesting that the same filters and
hyperparameters may yield similar results when used with GCNs, and that
transfer learning may be more effective between graphs with similar algebraic
connectivity. We explore this theoretically and empirically with experiments on
synthetic and real graph data, including the Cora, CiteSeer and Polblogs
datasets. We explore multiple ways of aggregating the Fiedler value for
connected components in the graphs to arrive at a value for the entire graph,
and show that it can be used to predict GCN performance. We also present
theoretical arguments as to why the Fiedler value is a good predictor.
[COMMENTS]
9 pages, 3 figures
[LINK]
http://arxiv.org/abs/2508.12993v2
[DATE]
2025-09-10 23:06:14+08:00
[CATEGORIES]
cs.LG
Replicable Reinforcement Learning with Linear Function Approximation
[AUTHORS]
Eric Eaton, Marcel Hussing, Michael Kearns, Aaron Roth, Sikata Bela Sengupta, Jessica Sorrell
[ABSTRACT]
Replication of experimental results has been a challenge faced by many
scientific disciplines, including the field of machine learning. Recent work on
the theory of machine learning has formalized replicability as the demand that
an algorithm produce identical outcomes when executed twice on different
samples from the same distribution. Provably replicable algorithms are
especially interesting for reinforcement learning (RL), where algorithms are
known to be unstable in practice. While replicable algorithms exist for tabular
RL settings, extending these guarantees to more practical function
approximation settings has remained an open problem. In this work, we make
progress by developing replicable methods for linear function approximation in
RL. We first introduce two efficient algorithms for replicable random design
regression and uncentered covariance estimation, each of independent interest.
We then leverage these tools to provide the first provably efficient replicable
RL algorithms for linear Markov decision processes in both the generative model
and episodic settings. Finally, we evaluate our algorithms experimentally and
show how they can inspire more consistent neural policies.
[LINK]
http://arxiv.org/abs/2509.08660v1
[DATE]
2025-09-10 22:56:09+08:00
[CATEGORIES]
cs.LG
Robust Belief-State Policy Learning for Quantum Network Routing Under Decoherence and Time-Varying Conditions
[AUTHORS]
Amirhossein Taherpour, Abbas Taherpour, Tamer Khattab
[ABSTRACT]
This paper presents a feature-based Partially Observable Markov Decision
Process (POMDP) framework for quantum network routing, combining belief-state
planning with Graph Neural Networks (GNNs) to address partial observability,
decoherence, and scalability challenges in dynamic quantum systems. Our
approach encodes complex quantum network dynamics, including entanglement
degradation and time-varying channel noise, into a low-dimensional feature
space, enabling efficient belief updates and scalable policy learning. The core
of our framework is a hybrid GNN-POMDP architecture that processes
graph-structured representations of entangled links to learn routing policies,
coupled with a noise-adaptive mechanism that fuses POMDP belief updates with
GNN outputs for robust decision making. We provide a theoretical analysis
establishing guarantees for belief convergence, policy improvement, and
robustness to noise. Experiments on simulated quantum networks with up to 100
nodes demonstrate significant improvements in routing fidelity and entanglement
delivery rates compared to state-of-the-art baselines, particularly under high
decoherence and nonstationary conditions.
[LINK]
http://arxiv.org/abs/2509.08654v1
[DATE]
2025-09-10 22:50:03+08:00
[CATEGORIES]
cs.LG
Randomly Sampled Language Reasoning Problems Elucidate Limitations of In-Context Learning
[AUTHORS]
Kavi Gupta, Kate Sanders, Armando Solar-Lezama
[ABSTRACT]
While LLMs have revolutionized the field of machine learning due to their
high performance on a strikingly wide range of problems, they are also known to
hallucinate false answers and underperform on less canonical versions of the
same tasks. There are several emerging theories of LLM performance, among them
that LLMs lack world modeling ability, that they have an undesirable bias
towards an autoregressive prior, and that they struggle on more novel problems.
The existing literature on LLM input novelty has focused on tasks of relatively
high complexity, studying perturbations of canonical but complex problems. In
this paper, we attempt to minimize complexity in order to isolate novelty as a
factor in LLM underperformance and investigate the power of
in-context-learning. To this end, we consider an extremely simple domain: next
token prediction on simple language tasks. The twist is that these language
tasks are wholly unseen, as they are randomly drawn from a large,
parsimoniously defined set of languages arising from simple grammar rules. This
experimental setup allows us to evaluate ICL independently of models’
parametric knowledge. We find that LLMs uniformly underperform n-gram models on
this task, both when used as next token predictors and in chain-of-thought.
[COMMENTS]
10 pages, 4 figures, 2 tables
[LINK]
http://arxiv.org/abs/2501.02825v6
[DATE]
2025-09-10 22:39:59+08:00
[CATEGORIES]
cs.LG
Calibrating Transformers via Sparse Gaussian Processes
[AUTHORS]
Wenlong Chen, Yingzhen Li
[ABSTRACT]
Transformer models have achieved profound success in prediction tasks in a
wide range of applications in natural language processing, speech recognition
and computer vision. Extending Transformer’s success to safety-critical domains
requires calibrated uncertainty estimation which remains under-explored. To
address this, we propose Sparse Gaussian Process attention (SGPA), which
performs Bayesian inference directly in the output space of multi-head
attention blocks (MHAs) in transformer to calibrate its uncertainty. It
replaces the scaled dot-product operation with a valid symmetric kernel and
uses sparse Gaussian processes (SGP) techniques to approximate the posterior
processes of MHA outputs. Empirically, on a suite of prediction tasks on text,
images and graphs, SGPA-based Transformers achieve competitive predictive
accuracy, while noticeably improving both in-distribution calibration and
out-of-distribution robustness and detection.
[COMMENTS]
Published at The Eleventh International Conference on Learning
Representations (ICLR 2023). ECE updated, typo fixed
[LINK]
http://arxiv.org/abs/2303.02444v4
[DATE]
2025-09-10 22:37:12+08:00
[CATEGORIES]
cs.LG
FlexFringe: Modeling Software Behavior by Learning Probabilistic Automata
[AUTHORS]
Sicco Verwer, Christian Hammerschmidt
[ABSTRACT]
We present the efficient implementations of probabilistic deterministic
finite automaton learning methods available in FlexFringe. These implement
well-known strategies for state-merging including several modifications to
improve their performance in practice. We show experimentally that these
algorithms obtain competitive results and significant improvements over a
default implementation. We also demonstrate how to use FlexFringe to learn
interpretable models from software logs and use these for anomaly detection.
Although less interpretable, we show that learning smaller more convoluted
models improves the performance of FlexFringe on anomaly detection,
outperforming an existing solution based on neural nets.
[LINK]
http://arxiv.org/abs/2203.16331v5
[DATE]
2025-09-10 22:35:49+08:00
[CATEGORIES]
cs.LG
A hierarchical entropy method for the delocalization of bias in high-dimensional Langevin Monte Carlo
[AUTHORS]
Daniel Lacker, Fuzhong Zhou
[ABSTRACT]
The unadjusted Langevin algorithm is widely used for sampling from complex
high-dimensional distributions. It is well known to be biased, with the bias
typically scaling linearly with the dimension when measured in squared
Wasserstein distance. However, the recent paper of Chen et al. (2024)
identifies an intriguing new delocalization effect: For a class of
distributions with sparse interactions, the bias between low-dimensional
marginals scales only with the lower dimension, not the full dimension. In this
work, we strengthen the results of Chen et al. (2024) in the sparse interaction
regime by removing a logarithmic factor, measuring distance in relative entropy
(a.k.a. KL-divergence), and relaxing the strong log-concavity assumption. In
addition, we expand the scope of the delocalization phenomenon by showing that
it holds for a class of distributions with weak interactions. Our proofs are
based on a hierarchical analysis of the marginal relative entropies, inspired
by the authors’ recent work on propagation of chaos.
[LINK]
http://arxiv.org/abs/2509.08619v1
[DATE]
2025-09-10 22:16:24+08:00
[CATEGORIES]
cs.LG
Towards Interpretable Deep Neural Networks for Tabular Data
[AUTHORS]
Khawla Elhadri, Jörg Schlötterer, Christin Seifert
[ABSTRACT]
Tabular data is the foundation of many applications in fields such as finance
and healthcare. Although DNNs tailored for tabular data achieve competitive
predictive performance, they are blackboxes with little interpretability. We
introduce XNNTab, a neural architecture that uses a sparse autoencoder (SAE) to
learn a dictionary of monosemantic features within the latent space used for
prediction. Using an automated method, we assign human-interpretable semantics
to these features. This allows us to represent predictions as linear
combinations of semantically meaningful components. Empirical evaluations
demonstrate that XNNTab attains performance on par with or exceeding that of
state-of-the-art, black-box neural models and classical machine learning
approaches while being fully interpretable.
[LINK]
http://arxiv.org/abs/2509.08617v1
[DATE]
2025-09-10 22:14:43+08:00
[CATEGORIES]
cs.LG
Linear Convergence of the Frank-Wolfe Algorithm over Product Polytopes
[AUTHORS]
Gabriele Iommazzo, David Martínez-Rubio, Francisco Criado, Elias Wirth, Sebastian Pokutta
[ABSTRACT]
We study the linear convergence of Frank-Wolfe algorithms over product
polytopes. We analyze two condition numbers for the product polytope, namely
the \emph{pyramidal width} and the \emph{vertex-facet distance}, based on the
condition numbers of individual polytope components. As a result, for convex
objectives that are $\mu$-Polyak-{\L}ojasiewicz, we show linear convergence
rates quantified in terms of the resulting condition numbers. We apply our
results to the problem of approximately finding a feasible point in a polytope
intersection in high-dimensions, and demonstrate the practical efficiency of
our algorithms through empirical results.
[LINK]
http://arxiv.org/abs/2505.11259v2
[DATE]
2025-09-10 22:13:40+08:00
[CATEGORIES]
cs.LG
Classification of 24-hour movement behaviors from wrist-worn accelerometer data: from handcrafted features to deep learning techniques
[AUTHORS]
Alireza Sameh, Mehrdad Rostami, Mourad Oussalah, Vahid Farrahi
[ABSTRACT]
Purpose: We compared the performance of deep learning (DL) and classical
machine learning (ML) algorithms for the classification of 24-hour movement
behavior into sleep, sedentary, light intensity physical activity (LPA), and
moderate-to-vigorous intensity physical activity (MVPA). Methods: Open-access
data from 151 adults wearing a wrist-worn accelerometer (Axivity-AX3) was used.
Participants were randomly divided into training, validation, and test sets
(121, 15, and 15 participants each). Raw acceleration signals were segmented
into non-overlapping 10-second windows, and then a total of 104 handcrafted
features were extracted. Four DL algorithms-Long Short-Term Memory (LSTM),
Bidirectional Long Short-Term Memory (BiLSTM), Gated Recurrent Units (GRU), and
One-Dimensional Convolutional Neural Network (1D-CNN)-were trained using raw
acceleration signals and with handcrafted features extracted from these signals
to predict 24-hour movement behavior categories. The handcrafted features were
also used to train classical ML algorithms, namely Random Forest (RF), Support
Vector Machine (SVM), Extreme Gradient Boosting (XGBoost), Logistic Regression
(LR), Artificial Neural Network (ANN), and Decision Tree (DT) for classifying
24-hour movement behavior intensities. Results: LSTM, BiLSTM, and GRU showed an
overall accuracy of approximately 85% when trained with raw acceleration
signals, and 1D-CNN an overall accuracy of approximately 80%. When trained on
handcrafted features, the overall accuracy for both DL and classical ML
algorithms ranged from 70% to 81%. Overall, there was a higher confusion in
classification of MVPA and LPA, compared to sleep and sedentary categories.
Conclusion: DL methods with raw acceleration signals had only slightly better
performance in predicting 24-hour movement behavior intensities, compared to
when DL and classical ML were trained with handcrafted features.
[LINK]
http://arxiv.org/abs/2509.08606v1
[DATE]
2025-09-10 22:04:51+08:00
[CATEGORIES]
cs.LG
To See a World in a Spark of Neuron: Disentangling Multi-task Interference for Training-free Model Merging
[AUTHORS]
Zitao Fang, Guodong DU, Shuyang Yu, Yifei Guo, Yiwei Zhang, Yiyao Cao, Jing Li, Ho-Kin Tang, Sim Kuan Goh
[COMMENTS]
Accepted to EMNLP 2025 Main Conference. This is the camera-ready
version. Code: https://ZzzitaoFang.github.io/projects/NeuroMerging/
[LINK]
http://arxiv.org/abs/2503.05320v4
[DATE]
2025-09-10 21:56:44+08:00
[CATEGORIES]
cs.LG
Interpretability as Alignment: Making Internal Understanding a Design Principle
[AUTHORS]
Aadit Sengupta, Pratinav Seth, Vinay Kumar Sankarapu
[ABSTRACT]
Large neural models are increasingly deployed in high-stakes settings,
raising concerns about whether their behavior reliably aligns with human
values. Interpretability provides a route to internal transparency by revealing
the computations that drive outputs. We argue that interpretability especially
mechanistic approaches should be treated as a design principle for alignment,
not an auxiliary diagnostic tool. Post-hoc methods such as LIME or SHAP offer
intuitive but correlational explanations, while mechanistic techniques like
circuit tracing or activation patching yield causal insight into internal
failures, including deceptive or misaligned reasoning that behavioral methods
like RLHF, red teaming, or Constitutional AI may overlook. Despite these
advantages, interpretability faces challenges of scalability, epistemic
uncertainty, and mismatches between learned representations and human concepts.
Our position is that progress on safe and trustworthy AI will depend on making
interpretability a first-class objective of AI research and development,
ensuring that systems are not only effective but also auditable, transparent,
and aligned with human intent.
[COMMENTS]
Pre-Print
[LINK]
http://arxiv.org/abs/2509.08592v1
[DATE]
2025-09-10 21:45:59+08:00
[CATEGORIES]
cs.LG
Implicit Shape-Prior for Few-Shot Assisted 3D Segmentation
[AUTHORS]
Mathilde Monvoisin, Louise Piecuch, Blanche Texier, Cédric Hémon, Anaïs Barateau, Jérémie Huet, Antoine Nordez, Anne-Sophie Boureau, Jean-Claude Nunes, Diana Mateus
[ABSTRACT]
The objective of this paper is to significantly reduce the manual workload
required from medical professionals in complex 3D segmentation tasks that
cannot be yet fully automated. For instance, in radiotherapy planning, organs
at risk must be accurately identified in computed tomography (CT) or magnetic
resonance imaging (MRI) scans to ensure they are spared from harmful radiation.
Similarly, diagnosing age-related degenerative diseases such as sarcopenia,
which involve progressive muscle volume loss and strength, is commonly based on
muscular mass measurements often obtained from manual segmentation of medical
volumes. To alleviate the manual-segmentation burden, this paper introduces an
implicit shape prior to segment volumes from sparse slice manual annotations
generalized to the multi-organ case, along with a simple framework for
automatically selecting the most informative slices to guide and minimize the
next interactions. The experimental validation shows the method’s effectiveness
on two medical use cases: assisted segmentation in the context of at risks
organs for brain cancer patients, and acceleration of the creation of a new
database with unseen muscle shapes for patients with sarcopenia.
[COMMENTS]
Both first Authors contributed equally to this work, lastnames in
alphabetical order. This preprint has not undergone peer review or any
post-submission improvements or corrections. The Version of Record of this
contribution will be published in a Springer Nature Computer Science book
series (CCIS, LNAI, LNBI, LNBIP, LNCS) and the doi will soon be released
[LINK]
http://arxiv.org/abs/2509.08580v1
[DATE]
2025-09-10 21:30:39+08:00
[CATEGORIES]
cs.LG
Efficient and Generalized end-to-end Autonomous Driving System with Latent Deep Reinforcement Learning and Demonstrations
[AUTHORS]
Zuojin Tang, Xiaoyu Chen, Yongqiang Li, Jianyu Chen
[ABSTRACT]
An intelligent driving system should dynamically formulate appropriate
driving strategies based on the current environment and vehicle status while
ensuring system security and reliability. However, methods based on
reinforcement learning and imitation learning often suffer from high sample
complexity, poor generalization, and low safety. To address these challenges,
this paper introduces an efficient and generalized end-to-end autonomous
driving system (EGADS) for complex and varied scenarios. The RL agent in our
EGADS combines variational inference with normalizing flows, which are
independent of distribution assumptions. This combination allows the agent to
capture historical information relevant to driving in latent space effectively,
thereby significantly reducing sample complexity. Additionally, we enhance
safety by formulating robust safety constraints and improve generalization and
performance by integrating RL with expert demonstrations. Experimental results
demonstrate that, compared to existing methods, EGADS significantly reduces
sample complexity, greatly improves safety performance, and exhibits strong
generalization capabilities in complex urban scenarios. Particularly, we
contributed an expert dataset collected through human expert steering wheel
control, specifically using the G29 steering wheel.
[COMMENTS]
Accepted by ECML PKDD 2025 (Research Track)
[LINK]
http://arxiv.org/abs/2401.11792v8
[DATE]
2025-09-10 21:18:49+08:00
[CATEGORIES]
cs.LG
A Nonlinear Low-rank Representation Model with Convolutional Neural Network for Imputing Water Quality Data
[AUTHORS]
Xin Liao, Bing Yang, Cai Yu
[ABSTRACT]
The integrity of Water Quality Data (WQD) is critical in environmental
monitoring for scientific decision-making and ecological protection. However,
water quality monitoring systems are often challenged by large amounts of
missing data due to unavoidable problems such as sensor failures and
communication delays, which further lead to water quality data becoming
High-Dimensional and Sparse (HDS). Traditional data imputation methods are
difficult to depict the potential dynamics and fail to capture the deep data
features, resulting in unsatisfactory imputation performance. To effectively
address the above issues, this paper proposes a Nonlinear Low-rank
Representation model (NLR) with Convolutional Neural Networks (CNN) for
imputing missing WQD, which utilizes CNNs to implement two ideas: a) fusing
temporal features to model the temporal dependence of data between time slots,
and b) Extracting nonlinear interactions and local patterns to mine
higher-order relationships features and achieve deep fusion of multidimensional
information. Experimental studies on three real water quality datasets
demonstrate that the proposed model significantly outperforms existing
state-of-the-art data imputation models in terms of estimation accuracy. It
provides an effective approach for handling water quality monitoring data in
complex dynamic environments.
[COMMENTS]
7 pages, 2 figures, conference
[LINK]
http://arxiv.org/abs/2506.23629v2
[DATE]
2025-09-10 20:50:14+08:00
[CATEGORIES]
cs.LG
How Should We Meta-Learn Reinforcement Learning Algorithms?
[AUTHORS]
Alexander David Goldie, Zilin Wang, Jaron Cohen, Jakob Nicolaus Foerster, Shimon Whiteson
[ABSTRACT]
The process of meta-learning algorithms from data, instead of relying on
manual design, is growing in popularity as a paradigm for improving the
performance of machine learning systems. Meta-learning shows particular promise
for reinforcement learning (RL), where algorithms are often adapted from
supervised or unsupervised learning despite their suboptimality for RL.
However, until now there has been a severe lack of comparison between different
meta-learning algorithms, such as using evolution to optimise over black-box
functions or LLMs to propose code. In this paper, we carry out this empirical
comparison of the different approaches when applied to a range of meta-learned
algorithms which target different parts of the RL pipeline. In addition to
meta-train and meta-test performance, we also investigate factors including the
interpretability, sample cost and train time for each meta-learning algorithm.
Based on these findings, we propose several guidelines for meta-learning new RL
algorithms which will help ensure that future learned algorithms are as
performant as possible.
[COMMENTS]
Accepted paper at Reinforcement Learning Conference (RLC) 2025
[LINK]
http://arxiv.org/abs/2507.17668v2
[DATE]
2025-09-10 20:25:27+08:00
[CATEGORIES]
cs.LG
Agents of Discovery
[AUTHORS]
Sascha Diefenbacher, Anna Hallin, Gregor Kasieczka, Michael Krämer, Anne Lauscher, Tim Lukas
[ABSTRACT]
The substantial data volumes encountered in modern particle physics and other
domains of fundamental physics research allow (and require) the use of
increasingly complex data analysis tools and workflows. While the use of
machine learning (ML) tools for data analysis has recently proliferated, these
tools are typically special-purpose algorithms that rely, for example, on
encoded physics knowledge to reach optimal performance. In this work, we
investigate a new and orthogonal direction: Using recent progress in large
language models (LLMs) to create a team of agents – instances of LLMs with
specific subtasks – that jointly solve data analysis-based research problems
in a way similar to how a human researcher might: by creating code to operate
standard tools and libraries (including ML systems) and by building on results
of previous iterations. If successful, such agent-based systems could be
deployed to automate routine analysis components to counteract the increasing
complexity of modern tool chains. To investigate the capabilities of
current-generation commercial LLMs, we consider the task of anomaly detection
via the publicly available and highly-studied LHC Olympics dataset. Several
current models by OpenAI (GPT-4o, o4-mini, GPT-4.1, and GPT-5) are investigated
and their stability tested. Overall, we observe the capacity of the agent-based
system to solve this data analysis problem. The best agent-created solutions
mirror the performance of human state-of-the-art results.
[LINK]
http://arxiv.org/abs/2509.08535v1
[DATE]
2025-09-10 20:25:13+08:00
[CATEGORIES]
cs.LG
Data Skeleton Learning: Scalable Active Clustering with Sparse Graph Structures
[AUTHORS]
Wen-Bo Xie, Xun Fu, Bin Chen, Yan-Li Lee, Tao Deng, Tian Zou, Xin Wang, Zhen Liu, Jaideep Srivastavad
[ABSTRACT]
In this work, we focus on the efficiency and scalability of pairwise
constraint-based active clustering, crucial for processing large-scale data in
applications such as data mining, knowledge annotation, and AI model
pre-training. Our goals are threefold: (1) to reduce computational costs for
iterative clustering updates; (2) to enhance the impact of user-provided
constraints to minimize annotation requirements for precise clustering; and (3)
to cut down memory usage in practical deployments. To achieve these aims, we
propose a graph-based active clustering algorithm that utilizes two sparse
graphs: one for representing relationships between data (our proposed data
skeleton) and another for updating this data skeleton. These two graphs work in
concert, enabling the refinement of connected subgraphs within the data
skeleton to create nested clusters. Our empirical analysis confirms that the
proposed algorithm consistently facilitates more accurate clustering with
dramatically less input of user-provided constraints, and outperforms its
counterparts in terms of computational performance and scalability, while
maintaining robustness across various distance metrics.
[LINK]
http://arxiv.org/abs/2509.08530v1
[DATE]
2025-09-10 20:18:52+08:00
[CATEGORIES]
cs.LG
Variational Rank Reduction Autoencoders for Generative Thermal Design
[AUTHORS]
Alicia Tierz, Jad Mounayer, Beatriz Moya, Francisco Chinesta
[ABSTRACT]
Generative thermal design for complex geometries is fundamental in many areas
of engineering, yet it faces two main challenges: the high computational cost
of high-fidelity simulations and the limitations of conventional generative
models. Approaches such as autoencoders (AEs) and variational autoencoders
(VAEs) often produce unstructured latent spaces with discontinuities, which
restricts their capacity to explore designs and generate physically consistent
solutions.
To address these limitations, we propose a hybrid framework that combines
Variational Rank-Reduction Autoencoders (VRRAEs) with Deep Operator Networks
(DeepONets). The VRRAE introduces a truncated SVD within the latent space,
leading to continuous, interpretable, and well-structured representations that
mitigate posterior collapse and improve geometric reconstruction. The DeepONet
then exploits this compact latent encoding in its branch network, together with
spatial coordinates in the trunk network, to predict temperature gradients
efficiently and accurately.
This hybrid approach not only enhances the quality of generated geometries
and the accuracy of gradient prediction, but also provides a substantial
advantage in inference efficiency compared to traditional numerical solvers.
Overall, the study underscores the importance of structured latent
representations for operator learning and highlights the potential of combining
generative models and operator networks in thermal design and broader
engineering applications.
[LINK]
http://arxiv.org/abs/2509.08515v1
[DATE]
2025-09-10 19:45:40+08:00
[CATEGORIES]
cs.LG
Learning Fluid-Structure Interaction Dynamics with Physics-Informed Neural Networks and Immersed Boundary Methods
[AUTHORS]
Afrah Farea, Saiful Khan, Reza Daryani, Emre Cenk Ersan, Mustafa Serdar Celebi
[ABSTRACT]
Physics-informed neural networks (PINNs) have emerged as a promising approach
for solving complex fluid dynamics problems, yet their application to
fluid-structure interaction (FSI) problems with moving boundaries remains
largely unexplored. This work addresses the critical challenge of modeling FSI
systems with deformable interfaces, where traditional unified PINN
architectures struggle to capture the distinct physics governing fluid and
structural domains simultaneously. We present an innovative Eulerian-Lagrangian
PINN architecture that integrates immersed boundary method (IBM) principles to
solve FSI problems with moving boundary conditions. Our approach fundamentally
departs from conventional unified architectures by introducing domain-specific
neural networks: an Eulerian network for fluid dynamics and a Lagrangian
network for structural interfaces, coupled through physics-based constraints.
Additionally, we incorporate learnable B-spline activation functions with SiLU
to capture both localized high-gradient features near interfaces and global
flow patterns. Empirical studies on a 2D cavity flow problem involving a moving
solid structure show that while baseline unified PINNs achieve reasonable
velocity predictions, they suffer from substantial pressure errors (12.9%) in
structural regions. Our Eulerian-Lagrangian architecture with learnable
activations (EL-L) achieves better performance across all metrics, improving
accuracy by 24.1-91.4% and particularly reducing pressure errors from 12.9% to
2.39%. These results demonstrate that domain decomposition aligned with
physical principles, combined with locality-aware activation functions, is
essential for accurate FSI modeling within the PINN framework.
[LINK]
http://arxiv.org/abs/2505.18565v4
[DATE]
2025-09-10 19:36:01+08:00
[CATEGORIES]
cs.LG
A Transformer approach for Electricity Price Forecasting
[AUTHORS]
Oscar Llorente, Jose Portela
[ABSTRACT]
This paper presents a novel approach to electricity price forecasting (EPF)
using a pure Transformer model. As opposed to other alternatives, no other
recurrent network is used in combination to the attention mechanism. Hence,
showing that the attention layer is enough for capturing the temporal patterns.
The paper also provides fair comparison of the models using the open-source EPF
toolbox and provide the code to enhance reproducibility and transparency in EPF
research. The results show that the Transformer model outperforms traditional
methods, offering a promising solution for reliable and sustainable power
system operation.
[COMMENTS]
9 pages
[LINK]
http://arxiv.org/abs/2403.16108v3
[DATE]
2025-09-10 19:00:11+08:00
[CATEGORIES]
cs.LG
From Channel Bias to Feature Redundancy: Uncovering the “Less is More” Principle in Few-Shot Learning
[AUTHORS]
Ji Zhang, Xu Luo, Lianli Gao, Difan Zou, Hengtao Shen, Jingkuan Song
[ABSTRACT]
Deep neural networks often fail to adapt representations to novel tasks under
distribution shifts, especially when only a few examples are available. This
paper identifies a core obstacle behind this failure: channel bias, where
networks develop a rigid emphasis on feature dimensions that were
discriminative for the source task, but this emphasis is misaligned and fails
to adapt to the distinct needs of a novel task. This bias leads to a striking
and detrimental consequence: feature redundancy. We demonstrate that for
few-shot tasks, classification accuracy is significantly improved by using as
few as 1-5% of the most discriminative feature dimensions, revealing that the
vast majority are actively harmful. Our theoretical analysis confirms that this
redundancy originates from confounding feature dimensions-those with high
intra-class variance but low inter-class separability-which are especially
problematic in low-data regimes. This “less is more” phenomenon is a defining
characteristic of the few-shot setting, diminishing as more samples become
available. To address this, we propose a simple yet effective soft-masking
method, Augmented Feature Importance Adjustment (AFIA), which estimates feature
importance from augmented data to mitigate the issue. By establishing the
cohesive link from channel bias to its consequence of extreme feature
redundancy, this work provides a foundational principle for few-shot
representation transfer and a practical method for developing more robust
few-shot learning algorithms.
[COMMENTS]
arXiv admin note: substantial text overlap with arXiv:2206.08126
[LINK]
http://arxiv.org/abs/2310.03843v2
[DATE]
2025-09-10 18:53:27+08:00
[CATEGORIES]
cs.LG
HOFT: Householder Orthogonal Fine-tuning
[AUTHORS]
Alejandro Moreno Arcas, Albert Sanchis, Jorge Civera, Alfons Juan
[ABSTRACT]
Adaptation of foundation models using low-rank methods is a widespread
approach. Another way to adapt these models is to employ orthogonal fine-tuning
methods, which are less time and memory efficient despite their good
generalization properties. In this work, we propose Householder Orthogonal
Fine-tuning (HOFT), a novel orthogonal fine-tuning method that aims to
alleviate time and space complexity. Moreover, some theoretical properties of
the orthogonal fine-tuning paradigm are explored. From this exploration, Scaled
Householder Orthogonal Fine-tuning (SHOFT) is proposed. Both HOFT and SHOFT are
evaluated in downstream tasks, namely commonsense reasoning, machine
translation, subject-driven generation and mathematical reasoning. Compared
with state-of-the-art adaptation methods, HOFT and SHOFT show comparable or
better results.
[LINK]
http://arxiv.org/abs/2505.16531v2
[DATE]
2025-09-10 18:50:10+08:00
[CATEGORIES]
cs.LG
Modified Loss of Momentum Gradient Descent: Fine-Grained Analysis
[AUTHORS]
Matias D. Cattaneo, Boris Shigida
[ABSTRACT]
We analyze gradient descent with Polyak heavy-ball momentum (HB) whose fixed
momentum parameter $\beta \in (0, 1)$ provides exponential decay of memory.
Building on Kovachki and Stuart (2021), we prove that on an exponentially
attractive invariant manifold the algorithm is exactly plain gradient descent
with a modified loss, provided that the step size $h$ is small enough. Although
the modified loss does not admit a closed-form expression, we describe it with
arbitrary precision and prove global (finite “time” horizon) approximation
bounds $O(h^{R})$ for any finite order $R \geq 2$. We then conduct a
fine-grained analysis of the combinatorics underlying the memoryless
approximations of HB, in particular, finding a rich family of polynomials in
$\beta$ hidden inside which contains Eulerian and Narayana polynomials. We
derive continuous modified equations of arbitrary approximation order (with
rigorous bounds) and the principal flow that approximates the HB dynamics,
generalizing Rosca et al. (2023). Approximation theorems cover both full-batch
and mini-batch HB. Our theoretical results shed new light on the main features
of gradient descent with heavy-ball momentum, and outline a road-map for
similar analysis of other optimization algorithms.
[LINK]
http://arxiv.org/abs/2509.08483v1
[DATE]
2025-09-10 18:47:54+08:00
[CATEGORIES]
cs.LG
SHAining on Process Mining: Explaining Event Log Characteristics Impact on Algorithms
[AUTHORS]
Andrea Maldonado, Christian M. M. Frey, Sai Anirudh Aryasomayajula, Ludwig Zellner, Stephan A. Fahrenkrog-Petersen, Thomas Seidl
[ABSTRACT]
Process mining aims to extract and analyze insights from event logs, yet
algorithm metric results vary widely depending on structural event log
characteristics. Existing work often evaluates algorithms on a fixed set of
real-world event logs but lacks a systematic analysis of how event log
characteristics impact algorithms individually. Moreover, since event logs are
generated from processes, where characteristics co-occur, we focus on
associational rather than causal effects to assess how strong the overlapping
individual characteristic affects evaluation metrics without assuming isolated
causal effects, a factor often neglected by prior work. We introduce SHAining,
the first approach to quantify the marginal contribution of varying event log
characteristics to process mining algorithms’ metrics. Using process discovery
as a downstream task, we analyze over 22,000 event logs covering a wide span of
characteristics to uncover which affect algorithms across metrics (e.g.,
fitness, precision, complexity) the most. Furthermore, we offer novel insights
about how the value of event log characteristics correlates with their
contributed impact, assessing the algorithm’s robustness.
[LINK]
http://arxiv.org/abs/2509.08482v1
[DATE]
2025-09-10 18:47:51+08:00
[CATEGORIES]
cs.LG
Gaussian Process Regression – Neural Network Hybrid with Optimized Redundant Coordinates
[AUTHORS]
Sergei Manzhos, Manabu Ihara
[ABSTRACT]
Recently, a Gaussian Process Regression - neural network (GPRNN) hybrid
machine learning method was proposed, which is based on additive-kernel GPR in
redundant coordinates constructed by rules [J. Phys. Chem. A 127 (2023) 7823].
The method combined the expressive power of an NN with the robustness of linear
regression, in particular, with respect to overfitting when the number of
neurons is increased beyond optimal. We introduce opt-GPRNN, in which the
redundant coordinates of GPRNN are optimized with a Monte Carlo algorithm and
show that when combined with optimization of redundant coordinates, GPRNN
attains the lowest test set error with much fewer terms / neurons and retains
the advantage of avoiding overfitting when the number of neurons is increased
beyond optimal value. The method, opt-GPRNN possesses an expressive power
closer to that of a multilayer NN and could obviate the need for deep NNs in
some applications. With optimized redundant coordinates, a dimensionality
reduction regime is also possible. Examples of application to machine learning
an interatomic potential and materials informatics are given.
[LINK]
http://arxiv.org/abs/2509.08457v1
[DATE]
2025-09-10 18:00:38+08:00
[CATEGORIES]
cs.LG
Spherical Brownian Bridge Diffusion Models for Conditional Cortical Thickness Forecasting
[AUTHORS]
Ivan Stoyanov, Fabian Bongratz, Christian Wachinger
[ABSTRACT]
Accurate forecasting of individualized, high-resolution cortical thickness
(CTh) trajectories is essential for detecting subtle cortical changes,
providing invaluable insights into neurodegenerative processes and facilitating
earlier and more precise intervention strategies. However, CTh forecasting is a
challenging task due to the intricate non-Euclidean geometry of the cerebral
cortex and the need to integrate multi-modal data for subject-specific
predictions. To address these challenges, we introduce the Spherical Brownian
Bridge Diffusion Model (SBDM). Specifically, we propose a bidirectional
conditional Brownian bridge diffusion process to forecast CTh trajectories at
the vertex level of registered cortical surfaces. Our technical contribution
includes a new denoising model, the conditional spherical U-Net (CoS-UNet),
which combines spherical convolutions and dense cross-attention to integrate
cortical surfaces and tabular conditions seamlessly. Compared to previous
approaches, SBDM achieves significantly reduced prediction errors, as
demonstrated by our experiments based on longitudinal datasets from the ADNI
and OASIS. Additionally, we demonstrate SBDM’s ability to generate individual
factual and counterfactual CTh trajectories, offering a novel framework for
exploring hypothetical scenarios of cortical development.
[LINK]
http://arxiv.org/abs/2509.08442v1
[DATE]
2025-09-10 17:40:41+08:00
[CATEGORIES]
cs.LG
Comprehensive Evaluation of Prototype Neural Networks
[AUTHORS]
Philipp Schlinge, Steffen Meinert, Martin Atzmueller
[ABSTRACT]
Prototype models are an important method for explainable artificial
intelligence (XAI) and interpretable machine learning. In this paper, we
perform an in-depth analysis of a set of prominent prototype models including
ProtoPNet, ProtoPool and PIPNet. For their assessment, we apply a comprehensive
set of metrics. In addition to applying standard metrics from literature, we
propose several new metrics to further complement the analysis of model
interpretability. In our experimentation, we apply the set of prototype models
on a diverse set of datasets including fine-grained classification, Non-IID
settings and multi-label classification to further contrast the performance.
Furthermore, we also provide our code as an open-source library
(https://github.com/uos-sis/quanproto), which facilitates simple application of
the metrics itself, as well as extensibility – providing the option for easily
adding new metrics and models.
[LINK]
http://arxiv.org/abs/2507.06819v2
[DATE]
2025-09-10 17:20:13+08:00
[CATEGORIES]
cs.LG
LD-ViCE: Latent Diffusion Model for Video Counterfactual Explanations
[AUTHORS]
Payal Varshney, Adriano Lucieri, Christoph Balada, Sheraz Ahmed, Andreas Dengel
[ABSTRACT]
Video-based AI systems are increasingly adopted in safety-critical domains
such as autonomous driving and healthcare. However, interpreting their
decisions remains challenging due to the inherent spatiotemporal complexity of
video data and the opacity of deep learning models. Existing explanation
techniques often suffer from limited temporal coherence, insufficient
robustness, and a lack of actionable causal insights. Current counterfactual
explanation methods typically do not incorporate guidance from the target
model, reducing semantic fidelity and practical utility. We introduce Latent
Diffusion for Video Counterfactual Explanations (LD-ViCE), a novel framework
designed to explain the behavior of video-based AI models. Compared to previous
approaches, LD-ViCE reduces the computational costs of generating explanations
by operating in latent space using a state-of-the-art diffusion model, while
producing realistic and interpretable counterfactuals through an additional
refinement step. Our experiments demonstrate the effectiveness of LD-ViCE
across three diverse video datasets, including EchoNet-Dynamic (cardiac
ultrasound), FERV39k (facial expression), and Something-Something V2 (action
recognition). LD-ViCE outperforms a recent state-of-the-art method, achieving
an increase in R2 score of up to 68% while reducing inference time by half.
Qualitative analysis confirms that LD-ViCE generates semantically meaningful
and temporally coherent explanations, offering valuable insights into the
target model behavior. LD-ViCE represents a valuable step toward the
trustworthy deployment of AI in safety-critical domains.
[COMMENTS]
30 pages
[LINK]
http://arxiv.org/abs/2509.08422v1
[DATE]
2025-09-10 17:10:18+08:00
[CATEGORIES]
cs.LG
Facet: highly efficient E(3)-equivariant networks for interatomic potentials
[AUTHORS]
Nicholas Miklaucic, Lai Wei, Rongzhi Dong, Nihang Fu, Sadman Sadeed Omee, Qingyang Li, Sourin Dey, Victor Fung, Jianjun Hu
[ABSTRACT]
Computational materials discovery is limited by the high cost of
first-principles calculations. Machine learning (ML) potentials that predict
energies from crystal structures are promising, but existing methods face
computational bottlenecks. Steerable graph neural networks (GNNs) encode
geometry with spherical harmonics, respecting atomic symmetries – permutation,
rotation, and translation – for physically realistic predictions. Yet
maintaining equivariance is difficult: activation functions must be modified,
and each layer must handle multiple data types for different harmonic orders.
We present Facet, a GNN architecture for efficient ML potentials, developed
through systematic analysis of steerable GNNs. Our innovations include
replacing expensive multi-layer perceptrons (MLPs) for interatomic distances
with splines, which match performance while cutting computational and memory
demands. We also introduce a general-purpose equivariant layer that mixes node
information via spherical grid projection followed by standard MLPs – faster
than tensor products and more expressive than linear or gate layers. On the
MPTrj dataset, Facet matches leading models with far fewer parameters and under
10% of their training compute. On a crystal relaxation task, it runs twice as
fast as MACE models. We further show SevenNet-0’s parameters can be reduced by
over 25% with no accuracy loss. These techniques enable more than 10x faster
training of large-scale foundation models for ML potentials, potentially
reshaping computational materials discovery.
[LINK]
http://arxiv.org/abs/2509.08418v1
[DATE]
2025-09-10 17:06:24+08:00
[CATEGORIES]
cs.LG
Moment- and Power-Spectrum-Based Gaussianity Regularization for Text-to-Image Models
[AUTHORS]
Jisung Hwang, Jaihoon Kim, Minhyuk Sung
[ABSTRACT]
We propose a novel regularization loss that enforces standard Gaussianity,
encouraging samples to align with a standard Gaussian distribution. This
facilitates a range of downstream tasks involving optimization in the latent
space of text-to-image models. We treat elements of a high-dimensional sample
as one-dimensional standard Gaussian variables and define a composite loss that
combines moment-based regularization in the spatial domain with power
spectrum-based regularization in the spectral domain. Since the expected values
of moments and power spectrum distributions are analytically known, the loss
promotes conformity to these properties. To ensure permutation invariance, the
losses are applied to randomly permuted inputs. Notably, existing
Gaussianity-based regularizations fall within our unified framework: some
correspond to moment losses of specific orders, while the previous
covariance-matching loss is equivalent to our spectral loss but incurs higher
time complexity due to its spatial-domain computation. We showcase the
application of our regularization in generative modeling for test-time reward
alignment with a text-to-image model, specifically to enhance aesthetics and
text alignment. Our regularization outperforms previous Gaussianity
regularization, effectively prevents reward hacking and accelerates
convergence.
[LINK]
http://arxiv.org/abs/2509.07027v2
[DATE]
2025-09-10 16:56:22+08:00
[CATEGORIES]
cs.LG
Statistical-Computational Trade-offs for Recursive Adaptive Partitioning Estimators
[AUTHORS]
Yan Shuo Tan, Jason M. Klusowski, Krishnakumar Balasubramanian
[ABSTRACT]
Models based on recursive adaptive partitioning such as decision trees and
their ensembles are popular for high-dimensional regression as they can
potentially avoid the curse of dimensionality. Because empirical risk
minimization (ERM) is computationally infeasible, these models are typically
trained using greedy algorithms. Although effective in many cases, these
algorithms have been empirically observed to get stuck at local optima. We
explore this phenomenon in the context of learning sparse regression functions
over $d$ binary features, showing that when the true regression function $f^$
does not satisfy Abbe et al. (2022)’s Merged Staircase Property (MSP), greedy
training requires $\exp(\Omega(d))$ to achieve low estimation error.
Conversely, when $f^$ does satisfy MSP, greedy training can attain small
estimation error with only $O(\log d)$ samples. This dichotomy mirrors that of
two-layer neural networks trained with stochastic gradient descent (SGD) in the
mean-field regime, thereby establishing a head-to-head comparison between
SGD-trained neural networks and greedy recursive partitioning estimators.
Furthermore, ERM-trained recursive partitioning estimators achieve low
estimation error with $O(\log d)$ samples irrespective of whether $f^*$
satisfies MSP, thereby demonstrating a statistical-computational trade-off for
greedy training. Our proofs are based on a novel interpretation of greedy
recursive partitioning using stochastic process theory and a coupling technique
that may be of independent interest.
[LINK]
http://arxiv.org/abs/2411.04394v3
[DATE]
2025-09-10 16:53:51+08:00
[CATEGORIES]
cs.LG
Generative Example-Based Explanations: Bridging the Gap between Generative Modeling and Explainability
[AUTHORS]
Philipp Vaeth, Alexander M. Fruehwald, Benjamin Paassen, Magda Gregorova
[ABSTRACT]
Recently, several methods have leveraged deep generative modeling to produce
example-based explanations of image classifiers. Despite producing visually
stunning results, these methods are largely disconnected from classical
explainability literature. This conceptual and communication gap leads to
misunderstandings and misalignments in goals and expectations. In this paper,
we bridge this gap by proposing a probabilistic framework for example-based
explanations, formally defining the example-based explanations in a
probabilistic manner amenable for modeling via deep generative models while
coherent with the critical characteristics and desiderata widely accepted in
the explainability community. Our aim is on one hand to provide a constructive
framework for the development of well-grounded generative algorithms for
example-based explanations and, on the other, to facilitate communication
between the generative and explainability research communities, foster rigor
and transparency, and improve the quality of peer discussion and research
progress in this promising direction.
[COMMENTS]
Accepted at the ECML 2025 Workshop for eXplainable Knowledge
Discovery in Data Mining and Unlearning
[LINK]
http://arxiv.org/abs/2410.20890v2
[DATE]
2025-09-10 16:43:29+08:00
[CATEGORIES]
cs.LG
LLM-Guided Ansätze Design for Quantum Circuit Born Machines in Financial Generative Modeling
[AUTHORS]
Yaswitha Gujju, Romain Harang, Tetsuo Shibuya
[ABSTRACT]
Quantum generative modeling using quantum circuit Born machines (QCBMs) shows
promising potential for practical quantum advantage. However, discovering
ans"atze that are both expressive and hardware-efficient remains a key
challenge, particularly on noisy intermediate-scale quantum (NISQ) devices. In
this work, we introduce a prompt-based framework that leverages large language
models (LLMs) to generate hardware-aware QCBM architectures. Prompts are
conditioned on qubit connectivity, gate error rates, and hardware topology,
while iterative feedback, including Kullback-Leibler (KL) divergence, circuit
depth, and validity, is used to refine the circuits. We evaluate our method on
a financial modeling task involving daily changes in Japanese government bond
(JGB) interest rates. Our results show that the LLM-generated ans"atze are
significantly shallower and achieve superior generative performance compared to
the standard baseline when executed on real IBM quantum hardware using 12
qubits. These findings demonstrate the practical utility of LLM-driven quantum
architecture search and highlight a promising path toward robust, deployable
generative models for near-term quantum devices.
[COMMENTS]
Work presented at the 3rd International Workshop on Quantum Machine
Learning: From Research to Practice (QML@QCE’25)
[LINK]
http://arxiv.org/abs/2509.08385v1
[DATE]
2025-09-10 16:23:58+08:00
[CATEGORIES]
cs.LG
Efficient Decoding Methods for Language Models on Encrypted Data
[AUTHORS]
Matan Avitan, Moran Baruch, Nir Drucker, Itamar Zimerman, Yoav Goldberg
[ABSTRACT]
Large language models (LLMs) power modern AI applications, but processing
sensitive data on untrusted servers raises privacy concerns. Homomorphic
encryption (HE) enables computation on encrypted data for secure inference.
However, neural text generation requires decoding methods like argmax and
sampling, which are non-polynomial and thus computationally expensive under
encryption, creating a significant performance bottleneck. We introduce cutmax,
an HE-friendly argmax algorithm that reduces ciphertext operations compared to
prior methods, enabling practical greedy decoding under encryption. We also
propose the first HE-compatible nucleus (top-p) sampling method, leveraging
cutmax for efficient stochastic decoding with provable privacy guarantees. Both
techniques are polynomial, supporting efficient inference in privacy-preserving
settings. Moreover, their differentiability facilitates gradient-based
sequence-level optimization as a polynomial alternative to straight-through
estimators. We further provide strong theoretical guarantees for cutmax,
proving it converges globally to a unique two-level fixed point, independent of
the input values beyond the identity of the maximizer, which explains its rapid
convergence in just a few iterations. Evaluations on realistic LLM outputs show
latency reductions of 24x-35x over baselines, advancing secure text generation.
[LINK]
http://arxiv.org/abs/2509.08383v1
[DATE]
2025-09-10 16:23:14+08:00
[CATEGORIES]
cs.LG
Co-Investigator AI: The Rise of Agentic AI for Smarter, Trustworthy AML Compliance Narratives
[AUTHORS]
Prathamesh Vasudeo Naik, Naresh Kumar Dintakurthi, Zhanghao Hu, Yue Wang, Robby Qiu
[ABSTRACT]
Generating regulatorily compliant Suspicious Activity Report (SAR) remains a
high-cost, low-scalability bottleneck in Anti-Money Laundering (AML) workflows.
While large language models (LLMs) offer promising fluency, they suffer from
factual hallucination, limited crime typology alignment, and poor
explainability – posing unacceptable risks in compliance-critical domains.
This paper introduces Co-Investigator AI, an agentic framework optimized to
produce Suspicious Activity Reports (SARs) significantly faster and with
greater accuracy than traditional methods. Drawing inspiration from recent
advances in autonomous agent architectures, such as the AI Co-Scientist, our
approach integrates specialized agents for planning, crime type detection,
external intelligence gathering, and compliance validation. The system features
dynamic memory management, an AI-Privacy Guard layer for sensitive data
handling, and a real-time validation agent employing the Agent-as-a-Judge
paradigm to ensure continuous narrative quality assurance. Human investigators
remain firmly in the loop, empowered to review and refine drafts in a
collaborative workflow that blends AI efficiency with domain expertise. We
demonstrate the versatility of Co-Investigator AI across a range of complex
financial crime scenarios, highlighting its ability to streamline SAR drafting,
align narratives with regulatory expectations, and enable compliance teams to
focus on higher-order analytical work. This approach marks the beginning of a
new era in compliance reporting – bringing the transformative benefits of AI
agents to the core of regulatory processes and paving the way for scalable,
reliable, and transparent SAR generation.
[LINK]
http://arxiv.org/abs/2509.08380v1
[DATE]
2025-09-10 16:16:04+08:00
[CATEGORIES]
cs.LG
FAMES: Fast Approximate Multiplier Substitution for Mixed-Precision Quantized DNNs–Down to 2 Bits!
[AUTHORS]
Yi Ren, Ruge Xu, Xinfei Guo, Weikang Qian
[ABSTRACT]
A widely-used technique in designing energy-efficient deep neural network
(DNN) accelerators is quantization. Recent progress in this direction has
reduced the bitwidths used in DNN down to 2. Meanwhile, many prior works apply
approximate multipliers (AppMuls) in designing DNN accelerators to lower their
energy consumption. Unfortunately, these works still assume a bitwidth much
larger than 2, which falls far behind the state-of-the-art in quantization area
and even challenges the meaningfulness of applying AppMuls in DNN accelerators,
since a high-bitwidth AppMul consumes much more energy than a low-bitwidth
exact multiplier! Thus, an important problem to study is: Can approximate
multipliers be effectively applied to quantized DNN models with very low
bitwidths? In this work, we give an affirmative answer to this question and
present a systematic solution that achieves the answer: FAMES, a fast
approximate multiplier substitution method for mixed-precision DNNs. Our
experiments demonstrate an average 28.67% energy reduction on state-of-the-art
mixed-precision quantized models with bitwidths as low as 2 bits and accuracy
losses kept under 1%. Additionally, our approach is up to 300x faster than
previous genetic algorithm-based methods.
[COMMENTS]
This work will be incorporated into another study as part of a larger
project, so we request to temporarily withdraw it. The new study involves
substantial changes and will be submitted as a new paper
[LINK]
http://arxiv.org/abs/2411.18055v4
[DATE]
2025-09-10 16:10:28+08:00
[CATEGORIES]
cs.LG
kNNSampler: Stochastic Imputations for Recovering Missing Value Distributions
[AUTHORS]
Parastoo Pashmchi, Jerome Benoit, Motonobu Kanagawa
[ABSTRACT]
We study a missing-value imputation method, termed kNNSampler, that imputes a
given unit’s missing response by randomly sampling from the observed responses
of the $k$ most similar units to the given unit in terms of the observed
covariates. This method can sample unknown missing values from their
distributions, quantify the uncertainties of missing values, and be readily
used for multiple imputation. Unlike popular kNNImputer, which estimates the
conditional mean of a missing response given an observed covariate, kNNSampler
is theoretically shown to estimate the conditional distribution of a missing
response given an observed covariate. Experiments demonstrate its effectiveness
in recovering the distribution of missing values. The code for kNNSampler is
made publicly available (https://github.com/SAP/knn-sampler).
[LINK]
http://arxiv.org/abs/2509.08366v1
[DATE]
2025-09-10 16:04:04+08:00
[CATEGORIES]
cs.LG
Prediction Loss Guided Decision-Focused Learning
[AUTHORS]
Haeun Jeon, Hyunglip Bae, Chanyeong Kim, Yongjae Lee, Woo Chang Kim
[ABSTRACT]
Decision-making under uncertainty is often considered in two stages:
predicting the unknown parameters, and then optimizing decisions based on
predictions. While traditional prediction-focused learning (PFL) treats these
two stages separately, decision-focused learning (DFL) trains the predictive
model by directly optimizing the decision quality in an end-to-end manner.
However, despite using exact or well-approximated gradients, vanilla DFL often
suffers from unstable convergence due to its flat-and-sharp loss landscapes. In
contrast, PFL yields more stable optimization, but overlooks the downstream
decision quality. To address this, we propose a simple yet effective approach:
perturbing the decision loss gradient using the prediction loss gradient to
construct an update direction. Our method requires no additional training and
can be integrated with any DFL solvers. Using the sigmoid-like decaying
parameter, we let the prediction loss gradient guide the decision loss gradient
to train a predictive model that optimizes decision quality. Also, we provide a
theoretical convergence guarantee to Pareto stationary point under mild
assumptions. Empirically, we demonstrate our method across three stochastic
optimization problems, showing promising results compared to other baselines.
We validate that our approach achieves lower regret with more stable training,
even in situations where either PFL or DFL struggles.
[LINK]
http://arxiv.org/abs/2509.08359v1
[DATE]
2025-09-10 15:49:04+08:00
[CATEGORIES]
cs.LG
Chordless cycle filtrations for dimensionality detection in complex networks via topological data analysis
[AUTHORS]
Aina Ferrà Marcús, Robert Jankowski, Meritxell Vila Miñana, Carles Casacuberta, M. Ángeles Serrano
[ABSTRACT]
Many complex networks, ranging from social to biological systems, exhibit
structural patterns consistent with an underlying hyperbolic geometry.
Revealing the dimensionality of this latent space can disentangle the
structural complexity of communities, impact efficient network navigation, and
fundamentally shape connectivity and system behavior. We introduce a novel
topological data analysis weighting scheme for graphs, based on chordless
cycles, aimed at estimating the dimensionality of networks in a data-driven
way. We further show that the resulting descriptors can effectively estimate
network dimensionality using a neural network architecture trained in a
synthetic graph database constructed for this purpose, which does not need
retraining to transfer effectively to real-world networks. Thus, by combining
cycle-aware filtrations, algebraic topology, and machine learning, our approach
provides a robust and effective method for uncovering the hidden geometry of
complex networks and guiding accurate modeling and low-dimensional embedding.
[LINK]
http://arxiv.org/abs/2509.08350v1
[DATE]
2025-09-10 15:40:48+08:00
[CATEGORIES]
cs.LG
Nearest Neighbor Projection Removal Adversarial Training
[AUTHORS]
Himanshu Singh, A. V. Subramanyam, Shivank Rajput, Mohan Kankanhalli
[ABSTRACT]
Deep neural networks have exhibited impressive performance in image
classification tasks but remain vulnerable to adversarial examples. Standard
adversarial training enhances robustness but typically fails to explicitly
address inter-class feature overlap, a significant contributor to adversarial
susceptibility. In this work, we introduce a novel adversarial training
framework that actively mitigates inter-class proximity by projecting out
inter-class dependencies from adversarial and clean samples in the feature
space. Specifically, our approach first identifies the nearest inter-class
neighbors for each adversarial sample and subsequently removes projections onto
these neighbors to enforce stronger feature separability. Theoretically, we
demonstrate that our proposed logits correction reduces the Lipschitz constant
of neural networks, thereby lowering the Rademacher complexity, which directly
contributes to improved generalization and robustness. Extensive experiments
across standard benchmarks including CIFAR-10, CIFAR-100, and SVHN show that
our method demonstrates strong performance that is competitive with leading
adversarial training techniques, highlighting significant achievements in both
robust and clean accuracy. Our findings reveal the importance of addressing
inter-class feature proximity explicitly to bolster adversarial robustness in
DNNs.
[LINK]
http://arxiv.org/abs/2509.07673v2
[DATE]
2025-09-10 15:36:45+08:00
[CATEGORIES]
cs.LG
Accelerating Reinforcement Learning Algorithms Convergence using Pre-trained Large Language Models as Tutors With Advice Reusing
[AUTHORS]
Lukas Toral, Teddy Lazebnik
[ABSTRACT]
Reinforcement Learning (RL) algorithms often require long training to become
useful, especially in complex environments with sparse rewards. While
techniques like reward shaping and curriculum learning exist to accelerate
training, these are often extremely specific and require the developer’s
professionalism and dedicated expertise in the problem’s domain. Tackling this
challenge, in this study, we explore the effectiveness of pre-trained Large
Language Models (LLMs) as tutors in a student-teacher architecture with RL
algorithms, hypothesizing that LLM-generated guidance allows for faster
convergence. In particular, we explore the effectiveness of reusing the LLM’s
advice on the RL’s convergence dynamics. Through an extensive empirical
examination, which included 54 configurations, varying the RL algorithm (DQN,
PPO, A2C), LLM tutor (Llama, Vicuna, DeepSeek), and environment (Blackjack,
Snake, Connect Four), our results demonstrate that LLM tutoring significantly
accelerates RL convergence while maintaining comparable optimal performance.
Furthermore, the advice reuse mechanism shows a further improvement in training
duration but also results in less stable convergence dynamics. Our findings
suggest that LLM tutoring generally improves convergence, and its effectiveness
is sensitive to the specific task, RL algorithm, and LLM model combination.
[LINK]
http://arxiv.org/abs/2509.08329v1
[DATE]
2025-09-10 15:08:04+08:00
[CATEGORIES]
cs.LG
A general language model for peptide identification
[AUTHORS]
Jixiu Zhai, Zikun Wang, Tianchi Lu, Haitian Zhong, Ziyang Xu, Yuhuan Liu, Shengrui Xu, Jingwan Wang, Dan Huang
[ABSTRACT]
Accurate identification of bioactive peptides (BPs) and protein
post-translational modifications (PTMs) is essential for understanding protein
function and advancing therapeutic discovery. However, most computational
methods remain limited in their generalizability across diverse peptide
functions. Here, we present PDeepPP, a unified deep learning framework that
integrates pretrained protein language models with a hybrid
transformer-convolutional architecture, enabling robust identification across
diverse peptide classes and PTM sites. We curated comprehensive benchmark
datasets and implemented strategies to address data imbalance, allowing PDeepPP
to systematically extract both global and local sequence features. Through
extensive analyses-including dimensionality reduction and comparison
studies-PDeepPP demonstrates strong, interpretable peptide representations and
achieves state-of-the-art performance in 25 of the 33 biological identification
tasks. Notably, PDeepPP attains high accuracy in antimicrobial (0.9726) and
phosphorylation site (0.9984) identification, with 99.5% specificity in
glycosylation site prediction and substantial reduction in false negatives in
antimalarial tasks. By enabling large-scale, accurate peptide analysis, PDeepPP
supports biomedical research and the discovery of novel therapeutic targets for
disease treatment. All code, datasets, and pretrained models are publicly
available via GitHub:https://github.com/fondress/PDeepPP and Hugging
Face:https://huggingface.co/fondress/PDeppPP.
[COMMENTS]
24 pages, 9 figures, 4 tables, submitted to arXiv
[LINK]
http://arxiv.org/abs/2502.15610v5
[DATE]
2025-09-10 14:54:53+08:00
[CATEGORIES]
cs.LG
GTS_Forecaster: a novel deep learning based geodetic time series forecasting toolbox with python
[AUTHORS]
Xuechen Liang, Xiaoxing He, Shengdao Wang, Jean-Philippe Montillet, Zhengkai Huang, Gaël Kermarrec, Shunqiang Hu, Yu Zhou, Jiahui Huang
[ABSTRACT]
Geodetic time series – such as Global Navigation Satellite System (GNSS)
positions, satellite altimetry-derived sea surface height (SSH), and tide gauge
(TG) records – is essential for monitoring surface deformation and sea level
change. Accurate forecasts of these variables can enhance early warning systems
and support hazard mitigation for earthquakes, landslides, coastal storm surge,
and long-term sea level. However, the nonlinear, non-stationary, and incomplete
nature of such variables presents significant challenges for classic models,
which often fail to capture long-term dependencies and complex spatiotemporal
dynamics. We introduce GTS Forecaster, an open-source Python package for
geodetic time series forecasting. It integrates advanced deep learning models
– including kernel attention networks (KAN), graph neural network-based gated
recurrent units (GNNGRU), and time-aware graph neural networks (TimeGNN) – to
effectively model nonlinear spatial-temporal patterns. The package also
provides robust preprocessing tools, including outlier detection and a
reinforcement learning-based gap-filling algorithm, the Kalman-TransFusion
Interpolation Framework (KTIF). GTS Forecaster currently supports forecasting,
visualization, and evaluation of GNSS, SSH, and TG datasets, and is adaptable
to general time series applications. By combining cutting-edge models with an
accessible interface, it facilitates the application of deep learning in
geodetic forecasting tasks.
[LINK]
http://arxiv.org/abs/2509.10560v1
[DATE]
2025-09-10 14:33:09+08:00
[CATEGORIES]
cs.LG
\emph{FoQuS}: A Forgetting-Quality Coreset Selection Framework for Automatic Modulation Recognition
[AUTHORS]
Yao Lu, Chunfeng Sun, Dongwei Xu, Yun Lin, Qi Xuan, Guan Gui
[ABSTRACT]
Deep learning-based Automatic Modulation Recognition (AMR) model has made
significant progress with the support of large-scale labeled data. However,
when developing new models or performing hyperparameter tuning, the time and
energy consumption associated with repeated training using massive amounts of
data are often unbearable. To address the above challenges, we propose
\emph{FoQuS}, which approximates the effect of full training by selecting a
coreset from the original dataset, thereby significantly reducing training
overhead. Specifically, \emph{FoQuS} records the prediction trajectory of each
sample during full-dataset training and constructs three importance metrics
based on training dynamics. Experiments show that \emph{FoQuS} can maintain
high recognition accuracy and good cross-architecture generalization on
multiple AMR datasets using only 1\%-30\% of the original data.
[LINK]
http://arxiv.org/abs/2509.08300v1
[DATE]
2025-09-10 13:39:49+08:00
[CATEGORIES]
cs.LG
From Static to Adaptive Defense: Federated Multi-Agent Deep Reinforcement Learning-Driven Moving Target Defense Against DoS Attacks in UAV Swarm Networks
[AUTHORS]
Yuyang Zhou, Guang Cheng, Kang Du, Zihan Chen, Tian Qin, Yuyu Zhao
[ABSTRACT]
The proliferation of UAVs has enabled a wide range of mission-critical
applications and is becoming a cornerstone of low-altitude networks, supporting
smart cities, emergency response, and more. However, the open wireless
environment, dynamic topology, and resource constraints of UAVs expose
low-altitude networks to severe DoS threats. Traditional defense approaches,
which rely on fixed configurations or centralized decision-making, cannot
effectively respond to the rapidly changing conditions in UAV swarm
environments. To address these challenges, we propose a novel federated
multi-agent deep reinforcement learning (FMADRL)-driven moving target defense
(MTD) framework for proactive DoS mitigation in low-altitude networks.
Specifically, we design lightweight and coordinated MTD mechanisms, including
leader switching, route mutation, and frequency hopping, to disrupt attacker
efforts and enhance network resilience. The defense problem is formulated as a
multi-agent partially observable Markov decision process, capturing the
uncertain nature of UAV swarms under attack. Each UAV is equipped with a policy
agent that autonomously selects MTD actions based on partial observations and
local experiences. By employing a policy gradient-based algorithm, UAVs
collaboratively optimize their policies via reward-weighted aggregation.
Extensive simulations demonstrate that our approach significantly outperforms
state-of-the-art baselines, achieving up to a 34.6% improvement in attack
mitigation rate, a reduction in average recovery time of up to 94.6%, and
decreases in energy consumption and defense cost by as much as 29.3% and 98.3%,
respectively, under various DoS attack strategies. These results highlight the
potential of intelligent, distributed defense mechanisms to protect
low-altitude networks, paving the way for reliable and scalable low-altitude
economy.
[COMMENTS]
16pages; Major Revision for IEEE TCCN
[LINK]
http://arxiv.org/abs/2506.07392v2
[DATE]
2025-09-10 11:47:56+08:00
[CATEGORIES]
cs.LG
A single-loop SPIDER-type stochastic subgradient method for expectation-constrained nonconvex nonsmooth optimization
[AUTHORS]
Wei Liu, Yangyang Xu
[ABSTRACT]
Many real-world problems, such as those with fairness constraints, involve
complex expectation constraints and large datasets, necessitating the design of
efficient stochastic methods to solve them. Most existing research focuses on
cases with no {constraint} or easy-to-project constraints or deterministic
constraints. In this paper, we consider nonconvex nonsmooth stochastic
optimization problems with expectation constraints, for which we build a novel
exact penalty model. We first show the relationship between the penalty model
and the original problem. Then on solving the penalty problem, we present a
single-loop SPIDER-type stochastic subgradient method, which utilizes the
subgradients of both the objective and constraint functions, as well as the
constraint function value at each iteration. Under certain regularity
conditions (weaker than Slater-type constraint qualification or strong
feasibility assumed in existing works), we establish an iteration complexity
result of $O(\epsilon^{-4})$ to reach a near-$\epsilon$ stationary point of the
penalized problem in expectation, matching the lower bound for such tasks.
Building on the exact penalization, an $(\epsilon,\epsilon)$-KKT point of the
original problem is obtained. For a few scenarios, our complexity of either the
{objective} sample subgradient or the constraint sample function values can be
lower than the state-of-the-art results by a factor of $\epsilon^{-2}$.
Moreover, on solving two fairness-constrained problems and a multi-class
Neyman-Pearson classification problem, our method is significantly (up to 466
times) faster than the state-of-the-art algorithms, including switching
subgradient method and inexact proximal point methods.
[COMMENTS]
Key word: stochastic, subgradient, expectation constraints, weakly
convex, fairness constrained classification
[LINK]
http://arxiv.org/abs/2501.19214v2
[DATE]
2025-09-10 10:48:51+08:00
[CATEGORIES]
cs.LG
Discrete Diffusion in Large Language and Multimodal Models: A Survey
[AUTHORS]
Runpeng Yu, Qi Li, Xinchao Wang
[ABSTRACT]
In this work, we provide a systematic survey of Discrete Diffusion Language
Models (dLLMs) and Discrete Diffusion Multimodal Language Models (dMLLMs).
Unlike autoregressive (AR) models, dLLMs and dMLLMs adopt a multi-token,
parallel decoding paradigm using full attention and a denoising-based
generation strategy. This paradigm naturally enables parallel generation,
fine-grained output control, and dynamic perception. These capabilities are
previously difficult to achieve with AR models. A growing number of
industrial-scale proprietary d(M)LLMs, as well as a large number of open-source
academic d(M)LLMs, have demonstrated performance comparable to their
autoregressive counterparts, while achieving up to \textit{10$\times$}
acceleration in inference speed. These developments position discrete diffusion
models as a promising alternative to intelligence based on the traditional
autoregressive approach. In this work, we present a comprehensive overview of
the research in the dLLM and dMLLM domains. We trace the historical development
of dLLMs and dMLLMs, formalize the underlying mathematical frameworks, and
categorize representative models. We further analyze key techniques for
training and inference, and summarize emerging applications across language,
vision-language, and biological domains and \textit{etc.}. We conclude by
discussing future directions for research and deployment. Relative papers are
collected in https://github.com/LiQiiiii/Awesome-Discrete-Diffusion-LLM_MLLM
[LINK]
http://arxiv.org/abs/2506.13759v4
[DATE]
2025-09-10 10:11:26+08:00
[CATEGORIES]
cs.LG
A Multimodal RAG Framework for Housing Damage Assessment: Collaborative Optimization of Image Encoding and Policy Vector Retrieval
[AUTHORS]
Jiayi Miao, Dingxin Lu, Zhuqi Wang
[ABSTRACT]
After natural disasters, accurate evaluations of damage to housing are
important for insurance claims response and planning of resources. In this
work, we introduce a novel multimodal retrieval-augmented generation (MM-RAG)
framework. On top of classical RAG architecture, we further the framework to
devise a two-branch multimodal encoder structure that the image branch employs
a visual encoder composed of ResNet and Transformer to extract the
characteristic of building damage after disaster, and the text branch harnesses
a BERT retriever for the text vectorization of posts as well as insurance
policies and for the construction of a retrievable restoration index. To impose
cross-modal semantic alignment, the model integrates a cross-modal interaction
module to bridge the semantic representation between image and text via
multi-head attention. Meanwhile, in the generation module, the introduced modal
attention gating mechanism dynamically controls the role of visual evidence and
text prior information during generation. The entire framework takes end-to-end
training, and combines the comparison loss, the retrieval loss and the
generation loss to form multi-task optimization objectives, and achieves image
understanding and policy matching in collaborative learning. The results
demonstrate superior performance in retrieval accuracy and classification index
on damage severity, where the Top-1 retrieval accuracy has been improved by
9.6%.
[LINK]
http://arxiv.org/abs/2509.09721v1
[DATE]
2025-09-10 09:58:07+08:00
[CATEGORIES]
cs.LG
Ensemble Distribution Distillation for Self-Supervised Human Activity Recognition
[AUTHORS]
Matthew Nolan, Lina Yao, Robert Davidson
[ABSTRACT]
Human Activity Recognition (HAR) has seen significant advancements with the
adoption of deep learning techniques, yet challenges remain in terms of data
requirements, reliability and robustness. This paper explores a novel
application of Ensemble Distribution Distillation (EDD) within a
self-supervised learning framework for HAR aimed at overcoming these
challenges. By leveraging unlabeled data and a partially supervised training
strategy, our approach yields an increase in predictive accuracy, robust
estimates of uncertainty, and substantial increases in robustness against
adversarial perturbation; thereby significantly improving reliability in
real-world scenarios without increasing computational complexity at inference.
We demonstrate this with an evaluation on several publicly available datasets.
The contributions of this work include the development of a self-supervised EDD
framework, an innovative data augmentation technique designed for HAR, and
empirical validation of the proposed method’s effectiveness in increasing
robustness and reliability.
[COMMENTS]
37 pages, 10 figures
[LINK]
http://arxiv.org/abs/2509.08225v1
[DATE]
2025-09-10 09:55:20+08:00
[CATEGORIES]
cs.LG
A Randomized Zeroth-Order Hierarchical Framework for Heterogeneous Federated Learning
[AUTHORS]
Yuyang Qiu, Kibaek Kim, Farzad Yousefian
[COMMENTS]
Accepted at the 64th IEEE Conference on Decision and Control (CDC
2025)
[LINK]
http://arxiv.org/abs/2504.01839v2
[DATE]
2025-09-10 09:46:49+08:00
[CATEGORIES]
cs.LG
MetaExplainer: A Framework to Generate Multi-Type User-Centered Explanations for AI Systems
[AUTHORS]
Shruthi Chari, Oshani Seneviratne, Prithwish Chakraborty, Pablo Meyer, Deborah L. McGuinness
[ABSTRACT]
Explanations are crucial for building trustworthy AI systems, but a gap often
exists between the explanations provided by models and those needed by users.
To address this gap, we introduce MetaExplainer, a neuro-symbolic framework
designed to generate user-centered explanations. Our approach employs a
three-stage process: first, we decompose user questions into machine-readable
formats using state-of-the-art large language models (LLM); second, we delegate
the task of generating system recommendations to model explainer methods; and
finally, we synthesize natural language explanations that summarize the
explainer outputs. Throughout this process, we utilize an Explanation Ontology
to guide the language models and explainer methods. By leveraging LLMs and a
structured approach to explanation generation, MetaExplainer aims to enhance
the interpretability and trustworthiness of AI systems across various
applications, providing users with tailored, question-driven explanations that
better meet their needs. Comprehensive evaluations of MetaExplainer demonstrate
a step towards evaluating and utilizing current state-of-the-art explanation
frameworks. Our results show high performance across all stages, with a 59.06%
F1-score in question reframing, 70% faithfulness in model explanations, and 67%
context-utilization in natural language synthesis. User studies corroborate
these findings, highlighting the creativity and comprehensiveness of generated
explanations. Tested on the Diabetes (PIMA Indian) tabular dataset,
MetaExplainer supports diverse explanation types, including Contrastive,
Counterfactual, Rationale, Case-Based, and Data explanations. The framework’s
versatility and traceability from using ontology to guide LLMs suggest broad
applicability beyond the tested scenarios, positioning MetaExplainer as a
promising tool for enhancing AI explainability across various domains.
[LINK]
http://arxiv.org/abs/2508.00300v2
[DATE]
2025-09-10 09:46:21+08:00
[CATEGORIES]
cs.LG
Generative Quasi-Continuum Modeling of Confined Fluids at the Nanoscale
[AUTHORS]
Bugra Yalcin, Ishan Nadkarni, Jinu Jeong, Chenxing Liang, Narayana R. Aluru
[ABSTRACT]
We present a data-efficient, multiscale framework for predicting the density
profiles of confined fluids at the nanoscale. While accurate density estimates
require prohibitively long timescales that are inaccessible by ab initio
molecular dynamics (AIMD) simulations, machine-learned molecular dynamics
(MLMD) offers a scalable alternative, enabling the generation of force
predictions at ab initio accuracy with reduced computational cost. However,
despite their efficiency, MLMD simulations remain constrained by femtosecond
timesteps, which limit their practicality for computing long-time averages
needed for accurate density estimation. To address this, we propose a
conditional denoising diffusion probabilistic model (DDPM) based
quasi-continuum approach that predicts the long-time behavior of force profiles
along the confinement direction, conditioned on noisy forces extracted from a
limited AIMD dataset. The predicted smooth forces are then linked to continuum
theory via the Nernst-Planck equation to reveal the underlying density
behavior. We test the framework on water confined between two graphene
nanoscale slits and demonstrate that density profiles for channel widths
outside of the training domain can be recovered with ab initio accuracy.
Compared to AIMD and MLMD simulations, our method achieves orders-of-magnitude
speed-up in runtime and requires significantly less training data than prior
works.
[LINK]
http://arxiv.org/abs/2509.08223v1
[DATE]
2025-09-10 09:44:27+08:00
[CATEGORIES]
cs.LG
FedComLoc: Communication-Efficient Distributed Training of Sparse and Quantized Models
[AUTHORS]
Kai Yi, Georg Meinhardt, Laurent Condat, Peter Richtárik
[ABSTRACT]
Federated Learning (FL) has garnered increasing attention due to its unique
characteristic of allowing heterogeneous clients to process their private data
locally and interact with a central server, while being respectful of privacy.
A critical bottleneck in FL is the communication cost. A pivotal strategy to
mitigate this burden is Local Training, which involves running multiple local
stochastic gradient descent iterations between communication phases. Our work
is inspired by the innovative Scaffnew algorithm, which has considerably
advanced the reduction of communication complexity in FL. We introduce
FedComLoc (Federated Compressed and Local Training), integrating practical and
effective compression into Scaffnew to further enhance communication
efficiency. Extensive experiments, using the popular TopK compressor and
quantization, demonstrate its prowess in substantially reducing communication
overheads in heterogeneous settings.
[COMMENTS]
Accepted version at Transactions on Machine Learning Research (TMLR)
[LINK]
http://arxiv.org/abs/2403.09904v2
[DATE]
2025-09-10 09:18:56+08:00
[CATEGORIES]
cs.LG
Traversal Learning: A Lossless And Efficient Distributed Learning Framework
[AUTHORS]
Erdenebileg Batbaatar, Jeonggeol Kim, Yongcheol Kim, Young Yoon
[ABSTRACT]
In this paper, we introduce Traversal Learning (TL), a novel approach
designed to address the problem of decreased quality encountered in popular
distributed learning (DL) paradigms such as Federated Learning (FL), Split
Learning (SL), and SplitFed Learning (SFL). Traditional FL experiences from an
accuracy drop during aggregation due to its averaging function, while SL and
SFL face increased loss due to the independent gradient updates on each split
network. TL adopts a unique strategy where the model traverses the nodes during
forward propagation (FP) and performs backward propagation (BP) on the
orchestrator, effectively implementing centralized learning (CL) principles
within a distributed environment. The orchestrator is tasked with generating
virtual batches and planning the sequential node visits of the model during FP,
aligning them with the ordered index of the data within these batches. We
conducted experiments on six datasets representing diverse characteristics
across various domains. Our evaluation demonstrates that TL is on par with
classic CL approaches in terms of accurate inference, thereby offering a viable
and robust solution for DL tasks. TL outperformed other DL methods and improved
accuracy by 7.85% for independent and identically distributed (IID) datasets,
macro F1-score by 1.06% for non-IID datasets, accuracy by 2.60% for text
classification, and AUC by 3.88% and 4.54% for medical and financial datasets,
respectively. By effectively preserving data privacy while maintaining
performance, TL represents a significant advancement in DL methodologies. The
implementation of TL is available at
https://github.com/neouly-inc/Traversal-Learning
[LINK]
http://arxiv.org/abs/2504.07471v2
[DATE]
2025-09-10 09:08:54+08:00
[CATEGORIES]
cs.LG
HopCast: Calibration of Autoregressive Dynamics Models
[AUTHORS]
Muhammad Bilal Shahid, Cody Fleming
[ABSTRACT]
Deep learning models are often trained to approximate dynamical systems that
can be modeled using differential equations. Many of these models are optimized
to predict one step ahead; such approaches produce calibrated one-step
predictions if the predictive model can quantify uncertainty, such as Deep
Ensembles. At inference time, multi-step predictions are generated via
autoregression, which needs a sound uncertainty propagation method to produce
calibrated multi-step predictions. This work introduces an alternative
Predictor-Corrector approach named \hop{} that uses Modern Hopfield Networks
(MHN) to learn the errors of a deterministic Predictor that approximates the
dynamical system. The Corrector predicts a set of errors for the Predictor’s
output based on a context state at any timestep during autoregression. The set
of errors creates sharper and well-calibrated prediction intervals with higher
predictive accuracy compared to baselines without uncertainty propagation. The
calibration and prediction performances are evaluated across a set of dynamical
systems. This work is also the first to benchmark existing uncertainty
propagation methods based on calibration errors.
[LINK]
http://arxiv.org/abs/2501.16587v4
[DATE]
2025-09-10 08:31:26+08:00
[CATEGORIES]
cs.LG
Damped Proximal Augmented Lagrangian Method for weakly-Convex Problems with Convex Constraints
[AUTHORS]
Hari Dahal, Wei Liu, Yangyang Xu
[ABSTRACT]
We give a damped proximal augmented Lagrangian method (DPALM) for solving
problems with a weakly-convex objective and convex linear/nonlinear
constraints. Instead of taking a full stepsize, DPALM adopts a damped dual
stepsize to ensure the boundedness of dual iterates. We show that DPALM can
produce a (near) $\vareps$-KKT point within $O(\vareps^{-2})$ outer iterations
if each DPALM subproblem is solved to a proper accuracy. In addition, we
establish overall iteration complexity of DPALM when the objective is either a
regularized smooth function or in a regularized compositional form. For the
former case, DPALM achieves the complexity of
$\widetilde{\mathcal{O}}\left(\varepsilon^{-2.5} \right)$ to produce an
$\varepsilon$-KKT point by applying an accelerated proximal gradient (APG)
method to each DPALM subproblem. For the latter case, the complexity of DPALM
is $\widetilde{\mathcal{O}}\left(\varepsilon^{-3} \right)$ to produce a near
$\varepsilon$-KKT point by using an APG to solve a Moreau-envelope smoothed
version of each subproblem. Our outer iteration complexity and the overall
complexity either generalize existing best ones from unconstrained or
linear-constrained problems to convex-constrained ones, or improve over the
best-known results on solving the same-structured problems. Furthermore,
numerical experiments on linearly/quadratically constrained non-convex
quadratic programs and linear-constrained robust nonlinear least squares are
conducted to demonstrate the empirical efficiency of the proposed DPALM over
several state-of-the art methods.
[COMMENTS]
27 pages
[LINK]
http://arxiv.org/abs/2311.09065v2
[DATE]
2025-09-10 08:21:29+08:00
[CATEGORIES]
cs.LG