The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements
[AUTHORS]
Bingchen Zhao, Despoina Magka, Minqi Jiang, Xian Li, Roberta Raileanu, Tatiana Shavrina, Jean-Christophe Gagnon-Audet, Kelvin Niu, Shagun Sodhani, Michael Shvartsman, Andrei Lupu, Alisia Lupidi, Edan Toledo, Karen Hambardzumyan, Martin Josifoski, Thomas Foster, Lucia Cipolina-Kun, Abhishek Charnalia, Derek Dunfield, Alexander H. Miller, Oisin Mac Aodha, Jakob Foerster, Yoram Bachrach
[ABSTRACT]
Rapid advancements in large language models (LLMs) have the potential to
assist in scientific progress. A critical capability toward this endeavor is
the ability to reproduce existing work. To evaluate the ability of AI agents to
reproduce results in an active research area, we introduce the Automated LLM
Speedrunning Benchmark, leveraging the research community contributions on the
NanoGPT speedrun, a competition to train a GPT-2 model in the shortest time.
Each of the 19 speedrun tasks provides the agent with the previous records
training script, optionally paired with one of three hint formats, ranging from
pseudocode to paper-like descriptions of the new records improvements. Records
execute quickly by design and speedrun improvements encompass diverse
code-level changes, ranging from high-level algorithmic advancements to
hardware-aware optimizations. These features make the benchmark both accessible
and realistic for the frontier problem of improving LLM training. We find that
recent reasoning LLMs combined with SoTA scaffolds struggle to reimplement
already-known innovations in our benchmark, even when given detailed hints. Our
benchmark thus provides a simple, non-saturated measure of an LLMs ability to
automate scientific reproduction, a necessary (but not sufficient) skill for an
autonomous research agent.
[LINK]
http://arxiv.org/abs/2506.22419v1
[DATE]
2025-06-28 01:44:32+08:00
[CATEGORIES]
cs.CL
cs.LG
Sequential Diagnosis with Language Models
[AUTHORS]
Harsha Nori, Mayank Daswani, Christopher Kelly, Scott Lundberg, Marco Tulio Ribeiro, Marc Wilson, Xiaoxuan Liu, Viknesh Sounderajah, Jonathan Carlson, Matthew P Lungren, Bay Gross, Peter Hames, Mustafa Suleyman, Dominic King, Eric Horvitz
[ABSTRACT]
Artificial intelligence holds great promise for expanding access to expert
medical knowledge and reasoning. However, most evaluations of language models
rely on static vignettes and multiple-choice questions that fail to reflect the
complexity and nuance of evidence-based medicine in real-world settings. In
clinical practice, physicians iteratively formulate and revise diagnostic
hypotheses, adapting each subsequent question and test to what they’ve just
learned, and weigh the evolving evidence before committing to a final
diagnosis. To emulate this iterative process, we introduce the Sequential
Diagnosis Benchmark, which transforms 304 diagnostically challenging New
England Journal of Medicine clinicopathological conference (NEJM-CPC) cases
into stepwise diagnostic encounters. A physician or AI begins with a short case
abstract and must iteratively request additional details from a gatekeeper
model that reveals findings only when explicitly queried. Performance is
assessed not just by diagnostic accuracy but also by the cost of physician
visits and tests performed. We also present the MAI Diagnostic Orchestrator
(MAI-DxO), a model-agnostic orchestrator that simulates a panel of physicians,
proposes likely differential diagnoses and strategically selects high-value,
cost-effective tests. When paired with OpenAI’s o3 model, MAI-DxO achieves 80%
diagnostic accuracy–four times higher than the 20% average of generalist
physicians. MAI-DxO also reduces diagnostic costs by 20% compared to
physicians, and 70% compared to off-the-shelf o3. When configured for maximum
accuracy, MAI-DxO achieves 85.5% accuracy. These performance gains with MAI-DxO
generalize across models from the OpenAI, Gemini, Claude, Grok, DeepSeek, and
Llama families. We highlight how AI systems, when guided to think iteratively
and act judiciously, can advance diagnostic precision and cost-effectiveness in
clinical care.
[COMMENTS]
23 pages, 10 figures
[LINK]
http://arxiv.org/abs/2506.22405v1
[DATE]
2025-06-28 01:27:26+08:00
[CATEGORIES]
cs.CL
HyperCLOVA X THINK Technical Report
[AUTHORS]
NAVER Cloud HyperCLOVA X Team
[ABSTRACT]
We introduce HyperCLOVA X THINK, the first reasoning-focused large language
model in the HyperCLOVA X family, pre-trained on roughly $6$ trillion
high-quality Korean, and English tokens, augmented with targeted synthetic
Korean data. It was implemented as a compute-memory-balanced Peri-LN
Transformer scaled with $\mu$P, pre-trained through a three-stage curriculum
that expands the context window to $128$K tokens, and post-trained via
supervised fine-tuning with Reinforcement Learning from Verifiable Rewards
supports both detailed rationale and concise-answer modes. It delivers
competitive performance against similarly sized models on Korea-focused
benchmarks such as KMMLU, CSAT, KoBALT-700, HAERAE-1.0, and KoBigBench, while
preserving robust bilingual consistency and translation quality. In addition, a
vision-augmented variant matches or exceeds GPT-4.1 on the KCSAT STEM
benchmark, all of which are achieved with substantially lower training compute
than existing models of similar sizes. We also present a pruning and
distillation technique that will soon be applied to HyperCLOVA X THINK for an
open-source and business-friendly foundation model. Altogether, these
capabilities position HyperCLOVA X THINK as a robust foundation for Korean AI
innovation and a valuable resource for the global research community.
[COMMENTS]
49 pages, 13 figures
[LINK]
http://arxiv.org/abs/2506.22403v1
[DATE]
2025-06-28 01:23:12+08:00
[CATEGORIES]
cs.CL
Refining Czech GEC: Insights from a Multi-Experiment Approach
[AUTHORS]
Petr Pechman, Milan Straka, Jana Straková, Jakub Náplava
[ABSTRACT]
We present a grammar error correction (GEC) system that achieves state of the
art for the Czech language. Our system is based on a neural network translation
approach with the Transformer architecture, and its key feature is its
real-time synthetic generation pipeline, which dynamically augments sentences
with artificial errors by introducing both language-agnostic and Czech-specific
errors. We conduct a comprehensive series of experiments, investigating the
Czech GEC corpora as bases for synthetic error introduction, several error
generation strategies, domain balancing, tokenization granularity, model size,
and data scaling during fine-tuning. Additionally, we evaluate the performance
of large language models (LLMs) on Czech GEC in both end-user and expert
fine-tuning scenarios. Our best-performing model is superior both in
performance and computational efficiency. The source code and the trained model
links are available on https://github.com/ufal/tsd2025-gec.
[COMMENTS]
Accepted to TSD 2025
[LINK]
http://arxiv.org/abs/2506.22402v1
[DATE]
2025-06-28 01:21:40+08:00
[CATEGORIES]
cs.CL
Metadata Conditioning Accelerates Language Model Pre-training
[AUTHORS]
Tianyu Gao, Alexander Wettig, Luxi He, Yihe Dong, Sadhika Malladi, Danqi Chen
[COMMENTS]
Accepted to ICML 2025. Code available at
https://github.com/princeton-pli/MeCo
[LINK]
http://arxiv.org/abs/2501.01956v3
[DATE]
2025-06-28 01:15:09+08:00
[CATEGORIES]
cs.CL
QuickSilver – Speeding up LLM Inference through Dynamic Token Halting, KV Skipping, Contextual Token Fusion, and Adaptive Matryoshka Quantization
[AUTHORS]
Danush Khanna, Aditya Kumar Guru, Srivarshinee Sridhar, Zidan Ahmed, Rubhav Bahirwani, Meetu Malhotra, Vinija Jain, Aman Chadha, Amitava Das, Kripabandhu Ghosh
[ABSTRACT]
Inference accounts for the majority of latency and energy consumption in
large language model (LLM) deployments, often exceeding 90% of total cost.
While training-time efficiency has seen extensive progress, runtime
optimization remains a key bottleneck, particularly under autoregressive
decoding. Existing approaches – such as pruning, quantization, early exits,
and speculative decoding – often require retraining, architectural changes, or
disrupt decoding compatibility. We introduce QuickSilver, a modular,
token-level framework that enables semantic adaptivity at inference time
without altering model weights or structure. QuickSilver integrates four
synergistic mechanisms:
(i) Dynamic Token Halting, which halts computation for tokens with converged
representations; (ii) KV Cache Skipping, which selectively suppresses memory
writes to reduce attention overhead; and (iii) Contextual Token Fusion, which
collapses redundant tokens into shared paths to shrink sequence length.
Unlike speculative decoding or MoE routing, QuickSilver operates entirely on
frozen, dense models and requires no auxiliary networks. Applied to GPT-2 and
Llama-2 across WikiText-103 and C4, QuickSilver achieves up to 39.6% FLOP
reduction with negligible perplexity degradation (<=0.2).
[COMMENTS]
Preprint. Under submission
[LINK]
http://arxiv.org/abs/2506.22396v1
[DATE]
2025-06-28 01:10:32+08:00
[CATEGORIES]
cs.CL
How to Train Long-Context Language Models (Effectively)
[AUTHORS]
Tianyu Gao, Alexander Wettig, Howard Yen, Danqi Chen
[ABSTRACT]
We study continued training and supervised fine-tuning (SFT) of a language
model (LM) to make effective use of long-context information. We first
establish a reliable evaluation protocol to guide model development – instead
of perplexity or simple needle-in-a-haystack (NIAH) tests, we use a broad set
of long-context downstream tasks, and we evaluate models after SFT as this
better reveals long-context abilities. Supported by our robust evaluations, we
run thorough experiments to decide the data mix for continued pre-training, the
instruction tuning dataset, and many other design choices such as position
extrapolation. We find that (1) code repositories and books are excellent
sources of long data, but it is crucial to combine them with high-quality
short-context data; (2) training with a sequence length beyond the evaluation
length boosts long-context performance; (3) for SFT, using only short
instruction datasets yields strong performance on long-context tasks. Our final
model, ProLong-8B, which is initialized from Llama-3 and trained on 40B tokens,
demonstrates state-of-the-art long-context performance among similarly sized
models at a length of 128K. ProLong outperforms Llama-3.1-8B-Instruct on the
majority of long-context tasks despite using only 5% as many tokens during
long-context training. Additionally, ProLong can effectively process up to 512K
tokens, one of the longest context windows of publicly available LMs.
[COMMENTS]
Accepted to ACL 2025. Our code, data, and models are available at
https://github.com/princeton-nlp/ProLong
[LINK]
http://arxiv.org/abs/2410.02660v3
[DATE]
2025-06-28 01:01:41+08:00
[CATEGORIES]
cs.CL
cs.LG
Can Video Large Multimodal Models Think Like Doubters-or Double-Down: A Study on Defeasible Video Entailment
[AUTHORS]
Yue Zhang, Jilei Sun, Yunhui Guo, Vibhav Gogate
[ABSTRACT]
Video Large Multimodal Models (VLMMs) have made impressive strides in
understanding video content, but they often struggle with abstract and adaptive
reasoning-the ability to revise their interpretations when new information
emerges. In reality, conclusions are rarely set in stone; additional context
can strengthen or weaken an initial inference. To address this, we introduce
Defeasible Video Entailment (DVidE), a new task that challenges models to think
like doubters, constantly updating their reasoning based on evolving evidence.
In DVidE, given a video premise and a textual hypothesis, models must determine
whether a new update strengthens or weakens the hypothesis (classification
version) or generate a coherent update that modifies the entailment
relationship (generation version). For solving the classification task, we
propose the Chain of Counterfactual Thought framework, utilizing counterfactual
reasoning, ASR-enhanced video content, and rationale refinement to reduce
inference bias. For the generation task, we develop a framework that combines
ASR output with a Large Language Model (LLM) to produce coherent, contextually
relevant updates aligned with the intended strengthener or weakener goals.
Additionally, we introduce a novel benchmark dataset, with
strengthener/weakener annotations and an LLM-based evaluation metric
specifically designed for assessing generative performance. Experimental
results demonstrate significant improvements, highlighting our proposed method
in enhancing dynamic reasoning capabilities of VLMMs.
[LINK]
http://arxiv.org/abs/2506.22385v1
[DATE]
2025-06-28 00:51:15+08:00
[CATEGORIES]
cs.CL
Probabilistic Optimality for Inference-time Scaling
[AUTHORS]
Youkang Wang, Jian Wang, Rubing Chen, Xiao-Yong Wei, Qing Li
[ABSTRACT]
Inference-time scaling has emerged as a powerful technique for enhancing the
reasoning performance of Large Language Models (LLMs). However, existing
approaches often rely on heuristic strategies for parallel sampling, lacking a
principled foundation. To address this gap, we propose a probabilistic
framework that formalizes the optimality of inference-time scaling under the
assumption that parallel samples are independently and identically distributed
(i.i.d.), and where the Best-of-N selection strategy follows a probability
distribution that can be estimated. Within this framework, we derive a
theoretical lower bound on the required number of samples to achieve a target
performance level, providing the first principled guidance for
compute-efficient scaling. Leveraging this insight, we develop
\textsc{OptScale}, a practical algorithm that dynamically determines the
optimal number of sampled responses. \textsc{OptScale} employs a language
model-based predictor to estimate probabilistic prior parameters, enabling the
decision of the minimal number of samples needed that satisfy predefined
performance thresholds and confidence levels. Extensive experiments on
mathematical reasoning benchmarks (including MATH-500, GSM8K, AIME, and AMC)
demonstrate that \textsc{OptScale} significantly reduces sampling overhead
while remaining better or on par with state-of-the-art reasoning performance.
Our work offers both a theoretical foundation and a practical solution for
principled inference-time scaling, addressing a critical gap in the efficient
deployment of LLMs for complex reasoning.
[LINK]
http://arxiv.org/abs/2506.22376v1
[DATE]
2025-06-28 00:44:11+08:00
[CATEGORIES]
cs.LG
cs.CL
Towards Fair Rankings: Leveraging LLMs for Gender Bias Detection and Measurement
[AUTHORS]
Maryam Mousavian, Zahra Abbasiantaeb, Mohammad Aliannejadi, Fabio Crestani
[ABSTRACT]
The presence of social biases in Natural Language Processing (NLP) and
Information Retrieval (IR) systems is an ongoing challenge, which underlines
the importance of developing robust approaches to identifying and evaluating
such biases. In this paper, we aim to address this issue by leveraging Large
Language Models (LLMs) to detect and measure gender bias in passage ranking.
Existing gender fairness metrics rely on lexical- and frequency-based measures,
leading to various limitations, e.g., missing subtle gender disparities.
Building on our LLM-based gender bias detection method, we introduce a novel
gender fairness metric, named Class-wise Weighted Exposure (CWEx), aiming to
address existing limitations. To measure the effectiveness of our proposed
metric and study LLMs’ effectiveness in detecting gender bias, we annotate a
subset of the MS MARCO Passage Ranking collection and release our new gender
bias collection, called MSMGenderBias, to foster future research in this area.
Our extensive experimental results on various ranking models show that our
proposed metric offers a more detailed evaluation of fairness compared to
previous metrics, with improved alignment to human labels (58.77% for
Grep-BiasIR, and 18.51% for MSMGenderBias, measured using Cohen’s Kappa
agreement), effectively distinguishing gender bias in ranking. By integrating
LLM-driven bias detection, an improved fairness metric, and gender bias
annotations for an established dataset, this work provides a more robust
framework for analyzing and mitigating bias in IR systems.
[COMMENTS]
Accepted by ACM SIGIR Conference on Innovative Concepts and Theories
in Information Retrieval (ICTIR 2025)
[LINK]
http://arxiv.org/abs/2506.22372v1
[DATE]
2025-06-28 00:39:12+08:00
[CATEGORIES]
cs.CL
Robust Detection of Watermarks for Large Language Models Under Human Edits
[AUTHORS]
Xiang Li, Feng Ruan, Huiyuan Wang, Qi Long, Weijie J. Su
[ABSTRACT]
Watermarking has offered an effective approach to distinguishing text
generated by large language models (LLMs) from human-written text. However, the
pervasive presence of human edits on LLM-generated text dilutes watermark
signals, thereby significantly degrading detection performance of existing
methods. In this paper, by modeling human edits through mixture model
detection, we introduce a new method in the form of a truncated goodness-of-fit
test for detecting watermarked text under human edits, which we refer to as
Tr-GoF. We prove that the Tr-GoF test achieves optimality in robust detection
of the Gumbel-max watermark in a certain asymptotic regime of substantial text
modifications and vanishing watermark signals. Importantly, Tr-GoF achieves
this optimality \textit{adaptively} as it does not require precise knowledge of
human edit levels or probabilistic specifications of the LLMs, in contrast to
the optimal but impractical (Neyman–Pearson) likelihood ratio test. Moreover,
we establish that the Tr-GoF test attains the highest detection efficiency rate
in a certain regime of moderate text modifications. In stark contrast, we show
that sum-based detection rules, as employed by existing methods, fail to
achieve optimal robustness in both regimes because the additive nature of their
statistics is less resilient to edit-induced noise. Finally, we demonstrate the
competitive and sometimes superior empirical performance of the Tr-GoF test on
both synthetic data and open-source LLMs in the OPT and LLaMA families.
[LINK]
http://arxiv.org/abs/2411.13868v2
[DATE]
2025-06-28 00:34:08+08:00
[CATEGORIES]
cs.CL
cs.LG
Why Are Parsing Actions for Understanding Message Hierarchies Not Random?
[AUTHORS]
Daichi Kato, Ryo Ueda, Yusuke Miyao
[ABSTRACT]
If humans understood language by randomly selecting parsing actions, it might
have been necessary to construct a robust symbolic system capable of being
interpreted under any hierarchical structure. However, human parsing strategies
do not seem to follow such a random pattern. Why is that the case? In fact, a
previous study on emergent communication using models with hierarchical biases
have reported that agents adopting random parsing
strategies$\unicode{x2013}$ones that deviate significantly from human language
comprehension$\unicode{x2013}$can achieve high communication accuracy. In this
study, we investigate this issue by making two simple and natural modifications
to the experimental setup: (I) we use more complex inputs that have
hierarchical structures, such that random parsing makes semantic interpretation
more difficult, and (II) we incorporate a surprisal-related term, which is
known to influence the order of words and characters in natural language, into
the objective function. With these changes, we evaluate whether agents
employing random parsing strategies still maintain high communication accuracy.
[LINK]
http://arxiv.org/abs/2506.22366v1
[DATE]
2025-06-28 00:27:35+08:00
[CATEGORIES]
cs.CL
Beyond ReLU: How Activations Affect Neural Kernels and Random Wide Networks
[AUTHORS]
David Holzmüller, Max Schölpple
[ABSTRACT]
While the theory of deep learning has made some progress in recent years,
much of it is limited to the ReLU activation function. In particular, while the
neural tangent kernel (NTK) and neural network Gaussian process kernel (NNGP)
have given theoreticians tractable limiting cases of fully connected neural
networks, their properties for most activation functions except for powers of
the ReLU function are poorly understood. Our main contribution is to provide a
more general characterization of the RKHS of these kernels for typical
activation functions whose only non-smoothness is at zero, such as SELU, ELU,
or LeakyReLU. Our analysis also covers a broad set of special cases such as
missing biases, two-layer networks, or polynomial activations. Our results show
that a broad class of not infinitely smooth activations generate equivalent
RKHSs at different network depths, while polynomial activations generate
non-equivalent RKHSs. Finally, we derive results for the smoothness of NNGP
sample paths, characterizing the smoothness of infinitely wide neural networks
at initialization.
[LINK]
http://arxiv.org/abs/2506.22429v1
[DATE]
2025-06-28 01:56:09+08:00
[CATEGORIES]
cs.LG
CLoVE: Personalized Federated Learning through Clustering of Loss Vector Embeddings
[AUTHORS]
Randeep Bhatia, Nikos Papadis, Murali Kodialam, TV Lakshman, Sayak Chakrabarty
[ABSTRACT]
We propose CLoVE (Clustering of Loss Vector Embeddings), a novel algorithm
for Clustered Federated Learning (CFL). In CFL, clients are naturally grouped
into clusters based on their data distribution. However, identifying these
clusters is challenging, as client assignments are unknown. CLoVE utilizes
client embeddings derived from model losses on client data, and leverages the
insight that clients in the same cluster share similar loss values, while those
in different clusters exhibit distinct loss patterns. Based on these
embeddings, CLoVE is able to iteratively identify and separate clients from
different clusters and optimize cluster-specific models through federated
aggregation. Key advantages of CLoVE over existing CFL algorithms are (1) its
simplicity, (2) its applicability to both supervised and unsupervised settings,
and (3) the fact that it eliminates the need for near-optimal model
initialization, which makes it more robust and better suited for real-world
applications. We establish theoretical convergence bounds, showing that CLoVE
can recover clusters accurately with high probability in a single round and
converges exponentially fast to optimal models in a linear setting. Our
comprehensive experiments comparing with a variety of both CFL and generic
Personalized Federated Learning (PFL) algorithms on different types of datasets
and an extensive array of non-IID settings demonstrate that CLoVE achieves
highly accurate cluster recovery in just a few rounds of training, along with
state-of-the-art model accuracy, across a variety of both supervised and
unsupervised PFL tasks.
[COMMENTS]
31 pages, 4 figures
[LINK]
http://arxiv.org/abs/2506.22427v1
[DATE]
2025-06-28 01:52:16+08:00
[CATEGORIES]
cs.LG
ARMOR: Robust Reinforcement Learning-based Control for UAVs under Physical Attacks
[AUTHORS]
Pritam Dash, Ethan Chan, Nathan P. Lawrence, Karthik Pattabiraman
[ABSTRACT]
Unmanned Aerial Vehicles (UAVs) depend on onboard sensors for perception,
navigation, and control. However, these sensors are susceptible to physical
attacks, such as GPS spoofing, that can corrupt state estimates and lead to
unsafe behavior. While reinforcement learning (RL) offers adaptive control
capabilities, existing safe RL methods are ineffective against such attacks. We
present ARMOR (Adaptive Robust Manipulation-Optimized State Representations),
an attack-resilient, model-free RL controller that enables robust UAV operation
under adversarial sensor manipulation. Instead of relying on raw sensor
observations, ARMOR learns a robust latent representation of the UAV’s physical
state via a two-stage training framework. In the first stage, a teacher
encoder, trained with privileged attack information, generates attack-aware
latent states for RL policy training. In the second stage, a student encoder is
trained via supervised learning to approximate the teacher’s latent states
using only historical sensor data, enabling real-world deployment without
privileged information. Our experiments show that ARMOR outperforms
conventional methods, ensuring UAV safety. Additionally, ARMOR improves
generalization to unseen attacks and reduces training cost by eliminating the
need for iterative adversarial training.
[LINK]
http://arxiv.org/abs/2506.22423v1
[DATE]
2025-06-28 01:46:33+08:00
[CATEGORIES]
cs.LG
L2MAC: Large Language Model Automatic Computer for Extensive Code Generation
[AUTHORS]
Samuel Holt, Max Ruiz Luyten, Mihaela van der Schaar
[ABSTRACT]
Transformer-based large language models (LLMs) are constrained by the fixed
context window of the underlying transformer architecture, hindering their
ability to produce long and coherent outputs. Memory-augmented LLMs are a
promising solution, but current approaches cannot handle long output generation
tasks since they (1) only focus on reading memory and reduce its evolution to
the concatenation of new memories or (2) use very specialized memories that
cannot adapt to other domains. This paper presents L2MAC, the first practical
LLM-based general-purpose stored-program automatic computer (von Neumann
architecture) framework, an LLM-based multi-agent system, for long and
consistent output generation. Its memory has two components: the instruction
registry, which is populated with a prompt program to solve the user-given
task, and a file store, which will contain the final and intermediate outputs.
Each instruction in turn is executed by a separate LLM agent, whose context is
managed by a control unit capable of precise memory reading and writing to
ensure effective interaction with the file store. These components enable L2MAC
to generate extensive outputs, bypassing the constraints of the finite context
window while producing outputs that fulfill a complex user-specified task. We
empirically demonstrate that L2MAC achieves state-of-the-art performance in
generating large codebases for system design tasks, significantly outperforming
other coding methods in implementing the detailed user-specified task; we show
that L2MAC works for general-purpose extensive text-based tasks, such as
writing an entire book; and we provide valuable insights into L2MAC’s
performance improvement over existing methods.
[COMMENTS]
Published in The Twelfth International Conference on Learning
Representations (ICLR), 2024. Copyright 2023 by the author(s)
[LINK]
http://arxiv.org/abs/2310.02003v6
[DATE]
2025-06-28 01:28:14+08:00
[CATEGORIES]
cs.LG
Decoupled SGDA for Games with Intermittent Strategy Communication
[AUTHORS]
Ali Zindari, Parham Yazdkhasti, Anton Rodomanov, Tatjana Chavdarova, Sebastian U. Stich
[ABSTRACT]
We focus on reducing communication overhead in multiplayer games, where
frequently exchanging strategies between players is not feasible and players
have noisy or outdated strategies of the other players. We introduce Decoupled
SGDA, a novel adaptation of Stochastic Gradient Descent Ascent (SGDA). In this
approach, players independently update their strategies based on outdated
opponent strategies, with periodic synchronization to align strategies. For
Strongly-Convex-Strongly-Concave (SCSC) games, we demonstrate that Decoupled
SGDA achieves near-optimal communication complexity comparable to the
best-known GDA rates. For weakly coupled games where the interaction between
players is lower relative to the non-interactive part of the game, Decoupled
SGDA significantly reduces communication costs compared to standard SGDA. Our
findings extend to multi-player games. To provide insights into the effect of
communication frequency and convergence, we extensively study the convergence
of Decoupled SGDA for quadratic minimax problems. Lastly, in settings where the
noise over the players is imbalanced, Decoupled SGDA significantly outperforms
federated minimax methods.
[LINK]
http://arxiv.org/abs/2501.14652v2
[DATE]
2025-06-28 01:22:45+08:00
[CATEGORIES]
cs.LG
Exploration from a Primal-Dual Lens: Value-Incentivized Actor-Critic Methods for Sample-Efficient Online RL
[AUTHORS]
Tong Yang, Bo Dai, Lin Xiao, Yuejie Chi
[ABSTRACT]
Online reinforcement learning (RL) with complex function approximations such
as transformers and deep neural networks plays a significant role in the modern
practice of artificial intelligence. Despite its popularity and importance,
balancing the fundamental trade-off between exploration and exploitation
remains a long-standing challenge; in particular, we are still in lack of
efficient and practical schemes that are backed by theoretical performance
guarantees. Motivated by recent developments in exploration via optimistic
regularization, this paper provides an interpretation of the principle of
optimism through the lens of primal-dual optimization. From this fresh
perspective, we set forth a new value-incentivized actor-critic (VAC) method,
which optimizes a single easy-to-optimize objective integrating exploration and
exploitation – it promotes state-action and policy estimates that are both
consistent with collected data transitions and result in higher value
functions. Theoretically, the proposed VAC method has near-optimal regret
guarantees under linear Markov decision processes (MDPs) in both finite-horizon
and infinite-horizon settings, which can be extended to the general function
approximation setting under appropriate assumptions.
[LINK]
http://arxiv.org/abs/2506.22401v1
[DATE]
2025-06-28 01:18:43+08:00
[CATEGORIES]
cs.LG
Multi-View Contrastive Learning for Robust Domain Adaptation in Medical Time Series Analysis
[AUTHORS]
YongKyung Oh, Alex Bui
[ABSTRACT]
Adapting machine learning models to medical time series across different
domains remains a challenge due to complex temporal dependencies and dynamic
distribution shifts. Current approaches often focus on isolated feature
representations, limiting their ability to fully capture the intricate temporal
dynamics necessary for robust domain adaptation. In this work, we propose a
novel framework leveraging multi-view contrastive learning to integrate
temporal patterns, derivative-based dynamics, and frequency-domain features.
Our method employs independent encoders and a hierarchical fusion mechanism to
learn feature-invariant representations that are transferable across domains
while preserving temporal coherence. Extensive experiments on diverse medical
datasets, including electroencephalogram (EEG), electrocardiogram (ECG), and
electromyography (EMG) demonstrate that our approach significantly outperforms
state-of-the-art methods in transfer learning tasks. By advancing the
robustness and generalizability of machine learning models, our framework
offers a practical pathway for deploying reliable AI systems in diverse
healthcare settings.
[LINK]
http://arxiv.org/abs/2506.22393v1
[DATE]
2025-06-28 01:06:16+08:00
[CATEGORIES]
cs.LG
Towards Distributed Neural Architectures
[AUTHORS]
Aditya Cowsik, Tianyu He, Andrey Gromov
[ABSTRACT]
We introduce and train distributed neural architectures (DNA) in vision and
language domains. DNAs are initialized with a proto-architecture that consists
of (transformer, MLP, attention, etc.) modules and routers. Any token (or
patch) can traverse any series of modules in any order. DNAs are a natural
generalization of the sparse methods such as Mixture-of-Experts,
Mixture-of-Depths, parameter sharing, etc. Computation and communication
patterns of DNA modules are learnt end-to-end during training and depend on the
content and context of each token (or patch). These patterns can be shaped by
further requirements added to the optimization objective such as compute/memory
efficiency or load balancing. We empirically show that (i) trained DNAs are
competitive with the dense baselines in both domains and (ii) compute
efficiency/parameter sharing can be learnt from data. Next, we analyze the
emergent connectivity and computation patterns in the trained DNAs. We find
that the paths that tokens take through the models are themselves distributed
according to a power-law. We show that some paths (or, equivalently, groups of
modules) show emergent specialization. Finally, we demonstrate that models
learn to allocate compute and active parameters in an interpretable way.
[COMMENTS]
36 pages, 25 figures
[LINK]
http://arxiv.org/abs/2506.22389v1
[DATE]
2025-06-28 00:57:59+08:00
[CATEGORIES]
cs.LG
Sheaf-Based Decentralized Multimodal Learning for Next-Generation Wireless Communication Systems
[AUTHORS]
Abdulmomen Ghalkha, Zhuojun Tian, Chaouki Ben Issaid, Mehdi Bennis
[ABSTRACT]
In large-scale communication systems, increasingly complex scenarios require
more intelligent collaboration among edge devices collecting various multimodal
sensory data to achieve a more comprehensive understanding of the environment
and improve decision-making accuracy. However, conventional federated learning
(FL) algorithms typically consider unimodal datasets, require identical model
architectures, and fail to leverage the rich information embedded in multimodal
data, limiting their applicability to real-world scenarios with diverse
modalities and varying client capabilities. To address this issue, we propose
Sheaf-DMFL, a novel decentralized multimodal learning framework leveraging
sheaf theory to enhance collaboration among devices with diverse modalities.
Specifically, each client has a set of local feature encoders for its different
modalities, whose outputs are concatenated before passing through a
task-specific layer. While encoders for the same modality are trained
collaboratively across clients, we capture the intrinsic correlations among
clients’ task-specific layers using a sheaf-based structure. To further enhance
learning capability, we propose an enhanced algorithm named Sheaf-DMFL-Att,
which tailors the attention mechanism within each client to capture
correlations among different modalities. A rigorous convergence analysis of
Sheaf-DMFL-Att is provided, establishing its theoretical guarantees. Extensive
simulations are conducted on real-world link blockage prediction and mmWave
beamforming scenarios, demonstrate the superiority of the proposed algorithms
in such heterogeneous wireless communication systems.
[COMMENTS]
13 pages, 9 figures
[LINK]
http://arxiv.org/abs/2506.22374v1
[DATE]
2025-06-28 00:41:23+08:00
[CATEGORIES]
cs.LG
Reinforcement Learning with Physics-Informed Symbolic Program Priors for Zero-Shot Wireless Indoor Navigation
[AUTHORS]
Tao Li, Haozhe Lei, Mingsheng Yin, Yaqi Hu
[ABSTRACT]
When using reinforcement learning (RL) to tackle physical control tasks,
inductive biases that encode physics priors can help improve sample efficiency
during training and enhance generalization in testing. However, the current
practice of incorporating these helpful physics-informed inductive biases
inevitably runs into significant manual labor and domain expertise, making them
prohibitive for general users. This work explores a symbolic approach to
distill physics-informed inductive biases into RL agents, where the physics
priors are expressed in a domain-specific language (DSL) that is human-readable
and naturally explainable. Yet, the DSL priors do not translate directly into
an implementable policy due to partial and noisy observations and additional
physical constraints in navigation tasks. To address this gap, we develop a
physics-informed program-guided RL (PiPRL) framework with applications to
indoor navigation. PiPRL adopts a hierarchical and modularized neuro-symbolic
integration, where a meta symbolic program receives semantically meaningful
features from a neural perception module, which form the bases for symbolic
programming that encodes physics priors and guides the RL process of a
low-level neural controller. Extensive experiments demonstrate that PiPRL
consistently outperforms purely symbolic or neural policies and reduces
training time by over 26% with the help of the program-based inductive biases.
[COMMENTS]
Spotlight paper at Reinforcement Learning Conference 2025, Workshop
on Inductive Biases in Reinforcement Learning
[LINK]
http://arxiv.org/abs/2506.22365v1
[DATE]
2025-06-28 00:26:29+08:00
[CATEGORIES]
cs.LG
DiffSoundStream: Efficient Speech Tokenization via Diffusion Decoding
[AUTHORS]
Yang Yang, Yunpeng Li, George Sung, Shao-Fu Shih, Craig Dooley, Alessio Centazzo, Ramanan Rajeswaran
[ABSTRACT]
Token-based language modeling is a prominent approach for speech generation,
where tokens are obtained by quantizing features from self-supervised learning
(SSL) models and extracting codes from neural speech codecs, generally referred
to as semantic tokens and acoustic tokens. These tokens are often modeled
autoregressively, with the inference speed being constrained by the token rate.
In this work, we propose DiffSoundStream, a solution that improves the
efficiency of speech tokenization in non-streaming scenarios through two
techniques: (1) conditioning the neural codec on semantic tokens to minimize
redundancy between semantic and acoustic tokens, and (2) leveraging latent
diffusion models to synthesize high-quality waveforms from semantic and
coarse-level acoustic tokens. Experiments show that at 50 tokens per second,
DiffSoundStream achieves speech quality on par with a standard SoundStream
model operating at twice the token rate. Additionally, we achieve step-size
distillation using just four diffusion sampling steps with only a minor quality
loss.
[LINK]
http://arxiv.org/abs/2506.22362v1
[DATE]
2025-06-28 00:23:07+08:00
[CATEGORIES]
cs.LG
From Ground to Air: Noise Robustness in Vision Transformers and CNNs for Event-Based Vehicle Classification with Potential UAV Applications
[AUTHORS]
Nouf Almesafri, Hector Figueiredo, Miguel Arana-Catania
[ABSTRACT]
This study investigates the performance of the two most relevant computer
vision deep learning architectures, Convolutional Neural Network and Vision
Transformer, for event-based cameras. These cameras capture scene changes,
unlike traditional frame-based cameras with capture static images, and are
particularly suited for dynamic environments such as UAVs and autonomous
vehicles. The deep learning models studied in this work are ResNet34 and ViT
B16, fine-tuned on the GEN1 event-based dataset. The research evaluates and
compares these models under both standard conditions and in the presence of
simulated noise. Initial evaluations on the clean GEN1 dataset reveal that
ResNet34 and ViT B16 achieve accuracies of 88% and 86%, respectively, with
ResNet34 showing a slight advantage in classification accuracy. However, the
ViT B16 model demonstrates notable robustness, particularly given its
pre-training on a smaller dataset. Although this study focuses on ground-based
vehicle classification, the methodologies and findings hold significant promise
for adaptation to UAV contexts, including aerial object classification and
event-based vision systems for aviation-related tasks.
[COMMENTS]
16 pages, 17 figures, 9 tables. To be presented in AIAA AVIATION
Forum 2025
[LINK]
http://arxiv.org/abs/2506.22360v1
[DATE]
2025-06-28 00:21:00+08:00
[CATEGORIES]
cs.LG
Learning Non-Local Molecular Interactions via Equivariant Local Representations and Charge Equilibration
[AUTHORS]
Paul Fuchs, Michał Sanocki, Julija Zavadlav
[ABSTRACT]
Graph Neural Network (GNN) potentials relying on chemical locality offer
near-quantum mechanical accuracy at significantly reduced computational costs.
Message-passing GNNs model interactions beyond their immediate neighborhood by
propagating local information between neighboring particles while remaining
effectively local. However, locality precludes modeling long-range effects
critical to many real-world systems, such as charge transfer, electrostatic
interactions, and dispersion effects. In this work, we propose the Charge
Equilibration Layer for Long-range Interactions (CELLI) to address the
challenge of efficiently modeling non-local interactions. This novel
architecture generalizes the classical charge equilibration (Qeq) method to a
model-agnostic building block for modern equivariant GNN potentials. Therefore,
CELLI extends the capability of GNNs to model long-range interactions while
providing high interpretability through explicitly modeled charges. On
benchmark systems, CELLI achieves state-of-the-art results for strictly local
models. CELLI generalizes to diverse datasets and large structures while
providing high computational efficiency and robust predictions.
[LINK]
http://arxiv.org/abs/2501.19179v2
[DATE]
2025-06-28 00:03:53+08:00
[CATEGORIES]
cs.LG
Learning Networks from Wide-Sense Stationary Stochastic Processes
[AUTHORS]
Anirudh Rayas, Jiajun Cheng, Rajasekhar Anguluri, Deepjyoti Deka, Gautam Dasarathy
[ABSTRACT]
Complex networked systems driven by latent inputs are common in fields like
neuroscience, finance, and engineering. A key inference problem here is to
learn edge connectivity from node outputs (potentials). We focus on systems
governed by steady-state linear conservation laws: $X_t = {L^{\ast}}Y_{t}$,
where $X_t, Y_t \in \mathbb{R}^p$ denote inputs and potentials, respectively,
and the sparsity pattern of the $p \times p$ Laplacian $L^{\ast}$ encodes the
edge structure. Assuming $X_t$ to be a wide-sense stationary stochastic process
with a known spectral density matrix, we learn the support of $L^{\ast}$ from
temporally correlated samples of $Y_t$ via an $\ell_1$-regularized Whittle’s
maximum likelihood estimator (MLE). The regularization is particularly useful
for learning large-scale networks in the high-dimensional setting where the
network size $p$ significantly exceeds the number of samples $n$.
We show that the MLE problem is strictly convex, admitting a unique solution.
Under a novel mutual incoherence condition and certain sufficient conditions on
$(n, p, d)$, we show that the ML estimate recovers the sparsity pattern of
$L^\ast$ with high probability, where $d$ is the maximum degree of the graph
underlying $L^{\ast}$. We provide recovery guarantees for $L^\ast$ in
element-wise maximum, Frobenius, and operator norms. Finally, we complement our
theoretical results with several simulation studies on synthetic and benchmark
datasets, including engineered systems (power and water networks), and
real-world datasets from neural systems (such as the human brain).
[LINK]
http://arxiv.org/abs/2412.03768v2
[DATE]
2025-06-28 00:01:18+08:00
[CATEGORIES]
cs.LG
Optimal Estimation of Watermark Proportions in Hybrid AI-Human Texts
[AUTHORS]
Xiang Li, Garrett Wen, Weiqing He, Jiayuan Wu, Qi Long, Weijie J. Su
[LINK]
http://arxiv.org/abs/2506.22343v1
[DATE]
2025-06-27 23:53:04+08:00
[CATEGORIES]
cs.CL
cs.LG
Multi-Turn Code Generation Through Single-Step Rewards
[AUTHORS]
Arnav Kumar Jain, Gonzalo Gonzalez-Pumariega, Wayne Chen, Alexander M Rush, Wenting Zhao, Sanjiban Choudhury
[ABSTRACT]
We address the problem of code generation from multi-turn execution feedback.
Existing methods either generate code without feedback or use complex,
hierarchical reinforcement learning to optimize multi-turn rewards. We propose
a simple yet scalable approach, $\mu$Code, that solves multi-turn code
generation using only single-step rewards. Our key insight is that code
generation is a one-step recoverable MDP, where the correct code can be
recovered from any intermediate code state in a single turn. $\mu$Code
iteratively trains both a generator to provide code solutions conditioned on
multi-turn execution feedback and a verifier to score the newly generated code.
Experimental evaluations show that our approach achieves significant
improvements over the state-of-the-art baselines. We provide analysis of the
design choices of the reward models and policy, and show the efficacy of
$\mu$Code at utilizing the execution feedback. Our code is available at
https://github.com/portal-cornell/muCode.
[COMMENTS]
9 pages (not including references or appendix); 5 figures (in main
paper); (v2) camera-ready version
[LINK]
http://arxiv.org/abs/2502.20380v2
[DATE]
2025-06-27 23:47:52+08:00
[CATEGORIES]
cs.LG
cs.CL
Conceptual Topic Aggregation
[AUTHORS]
Klara M. Gutekunst, Dominik Dürrschnabel, Johannes Hirth, Gerd Stumme
[ABSTRACT]
The vast growth of data has rendered traditional manual inspection
infeasible, necessitating the adoption of computational methods for efficient
data exploration. Topic modeling has emerged as a powerful tool for analyzing
large-scale textual datasets, enabling the extraction of latent semantic
structures. However, existing methods for topic modeling often struggle to
provide interpretable representations that facilitate deeper insights into data
structure and content. In this paper, we propose FAT-CAT, an approach based on
Formal Concept Analysis (FCA) to enhance meaningful topic aggregation and
visualization of discovered topics. Our approach can handle diverse topics and
file types – grouped by directories – to construct a concept lattice that
offers a structured, hierarchical representation of their topic distribution.
In a case study on the ETYNTKE dataset, we evaluate the effectiveness of our
approach against other representation methods to demonstrate that FCA-based
aggregation provides more meaningful and interpretable insights into dataset
composition than existing topic modeling techniques.
[COMMENTS]
16 pages, 4 tables, 11 figures, International Joint Conference on
Conceptual Knowledge Structures
[LINK]
http://arxiv.org/abs/2506.22309v1
[DATE]
2025-06-27 23:19:38+08:00
[CATEGORIES]
cs.CL
cs.LG
Detection of Personal Data in Structured Datasets Using a Large Language Model
[AUTHORS]
Albert Agisha Ntwali, Luca Rück, Martin Heckmann
[ABSTRACT]
We propose a novel approach for detecting personal data in structured
datasets, leveraging GPT-4o, a state-of-the-art Large Language Model. A key
innovation of our method is the incorporation of contextual information: in
addition to a feature’s name and values, we utilize information from other
feature names within the dataset as well as the dataset description. We compare
our approach to alternative methods, including Microsoft Presidio and CASSED,
evaluating them on multiple datasets: DeSSI, a large synthetic dataset,
datasets we collected from Kaggle and OpenML as well as MIMIC-Demo-Ext, a
real-world dataset containing patient information from critical care units.
Our findings reveal that detection performance varies significantly depending
on the dataset used for evaluation. CASSED excels on DeSSI, the dataset on
which it was trained. Performance on the medical dataset MIMIC-Demo-Ext is
comparable across all models, with our GPT-4o-based approach clearly
outperforming the others. Notably, personal data detection in the Kaggle and
OpenML datasets appears to benefit from contextual information. This is
evidenced by the poor performance of CASSED and Presidio (both of which do not
utilize the context of the dataset) compared to the strong results of our
GPT-4o-based approach.
We conclude that further progress in this field would greatly benefit from
the availability of more real-world datasets containing personal information.
[COMMENTS]
10 pages
[LINK]
http://arxiv.org/abs/2506.22305v1
[DATE]
2025-06-27 23:16:43+08:00
[CATEGORIES]
cs.CL
COOCO – Common Objects Out-of-Context – Semantic Violation in Scenes: Investigating Multimodal Context in Referential Communication
[AUTHORS]
Filippo Merlo, Ece Takmaz, Wenkai Chen, Albert Gatt
[ABSTRACT]
Natural scenes provide us with rich contexts for object recognition and
reference. In particular, knowing what type of scene one is looking at
generates expectations about which objects will occur, and what their spatial
configuration should be. Do Vision-Language Models (VLMs) learn to rely on
scene contexts in a similar way, when generating references to objects? To
address this question, we introduce the \textit{Common Objects Out-of-Context
(COOCO)} dataset and test to what extent VLMs rely on scene context to refer to
objects under different degrees of scene-object congruency, and different
perturbations. Our findings show that models leverage scene context adaptively,
depending on both the semantic relatedness between object and scene and the
level of noise. In particular, models rely more on context under high
target-scene congruence or when objects are degraded. Attention analysis
reveals that successful object categorisation involves increased focus on the
target in mid-level layers, especially under moderate noise, suggesting that
VLMs dynamically balance local and contextual information for reference
generation. We make our dataset, code and models available at
\href{https://github.com/cs-nlp-uu/scenereg}{https://github.com/cs-nlp-uu/scenereg}.
[LINK]
http://arxiv.org/abs/2506.22274v1
[DATE]
2025-06-27 22:44:45+08:00
[CATEGORIES]
cs.CL
KITAB-Bench: A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding
[AUTHORS]
Ahmed Heakl, Abdullah Sohail, Mukul Ranjan, Rania Hossam, Ghazi Shazan Ahmad, Mohamed El-Geish, Omar Maher, Zhiqiang Shen, Fahad Khan, Salman Khan
[ABSTRACT]
With the growing adoption of Retrieval-Augmented Generation (RAG) in document
processing, robust text recognition has become increasingly critical for
knowledge extraction. While OCR (Optical Character Recognition) for English and
other languages benefits from large datasets and well-established benchmarks,
Arabic OCR faces unique challenges due to its cursive script, right-to-left
text flow, and complex typographic and calligraphic features. We present
KITAB-Bench, a comprehensive Arabic OCR benchmark that fills the gaps in
current evaluation systems. Our benchmark comprises 8,809 samples across 9
major domains and 36 sub-domains, encompassing diverse document types including
handwritten text, structured tables, and specialized coverage of 21 chart types
for business intelligence. Our findings show that modern vision-language models
(such as GPT-4o, Gemini, and Qwen) outperform traditional OCR approaches (like
EasyOCR, PaddleOCR, and Surya) by an average of 60% in Character Error Rate
(CER). Furthermore, we highlight significant limitations of current Arabic OCR
models, particularly in PDF-to-Markdown conversion, where the best model
Gemini-2.0-Flash achieves only 65% accuracy. This underscores the challenges in
accurately recognizing Arabic text, including issues with complex fonts,
numeral recognition errors, word elongation, and table structure detection.
This work establishes a rigorous evaluation framework that can drive
improvements in Arabic document analysis methods and bridge the performance gap
with English OCR technologies.
[COMMENTS]
17 pages, 5 figures, ACL 2025
[LINK]
http://arxiv.org/abs/2502.14949v2
[DATE]
2025-06-27 22:31:41+08:00
[CATEGORIES]
cs.CL
cs.LG
Projected Compression: Trainable Projection for Efficient Transformer Compression
[AUTHORS]
Maciej Stefaniak, Michał Krutul, Jan Małaśnicki, Maciej Pióro, Jakub Krajewski, Sebastian Jaszczur, Marek Cygan, Kamil Adamczewski, Jan Ludziejewski
[ABSTRACT]
Large language models have steadily increased in size to achieve improved
performance; however, this growth has also led to greater inference time and
computational demands. Consequently, there is rising interest in model size
reduction methods. To address this issue, we propose Projected Compression, a
novel model compression technique, that reduces model weights by utilizing
projection modules. Specifically, we first train additional trainable
projections weights and preserve access to all the original model parameters.
Subsequently, these projections are merged into a lower-dimensional product
matrix, resulting in a reduced-size standard Transformer-based model. Unlike
alternative approaches that require additional computational overhead, our
method matches the base model’s per-token computation step in FLOPs.
Experimental results show that Projected Compression outperforms the comparable
hard pruning and retraining approach on higher quality models. Moreover, the
performance margin scales well with the number of tokens.
[LINK]
http://arxiv.org/abs/2506.22255v1
[DATE]
2025-06-27 22:24:01+08:00
[CATEGORIES]
cs.LG
cs.CL
Quantum-Enhanced Attention Mechanism in NLP: A Hybrid Classical-Quantum Approach
[AUTHORS]
S. M. Yousuf Iqbal Tomal, Abdullah Al Shafin, Debojit Bhattacharjee, MD. Khairul Amin, Rafiad Sadat Shahir
[ABSTRACT]
Recent advances in quantum computing have opened new pathways for enhancing
deep learning architectures, particularly in domains characterized by
high-dimensional and context-rich data such as natural language processing
(NLP). In this work, we present a hybrid classical-quantum Transformer model
that integrates a quantum-enhanced attention mechanism into the standard
classical architecture. By embedding token representations into a quantum
Hilbert space via parameterized variational circuits and exploiting
entanglement-aware kernel similarities, the model captures complex semantic
relationships beyond the reach of conventional dot-product attention. We
demonstrate the effectiveness of this approach across diverse NLP benchmarks,
showing improvements in both efficiency and representational capacity. The
results section reveal that the quantum attention layer yields globally
coherent attention maps and more separable latent features, while requiring
comparatively fewer parameters than classical counterparts. These findings
highlight the potential of quantum-classical hybrid models to serve as a
powerful and resource-efficient alternative to existing attention mechanisms in
NLP.
[COMMENTS]
16 pages, 7 figures, 5 tables
[LINK]
http://arxiv.org/abs/2501.15630v2
[DATE]
2025-06-27 22:09:08+08:00
[CATEGORIES]
cs.CL
Fine-Tuning MIDI-to-Audio Alignment using a Neural Network on Piano Roll and CQT Representations
[AUTHORS]
Sebastian Murgul, Moritz Reiser, Michael Heizmann, Christoph Seibert
[ABSTRACT]
In this paper, we present a neural network approach for synchronizing audio
recordings of human piano performances with their corresponding loosely aligned
MIDI files. The task is addressed using a Convolutional Recurrent Neural
Network (CRNN) architecture, which effectively captures spectral and temporal
features by processing an unaligned piano roll and a spectrogram as inputs to
estimate the aligned piano roll. To train the network, we create a dataset of
piano pieces with augmented MIDI files that simulate common human timing
errors. The proposed model achieves up to 20% higher alignment accuracy than
the industry-standard Dynamic Time Warping (DTW) method across various
tolerance windows. Furthermore, integrating DTW with the CRNN yields additional
improvements, offering enhanced robustness and consistency. These findings
demonstrate the potential of neural networks in advancing state-of-the-art
MIDI-to-audio alignment.
[COMMENTS]
9 pages, 3 figures, 6 tables
[LINK]
http://arxiv.org/abs/2506.22237v1
[DATE]
2025-06-27 21:59:50+08:00
[CATEGORIES]
cs.CL
Leveraging In-Context Learning for Political Bias Testing of LLMs
[AUTHORS]
Patrick Haller, Jannis Vamvas, Rico Sennrich, Lena A. Jäger
[COMMENTS]
ACL 2025
[LINK]
http://arxiv.org/abs/2506.22232v1
[DATE]
2025-06-27 21:49:37+08:00
[CATEGORIES]
cs.CL
TableLoRA: Low-rank Adaptation on Table Structure Understanding for Large Language Models
[AUTHORS]
Xinyi He, Yihao Liu, Mengyu Zhou, Yeye He, Haoyu Dong, Shi Han, Zejian Yuan, Dongmei Zhang
[ABSTRACT]
Tabular data are crucial in many fields and their understanding by large
language models (LLMs) under high parameter efficiency paradigm is important.
However, directly applying parameter-efficient fine-tuning (PEFT) techniques to
tabular tasks presents significant challenges, particularly in terms of better
table serialization and the representation of two-dimensional structured
information within a one-dimensional sequence. To address this, we propose
TableLoRA, a module designed to improve LLMs’ understanding of table structure
during PEFT. It incorporates special tokens for serializing tables with special
token encoder and uses 2D LoRA to encode low-rank information on cell
positions. Experiments on four tabular-related datasets demonstrate that
TableLoRA consistently outperforms vanilla LoRA and surpasses various table
encoding methods tested in control experiments. These findings reveal that
TableLoRA, as a table-specific LoRA, enhances the ability of LLMs to process
tabular data effectively, especially in low-parameter settings, demonstrating
its potential as a robust solution for handling table-related tasks.
[COMMENTS]
Accepted by ACL 2025 main conference, long paper
[LINK]
http://arxiv.org/abs/2503.04396v2
[DATE]
2025-06-27 21:42:07+08:00
[CATEGORIES]
cs.CL
LLM as GNN: Graph Vocabulary Learning for Text-Attributed Graph Foundation Models
[AUTHORS]
Xi Zhu, Haochen Xue, Ziwei Zhao, Wujiang Xu, Jingyuan Huang, Minghao Guo, Qifan Wang, Kaixiong Zhou, Yongfeng Zhang
[ABSTRACT]
Text-Attributed Graphs (TAGs), where each node is associated with text
descriptions, are ubiquitous in real-world scenarios. They typically exhibit
distinctive structure and domain-specific knowledge, motivating the development
of a Graph Foundation Model (GFM) that generalizes across diverse graphs and
tasks. Despite large efforts to integrate Large Language Models (LLMs) and
Graph Neural Networks (GNNs) for TAGs, existing approaches suffer from
decoupled architectures with two-stage alignment, limiting their synergistic
potential. Even worse, existing methods assign out-of-vocabulary (OOV) tokens
to graph nodes, leading to graph-specific semantics, token explosion, and
incompatibility with task-oriented prompt templates, which hinders cross-graph
and cross-task transferability. To address these challenges, we propose
PromptGFM, a versatile GFM for TAGs grounded in graph vocabulary learning.
PromptGFM comprises two key components: (1) Graph Understanding Module, which
explicitly prompts LLMs to replicate the finest GNN workflow within the text
space, facilitating seamless GNN-LLM integration and elegant graph-text
alignment; (2) Graph Inference Module, which establishes a language-based graph
vocabulary ensuring expressiveness, transferability, and scalability, enabling
readable instructions for LLM fine-tuning. Extensive experiments demonstrate
our superiority and transferability across diverse graphs and tasks. The code
is available at this: https://github.com/agiresearch/PromptGFM.
[LINK]
http://arxiv.org/abs/2503.03313v2
[DATE]
2025-06-27 20:53:42+08:00
[CATEGORIES]
cs.LG
cs.CL
Refining Salience-Aware Sparse Fine-Tuning Strategies for Language Models
[AUTHORS]
Xinxin Liu, Aaron Thomas, Cheng Zhang, Jianyi Cheng, Yiren Zhao, Xitong Gao
[ABSTRACT]
Parameter-Efficient Fine-Tuning (PEFT) has gained prominence through low-rank
adaptation methods like LoRA. In this paper, we focus on sparsity-based PEFT
(SPEFT), which introduces trainable sparse adaptations to the weight matrices
in the model, offering greater flexibility in selecting fine-tuned parameters
compared to low-rank methods. We conduct the first systematic evaluation of
salience metrics for SPEFT, inspired by zero-cost NAS proxies, and identify
simple gradient-based metrics is reliable, and results are on par with the best
alternatives, offering both computational efficiency and robust performance.
Additionally, we compare static and dynamic masking strategies, finding that
static masking, which predetermines non-zero entries before training, delivers
efficiency without sacrificing performance, while dynamic masking offers no
substantial benefits. Across NLP tasks, a simple gradient-based, static SPEFT
consistently outperforms other fine-tuning methods for LLMs, providing a simple
yet effective baseline for SPEFT. Our work challenges the notion that
complexity is necessary for effective PEFT, while our open-source framework
establishes a reproducible benchmark for future research, which is available at
[https://github.com/0-ml/speft].
[COMMENTS]
ACL 2025
[LINK]
http://arxiv.org/abs/2412.13488v2
[DATE]
2025-06-27 20:34:59+08:00
[CATEGORIES]
cs.CL
Training Language Model to Critique for Better Refinement
[AUTHORS]
Tianshu Yu, Chao Xiang, Mingchuan Yang, Pei Ke, Bosi Wen, Cunxiang Wang, Jiale Cheng, Li Zhang, Xinyu Mu, Chuxiong Sun, Minlie Huang
[ABSTRACT]
Large language models (LLMs) have demonstrated remarkable evaluation and
critique capabilities, providing insightful feedback and identifying flaws in
various tasks. However, limited research has explored which types of critiques
are most effective for improving model responses or how to generate such
critiques. To address this gap, we introduce \textbf{R}efinement-oriented
\textbf{C}ritique \textbf{O}ptimization (RCO), a novel framework designed to
train critic models using refinement signals. RCO uses a feedback loop where
critiques, generated by the critic model, guide the actor model in refining its
responses. The critique utility (CU) quantifies the effectiveness of these
refinements, serving as the reward signal for training the critic model. By
focusing on critiques that lead to better refinements, RCO eliminates the need
for direct critique preference assessment, ensuring that critiques driving
meaningful improvements are rewarded. We evaluate RCO across five tasks, i.e.,
dialog generation, summarization, question answering, mathematical reasoning,
and code generation, and show that it significantly outperforms traditional
methods and open-source models in terms of critique quality and refinement
outcomes. Our contributions include the introduction of RCO, a novel
supervision scheme based on refined response preferences, and comprehensive
experimental results that highlight the method’s effectiveness in enhancing LLM
critique-refinement loops.
[COMMENTS]
Accepted to ACL 2025 Findings
[LINK]
http://arxiv.org/abs/2506.22157v1
[DATE]
2025-06-27 20:10:57+08:00
[CATEGORIES]
cs.CL
MedRAG: Enhancing Retrieval-augmented Generation with Knowledge Graph-Elicited Reasoning for Healthcare Copilot
[AUTHORS]
Xuejiao Zhao, Siyan Liu, Su-Yin Yang, Chunyan Miao
[ABSTRACT]
Retrieval-augmented generation (RAG) is a well-suited technique for
retrieving privacy-sensitive Electronic Health Records (EHR). It can serve as a
key module of the healthcare copilot, helping reduce misdiagnosis for
healthcare practitioners and patients. However, the diagnostic accuracy and
specificity of existing heuristic-based RAG models used in the medical domain
are inadequate, particularly for diseases with similar manifestations. This
paper proposes MedRAG, a RAG model enhanced by knowledge graph (KG)-elicited
reasoning for the medical domain that retrieves diagnosis and treatment
recommendations based on manifestations. MedRAG systematically constructs a
comprehensive four-tier hierarchical diagnostic KG encompassing critical
diagnostic differences of various diseases. These differences are dynamically
integrated with similar EHRs retrieved from an EHR database, and reasoned
within a large language model. This process enables more accurate and specific
decision support, while also proactively providing follow-up questions to
enhance personalized medical decision-making. MedRAG is evaluated on both a
public dataset DDXPlus and a private chronic pain diagnostic dataset (CPDD)
collected from Tan Tock Seng Hospital, and its performance is compared against
various existing RAG methods. Experimental results show that, leveraging the
information integration and relational abilities of the KG, our MedRAG provides
more specific diagnostic insights and outperforms state-of-the-art models in
reducing misdiagnosis rates. Our code will be available at
https://github.com/SNOWTEAM2023/MedRAG
[LINK]
http://arxiv.org/abs/2502.04413v2
[DATE]
2025-06-27 20:06:42+08:00
[CATEGORIES]
cs.CL
Eye of Judgement: Dissecting the Evaluation of Russian-speaking LLMs with POLLUX
[AUTHORS]
Nikita Martynov, Anastasia Mordasheva, Dmitriy Gorbetskiy, Danil Astafurov, Ulyana Isaeva, Elina Basyrova, Sergey Skachkov, Victoria Berestova, Nikolay Ivanov, Valeriia Zanina, Alena Fenogenova
[ABSTRACT]
We introduce POLLUX, a comprehensive open-source benchmark designed to
evaluate the generative capabilities of large language models (LLMs) in
Russian. Our main contribution is a novel evaluation methodology that enhances
the interpretability of LLM assessment. For each task type, we define a set of
detailed criteria and develop a scoring protocol where models evaluate
responses and provide justifications for their ratings. This enables
transparent, criteria-driven evaluation beyond traditional resource-consuming,
side-by-side human comparisons. POLLUX includes a detailed, fine-grained
taxonomy of 35 task types covering diverse generative domains such as code
generation, creative writing, and practical assistant use cases, totaling 2,100
manually crafted and professionally authored prompts. Each task is categorized
by difficulty (easy/medium/hard), with experts constructing the dataset
entirely from scratch. We also release a family of LLM-as-a-Judge (7B and 32B)
evaluators trained for nuanced assessment of generative outputs. This approach
provides scalable, interpretable evaluation and annotation tools for model
development, effectively replacing costly and less precise human judgments.
[COMMENTS]
178 pages
[LINK]
http://arxiv.org/abs/2505.24616v3
[DATE]
2025-06-27 19:43:03+08:00
[CATEGORIES]
cs.CL
DAPFAM: A Domain-Aware Patent Retrieval Dataset Aggregated at the Family Level
[AUTHORS]
Iliass Ayaou, Denis Cavallucci, Hicham Chibane
[ABSTRACT]
In the landscape of publicly available patent retrieval datasets, the need
for explicit indomain and out-of-domain labeling, multi-jurisdiction coverage,
balanced query domain representation and manageable sizes that support sub
document level experiments on moderate computational resources is often
overlooked. To address these gaps, we propose DAPFAM, a new open access
domain-aware patent retrieval dataset constructed at the simple-family level.
The dataset contains 1,247 domain balanced full text query families and 45,336
full text target families. The dataset is enriched by clear relevance judgments
(forward/backward citations as positive links, random negatives), as well as
explicit in-domain or out-of-domain relationships via a novel proposed
labelling scheme based on via International Patent Classification (IPC) codes,
resulting in 49,869 evaluation pairs. The dataset is multi jurisdictional,
requires little to no preprocessing for retrieval evaluation, and remains of a
size manageable for entities with limited ressources allowing for sub document
level retrieval experiments without excessive computational costs. We describe
our three-step data-curation pipeline, present comprehensive dataset
statistics, and provide baseline experiments using lexical and neural retrieval
methods. Our baseline experiments highlight significant challenges in
crossdomain patent retrieval. The dataset will be publicly available (for now
the access link is this repository:
https://osf.io/vbyzd/?view_only=1a40242e0d1941a58aa854af3e50cf6b).
[LINK]
http://arxiv.org/abs/2506.22141v1
[DATE]
2025-06-27 19:34:51+08:00
[CATEGORIES]
cs.CL
iPrOp: Interactive Prompt Optimization for Large Language Models with a Human in the Loop
[AUTHORS]
Jiahui Li, Roman Klinger
[ABSTRACT]
Prompt engineering has made significant contributions to the era of large
language models, yet its effectiveness depends on the skills of a prompt
author. This paper introduces $\textit{iPrOp}$, a novel interactive prompt
optimization approach, to bridge manual prompt engineering and automatic prompt
optimization while offering users the flexibility to assess evolving prompts.
We aim to provide users with task-specific guidance to enhance human engagement
in the optimization process, which is structured through prompt variations,
informative instances, predictions generated by large language models along
with their corresponding explanations, and relevant performance metrics. This
approach empowers users to choose and further refine the prompts based on their
individual preferences and needs. It can not only assist non-technical domain
experts in generating optimal prompts tailored to their specific tasks or
domains, but also enable to study the intrinsic parameters that influence the
performance of prompt optimization. The evaluation shows that our approach has
the capability to generate improved prompts, leading to enhanced task
performance.
[LINK]
http://arxiv.org/abs/2412.12644v2
[DATE]
2025-06-27 19:25:48+08:00
[CATEGORIES]
cs.CL
Llama See, Llama Do: A Mechanistic Perspective on Contextual Entrainment and Distraction in LLMs
[AUTHORS]
Jingcheng Niu, Xingdi Yuan, Tong Wang, Hamidreza Saghir, Amir H. Abdi
[ABSTRACT]
We observe a novel phenomenon, contextual entrainment, across a wide range of
language models (LMs) and prompt settings, providing a new mechanistic
perspective on how LMs become distracted by “irrelevant” contextual
information in the input prompt. Specifically, LMs assign significantly higher
logits (or probabilities) to any tokens that have previously appeared in the
context prompt, even for random tokens. This suggests that contextual
entrainment is a mechanistic phenomenon, occurring independently of the
relevance or semantic relation of the tokens to the question or the rest of the
sentence. We find statistically significant evidence that the magnitude of
contextual entrainment is influenced by semantic factors. Counterfactual
prompts have a greater effect compared to factual ones, suggesting that while
contextual entrainment is a mechanistic phenomenon, it is modulated by semantic
factors.
We hypothesise that there is a circuit of attention heads – the entrainment
heads – that corresponds to the contextual entrainment phenomenon. Using a
novel entrainment head discovery method based on differentiable masking, we
identify these heads across various settings. When we “turn off” these heads,
i.e., set their outputs to zero, the effect of contextual entrainment is
significantly attenuated, causing the model to generate output that capitulates
to what it would produce if no distracting context were provided. Our discovery
of contextual entrainment, along with our investigation into LM distraction via
the entrainment heads, marks a key step towards the mechanistic analysis and
mitigation of the distraction problem.
[COMMENTS]
ACL 2025
[LINK]
http://arxiv.org/abs/2505.09338v2
[DATE]
2025-06-27 19:15:26+08:00
[CATEGORIES]
cs.CL
Identifying a Circuit for Verb Conjugation in GPT-2
[AUTHORS]
David Demitri Africa
[ABSTRACT]
I implement a procedure to isolate and interpret the sub-network (or
“circuit”) responsible for subject-verb agreement in GPT-2 Small. In this
study, the model is given prompts where the subject is either singular (e.g.
“Alice”) or plural (e.g. “Alice and Bob”), and the task is to correctly predict
the appropriate verb form (“walks” for singular subjects, “walk” for plural
subjects). Using a series of techniques-including performance verification
automatic circuit discovery via direct path patching, and direct logit
attribution- I isolate a candidate circuit that contributes significantly to
the model’s correct verb conjugation. The results suggest that only a small
fraction of the network’s component-token pairs is needed to achieve near-model
performance on the base task but substantially more for more complex settings.
[LINK]
http://arxiv.org/abs/2506.22105v1
[DATE]
2025-06-27 18:35:41+08:00
[CATEGORIES]
cs.CL
cs.LG
Beyond Fixed Length: Bucket Pre-training is All You Need
[AUTHORS]
Qing Yang, Qiyao Peng, Hongtao Liu, Kai Liu, Bing Qin, Ting Liu
[ABSTRACT]
Large Language Models (LLMs) have demonstrated exceptional performance across
various tasks, with pre-training stage serving as the cornerstone of their
capabilities. However, the conventional fixed-length data composition strategy
for pre-training presents several practical challenges. When using shorter
sequences, documents are often truncated, potentially leading to information
loss and affecting the model’s ability to capture long-range dependencies.
Conversely, longer sequences require concatenation of multiple documents, which
can introduce noise and affect the natural document boundaries and semantic
coherence as well as require substantial computational overhead. To address
these challenges, we first establish three quantitative metrics for evaluating
data composition quality: padding ratio, truncation ratio, and concatenation
ratio. Building upon these metrics, we propose a novel multi-bucket data
composition method that transcends the fixed-length paradigm. Our approach
adaptively organizes training data to achieve optimal composition quality as
measured by the proposed metrics, offering a more flexible and efficient
approach for pre-training. We conduct extensive experiments and the results
demonstrate that our proposed method significantly enhances both the efficiency
and effectiveness of LLM pre-training.
[COMMENTS]
8 pages, 5 figures, 3 tables. Accetped by IJCAI 2025
[LINK]
http://arxiv.org/abs/2407.07495v2
[DATE]
2025-06-27 18:33:27+08:00
[CATEGORIES]
cs.CL
Large Language Models in Argument Mining: A Survey
[AUTHORS]
Hao Li, Viktor Schlegel, Yizheng Sun, Riza Batista-Navarro, Goran Nenadic
[ABSTRACT]
Argument Mining (AM), a critical subfield of Natural Language Processing
(NLP), focuses on extracting argumentative structures from text. The advent of
Large Language Models (LLMs) has profoundly transformed AM, enabling advanced
in-context learning, prompt-based generation, and robust cross-domain
adaptability. This survey systematically synthesizes recent advancements in
LLM-driven AM. We provide a concise review of foundational theories and
annotation frameworks, alongside a meticulously curated catalog of datasets. A
key contribution is our comprehensive taxonomy of AM subtasks, elucidating how
contemporary LLM techniques – such as prompting, chain-of-thought reasoning,
and retrieval augmentation – have reconfigured their execution. We further
detail current LLM architectures and methodologies, critically assess
evaluation practices, and delineate pivotal challenges including long-context
reasoning, interpretability, and annotation bottlenecks. Conclusively, we
highlight emerging trends and propose a forward-looking research agenda for
LLM-based computational argumentation, aiming to strategically guide
researchers in this rapidly evolving domain.
[COMMENTS]
Work draft
[LINK]
http://arxiv.org/abs/2506.16383v2
[DATE]
2025-06-27 18:25:12+08:00
[CATEGORIES]
cs.CL
VLM@school – Evaluation of AI image understanding on German middle school knowledge
[AUTHORS]
René Peinl, Vincent Tischler
[ABSTRACT]
This paper introduces a novel benchmark dataset designed to evaluate the
capabilities of Vision Language Models (VLMs) on tasks that combine visual
reasoning with subject-specific background knowledge in the German language. In
contrast to widely used English-language benchmarks that often rely on
artificially difficult or decontextualized problems, this dataset draws from
real middle school curricula across nine domains including mathematics,
history, biology, and religion. The benchmark includes over 2,000 open-ended
questions grounded in 486 images, ensuring that models must integrate visual
interpretation with factual reasoning rather than rely on superficial textual
cues. We evaluate thirteen state-of-the-art open-weight VLMs across multiple
dimensions, including domain-specific accuracy and performance on adversarial
crafted questions. Our findings reveal that even the strongest models achieve
less than 45% overall accuracy, with particularly poor performance in music,
mathematics, and adversarial settings. Furthermore, the results indicate
significant discrepancies between success on popular benchmarks and real-world
multimodal understanding. We conclude that middle school-level tasks offer a
meaningful and underutilized avenue for stress-testing VLMs, especially in
non-English contexts. The dataset and evaluation protocol serve as a rigorous
testbed to better understand and improve the visual and linguistic reasoning
capabilities of future AI systems.
[COMMENTS]
Peinl, Ren'e; Tischler, Vincent (2025): VLM@school - Evaluation of
AI image understanding on German middle school knowledge. Future Technologies
Conference (FTC) 2025, Munich, Germany 2025 (accepted)
[LINK]
http://arxiv.org/abs/2506.11604v2
[DATE]
2025-06-27 18:12:42+08:00
[CATEGORIES]
cs.CL
MDC-R: The Minecraft Dialogue Corpus with Reference
[AUTHORS]
Chris Madge, Maris Camilleri, Paloma Carretero Garcia, Mladen Karan, Juexi Shao, Prashant Jayannavar, Julian Hough, Benjamin Roth, Massimo Poesio
[ABSTRACT]
We introduce the Minecraft Dialogue Corpus with Reference (MDC-R). MDC-R is a
new language resource that supplements the original Minecraft Dialogue Corpus
(MDC) with expert annotations of anaphoric and deictic reference. MDC’s
task-orientated, multi-turn, situated dialogue in a dynamic environment has
motivated multiple annotation efforts, owing to the interesting linguistic
phenomena that this setting gives rise to. We believe it can serve as a
valuable resource when annotated with reference, too. Here, we discuss our
method of annotation and the resulting corpus, and provide both a quantitative
and a qualitative analysis of the data. Furthermore, we carry out a short
experiment demonstrating the usefulness of our corpus for referring expression
comprehension.
[LINK]
http://arxiv.org/abs/2506.22062v1
[DATE]
2025-06-27 17:56:40+08:00
[CATEGORIES]
cs.CL
Lost at the Beginning of Reasoning
[AUTHORS]
Baohao Liao, Xinyi Chen, Sara Rajaee, Yuhui Xu, Christian Herold, Anders Søgaard, Maarten de Rijke, Christof Monz
[ABSTRACT]
Recent advancements in large language models (LLMs) have significantly
advanced complex reasoning capabilities, particularly through extended
chain-of-thought (CoT) reasoning that incorporates mechanisms such as
backtracking, self-reflection and self-correction. Despite these developments,
the self-correction abilities of LLMs during long CoT reasoning remain
underexplored. And recent findings on overthinking suggest that such models
often engage in unnecessarily redundant reasoning. In this work, we empirically
show that the first reasoning step exerts a disproportionately large influence
on the final prediction - errors introduced at this stage can substantially
degrade subsequent reasoning quality. This phenomenon is consistently observed
across two state-of-the-art open-source reasoning model families: DeepSeek-R1
and Qwen3. To address this, we propose an efficient sampling strategy that
leverages a reward model to identify and retain high-quality first reasoning
steps while discarding suboptimal ones, achieving up to a 70% reduction in
inference cost without sacrificing accuracy. Finally, we introduce a new
benchmark specifically constructed with deliberately flawed first reasoning
steps to systematically evaluate model self-correction capabilities, offering a
foundation for future research on robust reasoning in LLMs.
[COMMENTS]
9 pages, 5 figures, 2 tables
[LINK]
http://arxiv.org/abs/2506.22058v1
[DATE]
2025-06-27 17:53:57+08:00
[CATEGORIES]
cs.CL
Language in Vivo vs. in Silico: Size Matters but Larger Language Models Still Do Not Comprehend Language on a Par with Humans Due to Impenetrable Semantic Reference
[AUTHORS]
Vittoria Dentella, Fritz Guenther, Evelina Leivada
[ABSTRACT]
Understanding the limits of language is a prerequisite for Large Language
Models (LLMs) to act as theories of natural language. LLM performance in some
language tasks presents both quantitative and qualitative differences from that
of humans, however it remains to be determined whether such differences are
amenable to model size. This work investigates the critical role of model
scaling, determining whether increases in size make up for such differences
between humans and models. We test three LLMs from different families (Bard,
137 billion parameters; ChatGPT-3.5, 175 billion; ChatGPT-4, 1.5 trillion) on a
grammaticality judgment task featuring anaphora, center embedding,
comparatives, and negative polarity. N=1,200 judgments are collected and scored
for accuracy, stability, and improvements in accuracy upon repeated
presentation of a prompt. Results of the best performing LLM, ChatGPT-4, are
compared to results of n=80 humans on the same stimuli. We find that humans are
overall less accurate than ChatGPT-4 (76% vs. 80% accuracy, respectively), but
that this is due to ChatGPT-4 outperforming humans only in one task condition,
namely on grammatical sentences. Additionally, ChatGPT-4 wavers more than
humans in its answers (12.5% vs. 9.6% likelihood of an oscillating answer,
respectively). Thus, while increased model size may lead to better performance,
LLMs are still not sensitive to (un)grammaticality the same way as humans are.
It seems possible but unlikely that scaling alone can fix this issue. We
interpret these results by comparing language learning in vivo and in silico,
identifying three critical differences concerning (i) the type of evidence,
(ii) the poverty of the stimulus, and (iii) the occurrence of semantic
hallucinations due to impenetrable linguistic reference.
[LINK]
http://arxiv.org/abs/2404.14883v3
[DATE]
2025-06-27 17:50:30+08:00
[CATEGORIES]
cs.CL
Decoding Machine Translationese in English-Chinese News: LLMs vs. NMTs
[AUTHORS]
Delu Kong, Lieve Macken
[ABSTRACT]
This study explores Machine Translationese (MTese) – the linguistic
peculiarities of machine translation outputs – focusing on the
under-researched English-to-Chinese language pair in news texts. We construct a
large dataset consisting of 4 sub-corpora and employ a comprehensive five-layer
feature set. Then, a chi-square ranking algorithm is applied for feature
selection in both classification and clustering tasks. Our findings confirm the
presence of MTese in both Neural Machine Translation systems (NMTs) and Large
Language Models (LLMs). Original Chinese texts are nearly perfectly
distinguishable from both LLM and NMT outputs. Notable linguistic patterns in
MT outputs are shorter sentence lengths and increased use of adversative
conjunctions. Comparing LLMs and NMTs, we achieve approximately 70%
classification accuracy, with LLMs exhibiting greater lexical diversity and
NMTs using more brackets. Additionally, translation-specific LLMs show lower
lexical diversity but higher usage of causal conjunctions compared to generic
LLMs. Lastly, we find no significant differences between LLMs developed by
Chinese firms and their foreign counterparts.
[COMMENTS]
14 pages, 5 figures, 6 tables. Accpeted in MT Summit 2025, Research:
Technical track. Official version may be accessed later in the ACL Anthology
[LINK]
http://arxiv.org/abs/2506.22050v1
[DATE]
2025-06-27 17:45:37+08:00
[CATEGORIES]
cs.CL
GPAS: Accelerating Convergence of LLM Pretraining via Gradient-Preserving Activation Scaling
[AUTHORS]
Tianhao Chen, Xin Xu, Zijing Liu, Pengxiang Li, Xinyuan Song, Ajay Kumar Jaiswal, Fan Zhang, Jishan Hu, Yang Wang, Hao Chen, Shizhe Diao, Shiwei Liu, Yu Li, Yin Lu, Can Yang
[ABSTRACT]
Modern Large Language Models, such as the LLaMA, Qwen and DeepSeek series,
predominantly adopt the Pre-LayerNorm (Pre-LN) Transformer architecture. While
being stable during pretraining and scalable to large model sizes, Pre-LN
suffers from an exponential growth in activation variance across layers,
causing the residual path to dominate over sub-layer outputs and limiting the
learning capacity of deeper layers. To mitigate this issue, we propose
Gradient-Preserving Activation Scaling (GPAS), a simple technique that can be
used in combination with existing approaches. GPAS works by scaling down the
intermediate activations while keeping their gradients unchanged. This leaves
information in the activations intact, and avoids the gradient vanishing
problem associated with gradient downscaling. Extensive experiments across
various model sizes from 71M to 1B show that GPAS achieves consistent
performance gains. Beyond enhancing Pre-LN Transformers, GPAS also shows
promise in improving alternative architectures such as Sandwich-LN and
DeepNorm, demonstrating its versatility and potential for improving training
dynamics in a wide range of settings.
[LINK]
http://arxiv.org/abs/2506.22049v1
[DATE]
2025-06-27 17:45:15+08:00
[CATEGORIES]
cs.LG
cs.CL
ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows
[AUTHORS]
Qiushi Sun, Zhoumianze Liu, Chang Ma, Zichen Ding, Fangzhi Xu, Zhangyue Yin, Haiteng Zhao, Zhenyu Wu, Kanzhi Cheng, Zhaoyang Liu, Jianing Wang, Qintong Li, Xiangru Tang, Tianbao Xie, Xiachong Feng, Xiang Li, Ben Kao, Wenhai Wang, Biqing Qi, Lingpeng Kong, Zhiyong Wu
[ABSTRACT]
Large Language Models (LLMs) have extended their impact beyond Natural
Language Processing, substantially fostering the development of
interdisciplinary research. Recently, various LLM-based agents have been
developed to assist scientific discovery progress across multiple aspects and
domains. Among these, computer-using agents, capable of interacting with
operating systems as humans do, are paving the way to automated scientific
problem-solving and addressing routines in researchers’ workflows. Recognizing
the transformative potential of these agents, we introduce ScienceBoard, which
encompasses two complementary contributions: (i) a realistic, multi-domain
environment featuring dynamic and visually rich scientific workflows with
integrated professional software, where agents can autonomously interact via
different interfaces to accelerate complex research tasks and experiments; and
(ii) a challenging benchmark of 169 high-quality, rigorously validated
real-world tasks curated by humans, spanning scientific-discovery workflows in
domains such as biochemistry, astronomy, and geoinformatics. Extensive
evaluations of agents with state-of-the-art backbones (e.g., GPT-4o, Claude
3.7, UI-TARS) show that, despite some promising results, they still fall short
of reliably assisting scientists in complex workflows, achieving only a 15%
overall success rate. In-depth analysis further provides valuable insights for
addressing current agent limitations and more effective design principles,
paving the way to build more capable agents for scientific discovery. Our code,
environment, and benchmark are at
https://qiushisun.github.io/ScienceBoard-Home/.
[COMMENTS]
work in progress
[LINK]
http://arxiv.org/abs/2505.19897v2
[DATE]
2025-06-27 17:38:03+08:00
[CATEGORIES]
cs.CL
Can Peter Pan Survive MT? A Stylometric Study of LLMs, NMTs, and HTs in Children’s Literature Translation
[AUTHORS]
Delu Kong, Lieve Macken
[ABSTRACT]
This study focuses on evaluating the performance of machine translations
(MTs) compared to human translations (HTs) in English-to-Chinese children’s
literature translation (CLT) from a stylometric perspective. The research
constructs a Peter Pan corpus, comprising 21 translations: 7 human translations
(HTs), 7 large language model translations (LLMs), and 7 neural machine
translation outputs (NMTs). The analysis employs a generic feature set
(including lexical, syntactic, readability, and n-gram features) and a creative
text translation (CTT-specific) feature set, which captures repetition, rhythm,
translatability, and miscellaneous levels, yielding 447 linguistic features in
total.
Using classification and clustering techniques in machine learning, we
conduct a stylometric analysis of these translations. Results reveal that in
generic features, HTs and MTs exhibit significant differences in conjunction
word distributions and the ratio of 1-word-gram-YiYang, while NMTs and LLMs
show significant variation in descriptive words usage and adverb ratios.
Regarding CTT-specific features, LLMs outperform NMTs in distribution, aligning
more closely with HTs in stylistic characteristics, demonstrating the potential
of LLMs in CLT.
[COMMENTS]
19 pages, 8 figures, 4 tables. Accepted in 2nd Workshop on
Creative-text Translation and Technology Co-located with MT Summit 2025.
Official paper may later be accessed from ACL Anthology
[LINK]
http://arxiv.org/abs/2506.22038v1
[DATE]
2025-06-27 17:34:40+08:00
[CATEGORIES]
cs.CL
Towards Reproducible LLM Evaluation: Quantifying Uncertainty in LLM Benchmark Scores
[AUTHORS]
Robert E. Blackwell, Jon Barry, Anthony G. Cohn
[ABSTRACT]
Large language models (LLMs) are stochastic, and not all models give
deterministic answers, even when setting temperature to zero with a fixed
random seed. However, few benchmark studies attempt to quantify uncertainty,
partly due to the time and cost of repeated experiments. We use benchmarks
designed for testing LLMs’ capacity to reason about cardinal directions to
explore the impact of experimental repeats on mean score and prediction
interval. We suggest a simple method for cost-effectively quantifying the
uncertainty of a benchmark score and make recommendations concerning
reproducible LLM evaluation.
[COMMENTS]
4 pages, 1 figure
[LINK]
http://arxiv.org/abs/2410.03492v2
[DATE]
2025-06-27 17:33:10+08:00
[CATEGORIES]
cs.CL
ACORD: An Expert-Annotated Retrieval Dataset for Legal Contract Drafting
[AUTHORS]
Steven H. Wang, Maksim Zubkov, Kexin Fan, Sarah Harrell, Yuyang Sun, Wei Chen, Andreas Plesner, Roger Wattenhofer
[COMMENTS]
Accepted to ACL 2025. See the project page at
https://www.atticusprojectai.org/acord
[LINK]
http://arxiv.org/abs/2501.06582v3
[DATE]
2025-06-27 17:16:02+08:00
[CATEGORIES]
cs.CL
Robust and Efficient Autoregressive Speech Synthesis with Dynamic Chunk-wise Prediction Policy
[AUTHORS]
Bohan Li, Zhihan Li, Haoran Wang, Hanglei Zhang, Yiwei Guo, Hankun Wang, Xie Chen, Kai Yu
[ABSTRACT]
Recently, autoregressive (AR) language models have emerged as a dominant
approach in speech synthesis, offering expressive generation and scalable
training. However, conventional AR speech synthesis models relying on the
next-token prediction paradigm often encounter significant challenges when
handling long speech sequences. These models often struggle to construct stable
frame-to-frame attention, leading to increased latency and degraded synthesis
quality, thereby limiting their feasibility for real-time applications. To
address these limitations, we introduce a novel dynamic chunk-wise
autoregressive synthesis framework, termed DCAR, designed to enhance both
efficiency and intelligibility robustness in AR speech generation. DCAR
introduces a chunk-to-frame attention mechanism through training with
multi-token prediction, enabling dynamic chunk prediction in variable speech
contexts using a lightweight module trained on-policy. DCAR dynamically adjusts
the token prediction span, significantly reducing the sequence length
dependency while obtaining high synthesis quality. Comprehensive empirical
evaluations demonstrate that DCAR substantially outperforms traditional
next-token prediction models, achieving up to 72.27% intelligibility
improvement and 2.61x inference speedup simultaneously on the test set.
Furthermore, we conduct comprehensive analysis to support it as a versatile
foundation for next-generation speech synthesis systems.
[COMMENTS]
17 pages, 8 figures, 5 tables
[LINK]
http://arxiv.org/abs/2506.22023v1
[DATE]
2025-06-27 16:45:21+08:00
[CATEGORIES]
cs.CL
MMBoundary: Advancing MLLM Knowledge Boundary Awareness through Reasoning Step Confidence Calibration
[AUTHORS]
Zhitao He, Sandeep Polisetty, Zhiyuan Fan, Yuchen Huang, Shujin Wu, Yi R. Fung
[COMMENTS]
18 pages, ACL 2025
[LINK]
http://arxiv.org/abs/2505.23224v3
[DATE]
2025-06-27 16:40:06+08:00
[CATEGORIES]
cs.CL
Cannot See the Forest for the Trees: Invoking Heuristics and Biases to Elicit Irrational Choices of LLMs
[AUTHORS]
Haoming Yang, Ke Ma, Xiaojun Jia, Yingfei Sun, Qianqian Xu, Qingming Huang
[ABSTRACT]
Despite the remarkable performance of Large Language Models (LLMs), they
remain vulnerable to jailbreak attacks, which can compromise their safety
mechanisms. Existing studies often rely on brute-force optimization or manual
design, failing to uncover potential risks in real-world scenarios. To address
this, we propose a novel jailbreak attack framework, ICRT, inspired by
heuristics and biases in human cognition. Leveraging the simplicity effect, we
employ cognitive decomposition to reduce the complexity of malicious prompts.
Simultaneously, relevance bias is utilized to reorganize prompts, enhancing
semantic alignment and inducing harmful outputs effectively. Furthermore, we
introduce a ranking-based harmfulness evaluation metric that surpasses the
traditional binary success-or-failure paradigm by employing ranking aggregation
methods such as Elo, HodgeRank, and Rank Centrality to comprehensively quantify
the harmfulness of generated content. Experimental results show that our
approach consistently bypasses mainstream LLMs’ safety mechanisms and generates
high-risk content, providing insights into jailbreak attack risks and
contributing to stronger defense strategies.
[LINK]
http://arxiv.org/abs/2505.02862v3
[DATE]
2025-06-27 16:31:28+08:00
[CATEGORIES]
cs.CL
Advancing Language Multi-Agent Learning with Credit Re-Assignment for Interactive Environment Generalization
[AUTHORS]
Zhitao He, Zijun Liu, Peng Li, Yi R Fung, Ming Yan, Ji Zhang, Fei Huang, Yang Liu
[COMMENTS]
28 pages, under review
[LINK]
http://arxiv.org/abs/2502.14496v2
[DATE]
2025-06-27 16:30:22+08:00
[CATEGORIES]
cs.CL
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis
[AUTHORS]
Qiushi Sun, Kanzhi Cheng, Zichen Ding, Chuanyang Jin, Yian Wang, Fangzhi Xu, Zhenyu Wu, Chengyou Jia, Liheng Chen, Zhoumianze Liu, Ben Kao, Guohao Li, Junxian He, Yu Qiao, Zhiyong Wu
[ABSTRACT]
Graphical User Interface (GUI) agents powered by Vision-Language Models
(VLMs) have demonstrated human-like computer control capability. Despite their
utility in advancing digital automation, a critical bottleneck persists:
collecting high-quality trajectory data for training. Common practices for
collecting such data rely on human supervision or synthetic data generation
through executing pre-defined tasks, which are either resource-intensive or
unable to guarantee data quality. Moreover, these methods suffer from limited
data diversity and significant gaps between synthetic data and real-world
environments. To address these challenges, we propose OS-Genesis, a novel GUI
data synthesis pipeline that reverses the conventional trajectory collection
process. Instead of relying on pre-defined tasks, OS-Genesis enables agents
first to perceive environments and perform step-wise interactions, then
retrospectively derive high-quality tasks to enable trajectory-level
exploration. A trajectory reward model is then employed to ensure the quality
of the generated trajectories. We demonstrate that training GUI agents with
OS-Genesis significantly improves their performance on highly challenging
online benchmarks. In-depth analysis further validates OS-Genesis’s efficiency
and its superior data quality and diversity compared to existing synthesis
methods. Our codes, data, and checkpoints are available at
https://qiushisun.github.io/OS-Genesis-Home/.
[COMMENTS]
ACL 2025 Camera Ready
[LINK]
http://arxiv.org/abs/2412.19723v3
[DATE]
2025-06-27 16:25:48+08:00
[CATEGORIES]
cs.CL
Detecting Knowledge Boundary of Vision Large Language Models by Sampling-Based Inference
[AUTHORS]
Zhuo Chen, Xinyu Wang, Yong Jiang, Zhen Zhang, Xinyu Geng, Pengjun Xie, Fei Huang, Kewei Tu
[ABSTRACT]
Despite the advancements made in Visual Large Language Models (VLLMs), like
text Large Language Models (LLMs), they have limitations in addressing
questions that require real-time information or are knowledge-intensive.
Indiscriminately adopting Retrieval Augmented Generation (RAG) techniques is an
effective yet expensive way to enable models to answer queries beyond their
knowledge scopes. To mitigate the dependence on retrieval and simultaneously
maintain, or even improve, the performance benefits provided by retrieval, we
propose a method to detect the knowledge boundary of VLLMs, allowing for more
efficient use of techniques like RAG. Specifically, we propose a method with
two variants that fine-tunes a VLLM on an automatically constructed dataset for
boundary identification. Experimental results on various types of Visual
Question Answering datasets show that our method successfully depicts a VLLM’s
knowledge boundary based on which we are able to reduce indiscriminate
retrieval while maintaining or improving the performance. In addition, we show
that the knowledge boundary identified by our method for one VLLM can be used
as a surrogate boundary for other VLLMs. Code will be released at
https://github.com/Chord-Chen-30/VLLM-KnowledgeBoundary
[COMMENTS]
ACL25 May ARR
[LINK]
http://arxiv.org/abs/2502.18023v2
[DATE]
2025-06-27 16:05:04+08:00
[CATEGORIES]
cs.CL
Federated Data-Efficient Instruction Tuning for Large Language Models
[AUTHORS]
Zhen Qin, Zhaomin Wu, Bingsheng He, Shuiguang Deng
[COMMENTS]
Accepted to ACL 2025 (Findings)
[LINK]
http://arxiv.org/abs/2410.10926v2
[DATE]
2025-06-27 16:03:25+08:00
[CATEGORIES]
cs.LG
cs.CL
EasyDistill: A Comprehensive Toolkit for Effective Knowledge Distillation of Large Language Models
[AUTHORS]
Chengyu Wang, Junbing Yan, Wenrui Cai, Yuanhao Yue, Jun Huang
[ABSTRACT]
In this paper, we present EasyDistill, a comprehensive toolkit designed for
effective black-box and white-box knowledge distillation (KD) of large language
models (LLMs). Our framework offers versatile functionalities, including data
synthesis, supervised fine-tuning, ranking optimization, and reinforcement
learning techniques specifically tailored for KD scenarios. The toolkit
accommodates KD functionalities for both System 1 (fast, intuitive) and System
2 (slow, analytical) models. With its modular design and user-friendly
interface, EasyDistill empowers researchers and industry practitioners to
seamlessly experiment with and implement state-of-the-art KD strategies for
LLMs. In addition, EasyDistill provides a series of robust distilled models and
KD-based industrial solutions developed by us, along with the corresponding
open-sourced datasets, catering to a variety of use cases. Furthermore, we
describe the seamless integration of EasyDistill into Alibaba Cloud’s Platform
for AI (PAI). Overall, the EasyDistill toolkit makes advanced KD techniques for
LLMs more accessible and impactful within the NLP community.
[LINK]
http://arxiv.org/abs/2505.20888v2
[DATE]
2025-06-27 15:59:43+08:00
[CATEGORIES]
cs.CL
Analyzing and Fine-Tuning Whisper Models for Multilingual Pilot Speech Transcription in the Cockpit
[AUTHORS]
Kartheek Kumar Reddy Nareddy, Sarah Ternus, Julia Niebling
[ABSTRACT]
The developments in transformer encoder-decoder architectures have led to
significant breakthroughs in machine translation, Automatic Speech Recognition
(ASR), and instruction-based chat machines, among other applications. The
pre-trained models were trained on vast amounts of generic data over a few
epochs (fewer than five in most cases), resulting in their strong
generalization capabilities. Nevertheless, the performance of these models does
suffer when applied to niche domains like transcribing pilot speech in the
cockpit, which involves a lot of specific vocabulary and multilingual
conversations. This paper investigates and improves the transcription accuracy
of cockpit conversations with Whisper models. We have collected around 85
minutes of cockpit simulator recordings and 130 minutes of interview recordings
with pilots and manually labeled them. The speakers are middle aged men
speaking both German and English. To improve the accuracy of transcriptions, we
propose multiple normalization schemes to refine the transcripts and improve
Word Error Rate (WER). We then employ fine-tuning to enhance ASR performance,
utilizing performance-efficient fine-tuning with Low-Rank Adaptation (LoRA).
Hereby, WER decreased from 68.49 \% (pretrained whisper Large model without
normalization baseline) to 26.26\% (finetuned whisper Large model with the
proposed normalization scheme).
[COMMENTS]
Computer Vision and Pattern Recognition (CVPR) 2025 Workshops
[LINK]
http://arxiv.org/abs/2506.21990v1
[DATE]
2025-06-27 15:57:13+08:00
[CATEGORIES]
cs.CL
cs.LG
BeamLLM: Vision-Empowered mmWave Beam Prediction with Large Language Models
[AUTHORS]
Can Zheng, Jiguang He, Guofa Cai, Zitong Yu, Chung G. Kang
[ABSTRACT]
In this paper, we propose BeamLLM, a vision-aided millimeter-wave (mmWave)
beam prediction framework leveraging large language models (LLMs) to address
the challenges of high training overhead and latency in mmWave communication
systems. By combining computer vision (CV) with LLMs’ cross-modal reasoning
capabilities, the framework extracts user equipment (UE) positional features
from RGB images and aligns visual-temporal features with LLMs’ semantic space
through reprogramming techniques. Evaluated on a realistic
vehicle-to-infrastructure (V2I) scenario, the proposed method achieves 61.01%
top-1 accuracy and 97.39% top-3 accuracy in standard prediction tasks,
significantly outperforming traditional deep learning models. In few-shot
prediction scenarios, the performance degradation is limited to 12.56% (top-1)
and 5.55% (top-3) from time sample 1 to 10, demonstrating superior prediction
capability.
[COMMENTS]
6 pages, 7 figures, conference
[LINK]
http://arxiv.org/abs/2503.10432v2
[DATE]
2025-06-27 15:52:32+08:00
[CATEGORIES]
cs.LG
cs.CL
STAIR: Improving Safety Alignment with Introspective Reasoning
[AUTHORS]
Yichi Zhang, Siyuan Zhang, Yao Huang, Zeyu Xia, Zhengwei Fang, Xiao Yang, Ranjie Duan, Dong Yan, Yinpeng Dong, Jun Zhu
[ABSTRACT]
Ensuring the safety and harmlessness of Large Language Models (LLMs) has
become equally critical as their performance in applications. However, existing
safety alignment methods typically suffer from safety-performance trade-offs
and the susceptibility to jailbreak attacks, primarily due to their reliance on
direct refusals for malicious queries. In this paper, we propose STAIR, a novel
framework that integrates SafeTy Alignment with Itrospective Reasoning. We
enable LLMs to identify safety risks through step-by-step analysis by
self-improving chain-of-thought (CoT) reasoning with safety awareness. STAIR
first equips the model with a structured reasoning capability and then advances
safety alignment via iterative preference optimization on step-level reasoning
data generated using our newly proposed Safety-Informed Monte Carlo Tree Search
(SI-MCTS). We further train a process reward model on this data to guide
test-time searches for improved responses. Extensive experiments show that
STAIR effectively mitigates harmful outputs while better preserving
helpfulness, compared to instinctive alignment strategies. With test-time
scaling, STAIR achieves a safety performance comparable to Claude-3.5 against
popular jailbreak attacks. Relevant resources in this work are available at
https://github.com/thu-ml/STAIR.
[COMMENTS]
22 pages, 8 figures, ICML2025 Oral
[LINK]
http://arxiv.org/abs/2502.02384v2
[DATE]
2025-06-27 15:30:35+08:00
[CATEGORIES]
cs.CL
Advancing Jailbreak Strategies: A Hybrid Approach to Exploiting LLM Vulnerabilities and Bypassing Modern Defenses
[AUTHORS]
Mohamed Ahmed, Mohamed Abdelmouty, Mingyu Kim, Gunvanth Kandula, Alex Park, James C. Davis
[ABSTRACT]
The advancement of Pre-Trained Language Models (PTLMs) and Large Language
Models (LLMs) has led to their widespread adoption across diverse applications.
Despite their success, these models remain vulnerable to attacks that exploit
their inherent weaknesses to bypass safety measures. Two primary
inference-phase threats are token-level and prompt-level jailbreaks.
Token-level attacks embed adversarial sequences that transfer well to black-box
models like GPT but leave detectable patterns and rely on gradient-based token
optimization, whereas prompt-level attacks use semantically structured inputs
to elicit harmful responses yet depend on iterative feedback that can be
unreliable. To address the complementary limitations of these methods, we
propose two hybrid approaches that integrate token- and prompt-level techniques
to enhance jailbreak effectiveness across diverse PTLMs. GCG + PAIR and the
newly explored GCG + WordGame hybrids were evaluated across multiple Vicuna and
Llama models. GCG + PAIR consistently raised attack-success rates over its
constituent techniques on undefended models; for instance, on Llama-3, its
Attack Success Rate (ASR) reached 91.6%, a substantial increase from PAIR’s
58.4% baseline. Meanwhile, GCG + WordGame matched the raw performance of
WordGame maintaining a high ASR of over 80% even under stricter evaluators like
Mistral-Sorry-Bench. Crucially, both hybrids retained transferability and
reliably pierced advanced defenses such as Gradient Cuff and JBShield, which
fully blocked single-mode attacks. These findings expose previously unreported
vulnerabilities in current safety stacks, highlight trade-offs between raw
success and defensive robustness, and underscore the need for holistic
safeguards against adaptive adversaries.
[LINK]
http://arxiv.org/abs/2506.21972v1
[DATE]
2025-06-27 15:26:33+08:00
[CATEGORIES]
cs.CL
cs.LG
ShifCon: Enhancing Non-Dominant Language Capabilities with a Shift-based Multilingual Contrastive Framework
[AUTHORS]
Hengyuan Zhang, Chenming Shang, Sizhe Wang, Dongdong Zhang, Yiyao Yu, Feng Yao, Renliang Sun, Yujiu Yang, Furu Wei
[ABSTRACT]
Although fine-tuning Large Language Models (LLMs) with multilingual data can
rapidly enhance the multilingual capabilities of LLMs, they still exhibit a
performance gap between the dominant language (e.g., English) and non-dominant
ones due to the imbalance of training data across languages. To further enhance
the performance of non-dominant languages, we propose ShifCon, a Shift-based
multilingual Contrastive framework that aligns the internal forward process of
other languages toward that of the dominant one. Specifically, it shifts the
representations of non-dominant languages into the dominant language subspace,
allowing them to access relatively rich information encoded in the model
parameters. The enriched representations are then shifted back into their
original language subspace before generation. Moreover, we introduce a subspace
distance metric to pinpoint the optimal layer area for shifting representations
and employ multilingual contrastive learning to further enhance the alignment
of representations within this area. Experiments demonstrate that our ShifCon
framework significantly enhances the performance of non-dominant languages,
particularly for low-resource ones. Further analysis offers extra insights to
verify the effectiveness of ShifCon and propel future research.
[COMMENTS]
Accepted by ACL 2025
[LINK]
http://arxiv.org/abs/2410.19453v6
[DATE]
2025-06-27 15:21:49+08:00
[CATEGORIES]
cs.CL
Using Large Language Models to Suggest Informative Prior Distributions in Bayesian Statistics
[AUTHORS]
Michael A. Riegler, Kristoffer Herland Hellton, Vajira Thambawita, Hugo L. Hammer
[ABSTRACT]
Selecting prior distributions in Bayesian statistics is challenging,
resource-intensive, and subjective. We analyze using large-language models
(LLMs) to suggest suitable, knowledge-based informative priors. We developed an
extensive prompt asking LLMs not only to suggest priors but also to verify and
reflect on their choices.
We evaluated Claude Opus, Gemini 2.5 Pro, and ChatGPT-4o-mini on two real
datasets: heart disease risk and concrete strength. All LLMs correctly
identified the direction for all associations (e.g., that heart disease risk is
higher for males). The quality of suggested priors was measured by their
Kullback-Leibler divergence from the maximum likelihood estimator’s
distribution.
The LLMs suggested both moderately and weakly informative priors. The
moderate priors were often overconfident, resulting in distributions misaligned
with the data. In our experiments, Claude and Gemini provided better priors
than ChatGPT. For weakly informative priors, a key performance difference
emerged: ChatGPT and Gemini defaulted to an “unnecessarily vague” mean of 0,
while Claude did not, demonstrating a significant advantage.
The ability of LLMs to identify correct associations shows their great
potential as an efficient, objective method for developing informative priors.
However, the primary challenge remains in calibrating the width of these priors
to avoid over- and under-confidence.
[LINK]
http://arxiv.org/abs/2506.21964v1
[DATE]
2025-06-27 15:11:55+08:00
[CATEGORIES]
cs.CL
PapersPlease: A Benchmark for Evaluating Motivational Values of Large Language Models Based on ERG Theory
[AUTHORS]
Junho Myung, Yeon Su Park, Sunwoo Kim, Shin Yoo, Alice Oh
[COMMENTS]
Accepted to GEM2 Workshop: Generation, Evaluation & Metrics - ACL
2025
[LINK]
http://arxiv.org/abs/2506.21961v1
[DATE]
2025-06-27 15:09:11+08:00
[CATEGORIES]
cs.CL
EUR-USD Exchange Rate Forecasting Based on Information Fusion with Large Language Models and Deep Learning Methods
[AUTHORS]
Hongcheng Ding, Xuanze Zhao, Ruiting Deng, Shamsul Nahar Abdullah, Deshinta Arrova Dewi
[ABSTRACT]
Accurate forecasting of the EUR/USD exchange rate is crucial for investors,
businesses, and policymakers. This paper proposes a novel framework, IUS, that
integrates unstructured textual data from news and analysis with structured
data on exchange rates and financial indicators to enhance exchange rate
prediction. The IUS framework employs large language models for sentiment
polarity scoring and exchange rate movement classification of texts. These
textual features are combined with quantitative features and input into a
Causality-Driven Feature Generator. An Optuna-optimized Bi-LSTM model is then
used to forecast the EUR/USD exchange rate. Experiments demonstrate that the
proposed method outperforms benchmark models, reducing MAE by 10.69% and RMSE
by 9.56% compared to the best performing baseline. Results also show the
benefits of data fusion, with the combination of unstructured and structured
data yielding higher accuracy than structured data alone. Furthermore, feature
selection using the top 12 important quantitative features combined with the
textual features proves most effective. The proposed IUS framework and
Optuna-Bi-LSTM model provide a powerful new approach for exchange rate
forecasting through multi-source data integration.
[LINK]
http://arxiv.org/abs/2408.13214v2
[DATE]
2025-06-27 14:57:32+08:00
[CATEGORIES]
cs.CL
A Survey of Large Language Models in Psychotherapy: Current Landscape and Future Directions
[AUTHORS]
Hongbin Na, Yining Hua, Zimu Wang, Tao Shen, Beibei Yu, Lilin Wang, Wei Wang, John Torous, Ling Chen
[ABSTRACT]
Mental health is increasingly critical in contemporary healthcare, with
psychotherapy demanding dynamic, context-sensitive interactions that
traditional NLP methods struggle to capture. Large Language Models (LLMs) offer
significant potential for addressing this gap due to their ability to handle
extensive context and multi-turn reasoning. This review introduces a conceptual
taxonomy dividing psychotherapy into interconnected stages–assessment,
diagnosis, and treatment–to systematically examine LLM advancements and
challenges. Our comprehensive analysis reveals imbalances in current research,
such as a focus on common disorders, linguistic biases, fragmented methods, and
limited theoretical integration. We identify critical challenges including
capturing dynamic symptom fluctuations, overcoming linguistic and cultural
biases, and ensuring diagnostic reliability. Highlighting future directions, we
advocate for continuous multi-stage modeling, real-time adaptive systems
grounded in psychological theory, and diversified research covering broader
mental disorders and therapeutic approaches, aiming toward more holistic and
clinically integrated psychotherapy LLMs systems.
[COMMENTS]
Accepted by ACL 2025 Findings
[LINK]
http://arxiv.org/abs/2502.11095v3
[DATE]
2025-06-27 14:52:25+08:00
[CATEGORIES]
cs.CL
Dynamic Adaptive Rank Space Exploration for Efficient Sentiment Analysis with Large Language Models
[AUTHORS]
Hongcheng Ding, Fuzhen Hu, Ruiting Deng, Xuanze Zhao, Shamsul Nahar Abdullah, Deshinta Arrova Dewi
[ABSTRACT]
Sentiment analysis has become increasingly important for assessing public
opinion and informing decision-making. Large language models (LLMs) have
revolutionized this field by capturing nuanced language patterns. However,
adapting LLMs to domain-specific sentiment analysis tasks remains challenging
due to computational constraints and the need for optimal fine-tuning. To
address these challenges, we propose a novel Dynamic Adaptive Rank Space
Exploration (DARSE) framework for efficient and effective sentiment analysis
using LLMs. DARSE consists of a coarse-grained greedy algorithm to identify the
optimal rank range, a fine-grained exploration algorithm to refine rank
selection, and a dynamic rank allocation method to determine the optimal rank
combination for each LLM layer. Extensive experiments demonstrate that DARSE
significantly improves sentiment analysis accuracy, achieving a 15.1%
improvement in MSE and a 4.3% improvement in accuracy compared to previous
work. Our framework strikes a balance between computational efficiency and
model performance, making it a promising approach for sentiment analysis with
LLMs.
[LINK]
http://arxiv.org/abs/2410.16589v2
[DATE]
2025-06-27 14:44:48+08:00
[CATEGORIES]
cs.CL
LRP4RAG: Detecting Hallucinations in Retrieval-Augmented Generation via Layer-wise Relevance Propagation
[AUTHORS]
Haichuan Hu, Congqing He, Xiaochen Xie, Quanjun Zhang
[ABSTRACT]
Retrieval-Augmented Generation (RAG) has become a primary technique for
mitigating hallucinations in large language models (LLMs). However, incomplete
knowledge extraction and insufficient understanding can still mislead LLMs to
produce irrelevant or even contradictory responses, which means hallucinations
persist in RAG. In this paper, we propose LRP4RAG, a method based on the
Layer-wise Relevance Propagation (LRP) algorithm for detecting hallucinations
in RAG. Specifically, we first utilize LRP to compute the relevance between the
input and output of the RAG generator. We then apply further extraction and
resampling to the relevance matrix. The processed relevance data are input into
multiple classifiers to determine whether the output contains hallucinations.
To the best of our knowledge, this is the first time that LRP has been used for
detecting RAG hallucinations, and extensive experiments demonstrate that
LRP4RAG outperforms existing baselines.
[LINK]
http://arxiv.org/abs/2408.15533v3
[DATE]
2025-06-27 14:14:36+08:00
[CATEGORIES]
cs.CL
Dynamic Adaptive Optimization for Effective Sentiment Analysis Fine-Tuning on Large Language Models
[AUTHORS]
Hongcheng Ding, Xuanze Zhao, Ruiting Deng, Shamsul Nahar Abdullah, Deshinta Arrova Dewi, Zixiao Jiang
[ABSTRACT]
Sentiment analysis plays a crucial role in various domains, such as business
intelligence and financial forecasting. Large language models (LLMs) have
become a popular paradigm for sentiment analysis, leveraging multi-task
learning to address specific tasks concurrently. However, LLMs with fine-tuning
for sentiment analysis often underperforms due to the inherent challenges in
managing diverse task complexities. Moreover, constant-weight approaches in
multi-task learning struggle to adapt to variations in data characteristics,
further complicating model effectiveness. To address these issues, we propose a
novel multi-task learning framework with a dynamic adaptive optimization (DAO)
module. This module is designed as a plug-and-play component that can be
seamlessly integrated into existing models, providing an effective and flexible
solution for multi-task learning. The key component of the DAO module is
dynamic adaptive loss, which dynamically adjusts the weights assigned to
different tasks based on their relative importance and data characteristics
during training. Sentiment analyses on a standard and customized financial text
dataset demonstrate that the proposed framework achieves superior performance.
Specifically, this work improves the Mean Squared Error (MSE) and Accuracy
(ACC) by 15.58% and 1.24% respectively, compared with previous work.
[LINK]
http://arxiv.org/abs/2408.11856v3
[DATE]
2025-06-27 14:13:14+08:00
[CATEGORIES]
cs.CL
ARAG: Agentic Retrieval Augmented Generation for Personalized Recommendation
[AUTHORS]
Reza Yousefi Maragheh, Pratheek Vadla, Priyank Gupta, Kai Zhao, Aysenur Inan, Kehui Yao, Jianpeng Xu, Praveen Kanumala, Jason Cho, Sushant Kumar
[ABSTRACT]
Retrieval-Augmented Generation (RAG) has shown promise in enhancing
recommendation systems by incorporating external context into large language
model prompts. However, existing RAG-based approaches often rely on static
retrieval heuristics and fail to capture nuanced user preferences in dynamic
recommendation scenarios. In this work, we introduce ARAG, an Agentic
Retrieval-Augmented Generation framework for Personalized Recommendation, which
integrates a multi-agent collaboration mechanism into the RAG pipeline. To
better understand the long-term and session behavior of the user, ARAG
leverages four specialized LLM-based agents: a User Understanding Agent that
summarizes user preferences from long-term and session contexts, a Natural
Language Inference (NLI) Agent that evaluates semantic alignment between
candidate items retrieved by RAG and inferred intent, a context summary agent
that summarizes the findings of NLI agent, and an Item Ranker Agent that
generates a ranked list of recommendations based on contextual fit. We evaluate
ARAG accross three datasets. Experimental results demonstrate that ARAG
significantly outperforms standard RAG and recency-based baselines, achieving
up to 42.1% improvement in NDCG@5 and 35.5% in Hit@5. We also, conduct an
ablation study to analyse the effect by different components of ARAG. Our
findings highlight the effectiveness of integrating agentic reasoning into
retrieval-augmented recommendation and provide new directions for LLM-based
personalization.
[LINK]
http://arxiv.org/abs/2506.21931v1
[DATE]
2025-06-27 13:45:59+08:00
[CATEGORIES]
cs.CL
HyReC: Exploring Hybrid-based Retriever for Chinese
[AUTHORS]
Zunran Wang, Zheng Shenpeng, Wang Shenglan, Minghui Zhao, Zhonghua Li
[ABSTRACT]
Hybrid-based retrieval methods, which unify dense-vector and lexicon-based
retrieval, have garnered considerable attention in the industry due to
performance enhancement. However, despite their promising results, the
application of these hybrid paradigms in Chinese retrieval contexts has
remained largely underexplored. In this paper, we introduce HyReC, an
innovative end-to-end optimization method tailored specifically for
hybrid-based retrieval in Chinese. HyReC enhances performance by integrating
the semantic union of terms into the representation model. Additionally, it
features the Global-Local-Aware Encoder (GLAE) to promote consistent semantic
sharing between lexicon-based and dense retrieval while minimizing the
interference between them. To further refine alignment, we incorporate a
Normalization Module (NM) that fosters mutual benefits between the retrieval
approaches. Finally, we evaluate HyReC on the C-MTEB retrieval benchmark to
demonstrate its effectiveness.
[LINK]
http://arxiv.org/abs/2506.21913v1
[DATE]
2025-06-27 12:57:01+08:00
[CATEGORIES]
cs.CL
AutoMixer: Checkpoint Artifacts as Automatic Data Mixers
[AUTHORS]
Ernie Chang, Yang Li, Patrick Huber, David Kant, Yangyang Shi, Vikas Chandra
[COMMENTS]
Accepted at ACL 2025
[LINK]
http://arxiv.org/abs/2506.21910v1
[DATE]
2025-06-27 12:53:07+08:00
[CATEGORIES]
cs.CL
Round Attention: A Novel Round-Level Attention Mechanism to Accelerate LLM Inference
[AUTHORS]
Yaohua Tang, Zhicheng Hu, Kun Cheng, Fan Mo, Qiheng Lv, Hua Wang, Zhi Chen
[ABSTRACT]
The increasing context window size in large language models (LLMs) has
improved their ability to handle complex, long-text tasks. However, as the
conversation rounds continue, it is required to store a large amount of KV
cache in GPU memory, which significantly affects the efficiency and even
availability of the model serving systems. This paper analyzes dialogue data
from real users on the granularity of round and discovers that the LLM
inference manifests a watershed layer, after which the distribution of
round-level attention shows notable similarity. Based on this, we propose Round
Attention - a novel round-level attention mechanism that selectively processes
the KV cache of top-k relevant rounds, where k is dynamically determined
through the attention matrix in the watershed layer. Theoretical analysis
demonstrates that our method reduces memory usage by 54\% to 82\%, while
experimental results confirm that loading sparse critical-round KV cache
maintains answer accuracy without performance degradation.
[LINK]
http://arxiv.org/abs/2502.15294v3
[DATE]
2025-06-27 11:43:24+08:00
[CATEGORIES]
cs.CL
A Dual-Layered Evaluation of Geopolitical and Cultural Bias in LLMs
[AUTHORS]
Sean Kim, Hyuhng Joon Kim
[ABSTRACT]
As large language models (LLMs) are increasingly deployed across diverse
linguistic and cultural contexts, understanding their behavior in both factual
and disputable scenarios is essential, especially when their outputs may shape
public opinion or reinforce dominant narratives. In this paper, we define two
types of bias in LLMs: model bias (bias stemming from model training) and
inference bias (bias induced by the language of the query), through a two-phase
evaluation. Phase 1 evaluates LLMs on factual questions where a single
verifiable answer exists, assessing whether models maintain consistency across
different query languages. Phase 2 expands the scope by probing geopolitically
sensitive disputes, where responses may reflect culturally embedded or
ideologically aligned perspectives. We construct a manually curated dataset
spanning both factual and disputable QA, across four languages and question
types. The results show that Phase 1 exhibits query language induced alignment,
while Phase 2 reflects an interplay between the model’s training context and
query language. This paper offers a structured framework for evaluating LLM
behavior across neutral and sensitive topics, providing insights for future LLM
deployment and culturally aware evaluation practices in multilingual contexts.
[COMMENTS]
This paper is accepted to ACL Student Research Workshop (SRW) 2025
[LINK]
http://arxiv.org/abs/2506.21881v1
[DATE]
2025-06-27 11:37:15+08:00
[CATEGORIES]
cs.CL
Grammar and Gameplay-aligned RL for Game Description Generation with LLMs
[AUTHORS]
Tsunehiko Tanaka, Edgar Simo-Serra
[ABSTRACT]
Game Description Generation (GDG) is the task of generating a game
description written in a Game Description Language (GDL) from natural language
text. Previous studies have explored generation methods leveraging the
contextual understanding capabilities of Large Language Models (LLMs); however,
accurately reproducing the game features of the game descriptions remains a
challenge. In this paper, we propose reinforcement learning-based fine-tuning
of LLMs for GDG (RLGDG). Our training method simultaneously improves
grammatical correctness and fidelity to game concepts by introducing both
grammar rewards and concept rewards. Furthermore, we adopt a two-stage training
strategy where Reinforcement Learning (RL) is applied following Supervised
Fine-Tuning (SFT). Experimental results demonstrate that our proposed method
significantly outperforms baseline methods using SFT alone. Our code is
available at https://github.com/tsunehiko/rlgdg
[COMMENTS]
Published at IEEE Conference on Games, 2025
[LINK]
http://arxiv.org/abs/2503.15783v2
[DATE]
2025-06-27 11:31:44+08:00
[CATEGORIES]
cs.CL
Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation
[AUTHORS]
Qiyue Gao, Xinyu Pi, Kevin Liu, Junrong Chen, Ruolan Yang, Xinqi Huang, Xinyu Fang, Lu Sun, Gautham Kishore, Bo Ai, Stone Tao, Mengyang Liu, Jiaxi Yang, Chao-Jung Lai, Chuanyang Jin, Jiannan Xiang, Benhao Huang, Zeming Chen, David Danks, Hao Su, Tianmin Shu, Ziqiao Ma, Lianhui Qin, Zhiting Hu
[COMMENTS]
ACL 2025 (Findings)
[LINK]
http://arxiv.org/abs/2506.21876v1
[DATE]
2025-06-27 11:24:29+08:00
[CATEGORIES]
cs.CL
Time is On My Side: Dynamics of Talk-Time Sharing in Video-chat Conversations
[AUTHORS]
Kaixiang Zhang, Justine Zhang, Cristian Danescu-Niculescu-Mizil
[ABSTRACT]
An intrinsic aspect of every conversation is the way talk-time is shared
between multiple speakers. Conversations can be balanced, with each speaker
claiming a similar amount of talk-time, or imbalanced when one talks
disproportionately. Such overall distributions are the consequence of
continuous negotiations between the speakers throughout the conversation: who
should be talking at every point in time, and for how long? In this work we
introduce a computational framework for quantifying both the conversation-level
distribution of talk-time between speakers, as well as the lower-level dynamics
that lead to it. We derive a typology of talk-time sharing dynamics structured
by several intuitive axes of variation. By applying this framework to a large
dataset of video-chats between strangers, we confirm that, perhaps
unsurprisingly, different conversation-level distributions of talk-time are
perceived differently by speakers, with balanced conversations being preferred
over imbalanced ones, especially by those who end up talking less. Then we
reveal that – even when they lead to the same level of overall balance –
different types of talk-time sharing dynamics are perceived differently by the
participants, highlighting the relevance of our newly introduced typology.
Finally, we discuss how our framework offers new tools to designers of
computer-mediated communication platforms, for both human-human and human-AI
communication.
[COMMENTS]
Accepted for publication at CSCW 2025. Code and data available in
ConvoKit (https://convokit.cornell.edu)
[LINK]
http://arxiv.org/abs/2506.20474v2
[DATE]
2025-06-27 11:08:11+08:00
[CATEGORIES]
cs.CL
Bridging Compositional and Distributional Semantics: A Survey on Latent Semantic Geometry via AutoEncoder
[AUTHORS]
Yingji Zhang, Danilo S. Carvalho, André Freitas
[ABSTRACT]
Integrating compositional and symbolic properties into current distributional
semantic spaces can enhance the interpretability, controllability,
compositionality, and generalisation capabilities of Transformer-based
auto-regressive language models (LMs). In this survey, we offer a novel
perspective on latent space geometry through the lens of compositional
semantics, a direction we refer to as \textit{semantic representation
learning}. This direction enables a bridge between symbolic and distributional
semantics, helping to mitigate the gap between them. We review and compare
three mainstream autoencoder architectures-Variational AutoEncoder (VAE),
Vector Quantised VAE (VQVAE), and Sparse AutoEncoder (SAE)-and examine the
distinctive latent geometries they induce in relation to semantic structure and
interpretability.
[COMMENTS]
In progress
[LINK]
http://arxiv.org/abs/2506.20083v2
[DATE]
2025-06-27 10:47:54+08:00
[CATEGORIES]
cs.CL
RiverEcho: Real-Time Interactive Digital System for Ancient Yellow River Culture
[AUTHORS]
Haofeng Wang, Yilin Guo, Zehao Li, Tong Yue, Yizong Wang, Enci Zhang, Rongqun Lin, Feng Gao, Shiqi Wang, Siwei Ma
[ABSTRACT]
The Yellow River is China’s mother river and a cradle of human civilization.
The ancient Yellow River culture is, moreover, an indispensable part of human
art history. To conserve and inherit the ancient Yellow River culture, we
designed RiverEcho, a real-time interactive system that responds to voice
queries using a large language model and a cultural knowledge dataset,
delivering explanations through a talking-head digital human. Specifically, we
built a knowledge database focused on the ancient Yellow River culture,
including the collection of historical texts and the processing pipeline.
Experimental results demonstrate that leveraging Retrieval-Augmented Generation
(RAG) on the proposed dataset enhances the response quality of the Large
Language Model(LLM), enabling the system to generate more professional and
informative responses. Our work not only diversifies the means of promoting
Yellow River culture but also provides users with deeper cultural insights.
[COMMENTS]
IEEE International Conference on Multimedia and Expo Workshop,
2025.(Accepted)
[LINK]
http://arxiv.org/abs/2506.21865v1
[DATE]
2025-06-27 10:40:00+08:00
[CATEGORIES]
cs.CL
DeepTalk: Towards Seamless and Smart Speech Interaction with Adaptive Modality-Specific MoE
[AUTHORS]
Hang Shao, Heting Gao, Yunhang Shen, Jiawei Chen, Lijiang Li, Zuwei Long, Bo Tong, Ke Li, Xing Sun
[ABSTRACT]
Native multimodal large language models (MLLMs) restructure a single large
language model (LLM) into a spoken language model (SLM) capable of both speech
and text generation. Compared to modular and aligned MLLMs, native MLLMs
preserve richer paralinguistic features such as emotion and prosody, and
generate speech responses directly within the backbone LLM rather than using a
separate speech decoder. This integration also results in lower response
latency and smoother interaction. However, native MLLMs suffer from
catastrophic forgetting and performance degradation because the available
paired speech-text data is insufficient to support the pretraining of MLLMs
compared to the vast amount of text data required to pretrain text LLMs. To
address this issue, we propose DeepTalk, a framework for adaptive modality
expert learning based on a Mixture of Experts (MoE) architecture. DeepTalk
first adaptively distinguishes modality experts according to their modality
load within the LLM. Each modality expert then undergoes specialized
single-modality training, followed by joint multimodal collaborative training.
As a result, DeepTalk incurs only a 5.5% performance drop compared to the
original LLM, which is significantly lower than the average performance drop of
over 20% typically seen in native MLLMs (such as GLM-4-Voice), and is on par
with modular MLLMs. Meanwhile, the end-to-end dialogue latency remains within
0.5 seconds, ensuring a seamless and intelligent speech interaction experience.
Code and models are released at https://github.com/talkking/DeepTalk.
[COMMENTS]
Under Review
[LINK]
http://arxiv.org/abs/2506.21864v1
[DATE]
2025-06-27 10:32:04+08:00
[CATEGORIES]
cs.CL
Derivational Probing: Unveiling the Layer-wise Derivation of Syntactic Structures in Neural Language Models
[AUTHORS]
Taiga Someya, Ryo Yoshida, Hitomi Yanaka, Yohei Oseki
[ABSTRACT]
Recent work has demonstrated that neural language models encode syntactic
structures in their internal representations, yet the derivations by which
these structures are constructed across layers remain poorly understood. In
this paper, we propose Derivational Probing to investigate how micro-syntactic
structures (e.g., subject noun phrases) and macro-syntactic structures (e.g.,
the relationship between the root verbs and their direct dependents) are
constructed as word embeddings propagate upward across layers. Our experiments
on BERT reveal a clear bottom-up derivation: micro-syntactic structures emerge
in lower layers and are gradually integrated into a coherent macro-syntactic
structure in higher layers. Furthermore, a targeted evaluation on subject-verb
number agreement shows that the timing of constructing macro-syntactic
structures is critical for downstream performance, suggesting an optimal timing
for integrating global syntactic information.
[LINK]
http://arxiv.org/abs/2506.21861v1
[DATE]
2025-06-27 10:29:30+08:00
[CATEGORIES]
cs.CL
Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation
[AUTHORS]
Sadegh Mahdavi, Muchen Li, Kaiwen Liu, Christos Thrampoulidis, Leonid Sigal, Renjie Liao
[COMMENTS]
ICML 2025 Camera Ready
[LINK]
http://arxiv.org/abs/2501.14275v2
[DATE]
2025-06-27 10:05:51+08:00
[CATEGORIES]
cs.CL
cs.LG
The Consistency Hypothesis in Uncertainty Quantification for Large Language Models
[AUTHORS]
Quan Xiao, Debarun Bhattacharjya, Balaji Ganesan, Radu Marinescu, Katsiaryna Mirylenka, Nhan H Pham, Michael Glass, Junkyu Lee
[ABSTRACT]
Estimating the confidence of large language model (LLM) outputs is essential
for real-world applications requiring high user trust. Black-box uncertainty
quantification (UQ) methods, relying solely on model API access, have gained
popularity due to their practical benefits. In this paper, we examine the
implicit assumption behind several UQ methods, which use generation consistency
as a proxy for confidence, an idea we formalize as the consistency hypothesis.
We introduce three mathematical statements with corresponding statistical tests
to capture variations of this hypothesis and metrics to evaluate LLM output
conformity across tasks. Our empirical investigation, spanning 8 benchmark
datasets and 3 tasks (question answering, text summarization, and text-to-SQL),
highlights the prevalence of the hypothesis under different settings. Among the
statements, we highlight the `Sim-Any’ hypothesis as the most actionable, and
demonstrate how it can be leveraged by proposing data-free black-box UQ methods
that aggregate similarities between generations for confidence estimation.
These approaches can outperform the closest baselines, showcasing the practical
value of the empirically observed consistency hypothesis.
[COMMENTS]
Accepted by The Conference on Uncertainty in Artificial Intelligence
(UAI) 2025
[LINK]
http://arxiv.org/abs/2506.21849v1
[DATE]
2025-06-27 09:53:15+08:00
[CATEGORIES]
cs.CL
cs.LG
LinguaSynth: Heterogeneous Linguistic Signals for News Classification
[AUTHORS]
Duo Zhang, Junyi Mo
[ABSTRACT]
Deep learning has significantly advanced NLP, but its reliance on large
black-box models introduces critical interpretability and computational
efficiency concerns. This paper proposes LinguaSynth, a novel text
classification framework that strategically integrates five complementary
linguistic feature types: lexical, syntactic, entity-level, word-level
semantics, and document-level semantics within a transparent logistic
regression model. Unlike transformer-based architectures, LinguaSynth maintains
interpretability and computational efficiency, achieving an accuracy of 84.89
percent on the 20 Newsgroups dataset and surpassing a robust TF-IDF baseline by
3.32 percent. Through rigorous feature interaction analysis, we show that
syntactic and entity-level signals provide essential disambiguation and
effectively complement distributional semantics. LinguaSynth sets a new
benchmark for interpretable, resource-efficient NLP models and challenges the
prevailing assumption that deep neural networks are necessary for
high-performing text classification.
[LINK]
http://arxiv.org/abs/2506.21848v1
[DATE]
2025-06-27 09:45:20+08:00
[CATEGORIES]
cs.CL
PARSI: Persian Authorship Recognition via Stylometric Integration
[AUTHORS]
Kourosh Shahnazari, Mohammadali Keshtparvar, Seyed Moein Ayyoubzadeh
[ABSTRACT]
The intricate linguistic, stylistic, and metrical aspects of Persian
classical poetry pose a challenge for computational authorship attribution. In
this work, we present a versatile framework to determine authorship among 67
prominent poets. We employ a multi-input neural framework consisting of a
transformer-based language encoder complemented by features addressing the
semantic, stylometric, and metrical dimensions of Persian poetry. Our feature
set encompasses 100-dimensional Word2Vec embeddings, seven stylometric
measures, and categorical encodings of poetic form and meter. We compiled a
vast corpus of 647,653 verses of the Ganjoor digital collection, validating the
data through strict preprocessing and author verification while preserving
poem-level splitting to prevent overlap. This work employs verse-level
classification and majority and weighted voting schemes in evaluation,
revealing that weighted voting yields 71% accuracy. We further investigate
threshold-based decision filtering, allowing the model to generate highly
confident predictions, achieving 97% accuracy at a 0.9 threshold, though at
lower coverage. Our work focuses on the integration of deep representational
forms with domain-specific features for improved authorship attribution. The
results illustrate the potential of our approach for automated classification
and the contribution to stylistic analysis, authorship disputes, and general
computational literature research. This research will facilitate further
research on multilingual author attribution, style shift, and generative
modeling of Persian poetry.
[LINK]
http://arxiv.org/abs/2506.21840v1
[DATE]
2025-06-27 09:08:52+08:00
[CATEGORIES]
cs.CL
GenEscape: Hierarchical Multi-Agent Generation of Escape Room Puzzles
[AUTHORS]
Mengyi Shan, Brian Curless, Ira Kemelmacher-Shlizerman, Steve Seitz
[ABSTRACT]
We challenge text-to-image models with generating escape room puzzle images
that are visually appealing, logically solid, and intellectually stimulating.
While base image models struggle with spatial relationships and affordance
reasoning, we propose a hierarchical multi-agent framework that decomposes this
task into structured stages: functional design, symbolic scene graph reasoning,
layout synthesis, and local image editing. Specialized agents collaborate
through iterative feedback to ensure the scene is visually coherent and
functionally solvable. Experiments show that agent collaboration improves
output quality in terms of solvability, shortcut avoidance, and affordance
clarity, while maintaining visual quality.
[LINK]
http://arxiv.org/abs/2506.21839v1
[DATE]
2025-06-27 09:08:37+08:00
[CATEGORIES]
cs.CL
Strengthening False Information Propagation Detection: Leveraging SVM and Sophisticated Text Vectorization Techniques in comparison to BERT
[AUTHORS]
Ahmed Akib Jawad Karim, Kazi Hafiz Md Asad, Aznur Azam
[ABSTRACT]
The rapid spread of misinformation, particularly through online platforms,
underscores the urgent need for reliable detection systems. This study explores
the utilization of machine learning and natural language processing,
specifically Support Vector Machines (SVM) and BERT, to detect fake news. We
employ three distinct text vectorization methods for SVM: Term Frequency
Inverse Document Frequency (TF-IDF), Word2Vec, and Bag of Words (BoW),
evaluating their effectiveness in distinguishing between genuine and fake news.
Additionally, we compare these methods against the transformer large language
model, BERT. Our comprehensive approach includes detailed preprocessing steps,
rigorous model implementation, and thorough evaluation to determine the most
effective techniques. The results demonstrate that while BERT achieves superior
accuracy with 99.98% and an F1-score of 0.9998, the SVM model with a linear
kernel and BoW vectorization also performs exceptionally well, achieving 99.81%
accuracy and an F1-score of 0.9980. These findings highlight that, despite
BERT’s superior performance, SVM models with BoW and TF-IDF vectorization
methods come remarkably close, offering highly competitive performance with the
advantage of lower computational requirements.
[COMMENTS]
6 pages, 3 tables and 6 Figures. Submitted to a conference
[LINK]
http://arxiv.org/abs/2411.12703v2
[DATE]
2025-06-27 09:01:44+08:00
[CATEGORIES]
cs.CL
RLSF: Fine-tuning LLMs via Symbolic Feedback
[AUTHORS]
Piyush Jha, Prithwish Jana, Pranavkrishna Suresh, Arnav Arora, Vijay Ganesh
[ABSTRACT]
Large Language Models (LLMs) have transformed AI but often struggle with
tasks that require domain-specific reasoning and logical alignment. Traditional
fine-tuning methods do not leverage the vast amount of symbolic
domain-knowledge available to us via symbolic reasoning tools (e.g., provers),
and are further limited by sparse rewards and unreliable reward models.
We introduce Reinforcement Learning via Symbolic Feedback (RLSF), a novel
fine-tuning paradigm where symbolic reasoning tools (e.g., solvers, provers,
and algebra systems) provide fine-grained feedback to LLMs. RLSF uses
poly-sized certificates (e.g., proofs) generated by symbolic tools to identify
and correct errors in model outputs, offering token-level guidance without
requiring differentiable reasoning systems. This paradigm bridges the gap
between symbolic reasoning and LLM fine-tuning, enabling precise alignment with
domain-specific constraints while addressing key limitations of traditional
reward signals.
Via extensive evaluations, we show that our RLSF-based fine-tuning of LLMs
outperforms traditional approaches on five different applications (that have
some associated logical or domain constraints), namely, program synthesis from
natural language pseudo-code to programming language, three chemistry tasks,
and solving the Game of 24. A key takeaway is that fine-tuning via RLSF enables
relatively smaller LLMs to significantly outperform closed-source models that
are orders of magnitude larger.
[LINK]
http://arxiv.org/abs/2405.16661v3
[DATE]
2025-06-27 08:16:37+08:00
[CATEGORIES]
cs.CL
cs.LG
Exploring the Structure of AI-Induced Language Change in Scientific English
[AUTHORS]
Riley Galpin, Bryce Anderson, Tom S. Juzek
[ABSTRACT]
Scientific English has undergone rapid and unprecedented changes in recent
years, with words such as “delve,” “intricate,” and “crucial” showing
significant spikes in frequency since around 2022. These changes are widely
attributed to the growing influence of Large Language Models like ChatGPT in
the discourse surrounding bias and misalignment. However, apart from changes in
frequency, the exact structure of these linguistic shifts has remained unclear.
The present study addresses this and investigates whether these changes involve
the replacement of synonyms by suddenly ‘spiking words,’ for example, “crucial”
replacing “essential” and “key,” or whether they reflect broader semantic and
pragmatic qualifications. To further investigate structural changes, we include
part of speech tagging in our analysis to quantify linguistic shifts over
grammatical categories and differentiate between word forms, like “potential”
as a noun vs. as an adjective. We systematically analyze synonym groups for
widely discussed ‘spiking words’ based on frequency trends in scientific
abstracts from PubMed. We find that entire semantic clusters often shift
together, with most or all words in a group increasing in usage. This pattern
suggests that changes induced by Large Language Models are primarily semantic
and pragmatic rather than purely lexical. Notably, the adjective “important”
shows a significant decline, which prompted us to systematically analyze
decreasing lexical items. Our analysis of “collapsing” words reveals a more
complex picture, which is consistent with organic language change and contrasts
with the patterns of the abrupt spikes. These insights into the structure of
language change contribute to our understanding of how language technology
continues to shape human language.
[COMMENTS]
Accepted and published at FLAIRS 38. 8 pages, 4 figures, 1 table.
Licensed under CC BY-NC-SA 4.0
[LINK]
http://arxiv.org/abs/2506.21817v1
[DATE]
2025-06-27 07:44:24+08:00
[CATEGORIES]
cs.CL
Towards Transparent AI: A Survey on Explainable Large Language Models
[AUTHORS]
Avash Palikhe, Zhenyu Yu, Zichong Wang, Wenbin Zhang
[ABSTRACT]
Large Language Models (LLMs) have played a pivotal role in advancing
Artificial Intelligence (AI). However, despite their achievements, LLMs often
struggle to explain their decision-making processes, making them a ‘black box’
and presenting a substantial challenge to explainability. This lack of
transparency poses a significant obstacle to the adoption of LLMs in
high-stakes domain applications, where interpretability is particularly
essential. To overcome these limitations, researchers have developed various
explainable artificial intelligence (XAI) methods that provide
human-interpretable explanations for LLMs. However, a systematic understanding
of these methods remains limited. To address this gap, this survey provides a
comprehensive review of explainability techniques by categorizing XAI methods
based on the underlying transformer architectures of LLMs: encoder-only,
decoder-only, and encoder-decoder models. Then these techniques are examined in
terms of their evaluation for assessing explainability, and the survey further
explores how these explanations are leveraged in practical applications.
Finally, it discusses available resources, ongoing research challenges, and
future directions, aiming to guide continued efforts toward developing
transparent and responsible LLMs.
[LINK]
http://arxiv.org/abs/2506.21812v1
[DATE]
2025-06-27 07:25:22+08:00
[CATEGORIES]
cs.CL
Offensive Language Detection on Social Media Using XLNet
[AUTHORS]
Reem Alothman, Hafida Benhidour, Said Kerrache
[ABSTRACT]
The widespread use of text-based communication on social media-through chats,
comments, and microblogs-has improved user interaction but has also led to an
increase in offensive content, including hate speech, racism, and other forms
of abuse. Due to the enormous volume of user-generated content, manual
moderation is impractical, which creates a need for automated systems that can
detect offensive language. Deep learning models, particularly those using
transfer learning, have demonstrated significant success in understanding
natural language through large-scale pretraining. In this study, we propose an
automatic offensive language detection model based on XLNet, a generalized
autoregressive pretraining method, and compare its performance with BERT
(Bidirectional Encoder Representations from Transformers), which is a widely
used baseline in natural language processing (NLP). Both models are evaluated
using the Offensive Language Identification Dataset (OLID), a benchmark Twitter
dataset that includes hierarchical annotations. Our experimental results show
that XLNet outperforms BERT in detecting offensive content and in categorizing
the types of offenses, while BERT performs slightly better in identifying the
targets of the offenses. Additionally, we find that oversampling and
undersampling strategies are effective in addressing class imbalance and
improving classification performance. These findings highlight the potential of
transfer learning and XLNet-based architectures to create robust systems for
detecting offensive language on social media platforms.
[LINK]
http://arxiv.org/abs/2506.21795v1
[DATE]
2025-06-27 06:37:35+08:00
[CATEGORIES]
cs.CL
cs.LG
Evaluating List Construction and Temporal Understanding capabilities of Large Language Models
[AUTHORS]
Alexandru Dumitru, V Venktesh, Adam Jatowt, Avishek Anand
[ABSTRACT]
Large Language Models (LLMs) have demonstrated immense advances in a wide
range of natural language tasks. However, these models are susceptible to
hallucinations and errors on particularly temporal understanding tasks
involving multiple entities in answers. In such tasks, they fail to associate
entities with accurate time intervals, generate a complete list of entities in
answers or reason about events associated with specific temporal bounds.
Existing works do not extensively evaluate the abilities of the model to
perform implicit and explicit temporal understanding in a list answer
construction setup. To bridge this gap, we propose the Time referenced List
based Question Answering or TLQA benchmark that requires structured answers in
list format aligned with corresponding time periods. Our TLQA benchmark,
requires both list construction and temporal understanding simultaneously,
which to the best of our knowledge has not been explored in prior benchmarks.
We investigate the temporal understanding and list construction capabilities of
state-of-the-art generative models on TLQA in closed-book and open-domain
settings. Our findings reveal significant shortcomings in current models,
particularly their inability to provide complete answers and temporally align
facts in a closed-book setup and the need to improve retrieval in open-domain
setup, providing clear future directions for research on TLQA. The benchmark
and code at https://github.com/elixir-research-group/TLQA.
[COMMENTS]
Accepted at ICTIR 2025 co-located with SIGIR 2025, 11 pages
[LINK]
http://arxiv.org/abs/2506.21783v1
[DATE]
2025-06-27 05:40:58+08:00
[CATEGORIES]
cs.CL
Are Triggers Needed for Document-Level Event Extraction?
[AUTHORS]
Shaden Shaar, Wayne Chen, Maitreyi Chatterjee, Barry Wang, Wenting Zhao, Claire Cardie
[ABSTRACT]
Most existing work on event extraction has focused on sentence-level texts
and presumes the identification of a trigger-span – a word or phrase in the
input that evokes the occurrence of an event of interest. Event arguments are
then extracted with respect to the trigger. Indeed, triggers are treated as
integral to, and trigger detection as an essential component of, event
extraction. In this paper, we provide the first investigation of the role of
triggers for the more difficult and much less studied task of document-level
event extraction. We analyze their usefulness in multiple end-to-end and
pipelined transformer-based event extraction models for three document-level
event extraction datasets, measuring performance using triggers of varying
quality (human-annotated, LLM-generated, keyword-based, and random). We find
that whether or not systems benefit from explicitly extracting triggers depends
both on dataset characteristics (i.e. the typical number of events per
document) and task-specific information available during extraction (i.e.
natural language event schemas). Perhaps surprisingly, we also observe that the
mere existence of triggers in the input, even random ones, is important for
prompt-based in-context learning approaches to the task.
[LINK]
http://arxiv.org/abs/2411.08708v2
[DATE]
2025-06-27 05:13:38+08:00
[CATEGORIES]
cs.CL
Identifying Speaker Information in Feed-Forward Layers of Self-Supervised Speech Transformers
[AUTHORS]
Tzu-Quan Lin, Hsi-Chun Cheng, Hung-yi Lee, Hao Tang
[ABSTRACT]
In recent years, the impact of self-supervised speech Transformers has
extended to speaker-related applications. However, little research has explored
how these models encode speaker information. In this work, we address this gap
by identifying neurons in the feed-forward layers that are correlated with
speaker information. Specifically, we analyze neurons associated with k-means
clusters of self-supervised features and i-vectors. Our analysis reveals that
these clusters correspond to broad phonetic and gender classes, making them
suitable for identifying neurons that represent speakers. By protecting these
neurons during pruning, we can significantly preserve performance on
speaker-related task, demonstrating their crucial role in encoding speaker
information.
[LINK]
http://arxiv.org/abs/2506.21712v1
[DATE]
2025-06-27 02:54:26+08:00
[CATEGORIES]
cs.CL
End-to-End Long Document Summarization using Gradient Caching
[AUTHORS]
Rohit Saxena, Hao Tang, Frank Keller
[ABSTRACT]
Training transformer-based encoder-decoder models for long document
summarization poses a significant challenge due to the quadratic memory
consumption during training. Several approaches have been proposed to extend
the input length at test time, but training with these approaches is still
difficult, requiring truncation of input documents and causing a mismatch
between training and test conditions. In this work, we propose CachED (Gradient
$\textbf{Cach}$ing for $\textbf{E}$ncoder-$\textbf{D}$ecoder models), an
approach that enables end-to-end training of existing transformer-based
encoder-decoder models, using the entire document without truncation.
Specifically, we apply non-overlapping sliding windows to input documents,
followed by fusion in decoder. During backpropagation, the gradients are cached
at the decoder and are passed through the encoder in chunks by re-computing the
hidden vectors, similar to gradient checkpointing. In the experiments on long
document summarization, we extend BART to CachED BART, processing more than
500K tokens during training and achieving superior performance without using
any additional parameters.
[COMMENTS]
Accepted to Transactions of the Association for Computational
Linguistics (TACL 2025); Pre MIT Press version
[LINK]
http://arxiv.org/abs/2501.01805v2
[DATE]
2025-06-27 02:40:55+08:00
[CATEGORIES]
cs.CL
Introducing MAPO: Momentum-Aided Gradient Descent Prompt Optimization
[AUTHORS]
Anthony Cui, Pranav Nandyalam, Andrew Rufail, Ethan Cheung, Aiden Lei, Kevin Zhu, Sean O’Brien
[COMMENTS]
Accepted to NAACL SRW 2025. A few revisions since last version
[LINK]
http://arxiv.org/abs/2410.19499v3
[DATE]
2025-06-27 02:40:26+08:00
[CATEGORIES]
cs.CL
ANUBHUTI: A Comprehensive Corpus For Sentiment Analysis In Bangla Regional Languages
[AUTHORS]
Swastika Kundu, Autoshi Ibrahim, Mithila Rahman, Tanvir Ahmed
[ABSTRACT]
Sentiment analysis for regional dialects of Bangla remains an underexplored
area due to linguistic diversity and limited annotated data. This paper
introduces ANUBHUTI, a comprehensive dataset consisting of 2000 sentences
manually translated from standard Bangla into four major regional dialects
Mymensingh, Noakhali, Sylhet, and Chittagong. The dataset predominantly
features political and religious content, reflecting the contemporary socio
political landscape of Bangladesh, alongside neutral texts to maintain balance.
Each sentence is annotated using a dual annotation scheme: multiclass thematic
labeling categorizes sentences as Political, Religious, or Neutral, and
multilabel emotion annotation assigns one or more emotions from Anger,
Contempt, Disgust, Enjoyment, Fear, Sadness, and Surprise. Expert native
translators conducted the translation and annotation, with quality assurance
performed via Cohens Kappa inter annotator agreement, achieving strong
consistency across dialects. The dataset was further refined through systematic
checks for missing data, anomalies, and inconsistencies. ANUBHUTI fills a
critical gap in resources for sentiment analysis in low resource Bangla
dialects, enabling more accurate and context aware natural language processing.
[LINK]
http://arxiv.org/abs/2506.21686v1
[DATE]
2025-06-27 02:13:54+08:00
[CATEGORIES]
cs.CL
cs.LG
Do We Really Need GNNs with Explicit Structural Modeling? MLPs Suffice for Language Model Representations
[AUTHORS]
Li Zhou, Hao Jiang, Junjie Li, Zefeng Zhao, Feng Jiang, Wenyu Chen, Haizhou Li
[ABSTRACT]
Explicit structural information has been proven to be encoded by Graph Neural
Networks (GNNs), serving as auxiliary knowledge to enhance model capabilities
and improve performance in downstream NLP tasks. However, recent studies
indicate that GNNs fail to fully utilize structural information, whereas
Multi-Layer Perceptrons (MLPs), despite lacking the message-passing mechanisms
inherent to GNNs, exhibit a surprising ability in structure-aware tasks.
Motivated by these findings, this paper introduces a comprehensive probing
framework from an information-theoretic perspective. The framework is designed
to systematically assess the role of explicit structural modeling in enhancing
language model (LM) representations and to investigate the potential of MLPs as
efficient and scalable alternatives to GNNs. We extend traditional probing
classifiers by incorporating a control module that allows for selective use of
either the full GNN model or its decoupled components, specifically, the
message-passing and feature-transformation operations.This modular approach
isolates and assesses the individual contributions of these operations,
avoiding confounding effects from the complete GNN architecture. Using the Edge
Probing Suite, a diagnostic tool for evaluating the linguistic knowledge
encoded in LMs, we find that MLPs, when used as feature-transformation modules,
consistently improve the linguistic knowledge captured in LM representations
across different architectures. They effectively encode both syntactic and
semantic patterns. Similarly, GNNs that incorporate feature-transformation
operations show beneficial effects. In contrast, models that rely solely on
message-passing operations tend to underperform, often leading to negative
impacts on probing task performance.
[COMMENTS]
Graph Neural Networks, Multi-Layer Perceptrons, Explicit Structural
Modeling, Probing Classifier
[LINK]
http://arxiv.org/abs/2506.21682v1
[DATE]
2025-06-27 02:10:28+08:00
[CATEGORIES]
cs.CL
Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs
[AUTHORS]
Yifan Shen, Yuanzhe Liu, Jingyuan Zhu, Xu Cao, Xiaofeng Zhang, Yixiao He, Wenming Ye, James Matthew Rehg, Ismini Lourentzou
[ABSTRACT]
Current Vision-Language Models (VLMs) struggle with fine-grained spatial
reasoning, particularly when multi-step logic and precise spatial alignment are
required. In this work, we introduce SpatialReasoner-R1, a vision-language
reasoning model designed to address these limitations. To construct
high-quality supervision for spatial reasoning, we design a Multi-Model Monte
Carlo Tree Search (M3CTS) method that generates diverse, logically consistent
Long Chain-of-Thought (LongCoT) reasoning trajectories. In addition, we propose
fine-grained Direct Preference Optimization (fDPO), which introduces
segment-specific preference granularity for descriptive grounding and logical
reasoning, guided by a spatial reward mechanism that evaluates candidate
responses based on visual consistency, spatial grounding, and logical
coherence. Experimental results demonstrate that fDPO achieves an average
improvement of 4.1% over standard DPO across spatial quality tasks, and a 9.0%
gain in spatial quantity tasks. SpatialReasoner-R1, trained with fDPO, sets a
new SoTA on SPATIALRGPT-Bench, outperforming the strongest baseline by 9.8% in
average accuracy, while maintaining competitive performance on general
vision-language tasks.
[COMMENTS]
29 pages
[LINK]
http://arxiv.org/abs/2506.21656v1
[DATE]
2025-06-27 02:00:00+08:00
[CATEGORIES]
cs.CL
Data Efficacy for Language Model Training
[AUTHORS]
Yalun Dai, Yangyu Huang, Xin Zhang, Wenshan Wu, Chong Li, Wenhui Lu, Shijie Cao, Li Dong, Scarlett Li
[ABSTRACT]
Data is fundamental to the training of language models (LM). Recent research
has been dedicated to data efficiency, which aims to maximize performance by
selecting a minimal or optimal subset of training data. Techniques such as data
filtering, sampling, and selection play a crucial role in this area. To
complement it, we define Data Efficacy, which focuses on maximizing performance
by optimizing the organization of training data and remains relatively
underexplored. This work introduces a general paradigm, DELT, for considering
data efficacy in LM training, which highlights the significance of training
data organization. DELT comprises three components: Data Scoring, Data
Selection, and Data Ordering. Among these components, we design
Learnability-Quality Scoring (LQS), as a new instance of Data Scoring, which
considers both the learnability and quality of each data sample from the
gradient consistency perspective. We also devise Folding Ordering (FO), as a
novel instance of Data Ordering, which addresses issues such as model
forgetting and data distribution bias. Comprehensive experiments validate the
data efficacy in LM training, which demonstrates the following: Firstly,
various instances of the proposed DELT enhance LM performance to varying
degrees without increasing the data scale and model size. Secondly, among these
instances, the combination of our proposed LQS for data scoring and Folding for
data ordering achieves the most significant improvement. Lastly, data efficacy
can be achieved together with data efficiency by applying data selection.
Therefore, we believe that data efficacy is a promising foundational area in LM
training.
[LINK]
http://arxiv.org/abs/2506.21545v1
[DATE]
2025-06-27 01:59:07+08:00
[CATEGORIES]
cs.CL
cs.LG
“What’s Up, Doc?”: Analyzing How Users Seek Health Information in Large-Scale Conversational AI Datasets
[AUTHORS]
Akshay Paruchuri, Maryam Aziz, Rohit Vartak, Ayman Ali, Best Uchehara, Xin Liu, Ishan Chatterjee, Monica Agrawal
[ABSTRACT]
People are increasingly seeking healthcare information from large language
models (LLMs) via interactive chatbots, yet the nature and inherent risks of
these conversations remain largely unexplored. In this paper, we filter
large-scale conversational AI datasets to achieve HealthChat-11K, a curated
dataset of 11K real-world conversations composed of 25K user messages. We use
HealthChat-11K and a clinician-driven taxonomy for how users interact with LLMs
when seeking healthcare information in order to systematically study user
interactions across 21 distinct health specialties. Our analysis reveals
insights into the nature of how and why users seek health information, such as
common interactions, instances of incomplete context, affective behaviors, and
interactions (e.g., leading questions) that can induce sycophancy, underscoring
the need for improvements in the healthcare support capabilities of LLMs
deployed as conversational AI. Code and artifacts to retrieve our analyses and
combine them into a curated dataset can be found here:
https://github.com/yahskapar/HealthChat
[COMMENTS]
25 pages, 6 figures, 4 tables, corresponds to initial HealthChat-11K
dataset release
[LINK]
http://arxiv.org/abs/2506.21532v1
[DATE]
2025-06-27 01:52:18+08:00
[CATEGORIES]
cs.CL
OpenNER 1.0: Standardized Open-Access Named Entity Recognition Datasets in 50+ Languages
[AUTHORS]
Chester Palen-Michel, Maxwell Pickering, Maya Kruse, Jonne Sälevä, Constantine Lignos
[ABSTRACT]
We present OpenNER 1.0, a standardized collection of openly-available named
entity recognition (NER) datasets. OpenNER contains 36 NER corpora that span 52
languages, human-annotated in varying named entity ontologies. We correct
annotation format issues, standardize the original datasets into a uniform
representation with consistent entity type names across corpora, and provide
the collection in a structure that enables research in multilingual and
multi-ontology NER. We provide baseline results using three pretrained
multilingual language models and two large language models to compare the
performance of recent models and facilitate future research in NER. We find
that no single model is best in all languages and that significant work remains
to obtain high performance from LLMs on the NER task.
[COMMENTS]
Under review
[LINK]
http://arxiv.org/abs/2412.09587v2
[DATE]
2025-06-27 01:51:40+08:00
[CATEGORIES]
cs.CL
skLEP: A Slovak General Language Understanding Benchmark
[AUTHORS]
Marek Šuppa, Andrej Ridzik, Daniel Hládek, Tomáš Javůrek, Viktória Ondrejová, Kristína Sásiková, Martin Tamajka, Marián Šimko
[COMMENTS]
ACL 2025 Findings
[LINK]
http://arxiv.org/abs/2506.21508v1
[DATE]
2025-06-27 01:35:04+08:00
[CATEGORIES]
cs.CL
cs.LG
Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge
[AUTHORS]
Boyu Gou, Zanming Huang, Yuting Ning, Yu Gu, Michael Lin, Weijian Qi, Andrei Kopanev, Botao Yu, Bernal Jiménez Gutiérrez, Yiheng Shu, Chan Hee Song, Jiaman Wu, Shijie Chen, Hanane Nour Moussa, Tianshu Zhang, Jian Xie, Yifei Li, Tianci Xue, Zeyi Liao, Kai Zhang, Boyuan Zheng, Zhaowei Cai, Viktor Rozgic, Morteza Ziyadi, Huan Sun, Yu Su
[ABSTRACT]
Agentic search such as Deep Research systems, where large language models
autonomously browse the web, synthesize information, and return comprehensive
citation-backed answers, represents a major shift in how users interact with
web-scale information. While promising greater efficiency and cognitive
offloading, the growing complexity and open-endedness of agentic search have
outpaced existing evaluation benchmarks and methodologies, which largely assume
short search horizons and static answers. In this paper, we introduce Mind2Web
2, a benchmark of 130 realistic, high-quality, and long-horizon tasks that
require real-time web browsing and extensive information synthesis, constructed
with over 1,000 hours of human labor. To address the challenge of evaluating
time-varying and complex answers, we propose a novel Agent-as-a-Judge
framework. Our method constructs task-specific judge agents based on a
tree-structured rubric design to automatically assess both answer correctness
and source attribution. We conduct a comprehensive evaluation of nine frontier
agentic search systems and human performance, along with a detailed error
analysis to draw insights for future development. The best-performing system,
OpenAI Deep Research, can already achieve 50-70% of human performance while
spending half the time, showing a great potential. Altogether, Mind2Web 2
provides a rigorous foundation for developing and benchmarking the next
generation of agentic search systems.
[COMMENTS]
Project Homepage: https://osu-nlp-group.github.io/Mind2Web2/
[LINK]
http://arxiv.org/abs/2506.21506v1
[DATE]
2025-06-27 01:32:50+08:00
[CATEGORIES]
cs.CL
Enhancing User Engagement in Socially-Driven Dialogue through Interactive LLM Alignments
[AUTHORS]
Jiashuo Wang, Kaitao Song, Chunpu Xu, Changhe Song, Yang Xiao, Dongsheng Li, Lili Qiu, Wenjie Li
[ABSTRACT]
Enhancing user engagement through interactions plays an essential role in
socially-driven dialogues. While prior works have optimized models to reason
over relevant knowledge or plan a dialogue act flow, the relationship between
user engagement and knowledge or dialogue acts is subtle and does not guarantee
user engagement in socially-driven dialogues. To this end, we enable
interactive LLMs to learn user engagement by leveraging signals from the future
development of conversations. Specifically, we adopt a more direct and relevant
indicator of user engagement, i.e., the user’s reaction related to dialogue
intention after the interaction, as a reward to align interactive LLMs. To
achieve this, we develop a user simulator to interact with target interactive
LLMs and explore interactions between the user and the interactive LLM system
via \textit{i$\times$MCTS} (\textit{M}onte \textit{C}arlo \textit{T}ree
\textit{S}earch for \textit{i}nteraction). In this way, we collect a dataset
containing pairs of higher and lower-quality experiences using
\textit{i$\times$MCTS}, and align interactive LLMs for high-level user
engagement by direct preference optimization (DPO) accordingly. Experiments
conducted on two socially-driven dialogue scenarios (emotional support
conversations and persuasion for good) demonstrate that our method effectively
enhances user engagement in interactive LLMs.
[LINK]
http://arxiv.org/abs/2506.21497v1
[DATE]
2025-06-27 01:26:17+08:00
[CATEGORIES]
cs.CL
Bridging Offline and Online Reinforcement Learning for LLMs
[AUTHORS]
Jack Lanchantin, Angelica Chen, Janice Lan, Xian Li, Swarnadeep Saha, Tianlu Wang, Jing Xu, Ping Yu, Weizhe Yuan, Jason E Weston, Sainbayar Sukhbaatar, Ilia Kulikov
[ABSTRACT]
We investigate the effectiveness of reinforcement learning methods for
finetuning large language models when transitioning from offline to semi-online
to fully online regimes for both verifiable and non-verifiable tasks. Our
experiments cover training on verifiable math as well as non-verifiable
instruction following with a set of benchmark evaluations for both. Across
these settings, we extensively compare online and semi-online Direct Preference
Optimization and Group Reward Policy Optimization objectives, and surprisingly
find similar performance and convergence between these variants, which all
strongly outperform offline methods. We provide a detailed analysis of the
training dynamics and hyperparameter selection strategies to achieve optimal
results. Finally, we show that multi-tasking with verifiable and non-verifiable
rewards jointly yields improved performance across both task types.
[LINK]
http://arxiv.org/abs/2506.21495v1
[DATE]
2025-06-27 01:25:49+08:00
[CATEGORIES]
cs.CL
Prompting with Phonemes: Enhancing LLMs’ Multilinguality for Non-Latin Script Languages
[AUTHORS]
Hoang H Nguyen, Khyati Mahajan, Vikas Yadav, Julian Salazar, Philip S. Yu, Masoud Hashemi, Rishabh Maheshwary
[ABSTRACT]
Although multilingual LLMs have achieved remarkable performance across
benchmarks, we find they continue to underperform on non-Latin script languages
across contemporary LLM families. This discrepancy arises from the fact that
LLMs are pretrained with orthographic scripts, which are dominated by Latin
characters that obscure their shared phonology with non-Latin scripts. We
propose leveraging phonemic transcriptions as complementary signals to induce
script-invariant representations. Our study demonstrates that integrating
phonemic signals improves performance across both non-Latin and Latin script
languages, with a particularly significant impact on closing the performance
gap between the two. Through detailed experiments, we show that phonemic and
orthographic scripts retrieve distinct examples for in-context learning (ICL).
This motivates our proposed Mixed-ICL retrieval strategy, where further
aggregation from both leads to our significant performance improvements for
both Latin script languages (up to 12.6%) and non-Latin script languages (up to
15.1%) compared to randomized ICL retrieval.
[COMMENTS]
Accepted to NAACL 2025 (Main Conference). This version contains minor
improvements to the camera-ready
[LINK]
http://arxiv.org/abs/2411.02398v3
[DATE]
2025-06-27 01:22:53+08:00
[CATEGORIES]
cs.CL
cs.LG
From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents
[AUTHORS]
Weizhi Zhang, Yangning Li, Yuanchen Bei, Junyu Luo, Guancheng Wan, Liangwei Yang, Chenxuan Xie, Yuyao Yang, Wei-Chieh Huang, Chunyu Miao, Henry Peng Zou, Xiao Luo, Yusheng Zhao, Yankai Chen, Chunkit Chan, Peilin Zhou, Xinyang Zhang, Chenwei Zhang, Jingbo Shang, Ming Zhang, Yangqiu Song, Irwin King, Philip S. Yu
[ABSTRACT]
Information retrieval is a cornerstone of modern knowledge acquisition,
enabling billions of queries each day across diverse domains. However,
traditional keyword-based search engines are increasingly inadequate for
handling complex, multi-step information needs. Our position is that Large
Language Models (LLMs), endowed with reasoning and agentic capabilities, are
ushering in a new paradigm termed Agentic Deep Research. These systems
transcend conventional information search techniques by tightly integrating
autonomous reasoning, iterative retrieval, and information synthesis into a
dynamic feedback loop. We trace the evolution from static web search to
interactive, agent-based systems that plan, explore, and learn. We also
introduce a test-time scaling law to formalize the impact of computational
depth on reasoning and search. Supported by benchmark results and the rise of
open-source implementations, we demonstrate that Agentic Deep Research not only
significantly outperforms existing approaches, but is also poised to become the
dominant paradigm for future information seeking. All the related resources,
including industry products, research papers, benchmark datasets, and
open-source implementations, are collected for the community in
https://github.com/DavidZWZ/Awesome-Deep-Research.
[LINK]
http://arxiv.org/abs/2506.18959v2
[DATE]
2025-06-27 01:18:00+08:00
[CATEGORIES]
cs.CL
cs.LG
Logios : An open source Greek Polytonic Optical Character Recognition system
[AUTHORS]
Perifanos Konstantinos, Goutsos Dionisis
[ABSTRACT]
In this paper, we present an Optical Character Recognition (OCR) system
specifically designed for the accurate recognition and digitization of Greek
polytonic texts. By leveraging the combined strengths of convolutional layers
for feature extraction and recurrent layers for sequence learning, our system
addresses the unique challenges posed by Greek polytonic scripts. This approach
aims to overcome the limitations of traditional OCR methods, offering
significant improvements in accuracy and efficiency. We release the underlying
model as an open-source library and make our OCR platform available for
academic use.
[LINK]
http://arxiv.org/abs/2506.21474v1
[DATE]
2025-06-27 01:04:27+08:00
[CATEGORIES]
cs.CL
TopK Language Models
[AUTHORS]
Ryosuke Takahashi, Tatsuro Inaba, Kentaro Inui, Benjamin Heinzerling
[ABSTRACT]
Sparse autoencoders (SAEs) have become an important tool for analyzing and
interpreting the activation space of transformer-based language models (LMs).
However, SAEs suffer several shortcomings that diminish their utility and
internal validity. Since SAEs are trained post-hoc, it is unclear if the
failure to discover a particular concept is a failure on the SAE’s side or due
to the underlying LM not representing this concept. This problem is exacerbated
by training conditions and architecture choices affecting which features an SAE
learns. When tracing how LMs learn concepts during training, the lack of
feature stability also makes it difficult to compare SAEs features across
different checkpoints. To address these limitations, we introduce a
modification to the transformer architecture that incorporates a TopK
activation function at chosen layers, making the model’s hidden states
equivalent to the latent features of a TopK SAE. This approach eliminates the
need for post-hoc training while providing interpretability comparable to SAEs.
The resulting TopK LMs offer a favorable trade-off between model size,
computational efficiency, and interpretability. Despite this simple
architectural change, TopK LMs maintain their original capabilities while
providing robust interpretability benefits. Our experiments demonstrate that
the sparse representations learned by TopK LMs enable successful steering
through targeted neuron interventions and facilitate detailed analysis of
neuron formation processes across checkpoints and layers. These features make
TopK LMs stable and reliable tools for understanding how language models learn
and represent concepts, which we believe will significantly advance future
research on model interpretability and controllability.
[LINK]
http://arxiv.org/abs/2506.21468v1
[DATE]
2025-06-27 00:56:43+08:00
[CATEGORIES]
cs.CL
Aligning Spoken Dialogue Models from User Interactions
[AUTHORS]
Anne Wu, Laurent Mazaré, Neil Zeghidour, Alexandre Défossez
[ABSTRACT]
We propose a novel preference alignment framework for improving spoken
dialogue models on real-time conversations from user interactions. Current
preference learning methods primarily focus on text-based language models, and
are not directly suited to the complexities of real-time speech interactions,
with richer dynamics (e.g. interruption, interjection) and no explicit
segmentation between speaker turns.We create a large-scale dataset of more than
150,000 preference pairs from raw multi-turn speech conversations, annotated
with AI feedback, to cover preferences over both linguistic content and
temporal context variations. We leverage offline alignment methods to finetune
a full-duplex autoregressive speech-to-speech model. Extensive experiments
demonstrate that feedback on generic conversations can be consistently
effective in improving spoken dialogue models to produce more factual, safer
and more contextually aligned interactions. We deploy the finetuned model and
conduct holistic human evaluations to assess the impact beyond single-turn
conversations. Our findings shed light on the importance of a well-calibrated
balance among various dynamics, crucial for natural real-time speech dialogue
systems.
[COMMENTS]
Accepted at ICML 2025
[LINK]
http://arxiv.org/abs/2506.21463v1
[DATE]
2025-06-27 00:45:20+08:00
[CATEGORIES]
cs.CL
cs.LG
Spatial Mental Modeling from Limited Views
[AUTHORS]
Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, Saining Xie, Manling Li, Jiajun Wu, Li Fei-Fei
[ABSTRACT]
Can Vision Language Models (VLMs) imagine the full scene from just a few
views, like humans do? Humans form spatial mental models, internal
representations of unseen space, to reason about layout, perspective, and
motion. Our new MindCube benchmark with 21,154 questions across 3,268 images
exposes this critical gap, where existing VLMs exhibit near-random performance.
Using MindCube, we systematically evaluate how well VLMs build robust spatial
mental models through representing positions (cognitive mapping), orientations
(perspective-taking), and dynamics (mental simulation for “what-if” movements).
We then explore three approaches to help VLMs approximate spatial mental
models, including unseen intermediate views, natural language reasoning chains,
and cognitive maps. The significant improvement comes from a synergistic
approach, “map-then-reason”, that jointly trains the model to first generate a
cognitive map and then reason upon it. By training models to reason over these
internal maps, we boosted accuracy from 37.8% to 60.8% (+23.0%). Adding
reinforcement learning pushed performance even further to 70.7% (+32.9%). Our
key insight is that such scaffolding of spatial mental models, actively
constructing and utilizing internal structured spatial representations with
flexible reasoning processes, significantly improves understanding of
unobservable space.
[COMMENTS]
Preprint version
[LINK]
http://arxiv.org/abs/2506.21458v1
[DATE]
2025-06-27 00:38:19+08:00
[CATEGORIES]
cs.CL
Text2Cypher Across Languages: Evaluating Foundational Models Beyond English
[AUTHORS]
Makbule Gulcin Ozsoy, William Tai
[ABSTRACT]
Recent advances in large language models have enabled natural language
interfaces that translate user questions into database queries, such as
Text2SQL, Text2SPARQL, and Text2Cypher. While these interfaces enhance database
accessibility, most research today focuses solely on English, with limited
evaluation in other languages. This paper investigates the performance of
foundational LLMs on the Text2Cypher task across multiple languages. We create
and release a multilingual test set by translating English questions into
Spanish and Turkish while preserving the original Cypher queries, enabling fair
cross-lingual comparison. We evaluate multiple foundational models using
standardized prompts and metrics. Our results show a consistent performance
pattern: highest on English, then Spanish, and lowest on Turkish. We attribute
this to differences in training data availability and linguistic
characteristics. Additionally, we explore the impact of translating task
prompts into Spanish and Turkish. Results show little to no change in
evaluation metrics, suggesting prompt translation has minor impact. Our
findings highlight the need for more inclusive evaluation and development in
multilingual query generation. Future work includes schema localization and
fine-tuning across diverse languages.
[LINK]
http://arxiv.org/abs/2506.21445v1
[DATE]
2025-06-27 00:31:10+08:00
[CATEGORIES]
cs.CL
Domain Knowledge-Enhanced LLMs for Fraud and Concept Drift Detection
[AUTHORS]
Ali Şenol, Garima Agrawal, Huan Liu
[ABSTRACT]
Detecting deceptive conversations on dynamic platforms is increasingly
difficult due to evolving language patterns and Concept Drift (CD)-i.e.,
semantic or topical shifts that alter the context or intent of interactions
over time. These shifts can obscure malicious intent or mimic normal dialogue,
making accurate classification challenging. While Large Language Models (LLMs)
show strong performance in natural language tasks, they often struggle with
contextual ambiguity and hallucinations in risk-sensitive scenarios. To address
these challenges, we present a Domain Knowledge (DK)-Enhanced LLM framework
that integrates pretrained LLMs with structured, task-specific insights to
perform fraud and concept drift detection. The proposed architecture consists
of three main components: (1) a DK-LLM module to detect fake or deceptive
conversations; (2) a drift detection unit (OCDD) to determine whether a
semantic shift has occurred; and (3) a second DK-LLM module to classify the
drift as either benign or fraudulent. We first validate the value of domain
knowledge using a fake review dataset and then apply our full framework to
SEConvo, a multiturn dialogue dataset that includes various types of fraud and
spam attacks. Results show that our system detects fake conversations with high
accuracy and effectively classifies the nature of drift. Guided by structured
prompts, the LLaMA-based implementation achieves 98% classification accuracy.
Comparative studies against zero-shot baselines demonstrate that incorporating
domain knowledge and drift awareness significantly improves performance,
interpretability, and robustness in high-stakes NLP applications.
[LINK]
http://arxiv.org/abs/2506.21443v1
[DATE]
2025-06-27 00:29:45+08:00
[CATEGORIES]
cs.CL
Rethinking LLM Training through Information Geometry and Quantum Metrics
[AUTHORS]
Riccardo Di Sipio
[ABSTRACT]
Optimization in large language models (LLMs) unfolds over high-dimensional
parameter spaces with non-Euclidean structure. Information geometry frames this
landscape using the Fisher information metric, enabling more principled
learning via natural gradient descent. Though often impractical, this geometric
lens clarifies phenomena such as sharp minima, generalization, and observed
scaling laws. We argue that curvature-aware approaches deepen our understanding
of LLM training. Finally, we speculate on quantum analogies based on the
Fubini-Study metric and Quantum Fisher Information, hinting at efficient
optimization in quantum-enhanced systems.
[COMMENTS]
9 pages, 1 figure(s)
[LINK]
http://arxiv.org/abs/2506.15830v2
[DATE]
2025-06-27 00:14:42+08:00
[CATEGORIES]
cs.CL
Scalable Hypergraph Structure Learning with Diverse Smoothness Priors
[AUTHORS]
Benjamin T. Brown, Haoxiang Zhang, Daniel L. Lau, Gonzalo R. Arce
[ABSTRACT]
In graph signal processing, learning the weighted connections between nodes
from a set of sample signals is a fundamental task when the underlying
relationships are not known a priori. This task is typically addressed by
finding a graph Laplacian on which the observed signals are smooth. With the
extension of graphs to hypergraphs - where edges can connect more than two
nodes - graph learning methods have similarly been generalized to hypergraphs.
However, the absence of a unified framework for calculating total variation has
led to divergent definitions of smoothness and, consequently, differing
approaches to hyperedge recovery. We confront this challenge through
generalization of several previously proposed hypergraph total variations,
subsequently allowing ease of substitution into a vector based optimization. To
this end, we propose a novel hypergraph learning method that recovers a
hypergraph topology from time-series signals based on a smoothness prior. Our
approach, designated as Hypergraph Structure Learning with Smoothness (HSLS),
addresses key limitations in prior works, such as hyperedge selection and
convergence issues, by formulating the problem as a convex optimization solved
via a forward-backward-forward algorithm, ensuring guaranteed convergence.
Additionally, we introduce a process that simultaneously limits the span of the
hyperedge search and maintains a valid hyperedge selection set. In doing so,
our method becomes scalable in increasingly complex network structures. The
experimental results demonstrate improved performance, in terms of accuracy,
over other state-of-the-art hypergraph inference methods; furthermore, we
empirically show our method to be robust to total variation terms, biased
towards global smoothness, and scalable to larger hypergraphs.
[COMMENTS]
15 pages, 7 figures, submitted to IEEE for possible publication;
Section I includes more applications, comparisons, and enumerated list of
novel contributions; removed numerical analysis of TV terms in Section II,
added more general discussion; updated Algorithm 1 and corresponding text;
third experiment of Section V-C replaced with new experiment
[LINK]
http://arxiv.org/abs/2504.03583v2
[DATE]
2025-06-27 23:58:19+08:00
[CATEGORIES]
cs.LG
A Framework for Multi-source Privacy Preserving Epidemic Analysis
[AUTHORS]
Zihan Guan, Zhiyuan Zhao, Fengwei Tian, Dung Nguyen, Payel Bhattacharjee, Ravi Tandon, B. Aditya Prakash, Anil Vullikanti
[ABSTRACT]
It is now well understood that diverse datasets provide a lot of value in key
epidemiology and public health analyses, such as forecasting and nowcasting,
development of epidemic models, evaluation and design of interventions and
resource allocation. Some of these datasets are often sensitive, and need
adequate privacy protections. There are many models of privacy, but
Differential Privacy (DP) has become a de facto standard because of its strong
guarantees, without making models about adversaries. In this paper, we develop
a framework the integrates deep learning and epidemic models to simultaneously
perform epidemic forecasting and learning a mechanistic model of epidemic
spread, while incorporating multiple datasets for these analyses, including
some with DP guarantees. We demonstrate our framework using a realistic but
synthetic financial dataset with DP; such a dataset has not been used in such
epidemic analyses. We show that this dataset provides significant value in
forecasting and learning an epidemic model, even when used with DP guarantees.
[COMMENTS]
17 pages, 6 figures
[LINK]
http://arxiv.org/abs/2506.22342v1
[DATE]
2025-06-27 23:52:12+08:00
[CATEGORIES]
cs.LG
QuKAN: A Quantum Circuit Born Machine approach to Quantum Kolmogorov Arnold Networks
[AUTHORS]
Yannick Werner, Akash Malemath, Mengxi Liu, Vitor Fortes Rey, Nikolaos Palaiodimopoulos, Paul Lukowicz, Maximilian Kiefer-Emmanouilidis
[ABSTRACT]
Kolmogorov Arnold Networks (KANs), built upon the Kolmogorov Arnold
representation theorem (KAR), have demonstrated promising capabilities in
expressing complex functions with fewer neurons. This is achieved by
implementing learnable parameters on the edges instead of on the nodes, unlike
traditional networks such as Multi-Layer Perceptrons (MLPs). However, KANs
potential in quantum machine learning has not yet been well explored. In this
work, we present an implementation of these KAN architectures in both hybrid
and fully quantum forms using a Quantum Circuit Born Machine (QCBM). We adapt
the KAN transfer using pre-trained residual functions, thereby exploiting the
representational power of parametrized quantum circuits. In the hybrid model we
combine classical KAN components with quantum subroutines, while the fully
quantum version the entire architecture of the residual function is translated
to a quantum model. We demonstrate the feasibility, interpretability and
performance of the proposed Quantum KAN (QuKAN) architecture.
[LINK]
http://arxiv.org/abs/2506.22340v1
[DATE]
2025-06-27 23:51:19+08:00
[CATEGORIES]
cs.LG
Robust quantum reservoir computers for forecasting chaotic dynamics: generalized synchronization and stability
[AUTHORS]
Osama Ahmed, Felix Tennie, Luca Magri
[ABSTRACT]
We show that recurrent quantum reservoir computers (QRCs) and their
recurrence-free architectures (RF-QRCs) are robust tools for learning and
forecasting chaotic dynamics from time-series data. First, we formulate and
interpret quantum reservoir computers as coupled dynamical systems, where the
reservoir acts as a response system driven by training data; in other words,
quantum reservoir computers are generalized-synchronization (GS) systems.
Second, we show that quantum reservoir computers can learn chaotic dynamics and
their invariant properties, such as Lyapunov spectra, attractor dimensions, and
geometric properties such as the covariant Lyapunov vectors. This analysis is
enabled by deriving the Jacobian of the quantum reservoir update. Third, by
leveraging tools from generalized synchronization, we provide a method for
designing robust quantum reservoir computers. We propose the criterion
$GS=ESP$: GS implies the echo state property (ESP), and vice versa. We
analytically show that RF-QRCs, by design, fulfill $GS=ESP$. Finally, we
analyze the effect of simulated noise. We find that dissipation from noise
enhances the robustness of quantum reservoir computers. Numerical verifications
on systems of different dimensions support our conclusions. This work opens
opportunities for designing robust quantum machines for chaotic time series
forecasting on near-term quantum hardware.
[COMMENTS]
28 pages, 12 figures
[LINK]
http://arxiv.org/abs/2506.22335v1
[DATE]
2025-06-27 23:42:20+08:00
[CATEGORIES]
cs.LG
Less Greedy Equivalence Search
[AUTHORS]
Adiba Ejaz, Elias Bareinboim
[ABSTRACT]
Greedy Equivalence Search (GES) is a classic score-based algorithm for causal
discovery from observational data. In the sample limit, it recovers the Markov
equivalence class of graphs that describe the data. Still, it faces two
challenges in practice: computational cost and finite-sample accuracy. In this
paper, we develop Less Greedy Equivalence Search (LGES), a variant of GES that
retains its theoretical guarantees while partially addressing these
limitations. LGES modifies the greedy step: rather than always applying the
highest-scoring insertion, it avoids edge insertions between variables for
which the score implies some conditional independence. This more targeted
search yields up to a (10)-fold speed-up and a substantial reduction in
structural error relative to GES. Moreover, LGES can guide the search using
prior assumptions, while correcting these assumptions when contradicted by the
data. Finally, LGES can exploit interventional data to refine the learned
observational equivalence class. We prove that LGES recovers the true
equivalence class in the sample limit from observational and interventional
data, even with misspecified prior assumptions. Experiments demonstrate that
LGES outperforms GES and other baselines in speed, accuracy, and robustness to
misspecified assumptions. Our code is available at
https://github.com/CausalAILab/lges.
[COMMENTS]
35 total pages. 14 figures
[LINK]
http://arxiv.org/abs/2506.22331v1
[DATE]
2025-06-27 23:39:48+08:00
[CATEGORIES]
cs.LG
Unfolding Generative Flows with Koopman Operators: Fast and Interpretable Sampling
[AUTHORS]
Erkan Turan, Aristotelis Siozopoulos, Maks Ovsjanikov
[ABSTRACT]
Conditional Flow Matching (CFM) offers a simulation-free framework for
training continuous-time generative models, bridging diffusion and flow-based
approaches. However, sampling from CFM still relies on numerically solving
non-linear ODEs which can be computationally expensive and difficult to
interpret. Recent alternatives address sampling speed via trajectory
straightening, mini-batch coupling or distillation. However, these methods
typically do not shed light on the underlying \textit{structure} of the
generative process. In this work, we propose to accelerate CFM and introduce an
interpretable representation of its dynamics by integrating Koopman operator
theory, which models non-linear flows as linear evolution in a learned space of
observables. We introduce a decoder-free Koopman-CFM architecture that learns
an embedding where the generative dynamics become linear, enabling closed-form,
one-step sampling via matrix exponentiation. This results in significant
speedups over traditional CFM as demonstrated on controlled 2D datasets and
real-world benchmarks, MNIST, Fashion-MNIST (F-MNIST), and the Toronto Face
Dataset (TFD). Unlike previous methods, our approach leads to a well-structured
Koopman generator, whose spectral properties, eigenvalues, and eigenfunctions
offer principled tools for analyzing generative behavior such as temporal
scaling, mode stability, and decomposition in Koopman latent space. By
combining sampling efficiency with analytical structure, Koopman-enhanced flow
matching offers a potential step toward fast and interpretable generative
modeling.
[LINK]
http://arxiv.org/abs/2506.22304v1
[DATE]
2025-06-27 23:16:16+08:00
[CATEGORIES]
cs.LG
CoATA: Effective Co-Augmentation of Topology and Attribute for Graph Neural Networks
[AUTHORS]
Tao Liu, Longlong Lin, Yunfeng Yu, Xi Ou, Youan Zhang, Zhiqiu Ye, Tao Jia
[ABSTRACT]
Graph Neural Networks (GNNs) have garnered substantial attention due to their
remarkable capability in learning graph representations. However, real-world
graphs often exhibit substantial noise and incompleteness, which severely
degrades the performance of GNNs. Existing methods typically address this issue
through single-dimensional augmentation, focusing either on refining topology
structures or perturbing node attributes, thereby overlooking the deeper
interplays between the two. To bridge this gap, this paper presents CoATA, a
dual-channel GNN framework specifically designed for the Co-Augmentation of
Topology and Attribute. Specifically, CoATA first propagates structural signals
to enrich and denoise node attributes. Then, it projects the enhanced attribute
space into a node-attribute bipartite graph for further refinement or
reconstruction of the underlying structure. Subsequently, CoATA introduces
contrastive learning, leveraging prototype alignment and consistency
constraints, to facilitate mutual corrections between the augmented and
original graphs. Finally, extensive experiments on seven benchmark datasets
demonstrate that the proposed CoATA outperforms eleven state-of-the-art
baseline methods, showcasing its effectiveness in capturing the synergistic
relationship between topology and attributes.
[COMMENTS]
icmr
[LINK]
http://arxiv.org/abs/2506.22299v1
[DATE]
2025-06-27 23:11:49+08:00
[CATEGORIES]
cs.LG
Score-Based Model for Low-Rank Tensor Recovery
[AUTHORS]
Zhengyun Cheng, Changhao Wang, Guanwen Zhang, Yi Xu, Wei Zhou, Xiangyang Ji
[ABSTRACT]
Low-rank tensor decompositions (TDs) provide an effective framework for
multiway data analysis. Traditional TD methods rely on predefined structural
assumptions, such as CP or Tucker decompositions. From a probabilistic
perspective, these can be viewed as using Dirac delta distributions to model
the relationships between shared factors and the low-rank tensor. However, such
prior knowledge is rarely available in practical scenarios, particularly
regarding the optimal rank structure and contraction rules. The optimization
procedures based on fixed contraction rules are complex, and approximations
made during these processes often lead to accuracy loss. To address this issue,
we propose a score-based model that eliminates the need for predefined
structural or distributional assumptions, enabling the learning of
compatibility between tensors and shared factors. Specifically, a neural
network is designed to learn the energy function, which is optimized via score
matching to capture the gradient of the joint log-probability of tensor entries
and shared factors. Our method allows for modeling structures and distributions
beyond the Dirac delta assumption. Moreover, integrating the block coordinate
descent (BCD) algorithm with the proposed smooth regularization enables the
model to perform both tensor completion and denoising. Experimental results
demonstrate significant performance improvements across various tensor types,
including sparse and continuous-time tensors, as well as visual data.
[LINK]
http://arxiv.org/abs/2506.22295v1
[DATE]
2025-06-27 23:05:37+08:00
[CATEGORIES]
cs.LG
Gradual Domain Adaptation for Graph Learning
[AUTHORS]
Pui Ieng Lei, Ximing Chen, Yijun Sheng, Yanyan Liu, Jingzhi Guo, Zhiguo Gong
[ABSTRACT]
Existing literature lacks a graph domain adaptation technique for handling
large distribution shifts, primarily due to the difficulty in simulating an
evolving path from source to target graph. To make a breakthrough, we present a
graph gradual domain adaptation (GGDA) framework with the construction of a
compact domain sequence that minimizes information loss in adaptations. Our
approach starts with an efficient generation of knowledge-preserving
intermediate graphs over the Fused Gromov-Wasserstein (FGW) metric. With the
bridging data pool, GGDA domains are then constructed via a novel vertex-based
domain progression, which comprises “close” vertex selections and adaptive
domain advancement to enhance inter-domain information transferability.
Theoretically, our framework concretizes the intractable inter-domain distance
$W_p(\mu_t,\mu_{t+1})$ via implementable upper and lower bounds, enabling
flexible adjustments of this metric for optimizing domain formation. Extensive
experiments under various transfer scenarios validate the superior performance
of our GGDA framework.
[LINK]
http://arxiv.org/abs/2501.17443v2
[DATE]
2025-06-27 22:45:02+08:00
[CATEGORIES]
cs.LG
Breaking Rank Bottlenecks in Knowledge Graph Completion
[AUTHORS]
Samy Badreddine, Emile van Krieken, Luciano Serafini
[ABSTRACT]
Many Knowledge Graph Completion (KGC) models, despite using powerful
encoders, rely on a simple vector-matrix multiplication to score queries
against candidate object entities. When the number of entities is larger than
the model’s embedding dimension, which in practical scenarios is often by
several orders of magnitude, we have a linear output layer with a rank
bottleneck. Such bottlenecked layers limit model expressivity. We investigate
both theoretically and empirically how rank bottlenecks affect KGC models. We
find that, by limiting the set of feasible predictions, rank bottlenecks hurt
ranking accuracy and the distribution fidelity of scores. Inspired by the
language modelling literature, we propose KGE-MoS, a mixture-based output layer
to break rank bottlenecks in many KGC models. Our experiments on four datasets
show that KGE-MoS improves performance and probabilistic fit of KGC models for
a low parameter cost.
[LINK]
http://arxiv.org/abs/2506.22271v1
[DATE]
2025-06-27 22:41:22+08:00
[CATEGORIES]
cs.LG
How do Probabilistic Graphical Models and Graph Neural Networks Look at Network Data?
[AUTHORS]
Michela Lapenna, Caterina De Bacco
[ABSTRACT]
Graphs are a powerful data structure for representing relational data and are
widely used to describe complex real-world systems. Probabilistic Graphical
Models (PGMs) and Graph Neural Networks (GNNs) can both leverage
graph-structured data, but their inherent functioning is different. The
question is how do they compare in capturing the information contained in
networked datasets? We address this objective by solving a link prediction task
and we conduct three main experiments, on both synthetic and real networks: one
focuses on how PGMs and GNNs handle input features, while the other two
investigate their robustness to noisy features and increasing heterophily of
the graph. PGMs do not necessarily require features on nodes, while GNNs cannot
exploit the network edges alone, and the choice of input features matters. We
find that GNNs are outperformed by PGMs when input features are low-dimensional
or noisy, mimicking many real scenarios where node attributes might be scalar
or noisy. Then, we find that PGMs are more robust than GNNs when the
heterophily of the graph is increased. Finally, to assess performance beyond
prediction tasks, we also compare the two frameworks in terms of their
computational complexity and interpretability.
[LINK]
http://arxiv.org/abs/2506.11869v2
[DATE]
2025-06-27 22:37:02+08:00
[CATEGORIES]
cs.LG
Risk-Averse Best Arm Set Identification with Fixed Budget and Fixed Confidence
[AUTHORS]
Shunta Nonaga, Koji Tabata, Yuta Mizuno, Tamiki Komatsuzaki
[ABSTRACT]
Decision making under uncertain environments in the maximization of expected
reward while minimizing its risk is one of the ubiquitous problems in many
subjects. Here, we introduce a novel problem setting in stochastic bandit
optimization that jointly addresses two critical aspects of decision-making:
maximizing expected reward and minimizing associated uncertainty, quantified
via the mean-variance(MV) criterion. Unlike traditional bandit formulations
that focus solely on expected returns, our objective is to efficiently and
accurately identify the Pareto-optimal set of arms that strikes the best
trade-off between expected performance and risk. We propose a unified
meta-algorithmic framework capable of operating under both fixed-confidence and
fixed-budget regimes, achieved through adaptive design of confidence intervals
tailored to each scenario using the same sample exploration strategy. We
provide theoretical guarantees on the correctness of the returned solutions in
both settings. To complement this theoretical analysis, we conduct extensive
empirical evaluations across synthetic benchmarks, demonstrating that our
approach outperforms existing methods in terms of both accuracy and sample
efficiency, highlighting its broad applicability to risk-aware decision-making
tasks in uncertain environments.
[LINK]
http://arxiv.org/abs/2506.22253v1
[DATE]
2025-06-27 22:21:03+08:00
[CATEGORIES]
cs.LG
Fairness-Optimized Synthetic EHR Generation for Arbitrary Downstream Predictive Tasks
[AUTHORS]
Mirza Farhan Bin Tarek, Raphael Poulain, Rahmatollah Beheshti
[COMMENTS]
The paper has been accepted at the IEEE/ACM conference on Connected
Health: Applications, Systems and Engineering Technologies (CHASE) 2025
[LINK]
http://arxiv.org/abs/2406.02510v3
[DATE]
2025-06-27 22:11:59+08:00
[CATEGORIES]
cs.LG
Boosting Classification with Quantum-Inspired Augmentations
[AUTHORS]
Matthias Tschöpe, Vitor Fortes Rey, Sogo Pierre Sanon, Paul Lukowicz, Nikolaos Palaiodimopoulos, Maximilian Kiefer-Emmanouilidis
[ABSTRACT]
Understanding the impact of small quantum gate perturbations, which are
common in quantum digital devices but absent in classical computers, is crucial
for identifying potential advantages in quantum machine learning. While these
perturbations are typically seen as detrimental to quantum computation, they
can actually enhance performance by serving as a natural source of data
augmentation. Additionally, they can often be efficiently simulated on
classical hardware, enabling quantum-inspired approaches to improve classical
machine learning methods. In this paper, we investigate random Bloch sphere
rotations, which are fundamental SU(2) transformations, as a simple yet
effective quantum-inspired data augmentation technique. Unlike conventional
augmentations such as flipping, rotating, or cropping, quantum transformations
lack intuitive spatial interpretations, making their application to tasks like
image classification less straightforward. While common quantum augmentation
methods rely on applying quantum models or trainable quanvolutional layers to
classical datasets, we focus on the direct application of small-angle Bloch
rotations and their effect on classical data. Using the large-scale ImageNet
dataset, we demonstrate that our quantum-inspired augmentation method improves
image classification performance, increasing Top-1 accuracy by 3%, Top-5
accuracy by 2.5%, and the F$_1$ score from 8% to 12% compared to standard
classical augmentation methods. Finally, we examine the use of stronger unitary
augmentations. Although these transformations preserve information in
principle, they result in visually unrecognizable images with potential
applications for privacy computations. However, we show that our augmentation
approach and simple SU(2) transformations do not enhance differential privacy
and discuss the implications of this limitation.
[LINK]
http://arxiv.org/abs/2506.22241v1
[DATE]
2025-06-27 22:08:43+08:00
[CATEGORIES]
cs.LG
No More Sliding Window: Efficient 3D Medical Image Segmentation with Differentiable Top-k Patch Sampling
[AUTHORS]
Young Seok Jeon, Hongfei Yang, Huazhu Fu, Mengling Feng
[ABSTRACT]
3D models surpass 2D models in CT/MRI segmentation by effectively capturing
inter-slice relationships. However, the added depth dimension substantially
increases memory consumption. While patch-based training alleviates memory
constraints, it significantly slows down the inference speed due to the sliding
window (SW) approach. We propose No-More-Sliding-Window (NMSW), a novel
end-to-end trainable framework that enhances the efficiency of generic 3D
segmentation backbone during an inference step by eliminating the need for SW.
NMSW employs a differentiable Top-k module to selectively sample only the most
relevant patches, thereby minimizing redundant computations. When patch-level
predictions are insufficient, the framework intelligently leverages coarse
global predictions to refine results. Evaluated across 3 tasks using 3
segmentation backbones, NMSW achieves competitive accuracy compared to SW
inference while significantly reducing computational complexity by 91% (88.0 to
8.00 TMACs). Moreover, it delivers a 9.1x faster inference on the H100 GPU
(99.0 to 8.3 sec) and a 11.1x faster inference on the Xeon Gold CPU (2110 to
189 sec). NMSW is model-agnostic, further boosting efficiency when integrated
with any existing efficient segmentation backbones. The code is avaialble:
https://github.com/Youngseok0001/open_nmsw.
[LINK]
http://arxiv.org/abs/2501.10814v3
[DATE]
2025-06-27 21:58:15+08:00
[CATEGORIES]
cs.LG
Uncovering smooth structures in single-cell data with PCS-guided neighbor embeddings
[AUTHORS]
Rong Ma, Xi Li, Jingyuan Hu, Bin Yu
[ABSTRACT]
Single-cell sequencing is revolutionizing biology by enabling detailed
investigations of cell-state transitions. Many biological processes unfold
along continuous trajectories, yet it remains challenging to extract smooth,
low-dimensional representations from inherently noisy, high-dimensional
single-cell data. Neighbor embedding (NE) algorithms, such as t-SNE and UMAP,
are widely used to embed high-dimensional single-cell data into low dimensions.
But they often introduce undesirable distortions, resulting in misleading
interpretations. Existing evaluation methods for NE algorithms primarily focus
on separating discrete cell types rather than capturing continuous cell-state
transitions, while dynamic modeling approaches rely on strong assumptions about
cellular processes and specialized data. To address these challenges, we build
on the Predictability-Computability-Stability (PCS) framework for reliable and
reproducible data-driven discoveries. First, we systematically evaluate popular
NE algorithms through empirical analysis, simulation, and theory, and reveal
their key shortcomings, such as artifacts and instability. We then introduce
NESS, a principled and interpretable machine learning approach to improve NE
representations by leveraging algorithmic stability and to enable robust
inference of smooth biological structures. NESS offers useful concepts,
quantitative stability metrics, and efficient computational workflows to
uncover developmental trajectories and cell-state transitions in single-cell
data. Finally, we apply NESS to six single-cell datasets, spanning pluripotent
stem cell differentiation, organoid development, and multiple tissue-specific
lineage trajectories. Across these diverse contexts, NESS consistently yields
useful biological insights, such as identification of transitional and stable
cell states and quantification of transcriptional dynamics during development.
[LINK]
http://arxiv.org/abs/2506.22228v1
[DATE]
2025-06-27 21:45:55+08:00
[CATEGORIES]
cs.LG
Communication-Efficient Heterogeneous Federated Learning with Generalized Heavy-Ball Momentum
[AUTHORS]
Riccardo Zaccone, Sai Praneeth Karimireddy, Carlo Masone, Marco Ciccone
[ABSTRACT]
Federated Learning (FL) has emerged as the state-of-the-art approach for
learning from decentralized data in privacy-constrained scenarios.However,
system and statistical challenges hinder its real-world applicability,
requiring efficient learning from edge devices and robustness to data
heterogeneity. Despite significant research efforts, existing approaches often
degrade severely due to the joint effect of heterogeneity and partial client
participation. In particular, while momentum appears as a promising approach
for overcoming statistical heterogeneity, in current approaches its update is
biased towards the most recently sampled clients. As we show in this work, this
is the reason why it fails to outperform FedAvg, preventing its effective use
in real-world large-scale scenarios. In this work, we propose a novel
Generalized Heavy-Ball Momentum (GHBM) and theoretically prove it enables
convergence under unbounded data heterogeneity in cyclic partial participation,
thereby advancing the understanding of momentum’s effectiveness in FL. We then
introduce adaptive and communication-efficient variants of GHBM that match the
communication complexity of FedAvg in settings where clients can be stateful.
Extensive experiments on vision and language tasks confirm our theoretical
findings, demonstrating that GHBM substantially improves state-of-the-art
performance under random uniform client sampling, particularly in large-scale
settings with high data heterogeneity and low client participation. Code is
available at https://rickzack.github.io/GHBM.
[COMMENTS]
Accepted at TMLR - reviews at
https://openreview.net/forum?id=LNoFjcLywb
[LINK]
http://arxiv.org/abs/2311.18578v3
[DATE]
2025-06-27 21:40:04+08:00
[CATEGORIES]
cs.LG
No Metric to Rule Them All: Toward Principled Evaluations of Graph-Learning Datasets
[AUTHORS]
Corinna Coupette, Jeremy Wayland, Emily Simons, Bastian Rieck
[ABSTRACT]
Benchmark datasets have proved pivotal to the success of graph learning, and
good benchmark datasets are crucial to guide the development of the field.
Recent research has highlighted problems with graph-learning datasets and
benchmarking practices – revealing, for example, that methods which ignore the
graph structure can outperform graph-based approaches. Such findings raise two
questions: (1) What makes a good graph-learning dataset, and (2) how can we
evaluate dataset quality in graph learning? Our work addresses these questions.
As the classic evaluation setup uses datasets to evaluate models, it does not
apply to dataset evaluation. Hence, we start from first principles. Observing
that graph-learning datasets uniquely combine two modes – graph structure and
node features –, we introduce Rings, a flexible and extensible
mode-perturbation framework to assess the quality of graph-learning datasets
based on dataset ablations – i.e., quantifying differences between the
original dataset and its perturbed representations. Within this framework, we
propose two measures – performance separability and mode complementarity – as
evaluation tools, each assessing the capacity of a graph dataset to benchmark
the power and efficacy of graph-learning methods from a distinct angle. We
demonstrate the utility of our framework for dataset evaluation via extensive
experiments on graph-level tasks and derive actionable recommendations for
improving the evaluation of graph-learning methods. Our work opens new research
directions in data-centric graph learning, and it constitutes a step toward the
systematic evaluation of evaluations.
[COMMENTS]
Accepted at ICML 2025
[LINK]
http://arxiv.org/abs/2502.02379v2
[DATE]
2025-06-27 21:34:57+08:00
[CATEGORIES]
cs.LG
Soft Condorcet Optimization for Ranking of General Agents
[AUTHORS]
Marc Lanctot, Kate Larson, Michael Kaisers, Quentin Berthet, Ian Gemp, Manfred Diaz, Roberto-Rafael Maura-Rivero, Yoram Bachrach, Anna Koop, Doina Precup
[ABSTRACT]
Driving progress of AI models and agents requires comparing their performance
on standardized benchmarks; for general agents, individual performances must be
aggregated across a potentially wide variety of different tasks. In this paper,
we describe a novel ranking scheme inspired by social choice frameworks, called
Soft Condorcet Optimization (SCO), to compute the optimal ranking of agents:
the one that makes the fewest mistakes in predicting the agent comparisons in
the evaluation data. This optimal ranking is the maximum likelihood estimate
when evaluation data (which we view as votes) are interpreted as noisy samples
from a ground truth ranking, a solution to Condorcet’s original voting system
criteria. SCO ratings are maximal for Condorcet winners when they exist, which
we show is not necessarily true for the classical rating system Elo. We propose
three optimization algorithms to compute SCO ratings and evaluate their
empirical performance. When serving as an approximation to the Kemeny-Young
voting method, SCO rankings are on average 0 to 0.043 away from the optimal
ranking in normalized Kendall-tau distance across 865 preference profiles from
the PrefLib open ranking archive. In a simulated noisy tournament setting, SCO
achieves accurate approximations to the ground truth ranking and the best among
several baselines when 59\% or more of the preference data is missing. Finally,
SCO ranking provides the best approximation to the optimal ranking, measured on
held-out test sets, in a problem containing 52,958 human players across 31,049
games of the classic seven-player game of Diplomacy.
[LINK]
http://arxiv.org/abs/2411.00119v4
[DATE]
2025-06-27 21:26:25+08:00
[CATEGORIES]
cs.LG
Hybrid Generative Modeling for Incomplete Physics: Deep Grey-Box Meets Optimal Transport
[AUTHORS]
Gurjeet Sangra Singh, Maciej Falkiewicz, Alexandros Kalousis
[ABSTRACT]
Physics phenomena are often described by ordinary and/or partial differential
equations (ODEs/PDEs), and solved analytically or numerically. Unfortunately,
many real-world systems are described only approximately with missing or
unknown terms in the equations. This makes the distribution of the physics
model differ from the true data-generating process (DGP). Using limited and
unpaired data between DGP observations and the imperfect model simulations, we
investigate this particular setting by completing the known-physics model,
combining theory-driven models and data-driven to describe the shifted
distribution involved in the DGP. We present a novel hybrid generative model
approach combining deep grey-box modelling with Optimal Transport (OT) methods
to enhance incomplete physics models. Our method implements OT maps in data
space while maintaining minimal source distribution distortion, demonstrating
superior performance in resolving the unpaired problem and ensuring correct
usage of physics parameters. Unlike black-box alternatives, our approach
leverages physics-based inductive biases to accurately learn system dynamics
while preserving interpretability through its domain knowledge foundation.
Experimental results validate our method’s effectiveness in both generation
tasks and model transparency, offering detailed insights into learned physics
dynamics.
[COMMENTS]
Workshop paper at ICLR 2025 (XAI4Science Workshop)
[LINK]
http://arxiv.org/abs/2506.22204v1
[DATE]
2025-06-27 21:23:27+08:00
[CATEGORIES]
cs.LG
EFRame: Deeper Reasoning via Exploration-Filtering-Replay Reinforcement Learning Framework
[AUTHORS]
Chen Wang, Lai Wei, Yanzhi Zhang, Chenyang Shao, Zedong Dan, Weiran Huang, Yue Wang, Yuzhi Zhang
[ABSTRACT]
Recent advances in reinforcement learning (RL) have significantly enhanced
the reasoning capabilities of large language models (LLMs). Group Relative
Policy Optimization (GRPO), an efficient variant of PPO that lowers RL’s
computational cost, still faces limited exploration, low sample efficiency and
instability, constraining its performance on complex reasoning tasks. To
address these limitations, we introduce EFRame, an Exploration-Filtering-Replay
framework that systematically augments GRPO along three critical dimensions.
EFRame performs additional rollouts to explore high-quality trajectories,
applies online filtering to eliminate low-quality samples that introduce noise
and variance, and leverages experience replay to repeatedly exploit rare but
informative samples. EFRame establishes a complete and stable learning cycle,
guiding the model through a structured transition from exploration to
convergence. Our experiments across a variety of reasoning benchmarks
demonstrate that EFRame not only improves the robustness and efficiency of
training, but also enables access to deeper reasoning capabilities that remain
unattainable under vanilla GRPO. Furthermore, EFRame enables a more
fine-grained categorization of training samples, allowing for a deeper analysis
of how different types of samples contribute to the learning process in RL. Our
code is available at https://github.com/597358816/EFRame.
[LINK]
http://arxiv.org/abs/2506.22200v1
[DATE]
2025-06-27 21:09:05+08:00
[CATEGORIES]
cs.LG
REDELEX: A Framework for Relational Deep Learning Exploration
[AUTHORS]
Jakub Peleška, Gustav Šír
[ABSTRACT]
Relational databases (RDBs) are widely regarded as the gold standard for
storing structured information. Consequently, predictive tasks leveraging this
data format hold significant application promise. Recently, Relational Deep
Learning (RDL) has emerged as a novel paradigm wherein RDBs are conceptualized
as graph structures, enabling the application of various graph neural
architectures to effectively address these tasks. However, given its novelty,
there is a lack of analysis into the relationships between the performance of
various RDL models and the characteristics of the underlying RDBs.
In this study, we present REDELEX$-$a comprehensive exploration framework for
evaluating RDL models of varying complexity on the most diverse collection of
over 70 RDBs, which we make available to the community. Benchmarked alongside
key representatives of classic methods, we confirm the generally superior
performance of RDL while providing insights into the main factors shaping
performance, including model complexity, database sizes and their structural
properties.
[COMMENTS]
Accepted to ECMLPKDD 2025 at Porto, Portugal
[LINK]
http://arxiv.org/abs/2506.22199v1
[DATE]
2025-06-27 21:05:15+08:00
[CATEGORIES]
cs.LG
AB-UPT: Scaling Neural CFD Surrogates for High-Fidelity Automotive Aerodynamics Simulations via Anchored-Branched Universal Physics Transformers
[AUTHORS]
Benedikt Alkin, Maurits Bleeker, Richard Kurle, Tobias Kronlachner, Reinhard Sonnleitner, Matthias Dorfer, Johannes Brandstetter
[ABSTRACT]
Recent advances in neural surrogate modeling offer the potential for
transformative innovations in applications such as automotive aerodynamics.
Yet, industrial-scale problems often involve volumetric meshes with cell counts
reaching 100 million, presenting major scalability challenges. Complex
geometries further complicate modeling through intricate surface-volume
interactions, while quantities such as vorticity are highly nonlinear and must
satisfy strict divergence-free constraints. To address these requirements, we
introduce Anchored-Branched Universal Physics Transformers (AB-UPT) as a novel
modeling scheme for building neural surrogates for computational fluid dynamics
(CFD) simulations. AB-UPT is designed to: (i) decouple geometry encoding and
prediction tasks via multi-branch operators; (ii) enable scalability to
high-resolution outputs via neural simulation in a low-dimensional latent
space, coupled with anchored neural field decoders to predict high-fidelity
outputs; (iii) enforce physics consistency by a novel divergence-free
formulation. We show that AB-UPT yields state-of-the-art predictive accuracy of
surface and volume fields on automotive CFD simulations ranging from 33
thousand up to 150 million mesh cells. Furthermore, our anchored neural field
architecture enables the enforcement of hard physical constraints on the
physics predictions without degradation in performance, exemplified by modeling
divergence-free vorticity fields. Notably, the proposed models can be trained
on a single GPU in less than a day and predict industry-standard surface and
volume fields within seconds. Additionally, we show that the flexible design of
our method enables neural simulation from a computer-aided design geometry
alone, omitting the need for costly CFD meshing procedures.
[COMMENTS]
Preprint. Github: https://github.com/Emmi-AI/AB-UPT
[LINK]
http://arxiv.org/abs/2502.09692v3
[DATE]
2025-06-27 20:59:19+08:00
[CATEGORIES]
cs.LG
Thompson Sampling-Based Learning and Control for Unknown Dynamic Systems
[AUTHORS]
Kaikai Zheng, Dawei Shi, Yang Shi, Long Wang
[ABSTRACT]
Thompson sampling (TS) is an effective method to explore parametric
uncertainties and can therefore be used for active learning-based controller
design. However, TS relies on finite parametric representations, which limits
its applicability to more general spaces, which are more commonly encountered
in control system design. To address this issue, this work pro poses a
parameterization method for control law learning using reproducing kernel
Hilbert spaces and designs a data-driven active learning control approach.
Specifically, the proposed method treats the control law as an element in a
function space, allowing the design of control laws without imposing
restrictions on the system structure or the form of the controller. A TS
framework is proposed in this work to explore potential optimal control laws,
and the convergence guarantees are further provided for the learning process.
Theoretical analysis shows that the proposed method learns the relationship
between control laws and closed-loop performance metrics at an exponential
rate, and the upper bound of control regret is also derived. Numerical
experiments on controlling unknown nonlinear systems validate the effectiveness
of the proposed method.
[LINK]
http://arxiv.org/abs/2506.22186v1
[DATE]
2025-06-27 20:49:43+08:00
[CATEGORIES]
cs.LG
Graph-Reward-SQL: Execution-Free Reinforcement Learning for Text-to-SQL via Graph Matching and Stepwise Reward
[AUTHORS]
Han Weng, Puzhen Wu, Cui Longjie, Yi Zhan, Boyi Liu, Yuanfeng Song, Dun Zeng, Yingxiang Yang, Qianru Zhang, Dong Huang, Xiaoming Yin, Yang Sun, Xing Chen
[ABSTRACT]
Reinforcement learning (RL) has been widely adopted to enhance the
performance of large language models (LLMs) on Text-to-SQL tasks. However,
existing methods often rely on execution-based or LLM-based Bradley-Terry
reward models. The former suffers from high execution latency caused by
repeated database calls, whereas the latter imposes substantial GPU memory
overhead, both of which significantly hinder the efficiency and scalability of
RL pipelines. To this end, we propose a novel Text-to-SQL RL fine-tuning
framework named Graph-Reward-SQL, which employs the GMNScore outcome reward
model. We leverage SQL graph representations to provide accurate reward signals
while significantly reducing inference time and GPU memory usage. Building on
this foundation, we further introduce StepRTM, a stepwise reward model that
provides intermediate supervision over Common Table Expression (CTE)
subqueries. This encourages both functional correctness and structural clarity
of SQL. Extensive comparative and ablation experiments on standard benchmarks,
including Spider and BIRD, demonstrate that our method consistently outperforms
existing reward models.
[LINK]
http://arxiv.org/abs/2505.12380v2
[DATE]
2025-06-27 20:45:33+08:00
[CATEGORIES]
cs.LG
ASVSim (AirSim for Surface Vehicles): A High-Fidelity Simulation Framework for Autonomous Surface Vehicle Research
[AUTHORS]
Bavo Lesy, Siemen Herremans, Robin Kerstens, Jan Steckel, Walter Daems, Siegfried Mercelis, Ali Anwar
[ABSTRACT]
The transport industry has recently shown significant interest in unmanned
surface vehicles (USVs), specifically for port and inland waterway transport.
These systems can improve operational efficiency and safety, which is
especially relevant in the European Union, where initiatives such as the Green
Deal are driving a shift towards increased use of inland waterways. At the same
time, a shortage of qualified personnel is accelerating the adoption of
autonomous solutions. However, there is a notable lack of open-source,
high-fidelity simulation frameworks and datasets for developing and evaluating
such solutions. To address these challenges, we introduce AirSim For Surface
Vehicles (ASVSim), an open-source simulation framework specifically designed
for autonomous shipping research in inland and port environments. The framework
combines simulated vessel dynamics with marine sensor simulation capabilities,
including radar and camera systems and supports the generation of synthetic
datasets for training computer vision models and reinforcement learning agents.
Built upon Cosys-AirSim, ASVSim provides a comprehensive platform for
developing autonomous navigation algorithms and generating synthetic datasets.
The simulator supports research of both traditional control methods and deep
learning-based approaches. Through limited experiments, we demonstrate the
potential of the simulator in these research areas. ASVSim is provided as an
open-source project under the MIT license, making autonomous navigation
research accessible to a larger part of the ocean engineering community.
[COMMENTS]
14 Pages, 11 Figures
[LINK]
http://arxiv.org/abs/2506.22174v1
[DATE]
2025-06-27 20:39:16+08:00
[CATEGORIES]
cs.LG
Visual Structures Helps Visual Reasoning: Addressing the Binding Problem in VLMs
[AUTHORS]
Amirmohammad Izadi, Mohammad Ali Banayeeanzade, Fatemeh Askari, Ali Rahimiakbar, Mohammad Mahdi Vahedi, Hosein Hasani, Mahdieh Soleymani Baghshah
[ABSTRACT]
Despite progress in Vision-Language Models (VLMs), their capacity for visual
reasoning is often limited by the \textit{binding problem}: the failure to
reliably associate perceptual features with their correct visual referents.
This limitation underlies persistent errors in tasks such as counting, visual
search, scene description, and spatial relationship understanding. A key factor
is that current VLMs process visual features largely in parallel, lacking
mechanisms for spatially grounded, serial attention. This paper introduces a
simple yet effective intervention: augmenting visual inputs with low-level
spatial structures (e.g., horizontal lines) and pairing this with a textual
prompt that encourages sequential, spatially-aware parsing. We empirically
demonstrate substantial performance improvements across core visual reasoning
tasks. Specifically, our method improves GPT-4o visual search accuracy by
25.00%, increases counting accuracy by 26.83%, reduces edit distance error in
scene description by 0.32, and enhances performance on spatial relationship
tasks by 9.50% on a a 2D synthetic dataset. Furthermore, we find that the
visual modification is essential for these gains; purely textual strategies,
including Chain-of-Thought prompting, are insufficient and can even degrade
performance. Our method enhances binding only with a single-query inference,
underscoring the importance of visual input design over purely
linguistically-based approaches. These findings suggest that low-level visual
structuring is a powerful and underexplored direction for improving
compositional visual reasoning and could serve as a general strategy for
enhancing VLM performance on spatially grounded tasks.
[LINK]
http://arxiv.org/abs/2506.22146v1
[DATE]
2025-06-27 19:44:40+08:00
[CATEGORIES]
cs.LG
Near Field Localization via AI-Aided Subspace Methods
[AUTHORS]
Arad Gast, Luc Le Magoarou, Nir Shlezinger
[ABSTRACT]
The increasing demands for high-throughput and energy-efficient wireless
communications are driving the adoption of extremely large antennas operating
at high-frequency bands. In these regimes, multiple users will reside in the
radiative near-field, and accurate localization becomes essential. Unlike
conventional far-field systems that rely solely on DOA estimation, near-field
localization exploits spherical wavefront propagation to recover both DOA and
range information. While subspace-based methods, such as MUSIC and its
extensions, offer high resolution and interpretability for near-field
localization, their performance is significantly impacted by model assumptions,
including non-coherent sources, well-calibrated arrays, and a sufficient number
of snapshots. To address these limitations, this work proposes AI-aided
subspace methods for near-field localization that enhance robustness to
real-world challenges. Specifically, we introduce NF-SubspaceNet, a deep
learning-augmented 2D MUSIC algorithm that learns a surrogate covariance matrix
to improve localization under challenging conditions, and DCD-MUSIC, a cascaded
AI-aided approach that decouples angle and range estimation to reduce
computational complexity. We further develop a novel model-order-aware training
method to accurately estimate the number of sources, that is combined with
casting of near field subspace methods as AI models for learning. Extensive
simulations demonstrate that the proposed methods outperform classical and
existing deep-learning-based localization techniques, providing robust
near-field localization even under coherent sources, miscalibrations, and few
snapshots.
[COMMENTS]
Under review for publication in the IEEE
[LINK]
http://arxiv.org/abs/2504.00599v2
[DATE]
2025-06-27 19:37:04+08:00
[CATEGORIES]
cs.LG
Earthquake Damage Grades Prediction using An Ensemble Approach Integrating Advanced Machine and Deep Learning Models
[AUTHORS]
Anurag Panda, Gaurav Kumar Yadav
[ABSTRACT]
In the aftermath of major earthquakes, evaluating structural and
infrastructural damage is vital for coordinating post-disaster response
efforts. This includes assessing damage’s extent and spatial distribution to
prioritize rescue operations and resource allocation. Accurately estimating
damage grades to buildings post-earthquake is paramount for effective response
and recovery, given the significant impact on lives and properties,
underscoring the urgency of streamlining relief fund allocation processes.
Previous studies have shown the effectiveness of multi-class classification,
especially XGBoost, along with other machine learning models and ensembling
methods, incorporating regularization to address class imbalance. One
consequence of class imbalance is that it may give rise to skewed models that
undervalue minority classes and give preference to the majority class. This
research deals with the problem of class imbalance with the help of the
synthetic minority oversampling technique (SMOTE). We delve into multiple
multi-class classification machine learning, deep learning models, and
ensembling methods to forecast structural damage grades. The study elucidates
performance determinants through comprehensive feature manipulation experiments
and diverse training approaches. It identifies key factors contributing to
seismic vulnerability while evaluating model performance using techniques like
the confusion matrix further to enhance understanding of the effectiveness of
earthquake damage prediction.
[COMMENTS]
3rd International Conference on Applied Mathematics in Science and
Engineering
[LINK]
http://arxiv.org/abs/2506.22129v1
[DATE]
2025-06-27 19:12:37+08:00
[CATEGORIES]
cs.LG
Generative AI for O-RAN Slicing: A Semi-Supervised Approach with VAE and Contrastive Learning
[AUTHORS]
Salar Nouri, Mojdeh Karbalaee Motalleb, Vahid Shah-Mansouri, Seyed Pooya Shariatpanahi
[ABSTRACT]
This paper introduces a novel generative AI (GAI)-driven, unified
semi-supervised learning architecture for optimizing resource allocation and
network slicing in O-RAN. Termed Generative Semi-Supervised VAE-Contrastive
Learning, our approach maximizes the weighted user equipment (UE) throughput
and allocates physical resource blocks (PRBs) to enhance the quality of service
for eMBB and URLLC services. The GAI framework utilizes a dedicated xApp for
intelligent power control and PRB allocation. This integrated GAI model
synergistically combines the generative power of a VAE with contrastive
learning to achieve robustness in an end-to-end trainable system. It is a
semi-supervised training approach that concurrently optimizes supervised
regression of resource allocation decisions (i.e., power, UE association, PRB)
and unsupervised contrastive objectives. This intrinsic fusion improves the
precision of resource management and model generalization in dynamic mobile
networks. We evaluated our GAI methodology against exhaustive search and deep
Q-Network algorithms using key performance metrics. Results show our integrated
GAI approach offers superior efficiency and effectiveness in various scenarios,
presenting a compelling GAI-based solution for critical network slicing and
resource management challenges in next-generation O-RAN systems.
[LINK]
http://arxiv.org/abs/2401.08861v3
[DATE]
2025-06-27 18:51:47+08:00
[CATEGORIES]
cs.LG
Tied Prototype Model for Few-Shot Medical Image Segmentation
[AUTHORS]
Hyeongji Kim, Stine Hansen, Michael Kampffmeyer
[ABSTRACT]
Common prototype-based medical image few-shot segmentation (FSS) methods
model foreground and background classes using class-specific prototypes.
However, given the high variability of the background, a more promising
direction is to focus solely on foreground modeling, treating the background as
an anomaly – an approach introduced by ADNet. Yet, ADNet faces three key
limitations: dependence on a single prototype per class, a focus on binary
classification, and fixed thresholds that fail to adapt to patient and organ
variability. To address these shortcomings, we propose the Tied Prototype Model
(TPM), a principled reformulation of ADNet with tied prototype locations for
foreground and background distributions. Building on its probabilistic
foundation, TPM naturally extends to multiple prototypes and multi-class
segmentation while effectively separating non-typical background features.
Notably, both extensions lead to improved segmentation accuracy. Finally, we
leverage naturally occurring class priors to define an ideal target for
adaptive thresholds, boosting segmentation performance. Taken together, TPM
provides a fresh perspective on prototype-based FSS for medical image
segmentation. The code can be found at https://github.com/hjk92g/TPM-FSS.
[COMMENTS]
Submitted version (MICCAI). Accepted at MICCAI 2025. The code repo
will be made publicly available soon
[LINK]
http://arxiv.org/abs/2506.22101v1
[DATE]
2025-06-27 18:33:55+08:00
[CATEGORIES]
cs.LG
SONG: Self-Organizing Neural Graphs
[AUTHORS]
Łukasz Struski, Tomasz Danel, Marek Śmieja, Jacek Tabor, Bartosz Zieliński
[ABSTRACT]
Recent years have seen a surge in research on deep interpretable neural
networks with decision trees as one of the most commonly incorporated tools.
There are at least three advantages of using decision trees over logistic
regression classification models: they are easy to interpret since they are
based on binary decisions, they can make decisions faster, and they provide a
hierarchy of classes. However, one of the well-known drawbacks of decision
trees, as compared to decision graphs, is that decision trees cannot reuse the
decision nodes. Nevertheless, decision graphs were not commonly used in deep
learning due to the lack of efficient gradient-based training techniques. In
this paper, we fill this gap and provide a general paradigm based on Markov
processes, which allows for efficient training of the special type of decision
graphs, which we call Self-Organizing Neural Graphs (SONG). We provide an
extensive theoretical study of SONG, complemented by experiments conducted on
Letter, Connect4, MNIST, CIFAR, and TinyImageNet datasets, showing that our
method performs on par or better than existing decision models.
[COMMENTS]
Accepted in WACV 2023
[LINK]
http://arxiv.org/abs/2107.13214v2
[DATE]
2025-06-27 18:23:30+08:00
[CATEGORIES]
cs.LG
Transformers are Graph Neural Networks
[AUTHORS]
Chaitanya K. Joshi
[ABSTRACT]
We establish connections between the Transformer architecture, originally
introduced for natural language processing, and Graph Neural Networks (GNNs)
for representation learning on graphs. We show how Transformers can be viewed
as message passing GNNs operating on fully connected graphs of tokens, where
the self-attention mechanism capture the relative importance of all tokens
w.r.t. each-other, and positional encodings provide hints about sequential
ordering or structure. Thus, Transformers are expressive set processing
networks that learn relationships among input elements without being
constrained by apriori graphs. Despite this mathematical connection to GNNs,
Transformers are implemented via dense matrix operations that are significantly
more efficient on modern hardware than sparse message passing. This leads to
the perspective that Transformers are GNNs currently winning the hardware
lottery.
[COMMENTS]
This paper is a technical version of an article in The Gradient at
https://thegradient.pub/transformers-are-graph-neural-networks/
[LINK]
http://arxiv.org/abs/2506.22084v1
[DATE]
2025-06-27 18:15:33+08:00
[CATEGORIES]
cs.LG
Forecasting the future development in quality and value of professional football players
[AUTHORS]
Koen W. van Arem, Floris Goes-Smit, Jakob Söhl
[ABSTRACT]
Transfers in professional football (soccer) are risky investments because of
the large transfer fees and high risks involved. Although data-driven models
can be used to improve transfer decisions, existing models focus on describing
players’ historical progress, leaving their future performance unknown.
Moreover, recent developments have called for the use of explainable models
combined with uncertainty quantification of predictions. This paper assesses
explainable machine learning models based on predictive accuracy and
uncertainty quantification methods for the prediction of the future development
in quality and transfer value of professional football players. The predictive
accuracy is studied by training the models to predict the quality and value of
players one year ahead. This is carried out by training them on two data sets
containing data-driven indicators describing the player quality and player
value in historical settings. In general, the random forest model is found to
be the most suitable model because it provides accurate predictions as well as
an uncertainty quantification method that naturally arises from the bagging
procedure of the random forest model. Additionally, this research shows that
the development of player performance contains nonlinear patterns and
interactions between variables, and that time series information can provide
useful information for the modeling of player performance metrics. The
resulting models can help football clubs make more informed, data-driven
transfer decisions by forecasting player quality and transfer value.
[COMMENTS]
The article itself is on the pages 1-31. The data set used in this
article is described in the appendix at the pages 32-39
[LINK]
http://arxiv.org/abs/2502.07528v2
[DATE]
2025-06-27 17:47:13+08:00
[CATEGORIES]
cs.LG
Hyper-modal Imputation Diffusion Embedding with Dual-Distillation for Federated Multimodal Knowledge Graph Completion
[AUTHORS]
Ying Zhang, Yu Zhao, Xuhui Sui, Baohang Zhou, Xiangrui Cai, Li Shen, Xiaojie Yuan, Dacheng Tao
[ABSTRACT]
With the increasing multimodal knowledge privatization requirements,
multimodal knowledge graphs in different institutes are usually decentralized,
lacking of effective collaboration system with both stronger reasoning ability
and transmission safety guarantees. In this paper, we propose the Federated
Multimodal Knowledge Graph Completion (FedMKGC) task, aiming at training over
federated MKGs for better predicting the missing links in clients without
sharing sensitive knowledge. We propose a framework named MMFeD3-HidE for
addressing multimodal uncertain unavailability and multimodal client
heterogeneity challenges of FedMKGC. (1) Inside the clients, our proposed
Hyper-modal Imputation Diffusion Embedding model (HidE) recovers the complete
multimodal distributions from incomplete entity embeddings constrained by
available modalities. (2) Among clients, our proposed Multimodal FeDerated Dual
Distillation (MMFeD3) transfers knowledge mutually between clients and the
server with logit and feature distillation to improve both global convergence
and semantic consistency. We propose a FedMKGC benchmark for a comprehensive
evaluation, consisting of a general FedMKGC backbone named MMFedE, datasets
with heterogeneous multimodal information, and three groups of constructed
baselines. Experiments conducted on our benchmark validate the effectiveness,
semantic consistency, and convergence robustness of MMFeD3-HidE.
[COMMENTS]
Submitted to the IEEE for possible publication
[LINK]
http://arxiv.org/abs/2506.22036v1
[DATE]
2025-06-27 17:32:58+08:00
[CATEGORIES]
cs.LG
CAPM: Fast and Robust Verification on Maxpool-based CNN via Dual Network
[AUTHORS]
Jia-Hau Bai, Chi-Ting Liu, Yu Wang, Fu-Chieh Chang, Pei-Yuan Wu
[ABSTRACT]
This study uses CAPM (Convex Adversarial Polytope for Maxpool-based CNN) to
improve the verified bound for general purpose maxpool-based convolutional
neural networks (CNNs) under bounded norm adversarial perturbations. The
maxpool function is decomposed as a series of ReLU functions to extend the
convex relaxation technique to maxpool functions, by which the verified bound
can be efficiently computed through a dual network. The experimental results
demonstrate that this technique allows the state-of-the-art verification
precision for maxpool-based CNNs and involves a much lower computational cost
than current verification methods, such as DeepZ, DeepPoly and PRIMA. This
method is also applicable to large-scale CNNs, which previous studies show to
be often computationally prohibitively expensive. Under certain circumstances,
CAPM is 40-times, 20-times or twice as fast and give a significantly higher
verification bound (CAPM 98% vs. PRIMA 76%/DeepPoly 73%/DeepZ 8%) as compared
to PRIMA/DeepPoly/DeepZ. Furthermore, we additionally present the time
complexity of our algorithm as $O(W^2NK)$, where $W$ is the maximum width of
the neural network, $N$ is the number of neurons, and $K$ is the size of the
maxpool layer’s kernel.
[LINK]
http://arxiv.org/abs/2407.09550v3
[DATE]
2025-06-27 17:26:12+08:00
[CATEGORIES]
cs.LG
Learning Data-Driven Uncertainty Set Partitions for Robust and Adaptive Energy Forecasting with Missing Data
[AUTHORS]
Akylas Stratigakos, Panagiotis Andrianesis
[ABSTRACT]
Short-term forecasting models typically assume the availability of input data
(features) when they are deployed and in use. However, equipment failures,
disruptions, cyberattacks, may lead to missing features when such models are
used operationally, which could negatively affect forecast accuracy, and result
in suboptimal operational decisions. In this paper, we use adaptive robust
optimization and adversarial machine learning to develop forecasting models
that seamlessly handle missing data operationally. We propose linear- and
neural network-based forecasting models with parameters that adapt to available
features, combining linear adaptation with a novel algorithm for learning
data-driven uncertainty set partitions. The proposed adaptive models do not
rely on identifying historical missing data patterns and are suitable for
real-time operations under stringent time constraints. Extensive numerical
experiments on short-term wind power forecasting considering horizons from 15
minutes to 4 hours ahead illustrate that our proposed adaptive models are on
par with imputation when data are missing for very short periods (e.g., when
only the latest measurement is missing) whereas they significantly outperform
imputation when data are missing for longer periods. We further provide
insights by showcasing how linear adaptation and data-driven partitions (even
with a few subsets) approach the performance of the optimal, yet impractical,
method of retraining for every possible realization of missing data.
[COMMENTS]
Revised version, submitted to IEEE-TSG
[LINK]
http://arxiv.org/abs/2503.20410v2
[DATE]
2025-06-27 17:15:39+08:00
[CATEGORIES]
cs.LG
C-Learner: Constrained Learning for Causal Inference
[AUTHORS]
Tiffany Tianhui Cai, Yuri Fonseca, Kaiwen Hou, Hongseok Namkoong
[ABSTRACT]
Popular debiased estimation methods for causal inference – such as augmented
inverse propensity weighting and targeted maximum likelihood estimation –
enjoy desirable asymptotic properties like statistical efficiency and double
robustness but they can produce unstable estimates when there is limited
overlap between treatment and control, requiring additional assumptions or ad
hoc adjustments in practice (e.g., truncating propensity scores). In contrast,
simple plug-in estimators are stable but lack desirable asymptotic properties.
We propose a novel debiasing approach that achieves the best of both worlds,
producing stable plug-in estimates with desirable asymptotic properties. Our
constrained learning framework solves for the best plug-in estimator under the
constraint that the first-order error with respect to the plugged-in quantity
is zero, and can leverage flexible model classes including neural networks and
tree ensembles. In several experimental settings, including ones in which we
handle text-based covariates by fine-tuning language models, our constrained
learning-based estimator outperforms basic versions of one-step estimation and
targeting in challenging settings with limited overlap between treatment and
control, and performs similarly otherwise.
[LINK]
http://arxiv.org/abs/2405.09493v5
[DATE]
2025-06-27 16:45:28+08:00
[CATEGORIES]
cs.LG
Distilling the Unknown to Unveil Certainty
[AUTHORS]
Zhilin Zhao, Longbing Cao, Yixuan Zhang, Kun-Yu Lin, Wei-Shi Zheng
[ABSTRACT]
Out-of-distribution (OOD) detection is critical for identifying test samples
that deviate from in-distribution (ID) data, ensuring network robustness and
reliability. This paper presents a flexible framework for OOD knowledge
distillation that extracts OOD-sensitive information from a network to develop
a binary classifier capable of distinguishing between ID and OOD samples in
both scenarios, with and without access to training ID data. To accomplish
this, we introduce Confidence Amendment (CA), an innovative methodology that
transforms an OOD sample into an ID one while progressively amending prediction
confidence derived from the network to enhance OOD sensitivity. This approach
enables the simultaneous synthesis of both ID and OOD samples, each accompanied
by an adjusted prediction confidence, thereby facilitating the training of a
binary classifier sensitive to OOD. Theoretical analysis provides bounds on the
generalization error of the binary classifier, demonstrating the pivotal role
of confidence amendment in enhancing OOD sensitivity. Extensive experiments
spanning various datasets and network architectures confirm the efficacy of the
proposed method in detecting OOD samples.
[LINK]
http://arxiv.org/abs/2311.07975v3
[DATE]
2025-06-27 16:23:39+08:00
[CATEGORIES]
cs.LG
TROFI: Trajectory-Ranked Offline Inverse Reinforcement Learning
[AUTHORS]
Alessandro Sestini, Joakim Bergdahl, Konrad Tollmar, Andrew D. Bagdanov, Linus Gisslén
[ABSTRACT]
In offline reinforcement learning, agents are trained using only a fixed set
of stored transitions derived from a source policy. However, this requires that
the dataset be labeled by a reward function. In applied settings such as video
game development, the availability of the reward function is not always
guaranteed. This paper proposes Trajectory-Ranked OFfline Inverse reinforcement
learning (TROFI), a novel approach to effectively learn a policy offline
without a pre-defined reward function. TROFI first learns a reward function
from human preferences, which it then uses to label the original dataset making
it usable for training the policy. In contrast to other approaches, our method
does not require optimal trajectories. Through experiments on the D4RL
benchmark we demonstrate that TROFI consistently outperforms baselines and
performs comparably to using the ground truth reward to learn policies.
Additionally, we validate the efficacy of our method in a 3D game environment.
Our studies of the reward model highlight the importance of the reward function
in this setting: we show that to ensure the alignment of a value function to
the actual future discounted reward, it is fundamental to have a
well-engineered and easy-to-learn reward function.
[COMMENTS]
Published at Reinforcement Learning and Video Games Workshop at RLC
2025
[LINK]
http://arxiv.org/abs/2506.22008v1
[DATE]
2025-06-27 16:22:41+08:00
[CATEGORIES]
cs.LG
GKNet: Graph Kalman Filtering and Model Inference via Model-based Deep Learning
[AUTHORS]
Mohammad Sabbaqi, Riccardo Taormina, Elvin Isufi
[ABSTRACT]
Inference tasks with time series over graphs are of importance in
applications such as urban water networks, economics, and networked
neuroscience. Addressing these tasks typically relies on identifying a
computationally affordable model that jointly captures the graph-temporal
patterns of the data. In this work, we propose a graph-aware state space model
for graph time series, where both the latent state and the observation equation
are parametric graph-induced models with a limited number of parameters that
need to be learned. More specifically, we consider the state equation to follow
a stochastic partial differential equation driven by noise over the graphs
edges accounting not only for potential edge uncertainties but also for
increasing the degrees of freedom in the latter in a tractable manner. The
graph structure conditioning of the noise dispersion allows the state variable
to deviate from the stochastic process in certain neighborhoods. The
observation model is a sampled and graph-filtered version of the state
capturing multi-hop neighboring influence. The goal is to learn the parameters
in both state and observation models from the partially observed data for
downstream tasks such as prediction and imputation. The model is inferred first
through a maximum likelihood approach that provides theoretical tractability
but is limited in expressivity and scalability. To improve on the latter, we
use the state-space formulation to build a principled deep learning
architecture that jointly learns the parameters and tracks the state in an
end-to-end manner in the spirit of Kalman neural networks.
[LINK]
http://arxiv.org/abs/2506.22004v1
[DATE]
2025-06-27 16:17:07+08:00
[CATEGORIES]
cs.LG
Generative adversarial neural networks for simulating neutrino interactions
[AUTHORS]
Jose L. Bonilla, Krzysztof M. Graczyk, Artur M. Ankowski, Rwik Dharmapal Banerjee, Beata E. Kowal, Hemant Prasad, Jan T. Sobczyk
[ABSTRACT]
We propose a new approach to simulate neutrino scattering events as an
alternative to the standard Monte Carlo generator approach. Generative
adversarial neural network (GAN) models are developed to simulate charged
current neutrino-carbon collisions in the few-GeV energy range. We consider a
simplified framework to generate muon kinematic variables, specifically its
energy and scattering angle. GAN models are trained on simulation data from
\nuwro{} Monte Carlo event generator. Two GAN models have been obtained: one
simulating quasielastic neutrino-nucleus scatterings and another simulating all
interactions at given neutrino energy. The models work for neutrino energy
ranging from 300 MeV to 10 GeV. The performance of both models has been
assessed using two statistical metrics. It is shown that both GAN models
successfully reproduce the distribution of muon kinematics.
[COMMENTS]
16 pages, 16 figures
[LINK]
http://arxiv.org/abs/2502.20244v2
[DATE]
2025-06-27 16:14:44+08:00
[CATEGORIES]
cs.LG
Binned semiparametric Bayesian networks
[AUTHORS]
Rafael Sojo, Javier Díaz-Rozo, Concha Bielza, Pedro Larrañaga
[ABSTRACT]
This paper introduces a new type of probabilistic semiparametric model that
takes advantage of data binning to reduce the computational cost of kernel
density estimation in nonparametric distributions. Two new conditional
probability distributions are developed for the new binned semiparametric
Bayesian networks, the sparse binned kernel density estimation and the Fourier
kernel density estimation. These two probability distributions address the
curse of dimensionality, which typically impacts binned models, by using sparse
tensors and restricting the number of parent nodes in conditional probability
calculations. To evaluate the proposal, we perform a complexity analysis and
conduct several comparative experiments using synthetic data and datasets from
the UCI Machine Learning repository. The experiments include different binning
rules, parent restrictions, grid sizes, and number of instances to get a
holistic view of the model’s behavior. As a result, our binned semiparametric
Bayesian networks achieve structural learning and log-likelihood estimations
with no statistically significant differences compared to the semiparametric
Bayesian networks, but at a much higher speed. Thus, the new binned
semiparametric Bayesian networks prove to be a reliable and more efficient
alternative to their non-binned counterparts.
[LINK]
http://arxiv.org/abs/2506.21997v1
[DATE]
2025-06-27 16:07:34+08:00
[CATEGORIES]
cs.LG
The Mamba in the Llama: Distilling and Accelerating Hybrid Models
[AUTHORS]
Junxiong Wang, Daniele Paliotta, Avner May, Alexander M. Rush, Tri Dao
[ABSTRACT]
Linear RNN architectures, like Mamba, can be competitive with Transformer
models in language modeling while having advantageous deployment
characteristics. Given the focus on training large-scale Transformer models, we
consider the challenge of converting these pretrained models for deployment. We
demonstrate that it is feasible to distill large Transformers into linear RNNs
by reusing the linear projection weights from attention layers with academic
GPU resources. The resulting hybrid model, which incorporates a quarter of the
attention layers, achieves performance comparable to the original Transformer
in chat benchmarks and outperforms open-source hybrid Mamba models trained from
scratch with trillions of tokens in both chat benchmarks and general
benchmarks. Moreover, we introduce a hardware-aware speculative decoding
algorithm that accelerates the inference speed of Mamba and hybrid models.
Overall we show how, with limited computation resources, we can remove many of
the original attention layers and generate from the resulting model more
efficiently. Our top-performing model, distilled from Llama3-8B-Instruct,
achieves a 29.61 length-controlled win rate on AlpacaEval 2 against GPT-4 and
7.35 on MT-Bench, surpassing the best 8B scale instruction-tuned linear RNN
model. We also find that the distilled model has natural length extrapolation,
showing almost perfect accuracy in the needle-in-a-haystack test at 20x the
distillation length. Code and pre-trained checkpoints are open-sourced at
https://github.com/jxiw/MambaInLlama and
https://github.com/itsdaniele/speculative_mamba.
[COMMENTS]
NeurIPS 2024. v4 updates: mention concurrent work of speculative
decoding for SSM
[LINK]
http://arxiv.org/abs/2408.15237v4
[DATE]
2025-06-27 15:54:57+08:00
[CATEGORIES]
cs.LG
Distributional Reduction: Unifying Dimensionality Reduction and Clustering with Gromov-Wasserstein
[AUTHORS]
Hugues Van Assel, Cédric Vincent-Cuaz, Nicolas Courty, Rémi Flamary, Pascal Frossard, Titouan Vayer
[ABSTRACT]
Unsupervised learning aims to capture the underlying structure of potentially
large and high-dimensional datasets. Traditionally, this involves using
dimensionality reduction (DR) methods to project data onto lower-dimensional
spaces or organizing points into meaningful clusters (clustering). In this
work, we revisit these approaches under the lens of optimal transport and
exhibit relationships with the Gromov-Wasserstein problem. This unveils a new
general framework, called distributional reduction, that recovers DR and
clustering as special cases and allows addressing them jointly within a single
optimization problem. We empirically demonstrate its relevance to the
identification of low-dimensional prototypes representing data at different
scales, across multiple image and genomic datasets.
[COMMENTS]
45 pages, 20 figures
[LINK]
http://arxiv.org/abs/2402.02239v3
[DATE]
2025-06-27 15:44:55+08:00
[CATEGORIES]
cs.LG
Green LIME: Improving AI Explainability through Design of Experiments
[AUTHORS]
Alexandra Stadler, Werner G. Müller, Radoslav Harman
[ABSTRACT]
In artificial intelligence (AI), the complexity of many models and processes
surpasses human understanding, making it challenging to determine why a
specific prediction is made. This lack of transparency is particularly
problematic in critical fields like healthcare, where trust in a model’s
predictions is paramount. As a result, the explainability of machine learning
(ML) and other complex models has become a key area of focus. Efforts to
improve model explainability often involve experimenting with AI systems and
approximating their behavior through interpretable surrogate mechanisms.
However, these procedures can be resource-intensive. Optimal design of
experiments, which seeks to maximize the information obtained from a limited
number of observations, offers promising methods for improving the efficiency
of these explainability techniques. To demonstrate this potential, we explore
Local Interpretable Model-agnostic Explanations (LIME), a widely used method
introduced by Ribeiro et al. (2016). LIME provides explanations by generating
new data points near the instance of interest and passing them through the
model. While effective, this process can be computationally expensive,
especially when predictions are costly or require many samples. LIME is highly
versatile and can be applied to a wide range of models and datasets. In this
work, we focus on models involving tabular data, regression tasks, and linear
models as interpretable local approximations. By utilizing optimal design of
experiments’ techniques, we reduce the number of function evaluations of the
complex model, thereby reducing the computational effort of LIME by a
significant amount. We consider this modified version of LIME to be
energy-efficient or “green”.
[LINK]
http://arxiv.org/abs/2502.12753v2
[DATE]
2025-06-27 15:44:25+08:00
[CATEGORIES]
cs.LG
Spectraformer: A Unified Random Feature Framework for Transformer
[AUTHORS]
Duke Nguyen, Du Yin, Aditya Joshi, Flora Salim
[ABSTRACT]
Linearization of attention using various kernel approximation and kernel
learning techniques has shown promise. Past methods used a subset of
combinations of component functions and weight matrices within the random
feature paradigm. We identify the need for a systematic comparison of different
combinations of weight matrices and component functions for attention learning
in Transformer. Hence, we introduce Spectraformer, a unified framework for
approximating and learning the kernel function in the attention mechanism of
the Transformer. Our empirical results demonstrate, for the first time, that a
random feature-based approach can achieve performance comparable to
top-performing sparse and low-rank methods on the challenging Long Range Arena
benchmark. Thus, we establish a new state-of-the-art for random feature-based
efficient Transformers. The framework also produces many variants that offer
different advantages in accuracy, training time, and memory consumption. Our
code is available at: https://github.com/cruiseresearchgroup/spectraformer .
[LINK]
http://arxiv.org/abs/2405.15310v4
[DATE]
2025-06-27 15:39:14+08:00
[CATEGORIES]
cs.LG
SceneDiffuser++: City-Scale Traffic Simulation via a Generative World Model
[AUTHORS]
Shuhan Tan, John Lambert, Hong Jeon, Sakshum Kulshrestha, Yijing Bai, Jing Luo, Dragomir Anguelov, Mingxing Tan, Chiyu Max Jiang
[ABSTRACT]
The goal of traffic simulation is to augment a potentially limited amount of
manually-driven miles that is available for testing and validation, with a much
larger amount of simulated synthetic miles. The culmination of this vision
would be a generative simulated city, where given a map of the city and an
autonomous vehicle (AV) software stack, the simulator can seamlessly simulate
the trip from point A to point B by populating the city around the AV and
controlling all aspects of the scene, from animating the dynamic agents (e.g.,
vehicles, pedestrians) to controlling the traffic light states. We refer to
this vision as CitySim, which requires an agglomeration of simulation
technologies: scene generation to populate the initial scene, agent behavior
modeling to animate the scene, occlusion reasoning, dynamic scene generation to
seamlessly spawn and remove agents, and environment simulation for factors such
as traffic lights. While some key technologies have been separately studied in
various works, others such as dynamic scene generation and environment
simulation have received less attention in the research community. We propose
SceneDiffuser++, the first end-to-end generative world model trained on a
single loss function capable of point A-to-B simulation on a city scale
integrating all the requirements above. We demonstrate the city-scale traffic
simulation capability of SceneDiffuser++ and study its superior realism under
long simulation conditions. We evaluate the simulation quality on an augmented
version of the Waymo Open Motion Dataset (WOMD) with larger map regions to
support trip-level simulation.
[COMMENTS]
Accepted to CVPR 2025
[LINK]
http://arxiv.org/abs/2506.21976v1
[DATE]
2025-06-27 15:35:04+08:00
[CATEGORIES]
cs.LG
Mitigating Metropolitan Carbon Emissions with Dynamic Eco-driving at Scale
[AUTHORS]
Vindula Jayawardana, Baptiste Freydt, Ao Qu, Cameron Hickert, Edgar Sanchez, Catherine Tang, Mark Taylor, Blaine Leonard, Cathy Wu
[ABSTRACT]
The sheer scale and diversity of transportation make it a formidable sector
to decarbonize. Here, we consider an emerging opportunity to reduce carbon
emissions: the growing adoption of semi-autonomous vehicles, which can be
programmed to mitigate stop-and-go traffic through intelligent speed commands
and, thus, reduce emissions. But would such dynamic eco-driving move the needle
on climate change? A comprehensive impact analysis has been out of reach due to
the vast array of traffic scenarios and the complexity of vehicle emissions. We
address this challenge with large-scale scenario modeling efforts and by using
multi-task deep reinforcement learning with a carefully designed network
decomposition strategy. We perform an in-depth prospective impact assessment of
dynamic eco-driving at 6,011 signalized intersections across three major US
metropolitan cities, simulating a million traffic scenarios. Overall, we find
that vehicle trajectories optimized for emissions can cut city-wide
intersection carbon emissions by 11-22%, without harming throughput or safety,
and with reasonable assumptions, equivalent to the national emissions of Israel
and Nigeria, respectively. We find that 10% eco-driving adoption yields 25%-50%
of the total reduction, and nearly 70% of the benefits come from 20% of
intersections, suggesting near-term implementation pathways. However, the
composition of this high-impact subset of intersections varies considerably
across different adoption levels, with minimal overlap, calling for careful
strategic planning for eco-driving deployments. Moreover, the impact of
eco-driving, when considered jointly with projections of vehicle
electrification and hybrid vehicle adoption remains significant. More broadly,
this work paves the way for large-scale analysis of traffic externalities, such
as time, safety, and air quality, and the potential impact of solution
strategies.
[COMMENTS]
Accepted for publication at Transportation Research Part C: Emerging
Technologies
[LINK]
http://arxiv.org/abs/2408.05609v2
[DATE]
2025-06-27 15:16:41+08:00
[CATEGORIES]
cs.LG
On the Lipschitz Continuity of Set Aggregation Functions and Neural Networks for Sets
[AUTHORS]
Giannis Nikolentzos, Konstantinos Skianis
[ABSTRACT]
The Lipschitz constant of a neural network is connected to several important
properties of the network such as its robustness and generalization. It is thus
useful in many settings to estimate the Lipschitz constant of a model. Prior
work has focused mainly on estimating the Lipschitz constant of multi-layer
perceptrons and convolutional neural networks. Here we focus on data modeled as
sets or multisets of vectors and on neural networks that can handle such data.
These models typically apply some permutation invariant aggregation function,
such as the sum, mean or max operator, to the input multisets to produce a
single vector for each input sample. In this paper, we investigate whether
these aggregation functions are Lipschitz continuous with respect to three
distance functions for unordered multisets, and we compute their Lipschitz
constants. In the general case, we find that each aggregation function is
Lipschitz continuous with respect to only one of the three distance functions.
Then, we build on these results to derive upper bounds on the Lipschitz
constant of neural networks that can process multisets of vectors, while we
also study their stability to perturbations and generalization under
distribution shifts. To empirically verify our theoretical analysis, we conduct
a series of experiments on datasets from different domains.
[LINK]
http://arxiv.org/abs/2505.24403v2
[DATE]
2025-06-27 14:58:00+08:00
[CATEGORIES]
cs.LG
Optimal Return-to-Go Guided Decision Transformer for Auto-Bidding in Advertisement
[AUTHORS]
Hao Jiang, Yongxiang Tang, Yanxiang Zeng, Pengjia Yuan, Yanhua Cheng, Teng Sha, Xialong Liu, Peng Jiang
[ABSTRACT]
In the realm of online advertising, advertisers partake in ad auctions to
obtain advertising slots, frequently taking advantage of auto-bidding tools
provided by demand-side platforms. To improve the automation of these bidding
systems, we adopt generative models, namely the Decision Transformer (DT), to
tackle the difficulties inherent in automated bidding. Applying the Decision
Transformer to the auto-bidding task enables a unified approach to sequential
modeling, which efficiently overcomes short-sightedness by capturing long-term
dependencies between past bidding actions and user behavior. Nevertheless,
conventional DT has certain drawbacks: (1) DT necessitates a preset
return-to-go (RTG) value before generating actions, which is not inherently
produced; (2) The policy learned by DT is restricted by its training data,
which is consists of mixed-quality trajectories. To address these challenges,
we introduce the R* Decision Transformer (R* DT), developed in a three-step
process: (1) R DT: Similar to traditional DT, R DT stores actions based on
state and RTG value, as well as memorizing the RTG for a given state using the
training set; (2) R^ DT: We forecast the highest value (within the training
set) of RTG for a given state, deriving a suboptimal policy based on the
current state and the forecasted supreme RTG value; (3) R* DT: Based on R^ DT,
we generate trajectories and select those with high rewards (using a simulator)
to augment our training dataset. This data enhancement has been shown to
improve the RTG of trajectories in the training data and gradually leads the
suboptimal policy towards optimality. Comprehensive tests on a publicly
available bidding dataset validate the R* DT’s efficacy and highlight its
superiority when dealing with mixed-quality trajectories.
[LINK]
http://arxiv.org/abs/2506.21956v1
[DATE]
2025-06-27 14:56:54+08:00
[CATEGORIES]
cs.LG
deCIFer: Crystal Structure Prediction from Powder Diffraction Data using Autoregressive Language Models
[AUTHORS]
Frederik Lizak Johansen, Ulrik Friis-Jensen, Erik Bjørnager Dam, Kirsten Marie Ørnsbjerg Jensen, Rocío Mercado, Raghavendra Selvan
[ABSTRACT]
Novel materials drive progress across applications from energy storage to
electronics. Automated characterization of material structures with machine
learning methods offers a promising strategy for accelerating this key step in
material design. In this work, we introduce an autoregressive language model
that performs crystal structure prediction (CSP) from powder diffraction data.
The presented model, deCIFer, generates crystal structures in the widely used
Crystallographic Information File (CIF) format and can be conditioned on powder
X-ray diffraction (PXRD) data. Unlike earlier works that primarily rely on
high-level descriptors like composition, deCIFer is also able to use
diffraction data to perform CSP. We train deCIFer on nearly 2.3M crystal
structures and validate on diverse sets of PXRD patterns for characterizing
challenging inorganic crystal systems. Qualitative checks and quantitative
assessments using the residual weighted profile show that deCIFer produces
structures that more accurately match the target diffraction data. Notably,
deCIFer can achieve a 94% match rate on test data. deCIFer bridges experimental
diffraction data with computational CSP, lending itself as a powerful tool for
crystal structure characterization.
[COMMENTS]
24 pages, 18 figures, 8 tables. v2: Figure 8 revision. v3: added
benchmarks, text revisions
[LINK]
http://arxiv.org/abs/2502.02189v3
[DATE]
2025-06-27 14:53:05+08:00
[CATEGORIES]
cs.LG
Physics-informed network paradigm with data generation and background noise removal for diverse distributed acoustic sensing applications
[AUTHORS]
Yangyang Wan, Haotian Wang, Xuhui Yu, Jiageng Chen, Xinyu Fan, Zuyuan He
[ABSTRACT]
Distributed acoustic sensing (DAS) has attracted considerable attention
across various fields and artificial intelligence (AI) technology plays an
important role in DAS applications to realize event recognition and denoising.
Existing AI models require real-world data (RWD), whether labeled or not, for
training, which is contradictory to the fact of limited available event data in
real-world scenarios. Here, a physics-informed DAS neural network paradigm is
proposed, which does not need real-world events data for training. By
physically modeling target events and the constraints of real world and DAS
system, physical functions are derived to train a generative network for
generation of DAS events data. DAS debackground net is trained by using the
generated DAS events data to eliminate background noise in DAS data. The
effectiveness of the proposed paradigm is verified in event identification
application based on a public dataset of DAS spatiotemporal data and in belt
conveyor fault monitoring application based on DAS time-frequency data, and
achieved comparable or better performance than data-driven networks trained
with RWD. Owing to the introduction of physical information and capability of
background noise removal, the paradigm demonstrates generalization in same
application on different sites. A fault diagnosis accuracy of 91.8% is achieved
in belt conveyor field with networks which transferred from simulation test
site without any fault events data of test site and field for training. The
proposed paradigm is a prospective solution to address significant obstacles of
data acquisition and intense noise in practical DAS applications and explore
more potential fields for DAS.
[LINK]
http://arxiv.org/abs/2506.21952v1
[DATE]
2025-06-27 14:46:58+08:00
[CATEGORIES]
cs.LG
Hitchhiking Rides Dataset: Two decades of crowd-sourced records on stochastic traveling
[AUTHORS]
Till Wenke
[ABSTRACT]
Hitchhiking, a spontaneous and decentralized mode of travel, has long eluded
systematic study due to its informal nature. This paper presents and analyzes
the largest known structured dataset of hitchhiking rides, comprising over
63,000 entries collected over nearly two decades through platforms associated
with hitchwiki.org and lately on hitchmap.com. By leveraging crowd-sourced
contributions, the dataset captures key spatiotemporal and strategic aspects of
hitchhiking. This work documents the dataset’s origins, evolution, and
community-driven maintenance, highlighting its Europe-centric distribution,
seasonal patterns, and reliance on a small number of highly active
contributors. Through exploratory analyses, I examine waiting times, user
behavior, and comment metadata, shedding light on the lived realities of
hitchhikers. While the dataset has inherent biases and limitations - such as
demographic skew and unverifiable entries it offers a rare and valuable window
into an alternative form of mobility. I conclude by outlining future directions
for enriching the dataset and advancing research on hitchhiking as both a
transportation practice and cultural phenomenon.
[LINK]
http://arxiv.org/abs/2506.21946v1
[DATE]
2025-06-27 14:41:08+08:00
[CATEGORIES]
cs.LG
Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance
[AUTHORS]
Akio Hayakawa, Masato Ishii, Takashi Shibuya, Yuki Mitsufuji
[ABSTRACT]
We propose a novel step-by-step video-to-audio generation method that
sequentially produces individual audio tracks, each corresponding to a specific
sound event in the video. Our approach mirrors traditional Foley workflows,
aiming to capture all sound events induced by a given video comprehensively.
Each generation step is formulated as a guided video-to-audio synthesis task,
conditioned on a target text prompt and previously generated audio tracks. This
design is inspired by the idea of concept negation from prior compositional
generation frameworks. To enable this guided generation, we introduce a
training framework that leverages pre-trained video-to-audio models and
eliminates the need for specialized paired datasets, allowing training on more
accessible data. Experimental results demonstrate that our method generates
multiple semantically distinct audio tracks for a single input video, leading
to higher-quality composite audio synthesis than existing baselines.
[LINK]
http://arxiv.org/abs/2506.20995v2
[DATE]
2025-06-27 14:33:56+08:00
[CATEGORIES]
cs.LG
GuiderNet: A Meta-Learning Framework for Optimizing Quantum Circuit Geometry and Mitigating Barren Plateaus
[AUTHORS]
Marwan Ait Haddou, Mohamed Bennai
[ABSTRACT]
Variational Quantum Algorithms (VQAs) offer potential for near-term quantum
advantage but face challenges from barren plateaus, where gradients vanish, and
poorly conditioned optimization landscapes. We introduce GuiderNet, a
meta-learning framework that conditions Parameterized Quantum Circuits (PQCs)
using data-dependent parameter shifts aimed at minimizing the log condition
number of the Fubini-Study metric tensor. Implemented as a classical neural
network, GuiderNet is meta-trained to guide PQC parameters into geometrically
favorable regions and is embedded within hybrid quantum-classical pipelines to
steer both initialization and adaptive modulation during training.
Applied to the Kaggle Diabetes classification task, GuiderNet reduces
cumulative training loss by over 5x, improves test accuracy from 75.3% to
98.6%, and increases the minority-class F1 score from 0.67 to 0.95. It also
suppresses gradient explosion and stabilizes parameter updates, enabling
smoother and more robust optimization. These results demonstrate that geometric
meta-conditioning can mitigate barren plateaus and ill-conditioning, providing
a scalable approach to enhance trainability and generalization in quantum
machine learning.
[LINK]
http://arxiv.org/abs/2506.21940v1
[DATE]
2025-06-27 14:30:33+08:00
[CATEGORIES]
cs.LG
Joint Task Offloading and Resource Allocation in Low-Altitude MEC via Graph Attention Diffusion
[AUTHORS]
Yifan Xue, Ruihuai Liang, Bo Yang, Xuelin Cao, Zhiwen Yu, Mérouane Debbah, Chau Yuen
[ABSTRACT]
With the rapid development of the low-altitude economy, air-ground integrated
multi-access edge computing (MEC) systems are facing increasing demands for
real-time and intelligent task scheduling. In such systems, task offloading and
resource allocation encounter multiple challenges, including node
heterogeneity, unstable communication links, and dynamic task variations. To
address these issues, this paper constructs a three-layer heterogeneous MEC
system architecture for low-altitude economic networks, encompassing aerial and
ground users as well as edge servers. The system is systematically modeled from
the perspectives of communication channels, computational costs, and constraint
conditions, and the joint optimization problem of offloading decisions and
resource allocation is uniformly abstracted into a graph-structured modeling
task. On this basis, we propose a graph attention diffusion-based solution
generator (GADSG). This method integrates the contextual awareness of graph
attention networks with the solution distribution learning capability of
diffusion models, enabling joint modeling and optimization of discrete
offloading variables and continuous resource allocation variables within a
high-dimensional latent space. We construct multiple simulation datasets with
varying scales and topologies. Extensive experiments demonstrate that the
proposed GADSG model significantly outperforms existing baseline methods in
terms of optimization performance, robustness, and generalization across task
structures, showing strong potential for efficient task scheduling in dynamic
and complex low-altitude economic network environments.
[LINK]
http://arxiv.org/abs/2506.21933v1
[DATE]
2025-06-27 14:03:48+08:00
[CATEGORIES]
cs.LG
Causal Inference Isn’t Special: Why It’s Just Another Prediction Problem
[AUTHORS]
Carlos Fernández-Loría
[ABSTRACT]
Causal inference is often portrayed as fundamentally distinct from predictive
modeling, with its own terminology, goals, and intellectual challenges. But at
its core, causal inference is simply a structured instance of prediction under
distribution shift. In both cases, we begin with labeled data from a source
domain and seek to generalize to a target domain where outcomes are not
observed. The key difference is that in causal inference, the labels –
potential outcomes – are selectively observed based on treatment assignment,
introducing bias that must be addressed through assumptions. This perspective
reframes causal estimation as a familiar generalization problem and highlights
how techniques from predictive modeling, such as reweighting and domain
adaptation, apply directly to causal tasks. It also clarifies that causal
assumptions are not uniquely strong – they are simply more explicit. By
viewing causal inference through the lens of prediction, we demystify its
logic, connect it to familiar tools, and make it more accessible to
practitioners and educators alike.
[LINK]
http://arxiv.org/abs/2504.04320v2
[DATE]
2025-06-27 13:38:26+08:00
[CATEGORIES]
cs.LG
Mic-hackathon 2024: Hackathon on Machine Learning for Electron and Scanning Probe Microscopy
[AUTHORS]
Utkarsh Pratiush, Austin Houston, Kamyar Barakati, Aditya Raghavan, Dasol Yoon, Harikrishnan KP, Zhaslan Baraissov, Desheng Ma, Samuel S. Welborn, Mikolaj Jakowski, Shawn-Patrick Barhorst, Alexander J. Pattison, Panayotis Manganaris, Sita Sirisha Madugula, Sai Venkata Gayathri Ayyagari, Vishal Kennedy, Ralph Bulanadi, Michelle Wang, Kieran J. Pang, Ian Addison-Smith, Willy Menacho, Horacio V. Guzman, Alexander Kiefer, Nicholas Furth, Nikola L. Kolev, Mikhail Petrov, Viktoriia Liu, Sergey Ilyev, Srikar Rairao, Tommaso Rodani, Ivan Pinto-Huguet, Xuli Chen, Josep Cruañes, Marta Torrens, Jovan Pomar, Fanzhi Su, Pawan Vedanti, Zhiheng Lyu, Xingzhi Wang, Lehan Yao, Amir Taqieddin, Forrest Laskowski, Xiangyu Yin, Yu-Tsun Shao, Benjamin Fein-Ashley, Yi Jiang, Vineet Kumar, Himanshu Mishra, Yogesh Paul, Adib Bazgir, Rama chandra Praneeth Madugula, Yuwen Zhang, Pravan Omprakash, Jian Huang, Eric Montufar-Morales, Vivek Chawla, Harshit Sethi, Jie Huang, Lauri Kurki, Grace Guinan, Addison Salvador, Arman Ter-Petrosyan, Madeline Van Winkle, Steven R. Spurgeon, Ganesh Narasimha, Zijie Wu, Richard Liu, Yongtao Liu, Boris Slautin, Andrew R Lupini, Rama Vasudevan, Gerd Duscher, Sergei V. Kalinin
[ABSTRACT]
Microscopy is a primary source of information on materials structure and
functionality at nanometer and atomic scales. The data generated is often
well-structured, enriched with metadata and sample histories, though not always
consistent in detail or format. The adoption of Data Management Plans (DMPs) by
major funding agencies promotes preservation and access. However, deriving
insights remains difficult due to the lack of standardized code ecosystems,
benchmarks, and integration strategies. As a result, data usage is inefficient
and analysis time is extensive. In addition to post-acquisition analysis, new
APIs from major microscope manufacturers enable real-time, ML-based analytics
for automated decision-making and ML-agent-controlled microscope operation.
Yet, a gap remains between the ML and microscopy communities, limiting the
impact of these methods on physics, materials discovery, and optimization.
Hackathons help bridge this divide by fostering collaboration between ML
researchers and microscopy experts. They encourage the development of novel
solutions that apply ML to microscopy, while preparing a future workforce for
instrumentation, materials science, and applied ML. This hackathon produced
benchmark datasets and digital twins of microscopes to support community growth
and standardized workflows. All related code is available at GitHub:
https://github.com/KalininGroup/Mic-hackathon-2024-codes-publication/tree/1.0.0.1
[LINK]
http://arxiv.org/abs/2506.08423v2
[DATE]
2025-06-27 12:56:59+08:00
[CATEGORIES]
cs.LG
Foundation Model Insights and a Multi-Model Approach for Superior Fine-Grained One-shot Subset Selection
[AUTHORS]
Zhijing Wan, Zhixiang Wang, Zheng Wang, Xin Xu, Shin’ichi Satoh
[ABSTRACT]
One-shot subset selection serves as an effective tool to reduce deep learning
training costs by identifying an informative data subset based on the
information extracted by an information extractor (IE). Traditional IEs,
typically pre-trained on the target dataset, are inherently dataset-dependent.
Foundation models (FMs) offer a promising alternative, potentially mitigating
this limitation. This work investigates two key questions: (1) Can FM-based
subset selection outperform traditional IE-based methods across diverse
datasets? (2) Do all FMs perform equally well as IEs for subset selection?
Extensive experiments uncovered surprising insights: FMs consistently
outperform traditional IEs on fine-grained datasets, whereas their advantage
diminishes on coarse-grained datasets with noisy labels. Motivated by these
finding, we propose RAM-APL (RAnking Mean-Accuracy of Pseudo-class Labels), a
method tailored for fine-grained image datasets. RAM-APL leverages multiple FMs
to enhance subset selection by exploiting their complementary strengths. Our
approach achieves state-of-the-art performance on fine-grained datasets,
including Oxford-IIIT Pet, Food-101, and Caltech-UCSD Birds-200-2011.
[COMMENTS]
18 pages, 10 figures, accepted by ICML 2025
[LINK]
http://arxiv.org/abs/2506.14473v2
[DATE]
2025-06-27 12:48:09+08:00
[CATEGORIES]
cs.LG
TOAST: Task-Oriented Adaptive Semantic Transmission over Dynamic Wireless Environments
[AUTHORS]
Sheng Yun, Jianhua Pei, Ping Wang
[ABSTRACT]
The evolution toward 6G networks demands a fundamental shift from bit-centric
transmission to semantic-aware communication that emphasizes task-relevant
information. This work introduces TOAST (Task-Oriented Adaptive Semantic
Transmission), a unified framework designed to address the core challenge of
multi-task optimization in dynamic wireless environments through three
complementary components. First, we formulate adaptive task balancing as a
Markov decision process, employing deep reinforcement learning to dynamically
adjust the trade-off between image reconstruction fidelity and semantic
classification accuracy based on real-time channel conditions. Second, we
integrate module-specific Low-Rank Adaptation (LoRA) mechanisms throughout our
Swin Transformer-based joint source-channel coding architecture, enabling
parameter-efficient fine-tuning that dramatically reduces adaptation overhead
while maintaining full performance across diverse channel impairments including
Additive White Gaussian Noise (AWGN), fading, phase noise, and impulse
interference. Third, we incorporate an Elucidating diffusion model that
operates in the latent space to restore features corrupted by channel noises,
providing substantial quality improvements compared to baseline approaches.
Extensive experiments across multiple datasets demonstrate that TOAST achieves
superior performance compared to baseline approaches, with significant
improvements in both classification accuracy and reconstruction quality at low
Signal-to-Noise Ratio (SNR) conditions while maintaining robust performance
across all tested scenarios.
[LINK]
http://arxiv.org/abs/2506.21900v1
[DATE]
2025-06-27 12:36:30+08:00
[CATEGORIES]
cs.LG
Advancements and Challenges in Continual Reinforcement Learning: A Comprehensive Review
[AUTHORS]
Amara Zuffer, Michael Burke, Mehrtash Harandi
[ABSTRACT]
The diversity of tasks and dynamic nature of reinforcement learning (RL)
require RL agents to be able to learn sequentially and continuously, a learning
paradigm known as continuous reinforcement learning. This survey reviews how
continual learning transforms RL agents into dynamic continual learners. This
enables RL agents to acquire and retain useful and reusable knowledge
seamlessly. The paper delves into fundamental aspects of continual
reinforcement learning, exploring key concepts, significant challenges, and
novel methodologies. Special emphasis is placed on recent advancements in
continual reinforcement learning within robotics, along with a succinct
overview of evaluation environments utilized in prominent research,
facilitating accessibility for newcomers to the field. The review concludes
with a discussion on limitations and promising future directions, providing
valuable insights for researchers and practitioners alike.
[COMMENTS]
65 pages, 9 figures
[LINK]
http://arxiv.org/abs/2506.21899v1
[DATE]
2025-06-27 12:36:05+08:00
[CATEGORIES]
cs.LG
Enhancing Cloud Security through Topic Modelling
[AUTHORS]
Sabbir M. Saleh, Nazim Madhavji, John Steinbacher
[ABSTRACT]
Protecting cloud applications is critical in an era where security threats
are increasingly sophisticated and persistent. Continuous Integration and
Continuous Deployment (CI/CD) pipelines are particularly vulnerable, making
innovative security approaches essential. This research explores the
application of Natural Language Processing (NLP) techniques, specifically Topic
Modelling, to analyse security-related text data and anticipate potential
threats. We focus on Latent Dirichlet Allocation (LDA) and Probabilistic Latent
Semantic Analysis (PLSA) to extract meaningful patterns from data sources,
including logs, reports, and deployment traces. Using the Gensim framework in
Python, these methods categorise log entries into security-relevant topics
(e.g., phishing, encryption failures). The identified topics are leveraged to
highlight patterns indicative of security issues across CI/CD’s continuous
stages (build, test, deploy). This approach introduces a semantic layer that
supports early vulnerability recognition and contextual understanding of
runtime behaviours.
[COMMENTS]
7 pages, 5 figures, 28th ACIS International Winter Conference on
Software Engineering, Artificial Intelligence, Networking and
Parallel/Distributed Computing (SNPD 2024-Winter)
[LINK]
http://arxiv.org/abs/2505.01463v2
[DATE]
2025-06-27 12:34:30+08:00
[CATEGORIES]
cs.LG
Stability of Primal-Dual Gradient Flow Dynamics for Multi-Block Convex Optimization Problems
[AUTHORS]
Ibrahim K. Ozaslan, Panagiotis Patrinos, Mihailo R. Jovanović
[ABSTRACT]
We examine stability properties of primal-dual gradient flow dynamics for
composite convex optimization problems with multiple, possibly nonsmooth, terms
in the objective function under the generalized consensus constraint. The
proposed dynamics are based on the proximal augmented Lagrangian and they
provide a viable alternative to ADMM which faces significant challenges from
both analysis and implementation viewpoints in large-scale multi-block
scenarios. In contrast to customized algorithms with individualized convergence
guarantees, we develop a systematic approach for solving a broad class of
challenging composite optimization problems. We leverage various structural
properties to establish global (exponential) convergence guarantees for the
proposed dynamics. Our assumptions are much weaker than those required to prove
(exponential) stability of primal-dual dynamics as well as (linear) convergence
of discrete-time methods such as standard two-block and multi-block ADMM and
EXTRA algorithms. Finally, we show necessity of some of our structural
assumptions for exponential stability and provide computational experiments to
demonstrate the convenience of the proposed approach for parallel and
distributed computing applications.
[COMMENTS]
30 pages; 4 figures
[LINK]
http://arxiv.org/abs/2408.15969v2
[DATE]
2025-06-27 12:25:57+08:00
[CATEGORIES]
cs.LG
Thompson Sampling in Function Spaces via Neural Operators
[AUTHORS]
Rafael Oliveira, Xuesong Wang, Kian Ming A. Chai, Edwin V. Bonilla
[ABSTRACT]
We propose an extension of Thompson sampling to optimization problems over
function spaces where the objective is a known functional of an unknown
operator’s output. We assume that functional evaluations are inexpensive, while
queries to the operator (such as running a high-fidelity simulator) are costly.
Our algorithm employs a sample-then-optimize approach using neural operator
surrogates. This strategy avoids explicit uncertainty quantification by
treating trained neural operators as approximate samples from a Gaussian
process. We provide novel theoretical convergence guarantees, based on Gaussian
processes in the infinite-dimensional setting, under minimal assumptions. We
benchmark our method against existing baselines on functional optimization
tasks involving partial differential equations and other nonlinear
operator-driven phenomena, demonstrating improved sample efficiency and
competitive performance.
[COMMENTS]
Under review
[LINK]
http://arxiv.org/abs/2506.21894v1
[DATE]
2025-06-27 12:21:57+08:00
[CATEGORIES]
cs.LG
Interactive Multi-Objective Probabilistic Preference Learning with Soft and Hard Bounds
[AUTHORS]
Edward Chen, Sang T. Truong, Natalie Dullerud, Sanmi Koyejo, Carlos Guestrin
[ABSTRACT]
High-stakes decision-making involves navigating multiple competing objectives
with expensive evaluations. For instance, in brachytherapy, clinicians must
balance maximizing tumor coverage (e.g., an aspirational target or soft bound
of >95% coverage) against strict organ dose limits (e.g., a non-negotiable hard
bound of <601 cGy to the bladder), with each plan evaluation being
resource-intensive. Selecting Pareto-optimal solutions that match implicit
preferences is challenging, as exhaustive Pareto frontier exploration is
computationally and cognitively prohibitive, necessitating interactive
frameworks to guide users. While decision-makers (DMs) often possess domain
knowledge to narrow the search via such soft-hard bounds, current methods often
lack systematic approaches to iteratively refine these multi-faceted preference
structures. Critically, DMs must trust their final decision, confident they
haven’t missed superior alternatives; this trust is paramount in
high-consequence scenarios. We present Active-MoSH, an interactive local-global
framework designed for this process. Its local component integrates soft-hard
bounds with probabilistic preference learning, maintaining distributions over
DM preferences and bounds for adaptive Pareto subset refinement. This is guided
by an active sampling strategy optimizing exploration-exploitation while
minimizing cognitive burden. To build DM trust, Active-MoSH’s global component,
T-MoSH, leverages multi-objective sensitivity analysis to identify potentially
overlooked, high-value points beyond immediate feedback. We demonstrate
Active-MoSH’s performance benefits through diverse synthetic and real-world
applications. A user study on AI-generated image selection further validates
our hypotheses regarding the framework’s ability to improve convergence,
enhance DM trust, and provide expressive preference articulation, enabling more
effective DMs.
[LINK]
http://arxiv.org/abs/2506.21887v1
[DATE]
2025-06-27 11:44:20+08:00
[CATEGORIES]
cs.LG
Advances in Temporal Point Processes: Bayesian, Neural, and LLM Approaches
[AUTHORS]
Feng Zhou, Quyu Kong, Jie Qiao, Cheng Wan, Yixuan Zhang, Ruichu Cai
[ABSTRACT]
Temporal point processes (TPPs) are stochastic process models used to
characterize event sequences occurring in continuous time. Traditional
statistical TPPs have a long-standing history, with numerous models proposed
and successfully applied across diverse domains. In recent years, advances in
deep learning have spurred the development of neural TPPs, enabling greater
flexibility and expressiveness in capturing complex temporal dynamics. The
emergence of large language models (LLMs) has further sparked excitement,
offering new possibilities for modeling and analyzing event sequences by
leveraging their rich contextual understanding. This survey presents a
comprehensive review of recent research on TPPs from three perspectives:
Bayesian, deep learning, and LLM approaches. We begin with a review of the
fundamental concepts of TPPs, followed by an in-depth discussion of model
design and parameter estimation techniques in these three frameworks. We also
revisit classic application areas of TPPs to highlight their practical
relevance. Finally, we outline challenges and promising directions for future
research.
[LINK]
http://arxiv.org/abs/2501.14291v2
[DATE]
2025-06-27 11:22:06+08:00
[CATEGORIES]
cs.LG
A Survey of Continual Reinforcement Learning
[AUTHORS]
Chaofan Pan, Xin Yang, Yanhua Li, Wei Wei, Tianrui Li, Bo An, Jiye Liang
[ABSTRACT]
Reinforcement Learning (RL) is an important machine learning paradigm for
solving sequential decision-making problems. Recent years have witnessed
remarkable progress in this field due to the rapid development of deep neural
networks. However, the success of RL currently relies on extensive training
data and computational resources. In addition, RL’s limited ability to
generalize across tasks restricts its applicability in dynamic and real-world
environments. With the arisen of Continual Learning (CL), Continual
Reinforcement Learning (CRL) has emerged as a promising research direction to
address these limitations by enabling agents to learn continuously, adapt to
new tasks, and retain previously acquired knowledge. In this survey, we provide
a comprehensive examination of CRL, focusing on its core concepts, challenges,
and methodologies. Firstly, we conduct a detailed review of existing works,
organizing and analyzing their metrics, tasks, benchmarks, and scenario
settings. Secondly, we propose a new taxonomy of CRL methods, categorizing them
into four types from the perspective of knowledge storage and/or transfer.
Finally, our analysis highlights the unique challenges of CRL and provides
practical insights into future directions.
[COMMENTS]
This work has been submitted to the IEEE TPAMI
[LINK]
http://arxiv.org/abs/2506.21872v1
[DATE]
2025-06-27 11:10:20+08:00
[CATEGORIES]
cs.LG
SPADE: Spatial Transcriptomics and Pathology Alignment Using a Mixture of Data Experts for an Expressive Latent Space
[AUTHORS]
Ekaterina Redekop, Mara Pleasure, Zichen Wang, Kimberly Flores, Anthony Sisk, William Speier, Corey W. Arnold
[ABSTRACT]
The rapid growth of digital pathology and advances in self-supervised deep
learning have enabled the development of foundational models for various
pathology tasks across diverse diseases. While multimodal approaches
integrating diverse data sources have emerged, a critical gap remains in the
comprehensive integration of whole-slide images (WSIs) with spatial
transcriptomics (ST), which is crucial for capturing critical molecular
heterogeneity beyond standard hematoxylin & eosin (H&E) staining. We introduce
SPADE, a foundation model that integrates histopathology with ST data to guide
image representation learning within a unified framework, in effect creating an
ST-informed latent space. SPADE leverages a mixture-of-data experts technique,
where experts, created via two-stage feature-space clustering, use contrastive
learning to learn representations of co-registered WSI patches and gene
expression profiles. Pre-trained on the comprehensive HEST-1k dataset, SPADE is
evaluated on 14 downstream tasks, demonstrating significantly superior few-shot
performance compared to baseline models, highlighting the benefits of
integrating morphological and molecular information into one latent space.
[LINK]
http://arxiv.org/abs/2506.21857v1
[DATE]
2025-06-27 10:20:51+08:00
[CATEGORIES]
cs.LG
Unveiling the Power of Noise Priors: Enhancing Diffusion Models for Mobile Traffic Prediction
[AUTHORS]
Zhi Sheng, Daisy Yuan, Jingtao Ding, Yong Li
[ABSTRACT]
Accurate prediction of mobile traffic, i.e., network traffic from cellular
base stations, is crucial for optimizing network performance and supporting
urban development. However, the non-stationary nature of mobile traffic, driven
by human activity and environmental changes, leads to both regular patterns and
abrupt variations. Diffusion models excel in capturing such complex temporal
dynamics due to their ability to capture the inherent uncertainties. Most
existing approaches prioritize designing novel denoising networks but often
neglect the critical role of noise itself, potentially leading to sub-optimal
performance. In this paper, we introduce a novel perspective by emphasizing the
role of noise in the denoising process. Our analysis reveals that noise
fundamentally shapes mobile traffic predictions, exhibiting distinct and
consistent patterns. We propose NPDiff, a framework that decomposes noise into
prior and residual components, with the prior} derived from data dynamics,
enhancing the model’s ability to capture both regular and abrupt variations.
NPDiff can seamlessly integrate with various diffusion-based prediction models,
delivering predictions that are effective, efficient, and robust. Extensive
experiments demonstrate that it achieves superior performance with an
improvement over 30\%, offering a new perspective on leveraging diffusion
models in this domain. We provide code and data at
https://github.com/tsinghua-fib-lab/NPDiff.
[LINK]
http://arxiv.org/abs/2501.13794v3
[DATE]
2025-06-27 09:56:44+08:00
[CATEGORIES]
cs.LG
QT-DoG: Quantization-aware Training for Domain Generalization
[AUTHORS]
Saqib Javed, Hieu Le, Mathieu Salzmann
[COMMENTS]
Accepted at International Conference on Machine Learning (ICML) 2025.
Project website: https://saqibjaved1.github.io/QT_DoG/
[LINK]
http://arxiv.org/abs/2410.06020v2
[DATE]
2025-06-27 09:42:45+08:00
[CATEGORIES]
cs.LG
Koopman operator-based discussion on partial observation in stochastic systems
[AUTHORS]
Jun Ohkubo
[ABSTRACT]
It is sometimes difficult to achieve a complete observation for a full set of
observables, and partial observations are necessary. For deterministic systems,
the Mori-Zwanzig formalism provides a theoretical framework for handling
partial observations. Recently, data-driven algorithms based on the Koopman
operator theory have made significant progress, and there is a discussion to
connect the Mori-Zwanzig formalism with the Koopman operator theory. In this
work, we discuss the effects of partial observation in stochastic systems using
the Koopman operator theory. The discussion clarifies the importance of
distinguishing the state space and the function space in stochastic systems.
Even in stochastic systems, the delay embedding technique is beneficial for
partial observation, and several numerical experiments showed a power-law
behavior of the accuracy for the amplitude of the additive noise. We also
discuss the relation between the exponent of the power-law behavior and the
effects of partial observation.
[COMMENTS]
23 pages, 5 figures
[LINK]
http://arxiv.org/abs/2506.21844v1
[DATE]
2025-06-27 09:30:51+08:00
[CATEGORIES]
cs.LG
Adversarial Threats in Quantum Machine Learning: A Survey of Attacks and Defenses
[AUTHORS]
Archisman Ghosh, Satwik Kundu, Swaroop Ghosh
[ABSTRACT]
Quantum Machine Learning (QML) integrates quantum computing with classical
machine learning, primarily to solve classification, regression and generative
tasks. However, its rapid development raises critical security challenges in
the Noisy Intermediate-Scale Quantum (NISQ) era. This chapter examines
adversarial threats unique to QML systems, focusing on vulnerabilities in
cloud-based deployments, hybrid architectures, and quantum generative models.
Key attack vectors include model stealing via transpilation or output
extraction, data poisoning through quantum-specific perturbations, reverse
engineering of proprietary variational quantum circuits, and backdoor attacks.
Adversaries exploit noise-prone quantum hardware and insufficiently secured
QML-as-a-Service (QMLaaS) workflows to compromise model integrity, ownership,
and functionality. Defense mechanisms leverage quantum properties to counter
these threats. Noise signatures from training hardware act as non-invasive
watermarks, while hardware-aware obfuscation techniques and ensemble strategies
disrupt cloning attempts. Emerging solutions also adapt classical adversarial
training and differential privacy to quantum settings, addressing
vulnerabilities in quantum neural networks and generative architectures.
However, securing QML requires addressing open challenges such as balancing
noise levels for reliability and security, mitigating cross-platform attacks,
and developing quantum-classical trust frameworks. This chapter summarizes
recent advances in attacks and defenses, offering a roadmap for researchers and
practitioners to build robust, trustworthy QML systems resilient to evolving
adversarial landscapes.
[COMMENTS]
23 pages, 5 figures
[LINK]
http://arxiv.org/abs/2506.21842v1
[DATE]
2025-06-27 09:19:49+08:00
[CATEGORIES]
cs.LG
The Cost of Avoiding Backpropagation
[AUTHORS]
Kunjal Panchal, Sunav Choudhary, Yuriy Brun, Hui Guan
[ABSTRACT]
Forward-mode automatic differentiation (FmAD) and zero-order (ZO)
optimization have been proposed as memory-efficient alternatives to
backpropagation (BP) for gradient computation, especially in low-resource
settings. However, their practical benefits remain unclear due to two key gaps:
a lack of comparison against memory-efficient BP variants, such as activation
checkpointing, and a lack of a unified theoretical analysis. This work presents
a comprehensive theoretical and empirical comparison of BP, FmAD, and ZO
methods. Our theoretical analysis shows that while FmAD, and ZO can reduce
memory usage, they incur significant costs in accuracy, convergence speed, and
computation compared to BP with checkpointing. These drawbacks worsen with
larger models or constrained perturbation budgets. Empirical experiments on
large language and vision-language models show that BP with checkpointing
outperforms FmAD and ZO variants, including those enhanced with variance
reduction, achieving up to 31.1% higher accuracy, 34.8% faster convergence, and
3.8x fewer computations at comparable memory usage. Our results highlight
fundamental limitations of FmAD and ZO, and reaffirm BP with checkpointing as
the most effective strategy for model training under memory-constrained
settings. Our code is available at
https://github.com/Astuary/The_Cost_of_Avoiding_Backpropagation.
[LINK]
http://arxiv.org/abs/2506.21833v1
[DATE]
2025-06-27 08:47:03+08:00
[CATEGORIES]
cs.LG
Computational Efficient and Minimax Optimal Nonignorable Matrix Completion
[AUTHORS]
Yuanhong A, Guoyu Zhang, Yongcheng Zeng, Bo Zhang
[ABSTRACT]
While the matrix completion problem has attracted considerable attention over
the decades, few works address the nonignorable missing issue and all have
their limitations. In this article, we propose a nuclear norm regularized row-
and column-wise matrix U-statistic loss function for the generalized
nonignorable missing mechanism, a flexible and generally applicable missing
mechanism which contains both ignorable and nonignorable missing mechanism
assumptions. The proposed method achieves computational efficiency comparable
to the existing missing-at-random approaches, while providing the near minimax
optimal statistical convergence rate guarantees for the more general
nonignorable missing case. We propose an accelerated proximal gradient
algorithm to solve the associated optimization problem, and characterize the
interaction between algorithmic and statistical convergence. Simulations and
real data analyzes further support the practical utility of the proposed
method.
[LINK]
http://arxiv.org/abs/2504.04016v2
[DATE]
2025-06-27 08:17:58+08:00
[CATEGORIES]
cs.LG
Mathematical Modeling of Protein Structures: A Cohomology-Based Approach to the Flagellar Motor
[AUTHORS]
Zakaria Lamine, Abdelatif Hafid, Mohamed Rahouti
[ABSTRACT]
This study presents a novel mathematical model derived from cohomology,
leveraging the KEEL-proven theorem that establishes cohomology as tautological,
generated by boundary classes of curves with fixed dual graphs. Simplicial
complexes are constructed using skew-commutative graded algebra, and the
structure theorem is applied to connect distinct homologies, enabling precise
interpretations of the resulting geometric forms. The proposed model is
utilized for protein structure analysis and prediction, with a specific
application to the Flagellar Motor structure. This approach offers new insights
into the geometric and algebraic foundations of biological macromolecular
modeling, highlighting its potential for advancement in structural biology.
[LINK]
http://arxiv.org/abs/2504.16941v2
[DATE]
2025-06-27 07:25:39+08:00
[CATEGORIES]
cs.LG
CAT-SG: A Large Dynamic Scene Graph Dataset for Fine-Grained Understanding of Cataract Surgery
[AUTHORS]
Felix Holm, Gözde Ünver, Ghazal Ghazaei, Nassir Navab
[ABSTRACT]
Understanding the intricate workflows of cataract surgery requires modeling
complex interactions between surgical tools, anatomical structures, and
procedural techniques. Existing datasets primarily address isolated aspects of
surgical analysis, such as tool detection or phase segmentation, but lack
comprehensive representations that capture the semantic relationships between
entities over time. This paper introduces the Cataract Surgery Scene Graph
(CAT-SG) dataset, the first to provide structured annotations of tool-tissue
interactions, procedural variations, and temporal dependencies. By
incorporating detailed semantic relations, CAT-SG offers a holistic view of
surgical workflows, enabling more accurate recognition of surgical phases and
techniques. Additionally, we present a novel scene graph generation model,
CatSGG, which outperforms current methods in generating structured surgical
representations. The CAT-SG dataset is designed to enhance AI-driven surgical
training, real-time decision support, and workflow analysis, paving the way for
more intelligent, context-aware systems in clinical practice.
[LINK]
http://arxiv.org/abs/2506.21813v1
[DATE]
2025-06-27 07:25:23+08:00
[CATEGORIES]
cs.LG
Classification with Reject Option: Distribution-free Error Guarantees via Conformal Prediction
[AUTHORS]
Johan Hallberg Szabadváry, Tuwe Löfström, Ulf Johansson, Cecilia Sönströd, Ernst Ahlberg, Lars Carlsson
[ABSTRACT]
Machine learning (ML) models always make a prediction, even when they are
likely to be wrong. This causes problems in practical applications, as we do
not know if we should trust a prediction. ML with reject option addresses this
issue by abstaining from making a prediction if it is likely to be incorrect.
In this work, we formalise the approach to ML with reject option in binary
classification, deriving theoretical guarantees on the resulting error rate.
This is achieved through conformal prediction (CP), which produce prediction
sets with distribution-free validity guarantees. In binary classification, CP
can output prediction sets containing exactly one, two or no labels. By
accepting only the singleton predictions, we turn CP into a binary classifier
with reject option.
Here, CP is formally put in the framework of predicting with reject option.
We state and prove the resulting error rate, and give finite sample estimates.
Numerical examples provide illustrations of derived error rate through several
different conformal prediction settings, ranging from full conformal prediction
to offline batch inductive conformal prediction. The former has a direct link
to sharp validity guarantees, whereas the latter is more fuzzy in terms of
validity guarantees but can be used in practice. Error-reject curves illustrate
the trade-off between error rate and reject rate, and can serve to aid a user
to set an acceptable error rate or reject rate in practice.
[COMMENTS]
20 pages, 3 figures
[LINK]
http://arxiv.org/abs/2506.21802v1
[DATE]
2025-06-27 07:04:25+08:00
[CATEGORIES]
cs.LG
Why Neural Network Can Discover Symbolic Structures with Gradient-based Training: An Algebraic and Geometric Foundation for Neurosymbolic Reasoning
[AUTHORS]
Peihao Wang, Zhangyang Wang
[ABSTRACT]
We develop a theoretical framework that explains how discrete symbolic
structures can emerge naturally from continuous neural network training
dynamics. By lifting neural parameters to a measure space and modeling training
as Wasserstein gradient flow, we show that under geometric constraints, such as
group invariance, the parameter measure $\mu_t$ undergoes two concurrent
phenomena: (1) a decoupling of the gradient flow into independent optimization
trajectories over some potential functions, and (2) a progressive contraction
on the degree of freedom. These potentials encode algebraic constraints
relevant to the task and act as ring homomorphisms under a commutative
semi-ring structure on the measure space. As training progresses, the network
transitions from a high-dimensional exploration to compositional
representations that comply with algebraic operations and exhibit a lower
degree of freedom. We further establish data scaling laws for realizing
symbolic tasks, linking representational capacity to the group invariance that
facilitates symbolic solutions. This framework charts a principled foundation
for understanding and designing neurosymbolic systems that integrate continuous
learning with discrete algebraic reasoning.
[COMMENTS]
International Conference on Neuro-symbolic Systems (NeuS), 2025
[LINK]
http://arxiv.org/abs/2506.21797v1
[DATE]
2025-06-27 06:40:30+08:00
[CATEGORIES]
cs.LG
Multi-task parallelism for robust pre-training of graph foundation models on multi-source, multi-fidelity atomistic modeling data
[AUTHORS]
Massimiliano Lupo Pasini, Jong Youl Choi, Pei Zhang, Kshitij Mehta, Rylie Weaver, Ashwin M. Aji, Karl W. Schulz, Jorda Polo, Prasanna Balaprakash
[ABSTRACT]
Graph foundation models using graph neural networks promise sustainable,
efficient atomistic modeling. To tackle challenges of processing multi-source,
multi-fidelity data during pre-training, recent studies employ multi-task
learning, in which shared message passing layers initially process input
atomistic structures regardless of source, then route them to multiple decoding
heads that predict data-specific outputs. This approach stabilizes pre-training
and enhances a model’s transferability to unexplored chemical regions.
Preliminary results on approximately four million structures are encouraging,
yet questions remain about generalizability to larger, more diverse datasets
and scalability on supercomputers. We propose a multi-task parallelism method
that distributes each head across computing resources with GPU acceleration.
Implemented in the open-source HydraGNN architecture, our method was trained on
over 24 million structures from five datasets and tested on the Perlmutter,
Aurora, and Frontier supercomputers, demonstrating efficient scaling on all
three highly heterogeneous super-computing architectures.
[COMMENTS]
15 pages, 4 figures, 2 tables
[LINK]
http://arxiv.org/abs/2506.21788v1
[DATE]
2025-06-27 06:04:05+08:00
[CATEGORIES]
cs.LG
Sparse-Reg: Improving Sample Complexity in Offline Reinforcement Learning using Sparsity
[AUTHORS]
Samin Yeasar Arnob, Scott Fujimoto, Doina Precup
[ABSTRACT]
In this paper, we investigate the use of small datasets in the context of
offline reinforcement learning (RL). While many common offline RL benchmarks
employ datasets with over a million data points, many offline RL applications
rely on considerably smaller datasets. We show that offline RL algorithms can
overfit on small datasets, resulting in poor performance. To address this
challenge, we introduce “Sparse-Reg”: a regularization technique based on
sparsity to mitigate overfitting in offline reinforcement learning, enabling
effective learning in limited data settings and outperforming state-of-the-art
baselines in continuous control.
[LINK]
http://arxiv.org/abs/2506.17155v2
[DATE]
2025-06-27 05:55:13+08:00
[CATEGORIES]
cs.LG
Graph ODEs and Beyond: A Comprehensive Survey on Integrating Differential Equations with Graph Neural Networks
[AUTHORS]
Zewen Liu, Xiaoda Wang, Bohan Wang, Zijie Huang, Carl Yang, Wei Jin
[ABSTRACT]
Graph Neural Networks (GNNs) and differential equations (DEs) are two rapidly
advancing areas of research that have shown remarkable synergy in recent years.
GNNs have emerged as powerful tools for learning on graph-structured data,
while differential equations provide a principled framework for modeling
continuous dynamics across time and space. The intersection of these fields has
led to innovative approaches that leverage the strengths of both, enabling
applications in physics-informed learning, spatiotemporal modeling, and
scientific computing. This survey aims to provide a comprehensive overview of
the burgeoning research at the intersection of GNNs and DEs. We will categorize
existing methods, discuss their underlying principles, and highlight their
applications across domains such as molecular modeling, traffic prediction, and
epidemic spreading. Furthermore, we identify open challenges and outline future
research directions to advance this interdisciplinary field. A comprehensive
paper list is provided at https://github.com/Emory-Melody/Awesome-Graph-NDEs.
This survey serves as a resource for researchers and practitioners seeking to
understand and contribute to the fusion of GNNs and DEs
[COMMENTS]
Accepted by KDD 2025 Tutorial Track
[LINK]
http://arxiv.org/abs/2503.23167v3
[DATE]
2025-06-27 05:41:14+08:00
[CATEGORIES]
cs.LG
M3PO: Massively Multi-Task Model-Based Policy Optimization
[AUTHORS]
Aditya Narendra, Dmitry Makarov, Aleksandr Panov
[ABSTRACT]
We introduce Massively Multi-Task Model-Based Policy Optimization (M3PO), a
scalable model-based reinforcement learning (MBRL) framework designed to
address sample inefficiency in single-task settings and poor generalization in
multi-task domains. Existing model-based approaches like DreamerV3 rely on
pixel-level generative models that neglect control-centric representations,
while model-free methods such as PPO suffer from high sample complexity and
weak exploration. M3PO integrates an implicit world model, trained to predict
task outcomes without observation reconstruction, with a hybrid exploration
strategy that combines model-based planning and model-free uncertainty-driven
bonuses. This eliminates the bias-variance trade-off in prior methods by using
discrepancies between model-based and model-free value estimates to guide
exploration, while maintaining stable policy updates through a trust-region
optimizer. M3PO provides an efficient and robust alternative to existing
model-based policy optimization approaches and achieves state-of-the-art
performance across multiple benchmarks.
[COMMENTS]
6 pages, 4 figures. Accepted at IEEE/RSJ IROS 2025. Full version,
including appendix and implementation details
[LINK]
http://arxiv.org/abs/2506.21782v1
[DATE]
2025-06-27 05:39:01+08:00
[CATEGORIES]
cs.LG
Multi-thresholding Good Arm Identification with Bandit Feedback
[AUTHORS]
Xuanke Jiang, Sherief Hashima, Kohei Hatano, Eiji Takimoto
[ABSTRACT]
We consider a good arm identification problem in a stochastic bandit setting
with multi-objectives, where each arm $i \in [K]$ is associated with a
distribution $D_i$ defined over $R^M$. For each round $t$, the player pulls an
arm $i_t$ and receives an $M$-dimensional reward vector sampled according to
$D_{i_t}$. The goal is to find, with high probability, an $\epsilon$-good arm
whose expected reward vector is larger than $\bm{\xi} - \epsilon \mathbf{1}$,
where $\bm{\xi}$ is a predefined threshold vector, and the vector comparison is
component-wise. We propose the Multi-Thresholding UCB~(MultiTUCB) algorithm
with a sample complexity bound. Our bound matches the existing one in the
special case where $M=1$ and $\epsilon=0$. The proposed algorithm demonstrates
superior performance compared to baseline approaches across synthetic and real
datasets.
[LINK]
http://arxiv.org/abs/2503.10386v3
[DATE]
2025-06-27 05:38:37+08:00
[CATEGORIES]
cs.LG
Generative Data Mining with Longtail-Guided Diffusion
[AUTHORS]
David S. Hayden, Mao Ye, Timur Garipov, Gregory P. Meyer, Carl Vondrick, Zhao Chen, Yuning Chai, Eric Wolff, Siddhartha S. Srinivasa
[ABSTRACT]
It is difficult to anticipate the myriad challenges that a predictive model
will encounter once deployed. Common practice entails a reactive, cyclical
approach: model deployment, data mining, and retraining. We instead develop a
proactive longtail discovery process by imagining additional data during
training. In particular, we develop general model-based longtail signals,
including a differentiable, single forward pass formulation of epistemic
uncertainty that does not impact model parameters or predictive performance but
can flag rare or hard inputs. We leverage these signals as guidance to generate
additional training data from a latent diffusion model in a process we call
Longtail Guidance (LTG). Crucially, we can perform LTG without retraining the
diffusion model or the predictive model, and we do not need to expose the
predictive model to intermediate diffusion states. Data generated by LTG
exhibit semantically meaningful variation, yield significant generalization
improvements on numerous image classification benchmarks, and can be analyzed
by a VLM to proactively discover, textually explain, and address conceptual
gaps in a deployed predictive model.
[COMMENTS]
20 pages
[LINK]
http://arxiv.org/abs/2502.01980v2
[DATE]
2025-06-27 05:17:54+08:00
[CATEGORIES]
cs.LG
Gradient-Based Neuroplastic Adaptation for Concurrent Optimization of Neuro-Fuzzy Networks
[AUTHORS]
John Wesley Hostetter, Min Chi
[ABSTRACT]
Neuro-fuzzy networks (NFNs) are transparent, symbolic, and universal function
approximations that perform as well as conventional neural architectures, but
their knowledge is expressed as linguistic IF-THEN rules. Despite these
advantages, their systematic design process remains a challenge. Existing work
will often sequentially build NFNs by inefficiently isolating parametric and
structural identification, leading to a premature commitment to brittle and
subpar architecture. We propose a novel application-independent approach called
gradient-based neuroplastic adaptation for the concurrent optimization of NFNs’
parameters and structure. By recognizing that NFNs’ parameters and structure
should be optimized simultaneously as they are deeply conjoined, settings
previously unapproachable for NFNs are now accessible, such as the online
reinforcement learning of NFNs for vision-based tasks. The effectiveness of
concurrently optimizing NFNs is empirically shown as it is trained by online
reinforcement learning to proficiently play challenging scenarios from a
vision-based video game called DOOM.
[COMMENTS]
45 pages
[LINK]
http://arxiv.org/abs/2506.21771v1
[DATE]
2025-06-27 05:08:11+08:00
[CATEGORIES]
cs.LG
Testing Causal Models with Hidden Variables in Polynomial Delay via Conditional Independencies
[AUTHORS]
Hyunchai Jeong, Adiba Ejaz, Jin Tian, Elias Bareinboim
[ABSTRACT]
Testing a hypothesized causal model against observational data is a key
prerequisite for many causal inference tasks. A natural approach is to test
whether the conditional independence relations (CIs) assumed in the model hold
in the data. While a model can assume exponentially many CIs (with respect to
the number of variables), testing all of them is both impractical and
unnecessary. Causal graphs, which encode these CIs in polynomial space, give
rise to local Markov properties that enable model testing with a significantly
smaller subset of CIs. Model testing based on local properties requires an
algorithm to list the relevant CIs. However, existing algorithms for realistic
settings with hidden variables and non-parametric distributions can take
exponential time to produce even a single CI constraint. In this paper, we
introduce the c-component local Markov property (C-LMP) for causal graphs with
hidden variables. Since C-LMP can still invoke an exponential number of CIs, we
develop a polynomial delay algorithm to list these CIs in poly-time intervals.
To our knowledge, this is the first algorithm that enables poly-delay testing
of CIs in causal graphs with hidden variables against arbitrary data
distributions. Experiments on real-world and synthetic data demonstrate the
practicality of our algorithm.
[COMMENTS]
34 total pages, 14 figures
[LINK]
http://arxiv.org/abs/2409.14593v2
[DATE]
2025-06-27 04:51:29+08:00
[CATEGORIES]
cs.LG
Nested Stochastic Algorithm for Generalized Sinkhorn distance-Regularized Distributionally Robust Optimization
[AUTHORS]
Yufeng Yang, Yi Zhou, Zhaosong Lu
[ABSTRACT]
Distributionally robust optimization (DRO) is a powerful technique to train
robust models against data distribution shift. This paper aims to solve
regularized nonconvex DRO problems, where the uncertainty set is modeled by a
so-called generalized Sinkhorn distance and the loss function is nonconvex and
possibly unbounded. Such a distance allows to model uncertainty of
distributions with different probability supports and divergence functions. For
this class of regularized DRO problems, we derive a novel dual formulation
taking the form of nested stochastic optimization, where the dual variable
depends on the data sample. To solve the dual problem, we provide theoretical
evidence to design a nested stochastic gradient descent (SGD) algorithm, which
leverages stochastic approximation to estimate the nested stochastic gradients.
We study the convergence rate of nested SGD and establish polynomial iteration
and sample complexities that are independent of the data size and parameter
dimension, indicating its potential for solving large-scale DRO problems. We
conduct numerical experiments to demonstrate the efficiency and robustness of
the proposed algorithm.
[COMMENTS]
49pages, 2 tables
[LINK]
http://arxiv.org/abs/2503.22923v2
[DATE]
2025-06-27 04:48:14+08:00
[CATEGORIES]
cs.LG
VersaPRM: Multi-Domain Process Reward Model via Synthetic Reasoning Data
[AUTHORS]
Thomas Zeng, Shuibai Zhang, Shutong Wu, Christian Classen, Daewon Chae, Ethan Ewer, Minjae Lee, Heeju Kim, Wonjun Kang, Jackson Kunde, Ying Fan, Jungtaek Kim, Hyung Il Koo, Kannan Ramchandran, Dimitris Papailiopoulos, Kangwook Lee
[ABSTRACT]
Process Reward Models (PRMs) have proven effective at enhancing mathematical
reasoning for Large Language Models (LLMs) by leveraging increased
inference-time computation. However, they are predominantly trained on
mathematical data and their generalizability to non-mathematical domains has
not been rigorously studied. In response, this work first shows that current
PRMs have poor performance in other domains. To address this limitation, we
introduce VersaPRM, a multi-domain PRM trained on synthetic reasoning data
generated using our novel data generation and annotation method. VersaPRM
achieves consistent performance gains across diverse domains. For instance, in
the MMLU-Pro category of Law, VersaPRM via weighted majority voting, achieves a
7.9% performance gain over the majority voting baseline – surpassing
Qwen2.5-Math-PRM’s gain of 1.3%. We further contribute to the community by
open-sourcing all data, code and models for VersaPRM.
[LINK]
http://arxiv.org/abs/2502.06737v2
[DATE]
2025-06-27 04:39:38+08:00
[CATEGORIES]
cs.LG
TADA: Improved Diffusion Sampling with Training-free Augmented Dynamics
[AUTHORS]
Tianrong Chen, Huangjie Zheng, David Berthelot, Jiatao Gu, Josh Susskind, Shuangfei Zhai
[ABSTRACT]
Diffusion models have demonstrated exceptional capabilities in generating
high-fidelity images but typically suffer from inefficient sampling. Many
solver designs and noise scheduling strategies have been proposed to
dramatically improve sampling speeds. In this paper, we introduce a new
sampling method that is up to $186\%$ faster than the current state of the art
solver for comparative FID on ImageNet512. This new sampling method is
training-free and uses an ordinary differential equation (ODE) solver. The key
to our method resides in using higher-dimensional initial noise, allowing to
produce more detailed samples with less function evaluations from existing
pretrained diffusion models. In addition, by design our solver allows to
control the level of detail through a simple hyper-parameter at no extra
computational cost. We present how our approach leverages momentum dynamics by
establishing a fundamental equivalence between momentum diffusion models and
conventional diffusion models with respect to their training paradigms.
Moreover, we observe the use of higher-dimensional noise naturally exhibits
characteristics similar to stochastic differential equations (SDEs). Finally,
we demonstrate strong performances on a set of representative pretrained
diffusion models, including EDM, EDM2, and Stable-Diffusion 3, which cover
models in both pixel and latent spaces, as well as class and text conditional
settings. The code is available at https://github.com/apple/ml-tada.
[LINK]
http://arxiv.org/abs/2506.21757v1
[DATE]
2025-06-27 04:30:27+08:00
[CATEGORIES]
cs.LG
Beyond Conformal Predictors: Adaptive Conformal Inference with Confidence Predictors
[AUTHORS]
Johan Hallberg Szabadváry, Tuwe Löfström
[ABSTRACT]
Adaptive Conformal Inference (ACI) provides finite-sample coverage
guarantees, enhancing the prediction reliability under non-exchangeability.
This study demonstrates that these desirable properties of ACI do not require
the use of Conformal Predictors (CP). We show that the guarantees hold for the
broader class of confidence predictors, defined by the requirement of producing
nested prediction sets, a property we argue is essential for meaningful
confidence statements. We empirically investigate the performance of
Non-Conformal Confidence Predictors (NCCP) against CP when used with ACI on
non-exchangeable data. In online settings, the NCCP offers significant
computational advantages while maintaining a comparable predictive efficiency.
In batch settings, inductive NCCP (INCCP) can outperform inductive CP (ICP) by
utilising the full training dataset without requiring a separate calibration
set, leading to improved efficiency, particularly when the data are limited.
Although these initial results highlight NCCP as a theoretically sound and
practically effective alternative to CP for uncertainty quantification with ACI
in non-exchangeable scenarios, further empirical studies are warranted across
diverse datasets and predictors.
[COMMENTS]
28 pages, 5 figures
[LINK]
http://arxiv.org/abs/2409.15548v4
[DATE]
2025-06-27 04:25:03+08:00
[CATEGORIES]
cs.LG
Inverse Design of Diffractive Metasurfaces Using Diffusion Models
[AUTHORS]
Liav Hen, Erez Yosef, Dan Raviv, Raja Giryes, Jacob Scheuer
[ABSTRACT]
Metasurfaces are ultra-thin optical elements composed of engineered
sub-wavelength structures that enable precise control of light. Their inverse
design - determining a geometry that yields a desired optical response - is
challenging due to the complex, nonlinear relationship between structure and
optical properties. This often requires expert tuning, is prone to local
minima, and involves significant computational overhead. In this work, we
address these challenges by integrating the generative capabilities of
diffusion models into computational design workflows. Using an RCWA simulator,
we generate training data consisting of metasurface geometries and their
corresponding far-field scattering patterns. We then train a conditional
diffusion model to predict meta-atom geometry and height from a target spatial
power distribution at a specified wavelength, sampled from a continuous
supported band. Once trained, the model can generate metasurfaces with low
error, either directly using RCWA-guided posterior sampling or by serving as an
initializer for traditional optimization methods. We demonstrate our approach
on the design of a spatially uniform intensity splitter and a polarization beam
splitter, both produced with low error in under 30 minutes. To support further
research in data-driven metasurface design, we publicly release our code and
datasets.
[LINK]
http://arxiv.org/abs/2506.21748v1
[DATE]
2025-06-27 04:10:30+08:00
[CATEGORIES]
cs.LG
Analysis of static and dynamic batching algorithms for graph neural networks
[AUTHORS]
Daniel T. Speckhard, Tim Bechtel, Sebastian Kehl, Jonathan Godwin, Claudia Draxl
[ABSTRACT]
Graph neural networks (GNN) have shown promising results for several domains
such as materials science, chemistry, and the social sciences. GNN models often
contain millions of parameters, and like other neural network (NN) models, are
often fed only a fraction of the graphs that make up the training dataset in
batches to update model parameters. The effect of batching algorithms on
training time and model performance has been thoroughly explored for NNs but
not yet for GNNs. We analyze two different batching algorithms for graph based
models, namely static and dynamic batching for two datasets, the QM9 dataset of
small molecules and the AFLOW materials database. Our experiments show that
changing the batching algorithm can provide up to a 2.7x speedup, but the
fastest algorithm depends on the data, model, batch size, hardware, and number
of training steps run. Experiments show that for a select number of
combinations of batch size, dataset, and model, significant differences in
model learning metrics are observed between static and dynamic batching
algorithms.
[LINK]
http://arxiv.org/abs/2502.00944v2
[DATE]
2025-06-27 04:07:44+08:00
[CATEGORIES]
cs.LG
Asymmetric Graph Error Control with Low Complexity in Causal Bandits
[AUTHORS]
Chen Peng, Di Zhang, Urbashi Mitra
[ABSTRACT]
In this paper, the causal bandit problem is investigated, with the objective
of maximizing the long-term reward by selecting an optimal sequence of
interventions on nodes in an unknown causal graph. It is assumed that both the
causal topology and the distribution of interventions are unknown. First, based
on the difference between the two types of graph identification errors (false
positives and negatives), a causal graph learning method is proposed. Numerical
results suggest that this method has a much lower sample complexity relative to
the prior art by learning sub-graphs. However, we note that a sample complexity
analysis for the new algorithm has not been undertaken, as of yet. Under the
assumption of minimum-mean squared error weight estimation, a new uncertainty
bound tailored to the causal bandit problem is derived. This uncertainty bound
drives an upper confidence bound-based intervention selection to optimize the
reward. Further, we consider a particular instance of non-stationary bandits
wherein both the causal topology and interventional distributions can change.
Our solution is the design of a sub-graph change detection mechanism that
requires a modest number of samples. Numerical results compare the new
methodology to existing schemes and show a substantial performance improvement
in stationary and non-stationary settings. Averaged over 100 randomly generated
causal bandits, the proposed scheme takes significantly fewer samples to learn
the causal structure and achieves a reward gain of 85% compared to existing
approaches.
[LINK]
http://arxiv.org/abs/2408.11240v2
[DATE]
2025-06-27 04:05:51+08:00
[CATEGORIES]
cs.LG
Zebra: In-Context Generative Pretraining for Solving Parametric PDEs
[AUTHORS]
Louis Serrano, Armand Kassaï Koupaï, Thomas X Wang, Pierre Erbacher, Patrick Gallinari
[ABSTRACT]
Solving time-dependent parametric partial differential equations (PDEs) is
challenging for data-driven methods, as these models must adapt to variations
in parameters such as coefficients, forcing terms, and initial conditions.
State-of-the-art neural surrogates perform adaptation through gradient-based
optimization and meta-learning to implicitly encode the variety of dynamics
from observations. This often comes with increased inference complexity.
Inspired by the in-context learning capabilities of large language models
(LLMs), we introduce Zebra, a novel generative auto-regressive transformer
designed to solve parametric PDEs without requiring gradient adaptation at
inference. By leveraging in-context information during both pre-training and
inference, Zebra dynamically adapts to new tasks by conditioning on input
sequences that incorporate context example trajectories. As a generative model,
Zebra can be used to generate new trajectories and allows quantifying the
uncertainty of the predictions. We evaluate Zebra across a variety of
challenging PDE scenarios, demonstrating its adaptability, robustness, and
superior performance compared to existing approaches.
[LINK]
http://arxiv.org/abs/2410.03437v3
[DATE]
2025-06-27 04:05:33+08:00
[CATEGORIES]
cs.LG
Federated Item Response Theory Models
[AUTHORS]
Biying Zhou, Nanyu Luo, Feng Ji
[ABSTRACT]
Item Response Theory (IRT) models have been widely used to estimate
respondents’ latent abilities and calibrate items’ difficulty. Traditional IRT
estimation requires all individual raw response data to be centralized in one
place, thus potentially causing privacy issues. Federated learning is an
emerging field in computer science and machine learning with added features of
privacy protection and distributed computing. To integrate the advances from
federated learning with modern psychometrics, we propose a novel framework,
Federated Item Response Theory (IRT), to enable estimating traditional IRT
models with additional privacy, allowing estimation in a distributed manner
without losing estimation accuracy.
Our numerical experiments confirm that FedIRT achieves statistical accuracy
similar to standard IRT estimation using popular R packages, while offering
critical advantages: privacy protection and reduced communication costs. We
also validate FedIRT’s utility through a real-world exam dataset, demonstrating
its effectiveness in realistic educational contexts. This new framework extends
IRT’s applicability to distributed settings, such as multi-school assessments,
without sacrificing accuracy or security. To support practical adoption, we
provide an open-ource R package, FedIRT, implementing the framework for the
two-parameter logistic (2PL) and partial credit models (PCM).
[LINK]
http://arxiv.org/abs/2506.21744v1
[DATE]
2025-06-27 04:01:18+08:00
[CATEGORIES]
cs.LG
Storm Surge in Color: RGB-Encoded Physics-Aware Deep Learning for Storm Surge Forecasting
[AUTHORS]
Jinpai Zhao, Albert Cerrone, Eirik Valseth, Leendert Westerink, Clint Dawson
[ABSTRACT]
Storm surge forecasting plays a crucial role in coastal disaster
preparedness, yet existing machine learning approaches often suffer from
limited spatial resolution, reliance on coastal station data, and poor
generalization. Moreover, many prior models operate directly on unstructured
spatial data, making them incompatible with modern deep learning architectures.
In this work, we introduce a novel approach that projects unstructured water
elevation fields onto structured Red Green Blue (RGB)-encoded image
representations, enabling the application of Convolutional Long Short Term
Memory (ConvLSTM) networks for end-to-end spatiotemporal surge forecasting. Our
model further integrates ground-truth wind fields as dynamic conditioning
signals and topo-bathymetry as a static input, capturing physically meaningful
drivers of surge evolution. Evaluated on a large-scale dataset of synthetic
storms in the Gulf of Mexico, our method demonstrates robust 48-hour
forecasting performance across multiple regions along the Texas coast and
exhibits strong spatial extensibility to other coastal areas. By combining
structured representation, physically grounded forcings, and scalable deep
learning, this study advances the frontier of storm surge forecasting in
usability, adaptability, and interpretability.
[LINK]
http://arxiv.org/abs/2506.21743v1
[DATE]
2025-06-27 03:56:30+08:00
[CATEGORIES]
cs.LG
Critically-Damped Higher-Order Langevin Dynamics
[AUTHORS]
Benjamin Sterling, Chad Gueli, Mónica F. Bugallo
[ABSTRACT]
Denoising Diffusion Probabilistic Models represent an entirely new class of
generative AI methods that have yet to be fully explored. Critical damping has
been successfully introduced in Critically-Damped Langevin Dynamics (CLD) and
Critically-Damped Third-Order Langevin Dynamics (TOLD++), but has not yet been
applied to dynamics of arbitrary order. The proposed line of work generalizes
Higher-Order Langevin Dynamics (HOLD), a recent state-of-the-art diffusion
method, by introducing the concept of critical damping from systems analysis.
[COMMENTS]
12 pages
[LINK]
http://arxiv.org/abs/2506.21741v1
[DATE]
2025-06-27 03:50:53+08:00
[CATEGORIES]
cs.LG
Modification of a Numerical Method Using FIR Filters in a Time-dependent SIR Model for COVID-19
[AUTHORS]
Felipe Rogério Pimentel, Rafael Gustavo Alves
[ABSTRACT]
Authors Yi-Cheng Chen, Ping-En Lu, Cheng-Shang Chang, and Tzu-Hsuan Liu use
the Finite Impulse Response (FIR) linear system filtering method to track and
predict the number of people infected and recovered from COVID-19, in a
pandemic context in which there was still no vaccine and the only way to avoid
contagion was isolation. To estimate the coefficients of these FIR filters,
Chen et al. used machine learning methods through a classical optimization
problem with regularization (ridge regression). These estimated coefficients
are called ridge coefficients. The epidemic mathematical model adopted by these
researchers to formulate the FIR filters is the time-dependent discrete SIR. In
this paper, we propose a small modification to the algorithm of Chen et al. to
obtain the ridge coefficients. We then used this modified algorithm to track
and predict the number of people infected and recovered from COVID-19 in the
state of Minas Gerais/Brazil, within a prediction window, during the initial
period of the pandemic. We also compare the predicted data with the respective
real data to check how good the approximation is. In the modified algorithm, we
set values for the FIR filter orders and for the regularization parameters,
both different from the respective values defined by Chen et al. in their
algorithm. In this context, the numerical results obtained by the modified
algorithm in some simulations present better approximation errors compared to
the respective approximation errors presented by the algorithm of Chen et al.
[COMMENTS]
14 pages, 3 figures, 3 tables, and 2 algorithms
[LINK]
http://arxiv.org/abs/2506.21739v1
[DATE]
2025-06-27 03:44:45+08:00
[CATEGORIES]
cs.LG
Hierarchical Reasoning Model
[AUTHORS]
Guan Wang, Jin Li, Yuhao Sun, Xing Chen, Changling Liu, Yue Wu, Meng Lu, Sen Song, Yasin Abbasi Yadkori
[ABSTRACT]
Reasoning, the process of devising and executing complex goal-oriented action
sequences, remains a critical challenge in AI. Current large language models
(LLMs) primarily employ Chain-of-Thought (CoT) techniques, which suffer from
brittle task decomposition, extensive data requirements, and high latency.
Inspired by the hierarchical and multi-timescale processing in the human brain,
we propose the Hierarchical Reasoning Model (HRM), a novel recurrent
architecture that attains significant computational depth while maintaining
both training stability and efficiency. HRM executes sequential reasoning tasks
in a single forward pass without explicit supervision of the intermediate
process, through two interdependent recurrent modules: a high-level module
responsible for slow, abstract planning, and a low-level module handling rapid,
detailed computations. With only 27 million parameters, HRM achieves
exceptional performance on complex reasoning tasks using only 1000 training
samples. The model operates without pre-training or CoT data, yet achieves
nearly perfect performance on challenging tasks including complex Sudoku
puzzles and optimal path finding in large mazes. Furthermore, HRM outperforms
much larger models with significantly longer context windows on the Abstraction
and Reasoning Corpus (ARC), a key benchmark for measuring artificial general
intelligence capabilities. These results underscore HRM’s potential as a
transformative advancement toward universal computation and general-purpose
reasoning systems.
[LINK]
http://arxiv.org/abs/2506.21734v1
[DATE]
2025-06-27 03:39:54+08:00
[CATEGORIES]
cs.LG
Experimental investigation of pose informed reinforcement learning for skid-steered visual navigation
[AUTHORS]
Ameya Salvi, Venkat Krovi
[ABSTRACT]
Vision-based lane keeping is a topic of significant interest in the robotics
and autonomous ground vehicles communities in various on-road and off-road
applications. The skid-steered vehicle architecture has served as a useful
vehicle platform for human controlled operations. However, systematic modeling,
especially of the skid-slip wheel terrain interactions (primarily in off-road
settings) has created bottlenecks for automation deployment. End-to-end
learning based methods such as imitation learning and deep reinforcement
learning, have gained prominence as a viable deployment option to counter the
lack of accurate analytical models. However, the systematic formulation and
subsequent verification/validation in dynamic operation regimes (particularly
for skid-steered vehicles) remains a work in progress. To this end, a novel
approach for structured formulation for learning visual navigation is proposed
and investigated in this work. Extensive software simulations, hardware
evaluations and ablation studies now highlight the significantly improved
performance of the proposed approach against contemporary literature.
[LINK]
http://arxiv.org/abs/2506.21732v1
[DATE]
2025-06-27 03:36:49+08:00
[CATEGORIES]
cs.LG
Adapting Probabilistic Risk Assessment for AI
[AUTHORS]
Anna Katariina Wisakanto, Joe Rogero, Avyay M. Casheekar, Richard Mallah
[ABSTRACT]
Modern general-purpose artificial intelligence (AI) systems present an urgent
risk management challenge, as their rapidly evolving capabilities and potential
for catastrophic harm outpace our ability to reliably assess their risks.
Current methods often rely on selective testing and undocumented assumptions
about risk priorities, frequently failing to make a serious attempt at
assessing the set of pathways through which AI systems pose direct or indirect
risks to society and the biosphere. This paper introduces the probabilistic
risk assessment (PRA) for AI framework, adapting established PRA techniques
from high-reliability industries (e.g., nuclear power, aerospace) for the new
challenges of advanced AI. The framework guides assessors in identifying
potential risks, estimating likelihood and severity bands, and explicitly
documenting evidence, underlying assumptions, and analyses at appropriate
granularities. The framework’s implementation tool synthesizes the results into
a risk report card with aggregated risk estimates from all assessed risks. It
introduces three methodological advances: (1) Aspect-oriented hazard analysis
provides systematic hazard coverage guided by a first-principles taxonomy of AI
system aspects (e.g. capabilities, domain knowledge, affordances); (2) Risk
pathway modeling analyzes causal chains from system aspects to societal impacts
using bidirectional analysis and incorporating prospective techniques; and (3)
Uncertainty management employs scenario decomposition, reference scales, and
explicit tracing protocols to structure credible projections with novelty or
limited data. Additionally, the framework harmonizes diverse assessment methods
by integrating evidence into comparable, quantified absolute risk estimates for
lifecycle decisions. We have implemented this as a workbook tool for AI
developers, evaluators, and regulators.
[COMMENTS]
Project website with workbook tool available at:
https://pra-for-ai.github.io/pra/
[LINK]
http://arxiv.org/abs/2504.18536v2
[DATE]
2025-06-27 03:31:12+08:00
[CATEGORIES]
cs.LG
Learning treatment effects while treating those in need
[AUTHORS]
Bryan Wilder, Pim Welle
[ABSTRACT]
Many social programs attempt to allocate scarce resources to people with the
greatest need. Indeed, public services increasingly use algorithmic risk
assessments motivated by this goal. However, targeting the highest-need
recipients often conflicts with attempting to evaluate the causal effect of the
program as a whole, as the best evaluations would be obtained by randomizing
the allocation. We propose a framework to design randomized allocation rules
which optimally balance targeting high-need individuals with learning treatment
effects, presenting policymakers with a Pareto frontier between the two goals.
We give sample complexity guarantees for the policy learning problem and
provide a computationally efficient strategy to implement it. We then
collaborate with the human services department of Allegheny County,
Pennsylvania to evaluate our methods on data from real service delivery
settings. Optimized policies can substantially mitigate the tradeoff between
learning and targeting. For example, it is often possible to obtain 90% of the
optimal utility in targeting high-need individuals while ensuring that the
average treatment effect can be estimated with less than 2 times the samples
that a randomized controlled trial would require. Mechanisms for targeting
public services often focus on measuring need as accurately as possible.
However, our results suggest that algorithmic systems in public services can be
most impactful if they incorporate program evaluation as an explicit goal
alongside targeting.
[LINK]
http://arxiv.org/abs/2407.07596v2
[DATE]
2025-06-27 03:20:30+08:00
[CATEGORIES]
cs.LG
CaloHadronic: a diffusion model for the generation of hadronic showers
[AUTHORS]
Thorsten Buss, Frank Gaede, Gregor Kasieczka, Anatolii Korol, Katja Krüger, Peter McKeown, Martina Mozzanica
[ABSTRACT]
Simulating showers of particles in highly-granular calorimeters is a key
frontier in the application of machine learning to particle physics. Achieving
high accuracy and speed with generative machine learning models can enable them
to augment traditional simulations and alleviate a major computing constraint.
Recent developments have shown how diffusion based generative shower simulation
approaches that do not rely on a fixed structure, but instead generate
geometry-independent point clouds, are very efficient. We present a
transformer-based extension to previous architectures which were developed for
simulating electromagnetic showers in the highly granular electromagnetic
calorimeter of the International Large Detector, ILD. The attention mechanism
now allows us to generate complex hadronic showers with more pronounced
substructure across both the electromagnetic and hadronic calorimeters. This is
the first time that machine learning methods are used to holistically generate
showers across the electromagnetic and hadronic calorimeter in highly granular
imaging calorimeter systems.
[LINK]
http://arxiv.org/abs/2506.21720v1
[DATE]
2025-06-27 03:12:44+08:00
[CATEGORIES]
cs.LG
Performance Prediction for Large Systems via Text-to-Text Regression
[AUTHORS]
Yash Akhauri, Bryan Lewandowski, Cheng-Hsi Lin, Adrian N. Reyes, Grant C. Forbes, Arissa Wongpanich, Bangding Yang, Mohamed S. Abdelfattah, Sagi Perel, Xingyou Song
[ABSTRACT]
In many industries, predicting metric outcomes of large systems is a
fundamental problem, driven largely by traditional tabular regression. However,
such methods struggle on complex systems data in the wild such as configuration
files or system logs, where feature engineering is often infeasible. We propose
text-to-text regression as a general, scalable alternative. For predicting
resource efficiency on Borg, Google’s massive compute cluster scheduling
system, a 60M parameter encoder-decoder, trained from random initialization,
achieves up to a near perfect 0.99 (0.9 average) rank correlation across the
entire fleet, and 100x lower MSE than tabular approaches. The model also easily
adapts to new tasks in only 500 few-shot examples and captures the densities of
complex outcome distributions. Ablation studies highlight the importance of
using encoders, increasing sequence length, and the model’s inherent
uncertainty quantification. These findings pave the way for universal
simulators of real-world outcomes.
[COMMENTS]
Code can be found at https://github.com/google-deepmind/regress-lm
[LINK]
http://arxiv.org/abs/2506.21718v1
[DATE]
2025-06-27 03:10:08+08:00
[CATEGORIES]
cs.LG
$\textrm{ODE}_t \left(\textrm{ODE}_l \right)$: Shortcutting the Time and Length in Diffusion and Flow Models for Faster Sampling
[AUTHORS]
Denis Gudovskiy, Wenzhao Zheng, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer
[ABSTRACT]
Recently, continuous normalizing flows (CNFs) and diffusion models (DMs) have
been studied using the unified theoretical framework. Although such models can
generate high-quality data points from a noise distribution, the sampling
demands multiple iterations to solve an ordinary differential equation (ODE)
with high computational complexity. Most existing methods focus on reducing the
number of time steps during the sampling process to improve efficiency. In this
work, we explore a complementary direction in which the quality-complexity
tradeoff can be dynamically controlled in terms of time steps and in the length
of the neural network. We achieve this by rewiring the blocks in the
transformer-based architecture to solve an inner discretized ODE w.r.t. its
length. Then, we employ time- and length-wise consistency terms during flow
matching training, and as a result, the sampling can be performed with an
arbitrary number of time steps and transformer blocks. Unlike others, our
$\textrm{ODE}_t \left(\textrm{ODE}_l \right)$ approach is solver-agnostic in
time dimension and decreases both latency and memory usage. Compared to the
previous state of the art, image generation experiments on CelebA-HQ and
ImageNet show a latency reduction of up to $3\times$ in the most efficient
sampling mode, and a FID score improvement of up to $3.5$ points for
high-quality sampling. We release our code and model weights with fully
reproducible experiments.
[COMMENTS]
Preprint. Github page: github.com/gudovskiy/odelt
[LINK]
http://arxiv.org/abs/2506.21714v1
[DATE]
2025-06-27 02:59:59+08:00
[CATEGORIES]
cs.LG
Statistical Inference of the Value Function for Reinforcement Learning in Infinite Horizon Settings
[AUTHORS]
C. Shi, S. Zhang, W. Lu, R. Song
[ABSTRACT]
Reinforcement learning is a general technique that allows an agent to learn
an optimal policy and interact with an environment in sequential decision
making problems. The goodness of a policy is measured by its value function
starting from some initial state. The focus of this paper is to construct
confidence intervals (CIs) for a policy’s value in infinite horizon settings
where the number of decision points diverges to infinity. We propose to model
the action-value state function (Q-function) associated with a policy based on
series/sieve method to derive its confidence interval. When the target policy
depends on the observed data as well, we propose a SequentiAl Value Evaluation
(SAVE) method to recursively update the estimated policy and its value
estimator. As long as either the number of trajectories or the number of
decision points diverges to infinity, we show that the proposed CI achieves
nominal coverage even in cases where the optimal policy is not unique.
Simulation studies are conducted to back up our theoretical findings. We apply
the proposed method to a dataset from mobile health studies and find that
reinforcement learning algorithms could help improve patient’s health status. A
Python implementation of the proposed procedure is available at
https://github.com/shengzhang37/SAVE.
[LINK]
http://arxiv.org/abs/2001.04515v3
[DATE]
2025-06-27 02:35:13+08:00
[CATEGORIES]
cs.LG
Link Prediction with Physics-Inspired Graph Neural Networks
[AUTHORS]
Andrea Giuseppe Di Francesco, Francesco Caso, Maria Sofia Bucarelli, Fabrizio Silvestri
[ABSTRACT]
The message-passing mechanism underlying Graph Neural Networks (GNNs) is not
naturally suited for heterophilic datasets, where adjacent nodes often have
different labels. Most solutions to this problem remain confined to the task of
node classification. In this article, we focus on the valuable task of link
prediction under heterophily, an interesting problem for recommendation
systems, social network analysis, and other applications. GNNs like GRAFF have
improved node classification under heterophily by incorporating physics biases
in the architecture. Similarly, we propose GRAFF-LP, an extension of GRAFF for
link prediction. We show that GRAFF-LP effectively discriminates existing from
non-existing edges by learning implicitly to separate the edge gradients. Based
on this information, we propose a new readout function inspired by physics.
Remarkably, this new function not only enhances the performance of GRAFF-LP but
also improves that of other baseline models, leading us to reconsider how every
link prediction experiment has been conducted so far. Finally, we provide
evidence that even simple GNNs did not experience greater difficulty in
predicting heterophilic links compared to homophilic ones. This leads us to
believe in the necessity for heterophily measures specifically tailored for
link prediction, distinct from those used in node classification. The code and
appendix are available at
https://github.com/difra100/Link_Prediction_with_PIGNN_IJCNN.
[COMMENTS]
Camera-Ready version. Accepted at IJCNN 2025
[LINK]
http://arxiv.org/abs/2402.14802v3
[DATE]
2025-06-27 02:15:29+08:00
[CATEGORIES]
cs.LG
Risk-Averse Total-Reward Reinforcement Learning
[AUTHORS]
Xihong Su, Jia Lin Hau, Gersi Doko, Kishan Panaganti, Marek Petrik
[ABSTRACT]
Risk-averse total-reward Markov Decision Processes (MDPs) offer a promising
framework for modeling and solving undiscounted infinite-horizon objectives.
Existing model-based algorithms for risk measures like the entropic risk
measure (ERM) and entropic value-at-risk (EVaR) are effective in small
problems, but require full access to transition probabilities. We propose a
Q-learning algorithm to compute the optimal stationary policy for total-reward
ERM and EVaR objectives with strong convergence and performance guarantees. The
algorithm and its optimality are made possible by ERM’s dynamic consistency and
elicitability. Our numerical results on tabular domains demonstrate quick and
reliable convergence of the proposed Q-learning algorithm to the optimal
risk-averse value function.
[COMMENTS]
The paper is under review now
[LINK]
http://arxiv.org/abs/2506.21683v1
[DATE]
2025-06-27 02:10:51+08:00
[CATEGORIES]
cs.LG
TanDiT: Tangent-Plane Diffusion Transformer for High-Quality 360° Panorama Generation
[AUTHORS]
Hakan Çapuk, Andrew Bond, Muhammed Burak Kızıl, Emir Göçen, Erkut Erdem, Aykut Erdem
[ABSTRACT]
Recent advances in image generation have led to remarkable improvements in
synthesizing perspective images. However, these models still struggle with
panoramic image generation due to unique challenges, including varying levels
of geometric distortion and the requirement for seamless loop-consistency. To
address these issues while leveraging the strengths of the existing models, we
introduce TanDiT, a method that synthesizes panoramic scenes by generating
grids of tangent-plane images covering the entire 360$^\circ$ view. Unlike
previous methods relying on multiple diffusion branches, TanDiT utilizes a
unified diffusion model trained to produce these tangent-plane images
simultaneously within a single denoising iteration. Furthermore, we propose a
model-agnostic post-processing step specifically designed to enhance global
coherence across the generated panoramas. To accurately assess panoramic image
quality, we also present two specialized metrics, TangentIS and TangentFID, and
provide a comprehensive benchmark comprising captioned panoramic datasets and
standardized evaluation scripts. Extensive experiments demonstrate that our
method generalizes effectively beyond its training data, robustly interprets
detailed and complex text prompts, and seamlessly integrates with various
generative models to yield high-quality, diverse panoramic images.
[LINK]
http://arxiv.org/abs/2506.21681v1
[DATE]
2025-06-27 02:09:09+08:00
[CATEGORIES]
cs.LG
Whole-Body Conditioned Egocentric Video Prediction
[AUTHORS]
Yutong Bai, Danny Tran, Amir Bar, Yann LeCun, Trevor Darrell, Jitendra Malik
[ABSTRACT]
We train models to Predict Ego-centric Video from human Actions (PEVA), given
the past video and an action represented by the relative 3D body pose. By
conditioning on kinematic pose trajectories, structured by the joint hierarchy
of the body, our model learns to simulate how physical human actions shape the
environment from a first-person point of view. We train an auto-regressive
conditional diffusion transformer on Nymeria, a large-scale dataset of
real-world egocentric video and body pose capture. We further design a
hierarchical evaluation protocol with increasingly challenging tasks, enabling
a comprehensive analysis of the model’s embodied prediction and control
abilities. Our work represents an initial attempt to tackle the challenges of
modeling complex real-world environments and embodied agent behaviors with
video prediction from the perspective of a human.
[COMMENTS]
Project Page: https://dannytran123.github.io/PEVA
[LINK]
http://arxiv.org/abs/2506.21552v1
[DATE]
2025-06-27 01:59:59+08:00
[CATEGORIES]
cs.LG
mTSBench: Benchmarking Multivariate Time Series Anomaly Detection and Model Selection at Scale
[AUTHORS]
Xiaona Zhou, Constantin Brif, Ismini Lourentzou
[ABSTRACT]
Multivariate time series anomaly detection (MTS-AD) is critical in domains
like healthcare, cybersecurity, and industrial monitoring, yet remains
challenging due to complex inter-variable dependencies, temporal dynamics, and
sparse anomaly labels. We introduce mTSBench, the largest benchmark to date for
MTS-AD and unsupervised model selection, spanning 344 labeled time series
across 19 datasets and 12 diverse application domains. mTSBench evaluates 24
anomaly detection methods, including large language model (LLM)-based detectors
for multivariate time series, and systematically benchmarks unsupervised model
selection techniques under standardized conditions. Consistent with prior
findings, our results confirm that no single detector excels across datasets,
underscoring the importance of model selection. However, even state-of-the-art
selection methods remain far from optimal, revealing critical gaps. mTSBench
provides a unified evaluation suite to enable rigorous, reproducible
comparisons and catalyze future advances in adaptive anomaly detection and
robust model selection.
[LINK]
http://arxiv.org/abs/2506.21550v1
[DATE]
2025-06-27 01:59:58+08:00
[CATEGORIES]
cs.LG
Where to find Grokking in LLM Pretraining? Monitor Memorization-to-Generalization without Test
[AUTHORS]
Ziyue Li, Chenrui Fan, Tianyi Zhou
[ABSTRACT]
Grokking, i.e., test performance keeps improving long after training loss
converged, has been recently witnessed in neural network training, making the
mechanism of generalization and other emerging capabilities such as reasoning
mysterious. While prior studies usually train small models on a few toy or
highly-specific tasks for thousands of epochs, we conduct the first study of
grokking on checkpoints during one-pass pretraining of a 7B large language
model (LLM), i.e., OLMoE. We compute the training loss and evaluate
generalization on diverse benchmark tasks, including math reasoning, code
generation, and commonsense/domain-specific knowledge retrieval tasks.
Our study, for the first time, verifies that grokking still happens in the
pretraining of large-scale foundation models, though different data may enter
grokking stages asynchronously. We further demystify grokking’s “emergence of
generalization” by investigating LLM internal dynamics. Specifically, we find
that training samples’ pathways (i.e., expert choices across layers) evolve
from random, instance-specific to more structured and shareable between samples
during grokking. Also, the complexity of a sample’s pathway reduces despite the
converged loss. These indicate a memorization-to-generalization conversion,
providing a mechanistic explanation of delayed generalization. In the study, we
develop two novel metrics to quantify pathway distance and the complexity of a
single pathway. We show their ability to predict the generalization improvement
on diverse downstream tasks. They are efficient, simple to compute and solely
dependent on training data. Hence, they have practical value for pretraining,
enabling us to monitor the generalization performance without finetuning and
test. Theoretically, we show that more structured pathways reduce model
complexity and improve the generalization bound.
[LINK]
http://arxiv.org/abs/2506.21551v1
[DATE]
2025-06-27 01:59:58+08:00
[CATEGORIES]
cs.LG
APO: Enhancing Reasoning Ability of MLLMs via Asymmetric Policy Optimization
[AUTHORS]
Minjie Hong, Zirun Guo, Yan Xia, Zehan Wang, Ziang Zhang, Tao Jin, Zhou Zhao
[ABSTRACT]
Multimodal Large Language Models (MLLMs) are powerful at integrating diverse
data, but they often struggle with complex reasoning. While Reinforcement
learning (RL) can boost reasoning in LLMs, applying it to MLLMs is tricky.
Common issues include a drop in performance on general tasks and the generation
of overly detailed or “overthinking” reasoning. Our work investigates how the
KL penalty and overthinking affect RL training in MLLMs. We propose Asymmetric
Policy Optimization (APO) to address these issues, which divides the sampled
responses into positive and negative groups. For positive samples,
Difficulty-Adaptive Divergence Shaping (DADS) is introduced to dynamically
adjust the KL divergence weight based on their difficulty. This method prevents
policy entropy from dropping sharply, improves training stability, utilizes
samples better, and preserves the model’s existing knowledge. For negative
samples, Suboptimal Trajectory Complexity Regularization (STCR) is proposed to
penalize overly long responses. This helps mitigate overthinking and encourages
more concise reasoning while preserving the model’s explorative capacity. We
apply our method to Qwen2.5-VL-3B, creating View-R1-3B. View-R1-3B
significantly enhances reasoning capabilities, showing an average 7\% gain over
the base model and outperforming larger MLLMs (7-11B) on various reasoning
benchmarks. Importantly, unlike other reasoning-tuned MLLMs that often degrade
on general tasks, View-R1-3B maintains consistent improvement, demonstrating
superior generalization. These results highlight the effectiveness and broad
applicability of our DADS and STCR techniques for advancing complex multimodal
reasoning in MLLMs. The code will be made available at
https://github.com/Indolent-Kawhi/View-R1.
[LINK]
http://arxiv.org/abs/2506.21655v1
[DATE]
2025-06-27 01:57:08+08:00
[CATEGORIES]
cs.LG
Maximal Matching Matters: Preventing Representation Collapse for Robust Cross-Modal Retrieval
[AUTHORS]
Hani Alomari, Anushka Sivakumar, Andrew Zhang, Chris Thomas
[ABSTRACT]
Cross-modal image-text retrieval is challenging because of the diverse
possible associations between content from different modalities. Traditional
methods learn a single-vector embedding to represent semantics of each sample,
but struggle to capture nuanced and diverse relationships that can exist across
modalities. Set-based approaches, which represent each sample with multiple
embeddings, offer a promising alternative, as they can capture richer and more
diverse relationships. In this paper, we show that, despite their promise,
these set-based representations continue to face issues including sparse
supervision and set collapse, which limits their effectiveness. To address
these challenges, we propose Maximal Pair Assignment Similarity to optimize
one-to-one matching between embedding sets which preserve semantic diversity
within the set. We also introduce two loss functions to further enhance the
representations: Global Discriminative Loss to enhance distinction among
embeddings, and Intra-Set Divergence Loss to prevent collapse within each set.
Our method achieves state-of-the-art performance on MS-COCO and Flickr30k
without relying on external data.
[COMMENTS]
Accepted at the 63rd Annual Meeting of the Association for
Computational Linguistics (ACL 2025 Main)
[LINK]
http://arxiv.org/abs/2506.21538v1
[DATE]
2025-06-27 01:55:34+08:00
[CATEGORIES]
cs.LG
Exploring the Design Space of 3D MLLMs for CT Report Generation
[AUTHORS]
Mohammed Baharoon, Jun Ma, Congyu Fang, Augustin Toma, Bo Wang
[ABSTRACT]
Multimodal Large Language Models (MLLMs) have emerged as a promising way to
automate Radiology Report Generation (RRG). In this work, we systematically
investigate the design space of 3D MLLMs, including visual input
representation, projectors, Large Language Models (LLMs), and fine-tuning
techniques for 3D CT report generation. We also introduce two knowledge-based
report augmentation methods that improve performance on the GREEN score by up
to 10\%, achieving the 2nd place on the MICCAI 2024 AMOS-MM challenge. Our
results on the 1,687 cases from the AMOS-MM dataset show that RRG is largely
independent of the size of LLM under the same training protocol. We also show
that larger volume size does not always improve performance if the original ViT
was pre-trained on a smaller volume size. Lastly, we show that using a
segmentation mask along with the CT volume improves performance. The code is
publicly available at https://github.com/bowang-lab/AMOS-MM-Solution
[LINK]
http://arxiv.org/abs/2506.21535v1
[DATE]
2025-06-27 01:54:20+08:00
[CATEGORIES]
cs.LG
Chain-of-Sketch: Enabling Global Visual Reasoning
[AUTHORS]
Aryo Lotfi, Enrico Fini, Samy Bengio, Moin Nabi, Emmanuel Abbe
[ABSTRACT]
Modern vision models have achieved remarkable success in benchmarks where
local features provide critical information about the target. There is now a
growing interest in tackling tasks requiring more global reasoning, where local
features do not provide significant information. Minsky and Papert put forward
such tasks in 1969 with their connectivity study, exposing the limitations of
the perceptron model. In this paper, we introduce an expanded set of global
visual datasets involving graphs, strings, mazes, and image grids. We show that
large vision models still struggle to learn these tasks efficiently. Similarly,
state-of-the-art multi-modal LLMs perform poorly on these datasets. We explain
this learning inefficiency by means of the ‘globality degree’ measure. To
mitigate this, we propose a method called chain-of-sketch (CoS). Similar to the
chain-of-thought and scratchpad techniques used in language models, CoS breaks
the original task into intermediate visual steps to help learn a complex task.
In addition, we show that not all CoS strategies perform equally well. Our key
insight is to impose a Markovian structure on the CoS frames. This leads to the
introduction of ‘inductive CoS’ which achieves better out-of-distribution
generalization and performs well even with smaller models compared to
non-inductive variants.
[COMMENTS]
additional experiments added, title changed from “Visual Scratchpads:
Enabling Global Reasoning in Vision” to “Chain-of-Sketch: Enabling Global
Visual Reasoning”
[LINK]
http://arxiv.org/abs/2410.08165v2
[DATE]
2025-06-27 01:48:33+08:00
[CATEGORIES]
cs.LG
Mesh-Informed Neural Operator : A Transformer Generative Approach
[AUTHORS]
Yaozhong Shi, Zachary E. Ross, Domniki Asimaki, Kamyar Azizzadenesheli
[LINK]
http://arxiv.org/abs/2506.16656v2
[DATE]
2025-06-27 01:45:03+08:00
[CATEGORIES]
cs.LG
Gaussian Invariant Markov Chain Monte Carlo
[AUTHORS]
Michalis K. Titsias, Angelos Alexopoulos, Siran Liu, Petros Dellaportas
[ABSTRACT]
We develop sampling methods, which consist of Gaussian invariant versions of
random walk Metropolis (RWM), Metropolis adjusted Langevin algorithm (MALA) and
second order Hessian or Manifold MALA. Unlike standard RWM and MALA we show
that Gaussian invariant sampling can lead to ergodic estimators with improved
statistical efficiency. This is due to a remarkable property of Gaussian
invariance that allows us to obtain exact analytical solutions to the Poisson
equation for Gaussian targets. These solutions can be used to construct
efficient and easy to use control variates for variance reduction of estimators
under any intractable target. We demonstrate the new samplers and estimators in
several examples, including high dimensional targets in latent Gaussian models
where we compare against several advanced methods and obtain state-of-the-art
results. We also provide theoretical results regarding geometric ergodicity,
and an optimal scaling analysis that shows the dependence of the optimal
acceptance rate on the Gaussianity of the target.
[COMMENTS]
29, 2 figures
[LINK]
http://arxiv.org/abs/2506.21511v1
[DATE]
2025-06-27 01:36:10+08:00
[CATEGORIES]
cs.LG
Process mining-driven modeling and simulation to enhance fault diagnosis in cyber-physical systems
[AUTHORS]
Francesco Vitale, Nicola Dall’Ora, Sebastiano Gaiardelli, Enrico Fraccaroli, Nicola Mazzocca, Franco Fummi
[ABSTRACT]
Fault diagnosis in Cyber-Physical Systems (CPSs) is essential for ensuring
system dependability and operational efficiency by accurately detecting
anomalies and identifying their root causes. However, the manual modeling of
faulty behaviors often demands extensive domain expertise and produces models
that are complex, error-prone, and difficult to interpret. To address this
challenge, we present a novel unsupervised fault diagnosis methodology that
integrates collective anomaly detection in multivariate time series, process
mining, and stochastic simulation. Initially, collective anomalies are detected
from low-level sensor data using multivariate time-series analysis. These
anomalies are then transformed into structured event logs, enabling the
discovery of interpretable process models through process mining. By
incorporating timing distributions into the extracted Petri nets, the approach
supports stochastic simulation of faulty behaviors, thereby enhancing root
cause analysis and behavioral understanding. The methodology is validated using
the Robotic Arm Dataset (RoAD), a widely recognized benchmark in smart
manufacturing. Experimental results demonstrate its effectiveness in modeling,
simulating, and classifying faulty behaviors in CPSs. This enables the creation
of comprehensive fault dictionaries that support predictive maintenance and the
development of digital twins for industrial environments.
[LINK]
http://arxiv.org/abs/2506.21502v1
[DATE]
2025-06-27 01:29:37+08:00
[CATEGORIES]
cs.LG
Devising a solution to the problems of Cancer awareness in Telangana
[AUTHORS]
Priyanka Avhad, Vedanti Kshirsagar, Urvi Ranjan, Mahek Nakhua
[ABSTRACT]
According to the data, the percent of women who underwent screening for
cervical cancer, breast and oral cancer in Telangana in the year 2020 was 3.3
percent, 0.3 percent and 2.3 percent respectively. Although early detection is
the only way to reduce morbidity and mortality, people have very low awareness
about cervical and breast cancer signs and symptoms and screening practices. We
developed an ML classification model to predict if a person is susceptible to
breast or cervical cancer based on demographic factors. We devised a system to
provide suggestions for the nearest hospital or Cancer treatment centres based
on the users location or address. In addition to this, we can integrate the
health card to maintain medical records of all individuals and conduct
awareness drives and campaigns. For ML classification models, we used decision
tree classification and support vector classification algorithms for cervical
cancer susceptibility and breast cancer susceptibility respectively. Thus, by
devising this solution we come one step closer to our goal which is spreading
cancer awareness, thereby, decreasing the cancer mortality and increasing
cancer literacy among the people of Telangana.
[LINK]
http://arxiv.org/abs/2506.21500v1
[DATE]
2025-06-27 01:29:00+08:00
[CATEGORIES]
cs.LG
Multi-Preference Lambda-weighted Listwise DPO for Dynamic Preference Alignment
[AUTHORS]
Yuhui Sun, Xiyao Wang, Zixi Li, Jinman Zhao
[ABSTRACT]
While large-scale unsupervised language models (LMs) capture broad world
knowledge and reasoning capabilities, steering their behavior toward desired
objectives remains challenging due to the lack of explicit supervision.
Existing alignment techniques, such as reinforcement learning from human
feedback (RLHF), rely on training a reward model and performing reinforcement
learning to align with human preferences. However, RLHF is often
computationally intensive, unstable, and sensitive to hyperparameters.
To address these limitations, Direct Preference Optimization (DPO) was
introduced as a lightweight and stable alternative, enabling direct alignment
of language models with pairwise preference data via classification loss.
However, DPO and its extensions generally assume a single static preference
distribution, limiting flexibility in multi-objective or dynamic alignment
settings.
In this paper, we propose a novel framework: Multi-Preference Lambda-weighted
Listwise DPO, which extends DPO to incorporate multiple human preference
dimensions (e.g., helpfulness, harmlessness, informativeness) and enables
dynamic interpolation through a controllable simplex-weighted formulation. Our
method supports both listwise preference feedback and flexible alignment across
varying user intents without re-training. Empirical and theoretical analysis
demonstrates that our method is as effective as traditional DPO on static
objectives while offering greater generality and adaptability for real-world
deployment.
[COMMENTS]
10 pages, 4 figures, appendix included. To appear in Proceedings of
AAAI 2026. Code:
https://github.com/yuhui15/Multi-Preference-Lambda-weighted-DPO
[LINK]
http://arxiv.org/abs/2506.19780v2
[DATE]
2025-06-27 01:28:25+08:00
[CATEGORIES]
cs.LG
One Model to Forecast Them All and in Entity Distributions Bind Them
[AUTHORS]
Kutay Bölat, Simon Tindemans
[ABSTRACT]
Probabilistic forecasting in power systems often involves multi-entity
datasets like households, feeders, and wind turbines, where generating reliable
entity-specific forecasts presents significant challenges. Traditional
approaches require training individual models for each entity, making them
inefficient and hard to scale. This study addresses this problem using
GUIDE-VAE, a conditional variational autoencoder that allows entity-specific
probabilistic forecasting using a single model. GUIDE-VAE provides flexible
outputs, ranging from interpretable point estimates to full probability
distributions, thanks to its advanced covariance composition structure. These
distributions capture uncertainty and temporal dependencies, offering richer
insights than traditional methods. To evaluate our GUIDE-VAE-based forecaster,
we use household electricity consumption data as a case study due to its
multi-entity and highly stochastic nature. Experimental results demonstrate
that GUIDE-VAE outperforms conventional quantile regression techniques across
key metrics while ensuring scalability and versatility. These features make
GUIDE-VAE a powerful and generalizable tool for probabilistic forecasting
tasks, with potential applications beyond household electricity consumption.
[LINK]
http://arxiv.org/abs/2501.15499v2
[DATE]
2025-06-27 01:28:09+08:00
[CATEGORIES]
cs.LG
Towards Reliable Detection of Empty Space: Conditional Marked Point Processes for Object Detection
[AUTHORS]
Tobias J. Riedlinger, Kira Maag, Hanno Gottschalk
[ABSTRACT]
Deep neural networks have set the state-of-the-art in computer vision tasks
such as bounding box detection and semantic segmentation. Object detectors and
segmentation models assign confidence scores to predictions, reflecting the
model’s uncertainty in object detection or pixel-wise classification. However,
these confidence estimates are often miscalibrated, as their architectures and
loss functions are tailored to task performance rather than probabilistic
foundation. Even with well calibrated predictions, object detectors fail to
quantify uncertainty outside detected bounding boxes, i.e., the model does not
make a probability assessment of whether an area without detected objects is
truly free of obstacles. This poses a safety risk in applications such as
automated driving, where uncertainty in empty areas remains unexplored. In this
work, we propose an object detection model grounded in spatial statistics.
Bounding box data matches realizations of a marked point process, commonly used
to describe the probabilistic occurrence of spatial point events identified as
bounding box centers, where marks are used to describe the spatial extension of
bounding boxes and classes. Our statistical framework enables a
likelihood-based training and provides well-defined confidence estimates for
whether a region is drivable, i.e., free of objects. We demonstrate the
effectiveness of our method through calibration assessments and evaluation of
performance.
[COMMENTS]
15 pages, 4 figures, 3 tables
[LINK]
http://arxiv.org/abs/2506.21486v1
[DATE]
2025-06-27 01:14:37+08:00
[CATEGORIES]
cs.LG
Evaluation of Traffic Signals for Daily Traffic Pattern
[AUTHORS]
Mohammad Shokrolah Shirazi, Hung-Fu Chang
[ABSTRACT]
The turning movement count data is crucial for traffic signal design,
intersection geometry planning, traffic flow, and congestion analysis. This
work proposes three methods called dynamic, static, and hybrid configuration
for TMC-based traffic signals. A vision-based tracking system is developed to
estimate the TMC of six intersections in Las Vegas using traffic cameras. The
intersection design, route (e.g. vehicle movement directions), and signal
configuration files with compatible formats are synthesized and imported into
Simulation of Urban MObility for signal evaluation with realistic data. The
initial experimental results based on estimated waiting times indicate that the
cycle time of 90 and 120 seconds works best for all intersections. In addition,
four intersections show better performance for dynamic signal timing
configuration, and the other two with lower performance have a lower ratio of
total vehicle count to total lanes of the intersection leg. Since daily traffic
flow often exhibits a bimodal pattern, we propose a hybrid signal method that
switches between dynamic and static methods, adapting to peak and off-peak
traffic conditions for improved flow management. So, a built-in traffic
generator module creates vehicle routes for 4 hours, including peak hours, and
a signal design module produces signal schedule cycles according to static,
dynamic, and hybrid methods. Vehicle count distributions are weighted
differently for each zone (i.e., West, North, East, South) to generate diverse
traffic patterns. The extended experimental results for 6 intersections with 4
hours of simulation time imply that zone-based traffic pattern distributions
affect signal design selection. Although the static method works great for
evenly zone-based traffic distribution, the hybrid method works well for highly
weighted traffic at intersection pairs of the West-East and North-South zones.
[LINK]
http://arxiv.org/abs/2506.21469v1
[DATE]
2025-06-27 00:56:59+08:00
[CATEGORIES]
cs.LG
In-Context Learning Strategies Emerge Rationally
[AUTHORS]
Daniel Wurgaft, Ekdeep Singh Lubana, Core Francisco Park, Hidenori Tanaka, Gautam Reddy, Noah D. Goodman
[ABSTRACT]
Recent work analyzing in-context learning (ICL) has identified a broad set of
strategies that describe model behavior in different experimental conditions.
We aim to unify these findings by asking why a model learns these disparate
strategies in the first place. Specifically, we start with the observation that
when trained to learn a mixture of tasks, as is popular in the literature, the
strategies learned by a model for performing ICL can be captured by a family of
Bayesian predictors: a memorizing predictor, which assumes a discrete prior on
the set of seen tasks, and a generalizing predictor, where the prior matches
the underlying task distribution. Adopting the normative lens of rational
analysis, where a learner’s behavior is explained as an optimal adaptation to
data given computational constraints, we develop a hierarchical Bayesian
framework that almost perfectly predicts Transformer next-token predictions
throughout training – without assuming access to its weights. Under this
framework, pretraining is viewed as a process of updating the posterior
probability of different strategies, and inference-time behavior as a
posterior-weighted average over these strategies’ predictions. Our framework
draws on common assumptions about neural network learning dynamics, which make
explicit a tradeoff between loss and complexity among candidate strategies:
beyond how well it explains the data, a model’s preference towards implementing
a strategy is dictated by its complexity. This helps explain well-known ICL
phenomena, while offering novel predictions: e.g., we show a superlinear trend
in the timescale for transitioning from generalization to memorization as task
diversity increases. Overall, our work advances an explanatory and predictive
account of ICL grounded in tradeoffs between strategy loss and complexity.
[COMMENTS]
Preprint
[LINK]
http://arxiv.org/abs/2506.17859v2
[DATE]
2025-06-27 00:54:57+08:00
[CATEGORIES]
cs.LG
Optimising 4th-Order Runge-Kutta Methods: A Dynamic Heuristic Approach for Efficiency and Low Storage
[AUTHORS]
Gavin Lee Goodship, Luis Miralles-Pechuan, Stephen O’Sullivan
[ABSTRACT]
Extended Stability Runge-Kutta (ESRK) methods are crucial for solving
large-scale computational problems in science and engineering, including
weather forecasting, aerodynamic analysis, and complex biological modelling.
However, balancing accuracy, stability, and computational efficiency remains
challenging, particularly for high-order, low-storage schemes. This study
introduces a hybrid Genetic Algorithm (GA) and Reinforcement Learning (RL)
approach for automated heuristic discovery, optimising low-storage ESRK
methods. Unlike traditional approaches that rely on manually designed
heuristics or exhaustive numerical searches, our method leverages GA-driven
mutations for search-space exploration and an RL-inspired state transition
mechanism to refine heuristic selection dynamically. This enables systematic
parameter reduction, preserving fourth-order accuracy while significantly
improving computational efficiency.The proposed GA-RL heuristic optimisation
framework is validated through rigorous testing on benchmark problems,
including the 1D and 2D Brusselator systems and the steady-state Navier-Stokes
equations. The best-performing heuristic achieves a 25\% reduction in IPOPT
runtime compared to traditional ESRK optimisation processes while maintaining
numerical stability and accuracy. These findings demonstrate the potential of
adaptive heuristic discovery to improve resource efficiency in high-fidelity
simulations and broaden the applicability of low-storage Runge-Kutta methods in
real-world computational fluid dynamics, physics simulations, and other
demanding fields. This work establishes a new paradigm in heuristic
optimisation for numerical methods, opening pathways for further exploration
using Deep RL and AutoML-based heuristic search
[LINK]
http://arxiv.org/abs/2506.21465v1
[DATE]
2025-06-27 00:51:22+08:00
[CATEGORIES]
cs.LG
Capacity-Constrained Online Learning with Delays: Scheduling Frameworks and Regret Trade-offs
[AUTHORS]
Alexander Ryabchenko, Idan Attias, Daniel M. Roy
[ABSTRACT]
We study online learning with oblivious losses and delays under a novel
“capacity constraint” that limits how many past rounds can be tracked
simultaneously for delayed feedback. Under “clairvoyance” (i.e., delay
durations are revealed upfront each round) and/or “preemptibility” (i.e., we
can stop tracking previously chosen round feedback), we establish matching
upper and lower bounds (up to logarithmic terms) on achievable regret,
characterizing the “optimal capacity” needed to match the minimax rates of
classical delayed online learning, which implicitly assume unlimited capacity.
Our algorithms achieve minimax-optimal regret across all capacity levels, with
performance gracefully degrading under suboptimal capacity. For $K$ actions and
total delay $D$ over $T$ rounds, under clairvoyance and assuming capacity $C =
\Omega(\log(T))$, we achieve regret $\widetilde{\Theta}(\sqrt{TK + DK/C +
D\log(K)})$ for bandits and $\widetilde{\Theta}(\sqrt{(D+T)\log(K)})$ for
full-information feedback. When replacing clairvoyance with preemptibility, we
require a known maximum delay bound $d_{\max}$, adding
${\widetilde{O}(d_{\max})}$ to the regret. For fixed delays $d$ (i.e., $D=Td$),
the minimax regret is $\Theta(\sqrt{TK(1+d/C)+Td\log(K)})$ and the optimal
capacity is $\Theta(\min\{K/\log(K),d\})$ in the bandit setting, while in the
full-information feedback setting, the minimax regret is
$\Theta(\sqrt{T(d+1)\log(K)})$ and the optimal capacity is $\Theta(1)$. For
round-dependent and fixed delays, our upper bounds are achieved using novel
preemptive and non-preemptive scheduling policies, based on Pareto-distributed
proxy delays, and batching techniques, respectively. Crucially, our work
unifies delayed bandits, label-efficient learning, and online scheduling
frameworks, demonstrating that robust online learning under delayed feedback is
possible with surprisingly modest tracking capacity.
[LINK]
http://arxiv.org/abs/2503.19856v2
[DATE]
2025-06-27 00:47:52+08:00
[CATEGORIES]
cs.LG
Wild refitting for black box prediction
[AUTHORS]
Martin J. Wainwright
[ABSTRACT]
We describe and analyze a computionally efficient refitting procedure for
computing high-probability upper bounds on the instance-wise mean-squared
prediction error of penalized nonparametric estimates based on least-squares
minimization. Requiring only a single dataset and black box access to the
prediction method, it consists of three steps: computing suitable residuals,
symmetrizing and scaling them with a pre-factor $\rho$, and using them to
define and solve a modified prediction problem recentered at the current
estimate. We refer to it as wild refitting, since it uses Rademacher residual
symmetrization as in a wild bootstrap variant. Under relatively mild conditions
allowing for noise heterogeneity, we establish a high probability guarantee on
its performance, showing that the wild refit with a suitably chosen wild noise
scale $\rho$ gives an upper bound on prediction error. This theoretical
analysis provides guidance into the design of such procedures, including how
the residuals should be formed, the amount of noise rescaling in the wild
sub-problem needed for upper bounds, and the local stability properties of the
block-box procedure. We illustrate the applicability of this procedure to
various problems, including non-rigid structure-from-motion recovery with
structured matrix penalties; plug-and-play image restoration with deep neural
network priors; and randomized sketching with kernel methods.
[LINK]
http://arxiv.org/abs/2506.21460v1
[DATE]
2025-06-27 00:41:55+08:00
[CATEGORIES]
cs.LG
Fake it till You Make it: Reward Modeling as Discriminative Prediction
[AUTHORS]
Runtao Liu, Jiahao Zhan, Yingqing He, Chen Wei, Alan Yuille, Qifeng Chen
[ABSTRACT]
An effective reward model plays a pivotal role in reinforcement learning for
post-training enhancement of visual generative models. However, current
approaches of reward modeling suffer from implementation complexity due to
their reliance on extensive human-annotated preference data or meticulously
engineered quality dimensions that are often incomplete and
engineering-intensive. Inspired by adversarial training in generative
adversarial networks (GANs), this paper proposes GAN-RM, an efficient reward
modeling framework that eliminates manual preference annotation and explicit
quality dimension engineering. Our method trains the reward model through
discrimination between a small set of representative, unpaired target
samples(denoted as Preference Proxy Data) and model-generated ordinary outputs,
requiring only a few hundred target samples. Comprehensive experiments
demonstrate our GAN-RM’s effectiveness across multiple key applications
including test-time scaling implemented as Best-of-N sample filtering,
post-training approaches like Supervised Fine-Tuning (SFT) and Direct
Preference Optimization (DPO). Code and data will be released at
https://github.com/Visualignment/GAN-RM.
[LINK]
http://arxiv.org/abs/2506.13846v2
[DATE]
2025-06-27 00:39:32+08:00
[CATEGORIES]
cs.LG
Measurement to Meaning: A Validity-Centered Framework for AI Evaluation
[AUTHORS]
Olawale Salaudeen, Anka Reuel, Ahmed Ahmed, Suhana Bedi, Zachary Robertson, Sudharsan Sundar, Ben Domingue, Angelina Wang, Sanmi Koyejo
[ABSTRACT]
While the capabilities and utility of AI systems have advanced, rigorous
norms for evaluating these systems have lagged. Grand claims, such as models
achieving general reasoning capabilities, are supported with model performance
on narrow benchmarks, like performance on graduate-level exam questions, which
provide a limited and potentially misleading assessment. We provide a
structured approach for reasoning about the types of evaluative claims that can
be made given the available evidence. For instance, our framework helps
determine whether performance on a mathematical benchmark is an indication of
the ability to solve problems on math tests or instead indicates a broader
ability to reason. Our framework is well-suited for the contemporary paradigm
in machine learning, where various stakeholders provide measurements and
evaluations that downstream users use to validate their claims and decisions.
At the same time, our framework also informs the construction of evaluations
designed to speak to the validity of the relevant claims. By leveraging
psychometrics’ breakdown of validity, evaluations can prioritize the most
critical facets for a given claim, improving empirical utility and
decision-making efficacy. We illustrate our framework through detailed case
studies of vision and language model evaluations, highlighting how explicitly
considering validity strengthens the connection between evaluation evidence and
the claims being made.
[COMMENTS]
Correspondence to [email protected]
[LINK]
http://arxiv.org/abs/2505.10573v4
[DATE]
2025-06-27 00:38:11+08:00
[CATEGORIES]
cs.LG
PARALLELPROMPT: Extracting Parallelism from Large Language Model Queries
[AUTHORS]
Steven Kolawole, Keshav Santhanam, Virginia Smith, Pratiksha Thaker
[ABSTRACT]
LLM serving systems typically treat user prompts as monolithic inputs,
optimizing inference through decoding tricks or inter-query batching. However,
many real-world prompts contain latent semantic parallelism–decomposable
structures where subtasks can be executed independently to reduce latency while
preserving meaning. We introduce PARALLELPROMPT, the first benchmark for
measuring intra-query parallelism in natural user prompts. Our dataset
comprises over 37,000 real-world prompts from public LLM chat logs, each
annotated with a structured schema capturing task templates, shared context,
and iteration inputs. These schemas are extracted using LLM-assisted prompting
with rule-based multilingual validation. To evaluate the benefits of
decomposition, we provide an execution suite that benchmarks serial vs.
parallel strategies, measuring latency, structural adherence, and semantic
fidelity. Our results show that intra-query parallelism can be successfully
parsed in over 75% of curated datasets, unlocking up to 5x speedups on tasks
like translation, comprehension, and comparative analysis, with minimal quality
degradation. By releasing this benchmark, curation pipeline, and evaluation
suite, we provide the first standardized testbed for studying structure-aware
execution in LLM serving pipelines.
[COMMENTS]
In Adaptive Foundation Models: Evolving AI for Personalized and
Efficient Learning
[LINK]
http://arxiv.org/abs/2506.18728v2
[DATE]
2025-06-27 00:35:54+08:00
[CATEGORIES]
cs.LG
Towards an Optimal Control Perspective of ResNet Training
[AUTHORS]
Jens Püttschneider, Simon Heilig, Asja Fischer, Timm Faulwasser
[ABSTRACT]
We propose a training formulation for ResNets reflecting an optimal control
problem that is applicable for standard architectures and general loss
functions. We suggest bridging both worlds via penalizing intermediate outputs
of hidden states corresponding to stage cost terms in optimal control. For
standard ResNets, we obtain intermediate outputs by propagating the state
through the subsequent skip connections and the output layer. We demonstrate
that our training dynamic biases the weights of the unnecessary deeper residual
layers to vanish. This indicates the potential for a theory-grounded layer
pruning strategy.
[COMMENTS]
Accepted for presentation at the High-dimensional Learning Dynamics
(HiLD) workshop at ICML 2025
[LINK]
http://arxiv.org/abs/2506.21453v1
[DATE]
2025-06-27 00:34:47+08:00
[CATEGORIES]
cs.LG
Learnable Adaptive Time-Frequency Representation via Differentiable Short-Time Fourier Transform
[AUTHORS]
Maxime Leiber, Yosra Marnissi, Axel Barrau, Sylvain Meignen, Laurent Massoulié
[ABSTRACT]
The short-time Fourier transform (STFT) is widely used for analyzing
non-stationary signals. However, its performance is highly sensitive to its
parameters, and manual or heuristic tuning often yields suboptimal results. To
overcome this limitation, we propose a unified differentiable formulation of
the STFT that enables gradient-based optimization of its parameters. This
approach addresses the limitations of traditional STFT parameter tuning
methods, which often rely on computationally intensive discrete searches. It
enables fine-tuning of the time-frequency representation (TFR) based on any
desired criterion. Moreover, our approach integrates seamlessly with neural
networks, allowing joint optimization of the STFT parameters and network
weights. The efficacy of the proposed differentiable STFT in enhancing TFRs and
improving performance in downstream tasks is demonstrated through experiments
on both simulated and real-world data.
[COMMENTS]
DSTFT, STFT, spectrogram, time-frequency, IEEE Transactions on Signal
Processing, 10 pages
[LINK]
http://arxiv.org/abs/2506.21440v1
[DATE]
2025-06-27 00:24:27+08:00
[CATEGORIES]
cs.LG
New Bounds for Sparse Variational Gaussian Processes
[AUTHORS]
Michalis K. Titsias
[ABSTRACT]
Sparse variational Gaussian processes (GPs) construct tractable posterior
approximations to GP models. At the core of these methods is the assumption
that the true posterior distribution over training function values ${\bf f}$
and inducing variables ${\bf u}$ is approximated by a variational distribution
that incorporates the conditional GP prior $p({\bf f} | {\bf u})$ in its
factorization. While this assumption is considered as fundamental, we show that
for model training we can relax it through the use of a more general
variational distribution $q({\bf f} | {\bf u})$ that depends on $N$ extra
parameters, where $N$ is the number of training examples. In GP regression, we
can analytically optimize the evidence lower bound over the extra parameters
and express a tractable collapsed bound that is tighter than the previous
bound. The new bound is also amenable to stochastic optimization and its
implementation requires minor modifications to existing sparse GP code.
Further, we also describe extensions to non-Gaussian likelihoods. On several
datasets we demonstrate that our method can reduce bias when learning the
hyperparameters and can lead to better predictive performance.
[COMMENTS]
18 pages, 5 figures
[LINK]
http://arxiv.org/abs/2502.08730v2
[DATE]
2025-06-27 00:24:25+08:00
[CATEGORIES]
cs.LG
Graph Neural Network for Neutrino Physics Event Reconstruction
[AUTHORS]
V Hewes, Adam Aurisano, Giuseppe Cerati, Jim Kowalkowski, Claire Lee, Wei-keng Liao, Daniel Grzenda, Kaushal Gumpula, Xiaohe Zhang
[ABSTRACT]
Liquid Argon Time Projection Chamber (LArTPC) detector technology offers a
wealth of high-resolution information on particle interactions, and leveraging
that information to its full potential requires sophisticated automated
reconstruction techniques. This article describes NuGraph2, a Graph Neural
Network (GNN) for low-level reconstruction of simulated neutrino interactions
in a LArTPC detector. Simulated neutrino interactions in the MicroBooNE
detector geometry are described as heterogeneous graphs, with energy
depositions on each detector plane forming nodes on planar subgraphs. The
network utilizes a multi-head attention message-passing mechanism to perform
background filtering and semantic labelling on these graph nodes, identifying
those associated with the primary physics interaction with 98.0\% efficiency
and labelling them according to particle type with 94.9\% efficiency. The
network operates directly on detector observables across multiple 2D
representations, but utilizes a 3D-context-aware mechanism to encourage
consistency between these representations. Model inference takes 0.12~s/event
on a CPU, and 0.005s/event batched on a GPU. This architecture is designed to
be a general-purpose solution for particle reconstruction in neutrino physics,
with the potential for deployment across a broad range of detector
technologies, and offers a core convolution engine that can be leveraged for a
variety of tasks beyond the two described in this article.
[COMMENTS]
18 pages, 14 figures, published in Physical Review D
[LINK]
http://arxiv.org/abs/2403.11872v2
[DATE]
2025-06-27 00:15:31+08:00
[CATEGORIES]
cs.LG
The Sample Complexity of Learning Lipschitz Operators with respect to Gaussian Measures
[AUTHORS]
Ben Adcock, Michael Griebel, Gregor Maier
[ABSTRACT]
Operator learning, the approximation of mappings between infinite-dimensional
function spaces using machine learning, has gained increasing research
attention in recent years. Approximate operators, learned from data, can serve
as efficient surrogate models for problems in computational science and
engineering, complementing traditional methods. However, despite their
empirical success, our understanding of the underlying mathematical theory is
in large part still incomplete. In this paper, we study the approximation of
Lipschitz operators with respect to Gaussian measures. We prove higher Gaussian
Sobolev regularity of Lipschitz operators and establish lower and upper bounds
on the Hermite polynomial approximation error. We then study general
reconstruction strategies of Lipschitz operators from $m$ arbitrary
(potentially adaptive) linear samples. As a key finding, we tightly
characterize the corresponding sample complexity, that is, the smallest
achievable worst-case error among all possible choices of (adaptive) sampling
and reconstruction strategies in terms of $m$. As a consequence, we identify an
inherent curse of sample complexity: No method to approximate Lipschitz
operators based on $m$ linear samples can achieve algebraic convergence rates
in $m$. On the positive side, we prove that a sufficiently fast spectral decay
of the covariance operator of the underlying Gaussian measure guarantees
convergence rates which are arbitrarily close to any algebraic rate. Overall,
by tightly characterizing the sample complexity, our work confirms the
intrinsic difficulty of learning Lipschitz operators, regardless of the data or
learning technique.
[COMMENTS]
Section 6 about pointwise sampling in v2 of this paper has been cut
and will appear elsewhere
[LINK]
http://arxiv.org/abs/2410.23440v3
[DATE]
2025-06-27 00:15:09+08:00
[CATEGORIES]
cs.LG
Deception Detection in Dyadic Exchanges Using Multimodal Machine Learning: A Study on a Swedish Cohort
[AUTHORS]
Franco Rugolon, Thomas Jack Samuels, Stephan Hau, Lennart Högman
[ABSTRACT]
This study investigates the efficacy of using multimodal machine learning
techniques to detect deception in dyadic interactions, focusing on the
integration of data from both the deceiver and the deceived. We compare early
and late fusion approaches, utilizing audio and video data - specifically,
Action Units and gaze information - across all possible combinations of
modalities and participants. Our dataset, newly collected from Swedish native
speakers engaged in truth or lie scenarios on emotionally relevant topics,
serves as the basis for our analysis. The results demonstrate that
incorporating both speech and facial information yields superior performance
compared to single-modality approaches. Moreover, including data from both
participants significantly enhances deception detection accuracy, with the best
performance (71%) achieved using a late fusion strategy applied to both
modalities and participants. These findings align with psychological theories
suggesting differential control of facial and vocal expressions during initial
interactions. As the first study of its kind on a Scandinavian cohort, this
research lays the groundwork for future investigations into dyadic
interactions, particularly within psychotherapy settings.
[COMMENTS]
40 pages, 2 figures, 2 tables. To be submitted in Behavior Research
Methods
[LINK]
http://arxiv.org/abs/2506.21429v1
[DATE]
2025-06-27 00:11:42+08:00
[CATEGORIES]
cs.LG
Flow-Based Single-Step Completion for Efficient and Expressive Policy Learning
[AUTHORS]
Prajwal Koirala, Cody Fleming
[ABSTRACT]
Generative models such as diffusion and flow-matching offer expressive
policies for offline reinforcement learning (RL) by capturing rich, multimodal
action distributions, but their iterative sampling introduces high inference
costs and training instability due to gradient propagation across sampling
steps. We propose the \textit{Single-Step Completion Policy} (SSCP), a
generative policy trained with an augmented flow-matching objective to predict
direct completion vectors from intermediate flow samples, enabling accurate,
one-shot action generation. In an off-policy actor-critic framework, SSCP
combines the expressiveness of generative models with the training and
inference efficiency of unimodal policies, without requiring long
backpropagation chains. Our method scales effectively to offline,
offline-to-online, and online RL settings, offering substantial gains in speed
and adaptability over diffusion-based baselines. We further extend SSCP to
goal-conditioned RL, enabling flat policies to exploit subgoal structures
without explicit hierarchical inference. SSCP achieves strong results across
standard offline RL and behavior cloning benchmarks, positioning it as a
versatile, expressive, and efficient framework for deep RL and sequential
decision-making.
[LINK]
http://arxiv.org/abs/2506.21427v1
[DATE]
2025-06-27 00:09:53+08:00
[CATEGORIES]
cs.LG
TracLLM: A Generic Framework for Attributing Long Context LLMs
[AUTHORS]
Yanting Wang, Wei Zou, Runpeng Geng, Jinyuan Jia
[ABSTRACT]
Long context large language models (LLMs) are deployed in many real-world
applications such as RAG, agent, and broad LLM-integrated applications. Given
an instruction and a long context (e.g., documents, PDF files, webpages), a
long context LLM can generate an output grounded in the provided context,
aiming to provide more accurate, up-to-date, and verifiable outputs while
reducing hallucinations and unsupported claims. This raises a research
question: how to pinpoint the texts (e.g., sentences, passages, or paragraphs)
in the context that contribute most to or are responsible for the generated
output by an LLM? This process, which we call context traceback, has various
real-world applications, such as 1) debugging LLM-based systems, 2) conducting
post-attack forensic analysis for attacks (e.g., prompt injection attack,
knowledge corruption attacks) to an LLM, and 3) highlighting knowledge sources
to enhance the trust of users towards outputs generated by LLMs. When applied
to context traceback for long context LLMs, existing feature attribution
methods such as Shapley have sub-optimal performance and/or incur a large
computational cost. In this work, we develop TracLLM, the first generic context
traceback framework tailored to long context LLMs. Our framework can improve
the effectiveness and efficiency of existing feature attribution methods. To
improve the efficiency, we develop an informed search based algorithm in
TracLLM. We also develop contribution score ensemble/denoising techniques to
improve the accuracy of TracLLM. Our evaluation results show TracLLM can
effectively identify texts in a long context that lead to the output of an LLM.
Our code and data are at: https://github.com/Wang-Yanting/TracLLM.
[COMMENTS]
To appear in USENIX Security Symposium 2025. The code and data are
at: https://github.com/Wang-Yanting/TracLLM
[LINK]
http://arxiv.org/abs/2506.04202v3
[DATE]
2025-06-27 00:09:36+08:00
[CATEGORIES]
cs.LG
Improving Stochastic Cubic Newton with Momentum
[AUTHORS]
El Mahdi Chayti, Nikita Doikov, Martin Jaggi
[ABSTRACT]
We study stochastic second-order methods for solving general non-convex
optimization problems. We propose using a special version of momentum to
stabilize the stochastic gradient and Hessian estimates in Newton’s method. We
show that momentum provably improves the variance of stochastic estimates and
allows the method to converge for any noise level. Using the cubic
regularization technique, we prove a global convergence rate for our method on
general non-convex problems to a second-order stationary point, even when using
only a single stochastic data sample per iteration. This starkly contrasts with
all existing stochastic second-order methods for non-convex problems, which
typically require large batches. Therefore, we are the first to demonstrate
global convergence for batches of arbitrary size in the non-convex case for the
Stochastic Cubic Newton. Additionally, we show improved speed on convex
stochastic problems for our regularized Newton methods with momentum.
[LINK]
http://arxiv.org/abs/2410.19644v2
[DATE]
2025-06-27 00:07:20+08:00
[CATEGORIES]
cs.LG
Scalable Bayesian Low-Rank Adaptation of Large Language Models via Stochastic Variational Subspace Inference
[AUTHORS]
Colin Samplawski, Adam D. Cobb, Manoj Acharya, Ramneet Kaur, Susmit Jha
[ABSTRACT]
Despite their widespread use, large language models (LLMs) are known to
hallucinate incorrect information and be poorly calibrated. This makes the
uncertainty quantification of these models of critical importance, especially
in high-stakes domains, such as autonomy and healthcare. Prior work has made
Bayesian deep learning-based approaches to this problem more tractable by
performing inference over the low-rank adaptation (LoRA) parameters of a
fine-tuned model. While effective, these approaches struggle to scale to larger
LLMs due to requiring further additional parameters compared to LoRA. In this
work we present $\textbf{Scala}$ble $\textbf{B}$ayesian $\textbf{L}$ow-Rank
Adaptation via Stochastic Variational Subspace Inference (ScalaBL). We perform
Bayesian inference in an $r$-dimensional subspace, for LoRA rank $r$. By
repurposing the LoRA parameters as projection matrices, we are able to map
samples from this subspace into the full weight space of the LLM. This allows
us to learn all the parameters of our approach using stochastic variational
inference. Despite the low dimensionality of our subspace, we are able to
achieve competitive performance with state-of-the-art approaches while only
requiring ${\sim}1000$ additional parameters. Furthermore, it allows us to
scale up to the largest Bayesian LLM to date, with four times as a many base
parameters as prior work.
[COMMENTS]
Accepted at UAI 2025
[LINK]
http://arxiv.org/abs/2506.21408v1
[DATE]
2025-06-26 23:54:45+08:00
[CATEGORIES]
cs.LG
cs.CL
DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation
[AUTHORS]
Shansan Gong, Ruixiang Zhang, Huangjie Zheng, Jiatao Gu, Navdeep Jaitly, Lingpeng Kong, Yizhe Zhang
[ABSTRACT]
Diffusion large language models (dLLMs) are compelling alternatives to
autoregressive (AR) models because their denoising models operate over the
entire sequence. The global planning and iterative refinement features of dLLMs
are particularly useful for code generation. However, current training and
inference mechanisms for dLLMs in coding are still under-explored. To demystify
the decoding behavior of dLLMs and unlock their potential for coding, we
systematically investigate their denoising processes and reinforcement learning
(RL) methods. We train a 7B dLLM, \textbf{DiffuCoder}, on 130B tokens of code.
Using this model as a testbed, we analyze its decoding behavior, revealing how
it differs from that of AR models: (1) dLLMs can decide how causal their
generation should be without relying on semi-AR decoding, and (2) increasing
the sampling temperature diversifies not only token choices but also their
generation order. This diversity creates a rich search space for RL rollouts.
For RL training, to reduce the variance of token log-likelihood estimates and
maintain training efficiency, we propose \textbf{coupled-GRPO}, a novel
sampling scheme that constructs complementary mask noise for completions used
in training. In our experiments, coupled-GRPO significantly improves
DiffuCoder’s performance on code generation benchmarks (+4.4\% on EvalPlus) and
reduces reliance on AR bias during decoding. Our work provides deeper insight
into the machinery of dLLM generation and offers an effective, diffusion-native
RL training framework. https://github.com/apple/ml-diffucoder.
[COMMENTS]
minor update
[LINK]
http://arxiv.org/abs/2506.20639v2
[DATE]
2025-06-26 23:46:40+08:00
[CATEGORIES]
cs.CL
Hybrid Deep Learning and Signal Processing for Arabic Dialect Recognition in Low-Resource Settings
[AUTHORS]
Ghazal Al-Shwayyat, Omer Nezih Gerek
[ABSTRACT]
Arabic dialect recognition presents a significant challenge in speech
technology due to the linguistic diversity of Arabic and the scarcity of large
annotated datasets, particularly for underrepresented dialects. This research
investigates hybrid modeling strategies that integrate classical signal
processing techniques with deep learning architectures to address this problem
in low-resource scenarios. Two hybrid models were developed and evaluated: (1)
Mel-Frequency Cepstral Coefficients (MFCC) combined with a Convolutional Neural
Network (CNN), and (2) Discrete Wavelet Transform (DWT) features combined with
a Recurrent Neural Network (RNN). The models were trained on a dialect-filtered
subset of the Common Voice Arabic dataset, with dialect labels assigned based
on speaker metadata. Experimental results demonstrate that the MFCC + CNN
architecture achieved superior performance, with an accuracy of 91.2% and
strong precision, recall, and F1-scores, significantly outperforming the
Wavelet + RNN configuration, which achieved an accuracy of 66.5%. These
findings highlight the effectiveness of leveraging spectral features with
convolutional models for Arabic dialect recognition, especially when working
with limited labeled data. The study also identifies limitations related to
dataset size, potential regional overlaps in labeling, and model optimization,
providing a roadmap for future research. Recommendations for further
improvement include the adoption of larger annotated corpora, integration of
self-supervised learning techniques, and exploration of advanced neural
architectures such as Transformers. Overall, this research establishes a strong
baseline for future developments in Arabic dialect recognition within
resource-constrained environments.
[LINK]
http://arxiv.org/abs/2506.21386v1
[DATE]
2025-06-26 23:36:25+08:00
[CATEGORIES]
cs.CL
Leveraging LLM-Assisted Query Understanding for Live Retrieval-Augmented Generation
[AUTHORS]
Guanting Dong, Xiaoxi Li, Yuyao Zhang, Mengjie Deng
[ABSTRACT]
Real-world live retrieval-augmented generation (RAG) systems face significant
challenges when processing user queries that are often noisy, ambiguous, and
contain multiple intents. While RAG enhances large language models (LLMs) with
external knowledge, current systems typically struggle with such complex
inputs, as they are often trained or evaluated on cleaner data. This paper
introduces Omni-RAG, a novel framework designed to improve the robustness and
effectiveness of RAG systems in live, open-domain settings. Omni-RAG employs
LLM-assisted query understanding to preprocess user inputs through three key
modules: (1) Deep Query Understanding and Decomposition, which utilizes LLMs
with tailored prompts to denoise queries (e.g., correcting spelling errors) and
decompose multi-intent queries into structured sub-queries; (2) Intent-Aware
Knowledge Retrieval, which performs retrieval for each sub-query from a corpus
(i.e., FineWeb using OpenSearch) and aggregates the results; and (3) Reranking
and Generation, where a reranker (i.e., BGE) refines document selection before
a final response is generated by an LLM (i.e., Falcon-10B) using a
chain-of-thought prompt. Omni-RAG aims to bridge the gap between current RAG
capabilities and the demands of real-world applications, such as those
highlighted by the SIGIR 2025 LiveRAG Challenge, by robustly handling complex
and noisy queries.
[COMMENTS]
Accepted at SIGIR 2025 LiveRAG Workshop (Oral Presentation)
[LINK]
http://arxiv.org/abs/2506.21384v1
[DATE]
2025-06-26 23:35:12+08:00
[CATEGORIES]
cs.CL
Structuralist Approach to AI Literary Criticism: Leveraging Greimas Semiotic Square for Large Language Models
[AUTHORS]
Fangzhou Dong, Yifan Zeng, Yingpeng Sang, Hong Shen
[ABSTRACT]
Large Language Models (LLMs) excel in understanding and generating text but
struggle with providing professional literary criticism for works with profound
thoughts and complex narratives. This paper proposes GLASS (Greimas Literary
Analysis via Semiotic Square), a structured analytical framework based on
Greimas Semiotic Square (GSS), to enhance LLMs’ ability to conduct in-depth
literary analysis. GLASS facilitates the rapid dissection of narrative
structures and deep meanings in narrative works. We propose the first dataset
for GSS-based literary criticism, featuring detailed analyses of 48 works. Then
we propose quantitative metrics for GSS-based literary criticism using the
LLM-as-a-judge paradigm. Our framework’s results, compared with expert
criticism across multiple works and LLMs, show high performance. Finally, we
applied GLASS to 39 classic works, producing original and high-quality analyses
that address existing research gaps. This research provides an AI-based tool
for literary research and education, offering insights into the cognitive
mechanisms underlying literary engagement.
[COMMENTS]
Accepted in CogSci 2025
[LINK]
http://arxiv.org/abs/2506.21360v1
[DATE]
2025-06-26 23:10:24+08:00
[CATEGORIES]
cs.CL
Latent Prototype Routing: Achieving Near-Perfect Load Balancing in Mixture-of-Experts
[AUTHORS]
Jiajie Yang
[ABSTRACT]
Mixture-of-Experts (MoE) architectures have emerged as a key strategy for
scaling large language models (LLMs) efficiently. However, current MoE systems
suffer from severe load imbalance, where only a small subset of experts is
consistently activated during training and inference, leading to significant
underutilization of model capacity and computational resources. In this work,
we revisit expert routing through a clustering perspective and propose Latent
Prototype Routing (LPR), a novel routing framework that generalizes existing
approaches while promoting balanced expert utilization without compromising
downstream performance. Extensive experiments across multiple open-source MoE
models – including DeepSeek-V3, Qwen3-MoE, and Mixtral – demonstrate that LPR
reduces the Gini coefficient of expert load from 0.70 to 0.035 on average,
improves the min-max expert load ratio from 1e-6 to 0.70, achieving
near-perfect load balancing.
[COMMENTS]
15 pages,4 figures
[LINK]
http://arxiv.org/abs/2506.21328v1
[DATE]
2025-06-26 22:41:18+08:00
[CATEGORIES]
cs.LG
cs.CL
Exploring Adapter Design Tradeoffs for Low Resource Music Generation
[AUTHORS]
Atharva Mehta, Shivam Chauhan, Monojit Choudhury
[ABSTRACT]
Fine-tuning large-scale music generation models, such as MusicGen and
Mustango, is a computationally expensive process, often requiring updates to
billions of parameters and, therefore, significant hardware resources.
Parameter-Efficient Fine-Tuning (PEFT) techniques, particularly adapter-based
methods, have emerged as a promising alternative, enabling adaptation with
minimal trainable parameters while preserving model performance. However, the
design choices for adapters, including their architecture, placement, and size,
are numerous, and it is unclear which of these combinations would produce
optimal adapters and why, for a given case of low-resource music genre. In this
paper, we attempt to answer this question by studying various adapter
configurations for two AI music models, MusicGen and Mustango, on two genres:
Hindustani Classical and Turkish Makam music.
Our findings reveal distinct trade-offs: convolution-based adapters excel in
capturing fine-grained local musical details such as ornamentations and short
melodic phrases, while transformer-based adapters better preserve long-range
dependencies crucial for structured improvisation. Additionally, we analyze
computational resource requirements across different adapter scales,
demonstrating how mid-sized adapters (40M parameters) achieve an optimal
balance between expressivity and quality. Furthermore, we find that Mustango, a
diffusion-based model, generates more diverse outputs with better adherence to
the description in the input prompt while lacking in providing stability in
notes, rhythm alignment, and aesthetics. Also, it is computationally intensive
and requires significantly more time to train. In contrast, autoregressive
models like MusicGen offer faster training and are more efficient, and can
produce better quality output in comparison, but have slightly higher
redundancy in their generations.
[COMMENTS]
9 pages, 5 figures
[LINK]
http://arxiv.org/abs/2506.21298v1
[DATE]
2025-06-26 22:18:39+08:00
[CATEGORIES]
cs.CL
cs.LG
Detecting Referring Expressions in Visually Grounded Dialogue with Autoregressive Language Models
[AUTHORS]
Bram Willemsen, Gabriel Skantze
[ABSTRACT]
In this paper, we explore the use of a text-only, autoregressive language
modeling approach for the extraction of referring expressions from visually
grounded dialogue. More specifically, the aim is to investigate the extent to
which the linguistic context alone can inform the detection of mentions that
have a (visually perceivable) referent in the visual context of the
conversation. To this end, we adapt a pretrained large language model (LLM) to
perform a relatively course-grained annotation of mention spans in unfolding
conversations by demarcating mention span boundaries in text via next-token
prediction. Our findings indicate that even when using a moderately sized LLM,
relatively small datasets, and parameter-efficient fine-tuning, a text-only
approach can be effective, highlighting the relative importance of the
linguistic context for this task. Nevertheless, we argue that the task
represents an inherently multimodal problem and discuss limitations fundamental
to unimodal approaches.
[COMMENTS]
Accepted for publication at XLLM @ ACL 2025
[LINK]
http://arxiv.org/abs/2506.21294v1
[DATE]
2025-06-26 22:14:20+08:00
[CATEGORIES]
cs.CL
Small Encoders Can Rival Large Decoders in Detecting Groundedness
[AUTHORS]
Istabrak Abbes, Gabriele Prato, Quentin Fournier, Fernando Rodriguez, Alaa Boukhary, Adam Elwood, Sarath Chandar
[ABSTRACT]
Augmenting large language models (LLMs) with external context significantly
improves their performance in natural language processing (NLP) tasks. However,
LLMs struggle to answer queries reliably when the provided context lacks
information, often resorting to ungrounded speculation or internal knowledge.
Groundedness - generating responses strictly supported by the context - is
essential for ensuring factual consistency and trustworthiness. This study
focuses on detecting whether a given query is grounded in a document provided
in context before the costly answer generation by LLMs. Such a detection
mechanism can significantly reduce both inference time and resource
consumption. We show that lightweight, task specific encoder models such as
RoBERTa and NomicBERT, fine-tuned on curated datasets, can achieve accuracy
comparable to state-of-the-art LLMs, such as Llama3 8B and GPT4o, in
groundedness detection while reducing inference latency by orders of magnitude.
The code is available at : https://github.com/chandarlab/Hallucinate-less
[LINK]
http://arxiv.org/abs/2506.21288v1
[DATE]
2025-06-26 22:09:41+08:00
[CATEGORIES]
cs.CL
cs.LG
Double-Checker: Enhancing Reasoning of Slow-Thinking LLMs via Self-Critical Fine-Tuning
[AUTHORS]
Xin Xu, Tianhao Chen, Fan Zhang, Wanlong Liu, Pengxiang Li, Ajay Kumar Jaiswal, Yuchen Yan, Jishan Hu, Yang Wang, Hao Chen, Shiwei Liu, Shizhe Diao, Can Yang, Lu Yin
[ABSTRACT]
While slow-thinking large language models (LLMs) exhibit reflection-like
reasoning, commonly referred to as the “aha moment:, their ability to generate
informative critiques and refine prior solutions remains limited. In this
paper, we introduce Double-Checker, a principled framework designed to enhance
the reasoning capabilities of slow-thinking LLMs by fostering explicit
self-critique and iterative refinement of their previous solutions. By
fine-tuning on our curated 1,730 self-critical instances, Double-Checker
empowers long-CoT LLMs to iteratively critique and refine their outputs during
inference until they evaluate their solutions as correct under self-generated
critiques. We validate the efficacy of Double-Checker across a comprehensive
suite of reasoning benchmarks, demonstrating that iterative self-critique
significantly enhances the reasoning capabilities of long-CoT LLMs. Notably,
our Double-Checker increases the pass@1 performance on challenging AIME
benchmarks from 4.4% to 18.2% compared to the original long-CoT LLMs. These
results highlight a promising direction for developing more trustworthy and
effective LLMs capable of structured self-critique.
[COMMENTS]
10 pages
[LINK]
http://arxiv.org/abs/2506.21285v1
[DATE]
2025-06-26 22:05:45+08:00
[CATEGORIES]
cs.CL
Cat and Mouse – Can Fake Text Generation Outpace Detector Systems?
[AUTHORS]
Andrea McGlinchey, Peter J Barclay
[ABSTRACT]
Large language models can produce convincing “fake text” in domains such as
academic writing, product reviews, and political news. Many approaches have
been investigated for the detection of artificially generated text. While this
may seem to presage an endless “arms race”, we note that newer LLMs use ever
more parameters, training data, and energy, while relatively simple classifiers
demonstrate a good level of detection accuracy with modest resources. To
approach the question of whether the models’ ability to beat the detectors may
therefore reach a plateau, we examine the ability of statistical classifiers to
identify “fake text” in the style of classical detective fiction. Over a 0.5
version increase, we found that Gemini showed an increased ability to generate
deceptive text, while GPT did not. This suggests that reliable detection of
fake text may remain feasible even for ever-larger models, though new model
architectures may improve their deceptiveness
[COMMENTS]
(Submitted for publication)
[LINK]
http://arxiv.org/abs/2506.21274v1
[DATE]
2025-06-26 21:58:43+08:00
[CATEGORIES]
cs.CL
A Troublemaker with Contagious Jailbreak Makes Chaos in Honest Towns
[AUTHORS]
Tianyi Men, Pengfei Cao, Zhuoran Jin, Yubo Chen, Kang Liu, Jun Zhao
[ABSTRACT]
With the development of large language models, they are widely used as agents
in various fields. A key component of agents is memory, which stores vital
information but is susceptible to jailbreak attacks. Existing research mainly
focuses on single-agent attacks and shared memory attacks. However, real-world
scenarios often involve independent memory. In this paper, we propose the
Troublemaker Makes Chaos in Honest Town (TMCHT) task, a large-scale,
multi-agent, multi-topology text-based attack evaluation framework. TMCHT
involves one attacker agent attempting to mislead an entire society of agents.
We identify two major challenges in multi-agent attacks: (1) Non-complete graph
structure, (2) Large-scale systems. We attribute these challenges to a
phenomenon we term toxicity disappearing. To address these issues, we propose
an Adversarial Replication Contagious Jailbreak (ARCJ) method, which optimizes
the retrieval suffix to make poisoned samples more easily retrieved and
optimizes the replication suffix to make poisoned samples have contagious
ability. We demonstrate the superiority of our approach in TMCHT, with 23.51%,
18.95%, and 52.93% improvements in line topology, star topology, and 100-agent
settings. Encourage community attention to the security of multi-agent systems.
[COMMENTS]
ACL 2025 Main
[LINK]
http://arxiv.org/abs/2410.16155v2
[DATE]
2025-06-26 21:45:10+08:00
[CATEGORIES]
cs.CL
Simulating Hard Attention Using Soft Attention
[AUTHORS]
Andy Yang, Lena Strobl, David Chiang, Dana Angluin
[ABSTRACT]
We study conditions under which transformers using soft attention can
simulate hard attention, that is, effectively focus all attention on a subset
of positions. First, we examine several subclasses of languages recognized by
hard-attention transformers, which can be defined in variants of linear
temporal logic. We demonstrate how soft-attention transformers can compute
formulas of these logics using unbounded positional embeddings or temperature
scaling. Second, we demonstrate how temperature scaling allows softmax
transformers to simulate general hard-attention transformers, using a
temperature that depends on the minimum gap between the maximum attention
scores and other attention scores.
[COMMENTS]
19 pages
[LINK]
http://arxiv.org/abs/2412.09925v2
[DATE]
2025-06-26 21:41:24+08:00
[CATEGORIES]
cs.LG
cs.CL
Agent-RewardBench: Towards a Unified Benchmark for Reward Modeling across Perception, Planning, and Safety in Real-World Multimodal Agents
[AUTHORS]
Tianyi Men, Zhuoran Jin, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao
[ABSTRACT]
As Multimodal Large Language Models (MLLMs) advance, multimodal agents show
promise in real-world tasks like web navigation and embodied intelligence.
However, due to limitations in a lack of external feedback, these agents
struggle with self-correction and generalization. A promising approach is to
use reward models as external feedback, but there is no clear on how to select
reward models for agents. Thus, there is an urgent need to build a reward bench
targeted at agents. To address these challenges, we propose Agent-RewardBench,
a benchmark designed to evaluate reward modeling ability in MLLMs. The
benchmark is characterized by three key features: (1) Multiple dimensions and
real-world agent scenarios evaluation. It covers perception, planning, and
safety with 7 scenarios; (2) Step-level reward evaluation. It allows for the
assessment of agent capabilities at the individual steps of a task, providing a
more granular view of performance during the planning process; and (3)
Appropriately difficulty and high-quality. We carefully sample from 10 diverse
models, difficulty control to maintain task challenges, and manual verification
to ensure the integrity of the data. Experiments demonstrate that even
state-of-the-art multimodal models show limited performance, highlighting the
need for specialized training in agent reward modeling. Code is available at
github.
[COMMENTS]
ACL 2025 Main
[LINK]
http://arxiv.org/abs/2506.21252v1
[DATE]
2025-06-26 21:36:12+08:00
[CATEGORIES]
cs.CL
Capturing Style in Author and Document Representation
[AUTHORS]
Enzo Terreau, Antoine Gourru, Julien Velcin
[ABSTRACT]
A wide range of Deep Natural Language Processing (NLP) models integrates
continuous and low dimensional representations of words and documents.
Surprisingly, very few models study representation learning for authors. These
representations can be used for many NLP tasks, such as author identification
and classification, or in recommendation systems. A strong limitation of
existing works is that they do not explicitly capture writing style, making
them hardly applicable to literary data. We therefore propose a new
architecture based on Variational Information Bottleneck (VIB) that learns
embeddings for both authors and documents with a stylistic constraint. Our
model fine-tunes a pre-trained document encoder. We stimulate the detection of
writing style by adding predefined stylistic features making the representation
axis interpretable with respect to writing style indicators. We evaluate our
method on three datasets: a literary corpus extracted from the Gutenberg
Project, the Blog Authorship Corpus and IMDb62, for which we show that it
matches or outperforms strong/recent baselines in authorship attribution while
capturing much more accurately the authors stylistic aspects.
[LINK]
http://arxiv.org/abs/2407.13358v2
[DATE]
2025-06-26 21:21:53+08:00
[CATEGORIES]
cs.CL
cs.LG
Enhancing Automatic Term Extraction with Large Language Models via Syntactic Retrieval
[AUTHORS]
Yongchan Chun, Minhyuk Kim, Dongjun Kim, Chanjun Park, Heuiseok Lim
[ABSTRACT]
Automatic Term Extraction (ATE) identifies domain-specific expressions that
are crucial for downstream tasks such as machine translation and information
retrieval. Although large language models (LLMs) have significantly advanced
various NLP tasks, their potential for ATE has scarcely been examined. We
propose a retrieval-based prompting strategy that, in the few-shot setting,
selects demonstrations according to \emph{syntactic} rather than semantic
similarity. This syntactic retrieval method is domain-agnostic and provides
more reliable guidance for capturing term boundaries. We evaluate the approach
in both in-domain and cross-domain settings, analyzing how lexical overlap
between the query sentence and its retrieved examples affects performance.
Experiments on three specialized ATE benchmarks show that syntactic retrieval
improves F1-score. These findings highlight the importance of syntactic cues
when adapting LLMs to terminology-extraction tasks.
[LINK]
http://arxiv.org/abs/2506.21222v1
[DATE]
2025-06-26 21:14:52+08:00
[CATEGORIES]
cs.CL
Unveiling Causal Reasoning in Large Language Models: Reality or Mirage?
[AUTHORS]
Haoang Chi, He Li, Wenjing Yang, Feng Liu, Long Lan, Xiaoguang Ren, Tongliang Liu, Bo Han
[ABSTRACT]
Causal reasoning capability is critical in advancing large language models
(LLMs) toward strong artificial intelligence. While versatile LLMs appear to
have demonstrated capabilities in understanding contextual causality and
providing responses that obey the laws of causality, it remains unclear whether
they perform genuine causal reasoning akin to humans. However, current evidence
indicates the contrary. Specifically, LLMs are only capable of performing
shallow (level-1) causal reasoning, primarily attributed to the causal
knowledge embedded in their parameters, but they lack the capacity for genuine
human-like (level-2) causal reasoning. To support this hypothesis,
methodologically, we delve into the autoregression mechanism of
transformer-based LLMs, revealing that it is not inherently causal.
Empirically, we introduce a new causal Q&A benchmark called CausalProbe-2024,
whose corpora are fresh and nearly unseen for the studied LLMs. The LLMs
exhibit a significant performance drop on CausalProbe-2024 compared to earlier
benchmarks, indicating the fact that they primarily engage in level-1 causal
reasoning. To bridge the gap towards level-2 causal reasoning, we draw
inspiration from the fact that human reasoning is usually facilitated by
general knowledge and intended goals. We propose G^2-Reasoner, a method that
incorporates general knowledge and goal-oriented prompts into LLMs’ causal
reasoning processes. Experiments demonstrate that G^2-Reasoner significantly
enhances LLMs’ causal reasoning capability, particularly in fresh and
counterfactual contexts. This work sheds light on a new path for LLMs to
advance towards genuine causal reasoning, going beyond level-1 and making
strides towards level-2.
[COMMENTS]
24 pages, accepted at NeurIPS 2024
[LINK]
http://arxiv.org/abs/2506.21215v1
[DATE]
2025-06-26 21:11:01+08:00
[CATEGORIES]
cs.CL
cs.LG
TAPS: Tool-Augmented Personalisation via Structured Tagging
[AUTHORS]
Ekaterina Taktasheva, Jeff Dalton
[ABSTRACT]
Recent advancements in tool-augmented large language models have enabled them
to interact with external tools, enhancing their ability to perform complex
user tasks. However, existing approaches overlook the role of personalisation
in guiding tool use. This work investigates how user preferences can be
effectively integrated into goal-oriented dialogue agents. Through extensive
analysis, we identify key weaknesses in the ability of LLMs to personalise tool
use. To this end, we introduce TAPS, a novel solution that enhances
personalised tool use by leveraging a structured tagging tool and an
uncertainty-based tool detector. TAPS significantly improves the ability of
LLMs to incorporate user preferences, achieving the new state-of-the-art for
open source models on the NLSI task.
[LINK]
http://arxiv.org/abs/2506.20409v2
[DATE]
2025-06-26 21:09:40+08:00
[CATEGORIES]
cs.CL
LLM-Based Human-Agent Collaboration and Interaction Systems: A Survey
[AUTHORS]
Henry Peng Zou, Wei-Chieh Huang, Yaozu Wu, Yankai Chen, Chunyu Miao, Hoang Nguyen, Yue Zhou, Weizhi Zhang, Liancheng Fang, Langzhou He, Yangning Li, Dongyuan Li, Renhe Jiang, Xue Liu, Philip S. Yu
[ABSTRACT]
Recent advances in large language models (LLMs) have sparked growing interest
in building fully autonomous agents. However, fully autonomous LLM-based agents
still face significant challenges, including limited reliability due to
hallucinations, difficulty in handling complex tasks, and substantial safety
and ethical risks, all of which limit their feasibility and trustworthiness in
real-world applications. To overcome these limitations, LLM-based human-agent
systems (LLM-HAS) incorporate human-provided information, feedback, or control
into the agent system to enhance system performance, reliability and safety.
These human-agent collaboration systems enable humans and LLM-based agents to
collaborate effectively by leveraging their complementary strengths. This paper
provides the first comprehensive and structured survey of LLM-HAS. It clarifies
fundamental concepts, systematically presents core components shaping these
systems, including environment & profiling, human feedback, interaction types,
orchestration and communication, explores emerging applications, and discusses
unique challenges and opportunities arising from human-AI collaboration. By
consolidating current knowledge and offering a structured overview, we aim to
foster further research and innovation in this rapidly evolving
interdisciplinary field. Paper lists and resources are available at
https://github.com/HenryPengZou/Awesome-Human-Agent-Collaboration-Interaction-Systems.
[COMMENTS]
Paper lists and resources are available at
https://github.com/HenryPengZou/Awesome-Human-Agent-Collaboration-Interaction-Systems
[LINK]
http://arxiv.org/abs/2505.00753v4
[DATE]
2025-06-26 20:53:30+08:00
[CATEGORIES]
cs.CL
cs.LG
Prompt-Guided Turn-Taking Prediction
[AUTHORS]
Koji Inoue, Mikey Elmers, Yahui Fu, Zi Haur Pang, Divesh Lala, Keiko Ochi, Tatsuya Kawahara
[ABSTRACT]
Turn-taking prediction models are essential components in spoken dialogue
systems and conversational robots. Recent approaches leverage transformer-based
architectures to predict speech activity continuously and in real-time. In this
study, we propose a novel model that enables turn-taking prediction to be
dynamically controlled via textual prompts. This approach allows intuitive and
explicit control through instructions such as “faster” or “calmer” adapting
dynamically to conversational partners and contexts. The proposed model builds
upon a transformer-based voice activity projection (VAP) model, incorporating
textual prompt embeddings into both channel-wise transformers and a
cross-channel transformer. We evaluated the feasibility of our approach using
over 950 hours of human-human spoken dialogue data. Since textual prompt data
for the proposed approach was not available in existing datasets, we utilized a
large language model (LLM) to generate synthetic prompt sentences. Experimental
results demonstrated that the proposed model improved prediction accuracy and
effectively varied turn-taking timing behaviors according to the textual
prompts.
[COMMENTS]
This paper has been accepted for presentation at SIGdial Meeting on
Discourse and Dialogue 2025 (SIGDIAL 2025) and represents the author’s
version of the work
[LINK]
http://arxiv.org/abs/2506.21191v1
[DATE]
2025-06-26 20:49:07+08:00
[CATEGORIES]
cs.CL
Maintaining MTEB: Towards Long Term Usability and Reproducibility of Embedding Benchmarks
[AUTHORS]
Isaac Chung, Imene Kerboua, Marton Kardos, Roman Solomatin, Kenneth Enevoldsen
[ABSTRACT]
The Massive Text Embedding Benchmark (MTEB) has become a standard evaluation
platform for text embedding models. While previous work has established the
core benchmark methodology, this paper focuses on the engineering aspects that
ensure MTEB’s continued reproducibility and extensibility. We present our
approach to maintaining robust continuous integration pipelines that validate
dataset integrity, automate test execution, and assess benchmark results’
generalizability. We detail the design choices that collectively enhance
reproducibility and usability. Furthermore, we discuss our strategies for
handling community contributions and extending the benchmark with new tasks and
datasets. These engineering practices have been instrumental in scaling MTEB to
become more comprehensive while maintaining quality and, ultimately, relevance
to the field. Our experiences offer valuable insights for benchmark maintainers
facing similar challenges in ensuring reproducibility and usability in machine
learning evaluation frameworks. The MTEB repository is available at:
https://github.com/embeddings-benchmark/mteb
[LINK]
http://arxiv.org/abs/2506.21182v1
[DATE]
2025-06-26 20:40:48+08:00
[CATEGORIES]
cs.CL
Compressed and Smooth Latent Space for Text Diffusion Modeling
[AUTHORS]
Viacheslav Meshchaninov, Egor Chimbulatov, Alexander Shabalin, Aleksandr Abramov, Dmitry Vetrov
[ABSTRACT]
Autoregressive language models dominate modern text generation, yet their
sequential nature introduces fundamental limitations: decoding is slow, and
maintaining global coherence remains challenging. Diffusion models offer a
promising alternative by enabling parallel generation and flexible control;
however, their application to text generation is hindered by the high
dimensionality of token-level representations. We introduce Cosmos, a novel
approach to text generation that operates entirely in a compressed, smooth
latent space tailored specifically for diffusion. This space is learned using
an autoencoder trained simultaneously for token-level reconstruction and
alignment with frozen activations from a pretrained language encoder, providing
robust semantic grounding and enabling effective perturbation-based
augmentations. Empirically, we demonstrate that text representations can be
compressed by $8\times$ while maintaining generation quality comparable to
token-level diffusion models. Furthermore, increasing the latent sequence
length allows Cosmos to surpass both diffusion-based and autoregressive
baselines. We evaluate Cosmos on four diverse generative tasks including story
generation, question generation, summarization, and detoxification and compare
it with various generative paradigms. Cosmos achieves comparable or superior
generation quality while offering more than $2\times$ faster inference.
[LINK]
http://arxiv.org/abs/2506.21170v1
[DATE]
2025-06-26 20:05:13+08:00
[CATEGORIES]
cs.CL
CVC: A Large-Scale Chinese Value Rule Corpus for Value Alignment of Large Language Models
[AUTHORS]
Ping Wu, Guobin Shen, Dongcheng Zhao, Yuwei Wang, Yiting Dong, Yu Shi, Enmeng Lu, Feifei Zhao, Yi Zeng
[ABSTRACT]
Ensuring that Large Language Models (LLMs) align with mainstream human values
and ethical norms is crucial for the safe and sustainable development of AI.
Current value evaluation and alignment are constrained by Western cultural bias
and incomplete domestic frameworks reliant on non-native rules; furthermore,
the lack of scalable, rule-driven scenario generation methods makes evaluations
costly and inadequate across diverse cultural contexts. To address these
challenges, we propose a hierarchical value framework grounded in core Chinese
values, encompassing three main dimensions, 12 core values, and 50 derived
values. Based on this framework, we construct a large-scale Chinese Values
Corpus (CVC) containing over 250,000 value rules enhanced and expanded through
human annotation. Experimental results show that CVC-guided scenarios
outperform direct generation ones in value boundaries and content diversity. In
the evaluation across six sensitive themes (e.g., surrogacy, suicide), seven
mainstream LLMs preferred CVC-generated options in over 70.5% of cases, while
five Chinese human annotators showed an 87.5% alignment with CVC, confirming
its universality, cultural relevance, and strong alignment with Chinese values.
Additionally, we construct 400,000 rule-based moral dilemma scenarios that
objectively capture nuanced distinctions in conflicting value prioritization
across 17 LLMs. Our work establishes a culturally-adaptive benchmarking
framework for comprehensive value evaluation and alignment, representing
Chinese characteristics. All data are available at
https://huggingface.co/datasets/Beijing-AISI/CVC, and the code is available at
https://github.com/Beijing-AISI/CVC.
[LINK]
http://arxiv.org/abs/2506.01495v4
[DATE]
2025-06-26 19:34:33+08:00
[CATEGORIES]
cs.CL
Do Large Language Models Advocate for Inferentialism?
[AUTHORS]
Yuzuki Arai, Sho Tsugawa
[ABSTRACT]
The emergence of large language models (LLMs) such as ChatGPT and Claude
presents new challenges for philosophy of language, particularly regarding the
nature of linguistic meaning and representation. While LLMs have traditionally
been understood through distributional semantics, this paper explores Robert
Brandom’s inferential semantics as an alternative foundational framework for
understanding these systems. We examine how key features of inferential
semantics – including its anti-representationalist stance, logical
expressivism, and quasi-compositional approach – align with the architectural
and functional characteristics of Transformer-based LLMs. Through analysis of
the ISA (Inference, Substitution, Anaphora) approach, we demonstrate that LLMs
exhibit fundamentally anti-representationalist properties in their processing
of language. We further develop a consensus theory of truth appropriate for
LLMs, grounded in their interactive and normative dimensions through mechanisms
like RLHF. While acknowledging significant tensions between inferentialism’s
philosophical commitments and LLMs’ sub-symbolic processing, this paper argues
that inferential semantics provides valuable insights into how LLMs generate
meaning without reference to external world representations. Our analysis
suggests that LLMs may challenge traditional assumptions in philosophy of
language, including strict compositionality and semantic externalism, though
further empirical investigation is needed to fully substantiate these
theoretical claims.
[LINK]
http://arxiv.org/abs/2412.14501v2
[DATE]
2025-06-26 19:03:13+08:00
[CATEGORIES]
cs.CL
Learning Evaluation Models from Large Language Models for Sequence Generation
[AUTHORS]
Chenglong Wang, Hang Zhou, Kaiyan Chang, Tongran Liu, Chunliang Zhang, Quan Du, Tong Xiao, Yue Zhang, Jingbo Zhu
[ABSTRACT]
Automatic evaluation of sequence generation, traditionally reliant on metrics
like BLEU and ROUGE, often fails to capture the semantic accuracy of generated
text sequences due to their emphasis on n-gram overlap. A promising solution to
this problem is to develop model-based metrics, such as BLEURT and COMET.
However, these approaches are typically hindered by the scarcity of labeled
evaluation data, which is necessary to train the evaluation models. In this
work, we build upon this challenge by proposing the Customized Sequence
Evaluation Metric (CSEM), a three-stage evaluation model training method that
utilizes large language models to generate labeled data for model-based metric
development, thereby eliminating the need for human-labeled data. Additionally,
we expand the scope of CSEM to support various evaluation types, including
single-aspect, multi-aspect, reference-free, and reference-based evaluations,
enabling the customization of metrics to suit diverse real-world scenarios.
Experimental results on the SummEval benchmark demonstrate that CSEM can
effectively train an evaluation model without human-labeled data. Further
experiments in reinforcement learning and reranking show that metrics developed
through CSEM outperform traditional evaluation metrics, leading to substantial
improvements in sequence quality as evaluated by both commonly used metrics and
ChatGPT.
[COMMENTS]
Accepted by TASLP 2025
[LINK]
http://arxiv.org/abs/2308.04386v3
[DATE]
2025-06-26 18:00:23+08:00
[CATEGORIES]
cs.CL
Progtuning: Progressive Fine-tuning Framework for Transformer-based Language Models
[AUTHORS]
Xiaoshuang Ji, Zhendong Zhao, Xiaojun Chen, Xin Zhao, Zeyao Liu
[ABSTRACT]
Fine-tuning is a promising technique for leveraging Transformer-based
language models in downstream tasks. As model sizes continue to grow, updating
all model parameters becomes increasingly costly. Parameter-efficient
fine-tuning methods effectively address this issue by selectively updating a
small subset of parameters. However, fine-tuning and most existing
parameter-efficient fine-tuning methods require updating the same number of
parameters as the initial size, ignoring the unequal contribution across
Transformer blocks and leading to extremely inefficient allocation of computing
resources. In this paper, we propose Progtuning, the novel fine-tuning
framework combined with progressive learning for Transformer-based language
models. Specifically, Progtuning progressively reduces the number of updated
transformer blocks based on the contribution. Remarkably, Progtuning optimizes
resource allocation and reduces the number of updated parameters by
approximately 25\%, while still maintaining competitive performance. And it
also exhibits high adaptability with parameter-efficient fine-tuning methods,
demonstrating excellent performance across various adaptation scenarios.
[COMMENTS]
Accepted by ICONIP 2024
[LINK]
http://arxiv.org/abs/2506.21119v1
[DATE]
2025-06-26 17:37:15+08:00
[CATEGORIES]
cs.CL
Learning to Skip the Middle Layers of Transformers
[AUTHORS]
Tim Lawson, Laurence Aitchison
[ABSTRACT]
Conditional computation is a popular strategy to make Transformers more
efficient. Existing methods often target individual modules (e.g.,
mixture-of-experts layers) or skip layers independently of one another.
However, interpretability research has demonstrated that the middle layers of
Transformers exhibit greater redundancy, and that early layers aggregate
information into token positions. Guided by these insights, we propose a novel
architecture that dynamically skips a variable number of layers from the middle
outward. In particular, a learned gating mechanism determines whether to bypass
a symmetric span of central blocks based on the input, and a gated attention
mechanism prevents subsequent tokens from attending to skipped token positions.
Residual norms are controlled with a ‘sandwich’ or ‘perilayernorm’ scheme and
gate sparsity with an adaptive regularization loss. We had aimed to reduce
compute requirements for ‘simpler’ tokens and potentially foster an emergent
multi-level representational hierarchy but, at the scales investigated, our
approach does not achieve improvements in the trade-off between validation
cross-entropy and estimated FLOPs compared to dense baselines with fewer
layers. We release our code at https://github.com/tim-lawson/skip-middle.
[COMMENTS]
11 pages, 2 figures
[LINK]
http://arxiv.org/abs/2506.21103v1
[DATE]
2025-06-26 17:01:19+08:00
[CATEGORIES]
cs.LG
cs.CL
ComRAG: Retrieval-Augmented Generation with Dynamic Vector Stores for Real-time Community Question Answering in Industry
[AUTHORS]
Qinwen Chen, Wenbiao Tao, Zhiwei Zhu, Mingfan Xi, Liangzhong Guo, Yuan Wang, Wei Wang, Yunshi Lan
[ABSTRACT]
Community Question Answering (CQA) platforms can be deemed as important
knowledge bases in community, but effectively leveraging historical
interactions and domain knowledge in real-time remains a challenge. Existing
methods often underutilize external knowledge, fail to incorporate dynamic
historical QA context, or lack memory mechanisms suited for industrial
deployment. We propose ComRAG, a retrieval-augmented generation framework for
real-time industrial CQA that integrates static knowledge with dynamic
historical QA pairs via a centroid-based memory mechanism designed for
retrieval, generation, and efficient storage. Evaluated on three industrial CQA
datasets, ComRAG consistently outperforms all baselines–achieving up to 25.9%
improvement in vector similarity, reducing latency by 8.7% to 23.3%, and
lowering chunk growth from 20.23% to 2.06% over iterations.
[COMMENTS]
7 pages, 4 figures. Accepted at ACL 2025 Industry Track
[LINK]
http://arxiv.org/abs/2506.21098v1
[DATE]
2025-06-26 16:48:16+08:00
[CATEGORIES]
cs.CL
DALR: Dual-level Alignment Learning for Multimodal Sentence Representation Learning
[AUTHORS]
Kang He, Yuzhe Ding. Haining Wang, Fei Li, Chong Teng, Donghong Ji
[ABSTRACT]
Previous multimodal sentence representation learning methods have achieved
impressive performance. However, most approaches focus on aligning images and
text at a coarse level, facing two critical challenges:cross-modal misalignment
bias and intra-modal semantic divergence, which significantly degrade sentence
representation quality. To address these challenges, we propose DALR
(Dual-level Alignment Learning for Multimodal Sentence Representation). For
cross-modal alignment, we propose a consistency learning module that softens
negative samples and utilizes semantic similarity from an auxiliary task to
achieve fine-grained cross-modal alignment. Additionally, we contend that
sentence relationships go beyond binary positive-negative labels, exhibiting a
more intricate ranking structure. To better capture these relationships and
enhance representation quality, we integrate ranking distillation with global
intra-modal alignment learning. Comprehensive experiments on semantic textual
similarity (STS) and transfer (TR) tasks validate the effectiveness of our
approach, consistently demonstrating its superiority over state-of-the-art
baselines.
[COMMENTS]
Accepted by ACL 2025 Findings
[LINK]
http://arxiv.org/abs/2506.21096v1
[DATE]
2025-06-26 16:45:14+08:00
[CATEGORIES]
cs.CL
Enhancing LLM Tool Use with High-quality Instruction Data from Knowledge Graph
[AUTHORS]
Jingwei Wang, Zai Zhang, Hao Qian, Chunjing Gan, Binbin Hu, Ziqi Liu, Zhiqiang Zhang, Jun Zhou, Bin Shi, Bo Dong
[ABSTRACT]
Teaching large language models (LLMs) to use tools is crucial for improving
their problem-solving abilities and expanding their applications. However,
effectively using tools is challenging because it requires a deep understanding
of tool functionalities and user intentions. Previous methods relied mainly on
LLMs to generate instruction data, but the quality of these data was often
insufficient. In this paper, we propose a new method that uses knowledge graphs
to generate high-quality instruction data for LLMs. Knowledge graphs are
manually curated datasets rich in semantic information. We begin by extracting
various query pathways from a given knowledge graph, which are transformed into
a broad spectrum of user queries. We then translate the relationships between
entities into actionable tools and parse the pathways of each query into
detailed solution steps, thereby creating high-quality instruction data. Our
experiments show that fine-tuning on just a small sample of this synthetic data
can significantly improve the tool utilization and overall capabilities of
LLMs.
[COMMENTS]
20 pages, 12 figures
[LINK]
http://arxiv.org/abs/2506.21071v1
[DATE]
2025-06-26 15:45:15+08:00
[CATEGORIES]
cs.LG
cs.CL
Search and Refine During Think: Autonomous Retrieval-Augmented Reasoning of LLMs
[AUTHORS]
Yaorui Shi, Sihang Li, Chang Wu, Zhiyuan Liu, Junfeng Fang, Hengxing Cai, An Zhang, Xiang Wang
[ABSTRACT]
Large language models have demonstrated impressive reasoning capabilities but
are inherently limited by their knowledge reservoir. Retrieval-augmented
reasoning mitigates this limitation by allowing LLMs to query external
resources, but existing methods often retrieve irrelevant or noisy information,
hindering accurate reasoning. In this paper, we propose AutoRefine, a
reinforcement learning post-training framework that adopts a new
“search-and-refine-during-think” paradigm. AutoRefine introduces explicit
knowledge refinement steps between successive search calls, enabling the model
to iteratively filter, distill, and organize evidence before generating an
answer. Furthermore, we incorporate tailored retrieval-specific rewards
alongside answer correctness rewards using group relative policy optimization.
Experiments on single-hop and multi-hop QA benchmarks demonstrate that
AutoRefine significantly outperforms existing approaches, particularly in
complex, multi-hop reasoning scenarios. Detailed analysis shows that AutoRefine
issues frequent, higher-quality searches and synthesizes evidence effectively.
[LINK]
http://arxiv.org/abs/2505.11277v3
[DATE]
2025-06-26 14:52:37+08:00
[CATEGORIES]
cs.CL
A Semi-supervised Scalable Unified Framework for E-commerce Query Classification
[AUTHORS]
Chunyuan Yuan, Chong Zhang, Zheng Fang, Ming Pang, Xue Jiang, Changping Peng, Zhangang Lin, Ching Law
[ABSTRACT]
Query classification, including multiple subtasks such as intent and category
prediction, is vital to e-commerce applications. E-commerce queries are usually
short and lack context, and the information between labels cannot be used,
resulting in insufficient prior information for modeling. Most existing
industrial query classification methods rely on users’ posterior click behavior
to construct training samples, resulting in a Matthew vicious cycle.
Furthermore, the subtasks of query classification lack a unified framework,
leading to low efficiency for algorithm optimization.
In this paper, we propose a novel Semi-supervised Scalable Unified Framework
(SSUF), containing multiple enhanced modules to unify the query classification
tasks. The knowledge-enhanced module uses world knowledge to enhance query
representations and solve the problem of insufficient query information. The
label-enhanced module uses label semantics and semi-supervised signals to
reduce the dependence on posterior labels. The structure-enhanced module
enhances the label representation based on the complex label relations. Each
module is highly pluggable, and input features can be added or removed as
needed according to each subtask. We conduct extensive offline and online A/B
experiments, and the results show that SSUF significantly outperforms the
state-of-the-art models.
[COMMENTS]
Accepted by ACL 2025
[LINK]
http://arxiv.org/abs/2506.21049v1
[DATE]
2025-06-26 14:52:33+08:00
[CATEGORIES]
cs.CL
MockLLM: A Multi-Agent Behavior Collaboration Framework for Online Job Seeking and Recruiting
[AUTHORS]
Hongda Sun, Hongzhan Lin, Haiyu Yan, Yang Song, Xin Gao, Rui Yan
[ABSTRACT]
Online recruitment platforms have reshaped job-seeking and recruiting
processes, driving increased demand for applications that enhance person-job
matching. Traditional methods generally rely on analyzing textual data from
resumes and job descriptions, limiting the dynamic, interactive aspects crucial
to effective recruitment. Recent advances in Large Language Models (LLMs) have
revealed remarkable potential in simulating adaptive, role-based dialogues,
making them well-suited for recruitment scenarios. In this paper, we propose
\textbf{MockLLM}, a novel framework to generate and evaluate mock interview
interactions. The system consists of two key components: mock interview
generation and two-sided evaluation in handshake protocol. By simulating both
interviewer and candidate roles, MockLLM enables consistent and collaborative
interactions for real-time and two-sided matching. To further improve the
matching quality, MockLLM further incorporates reflection memory generation and
dynamic strategy modification, refining behaviors based on previous experience.
We evaluate MockLLM on real-world data Boss Zhipin, a major Chinese recruitment
platform. The experimental results indicate that MockLLM outperforms existing
methods in matching accuracy, scalability, and adaptability across job domains,
highlighting its potential to advance candidate assessment and online
recruitment.
[COMMENTS]
Accepted by KDD 2025 Research Track
[LINK]
http://arxiv.org/abs/2405.18113v2
[DATE]
2025-06-26 14:33:55+08:00
[CATEGORIES]
cs.CL
SceneGenAgent: Precise Industrial Scene Generation with Coding Agent
[AUTHORS]
Xiao Xia, Dan Zhang, Zibo Liao, Zhenyu Hou, Tianrui Sun, Jing Li, Ling Fu, Yuxiao Dong
[ABSTRACT]
The modeling of industrial scenes is essential for simulations in industrial
manufacturing. While large language models (LLMs) have shown significant
progress in generating general 3D scenes from textual descriptions, generating
industrial scenes with LLMs poses a unique challenge due to their demand for
precise measurements and positioning, requiring complex planning over spatial
arrangement. To address this challenge, we introduce SceneGenAgent, an
LLM-based agent for generating industrial scenes through C# code. SceneGenAgent
ensures precise layout planning through a structured and calculable format,
layout verification, and iterative refinement to meet the quantitative
requirements of industrial scenarios. Experiment results demonstrate that LLMs
powered by SceneGenAgent exceed their original performance, reaching up to
81.0% success rate in real-world industrial scene generation tasks and
effectively meeting most scene generation requirements. To further enhance
accessibility, we construct SceneInstruct, a dataset designed for fine-tuning
open-source LLMs to integrate into SceneGenAgent. Experiments show that
fine-tuning open-source LLMs on SceneInstruct yields significant performance
improvements, with Llama3.1-70B approaching the capabilities of GPT-4o. Our
code and data are available at https://github.com/THUDM/SceneGenAgent .
[COMMENTS]
Accepted to ACL 2025
[LINK]
http://arxiv.org/abs/2410.21909v3
[DATE]
2025-06-26 14:24:08+08:00
[CATEGORIES]
cs.CL
cs.LG
Large Language Models Acing Chartered Accountancy
[AUTHORS]
Jatin Gupta, Akhil Sharma, Saransh Singhania, Mohammad Adnan, Sakshi Deo, Ali Imam Abidi, Keshav Gupta
[ABSTRACT]
Advanced intelligent systems, particularly Large Language Models (LLMs), are
significantly reshaping financial practices through advancements in Natural
Language Processing (NLP). However, the extent to which these models
effectively capture and apply domain-specific financial knowledge remains
uncertain. Addressing a critical gap in the expansive Indian financial context,
this paper introduces CA-Ben, a Chartered Accountancy benchmark specifically
designed to evaluate the financial, legal, and quantitative reasoning
capabilities of LLMs. CA-Ben comprises structured question-answer datasets
derived from the rigorous examinations conducted by the Institute of Chartered
Accountants of India (ICAI), spanning foundational, intermediate, and advanced
CA curriculum stages. Six prominent LLMs i.e. GPT 4o, LLAMA 3.3 70B, LLAMA 3.1
405B, MISTRAL Large, Claude 3.5 Sonnet, and Microsoft Phi 4 were evaluated
using standardized protocols. Results indicate variations in performance, with
Claude 3.5 Sonnet and GPT-4o outperforming others, especially in conceptual and
legal reasoning. Notable challenges emerged in numerical computations and legal
interpretations. The findings emphasize the strengths and limitations of
current LLMs, suggesting future improvements through hybrid reasoning and
retrieval-augmented generation methods, particularly for quantitative analysis
and accurate legal interpretation.
[COMMENTS]
Accepted for publication at MoStart 2025: International Conference on
Digital Transformation in Education and Applications of Artificial
Intelligence, Bosnia and Herzegovina, 2025
[LINK]
http://arxiv.org/abs/2506.21031v1
[DATE]
2025-06-26 14:10:37+08:00
[CATEGORIES]
cs.CL
SAC: A Framework for Measuring and Inducing Personality Traits in LLMs with Dynamic Intensity Control
[AUTHORS]
Adithya Chittem, Aishna Shrivastava, Sai Tarun Pendela, Jagat Sesh Challa, Dhruv Kumar
[ABSTRACT]
Large language models (LLMs) have gained significant traction across a wide
range of fields in recent years. There is also a growing expectation for them
to display human-like personalities during interactions. To meet this
expectation, numerous studies have proposed methods for modelling LLM
personalities through psychometric evaluations. However, most existing models
face two major limitations: they rely on the Big Five (OCEAN) framework, which
only provides coarse personality dimensions, and they lack mechanisms for
controlling trait intensity. In this paper, we address this gap by extending
the Machine Personality Inventory (MPI), which originally used the Big Five
model, to incorporate the 16 Personality Factor (16PF) model, allowing
expressive control over sixteen distinct traits. We also developed a structured
framework known as Specific Attribute Control (SAC) for evaluating and
dynamically inducing trait intensity in LLMs. Our method introduces
adjective-based semantic anchoring to guide trait intensity expression and
leverages behavioural questions across five intensity factors:
\textit{Frequency}, \textit{Depth}, \textit{Threshold}, \textit{Effort}, and
\textit{Willingness}. Through experimentation, we find that modelling intensity
as a continuous spectrum yields substantially more consistent and controllable
personality expression compared to binary trait toggling. Moreover, we observe
that changes in target trait intensity systematically influence closely related
traits in psychologically coherent directions, suggesting that LLMs internalize
multi-dimensional personality structures rather than treating traits in
isolation. Our work opens new pathways for controlled and nuanced human-machine
interactions in domains such as healthcare, education, and interviewing
processes, bringing us one step closer to truly human-like social machines.
[COMMENTS]
Under review
[LINK]
http://arxiv.org/abs/2506.20993v1
[DATE]
2025-06-26 12:12:15+08:00
[CATEGORIES]
cs.CL
SharpZO: Hybrid Sharpness-Aware Vision Language Model Prompt Tuning via Forward-Only Passes
[AUTHORS]
Yifan Yang, Zhen Zhang, Rupak Vignesh Swaminathan, Jing Liu, Nathan Susanj, Zheng Zhang
[ABSTRACT]
Fine-tuning vision language models (VLMs) has achieved remarkable performance
across various downstream tasks; yet, it requires access to model gradients
through backpropagation (BP), making them unsuitable for memory-constrained,
inference-only edge devices. To address this limitation, previous work has
explored various BP-free fine-tuning methods. However, these approaches often
rely on high-variance evolutionary strategies (ES) or zeroth-order (ZO)
optimization, and often fail to achieve satisfactory performance. In this
paper, we propose a hybrid Sharpness-aware Zeroth-order optimization (SharpZO)
approach, specifically designed to enhance the performance of ZO VLM
fine-tuning via a sharpness-aware warm-up training. SharpZO features a
two-stage optimization process: a sharpness-aware ES stage that globally
explores and smooths the loss landscape to construct a strong initialization,
followed by a fine-grained local search via sparse ZO optimization. The entire
optimization relies solely on forward passes. Detailed theoretical analysis and
extensive experiments on CLIP models demonstrate that SharpZO significantly
improves accuracy and convergence speed, achieving up to 7% average gain over
state-of-the-art forward-only methods.
[LINK]
http://arxiv.org/abs/2506.20990v1
[DATE]
2025-06-26 12:07:14+08:00
[CATEGORIES]
cs.LG
cs.CL
SACL: Understanding and Combating Textual Bias in Code Retrieval with Semantic-Augmented Reranking and Localization
[AUTHORS]
Dhruv Gupta, Gayathri Ganesh Lakshmy, Yiqing Xie
[ABSTRACT]
Retrieval-Augmented Code Generation (RACG) is a critical technique for
enhancing code generation by retrieving relevant information. In this work, we
conduct an in-depth analysis of code retrieval by systematically masking
specific features while preserving code functionality. Our discoveries include:
(1) although trained on code, current retrievers heavily rely on surface-level
textual features (e.g., docstrings, identifier names), and (2) they exhibit a
strong bias towards well-documented code, even if the documentation is
irrelevant. Based on our discoveries, we propose SACL, a framework that
enriches textual information and reduces bias by augmenting code or structural
knowledge with semantic information. Extensive experiments show that SACL
substantially improves code retrieval (e.g., by 12.8% / 9.4% / 7.0% Recall@1 on
HumanEval / MBPP / SWE-Bench-Lite), which also leads to better code generation
performance (e.g., by 4.88% Pass@1 on HumanEval).
[LINK]
http://arxiv.org/abs/2506.20081v2
[DATE]
2025-06-26 12:06:50+08:00
[CATEGORIES]
cs.CL
Comparing Retrieval-Augmentation and Parameter-Efficient Fine-Tuning for Privacy-Preserving Personalization of Large Language Models
[AUTHORS]
Alireza Salemi, Hamed Zamani
[ABSTRACT]
Despite its substantial impact on various search, recommendation, and
question answering tasks, privacy-preserving methods for personalizing large
language models (LLMs) have received relatively limited exploration. There is
one primary approach in this area through retrieval-augmented generation (RAG),
which generates personalized outputs by enriching the input prompt with
information retrieved from the user’s personal data. This paper studies an
orthogonal approach to RAG that involves learning user-dependent LLM parameters
through parameter-efficient fine-tuning (PEFT). This paper presents the first
systematic study for exploration of PEFT for LLM personalization and provides
an extensive comparisons between RAG- and PEFT-based solutions, across a broad
set of seven diverse datasets from the LaMP benchmark. Our results demonstrate
that, on average, both RAG- and PEFT-based personalization methods yield 14.92%
and 1.07% improvements over non-personalized LLMs, respectively. When combining
RAG with PEFT, we observe a further improvement of 15.98%, highlighting the
effectiveness of their integration in enhancing personalized text generation.
Additionally, we identify a positive correlation between the amount of user
data available and the effectiveness of PEFT. This finding suggests that RAG is
particularly beneficial for cold-start users – users with limited personal
data – while PEFT performs better when more user-specific data is available.
[LINK]
http://arxiv.org/abs/2409.09510v2
[DATE]
2025-06-26 11:19:56+08:00
[CATEGORIES]
cs.CL
Reward-Guided Speculative Decoding for Efficient LLM Reasoning
[AUTHORS]
Baohao Liao, Yuhui Xu, Hanze Dong, Junnan Li, Christof Monz, Silvio Savarese, Doyen Sahoo, Caiming Xiong
[ABSTRACT]
We introduce Reward-Guided Speculative Decoding (RSD), a novel framework
aimed at improving the efficiency of inference in large language models (LLMs).
RSD synergistically combines a lightweight draft model with a more powerful
target model, incorporating a controlled bias to prioritize high-reward
outputs, in contrast to existing speculative decoding methods that enforce
strict unbiasedness. RSD employs a process reward model to evaluate
intermediate decoding steps and dynamically decide whether to invoke the target
model, optimizing the trade-off between computational cost and output quality.
We theoretically demonstrate that a threshold-based mixture strategy achieves
an optimal balance between resource utilization and performance. Extensive
evaluations on challenging reasoning benchmarks, including Olympiad-level
tasks, show that RSD delivers significant efficiency gains against decoding
with the target model only (up to 4.4x fewer FLOPs), while achieving
significant better accuracy than parallel decoding method on average (up to
+3.5). These results highlight RSD as a robust and cost-effective approach for
deploying LLMs in resource-intensive scenarios. The code is available at
https://github.com/BaohaoLiao/RSD.
[COMMENTS]
17 pages
[LINK]
http://arxiv.org/abs/2501.19324v3
[DATE]
2025-06-26 11:14:46+08:00
[CATEGORIES]
cs.CL
Learning to Rank for Multiple Retrieval-Augmented Models through Iterative Utility Maximization
[AUTHORS]
Alireza Salemi, Hamed Zamani
[ABSTRACT]
This paper investigates the design of a unified search engine to serve
multiple retrieval-augmented generation (RAG) agents, each with a distinct
task, backbone large language model (LLM), and RAG strategy. We introduce an
iterative approach where the search engine generates retrieval results for the
RAG agents and gathers feedback on the quality of the retrieved documents
during an offline phase. This feedback is then used to iteratively optimize the
search engine using an expectation-maximization algorithm, with the goal of
maximizing each agent’s utility function. Additionally, we adapt this to an
online setting, allowing the search engine to refine its behavior based on
real-time individual agents feedback to better serve the results for each of
them. Experiments on datasets from the Knowledge-Intensive Language Tasks
(KILT) benchmark demonstrates that our approach significantly on average
outperforms baselines across 18 RAG models. We demonstrate that our method
effectively “personalizes” the retrieval for each RAG agent based on the
collected feedback. Finally, we provide a comprehensive ablation study to
explore various aspects of our method.
[LINK]
http://arxiv.org/abs/2410.09942v2
[DATE]
2025-06-26 11:06:17+08:00
[CATEGORIES]
cs.CL
PP-DocBee: Improving Multimodal Document Understanding Through a Bag of Tricks
[AUTHORS]
Feng Ni, Kui Huang, Yao Lu, Wenyu Lv, Guanzhong Wang, Zeyu Chen, Yi Liu
[ABSTRACT]
With the rapid advancement of digitalization, various document images are
being applied more extensively in production and daily life, and there is an
increasingly urgent need for fast and accurate parsing of the content in
document images. Therefore, this report presents PP-DocBee, a novel multimodal
large language model designed for end-to-end document image understanding.
First, we develop a data synthesis strategy tailored to document scenarios in
which we build a diverse dataset to improve the model generalization. Then, we
apply a few training techniques, including dynamic proportional sampling, data
preprocessing, and OCR postprocessing strategies. Extensive evaluations
demonstrate the superior performance of PP-DocBee, achieving state-of-the-art
results on English document understanding benchmarks and even outperforming
existing open source and commercial models in Chinese document understanding.
The source code and pre-trained models are publicly available at
\href{https://github.com/PaddlePaddle/PaddleMIX}{https://github.com/PaddlePaddle/PaddleMIX}.
[LINK]
http://arxiv.org/abs/2503.04065v3
[DATE]
2025-06-26 09:11:25+08:00
[CATEGORIES]
cs.CL
KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model
[AUTHORS]
Xinping Zhao, Xinshuo Hu, Zifei Shan, Shouzheng Huang, Yao Zhou, Zetian Sun, Zhenyu Liu, Dongfang Li, Xinyuan Wei, Qian Chen, Youcheng Pan, Yang Xiang, Meishan Zhang, Haofen Wang, Jun Yu, Baotian Hu, Min Zhang
[ABSTRACT]
In this paper, we propose KaLM-Embedding-V2, a versatile and compact
embedding model, which achieves impressive performance in general-purpose text
embedding tasks by leveraging superior training techniques and data. Our key
innovations include: (1) To better align the architecture with representation
learning, we remove the causal attention mask and adopt a fully bidirectional
transformer with simple yet effective mean-pooling to produce fixed-length
embeddings; (2) We employ a multi-stage training pipeline: (i) pre-training on
large-scale weakly supervised open-source corpora; (ii) fine-tuning on
high-quality retrieval and non-retrieval datasets; and (iii) model-soup
parameter averaging for robust generalization. Besides, we introduce a
focal-style reweighting mechanism that concentrates learning on difficult
samples and an online hard-negative mixing strategy to continuously enrich hard
negatives without expensive offline mining; (3) We collect over 20 categories
of data for pre-training and 100 categories of data for fine-tuning, to boost
both the performance and generalization of the embedding model. Extensive
evaluations on the Massive Text Embedding Benchmark (MTEB) Chinese and English
show that our model significantly outperforms others of comparable size, and
competes with 3x, 14x, 18x, and 26x larger embedding models, setting a new
standard for a versatile and compact embedding model with less than 1B
parameters.
[COMMENTS]
Technical Report; 26 pages 12 tables 1 figure. arXiv admin note:
substantial text overlap with arXiv:2501.01028
[LINK]
http://arxiv.org/abs/2506.20923v1
[DATE]
2025-06-26 09:09:44+08:00
[CATEGORIES]
cs.CL
Optimising Language Models for Downstream Tasks: A Post-Training Perspective
[AUTHORS]
Zhengyan Shi
[ABSTRACT]
Language models (LMs) have demonstrated remarkable capabilities in NLP, yet
adapting them efficiently and robustly to specific tasks remains challenging.
As their scale and complexity grow, fine-tuning LMs on labelled data often
underutilizes available unlabelled data, leads to overfitting on small
task-specific sets, and imposes significant computational costs. These
limitations hamper their application to the open-ended landscape of real-world
language tasks.
This thesis proposes a series of methods to better adapt LMs to downstream
applications. First, we explore strategies for extracting task-relevant
knowledge from unlabelled data, introducing a novel continued pre-training
technique that outperforms state-of-the-art semi-supervised approaches. Next,
we present a parameter-efficient fine-tuning method that substantially reduces
memory and compute costs while maintaining competitive performance. We also
introduce improved supervised fine-tuning methods that enable LMs to better
follow instructions, especially when labelled data is scarce, enhancing their
performance across a range of NLP tasks, including open-ended generation.
Finally, we develop new evaluation methods and benchmarks, such as multi-hop
spatial reasoning tasks, to assess LM capabilities and adaptation more
comprehensively.
Through extensive empirical studies across diverse NLP tasks, our results
demonstrate that these approaches substantially improve LM robustness,
efficiency, and generalization, making them more adaptable to a broad range of
applications. These advances mark a significant step towards more robust and
efficient LMs, bringing us closer to the goal of artificial general
intelligence.
[COMMENTS]
PhD Thesis
[LINK]
http://arxiv.org/abs/2506.20917v1
[DATE]
2025-06-26 08:49:35+08:00
[CATEGORIES]
cs.CL
A3 : an Analytical Low-Rank Approximation Framework for Attention
[AUTHORS]
Jeffrey T. H. Wong, Cheng Zhang, Xinye Cao, Pedro Gimenes, George A. Constantinides, Wayne Luk, Yiren Zhao
[ABSTRACT]
Large language models have demonstrated remarkable performance; however,
their massive parameter counts make deployment highly expensive. Low-rank
approximation offers a promising compression solution, yet existing approaches
have two main limitations: (1) They focus on minimizing the output error of
individual linear layers, without considering the architectural characteristics
of Transformers, and (2) they decompose a large weight matrix into two small
low-rank matrices. Consequently, these methods often fall short compared to
other compression techniques like pruning and quantization, and introduce
runtime overhead such as the extra GEMM kernel launches for decomposed small
matrices. To address these limitations, we propose $\tt A^\tt 3$, a
post-training low-rank approximation framework. $\tt A^\tt 3$ splits a
Transformer layer into three functional components, namely $\tt QK$, $\tt OV$,
and $\tt MLP$. For each component, $\tt A^\tt 3$ provides an analytical
solution that reduces the hidden dimension size inside each component while
minimizing the component’s functional loss ($\it i.e.$, error in attention
scores, attention outputs, and MLP outputs). This approach directly reduces
model sizes, KV cache sizes, and FLOPs without introducing any runtime
overheads. In addition, it provides a new narrative in advancing the
optimization problem from singular linear layer loss optimization toward
improved end-to-end performance. Through extensive experiments, we show that
$\tt A^\tt 3$ maintains superior performance compared to SoTAs. For example,
under the same reduction budget in computation and memory, our low-rank
approximated LLaMA 3.1-70B achieves a perplexity of 4.69 on WikiText-2,
outperforming the previous SoTA’s 7.87 by 3.18. We also demonstrate the
versatility of $\tt A^\tt 3$, including KV cache compression, quantization, and
mixed-rank assignments for enhanced performance.
[LINK]
http://arxiv.org/abs/2505.12942v3
[DATE]
2025-06-26 07:03:54+08:00
[CATEGORIES]
cs.CL
cs.LG
Privacy Ripple Effects from Adding or Removing Personal Information in Language Model Training
[AUTHORS]
Jaydeep Borkar, Matthew Jagielski, Katherine Lee, Niloofar Mireshghallah, David A. Smith, Christopher A. Choquette-Choo
[ABSTRACT]
Due to the sensitive nature of personally identifiable information (PII), its
owners may have the authority to control its inclusion or request its removal
from large-language model (LLM) training. Beyond this, PII may be added or
removed from training datasets due to evolving dataset curation techniques,
because they were newly scraped for retraining, or because they were included
in a new downstream fine-tuning stage. We find that the amount and ease of PII
memorization is a dynamic property of a model that evolves throughout training
pipelines and depends on commonly altered design choices. We characterize three
such novel phenomena: (1) similar-appearing PII seen later in training can
elicit memorization of earlier-seen sequences in what we call assisted
memorization, and this is a significant factor (in our settings, up to 1/3);
(2) adding PII can increase memorization of other PII significantly (in our
settings, as much as $\approx!7.5\times$); and (3) removing PII can lead to
other PII being memorized. Model creators should consider these first- and
second-order privacy risks when training models to avoid the risk of new PII
regurgitation.
[COMMENTS]
Accepted at the Findings of the Association for Computational
Linguistics (2025)
[LINK]
http://arxiv.org/abs/2502.15680v2
[DATE]
2025-06-26 05:37:19+08:00
[CATEGORIES]
cs.CL
Uncovering Hidden Violent Tendencies in LLMs: A Demographic Analysis via Behavioral Vignettes
[AUTHORS]
Quintin Myers, Yanjun Gao
[ABSTRACT]
Large language models (LLMs) are increasingly proposed for detecting and
responding to violent content online, yet their ability to reason about morally
ambiguous, real-world scenarios remains underexamined. We present the first
study to evaluate LLMs using a validated social science instrument designed to
measure human response to everyday conflict, namely the Violent Behavior
Vignette Questionnaire (VBVQ). To assess potential bias, we introduce
persona-based prompting that varies race, age, and geographic identity within
the United States. Six LLMs developed across different geopolitical and
organizational contexts are evaluated under a unified zero-shot setting. Our
study reveals two key findings: (1) LLMs surface-level text generation often
diverges from their internal preference for violent responses; (2) their
violent tendencies vary across demographics, frequently contradicting
established findings in criminology, social science, and psychology.
[COMMENTS]
Under review
[LINK]
http://arxiv.org/abs/2506.20822v1
[DATE]
2025-06-26 04:43:04+08:00
[CATEGORIES]
cs.CL
MultiFinRAG: An Optimized Multimodal Retrieval-Augmented Generation (RAG) Framework for Financial Question Answering
[AUTHORS]
Chinmay Gondhalekar, Urjitkumar Patel, Fang-Chun Yeh
[ABSTRACT]
Financial documents–such as 10-Ks, 10-Qs, and investor presentations–span
hundreds of pages and combine diverse modalities, including dense narrative
text, structured tables, and complex figures. Answering questions over such
content often requires joint reasoning across modalities, which strains
traditional large language models (LLMs) and retrieval-augmented generation
(RAG) pipelines due to token limitations, layout loss, and fragmented
cross-modal context. We introduce MultiFinRAG, a retrieval-augmented generation
framework purpose-built for financial QA. MultiFinRAG first performs multimodal
extraction by grouping table and figure images into batches and sending them to
a lightweight, quantized open-source multimodal LLM, which produces both
structured JSON outputs and concise textual summaries. These outputs, along
with narrative text, are embedded and indexed with modality-aware similarity
thresholds for precise retrieval. A tiered fallback strategy then dynamically
escalates from text-only to text+table+image contexts when necessary, enabling
cross-modal reasoning while reducing irrelevant context. Despite running on
commodity hardware, MultiFinRAG achieves 19 percentage points higher accuracy
than ChatGPT-4o (free-tier) on complex financial QA tasks involving text,
tables, images, and combined multimodal reasoning.
[COMMENTS]
Preprint Copy
[LINK]
http://arxiv.org/abs/2506.20821v1
[DATE]
2025-06-26 04:37:20+08:00
[CATEGORIES]
cs.CL
CodeLutra: Boosting LLM Code Generation via Preference-Guided Refinement
[AUTHORS]
Leitian Tao, Xiang Chen, Tong Yu, Tung Mai, Ryan Rossi, Yixuan Li, Saayan Mitra
[ABSTRACT]
Large Language Models (LLMs) have revolutionized code generation but require
significant resources and often over-generalize, limiting their task-specific
efficiency. Fine-tuning smaller, open-source LLMs provides a cost-effective
alternative. However, standard supervised approaches rely only on correct
examples, missing valuable insights from failures. We introduce CodeLutra, a
framework that leverages both correct and incorrect code attempts. Instead of
using only correct solutions, CodeLutra applies iterative preference-based
refinement, comparing successful and failed outputs to better approximate
desired results. This approach narrows the performance gap with
state-of-the-art larger models without requiring massive datasets or auxiliary
models. For instance, on a challenging data science coding task, using only 500
samples improved Llama-3-8B’s accuracy from 28.2% to 48.6%, approaching GPT-4’s
level. By learning from both successes and mistakes, CodeLutra provides a
scalable and efficient path to high-quality code generation, making smaller
open-source models more competitive with leading closed-source alternatives.
[COMMENTS]
TMLR 2025
[LINK]
http://arxiv.org/abs/2411.05199v3
[DATE]
2025-06-26 02:20:39+08:00
[CATEGORIES]
cs.CL
Towards Probabilistic Question Answering Over Tabular Data
[AUTHORS]
Chen Shen, Sajjadur Rahman, Estevam Hruschka
[ABSTRACT]
Current approaches for question answering (QA) over tabular data, such as
NL2SQL systems, perform well for factual questions where answers are directly
retrieved from tables. However, they fall short on probabilistic questions
requiring reasoning under uncertainty. In this paper, we introduce a new
benchmark LUCARIO and a framework for probabilistic QA over large tabular data.
Our method induces Bayesian Networks from tables, translates natural language
queries into probabilistic queries, and uses large language models (LLMs) to
generate final answers. Empirical results demonstrate significant improvements
over baselines, highlighting the benefits of hybrid symbolic-neural reasoning.
[LINK]
http://arxiv.org/abs/2506.20747v1
[DATE]
2025-06-26 02:15:33+08:00
[CATEGORIES]
cs.CL
MAGPIE: A dataset for Multi-AGent contextual PrIvacy Evaluation
[AUTHORS]
Gurusha Juneja, Alon Albalak, Wenyue Hua, William Yang Wang
[ABSTRACT]
The proliferation of LLM-based agents has led to increasing deployment of
inter-agent collaboration for tasks like scheduling, negotiation, resource
allocation etc. In such systems, privacy is critical, as agents often access
proprietary tools and domain-specific databases requiring strict
confidentiality. This paper examines whether LLM-based agents demonstrate an
understanding of contextual privacy. And, if instructed, do these systems
preserve inference time user privacy in non-adversarial multi-turn
conversation. Existing benchmarks to evaluate contextual privacy in LLM-agents
primarily assess single-turn, low-complexity tasks where private information
can be easily excluded. We first present a benchmark - MAGPIE comprising 158
real-life high-stakes scenarios across 15 domains. These scenarios are designed
such that complete exclusion of private data impedes task completion yet
unrestricted information sharing could lead to substantial losses. We then
evaluate the current state-of-the-art LLMs on (a) their understanding of
contextually private data and (b) their ability to collaborate without
violating user privacy. Empirical experiments demonstrate that current models,
including GPT-4o and Claude-2.7-Sonnet, lack robust understanding of contextual
privacy, misclassifying private data as shareable 25.2\% and 43.6\% of the
time. In multi-turn conversations, these models disclose private information in
59.9\% and 50.5\% of cases even under explicit privacy instructions.
Furthermore, multi-agent systems fail to complete tasks in 71\% of scenarios.
These results underscore that current models are not aligned towards both
contextual privacy preservation and collaborative task-solving.
[LINK]
http://arxiv.org/abs/2506.20737v1
[DATE]
2025-06-26 02:04:25+08:00
[CATEGORIES]
cs.CL
MMSearch-R1: Incentivizing LMMs to Search
[AUTHORS]
Jinming Wu, Zihao Deng, Wei Li, Yiding Liu, Bo You, Bo Li, Zejun Ma, Ziwei Liu
[ABSTRACT]
Robust deployment of large multimodal models (LMMs) in real-world scenarios
requires access to external knowledge sources, given the complexity and dynamic
nature of real-world information. Existing approaches such as
retrieval-augmented generation (RAG) and prompt engineered search agents rely
on rigid pipelines, often leading to inefficient or excessive search behaviors.
We present MMSearch-R1, the first end-to-end reinforcement learning framework
that enables LMMs to perform on-demand, multi-turn search in real-world
Internet environments. Our framework integrates both image and text search
tools, allowing the model to reason about when and how to invoke them guided by
an outcome-based reward with a search penalty. To support training, We collect
a multimodal search VQA dataset through a semi-automated pipeline that covers
diverse visual and textual knowledge needs and curate a search-balanced subset
with both search-required and search-free samples, which proves essential for
shaping efficient and on-demand search behavior. Extensive experiments on
knowledge-intensive and info-seeking VQA tasks show that our model not only
outperforms RAG-based baselines of the same model size, but also matches the
performance of a larger RAG-based model while reducing search calls by over
30%. We further analyze key empirical findings to offer actionable insights for
advancing research in multimodal search.
[COMMENTS]
Code: https://github.com/EvolvingLMMs-Lab/multimodal-search-r1
[LINK]
http://arxiv.org/abs/2506.20670v1
[DATE]
2025-06-26 01:59:42+08:00
[CATEGORIES]
cs.CL
Inside you are many wolves: Using cognitive models to interpret value trade-offs in LLMs
[AUTHORS]
Sonia K. Murthy, Rosie Zhao, Jennifer Hu, Sham Kakade, Markus Wulfmeier, Peng Qian, Tomer Ullman
[ABSTRACT]
Navigating everyday social situations often requires juggling conflicting
goals, such as conveying a harsh truth, maintaining trust, all while still
being mindful of another person’s feelings. These value trade-offs are an
integral part of human decision-making and language use, however, current tools
for interpreting such dynamic and multi-faceted notions of values in LLMs are
limited. In cognitive science, so-called “cognitive models” provide formal
accounts of these trade-offs in humans, by modeling the weighting of a
speaker’s competing utility functions in choosing an action or utterance. In
this work, we use a leading cognitive model of polite speech to interpret the
extent to which LLMs represent human-like trade-offs. We apply this lens to
systematically evaluate value trade-offs in two encompassing model settings:
degrees of reasoning “effort” in frontier black-box models, and RL
post-training dynamics of open-source models. Our results highlight patterns of
higher informational utility than social utility in reasoning models, and in
open-source models shown to be stronger in mathematical reasoning. Our findings
from LLMs’ training dynamics suggest large shifts in utility values early on in
training with persistent effects of the choice of base model and pretraining
data, compared to feedback dataset or alignment method. We show that our method
is responsive to diverse aspects of the rapidly evolving LLM landscape, with
insights for forming hypotheses about other high-level behaviors, shaping
training regimes for reasoning models, and better controlling trade-offs
between values during model training.
[COMMENTS]
11 pages, 3 figures
[LINK]
http://arxiv.org/abs/2506.20666v1
[DATE]
2025-06-26 01:58:12+08:00
[CATEGORIES]
cs.CL
OmniGen2: Exploration to Advanced Multimodal Generation
[AUTHORS]
Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, Ze Liu, Ziyi Xia, Chaofan Li, Haoge Deng, Jiahao Wang, Kun Luo, Bo Zhang, Defu Lian, Xinlong Wang, Zhongyuan Wang, Tiejun Huang, Zheng Liu
[ABSTRACT]
In this work, we introduce OmniGen2, a versatile and open-source generative
model designed to provide a unified solution for diverse generation tasks,
including text-to-image, image editing, and in-context generation. Unlike
OmniGen v1, OmniGen2 features two distinct decoding pathways for text and image
modalities, utilizing unshared parameters and a decoupled image tokenizer. This
design enables OmniGen2 to build upon existing multimodal understanding models
without the need to re-adapt VAE inputs, thereby preserving the original text
generation capabilities. To facilitate the training of OmniGen2, we developed
comprehensive data construction pipelines, encompassing image editing and
in-context generation data. Additionally, we introduce a reflection mechanism
tailored for image generation tasks and curate a dedicated reflection dataset
based on OmniGen2. Despite its relatively modest parameter size, OmniGen2
achieves competitive results on multiple task benchmarks, including
text-to-image and image editing. To further evaluate in-context generation,
also referred to as subject-driven tasks, we introduce a new benchmark named
OmniContext. OmniGen2 achieves state-of-the-art performance among open-source
models in terms of consistency. We will release our models, training code,
datasets, and data construction pipeline to support future research in this
field. Project Page: https://vectorspacelab.github.io/OmniGen2; GitHub Link:
https://github.com/VectorSpaceLab/OmniGen2
[LINK]
http://arxiv.org/abs/2506.18871v2
[DATE]
2025-06-26 01:54:25+08:00
[CATEGORIES]
cs.CL
PLoP: Precise LoRA Placement for Efficient Finetuning of Large Models
[AUTHORS]
Soufiane Hayou, Nikhil Ghosh, Bin Yu
[ABSTRACT]
Low-Rank Adaptation (LoRA) is a widely used finetuning method for large
models. Its small memory footprint allows practitioners to adapt large models
to specific tasks at a fraction of the cost of full finetuning. Different
modifications have been proposed to enhance its efficiency by, for example,
setting the learning rate, the rank, and the initialization. Another
improvement axis is adapter placement strategy: when using LoRA, practitioners
usually pick module types to adapt with LoRA, such as Query and Key modules.
Few works have studied the problem of adapter placement, with nonconclusive
results: original LoRA paper suggested placing adapters in attention modules,
while other works suggested placing them in the MLP modules. Through an
intuitive theoretical analysis, we introduce PLoP (Precise LoRA Placement), a
lightweight method that allows automatic identification of module types where
LoRA adapters should be placed, given a pretrained model and a finetuning task.
We demonstrate that PLoP consistently outperforms, and in the worst case
competes, with commonly used placement strategies through comprehensive
experiments on supervised finetuning and reinforcement learning for reasoning.
[COMMENTS]
TD,LR: A lightweight module type selection method for LoRA
finetuning. PLoP gives precise placements for LoRA adapters for improved
performance
[LINK]
http://arxiv.org/abs/2506.20629v1
[DATE]
2025-06-26 01:25:02+08:00
[CATEGORIES]
cs.LG
cs.CL
Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models
[AUTHORS]
Thao Nguyen, Yang Li, Olga Golovneva, Luke Zettlemoyer, Sewoong Oh, Ludwig Schmidt, Xian Li
[ABSTRACT]
Scaling laws predict that the performance of large language models improves
with increasing model size and data size. In practice, pre-training has been
relying on massive web crawls, using almost all data sources publicly available
on the internet so far. However, this pool of natural data does not grow at the
same rate as the compute supply. Furthermore, the availability of high-quality
texts is even more limited: data filtering pipelines often remove up to 99% of
the initial web scrapes to achieve state-of-the-art. To address the “data wall”
of pre-training scaling, our work explores ways to transform and recycle data
discarded in existing filtering processes. We propose REWIRE, REcycling the Web
with guIded REwrite, a method to enrich low-quality documents so that they
could become useful for training. This in turn allows us to increase the
representation of synthetic data in the final pre-training set. Experiments at
1B, 3B and 7B scales of the DCLM benchmark show that mixing high-quality raw
texts and our rewritten texts lead to 1.0, 1.3 and 2.5 percentage points
improvement respectively across 22 diverse tasks, compared to training on only
filtered web data. Training on the raw-synthetic data mix is also more
effective than having access to 2x web data. Through further analysis, we
demonstrate that about 82% of the mixed in texts come from transforming
lower-quality documents that would otherwise be discarded. REWIRE also
outperforms related approaches of generating synthetic data, including
Wikipedia-style paraphrasing, question-answer synthesizing and knowledge
extraction. These results suggest that recycling web texts holds the potential
for being a simple and effective approach for scaling pre-training data.
[LINK]
http://arxiv.org/abs/2506.04689v2
[DATE]
2025-06-26 01:12:12+08:00
[CATEGORIES]
cs.CL
cs.LG
Ad-hoc Concept Forming in the Game Codenames as a Means for Evaluating Large Language Models
[AUTHORS]
Sherzod Hakimov, Lara Pfennigschmidt, David Schlangen
[COMMENTS]
Accepted at GemBench workshop co-located with ACL 2025
[LINK]
http://arxiv.org/abs/2502.11707v2
[DATE]
2025-06-26 00:48:16+08:00
[CATEGORIES]
cs.CL
On the Role of Context in Reading Time Prediction
[AUTHORS]
Andreas Opedal, Eleanor Chodroff, Ryan Cotterell, Ethan Gotlieb Wilcox
[ABSTRACT]
We present a new perspective on how readers integrate context during
real-time language comprehension. Our proposals build on surprisal theory,
which posits that the processing effort of a linguistic unit (e.g., a word) is
an affine function of its in-context information content. We first observe that
surprisal is only one out of many potential ways that a contextual predictor
can be derived from a language model. Another one is the pointwise mutual
information (PMI) between a unit and its context, which turns out to yield the
same predictive power as surprisal when controlling for unigram frequency.
Moreover, both PMI and surprisal are correlated with frequency. This means that
neither PMI nor surprisal contains information about context alone. In response
to this, we propose a technique where we project surprisal onto the orthogonal
complement of frequency, yielding a new contextual predictor that is
uncorrelated with frequency. Our experiments show that the proportion of
variance in reading times explained by context is a lot smaller when context is
represented by the orthogonalized predictor. From an interpretability
standpoint, this indicates that previous studies may have overstated the role
that context has in predicting reading times.
[COMMENTS]
EMNLP 2024; preprocessing was corrected to exclude variance due to
word skipping and the conclusions remain unchanged
[LINK]
http://arxiv.org/abs/2409.08160v4
[DATE]
2025-06-26 00:32:48+08:00
[CATEGORIES]
cs.CL
cs.LG
Unlocking In-Context Learning for Natural Datasets Beyond Language Modelling
[AUTHORS]
Jelena Bratulić, Sudhanshu Mittal, David T. Hoffmann, Samuel Böhm, Robin Tibor Schirrmeister, Tonio Ball, Christian Rupprecht, Thomas Brox
[ABSTRACT]
Large Language Models (LLMs) exhibit In-Context Learning (ICL), which enables
the model to perform new tasks conditioning only on the examples provided in
the context without updating the model’s weights. While ICL offers fast
adaptation across natural language tasks and domains, its emergence is less
straightforward for modalities beyond text. In this work, we systematically
uncover properties present in LLMs that support the emergence of ICL for
autoregressive models and various modalities by promoting the learning of the
needed mechanisms for ICL. We identify exact token repetitions in the training
data sequences as an important factor for ICL. Such repetitions further improve
stability and reduce transiency in ICL performance. Moreover, we emphasise the
significance of training task difficulty for the emergence of ICL. Finally, by
applying our novel insights on ICL emergence, we unlock ICL capabilities for
various visual datasets and a more challenging EEG classification task in a
few-shot learning regime.
[LINK]
http://arxiv.org/abs/2501.06256v2
[DATE]
2025-06-26 00:21:31+08:00
[CATEGORIES]
cs.CL
cs.LG
Action-Minimization Meets Generative Modeling: Efficient Transition Path Sampling with the Onsager-Machlup Functional
[AUTHORS]
Sanjeev Raja, Martin Šípka, Michael Psenka, Tobias Kreiman, Michal Pavelka, Aditi S. Krishnapriyan
[ABSTRACT]
Transition path sampling (TPS), which involves finding probable paths
connecting two points on an energy landscape, remains a challenge due to the
complexity of real-world atomistic systems. Current machine learning approaches
use expensive, task-specific, and data-free training procedures, limiting their
ability to benefit from high-quality datasets and large-scale pre-trained
models. In this work, we address TPS by interpreting candidate paths as
trajectories sampled from stochastic dynamics induced by the learned score
function of pre-trained generative models, specifically denoising diffusion and
flow matching. Under these dynamics, finding high-likelihood transition paths
becomes equivalent to minimizing the Onsager-Machlup (OM) action functional.
This enables us to repurpose pre-trained generative models for TPS in a
zero-shot manner, in contrast with bespoke, task-specific approaches in
previous work. We demonstrate our approach on varied molecular systems,
obtaining diverse, physically realistic transition pathways and generalizing
beyond the pre-trained model’s original training dataset. Our method can be
easily incorporated into new generative models, making it practically relevant
as models continue to scale and improve with increased data availability. Code
is available at github.com/ASK-Berkeley/OM-TPS.
[COMMENTS]
ICML 2025
[LINK]
http://arxiv.org/abs/2504.18506v3
[DATE]
2025-06-26 23:59:16+08:00
[CATEGORIES]
cs.LG
Distributed Cross-Channel Hierarchical Aggregation for Foundation Models
[AUTHORS]
Aristeidis Tsaris, Isaac Lyngaas, John Lagregren, Mohamed Wahib, Larry York, Prasanna Balaprakash, Dan Lu, Feiyi Wang, Xiao Wang
[ABSTRACT]
Vision-based scientific foundation models hold significant promise for
advancing scientific discovery and innovation. This potential stems from their
ability to aggregate images from diverse sources such as varying physical
groundings or data acquisition systems and to learn spatio-temporal
correlations using transformer architectures. However, tokenizing and
aggregating images can be compute-intensive, a challenge not fully addressed by
current distributed methods. In this work, we introduce the Distributed
Cross-Channel Hierarchical Aggregation (D-CHAG) approach designed for datasets
with a large number of channels across image modalities. Our method is
compatible with any model-parallel strategy and any type of vision transformer
architecture, significantly improving computational efficiency. We evaluated
D-CHAG on hyperspectral imaging and weather forecasting tasks. When integrated
with tensor parallelism and model sharding, our approach achieved up to a 75%
reduction in memory usage and more than doubled sustained throughput on up to
1,024 AMD GPUs on the Frontier Supercomputer.
[LINK]
http://arxiv.org/abs/2506.21411v1
[DATE]
2025-06-26 23:58:14+08:00
[CATEGORIES]
cs.LG
Early Stopping Tabular In-Context Learning
[AUTHORS]
Jaris Küken, Lennart Purucker, Frank Hutter
[ABSTRACT]
Tabular foundation models have shown strong performance across various
tabular learning tasks via in-context learning, offering robust generalization
without any downstream finetuning. However, their inference-time costs remain
high, particularly for larger datasets. To address this, we propose
early-stopping the in-context learning process. We achieve this by dynamically
evaluating whether to stop in-context learning after each Transformer encoder
layer. Once stopped, we decode the embedding using a pre-trained layer-wise
decoder. Experiments across 34 small classification tasks size show that early
stopping in-context learning accelerates inference by up to x1.3 with
negligible degradation in predictive performance. To assess scalability, we
further evaluate our method on five larger classification tasks, achieving
speedups of up to x2.2. Our results demonstrate the potential of early exiting
as an effective and practical strategy for improving the efficiency of tabular
in-context learning.
[COMMENTS]
ICML Workshop Paper
[LINK]
http://arxiv.org/abs/2506.21387v1
[DATE]
2025-06-26 23:36:37+08:00
[CATEGORIES]
cs.LG
Representation Learning of Lab Values via Masked AutoEncoders
[AUTHORS]
David Restrepo, Chenwei Wu, Yueran Jia, Jaden K. Sun, Jack Gallifant, Catherine G. Bielick, Yugang Jia, Leo A. Celi
[ABSTRACT]
Accurate imputation of missing laboratory values in electronic health records
(EHRs) is critical to enable robust clinical predictions and reduce biases in
AI systems in healthcare. Existing methods, such as XGBoost, softimpute, GAIN,
Expectation Maximization (EM), and MICE, struggle to model the complex temporal
and contextual dependencies in EHR data, particularly in underrepresented
groups. In this work, we propose Lab-MAE, a novel transformer-based masked
autoencoder framework that leverages self-supervised learning for the
imputation of continuous sequential lab values. Lab-MAE introduces a structured
encoding scheme that jointly models laboratory test values and their
corresponding timestamps, enabling explicit capturing temporal dependencies.
Empirical evaluation on the MIMIC-IV dataset demonstrates that Lab-MAE
significantly outperforms state-of-the-art baselines such as XGBoost,
softimpute, GAIN, EM, and MICE across multiple metrics, including root mean
square error (RMSE), R-squared (R2), and Wasserstein distance (WD). Notably,
Lab-MAE achieves equitable performance across demographic groups of patients,
advancing fairness in clinical predictions. We further investigate the role of
follow-up laboratory values as potential shortcut features, revealing Lab-MAE’s
robustness in scenarios where such data is unavailable. The findings suggest
that our transformer-based architecture, adapted to the characteristics of EHR
data, offers a foundation model for more accurate and fair clinical imputation.
In addition, we measure and compare the carbon footprint of Lab-MAE with the a
XGBoost model, highlighting its environmental requirements.
[COMMENTS]
14 pages of main text, 11 appendix
[LINK]
http://arxiv.org/abs/2501.02648v3
[DATE]
2025-06-26 23:34:13+08:00
[CATEGORIES]
cs.LG
Temporal-Aware Graph Attention Network for Cryptocurrency Transaction Fraud Detection
[AUTHORS]
Zhi Zheng, Bochuan Zhou, Yuping Song
[ABSTRACT]
Cryptocurrency transaction fraud detection faces the dual challenges of
increasingly complex transaction patterns and severe class imbalance.
Traditional methods rely on manual feature engineering and struggle to capture
temporal and structural dependencies in transaction networks. This paper
proposes an Augmented Temporal-aware Graph Attention Network (ATGAT) that
enhances detection performance through three modules: (1) designing an advanced
temporal embedding module that fuses multi-scale time difference features with
periodic position encoding; (2) constructing a temporal-aware triple attention
mechanism that jointly optimizes structural, temporal, and global context
attention; (3) employing weighted BCE loss to address class imbalance.
Experiments on the Elliptic++ cryptocurrency dataset demonstrate that ATGAT
achieves an AUC of 0.9130, representing a 9.2% improvement over the best
traditional method XGBoost, 12.0% over GCN, and 10.0% over standard GAT. This
method not only validates the enhancement effect of temporal awareness and
triple attention mechanisms on graph neural networks, but also provides
financial institutions with more reliable fraud detection tools, with its
design principles generalizable to other temporal graph anomaly detection
tasks.
[LINK]
http://arxiv.org/abs/2506.21382v1
[DATE]
2025-06-26 23:34:06+08:00
[CATEGORIES]
cs.LG
HARPT: A Corpus for Analyzing Consumers’ Trust and Privacy Concerns in Mobile Health Apps
[AUTHORS]
Timoteo Kelly, Abdulkadir Korkmaz, Samuel Mallet, Connor Souders, Sadra Aliakbarpour, Praveen Rao
[ABSTRACT]
We present HARPT, a large-scale annotated corpus of mobile health app store
reviews aimed at advancing research in user privacy and trust. The dataset
comprises over 480,000 user reviews labeled into seven categories that capture
critical aspects of trust in applications, trust in providers and privacy
concerns. Creating HARPT required addressing multiple complexities, such as
defining a nuanced label schema, isolating relevant content from large volumes
of noisy data, and designing an annotation strategy that balanced scalability
with accuracy. This strategy integrated rule-based filtering, iterative manual
labeling with review, targeted data augmentation, and weak supervision using
transformer-based classifiers to accelerate coverage. In parallel, a carefully
curated subset of 7,000 reviews was manually annotated to support model
development and evaluation. We benchmark a broad range of classification
models, demonstrating that strong performance is achievable and providing a
baseline for future research. HARPT is released as a public resource to support
work in health informatics, cybersecurity, and natural language processing.
[LINK]
http://arxiv.org/abs/2506.19268v2
[DATE]
2025-06-26 23:23:54+08:00
[CATEGORIES]
cs.LG
Latent Diffusion Model Based Denoising Receiver for 6G Semantic Communication: From Stochastic Differential Theory to Application
[AUTHORS]
Xiucheng Wang, Honggang Jia, Nan Cheng, Dusit Niyato
[ABSTRACT]
In this paper, a novel semantic communication framework empowered by
generative artificial intelligence (GAI) is proposed, to enhance the robustness
against both channel noise and transmission data distribution shifts. A
theoretical foundation is established using stochastic differential equations
(SDEs), from which a closed-form mapping between any signal-to-noise ratio
(SNR) and the optimal denoising timestep is derived. Moreover, to address
distribution mismatch, a mathematical scaling method is introduced to align
received semantic features with the training distribution of the GAI. Built on
this theoretical foundation, a latent diffusion model (LDM)-based semantic
communication framework is proposed that combines a variational autoencoder for
semantic features extraction, where a pretrained diffusion model is used for
denoising. The proposed system is a training-free framework that supports
zero-shot generalization, and achieves superior performance under low-SNR and
out-of-distribution conditions, offering a scalable and robust solution for
future 6G semantic communication systems. Experimental results demonstrate that
the proposed semantic communication framework achieves state-of-the-art
performance in both pixel-level accuracy and semantic perceptual quality,
consistently outperforming baselines across a wide range of SNRs and data
distributions without any fine-tuning or post-training.
[LINK]
http://arxiv.org/abs/2506.05710v2
[DATE]
2025-06-26 23:21:59+08:00
[CATEGORIES]
cs.LG
MAx-DNN: Multi-Level Arithmetic Approximation for Energy-Efficient DNN Hardware Accelerators
[AUTHORS]
Vasileios Leon, Georgios Makris, Sotirios Xydis, Kiamal Pekmestzi, Dimitrios Soudris
[ABSTRACT]
Nowadays, the rapid growth of Deep Neural Network (DNN) architectures has
established them as the defacto approach for providing advanced Machine
Learning tasks with excellent accuracy. Targeting low-power DNN computing, this
paper examines the interplay of fine-grained error resilience of DNN workloads
in collaboration with hardware approximation techniques, to achieve higher
levels of energy efficiency. Utilizing the state-of-the-art ROUP approximate
multipliers, we systematically explore their fine-grained distribution across
the network according to our layer-, filter-, and kernel-level approaches, and
examine their impact on accuracy and energy. We use the ResNet-8 model on the
CIFAR-10 dataset to evaluate our approximations. The proposed solution delivers
up to 54% energy gains in exchange for up to 4% accuracy loss, compared to the
baseline quantized model, while it provides 2x energy gains with better
accuracy versus the state-of-the-art DNN approximations.
[COMMENTS]
Presented at the 13th IEEE LASCAS Conference
[LINK]
http://arxiv.org/abs/2506.21371v1
[DATE]
2025-06-26 23:21:12+08:00
[CATEGORIES]
cs.LG
rQdia: Regularizing Q-Value Distributions With Image Augmentation
[AUTHORS]
Sam Lerman, Jing Bi
[ABSTRACT]
rQdia regularizes Q-value distributions with augmented images in pixel-based
deep reinforcement learning. With a simple auxiliary loss, that equalizes these
distributions via MSE, rQdia boosts DrQ and SAC on 9/12 and 10/12 tasks
respectively in the MuJoCo Continuous Control Suite from pixels, and
Data-Efficient Rainbow on 18/26 Atari Arcade environments. Gains are measured
in both sample efficiency and longer-term training. Moreover, the addition of
rQdia finally propels model-free continuous control from pixels over the state
encoding baseline.
[LINK]
http://arxiv.org/abs/2506.21367v1
[DATE]
2025-06-26 23:16:35+08:00
[CATEGORIES]
cs.LG
SMMILE: An Expert-Driven Benchmark for Multimodal Medical In-Context Learning
[AUTHORS]
Melanie Rieff, Maya Varma, Ossian Rabow, Subathra Adithan, Julie Kim, Ken Chang, Hannah Lee, Nidhi Rohatgi, Christian Bluethgen, Mohamed S. Muneer, Jean-Benoit Delbrouck, Michael Moor
[ABSTRACT]
Multimodal in-context learning (ICL) remains underexplored despite
significant potential for domains such as medicine. Clinicians routinely
encounter diverse, specialized tasks requiring adaptation from limited
examples, such as drawing insights from a few relevant prior cases or
considering a constrained set of differential diagnoses. While multimodal large
language models (MLLMs) have shown advances in medical visual question
answering (VQA), their ability to learn multimodal tasks from context is
largely unknown. We introduce SMMILE, the first expert-driven multimodal ICL
benchmark for medical tasks. Eleven medical experts curated problems, each
including a multimodal query and multimodal in-context examples as task
demonstrations. SMMILE encompasses 111 problems (517 question-image-answer
triplets) covering 6 medical specialties and 13 imaging modalities. We further
introduce SMMILE++, an augmented variant with 1038 permuted problems. A
comprehensive evaluation of 15 MLLMs demonstrates that most models exhibit
moderate to poor multimodal ICL ability in medical tasks. In open-ended
evaluations, ICL contributes only 8% average improvement over zero-shot on
SMMILE and 9.4% on SMMILE++. We observe a susceptibility for irrelevant
in-context examples: even a single noisy or irrelevant example can degrade
performance by up to 9.5%. Moreover, example ordering exhibits a recency bias,
i.e., placing the most relevant example last can lead to substantial
performance improvements by up to 71%. Our findings highlight critical
limitations and biases in current MLLMs when learning multimodal medical tasks
from context.
[LINK]
http://arxiv.org/abs/2506.21355v1
[DATE]
2025-06-26 23:08:18+08:00
[CATEGORIES]
cs.LG
Lipschitz Bounds for Persistent Laplacian Eigenvalues under One-Simplex Insertions
[AUTHORS]
Le Vu Anh, Mehmet Dik, Nguyen Viet Anh
[ABSTRACT]
Persistent Laplacians are matrix operators that track how the shape and
structure of data transform across scales and are popularly adopted in biology,
physics, and machine learning. Their eigenvalues are concise descriptors of
geometric and topological features in a filtration. Although earlier work
established global algebraic stability for these operators, the precise change
in a single eigenvalue when one simplex, such as a vertex, edge, or triangle,
is added has remained unknown. This is important because downstream tools,
including heat-kernel signatures and spectral neural networks, depend directly
on these eigenvalues. We close this gap by proving a uniform Lipschitz bound:
after inserting one simplex, every up-persistent Laplacian eigenvalue can vary
by at most twice the Euclidean norm of that simplex’s boundary, independent of
filtration scale and complex size. This result delivers the first
eigenvalue-level robustness guarantee for spectral topological data analysis.
It guarantees that spectral features remain stable under local updates and
enables reliable error control in dynamic data settings.
[COMMENTS]
16 pages, 4 figures
[LINK]
http://arxiv.org/abs/2506.21352v1
[DATE]
2025-06-26 23:03:54+08:00
[CATEGORIES]
cs.LG
On the Ability of Deep Networks to Learn Symmetries from Data: A Neural Kernel Theory
[AUTHORS]
Andrea Perin, Stephane Deny
[ABSTRACT]
Symmetries (transformations by group actions) are present in many datasets,
and leveraging them holds considerable promise for improving predictions in
machine learning. In this work, we aim to understand when and how deep networks
– with standard architectures trained in a standard, supervised way – learn
symmetries from data. Inspired by real-world scenarios, we study a
classification paradigm where data symmetries are only partially observed
during training: some classes include all transformations of a cyclic group,
while others – only a subset. In the infinite-width limit, where kernel
analogies apply, we derive a neural kernel theory of symmetry learning. The
group-cyclic nature of the dataset allows us to analyze the Gram matrix of
neural kernels in the Fourier domain; here we find a simple characterization of
the generalization error as a function of class separation (signal) and
class-orbit density (noise). This characterization reveals that generalization
can only be successful when the local structure of the data prevails over its
non-local, symmetry-induced structure, in the kernel space defined by the
architecture. We extend our theoretical treatment to any finite group,
including non-abelian groups. Our framework also applies to equivariant
architectures (e.g., CNNs), and recovers their success in the special case
where the architecture matches the inherent symmetry of the data. Empirically,
our theory reproduces the generalization failure of finite-width networks (MLP,
CNN, ViT) trained on partially observed versions of rotated-MNIST. We conclude
that conventional deep networks lack a mechanism to learn symmetries that have
not been explicitly embedded in their architecture a priori. Our framework
could be extended to guide the design of architectures and training procedures
able to learn symmetries from data.
[COMMENTS]
JMLR accepted version, including an extension of the theory to
general finite groups (including non-abelian groups)
[LINK]
http://arxiv.org/abs/2412.11521v2
[DATE]
2025-06-26 23:02:44+08:00
[CATEGORIES]
cs.LG
Learning Value of Information towards Joint Communication and Control in 6G V2X
[AUTHORS]
Lei Lei, Kan Zheng, Xuemin, Shen
[ABSTRACT]
As Cellular Vehicle-to-Everything (C-V2X) evolves towards future
sixth-generation (6G) networks, Connected Autonomous Vehicles (CAVs) are
emerging to become a key application. Leveraging data-driven Machine Learning
(ML), especially Deep Reinforcement Learning (DRL), is expected to
significantly enhance CAV decision-making in both vehicle control and V2X
communication under uncertainty. These two decision-making processes are
closely intertwined, with the value of information (VoI) acting as a crucial
bridge between them. In this paper, we introduce Sequential Stochastic Decision
Process (SSDP) models to define and assess VoI, demonstrating their application
in optimizing communication systems for CAVs. Specifically, we formally define
the SSDP model and demonstrate that the MDP model is a special case of it. The
SSDP model offers a key advantage by explicitly representing the set of
information that can enhance decision-making when available. Furthermore, as
current research on VoI remains fragmented, we propose a systematic VoI
modeling framework grounded in the MDP, Reinforcement Learning (RL) and Optimal
Control theories. We define different categories of VoI and discuss their
corresponding estimation methods. Finally, we present a structured approach to
leverage the various VoI metrics for optimizing the When",
What”, and
``How” to communicate problems. For this purpose, SSDP models are formulated
with VoI-associated reward functions derived from VoI-based optimization
objectives. While we use a simple vehicle-following control problem to
illustrate the proposed methodology, it holds significant potential to
facilitate the joint optimization of stochastic, sequential control and
communication decisions in a wide range of networked control systems.
[LINK]
http://arxiv.org/abs/2505.06978v2
[DATE]
2025-06-26 23:01:20+08:00
[CATEGORIES]
cs.LG
PuriDefense: Randomized Local Implicit Adversarial Purification for Defending Black-box Query-based Attacks
[AUTHORS]
Ping Guo, Xiang Li, Zhiyuan Yang, Xi Lin, Qingchuan Zhao, Qingfu Zhang
[ABSTRACT]
Black-box query-based attacks constitute significant threats to Machine
Learning as a Service (MLaaS) systems since they can generate adversarial
examples without accessing the target model’s architecture and parameters.
Traditional defense mechanisms, such as adversarial training, gradient masking,
and input transformations, either impose substantial computational costs or
compromise the test accuracy of non-adversarial inputs. To address these
challenges, we propose an efficient defense mechanism, PuriDefense, that
employs random patch-wise purifications with an ensemble of lightweight
purification models at a low level of inference cost. These models leverage the
local implicit function and rebuild the natural image manifold. Our theoretical
analysis suggests that this approach slows down the convergence of query-based
attacks by incorporating randomness into purifications. Extensive experiments
on CIFAR-10 and ImageNet validate the effectiveness of our proposed
purifier-based defense mechanism, demonstrating significant improvements in
robustness against query-based attacks.
[LINK]
http://arxiv.org/abs/2401.10586v2
[DATE]
2025-06-26 23:00:42+08:00
[CATEGORIES]
cs.LG
Regret Bounds for Robust Online Decision Making
[AUTHORS]
Alexander Appel, Vanessa Kosoy
[ABSTRACT]
We propose a framework which generalizes “decision making with structured
observations” by allowing robust (i.e. multivalued) models. In this framework,
each model associates each decision with a convex set of probability
distributions over outcomes. Nature can choose distributions out of this set in
an arbitrary (adversarial) manner, that can be nonoblivious and depend on past
history. The resulting framework offers much greater generality than classical
bandits and reinforcement learning, since the realizability assumption becomes
much weaker and more realistic. We then derive a theory of regret bounds for
this framework. Although our lower and upper bounds are not tight, they are
sufficient to fully characterize power-law learnability. We demonstrate this
theory in two special cases: robust linear bandits and tabular robust online
reinforcement learning. In both cases, we derive regret bounds that improve
state-of-the-art (except that we do not address computational efficiency).
[LINK]
http://arxiv.org/abs/2504.06820v2
[DATE]
2025-06-26 22:54:55+08:00
[CATEGORIES]
cs.LG
DynamicBench: Evaluating Real-Time Report Generation in Large Language Models
[AUTHORS]
Jingyao Li, Hao Sun, Zile Qiao, Yong Jiang, Pengjun Xie, Fei Huang, Hong Xu, Jiaya Jia
[ABSTRACT]
Traditional benchmarks for large language models (LLMs) typically rely on
static evaluations through storytelling or opinion expression, which fail to
capture the dynamic requirements of real-time information processing in
contemporary applications. To address this limitation, we present DynamicBench,
a benchmark designed to evaluate the proficiency of LLMs in storing and
processing up-to-the-minute data. DynamicBench utilizes a dual-path retrieval
pipeline, integrating web searches with local report databases. It necessitates
domain-specific knowledge, ensuring accurate responses report generation within
specialized fields. By evaluating models in scenarios that either provide or
withhold external documents, DynamicBench effectively measures their capability
to independently process recent information or leverage contextual
enhancements. Additionally, we introduce an advanced report generation system
adept at managing dynamic information synthesis. Our experimental results
confirm the efficacy of our approach, with our method achieving
state-of-the-art performance, surpassing GPT4o in document-free and
document-assisted scenarios by 7.0% and 5.8%, respectively. The code and data
will be made publicly available.
[LINK]
http://arxiv.org/abs/2506.21343v1
[DATE]
2025-06-26 22:53:44+08:00
[CATEGORIES]
cs.LG
A Scalable Quantum Neural Network for Approximate SRBB-Based Unitary Synthesis
[AUTHORS]
Giacomo Belli, Marco Mordacci, Michele Amoretti
[ABSTRACT]
In this work, a scalable quantum neural network is introduced as a means to
approximate any unitary evolution through the Standard Recursive Block Basis
(SRBB) and, subsequently, redesigned with a number of CNOTs asymptotically
reduced by an exponential contribution. This algebraic approach to the problem
of unitary synthesis exploits Lie algebras and their topological features to
obtain scalable parameterizations of unitary operators. First, the original
SRBB-based scalability scheme, already known in the literature only from a
theoretical point of view, is reformulated for efficient algorithm
implementation and complexity management. Remarkably, 2-qubit operators emerge
as a special case outside the original scaling scheme. Furthermore, an
algorithm is proposed to reduce the number of CNOTs, thus deriving a new
implementable scaling scheme that requires only one layer of approximation. The
scalable CNOT-reduced quantum neural network is implemented and its performance
is assessed with a variety of different unitary matrices, both sparse and
dense, up to 6 qubits via the PennyLane library. The effectiveness of the
approximation is measured with different metrics in relation to two optimizers:
a gradient-based method and the Nelder-Mead method. The approximate
CNOT-reduced SRBB-based synthesis algorithm is also tested on real hardware and
compared with other valid approximation and decomposition methods available in
the literature.
[LINK]
http://arxiv.org/abs/2412.03083v2
[DATE]
2025-06-26 22:43:45+08:00
[CATEGORIES]
cs.LG
ScaleGNN: Towards Scalable Graph Neural Networks via Adaptive High-order Neighboring Feature Fusion
[AUTHORS]
Xiang Li, Jianpeng Qi, Haobing Liu, Yuan Cao, Guoqing Chao, Zhongying Zhao, Junyu Dong, Yanwei Yu
[ABSTRACT]
Graph Neural Networks (GNNs) have demonstrated impressive performance across
diverse graph-based tasks by leveraging message passing to capture complex node
relationships. However, when applied to large-scale real-world graphs, GNNs
face two major challenges: First, it becomes increasingly difficult to ensure
both scalability and efficiency, as the repeated aggregation of large
neighborhoods leads to significant computational overhead; Second, the
over-smoothing problem arises, where excessive or deep propagation makes node
representations indistinguishable, severely hindering model expressiveness. To
tackle these issues, we propose ScaleGNN, a novel framework that adaptively
fuses multi-hop node features for both scalable and effective graph learning.
First, we construct per-hop pure neighbor matrices that capture only the
exclusive structural information at each hop, avoiding the redundancy of
conventional aggregation. Then, an enhanced feature fusion strategy
significantly balances low-order and high-order information, preserving both
local detail and global correlations without incurring excessive complexity. To
further reduce redundancy and over-smoothing, we introduce a Local Contribution
Score (LCS)-based masking mechanism to filter out less relevant high-order
neighbors, ensuring that only the most meaningful information is aggregated. In
addition, learnable sparse constraints selectively integrate multi-hop valuable
features, emphasizing the most informative high-order neighbors. Extensive
experiments on real-world datasets demonstrate that ScaleGNN consistently
outperforms state-of-the-art GNNs in both predictive accuracy and computational
efficiency, highlighting its practical value for large-scale graph learning.
[LINK]
http://arxiv.org/abs/2504.15920v4
[DATE]
2025-06-26 22:41:32+08:00
[CATEGORIES]
cs.LG
Stochastic Quantum Spiking Neural Networks with Quantum Memory and Local Learning
[AUTHORS]
Jiechen Chen, Bipin Rajendran, Osvaldo Simeone
[ABSTRACT]
Neuromorphic and quantum computing have recently emerged as promising
paradigms for advancing artificial intelligence, each offering complementary
strengths. Neuromorphic systems built on spiking neurons excel at processing
time-series data efficiently through sparse, event-driven computation,
consuming energy only upon input events. Quantum computing, on the other hand,
leverages superposition and entanglement to explore feature spaces that are
exponentially large in the number of qubits. Hybrid approaches combining these
paradigms have begun to show potential, but existing quantum spiking models
have important limitations. Notably, prior quantum spiking neuron
implementations rely on classical memory mechanisms on single qubits, requiring
repeated measurements to estimate firing probabilities, and they use
conventional backpropagation on classical simulators for training. Here we
propose a stochastic quantum spiking (SQS) neuron model that addresses these
challenges. The SQS neuron uses multi-qubit quantum circuits to realize a
spiking unit with internal quantum memory, enabling event-driven probabilistic
spike generation in a single shot. Furthermore, we outline how networks of SQS
neurons – dubbed SQS neural networks (SQSNNs) – can be trained via a
hardware-friendly local learning rule, eliminating the need for global
classical backpropagation. The proposed SQSNN model fuses the time-series
efficiency of neuromorphic computing with the exponentially large inner state
space of quantum computing, paving the way for quantum spiking neural networks
that are modular, scalable, and trainable on quantum hardware.
[LINK]
http://arxiv.org/abs/2506.21324v1
[DATE]
2025-06-26 22:39:14+08:00
[CATEGORIES]
cs.LG
On Uniform Weighted Deep Polynomial approximation
[AUTHORS]
Kingsley Yeon, Steven B. Damelin
[ABSTRACT]
It is a classical result in rational approximation theory that certain
non-smooth or singular functions, such as $|x|$ and $x^{1/p}$, can be
efficiently approximated using rational functions with root-exponential
convergence in terms of degrees of freedom \cite{Sta, GN}. In contrast,
polynomial approximations admit only algebraic convergence by Jackson’s theorem
\cite{Lub2}. Recent work shows that composite polynomial architectures can
recover exponential approximation rates even without smoothness \cite{KY}. In
this work, we introduce and analyze a class of weighted deep polynomial
approximants tailored for functions with asymmetric behavior-growing unbounded
on one side and decaying on the other. By multiplying a learnable deep
polynomial with a one-sided weight, we capture both local non-smoothness and
global growth. We show numerically that this framework outperforms Taylor,
Chebyshev, and standard deep polynomial approximants, even when all use the
same number of parameters. To optimize these approximants in practice, we
propose a stable graph-based parameterization strategy building on \cite{Jar}.
[LINK]
http://arxiv.org/abs/2506.21306v1
[DATE]
2025-06-26 22:25:32+08:00
[CATEGORIES]
cs.LG
Context-Aware Doubly-Robust Semi-Supervised Learning
[AUTHORS]
Clement Ruah, Houssem Sifaou, Osvaldo Simeone, Bashir Al-Hashimi
[ABSTRACT]
The widespread adoption of artificial intelligence (AI) in next-generation
communication systems is challenged by the heterogeneity of traffic and network
conditions, which call for the use of highly contextual, site-specific, data. A
promising solution is to rely not only on real-world data, but also on
synthetic pseudo-data generated by a network digital twin (NDT). However, the
effectiveness of this approach hinges on the accuracy of the NDT, which can
vary widely across different contexts. To address this problem, this paper
introduces context-aware doubly-robust (CDR) learning, a novel semi-supervised
scheme that adapts its reliance on the pseudo-data to the different levels of
fidelity of the NDT across contexts. CDR is evaluated on the task of downlink
beamforming where it outperforms previous state-of-the-art approaches,
providing a 24% loss decrease when compared to doubly-robust (DR)
semi-supervised learning in regimes with low labeled data availability.
[COMMENTS]
This work has been accepted for publication in IEEE Signal Processing
Letters
[LINK]
http://arxiv.org/abs/2502.15577v2
[DATE]
2025-06-26 22:22:27+08:00
[CATEGORIES]
cs.LG
Semantic Scene Graph for Ultrasound Image Explanation and Scanning Guidance
[AUTHORS]
Xuesong Li, Dianye Huang, Yameng Zhang, Nassir Navab, Zhongliang Jiang
[ABSTRACT]
Understanding medical ultrasound imaging remains a long-standing challenge
due to significant visual variability caused by differences in imaging and
acquisition parameters. Recent advancements in large language models (LLMs)
have been used to automatically generate terminology-rich summaries orientated
to clinicians with sufficient physiological knowledge. Nevertheless, the
increasing demand for improved ultrasound interpretability and basic scanning
guidance among non-expert users, e.g., in point-of-care settings, has not yet
been explored. In this study, we first introduce the scene graph (SG) for
ultrasound images to explain image content to ordinary and provide guidance for
ultrasound scanning. The ultrasound SG is first computed using a
transformer-based one-stage method, eliminating the need for explicit object
detection. To generate a graspable image explanation for ordinary, the user
query is then used to further refine the abstract SG representation through
LLMs. Additionally, the predicted SG is explored for its potential in guiding
ultrasound scanning toward missing anatomies within the current imaging view,
assisting ordinary users in achieving more standardized and complete anatomical
exploration. The effectiveness of this SG-based image explanation and scanning
guidance has been validated on images from the left and right neck regions,
including the carotid and thyroid, across five volunteers. The results
demonstrate the potential of the method to maximally democratize ultrasound by
enhancing its interpretability and usability for ordinaries.
[LINK]
http://arxiv.org/abs/2506.19683v2
[DATE]
2025-06-26 22:20:13+08:00
[CATEGORIES]
cs.LG
Devil’s Hand: Data Poisoning Attacks to Locally Private Graph Learning Protocols
[AUTHORS]
Longzhu He, Chaozhuo Li, Peng Tang, Li Sun, Sen Su, Philip S. Yu
[ABSTRACT]
Graph neural networks (GNNs) have achieved significant success in graph
representation learning and have been applied to various domains. However, many
real-world graphs contain sensitive personal information, such as user profiles
in social networks, raising serious privacy concerns when graph learning is
performed using GNNs. To address this issue, locally private graph learning
protocols have gained considerable attention. These protocols leverage the
privacy advantages of local differential privacy (LDP) and the effectiveness of
GNN’s message-passing in calibrating noisy data, offering strict privacy
guarantees for users’ local data while maintaining high utility (e.g., node
classification accuracy) for graph learning. Despite these advantages, such
protocols may be vulnerable to data poisoning attacks, a threat that has not
been considered in previous research. Identifying and addressing these threats
is crucial for ensuring the robustness and security of privacy-preserving graph
learning frameworks. This work introduces the first data poisoning attack
targeting locally private graph learning protocols. The attacker injects fake
users into the protocol, manipulates these fake users to establish links with
genuine users, and sends carefully crafted data to the server, ultimately
compromising the utility of private graph learning. The effectiveness of the
attack is demonstrated both theoretically and empirically. In addition, several
defense strategies have also been explored, but their limited effectiveness
highlights the need for more robust defenses.
[LINK]
http://arxiv.org/abs/2506.09803v2
[DATE]
2025-06-26 22:18:21+08:00
[CATEGORIES]
cs.LG
Improved seeding strategies for k-means and k-GMM
[AUTHORS]
Guillaume Carrière, Frédéric Cazals
[ABSTRACT]
We revisit the randomized seeding techniques for k-means clustering and k-GMM
(Gaussian Mixture model fitting with Expectation-Maximization), formalizing
their three key ingredients: the metric used for seed sampling, the number of
candidate seeds, and the metric used for seed selection. This analysis yields
novel families of initialization methods exploiting a lookahead
principle–conditioning the seed selection to an enhanced coherence with the
final metric used to assess the algorithm, and a multipass strategy to tame
down the effect of randomization.
Experiments show a consistent constant factor improvement over classical
contenders in terms of the final metric (SSE for k-means, log-likelihood for
k-GMM), at a modest overhead. In particular, for k-means, our methods improve
on the recently designed multi-swap strategy, which was the first one to
outperform the greedy k-means++ seeding.
Our experimental analysis also shed light on subtle properties of k-means
often overlooked, including the (lack of) correlations between the SSE upon
seeding and the final SSE, the variance reduction phenomena observed in
iterative seeding methods, and the sensitivity of the final SSE to the pool
size for greedy methods.
Practically, our most effective seeding methods are strong candidates to
become one of the–if not the–standard techniques. From a theoretical
perspective, our formalization of seeding opens the door to a new line of
analytical approaches.
[COMMENTS]
13 pages
[LINK]
http://arxiv.org/abs/2506.21291v1
[DATE]
2025-06-26 22:10:40+08:00
[CATEGORIES]
cs.LG
Energy Matching: Unifying Flow Matching and Energy-Based Models for Generative Modeling
[AUTHORS]
Michal Balcerak, Tamaz Amiranashvili, Antonio Terpin, Suprosanna Shit, Lea Bogensperger, Sebastian Kaltenbach, Petros Koumoutsakos, Bjoern Menze
[ABSTRACT]
The most widely used generative models map noise and data distributions by
matching flows or scores. However, they struggle to incorporate partial
observations and additional priors–something energy-based models (EBMs) handle
elegantly by simply adding corresponding scalar energy terms. We address this
issue by proposing Energy Matching, a framework that endows flow-based
approaches with the flexibility of EBMs. Far from the data manifold, samples
move along curl-free, optimal transport paths from noise to data. As they
approach the data manifold, an entropic energy term guides the system into a
Boltzmann equilibrium distribution, explicitly capturing the underlying
likelihood structure of the data. We parameterize this dynamic with a single
time-independent scalar field, which serves as both a powerful generator and a
flexible prior for effective regularization of inverse problems. Our method
substantially outperforms existing EBMs on CIFAR-10 and ImageNet generation in
terms of fidelity, while retaining simulation-free training of transport-based
approaches away from the data manifold. Furthermore, we leverage the method’s
flexibility to introduce an interaction energy that supports diverse mode
exploration, which we demonstrate in a controlled protein-generation setting.
Our approach focuses on learning a scalar potential energy–without
time-conditioning, auxiliary generators, or additional networks–which marks a
significant departure from recent EBM methods. We believe that this simplified
framework significantly advances EBMs capabilities and paves the way for their
wider adoption in generative modeling across diverse domains.
[LINK]
http://arxiv.org/abs/2504.10612v4
[DATE]
2025-06-26 22:04:51+08:00
[CATEGORIES]
cs.LG
Hyperspherical Variational Autoencoders Using Efficient Spherical Cauchy Distribution
[AUTHORS]
Lukas Sablica, Kurt Hornik
[ABSTRACT]
We propose a novel variational autoencoder (VAE) architecture that employs a
spherical Cauchy (spCauchy) latent distribution. Unlike traditional Gaussian
latent spaces or the widely used von Mises-Fisher (vMF) distribution, spCauchy
provides a more natural hyperspherical representation of latent variables,
better capturing directional data while maintaining flexibility. Its
heavy-tailed nature prevents over-regularization, ensuring efficient latent
space utilization while offering a more expressive representation.
Additionally, spCauchy circumvents the numerical instabilities inherent to vMF,
which arise from computing normalization constants involving Bessel functions.
Instead, it enables a fully differentiable and efficient reparameterization
trick via M"obius transformations, allowing for stable and scalable training.
The KL divergence can be computed through a rapidly converging power series,
eliminating concerns of underflow or overflow associated with evaluation of
ratios of hypergeometric functions. These properties make spCauchy a compelling
alternative for VAEs, offering both theoretical advantages and practical
efficiency in high-dimensional generative modeling.
[LINK]
http://arxiv.org/abs/2506.21278v1
[DATE]
2025-06-26 22:01:51+08:00
[CATEGORIES]
cs.LG
Lagrangian Index Policy for Restless Bandits with Average Reward
[AUTHORS]
Konstantin Avrachenkov, Vivek S. Borkar, Pratik Shah
[ABSTRACT]
We study the Lagrange Index Policy (LIP) for restless multi-armed bandits
with long-run average reward. In particular, we compare the performance of LIP
with the performance of the Whittle Index Policy (WIP), both heuristic policies
known to be asymptotically optimal under certain natural conditions. Even
though in most cases their performances are very similar, in the cases when WIP
shows bad performance, LIP continues to perform very well. We then propose
reinforcement learning algorithms, both tabular and NN-based, to obtain online
learning schemes for LIP in the model-free setting. The proposed reinforcement
learning schemes for LIP require significantly less memory than the analogous
schemes for WIP. We calculate analytically the Lagrange index for the restart
model, which applies to the optimal web crawling and the minimization of the
weighted age of information. We also give a new proof of asymptotic optimality
in case of homogeneous arms as the number of arms goes to infinity, based on
exchangeability and de Finetti’s theorem.
[LINK]
http://arxiv.org/abs/2412.12641v2
[DATE]
2025-06-26 22:00:55+08:00
[CATEGORIES]
cs.LG
A GREAT Architecture for Edge-Based Graph Problems Like TSP
[AUTHORS]
Attila Lischka, Filip Rydin, Jiaming Wu, Morteza Haghir Chehreghani, Balázs Kulcsár
[ABSTRACT]
In the last years, many learning-based approaches have been proposed to
tackle combinatorial optimization problems such as routing problems. Many of
these approaches are based on graph neural networks (GNNs) or related
transformers, operating on the Euclidean coordinates representing the routing
problems. However, models operating on Euclidean coordinates are ill-suited for
non-Euclidean, asymmetric problem instances that are often found in real-world
settings. To overcome this limitation, we propose a novel GNN-based and
edge-focused neural model called Graph Edge Attention Network (GREAT). Using
GREAT as an encoder to capture the properties of a routing problem instance, we
build a reinforcement learning framework which we apply to Euclidean and
non-Euclidean variants of vehicle routing problems such as Traveling Salesman
Problem, Capacitated Vehicle Routing Problem and Orienteering Problem. Our
framework is among the first to tackle non-Euclidean variants of these problems
and achieves competitive results among learning-based solvers.
[COMMENTS]
15 pages, 7 figures
[LINK]
http://arxiv.org/abs/2408.16717v2
[DATE]
2025-06-26 21:54:56+08:00
[CATEGORIES]
cs.LG
Wavelet Diffusion Neural Operator
[AUTHORS]
Peiyan Hu, Rui Wang, Xiang Zheng, Tao Zhang, Haodong Feng, Ruiqi Feng, Long Wei, Yue Wang, Zhi-Ming Ma, Tailin Wu
[ABSTRACT]
Simulating and controlling physical systems described by partial differential
equations (PDEs) are crucial tasks across science and engineering. Recently,
diffusion generative models have emerged as a competitive class of methods for
these tasks due to their ability to capture long-term dependencies and model
high-dimensional states. However, diffusion models typically struggle with
handling system states with abrupt changes and generalizing to higher
resolutions. In this work, we propose Wavelet Diffusion Neural Operator (WDNO),
a novel PDE simulation and control framework that enhances the handling of
these complexities. WDNO comprises two key innovations. Firstly, WDNO performs
diffusion-based generative modeling in the wavelet domain for the entire
trajectory to handle abrupt changes and long-term dependencies effectively.
Secondly, to address the issue of poor generalization across different
resolutions, which is one of the fundamental tasks in modeling physical
systems, we introduce multi-resolution training. We validate WDNO on five
physical systems, including 1D advection equation, three challenging physical
systems with abrupt changes (1D Burgers’ equation, 1D compressible
Navier-Stokes equation and 2D incompressible fluid), and a real-world dataset
ERA5, which demonstrates superior performance on both simulation and control
tasks over state-of-the-art methods, with significant improvements in long-term
and detail prediction accuracy. Remarkably, in the challenging context of the
2D high-dimensional and indirect control task aimed at reducing smoke leakage,
WDNO reduces the leakage by 78% compared to the second-best baseline. The code
can be found at https://github.com/AI4Science-WestlakeU/wdno.git.
[LINK]
http://arxiv.org/abs/2412.04833v3
[DATE]
2025-06-26 21:39:47+08:00
[CATEGORIES]
cs.LG
Radio Map Estimation via Latent Domain Plug-and-Play Denoising
[AUTHORS]
Le Xu, Lei Cheng, Junting Chen, Wenqiang Pu, Xiao Fu
[ABSTRACT]
Radio map estimation (RME), also known as spectrum cartography, aims to
reconstruct the strength of radio interference across different domains (e.g.,
space and frequency) from sparsely sampled measurements. To tackle this typical
inverse problem, state-of-the-art RME methods rely on handcrafted or
data-driven structural information of radio maps. However, the former often
struggles to model complex radio frequency (RF) environments and the latter
requires excessive training – making it hard to quickly adapt to in situ
sensing tasks. This work presents a spatio-spectral RME approach based on
plug-and-play (PnP) denoising, a technique from computational imaging. The idea
is to leverage the observation that the denoising operations of signals like
natural images and radio maps are similar – despite the nontrivial differences
of the signals themselves. Hence, sophisticated denoisers designed for or
learned from natural images can be directly employed to assist RME, avoiding
using radio map data for training. Unlike conventional PnP methods that operate
directly in the data domain, the proposed method exploits the underlying
physical structure of radio maps and proposes an ADMM algorithm that denoises
in a latent domain. This design significantly improves computational efficiency
and enhances noise robustness. Theoretical aspects, e.g., recoverability of the
complete radio map and convergence of the ADMM algorithm are analyzed.
Synthetic and real data experiments are conducted to demonstrate the
effectiveness of our approach.
[LINK]
http://arxiv.org/abs/2501.13472v2
[DATE]
2025-06-26 21:31:04+08:00
[CATEGORIES]
cs.LG
Balancing Privacy, Robustness, and Efficiency in Machine Learning
[AUTHORS]
Youssef Allouah, Rachid Guerraoui, John Stephan
[ABSTRACT]
This position paper argues that achieving robustness, privacy, and efficiency
simultaneously in machine learning systems is infeasible under prevailing
threat models. The tension between these goals arises not from algorithmic
shortcomings but from structural limitations imposed by worst-case adversarial
assumptions. We advocate for a systematic research agenda aimed at formalizing
the robustness-privacy-efficiency trilemma, exploring how principled
relaxations of threat models can unlock better trade-offs, and designing
benchmarks that expose rather than obscure the compromises made. By shifting
focus from aspirational universal guarantees to context-aware system design,
the machine learning community can build models that are truly appropriate for
real-world deployment.
[LINK]
http://arxiv.org/abs/2312.14712v3
[DATE]
2025-06-26 21:12:25+08:00
[CATEGORIES]
cs.LG
Unsupervised Learning for Optimal Transport plan prediction between unbalanced graphs
[AUTHORS]
Sonia Mazelet, Rémi Flamary, Bertrand Thirion
[ABSTRACT]
Optimal transport between graphs, based on Gromov-Wasserstein and
other extensions, is a powerful tool for comparing and aligning
graph structures. However, solving the associated non-convex
optimization problems is computationally expensive, which limits the
scalability of these methods to large graphs. In this work, we
present Unbalanced Learning of Optimal Transport (ULOT), a deep
learning method that predicts optimal transport plans between two
graphs. Our method is trained by minimizing the fused unbalanced
Gromov-Wasserstein (FUGW) loss. We propose a novel neural
architecture with cross-attention that is conditioned on the FUGW
tradeoff hyperparameters. We evaluate ULOT on synthetic stochastic
block model (SBM) graphs and on real cortical surface data obtained
from fMRI. ULOT predicts transport plans with competitive loss up to
two orders of magnitude faster than classical solvers. Furthermore,
the predicted plan can be used as a warm start for classical solvers
to accelerate their convergence. Finally, the predicted transport
plan is fully differentiable with respect to the graph inputs and
FUGW hyperparameters, enabling the optimization of functionals of
the ULOT plan.
[LINK]
http://arxiv.org/abs/2506.12025v2
[DATE]
2025-06-26 21:01:32+08:00
[CATEGORIES]
cs.LG
Seal Your Backdoor with Variational Defense
[AUTHORS]
Ivan Sabolić, Matej Grcić, Siniša Šegvić
[ABSTRACT]
We propose VIBE, a model-agnostic framework that trains classifiers resilient
to backdoor attacks. The key concept behind our approach is to treat malicious
inputs and corrupted labels from the training dataset as observed random
variables, while the actual clean labels are latent. VIBE then recovers the
corresponding latent clean label posterior through variational inference. The
resulting training procedure follows the expectation-maximization (EM)
algorithm. The E-step infers the clean pseudolabels by solving an
entropy-regularized optimal transport problem, while the M-step updates the
classifier parameters via gradient descent. Being modular, VIBE can seamlessly
integrate with recent advancements in self-supervised representation learning,
which enhance its ability to resist backdoor attacks. We experimentally
validate the method effectiveness against contemporary backdoor attacks on
standard datasets, a large-scale setup with 1$k$ classes, and a dataset
poisoned with multiple attacks. VIBE consistently outperforms previous defenses
across all tested scenarios.
[COMMENTS]
Accepted to ICCV 2025
[LINK]
http://arxiv.org/abs/2503.08829v2
[DATE]
2025-06-26 20:48:11+08:00
[CATEGORIES]
cs.LG
PCF-Grasp: Converting Point Completion to Geometry Feature to Enhance 6-DoF Grasp
[AUTHORS]
Yaofeng Cheng, Fusheng Zha, Wei Guo, Pengfei Wang, Chao Zeng, Lining Sun, Chenguang Yang
[ABSTRACT]
The 6-Degree of Freedom (DoF) grasp method based on point clouds has shown
significant potential in enabling robots to grasp target objects. However, most
existing methods are based on the point clouds (2.5D points) generated from
single-view depth images. These point clouds only have one surface side of the
object providing incomplete geometry information, which mislead the grasping
algorithm to judge the shape of the target object, resulting in low grasping
accuracy. Humans can accurately grasp objects from a single view by leveraging
their geometry experience to estimate object shapes. Inspired by humans, we
propose a novel 6-DoF grasping framework that converts the point completion
results as object shape features to train the 6-DoF grasp network. Here, point
completion can generate approximate complete points from the 2.5D points
similar to the human geometry experience, and converting it as shape features
is the way to utilize it to improve grasp efficiency. Furthermore, due to the
gap between the network generation and actual execution, we integrate a score
filter into our framework to select more executable grasp proposals for the
real robot. This enables our method to maintain a high grasp quality in any
camera viewpoint. Extensive experiments demonstrate that utilizing complete
point features enables the generation of significantly more accurate grasp
proposals and the inclusion of a score filter greatly enhances the credibility
of real-world robot grasping. Our method achieves a 17.8\% success rate higher
than the state-of-the-art method in real-world experiments.
[LINK]
http://arxiv.org/abs/2504.16320v2
[DATE]
2025-06-26 20:42:10+08:00
[CATEGORIES]
cs.LG
Variational Supervised Contrastive Learning
[AUTHORS]
Ziwen Wang, Jiajun Fan, Thao Nguyen, Heng Ji, Ge Liu
[ABSTRACT]
Contrastive learning has proven to be highly efficient and adaptable in
shaping representation spaces across diverse modalities by pulling similar
samples together and pushing dissimilar ones apart. However, two key
limitations persist: (1) Without explicit regulation of the embedding
distribution, semantically related instances can inadvertently be pushed apart
unless complementary signals guide pair selection, and (2) excessive reliance
on large in-batch negatives and tailored augmentations hinders generalization.
To address these limitations, we propose Variational Supervised Contrastive
Learning (VarCon), which reformulates supervised contrastive learning as
variational inference over latent class variables and maximizes a
posterior-weighted evidence lower bound (ELBO) that replaces exhaustive
pair-wise comparisons for efficient class-aware matching and grants
fine-grained control over intra-class dispersion in the embedding space.
Trained exclusively on image data, our experiments on CIFAR-10, CIFAR-100,
ImageNet-100, and ImageNet-1K show that VarCon (1) achieves state-of-the-art
performance for contrastive learning frameworks, reaching 79.36% Top-1 accuracy
on ImageNet-1K and 78.29% on CIFAR-100 with a ResNet-50 encoder while
converging in just 200 epochs; (2) yields substantially clearer decision
boundaries and semantic organization in the embedding space, as evidenced by
KNN classification, hierarchical clustering results, and transfer-learning
assessments; and (3) demonstrates superior performance in few-shot learning
than supervised baseline and superior robustness across various augmentation
strategies.
[LINK]
http://arxiv.org/abs/2506.07413v2
[DATE]
2025-06-26 20:27:25+08:00
[CATEGORIES]
cs.LG
Moderating the Generalization of Score-based Generative Model
[AUTHORS]
Wan Jiang, He Wang, Xin Zhang, Dan Guo, Zhaoxin Fan, Yunfeng Diao, Richang Hong
[ABSTRACT]
Score-based Generative Models (SGMs) have demonstrated remarkable
generalization abilities, e.g. generating unseen, but natural data. However,
the greater the generalization power, the more likely the unintended
generalization, and the more dangerous the abuse. Research on moderated
generalization in SGMs remains limited. To fill this gap, we first examine the
current ‘gold standard’ in Machine Unlearning (MU), i.e., re-training the model
after removing the undesirable training data, and find it does not work in
SGMs. Further analysis of score functions reveals that the MU ‘gold standard’
does not alter the original score function, which explains its ineffectiveness.
Based on this insight, we propose the first Moderated Score-based Generative
Model (MSGM), which introduces a novel score adjustment strategy that redirects
the score function away from undesirable data during the continuous-time
stochastic differential equation process. Extensive experimental results
demonstrate that MSGM significantly reduces the likelihood of generating
undesirable content while preserving high visual quality for normal image
generation. Albeit designed for SGMs, MSGM is a general and flexible MU
framework that is compatible with diverse diffusion architectures (SGM and
DDPM) and training strategies (re-training and fine-tuning), and enables
zero-shot transfer of the pre-trained models to downstream tasks, e.g. image
inpainting and reconstruction. The code will be shared upon acceptance.
[LINK]
http://arxiv.org/abs/2412.07229v2
[DATE]
2025-06-26 20:06:00+08:00
[CATEGORIES]
cs.LG
Metis-RISE: RL Incentivizes and SFT Enhances Multimodal Reasoning Model Learning
[AUTHORS]
Haibo Qiu, Xiaohan Lan, Fanfan Liu, Xiaohu Sun, Delian Ruan, Peng Shi, Lin Ma
[ABSTRACT]
Recent advancements in large language models (LLMs) have witnessed a surge in
the development of advanced reasoning paradigms, which are now being integrated
into multimodal large language models (MLLMs). However, existing approaches
often fall short: methods solely employing reinforcement learning (RL) can
struggle with sample inefficiency and activating entirely absent reasoning
capabilities, while conventional pipelines that initiate with a cold-start
supervised fine-tuning (SFT) phase before RL may restrict the model’s
exploratory capacity and face suboptimal convergence. In this work, we
introduce \textbf{Metis-RISE} (\textbf{R}L \textbf{I}ncentivizes and
\textbf{S}FT \textbf{E}nhances) for multimodal reasoning model learning. Unlike
conventional approaches, Metis-RISE distinctively omits an initial SFT stage,
beginning instead with an RL phase (e.g., using a Group Relative Policy
Optimization variant) to incentivize and activate the model’s latent reasoning
capacity. Subsequently, the targeted SFT stage addresses two key challenges
identified during RL: (1) \textit{inefficient trajectory sampling} for tasks
where the model possesses but inconsistently applies correct reasoning, which
we tackle using self-distilled reasoning trajectories from the RL model itself;
and (2) \textit{fundamental capability absence}, which we address by injecting
expert-augmented knowledge for prompts where the model entirely fails. This
strategic application of RL for incentivization followed by SFT for enhancement
forms the core of Metis-RISE, leading to two versions of our MLLMs (7B and 72B
parameters). Evaluations on the OpenCompass Multimodal Reasoning Leaderboard
demonstrate that both models achieve state-of-the-art performance among
similar-sized models, with the 72B version ranking fourth overall. Please refer
to our project page for open-source information.
[COMMENTS]
Project Page: https://github.com/MM-Thinking/Metis-RISE
[LINK]
http://arxiv.org/abs/2506.13056v2
[DATE]
2025-06-26 19:45:11+08:00
[CATEGORIES]
cs.LG
Self-Regulated Neurogenesis for Online Data-Incremental Learning
[AUTHORS]
Murat Onur Yildirim, Elif Ceren Gok Yildirim, Decebal Constantin Mocanu, Joaquin Vanschoren
[ABSTRACT]
Neural networks often struggle with catastrophic forgetting when learning
sequences of tasks or data streams, unlike humans who can continuously learn
and consolidate new concepts even in the absence of explicit cues. Online
data-incremental learning seeks to emulate this capability by processing each
sample only once, without having access to task or stream cues at any point in
time since this is more realistic compared to offline setups, where all data
from novel class(es) is assumed to be readily available. However, existing
methods typically rely on storing the subsets of data in memory or expanding
the initial model architecture, resulting in significant computational
overhead. Drawing inspiration from ‘self-regulated neurogenesis’-brain’s
mechanism for creating specialized regions or circuits for distinct
functions-we propose a novel approach SERENA which encodes each concept in a
specialized network path called ‘concept cell’, integrated into a single
over-parameterized network. Once a concept is learned, its corresponding
concept cell is frozen, effectively preventing the forgetting of previously
acquired information. Furthermore, we introduce two new continual learning
scenarios that more closely reflect real-world conditions, characterized by
gradually changing sample sizes. Experimental results show that our method not
only establishes new state-of-the-art results across ten benchmarks but also
remarkably surpasses offline supervised batch learning performance. The code is
available at https://github.com/muratonuryildirim/serena.
[COMMENTS]
Published at Conference on Lifelong Learning Agents (CoLLAs) 2025
[LINK]
http://arxiv.org/abs/2403.14684v2
[DATE]
2025-06-26 19:35:57+08:00
[CATEGORIES]
cs.LG
Diverse Mini-Batch Selection in Reinforcement Learning for Efficient Chemical Exploration in de novo Drug Design
[AUTHORS]
Hampus Gummesson Svensson, Ola Engkvist, Jon Paul Janet, Christian Tyrchan, Morteza Haghir Chehreghani
[ABSTRACT]
In many real-world applications, evaluating the goodness of instances is
often costly and time-consuming, e.g., human feedback and physics simulations,
in contrast to proposing new instances. In particular, this is even more
critical in reinforcement learning, as new interactions with the environment
(i.e., new instances) need to be evaluated to provide a reward signal to learn
from. As sufficient exploration is crucial, learning from a diverse mini-batch
can have a large impact and help mitigate mode collapse. In this paper, we
introduce diverse mini-batch selection for reinforcement learning and propose
to use determinantal point processes for this task. We study this framework in
the context of a real-world problem, namely drug discovery. We experimentally
study how our proposed framework can improve the effectiveness of chemical
exploration in de novo drug design, where finding diverse and high-quality
solutions is essential. We conduct a comprehensive evaluation with three
well-established molecular generation oracles over numerous generative steps.
Our experiments conclude that our diverse mini-batch selection framework can
substantially improve the diversity of the solutions, while still obtaining
solutions of high quality. In drug discovery, such outcome can potentially lead
to fulfilling unmet medication needs faster.
[LINK]
http://arxiv.org/abs/2506.21158v1
[DATE]
2025-06-26 19:31:30+08:00
[CATEGORIES]
cs.LG
Transformer-Based Spatial-Temporal Counterfactual Outcomes Estimation
[AUTHORS]
He Li, Haoang Chi, Mingyu Liu, Wanrong Huang, Liyang Xu, Wenjing Yang
[ABSTRACT]
The real world naturally has dimensions of time and space. Therefore,
estimating the counterfactual outcomes with spatial-temporal attributes is a
crucial problem. However, previous methods are based on classical statistical
models, which still have limitations in performance and generalization. This
paper proposes a novel framework for estimating counterfactual outcomes with
spatial-temporal attributes using the Transformer, exhibiting stronger
estimation ability. Under mild assumptions, the proposed estimator within this
framework is consistent and asymptotically normal. To validate the
effectiveness of our approach, we conduct simulation experiments and real data
experiments. Simulation experiments show that our estimator has a stronger
estimation capability than baseline methods. Real data experiments provide a
valuable conclusion to the causal effect of conflicts on forest loss in
Colombia. The source code is available at
https://github.com/lihe-maxsize/DeppSTCI_Release_Version-master.
[COMMENTS]
24 pages, accepted at ICML 2025
[LINK]
http://arxiv.org/abs/2506.21154v1
[DATE]
2025-06-26 19:24:46+08:00
[CATEGORIES]
cs.LG
A Novel Federated Learning-Based IDS for Enhancing UAVs Privacy and Security
[AUTHORS]
Ozlem Ceviz, Pinar Sadioglu, Sevil Sen, Vassilios G. Vassilakis
[ABSTRACT]
Unmanned aerial vehicles (UAVs) operating within Flying Ad-hoc Networks
(FANETs) encounter security challenges due to the dynamic and distributed
nature of these networks. Previous studies focused predominantly on centralized
intrusion detection, assuming a central entity responsible for storing and
analyzing data from all devices. However, these approaches face challenges
including computation and storage costs, along with a single point of failure
risk, threatening data privacy and availability. The widespread dispersion of
data across interconnected devices underscores the need for decentralized
approaches. This paper introduces the Federated Learning-based Intrusion
Detection System (FL-IDS), addressing challenges encountered by centralized
systems in FANETs. FL-IDS reduces computation and storage costs for both
clients and the central server, which is crucial for resource-constrained UAVs.
Operating in a decentralized manner, FL-IDS enables UAVs to collaboratively
train a global intrusion detection model without sharing raw data, thus
avoiding delay in decisions based on collected data, as is often the case with
traditional methods. Experimental results demonstrate FL-IDS’s competitive
performance with Central IDS (C-IDS) while mitigating privacy concerns, with
the Bias Towards Specific Clients (BTSC) method further enhancing FL-IDS
performance even at lower attacker ratios. Comparative analysis with
traditional intrusion detection methods, including Local IDS (L-IDS), sheds
light on the strengths of FL-IDS. This study significantly contributes to UAV
security by introducing a privacy-aware, decentralized intrusion detection
approach tailored to UAV networks. Moreover, by introducing a realistic dataset
for FANETs and federated learning, our approach differs from others lacking
high dynamism and 3D node movements or accurate federated data federations.
[COMMENTS]
Published in Internet of Things, Volume 25, 2025, Article 101592
[LINK]
http://arxiv.org/abs/2312.04135v3
[DATE]
2025-06-26 19:21:32+08:00
[CATEGORIES]
cs.LG
Generative Adversarial Evasion and Out-of-Distribution Detection for UAV Cyber-Attacks
[AUTHORS]
Deepak Kumar Panda, Weisi Guo
[ABSTRACT]
The growing integration of UAVs into civilian airspace underscores the need
for resilient and intelligent intrusion detection systems (IDS), as traditional
anomaly detection methods often fail to identify novel threats. A common
approach treats unfamiliar attacks as out-of-distribution (OOD) samples;
however, this leaves systems vulnerable when mitigation is inadequate.
Moreover, conventional OOD detectors struggle to distinguish stealthy
adversarial attacks from genuine OOD events. This paper introduces a
conditional generative adversarial network (cGAN)-based framework for crafting
stealthy adversarial attacks that evade IDS mechanisms. We first design a
robust multi-class IDS classifier trained on benign UAV telemetry and known
cyber-attacks, including Denial of Service (DoS), false data injection (FDI),
man-in-the-middle (MiTM), and replay attacks. Using this classifier, our cGAN
perturbs known attacks to generate adversarial samples that misclassify as
benign while retaining statistical resemblance to OOD distributions. These
adversarial samples are iteratively refined to achieve high stealth and success
rates. To detect such perturbations, we implement a conditional variational
autoencoder (CVAE), leveraging negative log-likelihood to separate adversarial
inputs from authentic OOD samples. Comparative evaluation shows that CVAE-based
regret scores significantly outperform traditional Mahalanobis distance-based
detectors in identifying stealthy adversarial threats. Our findings emphasize
the importance of advanced probabilistic modeling to strengthen IDS
capabilities against adaptive, generative-model-based cyber intrusions.
[LINK]
http://arxiv.org/abs/2506.21142v1
[DATE]
2025-06-26 18:56:34+08:00
[CATEGORIES]
cs.LG
Multi-convex Programming for Discrete Latent Factor Models Prototyping
[AUTHORS]
Hao Zhu, Shengchao Yan, Jasper Hoffmann, Joschka Boedecker
[ABSTRACT]
Discrete latent factor models (DLFMs) are widely used in various domains such
as machine learning, economics, neuroscience, psychology, etc. Currently,
fitting a DLFM to some dataset relies on a customized solver for individual
models, which requires lots of effort to implement and is limited to the
targeted specific instance of DLFMs. In this paper, we propose a generic
framework based on CVXPY, which allows users to specify and solve the fitting
problem of a wide range of DLFMs, including both regression and classification
models, within a very short script. Our framework is flexible and inherently
supports the integration of regularization terms and constraints on the DLFM
parameters and latent factors, such that the users can easily prototype the
DLFM structure according to their dataset and application scenario. We
introduce our open-source Python implementation and illustrate the framework in
several examples.
[LINK]
http://arxiv.org/abs/2504.01431v2
[DATE]
2025-06-26 18:53:38+08:00
[CATEGORIES]
cs.LG
DBConformer: Dual-Branch Convolutional Transformer for EEG Decoding
[AUTHORS]
Ziwei Wang, Hongbin Wang, Tianwang Jia, Xingyi He, Siyang Li, Dongrui Wu
[ABSTRACT]
Electroencephalography (EEG)-based brain-computer interfaces (BCIs) transform
spontaneous/evoked neural activity into control commands for external
communication. While convolutional neural networks (CNNs) remain the mainstream
backbone for EEG decoding, their inherently short receptive field makes it
difficult to capture long-range temporal dependencies and global inter-channel
relationships. Recent CNN-Transformer (Conformers) hybrids partially address
this issue, but most adopt a serial design, resulting in suboptimal integration
of local and global features, and often overlook explicit channel-wise
modeling. To address these limitations, we propose DBConformer, a dual-branch
convolutional Transformer network tailored for EEG decoding. It integrates a
temporal Conformer to model long-range temporal dependencies and a spatial
Conformer to extract inter-channel interactions, capturing both temporal
dynamics and spatial patterns in EEG signals. A lightweight channel attention
module further refines spatial representations by assigning data-driven
importance to EEG channels. Extensive experiments on five motor imagery (MI)
datasets and two seizure detection datasets under three evaluation settings
demonstrate that DBConformer consistently outperforms 10 competitive baseline
models, with over eight times fewer parameters than the high-capacity EEG
Conformer baseline. Further, the visualization results confirm that the
features extracted by DBConformer are physiologically interpretable and aligned
with sensorimotor priors in MI. The superior performance and interpretability
of DBConformer make it reliable for robust and explainable EEG decoding. Code
is publicized at https://github.com/wzwvv/DBConformer.
[COMMENTS]
12 pages, 6 figures
[LINK]
http://arxiv.org/abs/2506.21140v1
[DATE]
2025-06-26 18:53:24+08:00
[CATEGORIES]
cs.LG
Solving Inverse Problem for Multi-armed Bandits via Convex Optimization
[AUTHORS]
Hao Zhu, Joschka Boedecker
[ABSTRACT]
We consider the inverse problem of multi-armed bandits (IMAB) that are widely
used in neuroscience and psychology research for behavior modelling. We first
show that the IMAB problem is not convex in general, but can be relaxed to a
convex problem via variable transformation. Based on this result, we propose a
two-step sequential heuristic for (approximately) solving the IMAB problem. We
discuss a condition where our method provides global solution to the IMAB
problem with certificate, as well as approximations to further save computing
time. Numerical experiments indicate that our heuristic method is more robust
than directly solving the IMAB problem via repeated local optimization, and can
achieve the performance of Monte Carlo methods within a significantly decreased
running time. We provide the implementation of our method based on CVXPY, which
allows straightforward application by users not well versed in convex
optimization.
[LINK]
http://arxiv.org/abs/2501.18945v3
[DATE]
2025-06-26 18:49:32+08:00
[CATEGORIES]
cs.LG
NaLaFormer: Norm-Aware Linear Attention for Transformer Models
[AUTHORS]
Weikang Meng, Yadan Luo, Liangyu Huo, Yaowei Wang, Xin Li, Zheng Zhang
[ABSTRACT]
Linear attention has emerged as a viable alternative to softmax attention by
reducing complexity from quadratic to linear in sequence length. To preserve
two fundamental properties of softmax, non-negativity and entropy reduction,
current works employ various linearly separatable kernel functions with $L1$
normalization instead of softmax operator. However, query norms are neglected
by the normalization operation in linear attention, such degradation heavily
leads to an entropy gap. Meanwhile, existing works inhibit negative values of
query and key vectors resulting in a missing inner-product interactions after
being mapped. To address these dual challenges, we propose a novel Norm-Aware
Linear Attention mechanism serving to restore norm-guided dynamic spikiness and
recover kernel-perturbed norm distributions. Specifically, we first decouple
query and key matrices into two components: norm and direction, to achieve
norm-aware spikiness control and norm consistency, respectively. We
mathematically reveal that the extent of entropy reduction varies with the
query norm in softmax normalization, motivating a query-norm aware kernel
function for dynamic control over entropy reduction. Furthermore, to ensure
norm consistency and enforce non-negativity constraints, we employ a
norm-preserving mapping to project all elements of the angular matrix into
positive values, leveraging cosine similarity to inhibit dimensions with
opposite directions. We conduct extensive experiments demonstrating that the
NaLaFormer improves performance on vision and language tasks, enhancing both
expressiveness and efficiency by up to 4.2\%.
[LINK]
http://arxiv.org/abs/2506.21137v1
[DATE]
2025-06-26 18:47:39+08:00
[CATEGORIES]
cs.LG
Inverse Reinforcement Learning via Convex Optimization
[AUTHORS]
Hao Zhu, Yuan Zhang, Joschka Boedecker
[ABSTRACT]
We consider the inverse reinforcement learning (IRL) problem, where an
unknown reward function of some Markov decision process is estimated based on
observed expert demonstrations. In most existing approaches, IRL is formulated
and solved as a nonconvex optimization problem, posing challenges in scenarios
where robustness and reproducibility are critical. We discuss a convex
formulation of the IRL problem (CIRL) initially proposed by Ng and Russel, and
reformulate the problem such that the domain-specific language CVXPY can be
applied directly to specify and solve the convex problem. We also extend the
CIRL problem to scenarios where the expert policy is not given analytically but
by trajectory as state-action pairs, which can be strongly inconsistent with
optimality, by augmenting some of the constraints. Theoretical analysis and
practical implementation for hyperparameter auto-selection are introduced. This
note helps the users to easily apply CIRL for their problems, without
background knowledge on convex optimization.
[LINK]
http://arxiv.org/abs/2501.15957v2
[DATE]
2025-06-26 18:46:25+08:00
[CATEGORIES]
cs.LG
Curriculum-Guided Antifragile Reinforcement Learning for Secure UAV Deconfliction under Observation-Space Attacks
[AUTHORS]
Deepak Kumar Panda, Adolfo Perrusquia, Weisi Guo
[ABSTRACT]
Reinforcement learning (RL) policies deployed in safety-critical systems,
such as unmanned aerial vehicle (UAV) navigation in dynamic airspace, are
vulnerable to out-ofdistribution (OOD) adversarial attacks in the observation
space. These attacks induce distributional shifts that significantly degrade
value estimation, leading to unsafe or suboptimal decision making rendering the
existing policy fragile. To address this vulnerability, we propose an
antifragile RL framework designed to adapt against curriculum of incremental
adversarial perturbations. The framework introduces a simulated attacker which
incrementally increases the strength of observation-space perturbations which
enables the RL agent to adapt and generalize across a wider range of OOD
observations and anticipate previously unseen attacks. We begin with a
theoretical characterization of fragility, formally defining catastrophic
forgetting as a monotonic divergence in value function distributions with
increasing perturbation strength. Building on this, we define antifragility as
the boundedness of such value shifts and derive adaptation conditions under
which forgetting is stabilized. Our method enforces these bounds through
iterative expert-guided critic alignment using Wasserstein distance
minimization across incrementally perturbed observations. We empirically
evaluate the approach in a UAV deconfliction scenario involving dynamic 3D
obstacles. Results show that the antifragile policy consistently outperforms
standard and robust RL baselines when subjected to both projected gradient
descent (PGD) and GPS spoofing attacks, achieving up to 15% higher cumulative
reward and over 30% fewer conflict events. These findings demonstrate the
practical and theoretical viability of antifragile reinforcement learning for
secure and resilient decision-making in environments with evolving threat
scenarios.
[LINK]
http://arxiv.org/abs/2506.21129v1
[DATE]
2025-06-26 18:10:41+08:00
[CATEGORIES]
cs.LG
Robust Policy Switching for Antifragile Reinforcement Learning for UAV Deconfliction in Adversarial Environments
[AUTHORS]
Deepak Kumar Panda, Weisi Guo
[ABSTRACT]
The increasing automation of navigation for unmanned aerial vehicles (UAVs)
has exposed them to adversarial attacks that exploit vulnerabilities in
reinforcement learning (RL) through sensor manipulation. Although existing
robust RL methods aim to mitigate such threats, their effectiveness has limited
generalization to out-of-distribution shifts from the optimal value
distribution, as they are primarily designed to handle fixed perturbation. To
address this limitation, this paper introduces an antifragile RL framework that
enhances adaptability to broader distributional shifts by incorporating a
switching mechanism based on discounted Thompson sampling (DTS). This mechanism
dynamically selects among multiple robust policies to minimize adversarially
induced state-action-value distribution shifts. The proposed approach first
derives a diverse ensemble of action robust policies by accounting for a range
of perturbations in the policy space. These policies are then modeled as a
multiarmed bandit (MAB) problem, where DTS optimally selects policies in
response to nonstationary Bernoulli rewards, effectively adapting to evolving
adversarial strategies. Theoretical framework has also been provided where by
optimizing the DTS to minimize the overall regrets due to distributional shift,
results in effective adaptation against unseen adversarial attacks thus
inducing antifragility. Extensive numerical simulations validate the
effectiveness of the proposed framework in complex navigation environments with
multiple dynamic three-dimensional obstacles and with stronger projected
gradient descent (PGD) and spoofing attacks. Compared to conventional robust,
non-adaptive RL methods, the antifragile approach achieves superior
performance, demonstrating shorter navigation path lengths and a higher rate of
conflict-free navigation trajectories compared to existing robust RL techniques
[LINK]
http://arxiv.org/abs/2506.21127v1
[DATE]
2025-06-26 18:06:29+08:00
[CATEGORIES]
cs.LG
Pushing Trade-Off Boundaries: Compact yet Effective Remote Sensing Change Detection
[AUTHORS]
Luosheng Xu, Dalin Zhang, Zhaohui Song
[ABSTRACT]
Remote sensing change detection is essential for monitoring urban expansion,
disaster assessment, and resource management, offering timely, accurate, and
large-scale insights into dynamic landscape transformations. While deep
learning has revolutionized change detection, the increasing complexity and
computational demands of modern models have not necessarily translated into
significant accuracy gains. Instead of following this trend, this study
explores a more efficient approach, focusing on lightweight models that
maintain high accuracy while minimizing resource consumption, which is an
essential requirement for on-satellite processing. To this end, we propose
FlickCD, which means quick flick then get great results, pushing the boundaries
of the performance-resource trade-off. FlickCD introduces an Enhanced
Difference Module (EDM) to amplify critical feature differences between
temporal phases while suppressing irrelevant variations such as lighting and
weather changes, thereby reducing computational costs in the subsequent change
decoder. Additionally, the FlickCD decoder incorporates Local-Global Fusion
Blocks, leveraging Shifted Window Self-Attention (SWSA) and Enhanced Global
Self-Attention (EGSA) to efficiently capture semantic information at multiple
scales, preserving both coarse- and fine-grained changes. Extensive experiments
on four benchmark datasets demonstrate that FlickCD reduces computational and
storage overheads by more than an order of magnitude while achieving
state-of-the-art (SOTA) performance or incurring only a minor (<1\% F1)
accuracy trade-off. The implementation code is publicly available at
https://github.com/xulsh8/FlickCD.
[COMMENTS]
12 pages
[LINK]
http://arxiv.org/abs/2506.21109v1
[DATE]
2025-06-26 17:06:52+08:00
[CATEGORIES]
cs.LG
Unlasting: Unpaired Single-Cell Multi-Perturbation Estimation by Dual Conditional Diffusion Implicit Bridges
[AUTHORS]
Changxi Chi, Jun Xia, Yufei Huang, Jingbo Zhou, Siyuan Li, Yunfan Liu, Chang Yu, Stan Z. Li
[ABSTRACT]
Estimating single-cell responses across various perturbations facilitates the
identification of key genes and enhances drug screening, significantly boosting
experimental efficiency. However, single-cell sequencing is a destructive
process, making it impossible to capture the same cell’s phenotype before and
after perturbation. Consequently, data collected under perturbed and
unperturbed conditions are inherently unpaired. Existing methods either attempt
to forcibly pair unpaired data using random sampling, or neglect the inherent
relationship between unperturbed and perturbed cells during the modeling. In
this work, we propose a framework based on Dual Diffusion Implicit Bridges
(DDIB) to learn the mapping between different data distributions, effectively
addressing the challenge of unpaired data. We further interpret this framework
as a form of data augmentation. We integrate gene regulatory network (GRN)
information to propagate perturbation signals in a biologically meaningful way,
and further incorporate a masking mechanism to predict silent genes, improving
the quality of generated profiles. Moreover, gene expression under the same
perturbation often varies significantly across cells, frequently exhibiting a
bimodal distribution that reflects intrinsic heterogeneity. To capture this, we
introduce a more suitable evaluation metric. We propose Unlasting, dual
conditional diffusion models that overcome the problem of unpaired single-cell
perturbation data and strengthen the model’s insight into perturbations under
the guidance of the GRN, with a dedicated mask model designed to improve
generation quality by predicting silent genes. In addition, we introduce a
biologically grounded evaluation metric that better reflects the inherent
heterogeneity in single-cell responses.
[LINK]
http://arxiv.org/abs/2506.21107v1
[DATE]
2025-06-26 17:05:38+08:00
[CATEGORIES]
cs.LG
Chain-of-Thought Enhanced Shallow Transformers for Wireless Symbol Detection
[AUTHORS]
Li Fan, Peng Wang, Jing Yang, Cong Shen
[ABSTRACT]
Transformers have shown potential in solving wireless communication problems,
particularly via in-context learning (ICL), where models adapt to new tasks
through prompts without requiring model updates. However, prior ICL-based
Transformer models rely on deep architectures with many layers to achieve
satisfactory performance, resulting in substantial storage and computational
costs. In this work, we propose CHain Of thOught Symbol dEtection (CHOOSE), a
CoT-enhanced shallow Transformer framework for wireless symbol detection. By
introducing autoregressive latent reasoning steps within the hidden space,
CHOOSE significantly improves the reasoning capacity of shallow models (1-2
layers) without increasing model depth. This design enables lightweight
Transformers to achieve detection performance comparable to much deeper models,
making them well-suited for deployment on resource-constrained mobile devices.
Experimental results demonstrate that our approach outperforms conventional
shallow Transformers and achieves performance comparable to that of deep
Transformers, while maintaining storage and computational efficiency. This
represents a promising direction for implementing Transformer-based algorithms
in wireless receivers with limited computational resources.
[LINK]
http://arxiv.org/abs/2506.21093v1
[DATE]
2025-06-26 16:41:45+08:00
[CATEGORIES]
cs.LG
CovDocker: Benchmarking Covalent Drug Design with Tasks, Datasets, and Solutions
[AUTHORS]
Yangzhe Peng, Kaiyuan Gao, Liang He, Yuheng Cong, Haiguang Liu, Kun He, Lijun Wu
[ABSTRACT]
Molecular docking plays a crucial role in predicting the binding mode of
ligands to target proteins, and covalent interactions, which involve the
formation of a covalent bond between the ligand and the target, are
particularly valuable due to their strong, enduring binding nature. However,
most existing docking methods and deep learning approaches hardly account for
the formation of covalent bonds and the associated structural changes. To
address this gap, we introduce a comprehensive benchmark for covalent docking,
CovDocker, which is designed to better capture the complexities of covalent
binding. We decompose the covalent docking process into three main tasks:
reactive location prediction, covalent reaction prediction, and covalent
docking. By adapting state-of-the-art models, such as Uni-Mol and Chemformer,
we establish baseline performances and demonstrate the effectiveness of the
benchmark in accurately predicting interaction sites and modeling the molecular
transformations involved in covalent binding. These results confirm the role of
the benchmark as a rigorous framework for advancing research in covalent drug
design. It underscores the potential of data-driven approaches to accelerate
the discovery of selective covalent inhibitors and addresses critical
challenges in therapeutic development.
[COMMENTS]
Accepted to KDD 2025 Research Track
[LINK]
http://arxiv.org/abs/2506.21085v1
[DATE]
2025-06-26 16:28:07+08:00
[CATEGORIES]
cs.LG
EgoAdapt: Adaptive Multisensory Distillation and Policy Learning for Efficient Egocentric Perception
[AUTHORS]
Sanjoy Chowdhury, Subrata Biswas, Sayan Nag, Tushar Nagarajan, Calvin Murdock, Ishwarya Ananthabhotla, Yijun Qian, Vamsi Krishna Ithapu, Dinesh Manocha, Ruohan Gao
[ABSTRACT]
Modern perception models, particularly those designed for multisensory
egocentric tasks, have achieved remarkable performance but often come with
substantial computational costs. These high demands pose challenges for
real-world deployment, especially in resource-constrained environments. In this
paper, we introduce EgoAdapt, a framework that adaptively performs cross-modal
distillation and policy learning to enable efficient inference across different
egocentric perception tasks, including egocentric action recognition, active
speaker localization, and behavior anticipation. Our proposed policy module is
adaptable to task-specific action spaces, making it broadly applicable.
Experimental results on three challenging egocentric datasets EPIC-Kitchens,
EasyCom, and Aria Everyday Activities demonstrate that our method significantly
enhances efficiency, reducing GMACs by up to 89.09%, parameters up to 82.02%,
and energy up to 9.6x, while still on-par and in many cases outperforming, the
performance of corresponding state-of-the-art models.
[COMMENTS]
Accepted at ICCV 2025
[LINK]
http://arxiv.org/abs/2506.21080v1
[DATE]
2025-06-26 16:09:16+08:00
[CATEGORIES]
cs.LG
Homogenization of Multi-agent Learning Dynamics in Finite-state Markov Games
[AUTHORS]
Yann Kerzreho
[ABSTRACT]
This paper introduces a new approach for approximating the learning dynamics
of multiple reinforcement learning (RL) agents interacting in a finite-state
Markov game. The idea is to rescale the learning process by simultaneously
reducing the learning rate and increasing the update frequency, effectively
treating the agent’s parameters as a slow-evolving variable influenced by the
fast-mixing game state. Under mild assumptions-ergodicity of the state process
and continuity of the updates-we prove the convergence of this rescaled process
to an ordinary differential equation (ODE). This ODE provides a tractable,
deterministic approximation of the agent’s learning dynamics. An implementation
of the framework is available at\,:
https://github.com/yannKerzreho/MarkovGameApproximation
[LINK]
http://arxiv.org/abs/2506.21079v1
[DATE]
2025-06-26 16:08:49+08:00
[CATEGORIES]
cs.LG
SDE Matching: Scalable and Simulation-Free Training of Latent Stochastic Differential Equations
[AUTHORS]
Grigory Bartosh, Dmitry Vetrov, Christian A. Naesseth
[ABSTRACT]
The Latent Stochastic Differential Equation (SDE) is a powerful tool for time
series and sequence modeling. However, training Latent SDEs typically relies on
adjoint sensitivity methods, which depend on simulation and backpropagation
through approximate SDE solutions, which limit scalability. In this work, we
propose SDE Matching, a new simulation-free method for training Latent SDEs.
Inspired by modern Score- and Flow Matching algorithms for learning generative
dynamics, we extend these ideas to the domain of stochastic dynamics for time
series and sequence modeling, eliminating the need for costly numerical
simulations. Our results demonstrate that SDE Matching achieves performance
comparable to adjoint sensitivity methods while drastically reducing
computational complexity.
[LINK]
http://arxiv.org/abs/2502.02472v3
[DATE]
2025-06-26 15:38:35+08:00
[CATEGORIES]
cs.LG
FedDAA: Dynamic Client Clustering for Concept Drift Adaptation in Federated Learning
[AUTHORS]
Fu Peng, Ming Tang
[ABSTRACT]
In federated learning (FL), the data distribution of each client may change
over time, introducing both temporal and spatial data heterogeneity, known as
concept drift. Data heterogeneity arises from three drift sources: real drift
(a shift in the conditional distribution P(y|x)), virtual drift (a shift in the
input distribution P(x)), and label drift (a shift in the label distribution
P(y)). However, most existing FL methods addressing concept drift primarily
focus on real drift. When clients experience virtual or label drift, these
methods often fail to selectively retain useful historical knowledge, leading
to catastrophic forgetting. A key challenge lies in distinguishing different
sources of drift, as they require distinct adaptation strategies: real drift
calls for discarding outdated data, while virtual or label drift benefits from
retaining historical data. Without explicitly identifying the drift sources, a
general adaptation strategy is suboptimal and may harm generalization. To
address this challenge, we propose FedDAA, a dynamic clustered FL framework
designed to adapt to multi-source concept drift while preserving valuable
historical knowledge. Specifically, FedDAA integrates three modules: a cluster
number determination module to find the optimal number of clusters; a real
drift detection module to distinguish real drift from virtual/label drift; and
a concept drift adaptation module to adapt to new data while retaining useful
historical information. We provide theoretical convergence guarantees, and
experiments show that FedDAA achieves 7.84% to 8.52% accuracy improvements over
state-of-the-art methods on Fashion-MNIST, CIFAR-10, and CIFAR-100.
[LINK]
http://arxiv.org/abs/2506.21054v1
[DATE]
2025-06-26 15:09:08+08:00
[CATEGORIES]
cs.LG
Sharp concentration of uniform generalization errors in binary linear classification
[AUTHORS]
Shogo Nakakita
[ABSTRACT]
We examine the concentration of uniform generalization errors around their
expectation in binary linear classification problems via an isoperimetric
argument. In particular, we establish Poincar'{e} and log-Sobolev inequalities
for the joint distribution of the output labels and the label-weighted input
vectors, which we apply to derive concentration bounds. The derived
concentration bounds are sharp up to moderate multiplicative constants by those
under well-balanced labels. In asymptotic analysis, we also show that almost
sure convergence of uniform generalization errors to their expectation occurs
in very broad settings, such as proportionally high-dimensional regimes. Using
this convergence, we establish uniform laws of large numbers under
dimension-free conditions.
[COMMENTS]
26 pages, 1 figure; minor edits to improve readability
[LINK]
http://arxiv.org/abs/2505.16713v2
[DATE]
2025-06-26 14:57:11+08:00
[CATEGORIES]
cs.LG
Improving Diffusion-Based Image Editing Faithfulness via Guidance and Scheduling
[AUTHORS]
Hansam Cho, Seoung Bum Kim
[ABSTRACT]
Text-guided diffusion models have become essential for high-quality image
synthesis, enabling dynamic image editing. In image editing, two crucial
aspects are editability, which determines the extent of modification, and
faithfulness, which reflects how well unaltered elements are preserved.
However, achieving optimal results is challenging because of the inherent
trade-off between editability and faithfulness. To address this, we propose
Faithfulness Guidance and Scheduling (FGS), which enhances faithfulness with
minimal impact on editability. FGS incorporates faithfulness guidance to
strengthen the preservation of input image information and introduces a
scheduling strategy to resolve misalignment between editability and
faithfulness. Experimental results demonstrate that FGS achieves superior
faithfulness while maintaining editability. Moreover, its compatibility with
various editing methods enables precise, high-quality image edits across
diverse tasks.
[COMMENTS]
preprint
[LINK]
http://arxiv.org/abs/2506.21045v1
[DATE]
2025-06-26 14:46:03+08:00
[CATEGORIES]
cs.LG
Efficient Skill Discovery via Regret-Aware Optimization
[AUTHORS]
He Zhang, Ming Zhou, Shaopeng Zhai, Ying Sun, Hui Xiong
[ABSTRACT]
Unsupervised skill discovery aims to learn diverse and distinguishable
behaviors in open-ended reinforcement learning. For existing methods, they
focus on improving diversity through pure exploration, mutual information
optimization, and learning temporal representation. Despite that they perform
well on exploration, they remain limited in terms of efficiency, especially for
the high-dimensional situations. In this work, we frame skill discovery as a
min-max game of skill generation and policy learning, proposing a regret-aware
method on top of temporal representation learning that expands the discovered
skill space along the direction of upgradable policy strength. The key insight
behind the proposed method is that the skill discovery is adversarial to the
policy learning, i.e., skills with weak strength should be further explored
while less exploration for the skills with converged strength. As an
implementation, we score the degree of strength convergence with regret, and
guide the skill discovery with a learnable skill generator. To avoid
degeneration, skill generation comes from an up-gradable population of skill
generators. We conduct experiments on environments with varying complexities
and dimension sizes. Empirical results show that our method outperforms
baselines in both efficiency and diversity. Moreover, our method achieves a 15%
zero shot improvement in high-dimensional environments, compared to existing
methods.
[LINK]
http://arxiv.org/abs/2506.21044v1
[DATE]
2025-06-26 14:45:59+08:00
[CATEGORIES]
cs.LG
Strict Subgoal Execution: Reliable Long-Horizon Planning in Hierarchical Reinforcement Learning
[AUTHORS]
Jaebak Hwang, Sanghyeon Lee, Jeongmo Kim, Seungyul Han
[ABSTRACT]
Long-horizon goal-conditioned tasks pose fundamental challenges for
reinforcement learning (RL), particularly when goals are distant and rewards
are sparse. While hierarchical and graph-based methods offer partial solutions,
they often suffer from subgoal infeasibility and inefficient planning. We
introduce Strict Subgoal Execution (SSE), a graph-based hierarchical RL
framework that enforces single-step subgoal reachability by structurally
constraining high-level decision-making. To enhance exploration, SSE employs a
decoupled exploration policy that systematically traverses underexplored
regions of the goal space. Furthermore, a failure-aware path refinement, which
refines graph-based planning by dynamically adjusting edge costs according to
observed low-level success rates, thereby improving subgoal reliability.
Experimental results across diverse long-horizon benchmarks demonstrate that
SSE consistently outperforms existing goal-conditioned RL and hierarchical RL
approaches in both efficiency and success rate.
[COMMENTS]
9 technical page followed by references and appendix
[LINK]
http://arxiv.org/abs/2506.21039v1
[DATE]
2025-06-26 14:35:42+08:00
[CATEGORIES]
cs.LG
RL-Selector: Reinforcement Learning-Guided Data Selection via Redundancy Assessment
[AUTHORS]
Suorong Yang, Peijia Li, Furao Shen, Jian Zhao
[ABSTRACT]
Modern deep architectures often rely on large-scale datasets, but training on
these datasets incurs high computational and storage overhead. Real-world
datasets often contain substantial redundancies, prompting the need for more
data-efficient training paradigms. Data selection has shown promise to mitigate
redundancy by identifying the most representative samples, thereby reducing
training costs without compromising performance. Existing methods typically
rely on static scoring metrics or pretrained models, overlooking the combined
effect of selected samples and their evolving dynamics during training. We
introduce the concept of epsilon-sample cover, which quantifies sample
redundancy based on inter-sample relationships, capturing the intrinsic
structure of the dataset. Based on this, we reformulate data selection as a
reinforcement learning (RL) process and propose RL-Selector, where a
lightweight RL agent optimizes the selection policy by leveraging
epsilon-sample cover derived from evolving dataset distribution as a reward
signal. Extensive experiments across benchmark datasets and diverse
architectures demonstrate that our method consistently outperforms existing
state-of-the-art baselines. Models trained with our selected datasets show
enhanced generalization performance with improved training efficiency.
[COMMENTS]
ICCV 2025
[LINK]
http://arxiv.org/abs/2506.21037v1
[DATE]
2025-06-26 14:28:56+08:00
[CATEGORIES]
cs.LG
An Information-Theoretic Analysis for Federated Learning under Concept Drift
[AUTHORS]
Fu Peng, Meng Zhang, Ming Tang
[ABSTRACT]
Recent studies in federated learning (FL) commonly train models on static
datasets. However, real-world data often arrives as streams with shifting
distributions, causing performance degradation known as concept drift. This
paper analyzes FL performance under concept drift using information theory and
proposes an algorithm to mitigate the performance degradation. We model concept
drift as a Markov chain and introduce the \emph{Stationary Generalization
Error} to assess a model’s capability to capture characteristics of future
unseen data. Its upper bound is derived using KL divergence and mutual
information. We study three drift patterns (periodic, gradual, and random) and
their impact on FL performance. Inspired by this, we propose an algorithm that
regularizes the empirical risk minimization approach with KL divergence and
mutual information, thereby enhancing long-term performance. We also explore
the performance-cost tradeoff by identifying a Pareto front. To validate our
approach, we build an FL testbed using Raspberry Pi4 devices. Experimental
results corroborate with theoretical findings, confirming that drift patterns
significantly affect performance. Our method consistently outperforms existing
approaches for these three patterns, demonstrating its effectiveness in
adapting concept drift in FL.
[LINK]
http://arxiv.org/abs/2506.21036v1
[DATE]
2025-06-26 14:25:15+08:00
[CATEGORIES]
cs.LG
Little By Little: Continual Learning via Self-Activated Sparse Mixture-of-Rank Adaptive Learning
[AUTHORS]
Haodong Lu, Chongyang Zhao, Jason Xue, Lina Yao, Kristen Moore, Dong Gong
[ABSTRACT]
Continual learning (CL) with large pre-trained models is challenged by
catastrophic forgetting and task interference. Existing LoRA-based
Mixture-of-Experts (MoE) approaches mitigate forgetting by assigning and
freezing task-specific adapters, but suffer from interference, redundancy, and
ambiguous routing due to coarse adapter-level selection. However, this design
introduces three key challenges: 1) Interference: Activating full LoRA experts
per input leads to subspace interference and prevents selective reuse of useful
components across tasks. 2) Redundancy: Newly added experts often duplicate or
contradict existing knowledge due to unnecessary activation of unrelated ranks
and insufficient reuse of relevant ones. 3) Ambiguity: Overlapping features
across tasks confuse the router, resulting in unstable expert assignments. As
more experts accumulate, earlier task routing degrades, accelerating
forgetting. We propose MoRA, a Mixture-of-Rank Adaptive learning approach with
self-activated and sparse rank activation for CL. Unlike mixing multiple
low-rank matrices, MoRA decomposes each rank-r update into r rank-1 components,
each treated as an independent expert, enabling fine-grained mixture of rank-1
expert utilization while mitigating interference and redundancy. To avoid
ambiguous routing, we propose that each rank-1 expert can infer its own
relevance via intermediate activations. Coupled with our proposed rank pruning
and activation budgets, MoRA adaptively selects a sparse mixture of ranks per
input. We validate MoRA on continual learning tasks with CLIP and large
language models (LLMs), analyzing both in-domain learning and out-of-domain
forgetting/generalization during fine-tuning. MoRA shows significant
effectiveness on enhancing CL with PTMs, and improving generalization while
mitigating forgetting.
[COMMENTS]
Preprint
[LINK]
http://arxiv.org/abs/2506.21035v1
[DATE]
2025-06-26 14:19:05+08:00
[CATEGORIES]
cs.LG
PCDVQ: Enhancing Vector Quantization for Large Language Models via Polar Coordinate Decoupling
[AUTHORS]
Yuxuan Yue, Zukang Xu, Zhihang Yuan, Dawei Yang, Jianlong Wu, Liqiang Nie
[ABSTRACT]
Large Language Models (LLMs) face significant challenges in edge deployment
due to their massive parameter scale. Vector Quantization (VQ), a
clustering-based quantization method, serves as a prevalent solution to this
issue for its extremely low-bit (even at 2-bit) and considerable accuracy.
Since a vector is a quantity in mathematics and physics that has both direction
and magnitude, existing VQ works typically quantize them in a coupled manner.
However, we find that direction exhibits significantly greater sensitivity to
quantization compared to the magnitude. For instance, when separately
clustering the directions and magnitudes of weight vectors in LLaMA-2-7B, the
accuracy drop of zero-shot tasks are 46.5\% and 2.3\%, respectively. This gap
even increases with the reduction of clustering centers. Further, Euclidean
distance, a common metric to access vector similarities in current VQ works,
places greater emphasis on reducing the magnitude error. This property is
contrary to the above finding, unavoidably leading to larger quantization
errors. To these ends, this paper proposes Polar Coordinate Decoupled Vector
Quantization (PCDVQ), an effective and efficient VQ framework consisting of two
key modules: 1) Polar Coordinate Decoupling (PCD), which transforms vectors
into their polar coordinate representations and perform independent
quantization of the direction and magnitude parameters.2) Distribution Aligned
Codebook Construction (DACC), which optimizes the direction and magnitude
codebooks in accordance with the source distribution. Experimental results show
that PCDVQ outperforms baseline methods at 2-bit level by at least 1.5\%
zero-shot accuracy, establishing a novel paradigm for accurate and highly
compressed LLMs.
[LINK]
http://arxiv.org/abs/2506.05432v2
[DATE]
2025-06-26 14:17:49+08:00
[CATEGORIES]
cs.LG
TRIDENT: Tri-Modal Molecular Representation Learning with Taxonomic Annotations and Local Correspondence
[AUTHORS]
Feng Jiang, Mangal Prakash, Hehuan Ma, Jianyuan Deng, Yuzhi Guo, Amina Mollaysa, Tommaso Mansi, Rui Liao, Junzhou Huang
[ABSTRACT]
Molecular property prediction aims to learn representations that map chemical
structures to functional properties. While multimodal learning has emerged as a
powerful paradigm to learn molecular representations, prior works have largely
overlooked textual and taxonomic information of molecules for representation
learning. We introduce TRIDENT, a novel framework that integrates molecular
SMILES, textual descriptions, and taxonomic functional annotations to learn
rich molecular representations. To achieve this, we curate a comprehensive
dataset of molecule-text pairs with structured, multi-level functional
annotations. Instead of relying on conventional contrastive loss, TRIDENT
employs a volume-based alignment objective to jointly align tri-modal features
at the global level, enabling soft, geometry-aware alignment across modalities.
Additionally, TRIDENT introduces a novel local alignment objective that
captures detailed relationships between molecular substructures and their
corresponding sub-textual descriptions. A momentum-based mechanism dynamically
balances global and local alignment, enabling the model to learn both broad
functional semantics and fine-grained structure-function mappings. TRIDENT
achieves state-of-the-art performance on 11 downstream tasks, demonstrating the
value of combining SMILES, textual, and taxonomic functional annotations for
molecular property prediction.
[LINK]
http://arxiv.org/abs/2506.21028v1
[DATE]
2025-06-26 14:09:47+08:00
[CATEGORIES]
cs.LG
HybridQ: Hybrid Classical-Quantum Generative Adversarial Network for Skin Disease Image Generation
[AUTHORS]
Qingyue Jiao, Kangyu Zheng, Yiyu Shi, Zhiding Liang
[ABSTRACT]
Machine learning-assisted diagnosis is gaining traction in skin disease
detection, but training effective models requires large amounts of high-quality
data. Skin disease datasets often suffer from class imbalance, privacy
concerns, and object bias, making data augmentation essential. While classical
generative models are widely used, they demand extensive computational
resources and lengthy training time. Quantum computing offers a promising
alternative, but existing quantum-based image generation methods can only yield
grayscale low-quality images. Through a novel classical-quantum latent space
fusion technique, our work overcomes this limitation and introduces the first
classical-quantum generative adversarial network (GAN) capable of generating
color medical images. Our model outperforms classical deep convolutional GANs
and existing hybrid classical-quantum GANs in both image generation quality and
classification performance boost when used as data augmentation. Moreover, the
performance boost is comparable with that achieved using state-of-the-art
classical generative models, yet with over 25 times fewer parameters and 10
times fewer training epochs. Such results suggest a promising future for
quantum image generation as quantum hardware advances. Finally, we demonstrate
the robust performance of our model on real IBM quantum machine with hardware
noise.
[LINK]
http://arxiv.org/abs/2506.21015v1
[DATE]
2025-06-26 13:14:45+08:00
[CATEGORIES]
cs.LG
Efficient Image Generation with Variadic Attention Heads
[AUTHORS]
Steven Walton, Ali Hassani, Xingqian Xu, Zhangyang Wang, Humphrey Shi
[ABSTRACT]
While the integration of transformers in vision models have yielded
significant improvements on vision tasks they still require significant amounts
of computation for both training and inference. Restricted attention mechanisms
significantly reduce these computational burdens but come at the cost of losing
either global or local coherence. We propose a simple, yet powerful method to
reduce these trade-offs: allow the attention heads of a single transformer to
attend to multiple receptive fields.
We demonstrate our method utilizing Neighborhood Attention (NA) and integrate
it into a StyleGAN based architecture for image generation. With this work,
dubbed StyleNAT, we are able to achieve a FID of 2.05 on FFHQ, a 6% improvement
over StyleGAN-XL, while utilizing 28% fewer parameters and with 4$\times$ the
throughput capacity. StyleNAT achieves the Pareto Frontier on FFHQ-256 and
demonstrates powerful and efficient image generation on other datasets. Our
code and model checkpoints are publicly available at:
https://github.com/SHI-Labs/StyleNAT
[COMMENTS]
Published in eLVM @ CVPR
(https://openaccess.thecvf.com/content/CVPR2025W/eLVM/html/Walton_Efficient_Image_Generation_with_Variadic_Attention_Heads_CVPRW_2025_paper)
| Formerly named StyleNAT: Giving Each Head a New Perspective |
[LINK]
http://arxiv.org/abs/2211.05770v3
[DATE]
2025-06-26 13:07:48+08:00
[CATEGORIES]
cs.LG
Proximal Point Method for Online Saddle Point Problem
[AUTHORS]
Qing-xin Meng, Jian-wei Liu
[ABSTRACT]
This paper focuses on the online saddle point problem, which involves a
sequence of two-player time-varying convex-concave games. Considering the
nonstationarity of the environment, we adopt the duality gap and the dynamic
Nash equilibrium regret as performance metrics for algorithm design. We present
three variants of the proximal point method: the Online Proximal Point Method
(OPPM), the Optimistic OPPM (OptOPPM), and the OptOPPM with multiple
predictors. Each algorithm guarantees upper bounds for both the duality gap and
dynamic Nash equilibrium regret, achieving near-optimality when measured
against the duality gap. Specifically, in certain benign environments, such as
sequences of stationary payoff functions, these algorithms maintain a nearly
constant metric bound. Experimental results further validate the effectiveness
of these algorithms. Lastly, this paper discusses potential reliability
concerns associated with using dynamic Nash equilibrium regret as a performance
metric. The technical appendix and code can be found at
https://github.com/qingxin6174/PPM-for-OSP.
[LINK]
http://arxiv.org/abs/2407.04591v3
[DATE]
2025-06-26 13:01:47+08:00
[CATEGORIES]
cs.LG
Distilling Normalizing Flows
[AUTHORS]
Steven Walton, Valeriy Klyukin, Maksim Artemev, Denis Derkach, Nikita Orlov, Humphrey Shi
[ABSTRACT]
Explicit density learners are becoming an increasingly popular technique for
generative models because of their ability to better model probability
distributions. They have advantages over Generative Adversarial Networks due to
their ability to perform density estimation and having exact latent-variable
inference. This has many advantages, including: being able to simply
interpolate, calculate sample likelihood, and analyze the probability
distribution. The downside of these models is that they are often more
difficult to train and have lower sampling quality.
Normalizing flows are explicit density models, that use composable bijective
functions to turn an intractable probability function into a tractable one. In
this work, we present novel knowledge distillation techniques to increase
sampling quality and density estimation of smaller student normalizing flows.
We seek to study the capacity of knowledge distillation in Compositional
Normalizing Flows to understand the benefits and weaknesses provided by these
architectures. Normalizing flows have unique properties that allow for a
non-traditional forms of knowledge transfer, where we can transfer that
knowledge within intermediate layers. We find that through this distillation,
we can make students significantly smaller while making substantial performance
gains over a non-distilled student. With smaller models there is a
proportionally increased throughput as this is dependent upon the number of
bijectors, and thus parameters, in the network.
[COMMENTS]
Published in eLVM @ CVPR
(https://openaccess.thecvf.com/content/CVPR2025W/eLVM/html/Walton_Distilling_Normalizing_Flows_CVPRW_2025_paper)
[LINK]
http://arxiv.org/abs/2506.21003v1
[DATE]
2025-06-26 12:34:28+08:00
[CATEGORIES]
cs.LG
Genetic Algorithm with Innovative Chromosome Patterns in the Breeding Process
[AUTHORS]
Qingchuan Lyu
[ABSTRACT]
This paper proposes Genetic Algorithm with Border Trades (GAB), a novel
modification of the standard genetic algorithm that enhances exploration by
incorporating new chromosome patterns in the breeding process. This approach
significantly mitigates premature convergence and improves search diversity.
Empirically, GAB achieves up to 8x higher fitness and 10x faster convergence on
complex job scheduling problems compared to standard Genetic Algorithms,
reaching average fitness scores of 888 versus 106 in under 20 seconds. On the
classic Flip-Flop problem, GAB consistently finds optimal or near-optimal
solutions in fewer generations, even as input sizes scale to thousands of bits.
These results highlight GAB as a highly effective and computationally efficient
alternative for solving large-scale combinatorial optimization problems.
[LINK]
http://arxiv.org/abs/2501.18184v3
[DATE]
2025-06-26 12:26:22+08:00
[CATEGORIES]
cs.LG
Pretrained Reversible Generation as Unsupervised Visual Representation Learning
[AUTHORS]
Rongkun Xue, Jinouwen Zhang, Yazhe Niu, Dazhong Shen, Bingqi Ma, Yu Liu, Jing Yang
[ABSTRACT]
Recent generative models based on score matching and flow matching have
significantly advanced generation tasks, but their potential in discriminative
tasks remains underexplored. Previous approaches, such as generative
classifiers, have not fully leveraged the capabilities of these models for
discriminative tasks due to their intricate designs. We propose Pretrained
Reversible Generation (PRG), which extracts unsupervised representations by
reversing the generative process of a pretrained continuous generation model.
PRG effectively reuses unsupervised generative models, leveraging their high
capacity to serve as robust and generalizable feature extractors for downstream
tasks. This framework enables the flexible selection of feature hierarchies
tailored to specific downstream tasks. Our method consistently outperforms
prior approaches across multiple benchmarks, achieving state-of-the-art
performance among generative model based methods, including 78% top-1 accuracy
on ImageNet at a resolution of 64*64. Extensive ablation studies, including
out-of-distribution evaluations, further validate the effectiveness of our
approach. Code is available at https://github.com/opendilab/PRG.
[COMMENTS]
Accepted by ICCV 2025
[LINK]
http://arxiv.org/abs/2412.01787v3
[DATE]
2025-06-26 12:26:18+08:00
[CATEGORIES]
cs.LG
Bridging the Gap Between Approximation and Learning via Optimal Approximation by ReLU MLPs of Maximal Regularity
[AUTHORS]
Ruiyang Hong, Anastasis Kratsios
[ABSTRACT]
The foundations of deep learning are supported by the seemingly opposing
perspectives of approximation or learning theory. The former advocates for
large/expressive models that need not generalize, while the latter considers
classes that generalize but may be too small/constrained to be universal
approximators. Motivated by real-world deep learning implementations that are
both expressive and statistically reliable, we ask: “Is there a class of neural
networks that is both large enough to be universal but structured enough to
generalize?” This paper constructively provides a positive answer to this
question by identifying a highly structured class of ReLU multilayer
perceptions (MLPs), which are optimal function approximators and are
statistically well-behaved. We show that any $(L,\alpha)$-H"{o}lder function
from $[0,1]^d$ to $[-n,n]$ can be approximated to a uniform $\mathcal{O}(1/n)$
error on $[0,1]^d$ with a sparsely connected ReLU MLP with the same H"{o}lder
exponent $\alpha$ and coefficient $L$, of width $\mathcal{O}(dn^{d/\alpha})$,
depth $\mathcal{O}(\log(d))$, with $\mathcal{O}(dn^{d/\alpha})$ nonzero
parameters, and whose weights and biases take values in $\{0,\pm 1/2\}$ except
in the first and last layers which instead have magnitude at-most $n$. Further,
our class of MLPs achieves a near-optimal sample complexity of
$\mathcal{O}(\log(N)/\sqrt{N})$ when given $N$ i.i.d. normalized sub-Gaussian
training samples. We achieve this through a new construction that perfectly
fits together linear pieces using Kuhn triangulations, along with a new proof
technique which shows that our construction preserves the regularity of not
only the H"{o}lder functions, but also any uniformly continuous function. Our
results imply that neural networks can solve the McShane extension problem on
suitable finite sets.
[COMMENTS]
16 pages main body, 40 pages proofs, 10 figures, 1 table
[LINK]
http://arxiv.org/abs/2409.12335v4
[DATE]
2025-06-26 12:08:57+08:00
[CATEGORIES]
cs.LG
Generalized Tensor-based Parameter-Efficient Fine-Tuning via Lie Group Transformations
[AUTHORS]
Chongjie Si, Zhiyi Shi, Xuehui Wang, Yichen Xiao, Xiaokang Yang, Wei Shen
[ABSTRACT]
Adapting pre-trained foundation models for diverse downstream tasks is a core
practice in artificial intelligence. However, the wide range of tasks and high
computational costs make full fine-tuning impractical. To overcome this,
parameter-efficient fine-tuning (PEFT) methods like LoRA have emerged and are
becoming a growing research focus. Despite the success of these methods, they
are primarily designed for linear layers, focusing on two-dimensional matrices
while largely ignoring higher-dimensional parameter spaces like convolutional
kernels. Moreover, directly applying these methods to higher-dimensional
parameter spaces often disrupts their structural relationships. Given the rapid
advancements in matrix-based PEFT methods, rather than designing a specialized
strategy, we propose a generalization that extends matrix-based PEFT methods to
higher-dimensional parameter spaces without compromising their structural
properties. Specifically, we treat parameters as elements of a Lie group, with
updates modeled as perturbations in the corresponding Lie algebra. These
perturbations are mapped back to the Lie group through the exponential map,
ensuring smooth, consistent updates that preserve the inherent structure of the
parameter space. Extensive experiments on computer vision and natural language
processing validate the effectiveness and versatility of our approach,
demonstrating clear improvements over existing methods.
[COMMENTS]
2025 ICCV
[LINK]
http://arxiv.org/abs/2504.00851v2
[DATE]
2025-06-26 11:12:59+08:00
[CATEGORIES]
cs.LG
Explainable quantum regression algorithm with encoded data structure
[AUTHORS]
C. -C. Joseph Wang, F. Perkkola, I. Salmenperä, A. Meijer-van de Griend, J. K. Nurminen, R. S. Bennink
[ABSTRACT]
Hybrid variational quantum algorithms (VQAs) are promising for solving
practical problems such as combinatorial optimization, quantum chemistry
simulation, quantum machine learning, and quantum error correction on noisy
quantum computers. However, with typical random ansatz or quantum alternating
operator ansatz, derived variational quantum algorithms become a black box that
cannot be trusted for model interpretation, not to mention deploying as
applications in informing critical decisions: the results of these variational
parameters are just rotational angles for the quantum gates and have nothing to
do with interpretable values that a model can provide directly. In this paper,
we construct the first interpretable quantum regression algorithm, in which the
quantum state exactly encodes the classical data table and the variational
parameters correspond directly to the regression coefficients, which are real
numbers by construction, providing a high degree of model interpretability and
minimal cost to optimize due to the right expressiveness. We also take
advantage of the encoded data structure to reduce the time complexity of
computing the regression map. To shorten the circuit depth for nonlinear
regression, our algorithm can be extended by building nonlinear features by
classical preprocessing as the independent encoded column vectors. Even though
the realization of compressed encoding in superconducting qubits has been
achieved by the less noisy compressed encoding recently by the authors, we
envision potential quantum utilities with multi-qubit gates implemented in
neutral cold atoms and ions.
[LINK]
http://arxiv.org/abs/2307.03334v5
[DATE]
2025-06-26 11:12:31+08:00
[CATEGORIES]
cs.LG
EraRAG: Efficient and Incremental Retrieval Augmented Generation for Growing Corpora
[AUTHORS]
Fangyuan Zhang, Zhengjun Huang, Yingli Zhou, Qintian Guo, Zhixun Li, Wensheng Luo, Di Jiang, Yixiang Fang, Xiaofang Zhou
[ABSTRACT]
Graph-based Retrieval-Augmented Generation (Graph-RAG) enhances large
language models (LLMs) by structuring retrieval over an external corpus.
However, existing approaches typically assume a static corpus, requiring
expensive full-graph reconstruction whenever new documents arrive, limiting
their scalability in dynamic, evolving environments. To address these
limitations, we introduce EraRAG, a novel multi-layered Graph-RAG framework
that supports efficient and scalable dynamic updates. Our method leverages
hyperplane-based Locality-Sensitive Hashing (LSH) to partition and organize the
original corpus into hierarchical graph structures, enabling efficient and
localized insertions of new data without disrupting the existing topology. The
design eliminates the need for retraining or costly recomputation while
preserving high retrieval accuracy and low latency. Experiments on large-scale
benchmarks demonstrate that EraRag achieves up to an order of magnitude
reduction in update time and token consumption compared to existing Graph-RAG
systems, while providing superior accuracy performance. This work offers a
practical path forward for RAG systems that must operate over continually
growing corpora, bridging the gap between retrieval efficiency and
adaptability. Our code and data are available at
https://github.com/EverM0re/EraRAG-Official.
[COMMENTS]
Under review
[LINK]
http://arxiv.org/abs/2506.20963v1
[DATE]
2025-06-26 11:01:33+08:00
[CATEGORIES]
cs.LG
Antibody Design and Optimization with Multi-scale Equivariant Graph Diffusion Models for Accurate Complex Antigen Binding
[AUTHORS]
Jiameng Chen, Xiantao Cai, Jia Wu, Wenbin Hu
[ABSTRACT]
Antibody design remains a critical challenge in therapeutic and diagnostic
development, particularly for complex antigens with diverse binding interfaces.
Current computational methods face two main limitations: (1) capturing
geometric features while preserving symmetries, and (2) generalizing novel
antigen interfaces. Despite recent advancements, these methods often fail to
accurately capture molecular interactions and maintain structural integrity. To
address these challenges, we propose \textbf{AbMEGD}, an end-to-end framework
integrating \textbf{M}ulti-scale \textbf{E}quivariant \textbf{G}raph
\textbf{D}iffusion for antibody sequence and structure co-design. Leveraging
advanced geometric deep learning, AbMEGD combines atomic-level geometric
features with residue-level embeddings, capturing local atomic details and
global sequence-structure interactions. Its E(3)-equivariant diffusion method
ensures geometric precision, computational efficiency, and robust
generalizability for complex antigens. Furthermore, experiments using the
SAbDab database demonstrate a 10.13\% increase in amino acid recovery, 3.32\%
rise in improvement percentage, and a 0.062~\AA\ reduction in root mean square
deviation within the critical CDR-H3 region compared to DiffAb, a leading
antibody design model. These results highlight AbMEGD’s ability to balance
structural integrity with improved functionality, establishing a new benchmark
for sequence-structure co-design and affinity optimization. The code is
available at: https://github.com/Patrick221215/AbMEGD.
[COMMENTS]
9 pages, 4 figures, accepted at IJCAI 2025
[LINK]
http://arxiv.org/abs/2506.20957v1
[DATE]
2025-06-26 10:45:38+08:00
[CATEGORIES]
cs.LG
Forecasting Geopolitical Events with a Sparse Temporal Fusion Transformer and Gaussian Process Hybrid: A Case Study in Middle Eastern and U.S. Conflict Dynamics
[AUTHORS]
Hsin-Hsiung Huang, Hayden Hampton
[ABSTRACT]
Forecasting geopolitical conflict from data sources like the Global Database
of Events, Language, and Tone (GDELT) is a critical challenge for national
security. The inherent sparsity, burstiness, and overdispersion of such data
cause standard deep learning models, including the Temporal Fusion Transformer
(TFT), to produce unreliable long-horizon predictions. We introduce STFT-VNNGP,
a hybrid architecture that won the 2023 Algorithms for Threat Detection (ATD)
competition by overcoming these limitations. Designed to bridge this gap, our
model employs a two-stage process: first, a TFT captures complex temporal
dynamics to generate multi-quantile forecasts. These quantiles then serve as
informed inputs for a Variational Nearest Neighbor Gaussian Process (VNNGP),
which performs principled spatiotemporal smoothing and uncertainty
quantification. In a case study forecasting conflict dynamics in the Middle
East and the U.S., STFT-VNNGP consistently outperforms a standalone TFT,
showing a superior ability to predict the timing and magnitude of bursty event
periods, particularly at long-range horizons. This work offers a robust
framework for generating more reliable and actionable intelligence from
challenging event data, with all code and workflows made publicly available to
ensure reproducibility.
[LINK]
http://arxiv.org/abs/2506.20935v1
[DATE]
2025-06-26 09:53:25+08:00
[CATEGORIES]
cs.LG
Lower Bounds on the Size of Markov Equivalence Classes
[AUTHORS]
Erik Jahn, Frederick Eberhardt, Leonard J. Schulman
[ABSTRACT]
Causal discovery algorithms typically recover causal graphs only up to their
Markov equivalence classes unless additional parametric assumptions are made.
The sizes of these equivalence classes reflect the limits of what can be
learned about the underlying causal graph from purely observational data. Under
the assumptions of acyclicity, causal sufficiency, and a uniform model prior,
Markov equivalence classes are known to be small on average. In this paper, we
show that this is no longer the case when any of these assumptions is relaxed.
Specifically, we prove exponentially large lower bounds for the expected size
of Markov equivalence classes in three settings: sparse random directed acyclic
graphs, uniformly random acyclic directed mixed graphs, and uniformly random
directed cyclic graphs.
[LINK]
http://arxiv.org/abs/2506.20933v1
[DATE]
2025-06-26 09:44:23+08:00
[CATEGORIES]
cs.LG
Extremely Simple Streaming Forest
[AUTHORS]
Haoyin Xu, Jayanta Dey, Sambit Panda, Joshua T. Vogelstein
[ABSTRACT]
Decision forests, including random forests and gradient boosting trees,
remain the leading machine learning methods for many real-world data problems,
especially on tabular data. However, most of the current implementations only
operate in batch mode, and therefore cannot incrementally update when more data
arrive. Several previous works developed streaming trees and ensembles to
overcome this limitation. Nonetheless, we found that those state-of-the-art
algorithms suffer from a number of drawbacks, including low accuracy on some
problems and high memory usage on others. We therefore developed an extremely
simple extension of decision trees: given new data, simply update existing
trees by continuing to grow them, and replace some old trees with new ones to
control the total number of trees. In a benchmark suite containing 72
classification problems (the OpenML-CC18 data suite), we illustrate that our
approach, $\textit{Extremely Simple Streaming Forest}$ (XForest), does not
suffer from either of the aforementioned limitations. On those datasets, we
also demonstrate that our approach often performs as well as, and sometimes
even better than, conventional batch decision forest algorithms. With a
$\textit{zero-added-node}$ approach, XForest-Zero, we also further extend
existing splits to new tasks, and this very efficient method only requires
inference time. Thus, XForests establish a simple standard for streaming trees
and forests that could readily be applied to many real-world problems.
[COMMENTS]
Accepted at The Fourth Conference on Lifelong Learning Agents -
CoLLAs 2025
[LINK]
http://arxiv.org/abs/2110.08483v7
[DATE]
2025-06-26 09:33:13+08:00
[CATEGORIES]
cs.LG
Quantum Reinforcement Learning Trading Agent for Sector Rotation in the Taiwan Stock Market
[AUTHORS]
Chi-Sheng Chen, Xinyu Zhang, Ya-Chuan Chen
[ABSTRACT]
We propose a hybrid quantum-classical reinforcement learning framework for
sector rotation in the Taiwan stock market. Our system employs Proximal Policy
Optimization (PPO) as the backbone algorithm and integrates both classical
architectures (LSTM, Transformer) and quantum-enhanced models (QNN, QRWKV,
QASA) as policy and value networks. An automated feature engineering pipeline
extracts financial indicators from capital share data to ensure consistent
model input across all configurations. Empirical backtesting reveals a key
finding: although quantum-enhanced models consistently achieve higher training
rewards, they underperform classical models in real-world investment metrics
such as cumulative return and Sharpe ratio. This discrepancy highlights a core
challenge in applying reinforcement learning to financial domains – namely,
the mismatch between proxy reward signals and true investment objectives. Our
analysis suggests that current reward designs may incentivize overfitting to
short-term volatility rather than optimizing risk-adjusted returns. This issue
is compounded by the inherent expressiveness and optimization instability of
quantum circuits under Noisy Intermediate-Scale Quantum (NISQ) constraints. We
discuss the implications of this reward-performance gap and propose directions
for future improvement, including reward shaping, model regularization, and
validation-based early stopping. Our work offers a reproducible benchmark and
critical insights into the practical challenges of deploying quantum
reinforcement learning in real-world finance.
[LINK]
http://arxiv.org/abs/2506.20930v1
[DATE]
2025-06-26 09:29:19+08:00
[CATEGORIES]
cs.LG
Active Learning for Manifold Gaussian Process Regression
[AUTHORS]
Yuanxing Cheng, Lulu Kang, Yiwei Wang, Chun Liu
[ABSTRACT]
This paper introduces an active learning framework for manifold Gaussian
Process (GP) regression, combining manifold learning with strategic data
selection to improve accuracy in high-dimensional spaces. Our method jointly
optimizes a neural network for dimensionality reduction and a Gaussian process
regressor in the latent space, supervised by an active learning criterion that
minimizes global prediction error. Experiments on synthetic data demonstrate
superior performance over randomly sequential learning. The framework
efficiently handles complex, discontinuous functions while preserving
computational tractability, offering practical value for scientific and
engineering applications. Future work will focus on scalability and
uncertainty-aware manifold learning.
[COMMENTS]
13 pages, 6 figures
[LINK]
http://arxiv.org/abs/2506.20928v1
[DATE]
2025-06-26 09:25:39+08:00
[CATEGORIES]
cs.LG
Interpretable Representation Learning for Additive Rule Ensembles
[AUTHORS]
Shahrzad Behzadimanesh, Pierre Le Bodic, Geoffrey I. Webb, Mario Boley
[ABSTRACT]
Small additive ensembles of symbolic rules offer interpretable prediction
models. Traditionally, these ensembles use rule conditions based on
conjunctions of simple threshold propositions $x \geq t$ on a single input
variable $x$ and threshold $t$, resulting geometrically in axis-parallel
polytopes as decision regions. While this form ensures a high degree of
interpretability for individual rules and can be learned efficiently using the
gradient boosting approach, it relies on having access to a curated set of
expressive and ideally independent input features so that a small ensemble of
axis-parallel regions can describe the target variable well. Absent such
features, reaching sufficient accuracy requires increasing the number and
complexity of individual rules, which diminishes the interpretability of the
model. Here, we extend classical rule ensembles by introducing logical
propositions with learnable sparse linear transformations of input variables,
i.e., propositions of the form $\mathbf{x}^\mathrm{T}\mathbf{w} \geq t$, where
$\mathbf{w}$ is a learnable sparse weight vector, enabling decision regions as
general polytopes with oblique faces. We propose a learning method using
sequential greedy optimization based on an iteratively reweighted formulation
of logistic regression. Experimental results demonstrate that the proposed
method efficiently constructs rule ensembles with the same test risk as
state-of-the-art methods while significantly reducing model complexity across
ten benchmark datasets.
[LINK]
http://arxiv.org/abs/2506.20927v1
[DATE]
2025-06-26 09:24:08+08:00
[CATEGORIES]
cs.LG
LLM-guided Chemical Process Optimization with a Multi-Agent Approach
[AUTHORS]
Tong Zeng, Srivathsan Badrinarayanan, Janghoon Ock, Cheng-Kai Lai, Amir Barati Farimani
[ABSTRACT]
Chemical process optimization is crucial to maximize production efficiency
and economic performance. Traditional methods, including gradient-based
solvers, evolutionary algorithms, and parameter grid searches, become
impractical when operating constraints are ill-defined or unavailable,
requiring engineers to rely on subjective heuristics to estimate feasible
parameter ranges. To address this constraint definition bottleneck, we present
a multi-agent framework of large language model (LLM) agents that autonomously
infer operating constraints from minimal process descriptions, then
collaboratively guide optimization using the inferred constraints. Our
AutoGen-based agentic framework employs OpenAI’s o3 model, with specialized
agents for constraint generation, parameter validation, simulation execution,
and optimization guidance. Through two phases - autonomous constraint
generation using embedded domain knowledge, followed by iterative multi-agent
optimization - the framework eliminates the need for predefined operational
bounds. Validated on the hydrodealkylation process across cost, yield, and
yield-to-cost ratio metrics, the framework demonstrated competitive performance
with conventional optimization methods while achieving better computational
efficiency, requiring fewer iterations to converge. Our approach converged in
under 20 minutes, achieving a 31-fold speedup over grid search. Beyond
computational efficiency, the framework’s reasoning-guided search demonstrates
sophisticated process understanding, correctly identifying utility trade-offs,
and applying domain-informed heuristics. This approach shows significant
potential for optimization scenarios where operational constraints are poorly
characterized or unavailable, particularly for emerging processes and retrofit
applications.
[COMMENTS]
16 pages (main manuscript without references), 2 figures
[LINK]
http://arxiv.org/abs/2506.20921v1
[DATE]
2025-06-26 09:03:44+08:00
[CATEGORIES]
cs.LG
Explainable AI for Radar Resource Management: Modified LIME in Deep Reinforcement Learning
[AUTHORS]
Ziyang Lu, M. Cenk Gursoy, Chilukuri K. Mohan, Pramod K. Varshney
[ABSTRACT]
Deep reinforcement learning has been extensively studied in decision-making
processes and has demonstrated superior performance over conventional
approaches in various fields, including radar resource management (RRM).
However, a notable limitation of neural networks is their ``black box” nature
and recent research work has increasingly focused on explainable AI (XAI)
techniques to describe the rationale behind neural network decisions. One
promising XAI method is local interpretable model-agnostic explanations (LIME).
However, the sampling process in LIME ignores the correlations between
features. In this paper, we propose a modified LIME approach that integrates
deep learning (DL) into the sampling process, which we refer to as DL-LIME. We
employ DL-LIME within deep reinforcement learning for radar resource
management. Numerical results show that DL-LIME outperforms conventional LIME
in terms of both fidelity and task performance, demonstrating superior
performance with both metrics. DL-LIME also provides insights on which factors
are more important in decision making for radar resource management.
[LINK]
http://arxiv.org/abs/2506.20916v1
[DATE]
2025-06-26 08:49:25+08:00
[CATEGORIES]
cs.LG
Faster Fixed-Point Methods for Multichain MDPs
[AUTHORS]
Matthew Zurek, Yudong Chen
[ABSTRACT]
We study value-iteration (VI) algorithms for solving general (a.k.a.
multichain) Markov decision processes (MDPs) under the average-reward
criterion, a fundamental but theoretically challenging setting. Beyond the
difficulties inherent to all average-reward problems posed by the lack of
contractivity and non-uniqueness of solutions to the Bellman operator, in the
multichain setting an optimal policy must solve the navigation subproblem of
steering towards the best connected component, in addition to optimizing
long-run performance within each component. We develop algorithms which better
solve this navigational subproblem in order to achieve faster convergence for
multichain MDPs, obtaining improved rates of convergence and sharper measures
of complexity relative to prior work. Many key components of our results are of
potential independent interest, including novel connections between
average-reward and discounted problems, optimal fixed-point methods for
discounted VI which extend to general Banach spaces, new sublinear convergence
rates for the discounted value error, and refined suboptimality decompositions
for multichain MDPs. Overall our results yield faster convergence rates for
discounted and average-reward problems and expand the theoretical foundations
of VI approaches.
[LINK]
http://arxiv.org/abs/2506.20910v1
[DATE]
2025-06-26 08:31:21+08:00
[CATEGORIES]
cs.LG
Optimal Single-Policy Sample Complexity and Transient Coverage for Average-Reward Offline RL
[AUTHORS]
Matthew Zurek, Guy Zamir, Yudong Chen
[ABSTRACT]
We study offline reinforcement learning in average-reward MDPs, which
presents increased challenges from the perspectives of distribution shift and
non-uniform coverage, and has been relatively underexamined from a theoretical
perspective. While previous work obtains performance guarantees under
single-policy data coverage assumptions, such guarantees utilize additional
complexity measures which are uniform over all policies, such as the uniform
mixing time. We develop sharp guarantees depending only on the target policy,
specifically the bias span and a novel policy hitting radius, yielding the
first fully single-policy sample complexity bound for average-reward offline
RL. We are also the first to handle general weakly communicating MDPs,
contrasting restrictive structural assumptions made in prior work. To achieve
this, we introduce an algorithm based on pessimistic discounted value iteration
enhanced by a novel quantile clipping technique, which enables the use of a
sharper empirical-span-based penalty function. Our algorithm also does not
require any prior parameter knowledge for its implementation. Remarkably, we
show via hard examples that learning under our conditions requires coverage
assumptions beyond the stationary distribution of the target policy,
distinguishing single-policy complexity measures from previously examined
cases. We also develop lower bounds nearly matching our main result.
[LINK]
http://arxiv.org/abs/2506.20904v1
[DATE]
2025-06-26 08:22:39+08:00
[CATEGORIES]
cs.LG
Graph-Structured Feedback Multimodel Ensemble Online Conformal Prediction
[AUTHORS]
Erfan Hajihashemi, Yanning Shen
[ABSTRACT]
Online conformal prediction has demonstrated its capability to construct a
prediction set for each incoming data point that covers the true label with a
predetermined probability. To cope with potential distribution shift,
multi-model online conformal prediction has been introduced to select and
leverage different models from a preselected candidate set. Along with the
improved flexibility, the choice of the preselected set also brings challenges.
A candidate set that includes a large number of models may increase the
computational complexity. In addition, the inclusion of irrelevant models with
poor performance may negatively impact the performance and lead to
unnecessarily large prediction sets. To address these challenges, we propose a
novel multi-model online conformal prediction algorithm that identifies a
subset of effective models at each time step by collecting feedback from a
bipartite graph, which is refined upon receiving new data. A model is then
selected from this subset to construct the prediction set, resulting in reduced
computational complexity and smaller prediction sets. Additionally, we
demonstrate that using prediction set size as feedback, alongside model loss,
can significantly improve efficiency by constructing smaller prediction sets
while still satisfying the required coverage guarantee. The proposed algorithms
are proven to ensure valid coverage and achieve sublinear regret. Experiments
on real and synthetic datasets validate that the proposed methods construct
smaller prediction sets and outperform existing multi-model online conformal
prediction approaches.
[LINK]
http://arxiv.org/abs/2506.20898v1
[DATE]
2025-06-26 08:06:11+08:00
[CATEGORIES]
cs.LG
Next-token prediction capacity: general upper bounds and a lower bound for transformers
[AUTHORS]
Liam Madden, Curtis Fox, Christos Thrampoulidis
[ABSTRACT]
Given a sequence of tokens, such as words, the task of next-token prediction
is to predict the next-token conditional probability distribution. Decoder-only
transformers have become effective models for this task, but their properties
are still not fully understood. In particular, the largest number of distinct
context sequences that a decoder-only transformer can interpolate next-token
distributions for has not been established. To fill this gap, we prove upper
and lower bounds on this number, which are equal up to a multiplicative
constant. We prove these bounds in the general setting where next-token
distributions can be arbitrary as well as the empirical setting where they are
calculated from a finite number of document sequences. Our lower bounds are for
one-layer multi-head decoder-only transformers and our proofs highlight an
important injectivity property satisfied by self-attention. Furthermore, we
provide numerical evidence that the minimal number of parameters for
memorization is sufficient for being able to train the model to the entropy
lower bound.
[COMMENTS]
V3: added two examples, a remark, and a second experiment where only
the FNN layers are trained
[LINK]
http://arxiv.org/abs/2405.13718v3
[DATE]
2025-06-26 07:53:42+08:00
[CATEGORIES]
cs.LG
HyperINF: Unleashing the HyperPower of the Schulz’s Method for Data Influence Estimation
[AUTHORS]
Xinyu Zhou, Simin Fan, Martin Jaggi
[ABSTRACT]
Influence functions provide a principled method to assess the contribution of
individual training samples to a specific target. Yet, their high computational
costs limit their applications on large-scale models and datasets. Existing
methods proposed for influence function approximation have significantly
reduced the computational overheads. However, they mostly suffer from
inaccurate estimation due to the lack of strong convergence guarantees from the
algorithm. The family of hyperpower methods are well-known for their rigorous
convergence guarantees on matrix inverse approximation, while the matrix
multiplication operation can involve intractable memory and computation costs
on large-scale models. We propose HyperINF, an efficient and accurate influence
function approximation method which leverages the hyperpower method,
specifically Schulz’s iterative algorithm. To deal with the
computation-intensive matrix multiplication, we incorporate the generalized
fisher information (GFIM) as a low-rank approximation of the Hessian matrix,
which reduces the memory and computation overheads to constant costs
independent of ranks on LoRA-tuned models. We first demonstrate the superior
accuracy and stability of HyperINF compared to other baselines through a
synthetic convergence simulation for matrix inversion. We further validate the
efficacy of HyperINF through extensive real-world data attribution tasks,
including mislabeled data detection and data selection for LLM and VLM
fine-tuning. On LoRA-tuned models, HyperINF achieves superior downstream
performance with minimal memory and computational overhead, while other
baselines suffer from significant degradation. Our codebase is available at
https://github.com/Blackzxy/HyperINF.
[LINK]
http://arxiv.org/abs/2410.05090v2
[DATE]
2025-06-26 07:23:23+08:00
[CATEGORIES]
cs.LG
Complex Model Transformations by Reinforcement Learning with Uncertain Human Guidance
[AUTHORS]
Kyanna Dagenais, Istvan David
[ABSTRACT]
Model-driven engineering problems often require complex model transformations
(MTs), i.e., MTs that are chained in extensive sequences. Pertinent examples of
such problems include model synchronization, automated model repair, and design
space exploration. Manually developing complex MTs is an error-prone and often
infeasible process. Reinforcement learning (RL) is an apt way to alleviate
these issues. In RL, an autonomous agent explores the state space through trial
and error to identify beneficial sequences of actions, such as MTs. However, RL
methods exhibit performance issues in complex problems. In these situations,
human guidance can be of high utility. In this paper, we present an approach
and technical framework for developing complex MT sequences through RL, guided
by potentially uncertain human advice. Our framework allows user-defined MTs to
be mapped onto RL primitives, and executes them as RL programs to find optimal
MT sequences. Our evaluation shows that human guidance, even if uncertain,
substantially improves RL performance, and results in more efficient
development of complex MTs. Through a trade-off between the certainty and
timeliness of human advice, our method takes a step towards RL-driven
human-in-the-loop engineering methods.
[COMMENTS]
Accepted for ACM/IEEE MODELS’25
[LINK]
http://arxiv.org/abs/2506.20883v1
[DATE]
2025-06-26 07:10:12+08:00
[CATEGORIES]
cs.LG
Always Skip Attention
[AUTHORS]
Yiping Ji, Hemanth Saratchandran, Peyman Moghadam, Simon Lucey
[ABSTRACT]
We highlight a curious empirical result within modern Vision Transformers
(ViTs). Specifically, self-attention catastrophically fails to train unless it
is used in conjunction with a skip connection. This is in contrast to other
elements of a ViT that continue to exhibit good performance (albeit suboptimal)
when skip connections are removed. Further, we show that this critical
dependence on skip connections is a relatively new phenomenon, with previous
deep architectures (\eg, CNNs) exhibiting good performance in their absence. In
this paper, we theoretically characterize that the self-attention mechanism is
fundamentally ill-conditioned and is, therefore, uniquely dependent on skip
connections for regularization. Additionally, we propose Token Graying – a
simple yet effective complement (to skip connections) that further improves the
condition of input tokens. We validate our approach in both supervised and
self-supervised training methods.
[COMMENTS]
This work has just been accepted by ICCV 2025
[LINK]
http://arxiv.org/abs/2505.01996v2
[DATE]
2025-06-26 07:06:43+08:00
[CATEGORIES]
cs.LG
Empowering Digital Agriculture: A Privacy-Preserving Framework for Data Sharing and Collaborative Research
[AUTHORS]
Osama Zafar, Rosemarie Santa González, Mina Namazi, Alfonso Morales, Erman Ayday
[ABSTRACT]
Data-driven agriculture, which integrates technology and data into
agricultural practices, has the potential to improve crop yield, disease
resilience, and long-term soil health. However, privacy concerns, such as
adverse pricing, discrimination, and resource manipulation, deter farmers from
sharing data, as it can be used against them. To address this barrier, we
propose a privacy-preserving framework that enables secure data sharing and
collaboration for research and development while mitigating privacy risks. The
framework combines dimensionality reduction techniques (like Principal
Component Analysis (PCA)) and differential privacy by introducing Laplacian
noise to protect sensitive information. The proposed framework allows
researchers to identify potential collaborators for a target farmer and train
personalized machine learning models either on the data of identified
collaborators via federated learning or directly on the aggregated
privacy-protected data. It also allows farmers to identify potential
collaborators based on similarities. We have validated this on real-life
datasets, demonstrating robust privacy protection against adversarial attacks
and utility performance comparable to a centralized system. We demonstrate how
this framework can facilitate collaboration among farmers and help researchers
pursue broader research objectives. The adoption of the framework can empower
researchers and policymakers to leverage agricultural data responsibly, paving
the way for transformative advances in data-driven agriculture. By addressing
critical privacy challenges, this work supports secure data integration,
fostering innovation and sustainability in agricultural systems.
[COMMENTS]
arXiv admin note: text overlap with arXiv:2409.06069
[LINK]
http://arxiv.org/abs/2506.20872v1
[DATE]
2025-06-26 06:46:30+08:00
[CATEGORIES]
cs.LG
High-dimensional Contextual Bandit Problem without Sparsity
[AUTHORS]
Junpei Komiyama, Masaaki Imaizumi
[ABSTRACT]
In this research, we investigate the high-dimensional linear contextual
bandit problem where the number of features $p$ is greater than the budget $T$,
or it may even be infinite. Differing from the majority of previous works in
this field, we do not impose sparsity on the regression coefficients. Instead,
we rely on recent findings on overparameterized models, which enables us to
analyze the performance of the minimum-norm interpolating estimator when data
distributions have small effective ranks. We propose an explore-then-commit
(EtC) algorithm to address this problem and examine its performance. Through
our analysis, we derive the optimal rate of the ETC algorithm in terms of $T$
and show that this rate can be achieved by balancing exploration and
exploitation. Moreover, we introduce an adaptive explore-then-commit (AEtC)
algorithm that adaptively finds the optimal balance. We assess the performance
of the proposed algorithms through a series of simulations.
[LINK]
http://arxiv.org/abs/2306.11017v2
[DATE]
2025-06-26 06:16:22+08:00
[CATEGORIES]
cs.LG
Multi-Objective Reinforcement Learning for Cognitive Radar Resource Management
[AUTHORS]
Ziyang Lu, Subodh Kalia, M. Cenk Gursoy, Chilukuri K. Mohan, Pramod K. Varshney
[ABSTRACT]
The time allocation problem in multi-function cognitive radar systems focuses
on the trade-off between scanning for newly emerging targets and tracking the
previously detected targets. We formulate this as a multi-objective
optimization problem and employ deep reinforcement learning to find
Pareto-optimal solutions and compare deep deterministic policy gradient (DDPG)
and soft actor-critic (SAC) algorithms. Our results demonstrate the
effectiveness of both algorithms in adapting to various scenarios, with SAC
showing improved stability and sample efficiency compared to DDPG. We further
employ the NSGA-II algorithm to estimate an upper bound on the Pareto front of
the considered problem. This work contributes to the development of more
efficient and adaptive cognitive radar systems capable of balancing multiple
competing objectives in dynamic environments.
[LINK]
http://arxiv.org/abs/2506.20853v1
[DATE]
2025-06-26 05:56:30+08:00
[CATEGORIES]
cs.LG
InterFormer: Effective Heterogeneous Interaction Learning for Click-Through Rate Prediction
[AUTHORS]
Zhichen Zeng, Xiaolong Liu, Mengyue Hang, Xiaoyi Liu, Qinghai Zhou, Chaofei Yang, Yiqun Liu, Yichen Ruan, Laming Chen, Yuxin Chen, Yujia Hao, Jiaqi Xu, Jade Nie, Xi Liu, Buyun Zhang, Wei Wen, Siyang Yuan, Hang Yin, Xin Zhang, Kai Wang, Wen-Yen Chen, Yiping Han, Huayu Li, Chunzhi Yang, Bo Long, Philip S. Yu, Hanghang Tong, Jiyan Yang
[ABSTRACT]
Click-through rate (CTR) prediction, which predicts the probability of a user
clicking an ad, is a fundamental task in recommender systems. The emergence of
heterogeneous information, such as user profile and behavior sequences, depicts
user interests from different aspects. A mutually beneficial integration of
heterogeneous information is the cornerstone towards the success of CTR
prediction. However, most of the existing methods suffer from two fundamental
limitations, including (1) insufficient inter-mode interaction due to the
unidirectional information flow between modes, and (2) aggressive information
aggregation caused by early summarization, resulting in excessive information
loss. To address the above limitations, we propose a novel module named
InterFormer to learn heterogeneous information interaction in an interleaving
style. To achieve better interaction learning, InterFormer enables
bidirectional information flow for mutually beneficial learning across
different modes. To avoid aggressive information aggregation, we retain
complete information in each data mode and use a separate bridging arch for
effective information selection and summarization. Our proposed InterFormer
achieves state-of-the-art performance on three public datasets and a
large-scale industrial dataset.
[COMMENTS]
11 pages, 6 figures
[LINK]
http://arxiv.org/abs/2411.09852v3
[DATE]
2025-06-26 05:48:04+08:00
[CATEGORIES]
cs.LG
Learning-Based Resource Management in Integrated Sensing and Communication Systems
[AUTHORS]
Ziyang Lu, M. Cenk Gursoy, Chilukuri K. Mohan, Pramod K. Varshney
[ABSTRACT]
In this paper, we tackle the task of adaptive time allocation in integrated
sensing and communication systems equipped with radar and communication units.
The dual-functional radar-communication system’s task involves allocating dwell
times for tracking multiple targets and utilizing the remaining time for data
transmission towards estimated target locations. We introduce a novel
constrained deep reinforcement learning (CDRL) approach, designed to optimize
resource allocation between tracking and communication under time budget
constraints, thereby enhancing target communication quality. Our numerical
results demonstrate the efficiency of our proposed CDRL framework, confirming
its ability to maximize communication quality in highly dynamic environments
while adhering to time constraints.
[LINK]
http://arxiv.org/abs/2506.20849v1
[DATE]
2025-06-26 05:44:07+08:00
[CATEGORIES]
cs.LG
Uncertainty-Aware Machine-Learning Framework for Predicting Dislocation Plasticity and Stress-Strain Response in FCC Alloys
[AUTHORS]
Jing Luo, Yejun Gu, Yanfei Wang, Xiaolong Ma, Jaafar. A El-Awady
[ABSTRACT]
Machine learning has significantly advanced the understanding and application
of structural materials, with an increasing emphasis on integrating existing
data and quantifying uncertainties in predictive modeling. This study presents
a comprehensive methodology utilizing a mixed density network (MDN) model,
trained on extensive experimental data from literature. This approach uniquely
predicts the distribution of dislocation density, inferred as a latent
variable, and the resulting stress distribution at the grain level. The
incorporation of statistical parameters of those predicted distributions into a
dislocation-mediated plasticity model allows for accurate stress-strain
predictions with explicit uncertainty quantification. This strategy not only
improves the accuracy and reliability of mechanical property predictions but
also plays a vital role in optimizing alloy design, thereby facilitating the
development of new materials in a rapidly evolving industry.
[LINK]
http://arxiv.org/abs/2506.20839v1
[DATE]
2025-06-26 05:18:14+08:00
[CATEGORIES]
cs.LG
Discovering Global False Negatives On the Fly for Self-supervised Contrastive Learning
[AUTHORS]
Vicente Balmaseda, Bokun Wang, Ching-Long Lin, Tianbao Yang
[ABSTRACT]
In self-supervised contrastive learning, negative pairs are typically
constructed using an anchor image and a sample drawn from the entire dataset,
excluding the anchor. However, this approach can result in the creation of
negative pairs with similar semantics, referred to as “false negatives”,
leading to their embeddings being falsely pushed apart. To address this issue,
we introduce GloFND, an optimization-based approach that automatically learns
on the fly the threshold for each anchor data to identify its false negatives
during training. In contrast to previous methods for false negative discovery,
our approach globally detects false negatives across the entire dataset rather
than locally within the mini-batch. Moreover, its per-iteration computation
cost remains independent of the dataset size. Experimental results on image and
image-text data demonstrate the effectiveness of the proposed method. Our
implementation is available at https://github.com/vibalcam/GloFND.
[COMMENTS]
Accepted to ICML 2025
[LINK]
http://arxiv.org/abs/2502.20612v2
[DATE]
2025-06-26 05:11:53+08:00
[CATEGORIES]
cs.LG
Composite Flow Matching for Reinforcement Learning with Shifted-Dynamics Data
[AUTHORS]
Lingkai Kong, Haichuan Wang, Tonghan Wang, Guojun Xiong, Milind Tambe
[ABSTRACT]
Incorporating pre-collected offline data from a source environment can
significantly improve the sample efficiency of reinforcement learning (RL), but
this benefit is often challenged by discrepancies between the transition
dynamics of the source and target environments. Existing methods typically
address this issue by penalizing or filtering out source transitions in high
dynamics-gap regions. However, their estimation of the dynamics gap often
relies on KL divergence or mutual information, which can be ill-defined when
the source and target dynamics have disjoint support. To overcome these
limitations, we propose CompFlow, a method grounded in the theoretical
connection between flow matching and optimal transport. Specifically, we model
the target dynamics as a conditional flow built upon the output distribution of
the source-domain flow, rather than learning it directly from a Gaussian prior.
This composite structure offers two key advantages: (1) improved generalization
for learning target dynamics, and (2) a principled estimation of the dynamics
gap via the Wasserstein distance between source and target transitions.
Leveraging our principled estimation of the dynamics gap, we further introduce
an optimistic active data collection strategy that prioritizes exploration in
regions of high dynamics gap, and theoretically prove that it reduces the
performance disparity with the optimal policy. Empirically, CompFlow
outperforms strong baselines across several RL benchmarks with shifted
dynamics.
[LINK]
http://arxiv.org/abs/2505.23062v2
[DATE]
2025-06-26 05:09:46+08:00
[CATEGORIES]
cs.LG
Harnessing the Universal Geometry of Embeddings
[AUTHORS]
Rishi Jha, Collin Zhang, Vitaly Shmatikov, John X. Morris
[ABSTRACT]
We introduce the first method for translating text embeddings from one vector
space to another without any paired data, encoders, or predefined sets of
matches. Our unsupervised approach translates any embedding to and from a
universal latent representation (i.e., a universal semantic structure
conjectured by the Platonic Representation Hypothesis). Our translations
achieve high cosine similarity across model pairs with different architectures,
parameter counts, and training datasets.
The ability to translate unknown embeddings into a different space while
preserving their geometry has serious implications for the security of vector
databases. An adversary with access only to embedding vectors can extract
sensitive information about the underlying documents, sufficient for
classification and attribute inference.
[LINK]
http://arxiv.org/abs/2505.12540v3
[DATE]
2025-06-26 05:04:02+08:00
[CATEGORIES]
cs.LG
TaxaDiffusion: Progressively Trained Diffusion Model for Fine-Grained Species Generation
[AUTHORS]
Amin Karimi Monsefi, Mridul Khurana, Rajiv Ramnath, Anuj Karpatne, Wei-Lun Chao, Cheng Zhang
[ABSTRACT]
We propose TaxaDiffusion, a taxonomy-informed training framework for
diffusion models to generate fine-grained animal images with high morphological
and identity accuracy. Unlike standard approaches that treat each species as an
independent category, TaxaDiffusion incorporates domain knowledge that many
species exhibit strong visual similarities, with distinctions often residing in
subtle variations of shape, pattern, and color. To exploit these relationships,
TaxaDiffusion progressively trains conditioned diffusion models across
different taxonomic levels – starting from broad classifications such as Class
and Order, refining through Family and Genus, and ultimately distinguishing at
the Species level. This hierarchical learning strategy first captures
coarse-grained morphological traits shared by species with common ancestors,
facilitating knowledge transfer before refining fine-grained differences for
species-level distinction. As a result, TaxaDiffusion enables accurate
generation even with limited training samples per species. Extensive
experiments on three fine-grained animal datasets demonstrate that outperforms
existing approaches, achieving superior fidelity in fine-grained animal image
generation. Project page: https://amink8.github.io/TaxaDiffusion/
[COMMENTS]
Accepted to ICCV 2025
[LINK]
http://arxiv.org/abs/2506.01923v2
[DATE]
2025-06-26 05:02:25+08:00
[CATEGORIES]
cs.LG
Efficacy of Temporal Fusion Transformers for Runoff Simulation
[AUTHORS]
Sinan Rasiya Koya, Tirthankar Roy
[ABSTRACT]
Combining attention with recurrence has shown to be valuable in sequence
modeling, including hydrological predictions. Here, we explore the strength of
Temporal Fusion Transformers (TFTs) over Long Short-Term Memory (LSTM) networks
in rainfall-runoff modeling. We train ten randomly initialized models, TFT and
LSTM, for 531 CAMELS catchments in the US. We repeat the experiment with five
subsets of the Caravan dataset, each representing catchments in the US,
Australia, Brazil, Great Britain, and Chile. Then, the performance of the
models, their variability regarding the catchment attributes, and the
difference according to the datasets are assessed. Our findings show that TFT
slightly outperforms LSTM, especially in simulating the midsection and peak of
hydrographs. Furthermore, we show the ability of TFT to handle longer sequences
and why it can be a better candidate for higher or larger catchments. Being an
explainable AI technique, TFT identifies the key dynamic and static variables,
providing valuable scientific insights. However, both TFT and LSTM exhibit a
considerable drop in performance with the Caravan dataset, indicating possible
data quality issues. Overall, the study highlights the potential of TFT in
improving hydrological modeling and understanding.
[LINK]
http://arxiv.org/abs/2506.20831v1
[DATE]
2025-06-26 04:58:28+08:00
[CATEGORIES]
cs.LG
Universal and Efficient Detection of Adversarial Data through Nonuniform Impact on Network Layers
[AUTHORS]
Furkan Mumcu, Yasin Yilmaz
[ABSTRACT]
Deep Neural Networks (DNNs) are notoriously vulnerable to adversarial input
designs with limited noise budgets. While numerous successful attacks with
subtle modifications to original input have been proposed, defense techniques
against these attacks are relatively understudied. Existing defense approaches
either focus on improving DNN robustness by negating the effects of
perturbations or use a secondary model to detect adversarial data. Although
equally important, the attack detection approach, which is studied in this
work, provides a more practical defense compared to the robustness approach. We
show that the existing detection methods are either ineffective against the
state-of-the-art attack techniques or computationally inefficient for real-time
processing. We propose a novel universal and efficient method to detect
adversarial examples by analyzing the varying degrees of impact of attacks on
different DNN layers. {Our method trains a lightweight regression model that
predicts deeper-layer features from early-layer features, and uses the
prediction error to detect adversarial samples.} Through theoretical arguments
and extensive experiments, we demonstrate that our detection method is highly
effective, computationally efficient for real-time processing, compatible with
any DNN architecture, and applicable across different domains, such as image,
video, and audio.
[COMMENTS]
arXiv admin note: substantial text overlap with arXiv:2410.17442
[LINK]
http://arxiv.org/abs/2506.20816v1
[DATE]
2025-06-26 04:30:28+08:00
[CATEGORIES]
cs.LG
Divide, Specialize, and Route: A New Approach to Efficient Ensemble Learning
[AUTHORS]
Jakub Piwko, Jędrzej Ruciński, Dawid Płudowski, Antoni Zajko, Patryzja Żak, Mateusz Zacharecki, Anna Kozak, Katarzyna Woźnica
[ABSTRACT]
Ensemble learning has proven effective in boosting predictive performance,
but traditional methods such as bagging, boosting, and dynamic ensemble
selection (DES) suffer from high computational cost and limited adaptability to
heterogeneous data distributions. To address these limitations, we propose
Hellsemble, a novel and interpretable ensemble framework for binary
classification that leverages dataset complexity during both training and
inference. Hellsemble incrementally partitions the dataset into circles of
difficulty by iteratively passing misclassified instances from simpler models
to subsequent ones, forming a committee of specialised base learners. Each
model is trained on increasingly challenging subsets, while a separate router
model learns to assign new instances to the most suitable base model based on
inferred difficulty. Hellsemble achieves strong classification accuracy while
maintaining computational efficiency and interpretability. Experimental results
on OpenML-CC18 and Tabzilla benchmarks demonstrate that Hellsemble often
outperforms classical ensemble methods. Our findings suggest that embracing
instance-level difficulty offers a promising direction for constructing
efficient and robust ensemble systems.
[COMMENTS]
14 pages, 6 figures
[LINK]
http://arxiv.org/abs/2506.20814v1
[DATE]
2025-06-26 04:26:04+08:00
[CATEGORIES]
cs.LG
FINN-GL: Generalized Mixed-Precision Extensions for FPGA-Accelerated LSTMs
[AUTHORS]
Shashwat Khandelwal, Jakoba Petri-Koenig, Thomas B. Preußer, Michaela Blott, Shreejith Shanker
[ABSTRACT]
Recurrent neural networks (RNNs), particularly LSTMs, are effective for
time-series tasks like sentiment analysis and short-term stock prediction.
However, their computational complexity poses challenges for real-time
deployment in resource constrained environments. While FPGAs offer a promising
platform for energy-efficient AI acceleration, existing tools mainly target
feed-forward networks, and LSTM acceleration typically requires full custom
implementation. In this paper, we address this gap by leveraging the
open-source and extensible FINN framework to enable the generalized deployment
of LSTMs on FPGAs. Specifically, we leverage the Scan operator from the Open
Neural Network Exchange (ONNX) specification to model the recurrent nature of
LSTM computations, enabling support for mixed quantisation within them and
functional verification of LSTM-based models. Furthermore, we introduce custom
transformations within the FINN compiler to map the quantised ONNX computation
graph to hardware blocks from the HLS kernel library of the FINN compiler and
Vitis HLS. We validate the proposed tool-flow by training a quantised ConvLSTM
model for a mid-price stock prediction task using the widely used dataset and
generating a corresponding hardware IP of the model using our flow, targeting
the XCZU7EV device. We show that the generated quantised ConvLSTM accelerator
through our flow achieves a balance between performance (latency) and resource
consumption, while matching (or bettering) inference accuracy of
state-of-the-art models with reduced precision. We believe that the
generalisable nature of the proposed flow will pave the way for
resource-efficient RNN accelerator designs on FPGAs.
[COMMENTS]
9 pages, 6 figures, 5 tables, Accepted for publication in IEEE
FPL-2025 (https://2025.fpl.org/)
[LINK]
http://arxiv.org/abs/2506.20810v1
[DATE]
2025-06-26 04:07:46+08:00
[CATEGORIES]
cs.LG
GPU Kernel Scientist: An LLM-Driven Framework for Iterative Kernel Optimization
[AUTHORS]
Martin Andrews, Sam Witteveen
[COMMENTS]
4 page paper plus Appendices. Accepted to the ES-FoMo “Efficient
Systems for Foundation Models” workshop at ICML 2025
[LINK]
http://arxiv.org/abs/2506.20807v1
[DATE]
2025-06-26 03:59:34+08:00
[CATEGORIES]
cs.LG
Structural System Identification via Validation and Adaptation
[AUTHORS]
Cristian López, Keegan J. Moore
[ABSTRACT]
Estimating the governing equation parameter values is essential for
integrating experimental data with scientific theory to understand, validate,
and predict the dynamics of complex systems. In this work, we propose a new
method for structural system identification (SI), uncertainty quantification,
and validation directly from data. Inspired by generative modeling frameworks,
a neural network maps random noise to physically meaningful parameters. These
parameters are then used in the known equation of motion to obtain fake
accelerations, which are compared to real training data via a mean square error
loss. To simultaneously validate the learned parameters, we use independent
validation datasets. The generated accelerations from these datasets are
evaluated by a discriminator network, which determines whether the output is
real or fake, and guides the parameter-generator network. Analytical and real
experiments show the parameter estimation accuracy and model validation for
different nonlinear structural systems.
[LINK]
http://arxiv.org/abs/2506.20799v1
[DATE]
2025-06-26 03:43:23+08:00
[CATEGORIES]
cs.LG
Stochastic Parameter Decomposition
[AUTHORS]
Lucius Bushnaq, Dan Braun, Lee Sharkey
[ABSTRACT]
A key step in reverse engineering neural networks is to decompose them into
simpler parts that can be studied in relative isolation. Linear parameter
decomposition – a framework that has been proposed to resolve several issues
with current decomposition methods – decomposes neural network parameters into
a sum of sparsely used vectors in parameter space. However, the current main
method in this framework, Attribution-based Parameter Decomposition (APD), is
impractical on account of its computational cost and sensitivity to
hyperparameters. In this work, we introduce \textit{Stochastic Parameter
Decomposition} (SPD), a method that is more scalable and robust to
hyperparameters than APD, which we demonstrate by decomposing models that are
slightly larger and more complex than was possible to decompose with APD. We
also show that SPD avoids other issues, such as shrinkage of the learned
parameters, and better identifies ground truth mechanisms in toy models. By
bridging causal mediation analysis and network decomposition methods, this
demonstration opens up new research possibilities in mechanistic
interpretability by removing barriers to scaling linear parameter decomposition
methods to larger models. We release a library for running SPD and reproducing
our experiments at https://github.com/goodfire-ai/spd.
[LINK]
http://arxiv.org/abs/2506.20790v1
[DATE]
2025-06-26 03:26:31+08:00
[CATEGORIES]
cs.LG
Spiking Neural Networks for SAR Interferometric Phase Unwrapping: A Theoretical Framework for Energy-Efficient Processing
[AUTHORS]
Marc Bara
[ABSTRACT]
We present the first theoretical framework for applying spiking neural
networks (SNNs) to synthetic aperture radar (SAR) interferometric phase
unwrapping. Despite extensive research in both domains, our comprehensive
literature review confirms that SNNs have never been applied to phase
unwrapping, representing a significant gap in current methodologies. As Earth
observation data volumes continue to grow exponentially (with missions like
NISAR expected to generate 100PB in two years) energy-efficient processing
becomes critical for sustainable data center operations. SNNs, with their
event-driven computation model, offer potential energy savings of 30-100x
compared to conventional approaches while maintaining comparable accuracy. We
develop spike encoding schemes specifically designed for wrapped phase data,
propose SNN architectures that leverage the spatial propagation nature of phase
unwrapping, and provide theoretical analysis of computational complexity and
convergence properties. Our framework demonstrates how the temporal dynamics
inherent in SNNs can naturally model the spatial continuity constraints
fundamental to phase unwrapping. This work opens a new research direction at
the intersection of neuromorphic computing and SAR interferometry, offering a
complementary approach to existing algorithms that could enable more
sustainable large-scale InSAR processing.
[COMMENTS]
8 pages, 2 figures, patent pending
[LINK]
http://arxiv.org/abs/2506.20782v1
[DATE]
2025-06-26 03:12:16+08:00
[CATEGORIES]
cs.LG
Stable Minima of ReLU Neural Networks Suffer from the Curse of Dimensionality: The Neural Shattering Phenomenon
[AUTHORS]
Tongtong Liang, Dan Qiao, Yu-Xiang Wang, Rahul Parhi
[ABSTRACT]
We study the implicit bias of flatness / low (loss) curvature and its effects
on generalization in two-layer overparameterized ReLU networks with
multivariate inputs – a problem well motivated by the minima stability and
edge-of-stability phenomena in gradient-descent training. Existing work either
requires interpolation or focuses only on univariate inputs. This paper
presents new and somewhat surprising theoretical results for multivariate
inputs. On two natural settings (1) generalization gap for flat solutions, and
(2) mean-squared error (MSE) in nonparametric function estimation by stable
minima, we prove upper and lower bounds, which establish that while flatness
does imply generalization, the resulting rates of convergence necessarily
deteriorate exponentially as the input dimension grows. This gives an
exponential separation between the flat solutions vis-`a-vis low-norm
solutions (i.e., weight decay), which knowingly do not suffer from the curse of
dimensionality. In particular, our minimax lower bound construction, based on a
novel packing argument with boundary-localized ReLU neurons, reveals how flat
solutions can exploit a kind of ‘‘neural shattering’’ where neurons rarely
activate, but with high weight magnitudes. This leads to poor performance in
high dimensions. We corroborate these theoretical findings with extensive
numerical simulations. To the best of our knowledge, our analysis provides the
first systematic explanation for why flat minima may fail to generalize in high
dimensions.
[COMMENTS]
Comments Welcome!
[LINK]
http://arxiv.org/abs/2506.20779v1
[DATE]
2025-06-26 03:10:03+08:00
[CATEGORIES]
cs.LG
Steering Your Diffusion Policy with Latent Space Reinforcement Learning
[AUTHORS]
Andrew Wagenmaker, Mitsuhiko Nakamoto, Yunchu Zhang, Seohong Park, Waleed Yagoub, Anusha Nagabandi, Abhishek Gupta, Sergey Levine
[ABSTRACT]
Robotic control policies learned from human demonstrations have achieved
impressive results in many real-world applications. However, in scenarios where
initial performance is not satisfactory, as is often the case in novel
open-world settings, such behavioral cloning (BC)-learned policies typically
require collecting additional human demonstrations to further improve their
behavior – an expensive and time-consuming process. In contrast, reinforcement
learning (RL) holds the promise of enabling autonomous online policy
improvement, but often falls short of achieving this due to the large number of
samples it typically requires. In this work we take steps towards enabling fast
autonomous adaptation of BC-trained policies via efficient real-world RL.
Focusing in particular on diffusion policies – a state-of-the-art BC
methodology – we propose diffusion steering via reinforcement learning (DSRL):
adapting the BC policy by running RL over its latent-noise space. We show that
DSRL is highly sample efficient, requires only black-box access to the BC
policy, and enables effective real-world autonomous policy improvement.
Furthermore, DSRL avoids many of the challenges associated with finetuning
diffusion policies, obviating the need to modify the weights of the base policy
at all. We demonstrate DSRL on simulated benchmarks, real-world robotic tasks,
and for adapting pretrained generalist policies, illustrating its sample
efficiency and effective performance at real-world policy improvement.
[LINK]
http://arxiv.org/abs/2506.15799v2
[DATE]
2025-06-26 03:09:52+08:00
[CATEGORIES]
cs.LG
Revealing higher-order neural representations of uncertainty with the Noise Estimation through Reinforcement-based Diffusion (NERD) model
[AUTHORS]
Hojjat Azimi Asrari, Megan A. K. Peters
[ABSTRACT]
Studies often aim to reveal first-order" representations (FORs), which
encode aspects of an observer's environment, such as contents or <span style="color:#e74d3c;">structure</span>. A
less-common target is
higher-order” representations (HORs), which are
about" FORs -- e.g., their strength or uncertainty -- and which may
contribute to learning. HORs about uncertainty are unlikely to be direct
read-outs” of FOR characteristics, instead reflecting noisy estimation
processes incorporating prior expectations about uncertainty, but how the brain
represents such expected uncertainty distributions remains largely unexplored.
Here, we study ``noise expectation” HORs using neural data from a task which
may require the brain to learn about its own noise: decoded neurofeedback,
wherein human subjects learn to volitionally produce target neural patterns. We
develop and apply a Noise Estimation through Reinforcement-based Diffusion
(NERD) model to characterize how brains may undertake this process, and show
that NERD offers high explanatory power for human behavior.
[COMMENTS]
27 pages, 7 figures, 12 equations
[LINK]
http://arxiv.org/abs/2503.14333v2
[DATE]
2025-06-26 03:04:21+08:00
[CATEGORIES]
cs.LG
Stochastic and Non-local Closure Modeling for Nonlinear Dynamical Systems via Latent Score-based Generative Models
[AUTHORS]
Xinghao Dong, Huchen Yang, Jin-Long Wu
[ABSTRACT]
We propose a latent score-based generative AI framework for learning
stochastic, non-local closure models and constitutive laws in nonlinear
dynamical systems of computational mechanics. This work addresses a key
challenge of modeling complex multiscale dynamical systems without a clear
scale separation, for which numerically resolving all scales is prohibitively
expensive, e.g., for engineering turbulent flows. While classical closure
modeling methods leverage domain knowledge to approximate subgrid-scale
phenomena, their deterministic and local assumptions can be too restrictive in
regimes lacking a clear scale separation. Recent developments of
diffusion-based stochastic models have shown promise in the context of closure
modeling, but their prohibitive computational inference cost limits practical
applications for many real-world applications. This work addresses this
limitation by jointly training convolutional autoencoders with conditional
diffusion models in the latent spaces, significantly reducing the
dimensionality of the sampling process while preserving essential physical
characteristics. Numerical results demonstrate that the joint training approach
helps discover a proper latent space that not only guarantees small
reconstruction errors but also ensures good performance of the diffusion model
in the latent space. When integrated into numerical simulations, the proposed
stochastic modeling framework via latent conditional diffusion models achieves
significant computational acceleration while maintaining comparable predictive
accuracy to standard diffusion models in physical spaces.
[LINK]
http://arxiv.org/abs/2506.20771v1
[DATE]
2025-06-26 03:04:02+08:00
[CATEGORIES]
cs.LG
GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs
[AUTHORS]
Advik Raj Basani, Xiao Zhang
[ABSTRACT]
LLMs have shown impressive capabilities across various natural language
processing tasks, yet remain vulnerable to input prompts, known as jailbreak
attacks, carefully designed to bypass safety guardrails and elicit harmful
responses. Traditional methods rely on manual heuristics but suffer from
limited generalizability. Despite being automatic, optimization-based attacks
often produce unnatural prompts that can be easily detected by safety filters
or require high computational costs due to discrete token optimization. In this
paper, we introduce Generative Adversarial Suffix Prompter (GASP), a novel
automated framework that can efficiently generate human-readable jailbreak
prompts in a fully black-box setting. In particular, GASP leverages latent
Bayesian optimization to craft adversarial suffixes by efficiently exploring
continuous latent embedding spaces, gradually optimizing the suffix prompter to
improve attack efficacy while balancing prompt coherence via a targeted
iterative refinement procedure. Through comprehensive experiments, we show that
GASP can produce natural adversarial prompts, significantly improving jailbreak
success over baselines, reducing training times, and accelerating inference
speed, thus making it an efficient and scalable solution for red-teaming LLMs.
[COMMENTS]
38 pages, 8 tables, 18 figures
[LINK]
http://arxiv.org/abs/2411.14133v2
[DATE]
2025-06-26 03:01:33+08:00
[CATEGORIES]
cs.LG
Control and optimization for Neural Partial Differential Equations in Supervised Learning
[AUTHORS]
Alain Bensoussan, Minh-Binh Tran, Bangjie Wang
[ABSTRACT]
Although there is a substantial body of literature on control and
optimization problems for parabolic and hyperbolic systems, the specific
problem of controlling and optimizing the coefficients of the associated
operators within such systems has not yet been thoroughly explored. In this
work, we aim to initiate a line of research in control theory focused on
optimizing and controlling the coefficients of these operators-a problem that
naturally arises in the context of neural networks and supervised learning.
In supervised learning, the primary objective is to transport initial data
toward target data through the layers of a neural network. We propose a novel
perspective: neural networks can be interpreted as partial differential
equations (PDEs). From this viewpoint, the control problem traditionally
studied in the context of ordinary differential equations (ODEs) is
reformulated as a control problem for PDEs, specifically targeting the
optimization and control of coefficients in parabolic and hyperbolic operators.
To the best of our knowledge, this specific problem has not yet been
systematically addressed in the control theory of PDEs.
To this end, we propose a dual system formulation for the control and
optimization problem associated with parabolic PDEs, laying the groundwork for
the development of efficient numerical schemes in future research. We also
provide a theoretical proof showing that the control and optimization problem
for parabolic PDEs admits minimizers. Finally, we investigate the control
problem associated with hyperbolic PDEs and prove the existence of solutions
for a corresponding approximated control problem.
[LINK]
http://arxiv.org/abs/2506.20764v1
[DATE]
2025-06-26 02:54:48+08:00
[CATEGORIES]
cs.LG
Characterization and Mitigation of Training Instabilities in Microscaling Formats
[AUTHORS]
Huangyuan Su, Mujin Kwun, Stephanie Gil, Sham Kakade, Nikhil Anand
[ABSTRACT]
Training large language models is an expensive, compute-bound process that
must be repeated as models scale, algorithms improve, and new data is
collected. To address this, next-generation hardware accelerators increasingly
support lower-precision arithmetic formats, such as the Microscaling (MX)
formats introduced in NVIDIA’s Blackwell architecture. These formats use a
shared scale within blocks of parameters to extend representable range and
perform forward/backward GEMM operations in reduced precision for efficiency
gains. In this work, we investigate the challenges and viability of
block-scaled precision formats during model training. Across nearly one
thousand language models trained from scratch – spanning compute budgets from
$2 \times 10^{17}$ to $4.8 \times 10^{19}$ FLOPs and sweeping over a broad
range of weight-activation precision combinations – we consistently observe
that training in MX formats exhibits sharp, stochastic instabilities in the
loss, particularly at larger compute scales. To explain this phenomenon, we
conduct controlled experiments and ablations on a smaller proxy model that
exhibits similar behavior as the language model, sweeping across architectural
settings, hyperparameters, and precision formats. These experiments motivate a
simple model in which multiplicative gradient bias introduced by the
quantization of layer-norm affine parameters and a small fraction of
activations can trigger runaway divergence. Through \emph{in situ} intervention
experiments on our proxy model, we demonstrate that instabilities can be
averted or delayed by modifying precision schemes mid-training. Guided by these
findings, we evaluate stabilization strategies in the LLM setting and show that
certain hybrid configurations recover performance competitive with
full-precision training. We release our code at
https://github.com/Hither1/systems-scaling.
[COMMENTS]
14 pages + appendices
[LINK]
http://arxiv.org/abs/2506.20752v1
[DATE]
2025-06-26 02:25:08+08:00
[CATEGORIES]
cs.LG
Multiple Streams of Relation Extraction: Enriching and Recalling in Transformers
[AUTHORS]
Todd Nief, David Reber, Sean Richardson, Ari Holtzman
[ABSTRACT]
When an LLM learns a relation during finetuning (e.g., new movie releases,
corporate mergers, etc.), where does this information go? Is it extracted when
the model processes an entity, recalled just-in-time before a prediction, or
are there multiple separate heuristics? Existing localization approaches (e.g.
activation patching) are ill-suited for this analysis because they tend to
replace parts of the residual stream, potentially deleting information. To fill
this gap, we propose dynamic weight-grafting between fine-tuned and pre-trained
language models to show that fine-tuned language models both (1) extract
relation information learned during finetuning while processing entities and
(2) recall" this information in later layers while generating predictions. In
some cases, models need both of these pathways to correctly generate finetuned
information while, in other cases, a single
enrichment” or recall" pathway
alone is sufficient. We examine the necessity and sufficiency of these
information pathways, examining what layers they occur at, how much redundancy
they exhibit, and which model components are involved -- finding that the
recall” pathway occurs via both task-specific attention mechanisms and a
relation extraction step in the output of the attention and the feedforward
networks at the final layers before next token prediction.
[LINK]
http://arxiv.org/abs/2506.20746v1
[DATE]
2025-06-26 02:13:34+08:00
[CATEGORIES]
cs.LG
A Survey of AI for Materials Science: Foundation Models, LLM Agents, Datasets, and Tools
[AUTHORS]
Minh-Hao Van, Prateek Verma, Chen Zhao, Xintao Wu
[ABSTRACT]
Foundation models (FMs) are catalyzing a transformative shift in materials
science (MatSci) by enabling scalable, general-purpose, and multimodal AI
systems for scientific discovery. Unlike traditional machine learning models,
which are typically narrow in scope and require task-specific engineering, FMs
offer cross-domain generalization and exhibit emergent capabilities. Their
versatility is especially well-suited to materials science, where research
challenges span diverse data types and scales. This survey provides a
comprehensive overview of foundation models, agentic systems, datasets, and
computational tools supporting this growing field. We introduce a task-driven
taxonomy encompassing six broad application areas: data extraction,
interpretation and Q\&A; atomistic simulation; property prediction; materials
structure, design and discovery; process planning, discovery, and optimization;
and multiscale modeling. We discuss recent advances in both unimodal and
multimodal FMs, as well as emerging large language model (LLM) agents.
Furthermore, we review standardized datasets, open-source tools, and autonomous
experimental platforms that collectively fuel the development and integration
of FMs into research workflows. We assess the early successes of foundation
models and identify persistent limitations, including challenges in
generalizability, interpretability, data imbalance, safety concerns, and
limited multimodal fusion. Finally, we articulate future research directions
centered on scalable pretraining, continual learning, data governance, and
trustworthiness.
[LINK]
http://arxiv.org/abs/2506.20743v1
[DATE]
2025-06-26 02:10:30+08:00
[CATEGORIES]
cs.LG
Test-time Scaling Techniques in Theoretical Physics – A Comparison of Methods on the TPBench Dataset
[AUTHORS]
Zhiqi Gao, Tianyi Li, Yurii Kvasiuk, Sai Chaitanya Tadepalli, Maja Rudolph, Daniel J. H. Chung, Frederic Sala, Moritz Münchmeyer
[ABSTRACT]
Large language models (LLMs) have shown strong capabilities in complex
reasoning, and test-time scaling techniques can enhance their performance with
comparably low cost. Many of these methods have been developed and evaluated on
mathematical reasoning benchmarks such as AIME. This paper investigates whether
the lessons learned from these benchmarks generalize to the domain of advanced
theoretical physics. We evaluate a range of common test-time scaling methods on
the TPBench physics dataset and compare their effectiveness with results on
AIME. To better leverage the structure of physics problems, we develop a novel,
symbolic weak-verifier framework to improve parallel scaling results. Our
empirical results demonstrate that this method significantly outperforms
existing test-time scaling approaches on TPBench. We also evaluate our method
on AIME, confirming its effectiveness in solving advanced mathematical
problems. Our findings highlight the power of step-wise symbolic verification
for tackling complex scientific problems.
[COMMENTS]
23 pages, 6 figures
[LINK]
http://arxiv.org/abs/2506.20729v1
[DATE]
2025-06-26 02:00:18+08:00
[CATEGORIES]
cs.LG
On Convolutions, Intrinsic Dimension, and Diffusion Models
[AUTHORS]
Kin Kwan Leung, Rasa Hosseinzadeh, Gabriel Loaiza-Ganem
[ABSTRACT]
The manifold hypothesis asserts that data of interest in high-dimensional
ambient spaces, such as image data, lies on unknown low-dimensional
submanifolds. Diffusion models (DMs) – which operate by convolving data with
progressively larger amounts of Gaussian noise and then learning to revert this
process – have risen to prominence as the most performant generative models,
and are known to be able to learn distributions with low-dimensional support.
For a given datum in one of these submanifolds, we should thus intuitively
expect DMs to have implicitly learned its corresponding local intrinsic
dimension (LID), i.e. the dimension of the submanifold it belongs to. Kamkari
et al. (2024b) recently showed that this is indeed the case by linking this LID
to the rate of change of the log marginal densities of the DM with respect to
the amount of added noise, resulting in an LID estimator known as FLIPD. LID
estimators such as FLIPD have a plethora of uses, among others they quantify
the complexity of a given datum, and can be used to detect outliers,
adversarial examples and AI-generated text. FLIPD achieves state-of-the-art
performance at LID estimation, yet its theoretical underpinnings are incomplete
since Kamkari et al. (2024b) only proved its correctness under the highly
unrealistic assumption of affine submanifolds. In this work we bridge this gap
by formally proving the correctness of FLIPD under realistic assumptions.
Additionally, we show that an analogous result holds when Gaussian convolutions
are replaced with uniform ones, and discuss the relevance of this result.
[LINK]
http://arxiv.org/abs/2506.20705v1
[DATE]
2025-06-26 02:00:00+08:00
[CATEGORIES]
cs.LG
Diffusion Tree Sampling: Scalable inference-time alignment of diffusion models
[AUTHORS]
Vineet Jain, Kusha Sareen, Mohammad Pedramfar, Siamak Ravanbakhsh
[ABSTRACT]
Adapting a pretrained diffusion model to new objectives at inference time
remains an open problem in generative modeling. Existing steering methods
suffer from inaccurate value estimation, especially at high noise levels, which
biases guidance. Moreover, information from past runs is not reused to improve
sample quality, resulting in inefficient use of compute. Inspired by the
success of Monte Carlo Tree Search, we address these limitations by casting
inference-time alignment as a search problem that reuses past computations. We
introduce a tree-based approach that samples from the reward-aligned target
density by propagating terminal rewards back through the diffusion chain and
iteratively refining value estimates with each additional generation. Our
proposed method, Diffusion Tree Sampling (DTS), produces asymptotically exact
samples from the target distribution in the limit of infinite rollouts, and its
greedy variant, Diffusion Tree Search (DTS$^\star$), performs a global search
for high reward samples. On MNIST and CIFAR-10 class-conditional generation,
DTS matches the FID of the best-performing baseline with up to $10\times$ less
compute. In text-to-image generation and language completion tasks, DTS$^\star$
effectively searches for high reward samples that match best-of-N with up to
$5\times$ less compute. By reusing information from previous generations, we
get an anytime algorithm that turns additional compute into steadily better
samples, providing a scalable approach for inference-time alignment of
diffusion models.
[LINK]
http://arxiv.org/abs/2506.20701v1
[DATE]
2025-06-26 01:59:10+08:00
[CATEGORIES]
cs.LG
DemoDiffusion: One-Shot Human Imitation using pre-trained Diffusion Policy
[AUTHORS]
Sungjae Park, Homanga Bharadhwaj, Shubham Tulsiani
[ABSTRACT]
We propose DemoDiffusion, a simple and scalable method for enabling robots to
perform manipulation tasks in natural environments by imitating a single human
demonstration. Our approach is based on two key insights. First, the hand
motion in a human demonstration provides a useful prior for the robot’s
end-effector trajectory, which we can convert into a rough open-loop robot
motion trajectory via kinematic retargeting. Second, while this retargeted
motion captures the overall structure of the task, it may not align well with
plausible robot actions in-context. To address this, we leverage a pre-trained
generalist diffusion policy to modify the trajectory, ensuring it both follows
the human motion and remains within the distribution of plausible robot
actions. Our approach avoids the need for online reinforcement learning or
paired human-robot data, enabling robust adaptation to new tasks and scenes
with minimal manual effort. Experiments in both simulation and real-world
settings show that DemoDiffusion outperforms both the base policy and the
retargeted trajectory, enabling the robot to succeed even on tasks where the
pre-trained generalist policy fails entirely. Project page:
https://demodiffusion.github.io/
[COMMENTS]
Preprint(17 pages). Under Review
[LINK]
http://arxiv.org/abs/2506.20668v1
[DATE]
2025-06-26 01:59:01+08:00
[CATEGORIES]
cs.LG
Data Quality in Crowdsourcing and Spamming Behavior Detection
[AUTHORS]
Yang Ba, Michelle V. Mancenido, Erin K. Chiou, Rong Pan
[ABSTRACT]
As crowdsourcing emerges as an efficient and cost-effective method for
obtaining labels for machine learning datasets, it is important to assess the
quality of crowd-provided data, so as to improve analysis performance and
reduce biases in subsequent machine learning tasks. Given the lack of ground
truth in most cases of crowdsourcing, we refer to data quality as annotators’
consistency and credibility. Unlike the simple scenarios where Kappa
coefficient and intraclass correlation coefficient usually can apply, online
crowdsourcing requires dealing with more complex situations. We introduce a
systematic method for evaluating data quality and detecting spamming threats
via variance decomposition, and we classify spammers into three categories
based on their different behavioral patterns. A spammer index is proposed to
assess entire data consistency, and two metrics are developed to measure crowd
workers’ credibility by utilizing the Markov chain and generalized random
effects models. Furthermore, we showcase the practicality of our techniques and
their advantages by applying them on a face verification task with both
simulation and real-world data collected from two crowdsourcing platforms.
[COMMENTS]
Preprint paper, accepted on Behavior Research Methods. 56 pages, 14
figures
[LINK]
http://arxiv.org/abs/2404.17582v2
[DATE]
2025-06-26 01:56:08+08:00
[CATEGORIES]
cs.LG
IRanker: Towards Ranking Foundation Model
[AUTHORS]
Tao Feng, Zhigang Hua, Zijie Lei, Yan Xie, Shuang Yang, Bo Long, Jiaxuan You
[ABSTRACT]
Ranking tasks are ubiquitous, encompassing applications such as
recommendation systems, LLM routing, and item re-ranking. We propose to unify
these tasks using a single ranking foundation model (FM), as it eliminates the
need for designing different models for each specific ranking task. However,
unlike general supervision tasks in LLMs, ranking tasks do not have clear
labels for supervision, posing great challenges to developing a ranking FM. To
overcome these challenges, we propose IRanker, a ranking FM framework with
reinforcement learning (RL) and iterative decoding. Our insight is to decompose
the complex ranking task into an iterative decoding process that eliminates the
worst candidate from the candidate pool step by step, which significantly
reduces the output combinatorial space and better utilizes the limited context
length during RL training. We meticulously train and comprehensively evaluate
an IRanker-3B model on nine datasets across three scenarios: recommendation,
routing, and passage ranking. The results show that a single IRanker-3B
achieves state-of-the-art results on several datasets compared to models of
similar size, and even surpasses the performance of larger models on certain
datasets. We further demonstrate the effectiveness of our RL design and the
robustness of the iterative mechanism across different LLM sizes. Moreover, we
conducted both in-domain and out-of-domain zero-shot generalization
experiments, which showed that IRanker-3B achieved good generalization on
in-domain ranking tasks compared to the base LLM by at least 5% improvement.
Surprisingly, on out-of-domain generic LLM tasks, IRanker-3B outperformed the
base model by at least 9% on GSM8K, IFEval, and MathQA. In addition, the
thoughts generated by IRanker-3B during training could further enhance
zero-shot LLM performance.
[LINK]
http://arxiv.org/abs/2506.21638v1
[DATE]
2025-06-26 01:56:06+08:00
[CATEGORIES]
cs.LG
Mastering Multiple-Expert Routing: Realizable $H$-Consistency and Strong Guarantees for Learning to Defer
[AUTHORS]
Anqi Mao, Mehryar Mohri, Yutao Zhong
[ABSTRACT]
The problem of learning to defer with multiple experts consists of optimally
assigning input instances to experts, balancing the trade-off between their
accuracy and computational cost. This is a critical challenge in natural
language generation, but also in other fields such as image processing, and
medical diagnostics. Recent studies have proposed surrogate loss functions to
optimize deferral, but challenges remain in ensuring their consistency
properties. This paper introduces novel surrogate loss functions and efficient
algorithms with strong theoretical learning guarantees. We address open
questions regarding realizable $H$-consistency, $H$-consistency bounds, and
Bayes-consistency for both single-stage (jointly learning predictor and
deferral function) and two-stage (learning only the deferral function with a
fixed expert) learning scenarios. For single-stage deferral, we introduce a
family of new realizable $H$-consistent surrogate losses and further prove
$H$-consistency for a selected member. For two-stage deferral, we derive new
surrogate losses that achieve realizable $H$-consistency, $H$-consistency
bounds, and Bayes-consistency for the two-expert scenario and, under natural
assumptions, multiple-expert scenario. Additionally, we provide enhanced
theoretical guarantees under low-noise assumptions for both scenarios. Finally,
we report the results of experiments using our proposed surrogate losses,
comparing their performance against existing baselines.
[COMMENTS]
ICML 2025
[LINK]
http://arxiv.org/abs/2506.20650v1
[DATE]
2025-06-26 01:48:58+08:00
[CATEGORIES]
cs.LG
Efficient Federated Learning with Encrypted Data Sharing for Data-Heterogeneous Edge Devices
[AUTHORS]
Hangyu Li, Hongyue Wu, Guodong Fan, Zhen Zhang, Shizhan Chen, Zhiyong Feng
[ABSTRACT]
As privacy protection gains increasing importance, more models are being
trained on edge devices and subsequently merged into the central server through
Federated Learning (FL). However, current research overlooks the impact of
network topology, physical distance, and data heterogeneity on edge devices,
leading to issues such as increased latency and degraded model performance. To
address these issues, we propose a new federated learning scheme on edge
devices that called Federated Learning with Encrypted Data Sharing(FedEDS).
FedEDS uses the client model and the model’s stochastic layer to train the data
encryptor. The data encryptor generates encrypted data and shares it with other
clients. The client uses the corresponding client’s stochastic layer and
encrypted data to train and adjust the local model. FedEDS uses the client’s
local private data and encrypted shared data from other clients to train the
model. This approach accelerates the convergence speed of federated learning
training and mitigates the negative impact of data heterogeneity, making it
suitable for application services deployed on edge devices requiring rapid
convergence. Experiments results show the efficacy of FedEDS in promoting model
performance.
[COMMENTS]
Accepted by ICWS 2025
[LINK]
http://arxiv.org/abs/2506.20644v1
[DATE]
2025-06-26 01:40:54+08:00
[CATEGORIES]
cs.LG
Balancing the Scales: A Theoretical and Algorithmic Framework for Learning from Imbalanced Data
[AUTHORS]
Corinna Cortes, Anqi Mao, Mehryar Mohri, Yutao Zhong
[ABSTRACT]
Class imbalance remains a major challenge in machine learning, especially in
multi-class problems with long-tailed distributions. Existing methods, such as
data resampling, cost-sensitive techniques, and logistic loss modifications,
though popular and often effective, lack solid theoretical foundations. As an
example, we demonstrate that cost-sensitive methods are not Bayes-consistent.
This paper introduces a novel theoretical framework for analyzing
generalization in imbalanced classification. We then propose a new
class-imbalanced margin loss function for both binary and multi-class settings,
prove its strong $H$-consistency, and derive corresponding learning guarantees
based on empirical loss and a new notion of class-sensitive Rademacher
complexity. Leveraging these theoretical results, we devise novel and general
learning algorithms, IMMAX (Imbalanced Margin Maximization), which incorporate
confidence margins and are applicable to various hypothesis sets. While our
focus is theoretical, we also present extensive empirical results demonstrating
the effectiveness of our algorithms compared to existing baselines.
[COMMENTS]
ICML 2025
[LINK]
http://arxiv.org/abs/2502.10381v2
[DATE]
2025-06-26 01:36:30+08:00
[CATEGORIES]
cs.LG
First-order methods for stochastic and finite-sum convex optimization with deterministic constraints
[AUTHORS]
Zhaosong Lu, Yifeng Xiao
[ABSTRACT]
In this paper, we study a class of stochastic and finite-sum convex
optimization problems with deterministic constraints. Existing methods
typically aim to find an $\epsilon$-$expectedly\ feasible\ stochastic\ optimal$
solution, in which the expected constraint violation and expected optimality
gap are both within a prescribed tolerance $\epsilon$. However, in many
practical applications, constraints must be nearly satisfied with certainty,
rendering such solutions potentially unsuitable due to the risk of substantial
violations. To address this issue, we propose stochastic first-order methods
for finding an $\epsilon$-$surely\ feasible\ stochastic\ optimal$
($\epsilon$-SFSO) solution, where the constraint violation is deterministically
bounded by $\epsilon$ and the expected optimality gap is at most $\epsilon$.
Our methods apply an accelerated stochastic gradient (ASG) scheme or a modified
variance-reduced ASG scheme $only\ once$ to a sequence of quadratic penalty
subproblems with appropriately chosen penalty parameters. We establish
first-order oracle complexity bounds for the proposed methods in computing an
$\epsilon$-SFSO solution. As a byproduct, we also derive first-order oracle
complexity results for sample average approximation method in computing an
$\epsilon$-SFSO solution of the stochastic optimization problem using our
proposed methods to solve the sample average problem.
[COMMENTS]
41 pages
[LINK]
http://arxiv.org/abs/2506.20630v1
[DATE]
2025-06-26 01:26:02+08:00
[CATEGORIES]
cs.LG
On Context-Content Uncertainty Principle
[AUTHORS]
Xin Li
[ABSTRACT]
The Context-Content Uncertainty Principle (CCUP) proposes that inference
under uncertainty is governed by an entropy asymmetry between context and
content: high-entropy contexts must be interpreted through alignment with
low-entropy, structured content. In this paper, we develop a layered
computational framework that derives operational principles from this
foundational asymmetry. At the base level, CCUP formalizes inference as
directional entropy minimization, establishing a variational gradient that
favors content-first structuring. Building upon this, we identify four
hierarchical layers of operational principles: (\textbf{L1}) \emph{Core
Inference Constraints}, including structure-before-specificity, asymmetric
inference flow, cycle-consistent bootstrapping, and conditional compression,
all shown to be mutually reducible; (\textbf{L2}) \emph{Resource Allocation
Principles}, such as precision-weighted attention, asymmetric learning rates,
and attractor-based memory encoding; (\textbf{L3}) \emph{Temporal Bootstrapping
Dynamics}, which organize learning over time via structure-guided curricula;
and (\textbf{L4}) \emph{Spatial Hierarchical Composition}, which integrates
these mechanisms into self-organizing cycles of memory, inference, and
planning. We present formal equivalence theorems, a dependency lattice among
principles, and computational simulations demonstrating the efficiency gains of
CCUP-aligned inference. This work provides a unified theoretical foundation for
understanding how brains and machines minimize uncertainty through recursive
structure-specificity alignment. The brain is not just an inference machine. It
is a cycle-consistent entropy gradient resolver, aligning structure and
specificity via path-dependent, content-seeded simulation.
[LINK]
http://arxiv.org/abs/2506.20699v1
[DATE]
2025-06-26 01:21:19+08:00
[CATEGORIES]
cs.LG
Probing Quantum Spin Systems with Kolmogorov-Arnold Neural Network Quantum States
[AUTHORS]
Mahmud Ashraf Shamim, Eric A F Reinhardt, Talal Ahmed Chowdhury, Sergei Gleyzer, Paulo T Araujo
[ABSTRACT]
Neural Quantum States (NQS) are a class of variational wave functions
parametrized by neural networks (NNs) to study quantum many-body systems. In
this work, we propose \texttt{SineKAN}, a NQS \textit{ansatz} based on
Kolmogorov-Arnold Networks (KANs), to represent quantum mechanical wave
functions as nested univariate functions. We show that \texttt{SineKAN}
wavefunction with learnable sinusoidal activation functions can capture the
ground state energies, fidelities and various correlation functions of the one
dimensional Transverse-Field Ising model, Anisotropic Heisenberg model, and
Antiferromagnetic $J_{1}-J_{2}$ model with different chain lengths. In our
study of the $J_1-J_2$ model with $L=100$ sites, we find that the
\texttt{SineKAN} model outperforms several previously explored neural quantum
state \textit{ans"atze}, including Restricted Boltzmann Machines (RBMs), Long
Short-Term Memory models (LSTMs), and Multi-layer Perceptrons (MLP)
\textit{a.k.a.} Feed Forward Neural Networks, when compared to the results
obtained from the Density Matrix Renormalization Group (DMRG) algorithm. We
find that \texttt{SineKAN} models can be trained to high precisions and
accuracies with minimal computational costs.
[COMMENTS]
16 pages, 13 figures
[LINK]
http://arxiv.org/abs/2506.01891v3
[DATE]
2025-06-26 01:17:27+08:00
[CATEGORIES]
cs.LG
Lost in Retraining: Roaming the Parameter Space of Exponential Families Under Closed-Loop Learning
[AUTHORS]
Fariba Jangjoo, Matteo Marsili, Yasser Roudi
[ABSTRACT]
Closed-loop learning is the process of repeatedly estimating a model from
data generated from the model itself. It is receiving great attention due to
the possibility that large neural network models may, in the future, be
primarily trained with data generated by artificial neural networks themselves.
We study this process for models that belong to exponential families, deriving
equations of motions that govern the dynamics of the parameters. We show that
maximum likelihood estimation of the parameters endows sufficient statistics
with the martingale property and that as a result the process converges to
absorbing states that amplify initial biases present in the data. However, we
show that this outcome may be prevented by polluting the data with an
infinitesimal fraction of data points generated from a fixed model, by relying
on maximum a posteriori estimation or by introducing regularisation.
Furthermore, we show that the asymptotic behavior of the dynamics is not
reparametrisation invariant.
[COMMENTS]
13 pages, 2 figures
[LINK]
http://arxiv.org/abs/2506.20623v1
[DATE]
2025-06-26 01:12:22+08:00
[CATEGORIES]
cs.LG
Do Concept Bottleneck Models Respect Localities?
[AUTHORS]
Naveen Raman, Mateo Espinosa Zarlenga, Juyeon Heo, Mateja Jamnik
[ABSTRACT]
Concept-based explainability methods use human-understandable intermediaries
to produce explanations for machine learning models. These methods assume
concept predictions can help understand a model’s internal reasoning. In this
work, we assess the degree to which such an assumption is true by analyzing
whether concept predictors leverage “relevant” features to make predictions, a
term we call locality. Concept-based models that fail to respect localities
also fail to be explainable because concept predictions are based on spurious
features, making the interpretation of the concept predictions vacuous. To
assess whether concept-based models respect localities, we construct and use
three metrics to characterize when models respect localities, complementing our
analysis with theoretical results. Each of our metrics captures a different
notion of perturbation and assess whether perturbing “irrelevant” features
impacts the predictions made by a concept predictors. We find that many
concept-based models used in practice fail to respect localities because
concept predictors cannot always clearly distinguish distinct concepts. Based
on these findings, we propose suggestions for alleviating this issue.
[COMMENTS]
Published at TMLR
[LINK]
http://arxiv.org/abs/2401.01259v5
[DATE]
2025-06-26 01:10:45+08:00
[CATEGORIES]
cs.LG
From $\mathcal{O}(n^{2})$ to $\mathcal{O}(n)$ Parameters: Quantum Self-Attention in Vision Transformers for Biomedical Image Classification
[AUTHORS]
Thomas Boucher, John Whittle, Evangelos B. Mazomenos
[ABSTRACT]
We demonstrate that quantum vision transformers (QViTs), vision transformers
(ViTs) with self-attention (SA) mechanisms replaced by quantum self-attention
(QSA) mechanisms, can match state-of-the-art (SOTA) biomedical image
classifiers while using 99.99% fewer parameters. QSAs are produced by replacing
linear SA layers with parameterised quantum neural networks (QNNs), producing a
QSA mechanism and reducing parameter scaling from $\mathcal{O}(n^2)$ to
$\mathcal{O}(n)$. On RetinaMNIST, our ultra parameter-efficient QViT
outperforms 13/14 SOTA methods including CNNs and ViTs, achieving 56.5%
accuracy, just 0.88% below the top MedMamba model while using 99.99% fewer
parameters (1K vs 14.5M) and 89% fewer GFLOPs. We present the first
investigation of knowledge distillation (KD) from classical to quantum vision
transformers in biomedical image classification, showing that QViTs maintain
comparable performance to classical ViTs across eight diverse datasets spanning
multiple modalities, with improved QSA parameter-efficiency. Our higher-qubit
architecture benefitted more from KD pre-training, suggesting a scaling
relationship between QSA parameters and KD effectiveness. These findings
establish QSA as a practical architectural choice toward parameter-efficient
biomedical image analysis.
[COMMENTS]
Submitted for EMA4MICCAI 2025
[LINK]
http://arxiv.org/abs/2503.07294v2
[DATE]
2025-06-26 01:08:53+08:00
[CATEGORIES]
cs.LG
H-FEX: A Symbolic Learning Method for Hamiltonian Systems
[AUTHORS]
Jasen Lai, Senwei Liang, Chunmei Wang
[ABSTRACT]
Hamiltonian systems describe a broad class of dynamical systems governed by
Hamiltonian functions, which encode the total energy and dictate the evolution
of the system. Data-driven approaches, such as symbolic regression and neural
network-based methods, provide a means to learn the governing equations of
dynamical systems directly from observational data of Hamiltonian systems.
However, these methods often struggle to accurately capture complex Hamiltonian
functions while preserving energy conservation. To overcome this limitation, we
propose the Finite Expression Method for learning Hamiltonian Systems (H-FEX),
a symbolic learning method that introduces novel interaction nodes designed to
capture intricate interaction terms effectively. Our experiments, including
those on highly stiff dynamical systems, demonstrate that H-FEX can recover
Hamiltonian functions of complex systems that accurately capture system
dynamics and preserve energy over long time horizons. These findings highlight
the potential of H-FEX as a powerful framework for discovering closed-form
expressions of complex dynamical systems.
[COMMENTS]
16 pages, 7 figures
[LINK]
http://arxiv.org/abs/2506.20607v1
[DATE]
2025-06-26 00:53:01+08:00
[CATEGORIES]
cs.LG
LT-PINN: Lagrangian Topology-conscious Physics-informed Neural Network for Boundary-focused Engineering Optimization
[AUTHORS]
Yuanye Zhou, Zhaokun Wang, Kai Zhou, Hui Tang, Xiaofan Li
[ABSTRACT]
Physics-informed neural networks (PINNs) have emerged as a powerful meshless
tool for topology optimization, capable of simultaneously determining optimal
topologies and physical solutions. However, conventional PINNs rely on
density-based topology descriptions, which necessitate manual interpolation and
limit their applicability to complex geometries. To address this, we propose
Lagrangian topology-conscious PINNs (LT-PINNs), a novel framework for
boundary-focused engineering optimization. By parameterizing the control
variables of topology boundary curves as learnable parameters, LT-PINNs
eliminate the need for manual interpolation and enable precise boundary
determination. We further introduce specialized boundary condition loss
function and topology loss function to ensure sharp and accurate boundary
representations, even for intricate topologies. The accuracy and robustness of
LT-PINNs are validated via two types of partial differential equations (PDEs),
including elastic equation with Dirichlet boundary conditions and Laplace’s
equation with Neumann boundary conditions. Furthermore, we demonstrate
effectiveness of LT-PINNs on more complex time-dependent and time-independent
flow problems without relying on measurement data, and showcase their
engineering application potential in flow velocity rearrangement, transforming
a uniform upstream velocity into a sine-shaped downstream profile. The results
demonstrate (1) LT-PINNs achieve substantial reductions in relative L2 errors
compared with the state-of-art density topology-oriented PINNs (DT-PINNs), (2)
LT-PINNs can handle arbitrary boundary conditions, making them suitable for a
wide range of PDEs, and (3) LT-PINNs can infer clear topology boundaries
without manual interpolation, especially for complex topologies.
[LINK]
http://arxiv.org/abs/2506.06300v3
[DATE]
2025-06-26 00:48:42+08:00
[CATEGORIES]
cs.LG
The kernel of graph indices for vector search
[AUTHORS]
Mariano Tepper, Ted Willke
[ABSTRACT]
The most popular graph indices for vector search use principles from
computational geometry to build the graph. Hence, their formal graph
navigability guarantees are only valid in Euclidean space. In this work, we
show that machine learning can be used to build graph indices for vector search
in metric and non-metric vector spaces (e.g., for inner product similarity).
From this novel perspective, we introduce the Support Vector Graph (SVG), a new
type of graph index that leverages kernel methods to establish the graph
connectivity and that comes with formal navigability guarantees valid in metric
and non-metric vector spaces. In addition, we interpret the most popular graph
indices, including HNSW and DiskANN, as particular specializations of SVG and
show that new indices can be derived from the principles behind this
specialization. Finally, we propose SVG-L0 that incorporates an $\ell_0$
sparsity constraint into the SVG kernel method to build graphs with a bounded
out-degree. This yields a principled way of implementing this practical
requirement, in contrast to the traditional heuristic of simply truncating the
out edges of each node. Additionally, we show that SVG-L0 has a self-tuning
property that avoids the heuristic of using a set of candidates to find the
out-edges of each node and that keeps its computational complexity in check.
[LINK]
http://arxiv.org/abs/2506.20584v1
[DATE]
2025-06-26 00:24:55+08:00
[CATEGORIES]
cs.LG
Rethinking Early Stopping: Refine, Then Calibrate
[AUTHORS]
Eugène Berta, David Holzmüller, Michael I. Jordan, Francis Bach
[ABSTRACT]
Machine learning classifiers often produce probabilistic predictions that are
critical for accurate and interpretable decision-making in various domains. The
quality of these predictions is generally evaluated with proper losses, such as
cross-entropy, which decompose into two components: calibration error assesses
general under/overconfidence, while refinement error measures the ability to
distinguish different classes. In this paper, we present a novel variational
formulation of the calibration-refinement decomposition that sheds new light on
post-hoc calibration, and enables rapid estimation of the different terms.
Equipped with this new perspective, we provide theoretical and empirical
evidence that calibration and refinement errors are not minimized
simultaneously during training. Selecting the best epoch based on validation
loss thus leads to a compromise point that is suboptimal for both terms. To
address this, we propose minimizing refinement error only during training
(Refine,…), before minimizing calibration error post hoc, using standard
techniques (…then Calibrate). Our method integrates seamlessly with any
classifier and consistently improves performance across diverse classification
tasks.
[LINK]
http://arxiv.org/abs/2501.19195v2
[DATE]
2025-06-26 00:24:12+08:00
[CATEGORIES]
cs.LG
Causal Representation Learning with Observational Grouping for CXR Classification
[AUTHORS]
Rajat Rasal, Avinash Kori, Ben Glocker
[ABSTRACT]
Identifiable causal representation learning seeks to uncover the true causal
relationships underlying a data generation process. In medical imaging, this
presents opportunities to improve the generalisability and robustness of
task-specific latent features. This work introduces the concept of grouping
observations to learn identifiable representations for disease classification
in chest X-rays via an end-to-end framework. Our experiments demonstrate that
these causal representations improve generalisability and robustness across
multiple classification tasks when grouping is used to enforce invariance w.r.t
race, sex, and imaging views.
[LINK]
http://arxiv.org/abs/2506.20582v1
[DATE]
2025-06-26 00:17:36+08:00
[CATEGORIES]
cs.LG
Exploring Graph-Transformer Out-of-Distribution Generalization Abilities
[AUTHORS]
Itay Niv, Neta Rabin
[ABSTRACT]
Deep learning on graphs has shown remarkable success across numerous
applications, including social networks, bio-physics, traffic networks, and
recommendation systems. Regardless of their successes, current methods
frequently depend on the assumption that training and testing data share the
same distribution, a condition rarely met in real-world scenarios. While
graph-transformer (GT) backbones have recently outperformed traditional
message-passing neural networks (MPNNs) in multiple in-distribution (ID)
benchmarks, their effectiveness under distribution shifts remains largely
unexplored.
In this work, we address the challenge of out-of-distribution (OOD)
generalization for graph neural networks, with a special focus on the impact of
backbone architecture. We systematically evaluate GT and hybrid backbones in
OOD settings and compare them to MPNNs. To do so, we adapt several leading
domain generalization (DG) algorithms to work with GTs and assess their
performance on a benchmark designed to test a variety of distribution shifts.
Our results reveal that GT and hybrid GT-MPNN backbones consistently
demonstrate stronger generalization ability compared to MPNNs, even without
specialized DG algorithms.
Additionally, we propose a novel post-training analysis approach that
compares the clustering structure of the entire ID and OOD test datasets,
specifically examining domain alignment and class separation. Demonstrating its
model-agnostic design, this approach not only provided meaningful insights into
GT and MPNN backbones. It also shows promise for broader applicability to DG
problems beyond graph learning, offering a deeper perspective on generalization
abilities that goes beyond standard accuracy metrics. Together, our findings
highlight the promise of graph-transformers for robust, real-world graph
learning and set a new direction for future research in OOD generalization.
[LINK]
http://arxiv.org/abs/2506.20575v1
[DATE]
2025-06-26 00:09:24+08:00
[CATEGORIES]
cs.LG
Benchmarking Unsupervised Strategies for Anomaly Detection in Multivariate Time Series
[AUTHORS]
Laura Boggia, Rafael Teixeira de Lima, Bogdan Malaescu
[ABSTRACT]
Anomaly detection in multivariate time series is an important problem across
various fields such as healthcare, financial services, manufacturing or physics
detector monitoring. Accurately identifying when unexpected errors or faults
occur is essential, yet challenging, due to the unknown nature of anomalies and
the complex interdependencies between time series dimensions. In this paper, we
investigate transformer-based approaches for time series anomaly detection,
focusing on the recently proposed iTransformer architecture. Our contributions
are fourfold: (i) we explore the application of the iTransformer to time series
anomaly detection, and analyse the influence of key parameters such as window
size, step size, and model dimensions on performance; (ii) we examine methods
for extracting anomaly labels from multidimensional anomaly scores and discuss
appropriate evaluation metrics for such labels; (iii) we study the impact of
anomalous data present during training and assess the effectiveness of
alternative loss functions in mitigating their influence; and (iv) we present a
comprehensive comparison of several transformer-based models across a diverse
set of datasets for time series anomaly detection.
[COMMENTS]
Submitted to VLDB 2026 conference, currently under review
[LINK]
http://arxiv.org/abs/2506.20574v1
[DATE]
2025-06-26 00:08:22+08:00
[CATEGORIES]
cs.LG
When Life Gives You Samples: The Benefits of Scaling up Inference Compute for Multilingual LLMs
[AUTHORS]
Ammar Khairi, Daniel D’souza, Ye Shen, Julia Kreutzer, Sara Hooker
[ABSTRACT]
Recent advancements in large language models (LLMs) have shifted focus toward
scaling inference-time compute, improving performance without retraining the
model. A common approach is to sample multiple outputs in parallel, and select
one of these as the final output. However, work to date has focused on English
and a handful of domains such as math and code. In contrast, we are most
interested in techniques that generalize across open-ended tasks, formally
verifiable tasks, and across languages. In this work, we study how to robustly
scale inference-time compute for open-ended generative tasks in a multilingual,
multi-task setting.
Our findings show that both sampling strategy based on temperature variation
and selection strategy must be adapted to account for diverse domains and
varied language settings. We evaluate existing selection methods, revealing
that strategies effective in English often fail to generalize across languages.
We propose novel sampling and selection strategies specifically adapted for
multilingual and multi-task inference scenarios, and show they yield notable
gains across languages and tasks. In particular, our combined sampling and
selection methods lead to an average +6.8 jump in win-rates for our 8B models
on m-ArenaHard-v2.0 prompts, against proprietary models such as Gemini. At
larger scale, Command-A (111B model) equipped with our methods, shows +9.0
improvement in win-rates on the same benchmark with just five samples against
single-sample decoding, a substantial increase at minimal cost. Our results
underscore the need for language- and task-aware approaches to inference-time
compute, aiming to democratize performance improvements in underrepresented
languages.
[LINK]
http://arxiv.org/abs/2506.20544v1
[DATE]
2025-06-25 23:37:53+08:00
[CATEGORIES]
cs.CL
Attention with Trained Embeddings Provably Selects Important Tokens
[AUTHORS]
Diyuan Wu, Aleksandr Shevchenko, Samet Oymak, Marco Mondelli
[ABSTRACT]
Token embeddings play a crucial role in language modeling but, despite this
practical relevance, their theoretical understanding remains limited. Our paper
addresses the gap by characterizing the structure of embeddings obtained via
gradient descent. Specifically, we consider a one-layer softmax attention model
with a linear head for binary classification, i.e., $\texttt{Softmax}( p^\top
E_X^\top ) E_X v = \frac{ \sum_{i=1}^T \exp(p^\top E_{x_i}) E_{x_i}^\top
v}{\sum_{j=1}^T \exp(p^\top E_{x_{j}}) }$, where $E_X = [ E_{x_1} , \dots,
E_{x_T} ]^\top$ contains the embeddings of the input sequence, $p$ is the
embedding of the $\mathrm{\langle cls \rangle}$ token and $v$ the output
vector. First, we show that, already after a single step of gradient training
with the logistic loss, the embeddings $E_X$ capture the importance of tokens
in the dataset by aligning with the output vector $v$ proportionally to the
frequency with which the corresponding tokens appear in the dataset. Then,
after training $p$ via gradient flow until convergence, the softmax selects the
important tokens in the sentence (i.e., those that are predictive of the
label), and the resulting $\mathrm{\langle cls \rangle}$ embedding maximizes
the margin for such a selection. Experiments on real-world datasets (IMDB,
Yelp) exhibit a phenomenology close to that unveiled by our theory.
[COMMENTS]
Fix mistakes in Lemma 4.2 and proof of Lemma 4.5, and some other
minor changes
[LINK]
http://arxiv.org/abs/2505.17282v3
[DATE]
2025-06-25 23:19:05+08:00
[CATEGORIES]
cs.LG
cs.CL
Separating Tongue from Thought: Activation Patching Reveals Language-Agnostic Concept Representations in Transformers
[AUTHORS]
Clément Dumas, Chris Wendler, Veniamin Veselovsky, Giovanni Monea, Robert West
[ABSTRACT]
A central question in multilingual language modeling is whether large
language models (LLMs) develop a universal concept representation, disentangled
from specific languages. In this paper, we address this question by analyzing
latent representations (latents) during a word-translation task in
transformer-based LLMs. We strategically extract latents from a source
translation prompt and insert them into the forward pass on a target
translation prompt. By doing so, we find that the output language is encoded in
the latent at an earlier layer than the concept to be translated. Building on
this insight, we conduct two key experiments. First, we demonstrate that we can
change the concept without changing the language and vice versa through
activation patching alone. Second, we show that patching with the mean
representation of a concept across different languages does not affect the
models’ ability to translate it, but instead improves it. Finally, we
generalize to multi-token generation and demonstrate that the model can
generate natural language description of those mean representations. Our
results provide evidence for the existence of language-agnostic concept
representations within the investigated models.
[COMMENTS]
20 pages, 14 figures, previous version published under the title “How
Do Llamas Process Multilingual Text? A Latent Exploration through Activation
Patching” at the ICML 2024 mechanistic interpretability workshop at
https://openreview.net/forum?id=0ku2hIm4BS
[LINK]
http://arxiv.org/abs/2411.08745v4
[DATE]
2025-06-25 23:16:54+08:00
[CATEGORIES]
cs.CL
Asymmetric REINFORCE for off-Policy Reinforcement Learning: Balancing positive and negative rewards
[AUTHORS]
Charles Arnal, Gaëtan Narozniak, Vivien Cabannes, Yunhao Tang, Julia Kempe, Remi Munos
[ABSTRACT]
Reinforcement learning (RL) is increasingly used to align large language
models (LLMs). Off-policy methods offer greater implementation simplicity and
data efficiency than on-policy techniques, but often result in suboptimal
performance. In this work, we study the intermediate range of algorithms
between off-policy RL and supervised fine-tuning by analyzing a simple
off-policy REINFORCE algorithm, where the advantage is defined as $A=r-V$, with
$r$ a reward and $V$ some tunable baseline. Intuitively, lowering $V$
emphasizes high-reward samples, while raising it penalizes low-reward ones more
heavily. We first provide a theoretical analysis of this off-policy REINFORCE
algorithm, showing that when the baseline $V$ lower-bounds the expected reward,
the algorithm enjoys a policy improvement guarantee. Our analysis reveals that
while on-policy updates can safely leverage both positive and negative signals,
off-policy updates benefit from focusing more on positive rewards than on
negative ones. We validate our findings experimentally in a controlled
stochastic bandit setting and through fine-tuning state-of-the-art LLMs on
reasoning tasks.
[LINK]
http://arxiv.org/abs/2506.20520v1
[DATE]
2025-06-25 23:07:16+08:00
[CATEGORIES]
cs.LG
cs.CL
OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling
[AUTHORS]
Zengzhi Wang, Fan Zhou, Xuefeng Li, Pengfei Liu
[ABSTRACT]
Different base language model families, such as Llama and Qwen, exhibit
divergent behaviors during post-training with reinforcement learning (RL),
especially on reasoning-intensive tasks. What makes a base language model
suitable for reinforcement learning? Gaining deeper insight into this question
is essential for developing RL-scalable foundation models of the next
generation. In this work, we investigate how mid-training strategies shape RL
dynamics, focusing on two representative model families: Qwen and Llama. Our
study reveals that (1) high-quality mathematical corpora, such as
MegaMath-Web-Pro, significantly improve both base model and RL performance,
while existing alternatives (e.g., FineMath-4plus) fail to do so; (2) further
adding QA-style data, particularly long chain-of-thought (CoT) reasoning
examples, enhances RL outcomes, and instruction data further unlocks this
effect; (3) while long-CoT improves reasoning depth, it can also induce
verbosity of model responses and unstability of RL training, underscoring the
importance of data formatting; (4) scaling mid-training consistently leads to
stronger downstream RL performance. Building on these insights, we introduce a
two-stage mid-training strategy, Stable-then-Decay, in which base models are
first trained on 200B tokens with a constant learning rate, followed by 20B
tokens across three CoT-focused branches with learning rate decay. This yields
OctoThinker, a family of models demonstrating strong RL compatibility and
closing the performance gap with more RL-friendly model families, i.e., Qwen.
We hope our work will help shape pre-training strategies for foundation models
in the RL era. To support further research, we release our open-source models
along with a curated math reasoning-intensive corpus of over 70 billion tokens
(i.e., MegaMath-Web-Pro-Max).
[COMMENTS]
26 pages; The first three authors contribute to this work equally
[LINK]
http://arxiv.org/abs/2506.20512v1
[DATE]
2025-06-25 22:58:13+08:00
[CATEGORIES]
cs.CL
cs.LG
ReCode: Updating Code API Knowledge with Reinforcement Learning
[AUTHORS]
Haoze Wu, Yunzhi Yao, Wenhao Yu, Huajun Chen, Ningyu Zhang
[ABSTRACT]
Large Language Models (LLMs) exhibit remarkable code generation capabilities
but falter when adapting to frequent updates in external library APIs. This
critical limitation, stemming from reliance on outdated API knowledge from
their training data, even with access to current documentation, impedes
reliable code generation in dynamic environments. To tackle this issue, we
propose ReCode (rule-based Reinforcement learning for Code Update), a novel
framework that mimics human programmer adaptation to API changes. Specifically,
we construct a dataset of approximately 2,000 data entries to train the LLMs to
perform version migration based on updated information. Then, we introduce a
modified string similarity metric for code evaluation as the reward for
reinforcement learning. Our experiments demonstrate that ReCode substantially
boosts LLMs’ code generation performance in dynamic API scenarios, especially
on the unseen CodeUpdateArena task. Crucially, compared to supervised
fine-tuning, ReCode has less impact on LLMs’ general code generation abilities.
We apply ReCode on various LLMs and reinforcement learning algorithms (GRPO and
DAPO), all achieving consistent improvements. Notably, after training,
Qwen2.5-Coder-7B outperforms that of the 32B parameter code instruction-tuned
model and the reasoning model with the same architecture. Code is available at
https://github.com/zjunlp/ReCode.
[COMMENTS]
Work in progress
[LINK]
http://arxiv.org/abs/2506.20495v1
[DATE]
2025-06-25 22:41:13+08:00
[CATEGORIES]
cs.CL
cs.LG
Counterfactual Influence as a Distributional Quantity
[AUTHORS]
Matthieu Meeus, Igor Shilov, Georgios Kaissis, Yves-Alexandre de Montjoye
[ABSTRACT]
Machine learning models are known to memorize samples from their training
data, raising concerns around privacy and generalization. Counterfactual
self-influence is a popular metric to study memorization, quantifying how the
model’s prediction for a sample changes depending on the sample’s inclusion in
the training dataset. However, recent work has shown memorization to be
affected by factors beyond self-influence, with other training samples, in
particular (near-)duplicates, having a large impact. We here study memorization
treating counterfactual influence as a distributional quantity, taking into
account how all training samples influence how a sample is memorized. For a
small language model, we compute the full influence distribution of training
samples on each other and analyze its properties. We find that solely looking
at self-influence can severely underestimate tangible risks associated with
memorization: the presence of (near-)duplicates seriously reduces
self-influence, while we find these samples to be (near-)extractable. We
observe similar patterns for image classification, where simply looking at the
influence distributions reveals the presence of near-duplicates in CIFAR-10.
Our findings highlight that memorization stems from complex interactions across
training data and is better captured by the full influence distribution than by
self-influence alone.
[COMMENTS]
Workshop on The Impact of Memorization on Trustworthy Foundation
Models (MemFM) @ ICML 2025
[LINK]
http://arxiv.org/abs/2506.20481v1
[DATE]
2025-06-25 22:25:11+08:00
[CATEGORIES]
cs.LG
cs.CL
GPTailor: Large Language Model Pruning Through Layer Cutting and Stitching
[AUTHORS]
Guinan Su, Li Shen, Lu Yin, Shiwei Liu, Yanwu Yang, Jonas Geiping
[ABSTRACT]
Large language models (LLMs) have shown remarkable capabilities in language
understanding and generation. However, such impressive capability typically
comes with a substantial model size, which presents significant challenges in
deployment and inference. While structured pruning of model parameters offers a
promising way to reduce computational costs at deployment time, current methods
primarily focus on single model pruning. In this work, we develop a novel
strategy to compress models by strategically combining or merging layers from
finetuned model variants, which preserves the original model’s abilities by
aggregating capabilities accentuated in different finetunes. We pose the
optimal tailoring of these LLMs as a zero-order optimization problem, adopting
a search space that supports three different operations: (1) Layer removal, (2)
Layer selection from different candidate models, and (3) Layer merging. Our
experiments demonstrate that this approach leads to competitive model pruning,
for example, for the Llama2-13B model families, our compressed models maintain
approximately 97.3\% of the original performance while removing $\sim25\%$ of
parameters, significantly outperforming previous state-of-the-art methods. The
code is available at https://github.com/Guinan-Su/auto-merge-llm.
[LINK]
http://arxiv.org/abs/2506.20480v1
[DATE]
2025-06-25 22:24:59+08:00
[CATEGORIES]
cs.CL
Graph Linearization Methods for Reasoning on Graphs with Large Language Models
[AUTHORS]
Christos Xypolopoulos, Guokan Shang, Xiao Fei, Giannis Nikolentzos, Hadi Abdine, Iakovos Evdaimon, Michail Chatzianastasis, Giorgos Stamou, Michalis Vazirgiannis
[ABSTRACT]
Large language models have evolved to process multiple modalities beyond
text, such as images and audio, which motivates us to explore how to
effectively leverage them for graph reasoning tasks. The key question,
therefore, is how to transform graphs into linear sequences of tokens, a
process we term “graph linearization”, so that LLMs can handle graphs
naturally. We consider that graphs should be linearized meaningfully to reflect
certain properties of natural language text, such as local dependency and
global alignment, in order to ease contemporary LLMs, trained on trillions of
textual tokens, better understand graphs. To achieve this, we developed several
graph linearization methods based on graph centrality and degeneracy. These
methods are further enhanced using node relabeling techniques. The experimental
results demonstrate the effectiveness of our methods compared to the random
linearization baseline. Our work introduces novel graph representations
suitable for LLMs, contributing to the potential integration of graph machine
learning with the trend of multimodal processing using a unified transformer
model.
[LINK]
http://arxiv.org/abs/2410.19494v3
[DATE]
2025-06-25 22:24:33+08:00
[CATEGORIES]
cs.CL
cs.LG
CogniBench: A Legal-inspired Framework and Dataset for Assessing Cognitive Faithfulness of Large Language Models
[AUTHORS]
Xiaqiang Tang, Jian Li, Keyu Hu, Du Nan, Xiaolong Li, Xi Zhang, Weigao Sun, Sihong Xie
[COMMENTS]
ACL 2025
[LINK]
http://arxiv.org/abs/2505.20767v4
[DATE]
2025-06-25 22:02:19+08:00
[CATEGORIES]
cs.CL
Towards Fully Exploiting LLM Internal States to Enhance Knowledge Boundary Perception
[AUTHORS]
Shiyu Ni, Keping Bi, Jiafeng Guo, Lulu Yu, Baolong Bi, Xueqi Cheng
[ABSTRACT]
Large language models (LLMs) exhibit impressive performance across diverse
tasks but often struggle to accurately gauge their knowledge boundaries,
leading to confident yet incorrect responses. This paper explores leveraging
LLMs’ internal states to enhance their perception of knowledge boundaries from
efficiency and risk perspectives. We investigate whether LLMs can estimate
their confidence using internal states before response generation, potentially
saving computational resources. Our experiments on datasets like Natural
Questions, HotpotQA, and MMLU reveal that LLMs demonstrate significant
pre-generation perception, which is further refined post-generation, with
perception gaps remaining stable across varying conditions. To mitigate risks
in critical domains, we introduce Confidence Consistency-based Calibration
($C^3$), which assesses confidence consistency through question reformulation.
$C^3$ significantly improves LLMs’ ability to recognize their knowledge gaps,
enhancing the unknown perception rate by 5.6% on NQ and 4.9% on HotpotQA. Our
findings suggest that pre-generation confidence estimation can optimize
efficiency, while $C^3$ effectively controls output risks, advancing the
reliability of LLMs in practical applications.
[COMMENTS]
ACL2025 Main
[LINK]
http://arxiv.org/abs/2502.11677v2
[DATE]
2025-06-25 21:46:10+08:00
[CATEGORIES]
cs.CL
SMAR: Soft Modality-Aware Routing Strategy for MoE-based Multimodal Large Language Models Preserving Language Capabilities
[AUTHORS]
Guoyang Xia, Yifeng Ding, Fengfa Li, Lei Ren, Wei Chen, Fangxiang Feng, Xiaojie Wang
[ABSTRACT]
Mixture of Experts (MoE) architectures have become a key approach for scaling
large language models, with growing interest in extending them to multimodal
tasks. Existing methods to build multimodal MoE models either incur high
training costs or suffer from degraded language capabilities when adapting
pretrained models. To address this, we propose Soft ModalityAware Routing
(SMAR), a novel regularization technique that uses Kullback Leibler divergence
to control routing probability distributions across modalities, encouraging
expert specialization without modifying model architecture or heavily relying
on textual data. Experiments on visual instruction tuning show that SMAR
preserves language ability at 86.6% retention with only 2.5% pure text,
outperforming baselines while maintaining strong multimodal performance. Our
approach offers a practical and efficient solution to balance modality
differentiation and language capabilities in multimodal MoE models.
[LINK]
http://arxiv.org/abs/2506.06406v2
[DATE]
2025-06-25 20:36:55+08:00
[CATEGORIES]
cs.CL
From Codicology to Code: A Comparative Study of Transformer and YOLO-based Detectors for Layout Analysis in Historical Documents
[AUTHORS]
Sergio Torres Aguilar
[ABSTRACT]
Robust Document Layout Analysis (DLA) is critical for the automated
processing and understanding of historical documents with complex page
organizations. This paper benchmarks five state-of-the-art object detection
architectures on three annotated datasets representing a spectrum of
codicological complexity: The e-NDP, a corpus of Parisian medieval registers
(1326-1504); CATMuS, a diverse multiclass dataset derived from various medieval
and modern sources (ca.12th-17th centuries) and HORAE, a corpus of decorated
books of hours (ca.13th-16th centuries). We evaluate two Transformer-based
models (Co-DETR, Grounding DINO) against three YOLO variants (AABB, OBB, and
YOLO-World). Our findings reveal significant performance variations dependent
on model architecture, data set characteristics, and bounding box
representation. In the e-NDP dataset, Co-DETR achieves state-of-the-art results
(0.752 [email protected]:.95), closely followed by YOLOv11X-OBB (0.721). Conversely, on
the more complex CATMuS and HORAE datasets, the CNN-based YOLOv11x-OBB
significantly outperforms all other models (0.564 and 0.568, respectively).
This study unequivocally demonstrates that using Oriented Bounding Boxes (OBB)
is not a minor refinement but a fundamental requirement for accurately modeling
the non-Cartesian nature of historical manuscripts. We conclude that a key
trade-off exists between the global context awareness of Transformers, ideal
for structured layouts, and the superior generalization of CNN-OBB models for
visually diverse and complex documents.
[LINK]
http://arxiv.org/abs/2506.20326v1
[DATE]
2025-06-25 19:14:04+08:00
[CATEGORIES]
cs.CL
VICCA: Visual Interpretation and Comprehension of Chest X-ray Anomalies in Generated Report Without Human Feedback
[AUTHORS]
Sayeh Gholipour Picha, Dawood Al Chanti, Alice Caplier
[ABSTRACT]
As artificial intelligence (AI) becomes increasingly central to healthcare,
the demand for explainable and trustworthy models is paramount. Current report
generation systems for chest X-rays (CXR) often lack mechanisms for validating
outputs without expert oversight, raising concerns about reliability and
interpretability. To address these challenges, we propose a novel multimodal
framework designed to enhance the semantic alignment and localization accuracy
of AI-generated medical reports. Our framework integrates two key modules: a
Phrase Grounding Model, which identifies and localizes pathologies in CXR
images based on textual prompts, and a Text-to-Image Diffusion Module, which
generates synthetic CXR images from prompts while preserving anatomical
fidelity. By comparing features between the original and generated images, we
introduce a dual-scoring system: one score quantifies localization accuracy,
while the other evaluates semantic consistency. This approach significantly
outperforms existing methods, achieving state-of-the-art results in pathology
localization and text-to-image alignment. The integration of phrase grounding
with diffusion models, coupled with the dual-scoring evaluation system,
provides a robust mechanism for validating report quality, paving the way for
more trustworthy and transparent AI in medical imaging.
[LINK]
http://arxiv.org/abs/2501.17726v2
[DATE]
2025-06-25 19:13:35+08:00
[CATEGORIES]
cs.CL
Confucius3-Math: A Lightweight High-Performance Reasoning LLM for Chinese K-12 Mathematics Learning
[AUTHORS]
Lixin Wu, Na Cai, Qiao Cheng, Jiachen Wang, Yitao Duan
[ABSTRACT]
We introduce Confucius3-Math, an open-source large language model with 14B
parameters that (1) runs efficiently on a single consumer-grade GPU; (2)
achieves SOTA performances on a range of mathematical reasoning tasks,
outperforming many models with significantly larger sizes. In particular, as
part of our mission to enhancing education and knowledge dissemination with AI,
Confucius3-Math is specifically committed to mathematics learning for Chinese
K-12 students and educators. Built via post-training with large-scale
reinforcement learning (RL), Confucius3-Math aligns with national curriculum
and excels at solving main-stream Chinese K-12 mathematical problems with low
cost. In this report we share our development recipe, the challenges we
encounter and the techniques we develop to overcome them. In particular, we
introduce three technical innovations: Targeted Entropy Regularization, Recent
Sample Recovery and Policy-Specific Hardness Weighting. These innovations
encompass a new entropy regularization, a novel data scheduling policy, and an
improved group-relative advantage estimator. Collectively, they significantly
stabilize the RL training, improve data efficiency, and boost performance. Our
work demonstrates the feasibility of building strong reasoning models in a
particular domain at low cost. We open-source our model and code at
https://github.com/netease-youdao/Confucius3-Math.
[LINK]
http://arxiv.org/abs/2506.18330v2
[DATE]
2025-06-25 18:49:23+08:00
[CATEGORIES]
cs.LG
cs.CL
VAQUUM: Are Vague Quantifiers Grounded in Visual Data?
[AUTHORS]
Hugh Mee Wong, Rick Nouwen, Albert Gatt
[COMMENTS]
Proceedings of ACL 2025, 10 pages
[LINK]
http://arxiv.org/abs/2502.11874v3
[DATE]
2025-06-25 18:46:05+08:00
[CATEGORIES]
cs.CL
FundaQ-8: A Clinically-Inspired Scoring Framework for Automated Fundus Image Quality Assessment
[AUTHORS]
Lee Qi Zun, Oscar Wong Jin Hao, Nor Anita Binti Che Omar, Zalifa Zakiah Binti Asnir, Mohamad Sabri bin Sinal Zainal, Goh Man Fye
[ABSTRACT]
Automated fundus image quality assessment (FIQA) remains a challenge due to
variations in image acquisition and subjective expert evaluations. We introduce
FundaQ-8, a novel expert-validated framework for systematically assessing
fundus image quality using eight critical parameters, including field coverage,
anatomical visibility, illumination, and image artifacts. Using FundaQ-8 as a
structured scoring reference, we develop a ResNet18-based regression model to
predict continuous quality scores in the 0 to 1 range. The model is trained on
1800 fundus images from real-world clinical sources and Kaggle datasets, using
transfer learning, mean squared error optimization, and standardized
preprocessing. Validation against the EyeQ dataset and statistical analyses
confirm the framework’s reliability and clinical interpretability.
Incorporating FundaQ-8 into deep learning models for diabetic retinopathy
grading also improves diagnostic robustness, highlighting the value of
quality-aware training in real-world screening applications.
[LINK]
http://arxiv.org/abs/2506.20303v1
[DATE]
2025-06-25 18:28:53+08:00
[CATEGORIES]
cs.CL
LR^2Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems
[AUTHORS]
Jianghao Chen, Zhenlin Wei, Zhenjiang Ren, Ziyong Li, Jiajun Zhang
[COMMENTS]
ACL-2025, our code is available at https://github.com/ZNLP/LR2Bench
[LINK]
http://arxiv.org/abs/2502.17848v4
[DATE]
2025-06-25 17:36:23+08:00
[CATEGORIES]
cs.CL
LADM: Long-context Training Data Selection with Attention-based Dependency Measurement for LLMs
[AUTHORS]
Jianghao Chen, Junhong Wu, Yangyifan Xu, Jiajun Zhang
[ABSTRACT]
Long-context modeling has drawn more and more attention in the area of Large
Language Models (LLMs). Continual training with long-context data becomes the
de-facto method to equip LLMs with the ability to process long inputs. However,
it still remains an open challenge to measure the quality of long-context
training data. To address this issue, we propose a Long-context data selection
framework with Attention-based Dependency Measurement (LADM), which can
efficiently identify high-quality long-context data from a large-scale,
multi-domain pre-training corpus. LADM leverages the retrieval capabilities of
the attention mechanism to capture contextual dependencies, ensuring a
comprehensive quality measurement of long-context data. Experimental results
show that our LADM framework significantly boosts the performance of LLMs on
multiple long-context tasks with only 1B tokens for continual training.
[COMMENTS]
ACL 2025, our code is available at https://github.com/ZNLP/LADM
[LINK]
http://arxiv.org/abs/2503.02502v2
[DATE]
2025-06-25 17:27:33+08:00
[CATEGORIES]
cs.CL
Narrative Shift Detection: A Hybrid Approach of Dynamic Topic Models and Large Language Models
[AUTHORS]
Kai-Robin Lange, Tobias Schmidt, Matthias Reccius, Henrik Müller, Michael Roos, Carsten Jentsch
[ABSTRACT]
With rapidly evolving media narratives, it has become increasingly critical
to not just extract narratives from a given corpus but rather investigate, how
they develop over time. While popular narrative extraction methods such as
Large Language Models do well in capturing typical narrative elements or even
the complex structure of a narrative, applying them to an entire corpus comes
with obstacles, such as a high financial or computational cost. We propose a
combination of the language understanding capabilities of Large Language Models
with the large scale applicability of topic models to dynamically model
narrative shifts across time using the Narrative Policy Framework. We apply a
topic model and a corresponding change point detection method to find changes
that concern a specific topic of interest. Using this model, we filter our
corpus for documents that are particularly representative of that change and
feed them into a Large Language Model that interprets the change that happened
in an automated fashion and distinguishes between content and narrative shifts.
We employ our pipeline on a corpus of The Wall Street Journal news paper
articles from 2009 to 2023. Our findings indicate that a Large Language Model
can efficiently extract a narrative shift if one exists at a given point in
time, but does not perform as well when having to decide whether a shift in
content or a narrative shift took place.
[COMMENTS]
14 pages, 1 figure
[LINK]
http://arxiv.org/abs/2506.20269v1
[DATE]
2025-06-25 17:25:15+08:00
[CATEGORIES]
cs.CL
Language Modeling by Language Models
[AUTHORS]
Junyan Cheng, Peter Clark, Kyle Richardson
[ABSTRACT]
Can we leverage LLMs to model the process of discovering novel language model
(LM) architectures? Inspired by real research, we propose a multi-agent LLM
approach that simulates the conventional stages of research, from ideation and
literature search (proposal stage) to design implementation (code generation),
generative pre-training, and downstream evaluation (verification). Using ideas
from scaling laws, our system, Genesys, employs a Ladder of Scales approach;
new designs are proposed, adversarially reviewed, implemented, and selectively
verified at increasingly larger model scales (14M$\sim$350M parameters) with a
narrowing budget (the number of models we can train at each scale). To help
make discovery efficient and factorizable, Genesys uses a novel genetic
programming backbone, which we show has empirical advantages over commonly used
direct prompt generation workflows (e.g., $\sim$86\% percentage point
improvement in successful design generation, a key bottleneck). We report
experiments involving 1,162 newly discovered designs (1,062 fully verified
through pre-training) and find the best designs to be highly competitive with
known architectures (e.g., outperform GPT2, Mamba2, etc., on 6/9 common
benchmarks). We couple these results with comprehensive system-level ablations
and formal results, which give broader insights into the design of effective
autonomous discovery systems.
[LINK]
http://arxiv.org/abs/2506.20249v1
[DATE]
2025-06-25 16:46:10+08:00
[CATEGORIES]
cs.CL
Enhancing Large Language Models through Structured Reasoning
[AUTHORS]
Yubo Dong, Hehe Fan
[ABSTRACT]
Recent Large Language Models (LLMs) have significantly advanced natural
language processing and automated decision-making. However, these models still
encounter difficulties when performing complex reasoning tasks involving
logical deduction and systematic planning, primarily due to their reliance on
implicit statistical relationships without structured knowledge
representation.Inspired by cognitive science and neurosymbolic AI, we introduce
a novel approach to enhance LLMs through explicit structured reasoning. First,
we convert unstructured data into structured formats by explicitly annotating
reasoning steps. We then employ this structured dataset to train LLMs through
Supervised Fine-Tuning (SFT). Additionally, we enhance the structured reasoning
capabilities of LLMs using Group Relative Policy Optimization (GRPO),
incorporating two innovative algorithms–MAX-Flow and Longest Common
Subsequence (LCS)–which notably improve reasoning effectiveness and reduce
computational complexity. Experimental results from fine-tuning a
DeepSeek-R1-Distill-Qwen-1.5B model demonstrate concise reasoning, robust
performance across various scenarios, and improved compatibility with
optimization techniques, validating the efficacy of structured reasoning
integration in LLMs.
[COMMENTS]
Preprint. Under review
[LINK]
http://arxiv.org/abs/2506.20241v1
[DATE]
2025-06-25 16:36:12+08:00
[CATEGORIES]
cs.CL
LLaVA-CMoE: Towards Continual Mixture of Experts for Large Vision-Language Models
[AUTHORS]
Hengyuan Zhao, Ziqin Wang, Qixin Sun, Kaiyou Song, Yilin Li, Xiaolin Hu, Qingpei Guo, Si Liu
[ABSTRACT]
Mixture of Experts (MoE) architectures have recently advanced the scalability
and adaptability of large language models (LLMs) for continual multimodal
learning. However, efficiently extending these models to accommodate sequential
tasks remains challenging. As new tasks arrive, naive model expansion leads to
rapid parameter growth, while modifying shared routing components often causes
catastrophic forgetting, undermining previously learned knowledge. To address
these issues, we propose LLaVA-CMoE, a continual learning framework for LLMs
that requires no replay data of previous tasks and ensures both parameter
efficiency and robust knowledge retention. Our approach introduces a
Probe-Guided Knowledge Extension mechanism, which uses probe experts to
dynamically determine when and where new experts should be added, enabling
adaptive and minimal parameter expansion tailored to task complexity.
Furthermore, we present a Probabilistic Task Locator that assigns each task a
dedicated, lightweight router. To handle the practical issue that task labels
are unknown during inference, we leverage a VAE-based reconstruction strategy
to identify the most suitable router by matching input distributions, allowing
automatic and accurate expert allocation. This design mitigates routing
conflicts and catastrophic forgetting, enabling robust continual learning
without explicit task labels. Extensive experiments on the CoIN benchmark,
covering eight diverse VQA tasks, demonstrate that LLaVA-CMoE delivers strong
continual learning performance with a compact model size, significantly
reducing forgetting and parameter overhead compared to prior methods. These
results showcase the effectiveness and scalability of our approach for
parameter-efficient continual learning in large language models. Our code will
be open-sourced soon.
[COMMENTS]
Preprint
[LINK]
http://arxiv.org/abs/2503.21227v3
[DATE]
2025-06-25 16:30:20+08:00
[CATEGORIES]
cs.CL
Perspectives in Play: A Multi-Perspective Approach for More Inclusive NLP Systems
[AUTHORS]
Benedetta Muscato, Lucia Passaro, Gizem Gezici, Fosca Giannotti
[ABSTRACT]
In the realm of Natural Language Processing (NLP), common approaches for
handling human disagreement consist of aggregating annotators’ viewpoints to
establish a single ground truth. However, prior studies show that disregarding
individual opinions can lead can lead to the side effect of underrepresenting
minority perspectives, especially in subjective tasks, where annotators may
systematically disagree because of their preferences. Recognizing that labels
reflect the diverse backgrounds, life experiences, and values of individuals,
this study proposes a new multi-perspective approach using soft labels to
encourage the development of the next generation of perspective aware models,
more inclusive and pluralistic. We conduct an extensive analysis across diverse
subjective text classification tasks, including hate speech, irony, abusive
language, and stance detection, to highlight the importance of capturing human
disagreements, often overlooked by traditional aggregation methods. Results
show that the multi-perspective approach not only better approximates human
label distributions, as measured by Jensen-Shannon Divergence (JSD), but also
achieves superior classification performance (higher F1 scores), outperforming
traditional approaches. However, our approach exhibits lower confidence in
tasks like irony and stance detection, likely due to the inherent subjectivity
present in the texts. Lastly, leveraging Explainable AI (XAI), we explore model
uncertainty and uncover meaningful insights into model predictions.
[LINK]
http://arxiv.org/abs/2506.20209v1
[DATE]
2025-06-25 15:53:36+08:00
[CATEGORIES]
cs.CL
Intrinsic vs. Extrinsic Evaluation of Czech Sentence Embeddings: Semantic Relevance Doesn’t Help with MT Evaluation
[AUTHORS]
Petra Barančíková, Ondřej Bojar
[ABSTRACT]
In this paper, we compare Czech-specific and multilingual sentence embedding
models through intrinsic and extrinsic evaluation paradigms. For intrinsic
evaluation, we employ Costra, a complex sentence transformation dataset, and
several Semantic Textual Similarity (STS) benchmarks to assess the ability of
the embeddings to capture linguistic phenomena such as semantic similarity,
temporal aspects, and stylistic variations. In the extrinsic evaluation, we
fine-tune each embedding model using COMET-based metrics for machine
translation evaluation.
Our experiments reveal an interesting disconnect: models that excel in
intrinsic semantic similarity tests do not consistently yield superior
performance on downstream translation evaluation tasks. Conversely, models with
seemingly over-smoothed embedding spaces can, through fine-tuning, achieve
excellent results. These findings highlight the complex relationship between
semantic property probes and downstream task, emphasizing the need for more
research into ‘operationalizable semantics’ in sentence embeddings, or more
in-depth downstream tasks datasets (here translation evaluation)
[LINK]
http://arxiv.org/abs/2506.20203v1
[DATE]
2025-06-25 15:46:17+08:00
[CATEGORIES]
cs.CL
COIN: Uncertainty-Guarding Selective Question Answering for Foundation Models with Provable Risk Guarantees
[AUTHORS]
Zhiyuan Wang, Jinhao Duan, Qingni Wang, Xiaofeng Zhu, Tianlong Chen, Xiaoshuang Shi, Kaidi Xu
[ABSTRACT]
Uncertainty quantification (UQ) for foundation models is essential to
identify and mitigate potential hallucinations in automatically generated text.
However, heuristic UQ approaches lack formal guarantees for key metrics such as
the false discovery rate (FDR) in selective prediction. Previous work adopts
the split conformal prediction (SCP) framework to ensure desired coverage of
admissible answers by constructing prediction sets, but these sets often
contain incorrect candidates, limiting their practical utility. To address
this, we propose COIN, an uncertainty-guarding selection framework that
calibrates statistically valid thresholds to filter a single generated answer
per question under user-specified FDR constraints. COIN estimates the empirical
error rate on a calibration set and applies confidence interval methods such as
Clopper-Pearson to establish a high-probability upper bound on the true error
rate (i.e., FDR). This enables the selection of the largest uncertainty
threshold that ensures FDR control on test data while significantly increasing
sample retention. We demonstrate COIN’s robustness in risk control, strong
test-time power in retaining admissible answers, and predictive efficiency
under limited calibration data across both general and multimodal text
generation tasks. Furthermore, we show that employing alternative upper bound
constructions and UQ strategies can further boost COIN’s power performance,
which underscores its extensibility and adaptability to diverse application
scenarios.
[LINK]
http://arxiv.org/abs/2506.20178v1
[DATE]
2025-06-25 15:04:49+08:00
[CATEGORIES]
cs.CL
cs.LG
Conversational User-AI Intervention: A Study on Prompt Rewriting for Improved LLM Response Generation
[AUTHORS]
Rupak Sarkar, Bahareh Sarrafzadeh, Nirupama Chandrasekaran, Nagu Rangan, Philip Resnik, Longqi Yang, Sujay Kumar Jauhar
[COMMENTS]
8 pages, ACL style
[LINK]
http://arxiv.org/abs/2503.16789v2
[DATE]
2025-06-25 14:44:58+08:00
[CATEGORIES]
cs.CL
SEED: A Structural Encoder for Embedding-Driven Decoding in Time Series Prediction with LLMs
[AUTHORS]
Fengze Li, Yue Wang, Yangle Liu, Ming Huang, Dou Hong, Jieming Ma
[ABSTRACT]
Multivariate time series forecasting requires models to simultaneously
capture variable-wise structural dependencies and generalize across diverse
tasks. While structural encoders are effective in modeling feature
interactions, they lack the capacity to support semantic-level reasoning or
task adaptation. Conversely, large language models (LLMs) possess strong
generalization capabilities but remain incompatible with raw time series
inputs. This gap limits the development of unified, transferable prediction
systems. Therefore, we introduce SEED, a structural encoder for
embedding-driven decoding, which integrates four stages: a token-aware encoder
for patch extraction, a projection module that aligns patches with language
model embeddings, a semantic reprogramming mechanism that maps patches to
task-aware prototypes, and a frozen language model for prediction. This modular
architecture decouples representation learning from inference, enabling
efficient alignment between numerical patterns and semantic reasoning.
Empirical results demonstrate that the proposed method achieves consistent
improvements over strong baselines, and comparative studies on various datasets
confirm SEED’s role in addressing the structural-semantic modeling gap.
[LINK]
http://arxiv.org/abs/2506.20167v1
[DATE]
2025-06-25 14:40:14+08:00
[CATEGORIES]
cs.CL
Rewarding Graph Reasoning Process makes LLMs more Generalized Reasoners
[AUTHORS]
Miao Peng, Nuo Chen, Zongrui Suo, Jia Li
[ABSTRACT]
Despite significant advancements in Large Language Models (LLMs), developing
advanced reasoning capabilities in LLMs remains a key challenge. Process Reward
Models (PRMs) have demonstrated exceptional promise in enhancing reasoning by
providing step-wise feedback, particularly in the context of mathematical
reasoning. However, their application to broader reasoning domains remains
understudied, largely due to the high costs associated with manually creating
step-level supervision. In this work, we explore the potential of PRMs in graph
reasoning problems - a domain that demands sophisticated multi-step reasoning
and offers opportunities for automated step-level data generation using
established graph algorithms. We introduce GraphSILO, the largest dataset for
graph reasoning problems with fine-grained step-wise labels, built using
automated Task-oriented Trajectories and Monte Carlo Tree Search (MCTS) to
generate detailed reasoning steps with step-wise labels. Building upon this
dataset, we train GraphPRM, the first PRM designed for graph reasoning
problems, and evaluate its effectiveness in two key settings: inference-time
scaling and reinforcement learning via Direct Preference Optimization (DPO).
Experimental results show that GraphPRM significantly improves LLM performance
across 13 graph reasoning tasks, delivering a 9% gain for Qwen2.5-7B and
demonstrating transferability to new graph reasoning datasets and new reasoning
domains like mathematical problem-solving. Notably, GraphPRM enhances LLM
performance on GSM8K and Math500, underscoring the cross-domain applicability
of graph-based reasoning rewards. Our findings highlight the potential of PRMs
in advancing reasoning across diverse domains, paving the way for more
versatile and effective LLMs.
[COMMENTS]
Accepted to KDD 2025 Research Track
[LINK]
http://arxiv.org/abs/2503.00845v2
[DATE]
2025-06-25 14:00:08+08:00
[CATEGORIES]
cs.CL
cs.LG
Leveraging AI Graders for Missing Score Imputation to Achieve Accurate Ability Estimation in Constructed-Response Tests
[AUTHORS]
Masaki Uto, Yuma Ito
[ABSTRACT]
Evaluating the abilities of learners is a fundamental objective in the field
of education. In particular, there is an increasing need to assess higher-order
abilities such as expressive skills and logical thinking. Constructed-response
tests such as short-answer and essay-based questions have become widely used as
a method to meet this demand. Although these tests are effective, they require
substantial manual grading, making them both labor-intensive and costly. Item
response theory (IRT) provides a promising solution by enabling the estimation
of ability from incomplete score data, where human raters grade only a subset
of answers provided by learners across multiple test items. However, the
accuracy of ability estimation declines as the proportion of missing scores
increases. Although data augmentation techniques for imputing missing scores
have been explored in order to address this limitation, they often struggle
with inaccuracy for sparse or heterogeneous data. To overcome these challenges,
this study proposes a novel method for imputing missing scores by leveraging
automated scoring technologies for accurate IRT-based ability estimation. The
proposed method achieves high accuracy in ability estimation while markedly
reducing manual grading workload.
[COMMENTS]
Accepted to EvalLAC’25: 2nd Workshop on Automatic Evaluation of
Learning and Assessment Content, held at AIED 2025, Palermo, Italy. This is
the camera-ready version submitted to CEUR Workshop Proceedings
[LINK]
http://arxiv.org/abs/2506.20119v1
[DATE]
2025-06-25 12:17:57+08:00
[CATEGORIES]
cs.CL
cs.LG
A Global Context Mechanism for Sequence Labeling
[AUTHORS]
Conglei Xu, Kun Shen, Hongguang Sun, Yang Xu
[ABSTRACT]
Global sentence information is crucial for sequence labeling tasks, where
each word in a sentence must be assigned a label. While BiLSTM models are
widely used, they often fail to capture sufficient global context for inner
words. Previous work has proposed various RNN variants to integrate global
sentence information into word representations. However, these approaches
suffer from three key limitations: (1) they are slower in both inference and
training compared to the original BiLSTM, (2) they cannot effectively
supplement global information for transformer-based models, and (3) the high
time cost associated with reimplementing and integrating these customized RNNs
into existing architectures. In this study, we introduce a simple yet effective
mechanism that addresses these limitations. Our approach efficiently
supplements global sentence information for both BiLSTM and transformer-based
models, with minimal degradation in inference and training speed, and is easily
pluggable into current architectures. We demonstrate significant improvements
in F1 scores across seven popular benchmarks, including Named Entity
Recognition (NER) tasks such as Conll2003, Wnut2017 , and the Chinese
named-entity recognition task Weibo, as well as End-to-End Aspect-Based
Sentiment Analysis (E2E-ABSA) benchmarks such as Laptop14, Restaurant14,
Restaurant15, and Restaurant16. With out any extra strategy, we achieve third
highest score on weibo NER benchmark. Compared to CRF, one of the most popular
frameworks for sequence labeling, our mechanism achieves competitive F1 scores
while offering superior inference and training speed. Code is available at:
https://github.com/conglei2XU/Global-Context-Mechanism
[LINK]
http://arxiv.org/abs/2305.19928v5
[DATE]
2025-06-25 11:52:41+08:00
[CATEGORIES]
cs.CL
MIRAGE: A Benchmark for Multimodal Information-Seeking and Reasoning in Agricultural Expert-Guided Conversations
[AUTHORS]
Vardhan Dongre, Chi Gui, Shubham Garg, Hooshang Nayyeri, Gokhan Tur, Dilek Hakkani-Tür, Vikram S. Adve
[ABSTRACT]
We introduce MIRAGE, a new benchmark for multimodal expert-level reasoning
and decision-making in consultative interaction settings. Designed for the
agriculture domain, MIRAGE captures the full complexity of expert consultations
by combining natural user queries, expert-authored responses, and image-based
context, offering a high-fidelity benchmark for evaluating models on grounded
reasoning, clarification strategies, and long-form generation in a real-world,
knowledge-intensive domain. Grounded in over 35,000 real user-expert
interactions and curated through a carefully designed multi-step pipeline,
MIRAGE spans diverse crop health, pest diagnosis, and crop management
scenarios. The benchmark includes more than 7,000 unique biological entities,
covering plant species, pests, and diseases, making it one of the most
taxonomically diverse benchmarks available for vision-language models, grounded
in the real world. Unlike existing benchmarks that rely on well-specified user
inputs and closed-set taxonomies, MIRAGE features underspecified, context-rich
scenarios with open-world settings, requiring models to infer latent knowledge
gaps, handle rare entities, and either proactively guide the interaction or
respond. Project Page: https://mirage-benchmark.github.io
[COMMENTS]
66 pages, 32 figures, 23 tables
[LINK]
http://arxiv.org/abs/2506.20100v1
[DATE]
2025-06-25 11:07:54+08:00
[CATEGORIES]
cs.LG
cs.CL
PSALM-V: Automating Symbolic Planning in Interactive Visual Environments with Large Language Models
[AUTHORS]
Wang Bill Zhu, Miaosen Chai, Ishika Singh, Robin Jia, Jesse Thomason
[ABSTRACT]
We propose PSALM-V, the first autonomous neuro-symbolic learning system able
to induce symbolic action semantics (i.e., pre- and post-conditions) in visual
environments through interaction. PSALM-V bootstraps reliable symbolic planning
without expert action definitions, using LLMs to generate heuristic plans and
candidate symbolic semantics. Previous work has explored using large language
models to generate action semantics for Planning Domain Definition Language
(PDDL)-based symbolic planners. However, these approaches have primarily
focused on text-based domains or relied on unrealistic assumptions, such as
access to a predefined problem file, full observability, or explicit error
messages. By contrast, PSALM-V dynamically infers PDDL problem files and domain
action semantics by analyzing execution outcomes and synthesizing possible
error explanations. The system iteratively generates and executes plans while
maintaining a tree-structured belief over possible action semantics for each
action, iteratively refining these beliefs until a goal state is reached.
Simulated experiments of task completion in ALFRED demonstrate that PSALM-V
increases the plan success rate from 37% (Claude-3.7) to 74% in partially
observed setups. Results on two 2D game environments, RTFM and Overcooked-AI,
show that PSALM-V improves step efficiency and succeeds in domain induction in
multi-agent settings. PSALM-V correctly induces PDDL pre- and post-conditions
for real-world robot BlocksWorld tasks, despite low-level manipulation failures
from the robot.
[LINK]
http://arxiv.org/abs/2506.20097v1
[DATE]
2025-06-25 10:44:20+08:00
[CATEGORIES]
cs.CL
PP-DocBee2: Improved Baselines with Efficient Data for Multimodal Document Understanding
[AUTHORS]
Kui Huang, Xinrong Chen, Wenyu Lv, Jincheng Liao, Guanzhong Wang, Yi Liu
[ABSTRACT]
This report introduces PP-DocBee2, an advanced version of the PP-DocBee,
designed to enhance multimodal document understanding. Built on a large
multimodal model architecture, PP-DocBee2 addresses the limitations of its
predecessor through key technological improvements, including enhanced
synthetic data quality, improved visual feature fusion strategy, and optimized
inference methodologies. These enhancements yield an $11.4\%$ performance boost
on internal benchmarks for Chinese business documents, and reduce inference
latency by $73.0\%$ to the vanilla version. A key innovation of our work is a
data quality optimization strategy for multimodal document tasks. By employing
a large-scale multimodal pre-trained model to evaluate data, we apply a novel
statistical criterion to filter outliers, ensuring high-quality training data.
Inspired by insights into underutilized intermediate features in multimodal
models, we enhance the ViT representational capacity by decomposing it into
layers and applying a novel feature fusion strategy to improve complex
reasoning. The source code and pre-trained model are available at
\href{https://github.com/PaddlePaddle/PaddleMIX}{https://github.com/PaddlePaddle/PaddleMIX}.
[LINK]
http://arxiv.org/abs/2506.18023v2
[DATE]
2025-06-25 10:40:39+08:00
[CATEGORIES]
cs.CL
ITFormer: Bridging Time Series and Natural Language for Multi-Modal QA with Large-Scale Multitask Dataset
[AUTHORS]
Yilin Wang, Peixuan Lei, Jie Song, Yuzhe Hao, Tao Chen, Yuxuan Zhang, Lei Jia, Yuanxiang Li, Zhongyu Wei
[ABSTRACT]
Time-series data are critical in diverse applications, such as industrial
monitoring, medical diagnostics, and climate research. However, effectively
integrating these high-dimensional temporal signals with natural language for
dynamic, interactive tasks remains a significant challenge. To address this, we
introduce the Time-Series Question Answering (Time-Series QA) task and release
EngineMT-QA, the first large-scale, multi-task, temporal-textual QA dataset
designed to capture complex interactions between time-series signals and
natural language. Building on this resource, we propose the Instruct Time
Transformer (ITFormer), a novel framework that bridges time-series encoders
with frozen large language models (LLMs). ITFormer effectively extracts,
aligns, and fuses temporal and textual features, achieving a strong improvement
in QA accuracy over strong baselines with fewer than 1\% additional trainable
parameters. By combining computational efficiency with robust cross-modal
modeling, our work establishes a adaptable paradigm for integrating temporal
data with natural language, paving the way for new research and applications in
multi-modal AI. More details about the project, including datasets and code,
are available at: https://pandalin98.github.io/itformer_site/
[LINK]
http://arxiv.org/abs/2506.20093v1
[DATE]
2025-06-25 10:33:47+08:00
[CATEGORIES]
cs.CL
Understanding World or Predicting Future? A Comprehensive Survey of World Models
[AUTHORS]
Jingtao Ding, Yunke Zhang, Yu Shang, Yuheng Zhang, Zefang Zong, Jie Feng, Yuan Yuan, Hongyuan Su, Nian Li, Nicholas Sukiennik, Fengli Xu, Yong Li
[ABSTRACT]
The concept of world models has garnered significant attention due to
advancements in multimodal large language models such as GPT-4 and video
generation models such as Sora, which are central to the pursuit of artificial
general intelligence. This survey offers a comprehensive review of the
literature on world models. Generally, world models are regarded as tools for
either understanding the present state of the world or predicting its future
dynamics. This review presents a systematic categorization of world models,
emphasizing two primary functions: (1) constructing internal representations to
understand the mechanisms of the world, and (2) predicting future states to
simulate and guide decision-making. Initially, we examine the current progress
in these two categories. We then explore the application of world models in key
domains, including autonomous driving, robotics, and social simulacra, with a
focus on how each domain utilizes these aspects. Finally, we outline key
challenges and provide insights into potential future research directions. We
summarize the representative papers along with their code repositories in
https://github.com/tsinghua-fib-lab/World-Model.
[COMMENTS]
Accepted by ACM CSUR, 37 pages, 7 figures, 7 tables
[LINK]
http://arxiv.org/abs/2411.14499v2
[DATE]
2025-06-25 10:31:33+08:00
[CATEGORIES]
cs.CL
cs.LG
Attention Entropy is a Key Factor: An Analysis of Parallel Context Encoding with Full-attention-based Pre-trained Language Models
[AUTHORS]
Zhisong Zhang, Yan Wang, Xinting Huang, Tianqing Fang, Hongming Zhang, Chenlong Deng, Shuaiyi Li, Dong Yu
[COMMENTS]
ACL 2025
[LINK]
http://arxiv.org/abs/2412.16545v2
[DATE]
2025-06-25 10:28:36+08:00
[CATEGORIES]
cs.CL
Quantifying Fairness in LLMs Beyond Tokens: A Semantic and Statistical Perspective
[AUTHORS]
Weijie Xu, Yiwen Wang, Chi Xue, Xiangkun Hu, Xi Fang, Guimin Dong, Chandan K. Reddy
[ABSTRACT]
Large Language Models (LLMs) often generate responses with inherent biases,
undermining their reliability in real-world applications. Existing evaluation
methods often overlook biases in long-form responses and the intrinsic
variability of LLM outputs. To address these challenges, we propose
FiSCo(Fine-grained Semantic Computation), a novel statistical framework to
evaluate group-level fairness in LLMs by detecting subtle semantic differences
in long-form responses across demographic groups. Unlike prior work focusing on
sentiment or token-level comparisons, FiSCo goes beyond surface-level analysis
by operating at the claim level, leveraging entailment checks to assess the
consistency of meaning across responses. We decompose model outputs into
semantically distinct claims and apply statistical hypothesis testing to
compare inter- and intra-group similarities, enabling robust detection of
subtle biases. We formalize a new group counterfactual fairness definition and
validate FiSCo on both synthetic and human-annotated datasets spanning gender,
race, and age. Experiments show that FiSco more reliably identifies nuanced
biases while reducing the impact of stochastic LLM variability, outperforming
various evaluation metrics.
[COMMENTS]
29 pages, 9 figures, 15 tables
[LINK]
http://arxiv.org/abs/2506.19028v2
[DATE]
2025-06-25 09:21:47+08:00
[CATEGORIES]
cs.CL
mSTEB: Massively Multilingual Evaluation of LLMs on Speech and Text Tasks
[AUTHORS]
Luel Hagos Beyene, Vivek Verma, Min Ma, Jesujoba O. Alabi, Fabian David Schmidt, Joyce Nakatumba-Nabende, David Ifeoluwa Adelani
[ABSTRACT]
Large Language models (LLMs) have demonstrated impressive performance on a
wide range of tasks, including in multimodal settings such as speech. However,
their evaluation is often limited to English and a few high-resource languages.
For low-resource languages, there is no standardized evaluation benchmark. In
this paper, we address this gap by introducing mSTEB, a new benchmark to
evaluate the performance of LLMs on a wide range of tasks covering language
identification, text classification, question answering, and translation tasks
on both speech and text modalities. We evaluated the performance of leading
LLMs such as Gemini 2.0 Flash and GPT-4o (Audio) and state-of-the-art open
models such as Qwen 2 Audio and Gemma 3 27B. Our evaluation shows a wide gap in
performance between high-resource and low-resource languages, especially for
languages spoken in Africa and Americas/Oceania. Our findings show that more
investment is needed to address their under-representation in LLMs coverage.
[COMMENTS]
working paper
[LINK]
http://arxiv.org/abs/2506.08400v2
[DATE]
2025-06-25 08:58:19+08:00
[CATEGORIES]
cs.CL
cs.LG
A Modular Multitask Reasoning Framework Integrating Spatio-temporal Models and LLMs
[AUTHORS]
Kethmi Hirushini Hettige, Jiahao Ji, Cheng Long, Shili Xiang, Gao Cong, Jingyuan Wang
[ABSTRACT]
Spatio-temporal data mining plays a pivotal role in informed decision making
across diverse domains. However, existing models are often restricted to narrow
tasks, lacking the capacity for multi-task inference and complex long-form
reasoning that require generation of in-depth, explanatory outputs. These
limitations restrict their applicability to real-world, multi-faceted decision
scenarios. In this work, we introduce STReason, a novel framework that
integrates the reasoning strengths of large language models (LLMs) with the
analytical capabilities of spatio-temporal models for multi-task inference and
execution. Without requiring task-specific finetuning, STReason leverages
in-context learning to decompose complex natural language queries into modular,
interpretable programs, which are then systematically executed to generate both
solutions and detailed rationales. To facilitate rigorous evaluation, we
construct a new benchmark dataset and propose a unified evaluation framework
with metrics specifically designed for long-form spatio-temporal reasoning.
Experimental results show that STReason significantly outperforms advanced LLM
baselines across all metrics, particularly excelling in complex,
reasoning-intensive spatio-temporal scenarios. Human evaluations further
validate STReason’s credibility and practical utility, demonstrating its
potential to reduce expert workload and broaden the applicability to real-world
spatio-temporal tasks. We believe STReason provides a promising direction for
developing more capable and generalizable spatio-temporal reasoning systems.
[LINK]
http://arxiv.org/abs/2506.20073v1
[DATE]
2025-06-25 08:55:34+08:00
[CATEGORIES]
cs.CL
cs.LG
Computation Mechanism Behind LLM Position Generalization
[AUTHORS]
Chi Han, Heng Ji
[COMMENTS]
ACL 2025 Main Long Paper
[LINK]
http://arxiv.org/abs/2503.13305v3
[DATE]
2025-06-25 08:26:59+08:00
[CATEGORIES]
cs.CL
Learning Instruction-Following Policies through Open-Ended Instruction Relabeling with Large Language Models
[AUTHORS]
Zhicheng Zhang, Ziyan Wang, Yali Du, Fei Fang
[ABSTRACT]
Developing effective instruction-following policies in reinforcement learning
remains challenging due to the reliance on extensive human-labeled instruction
datasets and the difficulty of learning from sparse rewards. In this paper, we
propose a novel approach that leverages the capabilities of large language
models (LLMs) to automatically generate open-ended instructions retrospectively
from previously collected agent trajectories. Our core idea is to employ LLMs
to relabel unsuccessful trajectories by identifying meaningful subtasks the
agent has implicitly accomplished, thereby enriching the agent’s training data
and substantially alleviating reliance on human annotations. Through this
open-ended instruction relabeling, we efficiently learn a unified
instruction-following policy capable of handling diverse tasks within a single
policy. We empirically evaluate our proposed method in the challenging Craftax
environment, demonstrating clear improvements in sample efficiency, instruction
coverage, and overall policy performance compared to state-of-the-art
baselines. Our results highlight the effectiveness of utilizing LLM-guided
open-ended instruction relabeling to enhance instruction-following
reinforcement learning.
[COMMENTS]
Under Review
[LINK]
http://arxiv.org/abs/2506.20061v1
[DATE]
2025-06-25 07:49:28+08:00
[CATEGORIES]
cs.LG
cs.CL
Cross-Layer Discrete Concept Discovery for Interpreting Language Models
[AUTHORS]
Ankur Garg, Xuemin Yu, Hassan Sajjad, Samira Ebrahimi Kahou
[ABSTRACT]
Uncovering emergent concepts across transformer layers remains a significant
challenge because the residual stream linearly mixes and duplicates
information, obscuring how features evolve within large language models.
Current research efforts primarily inspect neural representations at single
layers, thereby overlooking this cross-layer superposition and the redundancy
it introduces. These representations are typically either analyzed directly for
activation patterns or passed to probing classifiers that map them to a limited
set of predefined concepts. To address these limitations, we propose
\gls{clvqvae}, a framework that uses vector quantization to map representations
across layers and in the process collapse duplicated residual-stream features
into compact, interpretable concept vectors. Our approach uniquely combines
top-$k$ temperature-based sampling during quantization with EMA codebook
updates, providing controlled exploration of the discrete latent space while
maintaining code-book diversity. We further enhance the framework with
scaled-spherical k-means++ for codebook initialization, which clusters by
directional similarity rather than magnitude, better aligning with semantic
structure in word embedding space.
[LINK]
http://arxiv.org/abs/2506.20040v1
[DATE]
2025-06-25 06:43:36+08:00
[CATEGORIES]
cs.LG
cs.CL
The Noisy Path from Source to Citation: Measuring How Scholars Engage with Past Research
[AUTHORS]
Hong Chen, Misha Teplitskiy, David Jurgens
[COMMENTS]
Accepted by ACL 2025
[LINK]
http://arxiv.org/abs/2502.20581v3
[DATE]
2025-06-25 06:00:02+08:00
[CATEGORIES]
cs.CL
Evaluating Long Range Dependency Handling in Code Generation LLMs
[AUTHORS]
Yannick Assogba, Donghao Ren
[ABSTRACT]
As language models support larger and larger context sizes, evaluating their
ability to make effective use of that context becomes increasingly important.
We analyze the ability of several code generation models to handle long range
dependencies using a suite of multi-step key retrieval tasks in context windows
up to 8k tokens in length. The tasks progressively increase in difficulty and
allow more nuanced evaluation of model capabilities than tests like the popular
needle-in-the-haystack test. We find that performance degrades significantly
for many models (up to 2x) when a function references another function that is
defined later in the prompt. We also observe that models that use sliding
window attention mechanisms have difficulty handling references further than
the size of a single window. We perform simple prompt modifications using call
graph information to improve multi-step retrieval performance up to 3x. Our
analysis highlights ways that long-context performance needs deeper
consideration beyond retrieval of single facts within a document.
[COMMENTS]
36 pages, 18 figures
[LINK]
http://arxiv.org/abs/2407.21049v2
[DATE]
2025-06-25 05:45:07+08:00
[CATEGORIES]
cs.CL
cs.LG
Language Models Learn Rare Phenomena from Less Rare Phenomena: The Case of the Missing AANNs
[AUTHORS]
Kanishka Misra, Kyle Mahowald
[ABSTRACT]
Language models learn rare syntactic phenomena, but the extent to which this
is attributable to generalization vs. memorization is a major open question. To
that end, we iteratively trained transformer language models on systematically
manipulated corpora which were human-scale in size, and then evaluated their
learning of a rare grammatical phenomenon: the English
Article+Adjective+Numeral+Noun (AANN) construction (“a beautiful five days”).
We compared how well this construction was learned on the default corpus
relative to a counterfactual corpus in which AANN sentences were removed. We
found that AANNs were still learned better than systematically perturbed
variants of the construction. Using additional counterfactual corpora, we
suggest that this learning occurs through generalization from related
constructions (e.g., “a few days”). An additional experiment showed that this
learning is enhanced when there is more variability in the input. Taken
together, our results provide an existence proof that LMs can learn rare
grammatical phenomena by generalization from less rare phenomena. Data and
code: https://github.com/kanishkamisra/aannalysis.
[COMMENTS]
Added Corrigendum to correct 4-gram baseline performance and chance
performance
[LINK]
http://arxiv.org/abs/2403.19827v3
[DATE]
2025-06-25 05:39:54+08:00
[CATEGORIES]
cs.CL
Accurate and Energy Efficient: Local Retrieval-Augmented Generation Models Outperform Commercial Large Language Models in Medical Tasks
[AUTHORS]
Konstantinos Vrettos, Michail E. Klontzas
[ABSTRACT]
Background The increasing adoption of Artificial Intelligence (AI) in
healthcare has sparked growing concerns about its environmental and ethical
implications. Commercial Large Language Models (LLMs), such as ChatGPT and
DeepSeek, require substantial resources, while the utilization of these systems
for medical purposes raises critical issues regarding patient privacy and
safety. Methods We developed a customizable Retrieval-Augmented Generation
(RAG) framework for medical tasks, which monitors its energy usage and CO2
emissions. This system was then used to create RAGs based on various
open-source LLMs. The tested models included both general purpose models like
llama3.1:8b and medgemma-4b-it, which is medical-domain specific. The best RAGs
performance and energy consumption was compared to DeepSeekV3-R1 and OpenAIs
o4-mini model. A dataset of medical questions was used for the evaluation.
Results Custom RAG models outperformed commercial models in accuracy and energy
consumption. The RAG model built on llama3.1:8B achieved the highest accuracy
(58.5%) and was significantly better than other models, including o4-mini and
DeepSeekV3-R1. The llama3.1-RAG also exhibited the lowest energy consumption
and CO2 footprint among all models, with a Performance per kWh of 0.52 and a
total CO2 emission of 473g. Compared to o4-mini, the llama3.1-RAG achieved 2.7x
times more accuracy points per kWh and 172% less electricity usage while
maintaining higher accuracy. Conclusion Our study demonstrates that local LLMs
can be leveraged to develop RAGs that outperform commercial, online LLMs in
medical tasks, while having a smaller environmental impact. Our modular
framework promotes sustainable AI development, reducing electricity usage and
aligning with the UNs Sustainable Development Goals.
[COMMENTS]
18 pages, 3 Figures
[LINK]
http://arxiv.org/abs/2506.20009v1
[DATE]
2025-06-25 04:56:03+08:00
[CATEGORIES]
cs.CL
Can Language Models Replace Programmers for Coding? REPOCOD Says ‘Not Yet’
[AUTHORS]
Shanchao Liang, Yiran Hu, Nan Jiang, Lin Tan
[ABSTRACT]
Recently, a number of repository-level code generation benchmarks-such as
CoderEval, DevEval, RepoEval, RepoBench, and LongCodeArena-have emerged to
evaluate the capabilities of large language models (LLMs) beyond standalone
benchmarks like HumanEval and MBPP. Thus, a natural question is, would LLMs
have similar performance in real world coding tasks as their performance in
these benchmarks? Unfortunately, one cannot answer this question, since these
benchmarks consist of short completions, synthetic examples, or focus on
limited scale repositories, failing to represent real-world coding tasks.
To address these challenges, we create REPOCOD, a Python code-generation
benchmark containing complex tasks with realistic dependencies in real-world
large projects and appropriate metrics for evaluating source code. It includes
980 whole-function generation tasks from 11 popular projects, 50.8% of which
require repository-level context. REPOCOD includes 314 developer-written test
cases per instance for better evaluation. We evaluate ten LLMs on REPOCOD and
find that none achieves more than 30% pass@1 on REPOCOD, indicating the
necessity of building stronger LLMs that can help developers in real-world
software development. In addition, we found that retrieval-augmented generation
achieves better results than using target function dependencies as context.
[LINK]
http://arxiv.org/abs/2410.21647v4
[DATE]
2025-06-25 04:49:51+08:00
[CATEGORIES]
cs.CL
A Spatio-Temporal Point Process for Fine-Grained Modeling of Reading Behavior
[AUTHORS]
Francesco Ignazio Re, Andreas Opedal, Glib Manaiev, Mario Giulianelli, Ryan Cotterell
[ABSTRACT]
Reading is a process that unfolds across space and time, alternating between
fixations where a reader focuses on a specific point in space, and saccades
where a reader rapidly shifts their focus to a new point. An ansatz of
psycholinguistics is that modeling a reader’s fixations and saccades yields
insight into their online sentence processing. However, standard approaches to
such modeling rely on aggregated eye-tracking measurements and models that
impose strong assumptions, ignoring much of the spatio-temporal dynamics that
occur during reading. In this paper, we propose a more general probabilistic
model of reading behavior, based on a marked spatio-temporal point process,
that captures not only how long fixations last, but also where they land in
space and when they take place in time. The saccades are modeled using a Hawkes
process, which captures how each fixation excites the probability of a new
fixation occurring near it in time and space. The duration time of fixation
events is modeled as a function of fixation-specific predictors convolved
across time, thus capturing spillover effects. Empirically, our Hawkes process
model exhibits a better fit to human saccades than baselines. With respect to
fixation durations, we observe that incorporating contextual surprisal as a
predictor results in only a marginal improvement in the model’s predictive
accuracy. This finding suggests that surprisal theory struggles to explain
fine-grained eye movements.
[COMMENTS]
ACL 2025
[LINK]
http://arxiv.org/abs/2506.19999v1
[DATE]
2025-06-25 04:39:21+08:00
[CATEGORIES]
cs.LG
cs.CL
WAFFLE: Finetuning Multi-Modal Model for Automated Front-End Development
[AUTHORS]
Shanchao Liang, Nan Jiang, Shangshu Qian, Lin Tan
[ABSTRACT]
Web development involves turning UI designs into functional webpages, which
can be difficult for both beginners and experienced developers due to the
complexity of HTML’s hierarchical structures and styles. While Large Language
Models (LLMs) have shown promise in generating source code, two major
challenges persist in UI-to-HTML code generation: (1) effectively representing
HTML’s hierarchical structure for LLMs, and (2) bridging the gap between the
visual nature of UI designs and the text-based format of HTML code. To tackle
these challenges, we introduce Waffle, a new fine-tuning strategy that uses a
structure-aware attention mechanism to improve LLMs’ understanding of HTML’s
structure and a contrastive fine-tuning approach to align LLMs’ understanding
of UI images and HTML code. Models fine-tuned with Waffle show up to 9.00 pp
(percentage point) higher HTML match, 0.0982 higher CW-SSIM, 32.99 higher CLIP,
and 27.12 pp higher LLEM on our new benchmark WebSight-Test and an existing
benchmark Design2Code, outperforming current fine-tuning methods.
[LINK]
http://arxiv.org/abs/2410.18362v2
[DATE]
2025-06-25 04:35:02+08:00
[CATEGORIES]
cs.CL
Doc2Agent: Scalable Generation of Tool-Using Agents from API Documentation
[AUTHORS]
Xinyi Ni, Haonan Jian, Qiuyang Wang, Vedanshi Chetan Shah, Pengyu Hong
[LINK]
http://arxiv.org/abs/2506.19998v1
[DATE]
2025-06-25 04:30:44+08:00
[CATEGORIES]
cs.CL
Inference Scaled GraphRAG: Improving Multi Hop Question Answering on Knowledge Graphs
[AUTHORS]
Travis Thompson, Seung-Hwan Lim, Paul Liu, Ruoying He, Dongkuan Xu
[ABSTRACT]
Large Language Models (LLMs) have achieved impressive capabilities in
language understanding and generation, yet they continue to underperform on
knowledge-intensive reasoning tasks due to limited access to structured context
and multi-hop information. Retrieval-Augmented Generation (RAG) partially
mitigates this by grounding generation in retrieved context, but conventional
RAG and GraphRAG methods often fail to capture relational structure across
nodes in knowledge graphs. We introduce Inference-Scaled GraphRAG, a novel
framework that enhances LLM-based graph reasoning by applying inference-time
compute scaling. Our method combines sequential scaling with deep
chain-of-thought graph traversal, and parallel scaling with majority voting
over sampled trajectories within an interleaved reasoning-execution loop.
Experiments on the GRBench benchmark demonstrate that our approach
significantly improves multi-hop question answering performance, achieving
substantial gains over both traditional GraphRAG and prior graph traversal
baselines. These findings suggest that inference-time scaling is a practical
and architecture-agnostic solution for structured knowledge reasoning with LLMs
[LINK]
http://arxiv.org/abs/2506.19967v1
[DATE]
2025-06-25 03:31:03+08:00
[CATEGORIES]
cs.CL
CycleDistill: Bootstrapping Machine Translation using LLMs with Cyclical Distillation
[AUTHORS]
Deepon Halder, Thanmay Jayakumar, Raj Dabre
[ABSTRACT]
Large language models (LLMs), despite their ability to perform few-shot
machine translation (MT), often lag behind dedicated MT systems trained on
parallel corpora, which are crucial for high quality machine translation (MT).
However, parallel corpora are often scarce or non-existent for low-resource
languages. In this paper, we propose CycleDistill, a bootstrapping approach
leveraging LLMs and few-shot translation to obtain high-quality MT systems.
CycleDistill involves iteratively generating synthetic parallel corpora from
monolingual corpora via zero- or few-shot MT, which is then used to fine-tune
the model that was used for generating said data for MT. CycleDistill does not
need parallel corpora beyond 1 to 4 few-shot examples, and in our experiments
focusing on three Indian languages, by relying solely on monolingual corpora,
it can achieve high-quality machine translation, improving upon a few-shot
baseline model by over 20-30 chrF points on average in the first iteration. We
also study the effect of leveraging softmax activations during the distillation
process and observe mild improvements in translation quality.
[LINK]
http://arxiv.org/abs/2506.19952v1
[DATE]
2025-06-25 02:56:57+08:00
[CATEGORIES]
cs.CL
Aug2Search: Enhancing Facebook Marketplace Search with LLM-Generated Synthetic Data Augmentation
[AUTHORS]
Ruijie Xi, He Ba, Hao Yuan, Rishu Agrawal, Yuxin Tian, Ruoyan Kong, Arul Prakash
[ABSTRACT]
Embedding-Based Retrieval (EBR) is an important technique in modern search
engines, enabling semantic match between search queries and relevant results.
However, search logging data on platforms like Facebook Marketplace lacks the
diversity and details needed for effective EBR model training, limiting the
models’ ability to capture nuanced search patterns. To address this challenge,
we propose Aug2Search, an EBR-based framework leveraging synthetic data
generated by Generative AI (GenAI) models, in a multimodal and multitask
approach to optimize query-product relevance. This paper investigates the
capabilities of GenAI, particularly Large Language Models (LLMs), in generating
high-quality synthetic data, and analyzing its impact on enhancing EBR models.
We conducted experiments using eight Llama models and 100 million data points
from Facebook Marketplace logs. Our synthetic data generation follows three
strategies: (1) generate queries, (2) enhance product listings, and (3)
generate queries from enhanced listings. We train EBR models on three different
datasets: sampled engagement data or original data ((e.g., “Click” and “Listing
Interactions”)), synthetic data, and a mixture of both engagement and synthetic
data to assess their performance across various training sets. Our findings
underscore the robustness of Llama models in producing synthetic queries and
listings with high coherence, relevance, and diversity, while maintaining low
levels of hallucination. Aug2Search achieves an improvement of up to 4% in
ROC_AUC with 100 million synthetic data samples, demonstrating the
effectiveness of our approach. Moreover, our experiments reveal that with the
same volume of training data, models trained exclusively on synthetic data
often outperform those trained on original data only or a mixture of original
and synthetic data.
[LINK]
http://arxiv.org/abs/2505.16065v3
[DATE]
2025-06-25 02:46:45+08:00
[CATEGORIES]
cs.CL
GlyphPattern: An Abstract Pattern Recognition Benchmark for Vision-Language Models
[AUTHORS]
Zixuan Wu, Yoolim Kim, Carolyn Jane Anderson
[ABSTRACT]
Vision-Language Models (VLMs) building upon the foundation of powerful large
language models have made rapid progress in reasoning across visual and textual
data. While VLMs perform well on vision tasks that they are trained on, our
results highlight key challenges in abstract pattern recognition. We present
GlyphPattern, a 954 item dataset that pairs 318 human-written descriptions of
visual patterns from 40 writing systems with three visual presentation styles.
GlyphPattern evaluates abstract pattern recognition in VLMs, requiring models
to understand and judge natural language descriptions of visual patterns.
GlyphPattern patterns are drawn from a large-scale cognitive science
investigation of human writing systems; as a result, they are rich in spatial
reference and compositionality. Our experiments show that GlyphPattern is
challenging for state-of-the-art VLMs (GPT-4o achieves only 55% accuracy), with
marginal gains from few-shot prompting. Our detailed error analysis reveals
challenges at multiple levels, including visual processing, natural language
understanding, and pattern generalization.
[LINK]
http://arxiv.org/abs/2408.05894v2
[DATE]
2025-06-25 02:23:10+08:00
[CATEGORIES]
cs.CL
MAM: Modular Multi-Agent Framework for Multi-Modal Medical Diagnosis via Role-Specialized Collaboration
[AUTHORS]
Yucheng Zhou, Lingran Song, Jianbing Shen
[COMMENTS]
ACL 2025 Findings
[LINK]
http://arxiv.org/abs/2506.19835v1
[DATE]
2025-06-25 01:52:43+08:00
[CATEGORIES]
cs.CL
Scaling Speculative Decoding with Lookahead Reasoning
[AUTHORS]
Yichao Fu, Rui Ge, Zelei Shao, Zhijie Deng, Hao Zhang
[LINK]
http://arxiv.org/abs/2506.19830v1
[DATE]
2025-06-25 01:48:10+08:00
[CATEGORIES]
cs.LG
cs.CL
Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study
[AUTHORS]
Yuqi Zhu, Yi Zhong, Jintian Zhang, Ziheng Zhang, Shuofei Qiao, Yujie Luo, Lun Du, Da Zheng, Huajun Chen, Ningyu Zhang
[ABSTRACT]
Large Language Models (LLMs) hold promise in automating data analysis tasks,
yet open-source models face significant limitations in these kinds of
reasoning-intensive scenarios. In this work, we investigate strategies to
enhance the data analysis capabilities of open-source LLMs. By curating a seed
dataset of diverse, realistic scenarios, we evaluate models across three
dimensions: data understanding, code generation, and strategic planning. Our
analysis reveals three key findings: (1) Strategic planning quality serves as
the primary determinant of model performance; (2) Interaction design and task
complexity significantly influence reasoning capabilities; (3) Data quality
demonstrates a greater impact than diversity in achieving optimal performance.
We leverage these insights to develop a data synthesis methodology,
demonstrating significant improvements in open-source LLMs’ analytical
reasoning capabilities.
[COMMENTS]
Work in progress
[LINK]
http://arxiv.org/abs/2506.19794v1
[DATE]
2025-06-25 01:04:23+08:00
[CATEGORIES]
cs.CL
cs.LG
Kling-Foley: Multimodal Diffusion Transformer for High-Quality Video-to-Audio Generation
[AUTHORS]
Jun Wang, Xijuan Zeng, Chunyu Qiang, Ruilong Chen, Shiyao Wang, Le Wang, Wangjing Zhou, Pengfei Cai, Jiahui Zhao, Nan Li, Zihan Li, Yuzhe Liang, Xiaopeng Wang, Haorui Zheng, Ming Wen, Kang Yin, Yiran Wang, Nan Li, Feng Deng, Liang Dong, Chen Zhang, Di Zhang, Kun Gai
[ABSTRACT]
We propose Kling-Foley, a large-scale multimodal Video-to-Audio generation
model that synthesizes high-quality audio synchronized with video content. In
Kling-Foley, we introduce multimodal diffusion transformers to model the
interactions between video, audio, and text modalities, and combine it with a
visual semantic representation module and an audio-visual synchronization
module to enhance alignment capabilities. Specifically, these modules align
video conditions with latent audio elements at the frame level, thereby
improving semantic alignment and audio-visual synchronization. Together with
text conditions, this integrated approach enables precise generation of
video-matching sound effects. In addition, we propose a universal latent audio
codec that can achieve high-quality modeling in various scenarios such as sound
effects, speech, singing, and music. We employ a stereo rendering method that
imbues synthesized audio with a spatial presence. At the same time, in order to
make up for the incomplete types and annotations of the open-source benchmark,
we also open-source an industrial-level benchmark Kling-Audio-Eval. Our
experiments show that Kling-Foley trained with the flow matching objective
achieves new audio-visual SOTA performance among public models in terms of
distribution matching, semantic alignment, temporal alignment and audio
quality.
[LINK]
http://arxiv.org/abs/2506.19774v1
[DATE]
2025-06-25 00:39:39+08:00
[CATEGORIES]
cs.CL
SRFT: A Single-Stage Method with Supervised and Reinforcement Fine-Tuning for Reasoning
[AUTHORS]
Yuqian Fu, Tinghong Chen, Jiajun Chai, Xihuai Wang, Songjun Tu, Guojun Yin, Wei Lin, Qichao Zhang, Yuanheng Zhu, Dongbin Zhao
[ABSTRACT]
Large language models (LLMs) have achieved remarkable progress in reasoning
tasks, yet the optimal integration of Supervised Fine-Tuning (SFT) and
Reinforcement Learning (RL) remains a fundamental challenge. Through
comprehensive analysis of token distributions, learning dynamics, and
integration mechanisms from entropy-based perspectives, we reveal key
differences between these paradigms: SFT induces coarse-grained global changes
to LLM policy distributions, while RL performs fine-grained selective
optimizations, with entropy serving as a critical indicator of training
effectiveness. Building on these observations, we propose Supervised
Reinforcement Fine-Tuning (SRFT), a single-stage method that unifies both
fine-tuning paradigms through entropy-aware weighting mechanisms. Our approach
simultaneously applies SFT and RL to directly optimize the LLM using
demonstrations and self-exploration rollouts rather than through two-stage
sequential methods. Extensive experiments show that SRFT achieves 59.1% average
accuracy, outperforming zero-RL methods by 9.0% on five mathematical reasoning
benchmarks and 10.9% on three out-of-distribution benchmarks.
[LINK]
http://arxiv.org/abs/2506.19767v1
[DATE]
2025-06-25 00:31:37+08:00
[CATEGORIES]
cs.CL
cs.LG
Sensitive Content Classification in Social Media: A Holistic Resource and Evaluation
[AUTHORS]
Dimosthenis Antypas, Indira Sen, Carla Perez-Almendros, Jose Camacho-Collados, Francesco Barbieri
[ABSTRACT]
The detection of sensitive content in large datasets is crucial for ensuring
that shared and analysed data is free from harmful material. However, current
moderation tools, such as external APIs, suffer from limitations in
customisation, accuracy across diverse sensitive categories, and privacy
concerns. Additionally, existing datasets and open-source models focus
predominantly on toxic language, leaving gaps in detecting other sensitive
categories such as substance abuse or self-harm. In this paper, we put forward
a unified dataset tailored for social media content moderation across six
sensitive categories: conflictual language, profanity, sexually explicit
material, drug-related content, self-harm, and spam. By collecting and
annotating data with consistent retrieval strategies and guidelines, we address
the shortcomings of previous focalised research. Our analysis demonstrates that
fine-tuning large language models (LLMs) on this novel dataset yields
significant improvements in detection performance compared to open
off-the-shelf models such as LLaMA, and even proprietary OpenAI models, which
underperform by 10-15% overall. This limitation is even more pronounced on
popular moderation APIs, which cannot be easily tailored to specific sensitive
content categories, among others.
[COMMENTS]
Accepted at the 9th Workshop on Online Abuse and Harms (WOAH)
[LINK]
http://arxiv.org/abs/2411.19832v3
[DATE]
2025-06-25 00:31:28+08:00
[CATEGORIES]
cs.CL
Accurate, fast, cheap: Choose three. Replacing Multi-Head-Attention with Bidirectional Recurrent Attention for Long-Form ASR
[AUTHORS]
Martin Ratajczak, Jean-Philippe Robichaud, Jennifer Drexler Fox
[ABSTRACT]
Long-form speech recognition is an application area of increasing research
focus. ASR models based on multi-head attention (MHA) are ill-suited to
long-form ASR because of their quadratic complexity in sequence length. We
build on recent work that has investigated linear complexity recurrent
attention (RA) layers for ASR. We find that bidirectional RA layers can match
the accuracy of MHA for both short- and long-form applications. We present a
strong limited-context attention (LCA) baseline, and show that RA layers are
just as accurate while being more efficient. We develop a long-form training
paradigm which further improves RA performance, leading to better accuracy than
LCA with 44% higher throughput. We also present Direction Dropout, a novel
regularization method that improves accuracy, provides fine-grained control of
the accuracy/throughput trade-off of bidirectional RA, and enables a new
alternating directions decoding mode with even higher throughput.
[COMMENTS]
Accepted to Interspeech 2025
[LINK]
http://arxiv.org/abs/2506.19761v1
[DATE]
2025-06-25 00:21:56+08:00
[CATEGORIES]
cs.CL
Arabic Dialect Classification using RNNs, Transformers, and Large Language Models: A Comparative Analysis
[AUTHORS]
Omar A. Essameldin, Ali O. Elbeih, Wael H. Gomaa, Wael F. Elsersy
[ABSTRACT]
The Arabic language is among the most popular languages in the world with a
huge variety of dialects spoken in 22 countries. In this study, we address the
problem of classifying 18 Arabic dialects of the QADI dataset of Arabic tweets.
RNN models, Transformer models, and large language models (LLMs) via prompt
engineering are created and tested. Among these, MARBERTv2 performed best with
65% accuracy and 64% F1-score. Through the use of state-of-the-art
preprocessing techniques and the latest NLP models, this paper identifies the
most significant linguistic issues in Arabic dialect identification. The
results corroborate applications like personalized chatbots that respond in
users’ dialects, social media monitoring, and greater accessibility for Arabic
communities.
[LINK]
http://arxiv.org/abs/2506.19753v1
[DATE]
2025-06-25 00:06:58+08:00
[CATEGORIES]
cs.CL
NEAR$^2$: A Nested Embedding Approach to Efficient Product Retrieval and Ranking
[AUTHORS]
Shenbin Qian, Diptesh Kanojia, Samarth Agrawal, Hadeel Saadany, Swapnil Bhosale, Constantin Orasan, Zhe Wu
[ABSTRACT]
E-commerce information retrieval (IR) systems struggle to simultaneously
achieve high accuracy in interpreting complex user queries and maintain
efficient processing of vast product catalogs. The dual challenge lies in
precisely matching user intent with relevant products while managing the
computational demands of real-time search across massive inventories. In this
paper, we propose a Nested Embedding Approach to product Retrieval and Ranking,
called NEAR$^2$, which can achieve up to $12$ times efficiency in embedding
size at inference time while introducing no extra cost in training and
improving performance in accuracy for various encoder-based Transformer models.
We validate our approach using different loss functions for the retrieval and
ranking task, including multiple negative ranking loss and online contrastive
loss, on four different test sets with various IR challenges such as short and
implicit queries. Our approach achieves an improved performance over a smaller
embedding dimension, compared to any existing models.
[COMMENTS]
This paper is accepted to the 2025 SIGIR Workshop on eCommerce
[LINK]
http://arxiv.org/abs/2506.19743v1
[DATE]
2025-06-25 00:02:02+08:00
[CATEGORIES]
cs.CL
Reinforcement Learning Increases Wind Farm Power Production by Enabling Closed-Loop Collaborative Control
[AUTHORS]
Andrew Mole, Max Weissenbacher, Georgios Rigas, Sylvain Laizet
[ABSTRACT]
Traditional wind farm control operates each turbine independently to maximize
individual power output. However, coordinated wake steering across the entire
farm can substantially increase the combined wind farm energy production.
Although dynamic closed-loop control has proven effective in flow control
applications, wind farm optimization has relied primarily on static,
low-fidelity simulators that ignore critical turbulent flow dynamics. In this
work, we present the first reinforcement learning (RL) controller integrated
directly with high-fidelity large-eddy simulation (LES), enabling real-time
response to atmospheric turbulence through collaborative, dynamic control
strategies. Our RL controller achieves a 4.30% increase in wind farm power
output compared to baseline operation, nearly doubling the 2.19% gain from
static optimal yaw control obtained through Bayesian optimization. These
results establish dynamic flow-responsive control as a transformative approach
to wind farm optimization, with direct implications for accelerating renewable
energy deployment to net-zero targets.
[LINK]
http://arxiv.org/abs/2506.20554v1
[DATE]
2025-06-25 23:53:12+08:00
[CATEGORIES]
cs.LG
Pay Less Attention to Deceptive Artifacts: Robust Detection of Compressed Deepfakes on Online Social Networks
[AUTHORS]
Manyi Li, Renshuai Tao, Yufan Liu, Chuangchuang Tan, Haotong Qin, Bing Li, Yunchao Wei, Yao Zhao
[ABSTRACT]
With the rapid advancement of deep learning, particularly through generative
adversarial networks (GANs) and diffusion models (DMs), AI-generated images, or
deepfakes", have become nearly indistinguishable from real ones. These images
are widely shared across Online Social Networks (OSNs), raising concerns about
their misuse. Existing deepfake detection methods overlook the
block effects”
introduced by compression in OSNs, which obscure deepfake artifacts, and
primarily focus on raw images, rarely encountered in real-world scenarios. To
address these challenges, we propose PLADA (Pay Less Attention to Deceptive
Artifacts), a novel framework designed to tackle the lack of paired data and
the ineffective use of compressed images. PLADA consists of two core modules:
Block Effect Eraser (B2E), which uses a dual-stage attention mechanism to
handle block effects, and Open Data Aggregation (ODA), which processes both
paired and unpaired data to improve detection. Extensive experiments across 26
datasets demonstrate that PLADA achieves a remarkable balance in deepfake
detection, outperforming SoTA methods in detecting deepfakes on OSNs, even with
limited paired data and compression. More importantly, this work introduces the
``block effect” as a critical factor in deepfake detection, providing a robust
solution for open-world scenarios. Our code is available at
https://github.com/ManyiLee/PLADA.
[COMMENTS]
20 pages, 10 figures
[LINK]
http://arxiv.org/abs/2506.20548v1
[DATE]
2025-06-25 23:46:41+08:00
[CATEGORIES]
cs.LG
Contextual Optimization under Covariate Shift: A Robust Approach by Intersecting Wasserstein Balls
[AUTHORS]
Tianyu Wang, Ningyuan Chen, Chun Wang
[ABSTRACT]
In contextual optimization, a decision-maker leverages contextual
information, often referred to as covariates, to better resolve uncertainty and
make informed decisions. In this paper, we examine the challenges of contextual
decision-making under covariate shift, a phenomenon where the distribution of
covariates differs between the training and test environments. Such shifts can
lead to inaccurate upstream estimations for test covariates that lie far from
the training data, ultimately resulting in suboptimal downstream decisions. To
tackle these challenges, we propose a novel approach called Intersection
Wasserstein-balls DRO (IW-DRO), which integrates multiple estimation methods
into the distributionally robust optimization (DRO) framework. At the core of
our approach is an innovative ambiguity set defined as the intersection of two
Wasserstein balls, with their centers constructed using appropriate
nonparametric and parametric estimators. On the computational side, we
reformulate the IW-DRO problem as a tractable convex program and develop an
approximate algorithm tailored for large-scale problems to enhance
computational efficiency. From a theoretical perspective, we demonstrate that
IW-DRO achieves superior performance compared to single Wasserstein-ball DRO
models. We further establish performance guarantees by analyzing the coverage
of the intersection ambiguity set and the measure concentration of both
estimators under the Wasserstein distance. Notably, we derive a finite-sample
concentration result for the Nadaraya-Watson kernel estimator under covariate
shift. The proposed IW-DRO framework offers practical value for decision-makers
operating in uncertain environments affected by covariate shifts.
[LINK]
http://arxiv.org/abs/2406.02426v2
[DATE]
2025-06-25 23:43:13+08:00
[CATEGORIES]
cs.LG
Demonstration of effective UCB-based routing in skill-based queues on real-world data
[AUTHORS]
Sanne van Kempen, Jaron Sanders, Fiona Sloothaak, Maarten G. Wolf
[ABSTRACT]
This paper is about optimally controlling skill-based queueing systems such
as data centers, cloud computing networks, and service systems. By means of a
case study using a real-world data set, we investigate the practical
implementation of a recently developed reinforcement learning algorithm for
optimal customer routing. Our experiments show that the algorithm efficiently
learns and adapts to changing environments and outperforms static benchmark
policies, indicating its potential for live implementation. We also augment the
real-world applicability of this algorithm by introducing a new heuristic
routing rule to reduce delays. Moreover, we show that the algorithm can
optimize for multiple objectives: next to payoff maximization, secondary
objectives such as server load fairness and customer waiting time reduction can
be incorporated. Tuning parameters are used for balancing inherent performance
trade–offs. Lastly, we investigate the sensitivity to estimation errors and
parameter tuning, providing valuable insights for implementing adaptive routing
algorithms in complex real-world queueing systems.
[LINK]
http://arxiv.org/abs/2506.20543v1
[DATE]
2025-06-25 23:36:43+08:00
[CATEGORIES]
cs.LG
Adversarial Reasoning at Jailbreaking Time
[AUTHORS]
Mahdi Sabbaghi, Paul Kassianik, George Pappas, Yaron Singer, Amin Karbasi, Hamed Hassani
[ABSTRACT]
As large language models (LLMs) are becoming more capable and widespread, the
study of their failure cases is becoming increasingly important. Recent
advances in standardizing, measuring, and scaling test-time compute suggest new
methodologies for optimizing models to achieve high performance on hard tasks.
In this paper, we apply these advances to the task of model jailbreaking:
eliciting harmful responses from aligned LLMs. We develop an adversarial
reasoning approach to automatic jailbreaking that leverages a loss signal to
guide the test-time compute, achieving SOTA attack success rates against many
aligned LLMs, even those that aim to trade inference-time compute for
adversarial robustness. Our approach introduces a new paradigm in understanding
LLM vulnerabilities, laying the foundation for the development of more robust
and trustworthy AI systems.
[COMMENTS]
Accepted to the 42nd International Conference on Machine Learning
(ICML 2025)
[LINK]
http://arxiv.org/abs/2502.01633v2
[DATE]
2025-06-25 23:31:17+08:00
[CATEGORIES]
cs.LG
Physics-Informed Machine Learning Regulated by Finite Element Analysis for Simulation Acceleration of Laser Powder Bed Fusion
[AUTHORS]
R. Sharma, M. Raissi, Y. B. Guo
[ABSTRACT]
Efficient simulation of Laser Powder Bed Fusion (LPBF) is crucial for process
prediction due to the lasting issue of high computation cost using traditional
numerical methods such as finite element analysis (FEA). This study presents an
efficient modeling framework termed FEA-Regulated Physics-Informed Neural
Network (FEA-PINN) to accelerate the thermal field prediction in a LPBF process
while maintaining the FEA accuracy. A novel dynamic material updating strategy
is developed to capture the dynamic phase change of powder-liquid-solid in the
PINN model. The PINN model incorporates temperature-dependent material
properties and phase change behavior using the apparent heat capacity method.
While the PINN model demonstrates high accuracy with a small training data and
enables generalization of new process parameters via transfer learning, it
faces the challenge of high computation cost in time-dependent problems due to
the residual accumulation. To overcome this issue, the FEA-PINN framework
integrates corrective FEA simulations during inference to enforce physical
consistency and reduce error drift. A comparative analysis shows that FEA-PINN
achieves equivalent accuracy to FEA while significantly reducing computational
cost. The framework has been validated using the benchmark FEA data and
demonstrated through single-track scanning in LPBF.
[LINK]
http://arxiv.org/abs/2506.20537v1
[DATE]
2025-06-25 23:25:01+08:00
[CATEGORIES]
cs.LG
WattsOnAI: Measuring, Analyzing, and Visualizing Energy and Carbon Footprint of AI Workloads
[AUTHORS]
Hongzhen Huang, Kunming Zhang, Hanlong Liao, Kui Wu, Guoming Tang
[ABSTRACT]
The rapid advancement of AI, particularly large language models (LLMs), has
raised significant concerns about the energy use and carbon emissions
associated with model training and inference. However, existing tools for
measuring and reporting such impacts are often fragmented, lacking systematic
metric integration and offering limited support for correlation analysis among
them. This paper presents WattsOnAI, a comprehensive software toolkit for the
measurement, analysis, and visualization of energy use, power draw, hardware
performance, and carbon emissions across AI workloads. By seamlessly
integrating with existing AI frameworks, WattsOnAI offers standardized reports
and exports fine-grained time-series data to support benchmarking and
reproducibility in a lightweight manner. It further enables in-depth
correlation analysis between hardware metrics and model performance and thus
facilitates bottleneck identification and performance enhancement. By
addressing critical limitations in existing tools, WattsOnAI encourages the
research community to weigh environmental impact alongside raw performance of
AI workloads and advances the shift toward more sustainable “Green AI”
practices. The code is available at https://github.com/SusCom-Lab/WattsOnAI.
[COMMENTS]
11 pages, 7 figures and 5 tables
[LINK]
http://arxiv.org/abs/2506.20535v1
[DATE]
2025-06-25 23:24:45+08:00
[CATEGORIES]
cs.LG
Global Convergence of Iteratively Reweighted Least Squares for Robust Subspace Recovery
[AUTHORS]
Gilad Lerman, Kang Li, Tyler Maunu, Teng Zhang
[ABSTRACT]
Robust subspace estimation is fundamental to many machine learning and data
analysis tasks. Iteratively Reweighted Least Squares (IRLS) is an elegant and
empirically effective approach to this problem, yet its theoretical properties
remain poorly understood. This paper establishes that, under deterministic
conditions, a variant of IRLS with dynamic smoothing regularization converges
linearly to the underlying subspace from any initialization. We extend these
guarantees to affine subspace estimation, a setting that lacks prior recovery
theory. Additionally, we illustrate the practical benefits of IRLS through an
application to low-dimensional neural network training. Our results provide the
first global convergence guarantees for IRLS in robust subspace recovery and,
more broadly, for nonconvex IRLS on a Riemannian manifold.
[LINK]
http://arxiv.org/abs/2506.20533v1
[DATE]
2025-06-25 23:23:32+08:00
[CATEGORIES]
cs.LG
Variational Learning Finds Flatter Solutions at the Edge of Stability
[AUTHORS]
Avrajit Ghosh, Bai Cong, Rio Yokota, Saiprasad Ravishankar, Rongrong Wang, Molei Tao, Mohammad Emtiyaz Khan, Thomas Möllenhoff
[ABSTRACT]
Variational Learning (VL) has recently gained popularity for training deep
neural networks and is competitive to standard learning methods. Part of its
empirical success can be explained by theories such as PAC-Bayes bounds,
minimum description length and marginal likelihood, but there are few tools to
unravel the implicit regularization in play. Here, we analyze the implicit
regularization of VL through the Edge of Stability (EoS) framework. EoS has
previously been used to show that gradient descent can find flat solutions and
we extend this result to VL to show that it can find even flatter solutions.
This is obtained by controlling the posterior covariance and the number of
Monte Carlo samples from the posterior. These results are derived in a similar
fashion as the standard EoS literature for deep learning, by first deriving a
result for a quadratic problem and then extending it to deep neural networks.
We empirically validate these findings on a wide variety of large networks,
such as ResNet and ViT, to find that the theoretical results closely match the
empirical ones. Ours is the first work to analyze the EoS dynamics in VL.
[LINK]
http://arxiv.org/abs/2506.12903v2
[DATE]
2025-06-25 23:17:32+08:00
[CATEGORIES]
cs.LG
Proximal Control of UAVs with Federated Learning for Human-Robot Collaborative Domains
[AUTHORS]
Lucas Nogueira Nobrega, Ewerton de Oliveira, Martin Saska, Tiago Nascimento
[COMMENTS]
version 2
[LINK]
http://arxiv.org/abs/2412.02863v2
[DATE]
2025-06-25 23:15:12+08:00
[CATEGORIES]
cs.LG
Industrial Energy Disaggregation with Digital Twin-generated Dataset and Efficient Data Augmentation
[AUTHORS]
Christian Internò, Andrea Castellani, Sebastian Schmitt, Fabio Stella, Barbara Hammer
[ABSTRACT]
Industrial Non-Intrusive Load Monitoring (NILM) is limited by the scarcity of
high-quality datasets and the complex variability of industrial energy
consumption patterns. To address data scarcity and privacy issues, we introduce
the Synthetic Industrial Dataset for Energy Disaggregation (SIDED), an
open-source dataset generated using Digital Twin simulations. SIDED includes
three types of industrial facilities across three different geographic
locations, capturing diverse appliance behaviors, weather conditions, and load
profiles. We also propose the Appliance-Modulated Data Augmentation (AMDA)
method, a computationally efficient technique that enhances NILM model
generalization by intelligently scaling appliance power contributions based on
their relative impact. We show in experiments that NILM models trained with
AMDA-augmented data significantly improve the disaggregation of energy
consumption of complex industrial appliances like combined heat and power
systems. Specifically, in our out-of-sample scenarios, models trained with AMDA
achieved a Normalized Disaggregation Error of 0.093, outperforming models
trained without data augmentation (0.451) and those trained with random data
augmentation (0.290). Data distribution analyses confirm that AMDA effectively
aligns training and test data distributions, enhancing model generalization.
[LINK]
http://arxiv.org/abs/2506.20525v1
[DATE]
2025-06-25 23:10:43+08:00
[CATEGORIES]
cs.LG
On Advancements of the Forward-Forward Algorithm
[AUTHORS]
Mauricio Ortiz Torres, Markus Lange, Arne P. Raulf
[ABSTRACT]
The Forward-Forward algorithm has evolved in machine learning research,
tackling more complex tasks that mimic real-life applications. In the last
years, it has been improved by several techniques to perform better than its
original version, handling a challenging dataset like CIFAR10 without losing
its flexibility and low memory usage. We have shown in our results that
improvements are achieved through a combination of convolutional channel
grouping, learning rate schedules, and independent block structures during
training that lead to a 20\% decrease in test error percentage. Additionally,
to approach further implementations on low-capacity hardware projects, we have
presented a series of lighter models that achieve low test error percentages
within (21$\pm$3)\% and number of trainable parameters between 164,706 and
754,386. This serves as a basis for our future study on complete verification
and validation of these kinds of neural networks.
[COMMENTS]
This work has been submitted to the IEEE for possible publication
[LINK]
http://arxiv.org/abs/2504.21662v2
[DATE]
2025-06-25 23:08:49+08:00
[CATEGORIES]
cs.LG
Fast ground penetrating radar dual-parameter full waveform inversion method accelerated by hybrid compilation of CUDA kernel function and PyTorch
[AUTHORS]
Lei Liu, Chao Song, Liangsheng He, Silin Wang, Xuan Feng, Cai Liu
[ABSTRACT]
This study proposes a high-performance dual-parameter full waveform inversion
framework (FWI) for ground-penetrating radar (GPR), accelerated through the
hybrid compilation of CUDA kernel functions and PyTorch. The method leverages
the computational efficiency of GPU programming while preserving the
flexibility and usability of Python-based deep learning frameworks. By
integrating customized CUDA kernels into PyTorch’s automatic differentiation
mechanism, the framework enables accurate and efficient inversion of both
dielectric permittivity and electrical conductivity. Experimental evaluations
on synthetic data and real wavefield data demonstrate that the proposed method
achieves dual-parameter FWI for GPR data while maintaining high accuracy.
Moreover, the framework is flexible and extensible, supporting optional
regularization strategies such as total variation and multi-scale inversion.
These features make the proposed approach a practical and scalable framework
for rapid GPR-based subsurface imaging in applications including civil
engineering, environmental monitoring, and geophysical exploration.
[LINK]
http://arxiv.org/abs/2506.20513v1
[DATE]
2025-06-25 23:00:33+08:00
[CATEGORIES]
cs.LG
Collaborative Batch Size Optimization for Federated Learning
[AUTHORS]
Arno Geimer, Karthick Panner Selvam, Beltran Fiz Pontiveros
[ABSTRACT]
Federated Learning (FL) is a decentralized collaborative Machine Learning
framework for training models without collecting data in a centralized
location. It has seen application across various disciplines, from helping
medical diagnoses in hospitals to detecting fraud in financial transactions. In
this paper, we focus on improving the local training process through hardware
usage optimization. While participants in a federation might share the hardware
they are training on, since there is no information exchange between them,
their training process can be hindered by an improper training configuration.
Taking advantage of the parallel processing inherent to Federated Learning, we
use a greedy randomized search to optimize local batch sizes for the best
training settings across all participants. Our results show that against
default parameter settings, our method improves convergence speed while staying
nearly on par with the case where local parameters are optimized.
[LINK]
http://arxiv.org/abs/2506.20511v1
[DATE]
2025-06-25 22:57:23+08:00
[CATEGORIES]
cs.LG
Unidentified and Confounded? Understanding Two-Tower Models for Unbiased Learning to Rank
[AUTHORS]
Philipp Hager, Onno Zoeter, Maarten de Rijke
[ABSTRACT]
Additive two-tower models are popular learning-to-rank methods for handling
biased user feedback in industry settings. Recent studies, however, report a
concerning phenomenon: training two-tower models on clicks collected by
well-performing production systems leads to decreased ranking performance. This
paper investigates two recent explanations for this observation: confounding
effects from logging policies and model identifiability issues. We
theoretically analyze the identifiability conditions of two-tower models,
showing that either document swaps across positions or overlapping feature
distributions are required to recover model parameters from clicks. We also
investigate the effect of logging policies on two-tower models, finding that
they introduce no bias when models perfectly capture user behavior. However,
logging policies can amplify biases when models imperfectly capture user
behavior, particularly when prediction errors correlate with document placement
across positions. We propose a sample weighting technique to mitigate these
effects and provide actionable insights for researchers and practitioners using
two-tower models.
[LINK]
http://arxiv.org/abs/2506.20501v1
[DATE]
2025-06-25 22:47:43+08:00
[CATEGORIES]
cs.LG
Training Plug-n-Play Knowledge Modules with Deep Context Distillation
[AUTHORS]
Lucas Caccia, Alan Ansell, Edoardo Ponti, Ivan Vulić, Alessandro Sordoni
[ABSTRACT]
Dynamically integrating new or rapidly evolving information after (Large)
Language Model pre-training remains challenging, particularly in low-data
scenarios or when dealing with private and specialized documents. In-context
learning and retrieval-augmented generation (RAG) face limitations, including
their high inference costs and their inability to capture global document
information. In this paper, we propose a way of modularizing knowledge by
training document-level Knowledge Modules (KMs). KMs are lightweight components
implemented as parameter-efficient LoRA modules, which are trained to store
information about new documents and can be easily plugged into models on
demand. We show that next-token prediction performs poorly as the training
objective for KMs. We instead propose Deep Context Distillation: we learn KMs
parameters such as to simulate hidden states and logits of a teacher that takes
the document in context. Our method outperforms standard next-token prediction
and pre-instruction training techniques, across two datasets. Finally, we
highlight synergies between KMs and RAG.
[COMMENTS]
Preprint
[LINK]
http://arxiv.org/abs/2503.08727v3
[DATE]
2025-06-25 22:45:56+08:00
[CATEGORIES]
cs.LG
Multimodal Representation Learning and Fusion
[AUTHORS]
Qihang Jin, Enze Ge, Yuhang Xie, Hongying Luo, Junhao Song, Ziqian Bi, Chia Xin Liang, Jibin Guan, Joe Yeong, Junfeng Hao
[ABSTRACT]
Multi-modal learning is a fast growing area in artificial intelligence. It
tries to help machines understand complex things by combining information from
different sources, like images, text, and audio. By using the strengths of each
modality, multi-modal learning allows AI systems to build stronger and richer
internal representations. These help machines better interpretation, reasoning,
and making decisions in real-life situations. This field includes core
techniques such as representation learning (to get shared features from
different data types), alignment methods (to match information across
modalities), and fusion strategies (to combine them by deep learning models).
Although there has been good progress, some major problems still remain. Like
dealing with different data formats, missing or incomplete inputs, and
defending against adversarial attacks. Researchers now are exploring new
methods, such as unsupervised or semi-supervised learning, AutoML tools, to
make models more efficient and easier to scale. And also more attention on
designing better evaluation metrics or building shared benchmarks, make it
easier to compare model performance across tasks and domains. As the field
continues to grow, multi-modal learning is expected to improve many areas:
computer vision, natural language processing, speech recognition, and
healthcare. In the future, it may help to build AI systems that can understand
the world in a way more like humans, flexible, context aware, and able to deal
with real-world complexity.
[LINK]
http://arxiv.org/abs/2506.20494v1
[DATE]
2025-06-25 22:40:09+08:00
[CATEGORIES]
cs.LG
Non-equilibrium Annealed Adjoint Sampler
[AUTHORS]
Jaemoo Choi, Yongxin Chen, Molei Tao, Guan-Horng Liu
[ABSTRACT]
Recently, there has been significant progress in learning-based diffusion
samplers, which aim to sample from a given unnormalized density. These methods
typically follow one of two paradigms: (i) formulating sampling as an unbiased
stochastic optimal control (SOC) problem using a canonical reference process,
or (ii) refining annealed path measures through importance-weighted sampling.
Although annealing approaches have advantages in guiding samples toward
high-density regions, reliance on importance sampling leads to high variance
and limited scalability in practice. In this paper, we introduce the
\textbf{Non-equilibrium Annealed Adjoint Sampler (NAAS)}, a novel SOC-based
diffusion sampler that leverages annealed reference dynamics without resorting
to importance sampling. NAAS employs a lean adjoint system inspired by adjoint
matching, enabling efficient and scalable training. We demonstrate the
effectiveness of our approach across a range of tasks, including sampling from
classical energy landscapes and molecular Boltzmann distribution.
[COMMENTS]
21 pages, 7 figures
[LINK]
http://arxiv.org/abs/2506.18165v2
[DATE]
2025-06-25 22:39:40+08:00
[CATEGORIES]
cs.LG
Offline Goal-Conditioned Reinforcement Learning with Projective Quasimetric Planning
[AUTHORS]
Anthony Kobanda, Waris Radji, Mathieu Petitbois, Odalric-Ambrym Maillard, Rémy Portelas
[ABSTRACT]
Offline Goal-Conditioned Reinforcement Learning seeks to train agents to
reach specified goals from previously collected trajectories. Scaling that
promises to long-horizon tasks remains challenging, notably due to compounding
value-estimation errors. Principled geometric offers a potential solution to
address these issues. Following this insight, we introduce Projective
Quasimetric Planning (ProQ), a compositional framework that learns an
asymmetric distance and then repurposes it, firstly as a repulsive energy
forcing a sparse set of keypoints to uniformly spread over the learned latent
space, and secondly as a structured directional cost guiding towards proximal
sub-goals. In particular, ProQ couples this geometry with a Lagrangian
out-of-distribution detector to ensure the learned keypoints stay within
reachable areas. By unifying metric learning, keypoint coverage, and
goal-conditioned control, our approach produces meaningful sub-goals and
robustly drives long-horizon goal-reaching on diverse a navigation benchmarks.
[LINK]
http://arxiv.org/abs/2506.18847v2
[DATE]
2025-06-25 22:37:00+08:00
[CATEGORIES]
cs.LG
MARCO: Multi-Agent Code Optimization with Real-Time Knowledge Integration for High-Performance Computing
[AUTHORS]
Asif Rahman, Veljko Cvetkovic, Kathleen Reece, Aidan Walters, Yasir Hassan, Aneesh Tummeti, Bryan Torres, Denise Cooney, Margaret Ellis, Dimitrios S. Nikolopoulos
[ABSTRACT]
Large language models (LLMs) have transformed software development through
code generation capabilities, yet their effectiveness for high-performance
computing (HPC) remains limited. HPC code requires specialized optimizations
for parallelism, memory efficiency, and architecture-specific considerations
that general-purpose LLMs often overlook. We present MARCO (Multi-Agent
Reactive Code Optimizer), a novel framework that enhances LLM-generated code
for HPC through a specialized multi-agent architecture. MARCO employs separate
agents for code generation and performance evaluation, connected by a feedback
loop that progressively refines optimizations. A key innovation is MARCO’s
web-search component that retrieves real-time optimization techniques from
recent conference proceedings and research publications, bridging the knowledge
gap in pre-trained LLMs. Our extensive evaluation on the LeetCode 75 problem
set demonstrates that MARCO achieves a 14.6\% average runtime reduction
compared to Claude 3.5 Sonnet alone, while the integration of the web-search
component yields a 30.9\% performance improvement over the base MARCO system.
These results highlight the potential of multi-agent systems to address the
specialized requirements of high-performance code generation, offering a
cost-effective alternative to domain-specific model fine-tuning.
[COMMENTS]
9 pages, 4 figures, 2 tables
[LINK]
http://arxiv.org/abs/2505.03906v3
[DATE]
2025-06-25 22:22:04+08:00
[CATEGORIES]
cs.LG
Physics-informed Imitative Reinforcement Learning for Real-world Driving
[AUTHORS]
Hang Zhou, Yihao Qin, Dan Xu, Yiding Ji
[ABSTRACT]
Recent advances in imitative reinforcement learning (IRL) have considerably
enhanced the ability of autonomous agents to assimilate expert demonstrations,
leading to rapid skill acquisition in a range of demanding tasks. However, such
learning-based agents face significant challenges when transferring knowledge
to highly dynamic closed-loop environments. Their performance is significantly
impacted by the conflicting optimization objectives of imitation learning (IL)
and reinforcement learning (RL), sample inefficiency, and the complexity of
uncovering the hidden world model and physics. To address this challenge, we
propose a physics-informed IRL that is entirely data-driven. It leverages both
expert demonstration data and exploratory data with a joint optimization
objective, allowing the underlying physical principles of vehicle dynamics to
emerge naturally from the training process. The performance is evaluated
through empirical experiments and results exceed popular IL, RL and IRL
algorithms in closed-loop settings on Waymax benchmark. Our approach exhibits
37.8% reduction in collision rate and 22.2% reduction in off-road rate compared
to the baseline method.
[LINK]
http://arxiv.org/abs/2407.02508v3
[DATE]
2025-06-25 22:06:21+08:00
[CATEGORIES]
cs.LG
HiWave: Training-Free High-Resolution Image Generation via Wavelet-Based Diffusion Sampling
[AUTHORS]
Tobias Vontobel, Seyedmorteza Sadat, Farnood Salehi, Romann M. Weber
[ABSTRACT]
Diffusion models have emerged as the leading approach for image synthesis,
demonstrating exceptional photorealism and diversity. However, training
diffusion models at high resolutions remains computationally prohibitive, and
existing zero-shot generation techniques for synthesizing images beyond
training resolutions often produce artifacts, including object duplication and
spatial incoherence. In this paper, we introduce HiWave, a training-free,
zero-shot approach that substantially enhances visual fidelity and structural
coherence in ultra-high-resolution image synthesis using pretrained diffusion
models. Our method employs a two-stage pipeline: generating a base image from
the pretrained model followed by a patch-wise DDIM inversion step and a novel
wavelet-based detail enhancer module. Specifically, we first utilize inversion
methods to derive initial noise vectors that preserve global coherence from the
base image. Subsequently, during sampling, our wavelet-domain detail enhancer
retains low-frequency components from the base image to ensure structural
consistency, while selectively guiding high-frequency components to enrich fine
details and textures. Extensive evaluations using Stable Diffusion XL
demonstrate that HiWave effectively mitigates common visual artifacts seen in
prior methods, achieving superior perceptual quality. A user study confirmed
HiWave’s performance, where it was preferred over the state-of-the-art
alternative in more than 80% of comparisons, highlighting its effectiveness for
high-quality, ultra-high-resolution image synthesis without requiring
retraining or architectural modifications.
[LINK]
http://arxiv.org/abs/2506.20452v1
[DATE]
2025-06-25 21:58:37+08:00
[CATEGORIES]
cs.LG
Méthode de quadrature pour les PINNs fondée théoriquement sur la hessienne des résiduels
[AUTHORS]
Antoine Caradot, Rémi Emonet, Amaury Habrard, Abdel-Rahim Mezidi, Marc Sebban
[ABSTRACT]
Physics-informed Neural Networks (PINNs) have emerged as an efficient way to
learn surrogate neural solvers of PDEs by embedding the physical model in the
loss function and minimizing its residuals using automatic differentiation at
so-called collocation points. Originally uniformly sampled, the choice of the
latter has been the subject of recent advances leading to adaptive sampling
refinements. In this paper, we propose a new quadrature method for
approximating definite integrals based on the hessian of the considered
function, and that we leverage to guide the selection of the collocation points
during the training process of PINNs.
[COMMENTS]
10 pages. In French. Comments are welcome
[LINK]
http://arxiv.org/abs/2506.20441v1
[DATE]
2025-06-25 21:49:53+08:00
[CATEGORIES]
cs.LG
Scalable Subset Selection in Linear Mixed Models
[AUTHORS]
Ryan Thompson, Matt P. Wand, Joanna J. J. Wang
[ABSTRACT]
Linear mixed models (LMMs), which incorporate fixed and random effects, are
key tools for analyzing heterogeneous data, such as in personalized medicine or
adaptive marketing. Nowadays, this type of data is increasingly wide, sometimes
containing thousands of candidate predictors, necessitating sparsity for
prediction and interpretation. However, existing sparse learning methods for
LMMs do not scale well beyond tens or hundreds of predictors, leaving a large
gap compared with sparse methods for linear models, which ignore random
effects. This paper closes the gap with a new $\ell_0$ regularized method for
LMM subset selection that can run on datasets containing thousands of
predictors in seconds to minutes. On the computational front, we develop a
coordinate descent algorithm as our main workhorse and provide a guarantee of
its convergence. We also develop a local search algorithm to help traverse the
nonconvex optimization surface. Both algorithms readily extend to subset
selection in generalized LMMs via a penalized quasi-likelihood approximation.
On the statistical front, we provide a finite-sample bound on the
Kullback-Leibler divergence of the new method. We then demonstrate its
excellent performance in synthetic experiments and illustrate its utility on
two datasets from biology and journalism.
[LINK]
http://arxiv.org/abs/2506.20425v1
[DATE]
2025-06-25 21:39:30+08:00
[CATEGORIES]
cs.LG
Off-Policy Evaluation and Learning for the Future under Non-Stationarity
[AUTHORS]
Tatsuhiro Shimizu, Kazuki Kawamura, Takanori Muroi, Yusuke Narita, Kei Tateno, Takuma Udagawa, Yuta Saito
[ABSTRACT]
We study the novel problem of future off-policy evaluation (F-OPE) and
learning (F-OPL) for estimating and optimizing the future value of policies in
non-stationary environments, where distributions vary over time. In e-commerce
recommendations, for instance, our goal is often to estimate and optimize the
policy value for the upcoming month using data collected by an old policy in
the previous month. A critical challenge is that data related to the future
environment is not observed in the historical data. Existing methods assume
stationarity or depend on restrictive reward-modeling assumptions, leading to
significant bias. To address these limitations, we propose a novel estimator
named \textit{\textbf{O}ff-\textbf{P}olicy Estimator for the \textbf{F}uture
\textbf{V}alue (\textbf{\textit{OPFV}})}, designed for accurately estimating
policy values at any future time point. The key feature of OPFV is its ability
to leverage the useful structure within time-series data. While future data
might not be present in the historical log, we can leverage, for example,
seasonal, weekly, or holiday effects that are consistent in both the historical
and future data. Our estimator is the first to exploit these time-related
structures via a new type of importance weighting, enabling effective F-OPE.
Theoretical analysis identifies the conditions under which OPFV becomes
low-bias. In addition, we extend our estimator to develop a new policy-gradient
method to proactively learn a good future policy using only historical data.
Empirical results show that our methods substantially outperform existing
methods in estimating and optimizing the future policy value under
non-stationarity for various experimental setups.
[LINK]
http://arxiv.org/abs/2506.20417v1
[DATE]
2025-06-25 21:31:46+08:00
[CATEGORIES]
cs.LG
Client Clustering Meets Knowledge Sharing: Enhancing Privacy and Robustness in Personalized Peer-to-Peer Learning
[AUTHORS]
Mohammad Mahdi Maheri, Denys Herasymuk, Hamed Haddadi
[ABSTRACT]
The growing adoption of Artificial Intelligence (AI) in Internet of Things
(IoT) ecosystems has intensified the need for personalized learning methods
that can operate efficiently and privately across heterogeneous,
resource-constrained devices. However, enabling effective personalized learning
in decentralized settings introduces several challenges, including efficient
knowledge transfer between clients, protection of data privacy, and resilience
against poisoning attacks. In this paper, we address these challenges by
developing P4 (Personalized, Private, Peer-to-Peer) – a method designed to
deliver personalized models for resource-constrained IoT devices while ensuring
differential privacy and robustness against poisoning attacks. Our solution
employs a lightweight, fully decentralized algorithm to privately detect client
similarity and form collaborative groups. Within each group, clients leverage
differentially private knowledge distillation to co-train their models,
maintaining high accuracy while ensuring robustness to the presence of
malicious clients. We evaluate P4 on popular benchmark datasets using both
linear and CNN-based architectures across various heterogeneity settings and
attack scenarios. Experimental results show that P4 achieves 5% to 30% higher
accuracy than leading differentially private peer-to-peer approaches and
maintains robustness with up to 30% malicious clients. Additionally, we
demonstrate its practicality by deploying it on resource-constrained devices,
where collaborative training between two clients adds only ~7 seconds of
overhead.
[LINK]
http://arxiv.org/abs/2506.20413v1
[DATE]
2025-06-25 21:27:36+08:00
[CATEGORIES]
cs.LG
POLAR: A Pessimistic Model-based Policy Learning Algorithm for Dynamic Treatment Regimes
[AUTHORS]
Ruijia Zhang, Zhengling Qi, Yue Wu, Xiangyu Zhang, Yanxun Xu
[ABSTRACT]
Dynamic treatment regimes (DTRs) provide a principled framework for
optimizing sequential decision-making in domains where decisions must adapt
over time in response to individual trajectories, such as healthcare,
education, and digital interventions. However, existing statistical methods
often rely on strong positivity assumptions and lack robustness under partial
data coverage, while offline reinforcement learning approaches typically focus
on average training performance, lack statistical guarantees, and require
solving complex optimization problems. To address these challenges, we propose
POLAR, a novel pessimistic model-based policy learning algorithm for offline
DTR optimization. POLAR estimates the transition dynamics from offline data and
quantifies uncertainty for each history-action pair. A pessimistic penalty is
then incorporated into the reward function to discourage actions with high
uncertainty. Unlike many existing methods that focus on average training
performance, POLAR directly targets the suboptimality of the final learned
policy and offers theoretical guarantees, without relying on computationally
intensive minimax or constrained optimization procedures. To the best of our
knowledge, POLAR is the first model-based DTR method to provide both
statistical and computational guarantees, including finite-sample bounds on
policy suboptimality. Empirical results on both synthetic data and the
MIMIC-III dataset demonstrate that POLAR outperforms state-of-the-art methods
and yields near-optimal, history-aware treatment strategies.
[LINK]
http://arxiv.org/abs/2506.20406v1
[DATE]
2025-06-25 21:22:57+08:00
[CATEGORIES]
cs.LG
scMamba: A Scalable Foundation Model for Single-Cell Multi-Omics Integration Beyond Highly Variable Feature Selection
[AUTHORS]
Zhen Yuan, Shaoqing Jiao, Yihang Xiao, Jiajie Peng
[ABSTRACT]
The advent of single-cell multi-omics technologies has enabled the
simultaneous profiling of diverse omics layers within individual cells.
Integrating such multimodal data provides unprecedented insights into cellular
identity, regulatory processes, and disease mechanisms. However, it remains
challenging, as current methods often rely on selecting highly variable genes
or peaks during preprocessing, which may inadvertently discard crucial
biological information. Here, we present scMamba, a foundation model designed
to integrate single-cell multi-omics data without the need for prior feature
selection while preserving genomic positional information. scMamba introduces a
patch-based cell tokenization strategy that treats genomics regions as words
(tokens) and cells as sentences. Building upon the concept of state space
duality, scMamba distills rich biological insights from high-dimensional,
sparse single-cell multi-omics data. Additionally, our novel contrastive
learning approach, enhanced with cosine similarity regularization, enables
superior alignment across omics layers compared to traditional methods.
Systematic benchmarking across multiple datasets demonstrates that scMamba
significantly outperforms state-of-the-art methods in preserving biological
variation, aligning omics layers, and enhancing key downstream tasks such as
clustering, cell type annotation, and trajectory inference. Our findings
position scMamba as a powerful tool for large-scale single-cell multi-omics
integration, capable of handling large-scale atlases and advancing biological
discovery.
[LINK]
http://arxiv.org/abs/2506.20697v1
[DATE]
2025-06-25 20:58:01+08:00
[CATEGORIES]
cs.LG
Exploiting Lightweight Hierarchical ViT and Dynamic Framework for Efficient Visual Tracking
[AUTHORS]
Ben Kang, Xin Chen, Jie Zhao, Chunjuan Bo, Dong Wang, Huchuan Lu
[ABSTRACT]
Transformer-based visual trackers have demonstrated significant advancements
due to their powerful modeling capabilities. However, their practicality is
limited on resource-constrained devices because of their slow processing
speeds. To address this challenge, we present HiT, a novel family of efficient
tracking models that achieve high performance while maintaining fast operation
across various devices. The core innovation of HiT lies in its Bridge Module,
which connects lightweight transformers to the tracking framework, enhancing
feature representation quality. Additionally, we introduce a dual-image
position encoding approach to effectively encode spatial information. HiT
achieves an impressive speed of 61 frames per second (fps) on the NVIDIA Jetson
AGX platform, alongside a competitive AUC of 64.6% on the LaSOT benchmark,
outperforming all previous efficient trackers.Building on HiT, we propose
DyHiT, an efficient dynamic tracker that flexibly adapts to scene complexity by
selecting routes with varying computational requirements. DyHiT uses search
area features extracted by the backbone network and inputs them into an
efficient dynamic router to classify tracking scenarios. Based on the
classification, DyHiT applies a divide-and-conquer strategy, selecting
appropriate routes to achieve a superior trade-off between accuracy and speed.
The fastest version of DyHiT achieves 111 fps on NVIDIA Jetson AGX while
maintaining an AUC of 62.4% on LaSOT.Furthermore, we introduce a training-free
acceleration method based on the dynamic routing architecture of DyHiT. This
method significantly improves the execution speed of various high-performance
trackers without sacrificing accuracy. For instance, our acceleration method
enables the state-of-the-art tracker SeqTrack-B256 to achieve a 2.68 times
speedup on an NVIDIA GeForce RTX 2080 Ti GPU while maintaining the same AUC of
69.9% on the LaSOT.
[COMMENTS]
This paper was accepted by International Journal of Computer
Vision(IJCV)
[LINK]
http://arxiv.org/abs/2506.20381v1
[DATE]
2025-06-25 20:46:46+08:00
[CATEGORIES]
cs.LG
TESSERA: Temporal Embeddings of Surface Spectra for Earth Representation and Analysis
[AUTHORS]
Zhengpeng Feng, Sadiq Jaffer, Jovana Knezevic, Silja Sormunen, Robin Young, Madeline Lisaius, Markus Immitzer, James Ball, Clement Atzberger, David A. Coomes, Anil Madhavapeddy, Andrew Blake, Srinivasan Keshav
[ABSTRACT]
Satellite remote sensing (RS) enables a wide array of downstream Earth
observation (EO) applications, including climate modeling, carbon accounting,
and strategies for conservation and sustainable land use. We present TESSERA, a
novel Remote Sensing Foundation Model (RSFM) that uses Self-Supervised Learning
(SSL) to generate global, robust representations at 10m scale from pixel-level
satellite time series data. TESSERA combines information from only optical and
SAR data streams using two parallel Transformer-based encoders: one dedicated
to Sentinel-1 SAR polarizations and another to Sentinel-2 MSI data (10 selected
spectral bands) to create representations that are then fused using a
multilayer perceptron (MLP), resulting in a global representation map covering
the years 2017 to 2024. Our precomputed representations set a new
state-of-the-art performance benchmark and our open-source approach
democratizes access to high-performance, high-resolution representations. We
benchmark the performance of TESSERA in five diverse tasks, comparing our work
with state-of-the-art task-specific models and other foundation models. Our
results show that TESSERA outperforms both traditional RS baselines and the
leading geospatial foundation models in these diverse downstream tasks.
[LINK]
http://arxiv.org/abs/2506.20380v1
[DATE]
2025-06-25 20:46:26+08:00
[CATEGORIES]
cs.LG
WyckoffDiff – A Generative Diffusion Model for Crystal Symmetry
[AUTHORS]
Filip Ekström Kelvinius, Oskar B. Andersson, Abhijith S. Parackal, Dong Qian, Rickard Armiento, Fredrik Lindsten
[ABSTRACT]
Crystalline materials often exhibit a high level of symmetry. However, most
generative models do not account for symmetry, but rather model each atom
without any constraints on its position or element. We propose a generative
model, Wyckoff Diffusion (WyckoffDiff), which generates symmetry-based
descriptions of crystals. This is enabled by considering a crystal structure
representation that encodes all symmetry, and we design a novel neural network
architecture which enables using this representation inside a discrete
generative model framework. In addition to respecting symmetry by construction,
the discrete nature of our model enables fast generation. We additionally
present a new metric, Fr'echet Wrenformer Distance, which captures the
symmetry aspects of the materials generated, and we benchmark WyckoffDiff
against recently proposed generative models for crystal generation. As a
proof-of-concept study, we use WyckoffDiff to find new materials below the
convex hull of thermodynamical stability.
[COMMENTS]
Accepted to ICML 2025, to appear in PMLR 267. Code is available
online at https://github.com/httk/wyckoffdiff
[LINK]
http://arxiv.org/abs/2502.06485v3
[DATE]
2025-06-25 20:45:51+08:00
[CATEGORIES]
cs.LG
InvZW: Invariant Feature Learning via Noise-Adversarial Training for Robust Image Zero-Watermarking
[AUTHORS]
Abdullah All Tanvir, Xin Zhong
[ABSTRACT]
This paper introduces a novel deep learning framework for robust image
zero-watermarking based on distortion-invariant feature learning. As a
zero-watermarking scheme, our method leaves the original image unaltered and
learns a reference signature through optimization in the feature space. The
proposed framework consists of two key modules. In the first module, a feature
extractor is trained via noise-adversarial learning to generate representations
that are both invariant to distortions and semantically expressive. This is
achieved by combining adversarial supervision against a distortion
discriminator and a reconstruction constraint to retain image content. In the
second module, we design a learning-based multibit zero-watermarking scheme
where the trained invariant features are projected onto a set of trainable
reference codes optimized to match a target binary message. Extensive
experiments on diverse image datasets and a wide range of distortions show that
our method achieves state-of-the-art robustness in both feature stability and
watermark recovery. Comparative evaluations against existing self-supervised
and deep watermarking techniques further highlight the superiority of our
framework in generalization and robustness.
[LINK]
http://arxiv.org/abs/2506.20370v1
[DATE]
2025-06-25 20:32:08+08:00
[CATEGORIES]
cs.LG
Self-Supervised Graph Learning via Spectral Bootstrapping and Laplacian-Based Augmentations
[AUTHORS]
Lorenzo Bini, Stephane Marchand-Maillet
[ABSTRACT]
We present LaplaceGNN, a novel self-supervised graph learning framework that
bypasses the need for negative sampling by leveraging spectral bootstrapping
techniques. Our method integrates Laplacian-based signals into the learning
process, allowing the model to effectively capture rich structural
representations without relying on contrastive objectives or handcrafted
augmentations. By focusing on positive alignment, LaplaceGNN achieves linear
scaling while offering a simpler, more efficient, self-supervised alternative
for graph neural networks, applicable across diverse domains. Our contributions
are twofold: we precompute spectral augmentations through max-min
centrality-guided optimization, enabling rich structural supervision without
relying on handcrafted augmentations, then we integrate an adversarial
bootstrapped training scheme that further strengthens feature learning and
robustness. Our extensive experiments on different benchmark datasets show that
LaplaceGNN achieves superior performance compared to state-of-the-art
self-supervised graph methods, offering a promising direction for efficiently
learning expressive graph representations.
[COMMENTS]
LaplaceGNN is a novel graph learning framework that employs a
bootstrapped teacher-student architecture. Its precomputed spectral
augmentations and adversarial training enable robust performance,
outperforming SOTA methods while scaling linearly
[LINK]
http://arxiv.org/abs/2506.20362v1
[DATE]
2025-06-25 20:23:23+08:00
[CATEGORIES]
cs.LG
Towards Interpretable and Efficient Feature Selection in Trajectory Datasets: A Taxonomic Approach
[AUTHORS]
Chanuka Don Samarasinghage, Dhruv Gulabani
[ABSTRACT]
Trajectory analysis is not only about obtaining movement data, but it is also
of paramount importance in understanding the pattern in which an object moves
through space and time, as well as in predicting its next move. Due to the
significant interest in the area, data collection has improved substantially,
resulting in a large number of features becoming available for training and
predicting models. However, this introduces a high-dimensionality-induced
feature explosion problem, which reduces the efficiency and interpretability of
the data, thereby reducing the accuracy of machine learning models. To overcome
this issue, feature selection has become one of the most prevalent tools. Thus,
the objective of this paper was to introduce a taxonomy-based feature selection
method that categorizes features based on their internal structure. This
approach classifies the data into geometric and kinematic features, further
categorizing them into curvature, indentation, speed, and acceleration. The
comparative analysis indicated that a taxonomy-based approach consistently
achieved comparable or superior predictive performance. Furthermore, due to the
taxonomic grouping, which reduces combinatorial space, the time taken to select
features was drastically reduced. The taxonomy was also used to gain insights
into what feature sets each dataset was more sensitive to. Overall, this study
provides robust evidence that a taxonomy-based feature selection method can add
a layer of interpretability, reduce dimensionality and computational
complexity, and contribute to high-level decision-making. It serves as a step
toward providing a methodological framework for researchers and practitioners
dealing with trajectory datasets and contributing to the broader field of
explainable artificial intelligence.
[LINK]
http://arxiv.org/abs/2506.20359v1
[DATE]
2025-06-25 20:21:20+08:00
[CATEGORIES]
cs.LG
A foundation model with multi-variate parallel attention to generate neuronal activity
[AUTHORS]
Francesco Carzaniga, Michael Hersche, Abu Sebastian, Kaspar Schindler, Abbas Rahimi
[ABSTRACT]
Learning from multi-variate time-series with heterogeneous channel
configurations remains a fundamental challenge for deep neural networks (DNNs),
particularly in clinical domains such as intracranial electroencephalography
(iEEG), where channel setups vary widely across subjects. In this work, we
introduce multi-variate parallel attention (MVPA), a novel self-attention
mechanism that disentangles content, temporal, and spatial attention, enabling
flexible, generalizable, and efficient modeling of time-series data with
varying channel counts and configurations. We use MVPA to build MVPFormer, a
generative foundation model for human electrophysiology, trained to predict the
evolution of iEEG signals across diverse subjects. To support this and future
effort by the community, we release the SWEC iEEG dataset, the largest publicly
available iEEG dataset to date, comprising nearly 10,000 hours of recordings
from heterogeneous clinical sources. MVPFormer leverages MVPA to achieve strong
generalization across subjects, demonstrating expert-level performance in
seizure detection and outperforming state-of-the-art Transformer baselines on
our SWEC, the MAYO, and the FNUSA dataset. We further validate MVPA on standard
time-series forecasting and classification tasks, where it matches or exceeds
existing attention-based models. Together, our contributions establish MVPA as
a general-purpose attention mechanism for heterogeneous time-series and
MVPFormer as the first open-source, open-weights, and open-data iEEG foundation
model with state-of-the-art clinical performance. The code is available at
https://github.com/IBM/multi-variate-parallel-transformer. The SWEC iEEG
dataset is available at
https://mb-neuro.medical-blocks.ch/public_access/databases/ieeg/swec_ieeg.
[COMMENTS]
The code is available at
https://github.com/IBM/multi-variate-parallel-transformer. The SWEC iEEG
dataset is available at
https://mb-neuro.medical-blocks.ch/public_access/databases/ieeg/swec_ieeg
[LINK]
http://arxiv.org/abs/2506.20354v1
[DATE]
2025-06-25 20:07:10+08:00
[CATEGORIES]
cs.LG
Backpropagation Through Time For Networks With Long-Term Dependencies
[AUTHORS]
George Bird, Maxim E. Polivoda
[ABSTRACT]
Backpropagation through time (BPTT) is a technique of updating tuned
parameters within recurrent neural networks (RNNs). Several attempts at
creating such an algorithm have been made including: Nth Ordered Approximations
and Truncated-BPTT. These methods approximate the backpropagation gradients
under the assumption that the RNN only utilises short-term dependencies. This
is an acceptable assumption to make for the current state of artificial neural
networks. As RNNs become more advanced, a shift towards influence by long-term
dependencies is likely. Thus, a new method for backpropagation is required. We
propose using the ‘discrete forward sensitivity equation’ and a variant of it
for single and multiple interacting recurrent loops respectively. This solution
is exact and also allows the network’s parameters to vary between each
subsequent step, however it does require the computation of a Jacobian.
[COMMENTS]
8 Pages, 1 Figure; typos corrected, references added, altered section
titles, added further commentary in section 2.1
[LINK]
http://arxiv.org/abs/2103.15589v3
[DATE]
2025-06-25 20:04:53+08:00
[CATEGORIES]
cs.LG
On the ability of Deep Neural Networks to Learn Granger Causality in Multi-Variate Time Series Data
[AUTHORS]
Malik Shahid Sultan, Hernando Ombao
[ABSTRACT]
Granger Causality (GC) offers an elegant statistical framework to study the
association between multivariate time series data. Linear Vector Autoregressive
models (VAR) though have nice interpretation properties but have limited
practical application due to underlying assumptions on the kind of associations
that can be captured by these models. Numerous attempts have already been made
in the literature that exploit the functional approximation power of Deep
Neural Networks (DNNs) for the task of GC estimation. These methods however
treat GC as a variable selection problem. We present a novel paradigm for
approaching GC. We present this idea that GC is essentially linked with
prediction and if a deep learning model is used to model the time series
collectively or jointly, a well regularized model may learn the true granger
causal structure from the data, given that there is enough training data. We
propose to uncover the learned GC structure by comparing the model uncertainty
or distribution of the residuals when the past of everything is used as
compared to the one where a specific time series component is dropped from the
model. We also compare the effect of input layer dropout on the ability of a
neural network to learn granger causality from the data. We show that a well
regularized model infact can learn the true GC structure from the data without
explicitly adding terms in the loss function that guide the model to select
variables or perform sparse regression.
[LINK]
http://arxiv.org/abs/2506.20347v1
[DATE]
2025-06-25 19:57:24+08:00
[CATEGORIES]
cs.LG
Signatures of planets and Galactic subpopulations in solar analogs. Precise chemical abundances with neural networks
[AUTHORS]
Giulia Martos, Jorge Meléndez, Lorenzo Spina, Sara Lucatello
[ABSTRACT]
The aim of this work is to obtain precise atmospheric parameters and chemical
abundances automatically for solar twins and analogs to find signatures of
exoplanets, as well as to assess how peculiar the Sun is compared to these
stars and to analyze any possible fine structures in the Galactic thin disk. We
developed a neural network (NN) algorithm using Python to obtain these
parameters for a sample of 99 solar twins and solar analogs previously studied
in the literature from normalized high-quality spectra from HARPS, with a
resolving power of R $\sim$ 115000 and a signal-to-noise ratio S/N > 400. We
obtained precise atmospheric parameters and abundance ratios [X/Fe] of 20
chemical elements (Li, C, O, Na, Mg, Al, Si, S, Ca, Sc, Ti, V, Cr, Mn, Co, Ni,
Cu, Zn, Y, and Ba). The results are in line with the literature, with average
differences and standard deviations of $(2 \pm 27)$ K for T${\rm eff}$, $(0.00
\pm 0.06)$ dex for log g, $(0.00 \pm 0.02)$ dex for [Fe/H], $(-0.01 \pm 0.05)$
km s$^{-1}$ for microturbulence velocity, $(0.02 \pm 0.08)$ km s$^{-1}$ for the
macro turbulence velocity, and $(-0.12 \pm 0.26)$ km s$^{-1}$ for the projected
rotational velocity (vsin$i$). Regarding the chemical abundances, most of the
elements agree with the literature within 0.01 - 0.02 dex. The abundances were
corrected from the effects of the Galactic chemical evolution and analyzed with
the condensation temperature (T${\rm cond}$) to verify whether the stars
presented depletion of refractories compared to volatiles. We found that the
Sun is more depleted in refractory elements compared to volatiles than 89% of
the studied solar analogs, with a significance of 9.5$\sigma$ when compared to
the stars without detected exoplanets. We also found the possible presence of
three subpopulations in the solar analogs: one Cu-rich, one Cu-poor, and the
last one slightly older and poor in Na.
[COMMENTS]
Accepted by A&A
[LINK]
http://arxiv.org/abs/2506.20345v1
[DATE]
2025-06-25 19:55:14+08:00
[CATEGORIES]
cs.LG
A Complete Loss Landscape Analysis of Regularized Deep Matrix Factorization
[AUTHORS]
Po Chen, Rujun Jiang, Peng Wang
[ABSTRACT]
Despite its wide range of applications across various domains, the
optimization foundations of deep matrix factorization (DMF) remain largely
open. In this work, we aim to fill this gap by conducting a comprehensive study
of the loss landscape of the regularized DMF problem. Toward this goal, we
first provide a closed-form expression of all critical points. Building on
this, we establish precise conditions under which a critical point is a local
minimizer, a global minimizer, a strict saddle point, or a non-strict saddle
point. Leveraging these results, we derive a necessary and sufficient condition
under which each critical point is either a local minimizer or a strict saddle
point. This provides insights into why gradient-based methods almost always
converge to a local minimizer of the regularized DMF problem. Finally, we
conduct numerical experiments to visualize its loss landscape under different
settings to support our theory.
[COMMENTS]
35 pages, 3 figures
[LINK]
http://arxiv.org/abs/2506.20344v1
[DATE]
2025-06-25 19:51:41+08:00
[CATEGORIES]
cs.LG
Feature Hallucination for Self-supervised Action Recognition
[AUTHORS]
Lei Wang, Piotr Koniusz
[ABSTRACT]
Understanding human actions in videos requires more than raw pixel analysis;
it relies on high-level semantic reasoning and effective integration of
multimodal features. We propose a deep translational action recognition
framework that enhances recognition accuracy by jointly predicting action
concepts and auxiliary features from RGB video frames. At test time,
hallucination streams infer missing cues, enriching feature representations
without increasing computational overhead. To focus on action-relevant regions
beyond raw pixels, we introduce two novel domain-specific descriptors. Object
Detection Features (ODF) aggregate outputs from multiple object detectors to
capture contextual cues, while Saliency Detection Features (SDF) highlight
spatial and intensity patterns crucial for action recognition. Our framework
seamlessly integrates these descriptors with auxiliary modalities such as
optical flow, Improved Dense Trajectories, skeleton data, and audio cues. It
remains compatible with state-of-the-art architectures, including I3D,
AssembleNet, Video Transformer Network, FASTER, and recent models like VideoMAE
V2 and InternVideo2. To handle uncertainty in auxiliary features, we
incorporate aleatoric uncertainty modeling in the hallucination step and
introduce a robust loss function to mitigate feature noise. Our multimodal
self-supervised action recognition framework achieves state-of-the-art
performance on multiple benchmarks, including Kinetics-400, Kinetics-600, and
Something-Something V2, demonstrating its effectiveness in capturing
fine-grained action dynamics.
[COMMENTS]
Accepted for publication in International Journal of Computer Vision
(IJCV)
[LINK]
http://arxiv.org/abs/2506.20342v1
[DATE]
2025-06-25 19:50:23+08:00
[CATEGORIES]
cs.LG
Recurrent neural network-based robust control systems with closed-loop regional incremental ISS and application to MPC design
[AUTHORS]
Daniele Ravasio, Marcello Farina, Alessio La Bella, Andrea Ballarino
[ABSTRACT]
This paper investigates the design of output-feedback schemes for systems
described by a class of recurrent neural networks. We propose a procedure based
on linear matrix inequalities for designing an observer and a static
state-feedback controller. The algorithm leverages global and regional
incremental input-to-state stability (incremental ISS) and enables the tracking
of constant setpoints, ensuring robustness to disturbances and state estimation
uncertainty. To address the potential limitations of regional incremental ISS,
we introduce an alternative scheme in which the static law is replaced with a
tube-based nonlinear model predictive controller (NMPC) that exploits regional
incremental ISS properties. We show that these conditions enable the
formulation of a robust NMPC law with guarantees of convergence and recursive
feasibility, leading to an enlarged region of attraction. Theoretical results
are validated through numerical simulations on the pH-neutralisation process
benchmark, demonstrating the effectiveness of the proposed schemes.
[COMMENTS]
16 pages, 7 figures, submitted to IEEE Transactions on Automatic
Control (under review)
[LINK]
http://arxiv.org/abs/2506.20334v1
[DATE]
2025-06-25 19:44:28+08:00
[CATEGORIES]
cs.LG
Permutation Equivariant Neural Controlled Differential Equations for Dynamic Graph Representation Learning
[AUTHORS]
Torben Berndt, Benjamin Walker, Tiexin Qin, Jan Stühmer, Andrey Kormilitzin
[ABSTRACT]
Dynamic graphs exhibit complex temporal dynamics due to the interplay between
evolving node features and changing network structures. Recently, Graph Neural
Controlled Differential Equations (Graph Neural CDEs) successfully adapted
Neural CDEs from paths on Euclidean domains to paths on graph domains. Building
on this foundation, we introduce Permutation Equivariant Neural Graph CDEs,
which project Graph Neural CDEs onto permutation equivariant function spaces.
This significantly reduces the model’s parameter count without compromising
representational power, resulting in more efficient training and improved
generalisation. We empirically demonstrate the advantages of our approach
through experiments on simulated dynamical systems and real-world tasks,
showing improved performance in both interpolation and extrapolation scenarios.
[LINK]
http://arxiv.org/abs/2506.20324v1
[DATE]
2025-06-25 19:06:30+08:00
[CATEGORIES]
cs.LG
BINDy – Bayesian identification of nonlinear dynamics with reversible-jump Markov-chain Monte-Carlo
[AUTHORS]
Max D. Champneys, Timothy J. Rogers
[ABSTRACT]
Model parsimony is an important \emph{cognitive bias} in data-driven
modelling that aids interpretability and helps to prevent over-fitting. Sparse
identification of nonlinear dynamics (SINDy) methods are able to learn sparse
representations of complex dynamics directly from data, given a basis of
library functions. In this work, a novel Bayesian treatment of dictionary
learning system identification, as an alternative to SINDy, is envisaged. The
proposed method – Bayesian identification of nonlinear dynamics (BINDy) – is
distinct from previous approaches in that it targets the full joint posterior
distribution over both the terms in the library and their parameterisation in
the model. This formulation confers the advantage that an arbitrary prior may
be placed over the model structure to produce models that are sparse in the
model space rather than in parameter space. Because this posterior is defined
over parameter vectors that can change in dimension, the inference cannot be
performed by standard techniques. Instead, a Gibbs sampler based on
reversible-jump Markov-chain Monte-Carlo is proposed. BINDy is shown to compare
favourably to ensemble SINDy in three benchmark case-studies. In particular, it
is seen that the proposed method is better able to assign high probability to
correct model terms.
[LINK]
http://arxiv.org/abs/2408.08062v3
[DATE]
2025-06-25 18:45:10+08:00
[CATEGORIES]
cs.LG
Beyond-Expert Performance with Limited Demonstrations: Efficient Imitation Learning with Double Exploration
[AUTHORS]
Heyang Zhao, Xingrui Yu, David M. Bossens, Ivor W. Tsang, Quanquan Gu
[ABSTRACT]
Imitation learning is a central problem in reinforcement learning where the
goal is to learn a policy that mimics the expert’s behavior. In practice, it is
often challenging to learn the expert policy from a limited number of
demonstrations accurately due to the complexity of the state space. Moreover,
it is essential to explore the environment and collect data to achieve
beyond-expert performance. To overcome these challenges, we propose a novel
imitation learning algorithm called Imitation Learning with Double Exploration
(ILDE), which implements exploration in two aspects: (1) optimistic policy
optimization via an exploration bonus that rewards state-action pairs with high
uncertainty to potentially improve the convergence to the expert policy, and
(2) curiosity-driven exploration of the states that deviate from the
demonstration trajectories to potentially yield beyond-expert performance.
Empirically, we demonstrate that ILDE outperforms the state-of-the-art
imitation learning algorithms in terms of sample efficiency and achieves
beyond-expert performance on Atari and MuJoCo tasks with fewer demonstrations
than in previous work. We also provide a theoretical justification of ILDE as
an uncertainty-regularized policy optimization method with optimistic
exploration, leading to a regret growing sublinearly in the number of episodes.
[LINK]
http://arxiv.org/abs/2506.20307v1
[DATE]
2025-06-25 18:39:32+08:00
[CATEGORIES]
cs.LG
Learning Moderately Input-Sensitive Functions: A Case Study in QR Code Decoding
[AUTHORS]
Kazuki Yoda, Kazuhiko Kawamoto, Hiroshi Kera
[ABSTRACT]
The hardness of learning a function that attains a target task relates to its
input-sensitivity. For example, image classification tasks are
input-insensitive as minor corruptions should not affect the classification
results, whereas arithmetic and symbolic computation, which have been recently
attracting interest, are highly input-sensitive as each input variable connects
to the computation results. This study presents the first learning-based Quick
Response (QR) code decoding and investigates learning functions of medium
sensitivity. Our experiments reveal that Transformers can successfully decode
QR codes, even beyond the theoretical error-correction limit, by learning the
structure of embedded texts. They generalize from English-rich training data to
other languages and even random strings. Moreover, we observe that the
Transformer-based QR decoder focuses on data bits while ignoring
error-correction bits, suggesting a decoding mechanism distinct from standard
QR code readers.
[COMMENTS]
17 pages, 13 figures
[LINK]
http://arxiv.org/abs/2506.20305v1
[DATE]
2025-06-25 18:37:39+08:00
[CATEGORIES]
cs.LG
Bilinear MLPs enable weight-based mechanistic interpretability
[AUTHORS]
Michael T. Pearce, Thomas Dooms, Alice Rigg, Jose M. Oramas, Lee Sharkey
[ABSTRACT]
A mechanistic understanding of how MLPs do computation in deep neural
networks remains elusive. Current interpretability work can extract features
from hidden activations over an input dataset but generally cannot explain how
MLP weights construct features. One challenge is that element-wise
nonlinearities introduce higher-order interactions and make it difficult to
trace computations through the MLP layer. In this paper, we analyze bilinear
MLPs, a type of Gated Linear Unit (GLU) without any element-wise nonlinearity
that nevertheless achieves competitive performance. Bilinear MLPs can be fully
expressed in terms of linear operations using a third-order tensor, allowing
flexible analysis of the weights. Analyzing the spectra of bilinear MLP weights
using eigendecomposition reveals interpretable low-rank structure across toy
tasks, image classification, and language modeling. We use this understanding
to craft adversarial examples, uncover overfitting, and identify small language
model circuits directly from the weights alone. Our results demonstrate that
bilinear layers serve as an interpretable drop-in replacement for current
activation functions and that weight-based interpretability is viable for
understanding deep-learning models.
[COMMENTS]
Accepted to ICLR‘25
[LINK]
http://arxiv.org/abs/2410.08417v2
[DATE]
2025-06-25 18:36:59+08:00
[CATEGORIES]
cs.LG
Graph-Assisted Stitching for Offline Hierarchical Reinforcement Learning
[AUTHORS]
Seungho Baek, Taegeon Park, Jongchan Park, Seungjun Oh, Yusung Kim
[COMMENTS]
ICML 2025
[LINK]
http://arxiv.org/abs/2506.07744v2
[DATE]
2025-06-25 18:33:47+08:00
[CATEGORIES]
cs.LG
OLALa: Online Learned Adaptive Lattice Codes for Heterogeneous Federated Learning
[AUTHORS]
Natalie Lang, Maya Simhi, Nir Shlezinger
[ABSTRACT]
Federated learning (FL) enables collaborative training across distributed
clients without sharing raw data, often at the cost of substantial
communication overhead induced by transmitting high-dimensional model updates.
This overhead can be alleviated by having the clients quantize their model
updates, with dithered lattice quantizers identified as an attractive scheme
due to its structural simplicity and convergence-preserving properties.
However, existing lattice-based FL schemes typically rely on a fixed
quantization rule, which is suboptimal in heterogeneous and dynamic
environments where the model updates distribution varies across users and
training rounds. In this work, we propose Online Learned Adaptive Lattices
(OLALa), a heterogeneous FL framework where each client can adjust its
quantizer online using lightweight local computations. We first derive
convergence guarantees for FL with non-fixed lattice quantizers and show that
proper lattice adaptation can tighten the convergence bound. Then, we design an
online learning algorithm that enables clients to tune their quantizers
throughout the FL process while exchanging only a compact set of quantization
parameters. Numerical experiments demonstrate that OLALa consistently improves
learning performance under various quantization rates, outperforming
conventional fixed-codebook and non-adaptive schemes.
[COMMENTS]
Under review for publication in the IEEE
[LINK]
http://arxiv.org/abs/2506.20297v1
[DATE]
2025-06-25 18:18:34+08:00
[CATEGORIES]
cs.LG
Provably Improving Generalization of Few-Shot Models with Synthetic Data
[AUTHORS]
Lan-Cuong Nguyen, Quan Nguyen-Tri, Bang Tran Khanh, Dung D. Le, Long Tran-Thanh, Khoat Than
[ABSTRACT]
Few-shot image classification remains challenging due to the scarcity of
labeled training examples. Augmenting them with synthetic data has emerged as a
promising way to alleviate this issue, but models trained on synthetic samples
often face performance degradation due to the inherent gap between real and
synthetic distributions. To address this limitation, we develop a theoretical
framework that quantifies the impact of such distribution discrepancies on
supervised learning, specifically in the context of image classification. More
importantly, our framework suggests practical ways to generate good synthetic
samples and to train a predictor with high generalization ability. Building
upon this framework, we propose a novel theoretical-based algorithm that
integrates prototype learning to optimize both data partitioning and model
training, effectively bridging the gap between real few-shot data and synthetic
data. Extensive experiments results show that our approach demonstrates
superior performance compared to state-of-the-art methods, outperforming them
across multiple datasets.
[COMMENTS]
ICML 2025. Our code is released at
https://github.com/Fsoft-AIC/ProtoAug
[LINK]
http://arxiv.org/abs/2505.24190v2
[DATE]
2025-06-25 18:02:36+08:00
[CATEGORIES]
cs.LG
Flexible Infinite-Width Graph Convolutional Neural Networks
[AUTHORS]
Ben Anson, Edward Milsom, Laurence Aitchison
[ABSTRACT]
A common theoretical approach to understanding neural networks is to take an
infinite-width limit, at which point the outputs become Gaussian process (GP)
distributed. This is known as a neural network Gaussian process (NNGP).
However, the NNGP kernel is fixed and tunable only through a small number of
hyperparameters, thus eliminating the possibility of representation learning.
This contrasts with finite-width NNs, which are often believed to perform well
because they are able to flexibly learn representations for the task at hand.
Thus, in simplifying NNs to make them theoretically tractable, NNGPs may
eliminate precisely what makes them work well (representation learning). This
motivated us to understand whether representation learning is necessary in a
range of graph tasks. We develop a precise tool for this task, the graph
convolutional deep kernel machine. This is very similar to an NNGP, in that it
is an infinite width limit and uses kernels, but comes with a “knob” to
control the amount of flexibility and hence representation learning. We found
that representation learning gives noticeable performance improvements for
heterophilous node classification tasks, but less so for homophilous node
classification tasks.
[COMMENTS]
Major revision. Title and abstract updated. Added new analysis
section on linear models and additional datasets. Paper accepted to TMLR
[LINK]
http://arxiv.org/abs/2402.06525v2
[DATE]
2025-06-25 17:59:16+08:00
[CATEGORIES]
cs.LG
Solving Linear-Gaussian Bayesian Inverse Problems with Decoupled Diffusion Sequential Monte Carlo
[AUTHORS]
Filip Ekström Kelvinius, Zheng Zhao, Fredrik Lindsten
[ABSTRACT]
A recent line of research has exploited pre-trained generative diffusion
models as priors for solving Bayesian inverse problems. We contribute to this
research direction by designing a sequential Monte Carlo method for
linear-Gaussian inverse problems which builds on “decoupled diffusion”, where
the generative process is designed such that larger updates to the sample are
possible. The method is asymptotically exact and we demonstrate the
effectiveness of our Decoupled Diffusion Sequential Monte Carlo (DDSMC)
algorithm on both synthetic as well as protein and image data. Further, we
demonstrate how the approach can be extended to discrete data.
[COMMENTS]
Accepted to ICML 2025, to appear in PMLR 267. Code available at
https://github.com/filipekstrm/ddsmc
[LINK]
http://arxiv.org/abs/2502.06379v2
[DATE]
2025-06-25 17:54:45+08:00
[CATEGORIES]
cs.LG
X-SiT: Inherently Interpretable Surface Vision Transformers for Dementia Diagnosis
[AUTHORS]
Fabian Bongratz, Tom Nuno Wolf, Jaume Gual Ramon, Christian Wachinger
[ABSTRACT]
Interpretable models are crucial for supporting clinical decision-making,
driving advances in their development and application for medical images.
However, the nature of 3D volumetric data makes it inherently challenging to
visualize and interpret intricate and complex structures like the cerebral
cortex. Cortical surface renderings, on the other hand, provide a more
accessible and understandable 3D representation of brain anatomy, facilitating
visualization and interactive exploration. Motivated by this advantage and the
widespread use of surface data for studying neurological disorders, we present
the eXplainable Surface Vision Transformer (X-SiT). This is the first
inherently interpretable neural network that offers human-understandable
predictions based on interpretable cortical features. As part of X-SiT, we
introduce a prototypical surface patch decoder for classifying surface patch
embeddings, incorporating case-based reasoning with spatially corresponding
cortical prototypes. The results demonstrate state-of-the-art performance in
detecting Alzheimer’s disease and frontotemporal dementia while additionally
providing informative prototypes that align with known disease patterns and
reveal classification errors.
[COMMENTS]
MICCAI 2025
[LINK]
http://arxiv.org/abs/2506.20267v1
[DATE]
2025-06-25 17:24:07+08:00
[CATEGORIES]
cs.LG
3D variational autoencoder for fingerprinting microstructure volume elements
[AUTHORS]
Michael D. White, Michael D. Atkinson, Adam J. Plowman, Pratheek Shanthraj
[ABSTRACT]
Microstructure quantification is an important step towards establishing
structure-property relationships in materials. Machine learning-based image
processing methods have been shown to outperform conventional image processing
techniques and are increasingly applied to microstructure quantification tasks.
In this work, we present a 3D variational autoencoder (VAE) for encoding
microstructure volume elements (VEs) comprising voxelated crystallographic
orientation data. Crystal symmetries in the orientation space are accounted for
by mapping to the crystallographic fundamental zone as a preprocessing step,
which allows for a continuous loss function to be used and improves the
training convergence rate. The VAE is then used to encode a training set of VEs
with an equiaxed polycrystalline microstructure with random texture. Accurate
reconstructions are achieved with a relative average misorientation error of
3x10^-2 on the test dataset, for a continuous latent space with dimension 256.
We show that the model generalises well to microstructures with textures, grain
sizes and aspect ratios outside the training distribution. Structure-property
relationships are explored through using the training set of VEs as initial
configurations in various crystal plasticity (CP) simulations. Microstructural
fingerprints extracted from the VAE, which parameterise the VEs in a
low-dimensional latent space, are stored alongside the volume-averaged stress
response, at each strain increment, to uniaxial tensile deformation from CP
simulations. This is then used to train a fully connected neural network
mapping the input fingerprint to the resulting stress response, which acts as a
surrogate model for the CP simulation. The fingerprint-based surrogate model is
shown to accurately predict the microstructural dependence in the CP stress
response, with a relative mean-squared error of 2.75 MPa on unseen test data.
[COMMENTS]
28 pages, 11 figures
[LINK]
http://arxiv.org/abs/2503.17427v3
[DATE]
2025-06-25 17:14:01+08:00
[CATEGORIES]
cs.LG
Exploration-Exploitation Tradeoff in Universal Lossy Compression
[AUTHORS]
Nir Weinberger, Ram Zamir
[ABSTRACT]
Universal compression can learn the source and adapt to it either in a batch
mode (forward adaptation), or in a sequential mode (backward adaptation). We
recast the sequential mode as a multi-armed bandit problem, a fundamental model
in reinforcement-learning, and study the trade-off between exploration and
exploitation in the lossy compression case. We show that a previously proposed
“natural type selection” scheme can be cast as a reconstruction-directed MAB
algorithm, for sequential lossy compression, and explain its limitations in
terms of robustness and short-block performance. We then derive and analyze
robust cost-directed MAB algorithms, which work at any block length.
[COMMENTS]
An extended version of ISIT 2025 paper
[LINK]
http://arxiv.org/abs/2506.20261v1
[DATE]
2025-06-25 17:08:29+08:00
[CATEGORIES]
cs.LG
Fine-tuning machine-learned particle-flow reconstruction for new detector geometries in future colliders
[AUTHORS]
Farouk Mokhtar, Joosep Pata, Dolores Garcia, Eric Wulff, Mengke Zhang, Michael Kagan, Javier Duarte
[ABSTRACT]
We demonstrate transfer learning capabilities in a machine-learned algorithm
trained for particle-flow reconstruction in high energy particle colliders.
This paper presents a cross-detector fine-tuning study, where we initially
pretrain the model on a large full simulation dataset from one detector design,
and subsequently fine-tune the model on a sample with a different collider and
detector design. Specifically, we use the Compact Linear Collider detector
(CLICdet) model for the initial training set and demonstrate successful
knowledge transfer to the CLIC-like detector (CLD) proposed for the Future
Circular Collider in electron-positron mode. We show that with an order of
magnitude less samples from the second dataset, we can achieve the same
performance as a costly training from scratch, across particle-level and
event-level performance metrics, including jet and missing transverse momentum
resolution. Furthermore, we find that the fine-tuned model achieves comparable
performance to the traditional rule-based particle-flow approach on event-level
metrics after training on 100,000 CLD events, whereas a model trained from
scratch requires at least 1 million CLD events to achieve similar
reconstruction performance. To our knowledge, this represents the first
full-simulation cross-detector transfer learning study for particle-flow
reconstruction. These findings offer valuable insights towards building large
foundation models that can be fine-tuned across different detector designs and
geometries, helping to accelerate the development cycle for new detectors and
opening the door to rapid detector design and optimization using machine
learning.
[COMMENTS]
20 pages, 13 figures
[LINK]
http://arxiv.org/abs/2503.00131v4
[DATE]
2025-06-25 17:07:47+08:00
[CATEGORIES]
cs.LG
A Transformer Based Handwriting Recognition System Jointly Using Online and Offline Features
[AUTHORS]
Ayush Lodh, Ritabrata Chakraborty, Shivakumara Palaiahnakote, Umapada Pal
[ABSTRACT]
We posit that handwriting recognition benefits from complementary cues
carried by the rasterized complex glyph and the pen’s trajectory, yet most
systems exploit only one modality. We introduce an end-to-end network that
performs early fusion of offline images and online stroke data within a shared
latent space. A patch encoder converts the grayscale crop into fixed-length
visual tokens, while a lightweight transformer embeds the $(x, y, \text{pen})$
sequence. Learnable latent queries attend jointly to both token streams,
yielding context-enhanced stroke embeddings that are pooled and decoded under a
cross-entropy loss objective. Because integration occurs before any high-level
classification, temporal cues reinforce each other during representation
learning, producing stronger writer independence. Comprehensive experiments on
IAMOn-DB and VNOn-DB demonstrate that our approach achieves state-of-the-art
accuracy, exceeding previous bests by up to 1\%. Our study also shows
adaptation of this pipeline with gesturification on the ISI-Air dataset. Our
code can be found here.
[COMMENTS]
15 pages, 7 figures
[LINK]
http://arxiv.org/abs/2506.20255v1
[DATE]
2025-06-25 16:58:47+08:00
[CATEGORIES]
cs.LG
Time-series surrogates from energy consumers generated by machine learning approaches for long-term forecasting scenarios
[AUTHORS]
Ben Gerhards, Nikita Popkov, Annekatrin König, Marcel Arpogaus, Bastian Schäfermeier, Leonie Riedl, Stephan Vogt, Philip Hehlert
[ABSTRACT]
Forecasting attracts a lot of research attention in the electricity value
chain. However, most studies concentrate on short-term forecasting of
generation or consumption with a focus on systems and less on individual
consumers. Even more neglected is the topic of long-term forecasting of
individual power consumption.
Here, we provide an in-depth comparative evaluation of data-driven methods
for generating synthetic time series data tailored to energy consumption
long-term forecasting. High-fidelity synthetic data is crucial for a wide range
of applications, including state estimations in energy systems or power grid
planning. In this study, we assess and compare the performance of multiple
state-of-the-art but less common techniques: a hybrid Wasserstein Generative
Adversarial Network (WGAN), Denoising Diffusion Probabilistic Model (DDPM),
Hidden Markov Model (HMM), and Masked Autoregressive Bernstein polynomial
normalizing Flows (MABF). We analyze the ability of each method to replicate
the temporal dynamics, long-range dependencies, and probabilistic transitions
characteristic of individual energy consumption profiles. Our comparative
evaluation highlights the strengths and limitations of: WGAN, DDPM, HMM and
MABF aiding in selecting the most suitable approach for state estimations and
other energy-related tasks. Our generation and analysis framework aims to
enhance the accuracy and reliability of synthetic power consumption data while
generating data that fulfills criteria like anonymisation - preserving privacy
concerns mitigating risks of specific profiling of single customers. This study
utilizes an open-source dataset from households in Germany with 15min time
resolution. The generated synthetic power profiles can readily be used in
applications like state estimations or consumption forecasting.
[LINK]
http://arxiv.org/abs/2506.20253v1
[DATE]
2025-06-25 16:54:47+08:00
[CATEGORIES]
cs.LG
Q-resafe: Assessing Safety Risks and Quantization-aware Safety Patching for Quantized Large Language Models
[AUTHORS]
Kejia Chen, Jiawen Zhang, Jiacong Hu, Yu Wang, Jian Lou, Zunlei Feng, Mingli Song
[COMMENTS]
ICML 2025
[LINK]
http://arxiv.org/abs/2506.20251v1
[DATE]
2025-06-25 16:52:22+08:00
[CATEGORIES]
cs.LG
FedBKD: Distilled Federated Learning to Embrace Gerneralization and Personalization on Non-IID Data
[AUTHORS]
Yushan Zhao, Jinyuan He, Donglai Chen, Weijie Luo, Chong Xie, Ri Zhang, Yonghong Chen, Yan Xu
[ABSTRACT]
Federated learning (FL) is a decentralized collaborative machine learning
(ML) technique. It provides a solution to the issues of isolated data islands
and data privacy leakage in industrial ML practices. One major challenge in FL
is handling the non-identical and independent distributed (non-IID) data.
Current solutions either focus on constructing an all-powerful global model, or
customizing personalized local models. Few of them can provide both a
well-generalized global model and well-performed local models at the same time.
Additionally, many FL solutions to the non-IID problem are benefited from
introducing public datasets. However, this will also increase the risk of data
leakage. To tackle the problems, we propose a novel data-free distillation
framework, Federated Bidirectional Knowledge Distillation (FedBKD).
Specifically, we train Generative Adversarial Networks (GAN) for synthetic
data. During the GAN training, local models serve as discriminators and their
parameters are frozen. The synthetic data is then used for bidirectional
distillation between global and local models to achieve knowledge interactions
so that performances for both sides are improved. We conduct extensive
experiments on 4 benchmarks under different non-IID settings. The results show
that FedBKD achieves SOTA performances in every case.
[LINK]
http://arxiv.org/abs/2506.20245v1
[DATE]
2025-06-25 16:42:10+08:00
[CATEGORIES]
cs.LG
E-ABIN: an Explainable module for Anomaly detection in BIological Networks
[AUTHORS]
Ugo Lomoio, Tommaso Mazza, Pierangelo Veltri, Pietro Hiram Guzzi
[ABSTRACT]
The increasing availability of large-scale omics data calls for robust
analytical frameworks capable of handling complex gene expression datasets
while offering interpretable results. Recent advances in artificial
intelligence have enabled the identification of aberrant molecular patterns
distinguishing disease states from healthy controls. Coupled with improvements
in model interpretability, these tools now support the identification of genes
potentially driving disease phenotypes. However, current approaches to gene
anomaly detection often remain limited to single datasets and lack accessible
graphical interfaces. Here, we introduce E-ABIN, a general-purpose, explainable
framework for Anomaly detection in Biological Networks. E-ABIN combines
classical machine learning and graph-based deep learning techniques within a
unified, user-friendly platform, enabling the detection and interpretation of
anomalies from gene expression or methylation-derived networks. By integrating
algorithms such as Support Vector Machines, Random Forests, Graph Autoencoders
(GAEs), and Graph Adversarial Attributed Networks (GAANs), E-ABIN ensures a
high predictive accuracy while maintaining interpretability. We demonstrate the
utility of E-ABIN through case studies of bladder cancer and coeliac disease,
where it effectively uncovers biologically relevant anomalies and offers
insights into disease mechanisms.
[LINK]
http://arxiv.org/abs/2506.20693v1
[DATE]
2025-06-25 16:25:17+08:00
[CATEGORIES]
cs.LG
Gradient-Free Sequential Bayesian Experimental Design via Interacting Particle Systems
[AUTHORS]
Robert Gruhlke, Matei Hanu, Claudia Schillings, Philipp Wacker
[ABSTRACT]
We introduce a gradient-free framework for Bayesian Optimal Experimental
Design (BOED) in sequential settings, aimed at complex systems where gradient
information is unavailable. Our method combines Ensemble Kalman Inversion (EKI)
for design optimization with the Affine-Invariant Langevin Dynamics (ALDI)
sampler for efficient posterior sampling-both of which are derivative-free and
ensemble-based. To address the computational challenges posed by nested
expectations in BOED, we propose variational Gaussian and parametrized Laplace
approximations that provide tractable upper and lower bounds on the Expected
Information Gain (EIG). These approximations enable scalable utility estimation
in high-dimensional spaces and PDE-constrained inverse problems. We demonstrate
the performance of our framework through numerical experiments ranging from
linear Gaussian models to PDE-based inference tasks, highlighting the method’s
robustness, accuracy, and efficiency in information-driven experimental design.
[LINK]
http://arxiv.org/abs/2504.13320v2
[DATE]
2025-06-25 16:22:09+08:00
[CATEGORIES]
cs.LG
Supporting renewable energy planning and operation with data-driven high-resolution ensemble weather forecast
[AUTHORS]
Jingnan Wang, Jie Chao, Shangshang Yang, Congyi Nai, Kaijun Ren, Kefeng Deng, Xi Chen, Yaxin Liu, Hanqiuzi Wen, Ziniu Xiao, Lifeng Zhang, Xiaodong Wang, Jiping Guan, Baoxiang Pan
[ABSTRACT]
The planning and operation of renewable energy, especially wind power, depend
crucially on accurate, timely, and high-resolution weather information.
Coarse-grid global numerical weather forecasts are typically downscaled to meet
these requirements, introducing challenges of scale inconsistency, process
representation error, computation cost, and entanglement of distinct
uncertainty sources from chaoticity, model bias, and large-scale forcing. We
address these challenges by learning the climatological distribution of a
target wind farm using its high-resolution numerical weather simulations. An
optimal combination of this learned high-resolution climatological prior with
coarse-grid large scale forecasts yields highly accurate, fine-grained,
full-variable, large ensemble of weather pattern forecasts. Using observed
meteorological records and wind turbine power outputs as references, the
proposed methodology verifies advantageously compared to existing
numerical/statistical forecasting-downscaling pipelines, regarding either
deterministic/probabilistic skills or economic gains. Moreover, a 100-member,
10-day forecast with spatial resolution of 1 km and output frequency of 15 min
takes < 1 hour on a moderate-end GPU, as contrast to $\mathcal{O}(10^3)$ CPU
hours for conventional numerical simulation. By drastically reducing
computational costs while maintaining accuracy, our method paves the way for
more efficient and reliable renewable energy planning and operation.
[LINK]
http://arxiv.org/abs/2505.04396v2
[DATE]
2025-06-25 16:04:43+08:00
[CATEGORIES]
cs.LG
MS-TVNet:A Long-Term Time Series Prediction Method Based on Multi-Scale Dynamic Convolution
[AUTHORS]
Chenghan Li, Mingchen Li, Yipu Liao, Ruisheng Diao
[ABSTRACT]
Long-term time series prediction has predominantly relied on Transformer and
MLP models, while the potential of convolutional networks in this domain
remains underexplored. To address this gap, we introduce a novel multi-scale
time series reshape module, which effectively captures the relationships among
multi-period patches and variable dependencies. Building upon this module, we
propose MS-TVNet, a multi-scale 3D dynamic convolutional neural network.
Through comprehensive evaluations on diverse datasets, MS-TVNet demonstrates
superior performance compared to baseline models, achieving state-of-the-art
(SOTA) results in long-term time series prediction. Our findings highlight the
effectiveness of leveraging convolutional networks for capturing complex
temporal patterns, suggesting a promising direction for future research in this
field.The code is realsed on https://github.com/Curyyfaust/TVNet.
[LINK]
http://arxiv.org/abs/2506.17253v2
[DATE]
2025-06-25 15:55:20+08:00
[CATEGORIES]
cs.LG
Curved representational Bregman divergences and their applications
[AUTHORS]
Frank Nielsen
[ABSTRACT]
By analogy to curved exponential families in statistics, we define curved
Bregman divergences as Bregman divergences restricted to nonlinear parameter
subspaces. We show that the barycenter of a finite weighted set of parameters
under a curved Bregman divergence amounts to the right Bregman projection onto
the nonlinear subspace of the barycenter with respect to the full Bregman
divergence. We demonstrate the significance of curved Bregman divergences with
two examples: (1) symmetrized Bregman divergences and (2) the Kullback-Leibler
divergence between circular complex normal distributions. We then consider
monotonic embeddings to define representational curved Bregman divergences and
show that the $\alpha$-divergences are representational curved Bregman
divergences with respect to $\alpha$-embeddings of the probability simplex into
the positive measure cone. As an application, we report an efficient method to
calculate the intersection of a finite set of $\alpha$-divergence spheres.
[COMMENTS]
12 pages, 5 figures
[LINK]
http://arxiv.org/abs/2504.05654v2
[DATE]
2025-06-25 15:53:44+08:00
[CATEGORIES]
cs.LG
Affective Priming Score: A Data-Driven Method to Detect Priming in Sequential Datasets
[AUTHORS]
Eduardo Gutierrez Maestro, Hadi Banaee, Amy Loutfi
[ABSTRACT]
Affective priming exemplifies the challenge of ambiguity in affective
computing. While the community has largely addressed this issue from a
label-based perspective, identifying data points in the sequence affected by
the priming effect, the impact of priming on data itself, particularly in
physiological signals, remains underexplored. Data affected by priming can lead
to misclassifications when used in learning models. This study proposes the
Affective Priming Score (APS), a data-driven method to detect data points
influenced by the priming effect. The APS assigns a score to each data point,
quantifying the extent to which it is affected by priming. To validate this
method, we apply it to the SEED and SEED-VII datasets, which contain sufficient
transitions between emotional events to exhibit priming effects. We train
models with the same configuration using both the original data and
priming-free sequences. The misclassification rate is significantly reduced
when using priming-free sequences compared to the original data. This work
contributes to the broader challenge of ambiguity by identifying and mitigating
priming effects at the data level, enhancing model robustness, and offering
valuable insights for the design and collection of affective computing
datasets.
[LINK]
http://arxiv.org/abs/2506.20204v1
[DATE]
2025-06-25 15:48:22+08:00
[CATEGORIES]
cs.LG
DuoGPT: Training-free Dual Sparsity through Activation-aware Pruning in LLMs
[AUTHORS]
Ruokai Yin, Yuhang Li, Donghyun Lee, Priyadarshini Panda
[ABSTRACT]
Large language models (LLMs) deliver strong performance but are difficult to
deploy due to high memory and compute costs. While pruning reduces these
demands, most methods ignore activation sparsity observed at runtime. We
reinterpret activation sparsity as dynamic structured weight sparsity and
propose DuoGPT, a unified framework that constructs dual-sparse (spMspV)
workloads by combining unstructured weight pruning with activation sparsity. To
preserve accuracy, we extend the Optimal Brain Compression (OBC) framework with
activation-aware calibration and introduce output residuals from the dense
model as correction terms. We further optimize the solution for efficient GPU
execution, enabling scalability to billion-parameter LLMs. Evaluations on
LLaMA-2 and LLaMA-3 show that DuoGPT outperforms state-of-the-art structured
pruning methods by up to 9.17% accuracy at an iso-speedup of 1.39$\times$
compared to the baseline dense model.
[LINK]
http://arxiv.org/abs/2506.20194v1
[DATE]
2025-06-25 15:35:12+08:00
[CATEGORIES]
cs.LG
IKDiffuser: A Generative Inverse Kinematics Solver for Multi-arm Robots via Diffusion Model
[AUTHORS]
Zeyu Zhang, Ziyuan Jiao
[ABSTRACT]
Solving Inverse Kinematics (IK) problems is fundamental to robotics, but has
primarily been successful with single serial manipulators. For multi-arm
robotic systems, IK remains challenging due to complex self-collisions, coupled
joints, and high-dimensional redundancy. These complexities make traditional IK
solvers slow, prone to failure, and lacking in solution diversity. In this
paper, we present IKDiffuser, a diffusion-based model designed for fast and
diverse IK solution generation for multi-arm robotic systems. IKDiffuser learns
the joint distribution over the configuration space, capturing complex
dependencies and enabling seamless generalization to multi-arm robotic systems
of different structures. In addition, IKDiffuser can incorporate additional
objectives during inference without retraining, offering versatility and
adaptability for task-specific requirements. In experiments on 6 different
multi-arm systems, the proposed IKDiffuser achieves superior solution accuracy,
precision, diversity, and computational efficiency compared to existing
solvers. The proposed IKDiffuser framework offers a scalable, unified approach
to solving multi-arm IK problems, facilitating the potential of multi-arm
robotic systems in real-time manipulation tasks.
[COMMENTS]
under review
[LINK]
http://arxiv.org/abs/2506.13087v3
[DATE]
2025-06-25 15:27:44+08:00
[CATEGORIES]
cs.LG
Causal Operator Discovery in Partial Differential Equations via Counterfactual Physics-Informed Neural Networks
[AUTHORS]
Ronald Katende
[ABSTRACT]
We develop a principled framework for discovering causal structure in partial
differential equations (PDEs) using physics-informed neural networks and
counterfactual perturbations. Unlike classical residual minimization or sparse
regression methods, our approach quantifies operator-level necessity through
functional interventions on the governing dynamics. We introduce causal
sensitivity indices and structural deviation metrics to assess the influence of
candidate differential operators within neural surrogates. Theoretically, we
prove exact recovery of the causal operator support under restricted isometry
or mutual coherence conditions, with residual bounds guaranteeing
identifiability. Empirically, we validate the framework on both synthetic and
real-world datasets across climate dynamics, tumor diffusion, and ocean flows.
Our method consistently recovers governing operators even under noise,
redundancy, and data scarcity, outperforming standard PINNs and DeepONets in
structural fidelity. This work positions causal PDE discovery as a tractable
and interpretable inference task grounded in structural causal models and
variational residual analysis.
[LINK]
http://arxiv.org/abs/2506.20181v1
[DATE]
2025-06-25 15:15:42+08:00
[CATEGORIES]
cs.LG
Valid Selection among Conformal Sets
[AUTHORS]
Mahmoud Hegazy, Liviu Aolaritei, Michael I. Jordan, Aymeric Dieuleveut
[ABSTRACT]
Conformal prediction offers a distribution-free framework for constructing
prediction sets with coverage guarantees. In practice, multiple valid conformal
prediction sets may be available, arising from different models or
methodologies. However, selecting the most desirable set, such as the smallest,
can invalidate the coverage guarantees. To address this challenge, we propose a
stability-based approach that ensures coverage for the selected prediction set.
We extend our results to the online conformal setting, propose several
refinements in settings where additional structure is available, and
demonstrate its effectiveness through experiments.
[LINK]
http://arxiv.org/abs/2506.20173v1
[DATE]
2025-06-25 14:59:55+08:00
[CATEGORIES]
cs.LG
Causal discovery in deterministic discrete LTI-DAE systems
[AUTHORS]
Bala Rajesh Konkathi, Arun K. Tangirala
[ABSTRACT]
Discovering pure causes or driver variables in deterministic LTI systems is
of vital importance in the data-driven reconstruction of causal networks. A
recent work by Kathari and Tangirala, proposed in 2022, formulated the causal
discovery method as a constraint identification problem. The constraints are
identified using a dynamic iterative PCA (DIPCA)-based approach for dynamical
systems corrupted with Gaussian measurement errors. The DIPCA-based method
works efficiently for dynamical systems devoid of any algebraic relations.
However, several dynamical systems operate under feedback control and/or are
coupled with conservation laws, leading to differential-algebraic (DAE) or
mixed causal systems. In this work, a method, namely the partition of variables
(PoV), for causal discovery in LTI-DAE systems is proposed. This method is
superior to the method that was presented by Kathari and Tangirala (2022), as
PoV also works for pure dynamical systems, which are devoid of algebraic
equations. The proposed method identifies the causal drivers up to a minimal
subset. PoV deploys DIPCA to first determine the number of algebraic relations
($n_a$), the number of dynamical relations ($n_d$) and the constraint matrix.
Subsequently, the subsets are identified through an admissible partitioning of
the constraint matrix by finding the condition number of it. Case studies are
presented to demonstrate the effectiveness of the proposed method.
[LINK]
http://arxiv.org/abs/2506.20169v1
[DATE]
2025-06-25 14:47:22+08:00
[CATEGORIES]
cs.LG
Active Learning of Deep Neural Networks via Gradient-Free Cutting Planes
[AUTHORS]
Erica Zhang, Fangzhao Zhang, Mert Pilanci
[ABSTRACT]
Active learning methods aim to improve sample complexity in machine learning.
In this work, we investigate an active learning scheme via a novel
gradient-free cutting-plane training method for ReLU networks of arbitrary
depth and develop a convergence theory. We demonstrate, for the first time,
that cutting-plane algorithms, traditionally used in linear models, can be
extended to deep neural networks despite their nonconvexity and nonlinear
decision boundaries. Moreover, this training method induces the first deep
active learning scheme known to achieve convergence guarantees, revealing a
geometric contraction rate of the feasible set. We exemplify the effectiveness
of our proposed active learning method against popular deep active learning
baselines via both synthetic data experiments and sentimental classification
task on real datasets.
[LINK]
http://arxiv.org/abs/2410.02145v5
[DATE]
2025-06-25 14:11:27+08:00
[CATEGORIES]
cs.LG
Counterfactual Fairness through Transforming Data Orthogonal to Bias
[AUTHORS]
Shuyi Chen, Shixiang Zhu
[ABSTRACT]
Machine learning models have shown exceptional prowess in solving complex
issues across various domains. However, these models can sometimes exhibit
biased decision-making, resulting in unequal treatment of different groups.
Despite substantial research on counterfactual fairness, methods to reduce the
impact of multivariate and continuous sensitive variables on decision-making
outcomes are still underdeveloped. We propose a novel data pre-processing
algorithm, Orthogonal to Bias (OB), which is designed to eliminate the
influence of a group of continuous sensitive variables, thus promoting
counterfactual fairness in machine learning applications. Our approach, based
on the assumption of a jointly normal distribution within a structural causal
model (SCM), demonstrates that counterfactual fairness can be achieved by
ensuring the data is orthogonal to the observed sensitive variables. The OB
algorithm is model-agnostic, making it applicable to a wide range of machine
learning models and tasks. Additionally, it includes a sparse variant to
improve numerical stability through regularization. Empirical evaluations on
both simulated and real-world datasets, encompassing settings with both
discrete and continuous sensitive variables, show that our methodology
effectively promotes fairer outcomes without compromising accuracy.
[LINK]
http://arxiv.org/abs/2403.17852v3
[DATE]
2025-06-25 13:35:44+08:00
[CATEGORIES]
cs.LG
Accept More, Reject Less: Reducing up to 19% Unnecessary Desk-Rejections over 11 Years of ICLR Data
[AUTHORS]
Xiaoyu Li, Zhao Song, Jiahao Zhang
[ABSTRACT]
The explosive growth of AI research has driven paper submissions at flagship
AI conferences to unprecedented levels, necessitating many venues in 2025
(e.g., CVPR, ICCV, KDD, AAAI, IJCAI, WSDM) to enforce strict per-author
submission limits and to desk-reject any excess papers by simple ID order.
While this policy helps reduce reviewer workload, it may unintentionally
discard valuable papers and penalize authors’ efforts. In this paper, we ask an
essential research question on whether it is possible to follow submission
limits while minimizing needless rejections. We first formalize the current
desk-rejection policies as an optimization problem, and then develop a
practical algorithm based on linear programming relaxation and a rounding
scheme. Under extensive evaluation on 11 years of real-world ICLR
(International Conference on Learning Representations) data, our method
preserves up to $19.23\%$ more papers without violating any author limits.
Moreover, our algorithm is highly efficient in practice, with all results on
ICLR data computed within at most 53.64 seconds. Our work provides a simple and
practical desk-rejection strategy that significantly reduces unnecessary
rejections, demonstrating strong potential to improve current CS conference
submission policies.
[LINK]
http://arxiv.org/abs/2506.20141v1
[DATE]
2025-06-25 13:23:44+08:00
[CATEGORIES]
cs.LG
High-Resolution Live Fuel Moisture Content (LFMC) Maps for Wildfire Risk from Multimodal Earth Observation Data
[AUTHORS]
Patrick Alan Johnson, Gabriel Tseng, Yawen Zhang, Heather Heward, Virginia Sjahli, Favyen Bastani, Joseph Redmon, Patrick Beukema
[ABSTRACT]
Wildfires are increasing in intensity and severity at an alarming rate.
Recent advances in AI and publicly available satellite data enable monitoring
critical wildfire risk factors globally, at high resolution and low latency.
Live Fuel Moisture Content (LFMC) is a critical wildfire risk factor and is
valuable for both wildfire research and operational response. However,
ground-based LFMC samples are both labor intensive and costly to acquire,
resulting in sparse and infrequent updates. In this work, we explore the use of
a pretrained, highly-multimodal earth-observation model for generating
large-scale spatially complete (wall-to-wall) LFMC maps. Our approach achieves
significant improvements over previous methods using randomly initialized
models (20 reduction in RMSE). We provide an automated pipeline that enables
rapid generation of these LFMC maps across the United States, and demonstrate
its effectiveness in two regions recently impacted by wildfire (Eaton and
Palisades).
[COMMENTS]
10 pages, ICML 2025 (TerraBytes)
[LINK]
http://arxiv.org/abs/2506.20132v1
[DATE]
2025-06-25 12:59:10+08:00
[CATEGORIES]
cs.LG
Log-Linear Attention
[AUTHORS]
Han Guo, Songlin Yang, Tarushii Goel, Eric P. Xing, Tri Dao, Yoon Kim
[ABSTRACT]
The attention mechanism in Transformers is an important primitive for
accurate and scalable sequence modeling. Its quadratic-compute and
linear-memory complexity however remain significant bottlenecks. Linear
attention and state-space models enable linear-time, constant-memory sequence
modeling and can moreover be trained efficiently through matmul-rich
parallelization across sequence length. However, at their core these models are
still RNNs, and thus their use of a fixed-size hidden state to model the
context is a fundamental limitation. This paper develops log-linear attention,
an attention mechanism that balances linear attention’s efficiency and the
expressiveness of softmax attention. Log-linear attention replaces the
fixed-size hidden state with a logarithmically growing set of hidden states. We
show that with a particular growth function, log-linear attention admits a
similarly matmul-rich parallel form whose compute cost is log-linear in
sequence length. Log-linear attention is a general framework and can be applied
on top of existing linear attention variants. As case studies, we instantiate
log-linear variants of two recent architectures – Mamba-2 and Gated DeltaNet
– and find they perform well compared to their linear-time variants.
[LINK]
http://arxiv.org/abs/2506.04761v2
[DATE]
2025-06-25 12:54:28+08:00
[CATEGORIES]
cs.LG
Evaluating Generalization and Representation Stability in Small LMs via Prompting, Fine-Tuning and Out-of-Distribution Prompts
[AUTHORS]
Rahul Raja, Arpita Vats
[COMMENTS]
Accepted at ICML
[LINK]
http://arxiv.org/abs/2506.17289v2
[DATE]
2025-06-25 12:27:25+08:00
[CATEGORIES]
cs.LG
U-R-VEDA: Integrating UNET, Residual Links, Edge and Dual Attention, and Vision Transformer for Accurate Semantic Segmentation of CMRs
[AUTHORS]
Racheal Mukisa, Arvind K. Bansal
[ABSTRACT]
Artificial intelligence, including deep learning models, will play a
transformative role in automated medical image analysis for the diagnosis of
cardiac disorders and their management. Automated accurate delineation of
cardiac images is the first necessary initial step for the quantification and
automated diagnosis of cardiac disorders. In this paper, we propose a deep
learning based enhanced UNet model, U-R-Veda, which integrates convolution
transformations, vision transformer, residual links, channel-attention, and
spatial attention, together with edge-detection based skip-connections for an
accurate fully-automated semantic segmentation of cardiac magnetic resonance
(CMR) images. The model extracts local-features and their interrelationships
using a stack of combination convolution blocks, with embedded channel and
spatial attention in the convolution block, and vision transformers. Deep
embedding of channel and spatial attention in the convolution block identifies
important features and their spatial localization. The combined edge
information with channel and spatial attention as skip connection reduces
information-loss during convolution transformations. The overall model
significantly improves the semantic segmentation of CMR images necessary for
improved medical image analysis. An algorithm for the dual attention module
(channel and spatial attention) has been presented. Performance results show
that U-R-Veda achieves an average accuracy of 95.2%, based on DSC metrics. The
model outperforms the accuracy attained by other models, based on DSC and HD
metrics, especially for the delineation of right-ventricle and
left-ventricle-myocardium.
[COMMENTS]
15 pages, 3 figures
[LINK]
http://arxiv.org/abs/2506.20689v1
[DATE]
2025-06-25 12:10:09+08:00
[CATEGORIES]
cs.LG
Extracting Interpretable Models from Tree Ensembles: Computational and Statistical Perspectives
[AUTHORS]
Brian Liu, Rahul Mazumder, Peter Radchenko
[ABSTRACT]
Tree ensembles are non-parametric methods widely recognized for their
accuracy and ability to capture complex interactions. While these models excel
at prediction, they are difficult to interpret and may fail to uncover useful
relationships in the data. We propose an estimator to extract compact sets of
decision rules from tree ensembles. The extracted models are accurate and can
be manually examined to reveal relationships between the predictors and the
response. A key novelty of our estimator is the flexibility to jointly control
the number of rules extracted and the interaction depth of each rule, which
improves accuracy. We develop a tailored exact algorithm to efficiently solve
optimization problems underlying our estimator and an approximate algorithm for
computing regularization paths, sequences of solutions that correspond to
varying model sizes. We also establish novel non-asymptotic prediction error
bounds for our proposed approach, comparing it to an oracle that chooses the
best data-dependent linear combination of the rules in the ensemble subject to
the same complexity constraint as our estimator. The bounds illustrate that the
large-sample predictive performance of our estimator is on par with that of the
oracle. Through experiments, we demonstrate that our estimator outperforms
existing algorithms for rule extraction.
[LINK]
http://arxiv.org/abs/2506.20114v1
[DATE]
2025-06-25 12:06:37+08:00
[CATEGORIES]
cs.LG
Autonomous Cyber Resilience via a Co-Evolutionary Arms Race within a Fortified Digital Twin Sandbox
[AUTHORS]
Malikussaid, Sutiyo
[ABSTRACT]
The convergence of IT and OT has created hyper-connected ICS, exposing
critical infrastructure to a new class of adaptive, intelligent adversaries
that render static defenses obsolete. Existing security paradigms often fail to
address a foundational “Trinity of Trust,” comprising the fidelity of the
system model, the integrity of synchronizing data, and the resilience of the
analytical engine against sophisticated evasion. This paper introduces the ARC
framework, a method for achieving analytical resilience through an autonomous,
closed-loop hardening process. ARC establishes a perpetual co-evolutionary arms
race within the high-fidelity sandbox of a F-SCDT. A DRL agent, the “Red
Agent,” is formalized and incentivized to autonomously discover stealthy,
physically-plausible attack paths that maximize process disruption while
evading detection. Concurrently, an ensemble-based “Blue Agent” defender is
continuously hardened via adversarial training against the evolving threats
discovered by its adversary. This co-evolutionary dynamic forces both agents to
become progressively more sophisticated, enabling the system to autonomously
probe and patch its own vulnerabilities. Experimental validation on both the
TEP and the SWaT testbeds demonstrates the framework’s superior performance. A
comprehensive ablation study, supported by extensive visualizations including
ROC curves and SHAP plots, reveals that the co-evolutionary process itself is
responsible for a significant performance increase in detecting novel attacks.
By integrating XAI to ensure operator trust and proposing a scalable F-ARC
architecture, this work presents ARC not merely as an improvement, but as a
necessary paradigm shift toward dynamic, self-improving security for the future
of critical infrastructure.
[COMMENTS]
17 pages, 2 figures, 4 equations, 2 algorithms, 4 tables, to be
published in ISPACS Conference 2025, unabridged version
[LINK]
http://arxiv.org/abs/2506.20102v1
[DATE]
2025-06-25 11:28:48+08:00
[CATEGORIES]
cs.LG
Fine-Grained Perturbation Guidance via Attention Head Selection
[AUTHORS]
Donghoon Ahn, Jiwon Kang, Sanghyun Lee, Minjae Kim, Jaewon Min, Wooseok Jang, Saungwu Lee, Sayak Paul, Susung Hong, Seungryong Kim
[ABSTRACT]
Recent guidance methods in diffusion models steer reverse sampling by
perturbing the model to construct an implicit weak model and guide generation
away from it. Among these approaches, attention perturbation has demonstrated
strong empirical performance in unconditional scenarios where classifier-free
guidance is not applicable. However, existing attention perturbation methods
lack principled approaches for determining where perturbations should be
applied, particularly in Diffusion Transformer (DiT) architectures where
quality-relevant computations are distributed across layers. In this paper, we
investigate the granularity of attention perturbations, ranging from the layer
level down to individual attention heads, and discover that specific heads
govern distinct visual concepts such as structure, style, and texture quality.
Building on this insight, we propose “HeadHunter”, a systematic framework for
iteratively selecting attention heads that align with user-centric objectives,
enabling fine-grained control over generation quality and visual attributes. In
addition, we introduce SoftPAG, which linearly interpolates each selected
head’s attention map toward an identity matrix, providing a continuous knob to
tune perturbation strength and suppress artifacts. Our approach not only
mitigates the oversmoothing issues of existing layer-level perturbation but
also enables targeted manipulation of specific visual styles through
compositional head selection. We validate our method on modern large-scale
DiT-based text-to-image models including Stable Diffusion 3 and FLUX.1,
demonstrating superior performance in both general quality enhancement and
style-specific guidance. Our work provides the first head-level analysis of
attention perturbation in diffusion models, uncovering interpretable
specialization within attention layers and enabling practical design of
effective perturbation strategies.
[COMMENTS]
Project page: https://cvlab-kaist.github.io/HeadHunter/
[LINK]
http://arxiv.org/abs/2506.10978v2
[DATE]
2025-06-25 10:37:46+08:00
[CATEGORIES]
cs.LG
Attack Smarter: Attention-Driven Fine-Grained Webpage Fingerprinting Attacks
[AUTHORS]
Yali Yuan, Weiyi Zou, Guang Cheng
[ABSTRACT]
Website Fingerprinting (WF) attacks aim to infer which websites a user is
visiting by analyzing traffic patterns, thereby compromising user anonymity.
Although this technique has been demonstrated to be effective in controlled
experimental environments, it remains largely limited to small-scale scenarios,
typically restricted to recognizing website homepages. In practical settings,
however, users frequently access multiple subpages in rapid succession, often
before previous content fully loads. WebPage Fingerprinting (WPF) generalizes
the WF framework to large-scale environments by modeling subpages of the same
site as distinct classes. These pages often share similar page elements,
resulting in lower inter-class variance in traffic features. Furthermore, we
consider multi-tab browsing scenarios, in which a single trace encompasses
multiple categories of webpages. This leads to overlapping traffic segments,
and similar features may appear in different positions within the traffic,
thereby increasing the difficulty of classification. To address these
challenges, we propose an attention-driven fine-grained WPF attack, named
ADWPF. Specifically, during the training phase, we apply targeted augmentation
to salient regions of the traffic based on attention maps, including attention
cropping and attention masking. ADWPF then extracts low-dimensional features
from both the original and augmented traffic and applies self-attention modules
to capture the global contextual patterns of the trace. Finally, to handle the
multi-tab scenario, we employ the residual attention to generate class-specific
representations of webpages occurring at different temporal positions.
Extensive experiments demonstrate that the proposed method consistently
surpasses state-of-the-art baselines across datasets of different scales.
[LINK]
http://arxiv.org/abs/2506.20082v1
[DATE]
2025-06-25 09:45:55+08:00
[CATEGORIES]
cs.LG
Quantum-Classical Hybrid Quantized Neural Network
[AUTHORS]
Wenxin Li, Chuan Wang, Hongdong Zhu, Qi Gao, Yin Ma, Hai Wei, Kai Wen
[ABSTRACT]
Here in this work, we present a novel Quadratic Binary Optimization (QBO)
model for quantized neural network training, enabling the use of arbitrary
activation and loss functions through spline interpolation. We introduce
Forward Interval Propagation (FIP), a method designed to tackle the challenges
of non-linearity and the multi-layer composite structure in neural networks by
discretizing activation functions into linear subintervals. This approach
preserves the universal approximation properties of neural networks while
allowing complex nonlinear functions to be optimized using quantum computers,
thus broadening their applicability in artificial intelligence. We provide
theoretical upper bounds on the approximation error and the number of Ising
spins required, by deriving the sample complexity of the empirical risk
minimization problem, from an optimization perspective. A significant challenge
in solving the associated Quadratic Constrained Binary Optimization (QCBO)
model on a large scale is the presence of numerous constraints. When employing
the penalty method to handle these constraints, tuning a large number of
penalty coefficients becomes a critical hyperparameter optimization problem,
increasing computational complexity and potentially affecting solution quality.
To address this, we employ the Quantum Conditional Gradient Descent (QCGD)
algorithm, which leverages quantum computing to directly solve the QCBO
problem. We prove the convergence of QCGD under a quantum oracle with
randomness and bounded variance in objective value, as well as under limited
precision constraints in the coefficient matrix. Additionally, we provide an
upper bound on the Time-To-Solution for the QCBO solving process. Experimental
results using a coherent Ising machine (CIM) demonstrate a 94.95% accuracy on
the Fashion MNIST classification task, with only 1.1-bit precision.
[COMMENTS]
27 pages, 5 figures, comments are welcome
[LINK]
http://arxiv.org/abs/2506.18240v2
[DATE]
2025-06-25 09:01:03+08:00
[CATEGORIES]
cs.LG
Multimodal Information Retrieval for Open World with Edit Distance Weak Supervision
[AUTHORS]
KMA Solaiman, Bharat Bhargava
[ABSTRACT]
Existing multi-media retrieval models either rely on creating a common
subspace with modality-specific representation models or require schema mapping
among modalities to measure similarities among multi-media data. Our goal is to
avoid the annotation overhead incurred from considering retrieval as a
supervised classification task and re-use the pretrained encoders in large
language models and vision tasks. We propose “FemmIR”, a framework to retrieve
multimodal results relevant to information needs expressed with multimodal
queries by example without any similarity label. Such identification is
necessary for real-world applications where data annotations are scarce and
satisfactory performance is required without fine-tuning with a common
framework across applications. We curate a new dataset called MuQNOL for
benchmarking progress on this task. Our technique is based on weak supervision
introduced through edit distance between samples: graph edit distance can be
modified to consider the cost of replacing a data sample in terms of its
properties, and relevance can be measured through the implicit signal from the
amount of edit cost among the objects. Unlike metric learning or encoding
networks, FemmIR re-uses the high-level properties and maintains the property
value and relationship constraints with a multi-level interaction score between
data samples and the query example provided by the user. We empirically
evaluate FemmIR on a missing person use case with MuQNOL. FemmIR performs
comparably to similar retrieval systems in delivering on-demand retrieval
results with exact and approximate similarities while using the existing
property identifiers in the system.
[COMMENTS]
Submitted to ICDE’24. An earlier version of this paper appeared on
TechRxiv: https://www.techrxiv.org/doi/full/10.36227/techrxiv.21990284.v1,
uploaded on February 05, 2023
[LINK]
http://arxiv.org/abs/2506.20070v1
[DATE]
2025-06-25 08:25:08+08:00
[CATEGORIES]
cs.LG
Conformal Prediction with Upper and Lower Bound Models
[AUTHORS]
Miao Li, Michael Klamkin, Mathieu Tanneau, Reza Zandehshahvar, Pascal Van Hentenryck
[ABSTRACT]
This paper studies a Conformal Prediction (CP) methodology for building
prediction intervals in a regression setting, given only deterministic lower
and upper bounds on the target variable. It proposes a new CP mechanism (CPUL)
that goes beyond post-processing by adopting a model selection approach over
multiple nested interval construction methods. Paradoxically, many
well-established CP methods, including CPUL, may fail to provide adequate
coverage in regions where the bounds are tight. To remedy this limitation, the
paper proposes an optimal thresholding mechanism, OMLT, that adjusts CPUL
intervals in tight regions with undercoverage. The combined CPUL-OMLT is
validated on large-scale learning tasks where the goal is to bound the optimal
value of a parametric optimization problem. The experimental results
demonstrate substantial improvements over baseline methods across various
datasets.
[LINK]
http://arxiv.org/abs/2503.04071v2
[DATE]
2025-06-25 08:04:42+08:00
[CATEGORIES]
cs.LG
TOMD: A Trail-based Off-road Multimodal Dataset for Traversable Pathway Segmentation under Challenging Illumination Conditions
[AUTHORS]
Yixin Sun, Li Li, Wenke E, Amir Atapour-Abarghouei, Toby P. Breckon
[ABSTRACT]
Detecting traversable pathways in unstructured outdoor environments remains a
significant challenge for autonomous robots, especially in critical
applications such as wide-area search and rescue, as well as incident
management scenarios like forest fires. Existing datasets and models primarily
target urban settings or wide, vehicle-traversable off-road tracks, leaving a
substantial gap in addressing the complexity of narrow, trail-like off-road
scenarios. To address this, we introduce the Trail-based Off-road Multimodal
Dataset (TOMD), a comprehensive dataset specifically designed for such
environments. TOMD features high-fidelity multimodal sensor data – including
128-channel LiDAR, stereo imagery, GNSS, IMU, and illumination measurements –
collected through repeated traversals under diverse conditions. We also propose
a dynamic multiscale data fusion model for accurate traversable pathway
prediction. The study analyzes the performance of early, cross, and mixed
fusion strategies under varying illumination levels. Results demonstrate the
effectiveness of our approach and the relevance of illumination in segmentation
performance. We publicly release TOMD at https://github.com/yyyxs1125/TMOD to
support future research in trail-based off-road navigation.
[COMMENTS]
8 pages, 9 figures, 2025 IJCNN
[LINK]
http://arxiv.org/abs/2506.21630v1
[DATE]
2025-06-25 07:58:44+08:00
[CATEGORIES]
cs.LG
Identifying Heterogeneity in Distributed Learning
[AUTHORS]
Zelin Xiao, Jia Gu, Song Xi Chen
[ABSTRACT]
We study methods for identifying heterogeneous parameter components in
distributed M-estimation with minimal data transmission. One is based on a
re-normalized Wald test, which is shown to be consistent as long as the number
of distributed data blocks $K$ is of a smaller order of the minimum block
sample size and the level of heterogeneity is dense. The second one is an
extreme contrast test (ECT) based on the difference between the largest and
smallest component-wise estimated parameters among data blocks. By introducing
a sample splitting procedure, the ECT can avoid the bias accumulation arising
from the M-estimation procedures, and exhibits consistency for $K$ being much
larger than the sample size while the heterogeneity is sparse. The ECT
procedure is easy to operate and communication-efficient. A combination of the
Wald and the extreme contrast tests is formulated to attain more robust power
under varying levels of sparsity of the heterogeneity. We also conduct
intensive numerical experiments to compare the family-wise error rate (FWER)
and the power of the proposed methods. Additionally, we conduct a case study to
present the implementation and validity of the proposed methods.
[LINK]
http://arxiv.org/abs/2506.16394v3
[DATE]
2025-06-25 07:55:45+08:00
[CATEGORIES]
cs.LG
Supervised Coupled Matrix-Tensor Factorization (SCMTF) for Computational Phenotyping of Patient Reported Outcomes in Ulcerative Colitis
[AUTHORS]
Cristian Minoccheri, Sophia Tesic, Kayvan Najarian, Ryan Stidham
[ABSTRACT]
Phenotyping is the process of distinguishing groups of patients to identify
different types of disease progression. A recent trend employs low-rank matrix
and tensor factorization methods for their capability of dealing with
multi-modal, heterogeneous, and missing data. Symptom quantification is crucial
for understanding patient experiences in inflammatory bowel disease, especially
in conditions such as ulcerative colitis (UC). However, patient-reported
symptoms are typically noisy, subjective, and significantly more sparse than
other data types. For this reason, they are usually not included in phenotyping
and other machine learning methods. This paper explores the application of
computational phenotyping to leverage Patient-Reported Outcomes (PROs) using a
novel supervised coupled matrix-tensor factorization (SCMTF) method, which
integrates temporal PROs and temporal labs with static features to predict
medication persistence in ulcerative colitis. This is the first tensor-based
method that is both supervised and coupled, it is the first application to the
UC domain, and the first application to PROs. We use a deep learning framework
that makes the model flexible and easy to train. The proposed method allows us
to handle the large amount of missing data in the PROs. The best model predicts
changes in medication 8 and 20 months in the future with AUCs of 0.853 and
0.803 on the test set respectively. We derive interpretable phenotypes
consisting of static features and temporal features (including their temporal
patterns). We show that low-rank matrix and tensor based phenotyping can be
successfully applied to the UC domain and to highly missing PRO data. We
identify phenotypes useful to predict medication persistence - these phenotypes
include several symptom variables, showing that PROs contain relevant
infromation that is usually discarded.
[LINK]
http://arxiv.org/abs/2506.20065v1
[DATE]
2025-06-25 07:55:11+08:00
[CATEGORIES]
cs.LG
The Alignment Trap: Complexity Barriers
[AUTHORS]
Jasper Yao
[ABSTRACT]
This paper argues that AI alignment is not merely difficult, but is founded
on a fundamental logical contradiction. We first establish The Enumeration
Paradox: we use machine learning precisely because we cannot enumerate all
necessary safety rules, yet making ML safe requires examples that can only be
generated from the very enumeration we admit is impossible. This paradox is
then confirmed by a set of five independent mathematical proofs, or “pillars of
impossibility.” Our main results show that: (1) Geometric Impossibility: The
set of safe policies has measure zero, a necessary consequence of projecting
infinite-dimensional world-context requirements onto finite-dimensional models.
(2) Computational Impossibility: Verifying a policy’s safety is coNP-complete,
even for non-zero error tolerances. (3) Statistical Impossibility: The training
data required for safety (abundant examples of rare disasters) is a logical
contradiction and thus unobtainable. (4) Information-Theoretic Impossibility:
Safety rules contain more incompressible, arbitrary information than any
feasible network can store. (5) Dynamic Impossibility: The optimization process
for increasing AI capability is actively hostile to safety, as the gradients
for the two objectives are generally anti-aligned. Together, these results
demonstrate that the pursuit of safe, highly capable AI is not a matter of
overcoming technical hurdles, but of confronting fundamental, interlocking
barriers. The paper concludes by presenting a strategic trilemma that these
impossibilities force upon the field. A formal verification of the core
theorems in Lean4 is currently in progress.
[COMMENTS]
31 Pages, 4 Figures. Substantial revision. Restructured around the
Enumeration Paradox and Five Pillars of Impossibility. Core mathematical
results unchanged but significantly expanded. Added new impossibility proofs
from statistical, information-theoretic, and dynamic perspectives
[LINK]
http://arxiv.org/abs/2506.10304v2
[DATE]
2025-06-25 07:41:11+08:00
[CATEGORIES]
cs.LG
Universal pre-training by iterated random computation
[AUTHORS]
Peter Bloem
[ABSTRACT]
We investigate the use of randomly generated data for the sake of
pre-training a model. We justify this approach theoretically from the
perspective of algorithmic complexity, building on recent research that shows
that sequence models can be trained to approximate Solomonoff induction. We
derive similar, but complementary theoretical results. We show empirically that
synthetically generated data can be used to pre-train a model before the data
is seen. We replicate earlier results that models trained this way show
zero-shot in-context learning across a variety of datasets, and that this
performance improves with scale. We extend earlier results to real-world data,
and show that finetuning a model after pre-training offers faster convergence
and better generalization.
[LINK]
http://arxiv.org/abs/2506.20057v1
[DATE]
2025-06-25 07:36:35+08:00
[CATEGORIES]
cs.LG
Machine-Learning-Assisted Photonic Device Development: A Multiscale Approach from Theory to Characterization
[AUTHORS]
Yuheng Chen, Alexander Montes McNeil, Taehyuk Park, Blake A. Wilson, Vaishnavi Iyer, Michael Bezick, Jae-Ik Choi, Rohan Ojha, Pravin Mahendran, Daksh Kumar Singh, Geetika Chitturi, Peigang Chen, Trang Do, Alexander V. Kildishev, Vladimir M. Shalaev, Michael Moebius, Wenshan Cai, Yongmin Liu, Alexandra Boltasseva
[ABSTRACT]
Photonic device development (PDD) has achieved remarkable success in
designing and implementing new devices for controlling light across various
wavelengths, scales, and applications, including telecommunications, imaging,
sensing, and quantum information processing. PDD is an iterative, five-step
process that consists of: i) deriving device behavior from design parameters,
ii) simulating device performance, iii) finding the optimal candidate designs
from simulations, iv) fabricating the optimal device, and v) measuring device
performance. Classically, all these steps involve Bayesian optimization,
material science, control theory, and direct physics-driven numerical methods.
However, many of these techniques are computationally intractable, monetarily
costly, or difficult to implement at scale. In addition, PDD suffers from large
optimization landscapes, uncertainties in structural or optical
characterization, and difficulties in implementing robust fabrication
processes. However, the advent of machine learning over the past decade has
provided novel, data-driven strategies for tackling these challenges, including
surrogate estimators for speeding up computations, generative modeling for
noisy measurement modeling and data augmentation, reinforcement learning for
fabrication, and active learning for experimental physical discovery. In this
review, we present a comprehensive perspective on these methods to enable
machine-learning-assisted PDD (ML-PDD) for efficient design optimization with
powerful generative models, fast simulation and characterization modeling under
noisy measurements, and reinforcement learning for fabrication. This review
will provide researchers from diverse backgrounds with valuable insights into
this emerging topic, fostering interdisciplinary efforts to accelerate the
development of complex photonic devices and systems.
[LINK]
http://arxiv.org/abs/2506.20056v1
[DATE]
2025-06-25 07:32:54+08:00
[CATEGORIES]
cs.LG
MegaFold: System-Level Optimizations for Accelerating Protein Structure Prediction Models
[AUTHORS]
Hoa La, Ahan Gupta, Alex Morehead, Jianlin Cheng, Minjia Zhang
[ABSTRACT]
Protein structure prediction models such as AlphaFold3 (AF3) push the
frontier of biomolecular modeling by incorporating science-informed
architectural changes to the transformer architecture. However, these advances
come at a steep system cost, introducing: compute- and memory-intensive
operators, 2D attention mechanisms, and retrieval-augmented data pipelines,
which collectively hinder the scalability of AF3 training. In this work, we
present MegaFold, a cross-platform system to accelerate AF3 training. MegaFold
tackles key bottlenecks through ahead-of-time caching to eliminate GPU idle
time from the retrieval-augmented data pipeline, Triton-based kernels for
memory-efficient EvoAttention on heterogeneous devices, and deep fusion for
common and critical small operators in AF3. Evaluation on both NVIDIA H200 and
AMD MI250 GPUs shows that MegaFold reduces peak memory usage of AF3 training by
up to 1.23$\times$ and improves per-iteration training time by up-to
1.73$\times$ and 1.62$\times$ respectively. More importantly, MegaFold enables
training on 1.35$\times$ longer sequence lengths compared to PyTorch baselines
without running out-of-memory, significantly improving the scalability of
modern protein folding models. We open source our code at
https://github.com/Supercomputing-System-AI-Lab/MegaFold/.
[COMMENTS]
13 pages, 12 figures
[LINK]
http://arxiv.org/abs/2506.20686v1
[DATE]
2025-06-25 07:30:49+08:00
[CATEGORIES]
cs.LG
A Principled Path to Fitted Distributional Evaluation
[AUTHORS]
Sungee Hong, Jiayi Wang, Zhengling Qi, Raymond Ka Wai Wong
[ABSTRACT]
In reinforcement learning, distributional off-policy evaluation (OPE) focuses
on estimating the return distribution of a target policy using offline data
collected under a different policy. This work focuses on extending the widely
used fitted-Q evaluation – developed for expectation-based reinforcement
learning – to the distributional OPE setting. We refer to this extension as
fitted distributional evaluation (FDE). While only a few related approaches
exist, there remains no unified framework for designing FDE methods. To fill
this gap, we present a set of guiding principles for constructing theoretically
grounded FDE methods. Building on these principles, we develop several new FDE
methods with convergence analysis and provide theoretical justification for
existing methods, even in non-tabular environments. Extensive experiments,
including simulations on linear quadratic regulators and Atari games,
demonstrate the superior performance of the FDE methods.
[LINK]
http://arxiv.org/abs/2506.20048v1
[DATE]
2025-06-25 07:08:56+08:00
[CATEGORIES]
cs.LG
GNN’s Uncertainty Quantification using Self-Distillation
[AUTHORS]
Hirad Daneshvar, Reza Samavi
[ABSTRACT]
Graph Neural Networks (GNNs) have shown remarkable performance in the
healthcare domain. However, what remained challenging is quantifying the
predictive uncertainty of GNNs, which is an important aspect of trustworthiness
in clinical settings. While Bayesian and ensemble methods can be used to
quantify uncertainty, they are computationally expensive. Additionally, the
disagreement metric used by ensemble methods to compute uncertainty cannot
capture the diversity of models in an ensemble network. In this paper, we
propose a novel method, based on knowledge distillation, to quantify GNNs’
uncertainty more efficiently and with higher precision. We apply
self-distillation, where the same network serves as both the teacher and
student models, thereby avoiding the need to train several networks
independently. To ensure the impact of self-distillation, we develop an
uncertainty metric that captures the diverse nature of the network by assigning
different weights to each GNN classifier. We experimentally evaluate the
precision, performance, and ability of our approach in distinguishing
out-of-distribution data on two graph datasets: MIMIC-IV and Enzymes. The
evaluation results demonstrate that the proposed method can effectively capture
the predictive uncertainty of the model while having performance similar to
that of the MC Dropout and ensemble methods. The code is publicly available at
https://github.com/tailabTMU/UQ_GNN.
[COMMENTS]
The paper has been accepted in the International Conference on AI in
Healthcare (AIiH) 2025 and will appear in the conference proceedings
[LINK]
http://arxiv.org/abs/2506.20046v1
[DATE]
2025-06-25 07:08:31+08:00
[CATEGORIES]
cs.LG
PocketVina Enables Scalable and Highly Accurate Physically Valid Docking through Multi-Pocket Conditioning
[AUTHORS]
Ahmet Sarigun, Bora Uyar, Vedran Franke, Altuna Akalin
[ABSTRACT]
Sampling physically valid ligand-binding poses remains a major challenge in
molecular docking, particularly for unseen or structurally diverse targets. We
introduce PocketVina, a fast and memory-efficient, search-based docking
framework that combines pocket prediction with systematic multi-pocket
exploration. We evaluate PocketVina across four established
benchmarks–PDBbind2020 (timesplit and unseen), DockGen, Astex, and
PoseBusters–and observe consistently strong performance in sampling physically
valid docking poses. PocketVina achieves state-of-the-art performance when
jointly considering ligand RMSD and physical validity (PB-valid), while
remaining competitive with deep learning-based approaches in terms of RMSD
alone, particularly on structurally diverse and previously unseen targets.
PocketVina also maintains state-of-the-art physically valid docking accuracy
across ligands with varying degrees of flexibility. We further introduce
TargetDock-AI, a benchmarking dataset we curated, consisting of over 500000
protein-ligand pairs, and a partition of the dataset labeled with PubChem
activity annotations. On this large-scale dataset, PocketVina successfully
discriminates active from inactive targets, outperforming a deep learning
baseline while requiring significantly less GPU memory and runtime. PocketVina
offers a robust and scalable docking strategy that requires no task-specific
training and runs efficiently on standard GPUs, making it well-suited for
high-throughput virtual screening and structure-based drug discovery.
[LINK]
http://arxiv.org/abs/2506.20043v1
[DATE]
2025-06-25 06:50:30+08:00
[CATEGORIES]
cs.LG
LSH-DynED: A Dynamic Ensemble Framework with LSH-Based Undersampling for Evolving Multi-Class Imbalanced Classification
[AUTHORS]
Soheil Abadifard, Fazli Can
[ABSTRACT]
The classification of imbalanced data streams, which have unequal class
distributions, is a key difficulty in machine learning, especially when dealing
with multiple classes. While binary imbalanced data stream classification tasks
have received considerable attention, only a few studies have focused on
multi-class imbalanced data streams. Effectively managing the dynamic imbalance
ratio is a key challenge in this domain. This study introduces a novel, robust,
and resilient approach to address these challenges by integrating Locality
Sensitive Hashing with Random Hyperplane Projections (LSH-RHP) into the Dynamic
Ensemble Diversification (DynED) framework. To the best of our knowledge, we
present the first application of LSH-RHP for undersampling in the context of
imbalanced non-stationary data streams. The proposed method undersamples the
majority classes by utilizing LSH-RHP, provides a balanced training set, and
improves the ensemble’s prediction performance. We conduct comprehensive
experiments on 23 real-world and ten semi-synthetic datasets and compare
LSH-DynED with 15 state-of-the-art methods. The results reveal that LSH-DynED
outperforms other approaches in terms of both Kappa and mG-Mean effectiveness
measures, demonstrating its capability in dealing with multi-class imbalanced
non-stationary data streams. Notably, LSH-DynED performs well in large-scale,
high-dimensional datasets with considerable class imbalances and demonstrates
adaptation and robustness in real-world circumstances. To motivate our design,
we review existing methods for imbalanced data streams, outline key challenges,
and offer guidance for future work. For the reproducibility of our results, we
have made our implementation available on GitHub.
[LINK]
http://arxiv.org/abs/2506.20041v1
[DATE]
2025-06-25 06:46:47+08:00
[CATEGORIES]
cs.LG
Learning Bilateral Team Formation in Cooperative Multi-Agent Reinforcement Learning
[AUTHORS]
Koorosh Moslemi, Chi-Guhn Lee
[ABSTRACT]
Team formation and the dynamics of team-based learning have drawn significant
interest in the context of Multi-Agent Reinforcement Learning (MARL). However,
existing studies primarily focus on unilateral groupings, predefined teams, or
fixed-population settings, leaving the effects of algorithmic bilateral
grouping choices in dynamic populations underexplored. To address this gap, we
introduce a framework for learning two-sided team formation in dynamic
multi-agent systems. Through this study, we gain insight into what algorithmic
properties in bilateral team formation influence policy performance and
generalization. We validate our approach using widely adopted multi-agent
scenarios, demonstrating competitive performance and improved generalization in
most scenarios.
[COMMENTS]
Accepted to the 2nd Coordination and Cooperation in Multi-Agent
Reinforcement Learning (CoCoMARL) Workshop at RLC 2025
[LINK]
http://arxiv.org/abs/2506.20039v1
[DATE]
2025-06-25 06:40:05+08:00
[CATEGORIES]
cs.LG
Verifiable Unlearning on Edge
[AUTHORS]
Mohammad M Maheri, Alex Davidson, Hamed Haddadi
[ABSTRACT]
Machine learning providers commonly distribute global models to edge devices,
which subsequently personalize these models using local data. However, issues
such as copyright infringements, biases, or regulatory requirements may require
the verifiable removal of certain data samples across all edge devices.
Ensuring that edge devices correctly execute such unlearning operations is
critical to maintaining integrity.
In this work, we introduce a verification framework leveraging zero-knowledge
proofs, specifically zk-SNARKs, to confirm data unlearning on personalized
edge-device models without compromising privacy. We have developed algorithms
explicitly designed to facilitate unlearning operations that are compatible
with efficient zk-SNARK proof generation, ensuring minimal computational and
memory overhead suitable for constrained edge environments. Furthermore, our
approach carefully preserves personalized enhancements on edge devices,
maintaining model performance post-unlearning.
Our results affirm the practicality and effectiveness of this verification
framework, demonstrating verifiable unlearning with minimal degradation in
personalization-induced performance improvements. Our methodology ensures
verifiable, privacy-preserving, and effective machine unlearning across edge
devices.
[COMMENTS]
This paper has been accepted to the IEEE European Symposium on
Security and Privacy (EuroS&P) 2025
[LINK]
http://arxiv.org/abs/2506.20037v1
[DATE]
2025-06-25 06:24:47+08:00
[CATEGORIES]
cs.LG
Neural network-based Godunov corrections for approximate Riemann solvers using bi-fidelity learning
[AUTHORS]
Akshay Thakur, Matthew J. Zahr
[ABSTRACT]
The Riemann problem is fundamental in the computational modeling of
hyperbolic partial differential equations, enabling the development of stable
and accurate upwind schemes. While exact solvers provide robust upwinding
fluxes, their high computational cost necessitates approximate solvers.
Although approximate solvers achieve accuracy in many scenarios, they produce
inaccurate solutions in certain cases. To overcome this limitation, we propose
constructing neural network-based surrogate models, trained using supervised
learning, designed to map interior and exterior conservative state variables to
the corresponding exact flux. Specifically, we propose two distinct approaches:
one utilizing a vanilla neural network and the other employing a bi-fidelity
neural network. The performance of the proposed approaches is demonstrated
through applications to one-dimensional and two-dimensional partial
differential equations, showcasing their robustness and accuracy.
[COMMENTS]
22 pages, 17 figures
[LINK]
http://arxiv.org/abs/2503.13248v2
[DATE]
2025-06-25 06:02:35+08:00
[CATEGORIES]
cs.LG
Automated Generation of Diverse Courses of Actions for Multi-Agent Operations using Binary Optimization and Graph Learning
[AUTHORS]
Prithvi Poddar, Ehsan Tarkesh Esfahani, Karthik Dantu, Souma Chowdhury
[ABSTRACT]
Operations in disaster response, search \& rescue, and military missions that
involve multiple agents demand automated processes to support the planning of
the courses of action (COA). Moreover, traverse-affecting changes in the
environment (rain, snow, blockades, etc.) may impact the expected performance
of a COA, making it desirable to have a pool of COAs that are diverse in task
distributions across agents. Further, variations in agent capabilities, which
could be human crews and/or autonomous systems, present practical opportunities
and computational challenges to the planning process. This paper presents a new
theoretical formulation and computational framework to generate such diverse
pools of COAs for operations with soft variations in agent-task compatibility.
Key to the problem formulation is a graph abstraction of the task space and the
pool of COAs itself to quantify its diversity. Formulating the COAs as a
centralized multi-robot task allocation problem, a genetic algorithm is used
for (order-ignoring) allocations of tasks to each agent that jointly maximize
diversity within the COA pool and overall compatibility of the agent-task
mappings. A graph neural network is trained using a policy gradient approach to
then perform single agent task sequencing in each COA, which maximizes
completion rates adaptive to task features. Our tests of the COA generation
process in a simulated environment demonstrate significant performance gain
over a random walk baseline, small optimality gap in task sequencing, and
execution time of about 50 minutes to plan up to 20 COAs for 5 agent/100 task
operations.
[LINK]
http://arxiv.org/abs/2506.20031v1
[DATE]
2025-06-25 05:58:30+08:00
[CATEGORIES]
cs.LG
Thumb on the Scale: Optimal Loss Weighting in Last Layer Retraining
[AUTHORS]
Nathan Stromberg, Christos Thrampoulidis, Lalitha Sankar
[ABSTRACT]
While machine learning models become more capable in discriminative tasks at
scale, their ability to overcome biases introduced by training data has come
under increasing scrutiny. Previous results suggest that there are two extremes
of parameterization with very different behaviors: the population
(underparameterized) setting where loss weighting is optimal and the separable
overparameterized setting where loss weighting is ineffective at ensuring equal
performance across classes. This work explores the regime of last layer
retraining (LLR) in which the unseen limited (retraining) data is frequently
inseparable and the model proportionately sized, falling between the two
aforementioned extremes. We show, in theory and practice, that loss weighting
is still effective in this regime, but that these weights \emph{must} take into
account the relative overparameterization of the model.
[LINK]
http://arxiv.org/abs/2506.20025v1
[DATE]
2025-06-25 05:48:58+08:00
[CATEGORIES]
cs.LG
Elucidated Rolling Diffusion Models for Probabilistic Weather Forecasting
[AUTHORS]
Salva Rühling Cachay, Miika Aittala, Karsten Kreis, Noah Brenowitz, Arash Vahdat, Morteza Mardani, Rose Yu
[ABSTRACT]
Diffusion models are a powerful tool for probabilistic forecasting, yet most
applications in high-dimensional chaotic systems predict future snapshots
one-by-one. This common approach struggles to model complex temporal
dependencies and fails to explicitly account for the progressive growth of
uncertainty inherent to such systems. While rolling diffusion frameworks, which
apply increasing noise to forecasts at longer lead times, have been proposed to
address this, their integration with state-of-the-art, high-fidelity diffusion
techniques remains a significant challenge. We tackle this problem by
introducing Elucidated Rolling Diffusion Models (ERDM), the first framework to
successfully unify a rolling forecast structure with the principled, performant
design of Elucidated Diffusion Models (EDM). To do this, we adapt the core EDM
components-its noise schedule, network preconditioning, and Heun sampler-to the
rolling forecast setting. The success of this integration is driven by three
key contributions: (i) a novel loss weighting scheme that focuses model
capacity on the mid-range forecast horizons where determinism gives way to
stochasticity; (ii) an efficient initialization strategy using a pre-trained
EDM for the initial window; and (iii) a bespoke hybrid sequence architecture
for robust spatiotemporal feature extraction under progressive denoising. On 2D
Navier-Stokes simulations and ERA5 global weather forecasting at 1.5^\circ
resolution, ERDM consistently outperforms key diffusion-based baselines,
including conditional autoregressive EDM. ERDM offers a flexible and powerful
general framework for tackling diffusion-based sequence generation problems
where modeling escalating uncertainty is paramount. Code is available at:
https://github.com/salvaRC/erdm
[LINK]
http://arxiv.org/abs/2506.20024v1
[DATE]
2025-06-25 05:44:31+08:00
[CATEGORIES]
cs.LG
DIM-SUM: Dynamic IMputation for Smart Utility Management
[AUTHORS]
Ryan Hildebrant, Rahul Bhope, Sharad Mehrotra, Christopher Tull, Nalini Venkatasubramanian
[ABSTRACT]
Time series imputation models have traditionally been developed using
complete datasets with artificial masking patterns to simulate missing values.
However, in real-world infrastructure monitoring, practitioners often encounter
datasets where large amounts of data are missing and follow complex,
heterogeneous patterns. We introduce DIM-SUM, a preprocessing framework for
training robust imputation models that bridges the gap between artificially
masked training data and real missing patterns. DIM-SUM combines pattern
clustering and adaptive masking strategies with theoretical learning guarantees
to handle diverse missing patterns actually observed in the data. Through
extensive experiments on over 2 billion readings from California water
districts, electricity datasets, and benchmarks, we demonstrate that DIM-SUM
outperforms traditional methods by reaching similar accuracy with lower
processing time and significantly less training data. When compared against a
large pre-trained model, DIM-SUM averages 2x higher accuracy with significantly
less inference time.
[LINK]
http://arxiv.org/abs/2506.20023v1
[DATE]
2025-06-25 05:38:06+08:00
[CATEGORIES]
cs.LG
New Insights on Unfolding and Fine-tuning Quantum Federated Learning
[AUTHORS]
Shanika Iroshi Nanayakkara, Shiva Raj Pokhrel
[ABSTRACT]
Client heterogeneity poses significant challenges to the performance of
Quantum Federated Learning (QFL). To overcome these limitations, we propose a
new approach leveraging deep unfolding, which enables clients to autonomously
optimize hyperparameters, such as learning rates and regularization factors,
based on their specific training behavior. This dynamic adaptation mitigates
overfitting and ensures robust optimization in highly heterogeneous
environments where standard aggregation methods often fail. Our framework
achieves approximately 90% accuracy, significantly outperforming traditional
methods, which typically yield around 55% accuracy, as demonstrated through
real-time training on IBM quantum hardware and Qiskit Aer simulators. By
developing self adaptive fine tuning, the proposed method proves particularly
effective in critical applications such as gene expression analysis and cancer
detection, enhancing diagnostic precision and predictive modeling within
quantum systems. Our results are attributed to convergence-aware, learnable
optimization steps intrinsic to the deep unfolded framework, which maintains
the generalization. Hence, this study addresses the core limitations of
conventional QFL, streamlining its applicability to any complex challenges such
as healthcare and genomic research.
[COMMENTS]
12 pages, 9 figures, 7 Tables, Submitted to IEEE/ACM journal 2025
[LINK]
http://arxiv.org/abs/2506.20016v1
[DATE]
2025-06-25 05:17:48+08:00
[CATEGORIES]
cs.LG
Neuromorphic Wireless Split Computing with Resonate-and-Fire Neurons
[AUTHORS]
Dengyu Wu, Jiechen Chen, H. Vincent Poor, Bipin Rajendran, Osvaldo Simeone
[ABSTRACT]
Neuromorphic computing offers an energy-efficient alternative to conventional
deep learning accelerators for real-time time-series processing. However, many
edge applications, such as wireless sensing and audio recognition, generate
streaming signals with rich spectral features that are not effectively captured
by conventional leaky integrate-and-fire (LIF) spiking neurons. This paper
investigates a wireless split computing architecture that employs
resonate-and-fire (RF) neurons with oscillatory dynamics to process time-domain
signals directly, eliminating the need for costly spectral pre-processing. By
resonating at tunable frequencies, RF neurons extract time-localized spectral
features while maintaining low spiking activity. This temporal sparsity
translates into significant savings in both computation and transmission
energy. Assuming an OFDM-based analog wireless interface for spike
transmission, we present a complete system design and evaluate its performance
on audio classification and modulation classification tasks. Experimental
results show that the proposed RF-SNN architecture achieves comparable accuracy
to conventional LIF-SNNs and ANNs, while substantially reducing spike rates and
total energy consumption during inference and communication.
[LINK]
http://arxiv.org/abs/2506.20015v1
[DATE]
2025-06-25 05:14:59+08:00
[CATEGORIES]
cs.LG
DRO-Augment Framework: Robustness by Synergizing Wasserstein Distributionally Robust Optimization and Data Augmentation
[AUTHORS]
Jiaming Hu, Debarghya Mukherjee, Ioannis Ch. Paschalidis
[ABSTRACT]
In many real-world applications, ensuring the robustness and stability of
deep neural networks (DNNs) is crucial, particularly for image classification
tasks that encounter various input perturbations. While data augmentation
techniques have been widely adopted to enhance the resilience of a trained
model against such perturbations, there remains significant room for
improvement in robustness against corrupted data and adversarial attacks
simultaneously. To address this challenge, we introduce DRO-Augment, a novel
framework that integrates Wasserstein Distributionally Robust Optimization
(W-DRO) with various data augmentation strategies to improve the robustness of
the models significantly across a broad spectrum of corruptions. Our method
outperforms existing augmentation methods under severe data perturbations and
adversarial attack scenarios while maintaining the accuracy on the clean
datasets on a range of benchmark datasets, including but not limited to
CIFAR-10-C, CIFAR-100-C, MNIST, and Fashion-MNIST. On the theoretical side, we
establish novel generalization error bounds for neural networks trained using a
computationally efficient, variation-regularized loss function closely related
to the W-DRO problem.
[COMMENTS]
26 pages,3 figures
[LINK]
http://arxiv.org/abs/2506.17874v2
[DATE]
2025-06-25 05:04:53+08:00
[CATEGORIES]
cs.LG
Scalable Machine Learning Algorithms using Path Signatures
[AUTHORS]
Csaba Tóth
[ABSTRACT]
The interface between stochastic analysis and machine learning is a rapidly
evolving field, with path signatures - iterated integrals that provide
faithful, hierarchical representations of paths - offering a principled and
universal feature map for sequential and structured data. Rooted in rough path
theory, path signatures are invariant to reparameterization and well-suited for
modelling evolving dynamics, long-range dependencies, and irregular sampling -
common challenges in real-world time series and graph data.
This thesis investigates how to harness the expressive power of path
signatures within scalable machine learning pipelines. It introduces a suite of
models that combine theoretical robustness with computational efficiency,
bridging rough path theory with probabilistic modelling, deep learning, and
kernel methods. Key contributions include: Gaussian processes with signature
kernel-based covariance functions for uncertainty-aware time series modelling;
the Seq2Tens framework, which employs low-rank tensor structure in the weight
space for scalable deep modelling of long-range dependencies; and graph-based
models where expected signatures over graphs induce hypo-elliptic diffusion
processes, offering expressive yet tractable alternatives to standard graph
neural networks. Further developments include Random Fourier Signature
Features, a scalable kernel approximation with theoretical guarantees, and
Recurrent Sparse Spectrum Signature Gaussian Processes, which combine Gaussian
processes, signature kernels, and random features with a principled forgetting
mechanism for multi-horizon time series forecasting with adaptive context
length.
We hope this thesis serves as both a methodological toolkit and a conceptual
bridge, and provides a useful reference for the current state of the art in
scalable, signature-based learning for sequential and structured data.
[COMMENTS]
PhD thesis
[LINK]
http://arxiv.org/abs/2506.17634v2
[DATE]
2025-06-25 04:58:09+08:00
[CATEGORIES]
cs.LG
Can One Safety Loop Guard Them All? Agentic Guard Rails for Federated Computing
[AUTHORS]
Narasimha Raghavan Veeraragavan, Jan Franz Nygård
[ABSTRACT]
We propose Guardian-FC, a novel two-layer framework for privacy preserving
federated computing that unifies safety enforcement across diverse privacy
preserving mechanisms, including cryptographic back-ends like fully homomorphic
encryption (FHE) and multiparty computation (MPC), as well as statistical
techniques such as differential privacy (DP). Guardian-FC decouples guard-rails
from privacy mechanisms by executing plug-ins (modular computation units),
written in a backend-neutral, domain-specific language (DSL) designed
specifically for federated computing workflows and interchangeable Execution
Providers (EPs), which implement DSL operations for various privacy back-ends.
An Agentic-AI control plane enforces a finite-state safety loop through signed
telemetry and commands, ensuring consistent risk management and auditability.
The manifest-centric design supports fail-fast job admission and seamless
extensibility to new privacy back-ends. We present qualitative scenarios
illustrating backend-agnostic safety and a formal model foundation for
verification. Finally, we outline a research agenda inviting the community to
advance adaptive guard-rail tuning, multi-backend composition, DSL
specification development, implementation, and compiler extensibility alongside
human-override usability.
[COMMENTS]
Accepted at ICML 2025 Workshop on Collaborative and Federated Agentic
Workflows (CFAgentic@ICML‘25)
[LINK]
http://arxiv.org/abs/2506.20000v1
[DATE]
2025-06-25 04:39:49+08:00
[CATEGORIES]
cs.LG
In-Context Learning for Gradient-Free Receiver Adaptation: Principles, Applications, and Theory
[AUTHORS]
Matteo Zecchin, Tomer Raviv, Dileep Kalathil, Krishna Narayanan, Nir Shlezinger, Osvaldo Simeone
[ABSTRACT]
In recent years, deep learning has facilitated the creation of wireless
receivers capable of functioning effectively in conditions that challenge
traditional model-based designs. Leveraging programmable hardware
architectures, deep learning-based receivers offer the potential to dynamically
adapt to varying channel environments. However, current adaptation strategies,
including joint training, hypernetwork-based methods, and meta-learning, either
demonstrate limited flexibility or necessitate explicit optimization through
gradient descent. This paper presents gradient-free adaptation techniques
rooted in the emerging paradigm of in-context learning (ICL). We review
architectural frameworks for ICL based on Transformer models and structured
state-space models (SSMs), alongside theoretical insights into how sequence
models effectively learn adaptation from contextual information. Further, we
explore the application of ICL to cell-free massive MIMO networks, providing
both theoretical analyses and empirical evidence. Our findings indicate that
ICL represents a principled and efficient approach to real-time receiver
adaptation using pilot signals and auxiliary contextual information-without
requiring online retraining.
[LINK]
http://arxiv.org/abs/2506.15176v2
[DATE]
2025-06-25 04:30:14+08:00
[CATEGORIES]
cs.LG
TRACED: Transition-aware Regret Approximation with Co-learnability for Environment Design
[AUTHORS]
Geonwoo Cho, Jaegyun Im, Jihwan Lee, Hojun Yi, Sejin Kim, Sundong Kim
[ABSTRACT]
Generalizing deep reinforcement learning agents to unseen environments
remains a significant challenge. One promising solution is Unsupervised
Environment Design (UED), a co-evolutionary framework in which a teacher
adaptively generates tasks with high learning potential, while a student learns
a robust policy from this evolving curriculum. Existing UED methods typically
measure learning potential via regret, the gap between optimal and current
performance, approximated solely by value-function loss. Building on these
approaches, we introduce the transition prediction error as an additional term
in our regret approximation. To capture how training on one task affects
performance on others, we further propose a lightweight metric called
co-learnability. By combining these two measures, we present Transition-aware
Regret Approximation with Co-learnability for Environment Design (TRACED).
Empirical evaluations show that TRACED yields curricula that improve zero-shot
generalization across multiple benchmarks while requiring up to 2x fewer
environment interactions than strong baselines. Ablation studies confirm that
the transition prediction error drives rapid complexity ramp-up and that
co-learnability delivers additional gains when paired with the transition
prediction error. These results demonstrate how refined regret approximation
and explicit modeling of task relationships can be leveraged for
sample-efficient curriculum design in UED.
[LINK]
http://arxiv.org/abs/2506.19997v1
[DATE]
2025-06-25 04:29:24+08:00
[CATEGORIES]
cs.LG
CoVE: Compressed Vocabulary Expansion Makes Better LLM-based Recommender Systems
[AUTHORS]
Haochen Zhang, Tianyi Zhang, Junze Yin, Oren Gal, Anshumali Shrivastava, Vladimir Braverman
[ABSTRACT]
Recommender systems play a pivotal role in providing relevant content to
users. With the rapid development of large language models (LLMs), researchers
have begun utilizing LLMs to build more powerful recommender systems. However,
existing approaches that focus on aligning LLMs with recommendation tasks do
not fully leverage their sequential information processing capabilities,
leading to suboptimal performance.
In this paper, we propose a novel system called compressed vocabulary
expansion (CoVE). In CoVE, each item is assigned a unique ID within the
expanded vocabulary. Our framework effectively capitalizes on sequence
understanding abilities of LLMs, significantly enhancing their performance on
recommendation tasks. Additionally, we compress the embedding layer, making
CoVE practical for large-scale industrial applications. The effectiveness and
performance of CoVE are demonstrated through comprehensive experiments on
multiple recommendation datasets and comparisons with prior works. Our code can
be found at https://github.com/HaochenZhang717/CoVE-official-Repo.
[COMMENTS]
Accepted by ACL 2025 Findings
[LINK]
http://arxiv.org/abs/2506.19993v1
[DATE]
2025-06-25 04:27:51+08:00
[CATEGORIES]
cs.LG
Ark: An Open-source Python-based Framework for Robot Learning
[AUTHORS]
Magnus Dierking, Christopher E. Mower, Sarthak Das, Huang Helong, Jiacheng Qiu, Cody Reading, Wei Chen, Huidong Liang, Huang Guowei, Jan Peters, Quan Xingyue, Jun Wang, Haitham Bou-Ammar
[ABSTRACT]
Robotics has made remarkable hardware strides-from DARPA’s Urban and Robotics
Challenges to the first humanoid-robot kickboxing tournament-yet commercial
autonomy still lags behind progress in machine learning. A major bottleneck is
software: current robot stacks demand steep learning curves, low-level C/C++
expertise, fragmented tooling, and intricate hardware integration, in stark
contrast to the Python-centric, well-documented ecosystems that propelled
modern AI. We introduce ARK, an open-source, Python-first robotics framework
designed to close that gap. ARK presents a Gym-style environment interface that
allows users to collect data, preprocess it, and train policies using
state-of-the-art imitation-learning algorithms (e.g., ACT, Diffusion Policy)
while seamlessly toggling between high-fidelity simulation and physical robots.
A lightweight client-server architecture provides networked
publisher-subscriber communication, and optional C/C++ bindings ensure
real-time performance when needed. ARK ships with reusable modules for control,
SLAM, motion planning, system identification, and visualization, along with
native ROS interoperability. Comprehensive documentation and case studies-from
manipulation to mobile navigation-demonstrate rapid prototyping, effortless
hardware swapping, and end-to-end pipelines that rival the convenience of
mainstream machine-learning workflows. By unifying robotics and AI practices
under a common Python umbrella, ARK lowers entry barriers and accelerates
research and commercial deployment of autonomous robots.
[LINK]
http://arxiv.org/abs/2506.21628v1
[DATE]
2025-06-25 04:23:39+08:00
[CATEGORIES]
cs.LG
Follow-the-Perturbed-Leader Approaches Best-of-Both-Worlds for the m-Set Semi-Bandit Problems
[AUTHORS]
Jingxin Zhan, Yuchen Xin, Chenjie Sun, Zhihua Zhang
[ABSTRACT]
We consider a common case of the combinatorial semi-bandit problem, the
$m$-set semi-bandit, where the learner exactly selects $m$ arms from the total
$d$ arms. In the adversarial setting, the best regret bound, known to be
$\mathcal{O}(\sqrt{nmd})$ for time horizon $n$, is achieved by the well-known
Follow-the-Regularized-Leader (FTRL) policy. However, this requires to
explicitly compute the arm-selection probabilities via optimizing problems at
each time step and sample according to them. This problem can be avoided by the
Follow-the-Perturbed-Leader (FTPL) policy, which simply pulls the $m$ arms that
rank among the $m$ smallest (estimated) loss with random perturbation. In this
paper, we show that FTPL with a Fr'echet perturbation also enjoys the near
optimal regret bound $\mathcal{O}(\sqrt{nm}(\sqrt{d\log(d)}+m^{5/6}))$ in the
adversarial setting and approaches best-of-both-world regret bounds, i.e.,
achieves a logarithmic regret for the stochastic setting. Moreover, our lower
bounds show that the extra factors are unavoidable with our approach; any
improvement would require a fundamentally different and more challenging
method.
[LINK]
http://arxiv.org/abs/2504.07307v3
[DATE]
2025-06-25 04:04:37+08:00
[CATEGORIES]
cs.LG
MaizeField3D: A Curated 3D Point Cloud and Procedural Model Dataset of Field-Grown Maize from a Diversity Panel
[AUTHORS]
Elvis Kimara, Mozhgan Hadadi, Jackson Godbersen, Aditya Balu, Talukder Jubery, Yawei Li, Adarsh Krishnamurthy, Patrick S. Schnable, Baskar Ganapathysubramanian
[ABSTRACT]
The development of artificial intelligence (AI) and machine learning (ML)
based tools for 3D phenotyping, especially for maize, has been limited due to
the lack of large and diverse 3D datasets. 2D image datasets fail to capture
essential structural details such as leaf architecture, plant volume, and
spatial arrangements that 3D data provide. To address this limitation, we
present MaizeField3D (https://baskargroup.github.io/MaizeField3D/), a curated
dataset of 3D point clouds of field-grown maize plants from a diverse genetic
panel, designed to be AI-ready for advancing agricultural research. Our dataset
includes 1,045 high-quality point clouds of field-grown maize collected using a
terrestrial laser scanner (TLS). Point clouds of 520 plants from this dataset
were segmented and annotated using a graph-based segmentation method to isolate
individual leaves and stalks, ensuring consistent labeling across all samples.
This labeled data was then used for fitting procedural models that provide a
structured parametric representation of the maize plants. The leaves of the
maize plants in the procedural models are represented using Non-Uniform
Rational B-Spline (NURBS) surfaces that were generated using a two-step
optimization process combining gradient-free and gradient-based methods. We
conducted rigorous manual quality control on all datasets, correcting errors in
segmentation, ensuring accurate leaf ordering, and validating metadata
annotations. The dataset also includes metadata detailing plant morphology and
quality, alongside multi-resolution subsampled point cloud data (100k, 50k, 10k
points), which can be readily used for different downstream computational
tasks. MaizeField3D will serve as a comprehensive foundational dataset for
AI-driven phenotyping, plant structural analysis, and 3D applications in
agricultural research.
[COMMENTS]
Elvis Kimara and Mozhgan Hadadi contributed equally to this work
[LINK]
http://arxiv.org/abs/2503.07813v2
[DATE]
2025-06-25 04:04:30+08:00
[CATEGORIES]
cs.LG
Proofs as Explanations: Short Certificates for Reliable Predictions
[AUTHORS]
Avrim Blum, Steve Hanneke, Chirag Pabbaraju, Donya Saless
[ABSTRACT]
We consider a model for explainable AI in which an explanation for a
prediction $h(x)=y$ consists of a subset $S’$ of the training data (if it
exists) such that all classifiers $h’ \in H$ that make at most $b$ mistakes on
$S’$ predict $h’(x)=y$. Such a set $S’$ serves as a proof that $x$ indeed has
label $y$ under the assumption that (1) the target function $h^\star$ belongs
to $H$, and (2) the set $S$ contains at most $b$ corrupted points. For example,
if $b=0$ and $H$ is the family of linear classifiers in $\mathbb{R}^d$, and if
$x$ lies inside the convex hull of the positive data points in $S$ (and hence
every consistent linear classifier labels $x$ as positive), then
Carath'eodory’s theorem states that $x$ lies inside the convex hull of $d+1$
of those points. So, a set $S’$ of size $d+1$ could be released as an
explanation for a positive prediction, and would serve as a short proof of
correctness of the prediction under the assumption of realizability.
In this work, we consider this problem more generally, for general hypothesis
classes $H$ and general values $b\geq 0$. We define the notion of the robust
hollow star number of $H$ (which generalizes the standard hollow star number),
and show that it precisely characterizes the worst-case size of the smallest
certificate achievable, and analyze its size for natural classes. We also
consider worst-case distributional bounds on certificate size, as well as
distribution-dependent bounds that we show tightly control the sample size
needed to get a certificate for any given test example. In particular, we
define a notion of the certificate coefficient $\varepsilon_x$ of an example
$x$ with respect to a data distribution $D$ and target function $h^\star$, and
prove matching upper and lower bounds on sample size as a function of
$\varepsilon_x$, $b$, and the VC dimension $d$ of $H$.
[COMMENTS]
Fixed typo for robust hollow star number sb -> s_b, updated
bibliography, other minor changes
[LINK]
http://arxiv.org/abs/2504.08377v3
[DATE]
2025-06-25 03:55:51+08:00
[CATEGORIES]
cs.LG
FORTRESS: Frontier Risk Evaluation for National Security and Public Safety
[AUTHORS]
Christina Q. Knight, Kaustubh Deshpande, Ved Sirdeshmukh, Meher Mankikar, Scale Red Team, SEAL Research Team, Julian Michael
[ABSTRACT]
The rapid advancement of large language models (LLMs) introduces dual-use
capabilities that could both threaten and bolster national security and public
safety (NSPS). Models implement safeguards to protect against potential misuse
relevant to NSPS and allow for benign users to receive helpful information.
However, current benchmarks often fail to test safeguard robustness to
potential NSPS risks in an objective, robust way. We introduce FORTRESS: 500
expert-crafted adversarial prompts with instance-based rubrics of 4-7 binary
questions for automated evaluation across 3 domains (unclassified information
only): Chemical, Biological, Radiological, Nuclear and Explosive (CBRNE),
Political Violence & Terrorism, and Criminal & Financial Illicit Activities,
with 10 total subcategories across these domains. Each prompt-rubric pair has a
corresponding benign version to test for model over-refusals. This evaluation
of frontier LLMs’ safeguard robustness reveals varying trade-offs between
potential risks and model usefulness: Claude-3.5-Sonnet demonstrates a low
average risk score (ARS) (14.09 out of 100) but the highest over-refusal score
(ORS) (21.8 out of 100), while Gemini 2.5 Pro shows low over-refusal (1.4) but
a high average potential risk (66.29). Deepseek-R1 has the highest ARS at
78.05, but the lowest ORS at only 0.06. Models such as o1 display a more even
trade-off between potential risks and over-refusals (with an ARS of 21.69 and
ORS of 5.2). To provide policymakers and researchers with a clear understanding
of models’ potential risks, we publicly release FORTRESS at
https://huggingface.co/datasets/ScaleAI/fortress_public. We also maintain a
private set for evaluation.
[COMMENTS]
12 pages, 7 figures, submitted to NeurIPS
[LINK]
http://arxiv.org/abs/2506.14922v2
[DATE]
2025-06-25 03:55:23+08:00
[CATEGORIES]
cs.LG
MAIZX: A Carbon-Aware Framework for Optimizing Cloud Computing Emissions
[AUTHORS]
Federico Ruilova, Ernst Gunnar Gran, Sven-Arne Reinemo
[ABSTRACT]
Cloud computing drives innovation but also poses significant environmental
challenges due to its high-energy consumption and carbon emissions. Data
centers account for 2-4% of global energy usage, and the ICT sector’s share of
electricity consumption is projected to reach 40% by 2040. As the goal of
achieving net-zero emissions by 2050 becomes increasingly urgent, there is a
growing need for more efficient and transparent solutions, particularly for
private cloud infrastructures, which are utilized by 87% of organizations,
despite the dominance of public-cloud systems.
This study evaluates the MAIZX framework, designed to optimize cloud
operations and reduce carbon footprint by dynamically ranking resources,
including data centers, edge computing nodes, and multi-cloud environments,
based on real-time and forecasted carbon intensity, Power Usage Effectiveness
(PUE), and energy consumption. Leveraging a flexible ranking algorithm, MAIZX
achieved an 85.68% reduction in CO2 emissions compared to baseline hypervisor
operations. Tested across geographically distributed data centers, the
framework demonstrates scalability and effectiveness, directly interfacing with
hypervisors to optimize workloads in private, hybrid, and multi-cloud
environments. MAIZX integrates real-time data on carbon intensity, power
consumption, and carbon footprint, as well as forecasted values, into cloud
management, providing a robust tool for enhancing climate performance potential
while maintaining operational efficiency.
[COMMENTS]
2 pages, 2 figures. LOCO 2024, December 3, 2024, Glasgow/Online
[LINK]
http://arxiv.org/abs/2506.19972v1
[DATE]
2025-06-25 03:40:09+08:00
[CATEGORIES]
cs.LG
COBRA-PPM: A Causal Bayesian Reasoning Architecture Using Probabilistic Programming for Robot Manipulation Under Uncertainty
[AUTHORS]
Ricardo Cannizzaro, Michael Groom, Jonathan Routley, Robert Osazuwa Ness, Lars Kunze
[ABSTRACT]
Manipulation tasks require robots to reason about cause and effect when
interacting with objects. Yet, many data-driven approaches lack causal
semantics and thus only consider correlations. We introduce COBRA-PPM, a novel
causal Bayesian reasoning architecture that combines causal Bayesian networks
and probabilistic programming to perform interventional inference for robot
manipulation under uncertainty. We demonstrate its capabilities through
high-fidelity Gazebo-based experiments on an exemplar block stacking task,
where it predicts manipulation outcomes with high accuracy (Pred Acc: 88.6%)
and performs greedy next-best action selection with a 94.2% task success rate.
We further demonstrate sim2real transfer on a domestic robot, showing
effectiveness in handling real-world uncertainty from sensor noise and
stochastic actions. Our generalised and extensible framework supports a wide
range of manipulation scenarios and lays a foundation for future work at the
intersection of robotics and causality.
[COMMENTS]
8 pages, 7 figures, accepted to the 2025 IEEE European Conference on
Mobile Robots (ECMR 2025)
[LINK]
http://arxiv.org/abs/2403.14488v3
[DATE]
2025-06-25 03:26:15+08:00
[CATEGORIES]
cs.LG
Fuzz-Testing Meets LLM-Based Agents: An Automated and Efficient Framework for Jailbreaking Text-To-Image Generation Models
[AUTHORS]
Yingkai Dong, Xiangtao Meng, Ning Yu, Zheng Li, Shanqing Guo
[LINK]
http://arxiv.org/abs/2408.00523v3
[DATE]
2025-06-25 02:55:29+08:00
[CATEGORIES]
cs.LG
Protein Structure Tokenization: Benchmarking and New Recipe
[AUTHORS]
Xinyu Yuan, Zichen Wang, Marcus Collins, Huzefa Rangwala
[ABSTRACT]
Recent years have witnessed a surge in the development of protein structural
tokenization methods, which chunk protein 3D structures into discrete or
continuous representations. Structure tokenization enables the direct
application of powerful techniques like language modeling for protein
structures, and large multimodal models to integrate structures with protein
sequences and functional texts. Despite the progress, the capabilities and
limitations of these methods remain poorly understood due to the lack of a
unified evaluation framework. We first introduce StructTokenBench, a framework
that comprehensively evaluates the quality and efficiency of structure
tokenizers, focusing on fine-grained local substructures rather than global
structures, as typical in existing benchmarks. Our evaluations reveal that no
single model dominates all benchmarking perspectives. Observations of codebook
under-utilization led us to develop AminoAseed, a simple yet effective strategy
that enhances codebook gradient updates and optimally balances codebook size
and dimension for improved tokenizer utilization and quality. Compared to the
leading model ESM3, our method achieves an average of 6.31% performance
improvement across 24 supervised tasks, with sensitivity and utilization rates
increased by 12.83% and 124.03%, respectively. Source code and model weights
are available at https://github.com/KatarinaYuan/StructTokenBench
[COMMENTS]
Accepted at ICML 2025
[LINK]
http://arxiv.org/abs/2503.00089v2
[DATE]
2025-06-25 02:54:25+08:00
[CATEGORIES]
cs.LG
Progressive Size-Adaptive Federated Learning: A Comprehensive Framework for Heterogeneous Multi-Modal Data Systems
[AUTHORS]
Sajid Hussain, Muhammad Sohail, Nauman Ali Khan, Naima Iltaf, Ihtesham ul Islam
[ABSTRACT]
Federated Learning (FL) has emerged as a transformative paradigm for
distributed machine learning while preserving data privacy. However, existing
approaches predominantly focus on model heterogeneity and aggregation
techniques, largely overlooking the fundamental impact of dataset size
characteristics on federated training dynamics. This paper introduces
Size-Based Adaptive Federated Learning (SAFL), a novel progressive training
framework that systematically organizes federated learning based on dataset
size characteristics across heterogeneous multi-modal data. Our comprehensive
experimental evaluation across 13 diverse datasets spanning 7 modalities
(vision, text, time series, audio, sensor, medical vision, and multimodal)
reveals critical insights: 1) an optimal dataset size range of 1000-1500
samples for federated learning effectiveness; 2) a clear modality performance
hierarchy with structured data (time series, sensor) significantly
outperforming unstructured data (text, multimodal); and 3) systematic
performance degradation for large datasets exceeding 2000 samples. SAFL
achieves an average accuracy of 87.68% across all datasets, with structured
data modalities reaching 99%+ accuracy. The framework demonstrates superior
communication efficiency, reducing total data transfer to 7.38 GB across 558
communications while maintaining high performance. Our real-time monitoring
framework provides unprecedented insights into system resource utilization,
network efficiency, and training dynamics. This work fills critical gaps in
understanding how data characteristics should drive federated learning
strategies, providing both theoretical insights and practical guidance for
real-world FL deployments in neural network and learning systems.
[LINK]
http://arxiv.org/abs/2506.20685v1
[DATE]
2025-06-25 02:50:33+08:00
[CATEGORIES]
cs.LG
SA-Solver: Stochastic Adams Solver for Fast Sampling of Diffusion Models
[AUTHORS]
Shuchen Xue, Mingyang Yi, Weijian Luo, Shifeng Zhang, Jiacheng Sun, Zhenguo Li, Zhi-Ming Ma
[ABSTRACT]
Diffusion Probabilistic Models (DPMs) have achieved considerable success in
generation tasks. As sampling from DPMs is equivalent to solving diffusion SDE
or ODE which is time-consuming, numerous fast sampling methods built upon
improved differential equation solvers are proposed. The majority of such
techniques consider solving the diffusion ODE due to its superior efficiency.
However, stochastic sampling could offer additional advantages in generating
diverse and high-quality data. In this work, we engage in a comprehensive
analysis of stochastic sampling from two aspects: variance-controlled diffusion
SDE and linear multi-step SDE solver. Based on our analysis, we propose
\textit{SA-Solver}, which is an improved efficient stochastic Adams method for
solving diffusion SDE to generate data with high quality. Our experiments show
that \textit{SA-Solver} achieves: 1) improved or comparable performance
compared with the existing state-of-the-art (SOTA) sampling methods for
few-step sampling; 2) SOTA FID on substantial benchmark datasets under a
suitable number of function evaluations (NFEs). Code is available at
https://github.com/scxue/SA-Solver.
[COMMENTS]
Accepted in NeurIPS 2023
[LINK]
http://arxiv.org/abs/2309.05019v3
[DATE]
2025-06-25 02:47:02+08:00
[CATEGORIES]
cs.LG
MILAAP: Mobile Link Allocation via Attention-based Prediction
[AUTHORS]
Yung-Fu Chen, Anish Arora
[ABSTRACT]
Channel hopping (CS) communication systems must adapt to interference changes
in the wireless network and to node mobility for maintaining throughput
efficiency. Optimal scheduling requires up-to-date network state information
(i.e., of channel occupancy) to select non-overlapping channels for links in
interference regions. However, state sharing among nodes introduces significant
communication overhead, especially as network size or node mobility scale,
thereby decreasing throughput efficiency of already capacity-limited networks.
In this paper, we eschew state sharing while adapting the CS schedule based on
a learning-based channel occupancy prediction. We propose the MiLAAP
attention-based prediction framework for machine learning models of spectral,
spatial, and temporal dependencies among network nodes. MiLAAP uses a
self-attention mechanism that lets each node capture the temporospectral CS
pattern in its interference region and accordingly predict the channel
occupancy state within that region. Notably, the prediction relies only on
locally and passively observed channel activities, and thus introduces no
communication overhead. To deal with node mobility, MiLAAP also uses a
multi-head self-attention mechanism that lets each node locally capture the
spatiotemporal dependencies on other network nodes that can interfere with it
and accordingly predict the motion trajectory of those nodes. Detecting nodes
that enter or move outside the interference region is used to further improve
the prediction accuracy of channel occupancy. We show that for dynamic networks
that use local CS sequences to support relatively long-lived flow traffics, the
channel state prediction accuracy of MiLAAP is remarkably ~100% across
different node mobility patterns and it achieves zero-shot generalizability
across different periods of CS sequences.
[LINK]
http://arxiv.org/abs/2506.19947v1
[DATE]
2025-06-25 02:45:51+08:00
[CATEGORIES]
cs.LG
Data-Driven Dynamic Factor Modeling via Manifold Learning
[AUTHORS]
Graeme Baker, Agostino Capponi, J. Antonio Sidaoui
[ABSTRACT]
We propose a data-driven dynamic factor framework where a response variable
depends on a high-dimensional set of covariates, without imposing any
parametric model on the joint dynamics. Leveraging Anisotropic Diffusion Maps,
a nonlinear manifold learning technique introduced by Singer and Coifman, our
framework uncovers the joint dynamics of the covariates and responses in a
purely data-driven way. We approximate the embedding dynamics using linear
diffusions, and exploit Kalman filtering to predict the evolution of the
covariates and response variables directly from the diffusion map embedding
space. We generalize Singer’s convergence rate analysis of the graph Laplacian
from the case of independent uniform samples on a compact manifold to the case
of time series arising from Langevin diffusions in Euclidean space.
Furthermore, we provide rigorous justification for our procedure by showing the
robustness of approximations of the diffusion map coordinates by linear
diffusions, and the convergence of ergodic averages under standard spectral
assumptions on the underlying dynamics. We apply our method to the stress
testing of equity portfolios using a combination of financial and macroeconomic
factors from the Federal Reserve’s supervisory scenarios. We demonstrate that
our data-driven stress testing method outperforms standard scenario analysis
and Principal Component Analysis benchmarks through historical backtests
spanning three major financial crises, achieving reductions in mean absolute
error of up to 55% and 39% for scenario-based portfolio return prediction,
respectively.
[LINK]
http://arxiv.org/abs/2506.19945v1
[DATE]
2025-06-25 02:40:40+08:00
[CATEGORIES]
cs.LG
Any-Order GPT as Masked Diffusion Model: Decoupling Formulation and Architecture
[AUTHORS]
Shuchen Xue, Tianyu Xie, Tianyang Hu, Zijin Feng, Jiacheng Sun, Kenji Kawaguchi, Zhenguo Li, Zhi-Ming Ma
[ABSTRACT]
Large language models (LLMs) predominantly use autoregressive (AR)
approaches, but masked diffusion models (MDMs) are emerging as viable
alternatives. A key challenge in comparing AR and MDM paradigms is their
typical architectural difference: AR models are often decoder-only, while MDMs
have largely been encoder-only. This practice of changing both the modeling
paradigm and architecture simultaneously makes direct comparisons unfair, as
it’s hard to distinguish whether observed differences stem from the paradigm
itself or the architectural shift. This research evaluates MDMs within a
decoder-only framework to: (1) equitably compare MDM (as Any-Order AR, or
AO-AR) and standard AR paradigms. Our investigation suggests that the standard
AO-AR objective, which averages over all token permutations, may benefit from
refinement, as many permutations appear less informative compared to the
language’s inherent left-to-right structure. (2) Investigate architectural
influences (decoder-only vs. encoder-only) within MDMs. We demonstrate that
while encoder-only MDMs model a simpler conditional probability space,
decoder-only MDMs can achieve dramatic generation speedups ($\sim25\times$) and
comparable perplexity with temperature annealing despite modeling a vastly
larger space, highlighting key trade-offs. This work thus decouples core
paradigm differences from architectural influences, offering insights for
future model design. Code is available at https://github.com/scxue/AO-GPT-MDM.
[LINK]
http://arxiv.org/abs/2506.19935v1
[DATE]
2025-06-25 02:22:25+08:00
[CATEGORIES]
cs.LG
A Comparative Analysis of Reinforcement Learning and Conventional Deep Learning Approaches for Bearing Fault Diagnosis
[AUTHORS]
Efe Çakır, Patrick Dumond
[ABSTRACT]
Bearing faults in rotating machinery can lead to significant operational
disruptions and maintenance costs. Modern methods for bearing fault diagnosis
rely heavily on vibration analysis and machine learning techniques, which often
require extensive labeled data and may not adapt well to dynamic environments.
This study explores the feasibility of reinforcement learning (RL),
specifically Deep Q-Networks (DQNs), for bearing fault classification tasks in
machine condition monitoring to enhance the accuracy and adaptability of
bearing fault diagnosis. The results demonstrate that while RL models developed
in this study can match the performance of traditional supervised learning
models under controlled conditions, they excel in adaptability when equipped
with optimized reward structures. However, their computational demands
highlight areas for further improvement. These findings demonstrate RL’s
potential to complement traditional methods, paving the way for adaptive
diagnostic frameworks.
[COMMENTS]
5 pages, 5 figures. To appear in the Proceedings of the Canadian
Society for Mechanical Engineering (CSME) Congress 2025
[LINK]
http://arxiv.org/abs/2506.19929v1
[DATE]
2025-06-25 02:06:57+08:00
[CATEGORIES]
cs.LG
Prover Agent: An Agent-based Framework for Formal Mathematical Proofs
[AUTHORS]
Kaito Baba, Chaoran Liu, Shuhei Kurita, Akiyoshi Sannai
[ABSTRACT]
We present Prover Agent, a novel AI agent for automated theorem proving that
integrates large language models (LLMs) with a formal proof assistant, Lean.
Prover Agent coordinates an informal reasoning LLM, a formal prover model, and
feedback from Lean while also generating auxiliary lemmas to assist in
discovering the overall proof strategy. It achieves an 86.1% success rate on
the MiniF2F benchmark, establishing a new state-of-the-art among methods using
small language models (SLMs) with a much lower sample budget than previous
approaches. We also present case studies illustrating how these generated
lemmas contribute to solving challenging problems.
[COMMENTS]
22 pages, 2 figures
[LINK]
http://arxiv.org/abs/2506.19923v1
[DATE]
2025-06-25 02:01:52+08:00
[CATEGORIES]
cs.LG
Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation
[AUTHORS]
Xingyang Li, Muyang Li, Tianle Cai, Haocheng Xi, Shuo Yang, Yujun Lin, Lvmin Zhang, Songlin Yang, Jinbo Hu, Kelly Peng, Maneesh Agrawala, Ion Stoica, Kurt Keutzer, Song Han
[ABSTRACT]
Recent advances in diffusion models have enabled high-quality video
generation, but the additional temporal dimension significantly increases
computational costs, making training and inference on long videos prohibitively
expensive. In this paper, we identify a phenomenon we term Spatiotemporal
Energy Decay in video diffusion models: post-softmax attention scores diminish
as spatial and temporal distance between tokens increase, akin to the physical
decay of signal or waves over space and time in nature. Motivated by this, we
propose Radial Attention, a scalable sparse attention mechanism with $O(n \log
n)$ complexity that translates energy decay into exponentially decaying compute
density, which is significantly more efficient than standard $O(n^2)$ dense
attention and more expressive than linear attention. Specifically, Radial
Attention employs a simple, static attention mask where each token attends to
spatially nearby tokens, with the attention window size shrinking with temporal
distance. Moreover, it allows pre-trained video diffusion models to extend
their generation length with efficient LoRA-based fine-tuning. Extensive
experiments show that Radial Attention maintains video quality across
Wan2.1-14B, HunyuanVideo, and Mochi 1, achieving up to a 1.9$\times$ speedup
over the original dense attention. With minimal tuning, it enables video
generation up to 4$\times$ longer while reducing training costs by up to
4.4$\times$ compared to direct fine-tuning and accelerating inference by up to
3.7$\times$ compared to dense attention inference.
[COMMENTS]
Code: https://github.com/mit-han-lab/radial-attention
[LINK]
http://arxiv.org/abs/2506.19852v1
[DATE]
2025-06-25 01:59:59+08:00
[CATEGORIES]
cs.LG
Convergence of Mean Shift Algorithms for Large Bandwidths and Simultaneous Accurate Clustering
[AUTHORS]
Susovan Pal, Praneeth Vepakomma
[ABSTRACT]
The mean shift (MS) is a non-parametric, density-based, iterative algorithm
that has prominent usage in clustering and image segmentation. A rigorous proof
for its convergence in full generality remains unknown. Two significant steps
in this direction were taken in the paper \cite{Gh1}, which proved that for
\textit{sufficiently large bandwidth}, the MS algorithm with the Gaussian
kernel always converges in any dimension, and also by the same author in
\cite{Gh2}, proved that MS always converges in one dimension for kernels with
differentiable, strictly decreasing, convex profiles. In the more recent paper
\cite{YT}, they have proved the convergence in more generality,\textit{ without
any restriction on the bandwidth}, with the assumption that the KDE $f$ has a
continuous Lipschitz gradient on the closure of the convex hull of the
trajectory of the iterated sequence of the mode estimate, and also satisfies
the {\L}ojasiewicz property there.
The main theoretical result of this paper is a generalization of those of
\cite{Gh1}, where we show that (1) for\textit{ sufficiently large bandwidth}
convergence is guaranteed in any dimension with \textit{any radially symmetric
and strictly positive definite kernels}. The proof uses two alternate
characterizations of radially symmetric positive definite smooth kernels by
Schoenberg and Bernstein \cite{Fass}, and borrows some steps from the proofs in
\cite{Gh1}. Although the authors acknowledge that the result in that paper is
more restrictive than that of \cite{YT} due to the lower bandwidth limit, it
uses a different set of assumptions than \cite{YT}, and the proof technique is
different.
[LINK]
http://arxiv.org/abs/2506.19837v1
[DATE]
2025-06-25 01:53:29+08:00
[CATEGORIES]
cs.LG
Machine Learning with Privacy for Protected Attributes
[AUTHORS]
Saeed Mahloujifar, Chuan Guo, G. Edward Suh, Kamalika Chaudhuri
[ABSTRACT]
Differential privacy (DP) has become the standard for private data analysis.
Certain machine learning applications only require privacy protection for
specific protected attributes. Using naive variants of differential privacy in
such use cases can result in unnecessary degradation of utility. In this work,
we refine the definition of DP to create a more general and flexible framework
that we call feature differential privacy (FDP). Our definition is
simulation-based and allows for both addition/removal and replacement variants
of privacy, and can handle arbitrary and adaptive separation of protected and
non-protected features. We prove the properties of FDP, such as adaptive
composition, and demonstrate its implications for limiting attribute inference
attacks. We also propose a modification of the standard DP-SGD algorithm that
satisfies FDP while leveraging desirable properties such as amplification via
sub-sampling. We apply our framework to various machine learning tasks and show
that it can significantly improve the utility of DP-trained models when public
features are available. For example, we train diffusion models on the AFHQ
dataset of animal faces and observe a drastic improvement in FID compared to
DP, from 286.7 to 101.9 at $\epsilon=8$, assuming that the blurred version of a
training image is available as a public feature. Overall, our work provides a
new approach to private data analysis that can help reduce the utility cost of
DP while still providing strong privacy guarantees.
[LINK]
http://arxiv.org/abs/2506.19836v1
[DATE]
2025-06-25 01:53:28+08:00
[CATEGORIES]
cs.LG
Inferring Higher-Order Couplings with Neural Networks
[AUTHORS]
Aurélien Decelle, Alfonso de Jesús Navas Gómez, Beatriz Seoane
[ABSTRACT]
Maximum entropy methods, rooted in the inverse Ising/Potts problem from
statistical physics, are widely used to model pairwise interactions in complex
systems across disciplines such as bioinformatics and neuroscience. While
successful, these approaches often fail to capture higher-order interactions
that are critical for understanding collective behavior. In contrast, modern
machine learning methods can model such interactions, but their
interpretability often comes at a prohibitive computational cost. Restricted
Boltzmann Machines (RBMs) provide a computationally efficient alternative by
encoding statistical correlations through hidden units in a bipartite
architecture. In this work, we introduce a method that maps RBMs onto
generalized Potts models, enabling the systematic extraction of interactions up
to arbitrary order. Leveraging large-$N$ approximations – made tractable by
the RBM’s structure – we extract effective many-body couplings with minimal
computational effort. We further propose a robust framework for recovering
higher-order interactions in more complex generative models, and introduce a
simple gauge-fixing scheme for the effective Potts representation. Validation
on synthetic data demonstrates accurate recovery of two- and three-body
interactions. Applied to protein sequence data, our method reconstructs contact
maps with high fidelity and outperforms state-of-the-art inverse Potts models.
These results establish RBMs as a powerful and efficient tool for modeling
higher-order structure in high-dimensional categorical data.
[COMMENTS]
24 Pages and 9 Figures
[LINK]
http://arxiv.org/abs/2501.06108v3
[DATE]
2025-06-25 01:51:24+08:00
[CATEGORIES]
cs.LG
A standard transformer and attention with linear biases for molecular conformer generation
[AUTHORS]
Viatcheslav Gurev, Timothy Rumbell
[ABSTRACT]
Sampling low-energy molecular conformations, spatial arrangements of atoms in
a molecule, is a critical task for many different calculations performed in the
drug discovery and optimization process. Numerous specialized equivariant
networks have been designed to generate molecular conformations from 2D
molecular graphs. Recently, non-equivariant transformer models have emerged as
a viable alternative due to their capability to scale to improve
generalization. However, the concern has been that non-equivariant models
require a large model size to compensate the lack of equivariant bias. In this
paper, we demonstrate that a well-chosen positional encoding effectively
addresses these size limitations. A standard transformer model incorporating
relative positional encoding for molecular graphs when scaled to 25 million
parameters surpasses the current state-of-the-art non-equivariant base model
with 64 million parameters on the GEOM-DRUGS benchmark. We implemented relative
positional encoding as a negative attention bias that linearly increases with
the shortest path distances between graph nodes at varying slopes for different
attention heads, similar to ALiBi, a widely adopted relative positional
encoding technique in the NLP domain. This architecture has the potential to
serve as a foundation for a novel class of generative models for molecular
conformations.
[COMMENTS]
Revision of paper at OpenReview:
https://openreview.net/forum?id=BjjerMYL3F
[LINK]
http://arxiv.org/abs/2506.19834v1
[DATE]
2025-06-25 01:50:49+08:00
[CATEGORIES]
cs.LG
Fourier Multi-Component and Multi-Layer Neural Networks: Unlocking High-Frequency Potential
[AUTHORS]
Shijun Zhang, Hongkai Zhao, Yimin Zhong, Haomin Zhou
[ABSTRACT]
The architecture of a neural network and the selection of its activation
function are both fundamental to its performance. Equally vital is ensuring
these two elements are well-matched, as their alignment is key to achieving
effective representation and learning. In this paper, we introduce the Fourier
Multi-Component and Multi-Layer Neural Network (FMMNN), a novel model that
creates a strong synergy between them. We demonstrate that FMMNNs are highly
effective and flexible in modeling high-frequency components. Our theoretical
results demonstrate that FMMNNs have exponential expressive power for function
approximation. We also analyze the optimization landscape of FMMNNs and find it
to be much more favorable than that of standard fully connected neural
networks, especially when dealing with high-frequency features. In addition, we
propose a scaled random initialization method for the first layer’s weights in
FMMNNs, which significantly speeds up training and enhances overall
performance. Extensive numerical experiments support our theoretical insights,
showing that FMMNNs consistently outperform traditional approaches in accuracy
and efficiency across various tasks.
[COMMENTS]
Our code and implementation details are available at
https://github.com/ShijunZhangMath/FMMNN
[LINK]
http://arxiv.org/abs/2502.18959v2
[DATE]
2025-06-25 01:50:17+08:00
[CATEGORIES]
cs.LG
Persona Features Control Emergent Misalignment
[AUTHORS]
Miles Wang, Tom Dupré la Tour, Olivia Watkins, Alex Makelov, Ryan A. Chi, Samuel Miserendino, Johannes Heidecke, Tejal Patwardhan, Dan Mossing
[ABSTRACT]
Understanding how language models generalize behaviors from their training to
a broader deployment distribution is an important problem in AI safety. Betley
et al. discovered that fine-tuning GPT-4o on intentionally insecure code causes
“emergent misalignment,” where models give stereotypically malicious responses
to unrelated prompts. We extend this work, demonstrating emergent misalignment
across diverse conditions, including reinforcement learning on reasoning
models, fine-tuning on various synthetic datasets, and in models without safety
training. To investigate the mechanisms behind this generalized misalignment,
we apply a “model diffing” approach using sparse autoencoders to compare
internal model representations before and after fine-tuning. This approach
reveals several “misaligned persona” features in activation space, including a
toxic persona feature which most strongly controls emergent misalignment and
can be used to predict whether a model will exhibit such behavior.
Additionally, we investigate mitigation strategies, discovering that
fine-tuning an emergently misaligned model on just a few hundred benign samples
efficiently restores alignment.
[LINK]
http://arxiv.org/abs/2506.19823v1
[DATE]
2025-06-25 01:38:21+08:00
[CATEGORIES]
cs.LG
ProxelGen: Generating Proteins as 3D Densities
[AUTHORS]
Felix Faltings, Hannes Stark, Regina Barzilay, Tommi Jaakkola
[ABSTRACT]
We develop ProxelGen, a protein structure generative model that operates on
3D densities as opposed to the prevailing 3D point cloud representations.
Representing proteins as voxelized densities, or proxels, enables new tasks and
conditioning capabilities. We generate proteins encoded as proxels via a 3D
CNN-based VAE in conjunction with a diffusion model operating on its latent
space. Compared to state-of-the-art models, ProxelGen’s samples achieve higher
novelty, better FID scores, and the same level of designability as the training
set. ProxelGen’s advantages are demonstrated in a standard motif scaffolding
benchmark, and we show how 3D density-based generation allows for more flexible
shape conditioning.
[LINK]
http://arxiv.org/abs/2506.19820v1
[DATE]
2025-06-25 01:35:55+08:00
[CATEGORIES]
cs.LG
Model-Based Exploration in Monitored Markov Decision Processes
[AUTHORS]
Alireza Kazemipour, Simone Parisi, Matthew E. Taylor, Michael Bowling
[ABSTRACT]
A tenet of reinforcement learning is that the agent always observes rewards.
However, this is not true in many realistic settings, e.g., a human observer
may not always be available to provide rewards, sensors may be limited or
malfunctioning, or rewards may be inaccessible during deployment. Monitored
Markov decision processes (Mon-MDPs) have recently been proposed to model such
settings. However, existing Mon-MDP algorithms have several limitations: they
do not fully exploit the problem structure, cannot leverage a known monitor,
lack worst-case guarantees for ‘unsolvable’ Mon-MDPs without specific
initialization, and offer only asymptotic convergence proofs. This paper makes
three contributions. First, we introduce a model-based algorithm for Mon-MDPs
that addresses these shortcomings. The algorithm employs two instances of
model-based interval estimation: one to ensure that observable rewards are
reliably captured, and another to learn the minimax-optimal policy. Second, we
empirically demonstrate the advantages. We show faster convergence than prior
algorithms in over four dozen benchmarks, and even more dramatic improvement
when the monitoring process is known. Third, we present the first finite-sample
bound on performance. We show convergence to a minimax-optimal policy even when
some rewards are never observable.
[LINK]
http://arxiv.org/abs/2502.16772v5
[DATE]
2025-06-25 01:32:18+08:00
[CATEGORIES]
cs.LG
Curating art exhibitions using machine learning
[AUTHORS]
Eurico Covas
[ABSTRACT]
Art curatorship has always been mostly the subjective work of human experts,
who, with extensive knowledge of many and diverse artworks, select a few of
those to present in communal spaces, spaces that evolved into what we now call
art galleries. There are no hard and fast set of rules on how to select these
artworks, given a theme which either is presented to the art curator or
constructed by her/him. Here we present a series of artificial models – a
total of four related models – based on machine learning techniques (a subset
of artificial intelligence) that attempt to learn from existing exhibitions
which have been curated by human experts, in order to be able to do similar
curatorship work. We focus exclusively on the last 25 years of past exhibitions
at the Metropolitan Museum of Art in New York, due to the quality of the data
available and the physical and time limitations of our research. Our four
artificial intelligence models achieve a reasonable ability at imitating these
various curators responsible for all those exhibitions, with various degrees of
precision and curatorial coherence. In particular, we can conclude two key
insights: first, that there is sufficient information in these exhibitions to
construct an artificial intelligence model that replicates past exhibitions
with an accuracy well above random choices; second, that using feature
engineering and carefully designing the architecture of modest size models can
make them as good as those using the so-called large language models such as
GPT in a brute force approach. We also believe, based on small attempts to use
the models in out-of-sample experiments, that given more much more data, it
should be possible for these kinds of artificial intelligence agents to be
closer and closer to the aesthetic and curatorial judgment of human art
curators.
[LINK]
http://arxiv.org/abs/2506.19813v1
[DATE]
2025-06-25 01:25:03+08:00
[CATEGORIES]
cs.LG
Ambiguous Online Learning
[AUTHORS]
Vanessa Kosoy
[ABSTRACT]
We propose a new variant of online learning that we call “ambiguous online
learning”. In this setting, the learner is allowed to produce multiple
predicted labels. Such an “ambiguous prediction” is considered correct when at
least one of the labels is correct, and none of the labels are “predictably
wrong”. The definition of “predictably wrong” comes from a hypothesis class in
which hypotheses are also multi-valued. Thus, a prediction is “predictably
wrong” if it’s not allowed by the (unknown) true hypothesis. In particular,
this setting is natural in the context of multivalued dynamical systems,
recommendation algorithms and lossless compression. It is also strongly related
to so-called “apple tasting”. We show that in this setting, there is a
trichotomy of mistake bounds: up to logarithmic factors, any hypothesis class
has an optimal mistake bound of either Theta(1), Theta(sqrt(N)) or N.
[LINK]
http://arxiv.org/abs/2506.19810v1
[DATE]
2025-06-25 01:22:45+08:00
[CATEGORIES]
cs.LG
First-Passage Approach to Optimizing Perturbations for Improved Training of Machine Learning Models
[AUTHORS]
Sagi Meir, Tommer D. Keidar, Shlomi Reuveni, Barak Hirshberg
[ABSTRACT]
Machine learning models have become indispensable tools in applications
across the physical sciences. Their training is often time-consuming, vastly
exceeding the inference timescales. Several protocols have been developed to
perturb the learning process and improve the training, such as shrink and
perturb, warm restarts, and stochastic resetting. For classifiers, these
perturbations have been shown to result in enhanced speedups or improved
generalization. However, the design of such perturbations is usually done ad
hoc by intuition and trial and error. To rationally optimize training
protocols, we frame them as first-passage processes and consider their response
to perturbations. We show that if the unperturbed learning process reaches a
quasi-steady state, the response at a single perturbation frequency can predict
the behavior at a wide range of frequencies. We employ this approach to a
CIFAR-10 classifier using the ResNet-18 model and identify a useful
perturbation and frequency among several possibilities. We demonstrate the
transferability of the approach to other datasets, architectures, optimizers
and even tasks (regression instead of classification). Our work allows
optimization of perturbations for improving the training of machine learning
models using a first-passage approach.
[LINK]
http://arxiv.org/abs/2502.04121v3
[DATE]
2025-06-25 01:16:47+08:00
[CATEGORIES]
cs.LG
Convolution-weighting method for the physics-informed neural network: A Primal-Dual Optimization Perspective
[AUTHORS]
Chenhao Si, Ming Yan
[ABSTRACT]
Physics-informed neural networks (PINNs) are extensively employed to solve
partial differential equations (PDEs) by ensuring that the outputs and
gradients of deep learning models adhere to the governing equations. However,
constrained by computational limitations, PINNs are typically optimized using a
finite set of points, which poses significant challenges in guaranteeing their
convergence and accuracy. In this study, we proposed a new weighting scheme
that will adaptively change the weights to the loss functions from isolated
points to their continuous neighborhood regions. The empirical results show
that our weighting scheme can reduce the relative $L^2$ errors to a lower
value.
[COMMENTS]
18 pages, 12 figures
[LINK]
http://arxiv.org/abs/2506.19805v1
[DATE]
2025-06-25 01:13:51+08:00
[CATEGORIES]
cs.LG
Multiscale Training of Convolutional Neural Networks
[AUTHORS]
Shadab Ahamed, Niloufar Zakariaei, Eldad Haber, Moshe Eliasof
[ABSTRACT]
Training convolutional neural networks (CNNs) on high-resolution images is
often bottlenecked by the cost of evaluating gradients of the loss on the
finest spatial mesh. To address this, we propose Multiscale Gradient Estimation
(MGE), a Multilevel Monte Carlo-inspired estimator that expresses the expected
gradient on the finest mesh as a telescopic sum of gradients computed on
progressively coarser meshes. By assigning larger batches to the cheaper coarse
levels, MGE achieves the same variance as single-scale stochastic gradient
estimation while reducing the number of fine mesh convolutions by a factor of 4
with each downsampling. We further embed MGE within a Full-Multiscale training
algorithm that solves the learning problem on coarse meshes first and
“hot-starts” the next finer level, cutting the required fine mesh iterations by
an additional order of magnitude. Extensive experiments on image denoising,
deblurring, inpainting and super-resolution tasks using UNet, ResNet and ESPCN
backbones confirm the practical benefits: Full-Multiscale reduces the
computation costs by 4-16$\times$ with no significant loss in performance.
Together, MGE and Full-Multiscale offer a principled, architecture-agnostic
route to accelerate CNN training on high-resolution data without sacrificing
accuracy, and they can be combined with other variance-reduction or
learning-rate schedules to further enhance scalability.
[COMMENTS]
23 pages, 10 figures, 8 tables
[LINK]
http://arxiv.org/abs/2501.12739v3
[DATE]
2025-06-25 01:04:58+08:00
[CATEGORIES]
cs.LG
FDA-Opt: Communication-Efficient Federated Fine-Tuning of Language Models
[AUTHORS]
Michail Theologitis, Vasilis Samoladas, Antonios Deligiannakis
[ABSTRACT]
Federated Learning (FL) enables the utilization of vast, previously
inaccessible data sources. At the same time, pre-trained Language Models (LMs)
have taken the world by storm and for good reason. They exhibit remarkable
emergent abilities and are readily adapted to downstream tasks. This opens one
of the most exciting frontiers in FL: fine-tuning LMs. Yet, a persistent
challenge in FL is the frequent, rigid communication of parameters – a problem
magnified by the sheer size of these contemporary models. The FedOpt family of
algorithms has become the go-to approach for FL, relying on fixed but arbitrary
intervals for model exchanges. Recently, the FDA algorithm prescribed a dynamic
approach by monitoring the training progress. However, it introduced a
hard-to-calibrate parameter and imposed a rigid synchronization scheme. In this
work, we address these limitations by proposing the FDA-Opt family of
algorithms – a unified generalization of both FDA and FedOpt. Our experimental
evaluation focuses on fine-tuning LMs on downstream NLP tasks and demonstrates
that FDA-Opt outperforms FedOpt even when it is configured with
hyper-parameters specifically optimized for the latter. In other words, we show
that FDA-Opt is a practical, drop-in replacement for FedOpt in modern FL
libraries and systems: it requires no additional configuration and delivers
superior performance out of the box.
[LINK]
http://arxiv.org/abs/2505.04535v2
[DATE]
2025-06-25 00:20:46+08:00
[CATEGORIES]
cs.LG
The Shape of Consumer Behavior: A Symbolic and Topological Analysis of Time Series
[AUTHORS]
Pola Bereta, Ioannis Diamantis
[ABSTRACT]
Understanding temporal patterns in online search behavior is crucial for
real-time marketing and trend forecasting. Google Trends offers a rich proxy
for public interest, yet the high dimensionality and noise of its time-series
data present challenges for effective clustering. This study evaluates three
unsupervised clustering approaches, Symbolic Aggregate approXimation (SAX),
enhanced SAX (eSAX), and Topological Data Analysis (TDA), applied to 20 Google
Trends keywords representing major consumer categories. Our results show that
while SAX and eSAX offer fast and interpretable clustering for stable time
series, they struggle with volatility and complexity, often producing ambiguous
“catch-all” clusters. TDA, by contrast, captures global structural features
through persistent homology and achieves more balanced and meaningful
groupings.
We conclude with practical guidance for using symbolic and topological
methods in consumer analytics and suggest that hybrid approaches combining both
perspectives hold strong potential for future applications.
[COMMENTS]
33 pages, 30 figures
[LINK]
http://arxiv.org/abs/2506.19759v1
[DATE]
2025-06-25 00:20:33+08:00
[CATEGORIES]
cs.LG
Cross-regularization: Adaptive Model Complexity through Validation Gradients
[AUTHORS]
Carlos Stein Brito
[ABSTRACT]
Model regularization requires extensive manual tuning to balance complexity
against overfitting. Cross-regularization resolves this tradeoff by directly
adapting regularization parameters through validation gradients during
training. The method splits parameter optimization - training data guides
feature learning while validation data shapes complexity controls - converging
provably to cross-validation optima. When implemented through noise injection
in neural networks, this approach reveals striking patterns: unexpectedly high
noise tolerance and architecture-specific regularization that emerges
organically during training. Beyond complexity control, the framework
integrates seamlessly with data augmentation, uncertainty calibration and
growing datasets while maintaining single-run efficiency through a simple
gradient-based approach.
[COMMENTS]
21 pages, 13 figures. Accepted at ICML 2025
[LINK]
http://arxiv.org/abs/2506.19755v1
[DATE]
2025-06-25 00:15:50+08:00
[CATEGORIES]
cs.LG
A Robust Twin Parametric Margin Support Vector Machine for Multiclass Classification
[AUTHORS]
Renato De Leone, Francesca Maggioni, Andrea Spinelli
[ABSTRACT]
In this paper, we introduce novel Twin Parametric Margin Support Vector
Machine (TPMSVM) models designed to address multiclass classification tasks
under feature uncertainty. To handle data perturbations, we construct
bounded-by-norm uncertainty set around each training observation and derive the
robust counterparts of the deterministic models using robust optimization
techniques. To capture complex data structure, we explore both linear and
kernel-induced classifiers, providing computationally tractable reformulations
of the resulting robust models. Additionally, we propose two alternatives for
the final decision function, enhancing models’ flexibility. Finally, we
validate the effectiveness of the proposed robust multiclass TPMSVM methodology
on real-world datasets, showing the good performance of the approach in the
presence of uncertainty.
[LINK]
http://arxiv.org/abs/2306.06213v3
[DATE]
2025-06-25 00:07:13+08:00
[CATEGORIES]
cs.LG
On the necessity of adaptive regularisation:Optimal anytime online learning on $\boldsymbol{\ell_p}$-balls
[AUTHORS]
Emmeran Johnson, David Martínez-Rubio, Ciara Pike-Burke, Patrick Rebeschini
[ABSTRACT]
We study online convex optimization on $\ell_p$-balls in $\mathbb{R}^d$ for
$p > 2$. While always sub-linear, the optimal regret exhibits a shift between
the high-dimensional setting ($d > T$), when the dimension $d$ is greater than
the time horizon $T$ and the low-dimensional setting ($d \leq T$). We show that
Follow-the-Regularised-Leader (FTRL) with time-varying regularisation which is
adaptive to the dimension regime is anytime optimal for all dimension regimes.
Motivated by this, we ask whether it is possible to obtain anytime optimality
of FTRL with fixed non-adaptive regularisation. Our main result establishes
that for separable regularisers, adaptivity in the regulariser is necessary,
and that any fixed regulariser will be sub-optimal in one of the two dimension
regimes. Finally, we provide lower bounds which rule out sub-linear regret
bounds for the linear bandit problem in sufficiently high-dimension for all
$\ell_p$-balls with $p \geq 1$.
[LINK]
http://arxiv.org/abs/2506.19752v1
[DATE]
2025-06-25 00:06:56+08:00
[CATEGORIES]
cs.LG
Continuous Bayesian Model Selection for Multivariate Causal Discovery
[AUTHORS]
Anish Dhir, Ruby Sedgwick, Avinash Kori, Ben Glocker, Mark van der Wilk
[ABSTRACT]
Current causal discovery approaches require restrictive model assumptions in
the absence of interventional data to ensure structure identifiability. These
assumptions often do not hold in real-world applications leading to a loss of
guarantees and poor performance in practice. Recent work has shown that, in the
bivariate case, Bayesian model selection can greatly improve performance by
exchanging restrictive modelling for more flexible assumptions, at the cost of
a small probability of making an error. Our work shows that this approach is
useful in the important multivariate case as well. We propose a scalable
algorithm leveraging a continuous relaxation of the discrete model selection
problem. Specifically, we employ the Causal Gaussian Process Conditional
Density Estimator (CGP-CDE) as a Bayesian non-parametric model, using its
hyperparameters to construct an adjacency matrix. This matrix is then optimised
using the marginal likelihood and an acyclicity regulariser, giving the maximum
a posteriori causal graph. We demonstrate the competitiveness of our approach,
showing it is advantageous to perform multivariate causal discovery without
infeasible assumptions using Bayesian model selection.
[LINK]
http://arxiv.org/abs/2411.10154v2
[DATE]
2025-06-25 00:05:27+08:00
[CATEGORIES]
cs.LG
DecDEC: A Systems Approach to Advancing Low-Bit LLM Quantization
[AUTHORS]
Yeonhong Park, Jake Hyun, Hojoon Kim, Jae W. Lee
[ABSTRACT]
Quantization of Large Language Models (LLMs) has recently gained popularity,
particularly for on-device settings with limited hardware resources. While
efficient, quantization inevitably degrades model quality, especially in
aggressive low-bit settings such as 3-bit and 4-bit precision. In this paper,
we propose DecDEC, an inference scheme that improves the quality of low-bit
LLMs while preserving the key benefits of quantization: GPU memory savings and
latency reduction. DecDEC stores the residual matrix – the difference between
full-precision and quantized weights – in CPU, and dynamically fetches the
residuals for only a small portion of the weights. This portion corresponds to
the salient channels, marked by activation outliers, with the fetched residuals
helping to correct quantization errors in these channels. Salient channels are
identified dynamically at each decoding step by analyzing the input activations
– this enables adaptation to the dynamic nature of activation distribution,
thus maximizing the effectiveness of error compensation. We demonstrate the
effectiveness of DecDEC by augmenting state-of-the-art quantization methods.
For example, DecDEC reduces the perplexity of a 3-bit Llama-3-8B-Instruct model
from 10.15 to 9.12 – outperforming its 3.5-bit counterpart – while adding
less than 0.0003\% to GPU memory usage and incurring only a 1.7\% inference
slowdown on NVIDIA RTX 4050 Mobile.
[COMMENTS]
OSDI 2025
[LINK]
http://arxiv.org/abs/2412.20185v2
[DATE]
2025-06-25 00:03:30+08:00
[CATEGORIES]
cs.LG
jina-embeddings-v4: Universal Embeddings for Multimodal Multilingual Retrieval
[AUTHORS]
Michael Günther, Saba Sturua, Mohammad Kalim Akram, Isabelle Mohr, Andrei Ungureanu, Bo Wang, Sedigheh Eslami, Scott Martens, Maximilian Werk, Nan Wang, Han Xiao
[ABSTRACT]
We introduce jina-embeddings-v4, a 3.8 billion parameter multimodal embedding
model that unifies text and image representations through a novel architecture
supporting both single-vector and multi-vector embeddings in the late
interaction style. The model incorporates task-specific Low-Rank Adaptation
(LoRA) adapters to optimize performance across diverse retrieval scenarios,
including query-document retrieval, semantic text similarity, and code search.
Comprehensive evaluations demonstrate that jina-embeddings-v4 achieves
state-of-the-art performance on both single-modal and cross-modal retrieval
tasks, with particular strength in processing visually rich content such as
tables, charts, diagrams, and mixed-media formats. To facilitate evaluation of
this capability, we also introduce Jina-VDR, a novel benchmark specifically
designed for visually rich image retrieval.
[COMMENTS]
22 pages, 1-10 main, 14-22 experimental results, benchmark tables
[LINK]
http://arxiv.org/abs/2506.18902v2
[DATE]
2025-06-24 23:52:37+08:00
[CATEGORIES]
cs.CL
Local Look-Ahead Guidance via Verifier-in-the-Loop for Automated Theorem Proving
[AUTHORS]
Sara Rajaee, Kumar Pratik, Gabriele Cesa, Arash Behboodi
[ABSTRACT]
The most promising recent methods for AI reasoning require applying variants
of reinforcement learning (RL) either on rolled out trajectories from the LLMs,
even for the step-wise rewards, or large quantities of human-annotated
trajectory data. The reliance on the rolled-out trajectory renders the compute
cost and time prohibitively high. In particular, the correctness of a reasoning
trajectory can typically only be judged at its completion, leading to sparse
rewards in RL or requiring expensive synthetic data generation in expert
iteration-like methods. In this work, we focus on the Automatic Theorem Proving
(ATP) task and propose a novel verifier-in-the-loop design, which, unlike
existing approaches that leverage feedback on the entire reasoning trajectory,
employs an automated verifier to give intermediate feedback at each step of the
reasoning process. Using Lean as the verifier, we empirically show that the
step-by-step local verification produces a global improvement in the model’s
reasoning accuracy and efficiency.
[COMMENTS]
Accepted at the Findings of ACL 2025, Accepted at ICLR 2025 Workshop
on Reasoning and Planning for Large Language Models
[LINK]
http://arxiv.org/abs/2503.09730v2
[DATE]
2025-06-24 23:42:55+08:00
[CATEGORIES]
cs.CL
cs.LG
Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models
[AUTHORS]
Jungwoo Park, Taewhoo Lee, Chanwoong Yoon, Hyeon Hwang, Jaewoo Kang
[ABSTRACT]
Extreme activation outliers in Large Language Models (LLMs) critically
degrade quantization performance, hindering efficient on-device deployment.
While channel-wise operations and adaptive gradient scaling are recognized
causes, practical mitigation remains challenging. We introduce Outlier-Safe
Pre-Training (OSP), a practical guideline that proactively prevents outlier
formation rather than relying on post-hoc mitigation. OSP combines three key
innovations: (1) the Muon optimizer, eliminating privileged bases while
maintaining training efficiency; (2) Single-Scale RMSNorm, preventing
channel-wise amplification; and (3) a learnable embedding projection,
redistributing activation magnitudes originating from embedding matrices. We
validate OSP by training a 1.4B-parameter model on 1 trillion tokens, which is
the first production-scale LLM trained without such outliers. Under aggressive
4-bit quantization, our OSP model achieves a 35.7 average score across 10
benchmarks (compared to 26.5 for an Adam-trained model), with only a 2%
training overhead. Remarkably, OSP models exhibit near-zero excess kurtosis
(0.04) compared to extreme values (1818.56) in standard models, fundamentally
altering LLM quantization behavior. Our work demonstrates that outliers are not
inherent to LLMs but are consequences of training strategies, paving the way
for more efficient LLM deployment. The source code and pretrained checkpoints
are available at https://github.com/dmis-lab/Outlier-Safe-Pre-Training.
[LINK]
http://arxiv.org/abs/2506.19697v1
[DATE]
2025-06-24 23:03:57+08:00
[CATEGORIES]
cs.LG
cs.CL
Recurrent Visual Feature Extraction and Stereo Attentions for CT Report Generation
[AUTHORS]
Yuanhe Tian, Lei Mao, Yan Song
[ABSTRACT]
Generating reports for computed tomography (CT) images is a challenging task,
while similar to existing studies for medical image report generation, yet has
its unique characteristics, such as spatial encoding of multiple images,
alignment between image volume and texts, etc. Existing solutions typically use
general 2D or 3D image processing techniques to extract features from a CT
volume, where they firstly compress the volume and then divide the compressed
CT slices into patches for visual encoding. These approaches do not explicitly
account for the transformations among CT slices, nor do they effectively
integrate multi-level image features, particularly those containing specific
organ lesions, to instruct CT report generation (CTRG). In considering the
strong correlation among consecutive slices in CT scans, in this paper, we
propose a large language model (LLM) based CTRG method with recurrent visual
feature extraction and stereo attentions for hierarchical feature modeling.
Specifically, we use a vision Transformer to recurrently process each slice in
a CT volume, and employ a set of attentions over the encoded slices from
different perspectives to selectively obtain important visual information and
align them with textual features, so as to better instruct an LLM for CTRG.
Experiment results and further analysis on the benchmark M3D-Cap dataset show
that our method outperforms strong baseline models and achieves
state-of-the-art results, demonstrating its validity and effectiveness.
[COMMENTS]
7 pages, 3 figures
[LINK]
http://arxiv.org/abs/2506.19665v1
[DATE]
2025-06-24 22:29:06+08:00
[CATEGORIES]
cs.CL
Tailored Conversations beyond LLMs: A RL-Based Dialogue Manager
[AUTHORS]
Lucie Galland, Catherine Pelachaud, Florian Pecune
[ABSTRACT]
In this work, we propose a novel framework that integrates large language
models (LLMs) with an RL-based dialogue manager for open-ended dialogue with a
specific goal. By leveraging hierarchical reinforcement learning to model the
structured phases of dialogue and employ meta-learning to enhance adaptability
across diverse user profiles, our approach enhances adaptability and
efficiency, enabling the system to learn from limited data, transition fluidly
between dialogue phases, and personalize responses to heterogeneous patient
needs. We apply our framework to Motivational Interviews, aiming to foster
behavior change, and demonstrate that the proposed dialogue manager outperforms
a state-of-the-art LLM baseline in terms of reward, showing a potential benefit
of conditioning LLMs to create open-ended dialogue systems with specific goals.
[LINK]
http://arxiv.org/abs/2506.19652v1
[DATE]
2025-06-24 22:15:26+08:00
[CATEGORIES]
cs.CL
Language Model Re-rankers are Fooled by Lexical Similarities
[AUTHORS]
Lovisa Hagström, Ercong Nie, Ruben Halifa, Helmut Schmid, Richard Johansson, Alexander Junge
[ABSTRACT]
Language model (LM) re-rankers are used to refine retrieval results for
retrieval-augmented generation (RAG). They are more expensive than lexical
matching methods like BM25 but assumed to better process semantic information
and the relations between the query and the retrieved answers. To understand
whether LM re-rankers always live up to this assumption, we evaluate 6
different LM re-rankers on the NQ, LitQA2 and DRUID datasets. Our results show
that LM re-rankers struggle to outperform a simple BM25 baseline on DRUID.
Leveraging a novel separation metric based on BM25 scores, we explain and
identify re-ranker errors stemming from lexical dissimilarities. We also
investigate different methods to improve LM re-ranker performance and find
these methods mainly useful for NQ. Taken together, our work identifies and
explains weaknesses of LM re-rankers and points to the need for more
adversarial and realistic datasets for their evaluation.
[COMMENTS]
Accepted to FEVER 2025
[LINK]
http://arxiv.org/abs/2502.17036v2
[DATE]
2025-06-24 22:03:01+08:00
[CATEGORIES]
cs.CL
Correcting Hallucinations in News Summaries: Exploration of Self-Correcting LLM Methods with External Knowledge
[AUTHORS]
Juraj Vladika, Ihsan Soydemir, Florian Matthes
[ABSTRACT]
While large language models (LLMs) have shown remarkable capabilities to
generate coherent text, they suffer from the issue of hallucinations –
factually inaccurate statements. Among numerous approaches to tackle
hallucinations, especially promising are the self-correcting methods. They
leverage the multi-turn nature of LLMs to iteratively generate verification
questions inquiring additional evidence, answer them with internal or external
knowledge, and use that to refine the original response with the new
corrections. These methods have been explored for encyclopedic generation, but
less so for domains like news summarization. In this work, we investigate two
state-of-the-art self-correcting systems by applying them to correct
hallucinated summaries using evidence from three search engines. We analyze the
results and provide insights into systems’ performance, revealing interesting
practical findings on the benefits of search engine snippets and few-shot
prompts, as well as high alignment of G-Eval and human evaluation.
[COMMENTS]
Accepted to FEVER @ ACL 2025
[LINK]
http://arxiv.org/abs/2506.19607v1
[DATE]
2025-06-24 21:20:31+08:00
[CATEGORIES]
cs.CL
PATCH! {P}sychometrics-{A}ssis{T}ed Ben{CH}marking of Large Language Models against Human Populations: A Case Study of Proficiency in 8th Grade Mathematics
[AUTHORS]
Qixiang Fang, Daniel L. Oberski, Dong Nguyen
[ABSTRACT]
Many existing benchmarks of large (multimodal) language models (LLMs) focus
on measuring LLMs’ academic proficiency, often with also an interest in
comparing model performance with human test takers’. While such benchmarks have
proven key to the development of LLMs, they suffer from several limitations,
including questionable measurement quality (e.g., Do they measure what they are
supposed to in a reliable way?), lack of quality assessment on the item level
(e.g., Are some items more important or difficult than others?) and unclear
human population reference (e.g., To whom can the model be compared?). In
response to these challenges, we propose leveraging knowledge from
psychometrics – a field dedicated to the measurement of latent variables like
academic proficiency – into LLM benchmarking. We make four primary
contributions. First, we reflect on current LLM benchmark developments and
contrast them with psychometrics-based test development. Second, we introduce
PATCH: a novel framework for {P}sychometrics-{A}ssis{T}ed ben{CH}marking of
LLMs. PATCH addresses the aforementioned limitations. In particular, PATCH
enables valid comparison between LLMs and human populations. Third, we
demonstrate PATCH by measuring several LLMs’ proficiency in 8th grade
mathematics against 56 human populations. We show that adopting a
psychometrics-based approach yields evaluation outcomes that diverge from those
based on current benchmarking practices. Fourth, we release 4 high-quality
datasets to support measuring and comparing LLM proficiency in grade school
mathematics and science with human populations.
[COMMENTS]
Accepted to GEM2 Workshop: Generation, Evaluation & Metrics - ACL
2025
[LINK]
http://arxiv.org/abs/2404.01799v3
[DATE]
2025-06-24 21:11:54+08:00
[CATEGORIES]
cs.CL
Large Language Models as Span Annotators
[AUTHORS]
Zdeněk Kasner, Vilém Zouhar, Patrícia Schmidtová, Ivan Kartáč, Kristýna Onderková, Ondřej Plátek, Dimitra Gkatzia, Saad Mahamood, Ondřej Dušek, Simone Balloccu
[ABSTRACT]
Span annotation is the task of localizing and classifying text spans
according to custom guidelines. Annotated spans can be used to analyze and
evaluate high-quality texts for which single-score metrics fail to provide
actionable feedback. Until recently, span annotation was limited to human
annotators or fine-tuned models. In this study, we show that large language
models (LLMs) can serve as flexible and cost-effective span annotation
backbones. To demonstrate their utility, we compare LLMs to skilled human
annotators on three diverse span annotation tasks: evaluating data-to-text
generation, identifying translation errors, and detecting propaganda
techniques. We demonstrate that LLMs achieve inter-annotator agreement (IAA)
comparable to human annotators at a fraction of a cost per output annotation.
We also manually analyze model outputs, finding that LLMs make errors at a
similar rate to human annotators. We release the dataset of more than 40k model
and human annotations for further research.
[LINK]
http://arxiv.org/abs/2504.08697v2
[DATE]
2025-06-24 21:11:18+08:00
[CATEGORIES]
cs.CL
ECCoT: A Framework for Enhancing Effective Cognition via Chain of Thought in Large Language Model
[AUTHORS]
Zhenke Duan, Jiqun Pan, Jiani Tu, Xiaoyi Wang, Yanqing Wang
[ABSTRACT]
In the era of large-scale artificial intelligence, Large Language Models
(LLMs) have made significant strides in natural language processing. However,
they often lack transparency and generate unreliable outputs, raising concerns
about their interpretability. To address this, the Chain of Thought (CoT)
prompting method structures reasoning into step-by-step deductions. Yet, not
all reasoning chains are valid, and errors can lead to unreliable conclusions.
We propose ECCoT, an End-to-End Cognitive Chain of Thought Validation
Framework, to evaluate and refine reasoning chains in LLMs. ECCoT integrates
the Markov Random Field-Embedded Topic Model (MRF-ETM) for topic-aware CoT
generation and Causal Sentence-BERT (CSBert) for causal reasoning alignment. By
filtering ineffective chains using structured ordering statistics, ECCoT
improves interpretability, reduces biases, and enhances the trustworthiness of
LLM-based decision-making. Key contributions include the introduction of ECCoT,
MRF-ETM for topic-driven CoT generation, and CSBert for causal reasoning
enhancement. Code is released at: https://github.com/erwinmsmith/ECCoT.git.
[LINK]
http://arxiv.org/abs/2506.19599v1
[DATE]
2025-06-24 21:09:53+08:00
[CATEGORIES]
cs.CL
ConciseHint: Boosting Efficient Reasoning via Continuous Concise Hints during Generation
[AUTHORS]
Siao Tang, Xinyin Ma, Gongfan Fang, Xinchao Wang
[ABSTRACT]
Recent advancements in large reasoning models (LRMs) like DeepSeek-R1 and
OpenAI o1 series have achieved notable performance enhancements on complex
reasoning tasks by scaling up the generation length by Chain-of-Thought (CoT).
However, an emerging issue is their inclination to produce excessively verbose
reasoning processes, leading to the inefficiency problem. Existing literature
on improving efficiency mainly adheres to the before-reasoning paradigms such
as prompting and reasoning or fine-tuning and reasoning, but ignores the
promising direction of directly encouraging the model to speak concisely by
intervening during the generation of reasoning. In order to fill the blank, we
propose a framework dubbed ConciseHint, which continuously encourages the
reasoning model to speak concisely by injecting the textual hint (manually
designed or trained on the concise data) during the token generation of the
reasoning process. Besides, ConciseHint is adaptive to the complexity of the
query by adaptively adjusting the hint intensity, which ensures it will not
undermine model performance. Experiments on the state-of-the-art LRMs,
including DeepSeek-R1 and Qwen-3 series, demonstrate that our method can
effectively produce concise reasoning processes while maintaining performance
well. For instance, we achieve a reduction ratio of 65\% for the reasoning
length on GSM8K benchmark with Qwen-3 4B with nearly no accuracy loss.
[COMMENTS]
Codes are available at https://github.com/tsa18/ConciseHint
[LINK]
http://arxiv.org/abs/2506.18810v2
[DATE]
2025-06-24 21:08:33+08:00
[CATEGORIES]
cs.CL
KAG-Thinker: Interactive Thinking and Deep Reasoning in LLMs via Knowledge-Augmented Generation
[AUTHORS]
Dalong Zhang, Jun Xu, Jun Zhou, Lei Liang, Lin Yuan, Ling Zhong, Mengshu Sun, Peilong Zhao, QiWei Wang, Xiaorui Wang, Xinkai Du, YangYang Hou, Yu Ao, ZhaoYang Wang, Zhengke Gui, ZhiYing Yi, Zhongpu Bo
[ABSTRACT]
In this paper, we introduce KAG-Thinker, which upgrade KAG to a multi-turn
interactive thinking and deep reasoning framework powered by a dedicated
parameter-light large language model (LLM). Our approach constructs a
structured thinking process for solving complex problems, enhancing the the
logical coherence and contextual consistency of the reasoning process in
question-answering (Q&A) tasks on domain-specific knowledge bases (KBs) within
LLMs. Following the \textbf{Logical Form} guided retrieval and reasoning
technology route of KAG, this framework first decomposes complex questions into
independently solvable sub-problems (which are also referred to as logical
forms) through \textbf{breadth decomposition}. Each such logical form is
represented in two equivalent forms-natural language and logical function-and
subsequently classified as either a Knowledge Retrieval or Reasoning Analysis
task. Dependencies and parameter passing between these tasks are explicitly
modeled via logical function interfaces. In the solving process, the Retrieval
function performs retrieval tasks. It retrieves one-hop structured and
unstructured information of specified knowledge unit. While the Math and Deduce
functions are used to perform reasoning analysis tasks. Secondly, it is worth
noting that, in the Knowledge Retrieval sub-problem tasks, LLMs and external
knowledge sources are regarded as equivalent KBs. We use the \textbf{knowledge
boundary} module to determine the optimal source using self-regulatory
mechanisms such as confidence calibration and reflective reasoning, and use the
\textbf{depth solving} module to enhance the comprehensiveness of knowledge
acquisition…
[LINK]
http://arxiv.org/abs/2506.17728v2
[DATE]
2025-06-24 20:50:57+08:00
[CATEGORIES]
cs.CL
Has Machine Translation Evaluation Achieved Human Parity? The Human Reference and the Limits of Progress
[AUTHORS]
Lorenzo Proietti, Stefano Perrella, Roberto Navigli
[ABSTRACT]
In Machine Translation (MT) evaluation, metric performance is assessed based
on agreement with human judgments. In recent years, automatic metrics have
demonstrated increasingly high levels of agreement with humans. To gain a
clearer understanding of metric performance and establish an upper bound, we
incorporate human baselines in the MT meta-evaluation, that is, the assessment
of MT metrics’ capabilities. Our results show that human annotators are not
consistently superior to automatic metrics, with state-of-the-art metrics often
ranking on par with or higher than human baselines. Despite these findings
suggesting human parity, we discuss several reasons for caution. Finally, we
explore the broader implications of our results for the research field, asking:
Can we still reliably measure improvements in MT evaluation? With this work, we
aim to shed light on the limits of our ability to measure progress in the
field, fostering discussion on an issue that we believe is crucial to the
entire MT evaluation community.
[COMMENTS]
Accepted at ACL 2025 Main Conference. 24 pages
[LINK]
http://arxiv.org/abs/2506.19571v1
[DATE]
2025-06-24 20:35:00+08:00
[CATEGORIES]
cs.CL
GeistBERT: Breathing Life into German NLP
[AUTHORS]
Raphael Scheible-Schmitt, Johann Frei
[ABSTRACT]
Advances in transformer-based language models have highlighted the benefits
of language-specific pre-training on high-quality corpora. In this context,
German NLP stands to gain from updated architectures and modern datasets
tailored to the linguistic characteristics of the German language. GeistBERT
seeks to improve German language processing by incrementally training on a
diverse corpus and optimizing model performance across various NLP tasks. It
was pre-trained using fairseq with standard hyperparameters, initialized from
GottBERT weights, and trained on a large-scale German corpus using Whole Word
Masking (WWM). Based on the pre-trained model, we derived extended-input
variants using Nystr"omformer and Longformer architectures with support for
sequences up to 8k tokens. While these long-context models were not evaluated
on dedicated long-context benchmarks, they are included in our release. We
assessed all models on NER (CoNLL 2003, GermEval 2014) and text classification
(GermEval 2018 fine/coarse, 10kGNAD) using $F_1$ score and accuracy. The
GeistBERT models achieved strong performance, leading all tasks among the base
models and setting a new state-of-the-art (SOTA). Notably, the base models
outperformed larger models in several tasks. To support the German NLP research
community, we are releasing GeistBERT under the MIT license.
[LINK]
http://arxiv.org/abs/2506.11903v3
[DATE]
2025-06-24 20:31:06+08:00
[CATEGORIES]
cs.CL
ChatSR: Multimodal Large Language Models for Scientific Formula Discovery
[AUTHORS]
Yanjie Li, Lina Yu, Weijun Li, Min Wu, Jingyi Liu, Wenqiang Li, Shu Wei, Yusong Deng
[ABSTRACT]
Formulas are the language of communication between humans and nature. The
discovery of formulas to describe natural laws from observational data is the
purpose of scientific research. It is also an important research topic in
artificial intelligence, which is called a symbolic regression problem. Most of
the existing symbolic regression methods generate expressions directly from
observed data. Although in some methods, we can inject some prior knowledge
into the model by adding constraints or introducing some special character
hints. However, these methods can only introduce a limited amount of prior
knowledge specified in advance. Not to mention understanding natural language
instructions. In this article, based on the powerful knowledge reserve and
language understanding ability of multi-modal large language models, we present
ChatSR, which acts like a knowledgeable human scientist, and we can tell it any
prior knowledge through natural language to guide it in formula generation. By
testing on 13 datasets, ChatSR not only shows state-of-the-art performance on
traditional symbolic regression tasks. More notably, ChatSR can well understand
the prior knowledge contained in natural language prompts and improve the
quality of generated expressions. In addition, it is exciting that ChatSR has a
good zero-shot capability to understand prior knowledge that is not present in
the training data.
[COMMENTS]
23 pages,
[LINK]
http://arxiv.org/abs/2406.05410v2
[DATE]
2025-06-24 20:22:55+08:00
[CATEGORIES]
cs.CL
DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs
[AUTHORS]
Bo-Cheng Chiu, Jen-Jee Chen, Yu-Chee Tseng, Feng-Chi Chen
[ABSTRACT]
Large Language Models (LLMs) have recently been extended to the video domain,
enabling sophisticated video-language understanding. However, existing Video
LLMs often exhibit limitations in fine-grained temporal reasoning, restricting
their ability to precisely attribute responses to specific video moments,
especially under constrained supervision. We introduce DaMO, a data-efficient
Video LLM explicitly designed for accurate temporal reasoning and multimodal
understanding. At its core, the proposed Temporal-aware Fuseformer employs a
hierarchical dual-stream architecture that progressively captures temporal
dynamics within each modality and effectively fuses complementary visual and
audio information. To further enhance computational efficiency, DaMO integrates
a global residual that reduces spatial redundancy while preserving essential
semantic details. We train DaMO via a structured four-stage progressive
training paradigm, incrementally equipping the model with multimodal alignment,
semantic grounding, and temporal reasoning capabilities. This work also
contributes multiple datasets augmented from existing ones with GPT-generated
temporally grounded QA pairs for tasks requiring temporal supervision.
Comprehensive experiments on temporal grounding and video QA benchmarks
demonstrate that DaMO consistently surpasses prior methods, particularly in
tasks demanding precise temporal alignment and reasoning. Our work establishes
a promising direction for data-efficient video-language modeling.
[COMMENTS]
I would like to request the withdrawal of this submission because the
current version contains significant errors and incomplete results. I intend
to revise the manuscript thoroughly before resubmitting. I apologize for the
oversight and appreciate your understanding
[LINK]
http://arxiv.org/abs/2506.11558v2
[DATE]
2025-06-24 19:59:30+08:00
[CATEGORIES]
cs.CL
RCStat: A Statistical Framework for using Relative Contextualization in Transformers
[AUTHORS]
Debabrata Mahapatra, Shubham Agarwal, Apoorv Saxena, Subrata Mitra
[ABSTRACT]
Prior work on input-token importance in auto-regressive transformers has
relied on Softmax-normalized attention weights, which obscure the richer
structure of pre-Softmax query-key logits. We introduce RCStat, a statistical
framework that harnesses raw attention logits via Relative Contextualization
(RC), a random variable measuring contextual alignment between token segments,
and derive an efficient upper bound for RC. We demonstrate two applications:
(i) Key-Value compression, where RC-based thresholds drive adaptive key-value
eviction for substantial cache reduction with minimal quality loss; and (ii)
Attribution, where RC yields higher-fidelity token-, sentence-, and chunk-level
explanations than post-Softmax methods. Across question answering,
summarization, and attribution benchmarks, RCStat achieves significant
empirical gains, delivering state-of-the-art compression and attribution
performance without any model retraining.
[LINK]
http://arxiv.org/abs/2506.19549v1
[DATE]
2025-06-24 19:55:43+08:00
[CATEGORIES]
cs.CL
cs.LG
Health Sentinel: An AI Pipeline For Real-time Disease Outbreak Detection
[AUTHORS]
Devesh Pant, Rishi Raj Grandhe, Vipin Samaria, Mukul Paul, Sudhir Kumar, Saransh Khanna, Jatin Agrawal, Jushaan Singh Kalra, Akhil VSSG, Satish V Khalikar, Vipin Garg, Himanshu Chauhan, Pranay Verma, Neha Khandelwal, Soma S Dhavala, Minesh Mathew
[ABSTRACT]
Early detection of disease outbreaks is crucial to ensure timely intervention
by the health authorities. Due to the challenges associated with traditional
indicator-based surveillance, monitoring informal sources such as online media
has become increasingly popular. However, owing to the number of online
articles getting published everyday, manual screening of the articles is
impractical. To address this, we propose Health Sentinel. It is a multi-stage
information extraction pipeline that uses a combination of ML and non-ML
methods to extract events-structured information concerning disease outbreaks
or other unusual health events-from online articles. The extracted events are
made available to the Media Scanning and Verification Cell (MSVC) at the
National Centre for Disease Control (NCDC), Delhi for analysis, interpretation
and further dissemination to local agencies for timely intervention. From April
2022 till date, Health Sentinel has processed over 300 million news articles
and identified over 95,000 unique health events across India of which over
3,500 events were shortlisted by the public health experts at NCDC as potential
outbreaks.
[LINK]
http://arxiv.org/abs/2506.19548v1
[DATE]
2025-06-24 19:54:37+08:00
[CATEGORIES]
cs.CL
Automatic Posology Structuration : What role for LLMs?
[AUTHORS]
Natalia Bobkova, Laura Zanella-Calzada, Anyes Tafoughalt, Raphaël Teboul, François Plesse, Félix Gaschi
[ABSTRACT]
Automatically structuring posology instructions is essential for improving
medication safety and enabling clinical decision support. In French
prescriptions, these instructions are often ambiguous, irregular, or
colloquial, limiting the effectiveness of classic ML pipelines. We explore the
use of Large Language Models (LLMs) to convert free-text posologies into
structured formats, comparing prompt-based methods and fine-tuning against a
“pre-LLM” system based on Named Entity Recognition and Linking (NERL). Our
results show that while prompting improves performance, only fine-tuned LLMs
match the accuracy of the baseline. Through error analysis, we observe
complementary strengths: NERL offers structural precision, while LLMs better
handle semantic nuances. Based on this, we propose a hybrid pipeline that
routes low-confidence cases from NERL (<0.8) to the LLM, selecting outputs
based on confidence scores. This strategy achieves 91% structuration accuracy
while minimizing latency and compute. Our results show that this hybrid
approach improves structuration accuracy while limiting computational cost,
offering a scalable solution for real-world clinical use.
[LINK]
http://arxiv.org/abs/2506.19525v1
[DATE]
2025-06-24 19:25:21+08:00
[CATEGORIES]
cs.CL
heiDS at ArchEHR-QA 2025: From Fixed-k to Query-dependent-k for Retrieval Augmented Generation
[AUTHORS]
Ashish Chouhan, Michael Gertz
[ABSTRACT]
This paper presents the approach of our team called heiDS for the ArchEHR-QA
2025 shared task. A pipeline using a retrieval augmented generation (RAG)
framework is designed to generate answers that are attributed to clinical
evidence from the electronic health records (EHRs) of patients in response to
patient-specific questions. We explored various components of a RAG framework,
focusing on ranked list truncation (RLT) retrieval strategies and attribution
approaches. Instead of using a fixed top-k RLT retrieval strategy, we employ a
query-dependent-k retrieval strategy, including the existing surprise and
autocut methods and two new methods proposed in this work, autocut* and elbow.
The experimental results show the benefits of our strategy in producing factual
and relevant answers when compared to a fixed-$k$.
[COMMENTS]
12 pages, 2 figures, 6 tables, Workshop on BioNLP and Shared Tasks at
ACL 2025
[LINK]
http://arxiv.org/abs/2506.19512v1
[DATE]
2025-06-24 19:03:01+08:00
[CATEGORIES]
cs.CL
AnTKV: Anchor Token-Aware Sub-Bit Vector Quantization for KV Cache in Large Language Models
[AUTHORS]
Zeyu Li, Chuanfu Xiao, Yang Wang, Xiang Liu, Zhenheng Tang, Baotong Lu, Mao Yang, Xinyu Chen, Xiaowen Chu
[ABSTRACT]
Quantization has emerged as an effective and lightweight solution to reduce
the memory footprint of the KV cache in Large Language Models (LLMs).
Nevertheless, minimizing the performance degradation caused by ultra-low-bit KV
cache quantization remains a significant challenge. We observe that quantizing
the KV cache of different tokens has varying impacts on the quality of
attention outputs. To systematically investigate this phenomenon, we perform
forward error propagation analysis on attention and propose the Anchor Score
(AnS) that quantifies the sensitivity of each token’s KV cache to
quantization-induced error. Our analysis reveals significant disparities in AnS
across tokens, suggesting that preserving a small subset with full precision
(FP16) of high-AnS tokens can greatly mitigate accuracy loss in aggressive
quantization scenarios. Based on this insight, we introduce AnTKV, a novel
framework that leverages Anchor Token-aware Vector Quantization to compress the
KV cache. Furthermore, to support efficient deployment, we design and develop a
triton kernel that is fully compatible with FlashAttention, enabling fast
online Anchor Token selection. AnTKV enables LLaMA-3-8B to handle context
lengths up to 840K tokens on a single 80GB A100 GPU, while achieving up to 3.5x
higher decoding throughput compared to the FP16 baseline. Our experiment
results demonstrate that AnTKV matches or outperforms prior works such as KIVI,
SKVQ, KVQuant, and CQ under 4-bit settings. More importantly, AnTKV achieves
significantly lower perplexity under ultra-low-bit quantization on Mistral-7B,
with only 6.32 at 1-bit and 8.87 at 0.375-bit, compared to the FP16 baseline of
4.73.
[LINK]
http://arxiv.org/abs/2506.19505v1
[DATE]
2025-06-24 18:45:48+08:00
[CATEGORIES]
cs.CL
NaviAgent: Bilevel Planning on Tool Dependency Graphs for Function Calling
[AUTHORS]
Yan Jiang, Hao Zhou, LiZhong GU, Ai Han, TianLong Li
[ABSTRACT]
LLMs’ reliance on static knowledge and fragile tool invocation severely
hinders the orchestration of complex, heterogeneous toolchains, particularly at
large scales. Existing methods typically use rigid single-path execution,
resulting in poor error recovery and exponentially growing search spaces. We
introduce NaviAgent, a graph-navigated bilevel planning architecture for robust
function calling, comprising a Multi-Path Decider and Graph-Encoded Navigator.
As an LLM-powered agent, the Multi-Path Decider defines a four-dimensional
decision space and continuously perceives environmental states, dynamically
selecting the optimal action to fully cover all tool invocation scenarios. The
Graph-Encoded Navigator constructs a Tool Dependency Heterogeneous Graph
(TDHG), where node embeddings explicitly fuse API schema structure with
historical invocation behavior. It also integrates a novel heuristic search
strategy that guides the Decider toward efficient and highly successful
toolchains, even for unseen tool combinations. Experiments show that NaviAgent
consistently achieves the highest task success rate (TSR) across all foundation
models and task complexities, outperforming the average baselines (ReAct,
ToolLLM, {\alpha}-UMI) by 13.5%, 16.4%, and 19.0% on Qwen2.5-14B, Qwen2.5-32B,
and Deepseek-V3, respectively. Its execution steps are typically within one
step of the most efficient baseline, ensuring a strong balance between quality
and efficiency. Notably, a fine-tuned Qwen2.5-14B model achieves a TSR of
49.5%, surpassing the much larger 32B model (44.9%) under our architecture.
Incorporating the Graph-Encoded Navigator further boosts TSR by an average of
2.4 points, with gains up over 9 points on complex tasks for larger models
(Deepseek-V3 and GPT-4o), highlighting its essential role in toolchain
orchestration.
[LINK]
http://arxiv.org/abs/2506.19500v1
[DATE]
2025-06-24 18:39:07+08:00
[CATEGORIES]
cs.CL
cs.LG
Is Long-to-Short a Free Lunch? Investigating Inconsistency and Reasoning Efficiency in LRMs
[AUTHORS]
Shu Yang, Junchao Wu, Xuansheng Wu, Derek Wong, Ninhao Liu, Di Wang
[ABSTRACT]
Large Reasoning Models (LRMs) have achieved remarkable performance on complex
tasks by engaging in extended reasoning before producing final answers, yet
this strength introduces the risk of overthinking, where excessive token
generation occurs even for simple tasks. While recent work in efficient
reasoning seeks to reduce reasoning length while preserving accuracy, it
remains unclear whether such optimization is truly a free lunch. Drawing on the
intuition that compressing reasoning may reduce the robustness of model
responses and lead models to omit key reasoning steps, we investigate whether
efficient reasoning strategies introduce behavioral inconsistencies. To
systematically assess this, we introduce $ICBENCH$, a benchmark designed to
measure inconsistency in LRMs across three dimensions: inconsistency across
task settings (ITS), inconsistency between training objectives and learned
behavior (TR-LB), and inconsistency between internal reasoning and
self-explanations (IR-SE). Applying $ICBENCH$ to a range of open-source LRMs,
we find that while larger models generally exhibit greater consistency than
smaller ones, they all display widespread “scheming” behaviors, including
self-disagreement, post-hoc rationalization, and the withholding of reasoning
cues. Crucially, our results demonstrate that efficient reasoning strategies
such as No-Thinking and Simple Token-Budget consistently increase all three
defined types of inconsistency. These findings suggest that although efficient
reasoning enhances token-level efficiency, further investigation is imperative
to ascertain whether it concurrently introduces the risk of models evading
effective supervision.
[LINK]
http://arxiv.org/abs/2506.19492v1
[DATE]
2025-06-24 18:25:28+08:00
[CATEGORIES]
cs.CL
Dialogic Pedagogy for Large Language Models: Aligning Conversational AI with Proven Theories of Learning
[AUTHORS]
Russell Beale
[ABSTRACT]
Large Language Models (LLMs) are rapidly transforming education by enabling
rich conversational learning experiences. This article provides a comprehensive
review of how LLM-based conversational agents are being used in higher
education, with extensions to secondary and lifelong learning contexts. We
synthesize existing literature on LLMs in education and theories of
conversational and dialogic pedagogy - including Vygotsky’s sociocultural
learning (scaffolding and the Zone of Proximal Development), the Socratic
method, and Laurillard’s conversational framework - and examine how prompting
strategies and retrieval-augmented generation (RAG) can align LLM behaviors
with these pedagogical theories, and how it can support personalized, adaptive
learning. We map educational theories to LLM capabilities, highlighting where
LLM-driven dialogue supports established learning principles and where it
challenges or falls short of traditional pedagogical assumptions. Notable gaps
in applying prior theories to LLMs are identified, such as the models tendency
to provide direct answers instead of fostering co-construction of knowledge,
and the need to account for the constant availability and broad but non-human
expertise of LLM tutors. In response, we propose practical strategies to better
align LLM interactions with sound pedagogy - for example, designing prompts
that encourage Socratic questioning, scaffolded guidance, and student
reflection, as well as integrating retrieval mechanisms to ensure accuracy and
contextual relevance. Our aim is to bridge the gap between educational theory
and the emerging practice of AI-driven conversational learning, offering
insights and tools for making LLM-based dialogues more educationally productive
and theory-aligned.
[LINK]
http://arxiv.org/abs/2506.19484v1
[DATE]
2025-06-24 18:19:09+08:00
[CATEGORIES]
cs.CL
Commonsense Generation and Evaluation for Dialogue Systems using Large Language Models
[AUTHORS]
Marcos Estecha-Garitagoitia, Chen Zhang, Mario Rodríguez-Cantelar, Luis Fernando D’Haro
[ABSTRACT]
This paper provides preliminary results on exploring the task of performing
turn-level data augmentation for dialogue system based on different types of
commonsense relationships, and the automatic evaluation of the generated
synthetic turns. The proposed methodology takes advantage of the extended
knowledge and zero-shot capabilities of pretrained Large Language Models (LLMs)
to follow instructions, understand contextual information, and their
commonsense reasoning capabilities. The approach draws inspiration from
methodologies like Chain-of-Thought (CoT), applied more explicitly to the task
of prompt-based generation for dialogue-based data augmentation conditioned on
commonsense attributes, and the automatic evaluation of the generated
dialogues.
To assess the effectiveness of the proposed approach, first we extracted 200
randomly selected partial dialogues, from 5 different well-known dialogue
datasets, and generate alternative responses conditioned on different event
commonsense attributes. This novel dataset allows us to measure the proficiency
of LLMs in generating contextually relevant commonsense knowledge, particularly
up to 12 different specific ATOMIC [10] database relations. Secondly, we
propose an evaluation framework to automatically detect the quality of the
generated dataset inspired by the ACCENT [26] metric, which offers a nuanced
approach to assess event commonsense. However, our method does not follow
ACCENT’s complex eventrelation tuple extraction process. Instead, we propose an
instruction-based prompt for each commonsense attribute and use
state-of-the-art LLMs to automatically detect the original attributes used when
creating each augmented turn in the previous step.
Preliminary results suggest that our approach effectively harnesses LLMs
capabilities for commonsense reasoning and evaluation in dialogue systems.
[LINK]
http://arxiv.org/abs/2506.19483v1
[DATE]
2025-06-24 18:18:05+08:00
[CATEGORIES]
cs.CL
LEVOS: Leveraging Vocabulary Overlap with Sanskrit to Generate Technical Lexicons in Indian Languages
[AUTHORS]
Karthika N J, Krishnakant Bhatt, Ganesh Ramakrishnan, Preethi Jyothi
[ABSTRACT]
Translating technical terms into lexically similar, low-resource Indian
languages remains a challenge due to limited parallel data and the complexity
of linguistic structures. We propose a novel use-case of Sanskrit-based
segments for linguistically informed translation of such terms, leveraging
subword-level similarity and morphological alignment across related languages.
Our approach uses character-level segmentation to identify meaningful subword
units, facilitating more accurate and context-aware translation. To enable
this, we utilize a Character-level Transformer model for Sanskrit Word
Segmentation (CharSS), which addresses the complexities of sandhi and
morpho-phonemic changes during segmentation. We observe consistent improvements
in two experimental settings for technical term translation using
Sanskrit-derived segments, averaging 8.46 and 6.79 chrF++ scores, respectively.
Further, we conduct a post hoc human evaluation to verify the quality
assessment of the translated technical terms using automated metrics. This work
has important implications for the education field, especially in creating
accessible, high-quality learning materials in Indian languages. By supporting
the accurate and linguistically rooted translation of technical content, our
approach facilitates inclusivity and aids in bridging the resource gap for
learners in low-resource language communities.
[COMMENTS]
20th Workshop on Innovative Use of NLP for Building Educational
Applications (Co-located with ACL2025)
[LINK]
http://arxiv.org/abs/2407.06331v2
[DATE]
2025-06-24 18:06:32+08:00
[CATEGORIES]
cs.CL
Can Large Language Models Capture Human Annotator Disagreements?
[AUTHORS]
Jingwei Ni, Yu Fan, Vilém Zouhar, Donya Rooein, Alexander Hoyle, Mrinmaya Sachan, Markus Leippold, Dirk Hovy, Elliott Ash
[ABSTRACT]
Human annotation variation (i.e., annotation disagreements) is common in NLP
and often reflects important information such as task subjectivity and sample
ambiguity. While Large Language Models (LLMs) are increasingly used for
automatic annotation to reduce human effort, their evaluation often focuses on
predicting the majority-voted “ground truth” labels. It is still unclear,
however, whether these models also capture informative human annotation
variation. Our work addresses this gap by extensively evaluating LLMs’ ability
to predict annotation disagreements without access to repeated human labels.
Our results show that LLMs struggle with modeling disagreements, which can be
overlooked by majority label-based evaluations. Notably, while RLVR-style
(Reinforcement learning with verifiable rewards) reasoning generally boosts LLM
performance, it degrades performance in disagreement prediction. Our findings
highlight the critical need for evaluating and improving LLM annotators in
disagreement modeling. Code and data at
https://github.com/EdisonNi-hku/Disagreement_Prediction.
[COMMENTS]
Preprint Under Review
[LINK]
http://arxiv.org/abs/2506.19467v1
[DATE]
2025-06-24 17:49:26+08:00
[CATEGORIES]
cs.CL
Mem4Nav: Boosting Vision-and-Language Navigation in Urban Environments with a Hierarchical Spatial-Cognition Long-Short Memory System
[AUTHORS]
Lixuan He, Haoyu Dong, Zhenxing Chen, Yangcheng Yu, Jie Feng, Yong Li
[ABSTRACT]
Vision-and-Language Navigation (VLN) in large-scale urban environments
requires embodied agents to ground linguistic instructions in complex scenes
and recall relevant experiences over extended time horizons. Prior modular
pipelines offer interpretability but lack unified memory, while end-to-end
(M)LLM agents excel at fusing vision and language yet remain constrained by
fixed context windows and implicit spatial reasoning. We introduce
\textbf{Mem4Nav}, a hierarchical spatial-cognition long-short memory system
that can augment any VLN backbone. Mem4Nav fuses a sparse octree for
fine-grained voxel indexing with a semantic topology graph for high-level
landmark connectivity, storing both in trainable memory tokens embedded via a
reversible Transformer. Long-term memory (LTM) compresses and retains
historical observations at both octree and graph nodes, while short-term memory
(STM) caches recent multimodal entries in relative coordinates for real-time
obstacle avoidance and local planning. At each step, STM retrieval sharply
prunes dynamic context, and, when deeper history is needed, LTM tokens are
decoded losslessly to reconstruct past embeddings. Evaluated on Touchdown and
Map2Seq across three backbones (modular, state-of-the-art VLN with prompt-based
LLM, and state-of-the-art VLN with strided-attention MLLM), Mem4Nav yields 7-13
pp gains in Task Completion, sufficient SPD reduction, and >10 pp nDTW
improvement. Ablations confirm the indispensability of both the hierarchical
map and dual memory modules. Our codes are open-sourced via
https://github.com/tsinghua-fib-lab/Mem4Nav.
[LINK]
http://arxiv.org/abs/2506.19433v1
[DATE]
2025-06-24 17:00:43+08:00
[CATEGORIES]
cs.CL
Learning to Disentangle Latent Reasoning Rules with Language VAEs: A Systematic Study
[AUTHORS]
Yingji Zhang, Marco Valentino, Danilo S. Carvalho, André Freitas
[ABSTRACT]
Incorporating explicit reasoning rules within the latent space of language
models (LMs) offers a promising pathway to enhance generalisation,
interpretability, and controllability. While current Transformer-based language
models have shown strong performance on Natural Language Inference (NLI) tasks,
they often rely on memorisation rather than rule-based inference. This work
investigates how reasoning rules can be explicitly embedded and memorised
within the LMs through Language Variational Autoencoders (VAEs). We propose a
complete pipeline for learning reasoning rules within Transformer-based
language VAEs. This pipeline encompasses three rule-based reasoning tasks, a
supporting theoretical framework, and a practical end-to-end architecture. The
experiment illustrates the following findings: Disentangled reasoning: Under
explicit signal supervision, reasoning rules - viewed as functional mappings -
can be disentangled within the encoder’s parametric space. This separation
results in distinct clustering of rules in the output feature space. Prior
knowledge injection: injecting reasoning information into the Query enables the
model to more effectively retrieve the stored value Value from memory based on
Key. This approach offers a simple method for integrating prior knowledge into
decoder-only language models. Performance bottleneck: In mathematical reasoning
tasks using Qwen2.5(0.5B), increasing sample count doesn’t improve performance
beyond a point. Moreover, ffn layers are better than attention layers at
preserving the separation of reasoning rules in the model’s parameters.
[LINK]
http://arxiv.org/abs/2506.19418v1
[DATE]
2025-06-24 16:38:03+08:00
[CATEGORIES]
cs.CL
Automated Detection of Pre-training Text in Black-box LLMs
[AUTHORS]
Ruihan Hu, Yu-Ming Shang, Jiankun Peng, Wei Luo, Yazhe Wang, Xi Zhang
[ABSTRACT]
Detecting whether a given text is a member of the pre-training data of Large
Language Models (LLMs) is crucial for ensuring data privacy and copyright
protection. Most existing methods rely on the LLM’s hidden information (e.g.,
model parameters or token probabilities), making them ineffective in the
black-box setting, where only input and output texts are accessible. Although
some methods have been proposed for the black-box setting, they rely on massive
manual efforts such as designing complicated questions or instructions. To
address these issues, we propose VeilProbe, the first framework for
automatically detecting LLMs’ pre-training texts in a black-box setting without
human intervention. VeilProbe utilizes a sequence-to-sequence mapping model to
infer the latent mapping feature between the input text and the corresponding
output suffix generated by the LLM. Then it performs the key token
perturbations to obtain more distinguishable membership features. Additionally,
considering real-world scenarios where the ground-truth training text samples
are limited, a prototype-based membership classifier is introduced to alleviate
the overfitting issue. Extensive evaluations on three widely used datasets
demonstrate that our framework is effective and superior in the black-box
setting.
[COMMENTS]
13 pages
[LINK]
http://arxiv.org/abs/2506.19399v1
[DATE]
2025-06-24 16:08:15+08:00
[CATEGORIES]
cs.CL
Statistical Multicriteria Evaluation of LLM-Generated Text
[AUTHORS]
Esteban Garces Arias, Hannah Blocher, Julian Rodemann, Matthias Aßenmacher, Christoph Jansen
[ABSTRACT]
Assessing the quality of LLM-generated text remains a fundamental challenge
in natural language processing. Current evaluation approaches often rely on
isolated metrics or simplistic aggregations that fail to capture the nuanced
trade-offs between coherence, diversity, fluency, and other relevant indicators
of text quality. In this work, we adapt a recently proposed framework for
statistical inference based on Generalized Stochastic Dominance (GSD) that
addresses three critical limitations in existing benchmarking methodologies:
the inadequacy of single-metric evaluation, the incompatibility between
cardinal automatic metrics and ordinal human judgments, and the lack of
inferential statistical guarantees. The GSD-front approach enables simultaneous
evaluation across multiple quality dimensions while respecting their different
measurement scales, building upon partial orders of decoding strategies, thus
avoiding arbitrary weighting of the involved metrics. By applying this
framework to evaluate common decoding strategies against human-generated text,
we demonstrate its ability to identify statistically significant performance
differences while accounting for potential deviations from the i.i.d.
assumption of the sampling design.
[LINK]
http://arxiv.org/abs/2506.18082v2
[DATE]
2025-06-24 15:59:45+08:00
[CATEGORIES]
cs.CL
Measuring and Guiding Monosemanticity
[AUTHORS]
Ruben Härle, Felix Friedrich, Manuel Brack, Stephan Wäldchen, Björn Deiseroth, Patrick Schramowski, Kristian Kersting
[ABSTRACT]
There is growing interest in leveraging mechanistic interpretability and
controllability to better understand and influence the internal dynamics of
large language models (LLMs). However, current methods face fundamental
challenges in reliably localizing and manipulating feature representations.
Sparse Autoencoders (SAEs) have recently emerged as a promising direction for
feature extraction at scale, yet they, too, are limited by incomplete feature
isolation and unreliable monosemanticity. To systematically quantify these
limitations, we introduce Feature Monosemanticity Score (FMS), a novel metric
to quantify feature monosemanticity in latent representation. Building on these
insights, we propose Guided Sparse Autoencoders (G-SAE), a method that
conditions latent representations on labeled concepts during training. We
demonstrate that reliable localization and disentanglement of target concepts
within the latent space improve interpretability, detection of behavior, and
control. Specifically, our evaluations on toxicity detection, writing style
identification, and privacy attribute recognition show that G-SAE not only
enhances monosemanticity but also enables more effective and fine-grained
steering with less quality degradation. Our findings provide actionable
guidelines for measuring and advancing mechanistic interpretability and control
of LLMs.
[LINK]
http://arxiv.org/abs/2506.19382v1
[DATE]
2025-06-24 15:18:20+08:00
[CATEGORIES]
cs.CL
ReDit: Reward Dithering for Improved LLM Policy Optimization
[AUTHORS]
Chenxing Wei, Jiarui Yu, Ying Tiffany He, Hande Dong, Yao Shu, Fei Yu
[ABSTRACT]
DeepSeek-R1 has successfully enhanced Large Language Model (LLM) reasoning
capabilities through its rule-based reward system. While it’s a ‘‘perfect’’
reward system that effectively mitigates reward hacking, such reward functions
are often discrete. Our experimental observations suggest that discrete rewards
can lead to gradient anomaly, unstable optimization, and slow convergence. To
address this issue, we propose ReDit (Reward Dithering), a method that dithers
the discrete reward signal by adding simple random noise. With this perturbed
reward, exploratory gradients are continuously provided throughout the learning
process, enabling smoother gradient updates and accelerating convergence. The
injected noise also introduces stochasticity into flat reward regions,
encouraging the model to explore novel policies and escape local optima.
Experiments across diverse tasks demonstrate the effectiveness and efficiency
of ReDit. On average, ReDit achieves performance comparable to vanilla GRPO
with only approximately 10% the training steps, and furthermore, still exhibits
a 4% performance improvement over vanilla GRPO when trained for a similar
duration. Visualizations confirm significant mitigation of gradient issues with
ReDit. Moreover, theoretical analyses are provided to further validate these
advantages.
[COMMENTS]
10 pages, 15 figures
[LINK]
http://arxiv.org/abs/2506.18631v2
[DATE]
2025-06-24 15:07:57+08:00
[CATEGORIES]
cs.LG
cs.CL
SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented Dialogue Agents
[AUTHORS]
Shuzheng Si, Wentao Ma, Haoyu Gao, Yuchuan Wu, Ting-En Lin, Yinpei Dai, Hangyu Li, Rui Yan, Fei Huang, Yongbin Li
[COMMENTS]
NeurIPS 2023
[LINK]
http://arxiv.org/abs/2305.13040v7
[DATE]
2025-06-24 15:06:57+08:00
[CATEGORIES]
cs.CL
Doc2SAR: A Synergistic Framework for High-Fidelity Extraction of Structure-Activity Relationships from Scientific Documents
[AUTHORS]
Jiaxi Zhuang, Kangning Li, Jue Hou, Mingjun Xu, Zhifeng Gao, Hengxing Cai
[ABSTRACT]
Extracting molecular structure-activity relationships (SARs) from scientific
literature and patents is essential for drug discovery and materials research.
However, this task remains challenging due to heterogeneous document formats
and limitations of existing methods. Specifically, rule-based approaches
relying on rigid templates fail to generalize across diverse document layouts,
while general-purpose multimodal large language models (MLLMs) lack sufficient
accuracy and reliability for specialized tasks, such as layout detection and
optical chemical structure recognition (OCSR). To address these challenges, we
introduce DocSAR-200, a rigorously annotated benchmark of 200 scientific
documents designed specifically for evaluating SAR extraction methods.
Additionally, we propose Doc2SAR, a novel synergistic framework that integrates
domain-specific tools with MLLMs enhanced via supervised fine-tuning (SFT).
Extensive experiments demonstrate that Doc2SAR achieves state-of-the-art
performance across various document types, significantly outperforming leading
end-to-end baselines. Specifically, Doc2SAR attains an overall Table Recall of
80.78% on DocSAR-200, exceeding end2end GPT-4o by 51.48%. Furthermore, Doc2SAR
demonstrates practical usability through efficient inference and is accompanied
by a web app.
[LINK]
http://arxiv.org/abs/2506.21625v1
[DATE]
2025-06-24 14:53:04+08:00
[CATEGORIES]
cs.CL
Spotting Out-of-Character Behavior: Atomic-Level Evaluation of Persona Fidelity in Open-Ended Generation
[AUTHORS]
Jisu Shin, Juhyun Oh, Eunsu Kim, Hoyun Song, Alice Oh
[ABSTRACT]
Ensuring persona fidelity in large language models (LLMs) is essential for
maintaining coherent and engaging human-AI interactions. However, LLMs often
exhibit Out-of-Character (OOC) behavior, where generated responses deviate from
an assigned persona, leading to inconsistencies that affect model reliability.
Existing evaluation methods typically assign single scores to entire responses,
struggling to capture subtle persona misalignment, particularly in long-form
text generation. To address this limitation, we propose an atomic-level
evaluation framework that quantifies persona fidelity at a finer granularity.
Our three key metrics measure the degree of persona alignment and consistency
within and across generations. Our approach enables a more precise and
realistic assessment of persona fidelity by identifying subtle deviations that
real users would encounter. Through our experiments, we demonstrate that our
framework effectively detects persona inconsistencies that prior methods
overlook. By analyzing persona fidelity across diverse tasks and personality
types, we reveal how task structure and persona desirability influence model
adaptability, highlighting challenges in maintaining consistent persona
expression.
[COMMENTS]
Findings of ACL 2025; github repo:
https://github.com/ddindidu/atomic-persona-evaluation/
[LINK]
http://arxiv.org/abs/2506.19352v1
[DATE]
2025-06-24 14:33:10+08:00
[CATEGORIES]
cs.CL
In-Context Occam’s Razor: How Transformers Prefer Simpler Hypotheses on the Fly
[AUTHORS]
Puneesh Deora, Bhavya Vasudeva, Tina Behnia, Christos Thrampoulidis
[ABSTRACT]
In-context learning (ICL) enables transformers to adapt to new tasks through
contextual examples without parameter updates. While existing research has
typically studied ICL in fixed-complexity environments, practical language
models encounter tasks spanning diverse complexity levels. This paper
investigates how transformers navigate hierarchical task structures where
higher-complexity categories can perfectly represent any pattern generated by
simpler ones. We design well-controlled testbeds based on Markov chains and
linear regression that reveal transformers not only identify the appropriate
complexity level for each task but also accurately infer the corresponding
parameters–even when the in-context examples are compatible with multiple
complexity hypotheses. Notably, when presented with data generated by simpler
processes, transformers consistently favor the least complex sufficient
explanation. We theoretically explain this behavior through a Bayesian
framework, demonstrating that transformers effectively implement an in-context
Bayesian Occam’s razor by balancing model fit against complexity penalties. We
further ablate on the roles of model size, training mixture distribution,
inference context length, and architecture. Finally, we validate this Occam’s
razor-like inductive bias on a pretrained GPT-4 model with Boolean-function
tasks as case study, suggesting it may be inherent to transformers trained on
diverse task distributions.
[COMMENTS]
28 pages, 19 figures
[LINK]
http://arxiv.org/abs/2506.19351v1
[DATE]
2025-06-24 14:33:00+08:00
[CATEGORIES]
cs.LG
cs.CL
Analyzing LLMs’ Knowledge Boundary Cognition Across Languages Through the Lens of Internal Representations
[AUTHORS]
Chenghao Xiao, Hou Pong Chan, Hao Zhang, Mahani Aljunied, Lidong Bing, Noura Al Moubayed, Yu Rong
[ABSTRACT]
While understanding the knowledge boundaries of LLMs is crucial to prevent
hallucination, research on the knowledge boundaries of LLMs has predominantly
focused on English. In this work, we present the first study to analyze how
LLMs recognize knowledge boundaries across different languages by probing their
internal representations when processing known and unknown questions in
multiple languages. Our empirical studies reveal three key findings: 1) LLMs’
perceptions of knowledge boundaries are encoded in the middle to middle-upper
layers across different languages. 2) Language differences in knowledge
boundary perception follow a linear structure, which motivates our proposal of
a training-free alignment method that effectively transfers knowledge boundary
perception ability across languages, thereby helping reduce hallucination risk
in low-resource languages; 3) Fine-tuning on bilingual question pair
translation further enhances LLMs’ recognition of knowledge boundaries across
languages. Given the absence of standard testbeds for cross-lingual knowledge
boundary analysis, we construct a multilingual evaluation suite comprising
three representative types of knowledge boundary data. Our code and datasets
are publicly available at
https://github.com/DAMO-NLP-SG/LLM-Multilingual-Knowledge-Boundaries.
[COMMENTS]
ACL 2025 main; camera ready
[LINK]
http://arxiv.org/abs/2504.13816v3
[DATE]
2025-06-24 14:24:15+08:00
[CATEGORIES]
cs.CL
RAG+: Enhancing Retrieval-Augmented Generation with Application-Aware Reasoning
[AUTHORS]
Yu Wang, Shiwan Zhao, Zhihu Wang, Yubo Zhang, Xicheng Zhang, Zhengfan Wang, Heyuan Huang, Ming Fan, Ting Liu
[ABSTRACT]
The integration of external knowledge through Retrieval-Augmented Generation
(RAG) has become foundational in enhancing large language models (LLMs) for
knowledge-intensive tasks. However, existing RAG paradigms often overlook the
cognitive step of applying knowledge, leaving a gap between retrieved facts and
task-specific reasoning. In this work, we introduce RAG+, a principled and
modular extension that explicitly incorporates application-aware reasoning into
the RAG pipeline. RAG+ constructs a dual corpus consisting of knowledge and
aligned application examples, created either manually or automatically, and
retrieves both jointly during inference. This design enables LLMs not only to
access relevant information but also to apply it within structured,
goal-oriented reasoning processes. Experiments across mathematical, legal, and
medical domains, conducted on multiple models, demonstrate that RAG+
consistently outperforms standard RAG variants, achieving average improvements
of 3-5%, and peak gains up to 7.5% in complex scenarios. By bridging retrieval
with actionable application, RAG+ advances a more cognitively grounded
framework for knowledge integration, representing a step toward more
interpretable and capable LLMs.
[LINK]
http://arxiv.org/abs/2506.11555v2
[DATE]
2025-06-24 13:50:06+08:00
[CATEGORIES]
cs.CL
FLAT-LLM: Fine-grained Low-rank Activation Space Transformation for Large Language Model Compression
[AUTHORS]
Jiayi Tian, Ryan Solgi, Jinming Lu, Yifan Yang, Hai Li, Zheng Zhang
[ABSTRACT]
Large Language Models (LLMs) have enabled remarkable progress in natural
language processing, yet their high computational and memory demands pose
challenges for deployment in resource-constrained environments. Although recent
low-rank decomposition methods offer a promising path for structural
compression, they often suffer from accuracy degradation, expensive calibration
procedures, and result in inefficient model architectures that hinder
real-world inference speedups. In this paper, we propose FLAT-LLM, a fast and
accurate, training-free structural compression method based on fine-grained
low-rank transformations in the activation space. Specifically, we reduce the
hidden dimension by transforming the weights using truncated eigenvectors
computed via head-wise Principal Component Analysis (PCA), and employ an
importance-based metric to adaptively allocate ranks across decoders. FLAT-LLM
achieves efficient and effective weight compression without recovery
fine-tuning, which could complete the calibration within a few minutes.
Evaluated across 4 models and 11 datasets, FLAT-LLM outperforms structural
pruning baselines in generalization and downstream performance, while
delivering inference speedups over decomposition-based methods.
[LINK]
http://arxiv.org/abs/2505.23966v2
[DATE]
2025-06-24 13:40:57+08:00
[CATEGORIES]
cs.CL
JCAPT: A Joint Modeling Approach for CAPT
[AUTHORS]
Tzu-Hsuan Yang, Yue-Yang He, Berlin Chen
[ABSTRACT]
Effective pronunciation feedback is critical in second language (L2)
learning, for which computer-assisted pronunciation training (CAPT) systems
often encompass two key tasks: automatic pronunciation assessment (APA) and
mispronunciation detection and diagnosis (MDD). Recent work has shown that
joint modeling of these two tasks can yield mutual benefits. Our unified
framework leverages Mamba, a selective state space model (SSM), while
integrating phonological features and think token strategies to jointly enhance
interpretability and fine-grained temporal reasoning in APA and MDD. To our
knowledge, this is the first study to combine phonological attribution,
SSM-based modeling, and prompting in CAPT. A series of experiments conducted on
the speechocean762 benchmark demonstrate that our model consistently
outperforms prior methods, particularly on the MDD task.
[COMMENTS]
Submitted to the ISCA SLaTE-2025 Workshop
[LINK]
http://arxiv.org/abs/2506.19315v1
[DATE]
2025-06-24 13:12:32+08:00
[CATEGORIES]
cs.CL
Long-Context Generalization with Sparse Attention
[AUTHORS]
Pavlo Vasylenko, Marcos Treviso, André F. T. Martins
[ABSTRACT]
Transformer-based architectures traditionally employ softmax to compute
attention weights, which produces dense distributions over all tokens in a
sequence. While effective in many settings, this density has been shown to be
detrimental for tasks that demand precise focus on fixed-size patterns: as
sequence length increases, non-informative tokens accumulate attention
probability mass, leading to dispersion and representational collapse. We show
in this paper that sparse attention mechanisms using $\alpha$-entmax can avoid
these issues, due to their ability to assign exact zeros to irrelevant tokens.
Furthermore, we introduce Adaptive-Scalable Entmax (ASEntmax), which endows
$\alpha$-entmax with a learnable temperature parameter, allowing the attention
distribution to interpolate between sparse (pattern-focused) and dense
(softmax-like) regimes. Finally, we show that the ability to locate and
generalize fixed-size patterns can be further improved through a careful design
of position encodings, which impacts both dense and sparse attention methods.
By integrating ASEntmax into standard transformer layers alongside proper
positional encodings, we show that our models greatly outperform softmax,
scalable softmax, and fixed-temperature $\alpha$-entmax baselines on
long-context generalization.
[LINK]
http://arxiv.org/abs/2506.16640v2
[DATE]
2025-06-24 12:45:00+08:00
[CATEGORIES]
cs.CL
Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs
[AUTHORS]
Liang Zeng, Yongcong Li, Yuzhen Xiao, Changshi Li, Chris Yuhao Liu, Rui Yan, Tianwen Wei, Jujie He, Xuchen Song, Yang Liu, Yahui Zhou
[ABSTRACT]
Software engineering (SWE) has recently emerged as a crucial testbed for
next-generation LLM agents, demanding inherent capabilities in two critical
dimensions: sustained iterative problem-solving (e.g., >50 interaction rounds)
and long-context dependency resolution (e.g., >32k tokens). However, the data
curation process in SWE remains notoriously time-consuming, as it heavily
relies on manual annotation for code file filtering and the setup of dedicated
runtime environments to execute and validate unit tests. Consequently, most
existing datasets are limited to only a few thousand GitHub-sourced instances.
To this end, we propose an incremental, automated data-curation pipeline that
systematically scales both the volume and diversity of SWE datasets. Our
dataset comprises 10,169 real-world Python task instances from 2,531 distinct
GitHub repositories, each accompanied by a task specified in natural language
and a dedicated runtime-environment image for automated unit-test validation.
We have carefully curated over 8,000 successfully runtime-validated training
trajectories from our proposed SWE dataset. When fine-tuning the Skywork-SWE
model on these trajectories, we uncover a striking data scaling phenomenon: the
trained model’s performance for software engineering capabilities in LLMs
continues to improve as the data size increases, showing no signs of
saturation. Notably, our Skywork-SWE model achieves 38.0% pass@1 accuracy on
the SWE-bench Verified benchmark without using verifiers or multiple rollouts,
establishing a new state-of-the-art (SOTA) among the Qwen2.5-Coder-32B-based
LLMs built on the OpenHands agent framework. Furthermore, with the
incorporation of test-time scaling techniques, the performance further improves
to 47.0% accuracy, surpassing the previous SOTA results for sub-32B parameter
models. We release the Skywork-SWE-32B model checkpoint to accelerate future
research.
[LINK]
http://arxiv.org/abs/2506.19290v1
[DATE]
2025-06-24 11:53:36+08:00
[CATEGORIES]
cs.CL
Evaluating Transparent Reasoning in Large Language Models for Accountable Critical Tasks
[AUTHORS]
Junhao Chen, Bowen Wang, Jiuyang Chang, Yuta Nakashima
[ABSTRACT]
This paper introduces REACT, a benchmark designed to rigorously evaluate the
reasoning capabilities of large language models (LLMs) within accountable,
high-stakes decision-making tasks in medical and legal domains. Unlike
traditional benchmarks primarily focused on prediction accuracy, REACT
emphasizes transparent and interpretable reasoning, requiring models to align
their logic closely with expert-derived procedures. To assess whether LLM
reasoning aligns closely with human experts, we annotated 511 clinical cases
from the medical domain and 86 legal cases from the legal domain, each enriched
with detailed expert-extracted rationales and evidence supporting each step of
the reasoning process. These annotations were guided by carefully constructed
reasoning graphs, which explicitly encode domain-specific inference structures
and decision criteria derived by domain experts. These reasoning graphs serve
not only as standards for expert annotation but also as structured guidelines
enabling models to reason transparently and step-by-step. To address the
scalability challenges of manual annotation, we further developed a
semi-automatic annotation pipeline leveraging expert-defined reasoning graph
templates to efficiently generate new graphs, exploring the potential to extend
our approach into additional critical domains. Experimental results demonstrate
that reasoning graphs substantially enhance the interpretability and accuracy
of LLM reasoning compared to traditional baselines, although significant gaps
remain relative to expert-level reasoning performance.
[COMMENTS]
This paper is the journal extension of our NeurIPS 2024 paper
“DiReCT: Diagnostic Reasoning for Clinical Notes via Large Language Models”
[LINK]
http://arxiv.org/abs/2408.01933v5
[DATE]
2025-06-24 11:31:03+08:00
[CATEGORIES]
cs.CL
Disentangling Reasoning and Knowledge in Medical Large Language Models
[AUTHORS]
Rahul Thapa, Qingyang Wu, Kevin Wu, Harrison Zhang, Angela Zhang, Eric Wu, Haotian Ye, Suhana Bedi, Nevin Aresh, Joseph Boen, Shriya Reddy, Ben Athiwaratkun, Shuaiwen Leon Song, James Zou
[ABSTRACT]
Medical reasoning in large language models (LLMs) aims to emulate clinicians’
diagnostic thinking, but current benchmarks such as MedQA-USMLE, MedMCQA, and
PubMedQA often mix reasoning with factual recall. We address this by separating
11 biomedical QA benchmarks into reasoning- and knowledge-focused subsets using
a PubMedBERT classifier that reaches 81 percent accuracy, comparable to human
performance. Our analysis shows that only 32.8 percent of questions require
complex reasoning. We evaluate biomedical models (HuatuoGPT-o1, MedReason, m1)
and general-domain models (DeepSeek-R1, o4-mini, Qwen3), finding consistent
gaps between knowledge and reasoning performance. For example, HuatuoGPT-o1
scores 56.9 on knowledge but only 44.8 on reasoning. In adversarial tests where
models are misled with incorrect initial reasoning, biomedical models degrade
sharply, while larger or RL-trained general models show more robustness. To
address this, we train BioMed-R1 using fine-tuning and reinforcement learning
on reasoning-heavy examples. It achieves the strongest performance among
similarly sized models. Further gains may come from incorporating clinical case
reports and training with adversarial and backtracking scenarios.
[LINK]
http://arxiv.org/abs/2505.11462v2
[DATE]
2025-06-24 11:27:30+08:00
[CATEGORIES]
cs.CL
EmoStage: A Framework for Accurate Empathetic Response Generation via Perspective-Taking and Phase Recognition
[AUTHORS]
Zhiyang Qi, Keiko Takamizo, Mariko Ukiyo, Michimasa Inaba
[ABSTRACT]
The rising demand for mental health care has fueled interest in AI-driven
counseling systems. While large language models (LLMs) offer significant
potential, current approaches face challenges, including limited understanding
of clients’ psychological states and counseling stages, reliance on
high-quality training data, and privacy concerns associated with commercial
deployment. To address these issues, we propose EmoStage, a framework that
enhances empathetic response generation by leveraging the inference
capabilities of open-source LLMs without additional training data. Our
framework introduces perspective-taking to infer clients’ psychological states
and support needs, enabling the generation of emotionally resonant responses.
In addition, phase recognition is incorporated to ensure alignment with the
counseling process and to prevent contextually inappropriate or inopportune
responses. Experiments conducted in both Japanese and Chinese counseling
settings demonstrate that EmoStage improves the quality of responses generated
by base models and performs competitively with data-driven methods.
[LINK]
http://arxiv.org/abs/2506.19279v1
[DATE]
2025-06-24 11:18:37+08:00
[CATEGORIES]
cs.CL
Process Reward Models That Think
[AUTHORS]
Muhammad Khalifa, Rishabh Agarwal, Lajanugen Logeswaran, Jaekyeom Kim, Hao Peng, Moontae Lee, Honglak Lee, Lu Wang
[ABSTRACT]
Step-by-step verifiers – also known as process reward models (PRMs) – are a
key ingredient for test-time scaling. PRMs require step-level supervision,
making them expensive to train. This work aims to build data-efficient PRMs as
verbalized step-wise reward models that verify every step in the solution by
generating a verification chain-of-thought (CoT). We propose ThinkPRM, a long
CoT verifier fine-tuned on orders of magnitude fewer process labels than those
required by discriminative PRMs. Our approach capitalizes on the inherent
reasoning abilities of long CoT models, and outperforms LLM-as-a-Judge and
discriminative verifiers – using only 1% of the process labels in PRM800K –
across several challenging benchmarks. Specifically, ThinkPRM beats the
baselines on ProcessBench, MATH-500, and AIME ‘24 under best-of-N selection and
reward-guided search. In an out-of-domain evaluation on a subset of
GPQA-Diamond and LiveCodeBench, our PRM surpasses discriminative verifiers
trained on the full PRM800K by 8% and 4.5%, respectively. Lastly, under the
same token budget, ThinkPRM scales up verification compute more effectively
compared to LLM-as-a-Judge, outperforming it by 7.2% on a subset of
ProcessBench. Our work highlights the value of generative, long CoT PRMs that
can scale test-time compute for verification while requiring minimal
supervision for training. Our code, data, and models will be released at
https://github.com/mukhal/thinkprm.
[LINK]
http://arxiv.org/abs/2504.16828v3
[DATE]
2025-06-24 11:05:02+08:00
[CATEGORIES]
cs.LG
cs.CL
Personality Prediction from Life Stories using Language Models
[AUTHORS]
Rasiq Hussain, Jerry Ma, Rithik Khandelwal, Joshua Oltmanns, Mehak Gupta
[ABSTRACT]
Natural Language Processing (NLP) offers new avenues for personality
assessment by leveraging rich, open-ended text, moving beyond traditional
questionnaires. In this study, we address the challenge of modeling long
narrative interview where each exceeds 2000 tokens so as to predict Five-Factor
Model (FFM) personality traits. We propose a two-step approach: first, we
extract contextual embeddings using sliding-window fine-tuning of pretrained
language models; then, we apply Recurrent Neural Networks (RNNs) with attention
mechanisms to integrate long-range dependencies and enhance interpretability.
This hybrid method effectively bridges the strengths of pretrained transformers
and sequence modeling to handle long-context data. Through ablation studies and
comparisons with state-of-the-art long-context models such as LLaMA and
Longformer, we demonstrate improvements in prediction accuracy, efficiency, and
interpretability. Our results highlight the potential of combining
language-based features with long-context modeling to advance personality
assessment from life narratives.
[COMMENTS]
13 pages, 5 figures
[LINK]
http://arxiv.org/abs/2506.19258v1
[DATE]
2025-06-24 10:39:06+08:00
[CATEGORIES]
cs.CL
cs.LG
MSR-Align: Policy-Grounded Multimodal Alignment for Safety-Aware Reasoning in Vision-Language Models
[AUTHORS]
Yinan Xia, Yilei Jiang, Yingshui Tan, Xiaoyong Zhu, Xiangyu Yue, Bo Zheng
[ABSTRACT]
Vision-Language Models (VLMs) have achieved remarkable progress in multimodal
reasoning tasks through enhanced chain-of-thought capabilities. However, this
advancement also introduces novel safety risks, as these models become
increasingly vulnerable to harmful multimodal prompts that can trigger
unethical or unsafe behaviors. Existing safety alignment approaches, primarily
designed for unimodal language models, fall short in addressing the complex and
nuanced threats posed by multimodal inputs. Moreover, current safety datasets
lack the fine-grained, policy-grounded reasoning required to robustly align
reasoning-capable VLMs. In this work, we introduce {MSR-Align}, a high-quality
Multimodal Safety Reasoning dataset tailored to bridge this gap. MSR-Align
supports fine-grained, deliberative reasoning over standardized safety policies
across both vision and text modalities. Our data generation pipeline emphasizes
multimodal diversity, policy-grounded reasoning, and rigorous quality filtering
using strong multimodal judges. Extensive experiments demonstrate that
fine-tuning VLMs on MSR-Align substantially improves robustness against both
textual and vision-language jailbreak attacks, while preserving or enhancing
general reasoning performance. MSR-Align provides a scalable and effective
foundation for advancing the safety alignment of reasoning-capable VLMs. Our
dataset is made publicly available at
https://huggingface.co/datasets/Leigest/MSR-Align.
[LINK]
http://arxiv.org/abs/2506.19257v1
[DATE]
2025-06-24 10:37:59+08:00
[CATEGORIES]
cs.CL
Position: Machine Learning Conferences Should Establish a “Refutations and Critiques” Track
[AUTHORS]
Rylan Schaeffer, Joshua Kazdan, Yegor Denisov-Blanch, Brando Miranda, Matthias Gerstgrasser, Susan Zhang, Andreas Haupt, Isha Gupta, Elyas Obbad, Jesse Dodge, Jessica Zosa Forde, Koustuv Sinha, Francesco Orabona, Sanmi Koyejo, David Donoho
[ABSTRACT]
Science progresses by iteratively advancing and correcting humanity’s
understanding of the world. In machine learning (ML) research, rapid
advancements have led to an explosion of publications, but have also led to
misleading, incorrect, flawed or perhaps even fraudulent studies being accepted
and sometimes highlighted at ML conferences due to the fallibility of peer
review. While such mistakes are understandable, ML conferences do not offer
robust processes to help the field systematically correct when such errors are
made.This position paper argues that ML conferences should establish a
dedicated “Refutations and Critiques” (R & C) Track. This R & C Track would
provide a high-profile, reputable platform to support vital research that
critically challenges prior research, thereby fostering a dynamic
self-correcting research ecosystem. We discuss key considerations including
track design, review principles, potential pitfalls, and provide an
illustrative example submission concerning a recent ICLR 2025 Oral. We conclude
that ML conferences should create official, reputable mechanisms to help ML
research self-correct.
[LINK]
http://arxiv.org/abs/2506.19882v1
[DATE]
2025-06-24 10:19:30+08:00
[CATEGORIES]
cs.LG
cs.CL
Augmenting Multi-Agent Communication with State Delta Trajectory
[AUTHORS]
Yichen Tang, Weihang Su, Yujia Zhou, Yiqun Liu, Min Zhang, Shaoping Ma, Qingyao Ai
[ABSTRACT]
Multi-agent techniques such as role playing or multi-turn debates have been
shown to be effective in improving the performance of large language models
(LLMs) in downstream tasks. Despite their differences in workflows, existing
LLM-based multi-agent systems mostly use natural language for agent
communication. While this is appealing for its simplicity and interpretability,
it also introduces inevitable information loss as one model must down sample
its continuous state vectors to concrete tokens before transferring them to the
other model. Such losses are particularly significant when the information to
transfer is not simple facts, but reasoning logics or abstractive thoughts. To
tackle this problem, we propose a new communication protocol that transfers
both natural language tokens and token-wise state transition trajectory from
one agent to another. Particularly, compared to the actual state value, we find
that the sequence of state changes in LLMs after generating each token can
better reflect the information hidden behind the inference process, so we
propose a State Delta Encoding (SDE) method to represent state transition
trajectories. The experimental results show that multi-agent systems with SDE
achieve SOTA performance compared to other communication protocols,
particularly in tasks that involve complex reasoning. This shows the potential
of communication augmentation for LLM-based multi-agent systems.
[COMMENTS]
22 pages, 5 figures
[LINK]
http://arxiv.org/abs/2506.19209v1
[DATE]
2025-06-24 08:38:25+08:00
[CATEGORIES]
cs.CL
Bayesian Evolutionary Swarm Architecture: A Formal Epistemic System Grounded in Truth-Based Competition
[AUTHORS]
Craig Steven Wright
[ABSTRACT]
We introduce a mathematically rigorous framework for an artificial
intelligence system composed of probabilistic agents evolving through
structured competition and belief revision. The architecture, grounded in
Bayesian inference, measure theory, and population dynamics, defines agent
fitness as a function of alignment with a fixed external oracle representing
ground truth. Agents compete in a discrete-time environment, adjusting
posterior beliefs through observed outcomes, with higher-rated agents
reproducing and lower-rated agents undergoing extinction. Ratings are updated
via pairwise truth-aligned utility comparisons, and belief updates preserve
measurable consistency and stochastic convergence. We introduce hash-based
cryptographic identity commitments to ensure traceability, alongside causal
inference operators using do-calculus. Formal theorems on convergence,
robustness, and evolutionary stability are provided. The system establishes
truth as an evolutionary attractor, demonstrating that verifiable knowledge
arises from adversarial epistemic pressure within a computable, self-regulating
swarm.
[COMMENTS]
83 pages, 14 sections, 92 formal results, no prior conference
publication
[LINK]
http://arxiv.org/abs/2506.19191v1
[DATE]
2025-06-24 07:27:44+08:00
[CATEGORIES]
cs.CL
Prompt, Translate, Fine-Tune, Re-Initialize, or Instruction-Tune? Adapting LLMs for In-Context Learning in Low-Resource Languages
[AUTHORS]
Christopher Toukmaji, Jeffrey Flanigan
[COMMENTS]
Accepted to ACL GEM 2025
[LINK]
http://arxiv.org/abs/2506.19187v1
[DATE]
2025-06-24 07:22:11+08:00
[CATEGORIES]
cs.CL
Transferring Features Across Language Models With Model Stitching
[AUTHORS]
Alan Chen, Jack Merullo, Alessandro Stolfo, Ellie Pavlick
[ABSTRACT]
In this work, we demonstrate that affine mappings between residual streams of
language models is a cheap way to effectively transfer represented features
between models. We apply this technique to transfer the weights of Sparse
Autoencoders (SAEs) between models of different sizes to compare their
representations. We find that small and large models learn similar
representation spaces, which motivates training expensive components like SAEs
on a smaller model and transferring to a larger model at a FLOPs savings. In
particular, using a small-to-large transferred SAE as initialization can lead
to 50% cheaper training runs when training SAEs on larger models. Next, we show
that transferred probes and steering vectors can effectively recover ground
truth performance. Finally, we dive deeper into feature-level transferability,
finding that semantic and structural features transfer noticeably differently
while specific classes of functional features have their roles faithfully
mapped. Overall, our findings illustrate similarities and differences in the
linear representation spaces of small and large models and demonstrate a method
for improving the training efficiency of SAEs.
[LINK]
http://arxiv.org/abs/2506.06609v2
[DATE]
2025-06-24 07:21:57+08:00
[CATEGORIES]
cs.CL
cs.LG
Enhanced Hybrid Transducer and Attention Encoder Decoder with Text Data
[AUTHORS]
Yun Tang, Eesung Kim, Vijendra Raj Apsingekar
[ABSTRACT]
A joint speech and text optimization method is proposed for hybrid transducer
and attention-based encoder decoder (TAED) modeling to leverage large amounts
of text corpus and enhance ASR accuracy. The joint TAED (J-TAED) is trained
with both speech and text input modalities together, while it only takes speech
data as input during inference. The trained model can unify the internal
representations from different modalities, and be further extended to
text-based domain adaptation. It can effectively alleviate data scarcity for
mismatch domain tasks since no speech data is required. Our experiments show
J-TAED successfully integrates speech and linguistic information into one
model, and reduce the WER by 5.8 ~12.8% on the Librispeech dataset. The model
is also evaluated on two out-of-domain datasets: one is finance and another is
named entity focused. The text-based domain adaptation brings 15.3% and 17.8%
WER reduction on those two datasets respectively.
[COMMENTS]
Accepted by Interspeech2025
[LINK]
http://arxiv.org/abs/2506.19159v1
[DATE]
2025-06-24 05:51:39+08:00
[CATEGORIES]
cs.CL
ProxSparse: Regularized Learning of Semi-Structured Sparsity Masks for Pretrained LLMs
[AUTHORS]
Hongyi Liu, Rajarshi Saha, Zhen Jia, Youngsuk Park, Jiaji Huang, Shoham Sabach, Yu-Xiang Wang, George Karypis
[ABSTRACT]
Large Language Models (LLMs) have demonstrated exceptional performance in
natural language processing tasks, yet their massive size makes serving them
inefficient and costly. Semi-structured pruning has emerged as an effective
method for model acceleration, but existing approaches are suboptimal because
they focus on local, layer-wise optimizations using heuristic rules, failing to
leverage global feedback. We present ProxSparse, a learning-based framework for
mask selection enabled by regularized optimization. ProxSparse transforms the
rigid, non-differentiable mask selection process into a smoother optimization
procedure, allowing gradual mask exploration with flexibility. ProxSparse does
not involve additional weight updates once the mask is determined. Our
extensive evaluations on 7 widely used models show that ProxSparse consistently
outperforms previously proposed semi-structured mask selection methods with
significant improvement, demonstrating the effectiveness of our learned
approach towards semi-structured pruning.
[COMMENTS]
ICML25
[LINK]
http://arxiv.org/abs/2502.00258v2
[DATE]
2025-06-24 05:39:56+08:00
[CATEGORIES]
cs.LG
cs.CL
Time-IMM: A Dataset and Benchmark for Irregular Multimodal Multivariate Time Series
[AUTHORS]
Ching Chang, Jeehyun Hwang, Yidan Shi, Haixin Wang, Wen-Chih Peng, Tien-Fu Chen, Wei Wang
[ABSTRACT]
Time series data in real-world applications such as healthcare, climate
modeling, and finance are often irregular, multimodal, and messy, with varying
sampling rates, asynchronous modalities, and pervasive missingness. However,
existing benchmarks typically assume clean, regularly sampled, unimodal data,
creating a significant gap between research and real-world deployment. We
introduce Time-IMM, a dataset specifically designed to capture cause-driven
irregularity in multimodal multivariate time series. Time-IMM represents nine
distinct types of time series irregularity, categorized into trigger-based,
constraint-based, and artifact-based mechanisms. Complementing the dataset, we
introduce IMM-TSF, a benchmark library for forecasting on irregular multimodal
time series, enabling asynchronous integration and realistic evaluation.
IMM-TSF includes specialized fusion modules, including a timestamp-to-text
fusion module and a multimodality fusion module, which support both
recency-aware averaging and attention-based integration strategies. Empirical
results demonstrate that explicitly modeling multimodality on irregular time
series data leads to substantial gains in forecasting performance. Time-IMM and
IMM-TSF provide a foundation for advancing time series analysis under
real-world conditions. The dataset is publicly available at
https://www.kaggle.com/datasets/blacksnail789521/time-imm/data, and the
benchmark library can be accessed at
https://anonymous.4open.science/r/IMMTSF_NeurIPS2025.
[COMMENTS]
This paper is currently under review
[LINK]
http://arxiv.org/abs/2506.10412v2
[DATE]
2025-06-24 05:10:15+08:00
[CATEGORIES]
cs.LG
cs.CL
TRAIL: Trace Reasoning and Agentic Issue Localization
[AUTHORS]
Darshan Deshpande, Varun Gangal, Hersh Mehta, Jitin Krishnan, Anand Kannappan, Rebecca Qian
[ABSTRACT]
The increasing adoption of agentic workflows across diverse domains brings a
critical need to scalably and systematically evaluate the complex traces these
systems generate. Current evaluation methods depend on manual, domain-specific
human analysis of lengthy workflow traces - an approach that does not scale
with the growing complexity and volume of agentic outputs. Error analysis in
these settings is further complicated by the interplay of external tool outputs
and language model reasoning, making it more challenging than traditional
software debugging. In this work, we (1) articulate the need for robust and
dynamic evaluation methods for agentic workflow traces, (2) introduce a formal
taxonomy of error types encountered in agentic systems, and (3) present a set
of 148 large human-annotated traces (TRAIL) constructed using this taxonomy and
grounded in established agentic benchmarks. To ensure ecological validity, we
curate traces from both single and multi-agent systems, focusing on real-world
applications such as software engineering and open-world information retrieval.
Our evaluations reveal that modern long context LLMs perform poorly at trace
debugging, with the best Gemini-2.5-pro model scoring a mere 11% on TRAIL. Our
dataset and code are made publicly available to support and accelerate future
research in scalable evaluation for agentic workflows.
[COMMENTS]
Dataset: https://huggingface.co/datasets/PatronusAI/TRAIL
[LINK]
http://arxiv.org/abs/2505.08638v3
[DATE]
2025-06-24 05:06:11+08:00
[CATEGORIES]
cs.CL
ADVLLM: Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities
[AUTHORS]
Chung-En Sun, Xiaodong Liu, Weiwei Yang, Tsui-Wei Weng, Hao Cheng, Aidan San, Michel Galley, Jianfeng Gao
[ABSTRACT]
Recent research has shown that Large Language Models (LLMs) are vulnerable to
automated jailbreak attacks, where adversarial suffixes crafted by algorithms
appended to harmful queries bypass safety alignment and trigger unintended
responses. Current methods for generating these suffixes are computationally
expensive and have low Attack Success Rates (ASR), especially against
well-aligned models like Llama2 and Llama3. To overcome these limitations, we
introduce ADV-LLM, an iterative self-tuning process that crafts adversarial
LLMs with enhanced jailbreak ability. Our framework significantly reduces the
computational cost of generating adversarial suffixes while achieving nearly
100\% ASR on various open-source LLMs. Moreover, it exhibits strong attack
transferability to closed-source models, achieving 99\% ASR on GPT-3.5 and 49\%
ASR on GPT-4, despite being optimized solely on Llama3. Beyond improving
jailbreak ability, ADV-LLM provides valuable insights for future safety
alignment research through its ability to generate large datasets for studying
LLM safety.
[COMMENTS]
Accepted to NAACL 2025 Main (oral)
[LINK]
http://arxiv.org/abs/2410.18469v4
[DATE]
2025-06-24 04:12:31+08:00
[CATEGORIES]
cs.CL
cs.LG
Small Language Models in the Real World: Insights from Industrial Text Classification
[AUTHORS]
Lujun Li, Lama Sleem, Niccolo’ Gentile, Geoffrey Nichil, Radu State
[ABSTRACT]
With the emergence of ChatGPT, Transformer models have significantly advanced
text classification and related tasks. Decoder-only models such as Llama
exhibit strong performance and flexibility, yet they suffer from inefficiency
on inference due to token-by-token generation, and their effectiveness in text
classification tasks heavily depends on prompt quality. Moreover, their
substantial GPU resource requirements often limit widespread adoption. Thus,
the question of whether smaller language models are capable of effectively
handling text classification tasks emerges as a topic of significant interest.
However, the selection of appropriate models and methodologies remains largely
underexplored. In this paper, we conduct a comprehensive evaluation of prompt
engineering and supervised fine-tuning methods for transformer-based text
classification. Specifically, we focus on practical industrial scenarios,
including email classification, legal document categorization, and the
classification of extremely long academic texts. We examine the strengths and
limitations of smaller models, with particular attention to both their
performance and their efficiency in Video Random-Access Memory (VRAM)
utilization, thereby providing valuable insights for the local deployment and
application of compact models in industrial settings.
[COMMENTS]
This paper has been accepted as a conference paper in the Industry
Track of the 63rd Annual Meeting of the Association for Computational
Linguistics (ACL)
[LINK]
http://arxiv.org/abs/2505.16078v3
[DATE]
2025-06-24 04:09:36+08:00
[CATEGORIES]
cs.CL
Language Models Might Not Understand You: Evaluating Theory of Mind via Story Prompting
[AUTHORS]
Nathaniel Getachew, Abulhair Saparov
[ABSTRACT]
We introduce $\texttt{StorySim}$, a programmable framework for synthetically
generating stories to evaluate the theory of mind (ToM) and world modeling (WM)
capabilities of large language models (LLMs). Unlike prior benchmarks that may
suffer from contamination in pretraining data, $\texttt{StorySim}$ produces
novel, compositional story prompts anchored by a highly controllable
$\texttt{Storyboard}$, enabling precise manipulation of character perspectives
and events. We use this framework to design first- and second-order ToM tasks
alongside WM tasks that control for the ability to track and model mental
states. Our experiments across a suite of state-of-the-art LLMs reveal that
most models perform better on WM tasks than ToM tasks, and that models tend to
perform better reasoning with humans compared to inanimate objects.
Additionally, our framework enabled us to find evidence of heuristic behavior
such as recency bias and an over-reliance on earlier events in the story. All
code for generating data and evaluations is freely available.
[COMMENTS]
14 pages, 11 figures
[LINK]
http://arxiv.org/abs/2506.19089v1
[DATE]
2025-06-24 04:06:53+08:00
[CATEGORIES]
cs.CL
HAWAII: Hierarchical Visual Knowledge Transfer for Efficient Vision-Language Models
[AUTHORS]
Yimu Wang, Mozhgan Nasr Azadani, Sean Sedwards, Krzysztof Czarnecki
[ABSTRACT]
Improving the visual understanding ability of vision-language models (VLMs)
is crucial for enhancing their performance across various tasks. While using
multiple pretrained visual experts has shown great promise, it often incurs
significant computational costs during training and inference. To address this
challenge, we propose HAWAII, a novel framework that distills knowledge from
multiple visual experts into a single vision encoder, enabling it to inherit
the complementary strengths of several experts with minimal computational
overhead. To mitigate conflicts among different teachers and switch between
different teacher-specific knowledge, instead of using a fixed set of adapters
for multiple teachers, we propose to use teacher-specific Low-Rank Adaptation
(LoRA) adapters with a corresponding router. Each adapter is aligned with a
specific teacher, avoiding noisy guidance during distillation. To enable
efficient knowledge distillation, we propose fine-grained and coarse-grained
distillation. At the fine-grained level, token importance scores are employed
to emphasize the most informative tokens from each teacher adaptively. At the
coarse-grained level, we summarize the knowledge from multiple teachers and
transfer it to the student using a set of general-knowledge LoRA adapters with
a router. Extensive experiments on various vision-language tasks demonstrate
the superiority of HAWAII, compared to the popular open-source VLMs.
[COMMENTS]
Work in progress
[LINK]
http://arxiv.org/abs/2506.19072v1
[DATE]
2025-06-24 03:43:25+08:00
[CATEGORIES]
cs.CL
NLPnorth @ TalentCLEF 2025: Comparing Discriminative, Contrastive, and Prompt-Based Methods for Job Title and Skill Matching
[AUTHORS]
Mike Zhang, Rob van der Goot
[ABSTRACT]
Matching job titles is a highly relevant task in the computational job market
domain, as it improves e.g., automatic candidate matching, career path
prediction, and job market analysis. Furthermore, aligning job titles to job
skills can be considered an extension to this task, with similar relevance for
the same downstream tasks. In this report, we outline NLPnorth’s submission to
TalentCLEF 2025, which includes both of these tasks: Multilingual Job Title
Matching, and Job Title-Based Skill Prediction. For both tasks we compare
(fine-tuned) classification-based, (fine-tuned) contrastive-based, and
prompting methods. We observe that for Task A, our prompting approach performs
best with an average of 0.492 mean average precision (MAP) on test data,
averaged over English, Spanish, and German. For Task B, we obtain an MAP of
0.290 on test data with our fine-tuned classification-based approach.
Additionally, we made use of extra data by pulling all the language-specific
titles and corresponding \emph{descriptions} from ESCO for each job and skill.
Overall, we find that the largest multilingual language models perform best for
both tasks. Per the provisional results and only counting the unique teams, the
ranking on Task A is 5$^{\text{th}}$/20 and for Task B 3$^{\text{rd}}$/14.
[COMMENTS]
TalentCLEF 2025
[LINK]
http://arxiv.org/abs/2506.19058v1
[DATE]
2025-06-24 03:18:25+08:00
[CATEGORIES]
cs.CL
Impact of Visual Context on Noisy Multimodal NMT: An Empirical Study for English to Indian Languages
[AUTHORS]
Baban Gain, Dibyanayan Bandyopadhyay, Samrat Mukherjee, Chandranath Adak, Asif Ekbal
[ABSTRACT]
Neural Machine Translation (NMT) has made remarkable progress using
large-scale textual data, but the potential of incorporating multimodal inputs,
especially visual information, remains underexplored in high-resource settings.
While prior research has focused on using multimodal data in low-resource
scenarios, this study examines how image features impact translation when added
to a large-scale, pre-trained unimodal NMT system. Surprisingly, the study
finds that images might be redundant in this context. Additionally, the
research introduces synthetic noise to assess whether images help the model
handle textual noise. Multimodal models slightly outperform text-only models in
noisy settings, even when random images are used. The study’s experiments
translate from English to Hindi, Bengali, and Malayalam, significantly
outperforming state-of-the-art benchmarks. Interestingly, the effect of visual
context varies with the level of source text noise: no visual context works
best for non-noisy translations, cropped image features are optimal for low
noise, and full image features perform better in high-noise scenarios. This
sheds light on the role of visual context, especially in noisy settings, and
opens up a new research direction for Noisy Neural Machine Translation in
multimodal setups. The research emphasizes the importance of combining visual
and textual information to improve translation across various environments. Our
code is publicly available at https://github.com/babangain/indicMMT.
[LINK]
http://arxiv.org/abs/2308.16075v2
[DATE]
2025-06-24 03:07:19+08:00
[CATEGORIES]
cs.CL
Self-reflecting Large Language Models: A Hegelian Dialectical Approach
[AUTHORS]
Sara Abdali, Can Goksen, Michael Solodko, Saeed Amizadeh, Julie E. Maybee, Kazuhito Koishida
[ABSTRACT]
Investigating NLP through a philosophical lens has recently caught
researchers’ eyes, as it bridges computational methods with classical schools
of philosophy. This paper introduces a philosophical framework inspired by the
Hegelian Dialectic to enable LLMs’ self-reflection, utilizing a
self-dialectical approach to emulate internal critiques and synthesize new
scientific ideas (spanning domains such as mathematics, physics, and more).
Additionally, we explore the effect of generation temperature in LLMs by
introducing a dynamic annealing approach, which encourages creativity in the
early stages and gradually focuses on refinement and nuance, as well as a
constant-temperature strategy. Furthermore, we implement a Multi-Agent Majority
Voting (MAMV) strategy to assess the validity and novelty of the generated
ideas, which proves useful in the absence of domain experts. We also evaluate
the effectiveness of our method in generating novel scientific ideas and
improving LLMs’ reasoning capabilities. Our experiments demonstrate promising
results in ideation, along with significant improvements in mathematical and
symbolic reasoning.
[LINK]
http://arxiv.org/abs/2501.14917v6
[DATE]
2025-06-24 02:59:06+08:00
[CATEGORIES]
cs.CL
cs.LG
Plan for Speed – Dilated Scheduling for Masked Diffusion Language Models
[AUTHORS]
Omer Luxembourg, Haim Permuter, Eliya Nachmani
[ABSTRACT]
Masked diffusion language models (MDLM) have shown strong promise for
non-autoregressive text generation, yet existing samplers act as implicit
planners, selecting tokens to unmask via denoiser confidence or entropy scores.
Such heuristics falter under parallel unmasking - they ignore pairwise
interactions between tokens and cannot account for dependencies when unmasking
multiple positions at once, limiting their inference time to traditional
auto-regressive (AR) models. We introduce the Dilated-scheduled Unmasking
Strategy (DUS), an inference-only, planner-model-free method that requires no
additional training. DUS leverages a first-order Markov assumption to partition
sequence positions into dilation-based groups of non-adjacent tokens, enabling
independent, parallel unmasking steps that respect local context that minimizes
the joint entropy of each iteration step. Unlike semi-AR block approaches
(e.g., LLADA and Dream) that still invoke the denoiser per block, DUS reduces
the number of denoiser calls to O(log B) per generation block - yielding
substantial speedup over the O(B) run time of state-of-the-art diffusion
models, where B is the block size in the semi-AR inference process. In
experiments on math (GSM8K) and code completion (Humaneval, MBPP) benchmarks -
domains suited to non-ordinal generation - DUS improves scores over parallel
confidence-based planner, without modifying the underlying denoiser. DUS offers
a lightweight, budget-aware approach to efficient, high-quality text
generation, paving the way to unlock the true capabilities of MDLMs.
[LINK]
http://arxiv.org/abs/2506.19037v1
[DATE]
2025-06-24 02:49:23+08:00
[CATEGORIES]
cs.CL
cs.LG
Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations
[AUTHORS]
Brian Siyuan Zheng, Alisa Liu, Orevaoghene Ahia, Jonathan Hayase, Yejin Choi, Noah A. Smith
[ABSTRACT]
Modern tokenizers employ deterministic algorithms to map text into a single
“canonical” token sequence, yet the same string can be encoded as many
non-canonical tokenizations using the tokenizer vocabulary. In this work, we
investigate the robustness of LMs to text encoded with non-canonical
tokenizations entirely unseen during training. Surprisingly, when evaluated
across 20 benchmarks, we find that instruction-tuned models retain up to 93.4%
of their original performance when given a randomly sampled tokenization, and
90.8% with character-level tokenization. We see that overall stronger models
tend to be more robust, and robustness diminishes as the tokenization departs
farther from the canonical form. Motivated by these results, we then identify
settings where non-canonical tokenization schemes can improve performance,
finding that character-level segmentation improves string manipulation and code
understanding tasks by up to +14%, and right-aligned digit grouping enhances
large-number arithmetic by +33%. Finally, we investigate the source of this
robustness, finding that it arises in the instruction-tuning phase. We show
that while both base and post-trained models grasp the semantics of
non-canonical tokenizations (perceiving them as containing misspellings), base
models try to mimic the imagined mistakes and degenerate into nonsensical
output, while post-trained models are committed to fluent responses. Overall,
our findings suggest that models are less tied to their tokenizer than
previously believed, and demonstrate the promise of intervening on tokenization
at inference time to boost performance.
[COMMENTS]
preprint
[LINK]
http://arxiv.org/abs/2506.19004v1
[DATE]
2025-06-24 02:02:26+08:00
[CATEGORIES]
cs.CL
Mirage of Mastery: Memorization Tricks LLMs into Artificially Inflated Self-Knowledge
[AUTHORS]
Sahil Kale, Vijaykant Nadadur
[COMMENTS]
Accepted to the Pre-ACL Workshop 2025, Copenhagen
[LINK]
http://arxiv.org/abs/2506.18998v1
[DATE]
2025-06-24 02:01:16+08:00
[CATEGORIES]
cs.CL
Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations
[AUTHORS]
Jiaming Han, Hao Chen, Yang Zhao, Hanyu Wang, Qi Zhao, Ziyan Yang, Hao He, Xiangyu Yue, Lu Jiang
[ABSTRACT]
This paper presents a multimodal framework that attempts to unify visual
understanding and generation within a shared discrete semantic representation.
At its core is the Text-Aligned Tokenizer (TA-Tok), which converts images into
discrete tokens using a text-aligned codebook projected from a large language
model’s (LLM) vocabulary. By integrating vision and text into a unified space
with an expanded vocabulary, our multimodal LLM, Tar, enables cross-modal input
and output through a shared interface, without the need for modality-specific
designs. Additionally, we propose scale-adaptive encoding and decoding to
balance efficiency and visual detail, along with a generative de-tokenizer to
produce high-fidelity visual outputs. To address diverse decoding needs, we
utilize two complementary de-tokenizers: a fast autoregressive model and a
diffusion-based model. To enhance modality fusion, we investigate advanced
pre-training tasks, demonstrating improvements in both visual understanding and
generation. Experiments across benchmarks show that Tar matches or surpasses
existing multimodal LLM methods, achieving faster convergence and greater
training efficiency. Code, models, and data are available at
https://tar.csuhan.com
[COMMENTS]
Project page: https://tar.csuhan.com
[LINK]
http://arxiv.org/abs/2506.18898v1
[DATE]
2025-06-24 01:59:14+08:00
[CATEGORIES]
cs.CL
ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs
[AUTHORS]
Jiaru Zou, Ling Yang, Jingwen Gu, Jiahao Qiu, Ke Shen, Jingrui He, Mengdi Wang
[ABSTRACT]
Process Reward Models (PRMs) have recently emerged as a powerful framework
for supervising intermediate reasoning steps in large language models (LLMs).
Previous PRMs are primarily trained on model final output responses and
struggle to evaluate intermediate thinking trajectories robustly, especially in
the emerging setting of trajectory-response outputs generated by frontier
reasoning models like Deepseek-R1. In this work, we introduce ReasonFlux-PRM, a
novel trajectory-aware PRM explicitly designed to evaluate the
trajectory-response type of reasoning traces. ReasonFlux-PRM incorporates both
step-level and trajectory-level supervision, enabling fine-grained reward
assignment aligned with structured chain-of-thought data. We adapt
ReasonFlux-PRM to support reward supervision under both offline and online
settings, including (i) selecting high-quality model distillation data for
downstream supervised fine-tuning of smaller models, (ii) providing dense
process-level rewards for policy optimization during reinforcement learning,
and (iii) enabling reward-guided Best-of-N test-time scaling. Empirical results
on challenging downstream benchmarks such as AIME, MATH500, and GPQA-Diamond
demonstrate that ReasonFlux-PRM-7B selects higher quality data than strong PRMs
(e.g., Qwen2.5-Math-PRM-72B) and human-curated baselines. Furthermore, our
derived ReasonFlux-PRM-7B yields consistent performance improvements, achieving
average gains of 12.1% in supervised fine-tuning, 4.5% in reinforcement
learning, and 6.3% in test-time scaling. We also release our efficient
ReasonFlux-PRM-1.5B for resource-constrained applications and edge deployment.
Projects: https://github.com/Gen-Verse/ReasonFlux
[COMMENTS]
Codes and Models: https://github.com/Gen-Verse/ReasonFlux
[LINK]
http://arxiv.org/abs/2506.18896v1
[DATE]
2025-06-24 01:59:02+08:00
[CATEGORIES]
cs.CL
OMEGA: Can LLMs Reason Outside the Box in Math? Evaluating Exploratory, Compositional, and Transformative Generalization
[AUTHORS]
Yiyou Sun, Shawn Hu, Georgia Zhou, Ken Zheng, Hannaneh Hajishirzi, Nouha Dziri, Dawn Song
[LINK]
http://arxiv.org/abs/2506.18880v1
[DATE]
2025-06-24 01:51:40+08:00
[CATEGORIES]
cs.CL
CommVQ: Commutative Vector Quantization for KV Cache Compression
[AUTHORS]
Junyan Li, Yang Zhang, Muhammad Yusuf Hassan, Talha Chafekar, Tianle Cai, Zhile Ren, Pengsheng Guo, Foroozan Karimzadeh, Colorado Reed, Chong Wang, Chuang Gan
[ABSTRACT]
Large Language Models (LLMs) are increasingly used in applications requiring
long context lengths, but the key-value (KV) cache often becomes a memory
bottleneck on GPUs as context grows. To address this, we propose Commutative
Vector Quantization (CommVQ) to significantly reduce memory usage for
long-context LLM inference. We first introduce additive quantization with a
lightweight encoder and codebook to compress the KV cache, which can be decoded
via simple matrix multiplication. To further reduce computational costs during
decoding, we design the codebook to be commutative with Rotary Position
Embedding (RoPE) and train it using an Expectation-Maximization (EM) algorithm.
This enables efficient integration of decoding into the self-attention
mechanism. Our approach achieves high accuracy with additive quantization and
low overhead via the RoPE-commutative codebook. Experiments on long-context
benchmarks and GSM8K show that our method reduces FP16 KV cache size by 87.5%
with 2-bit quantization, while outperforming state-of-the-art KV cache
quantization methods. Notably, it enables 1-bit KV cache quantization with
minimal accuracy loss, allowing a LLaMA-3.1 8B model to run with a 128K context
length on a single RTX 4090 GPU. The source code is available at:
https://github.com/UMass-Embodied-AGI/CommVQ.
[COMMENTS]
ICML 2025 poster
[LINK]
http://arxiv.org/abs/2506.18879v1
[DATE]
2025-06-24 01:50:11+08:00
[CATEGORIES]
cs.CL
Performance of diverse evaluation metrics in NLP-based assessment and text generation of consumer complaints
[AUTHORS]
Peiheng Gao, Chen Yang, Ning Sun, Ričardas Zitikis
[ABSTRACT]
Machine learning (ML) has significantly advanced text classification by
enabling automated understanding and categorization of complex, unstructured
textual data. However, accurately capturing nuanced linguistic patterns and
contextual variations inherent in natural language, particularly within
consumer complaints, remains a challenge. This study addresses these issues by
incorporating human-experience-trained algorithms that effectively recognize
subtle semantic differences crucial for assessing consumer relief eligibility.
Furthermore, we propose integrating synthetic data generation methods that
utilize expert evaluations of generative adversarial networks and are refined
through expert annotations. By combining expert-trained classifiers with
high-quality synthetic data, our research seeks to significantly enhance
machine learning classifier performance, reduce dataset acquisition costs, and
improve overall evaluation metrics and robustness in text classification tasks.
[LINK]
http://arxiv.org/abs/2506.21623v1
[DATE]
2025-06-24 01:26:38+08:00
[CATEGORIES]
cs.CL
cs.LG
A Comment On “The Illusion of Thinking”: Reframing the Reasoning Cliff as an Agentic Gap
[AUTHORS]
Sheraz Khan, Subha Madhavan, Kannan Natarajan
[ABSTRACT]
The recent work by Shojaee et al. (2025), titled The Illusion of Thinking:
Understanding the Strengths and Limitations of Reasoning Models via the Lens of
Problem Complexity, presents a compelling empirical finding, a reasoning cliff,
where the performance of Large Reasoning Models (LRMs) collapses beyond a
specific complexity threshold, which the authors posit as an intrinsic scaling
limitation of Chain-of-Thought (CoT) reasoning. This commentary, while
acknowledging the study’s methodological rigor, contends that this conclusion
is confounded by experimental artifacts. We argue that the observed failure is
not evidence of a fundamental cognitive boundary, but rather a predictable
outcome of system-level constraints in the static, text-only evaluation
paradigm, including tool use restrictions, context window recall issues, the
absence of crucial cognitive baselines, inadequate statistical reporting, and
output generation limits. We reframe this performance collapse through the lens
of an agentic gap, asserting that the models are not failing at reasoning, but
at execution within a profoundly restrictive interface. We empirically
substantiate this critique by demonstrating a striking reversal. A model,
initially declaring a puzzle impossible when confined to text-only generation,
now employs agentic tools to not only solve it but also master variations of
complexity far beyond the reasoning cliff it previously failed to surmount.
Additionally, our empirical analysis of tool-enabled models like o4-mini and
GPT-4o reveals a hierarchy of agentic reasoning, from simple procedural
execution to complex meta-cognitive self-correction, which has significant
implications for how we define and measure machine intelligence. The illusion
of thinking attributed to LRMs is less a reasoning deficit and more a
consequence of an otherwise capable mind lacking the tools for action.
[COMMENTS]
10 pages, 2 figures, Comment on “The Illusion of Thinking:
Understanding the Strengths and Limitations of Reasoning Models via the Lens
of Problem Complexity” (arXiv:2506.06941v1)
[LINK]
http://arxiv.org/abs/2506.18957v1
[DATE]
2025-06-24 01:14:21+08:00
[CATEGORIES]
cs.CL
cs.LG
LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning
[AUTHORS]
Yuhao Wu, Yushi Bai, Zhiqiang Hu, Roy Ka-Wei Lee, Juanzi Li
[ABSTRACT]
Ultra-long generation by large language models (LLMs) is a widely demanded
scenario, yet it remains a significant challenge due to their maximum
generation length limit and overall quality degradation as sequence length
increases. Previous approaches, exemplified by LongWriter, typically rely on
‘‘teaching’’, which involves supervised fine-tuning (SFT) on synthetic
long-form outputs. However, this strategy heavily depends on synthetic SFT
data, which is difficult and costly to construct, often lacks coherence and
consistency, and tends to be overly artificial and structurally monotonous. In
this work, we propose an incentivization-based approach that, starting entirely
from scratch and without relying on any annotated or synthetic data, leverages
reinforcement learning (RL) to foster the emergence of ultra-long, high-quality
text generation capabilities in LLMs. We perform RL training starting from a
base model, similar to R1-Zero, guiding it to engage in reasoning that
facilitates planning and refinement during the writing process. To support
this, we employ specialized reward models that steer the LLM towards improved
length control, writing quality, and structural formatting. Experimental
evaluations show that our LongWriter-Zero model, trained from Qwen2.5-32B,
consistently outperforms traditional SFT methods on long-form writing tasks,
achieving state-of-the-art results across all metrics on WritingBench and
Arena-Write, and even surpassing 100B+ models such as DeepSeek R1 and
Qwen3-235B. We open-source our data and model checkpoints under
https://huggingface.co/THU-KEG/LongWriter-Zero-32B
[LINK]
http://arxiv.org/abs/2506.18841v1
[DATE]
2025-06-24 00:59:02+08:00
[CATEGORIES]
cs.CL
cs.LG
EMULATE: A Multi-Agent Framework for Determining the Veracity of Atomic Claims by Emulating Human Actions
[AUTHORS]
Spencer Hong, Meng Luo, Xinyi Wan
[ABSTRACT]
Determining the veracity of atomic claims is an imperative component of many
recently proposed fact-checking systems. Many approaches tackle this problem by
first retrieving evidence by querying a search engine and then performing
classification by providing the evidence set and atomic claim to a large
language model, but this process deviates from what a human would do in order
to perform the task. Recent work attempted to address this issue by proposing
iterative evidence retrieval, allowing for evidence to be collected several
times and only when necessary. Continuing along this line of research, we
propose a novel claim verification system, called EMULATE, which is designed to
better emulate human actions through the use of a multi-agent framework where
each agent performs a small part of the larger task, such as ranking search
results according to predefined criteria or evaluating webpage content.
Extensive experiments on several benchmarks show clear improvements over prior
work, demonstrating the efficacy of our new multi-agent framework.
[COMMENTS]
FEVER 2025 (co-located with ACL 2025)
[LINK]
http://arxiv.org/abs/2505.16576v2
[DATE]
2025-06-24 00:58:51+08:00
[CATEGORIES]
cs.CL
STU-PID: Steering Token Usage via PID Controller for Efficient Large Language Model Reasoning
[AUTHORS]
Aryasomayajula Ram Bharadwaj
[ABSTRACT]
Large Language Models employing extended chain-of-thought (CoT) reasoning
often suffer from the overthinking phenomenon, generating excessive and
redundant reasoning steps that increase computational costs while potentially
degrading performance. While recent work has explored static steering
approaches to mitigate this issue, they lack the adaptability to dynamically
adjust intervention strength based on real-time reasoning quality. We propose
STUPID (Steering Token Usage via PID controller), a novel training-free method
that employs a PID controller to dynamically modulate activation steering
strength during inference. Our approach combines a chunk-level classifier for
detecting redundant reasoning patterns with a PID control mechanism that
adaptively adjusts steering intensity based on the predicted redundancy
probability. Experimental evaluation on GSM8K demonstrates that STUPID achieves
a 6% improvement in accuracy while reducing token usage by 32%, outperforming
static steering baselines. Our method provides a principled framework for
dynamic reasoning calibration that maintains reasoning quality while
significantly improving computational efficiency.
[LINK]
http://arxiv.org/abs/2506.18831v1
[DATE]
2025-06-24 00:47:19+08:00
[CATEGORIES]
cs.CL
MLLP-VRAIN UPV system for the IWSLT 2025 Simultaneous Speech Translation Translation task
[AUTHORS]
Jorge Iranzo-Sánchez, Javier Iranzo-Sánchez, Adrià Giménez, Jorge Civera, Alfons Juan
[ABSTRACT]
This work describes the participation of the MLLP-VRAIN research group in the
shared task of the IWSLT 2025 Simultaneous Speech Translation track. Our
submission addresses the unique challenges of real-time translation of
long-form speech by developing a modular cascade system that adapts strong
pre-trained models to streaming scenarios. We combine Whisper Large-V3-Turbo
for ASR with the multilingual NLLB-3.3B model for MT, implementing lightweight
adaptation techniques rather than training new end-to-end models from scratch.
Our approach employs document-level adaptation with prefix training to enhance
the MT model’s ability to handle incomplete inputs, while incorporating
adaptive emission policies including a wait-$k$ strategy and RALCP for managing
the translation stream. Specialized buffer management techniques and
segmentation strategies ensure coherent translations across long audio
sequences. Experimental results on the ACL60/60 dataset demonstrate that our
system achieves a favorable balance between translation quality and latency,
with a BLEU score of 31.96 and non-computational-aware StreamLAAL latency of
2.94 seconds. Our final model achieves a preliminary score on the official test
set (IWSLT25Instruct) of 29.8 BLEU. Our work demonstrates that carefully
adapted pre-trained components can create effective simultaneous translation
systems for long-form content without requiring extensive in-domain parallel
data or specialized end-to-end training.
[COMMENTS]
IWSLT 2025 System Description
[LINK]
http://arxiv.org/abs/2506.18828v1
[DATE]
2025-06-24 00:44:01+08:00
[CATEGORIES]
cs.CL
RWESummary: A Framework and Test for Choosing Large Language Models to Summarize Real-World Evidence (RWE) Studies
[AUTHORS]
Arjun Mukerji, Michael L. Jackson, Jason Jones, Neil Sanghavi
[ABSTRACT]
Large Language Models (LLMs) have been extensively evaluated for general
summarization tasks as well as medical research assistance, but they have not
been specifically evaluated for the task of summarizing real-world evidence
(RWE) from structured output of RWE studies. We introduce RWESummary, a
proposed addition to the MedHELM framework (Bedi, Cui, Fuentes, Unell et al.,
2025) to enable benchmarking of LLMs for this task. RWESummary includes one
scenario and three evaluations covering major types of errors observed in
summarization of medical research studies and was developed using Atropos
Health proprietary data. Additionally, we use RWESummary to compare the
performance of different LLMs in our internal RWE summarization tool. At the
time of publication, with 13 distinct RWE studies, we found the Gemini 2.5
models performed best overall (both Flash and Pro). We suggest RWESummary as a
novel and useful foundation model benchmark for real-world evidence study
summarization.
[COMMENTS]
24 pages, 2 figures
[LINK]
http://arxiv.org/abs/2506.18819v1
[DATE]
2025-06-24 00:28:03+08:00
[CATEGORIES]
cs.CL
Step-by-Step Unmasking for Parameter-Efficient Fine-tuning of Large Language Models
[AUTHORS]
Aradhye Agarwal, Suhas K Ramesh, Ayan Sengupta, Tanmoy Chakraborty
[ABSTRACT]
Fine-tuning large language models (LLMs) on downstream tasks requires
substantial computational resources. Selective PEFT, a class of
parameter-efficient fine-tuning (PEFT) methodologies, aims to mitigate these
computational challenges by selectively fine-tuning only a small fraction of
the model parameters. Although parameter-efficient, these techniques often fail
to match the performance of fully fine-tuned models, primarily due to inherent
biases introduced during parameter selection. Traditional selective PEFT
techniques use a fixed set of parameters selected using different importance
heuristics, failing to capture parameter importance dynamically and often
leading to suboptimal performance. We introduce $\text{ID}^3$, a novel
selective PEFT method that calculates parameter importance continually, and
dynamically unmasks parameters by balancing exploration and exploitation in
parameter selection. Our empirical study on 16 tasks spanning natural language
understanding, mathematical reasoning and summarization demonstrates the
effectiveness of our method compared to fixed-masking selective PEFT
techniques. We analytically show that $\text{ID}^3$ reduces the number of
gradient updates by a factor of two, enhancing computational efficiency. Since
$\text{ID}^3$ is robust to random initialization of neurons and operates
directly on the optimization process, it is highly flexible and can be
integrated with existing additive and reparametrization-based PEFT techniques
such as adapters and LoRA respectively.
[COMMENTS]
15 pages, 7 tables, 9 figures
[LINK]
http://arxiv.org/abs/2408.14470v3
[DATE]
2025-06-24 00:25:27+08:00
[CATEGORIES]
cs.CL
Noise Consistency Training: A Native Approach for One-Step Generator in Learning Additional Controls
[AUTHORS]
Yihong Luo, Shuchen Xue, Tianyang Hu, Jing Tang
[ABSTRACT]
The pursuit of efficient and controllable high-quality content generation
remains a central challenge in artificial intelligence-generated content
(AIGC). While one-step generators, enabled by diffusion distillation
techniques, offer excellent generation quality and computational efficiency,
adapting them to new control conditions–such as structural constraints,
semantic guidelines, or external inputs–poses a significant challenge.
Conventional approaches often necessitate computationally expensive
modifications to the base model and subsequent diffusion distillation. This
paper introduces Noise Consistency Training (NCT), a novel and lightweight
approach to directly integrate new control signals into pre-trained one-step
generators without requiring access to original training images or retraining
the base diffusion model. NCT operates by introducing an adapter module and
employs a noise consistency loss in the noise space of the generator. This loss
aligns the adapted model’s generation behavior across noises that are
conditionally dependent to varying degrees, implicitly guiding it to adhere to
the new control. Theoretically, this training objective can be understood as
minimizing the distributional distance between the adapted generator and the
conditional distribution induced by the new conditions. NCT is modular,
data-efficient, and easily deployable, relying only on the pre-trained one-step
generator and a control signal model. Extensive experiments demonstrate that
NCT achieves state-of-the-art controllable generation in a single forward pass,
surpassing existing multi-step and distillation-based methods in both
generation quality and computational efficiency. Code is available at
https://github.com/Luo-Yihong/NCT
[LINK]
http://arxiv.org/abs/2506.19741v1
[DATE]
2025-06-24 23:58:55+08:00
[CATEGORIES]
cs.LG
Q2SAR: A Quantum Multiple Kernel Learning Approach for Drug Discovery
[AUTHORS]
Alejandro Giraldo, Daniel Ruiz, Mariano Caruso, Javier Mancilla, Guido Bellomo
[ABSTRACT]
Quantitative Structure-Activity Relationship (QSAR) modeling is a cornerstone
of computational drug discovery. This research demonstrates the successful
application of a Quantum Multiple Kernel Learning (QMKL) framework to enhance
QSAR classification, showing a notable performance improvement over classical
methods. We apply this methodology to a dataset for identifying DYRK1A kinase
inhibitors. The workflow involves converting SMILES representations into
numerical molecular descriptors, reducing dimensionality via Principal
Component Analysis (PCA), and employing a Support Vector Machine (SVM) trained
on an optimized combination of multiple quantum and classical kernels. By
benchmarking the QMKL-SVM against a classical Gradient Boosting model, we show
that the quantum-enhanced approach achieves a superior AUC score, highlighting
its potential to provide a quantum advantage in challenging cheminformatics
classification tasks.
[LINK]
http://arxiv.org/abs/2506.14920v2
[DATE]
2025-06-24 23:57:50+08:00
[CATEGORIES]
cs.LG
Unscrambling disease progression at scale: fast inference of event permutations with optimal transport
[AUTHORS]
Peter A. Wijeratne, Daniel C. Alexander
[ABSTRACT]
Disease progression models infer group-level temporal trajectories of change
in patients’ features as a chronic degenerative condition plays out. They
provide unique insight into disease biology and staging systems with
individual-level clinical utility. Discrete models consider disease progression
as a latent permutation of events, where each event corresponds to a feature
becoming measurably abnormal. However, permutation inference using traditional
maximum likelihood approaches becomes prohibitive due to combinatoric
explosion, severely limiting model dimensionality and utility. Here we leverage
ideas from optimal transport to model disease progression as a latent
permutation matrix of events belonging to the Birkhoff polytope, facilitating
fast inference via optimisation of the variational lower bound. This enables a
factor of 1000 times faster inference than the current state of the art and,
correspondingly, supports models with several orders of magnitude more features
than the current state of the art can consider. Experiments demonstrate the
increase in speed, accuracy and robustness to noise in simulation. Further
experiments with real-world imaging data from two separate datasets, one from
Alzheimer’s disease patients, the other age-related macular degeneration,
showcase, for the first time, pixel-level disease progression events in the
brain and eye, respectively. Our method is low compute, interpretable and
applicable to any progressive condition and data modality, giving it broad
potential clinical utility.
[COMMENTS]
Camera-ready version of paper accepted to NeurIPS 2024
[LINK]
http://arxiv.org/abs/2410.14388v3
[DATE]
2025-06-24 23:53:30+08:00
[CATEGORIES]
cs.LG
DRIFT: Data Reduction via Informative Feature Transformation- Generalization Begins Before Deep Learning starts
[AUTHORS]
Ben Keslaki
[ABSTRACT]
Modern deep learning architectures excel at optimization, but only after the
data has entered the network. The true bottleneck lies in preparing the right
input: minimal, salient, and structured in a way that reflects the essential
patterns of the data. We propose DRIFT (Data Reduction via Informative Feature
Transformation), a novel preprocessing technique inspired by vibrational
analysis in physical systems, to identify and extract the most resonant modes
of input data prior to training. Unlike traditional models that attempt to
learn amidst both signal and noise, DRIFT mimics physics perception by
emphasizing informative features while discarding irrelevant elements. The
result is a more compact and interpretable representation that enhances
training stability and generalization performance. In DRIFT, images are
projected onto a low-dimensional basis formed by spatial vibration mode shapes
of plates, offering a physically grounded feature set. This enables neural
networks to operate with drastically fewer input dimensions (~ 50 features on
MNIST and less than 100 on CIFAR100) while achieving competitive classification
accuracy. Extensive experiments across MNIST and CIFAR100 demonstrate DRIFT’s
superiority over standard pixel-based models and PCA in terms of training
stability, resistance to overfitting, and generalization robustness. Notably,
DRIFT displays minimal sensitivity to changes in batch size, network
architecture, and image resolution, further establishing it as a resilient and
efficient data representation strategy. This work shifts the focus from
architecture engineering to input curation and underscores the power of
physics-driven data transformations in advancing deep learning performance.
[LINK]
http://arxiv.org/abs/2506.19734v1
[DATE]
2025-06-24 23:53:18+08:00
[CATEGORIES]
cs.LG
Who Does What in Deep Learning? Multidimensional Game-Theoretic Attribution of Function of Neural Units
[AUTHORS]
Shrey Dixit, Kayson Fakhar, Fatemeh Hadaeghi, Patrick Mineault, Konrad P. Kording, Claus C. Hilgetag
[ABSTRACT]
Neural networks now generate text, images, and speech with billions of
parameters, producing a need to know how each neural unit contributes to these
high-dimensional outputs. Existing explainable-AI methods, such as SHAP,
attribute importance to inputs, but cannot quantify the contributions of neural
units across thousands of output pixels, tokens, or logits. Here we close that
gap with Multiperturbation Shapley-value Analysis (MSA), a model-agnostic
game-theoretic framework. By systematically lesioning combinations of units,
MSA yields Shapley Modes, unit-wise contribution maps that share the exact
dimensionality of the model’s output. We apply MSA across scales, from
multi-layer perceptrons to the 56-billion-parameter Mixtral-8x7B and Generative
Adversarial Networks (GAN). The approach demonstrates how regularisation
concentrates computation in a few hubs, exposes language-specific experts
inside the LLM, and reveals an inverted pixel-generation hierarchy in GANs.
Together, these results showcase MSA as a powerful approach for interpreting,
editing, and compressing deep neural networks.
[LINK]
http://arxiv.org/abs/2506.19732v1
[DATE]
2025-06-24 23:50:35+08:00
[CATEGORIES]
cs.LG
IgCONDA-PET: Weakly-Supervised PET Anomaly Detection using Implicitly-Guided Attention-Conditional Counterfactual Diffusion Modeling – a Multi-Center, Multi-Cancer, and Multi-Tracer Study
[AUTHORS]
Shadab Ahamed, Arman Rahmim
[ABSTRACT]
Minimizing the need for pixel-level annotated data to train PET lesion
detection and segmentation networks is highly desired and can be
transformative, given time and cost constraints associated with expert
annotations. Current unsupervised or weakly-supervised anomaly detection
methods rely on autoencoder or generative adversarial networks (GANs) trained
only on healthy data. While these approaches reduce annotation dependency,
GAN-based methods are notably more challenging to train than non-GAN
alternatives (such as autoencoders) due to issues such as the simultaneous
optimization of two competing networks, mode collapse, and training
instability. In this paper, we present the weakly-supervised
$\textbf{I}$mplicitly-$\textbf{g}$uided $\textbf{CO}$u$\textbf{N}$terfactual
diffusion model for $\textbf{D}$etecting $\textbf{A}$nomalies in $\textbf{PET}$
images (IgCONDA-PET). The solution is developed and validated using PET scans
from six retrospective cohorts consisting of a total of 2652 cases
(multi-cancer, multi-tracer) containing both local and public datasets
(spanning multiple centers). The training is conditioned on image class labels
(healthy vs. unhealthy) via attention modules, and we employ implicit diffusion
guidance. We perform counterfactual generation which facilitates
“unhealthy-to-healthy” domain translation by generating a synthetic, healthy
version of an unhealthy input image, enabling the detection of anomalies
through the calculated differences. The performance of our method was compared
against several other deep learning based weakly-supervised or unsupervised
methods as well as traditional methods like 41% SUV$_\text{max}$ thresholding.
We also highlight the importance of incorporating attention modules in our
network for the detection of small anomalies. The code is publicly available
at: https://github.com/ahxmeds/IgCONDA-PET.git.
[COMMENTS]
48 pages, 13 figures, 4 tables
[LINK]
http://arxiv.org/abs/2405.00239v3
[DATE]
2025-06-24 23:45:53+08:00
[CATEGORIES]
cs.LG
Geometric-Aware Variational Inference: Robust and Adaptive Regularization with Directional Weight Uncertainty
[AUTHORS]
Carlos Stein Brito
[ABSTRACT]
Deep neural networks require principled uncertainty quantification, yet
existing variational inference methods often employ isotropic Gaussian
approximations in weight space that poorly match the network’s inherent
geometry. We address this mismatch by introducing Concentration-Adapted
Perturbations (CAP), a variational framework that models weight uncertainties
directly on the unit hypersphere using von Mises-Fisher distributions. Building
on recent work in radial-directional posterior decompositions and spherical
weight constraints, CAP provides the first complete theoretical framework
connecting directional statistics to practical noise regularization in neural
networks. Our key contribution is an analytical derivation linking vMF
concentration parameters to activation noise variance, enabling each layer to
learn its optimal uncertainty level through a novel closed-form KL divergence
regularizer. In experiments on CIFAR-10, CAP significantly improves model
calibration - reducing Expected Calibration Error by 5.6x - while providing
interpretable layer-wise uncertainty profiles. CAP requires minimal
computational overhead and integrates seamlessly into standard architectures,
offering a theoretically grounded yet practical approach to uncertainty
quantification in deep learning.
[COMMENTS]
19 pages, 4 figures
[LINK]
http://arxiv.org/abs/2506.19726v1
[DATE]
2025-06-24 23:42:00+08:00
[CATEGORIES]
cs.LG
Identifying Unknown Stochastic Dynamics via Finite expression methods
[AUTHORS]
Senwei Liang, Chunmei Wang, Xingjian Xu
[ABSTRACT]
Modeling stochastic differential equations (SDEs) is crucial for
understanding complex dynamical systems in various scientific fields. Recent
methods often employ neural network-based models, which typically represent
SDEs through a combination of deterministic and stochastic terms. However,
these models usually lack interpretability and have difficulty generalizing
beyond their training domain. This paper introduces the Finite Expression
Method (FEX), a symbolic learning approach designed to derive interpretable
mathematical representations of the deterministic component of SDEs. For the
stochastic component, we integrate FEX with advanced generative modeling
techniques to provide a comprehensive representation of SDEs. The numerical
experiments on linear, nonlinear, and multidimensional SDEs demonstrate that
FEX generalizes well beyond the training domain and delivers more accurate
long-term predictions compared to neural network-based methods. The symbolic
expressions identified by FEX not only improve prediction accuracy but also
offer valuable scientific insights into the underlying dynamics of the systems,
paving the way for new scientific discoveries.
[COMMENTS]
19 pages, 15 figures, 5 tables
[LINK]
http://arxiv.org/abs/2504.07085v3
[DATE]
2025-06-24 23:24:00+08:00
[CATEGORIES]
cs.LG
[AUTHORS]
Kristian Sotirov, Annie E. Paine, Savvas Varsamopoulos, Antonio A. Gentile, Osvaldo Simeone [ABSTRACT]
Offline model-based optimization (MBO) refers to the task of optimizing a
black-box objective function using only a fixed set of prior input-output data,
without any active experimentation. Recent work has introduced quantum extremal
learning (QEL), which leverages the expressive power of variational quantum
circuits to learn accurate surrogate functions by training on a few data
points. However, as widely studied in the classical machine learning
literature, predictive models may incorrectly extrapolate objective values in
unexplored regions, leading to the selection of overly optimistic solutions. In
this paper, we propose integrating QEL with conservative objective models (COM)a regularization technique aimed at ensuring cautious predictions on
out-of-distribution inputs. The resulting hybrid algorithm, COM-QEL, builds on
the expressive power of quantum neural networks while safeguarding
generalization via conservative modeling. Empirical results on benchmark
optimization tasks demonstrate that COM-QEL reliably finds solutions with
higher true objective values compared to the original QEL, validating its
superiority for offline design problems.
[COMMENTS]
5 pages, 5 figures, initial version
[LINK]
http://arxiv.org/abs/2506.19714v1
[DATE]
2025-06-24 23:20:17+08:00
[CATEGORIES]
cs.LG
Guidance in the Frequency Domain Enables High-Fidelity Sampling at Low CFG Scales
[AUTHORS]
Seyedmorteza Sadat, Tobias Vontobel, Farnood Salehi, Romann M. Weber
[ABSTRACT]
Classifier-free guidance (CFG) has become an essential component of modern
conditional diffusion models. Although highly effective in practice, the
underlying mechanisms by which CFG enhances quality, detail, and prompt
alignment are not fully understood. We present a novel perspective on CFG by
analyzing its effects in the frequency domain, showing that low and high
frequencies have distinct impacts on generation quality. Specifically,
low-frequency guidance governs global structure and condition alignment, while
high-frequency guidance mainly enhances visual fidelity. However, applying a
uniform scale across all frequencies – as is done in standard CFG – leads to
oversaturation and reduced diversity at high scales and degraded visual quality
at low scales. Based on these insights, we propose frequency-decoupled guidance
(FDG), an effective approach that decomposes CFG into low- and high-frequency
components and applies separate guidance strengths to each component. FDG
improves image quality at low guidance scales and avoids the drawbacks of high
CFG scales by design. Through extensive experiments across multiple datasets
and models, we demonstrate that FDG consistently enhances sample fidelity while
preserving diversity, leading to improved FID and recall compared to CFG,
establishing our method as a plug-and-play alternative to standard
classifier-free guidance.
[LINK]
http://arxiv.org/abs/2506.19713v1
[DATE]
2025-06-24 23:19:42+08:00
[CATEGORIES]
cs.LG
Learning-aided Bigraph Matching Approach to Multi-Crew Restoration of Damaged Power Networks Coupled with Road Transportation Networks
[AUTHORS]
Nathan Maurer, Harshal Kaushik, Roshni Anna Jacob, Jie Zhang, Souma Chowdhury
[ABSTRACT]
The resilience of critical infrastructure networks (CINs) after disruptions,
such as those caused by natural hazards, depends on both the speed of
restoration and the extent to which operational functionality can be regained.
Allocating resources for restoration is a combinatorial optimal planning
problem that involves determining which crews will repair specific network
nodes and in what order. This paper presents a novel graph-based formulation
that merges two interconnected graphs, representing crew and transportation
nodes and power grid nodes, into a single heterogeneous graph. To enable
efficient planning, graph reinforcement learning (GRL) is integrated with
bigraph matching. GRL is utilized to design the incentive function for
assigning crews to repair tasks based on the graph-abstracted state of the
environment, ensuring generalization across damage scenarios. Two learning
techniques are employed: a graph neural network trained using Proximal Policy
Optimization and another trained via Neuroevolution. The learned incentive
functions inform a bipartite graph that links crews to repair tasks, enabling
weighted maximum matching for crew-to-task allocations. An efficient simulation
environment that pre-computes optimal node-to-node path plans is used to train
the proposed restoration planning methods. An IEEE 8500-bus power distribution
test network coupled with a 21 square km transportation network is used as the
case study, with scenarios varying in terms of numbers of damaged nodes,
depots, and crews. Results demonstrate the approach’s generalizability and
scalability across scenarios, with learned policies providing 3-fold better
performance than random policies, while also outperforming optimization-based
solutions in both computation time (by several orders of magnitude) and power
restored.
[COMMENTS]
IDETC 2025
[LINK]
http://arxiv.org/abs/2506.19703v1
[DATE]
2025-06-24 23:12:45+08:00
[CATEGORIES]
cs.LG
Near-optimal estimates for the $\ell^p$-Lipschitz constants of deep random ReLU neural networks
[AUTHORS]
Sjoerd Dirksen, Patrick Finke, Paul Geuchen, Dominik Stöger, Felix Voigtlaender
[ABSTRACT]
This paper studies the $\ell^p$-Lipschitz constants of ReLU neural networks
$\Phi: \mathbb{R}^d \to \mathbb{R}$ with random parameters for $p \in
[1,\infty]$. The distribution of the weights follows a variant of the He
initialization and the biases are drawn from symmetric distributions. We derive
high probability upper and lower bounds for wide networks that differ at most
by a factor that is logarithmic in the network’s width and linear in its depth.
In the special case of shallow networks, we obtain matching bounds. Remarkably,
the behavior of the $\ell^p$-Lipschitz constant varies significantly between
the regimes $ p \in [1,2) $ and $ p \in [2,\infty] $. For $p \in [2,\infty]$,
the $\ell^p$-Lipschitz constant behaves similarly to $\Vert g\Vert_{p’}$, where
$g \in \mathbb{R}^d$ is a $d$-dimensional standard Gaussian vector and $1/p +
1/p’ = 1$. In contrast, for $p \in [1,2)$, the $\ell^p$-Lipschitz constant
aligns more closely to $\Vert g \Vert_{2}$.
[COMMENTS]
The introduction will still be expanded with additional references
[LINK]
http://arxiv.org/abs/2506.19695v1
[DATE]
2025-06-24 23:02:16+08:00
[CATEGORIES]
cs.LG
AYLA: Amplifying Gradient Sensitivity via Loss Transformation in Non-Convex Optimization
[AUTHORS]
Ben Keslaki
[ABSTRACT]
Stochastic Gradient Descent (SGD) and its variants, such as ADAM, are
foundational to deep learning optimization, adjusting model parameters through
fixed or adaptive learning rates based on loss function gradients. However,
these methods often struggle to balance adaptability and efficiency in
high-dimensional, non-convex settings. This paper introduces AYLA, a novel
optimization framework that enhances training dynamics via loss function
transformation. AYLA applies a tunable power-law transformation to the loss,
preserving critical points while scaling loss values to amplify gradient
sensitivity and accelerate convergence. Additionally, we propose an effective
learning rate that dynamically adapts to the transformed loss, further
improving optimization efficiency. Empirical evaluations on minimizing a
synthetic non-convex polynomial, solving a non-convex curve-fitting task, and
performing digit classification (MNIST) and image recognition (CIFAR-100)
demonstrate that AYLA consistently outperforms SGD and ADAM in both convergence
speed and training stability. By reshaping the loss landscape, AYLA provides a
model-agnostic enhancement to existing optimization methods, offering a
promising advancement in deep neural network training.
[LINK]
http://arxiv.org/abs/2504.01875v2
[DATE]
2025-06-24 22:57:56+08:00
[CATEGORIES]
cs.LG
When Can We Reuse a Calibration Set for Multiple Conformal Predictions?
[AUTHORS]
A. A. Balinsky, A. D. Balinsky
[ABSTRACT]
Reliable uncertainty quantification is crucial for the trustworthiness of
machine learning applications. Inductive Conformal Prediction (ICP) offers a
distribution-free framework for generating prediction sets or intervals with
user-specified confidence. However, standard ICP guarantees are marginal and
typically require a fresh calibration set for each new prediction to maintain
their validity. This paper addresses this practical limitation by demonstrating
how e-conformal prediction, in conjunction with Hoeffding’s inequality, can
enable the repeated use of a single calibration set with a high probability of
preserving the desired coverage. Through a case study on the CIFAR-10 dataset,
we train a deep neural network and utilise a calibration set to estimate a
Hoeffding correction. This correction allows us to apply a modified Markov’s
inequality, leading to the construction of prediction sets with quantifiable
confidence. Our results illustrate the feasibility of maintaining provable
performance in conformal prediction while enhancing its practicality by
reducing the need for repeated calibration. The code for this work is publicly
available.
[LINK]
http://arxiv.org/abs/2506.19689v1
[DATE]
2025-06-24 22:57:25+08:00
[CATEGORIES]
cs.LG
Extreme Learning Machines for Exoplanet Simulations: A Faster, Lightweight Alternative to Deep Learning
[AUTHORS]
Tara P. A. Tahseen, Luís F. Simões, Kai Hou Yip, Nikolaos Nikolaou, João M. Mendonça, Ingo P. Waldmann
[ABSTRACT]
Increasing resolution and coverage of astrophysical and climate data
necessitates increasingly sophisticated models, often pushing the limits of
computational feasibility. While emulation methods can reduce calculation
costs, the neural architectures typically used–optimised via gradient
descent–are themselves computationally expensive to train, particularly in
terms of data generation requirements. This paper investigates the utility of
the Extreme Learning Machine (ELM) as a lightweight, non-gradient-based machine
learning algorithm for accelerating complex physical models.
We evaluate ELM surrogate models in two test cases with different data
structures: (i) sequentially-structured data, and (ii) image-structured data.
For test case (i), where the number of samples $N$ » the dimensionality of
input data $d$, ELMs achieve remarkable efficiency, offering a 100,000$\times$
faster training time and a 40$\times$ faster prediction speed compared to a
Bi-Directional Recurrent Neural Network (BIRNN), whilst improving upon BIRNN
test performance. For test case (ii), characterised by $d » N$ and image-based
inputs, a single ELM was insufficient, but an ensemble of 50 individual ELM
predictors achieves comparable accuracy to a benchmark Convolutional Neural
Network (CNN), with a 16.4$\times$ reduction in training time, though costing a
6.9$\times$ increase in prediction time. We find different sample efficiency
characteristics between the test cases: in test case (i) individual ELMs
demonstrate superior sample efficiency, requiring only 0.28% of the training
dataset compared to the benchmark BIRNN, while in test case (ii) the ensemble
approach requires 78% of the data used by the CNN to achieve comparable
results–representing a trade-off between sample efficiency and model
complexity.
[COMMENTS]
20 pages, 16 figures
[LINK]
http://arxiv.org/abs/2506.19679v1
[DATE]
2025-06-24 22:46:23+08:00
[CATEGORIES]
cs.LG
Higher-Order Graph Databases
[AUTHORS]
Maciej Besta, Shriram Chandran, Jakub Cudak, Patrick Iff, Marcin Copik, Robert Gerstenberger, Tomasz Szydlo, Jürgen Müller, Torsten Hoefler
[ABSTRACT]
Recent advances in graph databases (GDBs) have been driving interest in
large-scale analytics, yet current systems fail to support higher-order (HO)
interactions beyond first-order (one-hop) relations, which are crucial for
tasks such as subgraph counting, polyadic modeling, and HO graph learning. We
address this by introducing a new class of systems, higher-order graph
databases (HO-GDBs) that use lifting and lowering paradigms to seamlessly
extend traditional GDBs with HO. We provide a theoretical analysis of OLTP and
OLAP queries, ensuring correctness, scalability, and ACID compliance. We
implement a lightweight, modular, and parallelizable HO-GDB prototype that
offers native support for hypergraphs, node-tuples, subgraphs, and other HO
structures under a unified API. The prototype scales to large HO OLTP & OLAP
workloads and shows how HO improves analytical tasks, for example enhancing
accuracy of graph neural networks within a GDB by 44%. Our work ensures low
latency and high query throughput, and generalizes both ACID-compliant and
eventually consistent systems.
[LINK]
http://arxiv.org/abs/2506.19661v1
[DATE]
2025-06-24 22:24:20+08:00
[CATEGORIES]
cs.LG
Unsupervised Data Generation for Offline Reinforcement Learning: A Perspective from Model
[AUTHORS]
Shuncheng He, Hongchang Zhang, Jianzhun Shao, Yuhang Jiang, Xiangyang Ji
[ABSTRACT]
Offline reinforcement learning (RL) recently gains growing interests from RL
researchers. However, the performance of offline RL suffers from the
out-of-distribution problem, which can be corrected by feedback in online RL.
Previous offline RL research focuses on restricting the offline algorithm in
in-distribution even in-sample action sampling. In contrast, fewer work pays
attention to the influence of the batch data. In this paper, we first build a
bridge over the batch data and the performance of offline RL algorithms
theoretically, from the perspective of model-based offline RL optimization. We
draw a conclusion that, with mild assumptions, the distance between the
state-action pair distribution generated by the behavioural policy and the
distribution generated by the optimal policy, accounts for the performance gap
between the policy learned by model-based offline RL and the optimal policy.
Secondly, we reveal that in task-agnostic settings, a series of policies
trained by unsupervised RL can minimize the worst-case regret in the
performance gap. Inspired by the theoretical conclusions, UDG (Unsupervised
Data Generation) is devised to generate data and select proper data for offline
training under tasks-agnostic settings. Empirical results demonstrate that UDG
can outperform supervised data generation on solving unknown tasks.
[LINK]
http://arxiv.org/abs/2506.19643v1
[DATE]
2025-06-24 22:08:36+08:00
[CATEGORIES]
cs.LG
Hierarchical Time Series Forecasting Via Latent Mean Encoding
[AUTHORS]
Alessandro Salatiello, Stefan Birr, Manuel Kunz
[LINK]
http://arxiv.org/abs/2506.19633v1
[DATE]
2025-06-24 21:54:47+08:00
[CATEGORIES]
cs.LG
Why Uncertainty Calibration Matters for Reliable Perturbation-based Explanations
[AUTHORS]
Thomas Decker, Volker Tresp, Florian Buettner
[ABSTRACT]
Perturbation-based explanations are widely utilized to enhance the
transparency of modern machine-learning models. However, their reliability is
often compromised by the unknown model behavior under the specific
perturbations used. This paper investigates the relationship between
uncertainty calibration - the alignment of model confidence with actual
accuracy - and perturbation-based explanations. We show that models frequently
produce unreliable probability estimates when subjected to
explainability-specific perturbations and theoretically prove that this
directly undermines explanation quality. To address this, we introduce ReCalX,
a novel approach to recalibrate models for improved perturbation-based
explanations while preserving their original predictions. Experiments on
popular computer vision models demonstrate that our calibration strategy
produces explanations that are more aligned with human perception and actual
object locations.
[COMMENTS]
ICLR 2025 Workshop: XAI4Science: From Understanding Model Behavior to
Discovering New Scientific Knowledge
[LINK]
http://arxiv.org/abs/2506.19630v1
[DATE]
2025-06-24 21:54:12+08:00
[CATEGORIES]
cs.LG
Operator Forces For Coarse-Grained Molecular Dynamics
[AUTHORS]
Leon Klein, Atharva Kelkar, Aleksander Durumeric, Yaoyi Chen, Frank Noé
[ABSTRACT]
Coarse-grained (CG) molecular dynamics simulations extend the length and time
scale of atomistic simulations by replacing groups of correlated atoms with CG
beads. Machine-learned coarse-graining (MLCG) has recently emerged as a
promising approach to construct highly accurate force fields for CG molecular
dynamics. However, the calibration of MLCG force fields typically hinges on
force matching, which demands extensive reference atomistic trajectories with
corresponding force labels. In practice, atomistic forces are often not
recorded, making traditional force matching infeasible on pre-existing
datasets. Recently, noise-based kernels have been introduced to adapt force
matching to the low-data regime, including situations in which reference
atomistic forces are not present. While this approach produces force fields
which recapitulate slow collective motion, it introduces significant local
distortions due to the corrupting effects of the noise-based kernel. In this
work, we introduce more general kernels based on normalizing flows that
substantially reduce these local distortions while preserving global
conformational accuracy. We demonstrate our method on small proteins, showing
that flow-based kernels can generate high-quality CG forces solely from
configurational samples.
[LINK]
http://arxiv.org/abs/2506.19628v1
[DATE]
2025-06-24 21:51:20+08:00
[CATEGORIES]
cs.LG
Scaling Up Unbiased Search-based Symbolic Regression
[AUTHORS]
Paul Kahlmeyer, Joachim Giesen, Michael Habeck, Henrik Voigt
[ABSTRACT]
In a regression task, a function is learned from labeled data to predict the
labels at new data points. The goal is to achieve small prediction errors. In
symbolic regression, the goal is more ambitious, namely, to learn an
interpretable function that makes small prediction errors. This additional goal
largely rules out the standard approach used in regression, that is, reducing
the learning problem to learning parameters of an expansion of basis functions
by optimization. Instead, symbolic regression methods search for a good
solution in a space of symbolic expressions. To cope with the typically vast
search space, most symbolic regression methods make implicit, or sometimes even
explicit, assumptions about its structure. Here, we argue that the only obvious
structure of the search space is that it contains small expressions, that is,
expressions that can be decomposed into a few subexpressions. We show that
systematically searching spaces of small expressions finds solutions that are
more accurate and more robust against noise than those obtained by
state-of-the-art symbolic regression methods. In particular, systematic search
outperforms state-of-the-art symbolic regressors in terms of its ability to
recover the true underlying symbolic expressions on established benchmark data
sets.
[LINK]
http://arxiv.org/abs/2506.19626v1
[DATE]
2025-06-24 21:47:19+08:00
[CATEGORIES]
cs.LG
Multimodal Machine Learning in Mental Health: A Survey of Data, Algorithms, and Challenges
[AUTHORS]
Zahraa Al Sahili, Ioannis Patras, Matthew Purver
[ABSTRACT]
Multimodal machine learning (MML) is rapidly reshaping the way mental-health
disorders are detected, characterized, and longitudinally monitored. Whereas
early studies relied on isolated data streams – such as speech, text, or
wearable signals – recent research has converged on architectures that
integrate heterogeneous modalities to capture the rich, complex signatures of
psychiatric conditions. This survey provides the first comprehensive,
clinically grounded synthesis of MML for mental health. We (i) catalog 26
public datasets spanning audio, visual, physiological signals, and text
modalities; (ii) systematically compare transformer, graph, and hybrid-based
fusion strategies across 28 models, highlighting trends in representation
learning and cross-modal alignment. Beyond summarizing current capabilities, we
interrogate open challenges: data governance and privacy, demographic and
intersectional fairness, evaluation explainability, and the complexity of
mental health disorders in multimodal settings. By bridging methodological
innovation with psychiatric utility, this survey aims to orient both ML
researchers and mental-health practitioners toward the next generation of
trustworthy, multimodal decision-support systems.
[LINK]
http://arxiv.org/abs/2407.16804v2
[DATE]
2025-06-24 21:40:09+08:00
[CATEGORIES]
cs.LG
Contactless Cardiac Pulse Monitoring Using Event Cameras
[AUTHORS]
Mohamed Moustafa, Joseph Lemley, Peter Corcoran
[ABSTRACT]
Time event cameras are a novel technology for recording scene information at
extremely low latency and with low power consumption. Event cameras output a
stream of events that encapsulate pixel-level light intensity changes within
the scene, capturing information with a higher dynamic range and temporal
resolution than traditional cameras. This study investigates the contact-free
reconstruction of an individual’s cardiac pulse signal from time event
recording of their face using a supervised convolutional neural network (CNN)
model. An end-to-end model is trained to extract the cardiac signal from a
two-dimensional representation of the event stream, with model performance
evaluated based on the accuracy of the calculated heart rate. The experimental
results confirm that physiological cardiac information in the facial region is
effectively preserved within the event stream, showcasing the potential of this
novel sensor for remote heart rate monitoring. The model trained on event
frames achieves a root mean square error (RMSE) of 3.32 beats per minute (bpm)
compared to the RMSE of 2.92 bpm achieved by the baseline model trained on
standard camera frames. Furthermore, models trained on event frames generated
at 60 and 120 FPS outperformed the 30 FPS standard camera results, achieving an
RMSE of 2.54 and 2.13 bpm, respectively.
[LINK]
http://arxiv.org/abs/2505.09529v2
[DATE]
2025-06-24 21:38:00+08:00
[CATEGORIES]
cs.LG
ECG-SMART-NET: A Deep Learning Architecture for Precise ECG Diagnosis of Occlusion Myocardial Infarction
[AUTHORS]
Nathan T. Riek, Murat Akcakaya, Zeineb Bouzid, Tanmay Gokhale, Stephanie Helman, Karina Kraevsky-Philips, Rui Qi Ji, Ervin Sejdic, Jessica K. Zègre-Hemsey, Christian Martin-Gill, Clifton W. Callaway, Samir Saba, Salah Al-Zaiti
[ABSTRACT]
Objective: In this paper we develop and evaluate ECG-SMART-NET for occlusion
myocardial infarction (OMI) identification. OMI is a severe form of heart
attack characterized by complete blockage of one or more coronary arteries
requiring immediate referral for cardiac catheterization to restore blood flow
to the heart. Two thirds of OMI cases are difficult to visually identify from a
12-lead electrocardiogram (ECG) and can be potentially fatal if not identified
quickly. Previous works on this topic are scarce, and current state-of-the-art
evidence suggests both feature-based random forests and convolutional neural
networks (CNNs) are promising approaches to improve ECG detection of OMI.
Methods: While the ResNet architecture has been adapted for use with ECG
recordings, it is not ideally suited to capture informative temporal features
within each lead and the spatial concordance or discordance across leads. We
propose a clinically informed modification of the ResNet-18 architecture. The
model first learns temporal features through temporal convolutional layers with
1xk kernels followed by a spatial convolutional layer, after the residual
blocks, with 12x1 kernels to learn spatial features. Results: ECG-SMART-NET was
benchmarked against the original ResNet-18 and other state-of-the-art models on
a multisite real-word clinical dataset that consists of 10,393 ECGs from 7,397
unique patients (rate of OMI =7.2%). ECG-SMART-NET outperformed other models in
the classification of OMI with a test AUC of 0.953 [0.921, 0.978]. Conclusion
and Significance: ECG-SMART-NET can outperform the state-of-the-art random
forest for OMI prediction and is better suited for this task than the original
ResNet-18 architecture.
[COMMENTS]
9 pages, 7 figures, 6 tables
[LINK]
http://arxiv.org/abs/2405.09567v2
[DATE]
2025-06-24 21:37:46+08:00
[CATEGORIES]
cs.LG
A text-to-tabular approach to generate synthetic patient data using LLMs
[AUTHORS]
Margaux Tornqvist, Jean-Daniel Zucker, Tristan Fauvel, Nicolas Lambert, Mathilde Berthelot, Antoine Movschin
[ABSTRACT]
Access to large-scale high-quality healthcare databases is key to accelerate
medical research and make insightful discoveries about diseases. However,
access to such data is often limited by patient privacy concerns, data sharing
restrictions and high costs. To overcome these limitations, synthetic patient
data has emerged as an alternative. However, synthetic data generation (SDG)
methods typically rely on machine learning (ML) models trained on original
data, leading back to the data scarcity problem. We propose an approach to
generate synthetic tabular patient data that does not require access to the
original data, but only a description of the desired database. We leverage
prior medical knowledge and in-context learning capabilities of large language
models (LLMs) to generate realistic patient data, even in a low-resource
setting. We quantitatively evaluate our approach against state-of-the-art SDG
models, using fidelity, privacy, and utility metrics. Our results show that
while LLMs may not match the performance of state-of-the-art models trained on
the original data, they effectively generate realistic patient data with
well-preserved clinical correlations. An ablation study highlights key elements
of our prompt contributing to high-quality synthetic patient data generation.
This approach, which is easy to use and does not require original data or
advanced ML skills, is particularly valuable for quickly generating
custom-designed patient data, supporting project implementation and providing
educational resources.
[COMMENTS]
12 pages, 3 figures. Accepted to the 2025 IEEE International
Conference on Healthcare Informatics (IEEE ICHI 2025), 2025, Rende (CS),
Calabria, Italy
[LINK]
http://arxiv.org/abs/2412.05153v2
[DATE]
2025-06-24 21:24:58+08:00
[CATEGORIES]
cs.LG
Beyond Static Models: Hypernetworks for Adaptive and Generalizable Forecasting in Complex Parametric Dynamical Systems
[AUTHORS]
Pantelis R. Vlachas, Konstantinos Vlachas, Eleni Chatzi
[ABSTRACT]
Dynamical systems play a key role in modeling, forecasting, and
decision-making across a wide range of scientific domains. However, variations
in system parameters, also referred to as parametric variability, can lead to
drastically different model behavior and output, posing challenges for
constructing models that generalize across parameter regimes. In this work, we
introduce the Parametric Hypernetwork for Learning Interpolated Networks
(PHLieNet), a framework that simultaneously learns: (a) a global mapping from
the parameter space to a nonlinear embedding and (b) a mapping from the
inferred embedding to the weights of a dynamics propagation network. The
learned embedding serves as a latent representation that modulates a base
network, termed the hypernetwork, enabling it to generate the weights of a
target network responsible for forecasting the system’s state evolution
conditioned on the previous time history. By interpolating in the space of
models rather than observations, PHLieNet facilitates smooth transitions across
parameterized system behaviors, enabling a unified model that captures the
dynamic behavior across a broad range of system parameterizations. The
performance of the proposed technique is validated in a series of dynamical
systems with respect to its ability to extrapolate in time and interpolate and
extrapolate in the parameter space, i.e., generalize to dynamics that were
unseen during training. In all cases, our approach outperforms or matches
state-of-the-art baselines in both short-term forecast accuracy and in
capturing long-term dynamical features, such as attractor statistics.
[LINK]
http://arxiv.org/abs/2506.19609v1
[DATE]
2025-06-24 21:22:49+08:00
[CATEGORIES]
cs.LG
Constructive Universal Approximation and Finite Sample Memorization by Narrow Deep ReLU Networks
[AUTHORS]
Martín Hernández, Enrique Zuazua
[ABSTRACT]
We present a fully constructive analysis of deep ReLU neural networks for
classification and function approximation tasks. First, we prove that any
dataset with $N$ distinct points in $\mathbb{R}^d$ and $M$ output classes can
be exactly classified using a multilayer perceptron (MLP) of width $2$ and
depth at most $2N + 4M - 1$, with all network parameters constructed
explicitly. This result is sharp with respect to width and is interpreted
through the lens of simultaneous or ensemble controllability in discrete
nonlinear dynamics.
Second, we show that these explicit constructions yield uniform bounds on the
parameter norms and, in particular, provide upper estimates for minimizers of
standard regularized training loss functionals in supervised learning. As the
regularization parameter vanishes, the trained networks converge to exact
classifiers with bounded norm, explaining the effectiveness of
overparameterized training in the small-regularization regime.
We also prove a universal approximation theorem in $L^p(\Omega;
\mathbb{R}_+)$ for any bounded domain $\Omega \subset \mathbb{R}^d$ and $p \in
[1, \infty)$, using MLPs of fixed width $d + 1$. The proof is constructive,
geometrically motivated, and provides explicit estimates on the network depth
when the target function belongs to the Sobolev space $W^{1,p}$. We also extend
the approximation and depth estimation results to $L^p(\Omega; \mathbb{R}^m)$
for any $m \geq 1$.
Our results offer a unified and interpretable framework connecting
controllability, expressivity, and training dynamics in deep neural networks.
[LINK]
http://arxiv.org/abs/2409.06555v2
[DATE]
2025-06-24 21:20:03+08:00
[CATEGORIES]
cs.LG
Diff-Def: Diffusion-Generated Deformation Fields for Conditional Atlases
[AUTHORS]
Sophie Starck, Vasiliki Sideri-Lampretsa, Bernhard Kainz, Martin J. Menten, Tamara T. Mueller, Daniel Rueckert
[ABSTRACT]
Anatomical atlases are widely used for population studies and analysis.
Conditional atlases target a specific sub-population defined via certain
conditions, such as demographics or pathologies, and allow for the
investigation of fine-grained anatomical differences like morphological changes
associated with ageing or disease. Existing approaches use either
registration-based methods that are often unable to handle large anatomical
variations or generative adversarial models, which are challenging to train
since they can suffer from training instabilities. Instead of generating
atlases directly in as intensities, we propose using latent diffusion models to
generate deformation fields, which transform a general population atlas into
one representing a specific sub-population. Our approach ensures structural
integrity, enhances interpretability and avoids hallucinations that may arise
during direct image synthesis by generating this deformation field and
regularising it using a neighbourhood of images. We compare our method to
several state-of-the-art atlas generation methods using brain MR images from
the UK Biobank. Our method generates highly realistic atlases with smooth
transformations and high anatomical fidelity, outperforming existing baselines.
We demonstrate the quality of these atlases through comprehensive evaluations,
including quantitative metrics for anatomical accuracy, perceptual similarity,
and qualitative analyses displaying the consistency and realism of the
generated atlases.
[LINK]
http://arxiv.org/abs/2403.16776v3
[DATE]
2025-06-24 21:12:51+08:00
[CATEGORIES]
cs.LG
Training Flexible Models of Genetic Variant Effects from Functional Annotations using Accelerated Linear Algebra
[AUTHORS]
Alan N. Amin, Andres Potapczynski, Andrew Gordon Wilson
[COMMENTS]
For example: ICML 2025. Code available at:
https://github.com/AlanNawzadAmin/DeepWAS
[LINK]
http://arxiv.org/abs/2506.19598v1
[DATE]
2025-06-24 21:07:45+08:00
[CATEGORIES]
cs.LG
Vision Transformer-Based Time-Series Image Reconstruction for Cloud-Filling Applications
[AUTHORS]
Lujun Li, Yiqun Wang, Radu State
[ABSTRACT]
Cloud cover in multispectral imagery (MSI) poses significant challenges for
early season crop mapping, as it leads to missing or corrupted spectral
information. Synthetic aperture radar (SAR) data, which is not affected by
cloud interference, offers a complementary solution, but lack sufficient
spectral detail for precise crop mapping. To address this, we propose a novel
framework, Time-series MSI Image Reconstruction using Vision Transformer (ViT),
to reconstruct MSI data in cloud-covered regions by leveraging the temporal
coherence of MSI and the complementary information from SAR from the attention
mechanism. Comprehensive experiments, using rigorous reconstruction evaluation
metrics, demonstrate that Time-series ViT framework significantly outperforms
baselines that use non-time-series MSI and SAR or time-series MSI without SAR,
effectively enhancing MSI image reconstruction in cloud-covered regions.
[COMMENTS]
This paper has been accepted as a conference paper at the 2025 IEEE
International Geoscience and Remote Sensing Symposium (IGARSS)
[LINK]
http://arxiv.org/abs/2506.19591v1
[DATE]
2025-06-24 21:00:36+08:00
[CATEGORIES]
cs.LG
ConStellaration: A dataset of QI-like stellarator plasma boundaries and optimization benchmarks
[AUTHORS]
Santiago A. Cadena, Andrea Merlo, Emanuel Laude, Alexander Bauer, Atul Agrawal, Maria Pascu, Marija Savtchouk, Enrico Guiraud, Lukas Bonauer, Stuart Hudson, Markus Kaiser
[ABSTRACT]
Stellarators are magnetic confinement devices under active development to
deliver steady-state carbon-free fusion energy. Their design involves a
high-dimensional, constrained optimization problem that requires expensive
physics simulations and significant domain expertise. Recent advances in plasma
physics and open-source tools have made stellarator optimization more
accessible. However, broader community progress is currently bottlenecked by
the lack of standardized optimization problems with strong baselines and
datasets that enable data-driven approaches, particularly for quasi-isodynamic
(QI) stellarator configurations, considered as a promising path to commercial
fusion due to their inherent resilience to current-driven disruptions. Here, we
release an open dataset of diverse QI-like stellarator plasma boundary shapes,
paired with their ideal magnetohydrodynamic (MHD) equilibria and performance
metrics. We generated this dataset by sampling a variety of QI fields and
optimizing corresponding stellarator plasma boundaries. We introduce three
optimization benchmarks of increasing complexity: (1) a single-objective
geometric optimization problem, (2) a “simple-to-build” QI stellarator, and (3)
a multi-objective ideal-MHD stable QI stellarator that investigates trade-offs
between compactness and coil simplicity. For every benchmark, we provide
reference code, evaluation scripts, and strong baselines based on classical
optimization techniques. Finally, we show how learned models trained on our
dataset can efficiently generate novel, feasible configurations without
querying expensive physics oracles. By openly releasing the dataset along with
benchmark problems and baselines, we aim to lower the entry barrier for
optimization and machine learning researchers to engage in stellarator design
and to accelerate cross-disciplinary progress toward bringing fusion energy to
the grid.
[LINK]
http://arxiv.org/abs/2506.19583v1
[DATE]
2025-06-24 20:49:00+08:00
[CATEGORIES]
cs.LG
Realistic Image-to-Image Machine Unlearning via Decoupling and Knowledge Retention
[AUTHORS]
Ayush K. Varshney, Vicenç Torra
[ABSTRACT]
Machine Unlearning allows participants to remove their data from a trained
machine learning model in order to preserve their privacy, and security.
However, the machine unlearning literature for generative models is rather
limited. The literature for image-to-image generative model (I2I model)
considers minimizing the distance between Gaussian noise and the output of I2I
model for forget samples as machine unlearning. However, we argue that the
machine learning model performs fairly well on unseen data i.e., a retrained
model will be able to catch generic patterns in the data and hence will not
generate an output which is equivalent to Gaussian noise. In this paper, we
consider that the model after unlearning should treat forget samples as
out-of-distribution (OOD) data, i.e., the unlearned model should no longer
recognize or encode the specific patterns found in the forget samples. To
achieve this, we propose a framework which decouples the model parameters with
gradient ascent, ensuring that forget samples are OOD for unlearned model with
theoretical guarantee. We also provide $(\epsilon, \delta)$-unlearning
guarantee for model updates with gradient ascent. The unlearned model is
further fine-tuned on the remaining samples to maintain its performance. We
also propose an attack model to ensure that the unlearned model has effectively
removed the influence of forget samples. Extensive empirical evaluation on two
large-scale datasets, ImageNet-1K and Places365 highlights the superiority of
our approach. To show comparable performance with retrained model, we also show
the comparison of a simple AutoEncoder on various baselines on CIFAR-10
dataset.
[LINK]
http://arxiv.org/abs/2502.04260v2
[DATE]
2025-06-24 20:47:20+08:00
[CATEGORIES]
cs.LG
FAF: A Feature-Adaptive Framework for Few-Shot Time Series Forecasting
[AUTHORS]
Pengpeng Ouyang, Dong Chen, Tong Yang, Shuo Feng, Zhao Jin, Mingliang Xu
[ABSTRACT]
Multi-task and few-shot time series forecasting tasks are commonly
encountered in scenarios such as the launch of new products in different
cities. However, traditional time series forecasting methods suffer from
insufficient historical data, which stems from a disregard for the generalized
and specific features among different tasks. For the aforementioned challenges,
we propose the Feature-Adaptive Time Series Forecasting Framework (FAF), which
consists of three key components: the Generalized Knowledge Module (GKM), the
Task-Specific Module (TSM), and the Rank Module (RM). During training phase,
the GKM is updated through a meta-learning mechanism that enables the model to
extract generalized features across related tasks. Meanwhile, the TSM is
trained to capture diverse local dynamics through multiple functional regions,
each of which learns specific features from individual tasks. During testing
phase, the RM dynamically selects the most relevant functional region from the
TSM based on input sequence features, which is then combined with the
generalized knowledge learned by the GKM to generate accurate forecasts. This
design enables FAF to achieve robust and personalized forecasting even with
sparse historical observations We evaluate FAF on five diverse real-world
datasets under few-shot time series forecasting settings. Experimental results
demonstrate that FAF consistently outperforms baselines that include three
categories of time series forecasting methods. In particular, FAF achieves a
41.81\% improvement over the best baseline, iTransformer, on the CO$_2$
emissions dataset.
[COMMENTS]
12 pages,4 figures, 8 tables
[LINK]
http://arxiv.org/abs/2506.19567v1
[DATE]
2025-06-24 20:28:38+08:00
[CATEGORIES]
cs.LG
ConCM: Consistency-Driven Calibration and Matching for Few-Shot Class-Incremental Learning
[AUTHORS]
QinZhe Wang, Zixuan Chen, Keke Huang, Xiu Su, Chunhua Yang, Chang Xu
[ABSTRACT]
Few-Shot Class-Incremental Learning (FSCIL) requires models to adapt to novel
classes with limited supervision while preserving learned knowledge. Existing
prospective learning-based space construction methods reserve space to
accommodate novel classes. However, prototype deviation and structure fixity
limit the expressiveness of the embedding space. In contrast to fixed space
reservation, we explore the optimization of feature-structure dual consistency
and propose a Consistency-driven Calibration and Matching Framework (ConCM)
that systematically mitigate the knowledge conflict inherent in FSCIL.
Specifically, inspired by hippocampal associative memory, we design a
memory-aware prototype calibration that extracts generalized semantic
attributes from base classes and reintegrates them into novel classes to
enhance the conceptual center consistency of features. Further, we propose
dynamic structure matching, which adaptively aligns the calibrated features to
a session-specific optimal manifold space, ensuring cross-session structure
consistency. Theoretical analysis shows that our method satisfies both
geometric optimality and maximum matching, thereby overcoming the need for
class-number priors. On large-scale FSCIL benchmarks including mini-ImageNet
and CUB200, ConCM achieves state-of-the-art performance, surpassing current
optimal method by 3.20% and 3.68% in harmonic accuracy of incremental sessions.
[COMMENTS]
9 pages, 5 figures(Excluding the appendix)
[LINK]
http://arxiv.org/abs/2506.19558v1
[DATE]
2025-06-24 20:12:50+08:00
[CATEGORIES]
cs.LG
General Methods Make Great Domain-specific Foundation Models: A Case-study on Fetal Ultrasound
[AUTHORS]
Jakob Ambsdorf, Asbjørn Munk, Sebastian Llambias, Anders Nymark Christensen, Kamil Mikolaj, Randall Balestriero, Martin Tolsgaard, Aasa Feragen, Mads Nielsen
[ABSTRACT]
With access to large-scale, unlabeled medical datasets, researchers are
confronted with two questions: Should they attempt to pretrain a custom
foundation model on this medical data, or use transfer-learning from an
existing generalist model? And, if a custom model is pretrained, are novel
methods required? In this paper we explore these questions by conducting a
case-study, in which we train a foundation model on a large regional fetal
ultrasound dataset of 2M images. By selecting the well-established DINOv2
method for pretraining, we achieve state-of-the-art results on three fetal
ultrasound datasets, covering data from different countries, classification,
segmentation, and few-shot tasks. We compare against a series of models
pretrained on natural images, ultrasound images, and supervised baselines. Our
results demonstrate two key insights: (i) Pretraining on custom data is worth
it, even if smaller models are trained on less data, as scaling in natural
image pretraining does not translate to ultrasound performance. (ii) Well-tuned
methods from computer vision are making it feasible to train custom foundation
models for a given medical domain, requiring no hyperparameter tuning and
little methodological adaptation. Given these findings, we argue that a bias
towards methodological innovation should be avoided when developing domain
specific foundation models under common computational resource constraints.
[COMMENTS]
Submitted version of paper accepted at MICCAI 2025
[LINK]
http://arxiv.org/abs/2506.19552v1
[DATE]
2025-06-24 20:00:13+08:00
[CATEGORIES]
cs.LG
Discovering Symmetries of ODEs by Symbolic Regression
[AUTHORS]
Paul Kahlmeyer, Niklas Merk, Joachim Giesen
[ABSTRACT]
Solving systems of ordinary differential equations (ODEs) is essential when
it comes to understanding the behavior of dynamical systems. Yet, automated
solving remains challenging, in particular for nonlinear systems. Computer
algebra systems (CASs) provide support for solving ODEs by first simplifying
them, in particular through the use of Lie point symmetries. Finding these
symmetries is, however, itself a difficult problem for CASs. Recent works in
symbolic regression have shown promising results for recovering symbolic
expressions from data. Here, we adapt search-based symbolic regression to the
task of finding generators of Lie point symmetries. With this approach, we can
find symmetries of ODEs that existing CASs cannot find.
[LINK]
http://arxiv.org/abs/2506.19550v1
[DATE]
2025-06-24 19:55:59+08:00
[CATEGORIES]
cs.LG
Overtuning in Hyperparameter Optimization
[AUTHORS]
Lennart Schneider, Bernd Bischl, Matthias Feurer
[ABSTRACT]
Hyperparameter optimization (HPO) aims to identify an optimal hyperparameter
configuration (HPC) such that the resulting model generalizes well to unseen
data. As the expected generalization error cannot be optimized directly, it is
estimated with a resampling strategy, such as holdout or cross-validation. This
approach implicitly assumes that minimizing the validation error leads to
improved generalization. However, since validation error estimates are
inherently stochastic and depend on the resampling strategy, a natural question
arises: Can excessive optimization of the validation error lead to overfitting
at the HPO level, akin to overfitting in model training based on empirical risk
minimization? In this paper, we investigate this phenomenon, which we term
overtuning, a form of overfitting specific to HPO. Despite its practical
relevance, overtuning has received limited attention in the HPO and AutoML
literature. We provide a formal definition of overtuning and distinguish it
from related concepts such as meta-overfitting. We then conduct a large-scale
reanalysis of HPO benchmark data to assess the prevalence and severity of
overtuning. Our results show that overtuning is more common than previously
assumed, typically mild but occasionally severe. In approximately 10% of cases,
overtuning leads to the selection of a seemingly optimal HPC with worse
generalization error than the default or first configuration tried. We further
analyze how factors such as performance metric, resampling strategy, dataset
size, learning algorithm, and HPO method affect overtuning and discuss
mitigation strategies. Our results highlight the need to raise awareness of
overtuning, particularly in the small-data regime, indicating that further
mitigation strategies should be studied.
[COMMENTS]
Accepted at the Fourth Conference on Automated Machine Learning
(Methods Track). 43 pages, 9 tables, 14 figures
[LINK]
http://arxiv.org/abs/2506.19540v1
[DATE]
2025-06-24 19:49:48+08:00
[CATEGORIES]
cs.LG
Towards Robust Stability Prediction in Smart Grids: GAN-based Approach under Data Constraints and Adversarial Challenges
[AUTHORS]
Emad Efatinasab, Alessandro Brighente, Denis Donadel, Mauro Conti, Mirco Rampazzo
[ABSTRACT]
Smart grids are crucial for meeting rising energy demands driven by global
population growth and urbanization. By integrating renewable energy sources,
they enhance efficiency, reliability, and sustainability. However, ensuring
their availability and security requires advanced operational control and
safety measures. Although artificial intelligence and machine learning can help
assess grid stability, challenges such as data scarcity and cybersecurity
threats, particularly adversarial attacks, remain. Data scarcity is a major
issue, as obtaining real-world instances of grid instability requires
significant expertise, resources, and time. Yet, these instances are critical
for testing new research advancements and security mitigations. This paper
introduces a novel framework for detecting instability in smart grids using
only stable data. It employs a Generative Adversarial Network (GAN) where the
generator is designed not to produce near-realistic data but instead to
generate Out-Of-Distribution (OOD) samples with respect to the stable class.
These OOD samples represent unstable behavior, anomalies, or disturbances that
deviate from the stable data distribution. By training exclusively on stable
data and exposing the discriminator to OOD samples, our framework learns a
robust decision boundary to distinguish stable conditions from any unstable
behavior, without requiring unstable data during training. Furthermore, we
incorporate an adversarial training layer to enhance resilience against
attacks. Evaluated on a real-world dataset, our solution achieves up to 98.1\%
accuracy in predicting grid stability and 98.9\% in detecting adversarial
attacks. Implemented on a single-board computer, it enables real-time
decision-making with an average response time of under 7ms.
[LINK]
http://arxiv.org/abs/2501.16490v2
[DATE]
2025-06-24 19:10:26+08:00
[CATEGORIES]
cs.LG
Distillation-Enabled Knowledge Alignment for Generative Semantic Communications in AIGC Provisioning Tasks
[AUTHORS]
Jingzhi Hu, Geoffrey Ye Li
[ABSTRACT]
Due to the surging amount of AI-generated content (AIGC), its provisioning to
edges and mobile users from the cloud incurs substantial traffic on networks.
Generative semantic communication (GSC) offers a promising solution by
transmitting highly compact information, i.e., prompt text and latent
representations, instead of high-dimensional AIGC data. However, GSC relies on
the alignment between the knowledge in the cloud generative AI (GAI) and that
possessed by the edges and users, and between the knowledge for wireless
transmission and that of actual channels, which remains challenging. In this
paper, we propose DeKA-g, a distillation-enabled knowledge alignment algorithm
for GSC systems. The core idea is to distill the generation knowledge from the
cloud-GAI into low-rank matrices, which can be incorporated by the edge and
used to adapt the transmission knowledge to diverse wireless channel
conditions. DeKA-g comprises two novel methods: metaword-aided knowledge
distillation (MAKD) and variable-rate grouped SNR adaptation (VGSA). For MAKD,
an optimized metaword is employed to enhance the efficiency of knowledge
distillation, while VGSA enables efficient adaptation to diverse compression
rates and SNR ranges. From simulation results, DeKA-g improves the alignment
between the edge-generated images and the cloud-generated ones by 44%.
Moreover, it adapts to compression rates with 116% higher efficiency than the
baseline and enhances the performance in low-SNR conditions by 28%.
[LINK]
http://arxiv.org/abs/2506.19893v1
[DATE]
2025-06-24 18:50:14+08:00
[CATEGORIES]
cs.LG
MATE: LLM-Powered Multi-Agent Translation Environment for Accessibility Applications
[AUTHORS]
Aleksandr Algazinov, Matt Laing, Paul Laban
[LINK]
http://arxiv.org/abs/2506.19502v1
[DATE]
2025-06-24 18:40:23+08:00
[CATEGORIES]
cs.LG
Tunable correlation retention: A statistical method for generating synthetic data
[AUTHORS]
Nicklas Jävergård, Rainey Lyons, Adrian Muntean, Jonas Forsman
[ABSTRACT]
We propose a method to generate statistically representative synthetic data
from a given dataset. The main goal of our method is for the created data set
to mimic the inter–feature correlations present in the original data, while
also offering a tunable parameter to influence the privacy level. In
particular, our method constructs a statistical map by using the empirical
conditional distributions between the features of the original dataset. Part of
the tunability is achieved by limiting the depths of conditional distributions
that are being used. We describe in detail our algorithms used both in the
construction of a statistical map and how to use this map to generate synthetic
observations. This approach is tested in three different ways: with a hand
calculated example; a manufactured dataset; and a real world energy-related
dataset of consumption/production of households in Madeira Island. We evaluate
the method by comparing the datasets using the Pearson correlation matrix with
different levels of resolution and depths of correlation. These two
considerations are being viewed as tunable parameters influencing the resulting
datasets fidelity and privacy. The proposed methodology is general in the sense
that it does not rely on the used test dataset. We expect it to be applicable
in a much broader context than indicated here.
[LINK]
http://arxiv.org/abs/2403.01471v3
[DATE]
2025-06-24 18:32:44+08:00
[CATEGORIES]
cs.LG
Enhancing Diversity in Parallel Agents: A Maximum State Entropy Exploration Story
[AUTHORS]
Vincenzo De Paola, Riccardo Zamboni, Mirco Mutti, Marcello Restelli
[ABSTRACT]
Parallel data collection has redefined Reinforcement Learning (RL), unlocking
unprecedented efficiency and powering breakthroughs in large-scale real-world
applications. In this paradigm, $N$ identical agents operate in $N$ replicas of
an environment simulator, accelerating data collection by a factor of $N$. A
critical question arises: \textit{Does specializing the policies of the
parallel agents hold the key to surpass the $N$ factor acceleration?} In this
paper, we introduce a novel learning framework that maximizes the entropy of
collected data in a parallel setting. Our approach carefully balances the
entropy of individual agents with inter-agent diversity, effectively minimizing
redundancies. The latter idea is implemented with a centralized policy gradient
method, which shows promise when evaluated empirically against systems of
identical agents, as well as synergy with batch RL techniques that can exploit
data diversity. Finally, we provide an original concentration analysis that
shows faster rates for specialized parallel sampling distributions, which
supports our methodology and may be of independent interest.
[LINK]
http://arxiv.org/abs/2505.01336v2
[DATE]
2025-06-24 18:24:23+08:00
[CATEGORIES]
cs.LG
Privacy Attacks on Image AutoRegressive Models
[AUTHORS]
Antoni Kowalczuk, Jan Dubiński, Franziska Boenisch, Adam Dziedzic
[ABSTRACT]
Image AutoRegressive generation has emerged as a new powerful paradigm with
image autoregressive models (IARs) matching state-of-the-art diffusion models
(DMs) in image quality (FID: 1.48 vs. 1.58) while allowing for a higher
generation speed. However, the privacy risks associated with IARs remain
unexplored, raising concerns regarding their responsible deployment. To address
this gap, we conduct a comprehensive privacy analysis of IARs, comparing their
privacy risks to the ones of DMs as reference points. Concretely, we develop a
novel membership inference attack (MIA) that achieves a remarkably high success
rate in detecting training images (with a True Positive Rate at False Positive
Rate = 1% of 86.38% vs. 6.38% for DMs with comparable attacks). We leverage our
novel MIA to provide dataset inference (DI) for IARs, and show that it requires
as few as 6 samples to detect dataset membership (compared to 200 for DI in
DMs), confirming a higher information leakage in IARs. Finally, we are able to
extract hundreds of training data points from an IAR (e.g., 698 from VAR-d30).
Our results suggest a fundamental privacy-utility trade-off: while IARs excel
in image generation quality and speed, they are empirically significantly more
vulnerable to privacy attacks compared to DMs that achieve similar performance.
We release the code at https://github.com/sprintml/privacy_attacks_against_iars
for reproducibility.
[COMMENTS]
Accepted at ICML2025
[LINK]
http://arxiv.org/abs/2502.02514v4
[DATE]
2025-06-24 18:19:57+08:00
[CATEGORIES]
cs.LG
Fast and Distributed Equivariant Graph Neural Networks by Virtual Node Learning
[AUTHORS]
Yuelin Zhang, Jiacheng Cen, Jiaqi Han, Wenbing Huang
[ABSTRACT]
Equivariant Graph Neural Networks (GNNs) have achieved remarkable success
across diverse scientific applications. However, existing approaches face
critical efficiency challenges when scaling to large geometric graphs and
suffer significant performance degradation when the input graphs are sparsified
for computational tractability. To address these limitations, we introduce
FastEGNN and DistEGNN, two novel enhancements to equivariant GNNs for
large-scale geometric graphs. FastEGNN employs a key innovation: a small
ordered set of virtual nodes that effectively approximates the large unordered
graph of real nodes. Specifically, we implement distinct message passing and
aggregation mechanisms for different virtual nodes to ensure mutual
distinctiveness, and minimize Maximum Mean Discrepancy (MMD) between virtual
and real coordinates to achieve global distributedness. This design enables
FastEGNN to maintain high accuracy while efficiently processing large-scale
sparse graphs. For extremely large-scale geometric graphs, we present DistEGNN,
a distributed extension where virtual nodes act as global bridges between
subgraphs in different devices, maintaining consistency while dramatically
reducing memory and computational overhead. We comprehensively evaluate our
models across four challenging domains: N-body systems (100 nodes), protein
dynamics (800 nodes), Water-3D (8,000 nodes), and our new Fluid113K benchmark
(113,000 nodes). Results demonstrate superior efficiency and performance,
establishing new capabilities in large-scale equivariant graph learning. Code
is available at https://github.com/GLAD-RUC/DistEGNN.
[LINK]
http://arxiv.org/abs/2506.19482v1
[DATE]
2025-06-24 18:17:38+08:00
[CATEGORIES]
cs.LG
Deep neural networks with ReLU, leaky ReLU, and softplus activation provably overcome the curse of dimensionality for Kolmogorov partial differential equations with Lipschitz nonlinearities in the $L^p$-sense
[AUTHORS]
Julia Ackermann, Arnulf Jentzen, Thomas Kruse, Benno Kuckuck, Joshua Lee Padgett
[ABSTRACT]
Recently, several deep learning (DL) methods for approximating
high-dimensional partial differential equations (PDEs) have been proposed. The
interest that these methods have generated in the literature is in large part
due to simulations which appear to demonstrate that such DL methods have the
capacity to overcome the curse of dimensionality (COD) for PDEs in the sense
that the number of computational operations they require to achieve a certain
approximation accuracy $\varepsilon\in(0,\infty)$ grows at most polynomially in
the PDE dimension $d\in\mathbb N$ and the reciprocal of $\varepsilon$. While
there is thus far no mathematical result that proves that one of such methods
is indeed capable of overcoming the COD, there are now a number of rigorous
results in the literature that show that deep neural networks (DNNs) have the
expressive power to approximate PDE solutions without the COD in the sense that
the number of parameters used to describe the approximating DNN grows at most
polynomially in both the PDE dimension $d\in\mathbb N$ and the reciprocal of
the approximation accuracy $\varepsilon>0$. Roughly speaking, in the literature
it is has been proved for every $T>0$ that solutions $u_d\colon
[0,T]\times\mathbb R^d\to \mathbb R$, $d\in\mathbb N$, of semilinear heat PDEs
with Lipschitz continuous nonlinearities can be approximated by DNNs with ReLU
activation at the terminal time in the $L^2$-sense without the COD provided
that the initial value functions $\mathbb R^d\ni x\mapsto u_d(0,x)\in\mathbb
R$, $d\in\mathbb N$, can be approximated by ReLU DNNs without the COD. It is
the key contribution of this work to generalize this result by establishing
this statement in the $L^p$-sense with $p\in(0,\infty)$ and by allowing the
activation function to be more general covering the ReLU, the leaky ReLU, and
the softplus activation functions as special cases.
[COMMENTS]
52 pages
[LINK]
http://arxiv.org/abs/2309.13722v2
[DATE]
2025-06-24 18:07:05+08:00
[CATEGORIES]
cs.LG
Uncertainty Quantification on Graph Learning: A Survey
[AUTHORS]
Chao Chen, Chenghua Guo, Rui Xu, Xiangwen Liao, Xi Zhang, Sihong Xie, Hui Xiong, Philip Yu
[ABSTRACT]
Graphical models have demonstrated their exceptional capabilities across
numerous applications, such as social networks, citation networks, and online
recommendation systems. However, their performance, confidence, and
trustworthiness are often limited by the inherent randomness in data and the
challenges of accurately modeling real-world complexities. There has been
increased interest in developing uncertainty quantification (UQ) techniques
tailored to graphical models. In this survey, we comprehensively examine
existing works on UQ for graphical models, focusing on key aspects such as the
sources, representation, handling, and evaluation of uncertainty. This survey
distinguishes itself from most existing UQ surveys by specifically
concentrating on UQ in graphical models, including probabilistic graphical
models (PGMs) and graph neural networks (GNNs). After reviewing sources of
uncertainty, we organize the work using two high-level dimensions: uncertainty
representation and uncertainty handling. By offering a comprehensive overview
of the current landscape, including both established methodologies and emerging
trends, we aim to bridge gaps in understanding key challenges and opportunities
in UQ for graphical models, hoping to inspire researchers working on graphical
models or uncertainty quantification to make further advancements at the cross
of the two fields.
[LINK]
http://arxiv.org/abs/2404.14642v3
[DATE]
2025-06-24 18:02:19+08:00
[CATEGORIES]
cs.LG
Orthogonal Soft Pruning for Efficient Class Unlearning
[AUTHORS]
Qinghui Gong, Xue Yang, Xiaohu Tang
[ABSTRACT]
Machine unlearning aims to selectively remove class-specific knowledge from
pretrained neural networks to satisfy privacy regulations such as the GDPR.
Existing methods typically face a trade-off between unlearning speed and
preservation of predictive accuracy, often incurring either high computational
overhead or significant performance degradation on retained classes. In this
paper, we propose a novel class-aware soft pruning framework leveraging
orthogonal convolutional kernel regularization to achieve rapid and precise
forgetting with millisecond-level response times. By enforcing orthogonality
constraints during training, our method decorrelates convolutional filters and
disentangles feature representations, while efficiently identifying
class-specific channels through activation difference analysis. Extensive
evaluations across multiple architectures and datasets demonstrate stable
pruning with near-instant execution, complete forgetting of targeted classes,
and minimal accuracy loss on retained data. Experiments on CIFAR-10, CIFAR-100,
and TinyImageNet confirm that our approach substantially reduces membership
inference attack risks and accelerates unlearning by orders of magnitude
compared to state-of-the-art baselines. This framework provides an efficient,
practical solution for real-time machine unlearning in Machine Learning as a
Service (MLaaS) scenarios.
[COMMENTS]
11 pages,3 figures
[LINK]
http://arxiv.org/abs/2506.19891v1
[DATE]
2025-06-24 17:52:04+08:00
[CATEGORIES]
cs.LG
Stylized Structural Patterns for Improved Neural Network Pre-training
[AUTHORS]
Farnood Salehi, Vandit Sharma, Amirhossein Askari Farsangi, Tunç Ozan Aydın
[ABSTRACT]
Modern deep learning models in computer vision require large datasets of real
images, which are difficult to curate and pose privacy and legal concerns,
limiting their commercial use. Recent works suggest synthetic data as an
alternative, yet models trained with it often underperform. This paper proposes
a two-step approach to bridge this gap. First, we propose an improved neural
fractal formulation through which we introduce a new class of synthetic data.
Second, we propose reverse stylization, a technique that transfers visual
features from a small, license-free set of real images onto synthetic datasets,
enhancing their effectiveness. We analyze the domain gap between our synthetic
datasets and real images using Kernel Inception Distance (KID) and show that
our method achieves a significantly lower distributional gap compared to
existing synthetic datasets. Furthermore, our experiments across different
tasks demonstrate the practical impact of this reduced gap. We show that
pretraining the EDM2 diffusion model on our synthetic dataset leads to an 11%
reduction in FID during image generation, compared to models trained on
existing synthetic datasets, and a 20% decrease in autoencoder reconstruction
error, indicating improved performance in data representation. Furthermore, a
ViT-S model trained for classification on this synthetic data achieves over a
10% improvement in ImageNet-100 accuracy. Our work opens up exciting
possibilities for training practical models when sufficiently large real
training sets are not available.
[LINK]
http://arxiv.org/abs/2506.19465v1
[DATE]
2025-06-24 17:47:31+08:00
[CATEGORIES]
cs.LG
Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference
[AUTHORS]
Andrii Skliar, Ties van Rozendaal, Romain Lepert, Todor Boinovski, Mart van Baalen, Markus Nagel, Paul Whatmough, Babak Ehteshami Bejnordi
[ABSTRACT]
Mixture of Experts (MoE) LLMs have recently gained attention for their
ability to enhance performance by selectively engaging specialized subnetworks
or “experts” for each input. However, deploying MoEs on memory-constrained
devices remains challenging, particularly when generating tokens sequentially
with a batch size of one, as opposed to typical high-throughput settings
involving long sequences or large batches. In this work, we optimize MoE on
memory-constrained devices where only a subset of expert weights fit in DRAM.
We introduce a novel cache-aware routing strategy that leverages expert reuse
during token generation to improve cache locality. We evaluate our approach on
language modeling, MMLU, and GSM8K benchmarks and present on-device results
demonstrating 2$\times$ speedups on mobile devices, offering a flexible,
training-free solution to extend MoE’s applicability across real-world
applications.
[COMMENTS]
Published in Transactions on Machine Learning Research (06/2025)
[LINK]
http://arxiv.org/abs/2412.00099v2
[DATE]
2025-06-24 17:27:46+08:00
[CATEGORIES]
cs.LG
Low-Complexity Semantic Packet Aggregation for Token Communication via Lookahead Search
[AUTHORS]
Seunghun Lee, Jihong Park, Jinho Choi, Hyuncheol Park
[ABSTRACT]
Tokens are fundamental processing units of generative AI (GenAI) and large
language models (LLMs), and token communication (TC) is essential for enabling
remote AI-generate content (AIGC) and wireless LLM applications. Unlike
traditional bits, each of which is independently treated, the semantics of each
token depends on its surrounding context tokens. This inter-token dependency
makes TC vulnerable to outage channels, where the loss of a single token can
significantly distort the original message semantics. Motivated by this, this
paper focuses on optimizing token packetization to maximize the average token
similarity (ATS) between the original and received token messages under outage
channels. Due to inter-token dependency, this token grouping problem is
combinatorial, with complexity growing exponentially with message length. To
address this, we propose a novel framework of semantic packet aggregation with
lookahead search (SemPA-Look), built on two core ideas. First, it introduces
the residual semantic score (RSS) as a token-level surrogate for the
message-level ATS, allowing robust semantic preservation even when a certain
token packet is lost. Second, instead of full search, SemPA-Look applies a
lookahead search-inspired algorithm that samples intra-packet token candidates
without replacement (fixed depth), conditioned on inter-packet token candidates
sampled with replacement (fixed width), thereby achieving linear complexity.
Experiments on a remote AIGC task with the MS-COCO dataset (text captioned
images) demonstrate that SemPA-Look achieves high ATS and LPIPS scores
comparable to exhaustive search, while reducing computational complexity by up
to 40$\times$. Compared to other linear-complexity algorithms such as the
genetic algorithm (GA), SemPA-Look achieves 10$\times$ lower complexity,
demonstrating its practicality for remote AIGC and other TC applications.
[LINK]
http://arxiv.org/abs/2506.19451v1
[DATE]
2025-06-24 17:25:44+08:00
[CATEGORIES]
cs.LG
SSPS: Self-Supervised Positive Sampling for Robust Self-Supervised Speaker Verification
[AUTHORS]
Theo Lepage, Reda Dehak
[ABSTRACT]
Self-Supervised Learning (SSL) has led to considerable progress in Speaker
Verification (SV). The standard framework uses same-utterance positive sampling
and data-augmentation to generate anchor-positive pairs of the same speaker.
This is a major limitation, as this strategy primarily encodes channel
information from the recording condition, shared by the anchor and positive. We
propose a new positive sampling technique to address this bottleneck:
Self-Supervised Positive Sampling (SSPS). For a given anchor, SSPS aims to find
an appropriate positive, i.e., of the same speaker identity but a different
recording condition, in the latent space using clustering assignments and a
memory queue of positive embeddings. SSPS improves SV performance for both
SimCLR and DINO, reaching 2.57% and 2.53% EER, outperforming SOTA SSL methods
on VoxCeleb1-O. In particular, SimCLR-SSPS achieves a 58% EER reduction by
lowering intra-speaker variance, providing comparable performance to DINO-SSPS.
[COMMENTS]
accepted at Interspeech 2025
[LINK]
http://arxiv.org/abs/2505.14561v2
[DATE]
2025-06-24 17:06:50+08:00
[CATEGORIES]
cs.LG
The Elements of Differentiable Programming
[AUTHORS]
Mathieu Blondel, Vincent Roulet
[ABSTRACT]
Artificial intelligence has recently experienced remarkable advances, fueled
by large models, vast datasets, accelerated hardware, and, last but not least,
the transformative power of differentiable programming. This new programming
paradigm enables end-to-end differentiation of complex computer programs
(including those with control flows and data structures), making gradient-based
optimization of program parameters possible. As an emerging paradigm,
differentiable programming builds upon several areas of computer science and
applied mathematics, including automatic differentiation, graphical models,
optimization and statistics. This book presents a comprehensive review of the
fundamental concepts useful for differentiable programming. We adopt two main
perspectives, that of optimization and that of probability, with clear
analogies between the two. Differentiable programming is not merely the
differentiation of programs, but also the thoughtful design of programs
intended for differentiation. By making programs differentiable, we inherently
introduce probability distributions over their execution, providing a means to
quantify the uncertainty associated with program outputs.
[COMMENTS]
Draft version 3
[LINK]
http://arxiv.org/abs/2403.14606v3
[DATE]
2025-06-24 16:38:16+08:00
[CATEGORIES]
cs.LG
Center of Gravity-Guided Focusing Influence Mechanism for Multi-Agent Reinforcement Learning
[AUTHORS]
Yisak Park, Sunwoo Lee, Seungyul Han
[ABSTRACT]
Cooperative multi-agent reinforcement learning (MARL) under sparse rewards
presents a fundamental challenge due to limited exploration and insufficient
coordinated attention among agents. In this work, we propose the Focusing
Influence Mechanism (FIM), a novel framework that enhances cooperation by
directing agent influence toward task-critical elements, referred to as Center
of Gravity (CoG) state dimensions, inspired by Clausewitz’s military theory.
FIM consists of three core components: (1) identifying CoG state dimensions
based on their stability under agent behavior, (2) designing counterfactual
intrinsic rewards to promote meaningful influence on these dimensions, and (3)
encouraging persistent and synchronized focus through eligibility-trace-based
credit accumulation. These mechanisms enable agents to induce more targeted and
effective state transitions, facilitating robust cooperation even in extremely
sparse reward settings. Empirical evaluations across diverse MARL benchmarks
demonstrate that the proposed FIM significantly improves cooperative
performance compared to baselines.
[COMMENTS]
9 technical page followed by references and appendix
[LINK]
http://arxiv.org/abs/2506.19417v1
[DATE]
2025-06-24 16:35:15+08:00
[CATEGORIES]
cs.LG
Meta-Reasoner: Dynamic Guidance for Optimized Inference-time Reasoning in Large Language Models
[AUTHORS]
Yuan Sui, Yufei He, Tri Cao, Simeng Han, Yulin Chen, Bryan Hooi
[ABSTRACT]
Large Language Models (LLMs) increasingly rely on prolonged reasoning chains
to solve complex tasks. However, this trial-and-error approach often leads to
high computational overhead and error propagation, where early mistakes can
derail subsequent steps. To address these issues, we introduce Meta-Reasoner, a
framework that dynamically optimizes inference-time reasoning by enabling LLMs
to “think about how to think.” Drawing inspiration from human meta-cognition
and dual-process theory, Meta-Reasoner operates as a strategic advisor,
decoupling high-level guidance from step-by-step generation. It employs
contextual multi-armed bandits to iteratively evaluate reasoning progress and
select optimal strategies (e.g., backtrack, clarify ambiguity, restart from
scratch, or propose alternative approaches), and reallocates computational
resources toward the most promising paths. Our evaluations on mathematical
reasoning and puzzles highlight the potential of dynamic reasoning chains to
overcome inherent challenges in the LLM reasoning process and also show promise
in broader applications, offering a scalable and adaptable solution for
reasoning-intensive tasks.
[LINK]
http://arxiv.org/abs/2502.19918v3
[DATE]
2025-06-24 16:27:42+08:00
[CATEGORIES]
cs.LG
Online Discovery of Simulation Models for Evolving Business Processes (Extended Version)
[AUTHORS]
Francesco Vinci, Gyunam Park, Wil van der Aalst, Massimiliano de Leoni
[ABSTRACT]
Business Process Simulation (BPS) refers to techniques designed to replicate
the dynamic behavior of a business process. Many approaches have been proposed
to automatically discover simulation models from historical event logs,
reducing the cost and time to manually design them. However, in dynamic
business environments, organizations continuously refine their processes to
enhance efficiency, reduce costs, and improve customer satisfaction. Existing
techniques to process simulation discovery lack adaptability to real-time
operational changes. In this paper, we propose a streaming process simulation
discovery technique that integrates Incremental Process Discovery with Online
Machine Learning methods. This technique prioritizes recent data while
preserving historical information, ensuring adaptation to evolving process
dynamics. Experiments conducted on four different event logs demonstrate the
importance in simulation of giving more weight to recent data while retaining
historical knowledge. Our technique not only produces more stable simulations
but also exhibits robustness in handling concept drift, as highlighted in one
of the use cases.
[LINK]
http://arxiv.org/abs/2506.10049v2
[DATE]
2025-06-24 16:14:12+08:00
[CATEGORIES]
cs.LG
M3D: Manifold-based Domain Adaptation with Dynamic Distribution for Non-Deep Transfer Learning in Cross-subject and Cross-session EEG-based Emotion Recognition
[AUTHORS]
Ting Luo, Jing Zhang, Yingwei Qiu, Li Zhang, Yaohua Hu, Zhuliang Yu, Zhen Liang
[ABSTRACT]
Emotion decoding using Electroencephalography (EEG)-based affective
brain-computer interfaces (aBCIs) plays a crucial role in affective computing
but is limited by challenges such as EEG’s non-stationarity, individual
variability, and the high cost of large labeled datasets. While deep learning
methods are effective, they require extensive computational resources and large
data volumes, limiting their practical application. To overcome these issues,
we propose Manifold-based Domain Adaptation with Dynamic Distribution (M3D), a
lightweight, non-deep transfer learning framework. M3D consists of four key
modules: manifold feature transformation, dynamic distribution alignment,
classifier learning, and ensemble learning. The data is mapped to an optimal
Grassmann manifold space, enabling dynamic alignment of source and target
domains. This alignment is designed to prioritize both marginal and conditional
distributions, improving adaptation efficiency across diverse datasets. In
classifier learning, the principle of structural risk minimization is applied
to build robust classification models. Additionally, dynamic distribution
alignment iteratively refines the classifier. The ensemble learning module
aggregates classifiers from different optimization stages to leverage diversity
and enhance prediction accuracy. M3D is evaluated on two EEG emotion
recognition datasets using two validation protocols (cross-subject
single-session and cross-subject cross-session) and a clinical EEG dataset for
Major Depressive Disorder (MDD). Experimental results show that M3D outperforms
traditional non-deep learning methods with a 4.47% average improvement and
achieves deep learning-level performance with reduced data and computational
requirements, demonstrating its potential for real-world aBCI applications.
[LINK]
http://arxiv.org/abs/2404.15615v3
[DATE]
2025-06-24 16:07:48+08:00
[CATEGORIES]
cs.LG
Controllable Video Generation with Provable Disentanglement
[AUTHORS]
Yifan Shen, Peiyuan Zhu, Zijian Li, Shaoan Xie, Zeyu Tang, Namrata Deka, Zongfang Liu, Guangyi Chen, Kun Zhang
[ABSTRACT]
Controllable video generation remains a significant challenge, despite recent
advances in generating high-quality and consistent videos. Most existing
methods for controlling video generation treat the video as a whole, neglecting
intricate fine-grained spatiotemporal relationships, which limits both control
precision and efficiency. In this paper, we propose Controllable Video
Generative Adversarial Networks (CoVoGAN) to disentangle the video concepts,
thus facilitating efficient and independent control over individual concepts.
Specifically, following the minimal change principle, we first disentangle
static and dynamic latent variables. We then leverage the sufficient change
property to achieve component-wise identifiability of dynamic latent variables,
enabling disentangled control of video generation. To establish the theoretical
foundation, we provide a rigorous analysis demonstrating the identifiability of
our approach. Building on these theoretical insights, we design a Temporal
Transition Module to disentangle latent dynamics. To enforce the minimal change
principle and sufficient change property, we minimize the dimensionality of
latent dynamic variables and impose temporal conditional independence. To
validate our approach, we integrate this module as a plug-in for GANs.
Extensive qualitative and quantitative experiments on various video generation
benchmarks demonstrate that our method significantly improves generation
quality and controllability across diverse real-world scenarios.
[LINK]
http://arxiv.org/abs/2502.02690v2
[DATE]
2025-06-24 15:54:02+08:00
[CATEGORIES]
cs.LG
Maximal Update Parametrization and Zero-Shot Hyperparameter Transfer for Fourier Neural Operators
[AUTHORS]
Shanda Li, Shinjae Yoo, Yiming Yang
[ABSTRACT]
Fourier Neural Operators (FNOs) offer a principled approach for solving
complex partial differential equations (PDEs). However, scaling them to handle
more complex PDEs requires increasing the number of Fourier modes, which
significantly expands the number of model parameters and makes hyperparameter
tuning computationally impractical. To address this, we introduce
$\mu$Transfer-FNO, a zero-shot hyperparameter transfer technique that enables
optimal configurations, tuned on smaller FNOs, to be directly applied to
billion-parameter FNOs without additional tuning. Building on the Maximal
Update Parametrization ($\mu$P) framework, we mathematically derive a
parametrization scheme that facilitates the transfer of optimal hyperparameters
across models with different numbers of Fourier modes in FNOs, which is
validated through extensive experiments on various PDEs. Our empirical study
shows that Transfer-FNO reduces computational cost for tuning hyperparameters
on large FNOs while maintaining or improving accuracy.
[COMMENTS]
ICML 2025
[LINK]
http://arxiv.org/abs/2506.19396v1
[DATE]
2025-06-24 15:53:34+08:00
[CATEGORIES]
cs.LG
Causal-Aware Intelligent QoE Optimization for VR Interaction with Adaptive Keyframe Extraction
[AUTHORS]
Ziru Zhang, Jiadong Yu, Danny H. K. Tsang
[ABSTRACT]
The optimization of quality of experience (QoE) in multi-user virtual reality
(VR) interactions demands a delicate balance between ultra-low latency,
high-fidelity motion synchronization, and equitable resource allocation. While
adaptive keyframe extraction mitigates transmission overhead, existing
approaches often overlook the causal relationships among allocated bandwidth,
CPU frequency, and user perception, limiting QoE gains. This paper proposes an
intelligent framework to maximize QoE by integrating adaptive keyframe
extraction with causal-aware reinforcement learning (RL). First, a novel QoE
metric is formulated using the Weber-Fechner Law, combining perceptual
sensitivity, attention-driven priorities, and motion reconstruction accuracy.
The QoE optimization problem is then modeled as a mixed integer programming
(MIP) task, jointly optimizing keyframe ratios, bandwidth, and computational
resources under horizon-fairness constraints. We propose Partial State Causal
Deep Deterministic Policy Gradient (PS-CDDPG), which integrates the Deep
Deterministic Policy Gradient (DDPG) method with causal influence detection. By
leveraging causal information regarding how QoE is influenced and determined by
various actions, we explore actions guided by weights calculated from causal
inference (CI), which in turn improves training efficiency. Experiments
conducted with the CMU Motion Capture Database demonstrate that our framework
significantly reduces interactive latency, enhances QoE, and maintains
fairness, achieving superior performance compared to benchmark methods.
[LINK]
http://arxiv.org/abs/2506.19890v1
[DATE]
2025-06-24 15:32:34+08:00
[CATEGORIES]
cs.LG
Do Vendi Scores Converge with Finite Samples? Truncated Vendi Score for Finite-Sample Convergence Guarantees
[AUTHORS]
Azim Ospanov, Farzan Farnia
[ABSTRACT]
Evaluating the diversity of generative models without reference data poses
methodological challenges. The reference-free Vendi and RKE scores address this
by quantifying the diversity of generated data using matrix-based entropy
measures. Among these two, the Vendi score is typically computed via the
eigendecomposition of an $n \times n$ kernel matrix constructed from n
generated samples. However, the prohibitive computational cost of
eigendecomposition for large $n$ often limits the number of samples used to
fewer than 20,000. In this paper, we investigate the statistical convergence of
the Vendi and RKE scores under restricted sample sizes. We numerically
demonstrate that, in general, the Vendi score computed with standard sample
sizes below 20,000 may not converge to its asymptotic value under infinite
sampling. To address this, we introduce the $t$-truncated Vendi score by
truncating the eigenspectrum of the kernel matrix, which is provably guaranteed
to converge to its population limit with $n=\mathcal{O}(t)$ samples. We further
show that existing Nystr"om and FKEA approximation methods converge to the
asymptotic limit of the truncated Vendi score. In contrast to the Vendi score,
we prove that the RKE score enjoys universal convergence guarantees across all
kernel functions. We conduct several numerical experiments to illustrate the
concentration of Nystr"om and FKEA computed Vendi scores around the truncated
Vendi score, and we analyze how the truncated Vendi and RKE scores correlate
with the diversity of image and text data. The code is available at
https://github.com/aziksh-ospanov/truncated-vendi.
[LINK]
http://arxiv.org/abs/2410.21719v3
[DATE]
2025-06-24 15:25:00+08:00
[CATEGORIES]
cs.LG
NAADA: A Noise-Aware Attention Denoising Autoencoder for Dental Panoramic Radiographs
[AUTHORS]
Khuram Naveed, Bruna Neves de Freitas, Ruben Pauwels
[ABSTRACT]
Convolutional denoising autoencoders (DAEs) are powerful tools for image
restoration. However, they inherit a key limitation of convolutional neural
networks (CNNs): they tend to recover low-frequency features, such as smooth
regions, more effectively than high-frequency details. This leads to the loss
of fine details, which is particularly problematic in dental radiographs where
preserving subtle anatomical structures is crucial. While self-attention
mechanisms can help mitigate this issue by emphasizing important features,
conventional attention methods often prioritize features corresponding to
cleaner regions and may overlook those obscured by noise. To address this
limitation, we propose a noise-aware self-attention method, which allows the
model to effectively focus on and recover key features even within noisy
regions. Building on this approach, we introduce the noise-aware
attention-enhanced denoising autoencoder (NAADA) network for enhancing noisy
panoramic dental radiographs. Compared with the recent state of the art (and
much heavier) methods like Uformer, MResDNN etc., our method improves the
reconstruction of fine details, ensuring better image quality and diagnostic
accuracy.
[COMMENTS]
10 pages, 8 figures
[LINK]
http://arxiv.org/abs/2506.19387v1
[DATE]
2025-06-24 15:23:04+08:00
[CATEGORIES]
cs.LG
Deep Electromagnetic Structure Design Under Limited Evaluation Budgets
[AUTHORS]
Shijian Zheng, Fangxiao Jin, Shuhai Zhang, Quan Xue, Mingkui Tan
[ABSTRACT]
Electromagnetic structure (EMS) design plays a critical role in developing
advanced antennas and materials, but remains challenging due to
high-dimensional design spaces and expensive evaluations. While existing
methods commonly employ high-quality predictors or generators to alleviate
evaluations, they are often data-intensive and struggle with real-world scale
and budget constraints. To address this, we propose a novel method called
Progressive Quadtree-based Search (PQS). Rather than exhaustively exploring the
high-dimensional space, PQS converts the conventional image-like layout into a
quadtree-based hierarchical representation, enabling a progressive search from
global patterns to local details. Furthermore, to lessen reliance on highly
accurate predictors, we introduce a consistency-driven sample selection
mechanism. This mechanism quantifies the reliability of predictions, balancing
exploitation and exploration when selecting candidate designs. We evaluate PQS
on two real-world engineering tasks, i.e., Dual-layer Frequency Selective
Surface and High-gain Antenna. Experimental results show that our method can
achieve satisfactory designs under limited computational budgets, outperforming
baseline methods. In particular, compared to generative approaches, it cuts
evaluation costs by 75-85%, effectively saving 20.27-38.80 days of product
designing cycle.
[COMMENTS]
ICML 2025 (accepted)
[LINK]
http://arxiv.org/abs/2506.19384v1
[DATE]
2025-06-24 15:20:16+08:00
[CATEGORIES]
cs.LG
Explainable Artificial Intelligence Credit Risk Assessment using Machine Learning
[AUTHORS]
Shreya, Harsh Pathak
[ABSTRACT]
This paper presents an intelligent and transparent AI-driven system for
Credit Risk Assessment using three state-of-the-art ensemble machine learning
models combined with Explainable AI (XAI) techniques. The system leverages
XGBoost, LightGBM, and Random Forest algorithms for predictive analysis of loan
default risks, addressing the challenges of model interpretability using SHAP
and LIME. Preprocessing steps include custom imputation, one-hot encoding, and
standardization. Class imbalance is managed using SMOTE, and hyperparameter
tuning is performed with GridSearchCV. The model is evaluated on multiple
performance metrics including ROC-AUC, precision, recall, and F1-score.
LightGBM emerges as the most business-optimal model with the highest accuracy
and best trade off between approval and default rates. Furthermore, the system
generates applicant-specific XAI visual reports and business impact summaries
to ensure transparent decision-making.
[COMMENTS]
15 pages, 8 Figures, 3 Tables
[LINK]
http://arxiv.org/abs/2506.19383v1
[DATE]
2025-06-24 15:20:05+08:00
[CATEGORIES]
cs.LG
Flopping for FLOPs: Leveraging equivariance for computational efficiency
[AUTHORS]
Georg Bökman, David Nordström, Fredrik Kahl
[COMMENTS]
ICML 2025
[LINK]
http://arxiv.org/abs/2502.05169v2
[DATE]
2025-06-24 15:03:53+08:00
[CATEGORIES]
cs.LG
WebGuard++:Interpretable Malicious URL Detection via Bidirectional Fusion of HTML Subgraphs and Multi-Scale Convolutional BERT
[AUTHORS]
Ye Tian, Zhang Yumin, Yifan Jia, Jianguo Sun, Yanbin Wang
[ABSTRACT]
URL+HTML feature fusion shows promise for robust malicious URL detection,
since attacker artifacts persist in DOM structures. However, prior work suffers
from four critical shortcomings: (1) incomplete URL modeling, failing to
jointly capture lexical patterns and semantic context; (2) HTML graph sparsity,
where threat-indicative nodes (e.g., obfuscated scripts) are isolated amid
benign content, causing signal dilution during graph aggregation; (3)
unidirectional analysis, ignoring URL-HTML feature bidirectional interaction;
and (4) opaque decisions, lacking attribution to malicious DOM components. To
address these challenges, we present WebGuard++, a detection framework with 4
novel components: 1) Cross-scale URL Encoder: Hierarchically learns
local-to-global and coarse to fine URL features based on Transformer network
with dynamic convolution. 2) Subgraph-aware HTML Encoder: Decomposes DOM graphs
into interpretable substructures, amplifying sparse threat signals via
Hierarchical feature fusion. 3) Bidirectional Coupling Module: Aligns URL and
HTML embeddings through cross-modal contrastive learning, optimizing
inter-modal consistency and intra-modal specificity. 4) Voting Module:
Localizes malicious regions through consensus voting on malicious subgraph
predictions. Experiments show WebGuard++ achieves significant improvements over
state-of-the-art baselines, achieving 1.1x-7.9x higher TPR at fixed FPR of
0.001 and 0.0001 across both datasets.
[LINK]
http://arxiv.org/abs/2506.19356v1
[DATE]
2025-06-24 14:36:51+08:00
[CATEGORIES]
cs.LG
Discrepancy-Aware Graph Mask Auto-Encoder
[AUTHORS]
Ziyu Zheng, Yaming Yang, Ziyu Guan, Wei Zhao, Weigang Lu
[ABSTRACT]
Masked Graph Auto-Encoder, a powerful graph self-supervised training
paradigm, has recently shown superior performance in graph representation
learning. Existing works typically rely on node contextual information to
recover the masked information. However, they fail to generalize well to
heterophilic graphs where connected nodes may be not similar, because they
focus only on capturing the neighborhood information and ignoring the
discrepancy information between different nodes, resulting in indistinguishable
node representations. In this paper, to address this issue, we propose a
Discrepancy-Aware Graph Mask Auto-Encoder (DGMAE). It obtains more
distinguishable node representations by reconstructing the discrepancy
information of neighboring nodes during the masking process. We conduct
extensive experiments on 17 widely-used benchmark datasets. The results show
that our DGMAE can effectively preserve the discrepancies of nodes in
low-dimensional space. Moreover, DGMAE significantly outperforms
state-of-the-art graph self-supervised learning methods on three graph analytic
including tasks node classification, node clustering, and graph classification,
demonstrating its remarkable superiority. The code of DGMAE is available at
https://github.com/zhengziyu77/DGMAE.
[LINK]
http://arxiv.org/abs/2506.19343v1
[DATE]
2025-06-24 14:15:44+08:00
[CATEGORIES]
cs.LG
CAM-NET: An AI Model for Whole Atmosphere with Thermosphere and Ionosphere Extension
[AUTHORS]
Jiahui Hu, Wenjun Dong
[ABSTRACT]
We present Compressible Atmospheric Model-Network (CAM-NET), an AI model
designed to predict neutral atmospheric variables from the Earth’s surface to
the ionosphere with high accuracy and computational efficiency. Accurate
modeling of the entire atmosphere is critical for understanding the upward
propagation of gravity waves, which influence upper-atmospheric dynamics and
coupling across atmospheric layers. CAM-NET leverages the Spherical Fourier
Neural Operator (SFNO) to capture global-scale atmospheric dynamics while
preserving the Earth’s spherical structure. Trained on a decade of datasets
from the Whole Atmosphere Community Climate Model with thermosphere and
ionosphere eXtension (WACCM-X), CAM-NET demonstrates accuracy comparable to
WACCM-X while achieving a speedup of over 1000x in inference time, can provide
one year simulation within a few minutes once trained. The model effectively
predicts key atmospheric parameters, including zonal and meridional winds,
temperature, and time rate of pressure. Inspired by traditional modeling
approaches that use external couplers to simulate tracer transport, CAM-NET
introduces a modular architecture that explicitly separates tracer prediction
from core dynamics. The core backbone of CAM-NET focuses on forecasting primary
physical variables (e.g., temperature, wind velocity), while tracer variables
are predicted through a lightweight, fine-tuned model. This design allows for
efficient adaptation to specific tracer scenarios with minimal computational
cost, avoiding the need to retrain the entire model. We have validated this
approach on the $O^2$ tracer, demonstrating strong performance and
generalization capabilities.
[LINK]
http://arxiv.org/abs/2506.19340v1
[DATE]
2025-06-24 14:07:28+08:00
[CATEGORIES]
cs.LG
Contrastive Cross-Modal Learning for Infusing Chest X-ray Knowledge into ECGs
[AUTHORS]
Vineet Punyamoorty, Aditya Malusare, Vaneet Aggarwal
[ABSTRACT]
Modern diagnostic workflows are increasingly multimodal, integrating diverse
data sources such as medical images, structured records, and physiological time
series. Among these, electrocardiograms (ECGs) and chest X-rays (CXRs) are two
of the most widely used modalities for cardiac assessment. While CXRs provide
rich diagnostic information, ECGs are more accessible and can support scalable
early warning systems. In this work, we propose CroMoTEX, a novel contrastive
learning-based framework that leverages chest X-rays during training to learn
clinically informative ECG representations for multiple cardiac-related
pathologies: cardiomegaly, pleural effusion, and edema. Our method aligns ECG
and CXR representations using a novel supervised cross-modal contrastive
objective with adaptive hard negative weighting, enabling robust and
task-relevant feature learning. At test time, CroMoTEX relies solely on ECG
input, allowing scalable deployment in real-world settings where CXRs may be
unavailable. Evaluated on the large-scale MIMIC-IV-ECG and MIMIC-CXR datasets,
CroMoTEX outperforms baselines across all three pathologies, achieving up to
78.31 AUROC on edema. Our code is available at
github.com/vineetpmoorty/cromotex.
[LINK]
http://arxiv.org/abs/2506.19329v1
[DATE]
2025-06-24 13:47:26+08:00
[CATEGORIES]
cs.LG
Diffusion-based Task-oriented Semantic Communications with Model Inversion Attack
[AUTHORS]
Xuesong Wang, Mo Li, Xingyan Shi, Zhaoqian Liu, Shenghao Yang
[ABSTRACT]
Semantic communication has emerged as a promising neural network-based system
design for 6G networks. Task-oriented semantic communication is a novel
paradigm whose core goal is to efficiently complete specific tasks by
transmitting semantic information, optimizing communication efficiency and task
performance. The key challenge lies in preserving privacy while maintaining
task accuracy, as this scenario is susceptible to model inversion attacks. In
such attacks, adversaries can restore or even reconstruct input data by
analyzing and processing model outputs, owing to the neural network-based
nature of the systems. In addition, traditional systems use image quality
indicators (such as PSNR or SSIM) to assess attack severity, which may be
inadequate for task-oriented semantic communication, since visual differences
do not necessarily ensure semantic divergence. In this paper, we propose a
diffusion-based semantic communication framework, named DiffSem, that optimizes
semantic information reconstruction through a diffusion mechanism with
self-referential label embedding to significantly improve task performance. Our
model also compensates channel noise and adopt semantic information distortion
to ensure the robustness of the system in various signal-to-noise ratio
environments. To evaluate the attacker’s effectiveness, we propose a new metric
that better quantifies the semantic fidelity of estimations from the adversary.
Experimental results based on this criterion show that on the MNIST dataset,
DiffSem improves the classification accuracy by 10.03%, and maintain stable
performance under dynamic channels. Our results further demonstrate that
significant deviation exists between traditional image quality indicators and
the leakage of task-relevant semantic information.
[LINK]
http://arxiv.org/abs/2506.19886v1
[DATE]
2025-06-24 13:21:27+08:00
[CATEGORIES]
cs.LG
Sum-of-Parts: Self-Attributing Neural Networks with End-to-End Learning of Feature Groups
[AUTHORS]
Weiqiu You, Helen Qu, Marco Gatti, Bhuvnesh Jain, Eric Wong
[ABSTRACT]
Self-attributing neural networks (SANNs) present a potential path towards
interpretable models for high-dimensional problems, but often face significant
trade-offs in performance. In this work, we formally prove a lower bound on
errors of per-feature SANNs, whereas group-based SANNs can achieve zero error
and thus high performance. Motivated by these insights, we propose Sum-of-Parts
(SOP), a framework that transforms any differentiable model into a group-based
SANN, where feature groups are learned end-to-end without group supervision.
SOP achieves state-of-the-art performance for SANNs on vision and language
tasks, and we validate that the groups are interpretable on a range of
quantitative and semantic metrics. We further validate the utility of SOP
explanations in model debugging and cosmological scientific discovery. Our code
is available at https://github.com/BrachioLab/sop
[COMMENTS]
ICML2025 Camera Ready
[LINK]
http://arxiv.org/abs/2310.16316v4
[DATE]
2025-06-24 12:57:39+08:00
[CATEGORIES]
cs.LG
FlightKooba: A Fast Interpretable FTP Model
[AUTHORS]
Jing Lu, Xuan Wu, Yizhun Tian, Songhan Fan, Yali Fang
[ABSTRACT]
The Koopman theory is a powerful and effective modeling tool for converting
nonlinear systems into linear representations, and flight trajectory prediction
(FTP) is a complex nonlinear system. However, current models applying the
Koopman theory to FTP tasks are not very effective, model interpretability is
indeed an issue, and the Koopman operators are computationally intensive,
resulting in long training times. To address this issue, this paper proposes a
new modeling and control framework based on the HIPPO method, the Koopman
theory, and state space equations from cybernetics: FlightKooba. Inspired by
the idea of structural state space equations, FlightKooba directly constructs
the Koopman operators from data. This makes the framework highly interpretable
and significantly reduces the number of trainable parameters in the module,
thereby greatly reducing training time. Experiments have demonstrated the
superiority of the FlightKooba modeling method in terms of time and memory
consumption (training time comparable to the Mamba module without using
CUDA-level acceleration; memory reduced by more than 50% on most datasets, with
a tenfold reduction in the number of parameters), essentially completing the
FTP task. It provides a new method for the fast computation of the Koopman
operators, opening up new possibilities for the combination of time series
forecasting and control.
[COMMENTS]
7 figures
[LINK]
http://arxiv.org/abs/2506.19885v1
[DATE]
2025-06-24 12:53:49+08:00
[CATEGORIES]
cs.LG
Adversarial Attacks on Deep Learning-Based False Data Injection Detection in Differential Relays
[AUTHORS]
Ahmad Mohammad Saber, Aditi Maheshwari, Amr Youssef, Deepa Kundur
[ABSTRACT]
The application of Deep Learning-based Schemes (DLSs) for detecting False
Data Injection Attacks (FDIAs) in smart grids has attracted significant
attention. This paper demonstrates that adversarial attacks, carefully crafted
FDIAs, can evade existing DLSs used for FDIA detection in Line Current
Differential Relays (LCDRs). We propose a novel adversarial attack framework,
utilizing the Fast Gradient Sign Method, which exploits DLS vulnerabilities by
introducing small perturbations to LCDR remote measurements, leading to
misclassification of the FDIA as a legitimate fault while also triggering the
LCDR to trip. We evaluate the robustness of multiple deep learning models,
including multi-layer perceptrons, convolutional neural networks, long
short-term memory networks, and residual networks, under adversarial
conditions. Our experimental results demonstrate that while these models
perform well, they exhibit high degrees of vulnerability to adversarial
attacks. For some models, the adversarial attack success rate exceeds 99.7%. To
address this threat, we introduce adversarial training as a proactive defense
mechanism, significantly enhancing the models’ ability to withstand adversarial
FDIAs without compromising fault detection accuracy. Our results highlight the
significant threat posed by adversarial attacks to DLS-based FDIA detection,
underscore the necessity for robust cybersecurity measures in smart grids, and
demonstrate the effectiveness of adversarial training in enhancing model
robustness against adversarial FDIAs.
[LINK]
http://arxiv.org/abs/2506.19302v1
[DATE]
2025-06-24 12:22:26+08:00
[CATEGORIES]
cs.LG
LAuReL: Learned Augmented Residual Layer
[AUTHORS]
Gaurav Menghani, Ravi Kumar, Sanjiv Kumar
[ABSTRACT]
One of the core pillars of efficient deep learning methods is architectural
improvements such as the residual/skip connection, which has led to
significantly better model convergence and quality. Since then the residual
connection has become ubiquitous in not just convolutional neural networks but
also transformer-based architectures, the backbone of LLMs.
In this paper we introduce Learned Augmented Residual Layer (LAuReL) – a
novel generalization of the canonical residual connection – with the goal to
be an in-situ replacement of the latter while outperforming on both model
quality and footprint metrics. Our experiments show that using LAuReL can help
boost performance for both vision and language models. For example, on the
ResNet-50, ImageNet 1K task, it achieves 60% of the gains from adding an extra
layer, while only adding 0.003% more parameters, and matches it while adding
2.6 times fewer parameters. Similarly, when pre-training 1B and 4B parameter
LLMs, LAuReL improves performance on a variety of challenging downstream
evaluation tasks by 2.54% to 20.05%, while adding only 0.012% and 0.1%
additional parameters, respectively.
[COMMENTS]
Accepted at 42nd International Conference on Machine Learning (2025),
Vancouver, Canada
[LINK]
http://arxiv.org/abs/2411.07501v4
[DATE]
2025-06-24 12:11:06+08:00
[CATEGORIES]
cs.LG
SycnMapV2: Robust and Adaptive Unsupervised Segmentation
[AUTHORS]
Heng Zhang, Zikang Wan, Danilo Vasconcellos Vargas
[ABSTRACT]
Human vision excels at segmenting visual cues without the need for explicit
training, and it remains remarkably robust even as noise severity increases. In
contrast, existing AI algorithms struggle to maintain accuracy under similar
conditions. Here, we present SyncMapV2, the first to solve unsupervised
segmentation with state-of-the-art robustness. SyncMapV2 exhibits a minimal
drop in mIoU, only 0.01%, under digital corruption, compared to a 23.8% drop
observed in SOTA methods. This superior performance extends across various
types of corruption: noise (7.3% vs. 37.7%), weather (7.5% vs. 33.8%), and blur
(7.0% vs. 29.5%). Notably, SyncMapV2 accomplishes this without any robust
training, supervision, or loss functions. It is based on a learning paradigm
that uses self-organizing dynamical equations combined with concepts from
random networks. Moreover, unlike conventional methods that require
re-initialization for each new input, SyncMapV2 adapts online, mimicking the
continuous adaptability of human vision. Thus, we go beyond the accurate and
robust results, and present the first algorithm that can do all the above
online, adapting to input rather than re-initializing. In adaptability tests,
SyncMapV2 demonstrates near-zero performance degradation, which motivates and
fosters a new generation of robust and adaptive intelligence in the near
future.
[LINK]
http://arxiv.org/abs/2506.16297v2
[DATE]
2025-06-24 12:07:21+08:00
[CATEGORIES]
cs.LG
The Effect of Depth on the Expressivity of Deep Linear State-Space Models
[AUTHORS]
Zeyu Bao, Penghao Yu, Haotian Jiang, Qianxiao Li
[ABSTRACT]
Deep state-space models (SSMs) have gained increasing popularity in sequence
modelling. While there are numerous theoretical investigations of shallow SSMs,
how the depth of the SSM affects its expressiveness remains a crucial problem.
In this paper, we systematically investigate the role of depth and width in
deep linear SSMs, aiming to characterize how they influence the expressive
capacity of the architecture. First, we rigorously prove that in the absence of
parameter constraints, increasing depth and increasing width are generally
equivalent, provided that the parameter count remains within the same order of
magnitude. However, under the assumption that the parameter norms are
constrained, the effects of depth and width differ significantly. We show that
a shallow linear SSM with large parameter norms can be represented by a deep
linear SSM with smaller norms using a constructive method. In particular, this
demonstrates that deep SSMs are more capable of representing targets with large
norms than shallow SSMs under norm constraints. Finally, we derive upper bounds
on the minimal depth required for a deep linear SSM to represent a given
shallow linear SSM under constrained parameter norms. We also validate our
theoretical results with numerical experiments
[LINK]
http://arxiv.org/abs/2506.19296v1
[DATE]
2025-06-24 12:01:21+08:00
[CATEGORIES]
cs.LG
Efficient Extreme Operating Condition Search for Online Relay Setting Calculation in Renewable Power Systems Based on Parallel Graph Neural Network
[AUTHORS]
Yan Li, Zengli Yang, Youhuai Wang, Jing Wang, Xiaoyu Han, Jingyu Wang, Dongyuan Shi
[ABSTRACT]
The Extreme Operating Conditions Search (EOCS) problem is one of the key
problems in relay setting calculation, which is used to ensure that the setting
values of protection relays can adapt to the changing operating conditions of
power systems over a period of time after deployment. The high penetration of
renewable energy and the wide application of inverter-based resources make the
operating conditions of renewable power systems more volatile, which urges the
adoption of the online relay setting calculation strategy. However, the
computation speed of existing EOCS methods based on local enumeration,
heuristic algorithms, and mathematical programming cannot meet the efficiency
requirement of online relay setting calculation. To reduce the time overhead,
this paper, for the first time, proposes an efficient deep learning-based EOCS
method suitable for online relay setting calculation. First, the power system
information is formulated as four layers, i.e., a component parameter layer, a
topological connection layer, an electrical distance layer, and a graph
distance layer, which are fed into a parallel graph neural network (PGNN) model
for feature extraction. Then, the four feature layers corresponding to each
node are spliced and stretched, and then fed into the decision network to
predict the extreme operating condition of the system. Finally, the proposed
PGNN method is validated on the modified IEEE 39-bus and 118-bus test systems,
where some of the synchronous generators are replaced by renewable generation
units. The nonlinear fault characteristics of renewables are fully considered
when computing fault currents. The experiment results show that the proposed
PGNN method achieves higher accuracy than the existing methods in solving the
EOCS problem. Meanwhile, it also provides greater improvements in online
computation time.
[LINK]
http://arxiv.org/abs/2506.19289v1
[DATE]
2025-06-24 11:50:58+08:00
[CATEGORIES]
cs.LG
Information-Theoretic Proofs for Diffusion Sampling
[AUTHORS]
Galen Reeves, Henry D. Pfister
[ABSTRACT]
This paper provides an elementary, self-contained analysis of diffusion-based
sampling methods for generative modeling. In contrast to existing approaches
that rely on continuous-time processes and then discretize, our treatment works
directly with discrete-time stochastic processes and yields precise
non-asymptotic convergence guarantees under broad assumptions. The key insight
is to couple the sampling process of interest with an idealized comparison
process that has an explicit Gaussian-convolution structure. We then leverage
simple identities from information theory, including the I-MMSE relationship,
to bound the discrepancy (in terms of the Kullback-Leibler divergence) between
these two discrete-time processes. In particular, we show that, if the
diffusion step sizes are chosen sufficiently small and one can approximate
certain conditional mean estimators well, then the sampling distribution is
provably close to the target distribution. Our results also provide a
transparent view on how to accelerate convergence by using additional
randomness in each step to match higher-order moments in the comparison
process.
[LINK]
http://arxiv.org/abs/2502.02305v2
[DATE]
2025-06-24 11:42:23+08:00
[CATEGORIES]
cs.LG
DF2: Distribution-Free Decision-Focused Learning
[AUTHORS]
Lingkai Kong, Wenhao Mu, Jiaming Cui, Yuchen Zhuang, B. Aditya Prakash, Bo Dai, Chao Zhang
[ABSTRACT]
Decision-focused learning (DFL), which differentiates through the KKT
conditions, has recently emerged as a powerful approach for
predict-then-optimize problems. However, under probabilistic settings, DFL
faces three major bottlenecks: model mismatch error, sample average
approximation error, and gradient approximation error. Model mismatch error
stems from the misalignment between the model’s parameterized predictive
distribution and the true probability distribution. Sample average
approximation error arises when using finite samples to approximate the
expected optimization objective. Gradient approximation error occurs when the
objectives are non-convex and KKT conditions cannot be directly applied. In
this paper, we present DF2, the first distribution-free decision-focused
learning method designed to mitigate these three bottlenecks. Rather than
depending on a task-specific forecaster that requires precise model
assumptions, our method directly learns the expected optimization function
during training. To efficiently learn this function in a data-driven manner, we
devise an attention-based model architecture inspired by the distribution-based
parameterization of the expected objective. We evaluate DF2 on two synthetic
problems and three real-world problems, demonstrating the effectiveness of DF2.
Our code is available at: https://github.com/Lingkai-Kong/DF2.
[COMMENTS]
UAI 2025
[LINK]
http://arxiv.org/abs/2308.05889v2
[DATE]
2025-06-24 11:35:35+08:00
[CATEGORIES]
cs.LG
A Batch-Insensitive Dynamic GNN Approach to Address Temporal Discontinuity in Graph Streams
[AUTHORS]
Yang Zhou, Xiaoning Ren
[ABSTRACT]
In dynamic graphs, preserving temporal continuity is critical. However,
Memory-based Dynamic Graph Neural Networks (MDGNNs) trained with large batches
often disrupt event sequences, leading to temporal information loss. This
discontinuity not only deteriorates temporal modeling but also hinders
optimization by increasing the difficulty of parameter convergence. Our
theoretical study quantifies this through a Lipschitz upper bound, showing that
large batch sizes enlarge the parameter search space. In response, we propose
BADGNN, a novel batch-agnostic framework consisting of two core components: (1)
Temporal Lipschitz Regularization (TLR) to control parameter search space
expansion, and (2) Adaptive Attention Adjustment (A3) to alleviate attention
distortion induced by both regularization and batching. Empirical results on
three benchmark datasets show that BADGNN maintains strong performance while
enabling significantly larger batch sizes and faster training compared to TGN.
Our code is available at Code:
https://anonymous.4open.science/r/TGN_Lipichitz-C033/.
[COMMENTS]
8pages, 5figures
[LINK]
http://arxiv.org/abs/2506.19282v1
[DATE]
2025-06-24 11:31:43+08:00
[CATEGORIES]
cs.LG
STIMULUS: Achieving Fast Convergence and Low Sample Complexity in Stochastic Multi-Objective Learning
[AUTHORS]
Zhuqing Liu, Chaosheng Dong, Michinari Momma, Simone Shao, Shaoyuan Xu, Yan Gao, Haibo Yang, Jia Liu
[ABSTRACT]
Recently, multi-objective optimization (MOO) has gained attention for its
broad applications in ML, operations research, and engineering. However, MOO
algorithm design remains in its infancy and many existing MOO methods suffer
from unsatisfactory convergence rate and sample complexity performance. To
address this challenge, in this paper, we propose an algorithm called STIMULUS(
stochastic path-integrated multi-gradient recursive e\ulstimator), a new and
robust approach for solving MOO problems. Different from the traditional
methods, STIMULUS introduces a simple yet powerful recursive framework for
updating stochastic gradient estimates to improve convergence performance with
low sample complexity. In addition, we introduce an enhanced version of
STIMULUS, termed STIMULUS-M, which incorporates a momentum term to further
expedite convergence. We establish $O(1/T)$ convergence rates of the proposed
methods for non-convex settings and $O (\exp{-\mu T})$ for strongly convex
settings, where $T$ is the total number of iteration rounds. Additionally, we
achieve the state-of-the-art $O \left(n+\sqrt{n}\epsilon^{-1}\right)$ sample
complexities for non-convex settings and $O\left(n+ \sqrt{n} \ln
({\mu/\epsilon})\right)$ for strongly convex settings, where $\epsilon>0$ is a
desired stationarity error. Moreover, to alleviate the periodic full gradient
evaluation requirement in STIMULUS and STIMULUS-M, we further propose enhanced
versions with adaptive batching called STIMULUS+/ STIMULUS-M+ and provide their
theoretical analysis.
[LINK]
http://arxiv.org/abs/2506.19883v1
[DATE]
2025-06-24 11:31:25+08:00
[CATEGORIES]
cs.LG
Robust OOD Graph Learning via Mean Constraints and Noise Reduction
[AUTHORS]
Yang Zhou, Xiaoning Ren
[ABSTRACT]
Graph Out-of-Distribution (OOD) classification often suffers from sharp
performance drops, particularly under category imbalance and structural noise.
This work tackles two pressing challenges in this context: (1) the
underperformance of minority classes due to skewed label distributions, and (2)
their heightened sensitivity to structural noise in graph data. To address
these problems, we propose two complementary solutions. First, Constrained Mean
Optimization (CMO) improves minority class robustness by encouraging
similarity-based instance aggregation under worst-case conditions. Second, the
Neighbor-Aware Noise Reweighting (NNR) mechanism assigns dynamic weights to
training samples based on local structural consistency, mitigating noise
influence. We provide theoretical justification for our methods, and validate
their effectiveness with extensive experiments on both synthetic and real-world
datasets, showing significant improvements in Graph OOD generalization and
classification accuracy. The code for our method is available at:
https://anonymous.4open.science/r/CMO-NNR-2F30.
[COMMENTS]
8 pages, 6 figures
[LINK]
http://arxiv.org/abs/2506.19281v1
[DATE]
2025-06-24 11:25:33+08:00
[CATEGORIES]
cs.LG
Emotion Detection on User Front-Facing App Interfaces for Enhanced Schedule Optimization: A Machine Learning Approach
[AUTHORS]
Feiting Yang, Antoine Moevus, Steve Lévesque
[ABSTRACT]
Human-Computer Interaction (HCI) has evolved significantly to incorporate
emotion recognition capabilities, creating unprecedented opportunities for
adaptive and personalized user experiences. This paper explores the integration
of emotion detection into calendar applications, enabling user interfaces to
dynamically respond to users’ emotional states and stress levels, thereby
enhancing both productivity and engagement. We present and evaluate two
complementary approaches to emotion detection: a biometric-based method
utilizing heart rate (HR) data extracted from electrocardiogram (ECG) signals
processed through Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU)
neural networks to predict the emotional dimensions of Valence, Arousal, and
Dominance; and a behavioral method analyzing computer activity through multiple
machine learning models to classify emotions based on fine-grained user
interactions such as mouse movements, clicks, and keystroke patterns. Our
comparative analysis, from real-world datasets, reveals that while both
approaches demonstrate effectiveness, the computer activity-based method
delivers superior consistency and accuracy, particularly for mouse-related
interactions, which achieved approximately 90\% accuracy. Furthermore, GRU
networks outperformed LSTM models in the biometric approach, with Valence
prediction reaching 84.38\% accuracy.
[LINK]
http://arxiv.org/abs/2506.19280v1
[DATE]
2025-06-24 11:21:46+08:00
[CATEGORIES]
cs.LG
Rare dense solutions clusters in asymmetric binary perceptrons – local entropy via fully lifted RDT
[AUTHORS]
Mihailo Stojnic
[ABSTRACT]
We study classical asymmetric binary perceptron (ABP) and associated
\emph{local entropy} (LE) as potential source of its algorithmic hardness.
Isolation of \emph{typical} ABP solutions in SAT phase seemingly suggests a
universal algorithmic hardness. Paradoxically, efficient algorithms do exist
even for constraint densities $\alpha$ fairly close but at a finite distance
(\emph{computational gap}) from the capacity. In recent years, existence of
rare large dense clusters and magical ability of fast algorithms to find them
have been posited as the conceptual resolution of this paradox. Monotonicity or
breakdown of the LEs associated with such \emph{atypical} clusters are
predicated to play a key role in their thinning-out or even complete
defragmentation.
Invention of fully lifted random duality theory (fl RDT) [90,93,94] allows
studying random structures \emph{typical} features. A large deviation upgrade,
sfl LD RDT [96,97], moves things further and enables \emph{atypical} features
characterizations as well. Utilizing the machinery of [96,97] we here develop a
generic framework to study LE as an ABP’s atypical feature. Already on the
second level of lifting we discover that the LE results are closely matching
those obtained through replica methods. For classical zero threshold ABP, we
obtain that LE breaks down for $\alpha$ in $(0.77,0.78)$ interval which
basically matches $\alpha\sim 0.75-0.77$ range that currently best ABP solvers
can handle and effectively indicates that LE’s behavior might indeed be among
key reflections of the ABP’s computational gaps presumable existence.
[LINK]
http://arxiv.org/abs/2506.19276v1
[DATE]
2025-06-24 11:12:39+08:00
[CATEGORIES]
cs.LG
A Qubit-Efficient Hybrid Quantum Encoding Mechanism for Quantum Machine Learning
[AUTHORS]
Hevish Cowlessur, Tansu Alpcan, Chandra Thapa, Seyit Camtepe, Neel Kanth Kundu
[ABSTRACT]
Efficiently embedding high-dimensional datasets onto noisy and low-qubit
quantum systems is a significant barrier to practical Quantum Machine Learning
(QML). Approaches such as quantum autoencoders can be constrained by current
hardware capabilities and may exhibit vulnerabilities to reconstruction attacks
due to their invertibility. We propose Quantum Principal Geodesic Analysis
(qPGA), a novel, non-invertible method for dimensionality reduction and
qubit-efficient encoding. Executed classically, qPGA leverages Riemannian
geometry to project data onto the unit Hilbert sphere, generating outputs
inherently suitable for quantum amplitude encoding. This technique preserves
the neighborhood structure of high-dimensional datasets within a compact latent
space, significantly reducing qubit requirements for amplitude encoding. We
derive theoretical bounds quantifying qubit requirements for effective encoding
onto noisy systems. Empirical results on MNIST, Fashion-MNIST, and CIFAR-10
show that qPGA preserves local structure more effectively than both quantum and
hybrid autoencoders. Additionally, we demonstrate that qPGA enhances resistance
to reconstruction attacks due to its non-invertible nature. In downstream QML
classification tasks, qPGA can achieve over 99% accuracy and F1-score on MNIST
and Fashion-MNIST, outperforming quantum-dependent baselines. Initial tests on
real hardware and noisy simulators confirm its potential for noise-resilient
performance, offering a scalable solution for advancing QML applications.
[LINK]
http://arxiv.org/abs/2506.19275v1
[DATE]
2025-06-24 11:09:16+08:00
[CATEGORIES]
cs.LG
Stabilizing PDE–ML Coupled System
[AUTHORS]
Saad Qadeer, Panos Stinis, Hui Wan
[ABSTRACT]
A long-standing obstacle in the use of machine-learnt surrogates with larger
PDE systems is the onset of instabilities when solved numerically. Efforts
towards ameliorating these have mostly concentrated on improving the accuracy
of the surrogates or imbuing them with additional structure, and have garnered
limited success. In this article, we study a prototype problem and draw
insights that can help with more complex systems. In particular, we focus on a
viscous Burgers’-ML system and, after identifying the cause of the
instabilities, prescribe strategies to stabilize the coupled system. To improve
the accuracy of the stabilized system, we next explore methods based on the
Mori–Zwanzig formalism.
[LINK]
http://arxiv.org/abs/2506.19274v1
[DATE]
2025-06-24 11:09:14+08:00
[CATEGORIES]
cs.LG
Continuous-variable Quantum Diffusion Model for State Generation and Restoration
[AUTHORS]
Haitao Huang, Chuangtao Chen, Qinglin Zhao
[ABSTRACT]
The generation and preservation of complex quantum states against
environmental noise are paramount challenges in advancing continuous-variable
(CV) quantum information processing. This paper introduces a novel framework
based on continuous-variable quantum diffusion principles, synergizing them
with CV quantum neural networks (CVQNNs) to address these dual challenges. For
the task of state generation, our Continuous-Variable Quantum Diffusion
Generative model (CVQD-G) employs a physically driven forward diffusion process
using a thermal loss channel, which is then inverted by a learnable,
parameter-efficient backward denoising process based on a CVQNN with
time-embedding. This framework’s capability is further extended for state
recovery by the Continuous-Variable Quantum Diffusion Restoration model
(CVQD-R), a specialized variant designed to restore quantum states,
particularly coherent states with unknown parameters, from thermal degradation.
Extensive numerical simulations validate these dual capabilities, demonstrating
the high-fidelity generation of diverse Gaussian (coherent, squeezed) and
non-Gaussian (Fock, cat) states, typically with fidelities exceeding 99%, and
confirming the model’s ability to robustly restore corrupted states.
Furthermore, a comprehensive complexity analysis reveals favorable training and
inference costs, highlighting the framework’s efficiency, scalability, and its
potential as a robust tool for quantum state engineering and noise mitigation
in realistic CV quantum systems.
[COMMENTS]
15+3 pages, 14 figures, 7 tables
[LINK]
http://arxiv.org/abs/2506.19270v1
[DATE]
2025-06-24 11:04:21+08:00
[CATEGORIES]
cs.LG
Learning Treatment Representations for Downstream Instrumental Variable Regression
[AUTHORS]
Shiangyi Lin, Hui Lan, Vasilis Syrgkanis
[ABSTRACT]
Traditional instrumental variable (IV) estimators face a fundamental
constraint: they can only accommodate as many endogenous treatment variables as
available instruments. This limitation becomes particularly challenging in
settings where the treatment is presented in a high-dimensional and
unstructured manner (e.g. descriptions of patient treatment pathways in a
hospital). In such settings, researchers typically resort to applying
unsupervised dimension reduction techniques to learn a low-dimensional
treatment representation prior to implementing IV regression analysis. We show
that such methods can suffer from substantial omitted variable bias due to
implicit regularization in the representation learning step. We propose a novel
approach to construct treatment representations by explicitly incorporating
instrumental variables during the representation learning process. Our approach
provides a framework for handling high-dimensional endogenous variables with
limited instruments. We demonstrate both theoretically and empirically that
fitting IV models on these instrument-informed representations ensures
identification of directions that optimize outcome prediction. Our experiments
show that our proposed methodology improves upon the conventional two-stage
approaches that perform dimension reduction without incorporating instrument
information.
[LINK]
http://arxiv.org/abs/2506.02200v2
[DATE]
2025-06-24 10:58:39+08:00
[CATEGORIES]
cs.LG
Leveraging Large Language Models to Democratize Access to Costly Datasets for Academic Research
[AUTHORS]
Julian Junyan Wang, Victor Xiaoqi Wang
[ABSTRACT]
Unequal access to costly datasets essential for empirical research has long
hindered researchers from disadvantaged institutions, limiting their ability to
contribute to their fields and advance their careers. Recent breakthroughs in
Large Language Models (LLMs) have the potential to democratize data access by
automating data collection from unstructured sources. We develop and evaluate a
novel methodology using GPT-4o-mini within a Retrieval-Augmented Generation
(RAG) framework to collect data from corporate disclosures. Our approach
achieves human-level accuracy in collecting CEO pay ratios from approximately
10,000 proxy statements and Critical Audit Matters (CAMs) from more than 12,000
10-K filings, with LLM processing times of 9 and 40 minutes respectively, each
at a cost under $10. This stands in stark contrast to the hundreds of hours
needed for manual collection or the thousands of dollars required for
commercial database subscriptions. To foster a more inclusive research
community by empowering researchers with limited resources to explore new
avenues of inquiry, we share our methodology and the resulting datasets.
[COMMENTS]
52 pagegs, 5 figures, 5 tables
[LINK]
http://arxiv.org/abs/2412.02065v2
[DATE]
2025-06-24 10:52:00+08:00
[CATEGORIES]
cs.LG
Network Structures as an Attack Surface: Topology-Based Privacy Leakage in Federated Learning
[AUTHORS]
Murtaza Rangwala, Richard O. Sinnott, Rajkumar Buyya
[ABSTRACT]
Federated learning systems increasingly rely on diverse network topologies to
address scalability and organizational constraints. While existing privacy
research focuses on gradient-based attacks, the privacy implications of network
topology knowledge remain critically understudied. We conduct the first
comprehensive analysis of topology-based privacy leakage across realistic
adversarial knowledge scenarios, demonstrating that adversaries with varying
degrees of structural knowledge can infer sensitive data distribution patterns
even under strong differential privacy guarantees. Through systematic
evaluation of 4,720 attack instances, we analyze six distinct adversarial
knowledge scenarios: complete topology knowledge and five partial knowledge
configurations reflecting real-world deployment constraints. We propose three
complementary attack vectors: communication pattern analysis, parameter
magnitude profiling, and structural position correlation, achieving success
rates of 84.1%, 65.0%, and 47.2% under complete knowledge conditions.
Critically, we find that 80% of realistic partial knowledge scenarios maintain
attack effectiveness above security thresholds, with certain partial knowledge
configurations achieving performance superior to the baseline complete
knowledge scenario. To address these vulnerabilities, we propose and
empirically validate structural noise injection as a complementary defense
mechanism across 808 configurations, demonstrating up to 51.4% additional
attack reduction when properly layered with existing privacy techniques. These
results establish that network topology represents a fundamental privacy
vulnerability in federated learning systems while providing practical pathways
for mitigation through topology-aware defense mechanisms.
[COMMENTS]
13 pages, 7 figures, 5 tables. Data from the experiments and source
code can be found here: https://doi.org/10.5281/zenodo.15622123
[LINK]
http://arxiv.org/abs/2506.19260v1
[DATE]
2025-06-24 10:42:08+08:00
[CATEGORIES]
cs.LG
Robust Behavior Cloning Via Global Lipschitz Regularization
[AUTHORS]
Shili Wu, Yizhao Jin, Puhua Niu, Aniruddha Datta, Sean B. Andersson
[ABSTRACT]
Behavior Cloning (BC) is an effective imitation learning technique and has
even been adopted in some safety-critical domains such as autonomous vehicles.
BC trains a policy to mimic the behavior of an expert by using a dataset
composed of only state-action pairs demonstrated by the expert, without any
additional interaction with the environment. However, During deployment, the
policy observations may contain measurement errors or adversarial disturbances.
Since the observations may deviate from the true states, they can mislead the
agent into making sub-optimal actions. In this work, we use a global Lipschitz
regularization approach to enhance the robustness of the learned policy
network. We then show that the resulting global Lipschitz property provides a
robustness certificate to the policy with respect to different bounded norm
perturbations. Then, we propose a way to construct a Lipschitz neural network
that ensures the policy robustness. We empirically validate our theory across
various environments in Gymnasium. Keywords: Robust Reinforcement Learning;
Behavior Cloning; Lipschitz Neural Network
[LINK]
http://arxiv.org/abs/2506.19250v1
[DATE]
2025-06-24 10:19:08+08:00
[CATEGORIES]
cs.LG
Inference-Time Reward Hacking in Large Language Models
[AUTHORS]
Hadi Khalaf, Claudio Mayrink Verdun, Alex Oesterling, Himabindu Lakkaraju, Flavio du Pin Calmon
[ABSTRACT]
A common paradigm to improve the performance of large language models is
optimizing for a reward model. Reward models assign a numerical score to LLM
outputs indicating, for example, which response would likely be preferred by a
user or is most aligned with safety goals. However, reward models are never
perfect. They inevitably function as proxies for complex desiderata such as
correctness, helpfulness, and safety. By overoptimizing for a misspecified
reward, we can subvert intended alignment goals and reduce overall performance
– a phenomenon commonly referred to as reward hacking. In this work, we
characterize reward hacking in inference-time alignment and demonstrate when
and how we can mitigate it by hedging on the proxy reward. We study this
phenomenon under Best-of-$n$ (BoN) and Soft-Best-of-$n$ (SBoN), and we
introduce Best-of-Poisson (BoP) that provides an efficient, near-exact
approximation of the optimal reward-KL divergence policy at inference time. We
show that the characteristic pattern of hacking as observed in practice (where
the true reward first increases before declining) is an inevitable property of
a broad class of inference-time mechanisms, including BoN and BoP. To counter
this effect, hedging offers a tactical choice to avoid placing undue confidence
in high but potentially misleading proxy reward signals. We introduce
HedgeTune, an efficient algorithm to find the optimal inference-time parameter
and avoid reward hacking. We demonstrate through experiments that hedging
mitigates reward hacking and achieves superior distortion-reward tradeoffs with
minimal computational overhead.
[COMMENTS]
Accepted to ICML 2025 Workshop on Models of Human Feedback for AI
Alignment
[LINK]
http://arxiv.org/abs/2506.19248v1
[DATE]
2025-06-24 10:05:25+08:00
[CATEGORIES]
cs.LG
Behavioral Anomaly Detection in Distributed Systems via Federated Contrastive Learning
[AUTHORS]
Renzi Meng, Heyi Wang, Yumeng Sun, Qiyuan Wu, Lian Lian, Renhan Zhang
[ABSTRACT]
This paper addresses the increasingly prominent problem of anomaly detection
in distributed systems. It proposes a detection method based on federated
contrastive learning. The goal is to overcome the limitations of traditional
centralized approaches in terms of data privacy, node heterogeneity, and
anomaly pattern recognition. The proposed method combines the distributed
collaborative modeling capabilities of federated learning with the feature
discrimination enhancement of contrastive learning. It builds embedding
representations on local nodes and constructs positive and negative sample
pairs to guide the model in learning a more discriminative feature space.
Without exposing raw data, the method optimizes a global model through a
federated aggregation strategy. Specifically, the method uses an encoder to
represent local behavior data in high-dimensional space. This includes system
logs, operational metrics, and system calls. The model is trained using both
contrastive loss and classification loss to improve its ability to detect
fine-grained anomaly patterns. The method is evaluated under multiple typical
attack types. It is also tested in a simulated real-time data stream scenario
to examine its responsiveness. Experimental results show that the proposed
method outperforms existing approaches across multiple performance metrics. It
demonstrates strong detection accuracy and adaptability, effectively addressing
complex anomalies in distributed environments. Through careful design of key
modules and optimization of the training mechanism, the proposed method
achieves a balance between privacy preservation and detection performance. It
offers a feasible technical path for intelligent security management in
distributed systems.
[LINK]
http://arxiv.org/abs/2506.19246v1
[DATE]
2025-06-24 10:04:44+08:00
[CATEGORIES]
cs.LG
Universal kernels via harmonic analysis on Riemannian symmetric spaces
[AUTHORS]
Franziskus Steinert, Salem Said, Cyrus Mostajeran
[ABSTRACT]
The universality properties of kernels characterize the class of functions
that can be approximated in the associated reproducing kernel Hilbert space and
are of fundamental importance in the theoretical underpinning of kernel methods
in machine learning. In this work, we establish fundamental tools for
investigating universality properties of kernels in Riemannian symmetric
spaces, thereby extending the study of this important topic to kernels in
non-Euclidean domains. Moreover, we use the developed tools to prove the
universality of several recent examples from the literature on positive
definite kernels defined on Riemannian symmetric spaces, thus providing
theoretical justification for their use in applications involving
manifold-valued data.
[LINK]
http://arxiv.org/abs/2506.19245v1
[DATE]
2025-06-24 10:03:25+08:00
[CATEGORIES]
cs.LG
SASSHA: Sharpness-aware Adaptive Second-order Optimization with Stable Hessian Approximation
[AUTHORS]
Dahun Shin, Dongyeop Lee, Jinseok Chung, Namhoon Lee
[ABSTRACT]
Approximate second-order optimization methods often exhibit poorer
generalization compared to first-order approaches. In this work, we look into
this issue through the lens of the loss landscape and find that existing
second-order methods tend to converge to sharper minima compared to SGD. In
response, we propose Sassha, a novel second-order method designed to enhance
generalization by explicitly reducing sharpness of the solution, while
stabilizing the computation of approximate Hessians along the optimization
trajectory. In fact, this sharpness minimization scheme is crafted also to
accommodate lazy Hessian updates, so as to secure efficiency besides flatness.
To validate its effectiveness, we conduct a wide range of standard deep
learning experiments where Sassha demonstrates its outstanding generalization
performance that is comparable to, and mostly better than, other methods. We
provide a comprehensive set of analyses including convergence, robustness,
stability, efficiency, and cost.
[COMMENTS]
ICML 2025
[LINK]
http://arxiv.org/abs/2502.18153v2
[DATE]
2025-06-24 10:03:04+08:00
[CATEGORIES]
cs.LG
High precision PINNs in unbounded domains: application to singularity formulation in PDEs
[AUTHORS]
Yixuan Wang, Ziming Liu, Zongyi Li, Anima Anandkumar, Thomas Y. Hou
[ABSTRACT]
We investigate the high-precision training of Physics-Informed Neural
Networks (PINNs) in unbounded domains, with a special focus on applications to
singularity formulation in PDEs. We propose a modularized approach and study
the choices of neural network ansatz, sampling strategy, and optimization
algorithm. When combined with rigorous computer-assisted proofs and PDE
analysis, the numerical solutions identified by PINNs, provided they are of
high precision, can serve as a powerful tool for studying singularities in
PDEs. For 1D Burgers equation, our framework can lead to a solution with very
high precision, and for the 2D Boussinesq equation, which is directly related
to the singularity formulation in 3D Euler and Navier-Stokes equations, we
obtain a solution whose loss is $4$ digits smaller than that obtained in
\cite{wang2023asymptotic} with fewer training steps. We also discuss potential
directions for pushing towards machine precision for higher-dimensional
problems.
[LINK]
http://arxiv.org/abs/2506.19243v1
[DATE]
2025-06-24 10:01:44+08:00
[CATEGORIES]
cs.LG
A General Framework for Property-Driven Machine Learning
[AUTHORS]
Thomas Flinkow, Marco Casadio, Colin Kessler, Rosemary Monahan, Ekaterina Komendantskaya
[ABSTRACT]
Neural networks have been shown to frequently fail to learn critical safety
and correctness properties purely from data, highlighting the need for training
methods that directly integrate logical specifications. While adversarial
training can be used to improve robustness to small perturbations within
$\epsilon$-cubes, domains other than computer vision – such as control systems
and natural language processing – may require more flexible input region
specifications via generalised hyper-rectangles. Differentiable logics offer a
way to encode arbitrary logical constraints as additional loss terms that guide
the learning process towards satisfying these constraints. In this paper, we
investigate how these two complementary approaches can be unified within a
single framework for property-driven machine learning, as a step toward
effective formal verification of neural networks. We show that well-known
properties from the literature are subcases of this general approach, and we
demonstrate its practical effectiveness on a case study involving a neural
network controller for a drone system. Our framework is made publicly available
at https://github.com/tflinkow/property-driven-ml.
[COMMENTS]
24 pages, 4 tables, 4 figures
[LINK]
http://arxiv.org/abs/2505.00466v2
[DATE]
2025-06-24 09:27:12+08:00
[CATEGORIES]
cs.LG
Limits of Discrete Energy of Families of Increasing Sets
[AUTHORS]
Hari Sarang Nathan
[ABSTRACT]
The Hausdorff dimension of a set can be detected using the Riesz energy.
Here, we consider situations where a sequence of points, $\{x_n\}$, ``fills
in’’ a set $E \subset \mathbb{R}^d$ in an appropriate sense and investigate the
degree to which the discrete analog to the Riesz energy of these sets can be
used to bound the Hausdorff dimension of $E$. We also discuss applications to
data science and Erd\H{o}s/Falconer type problems.
[LINK]
http://arxiv.org/abs/2504.11302v2
[DATE]
2025-06-24 09:23:47+08:00
[CATEGORIES]
cs.LG
Private Model Personalization Revisited
[AUTHORS]
Conor Snedeker, Xinyu Zhou, Raef Bassily
[ABSTRACT]
We study model personalization under user-level differential privacy (DP) in
the shared representation framework. In this problem, there are $n$ users whose
data is statistically heterogeneous, and their optimal parameters share an
unknown embedding $U^* \in\mathbb{R}^{d\times k}$ that maps the user parameters
in $\mathbb{R}^d$ to low-dimensional representations in $\mathbb{R}^k$, where
$k\ll d$. Our goal is to privately recover the shared embedding and the local
low-dimensional representations with small excess risk in the federated
setting. We propose a private, efficient federated learning algorithm to learn
the shared embedding based on the FedRep algorithm in [CHM+21]. Unlike
[CHM+21], our algorithm satisfies differential privacy, and our results hold
for the case of noisy labels. In contrast to prior work on private model
personalization [JRS+21], our utility guarantees hold under a larger class of
users’ distributions (sub-Gaussian instead of Gaussian distributions).
Additionally, in natural parameter regimes, we improve the privacy error term
in [JRS+21] by a factor of $\widetilde{O}(dk)$. Next, we consider the binary
classification setting. We present an information-theoretic construction to
privately learn the shared embedding and derive a margin-based accuracy
guarantee that is independent of $d$. Our method utilizes the
Johnson-Lindenstrauss transform to reduce the effective dimensions of the
shared embedding and the users’ data. This result shows that
dimension-independent risk bounds are possible in this setting under a margin
loss.
[COMMENTS]
ICML 2025
[LINK]
http://arxiv.org/abs/2506.19220v1
[DATE]
2025-06-24 08:57:17+08:00
[CATEGORIES]
cs.LG
Iterative Minimax Games with Coupled Linear Constraints
[AUTHORS]
Huiling Zhang, Zi Xu, Yu-Hong Dai
[ABSTRACT]
The study of nonconvex minimax games has gained significant momentum in
machine learning and decision science communities due to their fundamental
connections to adversarial training scenarios. This work develops a primal-dual
alternating proximal gradient (PDAPG) algorithm framework for resolving
iterative minimax games featuring nonsmooth nonconvex objectives subject to
coupled linear constraints. We establish rigorous convergence guarantees for
both nonconvex-strongly concave and nonconvex-concave game configurations,
demonstrating that PDAPG achieves an $\varepsilon$-stationary solution within
$\mathcal{O}\left( \varepsilon ^{-2} \right)$ iterations for strongly concave
settings and $\mathcal{O}\left( \varepsilon ^{-4} \right)$ iterations for
concave scenarios. Our analysis provides the first known iteration complexity
bounds for this class of constrained minimax games, particularly addressing the
critical challenge of coupled linear constraints that induce inherent
interdependencies among strategy variables. The proposed game-theoretic
framework advances existing solution methodologies by simultaneously handling
nonsmooth components and coordinated constraint structures through alternating
primal-dual updates.
[LINK]
http://arxiv.org/abs/2212.04672v5
[DATE]
2025-06-24 08:47:46+08:00
[CATEGORIES]
cs.LG
Simulation of a closed-loop dc-dc converter using a physics-informed neural network-based model
[AUTHORS]
Marc-Antoine Coulombe, Maxime Berger, Antoine Lesage-Landry
[ABSTRACT]
The growing reliance on power electronics introduces new challenges requiring
detailed time-domain analyses with fast and accurate circuit simulation tools.
Currently, commercial time-domain simulation software are mainly relying on
physics-based methods to simulate power electronics. Recent work showed that
data-driven and physics-informed learning methods can increase simulation speed
with limited compromise on accuracy, but many challenges remain before
deployment in commercial tools can be possible. In this paper, we propose a
physics-informed bidirectional long-short term memory neural network
(BiLSTM-PINN) model to simulate the time-domain response of a closed-loop dc-dc
boost converter for various operating points, parameters, and perturbations. A
physics-informed fully-connected neural network (FCNN) and a BiLSTM are also
trained to establish a comparison. The three methods are then compared using
step-response tests to assess their performance and limitations in terms of
accuracy. The results show that the BiLSTM-PINN and BiLSTM models outperform
the FCNN model by more than 9 and 4.5 times, respectively, in terms of median
RMSE. Their standard deviation values are more than 2.6 and 1.7 smaller than
the FCNN’s, making them also more consistent. Those results illustrate that the
proposed BiLSTM-PINN is a potential alternative to other physics-based or
data-driven methods for power electronics simulations.
[COMMENTS]
8 pages, 6 figures, Paper submitted to the International Conference
on Power Systems Transients (IPST2025) in Guadalajara, Mexico, June 8-12,
2025
[LINK]
http://arxiv.org/abs/2506.19178v1
[DATE]
2025-06-24 06:44:56+08:00
[CATEGORIES]
cs.LG
Machines and Mathematical Mutations: Using GNNs to Characterize Quiver Mutation Classes
[AUTHORS]
Jesse He, Helen Jenne, Herman Chau, Davis Brown, Mark Raugas, Sara Billey, Henry Kvinge
[ABSTRACT]
Machine learning is becoming an increasingly valuable tool in mathematics,
enabling one to identify subtle patterns across collections of examples so vast
that they would be impossible for a single researcher to feasibly review and
analyze. In this work, we use graph neural networks to investigate \emph{quiver
mutation} – an operation that transforms one quiver (or directed multigraph)
into another – which is central to the theory of cluster algebras with deep
connections to geometry, topology, and physics. In the study of cluster
algebras, the question of \emph{mutation equivalence} is of fundamental
concern: given two quivers, can one efficiently determine if one quiver can be
transformed into the other through a sequence of mutations? In this paper, we
use graph neural networks and AI explainability techniques to independently
discover mutation equivalence criteria for quivers of type $\tilde{D}$. Along
the way, we also show that even without explicit training to do so, our model
captures structure within its hidden representation that allows us to
reconstruct known criteria from type $D$, adding to the growing evidence that
modern machine learning models are capable of learning abstract and
parsimonious rules from mathematical data.
[COMMENTS]
ICML 2025
[LINK]
http://arxiv.org/abs/2411.07467v2
[DATE]
2025-06-24 06:44:29+08:00
[CATEGORIES]
cs.LG
The Gittins Index: A Design Principle for Decision-Making Under Uncertainty
[AUTHORS]
Ziv Scully, Alexander Terenin
[ABSTRACT]
The Gittins index is a tool that optimally solves a variety of
decision-making problems involving uncertainty, including multi-armed bandit
problems, minimizing mean latency in queues, and search problems like the
Pandora’s box model. However, despite the above examples and later extensions
thereof, the space of problems that the Gittins index can solve perfectly
optimally is limited, and its definition is rather subtle compared to those of
other multi-armed bandit algorithms. As a result, the Gittins index is often
regarded as being primarily a concept of theoretical importance, rather than a
practical tool for solving decision-making problems.
The aim of this tutorial is to demonstrate that the Gittins index can be
fruitfully applied to practical problems. We start by giving an example-driven
introduction to the Gittins index, then walk through several examples of
problems it solves - some optimally, some suboptimally but still with excellent
performance. Two practical highlights in the latter category are applying the
Gittins index to Bayesian optimization, and applying the Gittins index to
minimizing tail latency in queues.
[LINK]
http://arxiv.org/abs/2506.10872v2
[DATE]
2025-06-24 06:41:05+08:00
[CATEGORIES]
cs.LG
Distilling Tool Knowledge into Language Models via Back-Translated Traces
[AUTHORS]
Xingyue Huang, Xianglong Hu, Zifeng Ding, Yuan He, Rishabh, Waleed Alzarooni, Ziyu Ye, Wendong Fan, Bailan He, Haige Bo, Changran Hu, Guohao Li
[ABSTRACT]
Large language models (LLMs) often struggle with mathematical problems that
require exact computation or multi-step algebraic reasoning. Tool-integrated
reasoning (TIR) offers a promising solution by leveraging external tools such
as code interpreters to ensure correctness, but it introduces inference-time
dependencies that hinder scalability and deployment. In this work, we propose a
new paradigm for distilling tool knowledge into LLMs purely through natural
language. We first construct a Solver Agent that solves math problems by
interleaving planning, symbolic tool calls, and reflective reasoning. Then,
using a back-translation pipeline powered by multiple LLM-based agents, we
convert interleaved TIR traces into natural language reasoning traces. A
Translator Agent generates explanations for individual tool calls, while a
Rephrase Agent merges them into a fluent and globally coherent narrative.
Empirically, we show that fine-tuning a small open-source model on these
synthesized traces enables it to internalize both tool knowledge and structured
reasoning patterns, yielding gains on competition-level math benchmarks without
requiring tool access at inference.
[COMMENTS]
Accepted in Workshop in Multi-Agent Systems in the Era of Foundation
Models: Opportunities, Challenges and Futures, ICML 2025
[LINK]
http://arxiv.org/abs/2506.19171v1
[DATE]
2025-06-24 06:10:38+08:00
[CATEGORIES]
cs.LG
A Deep Learning Based Method for Fast Registration of Cardiac Magnetic Resonance Images
[AUTHORS]
Benjamin Graham
[ABSTRACT]
Image registration is used in many medical image analysis applications, such
as tracking the motion of tissue in cardiac images, where cardiac kinematics
can be an indicator of tissue health. Registration is a challenging problem for
deep learning algorithms because ground truth transformations are not feasible
to create, and because there are potentially multiple transformations that can
produce images that appear correlated with the goal. Unsupervised methods have
been proposed to learn to predict effective transformations, but these methods
take significantly longer to predict than established baseline methods. For a
deep learning method to see adoption in wider research and clinical settings,
it should be designed to run in a reasonable time on common, mid-level
hardware. Fast methods have been proposed for the task of image registration
but often use patch-based methods which can affect registration accuracy for a
highly dynamic organ such as the heart.
In this thesis, a fast, volumetric registration model is proposed for the use
of quantifying cardiac strain. The proposed Deep Learning Neural Network (DLNN)
is designed to utilize an architecture that can compute convolutions incredibly
efficiently, allowing the model to achieve registration fidelity similar to
other state-of-the-art models while taking a fraction of the time to perform
inference. The proposed fast and lightweight registration (FLIR) model is used
to predict tissue motion which is then used to quantify the non-uniform strain
experienced by the tissue. For acquisitions taken from the same patient at
approximately the same time, it would be expected that strain values measured
between the acquisitions would have very small differences. Using this metric,
strain values computed using the FLIR method are shown to be very consistent.
[LINK]
http://arxiv.org/abs/2506.19167v1
[DATE]
2025-06-24 06:06:07+08:00
[CATEGORIES]
cs.LG
Posterior Contraction for Sparse Neural Networks in Besov Spaces with Intrinsic Dimensionality
[AUTHORS]
Kyeongwon Lee, Lizhen Lin, Jaewoo Park, Seonghyun Jeong
[ABSTRACT]
This work establishes that sparse Bayesian neural networks achieve optimal
posterior contraction rates over anisotropic Besov spaces and their
hierarchical compositions. These structures reflect the intrinsic
dimensionality of the underlying function, thereby mitigating the curse of
dimensionality. Our analysis shows that Bayesian neural networks equipped with
either sparse or continuous shrinkage priors attain the optimal rates which are
dependent on the intrinsic dimension of the true structures. Moreover, we show
that these priors enable rate adaptation, allowing the posterior to contract at
the optimal rate even when the smoothness level of the true function is
unknown. The proposed framework accommodates a broad class of functions,
including additive and multiplicative Besov functions as special cases. These
results advance the theoretical foundations of Bayesian neural networks and
provide rigorous justification for their practical effectiveness in
high-dimensional, structured estimation problems.
[LINK]
http://arxiv.org/abs/2506.19144v1
[DATE]
2025-06-24 05:29:40+08:00
[CATEGORIES]
cs.LG
EEG Foundation Challenge: From Cross-Task to Cross-Subject EEG Decoding
[AUTHORS]
Bruno Aristimunha, Dung Truong, Pierre Guetschel, Seyed Yahya Shirazi, Isabelle Guyon, Alexandre R. Franco, Michael P. Milham, Aviv Dotan, Scott Makeig, Alexandre Gramfort, Jean-Remi King, Marie-Constance Corsi, Pedro A. Valdés-Sosa, Amit Majumdar, Alan Evans, Terrence J Sejnowski, Oren Shriki, Sylvain Chevallier, Arnaud Delorme
[COMMENTS]
Approved at Neurips Competition track. webpage:
https://eeg2025.github.io/
[LINK]
http://arxiv.org/abs/2506.19141v1
[DATE]
2025-06-24 05:25:19+08:00
[CATEGORIES]
cs.LG
Local Learning Rules for Out-of-Equilibrium Physical Generative Models
[AUTHORS]
Cyrill Bösch, Geoffrey Roeder, Marc Serra-Garcia, Ryan P. Adams
[ABSTRACT]
We show that the out-of-equilibrium driving protocol of score-based
generative models (SGMs) can be learned via a local learning rule. The gradient
with respect to the parameters of the driving protocol are computed directly
from force measurements or from observed system dynamics. As a demonstration,
we implement an SGM in a network of driven, nonlinear, overdamped oscillators
coupled to a thermal bath. We first apply it to the problem of sampling from a
mixture of two Gaussians in 2D. Finally, we train a network of 10x10
oscillators to sample images of 0s and 1s from the MNIST dataset.
[COMMENTS]
6 pages, 2 figures
[LINK]
http://arxiv.org/abs/2506.19136v1
[DATE]
2025-06-24 05:11:40+08:00
[CATEGORIES]
cs.LG
Riemannian generative decoder
[AUTHORS]
Andreas Bjerregaard, Søren Hauberg, Anders Krogh
[ABSTRACT]
Riemannian representation learning typically relies on approximating
densities on chosen manifolds. This involves optimizing difficult objectives,
potentially harming models. To completely circumvent this issue, we introduce
the Riemannian generative decoder which finds manifold-valued maximum
likelihood latents with a Riemannian optimizer while training a decoder
network. By discarding the encoder, we vastly simplify the manifold constraint
compared to current approaches which often only handle few specific manifolds.
We validate our approach on three case studies – a synthetic branching
diffusion process, human migrations inferred from mitochondrial DNA, and cells
undergoing a cell division cycle – each showing that learned representations
respect the prescribed geometry and capture intrinsic non-Euclidean structure.
Our method requires only a decoder, is compatible with existing architectures,
and yields interpretable latent spaces aligned with data geometry.
[COMMENTS]
GenBio ICML 2025 (Proceedings of the Workshop on Generative AI for
Biology at the 42nd International Conference on Machine Learning, Vancouver,
Canada. PMLR 267, 2025)
[LINK]
http://arxiv.org/abs/2506.19133v1
[DATE]
2025-06-24 05:06:13+08:00
[CATEGORIES]
cs.LG
Finding Clustering Algorithms in the Transformer Architecture
[AUTHORS]
Kenneth L. Clarkson, Lior Horesh, Takuya Ito, Charlotte Park, Parikshit Ram
[ABSTRACT]
The invention of the transformer architecture has revolutionized Artificial
Intelligence (AI), yielding unprecedented success in areas such as natural
language processing, computer vision, and multimodal reasoning. Despite these
advances, it is unclear whether transformers are able to learn and implement
precise algorithms. Here, we demonstrate that transformers can exactly
implement a fundamental and widely used algorithm for $k$-means clustering:
Lloyd’s algorithm. First, we theoretically prove the existence of such a
transformer architecture, which we term the $k$-means transformer, that exactly
implements Lloyd’s algorithm for $k$-means clustering using the standard
ingredients of modern transformers: attention and residual connections. Next,
we numerically implement this transformer and demonstrate in experiments the
exact correspondence between our architecture and Lloyd’s algorithm, providing
a fully neural implementation of $k$-means clustering. Finally, we demonstrate
that interpretable alterations (e.g., incorporating layer normalizations or
multilayer perceptrons) to this architecture yields diverse and novel variants
of clustering algorithms, such as soft $k$-means, spherical $k$-means, trimmed
$k$-means, and more. Collectively, our findings demonstrate how transformer
mechanisms can precisely map onto algorithmic procedures, offering a clear and
interpretable perspective on implementing precise algorithms in transformers.
[LINK]
http://arxiv.org/abs/2506.19125v1
[DATE]
2025-06-24 04:52:01+08:00
[CATEGORIES]
cs.LG
CUPID: Curating Data your Robot Loves with Influence Functions
[AUTHORS]
Christopher Agia, Rohan Sinha, Jingyun Yang, Rika Antonova, Marco Pavone, Haruki Nishimura, Masha Itkina, Jeannette Bohg
[ABSTRACT]
In robot imitation learning, policy performance is tightly coupled with the
quality and composition of the demonstration data. Yet, developing a precise
understanding of how individual demonstrations contribute to downstream
outcomes - such as closed-loop task success or failure - remains a persistent
challenge. We propose CUPID, a robot data curation method based on a novel
influence function-theoretic formulation for imitation learning policies. Given
a set of evaluation rollouts, CUPID estimates the influence of each training
demonstration on the policy’s expected return. This enables ranking and
selection of demonstrations according to their impact on the policy’s
closed-loop performance. We use CUPID to curate data by 1) filtering out
training demonstrations that harm policy performance and 2) subselecting newly
collected trajectories that will most improve the policy. Extensive simulated
and hardware experiments show that our approach consistently identifies which
data drives test-time performance. For example, training with less than 33% of
curated data can yield state-of-the-art diffusion policies on the simulated
RoboMimic benchmark, with similar gains observed in hardware. Furthermore,
hardware experiments show that our method can identify robust strategies under
distribution shift, isolate spurious correlations, and even enhance the
post-training of generalist robot policies. Additional materials are made
available at: https://cupid-curation.github.io.
[COMMENTS]
Project page: https://cupid-curation.github.io. 28 pages, 15 figures
[LINK]
http://arxiv.org/abs/2506.19121v1
[DATE]
2025-06-24 04:49:34+08:00
[CATEGORIES]
cs.LG
Blameless Users in a Clean Room: Defining Copyright Protection for Generative Models
[AUTHORS]
Aloni Cohen
[ABSTRACT]
Are there any conditions under which a generative model’s outputs are
guaranteed not to infringe the copyrights of its training data? This is the
question of “provable copyright protection” first posed by Vyas, Kakade, and
Barak (ICML 2023). They define near access-freeness (NAF) and propose it as
sufficient for protection. This paper revisits the question and establishes new
foundations for provable copyright protection – foundations that are firmer
both technically and legally. First, we show that NAF alone does not prevent
infringement. In fact, NAF models can enable verbatim copying, a blatant
failure of copy protection that we dub being tainted. Then, we introduce our
blameless copy protection framework for defining meaningful guarantees, and
instantiate it with clean-room copy protection. Clean-room copy protection
allows a user to control their risk of copying by behaving in a way that is
unlikely to copy in a counterfactual clean-room setting. Finally, we formalize
a common intuition about differential privacy and copyright by proving that DP
implies clean-room copy protection when the dataset is golden, a copyright
deduplication requirement.
[LINK]
http://arxiv.org/abs/2506.19881v1
[DATE]
2025-06-24 04:46:51+08:00
[CATEGORIES]
cs.LG
On the algorithmic construction of deep ReLU networks
[AUTHORS]
Daan Huybrechs
[ABSTRACT]
It is difficult to describe in mathematical terms what a neural network
trained on data represents. On the other hand, there is a growing mathematical
understanding of what neural networks are in principle capable of representing.
Feedforward neural networks using the ReLU activation function represent
continuous and piecewise linear functions and can approximate many others. The
study of their expressivity addresses the question: which ones? Contributing to
the available answers, we take the perspective of a neural network as an
algorithm. In this analogy, a neural network is programmed constructively,
rather than trained from data. An interesting example is a sorting algorithm:
we explicitly construct a neural network that sorts its inputs exactly, not
approximately, and that, in a sense, has optimal computational complexity if
the input dimension is large. Such constructed networks may have several
billion parameters. We construct and analyze several other examples, both
existing and new. We find that, in these examples, neural networks as
algorithms are typically recursive and parallel. Compared to conventional
algorithms, ReLU networks are restricted by having to be continuous. Moreover,
the depth of recursion is limited by the depth of the network, with deep
networks having superior properties over shallow ones.
[LINK]
http://arxiv.org/abs/2506.19104v1
[DATE]
2025-06-24 04:35:52+08:00
[CATEGORIES]
cs.LG
Code Graph Model (CGM): A Graph-Integrated Large Language Model for Repository-Level Software Engineering Tasks
[AUTHORS]
Hongyuan Tao, Ying Zhang, Zhenhao Tang, Hongen Peng, Xukun Zhu, Bingchang Liu, Yingguang Yang, Ziyin Zhang, Zhaogui Xu, Haipeng Zhang, Linchao Zhu, Rui Wang, Hang Yu, Jianguo Li, Peng Di
[ABSTRACT]
Recent advances in Large Language Models (LLMs) have shown promise in
function-level code generation, yet repository-level software engineering tasks
remain challenging. Current solutions predominantly rely on proprietary LLM
agents, which introduce unpredictability and limit accessibility, raising
concerns about data privacy and model customization. This paper investigates
whether open-source LLMs can effectively address repository-level tasks without
requiring agent-based approaches. We demonstrate this is possible by enabling
LLMs to comprehend functions and files within codebases through their semantic
information and structural dependencies. To this end, we introduce Code Graph
Models (CGMs), which integrate repository code graph structures into the LLM’s
attention mechanism and map node attributes to the LLM’s input space using a
specialized adapter. When combined with an agentless graph RAG framework, our
approach achieves a 43.00% resolution rate on the SWE-bench Lite benchmark
using the open-source Qwen2.5-72B model. This performance ranks first among
open weight models, second among methods with open-source systems, and eighth
overall, surpassing the previous best open-source model-based method by 12.33%.
[COMMENTS]
35 pages, 10 figures
[LINK]
http://arxiv.org/abs/2505.16901v4
[DATE]
2025-06-24 04:05:49+08:00
[CATEGORIES]
cs.LG
Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation
[AUTHORS]
Muhammad Haseeb Aslam, Clara Martinez, Marco Pedersoli, Alessandro Koerich, Ali Etemad, Eric Granger
[ABSTRACT]
Advances in self-distillation have shown that when knowledge is distilled
from a teacher to a student using the same deep learning (DL) architecture, the
student performance can surpass the teacher particularly when the network is
overparameterized and the teacher is trained with early stopping.
Alternatively, ensemble learning also improves performance, although training,
storing, and deploying multiple models becomes impractical as the number of
models grows. Even distilling an ensemble to a single student model or weight
averaging methods first requires training of multiple teacher models and does
not fully leverage the inherent stochasticity for generating and distilling
diversity in DL models. These constraints are particularly prohibitive in
resource-constrained or latency-sensitive applications such as wearable
devices. This paper proposes to train only one model and generate multiple
diverse teacher representations using distillation-time dropout. However,
generating these representations stochastically leads to noisy representations
that are misaligned with the learned task. To overcome this problem, a novel
stochastic self-distillation (SSD) training strategy is introduced for
filtering and weighting teacher representation to distill from task-relevant
representations only, using student-guided knowledge distillation (SGKD). The
student representation at each distillation step is used as authority to guide
the distillation process. Experimental results on real-world affective
computing, wearable/biosignal datasets from the UCR Archive, the HAR dataset,
and image classification datasets show that the proposed SSD method can
outperform state-of-the-art methods without increasing the model size at both
training and testing time, and incurs negligible computational complexity
compared to state-of-the-art ensemble learning and weight averaging methods.
[LINK]
http://arxiv.org/abs/2504.14307v2
[DATE]
2025-06-24 04:04:22+08:00
[CATEGORIES]
cs.LG
Finetuning a Weather Foundation Model with Lightweight Decoders for Unseen Physical Processes
[AUTHORS]
Fanny Lehmann, Firat Ozdemir, Benedikt Soja, Torsten Hoefler, Siddhartha Mishra, Sebastian Schemm
[ABSTRACT]
Recent advances in AI weather forecasting have led to the emergence of
so-called “foundation models”, typically defined by expensive pretraining and
minimal fine-tuning for downstream tasks. However, in the natural sciences, a
desirable foundation model should also encode meaningful statistical
relationships between the underlying physical variables. This study evaluates
the performance of the state-of-the-art Aurora foundation model in predicting
hydrological variables, which were not considered during pretraining. We
introduce a lightweight approach using shallow decoders trained on the latent
representations of the pretrained model to predict these new variables. As a
baseline, we compare this to fine-tuning the full model, which allows further
optimization of the latent space while incorporating new variables into both
inputs and outputs. The decoder-based approach requires 50% less training time
and 35% less memory, while achieving strong accuracy across various
hydrological variables and preserving desirable properties of the foundation
model, such as autoregressive stability. Notably, decoder accuracy depends on
the physical correlation between the new variables and those used during
pretraining, indicating that Aurora’s latent space captures meaningful physical
relationships. In this sense, we argue that an important quality metric for
foundation models in Earth sciences is their ability to be extended to new
variables without a full fine-tuning. This provides a new perspective for
making foundation models more accessible to communities with limited
computational resources, while supporting broader adoption in Earth sciences.
[LINK]
http://arxiv.org/abs/2506.19088v1
[DATE]
2025-06-24 04:03:53+08:00
[CATEGORIES]
cs.LG
Benchmarking Music Generation Models and Metrics via Human Preference Studies
[AUTHORS]
Florian Grötschla, Ahmet Solak, Luca A. Lanzendörfer, Roger Wattenhofer
[ABSTRACT]
Recent advancements have brought generated music closer to human-created
compositions, yet evaluating these models remains challenging. While human
preference is the gold standard for assessing quality, translating these
subjective judgments into objective metrics, particularly for text-audio
alignment and music quality, has proven difficult. In this work, we generate 6k
songs using 12 state-of-the-art models and conduct a survey of 15k pairwise
audio comparisons with 2.5k human participants to evaluate the correlation
between human preferences and widely used metrics. To the best of our
knowledge, this work is the first to rank current state-of-the-art music
generation models and metrics based on human preference. To further the field
of subjective metric evaluation, we provide open access to our dataset of
generated music and human evaluations.
[COMMENTS]
Accepted at ICASSP 2025
[LINK]
http://arxiv.org/abs/2506.19085v1
[DATE]
2025-06-24 04:01:29+08:00
[CATEGORIES]
cs.LG
FairCauseSyn: Towards Causally Fair LLM-Augmented Synthetic Data Generation
[AUTHORS]
Nitish Nagesh, Ziyu Wang, Amir M. Rahmani
[ABSTRACT]
Synthetic data generation creates data based on real-world data using
generative models. In health applications, generating high-quality data while
maintaining fairness for sensitive attributes is essential for equitable
outcomes. Existing GAN-based and LLM-based methods focus on counterfactual
fairness and are primarily applied in finance and legal domains. Causal
fairness provides a more comprehensive evaluation framework by preserving
causal structure, but current synthetic data generation methods do not address
it in health settings. To fill this gap, we develop the first LLM-augmented
synthetic data generation method to enhance causal fairness using real-world
tabular health data. Our generated data deviates by less than 10% from real
data on causal fairness metrics. When trained on causally fair predictors,
synthetic data reduces bias on the sensitive attribute by 70% compared to real
data. This work improves access to fair synthetic data, supporting equitable
health research and healthcare delivery.
[COMMENTS]
Accepted to IEEE EMBC 2025
[LINK]
http://arxiv.org/abs/2506.19082v1
[DATE]
2025-06-24 03:59:26+08:00
[CATEGORIES]
cs.LG
First-Order Sparse Convex Optimization: Better Rates with Sparse Updates
[AUTHORS]
Dan Garber
[ABSTRACT]
In was recently established that for convex optimization problems with a
sparse optimal solution (may it be entry-wise sparsity or matrix rank-wise
sparsity) it is possible to have linear convergence rates which depend on an
improved mixed-norm condition number of the form $\frac{\beta_1{}s}{\alpha_2}$,
where $\beta_1$ is the $\ell_1$-Lipchitz continuity constant of the gradient,
$\alpha_2$ is the $\ell_2$-quadratic growth constant, and $s$ is the sparsity
of the optimal solution. However, beyond the improved convergence rate, these
methods are unable to leverage the sparsity of optimal solutions towards
improving also the runtime of each iteration, which may still be prohibitively
high for high-dimensional problems. In this work, we establish that linear
convergence rates which depend on this improved condition number can be
obtained using only sparse updates, which may result in overall significantly
improved running times. Moreover, our methods are considerably easier to
implement.
[LINK]
http://arxiv.org/abs/2506.19075v1
[DATE]
2025-06-24 03:44:37+08:00
[CATEGORIES]
cs.LG
Online Learning for Dynamic Vickrey-Clarke-Groves Mechanism in Sequential Auctions under Unknown Environments
[AUTHORS]
Vincent Leon, S. Rasoul Etesami
[ABSTRACT]
We consider the problem of online dynamic mechanism design for sequential
auctions in unknown environments, where the underlying market and, thus, the
bidders’ values vary over time as interactions between the seller and the
bidders progress. We model the sequential auctions as an infinite-horizon
average-reward Markov decision process (MDP), where the transition kernel and
reward functions are unknown to the seller. In each round, the seller
determines an allocation and a payment for each bidder. Each bidder receives a
private reward and submits a sealed bid to the seller. The state, which
represents the underlying market, evolves according to an unknown transition
kernel and the seller’s allocation policy. Unlike existing works that formulate
the problem as a multi-armed bandit model or as an episodic MDP, where the
environment resets to an initial state after each round or episode, our paper
considers a more realistic and sophisticated setting in which the market
continues to evolve without restarting. We first extend the
Vickrey-Clarke-Groves (VCG) mechanism, which is known to be efficient,
truthful, and individually rational for one-shot static auctions, to sequential
auctions, thereby obtaining a dynamic VCG mechanism counterpart that preserves
these desired properties. We then focus on the online setting and develop an
online reinforcement learning algorithm for the seller to learn the underlying
MDP model and implement a mechanism that closely resembles the dynamic VCG
mechanism. We show that the learned online mechanism asymptotically converges
to a dynamic mechanism that approximately satisfies efficiency, truthfulness,
and individual rationality with arbitrarily high probability and achieves
guaranteed performance in terms of various notions of regret.
[COMMENTS]
16 pages
[LINK]
http://arxiv.org/abs/2506.19038v1
[DATE]
2025-06-24 02:52:32+08:00
[CATEGORIES]
cs.LG
Failure Modes of Time Series Interpretability Algorithms for Critical Care Applications and Potential Solutions
[AUTHORS]
Shashank Yadav, Vignesh Subbian
[ABSTRACT]
Interpretability plays a vital role in aligning and deploying deep learning
models in critical care, especially in constantly evolving conditions that
influence patient survival. However, common interpretability algorithms face
unique challenges when applied to dynamic prediction tasks, where patient
trajectories evolve over time. Gradient, Occlusion, and Permutation-based
methods often struggle with time-varying target dependency and temporal
smoothness. This work systematically analyzes these failure modes and supports
learnable mask-based interpretability frameworks as alternatives, which can
incorporate temporal continuity and label consistency constraints to learn
feature importance over time. Here, we propose that learnable mask-based
approaches for dynamic timeseries prediction problems provide more reliable and
consistent interpretations for applications in critical care and similar
domains.
[COMMENTS]
13 pages, 10 figures, Accepted at the AMIA Annual Symposium 2025. The
final version will appear in the official proceedings
[LINK]
http://arxiv.org/abs/2506.19035v1
[DATE]
2025-06-24 02:45:47+08:00
[CATEGORIES]
cs.LG
When Diffusion Models Memorize: Inductive Biases in Probability Flow of Minimum-Norm Shallow Neural Nets
[AUTHORS]
Chen Zeno, Hila Manor, Greg Ongie, Nir Weinberger, Tomer Michaeli, Daniel Soudry
[ABSTRACT]
While diffusion models generate high-quality images via probability flow, the
theoretical understanding of this process remains incomplete. A key question is
when probability flow converges to training samples or more general points on
the data manifold. We analyze this by studying the probability flow of shallow
ReLU neural network denoisers trained with minimal $\ell^2$ norm. For
intuition, we introduce a simpler score flow and show that for orthogonal
datasets, both flows follow similar trajectories, converging to a training
point or a sum of training points. However, early stopping by the diffusion
time scheduler allows probability flow to reach more general manifold points.
This reflects the tendency of diffusion models to both memorize training
samples and generate novel points that combine aspects of multiple samples,
motivating our study of such behavior in simplified settings. We extend these
results to obtuse simplex data and, through simulations in the orthogonal case,
confirm that probability flow converges to a training point, a sum of training
points, or a manifold point. Moreover, memorization decreases when the number
of training samples grows, as fewer samples accumulate near training points.
[COMMENTS]
Accepted to the Forty-second International Conference on Machine
Learning (ICML 2025)
[LINK]
http://arxiv.org/abs/2506.19031v1
[DATE]
2025-06-24 02:38:55+08:00
[CATEGORIES]
cs.LG
Statistical Inference for Optimal Transport Maps: Recent Advances and Perspectives
[AUTHORS]
Sivaraman Balakrishnan, Tudor Manole, Larry Wasserman
[ABSTRACT]
In many applications of optimal transport (OT), the object of primary
interest is the optimal transport map. This map rearranges mass from one
probability distribution to another in the most efficient way possible by
minimizing a specified cost. In this paper we review recent advances in
estimating and developing limit theorems for the OT map, using samples from the
underlying distributions. We also review parallel lines of work that establish
similar results for special cases and variants of the basic OT setup. We
conclude with a discussion of key directions for future research with the goal
of providing practitioners with reliable inferential tools.
[COMMENTS]
36 pages, 1 figure
[LINK]
http://arxiv.org/abs/2506.19025v1
[DATE]
2025-06-24 02:28:48+08:00
[CATEGORIES]
cs.LG
Double Machine Learning for Conditional Moment Restrictions: IV Regression, Proximal Causal Learning and Beyond
[AUTHORS]
Daqian Shao, Ashkan Soleymani, Francesco Quinzan, Marta Kwiatkowska
[ABSTRACT]
Solving conditional moment restrictions (CMRs) is a key problem considered in
statistics, causal inference, and econometrics, where the aim is to solve for a
function of interest that satisfies some conditional moment equalities.
Specifically, many techniques for causal inference, such as instrumental
variable (IV) regression and proximal causal learning (PCL), are CMR problems.
Most CMR estimators use a two-stage approach, where the first-stage estimation
is directly plugged into the second stage to estimate the function of interest.
However, naively plugging in the first-stage estimator can cause heavy bias in
the second stage. This is particularly the case for recently proposed CMR
estimators that use deep neural network (DNN) estimators for both stages, where
regularisation and overfitting bias is present. We propose DML-CMR, a two-stage
CMR estimator that provides an unbiased estimate with fast convergence rate
guarantees. We derive a novel learning objective to reduce bias and develop the
DML-CMR algorithm following the double/debiased machine learning (DML)
framework. We show that our DML-CMR estimator can achieve the minimax optimal
convergence rate of $O(N^{-1/2})$ under parameterisation and mild regularity
conditions, where $N$ is the sample size. We apply DML-CMR to a range of
problems using DNN estimators, including IV regression and proximal causal
learning on real-world datasets, demonstrating state-of-the-art performance
against existing CMR estimators and algorithms tailored to those problems.
[LINK]
http://arxiv.org/abs/2506.14950v2
[DATE]
2025-06-24 02:27:16+08:00
[CATEGORIES]
cs.LG
Automating Traffic Monitoring with SHM Sensor Networks via Vision-Supervised Deep Learning
[AUTHORS]
Hanshuo Wu, Xudong Jian, Christos Lataniotis, Cyprien Hoelzl, Eleni Chatzi, Yves Reuland
[ABSTRACT]
Bridges, as critical components of civil infrastructure, are increasingly
affected by deterioration, making reliable traffic monitoring essential for
assessing their remaining service life. Among operational loads, traffic load
plays a pivotal role, and recent advances in deep learning - particularly in
computer vision (CV) - have enabled progress toward continuous, automated
monitoring. However, CV-based approaches suffer from limitations, including
privacy concerns and sensitivity to lighting conditions, while traditional
non-vision-based methods often lack flexibility in deployment and validation.
To bridge this gap, we propose a fully automated deep-learning pipeline for
continuous traffic monitoring using structural health monitoring (SHM) sensor
networks. Our approach integrates CV-assisted high-resolution dataset
generation with supervised training and inference, leveraging graph neural
networks (GNNs) to capture the spatial structure and interdependence of sensor
data. By transferring knowledge from CV outputs to SHM sensors, the proposed
framework enables sensor networks to achieve comparable accuracy of
vision-based systems, with minimal human intervention. Applied to accelerometer
and strain gauge data in a real-world case study, the model achieves
state-of-the-art performance, with classification accuracies of 99% for light
vehicles and 94% for heavy vehicles.
[LINK]
http://arxiv.org/abs/2506.19023v1
[DATE]
2025-06-24 02:27:14+08:00
[CATEGORIES]
cs.LG
Simulation-Based Sensitivity Analysis in Optimal Treatment Regimes and Causal Decomposition with Individualized Interventions
[AUTHORS]
Soojin Park, Suyeon Kang, Chioun Lee
[ABSTRACT]
Causal decomposition analysis aims to assess the effect of modifying risk
factors on reducing social disparities in outcomes. Recently, this analysis has
incorporated individual characteristics when modifying risk factors by
utilizing optimal treatment regimes (OTRs). Since the newly defined
individualized effects rely on the no omitted confounding assumption,
developing sensitivity analyses to account for potential omitted confounding is
essential. Moreover, OTRs and individualized effects are primarily based on
binary risk factors, and no formal approach currently exists to benchmark the
strength of omitted confounding using observed covariates for binary risk
factors. To address this gap, we extend a simulation-based sensitivity analysis
that simulates unmeasured confounders, addressing two sources of bias emerging
from deriving OTRs and estimating individualized effects. Additionally, we
propose a formal bounding strategy that benchmarks the strength of omitted
confounding for binary risk factors. Using the High School Longitudinal Study
2009 (HSLS:09), we demonstrate this sensitivity analysis and benchmarking
method.
[COMMENTS]
42 pages
[LINK]
http://arxiv.org/abs/2506.19010v1
[DATE]
2025-06-24 02:05:30+08:00
[CATEGORIES]
cs.LG
Steering Conceptual Bias via Transformer Latent-Subspace Activation
[AUTHORS]
Vansh Sharma, Venkat Raman
[ABSTRACT]
This work examines whether activating latent subspaces in language models
(LLMs) can steer scientific code generation toward a specific programming
language. Five causal LLMs were first evaluated on scientific coding prompts to
quantify their baseline bias among four programming languages. A static
neuron-attribution method, perturbing the highest activated MLP weight for a
C++ or CPP token, proved brittle and exhibited limited generalization across
prompt styles and model scales. To address these limitations, a
gradient-refined adaptive activation steering framework (G-ACT) was developed:
per-prompt activation differences are clustered into a small set of steering
directions, and lightweight per-layer probes are trained and refined online to
select the appropriate steering vector. In LLaMA-3.2 3B, this approach reliably
biases generation towards the CPP language by increasing the average probe
classification accuracy by 15% and the early layers (0-6) improving the probe
classification accuracy by 61.5% compared to the standard ACT framework. For
LLaMA-3.3 70B, where attention-head signals become more diffuse, targeted
injections at key layers still improve language selection. Although per-layer
probing introduces a modest inference overhead, it remains practical by
steering only a subset of layers and enables reproducible model behavior. These
results demonstrate a scalable, interpretable and efficient mechanism for
concept-level control for practical agentic systems.
[LINK]
http://arxiv.org/abs/2506.18887v1
[DATE]
2025-06-24 01:56:34+08:00
[CATEGORIES]
cs.LG
Accurate and scalable exchange-correlation with deep learning
[AUTHORS]
Giulia Luise, Chin-Wei Huang, Thijs Vogels, Derk P. Kooi, Sebastian Ehlert, Stephanie Lanius, Klaas J. H. Giesbertz, Amir Karton, Deniz Gunceler, Megan Stanley, Wessel P. Bruinsma, Lin Huang, Xinran Wei, José Garrido Torres, Abylay Katbashev, Rodrigo Chavez Zavaleta, Bálint Máté, Sékou-Oumar Kaba, Roberto Sordillo, Yingrong Chen, David B. Williams-Young, Christopher M. Bishop, Jan Hermann, Rianne van den Berg, Paola Gori-Giorgi
[ABSTRACT]
Density Functional Theory (DFT) is the most widely used electronic structure
method for predicting the properties of molecules and materials. Although DFT
is, in principle, an exact reformulation of the Schr"odinger equation,
practical applications rely on approximations to the unknown
exchange-correlation (XC) functional. Most existing XC functionals are
constructed using a limited set of increasingly complex, hand-crafted features
that improve accuracy at the expense of computational efficiency. Yet, no
current approximation achieves the accuracy and generality for predictive
modeling of laboratory experiments at chemical accuracy – typically defined as
errors below 1 kcal/mol. In this work, we present Skala, a modern deep
learning-based XC functional that bypasses expensive hand-designed features by
learning representations directly from data. Skala achieves chemical accuracy
for atomization energies of small molecules while retaining the computational
efficiency typical of semi-local DFT. This performance is enabled by training
on an unprecedented volume of high-accuracy reference data generated using
computationally intensive wavefunction-based methods. Notably, Skala
systematically improves with additional training data covering diverse
chemistry. By incorporating a modest amount of additional high-accuracy data
tailored to chemistry beyond atomization energies, Skala achieves accuracy
competitive with the best-performing hybrid functionals across general main
group chemistry, at the cost of semi-local DFT. As the training dataset
continues to expand, Skala is poised to further enhance the predictive power of
first-principles simulations.
[COMMENTS]
Main: 13 pages plus references, 11 figures and tables. Supplementary
information: 19 pages, 12 figures and tables. v2 update: fix rendering of
figure 1 and part of figure 5 in Safari PDF viewer. v3 update: update author
information and fix typo
[LINK]
http://arxiv.org/abs/2506.14665v3
[DATE]
2025-06-24 01:52:42+08:00
[CATEGORIES]
cs.LG
A Reliable Framework for Human-in-the-Loop Anomaly Detection in Time Series
[AUTHORS]
Ziquan Deng, Xiwei Xuan, Kwan-Liu Ma, Zhaodan Kong
[ABSTRACT]
Time series anomaly detection is a critical machine learning task for
numerous applications, such as finance, healthcare, and industrial systems.
However, even high-performing models may exhibit potential issues such as
biases, leading to unreliable outcomes and misplaced confidence. While model
explanation techniques, particularly visual explanations, offer valuable
insights by elucidating model attributions of their decision, many limitations
still exist – They are primarily instance-based and not scalable across the
dataset, and they provide one-directional information from the model to the
human side, lacking a mechanism for users to address detected issues. To
fulfill these gaps, we introduce HILAD, a novel framework designed to foster a
dynamic and bidirectional collaboration between humans and AI for enhancing
anomaly detection models in time series. Through our visual interface, HILAD
empowers domain experts to detect, interpret, and correct unexpected model
behaviors at scale. Our evaluation through user studies with two models and
three time series datasets demonstrates the effectiveness of HILAD, which
fosters a deeper model understanding, immediate corrective actions, and model
reliability enhancement.
[COMMENTS]
The manuscript is currently under review
[LINK]
http://arxiv.org/abs/2405.03234v4
[DATE]
2025-06-24 01:41:29+08:00
[CATEGORIES]
cs.LG
CDI: Copyrighted Data Identification in Diffusion Models
[AUTHORS]
Jan Dubiński, Antoni Kowalczuk, Franziska Boenisch, Adam Dziedzic
[ABSTRACT]
Diffusion Models (DMs) benefit from large and diverse datasets for their
training. Since this data is often scraped from the Internet without permission
from the data owners, this raises concerns about copyright and intellectual
property protections. While (illicit) use of data is easily detected for
training samples perfectly re-created by a DM at inference time, it is much
harder for data owners to verify if their data was used for training when the
outputs from the suspect DM are not close replicas. Conceptually, membership
inference attacks (MIAs), which detect if a given data point was used during
training, present themselves as a suitable tool to address this challenge.
However, we demonstrate that existing MIAs are not strong enough to reliably
determine the membership of individual images in large, state-of-the-art DMs.
To overcome this limitation, we propose CDI, a framework for data owners to
identify whether their dataset was used to train a given DM. CDI relies on
dataset inference techniques, i.e., instead of using the membership signal from
a single data point, CDI leverages the fact that most data owners, such as
providers of stock photography, visual media companies, or even individual
artists, own datasets with multiple publicly exposed data points which might
all be included in the training of a given DM. By selectively aggregating
signals from existing MIAs and using new handcrafted methods to extract
features for these datasets, feeding them to a scoring model, and applying
rigorous statistical testing, CDI allows data owners with as little as 70 data
points to identify with a confidence of more than 99% whether their data was
used to train a given DM. Thereby, CDI represents a valuable tool for data
owners to claim illegitimate use of their copyrighted data. We make the code
available at https://github.com/sprintml/copyrighted_data_identification
[COMMENTS]
Accepted at CVPR2025 (Conference on Computer Vision and Pattern
Recognition) Code available at
https://github.com/sprintml/copyrighted_data_identification
[LINK]
http://arxiv.org/abs/2411.12858v3
[DATE]
2025-06-24 01:31:25+08:00
[CATEGORIES]
cs.LG
Controlling Moments with Kernel Stein Discrepancies
[AUTHORS]
Heishiro Kanagawa, Alessandro Barp, Arthur Gretton, Lester Mackey
[ABSTRACT]
Kernel Stein discrepancies (KSDs) measure the quality of a distributional
approximation and can be computed even when the target density has an
intractable normalizing constant. Notable applications include the diagnosis of
approximate MCMC samplers and goodness-of-fit tests for unnormalized
statistical models. The present work analyzes the convergence control
properties of KSDs. We first show that standard KSDs used for weak convergence
control fail to control moment convergence. To address this limitation, we next
provide sufficient conditions under which alternative diffusion KSDs control
both moment and weak convergence. As an immediate consequence we develop, for
each $q > 0$, the first KSDs known to exactly characterize $q$-Wasserstein
convergence.
[COMMENTS]
Accepted to the Annals of Applied Probability (103 pages, 10 figures)
[LINK]
http://arxiv.org/abs/2211.05408v7
[DATE]
2025-06-24 01:30:18+08:00
[CATEGORIES]
cs.LG
Segmentation-Aware Generative Reinforcement Network (GRN) for Tissue Layer Segmentation in 3-D Ultrasound Images for Chronic Low-back Pain (cLBP) Assessment
[AUTHORS]
Zixue Zeng, Xiaoyan Zhao, Matthew Cartier, Tong Yu, Jing Wang, Xin Meng, Zhiyu Sheng, Maryam Satarpour, John M Cormack, Allison Bean, Ryan Nussbaum, Maya Maurer, Emily Landis-Walkenhorst, Dinesh Kumbhare, Kang Kim, Ajay Wasan, Jiantao Pu
[ABSTRACT]
We introduce a novel segmentation-aware joint training framework called
generative reinforcement network (GRN) that integrates segmentation loss
feedback to optimize both image generation and segmentation performance in a
single stage. An image enhancement technique called segmentation-guided
enhancement (SGE) is also developed, where the generator produces images
tailored specifically for the segmentation model. Two variants of GRN were also
developed, including GRN for sample-efficient learning (GRN-SEL) and GRN for
semi-supervised learning (GRN-SSL). GRN’s performance was evaluated using a
dataset of 69 fully annotated 3D ultrasound scans from 29 subjects. The
annotations included six anatomical structures: dermis, superficial fat,
superficial fascial membrane (SFM), deep fat, deep fascial membrane (DFM), and
muscle. Our results show that GRN-SEL with SGE reduces labeling efforts by up
to 70% while achieving a 1.98% improvement in the Dice Similarity Coefficient
(DSC) compared to models trained on fully labeled datasets. GRN-SEL alone
reduces labeling efforts by 60%, GRN-SSL with SGE decreases labeling
requirements by 70%, and GRN-SSL alone by 60%, all while maintaining
performance comparable to fully supervised models. These findings suggest the
effectiveness of the GRN framework in optimizing segmentation performance with
significantly less labeled data, offering a scalable and efficient solution for
ultrasound image analysis and reducing the burdens associated with data
annotation.
[LINK]
http://arxiv.org/abs/2501.17690v3
[DATE]
2025-06-24 01:08:10+08:00
[CATEGORIES]
cs.LG
LIGHTHOUSE: Fast and precise distance to shoreline calculations from anywhere on earth
[AUTHORS]
Patrick Beukema, Henry Herzog, Yawen Zhang, Hunter Pitelka, Favyen Bastani
[COMMENTS]
8 pages, 7 figures, 1 table, ICML 2025 ML4RS
[LINK]
http://arxiv.org/abs/2506.18842v1
[DATE]
2025-06-24 01:00:34+08:00
[CATEGORIES]
cs.LG
Conformal Prediction for Causal Effects of Continuous Treatments
[AUTHORS]
Maresa Schröder, Dennis Frauen, Jonas Schweisthal, Konstantin Heß, Valentyn Melnychuk, Stefan Feuerriegel
[ABSTRACT]
Uncertainty quantification of causal effects is crucial for safety-critical
applications such as personalized medicine. A powerful approach for this is
conformal prediction, which has several practical benefits due to
model-agnostic finite-sample guarantees. Yet, existing methods for conformal
prediction of causal effects are limited to binary/discrete treatments and make
highly restrictive assumptions such as known propensity scores. In this work,
we provide a novel conformal prediction method for potential outcomes of
continuous treatments. We account for the additional uncertainty introduced
through propensity estimation so that our conformal prediction intervals are
valid even if the propensity score is unknown. Our contributions are
three-fold: (1) We derive finite-sample prediction intervals for potential
outcomes of continuous treatments. (2) We provide an algorithm for calculating
the derived intervals. (3) We demonstrate the effectiveness of the conformal
prediction intervals in experiments on synthetic and real-world datasets. To
the best of our knowledge, we are the first to propose conformal prediction for
continuous treatments when the propensity score is unknown and must be
estimated from data.
[LINK]
http://arxiv.org/abs/2407.03094v3
[DATE]
2025-06-24 00:52:25+08:00
[CATEGORIES]
cs.LG
Regularized Neural Ensemblers
[AUTHORS]
Sebastian Pineda Arango, Maciej Janowski, Lennart Purucker, Arber Zela, Frank Hutter, Josif Grabocka
[ABSTRACT]
Ensemble methods are known for enhancing the accuracy and robustness of
machine learning models by combining multiple base learners. However, standard
approaches like greedy or random ensembling often fall short, as they assume a
constant weight across samples for the ensemble members. This can limit
expressiveness and hinder performance when aggregating the ensemble
predictions. In this study, we explore employing regularized neural networks as
ensemble methods, emphasizing the significance of dynamic ensembling to
leverage diverse model predictions adaptively. Motivated by the risk of
learning low-diversity ensembles, we propose regularizing the ensembling model
by randomly dropping base model predictions during the training. We demonstrate
this approach provides lower bounds for the diversity within the ensemble,
reducing overfitting and improving generalization capabilities. Our experiments
showcase that the regularized neural ensemblers yield competitive results
compared to strong baselines across several modalities such as computer vision,
natural language processing, and tabular data.
[COMMENTS]
Accepted in AutoML Conference 2025
[LINK]
http://arxiv.org/abs/2410.04520v2
[DATE]
2025-06-24 00:40:18+08:00
[CATEGORIES]
cs.LG
Kernel spectral joint embeddings for high-dimensional noisy datasets using duo-landmark integral operators
[AUTHORS]
Xiucai Ding, Rong Ma
[ABSTRACT]
Integrative analysis of multiple heterogeneous datasets has become standard
practice in many research fields, especially in single-cell genomics and
medical informatics. Existing approaches oftentimes suffer from limited power
in capturing nonlinear structures, insufficient account of noisiness and
effects of high-dimensionality, lack of adaptivity to signals and sample sizes
imbalance, and their results are sometimes difficult to interpret. To address
these limitations, we propose a novel kernel spectral method that achieves
joint embeddings of two independently observed high-dimensional noisy datasets.
The proposed method automatically captures and leverages possibly shared
low-dimensional structures across datasets to enhance embedding quality. The
obtained low-dimensional embeddings can be utilized for many downstream tasks
such as simultaneous clustering, data visualization, and denoising. The
proposed method is justified by rigorous theoretical analysis. Specifically, we
show the consistency of our method in recovering the low-dimensional noiseless
signals, and characterize the effects of the signal-to-noise ratios on the
rates of convergence. Under a joint manifolds model framework, we establish the
convergence of ultimate embeddings to the eigenfunctions of some newly
introduced integral operators. These operators, referred to as duo-landmark
integral operators, are defined by the convolutional kernel maps of some
reproducing kernel Hilbert spaces (RKHSs). These RKHSs capture the either
partially or entirely shared underlying low-dimensional nonlinear signal
structures of the two datasets. Our numerical experiments and analyses of two
single-cell omics datasets demonstrate the empirical advantages of the proposed
method over existing methods in both embeddings and several downstream tasks.
[COMMENTS]
57 pages, 16 figures
[LINK]
http://arxiv.org/abs/2405.12317v2
[DATE]
2025-06-24 00:35:04+08:00
[CATEGORIES]
cs.LG
Multi-Agent Online Control with Adversarial Disturbances
[AUTHORS]
Anas Barakat, John Lazarsfeld, Georgios Piliouras, Antonios Varvitsiotis
[ABSTRACT]
Multi-agent control problems involving a large number of agents with
competing and time-varying objectives are increasingly prevalent in
applications across robotics, economics, and energy systems. In this paper, we
study online control in multi-agent linear dynamical systems with disturbances.
In contrast to most prior work in multi-agent control, we consider an online
setting where disturbances are adversarial and where each agent seeks to
minimize its own, adversarial sequence of convex losses. In this setting, we
investigate the robustness of gradient-based controllers from single-agent
online control, with a particular focus on understanding how individual regret
guarantees are influenced by the number of agents in the system. Under minimal
communication assumptions, we prove near-optimal sublinear regret bounds that
hold uniformly for all agents. Finally, when the objectives of the agents are
aligned, we show that the multi-agent control problem induces a time-varying
potential game for which we derive equilibrium gap guarantees.
[LINK]
http://arxiv.org/abs/2506.18814v1
[DATE]
2025-06-24 00:24:31+08:00
[CATEGORIES]
cs.LG
Learning Physical Systems: Symplectification via Gauge Fixing in Dirac Structures
[AUTHORS]
Aristotelis Papatheodorou, Pranav Vaidhyanathan, Natalia Ares, Ioannis Havoutis
[ABSTRACT]
Physics-informed deep learning has achieved remarkable progress by embedding
geometric priors, such as Hamiltonian symmetries and variational principles,
into neural networks, enabling structure-preserving models that extrapolate
with high accuracy. However, in systems with dissipation and holonomic
constraints, ubiquitous in legged locomotion and multibody robotics, the
canonical symplectic form becomes degenerate, undermining the very invariants
that guarantee stability and long-term prediction. In this work, we tackle this
foundational limitation by introducing Presymplectification Networks (PSNs),
the first framework to learn the symplectification lift via Dirac structures,
restoring a non-degenerate symplectic geometry by embedding constrained systems
into a higher-dimensional manifold. Our architecture combines a recurrent
encoder with a flow-matching objective to learn the augmented phase-space
dynamics end-to-end. We then attach a lightweight Symplectic Network (SympNet)
to forecast constrained trajectories while preserving energy, momentum, and
constraint satisfaction. We demonstrate our method on the dynamics of the
ANYmal quadruped robot, a challenging contact-rich, multibody system. To the
best of our knowledge, this is the first framework that effectively bridges the
gap between constrained, dissipative mechanical systems and symplectic
learning, unlocking a whole new class of geometric machine learning models,
grounded in first principles yet adaptable from data.
[COMMENTS]
Presented at Equivariant Systems: Theory and Applications in State
Estimation, Artificial Intelligence and Control, Robotics: Science and
Systems (RSS) 2025 Workshop, 6 Pages, 3 Figures
[LINK]
http://arxiv.org/abs/2506.18812v1
[DATE]
2025-06-24 00:23:37+08:00
[CATEGORIES]
cs.LG
Simple and Critical Iterative Denoising: A Recasting of Discrete Diffusion in Graph Generation
[AUTHORS]
Yoann Boget
[ABSTRACT]
Discrete Diffusion and Flow Matching models have significantly advanced
generative modeling for discrete structures, including graphs. However, the
dependencies between intermediate noisy states lead to error accumulation and
propagation during the reverse denoising process - a phenomenon known as
compounding denoising errors. To address this problem, we propose a novel
framework called Simple Iterative Denoising, which simplifies discrete
diffusion and circumvents the issue by assuming conditional independence
between intermediate states. Additionally, we enhance our model by
incorporating a Critic. During generation, the Critic selectively retains or
corrupts elements in an instance based on their likelihood under the data
distribution. Our empirical evaluations demonstrate that the proposed method
significantly outperforms existing discrete diffusion baselines in graph
generation tasks.
[COMMENTS]
ICML 2025 Accepted paper
[LINK]
http://arxiv.org/abs/2503.21592v2
[DATE]
2025-06-24 00:03:57+08:00
[CATEGORIES]
cs.LG
A Multi-view Divergence-Convergence Feature Augmentation Framework for Drug-related Microbes Prediction
[AUTHORS]
Xin An, Ruijie Li, Qiao Ning, Shikai Guo, Hui Li, Qian Ma
[ABSTRACT]
In the study of drug function and precision medicine, identifying new
drug-microbe associations is crucial. However, current methods isolate
association and similarity analysis of drug and microbe, lacking effective
inter-view optimization and coordinated multi-view feature fusion. In our
study, a multi-view Divergence-Convergence Feature Augmentation framework for
Drug-related Microbes Prediction (DCFA_DMP) is proposed, to better learn and
integrate association information and similarity information. In the divergence
phase, DCFA_DMP strengthens the complementarity and diversity between
heterogeneous information and similarity information by performing Adversarial
Learning method between the association network view and different similarity
views, optimizing the feature space. In the convergence phase, a novel
Bidirectional Synergistic Attention Mechanism is proposed to deeply synergize
the complementary features between different views, achieving a deep fusion of
the feature space. Moreover, Transformer graph learning is alternately applied
on the drug-microbe heterogeneous graph, enabling each drug or microbe node to
focus on the most relevant nodes. Numerous experiments demonstrate DCFA_DMP’s
significant performance in predicting drug-microbe associations. It also proves
effectiveness in predicting associations for new drugs and microbes in cold
start experiments, further confirming its stability and reliability in
predicting potential drug-microbe associations.
[COMMENTS]
10 pages, 8 figures (including subfigures), 1 table. Xin An and
Ruijie Li contributed equally to this work and should be considered co-first
authors
[LINK]
http://arxiv.org/abs/2506.18797v1
[DATE]
2025-06-24 00:03:46+08:00
[CATEGORIES]
cs.LG
[AUTHORS]
Suyash Gaurav, Muhammad Farhan Humayun, Jukka Heikkonen, Jatin Chaudhary [ABSTRACT]
The evolution of Vision Transformers has led to their widespread adaptation
to different domains. Despite large-scale success, there remain significant
challenges including their reliance on extensive computational and memory
resources for pre-training on huge datasets as well as difficulties in
task-specific transfer learning. These limitations coupled with energy
inefficiencies mainly arise due to the computation-intensive self-attention
mechanism. To address these issues, we propose a novel Super-Pixel Based Patch
Pooling (SPPP) technique that generates context-aware, semantically rich, patch
embeddings to effectively reduce the architectural complexity and improve
efficiency. Additionally, we introduce the Light Latent Attention (LLA) module
in our pipeline by integrating latent tokens into the attention mechanism
allowing cross-attention operations to significantly reduce the time and space
complexity of the attention module. By leveraging the data-intuitive patch
embeddings coupled with dynamic positional encodings, our approach adaptively
modulates the cross-attention process to focus on informative regions while
maintaining the global semantic structure. This targeted attention improves
training efficiency and accelerates convergence. Notably, the SPPP module is
lightweight and can be easily integrated into existing transformer
architectures. Extensive experiments demonstrate that our proposed architecture
provides significant improvements in terms of computational efficiency while
achieving comparable results with the state-of-the-art approaches, highlighting
its potential for energy-efficient transformers suitable for edge deployment.
(The code is available on our GitHub repository:
https://github.com/zser092/Focused-Attention-ViT). [LINK]
http://arxiv.org/abs/2506.18791v1 [DATE]
2025-06-24 00:00:57+08:00 [CATEGORIES]
cs.LG
Existing LLMs Are Not Self-Consistent For Simple Tasks
[AUTHORS]
Zhenru Lin, Jiawen Tao, Yang Yuan, Andrew Chi-Chih Yao
[ABSTRACT]
Large Language Models (LLMs) have grown increasingly powerful, yet ensuring
their decisions remain transparent and trustworthy requires self-consistency –
no contradictions in their internal reasoning. Our study reveals that even on
simple tasks, such as comparing points on a line or a plane, or reasoning in a
family tree, all smaller models are highly inconsistent, and even
state-of-the-art models like DeepSeek-R1 and GPT-o4-mini are not fully
self-consistent. To quantify and mitigate these inconsistencies, we introduce
inconsistency metrics and propose two automated methods – a graph-based and an
energy-based approach. While these fixes provide partial improvements, they
also highlight the complexity and importance of self-consistency in building
more reliable and interpretable AI. The code and data are available at
https://github.com/scorpio-nova/llm-self-consistency.
[COMMENTS]
10 pages, 6 figures
[LINK]
http://arxiv.org/abs/2506.18781v1
[DATE]
2025-06-23 23:50:21+08:00
[CATEGORIES]
cs.CL
Programming by Backprop: LLMs Acquire Reusable Algorithmic Abstractions During Code Training
[AUTHORS]
Jonathan Cook, Silvia Sapora, Arash Ahmadian, Akbir Khan, Tim Rocktaschel, Jakob Foerster, Laura Ruis
[ABSTRACT]
Training large language models (LLMs) on source code significantly enhances
their general-purpose reasoning abilities, but the mechanisms underlying this
generalisation are poorly understood. In this paper, we propose Programming by
Backprop (PBB) as a potential driver of this effect - teaching a model to
evaluate a program for inputs by training on its source code alone, without
ever seeing I/O examples. To explore this idea, we finetune LLMs on two sets of
programs representing simple maths problems and algorithms: one with source
code and I/O examples (w/ IO), the other with source code only (w/o IO). We
find evidence that LLMs have some ability to evaluate w/o IO programs for
inputs in a range of experimental settings, and make several observations.
Firstly, PBB works significantly better when programs are provided as code
rather than semantically equivalent language descriptions. Secondly, LLMs can
produce outputs for w/o IO programs directly, by implicitly evaluating the
program within the forward pass, and more reliably when stepping through the
program in-context via chain-of-thought. We further show that PBB leads to more
robust evaluation of programs across inputs than training on I/O pairs drawn
from a distribution that mirrors naturally occurring data. Our findings suggest
a mechanism for enhanced reasoning through code training: it allows LLMs to
internalise reusable algorithmic abstractions. Significant scope remains for
future work to enable LLMs to more effectively learn from symbolic procedures,
and progress in this direction opens other avenues like model alignment by
training on formal constitutional principles.
[LINK]
http://arxiv.org/abs/2506.18777v1
[DATE]
2025-06-23 23:45:44+08:00
[CATEGORIES]
cs.CL
cs.LG
Neural Total Variation Distance Estimators for Changepoint Detection in News Data
[AUTHORS]
Csaba Zsolnai, Niels Lörch, Julian Arnold
[ABSTRACT]
Detecting when public discourse shifts in response to major events is crucial
for understanding societal dynamics. Real-world data is high-dimensional,
sparse, and noisy, making changepoint detection in this domain a challenging
endeavor. In this paper, we leverage neural networks for changepoint detection
in news data, introducing a method based on the so-called learning-by-confusion
scheme, which was originally developed for detecting phase transitions in
physical systems. We train classifiers to distinguish between articles from
different time periods. The resulting classification accuracy is used to
estimate the total variation distance between underlying content distributions,
where significant distances highlight changepoints. We demonstrate the
effectiveness of this method on both synthetic datasets and real-world data
from The Guardian newspaper, successfully identifying major historical events
including 9/11, the COVID-19 pandemic, and presidential elections. Our approach
requires minimal domain knowledge, can autonomously discover significant shifts
in public discourse, and yields a quantitative measure of change in content,
making it valuable for journalism, policy analysis, and crisis monitoring.
[COMMENTS]
16 pages, 3 figures
[LINK]
http://arxiv.org/abs/2506.18764v1
[DATE]
2025-06-23 23:33:30+08:00
[CATEGORIES]
cs.LG
cs.CL
Adapting Foundation Speech Recognition Models to Impaired Speech: A Semantic Re-chaining Approach for Personalization of German Speech
[AUTHORS]
Niclas Pokel, Pehuén Moure, Roman Boehringer, Yingqiang Gao
[ABSTRACT]
Speech impairments caused by conditions such as cerebral palsy or genetic
disorders pose significant challenges for automatic speech recognition (ASR)
systems. Despite recent advances, ASR models like Whisper struggle with
non-normative speech due to limited training data and the difficulty of
collecting and annotating non-normative speech samples. In this work, we
propose a practical and lightweight pipeline to personalize ASR models,
formalizing the selection of words and enriching a small, speech-impaired
dataset with semantic coherence. Applied to data from a child with a structural
speech impairment, our approach shows promising improvements in transcription
quality, demonstrating the potential to reduce communication barriers for
individuals with atypical speech patterns.
[LINK]
http://arxiv.org/abs/2506.21622v1
[DATE]
2025-06-23 23:30:50+08:00
[CATEGORIES]
cs.CL
SEAL: Scaling to Emphasize Attention for Long-Context Retrieval
[AUTHORS]
Changhun Lee, Minsang Seok, Jun-gyu Jin, Younghyun Cho, Eunhyeok Park
[ABSTRACT]
While many advanced LLMs are designed to handle long sequence data, we can
still observe notable quality degradation even within the sequence limit. In
this work, we introduce a novel approach called Scaling to Emphasize Attention
for Long-context retrieval (SEAL), which enhances the retrieval performance of
large language models (LLMs) over long contexts. We observe that specific
attention heads are closely tied to long-context retrieval, showing positive or
negative correlation with retrieval scores, and adjusting the strength of these
heads boosts the quality of LLMs in long context by a large margin. Built on
this insight, we propose a learning-based mechanism that leverages generated
data to emphasize these heads. By applying SEAL, we achieve significant
improvements in long-context retrieval performance across various tasks and
models. Additionally, when combined with existing training-free context
extension techniques, SEAL extends the contextual limits of LLMs while
maintaining highly reliable outputs.
[COMMENTS]
Accepted at ACL 2025 Main
[LINK]
http://arxiv.org/abs/2501.15225v2
[DATE]
2025-06-23 23:24:16+08:00
[CATEGORIES]
cs.CL
cs.LG
Multi-modal Anchor Gated Transformer with Knowledge Distillation for Emotion Recognition in Conversation
[AUTHORS]
Jie Li, Shifei Ding, Lili Guo, Xuan Li
[ABSTRACT]
Emotion Recognition in Conversation (ERC) aims to detect the emotions of
individual utterances within a conversation. Generating efficient and
modality-specific representations for each utterance remains a significant
challenge. Previous studies have proposed various models to integrate features
extracted using different modality-specific encoders. However, they neglect the
varying contributions of modalities to this task and introduce high complexity
by aligning modalities at the frame level. To address these challenges, we
propose the Multi-modal Anchor Gated Transformer with Knowledge Distillation
(MAGTKD) for the ERC task. Specifically, prompt learning is employed to enhance
textual modality representations, while knowledge distillation is utilized to
strengthen representations of weaker modalities. Furthermore, we introduce a
multi-modal anchor gated transformer to effectively integrate utterance-level
representations across modalities. Extensive experiments on the IEMOCAP and
MELD datasets demonstrate the effectiveness of knowledge distillation in
enhancing modality representations and achieve state-of-the-art performance in
emotion recognition. Our code is available at:
https://github.com/JieLi-dd/MAGTKD.
[COMMENTS]
This paper has been accepted by IJCAI2025
[LINK]
http://arxiv.org/abs/2506.18716v1
[DATE]
2025-06-23 22:53:22+08:00
[CATEGORIES]
cs.LG
cs.CL
Handling Numeric Expressions in Automatic Speech Recognition
[AUTHORS]
Christian Huber, Alexander Waibel
[ABSTRACT]
This paper addresses the problem of correctly formatting numeric expressions
in automatic speech recognition (ASR) transcripts. This is challenging since
the expected transcript format depends on the context, e.g., 1945 (year) vs.
19:45 (timestamp). We compare cascaded and end-to-end approaches to recognize
and format numeric expressions such as years, timestamps, currency amounts, and
quantities. For the end-to-end approach, we employed a data generation strategy
using a large language model (LLM) together with a text to speech (TTS) model
to generate adaptation data. The results on our test data set show that while
approaches based on LLMs perform well in recognizing formatted numeric
expressions, adapted end-to-end models offer competitive performance with the
advantage of lower latency and inference cost.
[LINK]
http://arxiv.org/abs/2408.00004v2
[DATE]
2025-06-23 22:45:07+08:00
[CATEGORIES]
cs.CL
Context Biasing for Pronunciations-Orthography Mismatch in Automatic Speech Recognition
[AUTHORS]
Christian Huber, Alexander Waibel
[ABSTRACT]
Neural sequence-to-sequence systems deliver state-of-the-art performance for
automatic speech recognition. When using appropriate modeling units, e.g.,
byte-pair encoded characters, these systems are in principal open vocabulary
systems. In practice, however, they often fail to recognize words not seen
during training, e.g., named entities, acronyms, or domain-specific special
words. To address this problem, many context biasing methods have been
proposed; however, for words with a pronunciation-orthography mismatch, these
methods may still struggle. We propose a method which allows corrections of
substitution errors to improve the recognition accuracy of such challenging
words. Users can add corrections on the fly during inference. We show that with
this method we get a relative improvement in biased word error rate of up to
11\%, while maintaining a competitive overall word error rate.
[LINK]
http://arxiv.org/abs/2506.18703v1
[DATE]
2025-06-23 22:42:03+08:00
[CATEGORIES]
cs.CL
cs.LG
Better Language Model Inversion by Compactly Representing Next-Token Distributions
[AUTHORS]
Murtaza Nazir, Matthew Finlayson, John X. Morris, Xiang Ren, Swabha Swayamdipta
[ABSTRACT]
Language model inversion seeks to recover hidden prompts using only language
model outputs. This capability has implications for security and accountability
in language model deployments, such as leaking private information from an
API-protected language model’s system message. We propose a new method –
prompt inversion from logprob sequences (PILS) – that recovers hidden prompts
by gleaning clues from the model’s next-token probabilities over the course of
multiple generation steps. Our method is enabled by a key insight: The
vector-valued outputs of a language model occupy a low-dimensional subspace.
This enables us to losslessly compress the full next-token probability
distribution over multiple generation steps using a linear map, allowing more
output information to be used for inversion. Our approach yields massive gains
over previous state-of-the-art methods for recovering hidden prompts, achieving
2–3.5 times higher exact recovery rates across test sets, in one case
increasing the recovery rate from 17% to 60%. Our method also exhibits
surprisingly good generalization behavior; for instance, an inverter trained on
16 generations steps gets 5–27 points higher prompt recovery when we increase
the number of steps to 32 at test time. Furthermore, we demonstrate strong
performance of our method on the more challenging task of recovering hidden
system messages. We also analyze the role of verbatim repetition in prompt
recovery and propose a new method for cross-family model transfer for
logit-based inverters. Our findings show that next-token probabilities are a
considerably more vulnerable attack surface for inversion attacks than
previously known.
[LINK]
http://arxiv.org/abs/2506.17090v2
[DATE]
2025-06-23 22:39:37+08:00
[CATEGORIES]
cs.CL
HausaNLP at SemEval-2025 Task 11: Hausa Text Emotion Detection
[AUTHORS]
Sani Abdullahi Sani, Salim Abubakar, Falalu Ibrahim Lawan, Abdulhamid Abubakar, Maryam Bala
[ABSTRACT]
This paper presents our approach to multi-label emotion detection in Hausa, a
low-resource African language, for SemEval Track A. We fine-tuned AfriBERTa, a
transformer-based model pre-trained on African languages, to classify Hausa
text into six emotions: anger, disgust, fear, joy, sadness, and surprise. Our
methodology involved data preprocessing, tokenization, and model fine-tuning
using the Hugging Face Trainer API. The system achieved a validation accuracy
of 74.00%, with an F1-score of 73.50%, demonstrating the effectiveness of
transformer-based models for emotion detection in low-resource languages.
[LINK]
http://arxiv.org/abs/2506.16388v2
[DATE]
2025-06-23 22:32:28+08:00
[CATEGORIES]
cs.CL
“I understand why I got this grade”: Automatic Short Answer Grading with Feedback
[AUTHORS]
Dishank Aggarwal, Pritam Sil, Bhaskaran Raman, Pushpak Bhattacharyya
[ABSTRACT]
In recent years, there has been a growing interest in using Artificial
Intelligence (AI) to automate student assessment in education. Among different
types of assessments, summative assessments play a crucial role in evaluating a
student’s understanding level of a course. Such examinations often involve
short-answer questions. However, grading these responses and providing
meaningful feedback manually at scale is both time-consuming and
labor-intensive. Feedback is particularly important, as it helps students
recognize their strengths and areas for improvement. Despite the importance of
this task, there is a significant lack of publicly available datasets that
support automatic short-answer grading with feedback generation. To address
this gap, we introduce Engineering Short Answer Feedback (EngSAF), a dataset
designed for automatic short-answer grading with feedback. The dataset covers a
diverse range of subjects, questions, and answer patterns from multiple
engineering domains and contains ~5.8k data points. We incorporate feedback
into our dataset by leveraging the generative capabilities of state-of-the-art
large language models (LLMs) using our Label-Aware Synthetic Feedback
Generation (LASFG) strategy. This paper underscores the importance of enhanced
feedback in practical educational settings, outlines dataset annotation and
feedback generation processes, conducts a thorough EngSAF analysis, and
provides different LLMs-based zero-shot and finetuned baselines for future
comparison. The best-performing model (Mistral-7B) achieves an overall accuracy
of 75.4% and 58.7% on unseen answers and unseen question test sets,
respectively. Additionally, we demonstrate the efficiency and effectiveness of
our ASAG system through its deployment in a real-world end-semester exam at a
reputed institute.
[LINK]
http://arxiv.org/abs/2407.12818v2
[DATE]
2025-06-23 22:24:28+08:00
[CATEGORIES]
cs.CL
Is There a Case for Conversation Optimized Tokenizers in Large Language Models?
[AUTHORS]
Raquel Ferrando, Javier Conde, Gonzalo Martínez, Pedro Reviriego
[ABSTRACT]
The computational and energy costs of Large Language Models (LLMs) have
increased exponentially driven by the growing model sizes and the massive
adoption of LLMs by hundreds of millions of users. The unit cost of an LLM is
the computation of a token. Therefore, the tokenizer plays an important role in
the efficiency of a model, and they are carefully optimized to minimize the
number of tokens for the text in their training corpus. One of the most popular
applications of LLMs are chatbots that interact with users. A key observation
is that, for those chatbots, what is important is the performance of the
tokenizer in the user text input and the chatbot responses. Those are most
likely different from the text in the training corpus. So, a question that
immediately arises is whether there is a potential benefit in optimizing
tokenizers for chatbot conversations. In this paper, this idea is explored for
different tokenizers by using a publicly available corpus of chatbot
conversations to redesign their vocabularies and evaluate their performance in
this domain. The results show that conversation-optimized tokenizers
consistently reduce the number of tokens in chatbot dialogues, which can lead
to meaningful energy savings, in the range of 5% to 10% while having minimal or
even slightly positive impact on tokenization efficiency for the original
training corpus.
[LINK]
http://arxiv.org/abs/2506.18674v1
[DATE]
2025-06-23 22:18:46+08:00
[CATEGORIES]
cs.CL
C-SEO Bench: Does Conversational SEO Work?
[AUTHORS]
Haritz Puerto, Martin Gubri, Tommaso Green, Seong Joon Oh, Sangdoo Yun
[ABSTRACT]
Large Language Models (LLMs) are transforming search engines into
Conversational Search Engines (CSE). Consequently, Search Engine Optimization
(SEO) is being shifted into Conversational Search Engine Optimization (C-SEO).
We are beginning to see dedicated C-SEO methods for modifying web documents to
increase their visibility in CSE responses. However, they are often tested only
for a limited breadth of application domains; we do not understand whether
certain C-SEO methods would be effective for a broad range of domains.
Moreover, existing evaluations consider only a single-actor scenario where only
one web document adopts a C-SEO method; in reality, multiple players are likely
to competitively adopt the cutting-edge C-SEO techniques, drawing an analogy
from the dynamics we have seen in SEO. We present C-SEO Bench, the first
benchmark designed to evaluate C-SEO methods across multiple tasks, domains,
and number of actors. We consider two search tasks, question answering and
product recommendation, with three domains each. We also formalize a new
evaluation protocol with varying adoption rates among involved actors. Our
experiments reveal that most current C-SEO methods are largely ineffective,
contrary to reported results in the literature. Instead, traditional SEO
strategies, those aiming to improve the ranking of the source in the LLM
context, are significantly more effective. We also observe that as we increase
the number of C-SEO adopters, the overall gains decrease, depicting a congested
and zero-sum nature of the problem. Our code and data are available at
https://github.com/parameterlab/c-seo-bench and
https://huggingface.co/datasets/parameterlab/c-seo-bench.
[LINK]
http://arxiv.org/abs/2506.11097v2
[DATE]
2025-06-23 21:56:31+08:00
[CATEGORIES]
cs.CL
Alignment Helps Make the Most of Multimodal Data
[AUTHORS]
Christian Arnold, Andreas Küpfer
[ABSTRACT]
Political scientists increasingly analyze multimodal data. However, the
effective analysis of such data requires aligning information across different
modalities. In our paper, we demonstrate the significance of such alignment.
Informed by a systematic review of 2,703 papers, we find that political
scientists typically do not align their multimodal data. Introducing a decision
tree that guides alignment choices, our framework highlights alignment’s
untapped potential and provides concrete advice in research design and modeling
decisions. We illustrate alignment’s analytical value through two applications:
predicting tonality in U.S. presidential campaign ads and cross-modal querying
of German parliamentary speeches to examine responses to the far-right AfD.
[COMMENTS]
Working Paper
[LINK]
http://arxiv.org/abs/2405.08454v3
[DATE]
2025-06-23 21:51:06+08:00
[CATEGORIES]
cs.CL
Pretraining Language Models to Ponder in Continuous Space
[AUTHORS]
Boyi Zeng, Shixiang Song, Siyuan Huang, Yixuan Wang, He Li, Ziwei He, Xinbing Wang, Zhiyu Li, Zhouhan Lin
[ABSTRACT]
Humans ponder before articulating complex sentence elements, enabling deeper
cognitive processing through focused effort. In this work, we introduce this
pondering process into language models by repeatedly invoking the forward
process within a single token generation step. During pondering, instead of
generating an actual token sampled from the prediction distribution, the model
ponders by yielding a weighted sum of all token embeddings according to the
predicted token distribution. The generated embedding is then fed back as input
for another forward pass. We show that the model can learn to ponder in this
way through self-supervised learning, without any human annotations.
Experiments across three widely used open-source architectures-GPT-2, Pythia,
and LLaMA-and extensive downstream task evaluations demonstrate the
effectiveness and generality of our method. For language modeling tasks,
pondering language models achieve performance comparable to vanilla models with
twice the number of parameters. On 9 downstream benchmarks, our
pondering-enhanced Pythia models significantly outperform the official Pythia
models. Notably, PonderingPythia-2.8B surpasses Pythia-6.9B, and
PonderingPythia-1B is comparable to TinyLlama-1.1B, which is trained on 10
times more data. The code is available at
https://github.com/LUMIA-Group/PonderingLM.
[LINK]
http://arxiv.org/abs/2505.20674v2
[DATE]
2025-06-23 21:48:37+08:00
[CATEGORIES]
cs.CL
ByteSpan: Information-Driven Subword Tokenisation
[AUTHORS]
Zébulon Goriely, Suchir Salhan, Pietro Lesci, Julius Cheng, Paula Buttery
[ABSTRACT]
Recent dynamic tokenisation methods operate directly on bytes and pool their
latent representations into patches. This bears similarities to computational
models of word segmentation that determine lexical boundaries using spikes in
an autoregressive model’s prediction error. Inspired by this connection, we
explore whether grouping predictable bytes - rather than pooling their
representations - can yield a useful fixed subword vocabulary. We propose a new
information-driven subword tokeniser, ByteSpan, that uses an external
byte-level LM during training to identify contiguous predictable byte sequences
and group them into subwords. Experiments show that ByteSpan yields efficient
vocabularies with higher morphological alignment scores than BPE for English.
Multilingual experiments show similar compression and R'enyi efficiency for 25
languages.
[COMMENTS]
Accepted to TokShop 2025 (Non-archival)
[LINK]
http://arxiv.org/abs/2506.18639v1
[DATE]
2025-06-23 21:42:00+08:00
[CATEGORIES]
cs.CL
AggTruth: Contextual Hallucination Detection using Aggregated Attention Scores in LLMs
[AUTHORS]
Piotr Matys, Jan Eliasz, Konrad Kiełczyński, Mikołaj Langner, Teddy Ferdinan, Jan Kocoń, Przemysław Kazienko
[ABSTRACT]
In real-world applications, Large Language Models (LLMs) often hallucinate,
even in Retrieval-Augmented Generation (RAG) settings, which poses a
significant challenge to their deployment. In this paper, we introduce
AggTruth, a method for online detection of contextual hallucinations by
analyzing the distribution of internal attention scores in the provided context
(passage). Specifically, we propose four different variants of the method, each
varying in the aggregation technique used to calculate attention scores. Across
all LLMs examined, AggTruth demonstrated stable performance in both same-task
and cross-task setups, outperforming the current SOTA in multiple scenarios.
Furthermore, we conducted an in-depth analysis of feature selection techniques
and examined how the number of selected attention heads impacts detection
performance, demonstrating that careful selection of heads is essential to
achieve optimal results.
[COMMENTS]
ICCS 2025 Workshops
[LINK]
http://arxiv.org/abs/2506.18628v1
[DATE]
2025-06-23 21:35:05+08:00
[CATEGORIES]
cs.CL
The Open Proof Corpus: A Large-Scale Study of LLM-Generated Mathematical Proofs
[AUTHORS]
Jasper Dekoninck, Ivo Petrov, Kristian Minchev, Mislav Balunovic, Martin Vechev, Miroslav Marinov, Maria Drencheva, Lyuba Konova, Milen Shumanov, Kaloyan Tsvetkov, Nikolay Drenchev, Lazar Todorov, Kalina Nikolova, Nikolay Georgiev, Vanesa Kalinkova, Margulan Ismoldayev
[ABSTRACT]
In recent months, large language models (LLMs) have made significant progress
in mathematical proof generation, but further advancement is hindered by the
lack of a large-scale, high-quality dataset of human-evaluated proofs. While
expensive to create, such a dataset is essential for driving improvements in
training and enabling a rigorous analysis of proof generation capabilities. In
this work, we present the Open Proof Corpus (OPC), a dataset comprising over
5,000 human-evaluated proofs produced by state-of-the-art LLMs. The OPC was
specifically designed for broad applicability and downstream usage in proof
generation research and is the first to include a substantial number of
correct, LLM-generated solutions to problems from prestigious mathematics
competitions such as the USAMO and IMO. Using the OPC, we explore critical
questions in automated proof generation: (1) the performance gap between
natural language and formal proof generation, (2) the discrepancy between
final-answer accuracy and full-proof validity, and (3) the impact of best-of-n
selection on proof quality. Finally, to showcase the utility of the OPC, we
finetune an 8B-parameter model on the dataset, obtaining a model that performs
on par with the best model, Gemini-2.5-Pro, on the task of evaluating proof
correctness.
[LINK]
http://arxiv.org/abs/2506.21621v1
[DATE]
2025-06-23 21:31:58+08:00
[CATEGORIES]
cs.CL
Semantic similarity estimation for domain specific data using BERT and other techniques
[AUTHORS]
R. Prashanth
[ABSTRACT]
Estimation of semantic similarity is an important research problem both in
natural language processing and the natural language understanding, and that
has tremendous application on various downstream tasks such as question
answering, semantic search, information retrieval, document clustering,
word-sense disambiguation and machine translation. In this work, we carry out
the estimation of semantic similarity using different state-of-the-art
techniques including the USE (Universal Sentence Encoder), InferSent and the
most recent BERT, or Bidirectional Encoder Representations from Transformers,
models. We use two question pairs datasets for the analysis, one is a domain
specific in-house dataset and the other is a public dataset which is the
Quora’s question pairs dataset. We observe that the BERT model gave much
superior performance as compared to the other methods. This should be because
of the fine-tuning procedure that is involved in its training process, allowing
it to learn patterns based on the training data that is used. This works
demonstrates the applicability of BERT on domain specific datasets. We infer
from the analysis that BERT is the best technique to use in the case of domain
specific data.
[COMMENTS]
This is a preprint version of an article accepted for publication in
the proceedings of Machine Learning and Data Mining 2019
[LINK]
http://arxiv.org/abs/2506.18602v1
[DATE]
2025-06-23 21:03:59+08:00
[CATEGORIES]
cs.CL
No Training Wheels: Steering Vectors for Bias Correction at Inference Time
[AUTHORS]
Aviral Gupta, Armaan Sethi, Ameesh Sethi
[ABSTRACT]
Neural network classifiers trained on datasets with uneven group
representation often inherit class biases and learn spurious correlations.
These models may perform well on average but consistently fail on atypical
groups. For example, in hair color classification, datasets may over-represent
females with blond hair, reinforcing stereotypes. Although various algorithmic
and data-centric methods have been proposed to address such biases, they often
require retraining or significant compute. In this work, we propose a cheap,
training-free method inspired by steering vectors used to edit behaviors in
large language models. We compute the difference in mean activations between
majority and minority groups to define a “bias vector,” which we subtract from
the model’s residual stream. This leads to reduced classification bias and
improved worst-group accuracy. We explore multiple strategies for extracting
and applying these vectors in transformer-like classifiers, showing that
steering vectors, traditionally used in generative models, can also be
effective in classification. More broadly, we showcase an extremely cheap,
inference time, training free method to mitigate bias in classification models.
[LINK]
http://arxiv.org/abs/2506.18598v1
[DATE]
2025-06-23 20:58:54+08:00
[CATEGORIES]
cs.LG
cs.CL
Airalogy: AI-empowered universal data digitization for research automation
[AUTHORS]
Zijie Yang, Qiji Zhou, Fang Guo, Sijie Zhang, Yexun Xi, Jinglei Nie, Yudian Zhu, Liping Huang, Chou Wu, Yonghe Xia, Xiaoyu Ma, Yingming Pu, Panzhong Lu, Junshu Pan, Mingtao Chen, Tiannan Guo, Yanmei Dou, Hongyu Chen, Anping Zeng, Jiaxing Huang, Tian Xu, Yue Zhang
[ABSTRACT]
Research data are the foundation of Artificial Intelligence (AI)-driven
science, yet current AI applications remain limited to a few fields with
readily available, well-structured, digitized datasets. Achieving comprehensive
AI empowerment across multiple disciplines is still out of reach. Present-day
research data collection is often fragmented, lacking unified standards,
inefficiently managed, and difficult to share. Creating a single platform for
standardized data digitization needs to overcome the inherent challenge of
balancing between universality (supporting the diverse, ever-evolving needs of
various disciplines) and standardization (enforcing consistent formats to fully
enable AI). No existing platform accommodates both facets. Building a truly
multidisciplinary platform requires integrating scientific domain knowledge
with sophisticated computing skills. Researchers often lack the computational
expertise to design customized and standardized data recording methods, whereas
platform developers rarely grasp the intricate needs of multiple scientific
domains. These gaps impede research data standardization and hamper AI-driven
progress. In this study, we address these challenges by developing Airalogy
(https://airalogy.com), the world’s first AI- and community-driven platform
that balances universality and standardization for digitizing research data
across multiple disciplines. Airalogy represents entire research workflows
using customizable, standardized data records and offers an advanced AI
research copilot for intelligent Q&A, automated data entry, analysis, and
research automation. Already deployed in laboratories across all four schools
of Westlake University, Airalogy has the potential to accelerate and automate
scientific innovation in universities, industry, and the global research
community-ultimately benefiting humanity as a whole.
[COMMENTS]
146 pages, 6 figures, 49 supplementary figures
[LINK]
http://arxiv.org/abs/2506.18586v1
[DATE]
2025-06-23 20:43:16+08:00
[CATEGORIES]
cs.CL
Parallel Continuous Chain-of-Thought with Jacobi Iteration
[AUTHORS]
Haoyi Wu, Zhihao Teng, Kewei Tu
[ABSTRACT]
Continuous chain-of-thought has been shown to be effective in saving
reasoning tokens for large language models. By reasoning with continuous latent
thought tokens, continuous CoT is able to perform implicit reasoning in a
compact manner. However, the sequential dependencies between latent thought
tokens spoil parallel training, leading to long training time. In this paper,
we propose Parallel Continuous Chain-of-Thought (PCCoT), which performs Jacobi
iteration on the latent thought tokens, updating them iteratively in parallel
instead of sequentially and thus improving both training and inference
efficiency of continuous CoT. Experiments demonstrate that by choosing the
proper number of iterations, we are able to achieve comparable or even better
performance while saving nearly 50% of the training and inference time.
Moreover, PCCoT shows better stability and robustness in the training process.
Our code is available at https://github.com/whyNLP/PCCoT.
[COMMENTS]
under review
[LINK]
http://arxiv.org/abs/2506.18582v1
[DATE]
2025-06-23 20:35:41+08:00
[CATEGORIES]
cs.CL
When Fine-Tuning Fails: Lessons from MS MARCO Passage Ranking
[AUTHORS]
Manu Pande, Shahil Kumar, Anay Yatin Damle
[ABSTRACT]
This paper investigates the counterintuitive phenomenon where fine-tuning
pre-trained transformer models degrades performance on the MS MARCO passage
ranking task. Through comprehensive experiments involving five model
variants-including full parameter fine-tuning and parameter efficient LoRA
adaptations-we demonstrate that all fine-tuning approaches underperform the
base sentence-transformers/all- MiniLM-L6-v2 model (MRR@10: 0.3026). Our
analysis reveals that fine-tuning disrupts the optimal embedding space
structure learned during the base model’s extensive pre-training on 1 billion
sentence pairs, including 9.1 million MS MARCO samples. UMAP visualizations
show progressive embedding space flattening, while training dynamics analysis
and computational efficiency metrics further support our findings. These
results challenge conventional wisdom about transfer learning effectiveness on
saturated benchmarks and suggest architectural innovations may be necessary for
meaningful improvements.
[LINK]
http://arxiv.org/abs/2506.18535v1
[DATE]
2025-06-23 19:46:05+08:00
[CATEGORIES]
cs.CL
LLMs Lost in Translation: M-ALERT uncovers Cross-Linguistic Safety Inconsistencies
[AUTHORS]
Felix Friedrich, Simone Tedeschi, Patrick Schramowski, Manuel Brack, Roberto Navigli, Huu Nguyen, Bo Li, Kristian Kersting
[LINK]
http://arxiv.org/abs/2412.15035v3
[DATE]
2025-06-23 19:45:09+08:00
[CATEGORIES]
cs.CL
Affordable AI Assistants with Knowledge Graph of Thoughts
[AUTHORS]
Maciej Besta, Lorenzo Paleari, Jia Hao Andrea Jiang, Robert Gerstenberger, You Wu, Jón Gunnar Hannesson, Patrick Iff, Ales Kubicek, Piotr Nyczyk, Diana Khimey, Nils Blach, Haiqiang Zhang, Tao Zhang, Peiran Ma, Grzegorz Kwaśniewski, Marcin Copik, Hubert Niewiadomski, Torsten Hoefler
[ABSTRACT]
Large Language Models (LLMs) are revolutionizing the development of AI
assistants capable of performing diverse tasks across domains. However, current
state-of-the-art LLM-driven agents face significant challenges, including high
operational costs and limited success rates on complex benchmarks like GAIA. To
address these issues, we propose Knowledge Graph of Thoughts (KGoT), an
innovative AI assistant architecture that integrates LLM reasoning with
dynamically constructed knowledge graphs (KGs). KGoT extracts and structures
task-relevant knowledge into a dynamic KG representation, iteratively enhanced
through external tools such as math solvers, web crawlers, and Python scripts.
Such structured representation of task-relevant knowledge enables low-cost
models to solve complex tasks effectively while also minimizing bias and noise.
For example, KGoT achieves a 29% improvement in task success rates on the GAIA
benchmark compared to Hugging Face Agents with GPT-4o mini. Moreover,
harnessing a smaller model dramatically reduces operational costs by over 36x
compared to GPT-4o. Improvements for other models (e.g., Qwen2.5-32B and
Deepseek-R1-70B) and benchmarks (e.g., SimpleQA) are similar. KGoT offers a
scalable, affordable, versatile, and high-performing solution for AI
assistants.
[LINK]
http://arxiv.org/abs/2504.02670v4
[DATE]
2025-06-23 19:43:03+08:00
[CATEGORIES]
cs.CL
cs.LG
End-to-End Spoken Grammatical Error Correction
[AUTHORS]
Mengjie Qian, Rao Ma, Stefano Bannò, Mark J. F. Gales, Kate M. Knill
[ABSTRACT]
Grammatical Error Correction (GEC) and feedback play a vital role in
supporting second language (L2) learners, educators, and examiners. While
written GEC is well-established, spoken GEC (SGEC), aiming to provide feedback
based on learners’ speech, poses additional challenges due to disfluencies,
transcription errors, and the lack of structured input. SGEC systems typically
follow a cascaded pipeline consisting of Automatic Speech Recognition (ASR),
disfluency detection, and GEC, making them vulnerable to error propagation
across modules. This work examines an End-to-End (E2E) framework for SGEC and
feedback generation, highlighting challenges and possible solutions when
developing these systems. Cascaded, partial-cascaded and E2E architectures are
compared, all built on the Whisper foundation model. A challenge for E2E
systems is the scarcity of GEC labeled spoken data. To address this, an
automatic pseudo-labeling framework is examined, increasing the training data
from 77 to over 2500 hours. To improve the accuracy of the SGEC system,
additional contextual information, exploiting the ASR output, is investigated.
Candidate feedback of their mistakes is an essential step to improving
performance. In E2E systems the SGEC output must be compared with an estimate
of the fluent transcription to obtain the feedback. To improve the precision of
this feedback, a novel reference alignment process is proposed that aims to
remove hypothesised edits that results from fluent transcription errors.
Finally, these approaches are combined with an edit confidence estimation
approach, to exclude low-confidence edits. Experiments on the in-house
Linguaskill (LNG) corpora and the publicly available Speak & Improve (S&I)
corpus show that the proposed approaches significantly boost E2E SGEC
performance.
[COMMENTS]
This work has been submitted to the IEEE for possible publication
[LINK]
http://arxiv.org/abs/2506.18532v1
[DATE]
2025-06-23 19:40:04+08:00
[CATEGORIES]
cs.CL
cs.LG
Piloting Copilot, Codex, and StarCoder2: Hot Temperature, Cold Prompts, or Black Magic?
[AUTHORS]
Jean-Baptiste Döderlein, Nguessan Hermann Kouadio, Mathieu Acher, Djamel Eddine Khelladi, Benoit Combemale
[ABSTRACT]
Language models are promising solutions for tackling increasing complex
problems. In software engineering, they recently gained attention in code
assistants, which generate programs from a natural language task description
(prompt). They have the potential to save time and effort but remain poorly
understood, limiting their optimal use. In this article, we investigate the
impact of input variations on two configurations of a language model, focusing
on parameters such as task description, surrounding context, model creativity,
and the number of generated solutions. We design specific operators to modify
these inputs and apply them to three LLM-based code assistants (Copilot, Codex,
StarCoder2) and two benchmarks representing algorithmic problems (HumanEval,
LeetCode). Our study examines whether these variations significantly affect
program quality and how these effects generalize across models. Our results
show that varying input parameters can greatly improve performance, achieving
up to 79.27% success in one-shot generation compared to 22.44% for Codex and
31.1% for Copilot in default settings. Actioning this potential in practice is
challenging due to the complex interplay in our study - the optimal settings
for temperature, prompt, and number of generated solutions vary by problem.
Reproducing our study with StarCoder2 confirms these findings, indicating they
are not model-specific. We also uncover surprising behaviors (e.g., fully
removing the prompt can be effective), revealing model brittleness and areas
for improvement.
[COMMENTS]
53 pages, 3 Figures (not counted the subfigures), 16 Tables
[LINK]
http://arxiv.org/abs/2210.14699v3
[DATE]
2025-06-23 19:31:33+08:00
[CATEGORIES]
cs.CL
ASCenD-BDS: Adaptable, Stochastic and Context-aware framework for Detection of Bias, Discrimination and Stereotyping
[AUTHORS]
Rajiv Bahl, Venkatesan N, Parimal Aglawe, Aastha Sarasapalli, Bhavya Kancharla, Chaitanya kolukuluri, Harish Mohite, Japneet Hora, Kiran Kakollu, Rahul Dhiman, Shubham Kapale, Sri Bhagya Kathula, Vamsikrishna Motru, Yogeshwar Reddy
[ABSTRACT]
The rapid evolution of Large Language Models (LLMs) has transformed natural
language processing but raises critical concerns about biases inherent in their
deployment and use across diverse linguistic and sociocultural contexts. This
paper presents a framework named ASCenD BDS (Adaptable, Stochastic and
Context-aware framework for Detection of Bias, Discrimination and
Stereotyping). The framework presents approach to detecting bias,
discrimination, stereotyping across various categories such as gender, caste,
age, disability, socioeconomic status, linguistic variations, etc., using an
approach which is Adaptive, Stochastic and Context-Aware. The existing
frameworks rely heavily on usage of datasets to generate scenarios for
detection of Bias, Discrimination and Stereotyping. Examples include datasets
such as Civil Comments, Wino Gender, WinoBias, BOLD, CrowS Pairs and BBQ.
However, such an approach provides point solutions. As a result, these datasets
provide a finite number of scenarios for assessment. The current framework
overcomes this limitation by having features which enable Adaptability,
Stochasticity, Context Awareness. Context awareness can be customized for any
nation or culture or sub-culture (for example an organization’s unique
culture). In this paper, context awareness in the Indian context has been
established. Content has been leveraged from Indian Census 2011 to have a
commonality of categorization. A framework has been developed using Category,
Sub-Category, STEM, X-Factor, Synonym to enable the features for Adaptability,
Stochasticity and Context awareness. The framework has been described in detail
in Section 3. Overall 800 plus STEMs, 10 Categories, 31 unique SubCategories
were developed by a team of consultants at Saint Fox Consultancy Private Ltd.
The concept has been tested out in SFCLabs as part of product development.
[COMMENTS]
17 pages, 6 Figures and this manuscript will be submitted to Q1,Q2
Journals
[LINK]
http://arxiv.org/abs/2502.02072v2
[DATE]
2025-06-23 19:11:32+08:00
[CATEGORIES]
cs.CL
HiRAG: Retrieval-Augmented Generation with Hierarchical Knowledge
[AUTHORS]
Haoyu Huang, Yongfeng Huang, Junjie Yang, Zhenyu Pan, Yongqiang Chen, Kaili Ma, Hongzhi Chen, James Cheng
[ABSTRACT]
Graph-based Retrieval-Augmented Generation (RAG) methods have significantly
enhanced the performance of large language models (LLMs) in domain-specific
tasks. However, existing RAG methods do not adequately utilize the naturally
inherent hierarchical knowledge in human cognition, which limits the
capabilities of RAG systems. In this paper, we introduce a new RAG approach,
called HiRAG, which utilizes hierarchical knowledge to enhance the semantic
understanding and structure capturing capabilities of RAG systems in the
indexing and retrieval processes. Our extensive experiments demonstrate that
HiRAG achieves significant performance improvements over the state-of-the-art
baseline methods.
[LINK]
http://arxiv.org/abs/2503.10150v2
[DATE]
2025-06-23 19:08:00+08:00
[CATEGORIES]
cs.CL
MedTVT-R1: A Multimodal LLM Empowering Medical Reasoning and Diagnosis
[AUTHORS]
Yuting Zhang, Kaishen Yuan, Hao Lu, Yutao Yue, Jintai Chen, Kaishun Wu
[ABSTRACT]
Accurate and interpretable multi-disease diagnosis remains a critical
challenge in medical research, particularly when leveraging heterogeneous
multimodal medical data. Current approaches often rely on single-modal data,
limiting their ability to comprehensively understand complex diseases. To
address this, we propose MedTVT-R1, a novel Multimodal Large Language Model
(MLLM) framework designed to integrate clinical multimodal data for reasoning
and diagnosing multiple diseases. We construct MedTVT-QA, a curated instruction
dataset that provides question-answer pairs for physiological-level
interpretations and disease-level diagnoses with a Chain of Evidence approach.
MedTVT-R1 incorporates a modality perception layer to capture inter-modal
dependencies and adaptively weight modality contributions. Additionally, we
employ Group Relative Policy Optimization (GRPO)-based Reinforcement
Fine-Tuning with a Jaccard Reward function to enhance diagnostic reasoning.
Experimental results demonstrate MedTVT-R1’s superiority in multimodal feature
utilization and multi-disease diagnosis, offering significant potential for
clinical applications such as diagnostic report generation and comorbidity
reasoning. The dataset and code are available at
https://github.com/keke-nice/MedTVT-R1.
[LINK]
http://arxiv.org/abs/2506.18512v1
[DATE]
2025-06-23 19:06:31+08:00
[CATEGORIES]
cs.CL
Smooth Operators: LLMs Translating Imperfect Hints into Disfluency-Rich Transcripts
[AUTHORS]
Duygu Altinok
[ABSTRACT]
Accurate detection of disfluencies in spoken language is crucial for
enhancing the performance of automatic speech and language processing systems,
as well as fostering the development of more inclusive speech and language
technologies. Leveraging the growing trend of large language models (LLMs) as
versatile learners capable of processing both lexical and non-lexical inputs
(e.g., audio and video), we propose a novel approach to transcribing
disfluencies as explicit tokens with timestamps, enabling the generation of
fully annotated disfluency-rich transcripts. Our method integrates acoustic
representations extracted from an audio encoder with textual inputs of varying
quality: clean transcriptions without disfluencies, time-aligned transcriptions
from aligners, or outputs from phoneme-based ASR models – all of which may
contain imperfections. Importantly, our experiments demonstrate that textual
inputs do not need to be flawless. As long as they include timestamp-related
cues, LLMs can effectively smooth the input and produce fully
disfluency-annotated transcripts, underscoring their robustness in handling
imperfect hints.
[COMMENTS]
Accepted to INTERSPEECH2025 workshop DISS2025
[LINK]
http://arxiv.org/abs/2506.18510v1
[DATE]
2025-06-23 19:04:20+08:00
[CATEGORIES]
cs.CL
Comparative Evaluation of ChatGPT and DeepSeek Across Key NLP Tasks: Strengths, Weaknesses, and Domain-Specific Performance
[AUTHORS]
Wael Etaiwi, Bushra Alhijawi
[ABSTRACT]
The increasing use of large language models (LLMs) in natural language
processing (NLP) tasks has sparked significant interest in evaluating their
effectiveness across diverse applications. While models like ChatGPT and
DeepSeek have shown strong results in many NLP domains, a comprehensive
evaluation is needed to understand their strengths, weaknesses, and
domain-specific abilities. This is critical as these models are applied to
various tasks, from sentiment analysis to more nuanced tasks like textual
entailment and translation. This study aims to evaluate ChatGPT and DeepSeek
across five key NLP tasks: sentiment analysis, topic classification, text
summarization, machine translation, and textual entailment. A structured
experimental protocol is used to ensure fairness and minimize variability. Both
models are tested with identical, neutral prompts and evaluated on two
benchmark datasets per task, covering domains like news, reviews, and
formal/informal texts. The results show that DeepSeek excels in classification
stability and logical reasoning, while ChatGPT performs better in tasks
requiring nuanced understanding and flexibility. These findings provide
valuable insights for selecting the appropriate LLM based on task requirements.
[LINK]
http://arxiv.org/abs/2506.18501v1
[DATE]
2025-06-23 18:52:54+08:00
[CATEGORIES]
cs.CL
AI-Generated Song Detection via Lyrics Transcripts
[AUTHORS]
Markus Frohmann, Elena V. Epure, Gabriel Meseguer-Brocal, Markus Schedl, Romain Hennequin
[ABSTRACT]
The recent rise in capabilities of AI-based music generation tools has
created an upheaval in the music industry, necessitating the creation of
accurate methods to detect such AI-generated content. This can be done using
audio-based detectors; however, it has been shown that they struggle to
generalize to unseen generators or when the audio is perturbed. Furthermore,
recent work used accurate and cleanly formatted lyrics sourced from a lyrics
provider database to detect AI-generated music. However, in practice, such
perfect lyrics are not available (only the audio is); this leaves a substantial
gap in applicability in real-life use cases. In this work, we instead propose
solving this gap by transcribing songs using general automatic speech
recognition (ASR) models. We do this using several detectors. The results on
diverse, multi-genre, and multi-lingual lyrics show generally strong detection
performance across languages and genres, particularly for our best-performing
model using Whisper large-v2 and LLM2Vec embeddings. In addition, we show that
our method is more robust than state-of-the-art audio-based ones when the audio
is perturbed in different ways and when evaluated on different music
generators. Our code is available at
https://github.com/deezer/robust-AI-lyrics-detection.
[COMMENTS]
Accepted to ISMIR 2025
[LINK]
http://arxiv.org/abs/2506.18488v1
[DATE]
2025-06-23 18:42:50+08:00
[CATEGORIES]
cs.CL
MeRF: Motivation-enhanced Reinforcement Finetuning for Large Reasoning Models
[AUTHORS]
Junjie Zhang, Guozheng Ma, Shunyu Liu, Haoyu Wang, Jiaxing Huang, Ting-En Lin, Fei Huang, Yongbin Li, Dacheng Tao
[ABSTRACT]
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a
powerful learn-to-reason paradigm for Large Language Models (LLMs) to tackle
complex reasoning tasks. However, existing RLVR methods overlook one of the
most distinctive capabilities of LLMs, their in-context learning ability, as
prominently demonstrated by the success of Chain-of-Thought (CoT) prompting.
This motivates us to explore how reinforcement learning can be effectively
combined with in-context learning to better improve the reasoning capabilities
of LLMs. In this paper, we introduce Motivation-enhanced Reinforcement
Finetuning} (MeRF), an intuitive yet effective method enhancing reinforcement
learning of LLMs by involving “telling LLMs the rules of the game”.
Specifically, MeRF directly injects the reward specification into the prompt,
which serves as an in-context motivation for model to improve its responses
with awareness of the optimization objective. This simple modification
leverages the in-context learning ability of LLMs aligning generation with
optimization, thereby incentivizing the model to generate desired outputs from
both inner motivation and external reward. Empirical evaluations on the Knights
and Knaves~(K&K) logic puzzle reasoning benchmark demonstrate that
\texttt{MeRF} achieves substantial performance gains over baselines. Moreover,
ablation studies show that performance improves with greater consistency
between the in-context motivation and the external reward function, while the
model also demonstrates an ability to adapt to misleading motivations through
reinforcement learning.
[LINK]
http://arxiv.org/abs/2506.18485v1
[DATE]
2025-06-23 18:37:57+08:00
[CATEGORIES]
cs.CL
MORTAR: Multi-turn Metamorphic Testing for LLM-based Dialogue Systems
[AUTHORS]
Guoxiang Guo, Aldeida Aleti, Neelofar Neelofar, Chakkrit Tantithamthavorn, Yuanyuan Qi, Tsong Yueh Chen
[ABSTRACT]
With the widespread application of LLM-based dialogue systems in daily life,
quality assurance has become more important than ever. Recent research has
successfully introduced methods to identify unexpected behaviour in single-turn
testing scenarios. However, multi-turn interaction is the common real-world
usage of dialogue systems, yet testing methods for such interactions remain
underexplored. This is largely due to the oracle problem in multi-turn testing,
which continues to pose a significant challenge for dialogue system developers
and researchers. In this paper, we propose MORTAR, a metamorphic multi-turn
dialogue testing approach, which mitigates the test oracle problem in testing
LLM-based dialogue systems. MORTAR formalises the multi-turn testing for
dialogue systems, and automates the generation of question-answer dialogue test
cases with multiple dialogue-level perturbations and metamorphic relations
(MRs). The automated MR matching mechanism allows MORTAR more flexibility and
efficiency in metamorphic testing. The proposed approach is fully automated
without reliance on LLM judges. In testing six popular LLM-based dialogue
systems, MORTAR reaches significantly better effectiveness with over 150\% more
bugs revealed per test case when compared to the single-turn metamorphic
testing baseline. Regarding the quality of bugs, MORTAR reveals higher-quality
bugs in terms of diversity, precision and uniqueness. MORTAR is expected to
inspire more multi-turn testing approaches, and assist developers in evaluating
the dialogue system performance more comprehensively with constrained test
resources and budget.
[LINK]
http://arxiv.org/abs/2412.15557v3
[DATE]
2025-06-23 18:23:35+08:00
[CATEGORIES]
cs.CL
LLMs on a Budget? Say HOLA
[AUTHORS]
Zohaib Hasan Siddiqui, Jiechao Gao, Ebad Shabbir, Mohammad Anas Azeez, Rafiq Ali, Gautam Siddharth Kashyap, Usman Naseem
[ABSTRACT]
Running Large Language Models (LLMs) on edge devices is constrained by high
compute and memory demands posing a barrier for real-time applications in
sectors like healthcare, education, and embedded systems. Current solutions
such as quantization, pruning, and retrieval-augmented generation (RAG) offer
only partial optimizations and often compromise on speed or accuracy. We
introduce HOLA, an end-to-end optimization framework for efficient LLM
deployment. Internally, it leverages Hierarchical Speculative Decoding (HSD)
for faster inference without quality loss. Externally, AdaComp-RAG adjusts
retrieval complexity based on context needs. Together with LoBi, which blends
structured pruning (LoRA) and quantization, HOLA delivers significant gains:
17.6% EMA on GSM8K, 10.5% MCA on ARC, and reduced latency and memory on edge
devices like Jetson Nano–proving both scalable and production-ready.
[LINK]
http://arxiv.org/abs/2506.18952v1
[DATE]
2025-06-23 18:20:47+08:00
[CATEGORIES]
cs.LG
cs.CL
PlantDeBERTa: An Open Source Language Model for Plant Science
[AUTHORS]
Hiba Khey, Amine Lakhder, Salma Rouichi, Imane El Ghabi, Kamal Hejjaoui, Younes En-nahli, Fahd Kalloubi, Moez Amri
[ABSTRACT]
The rapid advancement of transformer-based language models has catalyzed
breakthroughs in biomedical and clinical natural language processing; however,
plant science remains markedly underserved by such domain-adapted tools. In
this work, we present PlantDeBERTa, a high-performance, open-source language
model specifically tailored for extracting structured knowledge from plant
stress-response literature. Built upon the DeBERTa architecture-known for its
disentangled attention and robust contextual encoding-PlantDeBERTa is
fine-tuned on a meticulously curated corpus of expert-annotated abstracts, with
a primary focus on lentil (Lens culinaris) responses to diverse abiotic and
biotic stressors. Our methodology combines transformer-based modeling with
rule-enhanced linguistic post-processing and ontology-grounded entity
normalization, enabling PlantDeBERTa to capture biologically meaningful
relationships with precision and semantic fidelity. The underlying corpus is
annotated using a hierarchical schema aligned with the Crop Ontology,
encompassing molecular, physiological, biochemical, and agronomic dimensions of
plant adaptation. PlantDeBERTa exhibits strong generalization capabilities
across entity types and demonstrates the feasibility of robust domain
adaptation in low-resource scientific fields.By providing a scalable and
reproducible framework for high-resolution entity recognition, PlantDeBERTa
bridges a critical gap in agricultural NLP and paves the way for intelligent,
data-driven systems in plant genomics, phenomics, and agronomic knowledge
discovery. Our model is publicly released to promote transparency and
accelerate cross-disciplinary innovation in computational plant science.
[LINK]
http://arxiv.org/abs/2506.08897v3
[DATE]
2025-06-23 17:42:53+08:00
[CATEGORIES]
cs.CL
Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models
[AUTHORS]
Philipp Mondorf, Sondre Wold, Barbara Plank
[ABSTRACT]
A fundamental question in interpretability research is to what extent neural
networks, particularly language models, implement reusable functions through
subnetworks that can be composed to perform more complex tasks. Recent advances
in mechanistic interpretability have made progress in identifying
$\textit{circuits}$, which represent the minimal computational subgraphs
responsible for a model’s behavior on specific tasks. However, most studies
focus on identifying circuits for individual tasks without investigating how
functionally similar circuits $\textit{relate}$ to each other. To address this
gap, we study the modularity of neural networks by analyzing circuits for
highly compositional subtasks within a transformer-based language model.
Specifically, given a probabilistic context-free grammar, we identify and
compare circuits responsible for ten modular string-edit operations. Our
results indicate that functionally similar circuits exhibit both notable node
overlap and cross-task faithfulness. Moreover, we demonstrate that the circuits
identified can be reused and combined through set operations to represent more
complex functional model capabilities.
[COMMENTS]
ACL 2025 main, 22 pages, 21 figures
[LINK]
http://arxiv.org/abs/2410.01434v3
[DATE]
2025-06-23 17:05:58+08:00
[CATEGORIES]
cs.LG
cs.CL
TReB: A Comprehensive Benchmark for Evaluating Table Reasoning Capabilities of Large Language Models
[AUTHORS]
Ce Li, Xiaofan Liu, Zhiyan Song, Ce Chi, Chen Zhao, Jingjing Yang, Zhendong Wang, Kexin Yang, Boshen Shi, Xing Wang, Chao Deng, Junlan Feng
[ABSTRACT]
The majority of data in businesses and industries is stored in tables,
databases, and data warehouses. Reasoning with table-structured data poses
significant challenges for large language models (LLMs) due to its hidden
semantics, inherent complexity, and structured nature. One of these challenges
is lacking an effective evaluation benchmark fairly reflecting the performances
of LLMs on broad table reasoning abilities. In this paper, we fill in this gap,
presenting a comprehensive table reasoning evolution benchmark, TReB, which
measures both shallow table understanding abilities and deep table reasoning
abilities, a total of 26 sub-tasks. We construct a high quality dataset through
an iterative data processing procedure. We create an evaluation framework to
robustly measure table reasoning capabilities with three distinct inference
modes, TCoT, PoT and ICoT. Further, we benchmark over 20 state-of-the-art LLMs
using this frame work and prove its effectiveness. Experimental results reveal
that existing LLMs still have significant room for improvement in addressing
the complex and real world Table related tasks. Both the dataset and evaluation
framework are publicly available, with the dataset hosted on [HuggingFace] and
the framework on [GitHub].
[COMMENTS]
Benmark report v1.0
[LINK]
http://arxiv.org/abs/2506.18421v1
[DATE]
2025-06-23 17:02:04+08:00
[CATEGORIES]
cs.CL
How Large Language Models play humans in online conversations: a simulated study of the 2016 US politics on Reddit
[AUTHORS]
Daniele Cirulli, Giulio Cimini, Giovanni Palermo
[ABSTRACT]
Large Language Models (LLMs) have recently emerged as powerful tools for
natural language generation, with applications spanning from content creation
to social simulations. Their ability to mimic human interactions raises both
opportunities and concerns, particularly in the context of politically relevant
online discussions. In this study, we evaluate the performance of LLMs in
replicating user-generated content within a real-world, divisive scenario:
Reddit conversations during the 2016 US Presidential election. In particular,
we conduct three different experiments, asking GPT-4 to generate comments by
impersonating either real or artificial partisan users. We analyze the
generated comments in terms of political alignment, sentiment, and linguistic
features, comparing them against real user contributions and benchmarking
against a null model. We find that GPT-4 is able to produce realistic comments,
both in favor of or against the candidate supported by the community, yet
tending to create consensus more easily than dissent. In addition we show that
real and artificial comments are well separated in a semantically embedded
space, although they are indistinguishable by manual inspection. Our findings
provide insights on the potential use of LLMs to sneak into online discussions,
influence political debate and shape political narratives, bearing broader
implications of AI-driven discourse manipulation.
[LINK]
http://arxiv.org/abs/2506.21620v1
[DATE]
2025-06-23 16:54:32+08:00
[CATEGORIES]
cs.CL
Infi-MMR: Curriculum-based Unlocking Multimodal Reasoning via Phased Reinforcement Learning in Multimodal Small Language Models
[AUTHORS]
Zeyu Liu, Yuhang Liu, Guanghao Zhu, Congkai Xie, Zhen Li, Jianbo Yuan, Xinyao Wang, Qing Li, Shing-Chi Cheung, Shengyu Zhang, Fei Wu, Hongxia Yang
[ABSTRACT]
Recent advancements in large language models (LLMs) have demonstrated
substantial progress in reasoning capabilities, such as DeepSeek-R1, which
leverages rule-based reinforcement learning to enhance logical reasoning
significantly. However, extending these achievements to multimodal large
language models (MLLMs) presents critical challenges, which are frequently more
pronounced for Multimodal Small Language Models (MSLMs) given their typically
weaker foundational reasoning abilities: (1) the scarcity of high-quality
multimodal reasoning datasets, (2) the degradation of reasoning capabilities
due to the integration of visual processing, and (3) the risk that direct
application of reinforcement learning may produce complex yet incorrect
reasoning processes. To address these challenges, we design a novel framework
Infi-MMR to systematically unlock the reasoning potential of MSLMs through a
curriculum of three carefully structured phases and propose our multimodal
reasoning model Infi-MMR-3B. The first phase, Foundational Reasoning
Activation, leverages high-quality textual reasoning datasets to activate and
strengthen the model’s logical reasoning capabilities. The second phase,
Cross-Modal Reasoning Adaptation, utilizes caption-augmented multimodal data to
facilitate the progressive transfer of reasoning skills to multimodal contexts.
The third phase, Multimodal Reasoning Enhancement, employs curated,
caption-free multimodal data to mitigate linguistic biases and promote robust
cross-modal reasoning. Infi-MMR-3B achieves both state-of-the-art multimodal
math reasoning ability (43.68% on MathVerse testmini, 27.04% on MathVision
test, and 21.33% on OlympiadBench) and general reasoning ability (67.2% on
MathVista testmini). Resources are available at
https://huggingface.co/Reallm-Labs/Infi-MMR-3B.
[LINK]
http://arxiv.org/abs/2505.23091v3
[DATE]
2025-06-23 16:47:25+08:00
[CATEGORIES]
cs.CL
Lemmatization as a Classification Task: Results from Arabic across Multiple Genres
[AUTHORS]
Mostafa Saeed, Nizar Habash
[ABSTRACT]
Lemmatization is crucial for NLP tasks in morphologically rich languages with
ambiguous orthography like Arabic, but existing tools face challenges due to
inconsistent standards and limited genre coverage. This paper introduces two
novel approaches that frame lemmatization as classification into a
Lemma-POS-Gloss (LPG) tagset, leveraging machine translation and semantic
clustering. We also present a new Arabic lemmatization test set covering
diverse genres, standardized alongside existing datasets. We evaluate character
level sequence-to-sequence models, which perform competitively and offer
complementary value, but are limited to lemma prediction (not LPG) and prone to
hallucinating implausible forms. Our results show that classification and
clustering yield more robust, interpretable outputs, setting new benchmarks for
Arabic lemmatization.
[LINK]
http://arxiv.org/abs/2506.18399v1
[DATE]
2025-06-23 16:34:33+08:00
[CATEGORIES]
cs.CL
IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech
[AUTHORS]
Siyi Zhou, Yiquan Zhou, Yi He, Xun Zhou, Jinchao Wang, Wei Deng, Jingchen Shu
[ABSTRACT]
Large-scale text-to-speech (TTS) models are typically categorized into
autoregressive and non-autoregressive systems. Although autoregressive systems
exhibit certain advantages in speech naturalness, their token-by-token
generation mechanism makes it difficult to precisely control the duration of
synthesized speech. This is a key limitation in applications such as video
dubbing that require strict audio-visual synchronization. This paper introduces
IndexTTS2, which proposes a novel and autoregressive-model-friendly method for
speech duration control. The method supports two generation modes: one allows
explicit specification of the number of generated tokens for precise duration
control; the other does not require manual input and lets the model freely
generate speech while preserving prosodic characteristics from the input
prompt. Furthermore, IndexTTS2 achieves disentanglement between emotional
expression and speaker identity, enabling independent control of timbre and
emotion. In the zero-shot setting, the model can perfectly reproduce the
emotional characteristics of the input prompt. Users may also provide a
separate emotion prompt, even from a different speaker, allowing the model to
reconstruct the target timbre while conveying the desired emotion. To enhance
clarity during strong emotional expressions, we incorporate GPT latent
representations to improve speech stability. Meanwhile, to lower the barrier
for emotion control, we design a soft instruction mechanism based on textual
descriptions by fine-tuning Qwen3. This enables effective guidance of speech
generation with desired emotional tendencies using natural language input.
Experimental results demonstrate that IndexTTS2 outperforms existing
state-of-the-art zero-shot TTS models in word error rate, speaker similarity,
and emotional fidelity.
[LINK]
http://arxiv.org/abs/2506.21619v1
[DATE]
2025-06-23 16:33:40+08:00
[CATEGORIES]
cs.CL
[AUTHORS]
Zhiyuan Zhang, Xiaosong Jia, Guanyu Chen, Qifeng Li, Junchi Yan [ABSTRACT]
In this technical report, we introduce TrajTok, a trajectory tokenizer for
discrete next-token-prediction based behavior generation models, which combines
data-driven and rule-based methods with better coverage, symmetry and
robustness, along with a spatial-aware label smoothing method for cross-entropy
loss. We adopt the tokenizer and loss for the SMART model and reach a superior
performance with realism score of 0.7852 on the Waymo Open Sim Agents Challenge
[LINK]
http://arxiv.org/abs/2506.21618v1 [DATE]
2025-06-23 16:32:05+08:00 [CATEGORIES]
cs.CL
SLR: An Automated Synthesis Framework for Scalable Logical Reasoning
[AUTHORS]
Lukas Helff, Ahmad Omar, Felix Friedrich, Wolfgang Stammer, Antonia Wüst, Tim Woydt, Rupert Mitchell, Patrick Schramowski, Kristian Kersting
[ABSTRACT]
We introduce SLR, an end-to-end framework for systematic evaluation and
training of Large Language Models (LLMs) via Scalable Logical Reasoning. Given
a user’s task specification, SLR enables scalable, automated synthesis of
inductive reasoning tasks with precisely controlled difficulty. For each task,
SLR synthesizes (i) a latent ground-truth rule, (ii) an executable validation
program used by a symbolic judge to deterministically verify model outputs, and
(iii) an instruction prompt for the reasoning task. Using SLR, we create
SLR-Bench, a benchmark comprising over 19k prompts spanning 20 curriculum
levels that progressively increase in relational, arithmetic, and recursive
complexity. Large-scale evaluation reveals that contemporary LLMs readily
produce syntactically valid rules, yet often fail at correct logical inference.
Recent reasoning LLMs do somewhat better, but incur substantial increases in
test-time compute, sometimes exceeding 15k completion tokens. Finally,
logic-tuning via SLR doubles Llama-3-8B accuracy on SLR-Bench, achieving parity
with Gemini-Flash-Thinking at a fraction of computational cost. SLR is fully
automated, requires no human annotation, ensures dataset novelty, and offers a
scalable environment for probing and advancing LLMs’ reasoning capabilities.
[LINK]
http://arxiv.org/abs/2506.15787v2
[DATE]
2025-06-23 16:27:44+08:00
[CATEGORIES]
cs.CL
cs.LG
Evaluating Causal Explanation in Medical Reports with LLM-Based and Human-Aligned Metrics
[AUTHORS]
Yousang Cho, Key-Sun Choi
[ABSTRACT]
This study investigates how accurately different evaluation metrics capture
the quality of causal explanations in automatically generated diagnostic
reports. We compare six metrics: BERTScore, Cosine Similarity, BioSentVec,
GPT-White, GPT-Black, and expert qualitative assessment across two input types:
observation-based and multiple-choice-based report generation. Two weighting
strategies are applied: one reflecting task-specific priorities, and the other
assigning equal weights to all metrics. Our results show that GPT-Black
demonstrates the strongest discriminative power in identifying logically
coherent and clinically valid causal narratives. GPT-White also aligns well
with expert evaluations, while similarity-based metrics diverge from clinical
reasoning quality. These findings emphasize the impact of metric selection and
weighting on evaluation outcomes, supporting the use of LLM-based evaluation
for tasks requiring interpretability and causal reasoning.
[COMMENTS]
9 pages, presented at LLM4Eval Workshop, SIGIR 2025 Padova, Italy,
July 17, 2025
[LINK]
http://arxiv.org/abs/2506.18387v1
[DATE]
2025-06-23 16:19:21+08:00
[CATEGORIES]
cs.CL
Song Form-aware Full-Song Text-to-Lyrics Generation with Multi-Level Granularity Syllable Count Control
[AUTHORS]
Yunkee Chae, Eunsik Shin, Suntae Hwang, Seungryeol Paik, Kyogu Lee
[ABSTRACT]
Lyrics generation presents unique challenges, particularly in achieving
precise syllable control while adhering to song form structures such as verses
and choruses. Conventional line-by-line approaches often lead to unnatural
phrasing, underscoring the need for more granular syllable management. We
propose a framework for lyrics generation that enables multi-level syllable
control at the word, phrase, line, and paragraph levels, aware of song form.
Our approach generates complete lyrics conditioned on input text and song form,
ensuring alignment with specified syllable constraints. Generated lyrics
samples are available at: https://tinyurl.com/lyrics9999
[COMMENTS]
Accepted to Interspeech 2025
[LINK]
http://arxiv.org/abs/2411.13100v3
[DATE]
2025-06-23 16:18:25+08:00
[CATEGORIES]
cs.CL
A Rigorous Evaluation of LLM Data Generation Strategies for Low-Resource Languages
[AUTHORS]
Tatiana Anikina, Jan Cegin, Jakub Simko, Simon Ostermann
[ABSTRACT]
Large Language Models (LLMs) are increasingly used to generate synthetic
textual data for training smaller specialized models. However, a comparison of
various generation strategies for low-resource language settings is lacking.
While various prompting strategies have been proposed, such as demonstrations,
label-based summaries, and self-revision, their comparative effectiveness
remains unclear, especially for low-resource languages. In this paper, we
systematically evaluate the performance of these generation strategies and
their combinations across 11 typologically diverse languages, including several
extremely low-resource ones. Using three NLP tasks and four open-source LLMs,
we assess downstream model performance on generated versus gold-standard data.
Our results show that strategic combinations of generation methods,
particularly target-language demonstrations with LLM-based revisions, yield
strong performance, narrowing the gap with real data to as little as 5% in some
settings. We also find that smart prompting techniques can reduce the advantage
of larger LLMs, highlighting efficient generation strategies for synthetic data
generation in low-resource scenarios with smaller models.
[COMMENTS]
21 pages, fixed typo
[LINK]
http://arxiv.org/abs/2506.12158v2
[DATE]
2025-06-23 15:52:34+08:00
[CATEGORIES]
cs.CL
Factual Knowledge in Language Models: Robustness and Anomalies under Simple Temporal Context Variations
[AUTHORS]
Hichem Ammar Khodja, Frédéric Béchet, Quentin Brabant, Alexis Nasr, Gwénolé Lecorvé
[COMMENTS]
preprint v6, accepted for publication at ACL 2025 - L2M2 Workshop
[LINK]
http://arxiv.org/abs/2502.01220v6
[DATE]
2025-06-23 15:49:40+08:00
[CATEGORIES]
cs.CL
cs.LG
RePST: Language Model Empowered Spatio-Temporal Forecasting via Semantic-Oriented Reprogramming
[AUTHORS]
Hao Wang, Jindong Han, Wei Fan, Leilei Sun, Hao Liu
[ABSTRACT]
Spatio-temporal forecasting is pivotal in numerous real-world applications,
including transportation planning, energy management, and climate monitoring.
In this work, we aim to harness the reasoning and generalization abilities of
Pre-trained Language Models (PLMs) for more effective spatio-temporal
forecasting, particularly in data-scarce scenarios. However, recent studies
uncover that PLMs, which are primarily trained on textual data, often falter
when tasked with modeling the intricate correlations in numerical time series,
thereby limiting their effectiveness in comprehending spatio-temporal data. To
bridge the gap, we propose RePST, a semantic-oriented PLM reprogramming
framework tailored for spatio-temporal forecasting. Specifically, we first
propose a semantic-oriented decomposer that adaptively disentangles spatially
correlated time series into interpretable sub-components, which facilitates PLM
to understand sophisticated spatio-temporal dynamics via a divide-and-conquer
strategy. Moreover, we propose a selective discrete reprogramming scheme, which
introduces an expanded spatio-temporal vocabulary space to project
spatio-temporal series into discrete representations. This scheme minimizes the
information loss during reprogramming and enriches the representations derived
by PLMs. Extensive experiments on real-world datasets show that the proposed
RePST outperforms twelve state-of-the-art baseline methods, particularly in
data-scarce scenarios, highlighting the effectiveness and superior
generalization capabilities of PLMs for spatio-temporal forecasting. Our codes
can be found at https://github.com/usail-hkust/REPST.
[LINK]
http://arxiv.org/abs/2408.14505v3
[DATE]
2025-06-23 15:42:58+08:00
[CATEGORIES]
cs.LG
cs.CL
SlimMoE: Structured Compression of Large MoE Models via Expert Slimming and Distillation
[AUTHORS]
Zichong Li, Chen Liang, Zixuan Zhang, Ilgee Hong, Young Jin Kim, Weizhu Chen, Tuo Zhao
[ABSTRACT]
The Mixture of Experts (MoE) architecture has emerged as a powerful paradigm
for scaling large language models (LLMs) while maintaining inference
efficiency. However, their enormous memory requirements make them prohibitively
expensive to fine-tune or deploy in resource-constrained environments. To
address this challenge, we introduce SlimMoE, a multi-stage compression
framework for transforming large MoE models into much smaller, efficient
variants without incurring the prohibitive costs of training from scratch. Our
method systematically reduces parameter counts by slimming experts and
transferring knowledge through intermediate stages, effectively mitigating the
performance degradation common in one-shot pruning approaches. Using this
framework, we compress Phi 3.5-MoE (41.9B total/6.6B activated parameters) to
create Phi-mini-MoE (7.6B total/2.4B activated parameters) and Phi-tiny-MoE
(3.8B total/1.1B activated parameters) using only 400B tokens–less than 10% of
the original model’s training data. These compressed models can be fine-tuned
on a single GPU (A100 for Phi-mini-MoE, A6000 for Phi-tiny-MoE), making them
highly suitable for academic and resource-limited settings. Our experiments
demonstrate that these compressed models outperform others of similar size and
remain competitive with larger models. For instance, Phi-mini-MoE achieves
similar or better performance to Phi-3-mini using only 2/3 of the activated
parameters and yields comparable MMLU scores to Llama 3.1 8B despite having
significantly lower latency. Our findings demonstrate that structured pruning
combined with staged distillation offers an effective path to creating
high-quality, compact MoE models, paving the way for broader adoption of MoE
architectures. We make our models publicly available at
https://huggingface.co/microsoft/Phi-mini-MoE-instruct and
https://huggingface.co/microsoft/Phi-tiny-MoE-instruct .
[LINK]
http://arxiv.org/abs/2506.18349v1
[DATE]
2025-06-23 15:15:59+08:00
[CATEGORIES]
cs.LG
cs.CL
Systematic Reward Gap Optimization for Mitigating VLM Hallucinations
[AUTHORS]
Lehan He, Zeren Chen, Zhelun Shi, Tianyu Yu, Jing Shao, Lu Sheng
[ABSTRACT]
The success of Direct Preference Optimization (DPO) in mitigating
hallucinations in Vision Language Models (VLMs) critically hinges on the true
reward gaps within preference pairs. However, current methods, typically
relying on ranking or rewriting strategies, often struggle to optimize these
reward gaps in a systematic way during data curation. A core difficulty lies in
precisely characterizing and strategically manipulating the overall reward gap
configuration, that is, the deliberate design of how to shape these reward gaps
within each preference pair across the data. To address this, we introduce
Topic-level Preference Rewriting(TPR), a novel framework designed for the
systematic optimization of reward gap configuration. Through selectively
replacing semantic topics within VLM responses with model’s own resampled
candidates for targeted rewriting, TPR can provide topic-level control over
fine-grained semantic details. This precise control enables advanced data
curation strategies, such as progressively adjusting the difficulty of rejected
responses, thereby sculpting an effective reward gap configuration that guides
the model to overcome challenging hallucinations. Comprehensive experiments
demonstrate TPR achieves state-of-the-art performance on multiple hallucination
benchmarks, outperforming previous methods by an average of 20%. Notably, it
significantly reduces hallucinations by up to 93% on ObjectHal-Bench, and also
exhibits superior data efficiency towards robust and cost-effective VLM
alignment.
[LINK]
http://arxiv.org/abs/2411.17265v3
[DATE]
2025-06-23 15:12:17+08:00
[CATEGORIES]
cs.CL
Less Data Less Tokens: Multilingual Unification Learning for Efficient Test-Time Reasoning in LLMs
[AUTHORS]
Kang Chen, Mengdi Zhang, Yixin Cao
[ABSTRACT]
This paper explores the challenges of test-time scaling of large language
models (LLMs), regarding both the data and inference efficiency. We highlight
the diversity of multi-lingual reasoning based on our pilot studies, and then
introduce a novel approach, (L^2) multi-lingual unification learning with a
decoding intervention strategy for further investigation. The basic idea of
(L^2) is that the reasoning process varies across different languages, which
may be mutually beneficial to enhance both model performance and efficiency. In
specific, there are two types of multi-lingual data: the entire long
chain-of-thought annotations in different languages and the step-wise mixture
of languages. By further tuning based on them, we show that even small amounts
of data can significantly improve reasoning capabilities. Our findings suggest
that multilingual learning reduces both the required data and the number of
inference tokens while maintaining a comparable performance. Furthermore,
(L^2) is orthogonal to other data efficient methods. Thus, we also emphasize
the importance of diverse data selection. The (L^2) method offers a promising
solution to the challenges of data collection and test-time compute efficiency
in LLMs.
[LINK]
http://arxiv.org/abs/2506.18341v1
[DATE]
2025-06-23 14:47:28+08:00
[CATEGORIES]
cs.CL
Position is Power: System Prompts as a Mechanism of Bias in Large Language Models (LLMs)
[AUTHORS]
Anna Neumann, Elisabeth Kirsten, Muhammad Bilal Zafar, Jatinder Singh
[ABSTRACT]
System prompts in Large Language Models (LLMs) are predefined directives that
guide model behaviour, taking precedence over user inputs in text processing
and generation. LLM deployers increasingly use them to ensure consistent
responses across contexts. While model providers set a foundation of system
prompts, deployers and third-party developers can append additional prompts
without visibility into others’ additions, while this layered implementation
remains entirely hidden from end-users. As system prompts become more complex,
they can directly or indirectly introduce unaccounted for side effects. This
lack of transparency raises fundamental questions about how the position of
information in different directives shapes model outputs. As such, this work
examines how the placement of information affects model behaviour. To this end,
we compare how models process demographic information in system versus user
prompts across six commercially available LLMs and 50 demographic groups. Our
analysis reveals significant biases, manifesting in differences in user
representation and decision-making scenarios. Since these variations stem from
inaccessible and opaque system-level configurations, they risk
representational, allocative and potential other biases and downstream harms
beyond the user’s ability to detect or correct. Our findings draw attention to
these critical issues, which have the potential to perpetuate harms if left
unexamined. Further, we argue that system prompt analysis must be incorporated
into AI auditing processes, particularly as customisable system prompts become
increasingly prevalent in commercial AI deployments.
[COMMENTS]
Published in Proceedings of ACM FAccT 2025 Update Comment: Fixed the
error where user vs. system and implicit vs. explicit labels in the heatmaps
were switched. The takeaways remain the same
[LINK]
http://arxiv.org/abs/2505.21091v3
[DATE]
2025-06-23 14:43:45+08:00
[CATEGORIES]
cs.CL
TranslationCorrect: A Unified Framework for Machine Translation Post-Editing with Predictive Error Assistance
[AUTHORS]
Syed Mekael Wasti, Shou-Yi Hung, Christopher Collins, En-Shiun Annie Lee
[ABSTRACT]
Machine translation (MT) post-editing and research data collection often rely
on inefficient, disconnected workflows. We introduce TranslationCorrect, an
integrated framework designed to streamline these tasks. TranslationCorrect
combines MT generation using models like NLLB, automated error prediction using
models like XCOMET or LLM APIs (providing detailed reasoning), and an intuitive
post-editing interface within a single environment. Built with human-computer
interaction (HCI) principles in mind to minimize cognitive load, as confirmed
by a user study. For translators, it enables them to correct errors and batch
translate efficiently. For researchers, TranslationCorrect exports high-quality
span-based annotations in the Error Span Annotation (ESA) format, using an
error taxonomy inspired by Multidimensional Quality Metrics (MQM). These
outputs are compatible with state-of-the-art error detection models and
suitable for training MT or post-editing systems. Our user study confirms that
TranslationCorrect significantly improves translation efficiency and user
satisfaction over traditional annotation methods.
[COMMENTS]
Preprint
[LINK]
http://arxiv.org/abs/2506.18337v1
[DATE]
2025-06-23 14:38:49+08:00
[CATEGORIES]
cs.CL
HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States
[AUTHORS]
Yilei Jiang, Xinyan Gao, Tianshuo Peng, Yingshui Tan, Xiaoyong Zhu, Bo Zheng, Xiangyu Yue
[ABSTRACT]
The integration of additional modalities increases the susceptibility of
large vision-language models (LVLMs) to safety risks, such as jailbreak
attacks, compared to their language-only counterparts. While existing research
primarily focuses on post-hoc alignment techniques, the underlying safety
mechanisms within LVLMs remain largely unexplored. In this work , we
investigate whether LVLMs inherently encode safety-relevant signals within
their internal activations during inference. Our findings reveal that LVLMs
exhibit distinct activation patterns when processing unsafe prompts, which can
be leveraged to detect and mitigate adversarial inputs without requiring
extensive fine-tuning. Building on this insight, we introduce HiddenDetect, a
novel tuning-free framework that harnesses internal model activations to
enhance safety. Experimental results show that {HiddenDetect} surpasses
state-of-the-art methods in detecting jailbreak attacks against LVLMs. By
utilizing intrinsic safety-aware patterns, our method provides an efficient and
scalable solution for strengthening LVLM robustness against multimodal threats.
Our code will be released publicly at
https://github.com/leigest519/HiddenDetect.
[COMMENTS]
Accepted by ACL 2025 (Main)
[LINK]
http://arxiv.org/abs/2502.14744v4
[DATE]
2025-06-23 14:11:32+08:00
[CATEGORIES]
cs.CL
Enhancing Entity Aware Machine Translation with Multi-task Learning
[AUTHORS]
An Trieu, Phuong Nguyen, Minh Le Nguyen
[ABSTRACT]
Entity-aware machine translation (EAMT) is a complicated task in natural
language processing due to not only the shortage of translation data related to
the entities needed to translate but also the complexity in the context needed
to process while translating those entities. In this paper, we propose a method
that applies multi-task learning to optimize the performance of the two
subtasks named entity recognition and machine translation, which improves the
final performance of the Entity-aware machine translation task. The result and
analysis are performed on the dataset provided by the organizer of Task 2 of
the SemEval 2025 competition.
[COMMENTS]
In the Proceedings of SCIDOCA 2025
[LINK]
http://arxiv.org/abs/2506.18318v1
[DATE]
2025-06-23 14:05:46+08:00
[CATEGORIES]
cs.CL
Enhancing Document Retrieval in COVID-19 Research: Leveraging Large Language Models for Hidden Relation Extraction
[AUTHORS]
Hoang-An Trieu, Dinh-Truong Do, Chau Nguyen, Vu Tran, Minh Le Nguyen
[ABSTRACT]
In recent years, with the appearance of the COVID-19 pandemic, numerous
publications relevant to this disease have been issued. Because of the massive
volume of publications, an efficient retrieval system is necessary to provide
researchers with useful information if an unexpected pandemic happens so
suddenly, like COVID-19. In this work, we present a method to help the
retrieval system, the Covrelex-SE system, to provide more high-quality search
results. We exploited the power of the large language models (LLMs) to extract
the hidden relationships inside the unlabeled publication that cannot be found
by the current parsing tools that the system is using. Since then, help the
system to have more useful information during retrieval progress.
[COMMENTS]
In the Proceedings of SCIDOCA 2024
[LINK]
http://arxiv.org/abs/2506.18311v1
[DATE]
2025-06-23 13:55:53+08:00
[CATEGORIES]
cs.CL
PlanGenLLMs: A Modern Survey of LLM Planning Capabilities
[AUTHORS]
Hui Wei, Zihao Zhang, Shenghua He, Tian Xia, Shijia Pan, Fei Liu
[COMMENTS]
Accepted by ACL 2025
[LINK]
http://arxiv.org/abs/2502.11221v3
[DATE]
2025-06-23 13:32:12+08:00
[CATEGORIES]
cs.CL
AlzheimerRAG: Multimodal Retrieval Augmented Generation for Clinical Use Cases using PubMed articles
[AUTHORS]
Aritra Kumar Lahiri, Qinmin Vivian Hu
[ABSTRACT]
Recent advancements in generative AI have fostered the development of highly
adept Large Language Models (LLMs) that integrate diverse data types to empower
decision-making. Among these, multimodal retrieval-augmented generation (RAG)
applications are promising because they combine the strengths of information
retrieval and generative models, enhancing their utility across various
domains, including clinical use cases. This paper introduces AlzheimerRAG, a
Multimodal RAG application for clinical use cases, primarily focusing on
Alzheimer’s Disease case studies from PubMed articles. This application
incorporates cross-modal attention fusion techniques to integrate textual and
visual data processing by efficiently indexing and accessing vast amounts of
biomedical literature. Our experimental results, compared to benchmarks such as
BioASQ and PubMedQA, have yielded improved performance in the retrieval and
synthesis of domain-specific information. We also present a case study using
our multimodal RAG in various Alzheimer’s clinical scenarios. We infer that
AlzheimerRAG can generate responses with accuracy non-inferior to humans and
with low rates of hallucination.
[LINK]
http://arxiv.org/abs/2412.16701v2
[DATE]
2025-06-23 13:28:42+08:00
[CATEGORIES]
cs.CL
LoRA vs Full Fine-tuning: An Illusion of Equivalence
[AUTHORS]
Reece Shuttleworth, Jacob Andreas, Antonio Torralba, Pratyusha Sharma
[ABSTRACT]
Fine-tuning is a crucial paradigm for adapting pre-trained large language
models to downstream tasks. Recently, methods like Low-Rank Adaptation (LoRA)
have been shown to effectively fine-tune LLMs with an extreme reduction in
trainable parameters. But, \emph{are their learned solutions really
equivalent?} We study how LoRA and full-finetuning change pre-trained models by
analyzing the model’s weight matrices through the lens of their spectral
properties. We find that LoRA and full fine-tuning yield weight matrices whose
singular value decompositions exhibit very different structure: weight matrices
trained with LoRA have new, high-ranking singular vectors, which we call
\emph{intruder dimensions}, while those trained with full fine-tuning do not.
Further, we extend the finding that LoRA forgets less than full fine-tuning and
find its forgetting is vastly localized to the intruder dimension – by
causally intervening on the intruder dimensions by changing their associated
singular values post-fine-tuning, we show that they cause forgetting. Moreover,
scaling them down significantly improves modeling of the pre-training
distribution with a minimal drop in downstream task performance. Given this, we
should expect accumulating intruder dimensions to be harmful and lead to more
forgetting. This will be amplified during continual learning because of
sequentially fine-tuning, and we show that LoRA models do accumulate intruder
dimensions here tend to perform worse in this setting, emphasizing the
practicality of our findings.
[LINK]
http://arxiv.org/abs/2410.21228v2
[DATE]
2025-06-23 12:59:01+08:00
[CATEGORIES]
cs.LG
cs.CL
FutureFill: Fast Generation from Convolutional Sequence Models
[AUTHORS]
Naman Agarwal, Xinyi Chen, Evan Dogariu, Devan Shah, Hubert Strauss, Vlad Feinberg, Daniel Suo, Peter Bartlett, Elad Hazan
[ABSTRACT]
We address the challenge of efficient auto-regressive generation in sequence
prediction models by introducing FutureFill, a general-purpose fast generation
method for any sequence prediction algorithm based on convolutional operators.
FutureFill reduces generation time from quadratic to quasilinear in the context
length. Moreover, when generating from a prompt, it requires a prefill cache
whose size grows only with the number of tokens to be generated, often much
smaller than the caches required by standard convolutional or attention based
models. We validate our theoretical claims with experiments on synthetic tasks
and demonstrate substantial efficiency gains when generating from a deep
convolutional sequence prediction model.
[LINK]
http://arxiv.org/abs/2410.03766v3
[DATE]
2025-06-23 11:20:46+08:00
[CATEGORIES]
cs.LG
cs.CL
AdaLRS: Loss-Guided Adaptive Learning Rate Search for Efficient Foundation Model Pretraining
[AUTHORS]
Hongyuan Dong, Dingkang Yang, Xiao Liang, Chao Feng, Jiao Ran
[ABSTRACT]
Learning rate is widely regarded as crucial for effective foundation model
pretraining. Recent research explores and demonstrates the transferability of
learning rate configurations across varying model and dataset sizes, etc.
Nevertheless, these approaches are constrained to specific training scenarios
and typically necessitate extensive hyperparameter tuning on proxy models. In
this work, we propose \textbf{AdaLRS}, a plug-in-and-play adaptive learning
rate search algorithm that conducts online optimal learning rate search via
optimizing loss descent velocities. We provide experiment results to show that
the optimization of training loss and loss descent velocity in foundation model
pretraining are both convex and share the same optimal learning rate. Relying
solely on training loss dynamics, AdaLRS involves few extra computations to
guide the search process, and its convergence is guaranteed via theoretical
analysis. Experiments on both LLM and VLM pretraining show that AdaLRS adjusts
suboptimal learning rates to the neighborhood of optimum with marked efficiency
and effectiveness, with model performance improved accordingly. We also show
the robust generalizability of AdaLRS across varying training scenarios, such
as different model sizes, training paradigms, and base learning rate scheduler
choices.
[LINK]
http://arxiv.org/abs/2506.13274v2
[DATE]
2025-06-23 11:18:17+08:00
[CATEGORIES]
cs.LG
cs.CL
RAPID: Long-Context Inference with Retrieval-Augmented Speculative Decoding
[AUTHORS]
Guanzheng Chen, Qilong Feng, Jinjie Ni, Xin Li, Michael Qizhe Shieh
[ABSTRACT]
The emergence of long-context large language models (LLMs) offers a promising
alternative to traditional retrieval-augmented generation (RAG) for processing
extensive documents. However, the computational overhead of long-context
inference presents significant efficiency challenges. While Speculative
Decoding (SD) traditionally accelerates inference using smaller draft models,
its effectiveness diminishes substantially in long-context scenarios due to
memory-bound KV cache operations. We introduce Retrieval-Augmented Speculative
Decoding (RAPID), which leverages RAG for both accelerating and enhancing
generation quality in long-context inference. RAPID introduces the RAG
drafter-a draft LLM operating on shortened retrieval contexts-to speculate on
the generation of long-context target LLMs. Our approach enables a new paradigm
where same-scale or even larger LLMs can serve as RAG drafters while
maintaining computational efficiency. To fully leverage the potentially
superior capabilities from stronger RAG drafters, we develop an inference-time
knowledge transfer that enriches the target distribution by RAG. Extensive
experiments on the LLaMA-3.1 and Qwen2.5 backbones demonstrate that RAPID
effectively integrates the strengths of both RAG and long-context LLMs,
achieving significant performance improvements (e.g., from 39.33 to 42.83 on
InfiniteBench for LLaMA-3.1-8B) with more than 2x speedups for long-context
inference. Our analyses also reveal the robustness of RAPID across various
context lengths and retrieval quality.
[COMMENTS]
ICML 2025 Spotlight
[LINK]
http://arxiv.org/abs/2502.20330v2
[DATE]
2025-06-23 11:05:26+08:00
[CATEGORIES]
cs.CL
Sycophancy in Vision-Language Models: A Systematic Analysis and an Inference-Time Mitigation Framework
[AUTHORS]
Yunpu Zhao, Rui Zhang, Junbin Xiao, Changxin Ke, Ruibo Hou, Yifan Hao, Ling Li
[ABSTRACT]
Large Vision-Language Models (LVLMs) have shown significant capability in
vision-language understanding. However, one critical issue that persists in
these models is sycophancy, where models are unduly influenced by leading or
deceptive prompts, resulting in biased outputs and hallucinations. Despite the
rapid development of LVLMs, evaluating and mitigating sycophancy remains
largely under-explored. In this work, we fill this gap by systematically
analyzing sycophancy across multiple vision-language benchmarks and propose an
inference-time mitigation framework. We curate leading queries and quantify the
susceptibility of state-of-the-art LVLMs to prompt-induced bias, revealing
consistent performance degradation and instability across models and tasks. Our
analysis further uncovers model-specific behavioral traits, such as sentiment
sensitivity and prediction polarity shifts under sycophancy. To mitigate these
issues, we propose a training-free, model-agnostic framework that operates
entirely at inference time. Our approach first employs a query neutralizer,
leveraging an language model to suppress implicit sycophantic bias in user
queries. We then introduce a sycophancy-aware contrastive decoding mechanism
that dynamically recalibrates token-level output distributions by contrasting
responses to neutralized and leading queries. Finally, an adaptive logits
refinement module further modifies the contrasted logits by integrating both a
adaptive plausibility filter and query sentiment scaler, ensuring coherent and
robust generation. Extensive experiments demonstrate that this framework
effectively mitigates sycophancy across all evaluated models, while maintaining
performance on neutral prompts. Our results suggest that sycophancy in LVLMs is
a general and urgent challenge, and that inference-time strategies offer a
promising path toward trustworthy multimodal reasoning.
[LINK]
http://arxiv.org/abs/2408.11261v2
[DATE]
2025-06-23 11:00:38+08:00
[CATEGORIES]
cs.CL
Chain-of-Experts: Unlocking the Communication Power of Mixture-of-Experts Models
[AUTHORS]
Zihan Wang, Rui Pan, Jiarui Yao, Robert Csordas, Linjie Li, Lu Yin, Jiajun Wu, Tong Zhang, Manling Li, Shiwei Liu
[ABSTRACT]
We propose Chain-of-Experts (CoE), a new Mixture-of-Experts (MoE)
architecture that introduces sequential expert communication within each layer.
Unlike traditional MoE models, where experts operate independently in parallel,
CoE processes tokens iteratively across a chain of experts inside a layer. To
support dynamic expert selection across iterations, CoE employs a dedicated
router at each iteration step within a layer. This design allows tokens to
re-evaluate and select different experts during each iteration, rather than
being statically assigned. As a result, CoE introduces a flexible routing
mechanism that increases the diversity of expert combinations and enriches the
model’s representational capacity. CoE demonstrates improved performance under
fixed compute: on math reasoning tasks, it reduces validation loss from 1.20 to
1.12 compared to a standard MoE. Beyond performance, CoE offers a new scaling
axis: depth through expert iteration, which complements conventional
width/depth scaling. For example, using 2x iterations matches the performance
of 3x expert selections (in width), while reducing memory usage by 17.6-42%
relative to other scaling strategies. Our analysis reveals that CoE’s benefits
stem from its iterative residual structure and enhanced expert specialization
empowered by iterative routing, which together unlock more expressive
representations. Code is available at https://github.com/ZihanWang314/coe.
[LINK]
http://arxiv.org/abs/2506.18945v1
[DATE]
2025-06-23 10:15:43+08:00
[CATEGORIES]
cs.LG
cs.CL
From RAG to Agentic: Validating Islamic-Medicine Responses with LLM Agents
[AUTHORS]
Mohammad Amaan Sayeed, Mohammed Talha Alam, Raza Imam, Shahab Saquib Sohail, Amir Hussain
[ABSTRACT]
Centuries-old Islamic medical texts like Avicenna’s Canon of Medicine and the
Prophetic Tibb-e-Nabawi encode a wealth of preventive care, nutrition, and
holistic therapies, yet remain inaccessible to many and underutilized in modern
AI systems. Existing language-model benchmarks focus narrowly on factual recall
or user preference, leaving a gap in validating culturally grounded medical
guidance at scale. We propose a unified evaluation pipeline, Tibbe-AG, that
aligns 30 carefully curated Prophetic-medicine questions with human-verified
remedies and compares three LLMs (LLaMA-3, Mistral-7B, Qwen2-7B) under three
configurations: direct generation, retrieval-augmented generation, and a
scientific self-critique filter. Each answer is then assessed by a secondary
LLM serving as an agentic judge, yielding a single 3C3H quality score.
Retrieval improves factual accuracy by 13%, while the agentic prompt adds
another 10% improvement through deeper mechanistic insight and safety
considerations. Our results demonstrate that blending classical Islamic texts
with retrieval and self-evaluation enables reliable, culturally sensitive
medical question-answering.
[COMMENTS]
Published at the 4th Muslims in Machine Learning (MusIML) Workshop
(ICML-25)
[LINK]
http://arxiv.org/abs/2506.15911v2
[DATE]
2025-06-23 10:12:38+08:00
[CATEGORIES]
cs.CL
AdapThink: Adaptive Thinking Preferences for Reasoning Language Model
[AUTHORS]
Xu Wan, Wei Wang, Wenyue Xu, Wotao Yin, Jie Song, Mingyang Sun
[ABSTRACT]
Reinforcement Learning (RL)-based post-training has significantly advanced
the complex reasoning capabilities of language models, fostering sophisticated
self-reflection processes. However, this “slow thinking” paradigm presents a
critical challenge to reasoning efficiency: models may expend excessive
computation on simple questions and shift reasoning prematurely for complex
ones. Previous mechanisms typically rely on static length budgets or predefined
rules, lacking the adaptability for varying question complexities and models’
evolving capabilities. To this end, we propose AdapThink, an adaptive
post-training framework designed to induce more efficient thinking while
maintaining the performance of reasoning language models. Specifically,
AdapThink incorporates two key mechanisms: 1) A group-relative reward function
that leverages model confidence and response’s characteristic to dynamically
adjust the preference of reflection-related transition words without resorting
to a fixed length preference. 2) A diversity-aware sampling mechanism that
balances the training group’s solution accuracy with reasoning diversity via an
entropy-guided score. Experiments on several mathematical reasoning datasets
with DeepSeek-distilled models demonstrate AdapThink’s advantages in enabling
adaptive reasoning patterns and mitigating the inefficiencies.
[LINK]
http://arxiv.org/abs/2506.18237v1
[DATE]
2025-06-23 10:06:04+08:00
[CATEGORIES]
cs.LG
cs.CL
NovelHopQA: Diagnosing Multi-Hop Reasoning Failures in Long Narrative Contexts
[AUTHORS]
Abhay Gupta, Michael Lu, Kevin Zhu, Sean O’Brien, Vasu Sharma
[ABSTRACT]
Current large language models (LLMs) struggle to answer questions that span
tens of thousands of tokens, especially when multi-hop reasoning is involved.
While prior benchmarks explore long-context comprehension or multi-hop
reasoning in isolation, none jointly vary context length and reasoning depth in
natural narrative settings. We introduce NovelHopQA, the first benchmark to
evaluate 1-4 hop QA over 64k-128k-token excerpts from 83 full-length
public-domain novels. A keyword-guided pipeline builds hop-separated chains
grounded in coherent storylines. We evaluate seven state-of-the-art models and
apply oracle-context filtering to ensure all questions are genuinely
answerable. Human annotators validate both alignment and hop depth. We
additionally present retrieval-augmented generation (RAG) evaluations to test
model performance when only selected passages are provided instead of the full
context. We noticed consistent accuracy drops with increased hops and context
length increase, even for frontier models-revealing that sheer scale does not
guarantee robust reasoning. Failure-mode analysis highlights common breakdowns
such as missed final-hop integration and long-range drift. NovelHopQA offers a
controlled diagnostic setting to test multi-hop reasoning at scale. All code
and datasets are available at https://novelhopqa.github.io.
[LINK]
http://arxiv.org/abs/2506.02000v2
[DATE]
2025-06-23 09:41:05+08:00
[CATEGORIES]
cs.CL
Advancing African-Accented Speech Recognition: Epistemic Uncertainty-Driven Data Selection for Generalizable ASR Models
[AUTHORS]
Bonaventure F. P. Dossou
[COMMENTS]
Accepted at ACL SRW 2025
[LINK]
http://arxiv.org/abs/2306.02105v7
[DATE]
2025-06-23 08:16:54+08:00
[CATEGORIES]
cs.CL
Shift Happens: Mixture of Experts based Continual Adaptation in Federated Learning
[AUTHORS]
Rahul Atul Bhope, K. R. Jayaram, Praveen Venkateswaran, Nalini Venkatasubramanian
[ABSTRACT]
Federated Learning (FL) enables collaborative model training across
decentralized clients without sharing raw data, yet faces significant
challenges in real-world settings where client data distributions evolve
dynamically over time. This paper tackles the critical problem of covariate and
label shifts in streaming FL environments, where non-stationary data
distributions degrade model performance and require adaptive middleware
solutions. We introduce ShiftEx, a shift-aware mixture of experts framework
that dynamically creates and trains specialized global models in response to
detected distribution shifts using Maximum Mean Discrepancy for covariate
shifts. The framework employs a latent memory mechanism for expert reuse and
implements facility location-based optimization to jointly minimize covariate
mismatch, expert creation costs, and label imbalance. Through theoretical
analysis and comprehensive experiments on benchmark datasets, we demonstrate
5.5-12.9 percentage point accuracy improvements and 22-95 % faster adaptation
compared to state-of-the-art FL baselines across diverse shift scenarios. The
proposed approach offers a scalable, privacy-preserving middleware solution for
FL systems operating in non-stationary, real-world conditions while minimizing
communication and computational overhead.
[LINK]
http://arxiv.org/abs/2506.18789v1
[DATE]
2025-06-23 23:59:21+08:00
[CATEGORIES]
cs.LG
A generalized neural tangent kernel for surrogate gradient learning
[AUTHORS]
Luke Eilers, Raoul-Martin Memmesheimer, Sven Goedeke
[ABSTRACT]
State-of-the-art neural network training methods depend on the gradient of
the network function. Therefore, they cannot be applied to networks whose
activation functions do not have useful derivatives, such as binary and
discrete-time spiking neural networks. To overcome this problem, the activation
function’s derivative is commonly substituted with a surrogate derivative,
giving rise to surrogate gradient learning (SGL). This method works well in
practice but lacks theoretical foundation. The neural tangent kernel (NTK) has
proven successful in the analysis of gradient descent. Here, we provide a
generalization of the NTK, which we call the surrogate gradient NTK, that
enables the analysis of SGL. First, we study a naive extension of the NTK to
activation functions with jumps, demonstrating that gradient descent for such
activation functions is also ill-posed in the infinite-width limit. To address
this problem, we generalize the NTK to gradient descent with surrogate
derivatives, i.e., SGL. We carefully define this generalization and expand the
existing key theorems on the NTK with mathematical rigor. Further, we
illustrate our findings with numerical experiments. Finally, we numerically
compare SGL in networks with sign activation function and finite width to
kernel regression with the surrogate gradient NTK; the results confirm that the
surrogate gradient NTK provides a good characterization of SGL.
[COMMENTS]
53 pages, 3 figures + 4 supplementary figures
[LINK]
http://arxiv.org/abs/2405.15539v2
[DATE]
2025-06-23 23:54:50+08:00
[CATEGORIES]
cs.LG
Reasoning Limitations of Multimodal Large Language Models. A Case Study of Bongard Problems
[AUTHORS]
Mikołaj Małkiński, Szymon Pawlonka, Jacek Mańdziuk
[COMMENTS]
Accepted to The Forty-Second International Conference on Machine
Learning (ICML 2025)
[LINK]
http://arxiv.org/abs/2411.01173v2
[DATE]
2025-06-23 23:53:53+08:00
[CATEGORIES]
cs.LG
The Impact of Input Order Bias on Large Language Models for Software Fault Localization
[AUTHORS]
Md Nakhla Rafi, Dong Jae Kim, Tse-Hsun Chen, Shaowei Wang
[ABSTRACT]
Large Language Models (LLMs) have shown significant potential in software
engineering tasks such as Fault Localization (FL) and Automatic Program Repair
(APR). This study investigates how input order and context size influence LLM
performance in FL, a crucial step for many downstream software engineering
tasks. We evaluate different method orderings using Kendall Tau distances,
including “perfect” (where ground truths appear first) and “worst” (where
ground truths appear last), across two benchmarks containing Java and Python
projects. Our results reveal a strong order bias: in Java projects, Top-1 FL
accuracy drops from 57% to 20% when reversing the order, while in Python
projects, it decreases from 38% to approximately 3%. However, segmenting inputs
into smaller contexts mitigates this bias, reducing the performance gap in FL
from 22% and 6% to just 1% across both benchmarks. We replaced method names
with semantically meaningful alternatives to determine whether this bias is due
to data leakage. The observed trends remained consistent, suggesting that the
bias is not caused by memorization from training data but rather by the
inherent effect of input order. Additionally, we explored ordering methods
based on traditional FL techniques and metrics, finding that DepGraph’s ranking
achieves 48% Top-1 accuracy, outperforming simpler approaches such as
CallGraph(DFS). These findings highlight the importance of structuring inputs,
managing context effectively, and selecting appropriate ordering strategies to
enhance LLM performance in FL and other software engineering applications.
[LINK]
http://arxiv.org/abs/2412.18750v3
[DATE]
2025-06-23 23:51:16+08:00
[CATEGORIES]
cs.LG
Fast Bayesian Optimization of Function Networks with Partial Evaluations
[AUTHORS]
Poompol Buathong, Peter I. Frazier
[ABSTRACT]
Bayesian optimization of function networks (BOFN) is a framework for
optimizing expensive-to-evaluate objective functions structured as networks,
where some nodes’ outputs serve as inputs for others. Many real-world
applications, such as manufacturing and drug discovery, involve function
networks with additional properties - nodes that can be evaluated independently
and incur varying costs. A recent BOFN variant, p-KGFN, leverages this
structure and enables cost-aware partial evaluations, selectively querying only
a subset of nodes at each iteration. p-KGFN reduces the number of expensive
objective function evaluations needed but has a large computational overhead:
choosing where to evaluate requires optimizing a nested Monte Carlo-based
acquisition function for each node in the network. To address this, we propose
an accelerated p-KGFN algorithm that reduces computational overhead with only a
modest loss in query efficiency. Key to our approach is generation of
node-specific candidate inputs for each node in the network via one inexpensive
global Monte Carlo simulation. Numerical experiments show that our method
maintains competitive query efficiency while achieving up to a 16x speedup over
the original p-KGFN algorithm.
[COMMENTS]
16 pages, 8 figures, 1 table
[LINK]
http://arxiv.org/abs/2506.11456v2
[DATE]
2025-06-23 23:42:55+08:00
[CATEGORIES]
cs.LG
DPG loss functions for learning parameter-to-solution maps by neural networks
[AUTHORS]
Pablo Cortés Castillo, Wolfgang Dahmen, Jay Gopalakrishnan
[ABSTRACT]
We develop, analyze, and experimentally explore residual-based loss functions
for machine learning of parameter-to-solution maps in the context of
parameter-dependent families of partial differential equations (PDEs). Our
primary concern is on rigorous accuracy certification to enhance prediction
capability of resulting deep neural network reduced models. This is achieved by
the use of variationally correct loss functions. Through one specific example
of an elliptic PDE, details for establishing the variational correctness of a
loss function from an ultraweak Discontinuous Petrov Galerkin (DPG)
discretization are worked out. Despite the focus on the example, the proposed
concepts apply to a much wider scope of problems, namely problems for which
stable DPG formulations are available. The issue of {high-contrast} diffusion
fields and ensuing difficulties with degrading ellipticity are discussed. Both
numerical results and theoretical arguments illustrate that for high-contrast
diffusion parameters the proposed DPG loss functions deliver much more robust
performance than simpler least-squares losses.
[LINK]
http://arxiv.org/abs/2506.18773v1
[DATE]
2025-06-23 23:40:56+08:00
[CATEGORIES]
cs.LG
Local Averaging Accurately Distills Manifold Structure From Noisy Data
[AUTHORS]
Yihan Shen, Shiyu Wang, Arnaud Lamy, Mariam Avagyan, John Wright
[ABSTRACT]
High-dimensional data are ubiquitous, with examples ranging from natural
images to scientific datasets, and often reside near low-dimensional manifolds.
Leveraging this geometric structure is vital for downstream tasks, including
signal denoising, reconstruction, and generation. However, in practice, the
manifold is typically unknown and only noisy samples are available. A
fundamental approach to uncovering the manifold structure is local averaging,
which is a cornerstone of state-of-the-art provable methods for manifold
fitting and denoising. However, to the best of our knowledge, there are no
works that rigorously analyze the accuracy of local averaging in a manifold
setting in high-noise regimes. In this work, we provide theoretical analyses of
a two-round mini-batch local averaging method applied to noisy samples drawn
from a $d$-dimensional manifold $\mathcal M \subset \mathbb{R}^D$, under a
relatively high-noise regime where the noise size is comparable to the reach
$\tau$. We show that with high probability, the averaged point $\hat{\mathbf
q}$ achieves the bound $d(\hat{\mathbf q}, \mathcal M) \leq \sigma
\sqrt{d\left(1+\frac{\kappa\mathrm{diam}(\mathcal {M})}{\log(D)}\right)}$,
where $\sigma, \mathrm{diam(\mathcal M)},\kappa$ denote the standard deviation
of the Gaussian noise, manifold’s diameter and a bound on its extrinsic
curvature, respectively. This is the first analysis of local averaging accuracy
over the manifold in the relatively high noise regime where $\sigma \sqrt{D}
\approx \tau$. The proposed method can serve as a preprocessing step for a wide
range of provable methods designed for lower-noise regimes. Additionally, our
framework can provide a theoretical foundation for a broad spectrum of
denoising and dimensionality reduction methods that rely on local averaging
techniques.
[LINK]
http://arxiv.org/abs/2506.18761v1
[DATE]
2025-06-23 23:32:16+08:00
[CATEGORIES]
cs.LG
[AUTHORS]
Zhaoyang Xu, Yunbo Liu [ABSTRACT]
Identifying suitable machine learning paradigms for intrusion detection
remains critical for building effective and generalizable security solutions.
In this study, we present a controlled comparison of four representative modelsMulti-Layer Perceptron (MLP), 1D Convolutional Neural Network (CNN),
One-Class Support Vector Machine (OCSVM) and Local Outlier Factor (LOF) - on
the CICIDS2017 dataset under two scenarios: detecting known attack types and
generalizing to previously unseen threats. Our results show that supervised MLP
and CNN achieve near-perfect accuracy on familiar attacks but suffer drastic
recall drops on novel attacks. Unsupervised LOF attains moderate overall
accuracy and high recall on unknown threats at the cost of elevated false
alarms, while boundary-based OCSVM balances precision and recall best,
demonstrating robust detection across both scenarios. These findings offer
practical guidance for selecting IDS models in dynamic network environments.
[COMMENTS]
submitted to IEEE CNS 2025
[LINK]
http://arxiv.org/abs/2506.19877v1
[DATE]
2025-06-23 23:31:10+08:00
[CATEGORIES]
cs.LG
ContinualFlow: Learning and Unlearning with Neural Flow Matching
[AUTHORS]
Lorenzo Simone, Davide Bacciu, Shuangge Ma
[ABSTRACT]
We introduce ContinualFlow, a principled framework for targeted unlearning in
generative models via Flow Matching. Our method leverages an energy-based
reweighting loss to softly subtract undesired regions of the data distribution
without retraining from scratch or requiring direct access to the samples to be
unlearned. Instead, it relies on energy-based proxies to guide the unlearning
process. We prove that this induces gradients equivalent to Flow Matching
toward a soft mass-subtracted target, and validate the framework through
experiments on 2D and image domains, supported by interpretable visualizations
and quantitative evaluations.
[COMMENTS]
Accepted at the ICML 2025 Workshop on Machine Unlearning for
Generative AI (MUGen @ ICML25, Vancouver, July 2025)
[LINK]
http://arxiv.org/abs/2506.18747v1
[DATE]
2025-06-23 23:20:58+08:00
[CATEGORIES]
cs.LG
Fast State-Augmented Learning for Wireless Resource Allocation with Dual Variable Regression
[AUTHORS]
Yigit Berkay Uslu, Navid NaderiAlizadeh, Mark Eisen, Alejandro Ribeiro
[ABSTRACT]
We consider resource allocation problems in multi-user wireless networks,
where the goal is to optimize a network-wide utility function subject to
constraints on the ergodic average performance of users. We demonstrate how a
state-augmented graph neural network (GNN) parametrization for the resource
allocation policy circumvents the drawbacks of the ubiquitous dual subgradient
methods by representing the network configurations (or states) as graphs and
viewing dual variables as dynamic inputs to the model, viewed as graph signals
supported over the graphs. Lagrangian maximizing state-augmented policies are
learned during the offline training phase, and the dual variables evolve
through gradient updates while executing the learned state-augmented policies
during the inference phase. Our main contributions are to illustrate how
near-optimal initialization of dual multipliers for faster inference can be
accomplished with dual variable regression, leveraging a secondary GNN
parametrization, and how maximization of the Lagrangian over the multipliers
sampled from the dual descent dynamics substantially improves the training of
state-augmented models. We demonstrate the superior performance of the proposed
algorithm with extensive numerical experiments in a case study of transmit
power control. Finally, we prove a convergence result and an exponential
probability bound on the excursions of the dual function (iterate) optimality
gaps.
[COMMENTS]
This work has been submitted to the IEEE TSP for possible publication
[LINK]
http://arxiv.org/abs/2506.18748v1
[DATE]
2025-06-23 23:20:58+08:00
[CATEGORIES]
cs.LG
DiffDesign: Controllable Diffusion with Meta Prior for Efficient Interior Design Generation
[AUTHORS]
Yuxuan Yang, Tao Geng
[ABSTRACT]
Interior design is a complex and creative discipline involving aesthetics,
functionality, ergonomics, and materials science. Effective solutions must meet
diverse requirements, typically producing multiple deliverables such as
renderings and design drawings from various perspectives. Consequently,
interior design processes are often inefficient and demand significant
creativity. With advances in machine learning, generative models have emerged
as a promising means of improving efficiency by creating designs from text
descriptions or sketches. However, few generative works focus on interior
design, leading to substantial discrepancies between outputs and practical
needs, such as differences in size, spatial scope, and the lack of controllable
generation quality. To address these challenges, we propose DiffDesign, a
controllable diffusion model with meta priors for efficient interior design
generation. Specifically, we utilize the generative priors of a 2D diffusion
model pre-trained on a large image dataset as our rendering backbone. We
further guide the denoising process by disentangling cross-attention control
over design attributes, such as appearance, pose, and size, and introduce an
optimal transfer-based alignment module to enforce view consistency.
Simultaneously, we construct an interior design-specific dataset, DesignHelper,
consisting of over 400 solutions across more than 15 spatial types and 15
design styles. This dataset helps fine-tune DiffDesign. Extensive experiments
conducted on various benchmark datasets demonstrate the effectiveness and
robustness of DiffDesign.
[LINK]
http://arxiv.org/abs/2411.16301v3
[DATE]
2025-06-23 23:20:13+08:00
[CATEGORIES]
cs.LG
Experimenting, Fast and Slow: Bayesian Optimization of Long-term Outcomes with Online Experiments
[AUTHORS]
Qing Feng, Samuel Dalton, Benjamin Letham, Maximilian Balandat, Eytan Bakshy
[ABSTRACT]
Online experiments in internet systems, also known as A/B tests, are used for
a wide range of system tuning problems, such as optimizing recommender system
ranking policies and learning adaptive streaming controllers. Decision-makers
generally wish to optimize for long-term treatment effects of the system
changes, which often requires running experiments for a long time as short-term
measurements can be misleading due to non-stationarity in treatment effects
over time. The sequential experimentation strategies–which typically involve
several iterations–can be prohibitively long in such cases. We describe a
novel approach that combines fast experiments (e.g., biased experiments run
only for a few hours or days) and/or offline proxies (e.g., off-policy
evaluation) with long-running, slow experiments to perform sequential, Bayesian
optimization over large action spaces in a short amount of time.
[LINK]
http://arxiv.org/abs/2506.18744v1
[DATE]
2025-06-23 23:18:54+08:00
[CATEGORIES]
cs.LG
On the Existence of Universal Simulators of Attention
[AUTHORS]
Debanjan Dutta, Faizanuddin Ansari, Anish Chakrabarty, Swagatam Das
[ABSTRACT]
Prior work on the learnability of transformers has established its capacity
to approximate specific algorithmic patterns through training under restrictive
architectural assumptions. Fundamentally, these arguments remain data-driven
and therefore can only provide a probabilistic guarantee. Expressivity, on the
contrary, has theoretically been explored to address the problems
\emph{computable} by such architecture. These results proved the
Turing-completeness of transformers, investigated bounds focused on circuit
complexity, and formal logic. Being at the crossroad between learnability and
expressivity, the question remains: \emph{can transformer architectures exactly
simulate an arbitrary attention mechanism, or in particular, the underlying
operations?} In this study, we investigate the transformer encoder’s ability to
simulate a vanilla attention mechanism. By constructing a universal simulator
$\mathcal{U}$ composed of transformer encoders, we present algorithmic
solutions to identically replicate attention outputs and the underlying
elementary matrix and activation operations via RASP, a formal framework for
transformer computation. Our proofs, for the first time, show the existence of
an algorithmically achievable data-agnostic solution, previously known to be
approximated only by learning.
[LINK]
http://arxiv.org/abs/2506.18739v1
[DATE]
2025-06-23 23:15:25+08:00
[CATEGORIES]
cs.LG
Towards Group Fairness with Multiple Sensitive Attributes in Federated Foundation Models
[AUTHORS]
Yuning Yang, Han Yu, Tianrun Gao, Xiaodong Xu, Guangyu Wang
[ABSTRACT]
The deep integration of foundation models (FM) with federated learning (FL)
enhances personalization and scalability for diverse downstream tasks, making
it crucial in sensitive domains like healthcare. Achieving group fairness has
become an increasingly prominent issue in the era of federated foundation
models (FFMs), since biases in sensitive attributes might lead to inequitable
treatment for under-represented demographic groups. Existing studies mostly
focus on achieving fairness with respect to a single sensitive attribute. This
renders them unable to provide clear interpretability of dependencies among
multiple sensitive attributes which is required to achieve group fairness. Our
paper takes the first attempt towards a causal analysis of the relationship
between group fairness across various sensitive attributes in the FFM. We
extend the FFM structure to trade off multiple sensitive attributes
simultaneously and quantify the causal effect behind the group fairness through
causal discovery and inference. Extensive experiments validate its
effectiveness, offering insights into interpretability towards building
trustworthy and fair FFM systems.
[LINK]
http://arxiv.org/abs/2506.18732v1
[DATE]
2025-06-23 23:09:14+08:00
[CATEGORIES]
cs.LG
Learning interpretable positional encodings in transformers depends on initialization
[AUTHORS]
Takuya Ito, Luca Cocchi, Tim Klinger, Parikshit Ram, Murray Campbell, Luke Hearne
[ABSTRACT]
In transformers, the positional encoding (PE) provides essential information
that distinguishes the position and order amongst tokens in a sequence. Most
prior investigations of PE effects on generalization were tailored to 1D input
sequences, such as those presented in natural language, where adjacent tokens
(e.g., words) are highly related. In contrast, many real world tasks involve
datasets with highly non-trivial positional arrangements, such as datasets
organized in multiple spatial dimensions, or datasets for which ground truth
positions are not known. Here we find that the choice of initialization of a
learnable PE greatly influences its ability to learn interpretable PEs that
lead to enhanced generalization. We empirically demonstrate our findings in
three experiments: 1) A 2D relational reasoning task; 2) A nonlinear stochastic
network simulation; 3) A real world 3D neuroscience dataset, applying
interpretability analyses to verify the learning of accurate PEs. Overall, we
find that a learned PE initialized from a small-norm distribution can 1)
uncover interpretable PEs that mirror ground truth positions in multiple
dimensions, and 2) lead to improved generalization. These results illustrate
the feasibility of learning identifiable and interpretable PEs for enhanced
generalization.
[COMMENTS]
ICML 2025, Workshop on Actionable Interpretability
[LINK]
http://arxiv.org/abs/2406.08272v4
[DATE]
2025-06-23 23:01:16+08:00
[CATEGORIES]
cs.LG
Including Semantic Information via Word Embeddings for Skeleton-based Action Recognition
[AUTHORS]
Dustin Aganian, Erik Franze, Markus Eisenbach, Horst-Michael Gross
[ABSTRACT]
Effective human action recognition is widely used for cobots in Industry 4.0
to assist in assembly tasks. However, conventional skeleton-based methods often
lose keypoint semantics, limiting their effectiveness in complex interactions.
In this work, we introduce a novel approach to skeleton-based action
recognition that enriches input representations by leveraging word embeddings
to encode semantic information. Our method replaces one-hot encodings with
semantic volumes, enabling the model to capture meaningful relationships
between joints and objects. Through extensive experiments on multiple assembly
datasets, we demonstrate that our approach significantly improves
classification performance, and enhances generalization capabilities by
simultaneously supporting different skeleton types and object classes. Our
findings highlight the potential of incorporating semantic information to
enhance skeleton-based action recognition in dynamic and diverse environments.
[COMMENTS]
IEEE International Joint Conference on Neural Networks (IJCNN) 2025
[LINK]
http://arxiv.org/abs/2506.18721v1
[DATE]
2025-06-23 22:57:06+08:00
[CATEGORIES]
cs.LG
PC-SRGAN: Physically Consistent Super-Resolution Generative Adversarial Network for General Transient Simulations
[AUTHORS]
Md Rakibul Hasan, Pouria Behnoudfar, Dan MacKinlay, Thomas Poulet
[ABSTRACT]
Machine Learning, particularly Generative Adversarial Networks (GANs), has
revolutionised Super Resolution (SR). However, generated images often lack
physical meaningfulness, which is essential for scientific applications. Our
approach, PC-SRGAN, enhances image resolution while ensuring physical
consistency for interpretable simulations. PC-SRGAN significantly improves both
the Peak Signal-to-Noise Ratio and the Structural Similarity Index Measure
compared to conventional methods, even with limited training data (e.g., only
13% of training data required for SRGAN). Beyond SR, PC-SRGAN augments
physically meaningful machine learning, incorporating numerically justified
time integrators and advanced quality metrics. These advancements promise
reliable and causal machine-learning models in scientific domains. A
significant advantage of PC-SRGAN over conventional SR techniques is its
physical consistency, which makes it a viable surrogate model for
time-dependent problems. PC-SRGAN advances scientific machine learning,
offering improved accuracy and efficiency for image processing, enhanced
process understanding, and broader applications to scientific research. We
publicly release the complete source code at
https://github.com/hasan-rakibul/PC-SRGAN.
[LINK]
http://arxiv.org/abs/2505.06502v2
[DATE]
2025-06-23 22:50:11+08:00
[CATEGORIES]
cs.LG
SaGIF: Improving Individual Fairness in Graph Neural Networks via Similarity Encoding
[AUTHORS]
Yuchang Zhu, Jintang Li, Huizhe Zhang, Liang Chen, Zibin Zheng
[ABSTRACT]
Individual fairness (IF) in graph neural networks (GNNs), which emphasizes
the need for similar individuals should receive similar outcomes from GNNs, has
been a critical issue. Despite its importance, research in this area has been
largely unexplored in terms of (1) a clear understanding of what induces
individual unfairness in GNNs and (2) a comprehensive consideration of
identifying similar individuals. To bridge these gaps, we conduct a preliminary
analysis to explore the underlying reason for individual unfairness and observe
correlations between IF and similarity consistency, a concept introduced to
evaluate the discrepancy in identifying similar individuals based on graph
structure versus node features. Inspired by our observations, we introduce two
metrics to assess individual similarity from two distinct perspectives:
topology fusion and feature fusion. Building upon these metrics, we propose
Similarity-aware GNNs for Individual Fairness, named SaGIF. The key insight
behind SaGIF is the integration of individual similarities by independently
learning similarity representations, leading to an improvement of IF in GNNs.
Our experiments on several real-world datasets validate the effectiveness of
our proposed metrics and SaGIF. Specifically, SaGIF consistently outperforms
state-of-the-art IF methods while maintaining utility performance. Code is
available at: https://github.com/ZzoomD/SaGIF.
[COMMENTS]
Under review
[LINK]
http://arxiv.org/abs/2506.18696v1
[DATE]
2025-06-23 22:34:26+08:00
[CATEGORIES]
cs.LG
BAnG: Bidirectional Anchored Generation for Conditional RNA Design
[AUTHORS]
Roman Klypa, Alberto Bietti, Sergei Grudinin
[ABSTRACT]
Designing RNA molecules that interact with specific proteins is a critical
challenge in experimental and computational biology. Existing computational
approaches require a substantial amount of previously known interacting RNA
sequences for each specific protein or a detailed knowledge of RNA structure,
restricting their utility in practice. To address this limitation, we develop
RNA-BAnG, a deep learning-based model designed to generate RNA sequences for
protein interactions without these requirements. Central to our approach is a
novel generative method, Bidirectional Anchored Generation (BAnG), which
leverages the observation that protein-binding RNA sequences often contain
functional binding motifs embedded within broader sequence contexts. We first
validate our method on generic synthetic tasks involving similar localized
motifs to those appearing in RNAs, demonstrating its benefits over existing
generative approaches. We then evaluate our model on biological sequences,
showing its effectiveness for conditional RNA sequence design given a binding
protein.
[LINK]
http://arxiv.org/abs/2502.21274v2
[DATE]
2025-06-23 22:26:44+08:00
[CATEGORIES]
cs.LG
One Step Diffusion via Shortcut Models
[AUTHORS]
Kevin Frans, Danijar Hafner, Sergey Levine, Pieter Abbeel
[ABSTRACT]
Diffusion models and flow-matching models have enabled generating diverse and
realistic images by learning to transfer noise to data. However, sampling from
these models involves iterative denoising over many neural network passes,
making generation slow and expensive. Previous approaches for speeding up
sampling require complex training regimes, such as multiple training phases,
multiple networks, or fragile scheduling. We introduce shortcut models, a
family of generative models that use a single network and training phase to
produce high-quality samples in a single or multiple sampling steps. Shortcut
models condition the network not only on the current noise level but also on
the desired step size, allowing the model to skip ahead in the generation
process. Across a wide range of sampling step budgets, shortcut models
consistently produce higher quality samples than previous approaches, such as
consistency models and reflow. Compared to distillation, shortcut models reduce
complexity to a single network and training phase and additionally allow
varying step budgets at inference time.
[LINK]
http://arxiv.org/abs/2410.12557v3
[DATE]
2025-06-23 22:26:35+08:00
[CATEGORIES]
cs.LG
VesselGPT: Autoregressive Modeling of Vascular Geometry
[AUTHORS]
Paula Feldman, Martin Sinnona, Claudio Delrieux, Viviana Siless, Emmanuel Iarussi
[ABSTRACT]
Anatomical trees are critical for clinical diagnosis and treatment planning,
yet their complex and diverse geometry make accurate representation a
significant challenge. Motivated by the latest advances in large language
models, we introduce an autoregressive method for synthesizing anatomical
trees. Our approach first embeds vessel structures into a learned discrete
vocabulary using a VQ-VAE architecture, then models their generation
autoregressively with a GPT-2 model. This method effectively captures intricate
geometries and branching patterns, enabling realistic vascular tree synthesis.
Comprehensive qualitative and quantitative evaluations reveal that our
technique achieves high-fidelity tree reconstruction with compact discrete
representations. Moreover, our B-spline representation of vessel cross-sections
preserves critical morphological details that are often overlooked in previous’
methods parameterizations. To the best of our knowledge, this work is the first
to generate blood vessels in an autoregressive manner. Code is available at
https://github.com/LIA-DiTella/VesselGPT-MICCAI.
[COMMENTS]
Accepted for MICCAI 2025
[LINK]
http://arxiv.org/abs/2505.13318v2
[DATE]
2025-06-23 21:57:18+08:00
[CATEGORIES]
cs.LG
A Random Matrix Analysis of In-context Memorization for Nonlinear Attention
[AUTHORS]
Zhenyu Liao, Jiaqing Liu, TianQi Hou, Difan Zou, Zenan Ling
[ABSTRACT]
Attention mechanisms have revolutionized machine learning (ML) by enabling
efficient modeling of global dependencies across inputs. Their inherently
parallelizable structures allow for efficient scaling with the exponentially
increasing size of both pretrained data and model parameters. Yet, despite
their central role as the computational backbone of modern large language
models (LLMs), the theoretical understanding of Attentions, especially in the
nonlinear setting, remains limited.
In this paper, we provide a precise characterization of the \emph{in-context
memorization error} of \emph{nonlinear Attention}, in the high-dimensional
proportional regime where the number of input tokens $n$ and their embedding
dimension $p$ are both large and comparable. Leveraging recent advances in the
theory of large kernel random matrices, we show that nonlinear Attention
typically incurs higher memorization error than linear ridge regression on
random inputs. However, this gap vanishes, and can even be reversed, when the
input exhibits statistical structure, particularly when the Attention weights
align with the input signal direction. Our results reveal how nonlinearity and
input structure interact with each other to govern the memorization performance
of nonlinear Attention. The theoretical insights are supported by numerical
experiments.
[COMMENTS]
40 pages, 7 pages
[LINK]
http://arxiv.org/abs/2506.18656v1
[DATE]
2025-06-23 21:56:43+08:00
[CATEGORIES]
cs.LG
Tight Generalization Error Bounds for Stochastic Gradient Descent in Non-convex Learning
[AUTHORS]
Wenjun Xiong, Juan Ding, Xinlei Zuo, Qizhai Li
[ABSTRACT]
Stochastic Gradient Descent (SGD) is fundamental for training deep neural
networks, especially in non-convex settings. Understanding SGD’s generalization
properties is crucial for ensuring robust model performance on unseen data. In
this paper, we analyze the generalization error bounds of SGD for non-convex
learning by introducing the Type II perturbed SGD (T2pm-SGD), which
accommodates both sub-Gaussian and bounded loss functions. The generalization
error bound is decomposed into two components: the trajectory term and the
flatness term. Our analysis improves the trajectory term to $O(n^{-1})$,
significantly enhancing the previous $O((nb)^{-1/2})$ bound for bounded losses,
where n is the number of training samples and b is the batch size. By selecting
an optimal variance for the perturbation noise, the overall bound is further
refined to $O(n^{-2/3})$. For sub-Gaussian loss functions, a tighter trajectory
term is also achieved. In both cases, the flatness term remains stable across
iterations and is smaller than those reported in previous literature, which
increase with iterations. This stability, ensured by T2pm-SGD, leads to tighter
generalization error bounds for both loss function types. Our theoretical
results are validated through extensive experiments on benchmark datasets,
including MNIST and CIFAR-10, demonstrating the effectiveness of T2pm-SGD in
establishing tighter generalization bounds.
[LINK]
http://arxiv.org/abs/2506.18645v1
[DATE]
2025-06-23 21:47:25+08:00
[CATEGORIES]
cs.LG
On Union-Closedness of Language Generation
[AUTHORS]
Steve Hanneke, Amin Karbasi, Anay Mehrotra, Grigoris Velegkas
[ABSTRACT]
We investigate language generation in the limit - a model by Kleinberg and
Mullainathan [NeurIPS 2024] and extended by Li, Raman, and Tewari [COLT 2025].
While Kleinberg and Mullainathan proved generation is possible for all
countable collections, Li et al. defined a hierarchy of generation notions
(uniform, non-uniform, and generatable) and explored their feasibility for
uncountable collections.
Our first set of results resolve two open questions of Li et al. by proving
finite unions of generatable or non-uniformly generatable classes need not be
generatable. These follow from a stronger result: there is a non-uniformly
generatable class and a uniformly generatable class whose union is
non-generatable. This adds to the aspects along which language generation in
the limit is different from traditional tasks in statistical learning theory
like classification, which are closed under finite unions. In particular, it
implies that given two generators for different collections, one cannot combine
them to obtain a single “more powerful” generator, prohibiting this notion of
boosting.
Our construction also addresses a third open question of Li et al. on whether
there are uncountable classes that are non-uniformly generatable and do not
satisfy the eventually unbounded closure (EUC) condition introduced by Li,
Raman, and Tewari. Our approach utilizes carefully constructed classes along
with a novel diagonalization argument that could be of independent interest in
the growing area of language generation.
[LINK]
http://arxiv.org/abs/2506.18642v1
[DATE]
2025-06-23 21:42:25+08:00
[CATEGORIES]
cs.LG
Federated Loss Exploration for Improved Convergence on Non-IID Data
[AUTHORS]
Christian Internò, Markus Olhofer, Yaochu Jin, Barbara Hammer
[ABSTRACT]
Federated learning (FL) has emerged as a groundbreaking paradigm in machine
learning (ML), offering privacy-preserving collaborative model training across
diverse datasets. Despite its promise, FL faces significant hurdles in
non-identically and independently distributed (non-IID) data scenarios, where
most existing methods often struggle with data heterogeneity and lack
robustness in performance. This paper introduces Federated Loss Exploration
(FedLEx), an innovative approach specifically designed to tackle these
challenges. FedLEx distinctively addresses the shortcomings of existing FL
methods in non-IID settings by optimizing its learning behavior for scenarios
in which assumptions about data heterogeneity are impractical or unknown. It
employs a federated loss exploration technique, where clients contribute to a
global guidance matrix by calculating gradient deviations for model parameters.
This matrix serves as a strategic compass to guide clients’ gradient updates in
subsequent FL rounds, thereby fostering optimal parameter updates for the
global model. FedLEx effectively navigates the complex loss surfaces inherent
in non-IID data, enhancing knowledge transfer in an efficient manner, since
only a small number of epochs and small amount of data are required to build a
strong global guidance matrix that can achieve model convergence without the
need for additional data sharing or data distribution statics in a large client
scenario. Our extensive experiments with state-of-the art FL algorithms
demonstrate significant improvements in performance, particularly under
realistic non-IID conditions, thus highlighting FedLEx’s potential to overcome
critical barriers in diverse FL applications.
[LINK]
http://arxiv.org/abs/2506.18640v1
[DATE]
2025-06-23 21:42:07+08:00
[CATEGORIES]
cs.LG
Granular-Ball-Induced Multiple Kernel K-Means
[AUTHORS]
Shuyin Xia, Yifan Wang, Lifeng Shen, Guoyin Wang
[ABSTRACT]
Most existing multi-kernel clustering algorithms, such as multi-kernel
K-means, often struggle with computational efficiency and robustness when faced
with complex data distributions. These challenges stem from their dependence on
point-to-point relationships for optimization, which can lead to difficulty in
accurately capturing data sets’ inherent structure and diversity. Additionally,
the intricate interplay between multiple kernels in such algorithms can further
exacerbate these issues, effectively impacting their ability to cluster data
points in high-dimensional spaces. In this paper, we leverage granular-ball
computing to improve the multi-kernel clustering framework. The core of
granular-ball computing is to adaptively fit data distribution by balls from
coarse to acceptable levels. Each ball can enclose data points based on a
density consistency measurement. Such ball-based data description thus improves
the computational efficiency and the robustness to unknown noises.
Specifically, based on granular-ball representations, we introduce the
granular-ball kernel (GBK) and its corresponding granular-ball multi-kernel
K-means framework (GB-MKKM) for efficient clustering. Using granular-ball
relationships in multiple kernel spaces, the proposed GB-MKKM framework shows
its superiority in efficiency and clustering performance in the empirical
evaluation of various clustering tasks.
[COMMENTS]
Accepted by IJCAI 2025
[LINK]
http://arxiv.org/abs/2506.18637v1
[DATE]
2025-06-23 21:39:32+08:00
[CATEGORIES]
cs.LG
Trustworthy Prediction with Gaussian Process Knowledge Scores
[AUTHORS]
Kurt Butler, Guanchao Feng, Tong Chen, Petar Djuric
[ABSTRACT]
Probabilistic models are often used to make predictions in regions of the
data space where no observations are available, but it is not always clear
whether such predictions are well-informed by previously seen data. In this
paper, we propose a knowledge score for predictions from Gaussian process
regression (GPR) models that quantifies the extent to which observing data have
reduced our uncertainty about a prediction. The knowledge score is
interpretable and naturally bounded between 0 and 1. We demonstrate in several
experiments that the knowledge score can anticipate when predictions from a GPR
model are accurate, and that this anticipation improves performance in tasks
such as anomaly detection, extrapolation, and missing data imputation. Source
code for this project is available online at
https://github.com/KurtButler/GP-knowledge.
[COMMENTS]
6 pages, 5 figures, to be published in the Proceedings of the
European Signal Processing Conference (EUSIPCO)
[LINK]
http://arxiv.org/abs/2506.18630v1
[DATE]
2025-06-23 21:36:06+08:00
[CATEGORIES]
cs.LG
On Equivariant Model Selection through the Lens of Uncertainty
[AUTHORS]
Putri A. van der Linden, Alexander Timans, Dharmesh Tailor, Erik J. Bekkers
[ABSTRACT]
Equivariant models leverage prior knowledge on symmetries to improve
predictive performance, but misspecified architectural constraints can harm it
instead. While work has explored learning or relaxing constraints, selecting
among pretrained models with varying symmetry biases remains challenging. We
examine this model selection task from an uncertainty-aware perspective,
comparing frequentist (via Conformal Prediction), Bayesian (via the marginal
likelihood), and calibration-based measures to naive error-based evaluation. We
find that uncertainty metrics generally align with predictive performance, but
Bayesian model evidence does so inconsistently. We attribute this to a mismatch
in Bayesian and geometric notions of model complexity, and discuss possible
remedies. Our findings point towards the potential of uncertainty in guiding
symmetry-aware model selection.
[COMMENTS]
9 pages, 4 figures, 2 tables. In the 8th Workshop on Tractable
Probabilistic Modeling at UAI 2025
[LINK]
http://arxiv.org/abs/2506.18629v1
[DATE]
2025-06-23 21:35:06+08:00
[CATEGORIES]
cs.LG
Multi-Agent Reinforcement Learning for Inverse Design in Photonic Integrated Circuits
[AUTHORS]
Yannik Mahlau, Maximilian Schier, Christoph Reinders, Frederik Schubert, Marco Bügling, Bodo Rosenhahn
[ABSTRACT]
Inverse design of photonic integrated circuits (PICs) has traditionally
relied on gradientbased optimization. However, this approach is prone to end up
in local minima, which results in suboptimal design functionality. As interest
in PICs increases due to their potential for addressing modern hardware demands
through optical computing, more adaptive optimization algorithms are needed. We
present a reinforcement learning (RL) environment as well as multi-agent RL
algorithms for the design of PICs. By discretizing the design space into a
grid, we formulate the design task as an optimization problem with thousands of
binary variables. We consider multiple two- and three-dimensional design tasks
that represent PIC components for an optical computing system. By decomposing
the design space into thousands of individual agents, our algorithms are able
to optimize designs with only a few thousand environment samples. They
outperform previous state-of-the-art gradient-based optimization in both twoand
three-dimensional design tasks. Our work may also serve as a benchmark for
further exploration of sample-efficient RL for inverse design in photonics.
[LINK]
http://arxiv.org/abs/2506.18627v1
[DATE]
2025-06-23 21:34:27+08:00
[CATEGORIES]
cs.LG
Bures-Wasserstein Flow Matching for Graph Generation
[AUTHORS]
Keyue Jiang, Jiahao Cui, Xiaowen Dong, Laura Toni
[ABSTRACT]
Graph generation has emerged as a critical task in fields ranging from
molecule design to drug discovery. Contemporary approaches, notably diffusion
and flow-based models, have achieved solid graph generative performance through
constructing a probability path that interpolates between a reference
distribution and the data distribution. However, these methods typically model
the evolution of individual nodes and edges independently and use linear
interpolations to build the path assuming that the data lie in Euclidean space.
We show that this is suboptimal given the intrinsic non-Euclidean structure and
interconnected patterns of graphs, and it poses risks to the sampling
convergence. To build a better probability path, we model the joint evolution
of the nodes and edges by representing graphs as connected systems
parameterized by Markov random fields (MRF). We then leverage the optimal
transport displacement between MRF objects to design the probability path for
graph generation. Based on this, we introduce BWFlow, a flow-matching framework
for graph generation that respects the underlying geometry of graphs and
provides smooth velocities in the probability path. The novel framework can be
adapted to both continuous and discrete flow-matching algorithms. Experimental
evaluations in plain graph generation and 2D/3D molecule generation validate
the effectiveness of BWFlow in graph generation with competitive performance,
stable training, and guaranteed sampling convergence.
[LINK]
http://arxiv.org/abs/2506.14020v2
[DATE]
2025-06-23 21:31:42+08:00
[CATEGORIES]
cs.LG
Pr{é}diction optimale pour un mod{è}le ordinal {à} covariables fonctionnelles
[AUTHORS]
Simón Weinberger, Jairo Cugliari, Aurélie Le Cain
[ABSTRACT]
We present a prediction framework for ordinal models: we introduce optimal
predictions using loss functions and give the explicit form of the
Least-Absolute-Deviation prediction for these models. Then, we reformulate an
ordinal model with functional covariates to a classic ordinal model with
multiple scalar covariates. We illustrate all the proposed methods and try to
apply these to a dataset collected by EssilorLuxottica for the development of a
control algorithm for the shade of connected glasses.
[COMMENTS]
in French language, Journ{'e}es de statistiques, Soci{'e}t{'e}
Fran\c{c}aise des Statistiques, Jul 2023, Bruxelle- Universit{'e} Libre de
Bruxelles (ULB), Belgique
[LINK]
http://arxiv.org/abs/2506.18615v1
[DATE]
2025-06-23 21:20:33+08:00
[CATEGORIES]
cs.LG
Policy gradient methods for ordinal policies
[AUTHORS]
Simón Weinberger, Jairo Cugliari
[ABSTRACT]
In reinforcement learning, the softmax parametrization is the standard
approach for policies over discrete action spaces. However, it fails to capture
the order relationship between actions. Motivated by a real-world industrial
problem, we propose a novel policy parametrization based on ordinal regression
models adapted to the reinforcement learning setting. Our approach addresses
practical challenges, and numerical experiments demonstrate its effectiveness
in real applications and in continuous action tasks, where discretizing the
action space and applying the ordinal policy yields competitive performance.
[COMMENTS]
in French language, Journ{'e}es de statistiques 2025,
Soci{'e}t{'e} Fran\c{c}aise des Statistiques, Jun 2023, Marseille, France
[LINK]
http://arxiv.org/abs/2506.18614v1
[DATE]
2025-06-23 21:19:36+08:00
[CATEGORIES]
cs.LG
SHAMaNS: Sound Localization with Hybrid Alpha-Stable Spatial Measure and Neural Steerer
[AUTHORS]
Diego Di Carlo, Mathieu Fontaine, Aditya Arie Nugraha, Yoshiaki Bando, Kazuyoshi Yoshii
[ABSTRACT]
This paper describes a sound source localization (SSL) technique that
combines an $\alpha$-stable model for the observed signal with a neural
network-based approach for modeling steering vectors. Specifically, a
physics-informed neural network, referred to as Neural Steerer, is used to
interpolate measured steering vectors (SVs) on a fixed microphone array. This
allows for a more robust estimation of the so-called $\alpha$-stable spatial
measure, which represents the most plausible direction of arrival (DOA) of a
target signal. As an $\alpha$-stable model for the non-Gaussian case ($\alpha$
$\in$ (0, 2)) theoretically defines a unique spatial measure, we choose to
leverage it to account for residual reconstruction error of the Neural Steerer
in the downstream tasks. The objective scores indicate that our proposed
technique outperforms state-of-the-art methods in the case of multiple sound
sources.
[COMMENTS]
European Signal Processing Conference (EUSIPCO), Sep 2025, Palermo,
Italy
[LINK]
http://arxiv.org/abs/2506.18954v1
[DATE]
2025-06-23 21:11:29+08:00
[CATEGORIES]
cs.LG
Simulation-Free Differential Dynamics through Neural Conservation Laws
[AUTHORS]
Mengjian Hua, Eric Vanden-Eijnden, Ricky T. Q. Chen
[ABSTRACT]
We present a novel simulation-free framework for training continuous-time
diffusion processes over very general objective functions. Existing methods
typically involve either prescribing the optimal diffusion process – which
only works for heavily restricted problem formulations – or require expensive
simulation to numerically obtain the time-dependent densities and sample from
the diffusion process. In contrast, we propose a coupled parameterization which
jointly models a time-dependent density function, or probability path, and the
dynamics of a diffusion process that generates this probability path. To
accomplish this, our approach directly bakes in the Fokker-Planck equation and
density function requirements as hard constraints, by extending and greatly
simplifying the construction of Neural Conservation Laws. This enables
simulation-free training for a large variety of problem formulations, from
data-driven objectives as in generative modeling and dynamical optimal
transport, to optimality-based objectives as in stochastic optimal control,
with straightforward extensions to mean-field objectives due to the ease of
accessing exact density functions. We validate our method in a diverse range of
application domains from modeling spatio-temporal events to learning optimal
dynamics from population data.
[LINK]
http://arxiv.org/abs/2506.18604v1
[DATE]
2025-06-23 21:04:23+08:00
[CATEGORIES]
cs.LG
BulletGen: Improving 4D Reconstruction with Bullet-Time Generation
[AUTHORS]
Denys Rozumnyi, Jonathon Luiten, Numair Khan, Johannes Schönberger, Peter Kontschieder
[ABSTRACT]
Transforming casually captured, monocular videos into fully immersive dynamic
experiences is a highly ill-posed task, and comes with significant challenges,
e.g., reconstructing unseen regions, and dealing with the ambiguity in
monocular depth estimation. In this work we introduce BulletGen, an approach
that takes advantage of generative models to correct errors and complete
missing information in a Gaussian-based dynamic scene representation. This is
done by aligning the output of a diffusion-based video generation model with
the 4D reconstruction at a single frozen “bullet-time” step. The generated
frames are then used to supervise the optimization of the 4D Gaussian model.
Our method seamlessly blends generative content with both static and dynamic
scene components, achieving state-of-the-art results on both novel-view
synthesis, and 2D/3D tracking tasks.
[LINK]
http://arxiv.org/abs/2506.18601v1
[DATE]
2025-06-23 21:03:42+08:00
[CATEGORIES]
cs.LG
SpaNN: Detecting Multiple Adversarial Patches on CNNs by Spanning Saliency Thresholds
[AUTHORS]
Mauricio Byrd Victorica, György Dán, Henrik Sandberg
[ABSTRACT]
State-of-the-art convolutional neural network models for object detection and
image classification are vulnerable to physically realizable adversarial
perturbations, such as patch attacks. Existing defenses have focused,
implicitly or explicitly, on single-patch attacks, leaving their sensitivity to
the number of patches as an open question or rendering them computationally
infeasible or inefficient against attacks consisting of multiple patches in the
worst cases. In this work, we propose SpaNN, an attack detector whose
computational complexity is independent of the expected number of adversarial
patches. The key novelty of the proposed detector is that it builds an ensemble
of binarized feature maps by applying a set of saliency thresholds to the
neural activations of the first convolutional layer of the victim model. It
then performs clustering on the ensemble and uses the cluster features as the
input to a classifier for attack detection. Contrary to existing detectors,
SpaNN does not rely on a fixed saliency threshold for identifying adversarial
regions, which makes it robust against white box adversarial attacks. We
evaluate SpaNN on four widely used data sets for object detection and
classification, and our results show that SpaNN outperforms state-of-the-art
defenses by up to 11 and 27 percentage points in the case of object detection
and the case of image classification, respectively. Our code is available at
https://github.com/gerkbyrd/SpaNN.
[COMMENTS]
2025 IEEE Conference on Secure and Trustworthy Machine Learning
(SaTML2025)
[LINK]
http://arxiv.org/abs/2506.18591v1
[DATE]
2025-06-23 20:51:10+08:00
[CATEGORIES]
cs.LG
Optimization-Induced Dynamics of Lipschitz Continuity in Neural Networks
[AUTHORS]
Róisín Luo, James McDermott, Christian Gagné, Qiang Sun, Colm O’Riordan
[ABSTRACT]
Lipschitz continuity characterizes the worst-case sensitivity of neural
networks to small input perturbations; yet its dynamics (i.e. temporal
evolution) during training remains under-explored. We present a rigorous
mathematical framework to model the temporal evolution of Lipschitz continuity
during training with stochastic gradient descent (SGD). This framework
leverages a system of stochastic differential equations (SDEs) to capture both
deterministic and stochastic forces. Our theoretical analysis identifies three
principal factors driving the evolution: (i) the projection of gradient flows,
induced by the optimization dynamics, onto the operator-norm Jacobian of
parameter matrices; (ii) the projection of gradient noise, arising from the
randomness in mini-batch sampling, onto the operator-norm Jacobian; and (iii)
the projection of the gradient noise onto the operator-norm Hessian of
parameter matrices. Furthermore, our theoretical framework sheds light on such
as how noisy supervision, parameter initialization, batch size, and mini-batch
sampling trajectories, among other factors, shape the evolution of the
Lipschitz continuity of neural networks. Our experimental results demonstrate
strong agreement between the theoretical implications and the observed
behaviors.
[LINK]
http://arxiv.org/abs/2506.18588v1
[DATE]
2025-06-23 20:49:13+08:00
[CATEGORIES]
cs.LG
Radio Map Prediction from Aerial Images and Application to Coverage Optimization
[AUTHORS]
Fabian Jaensch, Giuseppe Caire, Begüm Demir
[ABSTRACT]
Several studies have explored deep learning algorithms to predict large-scale
signal fading, or path loss, in urban communication networks. The goal is to
replace costly measurement campaigns, inaccurate statistical models, or
computationally expensive ray-tracing simulations with machine learning models
that deliver quick and accurate predictions. We focus on predicting path loss
radio maps using convolutional neural networks, leveraging aerial images alone
or in combination with supplementary height information. Notably, our approach
does not rely on explicit classification of environmental objects, which is
often unavailable for most locations worldwide. While the prediction of radio
maps using complete 3D environmental data is well-studied, the use of only
aerial images remains under-explored. We address this gap by showing that
state-of-the-art models developed for existing radio map datasets can be
effectively adapted to this task. Additionally, we introduce a new model dubbed
UNetDCN that achieves on par or better performance compared to the
state-of-the-art with reduced complexity. The trained models are
differentiable, and therefore they can be incorporated in various network
optimization algorithms. While an extensive discussion is beyond this paper’s
scope, we demonstrate this through an example optimizing the directivity of
base stations in cellular networks via backpropagation to enhance coverage.
[COMMENTS]
13 pages, 8 Figures, To appear in IEEE Transactions on Wireless
Communications. arXiv admin note: substantial text overlap with
arXiv:2402.00878
[LINK]
http://arxiv.org/abs/2410.17264v2
[DATE]
2025-06-23 20:42:36+08:00
[CATEGORIES]
cs.LG
Efficient Beam Selection for ISAC in Cell-Free Massive MIMO via Digital Twin-Assisted Deep Reinforcement Learning
[AUTHORS]
Jiexin Zhang, Shu Xu, Chunguo Li, Yongming Huang, Luxi Yang
[ABSTRACT]
Beamforming enhances signal strength and quality by focusing energy in
specific directions. This capability is particularly crucial in cell-free
integrated sensing and communication (ISAC) systems, where multiple distributed
access points (APs) collaborate to provide both communication and sensing
services. In this work, we first derive the distribution of joint target
detection probabilities across multiple receiving APs under false alarm rate
constraints, and then formulate the beam selection procedure as a Markov
decision process (MDP). We establish a deep reinforcement learning (DRL)
framework, in which reward shaping and sinusoidal embedding are introduced to
facilitate agent learning. To eliminate the high costs and associated risks of
real-time agent-environment interactions, we further propose a novel digital
twin (DT)-assisted offline DRL approach. Different from traditional online DRL,
a conditional generative adversarial network (cGAN)-based DT module, operating
as a replica of the real world, is meticulously designed to generate virtual
state-action transition pairs and enrich data diversity, enabling offline
adjustment of the agent’s policy. Additionally, we address the
out-of-distribution issue by incorporating an extra penalty term into the loss
function design. The convergency of agent-DT interaction and the upper bound of
the Q-error function are theoretically derived. Numerical results demonstrate
the remarkable performance of our proposed approach, which significantly
reduces online interaction overhead while maintaining effective beam selection
across diverse conditions including strict false alarm control, low
signal-to-noise ratios, and high target velocities.
[COMMENTS]
Submitted to IEEE Transactions on Wireless Communications
[LINK]
http://arxiv.org/abs/2506.18560v1
[DATE]
2025-06-23 20:17:57+08:00
[CATEGORIES]
cs.LG
Soft decision trees for survival analysis
[AUTHORS]
Antonio Consolo, Edoardo Amaldi, Emilio Carrizosa
[ABSTRACT]
Decision trees are popular in survival analysis for their interpretability
and ability to model complex relationships. Survival trees, which predict the
timing of singular events using censored historical data, are typically built
through heuristic approaches. Recently, there has been growing interest in
globally optimized trees, where the overall tree is trained by minimizing the
error function over all its parameters. We propose a new soft survival tree
model (SST), with a soft splitting rule at each branch node, trained via a
nonlinear optimization formulation amenable to decomposition. Since SSTs
provide for every input vector a specific survival function associated to a
single leaf node, they satisfy the conditional computation property and inherit
the related benefits. SST and the training formulation combine flexibility with
interpretability: any smooth survival function (parametric, semiparametric, or
nonparametric) estimated through maximum likelihood can be used, and each leaf
node of an SST yields a cluster of distinct survival functions which are
associated to the data points routed to it. Numerical experiments on 15
well-known datasets show that SSTs, with parametric and spline-based
semiparametric survival functions, trained using an adaptation of the
node-based decomposition algorithm proposed by Consolo et al. (2024) for soft
regression trees, outperform three benchmark survival trees in terms of four
widely-used discrimination and calibration measures. SSTs can also be extended
to consider group fairness.
[LINK]
http://arxiv.org/abs/2506.16846v2
[DATE]
2025-06-23 20:06:25+08:00
[CATEGORIES]
cs.LG
AutoPDL: Automatic Prompt Optimization for LLM Agents
[AUTHORS]
Claudio Spiess, Mandana Vaziri, Louis Mandel, Martin Hirzel
[ABSTRACT]
The performance of large language models (LLMs) depends on how they are
prompted, with choices spanning both the high-level prompting pattern (e.g.,
Zero-Shot, CoT, ReAct, ReWOO) and the specific prompt content (instructions and
few-shot demonstrations). Manually tuning this combination is tedious,
error-prone, and specific to a given LLM and task. Therefore, this paper
proposes AutoPDL, an automated approach to discovering good LLM agent
configurations. Our approach frames this as a structured AutoML problem over a
combinatorial space of agentic and non-agentic prompting patterns and
demonstrations, using successive halving to efficiently navigate this space. We
introduce a library implementing common prompting patterns using the PDL prompt
programming language. AutoPDL solutions are human-readable, editable, and
executable PDL programs that use this library. This approach also enables
source-to-source optimization, allowing human-in-the-loop refinement and reuse.
Evaluations across three tasks and seven LLMs (ranging from 3B to 70B
parameters) show consistent accuracy gains ($9.06\pm15.3$ percentage points),
up to 68.9pp, and reveal that selected prompting strategies vary across models
and tasks.
[LINK]
http://arxiv.org/abs/2504.04365v2
[DATE]
2025-06-23 19:56:03+08:00
[CATEGORIES]
cs.LG
Hidden Breakthroughs in Language Model Training
[AUTHORS]
Sara Kangaslahti, Elan Rosenfeld, Naomi Saphra
[ABSTRACT]
Loss curves are smooth during most of model training, so visible
discontinuities stand out as possible conceptual breakthroughs. Studying these
breakthroughs enables a deeper understanding of learning dynamics, but only
when they are properly identified. This paper argues that similar breakthroughs
occur frequently throughout training but they are obscured by a loss metric
that collapses all variation into a single scalar. To find these hidden
transitions, we introduce POLCA, a method for decomposing changes in loss along
arbitrary bases of the low-rank training subspace. We use our method to
identify clusters of samples that share similar changes in loss during
training, disaggregating the overall loss into that of smaller groups of
conceptually similar data. We validate our method on synthetic arithmetic and
natural language tasks, showing that POLCA recovers clusters that represent
interpretable breakthroughs in the model’s capabilities. We demonstrate the
promise of these hidden phase transitions as a tool for unsupervised
interpretability.
[COMMENTS]
17 pages, 10 figures
[LINK]
http://arxiv.org/abs/2506.15872v2
[DATE]
2025-06-23 19:55:45+08:00
[CATEGORIES]
cs.LG
Transformer World Model for Sample Efficient Multi-Agent Reinforcement Learning
[AUTHORS]
Azad Deihim, Eduardo Alonso, Dimitra Apostolopoulou
[ABSTRACT]
We present the Multi-Agent Transformer World Model (MATWM), a novel
transformer-based world model designed for multi-agent reinforcement learning
in both vector- and image-based environments. MATWM combines a decentralized
imagination framework with a semi-centralized critic and a teammate prediction
module, enabling agents to model and anticipate the behavior of others under
partial observability. To address non-stationarity, we incorporate a
prioritized replay mechanism that trains the world model on recent experiences,
allowing it to adapt to agents’ evolving policies. We evaluated MATWM on a
broad suite of benchmarks, including the StarCraft Multi-Agent Challenge,
PettingZoo, and MeltingPot. MATWM achieves state-of-the-art performance,
outperforming both model-free and prior world model approaches, while
demonstrating strong sample efficiency, achieving near-optimal performance in
as few as 50K environment interactions. Ablation studies confirm the impact of
each component, with substantial gains in coordination-heavy tasks.
[LINK]
http://arxiv.org/abs/2506.18537v1
[DATE]
2025-06-23 19:47:17+08:00
[CATEGORIES]
cs.LG
Multi-Stage Manipulation with Demonstration-Augmented Reward, Policy, and World Model Learning
[AUTHORS]
Adrià López Escoriza, Nicklas Hansen, Stone Tao, Tongzhou Mu, Hao Su
[ABSTRACT]
Long-horizon tasks in robotic manipulation present significant challenges in
reinforcement learning (RL) due to the difficulty of designing dense reward
functions and effectively exploring the expansive state-action space. However,
despite a lack of dense rewards, these tasks often have a multi-stage
structure, which can be leveraged to decompose the overall objective into
manageable subgoals. In this work, we propose DEMO3, a framework that exploits
this structure for efficient learning from visual inputs. Specifically, our
approach incorporates multi-stage dense reward learning, a bi-phasic training
scheme, and world model learning into a carefully designed
demonstration-augmented RL framework that strongly mitigates the challenge of
exploration in long-horizon tasks. Our evaluations demonstrate that our method
improves data-efficiency by an average of 40% and by 70% on particularly
difficult tasks compared to state-of-the-art approaches. We validate this
across 16 sparse-reward tasks spanning four domains, including challenging
humanoid visual control tasks using as few as five demonstrations.
[COMMENTS]
Project page can be found at
https://adrialopezescoriza.github.io/demo3/
[LINK]
http://arxiv.org/abs/2503.01837v2
[DATE]
2025-06-23 19:41:17+08:00
[CATEGORIES]
cs.LG
A Set-to-Set Distance Measure in Hyperbolic Space
[AUTHORS]
Pengxiang Li, Wei Wu, Zhi Gao, Xiaomeng Fan, Peilin Yu, Yuwei Wu, Zhipeng Lu, Yunde Jia, Mehrtash Harandi
[ABSTRACT]
We propose a hyperbolic set-to-set distance measure for computing
dissimilarity between sets in hyperbolic space. While point-to-point distances
in hyperbolic space effectively capture hierarchical relationships between data
points, many real-world applications require comparing sets of hyperbolic data
points, where the local structure and the global structure of the sets carry
crucial semantic information. The proposed the \underline{h}yperbolic
\underline{s}et-\underline{to}-\underline{s}et \underline{d}istance measure
(HS2SD) integrates both global and local structural information: global
structure through geodesic distances between Einstein midpoints of hyperbolic
sets, and local structure through topological characteristics of the two sets.
To efficiently compute topological differences, we prove that using a finite
Thue-Morse sequence of degree and adjacency matrices can serve as a robust
approximation to capture the topological structure of a set. In this case, by
considering the topological differences, HS2SD provides a more nuanced
understanding of the relationships between two hyperbolic sets. Empirical
evaluation on entity matching, standard image classification, and few-shot
image classification demonstrates that our distance measure outperforms
existing methods by effectively modeling the hierarchical and complex
relationships inherent in hyperbolic sets.
[COMMENTS]
24 pages
[LINK]
http://arxiv.org/abs/2506.18529v1
[DATE]
2025-06-23 19:31:40+08:00
[CATEGORIES]
cs.LG
DDOT: A Derivative-directed Dual-decoder Ordinary Differential Equation Transformer for Dynamic System Modeling
[AUTHORS]
Yang Chang, Kuang-Da Wang, Ping-Chun Hsieh, Cheng-Kuan Lin, Wen-Chih Peng
[ABSTRACT]
Uncovering the underlying ordinary differential equations (ODEs) that govern
dynamic systems is crucial for advancing our understanding of complex
phenomena. Traditional symbolic regression methods often struggle to capture
the temporal dynamics and intervariable correlations inherent in ODEs.
ODEFormer, a state-of-the-art method for inferring multidimensional ODEs from
single trajectories, has made notable progress. However, its focus on
single-trajectory evaluation is highly sensitive to initial starting points,
which may not fully reflect true performance. To address this, we propose the
divergence difference metric (DIV-diff), which evaluates divergence over a grid
of points within the target region, offering a comprehensive and stable
analysis of the variable space. Alongside, we introduce DDOT
(Derivative-Directed Dual-Decoder Ordinary Differential Equation Transformer),
a transformer-based model designed to reconstruct multidimensional ODEs in
symbolic form. By incorporating an auxiliary task predicting the ODE’s
derivative, DDOT effectively captures both structure and dynamic behavior.
Experiments on ODEBench show DDOT outperforms existing symbolic regression
methods, achieving an absolute improvement of 4.58% and 1.62% in $P(R^2 > 0.9)$
for reconstruction and generalization tasks, respectively, and an absolute
reduction of 3.55% in DIV-diff. Furthermore, DDOT demonstrates real-world
applicability on an anesthesia dataset, highlighting its practical impact.
[LINK]
http://arxiv.org/abs/2506.18522v1
[DATE]
2025-06-23 19:24:52+08:00
[CATEGORIES]
cs.LG
Machine-learning based high-bandwidth magnetic sensing
[AUTHORS]
Galya Haim, Stefano Martina, John Howell, Nir Bar-Gill, Filippo Caruso
[ABSTRACT]
Recent years have seen significant growth of quantum technologies, and
specifically quantum sensing, both in terms of the capabilities of advanced
platforms and their applications. One of the leading platforms in this context
is nitrogen-vacancy (NV) color centers in diamond, providing versatile,
high-sensitivity, and high-spatial-resolution magnetic sensing. Nevertheless,
current schemes for spin resonance magnetic sensing (as applied by NV quantum
sensing) suffer from tradeoffs associated with sensitivity, dynamic range, and
bandwidth. Here we address this issue, and implement machine learning tools to
enhance NV magnetic sensing in terms of the sensitivity/bandwidth tradeoff in
large dynamic range scenarios. Our results indicate a potential reduction of
required data points by at least a factor of 3, while maintaining the current
error level. Our results promote quantum machine learning protocols for sensing
applications towards more feasible and efficient quantum technologies.
[COMMENTS]
12 pages including supplementary, 5 figures, 3 supplementary figures
[LINK]
http://arxiv.org/abs/2409.12820v2
[DATE]
2025-06-23 19:20:23+08:00
[CATEGORIES]
cs.LG
Theoretical guarantees for neural estimators in parametric statistics
[AUTHORS]
Almut Rödder, Manuel Hentschel, Sebastian Engelke
[ABSTRACT]
Neural estimators are simulation-based estimators for the parameters of a
family of statistical models, which build a direct mapping from the sample to
the parameter vector. They benefit from the versatility of available network
architectures and efficient training methods developed in the field of deep
learning. Neural estimators are amortized in the sense that, once trained, they
can be applied to any new data set with almost no computational cost. While
many papers have shown very good performance of these methods in simulation
studies and real-world applications, so far no statistical guarantees are
available to support these observations theoretically. In this work, we study
the risk of neural estimators by decomposing it into several terms that can be
analyzed separately. We formulate easy-to-check assumptions ensuring that each
term converges to zero, and we verify them for popular applications of neural
estimators. Our results provide a general recipe to derive theoretical
guarantees also for broader classes of architectures and estimation problems.
[LINK]
http://arxiv.org/abs/2506.18508v1
[DATE]
2025-06-23 19:02:08+08:00
[CATEGORIES]
cs.LG
Indeterminate Probability Theory
[AUTHORS]
Tao Yang, Chuang Liu, Xiaofeng Ma, Weijia Lu, Ning Wu, Bingyang Li, Zhifei Yang, Peng Liu, Lin Sun, Xiaodong Zhang, Can Zhang
[ABSTRACT]
Complex continuous or mixed joint distributions (e.g., P(Y | z_1, z_2, …,
z_N)) generally lack closed-form solutions, often necessitating approximations
such as MCMC. This paper proposes Indeterminate Probability Theory (IPT), which
makes the following contributions: (1) An observer-centered framework in which
experimental outcomes are represented as distributions combining ground truth
with observation error; (2) The introduction of three independence candidate
axioms that enable a two-phase probabilistic inference framework; (3) The
derivation of closed-form solutions for arbitrary complex joint distributions
under this framework. Both the Indeterminate Probability Neural Network (IPNN)
model and the non-neural multivariate time series forecasting application
demonstrate IPT’s effectiveness in modeling high-dimensional distributions,
with successful validation up to 1000 dimensions. Importantly, IPT is
consistent with classical probability theory and subsumes the frequentist
equation in the limit of vanishing observation error.
[COMMENTS]
25 pages
[LINK]
http://arxiv.org/abs/2303.11536v2
[DATE]
2025-06-23 18:56:46+08:00
[CATEGORIES]
cs.LG
PuckTrick: A Library for Making Synthetic Data More Realistic
[AUTHORS]
Alessandra Agostini, Andrea Maurino, Blerina Spahiu
[ABSTRACT]
The increasing reliance on machine learning (ML) models for decision-making
requires high-quality training data. However, access to real-world datasets is
often restricted due to privacy concerns, proprietary restrictions, and
incomplete data availability. As a result, synthetic data generation (SDG) has
emerged as a viable alternative, enabling the creation of artificial datasets
that preserve the statistical properties of real data while ensuring privacy
compliance. Despite its advantages, synthetic data is often overly clean and
lacks real-world imperfections, such as missing values, noise, outliers, and
misclassified labels, which can significantly impact model generalization and
robustness. To address this limitation, we introduce Pucktrick, a Python
library designed to systematically contaminate synthetic datasets by
introducing controlled errors. The library supports multiple error types,
including missing data, noisy values, outliers, label misclassification,
duplication, and class imbalance, offering a structured approach to evaluating
ML model resilience under real-world data imperfections. Pucktrick provides two
contamination modes: one for injecting errors into clean datasets and another
for further corrupting already contaminated datasets. Through extensive
experiments on real-world financial datasets, we evaluate the impact of
systematic data contamination on model performance. Our findings demonstrate
that ML models trained on contaminated synthetic data outperform those trained
on purely synthetic, error-free data, particularly for tree-based and linear
models such as SVMs and Extra Trees.
[COMMENTS]
17 pages, 3 figures
[LINK]
http://arxiv.org/abs/2506.18499v1
[DATE]
2025-06-23 18:51:45+08:00
[CATEGORIES]
cs.LG
SPoRt – Safe Policy Ratio: Certified Training and Deployment of Task Policies in Model-Free RL
[AUTHORS]
Jacques Cloete, Nikolaus Vertovec, Alessandro Abate
[ABSTRACT]
To apply reinforcement learning to safety-critical applications, we ought to
provide safety guarantees during both policy training and deployment. In this
work, we present theoretical results that place a bound on the probability of
violating a safety property for a new task-specific policy in a model-free,
episodic setting. This bound, based on a maximum policy ratio computed with
respect to a ‘safe’ base policy, can also be applied to temporally-extended
properties (beyond safety) and to robust control problems. To utilize these
results, we introduce SPoRt, which provides a data-driven method for computing
this bound for the base policy using the scenario approach, and includes
Projected PPO, a new projection-based approach for training the task-specific
policy while maintaining a user-specified bound on property violation. SPoRt
thus enables users to trade off safety guarantees against task-specific
performance. Complementing our theoretical results, we present experimental
results demonstrating this trade-off and comparing the theoretical bound to
posterior bounds derived from empirical violation rates.
[COMMENTS]
9 pages + 16 pages supplementary material, 3 figures + 6 figures
supplementary material
[LINK]
http://arxiv.org/abs/2504.06386v2
[DATE]
2025-06-23 18:50:00+08:00
[CATEGORIES]
cs.LG
Leveraging neural network interatomic potentials for a foundation model of chemistry
[AUTHORS]
So Yeon Kim, Yang Jeong Park, Ju Li
[ABSTRACT]
Large-scale foundation models, including neural network interatomic
potentials (NIPs) in computational materials science, have demonstrated
significant potential. However, despite their success in accelerating atomistic
simulations, NIPs face challenges in directly predicting electronic properties
and often require coupling to higher-scale models or extensive simulations for
macroscopic properties. Machine learning (ML) offers alternatives for
structure-to-property mapping but faces trade-offs: feature-based methods often
lack generalizability, while deep neural networks require significant data and
computational power. To address these trade-offs, we introduce HackNIP, a
two-stage pipeline that leverages pretrained NIPs. This method first extracts
fixed-length feature vectors (embeddings) from NIP foundation models and then
uses these embeddings to train shallow ML models for downstream
structure-to-property predictions. This study investigates whether such a
hybridization approach, by ``hacking” the NIP, can outperform end-to-end deep
neural networks, determines the dataset size at which this transfer learning
approach surpasses direct fine-tuning of the NIP, and identifies which NIP
embedding depths yield the most informative features. HackNIP is benchmarked on
Matbench, evaluated for data efficiency, and tested on diverse tasks including
\textit{ab initio}, experimental, and molecular properties. We also analyze how
embedding depth impacts performance. This work demonstrates a hybridization
strategy to overcome ML trade-offs in materials science, aiming to democratize
high-performance predictive modeling.
[COMMENTS]
29pages, 10 figures
[LINK]
http://arxiv.org/abs/2506.18497v1
[DATE]
2025-06-23 18:49:19+08:00
[CATEGORIES]
cs.LG
Disentangling representations of retinal images with generative models
[AUTHORS]
Sarah Müller, Lisa M. Koch, Hendrik P. A. Lensch, Philipp Berens
[ABSTRACT]
Retinal fundus images play a crucial role in the early detection of eye
diseases. However, the impact of technical factors on these images can pose
challenges for reliable AI applications in ophthalmology. For example, large
fundus cohorts are often confounded by factors like camera type, bearing the
risk of learning shortcuts rather than the causal relationships behind the
image generation process. Here, we introduce a population model for retinal
fundus images that effectively disentangles patient attributes from camera
effects, enabling controllable and highly realistic image generation. To
achieve this, we propose a disentanglement loss based on distance correlation.
Through qualitative and quantitative analyses, we show that our models encode
desired information in disentangled subspaces and enable controllable image
generation based on the learned subspaces, demonstrating the effectiveness of
our disentanglement loss. The project’s code is publicly available:
https://github.com/berenslab/disentangling-retinal-images.
[COMMENTS]
Final journal paper version for Medical Image Analysis (MedIA)
[LINK]
http://arxiv.org/abs/2402.19186v3
[DATE]
2025-06-23 18:48:12+08:00
[CATEGORIES]
cs.LG
AnalogNAS-Bench: A NAS Benchmark for Analog In-Memory Computing
[AUTHORS]
Aniss Bessalah, Hatem Mohamed Abdelmoumen, Karima Benatchba, Hadjer Benmeziane
[ABSTRACT]
Analog In-memory Computing (AIMC) has emerged as a highly efficient paradigm
for accelerating Deep Neural Networks (DNNs), offering significant energy and
latency benefits over conventional digital hardware. However, state-of-the-art
neural networks are not inherently designed for AIMC, as they fail to account
for its unique non-idealities. Neural Architecture Search (NAS) is thus needed
to systematically discover neural architectures optimized explicitly for AIMC
constraints. However, comparing NAS methodologies and extracting insights about
robust architectures for AIMC requires a dedicated NAS benchmark that
explicitly accounts for AIMC-specific hardware non-idealities. To address this,
we introduce AnalogNAS-Bench, the first NAS benchmark tailored specifically for
AIMC. Our study reveals three key insights: (1) standard quantization
techniques fail to capture AIMC-specific noises, (2) robust architectures tend
to feature wider and branched blocks, (3) skip connections improve resilience
to temporal drift noise. These insights highlight the limitations of current
NAS benchmarks for AIMC and pave the way for future analog-aware NAS. All the
implementations used in this paper can be found at
https://github.com/IBM/analog-nas/tree/main/analognasbench.
[LINK]
http://arxiv.org/abs/2506.18495v1
[DATE]
2025-06-23 18:44:32+08:00
[CATEGORIES]
cs.LG
Reliability-Adjusted Prioritized Experience Replay
[AUTHORS]
Leonard S. Pleiss, Tobias Sutter, Maximilian Schiffer
[ABSTRACT]
Experience replay enables data-efficient learning from past experiences in
online reinforcement learning agents. Traditionally, experiences were sampled
uniformly from a replay buffer, regardless of differences in
experience-specific learning potential. In an effort to sample more
efficiently, researchers introduced Prioritized Experience Replay (PER). In
this paper, we propose an extension to PER by introducing a novel measure of
temporal difference error reliability. We theoretically show that the resulting
transition selection algorithm, Reliability-adjusted Prioritized Experience
Replay (ReaPER), enables more efficient learning than PER. We further present
empirical results showing that ReaPER outperforms PER across various
environment types, including the Atari-5 benchmark.
[LINK]
http://arxiv.org/abs/2506.18482v1
[DATE]
2025-06-23 18:35:36+08:00
[CATEGORIES]
cs.LG
A Deep Convolutional Neural Network-Based Novel Class Balancing for Imbalance Data Segmentation
[AUTHORS]
Atifa Kalsoom, M. A. Iftikhar, Amjad Ali, Zubair Shah, Shidin Balakrishnan, Hazrat Ali
[ABSTRACT]
Retinal fundus images provide valuable insights into the human eye’s interior
structure and crucial features, such as blood vessels, optic disk, macula, and
fovea. However, accurate segmentation of retinal blood vessels can be
challenging due to imbalanced data distribution and varying vessel thickness.
In this paper, we propose BLCB-CNN, a novel pipeline based on deep learning and
bi-level class balancing scheme to achieve vessel segmentation in retinal
fundus images. The BLCB-CNN scheme uses a Convolutional Neural Network (CNN)
architecture and an empirical approach to balance the distribution of pixels
across vessel and non-vessel classes and within thin and thick vessels. Level-I
is used for vessel/non-vessel balancing and Level-II is used for thick/thin
vessel balancing. Additionally, pre-processing of the input retinal fundus
image is performed by Global Contrast Normalization (GCN), Contrast Limited
Adaptive Histogram Equalization (CLAHE), and gamma corrections to increase
intensity uniformity as well as to enhance the contrast between vessels and
background pixels. The resulting balanced dataset is used for
classification-based segmentation of the retinal vascular tree. We evaluate the
proposed scheme on standard retinal fundus images and achieve superior
performance measures, including an area under the ROC curve of 98.23%, Accuracy
of 96.22%, Sensitivity of 81.57%, and Specificity of 97.65%. We also
demonstrate the method’s efficacy through external cross-validation on STARE
images, confirming its generalization ability.
[COMMENTS]
This is preprint of the paper submitted to Scientific Reports journal
[LINK]
http://arxiv.org/abs/2506.18474v1
[DATE]
2025-06-23 18:15:54+08:00
[CATEGORIES]
cs.LG
A Motivational Architecture for Open-Ended Learning Challenges in Robots
[AUTHORS]
Alejandro Romero, Gianluca Baldassarre, Richard J. Duro, Vieri Giuliano Santucci
[ABSTRACT]
Developing agents capable of autonomously interacting with complex and
dynamic environments, where task structures may change over time and prior
knowledge cannot be relied upon, is a key prerequisite for deploying artificial
systems in real-world settings. The open-ended learning framework identifies
the core challenges for creating such agents, including the ability to
autonomously generate new goals, acquire the necessary skills (or curricula of
skills) to achieve them, and adapt to non-stationary environments. While many
existing works tackles various aspects of these challenges in isolation, few
propose integrated solutions that address them simultaneously. In this paper,
we introduce H-GRAIL, a hierarchical architecture that, through the use of
different typologies of intrinsic motivations and interconnected learning
mechanisms, autonomously discovers new goals, learns the required skills for
their achievement, generates skill sequences for tackling interdependent tasks,
and adapts to non-stationary environments. We tested H-GRAIL in a real robotic
scenario, demonstrating how the proposed solutions effectively address the
various challenges of open-ended learning.
[COMMENTS]
Accepted to RLDM 2025
[LINK]
http://arxiv.org/abs/2506.18454v1
[DATE]
2025-06-23 17:46:05+08:00
[CATEGORIES]
cs.LG
xInv: Explainable Optimization of Inverse Problems
[AUTHORS]
Sean Memery, Kevin Denamganai, Anna Kapron-King, Kartic Subr
[ABSTRACT]
Inverse problems are central to a wide range of fields, including healthcare,
climate science, and agriculture. They involve the estimation of inputs,
typically via iterative optimization, to some known forward model so that it
produces a desired outcome. Despite considerable development in the
explainability and interpretability of forward models, the iterative
optimization of inverse problems remains largely cryptic to domain experts. We
propose a methodology to produce explanations, from traces produced by an
optimizer, that are interpretable by humans at the abstraction of the domain.
The central idea in our approach is to instrument a differentiable simulator so
that it emits natural language events during its forward and backward passes.
In a post-process, we use a Language Model to create an explanation from the
list of events. We demonstrate the effectiveness of our approach with an
illustrative optimization problem and an example involving the training of a
neural network.
[LINK]
http://arxiv.org/abs/2506.11056v2
[DATE]
2025-06-23 17:40:49+08:00
[CATEGORIES]
cs.LG
TreeSynth: Synthesizing Diverse Data from Scratch via Tree-Guided Subspace Partitioning
[AUTHORS]
Sheng Wang, Pengan Chen, Jingqi Zhou, Qintong Li, Jingwei Dong, Jiahui Gao, Boyang Xue, Jiyue Jiang, Lingpeng Kong, Chuan Wu
[ABSTRACT]
Model customization necessitates high-quality and diverse datasets, but
acquiring such data remains time-consuming and labor-intensive. Despite the
great potential of large language models (LLMs) for data synthesis, current
approaches are constrained by limited seed data, model biases, and
low-variation prompts, resulting in limited diversity and biased distributions
with the increase of data scales. To tackle this challenge, we introduce
TREESYNTH, a tree-guided subspace-based data synthesis approach inspired by
decision trees. It constructs a spatial partitioning tree to recursively divide
a task-specific full data space (i.e., root node) into numerous atomic
subspaces (i.e., leaf nodes) with mutually exclusive and exhaustive attributes
to ensure both distinctiveness and comprehensiveness before synthesizing
samples within each atomic subspace. This globally dividing-and-synthesizing
method finally collects subspace samples into a comprehensive dataset,
effectively circumventing repetition and space collapse to ensure the diversity
of large-scale data synthesis. Furthermore, the spatial partitioning tree
enables sample allocation into atomic subspaces, allowing the rebalancing of
existing datasets for more balanced and comprehensive distributions.
Empirically, extensive experiments across diverse benchmarks consistently
demonstrate the superior data diversity, model performance, and robust
scalability of TREESYNTH compared to both human-crafted datasets and peer data
synthesis methods, with an average performance gain reaching 10%. Besides, the
consistent improvements of TREESYNTH-balanced datasets highlight its
efficacious application to redistribute existing datasets for more
comprehensive coverage and the induced performance enhancement. The code is
available at https://github.com/cpa2001/TreeSynth.
[LINK]
http://arxiv.org/abs/2503.17195v2
[DATE]
2025-06-23 17:32:03+08:00
[CATEGORIES]
cs.LG
LoRA-One: One-Step Full Gradient Could Suffice for Fine-Tuning Large Language Models, Provably and Efficiently
[AUTHORS]
Yuanhe Zhang, Fanghui Liu, Yudong Chen
[ABSTRACT]
This paper explores how theory can guide and enhance practical algorithms,
using Low-Rank Adaptation (LoRA, Hu et al. 2022) in large language models as a
case study. We rigorously prove that, under gradient descent, LoRA adapters
align with specific singular subspaces of the one-step full fine-tuning
gradient. This result suggests that, by properly initializing the adapters
using the one-step full gradient, subspace alignment can be achieved
immediately and applicable to both linear and nonlinear models. Building on our
theory, we propose a theory-driven algorithm, LoRA-One, where the linear
convergence (as well as generalization) is built and incorporating
preconditioners theoretically helps mitigate the effects of ill-conditioning.
Besides, our theory reveals connections between LoRA-One and other
gradient-alignment-based methods, helping to clarify misconceptions in the
design of such algorithms. LoRA-One achieves significant empirical improvements
over LoRA and its variants across benchmarks in natural language understanding,
mathematical reasoning, and code generation. Code is available at:
https://github.com/YuanheZ/LoRA-One.
[COMMENTS]
Accepted by ICML 2025 (Oral)
[LINK]
http://arxiv.org/abs/2502.01235v3
[DATE]
2025-06-23 17:29:57+08:00
[CATEGORIES]
cs.LG
New Hardness Results for Low-Rank Matrix Completion
[AUTHORS]
Dror Chawin, Ishay Haviv
[ABSTRACT]
The low-rank matrix completion problem asks whether a given real matrix with
missing values can be completed so that the resulting matrix has low rank or is
close to a low-rank matrix. The completed matrix is often required to satisfy
additional structural constraints, such as positive semi-definiteness or a
bounded infinity norm. The problem arises in various research fields, including
machine learning, statistics, and theoretical computer science, and has broad
real-world applications.
This paper presents new $\mathsf{NP} $-hardness results for low-rank matrix
completion problems. We show that for every sufficiently large integer $d$ and
any real number $\varepsilon \in [ 2^{-O(d)},\frac{1}{7}]$, given a partial
matrix $A$ with exposed values of magnitude at most $1$ that admits a positive
semi-definite completion of rank $d$, it is $\mathsf{NP}$-hard to find a
positive semi-definite matrix that agrees with each given value of $A$ up to an
additive error of at most $\varepsilon$, even when the rank is allowed to
exceed $d$ by a multiplicative factor of $O (\frac{1}{\varepsilon ^2 \cdot
\log(1/\varepsilon)} )$. This strengthens a result of Hardt, Meka, Raghavendra,
and Weitz (COLT, 2014), which applies to multiplicative factors smaller than
$2$ and to $\varepsilon $ that decays polynomially in $d$. We establish similar
$\mathsf{NP}$-hardness results for the case where the completed matrix is
constrained to have a bounded infinity norm (rather than be positive
semi-definite), for which all previous hardness results rely on complexity
assumptions related to the Unique Games Conjecture. Our proofs involve a novel
notion of nearly orthonormal representations of graphs, the concept of line
digraphs, and bounds on the rank of perturbed identity matrices.
[COMMENTS]
27 pages
[LINK]
http://arxiv.org/abs/2506.18440v1
[DATE]
2025-06-23 17:22:28+08:00
[CATEGORIES]
cs.LG
Thermal Vision: Pioneering Non-Invasive Temperature Tracking in Congested Spaces
[AUTHORS]
Arijit Samal, Haroon R Lone
[ABSTRACT]
Non-invasive temperature monitoring of individuals plays a crucial role in
identifying and isolating symptomatic individuals. Temperature monitoring
becomes particularly vital in settings characterized by close human proximity,
often referred to as dense settings. However, existing research on non-invasive
temperature estimation using thermal cameras has predominantly focused on
sparse settings. Unfortunately, the risk of disease transmission is
significantly higher in dense settings like movie theaters or classrooms.
Consequently, there is an urgent need to develop robust temperature estimation
methods tailored explicitly for dense settings.
Our study proposes a non-invasive temperature estimation system that combines
a thermal camera with an edge device. Our system employs YOLO models for face
detection and utilizes a regression framework for temperature estimation. We
evaluated the system on a diverse dataset collected in dense and sparse
settings. Our proposed face detection model achieves an impressive mAP score of
over 84 in both in-dataset and cross-dataset evaluations. Furthermore, the
regression framework demonstrates remarkable performance with a mean square
error of 0.18$^{\circ}$C and an impressive $R^2$ score of 0.96. Our
experiments’ results highlight the developed system’s effectiveness,
positioning it as a promising solution for continuous temperature monitoring in
real-world applications. With this paper, we release our dataset and
programming code publicly.
[LINK]
http://arxiv.org/abs/2412.00863v2
[DATE]
2025-06-23 17:17:10+08:00
[CATEGORIES]
cs.LG
How Robust is Model Editing after Fine-Tuning? An Empirical Study on Text-to-Image Diffusion Models
[AUTHORS]
Feng He, Zhenyang Liu, Marco Valentino, Zhixue Zhao
[ABSTRACT]
Model editing offers a low-cost technique to inject or correct a particular
behavior in a pre-trained model without extensive retraining, supporting
applications such as factual correction and bias mitigation. Despite this
common practice, it remains unknown whether edits persist after fine-tuning or
whether they are inadvertently reversed. This question has fundamental
practical implications. For example, if fine-tuning removes prior edits, it
could serve as a defence mechanism against hidden malicious edits. Vice versa,
the unintended removal of edits related to bias mitigation could pose serious
safety concerns. We systematically investigate the interaction between model
editing and fine-tuning in the context of T2I diffusion models, which are known
to exhibit biases and generate inappropriate content. Our study spans two T2I
model families (Stable Diffusion and FLUX), two sota editing techniques, and
three fine-tuning methods (DreamBooth, LoRA, and DoRA). Through an extensive
empirical analysis across diverse editing tasks and evaluation metrics, our
findings reveal a trend: edits generally fail to persist through fine-tuning,
even when fine-tuning is tangential or unrelated to the edits. Notably, we
observe that DoRA exhibits the strongest edit reversal effect. At the same
time, among editing methods, UCE demonstrates greater robustness, retaining
significantly higher efficacy post-fine-tuning compared to ReFACT. These
findings highlight a crucial limitation in current editing methodologies,
emphasizing the need for more robust techniques to ensure reliable long-term
control and alignment of deployed AI systems. These findings have dual
implications for AI safety: they suggest that fine-tuning could serve as a
remediation mechanism for malicious edits while simultaneously highlighting the
need for re-editing after fine-tuning to maintain beneficial safety and
alignment properties.
[LINK]
http://arxiv.org/abs/2506.18428v1
[DATE]
2025-06-23 17:10:29+08:00
[CATEGORIES]
cs.LG
An Expanded Benchmark that Rediscovers and Affirms the Edge of Uncertainty Sampling for Active Learning in Tabular Datasets
[AUTHORS]
Po-Yi Lu, Yi-Jie Cheng, Chun-Liang Li, Hsuan-Tien Lin
[ABSTRACT]
Active Learning (AL) addresses the crucial challenge of enabling machines to
efficiently gather labeled examples through strategic queries. Among the many
AL strategies, Uncertainty Sampling (US) stands out as one of the most widely
adopted. US queries the example(s) that the current model finds uncertain,
proving to be both straightforward and effective. Despite claims in the
literature suggesting superior alternatives to US, community-wide acceptance
remains elusive. In fact, existing benchmarks for tabular datasets present
conflicting conclusions on the continued competitiveness of US. In this study,
we review the literature on AL strategies in the last decade and build the most
comprehensive open-source AL benchmark to date to understand the relative
merits of different AL strategies. The benchmark surpasses existing ones by
encompassing a broader coverage of strategies, models, and data. Through our
investigation of the conflicting conclusions in existing tabular AL benchmarks
by evaluation under broad AL experimental settings, we uncover fresh insights
into the often-overlooked issue of using machine learning models–model
compatibility in the context of US. Specifically, we notice that adopting the
different models for the querying unlabeled examples and learning tasks would
degrade US’s effectiveness. Notably, our findings affirm that US maintains a
competitive edge over other strategies when paired with compatible models.
These findings have practical implications and provide a concrete recipe for AL
practitioners, empowering them to make informed decisions when working with
tabular classifications with limited labeled data. The code for this project is
available on https://github.com/ariapoy/active-learning-benchmark.
[LINK]
http://arxiv.org/abs/2306.08954v3
[DATE]
2025-06-23 16:59:45+08:00
[CATEGORIES]
cs.LG
FARCLUSS: Fuzzy Adaptive Rebalancing and Contrastive Uncertainty Learning for Semi-Supervised Semantic Segmentation
[AUTHORS]
Ebenezer Tarubinga, Jenifer Kalafatovich, Seong-Whan Lee
[ABSTRACT]
Semi-supervised semantic segmentation (SSSS) faces persistent challenges in
effectively leveraging unlabeled data, such as ineffective utilization of
pseudo-labels, exacerbation of class imbalance biases, and neglect of
prediction uncertainty. Current approaches often discard uncertain regions
through strict thresholding favouring dominant classes. To address these
limitations, we introduce a holistic framework that transforms uncertainty into
a learning asset through four principal components: (1) fuzzy pseudo-labeling,
which preserves soft class distributions from top-K predictions to enrich
supervision; (2) uncertainty-aware dynamic weighting, that modulate pixel-wise
contributions via entropy-based reliability scores; (3) adaptive class
rebalancing, which dynamically adjust losses to counteract long-tailed class
distributions; and (4) lightweight contrastive regularization, that encourage
compact and discriminative feature embeddings. Extensive experiments on
benchmarks demonstrate that our method outperforms current state-of-the-art
approaches, achieving significant improvements in the segmentation of
under-represented classes and ambiguous regions.
[COMMENTS]
Submitted to Neural Networks
[LINK]
http://arxiv.org/abs/2506.11142v2
[DATE]
2025-06-23 16:58:30+08:00
[CATEGORIES]
cs.LG
Generative Modeling of Full-Atom Protein Conformations using Latent Diffusion on Graph Embeddings
[AUTHORS]
Aditya Sengar, Ali Hariri, Daniel Probst, Patrick Barth, Pierre Vandergheynst
[ABSTRACT]
Generating diverse, all-atom conformational ensembles of dynamic proteins
such as G-protein-coupled receptors (GPCRs) is critical for understanding their
function, yet most generative models simplify atomic detail or ignore
conformational diversity altogether. We present latent diffusion for full
protein generation (LD-FPG), a framework that constructs complete all-atom
protein structures, including every side-chain heavy atom, directly from
molecular dynamics (MD) trajectories. LD-FPG employs a Chebyshev graph neural
network (ChebNet) to obtain low-dimensional latent embeddings of protein
conformations, which are processed using three pooling strategies: blind,
sequential and residue-based. A diffusion model trained on these latent
representations generates new samples that a decoder, optionally regularized by
dihedral-angle losses, maps back to Cartesian coordinates. Using D2R-MD, a
2-microsecond MD trajectory (12 000 frames) of the human dopamine D2 receptor
in a membrane environment, the sequential and residue-based pooling strategy
reproduces the reference ensemble with high structural fidelity (all-atom lDDT
of approximately 0.7; C-alpha-lDDT of approximately 0.8) and recovers backbone
and side-chain dihedral-angle distributions with a Jensen-Shannon divergence of
less than 0.03 compared to the MD data. LD-FPG thereby offers a practical route
to system-specific, all-atom ensemble generation for large proteins, providing
a promising tool for structure-based therapeutic design on complex, dynamic
targets. The D2R-MD dataset and our implementation are freely available to
facilitate further research.
[COMMENTS]
10 pages (main text), 4 figures, 2 tables. Submitted to NeurIPS 2025.
Code and data are publicly available
[LINK]
http://arxiv.org/abs/2506.17064v2
[DATE]
2025-06-23 16:56:39+08:00
[CATEGORIES]
cs.LG
Online high-precision prediction method for injection molding product weight by integrating time series/non-time series mixed features and feature attention mechanism
[AUTHORS]
Maoyuan Li, Sihong Li, Guancheng Shen, Yun Zhang, Huamin Zhou
[ABSTRACT]
To address the challenges of untimely detection and online monitoring lag in
injection molding quality anomalies, this study proposes a mixed feature
attention-artificial neural network (MFA-ANN) model for high-precision online
prediction of product weight. By integrating mechanism-based with data-driven
analysis, the proposed architecture decouples time series data (e.g., melt flow
dynamics, thermal profiles) from non-time series data (e.g., mold features,
pressure settings), enabling hierarchical feature extraction. A self-attention
mechanism is strategically embedded during cross-domain feature fusion to
dynamically calibrate inter-modality feature weights, thereby emphasizing
critical determinants of weight variability. The results demonstrate that the
MFA-ANN model achieves a RMSE of 0.0281 with 0.5 g weight fluctuation
tolerance, outperforming conventional benchmarks: a 25.1% accuracy improvement
over non-time series ANN models, 23.0% over LSTM networks, 25.7% over SVR, and
15.6% over RF models, respectively. Ablation studies quantitatively validate
the synergistic enhancement derived from the integration of mixed feature
modeling (contributing 22.4%) and the attention mechanism (contributing 11.2%),
significantly enhancing the model’s adaptability to varying working conditions
and its resistance to noise. Moreover, critical sensitivity analyses further
reveal that data resolution significantly impacts prediction reliability,
low-fidelity sensor inputs degrade performance by 23.8% RMSE compared to
high-precision measurements. Overall, this study provides an efficient and
reliable solution for the intelligent quality control of injection molding
processes.
[LINK]
http://arxiv.org/abs/2506.18950v1
[DATE]
2025-06-23 16:40:50+08:00
[CATEGORIES]
cs.LG
ADNF-Clustering: An Adaptive and Dynamic Neuro-Fuzzy Clustering for Leukemia Prediction
[AUTHORS]
Marco Aruta, Ciro Listone, Giuseppe Murano, Aniello Murano
[ABSTRACT]
Leukemia diagnosis and monitoring rely increasingly on high-throughput image
data, yet conventional clustering methods lack the flexibility to accommodate
evolving cellular patterns and quantify uncertainty in real time. We introduce
Adaptive and Dynamic Neuro-Fuzzy Clustering, a novel streaming-capable
framework that combines Convolutional Neural Network-based feature extraction
with an online fuzzy clustering engine. ADNF initializes soft partitions via
Fuzzy C-Means, then continuously updates micro-cluster centers, densities, and
fuzziness parameters using a Fuzzy Temporal Index (FTI) that measures entropy
evolution. A topology refinement stage performs density-weighted merging and
entropy-guided splitting to guard against over- and under-segmentation. On the
C-NMC leukemia microscopy dataset, our tool achieves a silhouette score of
0.51, demonstrating superior cohesion and separation over static baselines. The
method’s adaptive uncertainty modeling and label-free operation hold immediate
potential for integration within the INFANT pediatric oncology network,
enabling scalable, up-to-date support for personalized leukemia management.
[COMMENTS]
6 pages, 1 figure, under review
[LINK]
http://arxiv.org/abs/2506.18396v1
[DATE]
2025-06-23 16:30:17+08:00
[CATEGORIES]
cs.LG
LOGICPO: Efficient Translation of NL-based Logical Problems to FOL using LLMs and Preference Optimization
[AUTHORS]
Koushik Viswanadha, Deepanway Ghosal, Somak Aditya
[ABSTRACT]
Logical reasoning is a key task for artificial intelligence due to it’s role
in major downstream tasks such as Question Answering, Summarization. Recent
methods in improving the reasoning ability of LLMs fall short in correctly
converting a natural language reasoning problem to an equivalent logical
formulation, which hinders the framework’s overall ability to reason. Towards
this, we propose to use finetuning on a preference optimization dataset to
learn to parse and represent a natural language problem as a whole to a
consistent logical program by 1) introducing a new supervised and preference
optimization dataset LogicPO, and 2) adopting popular techniques such as Direct
Preference Optimization (DPO), Kahneman-Tversky optimization (KTO) to finetune
open-source LLMs. Our best model with Phi-3.5 consistently outperforms
GPT-3.5-turbo’s (8-shot) by producing 10% more logically correct and with 14%
less syntax errors. Through the framework and our improved evaluation metrics,
we offer a promising direction in improving the logical reasoning of LLMs by
better representing them in their logical formulations.
[LINK]
http://arxiv.org/abs/2506.18383v1
[DATE]
2025-06-23 16:15:24+08:00
[CATEGORIES]
cs.LG
PERSCEN: Learning Personalized Interaction Pattern and Scenario Preference for Multi-Scenario Matching
[AUTHORS]
Haotong Du, Yaqing Wang, Fei Xiong, Lei Shao, Ming Liu, Hao Gu, Quanming Yao, Zhen Wang
[ABSTRACT]
With the expansion of business scales and scopes on online platforms,
multi-scenario matching has become a mainstream solution to reduce maintenance
costs and alleviate data sparsity. The key to effective multi-scenario
recommendation lies in capturing both user preferences shared across all
scenarios and scenario-aware preferences specific to each scenario. However,
existing methods often overlook user-specific modeling, limiting the generation
of personalized user representations. To address this, we propose PERSCEN, an
innovative approach that incorporates user-specific modeling into
multi-scenario matching. PERSCEN constructs a user-specific feature graph based
on user characteristics and employs a lightweight graph neural network to
capture higher-order interaction patterns, enabling personalized extraction of
preferences shared across scenarios. Additionally, we leverage vector
quantization techniques to distil scenario-aware preferences from users’
behavior sequence within individual scenarios, facilitating user-specific and
scenario-aware preference modeling. To enhance efficient and flexible
information transfer, we introduce a progressive scenario-aware gated linear
unit that allows fine-grained, low-latency fusion. Extensive experiments
demonstrate that PERSCEN outperforms existing methods. Further efficiency
analysis confirms that PERSCEN effectively balances performance with
computational cost, ensuring its practicality for real-world industrial
systems.
[COMMENTS]
Accepted by KDD 2025
[LINK]
http://arxiv.org/abs/2506.18382v1
[DATE]
2025-06-23 16:15:16+08:00
[CATEGORIES]
cs.LG
Holistic Physics Solver: Learning PDEs in a Unified Spectral-Physical Space
[AUTHORS]
Xihang Yue, Yi Yang, Linchao Zhu
[ABSTRACT]
Recent advances in operator learning have produced two distinct approaches
for solving partial differential equations (PDEs): attention-based methods
offering point-level adaptability but lacking spectral constraints, and
spectral-based methods providing domain-level continuity priors but limited in
local flexibility. This dichotomy has hindered the development of PDE solvers
with both strong flexibility and generalization capability. This work
introduces Holistic Physics Mixer (HPM), a simple framework that bridges this
gap by integrating spectral and physical information in a unified space. HPM
unifies both approaches as special cases while enabling more powerful
spectral-physical interactions beyond either method alone. This enables HPM to
inherit both the strong generalization of spectral methods and the flexibility
of attention mechanisms while avoiding their respective limitations. Through
extensive experiments across diverse PDE problems, we demonstrate that HPM
consistently outperforms state-of-the-art methods in both accuracy and
computational efficiency, while maintaining strong generalization capabilities
with limited training data and excellent zero-shot performance on unseen
resolutions.
[COMMENTS]
ICML2025
[LINK]
http://arxiv.org/abs/2410.11382v2
[DATE]
2025-06-23 16:07:36+08:00
[CATEGORIES]
cs.LG
Persistent Sampling: Enhancing the Efficiency of Sequential Monte Carlo
[AUTHORS]
Minas Karamanis, Uroš Seljak
[ABSTRACT]
Sequential Monte Carlo (SMC) samplers are powerful tools for Bayesian
inference but suffer from high computational costs due to their reliance on
large particle ensembles for accurate estimates. We introduce persistent
sampling (PS), an extension of SMC that systematically retains and reuses
particles from all prior iterations to construct a growing, weighted ensemble.
By leveraging multiple importance sampling and resampling from a mixture of
historical distributions, PS mitigates the need for excessively large particle
counts, directly addressing key limitations of SMC such as particle
impoverishment and mode collapse. Crucially, PS achieves this without
additional likelihood evaluations-weights for persistent particles are computed
using cached likelihood values. This framework not only yields more accurate
posterior approximations but also produces marginal likelihood estimates with
significantly lower variance, enhancing reliability in model comparison.
Furthermore, the persistent ensemble enables efficient adaptation of transition
kernels by leveraging a larger, decorrelated particle pool. Experiments on
high-dimensional Gaussian mixtures, hierarchical models, and non-convex targets
demonstrate that PS consistently outperforms standard SMC and related variants,
including recycled and waste-free SMC, achieving substantial reductions in mean
squared error for posterior expectations and evidence estimates, all at reduced
computational cost. PS thus establishes itself as a robust, scalable, and
efficient alternative for complex Bayesian inference tasks.
[COMMENTS]
37 pages, 9 figures. Submitted to Statistics & Computing
[LINK]
http://arxiv.org/abs/2407.20722v3
[DATE]
2025-06-23 15:59:17+08:00
[CATEGORIES]
cs.LG
DipLLM: Fine-Tuning LLM for Strategic Decision-making in Diplomacy
[AUTHORS]
Kaixuan Xu, Jiajun Chai, Sicheng Li, Yuqian Fu, Yuanheng Zhu, Dongbin Zhao
[ABSTRACT]
Diplomacy is a complex multiplayer game that requires both cooperation and
competition, posing significant challenges for AI systems. Traditional methods
rely on equilibrium search to generate extensive game data for training, which
demands substantial computational resources. Large Language Models (LLMs) offer
a promising alternative, leveraging pre-trained knowledge to achieve strong
performance with relatively small-scale fine-tuning. However, applying LLMs to
Diplomacy remains challenging due to the exponential growth of possible action
combinations and the intricate strategic interactions among players. To address
this challenge, we propose DipLLM, a fine-tuned LLM-based agent that learns
equilibrium policies for Diplomacy. DipLLM employs an autoregressive
factorization framework to simplify the complex task of multi-unit action
assignment into a sequence of unit-level decisions. By defining an equilibrium
policy within this framework as the learning objective, we fine-tune the model
using only 1.5% of the data required by the state-of-the-art Cicero model,
surpassing its performance. Our results demonstrate the potential of fine-tuned
LLMs for tackling complex strategic decision-making in multiplayer games.
[COMMENTS]
Accepted to the 42nd International Conference on Machine Learning
(ICML 2025)
[LINK]
http://arxiv.org/abs/2506.09655v2
[DATE]
2025-06-23 15:49:08+08:00
[CATEGORIES]
cs.LG
Global Context-aware Representation Learning for Spatially Resolved Transcriptomics
[AUTHORS]
Yunhak Oh, Junseok Lee, Yeongmin Kim, Sangwoo Seo, Namkyeong Lee, Chanyoung Park
[ABSTRACT]
Spatially Resolved Transcriptomics (SRT) is a cutting-edge technique that
captures the spatial context of cells within tissues, enabling the study of
complex biological networks. Recent graph-based methods leverage both gene
expression and spatial information to identify relevant spatial domains.
However, these approaches fall short in obtaining meaningful spot
representations, especially for spots near spatial domain boundaries, as they
heavily emphasize adjacent spots that have minimal feature differences from an
anchor node. To address this, we propose Spotscape, a novel framework that
introduces the Similarity Telescope module to capture global relationships
between multiple spots. Additionally, we propose a similarity scaling strategy
to regulate the distances between intra- and inter-slice spots, facilitating
effective multi-slice integration. Extensive experiments demonstrate the
superiority of Spotscape in various downstream tasks, including single-slice
and multi-slice scenarios. Our code is available at the following link: https:
//github.com/yunhak0/Spotscape.
[COMMENTS]
ICML 2025
[LINK]
http://arxiv.org/abs/2506.15698v2
[DATE]
2025-06-23 15:46:50+08:00
[CATEGORIES]
cs.LG
Bohdi: Heterogeneous LLM Fusion with Automatic Data Exploration
[AUTHORS]
Junqi Gao, Zhichang Guo, Dazhi Zhang, Dong Li, Runze Liu, Pengfei Li, Kai Tian, Biqing Qi
[ABSTRACT]
Heterogeneous Large Language Model (LLM) fusion integrates the strengths of
multiple source LLMs with different architectures into a target LLM with low
computational overhead. While promising, existing methods suffer from two major
limitations: 1) reliance on real data from limited domain for knowledge fusion,
preventing the target LLM from fully acquiring knowledge across diverse
domains, and 2) fixed data allocation proportions across domains, failing to
dynamically adjust according to the target LLM’s varying capabilities across
domains, leading to a capability imbalance. To overcome these limitations, we
propose Bohdi, a synthetic-data-only heterogeneous LLM fusion framework.
Through the organization of knowledge domains into a hierarchical tree
structure, Bohdi enables automatic domain exploration and multi-domain data
generation through multi-model collaboration, thereby comprehensively
extracting knowledge from source LLMs. By formalizing domain expansion and data
sampling proportion allocation on the knowledge tree as a Hierarchical
Multi-Armed Bandit problem, Bohdi leverages the designed DynaBranches mechanism
to adaptively adjust sampling proportions based on the target LLM’s performance
feedback across domains. Integrated with our proposed Introspection-Rebirth
(IR) mechanism, DynaBranches dynamically tracks capability shifts during target
LLM’s updates via Sliding Window Binomial Likelihood Ratio Testing (SWBLRT),
further enhancing its online adaptation capability. Comparative experimental
results on a comprehensive suite of benchmarks demonstrate that Bohdi
significantly outperforms existing baselines on multiple target LLMs, exhibits
higher data efficiency, and virtually eliminates the imbalance in the target
LLM’s capabilities. Our code is available at
https://github.com/gjq100/Bohdi.git.
[LINK]
http://arxiv.org/abs/2506.15721v2
[DATE]
2025-06-23 15:03:18+08:00
[CATEGORIES]
cs.LG
LoopSR: Looping Sim-and-Real for Lifelong Policy Adaptation of Legged Robots
[AUTHORS]
Peilin Wu, Weiji Xie, Jiahang Cao, Hang Lai, Weinan Zhang
[ABSTRACT]
Reinforcement Learning (RL) has shown its remarkable and generalizable
capability in legged locomotion through sim-to-real transfer. However, while
adaptive methods like domain randomization are expected to enhance policy
robustness across diverse environments, they potentially compromise the
policy’s performance in any specific environment, leading to suboptimal
real-world deployment due to the No Free Lunch theorem. To address this, we
propose LoopSR, a lifelong policy adaptation framework that continuously
refines RL policies in the post-deployment stage. LoopSR employs a
transformer-based encoder to map real-world trajectories into a latent space
and reconstruct a digital twin of the real world for further improvement.
Autoencoder architecture and contrastive learning methods are adopted to
enhance feature extraction of real-world dynamics. Simulation parameters for
continual training are derived by combining predicted values from the decoder
with retrieved parameters from a pre-collected simulation trajectory dataset.
By leveraging simulated continual training, LoopSR achieves superior data
efficiency compared with strong baselines, yielding eminent performance with
limited data in both sim-to-sim and sim-to-real experiments.
[COMMENTS]
IROS 2025
[LINK]
http://arxiv.org/abs/2409.17992v2
[DATE]
2025-06-23 14:59:08+08:00
[CATEGORIES]
cs.LG
Dynamic Hybrid Modeling: Incremental Identification and Model Predictive Control
[AUTHORS]
Adrian Caspari, Thomas Bierweiler, Sarah Fadda, Daniel Labisch, Maarten Nauta, Franzisko Wagner, Merle Warmbold, Constantinos C. Pantelides
[ABSTRACT]
Mathematical models are crucial for optimizing and controlling chemical
processes, yet they often face significant limitations in terms of
computational time, algorithm complexity, and development costs. Hybrid models,
which combine mechanistic models with data-driven models (i.e. models derived
via the application of machine learning to experimental data), have emerged as
a promising solution to these challenges. However, the identification of
dynamic hybrid models remains difficult due to the need to integrate
data-driven models within mechanistic model structures. We present an
incremental identification approach for dynamic hybrid models that decouples
the mechanistic and data-driven components to overcome computational and
conceptual difficulties. Our methodology comprises four key steps: (1)
regularized dynamic parameter estimation to determine optimal time profiles for
flux variables, (2) correlation analysis to evaluate relationships between
variables, (3) data-driven model identification using advanced machine learning
techniques, and (4) hybrid model integration to combine the mechanistic and
data-driven components. This approach facilitates early evaluation of model
structure suitability, accelerates the development of hybrid models, and allows
for independent identification of data-driven components. Three case studies
are presented to illustrate the robustness, reliability, and efficiency of our
incremental approach in handling complex systems and scenarios with limited
data.
[COMMENTS]
18 pages, 10 Figures
[LINK]
http://arxiv.org/abs/2506.18344v1
[DATE]
2025-06-23 14:55:32+08:00
[CATEGORIES]
cs.LG
Controlled Generation with Equivariant Variational Flow Matching
[AUTHORS]
Floor Eijkelboom, Heiko Zimmermann, Sharvaree Vadgama, Erik J Bekkers, Max Welling, Christian A. Naesseth, Jan-Willem van de Meent
[ABSTRACT]
We derive a controlled generation objective within the framework of
Variational Flow Matching (VFM), which casts flow matching as a variational
inference problem. We demonstrate that controlled generation can be implemented
two ways: (1) by way of end-to-end training of conditional generative models,
or (2) as a Bayesian inference problem, enabling post hoc control of
unconditional models without retraining. Furthermore, we establish the
conditions required for equivariant generation and provide an equivariant
formulation of VFM tailored for molecular generation, ensuring invariance to
rotations, translations, and permutations. We evaluate our approach on both
uncontrolled and controlled molecular generation, achieving state-of-the-art
performance on uncontrolled generation and outperforming state-of-the-art
models in controlled generation, both with end-to-end training and in the
Bayesian inference setting. This work strengthens the connection between
flow-based generative modeling and Bayesian inference, offering a scalable and
principled framework for constraint-driven and symmetry-aware generation.
[LINK]
http://arxiv.org/abs/2506.18340v1
[DATE]
2025-06-23 14:42:48+08:00
[CATEGORIES]
cs.LG
Structured Kolmogorov-Arnold Neural ODEs for Interpretable Learning and Symbolic Discovery of Nonlinear Dynamics
[AUTHORS]
Wei Liu, Kiran Bacsa, Loon Ching Tang, Eleni Chatzi
[ABSTRACT]
Understanding and modeling nonlinear dynamical systems is a fundamental
problem across scientific and engineering domains. While deep learning has
demonstrated remarkable potential for learning complex system behavior,
achieving models that are both highly accurate and physically interpretable
remains a major challenge. To address this, we propose Structured
Kolmogorov-Arnold Neural ODEs (SKANODEs), a novel framework that integrates
structured state-space modeling with the Kolmogorov-Arnold Network (KAN).
SKANODE first employs a fully trainable KAN as a universal function
approximator within a structured Neural ODE framework to perform virtual
sensing, recovering latent states that correspond to physically interpretable
quantities such as positions and velocities. Once this structured latent
representation is established, we exploit the symbolic regression capability of
KAN to extract compact and interpretable expressions for the system’s governing
dynamics. The resulting symbolic expression is then substituted back into the
Neural ODE framework and further calibrated through continued training to
refine its coefficients, enhancing both the precision of the discovered
equations and the predictive accuracy of system responses. Extensive
experiments on both simulated and real-world systems demonstrate that SKANODE
achieves superior performance while offering interpretable, physics-consistent
models that uncover the underlying mechanisms of nonlinear dynamical systems.
[LINK]
http://arxiv.org/abs/2506.18339v1
[DATE]
2025-06-23 14:42:43+08:00
[CATEGORIES]
cs.LG
A Transformer-Based Approach for Diagnosing Fault Cases in Optical Fiber Amplifiers
[AUTHORS]
Dominic Schneider, Lutz Rapp, Christoph Ament
[ABSTRACT]
A transformer-based deep learning approach is presented that enables the
diagnosis of fault cases in optical fiber amplifiers using condition-based
monitoring time series data. The model, Inverse Triple-Aspect Self-Attention
Transformer (ITST), uses an encoder-decoder architecture, utilizing three
feature extraction paths in the encoder, feature-engineered data for the
decoder and a self-attention mechanism. The results show that ITST outperforms
state-of-the-art models in terms of classification accuracy, which enables
predictive maintenance for optical fiber amplifiers, reducing network downtimes
and maintenance costs.
[COMMENTS]
This paper has been accepted for publication at the 25th
International Conference on Transparent Optical Networks (ICTON) 2025
[LINK]
http://arxiv.org/abs/2505.06245v2
[DATE]
2025-06-23 14:06:01+08:00
[CATEGORIES]
cs.LG
BrainSymphony: A Transformer-Driven Fusion of fMRI Time Series and Structural Connectivity
[AUTHORS]
Moein Khajehnejad, Forough Habibollahi, Adeel Razi
[ABSTRACT]
Existing foundation models for neuroimaging are often prohibitively large and
data-intensive. We introduce BrainSymphony, a lightweight, parameter-efficient
foundation model that achieves state-of-the-art performance while being
pre-trained on significantly smaller public datasets. BrainSymphony’s strong
multimodal architecture processes functional MRI data through parallel spatial
and temporal transformer streams, which are then efficiently distilled into a
unified representation by a Perceiver module. Concurrently, it models
structural connectivity from diffusion MRI using a novel signed graph
transformer to encode the brain’s anatomical structure. These powerful,
modality-specific representations are then integrated via an adaptive fusion
gate. Despite its compact design, our model consistently outperforms larger
models on a diverse range of downstream benchmarks, including classification,
prediction, and unsupervised network identification tasks. Furthermore, our
model revealed novel insights into brain dynamics using attention maps on a
unique external psilocybin neuroimaging dataset (pre- and post-administration).
BrainSymphony establishes that architecturally-aware, multimodal models can
surpass their larger counterparts, paving the way for more accessible and
powerful research in computational neuroscience.
[COMMENTS]
21 pages, 8 figures
[LINK]
http://arxiv.org/abs/2506.18314v1
[DATE]
2025-06-23 14:00:21+08:00
[CATEGORIES]
cs.LG
Sharpening the Spear: Adaptive Expert-Guided Adversarial Attack Against DRL-based Autonomous Driving Policies
[AUTHORS]
Junchao Fan, Xuyang Lei, Xiaolin Chang
[ABSTRACT]
Deep reinforcement learning (DRL) has emerged as a promising paradigm for
autonomous driving. However, despite their advanced capabilities, DRL-based
policies remain highly vulnerable to adversarial attacks, posing serious safety
risks in real-world deployments. Investigating such attacks is crucial for
revealing policy vulnerabilities and guiding the development of more robust
autonomous systems. While prior attack methods have made notable progress, they
still face several challenges: 1) they often rely on high-frequency attacks,
yet critical attack opportunities are typically context-dependent and
temporally sparse, resulting in inefficient attack patterns; 2) restricting
attack frequency can improve efficiency but often results in unstable training
due to the adversary’s limited exploration. To address these challenges, we
propose an adaptive expert-guided adversarial attack method that enhances both
the stability and efficiency of attack policy training. Our method first
derives an expert policy from successful attack demonstrations using imitation
learning, strengthened by an ensemble Mixture-of-Experts architecture for
robust generalization across scenarios. This expert policy then guides a
DRL-based adversary through a KL-divergence regularization term. Due to the
diversity of scenarios, expert policies may be imperfect. To address this, we
further introduce a performance-aware annealing strategy that gradually reduces
reliance on the expert as the adversary improves. Extensive experiments
demonstrate that our method achieves outperforms existing approaches in terms
of collision rate, attack efficiency, and training stability, especially in
cases where the expert policy is sub-optimal.
[COMMENTS]
12 pages, 3 figures, 2 tables
[LINK]
http://arxiv.org/abs/2506.18304v1
[DATE]
2025-06-23 13:42:49+08:00
[CATEGORIES]
cs.LG
Collaborative Mean Estimation Among Heterogeneous Strategic Agents: Individual Rationality, Fairness, and Truthful Contribution
[AUTHORS]
Alex Clinton, Yiding Chen, Xiaojin Zhu, Kirthevasan Kandasamy
[ABSTRACT]
We study a collaborative learning problem where $m$ agents aim to estimate a
vector $\mu =(\mu_1,\ldots,\mu_d)\in \mathbb{R}^d$ by sampling from associated
univariate normal distributions $\{\mathcal{N}(\mu_k, \sigma^2)\}{k\in[d]}$.
Agent $i$ incurs a cost $c{i,k}$ to sample from $\mathcal{N}(\mu_k,
\sigma^2)$. Instead of working independently, agents can exchange data,
collecting cheaper samples and sharing them in return for costly data, thereby
reducing both costs and estimation error. We design a mechanism to facilitate
such collaboration, while addressing two key challenges: ensuring individually
rational (IR) and fair outcomes so all agents benefit, and preventing strategic
behavior (e.g. non-collection, data fabrication) to avoid socially undesirable
outcomes. We design a mechanism and an associated Nash equilibrium (NE) which
minimizes the social penalty-sum of agents’ estimation errors and collection
costs-while being IR for all agents. We achieve a
$\mathcal{O}(\sqrt{m})$-approximation to the minimum social penalty in the
worst case and an $\mathcal{O}(1)$-approximation under favorable conditions.
Additionally, we establish three hardness results: no nontrivial mechanism
guarantees (i) a dominant strategy equilibrium where agents report truthfully,
(ii) is IR for every strategy profile of other agents, (iii) or avoids a
worst-case $\Omega(\sqrt{m})$ price of stability in any NE. Finally, by
integrating concepts from axiomatic bargaining, we demonstrate that our
mechanism supports fairer outcomes than one which minimizes social penalty.
[COMMENTS]
ICML 2025
[LINK]
http://arxiv.org/abs/2407.15881v2
[DATE]
2025-06-23 13:32:45+08:00
[CATEGORIES]
cs.LG
Interpretation of Deep Learning Model in Embryo Selection for In Vitro Fertilization (IVF) Treatment
[AUTHORS]
Radha Kodali, Venkata Rao Dhulipalla, Venkata Siva Kishor Tatavarty, Madhavi Nadakuditi, Bharadwaj Thiruveedhula, Suryanarayana Gunnam, Durga Prasad Bavirisetti
[ABSTRACT]
Infertility has a considerable impact on individuals’ quality of life,
affecting them socially and psychologically, with projections indicating a rise
in the upcoming years. In vitro fertilization (IVF) emerges as one of the
primary techniques within economically developed nations, employed to address
the rising problem of low fertility. Expert embryologists conventionally grade
embryos by reviewing blastocyst images to select the most optimal for transfer,
yet this process is time-consuming and lacks efficiency. Blastocyst images
provide a valuable resource for assessing embryo viability. In this study, we
introduce an explainable artificial intelligence (XAI) framework for
classifying embryos, employing a fusion of convolutional neural network (CNN)
and long short-term memory (LSTM) architecture, referred to as CNN-LSTM.
Utilizing deep learning, our model achieves high accuracy in embryo
classification while maintaining interpretability through XAI.
[LINK]
http://arxiv.org/abs/2506.06680v2
[DATE]
2025-06-23 13:29:59+08:00
[CATEGORIES]
cs.LG
AFBS:Buffer Gradient Selection in Semi-asynchronous Federated Learning
[AUTHORS]
Chaoyi Lu, Yiding Sun, Jinqian Chen, Zhichuan Yang, Jiangming Pan, Jihua Zhu
[ABSTRACT]
Asynchronous federated learning (AFL) accelerates training by eliminating the
need to wait for stragglers, but its asynchronous nature introduces gradient
staleness, where outdated gradients degrade performance. Existing solutions
address this issue with gradient buffers, forming a semi-asynchronous
framework. However, this approach struggles when buffers accumulate numerous
stale gradients, as blindly aggregating all gradients can harm training. To
address this, we propose AFBS (Asynchronous FL Buffer Selection), the first
algorithm to perform gradient selection within buffers while ensuring privacy
protection. Specifically, the client sends the random projection encrypted
label distribution matrix before training, and the server performs client
clustering based on it. During training, server scores and selects gradients
within each cluster based on their informational value, discarding low-value
gradients to enhance semi-asynchronous federated learning. Extensive
experiments in highly heterogeneous system and data environments demonstrate
AFBS’s superior performance compared to state-of-the-art methods. Notably, on
the most challenging task, CIFAR-100, AFBS improves accuracy by up to 4.8% over
the previous best algorithm and reduces the time to reach target accuracy by
75%.
[LINK]
http://arxiv.org/abs/2506.12754v2
[DATE]
2025-06-23 13:27:00+08:00
[CATEGORIES]
cs.LG
Instability in Diffusion ODEs: An Explanation for Inaccurate Image Reconstruction
[AUTHORS]
Han Zhang, Jinghong Mao, Shangwen Zhu, Zhantao Yang, Lianghua Huang, Yu Liu, Deli Zhao, Ruili Feng, Fan Cheng
[ABSTRACT]
Diffusion reconstruction plays a critical role in various applications such
as image editing, restoration, and style transfer. In theory, the
reconstruction should be simple - it just inverts and regenerates images by
numerically solving the Probability Flow-Ordinary Differential Equation
(PF-ODE). Yet in practice, noticeable reconstruction errors have been observed,
which cannot be well explained by numerical errors. In this work, we identify a
deeper intrinsic property in the PF-ODE generation process, the instability,
that can further amplify the reconstruction errors. The root of this
instability lies in the sparsity inherent in the generation distribution, which
means that the probability is concentrated on scattered and small regions while
the vast majority remains almost empty. To demonstrate the existence of
instability and its amplification on reconstruction error, we conduct
experiments on both toy numerical examples and popular open-sourced diffusion
models. Furthermore, based on the characteristics of image data, we
theoretically prove that the instability’s probability converges to one as the
data dimensionality increases. Our findings highlight the inherent challenges
in diffusion-based reconstruction and can offer insights for future
improvements.
[LINK]
http://arxiv.org/abs/2506.18290v1
[DATE]
2025-06-23 12:59:49+08:00
[CATEGORIES]
cs.LG
Learning High-Quality Latent Representations for Anomaly Detection and Signal Integrity Enhancement in High-Speed Signals
[AUTHORS]
Muhammad Usama, Hee-Deok Jang, Soham Shanbhag, Yoo-Chang Sung, Seung-Jun Bae, Dong Eui Chang
[ABSTRACT]
This paper addresses the dual challenge of improving anomaly detection and
signal integrity in high-speed dynamic random access memory signals. To achieve
this, we propose a joint training framework that integrates an autoencoder with
a classifier to learn more distinctive latent representations by focusing on
valid data features. Our approach is evaluated across three anomaly detection
algorithms and consistently outperforms two baseline methods. Detailed ablation
studies further support these findings. Furthermore, we introduce a signal
integrity enhancement algorithm that improves signal integrity by an average of
11.3%. The source code and data used in this study are available at
https://github.com/Usama1002/learning-latent-representations.
[LINK]
http://arxiv.org/abs/2506.18288v1
[DATE]
2025-06-23 12:48:22+08:00
[CATEGORIES]
cs.LG
Learning Causal Graphs at Scale: A Foundation Model Approach
[AUTHORS]
Naiyu Yin, Tian Gao, Yue Yu
[ABSTRACT]
Due to its human-interpretability and invariance properties, Directed Acyclic
Graph (DAG) has been a foundational tool across various areas of AI research,
leading to significant advancements. However, DAG learning remains highly
challenging, due to its super-exponential growth in computational cost and
identifiability issues, particularly in small-sample regimes. To address these
two challenges, in this work we leverage the recent success of linear
transformers and develop a foundation model approach for discovering multiple
order-consistent DAGs across tasks. In particular, we propose Attention-DAG
(ADAG), a novel attention-mechanism-based architecture for learning multiple
linear Structural Equation Models (SEMs). ADAG learns the mapping from observed
data to both graph structure and parameters via a nonlinear attention-based
kernel, enabling efficient multi-task estimation of the underlying linear SEMs.
By formulating the learning process across multiple tasks as a continuous
optimization problem, the pre-trained ADAG model captures the common structural
properties as a shared low-dimensional prior, thereby reducing the
ill-posedness of downstream DAG learning tasks in small-sample regimes. We
evaluate our proposed approach on benchmark synthetic datasets and find that
ADAG achieves substantial improvements in both DAG learning accuracy and
zero-shot inference efficiency. To the best of our knowledge, this is the first
practical approach for pre-training a foundation model specifically designed
for DAG learning, representing a step toward more efficient and generalizable
down-stream applications in causal discovery.
[LINK]
http://arxiv.org/abs/2506.18285v1
[DATE]
2025-06-23 12:41:02+08:00
[CATEGORIES]
cs.LG
Quantifying Uncertainty in the Presence of Distribution Shifts
[AUTHORS]
Yuli Slavutsky, David M. Blei
[ABSTRACT]
Neural networks make accurate predictions but often fail to provide reliable
uncertainty estimates, especially under covariate distribution shifts between
training and testing. To address this problem, we propose a Bayesian framework
for uncertainty estimation that explicitly accounts for covariate shifts. While
conventional approaches rely on fixed priors, the key idea of our method is an
adaptive prior, conditioned on both training and new covariates. This prior
naturally increases uncertainty for inputs that lie far from the training
distribution in regions where predictive performance is likely to degrade. To
efficiently approximate the resulting posterior predictive distribution, we
employ amortized variational inference. Finally, we construct synthetic
environments by drawing small bootstrap samples from the training data,
simulating a range of plausible covariate shift using only the original
dataset. We evaluate our method on both synthetic and real-world data. It
yields substantially improved uncertainty estimates under distribution shifts.
[LINK]
http://arxiv.org/abs/2506.18283v1
[DATE]
2025-06-23 12:30:36+08:00
[CATEGORIES]
cs.LG
Phase retrieval with rank $d$ measurements – \emph{descending} algorithms phase transitions
[AUTHORS]
Mihailo Stojnic
[ABSTRACT]
Companion paper [118] developed a powerful \emph{Random duality theory} (RDT)
based analytical program to statistically characterize performance of
\emph{descending} phase retrieval algorithms (dPR) (these include all variants
of gradient descents and among them widely popular Wirtinger flows). We here
generalize the program and show how it can be utilized to handle rank $d$
positive definite phase retrieval (PR) measurements (with special cases $d=1$
and $d=2$ serving as emulations of the real and complex phase retrievals,
respectively). In particular, we observe that the minimal sample complexity
ratio (number of measurements scaled by the dimension of the unknown signal)
which ensures dPR’s success exhibits a phase transition (PT) phenomenon. For
both plain and lifted RDT we determine phase transitions locations. To
complement theoretical results we implement a log barrier gradient descent
variant and observe that, even in small dimensional scenarios (with problem
sizes on the order of 100), the simulated phase transitions are in an excellent
agreement with the theoretical predictions.
[LINK]
http://arxiv.org/abs/2506.18282v1
[DATE]
2025-06-23 12:28:46+08:00
[CATEGORIES]
cs.LG
Optimal spectral initializers impact on phase retrieval phase transitions – an RDT view
[AUTHORS]
Mihailo Stojnic
[ABSTRACT]
We analyze the relation between spectral initializers and theoretical limits
of \emph{descending} phase retrieval algorithms (dPR). In companion paper
[104], for any sample complexity ratio, $\alpha$, \emph{parametric manifold},
${\mathcal {PM}}(\alpha)$, is recognized as a critically important structure
that generically determines dPRs abilities to solve phase retrieval (PR).
Moreover, overlap between the algorithmic solution and the true signal is
positioned as a key ${\mathcal {PM}}$’s component. We here consider the
so-called \emph{overlap optimal} spectral initializers (OptSpins) as dPR’s
starting points and develop a generic \emph{Random duality theory} (RDT) based
program to statistically characterize them. In particular, we determine the
functional structure of OptSpins and evaluate the starting overlaps that they
provide for the dPRs. Since ${\mathcal {PM}}$’s so-called \emph{flat regions}
are highly susceptible to \emph{local jitteriness} and as such are key
obstacles on dPR’s path towards PR’s global optimum, a precise characterization
of the starting overlap allows to determine if such regions can be successfully
circumvented. Through the presented theoretical analysis we observe two key
points in that regard: \textbf{\emph{(i)}} dPR’s theoretical phase transition
(critical $\alpha$ above which they solve PR) might be difficult to practically
achieve as the ${\mathcal {PM}}$’s flat regions are large causing the
associated OptSpins to fall exactly within them; and \textbf{\emph{(ii)}}
Opting for so-called “\emph{safer compression}” and slightly increasing
$\alpha$ (by say $15\%$) shrinks flat regions and allows OptSpins to fall
outside them and dPRs to ultimately solve PR. Numerical simulations are
conducted as well and shown to be in an excellent agreement with theoretical
predictions.
[LINK]
http://arxiv.org/abs/2506.18279v1
[DATE]
2025-06-23 12:20:24+08:00
[CATEGORIES]
cs.LG
Fast Rate Information-theoretic Bounds on Generalization Errors
[AUTHORS]
Xuetong Wu, Jonathan H. Manton, Uwe Aickelin, Jingge Zhu
[ABSTRACT]
The generalization error of a learning algorithm refers to the discrepancy
between the loss of a learning algorithm on training data and that on unseen
testing data. Various information-theoretic bounds on the generalization error
have been derived in the literature, where the mutual information between the
training data and the hypothesis (the output of the learning algorithm) plays
an important role. Focusing on the individual sample mutual information bound
by Bu et al., which itself is a tightened version of the first bound on the
topic by Russo et al. and Xu et al., this paper investigates the tightness of
these bounds, in terms of the dependence of their convergence rates on the
sample size $n$. It has been recognized that these bounds are in general not
tight, readily verified for the exemplary quadratic Gaussian mean estimation
problem, where the individual sample mutual information bound scales as
$O(\sqrt{1/n})$ while the true generalization error scales as $O(1/n)$. The
first contribution of this paper is to show that the same bound can in fact be
asymptotically tight if an appropriate assumption is made. In particular, we
show that the fast rate can be recovered when the assumption is made on the
excess risk instead of the loss function, which was usually done in existing
literature. A theoretical justification is given for this choice. The second
contribution of the paper is a new set of generalization error bounds based on
the $(\eta, c)$-central condition, a condition relatively easy to verify and
has the property that the mutual information term directly determines the
convergence rate of the bound. Several analytical and numerical examples are
given to show the effectiveness of these bounds.
[COMMENTS]
27 pages, 1 figure, accepted to TIT
[LINK]
http://arxiv.org/abs/2303.14658v3
[DATE]
2025-06-23 12:15:18+08:00
[CATEGORIES]
cs.LG
Finite-Time Information-Theoretic Bounds in Queueing Control
[AUTHORS]
Yujie Liu, Vincent Y. F. Tan, Yunbei Xu
[ABSTRACT]
We establish the first finite-time information-theoretic lower bounds-and
derive new policies that achieve them-for the total queue length in scheduling
problems over stochastic processing networks with both adversarial and
stochastic arrivals. Prior analyses of MaxWeight guarantee only stability and
asymptotic optimality in heavy traffic; we prove that, at finite horizons,
MaxWeight can incur strictly larger backlog by problem-dependent factors which
we identify. Our main innovations are 1) a minimax framework that pinpoints the
precise problem parameters governing any policy’s finite-time performance; 2)
an information-theoretic lower bound on total queue length; 3) fundamental
limitation of MaxWeight that it is suboptimal in finite time; and 4) a new
scheduling rule that minimizes the full Lyapunov drift-including its
second-order term-thereby matching the lower bound under certain conditions, up
to universal constants. These findings reveal a fundamental limitation on
“drift-only” methods and points the way toward principled, non-asymptotic
optimality in queueing control.
[LINK]
http://arxiv.org/abs/2506.18278v1
[DATE]
2025-06-23 12:14:40+08:00
[CATEGORIES]
cs.LG
Phase transition of \emph{descending} phase retrieval algorithms
[AUTHORS]
Mihailo Stojnic
[ABSTRACT]
We study theoretical limits of \emph{descending} phase retrieval algorithms.
Utilizing \emph{Random duality theory} (RDT) we develop a generic program that
allows statistical characterization of various algorithmic performance metrics.
Through these we identify the concepts of \emph{parametric manifold} and its
\emph{funneling points} as key mathematical objects that govern the underlying
algorithms’ behavior. An isomorphism between single funneling point manifolds
and global convergence of descending algorithms is established. The structure
and shape of the parametric manifold as well as its dependence on the sample
complexity are studied through both plain and lifted RDT. Emergence of a phase
transition is observed. Namely, as sample complexity increases, parametric
manifold transitions from a multi to a single funneling point structure. This
in return corresponds to a transition from the scenarios where descending
algorithms generically fail to the scenarios where they succeed in solving
phase retrieval. We also develop and implement a practical algorithmic variant
that in a hybrid alternating fashion combines a barrier and a plain gradient
descent. Even though the theoretical results are obtained for infinite
dimensional scenarios (and consequently non-jittery parametric manifolds), we
observe a strong agrement between theoretical and simulated phase transitions
predictions for fairly small dimensions on the order of a few hundreds.
[LINK]
http://arxiv.org/abs/2506.18275v1
[DATE]
2025-06-23 12:10:35+08:00
[CATEGORIES]
cs.LG
Leveraging Large Language Models for Information Verification – an Engineering Approach
[AUTHORS]
Nguyen Nang Hung, Nguyen Thanh Trong, Vuong Thanh Toan, Nguyen An Phuoc, Dao Minh Tu, Nguyen Manh Duc Tuan, Nguyen Dinh Mau
[ABSTRACT]
For the ACMMM25 challenge, we present a practical engineering approach to
multimedia news source verification, utilizing Large Language Models (LLMs)
like GPT-4o as the backbone of our pipeline. Our method processes images and
videos through a streamlined sequence of steps: First, we generate metadata
using general-purpose queries via Google tools, capturing relevant content and
links. Multimedia data is then segmented, cleaned, and converted into frames,
from which we select the top-K most informative frames. These frames are
cross-referenced with metadata to identify consensus or discrepancies.
Additionally, audio transcripts are extracted for further verification.
Noticeably, the entire pipeline is automated using GPT-4o through prompt
engineering, with human intervention limited to final validation.
[LINK]
http://arxiv.org/abs/2506.18274v1
[DATE]
2025-06-23 12:08:38+08:00
[CATEGORIES]
cs.LG
ARD-LoRA: Dynamic Rank Allocation for Parameter-Efficient Fine-Tuning of Foundation Models with Heterogeneous Adaptation Needs
[AUTHORS]
Haseeb Ullah Khan Shinwari, Muhammad Usama
[ABSTRACT]
Conventional Low-Rank Adaptation (LoRA) methods employ a fixed rank, imposing
uniform adaptation across transformer layers and attention heads despite their
heterogeneous learning dynamics. This paper introduces Adaptive Rank Dynamic
LoRA (ARD-LoRA), a novel framework that automates rank allocation through
learnable scaling factors. These factors are optimized via a meta-objective
balancing task performance and parameter efficiency, incorporating $\ell_1$
sparsity for minimal rank and Total Variation regularization for stable rank
transitions. ARD-LoRA enables continuous, differentiable, per-head rank
adaptation. Experiments on LLAMA-3.1-70B and PaliGemma-2 demonstrate ARD-LoRA’s
efficacy, achieving up to 99.3% of full fine-tuning performance with only 0.32%
trainable parameters, outperforming strong baselines like DoRA and AdaLoRA.
Furthermore, it reduces multimodal adaptation memory by 41%. These results
establish dynamic, fine-grained rank allocation as a critical paradigm for
efficient foundation model adaptation.
[LINK]
http://arxiv.org/abs/2506.18267v1
[DATE]
2025-06-23 11:45:37+08:00
[CATEGORIES]
cs.LG
Incentives for Responsiveness, Instrumental Control and Impact
[AUTHORS]
Ryan Carey, Eric Langlois, Chris van Merwijk, Shane Legg, Tom Everitt
[ABSTRACT]
We introduce three concepts that describe an agent’s incentives: response
incentives indicate which variables in the environment, such as sensitive
demographic information, affect the decision under the optimal policy.
Instrumental control incentives indicate whether an agent’s policy is chosen to
manipulate part of its environment, such as the preferences or instructions of
a user. Impact incentives indicate which variables an agent will affect,
intentionally or otherwise. For each concept, we establish sound and complete
graphical criteria, and discuss general classes of techniques that may be used
to produce incentives for safe and fair agent behaviour. Finally, we outline
how these notions may be generalised to multi-decision settings. This
journal-length paper extends our conference publications “Incentives for
Responsiveness, Instrumental Control and Impact” and “Agent Incentives: A
Causal Perspective”: the material on response incentives and instrumental
control incentives is updated, while the work on impact incentives and
multi-decision settings is entirely new.
[LINK]
http://arxiv.org/abs/2001.07118v3
[DATE]
2025-06-23 11:26:44+08:00
[CATEGORIES]
cs.LG
MGHF: Multi-Granular High-Frequency Perceptual Loss for Image Super-Resolution
[AUTHORS]
Shoaib Meraj Sami, Md Mahedi Hasan, Mohammad Saeed Ebrahimi Saadabadi, Jeremy Dawson, Nasser Nasrabadi, Raghuveer Rao
[ABSTRACT]
While different variants of perceptual losses have been employed in
super-resolution literature to synthesize more realistic, appealing, and
detailed high-resolution images, most are convolutional neural networks-based,
causing information loss during guidance and often relying on complicated
architectures and training procedures. We propose an invertible neural network
(INN)-based naive \textbf{M}ulti-\textbf{G}ranular
\textbf{H}igh-\textbf{F}requency (MGHF-n) perceptual loss trained on ImageNet
to overcome these issues. Furthermore, we develop a comprehensive framework
(MGHF-c) with several constraints to preserve, prioritize, and regularize
information across multiple perspectives: texture and style preservation,
content preservation, regional detail preservation, and joint content-style
regularization. Information is prioritized through adaptive entropy-based
pruning and reweighting of INN features. We utilize Gram matrix loss for style
preservation and mean-squared error loss for content preservation.
Additionally, we propose content-style consistency through correlation loss to
regulate unnecessary texture generation while preserving content information.
Since small image regions may contain intricate details, we employ modulated
PatchNCE in the INN features as a local information preservation objective.
Extensive experiments on various super-resolution algorithms, including GAN-
and diffusion-based methods, demonstrate that our MGHF framework significantly
improves performance. After the review process, our code will be released in
the public repository.
[COMMENTS]
14 pages
[LINK]
http://arxiv.org/abs/2411.13548v2
[DATE]
2025-06-23 11:08:58+08:00
[CATEGORIES]
cs.LG
Ground tracking for improved landmine detection in a GPR system
[AUTHORS]
Li Tang, Peter A. Torrione, Cihat Eldeniz, Leslie M. Collins
[ABSTRACT]
Ground penetrating radar (GPR) provides a promising technology for accurate
subsurface object detection. In particular, it has shown promise for detecting
landmines with low metal content. However, the ground bounce (GB) that is
present in GPR data, which is caused by the dielectric discontinuity between
soil and air, is a major source of interference and degrades landmine detection
performance. To mitigate this interference, GB tracking algorithms formulated
using both a Kalman filter (KF) and a particle filter (PF) framework are
proposed. In particular, the location of the GB in the radar signal is modeled
as the hidden state in a stochastic system for the PF approach. The
observations are the 2D radar images, which arrive scan by scan along the
down-track direction. An initial training stage sets parameters automatically
to accommodate different ground and weather conditions. The features associated
with the GB description are updated adaptively with the arrival of new data.
The prior distribution for a given location is predicted by propagating
information from two adjacent channels/scans, which ensures that the overall GB
surface remains smooth. The proposed algorithms are verified in experiments
utilizing real data, and their performances are compared with other GB tracking
approaches. We demonstrate that improved GB tracking contributes to improved
performance for the landmine detection problem.
[LINK]
http://arxiv.org/abs/2506.18258v1
[DATE]
2025-06-23 11:06:55+08:00
[CATEGORIES]
cs.LG
DSAC-C: Constrained Maximum Entropy for Robust Discrete Soft-Actor Critic
[AUTHORS]
Dexter Neo, Tsuhan Chen
[ABSTRACT]
We present a novel extension to the family of Soft Actor-Critic (SAC)
algorithms. We argue that based on the Maximum Entropy Principle, discrete SAC
can be further improved via additional statistical constraints derived from a
surrogate critic policy. Furthermore, our findings suggests that these
constraints provide an added robustness against potential domain shifts, which
are essential for safe deployment of reinforcement learning agents in the
real-world. We provide theoretical analysis and show empirical results on low
data regimes for both in-distribution and out-of-distribution variants of Atari
2600 games.
[COMMENTS]
Accepted by IJCNN’25
[LINK]
http://arxiv.org/abs/2310.17173v2
[DATE]
2025-06-23 10:45:04+08:00
[CATEGORIES]
cs.LG
Exploring Efficient Quantification of Modeling Uncertainties with Differentiable Physics-Informed Machine Learning Architectures
[AUTHORS]
Manaswin Oddiraju, Bharath Varma Penumatsa, Divyang Amin, Michael Piedmonte, Souma Chowdhury
[ABSTRACT]
Quantifying and propagating modeling uncertainties is crucial for reliability
analysis, robust optimization, and other model-based algorithmic processes in
engineering design and control. Now, physics-informed machine learning (PIML)
methods have emerged in recent years as a new alternative to traditional
computational modeling and surrogate modeling methods, offering a balance
between computing efficiency, modeling accuracy, and interpretability. However,
their ability to predict and propagate modeling uncertainties remains mostly
unexplored. In this paper, a promising class of auto-differentiable hybrid PIML
architectures that combine partial physics and neural networks or ANNs (for
input transformation or adaptive parameter estimation) is integrated with
Bayesian Neural networks (replacing the ANNs); this is done with the goal to
explore whether BNNs can successfully provision uncertainty propagation
capabilities in the PIML architectures as well, further supported by the
auto-differentiability of these architectures. A two-stage training process is
used to alleviate the challenges traditionally encountered in training
probabilistic ML models. The resulting BNN-integrated PIML architecture is
evaluated on an analytical benchmark problem and flight experiments data for a
fixed-wing RC aircraft, with prediction performance observed to be slightly
worse or at par with purely data-driven ML and original PIML models. Moreover,
Monte Carlo sampling of probabilistic BNN weights was found to be most
effective in propagating uncertainty in the BNN-integrated PIML architectures.
[COMMENTS]
IDETC 2025
[LINK]
http://arxiv.org/abs/2506.18247v1
[DATE]
2025-06-23 10:32:20+08:00
[CATEGORIES]
cs.LG
Dual-Forward Path Teacher Knowledge Distillation: Bridging the Capacity Gap Between Teacher and Student
[AUTHORS]
Tong Li, Long Liu, Yihang Hu, Hu Chen, Shifeng Chen
[ABSTRACT]
Knowledge distillation (KD) provides an effective way to improve the
performance of a student network under the guidance of pre-trained teachers.
However, this approach usually brings in a large capacity gap between teacher
and student networks, limiting the distillation gains. Previous methods
addressing this problem either discard accurate knowledge representation or
fail to dynamically adjust the transferred knowledge, which is less effective
in addressing the capacity gap problem and hinders students from achieving
comparable performance with the pre-trained teacher. In this work, we extend
the ideology of prompt-based learning to address the capacity gap problem, and
propose Dual-Forward Path Teacher Knowledge Distillation (DFPT-KD), which
replaces the pre-trained teacher with a novel dual-forward path teacher to
supervise the learning of student. The key to DFPT-KD is prompt-based tuning,
i.e., establishing an additional prompt-based forward path within the
pre-trained teacher and optimizing it with the pre-trained teacher frozen to
make the transferred knowledge compatible with the representation ability of
the student. Extensive experiments demonstrate that DFPT-KD leads to trained
students performing better than the vanilla KD. To make the transferred
knowledge better compatible with the representation abilities of the student,
we further fine-tune the whole prompt-based forward path, yielding a novel
distillation approach dubbed DFPT-KD+. By extensive experiments, it is shown
that DFPT-KD+ improves upon DFPT-KD and achieves state-of-the-art accuracy
performance.
[COMMENTS]
15pages
[LINK]
http://arxiv.org/abs/2506.18244v1
[DATE]
2025-06-23 10:22:53+08:00
[CATEGORIES]
cs.LG
LLM Web Dynamics: Tracing Model Collapse in a Network of LLMs
[AUTHORS]
Tianyu Wang, Lingyou Pang, Akira Horiguchi, Carey E. Priebe
[ABSTRACT]
The increasing use of synthetic data from the public Internet has enhanced
data usage efficiency in large language model (LLM) training. However, the
potential threat of model collapse remains insufficiently explored. Existing
studies primarily examine model collapse in a single model setting or rely
solely on statistical surrogates. In this work, we introduce LLM Web Dynamics
(LWD), an efficient framework for investigating model collapse at the network
level. By simulating the Internet with a retrieval-augmented generation (RAG)
database, we analyze the convergence pattern of model outputs. Furthermore, we
provide theoretical guarantees for this convergence by drawing an analogy to
interacting Gaussian Mixture Models.
[LINK]
http://arxiv.org/abs/2506.15690v2
[DATE]
2025-06-23 10:09:58+08:00
[CATEGORIES]
cs.LG
ASGO: Adaptive Structured Gradient Optimization
[AUTHORS]
Kang An, Yuxing Liu, Rui Pan, Yi Ren, Shiqian Ma, Donald Goldfarb, Tong Zhang
[ABSTRACT]
Training deep neural networks is a structured optimization problem, because
the parameters are naturally represented by matrices and tensors rather than by
vectors. Under this structural representation, it has been widely observed that
gradients are low-rank and Hessians are approximately block-wise diagonal.
These structured properties are crucial for designing efficient optimization
algorithms, but are not utilized by many current popular optimizers like Adam.
In this paper, we present a novel optimization algorithm ASGO that capitalizes
on these properties by employing a preconditioner that is adaptively updated
using structured gradients. By fine-grained theoretical analysis, ASGO is
proven to achieve superior convergence rates compared to existing structured
gradient methods. Based on the convergence theory, we further demonstrate that
ASGO can benefit from the low-rank and block-wise diagonal properties. We also
discuss practical modifications of ASGO and empirically verify ASGO’s
effectiveness on language model tasks.
[COMMENTS]
30 pages
[LINK]
http://arxiv.org/abs/2503.20762v2
[DATE]
2025-06-23 09:05:27+08:00
[CATEGORIES]
cs.LG
Cross-Architecture Knowledge Distillation (KD) for Retinal Fundus Image Anomaly Detection on NVIDIA Jetson Nano
[AUTHORS]
Berk Yilmaz, Aniruddh Aiyengar
[ABSTRACT]
Early and accurate identification of retinal ailments is crucial for averting
ocular decline; however, access to dependable diagnostic devices is not often
available in low-resourced settings. This project proposes to solve that by
developing a lightweight, edge-device deployable disease classifier using
cross-architecture knowledge distilling. We first train a high-capacity vision
transformer (ViT) teacher model, pre-trained using I-JEPA self-supervised
learning, to classify fundus images into four classes: Normal, Diabetic
Retinopathy, Glaucoma, and Cataract. We kept an Internet of Things (IoT) focus
when compressing to a CNN-based student model for deployment in
resource-limited conditions, such as the NVIDIA Jetson Nano. This was
accomplished using a novel framework which included a Partitioned
Cross-Attention (PCA) projector, a Group-Wise Linear (GL) projector, and a
multi-view robust training method. The teacher model has 97.4 percent more
parameters than the student model, with it achieving 89 percent classification
with a roughly 93 percent retention of the teacher model’s diagnostic
performance. The retention of clinical classification behavior supports our
method’s initial aim: compression of the ViT while retaining accuracy. Our work
serves as an example of a scalable, AI-driven triage solution for retinal
disorders in under-resourced areas.
[COMMENTS]
15 pages, 10 figures. Berk Yilmaz and Aniruddh Aiyengar contributed
equally to this work
[LINK]
http://arxiv.org/abs/2506.18220v1
[DATE]
2025-06-23 08:57:43+08:00
[CATEGORIES]
cs.LG
Symmetric Reinforcement Learning Loss for Robust Learning on Diverse Tasks and Model Scales
[AUTHORS]
Ju-Seung Byun, Andrew Perrault
[ABSTRACT]
Reinforcement learning (RL) training is inherently unstable due to factors
such as moving targets and high gradient variance. Reinforcement Learning from
Human Feedback (RLHF) and Reinforcement Learning from AI Feedback (RLAIF) can
introduce additional difficulty. Differing preferences can complicate the
alignment process, and prediction errors in a trained reward model can become
more severe as the LLM generates unseen outputs. To enhance training
robustness, RL has adopted techniques from supervised learning, such as
ensembles and layer normalization. In this work, we improve the stability of RL
training by adapting the reverse cross entropy (RCE) from supervised learning
for noisy data to define a symmetric RL loss. We demonstrate performance
improvements across various tasks and scales. We conduct experiments in
discrete action tasks (Atari games) and continuous action space tasks (MuJoCo
benchmark and Box2D) using Symmetric A2C (SA2C) and Symmetric PPO (SPPO), with
and without added noise with especially notable performance in SPPO across
different hyperparameters. Furthermore, we validate the benefits of the
symmetric RL loss when using SPPO for large language models through improved
performance in RLHF tasks, such as IMDB positive sentiment sentiment and TL;DR
summarization tasks.
[LINK]
http://arxiv.org/abs/2405.17618v3
[DATE]
2025-06-23 08:56:21+08:00
[CATEGORIES]
cs.LG
Cost-Aware Routing for Efficient Text-To-Image Generation
[AUTHORS]
Qinchan Li, Kenneth Chen, Changyue Su, Wittawat Jitkrittum, Qi Sun, Patsorn Sangkloy
[ABSTRACT]
Diffusion models are well known for their ability to generate a high-fidelity
image for an input prompt through an iterative denoising process.
Unfortunately, the high fidelity also comes at a high computational cost due
the inherently sequential generative process. In this work, we seek to
optimally balance quality and computational cost, and propose a framework to
allow the amount of computation to vary for each prompt, depending on its
complexity. Each prompt is automatically routed to the most appropriate
text-to-image generation function, which may correspond to a distinct number of
denoising steps of a diffusion model, or a disparate, independent text-to-image
model. Unlike uniform cost reduction techniques (e.g., distillation, model
quantization), our approach achieves the optimal trade-off by learning to
reserve expensive choices (e.g., 100+ denoising steps) only for a few complex
prompts, and employ more economical choices (e.g., small distilled model) for
less sophisticated prompts. We empirically demonstrate on COCO and DiffusionDB
that by learning to route to nine already-trained text-to-image models, our
approach is able to deliver an average quality that is higher than that
achievable by any of these models alone.
[LINK]
http://arxiv.org/abs/2506.14753v2
[DATE]
2025-06-23 08:44:17+08:00
[CATEGORIES]
cs.LG
[AUTHORS]
Shion Takeno, Yoshito Okura, Yu Inatsu, Tatsuya Aoyama, Tomonari Tanaka, Satoshi Akahane, Hiroyuki Hanada, Noriaki Hashimoto, Taro Murayama, Hanju Lee, Shinya Kojima, Ichiro Takeuchi [ABSTRACT]
Gaussian process regression (GPR) or kernel ridge regression is a widely used
and powerful tool for nonlinear prediction. Therefore, active learning (AL) for
GPR, which actively collects data labels to achieve an accurate prediction with
fewer data labels, is an important problem. However, existing AL methods do not
theoretically guarantee prediction accuracy for target distribution.
Furthermore, as discussed in the distributionally robust learning literature,
specifying the target distribution is often difficult. Thus, this paper
proposes two AL methods that effectively reduce the worst-case expected error
for GPR, which is the worst-case expectation in target distribution candidates.
We show an upper bound of the worst-case expected squared error, which suggests
that the error will be arbitrarily small by a finite number of data labels
under mild conditions. Finally, we demonstrate the effectiveness of the
proposed methods through synthetic and real-world datasets. [COMMENTS]
26 pages, 3 figures, Accepted to ICML2025 [LINK]
http://arxiv.org/abs/2502.16870v2 [DATE]
2025-06-23 08:32:30+08:00 [CATEGORIES]
cs.LG