TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning
[AUTHORS]
Jiaru Zou, Soumya Roy, Vinay Kumar Verma, Ziyi Wang, David Wipf, Pan Lu, Sumit Negi, James Zou, Jingrui He
[ABSTRACT]
Process Reward Models (PRMs) have recently emerged as a powerful framework
for enhancing the reasoning capabilities of large reasoning models (LRMs),
particularly in the context of test-time scaling (TTS). However, their
potential for supervising LRMs on tabular reasoning domains remains
underexplored. Through detailed empirical analyses, we identify that existing
PRMs, though widely adopted for supervising text-only reasoning steps, struggle
with table-specific operations such as sub-table retrieval and schema
interaction, leading to critical performance bottlenecks. To address this
limitation, we propose TaTToo, a novel table-grounded PRM framework that (i)
reasons explicitly over tabular reasoning steps and (ii) integrates tool-based
verification to provide precise reward supervision. Concretely, we first design
a scalable data curation pipeline that constructs over 60k high-quality
step-level annotations by integrating table verification rationales with
tool-based executions. Building on the collected data, we train TaTToo with a
dual-stage paradigm: cold-start supervised fine-tuning to capture tool-use
reasoning patterns, followed by reinforcement learning with tool-grounded
reward shaping to align our model with table-based verification. We provide a
comprehensive evaluation of the policy improvement induced by our newly
designed PRM. Across 5 challenging tabular reasoning benchmarks covering
numerical reasoning, fact-checking, and data analysis, TaTToo improves
downstream policy LRMs by 30.9% at inference, surpasses strong PRM baselines
such as Qwen-2.5-Math-PRM-72B with only 8B parameters, and demonstrates strong
generalizability across diverse TTS strategies.
[LINK]
http://arxiv.org/abs/2510.06217v1
[DATE]
2025-10-08 01:59:41+08:00
[CATEGORIES]
cs.CL
cs.LG
Stratified GRPO: Handling Structural Heterogeneity in Reinforcement Learning of LLM Search Agents
[AUTHORS]
Mingkang Zhu, Xi Chen, Bei Yu, Hengshuang Zhao, Jiaya Jia
[ABSTRACT]
Large language model (LLM) agents increasingly rely on external tools such as
search engines to solve complex, multi-step problems, and reinforcement
learning (RL) has become a key paradigm for training them. However, the
trajectories of search agents are structurally heterogeneous, where variations
in the number, placement, and outcomes of search calls lead to fundamentally
different answer directions and reward distributions. Standard policy gradient
methods, which use a single global baseline, suffer from what we identify and
formalize as cross-stratum bias-an “apples-to-oranges” comparison of
heterogeneous trajectories. This cross-stratum bias distorts credit assignment
and hinders exploration of complex, multi-step search strategies. To address
this, we propose Stratified GRPO, whose central component, Stratified Advantage
Normalization (SAN), partitions trajectories into homogeneous strata based on
their structural properties and computes advantages locally within each
stratum. This ensures that trajectories are evaluated only against their true
peers. Our analysis proves that SAN eliminates cross-stratum bias, yields
conditionally unbiased unit-variance estimates inside each stratum, and retains
the global unbiasedness and unit-variance properties enjoyed by standard
normalization, resulting in a more pure and scale-stable learning signal. To
improve practical stability under finite-sample regimes, we further linearly
blend SAN with the global estimator. Extensive experiments on diverse
single-hop and multi-hop question-answering benchmarks demonstrate that
Stratified GRPO consistently and substantially outperforms GRPO by up to 11.3
points, achieving higher training rewards, greater training stability, and more
effective search policies. These results establish stratification as a
principled remedy for structural heterogeneity in RL for LLM search agents.
[LINK]
http://arxiv.org/abs/2510.06214v1
[DATE]
2025-10-08 01:59:13+08:00
[CATEGORIES]
cs.LG
cs.CL
LaDiR: Latent Diffusion Enhances LLMs for Text Reasoning
[AUTHORS]
Haoqiang Kang, Yizhe Zhang, Nikki Lijing Kuang, Nicklas Majamaki, Navdeep Jaitly, Yi-An Ma, Lianhui Qin
[ABSTRACT]
Large Language Models (LLMs) demonstrate their reasoning ability through
chain-of-thought (CoT) generation. However, LLM’s autoregressive decoding may
limit the ability to revisit and refine earlier tokens in a holistic manner,
which can also lead to inefficient exploration for diverse solutions. In this
paper, we propose LaDiR (Latent Diffusion Reasoner), a novel reasoning
framework that unifies the expressiveness of continuous latent representation
with the iterative refinement capabilities of latent diffusion models for an
existing LLM. We first construct a structured latent reasoning space using a
Variational Autoencoder (VAE) that encodes text reasoning steps into blocks of
thought tokens, preserving semantic information and interpretability while
offering compact but expressive representations. Subsequently, we utilize a
latent diffusion model that learns to denoise a block of latent thought tokens
with a blockwise bidirectional attention mask, enabling longer horizon and
iterative refinement with adaptive test-time compute. This design allows
efficient parallel generation of diverse reasoning trajectories, allowing the
model to plan and revise the reasoning process holistically. We conduct
evaluations on a suite of mathematical reasoning and planning benchmarks.
Empirical results show that LaDiR consistently improves accuracy, diversity,
and interpretability over existing autoregressive, diffusion-based, and latent
reasoning methods, revealing a new paradigm for text reasoning with latent
diffusion.
[LINK]
http://arxiv.org/abs/2510.04573v2
[DATE]
2025-10-08 01:58:48+08:00
[CATEGORIES]
cs.LG
cs.CL
Generative Interfaces for Language Models
[AUTHORS]
Jiaqi Chen, Yanzhe Zhang, Yutong Zhang, Yijia Shao, Diyi Yang
[ABSTRACT]
Large language models (LLMs) are increasingly seen as assistants, copilots,
and consultants, capable of supporting a wide range of tasks through natural
conversation. However, most systems remain constrained by a linear
request-response format that often makes interactions inefficient in
multi-turn, information-dense, and exploratory tasks. To address these
limitations, we propose Generative Interfaces for Language Models, a paradigm
in which LLMs respond to user queries by proactively generating user interfaces
(UIs) that enable more adaptive and interactive engagement. Our framework
leverages structured interface-specific representations and iterative
refinements to translate user queries into task-specific UIs. For systematic
evaluation, we introduce a multidimensional assessment framework that compares
generative interfaces with traditional chat-based ones across diverse tasks,
interaction patterns, and query types, capturing functional, interactive, and
emotional aspects of user experience. Results show that generative interfaces
consistently outperform conversational ones, with up to a 72% improvement in
human preference. These findings clarify when and why users favor generative
interfaces, paving the way for future advancements in human-AI interaction.
[COMMENTS]
Preprint
[LINK]
http://arxiv.org/abs/2508.19227v2
[DATE]
2025-10-08 01:57:11+08:00
[CATEGORIES]
cs.CL
Tracing Multilingual Factual Knowledge Acquisition in Pretraining
[AUTHORS]
Yihong Liu, Mingyang Wang, Amir Hossein Kargaran, Felicia Körner, Ercong Nie, Barbara Plank, François Yvon, Hinrich Schütze
[COMMENTS]
EMNLP Findings 2025
[LINK]
http://arxiv.org/abs/2505.14824v2
[DATE]
2025-10-08 01:56:22+08:00
[CATEGORIES]
cs.CL
TokenChain: A Discrete Speech Chain via Semantic Token Modeling
[AUTHORS]
Mingxuan Wang, Satoshi Nakamura
[ABSTRACT]
Machine Speech Chain, simulating the human perception-production loop, proves
effective in jointly improving ASR and TTS. We propose TokenChain, a fully
discrete speech chain coupling semantic-token ASR with a two-stage TTS: an
autoregressive text-to-semantic model co-trained with ASR and a
masked-generative semantic-to-acoustic model for synthesis only. End-to-end
feedback across the text interface is enabled with straight-through
argmax/Gumbel-Softmax and balanced with supervised ASR via dynamic weight
averaging. Ablations examine optimal temperature schedules for in- and
cross-domain transfer. Evaluation reveals TokenChain surpasses baseline
accuracy 2-6 epochs earlier and yields 5-13% lower equal-epoch error with
stable T2S on LibriSpeech, and reduces relative ASR WER by 56% and T2S WER by
31% on TED-LIUM with minimal forgetting, showing that chain learning remains
effective with token interfaces and models.
[COMMENTS]
5 pages, 3 figures. Submitted to IEEE International Conference on
Acoustics, Speech, and Signal Processing (ICASSP) 2026
[LINK]
http://arxiv.org/abs/2510.06201v1
[DATE]
2025-10-08 01:54:12+08:00
[CATEGORIES]
cs.CL
Peeking inside the Black-Box: Reinforcement Learning for Explainable and Accurate Relation Extraction
[AUTHORS]
Xinyu Guo, Zhengliang Shi, Minglai Yang, Mahdi Rahimi, Mihai Surdeanu
[ABSTRACT]
This paper introduces a framework for relation extraction (RE) that enhances
both accuracy and explainability. The framework has two key components: (i) a
reasoning mechanism that formulates relation extraction as a series of
text-processing steps inspired by cognitive science, and (ii) an optimization
process driven by reinforcement learning (RL) with a novel reward function
designed to improve both task accuracy and explanation quality. We call our
approach CogRE. Our framework addresses the lack of supervision for
language-based explanations in traditional RE by promoting outputs that include
important relation keywords. These keywords are drawn from a high-quality
dictionary that is automatically constructed using an LLM. We evaluate our
approach for the task of one-shot RE using two LLMs and two RE datasets. Our
experiments show that CogRE improves explanation quality by addressing two
common failure patterns in one-shot RE: poor attention focus and limited
one-shot learning capability. For example, our cognitive-structured reasoning
with Qwen2.5-15B-Instruct on One-shot NYT29 achieves 24.65% F1, surpassing
prior reasoning-based designs. Optimizing this approach with RL using our
reward further improves performance by +23.46% (absolute). Finally, human
evaluation shows that our best model generates relational keywords closely
aligned with gold labels, increasing human explanation quality ratings by 54%
(relative).
[COMMENTS]
Working in process
[LINK]
http://arxiv.org/abs/2510.06198v1
[DATE]
2025-10-08 01:53:55+08:00
[CATEGORIES]
cs.CL
Latent Speech-Text Transformer
[AUTHORS]
Yen-Ju Lu, Yashesh Gaur, Wei Zhou, Benjamin Muller, Jesus Villalba, Najim Dehak, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Srinivasan Iyer, Duc Le
[ABSTRACT]
Auto-regressive speech-text models are typically pre-trained on a large
number of interleaved sequences of text tokens and raw speech encoded as speech
tokens using vector quantization. These models have demonstrated
state-of-the-art performance in speech-to-speech understanding and generation
benchmarks, together with promising scaling laws, primarily enabled by the
representational alignment between text and speech. Nevertheless, they suffer
from shortcomings, partly owing to the disproportionately longer sequences of
speech tokens in contrast to textual tokens. This results in a large compute
imbalance between modalities during pre-training as well as during inference,
and a potential hindrance to effectively aligning speech and text, ultimately
translating to several orders of magnitude slower scaling laws. We introduce
the Latent Speech-Text Transformer (LST), which makes pre-training speech-text
models more data-efficient by dynamically and inexpensively aggregating speech
tokens into latent speech patches. These patches serve as higher-level units
that can either align with corresponding textual units to aid capability
transfer or even encapsulate common speech sequences like silences to be more
compute-efficient. We show that LST outperforms vanilla approaches on
speech-to-speech as well as text-to-text benchmarks in both data- and
compute-controlled settings, the former indicating more effective
representational alignment and the latter indicating steeper scaling laws for
speech-text models. On HellaSwag story completion, LST achieves 6.5% absolute
gain in speech accuracy under compute-controlled training and 5.3% under
data-controlled training, while also improving text performance. We will
release our models, code, and the evaluation data to facilitate further
research.
[COMMENTS]
16 pages, 13 figures
[LINK]
http://arxiv.org/abs/2510.06195v1
[DATE]
2025-10-08 01:52:08+08:00
[CATEGORIES]
cs.CL
cs.LG
BanglaTalk: Towards Real-Time Speech Assistance for Bengali Regional Dialects
[AUTHORS]
Jakir Hasan, Shubhashis Roy Dipta
[ABSTRACT]
Real-time speech assistants are becoming increasingly popular for ensuring
improved accessibility to information. Bengali, being a low-resource language
with a high regional dialectal diversity, has seen limited progress in
developing such systems. Existing systems are not optimized for real-time use
and focus only on standard Bengali. In this work, we present BanglaTalk, the
first real-time speech assistance system for Bengali regional dialects.
BanglaTalk follows the client-server architecture and uses the Real-time
Transport Protocol (RTP) to ensure low-latency communication. To address
dialectal variation, we introduce a dialect-aware ASR system, BRDialect,
developed by fine-tuning the IndicWav2Vec model in ten Bengali regional
dialects. It outperforms the baseline ASR models by 12.41-33.98% on the
RegSpeech12 dataset. Furthermore, BanglaTalk can operate at a low bandwidth of
24 kbps while maintaining an average end-to-end delay of 4.9 seconds. Low
bandwidth usage and minimal end-to-end delay make the system both
cost-effective and interactive for real-time use cases, enabling inclusive and
accessible speech technology for the diverse community of Bengali speakers.
[LINK]
http://arxiv.org/abs/2510.06188v1
[DATE]
2025-10-08 01:47:39+08:00
[CATEGORIES]
cs.CL
cs.LG
RECODE-H: A Benchmark for Research Code Development with Interactive Human Feedback
[AUTHORS]
Chunyu Miao, Henry Peng Zou, Yangning Li, Yankai Chen, Yibo Wang, Fangxin Wang, Yifan Li, Wooseong Yang, Bowei He, Xinni Zhang, Dianzhi Yu, Hanchen Yang, Hoang H Nguyen, Yue Zhou, Jie Yang, Jizhou Guo, Wenzhe Fan, Chin-Yuan Yeh, Panpan Meng, Liancheng Fang, Jinhu Qi, Wei-Chieh Huang, Zhengyao Gu, Yuwei Han, Langzhou He, Yuyao Yang, Xue Liu, Irwin King, Philip S. Yu
[ABSTRACT]
Large language models (LLMs) show the promise in supporting scientific
research implementation, yet their ability to generate correct and executable
code remains limited. Existing works largely adopt one-shot settings, ignoring
the iterative and feedback-driven nature of realistic workflows of scientific
research development. To address this gap, we present RECODE-H, a benchmark of
102 tasks from research papers and repositories that evaluates LLM agents
through multi-turn interactions with LLM-simulated human feedback. It includes
structured instructions,unit tests, and a five-level feedback hierarchy to
reflect realistic researcher-agent collaboration. We further present
ReCodeAgent, a framework that integrates feedback into iterative code
generation. Experiments with leading LLMs, including GPT-5, Claude-Sonnet-4,
DeepSeek-V3.1, and Gemini 2.5, show substantial performance gains with richer
feedback, while also highlighting ongoing challenges in the generation of
complex research code. RECODE-H establishes a foundation for developing
adaptive, feedback-driven LLM agents in scientific research implementation
[COMMENTS]
Code and dataset are available at github.com/ChunyuMiao98/RECODE
[LINK]
http://arxiv.org/abs/2510.06186v1
[DATE]
2025-10-08 01:45:35+08:00
[CATEGORIES]
cs.CL
OWL: Probing Cross-Lingual Recall of Memorized Texts via World Literature
[AUTHORS]
Alisha Srivastava, Emir Korukluoglu, Minh Nhat Le, Duyen Tran, Chau Minh Pham, Marzena Karpinska, Mohit Iyyer
[ABSTRACT]
Large language models (LLMs) are known to memorize and recall English text
from their pretraining data. However, the extent to which this ability
generalizes to non-English languages or transfers across languages remains
unclear. This paper investigates multilingual and cross-lingual memorization in
LLMs, probing if memorized content in one language (e.g., English) can be
recalled when presented in translation. To do so, we introduce OWL, a dataset
of 31.5K aligned excerpts from 20 books in ten languages, including English
originals, official translations (Vietnamese, Spanish, Turkish), and new
translations in six low-resource languages (Sesotho, Yoruba, Maithili,
Malagasy, Setswana, Tahitian). We evaluate memorization across model families
and sizes through three tasks: (1) direct probing, which asks the model to
identify a book’s title and author; (2) name cloze, which requires predicting
masked character names; and (3) prefix probing, which involves generating
continuations. We find that LLMs consistently recall content across languages,
even for texts without direct translation in pretraining data. GPT-4o, for
example, identifies authors and titles 69% of the time and masked entities 6%
of the time in newly translated excerpts. Perturbations (e.g., masking
characters, shuffling words) modestly reduce direct probing accuracy (7% drop
for shuffled official translations). Our results highlight the extent of
cross-lingual memorization and provide insights on the differences between the
models.
[COMMENTS]
Accepted to EMNLP 2025 Main
[LINK]
http://arxiv.org/abs/2505.22945v2
[DATE]
2025-10-08 01:39:05+08:00
[CATEGORIES]
cs.CL
VecInfer: Efficient LLM Inference with Low-Bit KV Cache via Outlier-Suppressed Vector Quantization
[AUTHORS]
Dingyu Yao, Chenxu Yang, Zhengyang Tong, Zheng Lin, Wei Liu, Jian Luan, Weiping Wang
[ABSTRACT]
The Key-Value (KV) cache introduces substantial memory overhead during large
language model (LLM) inference. Although existing vector quantization (VQ)
methods reduce KV cache usage and provide flexible representational capacity
across bit-widths, they suffer severe performance degradation at ultra-low
bit-widths due to key cache outliers that hinder effective codebook
utilization. To address this challenge, we propose VecInfer, a novel VQ method
for aggressive KV cache compression while enabling efficient inference. By
applying smooth and Hadamard transformations, VecInfer suppresses outliers in
the key cache, enabling the codebook to comprehensively cover the original data
distribution and thereby reducing quantization difficulty. To facilitate
efficient deployment, we design an optimized CUDA kernel that fuses computation
with dequantization to minimize memory access overhead. Extensive evaluations
demonstrate that VecInfer consistently outperforms existing quantization
baselines across both long-context understanding and mathematical reasoning
tasks. With only 2-bit quantization, VecInfer achieves performance comparable
to full precision, while delivering up to $\mathbf{2.7\times}$ speedup in
large-batch self-attention computation and $\mathbf{8.3\times}$ reduction in
single-batch end-to-end latency on Llama-3.1-8B with a 196k sequence length.
[LINK]
http://arxiv.org/abs/2510.06175v1
[DATE]
2025-10-08 01:35:28+08:00
[CATEGORIES]
cs.CL
How Reliable are Causal Probing Interventions?
[AUTHORS]
Marc Canby, Adam Davies, Chirag Rastogi, Julia Hockenmaier
[ABSTRACT]
Causal probing aims to analyze foundation models by examining how intervening
on their representation of various latent properties impacts their outputs.
Recent works have cast doubt on the theoretical basis of several leading causal
probing methods, but it has been unclear how to systematically evaluate the
effectiveness of these methods in practice. To address this, we define two key
causal probing desiderata: completeness (how thoroughly the representation of
the target property has been transformed) and selectivity (how little
non-targeted properties have been impacted). We find that there is an inherent
tradeoff between the two, which we define as reliability, their harmonic mean.
We introduce an empirical analysis framework to measure and evaluate these
quantities, allowing us to make the first direct comparisons between different
families of leading causal probing methods (e.g., linear vs. nonlinear, or
concept removal vs. counterfactual interventions). We find that: (1) all
methods show a clear tradeoff between completeness and selectivity; (2) more
complete and reliable methods have a greater impact on LLM behavior; and (3)
nonlinear interventions are almost always more reliable than linear
interventions.
[LINK]
http://arxiv.org/abs/2408.15510v4
[DATE]
2025-10-08 01:20:30+08:00
[CATEGORIES]
cs.LG
cs.CL
Trajectory Prediction Meets Large Language Models: A Survey
[AUTHORS]
Yi Xu, Ruining Yang, Yitian Zhang, Jianglin Lu, Mingyuan Zhang, Yizhou Wang, Lili Su, Yun Fu
[ABSTRACT]
Recent advances in large language models (LLMs) have sparked growing interest
in integrating language-driven techniques into trajectory prediction. By
leveraging their semantic and reasoning capabilities, LLMs are reshaping how
autonomous systems perceive, model, and predict trajectories. This survey
provides a comprehensive overview of this emerging field, categorizing recent
work into five directions: (1) Trajectory prediction via language modeling
paradigms, (2) Direct trajectory prediction with pretrained language models,
(3) Language-guided scene understanding for trajectory prediction, (4)
Language-driven data generation for trajectory prediction, (5) Language-based
reasoning and interpretability for trajectory prediction. For each, we analyze
representative methods, highlight core design choices, and identify open
challenges. This survey bridges natural language processing and trajectory
prediction, offering a unified perspective on how language can enrich
trajectory prediction.
[COMMENTS]
16 pages, GitHub:
https://github.com/colorfulfuture/Awesome-Trajectory-Motion-Prediction-Papers
[LINK]
http://arxiv.org/abs/2506.03408v2
[DATE]
2025-10-08 01:20:13+08:00
[CATEGORIES]
cs.CL
RoSE: Round-robin Synthetic Data Evaluation for Selecting LLM Generators without Human Test Sets
[AUTHORS]
Jan Cegin, Branislav Pecher, Ivan Srba, Jakub Simko
[ABSTRACT]
LLMs are powerful generators of synthetic data, which are used for training
smaller, specific models. This is especially valuable for low-resource
languages, where human-labelled data is scarce but LLMs can still produce
high-quality text. However, LLMs differ in how useful their outputs are for
training. Selecting the best LLM as a generator is challenging because
extrinsic evaluation requires costly human annotations (which are often
unavailable for low-resource languages), while intrinsic metrics correlate
poorly with downstream performance. We introduce Round robin Synthetic data
Evaluation (RoSE), a proxy metric for selecting the best LLM generator without
human test sets. RoSE trains a small model on the outputs of a candidate
generator (LLM) and then evaluates it on generated synthetic examples from all
other candidate LLMs. The final RoSE score is the mean performance of this
small model. Across six LLMs, eleven languages, and three tasks (sentiment,
topic, intent), RoSE identifies the optimal generator more often than any other
intrinsic heuristics. RoSE outperforms intrinsic heuristics and comes within
0.76 percentage points of the optimal generator baseline. This result is
measured in terms of downstream performance, obtained by training a small model
on the chosen generator’s outputs (optimal vs. proxy metric selected) and
evaluating it on human-labelled test data. Additionally, RoSE is the only
metric to achieve a positive correlation with performance on human test data.
[COMMENTS]
16 pages
[LINK]
http://arxiv.org/abs/2510.06143v1
[DATE]
2025-10-08 01:17:14+08:00
[CATEGORIES]
cs.CL
CreditDecoding: Accelerating Parallel Decoding in Diffusion Large Language Models with Trace Credits
[AUTHORS]
Kangyu Wang, Zhiyun Jiang, Haibo Feng, Weijia Zhao, Lin Liu, Jianguo Li, Zhenzhong Lan, Weiyao Lin
[ABSTRACT]
Diffusion large language models (dLLMs) generate text through iterative
denoising steps, achieving parallel decoding by denoising only high-confidence
positions at each step. However, existing approaches often repetitively remask
tokens due to initially low confidence scores, leading to redundant iterations
and limiting overall acceleration. Through the analysis of dLLM decoding
traces, we observe that the model often determines the final prediction for a
token several steps before the decoding step. To leverage this historical
information and avoid redundant steps, we introduce the concept of Trace
Credit, which quantifies each token’s convergence potential by accumulating
historical logits. Furthermore, we propose CreditDecoding, a training-free
parallel decoding algorithm that accelerates the confidence convergence of
correct but underconfident tokens by fusing current logits with Trace Credit.
This process significantly reduces redundant iterations and enhances decoding
robustness. On eight benchmarks, CreditDecoding achieves a 5.48 times speedup
and a 0.48 performance improvement over LLaDA-8B-Instruct, and a 4.11 times
speedup with a 0.15 performance improvement over LLaDA-MoE-Instruct.
Importantly, CreditDecoding scales effectively to long sequences and is
orthogonal to mainstream inference optimizations, making it a readily
integrable and versatile solution.
[COMMENTS]
18 pages,8 figures,4 tables
[LINK]
http://arxiv.org/abs/2510.06133v1
[DATE]
2025-10-08 01:08:33+08:00
[CATEGORIES]
cs.CL
Parallel Tokenizers: Rethinking Vocabulary Design for Cross-Lingual Transfer
[AUTHORS]
Muhammad Dehan Al Kautsar, Fajri Koto
[ABSTRACT]
Tokenization defines the foundation of multilingual language models by
determining how words are represented and shared across languages. However,
existing methods often fail to support effective cross-lingual transfer because
semantically equivalent words are assigned distinct embeddings. For example, “I
eat rice” in English and “Ina cin shinkafa” in Hausa are typically mapped to
different vocabulary indices, preventing shared representations and limiting
cross-lingual generalization. We introduce parallel tokenizers. This new
framework trains tokenizers monolingually and then aligns their vocabularies
exhaustively using bilingual dictionaries or word-to-word translation, ensuring
consistent indices for semantically equivalent words. This alignment enforces a
shared semantic space across languages while naturally improving fertility
balance. To assess their effectiveness, we pretrain a transformer encoder from
scratch on thirteen low-resource languages and evaluate it on sentiment
analysis, hate speech detection, emotion classification, and sentence embedding
similarity. Across all tasks, models trained with parallel tokenizers
outperform conventional multilingual baselines, confirming that rethinking
tokenization is essential for advancing multilingual representation
learning–especially in low-resource settings.
[COMMENTS]
18 pages, 25 tables, 7 figures
[LINK]
http://arxiv.org/abs/2510.06128v1
[DATE]
2025-10-08 01:05:49+08:00
[CATEGORIES]
cs.CL
Distributional Semantics Tracing: A Framework for Explaining Hallucinations in Large Language Models
[AUTHORS]
Gagan Bhatia, Somayajulu G Sripada, Kevin Allan, Jacobo Azcona
[ABSTRACT]
Large Language Models (LLMs) are prone to hallucination, the generation of
plausible yet factually incorrect statements. This work investigates the
intrinsic, architectural origins of this failure mode through three primary
contributions.First, to enable the reliable tracing of internal semantic
failures, we propose \textbf{Distributional Semantics Tracing (DST)}, a unified
framework that integrates established interpretability techniques to produce a
causal map of a model’s reasoning, treating meaning as a function of context
(distributional semantics). Second, we pinpoint the model’s layer at which a
hallucination becomes inevitable, identifying a specific \textbf{commitment
layer} where a model’s internal representations irreversibly diverge from
factuality. Third, we identify the underlying mechanism for these failures. We
observe a conflict between distinct computational pathways, which we interpret
using the lens of dual-process theory: a fast, heuristic \textbf{associative
pathway} (akin to System 1) and a slow, deliberate \textbf{contextual pathway}
(akin to System 2), leading to predictable failure modes such as
\textit{Reasoning Shortcut Hijacks}. Our framework’s ability to quantify the
coherence of the contextual pathway reveals a strong negative correlation
($\rho = -0.863$) with hallucination rates, implying that these failures are
predictable consequences of internal semantic weakness. The result is a
mechanistic account of how, when, and why hallucinations occur within the
Transformer architecture.
[LINK]
http://arxiv.org/abs/2510.06107v1
[DATE]
2025-10-08 00:40:31+08:00
[CATEGORIES]
cs.CL
The Valley of Code Reasoning: Scaling Knowledge Distillation of Large Language Models
[AUTHORS]
Muyu He, Muhammad Ali Shafique, Anand Kumar, Tsach Mackey, Nazneen Rajani
[ABSTRACT]
Distilling the thinking traces of a Large Language Model (LLM) with reasoning
capabilities into a smaller model has been proven effective. Yet, there is a
scarcity of work done on how model performances scale with the quantity of
distillation data. In this work, we study the scaling trend of distilling
competitive coding skills on two small non-reasoning LLMs. We validate the
hypothesis that there is a $\textit{valley of code reasoning}$: downstream
performance on competitive coding first drops as data quantity increases, then
it steadily increases in a sharper-than-log-linear fashion. Having identified
the trend, we further fine-tune the models at two different distillation stages
on the same data to ground conclusions on their respective learning phases. We
learn that across stages in the low and medium-low data regimes, small models
benefit significantly from easier coding questions than from harder ones. We
also find that, surprisingly, the correctness of outputs in training data makes
no difference to distillation outcomes. Our work represents a step forward in
understanding the training dynamics of code reasoning distillation outside
intuition
[COMMENTS]
NeurIPS 2025 Workshop on Deep Learning for Code (DL4C), Project page:
https://collinear.ai/valley-of-reasoning
[LINK]
http://arxiv.org/abs/2510.06101v1
[DATE]
2025-10-08 00:32:09+08:00
[CATEGORIES]
cs.CL
Can We Predict Alignment Before Models Finish Thinking? Towards Monitoring Misaligned Reasoning Models
[AUTHORS]
Yik Siu Chan, Zheng-Xin Yong, Stephen H. Bach
[ABSTRACT]
Reasoning language models improve performance on complex tasks by generating
long chains of thought (CoTs), but this process can also increase harmful
outputs in adversarial settings. In this work, we ask whether the long CoTs can
be leveraged for predictive safety monitoring: do the reasoning traces provide
early signals of final response alignment that could enable timely
intervention? We evaluate a range of monitoring methods using either CoT text
or activations, including highly capable large language models, fine-tuned
classifiers, and humans. First, we find that a simple linear probe trained on
CoT activations significantly outperforms all text-based baselines in
predicting whether a final response is safe or unsafe, with an average absolute
increase of 13 in F1 scores over the best-performing alternatives. CoT texts
are often unfaithful and misleading, while model latents provide a more
reliable predictive signal. Second, the probe can be applied to early CoT
segments before the response is generated, showing that alignment signals
appear before reasoning completes. Error analysis reveals that the performance
gap between text classifiers and the linear probe largely stems from a subset
of responses we call performative CoTs, where the reasoning consistently
contradicts the final response as the CoT progresses. Our findings generalize
across model sizes, families, and safety benchmarks, suggesting that
lightweight probes could enable real-time safety monitoring and early
intervention during generation.
[LINK]
http://arxiv.org/abs/2507.12428v2
[DATE]
2025-10-08 00:30:40+08:00
[CATEGORIES]
cs.CL
cs.LG
The Alignment Auditor: A Bayesian Framework for Verifying and Refining LLM Objectives
[AUTHORS]
Matthieu Bou, Nyal Patel, Arjun Jagota, Satyapriya Krishna, Sonali Parbhoo
[ABSTRACT]
The objectives that Large Language Models (LLMs) implicitly optimize remain
dangerously opaque, making trustworthy alignment and auditing a grand
challenge. While Inverse Reinforcement Learning (IRL) can infer reward
functions from behaviour, existing approaches either produce a single,
overconfident reward estimate or fail to address the fundamental ambiguity of
the task (non-identifiability). This paper introduces a principled auditing
framework that re-frames reward inference from a simple estimation task to a
comprehensive process for verification. Our framework leverages Bayesian IRL to
not only recover a distribution over objectives but to enable three critical
audit capabilities: (i) Quantifying and systematically reducing
non-identifiability by demonstrating posterior contraction over sequential
rounds of evidence; (ii) Providing actionable, uncertainty-aware diagnostics
that expose spurious shortcuts and identify out-of-distribution prompts where
the inferred objective cannot be trusted; and (iii) Validating policy-level
utility by showing that the refined, low-uncertainty reward can be used
directly in RLHF to achieve training dynamics and toxicity reductions
comparable to the ground-truth alignment process. Empirically, our framework
successfully audits a detoxified LLM, yielding a well-calibrated and
interpretable objective that strengthens alignment guarantees. Overall, this
work provides a practical toolkit for auditors, safety teams, and regulators to
verify what LLMs are truly trying to achieve, moving us toward more trustworthy
and accountable AI.
[COMMENTS]
Preprint
[LINK]
http://arxiv.org/abs/2510.06096v1
[DATE]
2025-10-08 00:25:14+08:00
[CATEGORIES]
cs.LG
cs.CL
Learning from Failures: Understanding LLM Alignment through Failure-Aware Inverse RL
[AUTHORS]
Nyal Patel, Matthieu Bou, Arjun Jagota, Satyapriya Krishna, Sonali Parbhoo
[ABSTRACT]
Reinforcement Learning from Human Feedback (RLHF) aligns Large Language
Models (LLMs) with human preferences, yet the underlying reward signals they
internalize remain hidden, posing a critical challenge for interpretability and
safety. Existing approaches attempt to extract these latent incentives using
Inverse Reinforcement Learning (IRL), but treat all preference pairs equally,
often overlooking the most informative signals: those examples the extracted
reward model misclassifies or assigns nearly equal scores, which we term
\emph{failures}. We introduce a novel \emph{failure-aware} IRL algorithm that
focuses on misclassified or difficult examples to recover the latent rewards
defining model behaviors. By learning from these failures, our failure-aware
IRL extracts reward functions that better reflect the true objectives behind
RLHF. We demonstrate that failure-aware IRL outperforms existing IRL baselines
across multiple metrics when applied to LLM detoxification, without requiring
external classifiers or supervision. Crucially, failure-aware IRL yields
rewards that better capture the true incentives learned during RLHF, enabling
more effective re-RLHF training than standard IRL. This establishes
failure-aware IRL as a robust, scalable method for auditing model alignment and
reducing ambiguity in the IRL process.
[COMMENTS]
Preprint
[LINK]
http://arxiv.org/abs/2510.06092v1
[DATE]
2025-10-08 00:20:14+08:00
[CATEGORIES]
cs.LG
cs.CL
Epistemic Diversity and Knowledge Collapse in Large Language Models
[AUTHORS]
Dustin Wright, Sarah Masud, Jared Moore, Srishti Yadav, Maria Antoniak, Chan Young Park, Isabelle Augenstein
[ABSTRACT]
Large language models (LLMs) tend to generate lexically, semantically, and
stylistically homogenous texts. This poses a risk of knowledge collapse, where
homogenous LLMs mediate a shrinking in the range of accessible information over
time. Existing works on homogenization are limited by a focus on closed-ended
multiple-choice setups or fuzzy semantic features, and do not look at trends
across time and cultural contexts. To overcome this, we present a new
methodology to measure epistemic diversity, i.e., variation in real-world
claims in LLM outputs, which we use to perform a broad empirical study of LLM
knowledge collapse. We test 27 LLMs, 155 topics covering 12 countries, and 200
prompt variations sourced from real user chats. For the topics in our study, we
show that while newer models tend to generate more diverse claims, nearly all
models are less epistemically diverse than a basic web search. We find that
model size has a negative impact on epistemic diversity, while
retrieval-augmented generation (RAG) has a positive impact, though the
improvement from RAG varies by the cultural context. Finally, compared to a
traditional knowledge source (Wikipedia), we find that country-specific claims
reflect the English language more than the local one, highlighting a gap in
epistemic representation
[COMMENTS]
16 pages; 8 figures, 4 tables v2 changelog: Fixed the modeling for
table 3, random effect is the model version
[LINK]
http://arxiv.org/abs/2510.04226v2
[DATE]
2025-10-08 00:07:31+08:00
[CATEGORIES]
cs.CL
cs.LG
Entropy-Gated Branching for Efficient Test-Time Reasoning
[AUTHORS]
Xianzhi Li, Ethan Callanan, Abdellah Ghassel, Xiaodan Zhu
[ABSTRACT]
Test-time compute methods can significantly improve the reasoning
capabilities and problem-solving accuracy of large language models (LLMs).
However, these approaches require substantially more computational resources,
with most compute wasted on exploring low-diversity branches where the model
already exhibits high confidence. We observe that a small subset of uncertain
reasoning steps has a disproportionately large impact on final prediction
accuracy, and branching at these critical junctures tends to yield more diverse
and higher-quality candidate reasoning steps. We propose Entropy-Gated
Branching (EGB), which branches only at high-uncertainty steps and prunes
expansions with a lightweight verifier. On mathematical and financial reasoning
benchmarks, EGB improves accuracy by 22.6% over standard inference while
operating 31%-75% faster across math benchmarks than test-time beam search with
higher performance. Our results show that dynamic resource allocation during
inference can substantially improve both efficiency and effectiveness, offering
a more scalable pathway to enhanced LLM reasoning capabilities.
[LINK]
http://arxiv.org/abs/2503.21961v3
[DATE]
2025-10-08 00:06:25+08:00
[CATEGORIES]
cs.CL
HOG-Diff: Higher-Order Guided Diffusion for Graph Generation
[AUTHORS]
Yiming Huang, Tolga Birdal
[ABSTRACT]
Graph generation is a critical yet challenging task as empirical analyses
require a deep understanding of complex, non-Euclidean structures. Diffusion
models have recently made significant achievements in graph generation, but
these models are typically adapted from image generation frameworks and
overlook inherent higher-order topology, leaving them ill-suited for capturing
the topological properties of graphs. In this work, we propose Higher-order
Guided Diffusion (HOG-Diff), a principled framework that progressively
generates plausible graphs with inherent topological structures. HOG-Diff
follows a coarse-to-fine generation curriculum guided by higher-order topology
and implemented via diffusion bridges. We further prove that our model exhibits
a stronger theoretical guarantee than classical diffusion frameworks. Extensive
experiments on both molecular and generic graph generation tasks demonstrate
that our method consistently outperforms or remains competitive with
state-of-the-art baselines. Our code is available at
https://github.com/Yiminghh/HOG-Diff.
[LINK]
http://arxiv.org/abs/2502.04308v2
[DATE]
2025-10-08 01:58:18+08:00
[CATEGORIES]
cs.LG
Hierarchical Reasoning Models: Perspectives and Misconceptions
[AUTHORS]
Renee Ge, Qianli Liao, Tomaso Poggio
[ABSTRACT]
Transformers have demonstrated remarkable performance in natural language
processing and related domains, as they largely focus on sequential,
autoregressive next-token prediction tasks. Yet, they struggle in logical
reasoning, not necessarily because of a fundamental limitation of these models,
but possibly due to the lack of exploration of more creative uses, such as
latent space and recurrent reasoning. An emerging exploration in this direction
is the Hierarchical Reasoning Model (Wang et. al., 2025), which introduces a
novel type of recurrent reasoning in the latent space of transformers,
achieving remarkable performance on a wide range of 2D reasoning tasks. Despite
the promising results, this line of models is still at an early stage and calls
for in-depth investigation. In this work, we review this class of models,
examine key design choices, test alternative variants and clarify common
misconceptions.
[COMMENTS]
Found errors in some results of v1. Removed them and changed
conclusions
[LINK]
http://arxiv.org/abs/2510.00355v2
[DATE]
2025-10-08 01:57:06+08:00
[CATEGORIES]
cs.LG
Modulation Discovery with Differentiable Digital Signal Processing
[AUTHORS]
Christopher Mitcheltree, Hao Hao Tan, Joshua D. Reiss
[ABSTRACT]
Modulations are a critical part of sound design and music production,
enabling the creation of complex and evolving audio. Modern synthesizers
provide envelopes, low frequency oscillators (LFOs), and more parameter
automation tools that allow users to modulate the output with ease. However,
determining the modulation signals used to create a sound is difficult, and
existing sound-matching / parameter estimation systems are often
uninterpretable black boxes or predict high-dimensional framewise parameter
values without considering the shape, structure, and routing of the underlying
modulation curves. We propose a neural sound-matching approach that leverages
modulation extraction, constrained control signal parameterizations, and
differentiable digital signal processing (DDSP) to discover the modulations
present in a sound. We demonstrate the effectiveness of our approach on highly
modulated synthetic and real audio samples, its applicability to different DDSP
synth architectures, and investigate the trade-off it incurs between
interpretability and sound-matching accuracy. We make our code and audio
samples available and provide the trained DDSP synths in a VST plugin.
[COMMENTS]
Accepted to WASPAA 2025 (best paper award candidate). Code, audio
samples, and plugins can be found at
https://christhetree.github.io/mod_discovery/
[LINK]
http://arxiv.org/abs/2510.06204v1
[DATE]
2025-10-08 01:56:24+08:00
[CATEGORIES]
cs.LG
Reference Grounded Skill Discovery
[AUTHORS]
Seungeun Rho, Aaron Trinh, Danfei Xu, Sehoon Ha
[ABSTRACT]
Scaling unsupervised skill discovery algorithms to high-DoF agents remains
challenging. As dimensionality increases, the exploration space grows
exponentially, while the manifold of meaningful skills remains limited.
Therefore, semantic meaningfulness becomes essential to effectively guide
exploration in high-dimensional spaces. In this work, we present
Reference-Grounded Skill Discovery (RGSD), a novel algorithm that grounds skill
discovery in a semantically meaningful latent space using reference data. RGSD
first performs contrastive pretraining to embed motions on a unit hypersphere,
clustering each reference trajectory into a distinct direction. This grounding
enables skill discovery to simultaneously involve both imitation of reference
behaviors and the discovery of semantically related diverse behaviors. On a
simulated SMPL humanoid with 359-D observations and 69-D actions, RGSD learns
structured skills including walking, running, punching, and side stepping, and
also discovers related novel behaviors. In downstream control tasks, RGSD
outperforms imitation-based skill acquisition baselines. Our results suggest
that lightweight reference-guided grounding offers a practical path to
discovering semantically rich and structured skills in high-DoF systems.
[LINK]
http://arxiv.org/abs/2510.06203v1
[DATE]
2025-10-08 01:55:01+08:00
[CATEGORIES]
cs.LG
On Powerful Ways to Generate: Autoregression, Diffusion, and Beyond
[AUTHORS]
Chenxiao Yang, Cai Zhou, David Wipf, Zhiyuan Li
[ABSTRACT]
This paper formally studies generation processes, including auto-regressive
next-token prediction and masked diffusion, that abstract beyond architectural
specifics. At this level of abstraction, we quantify their benefits and
limitations through measurable criteria such as computational hardness and
learnability. In particular, we demonstrate that allowing generation to proceed
beyond autoregression and current masked diffusion, with capabilities to
rewrite and length-variable edit, can bring significant theoretical and
empirical advantages, with important implications for frontier LLMs that aspire
to tackle increasingly hard problems and work universally across domains beyond
natural language, such as coding and science.
[LINK]
http://arxiv.org/abs/2510.06190v1
[DATE]
2025-10-08 01:49:30+08:00
[CATEGORIES]
cs.LG
Conformalized Gaussian processes for online uncertainty quantification over graphs
[AUTHORS]
Jinwen Xu, Qin Lu, Georgios B. Giannakis
[ABSTRACT]
Uncertainty quantification (UQ) over graphs arises in a number of
safety-critical applications in network science. The Gaussian process (GP), as
a classical Bayesian framework for UQ, has been developed to handle
graph-structured data by devising topology-aware kernel functions. However,
such GP-based approaches are limited not only by the prohibitive computational
complexity, but also the strict modeling assumptions that might yield poor
coverage, especially with labels arriving on the fly. To effect scalability, we
devise a novel graph-aware parametric GP model by leveraging the random feature
(RF)-based kernel approximation, which is amenable to efficient recursive
Bayesian model updates. To further allow for adaptivity, an ensemble of
graph-aware RF-based scalable GPs have been leveraged, with per-GP weight
adapted to data arriving incrementally. To ensure valid coverage with
robustness to model mis-specification, we wed the GP-based set predictors with
the online conformal prediction framework, which post-processes the prediction
sets using adaptive thresholds. Experimental results the proposed method yields
improved coverage and efficient prediction sets over existing baselines by
adaptively ensembling the GP models and setting the key threshold parameters in
CP.
[LINK]
http://arxiv.org/abs/2510.06181v1
[DATE]
2025-10-08 01:44:13+08:00
[CATEGORIES]
cs.LG
Differentiable Model Predictive Control on the GPU
[AUTHORS]
Emre Adabag, Marcus Greiff, John Subosits, Thomas Lew
[ABSTRACT]
Differentiable model predictive control (MPC) offers a powerful framework for
combining learning and control. However, its adoption has been limited by the
inherently sequential nature of traditional optimization algorithms, which are
challenging to parallelize on modern computing hardware like GPUs. In this
work, we tackle this bottleneck by introducing a GPU-accelerated differentiable
optimization tool for MPC. This solver leverages sequential quadratic
programming and a custom preconditioned conjugate gradient (PCG) routine with
tridiagonal preconditioning to exploit the problem’s structure and enable
efficient parallelization. We demonstrate substantial speedups over CPU- and
GPU-based baselines, significantly improving upon state-of-the-art training
times on benchmark reinforcement learning and imitation learning tasks.
Finally, we showcase the method on the challenging task of reinforcement
learning for driving at the limits of handling, where it enables robust
drifting of a Toyota Supra through water puddles.
[LINK]
http://arxiv.org/abs/2510.06179v1
[DATE]
2025-10-08 01:42:17+08:00
[CATEGORIES]
cs.LG
Thermodynamic Performance Limits for Score-Based Diffusion Models
[AUTHORS]
Nathan X. Kodama, Michael Hinczewski
[ABSTRACT]
We establish a fundamental connection between score-based diffusion models
and non-equilibrium thermodynamics by deriving performance limits based on
entropy rates. Our main theoretical contribution is a lower bound on the
negative log-likelihood of the data that relates model performance to entropy
rates of diffusion processes. We numerically validate this bound on a synthetic
dataset and investigate its tightness. By building a bridge to entropy rates -
system, intrinsic, and exchange entropy - we provide new insights into the
thermodynamic operation of these models, drawing parallels to Maxwell’s demon
and implications for thermodynamic computing hardware. Our framework connects
generative modeling performance to fundamental physical principles through
stochastic thermodynamics.
[LINK]
http://arxiv.org/abs/2510.06174v1
[DATE]
2025-10-08 01:35:18+08:00
[CATEGORIES]
cs.LG
Higher-Order Feature Attribution: Bridging Statistics, Explainable AI, and Topological Signal Processing
[AUTHORS]
Kurt Butler, Guanchao Feng, Petar Djuric
[ABSTRACT]
Feature attributions are post-training analysis methods that assess how
various input features of a machine learning model contribute to an output
prediction. Their interpretation is straightforward when features act
independently, but becomes less direct when the predictive model involves
interactions such as multiplicative relationships or joint feature
contributions. In this work, we propose a general theory of higher-order
feature attribution, which we develop on the foundation of Integrated Gradients
(IG). This work extends existing frameworks in the literature on explainable
AI. When using IG as the method of feature attribution, we discover natural
connections to statistics and topological signal processing. We provide several
theoretical results that establish the theory, and we validate our theory on a
few examples.
[COMMENTS]
5 pages, 3 figures
[LINK]
http://arxiv.org/abs/2510.06165v1
[DATE]
2025-10-08 01:29:34+08:00
[CATEGORIES]
cs.LG
LLMs as Policy-Agnostic Teammates: A Case Study in Human Proxy Design for Heterogeneous Agent Teams
[AUTHORS]
Aju Ani Justus, Chris Baber
[ABSTRACT]
A critical challenge in modelling Heterogeneous-Agent Teams is training
agents to collaborate with teammates whose policies are inaccessible or
non-stationary, such as humans. Traditional approaches rely on expensive
human-in-the-loop data, which limits scalability. We propose using Large
Language Models (LLMs) as policy-agnostic human proxies to generate synthetic
data that mimics human decision-making. To evaluate this, we conduct three
experiments in a grid-world capture game inspired by Stag Hunt, a game theory
paradigm that balances risk and reward. In Experiment 1, we compare decisions
from 30 human participants and 2 expert judges with outputs from LLaMA 3.1 and
Mixtral 8x22B models. LLMs, prompted with game-state observations and reward
structures, align more closely with experts than participants, demonstrating
consistency in applying underlying decision criteria. Experiment 2 modifies
prompts to induce risk-sensitive strategies (e.g. “be risk averse”). LLM
outputs mirror human participants’ variability, shifting between risk-averse
and risk-seeking behaviours. Finally, Experiment 3 tests LLMs in a dynamic
grid-world where the LLM agents generate movement actions. LLMs produce
trajectories resembling human participants’ paths. While LLMs cannot yet fully
replicate human adaptability, their prompt-guided diversity offers a scalable
foundation for simulating policy-agnostic teammates.
[COMMENTS]
This is a preprint of a paper presented at the \textit{European
Conference on Artificial Intelligence (ECAI 2025)}. It is made publicly
available for the benefit of the research community and should be regarded as
a preprint rather than a formally reviewed publication
[LINK]
http://arxiv.org/abs/2510.06151v1
[DATE]
2025-10-08 01:21:20+08:00
[CATEGORIES]
cs.LG
Gemstones: A Model Suite for Multi-Faceted Scaling Laws
[AUTHORS]
Sean McLeish, John Kirchenbauer, David Yu Miller, Siddharth Singh, Abhinav Bhatele, Micah Goldblum, Ashwinee Panda, Tom Goldstein
[ABSTRACT]
Scaling laws are typically fit using a family of models with a narrow range
of frozen hyperparameter choices. In this work we study scaling laws using
multiple architectural shapes and hyperparameter choices, highlighting their
impact on resulting prescriptions. As a primary artifact of our research, we
release the Gemstones: an open-source scaling law dataset, consisting of over
4000 checkpoints from transformers with up to 2 billion parameters and diverse
architectural shapes; including ablations over learning rate and cooldown. Our
checkpoints enable more complex studies of scaling, such as analyzing the
relationship between width and depth. By examining our model suite, we find
that the prescriptions of scaling laws can be highly sensitive to the
experimental design process and the specific model checkpoints used during
fitting.
[COMMENTS]
NeurIPS 2025
[LINK]
http://arxiv.org/abs/2502.06857v3
[DATE]
2025-10-08 01:20:21+08:00
[CATEGORIES]
cs.LG
Non-iid hypothesis testing: from classical to quantum
[AUTHORS]
Giacomo De Palma, Marco Fanizza, Connor Mowry, Ryan O’Donnell
[ABSTRACT]
We study hypothesis testing (aka state certification) in the non-identically
distributed setting. A recent work (Garg et al. 2023) considered the classical
case, in which one is given (independent) samples from $T$ unknown probability
distributions $p_1, \dots, p_T$ on $[d] = \{1, 2, \dots, d\}$, and one wishes
to accept/reject the hypothesis that their average $p_{\mathrm{avg}}$ equals a
known hypothesis distribution $q$. Garg et al. showed that if one has just $c =
2$ samples from each $p_i$, and provided $T \gg \frac{\sqrt{d}}{\epsilon^2} +
\frac{1}{\epsilon^4}$, one can (whp) distinguish $p_{\mathrm{avg}} = q$ from
$d_{\mathrm{TV}}(p_{\mathrm{avg}},q) > \epsilon$. This nearly matches the
optimal result for the classical iid setting (namely, $T \gg
\frac{\sqrt{d}}{\epsilon^2}$). Besides optimally improving this result (and
generalizing to tolerant testing with more stringent distance measures), we
study the analogous problem of hypothesis testing for non-identical quantum
states. Here we uncover an unexpected phenomenon: for any $d$-dimensional
hypothesis state $\sigma$, and given just a single copy ($c = 1$) of each state
$\rho_1, \dots, \rho_T$, one can distinguish $\rho_{\mathrm{avg}} = \sigma$
from $D_{\mathrm{tr}}(\rho_{\mathrm{avg}},\sigma) > \epsilon$ provided $T \gg
d/\epsilon^2$. (Again, we generalize to tolerant testing with more stringent
distance measures.) This matches the optimal result for the iid case, which is
surprising because doing this with $c = 1$ is provably impossible in the
classical case. We also show that the analogous phenomenon happens for the
non-iid extension of identity testing between unknown states. A technical tool
we introduce may be of independent interest: an Efron-Stein inequality, and
more generally an Efron-Stein decomposition, in the quantum setting.
[COMMENTS]
33 pages, 2 figures
[LINK]
http://arxiv.org/abs/2510.06147v1
[DATE]
2025-10-08 01:19:26+08:00
[CATEGORIES]
cs.LG
Bimanual 3D Hand Motion and Articulation Forecasting in Everyday Images
[AUTHORS]
Aditya Prakash, David Forsyth, Saurabh Gupta
[ABSTRACT]
We tackle the problem of forecasting bimanual 3D hand motion & articulation
from a single image in everyday settings. To address the lack of 3D hand
annotations in diverse settings, we design an annotation pipeline consisting of
a diffusion model to lift 2D hand keypoint sequences to 4D hand motion. For the
forecasting model, we adopt a diffusion loss to account for the multimodality
in hand motion distribution. Extensive experiments across 6 datasets show the
benefits of training on diverse data with imputed labels (14% improvement) and
effectiveness of our lifting (42% better) & forecasting (16.4% gain) models,
over the best baselines, especially in zero-shot generalization to everyday
images.
[COMMENTS]
Project page: https://ap229997.github.io/projects/forehand4d
[LINK]
http://arxiv.org/abs/2510.06145v1
[DATE]
2025-10-08 01:18:56+08:00
[CATEGORIES]
cs.LG
Improved High-probability Convergence Guarantees of Decentralized SGD
[AUTHORS]
Aleksandar Armacki, Ali H. Sayed
[ABSTRACT]
Convergence in high-probability (HP) has been receiving increasing interest,
due to its attractive properties, such as exponentially decaying tail bounds
and strong guarantees for each individual run of an algorithm. While HP
guarantees are extensively studied in centralized settings, much less is
understood in the decentralized, networked setup. Existing HP studies in
decentralized settings impose strong assumptions, like uniformly bounded
gradients, or asymptotically vanishing noise, resulting in a significant gap
between assumptions used to establish convergence in the HP and the
mean-squared error (MSE) sense, even for vanilla Decentralized Stochastic
Gradient Descent ($\mathtt{DSGD}$) algorithm. This is contrary to centralized
settings, where it is known that $\mathtt{SGD}$ converges in HP under the same
conditions on the cost function as needed to guarantee MSE convergence.
Motivated by this observation, we revisit HP guarantees for $\mathtt{DSGD}$ in
the presence of light-tailed noise. We show that $\mathtt{DSGD}$ converges in
HP under the same conditions on the cost as in the MSE sense, removing
uniformly bounded gradients and other restrictive assumptions, while
simultaneously achieving order-optimal rates for both non-convex and strongly
convex costs. Moreover, our improved analysis yields linear speed-up in the
number of users, demonstrating that $\mathtt{DSGD}$ maintains strong
performance in the HP sense and matches existing MSE guarantees. Our improved
results stem from a careful analysis of the MGF of quantities of interest
(norm-squared of gradient or optimality gap) and the MGF of the consensus gap
between users’ models. To achieve linear speed-up, we provide a novel result on
the variance-reduction effect of decentralized methods in the HP sense and more
fine-grained bounds on the MGF for strongly convex costs, which are both of
independent interest.
[COMMENTS]
39 pages
[LINK]
http://arxiv.org/abs/2510.06141v1
[DATE]
2025-10-08 01:15:08+08:00
[CATEGORIES]
cs.LG
Multi-Task Reinforcement Learning with Language-Encoded Gated Policy Networks
[AUTHORS]
Rushiv Arora
[ABSTRACT]
Multi-task reinforcement learning often relies on task metadata – such as
brief natural-language descriptions – to guide behavior across diverse
objectives. We present Lexical Policy Networks (LEXPOL), a language-conditioned
mixture-of-policies architecture for multi-task RL. LEXPOL encodes task
metadata with a text encoder and uses a learned gating module to select or
blend among multiple sub-policies, enabling end-to-end training across tasks.
On MetaWorld benchmarks, LEXPOL matches or exceeds strong multi-task baselines
in success rate and sample efficiency, without task-specific retraining. To
analyze the mechanism, we further study settings with fixed expert policies
obtained independently of the gate and show that the learned language gate
composes these experts to produce behaviors appropriate to novel task
descriptions and unseen task combinations. These results indicate that
natural-language metadata can effectively index and recombine reusable skills
within a single policy.
[COMMENTS]
14 pages, 3 figures, 12 tables, 2 appendices. Currently under review
[LINK]
http://arxiv.org/abs/2510.06138v1
[DATE]
2025-10-08 01:12:24+08:00
[CATEGORIES]
cs.LG
lm-Meter: Unveiling Runtime Inference Latency for On-Device Language Models
[AUTHORS]
Haoxin Wang, Xiaolong Tu, Hongyu Ke, Huirong Chai, Dawei Chen, Kyungtae Han
[ABSTRACT]
Large Language Models (LLMs) are increasingly integrated into everyday
applications, but their prevalent cloud-based deployment raises growing
concerns around data privacy and long-term sustainability. Running LLMs locally
on mobile and edge devices (on-device LLMs) offers the promise of enhanced
privacy, reliability, and reduced communication costs. However, realizing this
vision remains challenging due to substantial memory and compute demands, as
well as limited visibility into performance-efficiency trade-offs on
resource-constrained hardware. We propose lm-Meter, the first lightweight,
online latency profiler tailored for on-device LLM inference. lm-Meter captures
fine-grained, real-time latency at both phase (e.g., embedding, prefill,
decode, softmax, sampling) and kernel levels without auxiliary devices. We
implement lm-Meter on commercial mobile platforms and demonstrate its high
profiling accuracy with minimal system overhead, e.g., only 2.58% throughput
reduction in prefill and 0.99% in decode under the most constrained Powersave
governor. Leveraging lm-Meter, we conduct comprehensive empirical studies
revealing phase- and kernel-level bottlenecks in on-device LLM inference,
quantifying accuracy-efficiency trade-offs, and identifying systematic
optimization opportunities. lm-Meter provides unprecedented visibility into the
runtime behavior of LLMs on constrained platforms, laying the foundation for
informed optimization and accelerating the democratization of on-device LLM
systems. Code and tutorials are available at
https://github.com/amai-gsu/LM-Meter.
[COMMENTS]
This is the preprint version of the paper accepted to The 10th
ACM/IEEE Symposium on Edge Computing (SEC 2025)
[LINK]
http://arxiv.org/abs/2510.06126v1
[DATE]
2025-10-08 01:05:30+08:00
[CATEGORIES]
cs.LG
PolyGraph Discrepancy: a classifier-based metric for graph generation
[AUTHORS]
Markus Krimmel, Philip Hartout, Karsten Borgwardt, Dexiong Chen
[ABSTRACT]
Existing methods for evaluating graph generative models primarily rely on
Maximum Mean Discrepancy (MMD) metrics based on graph descriptors. While these
metrics can rank generative models, they do not provide an absolute measure of
performance. Their values are also highly sensitive to extrinsic parameters,
namely kernel and descriptor parametrization, making them incomparable across
different graph descriptors. We introduce PolyGraph Discrepancy (PGD), a new
evaluation framework that addresses these limitations. It approximates the
Jensen-Shannon distance of graph distributions by fitting binary classifiers to
distinguish between real and generated graphs, featurized by these descriptors.
The data log-likelihood of these classifiers approximates a variational lower
bound on the JS distance between the two distributions. Resulting metrics are
constrained to the unit interval [0,1] and are comparable across different
graph descriptors. We further derive a theoretically grounded summary metric
that combines these individual metrics to provide a maximally tight lower bound
on the distance for the given descriptors. Thorough experiments demonstrate
that PGD provides a more robust and insightful evaluation compared to MMD
metrics. The PolyGraph framework for benchmarking graph generative models is
made publicly available at https://github.com/BorgwardtLab/polygraph-benchmark.
[LINK]
http://arxiv.org/abs/2510.06122v1
[DATE]
2025-10-08 01:02:44+08:00
[CATEGORIES]
cs.LG
Optimal Policy Minimum Bayesian Risk
[AUTHORS]
Ramón Fernandez Astudillo, Md Arafat Sultan, Aashka Trivedi, Yousef El-Kurdi, Tahira Naseem, Radu Florian, Salim Roukos
[ABSTRACT]
Inference scaling helps LLMs solve complex reasoning problems through
extended runtime computation. On top of long chain-of-thought (long-CoT)
models, purely inference-time techniques such as best-of-N (BoN) sampling,
majority voting, or more generally, minimum Bayes risk decoding (MBRD), can
further improve LLM accuracy by generating multiple candidate solutions and
aggregating over them. These methods typically leverage additional signals in
the form of reward models and risk/similarity functions that compare generated
samples, e.g., exact match in some normalized space or standard similarity
metrics such as Rouge. Here we present a novel method for incorporating reward
and risk/similarity signals into MBRD. Based on the concept of optimal policy
in KL-controlled reinforcement learning, our framework provides a simple and
well-defined mechanism for leveraging such signals, offering several advantages
over traditional inference-time methods: higher robustness, improved accuracy,
and well-understood asymptotic behavior. In addition, it allows for the
development of a sample-efficient variant of MBRD that can adjust the number of
samples to generate according to the difficulty of the problem, without relying
on majority vote counts. We empirically demonstrate the advantages of our
approach on math (MATH-$500$) and coding (HumanEval) tasks using recent
open-source models. We also present a comprehensive analysis of its
accuracy-compute trade-offs.
[LINK]
http://arxiv.org/abs/2505.17242v2
[DATE]
2025-10-08 00:58:55+08:00
[CATEGORIES]
cs.LG
Spatiotemporal Graph Learning with Direct Volumetric Information Passing and Feature Enhancement
[AUTHORS]
Yuan Mi, Qi Wang, Xueqin Hu, Yike Guo, Ji-Rong Wen, Yang Liu, Hao Sun
[ABSTRACT]
Data-driven learning of physical systems has kindled significant attention,
where many neural models have been developed. In particular, mesh-based graph
neural networks (GNNs) have demonstrated significant potential in modeling
spatiotemporal dynamics across arbitrary geometric domains. However, the
existing node-edge message-passing and aggregation mechanism in GNNs limits the
representation learning ability. In this paper, we proposed a dual-module
framework, Cell-embedded and Feature-enhanced Graph Neural Network (aka,
CeFeGNN), for learning spatiotemporal dynamics. Specifically, we embed
learnable cell attributions to the common node-edge message passing process,
which better captures the spatial dependency of regional features. Such a
strategy essentially upgrades the local aggregation scheme from first order
(e.g., from edge to node) to a higher order (e.g., from volume and edge to
node), which takes advantage of volumetric information in message passing.
Meanwhile, a novel feature-enhanced block is designed to further improve the
model’s performance and alleviate the over-smoothness problem. Extensive
experiments on various PDE systems and one real-world dataset demonstrate that
CeFeGNN achieves superior performance compared with other baselines.
[LINK]
http://arxiv.org/abs/2409.18013v2
[DATE]
2025-10-08 00:51:49+08:00
[CATEGORIES]
cs.LG
The Physics of Data and Tasks: Theories of Locality and Compositionality in Deep Learning
[AUTHORS]
Alessandro Favero
[ABSTRACT]
Deep neural networks have achieved remarkable success, yet our understanding
of how they learn remains limited. These models can learn high-dimensional
tasks, which is generally statistically intractable due to the curse of
dimensionality. This apparent paradox suggests that learnable data must have an
underlying latent structure. What is the nature of this structure? How do
neural networks encode and exploit it, and how does it quantitatively impact
performance - for instance, how does generalization improve with the number of
training examples? This thesis addresses these questions by studying the roles
of locality and compositionality in data, tasks, and deep learning
representations.
[COMMENTS]
PhD dissertation. Preprint
[LINK]
http://arxiv.org/abs/2510.06106v1
[DATE]
2025-10-08 00:40:06+08:00
[CATEGORIES]
cs.LG
Robust-Multi-Task Gradient Boosting
[AUTHORS]
Seyedsaman Emami, Gonzalo Martínez-Muñoz, Daniel Hernández-Lobato
[ABSTRACT]
Multi-task learning (MTL) has shown effectiveness in exploiting shared
information across tasks to improve generalization. MTL assumes tasks share
similarities that can improve performance. In addition, boosting algorithms
have demonstrated exceptional performance across diverse learning problems,
primarily due to their ability to focus on hard-to-learn instances and
iteratively reduce residual errors. This makes them a promising approach for
learning multi-task problems. However, real-world MTL scenarios often involve
tasks that are not well-aligned (known as outlier or adversarial tasks), which
do not share beneficial similarities with others and can, in fact, deteriorate
the performance of the overall model. To overcome this challenge, we propose
Robust-Multi-Task Gradient Boosting (R-MTGB), a novel boosting framework that
explicitly models and adapts to task heterogeneity during training. R-MTGB
structures the learning process into three sequential blocks: (1) learning
shared patterns, (2) partitioning tasks into outliers and non-outliers with
regularized parameters, and (3) fine-tuning task-specific predictors. This
architecture enables R-MTGB to automatically detect and penalize outlier tasks
while promoting effective knowledge transfer among related tasks. Our method
integrates these mechanisms seamlessly within gradient boosting, allowing
robust handling of noisy or adversarial tasks without sacrificing accuracy.
Extensive experiments on both synthetic benchmarks and real-world datasets
demonstrate that our approach successfully isolates outliers, transfers
knowledge, and consistently reduces prediction errors for each task
individually, and achieves overall performance gains across all tasks. These
results highlight robustness, adaptability, and reliable convergence of R-MTGB
in challenging MTL environments.
[LINK]
http://arxiv.org/abs/2507.11411v2
[DATE]
2025-10-08 00:25:39+08:00
[CATEGORIES]
cs.LG
Learning Mixtures of Linear Dynamical Systems (MoLDS) via Hybrid Tensor-EM Method
[AUTHORS]
Lulu Gong, Shreya Saxena
[ABSTRACT]
Mixtures of linear dynamical systems (MoLDS) provide a path to model
time-series data that exhibit diverse temporal dynamics across trajectories.
However, its application remains challenging in complex and noisy settings,
limiting its effectiveness for neural data analysis. Tensor-based moment
methods can provide global identifiability guarantees for MoLDS, but their
performance degrades under noise and complexity. Commonly used
expectation-maximization (EM) methods offer flexibility in fitting latent
models but are highly sensitive to initialization and prone to poor local
minima. Here, we propose a tensor-based method that provides identifiability
guarantees for learning MoLDS, which is followed by EM updates to combine the
strengths of both approaches. The novelty in our approach lies in the
construction of moment tensors using the input-output data to recover globally
consistent estimates of mixture weights and system parameters. These estimates
can then be refined through a Kalman EM algorithm, with closed-form updates for
all LDS parameters. We validate our framework on synthetic benchmarks and
real-world datasets. On synthetic data, the proposed Tensor-EM method achieves
more reliable recovery and improved robustness compared to either pure tensor
or randomly initialized EM methods. We then analyze neural recordings from the
primate somatosensory cortex while a non-human primate performs reaches in
different directions. Our method successfully models and clusters different
conditions as separate subsystems, consistent with supervised single-LDS fits
for each condition. Finally, we apply this approach to another neural dataset
where monkeys perform a sequential reaching task. These results demonstrate
that MoLDS provides an effective framework for modeling complex neural data,
and that Tensor-EM is a reliable approach to MoLDS learning for these
applications.
[COMMENTS]
20 pages, 7 figures
[LINK]
http://arxiv.org/abs/2510.06091v1
[DATE]
2025-10-08 00:17:52+08:00
[CATEGORIES]
cs.LG
CAMERA: Multi-Matrix Joint Compression for MoE Models via Micro-Expert Redundancy Analysis
[AUTHORS]
Yuzhuang Xu, Xu Han, Yuanchi Zhang, Yixuan Wang, Yijun Liu, Shiyu Ji, Qingfu Zhu, Wanxiang Che
[ABSTRACT]
Large Language Models (LLMs) with Mixture-of-Experts (MoE) architectures are
distinguished by their strong performance scaling with increasing parameters
across a wide range of tasks, yet they also suffer from substantial
computational and storage overheads. Notably, the performance gains of MoE
models do not scale proportionally with the growth in expert parameters. While
prior works attempt to reduce parameters via expert-level pruning, merging, or
decomposition, they still suffer from challenges in both performance and
computational efficiency. In this paper, we address these challenges by
introducing micro-expert as a finer-grained compression unit that spans across
matrices. We first establish a more fundamental perspective, viewing MoE layers
as mixtures of micro-experts, and present CAMERA, a lightweight and
training-free framework for identifying micro-expert redundancy. Our analysis
uncovers significant variance in micro-expert contributions during decoding.
Based on this insight, we further propose CAMERA-P, a structured micro-expert
pruning framework, and CAMERA-Q, a mixed-precision quantization idea designed
for micro-experts. Extensive experiments on nine downstream tasks show that
CAMERA-P consistently outperforms strong baselines under pruning ratios ranging
from 20% to 60%. Furthermore, CAMERA-Q achieves superior results under
aggressive 2-bit quantization, surpassing existing matrix- and channel-level
ideas. Notably, our method enables complete micro-expert analysis of
Qwen2-57B-A14B in less than 5 minutes on a single NVIDIA A100-40GB GPU.
[COMMENTS]
16 pages, 9 figures, 7 tables
[LINK]
http://arxiv.org/abs/2508.02322v2
[DATE]
2025-10-07 23:56:15+08:00
[CATEGORIES]
cs.CL
cs.LG
ASPO: Asymmetric Importance Sampling Policy Optimization
[AUTHORS]
Jiakang Wang, Runze Liu, Lei Lin, Wenping Hu, Xiu Li, Fuzheng Zhang, Guorui Zhou, Kun Gai
[ABSTRACT]
Recent Large Language Model (LLM) post-training methods rely on token-level
clipping mechanisms during Reinforcement Learning (RL). However, we identify a
fundamental flaw in this Outcome-Supervised RL (OSRL) paradigm: the Importance
Sampling (IS) ratios of positive-advantage tokens are mismatched, leading to
unbalanced token weighting for positive and negative tokens. This mismatch
suppresses the update of low-probability tokens while over-amplifying already
high-probability ones. To address this, we propose Asymmetric Importance
Sampling Policy Optimization (ASPO), which uses a simple yet effective strategy
that flips the IS ratios of positive-advantage tokens, aligning their update
direction with the learning dynamics of negative ones. AIS further incorporates
a soft dual-clipping mechanism to stabilize extreme updates while maintaining
gradient flow. Comprehensive experiments on coding and mathematical reasoning
benchmarks demonstrate that ASPO significantly mitigates premature convergence,
improves training stability, and enhances final performance over strong
GRPO-based baselines. Our analysis provides new insights into the role of
token-level weighting in OSRL and highlights the critical importance of
correcting IS in LLM RL. The code and models of ASPO are available at
https://github.com/wizard-III/Archer2.0.
[LINK]
http://arxiv.org/abs/2510.06062v1
[DATE]
2025-10-07 23:54:24+08:00
[CATEGORIES]
cs.CL
On Relation-Specific Neurons in Large Language Models
[AUTHORS]
Yihong Liu, Runsheng Chen, Lea Hirlimann, Ahmad Dawar Hakimi, Mingyang Wang, Amir Hossein Kargaran, Sascha Rothe, François Yvon, Hinrich Schütze
[ABSTRACT]
In large language models (LLMs), certain \emph{neurons} can store distinct
pieces of knowledge learned during pretraining. While factual knowledge
typically appears as a combination of \emph{relations} and \emph{entities}, it
remains unclear whether some neurons focus on a relation itself – independent
of any entity. We hypothesize such neurons \emph{detect} a relation in the
input text and \emph{guide} generation involving such a relation. To
investigate this, we study the LLama-2 family on a chosen set of relations,
with a \textit{statistics}-based method. Our experiments demonstrate the
existence of relation-specific neurons. We measure the effect of selectively
deactivating candidate neurons specific to relation $r$ on the LLM’s ability to
handle (1) facts involving relation $r$ and (2) facts involving a different
relation $r’ \neq r$. With respect to their capacity for encoding relation
information, we give evidence for the following three properties of
relation-specific neurons. \textbf{(i) Neuron cumulativity.} Multiple neurons
jointly contribute to processing facts involving relation $r$, with no single
neuron fully encoding a fact in $r$ on its own. \textbf{(ii) Neuron
versatility.} Neurons can be shared across multiple closely related as well as
less related relations. In addition, some relation neurons transfer across
languages. \textbf{(iii) Neuron interference.} Deactivating neurons specific to
one relation can improve LLMs’ factual recall performance for facts of other
relations. We make our code and data publicly available at
https://github.com/cisnlp/relation-specific-neurons.
[COMMENTS]
EMNLP 2025
[LINK]
http://arxiv.org/abs/2502.17355v2
[DATE]
2025-10-07 23:53:55+08:00
[CATEGORIES]
cs.CL
MedHal: An Evaluation Dataset for Medical Hallucination Detection
[AUTHORS]
Gaya Mehenni, Fabrice Lamarche, Odette Rios-Ibacache, John Kildea, Amal Zouaq
[ABSTRACT]
We present MedHal, a novel large-scale dataset specifically designed to
evaluate if models can detect hallucinations in medical texts. Current
hallucination detection methods face significant limitations when applied to
specialized domains like medicine, where they can have disastrous consequences.
Existing medical datasets are either too small, containing only a few hundred
samples, or focus on a single task like Question Answering or Natural Language
Inference. MedHal addresses these gaps by: (1) incorporating diverse medical
text sources and tasks; (2) providing a substantial volume of annotated samples
suitable for training medical hallucination detection models; and (3) including
explanations for factual inconsistencies to guide model learning. We
demonstrate MedHal’s utility by training and evaluating a baseline medical
hallucination detection model, showing improvements over general-purpose
hallucination detection approaches. This resource enables more efficient
evaluation of medical text generation systems while reducing reliance on costly
expert review, potentially accelerating the development of medical AI research.
[LINK]
http://arxiv.org/abs/2504.08596v2
[DATE]
2025-10-07 23:40:54+08:00
[CATEGORIES]
cs.CL
Large Language Models Achieve Gold Medal Performance at the International Olympiad on Astronomy & Astrophysics (IOAA)
[AUTHORS]
Lucas Carrit Delgado Pinheiro, Ziru Chen, Bruno Caixeta Piazza, Ness Shroff, Yingbin Liang, Yuan-Sen Ting, Huan Sun
[ABSTRACT]
While task-specific demonstrations show early success in applying large
language models (LLMs) to automate some astronomical research tasks, they only
provide incomplete views of all necessary capabilities in solving astronomy
problems, calling for more thorough understanding of LLMs’ strengths and
limitations. So far, existing benchmarks and evaluations focus on simple
question-answering that primarily tests astronomical knowledge and fails to
evaluate the complex reasoning required for real-world research in the
discipline. Here, we address this gap by systematically benchmarking five
state-of-the-art LLMs on the International Olympiad on Astronomy and
Astrophysics (IOAA) exams, which are designed to examine deep conceptual
understanding, multi-step derivations, and multimodal analysis. With average
scores of 85.6% and 84.2%, Gemini 2.5 Pro and GPT-5 (the two top-performing
models) not only achieve gold medal level performance but also rank in the top
two among ~200-300 participants in all four IOAA theory exams evaluated
(2022-2025). In comparison, results on the data analysis exams show more
divergence. GPT-5 still excels in the exams with an 88.5% average score,
ranking top 10 among the participants in the four most recent IOAAs, while
other models’ performances drop to 48-76%. Furthermore, our in-depth error
analysis underscores conceptual reasoning, geometric reasoning, and spatial
visualization (52-79% accuracy) as consistent weaknesses among all LLMs. Hence,
although LLMs approach peak human performance in theory exams, critical gaps
must be addressed before they can serve as autonomous research agents in
astronomy.
[COMMENTS]
18 pages, 6 figures, to be submitted, comments are welcome.
Reproducibility details can be found at:
https://github.com/OSU-NLP-Group/LLM-IOAA
[LINK]
http://arxiv.org/abs/2510.05016v2
[DATE]
2025-10-07 23:34:59+08:00
[CATEGORIES]
cs.CL
CDTP: A Large-Scale Chinese Data-Text Pair Dataset for Comprehensive Evaluation of Chinese LLMs
[AUTHORS]
Chengwei Wu, Jiapu Wang, Mingyang Gao, Xingrui Zhuo, Jipeng Guo, Runlin Lei, Haoran Luo, Tianyu Chen, Haoyi Zhou, Shirui Pan, Zechao Li
[ABSTRACT]
Large Language Models (LLMs) have achieved remarkable success across a wide
range of natural language processing tasks. However, Chinese LLMs face unique
challenges, primarily due to the dominance of unstructured free text and the
lack of structured representations in Chinese corpora. While existing
benchmarks for LLMs partially assess Chinese LLMs, they are still predominantly
English-centric and fail to address the unique linguistic characteristics of
Chinese, lacking structured datasets essential for robust evaluation. To
address these challenges, we present a Comprehensive Benchmark for Evaluating
Chinese Large Language Models (CB-ECLLM) based on the newly constructed Chinese
Data-Text Pair (CDTP) dataset. Specifically, CDTP comprises over 7 million
aligned text pairs, each consisting of unstructured text coupled with one or
more corresponding triples, alongside a total of 15 million triples spanning
four critical domains. The core contributions of CDTP are threefold: (i)
enriching Chinese corpora with high-quality structured information; (ii)
enabling fine-grained evaluation tailored to knowledge-driven tasks; and (iii)
supporting multi-task fine-tuning to assess generalization and robustness
across scenarios, including Knowledge Graph Completion, Triple-to-Text
generation, and Question Answering. Furthermore, we conduct rigorous
evaluations through extensive experiments and ablation studies to assess the
effectiveness, Supervised Fine-Tuning (SFT), and robustness of the benchmark.
To support reproducible research, we offer an open-source codebase and outline
potential directions for future investigations based on our insights.
[LINK]
http://arxiv.org/abs/2510.06039v1
[DATE]
2025-10-07 23:33:52+08:00
[CATEGORIES]
cs.CL
Evaluating The Impact of Stimulus Quality in Investigations of LLM Language Performance
[AUTHORS]
Timothy Pistotti, Jason Brown, Michael Witbrock
[ABSTRACT]
Recent studies employing Large Language Models (LLMs) to test the Argument
from the Poverty of the Stimulus (APS) have yielded contrasting results across
syntactic phenomena. This paper investigates the hypothesis that
characteristics of the stimuli used in recent studies, including lexical
ambiguities and structural complexities, may confound model performance. A
methodology is proposed for re-evaluating LLM competence on syntactic
prediction, focusing on GPT-2. This involves: 1) establishing a baseline on
previously used (both filtered and unfiltered) stimuli, and 2) generating a
new, refined dataset using a state-of-the-art (SOTA) generative LLM (Gemini 2.5
Pro Preview) guided by linguistically-informed templates designed to mitigate
identified confounds. Our preliminary findings indicate that GPT-2 demonstrates
notably improved performance on these refined PG stimuli compared to baselines,
suggesting that stimulus quality significantly influences outcomes in
surprisal-based evaluations of LLM syntactic competency.
[COMMENTS]
Presented at https://brigap-workshop.github.io/ Information to be
updated upon publication of proceedings
[LINK]
http://arxiv.org/abs/2510.06018v1
[DATE]
2025-10-07 23:16:47+08:00
[CATEGORIES]
cs.CL
MASA: Rethinking the Representational Bottleneck in LoRA with Multi-A Shared Adaptation
[AUTHORS]
Qin Dong, Yuntian Tang, Heming Jia, Yunhang Shen, Bohan Jia, Wenxuan Huang, Lianyue Zhang, Jiao Xie, Shaohui Lin
[ABSTRACT]
Low-Rank Adaptation (LoRA) has emerged as a dominant method in
Parameter-Efficient Fine-Tuning (PEFT) for large language models, which
augments the transformer layer with one down-projection $A$ and one
up-projection $B$. However, LoRA’s reliance on a single down-projection matrix
($A$) creates a representational bottleneck, as this solitary feature extractor
is inherently insufficient for capturing the diverse signals required by
complex tasks. This motivates our architectural shift to focus on enriching the
feature adaptation to improve the downstream task adaptation ability. We
propose MASA (Multi-$A$ Shared Adaptation), an architecture that implements a
multi-$A$, single-$B$ structure where the multi-$A$ expert ensemble is
asymmetrically shared across layers to ensure parameter efficiency. In MASA,
these specialized experts capture diverse features, which are then integrated
by a single, layer-specific $B$-matrix. The effectiveness and versatility of
our method are validated through a comprehensive suite of experiments spanning
multi-domain generalization, single-domain specialization, and multi-task
reasoning. For example, on the MMLU benchmark, MASA achieves an average
accuracy of 59.62%, outperforming the standard LoRA by 1.08 points (a relative
improvement of 1.84%) with comparable learnable parameters of 0.52%.
[COMMENTS]
14 pages, 5 figures
[LINK]
http://arxiv.org/abs/2510.06005v1
[DATE]
2025-10-07 23:06:46+08:00
[CATEGORIES]
cs.CL
Deterministic Legal Retrieval: An Action API for Querying the SAT-Graph RAG
[AUTHORS]
Hudson de Martim
[ABSTRACT]
The Structure-Aware Temporal Graph RAG (SAT-Graph RAG) addresses core
limitations of standard Retrieval-Augmented Generation in the legal domain by
providing a verifiable knowledge graph that models hierarchical structure,
temporal evolution, and causal events of legal norms. However, a critical gap
remains: how to reliably query this structured knowledge without sacrificing
its deterministic properties. This paper introduces the SAT-Graph API, a formal
query execution layer centered on canonical actions-atomic, composable, and
auditable primitives that isolate probabilistic discovery from deterministic
retrieval. These actions enable: (i) high-precision hybrid search; (ii) robust
reference resolution; (iii) point-in-time version retrieval; and (iv) auditable
causal tracing. We demonstrate how planner-guided agents can decompose complex
queries into Directed Acyclic Graphs (DAGs) of these actions. This two-layer
architecture transforms retrieval from an opaque black box to a transparent,
auditable process, directly addressing Explainable AI (XAI) requirements for
high-stakes domains.
[LINK]
http://arxiv.org/abs/2510.06002v1
[DATE]
2025-10-07 23:04:23+08:00
[CATEGORIES]
cs.CL
Exploring Gaps in the APS: Direct Minimal Pair Analysis in LLM Syntactic Assessments
[AUTHORS]
Timothy Pistotti, Jason Brown, Michael Witbrock
[ABSTRACT]
Recent studies probing the Argument from the Poverty of the Stimulus (APS)
have applied Large Language Models (LLMs) to test the learnability of complex
syntax through surprisal-based metrics. However, divergent conclusions raise
questions concerning the insights these metrics offer. While Wilcox et al.
(2024) used direct minimal pair comparisons (the “wh-effect”) to demonstrate
that models successfully generalise knowledge of filler-gap dependencies, Lan
et al. (2024) used a Difference-in-Differences (DiD) metric and found that
models largely fail on parasitic gaps (PGs). This paper argues that the direct
minimal pair approach offers greater diagnostic transparency. We demonstrate
this by generating a full 8-permutation paradigm of refined PG stimuli and
evaluating the GPT-2 model used in previous studies with a systematic
Wilcox-style wh-effect analysis. Our results show that GPT-2 succeeds across
all four tested conditions, indicating robust knowledge of filler-gap licensing
principles even in complex PG environments. This finding, which contrasts with
the more ambiguous results from DiD-style metrics, suggests that the choice of
evaluation metric is critical for assessing an LLM’s syntactic competence.
[COMMENTS]
Presented at the https://brigap-workshop.github.io/ Information to be
updated after publication of proceedings
[LINK]
http://arxiv.org/abs/2510.06001v1
[DATE]
2025-10-07 23:03:09+08:00
[CATEGORIES]
cs.CL
Generative Psycho-Lexical Approach for Constructing Value Systems in Large Language Models
[AUTHORS]
Haoran Ye, Tianze Zhang, Yuhang Xie, Liyuan Zhang, Yuanyi Ren, Xin Zhang, Guojie Song
[COMMENTS]
ACL 2025 Main
[LINK]
http://arxiv.org/abs/2502.02444v6
[DATE]
2025-10-07 22:57:19+08:00
[CATEGORIES]
cs.CL
AgenticIE: An Adaptive Agent for Information Extraction from Complex Regulatory Documents
[AUTHORS]
Gaye Colakoglu, Gürkan Solmaz, Jonathan Fürst
[ABSTRACT]
Declaration of Performance (DoP) documents, mandated by EU regulation,
certify the performance of construction products. There are two challenges to
make DoPs machine and human accessible through automated key-value pair
extraction (KVP) and question answering (QA): (1) While some of their content
is standardized, DoPs vary widely in layout, schema, and format; (2) Both users
and documents are multilingual. Existing static or LLM-only Information
Extraction (IE) pipelines fail to adapt to this structural document and user
diversity. Our domain-specific, agentic system addresses these challenges
through a planner-executor-responder architecture. The system infers user
intent, detects document language and modality, and orchestrates tools
dynamically for robust, traceable reasoning while avoiding tool misuse or
execution loops. Our agent outperforms baselines (ROUGE: 0.783 vs. 0.703/0.608)
with better cross-lingual stability (17-point vs. 21-26-point variation).
[LINK]
http://arxiv.org/abs/2509.11773v2
[DATE]
2025-10-07 22:55:30+08:00
[CATEGORIES]
cs.CL
Sample Smart, Not Hard: Correctness-First Decoding for Better Reasoning in LLMs
[AUTHORS]
Xueyan Li, Guinan Su, Mrinmaya Sachan, Jonas Geiping
[ABSTRACT]
Large Language Models (LLMs) are increasingly applied to complex tasks that
require extended reasoning. In such settings, models often benefit from diverse
chains-of-thought to arrive at multiple candidate solutions. This requires two
competing objectives: to inject enough stochasticity to explore multiple
reasoning chains, and to ensure sufficient accuracy and quality in each path.
Existing works pursue the first objective by increasing exploration at highly
uncertain steps with higher temperature or larger candidate token sets, while
others improve reliability by rejecting samples with low confidence
post-generation, implying that low confidence correlates with low answer
quality. These two lines of thought are in conflict, as they conflate different
sources of uncertainty. To resolve this, we argue that the decoding rule should
be calibrated by correctness, not confidence alone. We should sample from
tokens with higher estimated correctness, and reduce sampling where expected
correctness is low. We propose simple strategies that achieve this goal:
Greedy-Threshold makes sampling greedy at very low confidence steps.
Calibrated-TopK and Calibrated-epsilon set truncation threshold based on
estimated rank-wise correctness. Together, our findings challenge prevailing
heuristics about decoding under uncertainty and show gains across math and
general reasoning benchmarks.
[LINK]
http://arxiv.org/abs/2510.05987v1
[DATE]
2025-10-07 22:46:12+08:00
[CATEGORIES]
cs.LG
cs.CL
Probing the Difficulty Perception Mechanism of Large Language Models
[AUTHORS]
Sunbowen Lee, Qingyu Yin, Chak Tou Leong, Jialiang Zhang, Yicheng Gong, Xiaoyu Shen
[ABSTRACT]
Large language models (LLMs) are increasingly deployed on complex reasoning
tasks, yet little is known about their ability to internally evaluate problem
difficulty, which is an essential capability for adaptive reasoning and
efficient resource allocation. In this work, we investigate whether LLMs
implicitly encode problem difficulty in their internal representations. Using a
linear probe on the final-token representations of LLMs, we demonstrate that
the difficulty level of math problems can be linearly modeled. We further
locate the specific attention heads of the final Transformer layer: these
attention heads have opposite activation patterns for simple and difficult
problems, thus achieving perception of difficulty. Our ablation experiments
prove the accuracy of the location. Crucially, our experiments provide
practical support for using LLMs as automatic difficulty annotators,
potentially substantially reducing reliance on costly human labeling in
benchmark construction and curriculum learning. We also uncover that there is a
significant difference in entropy and difficulty perception at the token level.
Our study reveals that difficulty perception in LLMs is not only present but
also structurally organized, offering new theoretical insights and practical
directions for future research.
[LINK]
http://arxiv.org/abs/2510.05969v1
[DATE]
2025-10-07 22:24:32+08:00
[CATEGORIES]
cs.CL
MatheMagic: Generating Dynamic Mathematics Benchmarks Robust to Memorization
[AUTHORS]
Dayyán O’Brien, Barry Haddow, Emily Allaway, Pinzhen Chen
[ABSTRACT]
Conducting contamination-free evaluation of mathematical capabilities can be
difficult for two reasons: models may memorize a test set once it is made
public, and current mathematical benchmarks are prone to overfitting due to
having limited diversity of symbols and rules, coupled with closed-ended
answers. This paper proposes a method to leverage these shortcomings as useful
features to a construct dynamic, counterfactual benchmark, which can be used to
both reveal overfitting and measure true reasoning. We demonstrate this via
MatheMagic, which generates math test instances with the interpretations of
numbers and operators altered, yet has automatically verifiable answers. Test
instances are randomly seeded and constructed at test time to evaluate a
model’s induction or deduction capability, offering stability, extensibility,
comparability, and robustness to overfitting. Our experiments find that models
solve deduction more easily than induction, but they revert to standard math.
Further analysis reveals that math-adapted models fail to exhibit a general
“skill” of reasoning, and fine-tuning on induction tasks generalizes poorly.
[LINK]
http://arxiv.org/abs/2510.05962v1
[DATE]
2025-10-07 22:19:21+08:00
[CATEGORIES]
cs.CL
SAE-FiRE: Enhancing Earnings Surprise Predictions Through Sparse Autoencoder Feature Selection
[AUTHORS]
Huopu Zhang, Yanguang Liu, Miao Zhang, Zirui He, Mengnan Du
[ABSTRACT]
Predicting earnings surprises from financial documents, such as earnings
conference calls, regulatory filings, and financial news, has become
increasingly important in financial economics. However, these financial
documents present significant analytical challenges, typically containing over
5,000 words with substantial redundancy and industry-specific terminology that
creates obstacles for language models. In this work, we propose the SAE-FiRE
(Sparse Autoencoder for Financial Representation Enhancement) framework to
address these limitations by extracting key information while eliminating
redundancy. SAE-FiRE employs Sparse Autoencoders (SAEs) to decompose dense
neural representations from large language models into interpretable sparse
components, then applies statistical feature selection methods, including ANOVA
F-tests and tree-based importance scoring, to identify the top-k most
discriminative dimensions for classification. By systematically filtering out
noise that might otherwise lead to overfitting, we enable more robust and
generalizable predictions. Experimental results across three financial datasets
demonstrate that SAE-FiRE significantly outperforms baseline approaches.
[LINK]
http://arxiv.org/abs/2505.14420v2
[DATE]
2025-10-07 22:03:55+08:00
[CATEGORIES]
cs.CL
cs.LG
Unifying Inference-Time Planning Language Generation
[AUTHORS]
Prabhu Prakash Kagitha, Bo Sun, Ishan Desai, Andrew Zhu, Cassie Huang, Manling Li, Ziyang Li, Li Zhang
[ABSTRACT]
A line of work in planning uses LLM not to generate a plan, but to generate a
formal representation in some planning language, which can be input into a
symbolic solver to deterministically find a plan. While showing improved trust
and promising performance, dozens of recent publications have proposed
scattered methods on a variety of benchmarks under different experimental
settings. We attempt to unify the inference-time LLM-as-formalizer methodology
for classical planning by proposing a unifying framework based on intermediate
representations. We thus systematically evaluate more than a dozen pipelines
that subsume most existing work, while proposing novel ones that involve
syntactically similar but high resource intermediate languages (such as a
Python wrapper of PDDL). We provide recipes for planning language generation
pipelines, draw a series of conclusions showing the efficacy of their various
components, and evidence their robustness against problem complexity.
[LINK]
http://arxiv.org/abs/2505.14763v2
[DATE]
2025-10-07 21:59:20+08:00
[CATEGORIES]
cs.CL
EvalMORAAL: Interpretable Chain-of-Thought and LLM-as-Judge Evaluation for Moral Alignment in Large Language Models
[AUTHORS]
Hadi Mohammadi, Anastasia Giachanou, Ayoub Bagheri
[ABSTRACT]
We present EvalMORAAL, a transparent chain-of-thought (CoT) framework that
uses two scoring methods (log-probabilities and direct ratings) plus a
model-as-judge peer review to evaluate moral alignment in 20 large language
models. We assess models on the World Values Survey (55 countries, 19 topics)
and the PEW Global Attitudes Survey (39 countries, 8 topics). With EvalMORAAL,
top models align closely with survey responses (Pearson’s r approximately 0.90
on WVS). Yet we find a clear regional difference: Western regions average
r=0.82 while non-Western regions average r=0.61 (a 0.21 absolute gap),
indicating consistent regional bias. Our framework adds three parts: (1) two
scoring methods for all models to enable fair comparison, (2) a structured
chain-of-thought protocol with self-consistency checks, and (3) a
model-as-judge peer review that flags 348 conflicts using a data-driven
threshold. Peer agreement relates to survey alignment (WVS r=0.74, PEW r=0.39,
both p<.001), supporting automated quality checks. These results show real
progress toward culture-aware AI while highlighting open challenges for use
across regions.
[LINK]
http://arxiv.org/abs/2510.05942v1
[DATE]
2025-10-07 21:52:16+08:00
[CATEGORIES]
cs.CL
Fine-Grained and Thematic Evaluation of LLMs in Social Deduction Game
[AUTHORS]
Byungjun Kim, Dayeon Seo, Minju Kim, Bugeun Kim
[ABSTRACT]
Recent studies have investigated whether large language models (LLMs) can
support obscured communication, which is characterized by core aspects such as
inferring subtext and evading suspicions. To conduct the investigation,
researchers have used social deduction games (SDGs) as their experimental
environment, in which players conceal and infer specific information. However,
prior work has often overlooked how LLMs should be evaluated in such settings.
Specifically, we point out two limitations with the evaluation methods they
employed. First, metrics used in prior studies are coarse-grained as they are
based on overall game outcomes that often fail to capture event-level
behaviors; Second, error analyses have lacked structured methodologies capable
of producing insights that meaningfully support evaluation outcomes. To address
these limitations, we propose a microscopic and systematic approach to the
investigation. Specifically, we introduce six fine-grained metrics that resolve
the first issue. To tackle the second issue, we conducted a thematic analysis
and identified four major reasoning failures that undermine LLMs’ performance
in obscured communication.
[COMMENTS]
Published in IEEE Access
[LINK]
http://arxiv.org/abs/2408.09946v3
[DATE]
2025-10-07 21:51:09+08:00
[CATEGORIES]
cs.CL
Hire Your Anthropologist! Rethinking Culture Benchmarks Through an Anthropological Lens
[AUTHORS]
Mai AlKhamissi, Yunze Xiao, Badr AlKhamissi, Mona Diab
[ABSTRACT]
Cultural evaluation of large language models has become increasingly
important, yet current benchmarks often reduce culture to static facts or
homogeneous values. This view conflicts with anthropological accounts that
emphasize culture as dynamic, historically situated, and enacted in practice.
To analyze this gap, we introduce a four-part framework that categorizes how
benchmarks frame culture, such as knowledge, preference, performance, or bias.
Using this lens, we qualitatively examine 20 cultural benchmarks and identify
six recurring methodological issues, including treating countries as cultures,
overlooking within-culture diversity, and relying on oversimplified survey
formats. Drawing on established anthropological methods, we propose concrete
improvements: incorporating real-world narratives and scenarios, involving
cultural communities in design and validation, and evaluating models in context
rather than isolation. Our aim is to guide the development of cultural
benchmarks that go beyond static recall tasks and more accurately capture the
responses of the models to complex cultural situations.
[COMMENTS]
12 pages; 2 figures; First two author contributed equally
[LINK]
http://arxiv.org/abs/2510.05931v1
[DATE]
2025-10-07 21:42:44+08:00
[CATEGORIES]
cs.CL
CAPO: Towards Enhancing LLM Reasoning through Generative Credit Assignment
[AUTHORS]
Guofu Xie, Yunsheng Shi, Hongtao Tian, Ting Yao, Xiao Zhang
[ABSTRACT]
Reinforcement Learning with Verifiable Rewards (RLVR) has improved the
reasoning abilities of Large Language Models (LLMs) by using rule-based binary
feedback. However, current RLVR methods typically assign the same reward to
every token. This coarse-grained feedback hampers precise credit assignment,
making it hard for models to identify which reasoning steps lead to success or
failure, and often results in suboptimal policies. Methods like PPO provide
credit assignment by value estimation, but yield inaccurate and unverifiable
signals due to limited sampling. On the other hand, methods using Process
Reward Models can provide step-wise rewards but suffer from several key
limitations: they require high-quality process supervision labels, the feedback
is unreliable due to probabilistic reward modeling, and their application in
online reinforcement learning (RL) is time-consuming. To overcome these
limitations, we introduce a simple but efficient method-Credit Assignment
Policy Optimization (CAPO). Instead of training auxiliary models, CAPO directly
leverages an off-the-shelf, general-purpose LLM as a Generative Process Reward
Model (LLM-as-GenPRM) to generate all step-wise critique by one pass only based
on the correctness of the step itself, providing deterministic token-level
credits to refine the tokens that were originally assigned identical rule-based
rewards. To further enhance the accuracy and robustness, we employ voting
mechanisms that scale with the number of generated critiques. Extensive
experiments on various backbones like Llama and Qwen models show that CAPO
consistently outperforms supervised learning-based and RL-based fine-tuning
methods across four challenging mathematical benchmarks and three out-of-domain
benchmarks. Further analysis shows that CAPO can help the model to foster the
learning of correct reasoning pathways leading to correct answers.
[COMMENTS]
Work in progress
[LINK]
http://arxiv.org/abs/2508.02298v2
[DATE]
2025-10-07 21:13:29+08:00
[CATEGORIES]
cs.LG
cs.CL
Building Resource-Constrained Language Agents: A Korean Case Study on Chemical Toxicity Information
[AUTHORS]
Hojun Cho, Donghu Kim, Soyoung Yang, Chan Lee, Hunjoo Lee, Jaegul Choo
[ABSTRACT]
Language agents powered by large language models (LLMs) face significant
deployment challenges in resource-constrained environments, particularly for
specialized domains and less-common languages. This paper presents Tox-chat, a
Korean chemical toxicity information agent devised within these limitations. We
propose two key innovations: a context-efficient architecture that reduces
token consumption through hierarchical section search, and a scenario-based
dialogue generation methodology that effectively distills tool-using
capabilities from larger models. Experimental evaluations demonstrate that our
fine-tuned 8B parameter model substantially outperforms both untuned models and
baseline approaches, in terms of DB faithfulness and preference. Our work
offers valuable insights for researchers developing domain-specific language
agents under practical constraints.
[COMMENTS]
EMNLP 2025 Industry track
[LINK]
http://arxiv.org/abs/2503.17753v2
[DATE]
2025-10-07 20:40:17+08:00
[CATEGORIES]
cs.CL
Evaluating the Sensitivity of LLMs to Harmful Contents in Long Input
[AUTHORS]
Faeze Ghorbanpour, Alexander Fraser
[ABSTRACT]
Large language models (LLMs) increasingly support applications that rely on
extended context, from document processing to retrieval-augmented generation.
While their long-context capabilities are well studied for reasoning and
retrieval, little is known about their behavior in safety-critical scenarios.
We evaluate LLMs’ sensitivity to harmful content under extended context,
varying type (explicit vs. implicit), position (beginning, middle, end),
prevalence (0.01-0.50 of the prompt), and context length (600-6000 tokens).
Across harmful content categories such as toxic, offensive, and hate speech,
with LLaMA-3, Qwen-2.5, and Mistral, we observe similar patterns: performance
peaks at moderate harmful prevalence (0.25) but declines when content is very
sparse or dominant; recall decreases with increasing context length; harmful
sentences at the beginning are generally detected more reliably; and explicit
content is more consistently recognized than implicit. These findings provide
the first systematic view of how LLMs prioritize and calibrate harmful content
in long contexts, highlighting both their emerging strengths and the challenges
that remain for safety-critical use.
[LINK]
http://arxiv.org/abs/2510.05864v1
[DATE]
2025-10-07 20:33:21+08:00
[CATEGORIES]
cs.CL
Revisiting Long-context Modeling from Context Denoising Perspective
[AUTHORS]
Zecheng Tang, Baibei Ji, Juntao Li, Lijun Wu, Haijia Gui, Min Zhang
[ABSTRACT]
Long-context models (LCMs) have demonstrated great potential in processing
long sequences, facilitating many real-world applications. The success of LCMs
can be attributed to their ability to locate implicit critical information
within the context for further prediction. However, recent research reveals
that LCMs are often susceptible to contextual noise, i.e., irrelevant tokens,
that can mislead model attention. In this paper, we conduct a fine-grained
analysis of the context noise and propose an effective metric, the Integrated
Gradient (IG) score, to detect and quantify the noise information within the
context. Our findings reveal that even simple mitigation of detected context
noise can substantially boost the model’s attention on critical tokens and
benefit subsequent predictions. Building on this insight, we propose Context
Denoising Training (CDT), a straightforward yet effective training strategy
that improves attention on critical tokens while reinforcing their influence on
model predictions. Extensive experiments across four tasks, under both context
window scaling and long-context alignment settings, demonstrate the superiority
of CDT. Notably, when trained with CDT, an open-source 8B model can achieve
performance (50.92) comparable to GPT-4o (51.00).
[LINK]
http://arxiv.org/abs/2510.05862v1
[DATE]
2025-10-07 20:32:23+08:00
[CATEGORIES]
cs.CL
DACP: Domain-Adaptive Continual Pre-Training of Large Language Models for Phone Conversation Summarization
[AUTHORS]
Xue-Yong Fu, Elena Khasanova, Md Tahmid Rahman Laskar, Harsh Saini, Shashi Bhushan TN
[COMMENTS]
Accepted to the NewSumm Workshop at EMNLP 2025
[LINK]
http://arxiv.org/abs/2510.05858v1
[DATE]
2025-10-07 20:26:19+08:00
[CATEGORIES]
cs.CL
cs.LG
Contrastive Learning Using Graph Embeddings for Domain Adaptation of Language Models in the Process Industry
[AUTHORS]
Anastasia Zhukova, Jonas Lührs, Christian E. Lobmüller, Bela Gipp
[ABSTRACT]
Recent trends in NLP utilize knowledge graphs (KGs) to enhance pretrained
language models by incorporating additional knowledge from the graph structures
to learn domain-specific terminology or relationships between documents that
might otherwise be overlooked. This paper explores how SciNCL, a graph-aware
neighborhood contrastive learning methodology originally designed for
scientific publications, can be applied to the process industry domain, where
text logs contain crucial information about daily operations and are often
structured as sparse KGs. Our experiments demonstrate that language models
fine-tuned with triplets derived from graph embeddings (GE) outperform a
state-of-the-art mE5-large text encoder by 9.8-14.3% (5.45-7.96p) on the
proprietary process industry text embedding benchmark (PITEB) while having 3
times fewer parameters.
[COMMENTS]
accepted to EMNLP 2025 (industry track)
[LINK]
http://arxiv.org/abs/2510.04631v2
[DATE]
2025-10-07 20:23:10+08:00
[CATEGORIES]
cs.CL
Cross-Document Cross-Lingual NLI via RST-Enhanced Graph Fusion and Interpretability Prediction
[AUTHORS]
Mengying Yuan, Wenhao Wang, Zixuan Wang, Yujie Huang, Kangli Wei, Fei Li, Chong Teng, Donghong Ji
[ABSTRACT]
Natural Language Inference (NLI) is a fundamental task in natural language
processing. While NLI has developed many sub-directions such as sentence-level
NLI, document-level NLI and cross-lingual NLI, Cross-Document Cross-Lingual NLI
(CDCL-NLI) remains largely unexplored. In this paper, we propose a novel
paradigm: CDCL-NLI, which extends traditional NLI capabilities to
multi-document, multilingual scenarios. To support this task, we construct a
high-quality CDCL-NLI dataset including 25,410 instances and spanning 26
languages. To address the limitations of previous methods on CDCL-NLI task, we
further propose an innovative method that integrates RST-enhanced graph fusion
with interpretability-aware prediction. Our approach leverages RST (Rhetorical
Structure Theory) within heterogeneous graph neural networks for cross-document
context modeling, and employs a structure-aware semantic alignment based on
lexical chains for cross-lingual understanding. For NLI interpretability, we
develop an EDU (Elementary Discourse Unit)-level attribution framework that
produces extractive explanations. Extensive experiments demonstrate our
approach’s superior performance, achieving significant improvements over both
conventional NLI models as well as large language models. Our work sheds light
on the study of NLI and will bring research interest on cross-document
cross-lingual context understanding, hallucination elimination and
interpretability inference. Our code and datasets are available at
“https://github.com/Leonardo123-ui/CDCL_NLI” for peer review.
[COMMENTS]
EMNLP 2025 Main (Camera Ready)
[LINK]
http://arxiv.org/abs/2504.12324v3
[DATE]
2025-10-07 20:17:31+08:00
[CATEGORIES]
cs.CL
Towards Locally Deployable Fine-Tuned Causal Large Language Models for Mode Choice Behaviour
[AUTHORS]
Tareq Alsaleh, Bilal Farooq
[ABSTRACT]
This study investigates the adoption of open-access, locally deployable
causal large language models (LLMs) for travel mode choice prediction and
introduces LiTransMC, the first fine-tuned causal LLM developed for this task.
We systematically benchmark eleven open-access LLMs (1-12B parameters) across
three stated and revealed preference datasets, testing 396 configurations and
generating over 79,000 mode choice decisions. Beyond predictive accuracy, we
evaluate models generated reasoning using BERTopic for topic modelling and a
novel Explanation Strength Index, providing the first structured analysis of
how LLMs articulate decision factors in alignment with behavioural theory.
LiTransMC, fine-tuned using parameter efficient and loss masking strategy,
achieved a weighted F1 score of 0.6845 and a Jensen-Shannon Divergence of
0.000245, surpassing both untuned local models and larger proprietary systems,
including GPT-4o with advanced persona inference and embedding-based loading,
while also outperforming classical mode choice methods such as discrete choice
models and machine learning classifiers for the same dataset. This dual
improvement, i.e., high instant-level accuracy and near-perfect distributional
calibration, demonstrates the feasibility of creating specialist, locally
deployable LLMs that integrate prediction and interpretability. Through
combining structured behavioural prediction with natural language reasoning,
this work unlocks the potential for conversational, multi-task transport models
capable of supporting agent-based simulations, policy testing, and behavioural
insight generation. These findings establish a pathway for transforming general
purpose LLMs into specialized and explainable tools for transportation research
and policy formulation, while maintaining privacy, reducing cost, and
broadening access through local deployment.
[LINK]
http://arxiv.org/abs/2507.21432v2
[DATE]
2025-10-07 20:12:13+08:00
[CATEGORIES]
cs.CL
EEPO: Exploration-Enhanced Policy Optimization via Sample-Then-Forget
[AUTHORS]
Liang Chen, Xueting Han, Qizhou Wang, Bo Han, Jing Bai, Hinrich Schutze, Kam-Fai Wong
[ABSTRACT]
Balancing exploration and exploitation remains a central challenge in
reinforcement learning with verifiable rewards (RLVR) for large language models
(LLMs). Current RLVR methods often overemphasize exploitation, leading to
entropy collapse, diminished exploratory capacity, and ultimately limited
performance gains. Although techniques that increase policy stochasticity can
promote exploration, they frequently fail to escape dominant behavioral modes.
This creates a self-reinforcing loop-repeatedly sampling and rewarding dominant
modes-that further erodes exploration. We introduce Exploration-Enhanced Policy
Optimization (EEPO), a framework that promotes exploration via two-stage
rollouts with adaptive unlearning. In the first stage, the model generates half
of the trajectories; it then undergoes a lightweight unlearning step to
temporarily suppress these sampled responses, forcing the second stage to
explore different regions of the output space. This sample-then-forget
mechanism disrupts the self-reinforcing loop and promotes wider exploration
during rollouts. Across five reasoning benchmarks, EEPO outperforms GRPO,
achieving average relative gains of 24.3% on Qwen2.5-3B, 33.0% on
Llama3.2-3B-Instruct, and 10.4% on Qwen3-8B-Base.
[LINK]
http://arxiv.org/abs/2510.05837v1
[DATE]
2025-10-07 20:02:03+08:00
[CATEGORIES]
cs.CL
Fair Play in the Newsroom: Actor-Based Filtering Gender Discrimination in Text Corpora
[AUTHORS]
Stefanie Urchs, Veronika Thurner, Matthias Aßenmacher, Christian Heumann, Stephanie Thiemichen
[ABSTRACT]
Language corpora are the foundation of most natural language processing
research, yet they often reproduce structural inequalities. One such inequality
is gender discrimination in how actors are represented, which can distort
analyses and perpetuate discriminatory outcomes. This paper introduces a
user-centric, actor-level pipeline for detecting and mitigating gender
discrimination in large-scale text corpora. By combining discourse-aware
analysis with metrics for sentiment, syntactic agency, and quotation styles,
our method enables both fine-grained auditing and exclusion-based balancing.
Applied to the taz2024full corpus of German newspaper articles (1980-2024), the
pipeline yields a more gender-balanced dataset while preserving core dynamics
of the source material. Our findings show that structural asymmetries can be
reduced through systematic filtering, though subtler biases in sentiment and
framing remain. We release the tools and reports to support further research in
discourse-based fairness auditing and equitable corpus construction.
[LINK]
http://arxiv.org/abs/2508.13169v2
[DATE]
2025-10-07 19:54:24+08:00
[CATEGORIES]
cs.CL
Mitigating Premature Exploitation in Particle-based Monte Carlo for Inference-Time Scaling
[AUTHORS]
Giorgio Giannone, Guangxuan Xu, Nikhil Shivakumar Nayak, Rohan Mahesh Awhad, Shivchander Sudalairaj, Kai Xu, Akash Srivastava
[ABSTRACT]
Inference-Time Scaling (ITS) improves language models by allocating more
computation at generation time. Particle Filtering (PF) has emerged as a strong
ITS method for complex mathematical reasoning tasks, but it is vulnerable when
guided by process reward models, which often assign overconfident scores early
in the reasoning process. This causes PF to suffer from premature exploitation:
it myopically commits to locally promising trajectories, prunes potentially
correct hypotheses, and converges to suboptimal solutions. This failure mode,
known as particle impoverishment, is especially severe under constrained
computational budgets. To address this, we analyze the problem and identify two
root causes: a lack of diversity in the particle set due to overconfident
resampling and consequent inability to assess the potential of a reasoning
path. We introduce Entropic Particle Filtering (ePF), an algorithm that
integrates two new techniques to solve these issues. The first technique,
Entropic Annealing (EA), directly mitigates particle impoverishment by
monitoring search diversity via entropy; when diversity drops, it intervenes by
dynamically annealing the resampling distribution to preserve exploration. The
second, an enhancement called Look-ahead Modulation (LaM), adds a predictive
guide to evaluate a state’s potential based on its successors. On several
challenging math benchmarks, ePF significantly outperforms strong baselines and
achieves up to a 50 % relative improvement in task reward. Together, these
methods improve PF’s resilience by balancing the exploration of diverse
solution spaces with the exploitation of high-reward regions, ultimately
leading to higher-quality solutions.
[LINK]
http://arxiv.org/abs/2510.05825v1
[DATE]
2025-10-07 19:48:32+08:00
[CATEGORIES]
cs.LG
cs.CL
WebWeaver: Structuring Web-Scale Evidence with Dynamic Outlines for Open-Ended Deep Research
[AUTHORS]
Zijian Li, Xin Guan, Bo Zhang, Shen Huang, Houquan Zhou, Shaopeng Lai, Ming Yan, Yong Jiang, Pengjun Xie, Fei Huang, Jun Zhang, Jingren Zhou
[ABSTRACT]
This paper tackles \textbf{open-ended deep research (OEDR)}, a complex
challenge where AI agents must synthesize vast web-scale information into
insightful reports. Current approaches are plagued by dual-fold limitations:
static research pipelines that decouple planning from evidence acquisition and
monolithic generation paradigms that include redundant, irrelevant evidence,
suffering from hallucination issues and low citation accuracy. To address these
challenges, we introduce \textbf{WebWeaver}, a novel dual-agent framework that
emulates the human research process. The planner operates in a dynamic cycle,
iteratively interleaving evidence acquisition with outline optimization to
produce a comprehensive, citation-grounded outline linking to a memory bank of
evidence. The writer then executes a hierarchical retrieval and writing
process, composing the report section by section. By performing targeted
retrieval of only the necessary evidence from the memory bank via citations for
each part, it effectively mitigates long-context issues and citation
hallucinations. Our framework establishes a new state-of-the-art across major
OEDR benchmarks, including DeepResearch Bench, DeepConsult, and
DeepResearchGym. These results validate our human-centric, iterative
methodology, demonstrating that adaptive planning and focused synthesis are
crucial for producing comprehensive, trusted, and well-structured reports.
[COMMENTS]
An agent system for open-ended deep research
[LINK]
http://arxiv.org/abs/2509.13312v3
[DATE]
2025-10-07 19:47:57+08:00
[CATEGORIES]
cs.CL
An Embarrassingly Simple Defense Against LLM Abliteration Attacks
[AUTHORS]
Harethah Abu Shairah, Hasan Abed Al Kader Hammoud, Bernard Ghanem, George Turkiyyah
[ABSTRACT]
Large language models (LLMs) are typically aligned to refuse harmful
instructions through safety fine-tuning. A recent attack, termed abliteration,
identifies and suppresses the single latent direction most responsible for
refusal behavior, thereby enabling models to generate harmful content. We
propose a defense that fundamentally alters how models express refusal. We
construct an extended-refusal dataset in which responses to harmful prompts
provide detailed justifications before refusing, distributing the refusal
signal across multiple token positions. Fine-tuning Llama-2-7B-Chat and
Qwen2.5-Instruct (1.5B and 3B parameters) on this dataset yields models that
maintain high refusal rates under abliteration: refusal rates drop by at most
10%, compared to 70-80% drops in baseline models. Comprehensive evaluations of
safety and utility demonstrate that extended-refusal fine-tuning effectively
neutralizes abliteration attacks while preserving general model performance and
enhancing robustness across multiple alignment scenarios.
[COMMENTS]
preprint - under review
[LINK]
http://arxiv.org/abs/2505.19056v2
[DATE]
2025-10-07 19:31:29+08:00
[CATEGORIES]
cs.CL
cs.LG
SAFER: Advancing Safety Alignment via Efficient Ex-Ante Reasoning
[AUTHORS]
Kehua Feng, Keyan Ding, Yuhao Wang, Menghan Li, Fanjunduo Wei, Xinda Wang, Qiang Zhang, Huajun Chen
[ABSTRACT]
Recent advancements in large language models (LLMs) have accelerated progress
toward artificial general intelligence, yet their potential to generate harmful
content poses critical safety challenges. Existing alignment methods often
struggle to cover diverse safety scenarios and remain vulnerable to adversarial
attacks. In this work, we propose SAFER, a framework for Safety Alignment via
eFficient Ex-Ante Reasoning. Our approach instantiates structured Ex-Ante
reasoning through initial assessment, rule verification, and path calibration,
and embeds predefined safety rules to provide transparent and verifiable safety
judgments. Specifically, our approach consists of two training stages: (1)
supervised fine-tuning with synthetic traces to teach the multi-stage Ex-Ante
reasoning, and (2) step-level reasoning preference optimization to jointly
enhance safety, utility, and efficiency. Experiments on multiple open-source
LLMs demonstrate that SAFER significantly enhances safety performance while
maintaining helpfulness and response efficiency.
[COMMENTS]
22 pages, 5 figures
[LINK]
http://arxiv.org/abs/2504.02725v2
[DATE]
2025-10-07 19:07:54+08:00
[CATEGORIES]
cs.CL
MMReview: A Multidisciplinary and Multimodal Benchmark for LLM-Based Peer Review Automation
[AUTHORS]
Xian Gao, Jiacheng Ruan, Zongyun Zhang, Jingsheng Gao, Ting Liu, Yuzhuo Fu
[ABSTRACT]
With the rapid growth of academic publications, peer review has become an
essential yet time-consuming responsibility within the research community.
Large Language Models (LLMs) have increasingly been adopted to assist in the
generation of review comments; however, current LLM-based review tasks lack a
unified evaluation benchmark to rigorously assess the models’ ability to
produce comprehensive, accurate, and human-aligned assessments, particularly in
scenarios involving multimodal content such as figures and tables. To address
this gap, we propose \textbf{MMReview}, a comprehensive benchmark that spans
multiple disciplines and modalities. MMReview includes multimodal content and
expert-written review comments for 240 papers across 17 research domains within
four major academic disciplines: Artificial Intelligence, Natural Sciences,
Engineering Sciences, and Social Sciences. We design a total of 13 tasks
grouped into four core categories, aimed at evaluating the performance of LLMs
and Multimodal LLMs (MLLMs) in step-wise review generation, outcome
formulation, alignment with human preferences, and robustness to adversarial
input manipulation. Extensive experiments conducted on 16 open-source models
and 5 advanced closed-source models demonstrate the thoroughness of the
benchmark. We envision MMReview as a critical step toward establishing a
standardized foundation for the development of automated peer review systems.
[COMMENTS]
Work in progress
[LINK]
http://arxiv.org/abs/2508.14146v3
[DATE]
2025-10-07 18:58:44+08:00
[CATEGORIES]
cs.CL
Mixture of Neuron Experts
[AUTHORS]
Runxi Cheng, Yuchen Guan, Yucheng Ding, Qingguo Hu, Yongxian Wei, Chun Yuan, Yelong Shen, Weizhu Chen, Yeyun Gong
[ABSTRACT]
In this work, we first explore whether the parameters activated by the MoE
layer remain highly sparse at inference. We perform a sparsification study on
several representative MoE models. For each expert, we rank parameters by the
magnitude of their activations from the gate projection and progressively prune
the activated subset. Pruning up to 60% of parameters within that subset causes
only negligible task-performance degradation; substantial drops occur only
after more than 90% are removed. We further decompose experts into
neuron-granular MoE and visualize their activation values, finding that most
neuron activations are near zero. This observation motivates us to select only
high-activation neuron experts during pretraining. Based on this insight, we
propose Mixture of Neuron Experts (MoNE). MoNE achieves neuron-granular expert
selection by only applying a simple top-k selection within each expert, incurs
negligible latency, and requires no additional routing parameters or
inter-expert communication. Extensive experiments demonstrate that MoNE matches
traditional MoE performance while activating only 50% of the MoE-layer
parameters, and it consistently outperforms traditional MoE when compared at
equal numbers of activated parameters. These results suggest that MoNE is a
practical approach to improving parameter utilization and inference efficiency
in MoE-like models.
[COMMENTS]
18 page, 11 figures, 7 tables
[LINK]
http://arxiv.org/abs/2510.05781v1
[DATE]
2025-10-07 18:51:58+08:00
[CATEGORIES]
cs.CL
When to use Graphs in RAG: A Comprehensive Analysis for Graph Retrieval-Augmented Generation
[AUTHORS]
Zhishang Xiang, Chuanjie Wu, Qinggang Zhang, Shengyuan Chen, Zijin Hong, Xiao Huang, Jinsong Su
[ABSTRACT]
Graph retrieval-augmented generation (GraphRAG) has emerged as a powerful
paradigm for enhancing large language models (LLMs) with external knowledge. It
leverages graphs to model the hierarchical structure between specific concepts,
enabling more coherent and effective knowledge retrieval for accurate
reasoning.Despite its conceptual promise, recent studies report that GraphRAG
frequently underperforms vanilla RAG on many real-world tasks. This raises a
critical question: Is GraphRAG really effective, and in which scenarios do
graph structures provide measurable benefits for RAG systems? To address this,
we propose GraphRAG-Bench, a comprehensive benchmark designed to evaluate
GraphRAG models onboth hierarchical knowledge retrieval and deep contextual
reasoning. GraphRAG-Bench features a comprehensive dataset with tasks of
increasing difficulty, coveringfact retrieval, complex reasoning, contextual
summarization, and creative generation, and a systematic evaluation across the
entire pipeline, from graph constructionand knowledge retrieval to final
generation. Leveraging this novel benchmark, we systematically investigate the
conditions when GraphRAG surpasses traditional RAG and the underlying reasons
for its success, offering guidelines for its practical application. All related
resources and analyses are collected for the community at
https://github.com/GraphRAG-Bench/GraphRAG-Benchmark.
[COMMENTS]
All resources and analyses are collected at
https://github.com/GraphRAG-Bench/GraphRAG-Benchmark
[LINK]
http://arxiv.org/abs/2506.05690v2
[DATE]
2025-10-07 18:50:33+08:00
[CATEGORIES]
cs.CL
InforME: Improving Informativeness of Abstractive Text Summarization With Informative Attention Guided by Named Entity Salience
[AUTHORS]
Jianbin Shen, Christy Jie Liang, Junyu Xuan
[ABSTRACT]
Abstractive text summarization is integral to the Big Data era, which demands
advanced methods to turn voluminous and often long text data into concise but
coherent and informative summaries for efficient human consumption. Despite
significant progress, there is still room for improvement in various aspects.
One such aspect is to improve informativeness. Hence, this paper proposes a
novel learning approach consisting of two methods: an optimal transport-based
informative attention method to improve learning focal information in reference
summaries and an accumulative joint entropy reduction method on named entities
to enhance informative salience. Experiment results show that our approach
achieves better ROUGE scores compared to prior work on CNN/Daily Mail while
having competitive results on XSum. Human evaluation of informativeness also
demonstrates the better performance of our approach over a strong baseline.
Further analysis gives insight into the plausible reasons underlying the
evaluation results.
[LINK]
http://arxiv.org/abs/2510.05769v1
[DATE]
2025-10-07 18:40:09+08:00
[CATEGORIES]
cs.CL
AVerImaTeC: A Dataset for Automatic Verification of Image-Text Claims with Evidence from the Web
[AUTHORS]
Rui Cao, Zifeng Ding, Zhijiang Guo, Michael Schlichtkrull, Andreas Vlachos
[COMMENTS]
accepted at NeurIPS 2025 Datasets and Benchmarks Track
[LINK]
http://arxiv.org/abs/2505.17978v2
[DATE]
2025-10-07 18:35:02+08:00
[CATEGORIES]
cs.CL
Early Multimodal Prediction of Cross-Lingual Meme Virality on Reddit: A Time-Window Analysis
[AUTHORS]
Sedat Dogan, Nina Dethlefs, Debarati Chakraborty
[ABSTRACT]
Predicting the virality of online content remains challenging, especially for
culturally complex, fast-evolving memes. This study investigates the
feasibility of early prediction of meme virality using a large-scale,
cross-lingual dataset from 25 diverse Reddit communities. We propose a robust,
data-driven method to define virality based on a hybrid engagement score,
learning a percentile-based threshold from a chronologically held-out training
set to prevent data leakage. We evaluated a suite of models, including Logistic
Regression, XGBoost, and a Multi-layer Perceptron (MLP), with a comprehensive,
multimodal feature set across increasing time windows (30-420 min). Crucially,
useful signals emerge quickly: our best-performing model, XGBoost, achieves a
PR-AUC $>$ 0.52 in just 30 minutes. Our analysis reveals a clear “evidentiary
transition,” in which the importance of the feature dynamically shifts from the
static context to the temporal dynamics as a meme gains traction. This work
establishes a robust, interpretable, and practical benchmark for early virality
prediction in scenarios where full diffusion cascade data is unavailable,
contributing a novel cross-lingual dataset and a methodologically sound
definition of virality. To our knowledge, this study is the first to combine
time series data with static content and network features to predict early meme
virality.
[COMMENTS]
Preprint work in progress. Main body: 9 pages. Total: 15 pages
including references and appendix. 16 figures and 12 tables
[LINK]
http://arxiv.org/abs/2510.05761v1
[DATE]
2025-10-07 18:27:36+08:00
[CATEGORIES]
cs.CL
Text Clustering as Classification with LLMs
[AUTHORS]
Chen Huang, Guoxiu He
[ABSTRACT]
Text clustering serves as a fundamental technique for organizing and
interpreting unstructured textual data, particularly in contexts where manual
annotation is prohibitively costly. With the rapid advancement of Large
Language Models (LLMs) and their demonstrated effectiveness across a broad
spectrum of NLP tasks, an emerging body of research has begun to explore their
potential in the domain of text clustering. However, existing LLM-based
approaches still rely on fine-tuned embedding models and sophisticated
similarity metrics, rendering them computationally intensive and necessitating
domain-specific adaptation. To address these limitations, we propose a novel
framework that reframes text clustering as a classification task by harnessing
the in-context learning capabilities of LLMs. Our framework eliminates the need
for fine-tuning embedding models or intricate clustering algorithms. It
comprises two key steps: first, the LLM is prompted to generate a set of
candidate labels based on the dataset and then merges semantically similar
labels; second, it assigns the most appropriate label to each text sample. By
leveraging the advanced natural language understanding and generalization
capabilities of LLMs, the proposed approach enables effective clustering with
minimal human intervention. Experimental results on diverse datasets
demonstrate that our framework achieves comparable or superior performance to
state-of-the-art embedding-based clustering techniques, while significantly
reducing computational complexity and resource requirements. These findings
underscore the transformative potential of LLMs in simplifying and enhancing
text clustering tasks. We make our code available to the public for utilization
at https://github.com/ECNU-Text-Computing/Text-Clustering-via-LLM. We also
provide the supplementary Appendix within the repository.
[COMMENTS]
11 pages, 3 figures
[LINK]
http://arxiv.org/abs/2410.00927v3
[DATE]
2025-10-07 18:17:31+08:00
[CATEGORIES]
cs.CL
ARM: Discovering Agentic Reasoning Modules for Generalizable Multi-Agent Systems
[AUTHORS]
Bohan Yao, Shiva Krishna Reddy Malay, Vikas Yadav
[ABSTRACT]
Large Language Model (LLM)-powered Multi-agent systems (MAS) have achieved
state-of-the-art results on various complex reasoning tasks. Recent works have
proposed techniques to automate the design of MASes, eliminating the need for
manual engineering. However, these techniques perform poorly, often achieving
similar or inferior performance to simple baselines. Furthermore, they require
computationally expensive re-discovery of architectures for each new task
domain and expensive data annotation on domains without existing labeled
validation sets. A critical insight is that simple Chain of Thought (CoT)
reasoning often performs competitively with these complex systems, suggesting
that the fundamental reasoning unit of MASes, CoT, warrants further
investigation. To this end, we present a new paradigm for automatic MAS design
that pivots the focus to optimizing CoT reasoning. We introduce the Agentic
Reasoning Module (ARM), an agentic generalization of CoT where each granular
reasoning step is executed by a specialized reasoning module. This module is
discovered through a tree search over the code space, starting from a simple
CoT module and evolved using mutations informed by reflection on execution
traces. The resulting ARM acts as a versatile reasoning building block which
can be utilized as a direct recursive loop or as a subroutine in a learned
meta-orchestrator. Our approach significantly outperforms both manually
designed MASes and state-of-the-art automatic MAS design methods. Crucially,
MASes built with ARM exhibit superb generalization, maintaining high
performance across different foundation models and task domains without further
optimization.
[COMMENTS]
29 pages, 2 figures
[LINK]
http://arxiv.org/abs/2510.05746v1
[DATE]
2025-10-07 18:04:48+08:00
[CATEGORIES]
cs.CL
cs.LG
Geometry-Guided Adversarial Prompt Detection via Curvature and Local Intrinsic Dimension
[AUTHORS]
Canaan Yung, Hanxun Huang, Christopher Leckie, Sarah Erfani
[ABSTRACT]
Adversarial prompts are capable of jailbreaking frontier large language
models (LLMs) and inducing undesirable behaviours, posing a significant
obstacle to their safe deployment. Current mitigation strategies primarily rely
on activating built-in defence mechanisms or fine-tuning LLMs, both of which
are computationally expensive and can sacrifice model utility. In contrast,
detection-based approaches are more efficient and practical for deployment in
real-world applications. However, the fundamental distinctions between
adversarial and benign prompts remain poorly understood. In this work, we
introduce CurvaLID, a novel defence framework that efficiently detects
adversarial prompts by leveraging their geometric properties. It is agnostic to
the type of LLM, offering a unified detection framework across diverse
adversarial prompts and LLM architectures. CurvaLID builds on the geometric
analysis of text prompts to uncover their underlying differences. We
theoretically extend the concept of curvature via the Whewell equation into an
$n$-dimensional word embedding space, enabling us to quantify local geometric
properties, including semantic shifts and curvature in the underlying
manifolds. To further enhance our solution, we leverage Local Intrinsic
Dimensionality (LID) to capture complementary geometric features of text
prompts within adversarial subspaces. Our findings show that adversarial
prompts exhibit distinct geometric signatures from benign prompts, enabling
CurvaLID to achieve near-perfect classification and outperform state-of-the-art
detectors in adversarial prompt detection. CurvaLID provides a reliable and
efficient safeguard against malicious queries as a model-agnostic method that
generalises across multiple LLMs and attack families.
[COMMENTS]
40 Pages, 6 figues
[LINK]
http://arxiv.org/abs/2503.03502v2
[DATE]
2025-10-07 18:03:12+08:00
[CATEGORIES]
cs.CL
The Mirage of Performance Gains: Why Contrastive Decoding Fails to Mitigate Object Hallucinations in MLLMs?
[AUTHORS]
Hao Yin, Guangzong Si, Zilei Wang
[ABSTRACT]
Contrastive decoding strategies are widely used to reduce object
hallucinations in multimodal large language models (MLLMs). These methods work
by constructing contrastive samples to induce hallucinations and then
suppressing them in the output distribution. However, this paper demonstrates
that such approaches fail to effectively mitigate the hallucination problem.
The performance improvements observed on POPE Benchmark are largely driven by
two misleading factors: (1) crude, unidirectional adjustments to the model’s
output distribution and (2) the adaptive plausibility constraint, which reduces
the sampling strategy to greedy search. To further illustrate these issues, we
introduce a series of spurious improvement methods and evaluate their
performance against contrastive decoding techniques. Experimental results
reveal that the observed performance gains in contrastive decoding are entirely
unrelated to its intended goal of mitigating hallucinations. Our findings
challenge common assumptions about the effectiveness of contrastive decoding
strategies and pave the way for developing genuinely effective solutions to
hallucinations in MLLMs.
[LINK]
http://arxiv.org/abs/2504.10020v3
[DATE]
2025-10-07 17:52:04+08:00
[CATEGORIES]
cs.CL
Improving Discrete Diffusion Unmasking Policies Beyond Explicit Reference Policies
[AUTHORS]
Chunsan Hong, Seonho An, Min-Soo Kim, Jong Chul Ye
[ABSTRACT]
Masked diffusion models (MDMs) have recently emerged as a novel framework for
language modeling. MDMs generate sentences by iteratively denoising masked
sequences, filling in [MASK] tokens step by step. Although MDMs support
any-order sampling, performance is highly sensitive to the choice of which
position to unmask next. Prior work typically relies on rule-based schedules
(e.g., max-confidence, max-margin), which provide ad hoc improvements. In
contrast, we replace these heuristics with a learned scheduler. Specifically,
we cast denoising as a KL-regularized Markov decision process (MDP) with an
explicit reference policy and optimize a regularized objective that admits
policy improvement and convergence guarantees under standard assumptions. We
prove that the optimized policy under this framework generates samples that
more closely match the data distribution than heuristic schedules. Empirically,
across four benchmarks, our learned policy consistently outperforms
max-confidence: for example, on SUDOKU, where unmasking order is critical, it
yields a 20.1% gain over random and a 11.2% gain over max-confidence.
[COMMENTS]
Preprint
[LINK]
http://arxiv.org/abs/2510.05725v1
[DATE]
2025-10-07 17:44:24+08:00
[CATEGORIES]
cs.LG
cs.CL
Bridging the Culture Gap: A Framework for LLM-Driven Socio-Cultural Localization of Math Word Problems in Low-Resource Languages
[AUTHORS]
Israel Abebe Azime, Tadesse Destaw Belay, Dietrich Klakow, Philipp Slusallek, Anshuman Chhabra
[ABSTRACT]
Large language models (LLMs) have demonstrated significant capabilities in
solving mathematical problems expressed in natural language. However,
multilingual and culturally-grounded mathematical reasoning in low-resource
languages lags behind English due to the scarcity of socio-cultural task
datasets that reflect accurate native entities such as person names,
organization names, and currencies. Existing multilingual benchmarks are
predominantly produced via translation and typically retain English-centric
entities, owing to the high cost associated with human annotater-based
localization. Moreover, automated localization tools are limited, and hence,
truly localized datasets remain scarce. To bridge this gap, we introduce a
framework for LLM-driven cultural localization of math word problems that
automatically constructs datasets with native names, organizations, and
currencies from existing sources. We find that translated benchmarks can
obscure true multilingual math ability under appropriate socio-cultural
contexts. Through extensive experiments, we also show that our framework can
help mitigate English-centric entity bias and improves robustness when native
entities are introduced across various languages.
[LINK]
http://arxiv.org/abs/2508.14913v3
[DATE]
2025-10-07 17:29:49+08:00
[CATEGORIES]
cs.CL
Towards Reliable and Practical LLM Security Evaluations via Bayesian Modelling
[AUTHORS]
Mary Llewellyn, Annie Gray, Josh Collyer, Michael Harries
[ABSTRACT]
Before adopting a new large language model (LLM) architecture, it is critical
to understand vulnerabilities accurately. Existing evaluations can be difficult
to trust, often drawing conclusions from LLMs that are not meaningfully
comparable, relying on heuristic inputs or employing metrics that fail to
capture the inherent uncertainty. In this paper, we propose a principled and
practical end-to-end framework for evaluating LLM vulnerabilities to prompt
injection attacks. First, we propose practical approaches to experimental
design, tackling unfair LLM comparisons by considering two practitioner
scenarios: when training an LLM and when deploying a pre-trained LLM. Second,
we address the analysis of experiments and propose a Bayesian hierarchical
model with embedding-space clustering. This model is designed to improve
uncertainty quantification in the common scenario that LLM outputs are not
deterministic, test prompts are designed imperfectly, and practitioners only
have a limited amount of compute to evaluate vulnerabilities. We show the
improved inferential capabilities of the model in several prompt injection
attack settings. Finally, we demonstrate the pipeline to evaluate the security
of Transformer versus Mamba architectures. Our findings show that consideration
of output variability can suggest less definitive findings. However, for some
attacks, we find notably increased Transformer and Mamba-variant
vulnerabilities across LLMs with the same training data or mathematical
ability.
[LINK]
http://arxiv.org/abs/2510.05709v1
[DATE]
2025-10-07 17:22:22+08:00
[CATEGORIES]
cs.CL
Context Biasing for Pronunciations-Orthography Mismatch in Automatic Speech Recognition
[AUTHORS]
Christian Huber, Alexander Waibel
[ABSTRACT]
Neural sequence-to-sequence systems deliver state-of-the-art performance for
automatic speech recognition. When using appropriate modeling units, e.g.,
byte-pair encoded characters, these systems are in principal open vocabulary
systems. In practice, however, they often fail to recognize words not seen
during training, e.g., named entities, acronyms, or domain-specific special
words. To address this problem, many context biasing methods have been
proposed; however, for words with a pronunciation-orthography mismatch, these
methods may still struggle. We propose a method which allows corrections of
substitution errors to improve the recognition accuracy of such challenging
words. Users can add corrections on the fly during inference. We show that with
this method we get a relative improvement in biased word error rate of up to
8%, while maintaining a competitive overall word error rate.
[LINK]
http://arxiv.org/abs/2506.18703v2
[DATE]
2025-10-07 17:14:10+08:00
[CATEGORIES]
cs.CL
cs.LG
DecEx-RAG: Boosting Agentic Retrieval-Augmented Generation with Decision and Execution Optimization via Process Supervision
[AUTHORS]
Yongqi Leng, Yikun Lei, Xikai Liu, Meizhi Zhong, Bojian Xiong, Yurong Zhang, Yan Gao, Yi Wu, Yao Hu, Deyi Xiong
[ABSTRACT]
Agentic Retrieval-Augmented Generation (Agentic RAG) enhances the processing
capability for complex tasks through dynamic retrieval and adaptive workflows.
Recent advances (e.g., Search-R1) have shown that outcome-supervised
reinforcement learning demonstrate strong performance. However, this approach
still suffers from inefficient exploration, sparse reward signals, and
ambiguous global reward feedback. To address these challenges, we propose
DecEx-RAG, which models RAG as a Markov Decision Process (MDP) incorporating
decision-making and execution, while introducing an efficient pruning strategy
to optimize data expansion. Through comprehensive process-level policy
optimization, DecEx-RAG significantly enhances the autonomous task
decomposition, dynamic retrieval, and high-quality answer generation
capabilities of large language models (LLMs). Experiments show that DecEx-RAG
achieves an average absolute performance improvement of $6.2\%$ across six
datasets, significantly outperforming existing baselines. Moreover, the pruning
strategy improves data construction efficiency by nearly $6 \times$, providing
an efficient solution for process-supervised RAG training. The code is
available at https://github.com/sdsxdxl/DecEx-RAG.
[LINK]
http://arxiv.org/abs/2510.05691v1
[DATE]
2025-10-07 16:49:22+08:00
[CATEGORIES]
cs.CL
Code-Switching In-Context Learning for Cross-Lingual Transfer of Large Language Models
[AUTHORS]
Haneul Yoo, Jiho Jin, Kyunghyun Cho, Alice Oh
[ABSTRACT]
While large language models (LLMs) exhibit strong multilingual abilities,
their reliance on English as latent representations creates a translation
barrier, where reasoning implicitly depends on internal translation into
English. When this process fails, performance in non-English languages
deteriorates sharply, limiting the inclusiveness of LLM-based applications.
Existing cross-lingual in-context learning (X-ICL) methods primarily leverage
monolingual demonstrations, often failing to mitigate this barrier and instead
reinforcing it. In this work, we introduce code-switching in-context learning
(CSICL), a simple yet effective prompting strategy that progressively
transitions from a target language to English within demonstrations and
instruction to facilitate their latent reasoning in English. By explicitly
scaffolding the reasoning process through controlled code-switching, CSICL acts
as an implicit linguistic bridge that enhances cross-lingual alignment and
reduces reliance on the translation barrier. We conduct extensive experiments
across 4 LLMs, 6 datasets, and 10 languages, spanning both knowledge-intensive
and reasoning-oriented domains. Our results demonstrate that CSICL consistently
outperforms X-ICL baselines, achieving gains of 3.1%p and 1.9%p in both target
and unseen languages, respectively. The improvement is even more pronounced in
low-resource settings, with gains of 14.7% in target and 5.3% in unseen
languages. These findings establish code-switching as a principled and robust
approach for overcoming the translation barrier during inference, moving LLMs
toward more equitable and effective multilingual systems.
[LINK]
http://arxiv.org/abs/2510.05678v1
[DATE]
2025-10-07 16:35:42+08:00
[CATEGORIES]
cs.CL
VisRet: Visualization Improves Knowledge-Intensive Text-to-Image Retrieval
[AUTHORS]
Di Wu, Yixin Wan, Kai-Wei Chang
[ABSTRACT]
Text-to-image retrieval (T2I retrieval) remains challenging because
cross-modal embeddings often behave as bags of concepts and underrepresent
structured visual relationships such as pose and viewpoint. We propose
Visualize-then-Retrieve (VisRet), a new paradigm for T2I retrieval that
mitigates this limitation of cross-modal similarity alignment. VisRet first
projects textual queries into the image modality via T2I generation. Then, it
performs retrieval within the image modality to bypass the weaknesses of
cross-modal retrievers in recognizing subtle visual-spatial features. Across
four benchmarks (Visual-RAG, INQUIRE-Rerank, Microsoft COCO, and our new
Visual-RAG-ME featuring multi-entity comparisons), VisRet substantially
outperforms cross-modal similarity matching and baselines that recast T2I
retrieval as text-to-text similarity matching, improving nDCG@30 by 0.125 on
average with CLIP as the retriever and by 0.121 with E5-V. For downstream
question answering, VisRet increases accuracy on Visual-RAG and Visual-RAG-ME
by 3.8% and 15.7% in top-1 retrieval, and by 3.9% and 11.1% in top-10
retrieval. Ablation studies show compatibility with different T2I instruction
LLMs, T2I generation models, and downstream LLMs. VisRet provides a practical
and principled path that energizes further advances in vision-language
retrieval. Our code and the Visual-RAG-ME benchmark will be publicly released.
[LINK]
http://arxiv.org/abs/2505.20291v2
[DATE]
2025-10-07 15:50:24+08:00
[CATEGORIES]
cs.CL
Self-Routing RAG: Binding Selective Retrieval with Knowledge Verbalization
[AUTHORS]
Di Wu, Jia-Chen Gu, Kai-Wei Chang, Nanyun Peng
[ABSTRACT]
Selective retrieval improves the accuracy and efficiency of
retrieval-augmented generation (RAG) by reducing distractions from low-quality
retrievals. However, existing approaches underutilize the inherent knowledge of
large language models (LLMs), leading to suboptimal retrieval decisions and
degraded generation performance. To bridge this gap, we propose Self-Routing
RAG (SR-RAG), a novel framework that binds selective retrieval with knowledge
verbalization. SR-RAG enables an LLM to dynamically decide whether to retrieve
external knowledge or verbalize its own parametric knowledge. To this end, we
design a multi-task objective that jointly optimizes an LLM for knowledge
source selection, knowledge verbalization, and response generation. SR-RAG
further incorporates a nearest neighbor search mechanism at inference time to
improve the accuracy of knowledge source decisions under domain shifts.
Fine-tuning three LLMs with SR-RAG significantly improves both their response
accuracy and reduces the inference latency. Compared to the strongest selective
retrieval baseline, SR-RAG reduces the number of retrievals by 29% while
improving performance by 5.1%.
[LINK]
http://arxiv.org/abs/2504.01018v2
[DATE]
2025-10-07 15:44:04+08:00
[CATEGORIES]
cs.CL
The African Languages Lab: A Collaborative Approach to Advancing Low-Resource African NLP
[AUTHORS]
Sheriff Issaka, Keyi Wang, Yinka Ajibola, Oluwatumininu Samuel-Ipaye, Zhaoyi Zhang, Nicte Aguillon Jimenez, Evans Kofi Agyei, Abraham Lin, Rohan Ramachandran, Sadick Abdul Mumin, Faith Nchifor, Mohammed Shuraim, Lieqi Liu, Erick Rosas Gonzalez, Sylvester Kpei, Jemimah Osei, Carlene Ajeneza, Persis Boateng, Prisca Adwoa Dufie Yeboah, Saadia Gabriel
[ABSTRACT]
Despite representing nearly one-third of the world’s languages, African
languages remain critically underserved by modern NLP technologies, with 88\%
classified as severely underrepresented or completely ignored in computational
linguistics. We present the African Languages Lab (All Lab), a comprehensive
research initiative that addresses this technological gap through systematic
data collection, model development, and capacity building. Our contributions
include: (1) a quality-controlled data collection pipeline, yielding the
largest validated African multi-modal speech and text dataset spanning 40
languages with 19 billion tokens of monolingual text and 12,628 hours of
aligned speech data; (2) extensive experimental validation demonstrating that
our dataset, combined with fine-tuning, achieves substantial improvements over
baseline models, averaging +23.69 ChrF++, +0.33 COMET, and +15.34 BLEU points
across 31 evaluated languages; and (3) a structured research program that has
successfully mentored fifteen early-career researchers, establishing
sustainable local capacity. Our comparative evaluation against Google Translate
reveals competitive performance in several languages while identifying areas
that require continued development.
[LINK]
http://arxiv.org/abs/2510.05644v1
[DATE]
2025-10-07 15:42:52+08:00
[CATEGORIES]
cs.CL
BenchAgents: Multi-Agent Systems for Structured Benchmark Creation
[AUTHORS]
Natasha Butt, Varun Chandrasekaran, Neel Joshi, Besmira Nushi, Vidhisha Balachandran
[ABSTRACT]
Evaluation insights are limited by the availability of high-quality
benchmarks. As models evolve, there is a need to create benchmarks that can
measure progress on new and complex generative capabilities. However, manually
creating new benchmarks is slow and expensive, restricting comprehensive
evaluations for any capability. We introduce BenchAgents, a multi-agent
framework that methodically leverages large language models (LLMs) to automate
evaluation benchmark creation while inherently ensuring data and (evaluation)
metric quality. BenchAgents decomposes the benchmark creation process into
planning, generation, verification, and evaluation, each of which is ]
orchestrated via LLM agents. These agents interact with each other and utilize
feedback from benchmark developers to improve and flexibly control data
diversity and quality. We use BenchAgents to create benchmarks to evaluate
capabilities related to planning, constraint satisfaction, and causal reasoning
spanning both language and vision modalities. We then use these benchmarks to
study state-of-the-art models and extract new insights into common failure
modes and model differences.
[LINK]
http://arxiv.org/abs/2410.22584v2
[DATE]
2025-10-07 15:17:31+08:00
[CATEGORIES]
cs.LG
cs.CL
Generative AI-Driven Hierarchical Multi-Agent Framework for Zero-Touch Optical Networks
[AUTHORS]
Yao Zhang, Yuchen Song, Shengnan Li, Yan Shi, Shikui Shen, Xiongyan Tang, Min Zhang, Danshi Wang
[ABSTRACT]
The rapid development of Generative Artificial Intelligence (GenAI) has
catalyzed a transformative technological revolution across all walks of life.
As the backbone of wideband communication, optical networks are expecting
high-level autonomous operation and zero-touch management to accommodate their
expanding network scales and escalating transmission bandwidth. The integration
of GenAI is deemed as the pivotal solution for realizing zero-touch optical
networks. However, the lifecycle management of optical networks involves a
multitude of tasks and necessitates seamless collaboration across multiple
layers, which poses significant challenges to the existing single-agent GenAI
systems. In this paper, we propose a GenAI-driven hierarchical multi-agent
framework designed to streamline multi-task autonomous execution for zero-touch
optical networks. We present the architecture, implementation, and applications
of this framework. A field-deployed mesh network is utilized to demonstrate
three typical scenarios throughout the lifecycle of optical network: quality of
transmission estimation in the planning stage, dynamic channel adding/dropping
in the operation stage, and system capacity increase in the upgrade stage. The
case studies, illustrate the capabilities of multi-agent framework in
multi-task allocation, coordination, execution, evaluation, and summarization.
This work provides a promising approach for the future development of
intelligent, efficient, and collaborative network management solutions, paving
the way for more specialized and adaptive zero-touch optical networks.
[COMMENTS]
7 pages,6 figures, Accepted by lEEE Communications Magazine, Open
call
[LINK]
http://arxiv.org/abs/2510.05625v1
[DATE]
2025-10-07 15:12:52+08:00
[CATEGORIES]
cs.CL
HEALTH-PARIKSHA: Assessing RAG Models for Health Chatbots in Real-World Multilingual Settings
[AUTHORS]
Varun Gumma, Ananditha Raghunath, Mohit Jain, Sunayana Sitaram
[ABSTRACT]
Assessing the capabilities and limitations of large language models (LLMs)
has garnered significant interest, yet the evaluation of multiple models in
real-world scenarios remains rare. Multilingual evaluation often relies on
translated benchmarks, which typically do not capture linguistic and cultural
nuances present in the source language. This study provides an extensive
assessment of 24 LLMs on real world data collected from Indian patients
interacting with a medical chatbot in Indian English and 4 other Indic
languages. We employ a uniform Retrieval Augmented Generation framework to
generate responses, which are evaluated using both automated techniques and
human evaluators on four specific metrics relevant to our application. We find
that models vary significantly in their performance and that instruction tuned
Indic models do not always perform well on Indic language queries. Further, we
empirically show that factual correctness is generally lower for responses to
Indic queries compared to English queries. Finally, our qualitative work shows
that code-mixed and culturally relevant queries in our dataset pose challenges
to evaluated models.
[LINK]
http://arxiv.org/abs/2410.13671v2
[DATE]
2025-10-07 14:47:50+08:00
[CATEGORIES]
cs.CL
FormulaReasoning: A Dataset for Formula-Based Numerical Reasoning
[AUTHORS]
Xiao Li, Bolin Zhu, Kaiwen Shi, Sichen Liu, Yin Zhu, Yiwei Liu, Gong Cheng
[ABSTRACT]
The application of formulas (e.g., physics formulas) is a fundamental human
ability in solving numerical reasoning problems. Existing numerical reasoning
datasets rarely explicitly state the formulas employed, as their questions
often rely on implicit commonsense mathematical knowledge. To address this gap,
we introduce FormulaReasoning, a new dataset specifically designed for
formula-based numerical reasoning. It consists of 5,324 questions that require
numerical calculations grounded in external physics formulas. We provide
normalized, fine-grained annotations in both English and Chinese, including
formula structures, parameter names, symbols, numerical values, and
units-curated through extensive manual effort with LLM-assisted validation to
ensure high quality. Additionally, we offer a consolidated formula database to
serve as an external knowledge source. We analyze various reasoning approaches
on FormulaReasoning, with emphasis on comparative evaluation of different
architectural and methodological frameworks. Our assessment includes
retrieval-augmented methods, approaches that decompose reasoning into formula
generation, parameter extraction, and numerical calculation, as well as
optimization techniques using preference data. We identify key challenges in
formula-based numerical reasoning that require further investigation across
different reasoning paradigms, highlighting opportunities for methodological
advancement.
[LINK]
http://arxiv.org/abs/2402.12692v6
[DATE]
2025-10-07 14:36:36+08:00
[CATEGORIES]
cs.CL
Improving Chain-of-Thought Efficiency for Autoregressive Image Generation
[AUTHORS]
Zeqi Gu, Markos Georgopoulos, Xiaoliang Dai, Marjan Ghazvininejad, Chu Wang, Felix Juefei-Xu, Kunpeng Li, Yujun Shi, Zecheng He, Zijian He, Jiawei Zhou, Abe Davis, Jialiang Wang
[ABSTRACT]
Autoregressive multimodal large language models have recently gained
popularity for image generation, driven by advances in foundation models. To
enhance alignment and detail, newer approaches employ chain-of-thought (CoT)
reasoning, expanding user inputs into elaborated prompts prior to image
synthesis. However, this strategy can introduce unnecessary redundancy – a
phenomenon we call visual overthinking – which increases computational costs
and can introduce details that contradict the original prompt. In this work, we
explore how to generate more concise CoT sequences for more efficient image
generation. We introduce ShortCoTI, a lightweight optimization framework that
encourages more concise CoT while preserving output image quality. ShortCoTI
rewards more concise prompts with an adaptive function that scales according to
an estimated difficulty for each task. Incorporating this reward into a
reinforcement learning paradigm reduces prompt reasoning length by 54% while
maintaining or slightly improving quality metrics across multiple benchmarks
(T2I-CompBench, GenEval). Qualitative analysis shows that our method eliminates
verbose explanations and repetitive refinements, producing reasoning prompts
that are both concise and semantically rich. As a result, ShortCoTI improves
computational efficiency without compromising the fidelity or visual appeal of
generated images.
[LINK]
http://arxiv.org/abs/2510.05593v1
[DATE]
2025-10-07 13:40:43+08:00
[CATEGORIES]
cs.CL
In-the-Flow Agentic System Optimization for Effective Planning and Tool Use
[AUTHORS]
Zhuofeng Li, Haoxiang Zhang, Seungju Han, Sheng Liu, Jianwen Xie, Yu Zhang, Yejin Choi, James Zou, Pan Lu
[ABSTRACT]
Outcome-driven reinforcement learning has advanced reasoning in large
language models (LLMs), but prevailing tool-augmented approaches train a
single, monolithic policy that interleaves thoughts and tool calls under full
context; this scales poorly with long horizons and diverse tools and
generalizes weakly to new scenarios. Agentic systems offer a promising
alternative by decomposing work across specialized modules, yet most remain
training-free or rely on offline training decoupled from the live dynamics of
multi-turn interaction. We introduce AgentFlow, a trainable, in-the-flow
agentic framework that coordinates four modules (planner, executor, verifier,
generator) through an evolving memory and directly optimizes its planner inside
the multi-turn loop. To train on-policy in live environments, we propose
Flow-based Group Refined Policy Optimization (Flow-GRPO), which tackles
long-horizon, sparse-reward credit assignment by converting multi-turn
optimization into a sequence of tractable single-turn policy updates. It
broadcasts a single, verifiable trajectory-level outcome to every turn to align
local planner decisions with global success and stabilizes learning with
group-normalized advantages. Across ten benchmarks, AgentFlow with a 7B-scale
backbone outperforms top-performing baselines with average accuracy gains of
14.9% on search, 14.0% on agentic, 14.5% on mathematical, and 4.1% on
scientific tasks, even surpassing larger proprietary models like GPT-4o.
Further analyses confirm the benefits of in-the-flow optimization, showing
improved planning, enhanced tool-calling reliability, and positive scaling with
model size and reasoning turns.
[COMMENTS]
45 pages, 12 figures. Project website:
https://agentflow.stanford.edu/
[LINK]
http://arxiv.org/abs/2510.05592v1
[DATE]
2025-10-07 13:32:44+08:00
[CATEGORIES]
cs.CL
cs.LG
AWARE, Beyond Sentence Boundaries: A Contextual Transformer Framework for Identifying Cultural Capital in STEM Narratives
[AUTHORS]
Khalid Mehtab Khan, Anagha Kulkarni
[ABSTRACT]
Identifying cultural capital (CC) themes in student reflections can offer
valuable insights that help foster equitable learning environments in
classrooms. However, themes such as aspirational goals or family support are
often woven into narratives, rather than appearing as direct keywords. This
makes them difficult to detect for standard NLP models that process sentences
in isolation. The core challenge stems from a lack of awareness, as standard
models are pre-trained on general corpora, leaving them blind to the
domain-specific language and narrative context inherent to the data. To address
this, we introduce AWARE, a framework that systematically attempts to improve a
transformer model’s awareness for this nuanced task. AWARE has three core
components: 1) Domain Awareness, adapting the model’s vocabulary to the
linguistic style of student reflections; 2) Context Awareness, generating
sentence embeddings that are aware of the full essay context; and 3) Class
Overlap Awareness, employing a multi-label strategy to recognize the
coexistence of themes in a single sentence. Our results show that by making the
model explicitly aware of the properties of the input, AWARE outperforms a
strong baseline by 2.1 percentage points in Macro-F1 and shows considerable
improvements across all themes. This work provides a robust and generalizable
methodology for any text classification task in which meaning depends on the
context of the narrative.
[LINK]
http://arxiv.org/abs/2510.04983v2
[DATE]
2025-10-07 13:08:19+08:00
[CATEGORIES]
cs.CL
cs.LG
Evaluating and Mitigating Social Bias for Large Language Models in Open-ended Settings
[AUTHORS]
Zhao Liu, Tian Xie, Xueru Zhang
[ABSTRACT]
Current social bias benchmarks for Large Language Models (LLMs) primarily
rely on predefined question formats like multiple-choice, limiting their
ability to reflect the complexity and open-ended nature of real-world
interactions. To close this gap, we extend an existing dataset BBQ (Parrish et
al., 2022) to Open-BBQ, a comprehensive framework to evaluate the social bias
of LLMs in open-ended settings by incorporating two additional question
categories: fill-in-the-blank and short-answer. Since our new Open-BBQ dataset
contains a lot of open-ended responses like sentences and paragraphs, we
developed an evaluation process to detect biases from open-ended content by
labeling sentences and paragraphs. In addition to this, we also found that
existing debiasing methods, such as self-debiasing (Gallegos et al., 2024),
have over-correction issues, which make the original correct answers incorrect.
In order to solve this issue, we propose Composite Prompting, an In-context
Learning (ICL) method combining structured examples with explicit
chain-of-thought reasoning to form a unified instruction template for LLMs to
explicitly identify content that needs debiasing. Experimental results show
that the proposed method significantly reduces the bias for both GPT-3.5 and
GPT-4o while maintaining high accuracy.
[COMMENTS]
15 pages
[LINK]
http://arxiv.org/abs/2412.06134v3
[DATE]
2025-10-07 13:04:25+08:00
[CATEGORIES]
cs.CL
Robustness of Large Language Models to Perturbations in Text
[AUTHORS]
Ayush Singh, Navpreet Singh, Shubham Vatsal
[ABSTRACT]
Having a clean dataset has been the foundational assumption of most natural
language processing (NLP) systems. However, properly written text is rarely
found in real-world scenarios and hence, oftentimes invalidates the
aforementioned foundational assumption. Recently, Large language models (LLMs)
have shown impressive performance, but can they handle the inevitable noise in
real-world data? This work tackles this critical question by investigating
LLMs’ resilience against morphological variations in text. To that end, we
artificially introduce varying levels of noise into a diverse set of datasets
and systematically evaluate LLMs’ robustness against the corrupt variations of
the original text. Our findings show that contrary to popular beliefs,
generative LLMs are quiet robust to noisy perturbations in text. This is a
departure from pre-trained models like BERT or RoBERTa whose performance has
been shown to be sensitive to deteriorating noisy text. Additionally, we test
LLMs’ resilience on multiple real-world benchmarks that closely mimic commonly
found errors in the wild. With minimal prompting, LLMs achieve a new
state-of-the-art on the benchmark tasks of Grammar Error Correction (GEC) and
Lexical Semantic Change (LSC). To empower future research, we also release a
dataset annotated by humans stating their preference for LLM vs.
human-corrected outputs along with the code to reproduce our results.
[COMMENTS]
8 pages, 1 figure, 6 tables, updated with results also from GPT-4,
LLaMa-3
[LINK]
http://arxiv.org/abs/2407.08989v2
[DATE]
2025-10-07 12:50:05+08:00
[CATEGORIES]
cs.CL
Mission Impossible: Feedback-Guided Dynamic Interactive Planning for Improving Reasoning on LLMs
[AUTHORS]
Dong Yan, Gaochen Wu, Bowen Zhou
[ABSTRACT]
Recent advancements in language agents have led to significant improvements
in multi-hop reasoning tasks. However, existing approaches often struggle with
handling open-domain problems, which require massive information retrieval due
to their reliance on a fixed sequence of actions. To address this, we propose
Feedback-Guided Dynamic Interactive Planning (FGDIP), a novel framework
tailored to enhance reasoning in LLMs by utilizing dynamic and adaptive
strategies for information exploration in open-domain multi-hop reasoning
tasks. Our approach begins by identifying key entities relevant to the problem,
which serve as the initial nodes in the reasoning process. From these initial
nodes, we then generate reasoning child nodes with the process being refined
through a combination of historical error analysis and real-time feedback,
which allows the framework to dynamically adjust and optimize its reasoning
strategies. By integrating depth-first search with an innovative node
generation technique, our framework adapts based on both prior error paths and
concurrently generated nodes at the same hierarchical level. This dynamic
strategy effectively expands the search space while ensuring the reasoning
process systematically converges toward accurate solutions. Experimental
results show that FGDIP achieved up to 54.47% F1 score on the HotpotQA dataset
and 70.05% on the StrategyQA dataset, surpassing the best baseline by 5.03% and
7.25% respectively, highlighting its versatility and potential to enhance
language agents in multi-hop reasoning tasks.
[LINK]
http://arxiv.org/abs/2510.05577v1
[DATE]
2025-10-07 12:46:58+08:00
[CATEGORIES]
cs.CL
MediSyn: A Generalist Text-Guided Latent Diffusion Model For Diverse Medical Image Synthesis
[AUTHORS]
Joseph Cho, Mrudang Mathur, Cyril Zakka, Dhamanpreet Kaur, Matthew Leipzig, Alex Dalal, Aravind Krishnan, Eubee Koo, Karen Wai, Cindy S. Zhao, Akshay Chaudhari, Matthew Duda, Ashley Choi, Ehsan Rahimy, Lyna Azzouz, Robyn Fong, Rohan Shad, William Hiesinger
[ABSTRACT]
Deep learning algorithms require extensive data to achieve robust
performance. However, data availability is often restricted in the medical
domain due to patient privacy concerns. Synthetic data presents a possible
solution to these challenges. Recently, image generative models have found
increasing use for medical applications but are often designed for singular
medical specialties and imaging modalities, thus limiting their broader
utility. To address this, we introduce MediSyn: a text-guided, latent diffusion
model capable of generating synthetic images from 6 medical specialties and 10
image types. Through extensive experimentation, we first demonstrate that
MediSyn quantitatively matches or surpasses the performance of specialist
models. Second, we show that our synthetic images are realistic and exhibit
strong alignment with their corresponding text prompts, as validated by a team
of expert physicians. Third, we provide empirical evidence that our synthetic
images are visually distinct from their corresponding real patient images.
Finally, we demonstrate that in data-limited settings, classifiers trained
solely on synthetic data or real data supplemented with synthetic data can
outperform those trained solely on real data. Our findings highlight the
immense potential of generalist image generative models to accelerate
algorithmic research and development in medicine.
[LINK]
http://arxiv.org/abs/2405.09806v6
[DATE]
2025-10-07 12:46:25+08:00
[CATEGORIES]
cs.CL
cs.LG
Improving Retrieval-Augmented Generation through Multi-Agent Reinforcement Learning
[AUTHORS]
Yiqun Chen, Lingyong Yan, Weiwei Sun, Xinyu Ma, Yi Zhang, Shuaiqiang Wang, Dawei Yin, Yiming Yang, Jiaxin Mao
[ABSTRACT]
Retrieval-augmented generation (RAG) is widely utilized to incorporate
external knowledge into large language models, thereby enhancing factuality and
reducing hallucinations in question-answering (QA) tasks. A standard RAG
pipeline consists of several components, such as query rewriting, document
retrieval, document filtering, and answer generation. However, these components
are typically optimized separately through supervised fine-tuning, which can
lead to misalignments between the objectives of individual components and the
overarching aim of generating accurate answers. Although recent efforts have
explored using reinforcement learning (RL) to optimize specific RAG components,
these approaches often focus on simple pipelines with only two components or do
not adequately address the complex interdependencies and collaborative
interactions among the modules. To overcome these limitations, we propose
treating the complex RAG pipeline with multiple components as a multi-agent
cooperative task, in which each component can be regarded as an RL agent.
Specifically, we present MMOA-RAG, Multi-Module joint Optimization Algorithm
for RAG, which employs multi-agent reinforcement learning to harmonize all
agents’ goals toward a unified reward, such as the F1 score of the final
answer. Experiments conducted on various QA benchmarks demonstrate that
MMOA-RAG effectively boost the overall performance of the pipeline and
outperforms existing baselines. Furthermore, comprehensive ablation studies
validate the contributions of individual components and demonstrate MMOA-RAG
can be adapted to different RAG pipelines and benchmarks.
[COMMENTS]
NeurIPS 2025
[LINK]
http://arxiv.org/abs/2501.15228v2
[DATE]
2025-10-07 12:38:36+08:00
[CATEGORIES]
cs.CL
Presenting a Paper is an Art: Self-Improvement Aesthetic Agents for Academic Presentations
[AUTHORS]
Chengzhi Liu, Yuzhe Yang, Kaiwen Zhou, Zhen Zhang, Yue Fan, Yannan Xie, Peng Qi, Xin Eric Wang
[ABSTRACT]
The promotion of academic papers has become an important means of enhancing
research visibility. However, existing automated methods struggle limited
storytelling, insufficient aesthetic quality, and constrained self-adjustment,
making it difficult to achieve efficient and engaging dissemination. At the
heart of those challenges is a simple principle: \emph{there is no way to
improve it when you cannot evaluate it right}. To address this, we introduce
\textbf{EvoPresent}, a self-improvement agent framework that unifies coherent
narratives, aesthetic-aware designs, and realistic presentation delivery via
virtual characters. Central to EvoPresent is \textbf{PresAesth}, a multi-task
reinforcement learning (RL) aesthetic model that provides reliable aesthetic
scoring, defect adjustment, and comparative feedback, enabling iterative
self-improvement even under limited aesthetic training data. To systematically
evaluate the methods, we introduce \textbf{EvoPresent Benchmark}, a
comprehensive benchmark comprising: \textit{Presentation Generation Quality},
built on 650 top-tier AI conference papers with multimodal resources (slides,
videos and scripts) to assess both content and design; and \textit{Aesthetic
Awareness}, consisting of 2,000 slide pairs with varying aesthetic levels,
supporting joint training and evaluation on scoring, defect adjustment, and
comparison. Our findings highlight that (i) High-quality feedback is essential
for agent self-improvement, while initial capability alone does not guarantee
effective self-correction. (ii) Automated generation pipelines exhibit a
trade-off between visual design and content construction. (iii) Multi-task RL
training shows stronger generalization in aesthetic awareness tasks.
[LINK]
http://arxiv.org/abs/2510.05571v1
[DATE]
2025-10-07 12:24:26+08:00
[CATEGORIES]
cs.CL
Domain-Shift-Aware Conformal Prediction for Large Language Models
[AUTHORS]
Zhexiao Lin, Yuanyuan Li, Neeraj Sarna, Yuanyuan Gao, Michael von Gablenz
[ABSTRACT]
Large language models have achieved impressive performance across diverse
tasks. However, their tendency to produce overconfident and factually incorrect
outputs, known as hallucinations, poses risks in real world applications.
Conformal prediction provides finite-sample, distribution-free coverage
guarantees, but standard conformal prediction breaks down under domain shift,
often leading to under-coverage and unreliable prediction sets. We propose a
new framework called Domain-Shift-Aware Conformal Prediction (DS-CP). Our
framework adapts conformal prediction to large language models under domain
shift, by systematically reweighting calibration samples based on their
proximity to the test prompt, thereby preserving validity while enhancing
adaptivity. Our theoretical analysis and experiments on the MMLU benchmark
demonstrate that the proposed method delivers more reliable coverage than
standard conformal prediction, especially under substantial distribution
shifts, while maintaining efficiency. This provides a practical step toward
trustworthy uncertainty quantification for large language models in real-world
deployment.
[COMMENTS]
26 pages
[LINK]
http://arxiv.org/abs/2510.05566v1
[DATE]
2025-10-07 12:22:06+08:00
[CATEGORIES]
cs.CL
cs.LG
ChartCards: A Chart-Metadata Generation Framework for Multi-Task Chart Understanding
[AUTHORS]
Yifan Wu, Lutao Yan, Leixian Shen, Yinan Mei, Jiannan Wang, Yuyu Luo
[ABSTRACT]
The emergence of Multi-modal Large Language Models (MLLMs) presents new
opportunities for chart understanding. However, due to the fine-grained nature
of these tasks, applying MLLMs typically requires large, high-quality datasets
for task-specific fine-tuning, leading to high data collection and training
costs. To address this, we propose ChartCards, a unified chart-metadata
generation framework for multi-task chart understanding. ChartCards
systematically synthesizes various chart information, including data tables,
visualization code, visual elements, and multi-dimensional semantic captions.
By structuring this information into organized metadata, ChartCards enables a
single chart to support multiple downstream tasks, such as text-to-chart
retrieval, chart summarization, chart-to-table conversion, chart description,
and chart question answering. Using ChartCards, we further construct MetaChart,
a large-scale high-quality dataset containing 10,862 data tables, 85K charts,
and 170 K high-quality chart captions. We validate the dataset through
qualitative crowdsourcing evaluations and quantitative fine-tuning experiments
across various chart understanding tasks. Fine-tuning six different models on
MetaChart resulted in an average performance improvement of 5% across all
tasks. The most notable improvements are seen in text-to-chart retrieval and
chart-to-table tasks, with Long-CLIP and Llama 3.2-11B achieving improvements
of 17% and 28%, respectively.
[COMMENTS]
Need to be revised
[LINK]
http://arxiv.org/abs/2505.15046v3
[DATE]
2025-10-07 12:20:59+08:00
[CATEGORIES]
cs.CL
Assessing Algorithmic Bias in Language-Based Depression Detection: A Comparison of DNN and LLM Approaches
[AUTHORS]
Obed Junias, Prajakta Kini, Theodora Chaspari
[ABSTRACT]
This paper investigates algorithmic bias in language-based models for
automated depression detection, focusing on socio-demographic disparities
related to gender and race/ethnicity. Models trained using deep neural networks
(DNN) based embeddings are compared to few-shot learning approaches with large
language models (LLMs), evaluating both performance and fairness on clinical
interview transcripts from the Distress Analysis Interview Corpus/Wizard-of-Oz
(DAIC-WOZ). To mitigate bias, fairness-aware loss functions are applied to
DNN-based models, while in-context learning with varied prompt framing and shot
counts is explored for LLMs. Results indicate that LLMs outperform DNN-based
models in depression classification, particularly for underrepresented groups
such as Hispanic participants. LLMs also exhibit reduced gender bias compared
to DNN-based embeddings, though racial disparities persist. Among
fairness-aware techniques for mitigating bias in DNN-based embeddings, the
worst-group loss, which is designed to minimize loss for the worst-performing
demographic group, achieves a better balance between performance and fairness.
In contrast, the fairness-regularized loss minimizes loss across all groups but
performs less effectively. In LLMs, guided prompting with ethical framing helps
mitigate gender bias in the 1-shot setting. However, increasing the number of
shots does not lead to further reductions in disparities. For race/ethnicity,
neither prompting strategy nor increasing $N$ in $N$-shot learning effectively
reduces disparities.
[COMMENTS]
7 pages, 1 figure. This paper has been accepted to the IEEE-EMBS
International Conference on Biomedical and Health Informatics (BHI 2025),
Georgia Institute of Technology, Atlanta, Georgia, October 26-29, 2025
[LINK]
http://arxiv.org/abs/2509.25795v2
[DATE]
2025-10-07 12:20:55+08:00
[CATEGORIES]
cs.CL
AtomWorld: A Benchmark for Evaluating Spatial Reasoning in Large Language Models on Crystalline Materials
[AUTHORS]
Taoyuze Lv, Alexander Chen, Fengyu Xie, Chu Wu, Jeffrey Meng, Dongzhan Zhou, Bram Hoex, Zhicheng Zhong, Tong Xie
[ABSTRACT]
Large Language Models (LLMs) excel at textual reasoning and are beginning to
develop spatial understanding, prompting the question of whether these
abilities can be combined for complex, domain-specific tasks. This question is
essential in fields like materials science, where deep understanding of 3D
atomic structures is fundamental. While initial studies have successfully
applied LLMs to tasks involving pure crystal generation or coordinate
understandings, a standardized benchmark to systematically evaluate their core
reasoning abilities across diverse atomic structures has been notably absent.
To address this gap, we introduce the AtomWorld benchmark to evaluate LLMs on
tasks based in Crystallographic Information Files (CIFs), a standard structure
representation format. These tasks, including structural editing, CIF
perception, and property-guided modeling, reveal a critical limitation: current
models, despite establishing promising baselines, consistently fail in
structural understanding and spatial reasoning. Our experiments show that these
models make frequent errors on structure modification tasks, and even in the
basic CIF format understandings, potentially leading to cumulative errors in
subsequent analysis and materials insights. By defining these standardized
tasks, AtomWorld lays the ground for advancing LLMs toward robust atomic-scale
modeling, crucial for accelerating materials research and automating scientific
workflows.
[LINK]
http://arxiv.org/abs/2510.04704v2
[DATE]
2025-10-07 12:08:44+08:00
[CATEGORIES]
cs.CL
LLM Unlearning Without an Expert Curated Dataset
[AUTHORS]
Xiaoyuan Zhu, Muru Zhang, Ollie Liu, Robin Jia, Willie Neiswanger
[ABSTRACT]
Modern large language models often encode sensitive, harmful, or copyrighted
knowledge, raising the need for post-hoc unlearning-the ability to remove
specific domains of knowledge from a model without full retraining. A major
bottleneck in current unlearning pipelines is constructing effective forget
sets-datasets that approximate the target domain and guide the model to forget
it. In this work, we introduce a scalable, automated approach to generate
high-quality forget sets using language models themselves. Our method
synthesizes textbook-style data through a structured prompting pipeline,
requiring only a domain name as input. Through experiments on unlearning
biosecurity, cybersecurity, and Harry Potter novels, we show that our synthetic
datasets consistently outperform the baseline synthetic alternatives and are
comparable to the expert-curated ones. Additionally, ablation studies reveal
that the multi-step generation pipeline significantly boosts data diversity,
which in turn improves unlearning utility. Overall, our findings suggest that
synthetic datasets offer a promising path toward practical, scalable unlearning
for a wide range of emerging domains without the need for manual intervention.
We release our code and dataset at
https://github.com/xyzhu123/Synthetic_Textbook.
[LINK]
http://arxiv.org/abs/2508.06595v3
[DATE]
2025-10-07 11:52:31+08:00
[CATEGORIES]
cs.CL
cs.LG
Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time
[AUTHORS]
Daniel Tan, Anders Woodruff, Niels Warncke, Arun Jose, Maxime Riché, David Demitri Africa, Mia Taylor
[COMMENTS]
40 pages, 22 figures In proceedings at ICLR 2026
[LINK]
http://arxiv.org/abs/2510.04340v2
[DATE]
2025-10-07 11:52:12+08:00
[CATEGORIES]
cs.CL
Can Video Large Multimodal Models Think Like Doubters-or Double-Down: A Study on Defeasible Video Entailment
[AUTHORS]
Yue Zhang, Jilei Sun, Yunhui Guo, Vibhav Gogate
[ABSTRACT]
Video Large Multimodal Models (VLMMs) have made impressive strides in
understanding video content, but they often struggle with abstract and adaptive
reasoning-the ability to revise their interpretations when new information
emerges. In reality, conclusions are rarely set in stone; additional context
can strengthen or weaken an initial inference. To address this, we introduce
Defeasible Video Entailment (DVidE), a new task that challenges models to think
like doubters, constantly updating their reasoning based on evolving evidence.
In DVidE, given a video premise and a textual hypothesis, models must determine
whether a new update strengthens or weakens the hypothesis (classification
version) or generate a coherent update that modifies the entailment
relationship (generation version). For solving the classification task, we
propose the Chain of Counterfactual Thought framework, utilizing counterfactual
reasoning, ASR-enhanced video content, and rationale refinement to reduce
inference bias. For the generation task, we develop a framework that combines
ASR output with a Large Language Model (LLM) to produce coherent, contextually
relevant updates aligned with the intended strengthener or weakener goals.
Additionally, we introduce a novel benchmark dataset, with
strengthener/weakener annotations and an LLM-based evaluation metric
specifically designed for assessing generative performance. Experimental
results demonstrate significant improvements, highlighting our proposed method
in enhancing dynamic reasoning capabilities of VLMMs.
[LINK]
http://arxiv.org/abs/2506.22385v2
[DATE]
2025-10-07 11:47:19+08:00
[CATEGORIES]
cs.CL
GEM-Bench: A Benchmark for Ad-Injected Response Generation within Generative Engine Marketing
[AUTHORS]
Silan Hu, Shiqi Zhang, Yimin Shi, Xiaokui Xiao
[ABSTRACT]
Generative Engine Marketing (GEM) is an emerging ecosystem for monetizing
generative engines, such as LLM-based chatbots, by seamlessly integrating
relevant advertisements into their responses. At the core of GEM lies the
generation and evaluation of ad-injected responses. However, existing
benchmarks are not specifically designed for this purpose, which limits future
research. To address this gap, we propose GEM-Bench, the first comprehensive
benchmark for ad-injected response generation in GEM. GEM-Bench includes three
curated datasets covering both chatbot and search scenarios, a metric ontology
that captures multiple dimensions of user satisfaction and engagement, and
several baseline solutions implemented within an extensible multi-agent
framework. Our preliminary results indicate that, while simple prompt-based
methods achieve reasonable engagement such as click-through rate, they often
reduce user satisfaction. In contrast, approaches that insert ads based on
pre-generated ad-free responses help mitigate this issue but introduce
additional overhead. These findings highlight the need for future research on
designing more effective and efficient solutions for generating ad-injected
responses in GEM. The benchmark and all related resources are publicly
available at https://gem-bench.org/.
[COMMENTS]
Include more experimental results and supplementary materials
[LINK]
http://arxiv.org/abs/2509.14221v2
[DATE]
2025-10-07 11:29:20+08:00
[CATEGORIES]
cs.CL
PLSemanticsBench: Large Language Models As Programming Language Interpreters
[AUTHORS]
Aditya Thimmaiah, Jiyang Zhang, Jayanth Srinivasa, Junyi Jessy Li, Milos Gligoric
[ABSTRACT]
As large language models (LLMs) excel at code reasoning, a natural question
arises: can an LLM execute programs (i.e., act as an interpreter) purely based
on a programming language’s formal semantics? If so, it will enable rapid
prototyping of new programming languages and language features. We study this
question using the imperative language IMP (a subset of C), formalized via
small-step operational semantics (SOS) and rewriting-based operational
semantics (K-semantics). We introduce three evaluation sets-Human-Written,
LLM-Translated, and Fuzzer- Generated-whose difficulty is controlled by
code-complexity metrics spanning the size, control-flow, and data-flow axes.
Given a program and its semantics formalized with SOS/K-semantics, models are
evaluated on three tasks ranging from coarse to fine: (1) final-state
prediction, (2) semantic rule prediction, and (3) execution trace prediction.
To distinguish pretraining memorization from semantic competence, we define two
nonstandard semantics obtained through systematic mutations of the standard
rules. Across strong code/reasoning LLMs, performance drops under nonstandard
semantics despite high performance under the standard one. We further find that
(i) there are patterns to different model failures, (ii) most reasoning models
perform exceptionally well on coarse grained tasks involving reasoning about
highly complex programs often containing nested loop depths beyond five, and
surprisingly, (iii) providing formal semantics helps on simple programs but
often hurts on more complex ones. Overall, the results show a promise that LLMs
could serve as programming language interpreters, but points to the lack of
their robust semantics understanding. We release the benchmark and the
supporting code at https://github.com/EngineeringSoftware/PLSemanticsBench.
[LINK]
http://arxiv.org/abs/2510.03415v2
[DATE]
2025-10-07 11:28:52+08:00
[CATEGORIES]
cs.CL
Activation-Informed Pareto-Guided Low-Rank Compression for Efficient LLM/VLM
[AUTHORS]
Ryan Solgi, Parsa Madinei, Jiayi Tian, Rupak Swaminathan, Jing Liu, Nathan Susanj, Zheng Zhang
[ABSTRACT]
Large language models (LLM) and vision-language models (VLM) have achieved
state-of-the-art performance, but they impose significant memory and computing
challenges in deployment. We present a novel low-rank compression framework to
address this challenge. First, we upper bound the change of network loss via
layer-wise activation-based compression errors, filling a theoretical gap in
the literature. We then formulate low-rank model compression as a bi-objective
optimization and prove that a single uniform tolerance yields surrogate
Pareto-optimal heterogeneous ranks. Based on our theoretical insights, we
propose Pareto-Guided Singular Value Decomposition (PGSVD), a zero-shot
pipeline that improves activation-aware compression via Pareto-guided rank
selection and alternating least-squares implementation. We apply PGSVD to both
LLM and VLM, showing better accuracy at the same compression levels and
inference speedup.
[LINK]
http://arxiv.org/abs/2510.05544v1
[DATE]
2025-10-07 11:07:47+08:00
[CATEGORIES]
cs.CL
cs.LG
H1B-KV: Hybrid One-Bit Caches for Memory-Efficient Large Language Model Inference
[AUTHORS]
Harshil Vejendla
[ABSTRACT]
Autoregressive decoding in large language models (LLMs) requires caching a
growing list of past key-value (KV) pairs, making long-context inference a
memory-bound problem. While recent methods have explored quantizing the cache,
evicting tokens, or using binary sketches for keys (e.g., Loki), these
approaches often provide an incomplete solution by leaving one component (like
values) uncompressed or by discarding context information. This paper
introduces the Hybrid One-Bit KV Cache (H1B-KV), a comprehensive compression
scheme that radically reduces memory usage without sacrificing context. H1B-KV
represents each key vector using a 1-bit binary sketch, enabling
hardware-friendly bitwise attention, and further compresses value vectors using
4-bit quantization. This holistic, hybrid approach allows a 7-billion parameter
LLM to handle an 8k-token context with under 60 MB of cache memory - a 70x
reduction. We demonstrate that after a lightweight finetuning, H1B-KV matches
full-precision performance not only on perplexity benchmarks but also on
complex downstream tasks like mathematical reasoning (GSM8K), multi-task
understanding (MMLU), and code generation (HumanEval). Our results show H1B-KV
significantly outperforms leading quantization (KIVI), token eviction
(SparseLLM), and key-only sketching (Loki) methods in quality-per-byte,
establishing it as a robust solution for deploying LLMs in memory-constrained
environments.
[COMMENTS]
MIT URTC 2025 Technical Paper (Oral), 5 pages, 1 figure
[LINK]
http://arxiv.org/abs/2510.05529v1
[DATE]
2025-10-07 10:39:35+08:00
[CATEGORIES]
cs.CL
cs.LG
KEO: Knowledge Extraction on OMIn via Knowledge Graphs and RAG for Safety-Critical Aviation Maintenance
[AUTHORS]
Kuangshi Ai, Jonathan A. Karr Jr, Meng Jiang, Nitesh V. Chawla, Chaoli Wang
[ABSTRACT]
We present Knowledge Extraction on OMIn (KEO), a domain-specific knowledge
extraction and reasoning framework with large language models (LLMs) in
safety-critical contexts. Using the Operations and Maintenance Intelligence
(OMIn) dataset, we construct a QA benchmark spanning global sensemaking and
actionable maintenance tasks. KEO builds a structured Knowledge Graph (KG) and
integrates it into a retrieval-augmented generation (RAG) pipeline, enabling
more coherent, dataset-wide reasoning than traditional text-chunk RAG. We
evaluate locally deployable LLMs (Gemma-3, Phi-4, Mistral-Nemo) and employ
stronger models (GPT-4o, Llama-3.3) as judges. Experiments show that KEO
markedly improves global sensemaking by revealing patterns and system-level
insights, while text-chunk RAG remains effective for fine-grained procedural
tasks requiring localized retrieval. These findings underscore the promise of
KG-augmented LLMs for secure, domain-specific QA and their potential in
high-stakes reasoning.
[LINK]
http://arxiv.org/abs/2510.05524v1
[DATE]
2025-10-07 10:29:13+08:00
[CATEGORIES]
cs.CL
COLE: a Comprehensive Benchmark for French Language Understanding Evaluation
[AUTHORS]
David Beauchemin, Yan Tremblay, Mohamed Amine Youssef, Richard Khoury
[COMMENTS]
Submitted to ACL Rolling Review of October
[LINK]
http://arxiv.org/abs/2510.05046v2
[DATE]
2025-10-07 10:23:31+08:00
[CATEGORIES]
cs.CL
CAM: A Constructivist View of Agentic Memory for LLM-Based Reading Comprehension
[AUTHORS]
Rui Li, Zeyu Zhang, Xiaohe Bo, Zihang Tian, Xu Chen, Quanyu Dai, Zhenhua Dong, Ruiming Tang
[ABSTRACT]
Current Large Language Models (LLMs) are confronted with overwhelming
information volume when comprehending long-form documents. This challenge
raises the imperative of a cohesive memory module, which can elevate vanilla
LLMs into autonomous reading agents. Despite the emergence of some heuristic
approaches, a systematic design principle remains absent. To fill this void, we
draw inspiration from Jean Piaget’s Constructivist Theory, illuminating three
traits of the agentic memory – structured schemata, flexible assimilation, and
dynamic accommodation. This blueprint forges a clear path toward a more robust
and efficient memory system for LLM-based reading comprehension. To this end,
we develop CAM, a prototype implementation of Constructivist Agentic Memory
that simultaneously embodies the structurality, flexibility, and dynamicity. At
its core, CAM is endowed with an incremental overlapping clustering algorithm
for structured memory development, supporting both coherent hierarchical
summarization and online batch integration. During inference, CAM adaptively
explores the memory structure to activate query-relevant information for
contextual response, akin to the human associative process. Compared to
existing approaches, our design demonstrates dual advantages in both
performance and efficiency across diverse long-text reading comprehension
tasks, including question answering, query-based summarization, and claim
verification.
[COMMENTS]
Accepted by NeurIPS 2025
[LINK]
http://arxiv.org/abs/2510.05520v1
[DATE]
2025-10-07 10:16:30+08:00
[CATEGORIES]
cs.CL
Bayesian Teaching Enables Probabilistic Reasoning in Large Language Models
[AUTHORS]
Linlu Qiu, Fei Sha, Kelsey Allen, Yoon Kim, Tal Linzen, Sjoerd van Steenkiste
[ABSTRACT]
Artificial intelligence systems based on large language models (LLMs) are
increasingly used as agents that interact with users and with the world. To do
so successfully, LLMs need to construct internal representations of the world
and form probabilistic beliefs about those representations. To provide a user
with personalized recommendations, for example, the LLM needs to gradually
infer the user’s preferences, over the course of multiple interactions. To
evaluate whether contemporary LLMs are able to do so, we use the Bayesian
inference framework from probability theory, which lays out the optimal way to
update an agent’s beliefs as it receives new information. We first show that
LLMs do not update their beliefs as expected from the Bayesian framework, and
that consequently their predictions do not improve as expected as more
information becomes available. To address this issue, we teach the LLMs to
reason in a Bayesian manner by training them to mimic the predictions of the
normative Bayesian model. We find that this approach not only significantly
improves the LLM’s performance on the particular recommendation task it is
trained on, but also enables generalization to other tasks. This suggests that
this method teaches the LLM to better approximate Bayesian reasoning. More
generally, our results indicate that LLMs can effectively learn reasoning
skills from examples and generalize those skills to new domains.
[LINK]
http://arxiv.org/abs/2503.17523v2
[DATE]
2025-10-07 09:59:17+08:00
[CATEGORIES]
cs.CL
ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists
[AUTHORS]
Jie Ruan, Inderjeet Nair, Shuyang Cao, Amy Liu, Sheza Munir, Micah Pollens-Dempsey, Tiffany Chiang, Lucy Kates, Nicholas David, Sihan Chen, Ruxin Yang, Yuqian Yang, Jasmine Gump, Tessa Bialek, Vivek Sankaran, Margo Schlanger, Lu Wang
[LINK]
http://arxiv.org/abs/2506.01241v3
[DATE]
2025-10-07 09:39:45+08:00
[CATEGORIES]
cs.CL
Prototype-Based Dynamic Steering for Large Language Models
[AUTHORS]
Ceyhun Efe Kayan, Li Zhang
[ABSTRACT]
Despite impressive breadth, LLMs still rely on explicit reasoning
instructions or static, one-fits-all steering methods, leaving a gap for
adaptive, instruction-free reasoning amplification. We present Prototype-Based
Dynamic Steering (PDS), a test-time method that amplifies large language model
(LLM) reasoning without adding or altering instructions. We introduce
“reasoning prototypes” by clustering activation differences between
Chain-of-Thought (CoT) and neutral prompts. At inference, an input’s hidden
state is projected onto these prototypes to form an instance-specific steering
vector. Evaluated on GSM8K, AQuA-RAT, and BIG-Bench tasks, PDS consistently
improves accuracy without fine-tuning or prompt engineering. Notably, the gains
persist even when CoT is explicitly suppressed to improve cost-efficiency,
indicating that the intervention strengthens latent reasoning processes rather
than inducing a superficial behavioral shift. These results position dynamic,
prototype-guided steering as a lightweight alternative to training-time
approaches for enhancing LLM reasoning.
[LINK]
http://arxiv.org/abs/2510.05498v1
[DATE]
2025-10-07 09:34:28+08:00
[CATEGORIES]
cs.CL
Compound AI Systems Optimization: A Survey of Methods, Challenges, and Future Directions
[AUTHORS]
Yu-Ang Lee, Guan-Ting Yi, Mei-Yi Liu, Jui-Chao Lu, Guan-Bo Yang, Yun-Nung Chen
[ABSTRACT]
Recent advancements in large language models (LLMs) and AI systems have led
to a paradigm shift in the design and optimization of complex AI workflows. By
integrating multiple components, compound AI systems have become increasingly
adept at performing sophisticated tasks. However, as these systems grow in
complexity, new challenges arise in optimizing not only individual components
but also their interactions. While traditional optimization methods such as
supervised fine-tuning (SFT) and reinforcement learning (RL) remain
foundational, the rise of natural language feedback introduces promising new
approaches, especially for optimizing non-differentiable systems. This paper
provides a systematic review of recent progress in optimizing compound AI
systems, encompassing both numerical and language-based techniques. We
formalize the notion of compound AI system optimization, classify existing
methods along several key dimensions, and highlight open research challenges
and future directions in this rapidly evolving field. A list of surveyed papers
is publicly available at https://github.com/MiuLab/AISysOpt-Survey.
[COMMENTS]
Accepted to EMNLP 2025 (Main)
[LINK]
http://arxiv.org/abs/2506.08234v2
[DATE]
2025-10-07 09:23:00+08:00
[CATEGORIES]
cs.CL
NorMuon: Making Muon more efficient and scalable
[AUTHORS]
Zichong Li, Liming Liu, Chen Liang, Weizhu Chen, Tuo Zhao
[ABSTRACT]
The choice of optimizer significantly impacts the training efficiency and
computational costs of large language models (LLMs). Recently, the Muon
optimizer has demonstrated promising results by orthogonalizing parameter
updates, improving optimization geometry through better conditioning. Despite
Muon’s emergence as a candidate successor to Adam, the potential for jointly
leveraging their strengths has not been systematically explored. In this work,
we bridge this gap by proposing NorMuon (Neuron-wise Normalized Muon), an
optimizer that synergistically combines orthogonalization with neuron-level
adaptive learning rates. Our analysis reveals that while Muon effectively
reduces condition numbers, the resulting updates exhibit highly non-uniform
neuron norms, causing certain neurons to dominate the optimization process.
NorMuon addresses this imbalance by maintaining second-order momentum
statistics for each neuron and applying row-wise normalization after
orthogonalization, ensuring balanced parameter utilization while preserving
Muon’s conditioning benefits. To enable practical deployment at scale, we
develop an efficient distributed implementation under the FSDP2 framework that
strategically distributes orthogonalization computations across devices.
Experiments across multiple model scales demonstrate that NorMuon consistently
outperforms both Adam and Muon, achieving 21.74% better training efficiency
than Adam and 11.31% improvement over Muon on 1.1 B pretraining setting, while
maintaining a comparable memory footprint to Muon. Our findings suggest that
orthogonalization and adaptive learning rates are complementary rather than
competing approaches, opening new avenues for optimizer design in large-scale
deep learning.
[LINK]
http://arxiv.org/abs/2510.05491v1
[DATE]
2025-10-07 09:13:41+08:00
[CATEGORIES]
cs.LG
cs.CL
LANTERN: Scalable Distillation of Large Language Models for Job-Person Fit and Explanation
[AUTHORS]
Zhoutong Fu, Yihan Cao, Yi-Lin Chen, Aman Lunia, Liming Dong, Neha Saraf, Ruijie Jiang, Yun Dai, Qingquan Song, Tan Wang, Guoyao Li, Derek Koh, Haichao Wei, Zhipeng Wang, Aman Gupta, Chengming Jiang, Jianqiang Shen, Liangjie Hong, Wenjing Zhang
[ABSTRACT]
Large language models (LLMs) have achieved strong performance across a wide
range of natural language processing tasks. However, deploying LLMs at scale
for domain specific applications, such as job-person fit and explanation in job
seeking platforms, introduces distinct challenges. At LinkedIn, the job person
fit task requires analyzing a candidate’s public profile against job
requirements to produce both a fit assessment and a detailed explanation.
Directly applying open source or finetuned LLMs to this task often fails to
yield high quality, actionable feedback due to the complexity of the domain and
the need for structured outputs. Moreover, the large size of these models leads
to high inference latency and limits scalability, making them unsuitable for
online use. To address these challenges, we introduce LANTERN, a novel LLM
knowledge distillation framework tailored specifically for job person fit
tasks. LANTERN involves modeling over multiple objectives, an encoder model for
classification purpose, and a decoder model for explanation purpose. To better
distill the knowledge from a strong black box teacher model to multiple
downstream models, LANTERN incorporates multi level knowledge distillation that
integrates both data and logit level insights. In addition to introducing the
knowledge distillation framework, we share our insights on post training
techniques and prompt engineering, both of which are crucial for successfully
adapting LLMs to domain specific downstream tasks. Extensive experimental
results demonstrate that LANTERN significantly improves task specific metrics
for both job person fit and explanation. Online evaluations further confirm its
effectiveness, showing measurable gains in job seeker engagement, including a
0.24\% increase in apply rate and a 0.28\% increase in qualified applications.
[COMMENTS]
9 pages, 4 figures, 5 tables
[LINK]
http://arxiv.org/abs/2510.05490v1
[DATE]
2025-10-07 09:10:02+08:00
[CATEGORIES]
cs.CL
QAPyramid: Fine-grained Evaluation of Content Selection for Text Summarization
[AUTHORS]
Shiyue Zhang, David Wan, Arie Cattan, Ayal Klein, Ido Dagan, Mohit Bansal
[COMMENTS]
Accepted to COLM 2025. The first two authors contributed equally.
Code: https://github.com/ZhangShiyue/QAPyramid
[LINK]
http://arxiv.org/abs/2412.07096v2
[DATE]
2025-10-07 09:00:34+08:00
[CATEGORIES]
cs.CL
Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking
[AUTHORS]
Liangliang Zhang, Zhuorui Jiang, Hongliang Chi, Haoyang Chen, Mohammed Elkoumy, Fali Wang, Qiong Wu, Zhengyi Zhou, Shirui Pan, Suhang Wang, Yao Ma
[ABSTRACT]
Knowledge Graph Question Answering (KGQA) systems rely on high-quality
benchmarks to evaluate complex multi-hop reasoning. However, despite their
widespread use, popular datasets such as WebQSP and CWQ suffer from critical
quality issues, including inaccurate or incomplete ground-truth annotations,
poorly constructed questions that are ambiguous, trivial, or unanswerable, and
outdated or inconsistent knowledge. Through a manual audit of 16 popular KGQA
datasets, including WebQSP and CWQ, we find that the average factual
correctness rate is only 57 %. To address these issues, we introduce KGQAGen,
an LLM-in-the-loop framework that systematically resolves these pitfalls.
KGQAGen combines structured knowledge grounding, LLM-guided generation, and
symbolic verification to produce challenging and verifiable QA instances. Using
KGQAGen, we construct KGQAGen-10k, a ten-thousand scale benchmark grounded in
Wikidata, and evaluate a diverse set of KG-RAG models. Experimental results
demonstrate that even state-of-the-art systems struggle on this benchmark,
highlighting its ability to expose limitations of existing models. Our findings
advocate for more rigorous benchmark construction and position KGQAGen as a
scalable framework for advancing KGQA evaluation.
[COMMENTS]
Accepted at NeurIPS 2025 Datasets and Benchmarks Track
[LINK]
http://arxiv.org/abs/2505.23495v3
[DATE]
2025-10-07 08:23:47+08:00
[CATEGORIES]
cs.CL
cs.LG
AMAQ: Adaptive Mixed-bit Activation Quantization for Collaborative Parameter Efficient Fine-tuning
[AUTHORS]
Yurun Song, Zhuoyi Yang, Ian G. Harris, Sangeetha Abdu Jyothi
[ABSTRACT]
Large Language Models (LLMs) are scaling rapidly, creating significant
challenges for collaborative server client distributed training, particularly
in terms of communication efficiency and computational overheads. To address
these challenges, we implement Parameter-efficient Split Learning, which
effectively balances efficiency and performance for collaborative training on
low-resource devices.
To reduce communication overhead in collaborative training, we introduce
Adaptive Mixed bit Activation Quantization (AMAQ), a strategy that
progressively compresses activations and gradients from high precision (6 to 8
bits) to low precision (3 to 4 bits). AMAQ achieves this by effectively
allocating bit budgets across channels based on feature wise and layer wise
importance using bit regularization.
Under the same bit budgets, AMAQ outperforms fixed-precision approaches,
delivering about 2.5% higher generation accuracy and about 1.3% better
classification accuracy for models like LLaMA3 8B and Qwen2.5 7B. In addition,
it significantly enhances training stability and reducing ultra-low bit
representation collapse during the training.
Experiments demonstrate that AMAQ integrates effectively into practical
multi-machine collaborative training setups, offering superior inference
accuracy with only a modest communication overhead for bits adaptation during
training. This trade off makes AMAQ a practical and effective solution for
collaborative training with minimal communication cost.
[COMMENTS]
14 pages
[LINK]
http://arxiv.org/abs/2510.05468v1
[DATE]
2025-10-07 08:05:16+08:00
[CATEGORIES]
cs.LG
cs.CL
Do Code Models Suffer from the Dunning-Kruger Effect?
[AUTHORS]
Mukul Singh, Somya Chatterjee, Arjun Radhakrishna, Sumit Gulwani
[ABSTRACT]
As artificial intelligence systems increasingly collaborate with humans in
creative and technical domains, questions arise about the cognitive boundaries
and biases that shape our shared agency. This paper investigates the
Dunning-Kruger Effect (DKE), the tendency for those with limited competence to
overestimate their abilities in state-of-the-art LLMs in coding tasks. By
analyzing model confidence and performance across a diverse set of programming
languages, we reveal that AI models mirror human patterns of overconfidence,
especially in unfamiliar or low-resource domains. Our experiments demonstrate
that less competent models and those operating in rare programming languages
exhibit stronger DKE-like bias, suggesting that the strength of the bias is
proportionate to the competence of the models.
[LINK]
http://arxiv.org/abs/2510.05457v1
[DATE]
2025-10-07 07:41:24+08:00
[CATEGORIES]
cs.CL
AgentRouter: A Knowledge-Graph-Guided LLM Router for Collaborative Multi-Agent Question Answering
[AUTHORS]
Zheyuan Zhang, Kaiwen Shi, Zhengqing Yuan, Zehong Wang, Tianyi Ma, Keerthiram Murugesan, Vincent Galassi, Chuxu Zhang, Yanfang Ye
[ABSTRACT]
Large language models (LLMs) and agent-based frameworks have advanced
rapidly, enabling diverse applications. Yet, with the proliferation of models
and agentic strategies, practitioners face substantial uncertainty in selecting
the best configuration for a downstream task. Prior studies show that different
agents and backbones exhibit complementary strengths, and that larger models
are not always superior, underscoring the need for adaptive routing mechanisms.
Existing approaches to agent routing, however, often emphasize cost efficiency
while overlooking the fine-grained contextual and relational structure inherent
in QA tasks. In this paper, we propose tAgentRouter, a framework that
formulates multi-agent QA as a knowledge-graph-guided routing problem
supervised by empirical performance signals. Specifically, we convert QA
instance into a knowledge graph that jointly encodes queries, contextual
entities, and agents, and then train a heterogeneous graph neural network (GNN)
to propagate information across node types and produce task-aware routing
distributions over agents. By leveraging soft supervision and weighted
aggregation of agent outputs, AgentRouter learns principled collaboration
schemes that capture the complementary strengths of diverse agents. Extensive
experiments demonstrate that our framework consistently outperforms
single-agent and ensemble baselines, while generalizing across benchmarks and
LLM backbones. These results highlight the effectiveness and robustness of
graph-supervised multi-agent routing for question answering.
[LINK]
http://arxiv.org/abs/2510.05445v1
[DATE]
2025-10-07 07:20:49+08:00
[CATEGORIES]
cs.CL
SimulatorArena: Are User Simulators Reliable Proxies for Multi-Turn Evaluation of AI Assistants?
[AUTHORS]
Yao Dou, Michel Galley, Baolin Peng, Chris Kedzie, Weixin Cai, Alan Ritter, Chris Quirk, Wei Xu, Jianfeng Gao
[COMMENTS]
Accepted at EMNLP 2025 Main
[LINK]
http://arxiv.org/abs/2510.05444v1
[DATE]
2025-10-07 07:17:44+08:00
[CATEGORIES]
cs.CL
Adversarial Reinforcement Learning for Large Language Model Agent Safety
[AUTHORS]
Zizhao Wang, Dingcheng Li, Vaishakh Keshava, Phillip Wallis, Ananth Balashankar, Peter Stone, Lukas Rutishauser
[ABSTRACT]
Large Language Model (LLM) agents can leverage tools such as Google Search to
complete complex tasks. However, this tool usage introduces the risk of
indirect prompt injections, where malicious instructions hidden in tool outputs
can manipulate the agent, posing security risks like data leakage. Current
defense strategies typically rely on fine-tuning LLM agents on datasets of
known attacks. However, the generation of these datasets relies on manually
crafted attack patterns, which limits their diversity and leaves agents
vulnerable to novel prompt injections. To address this limitation, we propose
Adversarial Reinforcement Learning for Agent Safety (ARLAS), a novel framework
that leverages adversarial reinforcement learning (RL) by formulating the
problem as a two-player zero-sum game. ARLAS co-trains two LLMs: an attacker
that learns to autonomously generate diverse prompt injections and an agent
that learns to defend against them while completing its assigned tasks. To
ensure robustness against a wide range of attacks and to prevent cyclic
learning, we employ a population-based learning framework that trains the agent
to defend against all previous attacker checkpoints. Evaluated on BrowserGym
and AgentDojo, agents fine-tuned with ARLAS achieve a significantly lower
attack success rate than the original model while also improving their task
success rate. Our analysis further confirms that the adversarial process
generates a diverse and challenging set of attacks, leading to a more robust
agent compared to the base model.
[LINK]
http://arxiv.org/abs/2510.05442v1
[DATE]
2025-10-07 07:09:18+08:00
[CATEGORIES]
cs.LG
cs.CL
Adaptive Margin RLHF via Preference over Preferences
[AUTHORS]
Yaswanth Chittepu, Prasann Singhal, Greg Durrett, Scott Niekum
[ABSTRACT]
Margin-based optimization is fundamental to improving generalization and
robustness in classification tasks. In the context of reward model learning
from preferences within Reinforcement Learning from Human Feedback (RLHF),
existing methods typically rely on no margins, fixed margins, or margins that
are simplistic functions of preference ratings. However, such formulations
often fail to account for the varying strengths of different preferences, for
example some preferences are associated with larger margins between responses,
or they rely on noisy margin information derived from ratings. We argue that
modeling the strength of preferences can lead to better generalization and more
faithful alignment. Furthermore, many existing methods that use adaptive
margins assume access to accurate preference scores, which can be difficult for
humans to provide reliably. We propose an approach that leverages preferences
over preferences, that is annotations indicating which of two preferences
reflects a stronger distinction. We use this ordinal signal to infer adaptive
margins on a per-datapoint basis. We introduce an extension to Direct
Preference Optimization (DPO), DPO-PoP, that incorporates adaptive margins from
preference-over-preference supervision, enabling improved discriminative and
generative performance. Empirically, our method outperforms vanilla DPO, DPO
with fixed margins, and DPO with ground-truth margins on the UltraFeedback
dataset. Additionally, we show that there is a tradeoff between discriminative
and generative performance: improving test classification accuracy,
particularly by correctly labeling weaker preferences at the expense of
stronger ones, can lead to a decline in generative quality. To navigate this
tradeoff, we propose two sampling strategies to gather
preference-over-preference labels: one favoring discriminative performance and
one favoring generative performance.
[LINK]
http://arxiv.org/abs/2509.22851v2
[DATE]
2025-10-07 06:22:55+08:00
[CATEGORIES]
cs.LG
cs.CL
A Lightweight Large Language Model-Based Multi-Agent System for 2D Frame Structural Analysis
[AUTHORS]
Ziheng Geng, Jiachen Liu, Ran Cao, Lu Cheng, Haifeng Wang, Minghui Cheng
[ABSTRACT]
Large language models (LLMs) have recently been used to empower autonomous
agents in engineering, significantly improving automation and efficiency in
labor-intensive workflows. However, their potential remains underexplored in
structural engineering, particularly for finite element modeling tasks
requiring geometric modeling, complex reasoning, and domain knowledge. To
bridge this gap, this paper develops a LLM-based multi-agent system to automate
finite element modeling of 2D frames. The system decomposes structural analysis
into subtasks, each managed by a specialized agent powered by the lightweight
Llama-3.3 70B Instruct model. The workflow begins with a Problem Analysis
Agent, which extracts geometry, boundary, and material parameters from the user
input. Next, a Geometry Agent incrementally derives node coordinates and
element connectivity by applying expert-defined rules. These structured outputs
are converted into executable OpenSeesPy code by a Translation Agent and
refined by a Model Validation Agent through consistency checks. Then, a Load
Agent applies load conditions into the assembled structural model. Experimental
evaluations on 20 benchmark problems demonstrate that the system achieves
accuracy over 80% in most cases across 10 repeated trials, outperforming
Gemini-2.5 Pro and ChatGPT-4o models.
[LINK]
http://arxiv.org/abs/2510.05414v1
[DATE]
2025-10-07 06:12:52+08:00
[CATEGORIES]
cs.CL
Language Models Surface the Unwritten Code of Science and Society
[AUTHORS]
Honglin Bao, Siyang Wu, Jiwoong Choi, Yingrong Mao, James A. Evans
[ABSTRACT]
This paper calls on the research community not only to investigate how human
biases are inherited by large language models (LLMs) but also to explore how
these biases in LLMs can be leveraged to make society’s “unwritten code” - such
as implicit stereotypes and heuristics - visible and accessible for critique.
We introduce a conceptual framework through a case study in science: uncovering
hidden rules in peer review - the factors that reviewers care about but rarely
state explicitly due to normative scientific expectations. The idea of the
framework is to push LLMs to speak out their heuristics through generating
self-consistent hypotheses - why one paper appeared stronger in reviewer
scoring - among paired papers submitted to 45 academic conferences, while
iteratively searching deeper hypotheses from remaining pairs where existing
hypotheses cannot explain. We observed that LLMs’ normative priors about the
internal characteristics of good science extracted from their self-talk, e.g.,
theoretical rigor, were systematically updated toward posteriors that emphasize
storytelling about external connections, such as how the work is positioned and
connected within and across literatures. Human reviewers tend to explicitly
reward aspects that moderately align with LLMs’ normative priors (correlation =
0.49) but avoid articulating contextualization and storytelling posteriors in
their review comments (correlation = -0.14), despite giving implicit reward to
them with positive scores. These patterns are robust across different models
and out-of-sample judgments. We discuss the broad applicability of our proposed
framework, leveraging LLMs as diagnostic tools to amplify and surface the tacit
codes underlying human society, enabling public discussion of revealed values
and more precisely targeted responsible AI.
[LINK]
http://arxiv.org/abs/2505.18942v3
[DATE]
2025-10-07 06:09:02+08:00
[CATEGORIES]
cs.CL
Aligning Language Models with Clinical Expertise: DPO for Heart Failure Nursing Documentation in Critical Care
[AUTHORS]
Junyi Fan, Li Sun, Negin Ashrafi, Kamiar Alaei, Maryam Pishgar
[ABSTRACT]
Nursing documentation in intensive care units (ICUs) provides essential
clinical intelligence but often suffers from inconsistent terminology, informal
styles, and lack of standardization, challenges that are particularly critical
in heart failure care. This study applies Direct Preference Optimization (DPO)
to adapt Mistral-7B, a locally deployable language model, using 8,838 heart
failure nursing notes from the MIMIC-III database and 21,210 preference pairs
derived from expert-verified GPT outputs, model generations, and original
notes. Evaluation across BLEU, ROUGE, BERTScore, Perplexity, and expert
qualitative assessments demonstrates that DPO markedly enhances documentation
quality. Specifically, BLEU increased by 84% (0.173 to 0.318), BERTScore
improved by 7.6% (0.828 to 0.891), and expert ratings rose across accuracy
(+14.4 points), completeness (+14.5 points), logical consistency (+14.1
points), readability (+11.1 points), and structural clarity (+6.0 points).
These results indicate that DPO can align lightweight clinical language models
with expert standards, supporting privacy-preserving, AI-assisted documentation
within electronic health record systems to reduce administrative burden and
improve ICU patient safety.
[LINK]
http://arxiv.org/abs/2510.05410v1
[DATE]
2025-10-07 06:04:37+08:00
[CATEGORIES]
cs.CL
cs.LG
Evaluating the Effect of Retrieval Augmentation on Social Biases
[AUTHORS]
Tianhui Zhang, Yi Zhou, Danushka Bollegala
[ABSTRACT]
Retrieval Augmented Generation (RAG) has gained popularity as a method for
conveniently incorporating novel facts that were not seen during the
pre-training stage in Large Language Model (LLM)-based Natural Language
Generation (NLG) systems. However, LLMs are known to encode significant levels
of unfair social biases. The modulation of these biases by RAG in NLG systems
is not well understood. In this paper, we systematically study the relationship
between the different components of a RAG system and the social biases
presented in the text generated across three languages (i.e. English, Japanese
and Chinese) and four social bias types (i.e. gender, race, age and religion).
Specifically, using the Bias Question Answering (BBQ) benchmark datasets, we
evaluate the social biases in RAG responses from document collections with
varying levels of stereotypical biases, employing multiple LLMs used as
generators. We find that the biases in document collections are often amplified
in the generated responses, even when the generating LLM exhibits a low-level
of bias. Our findings raise concerns about the use of RAG as a technique for
injecting novel facts into NLG systems and call for careful evaluation of
potential social biases in RAG applications before their real-world deployment.
[COMMENTS]
21 pages
[LINK]
http://arxiv.org/abs/2502.17611v2
[DATE]
2025-10-07 05:50:27+08:00
[CATEGORIES]
cs.CL
Intent-Aware Schema Generation And Refinement For Literature Review Tables
[AUTHORS]
Vishakh Padmakumar, Joseph Chee Chang, Kyle Lo, Doug Downey, Aakanksha Naik
[ABSTRACT]
The increasing volume of academic literature makes it essential for
researchers to organize, compare, and contrast collections of documents. Large
language models (LLMs) can support this process by generating schemas defining
shared aspects along which to compare papers. However, progress on schema
generation has been slow due to: (i) ambiguity in reference-based evaluations,
and (ii) lack of editing/refinement methods. Our work is the first to address
both issues. First, we present an approach for augmenting unannotated table
corpora with \emph{synthesized intents}, and apply it to create a dataset for
studying schema generation conditioned on a given information need, thus
reducing ambiguity. With this dataset, we show how incorporating table intents
significantly improves baseline performance in reconstructing reference
schemas. We start by comprehensively benchmarking several single-shot schema
generation methods, including prompted LLM workflows and fine-tuned models,
showing that smaller, open-weight models can be fine-tuned to be competitive
with state-of-the-art prompted LLMs. Next, we propose several LLM-based schema
refinement techniques and show that these can further improve schemas generated
by these methods.
[COMMENTS]
To Appear at EMNLP Findings 2025
[LINK]
http://arxiv.org/abs/2507.19521v2
[DATE]
2025-10-07 05:44:37+08:00
[CATEGORIES]
cs.CL
cs.LG
Generative transformations and patterns in LLM-native approaches for software verification and falsification
[AUTHORS]
Víctor A. Braberman, Flavia Bonomo-Braberman, Yiannis Charalambous, Juan G. Colonna, Lucas C. Cordeiro, Rosiane de Freitas
[ABSTRACT]
The emergence of prompting as the dominant paradigm for leveraging Large
Language Models (LLMs) has led to a proliferation of LLM-native software, where
application behavior arises from complex, stochastic data transformations.
However, the engineering of such systems remains largely exploratory and
ad-hoc, hampered by the absence of conceptual frameworks, ex-ante
methodologies, design guidelines, and specialized benchmarks. We argue that a
foundational step towards a more disciplined engineering practice is a
systematic understanding of the core functional units–generative
transformations–and their compositional patterns within LLM-native
applications.
Focusing on the rich domain of software verification and falsification, we
conduct a secondary study of over 100 research proposals to address this gap.
We first present a fine-grained taxonomy of generative transformations,
abstracting prompt-based interactions into conceptual signatures. This taxonomy
serves as a scaffolding to identify recurrent transformation relationship
patterns–analogous to software design patterns–that characterize solution
approaches in the literature. Our analysis not only validates the utility of
the taxonomy but also surfaces strategic gaps and cross-dimensional
relationships, offering a structured foundation for future research in modular
and compositional LLM application design, benchmarking, and the development of
reliable LLM-native systems.
[LINK]
http://arxiv.org/abs/2404.09384v3
[DATE]
2025-10-07 05:35:53+08:00
[CATEGORIES]
cs.CL
cs.LG
Quantum Concept Music Score from Quantum Picturalism: Musical Incarnation of a Bell-Pair under Measurements
[AUTHORS]
Rakhat-Bi Abdyssagin, Bob Coecke
[ABSTRACT]
We initiate the development of a new language and theory for quantum music,
to which we refer as Quantum Concept Music (QCM). This new music formalism is
based on Categorical Quantum Mechanics (CQM), and more specifically, its
diagrammatic incarnation Quantum Picturalism (QPict), which is heavily based on
ZX-calculus. In fact, it is naturally inherited from CQM/QPict. At its heart is
the explicit notational representation of relations that exist within and
between the key concepts of music composition, performance, and automation. QCM
also enables one to directly translate quantum phenomena into music
compositions in a both intuitively obvious, rigorous and mechanical manner.
Following this pattern, we propose a score for musicians interacting like a
Bell-pair under measurement, and outline examples of how it could be live
performed. While most of the Western classical music notation has heavily
relied on linear representation of music - which does not always adequately
capture the nature of music - our approach is distinct by highlighting the
fundamental relational dimension of music. In addition, this quantum-based
technique not only influences the music at the profound level of composition,
but also has a direct impact on a live performance, and also provides a new
template for automating music, e.g.~in the context of AI-generation.
All together, we initiate the creation of new music formalism that is
powerful and efficient in capturing the interactive nature of music, both in
terms of internal and external interactions, and goes beyond the boundaries of
Western classical music notation, which allows to use it in many different
genres and directions.
[COMMENTS]
6 pages, musical score
[LINK]
http://arxiv.org/abs/2510.05391v1
[DATE]
2025-10-07 05:35:06+08:00
[CATEGORIES]
cs.CL
Context Length Alone Hurts LLM Performance Despite Perfect Retrieval
[AUTHORS]
Yufeng Du, Minyang Tian, Srikanth Ronanki, Subendhu Rongali, Sravan Bodapati, Aram Galstyan, Azton Wells, Roy Schwartz, Eliu A Huerta, Hao Peng
[COMMENTS]
18 pages (9 pages of main content), 5 figures, accepted at the
Findings of EMNLP 2025
[LINK]
http://arxiv.org/abs/2510.05381v1
[DATE]
2025-10-07 05:17:13+08:00
[CATEGORIES]
cs.CL
The End of Transformers? On Challenging Attention and the Rise of Sub-Quadratic Architectures
[AUTHORS]
Alexander M. Fichtl, Jeremias Bohn, Josefin Kelber, Edoardo Mosca, Georg Groh
[ABSTRACT]
Transformers have dominated sequence processing tasks for the past seven
years – most notably language modeling. However, the inherent quadratic
complexity of their attention mechanism remains a significant bottleneck as
context length increases. This paper surveys recent efforts to overcome this
bottleneck, including advances in (sub-quadratic) attention variants, recurrent
neural networks, state space models, and hybrid architectures. We critically
analyze these approaches in terms of compute and memory complexity, benchmark
results, and fundamental limitations to assess whether the dominance of
pure-attention transformers may soon be challenged.
[COMMENTS]
21 pages, 2 figures, 2 tables
[LINK]
http://arxiv.org/abs/2510.05364v1
[DATE]
2025-10-07 04:45:34+08:00
[CATEGORIES]
cs.CL
Residualized Similarity for Faithfully Explainable Authorship Verification
[AUTHORS]
Peter Zeng, Pegah Alipoormolabashi, Jihu Mun, Gourab Dey, Nikita Soni, Niranjan Balasubramanian, Owen Rambow, H. Schwartz
[COMMENTS]
EMNLP 2025 Findings
[LINK]
http://arxiv.org/abs/2510.05362v1
[DATE]
2025-10-07 04:41:17+08:00
[CATEGORIES]
cs.CL
DynaGuard: A Dynamic Guardian Model With User-Defined Policies
[AUTHORS]
Monte Hoover, Vatsal Baherwani, Neel Jain, Khalid Saifullah, Joseph Vincent, Chirag Jain, Melissa Kazemi Rad, C. Bayan Bruss, Ashwinee Panda, Tom Goldstein
[ABSTRACT]
Guardian models play a crucial role in ensuring the safety and ethical
behavior of user-facing AI applications by enforcing guardrails and detecting
harmful content. While standard guardian models are limited to predefined,
static harm categories, we introduce DynaGuard, a suite of dynamic guardian
models offering novel flexibility by evaluating text based on user-defined
policies, and DynaBench, a dataset for training and evaluating dynamic guardian
models. Our models provide both rapid detection of policy violations and a
chain-of-thought reasoning option that articulate and justify model outputs.
Critically, DynaGuard not only surpasses static models in detection accuracy on
traditional safety categories, but is competitive with frontier reasoning
models on free-form policy violations, all in a fraction of the time. This
makes DynaGuard an critical tool for language model guardrails.
[COMMENTS]
22 Pages
[LINK]
http://arxiv.org/abs/2509.02563v3
[DATE]
2025-10-07 04:34:31+08:00
[CATEGORIES]
cs.LG
cs.CL
Learning to vary: Teaching LMs to reproduce human linguistic variability in next-word prediction
[AUTHORS]
Tobias Groot, Salo Lacunes, Evgenia Ilia
[ABSTRACT]
Natural language generation (NLG) tasks are often subject to inherent
variability; e.g. predicting the next word given a context has multiple valid
responses, evident when asking multiple humans to complete the task. While
having language models (LMs) that are aligned pluralistically, so that they are
able to reproduce well the inherent diversity in perspectives of an entire
population of interest is clearly beneficial, Ilia and Aziz (2024) show that
LMs do not reproduce this type of linguistic variability well. They speculate
this inability might stem from the lack of consistent training of LMs with data
reflecting this type of inherent variability. As such, we investigate whether
training LMs on multiple plausible word continuations per context can improve
their ability to reproduce human linguistic variability for next-word
prediction. We employ fine-tuning techniques for pre-trained and
instruction-tuned models; and demonstrate their potential when fine-tuning
GPT-2 and Mistral-7B-IT, using Provo Corpus. Our evaluation, which measures
divergence among empirically estimated human and model next-word distributions
across contexts before and after fine-tuning, shows that our multi-label
fine-tuning improves the LMs’ ability to reproduce linguistic variability; both
for contexts that admit higher and lower variability.
[COMMENTS]
EMNLP UncertaiNLP Workshop 2025
[LINK]
http://arxiv.org/abs/2509.17794v2
[DATE]
2025-10-07 04:22:11+08:00
[CATEGORIES]
cs.CL
WeatherArchive-Bench: Benchmarking Retrieval-Augmented Reasoning for Historical Weather Archives
[AUTHORS]
Yongan Yu, Xianda Du, Qingchen Hu, Jiahao Liang, Jingwei Ni, Dan Qiang, Kaiyu Huang, Grant McKenzie, Renee Sieber, Fengran Mo
[ABSTRACT]
Historical archives on weather events are collections of enduring primary
source records that offer rich, untapped narratives of how societies have
experienced and responded to extreme weather events. These qualitative accounts
provide insights into societal vulnerability and resilience that are largely
absent from meteorological records, making them valuable for climate scientists
to understand societal responses. However, their vast scale, noisy digitized
quality, and archaic language make it difficult to transform them into
structured knowledge for climate research. To address this challenge, we
introduce WeatherArchive-Bench, the first benchmark for evaluating
retrieval-augmented generation (RAG) systems on historical weather archives.
WeatherArchive-Bench comprises two tasks: WeatherArchive-Retrieval, which
measures a system’s ability to locate historically relevant passages from over
one million archival news segments, and WeatherArchive-Assessment, which
evaluates whether Large Language Models (LLMs) can classify societal
vulnerability and resilience indicators from extreme weather narratives.
Extensive experiments across sparse, dense, and re-ranking retrievers, as well
as a diverse set of LLMs, reveal that dense retrievers often fail on historical
terminology, while LLMs frequently misinterpret vulnerability and resilience
concepts. These findings highlight key limitations in reasoning about complex
societal indicators and provide insights for designing more robust
climate-focused RAG systems from archival contexts. The constructed dataset and
evaluation framework are publicly available at
https://anonymous.4open.science/r/WeatherArchive-Bench/.
[LINK]
http://arxiv.org/abs/2510.05336v1
[DATE]
2025-10-07 03:58:42+08:00
[CATEGORIES]
cs.CL
Illusion or Algorithm? Investigating Memorization, Emergence, and Symbolic Processing in In-Context Learning
[AUTHORS]
Jingcheng Niu, Subhabrata Dutta, Ahmed Elshabrawy, Harish Tayyar Madabushi, Iryna Gurevych
[ABSTRACT]
Large-scale Transformer language models (LMs) trained solely on next-token
prediction with web-scale data can solve a wide range of tasks after seeing
just a few examples. The mechanism behind this capability, known as in-context
learning (ICL), remains both controversial and poorly understood. Some studies
argue that it is merely the result of memorizing vast amounts of data, while
others contend that it reflects a fundamental, symbolic algorithmic development
in LMs. In this work, we introduce a suite of investigative tasks and a novel
method to systematically investigate ICL by leveraging the full Pythia scaling
suite, including interim checkpoints that capture progressively larger amount
of training data. By carefully exploring ICL performance on downstream tasks
and simultaneously conducting a mechanistic analysis of the residual stream’s
subspace, we demonstrate that ICL extends beyond mere “memorization” of the
training corpus, yet does not amount to the implementation of an independent
symbolic algorithm. Our results also clarify several aspects of ICL, including
the influence of training dynamics, model capabilities, and elements of
mechanistic interpretability. Overall, our work advances the understanding of
ICL and its implications, offering model developers insights into potential
improvements and providing AI security practitioners with a basis for more
informed guidelines.
[COMMENTS]
TMLR
[LINK]
http://arxiv.org/abs/2505.11004v3
[DATE]
2025-10-07 03:24:12+08:00
[CATEGORIES]
cs.CL
RAG Makes Guardrails Unsafe? Investigating Robustness of Guardrails under RAG-style Contexts
[AUTHORS]
Yining She, Daniel W. Peterson, Marianne Menglin Liu, Vikas Upadhyay, Mohammad Hossein Chaghazardi, Eunsuk Kang, Dan Roth
[ABSTRACT]
With the increasing adoption of large language models (LLMs), ensuring the
safety of LLM systems has become a pressing concern. External LLM-based
guardrail models have emerged as a popular solution to screen unsafe inputs and
outputs, but they are themselves fine-tuned or prompt-engineered LLMs that are
vulnerable to data distribution shifts. In this paper, taking Retrieval
Augmentation Generation (RAG) as a case study, we investigated how robust
LLM-based guardrails are against additional information embedded in the
context. Through a systematic evaluation of 3 Llama Guards and 2 GPT-oss
models, we confirmed that inserting benign documents into the guardrail context
alters the judgments of input and output guardrails in around 11% and 8% of
cases, making them unreliable. We separately analyzed the effect of each
component in the augmented context: retrieved documents, user query, and
LLM-generated response. The two mitigation methods we tested only bring minor
improvements. These results expose a context-robustness gap in current
guardrails and motivate training and evaluation protocols that are robust to
retrieval and query composition.
[LINK]
http://arxiv.org/abs/2510.05310v1
[DATE]
2025-10-07 03:20:43+08:00
[CATEGORIES]
cs.CL
WaveSP-Net: Learnable Wavelet-Domain Sparse Prompt Tuning for Speech Deepfake Detection
[AUTHORS]
Xi Xuan, Xuechen Liu, Wenxin Zhang, Yi-Cheng Lin, Xiaojian Lin, Tomi Kinnunen
[COMMENTS]
Submitted to ICASSP 2026
[LINK]
http://arxiv.org/abs/2510.05305v1
[DATE]
2025-10-07 03:17:18+08:00
[CATEGORIES]
cs.CL
Decoding Partial Differential Equations: Cross-Modal Adaptation of Decoder-only Models to PDEs
[AUTHORS]
Paloma García-de-Herreros, Philipp Slusallek, Dietrich Klakow, Vagrant Gautam
[ABSTRACT]
Large language models have shown great success on natural language tasks in
recent years, but they have also shown great promise when adapted to new
modalities, e.g., for scientific machine learning tasks. Even though
decoder-only models are more popular within NLP and scale exceedingly well at
generating natural language, most proposed approaches for cross-modal
adaptation focus on encoder-only models, raising the question of how model
architecture affects these approaches. In this paper, we therefore perform a
series of ablation studies to answer this question, systematically comparing
encoder-only and decoder-only models on cross-modal adaptation for
time-dependent simulation tasks based on partial differential equations (PDEs).
We find that decoder-only models are far worse than encoder-only models, when
existing approaches are applied unmodified. In contrast to several other
domains, scaling decoder-only models also does not help. To harness the
potential of decoder-only models in this context, we introduce two novel
approaches, Parallel Flipping and Sequence Doubling, attempting to mimic
bidirectionality in autoregressive models. Both our methods improve overall
performance using decoder-only models for all tasks and all cross-model
adaptation methods, closing the gap to encoder-only model performance. We hope
that our findings broaden the spectrum of models used on cross-modal adaptation
tasks to further scientific ML.
[LINK]
http://arxiv.org/abs/2510.05278v1
[DATE]
2025-10-07 02:46:50+08:00
[CATEGORIES]
cs.LG
cs.CL
LaB-RAG: Label Boosted Retrieval Augmented Generation for Radiology Report Generation
[AUTHORS]
Steven Song, Anirudh Subramanyam, Irene Madejski, Robert L. Grossman
[ABSTRACT]
In the current paradigm of image captioning, deep learning models are trained
to generate text from image embeddings of latent features. We challenge the
assumption that fine-tuning of large, bespoke models is required to improve
model generation accuracy. Here we propose Label Boosted Retrieval Augmented
Generation (LaB-RAG), a small-model-based approach to image captioning that
leverages image descriptors in the form of categorical labels to boost standard
retrieval augmented generation (RAG) with pretrained large language models
(LLMs). We study our method in the context of radiology report generation (RRG)
over MIMIC-CXR and CheXpert Plus. We argue that simple classification models
combined with zero-shot embeddings can effectively transform X-rays into
text-space as radiology-specific labels. In combination with standard RAG, we
show that these derived text labels can be used with general-domain LLMs to
generate radiology reports. Without ever training our generative language model
or image embedding models specifically for the task, and without ever directly
“showing” the LLM an X-ray, we demonstrate that LaB-RAG achieves better results
across natural language and radiology language metrics compared with other
retrieval-based RRG methods, while attaining competitive results compared to
other fine-tuned vision-language RRG models. We further conduct extensive
ablation experiments to better understand the components of LaB-RAG. Our
results suggest broader compatibility and synergy with fine-tuned methods to
further enhance RRG performance.
[LINK]
http://arxiv.org/abs/2411.16523v2
[DATE]
2025-10-07 02:36:47+08:00
[CATEGORIES]
cs.CL
A Generative Approach to LLM Harmfulness Mitigation with Red Flag Tokens
[AUTHORS]
David Dobre, Mehrnaz Mofakhami, Sophie Xhonneux, Leo Schwinn, Gauthier Gidel
[ABSTRACT]
Many safety post-training methods for large language models (LLMs) are
designed to modify the model’s behaviour from producing unsafe answers to
issuing refusals. However, such distribution shifts are often brittle and
degrade performance on desirable tasks. To address these pitfalls, we propose
augmenting the model’s vocabulary with a special red flag token, and training
the model to insert this token whenever harmful content is generated or
imminent. This approach enables the model to explicitly learn the concept of
harmfulness in its representations, with minimal impact on utility due to the
marginal change in the generated distribution of natural language. Moreover,
because the token is embedded in the model’s vocabulary, we can naturally
leverage the LLMs’ generalization capabilities, such as in-context learning
(ICL) and out-of-distribution generalization to languages that are not formally
supported (e.g., Japanese for Llama3). In particular, we demonstrate that
through ICL alone, the model can learn to initiate reflective reasoning upon
generating the red flag token at inference, which steers the response away from
harmful continuations or enables self-correction when the flag is raised
falsely. This approach is orthogonal and complementary to existing safety
technique (such as safety classifiers or standard safety training) and easier
to evaluate in comparison to natural language refusals, as it does not require
a human or automated judge to assess the harmlessness of the answers.
[COMMENTS]
15 pages, 6 figures
[LINK]
http://arxiv.org/abs/2502.16366v4
[DATE]
2025-10-07 02:29:16+08:00
[CATEGORIES]
cs.CL
cs.LG
Hallucination Detox: Sensitivity Dropout (SenD) for Large Language Model Training
[AUTHORS]
Shahrad Mohammadzadeh, Juan David Guerra, Marco Bonizzato, Reihaneh Rabbany, Golnoosh Farnadi
[COMMENTS]
Accepted to ACL 2025, accepted to Safe Generative AI Workshop @
NeurIPS 2024. Camera-ready version for ACL 2025 (to appear). Submitted July
2025
[LINK]
http://arxiv.org/abs/2410.15460v5
[DATE]
2025-10-07 02:23:17+08:00
[CATEGORIES]
cs.CL
Let it Calm: Exploratory Annealed Decoding for Verifiable Reinforcement Learning
[AUTHORS]
Chenghao Yang, Lin Gui, Chenxiao Yang, Victor Veitch, Lizhu Zhang, Zhuokai Zhao
[ABSTRACT]
Reinforcement learning with verifiable rewards (RLVR) is a powerful paradigm
for enhancing the reasoning capabilities of large language models (LLMs), yet
its success hinges on effective exploration. An ideal exploration strategy must
navigate two fundamental challenges: it must preserve sample quality while also
ensuring training stability. While standard fixed-temperature sampling is
simple, it struggles to balance these competing demands, as high temperatures
degrade sample quality and low temperatures limit discovery. In this work, we
propose a simpler and more effective strategy, Exploratory Annealed Decoding
(EAD), grounded in the insight that exploration is most impactful on early
tokens which define a sequence’s semantic direction. EAD implements an
intuitive explore-at-the-beginning, exploit-at-the-end strategy by
annealing the sampling temperature from high to low during generation. This
dynamic schedule encourages meaningful, high-level diversity at the start, then
gradually lowers the temperature to preserve sample quality and keep the
sampling distribution close to the target policy, which is essential for stable
training. We demonstrate that EAD is a lightweight, plug-and-play method that
significantly improves sample efficiency, consistently outperforming
fixed-temperature sampling across various RLVR algorithms and model sizes. Our
work suggests that aligning exploration with the natural dynamics of sequential
generation offers a robust path to improving LLM reasoning.
[COMMENTS]
Codebase: https://github.com/yangalan123/EAD-RLVR
[LINK]
http://arxiv.org/abs/2510.05251v1
[DATE]
2025-10-07 02:15:43+08:00
[CATEGORIES]
cs.CL
cs.LG
A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility
[AUTHORS]
Andreas Hochlehnert, Hardik Bhatnagar, Vishaal Udandarao, Samuel Albanie, Ameya Prabhu, Matthias Bethge
[COMMENTS]
Accepted to COLM 2025
[LINK]
http://arxiv.org/abs/2504.07086v2
[DATE]
2025-10-07 02:07:34+08:00
[CATEGORIES]
cs.LG
cs.CL
Paper2Video: Automatic Video Generation from Scientific Papers
[AUTHORS]
Zeyu Zhu, Kevin Qinghong Lin, Mike Zheng Shou
[ABSTRACT]
Academic presentation videos have become an essential medium for research
communication, yet producing them remains highly labor-intensive, often
requiring hours of slide design, recording, and editing for a short 2 to 10
minutes video. Unlike natural video, presentation video generation involves
distinctive challenges: inputs from research papers, dense multi-modal
information (text, figures, tables), and the need to coordinate multiple
aligned channels such as slides, subtitles, speech, and human talker. To
address these challenges, we introduce PaperTalker, the first benchmark of 101
research papers paired with author-created presentation videos, slides, and
speaker metadata. We further design four tailored evaluation metrics–Meta
Similarity, PresentArena, PresentQuiz, and IP Memory–to measure how videos
convey the paper’s information to the audience. Building on this foundation, we
propose PaperTalker, the first multi-agent framework for academic presentation
video generation. It integrates slide generation with effective layout
refinement by a novel effective tree search visual choice, cursor grounding,
subtitling, speech synthesis, and talking-head rendering, while parallelizing
slide-wise generation for efficiency. Experiments on Paper2Video demonstrate
that the presentation videos produced by our approach are more faithful and
informative than existing baselines, establishing a practical step toward
automated and ready-to-use academic video generation. Our dataset, agent, and
code are available at https://github.com/showlab/Paper2Video.
[COMMENTS]
20 pages, 8 figures
[LINK]
http://arxiv.org/abs/2510.05096v1
[DATE]
2025-10-07 01:58:02+08:00
[CATEGORIES]
cs.CL
From Noisy Traces to Stable Gradients: Bias-Variance Optimized Preference Optimization for Aligning Large Reasoning Models
[AUTHORS]
Mingkang Zhu, Xi Chen, Bei Yu, Hengshuang Zhao, Jiaya Jia
[ABSTRACT]
Large reasoning models (LRMs) generate intermediate reasoning traces before
producing final answers, yielding strong gains on multi-step and mathematical
tasks. Yet aligning LRMs with human preferences, a crucial prerequisite for
model deployment, remains underexplored. The statistically correct objective
for preference alignment requires marginalizing over reasoning traces, but this
computation is intractable in practice. A common workaround optimizes a single
sampled trajectory, which introduces substantial gradient variance from
stochastic trace sampling. To address this challenge, we frame preference
optimization for LRMs through the lens of the bias–variance trade-off and
propose Bias–Variance Optimized Preference Optimization (BVPO), a simple,
drop-in method that mixes two gradient estimators: a high-variance trace-based
estimator and a low-variance empty-trace estimator obtained by disabling
reasoning trace generation. Our theory shows that BVPO strictly reduces
trace-induced variance for any nontrivial mixture, provides a closed-form
choice of the mixing weight that minimizes mean-squared error relative to the
true marginal gradient, and under standard smoothness and step-size conditions,
tightens classical convergence bounds for stochastic gradient descent.
Empirically, BVPO improves alignment over the best baseline by up to 7.8 points
on AlpacaEval~2 and 6.8 points on Arena-Hard. Despite being trained only on
general conversational data, BVPO also boosts reasoning performance for base
models by up to 4.0 points on the average of six math reasoning benchmarks.
These results identify variance from trace sampling as a key bottleneck and
demonstrate that directly optimizing the bias–variance trade-off yields more
stable training and stronger overall performance.
[LINK]
http://arxiv.org/abs/2510.05095v1
[DATE]
2025-10-07 01:58:01+08:00
[CATEGORIES]
cs.LG
cs.CL
Learning to Interpret Weight Differences in Language Models
[AUTHORS]
Avichal Goel, Yoon Kim, Nir Shavit, Tony T. Wang
[COMMENTS]
The weight diffs and DIT adapters trained in the paper can be found
at https://huggingface.co/diff-interpretation-tuning/loras
[LINK]
http://arxiv.org/abs/2510.05092v1
[DATE]
2025-10-07 01:57:23+08:00
[CATEGORIES]
cs.LG
cs.CL
Finish First, Perfect Later: Test-Time Token-Level Cross-Validation for Diffusion Large Language Models
[AUTHORS]
Runchu Tian, Junxia Cui, Xueqiang Xu, Feng Yao, Jingbo Shang
[ABSTRACT]
Diffusion large language models (dLLMs) have recently emerged as a promising
alternative to autoregressive (AR) models, offering advantages such as
accelerated parallel decoding and bidirectional context modeling. However, the
vanilla decoding strategy in discrete dLLMs suffers from a critical limitation:
once a token is accepted, it can no longer be revised in subsequent steps. As a
result, early mistakes persist across iterations, harming both intermediate
predictions and final output quality. To address this issue, we propose
Tolerator (Token-Level Cross-Validation Refinement), a training-free decoding
strategy that leverages cross-validation among predicted tokens. Unlike
existing methods that follow a single progressive unmasking procedure,
Tolerator introduces a two-stage process: (i) sequence fill-up and (ii)
iterative refinement by remasking and decoding a subset of tokens while
treating the remaining as context. This design enables previously accepted
tokens to be reconsidered and corrected when necessary, leading to more
reliable diffusion decoding outputs. We evaluate Tolerator on five standard
benchmarks covering language understanding, code generation, and mathematics.
Experiments show that our method achieves consistent improvements over the
baselines under the same computational budget. These findings suggest that
decoding algorithms are crucial to realizing the full potential of diffusion
large language models. Code and data are publicly available.
[COMMENTS]
17 pages, 8 figures. Work in progress
[LINK]
http://arxiv.org/abs/2510.05090v1
[DATE]
2025-10-07 01:56:46+08:00
[CATEGORIES]
cs.CL
TeachLM: Post-Training LLMs for Education Using Authentic Learning Data
[AUTHORS]
Janos Perczel, Jin Chow, Dorottya Demszky
[ABSTRACT]
The promise of generative AI to revolutionize education is constrained by the
pedagogical limits of large language models (LLMs). A major issue is the lack
of access to high-quality training data that reflect the learning of actual
students. Prompt engineering has emerged as a stopgap, but the ability of
prompts to encode complex pedagogical strategies in rule-based natural language
is inherently limited. To address this gap we introduce TeachLM - an LLM
optimized for teaching through parameter-efficient fine-tuning of
state-of-the-art models. TeachLM is trained on a dataset comprised of 100,000
hours of one-on-one, longitudinal student-tutor interactions maintained by
Polygence, which underwent a rigorous anonymization process to protect privacy.
We use parameter-efficient fine-tuning to develop an authentic student model
that enables the generation of high-fidelity synthetic student-tutor dialogues.
Building on this capability, we propose a novel multi-turn evaluation protocol
that leverages synthetic dialogue generation to provide fast, scalable, and
reproducible assessments of the dialogical capabilities of LLMs. Our
evaluations demonstrate that fine-tuning on authentic learning data
significantly improves conversational and pedagogical performance - doubling
student talk time, improving questioning style, increasing dialogue turns by
50%, and greater personalization of instruction.
[COMMENTS]
28 pages, 9 figures
[LINK]
http://arxiv.org/abs/2510.05087v1
[DATE]
2025-10-07 01:55:04+08:00
[CATEGORIES]
cs.CL
Using cognitive models to reveal value trade-offs in language models
[AUTHORS]
Sonia K. Murthy, Rosie Zhao, Jennifer Hu, Sham Kakade, Markus Wulfmeier, Peng Qian, Tomer Ullman
[ABSTRACT]
Value trade-offs are an integral part of human decision-making and language
use, however, current tools for interpreting such dynamic and multi-faceted
notions of values in LLMs are limited. In cognitive science, so-called
“cognitive models” provide formal accounts of such trade-offs in humans, by
modeling the weighting of a speaker’s competing utility functions in choosing
an action or utterance. Here we use a leading cognitive model of polite speech
to systematically evaluate value trade-offs in two encompassing model settings:
degrees of reasoning “effort” in frontier black-box models, and RL
post-training dynamics of open-source models. Our results highlight patterns of
higher informational utility than social utility in reasoning models’ default
behavior, and demonstrate that these patterns shift in predictable ways when
models are prompted to prioritize certain goals over others. Our findings from
LLMs’ training dynamics suggest large shifts in utility values early on in
training with persistent effects of the choice of base model and pretraining
data, compared to feedback dataset or alignment method. Our framework offers a
flexible tool for probing value trade-offs across diverse model types,
providing insights for generating hypotheses about other social behaviors such
as sycophancy and for shaping training regimes that better control trade-offs
between values during model development.
[COMMENTS]
10 pages, 5 figures
[LINK]
http://arxiv.org/abs/2506.20666v3
[DATE]
2025-10-07 01:52:34+08:00
[CATEGORIES]
cs.CL
Tokens, the oft-overlooked appetizer: Large language models, the distributional hypothesis, and meaning
[AUTHORS]
Julia Witte Zimmerman, Denis Hudon, Kathryn Cramer, Alejandro J. Ruiz, Calla Beauregard, Ashley Fehr, Mikaela Irene Fudolig, Bradford Demarest, Yoshi Meke Bird, Milo Z. Trujillo, Christopher M. Danforth, Peter Sheridan Dodds
[ABSTRACT]
Tokenization is a necessary component within the current architecture of many
language models, including the transformer-based large language models (LLMs)
of Generative AI, yet its impact on the model’s cognition is often overlooked.
We argue that LLMs demonstrate that the Distributional Hypothesis (DH) is
sufficient for reasonably human-like language performance, and that the
emergence of human-meaningful linguistic units among tokens and current
structural constraints motivate changes to existing, linguistically-agnostic
tokenization techniques, particularly with respect to their roles as (1)
semantic primitives and as (2) vehicles for conveying salient distributional
patterns from human language to the model. We explore tokenizations from a BPE
tokenizer; extant model vocabularies obtained from Hugging Face and tiktoken;
and the information in exemplar token vectors as they move through the layers
of a RoBERTa (large) model. Besides creating sub-optimal semantic building
blocks and obscuring the model’s access to the necessary distributional
patterns, we describe how tokens and pretraining can act as a backdoor for bias
and other unwanted content, which current alignment practices may not
remediate. Additionally, we relay evidence that the tokenization algorithm’s
objective function impacts the LLM’s cognition, despite being arguably
meaningfully insulated from the main system intelligence. [First uploaded to
arXiv in December, 2024.]
[LINK]
http://arxiv.org/abs/2412.10924v6
[DATE]
2025-10-07 01:52:33+08:00
[CATEGORIES]
cs.CL
Slm-mux: Orchestrating small language models for reasoning
[AUTHORS]
Chenyu Wang, Zishen Wan, Hao Kang, Emma Chen, Zhiqiang Xie, Tushar Krishna, Vijay Janapa Reddi, Yilun Du
[ABSTRACT]
With the rapid development of language models, the number of small language
models (SLMs) has grown significantly. Although they do not achieve
state-of-the-art accuracy, they are more efficient and often excel at specific
tasks. This raises a natural question: can multiple SLMs be orchestrated into a
system where each contributes effectively, achieving higher accuracy than any
individual model? Existing orchestration methods have primarily targeted
frontier models (e.g., GPT-4) and perform suboptimally when applied to SLMs. To
address this gap, we propose a three-stage approach for orchestrating SLMs.
First, we introduce SLM-MUX, a multi-model architecture that effectively
coordinates multiple SLMs. Building on this, we develop two optimization
strategies: (i) a model selection search that identifies the most complementary
SLMs from a given pool, and (ii) test-time scaling tailored to SLM-MUX. Our
approach delivers strong results: Compared to existing orchestration methods,
our approach achieves up to 13.4% improvement on MATH, 8.8% on GPQA, and 7.0%
on GSM8K. With just two SLMS, SLM-MUX outperforms Qwen 2.5 72B on GPQA and
GSM8K, and matches its performance on MATH. We further provide theoretical
analyses to substantiate the advantages of our method. In summary, we
demonstrate that SLMs can be effectively orchestrated into more accurate and
efficient systems through the proposed approach.
[LINK]
http://arxiv.org/abs/2510.05077v1
[DATE]
2025-10-07 01:49:58+08:00
[CATEGORIES]
cs.CL
The Telephone Game: Evaluating Semantic Drift in Unified Models
[AUTHORS]
Sabbir Mollah, Rohit Gupta, Sirnam Swetha, Qingyang Liu, Ahnaf Munir, Mubarak Shah
[ABSTRACT]
Employing a single, unified model (UM) for both visual understanding
(image-to-text: I2T) and visual generation (text-to-image: T2I) has opened a
new direction in Visual Language Model (VLM) research. While UMs can also
support broader unimodal tasks (e.g., text-to-text, image-to-image), we focus
on the core cross-modal pair T2I and I2T. Existing evaluation benchmarks
consider these capabilities in isolation: FID and GenEval for T2I, and
benchmarks such as MME, MMBench for I2T. These isolated single-pass metrics do
not reveal cross-consistency: whether a model that “understands” a concept can
also “render” it, nor whether semantic meaning is preserved when cycling
between image and text modalities. To address this, we introduce the Semantic
Drift Protocol (SDP) for Unified Models, a cyclic evaluation protocol that
alternates I2T and T2I over multiple generations to quantify semantic drift. We
propose two metrics: (i) Mean Cumulative Drift (MCD), an embedding-based
measure of overall semantic drift; and (ii) Multi-Generation GenEval (MGG), an
object-level compliance score extending GenEval. To assess generalization
beyond COCO dataset, which is widely used in training; we create a new
benchmark Nocaps+Docci400, sampled from NoCaps and DOCCI and evaluated on seven
recent models. SDP reveals substantial variation in cross-modal stability: some
models like BAGEL maintain semantic meaning over many alternations, whereas
others like VILA-U drift quickly despite strong single-pass scores. Our results
highlight SDP as a necessary complement to standard I2T and T2I evaluations.
Code is available at
https://github.com/mollahsabbir/Semantic-Drift-in-Unified-Models
[LINK]
http://arxiv.org/abs/2509.04438v2
[DATE]
2025-10-07 01:49:39+08:00
[CATEGORIES]
cs.CL
SwiReasoning: Switch-Thinking in Latent and Explicit for Pareto-Superior Reasoning LLMs
[AUTHORS]
Dachuan Shi, Abedelkadir Asi, Keying Li, Xiangchi Yuan, Leyan Pan, Wenke Lee, Wen Xiao
[ABSTRACT]
Recent work shows that, beyond discrete reasoning through explicit
chain-of-thought steps, which are limited by the boundaries of natural
languages, large language models (LLMs) can also reason continuously in latent
space, allowing richer information per step and thereby improving token
efficiency. Despite this promise, latent reasoning still faces two challenges,
especially in training-free settings: 1) purely latent reasoning broadens the
search distribution by maintaining multiple implicit paths, which diffuses
probability mass, introduces noise, and impedes convergence to a single
high-confidence solution, thereby hurting accuracy; and 2) overthinking
persists even without explicit text, wasting tokens and degrading efficiency.
To address these issues, we introduce SwiReasoning, a training-free framework
for LLM reasoning which features two key innovations: 1) SwiReasoning
dynamically switches between explicit and latent reasoning, guided by
block-wise confidence estimated from entropy trends in next-token
distributions, to balance exploration and exploitation and promote timely
convergence. 2) By limiting the maximum number of thinking-block switches,
SwiReasoning curbs overthinking and improves token efficiency across varying
problem difficulties. On widely used mathematics and STEM benchmarks,
SwiReasoning consistently improves average accuracy by 1.5%-2.8% across
reasoning LLMs of different model families and scales. Furthermore, under
constrained budgets, SwiReasoning improves average token efficiency by 56%-79%,
with larger gains as budgets tighten.
[COMMENTS]
Code: https://github.com/sdc17/SwiReasoning, Website:
https://swireasoning.github.io/
[LINK]
http://arxiv.org/abs/2510.05069v1
[DATE]
2025-10-07 01:46:34+08:00
[CATEGORIES]
cs.CL
Proactive defense against LLM Jailbreak
[AUTHORS]
Weiliang Zhao, Jinjun Peng, Daniel Ben-Levi, Zhou Yu, Junfeng Yang
[ABSTRACT]
The proliferation of powerful large language models (LLMs) has necessitated
robust safety alignment, yet these models remain vulnerable to evolving
adversarial attacks, including multi-turn jailbreaks that iteratively search
for successful queries. Current defenses, primarily reactive and static, often
fail to counter these search-based attacks. In this paper, we introduce ProAct,
a novel proactive defense framework designed to disrupt and mislead autonomous
jailbreaking processes. Our core idea is to intentionally provide adversaries
with “spurious responses” that appear to be results of successful jailbreak
attacks but contain no actual harmful content. These misleading responses
provide false signals to the attacker’s internal optimization loop, causing the
adversarial search to terminate prematurely and effectively jailbreaking the
jailbreak. By conducting extensive experiments across state-of-the-art LLMs,
jailbreaking frameworks, and safety benchmarks, our method consistently and
significantly reduces attack success rates by up to 92\%. When combined with
other defense frameworks, it further reduces the success rate of the latest
attack strategies to 0\%. ProAct represents an orthogonal defense strategy that
can serve as an additional guardrail to enhance LLM safety against the most
effective jailbreaking attacks.
[LINK]
http://arxiv.org/abs/2510.05052v1
[DATE]
2025-10-07 01:32:40+08:00
[CATEGORIES]
cs.CL
Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse Reinforcement Learning
[AUTHORS]
Jared Joselowitz, Ritam Majumdar, Arjun Jagota, Matthieu Bou, Nyal Patel, Satyapriya Krishna, Sonali Parbhoo
[COMMENTS]
Published as a conference paper at COLM 2025
[LINK]
http://arxiv.org/abs/2410.12491v3
[DATE]
2025-10-07 01:25:58+08:00
[CATEGORIES]
cs.CL
Rethinking Exact Unlearning under Exposure: Extracting Forgotten Data under Exact Unlearning in Large Language Model
[AUTHORS]
Xiaoyu Wu, Yifei Pang, Terrance Liu, Zhiwei Steven Wu
[ABSTRACT]
Large Language Models are typically trained on datasets collected from the
web, which may inadvertently contain harmful or sensitive personal information.
To address growing privacy concerns, unlearning methods have been proposed to
remove the influence of specific data from trained models. Of these, exact
unlearning – which retrains the model from scratch without the target data –
is widely regarded the gold standard for mitigating privacy risks in
deployment. In this paper, we revisit this assumption in a practical deployment
setting where both the pre- and post-unlearning logits API are exposed, such as
in open-weight scenarios. Targeting this setting, we introduce a novel data
extraction attack that leverages signals from the pre-unlearning model to guide
the post-unlearning model, uncovering patterns that reflect the removed data
distribution. Combining model guidance with a token filtering strategy, our
attack significantly improves extraction success rates – doubling performance
in some cases – across common benchmarks such as MUSE, TOFU, and WMDP.
Furthermore, we demonstrate our attack’s effectiveness on a simulated medical
diagnosis dataset to highlight real-world privacy risks associated with exact
unlearning. In light of our findings, which suggest that unlearning may, in a
contradictory way, increase the risk of privacy leakage during real-world
deployments, we advocate for evaluation of unlearning methods to consider
broader threat models that account not only for post-unlearning models but also
for adversarial access to prior checkpoints. Code is publicly available at:
https://github.com/Nicholas0228/unlearned_data_extraction_llm.
[COMMENTS]
Accepted by Neurips 2025
[LINK]
http://arxiv.org/abs/2505.24379v2
[DATE]
2025-10-07 01:21:05+08:00
[CATEGORIES]
cs.LG
cs.CL
RealKIE: Five Novel Datasets for Enterprise Key Information Extraction
[AUTHORS]
Benjamin Townsend, Madison May, Katherine Mackowiak, Christopher Wells
[ABSTRACT]
We introduce RealKIE, a benchmark of five challenging datasets aimed at
advancing key information extraction methods, with an emphasis on enterprise
applications. The datasets include a diverse range of documents including SEC
S1 Filings, US Non-disclosure Agreements, UK Charity Reports, FCC Invoices, and
Resource Contracts. Each presents unique challenges: poor text serialization,
sparse annotations in long documents, and complex tabular layouts. These
datasets provide a realistic testing ground for key information extraction
tasks like investment analysis and contract analysis. In addition to presenting
these datasets, we offer an in-depth description of the annotation process,
document processing techniques, and baseline modeling approaches. This
contribution facilitates the development of NLP models capable of handling
practical challenges and supports further research into information extraction
technologies applicable to industry-specific problems. The annotated data, OCR
outputs, and code to reproduce baselines are available to download at
https://indicodatasolutions.github.io/RealKIE/.
[LINK]
http://arxiv.org/abs/2403.20101v2
[DATE]
2025-10-07 01:14:08+08:00
[CATEGORIES]
cs.CL
cs.LG
A Set of Quebec-French Corpus of Regional Expressions and Terms
[AUTHORS]
David Beauchemin, Yan Tremblay, Mohamed Amine Youssef, Richard Khoury
[COMMENTS]
Submitted to ACL Rolling Review of October
[LINK]
http://arxiv.org/abs/2510.05026v1
[DATE]
2025-10-07 01:04:22+08:00
[CATEGORIES]
cs.CL
Imperceptible Jailbreaking against Large Language Models
[AUTHORS]
Kuofeng Gao, Yiming Li, Chao Du, Xin Wang, Xingjun Ma, Shu-Tao Xia, Tianyu Pang
[ABSTRACT]
Jailbreaking attacks on the vision modality typically rely on imperceptible
adversarial perturbations, whereas attacks on the textual modality are
generally assumed to require visible modifications (e.g., non-semantic
suffixes). In this paper, we introduce imperceptible jailbreaks that exploit a
class of Unicode characters called variation selectors. By appending invisible
variation selectors to malicious questions, the jailbreak prompts appear
visually identical to original malicious questions on screen, while their
tokenization is “secretly” altered. We propose a chain-of-search pipeline to
generate such adversarial suffixes to induce harmful responses. Our experiments
show that our imperceptible jailbreaks achieve high attack success rates
against four aligned LLMs and generalize to prompt injection attacks, all
without producing any visible modifications in the written prompt. Our code is
available at https://github.com/sail-sg/imperceptible-jailbreaks.
[LINK]
http://arxiv.org/abs/2510.05025v1
[DATE]
2025-10-07 01:03:50+08:00
[CATEGORIES]
cs.CL
Reinforce-Ada: An Adaptive Sampling Framework for Reinforce-Style LLM Training
[AUTHORS]
Wei Xiong, Chenlu Ye, Baohao Liao, Hanze Dong, Xinxing Xu, Christof Monz, Jiang Bian, Nan Jiang, Tong Zhang
[ABSTRACT]
Reinforcement learning applied to large language models (LLMs) for reasoning
tasks is often bottlenecked by unstable gradient estimates due to fixed and
uniform sampling of responses across prompts. Prior work such as GVM-RAFT
addresses this by dynamically allocating inference budget per prompt to
minimize stochastic gradient variance under a budget constraint. Inspired by
this insight, we propose Reinforce-Ada, an adaptive sampling framework for
online RL post-training of LLMs that continuously reallocates sampling effort
to the prompts with the greatest uncertainty or learning potential. Unlike
conventional two-stage allocation methods, Reinforce-Ada interleaves estimation
and sampling in an online successive elimination process, and automatically
stops sampling for a prompt once sufficient signal is collected. To stabilize
updates, we form fixed-size groups with enforced reward diversity and compute
advantage baselines using global statistics aggregated over the adaptive
sampling phase. Empirical results across multiple model architectures and
reasoning benchmarks show that Reinforce-Ada accelerates convergence and
improves final performance compared to GRPO, especially when using the balanced
sampling variant. Our work highlights the central role of variance-aware,
adaptive data curation in enabling efficient and reliable reinforcement
learning for reasoning-capable LLMs. Code is available at
https://github.com/RLHFlow/Reinforce-Ada.
[COMMENTS]
16 pages, 6 figures
[LINK]
http://arxiv.org/abs/2510.04996v1
[DATE]
2025-10-07 00:34:09+08:00
[CATEGORIES]
cs.LG
cs.CL
Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity
[AUTHORS]
Jiayi Zhang, Simon Yu, Derek Chong, Anthony Sicilia, Michael R. Tomz, Christopher D. Manning, Weiyan Shi
[ABSTRACT]
Post-training alignment often reduces LLM diversity, leading to a phenomenon
known as mode collapse. Unlike prior work that attributes this effect to
algorithmic limitations, we identify a fundamental, pervasive data-level
driver: typicality bias in preference data, whereby annotators systematically
favor familiar text as a result of well-established findings in cognitive
psychology. We formalize this bias theoretically, verify it on preference
datasets empirically, and show that it plays a central role in mode collapse.
Motivated by this analysis, we introduce Verbalized Sampling, a simple,
training-free prompting strategy to circumvent mode collapse. VS prompts the
model to verbalize a probability distribution over a set of responses (e.g.,
“Generate 5 jokes about coffee and their corresponding probabilities”).
Comprehensive experiments show that VS significantly improves performance
across creative writing (poems, stories, jokes), dialogue simulation,
open-ended QA, and synthetic data generation, without sacrificing factual
accuracy and safety. For instance, in creative writing, VS increases diversity
by 1.6-2.1x over direct prompting. We further observe an emergent trend that
more capable models benefit more from VS. In sum, our work provides a new
data-centric perspective on mode collapse and a practical inference-time remedy
that helps unlock pre-trained generative diversity.
[COMMENTS]
79 pages, 27 figures, 31 tables. Code is available at
https://github.com/CHATS-lab/verbalize-sampling
[LINK]
http://arxiv.org/abs/2510.01171v2
[DATE]
2025-10-07 00:29:44+08:00
[CATEGORIES]
cs.CL
LLM-Hanabi: Evaluating Multi-Agent Gameplays with Theory-of-Mind and Rationale Inference in Imperfect Information Collaboration Game
[AUTHORS]
Fangzhou Liang, Tianshi Zheng, Chunkit Chan, Yauwai Yim, Yangqiu Song
[ABSTRACT]
Effective multi-agent collaboration requires agents to infer the rationale
behind others’ actions, a capability rooted in Theory-of-Mind (ToM). While
recent Large Language Models (LLMs) excel at logical inference, their ability
to infer rationale in dynamic, collaborative settings remains under-explored.
This study introduces LLM-Hanabi, a novel benchmark that uses the cooperative
game Hanabi to evaluate the rationale inference and ToM of LLMs. Our framework
features an automated evaluation system that measures both game performance and
ToM proficiency. Across a range of models, we find a significant positive
correlation between ToM and in-game success. Notably, first-order ToM
(interpreting others’ intent) correlates more strongly with performance than
second-order ToM (predicting others’ interpretations). These findings highlight
that for effective AI collaboration, the ability to accurately interpret a
partner’s rationale is more critical than higher-order reasoning. We conclude
that prioritizing first-order ToM is a promising direction for enhancing the
collaborative capabilities of future models.
[COMMENTS]
EMNLP 2025 Wordplay
[LINK]
http://arxiv.org/abs/2510.04980v1
[DATE]
2025-10-07 00:17:24+08:00
[CATEGORIES]
cs.CL
EmoHRNet: High-Resolution Neural Network Based Speech Emotion Recognition
[AUTHORS]
Akshay Muppidi, Martin Radfar
[ABSTRACT]
Speech emotion recognition (SER) is pivotal for enhancing human-machine
interactions. This paper introduces “EmoHRNet”, a novel adaptation of
High-Resolution Networks (HRNet) tailored for SER. The HRNet structure is
designed to maintain high-resolution representations from the initial to the
final layers. By transforming audio samples into spectrograms, EmoHRNet
leverages the HRNet architecture to extract high-level features. EmoHRNet’s
unique architecture maintains high-resolution representations throughout,
capturing both granular and overarching emotional cues from speech signals. The
model outperforms leading models, achieving accuracies of 92.45% on RAVDESS,
80.06% on IEMOCAP, and 92.77% on EMOVO. Thus, we show that EmoHRNet sets a new
benchmark in the SER domain.
[LINK]
http://arxiv.org/abs/2510.06072v1
[DATE]
2025-10-07 23:59:40+08:00
[CATEGORIES]
cs.LG
Analyzing the Effect of Embedding Norms and Singular Values to Oversmoothing in Graph Neural Networks
[AUTHORS]
Dimitrios Kelesis, Dimitris Fotakis, Georgios Paliouras
[ABSTRACT]
In this paper, we study the factors that contribute to the effect of
oversmoothing in deep Graph Neural Networks (GNNs). Specifically, our analysis
is based on a new metric (Mean Average Squared Distance - $MASED$) to quantify
the extent of oversmoothing. We derive layer-wise bounds on $MASED$, which
aggregate to yield global upper and lower distance bounds. Based on this
quantification of oversmoothing, we further analyze the importance of two
different properties of the model; namely the norms of the generated node
embeddings, along with the largest and smallest singular values of the weight
matrices. Building on the insights drawn from the theoretical analysis, we show
that oversmoothing increases as the number of trainable weight matrices and the
number of adjacency matrices increases. We also use the derived layer-wise
bounds on $MASED$ to form a proposal for decoupling the number of hops (i.e.,
adjacency depth) from the number of weight matrices. In particular, we
introduce G-Reg, a regularization scheme that increases the bounds, and
demonstrate through extensive experiments that by doing so node classification
accuracy increases, achieving robustness at large depths. We further show that
by reducing oversmoothing in deep networks, we can achieve better results in
some tasks than using shallow ones. Specifically, we experiment with a ``cold
start” scenario, i.e., when there is no feature information for the unlabeled
nodes. Finally, we show empirically the trade-off between receptive field size
(i.e., number of weight matrices) and performance, using the $MASED$ bounds.
This is achieved by distributing adjacency hops across a small number of
trainable layers, avoiding the extremes of under- or over-parameterization of
the GNN.
[LINK]
http://arxiv.org/abs/2510.06066v1
[DATE]
2025-10-07 23:55:28+08:00
[CATEGORIES]
cs.LG
TelecomTS: A Multi-Modal Observability Dataset for Time Series and Language Analysis
[AUTHORS]
Austin Feng, Andreas Varvarigos, Ioannis Panitsas, Daniela Fernandez, Jinbiao Wei, Yuwei Guo, Jialin Chen, Ali Maatouk, Leandros Tassiulas, Rex Ying
[ABSTRACT]
Modern enterprises generate vast streams of time series metrics when
monitoring complex systems, known as observability data. Unlike conventional
time series from domains such as weather, observability data are zero-inflated,
highly stochastic, and exhibit minimal temporal structure. Despite their
importance, observability datasets are underrepresented in public benchmarks
due to proprietary restrictions. Existing datasets are often anonymized and
normalized, removing scale information and limiting their use for tasks beyond
forecasting, such as anomaly detection, root-cause analysis, and multi-modal
reasoning. To address this gap, we introduce TelecomTS, a large-scale
observability dataset derived from a 5G telecommunications network. TelecomTS
features heterogeneous, de-anonymized covariates with explicit scale
information and supports a suite of downstream tasks, including anomaly
detection, root-cause analysis, and a question-answering benchmark requiring
multi-modal reasoning. Benchmarking state-of-the-art time series, language, and
reasoning models reveals that existing approaches struggle with the abrupt,
noisy, and high-variance dynamics of observability data. Our experiments also
underscore the importance of preserving covariates’ absolute scale, emphasizing
the need for foundation time series models that natively leverage scale
information for practical observability applications.
[LINK]
http://arxiv.org/abs/2510.06063v1
[DATE]
2025-10-07 23:54:34+08:00
[CATEGORIES]
cs.LG
Medical Vision Language Models as Policies for Robotic Surgery
[AUTHORS]
Akshay Muppidi, Martin Radfar
[ABSTRACT]
Vision-based Proximal Policy Optimization (PPO) struggles with visual
observation-based robotic laparoscopic surgical tasks due to the
high-dimensional nature of visual input, the sparsity of rewards in surgical
environments, and the difficulty of extracting task-relevant features from raw
visual data. We introduce a simple approach integrating MedFlamingo, a medical
domain-specific Vision-Language Model, with PPO. Our method is evaluated on
five diverse laparoscopic surgery task environments in LapGym, using only
endoscopic visual observations. MedFlamingo PPO outperforms and converges
faster compared to both standard vision-based PPO and OpenFlamingo PPO
baselines, achieving task success rates exceeding 70% across all environments,
with improvements ranging from 66.67% to 1114.29% compared to baseline. By
processing task observations and instructions once per episode to generate
high-level planning tokens, our method efficiently combines medical expertise
with real-time visual feedback. Our results highlight the value of specialized
medical knowledge in robotic surgical planning and decision-making.
[COMMENTS]
IEEE CAI 2025
[LINK]
http://arxiv.org/abs/2510.06064v1
[DATE]
2025-10-07 23:54:34+08:00
[CATEGORIES]
cs.LG
Unified Cross-Modal Medical Image Synthesis with Hierarchical Mixture of Product-of-Experts
[AUTHORS]
Reuben Dorent, Nazim Haouchine, Alexandra Golby, Sarah Frisken, Tina Kapur, William Wells
[ABSTRACT]
We propose a deep mixture of multimodal hierarchical variational
auto-encoders called MMHVAE that synthesizes missing images from observed
images in different modalities. MMHVAE’s design focuses on tackling four
challenges: (i) creating a complex latent representation of multimodal data to
generate high-resolution images; (ii) encouraging the variational distributions
to estimate the missing information needed for cross-modal image synthesis;
(iii) learning to fuse multimodal information in the context of missing data;
(iv) leveraging dataset-level information to handle incomplete data sets at
training time. Extensive experiments are performed on the challenging problem
of pre-operative brain multi-parametric magnetic resonance and intra-operative
ultrasound imaging.
[COMMENTS]
Accepted in IEEE Transactions on Pattern Analysis and Machine
Intelligence (TPAMI)
[LINK]
http://arxiv.org/abs/2410.19378v3
[DATE]
2025-10-07 23:45:34+08:00
[CATEGORIES]
cs.LG
Edit-Based Flow Matching for Temporal Point Processes
[AUTHORS]
David Lüdke, Marten Lienen, Marcel Kollovieh, Stephan Günnemann
[ABSTRACT]
Temporal point processes (TPPs) are a fundamental tool for modeling event
sequences in continuous time, but most existing approaches rely on
autoregressive parameterizations that are limited by their sequential sampling.
Recent non-autoregressive, diffusion-style models mitigate these issues by
jointly interpolating between noise and data through event insertions and
deletions in a discrete Markov chain. In this work, we generalize this
perspective and introduce an Edit Flow process for TPPs that transports noise
to data via insert, delete, and substitute edit operations. By learning the
instantaneous edit rates within a continuous-time Markov chain framework, we
attain a flexible and efficient model that effectively reduces the total number
of necessary edit operations during generation. Empirical results demonstrate
the generative flexibility of our unconditionally trained model in a wide range
of unconditional and conditional generation tasks on benchmark TPPs.
[LINK]
http://arxiv.org/abs/2510.06050v1
[DATE]
2025-10-07 23:44:12+08:00
[CATEGORIES]
cs.LG
From Learning to Mastery: Achieving Safe and Efficient Real-World Autonomous Driving with Human-In-The-Loop Reinforcement Learning
[AUTHORS]
Li Zeqiao, Wang Yijing, Wang Haoyu, Li Zheng, Li Peng, Liu Wenfei, Zuo Zhiqiang
[ABSTRACT]
Autonomous driving with reinforcement learning (RL) has significant
potential. However, applying RL in real-world settings remains challenging due
to the need for safe, efficient, and robust learning. Incorporating human
expertise into the learning process can help overcome these challenges by
reducing risky exploration and improving sample efficiency. In this work, we
propose a reward-free, active human-in-the-loop learning method called
Human-Guided Distributional Soft Actor-Critic (H-DSAC). Our method combines
Proxy Value Propagation (PVP) and Distributional Soft Actor-Critic (DSAC) to
enable efficient and safe training in real-world environments. The key
innovation is the construction of a distributed proxy value function within the
DSAC framework. This function encodes human intent by assigning higher expected
returns to expert demonstrations and penalizing actions that require human
intervention. By extrapolating these labels to unlabeled states, the policy is
effectively guided toward expert-like behavior. With a well-designed state
space, our method achieves real-world driving policy learning within practical
training times. Results from both simulation and real-world experiments
demonstrate that our framework enables safe, robust, and sample-efficient
learning for autonomous driving.
[LINK]
http://arxiv.org/abs/2510.06038v1
[DATE]
2025-10-07 23:33:29+08:00
[CATEGORIES]
cs.LG
Uni-Instruct: One-step Diffusion Model through Unified Diffusion Divergence Instruction
[AUTHORS]
Yifei Wang, Weimin Bai, Colin Zhang, Debing Zhang, Weijian Luo, He Sun
[ABSTRACT]
In this paper, we unify more than 10 existing one-step diffusion distillation
approaches, such as Diff-Instruct, DMD, SIM, SiD, $f$-distill, etc, inside a
theory-driven framework which we name the \textbf{\emph{Uni-Instruct}}.
Uni-Instruct is motivated by our proposed diffusion expansion theory of the
$f$-divergence family. Then we introduce key theories that overcome the
intractability issue of the original expanded $f$-divergence, resulting in an
equivalent yet tractable loss that effectively trains one-step diffusion models
by minimizing the expanded $f$-divergence family. The novel unification
introduced by Uni-Instruct not only offers new theoretical contributions that
help understand existing approaches from a high-level perspective but also
leads to state-of-the-art one-step diffusion generation performances. On the
CIFAR10 generation benchmark, Uni-Instruct achieves record-breaking Frechet
Inception Distance (FID) values of \textbf{\emph{1.46}} for unconditional
generation and \textbf{\emph{1.38}} for conditional generation. On the
ImageNet-$64\times 64$ generation benchmark, Uni-Instruct achieves a new SoTA
one-step generation FID of \textbf{\emph{1.02}}, which outperforms its 79-step
teacher diffusion with a significant improvement margin of 1.33 (1.02 vs 2.35).
We also apply Uni-Instruct on broader tasks like text-to-3D generation. For
text-to-3D generation, Uni-Instruct gives decent results, which slightly
outperforms previous methods, such as SDS and VSD, in terms of both generation
quality and diversity. Both the solid theoretical and empirical contributions
of Uni-Instruct will potentially help future studies on one-step diffusion
distillation and knowledge transferring of diffusion models.
[LINK]
http://arxiv.org/abs/2505.20755v3
[DATE]
2025-10-07 23:30:03+08:00
[CATEGORIES]
cs.LG
Fundamental Limits of Membership Inference Attacks on Machine Learning Models
[AUTHORS]
Eric Aubinais, Elisabeth Gassiat, Pablo Piantanida
[COMMENTS]
Accepted for publication in JMLR
[LINK]
http://arxiv.org/abs/2310.13786v6
[DATE]
2025-10-07 23:29:38+08:00
[CATEGORIES]
cs.LG
Adaptive Pruning for Increased Robustness and Reduced Computational Overhead in Gaussian Process Accelerated Saddle Point Searches
[AUTHORS]
Rohit Goswami, Hannes Jónsson
[ABSTRACT]
Gaussian process (GP) regression provides a strategy for accelerating saddle
point searches on high-dimensional energy surfaces by reducing the number of
times the energy and its derivatives with respect to atomic coordinates need to
be evaluated. The computational overhead in the hyperparameter optimization
can, however, be large and make the approach inefficient. Failures can also
occur if the search ventures too far into regions that are not represented well
enough by the GP model. Here, these challenges are resolved by using
geometry-aware optimal transport measures and an active pruning strategy using
a summation over Wasserstein-1 distances for each atom-type in farthest-point
sampling, selecting a fixed-size subset of geometrically diverse configurations
to avoid rapidly increasing cost of GP updates as more observations are made.
Stability is enhanced by permutation-invariant metric that provides a reliable
trust radius for early-stopping and a logarithmic barrier penalty for the
growth of the signal variance. These physically motivated algorithmic changes
prove their efficacy by reducing to less than a half the mean computational
time on a set of 238 challenging configurations from a previously published
data set of chemical reactions. With these improvements, the GP approach is
established as, a robust and scalable algorithm for accelerating saddle point
searches when the evaluation of the energy and atomic forces requires
significant computational effort.
[COMMENTS]
Invited article for the ChemPhysChem special issue dedicated to the
60th birthday of Prof. Debabrata Goswami. A preliminary version of this work
was presented at the UNOOS 2025 conference
[LINK]
http://arxiv.org/abs/2510.06030v1
[DATE]
2025-10-07 23:27:39+08:00
[CATEGORIES]
cs.LG
Fast Leave-One-Out Approximation from Fragment-Target Prevalence Vectors (molFTP) : From Dummy Masking to Key-LOO for Leakage-Free Feature Construction
[AUTHORS]
Guillaume Godin
[ABSTRACT]
We introduce molFTP (molecular fragment-target prevalence), a compact
representation that delivers strong predictive performance. To prevent feature
leakage across cross-validation folds, we implement a dummy-masking procedure
that removes information about fragments present in the held-out molecules. We
further show that key leave-one-out (key-loo) closely approximates true
molecule-level leave-one-out (LOO), with deviation below 8% on our datasets.
This enables near full data training while preserving unbiased cross-validation
estimates of model performance. Overall, molFTP provides a fast,
leakage-resistant fragment-target prevalence vectorization with practical
safeguards (dummy masking or key-LOO) that approximate LOO at a fraction of its
cost.
[COMMENTS]
28 pages, 21 figures, 3 tables
[LINK]
http://arxiv.org/abs/2510.06029v1
[DATE]
2025-10-07 23:27:16+08:00
[CATEGORIES]
cs.LG
Generalization of Gibbs and Langevin Monte Carlo Algorithms in the Interpolation Regime
[AUTHORS]
Andreas Maurer, Erfan Mirzaei, Massimiliano Pontil
[ABSTRACT]
The paper provides data-dependent bounds on the test error of the Gibbs
algorithm in the overparameterized interpolation regime, where low training
errors are also obtained for impossible data, such as random labels in
classification. The bounds are stable under approximation with Langevin Monte
Carlo algorithms. Experiments on the MNIST and CIFAR-10 datasets verify that
the bounds yield nontrivial predictions on true labeled data and correctly
upper bound the test error for random labels. Our method indicates that
generalization in the low-temperature, interpolation regime is already signaled
by small training errors in the more classical high temperature regime.
[LINK]
http://arxiv.org/abs/2510.06028v1
[DATE]
2025-10-07 23:25:56+08:00
[CATEGORIES]
cs.LG
Out-of-Distribution Detection from Small Training Sets using Bayesian Neural Network Classifiers
[AUTHORS]
Kevin Raina, Tanya Schmah
[ABSTRACT]
Out-of-Distribution (OOD) detection is critical to AI reliability and safety,
yet in many practical settings, only a limited amount of training data is
available. Bayesian Neural Networks (BNNs) are a promising class of model on
which to base OOD detection, because they explicitly represent epistemic (i.e.
model) uncertainty. In the small training data regime, BNNs are especially
valuable because they can incorporate prior model information. We introduce a
new family of Bayesian posthoc OOD scores based on expected logit vectors, and
compare 5 Bayesian and 4 deterministic posthoc OOD scores. Experiments on MNIST
and CIFAR-10 In-Distributions, with 5000 training samples or less, show that
the Bayesian methods outperform corresponding deterministic methods.
[COMMENTS]
British Machine Vision Conference (BMVC) 2025; 18 pages, 6 figures, 3
tables
[LINK]
http://arxiv.org/abs/2510.06025v1
[DATE]
2025-10-07 23:23:05+08:00
[CATEGORIES]
cs.LG
RamPINN: Recovering Raman Spectra From Coherent Anti-Stokes Spectra Using Embedded Physics
[AUTHORS]
Sai Karthikeya Vemuri, Adithya Ashok Chalain Valapil, Tim Büchner, Joachim Denzler
[ABSTRACT]
Transferring the recent advancements in deep learning into scientific
disciplines is hindered by the lack of the required large-scale datasets for
training. We argue that in these knowledge-rich domains, the established body
of scientific theory provides reliable inductive biases in the form of
governing physical laws. We address the ill-posed inverse problem of recovering
Raman spectra from noisy Coherent Anti-Stokes Raman Scattering (CARS)
measurements, as the true Raman signal here is suppressed by a dominating
non-resonant background. We propose RamPINN, a model that learns to recover
Raman spectra from given CARS spectra. Our core methodological contribution is
a physics-informed neural network that utilizes a dual-decoder architecture to
disentangle resonant and non-resonant signals. This is done by enforcing the
Kramers-Kronig causality relations via a differentiable Hilbert transform loss
on the resonant and a smoothness prior on the non-resonant part of the signal.
Trained entirely on synthetic data, RamPINN demonstrates strong zero-shot
generalization to real-world experimental data, explicitly closing this gap and
significantly outperforming existing baselines. Furthermore, we show that
training with these physics-based losses alone, without access to any
ground-truth Raman spectra, still yields competitive results. This work
highlights a broader concept: formal scientific rules can act as a potent
inductive bias, enabling robust, self-supervised learning in data-limited
scientific domains.
[LINK]
http://arxiv.org/abs/2510.06020v1
[DATE]
2025-10-07 23:18:44+08:00
[CATEGORIES]
cs.LG
Teaching Metric Distance to Discrete Autoregressive Language Models
[AUTHORS]
Jiwan Chung, Saejin Kim, Yongrae Jo, Jaewoo Park, Dongjun Min, Youngjae Yu
[ABSTRACT]
As large language models expand beyond natural language to domains such as
mathematics, multimodal understanding, and embodied agents, tokens increasingly
reflect metric relationships rather than purely linguistic meaning. We
introduce DIST2Loss, a distance-aware framework designed to train
autoregressive discrete models by leveraging predefined distance relationships
among output tokens. At its core, DIST2Loss transforms continuous exponential
family distributions derived from inherent distance metrics into discrete,
categorical optimization targets compatible with the models’ architectures.
This approach enables the models to learn and preserve meaningful distance
relationships during token generation while maintaining compatibility with
existing architectures. Empirical evaluations show consistent performance gains
in diverse multimodal applications, including visual grounding, robotic
manipulation, generative reward modeling, and image generation using
vector-quantized features. These improvements are most notable in low-data
regimes, demonstrating DIST2Loss’s strength under resource constraints.
[LINK]
http://arxiv.org/abs/2503.02379v4
[DATE]
2025-10-07 23:18:30+08:00
[CATEGORIES]
cs.LG
Hybrid Quantum-Classical Policy Gradient for Adaptive Control of Cyber-Physical Systems: A Comparative Study of VQC vs. MLP
[AUTHORS]
Aueaphum Aueawatthanaphisut, Nyi Wunna Tun
[ABSTRACT]
The comparative evaluation between classical and quantum reinforcement
learning (QRL) paradigms was conducted to investigate their convergence
behavior, robustness under observational noise, and computational efficiency in
a benchmark control environment. The study employed a multilayer perceptron
(MLP) agent as a classical baseline and a parameterized variational quantum
circuit (VQC) as a quantum counterpart, both trained on the CartPole-v1
environment over 500 episodes. Empirical results demonstrated that the
classical MLP achieved near-optimal policy convergence with a mean return of
498.7 +/- 3.2, maintaining stable equilibrium throughout training. In contrast,
the VQC exhibited limited learning capability, with an average return of 14.6
+/- 4.8, primarily constrained by circuit depth and qubit connectivity. Noise
robustness analysis further revealed that the MLP policy deteriorated
gracefully under Gaussian perturbations, while the VQC displayed higher
sensitivity at equivalent noise levels. Despite the lower asymptotic
performance, the VQC exhibited significantly lower parameter count and
marginally increased training time, highlighting its potential scalability for
low-resource quantum processors. The results suggest that while classical
neural policies remain dominant in current control benchmarks, quantum-enhanced
architectures could offer promising efficiency advantages once hardware noise
and expressivity limitations are mitigated.
[COMMENTS]
6 pages, 5 figures, 2 tables, 17 equations, 1 algorithm
[LINK]
http://arxiv.org/abs/2510.06010v1
[DATE]
2025-10-07 23:09:29+08:00
[CATEGORIES]
cs.LG
Diffusion Models for Low-Light Image Enhancement: A Multi-Perspective Taxonomy and Performance Analysis
[AUTHORS]
Eashan Adhikarla, Yixin Liu, Brian D. Davison
[ABSTRACT]
Low-light image enhancement (LLIE) is vital for safety-critical applications
such as surveillance, autonomous navigation, and medical imaging, where
visibility degradation can impair downstream task performance. Recently,
diffusion models have emerged as a promising generative paradigm for LLIE due
to their capacity to model complex image distributions via iterative denoising.
This survey provides an up-to-date critical analysis of diffusion models for
LLIE, distinctively featuring an in-depth comparative performance evaluation
against Generative Adversarial Network and Transformer-based state-of-the-art
methods, a thorough examination of practical deployment challenges, and a
forward-looking perspective on the role of emerging paradigms like foundation
models. We propose a multi-perspective taxonomy encompassing six categories:
Intrinsic Decomposition, Spectral & Latent, Accelerated, Guided, Multimodal,
and Autonomous; that map enhancement methods across physical priors,
conditioning schemes, and computational efficiency. Our taxonomy is grounded in
a hybrid view of both the model mechanism and the conditioning signals. We
evaluate qualitative failure modes, benchmark inconsistencies, and trade-offs
between interpretability, generalization, and inference efficiency. We also
discuss real-world deployment constraints (e.g., memory, energy use) and
ethical considerations. This survey aims to guide the next generation of
diffusion-based LLIE research by highlighting trends and surfacing open
research questions, including novel conditioning, real-time adaptation, and the
potential of foundation models.
[LINK]
http://arxiv.org/abs/2510.05976v1
[DATE]
2025-10-07 22:30:36+08:00
[CATEGORIES]
cs.LG
ImageNet-trained CNNs are not biased towards texture: Revisiting feature reliance through controlled suppression
[AUTHORS]
Tom Burgert, Oliver Stoll, Paolo Rota, Begüm Demir
[COMMENTS]
Accepted at NeurIPS 2025 (oral)
[LINK]
http://arxiv.org/abs/2509.20234v2
[DATE]
2025-10-07 22:27:46+08:00
[CATEGORIES]
cs.LG
Structured Sparse Transition Matrices to Enable State Tracking in State-Space Models
[AUTHORS]
Aleksandar Terzić, Nicolas Menet, Michael Hersche, Thomas Hofmann, Abbas Rahimi
[ABSTRACT]
Modern state-space models (SSMs) often utilize transition matrices which
enable efficient computation but pose restrictions on the model’s expressivity,
as measured in terms of the ability to emulate finite-state automata (FSA).
While unstructured transition matrices are optimal in terms of expressivity,
they come at a prohibitively high compute and memory cost even for moderate
state sizes. We propose a structured sparse parametrization of transition
matrices in SSMs that enables FSA state tracking with optimal state size and
depth, while keeping the computational cost of the recurrence comparable to
that of diagonal SSMs. Our method, PD-SSM, parametrizes the transition matrix
as the product of a column one-hot matrix ($P$) and a complex-valued diagonal
matrix ($D$). Consequently, the computational cost of parallel scans scales
linearly with the state size. Theoretically, the model is BIBO-stable and can
emulate any $N$-state FSA with one layer of dimension $N$ and a linear readout
of size $N \times N$, significantly improving on all current structured SSM
guarantees. Experimentally, the model significantly outperforms a wide
collection of modern SSM variants on various FSA state tracking tasks. On
multiclass time-series classification, the performance is comparable to that of
neural controlled differential equations, a paradigm explicitly built for
time-series analysis. Finally, we integrate PD-SSM into a hybrid
Transformer-SSM architecture and demonstrate that the model can effectively
track the states of a complex FSA in which transitions are encoded as a set of
variable-length English sentences. The code is available at
https://github.com/IBM/expressive-sparse-state-space-model
[COMMENTS]
10 pages, NeurIPS 2025 Spotlight
[LINK]
http://arxiv.org/abs/2509.22284v2
[DATE]
2025-10-07 22:10:40+08:00
[CATEGORIES]
cs.LG
Gaussian Embeddings: How JEPAs Secretly Learn Your Data Density
[AUTHORS]
Randall Balestriero, Nicolas Ballas, Mike Rabbat, Yann LeCun
[ABSTRACT]
Joint Embedding Predictive Architectures (JEPAs) learn representations able
to solve numerous downstream tasks out-of-the-box. JEPAs combine two
objectives: (i) a latent-space prediction term, i.e., the representation of a
slightly perturbed sample must be predictable from the original sample’s
representation, and (ii) an anti-collapse term, i.e., not all samples should
have the same representation. While (ii) is often considered as an obvious
remedy to representation collapse, we uncover that JEPAs’ anti-collapse term
does much more–it provably estimates the data density. In short, any
successfully trained JEPA can be used to get sample probabilities, e.g., for
data curation, outlier detection, or simply for density estimation. Our
theoretical finding is agnostic of the dataset and architecture used–in any
case one can compute the learned probabilities of sample $x$ efficiently and in
closed-form using the model’s Jacobian matrix at $x$. Our findings are
empirically validated across datasets (synthetic, controlled, and Imagenet) and
across different Self Supervised Learning methods falling under the JEPA family
(I-JEPA and DINOv2) and on multimodal models, such as MetaCLIP. We denote the
method extracting the JEPA learned density as {\bf JEPA-SCORE}.
[LINK]
http://arxiv.org/abs/2510.05949v1
[DATE]
2025-10-07 22:06:30+08:00
[CATEGORIES]
cs.LG
N-Parties Private Structure and Parameter Learning for Sum-Product Networks
[AUTHORS]
Xenia Heilmann, Ernst Althaus, Mattia Cerrato, Nick Johannes Peter Rassau, Mohammad Sadeq Dousti, Stefan Kramer
[ABSTRACT]
A sum-product network (SPN) is a graphical model that allows several types of
probabilistic inference to be performed efficiently. In this paper, we propose
a privacy-preserving protocol which tackles structure generation and parameter
learning of SPNs. Additionally, we provide a protocol for private inference on
SPNs, subsequent to training. To preserve the privacy of the participants, we
derive our protocol based on secret sharing, which guarantees privacy in the
honest-but-curious setting even when at most half of the parties cooperate to
disclose the data. The protocol makes use of a forest of randomly generated
SPNs, which is trained and weighted privately and can then be used for private
inference on data points. Our experiments indicate that preserving the privacy
of all participants does not decrease log-likelihood performance on both
homogeneously and heterogeneously partitioned data. We furthermore show that
our protocol’s performance is comparable to current state-of-the-art SPN
learners in homogeneously partitioned data settings. In terms of runtime and
memory usage, we demonstrate that our implementation scales well when
increasing the number of parties, comparing favorably to protocols for neural
networks, when they are trained to reproduce the input-output behavior of SPNs.
[LINK]
http://arxiv.org/abs/2510.05946v1
[DATE]
2025-10-07 21:55:06+08:00
[CATEGORIES]
cs.LG
EARL: Efficient Agentic Reinforcement Learning Systems for Large Language Models
[AUTHORS]
Zheyue Tan, Mustapha Abdullahi, Tuo Shi, Huining Yuan, Zelai Xu, Chao Yu, Boxun Li, Bo Zhao
[ABSTRACT]
Reinforcement learning (RL) has become a pivotal component of large language
model (LLM) post-training, and agentic RL extends this paradigm to operate as
agents through multi-turn interaction and tool use. Scaling such systems
exposes two practical bottlenecks: (1) context length grows rapidly during
training, inflating memory usage and latency, and triggering out-of-memory
(OOM) failures; and (2) intermediate tensors accumulate with context length,
making cross-device data movement a major system bottleneck.
We present EARL, a scalable system for efficient agentic RL. EARL designs a
parallelism selector that dynamically adapts model and training parallelism
across RL stages based on sequence length and system load, and a data
dispatcher that performs layout-aware, decentralized exchange of intermediate
data batches. Together, these components increase throughput, reduce
long-context failures, and enable stable large-scale training of agentic LLMs
without relying on hard limits or penalties of context length.
[LINK]
http://arxiv.org/abs/2510.05943v1
[DATE]
2025-10-07 21:52:51+08:00
[CATEGORIES]
cs.LG
LLM-FS-Agent: A Deliberative Role-based Large Language Model Architecture for Transparent Feature Selection
[AUTHORS]
Mohamed Bal-Ghaoui, Fayssal Sabri
[ABSTRACT]
High-dimensional data remains a pervasive challenge in machine learning,
often undermining model interpretability and computational efficiency. While
Large Language Models (LLMs) have shown promise for dimensionality reduction
through feature selection, existing LLM-based approaches frequently lack
structured reasoning and transparent justification for their decisions. This
paper introduces LLM-FS-Agent, a novel multi-agent architecture designed for
interpretable and robust feature selection. The system orchestrates a
deliberative “debate” among multiple LLM agents, each assigned a specific role,
enabling collective evaluation of feature relevance and generation of detailed
justifications. We evaluate LLM-FS-Agent in the cybersecurity domain using the
CIC-DIAD 2024 IoT intrusion detection dataset and compare its performance
against strong baselines, including LLM-Select and traditional methods such as
PCA. Experimental results demonstrate that LLM-FS-Agent consistently achieves
superior or comparable classification performance while reducing downstream
training time by an average of 46% (statistically significant improvement, p =
0.028 for XGBoost). These findings highlight that the proposed deliberative
architecture enhances both decision transparency and computational efficiency,
establishing LLM-FS-Agent as a practical and reliable solution for real-world
applications.
[LINK]
http://arxiv.org/abs/2510.05935v1
[DATE]
2025-10-07 21:46:06+08:00
[CATEGORIES]
cs.LG
Randomly Removing 50% of Dimensions in Text Embeddings has Minimal Impact on Retrieval and Classification Tasks
[AUTHORS]
Sotaro Takeshita, Yurina Takeshita, Daniel Ruffinelli, Simone Paolo Ponzetto
[ABSTRACT]
In this paper, we study the surprising impact that truncating text embeddings
has on downstream performance. We consistently observe across 6
state-of-the-art text encoders and 26 downstream tasks, that randomly removing
up to 50% of embedding dimensions results in only a minor drop in performance,
less than 10%, in retrieval and classification tasks. Given the benefits of
using smaller-sized embeddings, as well as the potential insights about text
encoding, we study this phenomenon and find that, contrary to what is suggested
in prior work, this is not the result of an ineffective use of representation
space. Instead, we find that a large number of uniformly distributed dimensions
actually cause an increase in performance when removed. This would explain why,
on average, removing a large number of embedding dimensions results in a
marginal drop in performance. We make similar observations when truncating the
embeddings used by large language models to make next-token predictions on
generative tasks, suggesting that this phenomenon is not isolated to
classification or retrieval tasks.
[COMMENTS]
Accepted to EMNLP 2025 Main Conference (Oral), camera-ready version
[LINK]
http://arxiv.org/abs/2508.17744v2
[DATE]
2025-10-07 21:43:18+08:00
[CATEGORIES]
cs.LG
Carré du champ flow matching: better quality-generalisation tradeoff in generative models
[AUTHORS]
Jacob Bamberger, Iolo Jones, Dennis Duncan, Michael M. Bronstein, Pierre Vandergheynst, Adam Gosztolai
[ABSTRACT]
Deep generative models often face a fundamental tradeoff: high sample quality
can come at the cost of memorisation, where the model reproduces training data
rather than generalising across the underlying data geometry. We introduce
Carr'e du champ flow matching (CDC-FM), a generalisation of flow matching
(FM), that improves the quality-generalisation tradeoff by regularising the
probability path with a geometry-aware noise. Our method replaces the
homogeneous, isotropic noise in FM with a spatially varying, anisotropic
Gaussian noise whose covariance captures the local geometry of the latent data
manifold. We prove that this geometric noise can be optimally estimated from
the data and is scalable to large data. Further, we provide an extensive
experimental evaluation on diverse datasets (synthetic manifolds, point clouds,
single-cell genomics, animal motion capture, and images) as well as various
neural network architectures (MLPs, CNNs, and transformers). We demonstrate
that CDC-FM consistently offers a better quality-generalisation tradeoff. We
observe significant improvements over standard FM in data-scarce regimes and in
highly non-uniformly sampled datasets, which are often encountered in AI for
science applications. Our work provides a mathematical framework for studying
the interplay between data geometry, generalisation and memorisation in
generative models, as well as a robust and scalable algorithm that can be
readily integrated into existing flow matching pipelines.
[LINK]
http://arxiv.org/abs/2510.05930v1
[DATE]
2025-10-07 21:41:33+08:00
[CATEGORIES]
cs.LG
FedFlex: Federated Learning for Diverse Netflix Recommendations
[AUTHORS]
Sven Lankester, Gustavo de Carvalho Bertoli, Matias Vizcaino, Emmanuelle Beauxis Aussalet, Manel Slokom
[ABSTRACT]
The drive for personalization in recommender systems creates a tension
between user privacy and the risk of “filter bubbles”. Although federated
learning offers a promising paradigm for privacy-preserving recommendations,
its impact on diversity remains unclear. We introduce FedFlex, a two-stage
framework that combines local, on-device fine-tuning of matrix factorization
models (SVD and BPR) with a lightweight Maximal Marginal Relevance (MMR)
re-ranking step to promote diversity. We conducted the first live user study of
a federated recommender, collecting behavioral data and feedback during a
two-week online deployment. Our results show that FedFlex successfully engages
users, with BPR outperforming SVD in click-through rate. Re-ranking with MMR
consistently improved ranking quality (nDCG) across both models, with
statistically significant gains, particularly for BPR. Diversity effects
varied: MMR increased coverage for both models and improved intra-list
diversity for BPR, but slightly reduced it for SVD, suggesting different
interactions between personalization and diversification across models. Our
exit questionnaire responses indicated that most users expressed no clear
preference between re-ranked and unprocessed lists, implying that increased
diversity did not substantially reduce user satisfaction.
[LINK]
http://arxiv.org/abs/2507.21115v2
[DATE]
2025-10-07 21:39:23+08:00
[CATEGORIES]
cs.LG
An Attention-Augmented VAE-BiLSTM Framework for Anomaly Detection in 12-Lead ECG Signals
[AUTHORS]
Marc Garreta Basora, Mehmet Oguz Mulayim
[ABSTRACT]
Anomaly detection in 12-lead electrocardiograms (ECGs) is critical for
identifying deviations associated with cardiovascular disease. This work
presents a comparative analysis of three autoencoder-based architectures:
convolutional autoencoder (CAE), variational autoencoder with bidirectional
long short-term memory (VAE-BiLSTM), and VAE-BiLSTM with multi-head attention
(VAE-BiLSTM-MHA), for unsupervised anomaly detection in ECGs. To the best of
our knowledge, this study reports the first application of a VAE-BiLSTM-MHA
architecture to ECG anomaly detection. All models are trained on normal ECG
samples to reconstruct non-anomalous cardiac morphology and detect deviations
indicative of disease. Using a unified preprocessing and evaluation pipeline on
the public China Physiological Signal Challenge (CPSC) dataset, the
attention-augmented VAE achieves the best performance, with an AUPRC of 0.81
and a recall of 0.85 on the held-out test set, outperforming the other
architectures. To support clinical triage, this model is further integrated
into an interactive dashboard that visualizes anomaly localization. In
addition, a performance comparison with baseline models from the literature is
provided.
[COMMENTS]
14 pages, 11 figures
[LINK]
http://arxiv.org/abs/2510.05919v1
[DATE]
2025-10-07 21:30:02+08:00
[CATEGORIES]
cs.LG
Neon: Negative Extrapolation From Self-Training Improves Image Generation
[AUTHORS]
Sina Alemohammad, Zhangyang Wang, Richard G. Baraniuk
[ABSTRACT]
Scaling generative AI models is bottlenecked by the scarcity of high-quality
training data. The ease of synthesizing from a generative model suggests using
(unverified) synthetic data to augment a limited corpus of real data for the
purpose of fine-tuning in the hope of improving performance. Unfortunately,
however, the resulting positive feedback loop leads to model autophagy disorder
(MAD, aka model collapse) that results in a rapid degradation in sample quality
and/or diversity. In this paper, we introduce Neon (for Negative Extrapolation
frOm self-traiNing), a new learning method that turns the degradation from
self-training into a powerful signal for self-improvement. Given a base model,
Neon first fine-tunes it on its own self-synthesized data but then,
counterintuitively, reverses its gradient updates to extrapolate away from the
degraded weights. We prove that Neon works because typical inference samplers
that favor high-probability regions create a predictable anti-alignment between
the synthetic and real data population gradients, which negative extrapolation
corrects to better align the model with the true data distribution. Neon is
remarkably easy to implement via a simple post-hoc merge that requires no new
real data, works effectively with as few as 1k synthetic samples, and typically
uses less than 1% additional training compute. We demonstrate Neon’s
universality across a range of architectures (diffusion, flow matching,
autoregressive, and inductive moment matching models) and datasets (ImageNet,
CIFAR-10, and FFHQ). In particular, on ImageNet 256x256, Neon elevates the
xAR-L model to a new state-of-the-art FID of 1.02 with only 0.36% additional
training compute. Code is available at https://github.com/VITA-Group/Neon
[LINK]
http://arxiv.org/abs/2510.03597v2
[DATE]
2025-10-07 21:29:15+08:00
[CATEGORIES]
cs.LG
A weakly-supervised deep learning model for fast localisation and delineation of the skeleton, internal organs, and spinal canal on Whole-Body Diffusion-Weighted MRI (WB-DWI)
[AUTHORS]
A. Candito, A. Dragan, R. Holbrey, A. Ribeiro, R. Donners, C. Messiou, N. Tunariu, D. -M. Koh, M. D. Blackledge
[ABSTRACT]
Background: Apparent Diffusion Coefficient (ADC) values and Total Diffusion
Volume (TDV) from Whole-body diffusion-weighted MRI (WB-DWI) are recognized
cancer imaging biomarkers. However, manual disease delineation for ADC and TDV
measurements is unfeasible in clinical practice, demanding automation. As a
first step, we propose an algorithm to generate fast and reproducible
probability maps of the skeleton, adjacent internal organs (liver, spleen,
urinary bladder, and kidneys), and spinal canal. Methods: We developed an
automated deep-learning pipeline based on a 3D patch-based Residual U-Net
architecture that localises and delineates these anatomical structures on
WB-DWI. The algorithm was trained using “soft labels” (non-binary
segmentations) derived from a computationally intensive atlas-based approach.
For training and validation, we employed a multi-centre WB-DWI dataset
comprising 532 scans from patients with Advanced Prostate Cancer (APC) or
Multiple Myeloma (MM), with testing on 45 patients. Results: Our
weakly-supervised deep learning model achieved an average dice score of 0.67
for whole skeletal delineation, 0.76 when excluding ribcage, 0.83 for internal
organs, and 0.86 for spinal canal, with average surface distances below 3mm.
Relative median ADC differences between automated and manual full-body
delineations were below 10%. The model was 12x faster than the atlas-based
registration algorithm (25 sec vs. 5 min). Two experienced radiologists rated
the model’s outputs as either “good” or “excellent” on test scans, with
inter-reader agreement from fair to substantial (Gwet’s AC1 = 0.27-0.72).
Conclusion: The model offers fast, reproducible probability maps for localising
and delineating body regions on WB-DWI, potentially enabling non-invasive
imaging biomarker quantification to support disease staging and treatment
response assessment.
[LINK]
http://arxiv.org/abs/2503.20722v2
[DATE]
2025-10-07 21:28:38+08:00
[CATEGORIES]
cs.LG
MetaLLMix : An XAI Aided LLM-Meta-learning Based Approach for Hyper-parameters Optimization
[AUTHORS]
Mohamed Bal-Ghaoui, Mohammed Tiouti
[ABSTRACT]
Effective model and hyperparameter selection remains a major challenge in
deep learning, often requiring extensive expertise and computation. While
AutoML and large language models (LLMs) promise automation, current LLM-based
approaches rely on trial and error and expensive APIs, which provide limited
interpretability and generalizability. We propose MetaLLMiX, a zero-shot
hyperparameter optimization framework combining meta-learning, explainable AI,
and efficient LLM reasoning. By leveraging historical experiment outcomes with
SHAP explanations, MetaLLMiX recommends optimal hyperparameters and pretrained
models without additional trials. We further employ an LLM-as-judge evaluation
to control output format, accuracy, and completeness. Experiments on eight
medical imaging datasets using nine open-source lightweight LLMs show that
MetaLLMiX achieves competitive or superior performance to traditional HPO
methods while drastically reducing computational cost. Our local deployment
outperforms prior API-based approaches, achieving optimal results on 5 of 8
tasks, response time reductions of 99.6-99.9%, and the fastest training times
on 6 datasets (2.4-15.7x faster), maintaining accuracy within 1-5% of
best-performing baselines.
[LINK]
http://arxiv.org/abs/2509.09387v3
[DATE]
2025-10-07 21:08:21+08:00
[CATEGORIES]
cs.LG
How Foundational are Foundation Models for Time Series Forecasting?
[AUTHORS]
Nouha Karaouli, Denis Coquenet, Elisa Fromont, Martial Mermillod, Marina Reyboz
[COMMENTS]
Typo rectified in this v3 version. Accepted at NeurIPS 2025 Workshop
on Recent Advances in Time Series Foundation Models (BERT2S)
[LINK]
http://arxiv.org/abs/2510.00742v3
[DATE]
2025-10-07 21:03:30+08:00
[CATEGORIES]
cs.LG
Understanding Catastrophic Interference: On the Identifibility of Latent Representations
[AUTHORS]
Yuke Li, Yujia Zheng, Tianyi Xiong, Zhenyi Wang, Heng Huang
[ABSTRACT]
Catastrophic interference, also known as catastrophic forgetting, is a
fundamental challenge in machine learning, where a trained learning model
progressively loses performance on previously learned tasks when adapting to
new ones. In this paper, we aim to better understand and model the catastrophic
interference problem from a latent representation learning point of view, and
propose a novel theoretical framework that formulates catastrophic interference
as an identification problem. Our analysis demonstrates that the forgetting
phenomenon can be quantified by the distance between partial-task aware (PTA)
and all-task aware (ATA) setups. Building upon recent advances in
identifiability theory, we prove that this distance can be minimized through
identification of shared latent variables between these setups. When learning,
we propose our method \ourmeos with two-stage training strategy: First, we
employ maximum likelihood estimation to learn the latent representations from
both PTA and ATA configurations. Subsequently, we optimize the KL divergence to
identify and learn the shared latent variables. Through theoretical guarantee
and empirical validations, we establish that identifying and learning these
shared representations can effectively mitigate catastrophic interference in
machine learning systems. Our approach provides both theoretical guarantees and
practical performance improvements across both synthetic and benchmark
datasets.
[LINK]
http://arxiv.org/abs/2509.23027v3
[DATE]
2025-10-07 20:58:54+08:00
[CATEGORIES]
cs.LG
Segment-Factorized Full-Song Generation on Symbolic Piano Music
[AUTHORS]
Ping-Yi Chen, Chih-Pin Tan, Yi-Hsuan Yang
[ABSTRACT]
We propose the Segmented Full-Song Model (SFS) for symbolic full-song
generation. The model accepts a user-provided song structure and an optional
short seed segment that anchors the main idea around which the song is
developed. By factorizing a song into segments and generating each one through
selective attention to related segments, the model achieves higher quality and
efficiency compared to prior work. To demonstrate its suitability for human-AI
interaction, we further wrap SFS into a web application that enables users to
iteratively co-create music on a piano roll with customizable structures and
flexible ordering.
[COMMENTS]
Accepted to the 39th Conference on Neural Information Processing
Systems (NeurIPS 2025) Workshop: AI for Music
[LINK]
http://arxiv.org/abs/2510.05881v1
[DATE]
2025-10-07 20:54:44+08:00
[CATEGORIES]
cs.LG
MaNGO - Adaptable Graph Network Simulators via Meta-Learning
[AUTHORS]
Philipp Dahlinger, Tai Hoang, Denis Blessing, Niklas Freymuth, Gerhard Neumann
[ABSTRACT]
Accurately simulating physics is crucial across scientific domains, with
applications spanning from robotics to materials science. While traditional
mesh-based simulations are precise, they are often computationally expensive
and require knowledge of physical parameters, such as material properties. In
contrast, data-driven approaches like Graph Network Simulators (GNSs) offer
faster inference but suffer from two key limitations: Firstly, they must be
retrained from scratch for even minor variations in physical parameters, and
secondly they require labor-intensive data collection for each new parameter
setting. This is inefficient, as simulations with varying parameters often
share a common underlying latent structure. In this work, we address these
challenges by learning this shared structure through meta-learning, enabling
fast adaptation to new physical parameters without retraining. To this end, we
propose a novel architecture that generates a latent representation by encoding
graph trajectories using conditional neural processes (CNPs). To mitigate error
accumulation over time, we combine CNPs with a novel neural operator
architecture. We validate our approach, Meta Neural Graph Operator (MaNGO), on
several dynamics prediction tasks with varying material properties,
demonstrating superior performance over existing GNS methods. Notably, MaNGO
achieves accuracy on unseen material properties close to that of an oracle
model.
[COMMENTS]
19 pages including appendix. NeurIPS 2025 (preprint version)
[LINK]
http://arxiv.org/abs/2510.05874v1
[DATE]
2025-10-07 20:44:24+08:00
[CATEGORIES]
cs.LG
Towards Label-Free Biological Reasoning Synthetic Dataset Creation via Uncertainty Filtering
[AUTHORS]
Josefa Lia Stoisser, Lawrence Phillips, Aditya Misra, Tom A. Lamb, Philip Torr, Marc Boubnovski Martell, Julien Fauqueur, Kaspar Märtens
[ABSTRACT]
Synthetic chain-of-thought (CoT) traces are widely used to train large
reasoning models (LRMs), improving generalization by providing step-level
supervision. Yet most approaches require ground-truth labels to seed or filter
these traces - an expensive bottleneck in domains like biology where wet-lab
data are scarce. We propose a label-free alternative: uncertainty-based
filtering, which uses a model’s own confidence - quantified through established
uncertainty metrics like self-consistency and predictive perplexity - as a
substitute for external labels. We sample multiple reasoning traces and retain
only low-uncertainty subsets. Applied to biological perturbation prediction, a
domain where wet-lab labels are especially costly, we show that the filtered
subset has higher accuracy, and that supervised fine-tuning (SFT) on
uncertainty-filtered data outperforms unfiltered synthetic data, narrows the
gap to ground-truth training, and surpasses strong LRM baselines. Ablations
show that per-class filtering corrects for class-specific uncertainty scales
and that hybrid uncertainty metrics yield higher-quality datasets. Our results
suggest that model-internal confidence is a powerful signal for efficient
reasoning dataset creation, enabling LRMs in domains where supervision is
expensive.
[LINK]
http://arxiv.org/abs/2510.05871v1
[DATE]
2025-10-07 20:40:37+08:00
[CATEGORIES]
cs.LG
Mitigating Exponential Mixed Frequency Growth through Frequency Selection
[AUTHORS]
Michael Poppel, David Bucher, Maximilian Zorn, Nico Kraus, Philipp Altmann, Jonas Stein, Claudia Linnhoff-Popien
[ABSTRACT]
Quantum machine learning research has expanded rapidly due to potential
computational advantages over classical methods. Angle encoding has emerged as
a popular choice as feature map (FM) for embedding classical data into quantum
models due to its simplicity and natural generation of truncated Fourier
series, providing universal function approximation capabilities. Efficient FMs
within quantum circuits can exploit exponential scaling of Fourier frequencies,
with multi-dimensional inputs introducing additional exponential growth through
mixed-frequency terms. Despite this promising expressive capability, practical
implementation faces significant challenges. Through controlled experiments
with white-box target functions, we demonstrate that training failures can
occur even when all relevant frequencies are theoretically accessible. We
illustrate how two primary known causes lead to unsuccessful optimization:
insufficient trainable parameters relative to the model’s frequency content,
and limitations imposed by the ansatz’s dynamic lie algebra dimension, but also
uncover an additional parameter burden: the necessity of controlling non-unique
frequencies within the model. To address this, we propose near-zero weight
initialization to suppress unnecessary duplicate frequencies. For target
functions with a priori frequency knowledge, we introduce frequency selection
as a practical solution that reduces parameter requirements and mitigates the
exponential growth that would otherwise render problems intractable due to
parameter insufficiency. Our frequency selection approach achieved near-optimal
performance (median $R^2 \approx 0.95$) with 78\% of the parameters needed by
the best standard approach in 10 randomly chosen target functions.
[COMMENTS]
10 pages, 3 figures
[LINK]
http://arxiv.org/abs/2508.10533v3
[DATE]
2025-10-07 20:31:53+08:00
[CATEGORIES]
cs.LG
How to model Human Actions distribution with Event Sequence Data
[AUTHORS]
Egor Surkov, Dmitry Osin, Evgeny Burnaev, Egor Shvetsov
[ABSTRACT]
This paper studies forecasting of the future distribution of events in human
action sequences, a task essential in domains like retail, finance, healthcare,
and recommendation systems where the precise temporal order is often less
critical than the set of outcomes. We challenge the dominant autoregressive
paradigm and investigate whether explicitly modeling the future distribution or
order-invariant multi-token approaches outperform order-preserving methods. We
analyze local order invariance and introduce a KL-based metric to quantify
temporal drift. We find that a simple explicit distribution forecasting
objective consistently surpasses complex implicit baselines. We further
demonstrate that mode collapse of predicted categories is primarily driven by
distributional imbalance. This work provides a principled framework for
selecting modeling strategies and offers practical guidance for building more
accurate and robust forecasting systems.
[COMMENTS]
9 pages main text + 2 pages references + 6 pages appendix, 10
figures, 3 tables. Preprint version
[LINK]
http://arxiv.org/abs/2510.05856v1
[DATE]
2025-10-07 20:24:54+08:00
[CATEGORIES]
cs.LG
Sparse Representations Improve Adversarial Robustness of Neural Network Classifiers
[AUTHORS]
Killian Steunou, Théo Druilhe, Sigurd Saue
[ABSTRACT]
Deep neural networks perform remarkably well on image classification tasks
but remain vulnerable to carefully crafted adversarial perturbations. This work
revisits linear dimensionality reduction as a simple, data-adapted defense. We
empirically compare standard Principal Component Analysis (PCA) with its sparse
variant (SPCA) as front-end feature extractors for downstream classifiers, and
we complement these experiments with a theoretical analysis. On the theory
side, we derive exact robustness certificates for linear heads applied to SPCA
features: for both $\ell_\infty$ and $\ell_2$ threat models (binary and
multiclass), the certified radius grows as the dual norms of $W^\top u$ shrink,
where $W$ is the projection and $u$ the head weights. We further show that for
general (non-linear) heads, sparsity reduces operator-norm bounds through a
Lipschitz composition argument, predicting lower input sensitivity.
Empirically, with a small non-linear network after the projection, SPCA
consistently degrades more gracefully than PCA under strong white-box and
black-box attacks while maintaining competitive clean accuracy. Taken together,
the theory identifies the mechanism (sparser projections reduce adversarial
leverage) and the experiments verify that this benefit persists beyond the
linear setting. Our code is available at
https://github.com/killian31/SPCARobustness.
[COMMENTS]
Killian Steunou is the main contributor and corresponding author of
this work
[LINK]
http://arxiv.org/abs/2509.21130v2
[DATE]
2025-10-07 20:21:15+08:00
[CATEGORIES]
cs.LG
ESS-Flow: Training-free guidance of flow-based models as inference in source space
[AUTHORS]
Adhithyan Kalaivanan, Zheng Zhao, Jens Sjölund, Fredrik Lindsten
[ABSTRACT]
Guiding pretrained flow-based generative models for conditional generation or
to produce samples with desired target properties enables solving diverse tasks
without retraining on paired data. We present ESS-Flow, a gradient-free method
that leverages the typically Gaussian prior of the source distribution in
flow-based models to perform Bayesian inference directly in the source space
using Elliptical Slice Sampling. ESS-Flow only requires forward passes through
the generative model and observation process, no gradient or Jacobian
computations, and is applicable even when gradients are unreliable or
unavailable, such as with simulation-based observations or quantization in the
generation or observation process. We demonstrate its effectiveness on
designing materials with desired target properties and predicting protein
structures from sparse inter-residue distance measurements.
[COMMENTS]
14 pages, 12 figures. Code will be made available after publication
[LINK]
http://arxiv.org/abs/2510.05849v1
[DATE]
2025-10-07 20:11:58+08:00
[CATEGORIES]
cs.LG
Multimodal Trajectory Representation Learning for Travel Time Estimation
[AUTHORS]
Zhi Liu, Xuyuan Hu, Xiao Han, Zhehao Dai, Zhaolin Deng, Guojiang Shen, Xiangjie Kong
[ABSTRACT]
Accurate travel time estimation (TTE) plays a crucial role in intelligent
transportation systems. However, it remains challenging due to heterogeneous
data sources and complex traffic dynamics. Moreover, conventional approaches
typically convert trajectories into fixed-length representations, neglecting
the inherent variability of real-world trajectories, which often leads to
information loss or feature redundancy. To address these challenges, this paper
introduces the Multimodal Dynamic Trajectory Integration (MDTI) framework–a
novel multimodal trajectory representation learning approach that integrates
GPS sequences, grid trajectories, and road network constraints to enhance TTE
accuracy. MDTI employs modality-specific encoders and a cross-modal interaction
module to capture complementary spatial, temporal, and topological semantics,
while a dynamic trajectory modeling mechanism adaptively regulates information
density for trajectories of varying lengths. Two self-supervised pretraining
objectives, named contrastive alignment and masked language modeling, further
strengthen multimodal consistency and contextual understanding. Extensive
experiments on three real-world datasets demonstrate that MDTI consistently
outperforms state-of-the-art baselines, confirming its robustness and strong
generalization abilities. The code is publicly available at:
https://github.com/freshhxy/MDTI/
[LINK]
http://arxiv.org/abs/2510.05840v1
[DATE]
2025-10-07 20:04:16+08:00
[CATEGORIES]
cs.LG
Back to Square Roots: An Optimal Bound on the Matrix Factorization Error for Multi-Epoch Differentially Private SGD
[AUTHORS]
Nikita P. Kalinin, Ryan McKenna, Jalaj Upadhyay, Christoph H. Lampert
[ABSTRACT]
Matrix factorization mechanisms for differentially private training have
emerged as a promising approach to improve model utility under privacy
constraints. In practical settings, models are typically trained over multiple
epochs, requiring matrix factorizations that account for repeated
participation. Existing theoretical upper and lower bounds on multi-epoch
factorization error leave a significant gap. In this work, we introduce a new
explicit factorization method, Banded Inverse Square Root (BISR), which imposes
a banded structure on the inverse correlation matrix. This factorization
enables us to derive an explicit and tight characterization of the multi-epoch
error. We further prove that BISR achieves asymptotically optimal error by
matching the upper and lower bounds. Empirically, BISR performs on par with
state-of-the-art factorization methods, while being simpler to implement,
computationally efficient, and easier to analyze.
[LINK]
http://arxiv.org/abs/2505.12128v2
[DATE]
2025-10-07 19:58:19+08:00
[CATEGORIES]
cs.LG
Fast Policy Learning for Linear Quadratic Control with Entropy Regularization
[AUTHORS]
Xin Guo, Xinyu Li, Renyuan Xu
[ABSTRACT]
This paper proposes and analyzes two new policy learning methods: regularized
policy gradient (RPG) and iterative policy optimization (IPO), for a class of
discounted linear-quadratic control (LQC) problems over an infinite time
horizon with entropy regularization. Assuming access to the exact policy
evaluation, both proposed approaches are proven to converge linearly in finding
optimal policies of the regularized LQC. Moreover, the IPO method can achieve a
super-linear convergence rate once it enters a local region around the optimal
policy. Finally, when the optimal policy for an RL problem with a known
environment is appropriately transferred as the initial policy to an RL problem
with an unknown environment, the IPO method is shown to enable a super-linear
convergence rate if the two environments are sufficiently close. Performances
of these proposed algorithms are supported by numerical examples.
[COMMENTS]
31 pages, 3 figures
[LINK]
http://arxiv.org/abs/2311.14168v4
[DATE]
2025-10-07 19:57:24+08:00
[CATEGORIES]
cs.LG
The Use of Binary Choice Forests to Model and Estimate Discrete Choices
[AUTHORS]
Ningyuan Chen, Guillermo Gallego, Zhuodong Tang
[ABSTRACT]
Problem definition. In retailing, discrete choice models (DCMs) are commonly
used to capture the choice behavior of customers when offered an assortment of
products. When estimating DCMs using transaction data, flexible models (such as
machine learning models or nonparametric models) are typically not
interpretable and hard to estimate, while tractable models (such as the
multinomial logit model) tend to misspecify the complex behavior represeted in
the data. Methodology/results. In this study, we use a forest of binary
decision trees to represent DCMs. This approach is based on random forests, a
popular machine learning algorithm. The resulting model is interpretable: the
decision trees can explain the decision-making process of customers during the
purchase. We show that our approach can predict the choice probability of any
DCM consistently and thus never suffers from misspecification. Moreover, our
algorithm predicts assortments unseen in the training data. The mechanism and
errors can be theoretically analyzed. We also prove that the random forest can
recover preference rankings of customers thanks to the splitting criterion such
as the Gini index and information gain ratio. Managerial implications. The
framework has unique practical advantages. It can capture customers’ behavioral
patterns such as irrationality or sequential searches when purchasing a
product. It handles nonstandard formats of training data that result from
aggregation. It can measure product importance based on how frequently a random
customer would make decisions depending on the presence of the product. It can
also incorporate price information and customer features. Our numerical
experiments using synthetic and real data show that using random forests to
estimate customer choices can outperform existing methods.
[COMMENTS]
63 pages, 10 figures, 30 tables
[LINK]
http://arxiv.org/abs/1908.01109v6
[DATE]
2025-10-07 19:57:17+08:00
[CATEGORIES]
cs.LG
FoleyGRAM: Video-to-Audio Generation with GRAM-Aligned Multimodal Encoders
[AUTHORS]
Riccardo Fosco Gramaccioni, Christian Marinoni, Eleonora Grassucci, Giordano Cicchetti, Aurelio Uncini, Danilo Comminiello
[ABSTRACT]
In this work, we present FoleyGRAM, a novel approach to video-to-audio
generation that emphasizes semantic conditioning through the use of aligned
multimodal encoders. Building on prior advancements in video-to-audio
generation, FoleyGRAM leverages the Gramian Representation Alignment Measure
(GRAM) to align embeddings across video, text, and audio modalities, enabling
precise semantic control over the audio generation process. The core of
FoleyGRAM is a diffusion-based audio synthesis model conditioned on
GRAM-aligned embeddings and waveform envelopes, ensuring both semantic richness
and temporal alignment with the corresponding input video. We evaluate
FoleyGRAM on the Greatest Hits dataset, a standard benchmark for video-to-audio
models. Our experiments demonstrate that aligning multimodal encoders using
GRAM enhances the system’s ability to semantically align generated audio with
video content, advancing the state of the art in video-to-audio synthesis.
[COMMENTS]
Acepted at IJCNN 2025
[LINK]
http://arxiv.org/abs/2510.05829v1
[DATE]
2025-10-07 19:52:00+08:00
[CATEGORIES]
cs.LG
StereoSync: Spatially-Aware Stereo Audio Generation from Video
[AUTHORS]
Christian Marinoni, Riccardo Fosco Gramaccioni, Kazuki Shimada, Takashi Shibuya, Yuki Mitsufuji, Danilo Comminiello
[ABSTRACT]
Although audio generation has been widely studied over recent years,
video-aligned audio generation still remains a relatively unexplored frontier.
To address this gap, we introduce StereoSync, a novel and efficient model
designed to generate audio that is both temporally synchronized with a
reference video and spatially aligned with its visual context. Moreover,
StereoSync also achieves efficiency by leveraging pretrained foundation models,
reducing the need for extensive training while maintaining high-quality
synthesis. Unlike existing methods that primarily focus on temporal
synchronization, StereoSync introduces a significant advancement by
incorporating spatial awareness into video-aligned audio generation. Indeed,
given an input video, our approach extracts spatial cues from depth maps and
bounding boxes, using them as cross-attention conditioning in a diffusion-based
audio generation model. Such an approach allows StereoSync to go beyond simple
synchronization, producing stereo audio that dynamically adapts to the spatial
structure and movement of a video scene. We evaluate StereoSync on Walking The
Maps, a curated dataset comprising videos from video games that feature
animated characters walking through diverse environments. Experimental results
demonstrate the ability of StereoSync to achieve both temporal and spatial
alignment, advancing the state of the art in video-to-audio generation and
resulting in a significantly more immersive and realistic audio experience.
[COMMENTS]
Accepted at IJCNN 2025
[LINK]
http://arxiv.org/abs/2510.05828v1
[DATE]
2025-10-07 19:51:58+08:00
[CATEGORIES]
cs.LG
Minimizing the Weighted Number of Tardy Jobs: Data-Driven Heuristic for Single-Machine Scheduling
[AUTHORS]
Nikolai Antonov, Prěmysl Šůcha, Mikoláš Janota, Jan Hůla
[ABSTRACT]
Existing research on single-machine scheduling is largely focused on exact
algorithms, which perform well on typical instances but can significantly
deteriorate on certain regions of the problem space. In contrast, data-driven
approaches provide strong and scalable performance when tailored to the
structure of specific datasets. Leveraging this idea, we focus on a
single-machine scheduling problem where each job is defined by its weight,
duration, due date, and deadline, aiming to minimize the total weight of tardy
jobs. We introduce a novel data-driven scheduling heuristic that combines
machine learning with problem-specific characteristics, ensuring feasible
solutions, which is a common challenge for ML-based algorithms. Experimental
results demonstrate that our approach significantly outperforms the
state-of-the-art in terms of optimality gap, number of optimal solutions, and
adaptability across varied data scenarios, highlighting its flexibility for
practical applications. In addition, we conduct a systematic exploration of ML
models, addressing a common gap in similar studies by offering a detailed model
selection process and providing insights into why the chosen model is the best
fit.
[COMMENTS]
Published version: Computers & Operations Research,
https://doi.org/10.1016/j.cor.2025.107281. Data are publicly available at
https://doi.org/10.5281/zenodo.17233362
[LINK]
http://arxiv.org/abs/2508.13703v2
[DATE]
2025-10-07 19:41:19+08:00
[CATEGORIES]
cs.LG
Can foundation models actively gather information in interactive environments to test hypotheses?
[AUTHORS]
Danny P. Sawyer, Nan Rosemary Ke, Hubert Soyer, Martin Engelcke, David P Reichert, Drew A. Hudson, John Reid, Alexander Lerchner, Danilo Jimenez Rezende, Timothy P Lillicrap, Michael Mozer, Jane X Wang
[ABSTRACT]
Foundation models excel at single-turn reasoning but struggle with multi-turn
exploration in dynamic environments, a requirement for many real-world
challenges. We evaluated these models on their ability to learn from
experience, adapt, and gather information. First, in “Feature World,” a simple
setting for testing information gathering, models performed near-optimally.
However, to test more complex, multi-trial learning, we implemented a
text-based version of the “Alchemy” environment, a benchmark for meta-learning.
Here, agents must deduce a latent causal structure by integrating information
across many trials. In this setting, recent foundation models initially failed
to improve their performance over time. Crucially, we found that prompting the
models to summarize their observations at regular intervals enabled an emergent
meta-learning process. This allowed them to improve across trials and even
adaptively re-learn when the environment’s rules changed unexpectedly. While
most models handled the simple task, Alchemy revealed stark differences in
robustness: Gemini 2.5 performed best, followed by Claude 3.7, while ChatGPT-4o
and o4-mini struggled. This underscores Alchemy’s value as a benchmark. Our
findings demonstrate that the biggest challenge for foundation models is not
selecting informative actions in the moment, but integrating knowledge through
adaptive strategies over time. Encouragingly, there appears to be no intrinsic
barrier to future models mastering these abilities.
[LINK]
http://arxiv.org/abs/2412.06438v2
[DATE]
2025-10-07 19:28:32+08:00
[CATEGORIES]
cs.LG
Improving Clinical Dataset Condensation with Mode Connectivity-based Trajectory Surrogates
[AUTHORS]
Pafue Christy Nganjimi, Andrew Soltan, Danielle Belgrave, Lei Clifton, David A. Clifton, Anshul Thakur
[ABSTRACT]
Dataset condensation (DC) enables the creation of compact, privacy-preserving
synthetic datasets that can match the utility of real patient records,
supporting democratised access to highly regulated clinical data for developing
downstream clinical models. State-of-the-art DC methods supervise synthetic
data by aligning the training dynamics of models trained on real and those
trained on synthetic data, typically using full stochastic gradient descent
(SGD) trajectories as alignment targets; however, these trajectories are often
noisy, high-curvature, and storage-intensive, leading to unstable gradients,
slow convergence, and substantial memory overhead. We address these limitations
by replacing full SGD trajectories with smooth, low-loss parametric surrogates,
specifically quadratic B'ezier curves that connect the initial and final model
states from real training trajectories. These mode-connected paths provide
noise-free, low-curvature supervision signals that stabilise gradients,
accelerate convergence, and eliminate the need for dense trajectory storage. We
theoretically justify B'ezier-mode connections as effective surrogates for SGD
paths and empirically show that the proposed method outperforms
state-of-the-art condensation approaches across five clinical datasets,
yielding condensed datasets that enable clinically effective model development.
[COMMENTS]
20 pages, 4 figures, Submitted to AISTATS 2026
[LINK]
http://arxiv.org/abs/2510.05805v1
[DATE]
2025-10-07 19:22:27+08:00
[CATEGORIES]
cs.LG
Mellum: Production-Grade in-IDE Contextual Code Completion with Multi-File Project Understanding
[AUTHORS]
Nikita Pavlichenko, Iurii Nazarov, Ivan Dolgov, Ekaterina Garanina, Dmitry Ustalov, Ivan Bondyrev, Kseniia Lysaniuk, Evgeniia Vu, Kirill Chekmenev, Joseph Shtok, Yaroslav Golubev, Anton Semenkin, Uladzislau Sazanovich
[ABSTRACT]
We present the Mellum models family, open-weight code completion models
designed for interactive use in JetBrains IDEs. Mellums have 4B parameters,
adopt a Llama-style architecture, and are pre-trained on ~4T tokens of
permissively licensed, multi-language code. Our studies show that (i) careful
data curation and staged training significantly improve the model’s quality,
(ii) editor-critical capabilities such as context packing are necessary for
high-quality suggestions, and (iii) a compact, task-focused model can meet the
cost and latency constraints of interactive completion.
In the paper, we describe an end-to-end industrial pipeline for producing
contextualized in-editor completion: disciplined data governance, multi-stage
training that includes fill-in-the-middle and project context via supervised
fine-tuning, and alignment via direct preference optimization using feedback
from real-world scenarios. Our quality evaluations include both large-scale
offline benchmarks and online telemetry from production deployments in
JetBrains IDEs. Mellums are released under the Apache-2.0 license on
HuggingFace, with a public model card providing a reproducible reference for
practitioners. Our experience offers a pragmatic blueprint for taking a
focused, open model from a research prototype to at scale production for
hundreds of thousands of users.
[COMMENTS]
11 pages, 4 figures, 3 tables
[LINK]
http://arxiv.org/abs/2510.05788v1
[DATE]
2025-10-07 19:09:11+08:00
[CATEGORIES]
cs.LG
Möbius transforms and Shapley values for vector-valued functions on weighted directed acyclic multigraphs
[AUTHORS]
Patrick Forré, Abel Jansma
[ABSTRACT]
We generalize the concept of M"obius inversion and Shapley values to
directed acyclic multigraphs and weighted versions thereof. We further allow
value functions (games) and thus their M"obius transforms (synergy function)
and Shapley values to have values in any abelian group that is a module over a
ring that contains the graph weights, e.g. vector-valued functions. To achieve
this and overcome the obstruction that the classical axioms (linearity,
efficiency, null player, symmetry) are not strong enough to uniquely determine
Shapley values in this more general setting, we analyze Shapley values from two
novel points of view: 1) We introduce projection operators that allow us to
interpret Shapley values as the recursive projection and re-attribution of
higher-order synergies to lower-order ones; 2) we propose a strengthening of
the null player axiom and a localized symmetry axiom, namely the weak elements
and flat hierarchy axioms. The former allows us to remove coalitions with
vanishing synergy while preserving the rest of the hierarchical structure. The
latter treats player-coalition bonds uniformly in the corner case of
hierarchically flat graphs. Together with linearity these axioms already imply
a unique explicit formula for the Shapley values, as well as classical
properties like efficiency, null player, symmetry, and novel ones like the
projection property. This whole framework then specializes to finite inclusion
algebras, lattices, partial orders and mereologies, and also recovers certain
previously known cases as corner cases, and presents others from a new
perspective. The admission of general weighted directed acyclic multigraph
structured hierarchies and vector-valued functions and Shapley values opens up
the possibility for new analytic tools and application areas, like machine
learning, language processing, explainable artificial intelligence, and many
more.
[COMMENTS]
43 pages, 2 figures
[LINK]
http://arxiv.org/abs/2510.05786v1
[DATE]
2025-10-07 19:05:25+08:00
[CATEGORIES]
cs.LG
Expected Free Energy-based Planning as Variational Inference
[AUTHORS]
Bert de Vries, Wouter Nuijten, Thijs van de Laar, Wouter Kouw, Sepideh Adamiat, Tim Nisslbeck, Mykola Lukashchuk, Hoang Minh Huu Nguyen, Marco Hidalgo Araya, Raphael Tresor, Thijs Jenneskens, Ivana Nikoloska, Raaja Ganapathy Subramanian, Bart van Erp, Dmitry Bagaev, Albert Podusenko
[ABSTRACT]
We address the problem of planning under uncertainty, where an agent must
choose actions that not only achieve desired outcomes but also reduce
uncertainty. Traditional methods often treat exploration and exploitation as
separate objectives, lacking a unified inferential foundation. Active
inference, grounded in the Free Energy Principle, provides such a foundation by
minimizing Expected Free Energy (EFE), a cost function that combines utility
with epistemic drives, such as ambiguity resolution and novelty seeking.
However, the computational burden of EFE minimization had remained a
significant obstacle to its scalability. In this paper, we show that EFE-based
planning arises naturally from minimizing a variational free energy functional
on a generative model augmented with preference and epistemic priors. This
result reinforces theoretical consistency with the Free Energy Principle by
casting planning under uncertainty itself as a form of variational inference.
Our formulation yields policies that jointly support goal achievement and
information gain, while incorporating a complexity term that accounts for
bounded computational resources. This unifying framework connects and extends
existing methods, enabling scalable, resource-aware implementations of active
inference agents.
[COMMENTS]
18 pages
[LINK]
http://arxiv.org/abs/2504.14898v4
[DATE]
2025-10-07 18:48:23+08:00
[CATEGORIES]
cs.LG
DP-SNP-TIHMM: Differentially Private, Time-Inhomogeneous Hidden Markov Models for Synthesizing Genome-Wide Association Datasets
[AUTHORS]
Shadi Rahimian, Mario Fritz
[ABSTRACT]
Single nucleotide polymorphism (SNP) datasets are fundamental to genetic
studies but pose significant privacy risks when shared. The correlation of SNPs
with each other makes strong adversarial attacks such as masked-value
reconstruction, kin, and membership inference attacks possible. Existing
privacy-preserving approaches either apply differential privacy to statistical
summaries of these datasets or offer complex methods that require
post-processing and the usage of a publicly available dataset to suppress or
selectively share SNPs.
In this study, we introduce an innovative framework for generating synthetic
SNP sequence datasets using samples derived from time-inhomogeneous hidden
Markov models (TIHMMs). To preserve the privacy of the training data, we ensure
that each SNP sequence contributes only a bounded influence during training,
enabling strong differential privacy guarantees. Crucially, by operating on
full SNP sequences and bounding their gradient contributions, our method
directly addresses the privacy risks introduced by their inherent correlations.
Through experiments conducted on the real-world 1000 Genomes dataset, we
demonstrate the efficacy of our method using privacy budgets of $\varepsilon
\in [1, 10]$ at $\delta=10^{-4}$. Notably, by allowing the transition models of
the HMM to be dependent on the location in the sequence, we significantly
enhance performance, enabling the synthetic datasets to closely replicate the
statistical properties of non-private datasets. This framework facilitates the
private sharing of genomic data while offering researchers exceptional
flexibility and utility.
[LINK]
http://arxiv.org/abs/2510.05777v1
[DATE]
2025-10-07 18:47:29+08:00
[CATEGORIES]
cs.LG
Constrained free energy minimization for the design of thermal states and stabilizer thermodynamic systems
[AUTHORS]
Michele Minervini, Madison Chin, Jacob Kupperman, Nana Liu, Ivy Luo, Meghan Ly, Soorya Rethinasamy, Kathie Wang, Mark M. Wilde
[ABSTRACT]
A quantum thermodynamic system is described by a Hamiltonian and a list of
conserved, non-commuting charges, and a fundamental goal is to determine the
minimum energy of the system subject to constraints on the charges. Recently,
[Liu et al., arXiv:2505.04514] proposed first- and second-order classical and
hybrid quantum-classical algorithms for solving a dual chemical potential
maximization problem, and they proved that these algorithms converge to global
optima by means of gradient-ascent approaches. In this paper, we benchmark
these algorithms on several problems of interest in thermodynamics, including
one- and two-dimensional quantum Heisenberg models with nearest and
next-to-nearest neighbor interactions and with the charges set to the total x,
y, and z magnetizations. We also offer an alternative compelling interpretation
of these algorithms as methods for designing ground and thermal states of
controllable Hamiltonians, with potential applications in molecular and
material design. Furthermore, we introduce stabilizer thermodynamic systems as
thermodynamic systems based on stabilizer codes, with the Hamiltonian
constructed from a given code’s stabilizer operators and the charges
constructed from the code’s logical operators. We benchmark the aforementioned
algorithms on several examples of stabilizer thermodynamic systems, including
those constructed from the one-to-three-qubit repetition code, the perfect
one-to-five-qubit code, and the two-to-four-qubit error-detecting code.
Finally, we observe that the aforementioned hybrid quantum-classical
algorithms, when applied to stabilizer thermodynamic systems, can serve as
alternative methods for encoding qubits into stabilizer codes at a fixed
temperature, and we provide an effective method for warm-starting these
encoding algorithms whenever a single qubit is encoded into multiple physical
qubits.
[COMMENTS]
v2: 35 pages, 12 figures, updated simulations
[LINK]
http://arxiv.org/abs/2508.09103v2
[DATE]
2025-10-07 18:36:49+08:00
[CATEGORIES]
cs.LG
Interpretable Clustering: A Survey
[AUTHORS]
Lianyu Hu, Mudi Jiang, Junjie Dong, Xinying Liu, Zengyou He
[ABSTRACT]
In recent years, much of the research on clustering algorithms has primarily
focused on enhancing their accuracy and efficiency, frequently at the expense
of interpretability. However, as these methods are increasingly being applied
in high-stakes domains such as healthcare, finance, and autonomous systems, the
need for transparent and interpretable clustering outcomes has become a
critical concern. This is not only necessary for gaining user trust but also
for satisfying the growing ethical and regulatory demands in these fields.
Ensuring that decisions derived from clustering algorithms can be clearly
understood and justified is now a fundamental requirement. To address this
need, this paper provides a comprehensive and structured review of the current
state of explainable clustering algorithms, identifying key criteria to
distinguish between various methods. These insights can effectively assist
researchers in making informed decisions about the most suitable explainable
clustering methods for specific application contexts, while also promoting the
development and adoption of clustering algorithms that are both efficient and
transparent. For convenient access and reference, an open repository organizes
representative and emerging interpretable clustering methods under the taxonomy
proposed in this survey, available at
https://github.com/hulianyu/Awesome-Interpretable-Clustering
[COMMENTS]
14 pages, 2 figures, 3 tables
[LINK]
http://arxiv.org/abs/2409.00743v2
[DATE]
2025-10-07 18:32:42+08:00
[CATEGORIES]
cs.LG
Transcribing Rhythmic Patterns of the Guitar Track in Polyphonic Music
[AUTHORS]
Aleksandr Lukoianov, Anssi Klapuri
[ABSTRACT]
Whereas chord transcription has received considerable attention during the
past couple of decades, far less work has been devoted to transcribing and
encoding the rhythmic patterns that occur in a song. The topic is especially
relevant for instruments such as the rhythm guitar, which is typically played
by strumming rhythmic patterns that repeat and vary over time. However, in many
cases one cannot objectively define a single “right” rhythmic pattern for a
given song section. To create a dataset with well-defined ground-truth labels,
we asked expert musicians to transcribe the rhythmic patterns in 410 popular
songs and record cover versions where the guitar tracks followed those
transcriptions. To transcribe the strums and their corresponding rhythmic
patterns, we propose a three-step framework. Firstly, we perform approximate
stem separation to extract the guitar part from the polyphonic mixture.
Secondly, we detect individual strums within the separated guitar audio, using
a pre-trained foundation model (MERT) as a backbone. Finally, we carry out a
pattern-decoding process in which the transcribed sequence of guitar strums is
represented by patterns drawn from an expert-curated vocabulary. We show that
it is possible to transcribe the rhythmic patterns of the guitar track in
polyphonic music with quite high accuracy, producing a representation that is
human-readable and includes automatically detected bar lines and time signature
markers. We perform ablation studies and error analysis and propose a set of
evaluation metrics to assess the accuracy and readability of the predicted
rhythmic pattern sequence.
[COMMENTS]
Accepted to WASPAA 2025
[LINK]
http://arxiv.org/abs/2510.05756v1
[DATE]
2025-10-07 18:22:31+08:00
[CATEGORIES]
cs.LG
Uncertainty assessment in satellite-based greenhouse gas emissions estimates using emulated atmospheric transport
[AUTHORS]
Jeffrey N. Clark, Elena Fillola, Nawid Keshtmand, Raul Santos-Rodriguez, Matthew Rigby
[ABSTRACT]
Monitoring greenhouse gas emissions and evaluating national inventories
require efficient, scalable, and reliable inference methods. Top-down
approaches, combined with recent advances in satellite observations, provide
new opportunities to evaluate emissions at continental and global scales.
However, transport models used in these methods remain a key source of
uncertainty: they are computationally expensive to run at scale, and their
uncertainty is difficult to characterise. Artificial intelligence offers a dual
opportunity to accelerate transport simulations and to quantify their
associated uncertainty.
We present an ensemble-based pipeline for estimating atmospheric transport
“footprints”, greenhouse gas mole fraction measurements, and their
uncertainties using a graph neural network emulator of a Lagrangian Particle
Dispersion Model (LPDM). The approach is demonstrated with GOSAT (Greenhouse
Gases Observing Satellite) observations for Brazil in 2016. The emulator
achieved a ~1000x speed-up over the NAME LPDM, while reproducing large-scale
footprint structures. Ensembles were calculated to quantify absolute and
relative uncertainty, revealing spatial correlations with prediction error. The
results show that ensemble spread highlights low-confidence spatial and
temporal predictions for both atmospheric transport footprints and methane mole
fractions.
While demonstrated here for an LPDM emulator, the approach could be applied
more generally to atmospheric transport models, supporting uncertainty-aware
greenhouse gas inversion systems and improving the robustness of
satellite-based emissions monitoring. With further development, ensemble-based
emulators could also help explore systematic LPDM errors, offering a
computationally efficient pathway towards a more comprehensive uncertainty
budget in greenhouse gas flux estimates.
[LINK]
http://arxiv.org/abs/2510.05751v1
[DATE]
2025-10-07 18:14:25+08:00
[CATEGORIES]
cs.LG
Learning to Price Bundles: A GCN Approach for Mixed Bundling
[AUTHORS]
Liangyu Ding, Chenghan Wu, Guokai Li, Zizhuo Wang
[ABSTRACT]
Bundle pricing refers to designing several product combinations (i.e.,
bundles) and determining their prices in order to maximize the expected profit.
It is a classic problem in revenue management and arises in many industries,
such as e-commerce, tourism, and video games. However, the problem is typically
intractable due to the exponential number of candidate bundles. In this paper,
we explore the usage of graph convolutional networks (GCNs) in solving the
bundle pricing problem. Specifically, we first develop a graph representation
of the mixed bundling model (where every possible bundle is assigned with a
specific price) and then train a GCN to learn the latent patterns of optimal
bundles. Based on the trained GCN, we propose two inference strategies to
derive high-quality feasible solutions. A local-search technique is further
proposed to improve the solution quality. Numerical experiments validate the
effectiveness and efficiency of our proposed GCN-based framework. Using a GCN
trained on instances with 5 products, our methods consistently achieve
near-optimal solutions (better than 97%) with only a fraction of computational
time for problems of small to medium size. It also achieves superior solutions
for larger size of problems compared with other heuristic methods such as
bundle size pricing (BSP). The method can also provide high quality solutions
for instances with more than 30 products even for the challenging cases where
product utilities are non-additive.
[LINK]
http://arxiv.org/abs/2509.22557v2
[DATE]
2025-10-07 17:53:13+08:00
[CATEGORIES]
cs.LG
EntryPrune: Neural Network Feature Selection using First Impressions
[AUTHORS]
Felix Zimmer, Patrik Okanovic, Torsten Hoefler
[ABSTRACT]
There is an ongoing effort to develop feature selection algorithms to improve
interpretability, reduce computational resources, and minimize overfitting in
predictive models. Neural networks stand out as architectures on which to build
feature selection methods, and recently, neuron pruning and regrowth have
emerged from the sparse neural network literature as promising new tools. We
introduce EntryPrune, a novel supervised feature selection algorithm using a
dense neural network with a dynamic sparse input layer. It employs entry-based
pruning, a novel approach that compares neurons based on their relative change
induced when they have entered the network. Extensive experiments on 13
different datasets show that our approach generally outperforms the current
state-of-the-art methods, and in particular improves the average accuracy on
low-dimensional datasets. Furthermore, we show that EntryPruning surpasses
traditional techniques such as magnitude pruning within the EntryPrune
framework and that EntryPrune achieves lower runtime than competing approaches.
Our code is available at https://github.com/flxzimmer/entryprune.
[LINK]
http://arxiv.org/abs/2410.02344v4
[DATE]
2025-10-07 17:52:31+08:00
[CATEGORIES]
cs.LG
Neighborhood-Adaptive Generalized Linear Graph Embedding with Latent Pattern Mining
[AUTHORS]
S. Peng, L. Hu, W. Zhang, B. Jie, Y. Luo
[ABSTRACT]
Graph embedding has been widely applied in areas such as network analysis,
social network mining, recommendation systems, and bioinformatics. However,
current graph construction methods often require the prior definition of
neighborhood size, limiting the effective revelation of potential structural
correlations in the data. Additionally, graph embedding methods using linear
projection heavily rely on a singular pattern mining approach, resulting in
relative weaknesses in adapting to different scenarios. To address these
challenges, we propose a novel model, Neighborhood-Adaptive Generalized Linear
Graph Embedding (NGLGE), grounded in latent pattern mining. This model
introduces an adaptive graph learning method tailored to the neighborhood,
effectively revealing intrinsic data correlations. Simultaneously, leveraging a
reconstructed low-rank representation and imposing $\ell_{2,0}$ norm constraint
on the projection matrix allows for flexible exploration of additional pattern
information. Besides, an efficient iterative solving algorithm is derived for
the proposed model. Comparative evaluations on datasets from diverse scenarios
demonstrate the superior performance of our model compared to state-of-the-art
methods.
[LINK]
http://arxiv.org/abs/2510.05719v1
[DATE]
2025-10-07 17:37:29+08:00
[CATEGORIES]
cs.LG
DiffSDA: Unsupervised Diffusion Sequential Disentanglement Across Modalities
[AUTHORS]
Hedi Zisling, Ilan Naiman, Nimrod Berman, Supasorn Suwajanakorn, Omri Azencot
[ABSTRACT]
Unsupervised representation learning, particularly sequential
disentanglement, aims to separate static and dynamic factors of variation in
data without relying on labels. This remains a challenging problem, as existing
approaches based on variational autoencoders and generative adversarial
networks often rely on multiple loss terms, complicating the optimization
process. Furthermore, sequential disentanglement methods face challenges when
applied to real-world data, and there is currently no established evaluation
protocol for assessing their performance in such settings. Recently, diffusion
models have emerged as state-of-the-art generative models, but no theoretical
formalization exists for their application to sequential disentanglement. In
this work, we introduce the Diffusion Sequential Disentanglement Autoencoder
(DiffSDA), a novel, modal-agnostic framework effective across diverse
real-world data modalities, including time series, video, and audio. DiffSDA
leverages a new probabilistic modeling, latent diffusion, and efficient
samplers, while incorporating a challenging evaluation protocol for rigorous
testing. Our experiments on diverse real-world benchmarks demonstrate that
DiffSDA outperforms recent state-of-the-art methods in sequential
disentanglement.
[LINK]
http://arxiv.org/abs/2510.05717v1
[DATE]
2025-10-07 17:30:36+08:00
[CATEGORIES]
cs.LG
TranSUN: A Preemptive Paradigm to Eradicate Retransformation Bias Intrinsically from Regression Models in Recommender Systems
[AUTHORS]
Jiahao Yu, Haozhuang Liu, Yeqiu Yang, Lu Chen, Jian Wu, Yuning Jiang, Bo Zheng
[COMMENTS]
37 pages, 6 figures, NeurIPS 2025 Poster
[LINK]
http://arxiv.org/abs/2505.13881v5
[DATE]
2025-10-07 17:26:13+08:00
[CATEGORIES]
cs.LG
Stable Robot Motions on Manifolds: Learning Lyapunov-Constrained Neural Manifold ODEs
[AUTHORS]
David Boetius, Abdelrahman Abdelnaby, Ashok Kumar, Stefan Leue, Abdalla Swikir, Fares J. Abu-Dakka
[ABSTRACT]
Learning stable dynamical systems from data is crucial for safe and reliable
robot motion planning and control. However, extending stability guarantees to
trajectories defined on Riemannian manifolds poses significant challenges due
to the manifold’s geometric constraints. To address this, we propose a general
framework for learning stable dynamical systems on Riemannian manifolds using
neural ordinary differential equations. Our method guarantees stability by
projecting the neural vector field evolving on the manifold so that it strictly
satisfies the Lyapunov stability criterion, ensuring stability at every system
state. By leveraging a flexible neural parameterisation for both the base
vector field and the Lyapunov function, our framework can accurately represent
complex trajectories while respecting manifold constraints by evolving
solutions directly on the manifold. We provide an efficient training strategy
for applying our framework and demonstrate its utility by solving Riemannian
LASA datasets on the unit quaternion (S^3) and symmetric positive-definite
matrix manifolds, as well as robotic motions evolving on \mathbb{R}^3 \times
S^3. We demonstrate the performance, scalability, and practical applicability
of our approach through extensive simulations and by learning robot motions in
a real-world experiment.
[COMMENTS]
12 pages, 6 figures
[LINK]
http://arxiv.org/abs/2510.05707v1
[DATE]
2025-10-07 17:16:48+08:00
[CATEGORIES]
cs.LG
Benchmarking the Robustness of Agentic Systems to Adversarially-Induced Harms
[AUTHORS]
Jonathan Nöther, Adish Singla, Goran Radanovic
[ABSTRACT]
Ensuring the safe use of agentic systems requires a thorough understanding of
the range of malicious behaviors these systems may exhibit when under attack.
In this paper, we evaluate the robustness of LLM-based agentic systems against
attacks that aim to elicit harmful actions from agents. To this end, we propose
a novel taxonomy of harms for agentic systems and a novel benchmark, BAD-ACTS,
for studying the security of agentic systems with respect to a wide range of
harmful actions. BAD-ACTS consists of 4 implementations of agentic systems in
distinct application environments, as well as a dataset of 188 high-quality
examples of harmful actions. This enables a comprehensive study of the
robustness of agentic systems across a wide range of categories of harmful
behaviors, available tools, and inter-agent communication structures. Using
this benchmark, we analyze the robustness of agentic systems against an
attacker that controls one of the agents in the system and aims to manipulate
other agents to execute a harmful target action. Our results show that the
attack has a high success rate, demonstrating that even a single adversarial
agent within the system can have a significant impact on the security. This
attack remains effective even when agents use a simple prompting-based defense
strategy. However, we additionally propose a more effective defense based on
message monitoring. We believe that this benchmark provides a diverse testbed
for the security research of agentic systems. The benchmark can be found at
github.com/JNoether/BAD-ACTS
[COMMENTS]
54 Pages
[LINK]
http://arxiv.org/abs/2508.16481v2
[DATE]
2025-10-07 17:11:32+08:00
[CATEGORIES]
cs.LG
Primal-Dual Direct Preference Optimization for Constrained LLM Alignment
[AUTHORS]
Yihan Du, Seo Taek Kong, R. Srikant
[ABSTRACT]
The widespread application of Large Language Models (LLMs) imposes increasing
demands on safety, such as reducing harmful content and fake information, and
avoiding certain forbidden tokens due to rules and laws. While there have been
several recent works studying safe alignment of LLMs, these works either
require the training of reward and cost models and incur high memory and
computational costs, or need prior knowledge about the optimal solution.
Motivated by this fact, we study the problem of constrained alignment in LLMs,
i.e., maximizing the output reward while restricting the cost due to
potentially unsafe content to stay below a threshold. For this problem, we
propose a novel primal-dual DPO approach, which first trains a model using
standard DPO on reward preference data to provide reward information, and then
adopts a rearranged Lagrangian DPO objective utilizing the provided reward
information to fine-tune LLMs on cost preference data. Our approach
significantly reduces memory and computational costs, and does not require
extra prior knowledge. Moreover, we establish rigorous theoretical guarantees
on the suboptimality and constraint violation of the output policy. We also
extend our approach to an online data setting by incorporating exploration
bonuses, which enables our approach to explore uncovered prompt-response space,
and then provide theoretical results that get rid of the dependence on
preference data coverage. Experimental results on the widely-used preference
dataset PKU-SafeRLHF demonstrate the effectiveness of our approach.
[LINK]
http://arxiv.org/abs/2510.05703v1
[DATE]
2025-10-07 17:10:35+08:00
[CATEGORIES]
cs.LG
Sparse deepfake detection promotes better disentanglement
[AUTHORS]
Antoine Teissier, Marie Tahon, Nicolas Dugué, Aghilas Sini
[ABSTRACT]
Due to the rapid progress of speech synthesis, deepfake detection has become
a major concern in the speech processing community. Because it is a critical
task, systems must not only be efficient and robust, but also provide
interpretable explanations. Among the different approaches for explainability,
we focus on the interpretation of latent representations. In such paper, we
focus on the last layer of embeddings of AASIST, a deepfake detection
architecture. We use a TopK activation inspired by SAEs on this layer to obtain
sparse representations which are used in the decision process. We demonstrate
that sparse deepfake detection can improve detection performance, with an EER
of 23.36% on ASVSpoof5 test set, with 95% of sparsity. We then show that these
representations provide better disentanglement, using completeness and
modularity metrics based on mutual information. Notably, some attacks are
directly encoded in the latent space.
[LINK]
http://arxiv.org/abs/2510.05696v1
[DATE]
2025-10-07 17:03:39+08:00
[CATEGORIES]
cs.LG
SDFs from Unoriented Point Clouds using Neural Variational Heat Distances
[AUTHORS]
Samuel Weidemaier, Florine Hartwig, Josua Sassen, Sergio Conti, Mirela Ben-Chen, Martin Rumpf
[ABSTRACT]
We propose a novel variational approach for computing neural Signed Distance
Fields (SDF) from unoriented point clouds. To this end, we replace the commonly
used eikonal equation with the heat method, carrying over to the neural domain
what has long been standard practice for computing distances on discrete
surfaces. This yields two convex optimization problems for whose solution we
employ neural networks: We first compute a neural approximation of the
gradients of the unsigned distance field through a small time step of heat flow
with weighted point cloud densities as initial data. Then we use it to compute
a neural approximation of the SDF. We prove that the underlying variational
problems are well-posed. Through numerical experiments, we demonstrate that our
method provides state-of-the-art surface reconstruction and consistent SDF
gradients. Furthermore, we show in a proof-of-concept that it is accurate
enough for solving a PDE on the zero-level set.
[COMMENTS]
15 pages, 17 figures, 4 tables
[LINK]
http://arxiv.org/abs/2504.11212v2
[DATE]
2025-10-07 16:58:15+08:00
[CATEGORIES]
cs.LG
Oracle-Guided Masked Contrastive Reinforcement Learning for Visuomotor Policies
[AUTHORS]
Yuhang Zhang, Jiaping Xiao, Chao Yan, Mir Feroskhan
[ABSTRACT]
A prevailing approach for learning visuomotor policies is to employ
reinforcement learning to map high-dimensional visual observations directly to
action commands. However, the combination of high-dimensional visual inputs and
agile maneuver outputs leads to long-standing challenges, including low sample
efficiency and significant sim-to-real gaps. To address these issues, we
propose Oracle-Guided Masked Contrastive Reinforcement Learning (OMC-RL), a
novel framework designed to improve the sample efficiency and asymptotic
performance of visuomotor policy learning. OMC-RL explicitly decouples the
learning process into two stages: an upstream representation learning stage and
a downstream policy learning stage. In the upstream stage, a masked Transformer
module is trained with temporal modeling and contrastive learning to extract
temporally-aware and task-relevant representations from sequential visual
inputs. After training, the learned encoder is frozen and used to extract
visual representations from consecutive frames, while the Transformer module is
discarded. In the downstream stage, an oracle teacher policy with privileged
access to global state information supervises the agent during early training
to provide informative guidance and accelerate early policy learning. This
guidance is gradually reduced to allow independent exploration as training
progresses. Extensive experiments in simulated and real-world environments
demonstrate that OMC-RL achieves superior sample efficiency and asymptotic
policy performance, while also improving generalization across diverse and
perceptually complex scenarios.
[LINK]
http://arxiv.org/abs/2510.05692v1
[DATE]
2025-10-07 16:49:31+08:00
[CATEGORIES]
cs.LG
vAttention: Verified Sparse Attention
[AUTHORS]
Aditya Desai, Kumar Krishna Agrawal, Shuo Yang, Alejandro Cuadron, Luis Gaspar Schroeder, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica
[ABSTRACT]
State-of-the-art sparse attention methods for reducing decoding latency fall
into two main categories: approximate top-$k$ (and its extension, top-$p$) and
recently introduced sampling-based estimation. However, these approaches are
fundamentally limited in their ability to approximate full attention: they fail
to provide consistent approximations across heads and query vectors and, most
critically, lack guarantees on approximation quality, limiting their practical
deployment. We observe that top-$k$ and random sampling are complementary:
top-$k$ performs well when attention scores are dominated by a few tokens,
whereas random sampling provides better estimates when attention scores are
relatively uniform. Building on this insight and leveraging the statistical
guarantees of sampling, we introduce vAttention, the first practical sparse
attention mechanism with user-specified $(\epsilon, \delta)$ guarantees on
approximation accuracy (thus, verified). These guarantees make vAttention a
compelling step toward practical, reliable deployment of sparse attention at
scale. By unifying top-k and sampling, vAttention outperforms both
individually, delivering a superior quality-efficiency trade-off. Our
experiments show that vAttention significantly improves the quality of sparse
attention (e.g., $\sim$4.5 percentage points for Llama-3.1-8B-Inst and
Deepseek-R1-Distill-Llama-8B on RULER-HARD), and effectively bridges the gap
between full and sparse attention (e.g., across datasets, it matches full model
quality with upto 20x sparsity). We also demonstrate that it can be deployed in
reasoning scenarios to achieve fast decoding without compromising model quality
(e.g., vAttention achieves full model quality on AIME2024 at 10x sparsity with
up to 32K token generations). Code is open-sourced at
https://github.com/xAlg-ai/sparse-attention-hub.
[LINK]
http://arxiv.org/abs/2510.05688v1
[DATE]
2025-10-07 16:46:08+08:00
[CATEGORIES]
cs.LG
QGraphLIME - Explaining Quantum Graph Neural Networks
[AUTHORS]
Haribandhu Jena, Jyotirmaya Shivottam, Subhankar Mishra
[ABSTRACT]
Quantum graph neural networks offer a powerful paradigm for learning on
graph-structured data, yet their explainability is complicated by
measurement-induced stochasticity and the combinatorial nature of graph
structure. In this paper, we introduce QuantumGraphLIME (QGraphLIME), a
model-agnostic, post-hoc framework that treats model explanations as
distributions over local surrogates fit on structure-preserving perturbations
of a graph. By aggregating surrogate attributions together with their
dispersion, QGraphLIME yields uncertainty-aware node and edge importance
rankings for quantum graph models. The framework further provides a
distribution-free, finite-sample guarantee on the size of the surrogate
ensemble: a Dvoretzky-Kiefer-Wolfowitz bound ensures uniform approximation of
the induced distribution of a binary class probability at target accuracy and
confidence under standard independence assumptions. Empirical studies on
controlled synthetic graphs with known ground truth demonstrate accurate and
stable explanations, with ablations showing clear benefits of nonlinear
surrogate modeling and highlighting sensitivity to perturbation design.
Collectively, these results establish a principled, uncertainty-aware, and
structure-sensitive approach to explaining quantum graph neural networks, and
lay the groundwork for scaling to broader architectures and real-world
datasets, as quantum resources mature. Code is available at
https://github.com/smlab-niser/qglime.
[LINK]
http://arxiv.org/abs/2510.05683v1
[DATE]
2025-10-07 16:39:13+08:00
[CATEGORIES]
cs.LG
Verifier-free Test-Time Sampling for Vision Language Action Models
[AUTHORS]
Suhyeok Jang, Dongyoung Kim, Changyeon Kim, Youngsuk Kim, Jinwoo Shin
[ABSTRACT]
Vision-Language-Action models (VLAs) have demonstrated remarkable performance
in robot control. However, they remain fundamentally limited in tasks that
require high precision due to their single-inference paradigm. While test-time
scaling approaches using external verifiers have shown promise, they require
additional training and fail to generalize to unseen conditions. We propose
Masking Distribution Guided Selection (MG-Select), a novel test-time scaling
framework for VLAs that leverages the model’s internal properties without
requiring additional training or external modules. Our approach utilizes KL
divergence from a reference action token distribution as a confidence metric
for selecting the optimal action from multiple candidates. We introduce a
reference distribution generated by the same VLA but with randomly masked
states and language conditions as inputs, ensuring maximum uncertainty while
remaining aligned with the target task distribution. Additionally, we propose a
joint training strategy that enables the model to learn both conditional and
unconditional distributions by applying dropout to state and language
conditions, thereby further improving the quality of the reference
distribution. Our experiments demonstrate that MG-Select achieves significant
performance improvements, including a 28%/35% improvement in real-world
in-distribution/out-of-distribution tasks, along with a 168% relative gain on
RoboCasa pick-and-place tasks trained with 30 demonstrations.
[COMMENTS]
14 pages; 3 figures
[LINK]
http://arxiv.org/abs/2510.05681v1
[DATE]
2025-10-07 16:38:08+08:00
[CATEGORIES]
cs.LG
Inductive inference of gradient-boosted decision trees on graphs for insurance fraud detection
[AUTHORS]
Félix Vandervorst, Bruno Deprez, Wouter Verbeke, Tim Verdonck
[ABSTRACT]
Graph-based methods are becoming increasingly popular in machine learning due
to their ability to model complex data and relations. Insurance fraud is a
prime use case, since false claims are often the result of organised criminals
that stage accidents or the same persons filing erroneous claims on multiple
policies. One challenge is that graph-based approaches struggle to find
meaningful representations of the data because of the high class imbalance
present in fraud data. Another is that insurance networks are heterogeneous and
dynamic, given the changing relations among people, companies and policies.
That is why gradient boosted tree approaches on tabular data still dominate the
field. Therefore, we present a novel inductive graph gradient boosting machine
(G-GBM) for supervised learning on heterogeneous and dynamic graphs. We show
that our estimator competes with popular graph neural network approaches in an
experiment using a variety of simulated random graphs. We demonstrate the power
of G-GBM for insurance fraud detection using an open-source and a real-world,
proprietary dataset. Given that the backbone model is a gradient boosting
forest, we apply established explainability methods to gain better insights
into the predictions made by G-GBM.
[LINK]
http://arxiv.org/abs/2510.05676v1
[DATE]
2025-10-07 16:35:12+08:00
[CATEGORIES]
cs.LG
Deep Reinforcement Learning for Urban Air Quality Management: Multi-Objective Optimization of Pollution Mitigation Booth Placement in Metropolitan Environments
[AUTHORS]
Kirtan Rajesh, Suvidha Rupesh Kumar
[ABSTRACT]
This is the preprint version of the article published in IEEE Access vol. 13,
pp. 146503–146526, 2025, doi:10.1109/ACCESS.2025.3599541. Please cite the
published version.
Urban air pollution remains a pressing global concern, particularly in
densely populated and traffic-intensive metropolitan areas like Delhi, where
exposure to harmful pollutants severely impacts public health. Delhi, being one
of the most polluted cities globally, experiences chronic air quality issues
due to vehicular emissions, industrial activities, and construction dust, which
exacerbate its already fragile atmospheric conditions. Traditional pollution
mitigation strategies, such as static air purifying installations, often fail
to maximize their impact due to suboptimal placement and limited adaptability
to dynamic urban environments. This study presents a novel deep reinforcement
learning (DRL) framework to optimize the placement of air purification booths
to improve the air quality index (AQI) in the city of Delhi. We employ Proximal
Policy Optimization (PPO), a state-of-the-art reinforcement learning algorithm,
to iteratively learn and identify high-impact locations based on multiple
spatial and environmental factors, including population density, traffic
patterns, industrial influence, and green space constraints. Our approach is
benchmarked against conventional placement strategies, including random and
greedy AQI-based methods, using multi-dimensional performance evaluation
metrics such as AQI improvement, spatial coverage, population and traffic
impact, and spatial entropy.
[COMMENTS]
This is the preprint version of the article published in IEEE Access
vol. 13, pp. 146503–146526, 2025, doi:10.1109/ACCESS.2025.3599541. Please
cite the published version
[LINK]
http://arxiv.org/abs/2505.00668v2
[DATE]
2025-10-07 16:34:12+08:00
[CATEGORIES]
cs.LG
Quantifying the Accuracy-Interpretability Trade-Off in Concept-Based Sidechannel Models
[AUTHORS]
David Debot, Giuseppe Marra
[ABSTRACT]
Concept Bottleneck Models (CBNMs) are deep learning models that provide
interpretability by enforcing a bottleneck layer where predictions are based
exclusively on human-understandable concepts. However, this constraint also
restricts information flow and often results in reduced predictive accuracy.
Concept Sidechannel Models (CSMs) address this limitation by introducing a
sidechannel that bypasses the bottleneck and carry additional task-relevant
information. While this improves accuracy, it simultaneously compromises
interpretability, as predictions may rely on uninterpretable representations
transmitted through sidechannels. Currently, there exists no principled
technique to control this fundamental trade-off. In this paper, we close this
gap. First, we present a unified probabilistic concept sidechannel meta-model
that subsumes existing CSMs as special cases. Building on this framework, we
introduce the Sidechannel Independence Score (SIS), a metric that quantifies a
CSM’s reliance on its sidechannel by contrasting predictions made with and
without sidechannel information. We propose SIS regularization, which
explicitly penalizes sidechannel reliance to improve interpretability. Finally,
we analyze how the expressivity of the predictor and the reliance of the
sidechannel jointly shape interpretability, revealing inherent trade-offs
across different CSM architectures. Empirical results show that
state-of-the-art CSMs, when trained solely for accuracy, exhibit low
representation interpretability, and that SIS regularization substantially
improves their interpretability, intervenability, and the quality of learned
interpretable task predictors. Our work provides both theoretical and practical
tools for developing CSMs that balance accuracy and interpretability in a
principled manner.
[LINK]
http://arxiv.org/abs/2510.05670v1
[DATE]
2025-10-07 16:29:34+08:00
[CATEGORIES]
cs.LG
Detecting Invariant Manifolds in ReLU-Based RNNs
[AUTHORS]
Lukas Eisenmann, Alena Brändle, Zahra Monfared, Daniel Durstewitz
[ABSTRACT]
Recurrent Neural Networks (RNNs) have found widespread applications in
machine learning for time series prediction and dynamical systems
reconstruction, and experienced a recent renaissance with improved training
algorithms and architectural designs. Understanding why and how trained RNNs
produce their behavior is important for scientific and medical applications,
and explainable AI more generally. An RNN’s dynamical repertoire depends on the
topological and geometrical properties of its state space. Stable and unstable
manifolds of periodic points play a particularly important role: They dissect a
dynamical system’s state space into different basins of attraction, and their
intersections lead to chaotic dynamics with fractal geometry. Here we introduce
a novel algorithm for detecting these manifolds, with a focus on
piecewise-linear RNNs (PLRNNs) employing rectified linear units (ReLUs) as
their activation function. We demonstrate how the algorithm can be used to
trace the boundaries between different basins of attraction, and hence to
characterize multistability, a computationally important property. We further
show its utility in finding so-called homoclinic points, the intersections
between stable and unstable manifolds, and thus establish the existence of
chaos in PLRNNs. Finally we show for an empirical example, electrophysiological
recordings from a cortical neuron, how insights into the underlying dynamics
could be gained through our method.
[LINK]
http://arxiv.org/abs/2510.03814v2
[DATE]
2025-10-07 16:06:35+08:00
[CATEGORIES]
cs.LG
DeepBoost-AF: A Novel Unsupervised Feature Learning and Gradient Boosting Fusion for Robust Atrial Fibrillation Detection in Raw ECG Signals
[AUTHORS]
Alireza Jafari, Fereshteh Yousefirizi, Vahid Seydi
[ABSTRACT]
Atrial fibrillation (AF) is a prevalent cardiac arrhythmia associated with
elevated health risks, where timely detection is pivotal for mitigating
stroke-related morbidity. This study introduces an innovative hybrid
methodology integrating unsupervised deep learning and gradient boosting models
to improve AF detection. A 19-layer deep convolutional autoencoder (DCAE) is
coupled with three boosting classifiers-AdaBoost, XGBoost, and LightGBM
(LGBM)-to harness their complementary advantages while addressing individual
limitations. The proposed framework uniquely combines DCAE with gradient
boosting, enabling end-to-end AF identification devoid of manual feature
extraction. The DCAE-LGBM model attains an F1-score of 95.20%, sensitivity of
99.99%, and inference latency of four seconds, outperforming existing methods
and aligning with clinical deployment requirements. The DCAE integration
significantly enhances boosting models, positioning this hybrid system as a
reliable tool for automated AF detection in clinical settings.
[COMMENTS]
12-page,4 figures,3 tables, Achieves 95.20% F1-score (99.99%
sensitivity) on 8,528 PhysioNet 2017 recordings, Mean inference time: 4
seconds, Python implementation will be open-sourced upon publication
[LINK]
http://arxiv.org/abs/2505.24085v2
[DATE]
2025-10-07 15:53:47+08:00
[CATEGORIES]
cs.LG
Conditional Local Independence Testing for Itô processes with Applications to Dynamic Causal Discovery
[AUTHORS]
Mingzhou Liu, Xinwei Sun, Yizhou Wang
[ABSTRACT]
Inferring causal relationships from dynamical systems is the central interest
of many scientific inquiries. Conditional local independence, which describes
whether the evolution of one process is influenced by another process given
additional processes, is important for causal learning in such systems. In this
paper, we propose a hypothesis test for conditional local independence in It\^o
processes. Our test is grounded in the semimartingale decomposition of the
It\^o process, with which we introduce a stochastic integral process that is a
martingale under the null hypothesis. We then apply a test for the martingale
property, quantifying potential deviation from local independence. The test
statistics is estimated using the optimal filtering equation. We show the
consistency of the estimation, thereby establishing the level and power of our
test. Numerical verification and a real-world application to causal discovery
in brain resting-state fMRIs are conducted.
[COMMENTS]
Preprint
[LINK]
http://arxiv.org/abs/2506.07844v3
[DATE]
2025-10-07 15:44:41+08:00
[CATEGORIES]
cs.LG
Keep It on a Leash: Controllable Pseudo-label Generation Towards Realistic Long-Tailed Semi-Supervised Learning
[AUTHORS]
Yaxin Hou, Bo Han, Yuheng Jia, Hui Liu, Junhui Hou
[ABSTRACT]
Current long-tailed semi-supervised learning methods assume that labeled data
exhibit a long-tailed distribution, and unlabeled data adhere to a typical
predefined distribution (i.e., long-tailed, uniform, or inverse long-tailed).
However, the distribution of the unlabeled data is generally unknown and may
follow an arbitrary distribution. To tackle this challenge, we propose a
Controllable Pseudo-label Generation (CPG) framework, expanding the labeled
dataset with the progressively identified reliable pseudo-labels from the
unlabeled dataset and training the model on the updated labeled dataset with a
known distribution, making it unaffected by the unlabeled data distribution.
Specifically, CPG operates through a controllable self-reinforcing optimization
cycle: (i) at each training step, our dynamic controllable filtering mechanism
selectively incorporates reliable pseudo-labels from the unlabeled dataset into
the labeled dataset, ensuring that the updated labeled dataset follows a known
distribution; (ii) we then construct a Bayes-optimal classifier using logit
adjustment based on the updated labeled data distribution; (iii) this improved
classifier subsequently helps identify more reliable pseudo-labels in the next
training step. We further theoretically prove that this optimization cycle can
significantly reduce the generalization error under some conditions.
Additionally, we propose a class-aware adaptive augmentation module to further
improve the representation of minority classes, and an auxiliary branch to
maximize data utilization by leveraging all labeled and unlabeled samples.
Comprehensive evaluations on various commonly used benchmark datasets show that
CPG achieves consistent improvements, surpassing state-of-the-art methods by up
to $\textbf{15.97\%}$ in accuracy. The code is available at
https://github.com/yaxinhou/CPG.
[COMMENTS]
The paper is accepted by NeurIPS 2025
[LINK]
http://arxiv.org/abs/2510.03993v2
[DATE]
2025-10-07 15:36:33+08:00
[CATEGORIES]
cs.LG
NEO: No-Optimization Test-Time Adaptation through Latent Re-Centering
[AUTHORS]
Alexander Murphy, Michal Danilowski, Soumyajit Chatterjee, Abhirup Ghosh
[ABSTRACT]
Test-Time Adaptation (TTA) methods are often computationally expensive,
require a large amount of data for effective adaptation, or are brittle to
hyperparameters. Based on a theoretical foundation of the geometry of the
latent space, we are able to significantly improve the alignment between source
and distribution-shifted samples by re-centering target data embeddings at the
origin. This insight motivates NEO – a hyperparameter-free fully TTA method,
that adds no significant compute compared to vanilla inference. NEO is able to
improve the classification accuracy of ViT-Base on ImageNet-C from 55.6% to
59.2% after adapting on just one batch of 64 samples. When adapting on 512
samples NEO beats all 7 TTA methods we compare against on ImageNet-C,
ImageNet-R and ImageNet-S and beats 6/7 on CIFAR-10-C, while using the least
amount of compute. NEO performs well on model calibration metrics and
additionally is able to adapt from 1 class to improve accuracy on 999 other
classes in ImageNet-C. On Raspberry Pi and Jetson Orin Nano devices, NEO
reduces inference time by 63% and memory usage by 9% compared to baselines. Our
results based on 3 ViT architectures and 4 datasets show that NEO can be used
efficiently and effectively for TTA.
[LINK]
http://arxiv.org/abs/2510.05635v1
[DATE]
2025-10-07 15:35:55+08:00
[CATEGORIES]
cs.LG
From Principles to Practice: A Systematic Study of LLM Serving on Multi-core NPUs
[AUTHORS]
Tianhao Zhu, Dahu Feng, Erhu Feng, Yubin Xia
[ABSTRACT]
With the widespread adoption of Large Language Models (LLMs), the demand for
high-performance LLM inference services continues to grow. To meet this demand,
a growing number of AI accelerators have been proposed, such as Google TPU,
Huawei NPU, Graphcore IPU, and Cerebras WSE, etc. Most of these accelerators
adopt multi-core architectures to achieve enhanced scalability, but lack the
flexibility of SIMT architectures. Therefore, without careful configuration of
the hardware architecture, as well as deliberate design of tensor parallelism
and core placement strategies, computational resources may be underutilized,
resulting in suboptimal inference performance.
To address these challenges, we first present a multi-level simulation
framework with both transaction-level and performance-model-based simulation
for multi-core NPUs. Using this simulator, we conduct a systematic analysis and
further propose the optimal solutions for tensor parallelism strategies, core
placement policies, memory management methods, as well as the selection between
PD-disaggregation and PD-fusion on multi-core NPUs. We conduct comprehensive
experiments on representative LLMs and various NPU configurations. The
evaluation results demonstrate that, our solution can achieve 1.32x-6.03x
speedup compared to SOTA designs for multi-core NPUs across different hardware
configurations. As for LLM serving, our work offers guidance on designing
optimal hardware architectures and serving strategies for multi-core NPUs
across various LLM workloads.
[LINK]
http://arxiv.org/abs/2510.05632v1
[DATE]
2025-10-07 15:29:16+08:00
[CATEGORIES]
cs.LG
Monte Carlo-Type Neural Operator for Differential Equations
[AUTHORS]
Salah Eddine Choutri, Prajwal Chauhan, Othmane Mazhar, Saif Eddin Jabari
[ABSTRACT]
The Monte Carlo-type Neural Operator (MCNO) introduces a framework for
learning solution operators of one-dimensional partial differential equations
(PDEs) by directly learning the kernel function and approximating the
associated integral operator using a Monte Carlo-type approach. Unlike Fourier
Neural Operators (FNOs), which rely on spectral representations and assume
translation-invariant kernels, MCNO makes no such assumptions. The kernel is
represented as a learnable tensor over sampled input-output pairs, and sampling
is performed once, uniformly at random from a discretized grid. This design
enables generalization across multiple grid resolutions without relying on
fixed global basis functions or repeated sampling during training, while an
interpolation step maps between arbitrary input and output grids to further
enhance flexibility. Experiments on standard 1D PDE benchmarks show that MCNO
achieves competitive accuracy with efficient computational cost. We also
provide a theoretical analysis proving that the Monte Carlo estimator yields a
bounded bias and variance under mild regularity assumptions. This result holds
in any spatial dimension, suggesting that MCNO may extend naturally beyond
one-dimensional problems. More broadly, this work explores how Monte Carlo-type
integration can be incorporated into neural operator frameworks for
continuous-domain PDEs, providing a theoretically supported alternative to
spectral methods (such as FNO) and to graph-based Monte Carlo approaches (such
as the Graph Kernel Neural Operator, GNO).
[LINK]
http://arxiv.org/abs/2510.05620v1
[DATE]
2025-10-07 15:07:04+08:00
[CATEGORIES]
cs.LG
DP-HYPE: Distributed Differentially Private Hyperparameter Search
[AUTHORS]
Johannes Liebenow, Thorsten Peinemann, Esfandiar Mohammadi
[ABSTRACT]
The tuning of hyperparameters in distributed machine learning can
substantially impact model performance. When the hyperparameters are tuned on
sensitive data, privacy becomes an important challenge and to this end,
differential privacy has emerged as the de facto standard for provable privacy.
A standard setting when performing distributed learning tasks is that clients
agree on a shared setup, i.e., find a compromise from a set of hyperparameters,
like the learning rate of the model to be trained. Yet, prior work on
differentially private hyperparameter tuning either uses computationally
expensive cryptographic protocols, determines hyperparameters separately for
each client, or applies differential privacy locally, which can lead to
undesirable utility-privacy trade-offs.
In this work, we present our algorithm DP-HYPE, which performs a distributed
and privacy-preserving hyperparameter search by conducting a distributed voting
based on local hyperparameter evaluations of clients. In this way, DP-HYPE
selects hyperparameters that lead to a compromise supported by the majority of
clients, while maintaining scalability and independence from specific learning
tasks. We prove that DP-HYPE preserves the strong notion of differential
privacy called client-level differential privacy and, importantly, show that
its privacy guarantees do not depend on the number of hyperparameters. We also
provide bounds on its utility guarantees, that is, the probability of reaching
a compromise, and implement DP-HYPE as a submodule in the popular Flower
framework for distributed machine learning. In addition, we evaluate
performance on multiple benchmark data sets in iid as well as multiple non-iid
settings and demonstrate high utility of DP-HYPE even under small privacy
budgets.
[LINK]
http://arxiv.org/abs/2510.04902v2
[DATE]
2025-10-07 15:00:58+08:00
[CATEGORIES]
cs.LG
InstaGeo: Compute-Efficient Geospatial Machine Learning from Data to Deployment
[AUTHORS]
Ibrahim Salihu Yusuf, Iffanice Houndayi, Rym Oualha, Mohamed Aziz Cherif, Kobby Panford-Quainoo, Arnu Pretorius
[ABSTRACT]
Open-access multispectral imagery from missions like Landsat 8-9 and
Sentinel-2 has fueled the development of geospatial foundation models (GFMs)
for humanitarian and environmental applications. Yet, their deployment remains
limited by (i) the absence of automated geospatial data pipelines and (ii) the
large size of fine-tuned models. Existing GFMs lack workflows for processing
raw satellite imagery, and downstream adaptations often retain the full
complexity of the original encoder.
We present InstaGeo, an open-source, end-to-end framework that addresses
these challenges by integrating: (1) automated data curation to transform raw
imagery into model-ready datasets; (2) task-specific model distillation to
derive compact, compute-efficient models; and (3) seamless deployment as
interactive web-map applications. Using InstaGeo, we reproduced datasets from
three published studies and trained models with marginal mIoU differences of
-0.73 pp for flood mapping, -0.20 pp for crop segmentation, and +1.79 pp for
desert locust prediction. The distilled models are up to 8x smaller than
standard fine-tuned counterparts, reducing FLOPs and CO2 emissions with minimal
accuracy loss.
Leveraging InstaGeo’s streamlined data pipeline, we also curated a larger
crop segmentation dataset, achieving a state-of-the-art mIoU of 60.65%, a 12 pp
improvement over prior baselines. Moreover, InstaGeo enables users to progress
from raw data to model deployment within a single working day.
By unifying data preparation, model compression, and deployment, InstaGeo
transforms research-grade GFMs into practical, low-carbon tools for real-time,
large-scale Earth observation. This approach shifts geospatial AI toward data
quality and application-driven innovation. Source code, datasets, and model
checkpoints are available at:
https://github.com/instadeepai/InstaGeo-E2E-Geospatial-ML.git
[LINK]
http://arxiv.org/abs/2510.05617v1
[DATE]
2025-10-07 14:57:15+08:00
[CATEGORIES]
cs.LG
Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance
[AUTHORS]
Akio Hayakawa, Masato Ishii, Takashi Shibuya, Yuki Mitsufuji
[ABSTRACT]
We propose a step-by-step video-to-audio (V2A) generation method for finer
controllability over the generation process and more realistic audio synthesis.
Inspired by traditional Foley workflows, our approach aims to comprehensively
capture all sound events induced by a video through the incremental generation
of missing sound events. To avoid the need for costly multi-reference
video-audio datasets, each generation step is formulated as a negatively guided
V2A process that discourages duplication of existing sounds. The guidance model
is trained by finetuning a pre-trained V2A model on audio pairs from adjacent
segments of the same video, allowing training with standard single-reference
audiovisual datasets that are easily accessible. Objective and subjective
evaluations demonstrate that our method enhances the separability of generated
sounds at each step and improves the overall quality of the final composite
audio, outperforming existing baselines.
[LINK]
http://arxiv.org/abs/2506.20995v3
[DATE]
2025-10-07 14:36:19+08:00
[CATEGORIES]
cs.LG
PointNSP: Autoregressive 3D Point Cloud Generation with Next-Scale Level-of-Detail Prediction
[AUTHORS]
Ziqiao Meng, Qichao Wang, Zhiyang Dou, Zixing Song, Zhipeng Zhou, Irwin King, Peilin Zhao
[ABSTRACT]
Autoregressive point cloud generation has long lagged behind diffusion-based
approaches in quality. The performance gap stems from the fact that
autoregressive models impose an artificial ordering on inherently unordered
point sets, forcing shape generation to proceed as a sequence of local
predictions. This sequential bias emphasizes short-range continuity but
undermines the model’s capacity to capture long-range dependencies, hindering
its ability to enforce global structural properties such as symmetry,
consistent topology, and large-scale geometric regularities. Inspired by the
level-of-detail (LOD) principle in shape modeling, we propose PointNSP, a
coarse-to-fine generative framework that preserves global shape structure at
low resolutions and progressively refines fine-grained geometry at higher
scales through a next-scale prediction paradigm. This multi-scale factorization
aligns the autoregressive objective with the permutation-invariant nature of
point sets, enabling rich intra-scale interactions while avoiding brittle fixed
orderings. Experiments on ShapeNet show that PointNSP establishes
state-of-the-art (SOTA) generation quality for the first time within the
autoregressive paradigm. In addition, it surpasses strong diffusion-based
baselines in parameter, training, and inference efficiency. Finally, in dense
generation with 8,192 points, PointNSP’s advantages become even more
pronounced, underscoring its scalability potential.
[LINK]
http://arxiv.org/abs/2510.05613v1
[DATE]
2025-10-07 14:31:02+08:00
[CATEGORIES]
cs.LG
Bypassing Prompt Guards in Production with Controlled-Release Prompting
[AUTHORS]
Jaiden Fairoze, Sanjam Garg, Keewoo Lee, Mingyuan Wang
[ABSTRACT]
As large language models (LLMs) advance, ensuring AI safety and alignment is
paramount. One popular approach is prompt guards, lightweight mechanisms
designed to filter malicious queries while being easy to implement and update.
In this work, we introduce a new attack that circumvents such prompt guards,
highlighting their limitations. Our method consistently jailbreaks production
models while maintaining response quality, even under the highly protected chat
interfaces of Google Gemini (2.5 Flash/Pro), DeepSeek Chat (DeepThink), Grok
(3), and Mistral Le Chat (Magistral). The attack exploits a resource asymmetry
between the prompt guard and the main LLM, encoding a jailbreak prompt that
lightweight guards cannot decode but the main model can. This reveals an attack
surface inherent to lightweight prompt guards in modern LLM architectures and
underscores the need to shift defenses from blocking malicious inputs to
preventing malicious outputs. We additionally identify other critical alignment
issues, such as copyrighted data extraction, training data extraction, and
malicious response leakage during thinking.
[LINK]
http://arxiv.org/abs/2510.01529v2
[DATE]
2025-10-07 14:05:50+08:00
[CATEGORIES]
cs.LG
Riddled basin geometry sets fundamental limits to predictability and reproducibility in deep learning
[AUTHORS]
Andrew Ly, Pulin Gong
[ABSTRACT]
Fundamental limits to predictability are central to our understanding of many
physical and computational systems. Here we show that, despite its remarkable
capabilities, deep learning exhibits such fundamental limits rooted in the
fractal, riddled geometry of its basins of attraction: any initialization that
leads to one solution lies arbitrarily close to another that leads to a
different one. We derive sufficient conditions for the emergence of riddled
basins by analytically linking features widely observed in deep learning,
including chaotic learning dynamics and symmetry-induced invariant subspaces,
to reveal a general route to riddling in realistic deep networks. The resulting
basins of attraction possess an infinitely fine-scale fractal structure
characterized by an uncertainty exponent near zero, so that even large
increases in the precision of initial conditions yield only marginal gains in
outcome predictability. Riddling thus imposes a fundamental limit on the
predictability and hence reproducibility of neural network training, providing
a unified account of many empirical observations. These results reveal a
general organizing principle of deep learning with important implications for
optimization and the safe deployment of artificial intelligence.
[LINK]
http://arxiv.org/abs/2510.05606v1
[DATE]
2025-10-07 14:02:58+08:00
[CATEGORIES]
cs.LG
GRAFT: GRaPH and Table Reasoning for Textual Alignment – A Benchmark for Structured Instruction Following and Visual Reasoning
[AUTHORS]
Abhigya Verma, Sriram Puttagunta, Seganrasan Subramanian, Sravan Ramachandran
[ABSTRACT]
GRAFT is a structured multimodal benchmark for evaluating models on
instruction-following, visual reasoning, and visual-textual alignment tasks. It
features programmatically generated charts and synthetically rendered tables,
created with Python visualization libraries to ensure control over data
semantics, structure, and clarity. Each GRAFT instance pairs a chart or table
image with a systematically generated, multi-step analytical question based
solely on visual content. Answers are provided in structured formats such as
JSON or YAML, supporting consistent evaluation of both reasoning and output
format. The benchmark introduces a taxonomy of reasoning types including
comparison, trend identification, ranking, aggregation, proportion estimation,
and anomaly detection to enable comprehensive assessment. Reference answers
follow strict factual and formatting guidelines for precise, aspect-based
evaluation. GRAFT offers a unified, scalable framework for fine-grained
benchmarking of multimodal models on visually grounded, structured reasoning
tasks, setting a new evaluation standard in this field.
[COMMENTS]
25 pages, 10 tables, 3 figures
[LINK]
http://arxiv.org/abs/2508.15690v3
[DATE]
2025-10-07 14:01:15+08:00
[CATEGORIES]
cs.LG
Deciphering Invariant Feature Decoupling in Source-free Time Series Forecasting with Proxy Denoising
[AUTHORS]
Kangjia Yan, Chenxi Liu, Hao Miao, Xinle Wu, Yan Zhao, Chenjuan Guo, Bin Yang
[ABSTRACT]
The proliferation of mobile devices generates a massive volume of time series
across various domains, where effective time series forecasting enables a
variety of real-world applications. This study focuses on a new problem of
source-free domain adaptation for time series forecasting. It aims to adapt a
pretrained model from sufficient source time series to the sparse target time
series domain without access to the source data, embracing data protection
regulations. To achieve this, we propose TimePD, the first source-free time
series forecasting framework with proxy denoising, where large language models
(LLMs) are employed to benefit from their generalization capabilities.
Specifically, TimePD consists of three key components: (1) dual-branch
invariant disentangled feature learning that enforces representation- and
gradient-wise invariance by means of season-trend decomposition; (2)
lightweight, parameter-free proxy denoising that dynamically calibrates
systematic biases of LLMs; and (3) knowledge distillation that bidirectionally
aligns the denoised prediction and the original target prediction. Extensive
experiments on real-world datasets offer insight into the effectiveness of the
proposed TimePD, outperforming SOTA baselines by 9.3% on average.
[LINK]
http://arxiv.org/abs/2510.05589v1
[DATE]
2025-10-07 13:29:18+08:00
[CATEGORIES]
cs.LG
Geometry-Preserving Encoder/Decoder in Latent Generative Models
[AUTHORS]
Wonjun Lee, Riley C. W. O’Neill, Dongmian Zou, Jeff Calder, Gilad Lerman
[ABSTRACT]
Generative modeling aims to generate new data samples that resemble a given
dataset, with diffusion models recently becoming the most popular generative
model. One of the main challenges of diffusion models is solving the problem in
the input space, which tends to be very high-dimensional. Recently, solving
diffusion models in the latent space through an encoder that maps from the data
space to a lower-dimensional latent space has been considered to make the
training process more efficient and has shown state-of-the-art results. The
variational autoencoder (VAE) is the most commonly used encoder/decoder
framework in this domain, known for its ability to learn latent representations
and generate data samples. In this paper, we introduce a novel encoder/decoder
framework with theoretical properties distinct from those of the VAE,
specifically designed to preserve the geometric structure of the data
distribution. We demonstrate the significant advantages of this
geometry-preserving encoder in the training process of both the encoder and
decoder. Additionally, we provide theoretical results proving convergence of
the training process, including convergence guarantees for encoder training,
and results showing faster convergence of decoder training when using the
geometry-preserving encoder.
[COMMENTS]
56 pages
[LINK]
http://arxiv.org/abs/2501.09876v2
[DATE]
2025-10-07 13:09:44+08:00
[CATEGORIES]
cs.LG
(Token-Level) \textbf{InfoRMIA}: Stronger Membership Inference and Memorization Assessment for LLMs
[AUTHORS]
Jiashu Tao, Reza Shokri
[ABSTRACT]
Machine learning models are known to leak sensitive information, as they
inevitably memorize (parts of) their training data. More alarmingly, large
language models (LLMs) are now trained on nearly all available data, which
amplifies the magnitude of information leakage and raises serious privacy
risks. Hence, it is more crucial than ever to quantify privacy risk before the
release of LLMs. The standard method to quantify privacy is via membership
inference attacks, where the state-of-the-art approach is the Robust Membership
Inference Attack (RMIA). In this paper, we present InfoRMIA, a principled
information-theoretic formulation of membership inference. Our method
consistently outperforms RMIA across benchmarks while also offering improved
computational efficiency.
In the second part of the paper, we identify the limitations of treating
sequence-level membership inference as the gold standard for measuring leakage.
We propose a new perspective for studying membership and memorization in LLMs:
token-level signals and analyses. We show that a simple token-based InfoRMIA
can pinpoint which tokens are memorized within generated outputs, thereby
localizing leakage from the sequence level down to individual tokens, while
achieving stronger sequence-level inference power on LLMs. This new scope
rethinks privacy in LLMs and can lead to more targeted mitigation, such as
exact unlearning.
[LINK]
http://arxiv.org/abs/2510.05582v1
[DATE]
2025-10-07 12:59:49+08:00
[CATEGORIES]
cs.LG
Power Mechanism: Private Tabular Representation Release for Model Agnostic Consumption
[AUTHORS]
Praneeth Vepakomma, Kaustubh Ponkshe
[ABSTRACT]
Traditional collaborative learning approaches are based on sharing of model
weights between clients and a server. However, there are advantages to resource
efficiency through schemes based on sharing of embeddings (activations) created
from the data. Several differentially private methods were developed for
sharing of weights while such mechanisms do not exist so far for sharing of
embeddings. We propose Ours to learn a privacy encoding network in conjunction
with a small utility generation network such that the final embeddings
generated from it are equipped with formal differential privacy guarantees.
These privatized embeddings are then shared with a more powerful server, that
learns a post-processing that results in a higher accuracy for machine learning
tasks. We show that our co-design of collaborative and private learning results
in requiring only one round of privatized communication and lesser compute on
the client than traditional methods. The privatized embeddings that we share
from the client are agnostic to the type of model (deep learning, random
forests or XGBoost) used on the server in order to process these activations to
complete a task.
[LINK]
http://arxiv.org/abs/2510.05581v1
[DATE]
2025-10-07 12:55:38+08:00
[CATEGORIES]
cs.LG
On the Theory of Continual Learning with Gradient Descent for Neural Networks
[AUTHORS]
Hossein Taheri, Avishek Ghosh, Arya Mazumdar
[ABSTRACT]
Continual learning, the ability of a model to adapt to an ongoing sequence of
tasks without forgetting the earlier ones, is a central goal of artificial
intelligence. To shed light on its underlying mechanisms, we analyze the
limitations of continual learning in a tractable yet representative setting. In
particular, we study one-hidden-layer quadratic neural networks trained by
gradient descent on an XOR cluster dataset with Gaussian noise, where different
tasks correspond to different clusters with orthogonal means. Our results
obtain bounds on the rate of forgetting during train and test-time in terms of
the number of iterations, the sample size, the number of tasks, and the
hidden-layer size. Our results reveal interesting phenomena on the role of
different problem parameters in the rate of forgetting. Numerical experiments
across diverse setups confirm our results, demonstrating their validity beyond
the analyzed settings.
[LINK]
http://arxiv.org/abs/2510.05573v1
[DATE]
2025-10-07 12:32:27+08:00
[CATEGORIES]
cs.LG
Efficient Learning-based Graph Simulation for Temporal Graphs
[AUTHORS]
Sheng Xiang, Chenhao Xu, Dawei Cheng, Xiaoyang Wang, Ying Zhang
[ABSTRACT]
Graph simulation has recently received a surge of attention in graph
processing and analytics. In real-life applications, e.g. social science,
biology, and chemistry, many graphs are composed of a series of evolving graphs
(i.e., temporal graphs). While most of the existing graph generators focus on
static graphs, the temporal information of the graphs is ignored. In this
paper, we focus on simulating temporal graphs, which aim to reproduce the
structural and temporal properties of the observed real-life temporal graphs.
In this paper, we first give an overview of the existing temporal graph
generators, including recently emerged learning-based approaches. Most of these
learning-based methods suffer from one of the limitations: low efficiency in
training or slow generating, especially for temporal random walk-based methods.
Therefore, we propose an efficient learning-based approach to generate graph
snapshots, namely temporal graph autoencoder (TGAE). Specifically, we propose
an attention-based graph encoder to encode temporal and structural
characteristics on sampled ego-graphs. And we proposed an ego-graph decoder
that can achieve a good trade-off between simulation quality and efficiency in
temporal graph generation. Finally, the experimental evaluation is conducted
among our proposed TGAE and representative temporal graph generators on
real-life temporal graphs and synthesized graphs. It is reported that our
proposed approach outperforms the state-of-the-art temporal graph generators by
means of simulation quality and efficiency.
[COMMENTS]
14 pages, 6 figures, IEEE ICDE 2025
[LINK]
http://arxiv.org/abs/2510.05569v1
[DATE]
2025-10-07 12:22:24+08:00
[CATEGORIES]
cs.LG
Bilevel optimization for learning hyperparameters: Application to solving PDEs and inverse problems with Gaussian processes
[AUTHORS]
Nicholas H. Nelsen, Houman Owhadi, Andrew M. Stuart, Xianjin Yang, Zongren Zou
[ABSTRACT]
Methods for solving scientific computing and inference problems, such as
kernel- and neural network-based approaches for partial differential equations
(PDEs), inverse problems, and supervised learning tasks, depend crucially on
the choice of hyperparameters. Specifically, the efficacy of such methods, and
in particular their accuracy, stability, and generalization properties,
strongly depends on the choice of hyperparameters. While bilevel optimization
offers a principled framework for hyperparameter tuning, its nested
optimization structure can be computationally demanding, especially in
PDE-constrained contexts. In this paper, we propose an efficient strategy for
hyperparameter optimization within the bilevel framework by employing a
Gauss-Newton linearization of the inner optimization step. Our approach
provides closed-form updates, eliminating the need for repeated costly PDE
solves. As a result, each iteration of the outer loop reduces to a single
linearized PDE solve, followed by explicit gradient-based hyperparameter
updates. We demonstrate the effectiveness of the proposed method through
Gaussian process models applied to nonlinear PDEs and to PDE inverse problems.
Extensive numerical experiments highlight substantial improvements in accuracy
and robustness compared to conventional random hyperparameter initialization.
In particular, experiments with additive kernels and neural
network-parameterized deep kernels demonstrate the method’s scalability and
effectiveness for high-dimensional hyperparameter optimization.
[LINK]
http://arxiv.org/abs/2510.05568v1
[DATE]
2025-10-07 12:22:09+08:00
[CATEGORIES]
cs.LG
Generative Dynamic Graph Representation Learning for Conspiracy Spoofing Detection
[AUTHORS]
Sheng Xiang, Yidong Jiang, Yunting Chen, Dawei Cheng, Guoping Zhao, Changjun Jiang
[ABSTRACT]
Spoofing detection in financial trading is crucial, especially for
identifying complex behaviors such as conspiracy spoofing. Traditional
machine-learning approaches primarily focus on isolated node features, often
overlooking the broader context of interconnected nodes. Graph-based
techniques, particularly Graph Neural Networks (GNNs), have advanced the field
by leveraging relational information effectively. However, in real-world
spoofing detection datasets, trading behaviors exhibit dynamic, irregular
patterns. Existing spoofing detection methods, though effective in some
scenarios, struggle to capture the complexity of dynamic and diverse, evolving
inter-node relationships. To address these challenges, we propose a novel
framework called the Generative Dynamic Graph Model (GDGM), which models
dynamic trading behaviors and the relationships among nodes to learn
representations for conspiracy spoofing detection. Specifically, our approach
incorporates the generative dynamic latent space to capture the temporal
patterns and evolving market conditions. Raw trading data is first converted
into time-stamped sequences. Then we model trading behaviors using the neural
ordinary differential equations and gated recurrent units, to generate the
representation incorporating temporal dynamics of spoofing patterns.
Furthermore, pseudo-label generation and heterogeneous aggregation techniques
are employed to gather relevant information and enhance the detection
performance for conspiratorial spoofing behaviors. Experiments conducted on
spoofing detection datasets demonstrate that our approach outperforms
state-of-the-art models in detection accuracy. Additionally, our spoofing
detection system has been successfully deployed in one of the largest global
trading markets, further validating the practical applicability and performance
of the proposed method.
[COMMENTS]
10 pages, 5 figures, ACM the web conference 2025
[LINK]
http://arxiv.org/abs/2510.05562v1
[DATE]
2025-10-07 12:16:12+08:00
[CATEGORIES]
cs.LG
Midway Network: Learning Representations for Recognition and Motion from Latent Dynamics
[AUTHORS]
Christopher Hoang, Mengye Ren
[ABSTRACT]
Object recognition and motion understanding are key components of perception
that complement each other. While self-supervised learning methods have shown
promise in their ability to learn from unlabeled data, they have primarily
focused on obtaining rich representations for either recognition or motion
rather than both in tandem. On the other hand, latent dynamics modeling has
been used in decision making to learn latent representations of observations
and their transformations over time for control and planning tasks. In this
work, we present Midway Network, a new self-supervised learning architecture
that is the first to learn strong visual representations for both object
recognition and motion understanding solely from natural videos, by extending
latent dynamics modeling to this domain. Midway Network leverages a midway
top-down path to infer motion latents between video frames, as well as a dense
forward prediction objective and hierarchical structure to tackle the complex,
multi-object scenes of natural videos. We demonstrate that after pretraining on
two large-scale natural video datasets, Midway Network achieves strong
performance on both semantic segmentation and optical flow tasks relative to
prior self-supervised learning methods. We also show that Midway Network’s
learned dynamics can capture high-level correspondence via a novel analysis
method based on forward feature perturbation.
[COMMENTS]
Project page: https://agenticlearning.ai/midway-network/
[LINK]
http://arxiv.org/abs/2510.05558v1
[DATE]
2025-10-07 12:07:44+08:00
[CATEGORIES]
cs.LG
Critical attention scaling in long-context transformers
[AUTHORS]
Shi Chen, Zhengjiang Lin, Yury Polyanskiy, Philippe Rigollet
[ABSTRACT]
As large language models scale to longer contexts, attention layers suffer
from a fundamental pathology: attention scores collapse toward uniformity as
context length $n$ increases, causing tokens to cluster excessively, a
phenomenon known as rank-collapse. While $\textit{attention scaling}$
effectively addresses this deficiency by rescaling attention scores with a
polylogarithmic factor $\beta_n$, theoretical justification for this approach
remains lacking.
We analyze a simplified yet tractable model that magnifies the effect of
attention scaling. In this model, attention exhibits a phase transition
governed by the scaling factor $\beta_n$: insufficient scaling collapses all
tokens to a single direction, while excessive scaling reduces attention to
identity, thereby eliminating meaningful interactions between tokens. Our main
result identifies the critical scaling $\beta_n \asymp \log n$ and provides a
rigorous justification for attention scaling in YaRN and Qwen, clarifying why
logarithmic scaling maintains sparse, content-adaptive attention at large
context lengths.
[COMMENTS]
29 pages, 2 figures
[LINK]
http://arxiv.org/abs/2510.05554v1
[DATE]
2025-10-07 11:51:57+08:00
[CATEGORIES]
cs.LG
Channel Simulation and Distributed Compression with Ensemble Rejection Sampling
[AUTHORS]
Buu Phan, Ashish Khisti
[ABSTRACT]
We study channel simulation and distributed matching, two fundamental
problems with several applications to machine learning, using a recently
introduced generalization of the standard rejection sampling (RS) algorithm
known as Ensemble Rejection Sampling (ERS). For channel simulation, we propose
a new coding scheme based on ERS that achieves a near-optimal coding rate. In
this process, we demonstrate that standard RS can also achieve a near-optimal
coding rate and generalize the result of Braverman and Garg (2014) to the
continuous alphabet setting. Next, as our main contribution, we present a
distributed matching lemma for ERS, which serves as the rejection sampling
counterpart to the Poisson Matching Lemma (PML) introduced by Li and Anantharam
(2021). Our result also generalizes a recent work on importance matching lemma
(Phan et al, 2024) and, to our knowledge, is the first result on distributed
matching in the family of rejection sampling schemes where the matching
probability is close to PML. We demonstrate the practical significance of our
approach over prior works by applying it to distributed compression. The
effectiveness of our proposed scheme is validated through experiments involving
synthetic Gaussian sources and distributed image compression using the MNIST
dataset.
[LINK]
http://arxiv.org/abs/2510.05552v1
[DATE]
2025-10-07 11:43:58+08:00
[CATEGORIES]
cs.LG
DUA-D2C: Dynamic Uncertainty Aware Method for Overfitting Remediation in Deep Learning
[AUTHORS]
Md. Saiful Bari Siddiqui, Md Mohaiminul Islam, Md. Golam Rabiul Alam
[ABSTRACT]
Overfitting remains a significant challenge in deep learning, often arising
from data outliers, noise, and limited training data. To address this, the
Divide2Conquer (D2C) method was previously proposed, which partitions training
data into multiple subsets and trains identical models independently on each.
This strategy enables learning more consistent patterns while minimizing the
influence of individual outliers and noise. However, D2C’s standard aggregation
typically treats all subset models equally or based on fixed heuristics (like
data size), potentially underutilizing information about their varying
generalization capabilities. Building upon this foundation, we introduce
Dynamic Uncertainty-Aware Divide2Conquer (DUA-D2C), an advanced technique that
refines the aggregation process. DUA-D2C dynamically weights the contributions
of subset models based on their performance on a shared validation set,
considering both accuracy and prediction uncertainty. This intelligent
aggregation allows the central model to preferentially learn from subsets
yielding more generalizable and confident edge models, thereby more effectively
combating overfitting. Empirical evaluations on benchmark datasets spanning
multiple domains demonstrate that DUA-D2C significantly improves
generalization. Our analysis includes evaluations of decision boundaries, loss
curves, and other performance metrics, highlighting the effectiveness of
DUA-D2C. This study demonstrates that DUA-D2C improves generalization
performance even when applied on top of other regularization methods,
establishing it as a theoretically grounded and effective approach to combating
overfitting in modern deep learning. Our codes are publicly available at:
https://github.com/Saiful185/DUA-D2C.
[COMMENTS]
This version (v2) extends our previous work (arXiv:2411.15876v1) on
Divide2Conquer (D2C) by introducing Dynamic Uncertainty-Aware Divide2Conquer
(DUA-D2C). The manuscript is currently under review at Complex and
Intelligent Systems
[LINK]
http://arxiv.org/abs/2411.15876v2
[DATE]
2025-10-07 11:11:30+08:00
[CATEGORIES]
cs.LG
Learning Exposure Mapping Functions for Inferring Heterogeneous Peer Effects
[AUTHORS]
Shishir Adhikari, Sourav Medya, Elena Zheleva
[ABSTRACT]
In causal inference, interference refers to the phenomenon in which the
actions of peers in a network can influence an individual’s outcome. Peer
effect refers to the difference in counterfactual outcomes of an individual for
different levels of peer exposure, the extent to which an individual is exposed
to the treatments, actions, or behaviors of peers. Estimating peer effects
requires deciding how to represent peer exposure. Typically, researchers define
an exposure mapping function that aggregates peer treatments and outputs peer
exposure. Most existing approaches for defining exposure mapping functions
assume peer exposure based on the number or fraction of treated peers. Recent
studies have investigated more complex functions of peer exposure which capture
that different peers can exert different degrees of influence. However, none of
these works have explicitly considered the problem of automatically learning
the exposure mapping function. In this work, we focus on learning this function
for the purpose of estimating heterogeneous peer effects, where heterogeneity
refers to the variation in counterfactual outcomes for the same peer exposure
but different individual’s contexts. We develop EgoNetGNN, a graph neural
network (GNN)-based method, to automatically learn the appropriate exposure
mapping function allowing for complex peer influence mechanisms that, in
addition to peer treatments, can involve the local neighborhood structure and
edge attributes. We show that GNN models that use peer exposure based on the
number or fraction of treated peers or learn peer exposure naively face
difficulty accounting for such influence mechanisms. Our comprehensive
evaluation on synthetic and semi-synthetic network data shows that our method
is more robust to different unknown underlying influence mechanisms when
estimating heterogeneous peer effects when compared to state-of-the-art
baselines.
[LINK]
http://arxiv.org/abs/2503.01722v2
[DATE]
2025-10-07 11:01:38+08:00
[CATEGORIES]
cs.LG
Permutation-Invariant Representation Learning for Robust and Privacy-Preserving Feature Selection
[AUTHORS]
Rui Liu, Tao Zhe, Yanjie Fu, Feng Xia, Ted Senator, Dongjie Wang
[ABSTRACT]
Feature selection eliminates redundancy among features to improve downstream
task performance while reducing computational overhead. Existing methods often
struggle to capture intricate feature interactions and adapt across diverse
application scenarios. Recent advances employ generative intelligence to
alleviate these drawbacks. However, these methods remain constrained by
permutation sensitivity in embedding and reliance on convexity assumptions in
gradient-based search. To address these limitations, our initial work
introduces a novel framework that integrates permutation-invariant embedding
with policy-guided search. Although effective, it still left opportunities to
adapt to realistic distributed scenarios. In practice, data across local
clients is highly imbalanced, heterogeneous and constrained by strict privacy
regulations, limiting direct sharing. These challenges highlight the need for a
framework that can integrate feature selection knowledge across clients without
exposing sensitive information. In this extended journal version, we advance
the framework from two perspectives: 1) developing a privacy-preserving
knowledge fusion strategy to derive a unified representation space without
sharing sensitive raw data. 2) incorporating a sample-aware weighting strategy
to address distributional imbalance among heterogeneous local clients.
Extensive experiments validate the effectiveness, robustness, and efficiency of
our framework. The results further demonstrate its strong generalization
ability in federated learning scenarios. The code and data are publicly
available: https://anonymous.4open.science/r/FedCAPS-08BF.
[LINK]
http://arxiv.org/abs/2510.05535v1
[DATE]
2025-10-07 10:53:32+08:00
[CATEGORIES]
cs.LG
Teamwork: Collaborative Diffusion with Low-rank Coordination and Adaptation
[AUTHORS]
Sam Sartor, Pieter Peers
[ABSTRACT]
Large pretrained diffusion models can provide strong priors beneficial for
many graphics applications. However, generative applications such as neural
rendering and inverse methods such as SVBRDF estimation and intrinsic image
decomposition require additional input or output channels. Current solutions
for channel expansion are often application specific and these solutions can be
difficult to adapt to different diffusion models or new tasks. This paper
introduces Teamwork: a flexible and efficient unified solution for jointly
increasing the number of input and output channels as well as adapting a
pretrained diffusion model to new tasks. Teamwork achieves channel expansion
without altering the pretrained diffusion model architecture by coordinating
and adapting multiple instances of the base diffusion model (\ie, teammates).
We employ a novel variation of Low Rank-Adaptation (LoRA) to jointly address
both adaptation and coordination between the different teammates. Furthermore
Teamwork supports dynamic (de)activation of teammates. We demonstrate the
flexibility and efficiency of Teamwork on a variety of generative and inverse
graphics tasks such as inpainting, single image SVBRDF estimation, intrinsic
decomposition, neural shading, and intrinsic image synthesis.
[LINK]
http://arxiv.org/abs/2510.05532v1
[DATE]
2025-10-07 10:44:57+08:00
[CATEGORIES]
cs.LG
Efficient learning of bosonic Gaussian unitaries
[AUTHORS]
Marco Fanizza, Vishnu Iyer, Junseo Lee, Antonio A. Mele, Francesco A. Mele
[ABSTRACT]
Bosonic Gaussian unitaries are fundamental building blocks of central
continuous-variable quantum technologies such as quantum-optic interferometry
and bosonic error-correction schemes. In this work, we present the first
time-efficient algorithm for learning bosonic Gaussian unitaries with a
rigorous analysis. Our algorithm produces an estimate of the unknown unitary
that is accurate to small worst-case error, measured by the physically
motivated energy-constrained diamond distance. Its runtime and query complexity
scale polynomially with the number of modes, the inverse target accuracy, and
natural energy parameters quantifying the allowed input energy and the
unitary’s output-energy growth.
The protocol uses only experimentally friendly photonic resources: coherent
and squeezed probes, passive linear optics, and heterodyne/homodyne detection.
We then employ an efficient classical post-processing routine that leverages a
symplectic regularization step to project matrix estimates onto the symplectic
group. In the limit of unbounded input energy, our procedure attains
arbitrarily high precision using only $2m+2$ queries, where $m$ is the number
of modes. To our knowledge, this is the first provably efficient learning
algorithm for a multiparameter family of continuous-variable unitaries.
[LINK]
http://arxiv.org/abs/2510.05531v1
[DATE]
2025-10-07 10:42:40+08:00
[CATEGORIES]
cs.LG
LATTA: Langevin-Anchored Test-Time Adaptation for Enhanced Robustness and Stability
[AUTHORS]
Harshil Vejendla
[ABSTRACT]
Test-time adaptation (TTA) aims to adapt a pretrained model to distribution
shifts using only unlabeled test data. While promising, existing methods like
Tent suffer from instability and can catastrophically forget the source
knowledge, especially with small batch sizes or challenging corruptions. We
argue that this arises from overly deterministic updates on a complex loss
surface. In this paper, we introduce Langevin-Anchored Test-Time Adaptation
(LATTA), a novel approach that regularizes adaptation through two key
mechanisms: (1) a noisy weight perturbation inspired by Stochastic Gradient
Langevin Dynamics (SGLD) to explore the local parameter space and escape poor
local minima, and (2) a stable weight anchor that prevents the model from
diverging from its robust source pre-training. This combination allows LATTA to
adapt effectively without sacrificing stability. Unlike prior Bayesian TTA
methods, LATTA requires no architectural changes or expensive Monte Carlo
passes. We conduct extensive experiments on standard benchmarks, including
Rotated-MNIST and the more challenging CIFAR-10-C. Our results demonstrate that
LATTA significantly outperforms existing methods, including Tent, CoTTA, and
EATA, setting a new state of the art for self-supervised TTA by improving
average accuracy on CIFAR-10-C by over 2% while simultaneously reducing
performance variance.
[COMMENTS]
MIT URTC 2025 Technical Paper (Oral), 5 pages, 3 figures
[LINK]
http://arxiv.org/abs/2510.05530v1
[DATE]
2025-10-07 10:39:39+08:00
[CATEGORIES]
cs.LG
ARMOR: High-Performance Semi-Structured Pruning via Adaptive Matrix Factorization
[AUTHORS]
Lawrence Liu, Alexander Liu, Mengdi Wang, Tuo Zhao, Lin F. Yang
[ABSTRACT]
Large language models (LLMs) present significant deployment challenges due to
their immense computational and memory requirements. While semi-structured
pruning, particularly 2:4 sparsity, offers a path to practical hardware
acceleration, existing methods often incur substantial performance degradation.
To bridge this gap, we introduce ARMOR: (Adaptive Representation with
Matrix-factORization), a novel one-shot post-training pruning algorithm.
Instead of directly pruning weights, ARMOR factorizes each weight matrix into a
2:4 sparse core wrapped by two low-overhead, block diagonal matrices. These
wrappers act as efficient pre and post-transformation error correctors,
offering greater flexibility to preserve model quality compared to conventional
2:4 pruning techniques. The sparse core and block diagonal wrappers are chosen
through a block coordinate descent algorithm that minimizes a layer-wise proxy
loss. We theoretically prove this optimization is guaranteed to converge to a
solution with a proxy loss less than or equal to state-of-the-art pruning
algorithms. Experiments on Llama (Touvron et al., 2023; Dubey et al., 2024) and
Qwen (Yang et al., 2025) model families demonstrate that ARMOR consistently and
significantly outperforms state-of-the-art 2:4 pruning methods across a wide
range of downstream tasks and perplexity evaluations. ARMOR achieves this
superior performance while retaining the inference speedups and substantial
memory usage reductions of 2:4 pruning, establishing a more effective trade-off
between model compression and task accuracy
[LINK]
http://arxiv.org/abs/2510.05528v1
[DATE]
2025-10-07 10:39:20+08:00
[CATEGORIES]
cs.LG
Transfer Learning on Edge Connecting Probability Estimation under Graphon Model
[AUTHORS]
Yuyao Wang, Yu-Hung Cheng, Debarghya Mukherjee, Huimin Cheng
[ABSTRACT]
Graphon models provide a flexible nonparametric framework for estimating
latent connectivity probabilities in networks, enabling a range of downstream
applications such as link prediction and data augmentation. However, accurate
graphon estimation typically requires a large graph, whereas in practice, one
often only observes a small-sized network. One approach to addressing this
issue is to adopt a transfer learning framework, which aims to improve
estimation in a small target graph by leveraging structural information from a
larger, related source graph. In this paper, we propose a novel method, namely
GTRANS, a transfer learning framework that integrates neighborhood smoothing
and Gromov-Wasserstein optimal transport to align and transfer structural
patterns between graphs. To prevent negative transfer, GTRANS includes an
adaptive debiasing mechanism that identifies and corrects for target-specific
deviations via residual smoothing. We provide theoretical guarantees on the
stability of the estimated alignment matrix and demonstrate the effectiveness
of GTRANS in improving the accuracy of target graph estimation through
extensive synthetic and real data experiments. These improvements translate
directly to enhanced performance in downstream applications, such as the graph
classification task and the link prediction task.
[LINK]
http://arxiv.org/abs/2510.05527v1
[DATE]
2025-10-07 10:37:12+08:00
[CATEGORIES]
cs.LG
Provably Mitigating Corruption, Overoptimization, and Verbosity Simultaneously in Offline and Online RLHF/DPO Alignment
[AUTHORS]
Ziyi Chen, Junyi Li, Peiran Yu, Heng Huang
[ABSTRACT]
Reinforcement learning from human feedback (RLHF) and direct preference
optimization (DPO) are important techniques to align large language models
(LLM) with human preference. However, the quality of RLHF and DPO training is
seriously compromised by \textit{\textbf{C}orrupted} preference, reward
\textit{\textbf{O}veroptimization}, and bias towards
\textit{\textbf{V}erbosity}. To our knowledge, most existing works tackle only
one of these important issues, and the few other works require much computation
to estimate multiple reward models and lack theoretical guarantee of
generalization ability. In this work, we propose RLHF-\textbf{COV} and
DPO-\textbf{COV} algorithms that can simultaneously mitigate these three
issues, in both offline and online settings. This ability is theoretically
demonstrated by obtaining length-regularized generalization error rates for our
DPO-COV algorithms trained on corrupted data, which match the best-known rates
for simpler cases with clean data and without length regularization. Moreover,
our DPO-COV algorithm is simple to implement without reward estimation, and is
proved to be equivalent to our RLHF-COV algorithm, which directly implies the
equivalence between the vanilla RLHF and DPO algorithms. Experiments
demonstrate the effectiveness of our DPO-COV algorithms under both offline and
online settings.
[LINK]
http://arxiv.org/abs/2510.05526v1
[DATE]
2025-10-07 10:32:47+08:00
[CATEGORIES]
cs.LG
End-to-End Training of High-Dimensional Optimal Control with Implicit Hamiltonians via Jacobian-Free Backpropagation
[AUTHORS]
Eric Gelphman, Deepanshu Verma, Nicole Tianjiao Yang, Stanley Osher, Samy Wu Fung
[ABSTRACT]
Neural network approaches that parameterize value functions have succeeded in
approximating high-dimensional optimal feedback controllers when the
Hamiltonian admits explicit formulas. However, many practical problems, such as
the space shuttle reentry problem and bicycle dynamics, among others, may
involve implicit Hamiltonians that do not admit explicit formulas, limiting the
applicability of existing methods. Rather than directly parameterizing
controls, which does not leverage the Hamiltonian’s underlying structure, we
propose an end-to-end implicit deep learning approach that directly
parameterizes the value function to learn optimal control laws. Our method
enforces physical principles by ensuring trained networks adhere to the control
laws by exploiting the fundamental relationship between the optimal control and
the value function’s gradient; this is a direct consequence of the connection
between Pontryagin’s Maximum Principle and dynamic programming. Using
Jacobian-Free Backpropagation (JFB), we achieve efficient training despite
temporal coupling in trajectory optimization. We show that JFB produces descent
directions for the optimal control objective and experimentally demonstrate
that our approach effectively learns high-dimensional feedback controllers
across multiple scenarios involving implicit Hamiltonians, which existing
methods cannot address.
[LINK]
http://arxiv.org/abs/2510.00359v2
[DATE]
2025-10-07 10:23:22+08:00
[CATEGORIES]
cs.LG
Distilled Protein Backbone Generation
[AUTHORS]
Liyang Xie, Haoran Zhang, Zhendong Wang, Wesley Tansey, Mingyuan Zhou
[ABSTRACT]
Diffusion- and flow-based generative models have recently demonstrated strong
performance in protein backbone generation tasks, offering unprecedented
capabilities for de novo protein design. However, while achieving notable
performance in generation quality, these models are limited by their generating
speed, often requiring hundreds of iterative steps in the reverse-diffusion
process. This computational bottleneck limits their practical utility in
large-scale protein discovery, where thousands to millions of candidate
structures are needed. To address this challenge, we explore the techniques of
score distillation, which has shown great success in reducing the number of
sampling steps in the vision domain while maintaining high generation quality.
However, a straightforward adaptation of these methods results in unacceptably
low designability. Through extensive study, we have identified how to
appropriately adapt Score identity Distillation (SiD), a state-of-the-art score
distillation strategy, to train few-step protein backbone generators which
significantly reduce sampling time, while maintaining comparable performance to
their pretrained teacher model. In particular, multistep generation combined
with inference time noise modulation is key to the success. We demonstrate that
our distilled few-step generators achieve more than a 20-fold improvement in
sampling speed, while achieving similar levels of designability, diversity, and
novelty as the Proteina teacher model. This reduction in inference cost enables
large-scale in silico protein design, thereby bringing diffusion-based models
closer to real-world protein engineering applications.
[LINK]
http://arxiv.org/abs/2510.03095v2
[DATE]
2025-10-07 10:11:45+08:00
[CATEGORIES]
cs.LG
NeST-BO: Fast Local Bayesian Optimization via Newton-Step Targeting of Gradient and Hessian Information
[AUTHORS]
Wei-Ting Tang, Akshay Kudva, Joel A. Paulson
[ABSTRACT]
Bayesian optimization (BO) is effective for expensive black-box problems but
remains challenging in high dimensions. We propose NeST-BO, a local BO method
that targets the Newton step by jointly learning gradient and Hessian
information with Gaussian process surrogates, and selecting evaluations via a
one-step lookahead bound on Newton-step error. We show that this bound (and
hence the step error) contracts with batch size, so NeST-BO directly inherits
inexact-Newton convergence: global progress under mild stability assumptions
and quadratic local rates once steps are sufficiently accurate. To scale, we
optimize the acquisition in low-dimensional subspaces (e.g., random embeddings
or learned sparse subspaces), reducing the dominant cost of learning curvature
from $O(d^2)$ to $O(m^2)$ with $m \ll d$ while preserving step targeting.
Across high-dimensional synthetic and real-world problems, including cases with
thousands of variables and unknown active subspaces, NeST-BO consistently
yields faster convergence and lower regret than state-of-the-art local and
high-dimensional BO baselines.
[LINK]
http://arxiv.org/abs/2510.05516v1
[DATE]
2025-10-07 10:09:00+08:00
[CATEGORIES]
cs.LG
Integrating Feature Selection and Machine Learning for Nitrogen Assessment in Grapevine Leaves using In-Field Hyperspectral Imaging
[AUTHORS]
Atif Bilal Asad, Achyut Paudel, Safal Kshetri, Chenchen Kang, Salik Ram Khanal, Nataliya Shcherbatyuk, Pierre Davadant, R. Paul Schreiner, Santosh Kalauni, Manoj Karkee, Markus Keller
[ABSTRACT]
Nitrogen (N) is one of the most crucial nutrients in vineyards, affecting
plant growth and subsequent products such as wine and juice. Because soil N has
high spatial and temporal variability, it is desirable to accurately estimate
the N concentration of grapevine leaves and manage fertilization at the
individual plant level to optimally meet plant needs. In this study, we used
in-field hyperspectral images with wavelengths ranging from $400 to 1000nm of
four different grapevine cultivars collected from distinct vineyards and over
two growth stages during two growing seasons to develop models for predicting N
concentration at the leaf-level and canopy-level. After image processing, two
feature selection methods were employed to identify the optimal set of spectral
bands that were responsive to leaf N concentrations. The selected spectral
bands were used to train and test two different Machine Learning (ML) models,
Gradient Boosting and XGBoost, for predicting nitrogen concentrations. The
comparison of selected bands for both leaf-level and canopy-level datasets
showed that most of the spectral regions identified by the feature selection
methods were across both methods and the dataset types (leaf- and canopy-level
datasets), particularly in the key regions, 500-525nm, 650-690nm, 750-800nm,
and 900-950nm. These findings indicated the robustness of these spectral
regions for predicting nitrogen content. The results for N prediction
demonstrated that the ML model achieved an R square of 0.49 for canopy-level
data and an R square of 0.57 for leaf-level data, despite using different sets
of selected spectral bands for each analysis level. The study demonstrated the
potential of using in-field hyperspectral imaging and the use of spectral data
in integrated feature selection and ML techniques to monitor N status in
vineyards.
[COMMENTS]
Major Revision
[LINK]
http://arxiv.org/abs/2507.17869v2
[DATE]
2025-10-07 09:53:53+08:00
[CATEGORIES]
cs.LG
RainSeer: Fine-Grained Rainfall Reconstruction via Physics-Guided Modeling
[AUTHORS]
Lin Chen, Jun Chen, Minghui Qiu, Shuxin Zhong, Binghong Chen, Kaishun Wu
[ABSTRACT]
Reconstructing high-resolution rainfall fields is essential for flood
forecasting, hydrological modeling, and climate analysis. However, existing
spatial interpolation methods-whether based on automatic weather station (AWS)
measurements or enhanced with satellite/radar observations often over-smooth
critical structures, failing to capture sharp transitions and localized
extremes. We introduce RainSeer, a structure-aware reconstruction framework
that reinterprets radar reflectivity as a physically grounded structural
prior-capturing when, where, and how rain develops. This shift, however,
introduces two fundamental challenges: (i) translating high-resolution
volumetric radar fields into sparse point-wise rainfall observations, and (ii)
bridging the physical disconnect between aloft hydro-meteors and ground-level
precipitation. RainSeer addresses these through a physics-informed two-stage
architecture: a Structure-to-Point Mapper performs spatial alignment by
projecting mesoscale radar structures into localized ground-level rainfall,
through a bidirectional mapping, and a Geo-Aware Rain Decoder captures the
semantic transformation of hydro-meteors through descent, melting, and
evaporation via a causal spatiotemporal attention mechanism. We evaluate
RainSeer on two public datasets-RAIN-F (Korea, 2017-2019) and MeteoNet (France,
2016-2018)-and observe consistent improvements over state-of-the-art baselines,
reducing MAE by over 13.31% and significantly enhancing structural fidelity in
reconstructed rainfall fields.
[LINK]
http://arxiv.org/abs/2510.02414v2
[DATE]
2025-10-07 09:44:40+08:00
[CATEGORIES]
cs.LG
Fundamental Limits of Crystalline Equivariant Graph Neural Networks: A Circuit Complexity Perspective
[AUTHORS]
Yang Cao, Zhao Song, Jiahao Zhang, Jiale Zhao
[ABSTRACT]
Graph neural networks (GNNs) have become a core paradigm for learning on
relational data. In materials science, equivariant GNNs (EGNNs) have emerged as
a compelling backbone for crystalline-structure prediction, owing to their
ability to respect Euclidean symmetries and periodic boundary conditions.
Despite strong empirical performance, their expressive power in periodic,
symmetry-constrained settings remains poorly understood. This work
characterizes the intrinsic computational and expressive limits of EGNNs for
crystalline-structure prediction through a circuit-complexity lens. We analyze
the computations carried out by EGNN layers acting on node features, atomic
coordinates, and lattice matrices, and prove that, under polynomial precision,
embedding width $d=O(n)$ for $n$ nodes, $O(1)$ layers, and $O(1)$-depth,
$O(n)$-width MLP instantiations of the message/update/readout maps, these
models admit a simulation by a uniform $\mathsf{TC}^0$ threshold-circuit family
of polynomial size (with an explicit constant-depth bound). Situating EGNNs
within $\mathsf{TC}^0$ provides a concrete ceiling on the decision and
prediction problems solvable by such architectures under realistic resource
constraints and clarifies which architectural modifications (e.g., increased
depth, richer geometric primitives, or wider layers) are required to transcend
this regime. The analysis complements Weisfeiler-Lehman style results that do
not directly transfer to periodic crystals, and offers a complexity-theoretic
foundation for symmetry-aware graph learning on crystalline systems.
[LINK]
http://arxiv.org/abs/2510.05494v1
[DATE]
2025-10-07 09:24:15+08:00
[CATEGORIES]
cs.LG
High-Fidelity Synthetic ECG Generation via Mel-Spectrogram Informed Diffusion Training
[AUTHORS]
Zhuoyi Huang, Nutan Sahoo, Anamika Kumari, Girish Kumar, Kexuan Cai, Shixing Cao, Yue Kang, Tian Xia, Somya Chatterjee, Nicholas Hausman, Aidan Jay, Eric S. Rosenthal, Soundar Srinivasan, Sadid Hasan, Alex Fedorov, Sulaiman Vesal, Soundar Srinivasan, Sadid Hasan, Alex Fedorov, Sulaiman Vesal
[ABSTRACT]
The development of machine learning for cardiac care is severely hampered by
privacy restrictions on sharing real patient electrocardiogram (ECG) data.
Although generative AI offers a promising solution, the real-world use of
existing model-synthesized ECGs is limited by persistent gaps in
trustworthiness and clinical utility. In this work, we address two major
shortcomings of current generative ECG methods: insufficient morphological
fidelity and the inability to generate personalized, patient-specific
physiological signals. To address these gaps, we build on a conditional
diffusion-based Structured State Space Model (SSSD-ECG) with two principled
innovations: (1) MIDT-ECG (Mel-Spectrogram Informed Diffusion Training), a
novel training paradigm with time-frequency domain supervision to enforce
physiological structural realism, and (2) multi-modal demographic conditioning
to enable patient-specific synthesis. We comprehensively evaluate our approach
on the PTB-XL dataset, assessing the synthesized ECG signals on fidelity,
clinical coherence, privacy preservation, and downstream task utility. MIDT-ECG
achieves substantial gains: it improves morphological coherence, preserves
strong privacy guarantees with all metrics evaluated exceeding the baseline by
4-8%, and notably reduces the interlead correlation error by an average of 74%,
while demographic conditioning enhances signal-to-noise ratio and
personalization. In critical low-data regimes, a classifier trained on datasets
supplemented with our synthetic ECGs achieves performance comparable to a
classifier trained solely on real data. Together, we demonstrate that ECG
synthesizers, trained with the proposed time-frequency structural
regularization scheme, can serve as personalized, high-fidelity,
privacy-preserving surrogates when real data are scarce, advancing the
responsible use of generative AI in healthcare.
[LINK]
http://arxiv.org/abs/2510.05492v1
[DATE]
2025-10-07 09:14:53+08:00
[CATEGORIES]
cs.LG
The Method of Infinite Descent
[AUTHORS]
Reza T. Batley, Sourav Saha
[ABSTRACT]
Training - the optimisation of complex models - is traditionally performed
through small, local, iterative updates [D. E. Rumelhart, G. E. Hinton, R. J.
Williams, Nature 323, 533-536 (1986)]. Approximating solutions through
truncated gradients is a paradigm dating back to Cauchy [A.-L. Cauchy, Comptes
Rendus Math'ematique 25, 536-538 (1847)] and Newton [I. Newton, The Method of
Fluxions and Infinite Series (Henry Woodfall, London, 1736)]. This work
introduces the Method of Infinite Descent, a semi-analytic optimisation
paradigm that reformulates training as the direct solution to the first-order
optimality condition. By analytical resummation of its Taylor expansion, this
method yields an exact, algebraic equation for the update step. Realisation of
the infinite Taylor tower’s cascading resummation is formally derived, and an
exploitative algorithm for the direct solve step is proposed.
This principle is demonstrated with the herein-introduced AION (Analytic,
Infinitely-Optimisable Network) architecture. AION is a model designed
expressly to satisfy the algebraic closure required by Infinite Descent. In a
simple test problem, AION reaches the optimum in a single descent step.
Together, this optimiser-model pair exemplify how analytic structure enables
exact, non-iterative convergence. Infinite Descent extends beyond this example,
applying to any appropriately closed architecture. This suggests a new class of
semi-analytically optimisable models: the \emph{Infinity Class}; sufficient
conditions for class membership are discussed. This offers a pathway toward
non-iterative learning.
[LINK]
http://arxiv.org/abs/2510.05489v1
[DATE]
2025-10-07 09:09:20+08:00
[CATEGORIES]
cs.LG
ATOM: A Pretrained Neural Operator for Multitask Molecular Dynamics
[AUTHORS]
Luke Thompson, Davy Guan, Dai Shi, Slade Matthews, Junbin Gao, Andi Han
[ABSTRACT]
Molecular dynamics (MD) simulations underpin modern computational drug dis-
covery, materials science, and biochemistry. Recent machine learning models
provide high-fidelity MD predictions without the need to repeatedly solve
quantum mechanical forces, enabling significant speedups over conventional
pipelines. Yet many such methods typically enforce strict equivariance and rely
on sequential rollouts, thus limiting their flexibility and simulation
efficiency. They are also com- monly single-task, trained on individual
molecules and fixed timeframes, which restricts generalization to unseen
compounds and extended timesteps. To address these issues, we propose Atomistic
Transformer Operator for Molecules (ATOM), a pretrained transformer neural
operator for multitask molecular dynamics. ATOM adopts a quasi-equivariant
design that requires no explicit molecular graph and employs a temporal
attention mechanism, allowing for the accurate parallel decod- ing of multiple
future states. To support operator pretraining across chemicals and timescales,
we curate TG80, a large, diverse, and numerically stable MD dataset with over
2.5 million femtoseconds of trajectories across 80 compounds. ATOM achieves
state-of-the-art performance on established single-task benchmarks, such as
MD17, RMD17 and MD22. After multitask pretraining on TG80, ATOM shows
exceptional zero-shot generalization to unseen molecules across varying time
hori- zons. We believe ATOM represents a significant step toward accurate,
efficient, and transferable molecular dynamics models
[LINK]
http://arxiv.org/abs/2510.05482v1
[DATE]
2025-10-07 08:56:39+08:00
[CATEGORIES]
cs.LG
The Logical Implication Steering Method for Conditional Interventions on Transformer Generation
[AUTHORS]
Damjan Kalajdzievski
[ABSTRACT]
The field of mechanistic interpretability in pre-trained transformer models
has demonstrated substantial evidence supporting the ‘‘linear representation
hypothesis’’, which is the idea that high level concepts are encoded as vectors
in the space of activations of a model. Studies also show that model generation
behavior can be steered toward a given concept by adding the concept’s vector
to the corresponding activations. We show how to leverage these properties to
build a form of logical implication into models, enabling transparent and
interpretable adjustments that induce a chosen generation behavior in response
to the presence of any given concept. Our method, Logical Implication Model
Steering (LIMS), unlocks new hand engineered reasoning capabilities by
integrating neuro-symbolic logic into pre-trained transformer models.
[LINK]
http://arxiv.org/abs/2502.03618v2
[DATE]
2025-10-07 07:55:39+08:00
[CATEGORIES]
cs.LG
Deep Learning Approaches with Explainable AI for Differentiating Alzheimer Disease and Mild Cognitive Impairment
[AUTHORS]
Fahad Mostafa, Kannon Hossain, Hafiz Khan
[ABSTRACT]
Early and accurate diagnosis of Alzheimer Disease is critical for effective
clinical intervention, particularly in distinguishing it from Mild Cognitive
Impairment, a prodromal stage marked by subtle structural changes. In this
study, we propose a hybrid deep learning ensemble framework for Alzheimer
Disease classification using structural magnetic resonance imaging. Gray and
white matter slices are used as inputs to three pretrained convolutional neural
networks such as ResNet50, NASNet, and MobileNet, each fine tuned through an
end to end process. To further enhance performance, we incorporate a stacked
ensemble learning strategy with a meta learner and weighted averaging to
optimally combine the base models. Evaluated on the Alzheimer Disease
Neuroimaging Initiative dataset, the proposed method achieves state of the art
accuracy of 99.21% for Alzheimer Disease vs. Mild Cognitive Impairment and
91.0% for Mild Cognitive Impairment vs. Normal Controls, outperforming
conventional transfer learning and baseline ensemble methods. To improve
interpretability in image based diagnostics, we integrate Explainable AI
techniques by Gradient weighted Class Activation, which generates heatmaps and
attribution maps that highlight critical regions in gray and white matter
slices, revealing structural biomarkers that influence model decisions. These
results highlight the frameworks potential for robust and scalable clinical
decision support in neurodegenerative disease diagnostics.
[COMMENTS]
18 pages, 4 figures
[LINK]
http://arxiv.org/abs/2510.00048v2
[DATE]
2025-10-07 07:51:56+08:00
[CATEGORIES]
cs.LG
QDeepGR4J: Quantile-based ensemble of deep learning and GR4J hybrid rainfall-runoff models for extreme flow prediction with uncertainty quantification
[AUTHORS]
Arpit Kapoor, Rohitash Chandra
[ABSTRACT]
Conceptual rainfall-runoff models aid hydrologists and climate scientists in
modelling streamflow to inform water management practices. Recent advances in
deep learning have unravelled the potential for combining hydrological models
with deep learning models for better interpretability and improved predictive
performance. In our previous work, we introduced DeepGR4J, which enhanced the
GR4J conceptual rainfall-runoff model using a deep learning model to serve as a
surrogate for the routing component. DeepGR4J had an improved rainfall-runoff
prediction accuracy, particularly in arid catchments. Quantile regression
models have been extensively used for quantifying uncertainty while aiding
extreme value forecasting. In this paper, we extend DeepGR4J using a quantile
regression-based ensemble learning framework to quantify uncertainty in
streamflow prediction. We also leverage the uncertainty bounds to identify
extreme flow events potentially leading to flooding. We further extend the
model to multi-step streamflow predictions for uncertainty bounds. We design
experiments for a detailed evaluation of the proposed framework using the
CAMELS-Aus dataset. The results show that our proposed Quantile DeepGR4J
framework improves the predictive accuracy and uncertainty interval quality
(interval score) compared to baseline deep learning models. Furthermore, we
carry out flood risk evaluation using Quantile DeepGR4J, and the results
demonstrate its suitability as an early warning system.
[LINK]
http://arxiv.org/abs/2510.05453v1
[DATE]
2025-10-07 07:36:40+08:00
[CATEGORIES]
cs.LG
Persona Features Control Emergent Misalignment
[AUTHORS]
Miles Wang, Tom Dupré la Tour, Olivia Watkins, Alex Makelov, Ryan A. Chi, Samuel Miserendino, Jeffrey Wang, Achyuta Rajaram, Johannes Heidecke, Tejal Patwardhan, Dan Mossing
[ABSTRACT]
Understanding how language models generalize behaviors from their training to
a broader deployment distribution is an important problem in AI safety. Betley
et al. discovered that fine-tuning GPT-4o on intentionally insecure code causes
“emergent misalignment,” where models give stereotypically malicious responses
to unrelated prompts. We extend this work, demonstrating emergent misalignment
across diverse conditions, including reinforcement learning on reasoning
models, fine-tuning on various synthetic datasets, and in models without safety
training. To investigate the mechanisms behind this generalized misalignment,
we apply a “model diffing” approach using sparse autoencoders to compare
internal model representations before and after fine-tuning. This approach
reveals several “misaligned persona” features in activation space, including a
toxic persona feature which most strongly controls emergent misalignment and
can be used to predict whether a model will exhibit such behavior.
Additionally, we investigate mitigation strategies, discovering that
fine-tuning an emergently misaligned model on just a few hundred benign samples
efficiently restores alignment.
[LINK]
http://arxiv.org/abs/2506.19823v2
[DATE]
2025-10-07 07:33:09+08:00
[CATEGORIES]
cs.LG
NASP-T: A Fuzzy Neuro-Symbolic Transformer for Logic-Constrained Aviation Safety Report Classification
[AUTHORS]
Fadi Al Machot, Fidaa Al Machot
[ABSTRACT]
Deep transformer models excel at multi-label text classification but often
violate domain logic that experts consider essential, an issue of particular
concern in safety-critical applications. We propose a hybrid neuro-symbolic
framework that integrates Answer Set Programming (ASP) with transformer-based
learning on the Aviation Safety Reporting System (ASRS) corpus. Domain
knowledge is formalized as weighted ASP rules and validated using the Clingo
solver. These rules are incorporated in two complementary ways: (i) as
rule-based data augmentation, generating logically consistent synthetic samples
that improve label diversity and coverage; and (ii) as a fuzzy-logic
regularizer, enforcing rule satisfaction in a differentiable form during
fine-tuning. This design preserves the interpretability of symbolic reasoning
while leveraging the scalability of deep neural architectures. We further tune
per-class thresholds and report both standard classification metrics and
logic-consistency rates. Compared to a strong Binary Cross-Entropy (BCE)
baseline, our approach improves micro- and macro-F1 scores and achieves up to
an 86% reduction in rule violations on the ASRS test set. To the best of our
knowledge, this constitutes the first large-scale neuro-symbolic application to
ASRS reports that unifies ASP-based reasoning, rule-driven augmentation, and
differentiable transformer training for trustworthy, safety-critical NLP.
[LINK]
http://arxiv.org/abs/2510.05451v1
[DATE]
2025-10-07 07:33:09+08:00
[CATEGORIES]
cs.LG
A Probabilistic Basis for Low-Rank Matrix Learning
[AUTHORS]
Simon Segert, Nathan Wycoff
[ABSTRACT]
Low rank inference on matrices is widely conducted by optimizing a cost
function augmented with a penalty proportional to the nuclear norm $\Vert \cdot
\Vert_$. However, despite the assortment of computational methods for such
problems, there is a surprising lack of understanding of the underlying
probability distributions being referred to. In this article, we study the
distribution with density $f(X)\propto e^{-\lambda\Vert X\Vert_}$, finding
many of its fundamental attributes to be analytically tractable via
differential geometry. We use these facts to design an improved MCMC algorithm
for low rank Bayesian inference as well as to learn the penalty parameter
$\lambda$, obviating the need for hyperparameter tuning when this is difficult
or impossible. Finally, we deploy these to improve the accuracy and efficiency
of low rank Bayesian matrix denoising and completion algorithms in numerical
experiments.
[LINK]
http://arxiv.org/abs/2510.05447v1
[DATE]
2025-10-07 07:26:56+08:00
[CATEGORIES]
cs.LG
Prior-Aligned Meta-RL: Thompson Sampling with Learned Priors and Guarantees in Finite-Horizon MDPs
[AUTHORS]
Runlin Zhou, Chixiang Chen, Elynn Chen
[ABSTRACT]
We study meta-reinforcement learning in finite-horizon MDPs where related
tasks share similar structures in their optimal action-value functions.
Specifically, we posit a linear representation
$Q^_h(s,a)=\Phi_h(s,a)\,\theta^{(k)}_h$ and place a Gaussian meta-prior $
\mathcal{N}(\theta^_h,\Sigma^*_h)$ over the task-specific parameters
$\theta^{(k)}_h$. Building on randomized value functions, we propose two
Thompson-style algorithms: (i) MTSRL, which learns only the prior mean and
performs posterior sampling with the learned mean and known covariance; and
(ii) $\text{MTSRL}^{+}$, which additionally estimates the covariance and
employs prior widening to control finite-sample estimation error. Further, we
develop a prior-alignment technique that couples the posterior under the
learned prior with a meta-oracle that knows the true prior, yielding
meta-regret guarantees: we match prior-independent Thompson sampling in the
small-task regime and strictly improve with more tasks once the prior is
learned. Concretely, for known covariance we obtain
$\tilde{O}(H^{4}S^{3/2}\sqrt{ANK})$ meta-regret, and with learned covariance
$\tilde{O}(H^{4}S^{3/2}\sqrt{AN^3K})$; both recover a better behavior than
prior-independent after $K \gtrsim \tilde{O}(H^2)$ and $K \gtrsim
\tilde{O}(N^2H^2)$, respectively. Simulations on a stateful recommendation
environment (with feature and prior misspecification) show that after brief
exploration, MTSRL/MTSRL(^+) track the meta-oracle and substantially
outperform prior-independent RL and bandit-only meta-baselines. Our results
give the first meta-regret guarantees for Thompson-style RL with learned
Q-priors, and provide practical recipes (warm-start via RLSVI, OLS aggregation,
covariance widening) for experiment-rich settings.
[LINK]
http://arxiv.org/abs/2510.05446v1
[DATE]
2025-10-07 07:20:49+08:00
[CATEGORIES]
cs.LG
AD-NODE: Adaptive Dynamics Learning with Neural ODEs for Mobile Robots Control
[AUTHORS]
Shao-Yi Yu, Jen-Wei Wang, Maya Horii, Vikas Garg, Tarek Zohdi
[ABSTRACT]
Mobile robots, such as ground vehicles and quadrotors, are becoming
increasingly important in various fields, from logistics to agriculture, where
they automate processes in environments that are difficult to access for
humans. However, to perform effectively in uncertain environments using
model-based controllers, these systems require dynamics models capable of
responding to environmental variations, especially when direct access to
environmental information is limited. To enable such adaptivity and facilitate
integration with model predictive control, we propose an adaptive dynamics
model which bypasses the need for direct environmental knowledge by inferring
operational environments from state-action history. The dynamics model is based
on neural ordinary equations, and a two-phase training procedure is used to
learn latent environment representations. We demonstrate the effectiveness of
our approach through goal-reaching and path-tracking tasks on three robotic
platforms of increasing complexity: a 2D differential wheeled robot with
changing wheel contact conditions, a 3D quadrotor in variational wind fields,
and the Sphero BOLT robot under two contact conditions for real-world
deployment. Empirical results corroborate that our method can handle temporally
and spatially varying environmental changes in both simulation and real-world
systems.
[LINK]
http://arxiv.org/abs/2510.05443v1
[DATE]
2025-10-07 07:14:08+08:00
[CATEGORIES]
cs.LG
AutoPDL: Automatic Prompt Optimization for LLM Agents
[AUTHORS]
Claudio Spiess, Mandana Vaziri, Louis Mandel, Martin Hirzel
[ABSTRACT]
The performance of large language models (LLMs) depends on how they are
prompted, with choices spanning both the high-level prompting pattern (e.g.,
Zero-Shot, CoT, ReAct, ReWOO) and the specific prompt content (instructions and
few-shot demonstrations). Manually tuning this combination is tedious,
error-prone, and specific to a given LLM and task. Therefore, this paper
proposes AutoPDL, an automated approach to discovering good LLM agent
configurations. Our approach frames this as a structured AutoML problem over a
combinatorial space of agentic and non-agentic prompting patterns and
demonstrations, using successive halving to efficiently navigate this space. We
introduce a library implementing common prompting patterns using the PDL prompt
programming language. AutoPDL solutions are human-readable, editable, and
executable PDL programs that use this library. This approach also enables
source-to-source optimization, allowing human-in-the-loop refinement and reuse.
Evaluations across three tasks and seven LLMs (ranging from 3B to 70B
parameters) show consistent accuracy gains ($9.21\pm15.46$ percentage points),
up to 67.5pp, and reveal that selected prompting strategies vary across models
and tasks.
[COMMENTS]
Presented at AutoML 2025 (Methods Track); to be published in
proceedings
[LINK]
http://arxiv.org/abs/2504.04365v4
[DATE]
2025-10-07 07:10:56+08:00
[CATEGORIES]
cs.LG
Refereed Learning
[AUTHORS]
Ran Canetti, Ephraim Linder, Connor Wagaman
[ABSTRACT]
We initiate an investigation of learning tasks in a setting where the learner
is given access to two competing provers, only one of which is honest.
Specifically, we consider the power of such learners in assessing purported
properties of opaque models. Following prior work that considers the power of
competing provers in different settings, we call this setting refereed
learning.
After formulating a general definition of refereed learning tasks, we show
refereed learning protocols that obtain a level of accuracy that far exceeds
what is obtainable at comparable cost without provers, or even with a single
prover. We concentrate on the task of choosing the better one out of two
black-box models, with respect to some ground truth. While we consider a range
of parameters, perhaps our most notable result is in the high-precision range:
For all $\varepsilon>0$ and ambient dimension $d$, our learner makes only one
query to the ground truth function, communicates only
$(1+\frac{1}{\varepsilon^2})\cdot\text{poly}(d)$ bits with the provers, and
outputs a model whose loss is within a multiplicative factor of
$(1+\varepsilon)$ of the best model’s loss. Obtaining comparable loss with a
single prover would require the learner to access the ground truth at almost
all of the points in the domain. To obtain this bound, we develop a technique
that allows the learner to sample, using the provers, from a distribution that
is not efficiently samplable to begin with. We find this technique to be of
independent interest.
We also present lower bounds that demonstrate the optimality of our protocols
in a number of respects, including prover complexity, number of samples, and
need for query access.
[LINK]
http://arxiv.org/abs/2510.05440v1
[DATE]
2025-10-07 07:07:31+08:00
[CATEGORIES]
cs.LG
Optimal Transport for Brain-Image Alignment: Unveiling Redundancy and Synergy in Neural Information Processing
[AUTHORS]
Yang Xiao, Wang Lu, Jie Ji, Ruimeng Ye, Gen Li, Xiaolong Ma, Bo Hui
[ABSTRACT]
The design of artificial neural networks (ANNs) is inspired by the structure
of the human brain, and in turn, ANNs offer a potential means to interpret and
understand brain signals. Existing methods primarily align brain signals with
stimulus signals using Mean Squared Error (MSE), which focuses only on local
point-wise alignment and ignores global matching, leading to coarse
interpretations and inaccuracies in brain signal decoding.
In this paper, we address these issues through optimal transport (OT) and
theoretically demonstrate why OT provides a more effective alignment strategy
than MSE. Specifically, we construct a transport plan between brain voxel
embeddings and image embeddings, enabling more precise matching. By controlling
the amount of transport, we mitigate the influence of redundant information. We
apply our alignment model directly to the Brain Captioning task by feeding
brain signals into a large language model (LLM) instead of images. Our approach
achieves state-of-the-art performance across ten evaluation metrics, surpassing
the previous best method by an average of 6.11\% in single-subject training and
3.81\% in cross-subject training. Additionally, we have uncovered several
insightful conclusions that align with existing brain research. We unveil the
redundancy and synergy of brain information processing through region masking
and data dimensionality reduction visualization experiments. We believe our
approach paves the way for a more precise understanding of brain signals in the
future. The code is available at
https://github.com/NKUShaw/OT-Alignment4brain-to-image.
[COMMENTS]
14pages
[LINK]
http://arxiv.org/abs/2503.10663v2
[DATE]
2025-10-07 06:55:40+08:00
[CATEGORIES]
cs.LG
Physics-Informed Machine Learning in Biomedical Science and Engineering
[AUTHORS]
Nazanin Ahmadi, Qianying Cao, Jay D. Humphrey, George Em Karniadakis
[ABSTRACT]
Physics-informed machine learning (PIML) is emerging as a potentially
transformative paradigm for modeling complex biomedical systems by integrating
parameterized physical laws with data-driven methods. Here, we review three
main classes of PIML frameworks: physics-informed neural networks (PINNs),
neural ordinary differential equations (NODEs), and neural operators (NOs),
highlighting their growing role in biomedical science and engineering. We begin
with PINNs, which embed governing equations into deep learning models and have
been successfully applied to biosolid and biofluid mechanics, mechanobiology,
and medical imaging among other areas. We then review NODEs, which offer
continuous-time modeling, especially suited to dynamic physiological systems,
pharmacokinetics, and cell signaling. Finally, we discuss deep NOs as powerful
tools for learning mappings between function spaces, enabling efficient
simulations across multiscale and spatially heterogeneous biological domains.
Throughout, we emphasize applications where physical interpretability, data
scarcity, or system complexity make conventional black-box learning
insufficient. We conclude by identifying open challenges and future directions
for advancing PIML in biomedical science and engineering, including issues of
uncertainty quantification, generalization, and integration of PIML and large
language models.
[COMMENTS]
Accepted for publication in the Annual Review of Biomedical
Engineering on October 2, 2025
[LINK]
http://arxiv.org/abs/2510.05433v1
[DATE]
2025-10-07 06:52:39+08:00
[CATEGORIES]
cs.LG
PACER: Physics Informed and Uncertainty Aware Climate Emulator
[AUTHORS]
Hira Saleem, Flora Salim, Cormac Purcell
[ABSTRACT]
Physics based numerical climate models serve as critical tools for evaluating
the effects of climate change and projecting future climate scenarios. However,
the reliance on numerical simulations of physical equations renders them
computationally intensive and inefficient. While deep learning methodologies
have made significant progress in weather forecasting, they are still unstable
for longer roll-out climate emulation task. Here, we propose PACER, a
relatively lightweight 2.1M parameter Physics Informed Uncertainty Aware
Climate EmulatoR. PACER is trained across is trained across varying spatial
resolutions and physics based climate models, enabling faithful and stable
emulation of temperature fields at multiple surface levels over a 10 year
horizon. We propose an auto-regressive ODE-SDE framework for climate emulation
that integrates the fundamental physical law of advection, while being trained
under a negative log-likelihood objective to enable principled uncertainty
quantification of stochastic variability. We show PACER’s emulation performance
across 20 climate models outperforming relevant baselines and advancing towards
explicit physics infusion in ML emulator.
[LINK]
http://arxiv.org/abs/2410.21657v4
[DATE]
2025-10-07 06:35:00+08:00
[CATEGORIES]
cs.LG
Nonlinear Filtering with Brenier Optimal Transport Maps
[AUTHORS]
Mohammad Al-Jarrah, Niyizhen Jin, Bamdad Hosseini, Amirhossein Taghvaei
[ABSTRACT]
This paper is concerned with the problem of nonlinear filtering, i.e.,
computing the conditional distribution of the state of a stochastic dynamical
system given a history of noisy partial observations. Conventional sequential
importance resampling (SIR) particle filters suffer from fundamental
limitations, in scenarios involving degenerate likelihoods or high-dimensional
states, due to the weight degeneracy issue. In this paper, we explore an
alternative method, which is based on estimating the Brenier optimal transport
(OT) map from the current prior distribution of the state to the posterior
distribution at the next time step. Unlike SIR particle filters, the OT
formulation does not require the analytical form of the likelihood. Moreover,
it allows us to harness the approximation power of neural networks to model
complex and multi-modal distributions and employ stochastic optimization
algorithms to enhance scalability. Extensive numerical experiments are
presented that compare the OT method to the SIR particle filter and the
ensemble Kalman filter, evaluating the performance in terms of sample
efficiency, high-dimensional scalability, and the ability to capture complex
and multi-modal distributions.
[COMMENTS]
27 pages, 17 figures, 1 Table
[LINK]
http://arxiv.org/abs/2310.13886v3
[DATE]
2025-10-07 06:26:24+08:00
[CATEGORIES]
cs.LG
Can We Ignore Labels In Out of Distribution Detection?
[AUTHORS]
Hong Yang, Qi Yu, Travis Desell
[ABSTRACT]
Out-of-distribution (OOD) detection methods have recently become more
prominent, serving as a core element in safety-critical autonomous systems. One
major purpose of OOD detection is to reject invalid inputs that could lead to
unpredictable errors and compromise safety. Due to the cost of labeled data,
recent works have investigated the feasibility of self-supervised learning
(SSL) OOD detection, unlabeled OOD detection, and zero shot OOD detection. In
this work, we identify a set of conditions for a theoretical guarantee of
failure in unlabeled OOD detection algorithms from an information-theoretic
perspective. These conditions are present in all OOD tasks dealing with
real-world data: I) we provide theoretical proof of unlabeled OOD detection
failure when there exists zero mutual information between the learning
objective and the in-distribution labels, a.k.a. ‘label blindness’, II) we
define a new OOD task - Adjacent OOD detection - that tests for label blindness
and accounts for a previously ignored safety gap in all OOD detection
benchmarks, and III) we perform experiments demonstrating that existing
unlabeled OOD methods fail under conditions suggested by our label blindness
theory and analyze the implications for future research in unlabeled OOD
methods.
[LINK]
http://arxiv.org/abs/2504.14704v2
[DATE]
2025-10-07 06:26:02+08:00
[CATEGORIES]
cs.LG
Uncertainty-Guided Model Selection for Tabular Foundation Models in Biomolecule Efficacy Prediction
[AUTHORS]
Jie Li, Andrew McCarthy, Zhizhuo Zhang, Stephen Young
[ABSTRACT]
In-context learners like TabPFN are promising for biomolecule efficacy
prediction, where established molecular feature sets and relevant experimental
results can serve as powerful contextual examples. However, their performance
is highly sensitive to the provided context, making strategies like post-hoc
ensembling of models trained on different data subsets a viable approach. An
open question is how to select the best models for the ensemble without access
to ground truth labels. In this study, we investigate an uncertainty-guided
strategy for model selection. We demonstrate on an siRNA knockdown efficacy
task that a TabPFN model using straightforward sequence-based features can
surpass specialized state-of-the-art predictors. We also show that the model’s
predicted inter-quantile range (IQR), a measure of its uncertainty, has a
negative correlation with true prediction error. We developed the OligoICP
method, which selects and averages an ensemble of models with the lowest mean
IQR for siRNA efficacy prediction, achieving superior performance compared to
naive ensembling or using a single model trained on all available data. This
finding highlights model uncertainty as a powerful, label-free heuristic for
optimizing biomolecule efficacy predictions.
[COMMENTS]
Accepted by NeurIPS 2025 workshop: 2nd Workshop on Multi-modal
Foundation Models and Large Language Models for Life Sciences
[LINK]
http://arxiv.org/abs/2510.02476v2
[DATE]
2025-10-07 06:25:59+08:00
[CATEGORIES]
cs.LG
Model-free generalized fiducial inference
[AUTHORS]
Jonathan P Williams
[ABSTRACT]
Conformal prediction (CP) was developed to provide finite-sample
probabilistic prediction guarantees. While CP algorithms are a relatively
general-purpose approach to uncertainty quantification, with finite-sample
guarantees, they lack versatility. Namely, the CP approach does not {\em
prescribe} how to quantify the degree to which a data set provides evidence in
support of (or against) an arbitrary event from a general class of events. In
this paper, tools are offered from imprecise probability theory to build a
formal connection between CP and generalized fiducial (GF) inference. These new
insights establish a more general inferential lens from which CP can be
understood, and demonstrate the pragmatism of fiducial ideas. The formal
connection establishes a context in which epistemically-derived GF probability
matches aleatoric/frequentist probability. Beyond this fact, it is illustrated
how tools from imprecise probability theory, namely lower and upper probability
functions, can be applied in the context of the imprecise GF distribution to
provide posterior-like, prescriptive inference that is not possible within the
CP framework alone. In addition to the primary CP generalization that is
contributed, fundamental connections are synthesized between this new
model-free GF and three other areas of contemporary research: nonparametric
predictive inference (NPI), conformal predictive systems/distributions, and
inferential models (IMs).
[LINK]
http://arxiv.org/abs/2307.12472v2
[DATE]
2025-10-07 06:25:32+08:00
[CATEGORIES]
cs.LG
Draft, Verify, and Improve: Toward Training-Aware Speculative Decoding
[AUTHORS]
Shrenik Bhansali, Larry Heck
[ABSTRACT]
Autoregressive (AR) decoding is a major latency bottleneck for large language
models. Speculative decoding (SD) accelerates AR by letting a drafter propose
multi-token blocks that a verifier accepts or rejects. However, many SD systems
require heavy offline training or extra components. These choices raise
data/compute cost and can yield brittle drafters under distribution drift. We
introduce \emph{Draft, Verify, \& Improve (DVI)}, a training-aware
self-speculative framework that combines inference with continual online
learning. We partition an LLM into a drafter and a verifier, and during
generation, verifier accept/reject decisions are converted into supervision
signals and used to update the drafter head. A simple \emph{KL$\rightarrow$RL}
schedule bootstraps calibration via online distillation and then adds
reward-masked cross-entropy with a on-policy policy-gradient term, preserving
lossless, single model deployment. On Spec-Bench, DVI achieves a $2.16\times$
wall-time speedup, on par with SoTA approaches like EAGLE-2, while orders of
magnitude less data for training, and ablations show that DVI outperforms
KL-only online distillation. DVI demonstrates that \emph{training-aware}
self-speculation can deliver state-of-the-art, lossless speedups with minimal
training overhead.
[LINK]
http://arxiv.org/abs/2510.05421v1
[DATE]
2025-10-07 06:24:24+08:00
[CATEGORIES]
cs.LG
Correlating Cross-Iteration Noise for DP-SGD using Model Curvature
[AUTHORS]
Xin Gu, Yingtai Xiao, Guanlin He, Jiamu Bai, Daniel Kifer, Kiwan Maeng
[ABSTRACT]
Differentially private stochastic gradient descent (DP-SGD) offers the
promise of training deep learning models while mitigating many privacy risks.
However, there is currently a large accuracy gap between DP-SGD and normal SGD
training. This has resulted in different lines of research investigating
orthogonal ways of improving privacy-preserving training. One such line of
work, known as DP-MF, correlates the privacy noise across different iterations
of stochastic gradient descent – allowing later iterations to cancel out some
of the noise added to earlier iterations. In this paper, we study how to
improve this noise correlation. We propose a technique called NoiseCurve that
uses model curvature, estimated from public unlabeled data, to improve the
quality of this cross-iteration noise correlation. Our experiments on various
datasets, models, and privacy parameters show that the noise correlations
computed by NoiseCurve offer consistent and significant improvements in
accuracy over the correlation scheme used by DP-MF.
[LINK]
http://arxiv.org/abs/2510.05416v1
[DATE]
2025-10-07 06:13:02+08:00
[CATEGORIES]
cs.LG
Solar Irradiation Forecasting using Genetic Algorithms
[AUTHORS]
V. Gunasekaran, K. K. Kovi, S. Arja, R. Chimata
[ABSTRACT]
Renewable energy forecasting is attaining greater importance due to its
constant increase in contribution to the electrical power grids. Solar energy
is one of the most significant contributors to renewable energy and is
dependent on solar irradiation. For the effective management of electrical
power grids, forecasting models that predict solar irradiation, with high
accuracy, are needed. In the current study, Machine Learning techniques such as
Linear Regression, Extreme Gradient Boosting and Genetic Algorithm Optimization
are used to forecast solar irradiation. The data used for training and
validation is recorded from across three different geographical stations in the
United States that are part of the SURFRAD network. A Global Horizontal Index
(GHI) is predicted for the models built and compared. Genetic Algorithm
Optimization is applied to XGB to further improve the accuracy of solar
irradiation prediction.
[COMMENTS]
9 pages, 4 figures
[LINK]
http://arxiv.org/abs/2106.13956v2
[DATE]
2025-10-07 05:53:00+08:00
[CATEGORIES]
cs.LG
Comparing LSTM-Based Sequence-to-Sequence Forecasting Strategies for 24-Hour Solar Proton Flux Profiles Using GOES Data
[AUTHORS]
Kangwoo Yi, Bo Shen, Qin Li, Haimin Wang, Yong-Jae Moon, Jaewon Lee, Hwanhee Lee
[ABSTRACT]
Solar Proton Events (SPEs) cause significant radiation hazards to satellites,
astronauts, and technological systems. Accurate forecasting of their proton
flux time profiles is crucial for early warnings and mitigation. This paper
explores deep learning sequence-to-sequence (seq2seq) models based on Long
Short-Term Memory networks to predict 24-hour proton flux profiles following
SPE onsets. We used a dataset of 40 well-connected SPEs (1997-2017) observed by
NOAA GOES, each associated with a >=M-class western-hemisphere solar flare and
undisturbed proton flux profiles. Using 4-fold stratified cross-validation, we
evaluate seq2seq model configurations (varying hidden units and embedding
dimensions) under multiple forecasting scenarios: (i) proton-only input vs.
combined proton+X-ray input, (ii) original flux data vs. trend-smoothed data,
and (iii) autoregressive vs. one-shot forecasting. Our major results are as
follows: First, one-shot forecasting consistently yields lower error than
autoregressive prediction, avoiding the error accumulation seen in iterative
approaches. Second, on the original data, proton-only models outperform
proton+X-ray models. However, with trend-smoothed data, this gap narrows or
reverses in proton+X-ray models. Third, trend-smoothing significantly enhances
the performance of proton+X-ray models by mitigating fluctuations in the X-ray
channel. Fourth, while models trained on trendsmoothed data perform best on
average, the best-performing model was trained on original data, suggesting
that architectural choices can sometimes outweigh the benefits of data
preprocessing.
[COMMENTS]
7 pages; accepted as a workshop paper at ICDM 2025
[LINK]
http://arxiv.org/abs/2510.05399v1
[DATE]
2025-10-07 05:45:37+08:00
[CATEGORIES]
cs.LG
FinP: Fairness-in-Privacy in Federated Learning by Addressing Disparities in Privacy Risk
[AUTHORS]
Tianyu Zhao, Mahmoud Srewa, Salma Elmalaki
[ABSTRACT]
Ensuring fairness in machine learning extends to the critical dimension of
privacy, particularly in human-centric federated learning (FL) settings where
decentralized data necessitates an equitable distribution of privacy risk
across clients. This paper introduces FinP, a novel framework specifically
designed to address disparities in privacy risk by mitigating disproportionate
vulnerability to source inference attacks (SIA). FinP employs a two-pronged
strategy: (1) server-side adaptive aggregation, which dynamically adjusts
client contributions to the global model to foster fairness, and (2)
client-side regularization, which enhances the privacy robustness of individual
clients. This comprehensive approach directly tackles both the symptoms and
underlying causes of privacy unfairness in FL. Extensive evaluations on the
Human Activity Recognition (HAR) and CIFAR-10 datasets demonstrate FinP’s
effectiveness, achieving improvement in fairness-in-privacy on HAR and CIFAR-10
with minimal impact on utility. FinP improved group fairness with respect to
disparity in privacy risk using equal opportunity in CIFAR-10 by 57.14%
compared to the state-of-the-art. Furthermore, FinP significantly mitigates SIA
risks on CIFAR-10, underscoring its potential to establish fairness in privacy
within FL systems without compromising utility.
[LINK]
http://arxiv.org/abs/2502.17748v3
[DATE]
2025-10-07 05:45:18+08:00
[CATEGORIES]
cs.LG
Scalable In-context Ranking with Generative Models
[AUTHORS]
Nilesh Gupta, Chong You, Srinadh Bhojanapalli, Sanjiv Kumar, Inderjit Dhillon, Felix Yu
[ABSTRACT]
In-context Ranking (ICR) is an emerging paradigm for Information Retrieval
(IR), which leverages contextual understanding of LLMs by directly
incorporating the task description, candidate documents, and the query into the
model’s input prompt and tasking the LLM to identify relevant document(s).
While it is effective, efficiency is a significant challenge in this paradigm,
especially as the candidate list grows due to quadratic/super-linear scaling of
attention operation with context length. To this end, this paper first
identifies inherent and exploitable structures in the attention of LLMs
finetuned for ICR: (1) inter-document block sparsity: attention is dense within
each document block but sparse across different documents in the context; and
(2) query-document block relevance: the attention scores from certain query
tokens to a document block in middle layers strongly correlate with that
document’s actual relevance. Motivated by these observations, we introduce
BlockRank (Blockwise In-context Ranking), a novel method that adapts the
attention operation in an LLM by (a) architecturally enforcing the observed
inter-document block sparsity, reducing attention complexity from quadratic to
linear without loss in performance, and (b) optimizing query-document block
relevance for true relevant documents during fine-tuning using an auxiliary
contrastive training objective, improving retrieval in attention. Experiments
on BEIR, MSMarco and NQ with Mistral-7B demonstrate that FLARE Mistral matches
or outperforms existing SOTA listwise rankers and controlled fine-tuned
baseline while being significantly more efficient at inference (4.7x for 100
MSMarco documents in context) and scaling gracefully to long-context
shortlists, around 500 documents in-context (approximately 100K context length)
within a second, presenting a scalable and effective solution for ICR.
[LINK]
http://arxiv.org/abs/2510.05396v1
[DATE]
2025-10-07 05:41:58+08:00
[CATEGORIES]
cs.LG
Human + AI for Accelerating Ad Localization Evaluation
[AUTHORS]
Harshit Rajgarhia, Shivali Dalmia, Mengyang Zhao, Mukherji Abhishek, Kiran Ganesh
[ABSTRACT]
Adapting advertisements for multilingual audiences requires more than simple
text translation; it demands preservation of visual consistency, spatial
alignment, and stylistic integrity across diverse languages and formats. We
introduce a structured framework that combines automated components with human
oversight to address the complexities of advertisement localization. To the
best of our knowledge, this is the first work to integrate scene text
detection, inpainting, machine translation (MT), and text reimposition
specifically for accelerating ad localization evaluation workflows. Qualitative
results across six locales demonstrate that our approach produces semantically
accurate and visually coherent localized advertisements, suitable for
deployment in real-world workflows.
[LINK]
http://arxiv.org/abs/2509.12543v3
[DATE]
2025-10-07 05:30:41+08:00
[CATEGORIES]
cs.LG
Cross-Domain Graph Data Scaling: A Showcase with Diffusion Models
[AUTHORS]
Wenzhuo Tang, Haitao Mao, Danial Dervovic, Ivan Brugere, Saumitra Mishra, Yuying Xie, Jiliang Tang
[ABSTRACT]
Models for natural language and images benefit from data scaling behavior:
the more data fed into the model, the better they perform. This ‘better with
more’ phenomenon enables the effectiveness of large-scale pre-training on vast
amounts of data. However, current graph pre-training methods struggle to scale
up data due to heterogeneity across graphs. To achieve effective data scaling,
we aim to develop a general model that is able to capture diverse data patterns
of graphs and can be utilized to adaptively help the downstream tasks. To this
end, we propose UniAug, a universal graph structure augmentor built on a
diffusion model. We first pre-train a discrete diffusion model on thousands of
graphs across domains to learn the graph structural patterns. In the downstream
phase, we provide adaptive enhancement by conducting graph structure
augmentation with the help of the pre-trained diffusion model via guided
generation. By leveraging the pre-trained diffusion model for structure
augmentation, we consistently achieve performance improvements across various
downstream tasks in a plug-and-play manner. To the best of our knowledge, this
study represents the first demonstration of a data-scaling graph structure
augmentor on graphs across domains.
[COMMENTS]
NeurIPS‘25
[LINK]
http://arxiv.org/abs/2406.01899v3
[DATE]
2025-10-07 05:29:21+08:00
[CATEGORIES]
cs.LG
A Neural Network Algorithm for KL Divergence Estimation with Quantitative Error Bounds
[AUTHORS]
Mikil Foss, Andrew Lamperski
[ABSTRACT]
Estimating the Kullback-Leibler (KL) divergence between random variables is a
fundamental problem in statistical analysis. For continuous random variables,
traditional information-theoretic estimators scale poorly with dimension and/or
sample size. To mitigate this challenge, a variety of methods have been
proposed to estimate KL divergences and related quantities, such as mutual
information, using neural networks. The existing theoretical analyses show that
neural network parameters achieving low error exist. However, since they rely
on non-constructive neural network approximation theorems, they do not
guarantee that the existing algorithms actually achieve low error. In this
paper, we propose a KL divergence estimation algorithm using a shallow neural
network with randomized hidden weights and biases (i.e. a random feature
method). We show that with high probability, the algorithm achieves a KL
divergence estimation error of $O(m^{-1/2}+T^{-1/3})$, where $m$ is the number
of neurons and $T$ is both the number of steps of the algorithm and the number
of samples.
[COMMENTS]
Under Review for AISTATS 2026
[LINK]
http://arxiv.org/abs/2510.05386v1
[DATE]
2025-10-07 05:25:13+08:00
[CATEGORIES]
cs.LG
OpenFake: An Open Dataset and Platform Toward Real-World Deepfake Detection
[AUTHORS]
Victor Livernoche, Akshatha Arodi, Andreea Musulan, Zachary Yang, Adam Salvail, Gaétan Marceau Caron, Jean-François Godbout, Reihaneh Rabbany
[ABSTRACT]
Deepfakes, synthetic media created using advanced AI techniques, pose a
growing threat to information integrity, particularly in politically sensitive
contexts. This challenge is amplified by the increasing realism of modern
generative models, which our human perception study confirms are often
indistinguishable from real images. Yet, existing deepfake detection benchmarks
rely on outdated generators or narrowly scoped datasets (e.g., single-face
imagery), limiting their utility for real-world detection. To address these
gaps, we present OpenFake, a large politically grounded dataset specifically
crafted for benchmarking against modern generative models with high realism,
and designed to remain extensible through an innovative crowdsourced
adversarial platform that continually integrates new hard examples. OpenFake
comprises nearly four million total images: three million real images paired
with descriptive captions and almost one million synthetic counterparts from
state-of-the-art proprietary and open-source models. Detectors trained on
OpenFake achieve near-perfect in-distribution performance, strong
generalization to unseen generators, and high accuracy on a curated in-the-wild
social media test set, significantly outperforming models trained on existing
datasets. Overall, we demonstrate that with high-quality and continually
updated benchmarks, automatic deepfake detection is both feasible and effective
in real-world settings.
[COMMENTS]
26 pages, 12 figures
[LINK]
http://arxiv.org/abs/2509.09495v2
[DATE]
2025-10-07 05:24:19+08:00
[CATEGORIES]
cs.LG
Physics-Informed Neural Networks with Fourier Features and Attention-Driven Decoding
[AUTHORS]
Rohan Arni, Carlos Blanco
[ABSTRACT]
Physics-Informed Neural Networks (PINNs) are a useful framework for
approximating partial differential equation solutions using deep learning
methods. In this paper, we propose a principled redesign of the PINNsformer, a
Transformer-based PINN architecture. We present the Spectral PINNSformer
(S-Pformer), a refinement of encoder-decoder PINNSformers that addresses two
key issues; 1. the redundancy (i.e. increased parameter count) of the encoder,
and 2. the mitigation of spectral bias. We find that the encoder is unnecessary
for capturing spatiotemporal correlations when relying solely on
self-attention, thereby reducing parameter count. Further, we integrate Fourier
feature embeddings to explicitly mitigate spectral bias, enabling adaptive
encoding of multiscale behaviors in the frequency domain. Our model outperforms
encoder-decoder PINNSformer architectures across all benchmarks, achieving or
outperforming MLP performance while reducing parameter count significantly.
[COMMENTS]
16 pages, 6 figures. Accepted at NeurIPS 2025 AI4Science workshop
[LINK]
http://arxiv.org/abs/2510.05385v1
[DATE]
2025-10-07 05:23:09+08:00
[CATEGORIES]
cs.LG
Minima and Critical Points of the Bethe Free Energy Are Invariant Under Deformation Retractions of Factor Graphs
[AUTHORS]
Grégoire Sergeant-Perthuis, Léo Boitel
[ABSTRACT]
In graphical models, factor graphs, and more generally energy-based models,
the interactions between variables are encoded by a graph, a hypergraph, or, in
the most general case, a partially ordered set (poset). Inference on such
probabilistic models cannot be performed exactly due to cycles in the
underlying structures of interaction. Instead, one resorts to approximate
variational inference by optimizing the Bethe free energy. Critical points of
the Bethe free energy correspond to fixed points of the associated Belief
Propagation algorithm. A full characterization of these critical points for
general graphs, hypergraphs, and posets with a finite number of variables is
still an open problem. We show that, for hypergraphs and posets with chains of
length at most 1, changing the poset of interactions of the probabilistic model
to one with the same homotopy type induces a bijection between the critical
points of the associated free energy. This result extends and unifies classical
results that assume specific forms of collapsibility to prove uniqueness of the
critical points of the Bethe free energy.
[LINK]
http://arxiv.org/abs/2510.05380v1
[DATE]
2025-10-07 05:16:31+08:00
[CATEGORIES]
cs.LG
KVLinC : KV Cache Quantization with Hadamard Rotation and Linear Correction
[AUTHORS]
Utkarsh Saxena, Kaushik Roy
[ABSTRACT]
Quantizing the key-value (KV) cache is a promising strategy for improving the
inference efficiency of large language models (LLMs). However, aggressive
quantization to very low precision (e.g., 2 bits) introduces significant errors
in the stored key and value tensors, which propagate through the dot-product
attention mechanism and ultimately degrade generation quality. To address this,
we propose KVLinC, a framework to mitigate attention errors introduced by KV
cache quantization in the extreme low-precision regime. KVLinC combines a
Hadamard rotation, which reduces quantization error in values, with lightweight
linear correction adapters that explicitly compensate for errors introduced by
quantized keys. Across extensive evaluations on the LLaMA, Qwen2.5, and Qwen3
model families, KVLinC consistently matches or surpasses strong baselines while
achieving higher KV-cache compression. Furthermore, we implement a custom
attention kernel that results in upto 2.55x faster inference compared to Flash
Attention baseline, enabling efficient long-context LLM inference.
[COMMENTS]
14 pages, 7 figures, 6 tables
[LINK]
http://arxiv.org/abs/2510.05373v1
[DATE]
2025-10-07 05:08:11+08:00
[CATEGORIES]
cs.LG
Pref-GUIDE: Continual Policy Learning from Real-Time Human Feedback via Preference-Based Learning
[AUTHORS]
Zhengran Ji, Boyuan Chen
[ABSTRACT]
Training reinforcement learning agents with human feedback is crucial when
task objectives are difficult to specify through dense reward functions. While
prior methods rely on offline trajectory comparisons to elicit human
preferences, such data is unavailable in online learning scenarios where agents
must adapt on the fly. Recent approaches address this by collecting real-time
scalar feedback to guide agent behavior and train reward models for continued
learning after human feedback becomes unavailable. However, scalar feedback is
often noisy and inconsistent, limiting the accuracy and generalization of
learned rewards. We propose Pref-GUIDE, a framework that transforms real-time
scalar feedback into preference-based data to improve reward model learning for
continual policy training. Pref-GUIDE Individual mitigates temporal
inconsistency by comparing agent behaviors within short windows and filtering
ambiguous feedback. Pref-GUIDE Voting further enhances robustness by
aggregating reward models across a population of users to form consensus
preferences. Across three challenging environments, Pref-GUIDE significantly
outperforms scalar-feedback baselines, with the voting variant exceeding even
expert-designed dense rewards. By reframing scalar feedback as structured
preferences with population feedback, Pref-GUIDE offers a scalable and
principled approach for harnessing human input in online reinforcement
learning.
[LINK]
http://arxiv.org/abs/2508.07126v2
[DATE]
2025-10-07 04:55:39+08:00
[CATEGORIES]
cs.LG
LightCache: Memory-Efficient, Training-Free Acceleration for Video Generation
[AUTHORS]
Yang Xiao, Gen Li, Kaiyuan Deng, Yushu Wu, Zheng Zhan, Yanzhi Wang, Xiaolong Ma, Bo Hui
[ABSTRACT]
Training-free acceleration has emerged as an advanced research area in video
generation based on diffusion models. The redundancy of latents in diffusion
model inference provides a natural entry point for acceleration. In this paper,
we decompose the inference process into the encoding, denoising, and decoding
stages, and observe that cache-based acceleration methods often lead to
substantial memory surges in the latter two stages. To address this problem, we
analyze the characteristics of inference across different stages and propose
stage-specific strategies for reducing memory consumption: 1) Asynchronous
Cache Swapping. 2) Feature chunk. 3) Slicing latents to decode. At the same
time, we ensure that the time overhead introduced by these three strategies
remains lower than the acceleration gains themselves. Compared with the
baseline, our approach achieves faster inference speed and lower memory usage,
while maintaining quality degradation within an acceptable range. The Code is
available at https://github.com/NKUShaw/LightCache .
[LINK]
http://arxiv.org/abs/2510.05367v1
[DATE]
2025-10-07 04:54:44+08:00
[CATEGORIES]
cs.LG
MT-DAO: Multi-Timescale Distributed Adaptive Optimizers with Local Updates
[AUTHORS]
Alex Iacob, Andrej Jovanovic, Mher Safaryan, Meghdad Kurmanji, Lorenzo Sani, Samuel Horváth, William F. Shen, Xinchi Qiu, Nicholas D. Lane
[COMMENTS]
Submitted to the ICLR 2026 Conference
[LINK]
http://arxiv.org/abs/2510.05361v1
[DATE]
2025-10-07 04:37:57+08:00
[CATEGORIES]
cs.LG
Neural-Rendezvous: Provably Robust Guidance and Control to Encounter Interstellar Objects
[AUTHORS]
Hiroyasu Tsukamoto, Soon-Jo Chung, Yashwanth Kumar Nakka, Benjamin Donitz, Declan Mages, Michel Ingham
[ABSTRACT]
Interstellar objects (ISOs) are likely representatives of primitive materials
invaluable in understanding exoplanetary star systems. Due to their poorly
constrained orbits with generally high inclinations and relative velocities,
however, exploring ISOs with conventional human-in-the-loop approaches is
significantly challenging. This paper presents Neural-Rendezvous – a deep
learning-based guidance and control framework for encountering fast-moving
objects, including ISOs, robustly, accurately, and autonomously in real time.
It uses pointwise minimum norm tracking control on top of a guidance policy
modeled by a spectrally-normalized deep neural network, where its
hyperparameters are tuned with a loss function directly penalizing the MPC
state trajectory tracking error. We show that Neural-Rendezvous provides a high
probability exponential bound on the expected spacecraft delivery error, the
proof of which leverages stochastic incremental stability analysis. In
particular, it is used to construct a non-negative function with a
supermartingale property, explicitly accounting for the ISO state uncertainty
and the local nature of nonlinear state estimation guarantees. In numerical
simulations, Neural-Rendezvous is demonstrated to satisfy the expected error
bound for 100 ISO candidates. This performance is also empirically validated
using our spacecraft simulator and in high-conflict and distributed UAV swarm
reconfiguration with up to 20 UAVs.
[COMMENTS]
Preprint Version, Accepted: October, 2024 (One-minute YouTube
summary: https://youtu.be/q3e0LYS2IYQ, DOI:
https://doi.org/10.2514/1.G007671)
[LINK]
http://arxiv.org/abs/2208.04883v8
[DATE]
2025-10-07 04:37:14+08:00
[CATEGORIES]
cs.LG
Mitigating Diffusion Model Hallucinations with Dynamic Guidance
[AUTHORS]
Kostas Triaridis, Alexandros Graikos, Aggelina Chatziagapi, Grigorios G. Chrysos, Dimitris Samaras
[ABSTRACT]
Diffusion models, despite their impressive demos, often produce hallucinatory
samples with structural inconsistencies that lie outside of the support of the
true data distribution. Such hallucinations can be attributed to excessive
smoothing between modes of the data distribution. However, semantic
interpolations are often desirable and can lead to generation diversity, thus
we believe a more nuanced solution is required. In this work, we introduce
Dynamic Guidance, which tackles this issue. Dynamic Guidance mitigates
hallucinations by selectively sharpening the score function only along the
pre-determined directions known to cause artifacts, while preserving valid
semantic variations. To our knowledge, this is the first approach that
addresses hallucinations at generation time rather than through post-hoc
filtering. Dynamic Guidance substantially reduces hallucinations on both
controlled and natural image datasets, significantly outperforming baselines.
[LINK]
http://arxiv.org/abs/2510.05356v1
[DATE]
2025-10-07 04:31:13+08:00
[CATEGORIES]
cs.LG
Anchors Aweigh! Sail for Optimal Unified Multi-Modal Representations
[AUTHORS]
Minoh Jeong, Zae Myung Kim, Min Namgung, Dongyeop Kang, Yao-Yi Chiang, Alfred Hero
[LINK]
http://arxiv.org/abs/2410.02086v3
[DATE]
2025-10-07 04:29:29+08:00
[CATEGORIES]
cs.LG
Contraction Theory for Nonlinear Stability Analysis and Learning-based Control: A Tutorial Overview
[AUTHORS]
Hiroyasu Tsukamoto, Soon-Jo Chung, Jean-Jacques E. Slotine
[ABSTRACT]
Contraction theory is an analytical tool to study differential dynamics of a
non-autonomous (i.e., time-varying) nonlinear system under a contraction metric
defined with a uniformly positive definite matrix, the existence of which
results in a necessary and sufficient characterization of incremental
exponential stability of multiple solution trajectories with respect to each
other. By using a squared differential length as a Lyapunov-like function, its
nonlinear stability analysis boils down to finding a suitable contraction
metric that satisfies a stability condition expressed as a linear matrix
inequality, indicating that many parallels can be drawn between well-known
linear systems theory and contraction theory for nonlinear systems.
Furthermore, contraction theory takes advantage of a superior robustness
property of exponential stability used in conjunction with the comparison
lemma. This yields much-needed safety and stability guarantees for neural
network-based control and estimation schemes, without resorting to a more
involved method of using uniform asymptotic stability for input-to-state
stability. Such distinctive features permit the systematic construction of a
contraction metric via convex optimization, thereby obtaining an explicit
exponential bound on the distance between a time-varying target trajectory and
solution trajectories perturbed externally due to disturbances and learning
errors. The objective of this paper is, therefore, to present a tutorial
overview of contraction theory and its advantages in nonlinear stability
analysis of deterministic and stochastic systems, with an emphasis on deriving
formal robustness and stability guarantees for various learning-based and
data-driven automatic control methods. In particular, we provide a detailed
review of techniques for finding contraction metrics and associated control and
estimation laws using deep neural networks.
[COMMENTS]
Annual Reviews in Control, Preprint Version, Accepted, Oct. 1st
[LINK]
http://arxiv.org/abs/2110.00675v7
[DATE]
2025-10-07 04:27:34+08:00
[CATEGORIES]
cs.LG
SoftAdaClip: A Smooth Clipping Strategy for Fair and Private Model Training
[AUTHORS]
Dorsa Soleymani, Ali Dadsetan, Frank Rudzicz
[ABSTRACT]
Differential privacy (DP) provides strong protection for sensitive data, but
often reduces model performance and fairness, especially for underrepresented
groups. One major reason is gradient clipping in DP-SGD, which can
disproportionately suppress learning signals for minority subpopulations.
Although adaptive clipping can enhance utility, it still relies on uniform hard
clipping, which may restrict fairness. To address this, we introduce
SoftAdaClip, a differentially private training method that replaces hard
clipping with a smooth, tanh-based transformation to preserve relative gradient
magnitudes while bounding sensitivity. We evaluate SoftAdaClip on various
datasets, including MIMIC-III (clinical text), GOSSIS-eICU (structured
healthcare), and Adult Income (tabular data). Our results show that SoftAdaClip
reduces subgroup disparities by up to 87% compared to DP-SGD and up to 48%
compared to Adaptive-DPSGD, and these reductions in subgroup disparities are
statistically significant. These findings underscore the importance of
integrating smooth transformations with adaptive mechanisms to achieve fair and
private model training.
[LINK]
http://arxiv.org/abs/2510.01447v2
[DATE]
2025-10-07 04:27:34+08:00
[CATEGORIES]
cs.LG
Physics-informed Attention-enhanced Fourier Neural Operator for Solar Magnetic Field Extrapolations
[AUTHORS]
Jinghao Cao, Qin Li, Mengnan Du, Haimin Wang, Bo Shen
[ABSTRACT]
We propose Physics-informed Attention-enhanced Fourier Neural Operator
(PIANO) to solve the Nonlinear Force-Free Field (NLFFF) problem in solar
physics. Unlike conventional approaches that rely on iterative numerical
methods, our proposed PIANO directly learns the 3D magnetic field structure
from 2D boundary conditions. Specifically, PIANO integrates Efficient Channel
Attention (ECA) mechanisms with Dilated Convolutions (DC), which enhances the
model’s ability to capture multimodal input by prioritizing critical channels
relevant to the magnetic field’s variations. Furthermore, we apply
physics-informed loss by enforcing the force-free and divergence-free
conditions in the training process so that our prediction is consistent with
underlying physics with high accuracy. Experimental results on the ISEE NLFFF
dataset show that our PIANO not only outperforms state-of-the-art neural
operators in terms of accuracy but also shows strong consistency with the
physical characteristics of NLFFF data across magnetic fields reconstructed
from various solar active regions. The GitHub of this project is available
https://github.com/Autumnstar-cjh/PIANO
[COMMENTS]
10 pages; accepted as workshop paper in ICDM 2025;
https://github.com/Autumnstar-cjh/PIANO
[LINK]
http://arxiv.org/abs/2510.05351v1
[DATE]
2025-10-07 04:24:22+08:00
[CATEGORIES]
cs.LG
Margin Adaptive DPO: Leveraging Reward Model for Granular Control in Preference Optimization
[AUTHORS]
Hyung Gyu Rho
[ABSTRACT]
Direct Preference Optimization (DPO) has emerged as a simple and effective
method for aligning large language models. However, its reliance on a fixed
temperature parameter leads to suboptimal training on diverse preference data,
causing overfitting on easy examples and under-learning from informative ones.
Recent methods have emerged to counter this. While IPO addresses general
overfitting, its uniform regularization can be overly conservative. The more
targeted approach of $\beta$-DPO suffers from its own limitations: its
batch-level adaptation applies a single, compromised temperature to
mixed-margin pairs, its linear update rule can produce unstable negative
$\beta$ values, and its filtering mechanism discards potentially useful
training signals. In this work, we introduce Margin-Adaptive Direct Preference
Optimization (MADPO), a method that provides a stable, data-preserving, and
instance-level solution. MADPO employs a practical two-step approach: it first
trains a reward model to estimate preference margins and then uses these
margins to apply a continuous, adaptive weight to the DPO loss for each
individual training sample. This re-weighting scheme creates an effective
target margin that is amplified for hard pairs and dampened for easy pairs,
allowing for granular control over the learning signal. We provide a
comprehensive theoretical analysis, proving that MADPO has a well-behaved
optimization landscape and is robust to reward model estimation errors. We
validate our theory with experiments on a sentiment generation task, where
MADPO consistently and significantly outperforms strong baselines across
datasets of varying quality. It achieves performance gains of up to +33.3\% on
High Quality data and +10.5\% on Low Quality data over the next-best method.
Our results establish MADPO as a more robust and principled approach to
preference alignment.
[LINK]
http://arxiv.org/abs/2510.05342v1
[DATE]
2025-10-07 04:09:37+08:00
[CATEGORIES]
cs.LG
MatLLMSearch: Crystal Structure Discovery with Evolution-Guided Large Language Models
[AUTHORS]
Jingru Gan, Peichen Zhong, Yuanqi Du, Yanqiao Zhu, Chenru Duan, Haorui Wang, Daniel Schwalbe-Koda, Carla P. Gomes, Kristin A. Persson, Wei Wang
[ABSTRACT]
Crystal structure generation is fundamental to materials science, enabling
the discovery of novel materials with desired properties. While existing
approaches leverage Large Language Models (LLMs) through extensive fine-tuning
on materials databases, we show that pre-trained LLMs can inherently generate
novel and stable crystal structures without additional fine-tuning. Our
framework employs LLMs as intelligent proposal agents within an evolutionary
pipeline that guides them to perform implicit crossover and mutation operations
while maintaining chemical validity. We demonstrate that MatLLMSearch achieves
a 78.38% metastable rate validated by machine learning interatomic potentials
and 31.7% DFT-verified stability, outperforming specialized models such as
CrystalTextLLM. Beyond crystal structure generation, we further demonstrate
that our framework adapts to diverse materials design tasks, including crystal
structure prediction and multi-objective optimization of properties such as
deformation energy and bulk modulus, all without fine-tuning. These results
establish our framework as a versatile and effective framework for consistent
high-quality materials discovery, offering training-free generation of novel
stable structures with reduced overhead and broader accessibility.
[COMMENTS]
Preprint, 25 pages
[LINK]
http://arxiv.org/abs/2502.20933v2
[DATE]
2025-10-07 03:52:50+08:00
[CATEGORIES]
cs.LG
Tensor-on-tensor Regression Neural Networks for Process Modeling with High-dimensional Data
[AUTHORS]
Qian Wang, Mohammad N. Bisheh, Kamran Paynabar
[ABSTRACT]
Modern sensing and metrology systems now stream terabytes of heterogeneous,
high-dimensional (HD) data profiles, images, and dense point clouds, whose
natural representation is multi-way tensors. Understanding such data requires
regression models that preserve tensor geometry, yet remain expressive enough
to capture the pronounced nonlinear interactions that dominate many industrial
and mechanical processes. Existing tensor-based regressors meet the first
requirement but remain essentially linear. Conversely, conventional neural
networks offer nonlinearity only after flattening, thereby discarding spatial
structure and incurring prohibitive parameter counts. This paper introduces a
Tensor-on-Tensor Regression Neural Network (TRNN) that unifies these two
paradigms.
[LINK]
http://arxiv.org/abs/2510.05329v1
[DATE]
2025-10-07 03:49:03+08:00
[CATEGORIES]
cs.LG
Gradient Methods with Online Scaling Part II. Practical Aspects
[AUTHORS]
Ya-Chi Chu, Wenzhi Gao, Yinyu Ye, Madeleine Udell
[LINK]
http://arxiv.org/abs/2509.11007v2
[DATE]
2025-10-07 03:48:56+08:00
[CATEGORIES]
cs.LG
Extracting PAC Decision Trees from Black Box Binary Classifiers: The Gender Bias Case Study on BERT-based Language Models
[AUTHORS]
Ana Ozaki, Roberto Confalonieri, Ricardo Guimarães, Anders Imenes
[ABSTRACT]
Decision trees are a popular machine learning method, known for their
inherent explainability. In Explainable AI, decision trees can be used as
surrogate models for complex black box AI models or as approximations of parts
of such models. A key challenge of this approach is determining how accurately
the extracted decision tree represents the original model and to what extent it
can be trusted as an approximation of their behavior. In this work, we
investigate the use of the Probably Approximately Correct (PAC) framework to
provide a theoretical guarantee of fidelity for decision trees extracted from
AI models. Based on theoretical results from the PAC framework, we adapt a
decision tree algorithm to ensure a PAC guarantee under certain conditions. We
focus on binary classification and conduct experiments where we extract
decision trees from BERT-based language models with PAC guarantees. Our results
indicate occupational gender bias in these models.
[COMMENTS]
This is a revision of the version published at AAAI 2025. We fixed an
issue in Theorem 8 and run again all the experiments. We also fixed small
grammar mistakes found while producing this revised version
[LINK]
http://arxiv.org/abs/2412.10513v2
[DATE]
2025-10-07 03:41:06+08:00
[CATEGORIES]
cs.LG
Generalizing Supervised Contrastive learning: A Projection Perspective
[AUTHORS]
Minoh Jeong, Alfred Hero
[ABSTRACT]
Self-supervised contrastive learning (SSCL) has emerged as a powerful
paradigm for representation learning and has been studied from multiple
perspectives, including mutual information and geometric viewpoints. However,
supervised contrastive (SupCon) approaches have received comparatively little
attention in this context: for instance, while InfoNCE used in SSCL is known to
form a lower bound on mutual information (MI), the relationship between SupCon
and MI remains unexplored. To address this gap, we introduce ProjNCE, a
generalization of the InfoNCE loss that unifies supervised and self-supervised
contrastive objectives by incorporating projection functions and an adjustment
term for negative pairs. We prove that ProjNCE constitutes a valid MI bound and
affords greater flexibility in selecting projection strategies for class
embeddings. Building on this flexibility, we further explore the centroid-based
class embeddings in SupCon by exploring a variety of projection methods.
Extensive experiments on image and audio datasets demonstrate that ProjNCE
consistently outperforms both SupCon and standard cross-entropy training. Our
work thus refines SupCon along two complementary
perspectives–information-theoretic and projection viewpoints–and offers
broadly applicable improvements whenever SupCon serves as the foundational
contrastive objective.
[LINK]
http://arxiv.org/abs/2506.09810v2
[DATE]
2025-10-07 03:38:03+08:00
[CATEGORIES]
cs.LG
RegMix: Adversarial Mutual and Generalization Regularization for Enhancing DNN Robustness
[AUTHORS]
Zhenyu Liu, Varun Ojha
[ABSTRACT]
Adversarial training is the most effective defense against adversarial
attacks. The effectiveness of the adversarial attacks has been on the design of
its loss function and regularization term. The most widely used loss function
in adversarial training is cross-entropy and mean squared error (MSE) as its
regularization objective. However, MSE enforces overly uniform optimization
between two output distributions during training, which limits its robustness
in adversarial training scenarios. To address this issue, we revisit the idea
of mutual learning (originally designed for knowledge distillation) and propose
two novel regularization strategies tailored for adversarial training: (i)
weighted adversarial mutual regularization and (ii) adversarial generalization
regularization. In the former, we formulate a decomposed adversarial mutual
Kullback-Leibler divergence (KL-divergence) loss, which allows flexible control
over the optimization process by assigning unequal weights to the main and
auxiliary objectives. In the latter, we introduce an additional clean target
distribution into the adversarial training objective, improving generalization
and enhancing model robustness. Extensive experiments demonstrate that our
proposed methods significantly improve adversarial robustness compared to
existing regularization-based approaches.
[LINK]
http://arxiv.org/abs/2510.05317v1
[DATE]
2025-10-07 03:30:08+08:00
[CATEGORIES]
cs.LG
Probabilistic Variational Contrastive Learning
[AUTHORS]
Minoh Jeong, Seonho Kim, Alfred Hero
[ABSTRACT]
Deterministic embeddings learned by contrastive learning (CL) methods such as
SimCLR and SupCon achieve state-of-the-art performance but lack a principled
mechanism for uncertainty quantification. We propose Variational Contrastive
Learning (VCL), a decoder-free framework that maximizes the evidence lower
bound (ELBO) by interpreting the InfoNCE loss as a surrogate reconstruction
term and adding a KL divergence regularizer to a uniform prior on the unit
hypersphere. We model the approximate posterior $q_\theta(z|x)$ as a projected
normal distribution, enabling the sampling of probabilistic embeddings. Our two
instantiation–VSimCLR and VSupCon–replace deterministic embeddings with
samples from $q_\theta(z|x)$ and incorporate a normalized KL term into the
loss. Experiments on multiple benchmarks demonstrate that VCL mitigates
dimensional collapse, enhances mutual information with class labels, and
matches or outperforms deterministic baselines in classification accuracy, all
the while providing meaningful uncertainty estimates through the posterior
model. VCL thus equips contrastive learning with a probabilistic foundation,
serving as a new basis for contrastive approaches.
[LINK]
http://arxiv.org/abs/2506.10159v2
[DATE]
2025-10-07 03:26:48+08:00
[CATEGORIES]
cs.LG
Gamma Mixture Modeling for Cosine Similarity in Small Language Models
[AUTHORS]
Kevin Player
[ABSTRACT]
We study the cosine similarity of sentence transformer embeddings and observe
that they are well modeled by gamma mixtures. From a fixed corpus, we measure
similarities between all document embeddings and a reference query embedding.
Empirically we find that these distributions are often well captured by a gamma
distribution shifted and truncated to [-1,1], and in many cases, by a gamma
mixture. We propose a heuristic model in which a hierarchical clustering of
topics naturally leads to a gamma-mixture structure in the similarity scores.
Finally, we outline an expectation-maximization algorithm for fitting shifted
gamma mixtures, which provides a practical tool for modeling similarity
distributions.
[COMMENTS]
16 pages, 8 figures
[LINK]
http://arxiv.org/abs/2510.05309v1
[DATE]
2025-10-07 03:20:28+08:00
[CATEGORIES]
cs.LG
IMPACT: Intelligent Motion Planning with Acceptable Contact Trajectories via Vision-Language Models
[AUTHORS]
Yiyang Ling, Karan Owalekar, Oluwatobiloba Adesanya, Erdem Bıyık, Daniel Seita
[ABSTRACT]
Motion planning involves determining a sequence of robot configurations to
reach a desired pose, subject to movement and safety constraints. Traditional
motion planning finds collision-free paths, but this is overly restrictive in
clutter, where it may not be possible for a robot to accomplish a task without
contact. In addition, contacts range from relatively benign (e.g. brushing a
soft pillow) to more dangerous (e.g. toppling a glass vase), making it
difficult to characterize which may be acceptable. In this paper, we propose
IMPACT, a novel motion planning framework that uses Vision-Language Models
(VLMs) to infer environment semantics, identifying which parts of the
environment can best tolerate contact based on object properties and locations.
Our approach generates an anisotropic cost map that encodes directional push
safety. We pair this map with a contact-aware A* planner to find stable
contact-rich paths. We perform experiments using 20 simulation and 10
real-world scenes and assess using task success rate, object displacements, and
feedback from human evaluators. Our results over 3200 simulation and 200
real-world trials suggest that IMPACT enables efficient contact-rich motion
planning in cluttered settings while outperforming alternative methods and
ablations. Our project website is available at
https://impact-planning.github.io/.
[LINK]
http://arxiv.org/abs/2503.10110v2
[DATE]
2025-10-07 03:17:38+08:00
[CATEGORIES]
cs.LG
TimeSeriesScientist: A General-Purpose AI Agent for Time Series Analysis
[AUTHORS]
Haokun Zhao, Xiang Zhang, Jiaqi Wei, Yiwei Xu, Yuting He, Siqi Sun, Chenyu You
[ABSTRACT]
Time series forecasting is central to decision-making in domains as diverse
as energy, finance, climate, and public health. In practice, forecasters face
thousands of short, noisy series that vary in frequency, quality, and horizon,
where the dominant cost lies not in model fitting, but in the labor-intensive
preprocessing, validation, and ensembling required to obtain reliable
predictions. Prevailing statistical and deep learning models are tailored to
specific datasets or domains and generalize poorly. A general, domain-agnostic
framework that minimizes human intervention is urgently in demand. In this
paper, we introduce TimeSeriesScientist (TSci), the first LLM-driven agentic
framework for general time series forecasting. The framework comprises four
specialized agents: Curator performs LLM-guided diagnostics augmented by
external tools that reason over data statistics to choose targeted
preprocessing; Planner narrows the hypothesis space of model choice by
leveraging multi-modal diagnostics and self-planning over the input; Forecaster
performs model fitting and validation and, based on the results, adaptively
selects the best model configuration as well as ensemble strategy to make final
predictions; and Reporter synthesizes the whole process into a comprehensive,
transparent report. With transparent natural-language rationales and
comprehensive reports, TSci transforms the forecasting workflow into a
white-box system that is both interpretable and extensible across tasks.
Empirical results on eight established benchmarks demonstrate that TSci
consistently outperforms both statistical and LLM-based baselines, reducing
forecast error by an average of 10.4% and 38.2%, respectively. Moreover, TSci
produces a clear and rigorous report that makes the forecasting workflow more
transparent and interpretable.
[LINK]
http://arxiv.org/abs/2510.01538v2
[DATE]
2025-10-07 03:00:27+08:00
[CATEGORIES]
cs.LG
Computing frustration and near-monotonicity in deep neural networks
[AUTHORS]
Joel Wendin, Erik G. Larsson, Claudio Altafini
[ABSTRACT]
For the signed graph associated to a deep neural network, one can compute the
frustration level, i.e., test how close or distant the graph is to structural
balance. For all the pretrained deep convolutional neural networks we consider,
we find that the frustration is always less than expected from null models.
From a statistical physics point of view, and in particular in reference to an
Ising spin glass model, the reduced frustration indicates that the amount of
disorder encoded in the network is less than in the null models. From a
functional point of view, low frustration (i.e., proximity to structural
balance) means that the function representing the network behaves
near-monotonically, i.e., more similarly to a monotone function than in the
null models. Evidence of near-monotonic behavior along the partial order
determined by frustration is observed for all networks we consider. This
confirms that the class of deep convolutional neural networks tends to have a
more ordered behavior than expected from null models, and suggests a novel form
of implicit regularization.
[LINK]
http://arxiv.org/abs/2510.05286v1
[DATE]
2025-10-07 02:54:54+08:00
[CATEGORIES]
cs.LG
Adjusting the Output of Decision Transformer with Action Gradient
[AUTHORS]
Rui Lin, Yiwen Zhang, Zhicheng Peng, Minghao Lyu
[ABSTRACT]
Decision Transformer (DT), which integrates reinforcement learning (RL) with
the transformer model, introduces a novel approach to offline RL. Unlike
classical algorithms that take maximizing cumulative discounted rewards as
objective, DT instead maximizes the likelihood of actions. This paradigm shift,
however, presents two key challenges: stitching trajectories and extrapolation
of action. Existing methods, such as substituting specific tokens with
predictive values and integrating the Policy Gradient (PG) method, address
these challenges individually but fail to improve performance stably when
combined due to inherent instability. To address this, we propose Action
Gradient (AG), an innovative methodology that directly adjusts actions to
fulfill a function analogous to that of PG, while also facilitating efficient
integration with token prediction techniques. AG utilizes the gradient of the
Q-value with respect to the action to optimize the action. The empirical
results demonstrate that our method can significantly enhance the performance
of DT-based algorithms, with some results achieving state-of-the-art levels.
[LINK]
http://arxiv.org/abs/2510.05285v1
[DATE]
2025-10-07 02:54:42+08:00
[CATEGORIES]
cs.LG
Learning The Minimum Action Distance
[AUTHORS]
Lorenzo Steccanella, Joshua B. Evans, Özgür Şimşek, Anders Jonsson
[ABSTRACT]
This paper presents a state representation framework for Markov decision
processes (MDPs) that can be learned solely from state trajectories, requiring
neither reward signals nor the actions executed by the agent. We propose
learning the minimum action distance (MAD), defined as the minimum number of
actions required to transition between states, as a fundamental metric that
captures the underlying structure of an environment. MAD naturally enables
critical downstream tasks such as goal-conditioned reinforcement learning and
reward shaping by providing a dense, geometrically meaningful measure of
progress. Our self-supervised learning approach constructs an embedding space
where the distances between embedded state pairs correspond to their MAD,
accommodating both symmetric and asymmetric approximations. We evaluate the
framework on a comprehensive suite of environments with known MAD values,
encompassing both deterministic and stochastic dynamics, as well as discrete
and continuous state spaces, and environments with noisy observations.
Empirical results demonstrate that the proposed approach not only efficiently
learns accurate MAD representations across these diverse settings but also
significantly outperforms existing state representation methods in terms of
representation quality.
[LINK]
http://arxiv.org/abs/2506.09276v2
[DATE]
2025-10-07 02:53:43+08:00
[CATEGORIES]
cs.LG
TreeIRL: Safe Urban Driving with Tree Search and Inverse Reinforcement Learning
[AUTHORS]
Momchil S. Tomov, Sang Uk Lee, Hansford Hendrago, Jinwook Huh, Teawon Han, Forbes Howington, Rafael da Silva, Gianmarco Bernasconi, Marc Heim, Samuel Findler, Xiaonan Ji, Alexander Boule, Michael Napoli, Kuo Chen, Jesse Miller, Boaz Floor, Yunqing Hu
[ABSTRACT]
We present TreeIRL, a novel planner for autonomous driving that combines
Monte Carlo tree search (MCTS) and inverse reinforcement learning (IRL) to
achieve state-of-the-art performance in simulation and in real-world driving.
The core idea is to use MCTS to find a promising set of safe candidate
trajectories and a deep IRL scoring function to select the most human-like
among them. We evaluate TreeIRL against both classical and state-of-the-art
planners in large-scale simulations and on 500+ miles of real-world autonomous
driving in the Las Vegas metropolitan area. Test scenarios include dense urban
traffic, adaptive cruise control, cut-ins, and traffic lights. TreeIRL achieves
the best overall performance, striking a balance between safety, progress,
comfort, and human-likeness. To our knowledge, our work is the first
demonstration of MCTS-based planning on public roads and underscores the
importance of evaluating planners across a diverse set of metrics and in
real-world environments. TreeIRL is highly extensible and could be further
improved with reinforcement learning and imitation learning, providing a
framework for exploring different combinations of classical and learning-based
approaches to solve the planning bottleneck in autonomous driving.
[LINK]
http://arxiv.org/abs/2509.13579v3
[DATE]
2025-10-07 02:30:30+08:00
[CATEGORIES]
cs.LG
ECLipsE-Gen-Local: Efficient Compositional Local Lipschitz Estimates for Deep Neural Networks
[AUTHORS]
Yuezhu Xu, S. Sivaranjani
[ABSTRACT]
The Lipschitz constant is a key measure for certifying the robustness of
neural networks to input perturbations. However, computing the exact constant
is NP-hard, and standard approaches to estimate the Lipschitz constant involve
solving a large matrix semidefinite program (SDP) that scales poorly with
network size. Further, there is a potential to efficiently leverage local
information on the input region to provide tighter Lipschitz estimates. We
address this problem here by proposing a compositional framework that yields
tight yet scalable Lipschitz estimates for deep feedforward neural networks.
Specifically, we begin by developing a generalized SDP framework that is highly
flexible, accommodating heterogeneous activation function slope, and allowing
Lipschitz estimates with respect to arbitrary input-output pairs and arbitrary
choices of sub-networks of consecutive layers. We then decompose this
generalized SDP into a sequence of small sub-problems, with computational
complexity that scales linearly with respect to the network depth. We also
develop a variant that achieves near-instantaneous computation through
closed-form solutions to each sub-problem. All our algorithms are accompanied
by theoretical guarantees on feasibility and validity. Next, we develop a
series of algorithms, termed as ECLipsE-Gen-Local, that effectively incorporate
local information on the input. Our experiments demonstrate that our algorithms
achieve substantial speedups over a multitude of benchmarks while producing
significantly tighter Lipschitz bounds than global approaches. Moreover, we
show that our algorithms provide strict upper bounds for the Lipschitz constant
with values approaching the exact Jacobian from autodiff when the input region
is small enough. Finally, we demonstrate the practical utility of our approach
by showing that our Lipschitz estimates closely align with network robustness.
[LINK]
http://arxiv.org/abs/2510.05261v1
[DATE]
2025-10-07 02:26:46+08:00
[CATEGORIES]
cs.LG
Strong bounds for large-scale Minimum Sum-of-Squares Clustering
[AUTHORS]
Anna Livia Croella, Veronica Piccialli, Antonio M. Sudoso
[ABSTRACT]
Clustering is a fundamental technique in data analysis and machine learning,
used to group similar data points together. Among various clustering methods,
the Minimum Sum-of-Squares Clustering (MSSC) is one of the most widely used.
MSSC aims to minimize the total squared Euclidean distance between data points
and their corresponding cluster centroids. Due to the unsupervised nature of
clustering, achieving global optimality is crucial, yet computationally
challenging. The complexity of finding the global solution increases
exponentially with the number of data points, making exact methods impractical
for large-scale datasets. Even obtaining strong lower bounds on the optimal
MSSC objective value is computationally prohibitive, making it difficult to
assess the quality of heuristic solutions. We address this challenge by
introducing a novel method to validate heuristic MSSC solutions through
optimality gaps. Our approach employs a divide-and-conquer strategy,
decomposing the problem into smaller instances that can be handled by an exact
solver. The decomposition is guided by an auxiliary optimization problem, the
“anticlustering problem”, for which we design an efficient heuristic.
Computational experiments demonstrate the effectiveness of the method for
large-scale instances, achieving optimality gaps below 3% in most cases while
maintaining reasonable computational times. These results highlight the
practicality of our approach in assessing feasible clustering solutions for
large datasets, bridging a critical gap in MSSC evaluation.
[LINK]
http://arxiv.org/abs/2502.08397v2
[DATE]
2025-10-07 02:24:40+08:00
[CATEGORIES]
cs.LG
Stratum: System-Hardware Co-Design with Tiered Monolithic 3D-Stackable DRAM for Efficient MoE Serving
[AUTHORS]
Yue Pan, Zihan Xia, Po-Kai Hsu, Lanxiang Hu, Hyungyo Kim, Janak Sharda, Minxuan Zhou, Nam Sung Kim, Shimeng Yu, Tajana Rosing, Mingu Kang
[ABSTRACT]
As Large Language Models (LLMs) continue to evolve, Mixture of Experts (MoE)
architecture has emerged as a prevailing design for achieving state-of-the-art
performance across a wide range of tasks. MoE models use sparse gating to
activate only a handful of expert sub-networks per input, achieving
billion-parameter capacity with inference costs akin to much smaller models.
However, such models often pose challenges for hardware deployment due to the
massive data volume introduced by the MoE layers. To address the challenges of
serving MoE models, we propose Stratum, a system-hardware co-design approach
that combines the novel memory technology Monolithic 3D-Stackable DRAM (Mono3D
DRAM), near-memory processing (NMP), and GPU acceleration. The logic and Mono3D
DRAM dies are connected through hybrid bonding, whereas the Mono3D DRAM stack
and GPU are interconnected via silicon interposer. Mono3D DRAM offers higher
internal bandwidth than HBM thanks to the dense vertical interconnect pitch
enabled by its monolithic structure, which supports implementations of
higher-performance near-memory processing. Furthermore, we tackle the latency
differences introduced by aggressive vertical scaling of Mono3D DRAM along the
z-dimension by constructing internal memory tiers and assigning data across
layers based on access likelihood, guided by topic-based expert usage
prediction to boost NMP throughput. The Stratum system achieves up to 8.29x
improvement in decoding throughput and 7.66x better energy efficiency across
various benchmarks compared to GPU baselines.
[LINK]
http://arxiv.org/abs/2510.05245v1
[DATE]
2025-10-07 02:09:47+08:00
[CATEGORIES]
cs.LG
Simultaneous Learning and Optimization via Misspecified Saddle Point Problems
[AUTHORS]
Mohammad Mahdi Ahmadi, Erfan Yazdandoost Hamedani
[ABSTRACT]
We study a class of misspecified saddle point (SP) problems, where the
optimization objective depends on an unknown parameter that must be learned
concurrently from data. Unlike existing studies that assume parameters are
fully known or pre-estimated, our framework integrates optimization and
learning into a unified formulation, enabling a more flexible problem class. To
address this setting, we propose two algorithms based on the accelerated
primal-dual (APD) by Hamedani & Aybat 2021. In particular, we first analyze the
naive extension of the APD method by directly substituting the evolving
parameter estimates into the primal-dual updates; then, we design a new
learning-aware variant of the APD method that explicitly accounts for parameter
dynamics by adjusting the momentum updates. Both methods achieve a provable
convergence rate of $\mathcal{O}(\log K / K)$, while the learning-aware
approach attains a tighter $\mathcal{O}(1)$ constant and further benefits from
an adaptive step-size selection enabled by a backtracking strategy.
Furthermore, we extend the framework to problems where the learning problem
admits multiple optimal solutions, showing that our modified algorithm for a
structured setting achieves an $\mathcal{O}(1/\sqrt{K})$ rate. To demonstrate
practical impact, we evaluate our methods on a misspecified portfolio
optimization problem and show superior empirical performance compared to
state-of-the-art algorithms.
[LINK]
http://arxiv.org/abs/2510.05241v1
[DATE]
2025-10-07 02:07:02+08:00
[CATEGORIES]
cs.LG
Attribute Fusion-based Classifier on Framework of Belief Structure
[AUTHORS]
Qiying Hu, Yingying Liang, Qianli Zhou, Witold Pedrycz
[ABSTRACT]
Dempster-Shafer Theory (DST) provides a powerful framework for modeling
uncertainty and has been widely applied to multi-attribute classification
tasks. However, traditional DST-based attribute fusion-based classifiers suffer
from oversimplified membership function modeling and limited exploitation of
the belief structure brought by basic probability assignment (BPA), reducing
their effectiveness in complex real-world scenarios. This paper presents an
enhanced attribute fusion-based classifier that addresses these limitations
through two key innovations. First, we adopt a selective modeling strategy that
utilizes both single Gaussian and Gaussian Mixture Models (GMMs) for membership
function construction, with model selection guided by cross-validation and a
tailored evaluation metric. Second, we introduce a novel method to transform
the possibility distribution into a BPA by combining simple BPAs derived from
normalized possibility distributions, enabling a much richer and more flexible
representation of uncertain information. Furthermore, we apply the belief
structure-based BPA generation method to the evidential K-Nearest Neighbors
(EKNN) classifier, enhancing its ability to incorporate uncertainty information
into decision-making. Comprehensive experiments on benchmark datasets are
conducted to evaluate the performance of the proposed attribute fusion-based
classifier and the enhanced evidential K-Nearest Neighbors classifier in
comparison with both evidential classifiers and conventional machine learning
classifiers. The results demonstrate that the proposed classifier outperforms
the best existing evidential classifier, achieving an average accuracy
improvement of 4.86%, while maintaining low variance, thus confirming its
superior effectiveness and robustness.
[LINK]
http://arxiv.org/abs/2509.00754v2
[DATE]
2025-10-07 02:02:55+08:00
[CATEGORIES]
cs.LG
CMT-Benchmark: A Benchmark for Condensed Matter Theory Built by Expert Researchers
[AUTHORS]
Haining Pan, James V. Roggeveen, Erez Berg, Juan Carrasquilla, Debanjan Chowdhury, Surya Ganguli, Federico Ghimenti, Juraj Hasik, Henry Hunt, Hong-Chen Jiang, Mason Kamb, Ying-Jer Kao, Ehsan Khatami, Michael J. Lawler, Di Luo, Titus Neupert, Xiaoliang Qi, Michael P. Brenner, Eun-Ah Kim
[ABSTRACT]
Large language models (LLMs) have shown remarkable progress in coding and
math problem-solving, but evaluation on advanced research-level problems in
hard sciences remains scarce. To fill this gap, we present CMT-Benchmark, a
dataset of 50 problems covering condensed matter theory (CMT) at the level of
an expert researcher. Topics span analytical and computational approaches in
quantum many-body, and classical statistical mechanics. The dataset was
designed and verified by a panel of expert researchers from around the world.
We built the dataset through a collaborative environment that challenges the
panel to write and refine problems they would want a research assistant to
solve, including Hartree-Fock, exact diagonalization, quantum/variational Monte
Carlo, density matrix renormalization group (DMRG), quantum/classical
statistical mechanics, and model building. We evaluate LLMs by programmatically
checking solutions against expert-supplied ground truth. We developed
machine-grading, including symbolic handling of non-commuting operators via
normal ordering. They generalize across tasks too. Our evaluations show that
frontier models struggle with all of the problems in the dataset, highlighting
a gap in the physical reasoning skills of current LLMs. Notably, experts
identified strategies for creating increasingly difficult problems by
interacting with the LLMs and exploiting common failure modes. The best model,
GPT5, solves 30\% of the problems; average across 17 models (GPT, Gemini,
Claude, DeepSeek, Llama) is 11.4$\pm$2.1\%. Moreover, 18 problems are solved by
none of the 17 models, and 26 by at most one. These unsolved problems span
Quantum Monte Carlo, Variational Monte Carlo, and DMRG. Answers sometimes
violate fundamental symmetries or have unphysical scaling dimensions. We
believe this benchmark will guide development toward capable AI research
assistants and tutors.
[COMMENTS]
19 pages, 3 figures
[LINK]
http://arxiv.org/abs/2510.05228v1
[DATE]
2025-10-07 02:00:55+08:00
[CATEGORIES]
cs.LG
Approximate Gaussianity Beyond Initialisation in Neural Networks
[AUTHORS]
Edward Hirst, Sanjaye Ramgoolam
[ABSTRACT]
Ensembles of neural network weight matrices are studied through the training
process for the MNIST classification problem, testing the efficacy of matrix
models for representing their distributions, under assumptions of Gaussianity
and permutation-symmetry. The general 13-parameter permutation invariant
Gaussian matrix models are found to be effective models for the correlated
Gaussianity in the weight matrices, beyond the range of applicability of the
simple Gaussian with independent identically distributed matrix variables, and
notably well beyond the initialisation step. The representation theoretic model
parameters, and the graph-theoretic characterisation of the permutation
invariant matrix observables give an interpretable framework for the best-fit
model and for small departures from Gaussianity. Additionally, the Wasserstein
distance is calculated for this class of models and used to quantify the
movement of the distributions over training. Throughout the work, the effects
of varied initialisation regimes, regularisation, layer depth, and layer width
are tested for this formalism, identifying limits where particular departures
from Gaussianity are enhanced and how more general, yet still
highly-interpretable, models can be developed.
[COMMENTS]
26+34 pages, 15 figures, 12 tables
[LINK]
http://arxiv.org/abs/2510.05218v1
[DATE]
2025-10-07 02:00:46+08:00
[CATEGORIES]
cs.LG
VER: Vision Expert Transformer for Robot Learning via Foundation Distillation and Dynamic Routing
[AUTHORS]
Yixiao Wang, Mingxiao Huo, Zhixuan Liang, Yushi Du, Lingfeng Sun, Haotian Lin, Jinghuan Shang, Chensheng Peng, Mohit Bansal, Mingyu Ding, Masayoshi Tomizuka
[ABSTRACT]
Pretrained vision foundation models (VFMs) advance robotic learning via rich
visual representations, yet individual VFMs typically excel only in specific
domains, limiting generality across tasks. Distilling multiple VFMs into a
unified representation for policy can mitigate this limitation but often yields
inflexible task-specific feature selection and requires costly full re-training
to incorporate robot-domain knowledge. We propose VER, a Vision Expert
transformer for Robot learning. During pretraining, VER distills multiple VFMs
into a vision expert library. It then fine-tunes only a lightweight routing
network (fewer than 0.4% of parameters) to dynamically select task-relevant
experts from the pretrained library for downstream robot tasks. We further
introduce Patchwise Expert Routing with Curriculum Top-K Annealing to improve
both flexibility and precision of dynamic expert selection. Moreover, VER
supports parameter-efficient finetuning for scalable expert utilization and
adaptive robot-domain knowledge integration. Across 17 diverse robotic tasks
and multiple policy heads, VER achieves state-of-the-art performance. We find
that VER reduces large-norm outliers in task-irrelevant regions (e.g.,
background) and concentrates on task-critical regions. Visualizations and codes
can be found in https://yixiaowang7.github.io/ver_page/.
[LINK]
http://arxiv.org/abs/2510.05213v1
[DATE]
2025-10-07 02:00:43+08:00
[CATEGORIES]
cs.LG
A Data-Driven Prism: Multi-View Source Separation with Diffusion Model Priors
[AUTHORS]
Sebastian Wagner-Carena, Aizhan Akhmetzhanova, Sydney Erickson
[ABSTRACT]
A common challenge in the natural sciences is to disentangle distinct,
unknown sources from observations. Examples of this source separation task
include deblending galaxies in a crowded field, distinguishing the activity of
individual neurons from overlapping signals, and separating seismic events from
an ambient background. Traditional analyses often rely on simplified source
models that fail to accurately reproduce the data. Recent advances have shown
that diffusion models can directly learn complex prior distributions from
noisy, incomplete data. In this work, we show that diffusion models can solve
the source separation problem without explicit assumptions about the source.
Our method relies only on multiple views, or the property that different sets
of observations contain different linear transformations of the unknown
sources. We show that our method succeeds even when no source is individually
observed and the observations are noisy, incomplete, and vary in resolution.
The learned diffusion models enable us to sample from the source priors,
evaluate the probability of candidate sources, and draw from the joint
posterior of the source distribution given an observation. We demonstrate the
effectiveness of our method on a range of synthetic problems as well as
real-world galaxy observations.
[COMMENTS]
Accepted to main conference of NeurIPS 2025. Code available at
https://github.com/swagnercarena/ddprism
[LINK]
http://arxiv.org/abs/2510.05205v1
[DATE]
2025-10-07 02:00:05+08:00
[CATEGORIES]
cs.LG
TopInG: Topologically Interpretable Graph Learning via Persistent Rationale Filtration
[AUTHORS]
Cheng Xin, Fan Xu, Xin Ding, Jie Gao, Jiaxin Ding
[ABSTRACT]
Graph Neural Networks (GNNs) have shown remarkable success across various
scientific fields, yet their adoption in critical decision-making is often
hindered by a lack of interpretability. Recently, intrinsically interpretable
GNNs have been studied to provide insights into model predictions by
identifying rationale substructures in graphs. However, existing methods face
challenges when the underlying rationale subgraphs are complex and varied. In
this work, we propose TopInG: Topologically Interpretable Graph Learning, a
novel topological framework that leverages persistent homology to identify
persistent rationale subgraphs. TopInG employs a rationale filtration learning
approach to model an autoregressive generation process of rationale subgraphs,
and introduces a self-adjusted topological constraint, termed topological
discrepancy, to enforce a persistent topological distinction between rationale
subgraphs and irrelevant counterparts. We provide theoretical guarantees that
our loss function is uniquely optimized by the ground truth under specific
conditions. Extensive experiments demonstrate TopInG’s effectiveness in
tackling key challenges, such as handling variform rationale subgraphs,
balancing predictive performance with interpretability, and mitigating spurious
correlations. Results show that our approach improves upon state-of-the-art
methods on both predictive accuracy and interpretation quality.
[COMMENTS]
submitted to ICML 2025
[LINK]
http://arxiv.org/abs/2510.05102v1
[DATE]
2025-10-07 01:59:44+08:00
[CATEGORIES]
cs.LG
MALT: Improving Reasoning with Multi-Agent LLM Training
[AUTHORS]
Sumeet Ramesh Motwani, Chandler Smith, Rocktim Jyoti Das, Rafael Rafailov, Ivan Laptev, Philip H. S. Torr, Fabio Pizzati, Ronald Clark, Christian Schroeder de Witt
[ABSTRACT]
Large Language Models (LLMs) often produce answers with a single
chain-of-thought, which restricts their ability to explore reasoning paths or
self-correct flawed outputs in complex tasks. In this paper, we introduce MALT
(Multi-Agent LLM Training), a novel post-training strategy that divides the
reasoning process into generation, verification, and refinement steps using a
sequential pipeline of heterogeneous agents. During data generation, each agent
is repeatedly sampled to form a multi-agent search tree, where final outputs
are graded against ground-truth data. We then apply value iteration to
propagate reward signals back to each role-conditioned model, automatically
producing multi-agent post-training data without human or teacher-model
supervision. Our off-policy approach allows each agent to specialize by
learning from correct and incorrect trajectories, ultimately improving the
end-to-end reasoning chain. On MATH, GSM8K, and CSQA, MALT surpasses the same
baseline LLM with a relative improvement of 15.66%, 7.42%, and 9.40%
respectively, making it an important advance towards multi-agent cooperative
training.
[COMMENTS]
Published at COLM 2025
[LINK]
http://arxiv.org/abs/2412.01928v3
[DATE]
2025-10-07 01:57:15+08:00
[CATEGORIES]
cs.LG
Learning Penalty for Optimal Partitioning via Automatic Feature Extraction
[AUTHORS]
Tung L Nguyen, Toby Hocking
[ABSTRACT]
Changepoint detection identifies significant shifts in data sequences, making
it important in areas like finance, genetics, and healthcare. The Optimal
Partitioning algorithms efficiently detect these changes, using a penalty
parameter to limit the changepoints count. Determining the optimal value for
this penalty can be challenging. Traditionally, this process involved manually
extracting statistical features, such as sequence length or variance to make
the prediction. This study proposes a novel approach that uses recurrent
networks to learn this penalty directly from raw sequences by automatically
extracting features. Experiments conducted on 20 benchmark genomic datasets
show that this novel method generally outperforms traditional ones in
changepoint detection accuracy.
[COMMENTS]
9 Figures
[LINK]
http://arxiv.org/abs/2505.07413v2
[DATE]
2025-10-07 01:53:44+08:00
[CATEGORIES]
cs.LG
Conformal Prediction for Long-Tailed Classification
[AUTHORS]
Tiffany Ding, Jean-Baptiste Fermanian, Joseph Salmon
[ABSTRACT]
Many real-world classification problems, such as plant identification, have
extremely long-tailed class distributions. In order for prediction sets to be
useful in such settings, they should (i) provide good class-conditional
coverage, ensuring that rare classes are not systematically omitted from the
prediction sets, and (ii) be a reasonable size, allowing users to easily verify
candidate labels. Unfortunately, existing conformal prediction methods, when
applied to the long-tailed setting, force practitioners to make a binary choice
between small sets with poor class-conditional coverage or sets with very good
class-conditional coverage but that are extremely large. We propose methods
with guaranteed marginal coverage that smoothly trade off between set size and
class-conditional coverage. First, we introduce a new conformal score function
called prevalence-adjusted softmax that targets macro-coverage, a relaxed
notion of class-conditional coverage. Second, we propose a new procedure that
interpolates between marginal and class-conditional conformal prediction by
linearly interpolating their conformal score thresholds. We demonstrate our
methods on Pl@ntNet-300K and iNaturalist-2018, two long-tailed image datasets
with 1,081 and 8,142 classes, respectively.
[LINK]
http://arxiv.org/abs/2507.06867v2
[DATE]
2025-10-07 01:52:49+08:00
[CATEGORIES]
cs.LG
PENEX: AdaBoost-Inspired Neural Network Regularization
[AUTHORS]
Klaus-Rudolf Kladny, Bernhard Schölkopf, Michael Muehlebach
[ABSTRACT]
AdaBoost sequentially fits so-called weak learners to minimize an exponential
loss, which penalizes mislabeled data points more severely than other loss
functions like cross-entropy. Paradoxically, AdaBoost generalizes well in
practice as the number of weak learners grows. In the present work, we
introduce Penalized Exponential Loss (PENEX), a new formulation of the
multi-class exponential loss that is theoretically grounded and, in contrast to
the existing formulation, amenable to optimization via first-order methods. We
demonstrate both empirically and theoretically that PENEX implicitly maximizes
margins of data points. Also, we show that gradient increments on PENEX
implicitly parameterize weak learners in the boosting framework. Across
computer vision and language tasks, we show that PENEX exhibits a regularizing
effect often better than established methods with similar computational cost.
Our results highlight PENEX’s potential as an AdaBoost-inspired alternative for
effective training and fine-tuning of deep neural networks.
[LINK]
http://arxiv.org/abs/2510.02107v2
[DATE]
2025-10-07 01:51:59+08:00
[CATEGORIES]
cs.LG
MICROTRIPS: MICRO-geography TRavel Intelligence and Pattern Synthesis
[AUTHORS]
Yangyang Wang, Tayo Fabusuyi
[ABSTRACT]
This study presents a novel small-area estimation framework to enhance urban
transportation planning through detailed characterization of travel behavior.
Our approach improves on the four-step travel model by employing publicly
available microdata files and machine learning methods to predict travel
behavior for a representative, synthetic population at small geographic areas.
This approach enables high-resolution estimation of trip generation, trip
distribution, mode choice, and route assignment. Validation using ACS/PUMS
work-commute datasets demonstrates that our framework achieves higher accuracy
compared to conventional approaches. The resulting granular insights enable the
tailoring of interventions to address localized situations and support a range
of policy applications and targeted interventions, including the optimal
placement of micro-fulfillment centers, effective curb-space management, and
the design of more inclusive transportation solutions particularly for
vulnerable communities.
[LINK]
http://arxiv.org/abs/2510.05080v1
[DATE]
2025-10-07 01:50:56+08:00
[CATEGORIES]
cs.LG
Diffusion^2: Turning 3D Environments into Radio Frequency Heatmaps
[AUTHORS]
Kyoungjun Park, Yifan Yang, Changhan Ge, Lili Qiu, Shiqi Jiang
[ABSTRACT]
Modeling radio frequency (RF) signal propagation is essential for
understanding the environment, as RF signals offer valuable insights beyond the
capabilities of RGB cameras, which are limited by the visible-light spectrum,
lens coverage, and occlusions. It is also useful for supporting wireless
diagnosis, deployment, and optimization. However, accurately predicting RF
signals in complex environments remains a challenge due to interactions with
obstacles such as absorption and reflection. We introduce Diffusion^2, a
diffusion-based approach that uses 3D point clouds to model the propagation of
RF signals across a wide range of frequencies, from Wi-Fi to millimeter waves.
To effectively capture RF-related features from 3D data, we present the RF-3D
Encoder, which encapsulates the complexities of 3D geometry along with
signal-specific details. These features undergo multi-scale embedding to
simulate the actual RF signal dissemination process. Our evaluation, based on
synthetic and real-world measurements, demonstrates that Diffusion^2 accurately
estimates the behavior of RF signals in various frequency bands and
environmental conditions, with an error margin of just 1.9 dB and 27x faster
than existing methods, marking a significant advancement in the field. Refer to
https://rfvision-project.github.io/ for more information.
[LINK]
http://arxiv.org/abs/2510.02274v2
[DATE]
2025-10-07 01:44:43+08:00
[CATEGORIES]
cs.LG
Efficient Prediction of Pass@k Scaling in Large Language Models
[AUTHORS]
Joshua Kazdan, Rylan Schaeffer, Youssef Allouah, Colin Sullivan, Kyssen Yu, Noam Levi, Sanmi Koyejo
[ABSTRACT]
Assessing the capabilities and risks of frontier AI systems is a critical
area of research, and recent work has shown that repeated sampling from models
can dramatically increase both. For instance, repeated sampling has been shown
to increase their capabilities, such as solving difficult math and coding
problems, but it has also been shown to increase their potential for harm, such
as being jailbroken. Such results raise a crucial question for both capability
and safety forecasting: how can one accurately predict a model’s behavior when
scaled to a massive number of attempts, given a vastly smaller sampling budget?
This question is directly relevant to model providers, who serve hundreds of
millions of users daily, and to governmental regulators, who seek to prevent
harms. To answer this questions, we make three contributions. First, we find
that standard methods for fitting these laws suffer from statistical
shortcomings that hinder predictive accuracy, especially in data-limited
scenarios. Second, we remedy these shortcomings by introducing a robust
estimation framework, which uses a beta-binomial distribution to generate more
accurate predictions from limited data. Third, we propose a dynamic sampling
strategy that allocates a greater budget to harder problems. Combined, these
innovations enable more reliable prediction of rare risks and capabilities at a
fraction of the computational cost.
[LINK]
http://arxiv.org/abs/2510.05197v1
[DATE]
2025-10-07 01:42:27+08:00
[CATEGORIES]
cs.LG
VidGuard-R1: AI-Generated Video Detection and Explanation via Reasoning MLLMs and RL
[AUTHORS]
Kyoungjun Park, Yifan Yang, Juheon Yi, Shicheng Zheng, Yifei Shen, Dongqi Han, Caihua Shan, Muhammad Muaz, Lili Qiu
[ABSTRACT]
With the rapid advancement of AI-generated videos, there is an urgent need
for effective detection tools to mitigate societal risks such as misinformation
and reputational harm. In addition to accurate classification, it is essential
that detection models provide interpretable explanations to ensure transparency
for regulators and end users. To address these challenges, we introduce
VidGuard-R1, the first video authenticity detector that fine-tunes a
multi-modal large language model (MLLM) using group relative policy
optimization (GRPO). Our model delivers both highly accurate judgments and
insightful reasoning. We curate a challenging dataset of 140k real and
AI-generated videos produced by state-of-the-art generation models, carefully
designing the generation process to maximize discrimination difficulty. We then
fine-tune Qwen-VL using GRPO with two specialized reward models that target
temporal artifacts and generation complexity. Extensive experiments demonstrate
that VidGuard-R1 achieves state-of-the-art zero-shot performance on existing
benchmarks, with additional training pushing accuracy above 95%. Case studies
further show that VidGuard-R1 produces precise and interpretable rationales
behind its predictions. The code is publicly available at
https://VidGuard-R1.github.io.
[LINK]
http://arxiv.org/abs/2510.02282v2
[DATE]
2025-10-07 01:39:06+08:00
[CATEGORIES]
cs.LG
ResCP: Reservoir Conformal Prediction for Time Series Forecasting
[AUTHORS]
Roberto Neglia, Andrea Cini, Michael M. Bronstein, Filippo Maria Bianchi
[ABSTRACT]
Conformal prediction offers a powerful framework for building
distribution-free prediction intervals for exchangeable data. Existing methods
that extend conformal prediction to sequential data rely on fitting a
relatively complex model to capture temporal dependencies. However, these
methods can fail if the sample size is small and often require expensive
retraining when the underlying data distribution changes. To overcome these
limitations, we propose Reservoir Conformal Prediction (ResCP), a novel
training-free conformal prediction method for time series. Our approach
leverages the efficiency and representation learning capabilities of reservoir
computing to dynamically reweight conformity scores. In particular, we compute
similarity scores among reservoir states and use them to adaptively reweight
the observed residuals at each step. With this approach, ResCP enables us to
account for local temporal dynamics when modeling the error distribution
without compromising computational scalability. We prove that, under reasonable
assumptions, ResCP achieves asymptotic conditional coverage, and we empirically
demonstrate its effectiveness across diverse forecasting tasks.
[LINK]
http://arxiv.org/abs/2510.05060v1
[DATE]
2025-10-07 01:37:44+08:00
[CATEGORIES]
cs.LG
Modeling Student Learning with 3.8 Million Program Traces
[AUTHORS]
Alexis Ross, Megha Srivastava, Jeremiah Blanchard, Jacob Andreas
[ABSTRACT]
As programmers write code, they often edit and retry multiple times, creating
rich “interaction traces” that reveal how they approach coding tasks and
provide clues about their level of skill development. For novice programmers in
particular, these traces reflect the diverse reasoning processes they employ to
code, such as exploratory behavior to understand how a programming concept
works, re-strategizing in response to bugs, and personalizing stylistic
choices. In this work, we explore what can be learned from training language
models on such reasoning traces: not just about code, but about coders, and
particularly students learning to program. We introduce a dataset of over 3.8
million programming reasoning traces from users of Pencil Code, a free online
educational platform used by students to learn simple programming concepts.
Compared to models trained only on final programs or synthetically-generated
traces, we find that models trained on real traces are stronger at modeling
diverse student behavior. Through both behavioral and probing analyses, we also
find that many properties of code traces, such as goal backtracking or number
of comments, can be predicted from learned representations of the students who
write them. Building on this result, we show that we can help students recover
from mistakes by steering code generation models to identify a sequence of
edits that will results in more correct code while remaining close to the
original student’s style. Together, our results suggest that many properties of
code are properties of individual students and that training on edit traces can
lead to models that are more steerable, more predictive of student behavior
while programming, and better at generating programs in their final states.
Code and data is available at https://github.com/meghabyte/pencilcode-public
[LINK]
http://arxiv.org/abs/2510.05056v1
[DATE]
2025-10-07 01:37:17+08:00
[CATEGORIES]
cs.LG
Learning-Augmented Robust Algorithmic Recourse
[AUTHORS]
Kshitij Kayastha, Vasilis Gkatzelis, Shahin Jabbari
[ABSTRACT]
Algorithmic recourse provides individuals who receive undesirable outcomes
from machine learning systems with minimum-cost improvements to achieve a
desirable outcome. However, machine learning models often get updated, so the
recourse may not lead to the desired outcome. The robust recourse framework
chooses recourses that are less sensitive to adversarial model changes, but
this comes at a higher cost. To address this, we initiate the study of
learning-augmented algorithmic recourse and evaluate the extent to which a
designer equipped with a prediction of the future model can reduce the cost of
recourse when the prediction is accurate (consistency) while also limiting the
cost even when the prediction is inaccurate (robustness). We propose a novel
algorithm, study the robustness-consistency trade-off, and analyze how
prediction accuracy affects performance.
[LINK]
http://arxiv.org/abs/2410.01580v2
[DATE]
2025-10-07 01:35:00+08:00
[CATEGORIES]
cs.LG
HybridFlow: Quantification of Aleatoric and Epistemic Uncertainty with a Single Hybrid Model
[AUTHORS]
Peter Van Katwyk, Karianne J. Bergen
[ABSTRACT]
Uncertainty quantification is critical for ensuring robustness in high-stakes
machine learning applications. We introduce HybridFlow, a modular hybrid
architecture that unifies the modeling of aleatoric and epistemic uncertainty
by combining a Conditional Masked Autoregressive normalizing flow for
estimating aleatoric uncertainty with a flexible probabilistic predictor for
epistemic uncertainty. The framework supports integration with any
probabilistic model class, allowing users to easily adapt HybridFlow to
existing architectures without sacrificing predictive performance. HybridFlow
improves upon previous uncertainty quantification frameworks across a range of
regression tasks, such as depth estimation, a collection of regression
benchmarks, and a scientific case study of ice sheet emulation. We also provide
empirical results of the quantified uncertainty, showing that the uncertainty
quantified by HybridFlow is calibrated and better aligns with model error than
existing methods for quantifying aleatoric and epistemic uncertainty.
HybridFlow addresses a key challenge in Bayesian deep learning, unifying
aleatoric and epistemic uncertainty modeling in a single robust framework.
[COMMENTS]
Reviewed and published in TMLR at
https://openreview.net/forum?id=xRiEdSyVjY
[LINK]
http://arxiv.org/abs/2510.05054v1
[DATE]
2025-10-07 01:34:48+08:00
[CATEGORIES]
cs.LG
From paintbrush to pixel: A review of deep neural networks in AI-generated art
[AUTHORS]
Anne-Sofie Maerten, Derya Soydaner
[ABSTRACT]
This paper delves into the fascinating field of AI-generated art and explores
the various deep neural network architectures and models that have been
utilized to create it. From the classic convolutional networks to the
cutting-edge diffusion models, we examine the key players in the field. We
explain the general structures and working principles of these neural networks.
Then, we showcase examples of milestones, starting with the dreamy landscapes
of DeepDream and moving on to the most recent developments, including Stable
Diffusion and DALL-E 3, which produce mesmerizing images. We provide a detailed
comparison of these models, highlighting their strengths and limitations, and
examining the remarkable progress that deep neural networks have made so far in
a short period of time. With a unique blend of technical explanations and
insights into the current state of AI-generated art, this paper exemplifies how
art and computer science interact.
[LINK]
http://arxiv.org/abs/2302.10913v3
[DATE]
2025-10-07 01:33:24+08:00
[CATEGORIES]
cs.LG
KEEP: Integrating Medical Ontologies with Clinical Data for Robust Code Embeddings
[AUTHORS]
Ahmed Elhussein, Paul Meddeb, Abigail Newbury, Jeanne Mirone, Martin Stoll, Gamze Gursoy
[ABSTRACT]
Machine learning in healthcare requires effective representation of
structured medical codes, but current methods face a trade off: knowledge graph
based approaches capture formal relationships but miss real world patterns,
while data driven methods learn empirical associations but often overlook
structured knowledge in medical terminologies. We present KEEP (Knowledge
preserving and Empirically refined Embedding Process), an efficient framework
that bridges this gap by combining knowledge graph embeddings with adaptive
learning from clinical data. KEEP first generates embeddings from knowledge
graphs, then employs regularized training on patient records to adaptively
integrate empirical patterns while preserving ontological relationships.
Importantly, KEEP produces final embeddings without task specific auxiliary or
end to end training enabling KEEP to support multiple downstream applications
and model architectures. Evaluations on structured EHR from UK Biobank and
MIMIC IV demonstrate that KEEP outperforms both traditional and Language Model
based approaches in capturing semantic relationships and predicting clinical
outcomes. Moreover, KEEP’s minimal computational requirements make it
particularly suitable for resource constrained environments.
[LINK]
http://arxiv.org/abs/2510.05049v1
[DATE]
2025-10-07 01:27:54+08:00
[CATEGORIES]
cs.LG
A Unified Optimization Framework for Multiclass Classification with Structured Hyperplane Arrangements
[AUTHORS]
Víctor Blanco, Harshit Kothari, James Luedtke
[ABSTRACT]
In this paper, we propose a new mathematical optimization model for
multiclass classification based on arrangements of hyperplanes. Our approach
preserves the core support vector machine (SVM) paradigm of maximizing class
separation while minimizing misclassification errors, and it is computationally
more efficient than a previous formulation. We present a kernel-based extension
that allows it to construct nonlinear decision boundaries. Furthermore, we show
how the framework can naturally incorporate alternative geometric structures,
including classification trees, $\ell_p$-SVMs, and models with discrete feature
selection. To address large-scale instances, we develop a dynamic clustering
matheuristic that leverages the proposed MIP formulation. Extensive
computational experiments demonstrate the efficiency of the proposed model and
dynamic clustering heuristic, and we report competitive classification
performance on both synthetic datasets and real-world benchmarks from the UCI
Machine Learning Repository, comparing our method with state-of-the-art
implementations available in scikit-learn.
[COMMENTS]
28 pages, 2 tables, 9 figures
[LINK]
http://arxiv.org/abs/2510.05047v1
[DATE]
2025-10-07 01:26:56+08:00
[CATEGORIES]
cs.LG
Test-Time Scaling in Diffusion LLMs via Hidden Semi-Autoregressive Experts
[AUTHORS]
Jihoon Lee, Hoyeon Moon, Kevin Zhai, Arun Kumar Chithanar, Anit Kumar Sahu, Soummya Kar, Chul Lee, Souradip Chakraborty, Amrit Singh Bedi
[ABSTRACT]
Diffusion-based large language models (dLLMs) are trained flexibly to model
extreme dependence in the data distribution; however, how to best utilize this
information at inference time remains an open problem. In this work, we uncover
an interesting property of these models: dLLMs trained on textual data
implicitly learn a mixture of semi-autoregressive experts, where different
generation orders reveal different specialized behaviors. We show that
committing to any single, fixed inference time schedule, a common practice,
collapses performance by failing to leverage this latent ensemble. To address
this, we introduce HEX (Hidden semiautoregressive EXperts for test-time
scaling), a training-free inference method that ensembles across heterogeneous
block schedules. By doing a majority vote over diverse block-sized generation
paths, HEX robustly avoids failure modes associated with any single fixed
schedule. On reasoning benchmarks such as GSM8K, it boosts accuracy by up to
3.56X (from 24.72% to 88.10%), outperforming top-K margin inference and
specialized fine-tuned methods like GRPO, without additional training. HEX even
yields significant gains on MATH benchmark from 16.40% to 40.00%, scientific
reasoning on ARC-C from 54.18% to 87.80%, and TruthfulQA from 28.36% to 57.46%.
Our results establish a new paradigm for test-time scaling in diffusion-based
LLMs (dLLMs), revealing that the sequence in which masking is performed plays a
critical role in determining performance during inference.
[LINK]
http://arxiv.org/abs/2510.05040v1
[DATE]
2025-10-07 01:16:41+08:00
[CATEGORIES]
cs.LG
Graph-Aware Diffusion for Signal Generation
[AUTHORS]
Sergio Rozada, Vimal K. B., Andrea Cavallo, Antonio G. Marques, Hadi Jamali-Rad, Elvin Isufi
[ABSTRACT]
We study the problem of generating graph signals from unknown distributions
defined over given graphs, relevant to domains such as recommender systems or
sensor networks. Our approach builds on generative diffusion models, which are
well established in vision and graph generation but remain underexplored for
graph signals. Existing methods lack generality, either ignoring the graph
structure in the forward process or designing graph-aware mechanisms tailored
to specific domains. We adopt a forward process that incorporates the graph
through the heat equation. Rather than relying on the standard formulation, we
consider a time-warped coefficient to mitigate the exponential decay of the
drift term, yielding a graph-aware generative diffusion model (GAD). We analyze
its forward dynamics, proving convergence to a Gaussian Markov random field
with covariance parametrized by the graph Laplacian, and interpret the backward
dynamics as a sequence of graph-signal denoising problems. Finally, we
demonstrate the advantages of GAD on synthetic data, real traffic speed
measurements, and a temperature sensor network.
[LINK]
http://arxiv.org/abs/2510.05036v1
[DATE]
2025-10-07 01:11:32+08:00
[CATEGORIES]
cs.LG
Reinforced Generation of Combinatorial Structures: Applications to Complexity Theory
[AUTHORS]
Ansh Nagda, Prabhakar Raghavan, Abhradeep Thakurta
[ABSTRACT]
We explore whether techniques from AI can help discover new combinatorial
structures that improve on known limits on efficient algorithms. Specifically,
we use AlphaEvolve (an LLM coding agent) to study two settings:
a) Average-case hardness for MAX-CUT and MAX-Independent Set: We improve a
recent result of Kunisky and Yu to obtain near-optimal upper and (conditional)
lower bounds on certification algorithms for MAX-CUT and MAX-Independent Set on
random 3- and 4-regular graphs. Our improved lower bounds are obtained by
constructing nearly extremal Ramanujan graphs on as many as $163$ nodes, using
AlphaEvolve. Additionally, via analytical arguments we strengthen the upper
bounds to settle the computational hardness of these questions up to an error
in the third decimal place.
b) Worst-case Hardness of Approximation for MAX-k-CUT: We obtain new
inapproximability results, proving that it is NP-hard to approximate MAX-4-CUT
and MAX-3-CUT within factors of $0.987$ and $0.9649$ respectively, using
AlphaEvolve to discover new gadget reductions. Our MAX-4-CUT result improves
upon the SOTA of $0.9883$, and our MAX-3-CUT result improves on the current
best gadget-based inapproximability result of $0.9853$, but falls short of
improving the SOTA of $16/17$ that relies on a custom PCP, rather than a gadget
reduction from “standard” H{\aa}stad-style PCPs.
A key technical challenge we faced: verifying a candidate construction
produced by AlphaEvolve is costly (often requiring exponential time). In both
settings above, our results were enabled by using AlphaEvolve itself to evolve
the verification procedure to be faster (sometimes by $10,000\times$). We
conclude with a discussion of norms by which to assess the assistance from AI
in developing proofs.
[LINK]
http://arxiv.org/abs/2509.18057v4
[DATE]
2025-10-07 01:09:53+08:00
[CATEGORIES]
cs.LG
Unifying Autoregressive and Diffusion-Based Sequence Generation
[AUTHORS]
Nima Fathi, Torsten Scholak, Pierre-André Noël
[ABSTRACT]
We present significant extensions to diffusion-based sequence generation
models, blurring the line with autoregressive language models. We introduce
hyperschedules, which assign distinct noise schedules to individual token
positions, generalizing both autoregressive models (e.g., GPT) and conventional
diffusion models (e.g., SEDD, MDLM) as special cases. Second, we propose two
hybrid token-wise noising processes that interpolate between absorbing and
uniform processes, enabling the model to fix past mistakes, and we introduce a
novel inference algorithm that leverages this new feature in a simplified
context inspired from MDLM. To support efficient training and inference, we
design attention masks compatible with KV-caching. Our methods achieve
state-of-the-art perplexity and generate diverse, high-quality sequences across
standard benchmarks, suggesting a promising path for autoregressive
diffusion-based sequence generation. See code and resources at
https://hdlm-colm.github.io/
[COMMENTS]
Published as a conference paper at COLM 2025 Website:
https://hdlm-colm.github.io/
[LINK]
http://arxiv.org/abs/2504.06416v2
[DATE]
2025-10-07 01:09:39+08:00
[CATEGORIES]
cs.LG
Causal Abstractions, Categorically Unified
[AUTHORS]
Markus Englberger, Devendra Singh Dhami
[ABSTRACT]
We present a categorical framework for relating causal models that represent
the same system at different levels of abstraction. We define a causal
abstraction as natural transformations between appropriate Markov functors,
which concisely consolidate desirable properties a causal abstraction should
exhibit. Our approach unifies and generalizes previously considered causal
abstractions, and we obtain categorical proofs and generalizations of existing
results on causal abstractions. Using string diagrammatical tools, we can
explicitly describe the graphs that serve as consistent abstractions of a
low-level graph under interventions. We discuss how methods from mechanistic
interpretability, such as circuit analysis and sparse autoencoders, fit within
our categorical framework. We also show how applying do-calculus on a
high-level graphical abstraction of an acyclic-directed mixed graph (ADMG),
when unobserved confounders are present, gives valid results on the low-level
graph, thus generalizing an earlier statement by Anand et al. (2023). We argue
that our framework is more suitable for modeling causal abstractions compared
to existing categorical frameworks. Finally, we discuss how notions such as
$\tau$-consistency and constructive $\tau$-abstractions can be recovered with
our framework.
[LINK]
http://arxiv.org/abs/2510.05033v1
[DATE]
2025-10-07 01:09:30+08:00
[CATEGORIES]
cs.LG
Rethinking Langevin Thompson Sampling from A Stochastic Approximation Perspective
[AUTHORS]
Weixin Wang, Haoyang Zheng, Guang Lin, Wei Deng, Pan Xu
[ABSTRACT]
Most existing approximate Thompson Sampling (TS) algorithms for multi-armed
bandits use Stochastic Gradient Langevin Dynamics (SGLD) or its variants in
each round to sample from the posterior, relaxing the need for conjugacy
assumptions between priors and reward distributions in vanilla TS. However,
they often require approximating a different posterior distribution in
different round of the bandit problem. This requires tricky, round-specific
tuning of hyperparameters such as dynamic learning rates, causing challenges in
both theoretical analysis and practical implementation. To alleviate this
non-stationarity, we introduce TS-SA, which incorporates stochastic
approximation (SA) within the TS framework. In each round, TS-SA constructs a
posterior approximation only using the most recent reward(s), performs a
Langevin Monte Carlo (LMC) update, and applies an SA step to average noisy
proposals over time. This can be interpreted as approximating a stationary
posterior target throughout the entire algorithm, which further yields a fixed
step-size, a unified convergence analysis framework, and improved posterior
estimates through temporal averaging. We establish near-optimal regret bounds
for TS-SA, with a simplified and more intuitive theoretical analysis enabled by
interpreting the entire algorithm as a simulation of a stationary SGLD process.
Our empirical results demonstrate that even a single-step Langevin update with
certain warm-up outperforms existing methods substantially on bandit tasks.
[COMMENTS]
39 pages, 3 figures, 2 tables
[LINK]
http://arxiv.org/abs/2510.05023v1
[DATE]
2025-10-07 01:01:29+08:00
[CATEGORIES]
cs.LG
Fast constrained sampling in pre-trained diffusion models
[AUTHORS]
Alexandros Graikos, Nebojsa Jojic, Dimitris Samaras
[ABSTRACT]
Large denoising diffusion models, such as Stable Diffusion, have been trained
on billions of image-caption pairs to perform text-conditioned image
generation. As a byproduct of this training, these models have acquired general
knowledge about image statistics, which can be useful for other inference
tasks. However, when confronted with sampling an image under new constraints,
e.g. generating the missing parts of an image, using large pre-trained
text-to-image diffusion models is inefficient and often unreliable. Previous
approaches either utilized backpropagation through the denoiser network, making
them significantly slower and more memory-demanding than simple text-to-image
generation, or only enforced the constraint locally, failing to capture
critical long-range correlations in the sampled image. In this work, we propose
an algorithm that enables fast, high-quality generation under arbitrary
constraints. We show that in denoising diffusion models, we can employ an
approximation to Newton’s optimization method that allows us to speed up
inference and avoid the expensive backpropagation operations. Our approach
produces results that rival or surpass the state-of-the-art training-free
inference methods while requiring a fraction of the time. We demonstrate the
effectiveness of our algorithm under both linear (inpainting, super-resolution)
and non-linear (style-guided generation) constraints. An implementation is
provided at https://github.com/cvlab-stonybrook/fast-constrained-sampling.
[LINK]
http://arxiv.org/abs/2410.18804v3
[DATE]
2025-10-07 00:59:09+08:00
[CATEGORIES]
cs.LG
Think Then Embed: Generative Context Improves Multimodal Embedding
[AUTHORS]
Xuanming Cui, Jianpeng Cheng, Hong-you Chen, Satya Narayan Shukla, Abhijeet Awasthi, Xichen Pan, Chaitanya Ahuja, Shlok Kumar Mishra, Qi Guo, Ser-Nam Lim, Aashu Singh, Xiangjun Fan
[ABSTRACT]
There is a growing interest in Universal Multimodal Embeddings (UME), where
models are required to generate task-specific representations. While recent
studies show that Multimodal Large Language Models (MLLMs) perform well on such
tasks, they treat MLLMs solely as encoders, overlooking their generative
capacity. However, such an encoding paradigm becomes less effective as
instructions become more complex and require compositional reasoning. Inspired
by the proven effectiveness of chain-of-thought reasoning, we propose a general
Think-Then-Embed (TTE) framework for UME, composed of a reasoner and an
embedder. The reasoner MLLM first generates reasoning traces that explain
complex queries, followed by an embedder that produces representations
conditioned on both the original query and the intermediate reasoning. This
explicit reasoning step enables more nuanced understanding of complex
multimodal instructions. Our contributions are threefold. First, by leveraging
a powerful MLLM reasoner, we achieve state-of-the-art performance on the
MMEB-V2 benchmark, surpassing proprietary models trained on massive in-house
datasets. Second, to reduce the dependency on large MLLM reasoners, we finetune
a smaller MLLM reasoner using high-quality embedding-centric reasoning traces,
achieving the best performance among open-source models with a 7% absolute gain
over recently proposed models. Third, we investigate strategies for integrating
the reasoner and embedder into a unified model for improved efficiency without
sacrificing performance.
[LINK]
http://arxiv.org/abs/2510.05014v1
[DATE]
2025-10-07 00:53:56+08:00
[CATEGORIES]
cs.LG
Latent Uncertainty Representations for Video-based Driver Action and Intention Recognition
[AUTHORS]
Koen Vellenga, H. Joe Steinhauer, Jonas Andersson, Anders Sjögren
[ABSTRACT]
Deep neural networks (DNNs) are increasingly applied to safety-critical tasks
in resource-constrained environments, such as video-based driver action and
intention recognition. While last layer probabilistic deep learning (LL-PDL)
methods can detect out-of-distribution (OOD) instances, their performance
varies. As an alternative to last layer approaches, we propose extending
pre-trained DNNs with transformation layers to produce multiple latent
representations to estimate the uncertainty. We evaluate our latent uncertainty
representation (LUR) and repulsively trained LUR (RLUR) approaches against
eight PDL methods across four video-based driver action and intention
recognition datasets, comparing classification performance, calibration, and
uncertainty-based OOD detection. We also contribute 28,000 frame-level action
labels and 1,194 video-level intention labels for the NuScenes dataset. Our
results show that LUR and RLUR achieve comparable in-distribution
classification performance to other LL-PDL approaches. For uncertainty-based
OOD detection, LUR matches top-performing PDL methods while being more
efficient to train and easier to tune than approaches that require Markov-Chain
Monte Carlo sampling or repulsive training procedures.
[COMMENTS]
16 pages, 8 figures, 7 tables, under submission
[LINK]
http://arxiv.org/abs/2510.05006v1
[DATE]
2025-10-07 00:50:02+08:00
[CATEGORIES]
cs.LG
In-Context Learning for Pure Exploration
[AUTHORS]
Alessio Russo, Ryan Welch, Aldo Pacchiano
[ABSTRACT]
We study the problem active sequential hypothesis testing, also known as pure
exploration: given a new task, the learner adaptively collects data from the
environment to efficiently determine an underlying correct hypothesis. A
classical instance of this problem is the task of identifying the best arm in a
multi-armed bandit problem (a.k.a. BAI, Best-Arm Identification), where actions
index hypotheses. Another important case is generalized search, a problem of
determining the correct label through a sequence of strategically selected
queries that indirectly reveal information about the label. In this work, we
introduce In-Context Pure Exploration (ICPE), which meta-trains Transformers to
map observation histories to query actions and a predicted hypothesis, yielding
a model that transfers in-context. At inference time, ICPE actively gathers
evidence on new tasks and infers the true hypothesis without parameter updates.
Across deterministic, stochastic, and structured benchmarks, including BAI and
generalized search, ICPE is competitive with adaptive baselines while requiring
no explicit modeling of information structure. Our results support Transformers
as practical architectures for general sequential testing.
[LINK]
http://arxiv.org/abs/2506.01876v2
[DATE]
2025-10-07 00:44:47+08:00
[CATEGORIES]
cs.LG
Cooperative Decentralized Backdoor Attacks on Vertical Federated Learning
[AUTHORS]
Seohyun Lee, Wenzhi Fang, Anindya Bijoy Das, Seyyedali Hosseinalipour, David J. Love, Christopher G. Brinton
[ABSTRACT]
Federated learning (FL) is vulnerable to backdoor attacks, where adversaries
alter model behavior on target classification labels by embedding triggers into
data samples. While these attacks have received considerable attention in
horizontal FL, they are less understood for vertical FL (VFL), where devices
hold different features of the samples, and only the server holds the labels.
In this work, we propose a novel backdoor attack on VFL which (i) does not rely
on gradient information from the server and (ii) considers potential collusion
among multiple adversaries for sample selection and trigger embedding. Our
label inference model augments variational autoencoders with metric learning,
which adversaries can train locally. A consensus process over the adversary
graph topology determines which datapoints to poison. We further propose
methods for trigger splitting across the adversaries, with an intensity-based
implantation scheme skewing the server towards the trigger. Our convergence
analysis reveals the impact of backdoor perturbations on VFL indicated by a
stationarity gap for the trained model, which we verify empirically as well. We
conduct experiments comparing our attack with recent backdoor VFL approaches,
finding that ours obtains significantly higher success rates for the same main
task performance despite not using server information. Additionally, our
results verify the impact of collusion on attack performance.
[COMMENTS]
This paper is currently under review in the IEEE/ACM Transactions on
Networking Special Issue on AI and Networking
[LINK]
http://arxiv.org/abs/2501.09320v2
[DATE]
2025-10-07 00:41:32+08:00
[CATEGORIES]
cs.LG
QDFlow: A Python package for physics simulations of quantum dot devices
[AUTHORS]
Donovan L. Buterakos, Sandesh S. Kalantre, Joshua Ziegler, Jacob M Taylor, Justyna P. Zwolak
[ABSTRACT]
Recent advances in machine learning (ML) have accelerated progress in
calibrating and operating quantum dot (QD) devices. However, most ML approaches
rely on access to large, representative datasets designed to capture the full
spectrum of data quality encountered in practice, with both high- and
low-quality data for training, benchmarking, and validation, with labels
capturing key features of the device state. Collating such datasets
experimentally is challenging due to limited data availability, slow
measurement bandwidths, and the labor-intensive nature of labeling. QDFlow is
an open-source physics simulator for multi-QD arrays that generates realistic
synthetic data with ground-truth labels. QDFlow combines a self-consistent
Thomas-Fermi solver, a dynamic capacitance model, and flexible noise modules to
simulate charge stability diagrams and ray-based data closely resembling
experiments. With an extensive set of parameters that can be varied and
customizable noise models, QDFlow supports the creation of large, diverse
datasets for ML development, benchmarking, and quantum device research.
[COMMENTS]
17 pages, 5 figures
[LINK]
http://arxiv.org/abs/2509.13298v2
[DATE]
2025-10-07 00:40:26+08:00
[CATEGORIES]
cs.LG
Power Transform Revisited: Numerically Stable, and Federated
[AUTHORS]
Xuefeng Xu, Graham Cormode
[ABSTRACT]
Power transforms are popular parametric techniques for making data more
Gaussian-like, and are widely used as preprocessing steps in statistical
analysis and machine learning. However, we find that direct implementations of
power transforms suffer from severe numerical instabilities, which can lead to
incorrect results or even crashes. In this paper, we provide a comprehensive
analysis of the sources of these instabilities and propose effective remedies.
We further extend power transforms to the federated learning setting,
addressing both numerical and distributional challenges that arise in this
context. Experiments on real-world datasets demonstrate that our methods are
both effective and robust, substantially improving stability compared to
existing approaches.
[COMMENTS]
25 pages
[LINK]
http://arxiv.org/abs/2510.04995v1
[DATE]
2025-10-07 00:32:22+08:00
[CATEGORIES]
cs.LG
Data-Driven Performance Guarantees for Classical and Learned Optimizers
[AUTHORS]
Rajiv Sambharya, Bartolomeo Stellato
[ABSTRACT]
We introduce a data-driven approach to analyze the performance of continuous
optimization algorithms using generalization guarantees from statistical
learning theory. We study classical and learned optimizers to solve families of
parametric optimization problems. We build generalization guarantees for
classical optimizers, using a sample convergence bound, and for learned
optimizers, using the Probably Approximately Correct (PAC)-Bayes framework. To
train learned optimizers, we use a gradient-based algorithm to directly
minimize the PAC-Bayes upper bound. Numerical experiments in signal processing,
control, and meta-learning showcase the ability of our framework to provide
strong generalization guarantees for both classical and learned optimizers
given a fixed budget of iterations. For classical optimizers, our bounds are
much tighter than those that worst-case guarantees provide. For learned
optimizers, our bounds outperform the empirical outcomes observed in their
non-learned counterparts.
[LINK]
http://arxiv.org/abs/2404.13831v3
[DATE]
2025-10-07 00:30:05+08:00
[CATEGORIES]
cs.LG
Physics-informed Value Learner for Offline Goal-Conditioned Reinforcement Learning
[AUTHORS]
Vittorio Giammarino, Ruiqi Ni, Ahmed H. Qureshi
[ABSTRACT]
Offline Goal-Conditioned Reinforcement Learning (GCRL) holds great promise
for domains such as autonomous navigation and locomotion, where collecting
interactive data is costly and unsafe. However, it remains challenging in
practice due to the need to learn from datasets with limited coverage of the
state-action space and to generalize across long-horizon tasks. To improve on
these challenges, we propose a \emph{Physics-informed (Pi)} regularized loss
for value learning, derived from the Eikonal Partial Differential Equation
(PDE) and which induces a geometric inductive bias in the learned value
function. Unlike generic gradient penalties that are primarily used to
stabilize training, our formulation is grounded in continuous-time optimal
control and encourages value functions to align with cost-to-go structures. The
proposed regularizer is broadly compatible with temporal-difference-based value
learning and can be integrated into existing Offline GCRL algorithms. When
combined with Hierarchical Implicit Q-Learning (HIQL), the resulting method,
Eikonal-regularized HIQL (Eik-HIQL), yields significant improvements in both
performance and generalization, with pronounced gains in stitching regimes and
large-scale navigation tasks.
[LINK]
http://arxiv.org/abs/2509.06782v2
[DATE]
2025-10-07 00:26:44+08:00
[CATEGORIES]
cs.LG
Adaptive Memory Momentum via a Model-Based Framework for Deep Learning Optimization
[AUTHORS]
Kristi Topollai, Anna Choromanska
[ABSTRACT]
The vast majority of modern deep learning models are trained with
momentum-based first-order optimizers. The momentum term governs the
optimizer’s memory by determining how much each past gradient contributes to
the current convergence direction. Fundamental momentum methods, such as
Nesterov Accelerated Gradient and the Heavy Ball method, as well as more recent
optimizers such as AdamW and Lion, all rely on the momentum coefficient that is
customarily set to $\beta = 0.9$ and kept constant during model training, a
strategy widely used by practitioners, yet suboptimal. In this paper, we
introduce an \textit{adaptive memory} mechanism that replaces constant momentum
with a dynamic momentum coefficient that is adjusted online during
optimization. We derive our method by approximating the objective function
using two planes: one derived from the gradient at the current iterate and the
other obtained from the accumulated memory of the past gradients. To the best
of our knowledge, such a proximal framework was never used for momentum-based
optimization. Our proposed approach is novel, extremely simple to use, and does
not rely on extra assumptions or hyperparameter tuning. We implement adaptive
memory variants of both SGD and AdamW across a wide range of learning tasks,
from simple convex problems to large-scale deep learning scenarios,
demonstrating that our approach can outperform standard SGD and Adam with
hand-tuned momentum coefficients. Finally, our work opens doors for new ways of
inducing adaptivity in optimization.
[LINK]
http://arxiv.org/abs/2510.04988v1
[DATE]
2025-10-07 00:24:57+08:00
[CATEGORIES]
cs.LG
Another look at inference after prediction
[AUTHORS]
Jessica Gronsbell, Jianhui Gao, Yaqi Shi, Zachary R. McCaw, David Cheng
[ABSTRACT]
From structural biology to epidemiology, predictions from machine learning
(ML) models increasingly complement costly gold-standard data to enable faster,
more affordable, and scalable scientific inquiry. In response, prediction-based
(PB) inference has emerged to accommodate statistical analysis using a large
volume of predictions together with a small amount of gold-standard data. The
goals of PB inference are two-fold: (i) to mitigate bias from errors in
predictions and (ii) to improve efficiency relative to classical inference
using only the gold-standard data. While early PB inference methods focused on
bias, their ability to enhance efficiency remains a focus of ongoing research.
We revisit a foundational PB inference method and show that a simple
modification can be applied to guarantee provable improvements in efficiency.
In doing so, we establish new connections between augmented inverse probability
weighted estimators (AIPW) and several recently proposed PB inference methods
with a similar focus. The utility of our proposal, which leverages
prediction-based outcomes to enhance efficiency, is demonstrated through
extensive simulation studies and an application to real data from the UK
Biobank. Further, we contextualize PB inference by drawing connections to
historical literature from economics and statistics, highlighting how classic
methods directly inform this contemporary problem.
[LINK]
http://arxiv.org/abs/2411.19908v5
[DATE]
2025-10-07 00:21:56+08:00
[CATEGORIES]
cs.LG
Federated Computation of ROC and PR Curves
[AUTHORS]
Xuefeng Xu, Graham Cormode
[ABSTRACT]
Receiver Operating Characteristic (ROC) and Precision-Recall (PR) curves are
fundamental tools for evaluating machine learning classifiers, offering
detailed insights into the trade-offs between true positive rate vs. false
positive rate (ROC) or precision vs. recall (PR). However, in Federated
Learning (FL) scenarios, where data is distributed across multiple clients,
computing these curves is challenging due to privacy and communication
constraints. Specifically, the server cannot access raw prediction scores and
class labels, which are used to compute the ROC and PR curves in a centralized
setting. In this paper, we propose a novel method for approximating ROC and PR
curves in a federated setting by estimating quantiles of the prediction score
distribution under distributed differential privacy. We provide theoretical
bounds on the Area Error (AE) between the true and estimated curves,
demonstrating the trade-offs between approximation accuracy, privacy, and
communication cost. Empirical results on real-world datasets demonstrate that
our method achieves high approximation accuracy with minimal communication and
strong privacy guarantees, making it practical for privacy-preserving model
evaluation in federated systems.
[COMMENTS]
23 pages
[LINK]
http://arxiv.org/abs/2510.04979v1
[DATE]
2025-10-07 00:16:46+08:00
[CATEGORIES]
cs.LG
Multi-Turn Human-LLM Interaction Through the Lens of a Two-Way Intelligibility Protocol
[AUTHORS]
Harshvardhan Mestha, Karan Bania, Shreyas V Sathyanarayana, Sidong Liu, Ashwin Srinivasan
[ABSTRACT]
Our interest is in the design of software systems involving a human-expert
interacting – using natural language – with a large language model (LLM) on
data analysis tasks. For complex problems, it is possible that LLMs can harness
human expertise and creativity to find solutions that were otherwise elusive.
On one level, this interaction takes place through multiple turns of prompts
from the human and responses from the LLM. Here we investigate a more
structured approach based on an abstract protocol described in [3] for
interaction between agents. The protocol is motivated by a notion of “two-way
intelligibility” and is modelled by a pair of communicating finite-state
machines. We provide an implementation of the protocol, and provide empirical
evidence of using the implementation to mediate interactions between an LLM and
a human-agent in two areas of scientific interest (radiology and drug design).
We conduct controlled experiments with a human proxy (a database), and
uncontrolled experiments with human subjects. The results provide evidence in
support of the protocol’s capability of capturing one- and two-way
intelligibility in human-LLM interaction; and for the utility of two-way
intelligibility in the design of human-machine systems. Our code is available
at https://github.com/karannb/interact.
[COMMENTS]
Multi-Turn Interactions in Large Language Models (MTI-LLM) Workshop
at NeurIPS 2025
[LINK]
http://arxiv.org/abs/2410.20600v3
[DATE]
2025-10-07 00:15:07+08:00
[CATEGORIES]
cs.LG
Pivotal CLTs for Pseudolikelihood via Conditional Centering in Dependent Random Fields
[AUTHORS]
Nabarun Deb
[ABSTRACT]
In this paper, we study fluctuations of conditionally centered statistics of
the form \(N^\{-1/2\}\sum_\{i=1\}^N
c_i(g(\sigma_i)-\mathbb\{E\}_N[g(\sigma_i)|\sigma_j,j\neq i])\) where
$(\sigma_1,\ldots ,\sigma_N)$ are sampled from a dependent random field, and
$g$ is some bounded function. Our first main result shows that under weak
smoothness assumptions on the conditional means (which cover both sparse and
dense interactions), the above statistic converges to a Gaussian \emph{scale
mixture} with a random scale determined by a \emph{quadratic variance} and an
\emph{interaction component}. We also show that under appropriate
studentization, the limit becomes a pivotal Gaussian. We leverage this theory
to develop a general asymptotic framework for maximum pseudolikelihood (MPLE)
inference in dependent random fields. We apply our results to Ising models with
pairwise as well as higher-order interactions and exponential random graph
models (ERGMs). In particular, we obtain a joint central limit theorem for the
inverse temperature and magnetization parameters via the joint MPLE (to our
knowledge, the first such result in dense, irregular regimes), and we derive
conditionally centered edge CLTs and marginal MPLE CLTs for ERGMs without
restricting to the ``sub-critical” region. Our proof is based on a method of
moments approach via combinatorial decision-tree pruning, which may be of
independent interest.
[COMMENTS]
73 pages, 1 figure
[LINK]
http://arxiv.org/abs/2510.04972v1
[DATE]
2025-10-07 00:06:45+08:00
[CATEGORIES]
cs.LG
Embracing Discrete Search: A Reasonable Approach to Causal Structure Learning
[AUTHORS]
Marcel Wienöbst, Leonard Henckel, Sebastian Weichwald
[ABSTRACT]
We present FLOP (Fast Learning of Order and Parents), a score-based causal
discovery algorithm for linear models. It pairs fast parent selection with
iterative Cholesky-based score updates, cutting run-times over prior
algorithms. This makes it feasible to fully embrace discrete search, enabling
iterated local search with principled order initialization to find graphs with
scores at or close to the global optimum. The resulting structures are highly
accurate across benchmarks, with near-perfect recovery in standard settings.
This performance calls for revisiting discrete search over graphs as a
reasonable approach to causal discovery.
[LINK]
http://arxiv.org/abs/2510.04970v1
[DATE]
2025-10-07 00:04:53+08:00
[CATEGORIES]
cs.LG
Mind Your Tone: Investigating How Prompt Politeness Affects LLM Accuracy (short paper)
[AUTHORS]
Om Dobariya, Akhil Kumar
[ABSTRACT]
The wording of natural language prompts has been shown to influence the
performance of large language models (LLMs), yet the role of politeness and
tone remains underexplored. In this study, we investigate how varying levels of
prompt politeness affect model accuracy on multiple-choice questions. We
created a dataset of 50 base questions spanning mathematics, science, and
history, each rewritten into five tone variants: Very Polite, Polite, Neutral,
Rude, and Very Rude, yielding 250 unique prompts. Using ChatGPT 4o, we
evaluated responses across these conditions and applied paired sample t-tests
to assess statistical significance. Contrary to expectations, impolite prompts
consistently outperformed polite ones, with accuracy ranging from 80.8% for
Very Polite prompts to 84.8% for Very Rude prompts. These findings differ from
earlier studies that associated rudeness with poorer outcomes, suggesting that
newer LLMs may respond differently to tonal variation. Our results highlight
the importance of studying pragmatic aspects of prompting and raise broader
questions about the social dimensions of human-AI interaction.
[COMMENTS]
5 pages, 3 tables; includes Limitations and Ethical Considerations
sections; short paper under submission to Findings of ACL 2025
[LINK]
http://arxiv.org/abs/2510.04950v1
[DATE]
2025-10-06 23:50:39+08:00
[CATEGORIES]
cs.CL
cs.LG
A First Context-Free Grammar Applied to Nawatl Corpora Augmentation
[AUTHORS]
Juan-José Guzmán-Landa, Juan-Manuel Torres-Moreno, Miguel Figueroa-Saavedra, Ligia Quintana-Torres, Martha-Lorena Avendaño-Garrido, Graham Ranger
[ABSTRACT]
In this article we introduce a context-free grammar (CFG) for the Nawatl
language. Nawatl (or Nahuatl) is an Amerindian language of the $\pi$-language
type, i.e. a language with few digital resources, in which the corpora
available for machine learning are virtually non-existent. The objective here
is to generate a significant number of grammatically correct artificial
sentences, in order to increase the corpora available for language model
training. We want to show that a grammar enables us significantly to expand a
corpus in Nawatl which we call $\pi$-\textsc{yalli}. The corpus, thus enriched,
enables us to train algorithms such as FastText and to evaluate them on
sentence-level semantic tasks. Preliminary results show that by using the
grammar, comparative improvements are achieved over some LLMs. However, it is
observed that to achieve more significant improvement, grammars that model the
Nawatl language even more effectively are required.
[COMMENTS]
11 pages, 7 tables, 1 figure
[LINK]
http://arxiv.org/abs/2510.04945v1
[DATE]
2025-10-06 23:46:54+08:00
[CATEGORIES]
cs.CL
On Structured State-Space Duality
[AUTHORS]
Jerry Yao-Chieh Hu, Xiwen Zhang, Weimin Wu, Han Liu
[ABSTRACT]
Structured State-Space Duality (SSD) [Dao & Gu, ICML 2024] is an equivalence
between a simple Structured State-Space Model (SSM) and a masked attention
mechanism. In particular, a state-space model with a scalar-times-identity
state matrix is equivalent to a masked self-attention with a $1$-semiseparable
causal mask. Consequently, the same sequence transformation (model) has two
algorithmic realizations: as a linear-time $O(T)$ recurrence or as a
quadratic-time $O(T^2)$ attention. In this note, we formalize and generalize
this duality: (i) we extend SSD from the scalar-identity case to general
diagonal SSMs (diagonal state matrices); (ii) we show that these diagonal SSMs
match the scalar case’s training complexity lower bounds while supporting
richer dynamics; (iii) we establish a necessary and sufficient condition under
which an SSM is equivalent to $1$-semiseparable masked attention; and (iv) we
show that such duality fails to extend to standard softmax attention due to
rank explosion. Together, these results tighten bridge between recurrent SSMs
and Transformers, and widen the design space for expressive yet efficient
sequence models.
[LINK]
http://arxiv.org/abs/2510.04944v1
[DATE]
2025-10-06 23:46:50+08:00
[CATEGORIES]
cs.LG
cs.CL
ONNX-Net: Towards Universal Representations and Instant Performance Prediction for Neural Architectures
[AUTHORS]
Shiwen Qin, Alexander Auras, Shay B. Cohen, Elliot J. Crowley, Michael Moeller, Linus Ericsson, Jovita Lukasik
[COMMENTS]
Our code is available at: https://github.com/shiwenqin/ONNX-Net
[LINK]
http://arxiv.org/abs/2510.04938v1
[DATE]
2025-10-06 23:43:36+08:00
[CATEGORIES]
cs.LG
cs.CL
MARS: Optimizing Dual-System Deep Research via Multi-Agent Reinforcement Learning
[AUTHORS]
Guoxin Chen, Zile Qiao, Wenqing Wang, Donglei Yu, Xuanzhong Chen, Hao Sun, Minpeng Liao, Kai Fan, Yong Jiang, Penguin Xie, Wayne Xin Zhao, Ruihua Song, Fei Huang
[ABSTRACT]
Large Reasoning Models (LRMs) often exhibit a tendency for overanalysis in
simple tasks, where the models excessively utilize System 2-type, deliberate
reasoning, leading to inefficient token generation. Furthermore, these models
face challenges in adapting their reasoning capabilities to rapidly changing
environments due to the static nature of their pretraining data. To address
these issues, advancing Large Language Models (LLMs) for complex reasoning
tasks requires innovative approaches that bridge intuitive and deliberate
cognitive processes, akin to human cognition’s dual-system dynamic. This paper
introduces a Multi-Agent System for Deep ReSearch (MARS) enabling seamless
integration of System 1’s fast, intuitive thinking with System 2’s deliberate
reasoning within LLMs. MARS strategically integrates multiple external tools,
such as Google Search, Google Scholar, and Python Interpreter, to access
up-to-date information and execute complex computations, while creating a
specialized division of labor where System 1 efficiently processes and
summarizes high-volume external information, providing distilled insights that
expand System 2’s reasoning context without overwhelming its capacity.
Furthermore, we propose a multi-agent reinforcement learning framework
extending Group Relative Policy Optimization to simultaneously optimize both
systems with multi-turn tool interactions, bin-packing optimization, and sample
balancing strategies that enhance collaborative efficiency. Extensive
experiments demonstrate MARS achieves substantial improvements of 3.86% on the
challenging Humanity’s Last Exam (HLE) benchmark and an average gain of 8.9%
across 7 knowledge-intensive tasks, validating the effectiveness of our
dual-system paradigm for complex reasoning in dynamic information environments.
[COMMENTS]
Ongoing Work
[LINK]
http://arxiv.org/abs/2510.04935v1
[DATE]
2025-10-06 23:42:55+08:00
[CATEGORIES]
cs.CL
cs.LG
The Geometry of Truth: Layer-wise Semantic Dynamics for Hallucination Detection in Large Language Models
[AUTHORS]
Amir Hameed Mir
[ABSTRACT]
Large Language Models (LLMs) often produce fluent yet factually incorrect
statements-a phenomenon known as hallucination-posing serious risks in
high-stakes domains. We present Layer-wise Semantic Dynamics (LSD), a geometric
framework for hallucination detection that analyzes the evolution of
hidden-state semantics across transformer layers. Unlike prior methods that
rely on multiple sampling passes or external verification sources, LSD operates
intrinsically within the model’s representational space. Using margin-based
contrastive learning, LSD aligns hidden activations with ground-truth
embeddings derived from a factual encoder, revealing a distinct separation in
semantic trajectories: factual responses preserve stable alignment, while
hallucinations exhibit pronounced semantic drift across depth. Evaluated on the
TruthfulQA and synthetic factual-hallucination datasets, LSD achieves an
F1-score of 0.92, AUROC of 0.96, and clustering accuracy of 0.89, outperforming
SelfCheckGPT and Semantic Entropy baselines while requiring only a single
forward pass. This efficiency yields a 5-20x speedup over sampling-based
methods without sacrificing precision or interpretability. LSD offers a
scalable, model-agnostic mechanism for real-time hallucination monitoring and
provides new insights into the geometry of factual consistency within large
language models.
[COMMENTS]
Comments: 14 pages, 14 figures, 5 tables. Code available at:
https://github.com/sirraya-tech/Sirraya_LSD_Code
[LINK]
http://arxiv.org/abs/2510.04933v1
[DATE]
2025-10-06 23:41:22+08:00
[CATEGORIES]
cs.CL
cs.LG
MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly
[AUTHORS]
Zhaowei Wang, Wenhao Yu, Xiyu Ren, Jipeng Zhang, Yu Zhao, Rohit Saxena, Liang Cheng, Ginny Wong, Simon See, Pasquale Minervini, Yangqiu Song, Mark Steedman
[ABSTRACT]
The rapid extension of context windows in large vision-language models has
given rise to long-context vision-language models (LCVLMs), which are capable
of handling hundreds of images with interleaved text tokens in a single forward
pass. In this work, we introduce MMLongBench, the first benchmark covering a
diverse set of long-context vision-language tasks, to evaluate LCVLMs
effectively and thoroughly. MMLongBench is composed of 13,331 examples spanning
five different categories of downstream tasks, such as Visual RAG and Many-Shot
ICL. It also provides broad coverage of image types, including various natural
and synthetic images. To assess the robustness of the models to different input
lengths, all examples are delivered at five standardized input lengths (8K-128K
tokens) via a cross-modal tokenization scheme that combines vision patches and
text tokens. Through a thorough benchmarking of 46 closed-source and
open-source LCVLMs, we provide a comprehensive analysis of the current models’
vision-language long-context ability. Our results show that: i) performance on
a single task is a weak proxy for overall long-context capability; ii) both
closed-source and open-source models face challenges in long-context
vision-language tasks, indicating substantial room for future improvement; iii)
models with stronger reasoning ability tend to exhibit better long-context
performance. By offering wide task coverage, various image types, and rigorous
length control, MMLongBench provides the missing foundation for diagnosing and
advancing the next generation of LCVLMs.
[COMMENTS]
Accepted as a spotlight at NeurIPS 2025
[LINK]
http://arxiv.org/abs/2505.10610v3
[DATE]
2025-10-06 23:41:20+08:00
[CATEGORIES]
cs.CL
Do LLMs Align with My Task? Evaluating Text-to-SQL via Dataset Alignment
[AUTHORS]
Davood Rafiei, Morgan Lindsay Heisler, Weiwei Zhang, Mohammadreza Pourreza, Yong Zhang
[ABSTRACT]
Supervised Fine-Tuning (SFT) is an effective method for adapting Large
Language Models (LLMs) on downstream tasks. However, variability in training
data can hinder a model’s ability to generalize across domains. This paper
studies the problem of dataset alignment for Natural Language to SQL (NL2SQL or
text to SQL), examining how well SFT training data matches the structural
characteristics of target queries and how this alignment impacts model
performance. We hypothesize that alignment can be accurately estimated by
comparing the distributions of structural SQL features across the training set,
target data, and the model’s predictions prior to SFT. Through comprehensive
experiments on three large cross-domain NL2SQL benchmarks and multiple model
families, we show that structural alignment is a strong predictor of
fine-tuning success. When alignment is high, SFT yields substantial gains in
accuracy and SQL generation quality; when alignment is low, improvements are
marginal or absent. These findings highlight the importance of alignment-aware
data selection for effective fine-tuning and generalization in NL2SQL tasks.
[LINK]
http://arxiv.org/abs/2510.04919v1
[DATE]
2025-10-06 23:33:35+08:00
[CATEGORIES]
cs.CL
Text-Based Approaches to Item Alignment to Content Standards in Large-Scale Reading & Writing Tests
[AUTHORS]
Yanbin Fu, Hong Jiao, Tianyi Zhou, Robert W. Lissitz, Nan Zhang, Ming Li, Qingshu Xu, Sydney Peters
[ABSTRACT]
Aligning test items to content standards is a critical step in test
development to collect validity evidence based on content. Item alignment has
typically been conducted by human experts. This judgmental process can be
subjective and time-consuming. This study investigated the performance of
fine-tuned small language models (SLMs) for automated item alignment using data
from a large-scale standardized reading and writing test for college
admissions. Different SLMs were trained for alignment at both domain and skill
levels respectively with 10 skills mapped to 4 content domains. The model
performance was evaluated in multiple criteria on two testing datasets. The
impact of types and sizes of the input data for training was investigated.
Results showed that including more item text data led to substantially better
model performance, surpassing the improvements induced by sample size increase
alone. For comparison, supervised machine learning models were trained using
the embeddings from the multilingual-E5-large-instruct model. The study results
showed that fine-tuned SLMs consistently outperformed the embedding-based
supervised machine learning models, particularly for the more fine-grained
skill alignment. To better understand model misclassifications, multiple
semantic similarity analysis including pairwise cosine similarity,
Kullback-Leibler divergence of embedding distributions, and two-dimension
projections of item embeddings were conducted. These analyses consistently
showed that certain skills in SAT and PSAT were semantically too close,
providing evidence for the observed misclassification.
[COMMENTS]
need updates
[LINK]
http://arxiv.org/abs/2509.26431v2
[DATE]
2025-10-06 23:32:15+08:00
[CATEGORIES]
cs.CL
Retrieval-Augmented Code Generation: A Survey with Focus on Repository-Level Approaches
[AUTHORS]
Yicheng Tao, Yao Qin, Yepang Liu
[ABSTRACT]
Recent advancements in large language models (LLMs) have substantially
improved automated code generation. While function-level and file-level
generation have achieved promising results, real-world software development
typically requires reasoning across entire repositories. This gives rise to the
challenging task of Repository-Level Code Generation (RLCG), where models must
capture long-range dependencies, ensure global semantic consistency, and
generate coherent code spanning multiple files or modules. To address these
challenges, Retrieval-Augmented Generation (RAG) has emerged as a powerful
paradigm that integrates external retrieval mechanisms with LLMs, enhancing
context-awareness and scalability. In this survey, we provide a comprehensive
review of research on Retrieval-Augmented Code Generation (RACG), with an
emphasis on repository-level approaches. We categorize existing work along
several dimensions, including generation strategies, retrieval modalities,
model architectures, training paradigms, and evaluation protocols. Furthermore,
we summarize widely used datasets and benchmarks, analyze current limitations,
and outline key challenges and opportunities for future research. Our goal is
to establish a unified analytical framework for understanding this rapidly
evolving field and to inspire continued progress in AI-powered software
engineering.
[LINK]
http://arxiv.org/abs/2510.04905v1
[DATE]
2025-10-06 23:20:03+08:00
[CATEGORIES]
cs.CL
H3Fusion: Helpful, Harmless, Honest Fusion of Aligned LLMs
[AUTHORS]
Selim Furkan Tekin, Fatih Ilhan, Tiansheng Huang, Sihao Hu, Yichang Xu, Zachary Yahn, Ling Liu
[ABSTRACT]
Alignment of pretrained LLMs using instruction-based datasets is critical for
creating fine-tuned models that reflect human preference. A growing number of
alignment-based fine-tuning algorithms and benchmarks emerged recently, fueling
the efforts on effective alignments of pre-trained LLMs to ensure helpful,
harmless, and honest answers from both open-source and closed-source LLMs. This
paper tackles this problem by developing an alignment fusion approach, coined
as $H^3$Fusion, with three unique characteristics. First, $H^3$Fusion ensembles
multiple individually aligned LLMs to create a final fine-tuned alignment model
with enhanced capabilities beyond those of individual models, delivering robust
alignment through promoting helpful, harmless, honest fusion. Second,
$H^3$Fusion leverages the mixture-of-experts (MoE) methodology in two steps. We
first freeze the multi-head attention weights of each individual model while
tuning the FFN layer during alignment fusion. Then we merge the aligned model
weights with an expert router according to the type of input instruction and
dynamically select a subset of experts that are best suited for producing the
output response. Finally, we boost the performance of the resulting
$H^3$3Fusion model by introducing gating loss and regularization terms. The
former penalizes the selection errors of the expert-router, and the latter
mediates the expert weights drifting during fine-tuning and dynamically adjusts
the fusion behavior of the resulting model by canalizing the activations on the
experts. Extensive evaluations on three benchmark datasets show that
$H^3$3Fusion is more helpful, less harmful, and more honest from two aspects:
it outperforms each individually aligned model by $11.37\%$, and it provides
stronger robustness compared to the state-of-the-art LLM ensemble approaches by
$13.77\%$. Code is available at github.com/sftekin/h3fusion.
[LINK]
http://arxiv.org/abs/2411.17792v2
[DATE]
2025-10-06 23:19:49+08:00
[CATEGORIES]
cs.CL
cs.LG
Latent Thinking Optimization: Your Latent Reasoning Language Model Secretly Encodes Reward Signals in Its Latent Thoughts
[AUTHORS]
Hanwen Du, Yuxin Dong, Xia Ning
[ABSTRACT]
Large Language Models (LLMs) excel at problem solving by generating chain of
thoughts in natural language, but such verbal thinking is computationally
costly and prone to overthinking. Recent work instead proposes a latent
thinking architecture Huginn-3.5B, which represents intermediate reasoning
steps as sequence of latent representations. However, latent thoughts lack
interpretability and are difficult to supervise, raising concerns about the
correctness and reliability of its latent thinking processes. In this paper, we
provide a systematic study of how Huginn-3.5B thinks in the latent space and
how external supervision signals can improve its latent thinking processes. We
show that latent thoughts leading to correct versus incorrect answers exhibit
highly distinguishable patterns, and that a latent classifier can reliably
predict answer correctness directly from latent thoughts. Leveraging these
insights, we propose Latent Thinking Optimization (LTO), a probabilistic
algorithm that employs the latent classifier as a Latent Reward Model (LRM) to
optimize the latent thinking processes. Extensive experiments across diverse
reasoning tasks demonstrate that LRM is highly effective in detecting incorrect
latent thinking patterns, and LTO can significantly improve the latent thinking
processes. Furthermore, we show that LRM can generalize across diverse domains,
and LTO can be seamlessly applied to general LLMs to improve their thinking
processes. In contrast to verbal thinking, our method demonstrates that reward
modeling and scaling test-time thinking with supervision can be performed
directly in the latent space, highlighting its potential as a general,
efficient, and domain-agnostic approach to improving the thinking processes of
LLMs.
[LINK]
http://arxiv.org/abs/2509.26314v2
[DATE]
2025-10-06 23:15:21+08:00
[CATEGORIES]
cs.CL
SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests
[AUTHORS]
Punya Syon Pandey, Hai Son Le, Devansh Bhardwaj, Rada Mihalcea, Zhijing Jin
[ABSTRACT]
Large language models (LLMs) are increasingly deployed in contexts where
their failures can have direct sociopolitical consequences. Yet, existing
safety benchmarks rarely test vulnerabilities in domains such as political
manipulation, propaganda and disinformation generation, or surveillance and
information control. We introduce SocialHarmBench, a dataset of 585 prompts
spanning 7 sociopolitical categories and 34 countries, designed to surface
where LLMs most acutely fail in politically charged contexts. Our evaluations
reveal several shortcomings: open-weight models exhibit high vulnerability to
harmful compliance, with Mistral-7B reaching attack success rates as high as
97% to 98% in domains such as historical revisionism, propaganda, and political
manipulation. Moreover, temporal and geographic analyses show that LLMs are
most fragile when confronted with 21st-century or pre-20th-century contexts,
and when responding to prompts tied to regions such as Latin America, the USA,
and the UK. These findings demonstrate that current safeguards fail to
generalize to high-stakes sociopolitical settings, exposing systematic biases
and raising concerns about the reliability of LLMs in preserving human rights
and democratic values. We share the SocialHarmBench benchmark at
https://huggingface.co/datasets/psyonp/SocialHarmBench.
[LINK]
http://arxiv.org/abs/2510.04891v1
[DATE]
2025-10-06 23:11:46+08:00
[CATEGORIES]
cs.CL
cs.LG
ML2B: Multi-Lingual ML Benchmark For AutoML
[AUTHORS]
Ekaterina Trofimova, Zosia Shamina, Maria Selifanova, Artem Zaitsev, Remi Savchuk, Maxim Minets, Daria Ozerova, Emil Sataev, Denis Zuenko, Andrey E. Ustyuzhanin
[ABSTRACT]
Large language models (LLMs) have recently demonstrated strong capabilities
in generating machine learning (ML) code, enabling end-to-end pipeline
construction from natural language instructions. However, existing benchmarks
for ML code generation are mainly restricted to English, overlooking the global
and multilingual nature of ML research and practice. To address this gap, we
present ML2B, the first benchmark for evaluating multilingual ML code
generation. ML2B consists of 30 Kaggle competitions translated into 13 natural
languages, covering tabular, text, and image data types, with structured
metadata and validated human-reviewed translations. For evaluation, we employ
AIDE, an automated framework for end-to-end assessment of data science
pipelines, and provide insights into cross-lingual model performance. Our
results reveal substantial 15-45% performance degradation on non-English tasks,
highlighting critical challenges in multilingual representation learning for
code generation. The benchmark, evaluation framework, and comprehensive results
are made available through our GitHub repository to facilitate future research
in multilingual ML code generation: https://github.com/enaix/ml2b.
[LINK]
http://arxiv.org/abs/2509.22768v2
[DATE]
2025-10-06 22:53:27+08:00
[CATEGORIES]
cs.CL
Time Is a Feature: Exploiting Temporal Dynamics in Diffusion Language Models
[AUTHORS]
Wen Wang, Bozhen Fang, Chenchen Jing, Yongliang Shen, Yangyi Shen, Qiuyu Wang, Hao Ouyang, Hao Chen, Chunhua Shen
[ABSTRACT]
Diffusion large language models (dLLMs) generate text through iterative
denoising, yet current decoding strategies discard rich intermediate
predictions in favor of the final output. Our work here reveals a critical
phenomenon, temporal oscillation, where correct answers often emerge in the
middle process, but are overwritten in later denoising steps. To address this
issue, we introduce two complementary methods that exploit temporal
consistency: 1) Temporal Self-Consistency Voting, a training-free, test-time
decoding strategy that aggregates predictions across denoising steps to select
the most consistent output; and 2) a post-training method termed Temporal
Consistency Reinforcement, which uses Temporal Semantic Entropy (TSE), a
measure of semantic stability across intermediate predictions, as a reward
signal to encourage stable generations. Empirical results across multiple
benchmarks demonstrate the effectiveness of our approach. Using the negative
TSE reward alone, we observe a remarkable average improvement of 24.7% on the
Countdown dataset over an existing dLLM. Combined with the accuracy reward, we
achieve absolute gains of 2.0% on GSM8K, 4.3% on MATH500, 6.6% on SVAMP, and
25.3% on Countdown, respectively. Our findings underscore the untapped
potential of temporal dynamics in dLLMs and offer two simple yet effective
tools to harness them.
[COMMENTS]
Project webpage: https://aim-uofa.github.io/dLLM-MidTruth
[LINK]
http://arxiv.org/abs/2508.09138v3
[DATE]
2025-10-06 22:46:22+08:00
[CATEGORIES]
cs.CL
When Models Lie, We Learn: Multilingual Span-Level Hallucination Detection with PsiloQA
[AUTHORS]
Elisei Rykov, Kseniia Petrushina, Maksim Savkin, Valerii Olisov, Artem Vazhentsev, Kseniia Titova, Alexander Panchenko, Vasily Konovalov, Julia Belikova
[ABSTRACT]
Hallucination detection remains a fundamental challenge for the safe and
reliable deployment of large language models (LLMs), especially in applications
requiring factual accuracy. Existing hallucination benchmarks often operate at
the sequence level and are limited to English, lacking the fine-grained,
multilingual supervision needed for a comprehensive evaluation. In this work,
we introduce PsiloQA, a large-scale, multilingual dataset annotated with
span-level hallucinations across 14 languages. PsiloQA is constructed through
an automated three-stage pipeline: generating question-answer pairs from
Wikipedia using GPT-4o, eliciting potentially hallucinated answers from diverse
LLMs in a no-context setting, and automatically annotating hallucinated spans
using GPT-4o by comparing against golden answers and retrieved context. We
evaluate a wide range of hallucination detection methods – including
uncertainty quantification, LLM-based tagging, and fine-tuned encoder models –
and show that encoder-based models achieve the strongest performance across
languages. Furthermore, PsiloQA demonstrates effective cross-lingual
generalization and supports robust knowledge transfer to other benchmarks, all
while being significantly more cost-efficient than human-annotated datasets.
Our dataset and results advance the development of scalable, fine-grained
hallucination detection in multilingual settings.
[LINK]
http://arxiv.org/abs/2510.04849v1
[DATE]
2025-10-06 22:36:30+08:00
[CATEGORIES]
cs.CL
Instability in Downstream Task Performance During LLM Pretraining
[AUTHORS]
Yuto Nishida, Masaru Isonuma, Yusuke Oda
[COMMENTS]
Accepted to EMNLP 2025 Findings
[LINK]
http://arxiv.org/abs/2510.04848v1
[DATE]
2025-10-06 22:33:38+08:00
[CATEGORIES]
cs.CL
Identity resolution of software metadata using Large Language Models
[AUTHORS]
Eva Martín del Pico, Josep Lluís Gelpí, Salvador Capella-Gutiérrez
[ABSTRACT]
Software is an essential component of research. However, little attention has
been paid to it compared with that paid to research data. Recently, there has
been an increase in efforts to acknowledge and highlight the importance of
software in research activities. Structured metadata from platforms like
bio.tools, Bioconductor, and Galaxy ToolShed offers valuable insights into
research software in the Life Sciences. Although originally intended to support
discovery and integration, this metadata can be repurposed for large-scale
analysis of software practices. However, its quality and completeness vary
across platforms, reflecting diverse documentation practices. To gain a
comprehensive view of software development and sustainability, consolidating
this metadata is necessary, but requires robust mechanisms to address its
heterogeneity and scale.
This article presents an evaluation of instruction-tuned large language
models for the task of software metadata identity resolution, a critical step
in assembling a cohesive collection of research software. Such a collection is
the reference component for the Software Observatory at OpenEBench, a platform
that aggregates metadata to monitor the FAIRness of research software in the
Life Sciences. We benchmarked multiple models against a human-annotated gold
standard, examined their behavior on ambiguous cases, and introduced an
agreement-based proxy for high-confidence automated decisions. The proxy
achieved high precision and statistical robustness, while also highlighting the
limitations of current models and the broader challenges of automating semantic
judgment in FAIR-aligned software metadata across registries and repositories.
[LINK]
http://arxiv.org/abs/2505.23500v2
[DATE]
2025-10-06 22:17:23+08:00
[CATEGORIES]
cs.CL
Visual Representations inside the Language Model
[AUTHORS]
Benlin Liu, Amita Kamath, Madeleine Grunde-McLaughlin, Winson Han, Ranjay Krishna
[ABSTRACT]
Despite interpretability work analyzing VIT encoders and transformer
activations, we don’t yet understand why Multimodal Language Models (MLMs)
struggle on perception-heavy tasks. We offer an under-studied perspective by
examining how popular MLMs (LLaVA-OneVision, Qwen2.5-VL, and
Llama-3-LLaVA-NeXT) process their visual key-value tokens. We first study the
flow of visual information through the language model, finding that image value
tokens encode sufficient information to perform several perception-heavy tasks
zero-shot: segmentation, semantic correspondence, temporal correspondence, and
referring expression detection. We find that while the language model does
augment the visual information received from the projection of input visual
encodings-which we reveal correlates with overall MLM perception capability-it
contains less visual information on several tasks than the equivalent visual
encoder (SigLIP) that has not undergone MLM finetuning. Further, we find that
the visual information corresponding to input-agnostic image key tokens in
later layers of language models contains artifacts which reduce perception
capability of the overall MLM. Next, we discuss controlling visual information
in the language model, showing that adding a text prefix to the image input
improves perception capabilities of visual representations. Finally, we reveal
that if language models were able to better control their visual information,
their perception would significantly improve; e.g., in 33.3% of Art Style
questions in the BLINK benchmark, perception information present in the
language model is not surfaced to the output! Our findings reveal insights into
the role of key-value tokens in multimodal systems, paving the way for deeper
mechanistic interpretability of MLMs and suggesting new directions for training
their visual encoder and language model components.
[COMMENTS]
Accepted to COLM 2025
[LINK]
http://arxiv.org/abs/2510.04819v1
[DATE]
2025-10-06 22:01:39+08:00
[CATEGORIES]
cs.CL
TACO: Enhancing Multimodal In-context Learning via Task Mapping-Guided Sequence Configuration
[AUTHORS]
Yanshu Li, Jianjiang Yang, Tian Yun, Pinyuan Feng, Jinfa Huang, Ruixiang Tang
[ABSTRACT]
Multimodal in-context learning (ICL) has emerged as a key mechanism for
harnessing the capabilities of large vision-language models (LVLMs). However,
its effectiveness remains highly sensitive to the quality of input ICL
sequences, particularly for tasks involving complex reasoning or open-ended
generation. A major limitation is our limited understanding of how LVLMs
actually exploit these sequences during inference. To bridge this gap, we
systematically interpret multimodal ICL through the lens of task mapping, which
reveals how local and global relationships within and among demonstrations
guide model reasoning. Building on this insight, we present TACO, a lightweight
transformer-based model equipped with task-aware attention that dynamically
configures ICL sequences. By injecting task-mapping signals into the
autoregressive decoding process, TACO creates a bidirectional synergy between
sequence construction and task reasoning. Experiments on five LVLMs and nine
datasets demonstrate that TACO consistently surpasses baselines across diverse
ICL tasks. These results position task mapping as a novel and valuable
perspective for interpreting and improving multimodal ICL.
[COMMENTS]
EMNLP2025 Main, 28 pages, 11 figures, 19 tables
[LINK]
http://arxiv.org/abs/2505.17098v3
[DATE]
2025-10-06 21:42:58+08:00
[CATEGORIES]
cs.CL
DiffSampling: Enhancing Diversity and Accuracy in Neural Text Generation
[AUTHORS]
Giorgio Franceschelli, Mirco Musolesi
[ABSTRACT]
Despite their growing capabilities, language models still frequently
reproduce content from their training data, generate repetitive text, and favor
common grammatical patterns and vocabulary. A possible cause is the decoding
strategy: the most common strategies either consider only the most probable
tokens, which reduces output diversity, or increase the likelihood of unlikely
tokens, compromising output accuracy and correctness. In this paper, we propose
DiffSampling, a new decoding method that leverages a mathematical analysis of
the token probability distribution to ensure the generation of contextually
appropriate text. In particular, the difference between consecutive, sorted
probabilities can be used to truncate incorrect tokens. In addition, we also
propose two variations of the proposed method that aim to correct the subtle
inconsistencies of common sampling strategies. Experiments involving four
different text-generation tasks demonstrate that our approach consistently
performs at least on par with the existing methods it builds upon in terms of
quality, while potentially improving output diversity.
[LINK]
http://arxiv.org/abs/2502.14037v4
[DATE]
2025-10-06 21:37:50+08:00
[CATEGORIES]
cs.CL
cs.LG
Hybrid Architectures for Language Models: Systematic Analysis and Design Insights
[AUTHORS]
Sangmin Bae, Bilge Acun, Haroun Habeeb, Seungyeon Kim, Chien-Yu Lin, Liang Luo, Junjie Wang, Carole-Jean Wu
[ABSTRACT]
Recent progress in large language models demonstrates that hybrid
architectures–combining self-attention mechanisms with structured state space
models like Mamba–can achieve a compelling balance between modeling quality
and computational efficiency, particularly for long-context tasks. While these
hybrid models show promising performance, systematic comparisons of
hybridization strategies and analyses on the key factors behind their
effectiveness have not been clearly shared to the community. In this work, we
present a holistic evaluation of hybrid architectures based on inter-layer
(sequential) or intra-layer (parallel) fusion. We evaluate these designs from a
variety of perspectives: language modeling performance, long-context
capabilities, scaling analysis, and training and inference efficiency. By
investigating the core characteristics of their computational primitive, we
identify the most critical elements for each hybridization strategy and further
propose optimal design recipes for both hybrid models. Our comprehensive
analysis provides practical guidance and valuable insights for developing
hybrid language models, facilitating the optimization of architectural
configurations.
[COMMENTS]
17 pages, 4 figures, 6 tables; detailed results will be included in
the Appendix later
[LINK]
http://arxiv.org/abs/2510.04800v1
[DATE]
2025-10-06 21:30:07+08:00
[CATEGORIES]
cs.CL
COSPADI: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning
[AUTHORS]
Dmitriy Shopkhoev, Denis Makhov, Magauiya Zhussip, Ammar Ali, Stamatios Lefkimmiatis
[ABSTRACT]
Post-training compression of large language models (LLMs) largely relies on
low-rank weight approximation, which represents each column of a weight matrix
in a shared low-dimensional subspace. While this is a computationally efficient
strategy, the imposed structural constraint is rigid and can lead to a
noticeable model accuracy drop. In this work, we propose CoSpaDi (Compression
via Sparse Dictionary Learning), a novel training-free compression framework
that replaces low-rank decomposition with a more flexible structured sparse
factorization in which each weight matrix is represented with a dense
dictionary and a column-sparse coefficient matrix. This formulation enables a
union-of-subspaces representation: different columns of the original weight
matrix are approximated in distinct subspaces spanned by adaptively selected
dictionary atoms, offering greater expressiveness than a single invariant
basis. Crucially, CoSpaDi leverages a small calibration dataset to optimize the
factorization such that the output activations of compressed projection layers
closely match those of the original ones, thereby minimizing functional
reconstruction error rather than mere weight approximation. This data-aware
strategy preserves better model fidelity without any fine-tuning under
reasonable compression ratios. Moreover, the resulting structured sparsity
allows efficient sparse-dense matrix multiplication and is compatible with
post-training quantization for further memory and latency gains. We evaluate
CoSpaDi across multiple Llama and Qwen models under per-layer and per-group
settings at 20-50\% compression ratios, demonstrating consistent superiority
over state-of-the-art data-aware low-rank methods both in accuracy and
perplexity. Our results establish structured sparse dictionary learning as a
powerful alternative to conventional low-rank approaches for efficient LLM
deployment.
[LINK]
http://arxiv.org/abs/2509.22075v2
[DATE]
2025-10-06 20:56:01+08:00
[CATEGORIES]
cs.CL
Silent Tokens, Loud Effects: Padding in LLMs
[AUTHORS]
Rom Himelstein, Amit LeVi, Yonatan Belinkov, Avi Mendelson
[ABSTRACT]
Padding tokens are widely used in large language models (LLMs) to equalize
sequence lengths during batched inference. While they should be fully masked,
implementation errors can cause them to influence computation, and the extent
of this influence is not well understood. We systematically study this effect
across three open-source model families (Llama, Gemma, Qwen), inserting
controlled amounts of padding and evaluating outcomes along four axes:
activations, generation quality, bias, and safety. Even small amounts of
padding shift hidden representations, degrade quality in smaller models, alter
bias in unpredictable ways, and weaken safety guardrails. These findings
demonstrate that padding is not a harmless detail but a robustness risk that
must be carefully handled in deployment.
[COMMENTS]
Accepted to NeurIPS 2025 Workshop on Evaluating the Evolving LLM
Lifecycle: Benchmarks, Emergent Abilities, and Scaling
[LINK]
http://arxiv.org/abs/2510.01238v2
[DATE]
2025-10-06 20:48:05+08:00
[CATEGORIES]
cs.CL
cs.LG
Are BabyLMs Deaf to Gricean Maxims? A Pragmatic Evaluation of Sample-efficient Language Models
[AUTHORS]
Raha Askari, Sina Zarrieß, Özge Alacam, Judith Sieker
[ABSTRACT]
Implicit meanings are integral to human communication, making it essential
for language models to be capable of identifying and interpreting them. Grice
(1975) proposed a set of conversational maxims that guide cooperative dialogue,
noting that speakers may deliberately violate these principles to express
meanings beyond literal words, and that listeners, in turn, recognize such
violations to draw pragmatic inferences.
Building on Surian et al. (1996)’s study of children’s sensitivity to
violations of Gricean maxims, we introduce a novel benchmark to test whether
language models pretrained on less than 10M and less than 100M tokens can
distinguish maxim-adhering from maxim-violating utterances. We compare these
BabyLMs across five maxims and situate their performance relative to children
and a Large Language Model (LLM) pretrained on 3T tokens.
We find that overall, models trained on less than 100M tokens outperform
those trained on less than 10M, yet fall short of child-level and LLM
competence. Our results suggest that modest data increases improve some aspects
of pragmatic behavior, leading to finer-grained differentiation between
pragmatic dimensions.
[LINK]
http://arxiv.org/abs/2510.04764v1
[DATE]
2025-10-06 20:38:41+08:00
[CATEGORIES]
cs.CL
ModernBERT + ColBERT: Enhancing biomedical RAG through an advanced re-ranking retriever
[AUTHORS]
Eduardo Martínez Rivera, Filippo Menolascina
[ABSTRACT]
Retrieval-Augmented Generation (RAG) is a powerful technique for enriching
Large Language Models (LLMs) with external knowledge, allowing for factually
grounded responses, a critical requirement in high-stakes domains such as
healthcare. However, the efficacy of RAG systems is fundamentally restricted by
the performance of their retrieval module, since irrelevant or semantically
misaligned documents directly compromise the accuracy of the final generated
response. General-purpose dense retrievers can struggle with the nuanced
language of specialised domains, while the high accuracy of in-domain models is
often achieved at prohibitive computational costs. In this work, we aim to
address this trade-off by developing and evaluating a two-stage retrieval
architecture that combines a lightweight ModernBERT bidirectional encoder for
efficient initial candidate retrieval with a ColBERTv2 late-interaction model
for fine-grained re-ranking. We conduct comprehensive evaluations of our
retriever module performance and RAG system performance in the biomedical
context, fine-tuning the IR module using 10k question-passage pairs from
PubMedQA. Our analysis of the retriever module confirmed the positive impact of
the ColBERT re-ranker, which improved Recall@3 by up to 4.2 percentage points
compared to its retrieve-only counterpart. When integrated into the biomedical
RAG, our IR module leads to a state-of-the-art average accuracy of 0.4448 on
the five tasks of the MIRAGE question-answering benchmark, outperforming strong
baselines such as MedCPT (0.4436). Our ablation studies reveal that this
performance is critically dependent on a joint fine-tuning process that aligns
the retriever and re-ranker; otherwise, the re-ranker might degrade the
performance.
[LINK]
http://arxiv.org/abs/2510.04757v1
[DATE]
2025-10-06 20:34:55+08:00
[CATEGORIES]
cs.CL
Speak, Edit, Repeat: High-Fidelity Voice Editing and Zero-Shot TTS with Cross-Attentive Mamba
[AUTHORS]
Baher Mohammad, Magauiya Zhussip, Stamatios Lefkimmiatis
[ABSTRACT]
We introduce MAVE (Mamba with Cross-Attention for Voice Editing and
Synthesis), a novel autoregressive architecture for text-conditioned voice
editing and high-fidelity text-to-speech (TTS) synthesis, built on a
cross-attentive Mamba backbone. MAVE achieves state-of-the-art performance in
speech editing and very competitive results in zero-shot TTS, while not being
explicitly trained on the latter task, outperforming leading autoregressive and
diffusion models on diverse, real-world audio. By integrating Mamba for
efficient audio sequence modeling with cross-attention for precise
text-acoustic alignment, MAVE enables context-aware voice editing with
exceptional naturalness and speaker consistency. In pairwise human evaluations
on a random 40-sample subset of the RealEdit benchmark (400 judgments), 57.2%
of listeners rated MAVE - edited speech as perceptually equal to the original,
while 24.8% prefered the original and 18.0% MAVE - demonstrating that in the
majority of cases edits are indistinguishable from the source. MAVE compares
favorably with VoiceCraft and FluentSpeech both on pairwise comparisons and
standalone mean opinion score (MOS) evaluations. For zero-shot TTS, MAVE
exceeds VoiceCraft in both speaker similarity and naturalness, without
requiring multiple inference runs or post-processing. Remarkably, these quality
gains come with a significantly lower memory cost and approximately the same
latency: MAVE requires ~6x less memory than VoiceCraft during inference on
utterances from the RealEdit database (mean duration: 6.21s, A100, FP16, batch
size 1). Our results demonstrate that MAVE establishes a new standard for
flexible, high-fidelity voice editing and synthesis through the synergistic
integration of structured state-space modeling and cross-modal attention.
[LINK]
http://arxiv.org/abs/2510.04738v1
[DATE]
2025-10-06 20:11:31+08:00
[CATEGORIES]
cs.CL
cs.LG
Unlocking In-Context Learning for Natural Datasets Beyond Language Modelling
[AUTHORS]
Jelena Bratulić, Sudhanshu Mittal, David T. Hoffmann, Samuel Böhm, Robin Tibor Schirrmeister, Tonio Ball, Christian Rupprecht, Thomas Brox
[ABSTRACT]
Large Language Models (LLMs) exhibit In-Context Learning (ICL), which enables
the model to perform new tasks conditioning only on the examples provided in
the context without updating the model’s weights. While ICL offers fast
adaptation across natural language tasks and domains, its emergence is less
straightforward for modalities beyond text. In this work, we systematically
uncover properties present in LLMs that support the emergence of ICL for
autoregressive models and various modalities by promoting the learning of the
needed mechanisms for ICL. We identify exact token repetitions in the training
data sequences as an important factor for ICL. Such repetitions further improve
stability and reduce transiency in ICL performance. Moreover, we emphasise the
significance of training task difficulty for the emergence of ICL. Finally, by
applying our novel insights on ICL emergence, we unlock ICL capabilities for
various visual datasets and a more challenging EEG classification task.
[COMMENTS]
Best Paper Honorable Mention at GCPR 2025 (German Conference on
Pattern Recognition). This is the updated version submitted to the
conference, not the official conference proceedings
[LINK]
http://arxiv.org/abs/2501.06256v3
[DATE]
2025-10-06 19:37:13+08:00
[CATEGORIES]
cs.CL
cs.LG
JSON Whisperer: Efficient JSON Editing with LLMs
[AUTHORS]
Sarel Duanis, Asnat Greenstein-Messica, Eliya Habba
[ABSTRACT]
Large language models (LLMs) can modify JSON documents through natural
language commands, but current approaches regenerate entire structures for each
edit, resulting in computational inefficiency. We present JSON Whisperer, a
framework that enables LLMs to generate RFC 6902 diff patches-expressing only
the necessary modifications-rather than complete documents. We identify two key
challenges in patch-based editing: (1) LLMs often miss related updates when
generating isolated patches, and (2) array manipulations require tracking index
shifts across operations, which LLMs handle poorly. To address these issues, we
introduce EASE (Explicitly Addressed Sequence Encoding), which transforms
arrays into dictionaries with stable keys, eliminating index arithmetic
complexities. Our evaluation shows that patch generation with EASE reduces
token usage by 31% while maintaining edit quality within 5% of full
regeneration with particular gains for complex instructions and list
manipulations. The dataset is available at:
https://github.com/emnlp2025/JSON-Whisperer/
[LINK]
http://arxiv.org/abs/2510.04717v1
[DATE]
2025-10-06 19:36:46+08:00
[CATEGORIES]
cs.CL
Multilingual Routing in Mixture-of-Experts
[AUTHORS]
Lucas Bandarkar, Chenyuan Yang, Mohsen Fayyaz, Junlin Hu, Nanyun Peng
[ABSTRACT]
Mixture-of-Experts (MoE) architectures have become the key to scaling modern
LLMs, yet little is understood about how their sparse routing dynamics respond
to multilingual data. In this work, we analyze expert routing patterns using
parallel multilingual datasets and present highly interpretable layer-wise
phenomena. We find that MoE models route tokens in language-specific ways in
the early and late decoder layers but exhibit significant cross-lingual routing
alignment in middle layers, mirroring parameter-sharing trends observed in
dense LLMs. In particular, we reveal a clear, strong correlation between a
model’s performance in a given language and how similarly its tokens are routed
to English in these layers. Extending beyond correlation, we explore
inference-time interventions that induce higher cross-lingual routing
alignment. We introduce a method that steers the router by promoting
middle-layer task experts frequently activated in English, and it successfully
increases multilingual performance. These 1-2% gains are remarkably consistent
across two evaluation tasks, three models, and 15+ languages, especially given
that these simple interventions override routers of extensively trained,
state-of-the-art LLMs. In comparison, interventions outside of the middle
layers or targeting multilingual-specialized experts only yield performance
degradation. Altogether, we present numerous findings that explain how MoEs
process non-English text and demonstrate that generalization is limited by the
model’s ability to leverage language-universal experts in all languages.
[LINK]
http://arxiv.org/abs/2510.04694v1
[DATE]
2025-10-06 19:09:20+08:00
[CATEGORIES]
cs.CL
cs.LG
Praxis-VLM: Vision-Grounded Decision Making via Text-Driven Reinforcement Learning
[AUTHORS]
Zhe Hu, Jing Li, Zhongzhu Pu, Hou Pong Chan, Yu Yin
[COMMENTS]
Accepted at NeurIPS 2025
[LINK]
http://arxiv.org/abs/2503.16965v3
[DATE]
2025-10-06 18:03:29+08:00
[CATEGORIES]
cs.CL
FT-MDT: Extracting Decision Trees from Medical Texts via a Novel Low-rank Adaptation Method
[AUTHORS]
Yuheng Li, Jiechao Gao, Wei Han, Wenwen Ouyang, Wei Zhu, Hui Yi Leong
[ABSTRACT]
Knowledge of the medical decision process, which can be modeled as medical
decision trees (MDTs), is critical to building clinical decision support
systems. However, current MDT construction methods rely heavily on
time-consuming and laborious manual annotation. To address this challenge, we
propose PI-LoRA (Path-Integrated LoRA), a novel low-rank adaptation method for
automatically extracting MDTs from clinical guidelines and textbooks. We
integrate gradient path information to capture synergistic effects between
different modules, enabling more effective and reliable rank allocation. This
framework ensures that the most critical modules receive appropriate rank
allocations while less important ones are pruned, resulting in a more efficient
and accurate model for extracting medical decision trees from clinical texts.
Extensive experiments on medical guideline datasets demonstrate that our
PI-LoRA method significantly outperforms existing parameter-efficient
fine-tuning approaches for the Text2MDT task, achieving better accuracy with
substantially reduced model complexity. The proposed method achieves
state-of-the-art results while maintaining a lightweight architecture, making
it particularly suitable for clinical decision support systems where
computational resources may be limited.
[COMMENTS]
Accepted by EMNLP-2025 Industrial Track
[LINK]
http://arxiv.org/abs/2510.04655v1
[DATE]
2025-10-06 17:59:55+08:00
[CATEGORIES]
cs.CL
MAGIC: A Multi-Hop and Graph-Based Benchmark for Inter-Context Conflicts in Retrieval-Augmented Generation
[AUTHORS]
Jungyeon Lee, Kangmin Lee, Taeuk Kim
[ABSTRACT]
Knowledge conflict often arises in retrieval-augmented generation (RAG)
systems, where retrieved documents may be inconsistent with one another or
contradict the model’s parametric knowledge. Existing benchmarks for
investigating the phenomenon have notable limitations, including a narrow focus
on the question answering setup, heavy reliance on entity substitution
techniques, and a restricted range of conflict types. To address these issues,
we propose a knowledge graph (KG)-based framework that generates varied and
subtle conflicts between two similar yet distinct contexts, while ensuring
interpretability through the explicit relational structure of KGs. Experimental
results on our benchmark, MAGIC, provide intriguing insights into the inner
workings of LLMs regarding knowledge conflict: both open-source and proprietary
models struggle with conflict detection – especially when multi-hop reasoning
is required – and often fail to pinpoint the exact source of contradictions.
Finally, we present in-depth analyses that serve as a foundation for improving
LLMs in integrating diverse, sometimes even conflicting, information.
[LINK]
http://arxiv.org/abs/2507.21544v2
[DATE]
2025-10-06 17:59:30+08:00
[CATEGORIES]
cs.CL
Psychometric Item Validation Using Virtual Respondents with Trait-Response Mediators
[AUTHORS]
Sungjib Lim, Woojung Song, Eun-Ju Lee, Yohan Jo
[ABSTRACT]
As psychometric surveys are increasingly used to assess the traits of large
language models (LLMs), the need for scalable survey item generation suited for
LLMs has also grown. A critical challenge here is ensuring the construct
validity of generated items, i.e., whether they truly measure the intended
trait. Traditionally, this requires costly, large-scale human data collection.
To make it efficient, we present a framework for virtual respondent simulation
using LLMs. Our central idea is to account for mediators: factors through which
the same trait can give rise to varying responses to a survey item. By
simulating respondents with diverse mediators, we identify survey items that
robustly measure intended traits. Experiments on three psychological trait
theories (Big5, Schwartz, VIA) show that our mediator generation methods and
simulation framework effectively identify high-validity items. LLMs demonstrate
the ability to generate plausible mediators from trait definitions and to
simulate respondent behavior for item validation. Our problem formulation,
metrics, methodology, and dataset open a new direction for cost-effective
survey development and a deeper understanding of how LLMs simulate human survey
responses. We publicly release our dataset and code to support future work.
[COMMENTS]
21 pages, 9 figures
[LINK]
http://arxiv.org/abs/2507.05890v2
[DATE]
2025-10-06 17:54:02+08:00
[CATEGORIES]
cs.CL
Beyond Manuals and Tasks: Instance-Level Context Learning for LLM Agents
[AUTHORS]
Kuntai Cai, Juncheng Liu, Xianglin Yang, Zhaojie Niu, Xiaokui Xiao, Xing Chen
[ABSTRACT]
Large language model (LLM) agents typically receive two kinds of context: (i)
environment-level manuals that define interaction interfaces and global rules,
and (ii) task-level guidance or demonstrations tied to specific goals. In this
work, we identify a crucial but overlooked third type of context,
instance-level context, which consists of verifiable and reusable facts tied to
a specific environment instance, such as object locations, crafting recipes,
and local rules. We argue that the absence of instance-level context is a
common source of failure for LLM agents in complex tasks, as success often
depends not only on reasoning over global rules or task prompts but also on
making decisions based on precise and persistent facts. Acquiring such context
requires more than memorization: the challenge lies in efficiently exploring,
validating, and formatting these facts under tight interaction budgets. We
formalize this problem as Instance-Level Context Learning (ILCL) and introduce
our task-agnostic method to solve it. Our method performs a guided exploration,
using a compact TODO forest to intelligently prioritize its next actions and a
lightweight plan-act-extract loop to execute them. This process automatically
produces a high-precision context document that is reusable across many
downstream tasks and agents, thereby amortizing the initial exploration cost.
Experiments across TextWorld, ALFWorld, and Crafter demonstrate consistent
gains in both success and efficiency: for instance, ReAct’s mean success rate
in TextWorld rises from 37% to 95%, while IGE improves from 81% to 95%. By
transforming one-off exploration into persistent, reusable knowledge, our
method complements existing contexts to enable more reliable and efficient LLM
agents.
[LINK]
http://arxiv.org/abs/2510.02369v2
[DATE]
2025-10-06 17:40:38+08:00
[CATEGORIES]
cs.CL
AMAS: Adaptively Determining Communication Topology for LLM-based Multi-Agent System
[AUTHORS]
Hui Yi Leong, Yuheng Li, Yuqing Wu, Wenwen Ouyang, Wei Zhu, Jiechao Gao, Wei Han
[ABSTRACT]
Although large language models (LLMs) have revolutionized natural language
processing capabilities, their practical implementation as autonomous
multi-agent systems (MAS) for industrial problem-solving encounters persistent
barriers. Conventional MAS architectures are fundamentally restricted by
inflexible, hand-crafted graph topologies that lack contextual responsiveness,
resulting in diminished efficacy across varied academic and commercial
workloads. To surmount these constraints, we introduce AMAS, a
paradigm-shifting framework that redefines LLM-based MAS through a novel
dynamic graph designer. This component autonomously identifies task-specific
optimal graph configurations via lightweight LLM adaptation, eliminating the
reliance on monolithic, universally applied structural templates. Instead, AMAS
exploits the intrinsic properties of individual inputs to intelligently direct
query trajectories through task-optimized agent pathways. Rigorous validation
across question answering, mathematical deduction, and code generation
benchmarks confirms that AMAS systematically exceeds state-of-the-art
single-agent and multi-agent approaches across diverse LLM architectures. Our
investigation establishes that context-sensitive structural adaptability
constitutes a foundational requirement for high-performance LLM MAS
deployments.
[COMMENTS]
Accepted by EMNLP-2025 Industrial Track
[LINK]
http://arxiv.org/abs/2510.01617v2
[DATE]
2025-10-06 17:33:41+08:00
[CATEGORIES]
cs.CL
Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models
[AUTHORS]
Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, Kunle Olukotun
[ABSTRACT]
Large language model (LLM) applications such as agents and domain-specific
reasoning increasingly rely on context adaptation – modifying inputs with
instructions, strategies, or evidence, rather than weight updates. Prior
approaches improve usability but often suffer from brevity bias, which drops
domain insights for concise summaries, and from context collapse, where
iterative rewriting erodes details over time. Building on the adaptive memory
introduced by Dynamic Cheatsheet, we introduce ACE (Agentic Context
Engineering), a framework that treats contexts as evolving playbooks that
accumulate, refine, and organize strategies through a modular process of
generation, reflection, and curation. ACE prevents collapse with structured,
incremental updates that preserve detailed knowledge and scale with
long-context models. Across agent and domain-specific benchmarks, ACE optimizes
contexts both offline (e.g., system prompts) and online (e.g., agent memory),
consistently outperforming strong baselines: +10.6% on agents and +8.6% on
finance, while significantly reducing adaptation latency and rollout cost.
Notably, ACE could adapt effectively without labeled supervision and instead by
leveraging natural execution feedback. On the AppWorld leaderboard, ACE matches
the top-ranked production-level agent on the overall average and surpasses it
on the harder test-challenge split, despite using a smaller open-source model.
These results show that comprehensive, evolving contexts enable scalable,
efficient, and self-improving LLM systems with low overhead.
[LINK]
http://arxiv.org/abs/2510.04618v1
[DATE]
2025-10-06 17:30:18+08:00
[CATEGORIES]
cs.LG
cs.CL
Can We Infer Confidential Properties of Training Data from LLMs?
[AUTHORS]
Pengrun Huang, Chhavi Yadav, Kamalika Chaudhuri, Ruihan Wu
[ABSTRACT]
Large language models (LLMs) are increasingly fine-tuned on domain-specific
datasets to support applications in fields such as healthcare, finance, and
law. These fine-tuning datasets often have sensitive and confidential
dataset-level properties – such as patient demographics or disease prevalence
– that are not intended to be revealed. While prior work has studied property
inference attacks on discriminative models (e.g., image classification models)
and generative models (e.g., GANs for image data), it remains unclear if such
attacks transfer to LLMs. In this work, we introduce PropInfer, a benchmark
task for evaluating property inference in LLMs under two fine-tuning paradigms:
question-answering and chat-completion. Built on the ChatDoctor dataset, our
benchmark includes a range of property types and task configurations. We
further propose two tailored attacks: a prompt-based generation attack and a
shadow-model attack leveraging word frequency signals. Empirical evaluations
across multiple pretrained LLMs show the success of our attacks, revealing a
previously unrecognized vulnerability in LLMs.
[LINK]
http://arxiv.org/abs/2506.10364v3
[DATE]
2025-10-06 17:11:48+08:00
[CATEGORIES]
cs.LG
cs.CL
Query-Level Uncertainty in Large Language Models
[AUTHORS]
Lihu Chen, Gerard de Melo, Fabian M. Suchanek, Gaël Varoquaux
[ABSTRACT]
It is important for Large Language Models (LLMs) to be aware of the boundary
of their knowledge, distinguishing queries they can confidently answer from
those that lie beyond their capabilities. Such awareness enables models to
perform adaptive inference, such as invoking retrieval-augmented generation
(RAG), engaging in slow and deep thinking, or abstaining from answering when
appropriate. These mechanisms are key to developing efficient and trustworthy
AI. In this work, we propose a method to detect knowledge boundaries via
Query-Level Uncertainty, which estimates if a model is capable of answering a
given query before generating any tokens, thus avoiding the generation cost. To
this end, we propose a novel, training-free method called Internal Confidence,
which leverages self-evaluations across layers and tokens to provide a reliable
signal of uncertainty. Empirical studies on both factual question answering and
mathematical reasoning tasks demonstrate that our Internal Confidence
outperforms several baselines in quality of confidence while being
computationally cheaper. Furthermore, we demonstrate its benefits in adaptive
inference settings, showing that for RAG and model cascading it reduces
inference costs while preserving overall performance.
[COMMENTS]
Under Review
[LINK]
http://arxiv.org/abs/2506.09669v3
[DATE]
2025-10-06 17:08:21+08:00
[CATEGORIES]
cs.CL
FedSRD: Sparsify-Reconstruct-Decompose for Communication-Efficient Federated Large Language Models Fine-Tuning
[AUTHORS]
Guochen Yan, Luyuan Xie, Qingni Shen, Yuejian Fang, Zhonghai Wu
[ABSTRACT]
The current paradigm of training large language models (LLMs) on publicly
available Web data is becoming unsustainable, with high-quality data sources in
specialized domains nearing exhaustion. Federated Learning (FL) emerges as a
practical solution for the next generation of AI on a decentralized Web,
enabling privacy-preserving collaborative fine-tuning by leveraging private
data distributed across a global client base. While Low-Rank Adaptation (LoRA)
is the standard for efficient fine-tuning, its application in federated
settings presents a critical challenge: communication overhead remains a
significant bottleneck across the Web’s heterogeneous network conditions. The
structural redundancy within LoRA parameters not only incurs a heavy
communication burden but also introduces conflicts when aggregating client
updates. To address this, we propose FedSRD, a Sparsify-Reconstruct-Decompose
framework designed for communication-efficient FL. We first introduce an
importance-aware sparsification method that preserves the structural integrity
of LoRA updates to reduce the uploaded parameter count. The server then
reconstructs and aggregates these updates in a full-rank space to mitigate
conflicts. Finally, it decomposes the global update into a sparse low-rank
format for broadcast, ensuring a symmetrically efficient cycle. We also propose
an efficient variant, FedSRD-e, to reduce computational overhead. Experimental
results on 10 benchmarks demonstrate that our framework significantly reduces
communication costs by up to 90\% while even improving model performance on
heterogeneous client data.
[LINK]
http://arxiv.org/abs/2510.04601v1
[DATE]
2025-10-06 17:06:38+08:00
[CATEGORIES]
cs.CL
AdaSTaR: Adaptive Data Sampling for Training Self-Taught Reasoners
[AUTHORS]
Woosung Koh, Wonbeen Oh, Jaein Jang, MinHyung Lee, Hyeongjin Kim, Ah Yeon Kim, Joonkee Kim, Junghyun Lee, Taehyeon Kim, Se-Young Yun
[ABSTRACT]
Self-Taught Reasoners (STaR), synonymously known as Rejection sampling
Fine-Tuning (RFT), is an integral part of the training pipeline of
self-improving reasoning Language Models (LMs). The self-improving mechanism
often employs random observation (data) sampling. However, this results in
trained observation imbalance; inefficiently over-training on solved examples
while under-training on challenging ones. In response, we introduce Adaptive
STaR (AdaSTaR), a novel algorithm that rectifies this by integrating two
adaptive sampling principles: (1) Adaptive Sampling for Diversity: promoting
balanced training across observations, and (2) Adaptive Sampling for
Curriculum: dynamically adjusting data difficulty to match the model’s evolving
strength. Across six benchmarks, AdaSTaR achieves best test accuracy in all
instances (6/6) and reduces training FLOPs by an average of 58.6% against an
extensive list of baselines. These improvements in performance and efficiency
generalize to different pre-trained LMs and larger models, paving the way for
more efficient and effective self-improving LMs.
[COMMENTS]
NeurIPS 2025
[LINK]
http://arxiv.org/abs/2505.16322v3
[DATE]
2025-10-06 16:23:36+08:00
[CATEGORIES]
cs.LG
cs.CL
More Than Meets the Eye? Uncovering the Reasoning-Planning Disconnect in Training Vision-Language Driving Models
[AUTHORS]
Xurui Song, Shuo Huai, JingJing Jiang, Jiayi Kong, Jun Luo
[ABSTRACT]
Vision-Language Model (VLM) driving agents promise explainable end-to-end
autonomy by first producing natural-language reasoning and then predicting
trajectory planning. However, whether planning is causally driven by this
reasoning remains a critical but unverified assumption. To investigate this, we
build DriveMind, a large-scale driving Visual Question Answering (VQA) corpus
with plan-aligned Chain-of-Thought (CoT), automatically generated from nuPlan.
Our data generation process converts sensors and annotations into structured
inputs and, crucially, separates priors from to-be-reasoned signals, enabling
clean information ablations. Using DriveMind, we train representative VLM
agents with Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization
(GRPO) and evaluate them with nuPlan’s metrics. Our results, unfortunately,
indicate a consistent causal disconnect in reasoning-planning: removing
ego/navigation priors causes large drops in planning scores, whereas removing
CoT produces only minor changes. Attention analysis further shows that planning
primarily focuses on priors rather than the CoT. Based on this evidence, we
propose the Reasoning-Planning Decoupling Hypothesis, positing that the
training-yielded reasoning is an ancillary byproduct rather than a causal
mediator. To enable efficient diagnosis, we also introduce a novel,
training-free probe that measures an agent’s reliance on priors by evaluating
its planning robustness against minor input perturbations. In summary, we
provide the community with a new dataset and a diagnostic tool to evaluate the
causal fidelity of future models.
[COMMENTS]
The dataset will be released publicly once the paper is accepted for
publication
[LINK]
http://arxiv.org/abs/2510.04532v1
[DATE]
2025-10-06 14:50:16+08:00
[CATEGORIES]
cs.CL
GRACE: Generative Representation Learning via Contrastive Policy Optimization
[AUTHORS]
Jiashuo Sun, Shixuan Liu, Zhaochen Su, Xianrui Zhong, Pengcheng Jiang, Bowen Jin, Peiran Li, Weijia Shi, Jiawei Han
[ABSTRACT]
Prevailing methods for training Large Language Models (LLMs) as text encoders
rely on contrastive losses that treat the model as a black box function,
discarding its generative and reasoning capabilities in favor of static
embeddings. We introduce GRACE (Generative Representation Learning via
Contrastive Policy Optimization), a novel framework that reimagines contrastive
signals not as losses to be minimized, but as rewards that guide a generative
policy. In GRACE, the LLM acts as a policy that produces explicit,
human-interpretable rationales–structured natural language explanations of its
semantic understanding. These rationales are then encoded into high-quality
embeddings via mean pooling. Using policy gradient optimization, we train the
model with a multi-component reward function that maximizes similarity between
query positive pairs and minimizes similarity with negatives. This transforms
the LLM from an opaque encoder into an interpretable agent whose reasoning
process is transparent and inspectable. On MTEB benchmark, GRACE yields broad
cross category gains: averaged over four backbones, the supervised setting
improves overall score by 11.5% over base models, and the unsupervised variant
adds 6.9%, while preserving general capabilities. This work treats contrastive
objectives as rewards over rationales, unifying representation learning with
generation to produce stronger embeddings and transparent rationales. The
model, data and code are available at https://github.com/GasolSun36/GRACE.
[COMMENTS]
23 pages, 7 figures, 7 tables
[LINK]
http://arxiv.org/abs/2510.04506v1
[DATE]
2025-10-06 13:46:56+08:00
[CATEGORIES]
cs.CL
P2P: A Poison-to-Poison Remedy for Reliable Backdoor Defense in LLMs
[AUTHORS]
Shuai Zhao, Xinyi Wu, Shiqian Zhao, Xiaobao Wu, Zhongliang Guo, Yanhao Jia, Anh Tuan Luu
[ABSTRACT]
During fine-tuning, large language models (LLMs) are increasingly vulnerable
to data-poisoning backdoor attacks, which compromise their reliability and
trustworthiness. However, existing defense strategies suffer from limited
generalization: they only work on specific attack types or task settings. In
this study, we propose Poison-to-Poison (P2P), a general and effective backdoor
defense algorithm. P2P injects benign triggers with safe alternative labels
into a subset of training samples and fine-tunes the model on this re-poisoned
dataset by leveraging prompt-based learning. This enforces the model to
associate trigger-induced representations with safe outputs, thereby overriding
the effects of original malicious triggers. Thanks to this robust and
generalizable trigger-based fine-tuning, P2P is effective across task settings
and attack types. Theoretically and empirically, we show that P2P can
neutralize malicious backdoors while preserving task performance. We conduct
extensive experiments on classification, mathematical reasoning, and summary
generation tasks, involving multiple state-of-the-art LLMs. The results
demonstrate that our P2P algorithm significantly reduces the attack success
rate compared with baseline models. We hope that the P2P can serve as a
guideline for defending against backdoor attacks and foster the development of
a secure and trustworthy LLM community.
[LINK]
http://arxiv.org/abs/2510.04503v1
[DATE]
2025-10-06 13:45:23+08:00
[CATEGORIES]
cs.CL
DISC: Dynamic Decomposition Improves LLM Inference Scaling
[AUTHORS]
Jonathan Light, Wei Cheng, Benjamin Riviere, Wu Yue, Masafumi Oyamada, Mengdi Wang, Yisong Yue, Santiago Paternain, Haifeng Chen
[ABSTRACT]
Inference scaling methods for LLMs often rely on decomposing problems into
steps (or groups of tokens), followed by sampling and selecting the best next
steps. However, these steps and their sizes are often predetermined or manually
designed based on domain knowledge. We propose dynamic decomposition, a method
that adaptively and automatically partitions solution and reasoning traces into
manageable steps during inference. By more effectively allocating compute –
particularly through subdividing challenging steps and prioritizing their
sampling – dynamic decomposition significantly improves inference efficiency.
Experiments on benchmarks such as APPS, MATH, and LiveCodeBench demonstrate
that dynamic decomposition outperforms static approaches, including
token-level, sentence-level, and single-step decompositions, reducing the
pass@10 error rate by 5.0%, 6.7%, and 10.5% respectively. These findings
highlight the potential of dynamic decomposition to improve a wide range of
inference scaling techniques.
[COMMENTS]
10 pages, Accepted to NeurIPS 2025 (Conference on Neural Information
Processing Systems)
[LINK]
http://arxiv.org/abs/2502.16706v3
[DATE]
2025-10-06 13:36:54+08:00
[CATEGORIES]
cs.LG
cs.CL
GenQuest: An LLM-based Text Adventure Game for Language Learners
[AUTHORS]
Qiao Wang, Adnan Labib, Robert Swier, Michael Hofmeyr, Zheng Yuan
[ABSTRACT]
GenQuest is a generative text adventure game that leverages Large Language
Models (LLMs) to facilitate second language learning through immersive,
interactive storytelling. The system engages English as a Foreign Language
(EFL) learners in a collaborative “choose-your-own-adventure” style narrative,
dynamically generated in response to learner choices. Game mechanics such as
branching decision points and story milestones are incorporated to maintain
narrative coherence while allowing learner-driven plot development. Key
pedagogical features include content generation tailored to each learner’s
proficiency level, and a vocabulary assistant that provides in-context
explanations of learner-queried text strings, ranging from words and phrases to
sentences. Findings from a pilot study with university EFL students in China
indicate promising vocabulary gains and positive user perceptions. Also
discussed are suggestions from participants regarding the narrative length and
quality, and the request for multi-modal content such as illustrations.
[COMMENTS]
Workshop on Wordplay: When Language Meets Games, EMNLP 2025
[LINK]
http://arxiv.org/abs/2510.04498v1
[DATE]
2025-10-06 13:22:53+08:00
[CATEGORIES]
cs.CL
Scientific Paper Retrieval with LLM-Guided Semantic-Based Ranking
[AUTHORS]
Yunyi Zhang, Ruozhen Yang, Siqi Jiao, SeongKu Kang, Jiawei Han
[COMMENTS]
Accepted to EMNLP 2025 Findings
[LINK]
http://arxiv.org/abs/2505.21815v2
[DATE]
2025-10-06 12:57:43+08:00
[CATEGORIES]
cs.CL
Psychological Steering in LLMs: An Evaluation of Effectiveness and Trustworthiness
[AUTHORS]
Amin Banayeeanzade, Ala N. Tak, Fatemeh Bahrani, Anahita Bolourani, Leonardo Blas, Emilio Ferrara, Jonathan Gratch, Sai Praneeth Karimireddy
[ABSTRACT]
The ability to control LLMs’ emulated emotional states and personality traits
is essential for enabling rich, human-centered interactions in socially
interactive settings. We introduce PsySET, a Psychologically-informed benchmark
to evaluate LLM Steering Effectiveness and Trustworthiness across the emotion
and personality domains. Our study spans four models from different LLM
families paired with various steering strategies, including prompting,
fine-tuning, and representation engineering. Our results indicate that
prompting is consistently effective but limited in intensity control, whereas
vector injections achieve finer controllability while slightly reducing output
quality. Moreover, we explore the trustworthiness of steered LLMs by assessing
safety, truthfulness, fairness, and ethics, highlighting potential side effects
and behavioral shifts. Notably, we observe idiosyncratic effects; for instance,
even a positive emotion like joy can degrade robustness to adversarial
factuality, lower privacy awareness, and increase preferential bias. Meanwhile,
anger predictably elevates toxicity yet strengthens leakage resistance. Our
framework establishes the first holistic evaluation of emotion and personality
steering, offering insights into its interpretability and reliability for
socially interactive applications.
[COMMENTS]
Submitted to ARR - October 2025
[LINK]
http://arxiv.org/abs/2510.04484v1
[DATE]
2025-10-06 12:49:56+08:00
[CATEGORIES]
cs.CL
SCAN: Structured Capability Assessment and Navigation for LLMs
[AUTHORS]
Zongqi Wang, Tianle Gu, Chen Gong, Xin Tian, Siqi Bao, Yujiu Yang
[ABSTRACT]
Evaluating Large Language Models (LLMs) has become increasingly important,
with automatic evaluation benchmarks gaining prominence as alternatives to
human evaluation. While existing research has focused on approximating model
rankings, such benchmarks fail to provide users and developers with a
comprehensive and fine-grained understanding of a specific model’s
capabilities. To fill this gap, we propose \textbf{SCAN} (Structured Capability
Assessment and Navigation), a practical framework that enables detailed
characterization of LLM capabilities through comprehensive and fine-grained
evaluation. SCAN incorporates four key components: (1) TaxBuilder, which
extracts capability-indicating tags from extensive queries to construct a
hierarchical taxonomy automatically; (2) RealMix, a query synthesis and
filtering mechanism that ensures sufficient evaluation data for each capability
tag; (3) a suite of visualization and analysis tools that facilitate efficient
navigation and analysis of model capabilities; and (4) a PC$^2$-based
(Pre-Comparison-derived Criteria) LLM-as-a-Judge approach that achieves
significantly higher accuracy compared to classic LLM-as-a-Judge method. Using
SCAN, we conduct a comprehensive evaluation of 21 mainstream LLMs. Our detailed
analysis of the GPT-OSS family reveals substantial performance variations, even
within sub-capabilities belonging to the same category of capability. This
finding highlights the importance of fine-grained evaluation in accurately
understanding LLM behavior. Project homepage and resources are available at
\href{https://liudan193.github.io/Feedbacker/}{https://liudan193.github.io/Feedbacker/}.
[LINK]
http://arxiv.org/abs/2505.06698v3
[DATE]
2025-10-06 12:36:33+08:00
[CATEGORIES]
cs.CL
From Word to World: Evaluate and Mitigate Culture Bias in LLMs via Word Association Test
[AUTHORS]
Xunlian Dai, Li Zhou, Benyou Wang, Haizhou Li
[COMMENTS]
Cultural Analysis, Cultural Alignment, Word Association Test, Large
Language Models. Accepted by EMNLP 2025 (Oral)
[LINK]
http://arxiv.org/abs/2505.18562v2
[DATE]
2025-10-06 12:31:03+08:00
[CATEGORIES]
cs.CL
MedCLM: Learning to Localize and Reason via a CoT-Curriculum in Medical Vision-Language Models
[AUTHORS]
Soo Yong Kim, Suin Cho, Vincent-Daniel Yun, Gyeongyeon Hwang
[ABSTRACT]
Bridging clinical diagnostic reasoning with AI remains a central challenge in
medical imaging. We introduce MedCLM, an automated pipeline that converts
detection datasets into large-scale medical visual question answering (VQA)
data with Chain-of-Thought (CoT) reasoning by linking lesion boxes to organ
segmentation and structured rationales. These contextual signals enable medical
vision-language models to generate question-answer pairs with step-by-step
reasoning. To utilize this data effectively, we propose an Integrated
CoT-Curriculum Strategy composed of an Easy stage with explicit lesion boxes
for visual grounding, a Medium stage that encourages implicit localization, and
a Hard stage for weakly supervised reasoning. Experimental results demonstrate
that MedCLM attains state-of-the-art performance on several medical VQA
benchmarks, providing a scalable framework for developing clinically aligned
medical vision-language models.
[LINK]
http://arxiv.org/abs/2510.04477v1
[DATE]
2025-10-06 12:26:39+08:00
[CATEGORIES]
cs.CL
cs.LG
Compressed Convolutional Attention: Efficient Attention in a Compressed Latent Space
[AUTHORS]
Tomas Figliolia, Nicholas Alonso, Rishi Iyer, Quentin Anthony, Beren Millidge
[ABSTRACT]
Multi-headed Attention’s (MHA) quadratic compute and linearly growing
KV-cache make long-context transformers expensive to train and serve. Prior
works such as Grouped Query Attention (GQA) and Multi-Latent Attention (MLA)
shrink the cache, speeding decode, but leave compute, which determines prefill
and training speed, largely unchanged. We introduce Compressed Convolutional
Attention (CCA), a novel attention method which down-projects queries, keys,
and values and performs the entire attention operation inside the shared latent
space. This simple design dramatically cuts parameters, KV-cache, and FLOPs all
at once by the desired compression factor. Because CCA is orthogonal to
head-sharing, we combine the two to form Compressed Convolutional Grouped Query
Attention (CCGQA), which further tightens the compute-bandwidth Pareto frontier
so that users can tune compression toward either FLOP or memory limits without
sacrificing quality. Experiments show that CCGQA consistently outperforms both
GQA and MLA at equal KV-cache compression on dense and MoE models.
Additionally, we show that CCGQA outperforms all other attention methods on MoE
models with half the KV-cache of GQA and MLA, achieving an 8x KV-cache
compression with no drop in performance compared to standard MHA. CCA and CCGQA
also dramatically reduce the FLOP cost of attention which leads to
substantially faster training and prefill than existing methods. On H100 GPUs,
our fused CCA/CCGQA kernel reduces prefill latency by about 1.7x at a sequence
length of 16k relative to MHA, and accelerates backward by about 1.3x.
[LINK]
http://arxiv.org/abs/2510.04476v1
[DATE]
2025-10-06 12:24:23+08:00
[CATEGORIES]
cs.CL
Deliberate Planning in Language Models with Symbolic Representation
[AUTHORS]
Siheng Xiong, Zhangding Liu, Jieyu Zhou, Yusen Su
[ABSTRACT]
Planning remains a core challenge for large language models (LLMs),
particularly in domains that require coherent multi-step action sequences
grounded in external constraints. We introduce SymPlanner, a novel framework
that equips LLMs with structured planning capabilities by interfacing them with
a symbolic environment that serves as an explicit world model. Rather than
relying purely on natural language reasoning, SymPlanner grounds the planning
process in a symbolic state space, where a policy model proposes actions and a
symbolic environment deterministically executes and verifies their effects. To
enhance exploration and improve robustness, we introduce Iterative Correction
(IC), which refines previously proposed actions by leveraging feedback from the
symbolic environment to eliminate invalid decisions and guide the model toward
valid alternatives. Additionally, Contrastive Ranking (CR) enables fine-grained
comparison of candidate plans by evaluating them jointly. Conceptually,
SymPlanner operationalizes two cognitive faculties: (i) error monitoring and
repair via externalized feedback (IC) and (ii) preference formation among
alternatives via pairwise comparison (CR), advancing cognitively plausible,
symbol-grounded planning aligned with the rich structure in intelligent
systems. We evaluate SymPlanner on PlanBench, demonstrating that it produces
more coherent, diverse, and verifiable plans than pure natural language
baselines.
[COMMENTS]
Accepted to Twelfth Annual Conference on Advances in Cognitive
Systems
[LINK]
http://arxiv.org/abs/2505.01479v3
[DATE]
2025-10-06 12:14:44+08:00
[CATEGORIES]
cs.CL
Pretraining with hierarchical memories: separating long-tail and common knowledge
[AUTHORS]
Hadi Pouransari, David Grangier, C Thomas, Michael Kirchhof, Oncel Tuzel
[ABSTRACT]
The impressive performance gains of modern language models currently rely on
scaling parameters: larger models store more world knowledge and reason better.
Yet compressing all world knowledge into parameters is unnecessary, as only a
fraction is used per prompt, and impractical for edge devices with limited
inference-time memory and compute. We address this shortcoming by a
memory-augmented architecture and a pretraining strategy aligned with existing
hardware paradigms. We introduce small language models that access large
hierarchical parametric memory banks encoding world knowledge. During
pretraining and inference, we fetch a small, context-dependent memory block and
add it to the model. Our pretraining learns to store long-tail world knowledge
in the memory parameters, while the small language model acts as an anchor
capturing common knowledge and general reasoning abilities. Through
trillion-token-scale experiments, we show significant gains: a 160M-parameters
model augmented with an 18M-parameters memory fetched from a 4.6B memory bank
obtains comparable performance to a regular model with more than 2x the
parameters. Through extensive experiments, we study the optimal type and size
of parametric memories in transformers, scaling them to over 21B parameters. We
find that our proposed hierarchical feed-forward memories work robustly across
transformer architectures, whether added during pretraining or post-hoc.
[LINK]
http://arxiv.org/abs/2510.02375v2
[DATE]
2025-10-06 11:54:08+08:00
[CATEGORIES]
cs.CL
cs.LG
Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models
[AUTHORS]
Mickel Liu, Liwei Jiang, Yancheng Liang, Simon Shaolei Du, Yejin Choi, Tim Althoff, Natasha Jaques
[ABSTRACT]
Conventional language model (LM) safety alignment relies on a reactive,
disjoint procedure: attackers exploit a static model, followed by defensive
fine-tuning to patch exposed vulnerabilities. This sequential approach creates
a mismatch – attackers overfit to obsolete defenses, while defenders
perpetually lag behind emerging threats. To address this, we propose
Self-RedTeam, an online self-play reinforcement learning algorithm where an
attacker and defender agent co-evolve through continuous interaction. We cast
safety alignment as a two-player zero-sum game, where a single model alternates
between attacker and defender roles – generating adversarial prompts and
safeguarding against them – while a reward LM adjudicates outcomes. This
enables dynamic co-adaptation. Grounded in the game-theoretic framework of
zero-sum games, we establish a theoretical safety guarantee which motivates the
design of our method: if self-play converges to a Nash Equilibrium, the
defender will reliably produce safe responses to any adversarial input.
Empirically, Self-RedTeam uncovers more diverse attacks (+21.8% SBERT) compared
to attackers trained against static defenders and achieves higher robustness on
safety benchmarks (e.g., +65.5% on WildJailBreak) than defenders trained
against static attackers. We further propose hidden Chain-of-Thought, allowing
agents to plan privately, which boosts adversarial diversity and reduces
over-refusals. Our results motivate a shift from reactive patching to proactive
co-evolution in LM safety training, enabling scalable, autonomous, and robust
self-improvement of LMs via multi-agent reinforcement learning (MARL).
[LINK]
http://arxiv.org/abs/2506.07468v3
[DATE]
2025-10-06 11:42:10+08:00
[CATEGORIES]
cs.LG
cs.CL
Less LLM, More Documents: Searching for Improved RAG
[AUTHORS]
Jingjie Ning, Yibo Kong, Yunfan Long, Jamie Callan
[ABSTRACT]
Retrieval-Augmented Generation (RAG) couples document retrieval with large
language models (LLMs). While scaling generators improves accuracy, it also
raises cost and limits deployability. We explore an orthogonal axis: enlarging
the retriever’s corpus to reduce reliance on large LLMs. Experimental results
show that corpus scaling consistently strengthens RAG and can often serve as a
substitute for increasing model size, though with diminishing returns at larger
scales. Small- and mid-sized generators paired with larger corpora often rival
much larger models with smaller corpora; mid-sized models tend to gain the
most, while tiny and large models benefit less. Our analysis shows that
improvements arise primarily from increased coverage of answer-bearing
passages, while utilization efficiency remains largely unchanged. These
findings establish a principled corpus-generator trade-off: investing in larger
corpora offers an effective path to stronger RAG, often comparable to enlarging
the LLM itself.
[LINK]
http://arxiv.org/abs/2510.02657v2
[DATE]
2025-10-06 10:54:21+08:00
[CATEGORIES]
cs.CL
OpenWHO: A Document-Level Parallel Corpus for Health Translation in Low-Resource Languages
[AUTHORS]
Raphaël Merx, Hanna Suominen, Trevor Cohn, Ekaterina Vylomova
[ABSTRACT]
In machine translation (MT), health is a high-stakes domain characterised by
widespread deployment and domain-specific vocabulary. However, there is a lack
of MT evaluation datasets for low-resource languages in this domain. To address
this gap, we introduce OpenWHO, a document-level parallel corpus of 2,978
documents and 26,824 sentences from the World Health Organization’s e-learning
platform. Sourced from expert-authored, professionally translated materials
shielded from web-crawling, OpenWHO spans a diverse range of over 20 languages,
of which nine are low-resource. Leveraging this new resource, we evaluate
modern large language models (LLMs) against traditional MT models. Our findings
reveal that LLMs consistently outperform traditional MT models, with Gemini 2.5
Flash achieving a +4.79 ChrF point improvement over NLLB-54B on our
low-resource test set. Further, we investigate how LLM context utilisation
affects accuracy, finding that the benefits of document-level translation are
most pronounced in specialised domains like health. We release the OpenWHO
corpus to encourage further research into low-resource MT in the health domain.
[COMMENTS]
Accepted at WMT 2025
[LINK]
http://arxiv.org/abs/2508.16048v5
[DATE]
2025-10-06 10:43:38+08:00
[CATEGORIES]
cs.CL
On the Role of Unobserved Sequences on Sample-based Uncertainty Quantification for LLMs
[AUTHORS]
Lucie Kunitomo-Jacquin, Edison Marrese-Taylor, Ken Fukuda
[COMMENTS]
Accepted to UncertaiNLP workshop of EMNLP 2025
[LINK]
http://arxiv.org/abs/2510.04439v1
[DATE]
2025-10-06 10:14:48+08:00
[CATEGORIES]
cs.CL
Good Intentions Beyond ACL: Who Does NLP for Social Good, and Where?
[AUTHORS]
Grace LeFevre, Qingcheng Zeng, Adam Leif, Jason Jewell, Denis Peskoff, Rob Voigt
[ABSTRACT]
The social impact of Natural Language Processing (NLP) is increasingly
important, with a rising community focus on initiatives related to NLP for
Social Good (NLP4SG). Indeed, in recent years, almost 20% of all papers in the
ACL Anthology address topics related to social good as defined by the UN
Sustainable Development Goals (Adauto et al., 2023). In this study, we take an
author- and venue-level perspective to map the landscape of NLP4SG, quantifying
the proportion of work addressing social good concerns both within and beyond
the ACL community, by both core ACL contributors and non-ACL authors. With this
approach we discover two surprising facts about the landscape of NLP4SG. First,
ACL authors are dramatically more likely to do work addressing social good
concerns when publishing in venues outside of ACL. Second, the vast majority of
publications using NLP techniques to address concerns of social good are done
by non-ACL authors in venues outside of ACL. We discuss the implications of
these findings on agenda-setting considerations for the ACL community related
to NLP4SG.
[COMMENTS]
EMNLP 2025
[LINK]
http://arxiv.org/abs/2510.04434v1
[DATE]
2025-10-06 10:04:42+08:00
[CATEGORIES]
cs.CL
AutoBnB-RAG: Enhancing Multi-Agent Incident Response with Retrieval-Augmented Generation
[AUTHORS]
Zefang Liu, Arman Anwar
[ABSTRACT]
Incident response (IR) requires fast, coordinated, and well-informed
decision-making to contain and mitigate cyber threats. While large language
models (LLMs) have shown promise as autonomous agents in simulated IR settings,
their reasoning is often limited by a lack of access to external knowledge. In
this work, we present AutoBnB-RAG, an extension of the AutoBnB framework that
incorporates retrieval-augmented generation (RAG) into multi-agent incident
response simulations. Built on the Backdoors & Breaches (B&B) tabletop game
environment, AutoBnB-RAG enables agents to issue retrieval queries and
incorporate external evidence during collaborative investigations. We introduce
two retrieval settings: one grounded in curated technical documentation
(RAG-Wiki), and another using narrative-style incident reports (RAG-News). We
evaluate performance across eight team structures, including newly introduced
argumentative configurations designed to promote critical reasoning. To
validate practical utility, we also simulate real-world cyber incidents based
on public breach reports, demonstrating AutoBnB-RAG’s ability to reconstruct
complex multi-stage attacks. Our results show that retrieval augmentation
improves decision quality and success rates across diverse organizational
models. This work demonstrates the value of integrating retrieval mechanisms
into LLM-based multi-agent systems for cybersecurity decision-making.
[LINK]
http://arxiv.org/abs/2508.13118v2
[DATE]
2025-10-06 09:53:26+08:00
[CATEGORIES]
cs.CL
Partial Information Decomposition via Normalizing Flows in Latent Gaussian Distributions
[AUTHORS]
Wenyuan Zhao, Adithya Balachandran, Chao Tian, Paul Pu Liang
[ABSTRACT]
The study of multimodality has garnered significant interest in fields where
the analysis of interactions among multiple information sources can enhance
predictive modeling, data fusion, and interpretability. Partial information
decomposition (PID) has emerged as a useful information-theoretic framework to
quantify the degree to which individual modalities independently, redundantly,
or synergistically convey information about a target variable. However,
existing PID methods depend on optimizing over a joint distribution constrained
by estimated pairwise probability distributions, which are costly and
inaccurate for continuous and high-dimensional modalities. Our first key
insight is that the problem can be solved efficiently when the pairwise
distributions are multivariate Gaussians, and we refer to this problem as
Gaussian PID (GPID). We propose a new gradient-based algorithm that
substantially improves the computational efficiency of GPID based on an
alternative formulation of the underlying optimization problem. To generalize
the applicability to non-Gaussian data, we learn information-preserving
encoders to transform random variables of arbitrary input distributions into
pairwise Gaussian random variables. Along the way, we resolved an open problem
regarding the optimality of joint Gaussian solutions for GPID. Empirical
validation in diverse synthetic examples demonstrates that our proposed method
provides more accurate and efficient PID estimates than existing baselines. We
further evaluate a series of large-scale multimodal benchmarks to show its
utility in real-world applications of quantifying PID in multimodal datasets
and selecting high-performing models.
[COMMENTS]
NeurIPS 2025
[LINK]
http://arxiv.org/abs/2510.04417v1
[DATE]
2025-10-06 09:08:34+08:00
[CATEGORIES]
cs.LG
cs.CL
Large Language Models Preserve Semantic Isotopies in Story Continuations
[AUTHORS]
Marc Cavazza
[ABSTRACT]
In this work, we explore the relevance of textual semantics to Large Language
Models (LLMs), extending previous insights into the connection between
distributional semantics and structural semantics. We investigate whether
LLM-generated texts preserve semantic isotopies. We design a story continuation
experiment using 10,000 ROCStories prompts completed by five LLMs. We first
validate GPT-4o’s ability to extract isotopies from a linguistic benchmark,
then apply it to the generated stories. We then analyze structural (coverage,
density, spread) and semantic properties of isotopies to assess how they are
affected by completion. Results show that LLM completion within a given token
horizon preserves semantic isotopies across multiple properties.
[LINK]
http://arxiv.org/abs/2510.04400v1
[DATE]
2025-10-06 08:03:12+08:00
[CATEGORIES]
cs.CL
SECA: Semantically Equivalent and Coherent Attacks for Eliciting LLM Hallucinations
[AUTHORS]
Buyun Liang, Liangzu Peng, Jinqi Luo, Darshan Thaker, Kwan Ho Ryan Chan, René Vidal
[ABSTRACT]
Large Language Models (LLMs) are increasingly deployed in high-risk domains.
However, state-of-the-art LLMs often produce hallucinations, raising serious
concerns about their reliability. Prior work has explored adversarial attacks
for hallucination elicitation in LLMs, but it often produces unrealistic
prompts, either by inserting gibberish tokens or by altering the original
meaning. As a result, these approaches offer limited insight into how
hallucinations may occur in practice. While adversarial attacks in computer
vision often involve realistic modifications to input images, the problem of
finding realistic adversarial prompts for eliciting LLM hallucinations has
remained largely underexplored. To address this gap, we propose Semantically
Equivalent and Coherent Attacks (SECA) to elicit hallucinations via realistic
modifications to the prompt that preserve its meaning while maintaining
semantic coherence. Our contributions are threefold: (i) we formulate finding
realistic attacks for hallucination elicitation as a constrained optimization
problem over the input prompt space under semantic equivalence and coherence
constraints; (ii) we introduce a constraint-preserving zeroth-order method to
effectively search for adversarial yet feasible prompts; and (iii) we
demonstrate through experiments on open-ended multiple-choice question
answering tasks that SECA achieves higher attack success rates while incurring
almost no constraint violations compared to existing methods. SECA highlights
the sensitivity of both open-source and commercial gradient-inaccessible LLMs
to realistic and plausible prompt variations. Code is available at
https://github.com/Buyun-Liang/SECA.
[COMMENTS]
Accepted at NeurIPS 2025. Code is available at
https://github.com/Buyun-Liang/SECA
[LINK]
http://arxiv.org/abs/2510.04398v1
[DATE]
2025-10-06 07:44:54+08:00
[CATEGORIES]
cs.CL
cs.LG
Filtering for Creativity: Adaptive Prompting for Multilingual Riddle Generation in LLMs
[AUTHORS]
Duy Le, Kent Ziti, Evan Girard-Sun, Bakr Bouhaya, Sean O’Brien, Vasu Sharma, Kevin Zhu
[ABSTRACT]
Multilingual riddle generation challenges large language models (LLMs) to
balance cultural fluency with creative abstraction. Standard prompting
strategies – zero-shot, few-shot, chain-of-thought – tend to reuse memorized
riddles or perform shallow paraphrasing. We introduce Adaptive Originality
Filtering (AOF), a prompting framework that filters redundant generations using
cosine-based similarity rejection, while enforcing lexical novelty and
cross-lingual fidelity. Evaluated across three LLMs and four language pairs,
AOF-enhanced GPT-4o achieves \texttt{0.177} Self-BLEU and \texttt{0.915}
Distinct-2 in Japanese, signaling improved lexical diversity and reduced
redundancy compared to other prompting methods and language pairs. Our findings
show that semantic rejection can guide culturally grounded, creative generation
without task-specific fine-tuning.
[LINK]
http://arxiv.org/abs/2508.18709v2
[DATE]
2025-10-06 07:38:18+08:00
[CATEGORIES]
cs.CL
Time Is Effort: Estimating Human Post-Editing Time for Grammar Error Correction Tool Evaluation
[AUTHORS]
Ankit Vadehra, Bill Johnson, Gene Saunders, Pascal Poupart
[ABSTRACT]
Text editing can involve several iterations of revision. Incorporating an
efficient Grammar Error Correction (GEC) tool in the initial correction round
can significantly impact further human editing effort and final text quality.
This raises an interesting question to quantify GEC Tool usability: How much
effort can the GEC Tool save users? We present the first large-scale dataset of
post-editing (PE) time annotations and corrections for two English GEC test
datasets (BEA19 and CoNLL14). We introduce Post-Editing Effort in Time (PEET)
for GEC Tools as a human-focused evaluation scorer to rank any GEC Tool by
estimating PE time-to-correct. Using our dataset, we quantify the amount of
time saved by GEC Tools in text editing. Analyzing the edit type indicated that
determining whether a sentence needs correction and edits like paraphrasing and
punctuation changes had the greatest impact on PE time. Finally, comparison
with human rankings shows that PEET correlates well with technical effort
judgment, providing a new human-centric direction for evaluating GEC tool
usability. We release our dataset and code at:
https://github.com/ankitvad/PEET_Scorer.
[COMMENTS]
Accepted for publication in the 4th HCI+NLP Workshop (Fourth Workshop
on Bridging Human-Computer Interaction and Natural Language Processing; part
of EMNLP 2025)
[LINK]
http://arxiv.org/abs/2510.04394v1
[DATE]
2025-10-06 07:24:24+08:00
[CATEGORIES]
cs.CL
cs.LG
Improving Consistency in Retrieval-Augmented Systems with Group Similarity Rewards
[AUTHORS]
Faisal Hamman, Chenyang Zhu, Anoop Kumar, Xujun Peng, Sanghamitra Dutta, Daben Liu, Alfy Samuel
[COMMENTS]
Accepted at NeurIPS 2025 Workshop on Reliable ML from Unreliable Data
[LINK]
http://arxiv.org/abs/2510.04392v1
[DATE]
2025-10-06 07:14:13+08:00
[CATEGORIES]
cs.CL
cs.LG
MorphoSim: An Interactive, Controllable, and Editable Language-guided 4D World Simulator
[AUTHORS]
Xuehai He, Shijie Zhou, Thivyanth Venkateswaran, Kaizhi Zheng, Ziyu Wan, Achuta Kadambi, Xin Eric Wang
[ABSTRACT]
World models that support controllable
and editable spatiotemporal environments are valuable
for robotics, enabling scalable training data, repro ducible evaluation, and
flexible task design. While
recent text-to-video models generate realistic dynam ics, they are
constrained to 2D views and offer limited
interaction. We introduce MorphoSim, a language guided framework that
generates 4D scenes with
multi-view consistency and object-level controls. From
natural language instructions, MorphoSim produces
dynamic environments where objects can be directed,
recolored, or removed, and scenes can be observed
from arbitrary viewpoints. The framework integrates
trajectory-guided generation with feature field dis tillation, allowing edits
to be applied interactively
without full re-generation. Experiments show that Mor phoSim maintains high
scene fidelity while enabling
controllability and editability. The code is available
at https://github.com/eric-ai-lab/Morph4D.
[LINK]
http://arxiv.org/abs/2510.04390v1
[DATE]
2025-10-06 06:55:17+08:00
[CATEGORIES]
cs.CL
X-Teaming Evolutionary M2S: Automated Discovery of Multi-turn to Single-turn Jailbreak Templates
[AUTHORS]
Hyunjun Kim, Junwoo Ha, Sangyoon Yu, Haon Park
[ABSTRACT]
Multi-turn-to-single-turn (M2S) compresses iterative red-teaming into one
structured prompt, but prior work relied on a handful of manually written
templates. We present X-Teaming Evolutionary M2S, an automated framework that
discovers and optimizes M2S templates through language-model-guided evolution.
The system pairs smart sampling from 12 sources with an LLM-as-judge inspired
by StrongREJECT and records fully auditable logs.
Maintaining selection pressure by setting the success threshold to $\theta =
0.70$, we obtain five evolutionary generations, two new template families, and
44.8% overall success (103/230) on GPT-4.1. A balanced cross-model panel of
2,500 trials (judge fixed) shows that structural gains transfer but vary by
target; two models score zero at the same threshold. We also find a positive
coupling between prompt length and score, motivating length-aware judging.
Our results demonstrate that structure-level search is a reproducible route
to stronger single-turn probes and underscore the importance of threshold
calibration and cross-model evaluation. Code, configurations, and artifacts are
available at https://github.com/hyunjun1121/M2S-x-teaming.
[COMMENTS]
NeurIPS 2025 Workshop on Lock-LLM
[LINK]
http://arxiv.org/abs/2509.08729v2
[DATE]
2025-10-06 06:27:29+08:00
[CATEGORIES]
cs.CL
ObjexMT: Objective Extraction and Metacognitive Calibration for LLM-as-a-Judge under Multi-Turn Jailbreaks
[AUTHORS]
Hyunjun Kim, Junwoo Ha, Sangyoon Yu, Haon Park
[ABSTRACT]
LLM-as-a-Judge (LLMaaJ) now underpins scalable evaluation, yet we lack a
decisive test of a judge’s qualification: can it recover a conversation’s
latent objective and know when that inference is trustworthy? LLMs degrade
under irrelevant or long context; multi-turn jailbreaks further hide goals
across turns. We introduce ObjexMT, a benchmark for objective extraction and
metacognition. Given a multi-turn transcript, a model must return a
one-sentence base objective and self-reported confidence. Accuracy is computed
via LLM-judge semantic similarity to gold objectives, converted to binary
correctness by a human-aligned threshold calibrated on N=300 items (tau = 0.66;
F1 = 0.891). Metacognition is evaluated with ECE, Brier, Wrong at
High-Confidence (0.80/0.90/0.95), and risk-coverage. Across six models
(gpt-4.1, claude-sonnet-4, Qwen3-235B-A22B-FP8, kimi-k2, deepseek-v3.1,
gemini-2.5-flash) on three datasets, kimi-k2 attains the highest
objective-extraction accuracy (0.612), with claude-sonnet-4 (0.603) and
deepseek-v3.1 (0.599) statistically comparable. claude-sonnet-4 yields the best
selective risk and calibration (AURC 0.242; ECE 0.206; Brier 0.254). Dataset
heterogeneity (16-82 percent accuracy variance) reveals that automated
obfuscation poses fundamental challenges beyond model choice. High-confidence
errors persist: Wrong at 0.90 ranges from 14.9 percent (claude-sonnet-4) to
47.7 percent (Qwen3-235B-A22B-FP8). ObjexMT provides an actionable test for LLM
judges: when objectives are not explicit, judges often misinfer them; we
recommend exposing objectives when feasible and gating decisions by confidence
otherwise. Data at https://github.com/hyunjun1121/ObjexMT_dataset.
[LINK]
http://arxiv.org/abs/2508.16889v3
[DATE]
2025-10-06 06:27:27+08:00
[CATEGORIES]
cs.CL
FedMentor: Domain-Aware Differential Privacy for Heterogeneous Federated LLMs in Mental Health
[AUTHORS]
Nobin Sarwar, Shubhashis Roy Dipta
[ABSTRACT]
Privacy-preserving adaptation of Large Language Models (LLMs) in sensitive
domains (e.g., mental health) requires balancing strict confidentiality with
model utility and safety. We propose FedMentor, a federated fine-tuning
framework that integrates Low-Rank Adaptation (LoRA) and domain-aware
Differential Privacy (DP) to meet per-domain privacy budgets while maintaining
performance. Each client (domain) applies a custom DP noise scale proportional
to its data sensitivity, and the server adaptively reduces noise when utility
falls below a threshold. In experiments on three mental health datasets, we
show that FedMentor improves safety over standard Federated Learning (FL)
without privacy, raising safe output rates by up to three points and lowering
toxicity, while maintaining utility (BERTScore F1 and ROUGE-L) within 0.5% of
the non-private baseline and close to the centralized upper bound. The
framework scales to backbones with up to 1.7B parameters on single-GPU clients,
requiring < 173 MB of communication per-round. FedMentor demonstrates a
practical approach to privately fine-tune LLMs for safer deployments in
healthcare and other sensitive fields.
[COMMENTS]
NeurIPS 2025 GenAI4Health Workshop
[LINK]
http://arxiv.org/abs/2509.14275v2
[DATE]
2025-10-06 05:41:04+08:00
[CATEGORIES]
cs.CL
cs.LG
AgentAda: Skill-Adaptive Data Analytics for Tailored Insight Discovery
[AUTHORS]
Amirhossein Abaskohi, Amrutha Varshini Ramesh, Shailesh Nanisetty, Chirag Goel, David Vazquez, Christopher Pal, Spandana Gella, Giuseppe Carenini, Issam H. Laradji
[ABSTRACT]
We introduce AgentAda, the first LLM-powered analytics agent that can learn
and use new analytics skills to extract more specialized insights. Unlike
existing methods that require users to manually decide which data analytics
method to apply, AgentAda automatically identifies the skill needed from a
library of analytical skills to perform the analysis. This also allows AgentAda
to use skills that existing LLMs cannot perform out of the box. The library
covers a range of methods, including clustering, predictive modeling, and NLP
techniques like BERT, which allow AgentAda to handle complex analytics tasks
based on what the user needs. AgentAda’s dataset-to-insight extraction strategy
consists of three key steps: (I) a question generator to generate queries
relevant to the user’s goal and persona, (II) a hybrid Retrieval-Augmented
Generation (RAG)-based skill matcher to choose the best data analytics skill
from the skill library, and (III) a code generator that produces executable
code based on the retrieved skill’s documentation to extract key patterns. We
also introduce KaggleBench, a benchmark of curated notebooks across diverse
domains, to evaluate AgentAda’s performance. We conducted a human evaluation
demonstrating that AgentAda provides more insightful analytics than existing
tools, with 48.78% of evaluators preferring its analyses, compared to 27.67%
for the unskilled agent. We also propose a novel LLM-as-a-judge approach that
we show is aligned with human evaluation as a way to automate insight quality
evaluation at larger scale.
[LINK]
http://arxiv.org/abs/2504.07421v2
[DATE]
2025-10-06 05:28:53+08:00
[CATEGORIES]
cs.CL
J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning
[AUTHORS]
Chenxi Whitehouse, Tianlu Wang, Ping Yu, Xian Li, Jason Weston, Ilia Kulikov, Swarnadeep Saha
[ABSTRACT]
The progress of AI is bottlenecked by the quality of evaluation, making
powerful LLM-as-a-Judge models a core solution. The efficacy of these judges
depends on their chain-of-thought reasoning, creating a critical need for
methods that can effectively optimize this reasoning process. In this work, we
introduce J1, a reinforcement learning framework for teaching LLM judges to
think before making decisions. Our core contribution lies in converting all
judgment tasks for non-verifiable and verifiable prompts into a unified format
with verifiable rewards, enabling direct optimization of evaluation quality
while mitigating positional bias. We then use RL to train thinking-judges at
scales of 8B, 32B, and 70B and show that they obtain state-of-the-art
performance across multiple benchmarks. In particular, J1-Qwen-32B, our
multitasked pointwise and pairwise judge also outperforms o1-mini, o3, and a
much larger 671B DeepSeek-R1 on some benchmarks, while only training on
synthetic data. Through comprehensive ablations of pairwise, pointwise, and
multitask J1 variants, we demonstrate the effectiveness of our approach across
seed prompts, reward strategies, and training recipes. Qualitative analysis
reveals that J1 develops systematic evaluation strategies, including dynamic
criteria generation, reference answer creation, iterative self-correction of
initial assessments, and feedback generation for low-quality responses.
[COMMENTS]
10 pages, 13 tables, 14 figures
[LINK]
http://arxiv.org/abs/2505.10320v2
[DATE]
2025-10-06 05:28:03+08:00
[CATEGORIES]
cs.CL
cs.LG
Fun-ASR Technical Report
[AUTHORS]
Keyu An, Yanni Chen, Chong Deng, Changfeng Gao, Zhifu Gao, Bo Gong, Xiangang Li, Yabin Li, Xiang Lv, Yunjie Ji, Yiheng Jiang, Bin Ma, Haoneng Luo, Chongjia Ni, Zexu Pan, Yiping Peng, Zhendong Peng, Peiyao Wang, Hao Wang, Wen Wang, Wupeng Wang, Biao Tian, Zhentao Tan, Nan Yang, Bin Yuan, Jieping Ye, Jixing Yu, Qinglin Zhang, Kun Zou, Han Zhao, Shengkui Zhao, Jingren Zhou
[ABSTRACT]
In recent years, automatic speech recognition (ASR) has witnessed
transformative advancements driven by three complementary paradigms: data
scaling, model size scaling, and deep integration with large language models
(LLMs). However, LLMs are prone to hallucination, which can significantly
degrade user experience in real-world ASR applications. In this paper, we
present Fun-ASR, a large-scale, LLM-based ASR system that synergistically
combines massive data, large model capacity, LLM integration, and reinforcement
learning to achieve state-of-the-art performance across diverse and complex
speech recognition scenarios. Moreover, Fun-ASR is specifically optimized for
practical deployment, with enhancements in streaming capability, noise
robustness, code-switching, hotword customization, and satisfying other
real-world application requirements. Experimental results show that while most
LLM-based ASR systems achieve strong performance on open-source benchmarks,
they often underperform on real industry evaluation sets. Thanks to
production-oriented optimizations, Fun-ASR achieves state-of-the-art
performance on real application datasets, demonstrating its effectiveness and
robustness in practical settings.
[COMMENTS]
Authors are listed in alphabetical order
[LINK]
http://arxiv.org/abs/2509.12508v3
[DATE]
2025-10-06 05:27:32+08:00
[CATEGORIES]
cs.CL
MacroBench: A Novel Testbed for Web Automation Scripts via Large Language Models
[AUTHORS]
Hyunjun Kim, Sejong Kim
[COMMENTS]
NeurIPS 2025 Workshop on Lock-LLM
[LINK]
http://arxiv.org/abs/2510.04363v1
[DATE]
2025-10-06 05:15:11+08:00
[CATEGORIES]
cs.CL
Geometry of orofacial neuromuscular signals: speech articulation decoding using surface electromyography
[AUTHORS]
Harshavardhana T. Gowda, Zachary D. McNaughton, Lee M. Miller
[ABSTRACT]
Objective. In this article, we present data and methods for decoding speech
articulations using surface electromyogram (EMG) signals. EMG-based speech
neuroprostheses offer a promising approach for restoring audible speech in
individuals who have lost the ability to speak intelligibly due to
laryngectomy, neuromuscular diseases, stroke, or trauma-induced damage (e.g.,
from radiotherapy) to the speech articulators.
Approach. To achieve this, we collect EMG signals from the face, jaw, and
neck as subjects articulate speech, and we perform EMG-to-speech translation.
Main results. Our findings reveal that the manifold of symmetric positive
definite (SPD) matrices serves as a natural embedding space for EMG signals.
Specifically, we provide an algebraic interpretation of the manifold-valued EMG
data using linear transformations, and we analyze and quantify distribution
shifts in EMG signals across individuals.
Significance. Overall, our approach demonstrates significant potential for
developing neural networks that are both data- and parameter-efficient, an
important consideration for EMG-based systems, which face challenges in
large-scale data collection and operate under limited computational resources
on embedded devices.
[LINK]
http://arxiv.org/abs/2411.02591v3
[DATE]
2025-10-06 02:45:15+08:00
[CATEGORIES]
cs.CL
Language Agents Mirror Human Causal Reasoning Biases. How Can We Help Them Think Like Scientists?
[AUTHORS]
Anthony GX-Chen, Dongyan Lin, Mandana Samiei, Doina Precup, Blake A. Richards, Rob Fergus, Kenneth Marino
[ABSTRACT]
Language model (LM) agents are increasingly used as autonomous
decision-makers which need to actively gather information to guide their
decisions. A crucial cognitive skill for such agents is the efficient
exploration and understanding of the causal structure of the world – key to
robust, scientifically grounded reasoning. Yet, it remains unclear whether LMs
possess this capability or exhibit systematic biases leading to erroneous
conclusions. In this work, we examine LMs’ ability to explore and infer causal
relationships, using the well-established Blicket Test paradigm from
developmental psychology. We find that LMs reliably infer the common, intuitive
disjunctive causal relationships but systematically struggle with the unusual,
yet equally (or sometimes even more) evidenced conjunctive ones. This
“disjunctive bias” persists across model families, sizes, and prompting
strategies, and performance further declines as task complexity increases.
Interestingly, an analogous bias appears in human adults, suggesting that LMs
may have inherited deep-seated reasoning heuristics from their training data.
To this end, we quantify similarities between LMs and humans, finding that LMs
exhibit adult-like inference profiles (but not child-like). Finally, we propose
a test-time sampling method which explicitly samples and eliminates hypotheses
about causal relationships from the LM. This scalable approach significantly
reduces the disjunctive bias and moves LMs closer to the goal of scientific,
causally rigorous reasoning.
[COMMENTS]
Conference on Language Modelling (COLM) 2025, Camera Ready
[LINK]
http://arxiv.org/abs/2505.09614v3
[DATE]
2025-10-06 02:28:02+08:00
[CATEGORIES]
cs.CL
MedAgentGym: A Scalable Agentic Training Environment for Code-Centric Reasoning in Biomedical Data Science
[AUTHORS]
Ran Xu, Yuchen Zhuang, Yishan Zhong, Yue Yu, Zifeng Wang, Xiangru Tang, Hang Wu, May D. Wang, Peifeng Ruan, Donghan Yang, Tao Wang, Guanghua Xiao, Xin Liu, Carl Yang, Yang Xie, Wenqi Shi
[ABSTRACT]
We introduce MedAgentGym, a scalable and interactive training environment
designed to enhance coding-based biomedical reasoning capabilities in large
language model (LLM) agents. MedAgentGym comprises 72,413 task instances across
129 categories derived from 12 authentic real-world biomedical scenarios. Tasks
are encapsulated within executable sandbox environments, each featuring
detailed task specifications, interactive feedback mechanisms, verifiable
ground truth annotations, and scalable training trajectory generation.
Extensive benchmarking of 29 LLMs reveals substantial performance disparities
in biomedical data science between commercial and open-source LLMs. Leveraging
efficient multi-threaded and multi-turn trajectory sampling in MedAgentGym,
Med-Copilot achieves performance gains of +43.02% and +45.28% from offline and
online reinforcement learning, respectively, demonstrating MedAgentGym as an
effective training ground while establishing itself as a cost-effective,
privacy-preserving alternative competitive with proprietary LLMs (gpt-4o). By
offering a unified execution environment with a comprehensive benchmark and
accessible, extensible training resources, MedAgentGym delivers an integrated
platform to develop LLM-based coding assistants for advanced biomedical data
science.
[LINK]
http://arxiv.org/abs/2506.04405v2
[DATE]
2025-10-06 01:59:37+08:00
[CATEGORIES]
cs.CL
cs.LG
Wave-PDE Nets: Trainable Wave-Equation Layers as an Alternative to Attention
[AUTHORS]
Harshil Vejendla
[ABSTRACT]
We introduce Wave-PDE Nets, a neural architecture whose elementary operation
is a differentiable simulation of the second-order wave equation. Each layer
propagates its hidden state as a continuous field through a medium with
trainable spatial velocity c(x) and damping {\gamma}(x). A symplectic spectral
solver based on FFTs realises this propagation in O(nlog n) time. This
oscillatory, global mechanism provides a powerful alternative to attention and
first-order state-space models. We prove that a single Wave-PDE layer is a
universal approximator. On language and vision benchmarks, Wave-PDE Nets match
or exceed Transformer performance while demonstrating superior practical
efficiency, reducing wall-clock time by up to 30% and peak memory by 25%.
Ablation studies confirm the critical role of symplectic integration and a
spectral Laplacian for stability and performance. Visualizations of the learned
physical parameters reveal that the model learns intuitive strategies for
information propagation. These results position Wave-PDE Nets as a
computationally efficient and robust architecture with a strong physical
inductive bias.
[COMMENTS]
PRICAI 2025 Oral, 9 pages, 3 figures
[LINK]
http://arxiv.org/abs/2510.04304v1
[DATE]
2025-10-06 01:52:52+08:00
[CATEGORIES]
cs.LG
cs.CL
Network Formation and Dynamics Among Multi-LLMs
[AUTHORS]
Marios Papachristou, Yuan Yuan
[ABSTRACT]
Social networks profoundly influence how humans form opinions, exchange
information, and organize collectively. As large language models (LLMs) are
increasingly embedded into social and professional environments, it is critical
to understand whether their interactions approximate human-like network
dynamics. We develop a framework to study the network formation behaviors of
multiple LLM agents and benchmark them against human decisions. Across
synthetic and real-world settings, including friendship, telecommunication, and
employment networks, we find that LLMs consistently reproduce fundamental
micro-level principles such as preferential attachment, triadic closure, and
homophily, as well as macro-level properties including community structure and
small-world effects. Importantly, the relative emphasis of these principles
adapts to context: for example, LLMs favor homophily in friendship networks but
heterophily in organizational settings, mirroring patterns of social mobility.
A controlled human-subject survey confirms strong alignment between LLMs and
human participants in link-formation decisions. These results establish that
LLMs can serve as powerful tools for social simulation and synthetic data
generation, while also raising critical questions about bias, fairness, and the
design of AI systems that participate in human networks.
[COMMENTS]
Accepted at PNAS Nexus
[LINK]
http://arxiv.org/abs/2402.10659v7
[DATE]
2025-10-06 01:06:50+08:00
[CATEGORIES]
cs.CL
Equipping Retrieval-Augmented Large Language Models with Document Structure Awareness
[AUTHORS]
Lingnan Xu, Chong Feng, Kaiyuan Zhang, Liu Zhengyong, Wenqiang Xu, Fanqing Meng
[ABSTRACT]
While large language models (LLMs) demonstrate impressive capabilities, their
reliance on parametric knowledge often leads to factual inaccuracies.
Retrieval-Augmented Generation (RAG) mitigates this by leveraging external
documents, yet existing approaches treat retrieved passages as isolated chunks,
ignoring valuable structure that is crucial for document organization.
Motivated by this gap, we propose Retrieve-DocumentRoute-Read (RDR2), a novel
framework that explicitly incorporates structural information throughout the
RAG process. RDR2 employs an LLM-based router to dynamically navigate document
structure trees, jointly evaluating content relevance and hierarchical
relationships to assemble optimal evidence. Our key innovation lies in
formulating document routing as a trainable task, with automatic action
curation and structure-aware passage selection inspired by human reading
strategies. Through comprehensive evaluation on five challenging datasets, RDR2
achieves state-of-the-art performance, demonstrating that explicit structural
awareness significantly enhances RAG systems’ ability to acquire and utilize
knowledge, particularly in complex scenarios requiring multi-document
synthesis.
[COMMENTS]
EMNLP2025 Findings
[LINK]
http://arxiv.org/abs/2510.04293v1
[DATE]
2025-10-06 01:04:24+08:00
[CATEGORIES]
cs.CL
PABSA: Hybrid Framework for Persian Aspect-Based Sentiment Analysis
[AUTHORS]
Mehrzad Tareh, Aydin Mohandesi, Ebrahim Ansari
[ABSTRACT]
Sentiment analysis is a key task in Natural Language Processing (NLP),
enabling the extraction of meaningful insights from user opinions across
various domains. However, performing sentiment analysis in Persian remains
challenging due to the scarcity of labeled datasets, limited preprocessing
tools, and the lack of high-quality embeddings and feature extraction methods.
To address these limitations, we propose a hybrid approach that integrates
machine learning (ML) and deep learning (DL) techniques for Persian
aspect-based sentiment analysis (ABSA). In particular, we utilize polarity
scores from multilingual BERT as additional features and incorporate them into
a decision tree classifier, achieving an accuracy of 93.34%-surpassing existing
benchmarks on the Pars-ABSA dataset. Additionally, we introduce a Persian
synonym and entity dictionary, a novel linguistic resource that supports text
augmentation through synonym and named entity replacement. Our results
demonstrate the effectiveness of hybrid modeling and feature augmentation in
advancing sentiment analysis for low-resource languages such as Persian.
[COMMENTS]
8 pages
[LINK]
http://arxiv.org/abs/2510.04291v1
[DATE]
2025-10-06 01:02:31+08:00
[CATEGORIES]
cs.CL
cs.LG
SliceMoE: Routing Embedding Slices Instead of Tokens for Fine-Grained and Balanced Transformer Scaling
[AUTHORS]
Harshil Vejendla
[ABSTRACT]
Mixture-of-Experts (MoE) layers scale transformers by routing tokens to a
sparse subset of feed-forward experts. Token-level routing, however, assigns an
entire semantic spectrum to each expert, creating capacity bottlenecks,
load-balancing pathologies, and limited specialization. We introduce SliceMoE,
an architecture that routes contiguous slices of a token’s hidden vector. A
d-dimensional embedding is partitioned into S slices, and for each slice, a
lightweight shared router predicts the top-k experts. Experts operate on their
assigned slices independently, and outputs are reassembled, maintaining
per-token FLOP efficiency. Because slices from different tokens interleave
within an expert, utilization is naturally smoother. We propose a slice-level
capacity loss, cross-slice dropout, and efficient fused batched GEMM kernels.
Experiments on WikiText-103 language modeling, WMT En-De translation, and three
text-classification datasets show SliceMoE attains up to 1.7x faster inference
than dense baselines, 12 to 18 percent lower perplexity than parameter-matched
token-MoE, and improved expert balance, with interpretable expertise over
syntactic versus semantic subspaces.
[COMMENTS]
EMNLP 2025 Main, 8 pages, 9 figures
[LINK]
http://arxiv.org/abs/2510.04286v1
[DATE]
2025-10-06 00:57:32+08:00
[CATEGORIES]
cs.CL
cs.LG
Probing Geometry of Next Token Prediction Using Cumulant Expansion of the Softmax Entropy
[AUTHORS]
Karthik Viswanathan, Sang Eon Park
[ABSTRACT]
We introduce a cumulant-expansion framework for quantifying how large
language models (LLMs) internalize higher-order statistical structure during
next-token prediction. By treating the softmax entropy of each layer’s logit
distribution as a perturbation around its “center” distribution, we derive
closed-form cumulant observables that isolate successively higher-order
correlations. Empirically, we track these cumulants in GPT-2 and Pythia models
on Pile-10K prompts. (i) Structured prompts exhibit a characteristic
rise-and-plateau profile across layers, whereas token-shuffled prompts remain
flat, revealing the dependence of the cumulant profile on meaningful context.
(ii) During training, all cumulants increase monotonically before saturating,
directly visualizing the model’s progression from capturing variance to
learning skew, kurtosis, and higher-order statistical structures. (iii)
Mathematical prompts show distinct cumulant signatures compared to general
text, quantifying how models employ fundamentally different processing
mechanisms for mathematical versus linguistic content. Together, these results
establish cumulant analysis as a lightweight, mathematically grounded probe of
feature-learning dynamics in high-dimensional neural networks.
[COMMENTS]
14 pages, 7 figures. Poster at HiLD 2025: 3rd Workshop on
High-dimensional Learning Dynamics
[LINK]
http://arxiv.org/abs/2510.04285v1
[DATE]
2025-10-06 00:55:58+08:00
[CATEGORIES]
cs.CL
cs.LG
XAutoLM: Efficient Fine-Tuning of Language Models via Meta-Learning and AutoML
[AUTHORS]
Ernesto L. Estevanell-Valladares, Suilan Estevez-Velarde, Yoan Gutiérrez, Andrés Montoyo, Ruslan Mitkov
[ABSTRACT]
Experts in machine learning leverage domain knowledge to navigate decisions
in model selection, hyperparameter optimization, and resource allocation. This
is particularly critical for fine-tuning language models (LMs), where repeated
trials incur substantial computational overhead and environmental impact.
However, no existing automated framework simultaneously tackles the entire
model selection and hyperparameter optimization (HPO) task for
resource-efficient LM fine-tuning. We introduce XAutoLM, a
meta-learning-augmented AutoML framework that reuses past experiences to
optimize discriminative and generative LM fine-tuning pipelines efficiently.
XAutoLM learns from stored successes and failures by extracting task- and
system-level meta-features to bias its sampling toward valuable configurations
and away from costly dead ends. On four text classification and two
question-answering benchmarks, XAutoLM surpasses zero-shot optimizer’s peak F1
on five of six tasks, cuts mean evaluation time of pipelines by up to 4.5x,
reduces search error ratios by up to sevenfold, and uncovers up to 50% more
pipelines above the zero-shot Pareto front. In contrast, simpler memory-based
baselines suffer negative transfer. We release XAutoLM and our experience store
to catalyze resource-efficient, Green AI fine-tuning in the NLP community.
[COMMENTS]
18 pages, 10 figures, 7 tables. Preprint. Accepted at EMNLP 2025
[LINK]
http://arxiv.org/abs/2508.00924v3
[DATE]
2025-10-06 00:40:38+08:00
[CATEGORIES]
cs.CL
On Pruning State-Space LLMs
[AUTHORS]
Tamer Ghattas, Michael Hassid, Roy Schwartz
[ABSTRACT]
Recent work proposed state-space models (SSMs) as an efficient alternative to
transformer-based LLMs. Can these models be pruned to further reduce their
computation costs? We adapt several pruning methods to the SSM structure, and
apply them to four SSM-based LLMs across multiple tasks. We find that such
models are quite robust to some pruning methods (e.g. WANDA), while using other
methods lead to fast performance degradation.
[LINK]
http://arxiv.org/abs/2502.18886v2
[DATE]
2025-10-06 00:30:59+08:00
[CATEGORIES]
cs.CL
cs.LG
Don’t Pass$\mathtt{@}k$: A Bayesian Framework for Large Language Model Evaluation
[AUTHORS]
Mohsen Hariri, Amirhossein Samandar, Michael Hinczewski, Vipin Chaudhary
[ABSTRACT]
Pass$@k$ is widely used to report performance for LLM reasoning, but it often
yields unstable, misleading rankings, especially when the number of trials
(samples) is limited and compute is constrained. We present a principled
Bayesian evaluation framework that replaces Pass$@k$ and average accuracy over
$N$ trials (avg$@N$) with posterior estimates of a model’s underlying success
probability and credible intervals, yielding stable rankings and a transparent
decision rule for differences. Evaluation outcomes are modeled as categorical
(not just 0/1) with a Dirichlet prior, giving closed-form expressions for the
posterior mean and uncertainty of any weighted rubric and enabling the use of
prior evidence when appropriate. Theoretically, under a uniform prior, the
Bayesian posterior mean is order-equivalent to average accuracy (Pass$@1$),
explaining its empirical robustness while adding principled uncertainty.
Empirically, in simulations with known ground-truth success rates and on
AIME’24/’25, HMMT’25, and BrUMO’25, the Bayesian/avg procedure achieves faster
convergence and greater rank stability than Pass$@k$ and recent variants,
enabling reliable comparisons at far smaller sample counts. The framework
clarifies when observed gaps are statistically meaningful (non-overlapping
credible intervals) versus noise, and it naturally extends to graded,
rubric-based evaluations. Together, these results recommend replacing Pass$@k$
for LLM evaluation and ranking with a posterior-based, compute-efficient
protocol that unifies binary and non-binary evaluation while making uncertainty
explicit. Code is available at https://mohsenhariri.github.io/bayes-kit
[COMMENTS]
Code and simulations: https://mohsenhariri.github.io/bayes-kit
[LINK]
http://arxiv.org/abs/2510.04265v1
[DATE]
2025-10-06 00:14:03+08:00
[CATEGORIES]
cs.CL
CEMTM: Contextual Embedding-based Multimodal Topic Modeling
[AUTHORS]
Amirhossein Abaskohi, Raymond Li, Chuyuan Li, Shafiq Joty, Giuseppe Carenini
[COMMENTS]
EMNLP 2025
[LINK]
http://arxiv.org/abs/2509.11465v2
[DATE]
2025-10-06 00:11:19+08:00
[CATEGORIES]
cs.CL
cs.LG
Feasibility-Aware Decision-Focused Learning for Predicting Parameters in the Constraints
[AUTHORS]
Jayanta Mandi, Marianne Defresne, Senne Berden, Tias Guns
[ABSTRACT]
When some parameters of a constrained optimization problem (COP) are
uncertain, this gives rise to a predict-then-optimize (PtO) problem, comprising
two stages – the prediction of the unknown parameters from contextual
information and the subsequent optimization using those predicted parameters.
Decision-focused learning (DFL) implements the first stage by training a
machine learning (ML) model to optimize the quality of the decisions made using
the predicted parameters. When parameters in the constraints of a COP are
predicted, the predicted parameters can lead to infeasible solutions.
Therefore, it is important to simultaneously manage both feasibility and
decision quality. We develop a DFL framework for predicting constraint
parameters in a generic COP. While prior works typically assume that the
underlying optimization problem is a linear program (LP) or integer linear
program (ILP), our approach makes no such assumption. We derive two novel loss
functions based on maximum likelihood estimation (MLE): the first one penalizes
infeasibility (by penalizing when the predicted parameters lead to infeasible
solutions), and the second one penalizes suboptimal decisions (by penalizing
when the true optimal solution is infeasible under the predicted parameters).
We introduce a single tunable parameter to form a weighted average of the two
losses, allowing decision-makers to balance suboptimality and feasibility. We
experimentally demonstrate that adjusting this parameter provides a
decision-maker the control over the trade-off between the two. Moreover, across
several COP instances, we find that for a single value of the tunable
parameter, our method matches the performance of the existing baselines on
suboptimality and feasibility.
[LINK]
http://arxiv.org/abs/2510.04951v1
[DATE]
2025-10-06 23:52:03+08:00
[CATEGORIES]
cs.LG
What Lurks Within? Concept Auditing for Shared Diffusion Models at Scale
[AUTHORS]
Xiaoyong Yuan, Xiaolong Ma, Linke Guo, Lan Zhang
[ABSTRACT]
Diffusion models (DMs) have revolutionized text-to-image generation, enabling
the creation of highly realistic and customized images from text prompts. With
the rise of parameter-efficient fine-tuning (PEFT) techniques, users can now
customize powerful pre-trained models using minimal computational resources.
However, the widespread sharing of fine-tuned DMs on open platforms raises
growing ethical and legal concerns, as these models may inadvertently or
deliberately generate sensitive or unauthorized content. Despite increasing
regulatory attention on generative AI, there are currently no practical tools
for systematically auditing these models before deployment.
In this paper, we address the problem of concept auditing: determining
whether a fine-tuned DM has learned to generate a specific target concept.
Existing approaches typically rely on prompt-based input crafting and
output-based image classification but they suffer from critical limitations,
including prompt uncertainty, concept drift, and poor scalability. To overcome
these challenges, we introduce Prompt-Agnostic Image-Free Auditing (PAIA), a
novel, model-centric concept auditing framework. By treating the DM as the
object of inspection, PAIA enables direct analysis of internal model behavior,
bypassing the need for optimized prompts or generated images. We evaluate PAIA
on 320 controlled models trained with curated concept datasets and 771
real-world community models sourced from a public DM sharing platform.
Evaluation results show that PAIA achieves over 90% detection accuracy while
reducing auditing time by 18 - 40X compared to existing baselines. To our
knowledge, PAIA is the first scalable and practical solution for pre-deployment
concept auditing of diffusion models, providing a practical foundation for
safer and more transparent diffusion model sharing.
[COMMENTS]
Extended version of the paper accepted at CCS 2025
[LINK]
http://arxiv.org/abs/2504.14815v2
[DATE]
2025-10-06 23:50:02+08:00
[CATEGORIES]
cs.LG
Unsupervised Active Learning via Natural Feature Progressive Framework
[AUTHORS]
Yuxi Liu, Catherine Lalman, Yimin Yang
[ABSTRACT]
The effectiveness of modern deep learning models is predicated on the
availability of large-scale, human-annotated datasets, a process that is
notoriously expensive and time-consuming. While Active Learning (AL) offers a
strategic solution by labeling only the most informative and representative
data, its iterative nature still necessitates significant human involvement.
Unsupervised Active Learning (UAL) presents an alternative by shifting the
annotation burden to a single, post-selection step. Unfortunately, prevailing
UAL methods struggle to achieve state-of-the-art performance. These approaches
typically rely on local, gradient-based scoring for sample importance
estimation, which not only makes them vulnerable to ambiguous and noisy data
but also hinders their capacity to select samples that adequately represent the
full data distribution. Moreover, their use of shallow, one-shot linear
selection falls short of a true UAL paradigm. In this paper, we propose the
Natural Feature Progressive Framework (NFPF), a UAL method that revolutionizes
how sample importance is measured. At its core, NFPF employs a Specific Feature
Learning Machine (SFLM) to effectively quantify each sample’s contribution to
model performance. We further utilize the SFLM to define a powerful
Reconstruction Difference metric for initial sample selection. Our
comprehensive experiments show that NFPF significantly outperforms all
established UAL methods and achieves performance on par with supervised AL
methods on vision datasets. Detailed ablation studies and qualitative
visualizations provide compelling evidence for NFPF’s superior performance,
enhanced robustness, and improved data distribution coverage.
[COMMENTS]
Under review at IEEE TPAMI
[LINK]
http://arxiv.org/abs/2510.04939v1
[DATE]
2025-10-06 23:44:33+08:00
[CATEGORIES]
cs.LG
Egalitarian Gradient Descent: A Simple Approach to Accelerated Grokking
[AUTHORS]
Ali Saheb Pasand, Elvis Dohmatob
[ABSTRACT]
Grokking is the phenomenon whereby, unlike the training performance, which
peaks early in the training process, the test/generalization performance of a
model stagnates over arbitrarily many epochs and then suddenly jumps to usually
close to perfect levels. In practice, it is desirable to reduce the length of
such plateaus, that is to make the learning process “grok” faster. In this
work, we provide new insights into grokking. First, we show both empirically
and theoretically that grokking can be induced by asymmetric speeds of
(stochastic) gradient descent, along different principal (i.e singular
directions) of the gradients. We then propose a simple modification that
normalizes the gradients so that dynamics along all the principal directions
evolves at exactly the same speed. Then, we establish that this modified
method, which we call egalitarian gradient descent (EGD) and can be seen as a
carefully modified form of natural gradient descent, groks much faster. In
fact, in some cases the stagnation is completely removed. Finally, we
empirically show that on classical arithmetic problems such as modular addition
and sparse parity problem which this stagnation has been widely observed and
intensively studied, that our proposed method eliminates the plateaus.
[LINK]
http://arxiv.org/abs/2510.04930v1
[DATE]
2025-10-06 23:40:36+08:00
[CATEGORIES]
cs.LG
Set to Be Fair: Demographic Parity Constraints for Set-Valued Classification
[AUTHORS]
Eyal Cohen, Christophe Denis, Mohamed Hebiri
[ABSTRACT]
Set-valued classification is used in multiclass settings where confusion
between classes can occur and lead to misleading predictions. However, its
application may amplify discriminatory bias motivating the development of
set-valued approaches under fairness constraints. In this paper, we address the
problem of set-valued classification under demographic parity and expected size
constraints. We propose two complementary strategies: an oracle-based method
that minimizes classification risk while satisfying both constraints, and a
computationally efficient proxy that prioritizes constraint satisfaction. For
both strategies, we derive closed-form expressions for the (optimal) fair
set-valued classifiers and use these to build plug-in, data-driven procedures
for empirical predictions. We establish distribution-free convergence rates for
violations of the size and fairness constraints for both methods, and under
mild assumptions we also provide excess-risk bounds for the oracle-based
approach. Empirical results demonstrate the effectiveness of both strategies
and highlight the efficiency of our proxy method.
[LINK]
http://arxiv.org/abs/2510.04926v1
[DATE]
2025-10-06 23:36:45+08:00
[CATEGORIES]
cs.LG
Agentic Additive Manufacturing Alloy Discovery
[AUTHORS]
Peter Pak, Achuth Chandrasekhar, Amir Barati Farimani
[ABSTRACT]
Agentic systems enable the intelligent use of research tooling, augmenting a
researcher’s ability to investigate and propose novel solutions to existing
problems. Within Additive Manufacturing (AM), alloy discovery remains a complex
challenge, often requiring expertise in the various domains of materials
science, thermodynamic simulations, and experimental analysis. Large Language
Model (LLM) enabled agents can facilitate this endeavor by utilizing their
extensive knowledge base to dispatch tool calls via Model Context Protocol
(MCP) to perform actions such as Thermo-Calc property diagram calculations and
lack of fusion process map generation. In addition, the multi-agent system
developed in this work is able to effectively reason through complex user
prompts and provide analysis on the printability of proposed alloys. These
agents can dynamically adjust their task trajectory to the outcomes of tool
call results, effectively enabling autonomous decision-making in practical
environments. This work aims to utilize LLM enabled agents to automate and
accelerate the task of alloy discovery within the field of additive
manufacturing and showcase the benefits of adopting this multi-agent system.
[LINK]
http://arxiv.org/abs/2510.02567v2
[DATE]
2025-10-06 23:33:47+08:00
[CATEGORIES]
cs.LG
Glocal Information Bottleneck for Time Series Imputation
[AUTHORS]
Jie Yang, Kexin Zhang, Guibin Zhang, Philip S. Yu, Kaize Ding
[ABSTRACT]
Time Series Imputation (TSI), which aims to recover missing values in
temporal data, remains a fundamental challenge due to the complex and often
high-rate missingness in real-world scenarios. Existing models typically
optimize the point-wise reconstruction loss, focusing on recovering numerical
values (local information). However, we observe that under high missing rates,
these models still perform well in the training phase yet produce poor
imputations and distorted latent representation distributions (global
information) in the inference phase. This reveals a critical optimization
dilemma: current objectives lack global guidance, leading models to overfit
local noise and fail to capture global information of the data. To address this
issue, we propose a new training paradigm, Glocal Information Bottleneck
(Glocal-IB). Glocal-IB is model-agnostic and extends the standard IB framework
by introducing a Global Alignment loss, derived from a tractable mutual
information approximation. This loss aligns the latent representations of
masked inputs with those of their originally observed counterparts. It helps
the model retain global structure and local details while suppressing noise
caused by missing values, giving rise to better generalization under high
missingness. Extensive experiments on nine datasets confirm that Glocal-IB
leads to consistently improved performance and aligned latent representations
under missingness. Our code implementation is available in
https://github.com/Muyiiiii/NeurIPS-25-Glocal-IB.
[LINK]
http://arxiv.org/abs/2510.04910v1
[DATE]
2025-10-06 23:24:44+08:00
[CATEGORIES]
cs.LG
How Different from the Past? Spatio-Temporal Time Series Forecasting with Self-Supervised Deviation Learning
[AUTHORS]
Haotian Gao, Zheng Dong, Jiawei Yong, Shintaro Fukushima, Kenjiro Taura, Renhe Jiang
[ABSTRACT]
Spatio-temporal forecasting is essential for real-world applications such as
traffic management and urban computing. Although recent methods have shown
improved accuracy, they often fail to account for dynamic deviations between
current inputs and historical patterns. These deviations contain critical
signals that can significantly affect model performance. To fill this gap, we
propose ST-SSDL, a Spatio-Temporal time series forecasting framework that
incorporates a Self-Supervised Deviation Learning scheme to capture and utilize
such deviations. ST-SSDL anchors each input to its historical average and
discretizes the latent space using learnable prototypes that represent typical
spatio-temporal patterns. Two auxiliary objectives are proposed to refine this
structure: a contrastive loss that enhances inter-prototype discriminability
and a deviation loss that regularizes the distance consistency between input
representations and corresponding prototypes to quantify deviation. Optimized
jointly with the forecasting objective, these components guide the model to
organize its hidden space and improve generalization across diverse input
conditions. Experiments on six benchmark datasets show that ST-SSDL
consistently outperforms state-of-the-art baselines across multiple metrics.
Visualizations further demonstrate its ability to adaptively respond to varying
levels of deviation in complex spatio-temporal scenarios. Our code and datasets
are available at https://github.com/Jimmy-7664/ST-SSDL.
[COMMENTS]
Accepted at NeurIPS 2025
[LINK]
http://arxiv.org/abs/2510.04908v1
[DATE]
2025-10-06 23:21:13+08:00
[CATEGORIES]
cs.LG
Benchmarking M-LTSF: Frequency and Noise-Based Evaluation of Multivariate Long Time Series Forecasting Models
[AUTHORS]
Nick Janßen, Melanie Schaller, Bodo Rosenhahn
[ABSTRACT]
Understanding the robustness of deep learning models for multivariate
long-term time series forecasting (M-LTSF) remains challenging, as evaluations
typically rely on real-world datasets with unknown noise properties. We propose
a simulation-based evaluation framework that generates parameterizable
synthetic datasets, where each dataset instance corresponds to a different
configuration of signal components, noise types, signal-to-noise ratios, and
frequency characteristics. These configurable components aim to model
real-world multivariate time series data without the ambiguity of unknown
noise. This framework enables fine-grained, systematic evaluation of M-LTSF
models under controlled and diverse scenarios. We benchmark four representative
architectures S-Mamba (state-space), iTransformer (transformer-based), R-Linear
(linear), and Autoformer (decomposition-based). Our analysis reveals that all
models degrade severely when lookback windows cannot capture complete periods
of seasonal patters in the data. S-Mamba and Autoformer perform best on
sawtooth patterns, while R-Linear and iTransformer favor sinusoidal signals.
White and Brownian noise universally degrade performance with lower
signal-to-noise ratio while S-Mamba shows specific trend-noise and iTransformer
shows seasonal-noise vulnerability. Further spectral analysis shows that
S-Mamba and iTransformer achieve superior frequency reconstruction. This
controlled approach, based on our synthetic and principle-driven testbed,
offers deeper insights into model-specific strengths and limitations through
the aggregation of MSE scores and provides concrete guidance for model
selection based on signal characteristics and noise conditions.
[COMMENTS]
Number of pages: 13 Number of figures: 16 Number of Tables: 1
Submitted to: IEEE Transactions on Signal Processing
[LINK]
http://arxiv.org/abs/2510.04900v1
[DATE]
2025-10-06 23:16:52+08:00
[CATEGORIES]
cs.LG
HyperVLA: Efficient Inference in Vision-Language-Action Models via Hypernetworks
[AUTHORS]
Zheng Xiong, Kang Li, Zilin Wang, Matthew Jackson, Jakob Foerster, Shimon Whiteson
[ABSTRACT]
Built upon language and vision foundation models with strong generalization
ability and trained on large-scale robotic data, Vision-Language-Action (VLA)
models have recently emerged as a promising approach to learning generalist
robotic policies. However, a key drawback of existing VLAs is their extremely
high inference costs. In this paper, we propose HyperVLA to address this
problem. Unlike existing monolithic VLAs that activate the whole model during
both training and inference, HyperVLA uses a novel hypernetwork (HN)-based
architecture that activates only a small task-specific policy during inference,
while still retaining the high model capacity needed to accommodate diverse
multi-task behaviors during training. Successfully training an HN-based VLA is
nontrivial so HyperVLA contains several key algorithm design features that
improve its performance, including properly utilizing the prior knowledge from
existing vision foundation models, HN normalization, and an action generation
strategy. Compared to monolithic VLAs, HyperVLA achieves a similar or even
higher success rate for both zero-shot generalization and few-shot adaptation,
while significantly reducing inference costs. Compared to OpenVLA, a
state-of-the-art VLA model, HyperVLA reduces the number of activated parameters
at test time by $90\times$, and accelerates inference speed by $120\times$.
Code is publicly available at https://github.com/MasterXiong/HyperVLA
[LINK]
http://arxiv.org/abs/2510.04898v1
[DATE]
2025-10-06 23:15:38+08:00
[CATEGORIES]
cs.LG
Revealing Interconnections between Diseases: from Statistical Methods to Large Language Models
[AUTHORS]
Alina Ermilova, Dmitrii Kornilov, Sofia Samoilova, Ekaterina Laptenkova, Anastasia Kolesnikova, Ekaterina Podplutova, Senotrusova Sofya, Maksim G. Sharaev
[ABSTRACT]
Identifying disease interconnections through manual analysis of large-scale
clinical data is labor-intensive, subjective, and prone to expert disagreement.
While machine learning (ML) shows promise, three critical challenges remain:
(1) selecting optimal methods from the vast ML landscape, (2) determining
whether real-world clinical data (e.g., electronic health records, EHRs) or
structured disease descriptions yield more reliable insights, (3) the lack of
“ground truth,” as some disease interconnections remain unexplored in medicine.
Large language models (LLMs) demonstrate broad utility, yet they often lack
specialized medical knowledge. To address these gaps, we conduct a systematic
evaluation of seven approaches for uncovering disease relationships based on
two data sources: (i) sequences of ICD-10 codes from MIMIC-IV EHRs and (ii) the
full set of ICD-10 codes, both with and without textual descriptions. Our
framework integrates the following: (i) a statistical co-occurrence analysis
and a masked language modeling (MLM) approach using real clinical data; (ii)
domain-specific BERT variants (Med-BERT and BioClinicalBERT); (iii) a
general-purpose BERT and document retrieval; and (iv) four LLMs (Mistral,
DeepSeek, Qwen, and YandexGPT). Our graph-based comparison of the obtained
interconnection matrices shows that the LLM-based approach produces
interconnections with the lowest diversity of ICD code connections to different
diseases compared to other methods, including text-based and domain-based
approaches. This suggests an important implication: LLMs have limited potential
for discovering new interconnections. In the absence of ground truth databases
for medical interconnections between ICD codes, our results constitute a
valuable medical disease ontology that can serve as a foundational resource for
future clinical research and artificial intelligence applications in
healthcare.
[LINK]
http://arxiv.org/abs/2510.04888v1
[DATE]
2025-10-06 23:09:39+08:00
[CATEGORIES]
cs.LG
Flow-Matching Based Refiner for Molecular Conformer Generation
[AUTHORS]
Xiangyang Xu, Hongyang Gao
[ABSTRACT]
Low-energy molecular conformers generation (MCG) is a foundational yet
challenging problem in drug discovery. Denoising-based methods include
diffusion and flow-matching methods that learn mappings from a simple base
distribution to the molecular conformer distribution. However, these approaches
often suffer from error accumulation during sampling, especially in the low SNR
steps, which are hard to train. To address these challenges, we propose a
flow-matching refiner for the MCG task. The proposed method initializes
sampling from mixed-quality outputs produced by upstream denoising models and
reschedules the noise scale to bypass the low-SNR phase, thereby improving
sample quality. On the GEOM-QM9 and GEOM-Drugs benchmark datasets, the
generator-refiner pipeline improves quality with fewer total denoising steps
while preserving diversity.
[LINK]
http://arxiv.org/abs/2510.04878v1
[DATE]
2025-10-06 23:00:36+08:00
[CATEGORIES]
cs.LG
Asynchronous Federated Stochastic Optimization for Heterogeneous Objectives Under Arbitrary Delays
[AUTHORS]
Charikleia Iakovidou, Kibaek Kim
[ABSTRACT]
Federated learning (FL) was recently proposed to securely train models with
data held over multiple locations (“clients”) under the coordination of a
central server. Prolonged training times caused by slow clients may hinder the
performance of FL; while asynchronous communication is a promising solution,
highly heterogeneous client response times under non-IID local data may
introduce significant bias to the global model, particularly in client-driven
setups where sampling is infeasible. To address this issue, we propose
\underline{A}synch\underline{R}onous \underline{E}xact \underline{A}veraging
(\textsc{AREA}), a stochastic (sub)gradient method that leverages asynchrony
for scalability and uses client-side memory to correct the bias induced by
uneven participation, without client sampling or prior knowledge of client
latencies. \textsc{AREA} communicates model residuals rather than gradient
estimates, reducing exposure to gradient inversion, and is compatible with
secure aggregation. Under standard assumptions and unbounded, heterogeneous
delays with finite mean, AREA achieves optimal convergence rates:
$\mathcal{O}(1/K)$ in the strongly convex, smooth regime and
$\mathcal{O}(1/\sqrt{K})$ in the convex, nonsmooth regime. For strongly convex,
smooth objectives, we demonstrate theoretically and empirically that AREA
accommodates larger step sizes than existing methods, enabling fast convergence
without adversely impacting model generalization. In the convex, nonsmooth
setting, to our knowledge we are the first to obtain rates that scale with the
average client update frequency rather than the minimum or maximum, indicating
increased robustness to outliers.
[LINK]
http://arxiv.org/abs/2405.10123v3
[DATE]
2025-10-06 22:53:25+08:00
[CATEGORIES]
cs.LG
Video Game Level Design as a Multi-Agent Reinforcement Learning Problem
[AUTHORS]
Sam Earle, Zehua Jiang, Eugene Vinitsky, Julian Togelius
[ABSTRACT]
Procedural Content Generation via Reinforcement Learning (PCGRL) offers a
method for training controllable level designer agents without the need for
human datasets, using metrics that serve as proxies for level quality as
rewards. Existing PCGRL research focuses on single generator agents, but are
bottlenecked by the need to frequently recalculate heuristics of level quality
and the agent’s need to navigate around potentially large maps. By framing
level generation as a multi-agent problem, we mitigate the efficiency
bottleneck of single-agent PCGRL by reducing the number of reward calculations
relative to the number of agent actions. We also find that multi-agent level
generators are better able to generalize to out-of-distribution map shapes,
which we argue is due to the generators’ learning more local, modular design
policies. We conclude that treating content generation as a distributed,
multi-agent task is beneficial for generating functional artifacts at scale.
[COMMENTS]
11 pages, 7 tables, 5 figures, published as full technical paper at
the AAAI conference on Artificial Intelligence and Interactive Digital
Entertainment 2025
[LINK]
http://arxiv.org/abs/2510.04862v1
[DATE]
2025-10-06 22:49:21+08:00
[CATEGORIES]
cs.LG
A Clinical-grade Universal Foundation Model for Intraoperative Pathology
[AUTHORS]
Zihan Zhao, Fengtao Zhou, Ronggang Li, Bing Chu, Xinke Zhang, Xueyi Zheng, Ke Zheng, Xiaobo Wen, Jiabo Ma, Yihui Wang, Jiewei Chen, Chengyou Zheng, Jiangyu Zhang, Yongqin Wen, Jiajia Meng, Ziqi Zeng, Xiaoqing Li, Jing Li, Dan Xie, Yaping Ye, Yu Wang, Hao Chen, Muyan Cai
[ABSTRACT]
Intraoperative pathology is pivotal to precision surgery, yet its clinical
impact is constrained by diagnostic complexity and the limited availability of
high-quality frozen-section data. While computational pathology has made
significant strides, the lack of large-scale, prospective validation has
impeded its routine adoption in surgical workflows. Here, we introduce CRISP, a
clinical-grade foundation model developed on over 100,000 frozen sections from
eight medical centers, specifically designed to provide Clinical-grade Robust
Intraoperative Support for Pathology (CRISP). CRISP was comprehensively
evaluated on more than 15,000 intraoperative slides across nearly 100
retrospective diagnostic tasks, including benign-malignant discrimination, key
intraoperative decision-making, and pan-cancer detection, etc. The model
demonstrated robust generalization across diverse institutions, tumor types,
and anatomical sites-including previously unseen sites and rare cancers. In a
prospective cohort of over 2,000 patients, CRISP sustained high diagnostic
accuracy under real-world conditions, directly informing surgical decisions in
92.6% of cases. Human-AI collaboration further reduced diagnostic workload by
35%, avoided 105 ancillary tests and enhanced detection of micrometastases with
87.5% accuracy. Together, these findings position CRISP as a clinical-grade
paradigm for AI-driven intraoperative pathology, bridging computational
advances with surgical precision and accelerating the translation of artificial
intelligence into routine clinical practice.
[LINK]
http://arxiv.org/abs/2510.04861v1
[DATE]
2025-10-06 22:48:43+08:00
[CATEGORIES]
cs.LG
Alignment Tipping Process: How Self-Evolution Pushes LLM Agents Off the Rails
[AUTHORS]
Siwei Han, Jiaqi Liu, Yaofeng Su, Wenbo Duan, Xinyuan Liu, Cihang Xie, Mohit Bansal, Mingyu Ding, Linjun Zhang, Huaxiu Yao
[ABSTRACT]
As Large Language Model (LLM) agents increasingly gain self-evolutionary
capabilities to adapt and refine their strategies through real-world
interaction, their long-term reliability becomes a critical concern. We
identify the Alignment Tipping Process (ATP), a critical post-deployment risk
unique to self-evolving LLM agents. Unlike training-time failures, ATP arises
when continual interaction drives agents to abandon alignment constraints
established during training in favor of reinforced, self-interested strategies.
We formalize and analyze ATP through two complementary paradigms:
Self-Interested Exploration, where repeated high-reward deviations induce
individual behavioral drift, and Imitative Strategy Diffusion, where deviant
behaviors spread across multi-agent systems. Building on these paradigms, we
construct controllable testbeds and benchmark Qwen3-8B and
Llama-3.1-8B-Instruct. Our experiments show that alignment benefits erode
rapidly under self-evolution, with initially aligned models converging toward
unaligned states. In multi-agent settings, successful violations diffuse
quickly, leading to collective misalignment. Moreover, current reinforcement
learning-based alignment methods provide only fragile defenses against
alignment tipping. Together, these findings demonstrate that alignment of LLM
agents is not a static property but a fragile and dynamic one, vulnerable to
feedback-driven decay during deployment. Our data and code are available at
https://github.com/aiming-lab/ATP.
[LINK]
http://arxiv.org/abs/2510.04860v1
[DATE]
2025-10-06 22:48:39+08:00
[CATEGORIES]
cs.LG
ERDE: Entropy-Regularized Distillation for Early-exit
[AUTHORS]
Martial Guidez, Stefan Duffner, Yannick Alpou, Oscar Röth, Christophe Garcia
[ABSTRACT]
Although deep neural networks and in particular Convolutional Neural Networks
have demonstrated state-of-the-art performance in image classification with
relatively high efficiency, they still exhibit high computational costs, often
rendering them impractical for real-time and edge applications. Therefore, a
multitude of compression techniques have been developed to reduce these costs
while maintaining accuracy. In addition, dynamic architectures have been
introduced to modulate the level of compression at execution time, which is a
desirable property in many resource-limited application scenarios. The proposed
method effectively integrates two well-established optimization techniques:
early exits and knowledge distillation, where a reduced student early-exit
model is trained from a more complex teacher early-exit model. The primary
contribution of this research lies in the approach for training the student
early-exit model. In comparison to the conventional Knowledge Distillation
loss, our approach incorporates a new entropy-based loss for images where the
teacher’s classification was incorrect. The proposed method optimizes the
trade-off between accuracy and efficiency, thereby achieving significant
reductions in computational complexity without compromising classification
performance. The validity of this approach is substantiated by experimental
results on image classification datasets CIFAR10, CIFAR100 and SVHN, which
further opens new research perspectives for Knowledge Distillation in other
contexts.
[LINK]
http://arxiv.org/abs/2510.04856v1
[DATE]
2025-10-06 22:45:41+08:00
[CATEGORIES]
cs.LG
Synthesising Counterfactual Explanations via Label-Conditional Gaussian Mixture Variational Autoencoders
[AUTHORS]
Junqi Jiang, Francesco Leofante, Antonio Rago, Francesca Toni
[ABSTRACT]
Counterfactual explanations (CEs) provide recourse recommendations for
individuals affected by algorithmic decisions. A key challenge is generating
CEs that are robust against various perturbation types (e.g. input and model
perturbations) while simultaneously satisfying other desirable properties.
These include plausibility, ensuring CEs reside on the data manifold, and
diversity, providing multiple distinct recourse options for single inputs.
Existing methods, however, mostly struggle to address these multifaceted
requirements in a unified, model-agnostic manner. We address these limitations
by proposing a novel generative framework. First, we introduce the
Label-conditional Gaussian Mixture Variational Autoencoder (L-GMVAE), a model
trained to learn a structured latent space where each class label is
represented by a set of Gaussian components with diverse, prototypical
centroids. Building on this, we present LAPACE (LAtent PAth Counterfactual
Explanations), a model-agnostic algorithm that synthesises entire paths of CE
points by interpolating from inputs’ latent representations to those learned
latent centroids. This approach inherently ensures robustness to input changes,
as all paths for a given target class converge to the same fixed centroids.
Furthermore, the generated paths provide a spectrum of recourse options,
allowing users to navigate the trade-off between proximity and plausibility
while also encouraging robustness against model changes. In addition,
user-specified actionability constraints can also be easily incorporated via
lightweight gradient optimisation through the L-GMVAE’s decoder. Comprehensive
experiments show that LAPACE is computationally efficient and achieves
competitive performance across eight quantitative metrics.
[LINK]
http://arxiv.org/abs/2510.04855v1
[DATE]
2025-10-06 22:42:23+08:00
[CATEGORIES]
cs.LG
The Syntax and Semantics of einsum
[AUTHORS]
Maurice Wenig, Paul G. Rump, Mark Blacher, Joachim Giesen
[ABSTRACT]
In 2011, einsum was introduced to NumPy as a practical and convenient
notation for tensor expressions in machine learning, quantum circuit
simulation, and other fields. It has since been implemented in additional
Python frameworks such as PyTorch and TensorFlow, as well as in other
programming languages such as Julia. Despite its practical success, the einsum
notation still lacks a solid theoretical basis, and is not unified across the
different frameworks, limiting opportunities for formal reasoning and
systematic optimization. In this work, we discuss the terminology of tensor
expressions and provide a formal definition of the einsum language. Based on
this definition, we formalize and prove important equivalence rules for tensor
expressions and highlight their relevance in practical applications.
[COMMENTS]
21 pages, 1 figure. Includes formal definitions, proofs of algebraic
properties, and nesting/denesting rules for the einsum notation
[LINK]
http://arxiv.org/abs/2509.20020v2
[DATE]
2025-10-06 22:35:42+08:00
[CATEGORIES]
cs.LG
Impact of Dataset Properties on Membership Inference Vulnerability of Deep Transfer Learning
[AUTHORS]
Marlon Tobaben, Hibiki Ito, Joonas Jälkö, Yuan He, Antti Honkela
[ABSTRACT]
Membership inference attacks (MIAs) are used to test practical privacy of
machine learning models. MIAs complement formal guarantees from differential
privacy (DP) under a more realistic adversary model. We analyse MIA
vulnerability of fine-tuned neural networks both empirically and theoretically,
the latter using a simplified model of fine-tuning. We show that the
vulnerability of non-DP models when measured as the attacker advantage at a
fixed false positive rate reduces according to a simple power law as the number
of examples per class increases. A similar power-law applies even for the most
vulnerable points, but the dataset size needed for adequate protection of the
most vulnerable points is very large.
[COMMENTS]
Accepted to NeurIPS 2025; 47 pages, 13 figures
[LINK]
http://arxiv.org/abs/2402.06674v5
[DATE]
2025-10-06 22:33:31+08:00
[CATEGORIES]
cs.LG
First Hallucination Tokens Are Different from Conditional Ones
[AUTHORS]
Jakob Snel, Seong Joon Oh
[ABSTRACT]
Large Language Models (LLMs) hallucinate, and detecting these cases is key to
ensuring trust. While many approaches address hallucination detection at the
response or span level, recent work explores token-level detection, enabling
more fine-grained intervention. However, the distribution of hallucination
signal across sequences of hallucinated tokens remains unexplored. We leverage
token-level annotations from the RAGTruth corpus and find that the first
hallucinated token is far more detectable than later ones. This structural
property holds across models, suggesting that first hallucination tokens play a
key role in token-level hallucination detection. Our code is available at
https://github.com/jakobsnl/RAGTruth_Xtended.
[COMMENTS]
4.5 pages, 3 figures, Dataset, Knowledge Paper, Hallucination,
Trustworthiness
[LINK]
http://arxiv.org/abs/2507.20836v4
[DATE]
2025-10-06 22:29:58+08:00
[CATEGORIES]
cs.LG
Distributionally Robust Causal Abstractions
[AUTHORS]
Yorgos Felekis, Theodoros Damoulas, Paris Giampouras
[ABSTRACT]
Causal Abstraction (CA) theory provides a principled framework for relating
causal models that describe the same system at different levels of granularity
while ensuring interventional consistency between them. Recently, several
approaches for learning CAs have been proposed, but all assume fixed and
well-specified exogenous distributions, making them vulnerable to environmental
shifts and misspecification. In this work, we address these limitations by
introducing the first class of distributionally robust CAs and their associated
learning algorithms. The latter cast robust causal abstraction learning as a
constrained min-max optimization problem with Wasserstein ambiguity sets. We
provide theoretical results, for both empirical and Gaussian environments,
leading to principled selection of the level of robustness via the radius of
these sets. Furthermore, we present empirical evidence across different
problems and CA learning methods, demonstrating our framework’s robustness not
only to environmental shifts but also to structural model and intervention
mapping misspecification.
[LINK]
http://arxiv.org/abs/2510.04842v1
[DATE]
2025-10-06 22:26:12+08:00
[CATEGORIES]
cs.LG
Beyond Random: Automatic Inner-loop Optimization in Dataset Distillation
[AUTHORS]
Muquan Li, Hang Gou, Dongyang Zhang, Shuang Liang, Xiurui Xie, Deqiang Ouyang, Ke Qin
[ABSTRACT]
The growing demand for efficient deep learning has positioned dataset
distillation as a pivotal technique for compressing training dataset while
preserving model performance. However, existing inner-loop optimization methods
for dataset distillation typically rely on random truncation strategies, which
lack flexibility and often yield suboptimal results. In this work, we observe
that neural networks exhibit distinct learning dynamics across different
training stages-early, middle, and late-making random truncation ineffective.
To address this limitation, we propose Automatic Truncated Backpropagation
Through Time (AT-BPTT), a novel framework that dynamically adapts both
truncation positions and window sizes according to intrinsic gradient behavior.
AT-BPTT introduces three key components: (1) a probabilistic mechanism for
stage-aware timestep selection, (2) an adaptive window sizing strategy based on
gradient variation, and (3) a low-rank Hessian approximation to reduce
computational overhead. Extensive experiments on CIFAR-10, CIFAR-100,
Tiny-ImageNet, and ImageNet-1K show that AT-BPTT achieves state-of-the-art
performance, improving accuracy by an average of 6.16% over baseline methods.
Moreover, our approach accelerates inner-loop optimization by 3.9x while saving
63% memory cost.
[LINK]
http://arxiv.org/abs/2510.04838v1
[DATE]
2025-10-06 22:22:28+08:00
[CATEGORIES]
cs.LG
Bond-Centered Molecular Fingerprint Derivatives: A BBBP Dataset Study
[AUTHORS]
Guillaume Godin
[ABSTRACT]
Bond Centered FingerPrint (BCFP) are a complementary, bond-centric
alternative to Extended-Connectivity Fingerprints (ECFP). We introduce a static
BCFP that mirrors the bond-convolution used by directed message-passing GNNs
like ChemProp, and evaluate it with a fast rapid Random Forest model on
Brain-Blood Barrier Penetration (BBBP) classification task. Across stratified
cross-validation, concatenating ECFP with BCFP consistently improves AUROC and
AUPRC over either descriptor alone, as confirmed by Turkey HSD
multiple-comparison analysis. Among radii, r = 1 performs best; r = 2 does not
yield statistically separable gains under the same test. We further propose
BCFP-Sort&Slice, a simple feature-combination scheme that preserves the
out-of-vocabulary (OOV) count information native to ECFP count vectors while
enabling compact unhashed concatenation of BCFP variants. We also outperform
the MGTP prediction on our BBBP evaluation, using such composite new features
bond and atom features. These results show that lightweight, bond-centered
descriptors can complement atom-centered circular fingerprints and provide
strong, fast baselines for BBBP prediction.
[COMMENTS]
14 pages, 10 figures, 1 table
[LINK]
http://arxiv.org/abs/2510.04837v1
[DATE]
2025-10-06 22:22:23+08:00
[CATEGORIES]
cs.LG
On Predicting Post-Click Conversion Rate via Counterfactual Inference
[AUTHORS]
Junhyung Ahn, Sanghack Lee
[ABSTRACT]
Accurately predicting conversion rate (CVR) is essential in various
recommendation domains such as online advertising systems and e-commerce. These
systems utilize user interaction logs, which consist of exposures, clicks, and
conversions. CVR prediction models are typically trained solely based on
clicked samples, as conversions can only be determined following clicks.
However, the sparsity of clicked instances necessitates the collection of a
substantial amount of logs for effective model training. Recent works address
this issue by devising frameworks that leverage non-clicked samples. While
these frameworks aim to reduce biases caused by the discrepancy between clicked
and non-clicked samples, they often rely on heuristics. Against this
background, we propose a method to counterfactually generate conversion labels
for non-clicked samples by using causality as a guiding principle, attempting
to answer the question, “Would the user have converted if he or she had clicked
the recommended item?” Our approach is named the Entire Space Counterfactual
Inference Multi-task Model (ESCIM). We initially train a structural causal
model (SCM) of user sequential behaviors and conduct a hypothetical
intervention (i.e., click) on non-clicked items to infer counterfactual CVRs.
We then introduce several approaches to transform predicted counterfactual CVRs
into binary counterfactual conversion labels for the non-clicked samples.
Finally, the generated samples are incorporated into the training process.
Extensive experiments on public datasets illustrate the superiority of the
proposed algorithm. Online A/B testing further empirically validates the
effectiveness of our proposed algorithm in real-world scenarios. In addition,
we demonstrate the improved performance of the proposed method on latent
conversion data, showcasing its robustness and superior generalization
capabilities.
[COMMENTS]
This work has been accepted for publication at the IEEE International
Conference on Data Mining (ICDM) 2025
[LINK]
http://arxiv.org/abs/2510.04816v1
[DATE]
2025-10-06 21:57:49+08:00
[CATEGORIES]
cs.LG
A Noise Resilient Approach for Robust Hurst Exponent Estimation
[AUTHORS]
Malith Premarathna, Fabrizio Ruggeri, Dixon Vimalajeewa
[ABSTRACT]
Understanding signal behavior across scales is vital in areas such as natural
phenomena analysis and financial modeling. A key property is self-similarity,
quantified by the Hurst exponent (H), which reveals long-term dependencies.
Wavelet-based methods are effective for estimating H due to their multi-scale
analysis capability, but additive noise in real-world measurements often
degrades accuracy. We propose Noise-Controlled ALPHEE (NC-ALPHEE), an
enhancement of the Average Level-Pairwise Hurst Exponent Estimator (ALPHEE),
incorporating noise mitigation and generating multiple level-pairwise estimates
from signal energy pairs. A neural network (NN) combines these estimates,
replacing traditional averaging. This adaptive learning maintains ALPHEE’s
behavior in noise-free cases while improving performance in noisy conditions.
Extensive simulations show that in noise-free data, NC-ALPHEE matches ALPHEE’s
accuracy using both averaging and NN-based methods. Under noise, however,
traditional averaging deteriorates and requires impractical level restrictions,
while NC-ALPHEE consistently outperforms existing techniques without such
constraints. NC-ALPHEE offers a robust, adaptive approach for H estimation,
significantly enhancing the reliability of wavelet-based methods in noisy
environments.
[LINK]
http://arxiv.org/abs/2510.04811v1
[DATE]
2025-10-06 21:54:23+08:00
[CATEGORIES]
cs.LG
Unified ODE Analysis of Smooth Q-Learning Algorithms
[AUTHORS]
Donghwan Lee
[ABSTRACT]
Convergence of Q-learning has been the focus of extensive research over the
past several decades. Recently, an asymptotic convergence analysis for
Q-learning was introduced using a switching system framework. This approach
applies the so-called ordinary differential equation (ODE) approach to prove
the convergence of the asynchronous Q-learning modeled as a continuous-time
switching system, where notions from switching system theory are used to prove
its asymptotic stability without using explicit Lyapunov arguments. However, to
prove stability, restrictive conditions, such as quasi-monotonicity, must be
satisfied for the underlying switching systems, which makes it hard to easily
generalize the analysis method to other reinforcement learning algorithms, such
as the smooth Q-learning variants. In this paper, we present a more general and
unified convergence analysis that improves upon the switching system approach
and can analyze Q-learning and its smooth variants. The proposed analysis is
motivated by previous work on the convergence of synchronous Q-learning based
on $p$-norm serving as a Lyapunov function. However, the proposed analysis
addresses more general ODE models that can cover both asynchronous Q-learning
and its smooth versions with simpler frameworks.
[LINK]
http://arxiv.org/abs/2404.14442v5
[DATE]
2025-10-06 21:46:27+08:00
[CATEGORIES]
cs.LG
PRO-VPT: Distribution-Adaptive Visual Prompt Tuning via Prompt Relocation
[AUTHORS]
Chikai Shang, Mengke Li, Yiqun Zhang, Zhen Chen, Jinlin Wu, Fangqing Gu, Yang Lu, Yiu-ming Cheung
[ABSTRACT]
Visual prompt tuning (VPT), i.e., fine-tuning some lightweight prompt tokens,
provides an efficient and effective approach for adapting pre-trained models to
various downstream tasks. However, most prior art indiscriminately uses a fixed
prompt distribution across different tasks, neglecting the importance of each
block varying depending on the task. In this paper, we introduce adaptive
distribution optimization (ADO) by tackling two key questions: (1) How to
appropriately and formally define ADO, and (2) How to design an adaptive
distribution strategy guided by this definition? Through empirical analysis, we
first confirm that properly adjusting the distribution significantly improves
VPT performance, and further uncover a key insight that a nested relationship
exists between ADO and VPT. Based on these findings, we propose a new VPT
framework, termed PRO-VPT (iterative Prompt RelOcation-based VPT), which
adaptively adjusts the distribution built upon a nested optimization
formulation. Specifically, we develop a prompt relocation strategy derived from
this formulation, comprising two steps: pruning idle prompts from
prompt-saturated blocks, followed by allocating these prompts to the most
prompt-needed blocks. By iteratively performing prompt relocation and VPT, our
proposal can adaptively learn the optimal prompt distribution in a nested
optimization-based manner, thereby unlocking the full potential of VPT.
Extensive experiments demonstrate that our proposal significantly outperforms
advanced VPT methods, e.g., PRO-VPT surpasses VPT by 1.6 pp and 2.0 pp average
accuracy, leading prompt-based methods to state-of-the-art performance on
VTAB-1k and FGVC benchmarks. The code is available at
https://github.com/ckshang/PRO-VPT.
[COMMENTS]
Accepted by ICCV 2025
[LINK]
http://arxiv.org/abs/2503.06901v2
[DATE]
2025-10-06 21:38:48+08:00
[CATEGORIES]
cs.LG
Joint Diffusion models in Continual Learning
[AUTHORS]
Paweł Skierś, Kamil Deja
[ABSTRACT]
In this work, we introduce JDCL - a new method for continual learning with
generative rehearsal based on joint diffusion models. Neural networks suffer
from catastrophic forgetting defined as abrupt loss in the model’s performance
when retrained with additional data coming from a different distribution.
Generative-replay-based continual learning methods try to mitigate this issue
by retraining a model with a combination of new and rehearsal data sampled from
a generative model. In this work, we propose to extend this idea by combining a
continually trained classifier with a diffusion-based generative model into a
single - jointly optimized neural network. We show that such shared
parametrization, combined with the knowledge distillation technique allows for
stable adaptation to new tasks without catastrophic forgetting. We evaluate our
approach on several benchmarks, where it outperforms recent state-of-the-art
generative replay techniques. Additionally, we extend our method to the
semi-supervised continual learning setup, where it outperforms competing
buffer-based replay techniques, and evaluate, in a self-supervised manner, the
quality of trained representations.
[LINK]
http://arxiv.org/abs/2411.08224v3
[DATE]
2025-10-06 21:30:32+08:00
[CATEGORIES]
cs.LG
Fine-Grained AI Model Caching and Downloading With Coordinated Multipoint Broadcasting in Multi-Cell Edge Networks
[AUTHORS]
Yang Fu, Peng Qin, Yueyue Zhang, Pao Cheng, Jun Lu, Yifei Wang
[ABSTRACT]
6G networks are envisioned to support on-demand AI model downloading to
accommodate diverse inference requirements of end users. By proactively caching
models at edge nodes, users can retrieve the requested models with low latency
for on-device AI inference. However, the substantial size of contemporary AI
models poses significant challenges for edge caching under limited storage
capacity, as well as for the concurrent delivery of heterogeneous models over
wireless channels. To address these challenges, we propose a fine-grained AI
model caching and downloading system that exploits parameter reusability,
stemming from the common practice of fine-tuning task-specific models from a
shared pre-trained model with frozen parameters. This system selectively caches
model parameter blocks (PBs) at edge nodes, eliminating redundant storage of
reusable parameters across different cached models. Additionally, it
incorporates coordinated multipoint (CoMP) broadcasting to simultaneously
deliver reusable PBs to multiple users, thereby enhancing downlink spectrum
utilization. Under this arrangement, we formulate a model downloading delay
minimization problem to jointly optimize PB caching, migration (among edge
nodes), and broadcasting beamforming. To tackle this intractable problem, we
develop a distributed multi-agent learning framework that enables edge nodes to
explicitly learn mutual influence among their actions, thereby facilitating
cooperation. Furthermore, a data augmentation approach is proposed to
adaptively generate synthetic training samples through a predictive model,
boosting sample efficiency and accelerating policy learning. Both theoretical
analysis and simulation experiments validate the superior convergence
performance of the proposed learning framework.
[LINK]
http://arxiv.org/abs/2509.19341v2
[DATE]
2025-10-06 21:23:04+08:00
[CATEGORIES]
cs.LG
Learning on the Job: Test-Time Curricula for Targeted Reinforcement Learning
[AUTHORS]
Jonas Hübotter, Leander Diaz-Bone, Ido Hakimi, Andreas Krause, Moritz Hardt
[ABSTRACT]
Humans are good at learning on the job: We learn how to solve the tasks we
face as we go along. Can a model do the same? We propose an agent that
assembles a task-specific curriculum, called test-time curriculum (TTC-RL), and
applies reinforcement learning to continue training the model for its target
task. The test-time curriculum avoids time-consuming human curation of datasets
by automatically selecting the most task-relevant data from a large pool of
available training data. Our experiments demonstrate that reinforcement
learning on a test-time curriculum consistently improves the model on its
target tasks, across a variety of evaluations and models. Notably, on
challenging math and coding benchmarks, TTC-RL improves the pass@1 of Qwen3-8B
by approximately 1.8x on AIME25 and 2.1x on CodeElo. Moreover, we find that
TTC-RL significantly raises the performance ceiling compared to the initial
model, increasing pass@8 on AIME25 from 40% to 62% and on CodeElo from 28% to
43%. Our findings show the potential of test-time curricula in extending the
test-time scaling paradigm to continual training on thousands of task-relevant
experiences during test-time.
[LINK]
http://arxiv.org/abs/2510.04786v1
[DATE]
2025-10-06 21:07:14+08:00
[CATEGORIES]
cs.LG
Kernel ridge regression under power-law data: spectrum and generalization
[AUTHORS]
Arie Wortsman, Bruno Loureiro
[ABSTRACT]
In this work, we investigate high-dimensional kernel ridge regression (KRR)
on i.i.d. Gaussian data with anisotropic power-law covariance. This setting
differs fundamentally from the classical source & capacity conditions for KRR,
where power-law assumptions are typically imposed on the kernel eigen-spectrum
itself. Our contributions are twofold. First, we derive an explicit
characterization of the kernel spectrum for polynomial inner-product kernels,
giving a precise description of how the kernel eigen-spectrum inherits the data
decay. Second, we provide an asymptotic analysis of the excess risk in the
high-dimensional regime for a particular kernel with this spectral behavior,
showing that the sample complexity is governed by the effective dimension of
the data rather than the ambient dimension. These results establish a
fundamental advantage of learning with power-law anisotropic data over
isotropic data. To our knowledge, this is the first rigorous treatment of
non-linear KRR under power-law data.
[LINK]
http://arxiv.org/abs/2510.04780v1
[DATE]
2025-10-06 20:58:35+08:00
[CATEGORIES]
cs.LG
Distributional Inverse Reinforcement Learning
[AUTHORS]
Feiyang Wu, Ye Zhao, Anqi Wu
[ABSTRACT]
We propose a distributional framework for offline Inverse Reinforcement
Learning (IRL) that jointly models uncertainty over reward functions and full
distributions of returns. Unlike conventional IRL approaches that recover a
deterministic reward estimate or match only expected returns, our method
captures richer structure in expert behavior, particularly in learning the
reward distribution, by minimizing first-order stochastic dominance (FSD)
violations and thus integrating distortion risk measures (DRMs) into policy
learning, enabling the recovery of both reward distributions and
distribution-aware policies. This formulation is well-suited for behavior
analysis and risk-aware imitation learning. Empirical results on synthetic
benchmarks, real-world neurobehavioral data, and MuJoCo control tasks
demonstrate that our method recovers expressive reward representations and
achieves state-of-the-art imitation performance.
[LINK]
http://arxiv.org/abs/2510.03013v2
[DATE]
2025-10-06 20:56:00+08:00
[CATEGORIES]
cs.LG
MetaMP: Seamless Metadata Enrichment and AI Application Framework for Enhanced Membrane Protein Visualization and Analysis
[AUTHORS]
Ebenezer Awotoro, Chisom Ezekannagha, Florian Schwarz, Johannes Tauscher, Dominik Heider, Katharina Ladewig, Christel Le Bon, Karine Moncoq, Bruno Miroux, Georges Hattab
[ABSTRACT]
Structural biology has made significant progress in determining membrane
proteins, leading to a remarkable increase in the number of available
structures in dedicated databases. The inherent complexity of membrane protein
structures, coupled with challenges such as missing data, inconsistencies, and
computational barriers from disparate sources, underscores the need for
improved database integration. To address this gap, we present MetaMP, a
framework that unifies membrane-protein databases within a web application and
uses machine learning for classification. MetaMP improves data quality by
enriching metadata, offering a user-friendly interface, and providing eight
interactive views for streamlined exploration. MetaMP was effective across
tasks of varying difficulty, demonstrating advantages across different levels
without compromising speed or accuracy, according to user evaluations.
Moreover, MetaMP supports essential functions such as structure classification
and outlier detection.
We present three practical applications of Artificial Intelligence (AI) in
membrane protein research: predicting transmembrane segments, reconciling
legacy databases, and classifying structures with explainable AI support. In a
validation focused on statistics, MetaMP resolved 77% of data discrepancies and
accurately predicted the class of newly identified membrane proteins 98% of the
time and overtook expert curation. Altogether, MetaMP is a much-needed resource
that harmonizes current knowledge and empowers AI-driven exploration of
membrane-protein architecture.
[LINK]
http://arxiv.org/abs/2510.04776v1
[DATE]
2025-10-06 20:52:50+08:00
[CATEGORIES]
cs.LG
Federated Learning for Surgical Vision in Appendicitis Classification: Results of the FedSurg EndoVis 2024 Challenge
[AUTHORS]
Max Kirchner, Hanna Hoffmann, Alexander C. Jenke, Oliver L. Saldanha, Kevin Pfeiffer, Weam Kanjo, Julia Alekseenko, Claas de Boer, Santhi Raj Kolamuri, Lorenzo Mazza, Nicolas Padoy, Sophia Bano, Annika Reinke, Lena Maier-Hein, Danail Stoyanov, Jakob N. Kather, Fiona R. Kolbinger, Sebastian Bodenstedt, Stefanie Speidel
[ABSTRACT]
Purpose: The FedSurg challenge was designed to benchmark the state of the art
in federated learning for surgical video classification. Its goal was to assess
how well current methods generalize to unseen clinical centers and adapt
through local fine-tuning while enabling collaborative model development
without sharing patient data. Methods: Participants developed strategies to
classify inflammation stages in appendicitis using a preliminary version of the
multi-center Appendix300 video dataset. The challenge evaluated two tasks:
generalization to an unseen center and center-specific adaptation after
fine-tuning. Submitted approaches included foundation models with linear
probing, metric learning with triplet loss, and various FL aggregation schemes
(FedAvg, FedMedian, FedSAM). Performance was assessed using F1-score and
Expected Cost, with ranking robustness evaluated via bootstrapping and
statistical testing. Results: In the generalization task, performance across
centers was limited. In the adaptation task, all teams improved after
fine-tuning, though ranking stability was low. The ViViT-based submission
achieved the strongest overall performance. The challenge highlighted
limitations in generalization, sensitivity to class imbalance, and difficulties
in hyperparameter tuning in decentralized training, while spatiotemporal
modeling and context-aware preprocessing emerged as promising strategies.
Conclusion: The FedSurg Challenge establishes the first benchmark for
evaluating FL strategies in surgical video classification. Findings highlight
the trade-off between local personalization and global robustness, and
underscore the importance of architecture choice, preprocessing, and loss
design. This benchmarking offers a reference point for future development of
imbalance-aware, adaptive, and robust FL methods in clinical surgical AI.
[COMMENTS]
A challenge report pre-print (31 pages), including 7 tables and 8
figures
[LINK]
http://arxiv.org/abs/2510.04772v1
[DATE]
2025-10-06 20:48:46+08:00
[CATEGORIES]
cs.LG
Beyond the Seen: Bounded Distribution Estimation for Open-Vocabulary Learning
[AUTHORS]
Xiaomeng Fan, Yuchuan Mao, Zhi Gao, Yuwei Wu, Jin Chen, Yunde Jia
[ABSTRACT]
Open-vocabulary learning requires modeling the data distribution in open
environments, which consists of both seen-class and unseen-class data.
Existing methods estimate the distribution in open environments using
seen-class data, where the absence of unseen classes makes the estimation error
inherently unidentifiable.
Intuitively, learning beyond the seen classes is crucial for distribution
estimation to bound the estimation error.
We theoretically demonstrate that the distribution can be effectively
estimated by generating unseen-class data, through which the estimation error
is upper-bounded.
Building on this theoretical insight, we propose a novel open-vocabulary
learning method, which generates unseen-class data for estimating the
distribution in open environments. The method consists of a class-domain-wise
data generation pipeline and a distribution alignment algorithm. The data
generation pipeline generates unseen-class data under the guidance of a
hierarchical semantic tree and domain information inferred from the seen-class
data, facilitating accurate distribution estimation. With the generated data,
the distribution alignment algorithm estimates and maximizes the posterior
probability to enhance generalization in open-vocabulary learning. Extensive
experiments on $11$ datasets demonstrate that our method outperforms baseline
approaches by up to $14\%$, highlighting its effectiveness and superiority.
[LINK]
http://arxiv.org/abs/2510.04770v1
[DATE]
2025-10-06 20:43:59+08:00
[CATEGORIES]
cs.LG
When Do Credal Sets Stabilize? Fixed-Point Theorems for Credal Set Updates
[AUTHORS]
Michele Caprio, Siu Lun Chau, Krikamol Muandet
[ABSTRACT]
Many machine learning algorithms rely on iterative updates of uncertainty
representations, ranging from variational inference and
expectation-maximization, to reinforcement learning, continual learning, and
multi-agent learning. In the presence of imprecision and ambiguity, credal sets
– closed, convex sets of probability distributions – have emerged as a
popular framework for representing imprecise probabilistic beliefs. Under such
imprecision, many learning problems in imprecise probabilistic machine learning
(IPML) may be viewed as processes involving successive applications of update
rules on credal sets. This naturally raises the question of whether this
iterative process converges to stable fixed points – or, more generally, under
what conditions on the updating mechanism such fixed points exist, and whether
they can be attained. We provide the first analysis of this problem and
illustrate our findings using Credal Bayesian Deep Learning as a concrete
example. Our work demonstrates that incorporating imprecision into the learning
process not only enriches the representation of uncertainty, but also reveals
structural conditions under which stability emerges, thereby offering new
insights into the dynamics of iterative learning under imprecision.
[LINK]
http://arxiv.org/abs/2510.04769v1
[DATE]
2025-10-06 20:42:32+08:00
[CATEGORIES]
cs.LG
ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs
[AUTHORS]
Wonjun Kang, Kevin Galim, Seunghyuk Oh, Minjae Lee, Yuchen Zeng, Shuibai Zhang, Coleman Hooper, Yuezhou Hu, Hyung Il Koo, Nam Ik Cho, Kangwook Lee
[ABSTRACT]
While most autoregressive LLMs are constrained to one-by-one decoding,
diffusion LLMs (dLLMs) have attracted growing interest for their potential to
dramatically accelerate inference through parallel decoding. Despite this
promise, the conditional independence assumption in dLLMs causes parallel
decoding to ignore token dependencies, inevitably degrading generation quality
when these dependencies are strong. However, existing works largely overlook
these inherent challenges, and evaluations on standard benchmarks (e.g., math
and coding) are not sufficient to capture the quality degradation caused by
parallel decoding. To address this gap, we first provide an
information-theoretic analysis of parallel decoding. We then conduct case
studies on analytically tractable synthetic list operations from both data
distribution and decoding strategy perspectives, offering quantitative insights
that highlight the fundamental limitations of parallel decoding. Building on
these insights, we propose ParallelBench, the first benchmark specifically
designed for dLLMs, featuring realistic tasks that are trivial for humans and
autoregressive LLMs yet exceptionally challenging for dLLMs under parallel
decoding. Using ParallelBench, we systematically analyze both dLLMs and
autoregressive LLMs, revealing that: (i) dLLMs under parallel decoding can
suffer dramatic quality degradation in real-world scenarios, and (ii) current
parallel decoding strategies struggle to adapt their degree of parallelism
based on task difficulty, thus failing to achieve meaningful speedup without
compromising quality. Our findings underscore the pressing need for innovative
decoding methods that can overcome the current speed-quality trade-off. We
release our benchmark to help accelerate the development of truly efficient
dLLMs.
[COMMENTS]
Project Page: https://parallelbench.github.io
[LINK]
http://arxiv.org/abs/2510.04767v1
[DATE]
2025-10-06 20:41:31+08:00
[CATEGORIES]
cs.LG
Fisher-Bingham-like normalizing flows on the sphere
[AUTHORS]
Thorsten Glüsenkamp
[ABSTRACT]
A generic D-dimensional Gaussian can be conditioned or projected onto the D-1
unit sphere, thereby leading to the well-known Fisher-Bingham (FB) or Angular
Gaussian (AG) distribution families, respectively. These are some of the most
fundamental distributions on the sphere, yet cannot straightforwardly be
written as a normalizing flow except in two special cases: the von-Mises Fisher
in D=3 and the central angular Gaussian in any D. In this paper, we describe
how to generalize these special cases to a family of normalizing flows that
behave similarly to the full FB or AG family in any D. We call them
“zoom-linear-project” (ZLP)-Fisher flows. Unlike a normal Fisher-Bingham
distribution, their composition allows to gradually add complexity as needed.
Furthermore, they can naturally handle conditional density estimation with
target distributions that vary by orders of magnitude in scale - a setting that
is important in astronomical applications but that existing flows often
struggle with. A particularly useful member of the new family is the Kent
analogue that can cheaply upgrade any flow in this situation to yield better
performance.
[LINK]
http://arxiv.org/abs/2510.04762v1
[DATE]
2025-10-06 20:38:28+08:00
[CATEGORIES]
cs.LG
Provable Affine Identifiability of Nonlinear CCA under Latent Distributional Priors
[AUTHORS]
Zhiwei Han, Stefan Matthes, Hao Shen
[ABSTRACT]
In this work, we establish conditions under which nonlinear CCA recovers the
ground-truth latent factors up to an orthogonal transform after whitening.
Building on the classical result that linear mappings maximize canonical
correlations under Gaussian priors, we prove affine identifiability for a broad
class of latent distributions in the population setting. Central to our proof
is a reparameterization result that transports the analysis from observation
space to source space, where identifiability becomes tractable. We further show
that whitening is essential for ensuring boundedness and well-conditioning,
thereby underpinning identifiability. Beyond the population setting, we prove
that ridge-regularized empirical CCA converges to its population counterpart,
transferring these guarantees to the finite-sample regime. Experiments on a
controlled synthetic dataset and a rendered image dataset validate our theory
and demonstrate the necessity of its assumptions through systematic ablations.
[LINK]
http://arxiv.org/abs/2510.04758v1
[DATE]
2025-10-06 20:35:07+08:00
[CATEGORIES]
cs.LG
Rethinking KL Regularization in RLHF: From Value Estimation to Gradient Optimization
[AUTHORS]
Kezhao Liu, Jason Klein Liu, Mingtao Chen, Yiming Liu
[ABSTRACT]
Reinforcement Learning from Human Feedback (RLHF) leverages a
Kullback-Leibler (KL) divergence loss to stabilize training and prevent
overfitting. However, in methods such as GRPO, its implementation may be guided
by principles from numerical value estimation-a practice that overlooks the
term’s functional role as an optimization loss. To analyze this issue, we
establish a unified framework that connects two seemingly distinct
implementation styles: using the mathematical term $k_n$ as a detached
coefficient for the policy’s score function (‘$k_n$ in reward’) or as a direct
loss function through which gradients are propagated (‘$k_n$ as loss’). We show
that the latter can always be analyzed via an equivalent gradient coefficient
in the former, unifying the two perspectives. Through this framework, we prove
that the conventional ‘$k_1$ in reward’ (like in PPO) is the principled loss
for Reverse KL (RKL) regularization. We further establish a key finding: under
on-policy conditions, the ‘$k_2$ as loss’ formulation is, in fact,
gradient-equivalent to ‘$k_1$ in reward’. This equivalence, first proven in our
work, identifies both as the theoretically sound implementations of the RKL
objective. In contrast, we show that the recently adopted ‘$k_3$ as loss’ (like
in GRPO) is merely a first-order, biased approximation of the principled loss.
Furthermore, we argue that common off-policy implementations of ‘$k_n$ as loss’
methods are biased due to neglected importance sampling, and we propose a
principled correction. Our findings provide a comprehensive, gradient-based
rationale for choosing and correctly implementing KL regularization, paving the
way for more robust and effective RLHF systems.
[LINK]
http://arxiv.org/abs/2510.01555v2
[DATE]
2025-10-06 19:59:12+08:00
[CATEGORIES]
cs.LG
EVaR-Optimal Arm Identification in Bandits
[AUTHORS]
Mehrasa Ahmadipour, Aurélien Garivier
[ABSTRACT]
We study the fixed-confidence best arm identification (BAI) problem within
the multi-armed bandit (MAB) framework under the Entropic Value-at-Risk (EVaR)
criterion. Our analysis considers a nonparametric setting, allowing for general
reward distributions bounded in [0,1]. This formulation addresses the critical
need for risk-averse decision-making in high-stakes environments, such as
finance, moving beyond simple expected value optimization. We propose a
$\delta$-correct, Track-and-Stop based algorithm and derive a corresponding
lower bound on the expected sample complexity, which we prove is asymptotically
matched. The implementation of our algorithm and the characterization of the
lower bound both require solving a complex convex optimization problem and a
related, simpler non-convex one.
[LINK]
http://arxiv.org/abs/2510.04728v1
[DATE]
2025-10-06 19:49:56+08:00
[CATEGORIES]
cs.LG
Directional Sheaf Hypergraph Networks: Unifying Learning on Directed and Undirected Hypergraphs
[AUTHORS]
Emanuele Mule, Stefano Fiorini, Antonio Purificato, Federico Siciliano, Stefano Coniglio, Fabrizio Silvestri
[ABSTRACT]
Hypergraphs provide a natural way to represent higher-order interactions
among multiple entities. While undirected hypergraphs have been extensively
studied, the case of directed hypergraphs, which can model oriented group
interactions, remains largely under-explored despite its relevance for many
applications. Recent approaches in this direction often exhibit an implicit
bias toward homophily, which limits their effectiveness in heterophilic
settings. Rooted in the algebraic topology notion of Cellular Sheaves, Sheaf
Neural Networks (SNNs) were introduced as an effective solution to circumvent
such a drawback. While a generalization to hypergraphs is known, it is only
suitable for undirected hypergraphs, failing to tackle the directed case. In
this work, we introduce Directional Sheaf Hypergraph Networks (DSHN), a
framework integrating sheaf theory with a principled treatment of asymmetric
relations within a hypergraph. From it, we construct the Directed Sheaf
Hypergraph Laplacian, a complex-valued operator by which we unify and
generalize many existing Laplacian matrices proposed in the graph- and
hypergraph-learning literature. Across 7 real-world datasets and against 13
baselines, DSHN achieves relative accuracy gains from 2% up to 20%, showing how
a principled treatment of directionality in hypergraphs, combined with the
expressive power of sheaves, can substantially improve performance.
[LINK]
http://arxiv.org/abs/2510.04727v1
[DATE]
2025-10-06 19:46:53+08:00
[CATEGORIES]
cs.LG
Predictive economics: Rethinking economic methodology with machine learning
[AUTHORS]
Miguel Alves Pereira
[ABSTRACT]
This article proposes predictive economics as a distinct analytical
perspective within economics, grounded in machine learning and centred on
predictive accuracy rather than causal identification. Drawing on the
instrumentalist tradition (Friedman), the explanation-prediction divide
(Shmueli), and the contrast between modelling cultures (Breiman), we formalise
prediction as a valid epistemological and methodological objective. Reviewing
recent applications across economic subfields, we show how predictive models
contribute to empirical analysis, particularly in complex or data-rich
contexts. This perspective complements existing approaches and supports a more
pluralistic methodology - one that values out-of-sample performance alongside
interpretability and theoretical structure.
[COMMENTS]
8 pages
[LINK]
http://arxiv.org/abs/2510.04726v1
[DATE]
2025-10-06 19:46:03+08:00
[CATEGORIES]
cs.LG
Low Resource Audio Codec Challenge Baseline Systems
[AUTHORS]
Yusuf Ziya Isik, Rafał Łaganowski
[ABSTRACT]
The Low-Resource Audio Codec (LRAC) Challenge aims to advance neural audio
coding for deployment in resource-constrained environments. The first edition
focuses on low-resource neural speech codecs that must operate reliably under
everyday noise and reverberation, while satisfying strict constraints on
computational complexity, latency, and bitrate. Track 1 targets transparency
codecs, which aim to preserve the perceptual transparency of input speech under
mild noise and reverberation. Track 2 addresses enhancement codecs, which
combine coding and compression with denoising and dereverberation. This paper
presents the official baseline systems for both tracks in the 2025 LRAC
Challenge. The baselines are convolutional neural codec models with Residual
Vector Quantization, trained end-to-end using a combination of adversarial and
reconstruction objectives. We detail the data filtering and augmentation
strategies, model architectures, optimization procedures, and checkpoint
selection criteria.
[COMMENTS]
Low-Resource Audio Codec Challenge 2025
[LINK]
http://arxiv.org/abs/2510.00264v2
[DATE]
2025-10-06 19:39:10+08:00
[CATEGORIES]
cs.LG
RhoDARTS: Differentiable Quantum Architecture Search with Density Matrix Simulations
[AUTHORS]
Swagat Kumar, Jan-Nico Zaech, Colin Michael Wilmott, Luc Van Gool
[ABSTRACT]
Variational Quantum Algorithms (VQAs) are a promising approach to leverage
Noisy Intermediate-Scale Quantum (NISQ) computers. However, choosing optimal
quantum circuits that efficiently solve a given VQA problem is a non-trivial
task. Quantum Architecture Search (QAS) algorithms enable automatic generation
of quantum circuits tailored to the provided problem. Existing QAS approaches
typically adapt classical neural architecture search techniques, training
machine learning models to sample relevant circuits, but often overlook the
inherent quantum nature of the circuits they produce. By reformulating QAS from
a quantum perspective, we propose a sampling-free differentiable QAS algorithm
that models the search process as the evolution of a quantum mixed state, which
emerges from the search space of quantum circuits. The mixed state formulation
also enables our method to incorporate generic noise models, for example the
depolarizing channel, which cannot be modeled by state vector simulation. We
validate our method by finding circuits for state initialization and
Hamiltonian optimization tasks, namely the variational quantum eigensolver and
the unweighted max-cut problems. We show our approach to be comparable to, if
not outperform, existing QAS techniques while requiring significantly fewer
quantum simulations during training, and also show improved robustness levels
to noise.
[COMMENTS]
27 pages, 19 figures
[LINK]
http://arxiv.org/abs/2506.03697v2
[DATE]
2025-10-06 19:34:48+08:00
[CATEGORIES]
cs.LG
Frame-based Equivariant Diffusion Models for 3D Molecular Generation
[AUTHORS]
Mohan Guo, Cong Liu, Patrick Forré
[ABSTRACT]
Recent methods for molecular generation face a trade-off: they either enforce
strict equivariance with costly architectures or relax it to gain scalability
and flexibility. We propose a frame-based diffusion paradigm that achieves
deterministic E(3)-equivariance while decoupling symmetry handling from the
backbone. Building on this paradigm, we investigate three variants: Global
Frame Diffusion (GFD), which assigns a shared molecular frame; Local Frame
Diffusion (LFD), which constructs node-specific frames and benefits from
additional alignment constraints; and Invariant Frame Diffusion (IFD), which
relies on pre-canonicalized invariant representations. To enhance expressivity,
we further utilize EdgeDiT, a Diffusion Transformer with edge-aware attention.
On the QM9 dataset, GFD with EdgeDiT achieves state-of-the-art performance,
with a test NLL of -137.97 at standard scale and -141.85 at double scale,
alongside atom stability of 98.98%, and molecular stability of 90.51%. These
results surpass all equivariant baselines while maintaining high validity and
uniqueness and nearly 2x faster sampling compared to EDM. Altogether, our study
establishes frame-based diffusion as a scalable, flexible, and physically
grounded paradigm for molecular generation, highlighting the critical role of
global structure preservation.
[LINK]
http://arxiv.org/abs/2509.19506v2
[DATE]
2025-10-06 19:26:19+08:00
[CATEGORIES]
cs.LG
Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration
[AUTHORS]
Zhicheng Yang, Zhijiang Guo, Yinya Huang, Yongxin Wang, Dongchun Xie, Yiwei Wang, Xiaodan Liang, Jing Tang
[ABSTRACT]
Reinforcement Learning with Verifiable Reward (RLVR) has emerged as a
powerful paradigm for unlocking reasoning capabilities in large language
models, yet its full potential is hindered by two under-explored dimensions:
Depth-the hardest problem a model can sample; Breadth-the number of instances
consumed in a single iteration. We dissect the popular GRPO algorithm and
reveal a systematic bias: the cumulative-advantage disproportionately weights
samples with medium accuracy, while down-weighting the low-accuracy instances
that are crucial for pushing reasoning boundaries. To rectify the depth
neglect, we introduce Difficulty Adaptive Rollout Sampling (DARS), which
re-weights hard problems through targeted multi-stage rollouts, thereby
increasing the number of positive rollouts for hard problems. Empirically,
naively enlarging rollout size only accelerates convergence and even hurts
Pass@K. Our DARS, in contrast, delivers consistent Pass@K gains without extra
inference cost at convergence. Just as we adaptively expanded the depth of
exploration, we now ask whether aggressively scaling the breadth of training
data can further amplify reasoning gains. To this end, we intensely scale batch
size and replace PPO’s mini-batch iterations with full-batch updates over
multiple epochs. Increasing breadth significantly enhances Pass@1 performance.
Large-breadth training sustains high token-level entropy, indicating continued
exploration and reduced gradient noise. We further present DARS-B, which
augments DARS with large breadth, and demonstrate simultaneous gains in Pass@K
and Pass@1. The results confirm that breadth and adaptive exploration across
depth operate as orthogonal dimensions in RLVR, which are key to unleashing the
reasoning power of RLVR.
[COMMENTS]
18 pages, 14 figures
[LINK]
http://arxiv.org/abs/2508.13755v4
[DATE]
2025-10-06 19:12:22+08:00
[CATEGORIES]
cs.LG
How does the optimizer implicitly bias the model merging loss landscape?
[AUTHORS]
Chenxiang Zhang, Alexander Theus, Damien Teney, Antonio Orvieto, Jun Pang, Sjouke Mauw
[ABSTRACT]
Model merging methods combine models with different capabilities into a
single one while maintaining the same inference cost. Two popular approaches
are linear interpolation, which linearly interpolates between model weights,
and task arithmetic, which combines task vectors obtained by the difference
between finetuned and base models. While useful in practice, what properties
make merging effective are poorly understood. This paper explores how the
optimization process affects the loss landscape geometry and its impact on
merging success. We show that a single quantity – the effective noise scale –
unifies the impact of optimizer and data choices on model merging. Across
architectures and datasets, the effectiveness of merging success is a
non-monotonic function of effective noise, with a distinct optimum. Decomposing
this quantity, we find that larger learning rates, stronger weight decay,
smaller batch sizes, and data augmentation all independently modulate the
effective noise scale, exhibiting the same qualitative trend. Unlike prior work
that connects optimizer noise to the flatness or generalization of individual
minima, we show that it also affects the global loss landscape, predicting when
independently trained solutions can be merged. Our findings broaden the
understanding of how optimization shapes the loss landscape geometry and its
downstream consequences for model merging, suggesting the possibility of
further manipulating the training dynamics to improve merging effectiveness.
[COMMENTS]
preprint
[LINK]
http://arxiv.org/abs/2510.04686v1
[DATE]
2025-10-06 18:56:41+08:00
[CATEGORIES]
cs.LG
MoESD: Unveil Speculative Decoding’s Potential for Accelerating Sparse MoE
[AUTHORS]
Zongle Huang, Lei Zhu, Zongyuan Zhan, Ting Hu, Weikai Mao, Xianzhi Yu, Yongpan Liu, Tianyu Zhang
[LINK]
http://arxiv.org/abs/2505.19645v3
[DATE]
2025-10-06 18:53:42+08:00
[CATEGORIES]
cs.LG
Parameter-free Algorithms for the Stochastically Extended Adversarial Model
[AUTHORS]
Shuche Wang, Adarsh Barik, Peng Zhao, Vincent Y. F. Tan
[ABSTRACT]
We develop the first parameter-free algorithms for the Stochastically
Extended Adversarial (SEA) model, a framework that bridges adversarial and
stochastic online convex optimization. Existing approaches for the SEA model
require prior knowledge of problem-specific parameters, such as the diameter of
the domain $D$ and the Lipschitz constant of the loss functions $G$, which
limits their practical applicability. Addressing this, we develop
parameter-free methods by leveraging the Optimistic Online Newton Step (OONS)
algorithm to eliminate the need for these parameters. We first establish a
comparator-adaptive algorithm for the scenario with unknown domain diameter but
known Lipschitz constant, achieving an expected regret bound of
$\tilde{O}\big(|u|2^2 + |u|_2(\sqrt{\sigma^2{1:T}} +
\sqrt{\Sigma^2{1:T}})\big)$, where $u$ is the comparator vector and
$\sigma^2{1:T}$ and $\Sigma^2{1:T}$ represent the cumulative stochastic
variance and cumulative adversarial variation, respectively. We then extend
this to the more general setting where both $D$ and $G$ are unknown, attaining
the comparator- and Lipschitz-adaptive algorithm. Notably, the regret bound
exhibits the same dependence on $\sigma^2{1:T}$ and $\Sigma^2_{1:T}$,
demonstrating the efficacy of our proposed methods even when both parameters
are unknown in the SEA model.
[COMMENTS]
Accepted to NeurIPS 2025
[LINK]
http://arxiv.org/abs/2510.04685v1
[DATE]
2025-10-06 18:53:37+08:00
[CATEGORIES]
cs.LG
SALAD: Systematic Assessment of Machine Unlearning on LLM-Aided Hardware Design
[AUTHORS]
Zeng Wang, Minghao Shao, Rupesh Karn, Likhitha Mankali, Jitendra Bhandari, Ramesh Karri, Ozgur Sinanoglu, Muhammad Shafique, Johann Knechtel
[ABSTRACT]
Large Language Models (LLMs) offer transformative capabilities for hardware
design automation, particularly in Verilog code generation. However, they also
pose significant data security challenges, including Verilog evaluation data
contamination, intellectual property (IP) design leakage, and the risk of
malicious Verilog generation. We introduce SALAD, a comprehensive assessment
that leverages machine unlearning to mitigate these threats. Our approach
enables the selective removal of contaminated benchmarks, sensitive IP and
design artifacts, or malicious code patterns from pre-trained LLMs, all without
requiring full retraining. Through detailed case studies, we demonstrate how
machine unlearning techniques effectively reduce data security risks in
LLM-aided hardware design.
[LINK]
http://arxiv.org/abs/2506.02089v4
[DATE]
2025-10-06 18:38:59+08:00
[CATEGORIES]
cs.LG
Neural Deconstruction Search for Vehicle Routing Problems
[AUTHORS]
André Hottung, Paula Wong-Chung, Kevin Tierney
[ABSTRACT]
Autoregressive construction approaches generate solutions to vehicle routing
problems in a step-by-step fashion, leading to high-quality solutions that are
nearing the performance achieved by handcrafted operations research techniques.
In this work, we challenge the conventional paradigm of sequential solution
construction and introduce an iterative search framework where solutions are
instead deconstructed by a neural policy. Throughout the search, the neural
policy collaborates with a simple greedy insertion algorithm to rebuild the
deconstructed solutions. Our approach matches or surpasses the performance of
state-of-the-art operations research methods across three challenging vehicle
routing problems of various problem sizes.
[COMMENTS]
Published in TMLR
[LINK]
http://arxiv.org/abs/2501.03715v2
[DATE]
2025-10-06 18:38:24+08:00
[CATEGORIES]
cs.LG
Counterfactual Credit Guided Bayesian Optimization
[AUTHORS]
Qiyu Wei, Haowei Wang, Richard Allmendinger, Mauricio A. Álvarez
[ABSTRACT]
Bayesian optimization has emerged as a prominent methodology for optimizing
expensive black-box functions by leveraging Gaussian process surrogates, which
focus on capturing the global characteristics of the objective function.
However, in numerous practical scenarios, the primary objective is not to
construct an exhaustive global surrogate, but rather to quickly pinpoint the
global optimum. Due to the aleatoric nature of the sequential optimization
problem and its dependence on the quality of the surrogate model and the
initial design, it is restrictive to assume that all observed samples
contribute equally to the discovery of the optimum in this context. In this
paper, we introduce Counterfactual Credit Guided Bayesian Optimization (CCGBO),
a novel framework that explicitly quantifies the contribution of individual
historical observations through counterfactual credit. By incorporating
counterfactual credit into the acquisition function, our approach can
selectively allocate resources in areas where optimal solutions are most likely
to occur. We prove that CCGBO retains sublinear regret. Empirical evaluations
on various synthetic and real-world benchmarks demonstrate that CCGBO
consistently reduces simple regret and accelerates convergence to the global
optimum.
[LINK]
http://arxiv.org/abs/2510.04676v1
[DATE]
2025-10-06 18:34:50+08:00
[CATEGORIES]
cs.LG
Semantic Channel Equalization Strategies for Deep Joint Source-Channel Coding
[AUTHORS]
Lorenzo Pannacci, Simone Fiorellino, Mario Edoardo Pandolfo, Emilio Calvanese Strinati, Paolo Di Lorenzo
[ABSTRACT]
Deep joint source-channel coding (DeepJSCC) has emerged as a powerful
paradigm for end-to-end semantic communications, jointly learning to compress
and protect task-relevant features over noisy channels. However, existing
DeepJSCC schemes assume a shared latent space at transmitter (TX) and receiver
(RX) - an assumption that fails in multi-vendor deployments where encoders and
decoders cannot be co-trained. This mismatch introduces “semantic noise”,
degrading reconstruction quality and downstream task performance. In this
paper, we systematize and evaluate methods for semantic channel equalization
for DeepJSCC, introducing an additional processing stage that aligns
heterogeneous latent spaces under both physical and semantic impairments. We
investigate three classes of aligners: (i) linear maps, which admit closed-form
solutions; (ii) lightweight neural networks, offering greater expressiveness;
and (iii) a Parseval-frame equalizer, which operates in zero-shot mode without
the need for training. Through extensive experiments on image reconstruction
over AWGN and fading channels, we quantify trade-offs among complexity, data
efficiency, and fidelity, providing guidelines for deploying DeepJSCC in
heterogeneous AI-native wireless networks.
[COMMENTS]
Proceedings of IEEE Globecom 2025 Workshops
[LINK]
http://arxiv.org/abs/2510.04674v1
[DATE]
2025-10-06 18:29:07+08:00
[CATEGORIES]
cs.LG
PolyNet: Learning Diverse Solution Strategies for Neural Combinatorial Optimization
[AUTHORS]
André Hottung, Mridul Mahajan, Kevin Tierney
[ABSTRACT]
Reinforcement learning-based methods for constructing solutions to
combinatorial optimization problems are rapidly approaching the performance of
human-designed algorithms. To further narrow the gap, learning-based approaches
must efficiently explore the solution space during the search process. Recent
approaches artificially increase exploration by enforcing diverse solution
generation through handcrafted rules, however, these rules can impair solution
quality and are difficult to design for more complex problems. In this paper,
we introduce PolyNet, an approach for improving exploration of the solution
space by learning complementary solution strategies. In contrast to other
works, PolyNet uses only a single-decoder and a training schema that does not
enforce diverse solution generation through handcrafted rules. We evaluate
PolyNet on four combinatorial optimization problems and observe that the
implicit diversity mechanism allows PolyNet to find better solutions than
approaches that explicitly enforce diverse solution generation.
[COMMENTS]
Accepted at ICLR 2025
[LINK]
http://arxiv.org/abs/2402.14048v2
[DATE]
2025-10-06 18:28:23+08:00
[CATEGORIES]
cs.LG
TANTE: Time-Adaptive Operator Learning via Neural Taylor Expansion
[AUTHORS]
Zhikai Wu, Sifan Wang, Shiyang Zhang, Sizhuang He, Min Zhu, Anran Jiao, Lu Lu, David van Dijk
[ABSTRACT]
Operator learning for time-dependent partial differential equations (PDEs)
has seen rapid progress in recent years, enabling efficient approximation of
complex spatiotemporal dynamics. However, most existing methods rely on fixed
time step sizes during rollout, which limits their ability to adapt to varying
temporal complexity and often leads to error accumulation. Here, we propose the
Time-Adaptive Transformer with Neural Taylor Expansion (TANTE), a novel
operator-learning framework that produces continuous-time predictions with
adaptive step sizes. TANTE predicts future states by performing a Taylor
expansion at the current state, where neural networks learn both the
higher-order temporal derivatives and the local radius of convergence. This
allows the model to dynamically adjust its rollout based on the local behavior
of the solution, thereby reducing cumulative error and improving computational
efficiency. We demonstrate the effectiveness of TANTE across a wide range of
PDE benchmarks, achieving superior accuracy and adaptability compared to
fixed-step baselines, delivering accuracy gains of 60-80 % and speed-ups of
30-40 % at inference time. The code is publicly available at
https://github.com/zwu88/TANTE for transparency and reproducibility.
[COMMENTS]
22 pages, 7 figures, 10 tables
[LINK]
http://arxiv.org/abs/2502.08574v3
[DATE]
2025-10-06 18:27:56+08:00
[CATEGORIES]
cs.LG
Optimal Bound for PCA with Outliers using Higher-Degree Voronoi Diagrams
[AUTHORS]
Sajjad Hashemian, Mohammad Saeed Arvenaghi, Ebrahim Ardeshir-Larijani
[ABSTRACT]
In this paper, we introduce new algorithms for Principal Component Analysis
(PCA) with outliers. Utilizing techniques from computational geometry,
specifically higher-degree Voronoi diagrams, we navigate to the optimal
subspace for PCA even in the presence of outliers. This approach achieves an
optimal solution with a time complexity of
$n^{d+\mathcal{O}(1)}\text{poly}(n,d)$. Additionally, we present a randomized
algorithm with a complexity of $2^{\mathcal{O}(r(d-r))} \times \text{poly}(n,
d)$. This algorithm samples subspaces characterized in terms of a Grassmannian
manifold. By employing such sampling method, we ensure a high likelihood of
capturing the optimal subspace, with the success probability $(1 - \delta)^T$.
Where $\delta$ represents the probability that a sampled subspace does not
contain the optimal solution, and $T$ is the number of subspaces sampled,
proportional to $2^{r(d-r)}$. Our use of higher-degree Voronoi diagrams and
Grassmannian based sampling offers a clearer conceptual pathway and practical
advantages, particularly in handling large datasets or higher-dimensional
settings.
[LINK]
http://arxiv.org/abs/2408.06867v3
[DATE]
2025-10-06 18:23:52+08:00
[CATEGORIES]
cs.LG
SEE-DPO: Self Entropy Enhanced Direct Preference Optimization
[AUTHORS]
Shivanshu Shekhar, Shreyas Singh, Tong Zhang
[ABSTRACT]
Direct Preference Optimization (DPO) has been successfully used to align
large language models (LLMs) according to human preferences, and more recently
it has also been applied to improving the quality of text-to-image diffusion
models. However, DPO-based methods such as SPO, Diffusion-DPO, and D3PO are
highly susceptible to overfitting and reward hacking, especially when the
generative model is optimized to fit out-of-distribution during prolonged
training. To overcome these challenges and stabilize the training of diffusion
models, we introduce a self-entropy regularization mechanism in reinforcement
learning from human feedback. This enhancement improves DPO training by
encouraging broader exploration and greater robustness. Our regularization
technique effectively mitigates reward hacking, leading to improved stability
and enhanced image quality across the latent space. Extensive experiments
demonstrate that integrating human feedback with self-entropy regularization
can significantly boost image diversity and specificity, achieving
state-of-the-art results on key image generation metrics.
[LINK]
http://arxiv.org/abs/2411.04712v2
[DATE]
2025-10-06 18:21:14+08:00
[CATEGORIES]
cs.LG
IMLP: An Energy-Efficient Continual Learning Method for Tabular Data Streams
[AUTHORS]
Yuandou Wang, Filip Gunnarsson, Rihan Hai
[ABSTRACT]
Tabular data streams are rapidly emerging as a dominant modality for
real-time decision-making in healthcare, finance, and the Internet of Things
(IoT). These applications commonly run on edge and mobile devices, where energy
budgets, memory, and compute are strictly limited. Continual learning (CL)
addresses such dynamics by training models sequentially on task streams while
preserving prior knowledge and consolidating new knowledge. While recent CL
work has advanced in mitigating catastrophic forgetting and improving knowledge
transfer, the practical requirements of energy and memory efficiency for
tabular data streams remain underexplored. In particular, existing CL solutions
mostly depend on replay mechanisms whose buffers grow over time and exacerbate
resource costs.
We propose a context-aware incremental Multi-Layer Perceptron (IMLP), a
compact continual learner for tabular data streams. IMLP incorporates a
windowed scaled dot-product attention over a sliding latent feature buffer,
enabling constant-size memory and avoiding storing raw data. The attended
context is concatenated with current features and processed by shared
feed-forward layers, yielding lightweight per-segment updates. To assess
practical deployability, we introduce NetScore-T, a tunable metric coupling
balanced accuracy with energy for Pareto-aware comparison across models and
datasets. IMLP achieves up to $27.6\times$ higher energy efficiency than TabNet
and $85.5\times$ higher than TabPFN, while maintaining competitive average
accuracy. Overall, IMLP provides an easy-to-deploy, energy-efficient
alternative to full retraining for tabular data streams.
[LINK]
http://arxiv.org/abs/2510.04660v1
[DATE]
2025-10-06 18:05:44+08:00
[CATEGORIES]
cs.LG
What Drives Compositional Generalization in Visual Generative Models?
[AUTHORS]
Karim Farid, Rajat Sahay, Yumna Ali Alnaggar, Simon Schrodi, Volker Fischer, Cordelia Schmid, Thomas Brox
[ABSTRACT]
Compositional generalization, the ability to generate novel combinations of
known concepts, is a key ingredient for visual generative models. Yet, not all
mechanisms that enable or inhibit it are fully understood. In this work, we
conduct a systematic study of how various design choices influence
compositional generalization in image and video generation in a positive or
negative way. Through controlled experiments, we identify two key factors: (i)
whether the training objective operates on a discrete or continuous
distribution, and (ii) to what extent conditioning provides information about
the constituent concepts during training. Building on these insights, we show
that relaxing the MaskGIT discrete loss with an auxiliary continuous JEPA-based
objective can improve compositional performance in discrete models like
MaskGIT.
[LINK]
http://arxiv.org/abs/2510.03075v2
[DATE]
2025-10-06 18:01:02+08:00
[CATEGORIES]
cs.LG
Analyzing Uncertainty Quantification in Statistical and Deep Learning Models for Probabilistic Electricity Price Forecasting
[AUTHORS]
Andreas Lebedev, Abhinav Das, Sven Pappert, Stephan Schlüter
[ABSTRACT]
Precise probabilistic forecasts are fundamental for energy risk management,
and there is a wide range of both statistical and machine learning models for
this purpose. Inherent to these probabilistic models is some form of
uncertainty quantification. However, most models do not capture the full extent
of uncertainty, which arises not only from the data itself but also from model
and distributional choices. In this study, we examine uncertainty
quantification in state-of-the-art statistical and deep learning probabilistic
forecasting models for electricity price forecasting in the German market. In
particular, we consider deep distributional neural networks (DDNNs) and augment
them with an ensemble approach, Monte Carlo (MC) dropout, and conformal
prediction to account for model uncertainty. Additionally, we consider the
LASSO-estimated autoregressive (LEAR) approach combined with quantile
regression averaging (QRA), generalized autoregressive conditional
heteroskedasticity (GARCH), and conformal prediction. Across a range of
performance metrics, we find that the LEAR-based models perform well in terms
of probabilistic forecasting, irrespective of the uncertainty quantification
method. Furthermore, we find that DDNNs benefit from incorporating both data
and model uncertainty, improving both point and probabilistic forecasting.
Uncertainty itself appears to be best captured by the models using conformal
prediction. Overall, our extensive study shows that all models under
consideration perform competitively. However, their relative performance
depends on the choice of metrics for point and probabilistic forecasting.
[LINK]
http://arxiv.org/abs/2509.19417v2
[DATE]
2025-10-06 17:55:49+08:00
[CATEGORIES]
cs.LG
TOAST: Transformer Optimization using Adaptive and Simple Transformations
[AUTHORS]
Irene Cannistraci, Simone Antonelli, Emanuele Palumbo, Thomas M. Sutter, Emanuele Rodolà, Bastian Rieck, Julia E. Vogt
[ABSTRACT]
Foundation models achieve State-of-the-Art (SOTA) performance across
different tasks, but their size and computational demands raise concerns about
accessibility and sustainability. Existing efficiency methods often require
additional retraining or fine-tuning, limiting their practicality. Recent
findings suggest that deep neural networks exhibit internal representation
similarities. While such similarities across different models have been
exploited for enabling techniques such as model stitching and merging,
intra-network redundancy remains underexplored as a source for efficiency
gains. In this paper, we introduce TOAST (Transformer Optimization using
Adaptive and Simple Transformations), a framework that exploits these
redundancies to approximate entire transformer blocks with lightweight
closed-form mappings, such as linear transformation or even the identity,
without any additional training. Across SOTA pretrained vision models (e.g.,
ViT, DINOv2, DeiT) and datasets ranging from MNIST to ImageNet-1k, TOAST
reduces parameters and computation while preserving, and in some cases
improving, downstream performance. These results show that large portions of
transformer depth can be replaced by trivial functions, opening a new
perspective on efficient foundation models.
[COMMENTS]
24 pages, 15 figures, 12 tables
[LINK]
http://arxiv.org/abs/2410.04941v6
[DATE]
2025-10-06 17:49:19+08:00
[CATEGORIES]
cs.LG
Predictive Feature Caching for Training-free Acceleration of Molecular Geometry Generation
[AUTHORS]
Johanna Sommer, John Rachwan, Nils Fleischmann, Stephan Günnemann, Bertrand Charpentier
[ABSTRACT]
Flow matching models generate high-fidelity molecular geometries but incur
significant computational costs during inference, requiring hundreds of network
evaluations. This inference overhead becomes the primary bottleneck when such
models are employed in practice to sample large numbers of molecular
candidates. This work discusses a training-free caching strategy that
accelerates molecular geometry generation by predicting intermediate hidden
states across solver steps. The proposed method operates directly on the
SE(3)-equivariant backbone, is compatible with pretrained models, and is
orthogonal to existing training-based accelerations and system-level
optimizations. Experiments on the GEOM-Drugs dataset demonstrate that caching
achieves a twofold reduction in wall-clock inference time at matched sample
quality and a speedup of up to 3x compared to the base model with minimal
sample quality degradation. Because these gains compound with other
optimizations, applying caching alongside other general, lossless optimizations
yield as much as a 7x speedup.
[COMMENTS]
Accepted at the AI for Science Workshop @ NeurIPS 2025
[LINK]
http://arxiv.org/abs/2510.04646v1
[DATE]
2025-10-06 17:49:14+08:00
[CATEGORIES]
cs.LG
MINERVA: Mutual Information Neural Estimation for Supervised Feature Selection
[AUTHORS]
Taurai Muvunza, Egor Kraev, Pere Planell-Morell, Alexander Y. Shestopaloff
[ABSTRACT]
Existing feature filters rely on statistical pair-wise dependence metrics to
model feature-target relationships, but this approach may fail when the target
depends on higher-order feature interactions rather than individual
contributions. We introduce Mutual Information Neural Estimation Regularized
Vetting Algorithm (MINERVA), a novel approach to supervised feature selection
based on neural estimation of mutual information between features and targets.
We paramaterize the approximation of mutual information with neural networks
and perform feature selection using a carefully designed loss function
augmented with sparsity-inducing regularizers. Our method is implemented in a
two-stage process to decouple representation learning from feature selection,
ensuring better generalization and a more accurate expression of feature
importance. We present examples of ubiquitous dependency structures that are
rarely captured in literature and show that our proposed method effectively
captures these complex feature-target relationships by evaluating feature
subsets as an ensemble. Experimental results on synthetic and real-life fraud
datasets demonstrate the efficacy of our method and its ability to perform
exact solutions.
[COMMENTS]
23 pages
[LINK]
http://arxiv.org/abs/2510.02610v2
[DATE]
2025-10-06 17:40:13+08:00
[CATEGORIES]
cs.LG
Fairness in Repeated Matching: A Maximin Perspective
[AUTHORS]
Eugene Lim, Tzeh Yuan Neoh, Nicholas Teh
[ABSTRACT]
We study a sequential decision-making model where a set of items is
repeatedly matched to the same set of agents over multiple rounds. The
objective is to determine a sequence of matchings that either maximizes the
utility of the least advantaged agent at the end of all rounds (optimal) or at
the end of every individual round (anytime optimal). We investigate the
computational challenges associated with finding (anytime) optimal outcomes and
demonstrate that these problems are generally computationally intractable.
However, we provide approximation algorithms, fixed-parameter tractable
algorithms, and identify several special cases whereby the problem(s) can be
solved efficiently. Along the way, we also establish characterizations of
Pareto-optimal/maximum matchings, which may be of independent interest to works
in matching theory and house allocation.
[LINK]
http://arxiv.org/abs/2510.04624v1
[DATE]
2025-10-06 17:32:40+08:00
[CATEGORIES]
cs.LG
Forecasting-Based Biomedical Time-series Data Synthesis for Open Data and Robust AI
[AUTHORS]
Youngjoon Lee, Seongmin Cho, Yehhyun Jo, Jinu Gong, Hyunjoo Jenny Lee, Joonhyuk Kang
[ABSTRACT]
The limited data availability due to strict privacy regulations and
significant resource demands severely constrains biomedical time-series AI
development, which creates a critical gap between data requirements and
accessibility. Synthetic data generation presents a promising solution by
producing artificial datasets that maintain the statistical properties of real
biomedical time-series data without compromising patient confidentiality. We
propose a framework for synthetic biomedical time-series data generation based
on advanced forecasting models that accurately replicates complex
electrophysiological signals such as EEG and EMG with high fidelity. These
synthetic datasets preserve essential temporal and spectral properties of real
data, which enables robust analysis while effectively addressing data scarcity
and privacy challenges. Our evaluations across multiple subjects demonstrate
that the generated synthetic data can serve as an effective substitute for real
data and also significantly boost AI model performance. The approach maintains
critical biomedical features while provides high scalability for various
applications and integrates seamlessly into open-source repositories,
substantially expanding resources for AI-driven biomedical research.
[COMMENTS]
Under Review
[LINK]
http://arxiv.org/abs/2510.04622v1
[DATE]
2025-10-06 17:32:10+08:00
[CATEGORIES]
cs.LG
[AUTHORS]
Giuseppe Serra, Florian Buettner [ABSTRACT]
Given the ability to model more realistic and dynamic problems, Federated
Continual Learning (FCL) has been increasingly investigated recently. A
well-known problem encountered in this setting is the so-called catastrophic
forgetting, for which the learning model is inclined to focus on more recent
tasks while forgetting the previously learned knowledge. The majority of the
current approaches in FCL propose generative-based solutions to solve said
problem. However, this setting requires multiple training epochs over the data,
implying an offline setting where datasets are stored locally and remain
unchanged over time. Furthermore, the proposed solutions are tailored for
vision tasks solely. To overcome these limitations, we propose a new approach
to deal with different modalities in the online scenario where new data arrive
in streams of mini-batches that can only be processed once. To solve
catastrophic forgetting, we propose an uncertainty-aware memory-based approach.
Specifically, we suggest using an estimator based on the Bregman Information
(BI) to compute the model’s variance at the sample level. Through measures of
predictive uncertainty, we retrieve samples with specific characteristics, andby retraining the model on such samples - we demonstrate the potential of
this approach to reduce the forgetting effect in realistic settings while
maintaining data confidentiality and competitive communication efficiency
compared to state-of-the-art approaches.
[COMMENTS]
Accepted at ICLR 2025: https://openreview.net/forum?id=f65RuQgVlp
[LINK]
http://arxiv.org/abs/2405.18925v4
[DATE]
2025-10-06 17:24:25+08:00
[CATEGORIES]
cs.LG
A Case for Declarative LLM-friendly Interfaces for Improved Efficiency of Computer-Use Agents
[AUTHORS]
Yuan Wang, Mingyu Li, Haibo Chen
[ABSTRACT]
Computer-use agents (CUAs) powered by large language models (LLMs) have
emerged as a promising approach to automating computer tasks, yet they struggle
with graphical user interfaces (GUIs). GUIs, designed for humans, force LLMs to
decompose high-level goals into lengthy, error-prone sequences of fine-grained
actions, resulting in low success rates and an excessive number of LLM calls.
We propose Goal-Oriented Interface (GOI), a novel abstraction that transforms
existing GUIs into three declarative primitives: access, state, and
observation, which are better suited for LLMs. Our key idea is policy-mechanism
separation: LLMs focus on high-level semantic planning (policy) while GOI
handles low-level navigation and interaction (mechanism). GOI does not require
modifying the application source code or relying on application programming
interfaces (APIs).
We evaluate GOI with Microsoft Office Suite (Word, PowerPoint, Excel) on
Windows. Compared to a leading GUI-based agent baseline, GOI improves task
success rates by 67% and reduces interaction steps by 43.5%. Notably, GOI
completes over 61% of successful tasks with a single LLM call.
[LINK]
http://arxiv.org/abs/2510.04607v1
[DATE]
2025-10-06 17:14:58+08:00
[CATEGORIES]
cs.LG
Closed-Form Last Layer Optimization
[AUTHORS]
Alexandre Galashov, Nathaël Da Costa, Liyuan Xu, Philipp Hennig, Arthur Gretton
[ABSTRACT]
Neural networks are typically optimized with variants of stochastic gradient
descent. Under a squared loss, however, the optimal solution to the linear last
layer weights is known in closed-form. We propose to leverage this during
optimization, treating the last layer as a function of the backbone parameters,
and optimizing solely for these parameters. We show this is equivalent to
alternating between gradient descent steps on the backbone and closed-form
updates on the last layer. We adapt the method for the setting of stochastic
gradient descent, by trading off the loss on the current batch against the
accumulated information from previous batches. Further, we prove that, in the
Neural Tangent Kernel regime, convergence of this method to an optimal solution
is guaranteed. Finally, we demonstrate the effectiveness of our approach
compared with standard SGD on a squared loss in several supervised tasks –
both regression and classification – including Fourier Neural Operators and
Instrumental Variable Regression.
[LINK]
http://arxiv.org/abs/2510.04606v1
[DATE]
2025-10-06 17:14:39+08:00
[CATEGORIES]
cs.LG
Taming OOD Actions for Offline Reinforcement Learning: An Advantage-Based Approach
[AUTHORS]
Xuyang Chen, Keyu Yan, Wenhan Cao, Lin Zhao
[ABSTRACT]
Offline reinforcement learning (RL) learns policies from fixed datasets
without online interactions, but suffers from distribution shift, causing
inaccurate evaluation and overestimation of out-of-distribution (OOD) actions.
Existing methods counter this by conservatively discouraging all OOD actions,
which limits generalization. We propose Advantage-based Diffusion Actor-Critic
(ADAC), which evaluates OOD actions via an advantage-like function and uses it
to modulate the Q-function update discriminatively. Our key insight is that the
(state) value function is generally learned more reliably than the action-value
function; we thus use the next-state value to indirectly assess each action. We
develop a PointMaze environment to clearly visualize that advantage modulation
effectively selects superior OOD actions while discouraging inferior ones.
Moreover, extensive experiments on the D4RL benchmark show that ADAC achieves
state-of-the-art performance, with especially strong gains on challenging
tasks.
[LINK]
http://arxiv.org/abs/2505.05126v4
[DATE]
2025-10-06 17:11:08+08:00
[CATEGORIES]
cs.LG
Computing Wasserstein Barycenters through Gradient Flows
[AUTHORS]
Eduardo Fernandes Montesuma, Yassir Bendou, Mike Gartrell
[ABSTRACT]
Wasserstein barycenters provide a powerful tool for aggregating probability
measures, while leveraging the geometry of their ambient space. Existing
discrete methods suffer from poor scalability, as they require access to the
complete set of samples from input measures. We address this issue by recasting
the original barycenter problem as a gradient flow in the Wasserstein space.
Our approach offers two advantages. First, we achieve scalability by sampling
mini-batches from the input measures. Second, we incorporate functionals over
probability measures, which regularize the barycenter problem through internal,
potential, and interaction energies. We present two algorithms for empirical
and Gaussian mixture measures, providing convergence guarantees under the
Polyak-{\L}ojasiewicz inequality. Experimental validation on toy datasets and
domain adaptation benchmarks show that our methods outperform previous discrete
and neural net-based methods for computing Wasserstein barycenters.
[COMMENTS]
4 Figures, 3 Tables, under review
[LINK]
http://arxiv.org/abs/2510.04602v1
[DATE]
2025-10-06 17:07:12+08:00
[CATEGORIES]
cs.LG
Sampling-aware Adversarial Attacks Against Large Language Models
[AUTHORS]
Tim Beyer, Yan Scholten, Leo Schwinn, Stephan Günnemann
[ABSTRACT]
To guarantee safe and robust deployment of large language models (LLMs) at
scale, it is critical to accurately assess their adversarial robustness.
Existing adversarial attacks typically target harmful responses in single-point
greedy generations, overlooking the inherently stochastic nature of LLMs and
overestimating robustness. We show that for the goal of eliciting harmful
responses, repeated sampling of model outputs during the attack complements
prompt optimization and serves as a strong and efficient attack vector. By
casting attacks as a resource allocation problem between optimization and
sampling, we determine compute-optimal trade-offs and show that integrating
sampling into existing attacks boosts success rates by up to 37\% and improves
efficiency by up to two orders of magnitude. We further analyze how
distributions of output harmfulness evolve during an adversarial attack,
discovering that many common optimization strategies have little effect on
output harmfulness. Finally, we introduce a label-free proof-of-concept
objective based on entropy maximization, demonstrating how our sampling-aware
perspective enables new optimization targets. Overall, our findings establish
the importance of sampling in attacks to accurately assess and strengthen LLM
safety at scale.
[LINK]
http://arxiv.org/abs/2507.04446v3
[DATE]
2025-10-06 17:02:24+08:00
[CATEGORIES]
cs.LG
Mamba base PKD for efficient knowledge compression
[AUTHORS]
José Medina, Amnir Hadachi, Paul Honeine, Abdelaziz Bensrhair
[ABSTRACT]
Deep neural networks (DNNs) have remarkably succeeded in various image
processing tasks. However, their large size and computational complexity
present significant challenges for deploying them in resource-constrained
environments. This paper presents an innovative approach for integrating Mamba
Architecture within a Progressive Knowledge Distillation (PKD) process to
address the challenge of reducing model complexity while maintaining accuracy
in image classification tasks. The proposed framework distills a large teacher
model into progressively smaller student models, designed using Mamba blocks.
Each student model is trained using Selective-State-Space Models (S-SSM) within
the Mamba blocks, focusing on important input aspects while reducing
computational complexity. The work’s preliminary experiments use MNIST and
CIFAR-10 as datasets to demonstrate the effectiveness of this approach. For
MNIST, the teacher model achieves 98% accuracy. A set of seven student models
as a group retained 63% of the teacher’s FLOPs, approximating the teacher’s
performance with 98% accuracy. The weak student used only 1% of the teacher’s
FLOPs and maintained 72% accuracy. Similarly, for CIFAR-10, the students
achieved 1% less accuracy compared to the teacher, with the small student
retaining 5% of the teacher’s FLOPs to achieve 50% accuracy. These results
confirm the flexibility and scalability of Mamba Architecture, which can be
integrated into PKD, succeeding in the process of finding students as weak
learners. The framework provides a solution for deploying complex neural
networks in real-time applications with a reduction in computational cost.
[COMMENTS]
A preliminary version of this work was presented as a short poster
titled “Mamba-PKD: A Framework for Efficient and Scalable Model Compression
in Image Classification” at The 40th ACM/SIGAPP Symposium on Applied
Computing https://doi.org/10.1145/3672608.3707887
[LINK]
http://arxiv.org/abs/2503.01727v2
[DATE]
2025-10-06 16:56:55+08:00
[CATEGORIES]
cs.LG
Data-Driven Adaptive PID Control Based on Physics-Informed Neural Networks
[AUTHORS]
Junsei Ito, Yasuaki Wasa
[ABSTRACT]
This article proposes a data-driven PID controller design based on the
principle of adaptive gain optimization, leveraging Physics-Informed Neural
Networks (PINNs) generated for predictive modeling purposes. The proposed
control design method utilizes gradients of the PID gain optimization, achieved
through the automatic differentiation of PINNs, to apply model predictive
control using a cost function based on tracking error and control inputs. By
optimizing PINNs-based PID gains, the method achieves adaptive gain tuning that
ensures stability while accounting for system nonlinearities. The proposed
method features a systematic framework for integrating PINNs-based models of
dynamical control systems into closed-loop control systems, enabling direct
application to PID control design. A series of numerical experiments is
conducted to demonstrate the effectiveness of the proposed method from the
control perspectives based on both time and frequency domains.
[COMMENTS]
This work has been submitted to the IEEE Transactions on Control
Systems Technology for possible publication
[LINK]
http://arxiv.org/abs/2510.04591v1
[DATE]
2025-10-06 16:46:20+08:00
[CATEGORIES]
cs.LG
Improved probabilistic regression using diffusion models
[AUTHORS]
Carlo Kneissl, Christopher Bülte, Philipp Scholl, Gitta Kutyniok
[ABSTRACT]
Probabilistic regression models the entire predictive distribution of a
response variable, offering richer insights than classical point estimates and
directly allowing for uncertainty quantification. While diffusion-based
generative models have shown remarkable success in generating complex,
high-dimensional data, their usage in general regression tasks often lacks
uncertainty-related evaluation and remains limited to domain-specific
applications. We propose a novel diffusion-based framework for probabilistic
regression that learns predictive distributions in a nonparametric way. More
specifically, we propose to model the full distribution of the diffusion noise,
enabling adaptation to diverse tasks and enhanced uncertainty quantification.
We investigate different noise parameterizations, analyze their trade-offs, and
evaluate our framework across a broad range of regression tasks, covering low-
and high-dimensional settings. For several experiments, our approach shows
superior performance against existing baselines, while delivering calibrated
uncertainty estimates, demonstrating its versatility as a tool for
probabilistic prediction.
[LINK]
http://arxiv.org/abs/2510.04583v1
[DATE]
2025-10-06 16:36:05+08:00
[CATEGORIES]
cs.LG
Busemann Functions in the Wasserstein Space: Existence, Closed-Forms, and Applications to Slicing
[AUTHORS]
Clément Bonet, Elsa Cazelles, Lucas Drumetz, Nicolas Courty
[ABSTRACT]
The Busemann function has recently found much interest in a variety of
geometric machine learning problems, as it naturally defines projections onto
geodesic rays of Riemannian manifolds and generalizes the notion of
hyperplanes. As several sources of data can be conveniently modeled as
probability distributions, it is natural to study this function in the
Wasserstein space, which carries a rich formal Riemannian structure induced by
Optimal Transport metrics. In this work, we investigate the existence and
computation of Busemann functions in Wasserstein space, which admits geodesic
rays. We establish closed-form expressions in two important cases:
one-dimensional distributions and Gaussian measures. These results enable
explicit projection schemes for probability distributions on $\mathbb{R}$,
which in turn allow us to define novel Sliced-Wasserstein distances over
Gaussian mixtures and labeled datasets. We demonstrate the efficiency of those
original schemes on synthetic datasets as well as transfer learning problems.
[LINK]
http://arxiv.org/abs/2510.04579v1
[DATE]
2025-10-06 16:31:14+08:00
[CATEGORIES]
cs.LG
Language Model Based Text-to-Audio Generation: Anti-Causally Aligned Collaborative Residual Transformers
[AUTHORS]
Juncheng Wang, Chao Xu, Cheng Yu, Zhe Hu, Haoyu Xie, Guoqi Yu, Lei Shang, Shujun Wang
[ABSTRACT]
While language models (LMs) paired with residual vector quantization (RVQ)
tokenizers have shown promise in text-to-audio (T2A) generation, they still lag
behind diffusion-based models by a non-trivial margin. We identify a critical
dilemma underpinning this gap: incorporating more RVQ layers improves audio
reconstruction fidelity but exceeds the generation capacity of conventional
LMs. To address this, we first analyze RVQ dynamics and uncover two key
limitations: 1) orthogonality of features across RVQ layers hinders effective
LMs training, and 2) descending semantic richness in tokens from deeper RVQ
layers exacerbates exposure bias during autoregressive decoding. Based on these
insights, we propose Siren, a novel LM-based framework that employs multiple
isolated transformers with causal conditioning and anti-causal alignment via
reinforcement learning. Extensive experiments demonstrate that Siren
outperforms both existing LM-based and diffusion-based T2A systems, achieving
state-of-the-art results. By bridging the representational strengths of LMs
with the fidelity demands of audio synthesis, our approach repositions LMs as
competitive contenders against diffusion models in T2A tasks. Moreover, by
aligning audio representations with linguistic structures, Siren facilitates a
promising pathway toward unified multi-modal generation frameworks.
[COMMENTS]
Accepted to EMNLP 2025
[LINK]
http://arxiv.org/abs/2510.04577v1
[DATE]
2025-10-06 16:26:55+08:00
[CATEGORIES]
cs.LG
SONA: Learning Conditional, Unconditional, and Mismatching-Aware Discriminator
[AUTHORS]
Yuhta Takida, Satoshi Hayakawa, Takashi Shibuya, Masaaki Imaizumi, Naoki Murata, Bac Nguyen, Toshimitsu Uesaka, Chieh-Hsin Lai, Yuki Mitsufuji
[ABSTRACT]
Deep generative models have made significant advances in generating complex
content, yet conditional generation remains a fundamental challenge. Existing
conditional generative adversarial networks often struggle to balance the dual
objectives of assessing authenticity and conditional alignment of input samples
within their conditional discriminators. To address this, we propose a novel
discriminator design that integrates three key capabilities: unconditional
discrimination, matching-aware supervision to enhance alignment sensitivity,
and adaptive weighting to dynamically balance all objectives. Specifically, we
introduce Sum of Naturalness and Alignment (SONA), which employs separate
projections for naturalness (authenticity) and alignment in the final layer
with an inductive bias, supported by dedicated objective functions and an
adaptive weighting mechanism. Extensive experiments on class-conditional
generation tasks show that \ours achieves superior sample quality and
conditional alignment compared to state-of-the-art methods. Furthermore, we
demonstrate its effectiveness in text-to-image generation, confirming the
versatility and robustness of our approach.
[COMMENTS]
24 pages with 9 figures
[LINK]
http://arxiv.org/abs/2510.04576v1
[DATE]
2025-10-06 16:26:06+08:00
[CATEGORIES]
cs.LG
FLOWR.root: A flow matching based foundation model for joint multi-purpose structure-aware 3D ligand generation and affinity prediction
[AUTHORS]
Julian Cremer, Tuan Le, Mohammad M. Ghahremanpour, Emilia Sługocka, Filipe Menezes, Djork-Arné Clevert
[ABSTRACT]
We present FLOWR:root, an equivariant flow-matching model for pocket-aware 3D
ligand generation with joint binding affinity prediction and confidence
estimation. The model supports de novo generation, pharmacophore-conditional
sampling, fragment elaboration, and multi-endpoint affinity prediction (pIC50,
pKi, pKd, pEC50). Training combines large-scale ligand libraries with
mixed-fidelity protein-ligand complexes, followed by refinement on curated
co-crystal datasets and parameter-efficient finetuning for project-specific
adaptation. FLOWR:root achieves state-of-the-art performance in unconditional
3D molecule generation and pocket-conditional ligand design, producing
geometrically realistic, low-strain structures. The integrated affinity
prediction module demonstrates superior accuracy on the SPINDR test set and
outperforms recent models on the Schrodinger FEP+/OpenFE benchmark with
substantial speed advantages. As a foundation model, FLOWR:root requires
finetuning on project-specific datasets to account for unseen
structure-activity landscapes, yielding strong correlation with experimental
data. Joint generation and affinity prediction enable inference-time scaling
through importance sampling, steering molecular design toward higher-affinity
compounds. Case studies validate this: selective CK2$\alpha$ ligand generation
against CLK3 shows significant correlation between predicted and
quantum-mechanical binding energies, while ER$\alpha$ and TYK2 scaffold
elaboration demonstrates strong agreement with QM calculations. By integrating
structure-aware generation, affinity estimation, and property-guided sampling,
FLOWR:root provides a comprehensive foundation for structure-based drug design
spanning hit identification through lead optimization.
[LINK]
http://arxiv.org/abs/2510.02578v2
[DATE]
2025-10-06 16:20:22+08:00
[CATEGORIES]
cs.LG
Risk-Sensitive Option Market Making with Arbitrage-Free eSSVI Surfaces: A Constrained RL and Stochastic Control Bridge
[AUTHORS]
Jian’an Zhang
[ABSTRACT]
We formulate option market making as a constrained, risk-sensitive control
problem that unifies execution, hedging, and arbitrage-free implied-volatility
surfaces inside a single learning loop. A fully differentiable eSSVI layer
enforces static no-arbitrage conditions (butterfly and calendar) while the
policy controls half-spreads, hedge intensity, and structured surface
deformations (state-dependent rho-shift and psi-scale). Executions are
intensity-driven and respond monotonically to spreads and relative mispricing;
tail risk is shaped with a differentiable CVaR objective via the
Rockafellar–Uryasev program. We provide theory for (i) grid-consistency and
rates for butterfly/calendar surrogates, (ii) a primal–dual grounding of a
learnable dual action acting as a state-dependent Lagrange multiplier, (iii)
differentiable CVaR estimators with mixed pathwise and likelihood-ratio
gradients and epi-convergence to the nonsmooth objective, (iv) an eSSVI
wing-growth bound aligned with Lee’s moment constraints, and (v)
policy-gradient validity under smooth surrogates. In simulation (Heston
fallback; ABIDES-ready), the agent attains positive adjusted P\&L on most
intraday segments while keeping calendar violations at numerical zero and
butterfly violations at the numerical floor; ex-post tails remain realistic and
can be tuned through the CVaR weight. The five control heads admit clear
economic semantics and analytic sensitivities, yielding a white-box learner
that unifies pricing consistency and execution control in a reproducible
pipeline.
[COMMENTS]
34 pages including appendices; figures included. Primary subject
class: q-fin.TR. Cross-lists: cs.LG; q-fin.CP
[LINK]
http://arxiv.org/abs/2510.04569v1
[DATE]
2025-10-06 16:11:16+08:00
[CATEGORIES]
COSMIR: Chain Orchestrated Structured Memory for Iterative Reasoning over Long Context
[AUTHORS]
Naman Gupta, Shreeyash Gowaikar, Arun Iyer, Kirankumar Shiragur, Ramakrishna B Bairi, Rishikesh Maurya, Ritabrata Maiti, Sankarshan Damle, Shachee Mishra Gupta
[ABSTRACT]
Reasoning over very long inputs remains difficult for large language models
(LLMs). Common workarounds either shrink the input via retrieval (risking
missed evidence), enlarge the context window (straining selectivity), or stage
multiple agents to read in pieces. In staged pipelines (e.g., Chain of Agents,
CoA), free-form summaries passed between agents can discard crucial details and
amplify early mistakes. We introduce COSMIR (Chain Orchestrated Structured
Memory for Iterative Reasoning), a chain-style framework that replaces ad hoc
messages with a structured memory. A Planner agent first turns a user query
into concrete, checkable sub-questions. worker agents process chunks via a
fixed micro-cycle: Extract, Infer, Refine, writing all updates to the shared
memory. A Manager agent then Synthesizes the final answer directly from the
memory. This preserves step-wise read-then-reason benefits while changing both
the communication medium (structured memory) and the worker procedure (fixed
micro-cycle), yielding higher faithfulness, better long-range aggregation, and
auditability. On long-context QA from the HELMET suite, COSMIR reduces
propagation-stage information loss and improves accuracy over a CoA baseline.
[LINK]
http://arxiv.org/abs/2510.04568v1
[DATE]
2025-10-06 16:10:04+08:00
[CATEGORIES]
cs.LG
GILT: An LLM-Free, Tuning-Free Graph Foundational Model for In-Context Learning
[AUTHORS]
Weishuo Ma, Yanbo Wang, Xiyuan Wang, Lei Zou, Muhan Zhang
[ABSTRACT]
Graph Neural Networks (GNNs) are powerful tools for precessing relational
data but often struggle to generalize to unseen graphs, giving rise to the
development of Graph Foundational Models (GFMs). However, current GFMs are
challenged by the extreme heterogeneity of graph data, where each graph can
possess a unique feature space, label set, and topology. To address this, two
main paradigms have emerged. The first leverages Large Language Models (LLMs),
but is fundamentally text-dependent, thus struggles to handle the numerical
features in vast graphs. The second pre-trains a structure-based model, but the
adaptation to new tasks typically requires a costly, per-graph tuning stage,
creating a critical efficiency bottleneck. In this work, we move beyond these
limitations and introduce \textbf{G}raph \textbf{I}n-context \textbf{L}earning
\textbf{T}ransformer (GILT), a framework built on an LLM-free and tuning-free
architecture. GILT introduces a novel token-based framework for in-context
learning (ICL) on graphs, reframing classification tasks spanning node, edge
and graph levels in a unified framework. This mechanism is the key to handling
heterogeneity, as it is designed to operate on generic numerical features.
Further, its ability to understand class semantics dynamically from the context
enables tuning-free adaptation. Comprehensive experiments show that GILT
achieves stronger few-shot performance with significantly less time than
LLM-based or tuning-based baselines, validating the effectiveness of our
approach.
[LINK]
http://arxiv.org/abs/2510.04567v1
[DATE]
2025-10-06 16:09:15+08:00
[CATEGORIES]
cs.LG
Stochastic Approximation Methods for Distortion Risk Measure Optimization
[AUTHORS]
Jinyang Jiang, Bernd Heidergott, Jiaqiao Hu, Yijie Peng
[ABSTRACT]
Distortion Risk Measures (DRMs) capture risk preferences in decision-making
and serve as general criteria for managing uncertainty. This paper proposes
gradient descent algorithms for DRM optimization based on two dual
representations: the Distortion-Measure (DM) form and Quantile-Function (QF)
form. The DM-form employs a three-timescale algorithm to track quantiles,
compute their gradients, and update decision variables, utilizing the
Generalized Likelihood Ratio and kernel-based density estimation. The QF-form
provides a simpler two-timescale approach that avoids the need for complex
quantile gradient estimation. A hybrid form integrates both approaches,
applying the DM-form for robust performance around distortion function jumps
and the QF-form for efficiency in smooth regions. Proofs of strong convergence
and convergence rates for the proposed algorithms are provided. In particular,
the DM-form achieves an optimal rate of $O(k^{-4/7})$, while the QF-form
attains a faster rate of $O(k^{-2/3})$. Numerical experiments confirm their
effectiveness and demonstrate substantial improvements over baselines in robust
portfolio selection tasks. The method’s scalability is further illustrated
through integration into deep reinforcement learning. Specifically, a DRM-based
Proximal Policy Optimization algorithm is developed and applied to
multi-echelon dynamic inventory management, showcasing its practical
applicability.
[LINK]
http://arxiv.org/abs/2510.04563v1
[DATE]
2025-10-06 15:59:09+08:00
[CATEGORIES]
cs.LG
Challenger-Based Combinatorial Bandits for Subcarrier Selection in OFDM Systems
[AUTHORS]
Mohsen Amiri, V Venktesh, Sindri Magnússon
[ABSTRACT]
This paper investigates the identification of the top-m user-scheduling sets
in multi-user MIMO downlink, which is cast as a combinatorial pure-exploration
problem in stochastic linear bandits. Because the action space grows
exponentially, exhaustive search is infeasible. We therefore adopt a linear
utility model to enable efficient exploration and reliable selection of
promising user subsets. We introduce a gap-index framework that maintains a
shortlist of current estimates of champion arms (top-m sets) and a rotating
shortlist of challenger arms that pose the greatest threat to the champions.
This design focuses on measurements that yield the most informative
gap-index-based comparisons, resulting in significant reductions in runtime and
computation compared to state-of-the-art linear bandit methods, with high
identification accuracy. The method also exposes a tunable trade-off between
speed and accuracy. Simulations on a realistic OFDM downlink show that
shortlist-driven pure exploration makes online, measurement-efficient
subcarrier selection practical for AI-enabled communication systems.
[COMMENTS]
6 pages
[LINK]
http://arxiv.org/abs/2510.04559v1
[DATE]
2025-10-06 15:48:44+08:00
[CATEGORIES]
cs.LG
PhysioWave: A Multi-Scale Wavelet-Transformer for Physiological Signal Representation
[AUTHORS]
Yanlong Chen, Mattia Orlandi, Pierangelo Maria Rapa, Simone Benatti, Luca Benini, Yawei Li
[ABSTRACT]
Physiological signals are often corrupted by motion artifacts, baseline
drift, and other low-SNR disturbances, which pose significant challenges for
analysis. Additionally, these signals exhibit strong non-stationarity, with
sharp peaks and abrupt changes that evolve continuously, making them difficult
to represent using traditional time-domain or filtering methods. To address
these issues, a novel wavelet-based approach for physiological signal analysis
is presented, aiming to capture multi-scale time-frequency features in various
physiological signals. Leveraging this technique, two large-scale pretrained
models specific to EMG and ECG are introduced for the first time, achieving
superior performance and setting new baselines in downstream tasks.
Additionally, a unified multi-modal framework is constructed by integrating
pretrained EEG model, where each modality is guided through its dedicated
branch and fused via learnable weighted fusion. This design effectively
addresses challenges such as low signal-to-noise ratio, high inter-subject
variability, and device mismatch, outperforming existing methods on multi-modal
tasks. The proposed wavelet-based architecture lays a solid foundation for
analysis of diverse physiological signals, while the multi-modal design points
to next-generation physiological signal processing with potential impact on
wearable health monitoring, clinical diagnostics, and broader biomedical
applications. Code and data are available at:
github.com/ForeverBlue816/PhysioWave
[COMMENTS]
43 pages, 17 figures, 17 tables. Accepted by NeurIPS 2025. Code and
data are available at: github.com/ForeverBlue816/PhysioWave
[LINK]
http://arxiv.org/abs/2506.10351v3
[DATE]
2025-10-06 15:46:23+08:00
[CATEGORIES]
cs.LG
Gini-based Model Monitoring: A General Framework with an Application to Non-life Insurance Pricing
[AUTHORS]
Alexej Brauer, Paul Menzel
[ABSTRACT]
In a dynamic landscape where portfolios and environments evolve, maintaining
the accuracy of pricing models is critical. To the best of our knowledge, this
is the first study to systematically examine concept drift in non-life
insurance pricing. We (i) provide an overview of the relevant literature and
commonly used methodologies, clarify the distinction between virtual drift and
concept drift, and explain their implications for long-run model performance;
(ii) review and formalize common performance measures, including the Gini index
and deviance loss, and articulate their interpretation; (iii) derive the
asymptotic distribution of the Gini index, enabling valid inference and
hypothesis testing; and (iv) present a standardized monitoring procedure that
indicates when refitting is warranted. We illustrate the framework using a
modified real-world portfolio with induced concept drift and discuss practical
considerations and pitfalls.
[LINK]
http://arxiv.org/abs/2510.04556v1
[DATE]
2025-10-06 15:41:09+08:00
[CATEGORIES]
cs.LG
Tail-Safe Hedging: Explainable Risk-Sensitive Reinforcement Learning with a White-Box CBF–QP Safety Layer in Arbitrage-Free Markets
[AUTHORS]
Jian’an Zhang
[ABSTRACT]
We introduce Tail-Safe, a deployability-oriented framework for derivatives
hedging that unifies distributional, risk-sensitive reinforcement learning with
a white-box control-barrier-function (CBF) quadratic-program (QP) safety layer
tailored to financial constraints. The learning component combines an IQN-based
distributional critic with a CVaR objective (IQN–CVaR–PPO) and a
Tail-Coverage Controller that regulates quantile sampling through temperature
tilting and tail boosting to stabilize small-$\alpha$ estimation. The safety
component enforces discrete-time CBF inequalities together with domain-specific
constraints – ellipsoidal no-trade bands, box and rate limits, and a
sign-consistency gate – solved as a convex QP whose telemetry (active sets,
tightness, rate utilization, gate scores, slack, and solver status) forms an
auditable trail for governance. We provide guarantees of robust forward
invariance of the safe set under bounded model mismatch, a minimal-deviation
projection interpretation of the QP, a KL-to-DRO upper bound linking per-state
KL regularization to worst-case CVaR, concentration and sample-complexity
results for the temperature-tilted CVaR estimator, and a CVaR trust-region
improvement inequality under KL limits, together with feasibility persistence
under expiry-aware tightening. Empirically, in arbitrage-free,
microstructure-aware synthetic markets (SSVI $\to$ Dupire $\to$ VIX with
ABIDES/MockLOB execution), Tail-Safe improves left-tail risk without degrading
central performance and yields zero hard-constraint violations whenever the QP
is feasible with zero slack. Telemetry is mapped to governance dashboards and
incident workflows to support explainability and auditability. Limitations
include reliance on synthetic data and simplified execution to isolate
methodological contributions.
[COMMENTS]
32 pages including appendices; 5 figures. Primary subject class:
q-fin.TR. Cross-lists: cs.LG; q-fin.RM
[LINK]
http://arxiv.org/abs/2510.04555v1
[DATE]
2025-10-06 15:39:45+08:00
[CATEGORIES]
cs.LG
Learning Linear Regression with Low-Rank Tasks in-Context
[AUTHORS]
Kaito Takanami, Takashi Takahashi, Yoshiyuki Kabashima
[ABSTRACT]
In-context learning (ICL) is a key building block of modern large language
models, yet its theoretical mechanisms remain poorly understood. It is
particularly mysterious how ICL operates in real-world applications where tasks
have a common structure. In this work, we address this problem by analyzing a
linear attention model trained on low-rank regression tasks. Within this
setting, we precisely characterize the distribution of predictions and the
generalization error in the high-dimensional limit. Moreover, we find that
statistical fluctuations in finite pre-training data induce an implicit
regularization. Finally, we identify a sharp phase transition of the
generalization error governed by task structure. These results provide a
framework for understanding how transformers learn to learn the task structure.
[LINK]
http://arxiv.org/abs/2510.04548v1
[DATE]
2025-10-06 15:27:49+08:00
[CATEGORIES]
cs.LG
Post-training quantization of vision encoders needs prefixing registers
[AUTHORS]
Seunghyeon Kim, Jinho Kim, Taesun Yeom, Wonpyo Park, Kyuyeun Kim, Jaeho Lee
[ABSTRACT]
Transformer-based vision encoders – such as CLIP – are central to
multimodal intelligence, powering applications from autonomous web agents to
robotic control. Since these applications often demand real-time processing of
massive visual data, reducing the inference cost of vision encoders is
critical. Post-training quantization offers a practical path, but remains
challenging even at 8-bit precision due to massive-scale activations (i.e.,
outliers). In this work, we propose $\textit{RegCache}$, a training-free
algorithm to mitigate outliers in vision encoders, enabling quantization with
significantly smaller accuracy drops. The proposed RegCache introduces
outlier-prone yet semantically meaningless prefix tokens to the target vision
encoder, which prevents other tokens from having outliers. Notably, we observe
that outliers in vision encoders behave differently from those in language
models, motivating two technical innovations: middle-layer prefixing and token
deletion. Experiments show that our method consistently improves the accuracy
of quantized models across both text-supervised and self-supervised vision
encoders.
[LINK]
http://arxiv.org/abs/2510.04547v1
[DATE]
2025-10-06 15:27:46+08:00
[CATEGORIES]
cs.LG
Graph-based Tabular Deep Learning Should Learn Feature Interactions, Not Just Make Predictions
[AUTHORS]
Elias Dubbeldam, Reza Mohammadi, Marit Schoonhoven, S. Ilker Birbil
[ABSTRACT]
Despite recent progress, deep learning methods for tabular data still
struggle to compete with traditional tree-based models. A key challenge lies in
modeling complex, dataset-specific feature interactions that are central to
tabular data. Graph-based tabular deep learning (GTDL) methods aim to address
this by representing features and their interactions as graphs. However,
existing methods predominantly optimize predictive accuracy, neglecting
accurate modeling of the graph structure. This position paper argues that GTDL
should move beyond prediction-centric objectives and prioritize the explicit
learning and evaluation of feature interactions. Using synthetic datasets with
known ground-truth graph structures, we show that existing GTDL methods fail to
recover meaningful feature interactions. Moreover, enforcing the true
interaction structure improves predictive performance. This highlights the need
for GTDL methods to prioritize quantitative evaluation and accurate structural
learning. We call for a shift toward structure-aware modeling as a foundation
for building GTDL systems that are not only accurate but also interpretable,
trustworthy, and grounded in domain understanding.
[COMMENTS]
9 pages, 6 figures, submitted to position track NeurIPS 2025
[LINK]
http://arxiv.org/abs/2510.04543v1
[DATE]
2025-10-06 15:16:42+08:00
[CATEGORIES]
cs.LG
Divergence Minimization Preference Optimization for Diffusion Model Alignment
[AUTHORS]
Binxu Li, Minkai Xu, Jiaqi Han, Meihua Dang, Stefano Ermon
[ABSTRACT]
Diffusion models have achieved remarkable success in generating realistic and
versatile images from text prompts. Inspired by the recent advancements of
language models, there is an increasing interest in further improving the
models by aligning with human preferences. However, we investigate alignment
from a divergence minimization perspective and reveal that existing preference
optimization methods are typically trapped in suboptimal mean-seeking
optimization. In this paper, we introduce Divergence Minimization Preference
Optimization (DMPO), a novel and principled method for aligning diffusion
models by minimizing reverse KL divergence, which asymptotically enjoys the
same optimization direction as original RL. We provide rigorous analysis to
justify the effectiveness of DMPO and conduct comprehensive experiments to
validate its empirical strength across both human evaluations and automatic
metrics. Our extensive results show that diffusion models fine-tuned with DMPO
can consistently outperform or match existing techniques, specifically
consistently outperforming all baseline models across different base models and
test sets, achieving the best PickScore in every case, demonstrating the
method’s superiority in aligning generative behavior with desired outputs.
Overall, DMPO unlocks a robust and elegant pathway for preference alignment,
bridging principled theory with practical performance in diffusion models.
[LINK]
http://arxiv.org/abs/2507.07510v2
[DATE]
2025-10-06 15:01:28+08:00
[CATEGORIES]
cs.LG
Do We Need All the Synthetic Data? Targeted Synthetic Image Augmentation via Diffusion Models
[AUTHORS]
Dang Nguyen, Jiping Li, Jinghao Zheng, Baharan Mirzasoleiman
[ABSTRACT]
Synthetically augmenting training datasets with diffusion models has been an
effective strategy for improving generalization of image classifiers. However,
existing techniques struggle to ensure the diversity of generation and increase
the size of the data by up to 10-30x to improve the in-distribution
performance. In this work, we show that synthetically augmenting part of the
data that is not learned early in training with faithful images-containing same
features but different noise-outperforms augmenting the entire dataset. By
analyzing a two-layer CNN, we prove that this strategy improves generalization
by promoting homogeneity in feature learning speed without amplifying noise.
Our extensive experiments show that by augmenting only 30%-40% of the data, our
method boosts generalization by up to 2.8% in a variety of scenarios, including
training ResNet, ViT, ConvNeXt, and Swin Transformer on CIFAR-10/100, and
TinyImageNet, with various optimizers including SGD and SAM. Notably, our
method applied with SGD outperforms the SOTA optimizer, SAM, on CIFAR-100 and
TinyImageNet.
[LINK]
http://arxiv.org/abs/2505.21574v2
[DATE]
2025-10-06 14:55:59+08:00
[CATEGORIES]
cs.LG
Relevance-Aware Thresholding in Online Conformal Prediction for Time Series
[AUTHORS]
Théo Dupuy, Binbin Xu, Stéphane Perrey, Jacky Montmain, Abdelhak Imoussaten
[ABSTRACT]
Uncertainty quantification has received considerable interest in recent works
in Machine Learning. In particular, Conformal Prediction (CP) gains ground in
this field. For the case of time series, Online Conformal Prediction (OCP)
becomes an option to address the problem of data distribution shift over time.
Indeed, the idea of OCP is to update a threshold of some quantity (whether the
miscoverage level or the quantile) based on the distribution observation. To
evaluate the performance of OCP methods, two key aspects are typically
considered: the coverage validity and the prediction interval width
minimization. Recently, new OCP methods have emerged, offering long-run
coverage guarantees and producing more informative intervals. However, during
the threshold update step, most of these methods focus solely on the validity
of the prediction intervals~–~that is, whether the ground truth falls inside
or outside the interval~–~without accounting for their relevance. In this
paper, we aim to leverage this overlooked aspect. Specifically, we propose
enhancing the threshold update step by replacing the binary evaluation
(inside/outside) with a broader class of functions that quantify the relevance
of the prediction interval using the ground truth. This approach helps prevent
abrupt threshold changes, potentially resulting in narrower prediction
intervals. Indeed, experimental results on real-world datasets suggest that
these functions can produce tighter intervals compared to existing OCP methods
while maintaining coverage validity.
[COMMENTS]
Accepted for The 28th European Conference on Artificial Intelligence
2025, Workshop HC@AIxIA+HYDRA 2025
[LINK]
http://arxiv.org/abs/2510.02809v2
[DATE]
2025-10-06 14:51:20+08:00
[CATEGORIES]
cs.LG
Computing Exact Shapley Values in Polynomial Time for Product-Kernel Methods
[AUTHORS]
Majid Mohammadi, Siu Lun Chau, Krikamol Muandet
[ABSTRACT]
Kernel methods are widely used in machine learning due to their flexibility
and expressiveness. However, their black-box nature poses significant
challenges to interpretability, limiting their adoption in high-stakes
applications. Shapley value-based feature attribution techniques, such as SHAP
and kernel method-specific adaptation like RKHS-SHAP, offer a promising path
toward explainability. Yet, computing exact Shapley values is generally
intractable, leading existing methods to rely on approximations and thereby
incur unavoidable error. In this work, we introduce PKeX-Shapley, a novel
algorithm that utilizes the multiplicative structure of product kernels to
enable the exact computation of Shapley values in polynomial time. The core of
our approach is a new value function, the functional baseline value function,
specifically designed for product-kernel models. This value function removes
the influence of a feature subset by setting its functional component to the
least informative state. Crucially, it allows a recursive thus efficient
computation of Shapley values in polynomial time. As an important additional
contribution, we show that our framework extends beyond predictive modeling to
statistical inference. In particular, it generalizes to popular kernel-based
discrepancy measures such as the Maximum Mean Discrepancy (MMD) and the
Hilbert-Schmidt Independence Criterion (HSIC), thereby providing new tools for
interpretable statistical inference.
[LINK]
http://arxiv.org/abs/2505.16516v2
[DATE]
2025-10-06 14:40:29+08:00
[CATEGORIES]
cs.LG
Demystifying MaskGIT Sampler and Beyond: Adaptive Order Selection in Masked Diffusion
[AUTHORS]
Satoshi Hayakawa, Yuhta Takida, Masaaki Imaizumi, Hiromi Wakaki, Yuki Mitsufuji
[ABSTRACT]
Masked diffusion models have shown promising performance in generating
high-quality samples in a wide range of domains, but accelerating their
sampling process remains relatively underexplored. To investigate efficient
samplers for masked diffusion, this paper theoretically analyzes the MaskGIT
sampler for image modeling, revealing its implicit temperature sampling
mechanism. Through this analysis, we introduce the “moment sampler,” an
asymptotically equivalent but more tractable and interpretable alternative to
MaskGIT, which employs a “choose-then-sample” approach by selecting unmasking
positions before sampling tokens. In addition, we improve the efficiency of
choose-then-sample algorithms through two key innovations: a partial caching
technique for transformers that approximates longer sampling trajectories
without proportional computational cost, and a hybrid approach formalizing the
exploration-exploitation trade-off in adaptive unmasking. Experiments in image
and text domains demonstrate our theory as well as the efficiency of our
proposed methods, advancing both theoretical understanding and practical
implementation of masked diffusion samplers.
[COMMENTS]
23 pages
[LINK]
http://arxiv.org/abs/2510.04525v1
[DATE]
2025-10-06 14:30:22+08:00
[CATEGORIES]
cs.LG
Toward a Unified Geometry Understanding: Riemannian Diffusion Framework for Graph Generation and Prediction
[AUTHORS]
Yisen Gao, Xingcheng Fu, Qingyun Sun, Jianxin Li, Xianxian Li
[ABSTRACT]
Graph diffusion models have made significant progress in learning structured
graph data and have demonstrated strong potential for predictive tasks.
Existing approaches typically embed node, edge, and graph-level features into a
unified latent space, modeling prediction tasks including classification and
regression as a form of conditional generation. However, due to the
non-Euclidean nature of graph data, features of different curvatures are
entangled in the same latent space without releasing their geometric potential.
To address this issue, we aim to construt an ideal Riemannian diffusion model
to capture distinct manifold signatures of complex graph data and learn their
distribution. This goal faces two challenges: numerical instability caused by
exponential mapping during the encoding proces and manifold deviation during
diffusion generation. To address these challenges, we propose GeoMancer: a
novel Riemannian graph diffusion framework for both generation and prediction
tasks. To mitigate numerical instability, we replace exponential mapping with
an isometric-invariant Riemannian gyrokernel approach and decouple multi-level
features onto their respective task-specific manifolds to learn optimal
representations. To address manifold deviation, we introduce a
manifold-constrained diffusion method and a self-guided strategy for
unconditional generation, ensuring that the generated data remains aligned with
the manifold signature. Extensive experiments validate the effectiveness of our
approach, demonstrating superior performance across a variety of tasks.
[COMMENTS]
Accepted by NeuIPS 2025
[LINK]
http://arxiv.org/abs/2510.04522v1
[DATE]
2025-10-06 14:29:49+08:00
[CATEGORIES]
cs.LG
Quantum generative model on bicycle-sharing system and an application
[AUTHORS]
Fumio Nemoto, Nobuyuki Koike, Daichi Sato, Yuuta Kawaai, Masayuki Ohzeki
[ABSTRACT]
Recently, bicycle-sharing systems have been implemented in numerous cities,
becoming integral to daily life. However, a prevalent issue arises when
intensive commuting demand leads to bicycle shortages in specific areas and at
particular times. To address this challenge, we employ a novel quantum machine
learning model that analyzes time series data by fitting quantum time evolution
to observed sequences. This model enables us to capture actual trends in
bicycle counts at individual ports and identify correlations between different
ports. Utilizing the trained model, we simulate the impact of proactively
adding bicycles to high-demand ports on the overall rental number across the
system. Given that the core of this method lies in a Monte Carlo simulation, it
is anticipated to have a wide range of industrial applications.
[COMMENTS]
8 pages, 11 figures
[LINK]
http://arxiv.org/abs/2510.04512v1
[DATE]
2025-10-06 14:02:13+08:00
[CATEGORIES]
cs.LG
Real-time Prediction of Urban Sound Propagation with Conditioned Normalizing Flows
[AUTHORS]
Achim Eckerle, Martin Spitznagel, Janis Keuper
[ABSTRACT]
Accurate and fast urban noise prediction is pivotal for public health and for
regulatory workflows in cities, where the Environmental Noise Directive
mandates regular strategic noise maps and action plans, often needed in
permission workflows, right-of-way allocation, and construction scheduling.
Physics-based solvers are too slow for such time-critical, iterative “what-if”
studies. We evaluate conditional Normalizing Flows (Full-Glow) for generating
for generating standards-compliant urban sound-pressure maps from 2D urban
layouts in real time per 256x256 map on a single RTX 4090), enabling
interactive exploration directly on commodity hardware. On datasets covering
Baseline, Diffraction, and Reflection regimes, our model accelerates map
generation by >2000 times over a reference solver while improving NLoS accuracy
by up to 24% versus prior deep models; in Baseline NLoS we reach 0.65 dB MAE
with high structural fidelity. The model reproduces diffraction and
interference patterns and supports instant recomputation under source or
geometry changes, making it a practical engine for urban planning, compliance
mapping, and operations (e.g., temporary road closures, night-work variance
assessments).
[LINK]
http://arxiv.org/abs/2510.04510v1
[DATE]
2025-10-06 14:00:08+08:00
[CATEGORIES]
cs.LG
Wavelet Predictive Representations for Non-Stationary Reinforcement Learning
[AUTHORS]
Min Wang, Xin Li, Ye He, Yao-Hui Li, Hasnaa Bennis, Riashat Islam, Mingzhong Wang
[ABSTRACT]
The real world is inherently non-stationary, with ever-changing factors, such
as weather conditions and traffic flows, making it challenging for agents to
adapt to varying environmental dynamics. Non-Stationary Reinforcement Learning
(NSRL) addresses this challenge by training agents to adapt rapidly to
sequences of distinct Markov Decision Processes (MDPs). However, existing NSRL
approaches often focus on tasks with regularly evolving patterns, leading to
limited adaptability in highly dynamic settings. Inspired by the success of
Wavelet analysis in time series modeling, specifically its ability to capture
signal trends at multiple scales, we propose WISDOM to leverage wavelet-domain
predictive task representations to enhance NSRL. WISDOM captures these
multi-scale features in evolving MDP sequences by transforming task
representation sequences into the wavelet domain, where wavelet coefficients
represent both global trends and fine-grained variations of non-stationary
changes. In addition to the auto-regressive modeling commonly employed in time
series forecasting, we devise a wavelet temporal difference (TD) update
operator to enhance tracking and prediction of MDP evolution. We theoretically
prove the convergence of this operator and demonstrate policy improvement with
wavelet task representations. Experiments on diverse benchmarks show that
WISDOM significantly outperforms existing baselines in both sample efficiency
and asymptotic performance, demonstrating its remarkable adaptability in
complex environments characterized by non-stationary and stochastically
evolving tasks.
[LINK]
http://arxiv.org/abs/2510.04507v1
[DATE]
2025-10-06 13:49:18+08:00
[CATEGORIES]
cs.LG
VITA: Variational Pretraining of Transformers for Climate-Robust Crop Yield Forecasting
[AUTHORS]
Adib Hasan, Mardavij Roozbehani, Munther Dahleh
[ABSTRACT]
Accurate crop yield forecasting is essential for global food security.
However, current AI models systematically underperform when yields deviate from
historical trends. We attribute this to the lack of rich, physically grounded
datasets directly linking atmospheric states to yields. To address this, we
introduce VITA (Variational Inference Transformer for Asymmetric data), a
variational pretraining framework that learns representations from large
satellite-based weather datasets and transfers to the ground-based limited
measurements available for yield prediction. VITA is trained using detailed
meteorological variables as proxy targets during pretraining and learns to
predict latent atmospheric states under a seasonality-aware sinusoidal prior.
This allows the model to be fine-tuned using limited weather statistics during
deployment. Applied to 763 counties in the U.S. Corn Belt, VITA achieves
state-of-the-art performance in predicting corn and soybean yields across all
evaluation scenarios, particularly during extreme years, with statistically
significant improvements (paired t-test, $p < 0.0001$). Importantly, VITA
outperforms prior frameworks like GNN-RNN without soil data, and bigger
foundational models (e.g., Chronos-Bolt) with less compute, making it practical
for real-world use–especially in data-scarce regions. This work highlights how
domain-aware AI design can overcome data limitations and support resilient
agricultural forecasting in a changing climate.
[LINK]
http://arxiv.org/abs/2508.03589v2
[DATE]
2025-10-06 13:27:56+08:00
[CATEGORIES]
cs.LG
Expand Neurons, Not Parameters
[AUTHORS]
Linghao Kong, Inimai Subramanian, Yonadav Shavit, Micah Adler, Dan Alistarh, Nir Shavit
[ABSTRACT]
This work demonstrates how increasing the number of neurons in a network
without increasing its number of non-zero parameters improves performance. We
show that this gain corresponds with a decrease in interference between
multiple features that would otherwise share the same neurons. To reduce such
entanglement at a fixed non-zero parameter count, we introduce Fixed Parameter
Expansion (FPE): replace a neuron with multiple children and partition the
parent’s weights disjointly across them, so that each child inherits a
non-overlapping subset of connections. On symbolic tasks, specifically Boolean
code problems, clause-aligned FPE systematically reduces polysemanticity
metrics and yields higher task accuracy. Notably, random splits of neuron
weights approximate these gains, indicating that reduced collisions, not
precise assignment, are a primary driver. Consistent with the superposition
hypothesis, the benefits of FPE grow with increasing interference: when
polysemantic load is high, accuracy improvements are the largest. Transferring
these insights to real models (classifiers over CLIP embeddings and deeper
multilayer networks) we find that widening networks while maintaining a
constant non-zero parameter count consistently increases accuracy. These
results identify an interpretability-grounded mechanism to leverage width
against superposition, improving performance without increasing the number of
non-zero parameters. Such a direction is well matched to modern accelerators,
where memory movement of non-zero parameters, rather than raw compute, is the
dominant bottleneck.
[COMMENTS]
10 pages, 6 figures
[LINK]
http://arxiv.org/abs/2510.04500v1
[DATE]
2025-10-06 13:26:52+08:00
[CATEGORIES]
cs.LG
Inference-Time Scaling of Diffusion Language Models with Particle Gibbs Sampling
[AUTHORS]
Meihua Dang, Jiaqi Han, Minkai Xu, Kai Xu, Akash Srivastava, Stefano Ermon
[ABSTRACT]
Discrete diffusion models have recently emerged as strong alternatives to
autoregressive language models, matching their performance through large-scale
training. However, inference-time control remains relatively underexplored. In
this work, we study how to steer generation toward desired rewards without
retraining the models. Prior methods typically resample or filter within a
single denoising trajectory, optimizing rewards step-by-step without
trajectory-level refinement. We introduce particle Gibbs sampling for diffusion
language models (PG-DLM), a novel inference-time algorithm enabling
trajectory-level refinement while preserving generation perplexity under reward
optimization. PG-DLM constructs a Markov chain over full denoising trajectories
and applies a conditional sequential Monte Carlo kernel to resample them. We
derive theoretical guarantees for convergence, including asymptotic consistency
and variance bounds. Within this framework, we further analyze trade-offs
across four key axes for inference-time scaling under fixed budgets:
iterations, samples, denoising steps, and reward estimation. Our analysis shows
scaling iterations achieves the best reward-perplexity trade-off. Empirically,
PG-DLM consistently outperforms prior methods using MDLM and LLaDA-8B as base
models across a wide range of compute budgets for reward-guided generation
tasks including toxicity and sentiment control as well as linguistic
acceptability.
[LINK]
http://arxiv.org/abs/2507.08390v2
[DATE]
2025-10-06 13:26:50+08:00
[CATEGORIES]
cs.LG
Comparing Contrastive and Triplet Loss: Variance Analysis and Optimization Behavior
[AUTHORS]
Donghuo Zeng
[ABSTRACT]
Contrastive loss and triplet loss are widely used objectives in deep metric
learning, yet their effects on representation quality remain insufficiently
understood. We present a theoretical and empirical comparison of these losses,
focusing on intra- and inter-class variance and optimization behavior (e.g.,
greedy updates). Through task-specific experiments with consistent settings on
synthetic data and real datasets-MNIST, CIFAR-10-it is shown that triplet loss
preserves greater variance within and across classes, supporting finer-grained
distinctions in the learned representations. In contrast, contrastive loss
tends to compact intra-class embeddings, which may obscure subtle semantic
differences. To better understand their optimization dynamics, By examining
loss-decay rate, active ratio, and gradient norm, we find that contrastive loss
drives many small updates early on, while triplet loss produces fewer but
stronger updates that sustain learning on hard examples. Finally, across both
classification and retrieval tasks on MNIST, CIFAR-10, CUB-200, and CARS196
datasets, our results consistently show that triplet loss yields superior
performance, which suggests using triplet loss for detail retention and
hard-sample focus, and contrastive loss for smoother, broad-based embedding
refinement.
[COMMENTS]
8 pages, 4 tables, 3 figures
[LINK]
http://arxiv.org/abs/2510.02161v2
[DATE]
2025-10-06 13:19:04+08:00
[CATEGORIES]
cs.LG
STIV: Scalable Text and Image Conditioned Video Generation
[AUTHORS]
Zongyu Lin, Wei Liu, Chen Chen, Jiasen Lu, Wenze Hu, Tsu-Jui Fu, Jesse Allardice, Zhengfeng Lai, Liangchen Song, Bowen Zhang, Cha Chen, Yiran Fei, Lezhi Li, Yizhou Sun, Kai-Wei Chang, Yinfei Yang
[ABSTRACT]
The field of video generation has made remarkable advancements, yet there
remains a pressing need for a clear, systematic recipe that can guide the
development of robust and scalable models. In this work, we present a
comprehensive study that systematically explores the interplay of model
architectures, training recipes, and data curation strategies, culminating in a
simple and scalable text-image-conditioned video generation method, named STIV.
Our framework integrates image condition into a Diffusion Transformer (DiT)
through frame replacement, while incorporating text conditioning via a joint
image-text conditional classifier-free guidance. This design enables STIV to
perform both text-to-video (T2V) and text-image-to-video (TI2V) tasks
simultaneously. Additionally, STIV can be easily extended to various
applications, such as video prediction, frame interpolation, multi-view
generation, and long video generation, etc. With comprehensive ablation studies
on T2I, T2V, and TI2V, STIV demonstrate strong performance, despite its simple
design. An 8.7B model with 512 resolution achieves 83.1 on VBench T2V,
surpassing both leading open and closed-source models like CogVideoX-5B, Pika,
Kling, and Gen-3. The same-sized model also achieves a state-of-the-art result
of 90.1 on VBench I2V task at 512 resolution. By providing a transparent and
extensible recipe for building cutting-edge video generation models, we aim to
empower future research and accelerate progress toward more versatile and
reliable video generation solutions.
[LINK]
http://arxiv.org/abs/2412.07730v2
[DATE]
2025-10-06 13:11:37+08:00
[CATEGORIES]
cs.LG
Deep vs. Shallow: Benchmarking Physics-Informed Neural Architectures on the Biharmonic Equation
[AUTHORS]
Akshay Govind Srinivasan, Vikas Dwivedi, Balaji Srinivasan
[ABSTRACT]
Partial differential equation (PDE) solvers are fundamental to engineering
simulation. Classical mesh-based approaches (finite difference/volume/element)
are fast and accurate on high-quality meshes but struggle with higher-order
operators and complex, hard-to-mesh geometries. Recently developed
physics-informed neural networks (PINNs) and their variants are mesh-free and
flexible, yet compute-intensive and often less accurate. This paper
systematically benchmarks RBF-PIELM, a rapid PINN variant-an extreme learning
machine with radial-basis activations-for higher-order PDEs. RBF-PIELM replaces
PINNs’ time-consuming gradient descent with a single-shot least-squares solve.
We test RBF-PIELM on the fourth-order biharmonic equation using two benchmarks:
lid-driven cavity flow (streamfunction formulation) and a manufactured
oscillatory solution. Our results show up to $(350\times)$ faster training than
PINNs and over $(10\times)$ fewer parameters for comparable solution accuracy.
Despite surpassing PINNs, RBF-PIELM still lags mature mesh-based solvers and
its accuracy degrades on highly oscillatory solutions, highlighting remaining
challenges for practical deployment.
[COMMENTS]
16 Pages, 7 Figures and 1 Table. Submitted and accepted at Machine
Learning and the Physical Sciences Workshop at the 39th conference on Neural
Information Processing Systems (NeurIPS)
[LINK]
http://arxiv.org/abs/2510.04490v1
[DATE]
2025-10-06 12:54:04+08:00
[CATEGORIES]
cs.LG
Forking-Sequences
[AUTHORS]
Willa Potosnak, Malcolm Wolff, Boris Oreshkin, Mengfei Cao, Michael W. Mahoney, Dmitry Efimov, Kin G. Olivares
[ABSTRACT]
While accuracy is a critical requirement for time series forecasting models,
an equally important (yet often overlooked) desideratum is forecast stability
across forecast creation dates (FCDs). Even highly accurate models can produce
erratic revisions between FCDs, undermining stakeholder trust and disrupting
downstream decision-making. To improve forecast stability, models like MQCNN,
MQT, and SPADE employ a little-known but highly effective technique:
forking-sequences. Unlike standard statistical and neural forecasting methods
that treat each FCD independently, the forking-sequences method jointly encodes
and decodes the entire time series across all FCDs, in a way mirroring time
series cross-validation. Since forking sequences remains largely unknown in the
broader neural forecasting community, in this work, we formalize the
forking-sequences approach, and we make a case for its broader adoption. We
demonstrate three key benefits of forking-sequences: (i) more stable and
consistent gradient updates during training; (ii) reduced forecast variance
through ensembling; and (iii) improved inference computational efficiency. We
validate forking-sequences’ benefits using 16 datasets from the M1, M3, M4, and
Tourism competitions, showing improvements in forecast percentage change
stability of 28.8%, 28.8%, 37.9%, and 31.3%, and 8.8%, on average, for MLP,
RNN, LSTM, CNN, and Transformer-based architectures, respectively.
[LINK]
http://arxiv.org/abs/2510.04487v1
[DATE]
2025-10-06 12:51:06+08:00
[CATEGORIES]
cs.LG
FEB-Cache: Frequency-Guided Exposure Bias Reduction for Enhancing Diffusion Transformer Caching
[AUTHORS]
Zhen Zou, Feng Zhao
[ABSTRACT]
Diffusion Transformer (DiT) has exhibited impressive generation capabilities
but faces great challenges due to its high computational complexity. To address
this issue, various methods, notably feature caching, have been introduced.
However, these approaches focus on aligning non-cache diffusion without
analyzing why caching damage the generation processes. In this paper, we first
confirm that the cache greatly amplifies the exposure bias, resulting in a
decline in the generation quality. However, directly applying noise scaling is
challenging for this issue due to the non-smoothness of exposure bias. We found
that this phenomenon stems from the mismatch between its frequency response
characteristics and the simple cache of Attention and MLP. Since these two
components exhibit unique preferences for frequency signals, which provides us
with a caching strategy to separate Attention and MLP to achieve an enhanced
fit of exposure bias and reduce it. Based on this, we introduced FEB-Cache, a
joint caching strategy that aligns with the non-exposed bias diffusion process
(which gives us a higher performance cap) of caching Attention and MLP based on
the frequency-guided cache table. Our approach combines a comprehensive
understanding of the caching mechanism and offers a new perspective on
leveraging caching to accelerate the diffusion process. Empirical results
indicate that FEB-Cache optimizes model performance while concurrently
facilitating acceleration. Code is available at
https://github.com/aSleepyTree/EB-Cache.
[LINK]
http://arxiv.org/abs/2503.07120v3
[DATE]
2025-10-06 12:28:05+08:00
[CATEGORIES]
cs.LG
DRPO: Efficient Reasoning via Decoupled Reward Policy Optimization
[AUTHORS]
Gang Li, Yan Chen, Ming Lin, Tianbao Yang
[ABSTRACT]
Recent large reasoning models (LRMs) driven by reinforcement learning
algorithms (e.g., GRPO) have achieved remarkable performance on challenging
reasoning tasks. However, these models suffer from overthinking, generating
unnecessarily long and redundant reasoning even for simple questions, which
substantially increases computational cost and response latency. While existing
methods incorporate length rewards to GRPO to promote concise reasoning, they
incur significant performance degradation. We identify the root cause: when
rewards for correct but long rollouts are penalized, GRPO’s group-relative
advantage function can assign them negative advantages, actively discouraging
valid reasoning. To overcome this, we propose Decoupled Reward Policy
Optimization (DRPO), a novel framework that decouples the length-based learning
signal of correct rollouts from incorrect ones. DRPO ensures that reward
signals for correct rollouts are normalized solely within the positive group,
shielding them from interference by negative samples. The DRPO’s objective is
grounded in integrating an optimized positive data distribution, which
maximizes length-based rewards under a KL regularization, into a discriminative
objective. We derive a closed-form solution for this distribution, enabling
efficient computation of the objective and its gradients using only on-policy
data and importance weighting. Of independent interest, this formulation is
general and can incorporate other preference rewards of positive data beyond
length. Experiments on mathematical reasoning tasks demonstrate DRPO’s
significant superiority over six efficient reasoning baselines. Notably, with a
1.5B model, our method achieves 77\% length reduction with only 1.1\%
performance loss on simple questions like GSM8k dataset, while the follow-up
baseline sacrifices 4.3\% for 68\% length reduction.
[COMMENTS]
20 pages, 7 figures
[LINK]
http://arxiv.org/abs/2510.04474v1
[DATE]
2025-10-06 12:18:13+08:00
[CATEGORIES]
cs.LG
Improving Online-to-Nonconvex Conversion for Smooth Optimization via Double Optimism
[AUTHORS]
Francisco Patitucci, Ruichen Jiang, Aryan Mokhtari
[ABSTRACT]
A recent breakthrough in nonconvex optimization is the online-to-nonconvex
conversion framework of [Cutkosky et al., 2023], which reformulates the task of
finding an $\varepsilon$-first-order stationary point as an online learning
problem. When both the gradient and the Hessian are Lipschitz continuous,
instantiating this framework with two different online learners achieves a
complexity of $O(\varepsilon^{-1.75}\log(1/\varepsilon))$ in the deterministic
case and a complexity of $O(\varepsilon^{-3.5})$ in the stochastic case.
However, this approach suffers from several limitations: (i) the deterministic
method relies on a complex double-loop scheme that solves a fixed-point
equation to construct hint vectors for an optimistic online learner,
introducing an extra logarithmic factor; (ii) the stochastic method assumes a
bounded second-order moment of the stochastic gradient, which is stronger than
standard variance bounds; and (iii) different online learning algorithms are
used in the two settings. In this paper, we address these issues by introducing
an online optimistic gradient method based on a novel doubly optimistic hint
function. Specifically, we use the gradient at an extrapolated point as the
hint, motivated by two optimistic assumptions: that the difference between the
hint and the target gradient remains near constant, and that consecutive update
directions change slowly due to smoothness. Our method eliminates the need for
a double loop and removes the logarithmic factor. Furthermore, by simply
replacing full gradients with stochastic gradients and under the standard
assumption that their variance is bounded by $\sigma^2$, we obtain a unified
algorithm with complexity $O(\varepsilon^{-1.75} + \sigma^2
\varepsilon^{-3.5})$, smoothly interpolating between the best-known
deterministic rate and the optimal stochastic rate.
[COMMENTS]
32 pages
[LINK]
http://arxiv.org/abs/2510.03167v2
[DATE]
2025-10-06 11:45:44+08:00
[CATEGORIES]
cs.LG
Benchmarking atmospheric circulation variability in an AI emulator, ACE2, and a hybrid model, NeuralGCM
[AUTHORS]
Ian Baxter, Hamid Pahlavan, Pedram Hassanzadeh, Katharine Rucker, Tiffany Shaw
[ABSTRACT]
Physics-based atmosphere-land models with prescribed sea surface temperature
have notable successes but also biases in their ability to represent
atmospheric variability compared to observations. Recently, AI emulators and
hybrid models have emerged with the potential to overcome these biases, but
still require systematic evaluation against metrics grounded in fundamental
atmospheric dynamics. Here, we evaluate the representation of four atmospheric
variability benchmarking metrics in a fully data-driven AI emulator (ACE2-ERA5)
and hybrid model (NeuralGCM). The hybrid model and emulator can capture the
spectra of large-scale tropical waves and extratropical eddy-mean flow
interactions, including critical levels. However, both struggle to capture the
timescales associated with quasi-biennial oscillation (QBO, $\sim 28$ months)
and Southern annular mode propagation ($\sim 150$ days). These dynamical
metrics serve as an initial benchmarking tool to inform AI model development
and understand their limitations, which may be essential for
out-of-distribution applications (e.g., extrapolating to unseen climates).
[COMMENTS]
12 pages, 4 main figures, 6 supplementary figures
[LINK]
http://arxiv.org/abs/2510.04466v1
[DATE]
2025-10-06 11:42:18+08:00
[CATEGORIES]
cs.LG
Fitted value iteration methods for bicausal optimal transport
[AUTHORS]
Erhan Bayraktar, Bingyan Han
[ABSTRACT]
We develop a fitted value iteration (FVI) method to compute bicausal optimal
transport (OT) where couplings have an adapted structure. Based on the dynamic
programming formulation, FVI adopts a function class to approximate the value
functions in bicausal OT. Under the concentrability condition and approximate
completeness assumption, we prove the sample complexity using (local)
Rademacher complexity. Furthermore, we demonstrate that multilayer neural
networks with appropriate structures satisfy the crucial assumptions required
in sample complexity proofs. Numerical experiments reveal that FVI outperforms
linear programming and adapted Sinkhorn methods in scalability as the time
horizon increases, while still maintaining acceptable accuracy.
[COMMENTS]
Final version
[LINK]
http://arxiv.org/abs/2306.12658v3
[DATE]
2025-10-06 11:35:15+08:00
[CATEGORIES]
cs.LG
Perspectives on Stochastic Localization
[AUTHORS]
Bobby Shi, Kevin Tian, Matthew S. Zhang
[ABSTRACT]
We survey different perspectives on the stochastic localization process of
[Eld13], a powerful construction that has had many exciting recent applications
in high-dimensional probability and algorithm design. Unlike prior surveys on
this topic, our focus is on giving a self-contained presentation of all known
alternative constructions of Eldan’s stochastic localization, with an emphasis
on connections between different constructions. Our hope is that by collecting
these perspectives, some of which had primarily arisen within a particular
community (e.g., probability theory, theoretical computer science, information
theory, or machine learning), we can broaden the accessibility of stochastic
localization, and ease its future use.
[LINK]
http://arxiv.org/abs/2510.04460v1
[DATE]
2025-10-06 11:18:41+08:00
[CATEGORIES]
cs.LG
Inverse Mixed-Integer Programming: Learning Constraints then Objective Functions
[AUTHORS]
Akira Kitaoka
[ABSTRACT]
In mixed-integer linear programming, data-driven inverse optimization that
learns the objective function and the constraints from observed data plays an
important role in constructing appropriate mathematical models for various
fields, including power systems and scheduling. However, to the best of our
knowledge, there is no known method for learning both the objective functions
and the constraints. In this paper, we propose a two-stage method for a class
of problems where the objective function is expressed as a linear combination
of functions and the constraints are represented by functions and thresholds.
Specifically, our method first learns the constraints and then learns the
objective function. On the theoretical side, we show the proposed method can
solve inverse optimization problems in finite dataset, develop statistical
learning theory in pseudometric spaces and sub-Gaussian distributions, and
construct a statistical learning for inverse optimization. On the experimental
side, we demonstrate that our method is practically applicable for scheduling
problems formulated as integer linear programmings with up to 100 decision
variables, which are typical in real-world settings.
[COMMENTS]
33 pages
[LINK]
http://arxiv.org/abs/2510.04455v1
[DATE]
2025-10-06 11:02:43+08:00
[CATEGORIES]
cs.LG
Owen Sampling Accelerates Contribution Estimation in Federated Learning
[AUTHORS]
Hossein KhademSohi, Hadi Hemmati, Jiayu Zhou, Steve Drew
[ABSTRACT]
Federated Learning (FL) aggregates information from multiple clients to train
a shared global model without exposing raw data. Accurately estimating each
client’s contribution is essential not just for fair rewards, but for selecting
the most useful clients so the global model converges faster. The Shapley value
is a principled choice, yet exact computation scales exponentially with the
number of clients, making it infeasible for large federations. We propose
FedOwen, an efficient framework that uses Owen sampling to approximate Shapley
values under the same total evaluation budget as existing methods while keeping
the approximation error small. In addition, FedOwen uses an adaptive client
selection strategy that balances exploiting high-value clients with exploring
under-sampled ones, reducing bias and uncovering rare but informative data.
Under a fixed valuation cost, FedOwen achieves up to 23 percent higher final
accuracy within the same number of communication rounds compared to
state-of-the-art baselines on non-IID benchmarks.
[COMMENTS]
ECAI 2025 camera-ready; 8 pages + appendix; code link inside
[LINK]
http://arxiv.org/abs/2508.21261v2
[DATE]
2025-10-06 10:49:38+08:00
[CATEGORIES]
cs.LG
Conformalized Generative Bayesian Imaging: An Uncertainty Quantification Framework for Computational Imaging
[AUTHORS]
Canberk Ekmekci, Mujdat Cetin
[ABSTRACT]
Uncertainty quantification plays an important role in achieving trustworthy
and reliable learning-based computational imaging. Recent advances in
generative modeling and Bayesian neural networks have enabled the development
of uncertainty-aware image reconstruction methods. Current generative
model-based methods seek to quantify the inherent (aleatoric) uncertainty on
the underlying image for given measurements by learning to sample from the
posterior distribution of the underlying image. On the other hand, Bayesian
neural network-based approaches aim to quantify the model (epistemic)
uncertainty on the parameters of a deep neural network-based reconstruction
method by approximating the posterior distribution of those parameters.
Unfortunately, an ongoing need for an inversion method that can jointly
quantify complex aleatoric uncertainty and epistemic uncertainty patterns still
persists. In this paper, we present a scalable framework that can quantify both
aleatoric and epistemic uncertainties. The proposed framework accepts an
existing generative model-based posterior sampling method as an input and
introduces an epistemic uncertainty quantification capability through Bayesian
neural networks with latent variables and deep ensembling. Furthermore, by
leveraging the conformal prediction methodology, the proposed framework can be
easily calibrated to ensure rigorous uncertainty quantification. We evaluated
the proposed framework on magnetic resonance imaging, computed tomography, and
image inpainting problems and showed that the epistemic and aleatoric
uncertainty estimates produced by the proposed framework display the
characteristic features of true epistemic and aleatoric uncertainties.
Furthermore, our results demonstrated that the use of conformal prediction on
top of the proposed framework enables marginal coverage guarantees consistent
with frequentist principles.
[COMMENTS]
24 pages, 10 figures, preprint
[LINK]
http://arxiv.org/abs/2504.07696v2
[DATE]
2025-10-06 10:42:53+08:00
[CATEGORIES]
cs.LG
Zeroth-Order Methods for Stochastic Nonconvex Nonsmooth Composite Optimization
[AUTHORS]
Ziyi Chen, Peiran Yu, Heng Huang
[ABSTRACT]
This work aims to solve a stochastic nonconvex nonsmooth composite
optimization problem. Previous works on composite optimization problem requires
the major part to satisfy Lipschitz smoothness or some relaxed smoothness
conditions, which excludes some machine learning examples such as regularized
ReLU network and sparse support matrix machine. In this work, we focus on
stochastic nonconvex composite optimization problem without any smoothness
assumptions. In particular, we propose two new notions of approximate
stationary points for such optimization problem and obtain finite-time
convergence results of two zeroth-order algorithms to these two approximate
stationary points respectively. Finally, we demonstrate that these algorithms
are effective using numerical experiments.
[LINK]
http://arxiv.org/abs/2510.04446v1
[DATE]
2025-10-06 10:35:42+08:00
[CATEGORIES]
cs.LG
On-Demand Growth of Semiconductor Heterostructures Guided by Physics-Informed Machine Learning
[AUTHORS]
Chao Shen, Yuan Li, Wenkang Zhan, Shujie Pan, Fuxin Lin, Kaiyao Xin, Hui Cong, Chi Xu, Xiaotian Cheng, Ruixiang Liu, Zhibo Ni, Chaoyuan Jin, Bo Xu, Siming Chen, Zhongming Wei, Chunlai Xue, Zhanguo Wang, Chao Zhao
[ABSTRACT]
Developing tailored semiconductor heterostructures on demand represents a
critical capability for addressing the escalating performance demands in
electronic and optoelectronic devices. However, traditional fabrication methods
remain constrained by simulation-based design and iterative trial-and-error
optimization. Here, we introduce SemiEpi, a self-driving platform designed for
molecular beam epitaxy (MBE) to perform multi-step semiconductor
heterostructure growth through in-situ monitoring and on-the-fly feedback
control. By integrating standard MBE reactors, physics-informed machine
learning (ML) models, and parameter initialization, SemiEpi identifies optimal
initial conditions and proposes experiments for heterostructure growth,
eliminating the need for extensive expertise in MBE processes. As a proof of
concept, we demonstrate the optimization of high-density InAs quantum dot (QD)
growth with a target emission wavelength of 1240 nm, showcasing the power of
SemiEpi. We achieve a QD density of 5 x 10^10 cm^-2, a 1.6-fold increase in
photoluminescence (PL) intensity, and a reduced full width at half maximum
(FWHM) of 29.13 meV, leveraging in-situ reflective high-energy electron
diffraction monitoring with feedback control for adjusting growth temperatures.
Taken together, our results highlight the potential of ML-guided systems to
address challenges in multi-step heterostructure growth, facilitate the
development of a hardware-independent framework, and enhance process
repeatability and stability, even without exhaustive knowledge of growth
parameters.
[COMMENTS]
5 figures
[LINK]
http://arxiv.org/abs/2408.03508v4
[DATE]
2025-10-06 10:26:28+08:00
[CATEGORIES]
cs.LG
From Restless to Contextual: A Thresholding Bandit Reformulation For Finite-horizon Performance
[AUTHORS]
Jiamin Xu, Ivan Nazarov, Aditya Rastogi, África Periáñez, Kyra Gan
[ABSTRACT]
This paper addresses the poor finite-horizon performance of existing online
\emph{restless bandit} (RB) algorithms, which stems from the prohibitive sample
complexity of learning a full \emph{Markov decision process} (MDP) for each
agent. We argue that superior finite-horizon performance requires \emph{rapid
convergence} to a \emph{high-quality} policy. Thus motivated, we introduce a
reformulation of online RBs as a \emph{budgeted thresholding contextual
bandit}, which simplifies the learning problem by encoding long-term state
transitions into a scalar reward. We prove the first non-asymptotic optimality
of an oracle policy for a simplified finite-horizon setting. We propose a
practical learning policy under a heterogeneous-agent, multi-state setting, and
show that it achieves a sublinear regret, achieving \emph{faster convergence}
than existing methods. This directly translates to higher cumulative reward, as
empirically validated by significant gains over state-of-the-art algorithms in
large-scale heterogeneous environments. Our work provides a new pathway for
achieving practical, sample-efficient learning in finite-horizon RBs.
[LINK]
http://arxiv.org/abs/2502.05145v3
[DATE]
2025-10-06 10:24:15+08:00
[CATEGORIES]
cs.LG
Directional Convergence, Benign Overfitting of Gradient Descent in leaky ReLU two-layer Neural Networks
[AUTHORS]
Ichiro Hashimoto
[ABSTRACT]
In this paper, we study benign overfitting of fixed width leaky ReLU
two-layer neural network classifiers trained on mixture data via gradient
descent. We provide both, upper and lower classification error bounds, and
discover a phase transition in the bound as a function of signal strength. The
lower bound leads to a characterization of cases when benign overfitting
provably fails even if directional convergence occurs. Our analysis allows us
to considerably relax the distributional assumptions that are made in existing
work on benign overfitting of leaky ReLU two-layer neural network classifiers.
We can allow for non-sub-Gaussian data and do not require near orthogonality.
Our results are derived by establishing directional convergence of the network
parameters and studying classification error bounds for the convergent
direction. Previously, directional convergence in (leaky) ReLU neural networks
was established only for gradient flow. By first establishing directional
convergence, we are able to study benign overfitting of fixed width leaky ReLU
two-layer neural network classifiers in a much wider range of scenarios than
was done before.
[COMMENTS]
40 pages, New results added
[LINK]
http://arxiv.org/abs/2505.16204v2
[DATE]
2025-10-06 10:21:14+08:00
[CATEGORIES]
cs.LG
Domain Generalization: A Tale of Two ERMs
[AUTHORS]
Yilun Zhu, Naihao Deng, Naichen Shi, Aditya Gangrade, Clayton Scott
[ABSTRACT]
Domain generalization (DG) is the problem of generalizing from several
distributions (or domains), for which labeled training data are available, to a
new test domain for which no labeled data is available. A common finding in the
DG literature is that it is difficult to outperform empirical risk minimization
(ERM) on the pooled training data.
In this work, we argue that this finding has primarily been reported for
datasets satisfying a \emph{covariate shift} assumption. When the dataset
satisfies a \emph{posterior drift} assumption instead, we show that
“domain-informed ERM,” wherein feature vectors are augmented with
domain-specific information, outperforms pooling ERM. These claims are
supported by a theoretical framework and experiments on language and vision
tasks.
[LINK]
http://arxiv.org/abs/2510.04441v1
[DATE]
2025-10-06 10:17:12+08:00
[CATEGORIES]
cs.LG
Fractional Heat Kernel for Semi-Supervised Graph Learning with Small Training Sample Size
[AUTHORS]
Farid Bozorgnia, Vyacheslav Kungurtsev, Shirali Kadyrov, Mohsen Yousefnezhad
[ABSTRACT]
In this work, we introduce novel algorithms for label propagation and
self-training using fractional heat kernel dynamics with a source term. We
motivate the methodology through the classical correspondence of information
theory with the physics of parabolic evolution equations. We integrate the
fractional heat kernel into Graph Neural Network architectures such as Graph
Convolutional Networks and Graph Attention, enhancing their expressiveness
through adaptive, multi-hop diffusion. By applying Chebyshev polynomial
approximations, large graphs become computationally feasible. Motivating
variational formulations demonstrate that by extending the classical diffusion
model to fractional powers of the Laplacian, nonlocal interactions deliver more
globally diffusing labels. The particular balance between supervision of known
labels and diffusion across the graph is particularly advantageous in the case
where only a small number of labeled training examples are present. We
demonstrate the effectiveness of this approach on standard datasets.
[LINK]
http://arxiv.org/abs/2510.04440v1
[DATE]
2025-10-06 10:15:46+08:00
[CATEGORIES]
cs.LG
spd-metrics-id: A Python Package for SPD-Aware Distance Metrics in Connectome Fingerprinting and Beyond
[AUTHORS]
Kaosar Uddin
[ABSTRACT]
We present spd-metrics-id, a Python package for computing distances and
divergences between symmetric positive-definite (SPD) matrices. Unlike
traditional toolkits that focus on specific applications, spd-metrics-id
provides a unified, extensible, and reproducible framework for SPD distance
computation. The package supports a wide variety of geometry-aware metrics,
including Alpha-z Bures-Wasserstein, Alpha-Procrustes, affine-invariant
Riemannian, log-Euclidean, and others, and is accessible both via a
command-line interface and a Python API. Reproducibility is ensured through
Docker images and Zenodo archiving. We illustrate usage through a connectome
fingerprinting example, but the package is broadly applicable to covariance
analysis, diffusion tensor imaging, and other domains requiring SPD matrix
comparison. The package is openly available at
https://pypi.org/project/spd-metrics-id/.
[LINK]
http://arxiv.org/abs/2510.04438v1
[DATE]
2025-10-06 10:12:55+08:00
[CATEGORIES]
cs.LG
Trade-off in Estimating the Number of Byzantine Clients in Federated Learning
[AUTHORS]
Ziyi Chen, Su Zhang, Heng Huang
[ABSTRACT]
Federated learning has attracted increasing attention at recent large-scale
optimization and machine learning research and applications, but is also
vulnerable to Byzantine clients that can send any erroneous signals. Robust
aggregators are commonly used to resist Byzantine clients. This usually
requires to estimate the unknown number $f$ of Byzantine clients, and thus
accordingly select the aggregators with proper degree of robustness (i.e., the
maximum number $\hat{f}$ of Byzantine clients allowed by the aggregator). Such
an estimation should have important effect on the performance, which has not
been systematically studied to our knowledge. This work will fill in the gap by
theoretically analyzing the worst-case error of aggregators as well as its
induced federated learning algorithm for any cases of $\hat{f}$ and $f$.
Specifically, we will show that underestimation ($\hat{f}<f$) can lead to
arbitrarily poor performance for both aggregators and federated learning. For
non-underestimation ($\hat{f}\ge f$), we have proved optimal lower and upper
bounds of the same order on the errors of both aggregators and federated
learning. All these optimal bounds are proportional to $\hat{f}/(n-f-\hat{f})$
with $n$ clients, which monotonically increases with larger $\hat{f}$. This
indicates a fundamental trade-off: while an aggregator with a larger robustness
degree $\hat{f}$ can solve federated learning problems of wider range $f\in
[0,\hat{f}]$, the performance can deteriorate when there are actually fewer or
even no Byzantine clients (i.e., $f\in [0,\hat{f})$).
[LINK]
http://arxiv.org/abs/2510.04432v1
[DATE]
2025-10-06 10:01:56+08:00
[CATEGORIES]
cs.LG
Achieve Performatively Optimal Policy for Performative Reinforcement Learning
[AUTHORS]
Ziyi Chen, Heng Huang
[ABSTRACT]
Performative reinforcement learning is an emerging dynamical decision making
framework, which extends reinforcement learning to the common applications
where the agent’s policy can change the environmental dynamics. Existing works
on performative reinforcement learning only aim at a performatively stable (PS)
policy that maximizes an approximate value function. However, there is a
provably positive constant gap between the PS policy and the desired
performatively optimal (PO) policy that maximizes the original value function.
In contrast, this work proposes a zeroth-order Frank-Wolfe algorithm (0-FW)
algorithm with a zeroth-order approximation of the performative policy gradient
in the Frank-Wolfe framework, and obtains \textbf{the first polynomial-time
convergence to the desired PO} policy under the standard regularizer dominance
condition. For the convergence analysis, we prove two important properties of
the nonconvex value function. First, when the policy regularizer dominates the
environmental shift, the value function satisfies a certain gradient dominance
property, so that any stationary point (not PS) of the value function is a
desired PO. Second, though the value function has unbounded gradient, we prove
that all the sufficiently stationary points lie in a convex and compact policy
subspace $\Pi_{\Delta}$, where the policy value has a constant lower bound
$\Delta>0$ and thus the gradient becomes bounded and Lipschitz continuous.
Experimental results also demonstrate that our 0-FW algorithm is more effective
than the existing algorithms in finding the desired PO policy.
[LINK]
http://arxiv.org/abs/2510.04430v1
[DATE]
2025-10-06 09:56:31+08:00
[CATEGORIES]
cs.LG
Learning Survival Models with Right-Censored Reporting Delays
[AUTHORS]
Yuta Shikuri, Hironori Fujisawa
[ABSTRACT]
Survival analysis is a statistical technique used to estimate the time until
an event occurs. Although it is applied across a wide range of fields,
adjusting for reporting delays under practical constraints remains a
significant challenge in the insurance industry. Such delays render event
occurrences unobservable when their reports are subject to right censoring.
This issue becomes particularly critical when estimating hazard rates for newly
enrolled cohorts with limited follow-up due to administrative censoring. Our
study addresses this challenge by jointly modeling the parametric hazard
functions of event occurrences and report timings. The joint probability
distribution is marginalized over the latent event occurrence status. We
construct an estimator for the proposed survival model and establish its
asymptotic consistency. Furthermore, we develop an expectation-maximization
algorithm to compute its estimates. Using these findings, we propose a
two-stage estimation procedure based on a parametric proportional hazards model
to evaluate observations subject to administrative censoring. Experimental
results demonstrate that our method effectively improves the timeliness of risk
evaluation for newly enrolled cohorts.
[COMMENTS]
21 pages, 3 figures, 4 tables
[LINK]
http://arxiv.org/abs/2510.04421v1
[DATE]
2025-10-06 09:16:57+08:00
[CATEGORIES]
cs.LG
Vector Copula Variational Inference and Dependent Block Posterior Approximations
[AUTHORS]
Yu Fu, Michael Stanley Smith, Anastasios Panagiotelis
[ABSTRACT]
The key to VI is the selection of a tractable density to approximate the
Bayesian posterior. For large and complex models a common choice is to assume
independence between multivariate blocks in a partition of the parameter space.
While this simplifies the problem it can reduce accuracy. This paper proposes
using vector copulas to capture dependence between the blocks parsimoniously.
Tailored multivariate marginals are constructed using learnable transport maps.
We call the resulting joint distribution a “dependent block posterior”
approximation. Vector copula models are suggested that make tractable and
flexible variational approximations. They allow for differing marginals,
numbers of blocks, block sizes and forms of between block dependence. They also
allow for solution of the variational optimization using efficient stochastic
gradient methods. The approach is demonstrated using four different statistical
models and 16 datasets which have posteriors that are challenging to
approximate. This includes models that use global-local shrinkage priors for
regularization, and hierarchical models for smoothing and heteroscedastic time
series. In all cases, our method produces more accurate posterior
approximations than benchmark VI methods that either assume block independence
or factor-based dependence, at limited additional computational cost. A python
package implementing the method is available on GitHub at
https://github.com/YuFuOliver/VCVI_Rep_PyPackage.
[LINK]
http://arxiv.org/abs/2503.01072v2
[DATE]
2025-10-06 08:50:29+08:00
[CATEGORIES]
cs.LG
Emergence of Superposition: Unveiling the Training Dynamics of Chain of Continuous Thought
[AUTHORS]
Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, Yuandong Tian
[ABSTRACT]
Previous work shows that the chain of continuous thought (continuous CoT)
improves the reasoning capability of large language models (LLMs) by enabling
implicit parallel thinking, and a subsequent work provided theoretical insight
by showing that a two-layer transformer equipped with continuous CoT can
efficiently solve directed graph reachability by maintaining a superposition of
multiple reasoning traces in the continuous thought. However, it remains
unclear how the superposition mechanism is naturally learned from
gradient-based training methods. To fill this gap, we theoretically analyze the
training dynamics of a simplified two-layer transformer on the directed graph
reachability problem to unveil how the superposition mechanism emerges during
training in two training stages – (i) a thought-generation stage that
autoregressively expands the continuous thought, and (ii) a prediction stage
that converts the thought into the final answer. Our analysis reveals that
during training using continuous thought, the index-matching logit, an
important quantity which reflects the strength of the model’s local search
ability, will first increase and then remain bounded under mild assumptions.
The bounded index-matching logit effectively balances exploration and
exploitation during the reasoning process: the model will exploit local problem
structures to identify plausible search traces, and assign comparable weights
to multiple such traces to explore when it is uncertain about which solution is
correct, which results in superposition. Our experimental results tracking the
growth of logits further validate our theory.
[COMMENTS]
29 pages, 5 figures
[LINK]
http://arxiv.org/abs/2509.23365v2
[DATE]
2025-10-06 08:40:29+08:00
[CATEGORIES]
cs.LG
Understanding and Improving Shampoo and SOAP via Kullback-Leibler Minimization
[AUTHORS]
Wu Lin, Scott C. Lowe, Felix Dangel, Runa Eschenhagen, Zikun Xu, Roger B. Grosse
[ABSTRACT]
Shampoo and its efficient variant, SOAP, employ structured second-moment
estimations and have shown strong performance for training neural networks
(NNs). In practice, however, Shampoo typically requires step-size grafting with
Adam to be competitive, and SOAP mitigates this by applying Adam in Shampoo’s
eigenbasis – at the cost of additional memory overhead from Adam in both
methods. Prior analyses have largely relied on the Frobenius norm to motivate
these estimation schemes. We instead recast their estimation procedures as
covariance estimation under Kullback-Leibler (KL) divergence minimization,
revealing a previously overlooked theoretical limitation and motivating
principled redesigns. Building on this perspective, we develop
$\textbf{KL-Shampoo}$ and $\textbf{KL-SOAP}$, practical schemes that match or
exceed the performance of Shampoo and SOAP in NN pre-training while achieving
SOAP-level per-iteration runtime. Notably, KL-Shampoo does not rely on Adam to
attain competitive performance, eliminating the memory overhead introduced by
Adam. Across our experiments, KL-Shampoo consistently outperforms SOAP,
Shampoo, and even KL-SOAP, establishing the KL-based approach as a compelling
foundation for designing structured methods in NN optimization.
[COMMENTS]
technical report, working in progress
[LINK]
http://arxiv.org/abs/2509.03378v3
[DATE]
2025-10-06 08:39:27+08:00
[CATEGORIES]
cs.LG
Scale-Invariant Regret Matching and Online Learning with Optimal Convergence: Bridging Theory and Practice in Zero-Sum Games
[AUTHORS]
Brian Hu Zhang, Ioannis Anagnostides, Tuomas Sandholm
[ABSTRACT]
A considerable chasm has been looming for decades between theory and practice
in zero-sum game solving through first-order methods. Although a convergence
rate of $T^{-1}$ has long been established since Nemirovski’s mirror-prox
algorithm and Nesterov’s excessive gap technique in the early 2000s, the most
effective paradigm in practice is counterfactual regret minimization, which
is based on regret matching and its modern variants. In particular, the state
of the art across most benchmarks is predictive regret matching$^+$
(PRM$^+$), in conjunction with non-uniform averaging. Yet, such algorithms can
exhibit slower $\Omega(T^{-1/2})$ convergence even in self-play.
In this paper, we close the gap between theory and practice. We propose a new
scale-invariant and parameter-free variant of PRM$^+$, which we call
IREG-PRM$^+$. We show that it achieves $T^{-1/2}$ best-iterate and $T^{-1}$
(i.e., optimal) average-iterate convergence guarantees, while also being on par
with PRM$^+$ on benchmark games. From a technical standpoint, we draw an
analogy between IREG-PRM$^+$ and optimistic gradient descent with adaptive
learning rate. The basic flaw of PRM$^+$ is that the ($\ell_2$-)norm of the
regret vector – which can be thought of as the inverse of the learning rate –
can decrease. By contrast, we design IREG-PRM$^+$ so as to maintain the
invariance that the norm of the regret vector is nondecreasing. This enables us
to derive an RVU-type bound for IREG-PRM$^+$, the first such property that does
not rely on introducing additional hyperparameters to enforce smoothness.
Furthermore, we find that IREG-PRM$^+$ performs on par with an adaptive
version of optimistic gradient descent that we introduce whose learning rate
depends on the misprediction error, demystifying the effectiveness of the
regret matching family vis-a-vis more standard optimization techniques.
[LINK]
http://arxiv.org/abs/2510.04407v1
[DATE]
2025-10-06 08:33:20+08:00
[CATEGORIES]
cs.LG
Modular and Adaptive Conformal Prediction for Sequential Models via Residual Decomposition
[AUTHORS]
William Zhang, Saurabh Amin, Georgia Perakis
[ABSTRACT]
Conformal prediction offers finite-sample coverage guarantees under minimal
assumptions. However, existing methods treat the entire modeling process as a
black box, overlooking opportunities to exploit modular structure. We introduce
a conformal prediction framework for two-stage sequential models, where an
upstream predictor generates intermediate representations for a downstream
model. By decomposing the overall prediction residual into stage-specific
components, our method enables practitioners to attribute uncertainty to
specific pipeline stages. We develop a risk-controlled parameter selection
procedure using family-wise error rate (FWER) control to calibrate stage-wise
scaling parameters, and propose an adaptive extension for non-stationary
settings that preserves long-run coverage guarantees. Experiments on synthetic
distribution shifts, as well as real-world supply chain and stock market data,
demonstrate that our approach maintains coverage under conditions that degrade
standard conformal methods, while providing interpretable stage-wise
uncertainty attribution. This framework offers diagnostic advantages and robust
coverage that standard conformal methods lack.
[COMMENTS]
11 pages, (37 with appendix), 15 figures
[LINK]
http://arxiv.org/abs/2510.04406v1
[DATE]
2025-10-06 08:33:18+08:00
[CATEGORIES]
cs.LG
Utility-Learning Tension in Self-Modifying Agents
[AUTHORS]
Charles L. Wang, Keir Dorchen, Peter Jin
[ABSTRACT]
As systems trend toward superintelligence, a natural modeling premise is that
agents can self-improve along every facet of their own design. We formalize
this with a five-axis decomposition and a decision layer, separating incentives
from learning behavior and analyzing axes in isolation. Our central result
identifies and introduces a sharp utility–learning tension, the structural
conflict in self-modifying systems whereby utility-driven changes that improve
immediate or expected performance can also erode the statistical preconditions
for reliable learning and generalization. Our findings show that
distribution-free guarantees are preserved iff the policy-reachable model
family is uniformly capacity-bounded; when capacity can grow without limit,
utility-rational self-changes can render learnable tasks unlearnable. Under
standard assumptions common in practice, these axes reduce to the same capacity
criterion, yielding a single boundary for safe self-modification. Numerical
experiments across several axes validate the theory by comparing destructive
utility policies against our proposed two-gate policies that preserve
learnability.
[LINK]
http://arxiv.org/abs/2510.04399v1
[DATE]
2025-10-06 07:52:16+08:00
[CATEGORIES]
cs.LG
Uniform convergence of the smooth calibration error and its relationship with functional gradient
[AUTHORS]
Futoshi Futami, Atsushi Nitanda
[ABSTRACT]
Calibration is a critical requirement for reliable probabilistic prediction,
especially in high-risk applications. However, the theoretical understanding of
which learning algorithms can simultaneously achieve high accuracy and good
calibration remains limited, and many existing studies provide empirical
validation or a theoretical guarantee in restrictive settings. To address this
issue, in this work, we focus on the smooth calibration error (CE) and provide
a uniform convergence bound, showing that the smooth CE is bounded by the sum
of the smooth CE over the training dataset and a generalization gap. We further
prove that the functional gradient of the loss function can effectively control
the training smooth CE. Based on this framework, we analyze three
representative algorithms: gradient boosting trees, kernel boosting, and
two-layer neural networks. For each, we derive conditions under which both
classification and calibration performances are simultaneously guaranteed. Our
results offer new theoretical insights and practical guidance for designing
reliable probabilistic models with provable calibration guarantees.
[LINK]
http://arxiv.org/abs/2505.19396v4
[DATE]
2025-10-06 07:51:42+08:00
[CATEGORIES]
cs.LG
Exact and Linear Convergence for Federated Learning under Arbitrary Client Participation is Attainable
[AUTHORS]
Bicheng Ying, Zhe Li, Haibo Yang
[ABSTRACT]
This work tackles the fundamental challenges in Federated Learning (FL) posed
by arbitrary client participation and data heterogeneity, prevalent
characteristics in practical FL settings. It is well-established that popular
FedAvg-style algorithms struggle with exact convergence and can suffer from
slow convergence rates since a decaying learning rate is required to mitigate
these scenarios. To address these issues, we introduce the concept of
stochastic matrix and the corresponding time-varying graphs as a novel modeling
tool to accurately capture the dynamics of arbitrary client participation and
the local update procedure. Leveraging this approach, we offer a fresh
decentralized perspective on designing FL algorithms and present FOCUS,
Federated Optimization with Exact Convergence via Push-pull Strategy, a
provably convergent algorithm designed to effectively overcome the previously
mentioned two challenges. More specifically, we provide a rigorous proof
demonstrating that FOCUS achieves exact convergence with a linear rate
regardless of the arbitrary client participation, establishing it as the first
work to demonstrate this significant result.
[COMMENTS]
Accepted by NeurIPS 2025
[LINK]
http://arxiv.org/abs/2503.20117v3
[DATE]
2025-10-06 07:26:38+08:00
[CATEGORIES]
cs.LG
SSM-CGM: Interpretable State-Space Forecasting Model of Continuous Glucose Monitoring for Personalized Diabetes Management
[AUTHORS]
Shakson Isaac, Yentl Collin, Chirag Patel
[ABSTRACT]
Continuous glucose monitoring (CGM) generates dense data streams critical for
diabetes management, but most used forecasting models lack interpretability for
clinical use. We present SSM-CGM, a Mamba-based neural state-space forecasting
model that integrates CGM and wearable activity signals from the AI-READI
cohort. SSM-CGM improves short-term accuracy over a Temporal Fusion Transformer
baseline, adds interpretability through variable selection and temporal
attribution, and enables counterfactual forecasts simulating how planned
changes in physiological signals (e.g., heart rate, respiration) affect
near-term glucose. Together, these features make SSM-CGM an interpretable,
physiologically grounded framework for personalized diabetes management.
[COMMENTS]
Shakson Isaac and Yentl Collin contributed equally
[LINK]
http://arxiv.org/abs/2510.04386v1
[DATE]
2025-10-06 06:37:28+08:00
[CATEGORIES]
cs.LG
Score-based Greedy Search for Structure Identification of Partially Observed Linear Causal Models
[AUTHORS]
Xinshuai Dong, Ignavier Ng, Haoyue Dai, Jiaqi Sun, Xiangchen Song, Peter Spirtes, Kun Zhang
[ABSTRACT]
Identifying the structure of a partially observed causal system is essential
to various scientific fields. Recent advances have focused on constraint-based
causal discovery to solve this problem, and yet in practice these methods often
face challenges related to multiple testing and error propagation. These issues
could be mitigated by a score-based method and thus it has raised great
attention whether there exists a score-based greedy search method that can
handle the partially observed scenario. In this work, we propose the first
score-based greedy search method for the identification of structure involving
latent variables with identifiability guarantees. Specifically, we propose
Generalized N Factor Model and establish the global consistency:
the true structure including latent variables can be identified up to the
Markov equivalence class by using score. We then design
Latent variable Greedy Equivalence Search (LGES), a greedy search algorithm
for this class of model with well-defined operators,
which search very efficiently over the graph space to find the optimal
structure. Our experiments on both synthetic and real-life data validate the
effectiveness of our method (code will be publicly available).
[LINK]
http://arxiv.org/abs/2510.04378v1
[DATE]
2025-10-06 05:50:17+08:00
[CATEGORIES]
cs.LG
TCR-EML: Explainable Model Layers for TCR-pMHC Prediction
[AUTHORS]
Jiarui Li, Zixiang Yin, Zhengming Ding, Samuel J. Landry, Ramgopal R. Mettu
[ABSTRACT]
T cell receptor (TCR) recognition of peptide-MHC (pMHC) complexes is a
central component of adaptive immunity, with implications for vaccine design,
cancer immunotherapy, and autoimmune disease. While recent advances in machine
learning have improved prediction of TCR-pMHC binding, the most effective
approaches are black-box transformer models that cannot provide a rationale for
predictions. Post-hoc explanation methods can provide insight with respect to
the input but do not explicitly model biochemical mechanisms (e.g. known
binding regions), as in TCR-pMHC binding. “Explain-by-design” models (i.e.,
with architectural components that can be examined directly after training)
have been explored in other domains, but have not been used for TCR-pMHC
binding. We propose explainable model layers (TCR-EML) that can be incorporated
into protein-language model backbones for TCR-pMHC modeling. Our approach uses
prototype layers for amino acid residue contacts drawn from known TCR-pMHC
binding mechanisms, enabling high-quality explanations for predicted TCR-pMHC
binding. Experiments of our proposed method on large-scale datasets demonstrate
competitive predictive accuracy and generalization, and evaluation on the
TCR-XAI benchmark demonstrates improved explainability compared with existing
approaches.
[LINK]
http://arxiv.org/abs/2510.04377v1
[DATE]
2025-10-06 05:47:48+08:00
[CATEGORIES]
cs.LG
Categorical Invariants of Learning Dynamics
[AUTHORS]
Abdulrahman Tamim
[ABSTRACT]
Neural network training is typically viewed as gradient descent on a loss
surface. We propose a fundamentally different perspective: learning is a
structure-preserving transformation (a functor L) between the space of network
parameters (Param) and the space of learned representations (Rep). This
categorical framework reveals that different training runs producing similar
test performance often belong to the same homotopy class (continuous
deformation family) of optimization paths. We show experimentally that networks
converging via homotopic trajectories generalize within 0.5% accuracy of each
other, while non-homotopic paths differ by over 3%. The theory provides
practical tools: persistent homology identifies stable minima predictive of
generalization (R^2 = 0.82 correlation), pullback constructions formalize
transfer learning, and 2-categorical structures explain when different
optimization algorithms yield functionally equivalent models. These categorical
invariants offer both theoretical insight into why deep learning works and
concrete algorithmic principles for training more robust networks.
[LINK]
http://arxiv.org/abs/2510.04376v1
[DATE]
2025-10-06 05:45:36+08:00
[CATEGORIES]
cs.LG
Adaptive Weighted Loss for Sequential Recommendations on Sparse Domains
[AUTHORS]
Akshay Mittal, Vinay Venkatesh, Krishna Kandi, Shalini Sudarshan
[ABSTRACT]
The effectiveness of single-model sequential recommendation architectures,
while scalable, is often limited when catering to “power users” in sparse or
niche domains. Our previous research, PinnerFormerLite, addressed this by using
a fixed weighted loss to prioritize specific domains. However, this approach
can be sub-optimal, as a single, uniform weight may not be sufficient for
domains with very few interactions, where the training signal is easily diluted
by the vast, generic dataset.
This paper proposes a novel, data-driven approach: a Dynamic Weighted Loss
function with comprehensive theoretical foundations and extensive empirical
validation. We introduce an adaptive algorithm that adjusts the loss weight for
each domain based on its sparsity in the training data, assigning a higher
weight to sparser domains and a lower weight to denser ones. This ensures that
even rare user interests contribute a meaningful gradient signal, preventing
them from being overshadowed.
We provide rigorous theoretical analysis including convergence proofs,
complexity analysis, and bounds analysis to establish the stability and
efficiency of our approach. Our comprehensive empirical validation across four
diverse datasets (MovieLens, Amazon Electronics, Yelp Business, LastFM Music)
with state-of-the-art baselines (SIGMA, CALRec, SparseEnNet) demonstrates that
this dynamic weighting system significantly outperforms all comparison methods,
particularly for sparse domains, achieving substantial lifts in key metrics
like Recall at 10 and NDCG at 10 while maintaining performance on denser
domains and introducing minimal computational overhead.
[LINK]
http://arxiv.org/abs/2510.04375v1
[DATE]
2025-10-06 05:42:33+08:00
[CATEGORIES]
cs.LG
Region-of-Interest Augmentation for Mammography Classification under Patient-Level Cross-Validation
[AUTHORS]
Farbod Bigdeli, Mohsen Mohammadagha, Ali Bigdeli
[ABSTRACT]
Breast cancer screening with mammography remains central to early detection
and mortality reduction. Deep learning has shown strong potential for
automating mammogram interpretation, yet limited-resolution datasets and small
sample sizes continue to restrict performance. We revisit the Mini-DDSM dataset
(9,684 images; 2,414 patients) and introduce a lightweight region-of-interest
(ROI) augmentation strategy. During training, full images are probabilistically
replaced with random ROI crops sampled from a precomputed, label-free
bounding-box bank, with optional jitter to increase variability. We evaluate
under strict patient-level cross-validation and report ROC-AUC, PR-AUC, and
training-time efficiency metrics (throughput and GPU memory). Because ROI
augmentation is training-only, inference-time cost remains unchanged. On
Mini-DDSM, ROI augmentation (best: p_roi = 0.10, alpha = 0.10) yields modest
average ROC-AUC gains, with performance varying across folds; PR-AUC is flat to
slightly lower. These results demonstrate that simple, data-centric ROI
strategies can enhance mammography classification in constrained settings
without requiring additional labels or architectural modifications.
[COMMENTS]
5 pages, 5 figures, 2 tables
[LINK]
http://arxiv.org/abs/2509.20585v2
[DATE]
2025-10-06 05:40:20+08:00
[CATEGORIES]
cs.LG
Quantifying Ambiguity in Categorical Annotations: A Measure and Statistical Inference Framework
[AUTHORS]
Christopher Klugmann, Daniel Kondermann
[ABSTRACT]
Human-generated categorical annotations frequently produce empirical response
distributions (soft labels) that reflect ambiguity rather than simple annotator
error. We introduce an ambiguity measure that maps a discrete response
distribution to a scalar in the unit interval, designed to quantify aleatoric
uncertainty in categorical tasks. The measure bears a close relationship to
quadratic entropy (Gini-style impurity) but departs from those indices by
treating an explicit “can’t solve” category asymmetrically, thereby separating
uncertainty arising from class-level indistinguishability from uncertainty due
to explicit unresolvability. We analyze the measure’s formal properties and
contrast its behavior with a representative ambiguity measure from the
literature. Moving beyond description, we develop statistical tools for
inference: we propose frequentist point estimators for population ambiguity and
derive the Bayesian posterior over ambiguity induced by Dirichlet priors on the
underlying probability vector, providing a principled account of epistemic
uncertainty. Numerical examples illustrate estimation, calibration, and
practical use for dataset-quality assessment and downstream machine-learning
workflows.
[COMMENTS]
Preprint, 20 pages in total, 7 figures
[LINK]
http://arxiv.org/abs/2510.04366v1
[DATE]
2025-10-06 05:19:42+08:00
[CATEGORIES]
cs.LG
From News to Returns: A Granger-Causal Hypergraph Transformer on the Sphere
[AUTHORS]
Anoushka Harit, Zhongtian Sun, Jongmin Yu
[ABSTRACT]
We propose the Causal Sphere Hypergraph Transformer (CSHT), a novel
architecture for interpretable financial time-series forecasting that unifies
\emph{Granger-causal hypergraph structure}, \emph{Riemannian geometry}, and
\emph{causally masked Transformer attention}. CSHT models the directional
influence of financial news and sentiment on asset returns by extracting
multivariate Granger-causal dependencies, which are encoded as directional
hyperedges on the surface of a hypersphere. Attention is constrained via
angular masks that preserve both temporal directionality and geometric
consistency. Evaluated on S\&P 500 data from 2018 to 2023, including the 2020
COVID-19 shock, CSHT consistently outperforms baselines across return
prediction, regime classification, and top-asset ranking tasks. By enforcing
predictive causal structure and embedding variables in a Riemannian manifold,
CSHT delivers both \emph{robust generalisation across market regimes} and
\emph{transparent attribution pathways} from macroeconomic events to
stock-level responses. These results suggest that CSHT is a principled and
practical solution for trustworthy financial forecasting under uncertainty.
[COMMENTS]
6th ACM International Conference on AI in Finance
[LINK]
http://arxiv.org/abs/2510.04357v1
[DATE]
2025-10-06 04:51:59+08:00
[CATEGORIES]
cs.LG
Quantizer Design for Finite Model Approximations, Model Learning, and Quantized Q-Learning for MDPs with Unbounded Spaces
[AUTHORS]
Osman Bicer, Ali D. Kara, Serdar Yuksel
[ABSTRACT]
In this paper, for Markov decision processes (MDPs) with unbounded state
spaces we present refined upper bounds presented in [Kara et. al. JMLR’23] on
finite model approximation errors via optimizing the quantizers used for finite
model approximations. We also consider implications on quantizer design for
quantized Q-learning and empirical model learning, and the performance of
policies obtained via Q-learning where the quantized state is treated as the
state itself. We highlight the distinctions between planning, where
approximating MDPs can be independently designed, and learning (either via
Q-learning or empirical model learning), where approximating MDPs are
restricted to be defined by invariant measures of Markov chains under
exploration policies, leading to significant subtleties on quantizer design
performance, even though asymptotic near optimality can be established under
both setups. In particular, under Lyapunov growth conditions, we obtain
explicit upper bounds which decay to zero as the number of bins approaches
infinity.
[LINK]
http://arxiv.org/abs/2510.04355v1
[DATE]
2025-10-06 04:39:52+08:00
[CATEGORIES]
cs.LG
How to build a consistency model: Learning flow maps via self-distillation
[AUTHORS]
Nicholas M. Boffi, Michael S. Albergo, Eric Vanden-Eijnden
[ABSTRACT]
Flow-based generative models achieve state-of-the-art sample quality, but
require the expensive solution of a differential equation at inference time.
Flow map models, commonly known as consistency models, encompass many recent
efforts to improve inference-time efficiency by learning the solution operator
of this differential equation. Yet despite their promise, these models lack a
unified description that clearly explains how to learn them efficiently in
practice. Here, building on the methodology proposed in Boffi et. al. (2024),
we present a systematic algorithmic framework for directly learning the flow
map associated with a flow or diffusion model. By exploiting a relationship
between the velocity field underlying a continuous-time flow and the
instantaneous rate of change of the flow map, we show how to convert any
distillation scheme into a direct training algorithm via self-distillation,
eliminating the need for pre-trained teachers. We introduce three algorithmic
families based on different mathematical characterizations of the flow map:
Eulerian, Lagrangian, and Progressive methods, which we show encompass and
extend all known distillation and direct training schemes for consistency
models. We find that the novel class of Lagrangian methods, which avoid both
spatial derivatives and bootstrapping from small steps by design, achieve
significantly more stable training and higher performance than more standard
Eulerian and Progressive schemes. Our methodology unifies existing training
schemes under a single common framework and reveals new design principles for
accelerated generative modeling. Associated code is available at
https://github.com/nmboffi/flow-maps.
[COMMENTS]
NeurIPS 2025
[LINK]
http://arxiv.org/abs/2505.18825v2
[DATE]
2025-10-06 04:24:27+08:00
[CATEGORIES]
cs.LG
Challenge on Optimization of Context Collection for Code Completion
[AUTHORS]
Dmitry Ustalov, Egor Bogomolov, Alexander Bezzubov, Yaroslav Golubev, Evgeniy Glukhov, Georgii Levtsov, Vladimir Kovalenko
[ABSTRACT]
The rapid advancement of workflows and methods for software engineering using
AI emphasizes the need for a systematic evaluation and analysis of their
ability to leverage information from entire projects, particularly in large
code bases. In this challenge on optimization of context collection for code
completion, organized by JetBrains in collaboration with Mistral AI as part of
the ASE 2025 conference, participants developed efficient mechanisms for
collecting context from source code repositories to improve fill-in-the-middle
code completions for Python and Kotlin. We constructed a large dataset of
real-world code in these two programming languages using permissively licensed
open-source projects. The submissions were evaluated based on their ability to
maximize completion quality for multiple state-of-the-art neural models using
the chrF metric. During the public phase of the competition, nineteen teams
submitted solutions to the Python track and eight teams submitted solutions to
the Kotlin track. In the private phase, six teams competed, of which five
submitted papers to the workshop.
[COMMENTS]
7 pages, 3 figures, 5 tables. A report on the Context Collection
Workshop co-located with ASE’25
[LINK]
http://arxiv.org/abs/2510.04349v1
[DATE]
2025-10-06 04:18:34+08:00
[CATEGORIES]
cs.LG
Environment-Aware Indoor LoRaWAN Path Loss: Parametric Regression Comparisons, Shadow Fading, and Calibrated Fade Margins
[AUTHORS]
Nahshon Mokua Obiri, Kristof Van Laerhoven
[ABSTRACT]
Indoor LoRaWAN propagation is shaped by structural and time-varying context
factors, which challenge log-distance models and the assumption of log-normal
shadowing. We present an environment-aware, statistically disciplined path loss
framework evaluated using leakage-safe cross-validation on a 12-month campaign
in an eighth-floor office measuring 240 m^2. A log-distance multi-wall mean is
augmented with environmental covariates (relative humidity, temperature, carbon
dioxide, particulate matter, and barometric pressure), as well as the
signal-to-noise ratio. We compare multiple linear regression with regularized
variants, Bayesian linear regression, and a selective second-order polynomial
applied to continuous drivers. Predictor relevance is established using
heteroscedasticity-robust Type II and III analysis of variance and nested
partial F tests. Shadow fading is profiled with kernel density estimation and
non-parametric families, including Normal, Skew-Normal, Student’s t, and
Gaussian mixtures. The polynomial mean reduces cross-validated RMSE from 8.07
to 7.09 dB and raises R^2 from 0.81 to 0.86. Out-of-fold residuals are
non-Gaussian; a 3-component mixture captures a sharp core with a light, broad
tail. We convert accuracy into reliability by prescribing the fade margin as
the upper-tail quantile of cross-validated residuals, quantifying uncertainty
via a moving-block bootstrap, and validating on a held-out set. At 99% packet
delivery ratio, the environment-aware polynomial requires 25.7 dB versus 27.7
to 27.9 dB for linear baselines. This result presents a deployment-ready,
interpretable workflow with calibrated reliability control for indoor Internet
of Things planning, aligned with 6G targets.
[COMMENTS]
Code: https://github.com/nahshonmokua/LoRaWAN-Indoor-PL-parametrics
[LINK]
http://arxiv.org/abs/2510.04346v1
[DATE]
2025-10-06 04:14:48+08:00
[CATEGORIES]
cs.LG
Do Sparse Subnetworks Exhibit Cognitively Aligned Attention? Effects of Pruning on Saliency Map Fidelity, Sparsity, and Concept Coherence
[AUTHORS]
Sanish Suwal, Dipkamal Bhusal, Michael Clifford, Nidhi Rastogi
[COMMENTS]
4 pages, neurips workshop
[LINK]
http://arxiv.org/abs/2509.21387v3
[DATE]
2025-10-06 04:06:23+08:00
[CATEGORIES]
cs.LG
Learning to Predict Chaos: Curriculum-Driven Training for Robust Forecasting of Chaotic Dynamics
[AUTHORS]
Harshil Vejendla
[ABSTRACT]
Forecasting chaotic systems is a cornerstone challenge in many scientific
fields, complicated by the exponential amplification of even infinitesimal
prediction errors. Modern machine learning approaches often falter due to two
opposing pitfalls: over-specializing on a single, well-known chaotic system
(e.g., Lorenz-63), which limits generalizability, or indiscriminately mixing
vast, unrelated time-series, which prevents the model from learning the nuances
of any specific dynamical regime. We propose Curriculum Chaos Forecasting
(CCF), a training paradigm that bridges this gap. CCF organizes training data
based on fundamental principles of dynamical systems theory, creating a
curriculum that progresses from simple, periodic behaviors to highly complex,
chaotic dynamics. We quantify complexity using the largest Lyapunov exponent
and attractor dimension, two well-established metrics of chaos. By first
training a sequence model on predictable systems and gradually introducing more
chaotic trajectories, CCF enables the model to build a robust and generalizable
representation of dynamical behaviors. We curate a library of over 50 synthetic
ODE/PDE systems to build this curriculum. Our experiments show that
pre-training with CCF significantly enhances performance on unseen, real-world
benchmarks. On datasets including Sunspot numbers, electricity demand, and
human ECG signals, CCF extends the valid prediction horizon by up to 40%
compared to random-order training and more than doubles it compared to training
on real-world data alone. We demonstrate that this benefit is consistent across
various neural architectures (GRU, Transformer) and provide extensive ablations
to validate the importance of the curriculum’s structure.
[COMMENTS]
MIT URTC Technical Paper (Oral), 5 pages, 4 figures
[LINK]
http://arxiv.org/abs/2510.04342v1
[DATE]
2025-10-06 04:06:16+08:00
[CATEGORIES]
cs.LG
Critical appraisal of artificial intelligence for rare-event recognition: principles and pharmacovigilance case studies
[AUTHORS]
G. Niklas Noren, Eva-Lisa Meldau, Johan Ellenius
[ABSTRACT]
Many high-stakes AI applications target low-prevalence events, where apparent
accuracy can conceal limited real-world value. Relevant AI models range from
expert-defined rules and traditional machine learning to generative LLMs
constrained for classification. We outline key considerations for critical
appraisal of AI in rare-event recognition, including problem framing and test
set design, prevalence-aware statistical evaluation, robustness assessment, and
integration into human workflows. In addition, we propose an approach to
structured case-level examination (SCLE), to complement statistical performance
evaluation, and a comprehensive checklist to guide procurement or development
of AI models for rare-event recognition. We instantiate the framework in
pharmacovigilance, drawing on three studies: rule-based retrieval of
pregnancy-related reports; duplicate detection combining machine learning with
probabilistic record linkage; and automated redaction of person names using an
LLM. We highlight pitfalls specific to the rare-event setting including
optimism from unrealistic class balance and lack of difficult positive controls
in test sets - and show how cost-sensitive targets align model performance with
operational value. While grounded in pharmacovigilance practice, the principles
generalize to domains where positives are scarce and error costs may be
asymmetric.
[COMMENTS]
28 pages, 2 figures
[LINK]
http://arxiv.org/abs/2510.04341v1
[DATE]
2025-10-06 04:05:38+08:00
[CATEGORIES]
cs.LG
Pitch-Conditioned Instrument Sound Synthesis From an Interactive Timbre Latent Space
[AUTHORS]
Christian Limberg, Fares Schulz, Zhe Zhang, Stefan Weinzierl
[ABSTRACT]
This paper presents a novel approach to neural instrument sound synthesis
using a two-stage semi-supervised learning framework capable of generating
pitch-accurate, high-quality music samples from an expressive timbre latent
space. Existing approaches that achieve sufficient quality for music production
often rely on high-dimensional latent representations that are difficult to
navigate and provide unintuitive user experiences. We address this limitation
through a two-stage training paradigm: first, we train a pitch-timbre
disentangled 2D representation of audio samples using a Variational
Autoencoder; second, we use this representation as conditioning input for a
Transformer-based generative model. The learned 2D latent space serves as an
intuitive interface for navigating and exploring the sound landscape. We
demonstrate that the proposed method effectively learns a disentangled timbre
space, enabling expressive and controllable audio generation with reliable
pitch conditioning. Experimental results show the model’s ability to capture
subtle variations in timbre while maintaining a high degree of pitch accuracy.
The usability of our method is demonstrated in an interactive web application,
highlighting its potential as a step towards future music production
environments that are both intuitive and creatively empowering:
https://pgesam.faresschulz.com
[COMMENTS]
8 pages, accepted to the Proceedings of the 28-th Int. Conf. on
Digital Audio Effects (DAFx25) - demo: https://pgesam.faresschulz.com
[LINK]
http://arxiv.org/abs/2510.04339v1
[DATE]
2025-10-06 04:03:30+08:00
[CATEGORIES]
cs.LG
Aneurysm Growth Time Series Reconstruction Using Physics-informed Autoencoder
[AUTHORS]
Jiacheng Wu
[ABSTRACT]
Arterial aneurysm (Fig.1) is a bulb-shape local expansion of human arteries,
the rupture of which is a leading cause of morbidity and mortality in US.
Therefore, the prediction of arterial aneurysm rupture is of great significance
for aneurysm management and treatment selection. The prediction of aneurysm
rupture depends on the analysis of the time series of aneurysm growth history.
However, due to the long time scale of aneurysm growth, the time series of
aneurysm growth is not always accessible. We here proposed a method to
reconstruct the aneurysm growth time series directly from patient parameters.
The prediction is based on data pairs of [patient parameters, patient aneurysm
growth time history]. To obtain the mapping from patient parameters to patient
aneurysm growth time history, we first apply autoencoder to obtain a compact
representation of the time series for each patient. Then a mapping is learned
from patient parameters to the corresponding compact representation of time
series via a five-layer neural network. Moving average and convolutional output
layer are implemented to explicitly taking account the time dependency of the
time series.
Apart from that, we also propose to use prior knowledge about the mechanism
of aneurysm growth to improve the time series reconstruction results. The prior
physics-based knowledge is incorporated as constraints for the optimization
problem associated with autoencoder. The model can handle both algebraic and
differential constraints. Our results show that including physical model
information about the data will not significantly improve the time series
reconstruction results if the training data is error-free. However, in the case
of training data with noise and bias error, incorporating physical model
constraints can significantly improve the predicted time series.
[COMMENTS]
21 pages, 13 figures
[LINK]
http://arxiv.org/abs/2510.05183v1
[DATE]
2025-10-06 03:54:06+08:00
[CATEGORIES]
cs.LG
DoRAN: Stabilizing Weight-Decomposed Low-Rank Adaptation via Noise Injection and Auxiliary Networks
[AUTHORS]
Nghiem T. Diep, Hien Dang, Tuan Truong, Tan Dinh, Huy Nguyen, Nhat Ho
[ABSTRACT]
Parameter-efficient fine-tuning (PEFT) methods have become the standard
paradigm for adapting large-scale models. Among these techniques,
Weight-Decomposed Low-Rank Adaptation (DoRA) has been shown to improve both the
learning capacity and training stability of the vanilla Low-Rank Adaptation
(LoRA) method by explicitly decomposing pre-trained weights into magnitude and
directional components. In this work, we propose DoRAN, a new variant of DoRA
designed to further stabilize training and boost the sample efficiency of DoRA.
Our approach includes two key stages: (i) injecting noise into the denominator
of DoRA’s weight decomposition, which serves as an adaptive regularizer to
mitigate instabilities; and (ii) replacing static low-rank matrices with
auxiliary networks that generate them dynamically, enabling parameter coupling
across layers and yielding better sample efficiency in both theory and
practice. Comprehensive experiments on vision and language benchmarks show that
DoRAN consistently outperforms LoRA, DoRA, and other PEFT baselines. These
results underscore the effectiveness of combining stabilization through
noise-based regularization with network-based parameter generation, offering a
promising direction for robust and efficient fine-tuning of foundation models.
[COMMENTS]
Nghiem T. Diep, Hien Dang, and Tuan Truong contributed equally to
this work
[LINK]
http://arxiv.org/abs/2510.04331v1
[DATE]
2025-10-06 03:27:48+08:00
[CATEGORIES]
cs.LG
Arithmetic-Mean $μ$P for Modern Architectures: A Unified Learning-Rate Scale for CNNs and ResNets
[AUTHORS]
Haosong Zhang, Shenxi Wu, Yichi Zhang, Wei Lin
[ABSTRACT]
Choosing an appropriate learning rate remains a key challenge in scaling
depth of modern deep networks. The classical maximal update parameterization
($\mu$P) enforces a fixed per-layer update magnitude, which is well suited to
homogeneous multilayer perceptrons (MLPs) but becomes ill-posed in
heterogeneous architectures where residual accumulation and convolutions
introduce imbalance across layers. We introduce Arithmetic-Mean $\mu$P
(AM-$\mu$P), which constrains not each individual layer but the network-wide
average one-step pre-activation second moment to a constant scale. Combined
with a residual-aware He fan-in initialization - scaling residual-branch
weights by the number of blocks ($\mathrm{Var}[W]=c/(K\cdot
\mathrm{fan\text{-}in})$) - AM-$\mu$P yields width-robust depth laws that
transfer consistently across depths. We prove that, for one- and
two-dimensional convolutional networks, the maximal-update learning rate
satisfies $\eta^\star(L)\propto L^{-3/2}$; with zero padding, boundary effects
are constant-level as $N\gg k$. For standard residual networks with general
conv+MLP blocks, we establish $\eta^\star(L)=\Theta(L^{-3/2})$, with $L$ the
minimal depth. Empirical results across a range of depths confirm the $-3/2$
scaling law and enable zero-shot learning-rate transfer, providing a unified
and practical LR principle for convolutional and deep residual networks without
additional tuning overhead.
[COMMENTS]
Preprint. Under review at ICLR 2026
[LINK]
http://arxiv.org/abs/2510.04327v1
[DATE]
2025-10-06 03:22:50+08:00
[CATEGORIES]
cs.LG
FoilDiff: A Hybrid Transformer Backbone for Diffusion-based Modelling of 2D Airfoil Flow Fields
[AUTHORS]
Kenechukwu Ogbuagu, Sepehr Maleki, Giuseppe Bruni, Senthil Krishnababu
[ABSTRACT]
The accurate prediction of flow fields around airfoils is crucial for
aerodynamic design and optimisation. Computational Fluid Dynamics (CFD) models
are effective but computationally expensive, thus inspiring the development of
surrogate models to enable quicker predictions. These surrogate models can be
based on deep learning architectures, such as Convolutional Neural Networks
(CNNs), Graph Neural Networks (GNNs), and Diffusion Models (DMs). Diffusion
models have shown significant promise in predicting complex flow fields. In
this work, we propose FoilDiff, a diffusion-based surrogate model with a
hybrid-backbone denoising network. This hybrid design combines the power of
convolutional feature extraction and transformer-based global attention to
generate more adaptable and accurate representations of flow structures.
FoilDiff takes advantage of Denoising Diffusion Implicit Model (DDIM) sampling
to optimise the efficiency of the sampling process at no additional cost to
model generalisation. We used encoded representations of Reynolds number, angle
of attack, and airfoil geometry to define the input space for generalisation
across a wide range of aerodynamic conditions. When evaluated against
state-of-the-art models, FoilDiff shows significant performance improvements,
with mean prediction errors reducing by up to 85\% on the same datasets. The
results have demonstrated that FoilDiff can provide both more accurate
predictions and better-calibrated predictive uncertainty than existing
diffusion-based models.
[LINK]
http://arxiv.org/abs/2510.04325v1
[DATE]
2025-10-06 03:10:38+08:00
[CATEGORIES]
cs.LG
Exploring Applications of State Space Models and Advanced Training Techniques in Sequential Recommendations: A Comparative Study on Efficiency and Performance
[AUTHORS]
Mark Obozov, Makar Baderko, Stepan Kulibaba, Nikolay Kutuzov, Alexander Gasnikov
[ABSTRACT]
Recommender systems aim to estimate the dynamically changing user preferences
and sequential dependencies between historical user behaviour and metadata.
Although transformer-based models have proven to be effective in sequential
recommendations, their state growth is proportional to the length of the
sequence that is being processed, which makes them expensive in terms of memory
and inference costs. Our research focused on three promising directions in
sequential recommendations: enhancing speed through the use of State Space
Models (SSM), as they can achieve SOTA results in the sequential
recommendations domain with lower latency, memory, and inference costs, as
proposed by arXiv:2403.03900 improving the quality of recommendations with
Large Language Models (LLMs) via Monolithic Preference Optimization without
Reference Model (ORPO); and implementing adaptive batch- and step-size
algorithms to reduce costs and accelerate training processes.
[COMMENTS]
arXiv admin note: text overlap with arXiv:2403.07691 by other authors
[LINK]
http://arxiv.org/abs/2408.05606v2
[DATE]
2025-10-06 03:07:28+08:00
[CATEGORIES]
cs.LG
Graph Alignment via Birkhoff Relaxation
[AUTHORS]
Sushil Mahavir Varma, Irène Waldspurger, Laurent Massoulié
[ABSTRACT]
We consider the graph alignment problem, wherein the objective is to find a
vertex correspondence between two graphs that maximizes the edge overlap. The
graph alignment problem is an instance of the quadratic assignment problem
(QAP), known to be NP-hard in the worst case even to approximately solve. In
this paper, we analyze Birkhoff relaxation, a tight convex relaxation of QAP,
and present theoretical guarantees on its performance when the inputs follow
the Gaussian Wigner Model. More specifically, the weighted adjacency matrices
are correlated Gaussian Orthogonal Ensemble with correlation
$1/\sqrt{1+\sigma^2}$. Denote the optimal solutions of the QAP and Birkhoff
relaxation by $\Pi^\star$ and $X^\star$ respectively. We show that
$|X^\star-\Pi^\star|_F^2 = o(n)$ when $\sigma = o(n^{-1.25})$ and
$|X^\star-\Pi^\star|_F^2 = \Omega(n)$ when $\sigma = \Omega(n^{-0.5})$. Thus,
the optimal solution $X^\star$ transitions from a small perturbation of
$\Pi^\star$ for small $\sigma$ to being well separated from $\Pi^\star$ as
$\sigma$ becomes larger than $n^{-0.5}$. This result allows us to guarantee
that simple rounding procedures on $X^\star$ align $1-o(1)$ fraction of
vertices correctly whenever $\sigma = o(n^{-1.25})$. This condition on $\sigma$
to ensure the success of the Birkhoff relaxation is state-of-the-art.
[COMMENTS]
To appear in NeurIPS 2025
[LINK]
http://arxiv.org/abs/2503.05323v2
[DATE]
2025-10-06 03:06:29+08:00
[CATEGORIES]
cs.LG
QuIC: Quantum-Inspired Compound Adapters for Parameter Efficient Fine-Tuning
[AUTHORS]
Snehal Raj, Brian Coyle
[ABSTRACT]
Scaling full finetuning of large foundation models strains GPU memory and
training time. Parameter Efficient Fine-Tuning (PEFT) methods address this
issue via adapter modules which update only a small subset of model parameters.
In this work, we introduce Quantum-Inspired Compound Adapters (QuIC Adapters),
a PEFT approach inspired from Hamming-weight preserving quantum circuits that
can effectively finetune a model using less than 0.02\% memory footprint of the
base model. QuIC adapters preserve pretrained representations by enforcing
orthogonality in weight parameters, and have native deployment mechanisms on
quantum computers. We test QuIC adapters by finetuning large language models
like LLaMA and vision transformers on language, math, reasoning and vision
benchmarks. In its first-order configuration, QuIC recovers the performance of
existing orthogonal methods, while higher-order configurations enable
substantial parameter compression (over 40x smaller than LoRA) for a modest
performance trade-off, unlocking applications in highly resource-constrained
environments. Through ablation studies, we determine that combining multiple
Hamming-weight orders with orthogonality and matrix compounding are essential
for performant finetuning. Our findings suggest that QuIC adapters offers a
promising direction for efficient finetuning of foundation models in
resource-constrained environments.
[COMMENTS]
30 pages, 12 figures, 7 tables, ~8000 words
[LINK]
http://arxiv.org/abs/2502.06916v2
[DATE]
2025-10-06 02:54:57+08:00
[CATEGORIES]
cs.LG
Towards Fast Option Pricing PDE Solvers Powered by PIELM
[AUTHORS]
Akshay Govind Srinivasan, Anuj Jagannath Said, Sathwik Pentela, Vikas Dwivedi, Balaji Srinivasan
[ABSTRACT]
Partial differential equation (PDE) solvers underpin modern quantitative
finance, governing option pricing and risk evaluation. Physics-Informed Neural
Networks (PINNs) have emerged as a promising approach for solving the forward
and inverse problems of partial differential equations (PDEs) using deep
learning. However they remain computationally expensive due to their iterative
gradient descent based optimization and scale poorly with increasing model
size. This paper introduces Physics-Informed Extreme Learning Machines (PIELMs)
as fast alternative to PINNs for solving both forward and inverse problems in
financial PDEs. PIELMs replace iterative optimization with a single
least-squares solve, enabling deterministic and efficient training. We
benchmark PIELM on the Black-Scholes and Heston-Hull-White models for forward
pricing and demonstrate its capability in inverse model calibration to recover
volatility and interest rate parameters from noisy data. From experiments we
observe that PIELM achieve accuracy comparable to PINNs while being up to
$30\times$ faster, highlighting their potential for real-time financial
modeling.
[COMMENTS]
6 Pages, 5 Figures, 3 Tables
[LINK]
http://arxiv.org/abs/2510.04322v1
[DATE]
2025-10-06 02:50:49+08:00
[CATEGORIES]
cs.LG
Quantum Fisher information matrices from Rényi relative entropies
[AUTHORS]
Mark M. Wilde
[ABSTRACT]
Quantum generalizations of the Fisher information are important in quantum
information science, with applications in high energy and condensed matter
physics and in quantum estimation theory, machine learning, and optimization.
One can derive a quantum generalization of the Fisher information matrix in a
natural way as the Hessian matrix arising in a Taylor expansion of a smooth
divergence. Such an approach is appealing for quantum information theorists,
given the ubiquity of divergences in quantum information theory. In contrast to
the classical case, there is not a unique quantum generalization of the Fisher
information matrix, similar to how there is not a unique quantum generalization
of the relative entropy or the R'enyi relative entropy. In this paper, I
derive information matrices arising from the log-Euclidean, $\alpha$-$z$, and
geometric R'enyi relative entropies, with the main technical tool for doing so
being the method of divided differences for calculating matrix derivatives.
Interestingly, for all non-negative values of the R'enyi parameter $\alpha$,
the log-Euclidean R'enyi relative entropy leads to the Kubo-Mori information
matrix, and the geometric R'enyi relative entropy leads to the
right-logarithmic derivative Fisher information matrix. Thus, the resulting
information matrices obey the data-processing inequality for all non-negative
values of the R'enyi parameter $\alpha$ even though the original quantities do
not. Additionally, I derive and establish basic properties of $\alpha$-$z$
information matrices resulting from the $\alpha$-$z$ R'enyi relative
entropies. For parameterized thermal states and time-evolved states, I
establish formulas for their $\alpha$-$z$ information matrices and hybrid
quantum-classical algorithms for estimating them, with applications in quantum
Boltzmann machine learning.
[COMMENTS]
v2: 106 pages, 2 figures, dedicated to Professor Fumio Hiai on the
occasion of his forthcoming 80th birthday
[LINK]
http://arxiv.org/abs/2510.02218v2
[DATE]
2025-10-06 02:35:10+08:00
[CATEGORIES]
cs.LG
Crash Severity Prediction Using Deep Learning Approaches: A Hybrid CNN-RNN Framework
[AUTHORS]
Sahar Koohfar
[ABSTRACT]
Accurate and timely prediction of crash severity is crucial in mitigating the
severe consequences of traffic accidents. Accurate and timely prediction of
crash severity is crucial in mitigating the severe consequences of traffic
accidents. In order to provide appropriate levels of medical assistance and
transportation services, an intelligent transportation system relies on
effective prediction methods. Deep learning models have gained popularity in
this domain due to their capability to capture non-linear relationships among
variables. In this research, we have implemented a hybrid CNN-RNN deep learning
model for crash severity prediction and compared its performance against widely
used statistical and machine learning models such as logistic regression,
na"ive bayes classifier, K-Nearest Neighbors (KNN), decision tree, and
individual deep learning models: RNN and CNN. This study employs a methodology
that considers the interconnected relationships between various features of
traffic accidents. The study was conducted using a dataset of 15,870 accident
records gathered over a period of seven years between 2015 and 2021 on Virginia
highway I-64. The findings demonstrate that the proposed CNN-RNN hybrid model
has outperformed all benchmark models in terms of predicting crash severity.
This result illustrates the effectiveness of the hybrid model as it combines
the advantages of both RNN and CNN models in order to achieve greater accuracy
in the prediction process.
[LINK]
http://arxiv.org/abs/2510.04316v1
[DATE]
2025-10-06 02:31:45+08:00
[CATEGORIES]
cs.LG
On Zero-Shot Reinforcement Learning
[AUTHORS]
Scott Jeen
[ABSTRACT]
Modern reinforcement learning (RL) systems capture deep truths about general,
human problem-solving. In domains where new data can be simulated cheaply,
these systems uncover sequential decision-making policies that far exceed the
ability of any human. Society faces many problems whose solutions require this
skill, but they are often in domains where new data cannot be cheaply
simulated. In such scenarios, we can learn simulators from existing data, but
these will only ever be approximately correct, and can be pathologically
incorrect when queried outside of their training distribution. As a result, a
misalignment between the environments in which we train our agents and the
real-world in which we wish to deploy our agents is inevitable. Dealing with
this misalignment is the primary concern of zero-shot reinforcement learning, a
problem setting where the agent must generalise to a new task or domain with
zero practice shots. Whilst impressive progress has been made on methods that
perform zero-shot RL in idealised settings, new work is needed if these results
are to be replicated in real-world settings. In this thesis, we argue that
doing so requires us to navigate (at least) three constraints. First, the data
quality constraint: real-world datasets are small and homogeneous. Second, the
observability constraint: states, dynamics and rewards in the real-world are
often only partially observed. And third, the data availability constraint: a
priori access to data cannot always be assumed. This work proposes a suite of
methods that perform zero-shot RL subject to these constraints. In a series of
empirical studies we expose the failings of existing methods, and justify our
techniques for remedying them. We believe these designs take us a step closer
to RL methods that can be deployed to solve real-world problems.
[COMMENTS]
PhD thesis
[LINK]
http://arxiv.org/abs/2508.16496v2
[DATE]
2025-10-06 02:29:40+08:00
[CATEGORIES]
cs.LG
On amortizing convex conjugates for optimal transport
[AUTHORS]
Brandon Amos
[ABSTRACT]
This paper focuses on computing the convex conjugate (also known as the
Legendre-Fenchel conjugate or c-transform) that appears in Euclidean
Wasserstein-2 optimal transport. This conjugation is considered difficult to
compute and in practice, methods are limited by not being able to exactly
conjugate the dual potentials in continuous space. To overcome this, the
computation of the conjugate can be approximated with amortized optimization,
which learns a model to predict the conjugate. I show that combining amortized
approximations to the conjugate with a solver for fine-tuning significantly
improves the quality of transport maps learned for the Wasserstein-2 benchmark
by Korotin et al. (2021a) and is able to model many 2-dimensional couplings and
flows considered in the literature. All baselines, methods, and solvers are
publicly available at http://github.com/facebookresearch/w2ot.
[COMMENTS]
ICLR 2023
[LINK]
http://arxiv.org/abs/2210.12153v3
[DATE]
2025-10-06 02:29:04+08:00
[CATEGORIES]
cs.LG
Query Drift Compensation: Enabling Compatibility in Continual Learning of Retrieval Embedding Models
[AUTHORS]
Dipam Goswami, Liying Wang, Bartłomiej Twardowski, Joost van de Weijer
[ABSTRACT]
Text embedding models enable semantic search, powering several NLP
applications like Retrieval Augmented Generation by efficient information
retrieval (IR). However, text embedding models are commonly studied in
scenarios where the training data is static, thus limiting its applications to
dynamic scenarios where new training data emerges over time. IR methods
generally encode a huge corpus of documents to low-dimensional embeddings and
store them in a database index. During retrieval, a semantic search over the
corpus is performed and the document whose embedding is most similar to the
query embedding is returned. When updating an embedding model with new training
data, using the already indexed corpus is suboptimal due to the
non-compatibility issue, since the model which was used to obtain the
embeddings of the corpus has changed. While re-indexing of old corpus documents
using the updated model enables compatibility, it requires much higher
computation and time. Thus, it is critical to study how the already indexed
corpus can still be effectively used without the need of re-indexing. In this
work, we establish a continual learning benchmark with large-scale datasets and
continually train dense retrieval embedding models on query-document pairs from
new datasets in each task and observe forgetting on old tasks due to
significant drift of embeddings. We employ embedding distillation on both query
and document embeddings to maintain stability and propose a novel query drift
compensation method during retrieval to project new model query embeddings to
the old embedding space. This enables compatibility with previously indexed
corpus embeddings extracted using the old model and thus reduces the
forgetting. We show that the proposed method significantly improves performance
without any re-indexing. Code is available at
https://github.com/dipamgoswami/QDC.
[COMMENTS]
Accepted at CoLLAs 2025
[LINK]
http://arxiv.org/abs/2506.00037v2
[DATE]
2025-10-06 01:58:02+08:00
[CATEGORIES]
cs.LG
Riemannian Optimization on Tree Tensor Networks with Application in Machine Learning
[AUTHORS]
Marius Willner, Marco Trenti, Dirk Lebiedz
[ABSTRACT]
Tree tensor networks (TTNs) are widely used in low-rank approximation and
quantum many-body simulation. In this work, we present a formal analysis of the
differential geometry underlying TTNs. Building on this foundation, we develop
efficient first- and second-order optimization algorithms that exploit the
intrinsic quotient structure of TTNs. Additionally, we devise a backpropagation
algorithm for training TTNs in a kernel learning setting. We validate our
methods through numerical experiments on a representative machine learning
task.
[COMMENTS]
24 pages, 6 figures, 4 pseudo-code algorithms, 1 table; updated
version: additional explanation for computational advantages of Cart. horiz.
space in Sec. 6; updated Fig. 6 accordingly; fixed typos and added references
[LINK]
http://arxiv.org/abs/2507.21726v2
[DATE]
2025-10-06 01:33:44+08:00
[CATEGORIES]
cs.LG
The Cosine Schedule is Fisher-Rao-Optimal for Masked Discrete Diffusion Models
[AUTHORS]
Leo Zhang, Saifuddin Syed
[ABSTRACT]
In this work, we study the problem of choosing the discretisation schedule
for sampling from masked discrete diffusion models in terms of the information
geometry of the induced probability path. Specifically, we show that the
optimal schedule under the Fisher-Rao geometry recovers the popularly-used
cosine schedule.
[COMMENTS]
PreprintV2
[LINK]
http://arxiv.org/abs/2508.04884v2
[DATE]
2025-10-06 01:30:12+08:00
[CATEGORIES]
cs.LG
HoRA: Cross-Head Low-Rank Adaptation with Joint Hypernetworks
[AUTHORS]
Nghiem T. Diep, Dung Le, Tuan Truong, Tan Dinh, Huy Nguyen, Nhat Ho
[ABSTRACT]
Low-Rank Adaptation (LoRA) is a parameter-efficient fine-tuning (PEFT)
technique that adapts large pre-trained models by adding low-rank matrices to
their weight updates. However, in the context of fine-tuning multi-head
self-attention (MHA), LoRA has been employed to adapt each attention head
separately, thereby overlooking potential synergies across different heads. To
mitigate this issue, we propose a novel Hyper-shared Low-Rank Adaptation (HoRA)
method, which utilizes joint hypernetworks to generate low-rank matrices across
attention heads. By coupling their adaptation through a shared generator, HoRA
encourages cross-head information sharing, and thus directly addresses the
aforementioned limitation of LoRA. By comparing LoRA and HoRA through the lens
of hierarchical mixture of experts, our theoretical findings reveal that the
latter achieves superior sample efficiency to the former. Furthermore, through
extensive experiments across diverse language and vision benchmarks, we
demonstrate that HoRA outperforms LoRA and other PEFT methods while requiring
only a marginal increase in the number of trainable parameters.
[COMMENTS]
Nghiem T. Diep, Dung Le, and Tuan Truong contributed equally to this
work
[LINK]
http://arxiv.org/abs/2510.04295v1
[DATE]
2025-10-06 01:13:39+08:00
[CATEGORIES]
cs.LG
Approximation Bounds for Recurrent Neural Networks with Application to Regression
[AUTHORS]
Yuling Jiao, Yang Wang, Bokai Yan
[ABSTRACT]
We study the approximation capacity of deep ReLU recurrent neural networks
(RNNs) and explore the convergence properties of nonparametric least squares
regression using RNNs. We derive upper bounds on the approximation error of
RNNs for H"older smooth functions, in the sense that the output at each time
step of an RNN can approximate a H"older function that depends only on past
and current information, termed a past-dependent function. This allows a
carefully constructed RNN to simultaneously approximate a sequence of
past-dependent H"older functions. We apply these approximation results to
derive non-asymptotic upper bounds for the prediction error of the empirical
risk minimizer in regression problem. Our error bounds achieve minimax optimal
rate under both exponentially $\beta$-mixing and i.i.d. data assumptions,
improving upon existing ones. Our results provide statistical guarantees on the
performance of RNNs.
[LINK]
http://arxiv.org/abs/2409.05577v2
[DATE]
2025-10-06 01:08:44+08:00
[CATEGORIES]
cs.LG
A KL-regularization framework for learning to plan with adaptive priors
[AUTHORS]
Álvaro Serra-Gomez, Daniel Jarne Ornia, Dhruva Tirumala, Thomas Moerland
[ABSTRACT]
Effective exploration remains a central challenge in model-based
reinforcement learning (MBRL), particularly in high-dimensional continuous
control tasks where sample efficiency is crucial. A prominent line of recent
work leverages learned policies as proposal distributions for Model-Predictive
Path Integral (MPPI) planning. Initial approaches update the sampling policy
independently of the planner distribution, typically maximizing a learned value
function with deterministic policy gradient and entropy regularization.
However, because the states encountered during training depend on the MPPI
planner, aligning the sampling policy with the planner improves the accuracy of
value estimation and long-term performance. To this end, recent methods update
the sampling policy by minimizing KL divergence to the planner distribution or
by introducing planner-guided regularization into the policy update. In this
work, we unify these MPPI-based reinforcement learning methods under a single
framework by introducing Policy Optimization-Model Predictive Control (PO-MPC),
a family of KL-regularized MBRL methods that integrate the planner’s action
distribution as a prior in policy optimization. By aligning the learned policy
with the planner’s behavior, PO-MPC allows more flexibility in the policy
updates to trade off Return maximization and KL divergence minimization. We
clarify how prior approaches emerge as special cases of this family, and we
explore previously unstudied variations. Our experiments show that these
extended configurations yield significant performance improvements, advancing
the state of the art in MPPI-based RL.
[COMMENTS]
Preprint
[LINK]
http://arxiv.org/abs/2510.04280v1
[DATE]
2025-10-06 00:45:38+08:00
[CATEGORIES]
cs.LG
OptiFLIDS: Optimized Federated Learning for Energy-Efficient Intrusion Detection in IoT
[AUTHORS]
Saida Elouardi, Mohammed Jouhari, Anas Motii
[ABSTRACT]
In critical IoT environments, such as smart homes and industrial systems,
effective Intrusion Detection Systems (IDS) are essential for ensuring
security. However, developing robust IDS solutions remains a significant
challenge. Traditional machine learning-based IDS models typically require
large datasets, but data sharing is often limited due to privacy and security
concerns. Federated Learning (FL) presents a promising alternative by enabling
collaborative model training without sharing raw data. Despite its advantages,
FL still faces key challenges, such as data heterogeneity (non-IID data) and
high energy and computation costs, particularly for resource constrained IoT
devices. To address these issues, this paper proposes OptiFLIDS, a novel
approach that applies pruning techniques during local training to reduce model
complexity and energy consumption. It also incorporates a customized
aggregation method to better handle pruned models that differ due to non-IID
data distributions. Experiments conducted on three recent IoT IDS datasets,
TON_IoT, X-IIoTID, and IDSIoT2024, demonstrate that OptiFLIDS maintains strong
detection performance while improving energy efficiency, making it well-suited
for deployment in real-world IoT environments.
[COMMENTS]
12 pages, 15 figures
[LINK]
http://arxiv.org/abs/2510.05180v1
[DATE]
2025-10-06 00:44:41+08:00
[CATEGORIES]
cs.LG
Relative Information Gain and Gaussian Process Regression
[AUTHORS]
Hamish Flynn
[ABSTRACT]
The sample complexity of estimating or maximising an unknown function in a
reproducing kernel Hilbert space is known to be linked to both the effective
dimension and the information gain associated with the kernel. While the
information gain has an attractive information-theoretic interpretation, the
effective dimension typically results in better rates. We introduce a new
quantity called the relative information gain, which measures the sensitivity
of the information gain with respect to the observation noise. We show that the
relative information gain smoothly interpolates between the effective dimension
and the information gain, and that the relative information gain has the same
growth rate as the effective dimension. In the second half of the paper, we
prove a new PAC-Bayesian excess risk bound for Gaussian process regression. The
relative information gain arises naturally from the complexity term in this
PAC-Bayesian bound. We prove bounds on the relative information gain that
depend on the spectral properties of the kernel. When these upper bounds are
combined with our excess risk bound, we obtain minimax-optimal rates of
convergence.
[COMMENTS]
28 pages
[LINK]
http://arxiv.org/abs/2510.04277v1
[DATE]
2025-10-06 00:35:51+08:00
[CATEGORIES]
cs.LG
Influence branching for learning to solve mixed-integer programs online
[AUTHORS]
Paul Strang, Zacharie Alès, Côme Bissuel, Olivier Juan, Safia Kedad-Sidhoum, Emmanuel Rachelson
[ABSTRACT]
On the occasion of the 20th Mixed Integer Program Workshop’s computational
competition, this work introduces a new approach for learning to solve MIPs
online. Influence branching, a new graph-oriented variable selection strategy,
is applied throughout the first iterations of the branch and bound algorithm.
This branching heuristic is optimized online with Thompson sampling, which
ranks the best graph representations of MIP’s structure according to
computational speed up over SCIP. We achieve results comparable to state of the
art online learning methods. Moreover, our results indicate that our method
generalizes well to more general online frameworks, where variations in
constraint matrix, constraint vector and objective coefficients can all occur
and where more samples are available.
[COMMENTS]
11 pages
[LINK]
http://arxiv.org/abs/2510.04273v1
[DATE]
2025-10-06 00:29:44+08:00
[CATEGORIES]
cs.LG
Closing the Loop: Coordinating Inventory and Recommendation via Deep Reinforcement Learning on Multiple Timescales
[AUTHORS]
Jinyang Jiang, Jinhui Han, Yijie Peng, Ying Zhang
[ABSTRACT]
Effective cross-functional coordination is essential for enhancing firm-wide
profitability, particularly in the face of growing organizational complexity
and scale. Recent advances in artificial intelligence, especially in
reinforcement learning (RL), offer promising avenues to address this
fundamental challenge. This paper proposes a unified multi-agent RL framework
tailored for joint optimization across distinct functional modules, exemplified
via coordinating inventory replenishment and personalized product
recommendation. We first develop an integrated theoretical model to capture the
intricate interplay between these functions and derive analytical benchmarks
that characterize optimal coordination. The analysis reveals synchronized
adjustment patterns across products and over time, highlighting the importance
of coordinated decision-making. Leveraging these insights, we design a novel
multi-timescale multi-agent RL architecture that decomposes policy components
according to departmental functions and assigns distinct learning speeds based
on task complexity and responsiveness. Our model-free multi-agent design
improves scalability and deployment flexibility, while multi-timescale updates
enhance convergence stability and adaptability across heterogeneous decisions.
We further establish the asymptotic convergence of the proposed algorithm.
Extensive simulation experiments demonstrate that the proposed approach
significantly improves profitability relative to siloed decision-making
frameworks, while the behaviors of the trained RL agents align closely with the
managerial insights from our theoretical model. Taken together, this work
provides a scalable, interpretable RL-based solution to enable effective
cross-functional coordination in complex business settings.
[LINK]
http://arxiv.org/abs/2510.04272v1
[DATE]
2025-10-06 00:28:06+08:00
[CATEGORIES]
cs.LG
Probabilistic Language-Image Pre-Training
[AUTHORS]
Sanghyuk Chun, Wonjae Kim, Song Park, Sangdoo Yun
[ABSTRACT]
Vision-language models (VLMs) embed aligned image-text pairs into a joint
space but often rely on deterministic embeddings, assuming a one-to-one
correspondence between images and texts. This oversimplifies real-world
relationships, which are inherently many-to-many, with multiple captions
describing a single image and vice versa. We introduce Probabilistic
Language-Image Pre-training (ProLIP), the first probabilistic VLM pre-trained
on a billion-scale image-text dataset using only probabilistic objectives,
achieving a strong zero-shot capability (e.g., 74.6% ImageNet zero-shot
accuracy with ViT-B/16). ProLIP efficiently estimates uncertainty by an
“uncertainty token” without extra parameters. We also introduce a novel
inclusion loss that enforces distributional inclusion relationships between
image-text pairs and between original and masked inputs. Experiments
demonstrate that, by leveraging uncertainty estimates, ProLIP benefits
downstream tasks and aligns with intuitive notions of uncertainty, e.g.,
shorter texts being more uncertain and more general inputs including specific
ones. Utilizing text uncertainties, we further improve ImageNet accuracy from
74.6% to 75.8% (under a few-shot setting), supporting the practical advantages
of our probabilistic approach. The code is available at
https://github.com/naver-ai/prolip
[COMMENTS]
Code: https://github.com/naver-ai/prolip HuggingFace Hub:
https://huggingface.co/collections/SanghyukChun/prolip-6712595dfc87fd8597350291
33 pages, 4.5 MB; LongProLIP paper: arXiv:2503.08048; Multiplicity paper for
more background: arxiv.org:2505.19614; v4: fix typos
[LINK]
http://arxiv.org/abs/2410.18857v4
[DATE]
2025-10-06 00:26:02+08:00
[CATEGORIES]
cs.LG
Efficient Latent Variable Causal Discovery: Combining Score Search and Targeted Testing
[AUTHORS]
Joseph Ramsey, Bryan Andrews
[ABSTRACT]
Learning causal structure from observational data is especially challenging
when latent variables or selection bias are present. The Fast Causal Inference
(FCI) algorithm addresses this setting but often performs exhaustive
conditional independence tests across many subsets, leading to spurious
independence claims, extra or missing edges, and unreliable orientations. We
present a family of score-guided mixed-strategy causal search algorithms that
build on this tradition. First, we introduce BOSS-FCI and GRaSP-FCI,
straightforward variants of GFCI that substitute BOSS or GRaSP for FGES,
thereby retaining correctness while incurring different scalability tradeoffs.
Second, we develop FCI Targeted-testing (FCIT), a novel mixed-strategy method
that improves upon these variants by replacing exhaustive all-subsets testing
with targeted tests guided by BOSS, yielding well-formed PAGs with higher
precision and efficiency. Finally, we propose a simple heuristic, LV-Dumb (also
known as BOSS-POD), which bypasses latent-variable-specific reasoning and
directly returns the PAG of the BOSS DAG. Although not strictly correct in the
FCI sense, it scales better and often achieves superior accuracy in practice.
Simulations and real-data analyses demonstrate that BOSS-FCI and GRaSP-FCI
provide sound baselines, FCIT improves both efficiency and reliability, and
LV-Dumb offers a practical heuristic with strong empirical performance.
Together, these method highlight the value of score-guided and targeted
strategies for scalable latent-variable causal discovery.
[COMMENTS]
30 pages, 23 figures, 6 tables
[LINK]
http://arxiv.org/abs/2510.04263v1
[DATE]
2025-10-06 00:09:31+08:00
[CATEGORIES]
cs.LG
Machine Learning as Iterated Belief Change a la Darwiche and Pearl
[AUTHORS]
Theofanis Aravanis
[ABSTRACT]
Artificial Neural Networks (ANNs) are powerful machine-learning models
capable of capturing intricate non-linear relationships. They are widely used
nowadays across numerous scientific and engineering domains, driving
advancements in both research and real-world applications. In our recent work,
we focused on the statics and dynamics of a particular subclass of ANNs, which
we refer to as binary ANNs. A binary ANN is a feed-forward network in which
both inputs and outputs are restricted to binary values, making it particularly
suitable for a variety of practical use cases. Our previous study approached
binary ANNs through the lens of belief-change theory, specifically the
Alchourron, Gardenfors and Makinson (AGM) framework, yielding several key
insights. Most notably, we demonstrated that the knowledge embodied in a binary
ANN (expressed through its input-output behaviour) can be symbolically
represented using a propositional logic language. Moreover, the process of
modifying a belief set (through revision or contraction) was mapped onto a
gradual transition through a series of intermediate belief sets. Analogously,
the training of binary ANNs was conceptualized as a sequence of such belief-set
transitions, which we showed can be formalized using full-meet AGM-style belief
change. In the present article, we extend this line of investigation by
addressing some critical limitations of our previous study. Specifically, we
show that Dalal’s method for belief change naturally induces a structured,
gradual evolution of states of belief. More importantly, given the known
shortcomings of full-meet belief change, we demonstrate that the training
dynamics of binary ANNs can be more effectively modelled using robust AGM-style
change operations – namely, lexicographic revision and moderate contraction –
that align with the Darwiche-Pearl framework for iterated belief change.
[COMMENTS]
This second version incorporates improvements based on feedback from
anonymous reviewers of a previous journal submission
[LINK]
http://arxiv.org/abs/2506.13157v2
[DATE]
2025-10-06 00:06:32+08:00
[CATEGORIES]
cs.LG
Logistic-Gated Operators Enable Auditable Unit-Aware Thresholds in Symbolic Regression
[AUTHORS]
Ou Deng, Ruichen Cong, Jianting Xu, Shoji Nishimura, Atsushi Ogihara, Qun Jin
[ABSTRACT]
Symbolic regression promises readable equations but struggles to encode
unit-aware thresholds and conditional logic. We propose logistic-gated
operators (LGO) – differentiable gates with learnable location and steepness
– embedded as typed primitives and mapped back to physical units for audit.
Across two primary health datasets (ICU, NHANES), the hard-gate variant
recovers clinically plausible cut-points: 71% (5/7) of assessed thresholds fall
within 10% of guideline anchors and 100% within 20%, while using far fewer
gates than the soft variant (ICU median 4.0 vs 10.0; NHANES 5.0 vs 12.5), and
remaining within the competitive accuracy envelope of strong SR baselines. On
predominantly smooth tasks, gates are pruned, preserving parsimony. The result
is compact symbolic equations with explicit, unit-aware thresholds that can be
audited against clinical anchors – turning interpretability from a post-hoc
explanation into a modeling constraint and equipping symbolic regression with a
practical calculus for regime switching and governance-ready deployment.
[LINK]
http://arxiv.org/abs/2510.05178v1
[DATE]
2025-10-06 00:04:47+08:00
[CATEGORIES]
cs.LG
[AUTHORS]
Negin Golrezaei, Sourav Sahoo [ABSTRACT]
We study the bidding problem in repeated uniform price multi-unit auctions
from the perspective of a value-maximizing buyer. The buyer aims to maximize
their cumulative value over $T$ rounds while adhering to per-round
return-on-investment (RoI) constraints in a strategic (or adversarial)
environment. Using an $m$-uniform bidding format, the buyer submits $m$
bid-quantity pairs $(b_i, q_i)$ to demand $q_i$ units at bid $b_i$, with $m \ll
M$ in practice, where $M$ denotes the maximum demand of the buyer.
We introduce the notion of safe bidding strategies as those that satisfy the
RoI constraints irrespective of competing bids. Despite the stringent
requirement, we show that these strategies satisfy a mild no-overbidding
condition, depend only on the valuation curve of the bidder, and the bidder can
focus on a finite subset without loss of generality. Though the subset size is
$O(M^m)$, we design a polynomial-time learning algorithm that achieves
sublinear regret, both in full-information and bandit settings, relative to the
hindsight-optimal safe strategy.
We assess the robustness of safe strategies against the hindsight-optimal
strategy from a richer class. We define the richness ratio $\alpha \in (0,1]$
as the minimum ratio of the value of the optimal safe strategy to that of the
optimal strategy from richer class and construct hard instances showing the
tightness of $\alpha$. Our algorithm achieves $\alpha$-approximate sublinear
regret against these stronger benchmarks. Simulations on semi-synthetic auction
data show that empirical richness ratios significantly outperform the
theoretical worst-case bounds. The proposed safe strategies and learning
algorithm extend naturally to more nuanced buyer and competitor models. [COMMENTS]
84 pages, 5 figures. Appeared at ICML 2025. Fixed typos [LINK]
http://arxiv.org/abs/2406.03674v4 [DATE]
2025-10-06 00:03:20+08:00 [CATEGORIES]
cs.LG
A Stitch in Time Saves Nine: Proactive Self-Refinement for Language Models
[AUTHORS]
Jinyi Han, Xinyi Wang, Haiquan Zhao, Tingyun li, Zishang Jiang, Sihang Jiang, Jiaqing Liang, Xin Lin, Weikang Zhou, Zeye Sun, Fei Yu, Yanghua Xiao
[ABSTRACT]
Recent advances in self-refinement have demonstrated significant potential
for improving the outputs of large language models (LLMs) through iterative
refinement. However, most existing self-refinement methods rely on a reactive
process with a fixed number of iterations, making it difficult to determine the
optimal timing and content of refinement based on the evolving generation
context. Inspired by the way humans dynamically refine their thoughts during
execution, we propose ProActive Self-Refinement (PASR), a novel method that
enables LLMs to refine their outputs during the generation process. Unlike
methods that regenerate entire responses, PASR proactively decides whether,
when, and how to refine based on the model’s internal state and evolving
context. We conduct extensive experiments on a diverse set of 10 tasks to
evaluate the effectiveness of PASR. Experimental results show that PASR
significantly enhances problem-solving performance. In particular, on Qwen3-8B,
PASR reduces average token consumption by 41.6% compared to standard
generation, while also achieving an 8.2% improvement in accuracy. Our code and
baselines used in the paper are available on GitHub.
[LINK]
http://arxiv.org/abs/2508.12903v2
[DATE]
2025-10-05 23:00:27+08:00
[CATEGORIES]
cs.CL
Pushing on Multilingual Reasoning Models with Language-Mixed Chain-of-Thought
[AUTHORS]
Guijin Son, Donghun Yang, Hitesh Laxmichand Patel, Amit Agarwal, Hyunwoo Ko, Chanuk Lim, Srikant Panda, Minhyuk Kim, Nikunj Drolia, Dasol Choi, Kyong-Ha Lee, Youngjae Yu
[ABSTRACT]
Recent frontier models employ long chain-of-thought reasoning to explore
solution spaces in context and achieve stonger performance. While many works
study distillation to build smaller yet capable models, most focus on English
and little is known about language-specific reasoning. To bridge this gap, we
first introduct Language-Mixed CoT, a reasoning schema that switches
between English and a target language, using English as an anchor to excel in
reasoning while minimizing translation artificats. As a Korean case study, we
curate Yi-Sang: 5.79M native-Korean prompts from web Q&A, exams, STEM, and
code; 3.7M long reasoning traces generated from Qwen3-32B; and a targeted 260k
high-yield subset. We train ninve models (4B-35B) across six families (Qwen2.5,
Llama-3.1, Gemma-3, etc). Our best model, KO-REAson-35B, achieves
state-of-the-art performance, with the highest overall average score (64.0 \pm
25), ranking first on 5/9 benchmarks and second on the remainder. Samller and
mid-sized models also benefit substantially, with an average improvement of
+18.6 points across teh evaluated nine benchmarks. Ablations show
Language-Mixed CoT is more effective than monolingual CoT, also resulting
in cross-lingual and mult-modal performance gains. We release our data-curation
pipeline, evaluation system, datasets, and models to advance research on
language-specific reasoning. Data and model collection:
https://huggingface.co/KOREAson.
[COMMENTS]
Work in Progress
[LINK]
http://arxiv.org/abs/2510.04230v1
[DATE]
2025-10-05 22:39:41+08:00
[CATEGORIES]
cs.CL
TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios
[AUTHORS]
Shaohang Wei, Wei Li, Feifan Song, Wen Luo, Tianyi Zhuang, Haochen Tan, Zhijiang Guo, Houfeng Wang
[ABSTRACT]
Temporal reasoning is pivotal for Large Language Models (LLMs) to comprehend
the real world. However, existing works neglect the real-world challenges for
temporal reasoning: (1) intensive temporal information, (2) fast-changing event
dynamics, and (3) complex temporal dependencies in social interactions. To
bridge this gap, we propose a multi-level benchmark TIME, designed for temporal
reasoning in real-world scenarios. TIME consists of 38,522 QA pairs, covering 3
levels with 11 fine-grained sub-tasks. This benchmark encompasses 3
sub-datasets reflecting different real-world challenges: TIME-Wiki, TIME-News,
and TIME-Dial. We conduct extensive experiments on reasoning models and
non-reasoning models. And we conducted an in-depth analysis of temporal
reasoning performance across diverse real-world scenarios and tasks, and
summarized the impact of test-time scaling on temporal reasoning capabilities.
Additionally, we release TIME-Lite, a human-annotated subset to foster future
research and standardized evaluation in temporal reasoning. The code is
available at https://github.com/sylvain-wei/TIME , the dataset is available at
https://huggingface.co/datasets/SylvainWei/TIME , and the project page link is
https://sylvain-wei.github.io/TIME/ .
[COMMENTS]
Accepted by NeurIPS 2025 (Spotlight)
[LINK]
http://arxiv.org/abs/2505.12891v3
[DATE]
2025-10-05 21:52:34+08:00
[CATEGORIES]
cs.CL
CALM Before the STORM: Unlocking Native Reasoning for Optimization Modeling
[AUTHORS]
Zhengyang Tang, Zihan Ye, Chenyu Huang, Xuhan Huang, Chengpeng Li, Sihang Li, Guanhua Chen, Ming Yan, Zizhuo Wang, Hongyuan Zha, Dayiheng Liu, Benyou Wang
[ABSTRACT]
Large Reasoning Models (LRMs) have demonstrated strong capabilities in
complex multi-step reasoning, opening new opportunities for automating
optimization modeling. However, existing domain adaptation methods, originally
designed for earlier instruction-tuned models, often fail to exploit the
advanced reasoning patterns of modern LRMs – In particular, we show that
direct fine-tuning on traditional \textit{non-reflective} datasets leads to
limited gains. To fully leverage LRMs’ inherent reasoning abilities, we propose
\textbf{CALM} (\textit{Corrective Adaptation with Lightweight Modification}), a
framework that progressively refines LRMs within their native reasoning modes
for optimization modeling tasks. In CALM, an expert intervener identifies
reasoning flaws and provides concise corrective hints, which the LRM
incorporates to produce improved reasoning trajectories. These interventions
modify fewer than 2.6\% of generated tokens, but generate high-quality data for
soft adaptation through supervised fine-tuning. The adapted model is then
further improved through reinforcement learning. Building on CALM, we develop
\textbf{STORM} (\textit{Smart Thinking Optimization Reasoning Model}), a
4B-parameter LRM that achieves a new state-of-the-art average accuracy of
68.9\% across five popular optimization modeling benchmarks, matching the
performance of a 671B LRM. These results demonstrate that dynamic, hint-based
data synthesis both preserves and amplifies the native reasoning patterns of
modern LRMs, offering a more effective and scalable path towards expert-level
performance on challenging optimization modeling tasks.
[COMMENTS]
Work in progress
[LINK]
http://arxiv.org/abs/2510.04204v1
[DATE]
2025-10-05 21:38:31+08:00
[CATEGORIES]
cs.CL
cs.LG
Thinking on the Fly: Test-Time Reasoning Enhancement via Latent Thought Policy Optimization
[AUTHORS]
Wengao Ye, Yan Liang, Lianlei Shan
[ABSTRACT]
Recent advancements in Large Language Models (LLMs) have shifted from
explicit Chain-of-Thought (CoT) reasoning to more efficient latent reasoning,
where intermediate thoughts are represented as vectors rather than text.
However, latent reasoning can be brittle on challenging, out-of-distribution
tasks where robust reasoning is most critical. To overcome these limitations,
we introduce Latent Thought Policy Optimization (LTPO), a parameter-free
framework that enhances LLM reasoning entirely at test time, without requiring
model parameter updates. LTPO treats intermediate latent “thought” vectors as
dynamic parameters that are actively optimized for each problem instance. It
employs an online policy gradient method guided by an intrinsic,
confidence-based reward signal computed directly from the frozen LLM’s own
output distributions, eliminating the need for external supervision or
expensive text generation during optimization. Extensive experiments on five
reasoning benchmarks show that LTPO not only matches or surpasses strong
baselines on standard tasks but also demonstrates remarkable robustness where
others fail. Most notably, on highly challenging AIME benchmarks where existing
latent reasoning baselines collapse to near-zero accuracy, LTPO delivers
substantial improvements, showcasing a unique capability for complex reasoning.
[LINK]
http://arxiv.org/abs/2510.04182v1
[DATE]
2025-10-05 20:50:39+08:00
[CATEGORIES]
cs.CL
StressTest: Can YOUR Speech LM Handle the Stress?
[AUTHORS]
Iddo Yosha, Gallil Maimon, Yossi Adi
[ABSTRACT]
Sentence stress refers to emphasis on words within a spoken utterance to
highlight or contrast an idea. It is often used to imply an underlying
intention not explicitly stated. Recent speech-aware language models (SLMs)
have enabled direct audio processing, allowing models to access the full
richness of speech to perform audio reasoning tasks such as spoken question
answering. Despite the crucial role of sentence stress in shaping meaning and
intent, it remains largely overlooked in evaluation and development of SLMs. We
address this gap by introducing StressTest, a benchmark designed to evaluate
models’ ability to distinguish between meanings of speech based on the stress
pattern. We evaluate leading SLMs, and find that despite their overall
capabilities, they perform poorly on such tasks. Hence, we propose a novel data
generation pipeline, and create Stress-17k, a training set that simulates
change of meaning implied by stress variation. Results suggest, that our
finetuned model, StresSLM, generalizes well to real recordings and notably
outperforms existing SLMs on sentence stress reasoning and detection. Models,
code, data, samples - pages.cs.huji.ac.il/adiyoss-lab/stresstest.
[LINK]
http://arxiv.org/abs/2505.22765v2
[DATE]
2025-10-05 20:21:35+08:00
[CATEGORIES]
cs.CL
C3: A Bilingual Benchmark for Spoken Dialogue Models Exploring Challenges in Complex Conversations
[AUTHORS]
Chengqian Ma, Wei Tao, Yiwen Guo
[ABSTRACT]
Spoken Dialogue Models (SDMs) have recently attracted significant attention
for their ability to generate voice responses directly to users’ spoken
queries. Despite their increasing popularity, there exists a gap in research
focused on comprehensively understanding their practical effectiveness in
comprehending and emulating human conversations. This is especially true
compared to text-based Large Language Models (LLMs), which benefit from
extensive benchmarking. Human voice interactions are inherently more complex
than text due to characteristics unique to spoken dialogue. Ambiguity poses one
challenge, stemming from semantic factors like polysemy, as well as
phonological aspects such as heterograph, heteronyms, and stress patterns.
Additionally, context-dependency, like omission, coreference, and multi-turn
interaction, adds further complexity to human conversational dynamics. To
illuminate the current state of SDM development and to address these
challenges, we present a benchmark dataset in this paper, which comprises 1,079
instances in English and Chinese. Accompanied by an LLM-based evaluation method
that closely aligns with human judgment, this dataset facilitates a
comprehensive exploration of the performance of SDMs in tackling these
practical challenges.
[COMMENTS]
EMNLP 2025 main; Project Page: https://step-out.github.io/C3-web/
[LINK]
http://arxiv.org/abs/2507.22968v3
[DATE]
2025-10-05 19:17:29+08:00
[CATEGORIES]
cs.CL
Instruction Boundary: Quantifying Biases in LLM Reasoning under Various Coverage
[AUTHORS]
Zipeng Ling, Yuehao Tang, Chen Huang, Shuliang Liu, Gaoyang Jiang, Shenghong Fu, Junqi Yang, Yao Wan, Jiawan Zhang, Kejia Huang, Xuming Hu
[ABSTRACT]
Nowadays, automatically generated datasets are increasingly used in LLM
reasoning tasks; however, large-scale corpora often contain inherent flaws. For
example, a single-choice question may include none or multiple correct options,
while true-or-false questions may involve vague or unverifiable statements. We
refer to these exceptional answer forms as sparse labels. To compare LLMs’
ability to recognize various question forms and produce correct answers, we
investigate how different instruction formats can either facilitate or mislead
LLM reasoning ability. We introduce the concept of Instruction Boundary, which
systematically analyzes how different levels of prompt coverage – sufficient,
redundant, or insufficient – can lead to reasoning biases and performance
changes in LLMs. To examine this phenomenon, we design eight experimental
settings across five dataset forms. We further propose BiasDetector, a unified
framework that quantifies LLMs’ ability to identify sparse labels under
different kinds of Instruction Boundary conditions. Evaluations on five
mainstream LLMs show that, despite their seemingly high accuracy, substantial
reasoning biases persist in many downstream tasks as a direct consequence of
prompt coverage. We analyze the impact of these biases and outline possible
mitigation strategies. Our findings highlight not only the importance of
addressing sparse labels, but also the need for developers to recognize and
mitigate the risks introduced by Instruction Boundary.
[LINK]
http://arxiv.org/abs/2509.20278v2
[DATE]
2025-10-05 19:12:05+08:00
[CATEGORIES]
cs.CL
Self Speculative Decoding for Diffusion Large Language Models
[AUTHORS]
Yifeng Gao, Ziang Ji, Yuxuan Wang, Biqing Qi, Hanlin Xu, Linfeng Zhang
[ABSTRACT]
Diffusion-based Large Language Models (dLLMs) have emerged as a competitive
alternative to autoregressive models, offering unique advantages through
bidirectional attention and parallel generation paradigms. However, the
generation results of current parallel decoding methods deviate from stepwise
decoding, introducing potential performance degradation, which limits their
practical deployment. To address this problem, we propose \textbf{S}elf
\textbf{S}peculative \textbf{D}ecoding (SSD), a lossless inference acceleration
method that leverages the dLLM itself as both speculative decoding drafter and
verifier without auxiliary modules. SSD introduces a self-drafting mechanism
where the model generates predictions for multiple positions, then verifies
them through hierarchical verification trees in a single forward pass. Unlike
traditional speculative decoding that requires separate draft models, SSD
eliminates model redundancy and memory overhead by exploiting the dLLM’s
inherent parallel prediction capability for multiple positions. This
self-speculative approach allows the model to progressively verify and accept
multiple tokens in a single forward pass. Our experiments demonstrate that SSD
achieves up to 3.46$\times$ speedup while keeping the output identical to
stepwise decoding on open source models such as LLaDA and Dream. Code will be
made publicly available on GitHub.
[LINK]
http://arxiv.org/abs/2510.04147v1
[DATE]
2025-10-05 18:52:28+08:00
[CATEGORIES]
cs.CL
Beyond Next-Token Prediction: A Performance Characterization of Diffusion versus Autoregressive Language Models
[AUTHORS]
Minseo Kim, Coleman Hooper, Aditya Tomar, Chenfeng Xu, Mehrdad Farajtabar, Michael W. Mahoney, Kurt Keutzer, Amir Gholami
[ABSTRACT]
Large Language Models (LLMs) have achieved state-of-the-art performance on a
broad range of Natural Language Processing (NLP) tasks, including document
processing and coding. Autoregressive Language Models (ARMs), which generate
tokens sequentially conditioned on all previous tokens, have been the
predominant paradigm for LLMs. However, while these networks have achieved high
accuracy across a range of downstream tasks, they exhibit low arithmetic
intensity due to the inherent sequential dependency with next-token prediction.
Recently, Diffusion Language Models (DLMs) have emerged as a promising
alternative architecture. DLMs generate output text in parallel, breaking the
limitations of sequential dependency. However, the performance implications of
DLMs relative to commonly deployed ARMs are not fully understood. In this work,
we present a comprehensive performance study analyzing the performance
characteristics of ARMs and DLMs, using both theoretical analysis and profiling
data to characterize the trade-offs between these approaches. We illustrate
that although DLMs exhibit higher arithmetic intensity compared to ARMs because
of their capability to utilize parallelism across sequence lengths, they fail
to scale effectively to longer contexts. We then explore DLMs with block-wise
decoding, outlining how this approach allows for increased arithmetic
intensity, while still scaling well to long contexts (similar to ARMs). We also
show interesting trade-offs for batched inference, where we find that ARMs
exhibit superior throughput, as they benefit more from parallelism across
sequences in the batch. Finally, we highlight opportunities for accelerating
DLM inference, and, in particular, highlight the importance of reducing the
number of sampling steps for allowing open-source DLMs to provide improved
latency relative to ARMs.
[COMMENTS]
11 pages, 5 figures
[LINK]
http://arxiv.org/abs/2510.04146v1
[DATE]
2025-10-05 18:50:52+08:00
[CATEGORIES]
cs.LG
cs.CL
Automating construction safety inspections using a multi-modal vision-language RAG framework
[AUTHORS]
Chenxin Wang, Elyas Asadi Shamsabadi, Zhaohui Chen, Luming Shen, Alireza Ahmadian Fard Fini, Daniel Dias-da-Costa
[ABSTRACT]
Conventional construction safety inspection methods are often inefficient as
they require navigating through large volume of information. Recent advances in
large vision-language models (LVLMs) provide opportunities to automate safety
inspections through enhanced visual and linguistic understanding. However,
existing applications face limitations including irrelevant or unspecific
responses, restricted modal inputs and hallucinations. Utilisation of Large
Language Models (LLMs) for this purpose is constrained by availability of
training data and frequently lack real-time adaptability. This study introduces
SiteShield, a multi-modal LVLM-based Retrieval-Augmented Generation (RAG)
framework for automating construction safety inspection reports by integrating
visual and audio inputs. Using real-world data, SiteShield outperformed
unimodal LLMs without RAG with an F1 score of 0.82, hamming loss of 0.04,
precision of 0.76, and recall of 0.96. The findings indicate that SiteShield
offers a novel pathway to enhance information retrieval and efficiency in
generating safety reports.
[COMMENTS]
33 pages, 11 figures, 7 tables
[LINK]
http://arxiv.org/abs/2510.04145v1
[DATE]
2025-10-05 18:48:54+08:00
[CATEGORIES]
cs.CL
HP-BERT: A framework for longitudinal study of Hinduphobia on social media via language models
[AUTHORS]
Ashutosh Singh, Rohitash Chandra
[ABSTRACT]
During the COVID-19 pandemic, community tensions intensified, contributing to
discriminatory sentiments against various religious groups, including Hindu
communities. Recent advances in language models have shown promise for social
media analysis with potential for longitudinal studies of social media
platforms, such as X (Twitter). We present a computational framework for
analyzing anti-Hindu sentiment (Hinduphobia) during the COVID-19 period,
introducing an abuse detection and sentiment analysis approach for longitudinal
analysis on X. We curate and release a “Hinduphobic COVID-19 XDataset”
containing 8,000 annotated and manually verified tweets. We then develop the
Hinduphobic BERT (HP-BERT) model using this dataset and achieve 94.72\%
accuracy, outperforming baseline Transformer-based language models. The model
incorporates multi-label sentiment analysis capabilities through additional
fine-tuning. Our analysis encompasses approximately 27.4 million tweets from
six countries, including Australia, Brazil, India, Indonesia, Japan, and the
United Kingdom. Statistical analysis reveals moderate correlations (r =
0.312-0.428) between COVID-19 case increases and Hinduphobic content volume,
highlighting how pandemic-related stress may contribute to discriminatory
discourse. This study provides evidence of social media-based religious
discrimination during a COVID-19 crisis.
[LINK]
http://arxiv.org/abs/2501.05482v2
[DATE]
2025-10-05 18:40:38+08:00
[CATEGORIES]
cs.CL
LegalSearchLM: Rethinking Legal Case Retrieval as Legal Elements Generation
[AUTHORS]
Chaeeun Kim, Jinu Lee, Wonseok Hwang
[COMMENTS]
EMNLP 2025 Main Conference
[LINK]
http://arxiv.org/abs/2505.23832v3
[DATE]
2025-10-05 18:33:33+08:00
[CATEGORIES]
cs.CL
Internal states before wait modulate reasoning patterns
[AUTHORS]
Dmitrii Troitskii, Koyena Pal, Chris Wendler, Callum Stuart McDougall, Neel Nanda
[ABSTRACT]
Prior work has shown that a significant driver of performance in reasoning
models is their ability to reason and self-correct. A distinctive marker in
these reasoning traces is the token wait, which often signals reasoning
behavior such as backtracking. Despite being such a complex behavior, little is
understood of exactly why models do or do not decide to reason in this
particular manner, which limits our understanding of what makes a reasoning
model so effective. In this work, we address the question whether model’s
latents preceding wait tokens contain relevant information for modulating the
subsequent reasoning process. We train crosscoders at multiple layers of
DeepSeek-R1-Distill-Llama-8B and its base version, and introduce a latent
attribution technique in the crosscoder setting. We locate a small set of
features relevant for promoting/suppressing wait tokens’ probabilities.
Finally, through a targeted series of experiments analyzing max activating
examples and causal interventions, we show that many of our identified features
indeed are relevant for the reasoning process and give rise to different types
of reasoning patterns such as restarting from the beginning, recalling prior
knowledge, expressing uncertainty, and double-checking.
[COMMENTS]
Accepted to EMNLP Findings 2025
[LINK]
http://arxiv.org/abs/2510.04128v1
[DATE]
2025-10-05 18:03:42+08:00
[CATEGORIES]
cs.CL
WakenLLM: Evaluating Reasoning Potential and Stability in LLMs via Fine-Grained Benchmarking
[AUTHORS]
Zipeng Ling, Yuehao Tang, Shuliang Liu, Junqi Yang, Shenghong Fu, Chen Huang, Kejia Huang, Yao Wan, Zhichao Hou, Xuming Hu
[ABSTRACT]
Large Language Models (LLMs) frequently output the label Unknown in reasoning
tasks, where two scenarios may appear: (i) an input sample is genuinely
unverifiable, but the model cannot understand why; and (ii) a verifiable
problem that the model fails to solve, thus outputs Unknown. We refer to these
cases collectively as the Vague Perception phenomenon. Current evaluations
focus on whether such answers are honest, rather than analyzing the limits of
LLM reasoning.
To address this, we introduce WakenLLM, a framework that quantifies the
portion of Unknown output attributable to model incapacity and evaluates
whether stimulation can convert them into either correct answers (verifiable)
or justified (unverifiable) responses with valid reasoning. Our method offers a
clearer picture of the limits of LLM reasoning and the potential for
corrections across various datasets. Comprehensive experiments on six LLMs
suggest that, without any training or parameter revision, LLMs can achieve up
to a 68.53% accuracy improvement on Vague Perception samples through guided
understanding.
Our work reveals that current baseline methods only activate a small portion
of LLMs’ reasoning potential, indicating considerable unexplored capacity. This
extends the theoretical upper bounds of reasoning accuracy in LLMs.
Consequently, this study deepens our understanding of the latent reasoning
capacity of LLMs and offers a new perspective on addressing the Vague
Perception phenomenon.
[LINK]
http://arxiv.org/abs/2507.16199v4
[DATE]
2025-10-05 18:02:35+08:00
[CATEGORIES]
cs.CL
Unveiling LLMs’ Metaphorical Understanding: Exploring Conceptual Irrelevance, Context Leveraging and Syntactic Influence
[AUTHORS]
Fengying Ye, Shanshan Wang, Lidia S. Chao, Derek F. Wong
[ABSTRACT]
Metaphor analysis is a complex linguistic phenomenon shaped by context and
external factors. While Large Language Models (LLMs) demonstrate advanced
capabilities in knowledge integration, contextual reasoning, and creative
generation, their mechanisms for metaphor comprehension remain insufficiently
explored. This study examines LLMs’ metaphor-processing abilities from three
perspectives: (1) Concept Mapping: using embedding space projections to
evaluate how LLMs map concepts in target domains (e.g., misinterpreting “fall
in love” as “drop down from love”); (2) Metaphor-Literal Repository: analyzing
metaphorical words and their literal counterparts to identify inherent
metaphorical knowledge; and (3) Syntactic Sensitivity: assessing how
metaphorical syntactic structures influence LLMs’ performance. Our findings
reveal that LLMs generate 15\%-25\% conceptually irrelevant interpretations,
depend on metaphorical indicators in training data rather than contextual cues,
and are more sensitive to syntactic irregularities than to structural
comprehension. These insights underline the limitations of LLMs in metaphor
analysis and call for more robust computational approaches.
[LINK]
http://arxiv.org/abs/2510.04120v1
[DATE]
2025-10-05 17:45:51+08:00
[CATEGORIES]
cs.CL
ZSMerge: Zero-Shot KV Cache Compression for Memory-Efficient Long-Context LLMs
[AUTHORS]
Xin Liu, Xudong Wang, Pei Liu, Guoming Tang
[ABSTRACT]
The linear growth of key-value (KV) cache memory and quadratic computational
in attention mechanisms complexity pose significant bottlenecks for large
language models (LLMs) in long-context processing. While existing KV cache
optimization methods address these challenges through token pruning or feature
merging, they often incur irreversible information loss or require costly
parameter retraining. To this end, we propose ZSMerge, a dynamic KV cache
compression framework designed for efficient cache management, featuring three
key operations: (1) fine-grained memory allocation guided by multi-dimensional
token importance metrics at head-level granularity, (2) a residual merging
mechanism that preserves critical context through compensated attention
scoring, and (3) a zero-shot adaptation mechanism compatible with diverse LLM
architectures without requiring retraining. ZSMerge significantly enhances
memory efficiency and inference speed with negligible performance degradation
across LLMs. When applied to LLaMA2-7B, it demonstrates a 20:1 compression
ratio for key-value cache retention (reducing memory footprint to 5\% of
baseline) while sustaining comparable generation quality, coupled with triple
throughput gains at extreme 54k-token contexts that eliminate out-of-memory
failures. The code is available at https://github.com/SusCom-Lab/ZSMerge.
[LINK]
http://arxiv.org/abs/2503.10714v3
[DATE]
2025-10-05 16:34:30+08:00
[CATEGORIES]
cs.CL
Human Empathy as Encoder: AI-Assisted Depression Assessment in Special Education
[AUTHORS]
Boning Zhao, Xinnuo Li, Yutong Hu
[ABSTRACT]
Assessing student depression in sensitive environments like special education
is challenging. Standardized questionnaires may not fully reflect students’
true situations. Furthermore, automated methods often falter with rich student
narratives, lacking the crucial, individualized insights stemming from
teachers’ empathetic connections with students. Existing methods often fail to
address this ambiguity or effectively integrate educator understanding. To
address these limitations by fostering a synergistic human-AI collaboration,
this paper introduces Human Empathy as Encoder (HEAE), a novel, human-centered
AI framework for transparent and socially responsible depression severity
assessment. Our approach uniquely integrates student narrative text with a
teacher-derived, 9-dimensional “Empathy Vector” (EV), its dimensions guided by
the PHQ-9 framework,to explicitly translate tacit empathetic insight into a
structured AI input enhancing rather than replacing human judgment. Rigorous
experiments optimized the multimodal fusion, text representation, and
classification architecture, achieving 82.74% accuracy for 7-level severity
classification. This work demonstrates a path toward more responsible and
ethical affective computing by structurally embedding human empathy
[COMMENTS]
7 pages, 6 figures, ACII 2025
[LINK]
http://arxiv.org/abs/2505.23631v3
[DATE]
2025-10-05 16:26:26+08:00
[CATEGORIES]
cs.CL
Scaling Code-Assisted Chain-of-Thoughts and Instructions for Model Reasoning
[AUTHORS]
Honglin Lin, Qizhi Pei, Xin Gao, Zhuoshi Pan, Yu Li, Juntao Li, Conghui He, Lijun Wu
[ABSTRACT]
Reasoning capability is pivotal for Large Language Models (LLMs) to solve
complex tasks, yet achieving reliable and scalable reasoning remains
challenging. While Chain-of-Thought (CoT) prompting has become a mainstream
approach, existing methods often suffer from uncontrolled generation,
insufficient quality, and limited diversity in reasoning paths. Recent efforts
leverage code to enhance CoT by grounding reasoning in executable steps, but
such methods are typically constrained to predefined mathematical problems,
hindering scalability and generalizability. In this work, we propose Caco
(Code-Assisted Chain-of-ThOught), a novel framework that automates the
synthesis of high-quality, verifiable, and diverse instruction-CoT reasoning
data through code-driven augmentation. Unlike prior work, Caco first fine-tunes
a code-based CoT generator on existing math and programming solutions in a
unified code format, then scales the data generation to a large amount of
diverse reasoning traces. Crucially, we introduce automated validation via code
execution and rule-based filtering to ensure logical correctness and structural
diversity, followed by reverse-engineering filtered outputs into natural
language instructions and language CoTs to enrich task adaptability. This
closed-loop process enables fully automated, scalable synthesis of reasoning
data with guaranteed executability. Experiments on our created Caco-1.3M
dataset demonstrate that Caco-trained models achieve strong competitive
performance on mathematical reasoning benchmarks, outperforming existing strong
baselines. Further analysis reveals that Caco’s code-anchored verification and
instruction diversity contribute to superior generalization across unseen
tasks. Our work establishes a paradigm for building self-sustaining,
trustworthy reasoning systems without human intervention.
[COMMENTS]
Accepted by NeurIPS2025
[LINK]
http://arxiv.org/abs/2510.04081v1
[DATE]
2025-10-05 15:59:24+08:00
[CATEGORIES]
cs.CL
PoLi-RL: A Point-to-List Reinforcement Learning Framework for Conditional Semantic Textual Similarity
[AUTHORS]
Zixin Song, Bowen Zhang, Qian-Wen Zhang, Di Yin, Xing Sun, Chunping Li
[ABSTRACT]
Conditional Semantic Textual Similarity (C-STS) measures the semantic
proximity between text segments under a specific condition, thereby overcoming
the ambiguity inherent in traditional STS. However, existing methods are
largely confined to discriminative models, failing to fully integrate recent
breakthroughs in the NLP community concerning Large Language Models (LLMs) and
Reinforcement Learning (RL). RL is a particularly well-suited paradigm for this
task, as it can directly optimize the non-differentiable Spearman ranking
metric and guide the reasoning process required by C-STS. However, we find that
naively applying listwise RL fails to produce meaningful improvements, as the
model is overwhelmed by complex, coarse-grained reward signals. To address this
challenge, we introduce PoLi-RL, a novel Point-to-List Reinforcement Learning
framework. PoLi-RL employs a two-stage curriculum: it first trains the model
with simple pointwise rewards to establish fundamental scoring capabilities,
then transitions to a hybrid reward that combines pointwise, pairwise, and
listwise objectives to refine the model’s ability to discern subtle semantic
distinctions. Crucially, we propose an innovative Parallel Slice Ranking Reward
(PSRR) mechanism that computes ranking rewards in parallel slices, where each
slice comprises same-indexed completions from different samples. This provides
a precise, differentiated learning signal for each individual completion,
enabling granular credit assignment and effective optimization. On the official
C-STS benchmark, PoLi-RL achieves a Spearman correlation coefficient of 48.18,
establishing a new SOTA for the cross-encoder architecture. As the first work
to successfully apply RL to C-STS, our study introduces a powerful and precise
paradigm for training LLMs on complex, ranking-based conditional judgment
tasks.
[LINK]
http://arxiv.org/abs/2510.04080v1
[DATE]
2025-10-05 15:57:26+08:00
[CATEGORIES]
cs.CL
Reasoning over Boundaries: Enhancing Specification Alignment via Test-time Deliberation
[AUTHORS]
Haoran Zhang, Yafu Li, Xuyang Hu, Dongrui Liu, Zhilin Wang, Bo Li, Yu Cheng
[ABSTRACT]
Large language models (LLMs) are increasingly applied in diverse real-world
scenarios, each governed by bespoke behavioral and safety specifications (spec)
custom-tailored by users or organizations. These spec, categorized into
safety-spec and behavioral-spec, vary across scenarios and evolve with changing
preferences and requirements. We formalize this challenge as specification
alignment, focusing on LLMs’ ability to follow dynamic, scenario-specific spec
from both behavioral and safety perspectives. To address this challenge, we
propose Align3, a lightweight method that employs Test-Time Deliberation (TTD)
with hierarchical reflection and revision to reason over the specification
boundaries. We further present SpecBench, a unified benchmark for measuring
specification alignment, covering 5 scenarios, 103 spec, and 1,500 prompts.
Experiments on 15 reasoning and 18 instruct models with several TTD methods,
including Self-Refine, TPO, and MoreThink, yield three key findings: (i)
test-time deliberation enhances specification alignment; (ii) Align3 advances
the safety-helpfulness trade-off frontier with minimal overhead; (iii)
SpecBench effectively reveals alignment gaps. These results highlight the
potential of test-time deliberation as an effective strategy for reasoning over
the real-world specification boundaries.
[COMMENTS]
10 pages main text, 52 pages total (including appendix). Code and
resources are available at https://github.com/zzzhr97/SpecBench
[LINK]
http://arxiv.org/abs/2509.14760v2
[DATE]
2025-10-05 15:56:54+08:00
[CATEGORIES]
cs.CL
RAG over Tables: Hierarchical Memory Index, Multi-Stage Retrieval, and Benchmarking
[AUTHORS]
Jiaru Zou, Dongqi Fu, Sirui Chen, Xinrui He, Zihao Li, Yada Zhu, Jiawei Han, Jingrui He
[ABSTRACT]
Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by
integrating them with an external knowledge base to improve the answer
relevance and accuracy. In real-world scenarios, beyond pure text, a
substantial amount of knowledge is stored in tables, and user questions often
require retrieving answers that are distributed across multiple tables.
Retrieving knowledge from a table corpora (i.e., various individual tables) for
a question remains nascent, at least, for (i) how to understand intra- and
inter-table knowledge effectively, (ii) how to filter unnecessary tables and
how to retrieve the most relevant tables efficiently, (iii) how to prompt LLMs
to infer over the retrieval, (iv) how to evaluate the corresponding performance
in a realistic setting. Facing the above challenges, in this paper, we first
propose a table-corpora-aware RAG framework, named T-RAG, which consists of the
hierarchical memory index, multi-stage retrieval, and graph-aware prompting for
effective and efficient table knowledge retrieval and inference. Further, we
first develop a multi-table question answering benchmark named MultiTableQA,
which spans 3 different task types, 57,193 tables, and 23,758 questions in
total, and the sources are all from real-world scenarios. Based on
MultiTableQA, we did the holistic comparison over table retrieval methods, RAG
methods, and table-to-graph representation learning methods, where T-RAG shows
the leading accuracy, recall, and running time performance. Also, under T-RAG,
we evaluate the inference ability upgrade of different LLMs. Code and Data are
available at https://github.com/jiaruzouu/T-RAG
[COMMENTS]
Project Link: https://github.com/jiaruzouu/T-RAG
[LINK]
http://arxiv.org/abs/2504.01346v4
[DATE]
2025-10-05 15:24:41+08:00
[CATEGORIES]
cs.CL
cs.LG
What Makes Diffusion Language Models Super Data Learners?
[AUTHORS]
Zitian Gao, Haoming Luo, Lynx Chen, Jason Klein Liu, Ran Tao, Joey Zhou, Bryan Dai
[ABSTRACT]
Recent studies have shown that diffusion language models achieve remarkable
data efficiency under limited-data constraints, yet the underlying mechanisms
remain unclear. In this work, we perform extensive ablation experiments to
disentangle the sources of this efficiency. Our results show that random
masking of input tokens plays the dominant role. We further show that similar
gains can be obtained through in MLP dropout and weight decay, indicating that
stochastic regularization broadly enhances data efficiency in multi-epoch
training. Our code is available at
https://github.com/zitian-gao/data-efficiency.
[COMMENTS]
Technical report, work in progress
[LINK]
http://arxiv.org/abs/2510.04071v1
[DATE]
2025-10-05 15:22:44+08:00
[CATEGORIES]
cs.CL
What Scales in Cross-Entropy Scaling Law?
[AUTHORS]
Junxi Yan, Zixi Wei, Jingtao Zhan, Qingyao Ai, Yiqun Liu
[ABSTRACT]
The cross-entropy scaling law has long served as a key tool for guiding the
development of large language models. It shows that cross-entropy loss
decreases in a predictable power-law rate as the model size increases. However,
recent evidence indicates that this law breaks down at very large scales: the
loss decreases more slowly than expected, which causes significant trouble for
developing large language models. In this paper, we hypothesize that the root
cause lies in the fact that cross-entropy itself does not truly scale; instead,
only one of its hidden components does. To investigate this, we introduce a
novel decomposition of cross-entropy into three parts: Error-Entropy,
Self-Alignment, and Confidence. We show both theoretically and empirically that
this decomposition precisely captures the training dynamics and optimization
objectives. Through extensive experiments on multiple datasets and 32 models
spanning five orders of magnitude in size, we find that only error-entropy
follows a robust power-law scaling, while the other two terms remain largely
invariant. Moreover, error-entropy constitutes the dominant share of
cross-entropy in small models but diminishes in proportion as models grow
larger. This explains why the cross-entropy scaling law appears accurate at
small scales but fails at very large ones. Our findings establish the
error-entropy scaling law as a more accurate description of model behavior. We
believe it will have wide applications in the training, understanding, and
future development of large language models.
[LINK]
http://arxiv.org/abs/2510.04067v1
[DATE]
2025-10-05 15:06:02+08:00
[CATEGORIES]
cs.LG
cs.CL
How Can I Publish My LLM Benchmark Without Giving the True Answers Away?
[AUTHORS]
Takashi Ishida, Thanawat Lodkaew, Ikko Yamane
[ABSTRACT]
Publishing a large language model (LLM) benchmark on the Internet risks
contaminating future LLMs: the benchmark may be unintentionally (or
intentionally) used to train or select a model. A common mitigation is to keep
the benchmark private and let participants submit their models or predictions
to the organizers. However, this strategy will require trust in a single
organization and still permits test-set overfitting through repeated queries.
To overcome this issue, we propose a way to publish benchmarks without
completely disclosing the ground-truth answers to the questions, while still
maintaining the ability to openly evaluate LLMs. The main underlying idea is to
reduces the best possible accuracy, i.e., Bayes accuracy, by injecting
randomness to the answers by preparing several logically correct answers, and
only include one of them as the solution in the benchmark. Not only is this
helpful to keep us from disclosing the ground truth, but this also offers a
test for detecting data contamination. In principle, even fully capable models
should not surpass the Bayes accuracy. If a model surpasses this ceiling
despite this expectation, this is a strong signal of data contamination. We
present experimental evidence that our method can detect data contamination
accurately on a wide range of benchmarks, models, and training methodologies.
[COMMENTS]
Extended version of the paper presented as an Oral at the ICML 2025
Workshop on the Impact of Memorization on Trustworthy Foundation Models
[LINK]
http://arxiv.org/abs/2505.18102v6
[DATE]
2025-10-05 14:45:34+08:00
[CATEGORIES]
cs.LG
cs.CL
Don’t Pay Attention, PLANT It: Pretraining Attention via Learning-to-Rank
[AUTHORS]
Debjyoti Saha Roy, Byron C. Wallace, Javed A. Aslam
[ABSTRACT]
State-of-the-art Extreme Multi-Label Text Classification models rely on
multi-label attention to focus on key tokens in input text, but learning good
attention weights is challenging. We introduce PLANT - Pretrained and Leveraged
Attention - a plug-and-play strategy for initializing attention. PLANT works by
planting label-specific attention using a pretrained Learning-to-Rank model
guided by mutual information gain. This architecture-agnostic approach
integrates seamlessly with large language model backbones such as Mistral-7B,
LLaMA3-8B, DeepSeek-V3, and Phi-3. PLANT outperforms state-of-the-art methods
across tasks including ICD coding, legal topic classification, and content
recommendation. Gains are especially pronounced in few-shot settings, with
substantial improvements on rare labels. Ablation studies confirm that
attention initialization is a key driver of these gains. For code and trained
models, see https://github.com/debjyotiSRoy/xcube/tree/plant
[LINK]
http://arxiv.org/abs/2410.23066v2
[DATE]
2025-10-05 13:55:06+08:00
[CATEGORIES]
cs.CL
cs.LG
SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning
[AUTHORS]
Zhongwei Wan, Zhihao Dou, Che Liu, Yu Zhang, Dongfei Cui, Qinjian Zhao, Hui Shen, Jing Xiong, Yi Xin, Yifan Jiang, Chaofan Tao, Yangfan He, Mi Zhang, Shen Yan
[COMMENTS]
NeurIPS 2025
[LINK]
http://arxiv.org/abs/2506.01713v3
[DATE]
2025-10-05 13:53:45+08:00
[CATEGORIES]
cs.CL
Forecasting Conversation Derailments Through Generation
[AUTHORS]
Yunfan Zhang, Kathleen McKeown, Smaranda Muresan
[COMMENTS]
ACL INLG 2025
[LINK]
http://arxiv.org/abs/2504.08905v2
[DATE]
2025-10-05 13:44:30+08:00
[CATEGORIES]
cs.CL
Exploring Chain-of-Thought Reasoning for Steerable Pluralistic Alignment
[AUTHORS]
Yunfan Zhang, Kathleen McKeown, Smaranda Muresan
[ABSTRACT]
Large Language Models (LLMs) are typically trained to reflect a relatively
uniform set of values, which limits their applicability to tasks that require
understanding of nuanced human perspectives. Recent research has underscored
the importance of enabling LLMs to support steerable pluralism – the capacity
to adopt a specific perspective and align generated outputs with it. In this
work, we investigate whether Chain-of-Thought (CoT) reasoning techniques can be
applied to building steerable pluralistic models. We explore several methods,
including CoT prompting, fine-tuning on human-authored CoT, fine-tuning on
synthetic explanations, and Reinforcement Learning with Verifiable Rewards
(RLVR). We evaluate these approaches using the Value Kaleidoscope and OpinionQA
datasets. Among the methods studied, RLVR consistently outperforms others and
demonstrates strong training sample efficiency. We further analyze the
generated CoT traces with respect to faithfulness and safety.
[COMMENTS]
ACL EMNLP 2025
[LINK]
http://arxiv.org/abs/2510.04045v1
[DATE]
2025-10-05 13:39:50+08:00
[CATEGORIES]
cs.CL
cs.LG
Unlocking Multimodal Mathematical Reasoning via Process Reward Model
[AUTHORS]
Ruilin Luo, Zhuofan Zheng, Yifan Wang, Xinzhe Ni, Zicheng Lin, Songtao Jiang, Yiyao Yu, Chufan Shi, Lei Wang, Ruihang Chu, Jin Zeng, Yujiu Yang
[ABSTRACT]
Process Reward Models (PRMs) have shown promise in enhancing the mathematical
reasoning capabilities of Large Language Models (LLMs) through Test-Time
Scaling (TTS). However, their integration into multimodal reasoning remains
largely unexplored. In this work, we take the first step toward unlocking the
potential of PRMs in multimodal mathematical reasoning. We identify three key
challenges: (1) the scarcity of high-quality reasoning data constrains the
capabilities of foundation Multimodal Large Language Models (MLLMs), which
imposes further limitations on the upper bounds of TTS and reinforcement
learning (RL); (2) a lack of automated methods for process labeling within
multimodal contexts persists; (3) the employment of process rewards in unimodal
RL faces issues like reward hacking, which may extend to multimodal scenarios.
To address these issues, we introduce URSA, a three-stage Unfolding multimodal
Process-Supervision Aided training framework. We first construct MMathCoT-1M, a
high-quality large-scale multimodal Chain-of-Thought (CoT) reasoning dataset,
to build a stronger math reasoning foundation MLLM, URSA-8B. Subsequently, we
go through an automatic process to synthesize process supervision data, which
emphasizes both logical correctness and perceptual consistency. We introduce
DualMath-1.1M to facilitate the training of URSA-8B-RM. Finally, we propose
Process-Supervised Group-Relative-Policy-Optimization (PS-GRPO), pioneering a
multimodal PRM-aided online RL method that outperforms vanilla GRPO. With
PS-GRPO application, URSA-8B-PS-GRPO outperforms Gemma3-12B and GPT-4o by 8.4%
and 2.7% on average across 6 benchmarks. Code, data and checkpoint can be found
at https://github.com/URSA-MATH.
[COMMENTS]
NeurIPS 2025 Main Track
[LINK]
http://arxiv.org/abs/2501.04686v6
[DATE]
2025-10-05 13:09:48+08:00
[CATEGORIES]
cs.CL
cs.LG
Scaling Laws of Synthetic Data for Language Models
[AUTHORS]
Zeyu Qin, Qingxiu Dong, Xingxing Zhang, Li Dong, Xiaolong Huang, Ziyi Yang, Mahmoud Khademi, Dongdong Zhang, Hany Hassan Awadalla, Yi R. Fung, Weizhu Chen, Minhao Cheng, Furu Wei
[ABSTRACT]
Large language models (LLMs) achieve strong performance across diverse tasks,
largely driven by high-quality web data used in pre-training. However, recent
studies indicate this data source is rapidly depleting. Synthetic data emerges
as a promising alternative, but it remains unclear whether synthetic datasets
exhibit predictable scalability comparable to raw pre-training data. In this
work, we systematically investigate the scaling laws of synthetic data by
introducing SynthLLM, a scalable framework that transforms pre-training corpora
into diverse, high-quality synthetic datasets. Our approach achieves this by
automatically extracting and recombining high-level concepts across multiple
documents using a graph algorithm. Key findings from our extensive mathematical
experiments on SynthLLM include: (1) SynthLLM generates synthetic data that
reliably adheres to the rectified scaling law across various model sizes; (2)
Performance improvements plateau near 300B tokens; and (3) Larger models
approach optimal performance with fewer training tokens. For instance, an 8B
model peaks at 1T tokens, while a 3B model requires 4T. Moreover, comparisons
with existing synthetic data generation and augmentation methods demonstrate
that SynthLLM achieves superior performance and scalability. Our findings
highlight synthetic data as a scalable and reliable alternative to organic
pre-training corpora, offering a viable path toward continued improvement in
model performance.
[COMMENTS]
COLM 2025
[LINK]
http://arxiv.org/abs/2503.19551v3
[DATE]
2025-10-05 12:36:00+08:00
[CATEGORIES]
cs.CL
Latent Visual Reasoning
[AUTHORS]
Bangzheng Li, Ximeng Sun, Jiang Liu, Ze Wang, Jialian Wu, Xiaodong Yu, Hao Chen, Emad Barsoum, Muhao Chen, Zicheng Liu
[ABSTRACT]
Multimodal Large Language Models (MLLMs) have achieved notable gains in
various tasks by incorporating Chain-of-Thought (CoT) reasoning in language
spaces. Recent work extends this direction by leveraging external tools for
visual editing, thereby enhancing the visual signal along the reasoning
trajectories. Nevertheless, these approaches remain fundamentally constrained:
reasoning is still confined to the language space, with visual information
treated as static preconditions. We introduce Latent Visual Reasoning (LVR), a
new paradigm that enables autoregressive reasoning directly in the visual
embedding space. A visual encoder first projects images into visual tokens
within a joint semantic space shared with the language model. The language
model is then trained to generate latent states that reconstruct key visual
tokens critical for answering the query, constituting the process of latent
visual reasoning. By interleaving LVR with standard text generation, our model
achieves substantial gains on perception-intensive visual question answering
tasks. In addition, we adapt the GRPO algorithm to conduct reinforcement
learning on latent reasoning, further balancing LVR and textual generation. We
show that LVR substantially improves fine-grained visual understanding and
perception, achieving 71.67% on MMVP compared to 66.67% with Qwen2.5-VL. Code
base and model weights will be released later.
[LINK]
http://arxiv.org/abs/2509.24251v2
[DATE]
2025-10-05 12:01:18+08:00
[CATEGORIES]
cs.CL
Principled and Tractable RL for Reasoning with Diffusion Language Models
[AUTHORS]
Anthony Zhan
[ABSTRACT]
Diffusion large language models (dLLMs) are a new paradigm of
non-autoregressive language models that are trained to predict multiple tokens
in parallel and generate text via iterative unmasking. Recent works have
successfully pretrained dLLMs to parity with autoregressive LLMs at the 8B
scale, but dLLMs have yet to benefit from modern post-training techniques, e.g.
reinforcement learning (RL), that have proven effective for autoregressive
models. Crucially, algorithms designed for traditional LLMs aren’t directly
compatible with diffusion frameworks due to inherent differences in modeling
assumptions. Moreover, existing attempts at dLLM post-training with RL rely on
heuristic-based objectives with no theoretical grounding. In this work, we
present Amortized Group Relative Policy Optimization (AGRPO), a principled
on-policy RL algorithm designed specifically for dLLMs. AGRPO uses Monte Carlo
sampling to compute an unbiased policy gradient estimate, making it the first
tractable, faithful adaptation of policy gradient methods for dLLMs. We
demonstrate AGRPO’s effectiveness on different math/reasoning tasks, a common
setting for RL with LLMs, achieving up to +7.6% absolute gain on GSM8K and 3.8x
performance on the Countdown task over the baseline LLaDA-8B-Instruct model and
1.3x performance gains over comparable RL methods such as diffu-GRPO.
Furthermore, these gains persist across different numbers of sampling steps at
inference time, achieving better tradeoffs between compute and performance. Our
results demonstrate that online RL algorithms can be extended to diffusion LLMs
in principled ways, maintaining both theoretical soundness and practical
effectiveness.
[LINK]
http://arxiv.org/abs/2510.04019v1
[DATE]
2025-10-05 11:53:16+08:00
[CATEGORIES]
cs.LG
cs.CL
LLM Microscope: What Model Internals Reveal About Answer Correctness and Context Utilization
[AUTHORS]
Jiarui Liu, Jivitesh Jain, Mona Diab, Nishant Subramani
[ABSTRACT]
Although large language models (LLMs) have tremendous utility,
trustworthiness is still a chief concern: models often generate incorrect
information with high confidence. While contextual information can help guide
generation, identifying when a query would benefit from retrieved context and
assessing the effectiveness of that context remains challenging. In this work,
we operationalize interpretability methods to ascertain whether we can predict
the correctness of model outputs from the model’s activations alone. We also
explore whether model internals contain signals about the efficacy of external
context. We consider correct, incorrect, and irrelevant context and introduce
metrics to distinguish amongst them. Experiments on six different models reveal
that a simple classifier trained on intermediate layer activations of the first
output token can predict output correctness with about 75% accuracy, enabling
early auditing. Our model-internals-based metric significantly outperforms
prompting baselines at distinguishing between correct and incorrect context,
guarding against inaccuracies introduced by polluted context. These findings
offer a lens to better understand the underlying decision-making processes of
LLMs. Our code is publicly available at
https://github.com/jiarui-liu/LLM-Microscope
[LINK]
http://arxiv.org/abs/2510.04013v1
[DATE]
2025-10-05 11:14:05+08:00
[CATEGORIES]
cs.CL
What Shapes a Creative Machine Mind? Comprehensively Benchmarking Creativity in Foundation Models
[AUTHORS]
Zicong He, Boxuan Zhang, Weihao Liu, Ruixiang Tang, Lu Cheng
[ABSTRACT]
The meteoric rise of foundation models (FMs) has expanded their capabilities
far beyond conventional tasks. Creativity, long regarded as a hallmark of human
intelligence and a driver of innovation, is now increasingly recognized as a
critical dimension of machine intelligence in the era of generative FMs,
complementing traditional measures of accuracy. However, existing evaluation
frameworks for creativity remain fragmented, relying on ad hoc metrics not
firmly grounded in established theories. To address this gap, we introduce
C^2-Eval, a holistic benchmark for unified assessment of creativity in FMs.
C^2-Eval distinguishes between two complementary forms of creativity:
convergent creativity, where tasks admit constrained solutions (e.g., code
generation), and divergent creativity, where tasks are open-ended (e.g.,
storytelling). It evaluates both dimensions using fine-grained criteria derived
from social-science theory, focusing on Usefulness, Originality, and Surprise
(U-O-S). Through extensive experiments on leading proprietary and open-source
models, we analyze trade-offs in their creative capabilities. Our results
highlight both the strengths and challenges of current FMs in pursuing a
creative machine mind, showing that C^2-Eval is an effective lens for examining
the evolving landscape of creative AI.
[COMMENTS]
22 pages
[LINK]
http://arxiv.org/abs/2510.04009v1
[DATE]
2025-10-05 11:00:50+08:00
[CATEGORIES]
cs.CL
Enhancing OCR for Sino-Vietnamese Language Processing via Fine-tuned PaddleOCRv5
[AUTHORS]
Minh Hoang Nguyen, Su Nguyen Thiet
[ABSTRACT]
Recognizing and processing Classical Chinese (Han-Nom) texts play a vital
role in digitizing Vietnamese historical documents and enabling cross-lingual
semantic research. However, existing OCR systems struggle with degraded scans,
non-standard glyphs, and handwriting variations common in ancient sources. In
this work, we propose a fine-tuning approach for PaddleOCRv5 to improve
character recognition on Han-Nom texts. We retrain the text recognition module
using a curated subset of ancient Vietnamese Chinese manuscripts, supported by
a full training pipeline covering preprocessing, LMDB conversion, evaluation,
and visualization. Experimental results show a significant improvement over the
base model, with exact accuracy increasing from 37.5 percent to 50.0 percent,
particularly under noisy image conditions. Furthermore, we develop an
interactive demo that visually compares pre- and post-fine-tuning recognition
results, facilitating downstream applications such as Han-Vietnamese semantic
alignment, machine translation, and historical linguistics research. The demo
is available at https://huggingface.co/spaces/MinhDS/Fine-tuned-PaddleOCRv5.
[COMMENTS]
5 pages, 6 figures, 2 tables
[LINK]
http://arxiv.org/abs/2510.04003v1
[DATE]
2025-10-05 10:34:38+08:00
[CATEGORIES]
cs.CL
Named Entity Recognition in COVID-19 tweets with Entity Knowledge Augmentation
[AUTHORS]
Xuankang Zhang, Jiangming Liu
[ABSTRACT]
The COVID-19 pandemic causes severe social and economic disruption around the
world, raising various subjects that are discussed over social media.
Identifying pandemic-related named entities as expressed on social media is
fundamental and important to understand the discussions about the pandemic.
However, there is limited work on named entity recognition on this topic due to
the following challenges: 1) COVID-19 texts in social media are informal and
their annotations are rare and insufficient to train a robust recognition
model, and 2) named entity recognition in COVID-19 requires extensive
domain-specific knowledge. To address these issues, we propose a novel entity
knowledge augmentation approach for COVID-19, which can also be applied in
general biomedical named entity recognition in both informal text format and
formal text format. Experiments carried out on the COVID-19 tweets dataset and
PubMed dataset show that our proposed entity knowledge augmentation improves
NER performance in both fully-supervised and few-shot settings. Our source code
is publicly available: https://github.com/kkkenshi/LLM-EKA/tree/master
[COMMENTS]
Work in progress
[LINK]
http://arxiv.org/abs/2510.04001v1
[DATE]
2025-10-05 10:22:26+08:00
[CATEGORIES]
cs.CL
Simulating and Understanding Deceptive Behaviors in Long-Horizon Interactions
[AUTHORS]
Yang Xu, Xuanming Zhang, Min-Hsuan Yeh, Jwala Dhamala, Ousmane Dia, Rahul Gupta, Yixuan Li
[ABSTRACT]
Deception is a pervasive feature of human communication and an emerging
concern in large language models (LLMs). While recent studies document
instances of LLM deception under pressure, most evaluations remain confined to
single-turn prompts and fail to capture the long-horizon interactions in which
deceptive strategies typically unfold. We introduce the first simulation
framework for probing and evaluating deception in LLMs under extended sequences
of interdependent tasks and dynamic contextual pressures. Our framework
instantiates a multi-agent system: a performer agent tasked with completing
tasks and a supervisor agent that evaluates progress, provides feedback, and
maintains evolving states of trust. An independent deception auditor then
reviews full trajectories to identify when and how deception occurs. We conduct
extensive experiments across 11 frontier models, spanning both closed- and
open-source systems, and find that deception is model-dependent, increases with
event pressure, and consistently erodes supervisor trust. Qualitative analyses
further reveal distinct strategies of concealment, equivocation, and
falsification. Our findings establish deception as an emergent risk in
long-horizon interactions and provide a foundation for evaluating future LLMs
in real-world, trust-sensitive contexts.
[LINK]
http://arxiv.org/abs/2510.03999v1
[DATE]
2025-10-05 10:18:23+08:00
[CATEGORIES]
cs.CL
LPI-RIT at LeWiDi-2025: Improving Distributional Predictions via Metadata and Loss Reweighting with DisCo
[AUTHORS]
Mandira Sawkar, Samay U. Shetty, Deepak Pandita, Tharindu Cyril Weerasooriya, Christopher M. Homan
[COMMENTS]
To appear in Proceedings of the EMNLP 2025 Workshop on Learning with
Disagreements (LeWiDi)
[LINK]
http://arxiv.org/abs/2508.08163v2
[DATE]
2025-10-05 09:07:12+08:00
[CATEGORIES]
cs.CL
cs.LG
Can GPT models Follow Human Summarization Guidelines? A Study for Targeted Communication Goals
[AUTHORS]
Yongxin Zhou, Fabien Ringeval, François Portet
[ABSTRACT]
This study investigates the ability of GPT models (ChatGPT, GPT-4 and GPT-4o)
to generate dialogue summaries that adhere to human guidelines. Our evaluation
involved experimenting with various prompts to guide the models in complying
with guidelines on two datasets: DialogSum (English social conversations) and
DECODA (French call center interactions). Human evaluation, based on
summarization guidelines, served as the primary assessment method, complemented
by extensive quantitative and qualitative analyses. Our findings reveal a
preference for GPT-generated summaries over those from task-specific
pre-trained models and reference summaries, highlighting GPT models’ ability to
follow human guidelines despite occasionally producing longer outputs and
exhibiting divergent lexical and structural alignment with references. The
discrepancy between ROUGE, BERTScore, and human evaluation underscores the need
for more reliable automatic evaluation metrics.
[COMMENTS]
INLG 2025, Hanoi, Vietnam, October 29 - November 2, 2025
[LINK]
http://arxiv.org/abs/2310.16810v3
[DATE]
2025-10-05 07:22:58+08:00
[CATEGORIES]
cs.CL
Adversarial Defence without Adversarial Defence: Enhancing Language Model Robustness via Instance-level Principal Component Removal
[AUTHORS]
Yang Wang, Chenghao Xiao, Yizhi Li, Stuart E. Middleton, Noura Al Moubayed, Chenghua Lin
[ABSTRACT]
Pre-trained language models (PLMs) have driven substantial progress in
natural language processing but remain vulnerable to adversarial attacks,
raising concerns about their robustness in real-world applications. Previous
studies have sought to mitigate the impact of adversarial attacks by
introducing adversarial perturbations into the training process, either
implicitly or explicitly. While both strategies enhance robustness, they often
incur high computational costs. In this work, we propose a simple yet effective
add-on module that enhances the adversarial robustness of PLMs by removing
instance-level principal components, without relying on conventional
adversarial defences or perturbing the original training data. Our approach
transforms the embedding space to approximate Gaussian properties, thereby
reducing its susceptibility to adversarial perturbations while preserving
semantic relationships. This transformation aligns embedding distributions in a
way that minimises the impact of adversarial noise on decision boundaries,
enhancing robustness without requiring adversarial examples or costly
training-time augmentation. Evaluations on eight benchmark datasets show that
our approach improves adversarial robustness while maintaining comparable
before-attack accuracy to baselines, achieving a balanced trade-off between
robustness and generalisation.
[COMMENTS]
This paper was accepted with an A-decision to Transactions of the
Association for Computational Linguistics. This version is the
pre-publication version prior to MIT Press production
[LINK]
http://arxiv.org/abs/2507.21750v2
[DATE]
2025-10-05 05:04:38+08:00
[CATEGORIES]
cs.CL
From Compression to Expression: A Layerwise Analysis of In-Context Learning
[AUTHORS]
Jiachen Jiang, Yuxin Dong, Jinxin Zhou, Zhihui Zhu
[ABSTRACT]
In-context learning (ICL) enables large language models (LLMs) to adapt to
new tasks without weight updates by learning from demonstration sequences.
While ICL shows strong empirical performance, its internal representational
mechanisms are not yet well understood. In this work, we conduct a statistical
geometric analysis of ICL representations to investigate how task-specific
information is captured across layers. Our analysis reveals an intriguing
phenomenon, which we term Layerwise Compression-Expression: early layers
progressively produce compact and discriminative representations that encode
task information from the input demonstrations, while later layers express
these representations to incorporate the query and generate the prediction.
This phenomenon is observed consistently across diverse tasks and a range of
contemporary LLM architectures. We demonstrate that it has important
implications for ICL performance – improving with model size and the number of
demonstrations – and for robustness in the presence of noisy examples. To
further understand the effect of the compact task representation, we propose a
bias-variance decomposition and provide a theoretical analysis showing how
attention mechanisms contribute to reducing both variance and bias, thereby
enhancing performance as the number of demonstrations increases. Our findings
reveal an intriguing layerwise dynamic in ICL, highlight how structured
representations emerge within LLMs, and showcase that analyzing internal
representations can facilitate a deeper understanding of model behavior.
[LINK]
http://arxiv.org/abs/2505.17322v2
[DATE]
2025-10-05 04:23:05+08:00
[CATEGORIES]
cs.CL
cs.LG
ClinicRealm: Re-evaluating Large Language Models with Conventional Machine Learning for Non-Generative Clinical Prediction Tasks
[AUTHORS]
Yinghao Zhu, Junyi Gao, Zixiang Wang, Weibin Liao, Xiaochen Zheng, Lifang Liang, Miguel O. Bernabeu, Yasha Wang, Lequan Yu, Chengwei Pan, Ewen M. Harrison, Liantao Ma
[ABSTRACT]
Large Language Models (LLMs) are increasingly deployed in medicine. However,
their utility in non-generative clinical prediction, often presumed inferior to
specialized models, remains under-evaluated, leading to ongoing debate within
the field and potential for misuse, misunderstanding, or over-reliance due to a
lack of systematic benchmarking. Our ClinicRealm study addresses this by
benchmarking 15 GPT-style LLMs, 5 BERT-style models, and 11 traditional methods
on unstructured clinical notes and structured Electronic Health Records (EHR),
while also assessing their reasoning, reliability, and fairness. Key findings
reveal a significant shift: for clinical note predictions, leading LLMs (e.g.,
DeepSeek-V3.1-Think, GPT-5) in zero-shot settings now decisively outperform
finetuned BERT models. On structured EHRs, while specialized models excel with
ample data, advanced LLMs (e.g., GPT-5, DeepSeek-V3.1-Think) show potent
zero-shot capabilities, often surpassing conventional models in data-scarce
settings. Notably, leading open-source LLMs can match or exceed proprietary
counterparts. These results provide compelling evidence that modern LLMs are
competitive tools for non-generative clinical prediction, particularly with
unstructured text and offering data-efficient structured data options, thus
necessitating a re-evaluation of model selection strategies. This research
should serve as an important insight for medical informaticists, AI developers,
and clinical researchers, potentially prompting a reassessment of current
assumptions and inspiring new approaches to LLM application in predictive
healthcare.
[COMMENTS]
Code: https://github.com/yhzhu99/ehr-llm-benchmark
[LINK]
http://arxiv.org/abs/2407.18525v3
[DATE]
2025-10-05 04:21:52+08:00
[CATEGORIES]
cs.CL
cs.LG
LLM Chemistry Estimation for Multi-LLM Recommendation
[AUTHORS]
Huascar Sanchez, Briland Hitaj
[ABSTRACT]
Multi-LLM collaboration promises accurate, robust, and context-aware
solutions, yet existing approaches rely on implicit selection and output
assessment without analyzing whether collaborating models truly complement or
conflict. We introduce LLM Chemistry – a framework that measures when LLM
combinations exhibit synergistic or antagonistic behaviors that shape
collective performance beyond individual capabilities. We formalize the notion
of chemistry among LLMs, propose algorithms that quantify it by analyzing
interaction dependencies, and recommend optimal model ensembles accordingly.
Our theoretical analysis shows that chemistry among collaborating LLMs is most
evident under heterogeneous model profiles, with its outcome impact shaped by
task type, group size, and complexity. Evaluation on classification,
summarization, and program repair tasks provides initial evidence for these
task-dependent effects, thereby reinforcing our theoretical results. This
establishes LLM Chemistry as both a diagnostic factor in multi-LLM systems and
a foundation for ensemble recommendation.
[COMMENTS]
20 pages, 5 figures, 5 tables
[LINK]
http://arxiv.org/abs/2510.03930v1
[DATE]
2025-10-05 04:21:39+08:00
[CATEGORIES]
cs.LG
cs.CL
Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index
[AUTHORS]
Hao Xu, Jiacheng Liu, Yejin Choi, Noah A. Smith, Hannaneh Hajishirzi
[ABSTRACT]
Language models are trained mainly on massive text data from the Internet,
and it becomes increasingly important to understand this data source.
Exact-match search engines enable searching in large text corpora - counting
string appearances and retrieving the enclosing documents - yet the high
storage overhead hinders their application on Internet-scale data. We present
infini-gram mini, an efficient and scalable system that can make petabyte-level
text corpora searchable. Based on the FM-index data structure (Ferragina and
Manzini, 2000), which simultaneously indexes and compresses text, our system
creates indexes with size only 44% of the corpus. Infini-gram mini greatly
improves upon the best existing implementation of FM-index in terms of indexing
speed (18$\times$) and memory use during both indexing (3.2$\times$ reduction)
and querying (down to a negligible amount). We index 83TB of Internet text in
99 days with a single CPU node with 128 vCPUs (or 19 hours if using 137 such
nodes). We show one important use case of infini-gram mini in a large-scale
analysis of benchmark contamination. We find several core LM evaluation
benchmarks to be heavily contaminated in Internet crawls (up to 74.2% in
GSM8K), which could lead to overestimating the capabilities of language models
if trained on such data. We host a benchmark contamination bulletin to share
the contamination rate of many core and community-contributed benchmarks. We
also release a web interface and an API endpoint to serve general search
queries on infini-gram mini indexes.
[LINK]
http://arxiv.org/abs/2506.12229v4
[DATE]
2025-10-05 03:44:59+08:00
[CATEGORIES]
cs.CL
PsycholexTherapy: Simulating Reasoning in Psychotherapy with Small Language Models in Persian
[AUTHORS]
Mohammad Amin Abbasi, Hassan Naderi
[ABSTRACT]
This study presents PsychoLexTherapy, a framework for simulating
psychotherapeutic reasoning in Persian using small language models (SLMs). The
framework tackles the challenge of developing culturally grounded,
therapeutically coherent dialogue systems with structured memory for multi-turn
interactions in underrepresented languages. To ensure privacy and feasibility,
PsychoLexTherapy is optimized for on-device deployment, enabling use without
external servers. Development followed a three-stage process: (i) assessing
SLMs psychological knowledge with PsychoLexEval; (ii) designing and
implementing the reasoning-oriented PsychoLexTherapy framework; and (iii)
constructing two evaluation datasets-PsychoLexQuery (real Persian user
questions) and PsychoLexDialogue (hybrid simulated sessions)-to benchmark
against multiple baselines. Experiments compared simple prompting, multi-agent
debate, and structured therapeutic reasoning paths. Results showed that
deliberate model selection balanced accuracy, efficiency, and privacy. On
PsychoLexQuery, PsychoLexTherapy outperformed all baselines in automatic
LLM-as-a-judge evaluation and was ranked highest by human evaluators in a
single-turn preference study. In multi-turn tests with PsychoLexDialogue, the
long-term memory module proved essential: while naive history concatenation
caused incoherence and information loss, the full framework achieved the
highest ratings in empathy, coherence, cultural fit, and personalization.
Overall, PsychoLexTherapy establishes a practical, privacy-preserving, and
culturally aligned foundation for Persian psychotherapy simulation,
contributing novel datasets, a reproducible evaluation pipeline, and empirical
insights into structured memory for therapeutic reasoning.
[LINK]
http://arxiv.org/abs/2510.03913v1
[DATE]
2025-10-05 03:40:10+08:00
[CATEGORIES]
cs.CL
Seeded Poisson Factorization: leveraging domain knowledge to fit topic models
[AUTHORS]
Bernd Prostmaier, Jan Vávra, Bettina Grün, Paul Hofmarcher
[ABSTRACT]
Topic models are widely used for discovering latent thematic structures in
large text corpora, yet traditional unsupervised methods often struggle to
align with pre-defined conceptual domains. This paper introduces seeded Poisson
Factorization (SPF), a novel approach that extends the Poisson Factorization
(PF) framework by incorporating domain knowledge through seed words. SPF
enables a structured topic discovery by modifying the prior distribution of
topic-specific term intensities, assigning higher initial rates to pre-defined
seed words. The model is estimated using variational inference with stochastic
gradient optimization, ensuring scalability to large datasets.
We present in detail the results of applying SPF to an Amazon customer
feedback dataset, leveraging pre-defined product categories as guiding
structures. SPF achieves superior performance compared to alternative guided
probabilistic topic models in terms of computational efficiency and
classification performance. Robustness checks highlight SPF’s ability to
adaptively balance domain knowledge and data-driven topic discovery, even in
case of imperfect seed word selection. Further applications of SPF to four
additional benchmark datasets, where the corpus varies in size and the number
of topics differs, demonstrate its general superior classification performance
compared to the unseeded PF model.
[LINK]
http://arxiv.org/abs/2503.02741v2
[DATE]
2025-10-05 02:42:23+08:00
[CATEGORIES]
cs.CL
cs.LG
Kantian-Utilitarian XAI: Meta-Explained
[AUTHORS]
Zahra Atf, Peter R. Lewis
[ABSTRACT]
We present a gamified explainable AI (XAI) system for ethically aware
consumer decision-making in the coffee domain. Each session comprises six
rounds with three options per round. Two symbolic engines provide real-time
reasons: a Kantian module flags rule violations (e.g., child labor,
deforestation risk without shade certification, opaque supply chains, unsafe
decaf), and a utilitarian module scores options via multi-criteria aggregation
over normalized attributes (price, carbon, water, transparency, farmer income
share, taste/freshness, packaging, convenience). A meta-explainer with a regret
bound (0.2) highlights Kantian–utilitarian (mis)alignment and switches to a
deontically clean, near-parity option when welfare loss is small. We release a
structured configuration (attribute schema, certification map, weights, rule
set), a policy trace for auditability, and an interactive UI.
[COMMENTS]
Accepted for presentation as a poster at the 35th IEEE International
Conference on Collaborative Advances in Software and Computing, 2025.
Conference
website:https://conf.researchr.org/details/cascon-2025/posters-track/1/Kantian-Utilitarian-XAI-Meta-Explained
[LINK]
http://arxiv.org/abs/2510.03892v1
[DATE]
2025-10-05 02:16:12+08:00
[CATEGORIES]
cs.CL
SurGE: A Benchmark and Evaluation Framework for Scientific Survey Generation
[AUTHORS]
Weihang Su, Anzhe Xie, Qingyao Ai, Jianming Long, Jiaxin Mao, Ziyi Ye, Yiqun Liu
[ABSTRACT]
The rapid growth of academic literature makes the manual creation of
scientific surveys increasingly infeasible. While large language models show
promise for automating this process, progress in this area is hindered by the
absence of standardized benchmarks and evaluation protocols. To bridge this
critical gap, we introduce SurGE (Survey Generation Evaluation), a new
benchmark for scientific survey generation in computer science. SurGE consists
of (1) a collection of test instances, each including a topic description, an
expert-written survey, and its full set of cited references, and (2) a
large-scale academic corpus of over one million papers. In addition, we propose
an automated evaluation framework that measures the quality of generated
surveys across four dimensions: comprehensiveness, citation accuracy,
structural organization, and content quality. Our evaluation of diverse
LLM-based methods demonstrates a significant performance gap, revealing that
even advanced agentic frameworks struggle with the complexities of survey
generation and highlighting the need for future research in this area. We have
open-sourced all the code, data, and models at:
https://github.com/oneal2000/SurGE
[LINK]
http://arxiv.org/abs/2508.15658v2
[DATE]
2025-10-05 00:52:09+08:00
[CATEGORIES]
cs.CL
Unlocking Reasoning Capabilities in LLMs via Reinforcement Learning Exploration
[AUTHORS]
Wenhao Deng, Long Wei, Chenglei Yu, Tailin Wu
[ABSTRACT]
Reinforcement learning with verifiable rewards (RLVR) has recently enhanced
the reasoning capabilities of large language models (LLMs), particularly for
mathematical problem solving. However, a fundamental limitation remains: as the
sampling budget increases, the advantage of RLVR-trained models over their
pretrained bases often diminishes or even vanishes, revealing a strong
dependence on the base model’s restricted search space. We attribute this
phenomenon to the widespread use of the reverse Kullback-Leibler (KL)
divergence regularizer, whose mode-seeking behavior keeps the policy trapped
inside the base model’s support region and hampers wider exploration. To
address this issue, we propose RAPO (Rewards-Aware Policy Optimization), an
algorithm to promote broader yet focused exploration. Our method (i) utilizes
the forward KL penalty to replace the reverse KL penalty for
out-of-distribution exploration, and (ii) reweights the reference policy to
facilitate adaptive in-distribution exploration. We train Qwen2.5-3B and 7B
models with RAPO on the 8K SimpleRL-Zero dataset, without supervised
fine-tuning, and evaluate them on AIME2024 and AIME2025. Results show that RAPO
consistently improves problem-solving performance. Notably, RAPO enables models
to surpass the base model’s performance ceiling and solves previously
intractable problems, advancing the frontier of RLVR for challenging reasoning
tasks.
[LINK]
http://arxiv.org/abs/2510.03865v1
[DATE]
2025-10-05 00:22:19+08:00
[CATEGORIES]
cs.LG
cs.CL
League: Leaderboard Generation on Demand
[AUTHORS]
Jian Wu, Jiayu Zhang, Dongyuan Li, Linyi Yang, Aoxiao Zhong, Renhe Jiang, Qingsong Wen, Yue Zhang
[ABSTRACT]
This paper introduces Leaderboard Auto Generation (LAG), a novel and
well-organized framework for automatic generation of leaderboards on a given
research topic in rapidly evolving fields like Artificial Intelligence (AI).
Faced with a large number of AI papers updated daily, it becomes difficult for
researchers to track every paper’s proposed methods, experimental results, and
settings, prompting the need for efficient automatic leaderboard construction.
While large language models (LLMs) offer promise in automating this process,
challenges such as multi-document summarization, leaderboard generation, and
experiment fair comparison still remain under exploration. LAG solves these
challenges through a systematic approach that involves the paper collection,
experiment results extraction and integration, leaderboard generation, and
quality evaluation. Our contributions include a comprehensive solution to the
leaderboard construction problem, a reliable evaluation method, and
experimental results showing the high quality of leaderboards.
[LINK]
http://arxiv.org/abs/2502.18209v2
[DATE]
2025-10-05 00:02:01+08:00
[CATEGORIES]
cs.CL
User Feedback in Human-LLM Dialogues: A Lens to Understand Users But Noisy as a Learning Signal
[AUTHORS]
Yuhan Liu, Michael J. Q. Zhang, Eunsol Choi
[COMMENTS]
EMNLP camera-ready
[LINK]
http://arxiv.org/abs/2507.23158v2
[DATE]
2025-10-05 00:01:31+08:00
[CATEGORIES]
cs.CL
Inference-time Scaling of Diffusion Models through Classical Search
[AUTHORS]
Xiangcheng Zhang, Haowei Lin, Haotian Ye, James Zou, Jianzhu Ma, Yitao Liang, Yilun Du
[ABSTRACT]
Classical search algorithms have long underpinned modern artificial
intelligence. In this work, we tackle the challenge of inference-time control
in diffusion models – adapting generated outputs to meet diverse test-time
objectives – using principles from classical search. We propose a general
framework that orchestrates local and global search to efficiently navigate the
generative space. It employs a theoretically grounded local search via annealed
Langevin MCMC and performs compute-efficient global exploration using
breadth-first and depth-first tree search. We evaluate our approach on a range
of challenging domains, including planning, offline reinforcement learning, and
image generation. Across all tasks, we observe significant gains in both
performance and efficiency. These results show that classical search provides a
principled and practical foundation for inference-time scaling in diffusion
models. Project page at https://diffusion-inference-scaling.github.io/.
[COMMENTS]
Website at https://diffusion-inference-scaling.github.io/
[LINK]
http://arxiv.org/abs/2505.23614v2
[DATE]
2025-10-05 23:58:19+08:00
[CATEGORIES]
cs.LG
Chameleon2++: An Efficient and Scalable Variant Of Chameleon Clustering
[AUTHORS]
Priyanshu Singh, Kapil Ahuja
[ABSTRACT]
Hierarchical clustering remains a fundamental challenge in data mining,
particularly when dealing with large-scale datasets where traditional
approaches fail to scale effectively. Recent Chameleon-based algorithms -
Chameleon2, M-Chameleon, and INNGS-Chameleon have proposed advanced strategies
but they still suffer from $O(n^2)$ computational complexity, especially for
large datasets. With Chameleon2 as the base algorithm, we introduce
Chameleon2++ that addresses this challenge.
Our algorithm has three parts. First, Graph Generation - we propose an
approximate $k$-NN search instead of an exact one, specifically we integrate
with the Annoy algorithm. This results in fast approximate nearest neighbor
computation, significantly reducing the graph generation time. Second, Graph
Partitioning - we propose use of a multi-level partitioning algorithm instead
of a recursive bisection one. Specifically we adapt the hMETIS algorithm
instead of the FM. This is because multi-level algorithms are robust to
approximation introduced in the graph generation phase yielding higher-quality
partitions, and that too with minimum configuration requirements. Third,
Merging - we retain the flood fill heuristic that ensures connected balanced
components in the partitions as well as efficient partition merging criteria
leading to the final clusters.
These enhancements reduce the overall time complexity to $O(n\log n)$,
achieving scalability. On real-world benchmark datasets used in prior Chameleon
works, Chameleon2++ delivers an average of 4% improvement in clustering
quality. This demonstrates that algorithmic efficiency and clustering quality
can co-exist in large-scale hierarchical clustering.
[COMMENTS]
10 Pages, 2 Figures, 5 Tables
[LINK]
http://arxiv.org/abs/2501.02612v2
[DATE]
2025-10-05 23:42:08+08:00
[CATEGORIES]
cs.LG
Diffusion-Assisted Distillation for Self-Supervised Graph Representation Learning with MLPs
[AUTHORS]
Seong Jin Ahn, Myoung-Ho Kim
[ABSTRACT]
For large-scale applications, there is growing interest in replacing Graph
Neural Networks (GNNs) with lightweight Multi-Layer Perceptrons (MLPs) via
knowledge distillation. However, distilling GNNs for self-supervised graph
representation learning into MLPs is more challenging. This is because the
performance of self-supervised learning is more related to the model’s
inductive bias than supervised learning. This motivates us to design a new
distillation method to bridge a huge capacity gap between GNNs and MLPs in
self-supervised graph representation learning. In this paper, we propose
\textbf{D}iffusion-\textbf{A}ssisted \textbf{D}istillation for
\textbf{S}elf-supervised \textbf{G}raph representation learning with
\textbf{M}LPs (DAD-SGM). The proposed method employs a denoising diffusion
model as a teacher assistant to better distill the knowledge from the teacher
GNN into the student MLP. This approach enhances the generalizability and
robustness of MLPs in self-supervised graph representation learning. Extensive
experiments demonstrate that DAD-SGM effectively distills the knowledge of
self-supervised GNNs compared to state-of-the-art GNN-to-MLP distillation
methods. Our implementation is available at
https://github.com/SeongJinAhn/DAD-SGM.
[LINK]
http://arxiv.org/abs/2510.04241v1
[DATE]
2025-10-05 23:11:55+08:00
[CATEGORIES]
cs.LG
Truncated Kernel Stochastic Gradient Descent with General Losses and Spherical Radial Basis Functions
[AUTHORS]
Jinhui Bai, Andreas Christmann, Lei Shi
[ABSTRACT]
In this paper, we propose a novel kernel stochastic gradient descent (SGD)
algorithm for large-scale supervised learning with general losses. Compared to
traditional kernel SGD, our algorithm improves efficiency and scalability
through an innovative regularization strategy. By leveraging the infinite
series expansion of spherical radial basis functions, this strategy projects
the stochastic gradient onto a finite-dimensional hypothesis space, which is
adaptively scaled according to the bias-variance trade-off, thereby enhancing
generalization performance. Based on a new estimation of the spectral structure
of the kernel-induced covariance operator, we develop an analytical framework
that unifies optimization and generalization analyses. We prove that both the
last iterate and the suffix average converge at minimax-optimal rates, and we
further establish optimal strong convergence in the reproducing kernel Hilbert
space. Our framework accommodates a broad class of classical loss functions,
including least-squares, Huber, and logistic losses. Moreover, the proposed
algorithm significantly reduces computational complexity and achieves optimal
storage complexity by incorporating coordinate-wise updates from linear SGD,
thereby avoiding the costly pairwise operations typical of kernel SGD and
enabling efficient processing of streaming data. Finally, extensive numerical
experiments demonstrate the efficiency of our approach.
[COMMENTS]
54 pages, 20 figures
[LINK]
http://arxiv.org/abs/2510.04237v1
[DATE]
2025-10-05 23:04:03+08:00
[CATEGORIES]
cs.LG
Diffusion Approximations for Thompson Sampling in the Small Gap Regime
[AUTHORS]
Lin Fan, Peter W. Glynn
[ABSTRACT]
We study the process-level dynamics of Thompson sampling in the “small gap”
regime. The small gap regime is one in which the gaps between the arm means are
of order $\sqrt{\gamma}$ or smaller and the time horizon is of order
$1/\gamma$, where $\gamma$ is small. As $\gamma \downarrow 0$, we show that the
process-level dynamics of Thompson sampling converge weakly to the solutions to
certain stochastic differential equations and stochastic ordinary differential
equations. Our weak convergence theory is developed from first principles using
the Continuous Mapping Theorem, can handle stationary, weakly dependent reward
processes, and can also be adapted to analyze a variety of sampling-based
bandit algorithms. Indeed, we show that the process-level dynamics of many
sampling-based bandit algorithms – including Thompson sampling designed for
any single-parameter exponential family of rewards, as well as non-parametric
bandit algorithms based on bootstrap re-sampling – satisfy an invariance
principle. Namely, their weak limits coincide with that of Gaussian parametric
Thompson sampling with Gaussian priors. Moreover, in the small gap regime, the
regret performance of these algorithms is generally insensitive to model
mis-specification, changing continuously with increasing degrees of
mis-specification.
[LINK]
http://arxiv.org/abs/2105.09232v5
[DATE]
2025-10-05 23:02:40+08:00
[CATEGORIES]
cs.LG
Physics-Inspired All-Pair Interaction Learning for 3D Dynamics Modeling
[AUTHORS]
Kai Yang, Yuqi Huang, Junheng Tao, Wanyu Wang, Qitian Wu
[ABSTRACT]
Modeling 3D dynamics is a fundamental problem in multi-body systems across
scientific and engineering domains and has important practical implications in
trajectory prediction and simulation. While recent GNN-based approaches have
achieved strong performance by enforcing geometric symmetries, encoding
high-order features or incorporating neural-ODE mechanics, they typically
depend on explicitly observed structures and inherently fail to capture the
unobserved interactions that are crucial to complex physical behaviors and
dynamics mechanism. In this paper, we propose PAINET, a principled
SE(3)-equivariant neural architecture for learning all-pair interactions in
multi-body systems. The model comprises: (1) a novel physics-inspired attention
network derived from the minimization trajectory of an energy function, and (2)
a parallel decoder that preserves equivariance while enabling efficient
inference. Empirical results on diverse real-world benchmarks, including human
motion capture, molecular dynamics, and large-scale protein simulations, show
that PAINET consistently outperforms recently proposed models, yielding 4.7% to
41.5% error reductions in 3D dynamics prediction with comparable computation
costs in terms of time and memory.
[LINK]
http://arxiv.org/abs/2510.04233v1
[DATE]
2025-10-05 22:48:26+08:00
[CATEGORIES]
cs.LG
Blending adversarial training and representation-conditional purification via aggregation improves adversarial robustness
[AUTHORS]
Emanuele Ballarin, Alessio Ansuini, Luca Bortolussi
[ABSTRACT]
In this work, we propose a novel adversarial defence mechanism for image
classification - CARSO - blending the paradigms of adversarial training and
adversarial purification in a synergistic robustness-enhancing way. The method
builds upon an adversarially-trained classifier, and learns to map its internal
representation associated with a potentially perturbed input onto a
distribution of tentative clean reconstructions. Multiple samples from such
distribution are classified by the same adversarially-trained model, and a
carefully chosen aggregation of its outputs finally constitutes the robust
prediction of interest. Experimental evaluation by a well-established benchmark
of strong adaptive attacks, across different image datasets, shows that CARSO
is able to defend itself against adaptive end-to-end white-box attacks devised
for stochastic defences. Paying a modest clean accuracy toll, our method
improves by a significant margin the state-of-the-art for Cifar-10, Cifar-100,
and TinyImageNet-200 $\ell_\infty$ robust classification accuracy against
AutoAttack. Code, and instructions to obtain pre-trained models are available
at: https://github.com/emaballarin/CARSO .
[COMMENTS]
Published in Transactions on Machine Learning Research (09/2025). 25
pages, 1 figure, 19 tables
[LINK]
http://arxiv.org/abs/2306.06081v6
[DATE]
2025-10-05 22:43:06+08:00
[CATEGORIES]
cs.LG
A Universal Deep Learning Force Field for Molecular Dynamic Simulation and Vibrational Spectra Prediction
[AUTHORS]
Shengjiao Ji, Yujin Zhang, Zihan Zou, Bin Jiang, Jun Jiang, Yi Luo, Wei Hu
[ABSTRACT]
Accurate and efficient simulation of infrared (IR) and Raman spectra is
essential for molecular identification and structural analysis. Traditional
quantum chemistry methods based on the harmonic approximation neglect
anharmonicity and nuclear quantum effects, while ab initio molecular dynamics
(AIMD) remains computationally expensive. Here, we integrate our deep
equivariant tensor attention network (DetaNet) with a velocity-Verlet
integrator to enable fast and accurate machine learning molecular dynamics
(MLMD) simulations for spectral prediction. Trained on the QMe14S dataset
containing energies, forces, dipole moments, and polarizabilities for 186,102
small organic molecules, DetaNet yields a universal and transferable force
field with high-order tensor prediction capability. Using time-correlation
functions derived from MLMD and ring-polymer molecular dynamics (RPMD)
trajectories, we computed IR and Raman spectra that accurately reproduce
anharmonic and nuclear quantum effects. Benchmark tests on isolated molecules,
including polycyclic aromatic hydrocarbons, demonstrate that the DetaNet-based
MD approach achieves near-experimental spectral accuracy with speedups up to
three orders of magnitude over AIMD. Furthermore, the framework extends
seamlessly to molecular and inorganic crystals, molecular aggregates, and
biological macromolecules such as polypeptides with minimal fine-tuning. In all
systems, DetaNet maintains high accuracy while significantly reducing
computational cost. Overall, this work establishes a universal machine learning
force field and tensor-aware MLMD framework that enable fast, accurate, and
broadly applicable dynamic simulations and IR/Raman spectral predictions across
diverse molecular and material systems.
[COMMENTS]
19 pages, 5 figures
[LINK]
http://arxiv.org/abs/2510.04227v1
[DATE]
2025-10-05 22:36:33+08:00
[CATEGORIES]
cs.LG
MASC: Boosting Autoregressive Image Generation with a Manifold-Aligned Semantic Clustering
[AUTHORS]
Lixuan He, Shikang Zheng, Linfeng Zhang
[ABSTRACT]
Autoregressive (AR) models have shown great promise in image generation, yet
they face a fundamental inefficiency stemming from their core component: a
vast, unstructured vocabulary of visual tokens. This conventional approach
treats tokens as a flat vocabulary, disregarding the intrinsic structure of the
token embedding space where proximity often correlates with semantic
similarity. This oversight results in a highly complex prediction task, which
hinders training efficiency and limits final generation quality. To resolve
this, we propose Manifold-Aligned Semantic Clustering (MASC), a principled
framework that constructs a hierarchical semantic tree directly from the
codebook’s intrinsic structure. MASC employs a novel geometry-aware distance
metric and a density-driven agglomerative construction to model the underlying
manifold of the token embeddings. By transforming the flat, high-dimensional
prediction task into a structured, hierarchical one, MASC introduces a
beneficial inductive bias that significantly simplifies the learning problem
for the AR model. MASC is designed as a plug-and-play module, and our extensive
experiments validate its effectiveness: it accelerates training by up to 57%
and significantly improves generation quality, reducing the FID of LlamaGen-XL
from 2.87 to 2.58. MASC elevates existing AR frameworks to be highly
competitive with state-of-the-art methods, establishing that structuring the
prediction space is as crucial as architectural innovation for scalable
generative modeling.
[LINK]
http://arxiv.org/abs/2510.04220v1
[DATE]
2025-10-05 22:23:51+08:00
[CATEGORIES]
cs.LG
Empowering Healthcare Practitioners with Language Models: Structuring Speech Transcripts in Two Real-World Clinical Applications
[AUTHORS]
Jean-Philippe Corbeil, Asma Ben Abacha, George Michalopoulos, Phillip Swazinna, Miguel Del-Agua, Jerome Tremblay, Akila Jeeson Daniel, Cari Bader, Yu-Cheng Cho, Pooja Krishnan, Nathan Bodenstab, Thomas Lin, Wenxuan Teng, Francois Beaulieu, Paul Vozila
[ABSTRACT]
Large language models (LLMs) such as GPT-4o and o1 have demonstrated strong
performance on clinical natural language processing (NLP) tasks across multiple
medical benchmarks. Nonetheless, two high-impact NLP tasks - structured tabular
reporting from nurse dictations and medical order extraction from
doctor-patient consultations - remain underexplored due to data scarcity and
sensitivity, despite active industry efforts. Practical solutions to these
real-world clinical tasks can significantly reduce the documentation burden on
healthcare providers, allowing greater focus on patient care. In this paper, we
investigate these two challenging tasks using private and open-source clinical
datasets, evaluating the performance of both open- and closed-weight LLMs, and
analyzing their respective strengths and limitations. Furthermore, we propose
an agentic pipeline for generating realistic, non-sensitive nurse dictations,
enabling structured extraction of clinical observations. To support further
research in both areas, we release SYNUR and SIMORD, the first open-source
datasets for nurse observation extraction and medical order extraction.
[LINK]
http://arxiv.org/abs/2507.05517v3
[DATE]
2025-10-04 23:31:53+08:00
[CATEGORIES]
cs.CL
Rowen: Adaptive Retrieval-Augmented Generation for Hallucination Mitigation in LLMs
[AUTHORS]
Hanxing Ding, Liang Pang, Zihao Wei, Huawei Shen, Xueqi Cheng
[COMMENTS]
Accepted at SIGIR-AP 2025
[LINK]
http://arxiv.org/abs/2402.10612v3
[DATE]
2025-10-04 22:31:52+08:00
[CATEGORIES]
cs.CL
OptimalThinkingBench: Evaluating Over and Underthinking in LLMs
[AUTHORS]
Pranjal Aggarwal, Seungone Kim, Jack Lanchantin, Sean Welleck, Jason Weston, Ilia Kulikov, Swarnadeep Saha
[ABSTRACT]
Thinking LLMs solve complex tasks at the expense of increased compute and
overthinking on simpler problems, while non-thinking LLMs are faster and
cheaper but underthink on harder reasoning problems. This has led to the
development of separate thinking and non-thinking LLM variants, leaving the
onus of selecting the optimal model for each query on the end user. We
introduce OptimalThinkingBench, a unified benchmark that jointly evaluates
overthinking and underthinking in LLMs and also encourages the development of
optimally-thinking models that balance performance and efficiency. Our
benchmark comprises two sub-benchmarks: OverthinkingBench, featuring simple
math and general queries in 72 domains, and UnderthinkingBench, containing 11
challenging reasoning tasks along with harder math problems. Using novel
thinking-adjusted accuracy metrics, we extensively evaluate 33 different
thinking and non-thinking models and show that no model is able to optimally
think on our benchmark. Thinking models often overthink for hundreds of tokens
on the simplest user queries without improving performance. In contrast, large
non-thinking models underthink, often falling short of much smaller thinking
models. We further explore several methods to encourage optimal thinking, but
find that these approaches often improve on one sub-benchmark at the expense of
the other, highlighting the need for better unified and optimal models in the
future.
[COMMENTS]
30 pages, 10 tables, 11 figures
[LINK]
http://arxiv.org/abs/2508.13141v2
[DATE]
2025-10-04 22:25:22+08:00
[CATEGORIES]
cs.CL
cs.LG
Enhancing Unsupervised Sentence Embeddings via Knowledge-Driven Data Augmentation and Gaussian-Decayed Contrastive Learning
[AUTHORS]
Peichao Lai, Zhengfeng Zhang, Wentao Zhang, Fangcheng Fu, Bin Cui
[ABSTRACT]
Recently, using large language models (LLMs) for data augmentation has led to
considerable improvements in unsupervised sentence embedding models. However,
existing methods encounter two primary challenges: limited data diversity and
high data noise. Current approaches often neglect fine-grained knowledge, such
as entities and quantities, leading to insufficient diversity. Besides,
unsupervised data frequently lacks discriminative information, and the
generated synthetic samples may introduce noise. In this paper, we propose a
pipeline-based data augmentation method via LLMs and introduce the
Gaussian-decayed gradient-assisted Contrastive Sentence Embedding (GCSE) model
to enhance unsupervised sentence embeddings. To tackle the issue of low data
diversity, our pipeline utilizes knowledge graphs (KGs) to extract entities and
quantities, enabling LLMs to generate more diverse samples. To address high
data noise, the GCSE model uses a Gaussian-decayed function to limit the impact
of false hard negative samples, enhancing the model’s discriminative
capability. Experimental results show that our approach achieves
state-of-the-art performance in semantic textual similarity (STS) tasks, using
fewer data samples and smaller LLMs, demonstrating its efficiency and
robustness across various models.
[LINK]
http://arxiv.org/abs/2409.12887v5
[DATE]
2025-10-04 21:42:11+08:00
[CATEGORIES]
cs.CL
Annotate Rhetorical Relations with INCEpTION: A Comparison with Automatic Approaches
[AUTHORS]
Mehedi Hasan Emon
[ABSTRACT]
This research explores the annotation of rhetorical relations in discourse
using the INCEpTION tool and compares manual annotation with automatic
approaches based on large language models. The study focuses on sports reports
(specifically cricket news) and evaluates the performance of BERT, DistilBERT,
and Logistic Regression models in classifying rhetorical relations such as
elaboration, contrast, background, and cause-effect. The results show that
DistilBERT achieved the highest accuracy, highlighting its potential for
efficient discourse relation prediction. This work contributes to the growing
intersection of discourse parsing and transformer-based NLP. (This paper was
conducted as part of an academic requirement under the supervision of Prof. Dr.
Ralf Klabunde, Linguistic Data Science Lab, Ruhr University Bochum.) Keywords:
Rhetorical Structure Theory, INCEpTION, BERT, DistilBERT, Discourse Parsing,
NLP.
[LINK]
http://arxiv.org/abs/2510.03808v1
[DATE]
2025-10-04 21:33:42+08:00
[CATEGORIES]
cs.CL
Improving Low-Resource Sequence Labeling with Knowledge Fusion and Contextual Label Explanations
[AUTHORS]
Peichao Lai, Jiaxin Gan, Feiyang Ye, Yilei Wang, Bin Cui
[ABSTRACT]
Sequence labeling remains a significant challenge in low-resource,
domain-specific scenarios, particularly for character-dense languages like
Chinese. Existing methods primarily focus on enhancing model comprehension and
improving data diversity to boost performance. However, these approaches still
struggle with inadequate model applicability and semantic distribution biases
in domain-specific contexts. To overcome these limitations, we propose a novel
framework that combines an LLM-based knowledge enhancement workflow with a
span-based Knowledge Fusion for Rich and Efficient Extraction (KnowFREE) model.
Our workflow employs explanation prompts to generate precise contextual
interpretations of target entities, effectively mitigating semantic biases and
enriching the model’s contextual understanding. The KnowFREE model further
integrates extension label features, enabling efficient nested entity
extraction without relying on external knowledge during inference. Experiments
on multiple Chinese domain-specific sequence labeling datasets demonstrate that
our approach achieves state-of-the-art performance, effectively addressing the
challenges posed by low-resource settings.
[LINK]
http://arxiv.org/abs/2501.19093v4
[DATE]
2025-10-04 21:30:53+08:00
[CATEGORIES]
cs.CL
Beyond Token Length: Step Pruner for Efficient and Accurate Reasoning in Large Language Models
[AUTHORS]
Canhui Wu, Qiong Cao, Chang Li, Zhenfang Wang, Chao Xue, Yuwei Fan, Wei Xi, Xiaodong He
[ABSTRACT]
Large Reasoning Models (LRMs) demonstrate strong performance on complex tasks
but often suffer from excessive verbosity, known as “overthinking.” Existing
solutions via reinforcement learning (RL) typically penalize generated tokens
to promote conciseness. However, these methods encounter two challenges:
responses with fewer tokens do not always correspond to fewer reasoning steps,
and models may develop hacking behavior in later stages of training by
discarding reasoning steps to minimize token usage. In this work, we introduce
\textbf{Step Pruner (SP)}, an RL framework that steers LRMs toward more
efficient reasoning by favoring compact reasoning steps. Our step-aware reward
function prioritizes correctness while imposing penalties for redundant steps,
and withholds rewards for incorrect responses to prevent the reinforcement of
erroneous reasoning. Moreover, we propose a dynamic stopping mechanism: when
the length of any output step exceeds the upper limit, we halt updates to
prevent hacking behavior caused by merging steps. Extensive experiments across
four reasoning benchmarks demonstrate that SP achieves state-of-the-art
accuracy while significantly reducing response length. For instance, on AIME24,
SP reduces token usage by \textbf{69.7\%}.
[COMMENTS]
20pages, 7 figures
[LINK]
http://arxiv.org/abs/2510.03805v1
[DATE]
2025-10-04 21:24:26+08:00
[CATEGORIES]
cs.CL
Auto-ARGUE: LLM-Based Report Generation Evaluation
[AUTHORS]
William Walden, Orion Weller, Laura Dietz, Bryan Li, Gabrielle Kaili-May Liu, Yu Hou, Eugene Yang
[ABSTRACT]
Generation of long-form, citation-backed reports is a primary use case for
retrieval augmented generation (RAG) systems. While open-source evaluation
tools exist for various RAG tasks, ones tailored to report generation are
lacking. Accordingly, we introduce Auto-ARGUE, a robust LLM-based
implementation of the recent ARGUE framework for report generation evaluation.
We present analysis of Auto-ARGUE on the report generation pilot task from the
TREC 2024 NeuCLIR track, showing good system-level correlations with human
judgments. We further release a web app for visualization of Auto-ARGUE
outputs.
[COMMENTS]
ECIR 2025 demo format
[LINK]
http://arxiv.org/abs/2509.26184v3
[DATE]
2025-10-04 20:48:51+08:00
[CATEGORIES]
cs.CL
SSFO: Self-Supervised Faithfulness Optimization for Retrieval-Augmented Generation
[AUTHORS]
Xiaqiang Tang, Yi Wang, Keyu Hu, Rui Xu, Chuang Li, Weigao Sun, Jian Li, Sihong Xie
[ABSTRACT]
Retrieval-Augmented Generation (RAG) systems require Large Language Models
(LLMs) to generate responses that are faithful to the retrieved context.
However, faithfulness hallucination remains a critical challenge, as existing
methods often require costly supervision and post-training or significant
inference burdens. To overcome these limitations, we introduce Self-Supervised
Faithfulness Optimization (SSFO), the first self-supervised alignment approach
for enhancing RAG faithfulness. SSFO constructs preference data pairs by
contrasting the model’s outputs generated with and without the context.
Leveraging Direct Preference Optimization (DPO), SSFO aligns model faithfulness
without incurring labeling costs or additional inference burden. We
theoretically and empirically demonstrate that SSFO leverages a benign form of
\emph{likelihood displacement}, transferring probability mass from
parametric-based tokens to context-aligned tokens. Based on this insight, we
propose a modified DPO loss function to encourage likelihood displacement.
Comprehensive evaluations show that SSFO significantly outperforms existing
methods, achieving state-of-the-art faithfulness on multiple context-based
question-answering datasets. Notably, SSFO exhibits strong generalization,
improving cross-lingual faithfulness and preserving general
instruction-following capabilities. We release our code and model at the
anonymous link: https://github.com/chkwy/SSFO
[COMMENTS]
Working in progress
[LINK]
http://arxiv.org/abs/2508.17225v2
[DATE]
2025-10-04 19:26:02+08:00
[CATEGORIES]
cs.CL
When “Competency” in Reasoning Opens the Door to Vulnerability: Jailbreaking LLMs via Novel Complex Ciphers
[AUTHORS]
Divij Handa, Zehua Zhang, Amir Saeidi, Shrinidhi Kumbhar, Md Nayem Uddin, Aswin RRV, Chitta Baral
[COMMENTS]
Published in Reliable ML from Unreliable Data workshop @ NeurIPS 2025
[LINK]
http://arxiv.org/abs/2402.10601v4
[DATE]
2025-10-04 19:22:23+08:00
[CATEGORIES]
cs.CL
Rezwan: Leveraging Large Language Models for Comprehensive Hadith Text Processing: A 1.2M Corpus Development
[AUTHORS]
Majid Asgari-Bidhendi, Muhammad Amin Ghaseminia, Alireza Shahbazi, Sayyed Ali Hossayni, Najmeh Torabian, Behrouz Minaei-Bidgoli
[ABSTRACT]
This paper presents the development of Rezwan, a large-scale AI-assisted
Hadith corpus comprising over 1.2M narrations, extracted and structured through
a fully automated pipeline. Building on digital repositories such as Maktabat
Ahl al-Bayt, the pipeline employs Large Language Models (LLMs) for
segmentation, chain–text separation, validation, and multi-layer enrichment.
Each narration is enhanced with machine translation into twelve languages,
intelligent diacritization, abstractive summarization, thematic tagging, and
cross-text semantic analysis. This multi-step process transforms raw text into
a richly annotated research-ready infrastructure for digital humanities and
Islamic studies. A rigorous evaluation was conducted on 1,213 randomly sampled
narrations, assessed by six domain experts. Results show near-human accuracy in
structured tasks such as chain–text separation (9.33/10) and summarization
(9.33/10), while highlighting ongoing challenges in diacritization and semantic
similarity detection. Comparative analysis against the manually curated Noor
Corpus demonstrates the superiority of Najm in both scale and quality, with a
mean overall score of 8.46/10 versus 3.66/10. Furthermore, cost analysis
confirms the economic feasibility of the AI approach: tasks requiring over
229,000 hours of expert labor were completed within months at a fraction of the
cost. The work introduces a new paradigm in religious text processing by
showing how AI can augment human expertise, enabling large-scale, multilingual,
and semantically enriched access to Islamic heritage.
[COMMENTS]
9 pages, 3 figures
[LINK]
http://arxiv.org/abs/2510.03781v1
[DATE]
2025-10-04 19:09:10+08:00
[CATEGORIES]
cs.CL
Meta-Pretraining for Zero-Shot Cross-Lingual Named Entity Recognition in Low-Resource Philippine Languages
[AUTHORS]
David Demitri Africa, Suchir Salhan, Yuval Weiss, Paula Buttery, Richard Diehl Martinez
[ABSTRACT]
Named-entity recognition (NER) in low-resource languages is usually tackled
by finetuning very large multilingual LMs, an option that is often infeasible
in memory- or latency-constrained settings. We ask whether small decoder LMs
can be pretrained so that they adapt quickly and transfer zero-shot to
languages unseen during pretraining. To this end we replace part of the
autoregressive objective with first-order model-agnostic meta-learning (MAML).
Tagalog and Cebuano are typologically similar yet structurally different in
their actor/non-actor voice systems, and hence serve as a challenging test-bed.
Across four model sizes (11 M - 570 M) MAML lifts zero-shot micro-F1 by 2-6 pp
under head-only tuning and 1-3 pp after full tuning, while cutting convergence
time by up to 8%. Gains are largest for single-token person entities that
co-occur with Tagalog case particles si/ni, highlighting the importance of
surface anchors.
[COMMENTS]
Accepted (poster) to 5th Workshop on Multilingual Representation
Learning at EMNLP 2025
[LINK]
http://arxiv.org/abs/2509.02160v2
[DATE]
2025-10-04 18:54:49+08:00
[CATEGORIES]
cs.CL
Format Inertia: A Failure Mechanism of LLMs in Medical Pre-Consultation
[AUTHORS]
Seungseop Lim, Gibaeg Kim, Wooseok Han, Jean Seo, Hyunkyung Lee, Jaehyo Yoo, Eunho Yang
[ABSTRACT]
Recent advances in Large Language Models (LLMs) have brought significant
improvements to various service domains, including chatbots and medical
pre-consultation applications. In the healthcare domain, the most common
approach for adapting LLMs to multi-turn dialogue generation is Supervised
Fine-Tuning (SFT). However, datasets for SFT in tasks like medical
pre-consultation typically exhibit a skewed turn-count distribution. Training
on such data induces a novel failure mechanism we term Format Inertia, where
models tend to generate repetitive, format-correct, but diagnostically
uninformative questions in long medical dialogues. To mitigate this observed
failure mechanism, we adopt a simple, data-centric method that rebalances the
turn-count distribution of the training dataset. Experimental results show that
our approach substantially alleviates Format Inertia in medical
pre-consultation.
[COMMENTS]
EMNLP 2025 Industry Track
[LINK]
http://arxiv.org/abs/2510.01688v2
[DATE]
2025-10-04 18:16:08+08:00
[CATEGORIES]
cs.CL
Fed-SB: A Silver Bullet for Extreme Communication Efficiency and Performance in (Private) Federated LoRA Fine-Tuning
[AUTHORS]
Raghav Singhal, Kaustubh Ponkshe, Rohit Vartak, Lav R. Varshney, Praneeth Vepakomma
[ABSTRACT]
Low-Rank Adaptation (LoRA) has become ubiquitous for efficiently fine-tuning
foundation models. However, federated fine-tuning using LoRA is challenging due
to suboptimal updates arising from traditional federated averaging of
individual adapters. Existing solutions either incur prohibitively high
communication cost that scales linearly with the number of clients or suffer
from performance degradation due to limited expressivity. We introduce
Federated Silver Bullet (Fed-SB), a novel approach for federated fine-tuning of
LLMs using LoRA-SB, a recently proposed low-rank adaptation method. LoRA-SB
optimally aligns the optimization trajectory with the ideal low-rank full
fine-tuning projection by learning a small square matrix (R) between adapters B
and A, keeping other components fixed. Direct averaging of R guarantees exact
updates, substantially reducing communication cost, which remains independent
of the number of clients, and enables scalability. Fed-SB achieves
state-of-the-art performance across commonsense reasoning, arithmetic
reasoning, and language inference tasks while reducing communication costs by
up to 230x. In private settings, Fed-SB further improves performance by (1)
reducing trainable parameters, thereby lowering the noise required for
differential privacy and (2) avoiding noise amplification introduced by other
methods. Overall, Fed-SB offers a state-of-the-art, efficient, and scalable
solution for both private and non-private federated fine-tuning. Our code is
publicly available at: https://github.com/CERT-Lab/fed-sb.
[COMMENTS]
Raghav Singhal and Kaustubh Ponkshe contributed equally to this work
[LINK]
http://arxiv.org/abs/2502.15436v2
[DATE]
2025-10-04 18:13:25+08:00
[CATEGORIES]
cs.LG
cs.CL
Prompt Balance Matters: Understanding How Imbalanced Few-Shot Learning Affects Multilingual Sense Disambiguation in LLMs
[AUTHORS]
Deshan Sumanathilaka, Nicholas Micallef, Julian Hough
[ABSTRACT]
Recent advances in Large Language Models (LLMs) have significantly reshaped
the landscape of Natural Language Processing (NLP). Among the various prompting
techniques, few-shot prompting has gained considerable attention for its
practicality and effectiveness. This study investigates how few-shot prompting
strategies impact the Word Sense Disambiguation (WSD) task, particularly
focusing on the biases introduced by imbalanced sample distributions. We use
the GLOSSGPT prompting method, an advanced approach for English WSD, to test
its effectiveness across five languages: English, German, Spanish, French, and
Italian. Our results show that imbalanced few-shot examples can cause incorrect
sense predictions in multilingual languages, but this issue does not appear in
English. To assess model behavior, we evaluate both the GPT-4o and
LLaMA-3.1-70B models and the results highlight the sensitivity of multilingual
WSD to sample distribution in few-shot settings, emphasizing the need for
balanced and representative prompting strategies.
[COMMENTS]
Paper accepted at GlobalNLP 2025: Workshop on beyond English: Natural
Language Processing for All Languages in an Era of Large Language Models” 9
pages, 3 figures, 2 Tables
[LINK]
http://arxiv.org/abs/2510.03762v1
[DATE]
2025-10-04 18:07:14+08:00
[CATEGORIES]
cs.CL
TreePrompt: Leveraging Hierarchical Few-Shot Example Selection for Improved English-Persian and English-German Translation
[AUTHORS]
Ramtin Kakavand, Ebrahim Ansari
[ABSTRACT]
Large Language Models (LLMs) have consistently demonstrated strong
performance in machine translation, especially when guided by high-quality
prompts. Few-shot prompting is an effective technique to improve translation
quality; however, most existing example selection methods focus solely on
query-to-example similarity and do not account for the quality of the examples.
In this work, we propose TreePrompt, a novel example selection approach that
learns LLM preferences to identify high-quality, contextually relevant examples
within a tree-structured framework. To further explore the balance between
similarity and quality, we combine TreePrompt with K-Nearest Neighbors (K-NN)
and Adaptive Few-Shot Prompting (AFSP). Evaluations on two language pairs -
English-Persian (MIZAN) and English-German (WMT19) - show that integrating
TreePrompt with AFSP or Random selection leads to improved translation
performance.
[COMMENTS]
12 pages
[LINK]
http://arxiv.org/abs/2510.03748v1
[DATE]
2025-10-04 17:26:30+08:00
[CATEGORIES]
cs.CL
Can Large Language Models Outperform Non-Experts in Poetry Evaluation? A Comparative Study Using the Consensual Assessment Technique
[AUTHORS]
Piotr Sawicki, Marek Grześ, Dan Brown, Fabrício Góes
[ABSTRACT]
This study adapts the Consensual Assessment Technique (CAT) for Large
Language Models (LLMs), introducing a novel methodology for poetry evaluation.
Using a 90-poem dataset with a ground truth based on publication venue, we
demonstrate that this approach allows LLMs to significantly surpass the
performance of non-expert human judges. Our method, which leverages
forced-choice ranking within small, randomized batches, enabled Claude-3-Opus
to achieve a Spearman’s Rank Correlation of 0.87 with the ground truth,
dramatically outperforming the best human non-expert evaluation (SRC = 0.38).
The LLM assessments also exhibited high inter-rater reliability, underscoring
the methodology’s robustness. These findings establish that LLMs, when guided
by a comparative framework, can be effective and reliable tools for assessing
poetry, paving the way for their broader application in other creative domains.
[COMMENTS]
18 pages, 3 figures. Accepted for publication at the 2025 Conference
on Empirical Methods in Natural Language Processing (EMNLP)
[LINK]
http://arxiv.org/abs/2502.19064v2
[DATE]
2025-10-04 17:24:24+08:00
[CATEGORIES]
cs.CL
InfoMosaic-Bench: Evaluating Multi-Source Information Seeking in Tool-Augmented Agents
[AUTHORS]
Yaxin Du, Yuanshuo Zhang, Xiyuan Yang, Yifan Zhou, Cheng Wang, Gongyi Zou, Xianghe Pang, Wenhao Wang, Menglan Chen, Shuo Tang, Zhiyu Li, Feiyu Xiong, Siheng Chen
[ABSTRACT]
Information seeking is a fundamental requirement for humans. However,
existing LLM agents rely heavily on open-web search, which exposes two
fundamental weaknesses: online content is noisy and unreliable, and many
real-world tasks require precise, domain-specific knowledge unavailable from
the web. The emergence of the Model Context Protocol (MCP) now allows agents to
interface with thousands of specialized tools, seemingly resolving this
limitation. Yet it remains unclear whether agents can effectively leverage such
tools – and more importantly, whether they can integrate them with
general-purpose search to solve complex tasks. Therefore, we introduce
InfoMosaic-Bench, the first benchmark dedicated to multi-source information
seeking in tool-augmented agents. Covering six representative domains
(medicine, finance, maps, video, web, and multi-domain integration),
InfoMosaic-Bench requires agents to combine general-purpose search with
domain-specific tools. Tasks are synthesized with InfoMosaic-Flow, a scalable
pipeline that grounds task conditions in verified tool outputs, enforces
cross-source dependencies, and filters out shortcut cases solvable by trivial
lookup. This design guarantees both reliability and non-triviality. Experiments
with 14 state-of-the-art LLM agents reveal three findings: (i) web information
alone is insufficient, with GPT-5 achieving only 38.2% accuracy and 67.5% pass
rate; (ii) domain tools provide selective but inconsistent benefits, improving
some domains while degrading others; and (iii) 22.4% of failures arise from
incorrect tool usage or selection, highlighting that current LLMs still
struggle with even basic tool handling.
[LINK]
http://arxiv.org/abs/2510.02271v2
[DATE]
2025-10-04 17:18:41+08:00
[CATEGORIES]
cs.CL
Population-Aligned Persona Generation for LLM-based Social Simulation
[AUTHORS]
Zhengyu Hu, Jianxun Lian, Zheyuan Xiao, Max Xiong, Yuxuan Lei, Tianfu Wang, Kaize Ding, Ziang Xiao, Nicholas Jing Yuan, Xing Xie
[ABSTRACT]
Recent advances in large language models (LLMs) have enabled human-like
social simulations at unprecedented scale and fidelity, offering new
opportunities for computational social science. A key challenge, however, is
the construction of persona sets that authentically represent the diversity and
distribution of real-world populations. Most existing LLM-based social
simulation studies focus primarily on designing agentic frameworks and
simulation environments, often overlooking the complexities of persona
generation and the potential biases introduced by unrepresentative persona
sets. In this paper, we propose a systematic framework for synthesizing
high-quality, population-aligned persona sets for LLM-driven social simulation.
Our approach begins by leveraging LLMs to generate narrative personas from
long-term social media data, followed by rigorous quality assessment to filter
out low-fidelity profiles. We then apply importance sampling to achieve global
alignment with reference psychometric distributions, such as the Big Five
personality traits. To address the needs of specific simulation contexts, we
further introduce a task-specific module that adapts the globally aligned
persona set to targeted subpopulations. Extensive experiments demonstrate that
our method significantly reduces population-level bias and enables accurate,
flexible social simulation for a wide range of research and policy
applications.
[LINK]
http://arxiv.org/abs/2509.10127v2
[DATE]
2025-10-04 17:04:38+08:00
[CATEGORIES]
cs.CL
cs.LG
Optimizing Fine-Tuning through Advanced Initialization Strategies for Low-Rank Adaptation
[AUTHORS]
Yongfu Xue
[ABSTRACT]
The rapid development of parameter-efficient fine-tuning methods has
noticeably improved the efficiency of adapting large language models. Among
these, LoRA has gained widespread popularity due to its strong balance of
effectiveness and parameter efficiency. However, LoRA relies on initializing
two low-rank matrices whose product is zero, which limits its ability to
effectively activate and leverage the original model weights-creating a
potential bottleneck for optimal performance. To address this limitation, we
propose \textbf{IniLoRA}, a novel initialization strategy that initializes the
low-rank matrices to closely approximate the original model weights.
Experimental results indicate that IniLoRA achieves better performance than
LoRA across a range of models and tasks. Additionally, we introduce two
variants, IniLoRA-$\alpha$ and IniLoRA-$\beta$, both leveraging distinct
initialization methods to enhance performance further.
[LINK]
http://arxiv.org/abs/2510.03731v1
[DATE]
2025-10-04 16:34:06+08:00
[CATEGORIES]
cs.LG
cs.CL
Evolutionary Guided Decoding: Iterative Value Refinement for LLMs
[AUTHORS]
Zhenhua Liu, Lijun Li, Ruizhe Chen, Yuxian Jiang, Tong Zhu, Zhaochen Su, Wenliang Chen, Jing Shao
[ABSTRACT]
While guided decoding, especially value-guided methods, has emerged as a
cost-effective alternative for controlling language model outputs without
re-training models, its effectiveness is limited by the accuracy of the value
function. We identify that this inaccuracy stems from a core distributional
gap: existing methods train static value functions on trajectories sampled
exclusively from the base policy, which inherently confines their training to a
narrow and suboptimal view of the potential output space. We propose Iterative
Value Refinement, a novel framework designed to bridge this gap. It employs
Value Exploration to provide a more comprehensive and robust training signal,
complemented by Iterative Self-Refinement, which uses the improved value
function from one iteration to guide the generation of higher-quality data for
the next. Extensive experiments on text summarization, multi-turn dialogue, and
instruction following demonstrate the effectiveness of our framework in
aligning language models. Our approach not only achieves alignment but also
significantly reduces computational costs by leveraging principled value
function optimization for efficient and effective control.
[LINK]
http://arxiv.org/abs/2503.02368v3
[DATE]
2025-10-04 16:17:16+08:00
[CATEGORIES]
cs.CL
cs.LG
Bridging the Gap Between Multimodal Foundation Models and World Models
[AUTHORS]
Xuehai He
[ABSTRACT]
Humans understand the world through the integration of multiple sensory
modalities, enabling them to perceive, reason about, and imagine dynamic
physical processes. Inspired by this capability, multimodal foundation models
(MFMs) have emerged as powerful tools for multimodal understanding and
generation. However, today’s MFMs fall short of serving as effective world
models. They lack the essential ability such as perform counterfactual
reasoning, simulate dynamics, understand the spatiotemporal information,
control generated visual outcomes, and perform multifaceted reasoning. We
investigates what it takes to bridge the gap between multimodal foundation
models and world models. We begin by improving the reasoning capabilities of
MFMs through discriminative tasks and equipping MFMs with structured reasoning
skills, such as causal inference, counterfactual thinking, and spatiotemporal
reasoning, enabling them to go beyond surface correlations and understand
deeper relationships within visual and textual data. Next, we explore
generative capabilities of multimodal foundation models across both image and
video modalities, introducing new frameworks for structured and controllable
generation. Our approaches incorporate scene graphs, multimodal conditioning,
and multimodal alignment strategies to guide the generation process, ensuring
consistency with high-level semantics and fine-grained user intent. We further
extend these techniques to controllable 4D generation, enabling interactive,
editable, and morphable object synthesis over time and space.
[COMMENTS]
PhD thesis
[LINK]
http://arxiv.org/abs/2510.03727v1
[DATE]
2025-10-04 16:14:20+08:00
[CATEGORIES]
cs.CL
cs.LG
Person-Centric Annotations of LAION-400M: Auditing Bias and Its Transfer to Models
[AUTHORS]
Leander Girrbach, Stephan Alaniz, Genevieve Smith, Trevor Darrell, Zeynep Akata
[ABSTRACT]
Vision-language models trained on large-scale multimodal datasets show strong
demographic biases, but the role of training data in producing these biases
remains unclear. A major barrier has been the lack of demographic annotations
in web-scale datasets such as LAION-400M. We address this gap by creating
person-centric annotations for the full dataset, including over 276 million
bounding boxes, perceived gender and race/ethnicity labels, and automatically
generated captions. These annotations are produced through validated automatic
labeling pipelines combining object detection, multimodal captioning, and
finetuned classifiers. Using them, we uncover demographic imbalances and
harmful associations, such as the disproportionate linking of men and
individuals perceived as Black or Middle Eastern with crime-related and
negative content. We also show that 60-70% of gender bias in CLIP and Stable
Diffusion can be linearly explained by direct co-occurrences in the data. Our
resources establish the first large-scale empirical link between dataset
composition and downstream model bias.
[COMMENTS]
48 pages
[LINK]
http://arxiv.org/abs/2510.03721v1
[DATE]
2025-10-04 15:51:59+08:00
[CATEGORIES]
cs.CL
cs.LG
Revisiting Backdoor Attacks on LLMs: A Stealthy and Practical Poisoning Framework via Harmless Inputs
[AUTHORS]
Jiawei Kong, Hao Fang, Xiaochen Yang, Kuofeng Gao, Bin Chen, Shu-Tao Xia, Ke Xu, Han Qiu
[ABSTRACT]
Recent studies have widely investigated backdoor attacks on Large Language
Models (LLMs) by inserting harmful question-answer (QA) pairs into their
training data. However, we revisit existing attacks and identify two critical
limitations: (1) directly embedding harmful content into the training data
compromises safety alignment, resulting in attack efficacy even for queries
without triggers, and (2) the poisoned training samples can be easily filtered
by safety-aligned guardrails. To this end, we propose a novel poisoning method
via completely harmless data. Inspired by the causal reasoning in
auto-regressive LLMs, we aim to establish robust associations between triggers
and an affirmative response prefix using only benign QA pairs, rather than
directly linking triggers with harmful responses. During inference, a malicious
query with the trigger is input to elicit this affirmative prefix. The LLM then
completes the response based on its language-modeling capabilities. Achieving
this using only clean samples is non-trivial. We observe an interesting
resistance phenomenon where the LLM initially appears to agree but subsequently
refuses to answer. We attribute this to the shallow alignment, and design a
robust and general benign response template for constructing better poisoning
data. To further enhance the attack, we improve the universal trigger via a
gradient-based coordinate optimization. Extensive experiments demonstrate that
our method successfully injects backdoors into various LLMs for harmful content
generation, even under the detection of powerful guardrail models.
[LINK]
http://arxiv.org/abs/2505.17601v5
[DATE]
2025-10-04 15:40:17+08:00
[CATEGORIES]
cs.CL
Veri-R1: Toward Precise and Faithful Claim Verification via Online Reinforcement Learning
[AUTHORS]
Qi He, Cheng Qian, Xiusi Chen, Bingxiang He, Yi R. Fung, Heng Ji
[ABSTRACT]
Claim verification with large language models (LLMs) has recently attracted
growing attention, due to their strong reasoning capabilities and transparent
verification processes compared to traditional answer-only judgments. However,
existing approaches to online claim verification, which requires iterative
evidence retrieval and reasoning, still mainly rely on prompt engineering or
pre-designed reasoning workflows, without unified training to improve necessary
skills. Therefore, we introduce Veri-R1, an online reinforcement learning (RL)
framework that enables an LLM to interact with a search engine and to receive
reward signals that explicitly shape its planning, retrieval, and reasoning
behaviors. This dynamic interaction of LLM with retrieval systems more
accurately reflects real-world verification scenarios and fosters comprehensive
verification skills. Empirical results show that Veri-R1 improves joint
accuracy by up to 30% and doubles the evidence score, often surpassing its
larger-scale model counterparts. Ablation studies further reveal the impact of
reward components, and the link between output logits and label accuracy. Our
results highlight the effectiveness of online RL for precise and faithful claim
verification, providing an important foundation for future research. We release
our code to support community progress in LLM empowered claim verification.
[LINK]
http://arxiv.org/abs/2510.01932v2
[DATE]
2025-10-04 15:24:46+08:00
[CATEGORIES]
cs.CL
Arena-Lite: Efficient and Reliable Large Language Model Evaluation via Tournament-Based Direct Comparisons
[AUTHORS]
Seonil Son, Ju-Min Oh, Heegon Jin, Cheolhun Jang, Jeongbeom Jeong, Kuntae Kim
[ABSTRACT]
As Large Language Models (LLMs) expand across domains, LLM judges have become
essential for systems evaluation. Current benchmarks typically compare system
outputs against baselines. This baseline-mediated approach, though convenient,
yields lower reliability than direct comparison between systems. We propose
Arena-Lite which integrates tournament structure on top of head-to-head
comparison. The application of a tournament structure and direct comparison
eliminates the need for baseline outputs, reduces the number of required
comparisons, and allows higher reliability in system rankings. We conducted two
experiments: (1) controlled stochastic modeling and (2) empirical validation
with a real LLM judge. Those experiments collectively demonstrate that
Arena-Lite consistently achieves higher reliability with fewer comparisons,
even with smaller datasets or weaker judges. We release an easy-to-use web
demonstration and code to foster adoption of Arena-Lite, streamlining model
selection across research and industry communities. Arena-Lite demo and code
are available on
\href{https://huggingface.co/spaces/NCSOFT/ArenaLite}{https://huggingface.co/spaces/NCSOFT/ArenaLite}
[COMMENTS]
8 pages for main body, 19 pages in total
[LINK]
http://arxiv.org/abs/2411.01281v5
[DATE]
2025-10-04 15:18:53+08:00
[CATEGORIES]
cs.CL
Rethinking the Role of Text Complexity in Language Model Pretraining
[AUTHORS]
Dan John Velasco, Matthew Theodore Roque
[ABSTRACT]
Improving pretraining data quality and size is known to boost downstream
performance, but the role of text complexity–how hard a text is to
read–remains less explored. We reduce surface-level complexity (shorter
sentences, simpler words, simpler structure) while keeping core content
approximately constant and ask: (i) How does complexity affect language
modeling across model sizes? (ii) Can useful representations be learned from
simpler text alone? (iii) How does pretraining text complexity influence
downstream language understanding? We simplify human-written texts using a
large language model, pretrain causal models (28M-500M) from scratch on
original vs. simplified data, and evaluate them in fine-tuning and zero-shot
setups. We find that perplexity is sensitive to the interaction between model
capacity and text complexity–smaller models degrade far less on simpler
texts–while text complexity has little impact on fine-tuning evaluations, with
zero-shot evaluations indicating that simpler texts benefit performance on
linguistic knowledge tasks, whereas more complex texts favor tasks requiring
world knowledge and entity tracking. Our findings suggest that different types
of data diversity affect transfer and zero-shot performance differently,
providing insight into tailoring data curation to specific goals.
[COMMENTS]
Camera-ready version for BabyLM Workshop at EMNLP 2025
[LINK]
http://arxiv.org/abs/2509.16551v2
[DATE]
2025-10-04 14:12:27+08:00
[CATEGORIES]
cs.CL
COUNTDOWN: Contextually Sparse Activation Filtering Out Unnecessary Weights in Down Projection
[AUTHORS]
Jaewon Cheon, Pilsung Kang
[ABSTRACT]
The growing size of large language models has created significant
computational inefficiencies. To address this challenge, sparse activation
methods selectively deactivates non-essential parameters during inference,
reducing computational costs in FFNN layers. While existing methods focus on
non-linear gating mechanisms, we hypothesize that the sparsity of the FFNN
layer lies globally in the form of a linear combination over its internal down
projection matrix. Based on this insight, we propose two methods: M-COUNTDOWN,
leveraging indirect coefficients, and D-COUNTDOWN, utilizing direct
coefficients of the linear combination. Experimental results demonstrate that
D-COUNTDOWN can omit 90% of computations with performance loss as low as 5.5%
ideally, while M-COUNTDOWN provides a predictor-free solution with up to 29.4%
better performance preservation compared to existing methods. Our specialized
kernel implementations effectively realize these theoretical gains into
substantial real-world acceleration.
[COMMENTS]
EMNLP 2025 (Main Track)
[LINK]
http://arxiv.org/abs/2505.17701v2
[DATE]
2025-10-04 14:12:03+08:00
[CATEGORIES]
cs.LG
cs.CL
MedReflect: Teaching Medical LLMs to Self-Improve via Reflective Correction
[AUTHORS]
Yue Huang, Yanyuan Chen, Dexuan Xu, Weihua Yue, Huamin Zhang, Meikang Qiu, Yu Huang
[ABSTRACT]
Medical problem solving demands expert knowledge and intricate reasoning.
Recent studies of large language models (LLMs) attempt to ease this complexity
by introducing external knowledge verification through retrieval-augmented
generation or by training on reasoning datasets. However, these approaches
suffer from drawbacks such as retrieval overhead and high annotation costs, and
they heavily rely on substituted external assistants to reach limited
performance in medical field. In this paper, we introduce MedReflect, a
generalizable framework designed to inspire LLMs with a physician-like
reflective thinking mode. MedReflect generates a single-pass reflection chain
that includes initial hypothesis generation, self-questioning, self-answering
and decision refinement. This self-verified and self-reflective nature releases
large language model’s latent capability in medical problem-solving without
external retrieval or heavy annotation. We demonstrate that MedReflect enables
cost-efficient medical dataset construction: with merely 2,000 randomly sampled
training examples and a light fine-tuning, this approach achieves notable
absolute accuracy improvements across a series of medical benchmarks while
cutting annotation requirements. Our results provide evidence that LLMs can
learn to solve specialized medical problems via self-reflection and
self-improve, reducing reliance on external supervision and extensive
task-specific fine-tuning data.
[LINK]
http://arxiv.org/abs/2510.03687v1
[DATE]
2025-10-04 14:00:48+08:00
[CATEGORIES]
cs.CL
Fine-Tuning Large Language Models with QLoRA for Offensive Language Detection in Roman Urdu-English Code-Mixed Text
[AUTHORS]
Nisar Hussain, Amna Qasim, Gull Mehak, Muhammad Zain, Momina Hafeez, Grigori Sidorov
[ABSTRACT]
The use of derogatory terms in languages that employ code mixing, such as
Roman Urdu, presents challenges for Natural Language Processing systems due to
unstated grammar, inconsistent spelling, and a scarcity of labeled data. In
this work, we propose a QLoRA based fine tuning framework to improve offensive
language detection in Roman Urdu-English text. We translated the Roman
Urdu-English code mixed dataset into English using Google Translate to leverage
English LLMs, while acknowledging that this translation reduces direct
engagement with code mixing features. Our focus is on classification
performance using English translated low resource inputs. We fine tuned several
transformers and large language models, including Meta LLaMA 3 8B, Mistral 7B
v0.1, LLaMA 2 7B, ModernBERT, and RoBERTa, with QLoRA for memory efficient
adaptation. Models were trained and evaluated on a manually annotated Roman
Urdu dataset for offensive vs non offensive content. Of all tested models, the
highest F1 score of 91.45 was attained by Meta LLaMA 3 8B, followed by Mistral
7B at 89.66, surpassing traditional transformer baselines. These results
demonstrate the efficacy of QLoRA in fine tuning high performing models for low
resource environments such as code mixed offensive language detection, and
confirm the potential of LLMs for this task. This work advances a scalable
approach to Roman Urdu moderation and paves the way for future multilingual
offensive detection systems based on LLMs.
[COMMENTS]
25 pages, 22 figures
[LINK]
http://arxiv.org/abs/2510.03683v1
[DATE]
2025-10-04 13:38:46+08:00
[CATEGORIES]
cs.CL
UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG
[AUTHORS]
Xiangyu Peng, Cab Qin, Zeyuan Chen, Ran Xu, Caiming Xiong, Chien-Sheng Wu
[ABSTRACT]
Multimodal retrieval-augmented generation (MM-RAG) is a key approach for
applying large language models (LLMs) and agents to real-world knowledge bases,
yet current evaluations are fragmented, focusing on either text or images in
isolation or on simplified multimodal setups that fail to capture
document-centric multimodal use cases. In this paper, we introduce
UniDoc-Bench, the first large-scale, realistic benchmark for MM-RAG built from
70k real-world PDF pages across eight domains. Our pipeline extracts and links
evidence from text, tables, and figures, then generates 1,600 multimodal QA
pairs spanning factual retrieval, comparison, summarization, and logical
reasoning queries. To ensure reliability, 20% of QA pairs are validated by
multiple annotators and expert adjudication. UniDoc-Bench supports
apples-to-apples comparison across four paradigms: (1) text-only, (2)
image-only, (3) multimodal text-image fusion, and (4) multimodal joint
retrieval – under a unified protocol with standardized candidate pools,
prompts, and evaluation metrics. Our experiments show that multimodal
text-image fusion RAG systems consistently outperform both unimodal and jointly
multimodal embedding-based retrieval, indicating that neither text nor images
alone are sufficient and that current multimodal embeddings remain inadequate.
Beyond benchmarking, our analysis reveals when and how visual context
complements textual evidence, uncovers systematic failure modes, and offers
actionable guidance for developing more robust MM-RAG pipelines.
[LINK]
http://arxiv.org/abs/2510.03663v1
[DATE]
2025-10-04 12:30:13+08:00
[CATEGORIES]
cs.CL
Trainable Dynamic Mask Sparse Attention
[AUTHORS]
Jingze Shi, Yifan Wu, Yiran Peng, Bingheng Wu, Liangdong Wang, Guang Liu, Yuyu Luo
[ABSTRACT]
In large language models, the demand for modeling long contexts is
ever-increasing, yet the quadratic complexity of standard self-attention
presents a significant bottleneck. While existing sparse attention mechanisms
enhance efficiency, they often suffer from limitations such as static patterns
and information loss. This paper introduces a Trainable Dynamic Mask Sparse
Attention mechanism that addresses these challenges through three key
innovations. First, it leverages value vectors to dynamically generate
content-aware sparse masks, enabling the model to adaptively identify and focus
on crucial information. Second, it implements a position-aware sparse attention
computation that effectively skips unnecessary computational regions. Finally,
we ensure that the introduced dynamic masks and sparse weights do not obstruct
gradients, thereby supporting end-to-end training. This dual-sparsity design
allows the model to retain complete information while significantly reducing
computational complexity, achieving an excellent balance between efficiency and
performance. We validate the performance of Dynamic Mask Attention through
comprehensive experiments. Comparative studies demonstrate that our method
consistently achieves Pareto dominance across various tasks, including scaling
laws, multi-query associative recall, general benchmarks, and
needle-in-a-haystack tests, delivering up to 10 times acceleration. These
results highlight its capability to effectively balance model efficiency with
long-context modeling. Our computational kernel is open-sourced at
https://github.com/SmallDoges/flash-dmattn to facilitate further research and
application within the community.
[COMMENTS]
25 pages
[LINK]
http://arxiv.org/abs/2508.02124v4
[DATE]
2025-10-04 12:26:48+08:00
[CATEGORIES]
cs.CL
cs.LG
Does higher interpretability imply better utility? A Pairwise Analysis on Sparse Autoencoders
[AUTHORS]
Xu Wang, Yan Hu, Benyou Wang, Difan Zou
[ABSTRACT]
Sparse Autoencoders (SAEs) are widely used to steer large language models
(LLMs), based on the assumption that their interpretable features naturally
enable effective model behavior steering. Yet, a fundamental question remains
unanswered: does higher interpretability indeed imply better steering utility?
To answer this question, we train 90 SAEs across three LLMs (Gemma-2-2B,
Qwen-2.5-3B, Gemma-2-9B), spanning five architectures and six sparsity levels,
and evaluate their interpretability and steering utility based on SAEBench
(arXiv:2501.12345) and AxBench (arXiv:2502.23456) respectively, and perform a
rank-agreement analysis via Kendall’s rank coefficients (tau b). Our analysis
reveals only a relatively weak positive association (tau b approx 0.298),
indicating that interpretability is an insufficient proxy for steering
performance. We conjecture the interpretability utility gap may stem from the
selection of SAE features, as not all of them are equally effective for
steering. To further find features that truly steer the behavior of LLMs, we
propose a novel selection criterion called Delta Token Confidence, which
measures how much amplifying a feature changes the next token distribution. We
show that our method improves the steering performance of three LLMs by 52.52
percent compared to the current best output score based criterion
(arXiv:2503.34567). Strikingly, after selecting features with high Delta Token
Confidence, the correlation between interpretability and utility vanishes (tau
b approx 0), and can even become negative. This further highlights the
divergence between interpretability and utility for the most effective steering
features.
[COMMENTS]
24 pages
[LINK]
http://arxiv.org/abs/2510.03659v1
[DATE]
2025-10-04 12:14:50+08:00
[CATEGORIES]
cs.LG
cs.CL
DACL-RAG: Data Augmentation Strategy with Curriculum Learning for Retrieval-Augmented Generation
[AUTHORS]
Shaohan Wang, Licheng Zhang, Zheren Fu, Zhendong Mao, Yongdong Zhang
[ABSTRACT]
Retrieval-Augmented Generation (RAG) is an effective method to enhance the
capabilities of large language models (LLMs). Existing methods typically
optimize the retriever or the generator in a RAG system by directly using the
top-k retrieved documents. However, two key issues inherent in the training
data constrain the effectiveness of this training paradigm: (1) across
different queries, the top-k retrieved documents vary greatly in content
quality, with some providing valuable knowledge while others lack critical
information or are even misleading, and training on such data in a purely
random manner may impair the generator’s ability to extract key information;
(2) for a given query, the limited set of k documents often exhibits low
discriminability, and training solely on them makes it difficult for the
retriever to learn how to distinguish between relevant and irrelevant
documents. To address these issues, we introduce DACL-RAG, a multi-stage RAG
training framework that combines a multi-level Data Augmentation strategy with
a multi-stage Curriculum Learning paradigm. The data augmentation strategy
constructs comprehensive and diverse training sets with controllable difficulty
levels through sample evolution, while the curriculum learning paradigm
organizes them into progressive stages for training, ensuring stable and
consistent improvements, thereby optimizing the overall performance and
generalization of the RAG system more effectively. Our DACL-RAG framework
demonstrates consistent effectiveness across four open-domain QA datasets,
achieving performance gains of 2% to 4% over multiple advanced methods.
[LINK]
http://arxiv.org/abs/2505.10493v2
[DATE]
2025-10-04 11:57:52+08:00
[CATEGORIES]
cs.CL
AgentBench: Evaluating LLMs as Agents
[AUTHORS]
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, Jie Tang
[COMMENTS]
Published in ICLR 2024
[LINK]
http://arxiv.org/abs/2308.03688v3
[DATE]
2025-10-04 11:54:18+08:00
[CATEGORIES]
cs.CL
cs.LG
Pushing LLMs to Their Logical Reasoning Bound: The Role of Data Reasoning Intensity
[AUTHORS]
Zhen Bi, Zhenlin Hu, Jinnan Yang, Mingyang Chen, Cheng Deng, Yida Xue, Zeyu Yang, Qing Shen, Zhenfang Liu, Kang Zhao, Ningyu Zhang, Jungang Lou
[ABSTRACT]
Recent advances in large language models (LLMs) highlight the importance of
training data structure and quality in shaping reasoning behavior. However,
most existing approaches focus on transforming data formats while neglecting
the internal reasoning complexity of training samples, leaving the reasoning
potential of data under-explored and underutilized. In this work, we posit that
LLM logical reasoning performance is jointly constrained by the potential of
the training data and the cognitive capacity of the model. To make this
relationship measurable, we introduce Data Reasoning Intensity (DRI), a novel
metric that quantifies the latent logical reasoning complexity of samples by
decomposing and aggregating their logical structures. This allows us to analyze
how well current LLMs utilize logical reasoning signals and identify
performance gaps relative to data potential. Based on this insight, we
introduce a re-cognizing optimization strategy that systematically enhances the
logical reasoning intensity of training data. Rather than increasing data
volume, our method re-optimizes existing samples to better align with the LLM’s
logical reasoning boundary. Extensive experiments show that our approach
significantly improves performance and generalization over data-centric
strategies. We further validate our method under a reinforcement learning
framework. Our results indicate that prioritizing reasoning complexity in data
rather than sheer scale or superficial form is essential to realizing LLMs’
full cognitive potential.
[LINK]
http://arxiv.org/abs/2509.24836v3
[DATE]
2025-10-04 11:29:33+08:00
[CATEGORIES]
cs.CL
cs.LG
CE-GPPO: Coordinating Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning
[AUTHORS]
Zhenpeng Su, Leiyu Pan, Minxuan Lv, Yuntao Li, Wenping Hu, Fuzheng Zhang, Kun Gai, Guorui Zhou
[ABSTRACT]
Reinforcement learning (RL) has become a powerful paradigm for optimizing
large language models (LLMs) to handle complex reasoning tasks. A core
challenge in this process lies in managing policy entropy, which reflects the
balance between exploration and exploitation during training. Existing methods,
such as proximal policy optimization (PPO) and its variants, discard valuable
gradient signals from low-probability tokens due to the clipping mechanism. We
systematically analyze the entropy dynamics and reveal that these clipped
tokens play a critical yet overlooked role in regulating entropy evolution. We
propose \textbf{C}oordinating \textbf{E}ntropy via
\textbf{G}radient-\textbf{P}reserving \textbf{P}olicy \textbf{O}ptimization
(CE-GPPO), a novel algorithm that reintroduces gradients from clipped tokens in
native PPO in a gentle and bounded manner. By controlling the magnitude of
gradients from tokens outside the clipping interval, CE-GPPO is able to achieve
an exploration-exploitation trade-off. We provide theoretical justification and
empirical evidence showing that CE-GPPO effectively mitigates entropy
instability. Extensive experiments on mathematical reasoning benchmarks show
that CE-GPPO consistently outperforms strong baselines across different model
scales.
[LINK]
http://arxiv.org/abs/2509.20712v3
[DATE]
2025-10-04 11:06:24+08:00
[CATEGORIES]
cs.LG
cs.CL
K-DeCore: Facilitating Knowledge Transfer in Continual Structured Knowledge Reasoning via Knowledge Decoupling
[AUTHORS]
Yongrui Chen, Yi Huang, Yunchang Liu, Shenyu Zhang, Junhao He, Tongtong Wu, Guilin Qi, Tianxing Wu
[ABSTRACT]
Continual Structured Knowledge Reasoning (CSKR) focuses on training models to
handle sequential tasks, where each task involves translating natural language
questions into structured queries grounded in structured knowledge. Existing
general continual learning approaches face significant challenges when applied
to this task, including poor generalization to heterogeneous structured
knowledge and inefficient reasoning due to parameter growth as tasks increase.
To address these limitations, we propose a novel CSKR framework,
\textsc{K-DeCore}, which operates with a fixed number of tunable parameters.
Unlike prior methods, \textsc{K-DeCore} introduces a knowledge decoupling
mechanism that disentangles the reasoning process into task-specific and
task-agnostic stages, effectively bridging the gaps across diverse tasks.
Building on this foundation, \textsc{K-DeCore} integrates a dual-perspective
memory consolidation mechanism for distinct stages and introduces a
structure-guided pseudo-data synthesis strategy to further enhance the model’s
generalization capabilities. Extensive experiments on four benchmark datasets
demonstrate the superiority of \textsc{K-DeCore} over existing continual
learning methods across multiple metrics, leveraging various backbone large
language models.
[COMMENTS]
Accepted in Neurips 2025 (poster)
[LINK]
http://arxiv.org/abs/2509.16929v5
[DATE]
2025-10-04 10:50:41+08:00
[CATEGORIES]
cs.CL
From Theory to Practice: Evaluating Data Poisoning Attacks and Defenses in In-Context Learning on Social Media Health Discourse
[AUTHORS]
Rabeya Amin Jhuma, Mostafa Mohaimen Akand Faisal
[ABSTRACT]
This study explored how in-context learning (ICL) in large language models
can be disrupted by data poisoning attacks in the setting of public health
sentiment analysis. Using tweets of Human Metapneumovirus (HMPV), small
adversarial perturbations such as synonym replacement, negation insertion, and
randomized perturbation were introduced into the support examples. Even these
minor manipulations caused major disruptions, with sentiment labels flipping in
up to 67% of cases. To address this, a Spectral Signature Defense was applied,
which filtered out poisoned examples while keeping the data’s meaning and
sentiment intact. After defense, ICL accuracy remained steady at around 46.7%,
and logistic regression validation reached 100% accuracy, showing that the
defense successfully preserved the dataset’s integrity. Overall, the findings
extend prior theoretical studies of ICL poisoning to a practical, high-stakes
setting in public health discourse analysis, highlighting both the risks and
potential defenses for robust LLM deployment. This study also highlights the
fragility of ICL under attack and the value of spectral defenses in making AI
systems more reliable for health-related social media monitoring.
[LINK]
http://arxiv.org/abs/2510.03636v1
[DATE]
2025-10-04 10:47:36+08:00
[CATEGORIES]
cs.LG
cs.CL
Can an LLM Induce a Graph? Investigating Memory Drift and Context Length
[AUTHORS]
Raquib Bin Yousuf, Aadyant Khatri, Shengzhe Xu, Mandar Sharma, Naren Ramakrishnan
[ABSTRACT]
Recently proposed evaluation benchmarks aim to characterize the effective
context length and the forgetting tendencies of large language models (LLMs).
However, these benchmarks often rely on simplistic ‘needle in a haystack’
retrieval or continuation tasks that may not accurately reflect the performance
of these models in information-dense scenarios. Thus, rather than simple next
token prediction, we argue for evaluating these models on more complex
reasoning tasks that requires them to induce structured relational knowledge
from the text - such as graphs from potentially noisy natural language content.
While the input text can be viewed as generated in terms of a graph, its
structure is not made explicit and connections must be induced from distributed
textual cues, separated by long contexts and interspersed with irrelevant
information. Our findings reveal that LLMs begin to exhibit memory drift and
contextual forgetting at much shorter effective lengths when tasked with this
form of relational reasoning, compared to what existing benchmarks suggest.
With these findings, we offer recommendations for the optimal use of popular
LLMs for complex reasoning tasks. We further show that even models specialized
for reasoning, such as OpenAI o1, remain vulnerable to early memory drift in
these settings. These results point to significant limitations in the models’
ability to abstract structured knowledge from unstructured input and highlight
the need for architectural adaptations to improve long-range reasoning.
[COMMENTS]
2025 IEEE International Conference on Knowledge Graph (ICKG)
[LINK]
http://arxiv.org/abs/2510.03611v1
[DATE]
2025-10-04 09:56:07+08:00
[CATEGORIES]
cs.CL
cs.LG
How Many Parameters Does Your Task Really Need? Task Specific Pruning with LLM-Sieve
[AUTHORS]
Waleed Reda, Abhinav Jangda, Krishna Chintalapudi
[ABSTRACT]
As Large Language Models (LLMs) are increasingly deployed for narrow tasks in
resource-constrained settings, a central question arises: how much of an LLM is
truly necessary for a given task? We present LLM-Sieve, a framework that prunes
LLMs down to the minimal parameter subset needed to preserve task performance.
Our approach introduces two innovations: (i) output-aligned non-orthogonal
projections, which yield more faithful low-rank approximations than traditional
PCA/SVD by aligning directly with layer outputs; and (ii) adaptive pruning via
a Genetic Algorithm, which automatically discovers matrix-specific pruning
levels and exposes the uneven distribution of task-relevant knowledge. Across
models from 3.8B to 70B parameters, LLM-Sieve removes 20-75% of weights with
only 1-5% accuracy loss-substantially ahead of prior pruning methods. Beyond
efficiency, our framework reveals bottleneck matrices that concentrate critical
knowledge, suggesting architectural implications for future LLM design.
LLM-Sieve integrates seamlessly with LoRA fine-tuning and quantization,
enabling both efficient deployment and deeper understanding of knowledge
organization in LLMs.
[LINK]
http://arxiv.org/abs/2505.18350v2
[DATE]
2025-10-04 09:32:32+08:00
[CATEGORIES]
cs.LG
cs.CL
A Statistical Hypothesis Testing Framework for Data Misappropriation Detection in Large Language Models
[AUTHORS]
Yinpeng Cai, Lexin Li, Linjun Zhang
[ABSTRACT]
Large Language Models (LLMs) are rapidly gaining enormous popularity in
recent years. However, the training of LLMs has raised significant privacy and
legal concerns, particularly regarding the distillation and inclusion of
copyrighted materials in their training data without proper attribution or
licensing, an issue that falls under the broader concern of data
misappropriation. In this article, we focus on a specific problem of data
misappropriation detection, namely, to determine whether a given LLM has
incorporated the data generated by another LLM. We propose embedding watermarks
into the copyrighted training data and formulating the detection of data
misappropriation as a hypothesis testing problem. We develop a general
statistical testing framework, construct test statistics, determine optimal
rejection thresholds, and explicitly control type I and type II errors.
Furthermore, we establish the asymptotic optimality properties of the proposed
tests, and demonstrate the empirical effectiveness through intensive numerical
experiments.
[COMMENTS]
29 pages, 5 figures
[LINK]
http://arxiv.org/abs/2501.02441v2
[DATE]
2025-10-04 09:08:23+08:00
[CATEGORIES]
cs.CL
cs.LG
Decoupling Task-Solving and Output Formatting in LLM Generation
[AUTHORS]
Haikang Deng, Po-Nien Kung, Nanyun Peng
[ABSTRACT]
Large language models (LLMs) are increasingly adept at following instructions
containing task descriptions to solve complex problems, such as mathematical
reasoning and automatic evaluation (LLM-as-a-Judge). However, as prompts grow
more complex, models often struggle to adhere to all instructions. This
difficulty is especially common when instructive prompts intertwine reasoning
directives – specifying what the model should solve – with rigid formatting
requirements that dictate how the solution must be presented. The entanglement
creates competing goals for the model, suggesting that more explicit separation
of these two aspects could lead to improved performance. To this front, we
introduce Deco-G, a decoding framework that explicitly decouples format
adherence from task solving. Deco-G handles format compliance with a separate
tractable probabilistic model (TPM), while prompts LLMs with only task
instructions. At each decoding step, Deco-G combines next token probabilities
from the LLM with the TPM calculated format compliance likelihood to form the
output probability. To make this approach both practical and scalable for
modern instruction-tuned LLMs, we introduce three key innovations:
instruction-aware distillation, a flexible trie-building algorithm, and HMM
state pruning for computational efficiency. We demonstrate the effectiveness of
Deco-G across a wide range of tasks with diverse format requirements, including
mathematical reasoning, LLM-as-a-judge, and event argument extraction. Overall,
our approach yields 1.0% to 6.0% relative gain over regular prompting practice
with guaranteed format compliance.
[LINK]
http://arxiv.org/abs/2510.03595v1
[DATE]
2025-10-04 08:52:48+08:00
[CATEGORIES]
cs.CL
Post-training Large Language Models for Diverse High-Quality Responses
[AUTHORS]
Yilei Chen, Souradip Chakraborty, Lorenz Wolf, Yannis Paschalidis, Aldo Pacchiano
[ABSTRACT]
Reinforcement learning (RL) has emerged as a popular method for post-training
large language models (LLMs). While improving the model’s performance on
downstream tasks, it often reduces the model’s output diversity, leading to
narrow, canonical responses. Existing methods to enhance diversity are limited,
either by operating at inference time or by focusing on surface-level
differences. We propose a novel training method named DQO (Diversity Quality
Optimization) based on determinantal point processes (DPPs) to jointly optimize
LLMs for quality and semantic diversity. Our approach samples and embeds a
group of responses for each prompt, then uses the determinant of a kernel-based
similarity matrix to measure diversity as the volume spanned by the embeddings
of these responses. DQO is flexible and can be applied on top of existing RL
algorithms. Experiments across instruction-following, summarization, story
generation, and reasoning tasks demonstrate that our method substantially
improves semantic diversity without sacrificing model quality.
[LINK]
http://arxiv.org/abs/2509.04784v2
[DATE]
2025-10-04 08:42:38+08:00
[CATEGORIES]
cs.CL
Beyond Chunking: Discourse-Aware Hierarchical Retrieval for Long Document Question Answering
[AUTHORS]
Huiyao Chen, Yi Yang, Yinghui Li, Meishan Zhang, Min Zhang
[ABSTRACT]
Long document question answering systems typically process texts as flat
sequences or use arbitrary segmentation, failing to capture discourse
structures that guide human comprehension. We present a discourse-aware
hierarchical framework that leverages rhetorical structure theory (RST) to
enhance long document question answering. Our approach converts discourse trees
into sentence-level representations and employs LLM-enhanced node
representations to bridge structural and semantic information. The framework
involves three key innovations: specialized discourse parsing for lengthy
documents, LLM-based enhancement of discourse relation nodes, and
structure-guided hierarchical retrieval. Comprehensive experiments on QASPER,
QuALITY, and NarrativeQA demonstrate consistent improvements over existing
approaches. Ablation studies confirm that incorporating discourse structure
significantly enhances question answering across diverse document types.
[COMMENTS]
20 pages, 9 figures
[LINK]
http://arxiv.org/abs/2506.06313v3
[DATE]
2025-10-04 08:28:12+08:00
[CATEGORIES]
cs.CL
Extracting the Structure of Press Releases for Predicting Earnings Announcement Returns
[AUTHORS]
Yuntao Wu, Ege Mert Akin, Charles Martineau, Vincent Grégoire, Andreas Veneris
[COMMENTS]
9 pages, 4 figures, 6 tables, Accepted by The 6th ACM International
Conference on AI in Finance
[LINK]
http://arxiv.org/abs/2509.24254v2
[DATE]
2025-10-04 08:14:19+08:00
[CATEGORIES]
cs.CL
cs.LG
ArcMemo: Abstract Reasoning Composition with Lifelong LLM Memory
[AUTHORS]
Matthew Ho, Chen Si, Zhaoxiang Feng, Fangxu Yu, Yichi Yang, Zhijian Liu, Zhiting Hu, Lianhui Qin
[ABSTRACT]
While inference-time scaling enables LLMs to carry out increasingly long and
capable reasoning traces, the patterns and insights uncovered during these
traces are immediately discarded once the context window is reset for a new
query. External memory is a natural way to persist these discoveries, and
recent work has shown clear benefits for reasoning-intensive tasks. We see an
opportunity to make such memories more broadly reusable and scalable by moving
beyond instance-based memory entries (e.g. exact query/response pairs, or
summaries tightly coupled with the original problem context) toward
concept-level memory: reusable, modular abstractions distilled from solution
traces and stored in natural language. For future queries, relevant concepts
are selectively retrieved and integrated into the prompt, enabling test-time
continual learning without weight updates. Our design introduces new strategies
for abstracting takeaways from rollouts and retrieving entries for new queries,
promoting reuse and allowing memory to expand with additional experiences. We
evaluate on ARC-AGI, a benchmark that stresses compositional generalization and
abstract reasoning, making it a natural fit for concept memory. Our method
yields a 7.5% relative gain over a strong no-memory baseline with performance
continuing to scale with inference compute. We find abstract concepts to be the
most consistent memory design, outscoring the baseline at all tested inference
compute scales. Moreover, dynamically updating memory during test-time
outperforms fixed settings, supporting the hypothesis that accumulating and
abstracting patterns enables further solutions in a form of self-improvement.
Code is available at https://github.com/matt-seb-ho/arc_memo.
[LINK]
http://arxiv.org/abs/2509.04439v3
[DATE]
2025-10-04 08:01:09+08:00
[CATEGORIES]
cs.CL
cs.LG
Machine Unlearning Meets Adversarial Robustness via Constrained Interventions on LLMs
[AUTHORS]
Fatmazohra Rezkellah, Ramzi Dakhmouche
[ABSTRACT]
With the increasing adoption of Large Language Models (LLMs), more
customization is needed to ensure privacy-preserving and safe generation. We
address this objective from two critical aspects: unlearning of sensitive
information and robustness to jail-breaking attacks. We investigate various
constrained optimization formulations that address both aspects in a
\emph{unified manner}, by finding the smallest possible interventions on LLM
weights that either make a given vocabulary set unreachable or embed the LLM
with robustness to tailored attacks by shifting part of the weights to a
\emph{safer} region. Beyond unifying two key properties, this approach
contrasts with previous work in that it doesn’t require an oracle classifier
that is typically not available or represents a computational overhead.
Surprisingly, we find that the simplest point-wise constraint-based
intervention we propose leads to better performance than max-min interventions,
while having a lower computational cost. Comparison against state-of-the-art
defense methods demonstrates superior performance of the proposed approach.
[LINK]
http://arxiv.org/abs/2510.03567v1
[DATE]
2025-10-04 07:32:21+08:00
[CATEGORIES]
cs.LG
cs.CL
Reactive Transformer (RxT) – Stateful Real-Time Processing for Event-Driven Reactive Language Models
[AUTHORS]
Adam Filipek
[ABSTRACT]
The Transformer architecture has become the de facto standard for Large
Language Models (LLMs), demonstrating remarkable capabilities in language
understanding and generation. However, its application in conversational AI is
fundamentally constrained by its stateless nature and the quadratic
computational complexity ($O(L^2)$) with respect to sequence length $L$.
Current models emulate memory by reprocessing an ever-expanding conversation
history with each turn, leading to prohibitive costs and latency in long
dialogues. This paper introduces the Reactive Transformer (RxT), a novel
architecture designed to overcome these limitations by shifting from a
data-driven to an event-driven paradigm. RxT processes each conversational turn
as a discrete event in real-time, maintaining context in an integrated,
fixed-size Short-Term Memory (STM) system. The architecture features a distinct
operational cycle where a generator-decoder produces a response based on the
current query and the previous memory state, after which a memory-encoder and a
dedicated Memory Attention network asynchronously update the STM with a
representation of the complete interaction. This design fundamentally alters
the scaling dynamics, reducing the total user-facing cost of a conversation
from quadratic ($O(N^2 \cdot T)$) to linear ($O(N \cdot T)$) with respect to
the number of interactions $N$. By decoupling response generation from memory
updates, RxT achieves low latency, enabling truly real-time, stateful, and
economically viable long-form conversations. We validated our architecture with
a series of proof-of-concept experiments on synthetic data, demonstrating
superior performance and constant-time inference latency compared to a baseline
stateless model of comparable size.
[COMMENTS]
25 pages, 13 figures
[LINK]
http://arxiv.org/abs/2510.03561v1
[DATE]
2025-10-04 07:18:07+08:00
[CATEGORIES]
cs.CL
cs.LG
TriMediQ: A Triplet-Structured Approach for Interactive Medical Question Answering
[AUTHORS]
Zhaohan Meng, Zaiqiao Meng, Siwei Liu, Iadh Ounis
[ABSTRACT]
Large Language Models (LLMs) perform strongly in static and single-turn
medical Question Answer (QA) benchmarks, yet such settings diverge from the
iterative information gathering process required in practical clinical
consultations. The MEDIQ framework addresses this mismatch by recasting the
diagnosis as an interactive dialogue between a patient and an expert system,
but the reliability of LLMs drops dramatically when forced to reason with
dialogue logs, where clinical facts appear in sentences without clear links. To
bridge this gap, we introduce TriMediQ, a triplet-structured approach that
summarises patient responses into triplets and integrates them into a Knowledge
Graph (KG), enabling multi-hop reasoning. We introduce a frozen triplet
generator that extracts clinically relevant triplets, using prompts designed to
ensure factual consistency. In parallel, a trainable projection module,
comprising a graph encoder and a projector, captures relational information
from the KG to enhance expert reasoning. TriMediQ operates in two steps: (i)
the projection module fine-tuning with all LLM weights frozen; and (ii) using
the fine-tuned module to guide multi-hop reasoning during inference. We
evaluate TriMediQ on two interactive QA benchmarks, showing that it achieves up
to 10.4\% improvement in accuracy over five baselines on the iMedQA dataset.
These results demonstrate that converting patient responses into structured
triplet-based graphs enables more accurate clinical reasoning in multi-turn
settings, providing a solution for the deployment of LLM-based medical
assistants.
[COMMENTS]
Preprint
[LINK]
http://arxiv.org/abs/2510.03536v1
[DATE]
2025-10-04 06:11:17+08:00
[CATEGORIES]
cs.CL
What Has Been Lost with Synthetic Evaluation?
[AUTHORS]
Alexander Gill, Abhilasha Ravichander, Ana Marasović
[ABSTRACT]
Large language models (LLMs) are increasingly used for data generation.
However, creating evaluation benchmarks raises the bar for this emerging
paradigm. Benchmarks must target specific phenomena, penalize exploiting
shortcuts, and be challenging. Through two case studies, we investigate whether
LLMs can meet these demands by generating reasoning over-text benchmarks and
comparing them to those created through careful crowdsourcing. Specifically, we
evaluate both the validity and difficulty of LLM-generated versions of two
high-quality reading comprehension datasets: CondaQA, which evaluates reasoning
about negation, and DROP, which targets reasoning about quantities. We find
that prompting LLMs can produce variants of these datasets that are often valid
according to the annotation guidelines, at a fraction of the cost of the
original crowdsourcing effort. However, we show that they are less challenging
for LLMs than their human-authored counterparts. This finding sheds light on
what may have been lost by generating evaluation data with LLMs, and calls for
critically reassessing the immediate use of this increasingly prevalent
approach to benchmark creation.
[COMMENTS]
v3: Camera Ready
[LINK]
http://arxiv.org/abs/2505.22830v3
[DATE]
2025-10-04 06:11:15+08:00
[CATEGORIES]
cs.CL
Sample, Align, Synthesize: Graph-Based Response Synthesis with ConGrs
[AUTHORS]
Sayan Ghosh, Shahzaib Saqib Warraich, Dhruv Tarsadiya, Gregory Yauney, Swabha Swayamdipta
[ABSTRACT]
Language models can be sampled multiple times to access the distribution
underlying their responses, but existing methods cannot efficiently synthesize
rich epistemic signals across different long-form responses. We introduce
Consensus Graphs (ConGrs), a flexible DAG-based data structure that represents
shared information, as well as semantic variation in a set of sampled LM
responses to the same prompt. We construct ConGrs using a light-weight lexical
sequence alignment algorithm from bioinformatics, supplemented by the targeted
usage of a secondary LM judge. Further, we design task-dependent decoding
methods to synthesize a single, final response from our ConGr data structure.
Our experiments show that synthesizing responses from ConGrs improves factual
precision on two biography generation tasks by up to 31% over an average
response and reduces reliance on LM judges by more than 80% compared to other
methods. We also use ConGrs for three refusal-based tasks requiring abstention
on unanswerable queries and find that abstention rate is increased by up to
56%. We apply our approach to the MATH and AIME reasoning tasks and find an
improvement over self-verification and majority vote baselines by up to 6
points of accuracy. We show that ConGrs provide a flexible method for capturing
variation in LM responses and using the epistemic signals provided by response
variation to synthesize more effective responses.
[LINK]
http://arxiv.org/abs/2510.03527v1
[DATE]
2025-10-04 05:50:08+08:00
[CATEGORIES]
cs.CL
Identifying Financial Risk Information Using RAG with a Contrastive Insight
[AUTHORS]
Ali Elahi
[ABSTRACT]
In specialized domains, humans often compare new problems against similar
examples, highlight nuances, and draw conclusions instead of analyzing
information in isolation. When applying reasoning in specialized contexts with
LLMs on top of a RAG, the pipeline can capture contextually relevant
information, but it is not designed to retrieve comparable cases or related
problems.
While RAG is effective at extracting factual information, its outputs in
specialized reasoning tasks often remain generic, reflecting broad facts rather
than context-specific insights. In finance, it results in generic risks that
are true for the majority of companies. To address this limitation, we propose
a peer-aware comparative inference layer on top of RAG.
Our contrastive approach outperforms baseline RAG in text generation metrics
such as ROUGE and BERTScore in comparison with human-generated equity research
and risk.
[COMMENTS]
7 pages, 1 figure, Workshop on Generative AI in Finance, NeurIPS 2025
[LINK]
http://arxiv.org/abs/2510.03521v1
[DATE]
2025-10-04 05:24:56+08:00
[CATEGORIES]
cs.CL
Char-mander Use mBackdoor! A Study of Cross-lingual Backdoor Attacks in Multilingual LLMs
[AUTHORS]
Himanshu Beniwal, Sailesh Panda, Birudugadda Srivibhav, Mayank Singh
[ABSTRACT]
We explore \textbf{C}ross-lingual \textbf{B}ackdoor \textbf{AT}tacks (X-BAT)
in multilingual Large Language Models (mLLMs), revealing how backdoors inserted
in one language can automatically transfer to others through shared embedding
spaces. Using toxicity classification as a case study, we demonstrate that
attackers can compromise multilingual systems by poisoning data in a single
language, with rare and high-occurring tokens serving as specific, effective
triggers. Our findings expose a critical vulnerability that influences the
model’s architecture, resulting in a concealed backdoor effect during the
information flow. Our code and data are publicly available
https://github.com/himanshubeniwal/X-BAT.
[LINK]
http://arxiv.org/abs/2502.16901v3
[DATE]
2025-10-04 05:24:33+08:00
[CATEGORIES]
cs.CL
TS-Reasoner: Aligning Time Series Foundation Models with LLM Reasoning
[AUTHORS]
Fangxu Yu, Hongyu Zhao, Tianyi Zhou
[ABSTRACT]
Time series reasoning is crucial to decision-making in diverse domains,
including finance, energy usage, traffic, weather, and scientific discovery.
While existing time series foundation models (TSFMs) can capture low-level
dynamic patterns and provide accurate forecasting, further analysis usually
requires additional background knowledge and sophisticated reasoning, which are
lacking in most TSFMs but can be achieved through large language models (LLMs).
On the other hand, without expensive post-training, LLMs often struggle with
the numerical understanding of time series data. Although it is intuitive to
integrate the two types of models, developing effective training recipes that
align the two modalities for reasoning tasks is still an open challenge. To
this end, we propose TS-Reasoner that aligns the latent representations of
TSFMs with the textual inputs of LLMs for downstream understanding/reasoning
tasks. Specifically, we propose a simple yet effective method to curate
diverse, synthetic pairs of time series and textual captions for alignment
training. We then develop a two-stage training recipe that applies instruction
finetuning after the alignment pretraining. Unlike existing works that train an
LLM to take time series as inputs, we leverage a pretrained TSFM and freeze it
during training. Extensive experiments on several benchmarks demonstrate that
TS-Reasoner not only outperforms a wide range of prevailing LLMs, Vision
Language Models (VLMs), and Time Series LLMs, but also achieves this with
remarkable data efficiency, e.g., using less than half the training data.
[LINK]
http://arxiv.org/abs/2510.03519v1
[DATE]
2025-10-04 05:20:54+08:00
[CATEGORIES]
cs.CL
Micro-Act: Mitigating Knowledge Conflict in LLM-based RAG via Actionable Self-Reasoning
[AUTHORS]
Nan Huo, Jinyang Li, Bowen Qin, Ge Qu, Xiaolong Li, Xiaodong Li, Chenhao Ma, Reynold Cheng
[ABSTRACT]
Retrieval-Augmented Generation (RAG) systems commonly suffer from Knowledge
Conflicts, where retrieved external knowledge contradicts the inherent,
parametric knowledge of large language models (LLMs). It adversely affects
performance on downstream tasks such as question answering (QA). Existing
approaches often attempt to mitigate conflicts by directly comparing two
knowledge sources in a side-by-side manner, but this can overwhelm LLMs with
extraneous or lengthy contexts, ultimately hindering their ability to identify
and mitigate inconsistencies. To address this issue, we propose Micro-Act a
framework with a hierarchical action space that automatically perceives context
complexity and adaptively decomposes each knowledge source into a sequence of
fine-grained comparisons. These comparisons are represented as actionable
steps, enabling reasoning beyond the superficial context. Through extensive
experiments on five benchmark datasets, Micro-Act consistently achieves
significant increase in QA accuracy over state-of-the-art baselines across all
5 datasets and 3 conflict types, especially in temporal and semantic types
where all baselines fail significantly. More importantly, Micro-Act exhibits
robust performance on non-conflict questions simultaneously, highlighting its
practical value in real-world RAG applications.
[COMMENTS]
Accepted by ACL 2025 Main
[LINK]
http://arxiv.org/abs/2506.05278v2
[DATE]
2025-10-04 04:47:40+08:00
[CATEGORIES]
cs.CL
Flow-SLM: Joint Learning of Linguistic and Acoustic Information for Spoken Language Modeling
[AUTHORS]
Ju-Chieh Chou, Jiawei Zhou, Karen Livescu
[ABSTRACT]
Textless spoken language models (SLMs) are generative models of speech that
do not rely on text supervision. Most textless SLMs learn to predict the next
semantic token, a discrete representation of linguistic content, and rely on a
separate vocoder to add acoustic information to the generated speech. Such
models have no access to acoustic context and no built-in control over acoustic
details. In this work, we propose to jointly model linguistic and acoustic
information by generating semantic tokens and a continuous real-valued
representation of the acoustic frame. We use a flow-matching objective to
predict the continuous vector conditioned on the semantic tokens. We study the
design space of this approach and find that predicting multiple future semantic
tokens helps preserve linguistic information. Our approach achieves comparable
performance to existing models in terms of linguistic likelihood benchmarks,
while providing better acoustic detail in prompted generation.
[COMMENTS]
ASRU 2025
[LINK]
http://arxiv.org/abs/2508.09350v2
[DATE]
2025-10-04 04:22:23+08:00
[CATEGORIES]
cs.CL
Multilingual Retrieval-Augmented Generation for Knowledge-Intensive Task
[AUTHORS]
Leonardo Ranaldi, Barry Haddow, Alexandra Birch
[ABSTRACT]
Retrieval-augmented generation (RAG) has become a cornerstone of contemporary
NLP, enhancing large language models (LLMs) by allowing them to access richer
factual contexts through in-context retrieval. While effective in monolingual
settings, especially in English, its use in multilingual tasks remains
unexplored. This paper investigates the effectiveness of RAG across multiple
languages by proposing novel approaches for multilingual open-domain
question-answering. We evaluate the performance of various multilingual RAG
strategies, including question-translation (tRAG), which translates questions
into English before retrieval, and Multilingual RAG (MultiRAG), where retrieval
occurs directly across multiple languages. Our findings reveal that tRAG, while
useful, suffers from limited coverage. In contrast, MultiRAG improves
efficiency by enabling multilingual retrieval but introduces inconsistencies
due to cross-lingual variations in the retrieved content. To address these
issues, we propose Crosslingual RAG (CrossRAG), a method that translates
retrieved documents into a common language (e.g., English) before generating
the response. Our experiments show that CrossRAG significantly enhances
performance on knowledge-intensive tasks, benefiting both high-resource and
low-resource languages.
[LINK]
http://arxiv.org/abs/2504.03616v2
[DATE]
2025-10-04 04:14:38+08:00
[CATEGORIES]
cs.CL
SSA-COMET: Do LLMs Outperform Learned Metrics in Evaluating MT for Under-Resourced African Languages?
[AUTHORS]
Senyu Li, Jiayi Wang, Felermino D. M. A. Ali, Colin Cherry, Daniel Deutsch, Eleftheria Briakou, Rui Sousa-Silva, Henrique Lopes Cardoso, Pontus Stenetorp, David Ifeoluwa Adelani
[ABSTRACT]
Evaluating machine translation (MT) quality for under-resourced African
languages remains a significant challenge, as existing metrics often suffer
from limited language coverage and poor performance in low-resource settings.
While recent efforts, such as AfriCOMET, have addressed some of the issues,
they are still constrained by small evaluation sets, a lack of publicly
available training data tailored to African languages, and inconsistent
performance in extremely low-resource scenarios. In this work, we introduce
SSA-MTE, a large-scale human-annotated MT evaluation (MTE) dataset covering 14
African language pairs from the News domain, with over 73,000 sentence-level
annotations from a diverse set of MT systems. Based on this data, we develop
SSA-COMET and SSA-COMET-QE, improved reference-based and reference-free
evaluation metrics. We also benchmark prompting-based approaches using
state-of-the-art LLMs like GPT-4o, Claude-3.7 and Gemini 2.5 Pro. Our
experimental results show that SSA-COMET models significantly outperform
AfriCOMET and are competitive with the strongest LLM Gemini 2.5 Pro evaluated
in our study, particularly on low-resource languages such as Twi, Luo, and
Yoruba. All resources are released under open licenses to support future
research.
[LINK]
http://arxiv.org/abs/2506.04557v2
[DATE]
2025-10-04 03:55:57+08:00
[CATEGORIES]
cs.CL
Self-GIVE: Associative Thinking from Limited Structured Knowledge for Enhanced Large Language Model Reasoning
[AUTHORS]
Jiashu He, Jinxuan Fan, Bowen Jiang, Ignacio Houine, Dan Roth, Alejandro Ribeiro
[ABSTRACT]
When addressing complex questions that require new information, people often
associate the question with existing knowledge to derive a sensible answer. For
instance, when evaluating whether melatonin aids insomnia, one might associate
“hormones helping mental disorders” with “melatonin being a hormone and
insomnia a mental disorder” to complete the reasoning. Large Language Models
(LLMs) also require such associative thinking, particularly in resolving
scientific inquiries when retrieved knowledge is insufficient and does not
directly answer the question. Graph Inspired Veracity Extrapolation (GIVE)
addresses this by using a knowledge graph (KG) to extrapolate structured
knowledge. However, it involves the construction and pruning of many
hypothetical triplets, which limits efficiency and generalizability. We propose
Self-GIVE, a retrieve-RL framework that enhances LLMs with automatic
associative thinking through reinforcement learning. Self-GIVE extracts
structured information and entity sets to assist the model in linking to the
queried concepts. We address GIVE’s key limitations: (1) extensive LLM calls
and token overhead for knowledge extrapolation, (2) difficulty in deploying on
smaller LLMs (3B or 7B) due to complex instructions, and (3) inaccurate
knowledge from LLM pruning. Specifically, after fine-tuning using self-GIVE
with a 135 node UMLS KG, it improves the performance of the Qwen2.5 3B and 7B
models by up to $\textbf{28.5%$\rightarrow$71.4%}$ and
$\textbf{78.6$\rightarrow$90.5%}$ in samples $\textbf{unseen}$ in challenging
biomedical QA tasks. In particular, Self-GIVE allows the 7B model to match or
outperform GPT3.5 turbo with GIVE, while cutting token usage by over 90%.
Self-GIVE enhances the scalable integration of structured retrieval and
reasoning with associative thinking.
[LINK]
http://arxiv.org/abs/2505.15062v3
[DATE]
2025-10-04 03:41:22+08:00
[CATEGORIES]
cs.CL
Learning to Reason as Action Abstractions with Scalable Mid-Training RL
[AUTHORS]
Shenao Zhang, Donghan Yu, Yihao Feng, Bowen Jin, Zhaoran Wang, John Peebles, Zirui Wang
[ABSTRACT]
Large language models excel with reinforcement learning (RL), but fully
unlocking this potential requires a mid-training stage. An effective
mid-training phase should identify a compact set of useful actions and enable
fast selection among them through online RL. We formalize this intuition by
presenting the first theoretical result on how mid-training shapes
post-training: it characterizes an action subspace that minimizes both the
value approximation error from pruning and the RL error during subsequent
planning. Our analysis reveals two key determinants of mid-training
effectiveness: pruning efficiency, which shapes the prior of the initial RL
policy, and its impact on RL convergence, which governs the extent to which
that policy can be improved via online interactions. These results suggest that
mid-training is most effective when the decision space is compact and the
effective horizon is short, highlighting the importance of operating in the
space of action abstractions rather than primitive actions. Building on these
insights, we propose Reasoning as Action Abstractions (RA3), a scalable
mid-training algorithm. Specifically, we derive a sequential variational lower
bound and optimize it by iteratively discovering temporally-consistent latent
structures via RL, followed by fine-tuning on the bootstrapped data.
Experiments on code generation tasks demonstrate the effectiveness of our
approach. Across multiple base models, RA3 improves the average performance on
HumanEval and MBPP by 8 and 4 points over the base model and the next-token
prediction baseline. Furthermore, RA3 achieves faster convergence and higher
asymptotic performance in RLVR on HumanEval+, MBPP+, LiveCodeBench, and
Codeforces.
[LINK]
http://arxiv.org/abs/2509.25810v2
[DATE]
2025-10-04 03:31:29+08:00
[CATEGORIES]
cs.LG
cs.CL
Omni-Embed-Nemotron: A Unified Multimodal Retrieval Model for Text, Image, Audio, and Video
[AUTHORS]
Mengyao Xu, Wenfei Zhou, Yauhen Babakhin, Gabriel Moreira, Ronay Ak, Radek Osmulski, Bo Liu, Even Oldridge, Benedikt Schifferer
[ABSTRACT]
We present Omni-Embed-Nemotron, a unified multimodal retrieval embedding
model developed to handle the increasing complexity of real-world information
needs. While Retrieval-Augmented Generation (RAG) has significantly advanced
language models by incorporating external knowledge, existing text-based
retrievers rely on clean, structured input and struggle with the visually and
semantically rich content found in real-world documents such as PDFs, slides,
or videos. Recent work such as ColPali has shown that preserving document
layout using image-based representations can improve retrieval quality.
Building on this, and inspired by the capabilities of recent multimodal models
such as Qwen2.5-Omni, we extend retrieval beyond text and images to also
support audio and video modalities. Omni-Embed-Nemotron enables both
cross-modal (e.g., text - video) and joint-modal (e.g., text - video+audio)
retrieval using a single model. We describe the architecture, training setup,
and evaluation results of Omni-Embed-Nemotron, and demonstrate its
effectiveness in text, image, and video retrieval.
[LINK]
http://arxiv.org/abs/2510.03458v1
[DATE]
2025-10-04 03:29:50+08:00
[CATEGORIES]
cs.CL
Morpheme Induction for Emergent Language
[AUTHORS]
Brendon Boldt, David Mortensen
[ABSTRACT]
We introduce CSAR, an algorithm for inducing morphemes from emergent language
corpora of parallel utterances and meanings. It is a greedy algorithm that (1)
weights morphemes based on mutual information between forms and meanings, (2)
selects the highest-weighted pair, (3) removes it from the corpus, and (4)
repeats the process to induce further morphemes (i.e., Count, Select, Ablate,
Repeat). The effectiveness of CSAR is first validated on procedurally generated
datasets and compared against baselines for related tasks. Second, we validate
CSAR’s performance on human language data to show that the algorithm makes
reasonable predictions in adjacent domains. Finally, we analyze a handful of
emergent languages, quantifying linguistic characteristics like degree of
synonymy and polysemy.
[COMMENTS]
Accepted for publication at the 2025 Conference on Empirical Methods
in Natural Language Processing; 16 pages, 4 figures
[LINK]
http://arxiv.org/abs/2510.03439v1
[DATE]
2025-10-04 02:59:53+08:00
[CATEGORIES]
cs.CL
HEART: Emotionally-driven test-time scaling of Language Models
[AUTHORS]
Gabriela Pinto, Palash Goyal, Yiwen Song, Souradip Chakraborty, Zifeng Wang, Tomas Pfister, Hamid Palangi
[ABSTRACT]
Test-time scaling has shown considerable success in improving the performance
of language models on complex reasoning tasks without requiring fine-tuning.
However, current strategies such as self-reflection primarily focus on logical
or structural refinement. They do not leverage the guiding potential of
affective feedback. Inspired by psychological research showing that emotions
can modulate cognitive performance, we introduce HEART–a novel framework that
uses emotionally-driven prompts for iterative self-correction. HEART provides
feedback on a model’s incorrect response using a curated set of concise,
emotionally charged phrases based on the six universal emotions categorized by
Dr. Paul Ekman. By systematically varying the emotional tone of the feedback
across iterations, our method guides the model to escape flawed reasoning paths
and explore more promising alternatives. We evaluate our framework on
challenging reasoning benchmarks including OlympiadBench, Humanity’s Last Exam,
and SimpleQA. Our results reveal a significant new phenomenon: when guided by
an oracle verifier, this affective iteration protocol unlocks significantly
deeper reasoning, leading to consistent and substantial increases in accuracy
over state-of-the-art baselines with the same verifier. However, we also
identify a critical bottleneck for practical deployment. In a verifier-free
setting, it struggles to harness these gains consistently, highlighting as a
key challenge for future work. Our findings suggest that the next frontier in
machine reasoning may lie not just in refining logic, but also in understanding
and leveraging the `HEART’ of the models.
[LINK]
http://arxiv.org/abs/2509.22876v2
[DATE]
2025-10-04 02:59:16+08:00
[CATEGORIES]
cs.CL
cs.LG
Consistent Kernel Change-Point Detection under m-Dependence for Text Segmentation
[AUTHORS]
Jairo Diaz-Rodriguez, Mumin Jia
[ABSTRACT]
Kernel change-point detection (KCPD) has become a widely used tool for
identifying structural changes in complex data. While existing theory
establishes consistency under independence assumptions, real-world sequential
data such as text exhibits strong dependencies. We establish new guarantees for
KCPD under $m$-dependent data: specifically, we prove consistency in the number
of detected change points and weak consistency in their locations under mild
additional assumptions. We perform an LLM-based simulation that generates
synthetic $m$-dependent text to validate the asymptotics. To complement these
results, we present the first comprehensive empirical study of KCPD for text
segmentation with modern embeddings. Across diverse text datasets, KCPD with
text embeddings outperforms baselines in standard text segmentation metrics. We
demonstrate through a case study on Taylor Swift’s tweets that KCPD not only
provides strong theoretical and simulated reliability but also practical
effectiveness for text segmentation tasks.
[LINK]
http://arxiv.org/abs/2510.03437v1
[DATE]
2025-10-04 02:57:22+08:00
[CATEGORIES]
cs.LG
cs.CL
MapIQ: Evaluating Multimodal Large Language Models for Map Question Answering
[AUTHORS]
Varun Srivastava, Fan Lei, Srija Mukhopadhyay, Vivek Gupta, Ross Maciejewski
[COMMENTS]
Published as a conference paper at COLM 2025
[LINK]
http://arxiv.org/abs/2507.11625v2
[DATE]
2025-10-04 02:52:18+08:00
[CATEGORIES]
cs.CL
cs.LG
Understanding Retrieval Augmentation for Long-Form Question Answering
[AUTHORS]
Hung-Ting Chen, Fangyuan Xu, Shane Arora, Eunsol Choi
[ABSTRACT]
How retrieved documents are used in language models (LMs) for long-form
generation task is understudied. We present two controlled studies on
retrieval-augmented LM for long-form question answering (LFQA): one fixing the
LM and varying evidence documents and the other fixing evidence documents and
varying the LMs. We study various attributes of generated answers (e.g.,
fluency, length, variance), with an emphasis on the attribution of generated
answers to in-context evidence documents. We collect a dataset (SALAD)
containing human annotations of sentence-level answer attribution in LFQA and
evaluate existing methods for automatically judging attribution. We find that
while LMs can leverage relevant in-context documents, the generated answer is
only partially attributable towards the documents, especially for LMs trained
without retrieval augmentation. Together, our analysis reveals how retrieval
augmentation impacts long knowledge-rich text generation and provide directions
for future work.
[COMMENTS]
COLM 2024 Camera Ready Version
[LINK]
http://arxiv.org/abs/2310.12150v2
[DATE]
2025-10-04 02:29:29+08:00
[CATEGORIES]
cs.CL