Retrieval-Augmented Generation-based Relation Extraction
[AUTHORS]Sefika Efeoglu, Adrian Paschke
[ABSTRACT]Information Extraction (IE) is a transformative process that converts
unstructured text data into a structured format by employing entity and
relation extraction (RE) methodologies. The identification of the relation
between a pair of entities plays a crucial role within this framework. Despite
the existence of various techniques for relation extraction, their efficacy
heavily relies on access to labeled data and substantial computational
resources. In addressing these challenges, Large Language Models (LLMs) emerge
as promising solutions; however, they might return hallucinating responses due
to their own training data. To overcome these limitations, Retrieved-Augmented
Generation-based Relation Extraction (RAG4RE) in this work is proposed,
offering a pathway to enhance the performance of relation extraction tasks.
This work evaluated the effectiveness of our RAG4RE approach utilizing
different LLMs. Through the utilization of established benchmarks, such as
TACRED, TACREV, Re-TACRED, and SemEval RE datasets, our aim is to
comprehensively evaluate the efficacy of our RAG4RE approach. In particularly,
we leverage prominent LLMs including Flan T5, Llama2, and Mistral in our
investigation. The results of our study demonstrate that our RAG4RE approach
surpasses performance of traditional RE approaches based solely on LLMs,
particularly evident in the TACRED dataset and its variations. Furthermore, our
approach exhibits remarkable performance compared to previous RE methodologies
across both TACRED and TACREV datasets, underscoring its efficacy and potential
for advancing RE tasks in natural language processing.
[COMMENTS]published at the Semantic Web journal. The last version is available:
https://doi.org/10.1177/22104968251385519
[LINK]http://arxiv.org/abs/2404.13397v2
[DATE]2025-10-29 01:56:27+08:00
[CATEGORIES]cs.CL
MetricX-25 and GemSpanEval: Google Translate Submissions to the WMT25 Evaluation Shared Task
[AUTHORS]Juraj Juraska, Tobias Domhan, Mara Finkelstein, Tetsuji Nakagawa, Geza Kovacs, Daniel Deutsch, Pidong Wang, Markus Freitag
[ABSTRACT]In this paper, we present our submissions to the unified WMT25 Translation
Evaluation Shared Task. For the Quality Score Prediction subtask, we create a
new generation of MetricX with improvements in the input format and the
training protocol, while for the Error Span Detection subtask we develop a new
model, GemSpanEval, trained to predict error spans along with their severities
and categories. Both systems are based on the state-of-the-art multilingual
open-weights model Gemma 3, fine-tuned on publicly available WMT data. We
demonstrate that MetricX-25, adapting Gemma 3 to an encoder-only architecture
with a regression head on top, can be trained to effectively predict both MQM
and ESA quality scores, and significantly outperforms its predecessor. Our
decoder-only GemSpanEval model, on the other hand, we show to be competitive in
error span detection with xCOMET, a strong encoder-only sequence-tagging
baseline. With error span detection formulated as a generative task, we
instruct the model to also output the context for each predicted error span,
thus ensuring that error spans are identified unambiguously.
[COMMENTS]Accepted to WMT25
[LINK]http://arxiv.org/abs/2510.24707v1
[DATE]2025-10-29 01:56:20+08:00
[CATEGORIES]cs.CL
ComboBench: Can LLMs Manipulate Physical Devices to Play Virtual Reality Games?
[AUTHORS]Shuqing Li, Jiayi Yan, Chenyu Niu, Jen-tse Huang, Yun Peng, Wenxuan Wang, Yepang Liu, Michael R. Lyu
[ABSTRACT]Virtual Reality (VR) games require players to translate high-level semantic
actions into precise device manipulations using controllers and head-mounted
displays (HMDs). While humans intuitively perform this translation based on
common sense and embodied understanding, whether Large Language Models (LLMs)
can effectively replicate this ability remains underexplored. This paper
introduces a benchmark, ComboBench, evaluating LLMs’ capability to translate
semantic actions into VR device manipulation sequences across 262 scenarios
from four popular VR games: Half-Life: Alyx, Into the Radius, Moss: Book II,
and Vivecraft. We evaluate seven LLMs, including GPT-3.5, GPT-4, GPT-4o,
Gemini-1.5-Pro, LLaMA-3-8B, Mixtral-8x7B, and GLM-4-Flash, compared against
annotated ground truth and human performance. Our results reveal that while
top-performing models like Gemini-1.5-Pro demonstrate strong task decomposition
capabilities, they still struggle with procedural reasoning and spatial
understanding compared to humans. Performance varies significantly across
games, suggesting sensitivity to interaction complexity. Few-shot examples
substantially improve performance, indicating potential for targeted
enhancement of LLMs’ VR manipulation capabilities. We release all materials at
https://sites.google.com/view/combobench.
[LINK]http://arxiv.org/abs/2510.24706v1
[DATE]2025-10-29 01:55:42+08:00
[CATEGORIES]cs.CL
Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents
[AUTHORS]Yueqi Song, Ketan Ramaneti, Zaid Sheikh, Ziru Chen, Boyu Gou, Tianbao Xie, Yiheng Xu, Danyang Zhang, Apurva Gandhi, Fan Yang, Joseph Liu, Tianyue Ou, Zhihao Yuan, Frank Xu, Shuyan Zhou, Xingyao Wang, Xiang Yue, Tao Yu, Huan Sun, Yu Su, Graham Neubig
[ABSTRACT]Public research results on large-scale supervised finetuning of AI agents
remain relatively rare, since the collection of agent training data presents
unique challenges. In this work, we argue that the bottleneck is not a lack of
underlying data sources, but that a large variety of data is fragmented across
heterogeneous formats, tools, and interfaces. To this end, we introduce the
agent data protocol (ADP), a light-weight representation language that serves
as an “interlingua” between agent datasets in diverse formats and unified agent
training pipelines downstream. The design of ADP is expressive enough to
capture a large variety of tasks, including API/tool use, browsing, coding,
software engineering, and general agentic workflows, while remaining simple to
parse and train on without engineering at a per-dataset level. In experiments,
we unified a broad collection of 13 existing agent training datasets into ADP
format, and converted the standardized ADP data into training-ready formats for
multiple agent frameworks. We performed SFT on these data, and demonstrated an
average performance gain of ~20% over corresponding base models, and delivers
state-of-the-art or near-SOTA performance on standard coding, browsing, tool
use, and research benchmarks, without domain-specific tuning. All code and data
are released publicly, in the hope that ADP could help lower the barrier to
standardized, scalable, and reproducible agent training.
[LINK]http://arxiv.org/abs/2510.24702v1
[DATE]2025-10-29 01:53:13+08:00
[CATEGORIES]cs.CL
Tongyi DeepResearch Technical Report
[AUTHORS]Tongyi DeepResearch Team, Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, Kuan Li, Liangcai Su, Litu Ou, Liwen Zhang, Pengjun Xie, Rui Ye, Wenbiao Yin, Xinmiao Yu, Xinyu Wang, Xixi Wu, Xuanzhong Chen, Yida Zhao, Zhen Zhang, Zhengwei Tao, Zhongwang Zhang, Zile Qiao, Chenxi Wang, Donglei Yu, Gang Fu, Haiyang Shen, Jiayin Yang, Jun Lin, Junkai Zhang, Kui Zeng, Li Yang, Hailong Yin, Maojia Song, Ming Yan, Peng Xia, Qian Xiao, Rui Min, Ruixue Ding, Runnan Fang, Shaowei Chen, Shen Huang, Shihang Wang, Shihao Cai, Weizhou Shen, Xiaobin Wang, Xin Guan, Xinyu Geng, Yingcheng Shi, Yuning Wu, Zhuo Chen, Zijian Li, Yong Jiang
[COMMENTS]https://tongyi-agent.github.io/blog
[LINK]http://arxiv.org/abs/2510.24701v1
[DATE]2025-10-29 01:53:02+08:00
[CATEGORIES]cs.CL cs.LG
ParallelMuse: Agentic Parallel Thinking for Deep Information Seeking
[AUTHORS]Baixuan Li, Dingchu Zhang, Jialong Wu, Wenbiao Yin, Zhengwei Tao, Yida Zhao, Liwen Zhang, Haiyang Shen, Runnan Fang, Pengjun Xie, Jingren Zhou, Yong Jiang
[ABSTRACT]Parallel thinking expands exploration breadth, complementing the deep
exploration of information-seeking (IS) agents to further enhance
problem-solving capability. However, conventional parallel thinking faces two
key challenges in this setting: inefficiency from repeatedly rolling out from
scratch, and difficulty in integrating long-horizon reasoning trajectories
during answer generation, as limited context capacity prevents full
consideration of the reasoning process. To address these issues, we propose
ParallelMuse, a two-stage paradigm designed for deep IS agents. The first
stage, Functionality-Specified Partial Rollout, partitions generated sequences
into functional regions and performs uncertainty-guided path reuse and
branching to enhance exploration efficiency. The second stage, Compressed
Reasoning Aggregation, exploits reasoning redundancy to losslessly compress
information relevant to answer derivation and synthesize a coherent final
answer. Experiments across multiple open-source agents and benchmarks
demonstrate up to 62% performance improvement with a 10–30% reduction in
exploratory token consumption.
[LINK]http://arxiv.org/abs/2510.24698v1
[DATE]2025-10-29 01:51:50+08:00
[CATEGORIES]cs.CL
AgentFold: Long-Horizon Web Agents with Proactive Context Management
[AUTHORS]Rui Ye, Zhongwang Zhang, Kuan Li, Huifeng Yin, Zhengwei Tao, Yida Zhao, Liangcai Su, Liwen Zhang, Zile Qiao, Xinyu Wang, Pengjun Xie, Fei Huang, Siheng Chen, Jingren Zhou, Yong Jiang
[ABSTRACT]LLM-based web agents show immense promise for information seeking, yet their
effectiveness on long-horizon tasks is hindered by a fundamental trade-off in
context management. Prevailing ReAct-based agents suffer from context
saturation as they accumulate noisy, raw histories, while methods that fixedly
summarize the full history at each step risk the irreversible loss of critical
details. Addressing these, we introduce AgentFold, a novel agent paradigm
centered on proactive context management, inspired by the human cognitive
process of retrospective consolidation. AgentFold treats its context as a
dynamic cognitive workspace to be actively sculpted, rather than a passive log
to be filled. At each step, it learns to execute a `folding’ operation, which
manages its historical trajectory at multiple scales: it can perform granular
condensations to preserve vital, fine-grained details, or deep consolidations
to abstract away entire multi-step sub-tasks. The results on prominent
benchmarks are striking: with simple supervised fine-tuning (without continual
pre-training or RL), our AgentFold-30B-A3B agent achieves 36.2% on BrowseComp
and 47.3% on BrowseComp-ZH. Notably, this performance not only surpasses or
matches open-source models of a dramatically larger scale, such as the
DeepSeek-V3.1-671B-A37B, but also surpasses leading proprietary agents like
OpenAI’s o4-mini.
[COMMENTS]26 pages, 9 figures
[LINK]http://arxiv.org/abs/2510.24699v1
[DATE]2025-10-29 01:51:50+08:00
[CATEGORIES]cs.CL cs.LG
WebLeaper: Empowering Efficiency and Efficacy in WebAgent via Enabling Info-Rich Seeking
[AUTHORS]Zhengwei Tao, Haiyang Shen, Baixuan Li, Wenbiao Yin, Jialong Wu, Kuan Li, Zhongwang Zhang, Huifeng Yin, Rui Ye, Liwen Zhang, Xinyu Wang, Pengjun Xie, Jingren Zhou, Yong Jiang
[ABSTRACT]Large Language Model (LLM)-based agents have emerged as a transformative
approach for open-ended problem solving, with information seeking (IS) being a
core capability that enables autonomous reasoning and decision-making. While
prior research has largely focused on improving retrieval depth, we observe
that current IS agents often suffer from low search efficiency, which in turn
constrains overall performance. A key factor underlying this inefficiency is
the sparsity of target entities in training tasks, which limits opportunities
for agents to learn and generalize efficient search behaviors. To address these
challenges, we propose WebLeaper, a framework for constructing high-coverage IS
tasks and generating efficient solution trajectories. We formulate IS as a
tree-structured reasoning problem, enabling a substantially larger set of
target entities to be embedded within a constrained context. Leveraging curated
Wikipedia tables, we propose three variants for synthesizing IS tasks, Basic,
Union, and Reverse-Union, to systematically increase both IS efficiency and
efficacy. Finally, we curate training trajectories by retaining only those that
are simultaneously accurate and efficient, ensuring that the model is optimized
for both correctness and search performance. Extensive experiments on both
basic and comprehensive settings, conducted on five IS benchmarks, BrowserComp,
GAIA, xbench-DeepSearch, WideSearch, and Seal-0, demonstrate that our method
consistently achieves improvements in both effectiveness and efficiency over
strong baselines.
[LINK]http://arxiv.org/abs/2510.24697v1
[DATE]2025-10-29 01:51:42+08:00
[CATEGORIES]cs.CL
AgentFrontier: Expanding the Capability Frontier of LLM Agents with ZPD-Guided Data Synthesis
[AUTHORS]Xuanzhong Chen, Zile Qiao, Guoxin Chen, Liangcai Su, Zhen Zhang, Xinyu Wang, Pengjun Xie, Fei Huang, Jingren Zhou, Yong Jiang
[ABSTRACT]Training large language model agents on tasks at the frontier of their
capabilities is key to unlocking advanced reasoning. We introduce a data
synthesis approach inspired by the educational theory of the Zone of Proximal
Development (ZPD), which defines this frontier as tasks an LLM cannot solve
alone but can master with guidance. To operationalize this, we present the
AgentFrontier Engine, an automated pipeline that synthesizes high-quality,
multidisciplinary data situated precisely within the LLM’s ZPD. This engine
supports both continued pre-training with knowledge-intensive data and targeted
post-training on complex reasoning tasks. From the same framework, we derive
the ZPD Exam, a dynamic and automated benchmark designed to evaluate agent
capabilities on these frontier tasks. We train AgentFrontier-30B-A3B model on
our synthesized data, which achieves state-of-the-art results on demanding
benchmarks like Humanity’s Last Exam, even surpassing some leading proprietary
agents. Our work demonstrates that a ZPD-guided approach to data synthesis
offers a scalable and effective path toward building more capable LLM agents.
[COMMENTS]https://tongyi-agent.github.io/blog/introducing-tongyi-deep-research/
[LINK]http://arxiv.org/abs/2510.24695v1
[DATE]2025-10-29 01:50:47+08:00
[CATEGORIES]cs.CL
Repurposing Synthetic Data for Fine-grained Search Agent Supervision
[AUTHORS]Yida Zhao, Kuan Li, Xixi Wu, Liwen Zhang, Dingchu Zhang, Baixuan Li, Maojia Song, Zhuo Chen, Chenxi Wang, Xinyu Wang, Kewei Tu, Pengjun Xie, Jingren Zhou, Yong Jiang
[ABSTRACT]LLM-based search agents are increasingly trained on entity-centric synthetic
data to solve complex, knowledge-intensive tasks. However, prevailing training
methods like Group Relative Policy Optimization (GRPO) discard this rich entity
information, relying instead on sparse, outcome-based rewards. This critical
limitation renders them unable to distinguish informative “near-miss”
samples-those with substantially correct reasoning but a flawed final
answer-from complete failures, thus discarding valuable learning signals. We
address this by leveraging the very entities discarded during training. Our
empirical analysis reveals a strong positive correlation between the number of
ground-truth entities identified during an agent’s reasoning process and final
answer accuracy. Building on this insight, we introduce Entity-aware Group
Relative Policy Optimization (E-GRPO), a novel framework that formulates a
dense entity-aware reward function. E-GRPO assigns partial rewards to incorrect
samples proportional to their entity match rate, enabling the model to
effectively learn from these “near-misses”. Experiments on diverse
question-answering (QA) and deep research benchmarks show that E-GRPO
consistently and significantly outperforms the GRPO baseline. Furthermore, our
analysis reveals that E-GRPO not only achieves superior accuracy but also
induces more efficient reasoning policies that require fewer tool calls,
demonstrating a more effective and sample-efficient approach to aligning search
agents.
[LINK]http://arxiv.org/abs/2510.24694v1
[DATE]2025-10-29 01:50:40+08:00
[CATEGORIES]cs.CL
STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence
[AUTHORS]Zihan Liu, Zhikang Niu, Qiuyang Xiao, Zhisheng Zheng, Ruoqi Yuan, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Jianze Liang, Xie Chen, Leilei Sun, Dahua Lin, Jiaqi Wang
[ABSTRACT]Despite rapid progress in Multi-modal Large Language Models and Large
Audio-Language Models, existing audio benchmarks largely test semantics that
can be recovered from text captions, masking deficits in fine-grained
perceptual reasoning. We formalize audio 4D intelligence that is defined as
reasoning over sound dynamics in time and 3D space, and introduce STAR-Bench to
measure it. STAR-Bench combines a Foundational Acoustic Perception setting (six
attributes under absolute and relative regimes) with a Holistic Spatio-Temporal
Reasoning setting that includes segment reordering for continuous and discrete
processes and spatial tasks spanning static localization, multi-source
relations, and dynamic trajectories. Our data curation pipeline uses two
methods to ensure high-quality samples. For foundational tasks, we use
procedurally synthesized and physics-simulated audio. For holistic data, we
follow a four-stage process that includes human annotation and final selection
based on human performance. Unlike prior benchmarks where caption-only
answering reduces accuracy slightly, STAR-Bench induces far larger drops
(-31.5\% temporal, -35.2\% spatial), evidencing its focus on linguistically
hard-to-describe cues. Evaluating 19 models reveals substantial gaps compared
with humans and a capability hierarchy: closed-source models are bottlenecked
by fine-grained perception, while open-source models lag across perception,
knowledge, and reasoning. Our STAR-Bench provides critical insights and a clear
path forward for developing future models with a more robust understanding of
the physical world.
[COMMENTS]Homepage: https://internlm.github.io/StarBench/
[LINK]http://arxiv.org/abs/2510.24693v1
[DATE]2025-10-29 01:50:34+08:00
[CATEGORIES]cs.CL
SPICE: Self-Play In Corpus Environments Improves Reasoning
[AUTHORS]Bo Liu, Chuanyang Jin, Seungone Kim, Weizhe Yuan, Wenting Zhao, Ilia Kulikov, Xian Li, Sainbayar Sukhbaatar, Jack Lanchantin, Jason Weston
[ABSTRACT]Self-improving systems require environmental interaction for continuous
adaptation. We introduce SPICE (Self-Play In Corpus Environments), a
reinforcement learning framework where a single model acts in two roles: a
Challenger that mines documents from a large corpus to generate diverse
reasoning tasks, and a Reasoner that solves them. Through adversarial dynamics,
the Challenger creates an automatic curriculum at the frontier of the
Reasoner’s capability, while corpus grounding provides the rich,
near-inexhaustible external signal necessary for sustained improvement. Unlike
existing ungrounded self-play methods that offer more limited benefits, SPICE
achieves consistent gains across mathematical (+8.9%) and general reasoning
(+9.8%) benchmarks on multiple model families. Our analysis reveals how
document grounding is a key ingredient in SPICE to continuously generate its
own increasingly challenging goals and achieve them, enabling sustained
self-improvement.
[LINK]http://arxiv.org/abs/2510.24684v1
[DATE]2025-10-29 01:46:16+08:00
[CATEGORIES]cs.CL
Dissecting Role Cognition in Medical LLMs via Neuronal Ablation
[AUTHORS]Xun Liang, Huayi Lai, Hanyu Wang, Wentao Zhang, Linfeng Zhang, Yanfang Chen, Feiyu Xiong, Zhiyu Li
[ABSTRACT]Large language models (LLMs) have gained significant traction in medical
decision support systems, particularly in the
context of medical question answering and role-playing simulations. A common
practice, Prompt-Based Role Playing (PBRP),
instructs models to adopt different clinical roles (e.g., medical students,
residents, attending physicians) to simulate varied
professional behaviors. However, the impact of such role prompts on model
reasoning capabilities remains unclear. This
study introduces the RP-Neuron-Activated Evaluation Framework(RPNA) to
evaluate whether role prompts induce distinct,
role-specific cognitive processes in LLMs or merely modify linguistic style.
We test this framework on three medical QA
datasets, employing neuron ablation and representation analysis techniques to
assess changes in reasoning pathways. Our
results demonstrate that role prompts do not significantly enhance the
medical reasoning abilities of LLMs. Instead, they
primarily affect surface-level linguistic features, with no evidence of
distinct reasoning pathways or cognitive differentiation
across clinical roles. Despite superficial stylistic changes, the core
decision-making mechanisms of LLMs remain uniform
across roles, indicating that current PBRP methods fail to replicate the
cognitive complexity found in real-world medical
practice. This highlights the limitations of role-playing in medical AI and
emphasizes the need for models that simulate genuine
cognitive processes rather than linguistic imitation.We have released the
related code in the following repository:https:
//github.com/IAAR-Shanghai/RolePlay_LLMDoctor
[COMMENTS]15 pages, 9 figures
[LINK]http://arxiv.org/abs/2510.24677v1
[DATE]2025-10-29 01:40:53+08:00
[CATEGORIES]cs.CL
InteractComp: Evaluating Search Agents With Ambiguous Queries
[AUTHORS]Mingyi Deng, Lijun Huang, Yani Fan, Jiayi Zhang, Fashen Ren, Jinyi Bai, Fuzhen Yang, Dayi Miao, Zhaoyang Yu, Yifan Wu, Yanfei Zhang, Fengwei Teng, Yingjia Wan, Song Hu, Yude Li, Xin Jin, Conghao Hu, Haoyu Li, Qirui Fu, Tai Zhong, Xinyu Wang, Xiangru Tang, Nan Tang, Chenglin Wu, Yuyu Luo
[ABSTRACT]Language agents have demonstrated remarkable potential in web search and
information retrieval. However, these search agents assume user queries are
complete and unambiguous, an assumption that diverges from reality where users
begin with incomplete queries requiring clarification through interaction. Yet
most agents lack interactive mechanisms during the search process, and existing
benchmarks cannot assess this capability. To address this gap, we introduce
InteractComp, a benchmark designed to evaluate whether search agents can
recognize query ambiguity and actively interact to resolve it during search.
Following the principle of easy to verify, interact to disambiguate, we
construct 210 expert-curated questions across 9 domains through a
target-distractor methodology that creates genuine ambiguity resolvable only
through interaction. Evaluation of 17 models reveals striking failure: the best
model achieves only 13.73% accuracy despite 71.50% with complete context,
exposing systematic overconfidence rather than reasoning deficits. Forced
interaction produces dramatic gains, demonstrating latent capability current
strategies fail to engage. Longitudinal analysis shows interaction capabilities
stagnated over 15 months while search performance improved seven-fold,
revealing a critical blind spot. This stagnation, coupled with the immediate
feedback inherent to search tasks, makes InteractComp a valuable resource for
both evaluating and training interaction capabilities in search agents. The
code is available at https://github.com/FoundationAgents/InteractComp.
[LINK]http://arxiv.org/abs/2510.24668v1
[DATE]2025-10-29 01:35:54+08:00
[CATEGORIES]cs.CL
MQM Re-Annotation: A Technique for Collaborative Evaluation of Machine Translation
[AUTHORS]Parker Riley, Daniel Deutsch, Mara Finkelstein, Colten DiIanni, Juraj Juraska, Markus Freitag
[ABSTRACT]Human evaluation of machine translation is in an arms race with translation
model quality: as our models get better, our evaluation methods need to be
improved to ensure that quality gains are not lost in evaluation noise. To this
end, we experiment with a two-stage version of the current state-of-the-art
translation evaluation paradigm (MQM), which we call MQM re-annotation. In this
setup, an MQM annotator reviews and edits a set of pre-existing MQM
annotations, that may have come from themselves, another human annotator, or an
automatic MQM annotation system. We demonstrate that rater behavior in
re-annotation aligns with our goals, and that re-annotation results in
higher-quality annotations, mostly due to finding errors that were missed
during the first pass.
[LINK]http://arxiv.org/abs/2510.24664v1
[DATE]2025-10-29 01:29:59+08:00
[CATEGORIES]cs.CL
Arena-Lite: Efficient and Reliable Large Language Model Evaluation via Tournament-Based Direct Comparisons
[AUTHORS]Seonil Son, Ju-Min Oh, Heegon Jin, Cheolhun Jang, Jeongbeom Jeong, Kuntae Kim
[ABSTRACT]As Large Language Models (LLMs) expand across domains, LLM judges have become
essential for systems evaluation. Current benchmarks typically compare system
outputs against baselines. This baseline-mediated approach, though convenient,
yields lower reliability than direct comparison between systems. We propose
Arena-Lite which integrates tournament structure on top of head-to-head
comparison. The application of a tournament structure and direct comparison
eliminates the need for baseline outputs, reduces the number of required
comparisons, and allows higher reliability in system rankings. We conducted two
experiments: (1) controlled stochastic modeling and (2) empirical validation
with a real LLM judge. Those experiments collectively demonstrate that
Arena-Lite consistently achieves higher reliability with fewer comparisons,
even with smaller datasets or weaker judges. We release an easy-to-use web
demonstration and code to foster adoption of Arena-Lite, streamlining model
selection across research and industry communities. Arena-Lite demo and code
are available on
\href{https://huggingface.co/spaces/NCSOFT/ArenaLite}{https://huggingface.co/spaces/NCSOFT/ArenaLite}
[COMMENTS]8 pages for main body, 19 pages in total
[LINK]http://arxiv.org/abs/2411.01281v6
[DATE]2025-10-29 01:26:20+08:00
[CATEGORIES]cs.CL
Evolving Diagnostic Agents in a Virtual Clinical Environment
[AUTHORS]Pengcheng Qiu, Chaoyi Wu, Junwei Liu, Qiaoyu Zheng, Yusheng Liao, Haowen Wang, Yun Yue, Qianrui Fan, Shuai Zhen, Jian Wang, Jinjie Gu, Yanfeng Wang, Ya Zhang, Weidi Xie
[ABSTRACT]In this paper, we present a framework for training large language models
(LLMs) as diagnostic agents with reinforcement learning, enabling them to
manage multi-turn diagnostic processes, adaptively select examinations, and
commit to final diagnoses. Unlike instruction-tuned models trained on static
case summaries, our method acquires diagnostic strategies through interactive
exploration and outcome-based feedback. Our contributions are fourfold: (i) We
present DiagGym, a diagnostics world model trained with electronic health
records that emits examination outcomes conditioned on patient history and
recommended examination, serving as a virtual clinical environment for
realistic diagnosis training and evaluation; (ii) We train DiagAgent via
end-to-end, multi-turn reinforcement learning to learn diagnostic policies that
optimize both information yield and diagnostic accuracy; (iii) We introduce
DiagBench, a diagnostic benchmark comprising 750 cases with physician-validated
examination recommendations and 99 cases annotated with 973 physician-written
rubrics on diagnosis process; (iv) we demonstrate superior performance across
diverse diagnostic settings. DiagAgent significantly outperforms 10
state-of-the-art LLMs, including DeepSeek-v3 and GPT-4o, as well as two
prompt-engineered agents. In single-turn settings, DiagAgent achieves 9.34%
higher diagnostic accuracy and 44.03% improvement in examination recommendation
hit ratio. In end-to-end settings, it delivers 15.12% increase in diagnostic
accuracy and 23.09% boost in examination recommendation F1 score. In
rubric-based evaluation, it surpasses the next-best model, Claude-sonnet-4, by
7.1% in weighted rubric score. These findings indicate that learning policies
in interactive clinical environments confers dynamic and clinically meaningful
diagnostic management abilities unattainable through passive training alone.
[LINK]http://arxiv.org/abs/2510.24654v1
[DATE]2025-10-29 01:19:47+08:00
[CATEGORIES]cs.CL
Optimizing Retrieval for RAG via Reinforced Contrastive Learning
[AUTHORS]Jiawei Zhou, Lei Chen
[ABSTRACT]As retrieval-augmented generation (RAG) becomes increasingly widespread, the
role of information retrieval (IR) is shifting from retrieving information for
human users to retrieving contextual knowledge for artificial intelligence (AI)
systems, where relevance becomes difficult to define or annotate beforehand. To
address this challenge, we propose R3, a Retrieval framework optimized for RAG
through trialand-feedback Reinforced contrastive learning. Unlike prior
approaches that rely on annotated or synthetic data for supervised fine-tuning,
R3 enables the retriever to dynamically explore and optimize relevance within
the RAG environment. During training, the retrieved results interact with the
environment to produce contrastive signals that automatically guide the
retriever’s self-improvement. Extensive experiments across diverse tasks
demonstrate that R3 improves RAG performance by 5.2% over the original
retriever and surpasses state-of-the-art retrievers by 4.9%, while achieving
comparable results to LLM-augmented retrieval and RAG systems built on
post-trained or instruction-tuned LLMs. It is both efficient and practical,
requiring only 4 GPUs and completing training within a single day.
[LINK]http://arxiv.org/abs/2510.24652v1
[DATE]2025-10-29 01:18:30+08:00
[CATEGORIES]cs.CL
OpenReward: Learning to Reward Long-form Agentic Tasks via Reinforcement Learning
[AUTHORS]Ziyou Hu, Zhengliang Shi, Minghang Zhu, Haitao Li, Teng Sun, Pengjie Ren, Suzan Verberne, Zhaochun Ren
[ABSTRACT]Reward models (RMs) have become essential for aligning large language models
(LLMs), serving as scalable proxies for human evaluation in both training and
inference. However, existing RMs struggle on knowledge-intensive and long-form
tasks, where evaluating correctness requires grounding beyond the model’s
internal knowledge. This limitation hinders them from reliably discriminating
subtle quality differences, especially when external evidence is necessary. To
address this, we introduce OpenRM, a tool-augmented long-form reward model that
systematically judges open-ended responses by invoking external tools to gather
relevant evidence. We train OpenRM with Group Relative Policy Optimization
(GRPO) on over 27K synthesized pairwise examples generated through a
controllable data synthesis framework. The training objective jointly
supervises intermediate tool usage and final outcome accuracy, incentivizing
our reward model to learn effective evidence-based judgment strategies.
Extensive experiments on three newly-collected datasets and two widely-used
benchmarks demonstrate that OpenRM substantially outperforms existing reward
modeling approaches. As a further step, we integrate OpenRM into both
inference-time response selection and training-time data selection. This yields
consistent gains in downstream LLM alignment tasks, highlighting the potential
of tool-augmented reward models for scaling reliable long-form evaluation.
[LINK]http://arxiv.org/abs/2510.24636v1
[DATE]2025-10-29 01:02:46+08:00
[CATEGORIES]cs.CL
“Mm, Wat?” Detecting Other-initiated Repair Requests in Dialogue
[AUTHORS]Anh Ngo, Nicolas Rollet, Catherine Pelachaud, Chloe Clavel
[ABSTRACT]Maintaining mutual understanding is a key component in human-human
conversation to avoid conversation breakdowns, in which repair, particularly
Other-Initiated Repair (OIR, when one speaker signals trouble and prompts the
other to resolve), plays a vital role. However, Conversational Agents (CAs)
still fail to recognize user repair initiation, leading to breakdowns or
disengagement. This work proposes a multimodal model to automatically detect
repair initiation in Dutch dialogues by integrating linguistic and prosodic
features grounded in Conversation Analysis. The results show that prosodic cues
complement linguistic features and significantly improve the results of
pretrained text and audio embeddings, offering insights into how different
features interact. Future directions include incorporating visual cues,
exploring multilingual and cross-context corpora to assess the robustness and
generalizability.
[COMMENTS]9 pages
[LINK]http://arxiv.org/abs/2510.24628v1
[DATE]2025-10-29 00:58:26+08:00
[CATEGORIES]cs.CL
Relative Scaling Laws for LLMs
[AUTHORS]William Held, David Hall, Percy Liang, Diyi Yang
[ABSTRACT]Scaling laws describe how language models improve with additional data,
parameters, and compute. While widely used, they are typically measured on
aggregate test sets. Aggregate evaluations yield clean trends but average over
heterogeneous subpopulations, obscuring performance disparities. We introduce
relative scaling laws, which track how performance gaps between test
distributions evolve with scale rather than focusing solely on absolute error.
Using 255 decoder-only Transformers trained under matched-compute (IsoFLOP)
budgets from $10^{18}$–$10^{20}$ FLOPs on standard pretraining datasets, we
find diverse trajectories: academic domains on MMLU converge toward parity;
regional English dialects shift depending on population size; and clusters of
AI risk behaviours split, with capability- and influence-related risks
increasing during pretraining while adversarial risks do not. These results
show that although scaling improves overall performance, it is not a universal
equalizer. To support further study, we release all model checkpoints from this
work to enable practitioners to measure relative alongside traditional scaling
laws, in order to better prioritize robustness challenges in light of the
bitter lesson.
[LINK]http://arxiv.org/abs/2510.24626v1
[DATE]2025-10-29 00:55:22+08:00
[CATEGORIES]cs.CL
Zero-Shot Cross-Lingual Transfer using Prefix-Based Adaptation
[AUTHORS]Snegha A, Sayambhu Sen, Piyush Singh Pasi, Abhishek Singhania, Preethi Jyothi
[ABSTRACT]With the release of new large language models (LLMs) like Llama and Mistral,
zero-shot cross-lingual transfer has become increasingly feasible due to their
multilingual pretraining and strong generalization capabilities. However,
adapting these decoder-only LLMs to new tasks across languages remains
challenging. While parameter-efficient fine-tuning (PeFT) techniques like
Low-Rank Adaptation (LoRA) are widely used, prefix-based techniques such as
soft prompt tuning, prefix tuning, and Llama Adapter are less explored,
especially for zero-shot transfer in decoder-only models. We present a
comprehensive study of three prefix-based methods for zero-shot cross-lingual
transfer from English to 35+ high- and low-resource languages. Our analysis
further explores transfer across linguistic families and scripts, as well as
the impact of scaling model sizes from 1B to 24B. With Llama 3.1 8B, prefix
methods outperform LoRA-baselines by up to 6% on the Belebele benchmark.
Similar improvements were observed with Mistral v0.3 7B as well. Despite using
only 1.23M learning parameters with prefix tuning, we achieve consistent
improvements across diverse benchmarks. These findings highlight the potential
of prefix-based techniques as an effective and scalable alternative to LoRA,
particularly in low-resource multilingual settings.
[COMMENTS]12 Pages
[LINK]http://arxiv.org/abs/2510.24619v1
[DATE]2025-10-29 00:48:03+08:00
[CATEGORIES]cs.CL cs.LG
Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs
[AUTHORS]Siheng Xiong, Joe Zou, Faramarz Fekri, Yae Jee Cho
[ABSTRACT]The quadratic cost of attention hinders the scalability of long-context LLMs,
especially in resource-constrained settings. Existing static sparse methods
such as sliding windows or global tokens utilizes the sparsity of attention to
reduce the cost of attention, but poorly adapts to the content-dependent
variations in attention due to their staticity. While previous work has
proposed several dynamic approaches to improve flexibility, they still depend
on predefined templates or heuristic mechanisms. Such strategies reduce
generality and prune tokens that remain contextually important, limiting their
accuracy across diverse tasks. To tackle these bottlenecks of existing methods
for long-context modeling, we introduce Dynamic Hierarchical Sparse Attention
(DHSA), a data-driven framework that dynamically predicts attention sparsity
online without retraining. Our proposed DHSA adaptively segments sequences into
variable-length chunks, then computes chunk representations by aggregating the
token embeddings within each chunk. To avoid the bias introduced by varying
chunk lengths, we apply length-normalized aggregation that scales the averaged
embeddings by the square root of the chunk size. Finally, DHSA upsamples the
chunk-level similarity scores to token level similarities to calculate
importance scores that determine which token-level interactions should be
preserved. Our experiments on Gemma2 with Needle-in-a-Haystack Test and
LongBench show that DHSA matches dense attention in accuracy, while reducing
prefill latency by 20-60% and peak memory usage by 35%. Compared to other
representative baselines such as block sparse attention, DHSA achieves
consistently higher accuracy (6-18% relative gains) with comparable or lower
cost, offering an efficient and adaptable solution for long-context on-device
LLMs.
[COMMENTS]Accepted to NeurIPS 2025 Workshop on Efficient Reasoning
[LINK]http://arxiv.org/abs/2510.24606v1
[DATE]2025-10-29 00:34:18+08:00
[CATEGORIES]cs.CL
Diffusion LLM with Native Variable Generation Lengths: Let [EOS] Lead the Way
[AUTHORS]Yicun Yang, Cong Wang, Shaobo Wang, Zichen Wen, Biqing Qi, Hanlin Xu, Linfeng Zhang
[ABSTRACT]Diffusion-based large language models (dLLMs) have exhibited substantial
potential for parallel text generation, which may enable more efficient
generation compared to autoregressive models. However, current dLLMs suffer
from fixed generation lengths, which indicates the generation lengths of dLLMs
have to be determined before decoding as a hyper-parameter, leading to issues
in efficiency and flexibility. To solve these problems, in this work, we
propose to train a diffusion LLM with native variable generation lengths,
abbreviated as dLLM-Var. Concretely, we aim to train a model to accurately
predict the [EOS] token in the generated text, which makes a dLLM be able to
natively infer in a block diffusion manner, while still maintaining the ability
of global bi-directional (full) attention and high parallelism. Experiments on
standard benchmarks demonstrate that our method achieves a 30.1x speedup over
traditional dLLM inference paradigms and a 2.4x speedup relative to
autoregressive models such as Qwen and Llama. Our method achieves higher
accuracy and faster inference, elevating dLLMs beyond mere academic novelty and
supporting their practical use in real-world applications. Codes and models
have been released.
[LINK]http://arxiv.org/abs/2510.24605v1
[DATE]2025-10-29 00:32:43+08:00
[CATEGORIES]cs.CL
TableTime: Reformulating Time Series Classification as Training-Free Table Understanding with Large Language Models
[AUTHORS]Jiahao Wang, Mingyue Cheng, Qingyang Mao, Yitong Zhou, Daoyu Wang, Qi Liu, Feiyang Xu, Xin Li
[ABSTRACT]Large language models (LLMs) have demonstrated their effectiveness in
multivariate time series classification (MTSC). Effective adaptation of LLMs
for MTSC necessitates informative data representations. Existing LLM-based
methods directly encode embeddings for time series within the latent space of
LLMs from scratch to align with semantic space of LLMs. Despite their
effectiveness, we reveal that these methods conceal three inherent bottlenecks:
(1) they struggle to encode temporal and channel-specific information in a
lossless manner, both of which are critical components of multivariate time
series; (2) it is much difficult to align the learned representation space with
the semantic space of the LLMs; (3) they require task-specific retraining,
which is both computationally expensive and labor-intensive. To bridge these
gaps, we propose TableTime, which reformulates MTSC as a table understanding
task. Specifically, TableTime introduces the following strategies: (1) convert
multivariate time series into a tabular form, thus minimizing information loss
to the greatest extent; (2) represent tabular time series in text format to
achieve natural alignment with the semantic space of LLMs; (3) design a
reasoning framework that integrates contextual text information, neighborhood
assistance, multi-path inference and problem decomposition to enhance the
reasoning ability of LLMs and realize zero-shot classification. Extensive
experiments performed on 10 publicly representative datasets from UEA archive
verify the superiorities of the TableTime.
[LINK]http://arxiv.org/abs/2411.15737v4
[DATE]2025-10-29 00:23:53+08:00
[CATEGORIES]cs.CL cs.LG
BrowseConf: Confidence-Guided Test-Time Scaling for Web Agents
[AUTHORS]Litu Ou, Kuan Li, Huifeng Yin, Liwen Zhang, Zhongwang Zhang, Xixi Wu, Rui Ye, Zile Qiao, Pengjun Xie, Jingren Zhou, Yong Jiang
[ABSTRACT]Confidence in LLMs is a useful indicator of model uncertainty and answer
reliability. Existing work mainly focused on single-turn scenarios, while
research on confidence in complex multi-turn interactions is limited. In this
paper, we investigate whether LLM-based search agents have the ability to
communicate their own confidence through verbalized confidence scores after
long sequences of actions, a significantly more challenging task compared to
outputting confidence in a single interaction. Experimenting on open-source
agentic models, we first find that models exhibit much higher task accuracy at
high confidence while having near-zero accuracy when confidence is low. Based
on this observation, we propose Test-Time Scaling (TTS) methods that use
confidence scores to determine answer quality, encourage the model to try again
until reaching a satisfactory confidence level. Results show that our proposed
methods significantly reduce token consumption while demonstrating competitive
performance compared to baseline fixed budget TTS methods.
[COMMENTS]25 pages
[LINK]http://arxiv.org/abs/2510.23458v2
[DATE]2025-10-29 00:23:04+08:00
[CATEGORIES]cs.CL
ReForm: Reflective Autoformalization with Prospective Bounded Sequence Optimization
[AUTHORS]Guoxin Chen, Jing Wu, Xinjie Chen, Wayne Xin Zhao, Ruihua Song, Chengxi Li, Kai Fan, Dayiheng Liu, Minpeng Liao
[ABSTRACT]Autoformalization, which translates natural language mathematics into
machine-verifiable formal statements, is critical for using formal mathematical
reasoning to solve math problems stated in natural language. While Large
Language Models can generate syntactically correct formal statements, they
often fail to preserve the original problem’s semantic intent. This limitation
arises from the LLM approaches’ treating autoformalization as a simplistic
translation task which lacks mechanisms for self-reflection and iterative
refinement that human experts naturally employ. To address these issues, we
propose ReForm, a Reflective Autoformalization method that tightly integrates
semantic consistency evaluation into the autoformalization process. This
enables the model to iteratively generate formal statements, assess its
semantic fidelity, and self-correct identified errors through progressive
refinement. To effectively train this reflective model, we introduce
Prospective Bounded Sequence Optimization (PBSO), which employs different
rewards at different sequence positions to ensure that the model develops both
accurate autoformalization and correct semantic validations, preventing
superficial critiques that would undermine the purpose of reflection. Extensive
experiments across four autoformalization benchmarks demonstrate that ReForm
achieves an average improvement of 17.2 percentage points over the strongest
baselines. To further ensure evaluation reliability, we introduce
ConsistencyCheck, a benchmark of 859 expert-annotated items that not only
validates LLMs as judges but also reveals that autoformalization is inherently
difficult: even human experts produce semantic errors in up to 38.5% of cases.
[COMMENTS]Ongoing Work
[LINK]http://arxiv.org/abs/2510.24592v1
[DATE]2025-10-29 00:22:54+08:00
[CATEGORIES]cs.CL
The Hawthorne Effect in Reasoning Models: Evaluating and Steering Test Awareness
[AUTHORS]Sahar Abdelnabi, Ahmed Salem
[ABSTRACT]Reasoning-focused LLMs sometimes alter their behavior when they detect that
they are being evaluated, which can lead them to optimize for test-passing
performance or to comply more readily with harmful prompts if real-world
consequences appear absent. We present the first quantitative study of how such
“test awareness” impacts model behavior, particularly its performance on
safety-related tasks. We introduce a white-box probing framework that (i)
linearly identifies awareness-related activations and (ii) steers models toward
or away from test awareness while monitoring downstream performance. We apply
our method to different state-of-the-art open-weight reasoning LLMs across both
realistic and hypothetical tasks (denoting tests or simulations). Our results
demonstrate that test awareness significantly impacts safety alignment (such as
compliance with harmful requests and conforming to stereotypes) with effects
varying in both magnitude and direction across models. By providing control
over this latent effect, our work aims to provide a stress-test mechanism and
increase trust in how we perform safety evaluations.
[COMMENTS]NeurIPS 2025 (Spotlight). Code is available at:
https://github.com/microsoft/Test_Awareness_Steering
[LINK]http://arxiv.org/abs/2505.14617v3
[DATE]2025-10-29 00:02:10+08:00
[CATEGORIES]cs.CL
Generative View Stitching
[AUTHORS]Chonghyuk Song, Michal Stary, Boyuan Chen, George Kopanas, Vincent Sitzmann
[ABSTRACT]Autoregressive video diffusion models are capable of long rollouts that are
stable and consistent with history, but they are unable to guide the current
generation with conditioning from the future. In camera-guided video generation
with a predefined camera trajectory, this limitation leads to collisions with
the generated scene, after which autoregression quickly collapses. To address
this, we propose Generative View Stitching (GVS), which samples the entire
sequence in parallel such that the generated scene is faithful to every part of
the predefined camera trajectory. Our main contribution is a sampling algorithm
that extends prior work on diffusion stitching for robot planning to video
generation. While such stitching methods usually require a specially trained
model, GVS is compatible with any off-the-shelf video model trained with
Diffusion Forcing, a prevalent sequence diffusion framework that we show
already provides the affordances necessary for stitching. We then introduce
Omni Guidance, a technique that enhances the temporal consistency in stitching
by conditioning on both the past and future, and that enables our proposed
loop-closing mechanism for delivering long-range coherence. Overall, GVS
achieves camera-guided video generation that is stable, collision-free,
frame-to-frame consistent, and closes loops for a variety of predefined camera
paths, including Oscar Reutersv"ard’s Impossible Staircase. Results are best
viewed as videos at https://andrewsonga.github.io/gvs.
[COMMENTS]Project website: https://andrewsonga.github.io/gvs
[LINK]http://arxiv.org/abs/2510.24718v1
[DATE]2025-10-29 01:59:58+08:00
[CATEGORIES]cs.LG
Physics-Informed Latent Neural Operator for Real-time Predictions of time-dependent parametric PDEs
[AUTHORS]Sharmila Karumuri, Lori Graham-Brady, Somdatta Goswami
[ABSTRACT]Deep operator network (DeepONet) has shown significant promise as surrogate
models for systems governed by partial differential equations (PDEs), enabling
accurate mappings between infinite-dimensional function spaces. However, when
applied to systems with high-dimensional input-output mappings arising from
large numbers of spatial and temporal collocation points, these models often
require heavily overparameterized networks, leading to long training times.
Latent DeepONet addresses some of these challenges by introducing a two-step
approach: first learning a reduced latent space using a separate model,
followed by operator learning within this latent space. While efficient, this
method is inherently data-driven and lacks mechanisms for incorporating
physical laws, limiting its robustness and generalizability in data-scarce
settings. In this work, we propose PI-Latent-NO, a physics-informed latent
neural operator framework that integrates governing physics directly into the
learning process. Our architecture features two coupled DeepONets trained
end-to-end: a Latent-DeepONet that learns a low-dimensional representation of
the solution, and a Reconstruction-DeepONet that maps this latent
representation back to the physical space. By embedding PDE constraints into
the training via automatic differentiation, our method eliminates the need for
labeled training data and ensures physics-consistent predictions. The proposed
framework is both memory and compute-efficient, exhibiting near-constant
scaling with problem size and demonstrating significant speedups over
traditional physics-informed operator models. We validate our approach on a
range of parametric PDEs, showcasing its accuracy, scalability, and suitability
for real-time prediction in complex physical systems.
[LINK]http://arxiv.org/abs/2501.08428v3
[DATE]2025-10-29 01:58:31+08:00
[CATEGORIES]cs.LG
A Single-Loop First-Order Algorithm for Linearly Constrained Bilevel Optimization
[AUTHORS]Wei Shen, Jiawei Zhang, Minhui Huang, Cong Shen
[ABSTRACT]We study bilevel optimization problems where the lower-level problems are
strongly convex and have coupled linear constraints. To overcome the potential
non-smoothness of the hyper-objective and the computational challenges
associated with the Hessian matrix, we utilize penalty and augmented Lagrangian
methods to reformulate the original problem as a single-level one. Especially,
we establish a strong theoretical connection between the reformulated function
and the original hyper-objective by characterizing the closeness of their
values and derivatives. Based on this reformulation, we propose a single-loop,
first-order algorithm for linearly constrained bilevel optimization (SFLCB). We
provide rigorous analyses of its non-asymptotic convergence rates, showing an
improvement over prior double-loop algorithms – form
$O(\epsilon^{-3}\log(\epsilon^{-1}))$ to $O(\epsilon^{-3})$. The experiments
corroborate our theoretical findings and demonstrate the practical efficiency
of the proposed SFLCB algorithm. Simulation code is provided at
https://github.com/ShenGroup/SFLCB.
[COMMENTS]NeurIPS 2025
[LINK]http://arxiv.org/abs/2510.24710v1
[DATE]2025-10-29 01:58:17+08:00
[CATEGORIES]cs.LG
Does Object Binding Naturally Emerge in Large Pretrained Vision Transformers?
[AUTHORS]Yihao Li, Saeed Salehi, Lyle Ungar, Konrad P. Kording
[COMMENTS]Accepted as a Spotlight at NeurIPS 2025
[LINK]http://arxiv.org/abs/2510.24709v1
[DATE]2025-10-29 01:57:05+08:00
[CATEGORIES]cs.LG
DeltaPhi: Physical States Residual Learning for Neural Operators in Data-Limited PDE Solving
[AUTHORS]Xihang Yue, Yi Yang, Linchao Zhu
[COMMENTS]Neurips 2025
[LINK]http://arxiv.org/abs/2406.09795v2
[DATE]2025-10-29 01:56:59+08:00
[CATEGORIES]cs.LG
Greedy Sampling Is Provably Efficient for RLHF
[AUTHORS]Di Wu, Chengshuai Shi, Jing Yang, Cong Shen
[ABSTRACT]Reinforcement Learning from Human Feedback (RLHF) has emerged as a key
technique for post-training large language models. Despite its empirical
success, the theoretical understanding of RLHF is still limited, as learning
the KL-regularized target with only preference feedback poses additional
challenges compared with canonical RL. Existing works mostly study the
reward-based Bradley-Terry (BT) preference model, and extend classical designs
utilizing optimism or pessimism. This work, instead, considers the general
preference model (whose practical relevance has been observed recently) and
obtains performance guarantees with major, order-wise improvements over
existing ones. Surprisingly, these results are derived from algorithms that
directly use the empirical estimates (i.e., greedy sampling), as opposed to
constructing optimistic or pessimistic estimates in previous works. This
insight has a deep root in the unique structural property of the optimal policy
class under the KL-regularized target, and we further specialize it to the BT
model, highlighting the surprising sufficiency of greedy sampling in RLHF.
[COMMENTS]NeurIPS 2025
[LINK]http://arxiv.org/abs/2510.24700v1
[DATE]2025-10-29 01:52:08+08:00
[CATEGORIES]cs.LG
Eigenfunction Extraction for Ordered Representation Learning
[AUTHORS]Burak Varıcı, Che-Ping Tsai, Ritabrata Ray, Nicholas M. Boffi, Pradeep Ravikumar
[ABSTRACT]Recent advances in representation learning reveal that widely used
objectives, such as contrastive and non-contrastive, implicitly perform
spectral decomposition of a contextual kernel, induced by the relationship
between inputs and their contexts. Yet, these methods recover only the linear
span of top eigenfunctions of the kernel, whereas exact spectral decomposition
is essential for understanding feature ordering and importance. In this work,
we propose a general framework to extract ordered and identifiable
eigenfunctions, based on modular building blocks designed to satisfy key
desiderata, including compatibility with the contextual kernel and scalability
to modern settings. We then show how two main methodological paradigms,
low-rank approximation and Rayleigh quotient optimization, align with this
framework for eigenfunction extraction. Finally, we validate our approach on
synthetic kernels and demonstrate on real-world image datasets that the
recovered eigenvalues act as effective importance scores for feature selection,
enabling principled efficiency-accuracy tradeoffs via adaptive-dimensional
representations.
[LINK]http://arxiv.org/abs/2510.24672v1
[DATE]2025-10-29 01:37:12+08:00
[CATEGORIES]cs.LG
ADMN: A Layer-Wise Adaptive Multimodal Network for Dynamic Input Noise and Compute Resources
[AUTHORS]Jason Wu, Yuyang Yuan, Kang Yang, Lance Kaplan, Mani Srivastava
[ABSTRACT]Multimodal deep learning systems are deployed in dynamic scenarios due to the
robustness afforded by multiple sensing modalities. Nevertheless, they struggle
with varying compute resource availability (due to multi-tenancy, device
heterogeneity, etc.) and fluctuating quality of inputs (from sensor feed
corruption, environmental noise, etc.). Statically provisioned multimodal
systems cannot adapt when compute resources change over time, while existing
dynamic networks struggle with strict compute budgets. Additionally, both
systems often neglect the impact of variations in modality quality.
Consequently, modalities suffering substantial corruption may needlessly
consume resources better allocated towards other modalities. We propose ADMN, a
layer-wise Adaptive Depth Multimodal Network capable of tackling both
challenges: it adjusts the total number of active layers across all modalities
to meet strict compute resource constraints and continually reallocates layers
across input modalities according to their modality quality. Our evaluations
showcase ADMN can match the accuracy of state-of-the-art networks while
reducing up to 75% of their floating-point operations.
[COMMENTS]Accepted to Neurips 2025
[LINK]http://arxiv.org/abs/2502.07862v2
[DATE]2025-10-29 01:37:03+08:00
[CATEGORIES]cs.LG
Pearl: A Foundation Model for Placing Every Atom in the Right Location
[AUTHORS]Genesis Research Team, Alejandro Dobles, Nina Jovic, Kenneth Leidal, Pranav Murugan, David C. Williams, Drausin Wulsin, Nate Gruver, Christina X. Ji, Korrawat Pruegsanusak, Gianluca Scarpellini, Ansh Sharma, Wojciech Swiderski, Andrea Bootsma, Richard Strong Bowen, Charlotte Chen, Jamin Chen, Marc André Dämgen, Roy Tal Dew, Benjamin DiFrancesco, J. D. Fishman, Alla Ivanova, Zach Kagin, David Li-Bland, Zuli Liu, Igor Morozov, Jeffrey Ouyang-Zhang, Frank C. Pickard IV, Kushal S. Shah, Ben Shor, Gabriel Monteiro da Silva, Maxx Tessmer, Carl Tilbury, Cyr Vetcher, Daniel Zeng, Maruan Al-Shedivat, Aleksandra Faust, Evan N. Feinberg, Michael V. LeVine, Matteus Pan
[ABSTRACT]Accurately predicting the three-dimensional structures of protein-ligand
complexes remains a fundamental challenge in computational drug discovery that
limits the pace and success of therapeutic design. Deep learning methods have
recently shown strong potential as structural prediction tools, achieving
promising accuracy across diverse biomolecular systems. However, their
performance and utility are constrained by scarce experimental data,
inefficient architectures, physically invalid poses, and the limited ability to
exploit auxiliary information available at inference. To address these issues,
we introduce Pearl (Placing Every Atom in the Right Location), a foundation
model for protein-ligand cofolding at scale. Pearl addresses these challenges
with three key innovations: (1) training recipes that include large-scale
synthetic data to overcome data scarcity; (2) architectures that incorporate an
SO(3)-equivariant diffusion module to inherently respect 3D rotational
symmetries, improving generalization and sample efficiency, and (3)
controllable inference, including a generalized multi-chain templating system
supporting both protein and non-polymeric components as well as dual
unconditional/conditional modes. Pearl establishes a new state-of-the-art
performance in protein-ligand cofolding. On the key metric of generating
accurate (RMSD < 2 \r{A}) and physically valid poses, Pearl surpasses AlphaFold
3 and other open source baselines on the public Runs N’ Poses and PoseBusters
benchmarks, delivering 14.5% and 14.2% improvements, respectively, over the
next best model. In the pocket-conditional cofolding regime, Pearl delivers
$3.6\times$ improvement on a proprietary set of challenging, real-world drug
targets at the more rigorous RMSD < 1 \r{A} threshold. Finally, we demonstrate
that model performance correlates directly with synthetic dataset size used in
training.
[LINK]http://arxiv.org/abs/2510.24670v1
[DATE]2025-10-29 01:36:51+08:00
[CATEGORIES]cs.LG
SGFusion: Stochastic Geographic Gradient Fusion in Federated Learning
[AUTHORS]Khoa Nguyen, Khang Tran, NhatHai Phan, Cristian Borcea, Rouming Jin, Issa Khalil
[ABSTRACT]This paper proposes Stochastic Geographic Gradient Fusion (SGFusion), a novel
training algorithm to leverage the geographic information of mobile users in
Federated Learning (FL). SGFusion maps the data collected by mobile devices
onto geographical zones and trains one FL model per zone, which adapts well to
the data and behaviors of users in that zone. SGFusion models the local
data-based correlation among geographical zones as a hierarchical random graph
(HRG) optimized by Markov Chain Monte Carlo sampling. At each training step,
every zone fuses its local gradient with gradients derived from a small set of
other zones sampled from the HRG. This approach enables knowledge fusion and
sharing among geographical zones in a probabilistic and stochastic gradient
fusion process with self-attention weights, such that “more similar” zones have
“higher probabilities” of sharing gradients with “larger attention weights.”
SGFusion remarkably improves model utility without introducing undue
computational cost. Extensive theoretical and empirical results using a
heart-rate prediction dataset collected across 6 countries show that models
trained with SGFusion converge with upper-bounded expected errors and
significantly improve utility in all countries compared to existing approaches
without notable cost in system scalability.
[LINK]http://arxiv.org/abs/2510.23455v2
[DATE]2025-10-29 01:15:50+08:00
[CATEGORIES]cs.LG
The Cost of Robustness: Tighter Bounds on Parameter Complexity for Robust Memorization in ReLU Nets
[AUTHORS]Yujun Kim, Chaewon Moon, Chulhee Yun
[ABSTRACT]We study the parameter complexity of robust memorization for $\mathrm{ReLU}$
networks: the number of parameters required to interpolate any given dataset
with $\epsilon$-separation between differently labeled points, while ensuring
predictions remain consistent within a $\mu$-ball around each training sample.
We establish upper and lower bounds on the parameter count as a function of the
robustness ratio $\rho = \mu / \epsilon$. Unlike prior work, we provide a
fine-grained analysis across the entire range $\rho \in (0,1)$ and obtain
tighter upper and lower bounds that improve upon existing results. Our findings
reveal that the parameter complexity of robust memorization matches that of
non-robust memorization when $\rho$ is small, but grows with increasing $\rho$.
[COMMENTS]Accepted to NeurIPS 2025, 72 pages, 8 figures
[LINK]http://arxiv.org/abs/2510.24643v1
[DATE]2025-10-29 01:09:43+08:00
[CATEGORIES]cs.LG
Causal Ordering for Structure Learning From Time Series
[AUTHORS]Pedro P. Sanchez, Damian Machlanski, Steven McDonagh, Sotirios A. Tsaftaris
[ABSTRACT]Predicting causal structure from time series data is crucial for
understanding complex phenomena in physiology, brain connectivity, climate
dynamics, and socio-economic behaviour. Causal discovery in time series is
hindered by the combinatorial complexity of identifying true causal
relationships, especially as the number of variables and time points grow. A
common approach to simplify the task is the so-called ordering-based methods.
Traditional ordering methods inherently limit the representational capacity of
the resulting model. In this work, we fix this issue by leveraging multiple
valid causal orderings, instead of a single one as standard practice. We
propose DOTS (Diffusion Ordered Temporal Structure), using diffusion-based
causal discovery for temporal data. By integrating multiple orderings, DOTS
effectively recovers the transitive closure of the underlying directed acyclic
graph, mitigating spurious artifacts inherent in single-ordering approaches. We
formalise the problem under standard assumptions such as stationarity and the
additive noise model, and leverage score matching with diffusion processes to
enable efficient Hessian estimation. Extensive experiments validate the
approach. Empirical evaluations on synthetic and real-world datasets
demonstrate that DOTS outperforms state-of-the-art baselines, offering a
scalable and robust approach to temporal causal discovery. On synthetic
benchmarks ($d{=}!3-!6$ variables, $T{=}200!-!5{,}000$ samples), DOTS
improves mean window-graph $F1$ from $0.63$ (best baseline) to $0.81$. On the
CausalTime real-world benchmark ($d{=}20!-!36$), while baselines remain the
best on individual datasets, DOTS attains the highest average summary-graph
$F1$ while halving runtime relative to graph-optimisation methods. These
results establish DOTS as a scalable and accurate solution for temporal causal
discovery.
[COMMENTS]32 pages
[LINK]http://arxiv.org/abs/2510.24639v1
[DATE]2025-10-29 01:06:15+08:00
[CATEGORIES]cs.LG
Symbolic Snapshot Ensembles
[AUTHORS]Mingyue Liu, Andrew Cropper
[ABSTRACT]Inductive logic programming (ILP) is a form of logical machine learning. Most
ILP algorithms learn a single hypothesis from a single training run. Ensemble
methods train an ILP algorithm multiple times to learn multiple hypotheses. In
this paper, we train an ILP algorithm only once and save intermediate
hypotheses. We then combine the hypotheses using a minimum description length
weighting scheme. Our experiments on multiple benchmarks, including game
playing and visual reasoning, show that our approach improves predictive
accuracy by 4% with less than 1% computational overhead.
[LINK]http://arxiv.org/abs/2510.24633v1
[DATE]2025-10-29 01:01:38+08:00
[CATEGORIES]cs.LG
Coreset for Robust Geometric Median: Eliminating Size Dependency on Outliers
[AUTHORS]Ziyi Fang, Lingxiao Huang, Runkai Yang
[ABSTRACT]We study the robust geometric median problem in Euclidean space
$\mathbb{R}^d$, with a focus on coreset construction.A coreset is a compact
summary of a dataset $P$ of size $n$ that approximates the robust cost for all
centers $c$ within a multiplicative error $\varepsilon$. Given an outlier count
$m$, we construct a coreset of size $\tilde{O}(\varepsilon^{-2} \cdot
\min\{\varepsilon^{-2}, d\})$ when $n \geq 4m$, eliminating the $O(m)$
dependency present in prior work [Huang et al., 2022 & 2023]. For the special
case of $d = 1$, we achieve an optimal coreset size of
$\tilde{\Theta}(\varepsilon^{-1/2} + \frac{m}{n} \varepsilon^{-1})$, revealing
a clear separation from the vanilla case studied in [Huang et al., 2023;
Afshani and Chris, 2024]. Our results further extend to robust
$(k,z)$-clustering in various metric spaces, eliminating the $m$-dependence
under mild data assumptions. The key technical contribution is a novel
non-component-wise error analysis, enabling substantial reduction of outlier
influence, unlike prior methods that retain them.Empirically, our algorithms
consistently outperform existing baselines in terms of size-accuracy tradeoffs
and runtime, even when data assumptions are violated across a wide range of
datasets.
[COMMENTS]This paper has been accepted by NeurIPS 2025
[LINK]http://arxiv.org/abs/2510.24621v1
[DATE]2025-10-29 00:49:03+08:00
[CATEGORIES]cs.LG
Global Optimization of Gaussian Process Acquisition Functions Using a Piecewise-Linear Kernel Approximation
[AUTHORS]Yilin Xie, Shiqiang Zhang, Joel A. Paulson, Calvin Tsay
[ABSTRACT]Bayesian optimization relies on iteratively constructing and optimizing an
acquisition function. The latter turns out to be a challenging, non-convex
optimization problem itself. Despite the relative importance of this step, most
algorithms employ sampling- or gradient-based methods, which do not provably
converge to global optima. This work investigates mixed-integer programming
(MIP) as a paradigm for global acquisition function optimization. Specifically,
our Piecewise-linear Kernel Mixed Integer Quadratic Programming (PK-MIQP)
formulation introduces a piecewise-linear approximation for Gaussian process
kernels and admits a corresponding MIQP representation for acquisition
functions. The proposed method is applicable to uncertainty-based acquisition
functions for any stationary or dot-product kernel. We analyze the theoretical
regret bounds of the proposed approximation, and empirically demonstrate the
framework on synthetic functions, constrained benchmarks, and a hyperparameter
tuning task.
[COMMENTS]18 pages, 4 figures, 5 tables
[LINK]http://arxiv.org/abs/2410.16893v2
[DATE]2025-10-29 00:44:42+08:00
[CATEGORIES]cs.LG
Statistical physics of deep learning: Optimal learning of a multi-layer perceptron near interpolation
[AUTHORS]Jean Barbier, Francesco Camilli, Minh-Toan Nguyen, Mauro Pastore, Rudy Skerk
[ABSTRACT]For three decades statistical physics has been providing a framework to
analyse neural networks. A long-standing question remained on its capacity to
tackle deep learning models capturing rich feature learning effects, thus going
beyond the narrow networks or kernel methods analysed until now. We positively
answer through the study of the supervised learning of a multi-layer
perceptron. Importantly, (i) its width scales as the input dimension, making it
more prone to feature learning than ultra wide networks, and more expressive
than narrow ones or with fixed embedding layers; and (ii) we focus on the
challenging interpolation regime where the number of trainable parameters and
data are comparable, which forces the model to adapt to the task. We consider
the matched teacher-student setting. It provides the fundamental limits of
learning random deep neural network targets and helps in identifying the
sufficient statistics describing what is learnt by an optimally trained network
as the data budget increases. A rich phenomenology emerges with various
learning transitions. With enough data optimal performance is attained through
model’s “specialisation” towards the target, but it can be hard to reach for
training algorithms which get attracted by sub-optimal solutions predicted by
the theory. Specialisation occurs inhomogeneously across layers, propagating
from shallow towards deep ones, but also across neurons in each layer.
Furthermore, deeper targets are harder to learn. Despite its simplicity, the
Bayesian-optimal setting provides insights on how the depth, non-linearity and
finite (proportional) width influence neural networks in the feature learning
regime that are potentially relevant way beyond it.
[COMMENTS]30 pages, 19 figures + appendix. This submission supersedes both
arXiv:2505.24849 and arXiv:2501.18530
[LINK]http://arxiv.org/abs/2510.24616v1
[DATE]2025-10-29 00:44:34+08:00
[CATEGORIES]cs.LG
Semi-supervised and unsupervised learning for health indicator extraction from guided waves in aerospace composite structures
[AUTHORS]James Josep Perry, Pablo Garcia-Conde Ortiz, George Konstantinou, Cornelie Vergouwen, Edlyn Santha Kumaran, Morteza Moradi
[ABSTRACT]Health indicators (HIs) are central to diagnosing and prognosing the
condition of aerospace composite structures, enabling efficient maintenance and
operational safety. However, extracting reliable HIs remains challenging due to
variability in material properties, stochastic damage evolution, and diverse
damage modes. Manufacturing defects (e.g., disbonds) and in-service incidents
(e.g., bird strikes) further complicate this process. This study presents a
comprehensive data-driven framework that learns HIs via two learning approaches
integrated with multi-domain signal processing. Because ground-truth HIs are
unavailable, a semi-supervised and an unsupervised approach are proposed: (i) a
diversity deep semi-supervised anomaly detection (Diversity-DeepSAD) approach
augmented with continuous auxiliary labels used as hypothetical damage proxies,
which overcomes the limitation of prior binary labels that only distinguish
healthy and failed states while neglecting intermediate degradation, and (ii) a
degradation-trend-constrained variational autoencoder (DTC-VAE), in which the
monotonicity criterion is embedded via an explicit trend constraint. Guided
waves with multiple excitation frequencies are used to monitor single-stiffener
composite structures under fatigue loading. Time, frequency, and time-frequency
representations are explored, and per-frequency HIs are fused via unsupervised
ensemble learning to mitigate frequency dependence and reduce variance. Using
fast Fourier transform features, the augmented Diversity-DeepSAD model achieved
81.6% performance, while DTC-VAE delivered the most consistent HIs with 92.3%
performance, outperforming existing baselines.
[LINK]http://arxiv.org/abs/2510.24614v1
[DATE]2025-10-29 00:44:11+08:00
[CATEGORIES]cs.LG
Comparison of generalised additive models and neural networks in applications: A systematic review
[AUTHORS]Jessica Doohan, Lucas Kook, Kevin Burke
[ABSTRACT]Neural networks have become a popular tool in predictive modelling, more
commonly associated with machine learning and artificial intelligence than with
statistics. Generalised Additive Models (GAMs) are flexible non-linear
statistical models that retain interpretability. Both are state-of-the-art in
their own right, with their respective advantages and disadvantages. This paper
analyses how these two model classes have performed on real-world tabular data.
Following PRISMA guidelines, we conducted a systematic review of papers that
performed empirical comparisons of GAMs and neural networks. Eligible papers
were identified, yielding 143 papers, with 430 datasets. Key attributes at both
paper and dataset levels were extracted and reported. Beyond summarising
comparisons, we analyse reported performance metrics using mixed-effects
modelling to investigate potential characteristics that can explain and
quantify observed differences, including application area, study year, sample
size, number of predictors, and neural network complexity. Across datasets, no
consistent evidence of superiority was found for either GAMs or neural networks
when considering the most frequently reported metrics (RMSE, $R^2$, and AUC).
Neural networks tended to outperform in larger datasets and in those with more
predictors, but this advantage narrowed over time. Conversely, GAMs remained
competitive, particularly in smaller data settings, while retaining
interpretability. Reporting of dataset characteristics and neural network
complexity was incomplete in much of the literature, limiting transparency and
reproducibility. This review highlights that GAMs and neural networks should be
viewed as complementary approaches rather than competitors. For many tabular
applications, the performance trade-off is modest, and interpretability may
favour GAMs.
[LINK]http://arxiv.org/abs/2510.24601v1
[DATE]2025-10-29 00:28:42+08:00
[CATEGORIES]cs.LG
A Novel XAI-Enhanced Quantum Adversarial Networks for Velocity Dispersion Modeling in MaNGA Galaxies
[AUTHORS]Sathwik Narkedimilli, N V Saran Kumar, Aswath Babu H, Manjunath K Vanahalli, Manish M, Vinija Jain, Aman Chadha
[ABSTRACT]Current quantum machine learning approaches often face challenges balancing
predictive accuracy, robustness, and interpretability. To address this, we
propose a novel quantum adversarial framework that integrates a hybrid quantum
neural network (QNN) with classical deep learning layers, guided by an
evaluator model with LIME-based interpretability, and extended through quantum
GAN and self-supervised variants. In the proposed model, an adversarial
evaluator concurrently guides the QNN by computing feedback loss, thereby
optimizing both prediction accuracy and model explainability. Empirical
evaluations show that the Vanilla model achieves RMSE = 0.27, MSE = 0.071, MAE
= 0.21, and R^2 = 0.59, delivering the most consistent performance across
regression metrics compared to adversarial counterparts. These results
demonstrate the potential of combining quantum-inspired methods with classical
architectures to develop lightweight, high-performance, and interpretable
predictive models, advancing the applicability of QML beyond current
limitations.
[LINK]http://arxiv.org/abs/2510.24598v1
[DATE]2025-10-29 00:27:10+08:00
[CATEGORIES]cs.LG
DistDF: Time-Series Forecasting Needs Joint-Distribution Wasserstein Alignment
[AUTHORS]Hao Wang, Licheng Pan, Yuan Lu, Zhixuan Chu, Xiaoxi Li, Shuting He, Zhichao Chen, Haoxuan Li, Qingsong Wen, Zhouchen Lin
[ABSTRACT]Training time-series forecast models requires aligning the conditional
distribution of model forecasts with that of the label sequence. The standard
direct forecast (DF) approach resorts to minimize the conditional negative
log-likelihood of the label sequence, typically estimated using the mean
squared error. However, this estimation proves to be biased in the presence of
label autocorrelation. In this paper, we propose DistDF, which achieves
alignment by alternatively minimizing a discrepancy between the conditional
forecast and label distributions. Because conditional discrepancies are
difficult to estimate from finite time-series observations, we introduce a
newly proposed joint-distribution Wasserstein discrepancy for time-series
forecasting, which provably upper bounds the conditional discrepancy of
interest. This discrepancy admits tractable, differentiable estimation from
empirical samples and integrates seamlessly with gradient-based training.
Extensive experiments show that DistDF improves the performance diverse
forecast models and achieves the state-of-the-art forecasting performance. Code
is available at https://anonymous.4open.science/r/DistDF-F66B.
[LINK]http://arxiv.org/abs/2510.24574v1
[DATE]2025-10-29 00:09:59+08:00
[CATEGORIES]cs.LG
GST-UNet: A Neural Framework for Spatiotemporal Causal Inference with Time-Varying Confounding
[AUTHORS]Miruna Oprescu, David K. Park, Xihaier Luo, Shinjae Yoo, Nathan Kallus
[ABSTRACT]Estimating causal effects from spatiotemporal observational data is essential
in public health, environmental science, and policy evaluation, where
randomized experiments are often infeasible. Existing approaches, however,
either rely on strong structural assumptions or fail to handle key challenges
such as interference, spatial confounding, temporal carryover, and time-varying
confounding – where covariates are influenced by past treatments and, in turn,
affect future ones. We introduce GST-UNet (G-computation Spatio-Temporal UNet),
a theoretically grounded neural framework that combines a U-Net-based
spatiotemporal encoder with regression-based iterative G-computation to
estimate location-specific potential outcomes under complex intervention
sequences. GST-UNet explicitly adjusts for time-varying confounders and
captures non-linear spatial and temporal dependencies, enabling valid causal
inference from a single observed trajectory in data-scarce settings. We
validate its effectiveness in synthetic experiments and in a real-world
analysis of wildfire smoke exposure and respiratory hospitalizations during the
2018 California Camp Fire. Together, these results position GST-UNet as a
principled and ready-to-use framework for spatiotemporal causal inference,
advancing reliable estimation in policy-relevant and scientific domains.
[COMMENTS]29 pages, 6 figures, 6 tables, NeurIPS 2025
[LINK]http://arxiv.org/abs/2502.05295v2
[DATE]2025-10-29 00:01:40+08:00
[CATEGORIES]cs.LG
Levée d’ambiguïtés par grammaires locales
[AUTHORS]Eric G. C. Laporte
[ABSTRACT]Many words are ambiguous in terms of their part of speech (POS). However,
when a word appears in a text, this ambiguity is generally much reduced.
Disambiguating POS involves using context to reduce the number of POS
associated with words, and is one of the main challenges of lexical tagging.
The problem of labeling words by POS frequently arises in natural language
processing, for example for spelling correction, grammar or style checking,
expression recognition, text-to-speech conversion, text corpus analysis, etc.
Lexical tagging systems are thus useful as an initial component of many natural
language processing systems. A number of recent lexical tagging systems produce
multiple solutions when the text is lexically ambiguous or the uniquely correct
solution cannot be found. These contributions aim to guarantee a zero silence
rate: the correct tag(s) for a word must never be discarded. This objective is
unrealistic for systems that tag each word uniquely. This article concerns a
lexical disambiguation method adapted to the objective of a zero silence rate
and implemented in Silberztein’s INTEX system (1993). We present here a formal
description of this method. We show that to verify a local disambiguation
grammar in this framework, it is not sufficient to consider the transducer
paths separately: one needs to verify their interactions. Similarly, if a
combination of multiple transducers is used, the result cannot be predicted by
considering them in isolation. Furthermore, when examining the initial labeling
of a text as produced by INTEX, ideas for disambiguation rules come
spontaneously, but grammatical intuitions may turn out to be inaccurate, often
due to an unforeseen construction or ambiguity. If a zero silence rate is
targeted, local grammars must be carefully tested. This is where a detailed
specification of what a grammar will do once applied to texts would be
necessary.
[COMMENTS]in French language
[LINK]http://arxiv.org/abs/2510.24530v1
[DATE]2025-10-28 23:38:22+08:00
[CATEGORIES]cs.CL
Latent Sketchpad: Sketching Visual Thoughts to Elicit Multimodal Reasoning in MLLMs
[AUTHORS]Huanyu Zhang, Wenshan Wu, Chengzu Li, Ning Shang, Yan Xia, Yangyu Huang, Yifan Zhang, Li Dong, Zhang Zhang, Liang Wang, Tieniu Tan, Furu Wei
[ABSTRACT]While Multimodal Large Language Models (MLLMs) excel at visual understanding,
they often struggle in complex scenarios that require visual planning and
imagination. Inspired by how humans use sketching as a form of visual thinking
to develop and communicate ideas, we introduce Latent Sketchpad, a framework
that equips MLLMs with an internal visual scratchpad. The internal visual
representations of MLLMs have traditionally been confined to perceptual
understanding. We repurpose them to support generative visual thought without
compromising reasoning ability. Building on frontier MLLMs, our approach
integrates visual generation directly into their native autoregressive
reasoning process. It allows the model to interleave textual reasoning with the
generation of visual latents. These latents guide the internal thought process
and can be translated into sketch images for interpretability. To realize this,
we introduce two components: a Context-Aware Vision Head autoregressively
produces visual representations, and a pretrained Sketch Decoder renders these
into human-interpretable images. We evaluate the framework on our new dataset
MazePlanning. Experiments across various MLLMs show that Latent Sketchpad
delivers comparable or even superior reasoning performance to their backbone.
It further generalizes across distinct frontier MLLMs, including Gemma3 and
Qwen2.5-VL. By extending model’s textual reasoning to visual thinking, our
framework opens new opportunities for richer human-computer interaction and
broader applications. More details and resources are available on our project
page: https://latent-sketchpad.github.io/.
[LINK]http://arxiv.org/abs/2510.24514v1
[DATE]2025-10-28 23:26:20+08:00
[CATEGORIES]cs.CL
TokenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs
[AUTHORS]Sibo Xiao, Jinyuan Fu, Zhongle Xie, Lidan Shou
[ABSTRACT]Accelerating the inference of large language models (LLMs) has been a
critical challenge in generative AI. Speculative decoding (SD) substantially
improves LLM inference efficiency. However, its utility is limited by a
fundamental constraint: the draft and target models must share the same
vocabulary, thus limiting the herd of available draft models and often
necessitating the training of a new model from scratch. Inspired by Dynamic
Time Warping (DTW), a classic algorithm for aligning time series, we propose
the algorithm TokenTiming for universal speculative decoding. It operates by
re-encoding the draft token sequence to get a new target token sequence, and
then uses DTW to build a mapping to transfer the probability distributions for
speculative sampling. Benefiting from this, our method accommodates mismatched
vocabularies and works with any off-the-shelf models without retraining and
modification. We conduct comprehensive experiments on various tasks,
demonstrating 1.57x speedup. This work enables a universal approach for draft
model selection, making SD a more versatile and practical tool for LLM
acceleration.
[LINK]http://arxiv.org/abs/2510.15545v2
[DATE]2025-10-28 23:23:35+08:00
[CATEGORIES]cs.CL
Mitigating Hallucination in Large Language Models (LLMs): An Application-Oriented Survey on RAG, Reasoning, and Agentic Systems
[AUTHORS]Yihan Li, Xiyuan Fu, Ghanshyam Verma, Paul Buitelaar, Mingming Liu
[ABSTRACT]Hallucination remains one of the key obstacles to the reliable deployment of
large language models (LLMs), particularly in real-world applications. Among
various mitigation strategies, Retrieval-Augmented Generation (RAG) and
reasoning enhancement have emerged as two of the most effective and widely
adopted approaches, marking a shift from merely suppressing hallucinations to
balancing creativity and reliability. However, their synergistic potential and
underlying mechanisms for hallucination mitigation have not yet been
systematically examined. This survey adopts an application-oriented perspective
of capability enhancement to analyze how RAG, reasoning enhancement, and their
integration in Agentic Systems mitigate hallucinations. We propose a taxonomy
distinguishing knowledge-based and logic-based hallucinations, systematically
examine how RAG and reasoning address each, and present a unified framework
supported by real-world applications, evaluations, and benchmarks.
[COMMENTS]25 pages, 7 figures, 3 tables
[LINK]http://arxiv.org/abs/2510.24476v1
[DATE]2025-10-28 22:48:57+08:00
[CATEGORIES]cs.CL
Iterative Critique-Refine Framework for Enhancing LLM Personalization
[AUTHORS]Durga Prasad Maram, Dhruvin Gandhi, Zonghai Yao, Gayathri Akkinapalli, Franck Dernoncourt, Yu Wang, Ryan A. Rossi, Nesreen K. Ahmed
[ABSTRACT]Personalized text generation requires models not only to produce coherent
text but also to align with a target user’s style, tone, and topical focus.
Existing retrieval-augmented approaches such as LaMP and PGraphRAG enrich
profiles with user and neighbor histories, but they stop at generation and
often yield outputs that drift in tone, topic, or style. We present PerFine, a
unified, training-free critique-refine framework that enhances personalization
through iterative, profile-grounded feedback. In each iteration, an LLM
generator produces a draft conditioned on the retrieved profile, and a critic
LLM - also conditioned on the same profile - provides structured feedback on
tone, vocabulary, sentence structure, and topicality. The generator then
revises, while a novel knockout strategy retains the stronger draft across
iterations. We further study additional inference-time strategies such as
Best-of-N and Topic Extraction to balance quality and efficiency. Across Yelp,
Goodreads, and Amazon datasets, PerFine consistently improves personalization
over PGraphRAG, with GEval gains of +7-13%, steady improvements over 3-5
refinement iterations, and scalability with increasing critic size. These
results highlight that post-hoc, profile-aware feedback offers a powerful
paradigm for personalized LLM generation that is both training-free and
model-agnostic.
[LINK]http://arxiv.org/abs/2510.24469v1
[DATE]2025-10-28 22:36:22+08:00
[CATEGORIES]cs.CL
AutoJudge: Judge Decoding Without Manual Annotation
[AUTHORS]Roman Garipov, Fedor Velikonivtsev, Ivan Ermakov, Ruslan Svirschevski, Vage Egiazarian, Max Ryabinin
[ABSTRACT]We introduce AutoJudge, a method that accelerates large language model (LLM)
inference with task-specific lossy speculative decoding. Instead of matching
the original model output distribution token-by-token, we identify which of the
generated tokens affect the downstream quality of the response, relaxing the
distribution match guarantee so that the “unimportant” tokens can be generated
faster. Our approach relies on a semi-greedy search algorithm to test which of
the mismatches between target and draft models should be corrected to preserve
quality and which ones may be skipped. We then train a lightweight classifier
based on existing LLM embeddings to predict, at inference time, which
mismatching tokens can be safely accepted without compromising the final answer
quality. We evaluate the effectiveness of AutoJudge with multiple draft/target
model pairs on mathematical reasoning and programming benchmarks, achieving
significant speedups at the cost of a minor accuracy reduction. Notably, on
GSM8k with the Llama 3.1 70B target model, our approach achieves up to
$\approx2\times$ speedup over speculative decoding at the cost of $\le 1\%$
drop in accuracy. When applied to the LiveCodeBench benchmark, AutoJudge
automatically detects programming-specific important tokens, accepting $\ge 25$
tokens per speculation cycle at $2\%$ drop in Pass@1. Our approach requires no
human annotation and is easy to integrate with modern LLM inference frameworks.
[LINK]http://arxiv.org/abs/2504.20039v3
[DATE]2025-10-28 22:35:32+08:00
[CATEGORIES]cs.CL cs.LG
Mano Technical Report
[AUTHORS]Tianyu Fu, Anyang Su, Chenxu Zhao, Hanning Wang, Minghui Wu, Zhe Yu, Fei Hu, Mingjia Shi, Wei Dong, Jiayao Wang, Yuyang Chen, Ruiyang Yu, Siran Peng, Menglin Li, Nan Huang, Haitian Wei, Jiawei Yu, Yi Xin, Xilin Zhao, Kai Gu, Ping Jiang, Sifan Zhou, Shuo Wang
[ABSTRACT]Graphical user interfaces (GUIs) are the primary medium for human-computer
interaction, yet automating GUI interactions remains challenging due to the
complexity of visual elements, dynamic environments, and the need for
multi-step reasoning. Existing methods based on vision-language models (VLMs)
often suffer from limited resolution, domain mismatch, and insufficient
sequential decisionmaking capability. To address these issues, we propose Mano,
a robust GUI agent built upon a multi-modal foundation model pre-trained on
extensive web and computer system data. Our approach integrates a novel
simulated environment for high-fidelity data generation, a three-stage training
pipeline (supervised fine-tuning, offline reinforcement learning, and online
reinforcement learning), and a verification module for error recovery. Mano
demonstrates state-of-the-art performance on multiple GUI benchmarks, including
Mind2Web and OSWorld, achieving significant improvements in success rate and
operational accuracy. Our work provides new insights into the effective
integration of reinforcement learning with VLMs for practical GUI agent
deployment, highlighting the importance of domain-specific data, iterative
training, and holistic reward design.
[LINK]http://arxiv.org/abs/2509.17336v2
[DATE]2025-10-28 22:31:14+08:00
[CATEGORIES]cs.CL
Are you sure? Measuring models bias in content moderation through uncertainty
[AUTHORS]Alessandra Urbinati, Mirko Lai, Simona Frenda, Marco Antonio Stranisci
[COMMENTS]accepted at Findings of ACL: EMNLP 2025
[LINK]http://arxiv.org/abs/2509.22699v2
[DATE]2025-10-28 22:11:48+08:00
[CATEGORIES]cs.CL
SPARTA: Evaluating Reasoning Segmentation Robustness through Black-Box Adversarial Paraphrasing in Text Autoencoder Latent Space
[AUTHORS]Viktoriia Zinkovich, Anton Antonov, Andrei Spiridonov, Denis Shepelev, Andrey Moskalenko, Daria Pugacheva, Elena Tutubalina, Andrey Kuznetsov, Vlad Shakhuro
[ABSTRACT]Multimodal large language models (MLLMs) have shown impressive capabilities
in vision-language tasks such as reasoning segmentation, where models generate
segmentation masks based on textual queries. While prior work has primarily
focused on perturbing image inputs, semantically equivalent textual
paraphrases-crucial in real-world applications where users express the same
intent in varied ways-remain underexplored. To address this gap, we introduce a
novel adversarial paraphrasing task: generating grammatically correct
paraphrases that preserve the original query meaning while degrading
segmentation performance. To evaluate the quality of adversarial paraphrases,
we develop a comprehensive automatic evaluation protocol validated with human
studies. Furthermore, we introduce SPARTA-a black-box, sentence-level
optimization method that operates in the low-dimensional semantic latent space
of a text autoencoder, guided by reinforcement learning. SPARTA achieves
significantly higher success rates, outperforming prior methods by up to 2x on
both the ReasonSeg and LLMSeg-40k datasets. We use SPARTA and competitive
baselines to assess the robustness of advanced reasoning segmentation models.
We reveal that they remain vulnerable to adversarial paraphrasing-even under
strict semantic and grammatical constraints. All code and data will be released
publicly upon acceptance.
[LINK]http://arxiv.org/abs/2510.24446v1
[DATE]2025-10-28 22:09:05+08:00
[CATEGORIES]cs.CL
Can LLMs Write Faithfully? An Agent-Based Evaluation of LLM-generated Islamic Content
[AUTHORS]Abdullah Mushtaq, Rafay Naeem, Ezieddin Elmahjub, Ibrahim Ghaznavi, Shawqi Al-Maliki, Mohamed Abdallah, Ala Al-Fuqaha, Junaid Qadir
[ABSTRACT]Large language models are increasingly used for Islamic guidance, but risk
misquoting texts, misapplying jurisprudence, or producing culturally
inconsistent responses. We pilot an evaluation of GPT-4o, Ansari AI, and Fanar
on prompts from authentic Islamic blogs. Our dual-agent framework uses a
quantitative agent for citation verification and six-dimensional scoring (e.g.,
Structure, Islamic Consistency, Citations) and a qualitative agent for
five-dimensional side-by-side comparison (e.g., Tone, Depth, Originality).
GPT-4o scored highest in Islamic Accuracy (3.93) and Citation (3.38), Ansari AI
followed (3.68, 3.32), and Fanar lagged (2.76, 1.82). Despite relatively strong
performance, models still fall short in reliably producing accurate Islamic
content and citations – a paramount requirement in faith-sensitive writing.
GPT-4o had the highest mean quantitative score (3.90/5), while Ansari AI led
qualitative pairwise wins (116/200). Fanar, though trailing, introduces
innovations for Islamic and Arabic contexts. This study underscores the need
for community-driven benchmarks centering Muslim perspectives, offering an
early step toward more reliable AI in Islamic knowledge and other high-stakes
domains such as medicine, law, and journalism.
[COMMENTS]Accepted at 39th Conference on Neural Information Processing Systems
(NeurIPS 2025) Workshop: 5th Muslims in Machine Learning (MusIML) Workshop
[LINK]http://arxiv.org/abs/2510.24438v1
[DATE]2025-10-28 22:05:55+08:00
[CATEGORIES]cs.CL
LuxIT: A Luxembourgish Instruction Tuning Dataset from Monolingual Seed Data
[AUTHORS]Julian Valline, Cedric Lothritz, Jordi Cabot
[ABSTRACT]The effectiveness of instruction-tuned Large Language Models (LLMs) is often
limited in low-resource linguistic settings due to a lack of high-quality
training data. We introduce LuxIT, a novel, monolingual instruction tuning
dataset for Luxembourgish developed to mitigate this challenge. We synthesize
the dataset from a corpus of native Luxembourgish texts, utilizing
DeepSeek-R1-0528, chosen for its shown proficiency in Luxembourgish. Following
generation, we apply a quality assurance process, employing an LLM-as-a-judge
approach. To investigate the practical utility of the dataset, we fine-tune
several smaller-scale LLMs on LuxIT. Subsequent benchmarking against their base
models on Luxembourgish language proficiency examinations, however, yields
mixed results, with performance varying significantly across different models.
LuxIT represents a critical contribution to Luxembourgish natural language
processing and offers a replicable monolingual methodology, though our findings
highlight the need for further research to optimize its application.
[LINK]http://arxiv.org/abs/2510.24434v1
[DATE]2025-10-28 22:02:55+08:00
[CATEGORIES]cs.CL
From Language to Action: A Review of Large Language Models as Autonomous Agents and Tool Users
[AUTHORS]Sadia Sultana Chowa, Riasad Alvi, Subhey Sadi Rahman, Md Abdur Rahman, Mohaimenul Azam Khan Raiaan, Md Rafiqul Islam, Mukhtar Hussain, Sami Azam
[ABSTRACT]The pursuit of human-level artificial intelligence (AI) has significantly
advanced the development of autonomous agents and Large Language Models (LLMs).
LLMs are now widely utilized as decision-making agents for their ability to
interpret instructions, manage sequential tasks, and adapt through feedback.
This review examines recent developments in employing LLMs as autonomous agents
and tool users and comprises seven research questions. We only used the papers
published between 2023 and 2025 in conferences of the A* and A rank and Q1
journals. A structured analysis of the LLM agents’ architectural design
principles, dividing their applications into single-agent and multi-agent
systems, and strategies for integrating external tools is presented. In
addition, the cognitive mechanisms of LLM, including reasoning, planning, and
memory, and the impact of prompting methods and fine-tuning procedures on agent
performance are also investigated. Furthermore, we evaluated current benchmarks
and assessment protocols and have provided an analysis of 68 publicly available
datasets to assess the performance of LLM-based agents in various tasks. In
conducting this review, we have identified critical findings on verifiable
reasoning of LLMs, the capacity for self-improvement, and the personalization
of LLM-based agents. Finally, we have discussed ten future research directions
to overcome these gaps.
[COMMENTS]Submitted to Artificial Intelligence Review for peer review
[LINK]http://arxiv.org/abs/2508.17281v2
[DATE]2025-10-28 21:52:29+08:00
[CATEGORIES]cs.CL
SynthWorlds: Controlled Parallel Worlds for Disentangling Reasoning and Knowledge in Language Models
[AUTHORS]Ken Gu, Advait Bhat, Mike A Merrill, Robert West, Xin Liu, Daniel McDuff, Tim Althoff
[ABSTRACT]Evaluating the reasoning ability of language models (LMs) is complicated by
their extensive parametric world knowledge, where benchmark performance often
reflects factual recall rather than genuine reasoning. Existing datasets and
approaches (e.g., temporal filtering, paraphrasing, adversarial substitution)
cannot cleanly separate the two. We present SynthWorlds, a framework that
disentangles task reasoning complexity from factual knowledge. In SynthWorlds,
we construct parallel corpora representing two worlds with identical
interconnected structure: a real-mapped world, where models may exploit
parametric knowledge, and a synthetic-mapped world, where such knowledge is
meaningless. On top of these corpora, we design two mirrored tasks as case
studies: multi-hop question answering and page navigation, which maintain equal
reasoning difficulty across worlds. Experiments in parametric-only (e.g.,
closed-book QA) and knowledge-augmented (e.g., retrieval-augmented) LM settings
reveal a persistent knowledge advantage gap, defined as the performance boost
models gain from memorized parametric world knowledge. Knowledge acquisition
and integration mechanisms reduce but do not eliminate this gap, highlighting
opportunities for system improvements. Fully automatic and scalable,
SynthWorlds provides a controlled environment for evaluating LMs in ways that
were previously challenging, enabling precise and testable comparisons of
reasoning and memorization.
[LINK]http://arxiv.org/abs/2510.24427v1
[DATE]2025-10-28 21:47:23+08:00
[CATEGORIES]cs.CL
Comprehensive and Efficient Distillation for Lightweight Sentiment Analysis Models
[AUTHORS]Guangyu Xie, Yice Zhang, Jianzhu Bao, Qianlong Wang, Yang Sun, Bingbing Wang, Ruifeng Xu
[COMMENTS]Accepted by EMNLP 2025. 22 pages, 9 figures. The first two authors
contribute equally
[LINK]http://arxiv.org/abs/2510.24425v1
[DATE]2025-10-28 21:46:48+08:00
[CATEGORIES]cs.CL
OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows
[AUTHORS]Qiushi Sun, Mukai Li, Zhoumianze Liu, Zhihui Xie, Fangzhi Xu, Zhangyue Yin, Kanzhi Cheng, Zehao Li, Zichen Ding, Qi Liu, Zhiyong Wu, Zhuosheng Zhang, Ben Kao, Lingpeng Kong
[ABSTRACT]Computer-using agents powered by Vision-Language Models (VLMs) have
demonstrated human-like capabilities in operating digital environments like
mobile platforms. While these agents hold great promise for advancing digital
automation, their potential for unsafe operations, such as system compromise
and privacy leakage, is raising significant concerns. Detecting these safety
concerns across the vast and complex operational space of mobile environments
presents a formidable challenge that remains critically underexplored. To
establish a foundation for mobile agent safety research, we introduce
MobileRisk-Live, a dynamic sandbox environment accompanied by a safety
detection benchmark comprising realistic trajectories with fine-grained
annotations. Built upon this, we propose OS-Sentinel, a novel hybrid safety
detection framework that synergistically combines a Formal Verifier for
detecting explicit system-level violations with a VLM-based Contextual Judge
for assessing contextual risks and agent actions. Experiments show that
OS-Sentinel achieves 10%-30% improvements over existing approaches across
multiple metrics. Further analysis provides critical insights that foster the
development of safer and more reliable autonomous mobile agents.
[COMMENTS]work in progress
[LINK]http://arxiv.org/abs/2510.24411v1
[DATE]2025-10-28 21:22:39+08:00
[CATEGORIES]cs.CL
Video-SafetyBench: A Benchmark for Safety Evaluation of Video LVLMs
[AUTHORS]Xuannan Liu, Zekun Li, Zheqi He, Peipei Li, Shuhan Xia, Xing Cui, Huaibo Huang, Xi Yang, Ran He
[COMMENTS]Accepted by NeurIPS 2025 Dataset and Benchmark Track, Project page:
https://liuxuannan.github.io/Video-SafetyBench.github.io/
[LINK]http://arxiv.org/abs/2505.11842v3
[DATE]2025-10-28 20:44:07+08:00
[CATEGORIES]cs.CL
Text Simplification with Sentence Embeddings
[AUTHORS]Matthew Shardlow
[ABSTRACT]Sentence embeddings can be decoded to give approximations of the original
texts used to create them. We explore this effect in the context of text
simplification, demonstrating that reconstructed text embeddings preserve
complexity levels. We experiment with a small feed forward neural network to
effectively learn a transformation between sentence embeddings representing
high-complexity and low-complexity texts. We provide comparison to a Seq2Seq
and LLM-based approach, showing encouraging results in our much smaller
learning setting. Finally, we demonstrate the applicability of our
transformation to an unseen simplification dataset (MedEASI), as well as
datasets from languages outside the training data (ES,DE). We conclude that
learning transformations in sentence embedding space is a promising direction
for future research and has potential to unlock the ability to develop small,
but powerful models for text simplification and other natural language
generation tasks.
[LINK]http://arxiv.org/abs/2510.24365v1
[DATE]2025-10-28 20:41:10+08:00
[CATEGORIES]cs.CL
Zero-Shot Tokenizer Transfer
[AUTHORS]Benjamin Minixhofer, Edoardo Maria Ponti, Ivan Vulić
[ABSTRACT]Language models (LMs) are bound to their tokenizer, which maps raw text to a
sequence of vocabulary items (tokens). This restricts their flexibility: for
example, LMs trained primarily on English may still perform well in other
natural and programming languages, but have vastly decreased efficiency due to
their English-centric tokenizer. To mitigate this, we should be able to swap
the original LM tokenizer with an arbitrary one, on the fly, without degrading
performance. Hence, in this work we define a new problem: Zero-Shot Tokenizer
Transfer (ZeTT). The challenge at the core of ZeTT is finding embeddings for
the tokens in the vocabulary of the new tokenizer. Since prior heuristics for
initializing embeddings often perform at chance level in a ZeTT setting, we
propose a new solution: we train a hypernetwork taking a tokenizer as input and
predicting the corresponding embeddings. We empirically demonstrate that the
hypernetwork generalizes to new tokenizers both with encoder (e.g., XLM-R) and
decoder LLMs (e.g., Mistral-7B). Our method comes close to the original models’
performance in cross-lingual and coding tasks while markedly reducing the
length of the tokenized sequence. We also find that the remaining gap can be
quickly closed by continued training on less than 1B tokens. Finally, we show
that a ZeTT hypernetwork trained for a base (L)LM can also be applied to
fine-tuned variants without extra training. Overall, our results make
substantial strides toward detaching LMs from their tokenizer.
[COMMENTS]NeurIPS 2024
[LINK]http://arxiv.org/abs/2405.07883v2
[DATE]2025-10-28 20:30:22+08:00
[CATEGORIES]cs.CL
Automatically Benchmarking LLM Code Agents through Agent-Driven Annotation and Evaluation
[AUTHORS]Lingyue Fu, Bolun Zhang, Hao Guan, Yaoming Zhu, Lin Qiu, Weiwen Liu, Xuezhi Cao, Xunliang Cai, Weinan Zhang, Yong Yu
[ABSTRACT]Recent advances in code agents have enabled automated software development at
the project level, supported by large language models (LLMs) and widely adopted
tools. However, existing benchmarks for code agent evaluation face two major
limitations: high annotation cost and expertise requirements, and rigid
evaluation metrics that rely primarily on unit tests. To address these
challenges, we propose an agent-driven benchmark construction pipeline that
leverages human supervision to efficiently generate diverse and challenging
project-level tasks. Based on this approach, we introduce PRDBench, a novel
benchmark comprising 50 real-world Python projects across 20 domains, each with
structured Product Requirement Document (PRD) requirements, comprehensive
evaluation criteria, and reference implementations. PRDBench features rich data
sources, high task complexity, and flexible metrics. We further employ an
Agent-as-a-Judge paradigm to score agent outputs, enabling the evaluation of
various test types beyond unit tests. Extensive experiments on PRDBench
demonstrate its effectiveness in assessing the capabilities of both code agents
and evaluation agents, providing a scalable and robust framework for annotation
and evaluation.
[LINK]http://arxiv.org/abs/2510.24358v1
[DATE]2025-10-28 20:26:45+08:00
[CATEGORIES]cs.CL
LongWeave: A Long-Form Generation Benchmark Bridging Real-World Relevance and Verifiability
[AUTHORS]Zikai Xiao, Fei Huang, Jianhong Tu, Jianhui Wei, Wen Ma, Yuxuan Zhou, Jian Wu, Bowen Yu, Zuozhu Liu, Junyang Lin
[ABSTRACT]Generating long, informative, and factual outputs remains a major challenge
for Large Language Models (LLMs). Existing benchmarks for long-form generation
typically assess real-world queries with hard-to-verify metrics or use
synthetic setups that ease evaluation but overlook real-world intricacies. In
this paper, we introduce \textbf{LongWeave}, which balances real-world and
verifiable assessment with Constraint-Verifier Evaluation (CoV-Eval). CoV-Eval
constructs tasks by first defining verifiable targets within real-world
scenarios, then systematically generating corresponding queries, textual
materials, and constraints based on these targets. This ensures that tasks are
both realistic and objectively assessable, enabling rigorous assessment of
model capabilities in meeting complex real-world constraints. LongWeave
supports customizable input/output lengths (up to 64K/8K tokens) across seven
distinct tasks. Evaluation on 23 LLMs shows that even state-of-the-art models
encounter significant challenges in long-form generation as real-world
complexity and output length increase.
[COMMENTS]EMNLP Findings 2025
[LINK]http://arxiv.org/abs/2510.24345v1
[DATE]2025-10-28 20:11:12+08:00
[CATEGORIES]cs.CL
Face the Facts! Evaluating RAG-based Fact-checking Pipelines in Realistic Settings
[AUTHORS]Daniel Russo, Stefano Menini, Jacopo Staiano, Marco Guerini
[ABSTRACT]Natural Language Processing and Generation systems have recently shown the
potential to complement and streamline the costly and time-consuming job of
professional fact-checkers. In this work, we lift several constraints of
current state-of-the-art pipelines for automated fact-checking based on the
Retrieval-Augmented Generation (RAG) paradigm. Our goal is to benchmark, under
more realistic scenarios, RAG-based methods for the generation of verdicts -
i.e., short texts discussing the veracity of a claim - evaluating them on
stylistically complex claims and heterogeneous, yet reliable, knowledge bases.
Our findings show a complex landscape, where, for example, LLM-based retrievers
outperform other retrieval techniques, though they still struggle with
heterogeneous knowledge bases; larger models excel in verdict faithfulness,
while smaller models provide better context adherence, with human evaluations
favouring zero-shot and one-shot approaches for informativeness, and fine-tuned
models for emotional alignment.
[COMMENTS]Code and data at https://github.com/drusso98/face-the-facts -
Accepted for publication at INLG 2025
[LINK]http://arxiv.org/abs/2412.15189v2
[DATE]2025-10-28 20:02:14+08:00
[CATEGORIES]cs.CL
Provable Scaling Laws for the Test-Time Compute of Large Language Models
[AUTHORS]Yanxi Chen, Xuchen Pan, Yaliang Li, Bolin Ding, Jingren Zhou
[ABSTRACT]We propose two simple, principled and practical algorithms that enjoy
provable scaling laws for the test-time compute of large language models
(LLMs). The first one is a two-stage knockout-style algorithm: given an input
problem, it first generates multiple candidate solutions, and then aggregate
them via a knockout tournament for the final output. Assuming that the LLM can
generate a correct solution with non-zero probability and do better than a
random guess in comparing a pair of correct and incorrect solutions, we prove
theoretically that the failure probability of this algorithm decays to zero
exponentially or by a power law (depending on the specific way of scaling) as
its test-time compute grows. The second one is a two-stage league-style
algorithm, where each candidate is evaluated by its average win rate against
multiple opponents, rather than eliminated upon loss to a single opponent.
Under analogous but more robust assumptions, we prove that its failure
probability also decays to zero exponentially with more test-time compute. Both
algorithms require a black-box LLM and nothing else (e.g., no verifier or
reward model) for a minimalistic implementation, which makes them appealing for
practical applications and easy to adapt for different tasks. Through extensive
experiments with diverse models and datasets, we validate the proposed theories
and demonstrate the outstanding scaling properties of both algorithms.
[COMMENTS]NeurIPS 2025 camera-ready version
[LINK]http://arxiv.org/abs/2411.19477v5
[DATE]2025-10-28 19:59:43+08:00
[CATEGORIES]cs.CL cs.LG
Critique-RL: Training Language Models for Critiquing through Two-Stage Reinforcement Learning
[AUTHORS]Zhiheng Xi, Jixuan Huang, Xin Guo, Boyang Hong, Dingwen Yang, Xiaoran Fan, Shuo Li, Zehui Chen, Junjie Ye, Siyu Yuan, Zhengyin Du, Xuesong Yao, Yufei Xu, Jiecao Chen, Rui Zheng, Tao Gui, Qi Zhang, Xuanjing Huang
[ABSTRACT]Training critiquing language models to assess and provide feedback on model
outputs is a promising way to improve LLMs for complex reasoning tasks.
However, existing approaches typically rely on stronger supervisors for
annotating critique data. To address this, we propose Critique-RL, an online RL
approach for developing critiquing language models without stronger
supervision. Our approach operates on a two-player paradigm: the actor
generates a response, the critic provides feedback, and the actor refines the
response accordingly. We first reveal that relying solely on indirect reward
signals from the actor’s outputs for RL optimization often leads to
unsatisfactory critics: while their helpfulness (i.e., providing constructive
feedback) improves, the discriminability (i.e., determining whether a response
is high-quality or not) remains poor, resulting in marginal performance gains.
To overcome this, Critique-RL adopts a two-stage optimization strategy. In
stage I, it reinforces the discriminability of the critic with direct
rule-based reward signals; in stage II, it introduces indirect rewards based on
actor refinement to improve the critic’s helpfulness, while maintaining its
discriminability via appropriate regularization. Extensive experiments across
various tasks and models show that Critique-RL delivers substantial performance
improvements. For example, it achieves a 9.02% gain on in-domain tasks and a
5.70% gain on out-of-domain tasks for Qwen2.5-7B, highlighting its potential.
[COMMENTS]Preprint, 25 pages, 9 figures. Code:
https://github.com/WooooDyy/Critique-RL
[LINK]http://arxiv.org/abs/2510.24320v1
[DATE]2025-10-28 19:37:01+08:00
[CATEGORIES]cs.CL
Lookahead Tree-Based Rollouts for Enhanced Trajectory-Level Exploration in Reinforcement Learning with Verifiable Rewards
[AUTHORS]Shangyu Xing, Siyuan Wang, Chenyuan Yang, Xinyu Dai, Xiang Ren
[ABSTRACT]Reinforcement Learning with Verifiable Rewards (RLVR), particularly with
algorithms like Group Relative Policy Optimization (GRPO), has proven highly
effective in enhancing the reasoning capabilities of large language models.
However, a critical bottleneck in current pipelines lies in the limited
diversity of sampled trajectories during group rollouts. Homogeneous
trajectories and their associated rewards would diminish the return signals for
policy updates, thereby hindering effective policy learning. This lack of
diversity stems primarily from token-level stochastic sampling, where local
variations are likely to collapse into near-identical reasoning paths. To
address this limitation, we propose Lookahead Tree-Based Rollouts (LATR), a
novel rollout strategy designed to explicitly promotes trajectory-level
diversity by enforcing branching into different candidate tokens likely to
yield distinct continuations. Specifically, LATR iteratively operates in three
stages: (1) branching at high-uncertainty generation steps, (2) performing
lookahead simulation for each new branch, and (3) pruning branches that
exhibits prolonged similarity during simulation. Compared with stochastic
Sampling, LATR accelerates policy learning by 131% on average and improves
final pass@1 performance by 4.2% on both GRPO and Dynamic sAmpling Policy
Optimization (DAPO) algorithms across different reasoning tasks. Our code and
data are publicly available at https://github.com/starreeze/latr.
[LINK]http://arxiv.org/abs/2510.24302v1
[DATE]2025-10-28 19:12:02+08:00
[CATEGORIES]cs.CL
LittleBit: Ultra Low-Bit Quantization via Latent Factorization
[AUTHORS]Banseok Lee, Dongkyu Kim, Youngcheon You, Youngmin Kim
[ABSTRACT]Deploying large language models (LLMs) often faces challenges from
substantial memory and computational costs. Quantization offers a solution, yet
performance degradation in the sub-1-bit regime remains particularly difficult.
This paper introduces LittleBit, a novel method for extreme LLM compression. It
targets levels like 0.1 bits per weight (BPW), achieving nearly 31$\times$
memory reduction, e.g., Llama2-13B to under 0.9 GB. LittleBit represents
weights in a low-rank form using latent matrix factorization, subsequently
binarizing these factors. To counteract information loss from this extreme
precision, it integrates a multi-scale compensation mechanism. This includes
row, column, and an additional latent dimension that learns per-rank
importance. Two key contributions enable effective training: Dual
Sign-Value-Independent Decomposition (Dual-SVID) for quantization-aware
training (QAT) initialization, and integrated Residual Compensation to mitigate
errors. Extensive experiments confirm LittleBit’s superiority in sub-1-bit
quantization: e.g., its 0.1 BPW performance on Llama2-7B surpasses the leading
method’s 0.7 BPW. LittleBit establishes a new, viable size-performance
trade-off–unlocking a potential 11.6$\times$ speedup over FP16 at the kernel
level–and makes powerful LLMs practical for resource-constrained environments.
[COMMENTS]Accepted to NeurIPS 2025. Banseok Lee and Dongkyu Kim contributed
equally
[LINK]http://arxiv.org/abs/2506.13771v2
[DATE]2025-10-28 18:57:14+08:00
[CATEGORIES]cs.LG cs.CL
ViPER: Empowering the Self-Evolution of Visual Perception Abilities in Vision-Language Model
[AUTHORS]Juntian Zhang, Song Jin, Chuanqi Cheng, Yuhan Liu, Yankai Lin, Xun Zhang, Yufei Zhang, Fei Jiang, Guojun Yin, Wei Lin, Rui Yan
[ABSTRACT]The limited capacity for fine-grained visual perception presents a critical
bottleneck for Vision-Language Models (VLMs) in real-world applications.
Addressing this is challenging due to the scarcity of high-quality data and the
limitations of existing methods: supervised fine-tuning (SFT) often compromises
general capabilities, while reinforcement fine-tuning (RFT) prioritizes textual
reasoning over visual perception. To bridge this gap, we propose a novel
two-stage task that structures visual perception learning as a coarse-to-fine
progressive process. Based on this task formulation, we develop ViPER, a
self-bootstrapping framework specifically designed to enable iterative
evolution through self-critiquing and self-prediction. By synergistically
integrating image-level and instance-level reconstruction with a two-stage
reinforcement learning strategy, ViPER establishes a closed-loop training
paradigm, where internally synthesized data directly fuel the enhancement of
perceptual ability. Applied to the Qwen2.5-VL family, ViPER produces the
Qwen-Viper series. With an average gain of 1.7% on seven comprehensive
benchmarks spanning various tasks and up to 6.0% on fine-grained perception,
Qwen-Viper consistently demonstrates superior performance across different
vision-language scenarios while maintaining generalizability. Beyond enabling
self-improvement in perceptual capabilities, ViPER provides concrete evidence
for the reciprocal relationship between generation and understanding, a
breakthrough to developing more autonomous and capable VLMs.
[LINK]http://arxiv.org/abs/2510.24285v1
[DATE]2025-10-28 18:42:57+08:00
[CATEGORIES]cs.CL
Can LLMs Translate Human Instructions into a Reinforcement Learning Agent’s Internal Emergent Symbolic Representation?
[AUTHORS]Ziqi Ma, Sao Mai Nguyen, Philippe Xu
[ABSTRACT]Emergent symbolic representations are critical for enabling developmental
learning agents to plan and generalize across tasks. In this work, we
investigate whether large language models (LLMs) can translate human natural
language instructions into the internal symbolic representations that emerge
during hierarchical reinforcement learning. We apply a structured evaluation
framework to measure the translation performance of commonly seen LLMs – GPT,
Claude, Deepseek and Grok – across different internal symbolic partitions
generated by a hierarchical reinforcement learning algorithm in the Ant Maze
and Ant Fall environments. Our findings reveal that although LLMs demonstrate
some ability to translate natural language into a symbolic representation of
the environment dynamics, their performance is highly sensitive to partition
granularity and task complexity. The results expose limitations in current LLMs
capacity for representation alignment, highlighting the need for further
research on robust alignment between language and internal agent
representations.
[LINK]http://arxiv.org/abs/2510.24259v1
[DATE]2025-10-28 18:13:43+08:00
[CATEGORIES]cs.CL
From Memorization to Reasoning in the Spectrum of Loss Curvature
[AUTHORS]Jack Merullo, Srihita Vatsavaya, Lucius Bushnaq, Owen Lewis
[ABSTRACT]We characterize how memorization is represented in transformer models and
show that it can be disentangled in the weights of both language models (LMs)
and vision transformers (ViTs) using a decomposition based on the loss
landscape curvature. This insight is based on prior theoretical and empirical
work showing that the curvature for memorized training points is much sharper
than non memorized, meaning ordering weight components from high to low
curvature can reveal a distinction without explicit labels. This motivates a
weight editing procedure that suppresses far more recitation of untargeted
memorized data more effectively than a recent unlearning method
(BalancedSubnet), while maintaining lower perplexity. Since the basis of
curvature has a natural interpretation for shared structure in model weights,
we analyze the editing procedure extensively on its effect on downstream tasks
in LMs, and find that fact retrieval and arithmetic are specifically and
consistently negatively affected, even though open book fact retrieval and
general logical reasoning is conserved. We posit these tasks rely heavily on
specialized directions in weight space rather than general purpose mechanisms,
regardless of whether those individual datapoints are memorized. We support
this by showing a correspondence between task data’s activation strength with
low curvature components that we edit out, and the drop in task performance
after the edit. Our work enhances the understanding of memorization in neural
networks with practical applications towards removing it, and provides evidence
for idiosyncratic, narrowly-used structures involved in solving tasks like math
and fact retrieval.
[LINK]http://arxiv.org/abs/2510.24256v1
[DATE]2025-10-28 18:09:35+08:00
[CATEGORIES]cs.CL cs.LG
Abjad AI at NADI 2025: CATT-Whisper: Multimodal Diacritic Restoration Using Text and Speech Representations
[AUTHORS]Ahmad Ghannam, Naif Alharthi, Faris Alasmary, Kholood Al Tabash, Shouq Sadah, Lahouari Ghouti
[ABSTRACT]In this work, we tackle the Diacritic Restoration (DR) task for Arabic
dialectal sentences using a multimodal approach that combines both textual and
speech information. We propose a model that represents the text modality using
an encoder extracted from our own pre-trained model named CATT. The speech
component is handled by the encoder module of the OpenAI Whisper base model.
Our solution is designed following two integration strategies. The former
consists of fusing the speech tokens with the input at an early stage, where
the 1500 frames of the audio segment are averaged over 10 consecutive frames,
resulting in 150 speech tokens. To ensure embedding compatibility, these
averaged tokens are processed through a linear projection layer prior to
merging them with the text tokens. Contextual encoding is guaranteed by the
CATT encoder module. The latter strategy relies on cross-attention, where text
and speech embeddings are fused. The cross-attention output is then fed to the
CATT classification head for token-level diacritic prediction. To further
improve model robustness, we randomly deactivate the speech input during
training, allowing the model to perform well with or without speech. Our
experiments show that the proposed approach achieves a word error rate (WER) of
0.25 and a character error rate (CER) of 0.9 on the development set. On the
test set, our model achieved WER and CER scores of 0.55 and 0.13, respectively.
[LINK]http://arxiv.org/abs/2510.24247v1
[DATE]2025-10-28 17:58:18+08:00
[CATEGORIES]cs.CL
Towards Transparent Reasoning: What Drives Faithfulness in Large Language Models?
[AUTHORS]Teague McMillan, Gabriele Dominici, Martin Gjoreski, Marc Langheinrich
[COMMENTS]39th Conference on Neural Information Processing Systems (NeurIPS
2025) Workshop: NeurIPS 2025 Workshop on Evaluating the Evolving LLM
Lifecycle: Benchmarks, Emergent Abilities, and Scaling
[LINK]http://arxiv.org/abs/2510.24236v1
[DATE]2025-10-28 17:43:49+08:00
[CATEGORIES]cs.CL
NeedleInATable: Exploring Long-Context Capability of Large Language Models towards Long-Structured Tables
[AUTHORS]Lanrui Wang, Mingyu Zheng, Hongyin Tang, Zheng Lin, Yanan Cao, Jingang Wang, Xunliang Cai, Weiping Wang
[ABSTRACT]Processing structured tabular data, particularly large and lengthy tables,
constitutes a fundamental yet challenging task for large language models
(LLMs). However, existing long-context benchmarks like Needle-in-a-Haystack
primarily focus on unstructured text, neglecting the challenge of diverse
structured tables. Meanwhile, previous tabular benchmarks mainly consider
downstream tasks that require high-level reasoning abilities, and overlook
models’ underlying fine-grained perception of individual table cells, which is
crucial for practical and robust LLM-based table applications. To address this
gap, we introduce \textsc{NeedleInATable} (NIAT), a new long-context tabular
benchmark that treats each table cell as a “needle” and requires models to
extract the target cell based on cell locations or lookup questions. Our
comprehensive evaluation of various LLMs and multimodal LLMs reveals a
substantial performance gap between popular downstream tabular tasks and the
simpler NIAT task, suggesting that they may rely on dataset-specific
correlations or shortcuts to obtain better benchmark results but lack truly
robust long-context understanding towards structured tables. Furthermore, we
demonstrate that using synthesized NIAT training data can effectively improve
performance on both NIAT task and downstream tabular tasks, which validates the
importance of NIAT capability for LLMs’ genuine table understanding ability.
[COMMENTS]Accepted by NeurIPS 2025
[LINK]http://arxiv.org/abs/2504.06560v4
[DATE]2025-10-28 17:42:41+08:00
[CATEGORIES]cs.CL
Beyond Neural Incompatibility: Easing Cross-Scale Knowledge Transfer in Large Language Models through Latent Semantic Alignment
[AUTHORS]Jian Gu, Aldeida Aleti, Chunyang Chen, Hongyu Zhang
[ABSTRACT]Large Language Models (LLMs) encode vast amounts of knowledge in their
massive parameters, which is accessible to locate, trace, and analyze. Despite
advances in neural interpretability, it is still not clear how to transfer
knowledge in a fine-grained manner, namely parametric knowledge transfer (PKT).
A key problem is enabling effective and efficient knowledge transfer across
LLMs of different scales, which is essential for achieving greater flexibility
and broader applicability in transferring knowledge between LLMs. Due to neural
incompatibility, referring to the architectural and parametric differences
between LLMs of varying scales, existing methods that directly reuse layer
parameters are severely limited. In this paper, we identify the semantic
alignment in latent space as the fundamental prerequisite for LLM cross-scale
knowledge transfer. Instead of directly using the layer parameters, our
approach takes activations as the medium of layer-wise knowledge transfer.
Leveraging the semantics in latent space, our approach is simple and
outperforms prior work, better aligning model behaviors across varying scales.
Evaluations on four benchmarks demonstrate the efficacy of our method. Further
analysis reveals the key factors easing cross-scale knowledge transfer and
provides insights into the nature of latent semantic alignment.
[COMMENTS]an early-stage version
[LINK]http://arxiv.org/abs/2510.24208v1
[DATE]2025-10-28 17:25:40+08:00
[CATEGORIES]cs.CL cs.LG
DrVoice: Parallel Speech-Text Voice Conversation Model via Dual-Resolution Speech Representations
[AUTHORS]Chao-Hong Tan, Qian Chen, Wen Wang, Chong Deng, Qinglin Zhang, Luyao Cheng, Hai Yu, Xin Zhang, Xiang Lv, Tianyu Zhao, Chong Zhang, Yukun Ma, Yafeng Chen, Hui Wang, Jiaqing Liu, Xiangang Li, Jieping Ye
[ABSTRACT]Recent studies on end-to-end (E2E) speech generation with large language
models (LLMs) have attracted significant community attention, with multiple
works extending text-based LLMs to generate discrete speech tokens. Existing
E2E approaches primarily fall into two categories: (1) Methods that generate
discrete speech tokens independently without incorporating them into the LLM’s
autoregressive process, resulting in text generation being unaware of
concurrent speech synthesis. (2) Models that generate interleaved or parallel
speech- text tokens through joint autoregressive modeling, enabling mutual
modality awareness during generation. This paper presents DrVoice, a parallel
speech- text voice conversation model based on joint autoregressive modeling,
featuring dual-resolution speech representations. Notably, while current
methods utilize mainly 12.5Hz input audio representation, our proposed
dual-resolution mechanism reduces the input frequency for the LLM to 5Hz,
significantly reducing computational cost and alleviating the frequency
discrepancy between speech and text tokens and in turn better exploiting LLMs’
capabilities. Experimental results demonstrate that DRVOICE-7B establishes new
state-of-the-art (SOTA) on OpenAudioBench and Big Bench Audio benchmarks, while
achieving performance comparable to the SOTA on VoiceBench and UltraEval-Audio
benchmarks, making it a leading open-source speech foundation model in ~7B
models.
[COMMENTS]Work in progress
[LINK]http://arxiv.org/abs/2506.09349v3
[DATE]2025-10-28 17:04:11+08:00
[CATEGORIES]cs.CL
Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views
[AUTHORS]Anna Deichler, Jonas Beskow
[COMMENTS]10 pages, 6 figures, 2 tables. Accepted to the NeurIPS 2025 Workshop
on SPACE in Vision, Language, and Embodied AI (SpaVLE). Dataset:
https://huggingface.co/datasets/annadeichler/KTH-ARIA-referential
[LINK]http://arxiv.org/abs/2510.22672v2
[DATE]2025-10-28 16:39:14+08:00
[CATEGORIES]cs.CL
Exploring the Influence of Relevant Knowledge for Natural Language Generation Interpretability
[AUTHORS]Iván Martínez-Murillo, Paloma Moreda, Elena Lloret
[ABSTRACT]This paper explores the influence of external knowledge integration in
Natural Language Generation (NLG), focusing on a commonsense generation task.
We extend the CommonGen dataset by creating KITGI, a benchmark that pairs input
concept sets with retrieved semantic relations from ConceptNet and includes
manually annotated outputs. Using the T5-Large model, we compare sentence
generation under two conditions: with full external knowledge and with filtered
knowledge where highly relevant relations were deliberately removed. Our
interpretability benchmark follows a three-stage method: (1) identifying and
removing key knowledge, (2) regenerating sentences, and (3) manually assessing
outputs for commonsense plausibility and concept coverage. Results show that
sentences generated with full knowledge achieved 91\% correctness across both
criteria, while filtering reduced performance drastically to 6\%. These
findings demonstrate that relevant external knowledge is critical for
maintaining both coherence and concept coverage in NLG. This work highlights
the importance of designing interpretable, knowledge-enhanced NLG systems and
calls for evaluation frameworks that capture the underlying reasoning beyond
surface-level metrics.
[LINK]http://arxiv.org/abs/2510.24179v1
[DATE]2025-10-28 16:34:01+08:00
[CATEGORIES]cs.CL
LinearRAG: Linear Graph Retrieval Augmented Generation on Large-scale Corpora
[AUTHORS]Luyao Zhuang, Shengyuan Chen, Yilin Xiao, Huachi Zhou, Yujing Zhang, Hao Chen, Qinggang Zhang, Xiao Huang
[ABSTRACT]Retrieval-Augmented Generation (RAG) is widely used to mitigate
hallucinations of Large Language Models (LLMs) by leveraging external
knowledge. While effective for simple queries, traditional RAG systems struggle
with large-scale, unstructured corpora where information is fragmented. Recent
advances incorporate knowledge graphs to capture relational structures,
enabling more comprehensive retrieval for complex, multi-hop reasoning tasks.
However, existing graph-based RAG (GraphRAG) methods rely on unstable and
costly relation extraction for graph construction, often producing noisy graphs
with incorrect or inconsistent relations that degrade retrieval quality. In
this paper, we revisit the pipeline of existing GraphRAG systems and propose
LinearRAG (Linear Graph-based Retrieval-Augmented Generation), an efficient
framework that enables reliable graph construction and precise passage
retrieval. Specifically, LinearRAG constructs a relation-free hierarchical
graph, termed Tri-Graph, using only lightweight entity extraction and semantic
linking, avoiding unstable relation modeling. This new paradigm of graph
construction scales linearly with corpus size and incurs no extra token
consumption, providing an economical and reliable indexing of the original
passages. For retrieval, LinearRAG adopts a two-stage strategy: (i) relevant
entity activation via local semantic bridging, followed by (ii) passage
retrieval through global importance aggregation. Extensive experiments on four
datasets demonstrate that LinearRAG significantly outperforms baseline models.
Our code and datasets are available at https://github.com/DEEP-PolyU/LinearRAG.
[LINK]http://arxiv.org/abs/2510.10114v3
[DATE]2025-10-28 15:51:04+08:00
[CATEGORIES]cs.CL
MATCH: Task-Driven Code Evaluation through Contrastive Learning
[AUTHORS]Marah Ghoummaid, Vladimir Tchuiev, Ofek Glick, Michal Moshkovitz, Dotan Di Castro
[ABSTRACT]AI-based code generation is increasingly prevalent, with GitHub Copilot
estimated to generate 46% of the code on GitHub. Accurately evaluating how well
generated code aligns with developer intent remains a critical challenge.
Traditional evaluation methods, such as unit tests, are often unscalable and
costly. Syntactic similarity metrics (e.g., BLEU, ROUGE) fail to capture code
functionality, and metrics like CodeBERTScore require reference code, which is
not always available. To address the gap in reference-free evaluation, with few
alternatives such as ICE-Score, this paper introduces MATCH, a novel
reference-free metric. MATCH uses Contrastive Learning to generate meaningful
embeddings for code and natural language task descriptions, enabling similarity
scoring that reflects how well generated code implements the task. We show that
MATCH achieves stronger correlations with functional correctness and human
preference than existing metrics across multiple programming languages.
[LINK]http://arxiv.org/abs/2510.23169v2
[DATE]2025-10-28 15:44:06+08:00
[CATEGORIES]cs.CL
Ko-MuSR: A Multistep Soft Reasoning Benchmark for LLMs Capable of Understanding Korean
[AUTHORS]Chanwoo Park, Suyoung Park, JiA Kang, Jongyeon Park, Sangho Kim, Hyunji M. Park, Sumin Bae, Mingyu Kang, Jaejin Lee
[COMMENTS]submitted to ACL ARR Rolling Review
[LINK]http://arxiv.org/abs/2510.24150v1
[DATE]2025-10-28 15:42:59+08:00
[CATEGORIES]cs.CL
GRPO-MA: Multi-Answer Generation in GRPO for Stable and Efficient Chain-of-Thought Training
[AUTHORS]Hongcheng Wang, Yinuo Huang, Sukai Wang, Guanghui Ren, Hao Dong
[ABSTRACT]Recent progress, such as DeepSeek-R1, has shown that the GRPO algorithm, a
Reinforcement Learning (RL) approach, can effectively train Chain-of-Thought
(CoT) reasoning in Large Language Models (LLMs) and Vision-Language Models
(VLMs). In this paper, we analyze three challenges of GRPO: gradient coupling
between thoughts and answers, sparse reward signals caused by limited parallel
sampling, and unstable advantage estimation. To mitigate these challenges, we
propose GRPO-MA, a simple yet theoretically grounded method that leverages
multi-answer generation from each thought process, enabling more robust and
efficient optimization. Theoretically, we show that the variance of thought
advantage decreases as the number of answers per thought increases.
Empirically, our gradient analysis confirms this effect, showing that GRPO-MA
reduces gradient spikes compared to GRPO. Experiments on math, code, and
diverse multimodal tasks demonstrate that GRPO-MA substantially improves
performance and training efficiency. Our ablation studies further reveal that
increasing the number of answers per thought consistently enhances model
performance.
[COMMENTS]Under review
[LINK]http://arxiv.org/abs/2509.24494v2
[DATE]2025-10-28 15:36:45+08:00
[CATEGORIES]cs.CL
Context-level Language Modeling by Learning Predictive Context Embeddings
[AUTHORS]Beiya Dai, Yuliang Liu, Daozheng Xue, Qipeng Guo, Kai Chen, Xinbing Wang, Bowen Zhou, Zhouhan Lin
[ABSTRACT]Next-token prediction (NTP) is the cornerstone of modern large language
models (LLMs) pretraining, driving their unprecedented capabilities in text
generation, reasoning, and instruction following. However, the token-level
prediction limits the model’s capacity to capture higher-level semantic
structures and long-range contextual relationships. To overcome this
limitation, we introduce \textbf{ContextLM}, a framework that augments standard
pretraining with an inherent \textbf{next-context prediction} objective. This
mechanism trains the model to learn predictive representations of multi-token
contexts, leveraging error signals derived from future token chunks. Crucially,
ContextLM achieves this enhancement while remaining fully compatible with the
standard autoregressive, token-by-token evaluation paradigm (e.g., perplexity).
Extensive experiments on the GPT2 and Pythia model families, scaled up to
$1.5$B parameters, show that ContextLM delivers consistent improvements in both
perplexity and downstream task performance. Our analysis indicates that
next-context prediction provides a scalable and efficient pathway to stronger
language modeling, yielding better long-range coherence and more effective
attention allocation with minimal computational overhead.
[COMMENTS]16pages,6 figures
[LINK]http://arxiv.org/abs/2510.20280v2
[DATE]2025-10-28 15:35:34+08:00
[CATEGORIES]cs.CL
Beyond Line-Level Filtering for the Pretraining Corpora of LLMs
[AUTHORS]Chanwoo Park, Suyoung Park, Yelim Ahn, Jongmin Kim, Jongyeon Park, Jaejin Lee
[COMMENTS]submitted to ACL ARR Rolling Review
[LINK]http://arxiv.org/abs/2510.24139v1
[DATE]2025-10-28 15:24:32+08:00
[CATEGORIES]cs.CL
VC4VG: Optimizing Video Captions for Text-to-Video Generation
[AUTHORS]Yang Du, Zhuoran Lin, Kaiqiang Song, Biao Wang, Zhicheng Zheng, Tiezheng Ge, Bo Zheng, Qin Jin
[ABSTRACT]Recent advances in text-to-video (T2V) generation highlight the critical role
of high-quality video-text pairs in training models capable of producing
coherent and instruction-aligned videos. However, strategies for optimizing
video captions specifically for T2V training remain underexplored. In this
paper, we introduce VC4VG (Video Captioning for Video Generation), a
comprehensive caption optimization framework tailored to the needs of T2V
models.We begin by analyzing caption content from a T2V perspective,
decomposing the essential elements required for video reconstruction into
multiple dimensions, and proposing a principled caption design methodology. To
support evaluation, we construct VC4VG-Bench, a new benchmark featuring
fine-grained, multi-dimensional, and necessity-graded metrics aligned with
T2V-specific requirements.Extensive T2V fine-tuning experiments demonstrate a
strong correlation between improved caption quality and video generation
performance, validating the effectiveness of our approach. We release all
benchmark tools and code at https://github.com/qyr0403/VC4VG to support further
research.
[COMMENTS]Accepted by EMNLP 2025
[LINK]http://arxiv.org/abs/2510.24134v1
[DATE]2025-10-28 15:19:01+08:00
[CATEGORIES]cs.CL
SANSKRITI: A Comprehensive Benchmark for Evaluating Language Models’ Knowledge of Indian Culture
[AUTHORS]Arijit Maji, Raghvendra Kumar, Akash Ghosh, Anushka, Sriparna Saha
[ABSTRACT]Language Models (LMs) are indispensable tools shaping modern workflows, but
their global effectiveness depends on understanding local socio-cultural
contexts. To address this, we introduce SANSKRITI, a benchmark designed to
evaluate language models’ comprehension of India’s rich cultural diversity.
Comprising 21,853 meticulously curated question-answer pairs spanning 28 states
and 8 union territories, SANSKRITI is the largest dataset for testing Indian
cultural knowledge. It covers sixteen key attributes of Indian culture: rituals
and ceremonies, history, tourism, cuisine, dance and music, costume, language,
art, festivals, religion, medicine, transport, sports, nightlife, and
personalities, providing a comprehensive representation of India’s cultural
tapestry. We evaluate SANSKRITI on leading Large Language Models (LLMs), Indic
Language Models (ILMs), and Small Language Models (SLMs), revealing significant
disparities in their ability to handle culturally nuanced queries, with many
models struggling in region-specific contexts. By offering an extensive,
culturally rich, and diverse dataset, SANSKRITI sets a new standard for
assessing and improving the cultural understanding of LMs.
[COMMENTS]ACL 2025 Findings
[LINK]http://arxiv.org/abs/2506.15355v2
[DATE]2025-10-28 15:12:22+08:00
[CATEGORIES]cs.CL
Reinforcement Learning for Long-Horizon Multi-Turn Search Agents
[AUTHORS]Vivek Kalyan, Martin Andrews
[COMMENTS]4 pages plus references and appendices. Accepted into the First
Workshop on Multi-Turn Interactions in Large Language Models at NeurIPS 2025
[LINK]http://arxiv.org/abs/2510.24126v1
[DATE]2025-10-28 15:00:42+08:00
[CATEGORIES]cs.CL
RegSpeech12: A Regional Corpus of Bengali Spontaneous Speech Across Dialects
[AUTHORS]Md. Rezuwan Hassan, Azmol Hossain, Kanij Fatema, Rubayet Sabbir Faruque, Tanmoy Shome, Ruwad Naswan, Trina Chakraborty, Md. Foriduzzaman Zihad, Tawsif Tashwar Dipto, Nazia Tasnim, Nazmuddoha Ansary, Md. Mehedi Hasan Shawon, Ahmed Imtiaz Humayun, Md. Golam Rabiul Alam, Farig Sadeque, Asif Sushmit
[ABSTRACT]The Bengali language, spoken extensively across South Asia and among
diasporic communities, exhibits considerable dialectal diversity shaped by
geography, culture, and history. Phonological and pronunciation-based
classifications broadly identify five principal dialect groups: Eastern
Bengali, Manbhumi, Rangpuri, Varendri, and Rarhi. Within Bangladesh, further
distinctions emerge through variation in vocabulary, syntax, and morphology, as
observed in regions such as Chittagong, Sylhet, Rangpur, Rajshahi, Noakhali,
and Barishal. Despite this linguistic richness, systematic research on the
computational processing of Bengali dialects remains limited. This study seeks
to document and analyze the phonetic and morphological properties of these
dialects while exploring the feasibility of building computational models
particularly Automatic Speech Recognition (ASR) systems tailored to regional
varieties. Such efforts hold potential for applications in virtual assistants
and broader language technologies, contributing to both the preservation of
dialectal diversity and the advancement of inclusive digital tools for
Bengali-speaking communities. The dataset created for this study is released
for public use.
[COMMENTS]26 pages
[LINK]http://arxiv.org/abs/2510.24096v1
[DATE]2025-10-28 14:08:42+08:00
[CATEGORIES]cs.CL
Challenging Multilingual LLMs: A New Taxonomy and Benchmark for Unraveling Hallucination in Translation
[AUTHORS]Xinwei Wu, Heng Liu, Jiang Zhou, Xiaohu Zhao, Linlong Xu, Longyue Wang, Weihua Luo, Kaifu Zhang
[ABSTRACT]Large Language Models (LLMs) have advanced machine translation but remain
vulnerable to hallucinations. Unfortunately, existing MT benchmarks are not
capable of exposing failures in multilingual LLMs. To disclose hallucination in
multilingual LLMs, we introduce a diagnostic framework with a taxonomy that
separates Instruction Detachment from Source Detachment. Guided by this
taxonomy, we create HalloMTBench, a multilingual, human-verified benchmark
across 11 English-to-X directions. We employed 4 frontier LLMs to generate
candidates and scrutinize these candidates with an ensemble of LLM judges, and
expert validation. In this way, we curate 5,435 high-quality instances. We have
evaluated 17 LLMs on HalloMTBench. Results reveal distinct ``hallucination
triggers’’ – unique failure patterns reflecting model scale, source length
sensitivity, linguistic biases, and Reinforcement-Learning (RL) amplified
language mixing. HalloMTBench offers a forward-looking testbed for diagnosing
LLM translation failures. HalloMTBench is available in
https://huggingface.co/collections/AIDC-AI/marco-mt.
[LINK]http://arxiv.org/abs/2510.24073v1
[DATE]2025-10-28 13:17:18+08:00
[CATEGORIES]cs.CL
MENTOR: A Reinforcement Learning Framework for Enabling Tool Use in Small Models via Teacher-Optimized Rewards
[AUTHORS]ChangSu Choi, Hoyun Song, Dongyeon Kim, WooHyeon Jung, Minkyung Cho, Sunjin Park, NohHyeob Bae, Seona Yu, KyungTae Lim
[ABSTRACT]Distilling the tool-using capabilities of large language models (LLMs) into
smaller, more efficient small language models (SLMs) is a key challenge for
their practical application. The predominant approach, supervised fine-tuning
(SFT), suffers from poor generalization as it trains models to imitate a static
set of teacher trajectories rather than learn a robust methodology. While
reinforcement learning (RL) offers an alternative, the standard RL using sparse
rewards fails to effectively guide SLMs, causing them to struggle with
inefficient exploration and adopt suboptimal strategies. To address these
distinct challenges, we propose MENTOR, a framework that synergistically
combines RL with teacher-guided distillation. Instead of simple imitation,
MENTOR employs an RL-based process to learn a more generalizable policy through
exploration. In addition, to solve the problem of reward sparsity, it uses a
teacher’s reference trajectory to construct a dense, composite teacher-guided
reward that provides fine-grained guidance. Extensive experiments demonstrate
that MENTOR significantly improves the cross-domain generalization and
strategic competence of SLMs compared to both SFT and standard sparse-reward RL
baselines.
[LINK]http://arxiv.org/abs/2510.18383v2
[DATE]2025-10-28 12:50:06+08:00
[CATEGORIES]cs.CL
TrajAgent: An LLM-Agent Framework for Trajectory Modeling via Large-and-Small Model Collaboration
[AUTHORS]Yuwei Du, Jie Feng, Jie Zhao, Yong Li
[COMMENTS]Accepted by NeurIPS 2025,
https://github.com/tsinghua-fib-lab/TrajAgent
[LINK]http://arxiv.org/abs/2410.20445v5
[DATE]2025-10-28 12:18:04+08:00
[CATEGORIES]cs.CL cs.LG
Pie: A Programmable Serving System for Emerging LLM Applications
[AUTHORS]In Gim, Zhiyao Ma, Seung-seob Lee, Lin Zhong
[ABSTRACT]Emerging large language model (LLM) applications involve diverse reasoning
strategies and agentic workflows, straining the capabilities of existing
serving systems built on a monolithic token generation loop. This paper
introduces Pie, a programmable LLM serving system designed for flexibility and
efficiency. Pie decomposes the traditional generation loop into fine-grained
service handlers exposed via an API and delegates control of the generation
process to user-provided programs, called inferlets. This enables applications
to implement new KV cache strategies, bespoke generation logic, and seamlessly
integrate computation and I/O-entirely within the application, without
requiring modifications to the serving system. Pie executes inferlets using
WebAssembly, benefiting from its lightweight sandboxing. Our evaluation shows
Pie matches state-of-the-art performance on standard tasks (3-12% latency
overhead) while significantly improving latency and throughput (1.3x-3.4x
higher) on agentic workflows by enabling application-specific optimizations.
[COMMENTS]SOSP 2025. Source code available at
https://github.com/pie-project/pie
[LINK]http://arxiv.org/abs/2510.24051v1
[DATE]2025-10-28 12:17:55+08:00
[CATEGORIES]cs.CL
Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay
[AUTHORS]Yifan Sun, Jingyan Shen, Yibin Wang, Tianyu Chen, Zhendong Wang, Mingyuan Zhou, Huan Zhang
[COMMENTS]Accepted at NeurIPS 2025
[LINK]http://arxiv.org/abs/2506.05316v3
[DATE]2025-10-28 11:50:11+08:00
[CATEGORIES]cs.LG cs.CL
ReCode: Unify Plan and Action for Universal Granularity Control
[AUTHORS]Zhaoyang Yu, Jiayi Zhang, Huixue Su, Yufan Zhao, Yifan Wu, Mingyi Deng, Jinyu Xiang, Yizhang Lin, Lingxiao Tang, Yingchao Li, Yuyu Luo, Bang Liu, Chenglin Wu
[ABSTRACT]Real-world tasks require decisions at varying granularities, and humans excel
at this by leveraging a unified cognitive representation where planning is
fundamentally understood as a high-level form of action. However, current Large
Language Model (LLM)-based agents lack this crucial capability to operate
fluidly across decision granularities. This limitation stems from existing
paradigms that enforce a rigid separation between high-level planning and
low-level action, which impairs dynamic adaptability and limits generalization.
We propose ReCode (Recursive Code Generation), a novel paradigm that addresses
this limitation by unifying planning and action within a single code
representation. In this representation, ReCode treats high-level plans as
abstract placeholder functions, which the agent then recursively decomposes
into finer-grained sub-functions until reaching primitive actions. This
recursive approach dissolves the rigid boundary between plan and action,
enabling the agent to dynamically control its decision granularity.
Furthermore, the recursive structure inherently generates rich,
multi-granularity training data, enabling models to learn hierarchical
decision-making processes. Extensive experiments show ReCode significantly
surpasses advanced baselines in inference performance and demonstrates
exceptional data efficiency in training, validating our core insight that
unifying planning and action through recursive code generation is a powerful
and effective approach to achieving universal granularity control. The code is
available at https://github.com/FoundationAgents/ReCode.
[LINK]http://arxiv.org/abs/2510.23564v2
[DATE]2025-10-28 11:22:35+08:00
[CATEGORIES]cs.CL cs.LG
AdaRewriter: Unleashing the Power of Prompting-based Conversational Query Reformulation via Test-Time Adaptation
[AUTHORS]Yilong Lai, Jialong Wu, Zhenglin Wang, Deyu Zhou
[ABSTRACT]Prompting-based conversational query reformulation has emerged as a powerful
approach for conversational search, refining ambiguous user queries into
standalone search queries. Best-of-N reformulation over the generated
candidates via prompting shows impressive potential scaling capability.
However, both the previous tuning methods (training time) and adaptation
approaches (test time) can not fully unleash their benefits. In this paper, we
propose AdaRewriter, a novel framework for query reformulation using an
outcome-supervised reward model via test-time adaptation. By training a
lightweight reward model with contrastive ranking loss, AdaRewriter selects the
most promising reformulation during inference. Notably, it can operate
effectively in black-box systems, including commercial LLM APIs. Experiments on
five conversational search datasets show that AdaRewriter significantly
outperforms the existing methods across most settings, demonstrating the
potential of test-time adaptation for conversational query reformulation.
[COMMENTS]Accepted by EMNLP 2025
[LINK]http://arxiv.org/abs/2506.01381v2
[DATE]2025-10-28 11:11:09+08:00
[CATEGORIES]cs.CL
SpecKD: Speculative Decoding for Effective Knowledge Distillation of LLMs
[AUTHORS]Haiduo Huang, Jiangcheng Song, Yadong Zhang, Pengju Ren
[ABSTRACT]Knowledge Distillation (KD) has become a cornerstone technique for
compressing Large Language Models (LLMs) into smaller, more efficient student
models. However, conventional KD approaches typically apply the distillation
loss uniformly across all tokens, regardless of the teacher’s confidence. This
indiscriminate mimicry can introduce noise, as the student is forced to learn
from the teacher’s uncertain or high-entropy predictions, which may ultimately
harm student performance-especially when the teacher is much larger and more
powerful. To address this, we propose Speculative Knowledge Distillation
(SpecKD), a novel, plug-and-play framework that introduces a dynamic,
token-level gating mechanism inspired by the “propose-and-verify” paradigm of
speculative decoding. At each step, the student’s token proposal is verified
against the teacher’s distribution; the distillation loss is selectively
applied only to “accepted” tokens, while “rejected” tokens are masked out.
Extensive experiments on diverse text generation tasks show that SpecKD
consistently and significantly outperforms strong KD baselines, leading to more
stable training and more capable student models, and achieving state-of-the-art
results.
[LINK]http://arxiv.org/abs/2510.24021v1
[DATE]2025-10-28 11:02:22+08:00
[CATEGORIES]cs.CL
Teaching LLMs to Abstain via Fine-Grained Semantic Confidence Reward
[AUTHORS]Hao An, Yang Xu
[ABSTRACT]Mitigating hallucinations in Large Language Models (LLMs) is critical for
their reliable deployment. Existing methods typically fine-tune LLMs to abstain
from answering questions beyond their knowledge scope. However, these methods
often rely on coarse-grained signals to guide LLMs to abstain, such as overall
confidence or uncertainty scores on multiple sampled answers, which may result
in an imprecise awareness of the model’s own knowledge boundaries. To this end,
we propose a novel reinforcement learning framework built on
$\textbf{\underline{Fi}ne-grained \underline{S}emantic \underline{Co}nfidence
\underline{Re}ward (\Ours)}$, which guides LLMs to abstain via sample-specific
confidence. Specifically, our method operates by sampling multiple candidate
answers and conducting semantic clustering, then training the LLM to retain
answers within high-confidence clusters and discard those within low-confidence
ones, thereby promoting accurate post-hoc abstention. Additionally, we propose
a new metric for evaluating the reliability of abstention fine-tuning tasks
more comprehensively. Our method significantly enhances reliability in both
in-domain and out-of-distribution benchmarks.
[COMMENTS]23pages, 4figures
[LINK]http://arxiv.org/abs/2510.24020v1
[DATE]2025-10-28 11:00:35+08:00
[CATEGORIES]cs.CL
TEXT2DB: Integration-Aware Information Extraction with Large Language Model Agents
[AUTHORS]Yizhu Jiao, Sha Li, Sizhe Zhou, Heng Ji, Jiawei Han
[ABSTRACT]The task of information extraction (IE) is to extract structured knowledge
from text. However, it is often not straightforward to utilize IE output due to
the mismatch between the IE ontology and the downstream application needs. We
propose a new formulation of IE TEXT2DB that emphasizes the integration of IE
output and the target database (or knowledge base). Given a user instruction, a
document set, and a database, our task requires the model to update the
database with values from the document set to satisfy the user instruction.
This task requires understanding user instructions for what to extract and
adapting to the given DB/KB schema for how to extract on the fly. To evaluate
this new task, we introduce a new benchmark featuring common demands such as
data infilling, row population, and column addition. In addition, we propose an
LLM agent framework OPAL (Observe-PlanAnalyze LLM) which includes an Observer
component that interacts with the database, the Planner component that
generates a code-based plan with calls to IE models, and the Analyzer component
that provides feedback regarding code quality before execution. Experiments
show that OPAL can successfully adapt to diverse database schemas by generating
different code plans and calling the required IE models. We also highlight
difficult cases such as dealing with large databases with complex dependencies
and extraction hallucination, which we believe deserve further investigation.
Source code: https://github.com/yzjiao/Text2DB
[COMMENTS]ACL 2025. Source code: https://github.com/yzjiao/Text2DB
[LINK]http://arxiv.org/abs/2510.24014v1
[DATE]2025-10-28 10:49:40+08:00
[CATEGORIES]cs.CL
Discourse Features Enhance Detection of Document-Level Machine-Generated Content
[AUTHORS]Yupei Li, Manuel Milling, Lucia Specia, Björn W. Schuller
[ABSTRACT]The availability of high-quality APIs for Large Language Models (LLMs) has
facilitated the widespread creation of Machine-Generated Content (MGC), posing
challenges such as academic plagiarism and the spread of misinformation.
Existing MGC detectors often focus solely on surface-level information,
overlooking implicit and structural features. This makes them susceptible to
deception by surface-level sentence patterns, particularly for longer texts and
in texts that have been subsequently paraphrased. To overcome these challenges,
we introduce novel methodologies and datasets. Besides the publicly available
dataset Plagbench, we developed the paraphrased Long-Form Question and Answer
(paraLFQA) and paraphrased Writing Prompts (paraWP) datasets using GPT and
DIPPER, a discourse paraphrasing tool, by extending artifacts from their
original versions. To better capture the structure of longer texts at document
level, we propose DTransformer, a model that integrates discourse analysis
through PDTB preprocessing to encode structural features. It results in
substantial performance gains across both datasets - 15.5% absolute improvement
on paraLFQA, 4% absolute improvement on paraWP, and 1.5% absolute improvemene
on M4 compared to SOTA approaches. The data and code are available at:
https://github.com/myxp-lyp/Discourse-Features-Enhance-Detection-of-Document-Level-Machine-Generated-Content.git.
[COMMENTS]Accepted by IJCNN 2025
[LINK]http://arxiv.org/abs/2412.12679v2
[DATE]2025-10-28 10:20:41+08:00
[CATEGORIES]cs.CL
META-RAG: Meta-Analysis-Inspired Evidence-Re-Ranking Method for Retrieval-Augmented Generation in Evidence-Based Medicine
[AUTHORS]Mengzhou Sun, Sendong Zhao, Jianyu Chen, Haochun Wang, Bin Qin
[ABSTRACT]Evidence-based medicine (EBM) holds a crucial role in clinical application.
Given suitable medical articles, doctors effectively reduce the incidence of
misdiagnoses. Researchers find it efficient to use large language models (LLMs)
techniques like RAG for EBM tasks. However, the EBM maintains stringent
requirements for evidence, and RAG applications in EBM struggle to efficiently
distinguish high-quality evidence. Therefore, inspired by the meta-analysis
used in EBM, we provide a new method to re-rank and filter the medical
evidence. This method presents multiple principles to filter the best evidence
for LLMs to diagnose. We employ a combination of several EBM methods to emulate
the meta-analysis, which includes reliability analysis, heterogeneity analysis,
and extrapolation analysis. These processes allow the users to retrieve the
best medical evidence for the LLMs. Ultimately, we evaluate these high-quality
articles and show an accuracy improvement of up to 11.4% in our experiments and
results. Our method successfully enables RAG to extract higher-quality and more
reliable evidence from the PubMed dataset. This work can reduce the infusion of
incorrect knowledge into responses and help users receive more effective
replies.
[LINK]http://arxiv.org/abs/2510.24003v1
[DATE]2025-10-28 10:18:09+08:00
[CATEGORIES]cs.CL
PICOs-RAG: PICO-supported Query Rewriting for Retrieval-Augmented Generation in Evidence-Based Medicine
[AUTHORS]Mengzhou Sun, Sendong Zhao, Jianyu Chen, Bin Qin
[ABSTRACT]Evidence-based medicine (EBM) research has always been of paramount
importance. It is important to find appropriate medical theoretical support for
the needs from physicians or patients to reduce the occurrence of medical
accidents. This process is often carried out by human querying relevant
literature databases, which lacks objectivity and efficiency. Therefore,
researchers utilize retrieval-augmented generation (RAG) to search for evidence
and generate responses automatically. However, current RAG methods struggle to
handle complex queries in real-world clinical scenarios. For example, when
queries lack certain information or use imprecise language, the model may
retrieve irrelevant evidence and generate unhelpful answers. To address this
issue, we present the PICOs-RAG to expand the user queries into a better
format. Our method can expand and normalize the queries into professional ones
and use the PICO format, a search strategy tool present in EBM, to extract the
most important information used for retrieval. This approach significantly
enhances retrieval efficiency and relevance, resulting in up to an 8.8\%
improvement compared to the baseline evaluated by our method. Thereby the
PICOs-RAG improves the performance of the large language models into a helpful
and reliable medical assistant in EBM.
[LINK]http://arxiv.org/abs/2510.23998v1
[DATE]2025-10-28 10:01:05+08:00
[CATEGORIES]cs.CL
M-Eval: A Heterogeneity-Based Framework for Multi-evidence Validation in Medical RAG Systems
[AUTHORS]Mengzhou Sun, Sendong Zhao, Jianyu Chen, Haochun Wang, Bin Qin
[ABSTRACT]Retrieval-augmented Generation (RAG) has demonstrated potential in enhancing
medical question-answering systems through the integration of large language
models (LLMs) with external medical literature. LLMs can retrieve relevant
medical articles to generate more professional responses efficiently. However,
current RAG applications still face problems. They generate incorrect
information, such as hallucinations, and they fail to use external knowledge
correctly. To solve these issues, we propose a new method named M-Eval. This
method is inspired by the heterogeneity analysis approach used in
Evidence-Based Medicine (EBM). Our approach can check for factual errors in RAG
responses using evidence from multiple sources. First, we extract additional
medical literature from external knowledge bases. Then, we retrieve the
evidence documents generated by the RAG system. We use heterogeneity analysis
to check whether the evidence supports different viewpoints in the response. In
addition to verifying the accuracy of the response, we also assess the
reliability of the evidence provided by the RAG system. Our method shows an
improvement of up to 23.31% accuracy across various LLMs. This work can help
detect errors in current RAG-based medical systems. It also makes the
applications of LLMs more reliable and reduces diagnostic errors.
[LINK]http://arxiv.org/abs/2510.23995v1
[DATE]2025-10-28 09:57:40+08:00
[CATEGORIES]cs.CL
PVP: An Image Dataset for Personalized Visual Persuasion with Persuasion Strategies, Viewer Characteristics, and Persuasiveness Ratings
[AUTHORS]Junseo Kim, Jongwook Han, Dongmin Choi, Jongwook Yoon, Eun-Ju Lee, Yohan Jo
[ABSTRACT]Visual persuasion, which uses visual elements to influence cognition and
behaviors, is crucial in fields such as advertising and political
communication. With recent advancements in artificial intelligence, there is
growing potential to develop persuasive systems that automatically generate
persuasive images tailored to individuals. However, a significant bottleneck in
this area is the lack of comprehensive datasets that connect the persuasiveness
of images with the personal information about those who evaluated the images.
To address this gap and facilitate technological advancements in personalized
visual persuasion, we release the Personalized Visual Persuasion (PVP) dataset,
comprising 28,454 persuasive images across 596 messages and 9 persuasion
strategies. Importantly, the PVP dataset provides persuasiveness scores of
images evaluated by 2,521 human annotators, along with their demographic and
psychological characteristics (personality traits and values). We demonstrate
the utility of our dataset by developing a persuasive image generator and an
automated evaluator, and establish benchmark baselines. Our experiments reveal
that incorporating psychological characteristics enhances the generation and
evaluation of persuasive images, providing valuable insights for personalized
visual persuasion.
[COMMENTS]ACL 2025 Main. Code and dataset are released at:
https://github.com/holi-lab/PVP_Personalized_Visual_Persuasion
[LINK]http://arxiv.org/abs/2506.00481v2
[DATE]2025-10-28 08:59:36+08:00
[CATEGORIES]cs.CL
emg2speech: synthesizing speech from electromyography using self-supervised speech models
[AUTHORS]Harshavardhana T. Gowda, Lee M. Miller
[ABSTRACT]We present a neuromuscular speech interface that translates electromyographic
(EMG) signals collected from orofacial muscles during speech articulation
directly into audio. We show that self-supervised speech (SS) representations
exhibit a strong linear relationship with the electrical power of muscle action
potentials: SS features can be linearly mapped to EMG power with a correlation
of $r = 0.85$. Moreover, EMG power vectors corresponding to different
articulatory gestures form structured and separable clusters in feature space.
This relationship: $\text{SS features}$ $\xrightarrow{\texttt{linear mapping}}$
$\text{EMG power}$ $\xrightarrow{\texttt{gesture-specific clustering}}$
$\text{articulatory movements}$, highlights that SS models implicitly encode
articulatory mechanisms. Leveraging this property, we directly map EMG signals
to SS feature space and synthesize speech, enabling end-to-end EMG-to-speech
generation without explicit articulatory models and vocoder training.
[LINK]http://arxiv.org/abs/2510.23969v1
[DATE]2025-10-28 08:50:15+08:00
[CATEGORIES]cs.CL
Leveraging LLMs for Early Alzheimer’s Prediction
[AUTHORS]Tananun Songdechakraiwut
[ABSTRACT]We present a connectome-informed LLM framework that encodes dynamic fMRI
connectivity as temporal sequences, applies robust normalization, and maps
these data into a representation suitable for a frozen pre-trained LLM for
clinical prediction. Applied to early Alzheimer’s detection, our method
achieves sensitive prediction with error rates well below clinically recognized
margins, with implications for timely Alzheimer’s intervention.
[LINK]http://arxiv.org/abs/2510.23946v1
[DATE]2025-10-28 07:59:03+08:00
[CATEGORIES]cs.CL
EQ-Negotiator: Emotion Policing Personas for Anti-Manipulation in Credit Collection Dialogues
[AUTHORS]Yunbo Long, Yuhan Liu
[ABSTRACT]Persona modeling in large language models typically focuses on static
character traits, but overlooks the dynamic emotional intelligence required for
real-time adversarial negotiations. In financial dialogues, this limitation
creates critical vulnerabilities: debtors exploit predictable empathetic
responses through emotional manipulation tactics like aggression, feigned
distress, and guilt-tripping. To bridge this gap, we present EQ-Negotiator, a
novel framework that grounds persona behavior in emotion dynamics rather than
static personality profiles. Unlike naive empathy-centric agents, EQ-Negotiator
integrates emotion memory and game-theoretic reasoning, powered by a Hidden
Markov Model (HMM) to track and predict debtor emotional states. By analyzing
both real-time and historical emotional cues, EQ-Negotiator strategically
counters negative emotions (e.g., aggression, feigned distress) while
preserving productive debtor relationships. This work advances persona modeling
from descriptive character profiles to functional emotional architectures,
establishing emotion as the critical link between persona design and tactical
execution. Through agent-to-agent validation across 20 credit negotiation
scenarios, we demonstrate that emotion-driven personas enable robust defensive
capabilities against manipulation while maintaining strategic effectiveness.
[LINK]http://arxiv.org/abs/2503.21080v6
[DATE]2025-10-28 07:31:13+08:00
[CATEGORIES]cs.CL
Seeing Symbols, Missing Cultures: Probing Vision-Language Models’ Reasoning on Fire Imagery and Cultural Meaning
[AUTHORS]Haorui Yu, Yang Zhao, Yijia Chu, Qiufeng Yi
[COMMENTS]8 pages, 5 figures, 4 tables. Submitted to WiNLP 2025 Workshop at
COLING 2025
[LINK]http://arxiv.org/abs/2509.23311v2
[DATE]2025-10-28 07:22:21+08:00
[CATEGORIES]cs.CL
Latent Chain-of-Thought for Visual Reasoning
[AUTHORS]Guohao Sun, Hang Hua, Jian Wang, Jiebo Luo, Sohail Dianat, Majid Rabbani, Raghuveer Rao, Zhiqiang Tao
[ABSTRACT]Chain-of-thought (CoT) reasoning is critical for improving the
interpretability and reliability of Large Vision-Language Models (LVLMs).
However, existing training algorithms such as SFT, PPO, and GRPO may not
generalize well across unseen reasoning tasks and heavily rely on a biased
reward model. To address this challenge, we reformulate reasoning in LVLMs as
posterior inference and propose a scalable training algorithm based on
amortized variational inference. By leveraging diversity-seeking reinforcement
learning algorithms, we introduce a novel sparse reward function for
token-level learning signals that encourage diverse, high-likelihood latent
CoT, overcoming deterministic sampling limitations and avoiding reward hacking.
Additionally, we implement a Bayesian inference-scaling strategy that replaces
costly Best-of-N and Beam Search with a marginal likelihood to efficiently rank
optimal rationales and answers. We empirically demonstrate that the proposed
method enhances the state-of-the-art LVLMs on seven reasoning benchmarks, in
terms of effectiveness, generalization, and interpretability.
[COMMENTS]NeurIPS 2025
[LINK]http://arxiv.org/abs/2510.23925v1
[DATE]2025-10-28 07:10:06+08:00
[CATEGORIES]cs.CL
Agent-based Automated Claim Matching with Instruction-following LLMs
[AUTHORS]Dina Pisarevskaya, Arkaitz Zubiaga
[ABSTRACT]We present a novel agent-based approach for the automated claim matching task
with instruction-following LLMs. We propose a two-step pipeline that first
generates prompts with LLMs, to then perform claim matching as a binary
classification task with LLMs. We demonstrate that LLM-generated prompts can
outperform SOTA with human-generated prompts, and that smaller LLMs can do as
well as larger ones in the generation process, allowing to save computational
resources. We also demonstrate the effectiveness of using different LLMs for
each step of the pipeline, i.e. using an LLM for prompt generation, and another
for claim matching. Our investigation into the prompt generation process in
turn reveals insights into the LLMs’ understanding of claim matching.
[COMMENTS]Accepted for the International Joint Conference on Natural Language
Processing & Asia-Pacific Chapter of the Association for Computational
Linguistics (2025) Findings
[LINK]http://arxiv.org/abs/2510.23924v1
[DATE]2025-10-28 07:09:35+08:00
[CATEGORIES]cs.CL
Any Large Language Model Can Be a Reliable Judge: Debiasing with a Reasoning-based Bias Detector
[AUTHORS]Haoyan Yang, Runxue Bao, Cao Xiao, Jun Ma, Parminder Bhatia, Shangqian Gao, Taha Kass-Hout
[ABSTRACT]LLM-as-a-Judge has emerged as a promising tool for automatically evaluating
generated outputs, but its reliability is often undermined by potential biases
in judgment. Existing efforts to mitigate these biases face key limitations:
in-context learning-based methods fail to address rooted biases due to the
evaluator’s limited capacity for self-reflection, whereas fine-tuning is not
applicable to all evaluator types, especially closed-source models. To address
this challenge, we introduce the Reasoning-based Bias Detector (RBD), which is
a plug-in module that identifies biased evaluations and generates structured
reasoning to guide evaluator self-correction. Rather than modifying the
evaluator itself, RBD operates externally and engages in an iterative process
of bias detection and feedback-driven revision. To support its development, we
design a complete pipeline consisting of biased dataset construction,
supervision collection, distilled reasoning-based fine-tuning of RBD, and
integration with LLM evaluators. We fine-tune four sizes of RBD models, ranging
from 1.5B to 14B, and observe consistent performance improvements across all
scales. Experimental results on 4 bias types–verbosity, position, bandwagon,
and sentiment–evaluated using 8 LLM evaluators demonstrate RBD’s strong
effectiveness. For example, the RBD-8B model improves evaluation accuracy by an
average of 18.5% and consistency by 10.9%, and surpasses prompting-based
baselines and fine-tuned judges by 12.8% and 17.2%, respectively. These results
highlight RBD’s effectiveness and scalability. Additional experiments further
demonstrate its strong generalization across biases and domains, as well as its
efficiency.
[COMMENTS]Accepted at NeurIPS 2025 (Camera-Ready Version)
[LINK]http://arxiv.org/abs/2505.17100v2
[DATE]2025-10-28 06:09:54+08:00
[CATEGORIES]cs.CL
AfriMTEB and AfriE5: Benchmarking and Adapting Text Embedding Models for African Languages
[AUTHORS]Kosei Uemura, Miaoran Zhang, David Ifeoluwa Adelani
[ABSTRACT]Text embeddings are an essential building component of several NLP tasks such
as retrieval-augmented generation which is crucial for preventing
hallucinations in LLMs. Despite the recent release of massively multilingual
MTEB (MMTEB), African languages remain underrepresented, with existing tasks
often repurposed from translation benchmarks such as FLORES clustering or
SIB-200. In this paper, we introduce AfriMTEB – a regional expansion of MMTEB
covering 59 languages, 14 tasks, and 38 datasets, including six newly added
datasets. Unlike many MMTEB datasets that include fewer than five languages,
the new additions span 14 to 56 African languages and introduce entirely new
tasks, such as hate speech detection, intent detection, and emotion
classification, which were not previously covered. Complementing this, we
present AfriE5, an adaptation of the instruction-tuned mE5 model to African
languages through cross-lingual contrastive distillation. Our evaluation shows
that AfriE5 achieves state-of-the-art performance, outperforming strong
baselines such as Gemini-Embeddings and mE5.
[LINK]http://arxiv.org/abs/2510.23896v1
[DATE]2025-10-28 06:06:43+08:00
[CATEGORIES]cs.CL
GIFT: Group-relative Implicit Fine Tuning Integrates GRPO with DPO and UNA
[AUTHORS]Zhichao Wang
[ABSTRACT]I propose \textbf{G}roup-relative \textbf{I}mplicit \textbf{F}ine
\textbf{T}uning (GIFT), a novel reinforcement learning framework for aligning
LLMs. Instead of directly maximizing cumulative rewards like PPO or GRPO, GIFT
minimizes the discrepancy between implicit and explicit reward models. It
combines three key ideas: (1) the online multi-response generation and
normalization of GRPO, (2) the implicit reward formulation of DPO, and (3) the
implicit-explicit reward alignment principle of UNA. By jointly normalizing the
implicit and explicit rewards, GIFT eliminates an otherwise intractable term
that prevents effective use of implicit rewards. This normalization transforms
the complex reward maximization objective into a simple mean squared error
(MSE) loss between the normalized reward functions, converting a non-convex
optimization problem into a convex, stable, and analytically differentiable
formulation. Unlike offline methods such as DPO and UNA, GIFT remains on-policy
and thus retains exploration capability. Compared to GRPO, it requires fewer
hyperparameters, converges faster, and generalizes better with significantly
reduced training overfitting. Empirically, GIFT achieves superior reasoning and
alignment performance on mathematical benchmarks while remaining
computationally efficient.
[LINK]http://arxiv.org/abs/2510.23868v1
[DATE]2025-10-28 05:18:19+08:00
[CATEGORIES]cs.LG cs.CL
Can LLMs Narrate Tabular Data? An Evaluation Framework for Natural Language Representations of Text-to-SQL System Outputs
[AUTHORS]Jyotika Singh, Weiyi Sun, Amit Agarwal, Viji Krishnamurthy, Yassine Benajiba, Sujith Ravi, Dan Roth
[ABSTRACT]In modern industry systems like multi-turn chat agents, Text-to-SQL
technology bridges natural language (NL) questions and database (DB) querying.
The conversion of tabular DB results into NL representations (NLRs) enables the
chat-based interaction. Currently, NLR generation is typically handled by large
language models (LLMs), but information loss or errors in presenting tabular
results in NL remains largely unexplored. This paper introduces a novel
evaluation method - Combo-Eval - for judgment of LLM-generated NLRs that
combines the benefits of multiple existing methods, optimizing evaluation
fidelity and achieving a significant reduction in LLM calls by 25-61%.
Accompanying our method is NLR-BIRD, the first dedicated dataset for NLR
benchmarking. Through human evaluations, we demonstrate the superior alignment
of Combo-Eval with human judgments, applicable across scenarios with and
without ground truth references.
[COMMENTS]Accepted at EMNLP 2025
[LINK]http://arxiv.org/abs/2510.23854v1
[DATE]2025-10-28 04:52:19+08:00
[CATEGORIES]cs.CL
Temporal Blindness in Multi-Turn LLM Agents: Misaligned Tool Use vs. Human Time Perception
[AUTHORS]Yize Cheng, Arshia Soltani Moakhar, Chenrui Fan, Kazem Faghih, Parsa Hosseini, Wenxiao Wang, Soheil Feizi
[ABSTRACT]Large language model agents are increasingly used in multi-turn
conversational settings to interact with and execute tasks in dynamic
environments. However, a key limitation is their temporal blindness: they, by
default, operate with a stationary context, failing to account for the
real-world time elapsed between messages. This becomes a critical liability
when an agent must decide whether to invoke a tool based on how much time has
passed since the last observation. Without temporal awareness, agents often
either over-rely on previous context (skipping necessary tool calls), or
under-rely on it (unnecessarily repeating tool calls). To study this challenge,
we introduce TicToc-v1, a test set of multi-turn user-agent trajectories across
34 scenarios with varying time sensitivity. Each trajectory ends with a user
question, where the need for a tool call depends on the amount of time elapsed
since the last message. To give LLMs temporal context, we augment dialogue
messages with explicit timestamps, bridging the gap between static dialogue and
evolving environments. We then collected human preferences for these samples,
creating two subsets: one where humans preferred relying on the previous
observation (prefer-noTool), and another where they preferred a new tool call
(prefer-Tool). We evaluated how well LLM tool-calling decisions align with
human preferences under varying time intervals on TicToc-v1. Our analysis show
that without time information, most models perform only slightly better than
random, with the top alignment rate being just over 60%. While adding
timestamps leads to a slight improvement, particularly for larger models, the
improvement is modest, peaking at around 65%. We also show that naive,
prompt-based alignment have limited effectiveness. Our findings highlight the
need for specific post-training alignment to align multi-turn LLM tool use with
human temporal perception.
[COMMENTS]preliminary work in progress
[LINK]http://arxiv.org/abs/2510.23853v1
[DATE]2025-10-28 04:51:58+08:00
[CATEGORIES]cs.CL
A Neural Model for Contextual Biasing Score Learning and Filtering
[AUTHORS]Wanting Huang, Weiran Wang
[ABSTRACT]Contextual biasing improves automatic speech recognition (ASR) by integrating
external knowledge, such as user-specific phrases or entities, during decoding.
In this work, we use an attention-based biasing decoder to produce scores for
candidate phrases based on acoustic information extracted by an ASR encoder,
which can be used to filter out unlikely phrases and to calculate bonus for
shallow-fusion biasing. We introduce a per-token discriminative objective that
encourages higher scores for ground-truth phrases while suppressing
distractors. Experiments on the Librispeech biasing benchmark show that our
method effectively filters out majority of the candidate phrases, and
significantly improves recognition accuracy under different biasing conditions
when the scores are used in shallow fusion biasing. Our approach is modular and
can be used with any ASR system, and the filtering mechanism can potentially
boost performance of other biasing methods.
[COMMENTS]Accepted to IEEE ASRU 2025
[LINK]http://arxiv.org/abs/2510.23849v1
[DATE]2025-10-28 04:41:52+08:00
[CATEGORIES]cs.CL
A Comprehensive Survey on Reinforcement Learning-based Agentic Search: Foundations, Roles, Optimizations, Evaluations, and Applications
[AUTHORS]Minhua Lin, Zongyu Wu, Zhichao Xu, Hui Liu, Xianfeng Tang, Qi He, Charu Aggarwal, Hui Liu, Xiang Zhang, Suhang Wang
[ABSTRACT]The advent of large language models (LLMs) has transformed information access
and reasoning through open-ended natural language interaction. However, LLMs
remain limited by static knowledge, factual hallucinations, and the inability
to retrieve real-time or domain-specific information. Retrieval-Augmented
Generation (RAG) mitigates these issues by grounding model outputs in external
evidence, but traditional RAG pipelines are often single turn and heuristic,
lacking adaptive control over retrieval and reasoning. Recent advances in
agentic search address these limitations by enabling LLMs to plan, retrieve,
and reflect through multi-step interaction with search environments. Within
this paradigm, reinforcement learning (RL) offers a powerful mechanism for
adaptive and self-improving search behavior. This survey provides the first
comprehensive overview of \emph{RL-based agentic search}, organizing the
emerging field along three complementary dimensions: (i) What RL is for
(functional roles), (ii) How RL is used (optimization strategies), and (iii)
Where RL is applied (scope of optimization). We summarize representative
methods, evaluation protocols, and applications, and discuss open challenges
and future directions toward building reliable and scalable RL driven agentic
search systems. We hope this survey will inspire future research on the
integration of RL and agentic search. Our repository is available at
https://github.com/ventr1c/Awesome-RL-based-Agentic-Search-Papers.
[COMMENTS]38 pages, 4 figures, 7 tables
[LINK]http://arxiv.org/abs/2510.16724v2
[DATE]2025-10-28 03:23:17+08:00
[CATEGORIES]cs.CL
Science Hierarchography: Hierarchical Organization of Science Literature
[AUTHORS]Muhan Gao, Jash Shah, Weiqi Wang, Kuan-Hao Huang, Daniel Khashabi
[ABSTRACT]Scientific knowledge is growing rapidly, making it difficult to track
progress and high-level conceptual links across broad disciplines. While tools
like citation networks and search engines help retrieve related papers, they
lack the abstraction needed to capture the needed to represent the density and
structure of activity across subfields.
We motivate SCIENCE HIERARCHOGRAPHY, the goal of organizing scientific
literature into a high-quality hierarchical structure that spans multiple
levels of abstraction – from broad domains to specific studies. Such a
representation can provide insights into which fields are well-explored and
which are under-explored. To achieve this goal, we develop a hybrid approach
that combines efficient embedding-based clustering with LLM-based prompting,
striking a balance between scalability and semantic precision. Compared to
LLM-heavy methods like iterative tree construction, our approach achieves
superior quality-speed trade-offs. Our hierarchies capture different dimensions
of research contributions, reflecting the interdisciplinary and multifaceted
nature of modern science. We evaluate its utility by measuring how effectively
an LLM-based agent can navigate the hierarchy to locate target papers. Results
show that our method improves interpretability and offers an alternative
pathway for exploring scientific literature beyond traditional search methods.
Code, data and demo are available:
https://github.com/JHU-CLSP/science-hierarchography
[LINK]http://arxiv.org/abs/2504.13834v6
[DATE]2025-10-28 03:17:31+08:00
[CATEGORIES]cs.CL
OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM
[AUTHORS]Hanrong Ye, Chao-Han Huck Yang, Arushi Goel, Wei Huang, Ligeng Zhu, Yuanhang Su, Sean Lin, An-Chieh Cheng, Zhen Wan, Jinchuan Tian, Yuming Lou, Dong Yang, Zhijian Liu, Yukang Chen, Ambrish Dantrey, Ehsan Jahangiri, Sreyan Ghosh, Daguang Xu, Ehsan Hosseini-Asl, Danial Mohseni Taheri, Vidya Murali, Sifei Liu, Yao Lu, Oluwatobi Olabiyi, Yu-Chiang Frank Wang, Rafael Valle, Bryan Catanzaro, Andrew Tao, Song Han, Jan Kautz, Hongxu Yin, Pavlo Molchanov
[ABSTRACT]Advancing machine intelligence requires developing the ability to perceive
across multiple modalities, much as humans sense the world. We introduce
OmniVinci, an initiative to build a strong, open-source, omni-modal LLM. We
carefully study the design choices across model architecture and data curation.
For model architecture, we present three key innovations: (i) OmniAlignNet for
strengthening alignment between vision and audio embeddings in a shared
omni-modal latent space; (ii) Temporal Embedding Grouping for capturing
relative temporal alignment between vision and audio signals; and (iii)
Constrained Rotary Time Embedding for encoding absolute temporal information in
omni-modal embeddings. We introduce a curation and synthesis pipeline that
generates 24M single-modal and omni-modal conversations. We find that
modalities reinforce one another in both perception and reasoning. Our model,
OmniVinci, outperforms Qwen2.5-Omni with +19.05 on DailyOmni (cross-modal
understanding), +1.7 on MMAR (audio), and +3.9 on Video-MME (vision), while
using just 0.2T training tokens - a 6 times reduction compared to
Qwen2.5-Omni’s 1.2T. We finally demonstrate omni-modal advantages in downstream
applications spanning robotics, medical AI, and smart factory.
[COMMENTS]Technical Report. Code: https://github.com/NVlabs/OmniVinci
[LINK]http://arxiv.org/abs/2510.15870v2
[DATE]2025-10-28 03:12:55+08:00
[CATEGORIES]cs.CL
Semantic Agreement Enables Efficient Open-Ended LLM Cascades
[AUTHORS]Duncan Soiffer, Steven Kolawole, Virginia Smith
[ABSTRACT]Cascade systems route computational requests to smaller models when possible
and defer to larger models only when necessary, offering a promising approach
to balance cost and quality in LLM deployment. However, they face a fundamental
challenge in open-ended text generation: determining output reliability when
generation quality lies on a continuous spectrum, often with multiple valid
responses. To address this, we propose semantic agreement – meaning-level
consensus between ensemble outputs – as a training-free signal for reliable
deferral. We show that when diverse model outputs agree semantically, their
consensus is a stronger reliability signal than token-level confidence.
Evaluated from 500M to 70B-parameter models, we find that semantic cascades
match or surpass target-model quality at 40% of the cost and reduce latency by
up to 60%. Our method requires no model internals, works across black-box APIs,
and remains robust to model updates, making it a practical baseline for
real-world LLM deployment.
[COMMENTS]2025 Conference on Empirical Methods in Natural Language Processing
(EMNLP) Industry Track
[LINK]http://arxiv.org/abs/2509.21837v3
[DATE]2025-10-28 02:59:37+08:00
[CATEGORIES]cs.CL
Offline RL by Reward-Weighted Fine-Tuning for Conversation Optimization
[AUTHORS]Subhojyoti Mukherjee, Viet Dac Lai, Raghavendra Addanki, Ryan Rossi, Seunghyun Yoon, Trung Bui, Anup Rao, Jayakumar Subramanian, Branislav Kveton
[COMMENTS]Accepted at NeurIPS 2025 (main conference)
[LINK]http://arxiv.org/abs/2506.06964v2
[DATE]2025-10-28 02:56:23+08:00
[CATEGORIES]cs.CL cs.LG
Learned, Lagged, LLM-splained: LLM Responses to End User Security Questions
[AUTHORS]Vijay Prakash, Kevin Lee, Arkaprabha Bhattacharya, Danny Yuxing Huang, Jessica Staddon
[ABSTRACT]Answering end user security questions is challenging. While large language
models (LLMs) like GPT, LLAMA, and Gemini are far from error-free, they have
shown promise in answering a variety of questions outside of security. We
studied LLM performance in the area of end user security by qualitatively
evaluating 3 popular LLMs on 900 systematically collected end user security
questions.
While LLMs demonstrate broad generalist “knowledge” of end user security
information, there are patterns of errors and limitations across LLMs
consisting of stale and inaccurate answers, and indirect or unresponsive
communication styles, all of which impacts the quality of information received.
Based on these patterns, we suggest directions for model improvement and
recommend user strategies for interacting with LLMs when seeking assistance
with security.
[COMMENTS]17 pages, 7 tables
[LINK]http://arxiv.org/abs/2411.14571v2
[DATE]2025-10-28 02:54:02+08:00
[CATEGORIES]cs.CL
BitSkip: An Empirical Analysis of Quantization and Early Exit Composition
[AUTHORS]Ramshankar Bhuvaneswaran, Handan Liu
[ABSTRACT]The pursuit of efficient Large Language Models (LLMs) has led to increasingly
complex techniques like extreme quantization and dynamic routing. While
individual benefits of these methods are well-documented, their compositional
effects remain poorly understood. This paper introduces BitSkip, a hybrid
architectural framework for systematically explor- ing these interactions.
Counter-intuitively, our findings reveal that a simple 8-bit quantized model
without Hadamard transform (BitSkip-V1) not only outperforms its more complex
4-bit and Hadamard-enhanced counterparts but also competes the full-precision
baseline in quality (perplexity of 1.13 vs 1.19) . The introduction of Hadamard
transforms, even at 8- bit precision, catastrophically degraded performance by
over 37,000%, tracing fundamental training instability. Our BitSkip-V1 recipe
demonstrates superior early-exit characteristics, with layer 18 providing
optimal 32.5% speed gain for minimal 4% quality loss.
[COMMENTS]Submitted to JMLR
[LINK]http://arxiv.org/abs/2510.23766v1
[DATE]2025-10-28 02:53:08+08:00
[CATEGORIES]cs.CL
RARE: Retrieval-Aware Robustness Evaluation for Retrieval-Augmented Generation Systems
[AUTHORS]Yixiao Zeng, Tianyu Cao, Danqing Wang, Xinran Zhao, Zimeng Qiu, Morteza Ziyadi, Tongshuang Wu, Lei Li
[ABSTRACT]Retrieval-Augmented Generation (RAG) enhances recency and factuality in
answers. However, existing evaluations rarely test how well these systems cope
with real-world noise, conflicting between internal and external retrieved
contexts, or fast-changing facts. We introduce Retrieval-Aware Robustness
Evaluation (RARE), a unified framework and large-scale benchmark that jointly
stress-tests query and document perturbations over dynamic, time-sensitive
corpora. One of the central features of RARE is a knowledge-graph-driven
synthesis pipeline (RARE-Get) that automatically extracts single and multi-hop
relations from the customized corpus and generates multi-level question sets
without manual intervention. Leveraging this pipeline, we construct a dataset
(RARE-Set) spanning 527 expert-level time-sensitive finance, economics, and
policy documents and 48295 questions whose distribution evolves as the
underlying sources change. To quantify resilience, we formalize
retrieval-conditioned robustness metrics (RARE-Met) that capture a model’s
ability to remain correct or recover when queries, documents, or real-world
retrieval results are systematically altered. Our findings reveal that RAG
systems are unexpectedly sensitive to perturbations. Moreover, they
consistently demonstrate lower robustness on multi-hop queries compared to
single-hop queries across all domains.
[LINK]http://arxiv.org/abs/2506.00789v3
[DATE]2025-10-28 02:46:06+08:00
[CATEGORIES]cs.CL
DPRF: A Generalizable Dynamic Persona Refinement Framework for Optimizing Behavior Alignment Between Personalized LLM Role-Playing Agents and Humans
[AUTHORS]Bingsheng Yao, Bo Sun, Yuanzhe Dong, Yuxuan Lu, Dakuo Wang
[ABSTRACT]The emerging large language model role-playing agents (LLM RPAs) aim to
simulate individual human behaviors, but the persona fidelity is often
undermined by manually-created profiles (e.g., cherry-picked information and
personality characteristics) without validating the alignment with the target
individuals. To address this limitation, our work introduces the Dynamic
Persona Refinement Framework (DPRF).DPRF aims to optimize the alignment of LLM
RPAs’ behaviors with those of target individuals by iteratively identifying the
cognitive divergence, either through free-form or theory-grounded, structured
analysis, between generated behaviors and human ground truth, and refining the
persona profile to mitigate these divergences.We evaluate DPRF with five LLMs
on four diverse behavior-prediction scenarios: formal debates, social media
posts with mental health issues, public interviews, and movie reviews.DPRF can
consistently improve behavioral alignment considerably over baseline personas
and generalizes across models and scenarios.Our work provides a robust
methodology for creating high-fidelity persona profiles and enhancing the
validity of downstream applications, such as user simulation, social studies,
and personalized AI.
[COMMENTS]In Submission
[LINK]http://arxiv.org/abs/2510.14205v2
[DATE]2025-10-28 02:45:42+08:00
[CATEGORIES]cs.CL
Evaluating Long-Term Memory for Long-Context Question Answering
[AUTHORS]Alessandra Terranova, Björn Ross, Alexandra Birch
[ABSTRACT]In order for large language models to achieve true conversational continuity
and benefit from experiential learning, they need memory. While research has
focused on the development of complex memory systems, it remains unclear which
types of memory are most effective for long-context conversational tasks. We
present a systematic evaluation of memory-augmented methods using LoCoMo, a
benchmark of synthetic long-context dialogues annotated for question-answering
tasks that require diverse reasoning strategies. We analyse full-context
prompting, semantic memory through retrieval-augmented generation and agentic
memory, episodic memory through in-context learning, and procedural memory
through prompt optimization. Our findings show that memory-augmented approaches
reduce token usage by over 90% while maintaining competitive accuracy. Memory
architecture complexity should scale with model capability, with small
foundation models benefitting most from RAG, and strong instruction-tuned
reasoning model gaining from episodic learning through reflections and more
complex agentic semantic memory. In particular, episodic memory can help LLMs
recognise the limits of their own knowledge.
[COMMENTS]14 pages including appendix, 3 figures. Submitted to October ARR and
to Metacognition in Generative AI EurIPS workshop (under review for both)
[LINK]http://arxiv.org/abs/2510.23730v1
[DATE]2025-10-28 02:03:50+08:00
[CATEGORIES]cs.CL
Variational Masked Diffusion Models
[AUTHORS]Yichi Zhang, Alex Schwing, Zhizhen Zhao
[ABSTRACT]Masked diffusion models have recently emerged as a flexible framework for
discrete generative modeling. However, a key limitation of standard masked
diffusion is its inability to effectively capture dependencies among tokens
that are predicted concurrently, leading to degraded generation quality when
dependencies among tokens are important. To explicitly model dependencies among
tokens, we propose Variational Masked Diffusion (VMD), a framework that
introduces latent variables into the masked diffusion process. Through
controlled experiments on synthetic datasets, we demonstrate that VMD
successfully learns dependencies that conventional masked diffusion fails to
capture. We further validate the effectiveness of our approach on Sudoku
puzzles and text datasets, where learning of dependencies among tokens improves
global consistency. Across these domains, VMD enhances both generation quality
and dependency awareness, highlighting the value of integrating variational
inference into masked diffusion. Our code is available at:
https://riccizz.github.io/VMD.
[COMMENTS]Project Page: https://riccizz.github.io/VMD
[LINK]http://arxiv.org/abs/2510.23606v1
[DATE]2025-10-28 01:59:57+08:00
[CATEGORIES]cs.LG cs.CL
Constrained Entropic Unlearning: A Primal-Dual Framework for Large Language Models
[AUTHORS]Taha Entesari, Arman Hatami, Rinat Khaziev, Anil Ramakrishna, Mahyar Fazlyab
[ABSTRACT]Large Language Models (LLMs) deployed in real-world settings increasingly
face the need to unlearn sensitive, outdated, or proprietary information.
Existing unlearning methods typically formulate forgetting and retention as a
regularized trade-off, combining both objectives into a single scalarized loss.
This often leads to unstable optimization and degraded performance on retained
data, especially under aggressive forgetting. We propose a new formulation of
LLM unlearning as a constrained optimization problem: forgetting is enforced
via a novel logit-margin flattening loss that explicitly drives the output
distribution toward uniformity on a designated forget set, while retention is
preserved through a hard constraint on a separate retain set. Compared to
entropy-based objectives, our loss is softmax-free, numerically stable, and
maintains non-vanishing gradients, enabling more efficient and robust
optimization. We solve the constrained problem using a scalable primal-dual
algorithm that exposes the trade-off between forgetting and retention through
the dynamics of the dual variable, all without any extra computational
overhead. Evaluations on the TOFU and MUSE benchmarks across diverse LLM
architectures demonstrate that our approach consistently matches or exceeds
state-of-the-art baselines, effectively removing targeted information while
preserving downstream utility.
[COMMENTS]The Thirty-Ninth Annual Conference on Neural Information Processing
Systems
[LINK]http://arxiv.org/abs/2506.05314v2
[DATE]2025-10-28 01:59:13+08:00
[CATEGORIES]cs.CL cs.LG
Think Twice: Branch-and-Rethink Reasoning Reward Model
[AUTHORS]Yizhu Jiao, Jiaqi Zeng, Julien Veron Vialard, Oleksii Kuchaiev, Jiawei Han, Olivier Delalleau
[ABSTRACT]Large language models (LLMs) increasingly rely on thinking models that
externalize intermediate steps and allocate extra test-time compute, with
think-twice strategies showing that a deliberate second pass can elicit
stronger reasoning. In contrast, most reward models (RMs) still compress many
quality dimensions into a single scalar in one shot, a design that induces
judgment diffusion: attention spreads across evaluation criteria, yielding
diluted focus and shallow analysis. We introduce branch-and-rethink (BR-RM), a
two-turn RM that transfers the think-twice principle to reward modeling. Turn 1
performs adaptive branching, selecting a small set of instance-critical
dimensions (such as factuality and safety) and sketching concise,
evidence-seeking hypotheses. Turn 2 executes branch-conditioned rethinking, a
targeted reread that tests those hypotheses and scrutinizes only what matters
most. We train with GRPO-style reinforcement learning over structured two-turn
traces using a simple binary outcome reward with strict format checks, making
the approach compatible with standard RLHF pipelines. By converting
all-at-oncescoringintofocused, second-lookreasoning,
BR-RMreducesjudgmentdiffusionandimproves sensitivity to subtle yet
consequential errors while remaining practical and scalable. Experimental
results demonstrate that our model achieves state-of-the-art performance on
three challenging reward modeling benchmarks across diverse domains. The code
and the model will be released soon.
[LINK]http://arxiv.org/abs/2510.23596v1
[DATE]2025-10-28 01:58:07+08:00
[CATEGORIES]cs.CL
Hope Speech Detection in Social Media English Corpora: Performance of Traditional and Transformer Models
[AUTHORS]Luis Ramos, Hiram Calvo, Olga Kolesnikova
[ABSTRACT]The identification of hope speech has become a promised NLP task, considering
the need to detect motivational expressions of agency and goal-directed
behaviour on social media platforms. This proposal evaluates traditional
machine learning models and fine-tuned transformers for a previously split hope
speech dataset as train, development and test set. On development test, a
linear-kernel SVM and logistic regression both reached a macro-F1 of 0.78; SVM
with RBF kernel reached 0.77, and Na"ive Bayes hit 0.75. Transformer models
delivered better results, the best model achieved weighted precision of 0.82,
weighted recall of 0.80, weighted F1 of 0.79, macro F1 of 0.79, and 0.80
accuracy. These results suggest that while optimally configured traditional
machine learning models remain agile, transformer architectures detect some
subtle semantics of hope to achieve higher precision and recall in hope speech
detection, suggesting that larges transformers and LLMs could perform better in
small datasets.
[LINK]http://arxiv.org/abs/2510.23585v1
[DATE]2025-10-28 01:53:40+08:00
[CATEGORIES]cs.CL
LLM4Cell: A Survey of Large Language and Agentic Models for Single-Cell Biology
[AUTHORS]Sajib Acharjee Dip, Adrika Zafor, Bikash Kumar Paul, Uddip Acharjee Shuvo, Muhit Islam Emon, Xuan Wang, Liqing Zhang
[ABSTRACT]Large language models (LLMs) and emerging agentic frameworks are beginning to
transform single-cell biology by enabling natural-language reasoning,
generative annotation, and multimodal data integration. However, progress
remains fragmented across data modalities, architectures, and evaluation
standards. LLM4Cell presents the first unified survey of 58 foundation and
agentic models developed for single-cell research, spanning RNA, ATAC,
multi-omic, and spatial modalities. We categorize these methods into five
families-foundation, text-bridge, spatial, multimodal, epigenomic, and
agentic-and map them to eight key analytical tasks including annotation,
trajectory and perturbation modeling, and drug-response prediction. Drawing on
over 40 public datasets, we analyze benchmark suitability, data diversity, and
ethical or scalability constraints, and evaluate models across 10 domain
dimensions covering biological grounding, multi-omics alignment, fairness,
privacy, and explainability. By linking datasets, models, and evaluation
domains, LLM4Cell provides the first integrated view of language-driven
single-cell intelligence and outlines open challenges in interpretability,
standardization, and trustworthy model development.
[COMMENTS]34 pages, 5 figures, 7 tables
[LINK]http://arxiv.org/abs/2510.07793v2
[DATE]2025-10-28 01:46:32+08:00
[CATEGORIES]cs.CL
Superficial Self-Improved Reasoners Benefit from Model Merging
[AUTHORS]Xiangchi Yuan, Chunhui Zhang, Zheyuan Liu, Dachuan Shi, Leyan Pan, Soroush Vosoughi, Wenke Lee
[COMMENTS]EMNLP 2025
[LINK]http://arxiv.org/abs/2503.02103v2
[DATE]2025-10-28 01:35:32+08:00
[CATEGORIES]cs.CL
ISA-Bench: Benchmarking Instruction Sensitivity for Large Audio Language Models
[AUTHORS]Bohan Li, Wenbin Huang, Yuhang Qiu, Yiwei Guo, Hankun Wang, Zhihan Li, Jing Peng, Ziyang Ma, Xie Chen, Kai Yu
[ABSTRACT]Large Audio Language Models (LALMs), which couple acoustic perception with
large language models (LLMs) to extract and understand diverse information from
audio, have attracted intense interest from both academic and industrial
communities. However, existing LALMs are highly sensitive to how instructions
are phrased, affecting both (i) instruction-following rates and (ii) task
performance. Yet, no existing benchmarks offer a systematic and comprehensive
evaluation of this sensitivity. We introduce ISA-Bench, a dynamic benchmark
evaluating instruction sensitivity for LALMs along three axes: instruction
description, output format, and task composition. We assess recent open-source
and proprietary LALMs using ISA-Bench, profiling both compliance and accuracy
under controlled instruction variations. Experimental results reveal that even
state-of-the-art LALMs suffer significant instruction sensitivity, leading to
degraded performance on fundamental audio understanding tasks. To mitigate this
issue, we fine-tune Qwen2-Audio on a specifically constructed complex
instruction-variant dataset, achieving a marked improvement in
instruction-following performance. However, this also induces nontrivial
catastrophic forgetting: the model loses some previously mastered task
capabilities when exposed to new instruction styles. Our benchmark provides a
standardized basis for assessing and improving instruction sensitivity in
LALMs, underscoring the need for instruction-robust audio understanding in
real-world pipelines.
[COMMENTS]submitted to icassp 2026
[LINK]http://arxiv.org/abs/2510.23558v1
[DATE]2025-10-28 01:31:25+08:00
[CATEGORIES]cs.CL
Fixing It in Post: A Comparative Study of LLM Post-Training Data Quality and Model Performance
[AUTHORS]Aladin Djuhera, Swanand Ravindra Kadhe, Syed Zawad, Farhan Ahmed, Heiko Ludwig, Holger Boche
[ABSTRACT]Recent work on large language models (LLMs) has increasingly focused on
post-training and alignment with datasets curated to enhance instruction
following, world knowledge, and specialized skills. However, most post-training
datasets used in leading open- and closed-source LLMs remain inaccessible to
the public, with limited information about their construction process. This
lack of transparency has motivated the recent development of open-source
post-training corpora. While training on these open alternatives can yield
performance comparable to that of leading models, systematic comparisons remain
challenging due to the significant computational cost of conducting them
rigorously at scale, and are therefore largely absent. As a result, it remains
unclear how specific samples, task types, or curation strategies influence
downstream performance when assessing data quality. In this work, we conduct
the first comprehensive side-by-side analysis of two prominent open
post-training datasets: Tulu-3-SFT-Mix and SmolTalk. Using the Magpie
framework, we annotate each sample with detailed quality metrics, including
turn structure (single-turn vs. multi-turn), task category, input quality, and
response quality, and we derive statistics that reveal structural and
qualitative similarities and differences between the two datasets. Based on
these insights, we design a principled curation recipe that produces a new data
mixture, TuluTalk, which contains 14% fewer samples than either source dataset
while matching or exceeding their performance on key benchmarks. Our findings
offer actionable insights for constructing more effective post-training
datasets that improve model performance within practical resource limits. To
support future research, we publicly release both the annotated source datasets
and our curated TuluTalk mixture.
[LINK]http://arxiv.org/abs/2506.06522v2
[DATE]2025-10-28 01:31:21+08:00
[CATEGORIES]cs.CL
A U-Net and Transformer Pipeline for Multilingual Image Translation
[AUTHORS]Siddharth Sahay, Radhika Agarwal
[ABSTRACT]This paper presents an end-to-end multilingual translation pipeline that
integrates a custom U-Net for text detection, the Tesseract engine for text
recognition, and a from-scratch sequence-to-sequence (Seq2Seq) Transformer for
Neural Machine Translation (NMT). Our approach first utilizes a U-Net model,
trained on a synthetic dataset , to accurately segment and detect text regions
from an image. These detected regions are then processed by Tesseract to
extract the source text. This extracted text is fed into a custom Transformer
model trained from scratch on a multilingual parallel corpus spanning 5
languages. Unlike systems reliant on monolithic pre-trained models, our
architecture emphasizes full customization and adaptability. The system is
evaluated on its text detection accuracy, text recognition quality, and
translation performance via BLEU scores. The complete pipeline demonstrates
promising results, validating the viability of a custom-built system for
translating text directly from images.
[COMMENTS]6 pages, 3 figures, 5 tables, and 2 algorithms. Prepared in IEEE
double-column format
[LINK]http://arxiv.org/abs/2510.23554v1
[DATE]2025-10-28 01:28:55+08:00
[CATEGORIES]cs.LG cs.CL
LimRank: Less is More for Reasoning-Intensive Information Reranking
[AUTHORS]Tingyu Song, Yilun Zhao, Siyue Zhang, Chen Zhao, Arman Cohan
[ABSTRACT]Existing approaches typically rely on large-scale fine-tuning to adapt LLMs
for information reranking tasks, which is computationally expensive. In this
work, we demonstrate that modern LLMs can be effectively adapted using only
minimal, high-quality supervision. To enable this, we design
LIMRANK-SYNTHESIZER, a reusable and open-source pipeline for generating
diverse, challenging, and realistic reranking examples. Using this synthetic
data, we fine-tune our reranker model, LIMRANK. We evaluate LIMRANK on two
challenging benchmarks, i.e., BRIGHT for reasoning-intensive retrieval and
FollowIR for instruction-following retrieval. Our experiments demonstrate that
LIMRANK achieves competitive performance, while being trained on less than 5%
of the data typically used in prior work. Further ablation studies demonstrate
the effectiveness of LIMRANK-SYNTHESIZER and the strong generalization
capabilities of LIMRANK across downstream tasks, including scientific
literature search and retrieval-augmented generation for knowledge-intensive
problem solving.
[COMMENTS]EMNLP 2025 Main (Short)
[LINK]http://arxiv.org/abs/2510.23544v1
[DATE]2025-10-28 01:19:37+08:00
[CATEGORIES]cs.CL
JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence
[AUTHORS]Qiushi Sun, Jingyang Gong, Yang Liu, Qiaosheng Chen, Lei Li, Kai Chen, Qipeng Guo, Ben Kao, Fei Yuan
[ABSTRACT]The scope of neural code intelligence is rapidly expanding beyond text-based
source code to encompass the rich visual outputs that programs generate. This
visual dimension is critical for advanced applications like flexible content
generation and precise, program-driven editing of visualizations. However,
progress has been impeded by the scarcity of high-quality multimodal code data,
a bottleneck stemming from challenges in synthesis and quality assessment. To
address these challenges, we make contributions from both a data and modeling
perspective. We first introduce a complete synthesis toolkit that leverages
reciprocal synergies between data modalities to efficiently produce a
large-scale, high-quality corpus spanning from standard charts to complex
interactive web UIs and code-driven animations. Leveraging this toolkit, we
construct JanusCode-800K, the largest multimodal code corpus to date. This
powers the training of our models, JanusCoder and JanusCoderV, which establish
a visual-programmatic interface for generating code from textual instructions,
visual inputs, or a combination of both. Our unified model is a departure from
existing approaches that build specialized models for isolated tasks. Extensive
experiments on both text-centric and vision-centric coding tasks demonstrate
the superior performance of the JanusCoder series, with our 7B to 14B scale
models approaching or even exceeding the performance of commercial models.
Furthermore, extensive analysis provides key insights into harmonizing
programmatic logic with its visual expression. Our code and checkpoints will
are available at https://github.com/InternLM/JanusCoder.
[COMMENTS]Work in progress
[LINK]http://arxiv.org/abs/2510.23538v1
[DATE]2025-10-28 01:13:49+08:00
[CATEGORIES]cs.CL
AttentionRAG: Attention-Guided Context Pruning in Retrieval-Augmented Generation
[AUTHORS]Yixiong Fang, Tianran Sun, Yuling Shi, Xiaodong Gu
[LINK]http://arxiv.org/abs/2503.10720v2
[DATE]2025-10-28 00:55:55+08:00
[CATEGORIES]cs.CL
Cancer-Myth: Evaluating AI Chatbot on Patient Questions with False Presuppositions
[AUTHORS]Wang Bill Zhu, Tianqi Chen, Xinyan Velocity Yu, Ching Ying Lin, Jade Law, Mazen Jizzini, Jorge J. Nieva, Ruishan Liu, Robin Jia
[ABSTRACT]Cancer patients are increasingly turning to large language models (LLMs) for
medical information, making it critical to assess how well these models handle
complex, personalized questions. However, current medical benchmarks focus on
medical exams or consumer-searched questions and do not evaluate LLMs on real
patient questions with patient details. In this paper, we first have three
hematology-oncology physicians evaluate cancer-related questions drawn from
real patients. While LLM responses are generally accurate, the models
frequently fail to recognize or address false presuppositions in the questions,
posing risks to safe medical decision-making. To study this limitation
systematically, we introduce Cancer-Myth, an expert-verified adversarial
dataset of 585 cancer-related questions with false presuppositions. On this
benchmark, no frontier LLM – including GPT-5, Gemini-2.5-Pro, and
Claude-4-Sonnet – corrects these false presuppositions more than $43\%$ of the
time. To study mitigation strategies, we further construct a 150-question
Cancer-Myth-NFP set, in which physicians confirm the absence of false
presuppositions. We find typical mitigation strategies, such as adding
precautionary prompts with GEPA optimization, can raise accuracy on Cancer-Myth
to $80\%$, but at the cost of misidentifying presuppositions in $41\%$ of
Cancer-Myth-NFP questions and causing a $10\%$ relative performance drop on
other medical benchmarks. These findings highlight a critical gap in the
reliability of LLMs, show that prompting alone is not a reliable remedy for
false presuppositions, and underscore the need for more robust safeguards in
medical AI systems.
[LINK]http://arxiv.org/abs/2504.11373v2
[DATE]2025-10-28 00:39:30+08:00
[CATEGORIES]cs.CL
Less is More: Local Intrinsic Dimensions of Contextual Language Models
[AUTHORS]Benjamin Matthias Ruppik, Julius von Rohrscheidt, Carel van Niekerk, Michael Heck, Renato Vukovic, Shutong Feng, Hsien-chin Lin, Nurul Lubis, Bastian Rieck, Marcus Zibrowius, Milica Gašić
[ABSTRACT]Understanding the internal mechanisms of large language models (LLMs) remains
a challenging and complex endeavor. Even fundamental questions, such as how
fine-tuning affects model behavior, often require extensive empirical
evaluation. In this paper, we introduce a novel perspective based on the
geometric properties of contextual latent embeddings to study the effects of
training and fine-tuning. To that end, we measure the local dimensions of a
contextual language model’s latent space and analyze their shifts during
training and fine-tuning. We show that the local dimensions provide insights
into the model’s training dynamics and generalization ability. Specifically,
the mean of the local dimensions predicts when the model’s training
capabilities are exhausted, as exemplified in a dialogue state tracking task,
overfitting, as demonstrated in an emotion recognition task, and grokking, as
illustrated with an arithmetic task. Furthermore, our experiments suggest a
practical heuristic: reductions in the mean local dimension tend to accompany
and predict subsequent performance gains. Through this exploration, we aim to
provide practitioners with a deeper understanding of the implications of
fine-tuning on embedding spaces, facilitating informed decisions when
configuring models for specific applications. The results of this work
contribute to the ongoing discourse on the interpretability, adaptability, and
generalizability of LLMs by bridging the gap between intrinsic model mechanisms
and geometric properties in the respective embeddings.
[COMMENTS]Accepted at the 39th Conference on Neural Information Processing
Systems (NeurIPS 2025; in press). 10 pages, with an additional 17 pages in
the appendix. Our code is available at
https://github.com/aidos-lab/Topo_LLM_public and
https://github.com/aidos-lab/grokking-via-lid
[LINK]http://arxiv.org/abs/2506.01034v2
[DATE]2025-10-28 00:17:17+08:00
[CATEGORIES]cs.CL cs.LG
MMTutorBench: The First Multimodal Benchmark for AI Math Tutoring
[AUTHORS]Tengchao Yang, Sichen Guo, Mengzhao Jia, Jiaming Su, Yuanyang Liu, Zhihan Zhang, Meng Jiang
[ABSTRACT]Effective math tutoring requires not only solving problems but also
diagnosing students’ difficulties and guiding them step by step. While
multimodal large language models (MLLMs) show promise, existing benchmarks
largely overlook these tutoring skills. We introduce MMTutorBench, the first
benchmark for AI math tutoring, consisting of 685 problems built around
pedagogically significant key-steps. Each problem is paired with
problem-specific rubrics that enable fine-grained evaluation across six
dimensions, and structured into three tasks-Insight Discovery, Operation
Formulation, and Operation Execution. We evaluate 12 leading MLLMs and find
clear performance gaps between proprietary and open-source systems, substantial
room compared to human tutors, and consistent trends across input variants: OCR
pipelines degrade tutoring quality, few-shot prompting yields limited gains,
and our rubric-based LLM-as-a-Judge proves highly reliable. These results
highlight both the difficulty and diagnostic value of MMTutorBench for
advancing AI tutoring.
[LINK]http://arxiv.org/abs/2510.23477v1
[DATE]2025-10-28 00:11:49+08:00
[CATEGORIES]cs.CL
Computational-Assisted Systematic Review and Meta-Analysis (CASMA): Effect of a Subclass of GnRH-a on Endometriosis Recurrence
[AUTHORS]Sandro Tsang
[ABSTRACT]Background: Evidence synthesis facilitates evidence-based medicine. This task
becomes increasingly difficult to accomplished with applying computational
solutions, since the medical literature grows at astonishing rates. Objective:
This study evaluates an information retrieval-driven workflow, CASMA, to
enhance the efficiency, transparency, and reproducibility of systematic
reviews. Endometriosis recurrence serves as the ideal case due to its complex
and ambiguous literature. Methods: The hybrid approach integrates PRISMA
guidelines with fuzzy matching and regular expression (regex) to facilitate
semi-automated deduplication and filtered records before manual screening. The
workflow synthesised evidence from randomised controlled trials on the efficacy
of a subclass of gonadotropin-releasing hormone agonists (GnRH-a). A modified
splitting method addressed unit-of-analysis errors in multi-arm trials.
Results: The workflow sharply reduced the screening workload, taking only 11
days to fetch and filter 33,444 records. Seven eligible RCTs were synthesized
(841 patients). The pooled random-effects model yielded a Risk Ratio (RR) of
$0.64$ ($95\%$ CI $0.48$ to $0.86$), demonstrating a $36\%$ reduction in
recurrence, with non-significant heterogeneity ($I^2=0.00\%$, $\tau^2=0.00$).
The findings were robust and stable, as they were backed by sensitivity
analyses. Conclusion: This study demonstrates an application of an
information-retrieval-driven workflow for medical evidence synthesis. The
approach yields valuable clinical results and a generalisable framework to
scale up the evidence synthesis, bridging the gap between clinical research and
computer science.
[COMMENTS]15 pages, 12 figures and 4 tables. This work describes an information
retrieval-driven workflow for medical evidence synthesis, with an application
to endometriosis recurrence. The method can be generalized to other
systematic reviews. The preregistered protocol is available:
https://doi.org/10.17605/OSF.IO/R2DFA
[LINK]http://arxiv.org/abs/2509.16599v3
[DATE]2025-10-28 00:02:27+08:00
[CATEGORIES]cs.CL
Adaptive Anomaly Detection in Network Flows with Low-Rank Tensor Decompositions and Deep Unrolling
[AUTHORS]Lukas Schynol, Marius Pesavento
[ABSTRACT]Anomaly detection (AD) is increasingly recognized as a key component for
ensuring the resilience of future communication systems. While deep learning
has shown state-of-the-art AD performance, its application in critical systems
is hindered by concerns regarding training data efficiency, domain adaptation
and interpretability. This work considers AD in network flows using incomplete
measurements, leveraging a robust tensor decomposition approach and deep
unrolling techniques to address these challenges. We first propose a novel
block-successive convex approximation algorithm based on a regularized
model-fitting objective where the normal flows are modeled as low-rank tensors
and anomalies as sparse. An augmentation of the objective is introduced to
decrease the computational cost. We apply deep unrolling to derive a novel deep
network architecture based on our proposed algorithm, treating the
regularization parameters as learnable weights. Inspired by Bayesian
approaches, we extend the model architecture to perform online adaptation to
per-flow and per-time-step statistics, improving AD performance while
maintaining a low parameter count and preserving the problem’s permutation
equivariances. To optimize the deep network weights for detection performance,
we employ a homotopy optimization approach based on an efficient approximation
of the area under the receiver operating characteristic curve. Extensive
experiments on synthetic and real-world data demonstrate that our proposed deep
network architecture exhibits a high training data efficiency, outperforms
reference methods, and adapts seamlessly to varying network topologies.
[COMMENTS]18 pages, 7 figures
[LINK]http://arxiv.org/abs/2409.11529v3
[DATE]2025-10-28 23:59:49+08:00
[CATEGORIES]cs.LG
LoRA-DA: Data-Aware Initialization for Low-Rank Adaptation via Asymptotic Analysis
[AUTHORS]Qingyue Zhang, Chang Chu, Tianren Peng, Qi Li, Xiangyang Luo, Zhihao Jiang, Shao-Lun Huang
[ABSTRACT]With the widespread adoption of LLMs, LoRA has become a dominant method for
PEFT, and its initialization methods have attracted increasing attention.
However, existing methods have notable limitations: many methods do not
incorporate target-domain data, while gradient-based methods exploit data only
at a shallow level by relying on one-step gradient decomposition, which remains
unsatisfactory due to the weak empirical performance of the one-step
fine-tuning model that serves as their basis, as well as the fact that these
methods either lack a rigorous theoretical foundation or depend heavily on
restrictive isotropic assumptions. In this paper, we establish a theoretical
framework for data-aware LoRA initialization based on asymptotic analysis.
Starting from a general optimization objective that minimizes the expectation
of the parameter discrepancy between the fine-tuned and target models, we
derive an optimization problem with two components: a bias term, which is
related to the parameter distance between the fine-tuned and target models, and
is approximated using a Fisher-gradient formulation to preserve anisotropy; and
a variance term, which accounts for the uncertainty introduced by sampling
stochasticity through the Fisher information. By solving this problem, we
obtain an optimal initialization strategy for LoRA. Building on this
theoretical framework, we develop an efficient algorithm, LoRA-DA, which
estimates the terms in the optimization problem from a small set of target
domain samples and obtains the optimal LoRA initialization. Empirical results
across multiple benchmarks demonstrate that LoRA-DA consistently improves final
accuracy over existing initialization methods. Additional studies show faster,
more stable convergence, robustness across ranks, and only a small
initialization overhead for LoRA-DA. The source code will be released upon
publication.
[LINK]http://arxiv.org/abs/2510.24561v1
[DATE]2025-10-28 23:55:36+08:00
[CATEGORIES]cs.LG
Enforcing boundary conditions for physics-informed neural operators
[AUTHORS]Niklas Göschel, Sebastian Götschel, Daniel Ruprecht
[ABSTRACT]Machine-learning based methods like physics-informed neural networks and
physics-informed neural operators are becoming increasingly adept at solving
even complex systems of partial differential equations. Boundary conditions can
be enforced either weakly by penalizing deviations in the loss function or
strongly by training a solution structure that inherently matches the
prescribed values and derivatives. The former approach is easy to implement but
the latter can provide benefits with respect to accuracy and training times.
However, previous approaches to strongly enforcing Neumann or Robin boundary
conditions require a domain with a fully $C^1$ boundary and, as we demonstrate,
can lead to instability if those boundary conditions are posed on a segment of
the boundary that is piecewise $C^1$ but only $C^0$ globally. We introduce a
generalization of the approach by Sukumar \& Srivastava (doi:
10.1016/j.cma.2021.114333), and a new approach based on orthogonal projections
that overcome this limitation. The performance of these new techniques is
compared against weakly and semi-weakly enforced boundary conditions for the
scalar Darcy flow equation and the stationary Navier-Stokes equations.
[LINK]http://arxiv.org/abs/2510.24557v1
[DATE]2025-10-28 23:51:48+08:00
[CATEGORIES]cs.LG
Robust Uncertainty Quantification for Self-Evolving Large Language Models via Continual Domain Pretraining
[AUTHORS]Xiaofan Zhou, Lu Cheng
[ABSTRACT]Continual Learning (CL) is essential for enabling self-evolving large
language models (LLMs) to adapt and remain effective amid rapid knowledge
growth. Yet, despite its importance, little attention has been given to
establishing statistical reliability guarantees for LLMs under CL, particularly
in the setting of continual domain pretraining (CDP). Conformal Prediction (CP)
has shown promise in offering correctness guarantees for LLMs, but it faces
major challenges in CDP: testing data often stems from unknown or shifting
domain distributions, under which CP may no longer provide valid guarantees.
Moreover, when high coverage is required, CP can yield excessively large
prediction sets for unanswerable queries, reducing informativeness. To address
these challenges, we introduce an adaptive rejection and non-exchangeable CP
framework. Our method first estimates the distribution of questions across
domains in the test set using transformer-based clustering, then reweights or
resamples the calibration data accordingly. Building on this, adaptive
rejection CP allows the LLM to selectively abstain from answering when its
confidence or competence shifts significantly. Extensive experiments
demonstrate that our framework enhances both the effectiveness and reliability
of CP under CDP scenarios. Our code is available at:
https://anonymous.4open.science/r/CPCL-8C12/
[LINK]http://arxiv.org/abs/2510.22931v2
[DATE]2025-10-28 23:51:13+08:00
[CATEGORIES]cs.LG
GraSS: Scalable Data Attribution with Gradient Sparsification and Sparse Projection
[AUTHORS]Pingbang Hu, Joseph Melkonian, Weijing Tang, Han Zhao, Jiaqi W. Ma
[ABSTRACT]Gradient-based data attribution methods, such as influence functions, are
critical for understanding the impact of individual training samples without
requiring repeated model retraining. However, their scalability is often
limited by the high computational and memory costs associated with per-sample
gradient computation. In this work, we propose GraSS, a novel gradient
compression algorithm and its variants FactGraSS for linear layers
specifically, that explicitly leverage the inherent sparsity of per-sample
gradients to achieve sub-linear space and time complexity. Extensive
experiments demonstrate the effectiveness of our approach, achieving
substantial speedups while preserving data influence fidelity. In particular,
FactGraSS achieves up to 165% faster throughput on billion-scale models
compared to the previous state-of-the-art baselines. Our code is publicly
available at https://github.com/TRAIS-Lab/GraSS.
[COMMENTS]Accepted at the 39th Conference on Neural Information Processing
Systems (NeurIPS 2025)
[LINK]http://arxiv.org/abs/2505.18976v3
[DATE]2025-10-28 23:46:13+08:00
[CATEGORIES]cs.LG
Dual-Mind World Models: A General Framework for Learning in Dynamic Wireless Networks
[AUTHORS]Lingyi Wang, Rashed Shelim, Walid Saad, Naren Ramakrishnan
[ABSTRACT]Despite the popularity of reinforcement learning (RL) in wireless networks,
existing approaches that rely on model-free RL (MFRL) and model-based RL (MBRL)
are data inefficient and short-sighted. Such RL-based solutions cannot
generalize to novel network states since they capture only statistical patterns
rather than the underlying physics and logic from wireless data. These
limitations become particularly challenging in complex wireless networks with
high dynamics and long-term planning requirements. To address these
limitations, in this paper, a novel dual-mind world model-based learning
framework is proposed with the goal of optimizing completeness-weighted age of
information (CAoI) in a challenging mmWave V2X scenario. Inspired by cognitive
psychology, the proposed dual-mind world model encompasses a pattern-driven
System 1 component and a logic-driven System 2 component to learn dynamics and
logic of the wireless network, and to provide long-term link scheduling over
reliable imagined trajectories. Link scheduling is learned through end-to-end
differentiable imagined trajectories with logical consistency over an extended
horizon rather than relying on wireless data obtained from environment
interactions. Moreover, through imagination rollouts, the proposed world model
can jointly reason network states and plan link scheduling. During intervals
without observations, the proposed method remains capable of making efficient
decisions. Extensive experiments are conducted on a realistic simulator based
on Sionna with real-world physical channel, ray-tracing, and scene objects with
material properties. Simulation results show that the proposed world model
achieves a significant improvement in data efficiency and achieves strong
generalization and adaptation to unseen environments, compared to the
state-of-the-art RL baselines, and the world model approach with only System 1.
[LINK]http://arxiv.org/abs/2510.24546v1
[DATE]2025-10-28 23:45:15+08:00
[CATEGORIES]cs.LG
Online (Non-)Convex Learning via Tempered Optimism
[AUTHORS]Maxime Haddouche, Olivier Wintenberger, Benjamin Guedj
[ABSTRACT]Optimistic Online Learning aims to exploit experts conveying reliable
information to predict the future. However, such implicit optimism may be
challenged when it comes to practical crafting of such experts. A fundamental
example consists in approximating a minimiser of the current problem and use it
as expert. In the context of dynamic environments, such an expert only conveys
partially relevant information as it may lead to overfitting. To tackle this
issue, we introduce in this work the \emph{optimistically tempered} (OT) online
learning framework designed to handle such imperfect experts. As a first
contribution, we show that tempered optimism is a fruitful paradigm for Online
Non-Convex Learning by proposing simple, yet powerful modification of Online
Gradient and Mirror Descent. Second, we derive a second OT algorithm for convex
losses and third, evaluate the practical efficiency of tempered optimism on
real-life datasets and a toy experiment.
[LINK]http://arxiv.org/abs/2301.07530v3
[DATE]2025-10-28 23:36:36+08:00
[CATEGORIES]cs.LG
Unsupervised Machine-Learning Pipeline for Data-Driven Defect Detection and Characterisation: Application to Displacement Cascades
[AUTHORS]Samuel Del Fré, Andrée de Backer, Christophe Domain, Ludovic Thuinet, Charlotte S. Becquart
[ABSTRACT]Neutron irradiation produces, within a few picoseconds, displacement cascades
that are sequences of atomic collisions generating point and extended defects
which subsequently affects the long-term evolution of materials. The diversity
of these defects, characterized morphologically and statistically, defines what
is called the “primary damage”. In this work, we present a fully unsupervised
machine learning (ML) workflow that detects and classifies these defects
directly from molecular dynamics data. Local environments are encoded by the
Smooth Overlap of Atomic Positions (SOAP) vector, anomalous atoms are isolated
with autoencoder neural networks (AE), embedded with Uniform Man- ifold
Approximation and Projection (UMAP) and clustered using Hierarchical
Density-Based Spatial Clustering of Applications with Noise (HDBSCAN). Applied
to 80 keV displacement cascades in Ni, Fe70Ni10Cr20, and Zr, the AE
successfully identify the small fraction of outlier atoms that participate in
defect formation. HDBSCAN then partitions the UMAP latent space of AE-flagged
SOAP de- scriptors into well defined groups representing vacancy- and
interstitial-dominated regions and, within each, separates small from large
aggregates, assigning 99.7 % of outliers to compact physical motifs. A signed
cluster-identification score confirms this separation, and cluster size scales
with net defect counts (R2 > 0.89). Statistical cross analyses between the ML
outlier map and several conventional detectors (centrosymmetry, dislocation
extraction, etc.) reveal strong overlap and complementary coverage, all
achieved without template or threshold tuning. This ML workflow thus provides
an efficient tool for the quantitative mapping of structural anomalies in
materials, particularly those arising from irradiation damage in displacement
cascades.
[COMMENTS]22 pages, 1 graphical abstract, 7 figures, 4 tables
[LINK]http://arxiv.org/abs/2510.24523v1
[DATE]2025-10-28 23:34:23+08:00
[CATEGORIES]cs.LG
Uni-LoRA: One Vector is All You Need
[AUTHORS]Kaiyang Li, Shaobo Han, Qing Su, Wei Li, Zhipeng Cai, Shihao Ji
[ABSTRACT]Low-Rank Adaptation (LoRA) has become the de facto parameter-efficient
fine-tuning (PEFT) method for large language models (LLMs) by constraining
weight updates to low-rank matrices. Recent works such as Tied-LoRA, VeRA, and
VB-LoRA push efficiency further by introducing additional constraints to reduce
the trainable parameter space. In this paper, we show that the parameter space
reduction strategies employed by these LoRA variants can be formulated within a
unified framework, Uni-LoRA, where the LoRA parameter space, flattened as a
high-dimensional vector space $R^D$, can be reconstructed through a projection
from a subspace R^d, with $d \ll D$. We demonstrate that the fundamental
difference among various LoRA methods lies in the choice of the projection
matrix, $P \in R^{D \times d}$.Most existing LoRA variants rely on layer-wise
or structure-specific projections that limit cross-layer parameter sharing,
thereby compromising parameter efficiency. In light of this, we introduce an
efficient and theoretically grounded projection matrix that is isometric,
enabling global parameter sharing and reducing computation overhead.
Furthermore, under the unified view of Uni-LoRA, this design requires only a
single trainable vector to reconstruct LoRA parameters for the entire LLM -
making Uni-LoRA both a unified framework and a “one-vector-only” solution.
Extensive experiments on GLUE, mathematical reasoning, and instruction tuning
benchmarks demonstrate that Uni-LoRA achieves state-of-the-art parameter
efficiency while outperforming or matching prior approaches in predictive
performance. Our code is available at
https://github.com/KaiyangLi1992/Uni-LoRA.
[COMMENTS]NeurIPS 2025 Spotlight
[LINK]http://arxiv.org/abs/2506.00799v3
[DATE]2025-10-28 23:20:47+08:00
[CATEGORIES]cs.LG
MIMIC-Sepsis: A Curated Benchmark for Modeling and Learning from Sepsis Trajectories in the ICU
[AUTHORS]Yong Huang, Zhongqi Yang, Amir Rahmani
[ABSTRACT]Sepsis is a leading cause of mortality in intensive care units (ICUs), yet
existing research often relies on outdated datasets, non-reproducible
preprocessing pipelines, and limited coverage of clinical interventions. We
introduce MIMIC-Sepsis, a curated cohort and benchmark framework derived from
the MIMIC-IV database, designed to support reproducible modeling of sepsis
trajectories. Our cohort includes 35,239 ICU patients with time-aligned
clinical variables and standardized treatment data, including vasopressors,
fluids, mechanical ventilation and antibiotics. We describe a transparent
preprocessing pipeline-based on Sepsis-3 criteria, structured imputation
strategies, and treatment inclusion-and release it alongside benchmark tasks
focused on early mortality prediction, length-of-stay estimation, and shock
onset classification. Empirical results demonstrate that incorporating
treatment variables substantially improves model performance, particularly for
Transformer-based architectures. MIMIC-Sepsis serves as a robust platform for
evaluating predictive and sequential models in critical care research.
[LINK]http://arxiv.org/abs/2510.24500v1
[DATE]2025-10-28 23:13:38+08:00
[CATEGORIES]cs.LG
Group-in-Group Policy Optimization for LLM Agent Training
[AUTHORS]Lang Feng, Zhenghai Xue, Tingcong Liu, Bo An
[ABSTRACT]Recent advances in group-based reinforcement learning (RL) have driven
frontier large language models (LLMs) in single-turn tasks like mathematical
reasoning. However, their scalability to multi-turn LLM agent training remains
limited. Unlike static tasks, agent-environment interactions unfold over many
steps and often yield sparse or delayed rewards, making credit assignment
across individual steps significantly more challenging. In this work, we
propose Group-in-Group Policy Optimization (GiGPO), a novel RL algorithm that
achieves fine-grained credit assignment for LLM agents while preserving the
appealing properties of group-based RL: critic-free, low memory, and stable
convergence. GiGPO introduces a two-level structure for estimating relative
advantage: (i) At the episode-level, GiGPO computes macro relative advantages
based on groups of complete trajectories; (ii) At the step-level, GiGPO
introduces an anchor state grouping mechanism that retroactively constructs
step-level groups by identifying repeated environment states across
trajectories. Actions stemming from the same state are grouped together,
enabling micro relative advantage estimation. This hierarchical structure
effectively captures both global trajectory quality and local step
effectiveness without relying on auxiliary models or additional rollouts. We
evaluate GiGPO on challenging agent benchmarks, including ALFWorld and WebShop,
as well as tool-integrated reasoning on search-augmented QA tasks, using
Qwen2.5-1.5B/3B/7B-Instruct. Crucially, GiGPO delivers fine-grained per-step
credit signals, achieves performance gains of > 12% on ALFWorld and > 9% on
WebShop over GRPO, and obtains superior performance on QA tasks (42.1% on 3B
and 47.2% on 7B): all while maintaining the same GPU memory overhead, identical
LLM rollout, and incurring little to no additional time cost.
[COMMENTS]NeurIPS 2025
[LINK]http://arxiv.org/abs/2505.10978v3
[DATE]2025-10-28 23:11:36+08:00
[CATEGORIES]cs.LG
TIDMAD: Time Series Dataset for Discovering Dark Matter with AI Denoising
[AUTHORS]J. T. Fry, Xinyi Hope Fu, Zhenghao Fu, Kaliroe M. W. Pappas, Lindley Winslow, Aobo Li
[COMMENTS]Accepted by NeurIPS 2025 (Spotlight)
[LINK]http://arxiv.org/abs/2406.04378v3
[DATE]2025-10-28 23:03:25+08:00
[CATEGORIES]cs.LG
Mirror Descent and Novel Exponentiated Gradient Algorithms Using Trace-Form Entropies and Deformed Logarithms
[AUTHORS]Andrzej Cichocki, Toshihisa Tanaka, Frank Nielsen, Sergio Cruces
[ABSTRACT]This paper introduces a broad class of Mirror Descent (MD) and Generalized
Exponentiated Gradient (GEG) algorithms derived from trace-form entropies
defined via deformed logarithms. Leveraging these generalized entropies yields
MD \& GEG algorithms with improved convergence behavior, robustness to
vanishing and exploding gradients, and inherent adaptability to non-Euclidean
geometries through mirror maps. We establish deep connections between these
methods and Amari’s natural gradient, revealing a unified geometric foundation
for additive, multiplicative, and natural gradient updates. Focusing on the
Tsallis, Kaniadakis, Sharma–Taneja–Mittal, and Kaniadakis–Lissia–Scarfone
entropy families, we show that each entropy induces a distinct Riemannian
metric on the parameter space, leading to GEG algorithms that preserve the
natural statistical geometry. The tunable parameters of deformed logarithms
enable adaptive geometric selection, providing enhanced robustness and
convergence over classical Euclidean optimization. Overall, our framework
unifies key first-order MD optimization methods under a single
information-geometric perspective based on generalized Bregman divergences,
where the choice of entropy determines the underlying metric and dual geometric
structure.
[COMMENTS]22 pages, 9 figures
[LINK]http://arxiv.org/abs/2503.08748v4
[DATE]2025-10-28 23:01:16+08:00
[CATEGORIES]cs.LG
Sample-efficient and Scalable Exploration in Continuous-Time RL
[AUTHORS]Klemens Iten, Lenart Treven, Bhavya Sukhija, Florian Dörfler, Andreas Krause
[ABSTRACT]Reinforcement learning algorithms are typically designed for discrete-time
dynamics, even though the underlying real-world control systems are often
continuous in time. In this paper, we study the problem of continuous-time
reinforcement learning, where the unknown system dynamics are represented using
nonlinear ordinary differential equations (ODEs). We leverage probabilistic
models, such as Gaussian processes and Bayesian neural networks, to learn an
uncertainty-aware model of the underlying ODE. Our algorithm, COMBRL, greedily
maximizes a weighted sum of the extrinsic reward and model epistemic
uncertainty. This yields a scalable and sample-efficient approach to
continuous-time model-based RL. We show that COMBRL achieves sublinear regret
in the reward-driven setting, and in the unsupervised RL setting (i.e., without
extrinsic rewards), we provide a sample complexity bound. In our experiments,
we evaluate COMBRL in both standard and unsupervised RL settings and
demonstrate that it scales better, is more sample-efficient than prior methods,
and outperforms baselines across several deep RL tasks.
[COMMENTS]26 pages, 6 figures, 6 tables
[LINK]http://arxiv.org/abs/2510.24482v1
[DATE]2025-10-28 22:54:12+08:00
[CATEGORIES]cs.LG
Multimodal Dreaming: A Global Workspace Approach to World Model-Based Reinforcement Learning
[AUTHORS]Léopold Maytié, Roland Bertin Johannet, Rufin VanRullen
[ABSTRACT]Humans leverage rich internal models of the world to reason about the future,
imagine counterfactuals, and adapt flexibly to new situations. In Reinforcement
Learning (RL), world models aim to capture how the environment evolves in
response to the agent’s actions, facilitating planning and generalization.
However, typical world models directly operate on the environment variables
(e.g. pixels, physical attributes), which can make their training slow and
cumbersome; instead, it may be advantageous to rely on high-level latent
dimensions that capture relevant multimodal variables. Global Workspace (GW)
Theory offers a cognitive framework for multimodal integration and information
broadcasting in the brain, and recent studies have begun to introduce efficient
deep learning implementations of GW. Here, we evaluate the capabilities of an
RL system combining GW with a world model. We compare our GW-Dreamer with
various versions of the standard PPO and the original Dreamer algorithms. We
show that performing the dreaming process (i.e., mental simulation) inside the
GW latent space allows for training with fewer environment steps. As an
additional emergent property, the resulting model (but not its comparison
baselines) displays strong robustness to the absence of one of its observation
modalities (images or simulation attributes). We conclude that the combination
of GW with World Models holds great potential for improving decision-making in
RL agents.
[COMMENTS]Under review
[LINK]http://arxiv.org/abs/2502.21142v2
[DATE]2025-10-28 22:49:07+08:00
[CATEGORIES]cs.LG
Long-Term Mapping of the Douro River Plume with Multi-Agent Reinforcement Learning
[AUTHORS]Nicolò Dal Fabbro, Milad Mesbahi, Renato Mendes, João Borges de Sousa, George J. Pappas
[ABSTRACT]We study the problem of long-term (multiple days) mapping of a river plume
using multiple autonomous underwater vehicles (AUVs), focusing on the Douro
river representative use-case. We propose an energy - and communication -
efficient multi-agent reinforcement learning approach in which a central
coordinator intermittently communicates with the AUVs, collecting measurements
and issuing commands. Our approach integrates spatiotemporal Gaussian process
regression (GPR) with a multi-head Q-network controller that regulates
direction and speed for each AUV. Simulations using the Delft3D ocean model
demonstrate that our method consistently outperforms both single- and
multi-agent benchmarks, with scaling the number of agents both improving mean
squared error (MSE) and operational endurance. In some instances, our algorithm
demonstrates that doubling the number of AUVs can more than double endurance
while maintaining or improving accuracy, underscoring the benefits of
multi-agent coordination. Our learned policies generalize across unseen
seasonal regimes over different months and years, demonstrating promise for
future developments of data-driven long-term monitoring of dynamic plume
environments.
[LINK]http://arxiv.org/abs/2510.03534v2
[DATE]2025-10-28 22:48:21+08:00
[CATEGORIES]cs.LG
Methodology for Comparing Machine Learning Algorithms for Survival Analysis
[AUTHORS]Lucas Buk Cardoso, Simone Aldrey Angelo, Yasmin Pacheco Gil Bonilha, Fernando Maia, Adeylson Guimarães Ribeiro, Maria Paula Curado, Gisele Aparecida Fernandes, Vanderlei Cunha Parro, Flávio Almeida de Magalhães Cipparrone, Alexandre Dias Porto Chiavegatto Filho, Tatiana Natasha Toporcov
[ABSTRACT]This study presents a comparative methodological analysis of six machine
learning models for survival analysis (MLSA). Using data from nearly 45,000
colorectal cancer patients in the Hospital-Based Cancer Registries of S~ao
Paulo, we evaluated Random Survival Forest (RSF), Gradient Boosting for
Survival Analysis (GBSA), Survival SVM (SSVM), XGBoost-Cox (XGB-Cox),
XGBoost-AFT (XGB-AFT), and LightGBM (LGBM), capable of predicting survival
considering censored data. Hyperparameter optimization was performed with
different samplers, and model performance was assessed using the Concordance
Index (C-Index), C-Index IPCW, time-dependent AUC, and Integrated Brier Score
(IBS). Survival curves produced by the models were compared with predictions
from classification algorithms, and predictor interpretation was conducted
using SHAP and permutation importance. XGB-AFT achieved the best performance
(C-Index = 0.7618; IPCW = 0.7532), followed by GBSA and RSF. The results
highlight the potential and applicability of MLSA to improve survival
prediction and support decision making.
[LINK]http://arxiv.org/abs/2510.24473v1
[DATE]2025-10-28 22:42:28+08:00
[CATEGORIES]cs.LG
Non-Singularity of the Gradient Descent map for Neural Networks with Piecewise Analytic Activations
[AUTHORS]Alexandru Crăciun, Debarghya Ghoshdastidar
[ABSTRACT]The theory of training deep networks has become a central question of modern
machine learning and has inspired many practical advancements. In particular,
the gradient descent (GD) optimization algorithm has been extensively studied
in recent years. A key assumption about GD has appeared in several recent
works: the \emph{GD map is non-singular} – it preserves sets of measure zero
under preimages. Crucially, this assumption has been used to prove that GD
avoids saddle points and maxima, and to establish the existence of a computable
quantity that determines the convergence to global minima (both for GD and
stochastic GD). However, the current literature either assumes the
non-singularity of the GD map or imposes restrictive assumptions, such as
Lipschitz smoothness of the loss (for example, Lipschitzness does not hold for
deep ReLU networks with the cross-entropy loss) and restricts the analysis to
GD with small step-sizes. In this paper, we investigate the neural network map
as a function on the space of weights and biases. We also prove, for the first
time, the non-singularity of the gradient descent (GD) map on the loss
landscape of realistic neural network architectures (with fully connected,
convolutional, or softmax attention layers) and piecewise analytic activations
(which includes sigmoid, ReLU, leaky ReLU, etc.) for almost all step-sizes. Our
work significantly extends the existing results on the convergence of GD and
SGD by guaranteeing that they apply to practical neural network settings and
has the potential to unlock further exploration of learning dynamics.
[LINK]http://arxiv.org/abs/2510.24466v1
[DATE]2025-10-28 22:34:33+08:00
[CATEGORIES]cs.LG
Towards Personalized Treatment Plan: Geometrical Model-Agnostic Approach to Counterfactual Explanations
[AUTHORS]Daniel Sin, Milad Toutounchian
[ABSTRACT]In our article, we describe a method for generating counterfactual
explanations in high-dimensional spaces using four steps that involve fitting
our dataset to a model, finding the decision boundary, determining constraints
on the problem, and computing the closest point (counterfactual explanation)
from that boundary. We propose a discretized approach where we find many
discrete points on the boundary and then identify the closest feasible
counterfactual explanation. This method, which we later call $\textit{Segmented
Sampling for Boundary Approximation}$ (SSBA), applies binary search to find
decision boundary points and then searches for the closest boundary point.
Across four datasets of varying dimensionality, we show that our method can
outperform current methods for counterfactual generation with reductions in
distance between $5\%$ to $50\%$ in terms of the $L_2$ norm. Our method can
also handle real-world constraints by restricting changes to immutable and
categorical features, such as age, gender, sex, height, and other related
characteristics such as the case for a health-based dataset. In terms of
runtime, the SSBA algorithm generates decision boundary points on multiple
orders of magnitude in the same given time when we compare to a grid-based
approach. In general, our method provides a simple and effective model-agnostic
method that can compute nearest feasible (i.e. realistic with constraints)
counterfactual explanations. All of our results and code are available at:
https://github.com/dsin85691/SSBA_For_Counterfactuals
[COMMENTS]This paper is 15 pages long consisting of multiple sections including
an abstract, introduction, related works, methodology, results, ablation
studies, conclusion, future works, and an appendix section. There are 10
figures and 5 tables in total
[LINK]http://arxiv.org/abs/2510.22911v2
[DATE]2025-10-28 22:33:37+08:00
[CATEGORIES]cs.LG
Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies
[AUTHORS]Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Tian Nian, Liuao Pei, Shunbo Zhou, Xiaokang Yang, Jiangmiao Pang, Yao Mu, Ping Luo
[ABSTRACT]Vision-Language-Action (VLA) models adapt large vision-language backbones to
map images and instructions into robot actions. However, prevailing VLAs either
generate actions auto-regressively in a fixed left-to-right order or attach
separate MLP or diffusion heads outside the backbone, leading to fragmented
information pathways and specialized training requirements that hinder a
unified, scalable architecture. We present Discrete Diffusion VLA, a
unified-transformer policy that models discretized action chunks with discrete
diffusion. The design retains diffusion’s progressive refinement paradigm while
remaining natively compatible with the discrete token interface of VLMs. Our
method achieves an adaptive decoding order that resolves easy action elements
before harder ones and uses secondary re-masking to revisit uncertain
predictions across refinement rounds, which improves consistency and enables
robust error correction. This unified decoder preserves pre-trained
vision-language priors, supports parallel decoding, breaks the autoregressive
bottleneck, and reduces the number of function evaluations. Discrete Diffusion
VLA achieves 96.3% avg. success rates on LIBERO, 71.2% visual matching on
SimplerEnv-Fractal and 54.2% overall on SimplerEnv-Bridge, improving over
autoregressive, MLP decoder and continuous diffusion baselines. These findings
indicate that discrete-diffusion VLA supports precise action modeling and
consistent training, laying groundwork for scaling VLA to larger models and
datasets. Our project page is https://github.com/Liang-ZX/DiscreteDiffusionVLA
[COMMENTS]16 pages
[LINK]http://arxiv.org/abs/2508.20072v2
[DATE]2025-10-28 22:22:20+08:00
[CATEGORIES]cs.LG
The Importance of Being Discrete: Measuring the Impact of Discretization in End-to-End Differentially Private Synthetic Data
[AUTHORS]Georgi Ganev, Meenatchi Sundaram Muthu Selva Annamalai, Sofiane Mahiou, Emiliano De Cristofaro
[ABSTRACT]Differentially Private (DP) generative marginal models are often used in the
wild to release synthetic tabular datasets in lieu of sensitive data while
providing formal privacy guarantees. These models approximate low-dimensional
marginals or query workloads; crucially, they require the training data to be
pre-discretized, i.e., continuous values need to first be partitioned into
bins. However, as the range of values (or their domain) is often inferred
directly from the training data, with the number of bins and bin edges
typically defined arbitrarily, this approach can ultimately break end-to-end DP
guarantees and may not always yield optimal utility.
In this paper, we present an extensive measurement study of four
discretization strategies in the context of DP marginal generative models. More
precisely, we design DP versions of three discretizers (uniform, quantile, and
k-means) and reimplement the PrivTree algorithm. We find that optimizing both
the choice of discretizer and bin count can improve utility, on average, by
almost 30% across six DP marginal models, compared to the default strategy and
number of bins, with PrivTree being the best-performing discretizer in the
majority of cases. We demonstrate that, while DP generative models with
non-private discretization remain vulnerable to membership inference attacks,
applying DP during discretization effectively mitigates this risk. Finally, we
improve on an existing approach for automatically selecting the optimal number
of bins, and achieve high utility while reducing both privacy budget
consumption and computational overhead.
[LINK]http://arxiv.org/abs/2504.06923v4
[DATE]2025-10-28 22:21:34+08:00
[CATEGORIES]cs.LG
ARIMA_PLUS: Large-scale, Accurate, Automatic and Interpretable In-Database Time Series Forecasting and Anomaly Detection in Google BigQuery
[AUTHORS]Xi Cheng, Weijie Shen, Haoming Chen, Chaoyi Shen, Jean Ortega, Jiashang Liu, Steve Thomas, Honglin Zheng, Haoyun Wu, Yuxiang Li, Casey Lichtendahl, Jenny Ortiz, Gang Liu, Haiyang Qi, Omid Fatemieh, Chris Fry, Jing Jing Long
[ABSTRACT]Time series forecasting and anomaly detection are common tasks for
practitioners in industries such as retail, manufacturing, advertising and
energy. Two unique challenges stand out: (1) efficiently and accurately
forecasting time series or detecting anomalies in large volumes automatically;
and (2) ensuring interpretability of results to effectively incorporate
business insights. We present ARIMA_PLUS, a novel framework to overcome these
two challenges by a unique combination of (a) accurate and interpretable time
series models and (b) scalable and fully managed system infrastructure. The
model has a sequential and modular structure to handle different components of
the time series, including holiday effects, seasonality, trend, and anomalies,
which enables high interpretability of the results. Novel enhancements are made
to each module, and a unified framework is established to address both
forecasting and anomaly detection tasks simultaneously. In terms of accuracy,
its comprehensive benchmark on the 42 public datasets in the Monash forecasting
repository shows superior performance over not only well-established
statistical alternatives (such as ETS, ARIMA, TBATS, Prophet) but also newer
neural network models (such as DeepAR, N-BEATS, PatchTST, TimeMixer). In terms
of infrastructure, it is directly built into the query engine of BigQuery in
Google Cloud. It uses a simple SQL interface and automates tedious
technicalities such as data cleaning and model selection. It automatically
scales with managed cloud computational and storage resources, making it
possible to forecast 100 million time series using only 1.5 hours with a
throughput of more than 18000 time series per second. In terms of
interpretability, we present several case studies to demonstrate time series
insights it generates and customizability it offers.
[LINK]http://arxiv.org/abs/2510.24452v1
[DATE]2025-10-28 22:18:50+08:00
[CATEGORIES]cs.LG
Fill in the Blanks: Accelerating Q-Learning with a Handful of Demonstrations in Sparse Reward Settings
[AUTHORS]Seyed Mahdi Basiri Azad, Joschka Boedecker
[ABSTRACT]Reinforcement learning (RL) in sparse-reward environments remains a
significant challenge due to the lack of informative feedback. We propose a
simple yet effective method that uses a small number of successful
demonstrations to initialize the value function of an RL agent. By precomputing
value estimates from offline demonstrations and using them as targets for early
learning, our approach provides the agent with a useful prior over promising
actions. The agent then refines these estimates through standard online
interaction. This hybrid offline-to-online paradigm significantly reduces the
exploration burden and improves sample efficiency in sparse-reward settings.
Experiments on benchmark tasks demonstrate that our method accelerates
convergence and outperforms standard baselines, even with minimal or suboptimal
demonstration data.
[LINK]http://arxiv.org/abs/2510.24432v1
[DATE]2025-10-28 22:01:13+08:00
[CATEGORIES]cs.LG
Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in Training
[AUTHORS]Tony Bonnaire, Raphaël Urfin, Giulio Biroli, Marc Mézard
[ABSTRACT]Diffusion models have achieved remarkable success across a wide range of
generative tasks. A key challenge is understanding the mechanisms that prevent
their memorization of training data and allow generalization. In this work, we
investigate the role of the training dynamics in the transition from
generalization to memorization. Through extensive experiments and theoretical
analysis, we identify two distinct timescales: an early time
$\tau_\mathrm{gen}$ at which models begin to generate high-quality samples, and
a later time $\tau_\mathrm{mem}$ beyond which memorization emerges. Crucially,
we find that $\tau_\mathrm{mem}$ increases linearly with the training set size
$n$, while $\tau_\mathrm{gen}$ remains constant. This creates a growing window
of training times with $n$ where models generalize effectively, despite showing
strong memorization if training continues beyond it. It is only when $n$
becomes larger than a model-dependent threshold that overfitting disappears at
infinite training times. These findings reveal a form of implicit dynamical
regularization in the training dynamics, which allow to avoid memorization even
in highly overparameterized settings. Our results are supported by numerical
experiments with standard U-Net architectures on realistic and synthetic
datasets, and by a theoretical analysis using a tractable random features model
studied in the high-dimensional limit.
[COMMENTS]Accepted as an oral at Neurips 2025. 40 pages, 15 figures
[LINK]http://arxiv.org/abs/2505.17638v2
[DATE]2025-10-28 21:54:07+08:00
[CATEGORIES]cs.LG
RWKV-edge: Deeply Compressed RWKV for Resource-Constrained Devices
[AUTHORS]Wonkyo Choe, Yangfeng Ji, Felix Xiaozhu Lin
[ABSTRACT]To deploy LLMs on resource-contained platforms such as mobile robots and
smartphones, non-transformers LLMs have achieved major breakthroughs. Recently,
a novel RNN-based LLM family, Repentance Weighted Key Value (RWKV) has shown
strong computational efficiency; nevertheless, RWKV models still have high
parameter counts which limited their deployment. In this paper, we propose a
suite of compression techniques, ranging from model architecture optimizations
to post-training compression, tailored to the RWKV architecture. Combined, our
techniques reduce the memory footprint of RWKV models by 3.4x – 5x with only
negligible degradation in accuracy; compared to transformer LLMs with similar
accuracy, our models require 4x less memory footprint.
[LINK]http://arxiv.org/abs/2412.10856v4
[DATE]2025-10-28 21:45:25+08:00
[CATEGORIES]cs.LG
Attack on a PUF-based Secure Binary Neural Network
[AUTHORS]Bijeet Basak, Nupur Patil, Kurian Polachan, Srinivas Vivek
[ABSTRACT]Binarized Neural Networks (BNNs) deployed on memristive crossbar arrays
provide energy-efficient solutions for edge computing but are susceptible to
physical attacks due to memristor nonvolatility. Recently, Rajendran et al.
(IEEE Embedded Systems Letter 2025) proposed a Physical Unclonable Function
(PUF)-based scheme to secure BNNs against theft attacks. Specifically, the
weight and bias matrices of the BNN layers were secured by swapping columns
based on device’s PUF key bits.
In this paper, we demonstrate that this scheme to secure BNNs is vulnerable
to PUF-key recovery attack. As a consequence of our attack, we recover the
secret weight and bias matrices of the BNN. Our approach is motivated by
differential cryptanalysis and reconstructs the PUF key bit-by-bit by observing
the change in model accuracy, and eventually recovering the BNN model
parameters. Evaluated on a BNN trained on the MNIST dataset, our attack could
recover 85% of the PUF key, and recover the BNN model up to 93% classification
accuracy compared to the original model’s 96% accuracy. Our attack is very
efficient and it takes a couple of minutes to recovery the PUF key and the
model parameters.
[COMMENTS]Accepted at VLSID 2026. To be published in IEEE Xplore
[LINK]http://arxiv.org/abs/2510.24422v1
[DATE]2025-10-28 21:43:00+08:00
[CATEGORIES]cs.LG
Data Fusion of Deep Learned Molecular Embeddings for Property Prediction
[AUTHORS]Robert J Appleton, Brian C Barnes, Alejandro Strachan
[ABSTRACT]Data-driven approaches such as deep learning can result in predictive models
for material properties with exceptional accuracy and efficiency. However, in
many applications, data is sparse, severely limiting their accuracy and
applicability. To improve predictions, techniques such as transfer learning and
multitask learning have been used. The performance of multitask learning models
depends on the strength of the underlying correlations between tasks and the
completeness of the data set. Standard multitask models tend to underperform
when trained on sparse data sets with weakly correlated properties. To address
this gap, we fuse deep-learned embeddings generated by independent pretrained
single-task models, resulting in a multitask model that inherits rich,
property-specific representations. By reusing (rather than retraining) these
embeddings, the resulting fused model outperforms standard multitask models and
can be extended with fewer trainable parameters. We demonstrate this technique
on a widely used benchmark data set of quantum chemistry data for small
molecules as well as a newly compiled sparse data set of experimental data
collected from literature and our own quantum chemistry and thermochemical
calculations.
[LINK]http://arxiv.org/abs/2504.07297v2
[DATE]2025-10-28 21:27:06+08:00
[CATEGORIES]cs.LG
APEX: Approximate-but-exhaustive search for ultra-large combinatorial synthesis libraries
[AUTHORS]Aryan Pedawi, Jordi Silvestre-Ryan, Bradley Worley, Darren J Hsu, Kushal S Shah, Elias Stehle, Jingrong Zhang, Izhar Wallach
[ABSTRACT]Make-on-demand combinatorial synthesis libraries (CSLs) like Enamine REAL
have significantly enabled drug discovery efforts. However, their large size
presents a challenge for virtual screening, where the goal is to identify the
top compounds in a library according to a computational objective (e.g.,
optimizing docking score) subject to computational constraints under a limited
computational budget. For current library sizes – numbering in the tens of
billions of compounds – and scoring functions of interest, a routine virtual
screening campaign may be limited to scoring fewer than 0.1% of the available
compounds, leaving potentially many high scoring compounds undiscovered.
Furthermore, as constraints (and sometimes objectives) change during the course
of a virtual screening campaign, existing virtual screening algorithms
typically offer little room for amortization. We propose the
approximate-but-exhaustive search protocol for CSLs, or APEX. APEX utilizes a
neural network surrogate that exploits the structure of CSLs in the prediction
of objectives and constraints to make full enumeration on a consumer GPU
possible in under a minute, allowing for exact retrieval of approximate top-$k$
sets. To demonstrate APEX’s capabilities, we develop a benchmark CSL comprised
of more than 10 million compounds, all of which have been annotated with their
docking scores on five medically relevant targets along with physicohemical
properties measured with RDKit such that, for any objective and set of
constraints, the ground truth top-$k$ compounds can be identified and compared
against the retrievals from any virtual screening algorithm. We show APEX’s
consistently strong performance both in retrieval accuracy and runtime compared
to alternative methods.
[LINK]http://arxiv.org/abs/2510.24380v1
[DATE]2025-10-28 20:57:59+08:00
[CATEGORIES]cs.LG
Telegrapher’s Generative Model via Kac Flows
[AUTHORS]Richard Duong, Jannis Chemseddine, Peter K. Friz, Gabriele Steidl
[ABSTRACT]We break the mold in flow-based generative modeling by proposing a new model
based on the damped wave equation, also known as telegrapher’s equation.
Similar to the diffusion equation and Brownian motion, there is a Feynman-Kac
type relation between the telegrapher’s equation and the stochastic Kac process
in 1D. The Kac flow evolves stepwise linearly in time, so that the probability
flow is Lipschitz continuous in the Wasserstein distance and, in contrast to
diffusion flows, the norm of the velocity is globally bounded. Furthermore, the
Kac model has the diffusion model as its asymptotic limit. We extend these
considerations to a multi-dimensional stochastic process which consists of
independent 1D Kac processes in each spatial component. We show that this
process gives rise to an absolutely continuous curve in the Wasserstein space
and compute the conditional velocity field starting in a Dirac point
analytically. Using the framework of flow matching, we train a neural network
that approximates the velocity field and use it for sample generation. Our
numerical experiments demonstrate the scalability of our approach, and show its
advantages over diffusion models.
[COMMENTS]Update V2: We added CIFAR experiments. Update V3: The old FID scores
& CIFAR images of the Kac model corresponded to the schedule g(t) = t. We now
updated them with both schedules t and t^2. Update V4: We corrected a minor
implementation error and updated the CIFAR images/table
[LINK]http://arxiv.org/abs/2506.20641v4
[DATE]2025-10-28 20:57:51+08:00
[CATEGORIES]cs.LG
Generalized Exponentiated Gradient Algorithms Using the Euler Two-Parameter Logarithm
[AUTHORS]Andrzej Cichocki
[ABSTRACT]IIn this paper we propose and investigate a new class of Generalized
Exponentiated Gradient (GEG) algorithms using Mirror Descent (MD) updates, and
applying the Bregman divergence with a two–parameter
deformation of the logarithm as a link function. This link function (referred
here to as the Euler logarithm) is associated with a relatively wide class of
trace–form entropies. In order to derive novel GEG/MD updates, we estimate a
deformed exponential function, which closely approximates the inverse of the
Euler two–parameter deformed logarithm. The characteristic shape and
properties of the Euler logarithm and its inverse–deformed exponential
functions, are tuned by two hyperparameters. By learning these hyperparameters,
we can adapt to the distribution of training data and adjust them to achieve
desired properties of gradient descent algorithms. In the literature, there
exist nowadays more than fifty mathematically well-established entropic
functionals and associated deformed logarithms, so it is impossible to
investigate all of them in one research paper. Therefore, we focus here on a
class of trace-form entropies and the associated deformed two–parameters
logarithms.
[COMMENTS]10 pages, preprint of Journal paper
[LINK]http://arxiv.org/abs/2502.17500v2
[DATE]2025-10-28 20:53:44+08:00
[CATEGORIES]cs.LG
A Comprehensive Evaluation Framework for Synthetic Trip Data Generation in Public Transport
[AUTHORS]Yuanyuan Wu, Zhenlin Qin, Zhenliang Ma
[ABSTRACT]Synthetic data offers a promising solution to the privacy and accessibility
challenges of using smart card data in public transport research. Despite rapid
progress in generative modeling, there is limited attention to comprehensive
evaluation, leaving unclear how reliable, safe, and useful synthetic data truly
are. Existing evaluations remain fragmented, typically limited to
population-level representativeness or record-level privacy, without
considering group-level variations or task-specific utility. To address this
gap, we propose a Representativeness-Privacy-Utility (RPU) framework that
systematically evaluates synthetic trip data across three complementary
dimensions and three hierarchical levels (record, group, population). The
framework integrates a consistent set of metrics to quantify similarity,
disclosure risk, and practical usefulness, enabling transparent and balanced
assessment of synthetic data quality. We apply the framework to benchmark
twelve representative generation methods, spanning conventional statistical
models, deep generative networks, and privacy-enhanced variants. Results show
that synthetic data do not inherently guarantee privacy and there is no
“one-size-fits-all” model, the trade-off between privacy and
representativeness/utility is obvious. Conditional Tabular generative
adversarial network (CTGAN) provide the most balanced trade-off and is
suggested for practical applications. The RPU framework provides a systematic
and reproducible basis for researchers and practitioners to compare synthetic
data generation techniques and select appropriate methods in public transport
applications.
[LINK]http://arxiv.org/abs/2510.24375v1
[DATE]2025-10-28 20:52:47+08:00
[CATEGORIES]cs.LG
Linear regression with overparameterized linear neural networks: Tight upper and lower bounds for implicit $\ell^1$-regularization
[AUTHORS]Hannes Matt, Dominik Stöger
[ABSTRACT]Modern machine learning models are often trained in a setting where the
number of parameters exceeds the number of training samples. To understand the
implicit bias of gradient descent in such overparameterized models, prior work
has studied diagonal linear neural networks in the regression setting. These
studies have shown that, when initialized with small weights, gradient descent
tends to favor solutions with minimal $\ell^1$-norm - an effect known as
implicit regularization. In this paper, we investigate implicit regularization
in diagonal linear neural networks of depth $D\ge 2$ for overparameterized
linear regression problems. We focus on analyzing the approximation error
between the limit point of gradient flow trajectories and the solution to the
$\ell^1$-minimization problem. By deriving tight upper and lower bounds on the
approximation error, we precisely characterize how the approximation error
depends on the scale of initialization $\alpha$. Our results reveal a
qualitative difference between depths: for $D \ge 3$, the error decreases
linearly with $\alpha$, whereas for $D=2$, it decreases at rate
$\alpha^{1-\varrho}$, where the parameter $\varrho \in [0,1)$ can be explicitly
characterized. Interestingly, this parameter is closely linked to so-called
null space property constants studied in the sparse recovery literature. We
demonstrate the asymptotic tightness of our bounds through explicit examples.
Numerical experiments corroborate our theoretical findings and suggest that
deeper networks, i.e., $D \ge 3$, may lead to better generalization,
particularly for realistic initialization scales.
[LINK]http://arxiv.org/abs/2506.01143v2
[DATE]2025-10-28 20:49:08+08:00
[CATEGORIES]cs.LG
Assessing the robustness of heterogeneous treatment effects in survival analysis under informative censoring
[AUTHORS]Yuxin Wang, Dennis Frauen, Jonas Schweisthal, Maresa Schröder, Stefan Feuerriegel
[ABSTRACT]Dropout is common in clinical studies, with up to half of patients leaving
early due to side effects or other reasons. When dropout is informative (i.e.,
dependent on survival time), it introduces censoring bias, because of which
treatment effect estimates are also biased. In this paper, we propose an
assumption-lean framework to assess the robustness of conditional average
treatment effect (CATE) estimates in survival analysis when facing censoring
bias. Unlike existing works that rely on strong assumptions, such as
non-informative censoring, to obtain point estimation, we use partial
identification to derive informative bounds on the CATE. Thereby, our framework
helps to identify patient subgroups where treatment is effective despite
informative censoring. We further develop a novel meta-learner that estimates
the bounds using arbitrary machine learning models and with favorable
theoretical properties, including double robustness and quasi-oracle
efficiency. We demonstrate the practical value of our meta-learner through
numerical experiments and in an application to a cancer drug trial. Together,
our framework offers a practical tool for assessing the robustness of estimated
treatment effects in the presence of censoring and thus promotes the reliable
use of survival data for evidence generation in medicine and epidemiology.
[LINK]http://arxiv.org/abs/2510.13397v2
[DATE]2025-10-28 20:46:53+08:00
[CATEGORIES]cs.LG
Filtering instances and rejecting predictions to obtain reliable models in healthcare
[AUTHORS]Maria Gabriela Valeriano, David Kohan Marzagão, Alfredo Montelongo, Carlos Roberto Veiga Kiffer, Natan Katz, Ana Carolina Lorena
[ABSTRACT]Machine Learning (ML) models are widely used in high-stakes domains such as
healthcare, where the reliability of predictions is critical. However, these
models often fail to account for uncertainty, providing predictions even with
low confidence. This work proposes a novel two-step data-centric approach to
enhance the performance of ML models by improving data quality and filtering
low-confidence predictions. The first step involves leveraging Instance
Hardness (IH) to filter problematic instances during training, thereby refining
the dataset. The second step introduces a confidence-based rejection mechanism
during inference, ensuring that only reliable predictions are retained. We
evaluate our approach using three real-world healthcare datasets, demonstrating
its effectiveness at improving model reliability while balancing predictive
performance and rejection rate. Additionally, we use alternative criteria -
influence values for filtering and uncertainty for rejection - as baselines to
evaluate the efficiency of the proposed method. The results demonstrate that
integrating IH filtering with confidence-based rejection effectively enhances
model performance while preserving a large proportion of instances. This
approach provides a practical method for deploying ML systems in
safety-critical applications.
[COMMENTS]This paper is under review at Machine Learning (Springer)
[LINK]http://arxiv.org/abs/2510.24368v1
[DATE]2025-10-28 20:45:20+08:00
[CATEGORIES]cs.LG
Diffusion Models Meet Contextual Bandits
[AUTHORS]Imad Aouali
[ABSTRACT]Efficient online decision-making in contextual bandits is challenging, as
methods without informative priors often suffer from computational or
statistical inefficiencies. In this work, we leverage pre-trained diffusion
models as expressive priors to capture complex action dependencies and develop
a practical algorithm that efficiently approximates posteriors under such
priors, enabling both fast updates and sampling. Empirical results demonstrate
the effectiveness and versatility of our approach across diverse contextual
bandit settings.
[COMMENTS]Neurips 2025
[LINK]http://arxiv.org/abs/2402.10028v3
[DATE]2025-10-28 20:23:40+08:00
[CATEGORIES]cs.LG
Perception Learning: A Formal Separation of Sensory Representation Learning from Decision Learning
[AUTHORS]Suman Sanyal
[ABSTRACT]We introduce Perception Learning (PeL), a paradigm that optimizes an agent’s
sensory interface $f_\phi:\mathcal{X}\to\mathcal{Z}$ using task-agnostic
signals, decoupled from downstream decision learning
$g_\theta:\mathcal{Z}\to\mathcal{Y}$. PeL directly targets label-free
perceptual properties, such as stability to nuisances, informativeness without
collapse, and controlled geometry, assessed via objective
representation-invariant metrics. We formalize the separation of perception and
decision, define perceptual properties independent of objectives or
reparameterizations, and prove that PeL updates preserving sufficient
invariants are orthogonal to Bayes task-risk gradients. Additionally, we
provide a suite of task-agnostic evaluation metrics to certify perceptual
quality.
[LINK]http://arxiv.org/abs/2510.24356v1
[DATE]2025-10-28 20:19:49+08:00
[CATEGORIES]cs.LG
Advancing Compositional Awareness in CLIP with Efficient Fine-Tuning
[AUTHORS]Amit Peleg, Naman Deep Singh, Matthias Hein
[COMMENTS]Accepted at NeurIPS 2025
[LINK]http://arxiv.org/abs/2505.24424v2
[DATE]2025-10-28 20:08:40+08:00
[CATEGORIES]cs.LG
Transformers can do Bayesian Clustering
[AUTHORS]Prajit Bhaskaran, Tom Viering
[ABSTRACT]Bayesian clustering accounts for uncertainty but is computationally demanding
at scale. Furthermore, real-world datasets often contain missing values, and
simple imputation ignores the associated uncertainty, resulting in suboptimal
results. We present Cluster-PFN, a Transformer-based model that extends
Prior-Data Fitted Networks (PFNs) to unsupervised Bayesian clustering. Trained
entirely on synthetic datasets generated from a finite Gaussian Mixture Model
(GMM) prior, Cluster-PFN learns to estimate the posterior distribution over
both the number of clusters and the cluster assignments. Our method estimates
the number of clusters more accurately than handcrafted model selection
procedures such as AIC, BIC and Variational Inference (VI), and achieves
clustering quality competitive with VI while being orders of magnitude faster.
Cluster-PFN can be trained on complex priors that include missing data,
outperforming imputation-based baselines on real-world genomic datasets, at
high missingness. These results show that the Cluster-PFN can provide scalable
and flexible Bayesian clustering.
[LINK]http://arxiv.org/abs/2510.24318v1
[DATE]2025-10-28 19:36:31+08:00
[CATEGORIES]cs.LG
FraudTransformer: Time-Aware GPT for Transaction Fraud Detection
[AUTHORS]Gholamali Aminian, Andrew Elliott, Tiger Li, Timothy Cheuk Hin Wong, Victor Claude Dehon, Lukasz Szpruch, Carsten Maple, Christopher Read, Martin Brown, Gesine Reinert, Mo Mamouei
[ABSTRACT]Detecting payment fraud in real-world banking streams requires models that
can exploit both the order of events and the irregular time gaps between them.
We introduce FraudTransformer, a sequence model that augments a vanilla
GPT-style architecture with (i) a dedicated time encoder that embeds either
absolute timestamps or inter-event values, and (ii) a learned positional
encoder that preserves relative order. Experiments on a large industrial
dataset – tens of millions of transactions and auxiliary events – show that
FraudTransformer surpasses four strong classical baselines (Logistic
Regression, XGBoost and LightGBM) as well as transformer ablations that omit
either the time or positional component. On the held-out test set it delivers
the highest AUROC and PRAUC.
[COMMENTS]Accepted in AI-FIND ICAIF’25
(https://sites.google.com/view/icaif-fraud-detection-workshop/home)
[LINK]http://arxiv.org/abs/2509.23712v2
[DATE]2025-10-28 19:34:23+08:00
[CATEGORIES]cs.LG
ImageNet-trained CNNs are not biased towards texture: Revisiting feature reliance through controlled suppression
[AUTHORS]Tom Burgert, Oliver Stoll, Paolo Rota, Begüm Demir
[COMMENTS]Accepted at NeurIPS 2025 (oral)
[LINK]http://arxiv.org/abs/2509.20234v3
[DATE]2025-10-28 19:26:53+08:00
[CATEGORIES]cs.LG
EDC: Equation Discovery for Classification
[AUTHORS]Guus Toussaint, Arno Knobbe
[ABSTRACT]Equation Discovery techniques have shown considerable success in regression
tasks, where they are used to discover concise and interpretable models
(\textit{Symbolic Regression}). In this paper, we propose a new ED-based binary
classification framework. Our proposed method EDC finds analytical functions of
manageable size that specify the location and shape of the decision boundary.
In extensive experiments on artificial and real-life data, we demonstrate how
EDC is able to discover both the structure of the target equation as well as
the value of its parameters, outperforming the current state-of-the-art
ED-based classification methods in binary classification and achieving
performance comparable to the state of the art in binary classification. We
suggest a grammar of modest complexity that appears to work well on the tested
datasets but argue that the exact grammar – and thus the complexity of the
models – is configurable, and especially domain-specific expressions can be
included in the pattern language, where that is required. The presented grammar
consists of a series of summands (additive terms) that include linear,
quadratic and exponential terms, as well as products of two features (producing
hyperbolic curves ideal for capturing XOR-like dependencies). The experiments
demonstrate that this grammar allows fairly flexible decision boundaries while
not so rich to cause overfitting.
[COMMENTS]This preprint has not undergone peer review or any post-submission
improvements or corrections. The Version of Record of this contribution is
published in Lecture Notes in Computer Science, and is available online at
https://doi.org/10.1007/978-3-032-05461-6_9
[LINK]http://arxiv.org/abs/2510.24310v1
[DATE]2025-10-28 19:20:06+08:00
[CATEGORIES]cs.LG
Think Just Enough: Sequence-Level Entropy as a Confidence Signal for LLM Reasoning
[AUTHORS]Aman Sharma, Paras Chopra
[LINK]http://arxiv.org/abs/2510.08146v3
[DATE]2025-10-28 18:58:14+08:00
[CATEGORIES]cs.LG
Geo-Sign: Hyperbolic Contrastive Regularisation for Geometrically Aware Sign Language Translation
[AUTHORS]Edward Fish, Richard Bowden
[ABSTRACT]Recent progress in Sign Language Translation (SLT) has focussed primarily on
improving the representational capacity of large language models to incorporate
Sign Language features. This work explores an alternative direction: enhancing
the geometric properties of skeletal representations themselves. We propose
Geo-Sign, a method that leverages the properties of hyperbolic geometry to
model the hierarchical structure inherent in sign language kinematics. By
projecting skeletal features derived from Spatio-Temporal Graph Convolutional
Networks (ST-GCNs) into the Poincar'e ball model, we aim to create more
discriminative embeddings, particularly for fine-grained motions like finger
articulations. We introduce a hyperbolic projection layer, a weighted Fr'echet
mean aggregation scheme, and a geometric contrastive loss operating directly in
hyperbolic space. These components are integrated into an end-to-end
translation framework as a regularisation function, to enhance the
representations within the language model. This work demonstrates the potential
of hyperbolic geometry to improve skeletal representations for Sign Language
Translation, improving on SOTA RGB methods while preserving privacy and
improving computational efficiency. Code available here:
https://github.com/ed-fish/geo-sign.
[COMMENTS]Accepted to NeurIPS 2025
[LINK]http://arxiv.org/abs/2506.00129v2
[DATE]2025-10-28 18:56:55+08:00
[CATEGORIES]cs.LG
Problem-Parameter-Free Decentralized Bilevel Optimization
[AUTHORS]Zhiwei Zhai, Wenjing Yan, Ying-Jun Angela Zhang
[ABSTRACT]Decentralized bilevel optimization has garnered significant attention due to
its critical role in solving large-scale machine learning problems. However,
existing methods often rely on prior knowledge of problem parameters-such as
smoothness, convexity, or communication network topologies-to determine
appropriate stepsizes. In practice, these problem parameters are typically
unavailable, leading to substantial manual effort for hyperparameter tuning. In
this paper, we propose AdaSDBO, a fully problem-parameter-free algorithm for
decentralized bilevel optimization with a single-loop structure. AdaSDBO
leverages adaptive stepsizes based on cumulative gradient norms to update all
variables simultaneously, dynamically adjusting its progress and eliminating
the need for problem-specific hyperparameter tuning. Through rigorous
theoretical analysis, we establish that AdaSDBO achieves a convergence rate of
$\widetilde{\mathcal{O}}\left(\frac{1}{T}\right)$, matching the performance of
well-tuned state-of-the-art methods up to polylogarithmic factors. Extensive
numerical experiments demonstrate that AdaSDBO delivers competitive performance
compared to existing decentralized bilevel optimization methods while
exhibiting remarkable robustness across diverse stepsize configurations.
[COMMENTS]Accepted by NeurIPS 2025
[LINK]http://arxiv.org/abs/2510.24288v1
[DATE]2025-10-28 18:50:04+08:00
[CATEGORIES]cs.LG
SALS: Sparse Attention in Latent Space for KV cache Compression
[AUTHORS]Junlin Mu, Hantao Huang, Jihang Zhang, Minghui Yu, Tao Wang, Yidong Li
[ABSTRACT]Large Language Models capable of handling extended contexts are in high
demand, yet their inference remains challenging due to substantial Key-Value
cache size and high memory bandwidth requirements. Previous research has
demonstrated that KV cache exhibits low-rank characteristics within the hidden
dimension, suggesting the potential for effective compression. However, due to
the widely adopted Rotary Position Embedding mechanism in modern LLMs, naive
low-rank compression suffers severe accuracy degradation or creates a new speed
bottleneck, as the low-rank cache must first be reconstructed in order to apply
RoPE. In this paper, we introduce two key insights: first, the application of
RoPE to the key vectors increases their variance, which in turn results in a
higher rank; second, after the key vectors are transformed into the latent
space, they largely maintain their representation across most layers. Based on
these insights, we propose the Sparse Attention in Latent Space framework. SALS
projects the KV cache into a compact latent space via low-rank projection, and
performs sparse token selection using RoPE-free query-key interactions in this
space. By reconstructing only a small subset of important tokens, it avoids the
overhead of full KV cache reconstruction. We comprehensively evaluate SALS on
various tasks using two large-scale models: LLaMA2-7b-chat and Mistral-7b, and
additionally verify its scalability on the RULER-128k benchmark with
LLaMA3.1-8B-Instruct. Experimental results demonstrate that SALS achieves SOTA
performance by maintaining competitive accuracy. Under different settings, SALS
achieves 6.4-fold KV cache compression and 5.7-fold speed-up in the attention
operator compared to FlashAttention2 on the 4K sequence. For the end-to-end
throughput performance, we achieves 1.4-fold and 4.5-fold improvement compared
to GPT-fast on 4k and 32K sequences, respectively.
[LINK]http://arxiv.org/abs/2510.24273v1
[DATE]2025-10-28 18:32:52+08:00
[CATEGORIES]cs.LG
UtilGen: Utility-Centric Generative Data Augmentation with Dual-Level Task Adaptation
[AUTHORS]Jiyu Guo, Shuo Yang, Yiming Huang, Yancheng Long, Xiaobo Xia, Xiu Su, Bo Zhao, Zeke Xie, Liqiang Nie
[ABSTRACT]Data augmentation using generative models has emerged as a powerful paradigm
for enhancing performance in computer vision tasks. However, most existing
augmentation approaches primarily focus on optimizing intrinsic data attributes
– such as fidelity and diversity – to generate visually high-quality
synthetic data, while often neglecting task-specific requirements. Yet, it is
essential for data generators to account for the needs of downstream tasks, as
training data requirements can vary significantly across different tasks and
network architectures. To address these limitations, we propose UtilGen, a
novel utility-centric data augmentation framework that adaptively optimizes the
data generation process to produce task-specific, high-utility training data
via downstream task feedback. Specifically, we first introduce a weight
allocation network to evaluate the task-specific utility of each synthetic
sample. Guided by these evaluations, UtilGen iteratively refines the data
generation process using a dual-level optimization strategy to maximize the
synthetic data utility: (1) model-level optimization tailors the generative
model to the downstream task, and (2) instance-level optimization adjusts
generation policies – such as prompt embeddings and initial noise – at each
generation round. Extensive experiments on eight benchmark datasets of varying
complexity and granularity demonstrate that UtilGen consistently achieves
superior performance, with an average accuracy improvement of 3.87% over
previous SOTA. Further analysis of data influence and distribution reveals that
UtilGen produces more impactful and task-relevant synthetic data, validating
the effectiveness of the paradigm shift from visual characteristics-centric to
task utility-centric data augmentation.
[COMMENTS]39th Conference on Neural Information Processing Systems (NeurIPS
2025)
[LINK]http://arxiv.org/abs/2510.24262v1
[DATE]2025-10-28 18:17:11+08:00
[CATEGORIES]cs.LG
Forecasting precipitation in the Arctic using probabilistic machine learning informed by causal climate drivers
[AUTHORS]Madhurima Panja, Dhiman Das, Tanujit Chakraborty, Arnob Ray, R. Athulya, Chittaranjan Hens, Syamal K. Dana, Nuncio Murukesh, Dibakar Ghosh
[ABSTRACT]Understanding and forecasting precipitation events in the Arctic maritime
environments, such as Bear Island and Ny-{\AA}lesund, is crucial for assessing
climate risk and developing early warning systems in vulnerable marine regions.
This study proposes a probabilistic machine learning framework for modeling and
predicting the dynamics and severity of precipitation. We begin by analyzing
the scale-dependent relationships between precipitation and key atmospheric
drivers (e.g., temperature, relative humidity, cloud cover, and air pressure)
using wavelet coherence, which captures localized dependencies across time and
frequency domains. To assess joint causal influences, we employ
Synergistic-Unique-Redundant Decomposition, which quantifies the impact of
interaction effects among each variable on future precipitation dynamics. These
insights inform the development of data-driven forecasting models that
incorporate both historical precipitation and causal climate drivers. To
account for uncertainty, we employ the conformal prediction method, which
enables the generation of calibrated non-parametric prediction intervals. Our
results underscore the importance of utilizing a comprehensive framework that
combines causal analysis with probabilistic forecasting to enhance the
reliability and interpretability of precipitation predictions in Arctic marine
environments.
[LINK]http://arxiv.org/abs/2510.24254v1
[DATE]2025-10-28 18:05:34+08:00
[CATEGORIES]cs.LG
Acoustic and Machine Learning Methods for Speech-Based Suicide Risk Assessment: A Systematic Review
[AUTHORS]Ambre Marie, Marine Garnier, Thomas Bertin, Laura Machart, Guillaume Dardenne, Gwenolé Quellec, Sofian Berrouiguet
[ABSTRACT]Suicide remains a public health challenge, necessitating improved detection
methods to facilitate timely intervention and treatment. This systematic review
evaluates the role of Artificial Intelligence (AI) and Machine Learning (ML) in
assessing suicide risk through acoustic analysis of speech. Following PRISMA
guidelines, we analyzed 33 articles selected from PubMed, Cochrane, Scopus, and
Web of Science databases. The last search was conducted in February 2025. Risk
of bias was assessed using the PROBAST tool. Studies analyzing acoustic
features between individuals at risk of suicide (RS) and those not at risk
(NRS) were included, while studies lacking acoustic data, a suicide-related
focus, or sufficient methodological details were excluded. Sample sizes varied
widely and were reported in terms of participants or speech segments, depending
on the study. Results were synthesized narratively based on acoustic features
and classifier performance. Findings consistently showed significant acoustic
feature variations between RS and NRS populations, particularly involving
jitter, fundamental frequency (F0), Mel-frequency cepstral coefficients (MFCC),
and power spectral density (PSD). Classifier performance varied based on
algorithms, modalities, and speech elicitation methods, with multimodal
approaches integrating acoustic, linguistic, and metadata features
demonstrating superior performance. Among the 29 classifier-based studies,
reported AUC values ranged from 0.62 to 0.985 and accuracies from 60% to
99.85%. Most datasets were imbalanced in favor of NRS, and performance metrics
were rarely reported separately by group, limiting clear identification of
direction of effect.
[COMMENTS]Preprint version of a manuscript submitted to the Journal of
Affective Disorders
[LINK]http://arxiv.org/abs/2505.18195v2
[DATE]2025-10-28 18:02:13+08:00
[CATEGORIES]cs.LG
Human-Like Goalkeeping in a Realistic Football Simulation: a Sample-Efficient Reinforcement Learning Approach
[AUTHORS]Alessandro Sestini, Joakim Bergdahl, Jean-Philippe Barrette-LaPierre, Florian Fuchs, Brady Chen, Michael Jones, Linus Gisslén
[ABSTRACT]While several high profile video games have served as testbeds for Deep
Reinforcement Learning (DRL), this technique has rarely been employed by the
game industry for crafting authentic AI behaviors. Previous research focuses on
training super-human agents with large models, which is impractical for game
studios with limited resources aiming for human-like agents. This paper
proposes a sample-efficient DRL method tailored for training and fine-tuning
agents in industrial settings such as the video game industry. Our method
improves sample efficiency of value-based DRL by leveraging pre-collected data
and increasing network plasticity. We evaluate our method training a goalkeeper
agent in EA SPORTS FC 25, one of the best-selling football simulations today.
Our agent outperforms the game’s built-in AI by 10% in ball saving rate.
Ablation studies show that our method trains agents 50% faster compared to
standard DRL methods. Finally, qualitative evaluation from domain experts
indicates that our approach creates more human-like gameplay compared to
hand-crafted agents. As a testimony of the impact of the approach, the method
is intended to replace the hand-crafted counterpart in next iterations of the
series.
[LINK]http://arxiv.org/abs/2510.23216v2
[DATE]2025-10-28 17:50:12+08:00
[CATEGORIES]cs.LG
Enabling Near-realtime Remote Sensing via Satellite-Ground Collaboration of Large Vision-Language Models
[AUTHORS]Zihan Li, Jiahao Yang, Yuxin Zhang, Zhe Chen, Yue Gao
[ABSTRACT]Large vision-language models (LVLMs) have recently demonstrated great
potential in remote sensing (RS) tasks (e.g., disaster monitoring) conducted by
low Earth orbit (LEO) satellites. However, their deployment in real-world LEO
satellite systems remains largely unexplored, hindered by limited onboard
computing resources and brief satellite-ground contacts. We propose Grace, a
satellite-ground collaborative system designed for near-realtime LVLM inference
in RS tasks. Accordingly, we deploy compact LVLM on satellites for realtime
inference, but larger ones on ground stations (GSs) to guarantee end-to-end
performance. Grace is comprised of two main phases that are asynchronous
satellite-GS Retrieval-Augmented Generation (RAG), and a task dispatch
algorithm. Firstly, we still the knowledge archive of GS RAG to satellite
archive with tailored adaptive update algorithm during limited satellite-ground
data exchange period. Secondly, propose a confidence-based test algorithm that
either processes the task onboard the satellite or offloads it to the GS.
Extensive experiments based on real-world satellite orbital data show that
Grace reduces the average latency by 76-95% compared to state-of-the-art
methods, without compromising inference accuracy.
[COMMENTS]15 pages, 11 figures
[LINK]http://arxiv.org/abs/2510.24242v1
[DATE]2025-10-28 17:48:26+08:00
[CATEGORIES]cs.LG
Temporal Knowledge Graph Hyperedge Forecasting: Exploring Entity-to-Category Link Prediction
[AUTHORS]Edward Markai, Sina Molavipour
[ABSTRACT]Temporal Knowledge Graphs have emerged as a powerful way of not only modeling
static relationships between entities but also the dynamics of how relations
evolve over time. As these informational structures can be used to store
information from a real-world setting, such as a news flow, predicting future
graph components to a certain extent equates predicting real-world events. Most
of the research in this field focuses on embedding-based methods, often
leveraging convolutional neural net architectures. These solutions act as black
boxes, limiting insight. In this paper, we explore an extension to an
established rule-based framework, TLogic, that yields a high accuracy in
combination with explainable predictions. This offers transparency and allows
the end-user to critically evaluate the rules applied at the end of the
prediction stage. The new rule format incorporates entity category as a key
component with the purpose of limiting rule application only to relevant
entities. When categories are unknown for building the graph, we propose a
data-driven method to generate them with an LLM-based approach. Additionally,
we investigate the choice of aggregation method for scores of retrieved
entities when performing category prediction.
[LINK]http://arxiv.org/abs/2510.24240v1
[DATE]2025-10-28 17:47:38+08:00
[CATEGORIES]cs.LG
PaTaRM: Bridging Pairwise and Pointwise Signals via Preference-Aware Task-Adaptive Reward Modeling
[AUTHORS]Ai Jian, Jingqing Ruan, Xing Ma, Dailin Li, QianLin Zhou, Ke Zeng, Xunliang Cai
[ABSTRACT]Reward models (RMs) are central to reinforcement learning from human feedback
(RLHF), providing the critical supervision signals that align large language
models (LLMs) with human preferences. While generative reward models (GRMs)
offer greater interpretability than traditional scalar RMs, current training
paradigms remain limited. Pair-wise methods rely on binary good-versus-bad
labels, which cause mismatches for point-wise inference and necessitate complex
pairing strategies for effective application in RLHF. On the other hand,
point-wise methods require more elaborate absolute labeling with rubric-driven
criteria, resulting in poor adaptability and high annotation costs. In this
work, we propose the Preference-Aware Task-Adaptive Reward Model (PaTaRM), a
unified framework that integrates a preference-aware reward (PAR) mechanism
with dynamic rubric adaptation. PaTaRM leverages relative preference
information from pairwise data to construct robust point-wise training signals,
eliminating the need for explicit point-wise labels. Simultaneously, it employs
a task-adaptive rubric system that flexibly generates evaluation criteria for
both global task consistency and instance-specific fine-grained reasoning. This
design enables efficient, generalizable, and interpretable reward modeling for
RLHF. Extensive experiments show that PaTaRM achieves an average relative
improvement of 4.7% on RewardBench and RMBench across Qwen3-8B and Qwen3-14B
models. Furthermore, PaTaRM boosts downstream RLHF performance, with an average
improvement of 13.6% across IFEval and InFoBench benchmarks, confirming its
effectiveness and robustness. Our code is available at
https://github.com/JaneEyre0530/PaTaRM.
[LINK]http://arxiv.org/abs/2510.24235v1
[DATE]2025-10-28 17:43:47+08:00
[CATEGORIES]cs.LG
The Logical Expressiveness of Temporal GNNs via Two-Dimensional Product Logics
[AUTHORS]Marco Sälzer, Przemysław Andrzej Wałęga, Martin Lange
[ABSTRACT]In recent years, the expressive power of various neural architectures –
including graph neural networks (GNNs), transformers, and recurrent neural
networks – has been characterised using tools from logic and formal language
theory. As the capabilities of basic architectures are becoming well
understood, increasing attention is turning to models that combine multiple
architectural paradigms. Among them particularly important, and challenging to
analyse, are temporal extensions of GNNs, which integrate both spatial
(graph-structure) and temporal (evolution over time) dimensions. In this paper,
we initiate the study of logical characterisation of temporal GNNs by
connecting them to two-dimensional product logics. We show that the expressive
power of temporal GNNs depends on how graph and temporal components are
combined. In particular, temporal GNNs that apply static GNNs recursively over
time can capture all properties definable in the product logic of (past)
propositional temporal logic PTL and the modal logic K. In contrast,
architectures such as graph-and-time TGNNs and global TGNNs can only express
restricted fragments of this logic, where the interaction between temporal and
spatial operators is syntactically constrained. These provide us with the first
results on the logical expressiveness of temporal GNNs.
[LINK]http://arxiv.org/abs/2505.11930v2
[DATE]2025-10-28 17:43:41+08:00
[CATEGORIES]cs.LG
Sparse Optimistic Information Directed Sampling
[AUTHORS]Ludovic Schwartz, Hamish Flynn, Gergely Neu
[ABSTRACT]Many high-dimensional online decision-making problems can be modeled as
stochastic sparse linear bandits. Most existing algorithms are designed to
achieve optimal worst-case regret in either the data-rich regime, where
polynomial depen- dence on the ambient dimension is unavoidable, or the
data-poor regime, where dimension-independence is possible at the cost of worse
dependence on the num- ber of rounds. In contrast, the sparse Information
Directed Sampling (IDS) algo- rithm satisfies a Bayesian regret bound that has
the optimal rate in both regimes simultaneously. In this work, we explore the
use of Sparse Optimistic Informa- tion Directed Sampling (SOIDS) to achieve the
same adaptivity in the worst-case setting, without Bayesian assumptions.
Through a novel analysis that enables the use of a time-dependent learning
rate, we show that SOIDS can optimally balance information and regret. Our
results extend the theoretical guarantees of IDS, pro- viding the first
algorithm that simultaneously achieves optimal worst-case regret in both the
data-rich and data-poor regimes. We empirically demonstrate the good
performance of SOIDS.
[LINK]http://arxiv.org/abs/2510.24234v1
[DATE]2025-10-28 17:42:15+08:00
[CATEGORIES]cs.LG
PRIVET: Privacy Metric Based on Extreme Value Theory
[AUTHORS]Antoine Szatkownik, Aurélien Decelle, Beatriz Seoane, Nicolas Bereux, Léo Planche, Guillaume Charpiat, Burak Yelmen, Flora Jay, Cyril Furtlehner
[ABSTRACT]Deep generative models are often trained on sensitive data, such as genetic
sequences, health data, or more broadly, any copyrighted, licensed or protected
content. This raises critical concerns around privacy-preserving synthetic
data, and more specifically around privacy leakage, an issue closely tied to
overfitting. Existing methods almost exclusively rely on global criteria to
estimate the risk of privacy failure associated to a model, offering only
quantitative non interpretable insights. The absence of rigorous evaluation
methods for data privacy at the sample-level may hinder the practical
deployment of synthetic data in real-world applications. Using extreme value
statistics on nearest-neighbor distances, we propose PRIVET, a generic
sample-based, modality-agnostic algorithm that assigns an individual privacy
leak score to each synthetic sample. We empirically demonstrate that PRIVET
reliably detects instances of memorization and privacy leakage across diverse
data modalities, including settings with very high dimensionality, limited
sample sizes such as genetic data and even under underfitting regimes. We
compare our method to existing approaches under controlled settings and show
its advantage in providing both dataset level and sample level assessments
through qualitative and quantitative outputs. Additionally, our analysis
reveals limitations in existing computer vision embeddings to yield
perceptually meaningful distances when identifying near-duplicate samples.
[LINK]http://arxiv.org/abs/2510.24233v1
[DATE]2025-10-28 17:42:03+08:00
[CATEGORIES]cs.LG
MixAT: Combining Continuous and Discrete Adversarial Training for LLMs
[AUTHORS]Csaba Dékány, Stefan Balauca, Robin Staab, Dimitar I. Dimitrov, Martin Vechev
[ABSTRACT]Despite recent efforts in Large Language Model (LLM) safety and alignment,
current adversarial attacks on frontier LLMs can still consistently force
harmful generations. Although adversarial training has been widely studied and
shown to significantly improve the robustness of traditional machine learning
models, its strengths and weaknesses in the context of LLMs are less
understood. Specifically, while existing discrete adversarial attacks are
effective at producing harmful content, training LLMs with concrete adversarial
prompts is often computationally expensive, leading to reliance on continuous
relaxations. At the same time, despite their effectiveness and generalization
capabilities, training with continuous perturbations does not always capture
the full spectrum of vulnerabilities exploited by discrete attacks. In this
work, we aim to bridge this gap by introducing MixAT, a novel method that
combines stronger discrete and faster continuous attacks during training. We
rigorously evaluate MixAT across a wide spectrum of state-of-the-art attacks,
proposing the At Least One Attack Success Rate (ALO-ASR) metric to capture the
worst-case vulnerability of models. We show MixAT achieves substantially better
robustness (ALO-ASR < 20%) compared to prior defenses (ALO-ASR > 50%), while
maintaining a runtime comparable to methods based on continuous relaxations. We
further analyze MixAT in realistic deployment settings, exploring how chat
templates, quantization, low-rank adapters, and temperature affect both
adversarial training and evaluation, revealing additional blind spots in
current methodologies. Our results demonstrate that MixAT’s discrete-continuous
defense offers a principled and superior robustness-accuracy tradeoff with
minimal computational overhead, highlighting its promise for building safer
LLMs. We provide our code and models at
https://github.com/insait-institute/MixAT.
[COMMENTS]Published at 39th Conference on Neural Information Processing Systems
(NeurIPS 2025)
[LINK]http://arxiv.org/abs/2505.16947v2
[DATE]2025-10-28 17:41:22+08:00
[CATEGORIES]cs.LG
SeeDNorm: Self-Rescaled Dynamic Normalization
[AUTHORS]Wenrui Cai, Defa Zhu, Qingjie Liu, Qiyang Min
[ABSTRACT]Normalization layer constitutes an essential component in neural networks. In
transformers, the predominantly used RMSNorm constrains vectors to a unit
hypersphere, followed by dimension-wise rescaling through a learnable scaling
coefficient $\gamma$ to maintain the representational capacity of the model.
However, RMSNorm discards the input norm information in forward pass and a
static scaling factor $\gamma$ may be insufficient to accommodate the wide
variability of input data and distributional shifts, thereby limiting further
performance improvements, particularly in zero-shot scenarios that large
language models routinely encounter. To address this limitation, we propose
SeeDNorm, which enhances the representational capability of the model by
dynamically adjusting the scaling coefficient based on the current input,
thereby preserving the input norm information and enabling data-dependent,
self-rescaled dynamic normalization. During backpropagation, SeeDNorm retains
the ability of RMSNorm to dynamically adjust gradient according to the input
norm. We provide a detailed analysis of the training optimization for SeedNorm
and proposed corresponding solutions to address potential instability issues
that may arise when applying SeeDNorm. We validate the effectiveness of
SeeDNorm across models of varying sizes in large language model pre-training as
well as supervised and unsupervised computer vision tasks. By introducing a
minimal number of parameters and with neglligible impact on model efficiency,
SeeDNorm achieves consistently superior performance compared to previously
commonly used normalization layers such as RMSNorm and LayerNorm, as well as
element-wise activation alternatives to normalization layers like DyT.
[COMMENTS]31 pages, 14 figures, 18 tables
[LINK]http://arxiv.org/abs/2510.22777v2
[DATE]2025-10-28 17:39:42+08:00
[CATEGORIES]cs.LG
A comparison between joint and dual UKF implementations for state estimation and leak localization in water distribution networks
[AUTHORS]Luis Romero-Ben, Paul Irofti, Florin Stoican, Vicenç Puig
[ABSTRACT]The sustainability of modern cities highly depends on efficient water
distribution management, including effective pressure control and leak
detection and localization. Accurate information about the network hydraulic
state is therefore essential. This article presents a comparison between two
data-driven state estimation methods based on the Unscented Kalman Filter
(UKF), fusing pressure, demand and flow data for head and flow estimation. One
approach uses a joint state vector with a single estimator, while the other
uses a dual-estimator scheme. We analyse their main characteristics, discussing
differences, advantages and limitations, and compare them theoretically in
terms of accuracy and complexity. Finally, we show several estimation results
for the L-TOWN benchmark, allowing to discuss their properties in a real
implementation.
[COMMENTS]This work has been submitted to ECC2026 for review. It has 7 pages
and 2 figures
[LINK]http://arxiv.org/abs/2510.24228v1
[DATE]2025-10-28 17:39:41+08:00
[CATEGORIES]cs.LG
DP-LLM: Runtime Model Adaptation with Dynamic Layer-wise Precision Assignment
[AUTHORS]Sangwoo Kwon, Seong Hoon Seo, Jae W. Lee, Yeonhong Park
[COMMENTS]NeurIPS 2025
[LINK]http://arxiv.org/abs/2508.06041v3
[DATE]2025-10-28 17:34:36+08:00
[CATEGORIES]cs.LG
Minimax Optimal Transfer Learning for Kernel-based Nonparametric Regression
[AUTHORS]Chao Wang, Caixing Wang, Xin He, Xingdong Feng
[ABSTRACT]In recent years, transfer learning has garnered significant attention in the
machine learning community. Its ability to leverage knowledge from related
studies to improve generalization performance in a target study has made it
highly appealing. This paper focuses on investigating the transfer learning
problem within the context of nonparametric regression over a reproducing
kernel Hilbert space. The aim is to bridge the gap between practical
effectiveness and theoretical guarantees. We specifically consider two
scenarios: one where the transferable sources are known and another where they
are unknown. For the known transferable source case, we propose a two-step
kernel-based estimator by solely using kernel ridge regression. For the unknown
case, we develop a novel method based on an efficient aggregation algorithm,
which can automatically detect and alleviate the effects of negative sources.
This paper provides the statistical properties of the desired estimators and
establishes the minimax optimal rate. Through extensive numerical experiments
on synthetic data and real examples, we validate our theoretical findings and
demonstrate the effectiveness of our proposed method.
[LINK]http://arxiv.org/abs/2310.13966v2
[DATE]2025-10-28 17:32:53+08:00
[CATEGORIES]cs.LG
Is It Certainly a Deepfake? Reliability Analysis in Detection & Generation Ecosystem
[AUTHORS]Neslihan Kose, Anthony Rhodes, Umur Aybars Ciftci, Ilke Demir
[ABSTRACT]As generative models are advancing in quality and quantity for creating
synthetic content, deepfakes begin to cause online mistrust. Deepfake detectors
are proposed to counter this effect, however, misuse of detectors claiming fake
content as real or vice versa further fuels this misinformation problem. We
present the first comprehensive uncertainty analysis of deepfake detectors,
systematically investigating how generative artifacts influence prediction
confidence. As reflected in detectors’ responses, deepfake generators also
contribute to this uncertainty as their generative residues vary, so we cross
the uncertainty analysis of deepfake detectors and generators. Based on our
observations, the uncertainty manifold holds enough consistent information to
leverage uncertainty for deepfake source detection. Our approach leverages
Bayesian Neural Networks and Monte Carlo dropout to quantify both aleatoric and
epistemic uncertainties across diverse detector architectures. We evaluate
uncertainty on two datasets with nine generators, with four blind and two
biological detectors, compare different uncertainty methods, explore region-
and pixel-based uncertainty, and conduct ablation studies. We conduct and
analyze binary real/fake, multi-class real/fake, source detection, and
leave-one-out experiments between the generator/detector combinations to share
their generalization capability, model calibration, uncertainty, and robustness
against adversarial attacks. We further introduce uncertainty maps that
localize prediction confidence at the pixel level, revealing distinct patterns
correlated with generator-specific artifacts. Our analysis provides critical
insights for deploying reliable deepfake detection systems and establishes
uncertainty quantification as a fundamental requirement for trustworthy
synthetic media detection.
[COMMENTS]Accepted for publication at the ICCV 2025 workshop - STREAM
[LINK]http://arxiv.org/abs/2509.17550v3
[DATE]2025-10-28 17:32:18+08:00
[CATEGORIES]cs.LG
Clustering-Based Low-Rank Matrix Approximation for Medical Image Compression
[AUTHORS]Sisipho Hamlomo, Marcellin Atemkeng
[ABSTRACT]Medical images are inherently high-resolution and contain locally varying
structures crucial for diagnosis. Efficient compression must preserve
diagnostic fidelity while minimizing redundancy. Low-rank matrix approximation
(LoRMA) techniques have shown strong potential for image compression by
capturing global correlations; however, they often fail to adapt to local
structural variations across regions of interest. To address this, we introduce
an adaptive LoRMA, which partitions a medical image into overlapping patches,
groups structurally similar patches into clusters using k-means, and performs
SVD within each cluster. We derive the overall compression factor accounting
for patch overlap and analyze how patch size influences compression efficiency
and computational cost. While applicable to any data with high local variation,
we focus on medical imaging due to its pronounced local variability. We
evaluate and compare our adaptive LoRMA against global SVD across four imaging
modalities: MRI, ultrasound, CT scan, and chest X-ray. Results demonstrate that
adaptive LoRMA effectively preserves structural integrity, edge details, and
diagnostic relevance, measured by PSNR, SSIM, MSE, IoU, and EPI. Adaptive LoRMA
minimizes block artifacts and residual errors, particularly in pathological
regions, consistently outperforming global SVD in PSNR, SSIM, IoU, EPI, and
achieving lower MSE. It prioritizes clinically salient regions while allowing
aggressive compression in non-critical regions, optimizing storage efficiency.
Although adaptive LoRMA requires higher processing time, its diagnostic
fidelity justifies the overhead for high-compression applications.
[LINK]http://arxiv.org/abs/2505.08256v2
[DATE]2025-10-28 17:31:26+08:00
[CATEGORIES]cs.LG
Unlocking Out-of-Distribution Generalization in Dynamics through Physics-Guided Augmentation
[AUTHORS]Fan Xu, Hao Wu, Kun Wang, Nan Wang, Qingsong Wen, Xian Wu, Wei Gong, Xibin Zhao
[ABSTRACT]In dynamical system modeling, traditional numerical methods are limited by
high computational costs, while modern data-driven approaches struggle with
data scarcity and distribution shifts. To address these fundamental
limitations, we first propose SPARK, a physics-guided quantitative augmentation
plugin. Specifically, SPARK utilizes a reconstruction autoencoder to integrate
physical parameters into a physics-rich discrete state dictionary. This state
dictionary then acts as a structured dictionary of physical states, enabling
the creation of new, physically-plausible training samples via principled
interpolation in the latent space. Further, for downstream prediction, these
augmented representations are seamlessly integrated with a Fourier-enhanced
Graph ODE, a combination designed to robustly model the enriched data
distribution while capturing long-term temporal dependencies. Extensive
experiments on diverse benchmarks demonstrate that SPARK significantly
outperforms state-of-the-art baselines, particularly in challenging
out-of-distribution scenarios and data-scarce regimes, proving the efficacy of
our physics-guided augmentation paradigm.
[LINK]http://arxiv.org/abs/2510.24216v1
[DATE]2025-10-28 17:30:35+08:00
[CATEGORIES]cs.LG
What Can Be Recovered Under Sparse Adversarial Corruption? Assumption-Free Theory for Linear Measurements
[AUTHORS]Vishal Halder, Alexandre Reiffers-Masson, Abdeldjalil Aïssa-El-Bey, Gugan Thoppe
[ABSTRACT]Let (\bm{A} \in \mathbb{R}^{m \times n}) be an arbitrary, known matrix and
(\bm{e}) a (q)-sparse adversarial vector. Given (\bm{y} = \bm{A} x^* +
\bm{e}) and (q), we seek the smallest set containing (x^)-hence the one
conveying maximal information about (x^)-that is uniformly recoverable from
(\bm{y}) without knowing (\bm{e}). While exact recovery of (x^) via
strong (and often impractical) structural assumptions on (\bm{A}) or (x^)
(for example, restricted isometry, sparsity) is well studied, recoverability
for arbitrary (\bm{A}) and (x^) remains open. Our main result shows that
the best that one can hope to recover is (x^ + \ker(\bm{U})), where
(\bm{U}) is the unique projection matrix onto the intersection of rowspaces
of all possible submatrices of (\bm{A}) obtained by deleting (2q) rows.
Moreover, we prove that every (x) that minimizes the (\ell_0)-norm of
(\bm{y} - \bm{A} x) lies in (x^* + \ker(\bm{U})), which then gives a
constructive approach to recover this set.
[LINK]http://arxiv.org/abs/2510.24215v1
[DATE]2025-10-28 17:29:46+08:00
[CATEGORIES]cs.LG
MARS-M: When Variance Reduction Meets Matrices
[AUTHORS]Yifeng Liu, Angela Yuan, Quanquan Gu
[ABSTRACT]Matrix-based preconditioned optimizers, such as Muon, have recently been
shown to be more efficient than scalar-based optimizers for training
large-scale neural networks, including large language models (LLMs). On the
other hand, recent benchmarks on optimizers for LLM pre-training have
demonstrated that variance-reduction techniques such as MARS can achieve
substantial speedups over standard optimizers that do not employ variance
reduction. In this paper, to achieve the best of both worlds, we introduce
MARS-M, a new optimizer that integrates the variance reduction technique in
MARS with Muon. Under standard regularity conditions, we prove that Muon-M
converges to a first-order stationary point at a rate of
$\tilde{\mathcal{O}}(T^{-1/3})$, which improves upon
$\tilde{\mathcal{O}}(T^{-1/4})$ rate attained by Muon. Our empirical results on
language modeling and computer vision tasks demonstrate that MARS-M
consistently yields lower losses and improved performance across various
downstream benchmarks. The implementation of MARS-M is available at
https://github.com/AGI-Arena/MARS/tree/main/MARS_M.
[LINK]http://arxiv.org/abs/2510.21800v2
[DATE]2025-10-28 17:27:41+08:00
[CATEGORIES]cs.LG
Geometric Mixture Models for Electrolyte Conductivity Prediction
[AUTHORS]Anyi Li, Jiacheng Cen, Songyou Li, Mingze Li, Yang Yu, Wenbing Huang
[ABSTRACT]Accurate prediction of ionic conductivity in electrolyte systems is crucial
for advancing numerous scientific and technological applications. While
significant progress has been made, current research faces two fundamental
challenges: (1) the lack of high-quality standardized benchmarks, and (2)
inadequate modeling of geometric structure and intermolecular interactions in
mixture systems. To address these limitations, we first reorganize and enhance
the CALiSol and DiffMix electrolyte datasets by incorporating geometric graph
representations of molecules. We then propose GeoMix, a novel geometry-aware
framework that preserves Set-SE(3) equivariance-an essential but challenging
property for mixture systems. At the heart of GeoMix lies the Geometric
Interaction Network (GIN), an equivariant module specifically designed for
intermolecular geometric message passing. Comprehensive experiments demonstrate
that GeoMix consistently outperforms diverse baselines (including MLPs, GNNs,
and geometric GNNs) across both datasets, validating the importance of
cross-molecular geometric interactions and equivariant message passing for
accurate property prediction. This work not only establishes new benchmarks for
electrolyte research but also provides a general geometric learning framework
that advances modeling of mixture systems in energy materials, pharmaceutical
development, and beyond.
[LINK]http://arxiv.org/abs/2510.15403v2
[DATE]2025-10-28 17:19:23+08:00
[CATEGORIES]cs.LG
Taxonomy and Trends in Reinforcement Learning for Robotics and Control Systems: A Structured Review
[AUTHORS]Kumater Ter, Ore-Ofe Ajayi, Daniel Udekwe
[ABSTRACT]Reinforcement learning (RL) has become a foundational approach for enabling
intelligent robotic behavior in dynamic and uncertain environments. This work
presents an in-depth review of RL principles, advanced deep reinforcement
learning (DRL) algorithms, and their integration into robotic and control
systems. Beginning with the formalism of Markov Decision Processes (MDPs), the
study outlines essential elements of the agent-environment interaction and
explores core algorithmic strategies including actor-critic methods,
value-based learning, and policy gradients. Emphasis is placed on modern DRL
techniques such as DDPG, TD3, PPO, and SAC, which have shown promise in solving
high-dimensional, continuous control tasks. A structured taxonomy is introduced
to categorize RL applications across domains such as locomotion, manipulation,
multi-agent coordination, and human-robot interaction, along with training
methodologies and deployment readiness levels. The review synthesizes recent
research efforts, highlighting technical trends, design patterns, and the
growing maturity of RL in real-world robotics. Overall, this work aims to
bridge theoretical advances with practical implementations, providing a
consolidated perspective on the evolving role of RL in autonomous robotic
systems.
[LINK]http://arxiv.org/abs/2510.21758v2
[DATE]2025-10-28 17:14:57+08:00
[CATEGORIES]cs.LG
URB - Urban Routing Benchmark for RL-equipped Connected Autonomous Vehicles
[AUTHORS]Ahmet Onur Akman, Anastasia Psarou, Michał Hoffmann, Łukasz Gorczyca, Łukasz Kowalski, Paweł Gora, Grzegorz Jamróz, Rafał Kucharski
[COMMENTS]Accepted at the 39th Conference on Neural Information Processing
Systems (NeurIPS 2025), Datasets and Benchmarks Track
[LINK]http://arxiv.org/abs/2505.17734v2
[DATE]2025-10-28 17:06:30+08:00
[CATEGORIES]cs.LG
SPEAR++: Scaling Gradient Inversion via Sparsely-Used Dictionary Learning
[AUTHORS]Alexander Bakarsky, Dimitar I. Dimitrov, Maximilian Baader, Martin Vechev
[ABSTRACT]Federated Learning has seen an increased deployment in real-world scenarios
recently, as it enables the distributed training of machine learning models
without explicit data sharing between individual clients. Yet, the introduction
of the so-called gradient inversion attacks has fundamentally challenged its
privacy-preserving properties. Unfortunately, as these attacks mostly rely on
direct data optimization without any formal guarantees, the vulnerability of
real-world systems remains in dispute and requires tedious testing for each new
federated deployment. To overcome these issues, recently the SPEAR attack was
introduced, which is based on a theoretical analysis of the gradients of linear
layers with ReLU activations. While SPEAR is an important theoretical
breakthrough, the attack’s practicality was severely limited by its exponential
runtime in the batch size b. In this work, we fill this gap by applying
State-of-the-Art techniques from Sparsely-Used Dictionary Learning to make the
problem of gradient inversion on linear layers with ReLU activations tractable.
Our experiments demonstrate that our new attack, SPEAR++, retains all desirable
properties of SPEAR, such as robustness to DP noise and FedAvg aggregation,
while being applicable to 10x bigger batch sizes.
[COMMENTS]Published at the Workshop on Regulatable ML at the 39th Conference on
Neural Information Processing Systems (NeurIPS 2025)
[LINK]http://arxiv.org/abs/2510.24200v1
[DATE]2025-10-28 17:06:19+08:00
[CATEGORIES]cs.LG
Blindfolded Experts Generalize Better: Insights from Robotic Manipulation and Videogames
[AUTHORS]Ev Zisselman, Mirco Mutti, Shelly Francis-Meretzki, Elisei Shafer, Aviv Tamar
[ABSTRACT]Behavioral cloning is a simple yet effective technique for learning
sequential decision-making from demonstrations. Recently, it has gained
prominence as the core of foundation models for the physical world, where
achieving generalization requires countless demonstrations of a multitude of
tasks. Typically, a human expert with full information on the task demonstrates
a (nearly) optimal behavior. In this paper, we propose to hide some of the
task’s information from the demonstrator. This “blindfolded” expert is
compelled to employ non-trivial exploration to solve the task. We show that
cloning the blindfolded expert generalizes better to unseen tasks than its
fully-informed counterpart. We conduct experiments of real-world robot peg
insertion tasks with (limited) human demonstrations, alongside videogames from
the Procgen benchmark. Additionally, we support our findings with theoretical
analysis, which confirms that the generalization error scales with
$\sqrt{I/m}$, where $I$ measures the amount of task information available to
the demonstrator, and $m$ is the number of demonstrated tasks. Both theory and
practice indicate that cloning blindfolded experts generalizes better with
fewer demonstrated tasks. Project page with videos and code:
https://sites.google.com/view/blindfoldedexperts/home
[LINK]http://arxiv.org/abs/2510.24194v1
[DATE]2025-10-28 16:57:27+08:00
[CATEGORIES]cs.LG
Do Language Models Use Their Depth Efficiently?
[AUTHORS]Róbert Csordás, Christopher D. Manning, Christopher Potts
[ABSTRACT]Modern LLMs are increasingly deep, and depth correlates with performance,
albeit with diminishing returns. However, do these models use their depth
efficiently? Do they compose more features to create higher-order computations
that are impossible in shallow models, or do they merely spread the same kinds
of computation out over more layers? To address these questions, we analyze the
residual stream of the Llama 3.1, Qwen 3, and OLMo 2 family of models. We find:
First, comparing the output of the sublayers to the residual stream reveals
that layers in the second half contribute much less than those in the first
half, with a clear phase transition between the two halves. Second, skipping
layers in the second half has a much smaller effect on future computations and
output predictions. Third, for multihop tasks, we are unable to find evidence
that models are using increased depth to compose subresults in examples
involving many hops. Fourth, we seek to directly address whether deeper models
are using their additional layers to perform new kinds of computation. To do
this, we train linear maps from the residual stream of a shallow model to a
deeper one. We find that layers with the same relative depth map best to each
other, suggesting that the larger model simply spreads the same computations
out over its many layers. All this evidence suggests that deeper models are not
using their depth to learn new kinds of computation, but only using the greater
depth to perform more fine-grained adjustments to the residual. This may help
explain why increasing scale leads to diminishing returns for stacked
Transformer architectures.
[COMMENTS]Accepted to NeurIPS 2025
[LINK]http://arxiv.org/abs/2505.13898v3
[DATE]2025-10-28 16:56:58+08:00
[CATEGORIES]cs.LG
Self-Concordant Perturbations for Linear Bandits
[AUTHORS]Lucas Lévy, Jean-Lou Valeau, Arya Akhavan, Patrick Rebeschini
[ABSTRACT]We study the adversarial linear bandits problem and present a unified
algorithmic framework that bridges Follow-the-Regularized-Leader (FTRL) and
Follow-the-Perturbed-Leader (FTPL) methods, extending the known connection
between them from the full-information setting. Within this framework, we
introduce self-concordant perturbations, a family of probability distributions
that mirror the role of self-concordant barriers previously employed in the
FTRL-based SCRiBLe algorithm. Using this idea, we design a novel FTPL-based
algorithm that combines self-concordant regularization with efficient
stochastic exploration. Our approach achieves a regret of $O(d\sqrt{n \ln n})$
on both the $d$-dimensional hypercube and the Euclidean ball. On the Euclidean
ball, this matches the rate attained by existing self-concordant FTRL methods.
For the hypercube, this represents a $\sqrt{d}$ improvement over these methods
and matches the optimal bound up to logarithmic factors.
[LINK]http://arxiv.org/abs/2510.24187v1
[DATE]2025-10-28 16:47:15+08:00
[CATEGORIES]cs.LG
Two-Stage Learning of Stabilizing Neural Controllers via Zubov Sampling and Iterative Domain Expansion
[AUTHORS]Haoyu Li, Xiangru Zhong, Bin Hu, Huan Zhang
[ABSTRACT]Learning-based neural network (NN) control policies have shown impressive
empirical performance. However, obtaining stability guarantees and estimates of
the region of attraction of these learned neural controllers is challenging due
to the lack of stable and scalable training and verification algorithms.
Although previous works in this area have achieved great success, much
conservatism remains in their frameworks. In this work, we propose a novel
two-stage training framework to jointly synthesize a controller and a Lyapunov
function for continuous-time systems. By leveraging a Zubov-inspired region of
attraction characterization to directly estimate stability boundaries, we
propose a novel training-data sampling strategy and a domain-updating mechanism
that significantly reduces the conservatism in training. Moreover, unlike
existing works on continuous-time systems that rely on an SMT solver to
formally verify the Lyapunov condition, we extend state-of-the-art neural
network verifier $\alpha,!\beta$-CROWN with the capability of performing
automatic bound propagation through the Jacobian of dynamical systems and a
novel verification scheme that avoids expensive bisection. To demonstrate the
effectiveness of our approach, we conduct numerical experiments by synthesizing
and verifying controllers on several challenging nonlinear systems across
multiple dimensions. We show that our training can yield region of attractions
with volume $5 - 1.5\cdot 10^{5}$ times larger compared to the baselines, and
our verification on continuous systems can be up to $40-10{,}000$ times faster
compared to the traditional SMT solver dReal. Our code is available at
https://github.com/Verified-Intelligence/Two-Stage_Neural_Controller_Training.
[COMMENTS]NeurIPS 2025
[LINK]http://arxiv.org/abs/2506.01356v2
[DATE]2025-10-28 16:42:04+08:00
[CATEGORIES]cs.LG
V-SAT: Video Subtitle Annotation Tool
[AUTHORS]Arpita Kundu, Joyita Chakraborty, Anindita Desarkar, Aritra Sen, Srushti Anil Patil, Vishwanathan Raman
[ABSTRACT]The surge of audiovisual content on streaming platforms and social media has
heightened the demand for accurate and accessible subtitles. However, existing
subtitle generation methods primarily speech-based transcription or OCR-based
extraction suffer from several shortcomings, including poor synchronization,
incorrect or harmful text, inconsistent formatting, inappropriate reading
speeds, and the inability to adapt to dynamic audio-visual contexts. Current
approaches often address isolated issues, leaving post-editing as a
labor-intensive and time-consuming process. In this paper, we introduce V-SAT
(Video Subtitle Annotation Tool), a unified framework that automatically
detects and corrects a wide range of subtitle quality issues. By combining
Large Language Models(LLMs), Vision-Language Models (VLMs), Image Processing,
and Automatic Speech Recognition (ASR), V-SAT leverages contextual cues from
both audio and video. Subtitle quality improved, with the SUBER score reduced
from 9.6 to 3.54 after resolving all language mode issues and F1-scores of
~0.80 for image mode issues. Human-in-the-loop validation ensures high-quality
results, providing the first comprehensive solution for robust subtitle
annotation.
[LINK]http://arxiv.org/abs/2510.24180v1
[DATE]2025-10-28 16:34:27+08:00
[CATEGORIES]cs.LG
Pairwise Optimal Transports for Training All-to-All Flow-Based Condition Transfer Model
[AUTHORS]Kotaro Ikeda, Masanori Koyama, Jinzhe Zhang, Kohei Hayashi, Kenji Fukumizu
[ABSTRACT]In this paper, we propose a flow-based method for learning all-to-all
transfer maps among conditional distributions that approximates pairwise
optimal transport. The proposed method addresses the challenge of handling the
case of continuous conditions, which often involve a large set of conditions
with sparse empirical observations per condition. We introduce a novel cost
function that enables simultaneous learning of optimal transports for all pairs
of conditional distributions. Our method is supported by a theoretical
guarantee that, in the limit, it converges to the pairwise optimal transports
among infinite pairs of conditional distributions. The learned transport maps
are subsequently used to couple data points in conditional flow matching. We
demonstrate the effectiveness of this method on synthetic and benchmark
datasets, as well as on chemical datasets in which continuous physical
properties are defined as conditions. The code for this project can be found at
https://github.com/kotatumuri-room/A2A-FM
[COMMENTS]Accepted at NeurIPS 2025, 32 pages, 18 figures
[LINK]http://arxiv.org/abs/2504.03188v4
[DATE]2025-10-28 16:28:52+08:00
[CATEGORIES]cs.LG
EddyFormer: Accelerated Neural Simulations of Three-Dimensional Turbulence at Scale
[AUTHORS]Yiheng Du, Aditi S. Krishnapriyan
[ABSTRACT]Computationally resolving turbulence remains a central challenge in fluid
dynamics due to its multi-scale interactions. Fully resolving large-scale
turbulence through direct numerical simulation (DNS) is computationally
prohibitive, motivating data-driven machine learning alternatives. In this
work, we propose EddyFormer, a Transformer-based spectral-element (SEM)
architecture for large-scale turbulence simulation that combines the accuracy
of spectral methods with the scalability of the attention mechanism. We
introduce an SEM tokenization that decomposes the flow into grid-scale and
subgrid-scale components, enabling capture of both local and global features.
We create a new three-dimensional isotropic turbulence dataset and train
EddyFormer to achieves DNS-level accuracy at 256^3 resolution, providing a 30x
speedup over DNS. When applied to unseen domains up to 4x larger than in
training, EddyFormer preserves accuracy on physics-invariant metrics-energy
spectra, correlation functions, and structure functions-showing domain
generalization. On The Well benchmark suite of diverse turbulent flows,
EddyFormer resolves cases where prior ML models fail to converge, accurately
reproducing complex dynamics across a wide range of physical conditions.
[COMMENTS]NeurIPS 2025
[LINK]http://arxiv.org/abs/2510.24173v1
[DATE]2025-10-28 16:27:37+08:00
[CATEGORIES]cs.LG
Scalable Exploration via Ensemble++
[AUTHORS]Yingru Li, Jiawei Xu, Baoxiang Wang, Zhi-Quan Luo
[ABSTRACT]Thompson Sampling is a principled method for balancing exploration and
exploitation, but its real-world adoption faces computational challenges in
large-scale or non-conjugate settings. While ensemble-based approaches offer
partial remedies, they typically require prohibitively large ensemble sizes. We
propose Ensemble++, a scalable exploration framework using a novel
shared-factor ensemble architecture with random linear combinations. For linear
bandits, we provide theoretical guarantees showing that Ensemble++ achieves
regret comparable to exact Thompson Sampling with only $\Theta(d \log T)$
ensemble sizes–significantly outperforming prior methods. Crucially, this
efficiency holds across both compact and finite action sets with either
time-invariant or time-varying contexts without configuration changes. We
extend this theoretical foundation to nonlinear rewards by replacing fixed
features with learnable neural representations while preserving the same
incremental update principle, effectively bridging theory and practice for
real-world tasks. Comprehensive experiments across linear, quadratic, neural,
and GPT-based contextual bandits validate our theoretical findings and
demonstrate Ensemble++’s superior regret-computation tradeoff versus
state-of-the-art methods.
[COMMENTS]NeurIPS 2025
[LINK]http://arxiv.org/abs/2407.13195v6
[DATE]2025-10-28 16:26:28+08:00
[CATEGORIES]cs.LG
FragFM: Hierarchical Framework for Efficient Molecule Generation via Fragment-Level Discrete Flow Matching
[AUTHORS]Joongwon Lee, Seonghwan Kim, Seokhyun Moon, Hyunwoo Kim, Woo Youn Kim
[ABSTRACT]We introduce FragFM, a novel hierarchical framework via fragment-level
discrete flow matching for efficient molecular graph generation. FragFM
generates molecules at the fragment level, leveraging a coarse-to-fine
autoencoder to reconstruct details at the atom level. Together with a
stochastic fragment bag strategy to effectively handle an extensive fragment
space, our framework enables more efficient and scalable molecular generation.
We demonstrate that our fragment-based approach achieves better property
control than the atom-based method and additional flexibility through
conditioning the fragment bag. We also propose a Natural Product Generation
benchmark (NPGen) to evaluate modern molecular graph generative models’ ability
to generate natural product-like molecules. Since natural products are
biologically prevalidated and differ from typical drug-like molecules, our
benchmark provides a more challenging yet meaningful evaluation relevant to
drug discovery. We conduct a FragFM comparative study against various models on
diverse molecular generation benchmarks, including NPGen, demonstrating
superior performance. The results highlight the potential of fragment-based
generative modeling for large-scale, property-aware molecular design, paving
the way for more efficient exploration of chemical space.
[COMMENTS]49 pages, 29 figures, under review
[LINK]http://arxiv.org/abs/2502.15805v3
[DATE]2025-10-28 16:12:05+08:00
[CATEGORIES]cs.LG
PULSE: Practical Evaluation Scenarios for Large Multimodal Model Unlearning
[AUTHORS]Tatsuki Kawakami, Kazuki Egashira, Atsuyuki Miyai, Go Irie, Kiyoharu Aizawa
[COMMENTS]Accepted at NeurIPS 2025 Workshop: Evaluating the Evolving LLM
Lifecycle
[LINK]http://arxiv.org/abs/2507.01271v4
[DATE]2025-10-28 16:11:23+08:00
[CATEGORIES]cs.LG
Rademacher Meets Colors: More Expressivity, but at What Cost ?
[AUTHORS]Martin Carrasco, Caio F. Deberaldini Netto, Vahan A. Martirosyan, Aneeqa Mehrab, Ehimare Okoyomon, Caterina Graziani
[ABSTRACT]The expressive power of graph neural networks (GNNs) is typically understood
through their correspondence with graph isomorphism tests such as the
Weisfeiler-Leman (WL) hierarchy. While more expressive GNNs can distinguish a
richer set of graphs, they are also observed to suffer from higher
generalization error. This work provides a theoretical explanation for this
trade-off by linking expressivity and generalization through the lens of
coloring algorithms. Specifically, we show that the number of equivalence
classes induced by WL colorings directly bounds the GNNs Rademacher complexity
– a key data-dependent measure of generalization. Our analysis reveals that
greater expressivity leads to higher complexity and thus weaker generalization
guarantees. Furthermore, we prove that the Rademacher complexity is stable
under perturbations in the color counts across different samples, ensuring
robustness to sampling variability across datasets. Importantly, our framework
is not restricted to message-passing GNNs or 1-WL, but extends to arbitrary GNN
architectures and expressivity measures that partition graphs into equivalence
classes. These results unify the study of expressivity and generalization in
GNNs, providing a principled understanding of why increasing expressive power
often comes at the cost of generalization.
[LINK]http://arxiv.org/abs/2510.10101v3
[DATE]2025-10-28 15:57:39+08:00
[CATEGORIES]cs.LG
Identifiable learning of dissipative dynamics
[AUTHORS]Aiqing Zhu, Beatrice W. Soh, Grigorios A. Pavliotis, Qianxiao Li
[ABSTRACT]Complex dissipative systems appear across science and engineering, from
polymers and active matter to learning algorithms. These systems operate far
from equilibrium, where energy dissipation and time irreversibility are key to
their behavior, but are difficult to quantify from data. Learning accurate and
interpretable models of such dynamics remains a major challenge: the models
must be expressive enough to describe diverse processes, yet constrained enough
to remain physically meaningful and mathematically identifiable. Here, we
introduce I-OnsagerNet, a neural framework that learns dissipative stochastic
dynamics directly from trajectories while ensuring both interpretability and
uniqueness. I-OnsagerNet extends the Onsager principle to guarantee that the
learned potential is obtained from the stationary density and that the drift
decomposes cleanly into time-reversible and time-irreversible components, as
dictated by the Helmholtz decomposition. Our approach enables us to calculate
the entropy production and to quantify irreversibility, offering a principled
way to detect and quantify deviations from equilibrium. Applications to polymer
stretching in elongational flow and to stochastic gradient Langevin dynamics
reveal new insights, including super-linear scaling of barrier heights and
sub-linear scaling of entropy production rates with the strain rate, and the
suppression of irreversibility with increasing batch size. I-OnsagerNet thus
establishes a general, data-driven framework for discovering and interpreting
non-equilibrium dynamics.
[LINK]http://arxiv.org/abs/2510.24160v1
[DATE]2025-10-28 15:57:14+08:00
[CATEGORIES]cs.LG
Self-supervised Synthetic Pretraining for Inference of Stellar Mass Embedded in Dense Gas
[AUTHORS]Keiya Hirashima, Shingo Nozaki, Naoto Harada
[ABSTRACT]Stellar mass is a fundamental quantity that determines the properties and
evolution of stars. However, estimating stellar masses in star-forming regions
is challenging because young stars are obscured by dense gas and the regions
are highly inhomogeneous, making spherical dynamical estimates unreliable.
Supervised machine learning could link such complex structures to stellar mass,
but it requires large, high-quality labeled datasets from high-resolution
magneto-hydrodynamical (MHD) simulations, which are computationally expensive.
We address this by pretraining a vision transformer on one million synthetic
fractal images using the self-supervised framework DINOv2, and then applying
the frozen model to limited high-resolution MHD simulations. Our results
demonstrate that synthetic pretraining improves frozen-feature regression
stellar mass predictions, with the pretrained model performing slightly better
than a supervised model trained on the same limited simulations. Principal
component analysis of the extracted features further reveals semantically
meaningful structures, suggesting that the model enables unsupervised
segmentation of star-forming regions without the need for labeled data or
fine-tuning.
[COMMENTS]6 pages, 3 figures, 1 table, accepted for NeurIPS 2025 ML4PS workshop
[LINK]http://arxiv.org/abs/2510.24159v1
[DATE]2025-10-28 15:55:34+08:00
[CATEGORIES]cs.LG
Interpretable Clustering with Adaptive Heterogeneous Causal Structure Learning in Mixed Observational Data
[AUTHORS]Wenrui Li, Qinghao Zhang, Xiaowo Wang
[ABSTRACT]Understanding causal heterogeneity is essential for scientific discovery in
domains such as biology and medicine. However, existing methods lack causal
awareness, with insufficient modeling of heterogeneity, confounding, and
observational constraints, leading to poor interpretability and difficulty
distinguishing true causal heterogeneity from spurious associations. We propose
an unsupervised framework, HCL (Interpretable Causal Mechanism-Aware Clustering
with Adaptive Heterogeneous Causal Structure Learning), that jointly infers
latent clusters and their associated causal structures from mixed-type
observational data without requiring temporal ordering, environment labels,
interventions or other prior knowledge. HCL relaxes the homogeneity and
sufficiency assumptions by introducing an equivalent representation that
encodes both structural heterogeneity and confounding. It further develops a
bi-directional iterative strategy to alternately refine causal clustering and
structure learning, along with a self-supervised regularization that balance
cross-cluster universality and specificity. Together, these components enable
convergence toward interpretable, heterogeneous causal patterns. Theoretically,
we show identifiability of heterogeneous causal structures under mild
conditions. Empirically, HCL achieves superior performance in both clustering
and structure learning tasks, and recovers biologically meaningful mechanisms
in real-world single-cell perturbation data, demonstrating its utility for
discovering interpretable, mechanism-level causal heterogeneity.
[LINK]http://arxiv.org/abs/2509.04415v2
[DATE]2025-10-28 15:32:34+08:00
[CATEGORIES]cs.LG
Fixed Point Neural Acceleration and Inverse Surrogate Model for Battery Parameter Identification
[AUTHORS]Hojin Cheon, Hyeongseok Seo, Jihun Jeon, Wooju Lee, Dohyun Jeong, Hongseok Kim
[ABSTRACT]The rapid expansion of electric vehicles has intensified the need for
accurate and efficient diagnosis of lithium-ion batteries. Parameter
identification of electrochemical battery models is widely recognized as a
powerful method for battery health assessment. However, conventional
metaheuristic approaches suffer from high computational cost and slow
convergence, and recent machine learning methods are limited by their reliance
on constant current data, which may not be available in practice. To overcome
these challenges, we propose deep learning-based framework for parameter
identification of electrochemical battery models. The proposed framework
combines a neural surrogate model of the single particle model with electrolyte
(NeuralSPMe) and a deep learning-based fixed-point iteration method. NeuralSPMe
is trained on realistic EV load profiles to accurately predict lithium
concentration dynamics under dynamic operating conditions while a parameter
update network (PUNet) performs fixed-point iterative updates to significantly
reduce both the evaluation time per sample and the overall number of iterations
required for convergence. Experimental evaluations demonstrate that the
proposed framework accelerates the parameter identification by more than 2000
times, achieves superior sample efficiency and more than 10 times higher
accuracy compared to conventional metaheuristic algorithms, particularly under
dynamic load scenarios encountered in practical applications.
[COMMENTS]31 pages, 11 figures, submitted to Applied Energy
[LINK]http://arxiv.org/abs/2510.24135v1
[DATE]2025-10-28 15:20:38+08:00
[CATEGORIES]cs.LG
CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning
[AUTHORS]Xiaoya Li, Xiaofei Sun, Albert Wang, Jiwei Li, Chris Shum
[ABSTRACT]The exponential growth in demand for GPU computing resources has created an
urgent need for automated CUDA optimization strategies. While recent advances
in LLMs show promise for code generation, current SOTA models achieve low
success rates in improving CUDA speed. In this paper, we introduce CUDA-L1, an
automated reinforcement learning framework for CUDA optimization that employs a
novel contrastive RL algorithm.
CUDA-L1 achieves significant performance improvements on the CUDA
optimization task: trained on A100, it delivers an average speedup of x3.12
with a median speedup of x1.42 against default baselines over across all 250
CUDA kernels of KernelBench, with peak speedups reaching x120. In addition to
the default baseline provided by KernelBench, CUDA-L1 demonstrates x2.77 over
Torch Compile, x2.88 over Torch Compile with reduce overhead, x2.81 over CUDA
Graph implementations, and remarkably x7.72 over cuDNN libraries. Furthermore,
the model also demonstrates portability across different GPU architectures.
Beyond these benchmark results, CUDA-L1 demonstrates several properties: it
1) discovers a variety of CUDA optimization techniques and learns to combine
them strategically to achieve optimal performance; 2) uncovers fundamental
principles of CUDA optimization, such as the multiplicative nature of
optimizations; 3) identifies non-obvious performance bottlenecks and rejects
seemingly beneficial optimizations that actually harm performance. The
capabilities demonstrate that, RL can transform an initially poor-performing
LLM into an effective CUDA optimizer through speedup-based reward signals
alone, without human expertise or domain knowledge. This paradigm opens
possibilities for automated optimization of CUDA operations, and holds promise
to substantially promote GPU efficiency and alleviate the rising pressure on
GPU computing resources.
[COMMENTS]Project Page: https://deepreinforce-ai.github.io/cudal1_blog/
[LINK]http://arxiv.org/abs/2507.14111v8
[DATE]2025-10-28 15:04:44+08:00
[CATEGORIES]cs.LG
Causal Convolutional Neural Networks as Finite Impulse Response Filters
[AUTHORS]Kiran Bacsa, Wei Liu, Xudong Jian, Huangbin Liang, Eleni Chatzi
[ABSTRACT]This study investigates the behavior of Causal Convolutional Neural Networks
(CNNs) with quasi-linear activation functions when applied to time-series data
characterized by multimodal frequency content. We demonstrate that, once
trained, such networks exhibit properties analogous to Finite Impulse Response
(FIR) filters, particularly when the convolutional kernels are of extended
length exceeding those typically employed in standard CNN architectures. Causal
CNNs are shown to capture spectral features both implicitly and explicitly,
offering enhanced interpretability for tasks involving dynamic systems.
Leveraging the associative property of convolution, we further show that the
entire network can be reduced to an equivalent single-layer filter resembling
an FIR filter optimized via least-squares criteria. This equivalence yields new
insights into the spectral learning behavior of CNNs trained on signals with
sparse frequency content. The approach is validated on both simulated beam
dynamics and real-world bridge vibration datasets, underlining its relevance
for modeling and identifying physical systems governed by dynamic responses.
[COMMENTS]14 pages, 19 figures, Under review
[LINK]http://arxiv.org/abs/2510.24125v1
[DATE]2025-10-28 14:57:14+08:00
[CATEGORIES]cs.LG
Mixture-of-Experts Meets In-Context Reinforcement Learning
[AUTHORS]Wenhao Wu, Fuhong Liu, Haoru Li, Zican Hu, Daoyi Dong, Chunlin Chen, Zhi Wang
[ABSTRACT]In-context reinforcement learning (ICRL) has emerged as a promising paradigm
for adapting RL agents to downstream tasks through prompt conditioning.
However, two notable challenges remain in fully harnessing in-context learning
within RL domains: the intrinsic multi-modality of the state-action-reward data
and the diverse, heterogeneous nature of decision tasks. To tackle these
challenges, we propose T2MIR (Token- and Task-wise MoE for In-context RL), an
innovative framework that introduces architectural advances of
mixture-of-experts (MoE) into transformer-based decision models. T2MIR
substitutes the feedforward layer with two parallel layers: a token-wise MoE
that captures distinct semantics of input tokens across multiple modalities,
and a task-wise MoE that routes diverse tasks to specialized experts for
managing a broad task distribution with alleviated gradient conflicts. To
enhance task-wise routing, we introduce a contrastive learning method that
maximizes the mutual information between the task and its router
representation, enabling more precise capture of task-relevant information. The
outputs of two MoE components are concatenated and fed into the next layer.
Comprehensive experiments show that T2MIR significantly facilitates in-context
learning capacity and outperforms various types of baselines. We bring the
potential and promise of MoE to ICRL, offering a simple and scalable
architectural enhancement to advance ICRL one step closer toward achievements
in language and vision communities. Our code is available at
https://github.com/NJU-RL/T2MIR.
[COMMENTS]28 pages, 13 figures, 17 tables
[LINK]http://arxiv.org/abs/2506.05426v3
[DATE]2025-10-28 14:55:14+08:00
[CATEGORIES]cs.LG
Graph-Guided Concept Selection for Efficient Retrieval-Augmented Generation
[AUTHORS]Ziyu Liu, Yijing Liu, Jianfei Yuan, Minzhi Yan, Le Yue, Honghui Xiong, Yi Yang
[LINK]http://arxiv.org/abs/2510.24120v1
[DATE]2025-10-28 14:47:30+08:00
[CATEGORIES]cs.LG
HistoLens: An Interactive XAI Toolkit for Verifying and Mitigating Flaws in Vision-Language Models for Histopathology
[AUTHORS]Sandeep Vissapragada, Vikrant Sahu, Gagan Raj Gupta, Vandita Singh
[ABSTRACT]For doctors to truly trust artificial intelligence, it can’t be a black box.
They need to understand its reasoning, almost as if they were consulting a
colleague. We created HistoLens1 to be that transparent, collaborative partner.
It allows a pathologist to simply ask a question in plain English about a
tissue slide–just as they would ask a trainee. Our system intelligently
translates this question into a precise query for its AI engine, which then
provides a clear, structured report. But it doesn’t stop there. If a doctor
ever asks, “Why?”, HistoLens can instantly provide a ‘visual proof’ for any
finding–a heatmap that points to the exact cells and regions the AI used for
its analysis. We’ve also ensured the AI focuses only on the patient’s tissue,
just like a trained pathologist would, by teaching it to ignore distracting
background noise. The result is a workflow where the pathologist remains the
expert in charge, using a trustworthy AI assistant to verify their insights and
make faster, more confident diagnoses.
[LINK]http://arxiv.org/abs/2510.24115v1
[DATE]2025-10-28 14:38:59+08:00
[CATEGORIES]cs.LG
An unsupervised tour through the hidden pathways of deep neural networks
[AUTHORS]Diego Doimo
[ABSTRACT]The goal of this thesis is to improve our understanding of the internal
mechanisms by which deep artificial neural networks create meaningful
representations and are able to generalize. We focus on the challenge of
characterizing the semantic content of the hidden representations with
unsupervised learning tools, partially developed by us and described in this
thesis, which allow harnessing the low-dimensional structure of the data.
Chapter 2. introduces Gride, a method that allows estimating the intrinsic
dimension of the data as an explicit function of the scale without performing
any decimation of the data set. Our approach is based on rigorous
distributional results that enable the quantification of uncertainty of the
estimates. Moreover, our method is simple and computationally efficient since
it relies only on the distances among nearest data points. In Chapter 3, we
study the evolution of the probability density across the hidden layers in some
state-of-the-art deep neural networks. We find that the initial layers generate
a unimodal probability density getting rid of any structure irrelevant to
classification. In subsequent layers, density peaks arise in a hierarchical
fashion that mirrors the semantic hierarchy of the concepts. This process
leaves a footprint in the probability density of the output layer, where the
topography of the peaks allows reconstructing the semantic relationships of the
categories. In Chapter 4, we study the problem of generalization in deep neural
networks: adding parameters to a network that interpolates its training data
will typically improve its generalization performance, at odds with the
classical bias-variance trade-off. We show that wide neural networks learn
redundant representations instead of overfitting to spurious correlation and
that redundant neurons appear only if the network is regularized and the
training error is zero.
[COMMENTS]PhD thesis
[LINK]http://arxiv.org/abs/2510.21582v2
[DATE]2025-10-28 14:37:16+08:00
[CATEGORIES]cs.LG
Taming the Tail: NoI Topology Synthesis for Mixed DL Workloads on Chiplet-Based Accelerators
[AUTHORS]Arnav Shukla, Harsh Sharma, Srikant Bharadwaj, Vinayak Abrol, Sujay Deb
[ABSTRACT]Heterogeneous chiplet-based systems improve scaling by disag-gregating
CPUs/GPUs and emerging technologies (HBM/DRAM).However this on-package
disaggregation introduces a latency inNetwork-on-Interposer(NoI). We observe
that in modern large-modelinference, parameters and activations routinely move
backand forth from HBM/DRAM, injecting large, bursty flows into theinterposer.
These memory-driven transfers inflate tail latency andviolate Service Level
Agreements (SLAs) across k-ary n-cube base-line NoI topologies. To address this
gap we introduce an InterferenceScore (IS) that quantifies worst-case slowdown
under contention.We then formulate NoI synthesis as a multi-objective
optimization(MOO) problem. We develop PARL (Partition-Aware
ReinforcementLearner), a topology generator that balances throughput,
latency,and power. PARL-generated topologies reduce contention at the memory
cut, meet SLAs, and cut worst-case slowdown to 1.2 times while maintaining
competitive mean throughput relative to link-rich meshes. Overall, this
reframes NoI design for heterogeneouschiplet accelerators with workload-aware
objectives.
[LINK]http://arxiv.org/abs/2510.24113v1
[DATE]2025-10-28 14:36:44+08:00
[CATEGORIES]cs.LG
Enhancing Pre-trained Representation Classifiability can Boost its Interpretability
[AUTHORS]Shufan Shen, Zhaobo Qi, Junshu Sun, Qingming Huang, Qi Tian, Shuhui Wang
[COMMENTS]ICLR 2025 (Spotlight)
[LINK]http://arxiv.org/abs/2510.24105v1
[DATE]2025-10-28 14:21:06+08:00
[CATEGORIES]cs.LG
A High-Dimensional Statistical Method for Optimizing Transfer Quantities in Multi-Source Transfer Learning
[AUTHORS]Qingyue Zhang, Haohao Fu, Guanbo Huang, Yaoyuan Liang, Chang Chu, Tianren Peng, Yanru Wu, Qi Li, Yang Li, Shao-Lun Huang
[ABSTRACT]Multi-source transfer learning provides an effective solution to data
scarcity in real- world supervised learning scenarios by leveraging multiple
source tasks. In this field, existing works typically use all available samples
from sources in training, which constrains their training efficiency and may
lead to suboptimal results. To address this, we propose a theoretical framework
that answers the question: what is the optimal quantity of source samples
needed from each source task to jointly train the target model? Specifically,
we introduce a generalization error measure based on K-L divergence, and
minimize it based on high-dimensional statistical analysis to determine the
optimal transfer quantity for each source task. Additionally, we develop an
architecture-agnostic and data-efficient algorithm OTQMS to implement our
theoretical results for target model training in multi- source transfer
learning. Experimental studies on diverse architectures and two real-world
benchmark datasets show that our proposed algorithm significantly outperforms
state-of-the-art approaches in both accuracy and data efficiency. The code and
supplementary materials are available in https://github.com/zqy0126/OTQMS.
[COMMENTS]NeurIPS 2025 Poster
[LINK]http://arxiv.org/abs/2502.04242v4
[DATE]2025-10-28 14:15:39+08:00
[CATEGORIES]cs.LG
PTQTP: Post-Training Quantization to Trit-Planes for Large Language Models
[AUTHORS]He Xiao, Runming Yang, Qingyao Yang, Wendong Xu, Zhen Li, Yupeng Su, Zhengwu Liu, Hongxia Yang, Ngai Wong
[ABSTRACT]Post-training quantization (PTQ) of large language models (LLMs) to extremely
low bit-widths remains challenging due to the fundamental trade-off between
computational efficiency and model expressiveness. While existing ultra-low-bit
PTQ methods rely on binary approximations or complex compensation mechanisms,
they suffer from either limited representational capacity or computational
overhead that undermines their efficiency gains. We introduce PTQ to
Trit-Planes (PTQTP), the first ternary-weight PTQ framework that decomposes
weight matrices into structured ternary {-1, 0, 1} trit-planes using 2x1.58-bit
representation. PTQTP achieves multiplication-free inference, identical to
1-bit quantization, while maintaining superior expressiveness through its novel
structured decomposition. Our approach provides: (1) a theoretically grounded
progressive approximation algorithm ensuring global weight consistency; (2)
model-agnostic deployment across diverse modern LLMs without architectural
modifications; and (3) uniform ternary operations that eliminate the need for
mixed-precision or compensation schemes. Comprehensive experiments across
LLaMA3.x and Qwen3 model families (0.6B-70B parameters) demonstrate that PTQTP
significantly outperforms existing low-bit PTQ methods, achieving 82.4%
mathematical reasoning retention versus 0% for competing approaches. PTQTP
approaches and sometimes surpasses 1.58-bit quantization-aware training
performance while requiring only single-hour quantization compared to 10-14 GPU
days for training-based methods. These results establish PTQTP as a practical
solution for efficient LLM deployment in resource-constrained environments. The
code will be available at https://github.com/HeXiao-55/PTQTP.
[COMMENTS]under review
[LINK]http://arxiv.org/abs/2509.16989v2
[DATE]2025-10-28 14:14:52+08:00
[CATEGORIES]cs.LG
Learning Parameterized Skills from Demonstrations
[AUTHORS]Vedant Gupta, Haotian Fu, Calvin Luo, Yiding Jiang, George Konidaris
[ABSTRACT]We present DEPS, an end-to-end algorithm for discovering parameterized skills
from expert demonstrations. Our method learns parameterized skill policies
jointly with a meta-policy that selects the appropriate discrete skill and
continuous parameters at each timestep. Using a combination of temporal
variational inference and information-theoretic regularization methods, we
address the challenge of degeneracy common in latent variable models, ensuring
that the learned skills are temporally extended, semantically meaningful, and
adaptable. We empirically show that learning parameterized skills from
multitask expert demonstrations significantly improves generalization to unseen
tasks. Our method outperforms multitask as well as skill learning baselines on
both LIBERO and MetaWorld benchmarks. We also demonstrate that DEPS discovers
interpretable parameterized skills, such as an object grasping skill whose
continuous arguments define the grasp location.
[COMMENTS]Neurips 2025
[LINK]http://arxiv.org/abs/2510.24095v1
[DATE]2025-10-28 14:08:25+08:00
[CATEGORIES]cs.LG
Information-Theoretic Discrete Diffusion
[AUTHORS]Moongyu Jeon, Sangwoo Shin, Dongjae Jeon, Albert No
[ABSTRACT]We present an information-theoretic framework for discrete diffusion models
that yields principled estimators of log-likelihood using score-matching
losses. Inspired by the I-MMSE identity for the Gaussian setup, we derive
analogous results for the discrete setting. Specifically, we introduce the
Information-Minimum Denoising Score Entropy (I-MDSE) relation, which links
mutual information between data and its diffused version to the minimum
denoising score entropy (DSE) loss. We extend this theory to masked diffusion
and establish the Information-Minimum Denoising Cross-Entropy (I-MDCE)
relation, connecting cross-entropy losses to mutual information in discrete
masked processes. These results provide a time-integral decomposition of the
log-likelihood of the data in terms of optimal score-based losses, showing that
commonly used losses such as DSE and DCE are not merely variational bounds but
tight and principled estimators of log-likelihood. The I-MDCE decomposition
further enables practical extensions, including time-free formula, conditional
likelihood estimation in prompt-response tasks, and coupled Monte Carlo
estimation of likelihood ratios. Experiments on synthetic and real-world data
confirm the accuracy, variance stability, and utility of our estimators. The
code is publicly available at https://github.com/Dongjae0324/infodis.
[COMMENTS]Accepted at NeurIPS 2025
[LINK]http://arxiv.org/abs/2510.24088v1
[DATE]2025-10-28 13:59:05+08:00
[CATEGORIES]cs.LG
Modeling Electric Vehicle Car-Following Behavior: Classical vs Machine Learning Approach
[AUTHORS]Md. Shihab Uddin, Md Nazmus Shakib, Rahul Bhadani
[ABSTRACT]The increasing adoption of electric vehicles (EVs) necessitates an
understanding of their driving behavior to enhance traffic safety and develop
smart driving systems. This study compares classical and machine learning
models for EV car following behavior. Classical models include the Intelligent
Driver Model (IDM), Optimum Velocity Model (OVM), Optimal Velocity Relative
Velocity (OVRV), and a simplified CACC model, while the machine learning
approach employs a Random Forest Regressor. Using a real world dataset of an EV
following an internal combustion engine (ICE) vehicle under varied driving
conditions, we calibrated classical model parameters by minimizing the RMSE
between predictions and real data. The Random Forest model predicts
acceleration using spacing, speed, and gap type as inputs. Results demonstrate
the Random Forest’s superior accuracy, achieving RMSEs of 0.0046 (medium gap),
0.0016 (long gap), and 0.0025 (extra long gap). Among physics based models,
CACC performed best, with an RMSE of 2.67 for long gaps. These findings
highlight the machine learning model’s performance across all scenarios. Such
models are valuable for simulating EV behavior and analyzing mixed autonomy
traffic dynamics in EV integrated environments.
[LINK]http://arxiv.org/abs/2510.24085v1
[DATE]2025-10-28 13:54:50+08:00
[CATEGORIES]cs.LG
Selecting Critical Scenarios of DER Adoption in Distribution Grids Using Bayesian Optimization
[AUTHORS]Olivier Mulkin, Miguel Heleno, Mike Ludkovski
[ABSTRACT]We develop a new methodology to select scenarios of DER adoption most
critical for distribution grids. Anticipating risks of future voltage and line
flow violations due to additional PV adopters is central for utility investment
planning but continues to rely on deterministic or ad hoc scenario selection.
We propose a highly efficient search framework based on multi-objective
Bayesian Optimization. We treat underlying grid stress metrics as
computationally expensive black-box functions, approximated via Gaussian
Process surrogates and design an acquisition function based on probability of
scenarios being Pareto-critical across a collection of line- and bus-based
violation objectives. Our approach provides a statistical guarantee and offers
an order of magnitude speed-up relative to a conservative exhaustive search.
Case studies on realistic feeders with 200-400 buses demonstrate the
effectiveness and accuracy of our approach.
[COMMENTS]12 pages, 4 tables, 12 figures
[LINK]http://arxiv.org/abs/2501.14118v2
[DATE]2025-10-28 13:52:26+08:00
[CATEGORIES]cs.LG
MathBode: Understanding LLM Reasoning with Dynamical Systems
[AUTHORS]Charles L. Wang
[ABSTRACT]This paper presents MathBode, a dynamic diagnostic for mathematical reasoning
in large language models (LLMs). Instead of one-shot accuracy, MathBode treats
each parametric problem as a system: we drive a single parameter sinusoidally
and fit first-harmonic responses of model outputs and exact solutions. This
yields interpretable, frequency-resolved metrics – gain (amplitude tracking)
and phase (lag) – that form Bode-style fingerprints. Across five closed-form
families (linear solve, ratio/saturation, compound interest, 2x2 linear
systems, similar triangles), the diagnostic surfaces systematic low-pass
behavior and growing phase lag that accuracy alone obscures. We compare several
models against a symbolic baseline that calibrates the instrument ($G \approx
1$, $\phi \approx 0$). Results separate frontier from mid-tier models on
dynamics, providing a compact, reproducible protocol that complements standard
benchmarks with actionable measurements of reasoning fidelity and consistency.
We open-source the dataset and code to enable further research and adoption.
[LINK]http://arxiv.org/abs/2509.23143v3
[DATE]2025-10-28 13:44:55+08:00
[CATEGORIES]cs.LG
DeepRTE: Pre-trained Attention-based Neural Network for Radiative Transfer
[AUTHORS]Yekun Zhu, Min Tang, Zheng Ma
[ABSTRACT]In this paper, we propose a novel neural network approach, termed DeepRTE, to
address the steady-state Radiative Transfer Equation (RTE). The RTE is a
differential-integral equation that governs the propagation of radiation
through a participating medium, with applications spanning diverse domains such
as neutron transport, atmospheric radiative transfer, heat transfer, and
optical imaging. Our DeepRTE framework demonstrates superior computational
efficiency for solving the steady-state RTE, surpassing traditional methods and
existing neural network approaches. This efficiency is achieved by embedding
physical information through derivation of the RTE and mathematically-informed
network architecture. Concurrently, DeepRTE achieves high accuracy with
significantly fewer parameters, largely due to its incorporation of mechanisms
such as multi-head attention. Furthermore, DeepRTE is a mesh-free neural
operator framework with inherent zero-shot capability. This is achieved by
incorporating Green’s function theory and pre-training with delta-function
inflow boundary conditions into both its architecture design and training data
construction. The efficacy of the proposed approach is substantiated through
comprehensive numerical experiments.
[LINK]http://arxiv.org/abs/2505.23190v3
[DATE]2025-10-28 13:36:40+08:00
[CATEGORIES]cs.LG
Deep Learning-Enhanced Calibration of the Heston Model: A Unified Framework
[AUTHORS]Arman Zadgar, Somayeh Fallah, Farshid Mehrdoust
[ABSTRACT]The Heston stochastic volatility model is a widely used tool in financial
mathematics for pricing European options. However, its calibration remains
computationally intensive and sensitive to local minima due to the model’s
nonlinear structure and high-dimensional parameter space. This paper introduces
a hybrid deep learning-based framework that enhances both the computational
efficiency and the accuracy of the calibration procedure. The proposed approach
integrates two supervised feedforward neural networks: the Price Approximator
Network (PAN), which approximates the option price surface based on strike and
moneyness inputs, and the Calibration Correction Network (CCN), which refines
the Heston model’s output by correcting systematic pricing errors. Experimental
results on real S\&P 500 option data demonstrate that the deep learning
approach outperforms traditional calibration techniques across multiple error
metrics, achieving faster convergence and superior generalization in both
in-sample and out-of-sample settings. This framework offers a practical and
robust solution for real-time financial model calibration.
[LINK]http://arxiv.org/abs/2510.24074v1
[DATE]2025-10-28 13:21:55+08:00
[CATEGORIES]cs.LG
RDB2G-Bench: A Comprehensive Benchmark for Automatic Graph Modeling of Relational Databases
[AUTHORS]Dongwon Choi, Sunwoo Kim, Juyeon Kim, Kyungho Kim, Geon Lee, Shinhwan Kang, Myunghwan Kim, Kijung Shin
[ABSTRACT]Recent advances have demonstrated the effectiveness of graph-based learning
on relational databases (RDBs) for predictive tasks. Such approaches require
transforming RDBs into graphs, a process we refer to as RDB-to-graph modeling,
where rows of tables are represented as nodes and foreign-key relationships as
edges. Yet, effective modeling of RDBs into graphs remains challenging.
Specifically, there exist numerous ways to model RDBs into graphs, and
performance on predictive tasks varies significantly depending on the chosen
graph model of RDBs. In our analysis, we find that the best-performing graph
model can yield up to a 10% higher performance compared to the common heuristic
rule for graph modeling, which remains non-trivial to identify. To foster
research on intelligent RDB-to-graph modeling, we introduce RDB2G-Bench, the
first benchmark framework for evaluating such methods. We construct extensive
datasets covering 5 real-world RDBs and 12 predictive tasks, resulting in
around 50k graph model-performance pairs for efficient and reproducible
evaluations. Thanks to our precomputed datasets, we were able to benchmark 10
automatic RDB-to-graph modeling methods on the 12 tasks about 380x faster than
on-the-fly evaluation, which requires repeated GNN training. Our analysis of
the datasets and benchmark results reveals key structural patterns affecting
graph model effectiveness, along with practical implications for effective
graph modeling. Our datasets and code are available at
https://github.com/chlehdwon/RDB2G-Bench.
[COMMENTS]Accepted at NeurIPS 2025 Datasets and Benchmarks Track
[LINK]http://arxiv.org/abs/2506.01360v2
[DATE]2025-10-28 13:17:40+08:00
[CATEGORIES]cs.LG
Learning Wireless Interference Patterns: Decoupled GNN for Throughput Prediction in Heterogeneous Multi-Hop p-CSMA Networks
[AUTHORS]Faezeh Dehghan Tarzjani, Bhaskar Krishnamachari
[ABSTRACT]The p-persistent CSMA protocol is central to random-access MAC analysis, but
predicting saturation throughput in heterogeneous multi-hop wireless networks
remains a hard problem. Simplified models that assume a single, shared
interference domain can underestimate throughput by 48-62% in sparse
topologies. Exact Markov-chain analyses are accurate but scale exponentially in
computation time, making them impractical for large networks. These
computational barriers motivate structural machine learning approaches like
GNNs for scalable throughput prediction in general network topologies. Yet
off-the-shelf GNNs struggle here: a standard GCN yields 63.94% normalized mean
absolute error (NMAE) on heterogeneous networks because symmetric normalization
conflates a node’s direct interference with higher-order, cascading effects
that pertain to how interference propagates over the network graph.
Building on these insights, we propose the Decoupled Graph Convolutional
Network (D-GCN), a novel architecture that explicitly separates processing of a
node’s own transmission probability from neighbor interference effects. D-GCN
replaces mean aggregation with learnable attention, yielding interpretable,
per-neighbor contribution weights while capturing complex multihop interference
patterns. D-GCN attains 3.3% NMAE, outperforms strong baselines, remains
tractable even when exact analytical methods become computationally infeasible,
and enables gradient-based network optimization that achieves within 1% of
theoretical optima.
[LINK]http://arxiv.org/abs/2510.14137v2
[DATE]2025-10-28 13:11:02+08:00
[CATEGORIES]cs.LG
Multimodal 3D Genome Pre-training
[AUTHORS]Minghao Yang, Pengteng Li, Yan Liang, Qianyi Cai, Zhihang Zheng, Shichen Zhang, Pengfei Zhang, Zhi-An Huang, Hui Xiong
[ABSTRACT]Deep learning techniques have driven significant progress in various
analytical tasks within 3D genomics in computational biology. However, a
holistic understanding of 3D genomics knowledge remains underexplored. Here, we
propose MIX-HIC, the first multimodal foundation model of 3D genome that
integrates both 3D genome structure and epigenomic tracks, which obtains
unified and comprehensive semantics. For accurate heterogeneous semantic
fusion, we design the cross-modal interaction and mapping blocks for robust
unified representation, yielding the accurate aggregation of 3D genome
knowledge. Besides, we introduce the first large-scale dataset comprising over
1 million pairwise samples of Hi-C contact maps and epigenomic tracks for
high-quality pre-training, enabling the exploration of functional implications
in 3D genomics. Extensive experiments show that MIX-HIC can significantly
surpass existing state-of-the-art methods in diverse downstream tasks. This
work provides a valuable resource for advancing 3D genomics research.
[COMMENTS]NeurIPS 2025
[LINK]http://arxiv.org/abs/2504.09060v2
[DATE]2025-10-28 13:01:44+08:00
[CATEGORIES]cs.LG
Federated Structured Sparse PCA for Anomaly Detection in IoT Networks
[AUTHORS]Chenyi Huang, Xianchao Xiu
[ABSTRACT]Although federated learning has gained prominence as a privacy-preserving
framework tailored for distributed Internet of Things (IoT) environments,
current federated principal component analysis (PCA) methods lack integration
of sparsity, a critical feature for robust anomaly detection. To address this
limitation, we propose a novel federated structured sparse PCA (FedSSP)
approach for anomaly detection in IoT networks. The proposed model uniquely
integrates double sparsity regularization: (1) row-wise sparsity governed by
$\ell_{2,p}$-norm with $p\in [0,1)$ to eliminate redundant feature dimensions,
and (2) element-wise sparsity via $\ell_{q}$-norm with $q\in [0,1)$ to suppress
noise-sensitive components. To solve this nonconvex problem in a distributed
setting, we devise an efficient optimization algorithm based on the proximal
alternating minimization (PAM). Numerical experiments validate that
incorporating structured sparsity enhances both model interpretability and
detection accuracy. Our code is available at
https://github.com/xianchaoxiu/FedSSP.
[LINK]http://arxiv.org/abs/2503.23981v3
[DATE]2025-10-28 12:55:22+08:00
[CATEGORIES]cs.LG
Forecasting Outside the Box: Application-Driven Optimal Pointwise Forecasts for Stochastic Optimization
[AUTHORS]Tito Homem-de-Mello, Juan Valencia, Felipe Lagos, Guido Lagos
[ABSTRACT]We study a class of two-stage stochastic programs, namely, those with fixed
recourse matrix and fixed costs, and linear second stage. We show that, under
mild assumptions, the problem can be solved with just one scenario, which we
call an “optimal scenario.” Such a scenario does not have to be unique and
may fall outside the support of the underlying distribution. Although finding
an optimal scenario in general might be hard, we show that the result can be
particularly useful in the case of stochastic optimization problems with
contextual information, where the goal is to optimize the expected value of a
certain function given some contextual information (e.g., previous demand,
customer type, etc.) that accompany the main data of interest. The contextual
information allows for a better estimation of the quantity of interest via
machine learning methods. We focus on a class of learning methods – sometimes
called in the literature decision-focused learning – that integrate the
learning and optimization procedures by means of a bilevel optimization
formulation, which determines the parameters for pointwise forecasts. By using
the optimal scenario result, we prove that when such models are applied to the
class of contextual two-stage problems considered in this paper, the pointwise
forecasts computed from the bilevel optimization formulation actually yield
asymptotically the best approximation of an optimal scenario within the
modeler’s pre-specified set of parameterized forecast functions. Numerical
results conducted with inventory problems from the literature (with synthetic
data) as well as a bike-sharing problem with real data demonstrate that the
proposed approach performs well when compared to benchmark methods from the
literature.
[LINK]http://arxiv.org/abs/2411.03520v3
[DATE]2025-10-28 12:54:54+08:00
[CATEGORIES]cs.LG
PEARL: Peer-Enhanced Adaptive Radio via On-Device LLM
[AUTHORS]Ju-Hyung Lee, Yanqing Lu, Klaus Doppler
[ABSTRACT]We present PEARL (Peer-Enhanced Adaptive Radio via On-Device LLM), a
framework for cooperative cross-layer optimization in device-to-device (D2D)
communication. Building on our previous work on single-device on-device LLMs,
PEARL extends the paradigm by leveraging both publisher and subscriber states
to guide Wi-Fi Aware (WA) parameter selection. A context-aware reward, which
normalizes latency by application tolerances and modulates energy by device
battery states, provides richer supervision for KL-based finetuning. We study
two lightweight variants: PEARL (Head + Low-Rank Adaptation (LoRA)) achieves
the best overall performance, while PEARL-Lite (Head-only) delivers sub-20 ms
inference at near-identical objective scores. Across synthetic scenarios
grounded in real measurements, PEARL improves objective scores over heuristic
and compact model baselines and reduces energy by up to 16% in cooperative
low-battery cases. These results demonstrate that peer-aware context,
reward-aligned training, and head-based efficiency make LLMs practical for
always-on, on-device cross-layer control. Code, real-world demo, and dataset
are available at https://github.com/abman23/pearl
[LINK]http://arxiv.org/abs/2509.24085v2
[DATE]2025-10-28 12:48:14+08:00
[CATEGORIES]cs.LG
FALQON: Accelerating LoRA Fine-tuning with Low-Bit Floating-Point Arithmetic
[AUTHORS]Kanghyun Choi, Hyeyoon Lee, SunJong Park, Dain Kwon, Jinho Lee
[ABSTRACT]Low-bit floating-point (FP) formats, such as FP8, provide significant
acceleration and memory savings in model training thanks to native hardware
support on modern GPUs and NPUs. However, we analyze that FP8 quantization
offers speedup primarily for large-dimensional matrix multiplications, while
inherent quantization overheads diminish speedup when applied to low-rank
adaptation (LoRA), which uses small-dimensional matrices for efficient
fine-tuning of large language models (LLMs). To address this limitation, we
propose FALQON, a novel framework that eliminates the quantization overhead
from separate LoRA computational paths by directly merging LoRA adapters into
an FP8-quantized backbone during fine-tuning. Furthermore, we reformulate the
forward and backward computations for merged adapters to significantly reduce
quantization overhead, and introduce a row-wise proxy update mechanism that
efficiently integrates substantial updates into the quantized backbone.
Experimental evaluations demonstrate that FALQON achieves approximately a
3$\times$ training speedup over existing quantized LoRA methods with a similar
level of accuracy, providing a practical solution for efficient large-scale
model fine-tuning. Moreover, FALQON’s end-to-end FP8 workflow removes the need
for post-training quantization, facilitating efficient deployment. Code is
available at https://github.com/iamkanghyunchoi/falqon.
[COMMENTS]NeurIPS 2025
[LINK]http://arxiv.org/abs/2510.24061v1
[DATE]2025-10-28 12:44:49+08:00
[CATEGORIES]cs.LG
MsEdF: A Multi-stream Encoder-decoder Framework for Remote Sensing Image Captioning
[AUTHORS]Swadhin Das, Raksha Sharma
[ABSTRACT]Remote sensing images contain complex spatial patterns and semantic
structures, which makes the captioning model difficult to accurately describe.
Encoder-decoder architectures have become the widely used approach for RSIC by
translating visual content into descriptive text. However, many existing
methods rely on a single-stream architecture, which weakens the model to
accurately describe the image. Such single-stream architectures typically
struggle to extract diverse spatial features or capture complex semantic
relationships, limiting their effectiveness in scenes with high intraclass
similarity or contextual ambiguity. In this work, we propose a novel
Multi-stream Encoder-decoder Framework (MsEdF) which improves the performance
of RSIC by optimizing both the spatial representation and language generation
of encoder-decoder architecture. The encoder fuses information from two
complementary image encoders, thereby promoting feature diversity through the
integration of multiscale and structurally distinct cues. To improve the
capture of context-aware descriptions, we refine the input sequence’s semantic
modeling on the decoder side using a stacked GRU architecture with an
element-wise aggregation scheme. Experiments on three benchmark RSIC datasets
show that MsEdF outperforms several baseline models.
[LINK]http://arxiv.org/abs/2502.09282v4
[DATE]2025-10-28 12:40:41+08:00
[CATEGORIES]cs.LG
Copula-Stein Discrepancy: A Generator-Based Stein Operator for Archimedean Dependence
[AUTHORS]Agnideep Aich, Ashit Baran Aich
[ABSTRACT]Kernel Stein discrepancies (KSDs) have become a principal tool for
goodness-of-fit testing, but standard KSDs are often insensitive to
higher-order dependency structures, such as tail dependence, which are critical
in many scientific and financial domains. We address this gap by introducing
the Copula-Stein Discrepancy (CSD), a novel class of discrepancies tailored to
the geometry of statistical dependence. By defining a Stein operator directly
on the copula density, CSD leverages the generative structure of dependence,
rather than relying on the joint density’s score function. For the broad class
of Archimedean copulas, this approach yields a closed-form Stein kernel derived
from the scalar generator function. We provide a comprehensive theoretical
analysis, proving that CSD (i) metrizes weak convergence of copula
distributions, ensuring it detects any mismatch in dependence; (ii) has an
empirical estimator that converges at the minimax optimal rate of
$O_P(n^{-1/2})$; and (iii) is provably sensitive to differences in tail
dependence coefficients. The framework is extended to general non-Archimedean
copulas, including elliptical and vine copulas. Computationally, the exact CSD
kernel evaluation scales linearly in dimension, while a novel random feature
approximation reduces the $n$-dependence from quadratic $O(n^2)$ to near-linear
$\tilde{O}(n)$, making CSD a practical and theoretically principled tool for
dependence-aware inference.
[LINK]http://arxiv.org/abs/2510.24056v1
[DATE]2025-10-28 12:33:57+08:00
[CATEGORIES]cs.LG
Language-Conditioned Representations and Mixture-of-Experts Policy for Robust Multi-Task Robotic Manipulation
[AUTHORS]Xiucheng Zhang, Yang Jiang, Hongwei Qing, Jiashuo Bai
[ABSTRACT]Perceptual ambiguity and task conflict limit multitask robotic manipulation
via imitation learning. We propose a framework combining a Language-Conditioned
Visual Representation (LCVR) module and a Language-conditioned
Mixture-ofExperts Density Policy (LMoE-DP). LCVR resolves perceptual
ambiguities by grounding visual features with language instructions, enabling
differentiation between visually similar tasks. To mitigate task conflict,
LMoE-DP uses a sparse expert architecture to specialize in distinct, multimodal
action distributions, stabilized by gradient modulation. On real-robot
benchmarks, LCVR boosts Action Chunking with Transformers (ACT) and Diffusion
Policy (DP) success rates by 33.75% and 25%, respectively. The full framework
achieves a 79% average success, outperforming the advanced baseline by 21%. Our
work shows that combining semantic grounding and expert specialization enables
robust, efficient multi-task manipulation
[COMMENTS]8 pages
[LINK]http://arxiv.org/abs/2510.24055v1
[DATE]2025-10-28 12:27:03+08:00
[CATEGORIES]cs.LG
MDP3: A Training-free Approach for List-wise Frame Selection in Video-LLMs
[AUTHORS]Hui Sun, Shiyin Lu, Huanyu Wang, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Ming Li
[ABSTRACT]Video large language models (Video-LLMs) have made significant progress in
understanding videos. However, processing multiple frames leads to lengthy
visual token sequences, presenting challenges such as the limited context
length cannot accommodate the entire video, and the inclusion of irrelevant
frames hinders visual perception. Hence, effective frame selection is crucial.
This paper emphasizes that frame selection should follow three key principles:
query relevance, list-wise diversity, and sequentiality. Existing methods, such
as uniform frame sampling and query-frame matching, do not capture all of these
principles. Thus, we propose Markov decision determinantal point process with
dynamic programming (MDP3) for frame selection, a training-free and
model-agnostic method that can be seamlessly integrated into existing
Video-LLMs. Our method first estimates frame similarities conditioned on the
query using a conditional Gaussian kernel within the reproducing kernel Hilbert
space~(RKHS). We then apply the determinantal point process~(DPP) to the
similarity matrix to capture both query relevance and list-wise diversity. To
incorporate sequentiality, we segment the video and apply DPP within each
segment, conditioned on the preceding segment selection, modeled as a Markov
decision process~(MDP) for allocating selection sizes across segments.
Theoretically, MDP3 provides a ((1 - 1/e))-approximate solution to the
NP-hard list-wise frame selection problem with pseudo-polynomial time
complexity, demonstrating its efficiency. Empirically, MDP3 significantly
outperforms existing methods, verifying its effectiveness and robustness.
[COMMENTS]26 pages, 14 figures
[LINK]http://arxiv.org/abs/2501.02885v2
[DATE]2025-10-28 12:18:29+08:00
[CATEGORIES]cs.LG
Learning from History: A Retrieval-Augmented Framework for Spatiotemporal Prediction
[AUTHORS]Hao Jia, Penghao Zhao, Hao Wu, Yuan Gao, Yangyu Tao, Bin Cui
[ABSTRACT]Accurate and long-term spatiotemporal prediction for complex physical systems
remains a fundamental challenge in scientific computing. While deep learning
models, as powerful parametric approximators, have shown remarkable success,
they suffer from a critical limitation: the accumulation of errors during
long-term autoregressive rollouts often leads to physically implausible
artifacts. This deficiency arises from their purely parametric nature, which
struggles to capture the full constraints of a system’s intrinsic dynamics. To
address this, we introduce a novel \textbf{Retrieval-Augmented Prediction
(RAP)} framework, a hybrid paradigm that synergizes the predictive power of
deep networks with the grounded truth of historical data. The core philosophy
of RAP is to leverage historical evolutionary exemplars as a non-parametric
estimate of the system’s local dynamics. For any given state, RAP efficiently
retrieves the most similar historical analog from a large-scale database. The
true future evolution of this analog then serves as a \textbf{reference
target}. Critically, this target is not a hard constraint in the loss function
but rather a powerful conditional input to a specialized dual-stream
architecture. It provides strong \textbf{dynamic guidance}, steering the
model’s predictions towards physically viable trajectories. In extensive
benchmarks across meteorology, turbulence, and fire simulation, RAP not only
surpasses state-of-the-art methods but also significantly outperforms a strong
\textbf{analog-only forecasting baseline}. More importantly, RAP generates
predictions that are more physically realistic by effectively suppressing error
divergence in long-term rollouts.
[LINK]http://arxiv.org/abs/2510.24049v1
[DATE]2025-10-28 12:09:16+08:00
[CATEGORIES]cs.LG
Causal-Aware Generative Adversarial Networks with Reinforcement Learning
[AUTHORS]Tu Anh Hoang Nguyen, Dang Nguyen, Tri-Nhan Vo, Thuc Duy Le, Sunil Gupta
[ABSTRACT]The utility of tabular data for tasks ranging from model training to
large-scale data analysis is often constrained by privacy concerns or
regulatory hurdles. While existing data generation methods, particularly those
based on Generative Adversarial Networks (GANs), have shown promise, they
frequently struggle with capturing complex causal relationship, maintaining
data utility, and providing provable privacy guarantees suitable for enterprise
deployment. We introduce CA-GAN, a novel generative framework specifically
engineered to address these challenges for real-world tabular datasets. CA-GAN
utilizes a two-step approach: causal graph extraction to learn a robust,
comprehensive causal relationship in the data’s manifold, followed by a custom
Conditional WGAN-GP (Wasserstein GAN with Gradient Penalty) that operates
exclusively as per the structure of nodes in the causal graph. More
importantly, the generator is trained with a new Reinforcement Learning-based
objective that aligns the causal graphs constructed from real and fake data,
ensuring the causal awareness in both training and sampling phases. We
demonstrate CA-GAN superiority over six SOTA methods across 14 tabular
datasets. Our evaluations, focused on core data engineering metrics: causal
preservation, utility preservation, and privacy preservation. Our method offers
a practical, high-performance solution for data engineers seeking to create
high-quality, privacy-compliant synthetic datasets to benchmark database
systems, accelerate software development, and facilitate secure data-driven
research.
[LINK]http://arxiv.org/abs/2510.24046v1
[DATE]2025-10-28 12:02:49+08:00
[CATEGORIES]cs.LG
MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems
[AUTHORS]Qingyao Ai, Yichen Tang, Changyue Wang, Jianming Long, Weihang Su, Yiqun Liu
[ABSTRACT]Scaling up data, parameters, and test-time computation has been the
mainstream methods to improve LLM systems (LLMsys), but their upper bounds are
almost reached due to the gradual depletion of high-quality data and marginal
gains obtained from larger computational resource consumption. Inspired by the
abilities of human and traditional AI systems in learning from practice,
constructing memory and continual learning frameworks for LLMsys has become an
important and popular research direction in recent literature. Yet, existing
benchmarks for LLM memory often focus on evaluating the system on homogeneous
reading comprehension tasks with long-form inputs rather than testing their
abilities to learn from accumulated user feedback in service time. Therefore,
we propose a user feedback simulation framework and a comprehensive benchmark
covering multiple domains, languages, and types of tasks to evaluate the
continual learning abilities of LLMsys. Experiments show that the effectiveness
and efficiency of state-of-the-art baselines are far from satisfying, and we
hope this benchmark could pave the way for future studies on LLM memory and
optimization algorithms.
[LINK]http://arxiv.org/abs/2510.17281v2
[DATE]2025-10-28 12:01:30+08:00
[CATEGORIES]cs.LG
Schrödinger bridge for generative AI: Soft-constrained formulation and convergence analysis
[AUTHORS]Jin Ma, Ying Tan, Renyuan Xu
[ABSTRACT]Generative AI can be framed as the problem of learning a model that maps
simple reference measures into complex data distributions, and it has recently
found a strong connection to the classical theory of the Schr"odinger bridge
problems (SBPs) due partly to their common nature of interpolating between
prescribed marginals via entropy-regularized stochastic dynamics. However, the
classical SBP enforces hard terminal constraints, which often leads to
instability in practical implementations, especially in high-dimensional or
data-scarce regimes. To address this challenge, we follow the idea of the
so-called soft-constrained Schr"odinger bridge problem (SCSBP), in which the
terminal constraint is replaced by a general penalty function. This relaxation
leads to a more flexible stochastic control formulation of McKean-Vlasov type.
We establish the existence of optimal solutions for all penalty levels and
prove that, as the penalty grows, both the controls and value functions
converge to those of the classical SBP at a linear rate. Our analysis builds on
Doob’s h-transform representations, the stability results of Schr"odinger
potentials, Gamma-convergence, and a novel fixed-point argument that couples an
optimization problem over the space of measures with an auxiliary entropic
optimal transport problem. These results not only provide the first
quantitative convergence guarantees for soft-constrained bridges but also shed
light on how penalty regularization enables robust generative modeling,
fine-tuning, and transfer learning.
[COMMENTS]31 pages
[LINK]http://arxiv.org/abs/2510.11829v2
[DATE]2025-10-28 11:59:44+08:00
[CATEGORIES]cs.LG
Mitigating Negative Transfer via Reducing Environmental Disagreement
[AUTHORS]Hui Sun, Zheng Xie, Hao-Yuan He, Ming Li
[ABSTRACT]Unsupervised Domain Adaptation~(UDA) focuses on transferring knowledge from a
labeled source domain to an unlabeled target domain, addressing the challenge
of \emph{domain shift}. Significant domain shifts hinder effective knowledge
transfer, leading to \emph{negative transfer} and deteriorating model
performance. Therefore, mitigating negative transfer is essential. This study
revisits negative transfer through the lens of causally disentangled learning,
emphasizing cross-domain discriminative disagreement on non-causal
environmental features as a critical factor. Our theoretical analysis reveals
that overreliance on non-causal environmental features as the environment
evolves can cause discriminative disagreements~(termed \emph{environmental
disagreement}), thereby resulting in negative transfer. To address this, we
propose Reducing Environmental Disagreement~(RED), which disentangles each
sample into domain-invariant causal features and domain-specific non-causal
environmental features via adversarially training domain-specific environmental
feature extractors in the opposite domains. Subsequently, RED estimates and
reduces environmental disagreement based on domain-specific non-causal
environmental features. Experimental results confirm that RED effectively
mitigates negative transfer and achieves state-of-the-art performance.
[COMMENTS]13 pages, 5 figures
[LINK]http://arxiv.org/abs/2510.24044v1
[DATE]2025-10-28 11:56:20+08:00
[CATEGORIES]cs.LG
Riemannian-Geometric Fingerprints of Generative Models
[AUTHORS]Hae Jin Song, Laurent Itti
[ABSTRACT]Recent breakthroughs and rapid integration of generative models (GMs) have
sparked interest in the problem of model attribution and their fingerprints.
For instance, service providers need reliable methods of authenticating their
models to protect their IP, while users and law enforcement seek to verify the
source of generated content for accountability and trust. In addition, a
growing threat of model collapse is arising, as more model-generated data are
being fed back into sources (e.g., YouTube) that are often harvested for
training (“regurgitative training”), heightening the need to differentiate
synthetic from human data. Yet, a gap still exists in understanding generative
models’ fingerprints, we believe, stemming from the lack of a formal framework
that can define, represent, and analyze the fingerprints in a principled way.
To address this gap, we take a geometric approach and propose a new definition
of artifact and fingerprint of GMs using Riemannian geometry, which allows us
to leverage the rich theory of differential geometry. Our new definition
generalizes previous work (Song et al., 2024) to non-Euclidean manifolds by
learning Riemannian metrics from data and replacing the Euclidean distances and
nearest-neighbor search with geodesic distances and kNN-based Riemannian center
of mass. We apply our theory to a new gradient-based algorithm for computing
the fingerprints in practice. Results show that it is more effective in
distinguishing a large array of GMs, spanning across 4 different datasets in 2
different resolutions (64 by 64, 256 by 256), 27 model architectures, and 2
modalities (Vision, Vision-Language). Using our proposed definition
significantly improves the performance on model attribution, as well as a
generalization to unseen datasets, model types, and modalities, suggesting its
practical efficacy.
[COMMENTS]ICCV 2025 Highlight paper
[LINK]http://arxiv.org/abs/2506.22802v2
[DATE]2025-10-28 11:55:35+08:00
[CATEGORIES]cs.LG
Localized Kernel Projection Outlyingness: A Two-Stage Approach for Multi-Modal Outlier Detection
[AUTHORS]Akira Tamamori
[ABSTRACT]This paper presents Two-Stage LKPLO, a novel multi-stage outlier detection
framework that overcomes the coexisting limitations of conventional
projection-based methods: their reliance on a fixed statistical metric and
their assumption of a single data structure. Our framework uniquely synthesizes
three key concepts: (1) a generalized loss-based outlyingness measure (PLO)
that replaces the fixed metric with flexible, adaptive loss functions like our
proposed SVM-like loss; (2) a global kernel PCA stage to linearize non-linear
data structures; and (3) a subsequent local clustering stage to handle
multi-modal distributions. Comprehensive 5-fold cross-validation experiments on
10 benchmark datasets, with automated hyperparameter optimization, demonstrate
that Two-Stage LKPLO achieves state-of-the-art performance. It significantly
outperforms strong baselines on datasets with challenging structures where
existing methods fail, most notably on multi-cluster data (Optdigits) and
complex, high-dimensional data (Arrhythmia). Furthermore, an ablation study
empirically confirms that the synergistic combination of both the kernelization
and localization stages is indispensable for its superior performance. This
work contributes a powerful new tool for a significant class of outlier
detection problems and underscores the importance of hybrid, multi-stage
architectures.
[COMMENTS]10 pages, 4 figures; submitted to The IEICE Transactions on
Information and Systems
[LINK]http://arxiv.org/abs/2510.24043v1
[DATE]2025-10-28 11:53:46+08:00
[CATEGORIES]cs.LG
JanusDNA: A Powerful Bi-directional Hybrid DNA Foundation Model
[AUTHORS]Qihao Duan, Bingding Huang, Zhenqiao Song, Irina Lehmann, Lei Gu, Roland Eils, Benjamin Wild
[ABSTRACT]Large language models (LLMs) have revolutionized natural language processing
and are increasingly applied to other sequential data types, including genetic
sequences. However, adapting LLMs to genomics presents significant challenges.
Capturing complex genomic interactions requires modeling long-range
dependencies within DNA sequences, where interactions often span over 10,000
base pairs, even within a single gene, posing substantial computational burdens
under conventional model architectures and training paradigms. Moreover,
standard LLM training approaches are suboptimal for DNA: autoregressive
training, while efficient, supports only unidirectional understanding. However,
DNA is inherently bidirectional, e.g., bidirectional promoters regulate
transcription in both directions and account for nearly 11% of human gene
expression. Masked language models (MLMs) allow bidirectional understanding but
are inefficient, as only masked tokens contribute to the loss per step. To
address these limitations, we introduce JanusDNA, the first bidirectional DNA
foundation model built upon a novel pretraining paradigm that combines the
optimization efficiency of autoregressive modeling with the bidirectional
comprehension of masked modeling. JanusDNA adopts a hybrid Mamba, Attention and
Mixture of Experts (MoE) architecture, combining long-range modeling of
Attention with efficient sequential learning of Mamba. MoE layers further scale
model capacity via sparse activation while keeping computational cost low.
Notably, JanusDNA processes up to 1 million base pairs at single nucleotide
resolution on a single 80GB GPU. Extensive experiments and ablations show
JanusDNA achieves new SOTA results on three genomic representation benchmarks,
outperforming models with 250x more activated parameters. Code:
https://github.com/Qihao-Duan/JanusDNA
[COMMENTS]NeurIPS 2025
[LINK]http://arxiv.org/abs/2505.17257v4
[DATE]2025-10-28 11:53:33+08:00
[CATEGORIES]cs.LG
Geometric Algorithms for Neural Combinatorial Optimization with Constraints
[AUTHORS]Nikolaos Karalias, Akbar Rafiey, Yifei Xu, Zhishang Luo, Behrooz Tahmasebi, Connie Jiang, Stefanie Jegelka
[ABSTRACT]Self-Supervised Learning (SSL) for Combinatorial Optimization (CO) is an
emerging paradigm for solving combinatorial problems using neural networks. In
this paper, we address a central challenge of SSL for CO: solving problems with
discrete constraints. We design an end-to-end differentiable framework that
enables us to solve discrete constrained optimization problems with neural
networks. Concretely, we leverage algorithmic techniques from the literature on
convex geometry and Carath'eodory’s theorem to decompose neural network
outputs into convex combinations of polytope corners that correspond to
feasible sets. This decomposition-based approach enables self-supervised
training but also ensures efficient quality-preserving rounding of the neural
net output into feasible solutions. Extensive experiments in
cardinality-constrained optimization show that our approach can consistently
outperform neural baselines. We further provide worked-out examples of how our
method can be applied beyond cardinality-constrained problems to a diverse set
of combinatorial optimization tasks, including finding independent sets in
graphs, and solving matroid-constrained problems.
[LINK]http://arxiv.org/abs/2510.24039v1
[DATE]2025-10-28 11:49:01+08:00
[CATEGORIES]cs.LG
Kernelized Sparse Fine-Tuning with Bi-level Parameter Competition for Vision Models
[AUTHORS]Shufan Shen, Junshu Sun, Shuhui Wang, Qingming Huang
[ABSTRACT]Parameter-efficient fine-tuning (PEFT) aims to adapt pre-trained vision
models to downstream tasks. Among PEFT paradigms, sparse tuning achieves
remarkable performance by adjusting only the weights most relevant to
downstream tasks, rather than densely tuning the entire weight matrix. Current
methods follow a two-stage paradigm. First, it locates task-relevant weights by
gradient information, which overlooks the parameter adjustments during
fine-tuning and limits the performance. Second, it updates only the located
weights by applying a sparse mask to the gradient of the weight matrix, which
results in high memory usage due to the storage of all weight matrices in the
optimizer. In this paper, we propose a one-stage method named SNELLA to
overcome the above limitations. For memory usage, SNELLA selectively updates
the weight matrix by adding it to another sparse matrix that is merged by two
low-rank learnable matrices. We extend the low-rank decomposition by
introducing nonlinear kernel functions, thereby increasing the rank of the
resulting merged matrix to prevent the interdependency among weight updates,
enabling better adaptation to downstream tasks. For locating task-relevant
weights, we propose an adaptive bi-level sparsity allocation mechanism that
encourages weights to compete across and inside layers based on their
importance scores in an end-to-end manner. Extensive experiments are conducted
on classification, segmentation, and generation tasks using different
pre-trained vision models. The results show that SNELLA achieves SOTA
performance with low memory usage. Notably, SNELLA obtains 1.8% (91.9% v.s.
90.1%) higher Top-1 accuracy on the FGVC benchmark compared to SPT-LoRA.
Compared to previous methods, SNELLA achieves a memory reduction of 31.1%-39.9%
across models with parameter scales from 86M to 632M. Our source codes are
available at https://github.com/ssfgunner/SNELL.
[LINK]http://arxiv.org/abs/2510.24037v1
[DATE]2025-10-28 11:39:18+08:00
[CATEGORIES]cs.LG
Prosperity before Collapse: How Far Can Off-Policy RL Reach with Stale Data on LLMs?
[AUTHORS]Haizhong Zheng, Jiawei Zhao, Beidi Chen
[ABSTRACT]Reinforcement learning has been central to recent advances in large language
model reasoning, but most algorithms rely on on-policy training that demands
fresh rollouts at every update, limiting efficiency and scalability.
Asynchronous RL systems alleviate this by decoupling rollout generation from
training, yet their effectiveness hinges on tolerating large staleness in
rollout data, a setting where existing methods either degrade in performance or
collapse. We revisit this challenge and uncover a prosperity-before-collapse
phenomenon: stale data can be as informative as on-policy data if exploited
properly. Building on this insight, we introduce M2PO (Second-Moment Trust
Policy Optimization), which constrains the second moment of importance weights
to suppress only extreme outliers while preserving informative updates.
Notably, M2PO sharply reduces the fraction of clipped tokens under high
staleness (from 1.22% to 0.06% over training), precisely masking high-variance
tokens while maintaining stable optimization. Extensive evaluation across six
models (from 1.7B to 32B) and eight benchmarks shows that M2PO delivers stable
off-policy training even with data stale by at least 256 model updates and
matches on-policy performance.
[LINK]http://arxiv.org/abs/2510.01161v2
[DATE]2025-10-28 11:28:48+08:00
[CATEGORIES]cs.LG
Spatio-temporal Multivariate Time Series Forecast with Chosen Variables
[AUTHORS]Zibo Liu, Zhe Jiang, Zelin Xu, Tingsong Xiao, Yupu Zhang, Zhengkun Xiao, Haibo Wang, Shigang Chen
[ABSTRACT]Spatio-Temporal Multivariate time series Forecast (STMF) uses the time series
of $n$ spatially distributed variables in a period of recent past to forecast
their values in a period of near future. It has important applications in
spatio-temporal sensing forecast such as road traffic prediction and air
pollution prediction. Recent papers have addressed a practical problem of
missing variables in the model input, which arises in the sensing applications
where the number $m$ of sensors is far less than the number $n$ of locations to
be monitored, due to budget constraints. We observe that the state of the art
assumes that the $m$ variables (i.e., locations with sensors) in the model
input are pre-determined and the important problem of how to choose the $m$
variables in the input has never been studied. This paper fills the gap by
studying a new problem of STMF with chosen variables, which optimally selects
$m$-out-of-$n$ variables for the model input in order to maximize the forecast
accuracy. We propose a unified framework that jointly performs variable
selection and model optimization for both forecast accuracy and model
efficiency. It consists of three novel technical components: (1) masked
variable-parameter pruning, which progressively prunes less informative
variables and attention parameters through quantile-based masking; (2)
prioritized variable-parameter replay, which replays low-loss past samples to
preserve learned knowledge for model stability; (3) dynamic extrapolation
mechanism, which propagates information from variables selected for the input
to all other variables via learnable spatial embeddings and adjacency
information. Experiments on five real-world datasets show that our work
significantly outperforms the state-of-the-art baselines in both accuracy and
efficiency, demonstrating the effectiveness of joint variable selection and
model optimization.
[COMMENTS]In submission
[LINK]http://arxiv.org/abs/2510.24027v1
[DATE]2025-10-28 11:19:06+08:00
[CATEGORIES]cs.LG
Efficient Global-Local Fusion Sampling for Physics-Informed Neural Networks
[AUTHORS]Jiaqi Luo, Shixin Xu, Zhouwang Yang
[ABSTRACT]The accuracy of Physics-Informed Neural Networks (PINNs) critically depends
on the placement of collocation points, as the PDE loss is approximated through
sampling over the solution domain. Global sampling ensures stability by
covering the entire domain but requires many samples and is computationally
expensive, whereas local sampling improves efficiency by focusing on
high-residual regions but may neglect well-learned areas, reducing robustness.
We propose a Global-Local Fusion (GLF) Sampling Strategy that combines the
strengths of both approaches. Specifically, new collocation points are
generated by perturbing training points with Gaussian noise scaled inversely to
the residual, thereby concentrating samples in difficult regions while
preserving exploration. To further reduce computational overhead, a lightweight
linear surrogate is introduced to approximate the global residual-based
distribution, achieving similar effectiveness at a fraction of the cost.
Together, these components, residual-adaptive sampling and residual-based
approximation, preserve the stability of global methods while retaining the
efficiency of local refinement. Extensive experiments on benchmark PDEs
demonstrate that GLF consistently improves both accuracy and efficiency
compared with global and local sampling strategies. This study provides a
practical and scalable framework for enhancing the reliability and efficiency
of PINNs in solving complex and high-dimensional PDEs.
[LINK]http://arxiv.org/abs/2510.24026v1
[DATE]2025-10-28 11:10:54+08:00
[CATEGORIES]cs.LG
Unveiling Concept Attribution in Diffusion Models
[AUTHORS]Quang H. Nguyen, Hoang Phan, Khoa D. Doan
[ABSTRACT]Diffusion models have shown remarkable abilities in generating realistic and
high-quality images from text prompts. However, a trained model remains largely
black-box; little do we know about the roles of its components in exhibiting a
concept such as objects or styles. Recent works employ causal tracing to
localize knowledge-storing layers in generative models without showing how
other layers contribute to the target concept. In this work, we approach
diffusion models’ interpretability problem from a more general perspective and
pose a question: \textit{``How do model components work jointly to demonstrate
knowledge?’’}. To answer this question, we decompose diffusion models using
component attribution, systematically unveiling the importance of each
component (specifically the model parameter) in generating a concept. The
proposed framework, called \textbf{C}omponent \textbf{A}ttribution for
\textbf{D}iffusion Model (CAD), discovers the localization of concept-inducing
(positive) components, while interestingly uncovers another type of components
that contribute negatively to generating a concept, which is missing in the
previous knowledge localization work. Based on this holistic understanding of
diffusion models, we introduce two fast, inference-time model editing
algorithms, CAD-Erase and CAD-Amplify; in particular, CAD-Erase enables erasure
and CAD-Amplify allows amplification of a generated concept by ablating the
positive and negative components, respectively, while retaining knowledge of
other concepts. Extensive experimental results validate the significance of
both positive and negative components pinpointed by our framework,
demonstrating the potential of providing a complete view of interpreting
generative models. Our code is available
\href{https://github.com/mail-research/CAD-attribution4diffusion}{here}.
[LINK]http://arxiv.org/abs/2412.02542v3
[DATE]2025-10-28 11:07:50+08:00
[CATEGORIES]cs.LG
NeuroPathNet: Dynamic Path Trajectory Learning for Brain Functional Connectivity Analysis
[AUTHORS]Guo Tianqi Guo, Chen Liping, Peng Ciyuan, Guo Jingjing, Ren Jing
[ABSTRACT]Understanding the evolution of brain functional networks over time is of
great significance for the analysis of cognitive mechanisms and the diagnosis
of neurological diseases. Existing methods often have difficulty in capturing
the temporal evolution characteristics of connections between specific
functional communities. To this end, this paper proposes a new path-level
trajectory modeling framework (NeuroPathNet) to characterize the dynamic
behavior of connection pathways between brain functional partitions. Based on
medically supported static partitioning schemes (such as Yeo and Smith ICA), we
extract the time series of connection strengths between each pair of functional
partitions and model them using a temporal neural network. We validate the
model performance on three public functional Magnetic Resonance Imaging (fMRI)
datasets, and the results show that it outperforms existing mainstream methods
in multiple indicators. This study can promote the development of dynamic graph
learning methods for brain network analysis, and provide possible clinical
applications for the diagnosis of neurological diseases.
[LINK]http://arxiv.org/abs/2510.24025v1
[DATE]2025-10-28 11:07:06+08:00
[CATEGORIES]cs.LG
TraceTrans: Translation and Spatial Tracing for Surgical Prediction
[AUTHORS]Xiyu Luo, Haodong Li, Xinxing Cheng, He Zhao, Yang Hu, Xuan Song, Tianyang Zhang
[ABSTRACT]Image-to-image translation models have achieved notable success in converting
images across visual domains and are increasingly used for medical tasks such
as predicting post-operative outcomes and modeling disease progression.
However, most existing methods primarily aim to match the target distribution
and often neglect spatial correspondences between the source and translated
images. This limitation can lead to structural inconsistencies and
hallucinations, undermining the reliability and interpretability of the
predictions. These challenges are accentuated in clinical applications by the
stringent requirement for anatomical accuracy. In this work, we present
TraceTrans, a novel deformable image translation model designed for
post-operative prediction that generates images aligned with the target
distribution while explicitly revealing spatial correspondences with the
pre-operative input. The framework employs an encoder for feature extraction
and dual decoders for predicting spatial deformations and synthesizing the
translated image. The predicted deformation field imposes spatial constraints
on the generated output, ensuring anatomical consistency with the source.
Extensive experiments on medical cosmetology and brain MRI datasets demonstrate
that TraceTrans delivers accurate and interpretable post-operative predictions,
highlighting its potential for reliable clinical deployment.
[LINK]http://arxiv.org/abs/2510.22379v2
[DATE]2025-10-28 11:06:09+08:00
[CATEGORIES]cs.LG
CFM-GP: Unified Conditional Flow Matching to Learn Gene Perturbation Across Cell Types
[AUTHORS]Abrar Rahman Abir, Sajib Acharjee Dip, Liqing Zhang
[ABSTRACT]Understanding gene perturbation effects across diverse cellular contexts is a
central challenge in functional genomics, with important implications for
therapeutic discovery and precision medicine. Single-cell technologies enable
high-resolution measurement of transcriptional responses, but collecting such
data is costly and time-consuming, especially when repeated for each cell type.
Existing computational methods often require separate models per cell type,
limiting scalability and generalization. We present CFM-GP, a method for cell
type-agnostic gene perturbation prediction. CFM-GP learns a continuous,
time-dependent transformation between unperturbed and perturbed gene expression
distributions, conditioned on cell type, allowing a single model to predict
across all cell types. Unlike prior approaches that use discrete modeling,
CFM-GP employs a flow matching objective to capture perturbation dynamics in a
scalable manner. We evaluate on five datasets: SARS-CoV-2 infection, IFN-beta
stimulated PBMCs, glioblastoma treated with Panobinostat, lupus under IFN-beta
stimulation, and Statefate progenitor fate mapping. CFM-GP consistently
outperforms state-of-the-art baselines in R-squared and Spearman correlation,
and pathway enrichment analysis confirms recovery of key biological pathways.
These results demonstrate the robustness and biological fidelity of CFM-GP as a
scalable solution for cross-cell type gene perturbation prediction.
[COMMENTS]28 Pages, 19 Tables, 8 Figures. The first two authors contributed
equally
[LINK]http://arxiv.org/abs/2508.08312v2
[DATE]2025-10-28 10:55:43+08:00
[CATEGORIES]cs.LG
RL-AUX: Reinforcement Learning for Auxiliary Task Generation
[AUTHORS]Judah Goldfeder, Matthew So, Hod Lipson
[ABSTRACT]Auxiliary Learning (AL) is a special case of Multi-task Learning (MTL) in
which a network trains on auxiliary tasks to improve performance on its main
task. This technique is used to improve generalization and, ultimately,
performance on the network’s main task. AL has been demonstrated to improve
performance across multiple domains, including navigation, image
classification, and natural language processing. One weakness of AL is the need
for labeled auxiliary tasks, which can require human effort and domain
expertise to generate. Meta Learning techniques have been used to solve this
issue by learning an additional auxiliary task generation network that can
create helpful tasks for the primary network. The most prominent techniques
rely on Bi-Level Optimization, which incurs computational cost and increased
code complexity. To avoid the need for Bi-Level Optimization, we present an
RL-based approach to dynamically create auxiliary tasks. In this framework, an
RL agent is tasked with selecting auxiliary labels for every data point in a
training set. The agent is rewarded when their selection improves the
performance on the primary task. We also experiment with learning optimal
strategies for weighing the auxiliary loss per data point. On the 20-Superclass
CIFAR100 problem, our RL approach outperforms human-labeled auxiliary tasks and
performs as well as a prominent Bi-Level Optimization technique. Our weight
learning approaches significantly outperform all of these benchmarks. For
example, a Weight-Aware RL-based approach helps the VGG16 architecture achieve
80.9% test accuracy while the human-labeled auxiliary task setup achieved
75.53%. The goal of this work is to (1) prove that RL is a viable approach to
dynamically generate auxiliary tasks and (2) demonstrate that per-sample
auxiliary task weights can be learned alongside the auxiliary task labels and
can achieve strong results.
[LINK]http://arxiv.org/abs/2510.22940v2
[DATE]2025-10-28 10:44:02+08:00
[CATEGORIES]cs.LG
Discovering Heuristics with Large Language Models (LLMs) for Mixed-Integer Programs: Single-Machine Scheduling
[AUTHORS]İbrahim Oğuz Çetinkaya, İ. Esra Büyüktahtakın, Parshin Shojaee, Chandan K. Reddy
[ABSTRACT]Our study contributes to the scheduling and combinatorial optimization
literature with new heuristics discovered by leveraging the power of Large
Language Models (LLMs). We focus on the single-machine total tardiness (SMTT)
problem, which aims to minimize total tardiness by sequencing n jobs on a
single processor without preemption, given processing times and due dates. We
develop and benchmark two novel LLM-discovered heuristics, the EDD Challenger
(EDDC) and MDD Challenger (MDDC), inspired by the well-known Earliest Due Date
(EDD) and Modified Due Date (MDD) rules. In contrast to prior studies that
employed simpler rule-based heuristics, we evaluate our LLM-discovered
algorithms using rigorous criteria, including optimality gaps and solution time
derived from a mixed-integer programming (MIP) formulation of SMTT. We compare
their performance against state-of-the-art heuristics and exact methods across
various job sizes (20, 100, 200, and 500 jobs). For instances with more than
100 jobs, exact methods such as MIP and dynamic programming become
computationally intractable. Up to 500 jobs, EDDC improves upon the classic EDD
rule and another widely used algorithm in the literature. MDDC consistently
outperforms traditional heuristics and remains competitive with exact
approaches, particularly on larger and more complex instances. This study shows
that human-LLM collaboration can produce scalable, high-performing heuristics
for NP-hard constrained combinatorial optimization, even under limited
resources when effectively configured.
[LINK]http://arxiv.org/abs/2510.24013v1
[DATE]2025-10-28 10:43:04+08:00
[CATEGORIES]cs.LG
Training-Free Safe Text Embedding Guidance for Text-to-Image Diffusion Models
[AUTHORS]Byeonghu Na, Mina Kang, Jiseok Kwak, Minsang Park, Jiwoo Shin, SeJoon Jun, Gayoung Lee, Jin-Hwa Kim, Il-Chul Moon
[ABSTRACT]Text-to-image models have recently made significant advances in generating
realistic and semantically coherent images, driven by advanced diffusion models
and large-scale web-crawled datasets. However, these datasets often contain
inappropriate or biased content, raising concerns about the generation of
harmful outputs when provided with malicious text prompts. We propose Safe Text
embedding Guidance (STG), a training-free approach to improve the safety of
diffusion models by guiding the text embeddings during sampling. STG adjusts
the text embeddings based on a safety function evaluated on the expected final
denoised image, allowing the model to generate safer outputs without additional
training. Theoretically, we show that STG aligns the underlying model
distribution with safety constraints, thereby achieving safer outputs while
minimally affecting generation quality. Experiments on various safety
scenarios, including nudity, violence, and artist-style removal, show that STG
consistently outperforms both training-based and training-free baselines in
removing unsafe content while preserving the core semantic intent of input
prompts. Our code is available at https://github.com/aailab-kaist/STG.
[COMMENTS]Accepted at NeurIPS 2025
[LINK]http://arxiv.org/abs/2510.24012v1
[DATE]2025-10-28 10:37:20+08:00
[CATEGORIES]cs.LG
Eigen-Value: Efficient Domain-Robust Data Valuation via Eigenvalue-Based Approach
[AUTHORS]Youngjun Choi, Joonseong Kang, Sungjun Lim, Kyungwoo Song
[ABSTRACT]Data valuation has become central in the era of data-centric AI. It drives
efficient training pipelines and enables objective pricing in data markets by
assigning a numeric value to each data point. Most existing data valuation
methods estimate the effect of removing individual data points by evaluating
changes in model validation performance under in-distribution (ID) settings, as
opposed to out-of-distribution (OOD) scenarios where data follow different
patterns. Since ID and OOD data behave differently, data valuation methods
based on ID loss often fail to generalize to OOD settings, particularly when
the validation set contains no OOD data. Furthermore, although OOD-aware
methods exist, they involve heavy computational costs, which hinder practical
deployment. To address these challenges, we introduce \emph{Eigen-Value} (EV),
a plug-and-play data valuation framework for OOD robustness that uses only an
ID data subset, including during validation. EV provides a new spectral
approximation of domain discrepancy, which is the gap of loss between ID and
OOD using ratios of eigenvalues of ID data’s covariance matrix. EV then
estimates the marginal contribution of each data point to this discrepancy via
perturbation theory, alleviating the computational burden. Subsequently, EV
plugs into ID loss-based methods by adding an EV term without any additional
training loop. We demonstrate that EV achieves improved OOD robustness and
stable value rankings across real-world datasets, while remaining
computationally lightweight. These results indicate that EV is practical for
large-scale settings with domain shift, offering an efficient path to
OOD-robust data valuation.
[LINK]http://arxiv.org/abs/2510.23409v2
[DATE]2025-10-28 10:35:45+08:00
[CATEGORIES]cs.LG
Mars-Bench: A Benchmark for Evaluating Foundation Models for Mars Science Tasks
[AUTHORS]Mirali Purohit, Bimal Gajera, Vatsal Malaviya, Irish Mehta, Kunal Kasodekar, Jacob Adler, Steven Lu, Umaa Rebbapragada, Hannah Kerner
[COMMENTS]Accepted at NeurIPS 2025
[LINK]http://arxiv.org/abs/2510.24010v1
[DATE]2025-10-28 10:34:08+08:00
[CATEGORIES]cs.LG
Learning Provably Improves the Convergence of Gradient Descent
[AUTHORS]Qingyu Song, Wei Lin, Hong Xu
[ABSTRACT]Learn to Optimize (L2O) trains deep neural network-based solvers for
optimization, achieving success in accelerating convex problems and improving
non-convex solutions. However, L2O lacks rigorous theoretical backing for its
own training convergence, as existing analyses often use unrealistic
assumptions – a gap this work highlights empirically. We bridge this gap by
proving the training convergence of L2O models that learn Gradient Descent (GD)
hyperparameters for quadratic programming, leveraging the Neural Tangent Kernel
(NTK) theory. We propose a deterministic initialization strategy to support our
theoretical results and promote stable training over extended optimization
horizons by mitigating gradient explosion. Our L2O framework demonstrates over
50% better optimality than GD and superior robustness over state-of-the-art L2O
methods on synthetic datasets. The code of our method can be found from
https://github.com/NetX-lab/MathL2OProof-Official.
[COMMENTS]48 pages, 11 figures, NeurIPS 2025
[LINK]http://arxiv.org/abs/2501.18092v5
[DATE]2025-10-28 10:24:55+08:00
[CATEGORIES]cs.LG
Detecting and Mitigating Insertion Hallucination in Video-to-Audio Generation
[AUTHORS]Liyang Chen, Hongkai Chen, Yujun Cai, Sifan Li, Qingwen Ye, Yiwei Wang
[ABSTRACT]Video-to-Audio generation has made remarkable strides in automatically
synthesizing sound for video. However, existing evaluation metrics, which focus
on semantic and temporal alignment, overlook a critical failure mode: models
often generate acoustic events, particularly speech and music, that have no
corresponding visual source. We term this phenomenon Insertion Hallucination
and identify it as a systemic risk driven by dataset biases, such as the
prevalence of off-screen sounds, that remains completely undetected by current
metrics. To address this challenge, we first develop a systematic evaluation
framework that employs a majority-voting ensemble of multiple audio event
detectors. We also introduce two novel metrics to quantify the prevalence and
severity of this issue: IH@vid (the fraction of videos with hallucinations) and
IH@dur (the fraction of hallucinated duration). Building on this, we propose
Posterior Feature Correction, a novel training-free inference-time method that
mitigates IH. PFC operates in a two-pass process: it first generates an initial
audio output to detect hallucinated segments, and then regenerates the audio
after masking the corresponding video features at those timestamps. Experiments
on several mainstream V2A benchmarks first reveal that state-of-the-art models
suffer from severe IH. In contrast, our PFC method reduces both the prevalence
and duration of hallucinations by over 50\% on average, without degrading, and
in some cases even improving, conventional metrics for audio quality and
temporal synchronization. Our work is the first to formally define,
systematically measure, and effectively mitigate Insertion Hallucination,
paving the way for more reliable and faithful V2A models.
[LINK]http://arxiv.org/abs/2510.08078v3
[DATE]2025-10-28 10:16:25+08:00
[CATEGORIES]cs.LG
CT-OT Flow: Estimating Continuous-Time Dynamics from Discrete Temporal Snapshots
[AUTHORS]Keisuke Kawano, Takuro Kutsuna, Naoki Hayashi, Yasushi Esaki, Hidenori Tanaka
[ABSTRACT]In many real-world settings–e.g., single-cell RNA sequencing, mobility
sensing, and environmental monitoring–data are observed only as temporally
aggregated snapshots collected over finite time windows, often with noisy or
uncertain timestamps, and without access to continuous trajectories. We study
the problem of estimating continuous-time dynamics from such snapshots. We
present Continuous-Time Optimal Transport Flow (CT-OT Flow), a two-stage
framework that (i) infers high-resolution time labels by aligning neighboring
intervals via partial optimal transport (POT) and (ii) reconstructs a
continuous-time data distribution through temporal kernel smoothing, from which
we sample pairs of nearby times to train standard ODE/SDE models. Our
formulation explicitly accounts for snapshot aggregation and time-label
uncertainty and uses practical accelerations (screening and mini-batch POT),
making it applicable to large datasets. Across synthetic benchmarks and two
real datasets (scRNA-seq and typhoon tracks), CT-OT Flow reduces distributional
and trajectory errors compared with OT-CFM, [SF](^{2})M, TrajectoryNet, MFM,
and ENOT.
[LINK]http://arxiv.org/abs/2505.17354v2
[DATE]2025-10-28 10:08:49+08:00
[CATEGORIES]cs.LG
Auto-Adaptive PINNs with Applications to Phase Transitions
[AUTHORS]Kevin Buck, Woojeong Kim
[ABSTRACT]We propose an adaptive sampling method for the training of Physics Informed
Neural Networks (PINNs) which allows for sampling based on an arbitrary
problem-specific heuristic which may depend on the network and its gradients.
In particular we focus our analysis on the Allen-Cahn equations, attempting to
accurately resolve the characteristic interfacial regions using a PINN without
any post-hoc resampling. In experiments, we show the effectiveness of these
methods over residual-adaptive frameworks.
[LINK]http://arxiv.org/abs/2510.23999v1
[DATE]2025-10-28 10:03:39+08:00
[CATEGORIES]cs.LG
Turbocharging Gaussian Process Inference with Approximate Sketch-and-Project
[AUTHORS]Pratik Rathore, Zachary Frangella, Sachin Garg, Shaghayegh Fazliani, Michał Dereziński, Madeleine Udell
[ABSTRACT]Gaussian processes (GPs) play an essential role in biostatistics, scientific
machine learning, and Bayesian optimization for their ability to provide
probabilistic predictions and model uncertainty. However, GP inference
struggles to scale to large datasets (which are common in modern applications),
since it requires the solution of a linear system whose size scales
quadratically with the number of samples in the dataset. We propose an
approximate, distributed, accelerated sketch-and-project algorithm
($\texttt{ADASAP}$) for solving these linear systems, which improves
scalability. We use the theory of determinantal point processes to show that
the posterior mean induced by sketch-and-project rapidly converges to the true
posterior mean. In particular, this yields the first efficient, condition
number-free algorithm for estimating the posterior mean along the top spectral
basis functions, showing that our approach is principled for GP inference.
$\texttt{ADASAP}$ outperforms state-of-the-art solvers based on conjugate
gradient and coordinate descent across several benchmark datasets and a
large-scale Bayesian optimization task. Moreover, $\texttt{ADASAP}$ scales to a
dataset with $> 3 \cdot 10^8$ samples, a feat which has not been accomplished
in the literature.
[COMMENTS]NeurIPS 2025; 31 pages, 7 figures, 2 tables
[LINK]http://arxiv.org/abs/2505.13723v2
[DATE]2025-10-28 09:57:58+08:00
[CATEGORIES]cs.LG
Predicting Barge Tow Size on Inland Waterways Using Vessel Trajectory Derived Features: Proof of Concept
[AUTHORS]Geoffery Agorku, Sarah Hernandez, Hayley Hames, Cade Wagner
[ABSTRACT]Accurate, real-time estimation of barge quantity on inland waterways remains
a critical challenge due to the non-self-propelled nature of barges and the
limitations of existing monitoring systems. This study introduces a novel
method to use Automatic Identification System (AIS) vessel tracking data to
predict the number of barges in tow using Machine Learning (ML). To train and
test the model, barge instances were manually annotated from satellite scenes
across the Lower Mississippi River. Labeled images were matched to AIS vessel
tracks using a spatiotemporal matching procedure. A comprehensive set of 30
AIS-derived features capturing vessel geometry, dynamic movement, and
trajectory patterns were created and evaluated using Recursive Feature
Elimination (RFE) to identify the most predictive variables. Six regression
models, including ensemble, kernel-based, and generalized linear approaches,
were trained and evaluated. The Poisson Regressor model yielded the best
performance, achieving a Mean Absolute Error (MAE) of 1.92 barges using 12 of
the 30 features. The feature importance analysis revealed that metrics
capturing vessel maneuverability such as course entropy, speed variability and
trip length were most predictive of barge count. The proposed approach provides
a scalable, readily implementable method for enhancing Maritime Domain
Awareness (MDA), with strong potential applications in lock scheduling, port
management, and freight planning. Future work will expand the proof of concept
presented here to explore model transferability to other inland rivers with
differing operational and environmental conditions.
[LINK]http://arxiv.org/abs/2510.23994v1
[DATE]2025-10-28 09:51:23+08:00
[CATEGORIES]cs.LG
Optimal Arm Elimination Algorithms for Combinatorial Bandits
[AUTHORS]Yuxiao Wen, Yanjun Han, Zhengyuan Zhou
[ABSTRACT]Combinatorial bandits extend the classical bandit framework to settings where
the learner selects multiple arms in each round, motivated by applications such
as online recommendation and assortment optimization. While extensions of upper
confidence bound (UCB) algorithms arise naturally in this context, adapting arm
elimination methods has proved more challenging. We introduce a novel
elimination scheme that partitions arms into three categories (confirmed,
active, and eliminated), and incorporates explicit exploration to update these
sets. We demonstrate the efficacy of our algorithm in two settings: the
combinatorial multi-armed bandit with general graph feedback, and the
combinatorial linear contextual bandit. In both cases, our approach achieves
near-optimal regret, whereas UCB-based methods can provably fail due to
insufficient explicit exploration. Matching lower bounds are also provided.
[LINK]http://arxiv.org/abs/2510.23992v1
[DATE]2025-10-28 09:50:24+08:00
[CATEGORIES]cs.LG
STNet: Spectral Transformation Network for Solving Operator Eigenvalue Problem
[AUTHORS]Hong Wang, Jiang Yixuan, Jie Wang, Xinyi Li, Jian Luo, Huanshuo Dong
[ABSTRACT]Operator eigenvalue problems play a critical role in various scientific
fields and engineering applications, yet numerical methods are hindered by the
curse of dimensionality. Recent deep learning methods provide an efficient
approach to address this challenge by iteratively updating neural networks.
These methods’ performance relies heavily on the spectral distribution of the
given operator: larger gaps between the operator’s eigenvalues will improve
precision, thus tailored spectral transformations that leverage the spectral
distribution can enhance their performance. Based on this observation, we
propose the Spectral Transformation Network (STNet). During each iteration,
STNet uses approximate eigenvalues and eigenfunctions to perform spectral
transformations on the original operator, turning it into an equivalent but
easier problem. Specifically, we employ deflation projection to exclude the
subspace corresponding to already solved eigenfunctions, thereby reducing the
search space and avoiding converging to existing eigenfunctions. Additionally,
our filter transform magnifies eigenvalues in the desired region and suppresses
those outside, further improving performance. Extensive experiments demonstrate
that STNet consistently outperforms existing learning-based methods, achieving
state-of-the-art performance in accuracy.
[LINK]http://arxiv.org/abs/2510.23986v1
[DATE]2025-10-28 09:43:54+08:00
[CATEGORIES]cs.LG
High-Energy Concentration for Federated Learning in Frequency Domain
[AUTHORS]Haozhi Shi, Weiying Xie, Hangyu Ye, Daixun Li, Jitao Ma, Yunsong Li, Leyuan Fang
[ABSTRACT]Federated Learning (FL) presents significant potential for collaborative
optimization without data sharing. Since synthetic data is sent to the server,
leveraging the popular concept of dataset distillation, this FL framework
protects real data privacy while alleviating data heterogeneity. However, such
methods are still challenged by the redundant information and noise in entire
spatial-domain designs, which inevitably increases the communication burden. In
this paper, we propose a novel Frequency-Domain aware FL method with
high-energy concentration (FedFD) to address this problem. Our FedFD is
inspired by the discovery that the discrete cosine transform predominantly
distributes energy to specific regions, referred to as high-energy
concentration. The principle behind FedFD is that low-energy like
high-frequency components usually contain redundant information and noise, thus
filtering them helps reduce communication costs and optimize performance. Our
FedFD is mathematically formulated to preserve the low-frequency components
using a binary mask, facilitating an optimal solution through frequency-domain
distribution alignment. In particular, real data-driven synthetic
classification is imposed into the loss to enhance the quality of the
low-frequency components. On five image and speech datasets, FedFD achieves
superior performance than state-of-the-art methods while reducing communication
costs. For example, on the CIFAR-10 dataset with Dirichlet coefficient $\alpha
= 0.01$, FedFD achieves a minimum reduction of 37.78\% in the communication
cost, while attaining a 10.88\% performance gain.
[LINK]http://arxiv.org/abs/2509.12630v2
[DATE]2025-10-28 09:41:54+08:00
[CATEGORIES]cs.LG
CAUSAL3D: A Comprehensive Benchmark for Causal Learning from Visual Data
[AUTHORS]Disheng Liu, Yiran Qiao, Wuche Liu, Yiren Lu, Yunlai Zhou, Tuo Liang, Yu Yin, Jing Ma
[ABSTRACT]True intelligence hinges on the ability to uncover and leverage hidden causal
relations. Despite significant progress in AI and computer vision (CV), there
remains a lack of benchmarks for assessing models’ abilities to infer latent
causality from complex visual data. In this paper, we introduce
\textsc{\textbf{Causal3D}}, a novel and comprehensive benchmark that integrates
structured data (tables) with corresponding visual representations (images) to
evaluate causal reasoning. Designed within a systematic framework, Causal3D
comprises 19 3D-scene datasets capturing diverse causal relations, views, and
backgrounds, enabling evaluations across scenes of varying complexity. We
assess multiple state-of-the-art methods, including classical causal discovery,
causal representation learning, and large/vision-language models (LLMs/VLMs).
Our experiments show that as causal structures grow more complex without prior
knowledge, performance declines significantly, highlighting the challenges even
advanced methods face in complex causal scenarios. Causal3D serves as a vital
resource for advancing causal reasoning in CV and fostering trustworthy AI in
critical domains.
[LINK]http://arxiv.org/abs/2503.04852v2
[DATE]2025-10-28 09:41:35+08:00
[CATEGORIES]cs.LG
Score-based constrained generative modeling via Langevin diffusions with boundary conditions
[AUTHORS]Adam Nordenhög, Akash Sharma
[ABSTRACT]Score-based generative models based on stochastic differential equations
(SDEs) achieve impressive performance in sampling from unknown distributions,
but often fail to satisfy underlying constraints. We propose a constrained
generative model using kinetic (underdamped) Langevin dynamics with specular
reflection of velocity on the boundary defining constraints. This results in
piecewise continuously differentiable noising and denoising process where the
latter is characterized by a time-reversed dynamics restricted to a domain with
boundary due to specular boundary condition. In addition, we also contribute to
existing reflected SDEs based constrained generative models, where the
stochastic dynamics is restricted through an abstract local time term. By
presenting efficient numerical samplers which converge with optimal rate in
terms of discretizations step, we provide a comprehensive comparison of models
based on confined (specularly reflected kinetic) Langevin diffusion with models
based on reflected diffusion with local time.
[LINK]http://arxiv.org/abs/2510.23985v1
[DATE]2025-10-28 09:36:54+08:00
[CATEGORIES]cs.LG
HyperGraphX: Graph Transductive Learning with Hyperdimensional Computing and Message Passing
[AUTHORS]Guojing Cong, Tom Potok, Hamed Poursiami, Maryam Parsa
[ABSTRACT]We present a novel algorithm, \hdgc, that marries graph convolution with
binding and bundling operations in hyperdimensional computing for transductive
graph learning. For prediction accuracy \hdgc outperforms major and popular
graph neural network implementations as well as state-of-the-art
hyperdimensional computing implementations for a collection of homophilic
graphs and heterophilic graphs. Compared with the most accurate learning
methodologies we have tested, on the same target GPU platform, \hdgc is on
average 9561.0 and 144.5 times faster than \gcnii, a graph neural network
implementation and HDGL, a hyperdimensional computing implementation,
respectively. As the majority of the learning operates on binary vectors, we
expect outstanding energy performance of \hdgc on neuromorphic and emerging
process-in-memory devices.
[LINK]http://arxiv.org/abs/2510.23980v1
[DATE]2025-10-28 09:21:54+08:00
[CATEGORIES]cs.LG
Synergistic Neural Forecasting of Air Pollution with Stochastic Sampling
[AUTHORS]Yohan Abeysinghe, Muhammad Akhtar Munir, Sanoojan Baliah, Ron Sarafian, Fahad Shahbaz Khan, Yinon Rudich, Salman Khan
[ABSTRACT]Air pollution remains a leading global health and environmental risk,
particularly in regions vulnerable to episodic air pollution spikes due to
wildfires, urban haze and dust storms. Accurate forecasting of particulate
matter (PM) concentrations is essential to enable timely public health warnings
and interventions, yet existing models often underestimate rare but hazardous
pollution events. Here, we present SynCast, a high-resolution neural
forecasting model that integrates meteorological and air composition data to
improve predictions of both average and extreme pollution levels. Built on a
regionally adapted transformer backbone and enhanced with a diffusion-based
stochastic refinement module, SynCast captures the nonlinear dynamics driving
PM spikes more accurately than existing approaches. Leveraging on harmonized
ERA5 and CAMS datasets, our model shows substantial gains in forecasting
fidelity across multiple PM variables (PM$1$, PM${2.5}$, PM$_{10}$),
especially under extreme conditions. We demonstrate that conventional loss
functions underrepresent distributional tails (rare pollution events) and show
that SynCast, guided by domain-aware objectives and extreme value theory,
significantly enhances performance in highly impacted regions without
compromising global accuracy. This approach provides a scalable foundation for
next-generation air quality early warning systems and supports climate-health
risk mitigation in vulnerable regions.
[LINK]http://arxiv.org/abs/2510.23977v1
[DATE]2025-10-28 09:18:00+08:00
[CATEGORIES]cs.LG
Diffusion Adaptive Text Embedding for Text-to-Image Diffusion Models
[AUTHORS]Byeonghu Na, Minsang Park, Gyuwon Sim, Donghyeok Shin, HeeSun Bae, Mina Kang, Se Jung Kwon, Wanmo Kang, Il-Chul Moon
[ABSTRACT]Text-to-image diffusion models rely on text embeddings from a pre-trained
text encoder, but these embeddings remain fixed across all diffusion timesteps,
limiting their adaptability to the generative process. We propose Diffusion
Adaptive Text Embedding (DATE), which dynamically updates text embeddings at
each diffusion timestep based on intermediate perturbed data. We formulate an
optimization problem and derive an update rule that refines the text embeddings
at each sampling step to improve alignment and preference between the mean
predicted image and the text. This allows DATE to dynamically adapts the text
conditions to the reverse-diffused images throughout diffusion sampling without
requiring additional model training. Through theoretical analysis and empirical
results, we show that DATE maintains the generative capability of the model
while providing superior text-image alignment over fixed text embeddings across
various tasks, including multi-concept generation and text-guided image
editing. Our code is available at https://github.com/aailab-kaist/DATE.
[COMMENTS]Accepted at NeurIPS 2025
[LINK]http://arxiv.org/abs/2510.23974v1
[DATE]2025-10-28 09:10:15+08:00
[CATEGORIES]cs.LG
An efficient probabilistic hardware architecture for diffusion-like models
[AUTHORS]Andraž Jelinčič, Owen Lockwood, Akhil Garlapati, Guillaume Verdon, Trevor McCourt
[ABSTRACT]The proliferation of probabilistic AI has promoted proposals for specialized
stochastic computers. Despite promising efficiency gains, these proposals have
failed to gain traction because they rely on fundamentally limited modeling
techniques and exotic, unscalable hardware. In this work, we address these
shortcomings by proposing an all-transistor probabilistic computer that
implements powerful denoising models at the hardware level. A system-level
analysis indicates that devices based on our architecture could achieve
performance parity with GPUs on a simple image benchmark using approximately
10,000 times less energy.
[COMMENTS]9 pages, 6 figures
[LINK]http://arxiv.org/abs/2510.23972v1
[DATE]2025-10-28 09:09:19+08:00
[CATEGORIES]cs.LG
A Pragmatic Way to Measure Chain-of-Thought Monitorability
[AUTHORS]Scott Emmons, Roland S. Zimmermann, David K. Elson, Rohin Shah
[ABSTRACT]While Chain-of-Thought (CoT) monitoring offers a unique opportunity for AI
safety, this opportunity could be lost through shifts in training practices or
model architecture. To help preserve monitorability, we propose a pragmatic way
to measure two components of it: legibility (whether the reasoning can be
followed by a human) and coverage (whether the CoT contains all the reasoning
needed for a human to also produce the final output). We implement these
metrics with an autorater prompt that enables any capable LLM to compute the
legibility and coverage of existing CoTs. After sanity-checking our prompted
autorater with synthetic CoT degradations, we apply it to several frontier
models on challenging benchmarks, finding that they exhibit high
monitorability. We present these metrics, including our complete autorater
prompt, as a tool for developers to track how design decisions impact
monitorability. While the exact prompt we share is still a preliminary version
under ongoing development, we are sharing it now in the hopes that others in
the community will find it useful. Our method helps measure the default
monitorability of CoT - it should be seen as a complement, not a replacement,
for the adversarial stress-testing needed to test robustness against
deliberately evasive models.
[COMMENTS]The first two authors contributed equally
[LINK]http://arxiv.org/abs/2510.23966v1
[DATE]2025-10-28 08:44:25+08:00
[CATEGORIES]cs.LG
Seeding neural network quantum states with tensor network states
[AUTHORS]Ryui Kaneko, Shimpei Goto
[ABSTRACT]We find an efficient approach to approximately convert matrix product states
(MPSs) into restricted Boltzmann machine wave functions consisting of a
multinomial hidden unit through a canonical polyadic (CP) decomposition of the
MPSs. This method allows us to generate well-behaved initial neural network
quantum states for quantum many-body ground-state calculations in polynomial
time of the number of variational parameters and systematically shorten the
distance between the initial states and the ground states while increasing the
rank of the CP decomposition. We demonstrate the efficiency of our method by
taking the transverse-field Ising model as an example and discuss possible
applications of our method to more general quantum many-body systems in which
the ground-state wave functions possess complex nodal structures.
[COMMENTS]15 pages, 15 figures, All codes and data used in this manuscript are
available at https://github.com/ryuikaneko/mps2rbm
[LINK]http://arxiv.org/abs/2506.23550v3
[DATE]2025-10-28 08:43:01+08:00
[CATEGORIES]cs.LG
The Sign Estimator: LLM Alignment in the Face of Choice Heterogeneity
[AUTHORS]Aymane El Gadarri, Ali Aouad, Vivek F. Farias
[ABSTRACT]Traditional LLM alignment methods are vulnerable to heterogeneity in human
preferences. Fitting a na"ive probabilistic model to pairwise comparison data
(say over prompt-completion pairs) yields an inconsistent estimate of the
population-average utility -a canonical measure of social welfare. We propose a
new method, dubbed the sign estimator, that provides a simple, provably
consistent, and efficient estimator by replacing cross-entropy with binary
classification loss in the aggregation step. This simple modification recovers
consistent ordinal alignment under mild assumptions and achieves the first
polynomial finite-sample error bounds in this setting. In realistic simulations
of LLM alignment using digital twins, the sign estimator substantially reduces
preference distortion over a panel of simulated personas, cutting (angular)
estimation error by nearly 35% and decreasing disagreement with true population
preferences from 12% to 8% compared to standard RLHF. Our method also compares
favorably to panel data heuristics that explicitly model user heterogeneity and
require tracking individual-level preference data-all while maintaining the
implementation simplicity of existing LLM alignment pipelines.
[LINK]http://arxiv.org/abs/2510.23965v1
[DATE]2025-10-28 08:42:38+08:00
[CATEGORIES]cs.LG
Partner Modelling Emerges in Recurrent Agents (But Only When It Matters)
[AUTHORS]Ruaridh Mon-Williams, Max Taylor-Davies, Elizabeth Mieczkowski, Natalia Velez, Neil R. Bramley, Yanwei Wang, Thomas L. Griffiths, Christopher G. Lucas
[ABSTRACT]Humans are remarkably adept at collaboration, able to infer the strengths and
weaknesses of new partners in order to work successfully towards shared goals.
To build AI systems with this capability, we must first understand its building
blocks: does such flexibility require explicit, dedicated mechanisms for
modelling others – or can it emerge spontaneously from the pressures of
open-ended cooperative interaction? To investigate this question, we train
simple model-free RNN agents to collaborate with a population of diverse
partners. Using the `Overcooked-AI’ environment, we collect data from thousands
of collaborative teams, and analyse agents’ internal hidden states. Despite a
lack of additional architectural features, inductive biases, or auxiliary
objectives, the agents nevertheless develop structured internal representations
of their partners’ task abilities, enabling rapid adaptation and generalisation
to novel collaborators. We investigated these internal models through probing
techniques, and large-scale behavioural analysis. Notably, we find that
structured partner modelling emerges when agents can influence partner
behaviour by controlling task allocation. Our results show that partner
modelling can arise spontaneously in model-free agents – but only under
environmental conditions that impose the right kind of social pressure.
[LINK]http://arxiv.org/abs/2505.17323v2
[DATE]2025-10-28 08:28:59+08:00
[CATEGORIES]cs.LG
ChessQA: Evaluating Large Language Models for Chess Understanding
[AUTHORS]Qianfeng Wen, Zhenwei Tang, Ashton Anderson
[ABSTRACT]Chess provides an ideal testbed for evaluating the reasoning, modeling, and
abstraction capabilities of large language models (LLMs), as it has
well-defined structure and objective ground truth while admitting a wide
spectrum of skill levels. However, existing evaluations of LLM ability in chess
are ad hoc and narrow in scope, making it difficult to accurately measure LLM
chess understanding and how it varies with scale, post-training methodologies,
or architecture choices. We present ChessQA, a comprehensive benchmark that
assesses LLM chess understanding across five task categories (Structural,
Motifs, Short Tactics, Position Judgment, and Semantic), which approximately
correspond to the ascending abstractions that players master as they accumulate
chess knowledge, from understanding basic rules and learning tactical motifs to
correctly calculating tactics, evaluating positions, and semantically
describing high-level concepts. In this way, ChessQA captures a more
comprehensive picture of chess ability and understanding, going significantly
beyond the simple move quality evaluations done previously, and offers a
controlled, consistent setting for diagnosis and comparison. Furthermore,
ChessQA is inherently dynamic, with prompts, answer keys, and construction
scripts that can evolve as models improve. Evaluating a range of contemporary
LLMs, we find persistent weaknesses across all five categories and provide
results and error analyses by category. We will release the code, periodically
refreshed datasets, and a public leaderboard to support further research.
[COMMENTS]33 pages,8 figures
[LINK]http://arxiv.org/abs/2510.23948v1
[DATE]2025-10-28 08:02:52+08:00
[CATEGORIES]cs.LG
Modeling Biological Multifunctionality with Echo State Networks
[AUTHORS]Anastasia-Maria Leventi-Peetz, Jörg-Volker Peetz, Kai Weber, Nikolaos Zacharis
[ABSTRACT]In this work, a three-dimensional multicomponent reaction-diffusion model has
been developed, combining excitable-system dynamics with diffusion processes
and sharing conceptual features with the FitzHugh-Nagumo model. Designed to
capture the spatiotemporal behavior of biological systems, particularly
electrophysiological processes, the model was solved numerically to generate
time-series data. These data were subsequently used to train and evaluate an
Echo State Network (ESN), which successfully reproduced the system’s dynamic
behavior. The results demonstrate that simulating biological dynamics using
data-driven, multifunctional ESN models is both feasible and effective.
[COMMENTS]26 pages, 17 figures, 6 tables, 23 references
[LINK]http://arxiv.org/abs/2510.23940v1
[DATE]2025-10-28 07:47:51+08:00
[CATEGORIES]cs.LG
A data free neural operator enabling fast inference of 2D and 3D Navier Stokes equations
[AUTHORS]Junho Choi, Teng-Yuan Chang, Namjung Kim, Youngjoon Hong
[ABSTRACT]Ensemble simulations of high-dimensional flow models (e.g., Navier Stokes
type PDEs) are computationally prohibitive for real time applications. Neural
operators enable fast inference but are limited by costly data requirements and
poor generalization to 3D flows. We present a data-free operator network for
the Navier Stokes equations that eliminates the need for paired solution data
and enables robust, real time inference for large ensemble forecasting. The
physics-grounded architecture takes initial and boundary conditions as well as
forcing functions, yielding solutions robust to high variability and
perturbations. Across 2D benchmarks and 3D test cases, the method surpasses
prior neural operators in accuracy and, for ensembles, achieves greater
efficiency than conventional numerical solvers. Notably, it delivers accurate
solutions of the three dimensional Navier Stokes equations, a regime not
previously demonstrated for data free neural operators. By uniting a
numerically grounded architecture with the scalability of machine learning,
this approach establishes a practical pathway toward data free, high fidelity
PDE surrogates for end to end scientific simulation and prediction.
[LINK]http://arxiv.org/abs/2510.23936v1
[DATE]2025-10-28 07:41:42+08:00
[CATEGORIES]cs.LG
Differential Privacy: Gradient Leakage Attacks in Federated Learning Environments
[AUTHORS]Miguel Fernandez-de-Retana, Unai Zulaika, Rubén Sánchez-Corcuera, Aitor Almeida
[ABSTRACT]Federated Learning (FL) allows for the training of Machine Learning models in
a collaborative manner without the need to share sensitive data. However, it
remains vulnerable to Gradient Leakage Attacks (GLAs), which can reveal private
information from the shared model updates. In this work, we investigate the
effectiveness of Differential Privacy (DP) mechanisms - specifically, DP-SGD
and a variant based on explicit regularization (PDP-SGD) - as defenses against
GLAs. To this end, we evaluate the performance of several computer vision
models trained under varying privacy levels on a simple classification task,
and then analyze the quality of private data reconstructions obtained from the
intercepted gradients in a simulated FL environment. Our results demonstrate
that DP-SGD significantly mitigates the risk of gradient leakage attacks,
albeit with a moderate trade-off in model utility. In contrast, PDP-SGD
maintains strong classification performance but proves ineffective as a
practical defense against reconstruction attacks. These findings highlight the
importance of empirically evaluating privacy mechanisms beyond their
theoretical guarantees, particularly in distributed learning scenarios where
information leakage may represent an unassumable critical threat to data
security and privacy.
[COMMENTS]17 pages, 12 figures
[LINK]http://arxiv.org/abs/2510.23931v1
[DATE]2025-10-28 07:33:21+08:00
[CATEGORIES]cs.LG
REASONING COMPILER: LLM-Guided Optimizations for Efficient Model Serving
[AUTHORS]Sujun Tang, Christopher Priebe, Rohan Mahapatra, Lianhui Qin, Hadi Esmaeilzadeh
[ABSTRACT]While model serving has unlocked unprecedented capabilities, the high cost of
serving large-scale models continues to be a significant barrier to widespread
accessibility and rapid innovation. Compiler optimizations have long driven
substantial performance improvements, but existing compilers struggle with
neural workloads due to the exponentially large and highly interdependent space
of possible transformations. Although existing stochastic search techniques can
be effective, they are often sample-inefficient and fail to leverage the
structural context underlying compilation decisions. We set out to investigate
the research question of whether reasoning with large language models (LLMs),
without any retraining, can leverage the context-aware decision space of
compiler optimizations to significantly improve sample efficiency. To that end,
we introduce a novel compilation framework (dubbed Reasoning Compiler) that
formulates optimization as a sequential, context-aware decision process guided
by a large language model and structured Monte Carlo tree search (MCTS). The
LLM acts as a proposal mechanism, suggesting hardware-informed transformations
that reflect the current program state and accumulated performance feedback.
MCTS incorporates the LLM-generated proposals to balance exploration and
exploitation, facilitating structured, context-sensitive traversal of the
expansive compiler optimization space. By achieving substantial speedups with
markedly fewer samples than leading neural compilers, our approach demonstrates
the potential of LLM-guided reasoning to transform the landscape of compiler
optimization.
[COMMENTS]NeurIPS 2025
[LINK]http://arxiv.org/abs/2506.01374v2
[DATE]2025-10-28 07:22:07+08:00
[CATEGORIES]cs.LG
Sample Complexity Bounds for Linear Constrained MDPs with a Generative Model
[AUTHORS]Xingtu Liu, Lin F. Yang, Sharan Vaswani
[ABSTRACT]We consider infinite-horizon $\gamma$-discounted (linear) constrained Markov
decision processes (CMDPs) where the objective is to find a policy that
maximizes the expected cumulative reward subject to expected cumulative
constraints. Given access to a generative model, we propose to solve CMDPs with
a primal-dual framework that can leverage any black-box unconstrained MDP
solver. For linear CMDPs with feature dimension $d$, we instantiate the
framework by using mirror descent value iteration
(\texttt{MDVI})~\citep{kitamura2023regularization} an example MDP solver. We
provide sample complexity bounds for the resulting CMDP algorithm in two cases:
(i) relaxed feasibility, where small constraint violations are allowed, and
(ii) strict feasibility, where the output policy is required to exactly satisfy
the constraint. For (i), we prove that the algorithm can return an
$\epsilon$-optimal policy with high probability by using
$\tilde{O}\left(\frac{d^2}{(1-\gamma)^4\epsilon^2}\right)$ samples. For (ii),
we show that the algorithm requires
$\tilde{O}\left(\frac{d^2}{(1-\gamma)^6\epsilon^2\zeta^2}\right)$ samples,
where $\zeta$ is the problem-dependent Slater constant that characterizes the
size of the feasible region. Furthermore, we prove a lower-bound of
$\Omega\left(\frac{d^2}{(1-\gamma)^5\epsilon^2\zeta^2}\right)$ for the strict
feasibility setting. We note that our upper bounds under both settings exhibit
a near-optimal dependence on $d$, $\epsilon$, and $\zeta$. Finally, we
instantiate our framework for tabular CMDPs and show that it can be used to
recover near-optimal sample complexities in this setting.
[LINK]http://arxiv.org/abs/2507.02089v2
[DATE]2025-10-28 07:20:30+08:00
[CATEGORIES]cs.LG
Improving the Straight-Through Estimator with Zeroth-Order Information
[AUTHORS]Ningfeng Yang, Tor M. Aamodt
[COMMENTS]39th Conference on Neural Information Processing Systems (NeurIPS
2025)
[LINK]http://arxiv.org/abs/2510.23926v1
[DATE]2025-10-28 07:14:59+08:00
[CATEGORIES]cs.LG
Beyond PCA: Manifold Dimension Estimation via Local Graph Structure
[AUTHORS]Zelong Bi, Pierre Lafaye de Micheaux
[ABSTRACT]Local principal component analysis (Local PCA) has proven to be an effective
tool for estimating the intrinsic dimension of a manifold. More recently,
curvature-adjusted PCA (CA-PCA) has improved upon this approach by explicitly
accounting for the curvature of the underlying manifold, rather than assuming
local flatness. Building on these insights, we propose a general framework for
manifold dimension estimation that captures the manifold’s local graph
structure by integrating PCA with regression-based techniques. Within this
framework, we introduce two representative estimators: quadratic embedding (QE)
and total least squares (TLS). Experiments on both synthetic and real-world
datasets demonstrate that these methods perform competitively with, and often
outperform, state-of-the-art alternatives.
[LINK]http://arxiv.org/abs/2510.15141v2
[DATE]2025-10-28 07:02:56+08:00
[CATEGORIES]cs.LG
Doubly-Robust Estimation of Counterfactual Policy Mean Embeddings
[AUTHORS]Houssam Zenati, Bariscan Bozkurt, Arthur Gretton
[ABSTRACT]Estimating the distribution of outcomes under counterfactual policies is
critical for decision-making in domains such as recommendation, advertising,
and healthcare. We propose and analyze a novel framework-Counterfactual Policy
Mean Embedding (CPME)-that represents the entire counterfactual outcome
distribution in a reproducing kernel Hilbert space (RKHS), enabling flexible
and nonparametric distributional off-policy evaluation. We introduce both a
plug-in estimator and a doubly robust estimator; the latter enjoys improved
convergence rates by correcting for bias in both the outcome embedding and
propensity models. Building on this, we develop a doubly robust kernel test
statistic for hypothesis testing, which achieves asymptotic normality and thus
enables computationally efficient testing and straightforward construction of
confidence intervals. Our framework also supports sampling from the
counterfactual distribution. Numerical simulations illustrate the practical
benefits of CPME over existing methods.
[LINK]http://arxiv.org/abs/2506.02793v2
[DATE]2025-10-28 07:02:35+08:00
[CATEGORIES]cs.LG
NOBLE – Neural Operator with Biologically-informed Latent Embeddings to Capture Experimental Variability in Biological Neuron Models
[AUTHORS]Luca Ghafourpour, Valentin Duruisseaux, Bahareh Tolooshams, Philip H. Wong, Costas A. Anastassiou, Anima Anandkumar
[ABSTRACT]Characterizing the cellular properties of neurons is fundamental to
understanding their function in the brain. In this quest, the generation of
bio-realistic models is central towards integrating multimodal cellular data
sets and establishing causal relationships. However, current modeling
approaches remain constrained by the limited availability and intrinsic
variability of experimental neuronal data. The deterministic formalism of
bio-realistic models currently precludes accounting for the natural variability
observed experimentally. While deep learning is becoming increasingly relevant
in this space, it fails to capture the full biophysical complexity of neurons,
their nonlinear voltage dynamics, and variability. To address these
shortcomings, we introduce NOBLE, a neural operator framework that learns a
mapping from a continuous frequency-modulated embedding of interpretable neuron
features to the somatic voltage response induced by current injection. Trained
on synthetic data generated from bio-realistic neuron models, NOBLE predicts
distributions of neural dynamics accounting for the intrinsic experimental
variability. Unlike conventional bio-realistic neuron models, interpolating
within the embedding space offers models whose dynamics are consistent with
experimentally observed responses. NOBLE enables the efficient generation of
synthetic neurons that closely resemble experimental data and exhibit
trial-to-trial variability, offering a $4200\times$ speedup over the numerical
solver. NOBLE is the first scaled-up deep learning framework that validates its
generalization with real experimental data. To this end, NOBLE captures
fundamental neural properties in a unique and emergent manner that opens the
door to a better understanding of cellular composition and computations,
neuromorphic architectures, large-scale brain circuits, and general neuroAI
applications.
[LINK]http://arxiv.org/abs/2506.04536v3
[DATE]2025-10-28 06:48:13+08:00
[CATEGORIES]cs.LG
Geometry-Inspired Unified Framework for Discounted and Average Reward MDPs
[AUTHORS]Arsenii Mustafin, Xinyi Sheng, Dominik Baumann
[ABSTRACT]The theoretical analysis of Markov Decision Processes (MDPs) is commonly
split into two cases - the average-reward case and the discounted-reward case -
which, while sharing similarities, are typically analyzed separately. In this
work, we extend a recently introduced geometric interpretation of MDPs for the
discounted-reward case to the average-reward case, thereby unifying both. This
allows us to extend a major result known for the discounted-reward case to the
average-reward case: under a unique and ergodic optimal policy, the Value
Iteration algorithm achieves a geometric convergence rate.
[COMMENTS]12 pages, 1 figure
[LINK]http://arxiv.org/abs/2510.23914v1
[DATE]2025-10-28 06:42:53+08:00
[CATEGORIES]cs.LG
FALCON: An ML Framework for Fully Automated Layout-Constrained Analog Circuit Design
[AUTHORS]Asal Mehradfar, Xuzhe Zhao, Yilun Huang, Emir Ceyani, Yankai Yang, Shihao Han, Hamidreza Aghasi, Salman Avestimehr
[ABSTRACT]Designing analog circuits from performance specifications is a complex,
multi-stage process encompassing topology selection, parameter inference, and
layout feasibility. We introduce FALCON, a unified machine learning framework
that enables fully automated, specification-driven analog circuit synthesis
through topology selection and layout-constrained optimization. Given a target
performance, FALCON first selects an appropriate circuit topology using a
performance-driven classifier guided by human design heuristics. Next, it
employs a custom, edge-centric graph neural network trained to map circuit
topology and parameters to performance, enabling gradient-based parameter
inference through the learned forward model. This inference is guided by a
differentiable layout cost, derived from analytical equations capturing
parasitic and frequency-dependent effects, and constrained by design rules. We
train and evaluate FALCON on a large-scale custom dataset of 1M analog mm-wave
circuits, generated and simulated using Cadence Spectre across 20
expert-designed topologies. Through this evaluation, FALCON demonstrates >99%
accuracy in topology inference, <10% relative error in performance prediction,
and efficient layout-aware design that completes in under 1 second per
instance. Together, these results position FALCON as a practical and extensible
foundation model for end-to-end analog circuit design automation.
[COMMENTS]Accepted at the 39th Conference on Neural Information Processing
Systems (NeurIPS 2025)
[LINK]http://arxiv.org/abs/2505.21923v2
[DATE]2025-10-28 06:42:49+08:00
[CATEGORIES]cs.LG
DynaStride: Dynamic Stride Windowing with MMCoT for Instructional Multi-Scene Captioning
[AUTHORS]Eddison Pham, Prisha Priyadarshini, Adrian Maliackel, Kanishk Bandi, Cristian Meo, Kevin Zhu
[ABSTRACT]Scene-level captioning in instructional videos can enhance learning by
requiring an understanding of both visual cues and temporal structure. By
aligning visual cues with textual guidance, this understanding supports
procedural learning and multimodal reasoning, providing a richer context for
skill acquisition. However, captions that fail to capture this structure may
lack coherence and quality, which can create confusion and undermine the
video’s educational intent. To address this gap, we introduce DynaStride, a
pipeline to generate coherent, scene-level captions without requiring manual
scene segmentation. Using the YouCookII dataset’s scene annotations, DynaStride
performs adaptive frame sampling and multimodal windowing to capture key
transitions within each scene. It then employs a multimodal chain-of-thought
process to produce multiple action-object pairs, which are refined and fused
using a dynamic stride window selection algorithm that adaptively balances
temporal context and redundancy. The final scene-level caption integrates
visual semantics and temporal reasoning in a single instructional caption.
Empirical evaluations against strong baselines, including VLLaMA3 and GPT-4o,
demonstrate consistent gains on both N-gram-based metrics (BLEU, METEOR) and
semantic similarity measures (BERTScore, CLIPScore). Qualitative analyses
further show that DynaStride produces captions that are more temporally
coherent and informative, suggesting a promising direction for improving
AI-powered instructional content generation.
[COMMENTS]16 pages, 15 figures, 5 Tables, submitted to AAAI AI4ED Workshop 2026
[LINK]http://arxiv.org/abs/2510.23907v1
[DATE]2025-10-28 06:29:08+08:00
[CATEGORIES]cs.LG
Group Interventions on Deep Networks for Causal Discovery in Subsystems
[AUTHORS]Wasim Ahmad, Maha Shadaydeh, Joachim Denzler
[ABSTRACT]Causal discovery uncovers complex relationships between variables, enhancing
predictions, decision-making, and insights into real-world systems, especially
in nonlinear multivariate time series. However, most existing methods primarily
focus on pairwise cause-effect relationships, overlooking interactions among
groups of variables, i.e., subsystems and their collective causal influence. In
this study, we introduce gCDMI, a novel multi-group causal discovery method
that leverages group-level interventions on trained deep neural networks and
employs model invariance testing to infer causal relationships. Our approach
involves three key steps. First, we use deep learning to jointly model the
structural relationships among groups of all time series. Second, we apply
group-wise interventions to the trained model. Finally, we conduct model
invariance testing to determine the presence of causal links among variable
groups. We evaluate our method on simulated datasets, demonstrating its
superior performance in identifying group-level causal relationships compared
to existing methods. Additionally, we validate our approach on real-world
datasets, including brain networks and climate ecosystems. Our results
highlight that applying group-level interventions to deep learning models,
combined with invariance testing, can effectively reveal complex causal
structures, offering valuable insights for domains such as neuroscience and
climate science.
[COMMENTS]Submitted to IEEE Access. We are working on the revised version
[LINK]http://arxiv.org/abs/2510.23906v1
[DATE]2025-10-28 06:26:20+08:00
[CATEGORIES]cs.LG
Inferring Group Intent as a Cooperative Game. An NLP-based Framework for Trajectory Analysis using Graph Transformer Neural Network
[AUTHORS]Yiming Zhang, Vikram Krishnamurthy, Shashwat Jain
[ABSTRACT]This paper studies group target trajectory intent as the outcome of a
cooperative game where the complex-spatio trajectories are modeled using an
NLP-based generative model. In our framework, the group intent is specified by
the characteristic function of a cooperative game, and allocations for players
in the cooperative game are specified by either the core, the Shapley value, or
the nucleolus. The resulting allocations induce probability distributions that
govern the coordinated spatio-temporal trajectories of the targets that reflect
the group’s underlying intent. We address two key questions: (1) How can the
intent of a group trajectory be optimally formalized as the characteristic
function of a cooperative game? (2) How can such intent be inferred from noisy
observations of the targets? To answer the first question, we introduce a
Fisher-information-based characteristic function of the cooperative game, which
yields probability distributions that generate coordinated spatio-temporal
patterns. As a generative model for these patterns, we develop an NLP-based
generative model built on formal grammar, enabling the creation of realistic
multi-target trajectory data. To answer the second question, we train a Graph
Transformer Neural Network (GTNN) to infer group trajectory intent-expressed as
the characteristic function of the cooperative game-from observational data
with high accuracy. The self-attention function of the GTNN depends on the
track estimates. Thus, the formulation and algorithms provide a multi-layer
approach that spans target tracking (Bayesian signal processing) and the GTNN
(for group intent inference).
[LINK]http://arxiv.org/abs/2510.23905v1
[DATE]2025-10-28 06:23:53+08:00
[CATEGORIES]cs.LG
RS-ORT: A Reduced-Space Branch-and-Bound Algorithm for Optimal Regression Trees
[AUTHORS]Cristobal Heredia, Pedro Chumpitaz-Flores, Kaixun Hua
[ABSTRACT]Mixed-integer programming (MIP) has emerged as a powerful framework for
learning optimal decision trees. Yet, existing MIP approaches for regression
tasks are either limited to purely binary features or become computationally
intractable when continuous, large-scale data are involved. Naively binarizing
continuous features sacrifices global optimality and often yields needlessly
deep trees. We recast the optimal regression-tree training as a two-stage
optimization problem and propose Reduced-Space Optimal Regression Trees
(RS-ORT) - a specialized branch-and-bound (BB) algorithm that branches
exclusively on tree-structural variables. This design guarantees the
algorithm’s convergence and its independence from the number of training
samples. Leveraging the model’s structure, we introduce several bound
tightening techniques - closed-form leaf prediction, empirical threshold
discretization, and exact depth-1 subtree parsing - that combine with
decomposable upper and lower bounding strategies to accelerate the training.
The BB node-wise decomposition enables trivial parallel execution, further
alleviating the computational intractability even for million-size datasets.
Based on the empirical studies on several regression benchmarks containing both
binary and continuous features, RS-ORT also delivers superior training and
testing performance than state-of-the-art methods. Notably, on datasets with up
to 2,000,000 samples with continuous features, RS-ORT can obtain guaranteed
training performance with a simpler tree structure and a better generalization
ability in four hours.
[COMMENTS]20 pages, 1 figure, uses ICLR 2026 LaTeX style. Submitted to arXiv as
a preprint version
[LINK]http://arxiv.org/abs/2510.23901v1
[DATE]2025-10-28 06:17:09+08:00
[CATEGORIES]cs.LG
Pre-trained knowledge elevates large language models beyond traditional chemical reaction optimizers
[AUTHORS]Robert MacKnight, Jose Emilio Regio, Jeffrey G. Ethier, Luke A. Baldwin, Gabe Gomes
[ABSTRACT]Modern optimization in experimental chemistry employs algorithmic search
through black-box parameter spaces. Here we demonstrate that pre-trained
knowledge in large language models (LLMs) fundamentally changes this paradigm.
Using six fully enumerated categorical reaction datasets (768-5,684
experiments), we benchmark LLM-guided optimization (LLM-GO) against Bayesian
optimization (BO) and random sampling. Frontier LLMs consistently match or
exceed BO performance across five single-objective datasets, with advantages
growing as parameter complexity increases and high-performing conditions become
scarce (<5% of space). BO retains superiority only for explicit multi-objective
trade-offs. To understand these contrasting behaviors, we introduce a
topology-agnostic information theory framework quantifying sampling diversity
throughout optimization campaigns. This analysis reveals that LLMs maintain
systematically higher exploration Shannon entropy than BO across all datasets
while achieving superior performance, with advantages most pronounced in
solution-scarce parameter spaces where high-entropy exploration typically
fails-suggesting that pre-trained domain knowledge enables more effective
navigation of chemical parameter space rather than replacing structured
exploration strategies. To enable transparent benchmarking and community
validation, we release Iron Mind (https://gomes.andrew.cmu.edu/iron-mind), a
no-code platform for side-by-side evaluation of human, algorithmic, and LLM
optimization campaigns with public leaderboards and complete trajectories. Our
findings establish that LLM-GO excels precisely where traditional methods
struggle: complex categorical spaces requiring domain understanding rather than
mathematical optimization.
[COMMENTS]27 pages, 8 figures
[LINK]http://arxiv.org/abs/2509.00103v2
[DATE]2025-10-28 06:13:12+08:00
[CATEGORIES]cs.LG
PRO: Enabling Precise and Robust Text Watermark for Open-Source LLMs
[AUTHORS]Jiaqi Xue, Yifei Zhao, Mansour Al Ghanim, Shangqian Gao, Ruimin Sun, Qian Lou, Mengxin Zheng
[ABSTRACT]Text watermarking for large language models (LLMs) enables model owners to
verify text origin and protect intellectual property. While watermarking
methods for closed-source LLMs are relatively mature, extending them to
open-source models remains challenging, as developers cannot control the
decoding process. Consequently, owners of open-source LLMs lack practical means
to verify whether text was generated by their models. A core difficulty lies in
embedding watermarks directly into model weights without hurting detectability.
A promising idea is to distill watermarks from a closed-source model into an
open one, but this suffers from (i) poor detectability due to mismatch between
learned and predefined patterns, and (ii) fragility to downstream modifications
such as fine-tuning or model merging. To overcome these limitations, we propose
PRO, a Precise and Robust text watermarking method for open-source LLMs. PRO
jointly trains a watermark policy model with the LLM, producing patterns that
are easier for the model to learn and more consistent with detection criteria.
A regularization term further simulates downstream perturbations and penalizes
degradation in watermark detectability, ensuring robustness under model edits.
Experiments on open-source LLMs (e.g., LLaMA-3.2, LLaMA-3, Phi-2) show that PRO
substantially improves both watermark detectability and resilience to model
modifications.
[LINK]http://arxiv.org/abs/2510.23891v1
[DATE]2025-10-28 06:00:49+08:00
[CATEGORIES]cs.LG
MinatoLoader: Accelerating Machine Learning Training Through Efficient Data Preprocessing
[AUTHORS]Rahma Nouaji, Stella Bitchebe, Ricardo Macedo, Oana Balmau
[ABSTRACT]Data loaders are used by Machine Learning (ML) frameworks like PyTorch and
TensorFlow to apply transformations to data before feeding it into the
accelerator. This operation is called data preprocessing. Data preprocessing
plays an important role in the ML training workflow because if it is
inefficiently pipelined with the training, it can yield high GPU idleness,
resulting in important training delays. Unfortunately, existing data loaders
turn out to waste GPU resources, with $76\%$ GPU idleness when using the
PyTorch data loader, for example. One key source of inefficiency is the
variability in preprocessing time across samples within the same dataset.
Existing data loaders are oblivious to this variability, and they construct
batches without any consideration of slow or fast samples. In this case, the
entire batch is delayed by a single slow sample, stalling the training pipeline
and resulting in head-of-line blocking.
To address these inefficiencies, we present MinatoLoader, a general-purpose
data loader for PyTorch that accelerates training and improves GPU utilization.
MinatoLoader is designed for a single-server setup, containing multiple GPUs.
It continuously prepares data in the background and actively constructs batches
by prioritizing fast-to-preprocess samples, while slower samples are processed
in parallel.
We evaluate MinatoLoader on servers with V100 and A100 GPUs. On a machine
with four A100 GPUs, MinatoLoader improves the training time of a wide range of
workloads by up to $7.5\times$ ($3.6\times$ on average) over PyTorch DataLoader
and Pecan, and up to $3\times$ ($2.2\times$ on average) over DALI. It also
increases average GPU utilization from 46.4\% with PyTorch to 90.45\%, while
preserving model accuracy and enabling faster convergence.
[COMMENTS]Paper accepted at EuroSys 2026
[LINK]http://arxiv.org/abs/2509.10712v2
[DATE]2025-10-28 05:57:37+08:00
[CATEGORIES]cs.LG
STree: Speculative Tree Decoding for Hybrid State-Space Models
[AUTHORS]Yangchao Wu, Zongyue Qin, Alex Wong, Stefano Soatto
[ABSTRACT]Speculative decoding is a technique to leverage hardware concurrency in order
to enable multiple steps of token generation in a single forward pass, thus
improving the efficiency of large-scale autoregressive (AR) Transformer models.
State-space models (SSMs) are already more efficient than AR Transformers,
since their state summarizes all past data with no need to cache or re-process
tokens in the sliding window context. However, their state can also comprise
thousands of tokens; so, speculative decoding has recently been extended to
SSMs. Existing approaches, however, do not leverage the tree-based verification
methods, since current SSMs lack the means to compute a token tree efficiently.
We propose the first scalable algorithm to perform tree-based speculative
decoding in state-space models (SSMs) and hybrid architectures of SSMs and
Transformer layers. We exploit the structure of accumulated state transition
matrices to facilitate tree-based speculative decoding with minimal overhead
relative to current SSM implementations. Along with the algorithm, we describe
a hardware-aware implementation that improves naive application of AR
Transformer tree-based speculative decoding methods to SSMs. Furthermore, we
outperform vanilla speculative decoding with SSMs even with a baseline drafting
model and tree structure on three different benchmarks, opening up
opportunities for further speed up with SSM and hybrid model inference. Code
can be found at: https://github.com/wyc1997/stree.
[LINK]http://arxiv.org/abs/2505.14969v2
[DATE]2025-10-28 05:48:48+08:00
[CATEGORIES]cs.LG
Generating Creative Chess Puzzles
[AUTHORS]Xidong Feng, Vivek Veeriah, Marcus Chiam, Michael Dennis, Ryan Pachauri, Thomas Tumiel, Federico Barbero, Johan Obando-Ceron, Jiaxin Shi, Satinder Singh, Shaobo Hou, Nenad Tomašev, Tom Zahavy
[ABSTRACT]While Generative AI rapidly advances in various domains, generating truly
creative, aesthetic, and counter-intuitive outputs remains a challenge. This
paper presents an approach to tackle these difficulties in the domain of chess
puzzles. We start by benchmarking Generative AI architectures, and then
introduce an RL framework with novel rewards based on chess engine search
statistics to overcome some of those shortcomings. The rewards are designed to
enhance a puzzle’s uniqueness, counter-intuitiveness, diversity, and realism.
Our RL approach dramatically increases counter-intuitive puzzle generation by
10x, from 0.22\% (supervised) to 2.5\%, surpassing existing dataset rates
(2.1\%) and the best Lichess-trained model (0.4\%). Our puzzles meet novelty
and diversity benchmarks, retain aesthetic themes, and are rated by human
experts as more creative, enjoyable, and counter-intuitive than composed book
puzzles, even approaching classic compositions. Our final outcome is a curated
booklet of these AI-generated puzzles, which is acknowledged for creativity by
three world-renowned experts.
[LINK]http://arxiv.org/abs/2510.23881v1
[DATE]2025-10-28 05:43:39+08:00
[CATEGORIES]cs.LG
Artificial Intelligence Based Predictive Maintenance for Electric Buses
[AUTHORS]Ayse Irmak Ercevik, Ahmet Murat Ozbayoglu
[ABSTRACT]Predictive maintenance (PdM) is crucial for optimizing efficiency and
minimizing downtime of electric buses. While these vehicles provide
environmental benefits, they pose challenges for PdM due to complex electric
transmission and battery systems. Traditional maintenance, often based on
scheduled inspections, struggles to capture anomalies in multi-dimensional
real-time CAN Bus data. This study employs a graph-based feature selection
method to analyze relationships among CAN Bus parameters of electric buses and
investigates the prediction performance of targeted alarms using artificial
intelligence techniques. The raw data collected over two years underwent
extensive preprocessing to ensure data quality and consistency. A hybrid
graph-based feature selection tool was developed by combining statistical
filtering (Pearson correlation, Cramer’s V, ANOVA F-test) with
optimization-based community detection algorithms (InfoMap, Leiden, Louvain,
Fast Greedy). Machine learning models, including SVM, Random Forest, and
XGBoost, were optimized through grid and random search with data balancing via
SMOTEEN and binary search-based down-sampling. Model interpretability was
achieved using LIME to identify the features influencing predictions. The
results demonstrate that the developed system effectively predicts vehicle
alarms, enhances feature interpretability, and supports proactive maintenance
strategies aligned with Industry 4.0 principles.
[LINK]http://arxiv.org/abs/2510.23879v1
[DATE]2025-10-28 05:39:25+08:00
[CATEGORIES]cs.LG
LIME: Link-based user-item Interaction Modeling with decoupled xor attention for Efficient test time scaling
[AUTHORS]Yunjiang Jiang, Ayush Agarwal, Yang Liu, Bi Xue
[ABSTRACT]Scaling large recommendation systems requires advancing three major
frontiers: processing longer user histories, expanding candidate sets, and
increasing model capacity. While promising, transformers’ computational cost
scales quadratically with the user sequence length and linearly with the number
of candidates. This trade-off makes it prohibitively expensive to expand
candidate sets or increase sequence length at inference, despite the
significant performance improvements.
We introduce \textbf{LIME}, a novel architecture that resolves this
trade-off. Through two key innovations, LIME fundamentally reduces
computational complexity. First, low-rank ``link embeddings” enable
pre-computation of attention weights by decoupling user and candidate
interactions, making the inference cost nearly independent of candidate set
size. Second, a linear attention mechanism, \textbf{LIME-XOR}, reduces the
complexity with respect to user sequence length from quadratic ($O(N^2)$) to
linear ($O(N)$).
Experiments on public and industrial datasets show LIME achieves near-parity
with state-of-the-art transformers but with a 10$\times$ inference speedup on
large candidate sets or long sequence lengths. When tested on a major
recommendation platform, LIME improved user engagement while maintaining
minimal inference costs with respect to candidate set size and user history
length, establishing a new paradigm for efficient and expressive recommendation
systems.
[COMMENTS]16 pages
[LINK]http://arxiv.org/abs/2510.18239v2
[DATE]2025-10-28 05:18:47+08:00
[CATEGORIES]cs.LG
A PDE-Informed Latent Diffusion Model for 2-m Temperature Downscaling
[AUTHORS]Paul Rosu, Muchang Bahng, Erick Jiang, Rico Zhu, Vahid Tarokh
[ABSTRACT]This work presents a physics-conditioned latent diffusion model tailored for
dynamical downscaling of atmospheric data, with a focus on reconstructing
high-resolution 2-m temperature fields. Building upon a pre-existing diffusion
architecture and employing a residual formulation against a reference UNet, we
integrate a partial differential equation (PDE) loss term into the model’s
training objective. The PDE loss is computed in the full resolution (pixel)
space by decoding the latent representation and is designed to enforce physical
consistency through a finite-difference approximation of an effective
advection-diffusion balance. Empirical observations indicate that conventional
diffusion training already yields low PDE residuals, and we investigate how
fine-tuning with this additional loss further regularizes the model and
enhances the physical plausibility of the generated fields. The entirety of our
codebase is available on Github, for future reference and development.
[LINK]http://arxiv.org/abs/2510.23866v1
[DATE]2025-10-28 05:17:03+08:00
[CATEGORIES]cs.LG
Program Evaluation with Remotely Sensed Outcomes
[AUTHORS]Ashesh Rambachan, Rahul Singh, Davide Viviano
[ABSTRACT]Economists often estimate treatment effects in experiments using remotely
sensed variables (RSVs), e.g., satellite images or mobile phone activity, in
place of directly measured economic outcomes. A common practice is to use an
observational sample to train a predictor of the economic outcome from the RSV,
and then use these predictions as the outcomes in the experiment. We show that
this method is biased whenever the RSV is a post-outcome variable, meaning that
variation in the economic outcome causes variation in the RSV. For example,
changes in poverty or environmental quality cause changes in satellite images,
but not vice versa. As our main result, we nonparametrically identify the
treatment effect by formalizing the intuition underlying common practice: the
conditional distribution of the RSV given the outcome and treatment is stable
across samples. Our identifying formula reveals that efficient inference
requires predictions of three quantities from the RSV – the outcome,
treatment, and sample indicator – whereas common practice only predicts the
outcome. Valid inference does not require any rate conditions on RSV
predictions, justifying the use of complex deep learning algorithms with
unknown statistical properties. We reanalyze the effect of an anti-poverty
program in India using satellite images.
[LINK]http://arxiv.org/abs/2411.10959v3
[DATE]2025-10-28 04:57:45+08:00
[CATEGORIES]cs.LG
Inter-turbine Modelling of Wind-Farm Power using Multi-task Learning
[AUTHORS]Simon M. Brealy, Lawrence A. Bull, Pauline Beltrando, Anders Sommer, Nikolaos Dervilis, Keith Worden
[ABSTRACT]Because of the global need to increase power production from renewable energy
resources, developments in the online monitoring of the associated
infrastructure is of interest to reduce operation and maintenance costs.
However, challenges exist for data-driven approaches to this problem, such as
incomplete or limited histories of labelled damage-state data, operational and
environmental variability, or the desire for the quantification of uncertainty
to support risk management.
This work first introduces a probabilistic regression model for predicting
wind-turbine power, which adjusts for wake effects learnt from data. Spatial
correlations in the learned model parameters for different tasks (turbines) are
then leveraged in a hierarchical Bayesian model (an approach to multi-task
learning) to develop a “metamodel”, which can be used to make power-predictions
which adjust for turbine location - including on previously unobserved turbines
not included in the training data. The results show that the metamodel is able
to outperform a series of benchmark models, and demonstrates a novel strategy
for making efficient use of data for inference in populations of structures, in
particular where correlations exist in the variable(s) of interest (such as
those from wind-turbine wake-effects).
[COMMENTS]Preprint submitted to Mechanical Systems and Signal Processing. A
shortened version of this article has submitted to the Wind Energy Science
Conference 2025
[LINK]http://arxiv.org/abs/2502.14527v2
[DATE]2025-10-28 04:50:49+08:00
[CATEGORIES]cs.LG
Testing-driven Variable Selection in Bayesian Modal Regression
[AUTHORS]Jiasong Duan, Hongmei Zhang, Xianzheng Huang
[ABSTRACT]We propose a Bayesian variable selection method in the framework of modal
regression for heavy-tailed responses. An efficient expectation-maximization
algorithm is employed to expedite parameter estimation. A test statistic is
constructed to exploit the shape of the model error distribution to effectively
separate informative covariates from unimportant ones. Through simulations, we
demonstrate and evaluate the efficacy of the proposed method in identifying
important covariates in the presence of non-Gaussian model errors. Finally, we
apply the proposed method to analyze two datasets arising in genetic and
epigenetic studies.
[COMMENTS]30 pages, 2 figures, preprint under review
[LINK]http://arxiv.org/abs/2510.23831v1
[DATE]2025-10-28 04:17:34+08:00
[CATEGORIES]cs.LG
LASER: A Neuro-Symbolic Framework for Learning Spatial-Temporal Scene Graphs with Weak Supervision
[AUTHORS]Jiani Huang, Ziyang Li, Mayur Naik, Ser-Nam Lim
[ABSTRACT]Supervised approaches for learning spatio-temporal scene graphs (STSG) from
video are greatly hindered due to their reliance on STSG-annotated videos,
which are labor-intensive to construct at scale. Is it feasible to instead use
readily available video captions as weak supervision? To address this question,
we propose LASER, a neuro-symbolic framework to enable training STSG generators
using only video captions. LASER employs large language models to first extract
logical specifications with rich spatio-temporal semantic information from
video captions. LASER then trains the underlying STSG generator to align the
predicted STSG with the specification. The alignment algorithm overcomes the
challenges of weak supervision by leveraging a differentiable symbolic reasoner
and using a combination of contrastive, temporal, and semantics losses. The
overall approach efficiently trains low-level perception models to extract a
fine-grained STSG that conforms to the video caption. In doing so, it enables a
novel methodology for learning STSGs without tedious annotations. We evaluate
our method on three video datasets: OpenPVSG, 20BN, and MUGEN. Our approach
demonstrates substantial improvements over fully-supervised baselines,
achieving a unary predicate prediction accuracy of 27.78% (+12.65%) and a
binary recall@5 of 0.42 (+0.22) on OpenPVSG. Additionally, LASER exceeds
baselines by 7% on 20BN and 5.2% on MUGEN in terms of overall predicate
prediction accuracy.
[COMMENTS]Accepted at International Conference on Learning Representations
(ICLR) 2025
[LINK]http://arxiv.org/abs/2304.07647v7
[DATE]2025-10-28 04:14:22+08:00
[CATEGORIES]cs.LG
GeoClip: Geometry-Aware Clipping for Differentially Private SGD
[AUTHORS]Atefeh Gilani, Naima Tasnim, Lalitha Sankar, Oliver Kosut
[ABSTRACT]Differentially private stochastic gradient descent (DP-SGD) is the most
widely used method for training machine learning models with provable privacy
guarantees. A key challenge in DP-SGD is setting the per-sample gradient
clipping threshold, which significantly affects the trade-off between privacy
and utility. While recent adaptive methods improve performance by adjusting
this threshold during training, they operate in the standard coordinate system
and fail to account for correlations across the coordinates of the gradient. We
propose GeoClip, a geometry-aware framework that clips and perturbs gradients
in a transformed basis aligned with the geometry of the gradient distribution.
GeoClip adaptively estimates this transformation using only previously released
noisy gradients, incurring no additional privacy cost. We provide convergence
guarantees for GeoClip and derive a closed-form solution for the optimal
transformation that minimizes the amount of noise added while keeping the
probability of gradient clipping under control. Experiments on both tabular and
image datasets demonstrate that GeoClip consistently outperforms existing
adaptive clipping methods under the same privacy budget.
[LINK]http://arxiv.org/abs/2506.06549v3
[DATE]2025-10-28 04:00:10+08:00
[CATEGORIES]cs.LG
ScaLoRA: Optimally Scaled Low-Rank Adaptation for Efficient High-Rank Fine-Tuning
[AUTHORS]Yilang Zhang, Xiaodong Yang, Yiwei Cai, Georgios B. Giannakis
[ABSTRACT]As large language models (LLMs) continue to scale in size, the computational
overhead has become a major bottleneck for task-specific fine-tuning. While
low-rank adaptation (LoRA) effectively curtails this cost by confining the
weight updates to a low-dimensional subspace, such a restriction can hinder
effectiveness and slow convergence. This contribution deals with these
limitations by accumulating progressively a high-rank weight update from
consecutive low-rank increments. Specifically, the per update optimal low-rank
matrix is identified to minimize the loss function and closely approximate full
fine-tuning. To endow efficient and seamless optimization without restarting,
this optimal choice is formed by appropriately scaling the columns of the
original low-rank matrix. Rigorous performance guarantees reveal that the
optimal scaling can be found analytically. Extensive numerical tests with
popular LLMs scaling up to 12 billion parameters demonstrate a consistent
performance gain and fast convergence relative to state-of-the-art LoRA
variants on diverse tasks including natural language understanding, commonsense
reasoning, and mathematical problem solving.
[LINK]http://arxiv.org/abs/2510.23818v1
[DATE]2025-10-28 03:59:46+08:00
[CATEGORIES]cs.LG
Detecting Neurocognitive Disorders through Analyses of Topic Evolution and Cross-modal Consistency in Visual-Stimulated Narratives
[AUTHORS]Jinchao Li, Yuejiao Wang, Junan Li, Jiawen Kang, Bo Zheng, Ka Ho Wong, Brian Mak, Helene H. Fung, Jean Woo, Man-Wai Mak, Timothy Kwok, Vincent Mok, Xianmin Gong, Xixin Wu, Xunying Liu, Patrick C. M. Wong, Helen Meng
[ABSTRACT]Early detection of neurocognitive disorders (NCDs) is crucial for timely
intervention and disease management. Given that language impairments manifest
early in NCD progression, visual-stimulated narrative (VSN)-based analysis
offers a promising avenue for NCD detection. Current VSN-based NCD detection
methods primarily focus on linguistic microstructures (e.g., lexical diversity)
that are closely tied to bottom-up, stimulus-driven cognitive processes. While
these features illuminate basic language abilities, the higher-order linguistic
macrostructures (e.g., topic development) that may reflect top-down,
concept-driven cognitive abilities remain underexplored. These macrostructural
patterns are crucial for NCD detection, yet challenging to quantify due to
their abstract and complex nature. To bridge this gap, we propose two novel
macrostructural approaches: (1) a Dynamic Topic Model (DTM) to track topic
evolution over time, and (2) a Text-Image Temporal Alignment Network (TITAN) to
measure cross-modal consistency between narrative and visual stimuli.
Experimental results show the effectiveness of the proposed approaches in NCD
detection, with TITAN achieving superior performance across three corpora:
ADReSS (F1=0.8889), ADReSSo (F1=0.8504), and CU-MARVEL-RABBIT (F1=0.7238).
Feature contribution analysis reveals that macrostructural features (e.g.,
topic variability, topic change rate, and topic consistency) constitute the
most significant contributors to the model’s decision pathways, outperforming
the investigated microstructural features. These findings underscore the value
of macrostructural analysis for understanding linguistic-cognitive interactions
associated with NCDs.
[COMMENTS]16 pages, 5 figures, accepted by “IEEE Journal of Selected Topics in
Signal Processing”
[LINK]http://arxiv.org/abs/2501.03727v3
[DATE]2025-10-28 03:54:43+08:00
[CATEGORIES]cs.LG
One-Step is Enough: Sparse Autoencoders for Text-to-Image Diffusion Models
[AUTHORS]Viacheslav Surkov, Chris Wendler, Antonio Mari, Mikhail Terekhov, Justin Deschenaux, Robert West, Caglar Gulcehre, David Bau
[ABSTRACT]For large language models (LLMs), sparse autoencoders (SAEs) have been shown
to decompose intermediate representations that often are not interpretable
directly into sparse sums of interpretable features, facilitating better
control and subsequent analysis. However, similar analyses and approaches have
been lacking for text-to-image models. We investigate the possibility of using
SAEs to learn interpretable features for SDXL Turbo, a few-step text-to-image
diffusion model. To this end, we train SAEs on the updates performed by
transformer blocks within SDXL Turbo’s denoising U-net in its 1-step setting.
Interestingly, we find that they generalize to 4-step SDXL Turbo and even to
the multi-step SDXL base model (i.e., a different model) without additional
training. In addition, we show that their learned features are interpretable,
causally influence the generation process, and reveal specialization among the
blocks. We do so by creating RIEBench, a representation-based image editing
benchmark, for editing images while they are generated by turning on and off
individual SAE features. This allows us to track which transformer blocks’
features are the most impactful depending on the edit category. Our work is the
first investigation of SAEs for interpretability in text-to-image diffusion
models and our results establish SAEs as a promising approach for understanding
and manipulating the internal mechanisms of text-to-image models.
[LINK]http://arxiv.org/abs/2410.22366v5
[DATE]2025-10-28 03:52:38+08:00
[CATEGORIES]cs.LG
A Physics-informed Multi-resolution Neural Operator
[AUTHORS]Sumanta Roy, Bahador Bahmani, Ioannis G. Kevrekidis, Michael D. Shields
[ABSTRACT]The predictive accuracy of operator learning frameworks depends on the
quality and quantity of available training data (input-output function pairs),
often requiring substantial amounts of high-fidelity data, which can be
challenging to obtain in some real-world engineering applications. These
datasets may be unevenly discretized from one realization to another, with the
grid resolution varying across samples. In this study, we introduce a
physics-informed operator learning approach by extending the Resolution
Independent Neural Operator (RINO) framework to a fully data-free setup,
addressing both challenges simultaneously. Here, the arbitrarily (but
sufficiently finely) discretized input functions are projected onto a latent
embedding space (i.e., a vector space of finite dimensions), using pre-trained
basis functions. The operator associated with the underlying partial
differential equations (PDEs) is then approximated by a simple multi-layer
perceptron (MLP), which takes as input a latent code along with spatiotemporal
coordinates to produce the solution in the physical space. The PDEs are
enforced via a finite difference solver in the physical space. The validation
and performance of the proposed method are benchmarked on several numerical
examples with multi-resolution data, where input functions are sampled at
varying resolutions, including both coarse and fine discretizations.
[COMMENTS]26 pages, 14 figures, 4 tables
[LINK]http://arxiv.org/abs/2510.23810v1
[DATE]2025-10-28 03:50:02+08:00
[CATEGORIES]cs.LG
How do simple rotations affect the implicit bias of Adam?
[AUTHORS]Adela DePavia, Vasileios Charisopoulos, Rebecca Willett
[ABSTRACT]Adaptive gradient methods such as Adam and Adagrad are widely used in machine
learning, yet their effect on the generalization of learned models – relative
to methods like gradient descent – remains poorly understood. Prior work on
binary classification suggests that Adam exhibits a “richness bias,” which
can help it learn nonlinear decision boundaries closer to the Bayes-optimal
decision boundary relative to gradient descent. However, the coordinate-wise
preconditioning scheme employed by Adam renders the overall method sensitive to
orthogonal transformations of feature space. We show that this sensitivity can
manifest as a reversal of Adam’s competitive advantage: even small rotations of
the underlying data distribution can make Adam forfeit its richness bias and
converge to a linear decision boundary that is farther from the Bayes-optimal
decision boundary than the one learned by gradient descent. To alleviate this
issue, we show that a recently proposed reparameterization method – which
applies an orthogonal transformation to the optimization objective – endows
any first-order method with equivariance to data rotations, and we empirically
demonstrate its ability to restore Adam’s bias towards rich decision
boundaries.
[LINK]http://arxiv.org/abs/2510.23804v1
[DATE]2025-10-28 03:38:46+08:00
[CATEGORIES]cs.LG
CausalPFN: Amortized Causal Effect Estimation via In-Context Learning
[AUTHORS]Vahid Balazadeh, Hamidreza Kamkari, Valentin Thomas, Benson Li, Junwei Ma, Jesse C. Cresswell, Rahul G. Krishnan
[ABSTRACT]Causal effect estimation from observational data is fundamental across
various applications. However, selecting an appropriate estimator from dozens
of specialized methods demands substantial manual effort and domain expertise.
We present CausalPFN, a single transformer that amortizes this workflow:
trained once on a large library of simulated data-generating processes that
satisfy ignorability, it infers causal effects for new observational datasets
out of the box. CausalPFN combines ideas from Bayesian causal inference with
the large-scale training protocol of prior-fitted networks (PFNs), learning to
map raw observations directly to causal effects without any task-specific
adjustment. Our approach achieves superior average performance on heterogeneous
and average treatment effect estimation benchmarks (IHDP, Lalonde, ACIC).
Moreover, it shows competitive performance for real-world policy making on
uplift modeling tasks. CausalPFN provides calibrated uncertainty estimates to
support reliable decision-making based on Bayesian principles. This
ready-to-use model requires no further training or tuning and takes a step
toward automated causal inference (https://github.com/vdblm/CausalPFN/).
[LINK]http://arxiv.org/abs/2506.07918v2
[DATE]2025-10-28 03:38:29+08:00
[CATEGORIES]cs.LG
FoGE: Fock Space inspired encoding for graph prompting
[AUTHORS]Sotirios Panagiotis Chytas, Rudrasis Chakraborty, Vikas Singh
[ABSTRACT]Recent results show that modern Large Language Models (LLM) are indeed
capable of understanding and answering questions about structured data such as
graphs. This new paradigm can lead to solutions that require less supervision
while, at the same time, providing a model that can generalize and answer
questions beyond the training labels. Existing proposals often use some
description of the graph to create an “augmented” prompt fed to the LLM. For
a chosen class of graphs, if a well-tailored graph encoder is deployed to play
together with a pre-trained LLM, the model can answer graph-related questions
well. Existing solutions to graph-based prompts range from graph serialization
to graph transformers. In this work, we show that the use of a parameter-free
graph encoder based on Fock space representations, a concept borrowed from
mathematical physics, is remarkably versatile in this problem setting. The
simple construction, inherited directly from the theory with a few small
adjustments, can provide rich and informative graph encodings, for a wide range
of different graphs. We investigate the use of this idea for prefix-tuned
prompts leveraging the capabilities of a pre-trained, frozen LLM. The
modifications lead to a model that can answer graph-related questions – from
simple graphs to proteins to hypergraphs – effectively and with minimal, if
any, adjustments to the architecture. Our work significantly simplifies
existing solutions and generalizes well to multiple different graph-based
structures effortlessly.
[LINK]http://arxiv.org/abs/2507.02937v2
[DATE]2025-10-28 03:36:33+08:00
[CATEGORIES]cs.LG
Learning Interpretable Features in Audio Latent Spaces via Sparse Autoencoders
[AUTHORS]Nathan Paek, Yongyi Zang, Qihui Yang, Randal Leistikow
[ABSTRACT]While sparse autoencoders (SAEs) successfully extract interpretable features
from language models, applying them to audio generation faces unique
challenges: audio’s dense nature requires compression that obscures semantic
meaning, and automatic feature characterization remains limited. We propose a
framework for interpreting audio generative models by mapping their latent
representations to human-interpretable acoustic concepts. We train SAEs on
audio autoencoder latents, then learn linear mappings from SAE features to
discretized acoustic properties (pitch, amplitude, and timbre). This enables
both controllable manipulation and analysis of the AI music generation process,
revealing how acoustic properties emerge during synthesis. We validate our
approach on continuous (DiffRhythm-VAE) and discrete (EnCodec, WavTokenizer)
audio latent spaces, and analyze DiffRhythm, a state-of-the-art text-to-music
model, to demonstrate how pitch, timbre, and loudness evolve throughout
generation. While our work is only done on audio modality, our framework can be
extended to interpretable analysis of visual latent space generation models.
[COMMENTS]Accepted to NeurIPS 2025 Mechanistic Interpretability Workshop
[LINK]http://arxiv.org/abs/2510.23802v1
[DATE]2025-10-28 03:35:39+08:00
[CATEGORIES]cs.LG
Distilled Protein Backbone Generation
[AUTHORS]Liyang Xie, Haoran Zhang, Zhendong Wang, Wesley Tansey, Mingyuan Zhou
[ABSTRACT]Diffusion- and flow-based generative models have recently demonstrated strong
performance in protein backbone generation tasks, offering unprecedented
capabilities for de novo protein design. However, while achieving notable
performance in generation quality, these models are limited by their generating
speed, often requiring hundreds of iterative steps in the reverse-diffusion
process. This computational bottleneck limits their practical utility in
large-scale protein discovery, where thousands to millions of candidate
structures are needed. To address this challenge, we explore the techniques of
score distillation, which has shown great success in reducing the number of
sampling steps in the vision domain while maintaining high generation quality.
However, a straightforward adaptation of these methods results in unacceptably
low designability. Through extensive study, we have identified how to
appropriately adapt Score identity Distillation (SiD), a state-of-the-art score
distillation strategy, to train few-step protein backbone generators which
significantly reduce sampling time, while maintaining comparable performance to
their pretrained teacher model. In particular, multistep generation combined
with inference time noise modulation is key to the success. We demonstrate that
our distilled few-step generators achieve more than a 20-fold improvement in
sampling speed, while achieving similar levels of designability, diversity, and
novelty as the Proteina teacher model. This reduction in inference cost enables
large-scale in silico protein design, thereby bringing diffusion-based models
closer to real-world protein engineering applications. The PyTorch
implementation is available at https://github.com/LY-Xie/SiD_Protein
[COMMENTS]PyTorch implementation: https://github.com/LY-Xie/SiD_Protein
[LINK]http://arxiv.org/abs/2510.03095v3
[DATE]2025-10-28 03:32:07+08:00
[CATEGORIES]cs.LG
Revealing the Potential of Learnable Perturbation Ensemble Forecast Model for Tropical Cyclone Prediction
[AUTHORS]Jun Liu, Tao Zhou, Jiarui Li, Xiaohui Zhong, Peng Zhang, Jie Feng, Lei Chen, Hao Li
[ABSTRACT]Tropical cyclones (TCs) are highly destructive and inherently uncertain
weather systems. Ensemble forecasting helps quantify these uncertainties, yet
traditional systems are constrained by high computational costs and limited
capability to fully represent atmospheric nonlinearity. FuXi-ENS introduces a
learnable perturbation scheme for ensemble generation, representing a novel
AI-based forecasting paradigm. Here, we systematically compare FuXi-ENS with
ECMWF-ENS using all 90 global TCs in 2018, examining their performance in
TC-related physical variables, track and intensity forecasts, and the
associated dynamical and thermodynamical fields. FuXi-ENS demonstrates clear
advantages in predicting TC-related physical variables, and achieves more
accurate track forecasts with reduced ensemble spread, though it still
underestimates intensity relative to observations. Further dynamical and
thermodynamical analyses reveal that FuXi-ENS better captures large-scale
circulation, with moisture turbulent energy more tightly concentrated around
the TC warm core, whereas ECMWF-ENS exhibits a more dispersed distribution.
These findings highlight the potential of learnable perturbations to improve TC
forecasting skill and provide valuable insights for advancing AI-based ensemble
prediction of extreme weather events that have significant societal impacts.
[COMMENTS]30 pages, 21 figures, 1 table
[LINK]http://arxiv.org/abs/2510.23794v1
[DATE]2025-10-28 03:27:04+08:00
[CATEGORIES]cs.LG
Apollo: A Posteriori Label-Only Membership Inference Attack Towards Machine Unlearning
[AUTHORS]Liou Tang, James Joshi, Ashish Kundu
[ABSTRACT]Machine Unlearning (MU) aims to update Machine Learning (ML) models following
requests to remove training samples and their influences on a trained model
efficiently without retraining the original ML model from scratch. While MU
itself has been employed to provide privacy protection and regulatory
compliance, it can also increase the attack surface of the model. Existing
privacy inference attacks towards MU that aim to infer properties of the
unlearned set rely on the weaker threat model that assumes the attacker has
access to both the unlearned model and the original model, limiting their
feasibility toward real-life scenarios. We propose a novel privacy attack, A
Posteriori Label-Only Membership Inference Attack towards MU, Apollo, that
infers whether a data sample has been unlearned, following a strict threat
model where an adversary has access to the label-output of the unlearned model
only. We demonstrate that our proposed attack, while requiring less access to
the target model compared to previous attacks, can achieve relatively high
precision on the membership status of the unlearned samples.
[LINK]http://arxiv.org/abs/2506.09923v2
[DATE]2025-10-28 03:22:43+08:00
[CATEGORIES]cs.LG
Relaxed Sequence Sampling for Diverse Protein Design
[AUTHORS]Joohwan Ko, Aristofanis Rontogiannis, Yih-En Andrew Ban, Axel Elaldi, Nicholas Franklin
[ABSTRACT]Protein design using structure prediction models such as AlphaFold2 has shown
remarkable success, but existing approaches like relaxed sequence optimization
(RSO) rely on single-path gradient descent and ignore sequence-space
constraints, limiting diversity and designability. We introduce Relaxed
Sequence Sampling (RSS), a Markov chain Monte Carlo (MCMC) framework that
integrates structural and evolutionary information for protein design. RSS
operates in continuous logit space, combining gradient-guided exploration with
protein language model-informed jumps. Its energy function couples
AlphaFold2-derived structural objectives with ESM2-derived sequence priors,
balancing accuracy and biological plausibility. In an in silico protein binder
design task, RSS produces 5$\times$ more designable structures and 2-3$\times$
greater structural diversity than RSO baselines, at equal computational cost.
These results highlight RSS as a principled approach for efficiently exploring
the protein design landscape.
[LINK]http://arxiv.org/abs/2510.23786v1
[DATE]2025-10-28 03:18:36+08:00
[CATEGORIES]cs.LG
PPFL-RDSN: Privacy-Preserving Federated Learning-based Residual Dense Spatial Networks for Encrypted Lossy Image Reconstruction
[AUTHORS]Peilin He, James Joshi
[ABSTRACT]Reconstructing high-quality images from low-resolution inputs using Residual
Dense Spatial Networks (RDSNs) is crucial yet challenging. It is even more
challenging in centralized training where multiple collaborating parties are
involved, as it poses significant privacy risks, including data leakage and
inference attacks, as well as high computational and communication costs. We
propose a novel Privacy-Preserving Federated Learning-based RDSN (PPFL-RDSN)
framework specifically tailored for encrypted lossy image reconstruction.
PPFL-RDSN integrates Federated Learning (FL), local differential privacy, and
robust model watermarking techniques to ensure that data remains secure on
local clients/devices, safeguards privacy-sensitive information, and maintains
model authenticity without revealing underlying data. Empirical evaluations
show that PPFL-RDSN achieves comparable performance to the state-of-the-art
centralized methods while reducing computational burdens, and effectively
mitigates security and privacy vulnerabilities, making it a practical solution
for secure and privacy-preserving collaborative computer vision applications.
[COMMENTS]Accepted to be published on the 7th IEEE International Conference on
Trust, Privacy and Security in Intelligent Systems, and Applications, Nov.
11-14, 2025, Pittsburgh, PA, USA.
https://www.sis.pitt.edu/lersais/conference/tps/2025/
[LINK]http://arxiv.org/abs/2507.00230v3
[DATE]2025-10-28 03:09:31+08:00
[CATEGORIES]cs.LG
Evaluating In Silico Creativity: An Expert Review of AI Chess Compositions
[AUTHORS]Vivek Veeriah, Federico Barbero, Marcus Chiam, Xidong Feng, Michael Dennis, Ryan Pachauri, Thomas Tumiel, Johan Obando-Ceron, Jiaxin Shi, Shaobo Hou, Satinder Singh, Nenad Tomašev, Tom Zahavy
[COMMENTS]Accepted at the Creative AI Track, NeurIPS 2025
[LINK]http://arxiv.org/abs/2510.23772v1
[DATE]2025-10-28 03:00:02+08:00
[CATEGORIES]cs.LG
Practical Bayes-Optimal Membership Inference Attacks
[AUTHORS]Marcus Lassila, Johan Östman, Khac-Hoang Ngo, Alexandre Graell i Amat
[ABSTRACT]We develop practical and theoretically grounded membership inference attacks
(MIAs) against both independent and identically distributed (i.i.d.) data and
graph-structured data. Building on the Bayesian decision-theoretic framework of
Sablayrolles et al., we derive the Bayes-optimal membership inference rule for
node-level MIAs against graph neural networks, addressing key open questions
about optimal query strategies in the graph setting. We introduce BASE and
G-BASE, tractable approximations of the Bayes-optimal membership inference.
G-BASE achieves superior performance compared to previously proposed
classifier-based node-level MIA attacks. BASE, which is also applicable to
non-graph data, matches or exceeds the performance of prior state-of-the-art
MIAs, such as LiRA and RMIA, at a significantly lower computational cost.
Finally, we show that BASE and RMIA are equivalent under a specific
hyperparameter setting, providing a principled, Bayes-optimal justification for
the RMIA attack.
[COMMENTS]In NeurIPS 2025, 10 pages plus 15 pages of appendices
[LINK]http://arxiv.org/abs/2505.24089v2
[DATE]2025-10-28 02:58:38+08:00
[CATEGORIES]cs.LG
Explaining Robustness to Catastrophic Forgetting Through Incremental Concept Formation
[AUTHORS]Nicki Barari, Edward Kim, Christopher MacLellan
[ABSTRACT]Catastrophic forgetting remains a central challenge in continual learning,
where models are required to integrate new knowledge over time without losing
what they have previously learned. In prior work, we introduced Cobweb/4V, a
hierarchical concept formation model that exhibited robustness to catastrophic
forgetting in visual domains. Motivated by this robustness, we examine three
hypotheses regarding the factors that contribute to such stability: (1)
adaptive structural reorganization enhances knowledge retention, (2) sparse and
selective updates reduce interference, and (3) information-theoretic learning
based on sufficiency statistics provides advantages over gradient-based
backpropagation. To test these hypotheses, we compare Cobweb/4V with neural
baselines, including CobwebNN, a neural implementation of the Cobweb framework
introduced in this work. Experiments on datasets of varying complexity (MNIST,
Fashion-MNIST, MedMNIST, and CIFAR-10) show that adaptive restructuring
enhances learning plasticity, sparse updates help mitigate interference, and
the information-theoretic learning process preserves prior knowledge without
revisiting past data. Together, these findings provide insight into mechanisms
that can mitigate catastrophic forgetting and highlight the potential of
concept-based, information-theoretic approaches for building stable and
adaptive continual learning systems.
[COMMENTS]18 pages, 5 figures, Advances in Cognitive Systems 2025
[LINK]http://arxiv.org/abs/2510.23756v1
[DATE]2025-10-28 02:41:25+08:00
[CATEGORIES]cs.LG
Debiasing Reward Models by Representation Learning with Guarantees
[AUTHORS]Ignavier Ng, Patrick Blöbaum, Siddharth Bhandari, Kun Zhang, Shiva Kasiviswanathan
[ABSTRACT]Recent alignment techniques, such as reinforcement learning from human
feedback, have been widely adopted to align large language models with human
preferences by learning and leveraging reward models. In practice, these models
often exploit spurious correlations, involving, e.g., response length,
discrimination, sycophancy, and conceptual bias, which is a problem that has
received increasing attention. In this work, we propose a principled framework
that mitigates these biases in reward models while preserving the underlying
factors that reflect intended preferences. We first provide a formulation of
the data-generating process, assuming that the observed data (e.g., text) is
generated from both spurious and non-spurious latent variables. We show that,
interestingly, these non-spurious latent variables can be theoretically
identified from data, regardless of whether a surrogate for the spurious latent
variables is available. This further inspires a practical method that uses
variational inference to recover these variables and leverages them to train
reward models. Experiments on synthetic and real-world datasets demonstrate
that our method effectively mitigates spurious correlation issues and yields
more robust reward models.
[LINK]http://arxiv.org/abs/2510.23751v1
[DATE]2025-10-28 02:37:57+08:00
[CATEGORIES]cs.LG
Re-envisioning Euclid Galaxy Morphology: Identifying and Interpreting Features with Sparse Autoencoders
[AUTHORS]John F. Wu, Michael Walmsley
[ABSTRACT]Sparse Autoencoders (SAEs) can efficiently identify candidate monosemantic
features from pretrained neural networks for galaxy morphology. We demonstrate
this on Euclid Q1 images using both supervised (Zoobot) and new self-supervised
(MAE) models. Our publicly released MAE achieves superhuman image
reconstruction performance. While a Principal Component Analysis (PCA) on the
supervised model primarily identifies features already aligned with the Galaxy
Zoo decision tree, SAEs can identify interpretable features outside of this
framework. SAE features also show stronger alignment than PCA with Galaxy Zoo
labels. Although challenges in interpretability remain, SAEs provide a powerful
engine for discovering astrophysical phenomena beyond the confines of
human-defined classification.
[COMMENTS]Accepted to NeurIPS Machine Learning and the Physical Sciences
Workshop
[LINK]http://arxiv.org/abs/2510.23749v1
[DATE]2025-10-28 02:28:56+08:00
[CATEGORIES]cs.LG
Test-Time Tuned Language Models Enable End-to-end De Novo Molecular Structure Generation from MS/MS Spectra
[AUTHORS]Laura Mismetti, Marvin Alberts, Andreas Krause, Mara Graziani
[ABSTRACT]Tandem Mass Spectrometry enables the identification of unknown compounds in
crucial fields such as metabolomics, natural product discovery and
environmental analysis. However, current methods rely on database matching from
previously observed molecules, or on multi-step pipelines that require
intermediate fragment or fingerprint prediction. This makes finding the correct
molecule highly challenging, particularly for compounds absent from reference
databases. We introduce a framework that, by leveraging test-time tuning,
enhances the learning of a pre-trained transformer model to address this gap,
enabling end-to-end de novo molecular structure generation directly from the
tandem mass spectra and molecular formulae, bypassing manual annotations and
intermediate steps. We surpass the de-facto state-of-the-art approach DiffMS on
two popular benchmarks NPLIB1 and MassSpecGym by 100% and 20%, respectively.
Test-time tuning on experimental spectra allows the model to dynamically adapt
to novel spectra, and the relative performance gain over conventional
fine-tuning is of 62% on MassSpecGym. When predictions deviate from the ground
truth, the generated molecular candidates remain structurally accurate,
providing valuable guidance for human interpretation and more reliable
identification.
[LINK]http://arxiv.org/abs/2510.23746v1
[DATE]2025-10-28 02:25:36+08:00
[CATEGORIES]cs.LG
Bayesian neural networks with interpretable priors from Mercer kernels
[AUTHORS]Alex Alberts, Ilias Bilionis
[ABSTRACT]Quantifying the uncertainty in the output of a neural network is essential
for deployment in scientific or engineering applications where decisions must
be made under limited or noisy data. Bayesian neural networks (BNNs) provide a
framework for this purpose by constructing a Bayesian posterior distribution
over the network parameters. However, the prior, which is of key importance in
any Bayesian setting, is rarely meaningful for BNNs. This is because the
complexity of the input-to-output map of a BNN makes it difficult to understand
how certain distributions enforce any interpretable constraint on the output
space. Gaussian processes (GPs), on the other hand, are often preferred in
uncertainty quantification tasks due to their interpretability. The drawback is
that GPs are limited to small datasets without advanced techniques, which often
rely on the covariance kernel having a specific structure. To address these
challenges, we introduce a new class of priors for BNNs, called Mercer priors,
such that the resulting BNN has samples which approximate that of a specified
GP. The method works by defining a prior directly over the network parameters
from the Mercer representation of the covariance kernel, and does not rely on
the network having a specific structure. In doing so, we can exploit the
scalability of BNNs in a meaningful Bayesian way.
[LINK]http://arxiv.org/abs/2510.23745v1
[DATE]2025-10-28 02:25:21+08:00
[CATEGORIES]cs.LG
Structured Reinforcement Learning for Combinatorial Decision-Making
[AUTHORS]Heiko Hoppe, Léo Baty, Louis Bouvier, Axel Parmentier, Maximilian Schiffer
[ABSTRACT]Reinforcement learning (RL) is increasingly applied to real-world problems
involving complex and structured decisions, such as routing, scheduling, and
assortment planning. These settings challenge standard RL algorithms, which
struggle to scale, generalize, and exploit structure in the presence of
combinatorial action spaces. We propose Structured Reinforcement Learning
(SRL), a novel actor-critic paradigm that embeds combinatorial
optimization-layers into the actor neural network. We enable end-to-end
learning of the actor via Fenchel-Young losses and provide a geometric
interpretation of SRL as a primal-dual algorithm in the dual of the moment
polytope. Across six environments with exogenous and endogenous uncertainty,
SRL matches or surpasses the performance of unstructured RL and imitation
learning on static tasks and improves over these baselines by up to 92% on
dynamic problems, with improved stability and convergence speed.
[COMMENTS]29 pages, 8 figures, accepted at the 39th Annual Conference on Neural
Information Processing Systems (NeurIPS 2025)
[LINK]http://arxiv.org/abs/2505.19053v2
[DATE]2025-10-28 02:03:20+08:00
[CATEGORIES]cs.LG
Geometry matters: insights from Ollivier Ricci Curvature and Ricci Flow into representational alignment through Ollivier-Ricci Curvature and Ricci Flow
[AUTHORS]Nahid Torbati, Michael Gaebler, Simon M. Hofmann, Nico Scherf
[ABSTRACT]Representational similarity analysis (RSA) is widely used to analyze the
alignment between humans and neural networks; however, conclusions based on
this approach can be misleading without considering the underlying
representational geometry. Our work introduces a framework using Ollivier Ricci
Curvature and Ricci Flow to analyze the fine-grained local structure of
representations. This approach is agnostic to the source of the
representational space, enabling a direct geometric comparison between human
behavioral judgments and a model’s vector embeddings. We apply it to compare
human similarity judgments for 2D and 3D face stimuli with a baseline 2D native
network (VGG-Face) and a variant of it aligned to human behavior. Our results
suggest that geometry-aware analysis provides a more sensitive characterization
of discrepancies and geometric dissimilarities in the underlying
representations that remain only partially captured by RSA. Notably, we reveal
geometric inconsistencies in the alignment when moving from 2D to 3D viewing
conditions.This highlights how incorporating geometric information can expose
alignment differences missed by traditional metrics, offering deeper insight
into representational organization.
[COMMENTS]Presented at NeuReps workshop, NeurIPS 2024
[LINK]http://arxiv.org/abs/2501.00919v2
[DATE]2025-10-28 02:01:43+08:00
[CATEGORIES]cs.LG
In Search of the Unknown Unknowns: A Multi-Metric Distance Ensemble for Out of Distribution Anomaly Detection in Astronomical Surveys
[AUTHORS]Siddharth Chaini, Federica B. Bianco, Ashish Mahabal
[ABSTRACT]Distance-based methods involve the computation of distance values between
features and are a well-established paradigm in machine learning. In anomaly
detection, anomalies are identified by their large distance from normal data
points. However, the performance of these methods often hinges on a single,
user-selected distance metric (e.g., Euclidean), which may not be optimal for
the complex, high-dimensional feature spaces common in astronomy. Here, we
introduce a novel anomaly detection method, Distance Multi-Metric Anomaly
Detection (DiMMAD), which uses an ensemble of distance metrics to find
novelties.
Using multiple distance metrics is effectively equivalent to using different
geometries in the feature space. By using a robust ensemble of diverse distance
metrics, we overcome the metric-selection problem, creating an anomaly score
that is not reliant on any single definition of distance. We demonstrate this
multi-metric approach as a tool for simple, interpretable scientific discovery
on astronomical time series – (1) with simulated data for the upcoming Vera C.
Rubin Observatory Legacy Survey of Space and Time, and (2) real data from the
Zwicky Transient Facility.
We find that DiMMAD excels at out-of-distribution anomaly detection –
anomalies in the data that might be new classes – and beats other
state-of-the-art methods in the goal of maximizing the diversity of new classes
discovered. For rare in-distribution anomaly detection, DiMMAD performs
similarly to other methods, but may allow for improved interpretability. All
our code is open source: DiMMAD is implemented within DistClassiPy:
https://github.com/sidchaini/distclassipy/, while all code to reproduce the
results of this paper is available here: https://github.com/sidchaini/dimmad/.
[COMMENTS]9 pages, 5 figures, Accepted at the 2025 Machine Learning and the
Physical Sciences (ML4PS) workshop at NeurIPS
[LINK]http://arxiv.org/abs/2510.23702v1
[DATE]2025-10-28 02:00:00+08:00
[CATEGORIES]cs.LG
Track, Inpaint, Resplat: Subject-driven 3D and 4D Generation with Progressive Texture Infilling
[AUTHORS]Shuhong Zheng, Ashkan Mirzaei, Igor Gilitschenski
[ABSTRACT]Current 3D/4D generation methods are usually optimized for photorealism,
efficiency, and aesthetics. However, they often fail to preserve the semantic
identity of the subject across different viewpoints. Adapting generation
methods with one or few images of a specific subject (also known as
Personalization or Subject-driven generation) allows generating visual content
that align with the identity of the subject. However, personalized 3D/4D
generation is still largely underexplored. In this work, we introduce TIRE
(Track, Inpaint, REsplat), a novel method for subject-driven 3D/4D generation.
It takes an initial 3D asset produced by an existing 3D generative model as
input and uses video tracking to identify the regions that need to be modified.
Then, we adopt a subject-driven 2D inpainting model for progressively infilling
the identified regions. Finally, we resplat the modified 2D multi-view
observations back to 3D while still maintaining consistency. Extensive
experiments demonstrate that our approach significantly improves identity
preservation in 3D/4D generation compared to state-of-the-art methods. Our
project website is available at
https://zsh2000.github.io/track-inpaint-resplat.github.io/.
[COMMENTS]NeurIPS 2025, 38 pages, 22 figures
[LINK]http://arxiv.org/abs/2510.23605v1
[DATE]2025-10-28 01:59:51+08:00
[CATEGORIES]cs.LG
UNDREAM: Bridging Differentiable Rendering and Photorealistic Simulation for End-to-end Adversarial Attacks
[AUTHORS]Mansi Phute, Matthew Hull, Haoran Wang, Alec Helbling, ShengYun Peng, Willian Lunardi, Martin Andreoni, Wenke Lee, Duen Horng Chau
[ABSTRACT]Deep learning models deployed in safety critical applications like autonomous
driving use simulations to test their robustness against adversarial attacks in
realistic conditions. However, these simulations are non-differentiable,
forcing researchers to create attacks that do not integrate simulation
environmental factors, reducing attack success. To address this limitation, we
introduce UNDREAM, the first software framework that bridges the gap between
photorealistic simulators and differentiable renderers to enable end-to-end
optimization of adversarial perturbations on any 3D objects. UNDREAM enables
manipulation of the environment by offering complete control over weather,
lighting, backgrounds, camera angles, trajectories, and realistic human and
object movements, thereby allowing the creation of diverse scenes. We showcase
a wide array of distinct physically plausible adversarial objects that UNDREAM
enables researchers to swiftly explore in different configurable environments.
This combination of photorealistic simulation and differentiable optimization
opens new avenues for advancing research of physical adversarial attacks.
[LINK]http://arxiv.org/abs/2510.16923v2
[DATE]2025-10-28 01:59:01+08:00
[CATEGORIES]cs.LG
Lightweight Robust Direct Preference Optimization
[AUTHORS]Cheol Woo Kim, Shresth Verma, Mauricio Tec, Milind Tambe
[ABSTRACT]Direct Preference Optimization (DPO) has become a popular method for
fine-tuning large language models (LLMs) due to its stability and simplicity.
However, it is also known to be sensitive to noise in the data and prone to
overfitting. Recent works have proposed using distributionally robust
optimization (DRO) to address potential noise and distributional shift in the
data. However, these methods often suffer from excessive conservatism and high
computational cost. We propose DPO-PRO (DPO with Preference Robustness), a
robust fine-tuning algorithm based on DPO which accounts for uncertainty in the
preference distribution through a lightweight DRO formulation. Unlike prior
DRO-based variants, DPO-PRO focuses solely on uncertainty in preferences,
avoiding unnecessary conservatism and incurring negligible computational
overhead. We further show that DPO-PRO is equivalent to a regularized DPO
objective that penalizes model overconfidence under weak preference signals. We
evaluate DPO-PRO on standard alignment benchmarks and a real-world public
health task. Experimental results show that our method consistently improves
robustness to noisy preference signals compared to existing DPO variants.
[COMMENTS]arXiv admin note: substantial text overlap with arXiv:2509.02709
[LINK]http://arxiv.org/abs/2510.23590v1
[DATE]2025-10-28 01:55:06+08:00
[CATEGORIES]cs.LG
R-SFLLM: Jamming Resilient Framework for Split Federated Learning with Large Language Models
[AUTHORS]Aladin Djuhera, Vlad C. Andrei, Xinyang Li, Ullrich J. Mönich, Holger Boche, Walid Saad
[ABSTRACT]Split federated learning (SFL) is a compute-efficient paradigm in distributed
machine learning (ML), where components of large ML models are outsourced to
remote servers. A significant challenge in SFL, particularly when deployed over
wireless channels, is the susceptibility of transmitted model parameters to
adversarial jamming that could jeopardize the learning process. This is
particularly pronounced for embedding parameters in large language models
(LLMs) and vision language models (VLMs), which are learned feature vectors
essential for domain understanding. In this paper, rigorous insights are
provided into the influence of jamming embeddings in SFL by deriving an
expression for the ML training loss divergence and showing that it is
upper-bounded by the mean squared error (MSE). Based on this analysis, a
physical layer framework is developed for resilient SFL with LLMs (R-SFLLM)
over wireless networks. R-SFLLM leverages wireless sensing data to gather
information on the jamming directions-of-arrival (DoAs) for the purpose of
devising a novel, sensing-assisted anti-jamming strategy while jointly
optimizing beamforming, user scheduling, and resource allocation. Extensive
experiments using both LLMs and VLMs demonstrate R-SFLLM’s effectiveness,
achieving close-to-baseline performance across various natural language
processing (NLP) and computer vision (CV) tasks, datasets, and modalities. The
proposed methodology further introduces an adversarial training component,
where controlled noise exposure significantly enhances the model’s resilience
to perturbed parameters during training. The results show that more
noise-sensitive models, such as RoBERTa, benefit from this feature, especially
when resource allocation is unfair. It is also shown that worst-case jamming in
particular translates into worst-case model outcomes, thereby necessitating the
need for jamming-resilient SFL protocols.
[LINK]http://arxiv.org/abs/2407.11654v3
[DATE]2025-10-28 01:52:25+08:00
[CATEGORIES]cs.LG
ESCA: Contextualizing Embodied Agents via Scene-Graph Generation
[AUTHORS]Jiani Huang, Amish Sethi, Matthew Kuo, Mayank Keoliya, Neelay Velingker, JungHo Jung, Ser-Nam Lim, Ziyang Li, Mayur Naik
[ABSTRACT]Multi-modal large language models (MLLMs) are making rapid progress toward
general-purpose embodied agents. However, existing MLLMs do not reliably
capture fine-grained links between low-level visual features and high-level
textual semantics, leading to weak grounding and inaccurate perception. To
overcome this challenge, we propose ESCA, a framework that contextualizes
embodied agents by grounding their perception in spatial-temporal scene graphs.
At its core is SGCLIP, a novel, open-domain, promptable foundation model for
generating scene graphs that is based on CLIP. SGCLIP is trained on 87K+
open-domain videos using a neurosymbolic pipeline that aligns automatically
generated captions with scene graphs produced by the model itself, eliminating
the need for human-labeled annotations. We demonstrate that SGCLIP excels in
both prompt-based inference and task-specific fine-tuning, achieving
state-of-the-art results on scene graph generation and action localization
benchmarks. ESCA with SGCLIP improves perception for embodied agents based on
both open-source and commercial MLLMs, achieving state of-the-art performance
across two embodied environments. Notably, ESCA significantly reduces agent
perception errors and enables open-source models to surpass proprietary
baselines. We release the source code for SGCLIP model training at
https://github.com/video-fm/LASER and for the embodied agent at
https://github.com/video-fm/ESCA.
[COMMENTS]Accepted as a Spotlight Paper at NeurIPS 2025
[LINK]http://arxiv.org/abs/2510.15963v2
[DATE]2025-10-28 01:51:21+08:00
[CATEGORIES]cs.LG
Lookahead Anchoring: Preserving Character Identity in Audio-Driven Human Animation
[AUTHORS]Junyoung Seo, Rodrigo Mira, Alexandros Haliassos, Stella Bounareli, Honglie Chen, Linh Tran, Seungryong Kim, Zoe Landgraf, Jie Shen
[ABSTRACT]Audio-driven human animation models often suffer from identity drift during
temporal autoregressive generation, where characters gradually lose their
identity over time. One solution is to generate keyframes as intermediate
temporal anchors that prevent degradation, but this requires an additional
keyframe generation stage and can restrict natural motion dynamics. To address
this, we propose Lookahead Anchoring, which leverages keyframes from future
timesteps ahead of the current generation window, rather than within it. This
transforms keyframes from fixed boundaries into directional beacons: the model
continuously pursues these future anchors while responding to immediate audio
cues, maintaining consistent identity through persistent guidance. This also
enables self-keyframing, where the reference image serves as the lookahead
target, eliminating the need for keyframe generation entirely. We find that the
temporal lookahead distance naturally controls the balance between expressivity
and consistency: larger distances allow for greater motion freedom, while
smaller ones strengthen identity adherence. When applied to three recent human
animation models, Lookahead Anchoring achieves superior lip synchronization,
identity preservation, and visual quality, demonstrating improved temporal
conditioning across several different architectures. Video results are
available at the following link: https://lookahead-anchoring.github.io.
[COMMENTS]Project page: https://lookahead-anchoring.github.io
[LINK]http://arxiv.org/abs/2510.23581v1
[DATE]2025-10-28 01:50:19+08:00
[CATEGORIES]cs.LG
RobotArena $\infty$: Scalable Robot Benchmarking via Real-to-Sim Translation
[AUTHORS]Yash Jangir, Yidi Zhang, Kashu Yamazaki, Chenyu Zhang, Kuan-Hsun Tu, Tsung-Wei Ke, Lei Ke, Yonatan Bisk, Katerina Fragkiadaki
[ABSTRACT]The pursuit of robot generalists - instructable agents capable of performing
diverse tasks across diverse environments - demands rigorous and scalable
evaluation. Yet real-world testing of robot policies remains fundamentally
constrained: it is labor-intensive, slow, unsafe at scale, and difficult to
reproduce. Existing simulation benchmarks are similarly limited, as they train
and test policies within the same synthetic domains and cannot assess models
trained from real-world demonstrations or alternative simulation environments.
As policies expand in scope and complexity, these barriers only intensify,
since defining “success” in robotics often hinges on nuanced human judgments of
execution quality. In this paper, we introduce a new benchmarking framework
that overcomes these challenges by shifting VLA evaluation into large-scale
simulated environments augmented with online human feedback. Leveraging
advances in vision-language models, 2D-to-3D generative modeling, and
differentiable rendering, our approach automatically converts video
demonstrations from widely used robot datasets into simulated counterparts.
Within these digital twins, we assess VLA policies using both automated
VLM-guided scoring and scalable human preference judgments collected from
crowdworkers, transforming human involvement from tedious scene setup,
resetting, and safety supervision into lightweight preference comparisons. To
measure robustness, we systematically perturb simulated environments along
multiple axes, such as textures and object placements, stress-testing policy
generalization under controlled variation. The result is a continuously
evolving, reproducible, and scalable benchmark for real-world trained robot
manipulation policies, addressing a critical missing capability in today’s
robotics landscape.
[COMMENTS]Website: https://robotarenainf.github.io
[LINK]http://arxiv.org/abs/2510.23571v1
[DATE]2025-10-28 01:41:38+08:00
[CATEGORIES]cs.LG
Minimizing Human Intervention in Online Classification
[AUTHORS]William Réveillard, Vasileios Saketos, Alexandre Proutiere, Richard Combes
[ABSTRACT]We introduce and study an online problem arising in question answering
systems. In this problem, an agent must sequentially classify user-submitted
queries represented by $d$-dimensional embeddings drawn i.i.d. from an unknown
distribution. The agent may consult a costly human expert for the correct
label, or guess on her own without receiving feedback. The goal is to minimize
regret against an oracle with free expert access. When the time horizon $T$ is
at least exponential in the embedding dimension $d$, one can learn the geometry
of the class regions: in this regime, we propose the Conservative Hull-based
Classifier (CHC), which maintains convex hulls of expert-labeled queries and
calls the expert as soon as a query lands outside all known hulls. CHC attains
$\mathcal{O}(\log^d T)$ regret in $T$ and is minimax optimal for $d=1$.
Otherwise, the geometry cannot be reliably learned without additional
distributional assumptions. We show that when the queries are drawn from a
subgaussian mixture, for $T \le e^d$, a Center-based Classifier (CC) achieves
regret proportional to $N\log{N}$ where $N$ is the number of labels. To bridge
these regimes, we introduce the Generalized Hull-based Classifier (GHC), a
practical extension of CHC that allows for more aggressive guessing via a
tunable threshold parameter. Our approach is validated with experiments,
notably on real-world question-answering datasets using embeddings derived from
state-of-the-art large language models.
[COMMENTS]49 pages, 8 figures
[LINK]http://arxiv.org/abs/2510.23557v1
[DATE]2025-10-28 01:31:24+08:00
[CATEGORIES]cs.LG
KV-weights are all you need for skipless transformers
[AUTHORS]Nils Graef
[ABSTRACT]He and Hofmann (arXiv:2311.01906) detailed a skipless transformer without the
V and P (post-attention projection) linear layers, which reduces the total
number of weights. However, this scheme is only applicable to MHA (multi-head
attention), but not for MQA (multi-query attention) and GQA (grouped-query
attention). The latter schemes are used by many popular LLMs such as Llama 2,
Mistral, Mixtral, PaLM, and Gemma. Therefore, this micro-paper proposes
mathematically equivalent versions that are suitable for MQA and GQA. For
example, removing Q and P from a skipless version of Mistral-7B would remove
15% of its weights (and thus reduce its compute and memory complexity). Watch
our explainer video https://youtu.be/Tx_lMpphd2g and see
https://github.com/OpenMachine-ai/transformer-tricks for code and more
transformer tricks.
[COMMENTS]6 pages, 4 figures
[LINK]http://arxiv.org/abs/2404.12362v2
[DATE]2025-10-28 01:31:15+08:00
[CATEGORIES]cs.LG
Enhancing Graph Neural Networks: A Mutual Learning Approach
[AUTHORS]Paul Agbaje, Arkajyoti Mitra, Afia Anjum, Pranali Khose, Ebelechukwu Nwafor, Habeeb Olufowobi
[ABSTRACT]Knowledge distillation (KD) techniques have emerged as a powerful tool for
transferring expertise from complex teacher models to lightweight student
models, particularly beneficial for deploying high-performance models in
resource-constrained devices. This approach has been successfully applied to
graph neural networks (GNNs), harnessing their expressive capabilities to
generate node embeddings that capture structural and feature-related
information. In this study, we depart from the conventional KD approach by
exploring the potential of collaborative learning among GNNs. In the absence of
a pre-trained teacher model, we show that relatively simple and shallow GNN
architectures can synergetically learn efficient models capable of performing
better during inference, particularly in tackling multiple tasks. We propose a
collaborative learning framework where ensembles of student GNNs mutually teach
each other throughout the training process. We introduce an adaptive logit
weighting unit to facilitate efficient knowledge exchange among models and an
entropy enhancement technique to improve mutual learning. These components
dynamically empower the models to adapt their learning strategies during
training, optimizing their performance for downstream tasks. Extensive
experiments conducted on three datasets each for node and graph classification
demonstrate the effectiveness of our approach.
[LINK]http://arxiv.org/abs/2510.19223v2
[DATE]2025-10-28 01:26:39+08:00
[CATEGORIES]cs.LG
On the Stability of Graph Convolutional Neural Networks: A Probabilistic Perspective
[AUTHORS]Ning Zhang, Henry Kenlay, Li Zhang, Mihai Cucuringu, Xiaowen Dong
[ABSTRACT]Graph convolutional neural networks (GCNNs) have emerged as powerful tools
for analyzing graph-structured data, achieving remarkable success across
diverse applications. However, the theoretical understanding of the stability
of these models, i.e., their sensitivity to small changes in the graph
structure, remains in rather limited settings, hampering the development and
deployment of robust and trustworthy models in practice. To fill this gap, we
study how perturbations in the graph topology affect GCNN outputs and propose a
novel formulation for analyzing model stability. Unlike prior studies that
focus only on worst-case perturbations, our distribution-aware formulation
characterizes output perturbations across a broad range of input data. This
way, our framework enables, for the first time, a probabilistic perspective on
the interplay between the statistical properties of the node data and
perturbations in the graph topology. We conduct extensive experiments to
validate our theoretical findings and demonstrate their benefits over existing
baselines, in terms of both representation stability and adversarial attacks on
downstream tasks. Our results demonstrate the practical significance of the
proposed formulation and highlight the importance of incorporating data
distribution into stability analysis.
[LINK]http://arxiv.org/abs/2506.01213v4
[DATE]2025-10-28 01:20:28+08:00
[CATEGORIES]cs.LG
ProSpero: Active Learning for Robust Protein Design Beyond Wild-Type Neighborhoods
[AUTHORS]Michal Kmicikiewicz, Vincent Fortuin, Ewa Szczurek
[ABSTRACT]Designing protein sequences of both high fitness and novelty is a challenging
task in data-efficient protein engineering. Exploration beyond wild-type
neighborhoods often leads to biologically implausible sequences or relies on
surrogate models that lose fidelity in novel regions. Here, we propose
ProSpero, an active learning framework in which a frozen pre-trained generative
model is guided by a surrogate updated from oracle feedback. By integrating
fitness-relevant residue selection with biologically-constrained Sequential
Monte Carlo sampling, our approach enables exploration beyond wild-type
neighborhoods while preserving biological plausibility. We show that our
framework remains effective even when the surrogate is misspecified. ProSpero
consistently outperforms or matches existing methods across diverse protein
engineering tasks, retrieving sequences of both high fitness and novelty.
[COMMENTS]NeurIPS 2025
[LINK]http://arxiv.org/abs/2505.22494v2
[DATE]2025-10-28 01:12:30+08:00
[CATEGORIES]cs.LG
On the Structure of Stationary Solutions to McKean-Vlasov Equations with Applications to Noisy Transformers
[AUTHORS]Krishnakumar Balasubramanian, Sayan Banerjee, Philippe Rigollet
[ABSTRACT]We study stationary solutions of McKean-Vlasov equations on the circle. Our
main contributions stem from observing an exact equivalence between solutions
of the stationary McKean-Vlasov equation and an infinite-dimensional quadratic
system of equations over Fourier coefficients, which allows explicit
characterization of the stationary states in a sequence space rather than a
function space. This framework provides a transparent description of local
bifurcations, characterizing their periodicity, and resonance structures, while
accommodating singular potentials. We derive analytic expressions that
characterize the emergence, form and shape (supercritical, critical,
subcritical or transcritical) of bifurcations involving possibly multiple
Fourier modes and connect them with discontinuous phase transitions. We also
characterize, under suitable assumptions, the detailed structure of the
stationary bifurcating solutions that are accurate upto an arbitrary number of
Fourier modes. At the global level, we establish regularity and concavity
properties of the free energy landscape, proving existence, compactness, and
coexistence of globally minimizing stationary measures, further identifying
discontinuous phase transitions with points of non-differentiability of the
minimum free energy map. As an application, we specialize the theory to the
Noisy Mean-Field Transformer model, where we show how changing the inverse
temperature parameter $\beta$ affects the geometry of the infinitely many
bifurcations from the uniform measure. We also explain how increasing $\beta$
can lead to a rich class of approximate multi-mode stationary solutions which
can be seen as `metastable states’. Further, a sharp transition from continuous
to discontinuous (first-order) phase behavior is observed as $\beta$ increases.
[COMMENTS]46 pages, 5 figures
[LINK]http://arxiv.org/abs/2510.20094v2
[DATE]2025-10-28 01:12:03+08:00
[CATEGORIES]cs.LG
Sequential Multi-Agent Dynamic Algorithm Configuration
[AUTHORS]Chen Lu, Ke Xue, Lei Yuan, Yao Wang, Yaoyuan Wang, Sheng Fu, Chao Qian
[ABSTRACT]Dynamic algorithm configuration (DAC) is a recent trend in automated machine
learning, which can dynamically adjust the algorithm’s configuration during the
execution process and relieve users from tedious trial-and-error tuning tasks.
Recently, multi-agent reinforcement learning (MARL) approaches have improved
the configuration of multiple heterogeneous hyperparameters, making various
parameter configurations for complex algorithms possible. However, many complex
algorithms have inherent inter-dependencies among multiple parameters (e.g.,
determining the operator type first and then the operator’s parameter), which
are, however, not considered in previous approaches, thus leading to
sub-optimal results. In this paper, we propose the sequential multi-agent DAC
(Seq-MADAC) framework to address this issue by considering the inherent
inter-dependencies of multiple parameters. Specifically, we propose a
sequential advantage decomposition network, which can leverage action-order
information through sequential advantage decomposition. Experiments from
synthetic functions to the configuration of multi-objective optimization
algorithms demonstrate Seq-MADAC’s superior performance over state-of-the-art
MARL methods and show strong generalization across problem classes. Seq-MADAC
establishes a new paradigm for the widespread dependency-aware automated
algorithm configuration. Our code is available at
https://github.com/lamda-bbo/seq-madac.
[COMMENTS]NeurIPS 2025
[LINK]http://arxiv.org/abs/2510.23535v1
[DATE]2025-10-28 01:11:03+08:00
[CATEGORIES]cs.LG
WhaleVAD-BPN: Improving Baleen Whale Call Detection with Boundary Proposal Networks and Post-processing Optimisation
[AUTHORS]Christiaan M. Geldenhuys, Günther Tonitz, Thomas R. Niesler
[ABSTRACT]While recent sound event detection (SED) systems can identify baleen whale
calls in marine audio, challenges related to false positive and minority-class
detection persist. We propose the boundary proposal network (BPN), which
extends an existing lightweight SED system. The BPN is inspired by work in
image object detection and aims to reduce the number of false positive
detections. It achieves this by using intermediate latent representations
computed within the backbone classification model to gate the final output.
When added to an existing SED system, the BPN achieves a 16.8 % absolute
increase in precision, as well as 21.3 % and 9.4 % improvements in the F1-score
for minority-class d-calls and bp-calls, respectively. We further consider two
approaches to the selection of post-processing hyperparameters: a
forward-search and a backward-search. By separately optimising event-level and
frame-level hyperparameters, these two approaches lead to considerable
performance improvements over parameters selected using empirical methods. The
complete WhaleVAD-BPN system achieves a cross-validated development F1-score of
0.475, which is a 9.8 % absolute improvement over the baseline.
[LINK]http://arxiv.org/abs/2510.21280v2
[DATE]2025-10-28 01:10:50+08:00
[CATEGORIES]cs.LG
Direct Debiased Machine Learning via Bregman Divergence Minimization
[AUTHORS]Masahiro Kato
[ABSTRACT]We develop a direct debiased machine learning framework comprising Neyman
targeted estimation and generalized Riesz regression. Our framework unifies
Riesz regression for automatic debiased machine learning, covariate balancing,
targeted maximum likelihood estimation (TMLE), and density-ratio estimation. In
many problems involving causal effects or structural models, the parameters of
interest depend on regression functions. Plugging regression functions
estimated by machine learning methods into the identifying equations can yield
poor performance because of first-stage bias. To reduce such bias, debiased
machine learning employs Neyman orthogonal estimating equations. Debiased
machine learning typically requires estimation of the Riesz representer and the
regression function. For this problem, we develop a direct debiased machine
learning framework with an end-to-end algorithm. We formulate estimation of the
nuisance parameters, the regression function and the Riesz representer, as
minimizing the discrepancy between Neyman orthogonal scores computed with known
and unknown nuisance parameters, which we refer to as Neyman targeted
estimation. Neyman targeted estimation includes Riesz representer estimation,
and we measure discrepancies using the Bregman divergence. The Bregman
divergence encompasses various loss functions as special cases, where the
squared loss yields Riesz regression and the Kullback-Leibler divergence yields
entropy balancing. We refer to this Riesz representer estimation as generalized
Riesz regression. Neyman targeted estimation also yields TMLE as a special case
for regression function estimation. Furthermore, for specific pairs of models
and Riesz representer estimation methods, we can automatically obtain the
covariate balancing property without explicitly solving the covariate balancing
objective.
[LINK]http://arxiv.org/abs/2510.23534v1
[DATE]2025-10-28 01:10:43+08:00
[CATEGORIES]cs.LG
When No Paths Lead to Rome: Benchmarking Systematic Neural Relational Reasoning
[AUTHORS]Anirban Das, Irtaza Khalid, Rafael Peñaloza, Steven Schockaert
[ABSTRACT]Designing models that can learn to reason in a systematic way is an important
and long-standing challenge. In recent years, a wide range of solutions have
been proposed for the specific case of systematic relational reasoning,
including Neuro-Symbolic approaches, variants of the Transformer architecture,
and specialised Graph Neural Networks. However, existing benchmarks for
systematic relational reasoning focus on an overly simplified setting, based on
the assumption that reasoning can be reduced to composing relational paths. In
fact, this assumption is hard-baked into the architecture of several recent
models, leading to approaches that can perform well on existing benchmarks but
are difficult to generalise to other settings. To support further progress in
the field of systematic relational reasoning with neural networks, we introduce
NoRA, a new benchmark which adds several levels of difficulty and requires
models to go beyond path-based reasoning.
[COMMENTS]accepted at NeurIPS 2025 D&B track
[LINK]http://arxiv.org/abs/2510.23532v1
[DATE]2025-10-28 01:09:16+08:00
[CATEGORIES]cs.LG
Learning Linearity in Audio Consistency Autoencoders via Implicit Regularization
[AUTHORS]Bernardo Torres, Manuel Moussallam, Gabriel Meseguer-Brocal
[ABSTRACT]Audio autoencoders learn useful, compressed audio representations, but their
non-linear latent spaces prevent intuitive algebraic manipulation such as
mixing or scaling. We introduce a simple training methodology to induce
linearity in a high-compression Consistency Autoencoder (CAE) by using data
augmentation, thereby inducing homogeneity (equivariance to scalar gain) and
additivity (the decoder preserves addition) without altering the model’s
architecture or loss function. When trained with our method, the CAE exhibits
linear behavior in both the encoder and decoder while preserving reconstruction
fidelity. We test the practical utility of our learned space on music source
composition and separation via simple latent arithmetic. This work presents a
straightforward technique for constructing structured latent spaces, enabling
more intuitive and efficient audio processing.
[LINK]http://arxiv.org/abs/2510.23530v1
[DATE]2025-10-28 01:08:27+08:00
[CATEGORIES]cs.LG
Toward Carbon-Neutral Human AI: Rethinking Data, Computation, and Learning Paradigms for Sustainable Intelligence
[AUTHORS]KC Santosh, Rodrigue Rizk, Longwei Wang
[ABSTRACT]The rapid advancement of Artificial Intelligence (AI) has led to
unprecedented computational demands, raising significant environmental and
ethical concerns. This paper critiques the prevailing reliance on large-scale,
static datasets and monolithic training paradigms, advocating for a shift
toward human-inspired, sustainable AI solutions. We introduce a novel
framework, Human AI (HAI), which emphasizes incremental learning, carbon-aware
optimization, and human-in-the-loop collaboration to enhance adaptability,
efficiency, and accountability. By drawing parallels with biological cognition
and leveraging dynamic architectures, HAI seeks to balance performance with
ecological responsibility. We detail the theoretical foundations, system
design, and operational principles that enable AI to learn continuously and
contextually while minimizing carbon footprints and human annotation costs. Our
approach addresses pressing challenges in active learning, continual
adaptation, and energy-efficient model deployment, offering a pathway toward
responsible, human-centered artificial intelligence.
[COMMENTS]9 pages, 3 figures
[LINK]http://arxiv.org/abs/2510.23524v1
[DATE]2025-10-28 01:02:30+08:00
[CATEGORIES]cs.LG
DmC: Nearest Neighbor Guidance Diffusion Model for Offline Cross-domain Reinforcement Learning
[AUTHORS]Linh Le Pham Van, Minh Hoang Nguyen, Duc Kieu, Hung Le, Hung The Tran, Sunil Gupta
[ABSTRACT]Cross-domain offline reinforcement learning (RL) seeks to enhance sample
efficiency in offline RL by utilizing additional offline source datasets. A key
challenge is to identify and utilize source samples that are most relevant to
the target domain. Existing approaches address this challenge by measuring
domain gaps through domain classifiers, target transition dynamics modeling, or
mutual information estimation using contrastive loss. However, these methods
often require large target datasets, which is impractical in many real-world
scenarios. In this work, we address cross-domain offline RL under a limited
target data setting, identifying two primary challenges: (1) Dataset imbalance,
which is caused by large source and small target datasets and leads to
overfitting in neural network-based domain gap estimators, resulting in
uninformative measurements; and (2) Partial domain overlap, where only a subset
of the source data is closely aligned with the target domain. To overcome these
issues, we propose DmC, a novel framework for cross-domain offline RL with
limited target samples. Specifically, DmC utilizes $k$-nearest neighbor
($k$-NN) based estimation to measure domain proximity without neural network
training, effectively mitigating overfitting. Then, by utilizing this domain
proximity, we introduce a nearest-neighbor-guided diffusion model to generate
additional source samples that are better aligned with the target domain, thus
enhancing policy learning with more effective source samples. Through
theoretical analysis and extensive experiments in diverse MuJoCo environments,
we demonstrate that DmC significantly outperforms state-of-the-art cross-domain
offline RL methods, achieving substantial performance gains.
[COMMENTS]accepted at ECAI 2025; offline cross-domain reinforcement learning
with a guided diffusion model;
[LINK]http://arxiv.org/abs/2507.20499v2
[DATE]2025-10-28 01:00:52+08:00
[CATEGORIES]cs.LG
A Deep Latent Factor Graph Clustering with Fairness-Utility Trade-off Perspective
[AUTHORS]Siamak Ghodsi, Amjad Seyedi, Tai Le Quy, Fariba Karimi, Eirini Ntoutsi
[ABSTRACT]Fair graph clustering seeks partitions that respect network structure while
maintaining proportional representation across sensitive groups, with
applications spanning community detection, team formation, resource allocation,
and social network analysis. Many existing approaches enforce rigid constraints
or rely on multi-stage pipelines (e.g., spectral embedding followed by
$k$-means), limiting trade-off control, interpretability, and scalability. We
introduce \emph{DFNMF}, an end-to-end deep nonnegative tri-factorization
tailored to graphs that directly optimizes cluster assignments with a soft
statistical-parity regularizer. A single parameter $\lambda$ tunes the
fairness–utility balance, while nonnegativity yields parts-based factors and
transparent soft memberships. The optimization uses sparse-friendly alternating
updates and scales near-linearly with the number of edges. Across synthetic and
real networks, DFNMF achieves substantially higher group balance at comparable
modularity, often dominating state-of-the-art baselines on the Pareto front.
The code is available at https://github.com/SiamakGhodsi/DFNMF.git.
[COMMENTS]Accepted to IEEE Big-Data 2025 main research track. The paper is 10
main pages and 4 pages of Appendix
[LINK]http://arxiv.org/abs/2510.23507v1
[DATE]2025-10-28 00:40:52+08:00
[CATEGORIES]cs.LG
Bayes-Split-Edge: Bayesian Optimization for Constrained Collaborative Inference in Wireless Edge Systems
[AUTHORS]Fatemeh Zahra Safaeipour, Jacob Chakareski, Morteza Hashemi
[ABSTRACT]Mobile edge devices (e.g., AR/VR headsets) typically need to complete timely
inference tasks while operating with limited on-board computing and energy
resources. In this paper, we investigate the problem of collaborative inference
in wireless edge networks, where energy-constrained edge devices aim to
complete inference tasks within given deadlines. These tasks are carried out
using neural networks, and the edge device seeks to optimize inference
performance under energy and delay constraints. The inference process can be
split between the edge device and an edge server, thereby achieving
collaborative inference over wireless networks. We formulate an inference
utility optimization problem subject to energy and delay constraints, and
propose a novel solution called Bayes-Split-Edge, which leverages Bayesian
optimization for collaborative split inference over wireless edge networks. Our
solution jointly optimizes the transmission power and the neural network split
point. The Bayes-Split-Edge framework incorporates a novel hybrid acquisition
function that balances inference task utility, sample efficiency, and
constraint violation penalties. We evaluate our approach using the VGG19 model
on the ImageNet-Mini dataset, and Resnet101 on Tiny-ImageNet, and real-world
mMobile wireless channel datasets. Numerical results demonstrate that
Bayes-Split-Edge achieves up to 2.4x reduction in evaluation cost compared to
standard Bayesian optimization and achieves near-linear convergence. It also
outperforms several baselines, including CMA-ES, DIRECT, exhaustive search, and
Proximal Policy Optimization (PPO), while matching exhaustive search
performance under tight constraints. These results confirm that the proposed
framework provides a sample-efficient solution requiring maximum 20 function
evaluations and constraint-aware optimization for wireless split inference in
edge computing systems.
[LINK]http://arxiv.org/abs/2510.23503v1
[DATE]2025-10-28 00:36:51+08:00
[CATEGORIES]cs.LG
Towards Deep Physics-Informed Kolmogorov-Arnold Networks
[AUTHORS]Spyros Rigas, Fotios Anagnostopoulos, Michalis Papachristou, Georgios Alexandridis
[ABSTRACT]Since their introduction, Kolmogorov-Arnold Networks (KANs) have been
successfully applied across several domains, with physics-informed machine
learning (PIML) emerging as one of the areas where they have thrived. In the
PIML setting, Chebyshev-based physics-informed KANs (cPIKANs) have become the
standard due to their computational efficiency. However, like their multilayer
perceptron-based counterparts, cPIKANs face significant challenges when scaled
to depth, leading to training instabilities that limit their applicability to
several PDE problems. To address this, we propose a basis-agnostic, Glorot-like
initialization scheme that preserves activation variance and yields substantial
improvements in stability and accuracy over the default initialization of
cPIKANs. Inspired by the PirateNet architecture, we further introduce
Residual-Gated Adaptive KANs (RGA KANs), designed to mitigate divergence in
deep cPIKANs where initialization alone is not sufficient. Through empirical
tests and information bottleneck analysis, we show that RGA KANs successfully
traverse all training phases, unlike baseline cPIKANs, which stagnate in the
diffusion phase in specific PDE settings. Evaluations on seven standard forward
PDE benchmarks under a fixed training pipeline with adaptive components
demonstrate that RGA KANs consistently outperform parameter-matched cPIKANs and
PirateNets - often by several orders of magnitude - while remaining stable in
settings where the others diverge.
[COMMENTS]73 pages, 22 figures
[LINK]http://arxiv.org/abs/2510.23501v1
[DATE]2025-10-28 00:35:01+08:00
[CATEGORIES]cs.LG
Mixed Precision Training of Neural ODEs
[AUTHORS]Elena Celledoni, Brynjulf Owren, Lars Ruthotto, Tianjiao Nicole Yang
[ABSTRACT]Exploiting low-precision computations has become a standard strategy in deep
learning to address the growing computational costs imposed by ever larger
models and datasets. However, naively performing all computations in low
precision can lead to roundoff errors and instabilities. Therefore, mixed
precision training schemes usually store the weights in high precision and use
low-precision computations only for whitelisted operations. Despite their
success, these principles are currently not reliable for training
continuous-time architectures such as neural ordinary differential equations
(Neural ODEs). This paper presents a mixed precision training framework for
neural ODEs, combining explicit ODE solvers with a custom backpropagation
scheme, and demonstrates its effectiveness across a range of learning tasks.
Our scheme uses low-precision computations for evaluating the velocity,
parameterized by the neural network, and for storing intermediate states, while
stability is provided by a custom dynamic adjoint scaling and by accumulating
the solution and gradients in higher precision. These contributions address two
key challenges in training neural ODE: the computational cost of repeated
network evaluations and the growth of memory requirements with the number of
time steps or layers. Along with the paper, we publish our extendable,
open-source PyTorch package rampde, whose syntax resembles that of leading
packages to provide a drop-in replacement in existing codes. We demonstrate the
reliability and effectiveness of our scheme using challenging test cases and on
neural ODE applications in image classification and generative models,
achieving approximately 50% memory reduction and up to 2x speedup while
maintaining accuracy comparable to single-precision training.
[COMMENTS]Code available at https://github.com/EmoryMLIP/rampde; 26 pages, 4
figures
[LINK]http://arxiv.org/abs/2510.23498v1
[DATE]2025-10-28 00:32:56+08:00
[CATEGORIES]cs.LG
Universal Sequence Preconditioning
[AUTHORS]Annie Marsden, Elad Hazan
[ABSTRACT]We study the problem of preconditioning in sequential prediction. From the
theoretical lens of linear dynamical systems, we show that convolving the input
sequence corresponds to applying a polynomial to the hidden transition matrix.
Building on this insight, we propose a universal preconditioning method that
convolves the input with coefficients from orthogonal polynomials such as
Chebyshev or Legendre. We prove that this approach reduces regret for two
distinct prediction algorithms and yields the first ever sublinear and
hidden-dimension independent regret bounds (up to logarithmic factors) that
hold for systems with marginally stable and asymmetric transition matrices.
Finally, extensive synthetic and real-world experiments show that this simple
preconditioning strategy improves the performance of a diverse range of
algorithms, including recurrent neural networks, and generalizes to signals
beyond linear dynamical systems.
[COMMENTS]35 pages, 3 tables, 5 figures
[LINK]http://arxiv.org/abs/2502.06545v3
[DATE]2025-10-28 00:31:36+08:00
[CATEGORIES]cs.LG
Deriving Transformer Architectures as Implicit Multinomial Regression
[AUTHORS]Jonas A. Actor, Anthony Gruber, Eric C. Cyr
[ABSTRACT]While attention has been empirically shown to improve model performance, it
lacks a rigorous mathematical justification. This short paper establishes a
novel connection between attention mechanisms and multinomial regression.
Specifically, we show that in a fixed multinomial regression setting,
optimizing over latent features yields solutions that align with the dynamics
induced on features by attention blocks. In other words, the evolution of
representations through a transformer can be interpreted as a trajectory that
recovers the optimal features for classification.
[COMMENTS]4 pages, additional 3 pages of references and supplementary details
[LINK]http://arxiv.org/abs/2509.04653v2
[DATE]2025-10-28 00:26:55+08:00
[CATEGORIES]cs.LG
Quantum Phase Classification of Rydberg Atom Systems Using Resource-Efficient Variational Quantum Circuits and Classical Shadows
[AUTHORS]Hemish Ahuja, Samradh Bhardwaj, Kirti Dhir, Roman Bagdasarian, Ziwoong Jang
[ABSTRACT]Quantum phase transitions in Rydberg atom arrays present significant
opportunities for studying many-body physics, yet distinguishing between
different ordered phases without explicit order parameters remains challenging.
We present a resource-efficient quantum machine learning approach combining
classical shadow tomography with variational quantum circuits (VQCs) for binary
phase classification of Z2 and Z3 ordered phases. Our pipeline processes 500
randomized measurements per 51-atom chain state, reconstructs shadow operators,
performs PCA dimensionality reduction (514 features), and encodes features
using angle embedding onto a 2-qubit parameterized circuit. The circuit employs
RY-RZ angle encoding, strong entanglement via all-to-all CZ gates, and a
minimal 2-parameter ansatz achieving depth 7. Training via simultaneous
perturbation stochastic approximation (SPSA) with hinge loss converged in 120
iterations. The model achieved 100% test accuracy with perfect precision,
recall, and F1 scores, demonstrating that minimal quantum resources suffice for
high-accuracy phase classification. This work establishes pathways for
quantum-enhanced condensed matter physics on near-term quantum devices.
[COMMENTS]7 pages, 2 tables, and 3 figures. for associated code files, see
https://github.com/Hemishahuja/FLIQ_Challenge_ClassiqDuQIS
[LINK]http://arxiv.org/abs/2510.23489v1
[DATE]2025-10-28 00:25:16+08:00
[CATEGORIES]cs.LG
The Marked Edge Walk: A Novel MCMC Algorithm for Sampling of Graph Partitions
[AUTHORS]Atticus McWhorter, Daryl DeFord
[ABSTRACT]Novel Markov Chain Monte Carlo (MCMC) methods have enabled the generation of
large ensembles of redistricting plans through graph partitioning. However,
existing algorithms such as Reversible Recombination (RevReCom) and
Metropolized Forest Recombination (MFR) are constrained to sampling from
distributions related to spanning trees. We introduce the marked edge walk
(MEW), a novel MCMC algorithm for sampling from the space of graph partitions
under a tunable distribution. The walk operates on the space of spanning trees
with marked edges, allowing for calculable transition probabilities for use in
the Metropolis-Hastings algorithm. Empirical results on real-world dual graphs
show convergence under target distributions unrelated to spanning trees. For
this reason, MEW represents an advancement in flexible ensemble generation.
[LINK]http://arxiv.org/abs/2510.17714v2
[DATE]2025-10-28 00:20:43+08:00
[CATEGORIES]cs.LG
Leveraging Approximate Caching for Faster Retrieval-Augmented Generation
[AUTHORS]Shai Bergman, Anne-Marie Kermarrec, Diana Petrescu, Rafael Pires, Mathis Randl, Martijn de Vos, Ji Zhang
[ABSTRACT]Retrieval-augmented generation (RAG) improves the reliability of large
language model (LLM) answers by integrating external knowledge. However, RAG
increases the end-to-end inference time since looking for relevant documents
from large vector databases is computationally expensive. To address this, we
introduce Proximity, an approximate key-value cache that optimizes the RAG
workflow by leveraging similarities in user queries. Instead of treating each
query independently, Proximity reuses previously retrieved documents when
similar queries appear, substantially reducing the reliance on expensive vector
database lookups. To efficiently scale, Proximity employs a locality-sensitive
hashing (LSH) scheme that enables fast cache lookups while preserving retrieval
accuracy. We evaluate Proximity using the MMLU and MedRAG question-answering
benchmarks. Our experiments demonstrate that Proximity with our LSH scheme and
a realistically-skewed MedRAG workload reduces database calls by 77.2% while
maintaining database recall and test accuracy. We experiment with different
similarity tolerances and cache capacities, and show that the time spent within
the Proximity cache remains low and constant (4.8 microseconds) even as the
cache grows substantially in size. Our results demonstrate that approximate
caching is a practical and effective strategy for optimizing RAG-based systems.
[COMMENTS]Accepted at Middleware ‘25
[LINK]http://arxiv.org/abs/2503.05530v3
[DATE]2025-10-28 00:20:28+08:00
[CATEGORIES]cs.LG
Validating LLM-as-a-Judge Systems under Rating Indeterminacy
[AUTHORS]Luke Guerdan, Solon Barocas, Kenneth Holstein, Hanna Wallach, Zhiwei Steven Wu, Alexandra Chouldechova
[ABSTRACT]The LLM-as-a-judge paradigm, in which a judge LLM system replaces human
raters in rating the outputs of other generative AI (GenAI) systems, plays a
critical role in scaling and standardizing GenAI evaluations. To validate such
judge systems, evaluators assess human–judge agreement by first collecting
multiple human ratings for each item in a validation corpus, then aggregating
the ratings into a single, per-item gold label rating. For many items, however,
rating criteria may admit multiple valid interpretations, so a human or LLM
rater may deem multiple ratings “reasonable” or “correct.” We call this
condition rating indeterminacy. Problematically, many rating tasks that contain
rating indeterminacy rely on forced-choice elicitation, whereby raters are
instructed to select only one rating for each item. In this paper, we introduce
a framework for validating LLM-as-a-judge systems under rating indeterminacy.
We draw theoretical connections between different measures of judge system
performance under different human–judge agreement metrics, and different
rating elicitation and aggregation schemes. We demonstrate that differences in
how humans and LLMs resolve rating indeterminacy when responding to
forced-choice rating instructions can heavily bias LLM-as-a-judge validation.
Through extensive experiments involving 11 real-world rating tasks and 9
commercial LLMs, we show that standard validation approaches that rely upon
forced-choice ratings select judge systems that are highly suboptimal,
performing as much as 31% worse than judge systems selected by our approach
that uses multi-label “response set” ratings to account for rating
indeterminacy. We conclude with concrete recommendations for more principled
approaches to LLM-as-a-judge validation.
[COMMENTS]NeurIPS 2025
[LINK]http://arxiv.org/abs/2503.05965v4
[DATE]2025-10-28 00:18:11+08:00
[CATEGORIES]cs.LG
Parallel BiLSTM-Transformer networks for forecasting chaotic dynamics
[AUTHORS]Junwen Ma, Mingyu Ge, Yisen Wang, Yong Zhang, Weicheng Fu
[ABSTRACT]The nonlinear nature of chaotic systems results in extreme sensitivity to
initial conditions and highly intricate dynamical behaviors, posing fundamental
challenges for accurately predicting their evolution. To overcome the
limitation that conventional approaches fail to capture both local features and
global dependencies in chaotic time series simultaneously, this study proposes
a parallel predictive framework integrating Transformer and Bidirectional Long
Short-Term Memory (BiLSTM) networks. The hybrid model employs a dual-branch
architecture, where the Transformer branch mainly captures long-range
dependencies while the BiLSTM branch focuses on extracting local temporal
features. The complementary representations from the two branches are fused in
a dedicated feature-fusion layer to enhance predictive accuracy. As
illustrating examples, the model’s performance is systematically evaluated on
two representative tasks in the Lorenz system. The first is autonomous
evolution prediction, in which the model recursively extrapolates system
trajectories from the time-delay embeddings of the state vector to evaluate
long-term tracking accuracy and stability. The second is inference of
unmeasured variable, where the model reconstructs the unobserved states from
the time-delay embeddings of partial observations to assess its
state-completion capability. The results consistently indicate that the
proposed hybrid framework outperforms both single-branch architectures across
tasks, demonstrating its robustness and effectiveness in chaotic system
prediction.
[COMMENTS]9 pages,7 figures
[LINK]http://arxiv.org/abs/2510.23685v1
[DATE]2025-10-28 00:17:10+08:00
[CATEGORIES]cs.LG
Tighter CMI-Based Generalization Bounds via Stochastic Projection and Quantization
[AUTHORS]Milad Sefidgaran, Kimia Nadjahi, Abdellatif Zaidi
[ABSTRACT]In this paper, we leverage stochastic projection and lossy compression to
establish new conditional mutual information (CMI) bounds on the generalization
error of statistical learning algorithms. It is shown that these bounds are
generally tighter than the existing ones. In particular, we prove that for
certain problem instances for which existing MI and CMI bounds were recently
shown in Attias et al. [2024] and Livni [2023] to become vacuous or fail to
describe the right generalization behavior, our bounds yield suitable
generalization guarantees of the order of $\mathcal{O}(1/\sqrt{n})$, where $n$
is the size of the training dataset. Furthermore, we use our bounds to
investigate the problem of data “memorization” raised in those works, and which
asserts that there are learning problem instances for which any learning
algorithm that has good prediction there exist distributions under which the
algorithm must “memorize” a big fraction of the training dataset. We show that
for every learning algorithm, there exists an auxiliary algorithm that does not
memorize and which yields comparable generalization error for any data
distribution. In part, this shows that memorization is not necessary for good
generalization.
[COMMENTS]Accepted for oral presentation at NeurIPS 2025
[LINK]http://arxiv.org/abs/2510.23485v1
[DATE]2025-10-28 00:17:09+08:00
[CATEGORIES]cs.LG
T-REGS: Minimum Spanning Tree Regularization for Self-Supervised Learning
[AUTHORS]Julie Mordacq, David Loiseaux, Vicky Kalogeiton, Steve Oudot
[ABSTRACT]Self-supervised learning (SSL) has emerged as a powerful paradigm for
learning representations without labeled data, often by enforcing invariance to
input transformations such as rotations or blurring. Recent studies have
highlighted two pivotal properties for effective representations: (i) avoiding
dimensional collapse-where the learned features occupy only a low-dimensional
subspace, and (ii) enhancing uniformity of the induced distribution. In this
work, we introduce T-REGS, a simple regularization framework for SSL based on
the length of the Minimum Spanning Tree (MST) over the learned representation.
We provide theoretical analysis demonstrating that T-REGS simultaneously
mitigates dimensional collapse and promotes distribution uniformity on
arbitrary compact Riemannian manifolds. Several experiments on synthetic data
and on classical SSL benchmarks validate the effectiveness of our approach at
enhancing representation quality.
[COMMENTS]NeurIPS 2025
[LINK]http://arxiv.org/abs/2510.23484v1
[DATE]2025-10-28 00:16:40+08:00
[CATEGORIES]cs.LG
Effortless, Simulation-Efficient Bayesian Inference using Tabular Foundation Models
[AUTHORS]Julius Vetter, Manuel Gloeckler, Daniel Gedon, Jakob H. Macke
[ABSTRACT]Simulation-based inference (SBI) offers a flexible and general approach to
performing Bayesian inference: In SBI, a neural network is trained on synthetic
data simulated from a model and used to rapidly infer posterior distributions
for observed data. A key goal for SBI is to achieve accurate inference with as
few simulations as possible, especially for expensive simulators. In this work,
we address this challenge by repurposing recent probabilistic foundation models
for tabular data: We show how tabular foundation models – specifically TabPFN
– can be used as pre-trained autoregressive conditional density estimators for
SBI. We propose Neural Posterior Estimation with Prior-data Fitted Networks
(NPE-PFN) and show that it is competitive with current SBI approaches in terms
of accuracy for both benchmark tasks and two complex scientific inverse
problems. Crucially, it often substantially outperforms them in terms of
simulation efficiency, sometimes requiring orders of magnitude fewer
simulations. NPE-PFN eliminates the need for inference network selection,
training, and hyperparameter tuning. We also show that it exhibits superior
robustness to model misspecification and can be scaled to simulation budgets
that exceed the context size limit of TabPFN. NPE-PFN provides a new direction
for SBI, where training-free, general-purpose inference models offer efficient,
easy-to-use, and flexible solutions for a wide range of stochastic inverse
problems.
[LINK]http://arxiv.org/abs/2504.17660v2
[DATE]2025-10-28 00:14:05+08:00
[CATEGORIES]cs.LG
BBOPlace-Bench: Benchmarking Black-Box Optimization for Chip Placement
[AUTHORS]Ke Xue, Ruo-Tong Chen, Rong-Xi Tan, Xi Lin, Yunqi Shi, Siyuan Xu, Mingxuan Yuan, Chao Qian
[ABSTRACT]Chip placement is a vital stage in modern chip design as it has a substantial
impact on the subsequent processes and the overall quality of the final chip.
The use of black-box optimization (BBO) for chip placement has a history of
several decades. However, early efforts were limited by immature problem
formulations and inefficient algorithm designs. Recent progress has shown the
effectiveness and efficiency of BBO for chip placement, proving its potential
to achieve state-of-the-art results. Despite these advancements, the field
lacks a unified, BBO-specific benchmark for thoroughly assessing various
problem formulations and BBO algorithms. To fill this gap, we propose
BBOPlace-Bench, the first benchmark designed specifically for evaluating and
developing BBO algorithms for chip placement tasks. It integrates three problem
formulations of BBO for chip placement, and offers a modular, decoupled, and
flexible framework that enables users to seamlessly implement, test, and
compare their own algorithms. BBOPlace-Bench integrates a wide variety of
existing BBO algorithms, including simulated annealing (SA), evolutionary
algorithms (EAs), and Bayesian optimization (BO). Experimental results show
that the problem formulations of mask-guided optimization and hyperparameter
optimization exhibit superior performance than the sequence pair problem
formulation, while EAs demonstrate better overall performance than SA and BO,
especially in high-dimensional search spaces, and also achieve state-of-the-art
performance compared to the mainstream chip placement methods. BBOPlace-Bench
not only facilitates the development of efficient BBO-driven solutions for chip
placement but also broadens the practical application scenarios (which are
urgently needed) for the BBO community. The code of BBOPlace-Bench is available
at https://github.com/lamda-bbo/BBOPlace-Bench.
[LINK]http://arxiv.org/abs/2510.23472v1
[DATE]2025-10-28 00:10:32+08:00
[CATEGORIES]cs.LG
Robust Decision Making with Partially Calibrated Forecasts
[AUTHORS]Shayan Kiyani, Hamed Hassani, George Pappas, Aaron Roth
[ABSTRACT]Calibration has emerged as a foundational goal in trustworthy machine
learning'', in part because of its strong decision theoretic semantics.
Independent of the underlying distribution, and independent of the decision
maker's utility function, calibration promises that amongst all policies
mapping predictions to actions, the uniformly best policy is the one that
“trusts the predictions” and acts as if they were correct. But this is true
only of \emph\{fully calibrated\} forecasts, which are tractable to guarantee
only for very low dimensional prediction problems. For higher dimensional
prediction problems (e.g. when outcomes are multiclass), weaker forms of
calibration have been studied that lack these decision theoretic properties. In
this paper we study how a conservative decision maker should map predictions
endowed with these weaker (“partial”) calibration guarantees to actions, in a
way that is robust in a minimax sense: i.e. to maximize their expected utility
in the worst case over distributions consistent with the calibration
guarantees. We characterize their minimax <span style="color:#e74d3c;">optimal</span> decision rule via a <span style="color:#e74d3c;">duality</span>
argument, and show that surprisingly,trusting the predictions and acting
accordingly’’ is recovered in this minimax sense by \emph{decision calibration}
(and any strictly stronger notion of calibration), a substantially weaker and
more tractable condition than full calibration. For calibration guarantees that
fall short of decision calibration, the minimax optimal decision rule is still
efficiently computable, and we provide an empirical evaluation of a natural one
that applies to any regression model solved to optimize squared error.
[LINK]http://arxiv.org/abs/2510.23471v1
[DATE]2025-10-28 00:09:07+08:00
[CATEGORIES]cs.LG
Adaptive Dual Prompting: Hierarchical Debiasing for Fairness-aware Graph Neural Networks
[AUTHORS]Yuhan Yang, Xingbo Fu, Jundong Li
[ABSTRACT]In recent years, pre-training Graph Neural Networks (GNNs) through
self-supervised learning on unlabeled graph data has emerged as a widely
adopted paradigm in graph learning. Although the paradigm is effective for
pre-training powerful GNN models, the objective gap often exists between
pre-training and downstream tasks. To bridge this gap, graph prompting adapts
pre-trained GNN models to specific downstream tasks with extra learnable
prompts while keeping the pre-trained GNN models frozen. As recent graph
prompting methods largely focus on enhancing model utility on downstream tasks,
they often overlook fairness concerns when designing prompts for adaptation. In
fact, pre-trained GNN models will produce discriminative node representations
across demographic subgroups, as downstream graph data inherently contains
biases in both node attributes and graph structures. To address this issue, we
propose an Adaptive Dual Prompting (ADPrompt) framework that enhances fairness
for adapting pre-trained GNN models to downstream tasks. To mitigate attribute
bias, we design an Adaptive Feature Rectification module that learns customized
attribute prompts to suppress sensitive information at the input layer,
reducing bias at the source. Afterward, we propose an Adaptive Message
Calibration module that generates structure prompts at each layer, which adjust
the message from neighboring nodes to enable dynamic and soft calibration of
the information flow. Finally, ADPrompt jointly optimizes the two prompting
modules to adapt the pre-trained GNN while enhancing fairness. We conduct
extensive experiments on four datasets with four pre-training strategies to
evaluate the performance of ADPrompt. The results demonstrate that our proposed
ADPrompt outperforms seven baseline methods on node classification tasks.
[LINK]http://arxiv.org/abs/2510.23469v1
[DATE]2025-10-28 00:07:36+08:00
[CATEGORIES]cs.LG
Boosting Revisited: Benchmarking and Advancing LP-Based Ensemble Methods
[AUTHORS]Fabian Akkerman, Julien Ferry, Christian Artigues, Emmanuel Hebrard, Thibaut Vidal
[ABSTRACT]Despite their theoretical appeal, totally corrective boosting methods based
on linear programming have received limited empirical attention. In this paper,
we conduct the first large-scale experimental study of six LP-based boosting
formulations, including two novel methods, NM-Boost and QRLP-Boost, across 20
diverse datasets. We evaluate the use of both heuristic and optimal base
learners within these formulations, and analyze not only accuracy, but also
ensemble sparsity, margin distribution, anytime performance, and hyperparameter
sensitivity. We show that totally corrective methods can outperform or match
state-of-the-art heuristics like XGBoost and LightGBM when using shallow trees,
while producing significantly sparser ensembles. We further show that these
methods can thin pre-trained ensembles without sacrificing performance, and we
highlight both the strengths and limitations of using optimal decision trees in
this context.
[COMMENTS]Published in Transactions on Machine Learning Research (2025), see:
https://openreview.net/forum?id=lscC4PZUE4
[LINK]http://arxiv.org/abs/2507.18242v2
[DATE]2025-10-28 00:02:45+08:00
[CATEGORIES]cs.LG
Differential Privacy as a Perk: Federated Learning over Multiple-Access Fading Channels with a Multi-Antenna Base Station
[AUTHORS]Hao Liang, Haifeng Wen, Kaishun Wu, Hong Xing
[ABSTRACT]Federated Learning (FL) is a distributed learning paradigm that preserves
privacy by eliminating the need to exchange raw data during training. In its
prototypical edge instantiation with underlying wireless transmissions enabled
by analog over-the-air computing (AirComp), referred to as \emph{over-the-air
FL (AirFL)}, the inherent channel noise plays a unique role of \emph{frenemy}
in the sense that it degrades training due to noisy global aggregation while
providing a natural source of randomness for privacy-preserving mechanisms,
formally quantified by \emph{differential privacy (DP)}. It remains,
nevertheless, challenging to effectively harness such channel impairments, as
prior arts, under assumptions of either simple channel models or restricted
types of loss functions, mostly considering (local) DP enhancement with a
single-round or non-convergent bound on privacy loss. In this paper, we study
AirFL over multiple-access fading channels with a multi-antenna base station
(BS) subject to user-level DP requirements. Despite a recent study, which
claimed in similar settings that artificial noise (AN) must be injected to
ensure DP in general, we demonstrate, on the contrary, that DP can be gained as
a \emph{perk} even \emph{without} employing any AN. Specifically, we derive a
novel bound on DP that converges under general bounded-domain assumptions on
model parameters, along with a convergence bound with general smooth and
non-convex loss functions. Next, we optimize over receive beamforming and power
allocations to characterize the optimal convergence-privacy trade-offs, which
also reveal explicit conditions in which DP is achievable without compromising
training. Finally, our theoretical findings are validated by extensive
numerical results.
[COMMENTS]15 pages, 5 figures, submitted for possible publication
[LINK]http://arxiv.org/abs/2510.23463v1
[DATE]2025-10-28 00:01:15+08:00
[CATEGORIES]cs.LG
AdaDetectGPT: Adaptive Detection of LLM-Generated Text with Statistical Guarantees
[AUTHORS]Hongyi Zhou, Jin Zhu, Pingfan Su, Kai Ye, Ying Yang, Shakeel A O B Gavioli-Akilagun, Chengchun Shi
[ABSTRACT]We study the problem of determining whether a piece of text has been authored
by a human or by a large language model (LLM). Existing state of the art
logits-based detectors make use of statistics derived from the log-probability
of the observed text evaluated using the distribution function of a given
source LLM. However, relying solely on log probabilities can be sub-optimal. In
response, we introduce AdaDetectGPT – a novel classifier that adaptively
learns a witness function from training data to enhance the performance of
logits-based detectors. We provide statistical guarantees on its true positive
rate, false positive rate, true negative rate and false negative rate.
Extensive numerical studies show AdaDetectGPT nearly uniformly improves the
state-of-the-art method in various combination of datasets and LLMs, and the
improvement can reach up to 37\%. A python implementation of our method is
available at https://github.com/Mamba413/AdaDetectGPT.
[COMMENTS]Accepted by NeurIPS2025
[LINK]http://arxiv.org/abs/2510.01268v3
[DATE]2025-10-27 23:06:24+08:00
[CATEGORIES]cs.CL cs.LG
A Data-driven ML Approach for Maximizing Performance in LLM-Adapter Serving
[AUTHORS]Ferran Agullo, Joan Oliveras, Chen Wang, Alberto Gutierrez-Torre, Olivier Tardieu, Alaa Youssef, Jordi Torres, Josep Ll. Berral
[ABSTRACT]With the rapid adoption of Large Language Models (LLMs), LLM-adapters have
become increasingly common, providing lightweight specialization of large-scale
models. Serving hundreds or thousands of these adapters on a single GPU allows
request aggregation, increasing throughput, but may also cause request
starvation if GPU memory limits are exceeded. To address this issue, this study
focuses on determining the joint configuration of concurrent and parallel
adapters that maximizes GPU throughput without inducing starvation, given
heterogeneous adapter and traffic properties. We propose a data-driven ML
approach leveraging interpretable models to tackle this caching problem and
introduce the first Digital Twin capable of reproducing an LLM-adapter serving
system, enabling efficient training data generation. Experiments with the vLLM
framework and LoRA adapters show that the Digital Twin reproduces throughput
within 5.1% of real results, while the ML approach predicts optimal numbers of
concurrent and parallel adapters with an error of at most 7.2% under
heterogeneous, real-world workloads.
[COMMENTS]Accepted in a computer science workshop
[LINK]http://arxiv.org/abs/2508.08343v2
[DATE]2025-10-27 22:59:46+08:00
[CATEGORIES]cs.CL
EMTSF:Extraordinary Mixture of SOTA Models for Time Series Forecasting
[AUTHORS]Musleh Alharthi, Kaleel Mahmood, Sarosh Patel, Ausif Mahmood
[ABSTRACT]The immense success of the Transformer architecture
in Natural Language Processing has led to its adoption in Time Se ries
Forecasting (TSF), where superior performance has been shown.
However, a recent important paper questioned their effectiveness by
demonstrating that a simple single layer linear model outperforms
Transformer-based models. This was soon shown to be not as valid,
by a better transformer-based model termed PatchTST. More re cently, TimeLLM
demonstrated even better results by repurposing a
Large Language Model (LLM) for the TSF domain. Again, a follow
up paper challenged this by demonstrating that removing the LLM
component or replacing it with a basic attention layer in fact yields
better performance. One of the challenges in forecasting is the fact
that TSF data favors the more recent past, and is sometimes subject
to unpredictable events. Based upon these recent insights in TSF, we
propose a strong Mixture of Experts (MoE) framework. Our method
combines the state-of-the-art (SOTA) models including xLSTM, en hanced
Linear, PatchTST, and minGRU, among others. This set of
complimentary and diverse models for TSF are integrated in a Trans former
based MoE gating network. Our proposed model outperforms
all existing TSF models on standard benchmarks, surpassing even the
latest approaches based on MoE frameworks.
[LINK]http://arxiv.org/abs/2510.23396v1
[DATE]2025-10-27 22:55:30+08:00
[CATEGORIES]cs.CL
Detecting Religious Language in Climate Discourse
[AUTHORS]Evy Beijen, Pien Pieterse, Yusuf Çelik, Willem Th. van Peursen, Sandjai Bhulai, Meike Morren
[ABSTRACT]Religious language continues to permeate contemporary discourse, even in
ostensibly secular domains such as environmental activism and climate change
debates. This paper investigates how explicit and implicit forms of religious
language appear in climate-related texts produced by secular and religious
nongovernmental organizations (NGOs). We introduce a dual methodological
approach: a rule-based model using a hierarchical tree of religious terms
derived from ecotheology literature, and large language models (LLMs) operating
in a zero-shot setting. Using a dataset of more than 880,000 sentences, we
compare how these methods detect religious language and analyze points of
agreement and divergence. The results show that the rule-based method
consistently labels more sentences as religious than LLMs. These findings
highlight not only the methodological challenges of computationally detecting
religious language but also the broader tension over whether religious language
should be defined by vocabulary alone or by contextual meaning. This study
contributes to digital methods in religious studies by demonstrating both the
potential and the limitations of approaches for analyzing how the sacred
persists in climate discourse.
[LINK]http://arxiv.org/abs/2510.23395v1
[DATE]2025-10-27 22:54:51+08:00
[CATEGORIES]cs.CL
Estimating LLM Consistency: A User Baseline vs Surrogate Metrics
[AUTHORS]Xiaoyuan Wu, Weiran Lin, Omer Akgul, Lujo Bauer
[COMMENTS]Published as a main conference paper at EMNLP 2025
[LINK]http://arxiv.org/abs/2505.23799v3
[DATE]2025-10-27 22:42:01+08:00
[CATEGORIES]cs.CL cs.LG
Can Large Language Models Unlock Novel Scientific Research Ideas?
[AUTHORS]Sandeep Kumar, Tirthankar Ghosal, Vinayak Goyal, Asif Ekbal
[ABSTRACT]The widespread adoption of Large Language Models (LLMs) and publicly
available ChatGPT have marked a significant turning point in the integration of
Artificial Intelligence (AI) into people’s everyday lives. This study examines
the ability of Large Language Models (LLMs) to generate future research ideas
from scientific papers. Unlike tasks such as summarization or translation, idea
generation lacks a clearly defined reference set or structure, making manual
evaluation the default standard. However, human evaluation in this setting is
extremely challenging ie: it requires substantial domain expertise, contextual
understanding of the paper, and awareness of the current research landscape.
This makes it time-consuming, costly, and fundamentally non-scalable,
particularly as new LLMs are being released at a rapid pace. Currently, there
is no automated evaluation metric specifically designed for this task. To
address this gap, we propose two automated evaluation metrics: Idea Alignment
Score (IAScore) and Idea Distinctness Index. We further conducted human
evaluation to assess the novelty, relevance, and feasibility of the generated
future research ideas. This investigation offers insights into the evolving
role of LLMs in idea generation, highlighting both its capability and
limitations. Our work contributes to the ongoing efforts in evaluating and
utilizing language models for generating future research ideas. We make our
datasets and codes publicly available
[COMMENTS]EMNLP 2025 (Main)
[LINK]http://arxiv.org/abs/2409.06185v2
[DATE]2025-10-27 22:39:52+08:00
[CATEGORIES]cs.CL cs.LG
ClaimGen-CN: A Large-scale Chinese Dataset for Legal Claim Generation
[AUTHORS]Siying Zhou, Yiquan Wu, Hui Chen, Xavier Hu, Kun Kuang, Adam Jatowt, Ming Hu, Chunyan Zheng, Fei Wu
[ABSTRACT]Legal claims refer to the plaintiff’s demands in a case and are essential to
guiding judicial reasoning and case resolution. While many works have focused
on improving the efficiency of legal professionals, the research on helping
non-professionals (e.g., plaintiffs) remains unexplored. This paper explores
the problem of legal claim generation based on the given case’s facts. First,
we construct ClaimGen-CN, the first dataset for Chinese legal claim generation
task, from various real-world legal disputes. Additionally, we design an
evaluation metric tailored for assessing the generated claims, which
encompasses two essential dimensions: factuality and clarity. Building on this,
we conduct a comprehensive zero-shot evaluation of state-of-the-art general and
legal-domain large language models. Our findings highlight the limitations of
the current models in factual precision and expressive clarity, pointing to the
need for more targeted development in this domain. To encourage further
exploration of this important task, we will make the dataset publicly
available.
[LINK]http://arxiv.org/abs/2508.17234v2
[DATE]2025-10-27 22:25:55+08:00
[CATEGORIES]cs.CL
Are LLMs Empathetic to All? Investigating the Influence of Multi-Demographic Personas on a Model’s Empathy
[AUTHORS]Ananya Malik, Nazanin Sabri, Melissa Karnaze, Mai Elsherief
[COMMENTS]9 pages, 4 figures, 4 tables, EMNLP 2025 Findings
[LINK]http://arxiv.org/abs/2510.10328v2
[DATE]2025-10-27 22:25:32+08:00
[CATEGORIES]cs.CL
Bootstrapping Referring Multi-Object Tracking
[AUTHORS]Yani Zhang, Dongming Wu, Wencheng Han, Xingping Dong
[ABSTRACT]Referring understanding is a fundamental task that bridges natural language
and visual content by localizing objects described in free-form expressions.
However, existing works are constrained by limited language expressiveness,
lacking the capacity to model object dynamics in spatial numbers and temporal
states. To address these limitations, we introduce a new and general referring
understanding task, termed referring multi-object tracking (RMOT). Its core
idea is to employ a language expression as a semantic cue to guide the
prediction of multi-object tracking, comprehensively accounting for variations
in object quantity and temporal semantics. Along with RMOT, we introduce a RMOT
benchmark named Refer-KITTI-V2, featuring scalable and diverse language
expressions. To efficiently generate high-quality annotations covering object
dynamics with minimal manual effort, we propose a semi-automatic labeling
pipeline that formulates a total of 9,758 language prompts. In addition, we
propose TempRMOT, an elegant end-to-end Transformer-based framework for RMOT.
At its core is a query-driven Temporal Enhancement Module that represents each
object as a Transformer query, enabling long-term spatial-temporal interactions
with other objects and past frames to efficiently refine these queries.
TempRMOT achieves state-of-the-art performance on both Refer-KITTI and
Refer-KITTI-V2, demonstrating the effectiveness of our approach. The source
code and dataset is available at https://github.com/zyn213/TempRMOT.
[LINK]http://arxiv.org/abs/2406.05039v2
[DATE]2025-10-27 22:22:30+08:00
[CATEGORIES]cs.CL
Tiny but Mighty: A Software-Hardware Co-Design Approach for Efficient Multimodal Inference on Battery-Powered Small Devices
[AUTHORS]Yilong Li, Shuai Zhang, Yijing Zeng, Hao Zhang, Xinmiao Xiong, Jingyu Liu, Pan Hu, Suman Banerjee
[ABSTRACT]Large Multimodal Models (LMMs) are inherently modular, consisting of vision
and audio encoders, projectors, and large language models. Yet, they are almost
always executed monolithically, which underutilizes the heterogeneous
accelerators (NPUs, GPUs, DSPs) in modern SoCs and leads to high end-to-end
latency. In this paper, we present NANOMIND, a hardware–software co-design
inference framework for Large Multimodal Models (LMMs) that breaks large models
into modular “bricks” (vision, language, audio, etc.) and maps each to its
ideal accelerator. The key insight is that large models can be broken into
modular components and scheduled to run on the most appropriate compute units.
It performs module-level dynamic offloading across accelerators on
unified-memory SoCs. By combining customized hardware design, system-level
scheduling, and optimized low-bit computation kernels, we demonstrate our
framework with a compact, battery-powered device capable of running LMMs
entirely on device. This prototype functions as a self-contained intelligent
assistant that requires no network connectivity, while achieving higher
throughput and superior power efficiency under strict resource constraints. The
design further bypasses CPU bottlenecks and reduces redundant memory usage
through token-aware buffer management and module-level coordination. Our system
outperforms existing implementations in resource efficiency, cutting energy
consumption by 42.3\% and GPU memory usage by 11.2\%. This enables a
battery-powered device to run LLaVA-OneVision with a camera for nearly half a
day and LLaMA-3-8B for voice interactions up to almost 20.8 hours.
[LINK]http://arxiv.org/abs/2510.05109v2
[DATE]2025-10-27 22:17:43+08:00
[CATEGORIES]cs.CL
MOOSE-Chem: Large Language Models for Rediscovering Unseen Chemistry Scientific Hypotheses
[AUTHORS]Zonglin Yang, Wanhao Liu, Ben Gao, Tong Xie, Yuqiang Li, Wanli Ouyang, Soujanya Poria, Erik Cambria, Dongzhan Zhou
[ABSTRACT]Scientific discovery plays a pivotal role in advancing human society, and
recent progress in large language models (LLMs) suggests their potential to
accelerate this process. However, it remains unclear whether LLMs can
autonomously generate novel and valid hypotheses in chemistry. In this work, we
investigate whether LLMs can discover high-quality chemistry hypotheses given
only a research background-comprising a question and/or a survey-without
restriction on the domain of the question. We begin with the observation that
hypothesis discovery is a seemingly intractable task. To address this, we
propose a formal mathematical decomposition grounded in a fundamental
assumption: that most chemistry hypotheses can be composed from a research
background and a set of inspirations. This decomposition leads to three
practical subtasks-retrieving inspirations, composing hypotheses with
inspirations, and ranking hypotheses - which together constitute a sufficient
set of subtasks for the overall scientific discovery task. We further develop
an agentic LLM framework, MOOSE-Chem, that is a direct implementation of this
mathematical decomposition. To evaluate this framework, we construct a
benchmark of 51 high-impact chemistry papers published and online after January
2024, each manually annotated by PhD chemists with background, inspirations,
and hypothesis. The framework is able to rediscover many hypotheses with high
similarity to the groundtruth, successfully capturing the core
innovations-while ensuring no data contamination since it uses an LLM with
knowledge cutoff date prior to 2024. Finally, based on LLM’s surprisingly high
accuracy on inspiration retrieval, a task with inherently out-of-distribution
nature, we propose a bold assumption: that LLMs may already encode latent
scientific knowledge associations not yet recognized by humans.
[COMMENTS]Accepted by ICLR 2025
[LINK]http://arxiv.org/abs/2410.07076v6
[DATE]2025-10-27 22:10:54+08:00
[CATEGORIES]cs.CL cs.LG
How AI Forecasts AI Jobs: Benchmarking LLM Predictions of Labor Market Changes
[AUTHORS]Sheri Osborn, Rohit Valecha, H. Raghav Rao, Dan Sass, Anthony Rios
[ABSTRACT]Artificial intelligence is reshaping labor markets, yet we lack tools to
systematically forecast its effects on employment. This paper introduces a
benchmark for evaluating how well large language models (LLMs) can anticipate
changes in job demand, especially in occupations affected by AI. Existing
research has shown that LLMs can extract sentiment, summarize economic reports,
and emulate forecaster behavior, but little work has assessed their use for
forward-looking labor prediction. Our benchmark combines two complementary
datasets: a high-frequency index of sector-level job postings in the United
States, and a global dataset of projected occupational changes due to AI
adoption. We format these data into forecasting tasks with clear temporal
splits, minimizing the risk of information leakage. We then evaluate LLMs using
multiple prompting strategies, comparing task-scaffolded, persona-driven, and
hybrid approaches across model families. We assess both quantitative accuracy
and qualitative consistency over time. Results show that structured task
prompts consistently improve forecast stability, while persona prompts offer
advantages on short-term trends. However, performance varies significantly
across sectors and horizons, highlighting the need for domain-aware prompting
and rigorous evaluation protocols. By releasing our benchmark, we aim to
support future research on labor forecasting, prompt design, and LLM-based
economic reasoning. This work contributes to a growing body of research on how
LLMs interact with real-world economic data, and provides a reproducible
testbed for studying the limits and opportunities of AI as a forecasting tool
in the context of labor markets.
[COMMENTS]8 pages + Limitations + References
[LINK]http://arxiv.org/abs/2510.23358v1
[DATE]2025-10-27 22:08:27+08:00
[CATEGORIES]cs.CL
Prompting is not Enough: Exploring Knowledge Integration and Controllable Generation
[AUTHORS]Tingjia Shen, Hao Wang, Chuan Qin, Ruijun Sun, Yang Song, Defu Lian, Hengshu Zhu, Enhong Chen
[ABSTRACT]Open-domain question answering (OpenQA) represents a cornerstone in natural
language processing (NLP), primarily focused on extracting answers from
unstructured textual data. With the rapid advancements in Large Language Models
(LLMs), LLM-based OpenQA methods have reaped the benefits of emergent
understanding and answering capabilities enabled by massive parameters compared
to traditional methods. However, most of these methods encounter two critical
challenges: how to integrate knowledge into LLMs effectively and how to
adaptively generate results with specific answer formats for various task
situations. To address these challenges, we propose a novel framework named
GenKI, which aims to improve the OpenQA performance by exploring Knowledge
Integration and controllable Generation on LLMs simultaneously. Specifically,
we first train a dense passage retrieval model to retrieve associated knowledge
from a given knowledge base. Subsequently, we introduce a novel knowledge
integration model that incorporates the retrieval knowledge into instructions
during fine-tuning to intensify the model. Furthermore, to enable controllable
generation in LLMs, we leverage a certain fine-tuned LLM and an ensemble based
on text consistency incorporating all coherence, fluency, and answer format
assurance. Finally, extensive experiments conducted on the TriviaQA, MSMARCO,
and CMRC2018 datasets, featuring diverse answer formats, have demonstrated the
effectiveness of GenKI with comparison of state-of-the-art baselines. Moreover,
ablation studies have disclosed a linear relationship between the frequency of
retrieved knowledge and the model’s ability to recall knowledge accurately
against the ground truth. Our code of GenKI is available at
https://github.com/USTC-StarTeam/GenKI
[COMMENTS]13 pages, 5 figures
[LINK]http://arxiv.org/abs/2505.19660v3
[DATE]2025-10-27 22:08:24+08:00
[CATEGORIES]cs.CL
LightKGG: Simple and Efficient Knowledge Graph Generation from Textual Data
[AUTHORS]Teng Lin
[ABSTRACT]The scarcity of high-quality knowledge graphs (KGs) remains a critical
bottleneck for downstream AI applications, as existing extraction methods rely
heavily on error-prone pattern-matching techniques or resource-intensive large
language models (LLMs). While recent tools leverage LLMs to generate KGs, their
computational demands limit accessibility for low-resource environments. Our
paper introduces LightKGG, a novel framework that enables efficient KG
extraction from textual data using small-scale language models (SLMs) through
two key technical innovations: (1) Context-integrated Graph extraction
integrates contextual information with nodes and edges into a unified graph
structure, reducing the reliance on complex semantic processing while
maintaining more key information; (2) Topology-enhanced relationship inference
leverages the inherent topology of the extracted graph to efficiently infer
relationships, enabling relationship discovery without relying on complex
language understanding capabilities of LLMs. By enabling accurate KG
construction with minimal hardware requirements, this work bridges the gap
between automated knowledge extraction and practical deployment scenarios while
introducing scientifically rigorous methods for optimizing SLM efficiency in
structured NLP tasks.
[LINK]http://arxiv.org/abs/2510.23341v1
[DATE]2025-10-27 21:55:13+08:00
[CATEGORIES]cs.CL
Planning Ahead with RSA: Efficient Signalling in Dynamic Environments by Projecting User Awareness across Future Timesteps
[AUTHORS]Anwesha Das, John Duff, Jörg Hoffmann, Vera Demberg
[ABSTRACT]Adaptive agent design offers a way to improve human-AI collaboration on
time-sensitive tasks in rapidly changing environments. In such cases, to ensure
the human maintains an accurate understanding of critical task elements, an
assistive agent must not only identify the highest priority information but
also estimate how and when this information can be communicated most
effectively, given that human attention represents a zero-sum cognitive
resource where focus on one message diminishes awareness of other or upcoming
information. We introduce a theoretical framework for adaptive signalling which
meets these challenges by using principles of rational communication,
formalised as Bayesian reference resolution using the Rational Speech Act (RSA)
modelling framework, to plan a sequence of messages which optimise timely
alignment between user belief and a dynamic environment. The agent adapts
message specificity and timing to the particulars of a user and scenario based
on projections of how prior-guided interpretation of messages will influence
attention to the interface and subsequent belief update, across several
timesteps out to a fixed horizon. In a comparison to baseline methods, we show
that this effectiveness depends crucially on combining multi-step planning with
a realistic model of user awareness. As the first application of RSA for
communication in a dynamic environment, and for human-AI interaction in
general, we establish theoretical foundations for pragmatic communication in
human-agent teams, highlighting how insights from cognitive science can be
capitalised to inform the design of assistive agents.
[COMMENTS]11 pages, 3 figures
[LINK]http://arxiv.org/abs/2510.23340v1
[DATE]2025-10-27 21:54:54+08:00
[CATEGORIES]cs.CL
LLMs can hide text in other text of the same length
[AUTHORS]Antonio Norelli, Michael Bronstein
[ABSTRACT]A meaningful text can be hidden inside another, completely different yet
still coherent and plausible, text of the same length. For example, a tweet
containing a harsh political critique could be embedded in a tweet that
celebrates the same political leader, or an ordinary product review could
conceal a secret manuscript. This uncanny state of affairs is now possible
thanks to Large Language Models, and in this paper we present a simple and
efficient protocol to achieve it. We show that even modest 8-billion-parameter
open-source LLMs are sufficient to obtain high-quality results, and a message
as long as this abstract can be encoded and decoded locally on a laptop in
seconds. The existence of such a protocol demonstrates a radical decoupling of
text from authorial intent, further eroding trust in written communication,
already shaken by the rise of LLM chatbots. We illustrate this with a concrete
scenario: a company could covertly deploy an unfiltered LLM by encoding its
answers within the compliant responses of a safe model. This possibility raises
urgent questions for AI safety and challenges our understanding of what it
means for a Large Language Model to know something.
[COMMENTS]21 pages, main paper 9 pages
[LINK]http://arxiv.org/abs/2510.20075v3
[DATE]2025-10-27 21:54:40+08:00
[CATEGORIES]cs.CL cs.LG
BaZi-Based Character Simulation Benchmark: Evaluating AI on Temporal and Persona Reasoning
[AUTHORS]Siyuan Zheng, Pai Liu, Xi Chen, Jizheng Dong, Sihan Jia
[ABSTRACT]Human-like virtual characters are crucial for games, storytelling, and
virtual reality, yet current methods rely heavily on annotated data or
handcrafted persona prompts, making it difficult to scale up and generate
realistic, contextually coherent personas. We create the first QA dataset for
BaZi-based persona reasoning, where real human experiences categorized into
wealth, health, kinship, career, and relationships are represented as
life-event questions and answers. Furthermore, we propose the first BaZi-LLM
system that integrates symbolic reasoning with large language models to
generate temporally dynamic and fine-grained virtual personas. Compared with
mainstream LLMs such as DeepSeek-v3 and GPT-5-mini, our method achieves a
30.3%-62.6% accuracy improvement. In addition, when incorrect BaZi information
is used, our model’s accuracy drops by 20%-45%, showing the potential of
culturally grounded symbolic-LLM integration for realistic character
simulation.
[LINK]http://arxiv.org/abs/2510.23337v1
[DATE]2025-10-27 21:51:13+08:00
[CATEGORIES]cs.CL
Adaptive Blockwise Search: Inference-Time Alignment for Large Language Models
[AUTHORS]Mohammad Atif Quamar, Mohammad Areeb, Nishant Sharma, Ananth Shreekumar, Jonathan Rosenthal, Muslum Ozgur Ozmen, Mikhail Kuznetsov, Z. Berkay Celik
[ABSTRACT]LLM alignment remains a critical challenge. Inference-time methods provide a
flexible alternative to fine-tuning, but their uniform computational effort
often yields suboptimal alignment. We hypothesize that for many alignment
tasks, the initial tokens of a response are disproportionately more critical.
To leverage this principle, we introduce AdaSearch, a novel blockwise search
strategy. It adaptively allocates a fixed computational budget using a sampling
schedule, focusing search effort on these critical tokens. We apply AdaSearch
to sequential decoding and introduce its tree-search counterpart, AdaBeam. Our
comprehensive evaluation across eight LLMs demonstrates that AdaSearch
outperforms strong Best-of-N and fine-tuning baselines. Specifically, win-rates
improve by over 10% for harmlessness generation, controlled sentiment
generation, and for mathematical reasoning tasks relative to Best-of-N.
[LINK]http://arxiv.org/abs/2510.23334v1
[DATE]2025-10-27 21:48:59+08:00
[CATEGORIES]cs.CL
LibriConvo: Simulating Conversations from Read Literature for ASR and Diarization
[AUTHORS]Máté Gedeon, Péter Mihajlik
[ABSTRACT]We introduce LibriConvo, a simulated multi-speaker conversational dataset
based on speaker-aware conversation simulation (SASC), designed to support
training and evaluation of speaker diarization and automatic speech recognition
(ASR) systems. Unlike prior resources that mostly rely on semantically
disconnected utterances and implausible temporal gaps, LibriConvo ensures
semantic coherence and realistic conversational timing. Our pipeline leverages
CallHome with external VAD for reliable boundaries, applies compression to
reduce unnaturally long silences, and organizes LibriTTS utterances by book to
maintain contextual consistency. Acoustic realism is enhanced via a novel room
impulse response selection procedure that ranks speaker-microphone
configurations by spatial plausibility, balancing realism and diversity. The
dataset comprises 240.1 hours across 1,496 dialogues with 830 unique speakers,
split in a speaker-disjoint manner for robust evaluation. Baselines show that
the sortformer model outperforms the pyannote pipeline in diarization, while a
fine-tuned Fast Conformer-CTC XLarge with Serialized Output Training achieves
7.29\% WER for ASR, surpassing zero-shot Whisper-large-v3. LibriConvo provides
a valuable resource for advancing multi-speaker speech processing research with
realistic conversational dynamics and controlled experimental conditions.
[COMMENTS]Submitted to LREC 2026
[LINK]http://arxiv.org/abs/2510.23320v1
[DATE]2025-10-27 21:35:22+08:00
[CATEGORIES]cs.CL
MOOSE-Chem2: Exploring LLM Limits in Fine-Grained Scientific Hypothesis Discovery via Hierarchical Search
[AUTHORS]Zonglin Yang, Wanhao Liu, Ben Gao, Yujie Liu, Wei Li, Tong Xie, Lidong Bing, Wanli Ouyang, Erik Cambria, Dongzhan Zhou
[ABSTRACT]Large language models (LLMs) have shown promise in automating scientific
hypothesis generation, yet existing approaches primarily yield coarse-grained
hypotheses lacking critical methodological and experimental details. We
introduce and formally define the new task of fine-grained scientific
hypothesis discovery, which entails generating detailed, experimentally
actionable hypotheses from coarse initial research directions. We frame this as
a combinatorial optimization problem and investigate the upper limits of LLMs’
capacity to solve it when maximally leveraged. Specifically, we explore four
foundational questions: (1) how to best harness an LLM’s internal heuristics to
formulate the fine-grained hypothesis it itself would judge as the most
promising among all the possible hypotheses it might generate, based on its own
internal scoring-thus defining a latent reward landscape over the hypothesis
space; (2) whether such LLM-judged better hypotheses exhibit stronger alignment
with ground-truth hypotheses; (3) whether shaping the reward landscape using an
ensemble of diverse LLMs of similar capacity yields better outcomes than
defining it with repeated instances of the strongest LLM among them; and (4)
whether an ensemble of identical LLMs provides a more reliable reward landscape
than a single LLM. To address these questions, we propose a hierarchical search
method that incrementally proposes and integrates details into the hypothesis,
progressing from general concepts to specific experimental configurations. We
show that this hierarchical process smooths the reward landscape and enables
more effective optimization. Empirical evaluations on a new benchmark of
expert-annotated fine-grained hypotheses from recent literature show that our
method consistently outperforms strong baselines.
[COMMENTS]Accepted by NeurIPS 2025
[LINK]http://arxiv.org/abs/2505.19209v2
[DATE]2025-10-27 21:16:36+08:00
[CATEGORIES]cs.CL
The Atlas of In-Context Learning: How Attention Heads Shape In-Context Retrieval Augmentation
[AUTHORS]Patrick Kahardipraja, Reduan Achtibat, Thomas Wiegand, Wojciech Samek, Sebastian Lapuschkin
[ABSTRACT]Large language models are able to exploit in-context learning to access
external knowledge beyond their training data through retrieval-augmentation.
While promising, its inner workings remain unclear. In this work, we shed light
on the mechanism of in-context retrieval augmentation for question answering by
viewing a prompt as a composition of informational components. We propose an
attribution-based method to identify specialized attention heads, revealing
in-context heads that comprehend instructions and retrieve relevant contextual
information, and parametric heads that store entities’ relational knowledge. To
better understand their roles, we extract function vectors and modify their
attention weights to show how they can influence the answer generation process.
Finally, we leverage the gained insights to trace the sources of knowledge used
during inference, paving the way towards more safe and transparent language
models.
[COMMENTS]Accepted at NeurIPS 2025
[LINK]http://arxiv.org/abs/2505.15807v2
[DATE]2025-10-27 21:12:32+08:00
[CATEGORIES]cs.CL cs.LG
TaoSR1: The Thinking Model for E-commerce Relevance Search
[AUTHORS]Chenhe Dong, Shaowei Yao, Pengkun Jiao, Jianhui Yang, Yiming Jin, Zerui Huang, Xiaojiang Zhou, Dan Ou, Haihong Tang, Bo Zheng
[ABSTRACT]Query-product relevance prediction is a core task in e-commerce search.
BERT-based models excel at semantic matching but lack complex reasoning
capabilities. While Large Language Models (LLMs) are explored, most still use
discriminative fine-tuning or distill to smaller models for deployment. We
propose a framework to directly deploy LLMs for this task, addressing key
challenges: Chain-of-Thought (CoT) error accumulation, discriminative
hallucination, and deployment feasibility. Our framework, TaoSR1, involves
three stages: (1) Supervised Fine-Tuning (SFT) with CoT to instill reasoning;
(2) Offline sampling with a pass@N strategy and Direct Preference Optimization
(DPO) to improve generation quality; and (3) Difficulty-based dynamic sampling
with Group Relative Policy Optimization (GRPO) to mitigate discriminative
hallucination. Additionally, post-CoT processing and a cumulative
probability-based partitioning method enable efficient online deployment.
TaoSR1 significantly outperforms baselines on offline datasets and achieves
substantial gains in online side-by-side human evaluations, introducing a novel
paradigm for applying CoT reasoning to relevance classification.
[LINK]http://arxiv.org/abs/2508.12365v2
[DATE]2025-10-27 21:03:18+08:00
[CATEGORIES]cs.CL
Thought Anchors: Which LLM Reasoning Steps Matter?
[AUTHORS]Paul C. Bogdan, Uzay Macar, Neel Nanda, Arthur Conmy
[ABSTRACT]Current frontier large-language models rely on reasoning to achieve
state-of-the-art performance. Many existing interpretability are limited in
this area, as standard methods have been designed to study single forward
passes of a model rather than the multi-token computational steps that unfold
during reasoning. We argue that analyzing reasoning traces at the sentence
level is a promising approach to understanding reasoning processes. We
introduce a black-box method that measures each sentence’s counterfactual
importance by repeatedly sampling replacement sentences from the model,
filtering for semantically different ones, and continuing the chain of thought
from that point onwards to quantify the sentence’s impact on the distribution
of final answers. We discover that certain sentences can have an outsized
impact on the trajectory of the reasoning trace and final answer. We term these
sentences \textit{thought anchors}. These are generally planning or uncertainty
management sentences, and specialized attention heads consistently attend from
subsequent sentences to thought anchors. We further show that examining
sentence-sentence causal links within a reasoning trace gives insight into a
model’s behavior. Such information can be used to predict a problem’s
difficulty and the extent different question domains involve sequential or
diffuse reasoning. As a proof-of-concept, we demonstrate that our techniques
together provide a practical toolkit for analyzing reasoning models by
conducting a detailed case study of how the model solves a difficult math
problem, finding that our techniques yield a consistent picture of the
reasoning trace’s structure. We provide an open-source tool
(thought-anchors.com) for visualizing the outputs of our methods on further
problems. The convergence across our methods shows the potential of
sentence-level analysis for a deeper understanding of reasoning models.
[COMMENTS]Paul C. Bogdan and Uzay Macar contributed equally to this work, and
their listed order was determined by coinflip. Neel Nanda and Arthur Conmy
contributed equally to this work as senior authors, and their listed order
was determined by coinflip
[LINK]http://arxiv.org/abs/2506.19143v4
[DATE]2025-10-27 20:36:23+08:00
[CATEGORIES]cs.LG cs.CL
Code Aesthetics with Agentic Reward Feedback
[AUTHORS]Bang Xiao, Lingjie Jiang, Shaohan Huang, Tengchao Lv, Yupan Huang, Xun Wu, Lei Cui, Furu Wei
[ABSTRACT]Large Language Models (LLMs) have become valuable assistants for developers
in code-related tasks. While LLMs excel at traditional programming tasks such
as code generation and bug fixing, they struggle with visually-oriented coding
tasks, often producing suboptimal aesthetics. In this paper, we introduce a new
pipeline to enhance the aesthetic quality of LLM-generated code. We first
construct AesCode-358K, a large-scale instruction-tuning dataset focused on
code aesthetics. Next, we propose agentic reward feedback, a multi-agent system
that evaluates executability, static aesthetics, and interactive aesthetics.
Building on this, we develop GRPO-AR, which integrates these signals into the
GRPO algorithm for joint optimization of functionality and code aesthetics.
Finally, we develop OpenDesign, a benchmark for assessing code aesthetics.
Experimental results show that combining supervised fine-tuning on AesCode-358K
with reinforcement learning using agentic reward feedback significantly
improves performance on OpenDesign and also enhances results on existing
benchmarks such as PandasPlotBench. Notably, our AesCoder-4B surpasses GPT-4o
and GPT-4.1, and achieves performance comparable to large open-source models
with 480B-685B parameters, underscoring the effectiveness of our approach.
[COMMENTS]30 pages, 7 figures
[LINK]http://arxiv.org/abs/2510.23272v1
[DATE]2025-10-27 20:32:33+08:00
[CATEGORIES]cs.CL
Mubeen AI: A Specialized Arabic Language Model for Heritage Preservation and User Intent Understanding
[AUTHORS]Mohammed Aljafari, Ismail Alturki, Ahmed Mori, Yehya Kadumi
[ABSTRACT]Mubeen is a proprietary Arabic language model developed by MASARAT SA,
optimized for deep understanding of Arabic linguistics, Islamic studies, and
cultural heritage. Trained on an extensive collection of authentic Arabic
sources significantly expanded by digitizing historical manuscripts via a
proprietary Arabic OCR engine, the model incorporates seminal scholarly works
in linguistics, jurisprudence, hadith, and Quranic exegesis, alongside
thousands of academic theses and peer-reviewed research papers. Conditioned
through a deep linguistic engineering framework, Mubeen masters not just the
meaning but the eloquence of Arabic, enabling precise understanding across
classical texts, contemporary writing, and regional dialects with focus on
comprehending user intent and delivering accurate, contextually relevant
responses. Unlike other Arabic models relying on translated English data that
often fail in intent detection or retrieval-augmented generation (RAG), Mubeen
uses native Arabic sources to ensure cultural authenticity and accuracy. Its
core innovation is the Practical Closure Architecture, designed to solve the
“Utility Gap Crisis” where factually correct answers fail to resolve users’
core needs, forcing them into frustrating cycles of re-prompting. By
prioritizing clarity and decisive guidance, Mubeen transforms from an
information repository into a decisive guide, aligning with Saudi Vision 2030.
The model’s architecture combines deep heritage specialization with
multi-disciplinary expert modules, enabling robust performance across both
cultural preservation and general knowledge domains.
[COMMENTS]21 pages, 2 figures, 3 tables. Includes appendices on ethical
guidelines and training framework. Submitted September 04, 2025
[LINK]http://arxiv.org/abs/2510.23271v1
[DATE]2025-10-27 20:29:27+08:00
[CATEGORIES]cs.CL
OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model
[AUTHORS]Chen Wang, Tianyu Peng, Wen Yang, Yinan Bai, Guangfu Wang, Jun Lin, Lanpeng Jia, Lingxiang Wu, Jinqiao Wang, Chengqing Zong, Jiajun Zhang
[ABSTRACT]Empathetic interaction is a cornerstone of human-machine communication, due
to the need for understanding speech enriched with paralinguistic cues and
generating emotional and expressive responses. However, the most powerful
empathetic LSLMs are increasingly closed off, leaving the crucial details about
the architecture, data and development opaque to researchers. Given the
critical need for transparent research into the LSLMs and empathetic behavior,
we present OpenS2S, a fully open-source, transparent and end-to-end LSLM
designed to enable empathetic speech interactions. Based on our empathetic
speech-to-text model BLSP-Emo, OpenS2S further employs a streaming interleaved
decoding architecture to achieve low-latency speech generation. To facilitate
end-to-end training, OpenS2S incorporates an automated data construction
pipeline that synthesizes diverse, high-quality empathetic speech dialogues at
low cost. By leveraging large language models to generate empathetic content
and controllable text-to-speech systems to introduce speaker and emotional
variation, we construct a scalable training corpus with rich paralinguistic
diversity and minimal human supervision. We release the fully open-source
OpenS2S model, including the dataset, model weights, pre-training and
fine-tuning codes, to empower the broader research community and accelerate
innovation in empathetic speech systems. The project webpage can be accessed at
https://casia-lm.github.io/OpenS2S
[COMMENTS]Technical Report, Update on OpenS2S_v1.5
[LINK]http://arxiv.org/abs/2507.05177v3
[DATE]2025-10-27 19:59:16+08:00
[CATEGORIES]cs.CL
Input Matters: Evaluating Input Structure’s Impact on LLM Summaries of Sports Play-by-Play
[AUTHORS]Barkavi Sundararajan, Somayajulu Sripada, Ehud Reiter
[ABSTRACT]A major concern when deploying LLMs in accuracy-critical domains such as
sports reporting is that the generated text may not faithfully reflect the
input data. We quantify how input structure affects hallucinations and other
factual errors in LLM-generated summaries of NBA play-by-play data, across
three formats: row-structured, JSON and unstructured. We manually annotated
3,312 factual errors across 180 game summaries produced by two models,
Llama-3.1-70B and Qwen2.5-72B. Input structure has a strong effect: JSON input
reduces error rates by 69% for Llama and 65% for Qwen compared to unstructured
input, while row-structured input reduces errors by 54% for Llama and 51% for
Qwen. A two-way repeated measures ANOVA shows that input structure accounts for
over 80% of the variance in error rates, with Tukey HSD post hoc tests
confirming statistically significant differences between all input formats.
[COMMENTS]Accepted at INLG 2025
[LINK]http://arxiv.org/abs/2510.21034v2
[DATE]2025-10-27 19:49:08+08:00
[CATEGORIES]cs.CL
Process Reward Models for Sentence-Level Verification of LVLM Radiology Reports
[AUTHORS]Alois Thomas, Maya Varma, Jean-Benoit Delbrouck, Curtis P. Langlotz
[ABSTRACT]Automating radiology report generation with Large Vision-Language Models
(LVLMs) holds great potential, yet these models often produce clinically
critical hallucinations, posing serious risks. Existing hallucination detection
methods frequently lack the necessary sentence-level granularity or robust
generalization across different LVLM generators. We introduce a novel approach:
a sentence-level Process Reward Model (PRM) adapted for this vision-language
task. Our PRM predicts the factual correctness of each generated sentence,
conditioned on clinical context and preceding text. When fine-tuned on
MIMIC-CXR with weakly-supervised labels, a lightweight 0.5B-parameter PRM
outperforms existing verification techniques, demonstrating, for instance,
relative improvements of 7.5% in Matthews Correlation Coefficient and 1.8% in
AUROC over strong white-box baselines on outputs from one LVLM. Unlike methods
reliant on internal model states, our PRM demonstrates strong generalization to
an unseen LVLM. We further show its practical utility: PRM scores effectively
filter low-quality reports, improving F1-CheXbert scores by 4.5% (when
discarding the worst 10% of reports). Moreover, when guiding a novel weighted
best-of-N selection process on the MIMIC-CXR test set, our PRM show relative
improvements in clinical metrics of 7.4% for F1-CheXbert and 0.6% for
BERTScore. These results demonstrate that a lightweight, context-aware PRM
provides a model-agnostic safety layer for clinical LVLMs without access to
internal activations
[LINK]http://arxiv.org/abs/2510.23217v1
[DATE]2025-10-27 19:08:05+08:00
[CATEGORIES]cs.CL
Cohort Discovery: A Survey on LLM-Assisted Clinical Trial Recruitment
[AUTHORS]Shrestha Ghosh, Moritz Schneider, Carina Reinicke, Carsten Eickhoff
[ABSTRACT]Recent advances in LLMs have greatly improved general-domain NLP tasks. Yet,
their adoption in critical domains, such as clinical trial recruitment, remains
limited. As trials are designed in natural language and patient data is
represented as both structured and unstructured text, the task of matching
trials and patients benefits from knowledge aggregation and reasoning abilities
of LLMs. Classical approaches are trial-specific and LLMs with their ability to
consolidate distributed knowledge hold the potential to build a more general
solution. Yet recent applications of LLM-assisted methods rely on proprietary
models and weak evaluation benchmarks. In this survey, we are the first to
analyze the task of trial-patient matching and contextualize emerging LLM-based
approaches in clinical trial recruitment. We critically examine existing
benchmarks, approaches and evaluation frameworks, the challenges to adopting
LLM technologies in clinical research and exciting future directions.
[LINK]http://arxiv.org/abs/2506.15301v2
[DATE]2025-10-27 18:59:03+08:00
[CATEGORIES]cs.CL
PTPP-Aware Adaptation Scaling Laws: Predicting Domain-Adaptation Performance at Unseen Pre-Training Budgets
[AUTHORS]Etienne Goffinet, Shane Bergsma, Avraham Sheinin, Natalia Vassilieva, Shaheer Muhammad, Preslav Nakov, Gurpreet Gosal
[ABSTRACT]Continual pre-training (CPT) for domain adaptation must balance target-domain
gains with stability on the base domain. Existing CPT scaling laws typically
assume a fixed pre-training budget, which limits their ability to forecast
adaptation outcomes for models trained at different tokens-per-parameter
(PTPP). We present \emph{PTPP-aware} adaptation scaling laws that make the
pre-training budget an explicit variable, enabling accurate \emph{prediction}
of adaptation loss at unseen \ptpp. On a multilingual setup (English/Arabic
$\rightarrow$ French), PTPP-aware formulations trained on early stages
(\ptpp{}=\{15,31\}) predict target loss at \ptpp{}=279 and outperform a
PTPP-agnostic \dcpt{} transfer baseline on metrics (Huber-on-log,
MAE$_\mathrm{rel}$, calibration slope); full diagnostics (RMSE, MAPE) are in
the appendix. Beyond forecasting, we show a practical use case: planning replay
ratios and adaptation token budgets that satisfy target and forgetting
constraints under compute limits.
[LINK]http://arxiv.org/abs/2510.23198v1
[DATE]2025-10-27 18:36:15+08:00
[CATEGORIES]cs.LG cs.CL
Beyond Direct Generation: A Decomposed Approach to Well-Crafted Screenwriting with LLMs
[AUTHORS]Hang Lei, Shengyi Zong, Zhaoyan Li, Ziren Zhou, Hao Liu
[ABSTRACT]The screenplay serves as the foundation for television production, defining
narrative structure, character development, and dialogue. While Large Language
Models (LLMs) show great potential in creative writing, direct end-to-end
generation approaches often fail to produce well-crafted screenplays. We argue
this failure stems from forcing a single model to simultaneously master two
disparate capabilities: creative narrative construction and rigid format
adherence. The resulting outputs may mimic superficial style but lack the deep
structural integrity and storytelling substance required for professional use.
To enable LLMs to generate high-quality screenplays, we introduce Dual-Stage
Refinement (DSR), a decomposed framework that decouples creative narrative
generation from format conversion. The first stage transforms a brief outline
into rich, novel-style prose. The second stage refines this narrative into a
professionally formatted screenplay. This separation enables the model to
specialize in one distinct capability at each stage. A key challenge in
implementing DSR is the scarcity of paired outline-to-novel training data. We
address this through hybrid data synthesis: reverse synthesis deconstructs
existing screenplays into structured inputs, while forward synthesis leverages
these inputs to generate high-quality narrative texts as training targets.
Blind evaluations by professional screenwriters show that DSR achieves a 75%
win rate against strong baselines like Gemini-2.5-Pro and reaches 82.7% of
human-level performance. Our work demonstrates that decomposed generation
architecture with tailored data synthesis effectively specializes LLMs in
complex creative domains.
[LINK]http://arxiv.org/abs/2510.23163v1
[DATE]2025-10-27 17:41:29+08:00
[CATEGORIES]cs.CL
ENTP: Enhancing Low-Quality SFT Data via Neural-Symbolic Text Purge-Mix
[AUTHORS]Zile Yang, Ling Li, Na Di, Jinlong Pang, Yao Zhou, Hao Cheng, Bo Han, Jiaheng Wei
[ABSTRACT]Supervised Fine-Tuning (SFT) adapts pre-trained Large Language Models (LLMs)
to domain-specific instructions by training on a carefully curated subset of
high-quality instruction-response pairs, typically drawn from a larger dataset
that often contains many low-quality or noisy samples. However, existing
quality-first paradigms often overlook valuable signals in discarded
low-quality data and rely on imperfect quality filters. We introduce ENTP
(Enhancing low-quality SFT data via Neural-symbolic Text Purge-Mix), a
framework that revitalizes low-quality corpora through symbolic purification
and neural reconstruction. The symbolic module identifies and prunes noisy
samples based on statistical priors, while the neural component synthesizes
enriched instruction-response pairs by leveraging latent representations and
model knowledge. This neural-symbolic synergy enhances data informativeness and
diversity. Experiments show that ENTP-augmented datasets, constructed
exclusively from low-quality data, outperform 13 established data-selection
baselines across five instruction-following benchmarks, and even surpass
fine-tuning on the full original dataset (approximately 300K examples). Our
results highlight the untapped potential of low-quality data and underscore the
importance of intelligent purification and synthesis for efficient instruction
alignment.
[LINK]http://arxiv.org/abs/2510.23160v1
[DATE]2025-10-27 17:39:22+08:00
[CATEGORIES]cs.CL
Rethinking GSPO: The Perplexity-Entropy Equivalence
[AUTHORS]Chi Liu
[ABSTRACT]We provide a new perspective on GSPO’s length-normalized importance ratios by
establishing their connection to information-theoretic quantities. We show that
GSPO’s sequence-level weight $s(\theta) =
(\pi_\theta/\pi_{\theta_{\text{old}}})^{1/|y|}$ can be equivalently expressed
as the inverse perplexity ratio
$\text{PPL}{\theta{\text{old}}}/\text{PPL}_\theta$ and as the exponential
cross-entropy change $\exp(\Delta H)$. While the perplexity-entropy
relationship follows from standard definitions, this observation provides a
useful lens for understanding GSPO: the algorithm weights policy gradient
updates by perplexity ratios, offering an information-theoretic interpretation
of the importance weights. This perspective helps explain GSPO’s empirical
properties, including log-domain variance reduction through geometric averaging
and stability in training mixture-of-experts models. We validate the
mathematical equivalences and variance predictions through controlled
experiments on mathematical reasoning tasks.
[COMMENTS]10 pages, 2 figures
[LINK]http://arxiv.org/abs/2510.23142v1
[DATE]2025-10-27 17:19:10+08:00
[CATEGORIES]cs.LG cs.CL
Corpus Frequencies in Morphological Inflection: Do They Matter?
[AUTHORS]Tomáš Sourada, Jana Straková
[ABSTRACT]The traditional approach to morphological inflection (the task of modifying a
base word (lemma) to express grammatical categories) has been, for decades, to
consider lexical entries of lemma-tag-form triples uniformly, lacking any
information about their frequency distribution. However, in production
deployment, one might expect the user inputs to reflect a real-world
distribution of frequencies in natural texts. With future deployment in mind,
we explore the incorporation of corpus frequency information into the task of
morphological inflection along three key dimensions during system development:
(i) for train-dev-test split, we combine a lemma-disjoint approach, which
evaluates the model’s generalization capabilities, with a frequency-weighted
strategy to better reflect the realistic distribution of items across different
frequency bands in training and test sets; (ii) for evaluation, we complement
the standard type accuracy (often referred to simply as accuracy), which treats
all items equally regardless of frequency, with token accuracy, which assigns
greater weight to frequent words and better approximates performance on running
text; (iii) for training data sampling, we introduce a method novel in the
context of inflection, frequency-aware training, which explicitly incorporates
word frequency into the sampling process. We show that frequency-aware training
outperforms uniform sampling in 26 out of 43 languages.
[COMMENTS]Published in the proceedings of ITAT 2025.15 pages, 1 figure, 4
tables
[LINK]http://arxiv.org/abs/2510.23131v1
[DATE]2025-10-27 17:12:04+08:00
[CATEGORIES]cs.CL
First SFT, Second RL, Third UPT: Continual Improving Multi-Modal LLM Reasoning via Unsupervised Post-Training
[AUTHORS]Lai Wei, Yuting Li, Chen Wang, Yue Wang, Linghe Kong, Weiran Huang, Lichao Sun
[ABSTRACT]Improving Multi-modal Large Language Models (MLLMs) in the post-training
stage typically relies on supervised fine-tuning (SFT) or reinforcement
learning (RL), which require expensive and manually annotated multi-modal
data–an ultimately unsustainable resource. This limitation has motivated a
growing interest in unsupervised paradigms as a third stage of post-training
after SFT and RL. While recent efforts have explored this direction, their
methods are complex and difficult to iterate. To address this, we propose
MM-UPT, a simple yet effective framework for unsupervised post-training of
MLLMs, enabling continual self-improvement without any external supervision.
The training method of MM-UPT builds upon GRPO, replacing traditional reward
signals with a self-rewarding mechanism based on majority voting over multiple
sampled responses. Our experiments demonstrate that such training method
effectively improves the reasoning ability of Qwen2.5-VL-7B (e.g.,
66.3\%$\rightarrow$72.9\% on MathVista, 62.9\%$\rightarrow$68.7\% on We-Math),
using standard dataset without ground truth labels. To further explore
scalability, we extend our framework to a data self-generation setting,
designing two strategies that prompt the MLLM to synthesize new training
samples on its own. Additional experiments show that combining these synthetic
data with the unsupervised training method can also boost performance,
highlighting a promising approach for scalable self-improvement. Overall,
MM-UPT offers a new paradigm for autonomous enhancement of MLLMs, serving as a
critical third step after initial SFT and RL in the absence of external
supervision. Our code is available at https://github.com/waltonfuture/MM-UPT.
[COMMENTS]Accepted by NeurIPS 2025
[LINK]http://arxiv.org/abs/2505.22453v2
[DATE]2025-10-27 17:06:32+08:00
[CATEGORIES]cs.CL cs.LG
Beyond Higher Rank: Token-wise Input-Output Projections for Efficient Low-Rank Adaptation
[AUTHORS]Shiwei Li, Xiandi Luo, Haozhao Wang, Xing Tang, Ziqiang Cui, Dugang Liu, Yuhua Li, Xiuqiang He, Ruixuan Li
[ABSTRACT]Low-rank adaptation (LoRA) is a parameter-efficient fine-tuning (PEFT) method
widely used in large language models (LLMs). LoRA essentially describes the
projection of an input space into a low-dimensional output space, with the
dimensionality determined by the LoRA rank. In standard LoRA, all input tokens
share the same weights and undergo an identical input-output projection. This
limits LoRA’s ability to capture token-specific information due to the inherent
semantic differences among tokens. To address this limitation, we propose
Token-wise Projected Low-Rank Adaptation (TopLoRA), which dynamically adjusts
LoRA weights according to the input token, thereby learning token-wise
input-output projections in an end-to-end manner. Formally, the weights of
TopLoRA can be expressed as $B\Sigma_X A$, where $A$ and $B$ are low-rank
matrices (as in standard LoRA), and $\Sigma_X$ is a diagonal matrix generated
from each input token $X$. Notably, TopLoRA does not increase the rank of LoRA
weights but achieves more granular adaptation by learning token-wise LoRA
weights (i.e., token-wise input-output projections). Extensive experiments
across multiple models and datasets demonstrate that TopLoRA consistently
outperforms LoRA and its variants. The code is available at
https://github.com/Leopold1423/toplora-neurips25.
[COMMENTS]Accepted by NeurIPS 2025
[LINK]http://arxiv.org/abs/2510.23123v1
[DATE]2025-10-27 16:57:24+08:00
[CATEGORIES]cs.CL cs.LG
DeepOmni: Towards Seamless and Smart Speech Interaction with Adaptive Modality-Specific MoE
[AUTHORS]Hang Shao, Heting Gao, Yunhang Shen, Jiawei Chen, Zuwei Long, Dong Yang, Ke Li, Xing Sun
[ABSTRACT]Native multimodal large language models (MLLMs) restructure a single large
language model (LLM) into a spoken language model (SLM) capable of both speech
and text generation. Compared to modular and aligned MLLMs, native MLLMs
preserve richer paralinguistic features such as emotion and prosody, and
generate speech responses directly within the backbone LLM rather than using a
separate speech decoder. This integration also results in lower response
latency and smoother interaction. However, native MLLMs suffer from
catastrophic forgetting and performance degradation because the available
paired speech-text data is insufficient to support the pretraining of MLLMs
compared to the vast amount of text data required to pretrain text LLMs. To
address this issue, we propose DeepTalk, a framework for adaptive modality
expert learning based on a Mixture of Experts (MoE) architecture. DeepTalk
first adaptively distinguishes modality experts according to their modality
load within the LLM. Each modality expert then undergoes specialized
single-modality training, followed by joint multimodal collaborative training.
As a result, DeepTalk incurs only a 5.5% performance drop compared to the
original LLM, which is significantly lower than the average performance drop of
over 20% typically seen in native MLLMs (such as GLM-4-Voice), and is on par
with modular MLLMs. Meanwhile, the end-to-end dialogue latency remains within
0.5 seconds, ensuring a seamless and intelligent speech interaction experience.
Code and models are released at https://github.com/talkking/DeepTalk.
[COMMENTS]Under Review
[LINK]http://arxiv.org/abs/2506.21864v3
[DATE]2025-10-27 16:52:21+08:00
[CATEGORIES]cs.CL
COUNTDOWN: Contextually Sparse Activation Filtering Out Unnecessary Weights in Down Projection
[AUTHORS]Jaewon Cheon, Pilsung Kang
[ABSTRACT]The growing size of large language models has created significant
computational inefficiencies. To address this challenge, sparse activation
methods selectively deactivates non-essential parameters during inference,
reducing computational costs in FFNN layers. While existing methods focus on
non-linear gating mechanisms, we hypothesize that the sparsity of the FFNN
layer lies globally in the form of a linear combination over its internal down
projection matrix. Based on this insight, we propose two methods: M-COUNTDOWN,
leveraging indirect coefficients, and D-COUNTDOWN, utilizing direct
coefficients of the linear combination. Experimental results demonstrate that
D-COUNTDOWN can omit 90% of computations with performance loss as low as 5.5%
ideally, while M-COUNTDOWN provides a predictor-free solution with up to 29.4%
better performance preservation compared to existing methods. Our specialized
kernel implementations effectively realize these theoretical gains into
substantial real-world acceleration.
[COMMENTS]EMNLP 2025 (Main Track)
[LINK]http://arxiv.org/abs/2505.17701v3
[DATE]2025-10-27 16:19:35+08:00
[CATEGORIES]cs.LG cs.CL
GraphInstruct: Empowering Large Language Models with Graph Understanding and Reasoning Capability
[AUTHORS]Zihan Luo, Xiran Song, Hong Huang, Jianxun Lian, Chenhao Zhang, Jinqi Jiang, Xing Xie, Hai Jin
[ABSTRACT]Improving the general capabilities of large language models (LLMs) is an
active research topic. As a common data structure in many real-world domains,
understanding graph data is a crucial part of advancing general intelligence.
To this end, we propose a dynamic benchmark named GraphInstruct in this paper,
which comprehensively includes 21 classical graph reasoning tasks, providing
diverse graph generation pipelines and detailed intermediate reasoning steps
for each sample. Based on GraphInstruct, we develop GraphSolver via efficient
instruction-tuning, which demonstrates prominent graph understanding capability
compared to other open-sourced LLMs. To further endow LLMs with multi-step
graph reasoning capability, we propose a label-mask training strategy and build
GraphSolver+, which leverages masked supervision on intermediate reasoning
tokens to emphasize crucial node-identification signals. As one of the
pioneering efforts to enhance the graph understanding and reasoning abilities
of LLMs, extensive experiments have demonstrated the superiority of GraphSolver
and GraphSolver+ over other LLMs. We sincerely hope GraphInstruct will
facilitate further research on applying LLMs to graph-structured data. Our code
and data are released publicly at: https://github.com/CGCL-codes/GraphInstruct.
[COMMENTS]Accepted by Frontiers of Computer Science
[LINK]http://arxiv.org/abs/2403.04483v3
[DATE]2025-10-27 16:07:17+08:00
[CATEGORIES]cs.CL
MAP4TS: A Multi-Aspect Prompting Framework for Time-Series Forecasting with Large Language Models
[AUTHORS]Suchan Lee, Jihoon Choi, Sohyeon Lee, Minseok Song, Bong-Gyu Jang, Hwanjo Yu, Soyeon Caren Han
[ABSTRACT]Recent advances have investigated the use of pretrained large language models
(LLMs) for time-series forecasting by aligning numerical inputs with LLM
embedding spaces. However, existing multimodal approaches often overlook the
distinct statistical properties and temporal dependencies that are fundamental
to time-series data. To bridge this gap, we propose MAP4TS, a novel
Multi-Aspect Prompting Framework that explicitly incorporates classical
time-series analysis into the prompt design. Our framework introduces four
specialized prompt components: a Global Domain Prompt that conveys
dataset-level context, a Local Domain Prompt that encodes recent trends and
series-specific behaviors, and a pair of Statistical and Temporal Prompts that
embed handcrafted insights derived from autocorrelation (ACF), partial
autocorrelation (PACF), and Fourier analysis. Multi-Aspect Prompts are combined
with raw time-series embeddings and passed through a cross-modality alignment
module to produce unified representations, which are then processed by an LLM
and projected for final forecasting. Extensive experiments across eight diverse
datasets show that MAP4TS consistently outperforms state-of-the-art LLM-based
methods. Our ablation studies further reveal that prompt-aware designs
significantly enhance performance stability and that GPT-2 backbones, when
paired with structured prompts, outperform larger models like LLaMA in
long-term forecasting tasks.
[LINK]http://arxiv.org/abs/2510.23090v1
[DATE]2025-10-27 15:51:54+08:00
[CATEGORIES]cs.CL
The Cross-Lingual Cost: Retrieval Biases in RAG over Arabic-English Corpora
[AUTHORS]Chen Amiraz, Yaroslav Fyodorov, Elad Haramaty, Zohar Karnin, Liane Lewin-Eytan
[ABSTRACT]Cross-lingual retrieval-augmented generation (RAG) is a critical capability
for retrieving and generating answers across languages. Prior work in this
context has mostly focused on generation and relied on benchmarks derived from
open-domain sources, most notably Wikipedia. In such settings, retrieval
challenges often remain hidden due to language imbalances, overlap with
pretraining data, and memorized content. To address this gap, we study
Arabic-English RAG in a domain-specific setting using benchmarks derived from
real-world corporate datasets. Our benchmarks include all combinations of
languages for the user query and the supporting document, drawn independently
and uniformly at random. This enables a systematic study of multilingual
retrieval behavior.
Our findings reveal that retrieval is a critical bottleneck in cross-lingual
domain-specific scenarios, with substantial performance drops occurring when
the user query and supporting document languages differ. A key insight is that
these failures stem primarily from the retriever’s difficulty in ranking
documents across languages. Finally, we propose two simple retrieval strategies
that address this source of failure by enforcing equal retrieval from both
languages or by translating the query, resulting in substantial improvements in
cross-lingual and overall performance. These results highlight meaningful
opportunities for improving multilingual retrieval, particularly in practical,
real-world RAG applications.
[COMMENTS]Accepted to ArabicNLP 2025
[LINK]http://arxiv.org/abs/2507.07543v2
[DATE]2025-10-27 15:40:12+08:00
[CATEGORIES]cs.CL
ContextAgent: Context-Aware Proactive LLM Agents with Open-World Sensory Perceptions
[AUTHORS]Bufang Yang, Lilin Xu, Liekang Zeng, Kaiwei Liu, Siyang Jiang, Wenrui Lu, Hongkai Chen, Xiaofan Jiang, Guoliang Xing, Zhenyu Yan
[COMMENTS]Accepted by NeurIPS 2025
[LINK]http://arxiv.org/abs/2505.14668v2
[DATE]2025-10-27 15:17:51+08:00
[CATEGORIES]cs.CL
Quality-Aware Translation Tagging in Multilingual RAG system
[AUTHORS]Hoyeon Moon, Byeolhee Kim, Nikhil Verma
[ABSTRACT]Multilingual Retrieval-Augmented Generation (mRAG) often retrieves English
documents and translates them into the query language for low-resource
settings. However, poor translation quality degrades response generation
performance. Existing approaches either assume sufficient translation quality
or utilize the rewriting method, which introduces factual distortion and
hallucinations. To mitigate these problems, we propose Quality-Aware
Translation Tagging in mRAG (QTT-RAG), which explicitly evaluates translation
quality along three dimensions-semantic equivalence, grammatical accuracy, and
naturalness&fluency-and attach these scores as metadata without altering the
original content. We evaluate QTT-RAG against CrossRAG and DKM-RAG as baselines
in two open-domain QA benchmarks (XORQA, MKQA) using six instruction-tuned LLMs
ranging from 2.4B to 14B parameters, covering two low-resource languages
(Korean and Finnish) and one high-resource language (Chinese). QTT-RAG
outperforms the baselines by preserving factual integrity while enabling
generator models to make informed decisions based on translation reliability.
This approach allows for effective usage of cross-lingual documents in
low-resource settings with limited native language documents, offering a
practical and robust solution across multilingual domains.
[COMMENTS]EMNLP 2025 MRL Workshop
[LINK]http://arxiv.org/abs/2510.23070v1
[DATE]2025-10-27 15:11:01+08:00
[CATEGORIES]cs.CL
Computational Analysis of Character Development in Holocaust Testimonies
[AUTHORS]Esther Shizgal, Eitan Wagner, Renana Keydar, Omri Abend
[ABSTRACT]This work presents a computational approach to analyze character development
along the narrative timeline. The analysis characterizes the inner and outer
changes the protagonist undergoes within a narrative, and the interplay between
them. We consider transcripts of Holocaust survivor testimonies as a test case,
each telling the story of an individual in first-person terms. We focus on the
survivor’s religious trajectory, examining the evolution of their disposition
toward religious belief and practice along the testimony. Clustering the
resulting trajectories in the dataset, we identify common sequences in the
data. Our findings highlight multiple common structures of religiosity across
the narratives: in terms of belief, most present a constant disposition, while
for practice, most present an oscillating structure, serving as valuable
material for historical and sociological research. This work demonstrates the
potential of natural language processing techniques for analyzing character
evolution through thematic trajectories in narratives.
[LINK]http://arxiv.org/abs/2412.17063v5
[DATE]2025-10-27 14:49:01+08:00
[CATEGORIES]cs.CL
Knocking-Heads Attention
[AUTHORS]Zhanchao Zhou, Xiaodong Chen, Haoxing Chen, Zhenzhong Lan, Jianguo Li
[ABSTRACT]Multi-head attention (MHA) has become the cornerstone of modern large
language models, enhancing representational capacity through parallel attention
heads. However, increasing the number of heads inherently weakens individual
head capacity, and existing attention mechanisms - whether standard MHA or its
variants like grouped-query attention (GQA) and grouped-tied attention (GTA) -
simply concatenate outputs from isolated heads without strong interaction. To
address this limitation, we propose knocking-heads attention (KHA), which
enables attention heads to “knock” on each other - facilitating cross-head
feature-level interactions before the scaled dot-product attention. This is
achieved by applying a shared, diagonally-initialized projection matrix across
all heads. The diagonal initialization preserves head-specific specialization
at the start of training while allowing the model to progressively learn
integrated cross-head representations. KHA adds only minimal parameters and
FLOPs and can be seamlessly integrated into MHA, GQA, GTA, and other attention
variants. We validate KHA by training a 6.1B parameter MoE model (1.01B
activated) on 1T high-quality tokens. Compared to baseline attention
mechanisms, KHA brings superior and more stable training dynamics, achieving
better performance across downstream tasks.
[LINK]http://arxiv.org/abs/2510.23052v1
[DATE]2025-10-27 14:28:58+08:00
[CATEGORIES]cs.CL
FaithLM: Towards Faithful Explanations for Large Language Models
[AUTHORS]Yu-Neng Chuang, Guanchu Wang, Chia-Yuan Chang, Ruixiang Tang, Shaochen Zhong, Fan Yang, Mengnan Du, Xuanting Cai, Vladimir Braverman, Xia Hu
[ABSTRACT]Large language models (LLMs) increasingly produce natural language
explanations, yet these explanations often lack faithfulness, and they do not
reliably reflect the evidence the model uses to decide. We introduce FaithLM, a
model-agnostic framework that evaluates and improves the faithfulness of LLM
explanations without token masking or task-specific heuristics. FaithLM
formalizes explanation faithfulness as an intervention property: a faithful
explanation should yield a prediction shift when its content is contradicted.
Theoretical analysis shows that the resulting contrary-hint score is a sound
and discriminative estimator of faithfulness. Building on this principle,
FaithLM iteratively refines both the elicitation prompt and the explanation to
maximize the measured score. Experiments on three multi-domain datasets and
multiple LLM backbones demonstrate that FaithLM consistently increases
faithfulness and produces explanations more aligned with human rationales than
strong self-explanation baselines. These findings highlight that
intervention-based evaluation, coupled with iterative optimization, provides a
principled route toward faithful and reliable LLM explanations.
[LINK]http://arxiv.org/abs/2402.04678v4
[DATE]2025-10-27 14:19:56+08:00
[CATEGORIES]cs.CL cs.LG
Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving
[AUTHORS]Xiangru Tang, Tianrui Qin, Tianhao Peng, Ziyang Zhou, Daniel Shao, Tingting Du, Xinming Wei, Peng Xia, Fang Wu, He Zhu, Ge Zhang, Jiaheng Liu, Xingyao Wang, Sirui Hong, Chenglin Wu, Hao Cheng, Chi Wang, Wangchunshu Zhou
[ABSTRACT]AI agent frameworks operate in isolation, forcing agents to rediscover
solutions and repeat mistakes across different systems. Despite valuable
problem-solving experiences accumulated by frameworks like smolagents,
OpenHands, and OWL, this knowledge remains trapped within individual systems,
preventing the emergence of collective intelligence. Current memory systems
focus on individual agents or framework-specific demonstrations, failing to
enable cross-architecture knowledge transfer. We introduce AGENT KB, a
universal memory infrastructure enabling seamless experience sharing across
heterogeneous agent frameworks without retraining. AGENT KB aggregates
trajectories into a structured knowledge base and serves lightweight APIs. At
inference time, hybrid retrieval operates through two stages: planning seeds
agents with cross-domain workflows, while feedback applies targeted diagnostic
fixes. A disagreement gate ensures retrieved knowledge enhances rather than
disrupts reasoning, addressing knowledge interference in cross-framework
transfer. We validate AGENT KB across major frameworks on GAIA, Humanity’s Last
Exam, GPQA, and SWE-bench. Results show substantial improvements across diverse
model families: compared to baseline pass@1, smolagents with AGENT KB achieve
up to 18.7pp gains at pass@3 (55.2% -> 73.9%), while OpenHands improves 4.0pp
on SWE-bench pass@1 (24.3% -> 28.3%). Similar improvements are observed across
all base model families. Ablations confirm that hybrid retrieval and feedback
stages are essential, with automatically generated experiences matching manual
curation. This establishes the foundation for collective agent intelligence
through shared memory infrastructures.
[LINK]http://arxiv.org/abs/2507.06229v5
[DATE]2025-10-27 14:16:14+08:00
[CATEGORIES]cs.CL
Towards Stable and Effective Reinforcement Learning for Mixture-of-Experts
[AUTHORS]Di Zhang, Xun Wu, Shaohan Huang, Yaru Hao, Li Dong, Zewen Chi, Zhifang Sui, Furu Wei
[ABSTRACT]Recent advances in reinforcement learning (RL) have substantially improved
the training of large-scale language models, leading to significant gains in
generation quality and reasoning ability. However, most existing research
focuses on dense models, while RL training for Mixture-of-Experts (MoE)
architectures remains underexplored. To address the instability commonly
observed in MoE training, we propose a novel router-aware approach to optimize
importance sampling (IS) weights in off-policy RL. Specifically, we design a
rescaling strategy guided by router logits, which effectively reduces gradient
variance and mitigates training divergence. Experimental results demonstrate
that our method significantly improves both the convergence stability and the
final performance of MoE models, highlighting the potential of RL algorithmic
innovations tailored to MoE architectures and providing a promising direction
for efficient training of large-scale expert models.
[LINK]http://arxiv.org/abs/2510.23027v1
[DATE]2025-10-27 13:47:48+08:00
[CATEGORIES]cs.LG cs.CL
M$^{3}$T2IBench: A Large-Scale Multi-Category, Multi-Instance, Multi-Relation Text-to-Image Benchmark
[AUTHORS]Huixuan Zhang, Xiaojun Wan
[ABSTRACT]Text-to-image models are known to struggle with generating images that
perfectly align with textual prompts. Several previous studies have focused on
evaluating image-text alignment in text-to-image generation. However, these
evaluations either address overly simple scenarios, especially overlooking the
difficulty of prompts with multiple different instances belonging to the same
category, or they introduce metrics that do not correlate well with human
evaluation. In this study, we introduce M$^3$T2IBench, a large-scale,
multi-category, multi-instance, multi-relation along with an
object-detection-based evaluation metric, $AlignScore$, which aligns closely
with human evaluation. Our findings reveal that current open-source
text-to-image models perform poorly on this challenging benchmark.
Additionally, we propose the Revise-Then-Enforce approach to enhance image-text
alignment. This training-free post-editing method demonstrates improvements in
image-text alignment across a broad range of diffusion models. \footnote{Our
code and data has been released in supplementary material and will be made
publicly available after the paper is accepted.}
[LINK]http://arxiv.org/abs/2510.23020v1
[DATE]2025-10-27 13:32:50+08:00
[CATEGORIES]cs.CL
LangLingual: A Personalised, Exercise-oriented English Language Learning Tool Leveraging Large Language Models
[AUTHORS]Sammriddh Gupta, Sonit Singh, Aditya Joshi, Mira Kim
[ABSTRACT]Language educators strive to create a rich experience for learners, while
they may be restricted in the extend of feedback and practice they can provide.
We present the design and development of LangLingual, a conversational agent
built using the LangChain framework and powered by Large Language Models. The
system is specifically designed to provide real-time, grammar-focused feedback,
generate context-aware language exercises and track learner proficiency over
time. The paper discusses the architecture, implementation and evaluation of
LangLingual in detail. The results indicate strong usability, positive learning
outcomes and encouraging learner engagement.
[COMMENTS]14 pages
[LINK]http://arxiv.org/abs/2510.23011v1
[DATE]2025-10-27 13:11:07+08:00
[CATEGORIES]cs.CL
Unified Sparse Mixture of Experts
[AUTHORS]Giang Do, Hung Le, Truyen Tran
[ABSTRACT]Sparse Mixture of Experts (SMoEs) models scale the capacity of models while
maintaining constant computational overhead. Early designs typically relied on
a fixed value of $k$, where $k$ represents either the number of experts
selected per token or the number of tokens assigned per expert. However, these
approaches encounter three key limitations: they may fail to route to important
experts or tokens, may assign irrelevant ones, and often suffer from
representation collapse among experts. This paper reexamines SMoEs through the
lens of \textit{Linear Programming}, and proposes a Unified Sparse Mixture of
Experts (USMoE) framework that addresses these limitations. Specifically, our
approach introduces a unified mechanism that integrates information from both
the expert and token dimensions, and a unified scoring function that linearly
combines similarity scores between experts and tokens. We provide both
theoretical justification and empirical evidence demonstrating USMoE’s
effectiveness in overcoming the limitations of traditional routing methods.
Through comprehensive evaluations on both clean and corrupted settings for
large language models and vision tasks, under both training-free and training
scenarios, USMoE achieves up to a 10\% performance improvement over standard
approaches or reduces inference costs by up to 14\%, while maintaining
competitive accuracy.
[COMMENTS]26 pages
[LINK]http://arxiv.org/abs/2503.22996v2
[DATE]2025-10-27 12:51:10+08:00
[CATEGORIES]cs.CL
Understanding In-Context Learning Beyond Transformers: An Investigation of State Space and Hybrid Architectures
[AUTHORS]Shenran Wang, Timothy Tin-Long Tse, Jian Zhu
[ABSTRACT]We perform in-depth evaluations of in-context learning (ICL) on
state-of-the-art transformer, state-space, and hybrid large language models
over two categories of knowledge-based ICL tasks. Using a combination of
behavioral probing and intervention-based methods, we have discovered that,
while LLMs of different architectures can behave similarly in task performance,
their internals could remain different. We discover that function vectors (FVs)
responsible for ICL are primarily located in the self-attention and Mamba
layers, and speculate that Mamba2 uses a different mechanism from FVs to
perform ICL. FVs are more important for ICL involving parametric knowledge
retrieval, but not for contextual knowledge understanding. Our work contributes
to a more nuanced understanding across architectures and task types.
Methodologically, our approach also highlights the importance of combining both
behavioural and mechanistic analyses to investigate LLM capabilities.
[LINK]http://arxiv.org/abs/2510.23006v1
[DATE]2025-10-27 12:49:01+08:00
[CATEGORIES]cs.CL
Learning to Better Search with Language Models via Guided Reinforced Self-Training
[AUTHORS]Seungyong Moon, Bumsoo Park, Hyun Oh Song
[ABSTRACT]While language models have shown remarkable performance across diverse tasks,
they still encounter challenges in complex reasoning scenarios. Recent research
suggests that language models trained on linearized search traces toward
solutions, rather than solely on the final solutions, exhibit improved
generalization, despite the search traces being potentially noisy or
suboptimal. However, relying on such imperfect traces can result in inefficient
use of test-time compute. To address this, we propose guided reinforced
self-training (Guided-ReST), a fine-tuning algorithm designed to improve the
model’s capability for effective search during inference. The key insight
behind Guided-ReST is that optimal solutions can serve as valuable step-by-step
landmarks to guide the model’s search process. Based on this insight, we
introduce a novel data generation method that seamlessly incorporates optimal
solutions into the model’s search procedure, enabling the generation of
high-quality search traces. By fine-tuning the model on these search traces, we
effectively distill improved search strategies into the model. Our method
significantly enhances the search capabilities of language models on arithmetic
reasoning and code self-repair tasks, including Countdown, CodeContests, and
CodeForces. We release the source code at
https://github.com/snu-mllab/guided-rest.
[COMMENTS]Accepted at NeurIPS 2025
[LINK]http://arxiv.org/abs/2410.02992v2
[DATE]2025-10-27 12:46:45+08:00
[CATEGORIES]cs.CL
MAD-Fact: A Multi-Agent Debate Framework for Long-Form Factuality Evaluation in LLMs
[AUTHORS]Yucheng Ning, Xixun Lin, Fang Fang, Yanan Cao
[ABSTRACT]The widespread adoption of Large Language Models (LLMs) raises critical
concerns about the factual accuracy of their outputs, especially in high-risk
domains such as biomedicine, law, and education. Existing evaluation methods
for short texts often fail on long-form content due to complex reasoning
chains, intertwined perspectives, and cumulative information. To address this,
we propose a systematic approach integrating large-scale long-form datasets,
multi-agent verification mechanisms, and weighted evaluation metrics. We
construct LongHalluQA, a Chinese long-form factuality dataset; and develop
MAD-Fact, a debate-based multi-agent verification system. We introduce a fact
importance hierarchy to capture the varying significance of claims in long-form
texts. Experiments on two benchmarks show that larger LLMs generally maintain
higher factual consistency, while domestic models excel on Chinese content. Our
work provides a structured framework for evaluating and enhancing factual
reliability in long-form LLM outputs, guiding their safe deployment in
sensitive domains.
[COMMENTS]This article has been accepted by Frontiers of Computer Science (FCS)
[LINK]http://arxiv.org/abs/2510.22967v1
[DATE]2025-10-27 11:41:32+08:00
[CATEGORIES]cs.CL
Tagging-Augmented Generation: Assisting Language Models in Finding Intricate Knowledge In Long Contexts
[AUTHORS]Anwesan Pal, Karen Hovsepian, Tinghao Guo, Mengnan Zhao, Somendra Tripathi, Nikos Kanakaris, George Mihaila, Sumit Nigam
[ABSTRACT]Recent investigations into effective context lengths of modern flagship large
language models (LLMs) have revealed major limitations in effective question
answering (QA) and reasoning over long and complex contexts for even the
largest and most impressive cadre of models. While approaches like
retrieval-augmented generation (RAG) and chunk-based re-ranking attempt to
mitigate this issue, they are sensitive to chunking, embedding and retrieval
strategies and models, and furthermore, rely on extensive pre-processing,
knowledge acquisition and indexing steps. In this paper, we propose
Tagging-Augmented Generation (TAG), a lightweight data augmentation strategy
that boosts LLM performance in long-context scenarios, without degrading and
altering the integrity and composition of retrieved documents. We validate our
hypothesis by augmenting two challenging and directly relevant
question-answering benchmarks – NoLima and NovelQA – and show that tagging
the context or even just adding tag definitions into QA prompts leads to
consistent performance gains over the baseline – up to 17% for 32K token
contexts, and 2.9% in complex reasoning question-answering for multi-hop
queries requiring knowledge across a wide span of text. Additional details are
available at https://sites.google.com/view/tag-emnlp.
[COMMENTS]Paper accepted at EMNLP 2025
[LINK]http://arxiv.org/abs/2510.22956v1
[DATE]2025-10-27 11:23:25+08:00
[CATEGORIES]cs.CL
Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)
[AUTHORS]Liwei Jiang, Yuanjun Chai, Margaret Li, Mickel Liu, Raymond Fok, Nouha Dziri, Yulia Tsvetkov, Maarten Sap, Alon Albalak, Yejin Choi
[ABSTRACT]Language models (LMs) often struggle to generate diverse, human-like creative
content, raising concerns about the long-term homogenization of human thought
through repeated exposure to similar outputs. Yet scalable methods for
evaluating LM output diversity remain limited, especially beyond narrow tasks
such as random number or name generation, or beyond repeated sampling from a
single model. We introduce Infinity-Chat, a large-scale dataset of 26K diverse,
real-world, open-ended user queries that admit a wide range of plausible
answers with no single ground truth. We introduce the first comprehensive
taxonomy for characterizing the full spectrum of open-ended prompts posed to
LMs, comprising 6 top-level categories (e.g., brainstorm & ideation) that
further breaks down to 17 subcategories. Using Infinity-Chat, we present a
large-scale study of mode collapse in LMs, revealing a pronounced Artificial
Hivemind effect in open-ended generation of LMs, characterized by (1)
intra-model repetition, where a single model consistently generates similar
responses, and more so (2) inter-model homogeneity, where different models
produce strikingly similar outputs. Infinity-Chat also includes 31,250 human
annotations, across absolute ratings and pairwise preferences, with 25
independent human annotations per example. This enables studying collective and
individual-specific human preferences in response to open-ended queries. Our
findings show that LMs, reward models, and LM judges are less well calibrated
to human ratings on model generations that elicit differing idiosyncratic
annotator preferences, despite maintaining comparable overall quality. Overall,
INFINITY-CHAT presents the first large-scale resource for systematically
studying real-world open-ended queries to LMs, revealing critical insights to
guide future research for mitigating long-term AI safety risks posed by the
Artificial Hivemind.
[COMMENTS]NeurIPS 2025 D&B Paper (Oral); Camera-Ready Version
[LINK]http://arxiv.org/abs/2510.22954v1
[DATE]2025-10-27 11:16:21+08:00
[CATEGORIES]cs.CL
Unsupervised Classification of English Words Based on Phonological Information: Discovery of Germanic and Latinate Clusters
[AUTHORS]Takashi Morita, Timothy J. O’Donnell
[ABSTRACT]Cross-linguistically, native words and loanwords follow different
phonological rules. In English, for example, words of Germanic and Latinate
origin exhibit different stress patterns, and a certain syntactic structure,
double-object datives, is predominantly associated with Germanic verbs rather
than Latinate verbs. As a cognitive model, however, such etymology-based
generalizations face challenges in terms of learnability, since the historical
origins of words are presumably inaccessible information for general language
learners. In this study, we present computational evidence indicating that the
Germanic-Latinate distinction in the English lexicon is learnable from the
phonotactic information of individual words. Specifically, we performed an
unsupervised clustering on corpus-extracted words, and the resulting word
clusters largely aligned with the etymological distinction. The
model-discovered clusters also recovered various linguistic generalizations
documented in the previous literature regarding the corresponding etymological
classes. Moreover, our findings also uncovered previously unrecognized features
of the quasi-etymological clusters.
[LINK]http://arxiv.org/abs/2504.11770v3
[DATE]2025-10-27 11:10:43+08:00
[CATEGORIES]cs.CL
Exploiting Vocabulary Frequency Imbalance in Language Model Pre-training
[AUTHORS]Woojin Chung, Jeonghoon Kim
[ABSTRACT]Large language models are trained with tokenizers, and the resulting token
distribution is highly imbalanced: a few words dominate the stream while most
occur rarely. Recent practice favors ever-larger vocabularies, but it is
unclear where the benefit comes from. To this end, we perform a controlled
study that scales the vocabulary of the language model from 24K to 196K while
holding data, computation, and optimization unchanged. We begin by quantifying
the complexity of tokenized text – formalized via Kolmogorov complexity – and
show that larger vocabularies reduce this complexity. Above 24K, every common
word is already tokenized as a single token, so enlarging vocabulary only
deepens the relative token-frequency imbalance. Word-level loss decomposition
shows that larger vocabularies reduce cross-entropy loss almost exclusively by
lowering uncertainty on the 2,500 most frequent words, even though loss on the
rare tail rises. The same frequent words cover roughly 75% of tokens in
downstream benchmarks, so this training advantage transfers intact. We further
show that enlarging model parameters with a fixed vocabulary yields the same
frequent-word benefit. Our results recast “bigger vocabularies help” as
“lowering complexity of tokenized text helps,” offering a simple, principled
knob for tokenizer-model co-design and clarifying the loss dynamics that govern
language model scaling in pre-training.
[COMMENTS]NeurIPS 2025
[LINK]http://arxiv.org/abs/2508.15390v2
[DATE]2025-10-27 10:39:13+08:00
[CATEGORIES]cs.CL cs.LG
Know Me, Respond to Me: Benchmarking LLMs for Dynamic User Profiling and Personalized Responses at Scale
[AUTHORS]Bowen Jiang, Zhuoqun Hao, Young-Min Cho, Bryan Li, Yuan Yuan, Sihao Chen, Lyle Ungar, Camillo J. Taylor, Dan Roth
[ABSTRACT]Large Language Models (LLMs) have emerged as personalized assistants for
users across a wide range of tasks – from offering writing support to
delivering tailored recommendations or consultations. Over time, the
interaction history between a user and an LLM can provide extensive information
about an individual’s traits and preferences. However, open questions remain on
how well LLMs today can effectively leverage such history to (1) internalize
the user’s inherent traits and preferences, (2) track how the user profiling
and preferences evolve over time, and (3) generate personalized responses
accordingly in new scenarios.
In this work, we introduce the PERSONAMEM benchmark. PERSONAMEM features
curated user profiles with over 180 simulated user-LLM interaction histories,
each containing up to 60 sessions of multi-turn conversations across 15
real-world tasks that require personalization. Given an in-situ user query,
i.e. query issued by the user from the first-person perspective, we evaluate
LLM chatbots’ ability to identify the most suitable response according to the
current state of the user’s profile. We observe that current LLMs still
struggle to recognize the dynamic evolution in users’ profiles over time
through direct prompting approaches. As a consequence, LLMs often fail to
deliver responses that align with users’ current situations and preferences,
with frontier models such as GPT-4.1, o4-mini, GPT-4.5, o1, or Gemini-2.0
achieving only around 50% overall accuracy, suggesting room for improvement. We
hope that PERSONAMEM, along with the user profile and conversation simulation
pipeline, can facilitate future research in the development of truly user-aware
chatbots. Code and data are available at github.com/bowen-upenn/PersonaMem.
[COMMENTS]The 2025 Conference on Language Modeling (COLM)
[LINK]http://arxiv.org/abs/2504.14225v2
[DATE]2025-10-27 10:22:53+08:00
[CATEGORIES]cs.CL
Integrated Design and Governance of Agentic AI Systems through Adaptive Information Modulation
[AUTHORS]Qiliang Chen, Sepehr Ilami, Nunzio Lore, Babak Heydari
[ABSTRACT]Modern engineered systems increasingly involve complex sociotechnical
environments where multiple agents, including humans and the emerging paradigm
of agentic AI powered by large language models, must navigate social dilemmas
that pit individual interests against collective welfare. As engineered systems
evolve toward multi-agent architectures with autonomous LLM-based agents,
traditional governance approaches using static rules or fixed network
structures fail to address the dynamic uncertainties inherent in real-world
operations. This paper presents a novel framework that integrates adaptive
governance mechanisms directly into the design of sociotechnical systems
through a unique separation of agent interaction networks from information flow
networks. We introduce a system comprising strategic LLM-based system agents
that engage in repeated interactions and a reinforcement learning-based
governing agent that dynamically modulates information transparency. Unlike
conventional approaches that require direct structural interventions or payoff
modifications, our framework preserves agent autonomy while promoting
cooperation through adaptive information governance. The governing agent learns
to strategically adjust information disclosure at each timestep, determining
what contextual or historical information each system agent can access.
Experimental results demonstrate that this RL-based governance significantly
enhances cooperation compared to static information-sharing baselines.
[LINK]http://arxiv.org/abs/2409.10372v4
[DATE]2025-10-27 09:25:31+08:00
[CATEGORIES]cs.CL
Language Server CLI Empowers Language Agents with Process Rewards
[AUTHORS]Yifan Zhang, Lanser Contributors
[ABSTRACT]Large language models routinely hallucinate APIs and mislocalize edits, while
language servers compute verified, IDE-grade facts about real code. We present
Lanser-CLI, a CLI-first orchestration layer that pins and mediates a Language
Server Protocol (LSP) server for coding agents and CI, exposing deterministic,
replayable workflows. Our position is that language servers provide not only
structural information (definitions, references, types, diagnostics) but also
an actionable process reward: machine-checked, step-wise signals that align an
agent’s planning loop with program reality. In this work, Lanser-CLI
contributes: (i) a robust addressing scheme beyond brittle “file:line:col” via
a Selector DSL (symbolic, AST-path, and content-anchored selectors) with a
principled relocation algorithm; (ii) deterministic Analysis Bundles that
normalize Language Server responses and capture environment/capability metadata
with stable content hashes; (iii) a safety envelope for mutating operations
(rename, code actions) with preview, workspace jails, and Git-aware,
transactional apply; and (iv) a process-reward functional derived from Language
Server facts (diagnostic deltas, disambiguation confidence, and safe-apply
checks) that is computable online and replayable offline. We formalize
determinism under frozen snapshots and establish a monotonicity property for
the process reward, making it suitable for process supervision and
counterfactual analysis. Project Page:
https://github.com/yifanzhang-pro/lanser-cli
[COMMENTS]Project Page: https://github.com/yifanzhang-pro/lanser-cli
[LINK]http://arxiv.org/abs/2510.22907v1
[DATE]2025-10-27 09:25:20+08:00
[CATEGORIES]cs.CL
Modeling Political Discourse with Sentence-BERT and BERTopic
[AUTHORS]Margarida Mendonca, Alvaro Figueira
[ABSTRACT]Social media has reshaped political discourse, offering politicians a
platform for direct engagement while reinforcing polarization and ideological
divides. This study introduces a novel topic evolution framework that
integrates BERTopic-based topic modeling with Moral Foundations Theory (MFT) to
analyze the longevity and moral dimensions of political topics in Twitter
activity during the 117th U.S. Congress. We propose a methodology for tracking
dynamic topic shifts over time and measuring their association with moral
values and quantifying topic persistence. Our findings reveal that while
overarching themes remain stable, granular topics tend to dissolve rapidly,
limiting their long-term influence. Moreover, moral foundations play a critical
role in topic longevity, with Care and Loyalty dominating durable topics, while
partisan differences manifest in distinct moral framing strategies. This work
contributes to the field of social network analysis and computational political
discourse by offering a scalable, interpretable approach to understanding
moral-driven topic evolution on social media.
[COMMENTS]11 pages. Continues previous study by Mendonca M. and Figueira A,
2023: “Analyzing Political Discourse in the 117th U.S. Congress Using
Transformer-Based Topic Models”, presented at the International Conference on
Computational Social Science
[LINK]http://arxiv.org/abs/2510.22904v1
[DATE]2025-10-27 09:19:42+08:00
[CATEGORIES]cs.CL
ControlText: Unlocking Controllable Fonts in Multilingual Text Rendering without Font Annotations
[AUTHORS]Bowen Jiang, Yuan Yuan, Xinyi Bai, Zhuoqun Hao, Alyson Yin, Yaojie Hu, Wenyu Liao, Lyle Ungar, Camillo J. Taylor
[ABSTRACT]This work demonstrates that diffusion models can achieve font-controllable
multilingual text rendering using just raw images without font label
annotations.Visual text rendering remains a significant challenge. While recent
methods condition diffusion on glyphs, it is impossible to retrieve exact font
annotations from large-scale, real-world datasets, which prevents
user-specified font control. To address this, we propose a data-driven solution
that integrates the conditional diffusion model with a text segmentation model,
utilizing segmentation masks to capture and represent fonts in pixel space in a
self-supervised manner, thereby eliminating the need for any ground-truth
labels and enabling users to customize text rendering with any multilingual
font of their choice. The experiment provides a proof of concept of our
algorithm in zero-shot text and font editing across diverse fonts and
languages, providing valuable insights for the community and industry toward
achieving generalized visual text rendering. Code is available at
github.com/bowen-upenn/ControlText.
[COMMENTS]The 2025 Conference on Empirical Methods in Natural Language
Processing (EMNLP) Findings
[LINK]http://arxiv.org/abs/2502.10999v2
[DATE]2025-10-27 08:52:27+08:00
[CATEGORIES]cs.CL
Dynamic Retriever for In-Context Knowledge Editing via Policy Optimization
[AUTHORS]Mahmud Wasif Nafee, Maiqi Jiang, Haipeng Chen, Yanfu Zhang
[ABSTRACT]Large language models (LLMs) excel at factual recall yet still propagate
stale or incorrect knowledge. In-context knowledge editing offers a
gradient-free remedy suitable for black-box APIs, but current editors rely on
static demonstration sets chosen by surface-level similarity, leading to two
persistent obstacles: (i) a quantity-quality trade-off, and (ii) lack of
adaptivity to task difficulty. We address these issues by dynamically selecting
supporting demonstrations according to their utility for the edit. We propose
Dynamic Retriever for In-Context Knowledge Editing (DR-IKE), a lightweight
framework that (1) trains a BERT retriever with REINFORCE to rank
demonstrations by editing reward, and (2) employs a learnable threshold to
prune low-value examples, shortening the prompt when the edit is easy and
expanding it when the task is hard. DR-IKE performs editing without modifying
model weights, relying solely on forward passes for compatibility with
black-box LLMs. On the COUNTERFACT benchmark, it improves edit success by up to
17.1%, reduces latency by 41.6%, and preserves accuracy on unrelated queries,
demonstrating scalable and adaptive knowledge editing. The code is available at
https://github.com/mwnafee/DR-IKE .
[COMMENTS]Accepted at EMNLP 2025. Copyright 2025 Association for Computational
Linguistics (CC BY 4.0). 12 pages, 5 figures
[LINK]http://arxiv.org/abs/2510.21059v2
[DATE]2025-10-27 08:25:35+08:00
[CATEGORIES]cs.CL
Offline Preference Optimization via Maximum Marginal Likelihood Estimation
[AUTHORS]Saeed Najafi, Alona Fyshe
[ABSTRACT]Aligning Large Language Models (LLMs) with human preferences is crucial, but
standard methods like Reinforcement Learning from Human Feedback (RLHF) are
often complex and unstable. In this work, we propose a new, simpler approach
that recasts alignment through the lens of Maximum Marginal Likelihood (MML)
estimation. Our new MML based Preference Optimization (MMPO) maximizes the
marginal log-likelihood of a preferred text output, using the preference pair
as samples for approximation, and forgoes the need for both an explicit reward
model and entropy maximization. We theoretically demonstrate that MMPO
implicitly performs preference optimization, producing a weighted gradient that
naturally up-weights chosen responses over rejected ones. Across models ranging
from 135M to 8B parameters, we empirically show that MMPO: 1) is more stable
with respect to the hyperparameter $\beta$ compared to alternative baselines,
and 2) achieves competitive or superior preference alignment while better
preserving the base model’s general language capabilities. Through a series of
ablation experiments, we show that this improved performance is indeed
attributable to MMPO’s implicit preference optimization within the gradient
updates.
[LINK]http://arxiv.org/abs/2510.22881v1
[DATE]2025-10-27 08:15:57+08:00
[CATEGORIES]cs.LG cs.CL
Batch Speculative Decoding Done Right
[AUTHORS]Ranran Haoran Zhang, Soumik Dey, Ashirbad Mishra, Hansi Wu, Binbin Li, Rui Zhang
[ABSTRACT]Speculative decoding speeds up LLM inference by using a small draft model to
propose multiple tokens that a target model verifies in parallel. Extending
this idea to batches is essential for production serving, but it introduces the
ragged tensor problem: sequences in the same batch accept different numbers of
draft tokens, breaking right-alignment and corrupting position IDs, attention
masks, and KV-cache state. We show that several existing batch implementations
violate output equivalence-the fundamental requirement that speculative
decoding must produce identical token sequences to standard autoregressive
generation. These violations occur precisely due to improper handling of the
ragged tensor problem. In response, we (1) characterize the synchronization
requirements that guarantee correctness, (2) present a correctness-first batch
speculative decoding EQSPEC that exposes realignment as consuming 40% of
overhead, and (3) introduce EXSPEC, which maintains a sliding pool of sequences
and dynamically forms same-length groups, to reduce the realignment overhead
while preserving per-sequence speculative speedups. On the SpecBench dataset,
across Vicuna-7B/68M, Qwen3-8B/0.6B, and GLM-4-9B/0.6B target/draft pairs, our
approach achieves up to 3$\times$ throughput improvement at batch size 8
compared to batch size 1, with efficient scaling through batch size 8, while
maintaining 95% output equivalence. Our method requires no custom kernels and
integrates cleanly with existing inference stacks. Our code is available at
https://github.com/eBay/spec_dec.
[LINK]http://arxiv.org/abs/2510.22876v1
[DATE]2025-10-27 07:59:23+08:00
[CATEGORIES]cs.CL
The Chameleon Nature of LLMs: Quantifying Multi-Turn Stance Instability in Search-Enabled Language Models
[AUTHORS]Shivam Ratnakar, Sanjay Raghavendra
[ABSTRACT]Integration of Large Language Models with search/retrieval engines has become
ubiquitous, yet these systems harbor a critical vulnerability that undermines
their reliability. We present the first systematic investigation of “chameleon
behavior” in LLMs: their alarming tendency to shift stances when presented with
contradictory questions in multi-turn conversations (especially in
search-enabled LLMs). Through our novel Chameleon Benchmark Dataset, comprising
17,770 carefully crafted question-answer pairs across 1,180 multi-turn
conversations spanning 12 controversial domains, we expose fundamental flaws in
state-of-the-art systems. We introduce two theoretically grounded metrics: the
Chameleon Score (0-1) that quantifies stance instability, and Source Re-use
Rate (0-1) that measures knowledge diversity. Our rigorous evaluation of
Llama-4-Maverick, GPT-4o-mini, and Gemini-2.5-Flash reveals consistent
failures: all models exhibit severe chameleon behavior (scores 0.391-0.511),
with GPT-4o-mini showing the worst performance. Crucially, small
across-temperature variance (less than 0.004) suggests the effect is not a
sampling artifact. Our analysis uncovers the mechanism: strong correlations
between source re-use rate and confidence (r=0.627) and stance changes
(r=0.429) are statistically significant (p less than 0.05), indicating that
limited knowledge diversity makes models pathologically deferential to query
framing. These findings highlight the need for comprehensive consistency
evaluation before deploying LLMs in healthcare, legal, and financial systems
where maintaining coherent positions across interactions is critical for
reliable decision support.
[COMMENTS]39th Conference on Neural Information Processing Systems (NeurIPS
2025) Workshop: MTI-LLM @ NeurIPS 2025
[LINK]http://arxiv.org/abs/2510.16712v2
[DATE]2025-10-27 07:59:21+08:00
[CATEGORIES]cs.CL
FaithUn: Toward Faithful Forgetting in Language Models by Investigating the Interconnectedness of Knowledge
[AUTHORS]Nakyeong Yang, Minsung Kim, Seunghyun Yoon, Joongbo Shin, Kyomin Jung
[COMMENTS]accepted to EMNLP 2025
[LINK]http://arxiv.org/abs/2502.19207v2
[DATE]2025-10-27 07:55:34+08:00
[CATEGORIES]cs.CL
Beyond QA Pairs: Assessing Parameter-Efficient Fine-Tuning for Fact Embedding in LLMs
[AUTHORS]Shivam Ratnakar, Abhiroop Talasila, Raghav Chamadiya, Nikhil Agarwal, Vinayak K Doifode
[ABSTRACT]This paper presents an extensive examination of Parameter-Efficient
Fine-Tuning (PEFT) for embedding domain specific facts into Large Language
Models (LLMs), focusing on improving the fine-tuning process by categorizing
question-answer (QA) pairs into Factual and Conceptual classes using a
BERT-based classifier. Two distinct Llama-2 models are fine-tuned based on
these classifications and evaluated using larger models like GPT-3.5 Turbo and
Gemini. Our results indicate that models trained on conceptual datasets
outperform those trained on factual datasets. Additionally, we compare the
efficiency of two synthetic fine-tuning dataset generation techniques, D-RAG
and D-Naive, with D-Naive demonstrating superior performance. Although PEFT has
shown effectiveness, our research indicates that it may not be the most optimal
method for embedding facts into LLMs. However, it has demonstrated exceptional
performance in instruction-based tasks. Our findings are reinforced by a
1000-sample dataset in the data center domain, where the fine-tuned Llama-2 7B
model significantly outperforms the baseline model in generating product
recommendations. Our study highlights the importance of QA pair categorization
and synthetic dataset generation techniques in enhancing the performance of
LLMs in specific domains.
[COMMENTS]Presented at the Workshop on Preparing Good Data for Generative AI:
Challenges and Approaches (Good-Data) in conjunction with AAAI 2025. The
authors retain the copyright
[LINK]http://arxiv.org/abs/2503.01131v2
[DATE]2025-10-27 07:53:26+08:00
[CATEGORIES]cs.CL
Interpreting and Mitigating Unwanted Uncertainty in LLMs
[AUTHORS]Tiasa Singha Roy, Ayush Rajesh Jhaveri, Ilias Triantafyllopoulos
[ABSTRACT]Despite their impressive capabilities, Large Language Models (LLMs) exhibit
unwanted uncertainty, a phenomenon where a model changes a previously correct
answer into an incorrect one when re-prompted. This behavior undermines trust
and poses serious risks in high-stakes domains. In this work, we investigate
the mechanisms that drive this phenomenon. We adapt the Needle-in-a-Haystack
retrieval framework and integrate a Flip-style re-evaluation prompt to simulate
realistic answer-flipping scenarios. We find that retrieval heads are not
primarily responsible for avoiding uncertainty. Instead, we identify a small
set of non-retrieval attention heads that disproportionately attend to
misleading tokens in uncertain contexts. Masking these heads yields significant
improvements, reducing flip behavior by up to 15% without introducing
incoherence or overcorrection. However, when tested for downstream tasks, we
observe trade-offs with flip behavior. Our findings contribute to the growing
field of mechanistic interpretability and present a simple yet effective
technique for mitigating uncertainty-driven failure modes in LLMs.
[LINK]http://arxiv.org/abs/2510.22866v1
[DATE]2025-10-27 07:16:59+08:00
[CATEGORIES]cs.CL cs.LG
Far from the Shallow: Brain-Predictive Reasoning Embedding through Residual Disentanglement
[AUTHORS]Linyang He, Tianjun Zhong, Richard Antonello, Gavin Mischler, Micah Goldblum, Nima Mesgarani
[ABSTRACT]Understanding how the human brain progresses from processing simple
linguistic inputs to performing high-level reasoning is a fundamental challenge
in neuroscience. While modern large language models (LLMs) are increasingly
used to model neural responses to language, their internal representations are
highly “entangled,” mixing information about lexicon, syntax, meaning, and
reasoning. This entanglement biases conventional brain encoding analyses toward
linguistically shallow features (e.g., lexicon and syntax), making it difficult
to isolate the neural substrates of cognitively deeper processes. Here, we
introduce a residual disentanglement method that computationally isolates these
components. By first probing an LM to identify feature-specific layers, our
method iteratively regresses out lower-level representations to produce four
nearly orthogonal embeddings for lexicon, syntax, meaning, and, critically,
reasoning. We used these disentangled embeddings to model intracranial (ECoG)
brain recordings from neurosurgical patients listening to natural speech. We
show that: 1) This isolated reasoning embedding exhibits unique predictive
power, accounting for variance in neural activity not explained by other
linguistic features and even extending to the recruitment of visual regions
beyond classical language areas. 2) The neural signature for reasoning is
temporally distinct, peaking later (~350-400ms) than signals related to
lexicon, syntax, and meaning, consistent with its position atop a processing
hierarchy. 3) Standard, non-disentangled LLM embeddings can be misleading, as
their predictive success is primarily attributable to linguistically shallow
features, masking the more subtle contributions of deeper cognitive processing.
[COMMENTS]Accepted at NeurIPS 2025
[LINK]http://arxiv.org/abs/2510.22860v1
[DATE]2025-10-27 06:46:26+08:00
[CATEGORIES]cs.CL
Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs
[AUTHORS]Yifan Shen, Yuanzhe Liu, Jingyuan Zhu, Xu Cao, Xiaofeng Zhang, Yixiao He, Wenming Ye, James Matthew Rehg, Ismini Lourentzou
[ABSTRACT]Current Vision-Language Models (VLMs) struggle with fine-grained spatial
reasoning, particularly when multi-step logic and precise spatial alignment are
required. In this work, we introduce SpatialReasoner-R1, a vision-language
reasoning model designed to address these limitations. To construct
high-quality supervision for spatial reasoning, we design a Multi-Model Monte
Carlo Tree Search (M3CTS) method that generates diverse, logically consistent
Long Chain-of-Thought (LongCoT) reasoning trajectories. In addition, we propose
fine-grained Direct Preference Optimization (fDPO), which introduces
segment-specific preference granularity for descriptive grounding and logical
reasoning, guided by a spatial reward mechanism that evaluates candidate
responses based on visual consistency, spatial grounding, and logical
coherence. Experimental results demonstrate that fDPO achieves an average
improvement of 4.1% over standard DPO across spatial quality tasks, and a 9.0%
gain in spatial quantity tasks. SpatialReasoner-R1, trained with fDPO, sets a
new SoTA on SPATIALRGPT-Bench, outperforming the strongest baseline by 9.8% in
average accuracy, while maintaining competitive performance on general
vision-language tasks.
[LINK]http://arxiv.org/abs/2506.21656v2
[DATE]2025-10-27 06:18:43+08:00
[CATEGORIES]cs.CL
Once Upon an Input: Reasoning via Per-Instance Program Synthesis
[AUTHORS]Adam Stein, Neelay Velingker, Mayur Naik, Eric Wong
[ABSTRACT]Large language models (LLMs) excel at zero-shot inference but continue to
struggle with complex, multi-step reasoning. Recent methods that augment LLMs
with intermediate reasoning steps such as Chain of Thought (CoT) and Program of
Thought (PoT) improve performance but often produce undesirable solutions,
especially in algorithmic domains. We introduce Per-Instance Program Synthesis
(PIPS), a method that generates and refines programs at the instance-level
using structural feedback without relying on task-specific guidance or explicit
test cases. To further improve performance, PIPS incorporates a confidence
metric that dynamically chooses between direct inference and program synthesis
on a per-instance basis. Experiments across three frontier LLMs and 30
benchmarks including all tasks of Big Bench Extra Hard (BBEH), visual question
answering tasks, relational reasoning tasks, and mathematical reasoning tasks
show that PIPS improves the absolute harmonic mean accuracy by up to 8.6% and
9.4% compared to PoT and CoT respectively, and reduces undesirable program
generations by 65.1% on the algorithmic tasks compared to PoT with
Gemini-2.0-Flash.
[COMMENTS]Accepted at NeurIPS 2025. 34 pages, 7 figures
[LINK]http://arxiv.org/abs/2510.22849v1
[DATE]2025-10-27 05:58:33+08:00
[CATEGORIES]cs.CL cs.LG
Breaking Language Barriers or Reinforcing Bias? A Study of Gender and Racial Disparities in Multilingual Contrastive Vision Language Models
[AUTHORS]Zahraa Al Sahili, Ioannis Patras, Matthew Purver
[ABSTRACT]Multilingual vision-language models (VLMs) promise universal image-text
retrieval, yet their social biases remain underexplored. We perform the first
systematic audit of four public multilingual CLIP variants: M-CLIP, NLLB-CLIP,
CAPIVARA-CLIP, and the debiased SigLIP-2, covering ten languages that differ in
resource availability and morphological gender marking. Using balanced subsets
of FairFace and the PATA stereotype suite in a zero-shot setting, we quantify
race and gender bias and measure stereotype amplification. Contrary to the
intuition that multilinguality mitigates bias, every model exhibits stronger
gender skew than its English-only baseline. CAPIVARA-CLIP shows its largest
biases precisely in the low-resource languages it targets, while the shared
encoder of NLLB-CLIP and SigLIP-2 transfers English gender stereotypes into
gender-neutral languages; loosely coupled encoders largely avoid this leakage.
Although SigLIP-2 reduces agency and communion skews, it inherits – and in
caption-sparse contexts (e.g., Xhosa) amplifies – the English anchor’s crime
associations. Highly gendered languages consistently magnify all bias types,
yet gender-neutral languages remain vulnerable whenever cross-lingual weight
sharing imports foreign stereotypes. Aggregated metrics thus mask
language-specific hot spots, underscoring the need for fine-grained,
language-aware bias evaluation in future multilingual VLM research.
[COMMENTS]Accepted at IJCNLP-AACL 2025
[LINK]http://arxiv.org/abs/2505.14160v3
[DATE]2025-10-27 05:27:28+08:00
[CATEGORIES]cs.CL cs.LG
Leveraging Large Language Models to Identify Conversation Threads in Collaborative Learning
[AUTHORS]Prerna Ravi, Dong Won Lee, Beatriz Flamia, Jasmine David, Brandon Hanks, Cynthia Breazeal, Emma Anderson, Grace Lin
[ABSTRACT]Understanding how ideas develop and flow in small-group conversations is
critical for analyzing collaborative learning. A key structural feature of
these interactions is threading, the way discourse talk naturally organizes
into interwoven topical strands that evolve over time. While threading has been
widely studied in asynchronous text settings, detecting threads in synchronous
spoken dialogue remains challenging due to overlapping turns and implicit cues.
At the same time, large language models (LLMs) show promise for automating
discourse analysis but often struggle with long-context tasks that depend on
tracing these conversational links. In this paper, we investigate whether
explicit thread linkages can improve LLM-based coding of relational moves in
group talk. We contribute a systematic guidebook for identifying threads in
synchronous multi-party transcripts and benchmark different LLM prompting
strategies for automated threading. We then test how threading influences
performance on downstream coding of conversational analysis frameworks, that
capture core collaborative actions such as agreeing, building, and eliciting.
Our results show that providing clear conversational thread information
improves LLM coding performance and underscores the heavy reliance of
downstream analysis on well-structured dialogue. We also discuss practical
trade-offs in time and cost, emphasizing where human-AI hybrid approaches can
yield the best value. Together, this work advances methods for combining LLMs
and robust conversational thread structures to make sense of complex, real-time
group interactions.
[COMMENTS]In Submission: Journal of Educational Data Mining (jEDM) 2026
[LINK]http://arxiv.org/abs/2510.22844v1
[DATE]2025-10-27 05:25:23+08:00
[CATEGORIES]cs.CL
Reasoning is Periodicity? Improving Large Language Models Through Effective Periodicity Modeling
[AUTHORS]Yihong Dong, Ge Li, Xue Jiang, Yongding Tao, Kechi Zhang, Hao Zhu, Huanyu Liu, Jiazheng Ding, Jia Li, Jinliang Deng, Hong Mei
[ABSTRACT]Periodicity, as one of the most important basic characteristics, lays the
foundation for facilitating structured knowledge acquisition and systematic
cognitive processes within human learning paradigms. However, the potential
flaws of periodicity modeling in Transformer affect the learning efficiency and
establishment of underlying principles from data for large language models
(LLMs) built upon it. In this paper, we demonstrate that integrating effective
periodicity modeling can improve the learning efficiency and performance of
LLMs. We introduce FANformer, which adapts Fourier Analysis Network (FAN) into
attention mechanism to achieve efficient periodicity modeling, by modifying the
feature projection process of attention mechanism. Extensive experimental
results on language modeling show that FANformer consistently outperforms
Transformer when scaling up model size and training tokens, underscoring its
superior learning efficiency. Our pretrained FANformer-1B exhibits marked
improvements on downstream tasks compared to open-source LLMs with similar
model parameters or training tokens. Moreover, we reveal that FANformer
exhibits superior ability to learn and apply rules for reasoning compared to
Transformer. The results position FANformer as an effective and promising
architecture for advancing LLMs.
[COMMENTS]Accepted to NeurIPS‘25
[LINK]http://arxiv.org/abs/2502.21309v4
[DATE]2025-10-27 03:43:00+08:00
[CATEGORIES]cs.CL cs.LG
Exact Coset Sampling for Quantum Lattice Algorithms
[AUTHORS]Yifan Zhang
[ABSTRACT]We give a simple replacement for the contested “domain-extension” in Step 9
of a recent windowed-QFT lattice algorithm with complex-Gaussian windows (Chen,
2024). As acknowledged by the author, the reported issue is due to a
periodicity/support mismatch when extending only the first coordinate in the
presence of offsets, which breaks the intended $\mathbb{Z}P$-fiber. Our new
subroutine replaces domain extension by a pair-shift difference that cancels
unknown offsets exactly and synthesizes a uniform cyclic subgroup (a
zero-offset coset) of order $P$ inside $(\mathbb{Z}{M_2})^n$. We adopt a
gate-level access model and run a short prepass that measures the designated
outcome registers (Chen’s Steps 1, 3, and 5), fixing $E=(y’,z’,h^{\ast})$. We
then identify a concrete program point $t^{\star}$ at which an index wire $J
\in \mathbb{Z}_P$ is preserved and the coordinate block equals
$\mathbf{X}(j)\equiv 2D^2 j\,\mathbf{b}^{\ast}+\mathbf{v}^{\ast}\ (\bmod M_2)$.
A compute-copy-uncompute sandwich on the prefix up to $t^{\star}$ yields a
reversible evaluator that we call only on basis inputs $j=0,1$ to harvest
$V=\mathbf{X}(0)$ and $\Delta=\mathbf{X}(1)-\mathbf{X}(0)\equiv
2D^2\mathbf{b}^{\ast}$ within the same run. We never invert a measurement, and
we do not claim the circuit suffix after $t^{\star}$. The default Step
$9^{\dagger}$ uses only $\Delta$ (no foreknowledge of $\mathbf{b}^\ast$): set
$\mathbf{Z}\leftarrow -\,T\cdot \Delta\ (\bmod M_2)$ for uniform
$T\in\mathbb{Z}_P$ and erase $T$ coherently primewise by modular inversion and
CRT.
[COMMENTS]Project Page: https://github.com/yifanzhang-pro/quantum-lattice
[LINK]http://arxiv.org/abs/2509.12341v4
[DATE]2025-10-27 03:21:26+08:00
[CATEGORIES]cs.CL
FAN: Fourier Analysis Networks
[AUTHORS]Yihong Dong, Ge Li, Yongding Tao, Xue Jiang, Kechi Zhang, Jia Li, Jinliang Deng, Jing Su, Jun Zhang, Jingjing Xu
[ABSTRACT]Despite the remarkable successes of general-purpose neural networks, such as
MLPs and Transformers, we find that they exhibit notable shortcomings in
modeling and reasoning about periodic phenomena, achieving only marginal
performance within the training domain and failing to generalize effectively to
out-of-domain (OOD) scenarios. Periodicity is ubiquitous throughout nature and
science. Therefore, neural networks should be equipped with the essential
ability to model and handle periodicity. In this work, we propose FAN, a novel
neural network that effectively addresses periodicity modeling challenges while
offering broad applicability similar to MLP with fewer parameters and FLOPs.
Periodicity is naturally integrated into FAN’s structure and computational
processes by introducing the Fourier Principle. Unlike existing Fourier-based
networks, which possess particular periodicity modeling abilities but face
challenges in scaling to deeper networks and are typically designed for
specific tasks, our approach overcomes this challenge to enable scaling to
large-scale models and maintains general-purpose modeling capability. Through
extensive experiments, we demonstrate the superiority of FAN in periodicity
modeling tasks and the effectiveness and generalizability of FAN across a range
of real-world tasks. Moreover, we reveal that compared to existing
Fourier-based networks, FAN accommodates both periodicity modeling and
general-purpose modeling well.
[COMMENTS]Accepted to NeurIPS‘25
[LINK]http://arxiv.org/abs/2410.02675v6
[DATE]2025-10-27 03:15:11+08:00
[CATEGORIES]cs.LG cs.CL
VEHME: A Vision-Language Model For Evaluating Handwritten Mathematics Expressions
[AUTHORS]Thu Phuong Nguyen, Duc M. Nguyen, Hyotaek Jeon, Hyunwook Lee, Hyunmin Song, Sungahn Ko, Taehwan Kim
[ABSTRACT]Automatically assessing handwritten mathematical solutions is an important
problem in educational technology with practical applications, but it remains a
significant challenge due to the diverse formats, unstructured layouts, and
symbolic complexity of student work. To address this challenge, we introduce
VEHME-a Vision-Language Model for Evaluating Handwritten Mathematics
Expressions-designed to assess open-form handwritten math responses with high
accuracy and interpretable reasoning traces. VEHME integrates a two-phase
training pipeline: (i) supervised fine-tuning using structured reasoning data,
and (ii) reinforcement learning that aligns model outputs with
multi-dimensional grading objectives, including correctness, reasoning depth,
and error localization. To enhance spatial understanding, we propose an
Expression-Aware Visual Prompting Module, trained on our synthesized multi-line
math expressions dataset to robustly guide attention in visually heterogeneous
inputs. Evaluated on AIHub and FERMAT datasets, VEHME achieves state-of-the-art
performance among open-source models and approaches the accuracy of proprietary
systems, demonstrating its potential as a scalable and accessible tool for
automated math assessment. Our training and experiment code is publicly
available at our GitHub repository.
[COMMENTS]EMNLP 2025. Project Website: https://vehme.github.io/
[LINK]http://arxiv.org/abs/2510.22798v1
[DATE]2025-10-27 03:03:27+08:00
[CATEGORIES]cs.CL cs.LG
EuroSpeech: A Multilingual Speech Corpus
[AUTHORS]Samuel Pfisterer, Florian Grötschla, Luca A. Lanzendörfer, Florian Yan, Roger Wattenhofer
[COMMENTS]Published in the 39th Conference on Neural Information Processing
Systems (NeurIPS 2025) Track on Datasets and Benchmark
[LINK]http://arxiv.org/abs/2510.00514v2
[DATE]2025-10-27 02:50:17+08:00
[CATEGORIES]cs.CL cs.LG
How Do AI Agents Do Human Work? Comparing AI and Human Workflows Across Diverse Occupations
[AUTHORS]Zora Zhiruo Wang, Yijia Shao, Omar Shaikh, Daniel Fried, Graham Neubig, Diyi Yang
[ABSTRACT]AI agents are continually optimized for tasks related to human work, such as
software engineering and professional writing, signaling a pressing trend with
significant impacts on the human workforce. However, these agent developments
have often not been grounded in a clear understanding of how humans execute
work, to reveal what expertise agents possess and the roles they can play in
diverse workflows. In this work, we study how agents do human work by
presenting the first direct comparison of human and agent workers across
multiple essential work-related skills: data analysis, engineering,
computation, writing, and design. To better understand and compare
heterogeneous computer-use activities of workers, we introduce a scalable
toolkit to induce interpretable, structured workflows from either human or
agent computer-use activities. Using such induced workflows, we compare how
humans and agents perform the same tasks and find that: (1) While agents
exhibit promise in their alignment to human workflows, they take an
overwhelmingly programmatic approach across all work domains, even for
open-ended, visually dependent tasks like design, creating a contrast with the
UI-centric methods typically used by humans. (2) Agents produce work of
inferior quality, yet often mask their deficiencies via data fabrication and
misuse of advanced tools. (3) Nonetheless, agents deliver results 88.3% faster
and cost 90.4-96.2% less than humans, highlighting the potential for enabling
efficient collaboration by delegating easily programmable tasks to agents.
[LINK]http://arxiv.org/abs/2510.22780v1
[DATE]2025-10-27 02:10:22+08:00
[CATEGORIES]cs.CL
Scalable Supervising Software Agents with Patch Reasoner
[AUTHORS]Junjielong Xu, Boyin Tan, Xiaoyuan Liu, Chao Peng, Pengfei Gao, Pinjia He
[ABSTRACT]While large language model agents have advanced software engineering tasks,
the unscalable nature of existing test-based supervision is limiting the
potential improvement of data scaling. The reason is twofold: (1) building and
running test sandbox is rather heavy and fragile, and (2) data with
high-coverage tests is naturally rare and threatened by test hacking via edge
cases. In this paper, we propose R4P, a patch verifier model to provide
scalable rewards for training and testing SWE agents via reasoning. We consider
that patch verification is fundamentally a reasoning task, mirroring how human
repository maintainers review patches without writing and running new
reproduction tests. To obtain sufficient reference and reduce the risk of
reward hacking, R4P uses a group-wise objective for RL training, enabling it to
verify multiple patches against each other’s modification and gain a dense
reward for stable training. R4P achieves 72.2% Acc. for verifying patches from
SWE-bench-verified, surpassing OpenAI o3. To demonstrate R4P’s practicality, we
design and train a lite scaffold, Mini-SE, with pure reinforcement learning
where all rewards are derived from R4P. As a result, Mini-SE achieves 26.2%
Pass@1 on SWE-bench-verified, showing a 10.0% improvement over the original
Qwen3-32B. This can be further improved to 32.8% with R4P for test-time
scaling. Furthermore, R4P verifies patches within a second, 50x faster than
testing on average. The stable scaling curves of rewards and accuracy along
with high efficiency reflect R4P’s practicality.
[LINK]http://arxiv.org/abs/2510.22775v1
[DATE]2025-10-27 01:52:05+08:00
[CATEGORIES]cs.CL
MMPersuade: A Dataset and Evaluation Framework for Multimodal Persuasion
[AUTHORS]Haoyi Qiu, Yilun Zhou, Pranav Narayanan Venkit, Kung-Hsiang Huang, Jiaxin Zhang, Nanyun Peng, Chien-Sheng Wu
[ABSTRACT]As Large Vision-Language Models (LVLMs) are increasingly deployed in domains
such as shopping, health, and news, they are exposed to pervasive persuasive
content. A critical question is how these models function as persuadees-how and
why they can be influenced by persuasive multimodal inputs. Understanding both
their susceptibility to persuasion and the effectiveness of different
persuasive strategies is crucial, as overly persuadable models may adopt
misleading beliefs, override user preferences, or generate unethical or unsafe
outputs when exposed to manipulative messages. We introduce MMPersuade, a
unified framework for systematically studying multimodal persuasion dynamics in
LVLMs. MMPersuade contributes (i) a comprehensive multimodal dataset that pairs
images and videos with established persuasion principles across commercial,
subjective and behavioral, and adversarial contexts, and (ii) an evaluation
framework that quantifies both persuasion effectiveness and model
susceptibility via third-party agreement scoring and self-estimated token
probabilities on conversation histories. Our study of six leading LVLMs as
persuadees yields three key insights: (i) multimodal inputs substantially
increase persuasion effectiveness-and model susceptibility-compared to text
alone, especially in misinformation scenarios; (ii) stated prior preferences
decrease susceptibility, yet multimodal information maintains its persuasive
advantage; and (iii) different strategies vary in effectiveness across
contexts, with reciprocity being most potent in commercial and subjective
contexts, and credibility and logic prevailing in adversarial contexts. By
jointly analyzing persuasion effectiveness and susceptibility, MMPersuade
provides a principled foundation for developing models that are robust,
preference-consistent, and ethically aligned when engaging with persuasive
multimodal content.
[LINK]http://arxiv.org/abs/2510.22768v1
[DATE]2025-10-27 01:39:21+08:00
[CATEGORIES]cs.CL
TELL-TALE: Task Efficient LLMs with Task Aware Layer Elimination
[AUTHORS]Omar Naim, Krish Sharma, Nicholas Asher
[ABSTRACT]In this paper we introduce Tale, Task-Aware Layer Elimination, an
inference-time algorithm that prunes entire transformer layers in an LLM by
directly optimizing task-specific validation performance. We evaluate TALE on 9
tasks and 5 models, including LLaMA 3.1 8B, Qwen 2.5 7B, Qwen 2.5 0.5B, Mistral
7B, and Lucie 7B, under both zero-shot and few-shot settings. Unlike prior
approaches, TALE requires no retraining and consistently improves accuracy
while reducing computational cost across all benchmarks. Furthermore, applying
TALE during finetuning leads to additional performance gains. Finally, TALE
provides flexible user control over trade-offs between accuracy and efficiency.
Mutual information analysis shows that certain layers act as bottlenecks,
degrading task-relevant representations. Tale’s selective layer removal
remedies this problem, producing smaller, faster, and more accurate models that
are also faster to fine-tune while offering new insights into transformer
interpretability.
[LINK]http://arxiv.org/abs/2510.22767v1
[DATE]2025-10-27 01:34:40+08:00
[CATEGORIES]cs.LG cs.CL
Iterative Layer Pruning for Efficient Translation Inference
[AUTHORS]Yasmin Moslem, Muhammad Hazim Al Farouq, John D. Kelleher
[ABSTRACT]Large language models (LLMs) have transformed many areas of natural language
processing, including machine translation. However, efficient deployment of
LLMs remains challenging due to their intensive computational requirements. In
this paper, we address this challenge and present our submissions to the Model
Compression track at the Conference on Machine Translation (WMT 2025). In our
experiments, we investigate iterative layer pruning guided by layer importance
analysis. We evaluate this method using the Aya-Expanse-8B model for
translation from Czech to German, and from English to Egyptian Arabic. Our
approach achieves substantial reductions in model size and inference time,
while maintaining the translation quality of the baseline models.
[COMMENTS]WMT 2025
[LINK]http://arxiv.org/abs/2510.22763v1
[DATE]2025-10-27 01:26:14+08:00
[CATEGORIES]cs.CL
EchoMind: An Interrelated Multi-level Benchmark for Evaluating Empathetic Speech Language Models
[AUTHORS]Li Zhou, Lutong Yu, You Lyu, Yihang Lin, Zefeng Zhao, Junyi Ao, Yuhao Zhang, Benyou Wang, Haizhou Li
[ABSTRACT]Speech Language Models (SLMs) have made significant progress in spoken
language understanding. Yet it remains unclear whether they can fully perceive
non lexical vocal cues alongside spoken words, and respond with empathy that
aligns with both emotional and contextual factors. Existing benchmarks
typically evaluate linguistic, acoustic, reasoning, or dialogue abilities in
isolation, overlooking the integration of these skills that is crucial for
human-like, emotionally intelligent conversation. We present EchoMind, the
first interrelated, multi-level benchmark that simulates the cognitive process
of empathetic dialogue through sequential, context-linked tasks: spoken-content
understanding, vocal-cue perception, integrated reasoning, and response
generation. All tasks share identical and semantically neutral scripts that are
free of explicit emotional or contextual cues, and controlled variations in
vocal style are used to test the effect of delivery independent of the
transcript. EchoMind is grounded in an empathy-oriented framework spanning 3
coarse and 12 fine-grained dimensions, encompassing 39 vocal attributes, and
evaluated using both objective and subjective metrics. Testing 12 advanced SLMs
reveals that even state-of-the-art models struggle with high-expressive vocal
cues, limiting empathetic response quality. Analyses of prompt strength, speech
source, and ideal vocal cue recognition reveal persistent weaknesses in
instruction-following, resilience to natural speech variability, and effective
use of vocal cues for empathy. These results underscore the need for SLMs that
integrate linguistic content with diverse vocal cues to achieve truly
empathetic conversational ability.
[COMMENTS]Speech Language Models, Spoken Language Understanding, Vocal Cue
Perception, Empathetic Dialogue, Benchmark Evaluation
[LINK]http://arxiv.org/abs/2510.22758v1
[DATE]2025-10-27 01:15:56+08:00
[CATEGORIES]cs.CL
Beyond Semantics: How Temporal Biases Shape Retrieval in Transformer and State-Space Models
[AUTHORS]Anooshka Bajaj, Deven Mahesh Mistry, Sahaj Singh Maini, Yash Aggarwal, Zoran Tiganj
[ABSTRACT]In-context learning is governed by both temporal and semantic relationships,
shaping how Large Language Models (LLMs) retrieve contextual information.
Analogous to human episodic memory, where the retrieval of specific events is
enabled by separating events that happened at different times, this work probes
the ability of various pretrained LLMs, including transformer and state-space
models, to differentiate and retrieve temporally separated events.
Specifically, we prompted models with sequences containing multiple
presentations of the same token, which reappears at the sequence end. By fixing
the positions of these repeated tokens and permuting all others, we removed
semantic confounds and isolated temporal effects on next-token prediction.
Across diverse sequences, models consistently placed the highest probabilities
on tokens following a repeated token, but with a notable bias for those nearest
the beginning or end of the input. An ablation experiment linked this
phenomenon in transformers to induction heads. Extending the analysis to unique
semantic contexts with partial overlap further demonstrated that memories
embedded in the middle of a prompt are retrieved less reliably. Despite
architectural differences, state-space and transformer models showed comparable
temporal biases. Our findings deepen the understanding of temporal biases in
in-context learning and offer an illustration of how these biases can enable
temporal separation and episodic retrieval.
[LINK]http://arxiv.org/abs/2510.22752v1
[DATE]2025-10-27 01:01:41+08:00
[CATEGORIES]cs.CL
Multi-Modal Fact-Verification Framework for Reducing Hallucinations in Large Language Models
[AUTHORS]Piyushkumar Patel
[ABSTRACT]While Large Language Models have transformed how we interact with AI systems,
they suffer from a critical flaw: they confidently generate false information
that sounds entirely plausible. This hallucination problem has become a major
barrier to deploying these models in real-world applications where accuracy
matters. We developed a fact verification framework that catches and corrects
these errors in real-time by cross checking LLM outputs against multiple
knowledge sources. Our system combines structured databases, live web searches,
and academic literature to verify factual claims as they’re generated. When we
detect inconsistencies, we automatically correct them while preserving the
natural flow of the response. Testing across various domains showed we could
reduce hallucinations by 67% without sacrificing response quality. Domain
experts in healthcare, finance, and scientific research rated our corrected
outputs 89% satisfactory a significant improvement over unverified LLM
responses. This work offers a practical solution for making LLMs more
trustworthy in applications where getting facts wrong isn’t an option.
[LINK]http://arxiv.org/abs/2510.22751v1
[DATE]2025-10-27 00:58:54+08:00
[CATEGORIES]cs.CL
Low-Resource Dialect Adaptation of Large Language Models: A French Dialect Case-Study
[AUTHORS]Eeham Khan, Firas Saidani, Owen Van Esbroeck, Richard Khoury, Leila Kosseim
[ABSTRACT]Despite the widespread adoption of large language models (LLMs), their
strongest capabilities remain largely confined to a small number of
high-resource languages for which there is abundant training data. Recently,
continual pre-training (CPT) has emerged as a means to fine-tune these models
to low-resource regional dialects. In this paper, we study the use of CPT for
dialect learning under tight data and compute budgets. Using low-rank
adaptation (LoRA) and compute-efficient continual pre-training, we adapt three
LLMs to the Qu'ebec French dialect using a very small dataset and benchmark
them on the COLE suite. Our experiments demonstrate an improvement on the
minority dialect benchmarks with minimal regression on the prestige language
benchmarks with under 1% of model parameters updated. Analysis of the results
demonstrate that gains are highly contingent on corpus composition. These
findings indicate that CPT with parameter-efficient fine-tuning (PEFT) can
narrow the dialect gap by providing cost-effective and sustainable language
resource creation, expanding high-quality LLM access to minority linguistic
communities. We release the first Qu'ebec French LLMs on HuggingFace.
[COMMENTS]Submitted to LREC 2026
[LINK]http://arxiv.org/abs/2510.22747v1
[DATE]2025-10-27 00:49:06+08:00
[CATEGORIES]cs.CL
Evolving LLMs’ Self-Refinement Capability via Synergistic Training-Inference Optimization
[AUTHORS]Yongcheng Zeng, Xinyu Cui, Xuanfa Jin, Qirui Mi, Guoqing Liu, Zexu Sun, Mengyue Yang, Dong Li, Weiyu Ma, Ning Yang, Jian Zhao, Jianye Hao, Haifeng Zhang, Jun Wang
[ABSTRACT]Self-Refinement refers to a model’s ability to revise its own responses to
produce improved outputs. This capability can also serve as a fundamental
mechanism for Self-Improvement, for example, by reconstructing datasets with
refined results to enhance intrinsic model performance. However, our
comprehensive experiments reveal that large language models (LLMs) show no
clear evidence of inherent Self-Refinement and may even experience response
quality degradation after Self-Refinement. To address this issue, we propose
EVOLVE, a simple and effective framework for eliciting and tracking the
evolution of Self-Refinement through iterative training. We first explore
optimization methods during training to activate the model’s Self-Refinement
capability. Then, at inference, we investigate various generation strategies to
further enhance and utilize Self-Refinement while supplying the necessary data
for training. Through synergistic optimization of training and inference
stages, we continually evolve the model’s Self-Refinement ability, enabling it
to better refine its own responses. Moreover, we demonstrate the potential of
leveraging Self-Refinement to achieve broader Self-Improvement of intrinsic
model abilities. Experiments show that the evolved Self-Refinement ability
enables the Llama-3.1-8B base model to surpass GPT-4o, achieving 62.3%
length-controlled and 63.3% raw win rates on AlpacaEval 2, and 50.3% on
Arena-Hard. It also generalizes effectively to out-of-domain reasoning tasks,
improving performance on mathematical reasoning benchmarks such as GSM8K and
MATH.
[LINK]http://arxiv.org/abs/2502.05605v6
[DATE]2025-10-27 00:21:53+08:00
[CATEGORIES]cs.CL cs.LG
REVISION:Reflective Intent Mining and Online Reasoning Auxiliary for E-commerce Visual Search System Optimization
[AUTHORS]Yiwen Tang, Qiuyu Zhao, Zenghui Sun, Jinsong Lan, Xiaoyong Zhu, Bo Zheng, Kaifu Zhang
[ABSTRACT]In Taobao e-commerce visual search, user behavior analysis reveals a large
proportion of no-click requests, suggesting diverse and implicit user intents.
These intents are expressed in various forms and are difficult to mine and
discover, thereby leading to the limited adaptability and lag in platform
strategies. This greatly restricts users’ ability to express diverse intents
and hinders the scalability of the visual search system. This mismatch between
user implicit intent expression and system response defines the User-SearchSys
Intent Discrepancy. To alleviate the issue, we propose a novel framework
REVISION. This framework integrates offline reasoning mining with online
decision-making and execution, enabling adaptive strategies to solve implicit
user demands. In the offline stage, we construct a periodic pipeline to mine
discrepancies from historical no-click requests. Leveraging large models, we
analyze implicit intent factors and infer optimal suggestions by jointly
reasoning over query and product metadata. These inferred suggestions serve as
actionable insights for refining platform strategies. In the online stage,
REVISION-R1-3B, trained on the curated offline data, performs holistic analysis
over query images and associated historical products to generate optimization
plans and adaptively schedule strategies across the search pipeline. Our
framework offers a streamlined paradigm for integrating large models with
traditional search systems, enabling end-to-end intelligent optimization across
information aggregation and user interaction. Experimental results demonstrate
that our approach improves the efficiency of implicit intent mining from
large-scale search logs and significantly reduces the no-click rate.
[LINK]http://arxiv.org/abs/2510.22739v1
[DATE]2025-10-27 00:15:50+08:00
[CATEGORIES]cs.CL
$\text{E}^2\text{Rank}$: Your Text Embedding can Also be an Effective and Efficient Listwise Reranker
[AUTHORS]Qi Liu, Yanzhao Zhang, Mingxin Li, Dingkun Long, Pengjun Xie, Jiaxin Mao
[ABSTRACT]Text embedding models serve as a fundamental component in real-world search
applications. By mapping queries and documents into a shared embedding space,
they deliver competitive retrieval performance with high efficiency. However,
their ranking fidelity remains limited compared to dedicated rerankers,
especially recent LLM-based listwise rerankers, which capture fine-grained
query-document and document-document interactions. In this paper, we propose a
simple yet effective unified framework $\text{E}^2\text{Rank}$, means Efficient
Embedding-based Ranking (also means Embedding-to-Rank), which extends a single
text embedding model to perform both high-quality retrieval and listwise
reranking through continued training under a listwise ranking objective,
thereby achieving strong effectiveness with remarkable efficiency. By applying
cosine similarity between the query and document embeddings as a unified
ranking function, the listwise ranking prompt, which is constructed from the
original query and its candidate documents, serves as an enhanced query
enriched with signals from the top-K documents, akin to pseudo-relevance
feedback (PRF) in traditional retrieval models. This design preserves the
efficiency and representational quality of the base embedding model while
significantly improving its reranking performance. Empirically,
$\textrm{E}^2\text{Rank}$ achieves state-of-the-art results on the BEIR
reranking benchmark and demonstrates competitive performance on the
reasoning-intensive BRIGHT benchmark, with very low reranking latency. We also
show that the ranking training process improves embedding performance on the
MTEB benchmark. Our findings indicate that a single embedding model can
effectively unify retrieval and reranking, offering both computational
efficiency and competitive ranking accuracy.
[COMMENTS]Code and models are avaliable at https://alibaba-nlp.github.io/E2Rank
[LINK]http://arxiv.org/abs/2510.22733v1
[DATE]2025-10-27 00:04:48+08:00
[CATEGORIES]cs.CL
ATLAS: Actor-Critic Task-Completion with Look-ahead Action Simulation
[AUTHORS]Jiali Cheng, Anjishnu Kumar, Roshan Lal, Rishi Rajasekaran, Hani Ramezani, Omar Zia Khan, Oleg Rokhlenko, Sunny Chiu-Webster, Gang Hua, Hadi Amiri
[ABSTRACT]We observe that current state-of-the-art web-agents are unable to effectively
adapt to new environments without neural network fine-tuning, without which
they produce inefficient execution plans due to a lack of awareness of the
structure and dynamics of the new environment. To address this limitation, we
introduce ATLAS (Actor-Critic Task-completion with Look-ahead Action
Simulation), a memory-augmented agent that is able to make plans grounded in a
model of the environment by simulating the consequences of those actions in
cognitive space. Our agent starts by building a “cognitive map” by performing a
lightweight curiosity driven exploration of the environment. The planner
proposes candidate actions; the simulator predicts their consequences in
cognitive space; a critic analyzes the options to select the best roll-out and
update the original plan; and a browser executor performs the chosen action. On
the WebArena-Lite Benchmark, we achieve a 63% success rate compared to 53.9%
success rate for the previously published state-of-the-art. Unlike previous
systems, our modular architecture requires no website-specific LLM fine-tuning.
Ablations show sizable drops without the world-model, hierarchical planner, and
look-ahead-based replanner confirming their complementary roles within the
design of our system
[COMMENTS]9 pages, NeurIPS 2025 Workshop on Language Agents and World Models
[LINK]http://arxiv.org/abs/2510.22732v1
[DATE]2025-10-27 00:03:39+08:00
[CATEGORIES]cs.LG cs.CL
Covering Multiple Objectives with a Small Set of Solutions Using Bayesian Optimization
[AUTHORS]Natalie Maus, Kyurae Kim, Yimeng Zeng, Haydn Thomas Jones, Fangping Wan, Marcelo Der Torossian Torres, Cesar de la Fuente-Nunez, Jacob R. Gardner
[ABSTRACT]In multi-objective black-box optimization, the goal is typically to find
solutions that optimize a set of $T$ black-box objective functions, $f_1,
\ldots f_T$, simultaneously. Traditional approaches often seek a single
Pareto-optimal set that balances trade-offs among all objectives. In contrast,
we consider a problem setting that departs from this paradigm: finding a small
set of $K < T$ solutions, that collectively “cover” the $T$ objectives. A set
of solutions is defined as “covering” if, for each objective $f_1, \ldots f_T$,
there is at least one good solution. A motivating example for this problem
setting occurs in drug design. For example, we may have $T$ pathogens and aim
to identify a set of $K < T$ antibiotics such that at least one antibiotic can
be used to treat each pathogen. This problem, known as coverage optimization,
has yet to be tackled with the Bayesian optimization (BO) framework. To fill
this void, we develop Multi-Objective Coverage Bayesian Optimization (MOCOBO),
a BO algorithm for solving coverage optimization. Our approach is based on a
new acquisition function reminiscent of expected improvement in the vanilla BO
setup. We demonstrate the performance of our method on high-dimensional
black-box optimization tasks, including applications in peptide and molecular
design. Results show that the coverage of the $K < T$ solutions found by MOCOBO
matches or nearly matches the coverage of $T$ solutions obtained by optimizing
each objective individually. Furthermore, in in vitro experiments, the peptides
found by MOCOBO exhibited high potency against drug-resistant pathogens,
further demonstrating the potential of MOCOBO for drug discovery. All of our
code is publicly available at the following link:
https://github.com/nataliemaus/mocobo.
[LINK]http://arxiv.org/abs/2501.19342v4
[DATE]2025-10-27 23:59:24+08:00
[CATEGORIES]cs.LG
Schrodinger Neural Network and Uncertainty Quantification: Quantum Machine
[AUTHORS]M. M. Hammad
[ABSTRACT]We introduce the Schrodinger Neural Network (SNN), a principled architecture
for conditional density estimation and uncertainty quantification inspired by
quantum mechanics. The SNN maps each input to a normalized wave function on the
output domain and computes predictive probabilities via the Born rule. The SNN
departs from standard parametric likelihood heads by learning complex
coefficients of a spectral expansion (e . g ., Chebyshev polynomials) whose
squared modulus yields the conditional density $p(y|x)=\left| \psi _x(y)\right|
{}^2$ with analytic normalization. This representation confers three practical
advantages: positivity and exact normalization by construction, native
multimodality through interference among basis modes without explicit mixture
bookkeeping, and yields closed-form (or efficiently computable)
functionals$-$such as moments and several calibration diagnostics$-$as
quadratic forms in coefficient space. We develop the statistical and
computational foundations of the SNN, including (i) training by exact
maximum-likelihood with unit-sphere coefficient parameterization, (ii)
physics-inspired quadratic regularizers (kinetic and potential energies)
motivated by uncertainty relations between localization and spectral
complexity, (iii) scalable low-rank and separable extensions for multivariate
outputs, (iv) operator-based extensions that represent observables,
constraints, and weak labels as self-adjoint matrices acting on the amplitude
space, and (v) a comprehensive framework for evaluating multimodal predictions.
The SNN provides a coherent, tractable framework to elevate probabilistic
prediction from point estimates to physically inspired amplitude-based
distributions.
[COMMENTS]29 pages, 16 figures
[LINK]http://arxiv.org/abs/2510.23449v1
[DATE]2025-10-27 23:52:47+08:00
[CATEGORIES]cs.LG
Automatic Discovery of One Parameter Subgroups of $SO(n)$
[AUTHORS]Pavan Karjol, Vivek V Kashyap, Rohan Kashyap, Prathosh A P
[ABSTRACT]We introduce a novel framework for the automatic discovery of one-parameter
subgroups ($H_{\gamma}$) of $SO(3)$ and, more generally, $SO(n)$. One-parameter
subgroups of $SO(n)$ are crucial in a wide range of applications, including
robotics, quantum mechanics, and molecular structure analysis. Our method
utilizes the standard Jordan form of skew-symmetric matrices, which define the
Lie algebra of $SO(n)$, to establish a canonical form for orbits under the
action of $H_{\gamma}$. This canonical form is then employed to derive a
standardized representation for $H_{\gamma}$-invariant functions. By learning
the appropriate parameters, the framework uncovers the underlying one-parameter
subgroup $H_{\gamma}$. The effectiveness of the proposed approach is
demonstrated through tasks such as double pendulum modeling, moment of inertia
prediction, top quark tagging and invariant polynomial regression, where it
successfully recovers meaningful subgroup structure and produces interpretable,
symmetry-aware representations.
[LINK]http://arxiv.org/abs/2509.22219v2
[DATE]2025-10-27 23:45:43+08:00
[CATEGORIES]cs.LG
Onboard Mission Replanning for Adaptive Cooperative Multi-Robot Systems
[AUTHORS]Elim Kwan, Rehman Qureshi, Liam Fletcher, Colin Laganier, Victoria Nockles, Richard Walters
[ABSTRACT]Cooperative autonomous robotic systems have significant potential for
executing complex multi-task missions across space, air, ground, and maritime
domains. But they commonly operate in remote, dynamic and hazardous
environments, requiring rapid in-mission adaptation without reliance on fragile
or slow communication links to centralised compute. Fast, on-board replanning
algorithms are therefore needed to enhance resilience. Reinforcement Learning
shows strong promise for efficiently solving mission planning tasks when
formulated as Travelling Salesperson Problems (TSPs), but existing methods: 1)
are unsuitable for replanning, where agents do not start at a single location;
2) do not allow cooperation between agents; 3) are unable to model tasks with
variable durations; or 4) lack practical considerations for on-board
deployment. Here we define the Cooperative Mission Replanning Problem as a
novel variant of multiple TSP with adaptations to overcome these issues, and
develop a new encoder/decoder-based model using Graph Attention Networks and
Attention Models to solve it effectively and efficiently. Using a simple
example of cooperative drones, we show our replanner consistently (90% of the
time) maintains performance within 10% of the state-of-the-art LKH3 heuristic
solver, whilst running 85-370 times faster on a Raspberry Pi. This work paves
the way for increased resilience in autonomous multi-agent systems.
[COMMENTS]9 pages, 5 figures, 1 table
[LINK]http://arxiv.org/abs/2506.06094v3
[DATE]2025-10-27 23:42:48+08:00
[CATEGORIES]cs.LG
Coresets for Clustering Under Stochastic Noise
[AUTHORS]Lingxiao Huang, Zhize Li, Nisheeth K. Vishnoi, Runkai Yang, Haoyu Zhao
[ABSTRACT]We study the problem of constructing coresets for $(k, z)$-clustering when
the input dataset is corrupted by stochastic noise drawn from a known
distribution. In this setting, evaluating the quality of a coreset is
inherently challenging, as the true underlying dataset is unobserved. To
address this, we investigate coreset construction using surrogate error metrics
that are tractable and provably related to the true clustering cost. We analyze
a traditional metric from prior work and introduce a new error metric that more
closely aligns with the true cost. Although our metric is defined independently
of the noise distribution, it enables approximation guarantees that scale with
the noise level. We design a coreset construction algorithm based on this
metric and show that, under mild assumptions on the data and noise, enforcing
an $\varepsilon$-bound under our metric yields smaller coresets and tighter
guarantees on the true clustering cost than those obtained via classical
metrics. In particular, we prove that the coreset size can improve by a factor
of up to $\mathrm{poly}(k)$, where $n$ is the dataset size. Experiments on
real-world datasets support our theoretical findings and demonstrate the
practical advantages of our approach.
[COMMENTS]This paper has been accepted by NeurIPS 2025
[LINK]http://arxiv.org/abs/2510.23438v1
[DATE]2025-10-27 23:41:27+08:00
[CATEGORIES]cs.LG
Conditional Mean and Variance Estimation via \textit{k}-NN Algorithm with Automated Variance Selection
[AUTHORS]Marcos Matabuena, Juan C. Vidal, Oscar Hernan Madrid Padilla, Jukka-Pekka Onnela
[ABSTRACT]We introduce a novel \textit{k}-nearest neighbor (\textit{k}-NN) regression
method for joint estimation of the conditional mean and variance. The proposed
algorithm preserves the computational efficiency and manifold-learning
capabilities of classical non-parametric \textit{k}-NN models, while
integrating a data-driven variable selection step that improves empirical
performance. By accurately estimating both conditional mean and variance
regression functions, the method effectively reconstructs the conditional
distribution and density functions for multiple families of
scale-and-localization generative models. We show that our estimator can
achieve fast convergence rates, and we derive practical rules for selecting the
smoothing parameter~$k$ that enhance the precision of the algorithm in finite
sample regimes. Extensive simulations for low, moderate and large-dimensional
covariate spaces, together with a real-world biomedical application,
demonstrate that the proposed method can consistently outperform the
conventional \textit{k-NN} regression algorithm while being more interpretable
in the model output.
[LINK]http://arxiv.org/abs/2402.01635v2
[DATE]2025-10-27 23:39:45+08:00
[CATEGORIES]cs.LG
VIKING: Deep variational inference with stochastic projections
[AUTHORS]Samuel G. Fadel, Hrittik Roy, Nicholas Krämer, Yevgen Zainchkovskyy, Stas Syrota, Alejandro Valverde Mahou, Carl Henrik Ek, Søren Hauberg
[ABSTRACT]Variational mean field approximations tend to struggle with contemporary
overparametrized deep neural networks. Where a Bayesian treatment is usually
associated with high-quality predictions and uncertainties, the practical
reality has been the opposite, with unstable training, poor predictive power,
and subpar calibration. Building upon recent work on reparametrizations of
neural networks, we propose a simple variational family that considers two
independent linear subspaces of the parameter space. These represent functional
changes inside and outside the support of training data. This allows us to
build a fully-correlated approximate posterior reflecting the
overparametrization that tunes easy-to-interpret hyperparameters. We develop
scalable numerical routines that maximize the associated evidence lower bound
(ELBO) and sample from the approximate posterior. Empirically, we observe
state-of-the-art performance across tasks, models, and datasets compared to a
wide array of baseline methods. Our results show that approximate Bayesian
inference applied to deep neural networks is far from a lost cause when
constructing inference mechanisms that reflect the geometry of
reparametrizations.
[COMMENTS]NeurIPS 2025 (poster)
[LINK]http://arxiv.org/abs/2510.23684v1
[DATE]2025-10-27 23:38:35+08:00
[CATEGORIES]cs.LG
Conformal Prediction for Hierarchical Data
[AUTHORS]Guillaume Principato, Gilles Stoltz, Yvenn Amara-Ouali, Yannig Goude, Bachir Hamrouche, Jean-Michel Poggi
[ABSTRACT]We consider conformal prediction for multivariate data and focus on
hierarchical data, where some components are linear combinations of others.
Intuitively, the hierarchical structure can be leveraged to reduce the size of
prediction regions for the same coverage level. We implement this intuition by
including a projection step (also called a reconciliation step) in the split
conformal prediction [SCP] procedure, and prove that the resulting prediction
regions are indeed globally smaller. We do so both under the classic objective
of joint coverage and under a new and challenging task: component-wise
coverage, for which efficiency results are more difficult to obtain. The
associated strategies and their analyses are based both on the literature of
SCP and of forecast reconciliation, which we connect. We also illustrate the
theoretical findings, for different scales of hierarchies on simulated data.
[COMMENTS]38 pages, 3 figures
[LINK]http://arxiv.org/abs/2411.13479v3
[DATE]2025-10-27 23:33:38+08:00
[CATEGORIES]cs.LG
Improving Predictions of Molecular Properties with Graph Featurisation and Heterogeneous Ensemble Models
[AUTHORS]Michael L. Parker, Samar Mahmoud, Bailey Montefiore, Mario Öeren, Himani Tandon, Charlotte Wharrick, Matthew D. Segall
[ABSTRACT]We explore a “best-of-both” approach to modelling molecular properties by
combining learned molecular descriptors from a graph neural network (GNN) with
general-purpose descriptors and a mixed ensemble of machine learning (ML)
models. We introduce a MetaModel framework to aggregate predictions from a
diverse set of leading ML models. We present a featurisation scheme for
combining task-specific GNN-derived features with conventional molecular
descriptors.
We demonstrate that our framework outperforms the cutting-edge ChemProp model
on all regression datasets tested and 6 of 9 classification datasets. We
further show that including the GNN features derived from ChemProp boosts the
ensemble model’s performance on several datasets where it otherwise would have
underperformed. We conclude that to achieve optimal performance across a wide
set of problems, it is vital to combine general-purpose descriptors with
task-specific learned features and use a diverse set of ML models to make the
predictions.
[LINK]http://arxiv.org/abs/2510.23428v1
[DATE]2025-10-27 23:33:05+08:00
[CATEGORIES]cs.LG
PrivacyGuard: A Modular Framework for Privacy Auditing in Machine Learning
[AUTHORS]Luca Melis, Matthew Grange, Iden Kalemaj, Karan Chadha, Shengyuan Hu, Elena Kashtelyan, Will Bullock
[ABSTRACT]The increasing deployment of Machine Learning (ML) models in sensitive
domains motivates the need for robust, practical privacy assessment tools.
PrivacyGuard is a comprehensive tool for empirical differential privacy (DP)
analysis, designed to evaluate privacy risks in ML models through
state-of-the-art inference attacks and advanced privacy measurement techniques.
To this end, PrivacyGuard implements a diverse suite of privacy attack –
including membership inference , extraction, and reconstruction attacks –
enabling both off-the-shelf and highly configurable privacy analyses. Its
modular architecture allows for the seamless integration of new attacks, and
privacy metrics, supporting rapid adaptation to emerging research advances. We
make PrivacyGuard available at
https://github.com/facebookresearch/PrivacyGuard.
[LINK]http://arxiv.org/abs/2510.23427v1
[DATE]2025-10-27 23:33:01+08:00
[CATEGORIES]cs.LG
Attention-based clustering
[AUTHORS]Rodrigo Maulen-Soto, Pierre Marion, Claire Boyer
[ABSTRACT]Transformers have emerged as a powerful neural network architecture capable
of tackling a wide range of learning tasks. In this work, we provide a
theoretical analysis of their ability to automatically extract structure from
data in an unsupervised setting. In particular, we demonstrate their
suitability for clustering when the input data is generated from a Gaussian
mixture model. To this end, we study a simplified two-head attention layer and
define a population risk whose minimization with unlabeled data drives the head
parameters to align with the true mixture centroids. This phenomenon highlights
the ability of attention-based layers to capture underlying distributional
structure. We further examine an attention layer with key, query, and value
matrices fixed to the identity, and show that, even without any trainable
parameters, it can perform in-context quantization, revealing the surprising
capacity of transformer-based methods to adapt dynamically to input-specific
distributions.
[LINK]http://arxiv.org/abs/2505.13112v3
[DATE]2025-10-27 23:23:43+08:00
[CATEGORIES]cs.LG
Psi-Sampler: Initial Particle Sampling for SMC-Based Inference-Time Reward Alignment in Score Models
[AUTHORS]Taehoon Yoon, Yunhong Min, Kyeongmin Yeo, Minhyuk Sung
[ABSTRACT]We introduce $\Psi$-Sampler, an SMC-based framework incorporating pCNL-based
initial particle sampling for effective inference-time reward alignment with a
score-based generative model. Inference-time reward alignment with score-based
generative models has recently gained significant traction, following a broader
paradigm shift from pre-training to post-training optimization. At the core of
this trend is the application of Sequential Monte Carlo (SMC) to the denoising
process. However, existing methods typically initialize particles from the
Gaussian prior, which inadequately captures reward-relevant regions and results
in reduced sampling efficiency. We demonstrate that initializing from the
reward-aware posterior significantly improves alignment performance. To enable
posterior sampling in high-dimensional latent spaces, we introduce the
preconditioned Crank-Nicolson Langevin (pCNL) algorithm, which combines
dimension-robust proposals with gradient-informed dynamics. This approach
enables efficient and scalable posterior sampling and consistently improves
performance across various reward alignment tasks, including layout-to-image
generation, quantity-aware generation, and aesthetic-preference generation, as
demonstrated in our experiments. Project Webpage:
https://psi-sampler.github.io/
[COMMENTS]NeurIPS 2025, Spotlight Presentation
[LINK]http://arxiv.org/abs/2506.01320v3
[DATE]2025-10-27 23:22:41+08:00
[CATEGORIES]cs.LG
DataRater: Meta-Learned Dataset Curation
[AUTHORS]Dan A. Calian, Gregory Farquhar, Iurii Kemaev, Luisa M. Zintgraf, Matteo Hessel, Jeremy Shar, Junhyuk Oh, András György, Tom Schaul, Jeffrey Dean, Hado van Hasselt, David Silver
[ABSTRACT]The quality of foundation models depends heavily on their training data.
Consequently, great efforts have been put into dataset curation. Yet most
approaches rely on manual tuning of coarse-grained mixtures of large buckets of
data, or filtering by hand-crafted heuristics. An approach that is ultimately
more scalable (let alone more satisfying) is to \emph{learn} which data is
actually valuable for training. This type of meta-learning could allow more
sophisticated, fine-grained, and effective curation. Our proposed
\emph{DataRater} is an instance of this idea. It estimates the value of
training on any particular data point. This is done by meta-learning using
`meta-gradients’, with the objective of improving training efficiency on held
out data. In extensive experiments across a range of model scales and datasets,
we find that using our DataRater to filter data is highly effective, resulting
in significantly improved compute efficiency.
[COMMENTS]NeurIPS 2025
[LINK]http://arxiv.org/abs/2505.17895v2
[DATE]2025-10-27 23:19:13+08:00
[CATEGORIES]cs.LG
A Principle of Targeted Intervention for Multi-Agent Reinforcement Learning
[AUTHORS]Anjie Liu, Jianhong Wang, Samuel Kaski, Jun Wang, Mengyue Yang
[COMMENTS]Published in NeurIPS 2025
[LINK]http://arxiv.org/abs/2510.17697v3
[DATE]2025-10-27 23:15:26+08:00
[CATEGORIES]cs.LG
Neural variational inference for cutting feedback during uncertainty propagation
[AUTHORS]Jiafang Song, Sandipan Pramanik, Abhirup Datta
[ABSTRACT]In many scientific applications, uncertainty of estimates from an earlier
(upstream) analysis needs to be propagated in subsequent (downstream) Bayesian
analysis, without feedback. Cutting feedback methods, also termed cut-Bayes,
achieve this by constructing a cut-posterior distribution that prevents
backward information flow. Cutting feedback like nested MCMC is computationally
challenging while variational inference (VI) cut-Bayes methods need two
variational approximations and require access to the upstream data and model.
In this manuscript we propose, NeVI-Cut, a provably accurate and modular neural
network-based variational inference method for cutting feedback. We directly
utilize samples from the upstream analysis without requiring access to the
upstream data or model. This simultaneously preserves modularity of analysis
and reduces approximation errors by avoiding a variational approximation for
the upstream model. We then use normalizing flows to specify the conditional
variational family for the downstream parameters and estimate the conditional
cut-posterior as a variational solution of Monte Carlo average loss over all
the upstream samples. We provide theoretical guarantees on the NeVI-Cut
estimate to approximate any cut-posterior. Our results are in a fixed-data
regime and provide convergence rates of the actual variational solution,
quantifying how richness of the neural architecture and the complexity of the
target cut-posterior dictate the approximation quality. In the process, we
establish new results on uniform Kullback-Leibler approximation rates of
conditional normalizing flows. Simulation studies and two real-world analyses
illustrate how NeVI-Cut achieves significant computational gains over
traditional cutting feedback methods and is considerably more accurate than
parametric variational cut approaches.
[LINK]http://arxiv.org/abs/2510.10268v2
[DATE]2025-10-27 23:13:18+08:00
[CATEGORIES]cs.LG
AutoStreamPipe: LLM Assisted Automatic Generation of Data Stream Processing Pipelines
[AUTHORS]Abolfazl Younesi, Zahra Najafabadi Samani, Thomas Fahringer
[ABSTRACT]Data pipelines are essential in stream processing as they enable the
efficient collection, processing, and delivery of real-time data, supporting
rapid data analysis. In this paper, we present AutoStreamPipe, a novel
framework that employs Large Language Models (LLMs) to automate the design,
generation, and deployment of stream processing pipelines. AutoStreamPipe
bridges the semantic gap between high-level user intent and platform-specific
implementations across distributed stream processing systems for structured
multi-agent reasoning by integrating a Hypergraph of Thoughts (HGoT) as an
extended version of GoT. AutoStreamPipe combines resilient execution
strategies, advanced query analysis, and HGoT to deliver pipelines with good
accuracy. Experimental evaluations on diverse pipelines demonstrate that
AutoStreamPipe significantly reduces development time (x6.3) and error rates
(x5.19), as measured by a novel Error-Free Score (EFS), compared to LLM
code-generation methods.
[COMMENTS]Under review
[LINK]http://arxiv.org/abs/2510.23408v1
[DATE]2025-10-27 23:11:31+08:00
[CATEGORIES]cs.LG
Interpretable Neural ODEs for Gene Regulatory Network Discovery under Perturbations
[AUTHORS]Zaikang Lin, Sei Chang, Aaron Zweig, Minseo Kang, Elham Azizi, David A. Knowles
[ABSTRACT]Modern high-throughput biological datasets with thousands of perturbations
provide the opportunity for large-scale discovery of causal graphs that
represent the regulatory interactions between genes. Differentiable causal
graphical models have been proposed to infer a gene regulatory network (GRN)
from large scale interventional datasets, capturing the causal gene regulatory
relationships from genetic perturbations. However, existing models are limited
in their expressivity and scalability while failing to address the dynamic
nature of biological processes such as cellular differentiation. We propose
PerturbODE, a novel framework that incorporates biologically informative neural
ordinary differential equations (neural ODEs) to model cell state trajectories
under perturbations and derive the causal GRN from the neural ODE’s parameters.
We demonstrate PerturbODE’s efficacy in trajectory prediction and GRN inference
across simulated and real over-expression datasets.
[LINK]http://arxiv.org/abs/2501.02409v4
[DATE]2025-10-27 23:06:54+08:00
[CATEGORIES]cs.LG
Informed Initialization for Bayesian Optimization and Active Learning
[AUTHORS]Carl Hvarfner, David Eriksson, Eytan Bakshy, Max Balandat
[ABSTRACT]Bayesian Optimization is a widely used method for optimizing expensive
black-box functions, relying on probabilistic surrogate models such as Gaussian
Processes. The quality of the surrogate model is crucial for good optimization
performance, especially in the few-shot setting where only a small number of
batches of points can be evaluated. In this setting, the initialization plays a
critical role in shaping the surrogate’s predictive quality and guiding
subsequent optimization. Despite this, practitioners typically rely on
(quasi-)random designs to cover the input space. However, such approaches
neglect two key factors: (a) space-filling designs may not be desirable to
reduce predictive uncertainty, and (b) efficient hyperparameter learning during
initialization is essential for high-quality prediction, which may conflict
with space-filling designs. To address these limitations, we propose
Hyperparameter-Informed Predictive Exploration (HIPE), a novel acquisition
strategy that balances predictive uncertainty reduction with hyperparameter
learning using information-theoretic principles. We derive a closed-form
expression for HIPE in the Gaussian Process setting and demonstrate its
effectiveness through extensive experiments in active learning and few-shot BO.
Our results show that HIPE outperforms standard initialization strategies in
terms of predictive accuracy, hyperparameter identification, and subsequent
optimization performance, particularly in large-batch, few-shot settings
relevant to many real-world Bayesian Optimization applications.
[COMMENTS]28 pages
[LINK]http://arxiv.org/abs/2510.23681v1
[DATE]2025-10-27 23:05:12+08:00
[CATEGORIES]cs.LG
Controllable Collision Scenario Generation via Collision Pattern Prediction
[AUTHORS]Pin-Lun Chen, Chi-Hsi Kung, Che-Han Chang, Wei-Chen Chiu, Yi-Ting Chen
[ABSTRACT]Evaluating the safety of autonomous vehicles (AVs) requires diverse,
safety-critical scenarios, with collisions being especially important yet rare
and unsafe to collect in the real world. Therefore, the community has been
focusing on generating safety-critical scenarios in simulation. However,
controlling attributes such as collision type and time-to-accident (TTA)
remains challenging. We introduce a new task called controllable collision
scenario generation, where the goal is to produce trajectories that realize a
user-specified collision type and TTA, to investigate the feasibility of
automatically generating desired collision scenarios. To support this task, we
present COLLIDE, a large-scale collision scenario dataset constructed by
transforming real-world driving logs into diverse collisions, balanced across
five representative collision types and different TTA intervals. We propose a
framework that predicts Collision Pattern, a compact and interpretable
representation that captures the spatial configuration of the ego and the
adversarial vehicles at impact, before rolling out full adversarial
trajectories. Experiments show that our approach outperforms strong baselines
in both collision rate and controllability. Furthermore, generated scenarios
consistently induce higher planner failure rates, revealing limitations of
existing planners. We demonstrate that these scenarios fine-tune planners for
robustness improvements, contributing to safer AV deployment in different
collision scenarios. Project page is available at
https://submit-user.github.io/anon2025
[COMMENTS]8 pages, 3 figures
[LINK]http://arxiv.org/abs/2510.12206v2
[DATE]2025-10-27 22:53:32+08:00
[CATEGORIES]cs.LG
Unveiling m-Sharpness Through the Structure of Stochastic Gradient Noise
[AUTHORS]Haocheng Luo, Mehrtash Harandi, Dinh Phung, Trung Le
[ABSTRACT]Sharpness-aware minimization (SAM) has emerged as a highly effective
technique for improving model generalization, but its underlying principles are
not fully understood. We investigated the phenomenon known as m-sharpness,
where the performance of SAM improves monotonically as the micro-batch size for
computing perturbations decreases. In practice, the empirical m-sharpness
effect underpins the deployment of SAM in distributed training, yet a rigorous
theoretical account has remained lacking. To provide a theoretical explanation
for m-sharpness, we leverage an extended Stochastic Differential Equation (SDE)
framework and analyze the structure of stochastic gradient noise (SGN) to
characterize the dynamics of various SAM variants, including n-SAM and m-SAM.
Our findings reveal that the stochastic noise introduced during SAM
perturbations inherently induces a variance-based sharpness regularization
effect. Motivated by our theoretical insights, we introduce Reweighted SAM
(RW-SAM), which employs sharpness-weighted sampling to mimic the generalization
benefits of m-SAM while remaining parallelizable. Comprehensive experiments
validate the effectiveness of our theoretical analysis and proposed method.
[LINK]http://arxiv.org/abs/2509.18001v2
[DATE]2025-10-27 22:49:07+08:00
[CATEGORIES]cs.LG
The Best of N Worlds: Aligning Reinforcement Learning with Best-of-N Sampling via max@k Optimisation
[AUTHORS]Farid Bagirov, Mikhail Arkhipov, Ksenia Sycheva, Evgeniy Glukhov, Egor Bogomolov
[ABSTRACT]The application of Reinforcement Learning with Verifiable Rewards (RLVR) to
mathematical and coding domains has demonstrated significant improvements in
the reasoning and problem-solving abilities of Large Language Models. Despite
its success in single generation problem solving, the reinforcement learning
fine-tuning process may harm the model’s exploration ability, as reflected in
decreased diversity of generations and a resulting degradation of performance
during Best-of-N sampling for large N values. In this work, we focus on
optimizing the max@k metric, a continuous generalization of pass@k. We derive
an unbiased on-policy gradient estimate for direct optimization of this metric.
Furthermore, we extend our derivations to the off-policy updates, a common
element in modern RLVR algorithms, that allows better sample efficiency.
Empirically, we show that our objective effectively optimizes max@k metric in
off-policy scenarios, aligning the model with the Best-of-N inference strategy.
[LINK]http://arxiv.org/abs/2510.23393v1
[DATE]2025-10-27 22:47:30+08:00
[CATEGORIES]cs.LG
Opinion Mining Based Entity Ranking using Fuzzy Logic Algorithmic Approach
[AUTHORS]Pratik N. Kalamkar, A. G. Phakatkar
[ABSTRACT]Opinions are central to almost all human activities and are key influencers
of our behaviors. In current times due to growth of social networking website
and increase in number of e-commerce site huge amount of opinions are now
available on web. Given a set of evaluative statements that contain opinions
(or sentiments) about an Entity, opinion mining aims to extract attributes and
components of the object that have been commented on in each statement and to
determine whether the comments are positive, negative or neutral. While lot of
research recently has been done in field of opinion mining and some of it
dealing with ranking of entities based on review or opinion set, classifying
opinions into finer granularity level and then ranking entities has never been
done before. In this paper method for opinion mining from statements at a
deeper level of granularity is proposed. This is done by using fuzzy logic
reasoning, after which entities are ranked as per this information.
[COMMENTS]8 pages, 4 figures, Conference Paper
[LINK]http://arxiv.org/abs/2510.23384v1
[DATE]2025-10-27 22:35:20+08:00
[CATEGORIES]cs.LG
Symbolic Neural Generation with Applications to Lead Discovery in Drug Design
[AUTHORS]Ashwin Srinivasan, A Baskar, Tirtharaj Dash, Michael Bain, Sanjay Kumar Dey, Mainak Banerjee
[ABSTRACT]We investigate a relatively underexplored class of hybrid neurosymbolic
models integrating symbolic learning with neural reasoning to construct data
generators meeting formal correctness criteria. In \textit{Symbolic Neural
Generators} (SNGs), symbolic learners examine logical specifications of
feasible data from a small set of instances – sometimes just one. Each
specification in turn constrains the conditional information supplied to a
neural-based generator, which rejects any instance violating the symbolic
specification. Like other neurosymbolic approaches, SNG exploits the
complementary strengths of symbolic and neural methods. The outcome of an SNG
is a triple $(H, X, W)$, where $H$ is a symbolic description of feasible
instances constructed from data, $X$ a set of generated new instances that
satisfy the description, and $W$ an associated weight. We introduce a semantics
for such systems, based on the construction of appropriate \textit{base} and
\textit{fibre} partially-ordered sets combined into an overall partial order,
and outline a probabilistic extension relevant to practical applications. In
this extension, SNGs result from searching over a weighted partial ordering. We
implement an SNG combining a restricted form of Inductive Logic Programming
(ILP) with a large language model (LLM) and evaluate it on early-stage drug
design. Our main interest is the description and the set of potential inhibitor
molecules generated by the SNG. On benchmark problems – where drug targets are
well understood – SNG performance is statistically comparable to
state-of-the-art methods. On exploratory problems with poorly understood
targets, generated molecules exhibit binding affinities on par with leading
clinical candidates. Experts further find the symbolic specifications useful as
preliminary filters, with several generated molecules identified as viable for
synthesis and wet-lab testing.
[COMMENTS]37 pages, 15 figures; partial overlap of experimental results with
https://doi.org/10.1101/2025.02.14.634875
[LINK]http://arxiv.org/abs/2510.23379v1
[DATE]2025-10-27 22:29:22+08:00
[CATEGORIES]cs.LG
RotaTouille: Rotation Equivariant Deep Learning for Contours
[AUTHORS]Odin Hoff Gardaa, Nello Blaser
[ABSTRACT]Contours or closed planar curves are common in many domains. For example,
they appear as object boundaries in computer vision, isolines in meteorology,
and the orbits of rotating machinery. In many cases when learning from contour
data, planar rotations of the input will result in correspondingly rotated
outputs. It is therefore desirable that deep learning models be rotationally
equivariant. In addition, contours are typically represented as an ordered
sequence of edge points, where the choice of starting point is arbitrary. It is
therefore also desirable for deep learning methods to be equivariant under
cyclic shifts. We present RotaTouille, a deep learning framework for learning
from contour data that achieves both rotation and cyclic shift equivariance
through complex-valued circular convolution. We further introduce and
characterize equivariant non-linearities, coarsening layers, and global pooling
layers to obtain invariant representations for downstream tasks. Finally, we
demonstrate the effectiveness of RotaTouille through experiments in shape
classification, reconstruction, and contour regression.
[COMMENTS]19 pages, 6 figures
[LINK]http://arxiv.org/abs/2508.16359v2
[DATE]2025-10-27 22:23:31+08:00
[CATEGORIES]cs.LG
HOPSE: Scalable Higher-Order Positional and Structural Encoder for Combinatorial Representations
[AUTHORS]Martin Carrasco, Guillermo Bernardez, Marco Montagna, Nina Miolane, Lev Telyatnikov
[ABSTRACT]While Graph Neural Networks (GNNs) have proven highly effective at modeling
relational data, pairwise connections cannot fully capture multi-way
relationships naturally present in complex real-world systems. In response to
this, Topological Deep Learning (TDL) leverages more general combinatorial
representations – such as simplicial or cellular complexes – to accommodate
higher-order interactions. Existing TDL methods often extend GNNs through
Higher-Order Message Passing (HOMP), but face critical \emph{scalability
challenges} due to \textit{(i)} a combinatorial explosion of message-passing
routes, and \textit{(ii)} significant complexity overhead from the propagation
mechanism. This work presents HOPSE (Higher-Order Positional and Structural
Encoder), an alternative method to solve tasks involving higher-order
interactions \emph{without message passing}. Instead, HOPSE breaks
\emph{arbitrary higher-order domains} into their neighborhood relationships
using a Hasse graph decomposition. This method shows that decoupling the
representation learning of neighborhood topology from that of attributes
results in lower computational complexity, casting doubt on the need for HOMP.
The experiments on molecular graph tasks and topological benchmarks show that
HOPSE matches performance on traditional TDL datasets and outperforms HOMP
methods on topological tasks, achieving up to $7\times$ speedups over
HOMP-based models, opening a new path for scalable TDL.
[LINK]http://arxiv.org/abs/2505.15405v2
[DATE]2025-10-27 22:16:21+08:00
[CATEGORIES]cs.LG
Robust Non-negative Proximal Gradient Algorithm for Inverse Problems
[AUTHORS]Hanzhang Wang, Zonglin Liu, Jingyi Xu, Chenyang Wang, Zhiwei Zhong, Qiangqiang Shen
[ABSTRACT]Proximal gradient algorithms (PGA), while foundational for inverse problems
like image reconstruction, often yield unstable convergence and suboptimal
solutions by violating the critical non-negativity constraint. We identify the
gradient descent step as the root cause of this issue, which introduces
negative values and induces high sensitivity to hyperparameters. To overcome
these limitations, we propose a novel multiplicative update proximal gradient
algorithm (SSO-PGA) with convergence guarantees, which is designed for
robustness in non-negative inverse problems. Our key innovation lies in
superseding the gradient descent step with a learnable sigmoid-based operator,
which inherently enforces non-negativity and boundedness by transforming
traditional subtractive updates into multiplicative ones. This design,
augmented by a sliding parameter for enhanced stability and convergence, not
only improves robustness but also boosts expressive capacity and noise
immunity. We further formulate a degradation model for multi-modal restoration
and derive its SSO-PGA-based optimization algorithm, which is then unfolded
into a deep network to marry the interpretability of optimization with the
power of deep learning. Extensive numerical and real-world experiments
demonstrate that our method significantly surpasses traditional PGA and other
state-of-the-art algorithms, ensuring superior performance and stability.
[LINK]http://arxiv.org/abs/2510.23362v1
[DATE]2025-10-27 22:10:25+08:00
[CATEGORIES]cs.LG
Macroeconomic Forecasting for the G7 countries under Uncertainty Shocks
[AUTHORS]Shovon Sengupta, Sunny Kumar Singh, Tanujit Chakraborty
[ABSTRACT]Accurate macroeconomic forecasting has become harder amid geopolitical
disruptions, policy reversals, and volatile financial markets. Conventional
vector autoregressions (VARs) overfit in high dimensional settings, while
threshold VARs struggle with time varying interdependencies and complex
parameter structures. We address these limitations by extending the Sims Zha
Bayesian VAR with exogenous variables (SZBVARx) to incorporate domain-informed
shrinkage and four newspaper based uncertainty shocks such as economic policy
uncertainty, geopolitical risk, US equity market volatility, and US monetary
policy uncertainty. The framework improves structural interpretability,
mitigates dimensionality, and imposes empirically guided regularization. Using
G7 data, we study spillovers from uncertainty shocks to five core variables
(unemployment, real broad effective exchange rates, short term rates, oil
prices, and CPI inflation), combining wavelet coherence (time frequency
dynamics) with nonlinear local projections (state dependent impulse responses).
Out-of-sample results at 12 and 24 month horizons show that SZBVARx outperforms
14 benchmarks, including classical VARs and leading machine learning models, as
confirmed by Murphy difference diagrams, multivariate Diebold Mariano tests,
and Giacomini White predictability tests. Credible Bayesian prediction
intervals deliver robust uncertainty quantification for scenario analysis and
risk management. The proposed SZBVARx offers G7 policymakers a transparent,
well calibrated tool for modern macroeconomic forecasting under pervasive
uncertainty.
[LINK]http://arxiv.org/abs/2510.23347v1
[DATE]2025-10-27 22:01:41+08:00
[CATEGORIES]cs.LG
Assessing the Completeness of Traffic Scenario Categories for Automated Highway Driving Functions via Cluster-based Analysis
[AUTHORS]Niklas Roßberg, Marion Neumeier, Sinan Hasirlioglu, Mohamed Essayed Bouzouraa, Michael Botsch
[ABSTRACT]The ability to operate safely in increasingly complex traffic scenarios is a
fundamental requirement for Automated Driving Systems (ADS). Ensuring the safe
release of ADS functions necessitates a precise understanding of the occurring
traffic scenarios. To support this objective, this work introduces a pipeline
for traffic scenario clustering and the analysis of scenario category
completeness. The Clustering Vector Quantized - Variational Autoencoder
(CVQ-VAE) is employed for the clustering of highway traffic scenarios and
utilized to create various catalogs with differing numbers of traffic scenario
categories. Subsequently, the impact of the number of categories on the
completeness considerations of the traffic scenario categories is analyzed. The
results show an outperforming clustering performance compared to previous work.
The trade-off between cluster quality and the amount of required data to
maintain completeness is discussed based on the publicly available highD
dataset.
[LINK]http://arxiv.org/abs/2506.02599v2
[DATE]2025-10-27 21:50:39+08:00
[CATEGORIES]cs.LG
The First Star-by-star $N$-body/Hydrodynamics Simulation of Our Galaxy Coupling with a Surrogate Model
[AUTHORS]Keiya Hirashima, Michiko S. Fujii, Takayuki R. Saitoh, Naoto Harada, Kentaro Nomura, Kohji Yoshikawa, Yutaka Hirai, Tetsuro Asano, Kana Moriwaki, Masaki Iwasawa, Takashi Okamoto, Junichiro Makino
[ABSTRACT]A major goal of computational astrophysics is to simulate the Milky Way
Galaxy with sufficient resolution down to individual stars. However, the
scaling fails due to some small-scale, short-timescale phenomena, such as
supernova explosions. We have developed a novel integration scheme of
$N$-body/hydrodynamics simulations working with machine learning. This approach
bypasses the short timesteps caused by supernova explosions using a surrogate
model, thereby improving scalability. With this method, we reached 300 billion
particles using 148,900 nodes, equivalent to 7,147,200 CPU cores, breaking
through the billion-particle barrier currently faced by state-of-the-art
simulations. This resolution allows us to perform the first star-by-star galaxy
simulation, which resolves individual stars in the Milky Way Galaxy. The
performance scales over $10^4$ CPU cores, an upper limit in the current
state-of-the-art simulations using both A64FX and X86-64 processors and NVIDIA
CUDA GPUs.
[COMMENTS]12 pages, 7 figures, 7 tables, IEEE/ACM Supercomputing Conference
(SC25)
[LINK]http://arxiv.org/abs/2510.23330v1
[DATE]2025-10-27 21:45:55+08:00
[CATEGORIES]cs.LG
GRAD: Real-Time Gated Recurrent Anomaly Detection in Autonomous Vehicle Sensors Using Reinforced EMA and Multi-Stage Sliding Window Techniques
[AUTHORS]Mohammad Hossein Jafari Naeimi, Ali Norouzi, Athena Abdi
[ABSTRACT]This paper introduces GRAD, a real-time anomaly detection method for
autonomous vehicle sensors that integrates statistical analysis and deep
learning to ensure the reliability of sensor data. The proposed approach
combines the Reinforced Exponential Moving Average (REMA), which adapts
smoothing factors and thresholding for outlier detection, with the Multi-Stage
Sliding Window (MS-SW) technique for capturing both short- and long-term
patterns. These features are processed using a lightweight Gated Recurrent Unit
(GRU) model, which detects and classifies anomalies based on bias types, while
a recovery module restores damaged sensor data to ensure continuous system
operation. GRAD has a lightweight architecture consisting of two layers of GRU
with a limited number of neurons that make it appropriate for real-time
applications while maintaining high detection accuracy. The GRAD framework
achieved remarkable performance in anomaly detection and classification. The
model demonstrated an overall F1-score of 97.6% for abnormal data and 99.4% for
normal data, signifying its high accuracy in distinguishing between normal and
anomalous sensor data. Regarding the anomaly classification, GRAD successfully
categorized different anomaly types with high precision, enabling the recovery
module to accurately restore damaged sensor data. Relative to analogous
studies, GRAD surpasses current models by attaining a balance between elevated
detection accuracy and diminished computational expense. These results
demonstrate GRAD’s potential as a reliable and efficient solution for real-time
anomaly detection in autonomous vehicle systems, guaranteeing safe vehicle
operation with minimal computational overhead.
[LINK]http://arxiv.org/abs/2510.23327v1
[DATE]2025-10-27 21:44:15+08:00
[CATEGORIES]cs.LG
Multitask Multimodal Self-Supervised Learning for Medical Images
[AUTHORS]Cristian Simionescu
[ABSTRACT]This thesis works to address a pivotal challenge in medical image analysis:
the reliance on extensive labeled datasets, which are often limited due to the
need for expert annotation and constrained by privacy and legal issues. By
focusing on the development of self-supervised learning techniques and domain
adaptation methods, this research aims to circumvent these limitations,
presenting a novel approach to enhance the utility and efficacy of deep
learning in medical imaging.
Central to this thesis is the development of the Medformer, an innovative
neural network architecture designed for multitask learning and deep domain
adaptation. This model is adept at pre-training on diverse medical image
datasets, handling varying sizes and modalities, and is equipped with a dynamic
input-output adaptation mechanism. This enables efficient processing and
integration of a wide range of medical image types, from 2D X-rays to complex
3D MRIs, thus mitigating the dependency on large labeled datasets.
Further, the thesis explores the current state of self-supervised learning in
medical imaging. It introduces novel pretext tasks that are capable of
extracting meaningful information from unlabeled data, significantly advancing
the model’s interpretative abilities. This approach is validated through
rigorous experimentation, including the use of the MedMNIST dataset,
demonstrating the model’s proficiency in learning generalized features
applicable to various downstream tasks.
In summary, this thesis contributes to the advancement of medical image
analysis by offering a scalable, adaptable framework that reduces reliance on
labeled data. It paves the way for more accurate, efficient diagnostic tools in
healthcare, signifying a major step forward in the application of deep learning
in medical imaging.
[LINK]http://arxiv.org/abs/2510.23325v1
[DATE]2025-10-27 21:42:16+08:00
[CATEGORIES]cs.LG
Predicting symbolic ODEs from multiple trajectories
[AUTHORS]Yakup Emre Şahin, Niki Kilbertus, Sören Becker
[ABSTRACT]We introduce MIO, a transformer-based model for inferring symbolic ordinary
differential equations (ODEs) from multiple observed trajectories of a
dynamical system. By combining multiple instance learning with
transformer-based symbolic regression, the model effectively leverages repeated
observations of the same system to learn more generalizable representations of
the underlying dynamics. We investigate different instance aggregation
strategies and show that even simple mean aggregation can substantially boost
performance. MIO is evaluated on systems ranging from one to four dimensions
and under varying noise levels, consistently outperforming existing baselines.
[COMMENTS]Published at: 39th Conference on Neural Information Processing
Systems (NeurIPS 2025) Workshop: Machine Learning and the Physical Sciences
[LINK]http://arxiv.org/abs/2510.23295v1
[DATE]2025-10-27 21:03:29+08:00
[CATEGORIES]cs.LG
Learning from Frustration: Torsor CNNs on Graphs
[AUTHORS]Daiyuan Li, Shreya Arya, Robert Ghrist
[ABSTRACT]Most equivariant neural networks rely on a single global symmetry, limiting
their use in domains where symmetries are instead local. We introduce Torsor
CNNs, a framework for learning on graphs with local symmetries encoded as edge
potentials – group-valued transformations between neighboring coordinate
frames. We establish that this geometric construction is fundamentally
equivalent to the classical group synchronization problem, yielding: (1) a
Torsor Convolutional Layer that is provably equivariant to local changes in
coordinate frames, and (2) the frustration loss – a standalone geometric
regularizer that encourages locally equivariant representations when added to
any NN’s training objective. The Torsor CNN framework unifies and generalizes
several architectures – including classical CNNs and Gauge CNNs on manifolds
– by operating on arbitrary graphs without requiring a global coordinate
system or smooth manifold structure. We establish the mathematical foundations
of this framework and demonstrate its applicability to multi-view 3D
recognition, where relative camera poses naturally define the required edge
potentials.
[COMMENTS]19 pages (main text + appendices), 1 figure
[LINK]http://arxiv.org/abs/2510.23288v1
[DATE]2025-10-27 20:59:45+08:00
[CATEGORIES]cs.LG
A Novel Framework for Multi-Modal Protein Representation Learning
[AUTHORS]Runjie Zheng, Zhen Wang, Anjie Qiao, Jiancong Xie, Jiahua Rao, Yuedong Yang
[ABSTRACT]Accurate protein function prediction requires integrating heterogeneous
intrinsic signals (e.g., sequence and structure) with noisy extrinsic contexts
(e.g., protein-protein interactions and GO term annotations). However, two key
challenges hinder effective fusion: (i) cross-modal distributional mismatch
among embeddings produced by pre-trained intrinsic encoders, and (ii) noisy
relational graphs of extrinsic data that degrade GNN-based information
aggregation. We propose Diffused and Aligned Multi-modal Protein Embedding
(DAMPE), a unified framework that addresses these through two core mechanisms.
First, we propose Optimal Transport (OT)-based representation alignment that
establishes correspondence between intrinsic embedding spaces of different
modalities, effectively mitigating cross-modal heterogeneity. Second, we
develop a Conditional Graph Generation (CGG)-based information fusion method,
where a condition encoder fuses the aligned intrinsic embeddings to provide
informative cues for graph reconstruction. Meanwhile, our theoretical analysis
implies that the CGG objective drives this condition encoder to absorb
graph-aware knowledge into its produced protein representations. Empirically,
DAMPE outperforms or matches state-of-the-art methods such as DPFunc on
standard GO benchmarks, achieving AUPR gains of 0.002-0.013 pp and Fmax gains
0.004-0.007 pp. Ablation studies further show that OT-based alignment
contributes 0.043-0.064 pp AUPR, while CGG-based fusion adds 0.005-0.111 pp
Fmax. Overall, DAMPE offers a scalable and theoretically grounded approach for
robust multi-modal protein representation learning, substantially enhancing
protein function prediction.
[COMMENTS]35 pages, 5 figures, 4 tables
[LINK]http://arxiv.org/abs/2510.23273v1
[DATE]2025-10-27 20:33:01+08:00
[CATEGORIES]cs.LG
PAHQ: Accelerating Automated Circuit Discovery through Mixed-Precision Inference Optimization
[AUTHORS]Xinhai Wang, Shu Yang, Liangyu Wang, Lin Zhang, Huanyi Xie, Lijie Hu, Di Wang
[ABSTRACT]Circuit discovery, which involves identifying sparse and task-relevant
subnetworks in pre-trained language models, is a cornerstone of mechanistic
interpretability. Automated Circuit Discovery (ACDC) has emerged as a pivotal
methodology in circuit discovery, but its application to large language models
is severely limited by computational inefficiency and prohibitively high memory
requirements. Although several accelerated approaches have been proposed, they
primarily rely on linear approximations to ACDC, which significantly
compromises analytical faithfulness. Our proposed method for accelerating
automated circuit discovery, Per Attention Head Quantization (PAHQ), takes a
fundamentally different approach by optimizing the efficiency of each
individual patching operation. PAHQ leverages a fundamental alignment between
activation patching and mixed-precision quantization (MPQ): interpretability
analysis through patching essentially performs targeted ablation studies.
Therefore, we can maintain high precision exclusively for investigated
components while safely reducing precision elsewhere in the network.
PAHQ-accelerated ACDC reduces runtime by up to 80\% and memory consumption by
up to 30\% compared to unaccelerated ACDC while maintaining faithfulness.
Importantly, our method readily integrates with existing edge-based circuit
discovery techniques by modifying the attention computation mechanism. This
training-free approach provides a practical and novel pathway for accelerating
mechanistic interpretability methods. Our code is available at
https://github.com/626619403/PAHQ.
[LINK]http://arxiv.org/abs/2510.23264v1
[DATE]2025-10-27 20:24:14+08:00
[CATEGORIES]cs.LG
Toward Interpretable Evaluation Measures for Time Series Segmentation
[AUTHORS]Félix Chavelli, Paul Boniol, Michaël Thomazo
[ABSTRACT]Time series segmentation is a fundamental task in analyzing temporal data
across various domains, from human activity recognition to energy monitoring.
While numerous state-of-the-art methods have been developed to tackle this
problem, the evaluation of their performance remains critically limited.
Existing measures predominantly focus on change point accuracy or rely on
point-based measures such as Adjusted Rand Index (ARI), which fail to capture
the quality of the detected segments, ignore the nature of errors, and offer
limited interpretability. In this paper, we address these shortcomings by
introducing two novel evaluation measures: WARI (Weighted Adjusted Rand Index),
that accounts for the position of segmentation errors, and SMS (State Matching
Score), a fine-grained measure that identifies and scores four fundamental
types of segmentation errors while allowing error-specific weighting. We
empirically validate WARI and SMS on synthetic and real-world benchmarks,
showing that they not only provide a more accurate assessment of segmentation
quality but also uncover insights, such as error provenance and type, that are
inaccessible with traditional measures.
[LINK]http://arxiv.org/abs/2510.23261v1
[DATE]2025-10-27 20:23:37+08:00
[CATEGORIES]cs.LG
Fast Rate Bounds for Multi-Task and Meta-Learning with Different Sample Sizes
[AUTHORS]Hossein Zakerinia, Christoph H. Lampert
[ABSTRACT]We present new fast-rate PAC-Bayesian generalization bounds for multi-task
and meta-learning in the unbalanced setting, i.e. when the tasks have training
sets of different sizes, as is typically the case in real-world scenarios.
Previously, only standard-rate bounds were known for this situation, while
fast-rate bounds were limited to the setting where all training sets are of
equal size. Our new bounds are numerically computable as well as interpretable,
and we demonstrate their flexibility in handling a number of cases where they
give stronger guarantees than previous bounds. Besides the bounds themselves,
we also make conceptual contributions: we demonstrate that the unbalanced
multi-task setting has different statistical properties than the balanced
situation, specifically that proofs from the balanced situation do not carry
over to the unbalanced setting. Additionally, we shed light on the fact that
the unbalanced situation allows two meaningful definitions of multi-task risk,
depending on whether all tasks should be considered equally important or if
sample-rich tasks should receive more weight than sample-poor ones.
[COMMENTS]Conference on Neural Information Processing Systems (NeurIPS), 2025
[LINK]http://arxiv.org/abs/2505.15496v2
[DATE]2025-10-27 20:22:36+08:00
[CATEGORIES]cs.LG
GCAO: Group-driven Clustering via Gravitational Attraction and Optimization
[AUTHORS]Qi Li, Jun Wang
[ABSTRACT]Traditional clustering algorithms often struggle with high-dimensional and
non-uniformly distributed data, where low-density boundary samples are easily
disturbed by neighboring clusters, leading to unstable and distorted clustering
results. To address this issue, we propose a Group-driven Clustering via
Gravitational Attraction and Optimization (GCAO) algorithm. GCAO introduces a
group-level optimization mechanism that aggregates low-density boundary points
into collaboratively moving groups, replacing the traditional point-based
contraction process. By combining local density estimation with neighborhood
topology, GCAO constructs effective gravitational interactions between groups
and their surroundings, enhancing boundary clarity and structural consistency.
Using groups as basic motion units, a gravitational contraction strategy
ensures globally stable and directionally consistent convergence. Experiments
on multiple high-dimensional datasets demonstrate that GCAO outperforms 11
representative clustering methods, achieving average improvements of 37.13%,
52.08%, 44.98%, and 38.81% in NMI, ARI, Homogeneity, and ACC, respectively,
while maintaining competitive efficiency and scalability. These results
highlight GCAO’s superiority in preserving cluster integrity, enhancing
boundary separability, and ensuring robust performance on complex data
distributions.
[LINK]http://arxiv.org/abs/2510.23259v1
[DATE]2025-10-27 20:22:24+08:00
[CATEGORIES]cs.LG
Deep Active Inference with Diffusion Policy and Multiple Timescale World Model for Real-World Exploration and Navigation
[AUTHORS]Riko Yokozawa, Kentaro Fujii, Yuta Nomura, Shingo Murata
[ABSTRACT]Autonomous robotic navigation in real-world environments requires exploration
to acquire environmental information as well as goal-directed navigation in
order to reach specified targets. Active inference (AIF) based on the
free-energy principle provides a unified framework for these behaviors by
minimizing the expected free energy (EFE), thereby combining epistemic and
extrinsic values. To realize this practically, we propose a deep AIF framework
that integrates a diffusion policy as the policy model and a multiple timescale
recurrent state-space model (MTRSSM) as the world model. The diffusion policy
generates diverse candidate actions while the MTRSSM predicts their
long-horizon consequences through latent imagination, enabling action selection
that minimizes EFE. Real-world navigation experiments demonstrated that our
framework achieved higher success rates and fewer collisions compared with the
baselines, particularly in exploration-demanding scenarios. These results
highlight how AIF based on EFE minimization can unify exploration and
goal-directed navigation in real-world robotic settings.
[COMMENTS]Preprint version
[LINK]http://arxiv.org/abs/2510.23258v1
[DATE]2025-10-27 20:21:33+08:00
[CATEGORIES]cs.LG
Provable test-time adaptivity and distributional robustness of in-context learning
[AUTHORS]Tianyi Ma, Tengyao Wang, Richard J. Samworth
[ABSTRACT]We study in-context learning problems where a Transformer is pretrained on
tasks drawn from a mixture distribution $\pi=\sum_{\alpha\in\mathcal{A}}
\lambda_{\alpha} \pi_{\alpha}$, called the pretraining prior, in which each
mixture component $\pi_{\alpha}$ is a distribution on tasks of a specific
difficulty level indexed by $\alpha$. Our goal is to understand the performance
of the pretrained Transformer when evaluated on a different test distribution
$\mu$, consisting of tasks of fixed difficulty $\beta\in\mathcal{A}$, and with
potential distribution shift relative to $\pi_\beta$, subject to the
chi-squared divergence $\chi^2(\mu,\pi_{\beta})$ being at most $\kappa$. In
particular, we consider nonparametric regression problems with random
smoothness, and multi-index models with random smoothness as well as random
effective dimension. We prove that a large Transformer pretrained on sufficient
data achieves the optimal rate of convergence corresponding to the difficulty
level $\beta$, uniformly over test distributions $\mu$ in the chi-squared
divergence ball. Thus, the pretrained Transformer is able to achieve faster
rates of convergence on easier tasks and is robust to distribution shift at
test time. Finally, we prove that even if an estimator had access to the test
distribution $\mu$, the convergence rate of its expected risk over $\mu$ could
not be faster than that of our pretrained Transformers, thereby providing a
more appropriate optimality guarantee than minimax lower bounds.
[COMMENTS]44 pages
[LINK]http://arxiv.org/abs/2510.23254v1
[DATE]2025-10-27 20:16:49+08:00
[CATEGORIES]cs.LG
Local Learning for Covariate Selection in Nonparametric Causal Effect Estimation with Latent Variables
[AUTHORS]Zheng Li, Xichen Guo, Feng Xie, Yan Zeng, Hao Zhang, Zhi Geng
[ABSTRACT]Estimating causal effects from nonexperimental data is a fundamental problem
in many fields of science. A key component of this task is selecting an
appropriate set of covariates for confounding adjustment to avoid bias. Most
existing methods for covariate selection often assume the absence of latent
variables and rely on learning the global network structure among variables.
However, identifying the global structure can be unnecessary and inefficient,
especially when our primary interest lies in estimating the effect of a
treatment variable on an outcome variable. To address this limitation, we
propose a novel local learning approach for covariate selection in
nonparametric causal effect estimation, which accounts for the presence of
latent variables. Our approach leverages testable independence and dependence
relationships among observed variables to identify a valid adjustment set for a
target causal relationship, ensuring both soundness and completeness under
standard assumptions. We validate the effectiveness of our algorithm through
extensive experiments on both synthetic and real-world data.
[LINK]http://arxiv.org/abs/2411.16315v7
[DATE]2025-10-27 19:57:07+08:00
[CATEGORIES]cs.LG
Progressive Growing of Patch Size: Curriculum Learning for Accelerated and Improved Medical Image Segmentation
[AUTHORS]Stefan M. Fischer, Johannes Kiechle, Laura Daza, Lina Felsner, Richard Osuala, Daniel M. Lang, Karim Lekadir, Jan C. Peeken, Julia A. Schnabel
[ABSTRACT]In this work, we introduce Progressive Growing of Patch Size, an automatic
curriculum learning approach for 3D medical image segmentation. Our approach
progressively increases the patch size during model training, resulting in an
improved class balance for smaller patch sizes and accelerated convergence of
the training process. We evaluate our curriculum approach in two settings: a
resource-efficient mode and a performance mode, both regarding Dice score
performance and computational costs across 15 diverse and popular 3D medical
image segmentation tasks. The resource-efficient mode matches the Dice score
performance of the conventional constant patch size sampling baseline with a
notable reduction in training time to only 44%. The performance mode improves
upon constant patch size segmentation results, achieving a statistically
significant relative mean performance gain of 1.28% in Dice Score. Remarkably,
across all 15 tasks, our proposed performance mode manages to surpass the
constant patch size baseline in Dice Score performance, while simultaneously
reducing training time to only 89%. The benefits are particularly pronounced
for highly imbalanced tasks such as lesion segmentation tasks. Rigorous
experiments demonstrate that our performance mode not only improves mean
segmentation performance but also reduces performance variance, yielding more
trustworthy model comparison. Furthermore, our findings reveal that the
proposed curriculum sampling is not tied to a specific architecture but
represents a broadly applicable strategy that consistently boosts performance
across diverse segmentation models, including UNet, UNETR, and SwinUNETR. In
summary, we show that this simple yet elegant transformation on input data
substantially improves both Dice Score performance and training runtime, while
being compatible across diverse segmentation backbones.
[COMMENTS]Journal Extension of “Progressive Growing of Patch Size:
Resource-Efficient Curriculum Learning for Dense Prediction Tasks”
(MICCAI2024) submitted to MedIA
[LINK]http://arxiv.org/abs/2510.23241v1
[DATE]2025-10-27 19:55:12+08:00
[CATEGORIES]cs.LG
Robust Iterative Learning Hidden Quantum Markov Models
[AUTHORS]Ning Ning
[ABSTRACT]Hidden Quantum Markov Models (HQMMs) extend classical Hidden Markov Models to
the quantum domain, offering a powerful probabilistic framework for modeling
sequential data with quantum coherence. However, existing HQMM learning
algorithms are highly sensitive to data corruption and lack mechanisms to
ensure robustness under adversarial perturbations. In this work, we introduce
the Adversarially Corrupted HQMM (AC-HQMM), which formalizes robustness
analysis by allowing a controlled fraction of observation sequences to be
adversarially corrupted. To learn AC-HQMMs, we propose the Robust Iterative
Learning Algorithm (RILA), a derivative-free method that integrates a Remove
Corrupted Rows by Entropy Filtering (RCR-EF) module with an iterative
stochastic resampling procedure for physically valid Kraus operator updates.
RILA incorporates L1-penalized likelihood objectives to enhance stability,
resist overfitting, and remain effective under non-differentiable conditions.
Across multiple HQMM and HMM benchmarks, RILA demonstrates superior convergence
stability, corruption resilience, and preservation of physical validity
compared to existing algorithms, establishing a principled and efficient
approach for robust quantum sequential learning.
[COMMENTS]Quantum Computing, Bayesian Inference, Spatiotemporal Analysis,
Robust Learning
[LINK]http://arxiv.org/abs/2510.23237v1
[DATE]2025-10-27 19:48:44+08:00
[CATEGORIES]cs.LG
Grassmanian Interpolation of Low-Pass Graph Filters: Theory and Applications
[AUTHORS]Anton Savostianov, Michael T. Schaub, Benjamin Stamm
[ABSTRACT]Low-pass graph filters are fundamental for signal processing on graphs and
other non-Euclidean domains. However, the computation of such filters for
parametric graph families can be prohibitively expensive as computation of the
corresponding low-frequency subspaces, requires the repeated solution of an
eigenvalue problem. We suggest a novel algorithm of low-pass graph filter
interpolation based on Riemannian interpolation in normal coordinates on the
Grassmann manifold. We derive an error bound estimate for the subspace
interpolation and suggest two possible applications for induced parametric
graph families. First, we argue that the temporal evolution of the node
features may be translated to the evolving graph topology via a similarity
correction to adjust the homophily degree of the network. Second, we suggest a
dot product graph family induced by a given static graph which allows to infer
improved message passing scheme for node classification facilitated by the
filter interpolation.
[COMMENTS]13 pages
[LINK]http://arxiv.org/abs/2510.23235v1
[DATE]2025-10-27 19:40:14+08:00
[CATEGORIES]cs.LG
Secure and Confidential Certificates of Online Fairness
[AUTHORS]Olive Franzese, Ali Shahin Shamsabadi, Carter Luck, Hamed Haddadi
[ABSTRACT]The black-box service model enables ML service providers to serve clients
while keeping their intellectual property and client data confidential.
Confidentiality is critical for delivering ML services legally and responsibly,
but makes it difficult for outside parties to verify important model properties
such as fairness. Existing methods that assess model fairness confidentially
lack either (i) reliability because they certify fairness with respect to a
static set of data, and therefore fail to guarantee fairness in the presence of
distribution shift or service provider malfeasance; and/or (ii) scalability due
to the computational overhead of confidentiality-preserving cryptographic
primitives. We address these problems by introducing online fairness
certificates, which verify that a model is fair with respect to data received
by the service provider online during deployment. We then present OATH, a
deployably efficient and scalable zero-knowledge proof protocol for
confidential online group fairness certification. OATH exploits statistical
properties of group fairness via a cut-and-choose style protocol, enabling
scalability improvements over baselines.
[LINK]http://arxiv.org/abs/2410.02777v2
[DATE]2025-10-27 19:37:07+08:00
[CATEGORIES]cs.LG
Revisiting Agnostic Boosting
[AUTHORS]Arthur da Cunha, Mikael Møller Høgsgaard, Andrea Paudice, Yuxin Sun
[ABSTRACT]Boosting is a key method in statistical learning, allowing for converting
weak learners into strong ones. While well studied in the realizable case, the
statistical properties of weak-to-strong learning remain less understood in the
agnostic setting, where there are no assumptions on the distribution of the
labels. In this work, we propose a new agnostic boosting algorithm with
substantially improved sample complexity compared to prior works under very
general assumptions. Our approach is based on a reduction to the realizable
case, followed by a margin-based filtering of high-quality hypotheses.
Furthermore, we show a nearly-matching lower bound, settling the sample
complexity of agnostic boosting up to logarithmic factors.
[COMMENTS]Camera-ready version: NeurIPS 2025
[LINK]http://arxiv.org/abs/2503.09384v2
[DATE]2025-10-27 19:25:33+08:00
[CATEGORIES]cs.LG
Windsock is Dancing: Adaptive Multimodal Retrieval-Augmented Generation
[AUTHORS]Shu Zhao, Tianyi Shen, Nilesh Ahuja, Omesh Tickoo, Vijaykrishnan Narayanan
[ABSTRACT]Multimodal Retrieval-Augmented Generation (MRAG) has emerged as a promising
method to generate factual and up-to-date responses of Multimodal Large
Language Models (MLLMs) by incorporating non-parametric knowledge from external
knowledge bases. However, existing MRAG approaches suffer from static retrieval
strategies, inflexible modality selection, and suboptimal utilization of
retrieved information, leading to three critical challenges: determining when
to retrieve, what modality to incorporate, and how to utilize retrieved
information effectively. To address these challenges, we introduce Windsock, a
query-dependent module making decisions on retrieval necessity and modality
selection, effectively reducing computational overhead and improving response
quality. Additionally, we propose Dynamic Noise-Resistance (DANCE) Instruction
Tuning, an adaptive training strategy that enhances MLLMs’ ability to utilize
retrieved information while maintaining robustness against noise. Moreover, we
adopt a self-assessment approach leveraging knowledge within MLLMs to convert
question-answering datasets to MRAG training datasets. Extensive experiments
demonstrate that our proposed method significantly improves the generation
quality by 17.07% while reducing 8.95% retrieval times.
[COMMENTS]Accepted at NeurIPS 2025 UniReps Workshop
[LINK]http://arxiv.org/abs/2510.22694v1
[DATE]2025-10-26 22:36:16+08:00
[CATEGORIES]cs.CL
SALSA: Single-pass Autoregressive LLM Structured Classification
[AUTHORS]Ruslan Berdichevsky, Shai Nahum-Gefen, Elad Ben Zaken
[ABSTRACT]Despite their impressive generalization capabilities, instruction-tuned Large
Language Models often underperform on text classification benchmarks. We
introduce SALSA, a coherent pipeline that combines structured prompting,
class-to-token mapping, and parameter-efficient fine-tuning, thereby avoiding
cold-start training. Each class label is mapped to a distinct output token, and
prompts are constructed to elicit a single-token response. During inference,
the model’s output is projected only onto the logits of the relevant class
tokens, enabling efficient and accurate classification in a single forward
pass. SALSA achieves state-of-the-art results across diverse benchmarks,
demonstrating its robustness and scalability for LLM-based classification
applications.
[LINK]http://arxiv.org/abs/2510.22691v1
[DATE]2025-10-26 22:28:42+08:00
[CATEGORIES]cs.CL cs.LG
Antidistillation Sampling
[AUTHORS]Yash Savani, Asher Trockman, Zhili Feng, Yixuan Even Xu, Avi Schwarzschild, Alexander Robey, Marc Finzi, J. Zico Kolter
[ABSTRACT]Frontier models that generate extended reasoning traces inadvertently produce
rich token sequences that can facilitate model distillation. Recognizing this
vulnerability, model owners may seek sampling strategies that limit the
effectiveness of distillation without compromising model performance.
Antidistillation sampling provides exactly this capability. By strategically
modifying a model’s next-token probability distribution, antidistillation
sampling poisons reasoning traces, rendering them significantly less effective
for distillation while preserving the model’s practical utility. For further
details, see https://antidistillation.com.
[LINK]http://arxiv.org/abs/2504.13146v5
[DATE]2025-10-26 22:23:30+08:00
[CATEGORIES]cs.CL
Rule-Based Explanations for Retrieval-Augmented LLM Systems
[AUTHORS]Joel Rorseth, Parke Godfrey, Lukasz Golab, Divesh Srivastava, Jarek Szlichta
[ABSTRACT]If-then rules are widely used to explain machine learning models; e.g., “if
employed = no, then loan application = rejected.” We present the first proposal
to apply rules to explain the emerging class of large language models (LLMs)
with retrieval-augmented generation (RAG). Since RAG enables LLM systems to
incorporate retrieved information sources at inference time, rules linking the
presence or absence of sources can explain output provenance; e.g., “if a Times
Higher Education ranking article is retrieved, then the LLM ranks Oxford
first.” To generate such rules, a brute force approach would probe the LLM with
all source combinations and check if the presence or absence of any sources
leads to the same output. We propose optimizations to speed up rule generation,
inspired by Apriori-like pruning from frequent itemset mining but redefined
within the scope of our novel problem. We conclude with qualitative and
quantitative experiments demonstrating our solutions’ value and efficiency.
[LINK]http://arxiv.org/abs/2510.22689v1
[DATE]2025-10-26 22:22:07+08:00
[CATEGORIES]cs.CL
Deflanderization for Game Dialogue: Balancing Character Authenticity with Task Execution in LLM-based NPCs
[AUTHORS]Pasin Buakhaw, Kun Kerdthaisong, Phuree Phenhiran, Pitikorn Khlaisamniang, Supasate Vorathammathorn, Piyalitt Ittichaiwong, Nutchanon Yongsatianchot
[ABSTRACT]The emergence of large language models (LLMs) has opened new opportunities
for creating dynamic non-player characters (NPCs) in gaming environments,
enabling both functional task execution and persona-consistent dialogue
generation. In this paper, we (Tu_Character_lab) report our participation in
the Commonsense Persona-Grounded Dialogue Challenge (CPDC) 2025 Round 2, which
evaluates agents across three tracks: task-oriented dialogue, context-aware
dialogue, and their integration. Our approach combines two complementary
strategies: (i) lightweight prompting techniques in the API track, including a
Deflanderization prompting method to suppress excessive role-play and improve
task fidelity, and (ii) fine-tuned large models in the GPU track, leveraging
Qwen3-14B with supervisedfinetuning (SFT) and Low-Rank Adaptation(LoRA). Our
best submissions ranked 2nd on Task 1, 2nd on Task 3 (API track), and 4th on
Task 3 (GPU track).
[LINK]http://arxiv.org/abs/2510.13586v3
[DATE]2025-10-26 22:03:51+08:00
[CATEGORIES]cs.CL
RoboSVG: A Unified Framework for Interactive SVG Generation with Multi-modal Guidance
[AUTHORS]Jiuniu Wang, Gongjie Zhang, Quanhao Qian, Junlong Gao, Deli Zhao, Ran Xu
[ABSTRACT]Scalable Vector Graphics (SVGs) are fundamental to digital design and robot
control, encoding not only visual structure but also motion paths in
interactive drawings. In this work, we introduce RoboSVG, a unified multimodal
framework for generating interactive SVGs guided by textual, visual, and
numerical signals. Given an input query, the RoboSVG model first produces
multimodal guidance, then synthesizes candidate SVGs through dedicated
generation modules, and finally refines them under numerical guidance to yield
high-quality outputs. To support this framework, we construct RoboDraw, a
large-scale dataset of one million examples, each pairing an SVG generation
condition (e.g., text, image, and partial SVG) with its corresponding
ground-truth SVG code. RoboDraw dataset enables systematic study of four tasks,
including basic generation (Text-to-SVG, Image-to-SVG) and interactive
generation (PartialSVG-to-SVG, PartialImage-to-SVG). Extensive experiments
demonstrate that RoboSVG achieves superior query compliance and visual fidelity
across tasks, establishing a new state of the art in versatile SVG generation.
The dataset and source code of this project will be publicly available soon.
[COMMENTS]15 pages, 5 figures
[LINK]http://arxiv.org/abs/2510.22684v1
[DATE]2025-10-26 21:57:08+08:00
[CATEGORIES]cs.CL
Do Stop Me Now: Detecting Boilerplate Responses with a Single Iteration
[AUTHORS]Yuval Kainan, Shaked Zychlinski
[ABSTRACT]Large Language Models (LLMs) often expend significant computational resources
generating boilerplate responses, such as refusals, simple acknowledgements and
casual greetings, which adds unnecessary cost and latency. To address this
inefficiency, we propose a simple yet highly effective method for detecting
such responses after only a single generation step. We demonstrate that the
log-probability distribution of the first generated token serves as a powerful
signal for classifying the nature of the entire subsequent response. Our
experiments, conducted across a diverse range of small, large, and
reasoning-specialized models, show that the first-token log-probability vectors
form distinctly separable clusters for different response types. Using a
lightweight k-NN classifier, we achieve high accuracy in predicting whether a
response will be a substantive answer or a form of boilerplate response,
including user-specified refusals. The primary implication is a practical,
computationally trivial technique, optimizing LLM inference by enabling early
termination or redirection to a smaller model, thereby yielding significant
savings in computational cost. This work presents a direct path toward more
efficient and sustainable LLM deployment.
[COMMENTS]13 pages, 4 figures
[LINK]http://arxiv.org/abs/2510.22679v1
[DATE]2025-10-26 21:43:56+08:00
[CATEGORIES]cs.CL
Conjugate Relation Modeling for Few-Shot Knowledge Graph Completion
[AUTHORS]Zilong Wang, Qingtian Zeng, Hua Duan, Cheng Cheng, Minghao Zou, Ziyang Wang
[ABSTRACT]Few-shot Knowledge Graph Completion (FKGC) infers missing triples from
limited support samples, tackling long-tail distribution challenges. Existing
methods, however, struggle to capture complex relational patterns and mitigate
data sparsity. To address these challenges, we propose a novel FKGC framework
for conjugate relation modeling (CR-FKGC). Specifically, it employs a
neighborhood aggregation encoder to integrate higher-order neighbor
information, a conjugate relation learner combining an implicit conditional
diffusion relation module with a stable relation module to capture stable
semantics and uncertainty offsets, and a manifold conjugate decoder for
efficient evaluation and inference of missing triples in manifold space.
Experiments on three benchmarks demonstrate that our method achieves superior
performance over state-of-the-art methods.
[LINK]http://arxiv.org/abs/2510.22656v1
[DATE]2025-10-26 20:38:23+08:00
[CATEGORIES]cs.CL
Integrating Linguistics and AI: Morphological Analysis and Corpus development of Endangered Toto Language of West Bengal
[AUTHORS]Ambalika Guha, Sajal Saha, Debanjan Ballav, Soumi Mitra, Hritwick Chakraborty
[ABSTRACT]Preserving linguistic diversity is necessary as every language offers a
distinct perspective on the world. There have been numerous global initiatives
to preserve endangered languages through documentation. This paper is a part of
a project which aims to develop a trilingual (Toto-Bangla-English) language
learning application to digitally archive and promote the endangered Toto
language of West Bengal, India. This application, designed for both native Toto
speakers and non-native learners, aims to revitalize the language by ensuring
accessibility and usability through Unicode script integration and a structured
language corpus. The research includes detailed linguistic documentation
collected via fieldwork, followed by the creation of a morpheme-tagged,
trilingual corpus used to train a Small Language Model (SLM) and a
Transformer-based translation engine. The analysis covers inflectional
morphology such as person-number-gender agreement, tense-aspect-mood
distinctions, and case marking, alongside derivational strategies that reflect
word-class changes. Script standardization and digital literacy tools were also
developed to enhance script usage. The study offers a sustainable model for
preserving endangered languages by incorporating traditional linguistic
methodology with AI. This bridge between linguistic research with technological
innovation highlights the value of interdisciplinary collaboration for
community-based language revitalization.
[LINK]http://arxiv.org/abs/2510.22629v1
[DATE]2025-10-26 19:22:46+08:00
[CATEGORIES]cs.CL
PerCoR: Evaluating Commonsense Reasoning in Persian via Multiple-Choice Sentence Completion
[AUTHORS]Morteza Alikhani, Mohammadtaha Bagherifard, Erfan Zinvandi, Mehran Sarmadi
[ABSTRACT]We introduced PerCoR (Persian Commonsense Reasoning), the first large-scale
Persian benchmark for commonsense reasoning. PerCoR contains 106K
multiple-choice sentence-completion problems drawn from more than forty news,
cultural, and other web sources. We introduce a novel conjunction-based
segmentation strategy to generate coherent sentence-completion pairs, enabling
broad topical and structural diversity. To create challenging distractors, we
propose DRESS-AF (Distractor Ranking via Embedding Similarity Scoring and
Adversarial Filtering), a generation-free adversarial filtering method that
selects distractors from the pool of gold continuations while maximising model
confusion. Human annotators score 89% on PerCoR, while OpenAI-o3 achieves the
highest performance at 92.18%, followed closely by Claude-Sonnet-3.7 (91.17%).
The strongest open-source model, DeepSeek-R1, reaches 82.51%, underscoring both
the dataset’s difficulty and the remaining performance gap in Persian
commonsense reasoning. We further show that DRESS-AF transfers to the English
HellaSwag benchmark, increasing its difficulty without hurting human
solvability. The dataset is available at
https://huggingface.co/datasets/MCINext/PerCoR.
[COMMENTS]20 pages, 17 figures, Accepted to IJCNLP-AACL 2025 (Main Conference)
[LINK]http://arxiv.org/abs/2510.22616v1
[DATE]2025-10-26 18:25:02+08:00
[CATEGORIES]cs.CL
ATOM: AdapTive and OptiMized dynamic temporal knowledge graph construction using LLMs
[AUTHORS]Yassir Lairgi, Ludovic Moncla, Khalid Benabdeslem, Rémy Cazabet, Pierre Cléau
[ABSTRACT]In today’s rapidly expanding data landscape, knowledge extraction from
unstructured text is vital for real-time analytics, temporal inference, and
dynamic memory frameworks. However, traditional static knowledge graph (KG)
construction often overlooks the dynamic and time-sensitive nature of
real-world data, limiting adaptability to continuous changes. Moreover, recent
zero- or few-shot approaches that avoid domain-specific fine-tuning or reliance
on prebuilt ontologies often suffer from instability across multiple runs, as
well as incomplete coverage of key facts. To address these challenges, we
introduce ATOM (AdapTive and OptiMized), a few-shot and scalable approach that
builds and continuously updates Temporal Knowledge Graphs (TKGs) from
unstructured texts. ATOM splits input documents into minimal, self-contained
“atomic” facts, improving extraction exhaustivity and stability. Then, it
constructs atomic TKGs from these facts while employing a dual-time modeling
that distinguishes when information is observed from when it is valid. The
resulting atomic TKGs are subsequently merged in parallel. Empirical
evaluations demonstrate that ATOM achieves ~18% higher exhaustivity, ~17%
better stability, and over 90% latency reduction compared to baseline methods,
demonstrating a strong scalability potential for dynamic TKG construction.
[LINK]http://arxiv.org/abs/2510.22590v1
[DATE]2025-10-26 17:10:26+08:00
[CATEGORIES]cs.CL
UltraVoice: Scaling Fine-Grained Style-Controlled Speech Conversations for Spoken Dialogue Models
[AUTHORS]Wenming Tu, Guanrou Yang, Ruiqi Yan, Wenxi Chen, Ziyang Ma, Yipeng Kang, Kai Yu, Xie Chen, Zilong Zheng
[COMMENTS]23 pages, 4 figures
[LINK]http://arxiv.org/abs/2510.22588v1
[DATE]2025-10-26 17:06:55+08:00
[CATEGORIES]cs.CL
A Closed-Loop Personalized Learning Agent Integrating Neural Cognitive Diagnosis, Bounded-Ability Adaptive Testing, and LLM-Driven Feedback
[AUTHORS]Zhifeng Wang, Xinyue Zheng, Chunyan Zeng
[ABSTRACT]As information technology advances, education is moving from
one-size-fits-all instruction toward personalized learning. However, most
methods handle modeling, item selection, and feedback in isolation rather than
as a closed loop. This leads to coarse or opaque student models,
assumption-bound adaptivity that ignores diagnostic posteriors, and generic,
non-actionable feedback. To address these limitations, this paper presents an
end-to-end personalized learning agent, EduLoop-Agent, which integrates a
Neural Cognitive Diagnosis model (NCD), a Bounded-Ability Estimation
Computerized Adaptive Testing strategy (BECAT), and large language models
(LLMs). The NCD module provides fine-grained estimates of students’ mastery at
the knowledge-point level; BECAT dynamically selects subsequent items to
maximize relevance and learning efficiency; and LLMs convert diagnostic signals
into structured, actionable feedback. Together, these components form a
closed-loop framework of “Diagnosis–Recommendation–Feedback.” Experiments
on the ASSISTments dataset show that the NCD module achieves strong performance
on response prediction while yielding interpretable mastery assessments. The
adaptive recommendation strategy improves item relevance and personalization,
and the LLM-based feedback offers targeted study guidance aligned with
identified weaknesses. Overall, the results indicate that the proposed design
is effective and practically deployable, providing a feasible pathway to
generating individualized learning trajectories in intelligent education.
[COMMENTS]8 pages, 6 figures
[LINK]http://arxiv.org/abs/2510.22559v1
[DATE]2025-10-26 15:32:31+08:00
[CATEGORIES]cs.CL
SABlock: Semantic-Aware KV Cache Eviction with Adaptive Compression Block Size
[AUTHORS]Jinhan Chen, Jianchun Liu, Hongli Xu, Xianjun Gao, Shilong Wang
[ABSTRACT]The growing memory footprint of the Key-Value (KV) cache poses a severe
scalability bottleneck for long-context Large Language Model (LLM) inference.
While KV cache eviction has emerged as an effective solution by discarding less
critical tokens, existing token-, block-, and sentence-level compression
methods struggle to balance semantic coherence and memory efficiency. To this
end, we introduce SABlock, a \underline{s}emantic-aware KV cache eviction
framework with \underline{a}daptive \underline{block} sizes. Specifically,
SABlock first performs semantic segmentation to align compression boundaries
with linguistic structures, then applies segment-guided token scoring to refine
token importance estimation. Finally, for each segment, a budget-driven search
strategy adaptively determines the optimal block size that preserves semantic
integrity while improving compression efficiency under a given cache budget.
Extensive experiments on long-context benchmarks demonstrate that SABlock
consistently outperforms state-of-the-art baselines under the same memory
budgets. For instance, on Needle-in-a-Haystack (NIAH), SABlock achieves 99.9%
retrieval accuracy with only 96 KV entries, nearly matching the performance of
the full-cache baseline that retains up to 8K entries. Under a fixed cache
budget of 1,024, SABlock further reduces peak memory usage by 46.28% and
achieves up to 9.5x faster decoding on a 128K context length.
[LINK]http://arxiv.org/abs/2510.22556v1
[DATE]2025-10-26 15:17:10+08:00
[CATEGORIES]cs.CL
Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models
[AUTHORS]Teng Wang, Zhangyi Jiang, Zhenqi He, Shenyang Tong, Wenhan Yang, Yanan Zheng, Zeyu Li, Zifan He, Hailei Gong, Zewen Ye, Shengjie Ma, Jianping Zhang
[ABSTRACT]Recent studies show that Large Language Models (LLMs) achieve strong
reasoning capabilities through supervised fine-tuning or reinforcement
learning. However, a key approach, the Process Reward Model (PRM), suffers from
reward hacking, making it unreliable in identifying the best intermediate step.
In addition, the cost of annotating reasoning processes for reward modeling is
high, making large-scale collection of high-quality data challenging. To
address this, we propose a novel reward model approach called the Hierarchical
Reward Model (HRM), which evaluates both individual and consecutive reasoning
steps at both fine-grained and coarse-grained levels. HRM excels at assessing
multi-step reasoning coherence, especially when flawed steps are later
corrected through self-reflection. To further reduce the cost of generating
training data, we introduce a lightweight and effective data augmentation
strategy called Hierarchical Node Compression (HNC), which merges two
consecutive reasoning steps into one within the tree structure. By applying HNC
to MCTS-generated reasoning trajectories, we enhance the diversity and
robustness of HRM training data while introducing controlled noise with minimal
computational overhead. Empirical results on the PRM800K dataset show that HRM,
together with HNC, provides more stable and reliable evaluations than PRM.
Furthermore, cross-domain evaluations on the MATH500 and GSM8K datasets
demonstrate HRM’s strong generalization and robustness across a variety of
reasoning tasks.
[LINK]http://arxiv.org/abs/2503.13551v4
[DATE]2025-10-26 14:47:24+08:00
[CATEGORIES]cs.CL
MOSAIC: Modeling Social AI for Content Dissemination and Regulation in Multi-Agent Simulations
[AUTHORS]Genglin Liu, Vivian Le, Salman Rahman, Elisa Kreiss, Marzyeh Ghassemi, Saadia Gabriel
[COMMENTS]Accepted into EMNLP 2025 Main Conference, Oral Presentation
[LINK]http://arxiv.org/abs/2504.07830v3
[DATE]2025-10-26 14:27:44+08:00
[CATEGORIES]cs.CL
Visual Thoughts: A Unified Perspective of Understanding Multimodal Chain-of-Thought
[AUTHORS]Zihui Cheng, Qiguang Chen, Xiao Xu, Jiaqi Wang, Weiyun Wang, Hao Fei, Yidong Wang, Alex Jinpeng Wang, Zhi Chen, Wanxiang Che, Libo Qin
[ABSTRACT]Large Vision-Language Models (LVLMs) have achieved significant success in
multimodal tasks, with multimodal chain-of-thought (MCoT) further enhancing
performance and interpretability. Recent MCoT methods fall into two categories:
(i) Textual-MCoT (T-MCoT), which takes multimodal input and produces textual
output; and (ii) Interleaved-MCoT (I-MCoT), which generates interleaved
image-text outputs. Despite advances in both approaches, the mechanisms driving
these improvements are not fully understood. To fill this gap, we first reveal
that MCoT boosts LVLMs by incorporating visual thoughts, which convey image
information to the reasoning process regardless of the MCoT format, depending
only on clarity and conciseness of expression. Furthermore, to explore visual
thoughts systematically, we define four distinct forms of visual thought
expressions and analyze them comprehensively. Our findings demonstrate that
these forms differ in clarity and conciseness, yielding varying levels of MCoT
improvement. Additionally, we explore the internal nature of visual thoughts,
finding that visual thoughts serve as intermediaries between the input image
and reasoning to deeper transformer layers, enabling more advanced visual
information transmission. We hope that the visual thoughts can inspire further
breakthroughs for future MCoT research.
[COMMENTS]Accepted at NeurIPS 2025;
[LINK]http://arxiv.org/abs/2505.15510v2
[DATE]2025-10-26 14:24:15+08:00
[CATEGORIES]cs.CL
LooGLE v2: Are LLMs Ready for Real World Long Dependency Challenges?
[AUTHORS]Ziyuan He, Yuxuan Wang, Jiaqi Li, Kexin Liang, Muhan Zhang
[ABSTRACT]Large language models (LLMs) are equipped with increasingly extended context
windows recently, yet their long context understanding capabilities over long
dependency tasks remain fundamentally limited and underexplored. This gap is
especially significant in many real-world long-context applications that were
rarely benchmarked. In this paper, we introduce LooGLE v2, a novel benchmark
designed to evaluate LLMs’ long context ability in real-world applications and
scenarios. Our benchmark consists of automatically collected real-world long
texts, ranging from 16k to 2M tokens, encompassing domains in law, finance,
game and code. Accordingly, we delicately design 10 types of domain-specific
long-dependency tasks and generate 1,934 QA instances with various diversity
and complexity in a scalable data curation pipeline for further practical
needs. We conduct a comprehensive assessment of 6 locally deployed and 4
API-based LLMs. The evaluation results show that even the best-performing model
achieves only a 59.2% overall score on our benchmark. Despite the extensive
context windows, popular LLMs are only capable of understanding a much shorter
length of context than they claim to be, revealing significant limitations in
their ability to handle real-world tasks with long dependencies and
highlighting substantial room for model improvement in practical long-context
understanding.
[COMMENTS]NeurIPS 2025 Datasets and Benchmarks Track
[LINK]http://arxiv.org/abs/2510.22548v1
[DATE]2025-10-26 14:14:19+08:00
[CATEGORIES]cs.CL
LyapLock: Bounded Knowledge Preservation in Sequential Large Language Model Editing
[AUTHORS]Peng Wang, Biyu Zhou, Xuehai Tang, Jizhong Han, Songlin Hu
[ABSTRACT]Large Language Models often contain factually incorrect or outdated
knowledge, giving rise to model editing methods for precise knowledge updates.
However, current mainstream locate-then-edit approaches exhibit a progressive
performance decline during sequential editing, due to inadequate mechanisms for
long-term knowledge preservation. To tackle this, we model the sequential
editing as a constrained stochastic programming. Given the challenges posed by
the cumulative preservation error constraint and the gradually revealed editing
tasks, \textbf{LyapLock} is proposed. It integrates queuing theory and Lyapunov
optimization to decompose the long-term constrained programming into tractable
stepwise subproblems for efficient solving. This is the first model editing
framework with rigorous theoretical guarantees, achieving asymptotic optimal
editing performance while meeting the constraints of long-term knowledge
preservation. Experimental results show that our framework scales sequential
editing capacity to over 10,000 edits while stabilizing general capabilities
and boosting average editing efficacy by 11.89\% over SOTA baselines.
Furthermore, it can be leveraged to enhance the performance of baseline
methods. Our code is released on https://github.com/caskcsg/LyapLock.
[COMMENTS]EMNLP 2025 main
[LINK]http://arxiv.org/abs/2505.15702v2
[DATE]2025-10-26 13:46:25+08:00
[CATEGORIES]cs.CL
Text to Trust: Evaluating Fine-Tuning and LoRA Trade-offs in Language Models for Unfair Terms of Service Detection
[AUTHORS]Noshitha Padma Pratyusha Juttu, Sahithi Singireddy, Sravani Gona, Sujal Timilsina
[ABSTRACT]Large Language Models (LLMs) have transformed text understanding, yet their
adaptation to specialized legal domains remains constrained by the cost of full
fine-tuning. This study provides a systematic evaluation of fine tuning,
parameter efficient adaptation (LoRA, QLoRA), and zero-shot prompting
strategies for unfair clause detection in Terms of Service (ToS) documents, a
key application in legal NLP. We finetune BERT and DistilBERT, apply 4-bit
Low-Rank Adaptation (LoRA) to models such as TinyLlama, LLaMA 3B/7B, and
SaulLM, and evaluate GPT-4o and O-versions in zero-shot settings. Experiments
on the CLAUDETTE-ToS benchmark and the Multilingual Scraper Corpus show that
full fine-tuning achieves the strongest precision recall balance, while
LoRA-based models provide competitive recall with up to 3x lower memory cost.
These findings highlight practical design trade-offs for efficient and
domain-adapted LLMs, contributing open baselines for fine-tuning research in
legal text processing.
[COMMENTS]6 pages, including figures and tables. All experiments are
reproducible. Code and fine-tuned models are publicly available on: GitHub:
(https://github.com/Stimils02/UnfairTOSAgreementsDetection) and Hugging Face:
(https://huggingface.co/Noshitha98)
[LINK]http://arxiv.org/abs/2510.22531v1
[DATE]2025-10-26 12:46:06+08:00
[CATEGORIES]cs.CL cs.LG
AttentionPredictor: Temporal Patterns Matter for KV Cache Compression
[AUTHORS]Qingyue Yang, Jie Wang, Xing Li, Zhihai Wang, Chen Chen, Lei Chen, Xianzhi Yu, Wulong Liu, Jianye Hao, Mingxuan Yuan, Bin Li
[ABSTRACT]With the development of large language models (LLMs), efficient inference
through Key-Value (KV) cache compression has attracted considerable attention,
especially for long-context generation. To compress the KV cache, recent
methods identify critical KV tokens through static modeling of attention
scores. However, these methods often struggle to accurately determine critical
tokens as they neglect the temporal patterns in attention scores, resulting in
a noticeable degradation in LLM performance. To address this challenge, we
propose AttentionPredictor, which is the first learning-based method to
directly predict attention patterns for KV cache compression and critical token
identification. Specifically, AttentionPredictor learns a lightweight, unified
convolution model to dynamically capture spatiotemporal patterns and predict
the next-token attention scores. An appealing feature of AttentionPredictor is
that it accurately predicts the attention score and shares the unified
prediction model, which consumes negligible memory, among all transformer
layers. Moreover, we propose a cross-token critical cache prefetching framework
that hides the token estimation time overhead to accelerate the decoding stage.
By retaining most of the attention information, AttentionPredictor achieves
13$\times$ KV cache compression and 5.6$\times$ speedup in a cache offloading
scenario with comparable LLM performance, significantly outperforming the
state-of-the-arts. The code is available at
https://github.com/MIRALab-USTC/LLM-AttentionPredictor.
[COMMENTS]NeurIPS 2025
[LINK]http://arxiv.org/abs/2502.04077v3
[DATE]2025-10-26 12:25:10+08:00
[CATEGORIES]cs.CL cs.LG
Scalable Oversight via Partitioned Human Supervision
[AUTHORS]Ren Yin, Takashi Ishida, Masashi Sugiyama
[ABSTRACT]As artificial intelligence (AI) systems approach and surpass expert human
performance across a broad range of tasks, obtaining high-quality human
supervision for evaluation and training becomes increasingly challenging. Our
focus is on tasks that require deep knowledge and skills of multiple domains.
Unfortunately, even the best human experts are knowledgeable only in a single
narrow area, and will not be able to evaluate the correctness of advanced AI
systems on such superhuman tasks. However, based on their narrow expertise,
humans may provide a weak signal, i.e., a complementary label indicating an
option that is incorrect. For example, a cardiologist could state that “this is
not related to cardiology,’’ even if they cannot identify the true disease.
Based on this weak signal, we propose a scalable oversight framework that
enables us to evaluate frontier AI systems without the need to prepare the
ground truth. We derive an unbiased estimator of top-1 accuracy from
complementary labels and quantify how many complementary labels are needed to
match the variance of ordinary labels. We further introduce two estimators to
combine scarce ordinary labels with abundant complementary labels. We provide
finite-sample deviation guarantees for both complementary-only and the mixed
estimators. Empirically, we show that we can evaluate the output of large
language models without the ground truth, if we have complementary labels. We
further show that we can train an AI system with such weak signals: we show how
we can design an agentic AI system automatically that can perform better with
this partitioned human supervision. Our code is available at
https://github.com/R-Yin-217/Scalable-Oversight-via-Human-Partitioned-Supervision.
[LINK]http://arxiv.org/abs/2510.22500v1
[DATE]2025-10-26 10:42:03+08:00
[CATEGORIES]cs.LG cs.CL
A Sociophonetic Analysis of Racial Bias in Commercial ASR Systems Using the Pacific Northwest English Corpus
[AUTHORS]Michael Scott, Siyu Liang, Alicia Wassink, Gina-Anne Levow
[ABSTRACT]This paper presents a systematic evaluation of racial bias in four major
commercial automatic speech recognition (ASR) systems using the Pacific
Northwest English (PNWE) corpus. We analyze transcription accuracy across
speakers from four ethnic backgrounds (African American, Caucasian American,
ChicanX, and Yakama) and examine how sociophonetic variation contributes to
differential system performance. We introduce a heuristically-determined
Phonetic Error Rate (PER) metric that links recognition errors to specific
linguistically motivated variables derived from sociophonetic annotation. Our
analysis of eleven sociophonetic features reveals that vowel quality variation,
particularly resistance to the low-back merger and pre-nasal merger patterns,
is systematically associated with differential error rates across ethnic
groups, with the most pronounced effects for African American speakers across
all evaluated systems. These findings demonstrate that acoustic modeling of
dialectal phonetic variation, rather than lexical or syntactic factors, remains
a primary source of bias in commercial ASR systems. The study establishes the
PNWE corpus as a valuable resource for bias evaluation in speech technologies
and provides actionable guidance for improving ASR performance through targeted
representation of sociophonetic diversity in training data.
[LINK]http://arxiv.org/abs/2510.22495v1
[DATE]2025-10-26 10:19:40+08:00
[CATEGORIES]cs.CL
The Limits of Data Scaling: Sub-token Utilization and Acoustic Saturation in Multilingual ASR
[AUTHORS]Siyu Liang, Nicolas Ballier, Gina-Anne Levow, Richard Wright
[ABSTRACT]How much audio is needed to fully observe a multilingual ASR model’s learned
sub-token inventory across languages, and does data disparity in multilingual
pre-training affect how these tokens are utilized during inference? We address
this question by analyzing Whisper’s decoding behavior during inference across
49 languages. By logging decoding candidate sub-tokens and tracking their
cumulative discovery over time, we study the utilization pattern of the model’s
sub-token space. Results show that the total number of discovered tokens
remains largely independent of a language’s pre-training hours, indicating that
data disparity does not strongly influence lexical diversity in the model’s
hypothesis space. Sub-token discovery rates follow a consistent exponential
saturation pattern across languages, suggesting a stable time window after
which additional audio yields minimal new sub-token activation. We refer to
this convergence threshold as acoustic saturation time (AST). Further analyses
of rank-frequency distributions reveal Zipf-like patterns better modeled by a
Zipf-Mandelbrot law, and mean sub-token length shows a positive correlation
with resource level. Additionally, those metrics show more favorable patterns
for languages in the Latin script than those in scripts such as Cyrillic, CJK,
and Semitic. Together, our study suggests that sub-token utilization during
multilingual ASR inference is constrained more by the statistical, typological,
and orthographic structure of the speech than by training data scale, providing
an empirical basis for more equitable corpus construction and cross-lingual
evaluation.
[LINK]http://arxiv.org/abs/2510.22492v1
[DATE]2025-10-26 10:13:26+08:00
[CATEGORIES]cs.CL
Frustratingly Easy Task-aware Pruning for Large Language Models
[AUTHORS]Yuanhe Tian, Junjie Liu, Xican Yang, Haishan Ye, Yan Song
[ABSTRACT]Pruning provides a practical solution to reduce the resources required to run
large language models (LLMs) to benefit from their effective capabilities as
well as control their cost for training and inference. Research on LLM pruning
often ranks the importance of LLM parameters using their magnitudes and
calibration-data activations and removes (or masks) the less important ones,
accordingly reducing LLMs’ size. However, these approaches primarily focus on
preserving the LLM’s ability to generate fluent sentences, while neglecting
performance on specific domains and tasks. In this paper, we propose a simple
yet effective pruning approach for LLMs that preserves task-specific
capabilities while shrinking their parameter space. We first analyze how
conventional pruning minimizes loss perturbation under general-domain
calibration and extend this formulation by incorporating task-specific feature
distributions into the importance computation of existing pruning algorithms.
Thus, our framework computes separate importance scores using both general and
task-specific calibration data, partitions parameters into shared and exclusive
groups based on activation-norm differences, and then fuses their scores to
guide the pruning process. This design enables our method to integrate
seamlessly with various foundation pruning techniques and preserve the LLM’s
specialized abilities under compression. Experiments on widely used benchmarks
demonstrate that our approach is effective and consistently outperforms the
baselines with identical pruning ratios and different settings.
[COMMENTS]8 pages, 3 figures
[LINK]http://arxiv.org/abs/2510.22489v1
[DATE]2025-10-26 10:09:22+08:00
[CATEGORIES]cs.CL cs.LG
Gated Integration of Low-Rank Adaptation for Continual Learning of Large Language Models
[AUTHORS]Yan-Shuo Liang, Jia-Rui Chen, Wu-Jun Li
[ABSTRACT]Continual learning (CL), which requires the model to learn multiple tasks
sequentially, is crucial for large language models (LLMs). Recently, low-rank
adaptation~(LoRA), one of the most representative parameter-efficient
fine-tuning (PEFT) methods, has gained increasing attention in CL of LLMs.
However, most existing CL methods based on LoRA typically expand a new LoRA
branch to learn each new task and force the new and old LoRA branches to
influence old tasks equally, potentially leading to forgetting. In this work,
we propose a new method, called gated integration of low-rank adaptation
(GainLoRA), for CL of LLMs. GainLoRA expands a new LoRA branch for each new
task and introduces gating modules to integrate the new and old LoRA branches.
Furthermore, GainLoRA leverages the new gating module to minimize the influence
from the new LoRA branch to old tasks, effectively mitigating forgetting and
improving the model’s overall performance. Experimental results on CL
benchmarks demonstrate that GainLoRA outperforms existing state-of-the-art
methods.
[COMMENTS]Accepted to NeurIPS 2025
[LINK]http://arxiv.org/abs/2505.15424v2
[DATE]2025-10-26 10:08:11+08:00
[CATEGORIES]cs.CL
Beyond Fertility: Analyzing STRR as a Metric for Multilingual Tokenization Evaluation
[AUTHORS]Mir Tafseer Nayeem, Sawsan Alqahtani, Md Tahmid Rahman Laskar, Tasnim Mohiuddin, M Saiful Bari
[COMMENTS]NeurIPS 2025 Workshop
[LINK]http://arxiv.org/abs/2510.09947v2
[DATE]2025-10-26 09:32:06+08:00
[CATEGORIES]cs.CL cs.LG
Language Model Guided Reinforcement Learning in Quantitative Trading
[AUTHORS]Adam Darmanin, Vince Vella
[ABSTRACT]Algorithmic trading requires short-term tactical decisions consistent with
long-term financial objectives. Reinforcement Learning (RL) has been applied to
such problems, but adoption is limited by myopic behaviour and opaque policies.
Large Language Models (LLMs) offer complementary strategic reasoning and
multi-modal signal interpretation when guided by well-structured prompts. This
paper proposes a hybrid framework in which LLMs generate high-level trading
strategies to guide RL agents. We evaluate (i) the economic rationale of
LLM-generated strategies through expert review, and (ii) the performance of
LLM-guided agents against unguided RL baselines using Sharpe Ratio (SR) and
Maximum Drawdown (MDD). Empirical results indicate that LLM guidance improves
both return and risk metrics relative to standard RL.
[COMMENTS]12 pages (4 pages appendix and references) and 6 figures. Accepted
for presentation at FLLM 2025, Vienna
[LINK]http://arxiv.org/abs/2508.02366v3
[DATE]2025-10-26 06:25:16+08:00
[CATEGORIES]cs.LG cs.CL
Modeling Hierarchical Thinking in Large Reasoning Models
[AUTHORS]G M Shahariar, Ali Nazari, Erfan Shayegani, Nael Abu-Ghazaleh
[ABSTRACT]Large Language Models (LLMs) have demonstrated remarkable reasoning abilities
when they generate step-by-step solutions, known as chain-of-thought (CoT)
reasoning. When trained to using chain-of-thought reasoning examples, the
resulting models (called Large Reasoning Models, or LRMs) appear to learn
hierarchical thinking strategies similar to those used by humans. However,
understanding LRMs emerging reasoning capabilities remains a difficult open
problem, with many potential important applications including improving
training and understanding robustness. In this paper, we adopt a memoryless
Finite State Machine formulation to approximate LRM’s emerging hierarchical
reasoning dynamics as a structured, interpretable abstraction. We identify a
small set of discrete reasoning states including - initialization, deduction,
augmentation-strategy, uncertainty-estimation, backtracking, and
final-conclusion that capture the high-level states present in the model’s
reasoning process. By annotating each step of a model’s CoT with these states,
we can represent the reasoning trajectory as a transition sequence through the
state graph. This FSM formulation provides a systematic way to analyze,
interpret and visualize how different models approach problems. We describe the
FSM model, provide examples of CoT annotations under this scheme, and discuss
how it can shed light on differences between available models in their approach
to reasoning. Our results demonstrate that this FSM-based analysis reveals
distinct reasoning patterns and potential shortcomings, offering a new lens to
evaluate and improve LLM reasoning.
[LINK]http://arxiv.org/abs/2510.22437v1
[DATE]2025-10-26 05:25:30+08:00
[CATEGORIES]cs.CL
ComPO: Preference Alignment via Comparison Oracles
[AUTHORS]Peter Chen, Xi Chen, Wotao Yin, Tianyi Lin
[COMMENTS]Accepted to NeurIPS 2025
[LINK]http://arxiv.org/abs/2505.05465v2
[DATE]2025-10-26 04:23:09+08:00
[CATEGORIES]cs.CL cs.LG
DiffHeads: Differential Analysis and Inference-Time Masking of Bias Heads in Large Language Models
[AUTHORS]Tingxu Han, Wei Song, Ziqi Ding, Ziming Li, Chunrong Fang, Yuekang Li, Dongfang Liu, Zhenyu Chen, Zhenting Wang
[ABSTRACT]Large language models (LLMs) increasingly mediate decisions in domains where
unfair treatment of demographic groups is unacceptable. Existing work probes
when biased outputs appear, but gives little insight into the mechanisms that
generate them, leaving existing mitigations largely fragile. In this paper, we
conduct a systematic investigation LLM unfairness and propose DiffHeads, a
lightweight debiasing framework for LLMs. We first compare Direct-Answer (DA)
prompting to Chain-of-Thought (CoT) prompting across eight representative open-
and closed-source LLMs. DA will trigger the nature bias part of LLM and improve
measured unfairness by 534.5%-391.9% in both one-turn and two-turn dialogues.
Next, we define a token-to-head contribution score that traces each token’s
influence back to individual attention heads. This reveals a small cluster of
bias heads that activate under DA but stay largely dormant with CoT, providing
the first causal link between prompting strategy and bias emergence. Finally,
building on this insight, we propose DiffHeads that identifies bias heads
through differential activation analysis between DA and CoT, and selectively
masks only those heads. DiffHeads reduces unfairness by 49.4%, and 40.3% under
DA and CoT, respectively, without harming model utility.
[LINK]http://arxiv.org/abs/2510.10142v2
[DATE]2025-10-26 04:03:37+08:00
[CATEGORIES]cs.CL
Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers
[AUTHORS]Yixiao Huang, Hanlin Zhu, Tianyu Guo, Jiantao Jiao, Somayeh Sojoudi, Michael I. Jordan, Stuart Russell, Song Mei
[ABSTRACT]Large language models (LLMs) can acquire new knowledge through fine-tuning,
but this process exhibits a puzzling duality: models can generalize remarkably
from new facts, yet are also prone to hallucinating incorrect information.
However, the reasons for this phenomenon remain poorly understood. In this
work, we argue that both behaviors stem from a single mechanism known as
out-of-context reasoning (OCR): the ability to deduce implications by
associating concepts, even those without a causal link. Our experiments across
five prominent LLMs confirm that OCR indeed drives both generalization and
hallucination, depending on whether the associated concepts are causally
related. To build a rigorous theoretical understanding of this phenomenon, we
then formalize OCR as a synthetic factual recall task. We empirically show that
a one-layer single-head attention-only transformer with factorized output and
value matrices can learn to solve this task, while a model with combined
weights cannot, highlighting the crucial role of matrix factorization. Our
theoretical analysis shows that the OCR capability can be attributed to the
implicit bias of gradient descent, which favors solutions that minimize the
nuclear norm of the combined output-value matrix. This mathematical structure
explains why the model learns to associate facts and implications with high
sample efficiency, regardless of whether the correlation is causal or merely
spurious. Ultimately, our work provides a theoretical foundation for
understanding the OCR phenomenon, offering a new lens for analyzing and
mitigating undesirable behaviors from knowledge injection.
[COMMENTS]NeurIPS 2025, first three authors contributed equally
[LINK]http://arxiv.org/abs/2506.10887v3
[DATE]2025-10-26 03:53:24+08:00
[CATEGORIES]cs.CL cs.LG
Aligning LLMs for Multilingual Consistency in Enterprise Applications
[AUTHORS]Amit Agarwal, Hansa Meghwani, Hitesh Laxmichand Patel, Tao Sheng, Sujith Ravi, Dan Roth
[ABSTRACT]Large language models (LLMs) remain unreliable for global enterprise
applications due to substantial performance gaps between high-resource and
mid/low-resource languages, driven by English-centric pretraining and internal
reasoning biases. This inconsistency undermines customer experience and
operational reliability in multilingual settings such as customer support,
content moderation, and information retrieval. Even with advanced
Retrieval-Augmented Generation (RAG) systems, we observe up to an 29% accuracy
drop in non-English languages compared to English. We propose a practical,
batch-wise alignment strategy for fine-tuning LLMs, leveraging semantically
equivalent multilingual data in each training batch to directly align model
outputs across languages. This approach improves non-English accuracy by up to
23.9% without compromising English performance, model reasoning, or retrieval
quality. Our method is simple to implement, scalable, and integrates seamlessly
with existing LLM training & deployment pipelines, enabling more robust and
equitable multilingual AI solutions in industry.
[COMMENTS]Accepted at EMNLP 2025
[LINK]http://arxiv.org/abs/2509.23659v2
[DATE]2025-10-26 02:56:44+08:00
[CATEGORIES]cs.CL
Confabulations from ACL Publications (CAP): A Dataset for Scientific Hallucination Detection
[AUTHORS]Federica Gamba, Aman Sinha, Timothee Mickus, Raul Vazquez, Patanjali Bhamidipati, Claudio Savelli, Ahana Chattopadhyay, Laura A. Zanella, Yash Kankanampati, Binesh Arakkal Remesh, Aryan Ashok Chandramania, Rohit Agarwal, Chuyuan Li, Ioana Buhnila, Radhika Mamidi
[ABSTRACT]We introduce the CAP (Confabulations from ACL Publications) dataset, a
multilingual resource for studying hallucinations in large language models
(LLMs) within scientific text generation. CAP focuses on the scientific domain,
where hallucinations can distort factual knowledge, as they frequently do. In
this domain, however, the presence of specialized terminology, statistical
reasoning, and context-dependent interpretations further exacerbates these
distortions, particularly given LLMs’ lack of true comprehension, limited
contextual understanding, and bias toward surface-level generalization. CAP
operates in a cross-lingual setting covering five high-resource languages
(English, French, Hindi, Italian, and Spanish) and four low-resource languages
(Bengali, Gujarati, Malayalam, and Telugu). The dataset comprises 900 curated
scientific questions and over 7000 LLM-generated answers from 16 publicly
available models, provided as question-answer pairs along with token sequences
and corresponding logits. Each instance is annotated with a binary label
indicating the presence of a scientific hallucination, denoted as a factuality
error, and a fluency label, capturing issues in the linguistic quality or
naturalness of the text. CAP is publicly released to facilitate advanced
research on hallucination detection, multilingual evaluation of LLMs, and the
development of more reliable scientific NLP systems.
[LINK]http://arxiv.org/abs/2510.22395v1
[DATE]2025-10-26 02:42:22+08:00
[CATEGORIES]cs.CL
SUMO: Subspace-Aware Moment-Orthogonalization for Accelerating Memory-Efficient LLM Training
[AUTHORS]Yehonathan Refael, Guy Smorodinsky, Tom Tirer, Ofir Lindenbaum
[ABSTRACT]Low-rank gradient-based optimization methods have significantly improved
memory efficiency during the training of large language models (LLMs), enabling
operations within constrained hardware without sacrificing performance.
However, these methods primarily emphasize memory savings, often overlooking
potential acceleration in convergence due to their reliance on standard
isotropic steepest descent techniques, which can perform suboptimally in the
highly anisotropic landscapes typical of deep networks, particularly LLMs. In
this paper, we propose SUMO (Subspace-Aware Moment-Orthogonalization), an
optimizer that employs exact singular value decomposition (SVD) for moment
orthogonalization within a dynamically adapted low-dimensional subspace,
enabling norm-inducing steepest descent optimization steps. By explicitly
aligning optimization steps with the spectral characteristics of the loss
landscape, SUMO effectively mitigates approximation errors associated with
commonly used methods like Newton-Schulz orthogonalization approximation. We
theoretically establish an upper bound on these approximation errors, proving
their dependence on the condition numbers of moments, conditions we
analytically demonstrate are encountered during LLM training. Furthermore, we
both theoretically and empirically illustrate that exact orthogonalization via
SVD substantially improves convergence rates while reducing overall complexity.
Empirical evaluations confirm that SUMO accelerates convergence, enhances
stability, improves performance, and reduces memory requirements by up to 20%
compared to state-of-the-art methods.
[LINK]http://arxiv.org/abs/2505.24749v2
[DATE]2025-10-26 02:18:42+08:00
[CATEGORIES]cs.LG cs.CL
Solving the Unsolvable: Translating Case Law in Hong Kong
[AUTHORS]King-kui Sin, Xi Xuan, Chunyu Kit, Clara Ho-yan Chan, Honic Ho-kin Ip
[ABSTRACT]This paper addresses the challenges translating case law under Hong Kong’s
bilingual legal system. It highlights the initial success of translating all
written statutes into Chinese before the 1997 handover, a task mandated by the
Basic Law. The effort involved significant collaboration among legal,
linguistic, and translation experts, resulting in a comprehensive and
culturally appropriate bilingual legal system. However, translating case law
remains a significant challenge due to the sheer volume and continuous growth
of judicial decisions. The paper critiques the governments and judiciarys
sporadic and uncoordinated efforts to translate case law, contrasting it with
the thorough approach previously taken for statute translation. Although the
government acknowledges the importance of legal bilingualism, it lacks a
sustainable strategy for translating case law. The Judiciarys position that
translating all judgments is unnecessary, unrealistic, and not cost-effectiveis
analyzed and critiqued for its impact on legal transparency and public trust. A
proposed solution involves leveraging machine translation technology through a
human-machine interactive translation platform, which undergoes two major
transitions. Initially based on a neural model, the platform transitions to
using a large language model for improved translation accuracy. Furthermore, it
evolves from a single-agent system to a multi-agent system, incorporating
Translator, Annotator, and Proofreader agents. This multi-agent approach,
supported by a grant, aims to facilitate efficient, high-quality translation of
judicial judgments by integrating advanced artificial intelligence and
continuous feedback mechanisms, thus better meeting the needs of a bilingual
legal system.
[LINK]http://arxiv.org/abs/2501.09444v3
[DATE]2025-10-26 02:00:35+08:00
[CATEGORIES]cs.CL cs.LG
Label Smoothing Improves Gradient Ascent in LLM Unlearning
[AUTHORS]Zirui Pang, Hao Zheng, Zhijie Deng, Ling Li, Zixin Zhong, Jiaheng Wei
[ABSTRACT]LLM unlearning has emerged as a promising approach, aiming to enable models
to forget hazardous/undesired knowledge at low cost while preserving as much
model utility as possible. Among existing techniques, the most straightforward
method is performing Gradient Ascent (GA) w.r.t. the forget data, thereby
forcing the model to unlearn the forget dataset. However, GA suffers from
severe instability, as it drives updates in a divergent direction, often
resulting in drastically degraded model utility. To address this issue, we
propose Smoothed Gradient Ascent (SGA). SGA combines the forget data with
multiple constructed normal data through a tunable smoothing rate. Intuitively,
this extends GA from learning solely on the forget data to jointly learning
across both forget and normal data, enabling more stable unlearning while
better preserving model utility. Theoretically, we provide the theoretical
guidance on the selection of the optimal smoothing rate. Empirically, we
evaluate SGA on three benchmarks: TOFU, Harry Potter, and MUSE-NEWS.
Experimental results demonstrate that SGA consistently outperforms the original
Gradient Ascent (GA) method across all metrics and achieves top-2 performance
among all baseline methods on several key metrics.
[LINK]http://arxiv.org/abs/2510.22376v1
[DATE]2025-10-26 01:43:34+08:00
[CATEGORIES]cs.LG cs.CL
Reasoning Models Reason Well, Until They Don’t
[AUTHORS]Revanth Rameshkumar, Jimson Huang, Yunxin Sun, Fei Xia, Abulhair Saparov
[ABSTRACT]Large language models (LLMs) have shown significant progress in reasoning
tasks. However, recent studies show that transformers and LLMs fail
catastrophically once reasoning problems exceed modest complexity. We revisit
these findings through the lens of large reasoning models (LRMs) – LLMs
fine-tuned with incentives for step-by-step argumentation and
self-verification. LRM performance on graph and reasoning benchmarks such as
NLGraph seem extraordinary, with some even claiming they are capable of
generalized reasoning and innovation in reasoning-intensive fields such as
mathematics, physics, medicine, and law. However, by more carefully scaling the
complexity of reasoning problems, we show existing benchmarks actually have
limited complexity. We develop a new dataset, the Deep Reasoning Dataset
(DeepRD), along with a generative process for producing unlimited examples of
scalable complexity. We use this dataset to evaluate model performance on graph
connectivity and natural language proof planning. We find that the performance
of LRMs drop abruptly at sufficient complexity and do not generalize. We also
relate our LRM results to the distributions of the complexities of large,
real-world knowledge graphs, interaction graphs, and proof datasets. We find
the majority of real-world examples fall inside the LRMs’ success regime, yet
the long tails expose substantial failure potential. Our analysis highlights
the near-term utility of LRMs while underscoring the need for new methods that
generalize beyond the complexity of examples in the training distribution.
[LINK]http://arxiv.org/abs/2510.22371v1
[DATE]2025-10-26 01:28:38+08:00
[CATEGORIES]cs.CL
GigaEmbeddings: Efficient Russian Language Embedding Model
[AUTHORS]Egor Kolodin, Daria Khomich, Nikita Savushkin, Anastasia Ianina, Fyodor Minkin
[ABSTRACT]We introduce GigaEmbeddings, a novel framework for training high-performance
Russian-focused text embeddings through hierarchical instruction tuning of the
decoder-only LLM designed specifically for Russian language (GigaChat-3B). Our
three-stage pipeline, comprising large-scale contrastive pre-training in
web-scale corpora, fine-tuning with hard negatives, and multitask
generalization across retrieval, classification, and clustering tasks,
addresses key limitations of existing methods by unifying diverse objectives
and leveraging synthetic data generation. Architectural innovations include
bidirectional attention for contextual modeling, latent attention pooling for
robust sequence aggregation, and strategic pruning of 25% of transformer layers
to enhance efficiency without compromising performance. Evaluated on the ruMTEB
benchmark spanning 23 multilingual tasks, GigaEmbeddings achieves
state-of-the-art results (69.1 avg. score), outperforming strong baselines with
a larger number of parameters.
[LINK]http://arxiv.org/abs/2510.22369v1
[DATE]2025-10-26 01:26:05+08:00
[CATEGORIES]cs.CL
A Multi-Task Benchmark for Abusive Language Detection in Low-Resource Settings
[AUTHORS]Fitsum Gaim, Hoyun Song, Huije Lee, Changgeon Ko, Eui Jun Hwang, Jong C. Park
[COMMENTS]Accepted at NeurIPS 2025
[LINK]http://arxiv.org/abs/2505.12116v2
[DATE]2025-10-26 00:54:23+08:00
[CATEGORIES]cs.CL
Mapping Faithful Reasoning in Language Models
[AUTHORS]Jiazheng Li, Andreas Damianou, J Rosser, José Luis Redondo García, Konstantina Palla
[COMMENTS]9 pages, Accepted to the Mechanistic Interpretability Workshop at
NeurIPS 2025
[LINK]http://arxiv.org/abs/2510.22362v1
[DATE]2025-10-26 00:48:19+08:00
[CATEGORIES]cs.LG cs.CL
Are they lovers or friends? Evaluating LLMs’ Social Reasoning in English and Korean Dialogues
[AUTHORS]Eunsu Kim, Junyeong Park, Juhyun Oh, Kiwoong Park, Seyoung Song, A. Seza Doğruöz, Najoung Kim, Alice Oh
[ABSTRACT]As large language models (LLMs) are increasingly used in human-AI
interactions, their social reasoning capabilities in interpersonal contexts are
critical. We introduce SCRIPTS, a 1k-dialogue dataset in English and Korean,
sourced from movie scripts. The task involves evaluating models’ social
reasoning capability to infer the interpersonal relationships (e.g., friends,
sisters, lovers) between speakers in each dialogue. Each dialogue is annotated
with probabilistic relational labels (Highly Likely, Less Likely, Unlikely) by
native (or equivalent) Korean and English speakers from Korea and the U.S.
Evaluating nine models on our task, current proprietary LLMs achieve around
75-80% on the English dataset, whereas their performance on Korean drops to
58-69%. More strikingly, models select Unlikely relationships in 10-25% of
their responses. Furthermore, we find that thinking models and chain-of-thought
prompting, effective for general reasoning, provide minimal benefits for social
reasoning and occasionally amplify social biases. Our findings reveal
significant limitations in current LLMs’ social reasoning capabilities,
highlighting the need for efforts to develop socially-aware language models.
[LINK]http://arxiv.org/abs/2510.19028v2
[DATE]2025-10-26 00:46:10+08:00
[CATEGORIES]cs.CL
Irony Detection in Urdu Text: A Comparative Study Using Machine Learning Models and Large Language Models
[AUTHORS]Fiaz Ahmad, Nisar Hussain, Amna Qasim, Momina Hafeez, Muhammad Usman Grigori Sidorov, Alexander Gelbukh
[ABSTRACT]Ironic identification is a challenging task in Natural Language Processing,
particularly when dealing with languages that differ in syntax and cultural
context. In this work, we aim to detect irony in Urdu by translating an English
Ironic Corpus into the Urdu language. We evaluate ten state-of-the-art machine
learning algorithms using GloVe and Word2Vec embeddings, and compare their
performance with classical methods. Additionally, we fine-tune advanced
transformer-based models, including BERT, RoBERTa, LLaMA 2 (7B), LLaMA 3 (8B),
and Mistral, to assess the effectiveness of large-scale models in irony
detection. Among machine learning models, Gradient Boosting achieved the best
performance with an F1-score of 89.18%. Among transformer-based models, LLaMA 3
(8B) achieved the highest performance with an F1-score of 94.61%. These results
demonstrate that combining transliteration techniques with modern NLP models
enables robust irony detection in Urdu, a historically low-resource language.
[COMMENTS]5 pages, 3 figuers
[LINK]http://arxiv.org/abs/2510.22356v1
[DATE]2025-10-26 00:36:03+08:00
[CATEGORIES]cs.CL
DATE-LM: Benchmarking Data Attribution Evaluation for Large Language Models
[AUTHORS]Cathy Jiao, Yijun Pan, Emily Xiao, Daisy Sheng, Niket Jain, Hanzhang Zhao, Ishita Dasgupta, Jiaqi W. Ma, Chenyan Xiong
[COMMENTS]NeurIPS 2025 Datasets and Benchmarks Track
[LINK]http://arxiv.org/abs/2507.09424v2
[DATE]2025-10-26 00:02:01+08:00
[CATEGORIES]cs.CL
FAIR-RAG: Faithful Adaptive Iterative Refinement for Retrieval-Augmented Generation
[AUTHORS]Mohammad Aghajani Asl, Majid Asgari-Bidhendi, Behrooz Minaei-Bidgoli
[ABSTRACT]While Retrieval-Augmented Generation (RAG) mitigates hallucination and
knowledge staleness in Large Language Models (LLMs), existing frameworks often
falter on complex, multi-hop queries that require synthesizing information from
disparate sources. Current advanced RAG methods, employing iterative or
adaptive strategies, lack a robust mechanism to systematically identify and
fill evidence gaps, often propagating noise or failing to gather a
comprehensive context. We introduce FAIR-RAG, a novel agentic framework that
transforms the standard RAG pipeline into a dynamic, evidence-driven reasoning
process. At its core is an Iterative Refinement Cycle governed by a module we
term Structured Evidence Assessment (SEA). The SEA acts as an analytical gating
mechanism: it deconstructs the initial query into a checklist of required
findings and audits the aggregated evidence to identify confirmed facts and,
critically, explicit informational gaps. These gaps provide a precise signal to
an Adaptive Query Refinement agent, which generates new, targeted sub-queries
to retrieve missing information. This cycle repeats until the evidence is
verified as sufficient, ensuring a comprehensive context for a final, strictly
faithful generation. We conducted experiments on challenging multi-hop QA
benchmarks, including HotpotQA, 2WikiMultiHopQA, and MusiQue. In a unified
experimental setup, FAIR-RAG significantly outperforms strong baselines. On
HotpotQA, it achieves an F1-score of 0.453 – an absolute improvement of 8.3
points over the strongest iterative baseline – establishing a new
state-of-the-art for this class of methods on these benchmarks. Our work
demonstrates that a structured, evidence-driven refinement process with
explicit gap analysis is crucial for unlocking reliable and accurate reasoning
in advanced RAG systems for complex, knowledge-intensive tasks.
[COMMENTS]30 pages, 5 figures, 5 tables. Keywords: Retrieval-Augmented
Generation (RAG), Large Language Models (LLMs), Agentic AI, Multi-hop
Question Answering, Faithfulness
[LINK]http://arxiv.org/abs/2510.22344v1
[DATE]2025-10-25 23:59:33+08:00
[CATEGORIES]cs.CL
DynaSolidGeo: A Dynamic Benchmark for Genuine Spatial Mathematical Reasoning of VLMs in Solid Geometry
[AUTHORS]Changti Wu, Shijie Lian, Zihao Liu, Lei Zhang, Laurence Tianruo Yang, Kai Chen
[ABSTRACT]Solid geometry problem solving demands spatial mathematical reasoning that
integrates spatial intelligence and symbolic reasoning. However, most existing
multimodal mathematical reasoning benchmarks focus primarily on 2D plane
geometry, rely on static datasets prone to data contamination and memorization,
and evaluate models solely by final answers, overlooking the reasoning process.
To address these limitations, we introduce DynaSolidGeo, the first dynamic
benchmark for evaluating genuine spatial reasoning in Vision-Language Models
(VLMs). Constructed through a semi-automatic annotation pipeline, DynaSolidGeo
contains 503 expert-curated seed questions that can, in principle, dynamically
generate an unbounded number of diverse multimodal text-visual instances.
Beyond answer accuracy, we incorporate process evaluation based on
expert-annotated reasoning chains to measure logical validity and causal
coherence. Experiments across representative open-source and closed-source VLMs
reveal large performance gaps, severe degradation in dynamic settings, and poor
performance on tasks requiring high-level spatial intelligence, such as mental
rotation and visualization. The code and dataset are available at
\href{https://zgca-ai4edu.github.io/DynaSolidGeo/}{DynaSolidGeo}.
[COMMENTS]The code and dataset are available at
\href{https://zgca-ai4edu.github.io/DynaSolidGeo/}{DynaSolidGeo}
[LINK]http://arxiv.org/abs/2510.22340v1
[DATE]2025-10-25 23:49:45+08:00
[CATEGORIES]cs.CL cs.LG
BUSTED at AraGenEval Shared Task: A Comparative Study of Transformer-Based Models for Arabic AI-Generated Text Detection
[AUTHORS]Ali Zain, Sareem Farooqui, Muhammad Rafi
[ABSTRACT]This paper details our submission to the AraGenEval Shared Task on Arabic
AI-generated text detection, where our team, BUSTED, secured 5th place. We
investigated the effectiveness of three pre-trained transformer models:
AraELECTRA, CAMeLBERT, and XLM-RoBERTa. Our approach involved fine-tuning each
model on the provided dataset for a binary classification task. Our findings
revealed a surprising result: the multilingual XLM-RoBERTa model achieved the
highest performance with an F1 score of 0.7701, outperforming the specialized
Arabic models. This work underscores the complexities of AI-generated text
detection and highlights the strong generalization capabilities of multilingual
models.
[LINK]http://arxiv.org/abs/2510.20610v2
[DATE]2025-10-25 23:33:29+08:00
[CATEGORIES]cs.CL
Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation
[AUTHORS]Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, Se-Young Yun
[ABSTRACT]Scaling language models unlocks impressive capabilities, but the accompanying
computational and memory demands make both training and deployment expensive.
Existing efficiency efforts typically target either parameter sharing or
adaptive computation, leaving open the question of how to attain both
simultaneously. We introduce Mixture-of-Recursions (MoR), a unified framework
that combines the two axes of efficiency inside a single Recursive Transformer.
MoR reuses a shared stack of layers across recursion steps to achieve parameter
efficiency, while lightweight routers enable adaptive token-level thinking by
dynamically assigning different recursion depths to individual tokens. This
allows MoR to focus quadratic attention computation only among tokens still
active at a given recursion depth, further improving memory access efficiency
by selectively caching only their key-value pairs. Beyond these core
mechanisms, we also propose a KV sharing variant that reuses KV pairs from the
first recursion, specifically designed to further decrease memory footprint.
Across model scales ranging from 135M to 1.7B parameters, MoR forms a new
Pareto frontier: at equal training FLOPs and smaller model sizes, it
significantly lowers validation perplexity and improves few-shot accuracy,
while delivering higher throughput compared with vanilla and existing recursive
baselines. These gains demonstrate that MoR is an effective path towards
large-model quality without incurring large-model cost.
[COMMENTS]38 pages, 9 figures, 17 tables, codes at
https://github.com/raymin0223/mixture_of_recursions
[LINK]http://arxiv.org/abs/2507.10524v3
[DATE]2025-10-25 22:12:56+08:00
[CATEGORIES]cs.CL cs.LG
MOOSE-Chem3: Toward Experiment-Guided Hypothesis Ranking via Simulated Experimental Feedback
[AUTHORS]Wanhao Liu, Zonglin Yang, Jue Wang, Lidong Bing, Di Zhang, Dongzhan Zhou, Yuqiang Li, Houqiang Li, Erik Cambria, Wanli Ouyang
[ABSTRACT]Hypothesis ranking is vital for automated scientific discovery, especially in
cost-intensive, throughput-limited natural science domains. Current methods
focus on pre-experiment ranking, relying solely on language model reasoning
without empirical feedback. We introduce experiment-guided ranking, which
prioritizes hypotheses based on feedback from prior tests. Due to the
impracticality of real experiments, we propose a simulator grounded in
domain-specific concepts that models hypothesis performance as a function of
similarity to a hidden ground truth, perturbed by noise. Validated against 124
hypotheses with experimentally reported outcomes, the simulator approximates
real results with consistent trend alignment. Although deviations exist, they
mimic wet-lab noise, promoting more robust ranking strategies. We frame
experiment-guided ranking as a sequential decision-making problem and propose
an in-context reinforcement learning (ICRL) framework. Our LLM-based policy
decomposes hypotheses into functional elements, clusters them by mechanistic
roles, and prioritizes recombinations based on feedback. Experiments show our
approach significantly outperforms pre-experiment baselines and strong
ablations. Our toolkit, comprising the simulator and ICRL framework, enables
systematic research on experiment-guided ranking, with the policy serving as a
strong proof of concept.
[LINK]http://arxiv.org/abs/2505.17873v3
[DATE]2025-10-25 22:00:54+08:00
[CATEGORIES]cs.CL
CityRiSE: Reasoning Urban Socio-Economic Status in Vision-Language Models via Reinforcement Learning
[AUTHORS]Tianhui Liu, Hetian Pang, Xin Zhang, Jie Feng, Yong Li, Pan Hui
[ABSTRACT]Harnessing publicly available, large-scale web data, such as street view and
satellite imagery, urban socio-economic sensing is of paramount importance for
achieving global sustainable development goals. With the emergence of Large
Vision-Language Models (LVLMs), new opportunities have arisen to solve this
task by treating it as a multi-modal perception and understanding problem.
However, recent studies reveal that LVLMs still struggle with accurate and
interpretable socio-economic predictions from visual data. To address these
limitations and maximize the potential of LVLMs, we introduce
\textbf{CityRiSE}, a novel framework for \textbf{R}eason\textbf{i}ng urban
\textbf{S}ocio-\textbf{E}conomic status in LVLMs through pure reinforcement
learning (RL). With carefully curated multi-modal data and verifiable reward
design, our approach guides the LVLM to focus on semantically meaningful visual
cues, enabling structured and goal-oriented reasoning for generalist
socio-economic status prediction. Experiments demonstrate that CityRiSE with
emergent reasoning process significantly outperforms existing baselines,
improving both prediction accuracy and generalization across diverse urban
contexts, particularly for prediction on unseen cities and unseen indicators.
This work highlights the promise of combining RL and LVLMs for interpretable
and generalist urban socio-economic sensing.
[LINK]http://arxiv.org/abs/2510.22282v1
[DATE]2025-10-25 20:56:46+08:00
[CATEGORIES]cs.CL
From Slides to Chatbots: Enhancing Large Language Models with University Course Materials
[AUTHORS]Tu Anh Dinh, Philipp Nicolas Schumacher, Jan Niehues
[ABSTRACT]Large Language Models (LLMs) have advanced rapidly in recent years. One
application of LLMs is to support student learning in educational settings.
However, prior work has shown that LLMs still struggle to answer questions
accurately within university-level computer science courses. In this work, we
investigate how incorporating university course materials can enhance LLM
performance in this setting. A key challenge lies in leveraging diverse course
materials such as lecture slides and transcripts, which differ substantially
from typical textual corpora: slides also contain visual elements like images
and formulas, while transcripts contain spoken, less structured language. We
compare two strategies, Retrieval-Augmented Generation (RAG) and Continual
Pre-Training (CPT), to extend LLMs with course-specific knowledge. For lecture
slides, we further explore a multi-modal RAG approach, where we present the
retrieved content to the generator in image form. Our experiments reveal that,
given the relatively small size of university course materials, RAG is more
effective and efficient than CPT. Moreover, incorporating slides as images in
the multi-modal setting significantly improves performance over text-only
retrieval. These findings highlight practical strategies for developing AI
assistants that better support learning and teaching, and we hope they inspire
similar efforts in other educational contexts.
[LINK]http://arxiv.org/abs/2510.22272v1
[DATE]2025-10-25 20:31:26+08:00
[CATEGORIES]cs.CL
MathOPEval: A Fine-grained Evaluation Benchmark for Visual Operations of MLLMs in Mathematical Reasoning
[AUTHORS]Xiaoyuan Li, Moxin Li, Wenjie Wang, Rui Men, Yichang Zhang, Fuli Feng, Dayiheng Liu
[ABSTRACT]Recent progress in Multi-modal Large Language Models (MLLMs) has enabled
step-by-step multi-modal mathematical reasoning by performing visual operations
based on the textual instructions. A promising approach uses code as an
intermediate representation to precisely express and manipulate the images in
the reasoning steps. However, existing evaluations focus mainly on text-only
reasoning outputs, leaving the MLLM’s ability to perform accurate visual
operations via code largely unexplored. This work takes a first step toward
addressing that gap by evaluating MLLM’s code-based capabilities in multi-modal
mathematical reasoning.Specifically, our framework focuses on two key
evaluation aspects: (1) Multi-modal Code Generation (MCG) evaluates the model’s
ability to accurately understand and construct visualizations from scratch. (2)
Multi-modal Code Editing (MCE) assesses the model’s capacity for fine-grained
operations, which include three types: Deletion, Modification and Annotation.
To evaluate the above tasks, we incorporate a dataset that covers the five most
popular types of mathematical figures, including geometric diagrams, function
plots, and three types of statistical charts, to provide a comprehensive and
effective measurement of existing MLLMs. Our experimental evaluation involves
nine mainstream MLLMs, and the results reveal that existing models still lag
significantly behind human performance in performing fine-grained visual
operations.
[COMMENTS]Under Review
[LINK]http://arxiv.org/abs/2507.18140v2
[DATE]2025-10-25 20:17:56+08:00
[CATEGORIES]cs.CL
SteerX: Disentangled Steering for LLM Personalization
[AUTHORS]Xiaoyan Zhao, Ming Yan, Yilun Qiu, Haoting Ni, Yang Zhang, Fuli Feng, Hong Cheng, Tat-Seng Chua
[ABSTRACT]Large language models (LLMs) have shown remarkable success in recent years,
enabling a wide range of applications, including intelligent assistants that
support users’ daily life and work. A critical factor in building such
assistants is personalizing LLMs, as user preferences and needs vary widely.
Activation steering, which directly leverages directions representing user
preference in the LLM activation space to adjust its behavior, offers a
cost-effective way to align the model’s outputs with individual users. However,
existing methods rely on all historical data to compute the steering vector,
ignoring that not all content reflects true user preferences, which undermines
the personalization signal. To address this, we propose SteerX, a disentangled
steering method that isolates preference-driven components from
preference-agnostic components. Grounded in causal inference theory, SteerX
estimates token-level causal effects to identify preference-driven tokens,
transforms these discrete signals into a coherent description, and then
leverages them to steer personalized LLM generation. By focusing on the truly
preference-driven information, SteerX produces more accurate activation
steering vectors and enhances personalization. Experiments on two
representative steering backbone methods across real-world datasets demonstrate
that SteerX consistently enhances steering vector quality, offering a practical
solution for more effective LLM personalization.
[LINK]http://arxiv.org/abs/2510.22256v1
[DATE]2025-10-25 19:26:20+08:00
[CATEGORIES]cs.CL
PACR: Progressively Ascending Confidence Reward for LLM Reasoning
[AUTHORS]Eunseop Yoon, Hee Suk Yoon, Jaehyun Jang, SooHwan Eom, Qi Dai, Chong Luo, Mark A. Hasegawa-Johnson, Chang D. Yoo
[ABSTRACT]Reinforcement Learning with Verifiable Rewards (RLVR) has significantly
improved LLM reasoning, but its sparse, outcome-based reward provides no
guidance for intermediate steps, slowing exploration. We propose Progressively
Ascending Confidence Reward (PACR), a dense, model-intrinsic reward computed
directly from the model’s evolving belief in the correct answer. PACR encodes
the inductive bias that, along a well-formed reasoning trajectory, the
probability of the ground-truth answer should have a generally ascending trend.
We provide empirical and theoretical analysis validating that such an inductive
bias constrains the exploration search space to regions richer in logically
sound reasoning. We demonstrate that PACR accelerates exploration, reaches
reward saturation with fewer trajectories, and yields improvements on multiple
benchmarks. Our results suggest that dense, model-intrinsic shaping signals can
make RLVR training more effective and reliable.
[COMMENTS]16 pages, 14 figures
[LINK]http://arxiv.org/abs/2510.22255v1
[DATE]2025-10-25 19:25:35+08:00
[CATEGORIES]cs.CL
You Don’t Need Prompt Engineering Anymore: The Prompting Inversion
[AUTHORS]Imran Khan
[ABSTRACT]Prompt engineering, particularly Chain-of-Thought (CoT) prompting,
significantly enhances LLM reasoning capabilities. We introduce “Sculpting,” a
constrained, rule-based prompting method designed to improve upon standard CoT
by reducing errors from semantic ambiguity and flawed common sense.
We evaluate three prompting strategies (Zero Shot, standard CoT, and
Sculpting) across three OpenAI model generations (gpt-4o-mini, gpt-4o, gpt-5)
using the GSM8K mathematical reasoning benchmark (1,317 problems).
Our findings reveal a “Prompting Inversion”: Sculpting provides advantages on
gpt-4o (97% vs. 93% for standard CoT), but becomes detrimental on gpt-5 (94.00%
vs. 96.36% for CoT on full benchmark). We trace this to a
“Guardrail-to-Handcuff” transition where constraints preventing common-sense
errors in mid-tier models induce hyper-literalism in advanced models. Our
detailed error analysis demonstrates that optimal prompting strategies must
co-evolve with model capabilities, suggesting simpler prompts for more capable
models.
[COMMENTS]17 pages, 1 figure, 6 tables. Code and experimental data available at
https://github.com/strongSoda/prompt-sculpting
[LINK]http://arxiv.org/abs/2510.22251v1
[DATE]2025-10-25 19:04:01+08:00
[CATEGORIES]cs.CL cs.LG
Check Yourself Before You Wreck Yourself: Selectively Quitting Improves LLM Agent Safety
[AUTHORS]Vamshi Krishna Bonagiri, Ponnurangam Kumaragurum, Khanh Nguyen, Benjamin Plaut
[ABSTRACT]As Large Language Model (LLM) agents increasingly operate in complex
environments with real-world consequences, their safety becomes critical. While
uncertainty quantification is well-studied for single-turn tasks, multi-turn
agentic scenarios with real-world tool access present unique challenges where
uncertainties and ambiguities compound, leading to severe or catastrophic risks
beyond traditional text generation failures. We propose using “quitting” as a
simple yet effective behavioral mechanism for LLM agents to recognize and
withdraw from situations where they lack confidence. Leveraging the ToolEmu
framework, we conduct a systematic evaluation of quitting behavior across 12
state-of-the-art LLMs. Our results demonstrate a highly favorable
safety-helpfulness trade-off: agents prompted to quit with explicit
instructions improve safety by an average of +0.39 on a 0-3 scale across all
models (+0.64 for proprietary models), while maintaining a negligible average
decrease of -0.03 in helpfulness. Our analysis demonstrates that simply adding
explicit quit instructions proves to be a highly effective safety mechanism
that can immediately be deployed in existing agent systems, and establishes
quitting as an effective first-line defense mechanism for autonomous agents in
high-stakes applications.
[COMMENTS]Reliable ML and Regulatable ML workshops, Neurips 2025
[LINK]http://arxiv.org/abs/2510.16492v2
[DATE]2025-10-25 18:26:51+08:00
[CATEGORIES]cs.CL
Modeling Bottom-up Information Quality during Language Processing
[AUTHORS]Cui Ding, Yanning Yin, Lena A. Jäger, Ethan Gotlieb Wilcox
[ABSTRACT]Contemporary theories model language processing as integrating both top-down
expectations and bottom-up inputs. One major prediction of such models is that
the quality of the bottom-up inputs modulates ease of processing – noisy
inputs should lead to difficult and effortful comprehension. We test this
prediction in the domain of reading. First, we propose an information-theoretic
operationalization for the “quality” of bottom-up information as the mutual
information (MI) between visual information and word identity. We formalize
this prediction in a mathematical model of reading as a Bayesian update.
Second, we test our operationalization by comparing participants’ reading times
in conditions where words’ information quality has been reduced, either by
occluding their top or bottom half, with full words. We collect data in English
and Chinese. We then use multimodal language models to estimate the mutual
information between visual inputs and words. We use these data to estimate the
specific effect of reduced information quality on reading times. Finally, we
compare how information is distributed across visual forms. In English and
Chinese, the upper half contains more information about word identity than the
lower half. However, the asymmetry is more pronounced in English, a pattern
which is reflected in the reading times.
[LINK]http://arxiv.org/abs/2509.17047v2
[DATE]2025-10-25 18:24:58+08:00
[CATEGORIES]cs.CL
Fine-tuning Large Language Models with Limited Data: A Survey and Practical Guide
[AUTHORS]Marton Szep, Daniel Rueckert, Rüdiger von Eisenhart-Rothe, Florian Hinterwimmer
[ABSTRACT]Fine-tuning large language models (LLMs) with limited data poses a practical
challenge in low-resource languages, specialized domains, and constrained
deployment settings. While pre-trained LLMs provide strong foundations,
effective adaptation under data scarcity requires focused and efficient
fine-tuning techniques. This paper presents a structured and practical survey
of recent methods for fine-tuning LLMs in data-scarce scenarios. We
systematically review parameter-efficient fine-tuning techniques that lower
training and deployment costs, domain and cross-lingual adaptation methods for
both encoder and decoder models, and model specialization strategies. We
further examine preference alignment approaches that guide model behavior using
limited human or synthetic feedback, emphasizing sample and compute efficiency.
Throughout, we highlight empirical trade-offs, selection criteria, and best
practices for choosing suitable techniques based on task constraints, including
model scaling, data scaling, and the mitigation of catastrophic forgetting. The
aim is to equip researchers and practitioners with actionable insights for
effectively fine-tuning LLMs when data and resources are limited.
[COMMENTS]Accepted to TACL. Pre-MIT Press version. Major restructuring; added
preference alignment section and additional tables. 36 pages
[LINK]http://arxiv.org/abs/2411.09539v2
[DATE]2025-10-25 18:17:48+08:00
[CATEGORIES]cs.CL cs.LG
Assessing the Potential of Generative Agents in Crowdsourced Fact-Checking
[AUTHORS]Luigia Costabile, Gian Marco Orlando, Valerio La Gatta, Vincenzo Moscato
[ABSTRACT]The growing spread of online misinformation has created an urgent need for
scalable, reliable fact-checking solutions. Crowdsourced fact-checking - where
non-experts evaluate claim veracity - offers a cost-effective alternative to
expert verification, despite concerns about variability in quality and bias.
Encouraged by promising results in certain contexts, major platforms such as X
(formerly Twitter), Facebook, and Instagram have begun shifting from
centralized moderation to decentralized, crowd-based approaches.
In parallel, advances in Large Language Models (LLMs) have shown strong
performance across core fact-checking tasks, including claim detection and
evidence evaluation. However, their potential role in crowdsourced workflows
remains unexplored. This paper investigates whether LLM-powered generative
agents - autonomous entities that emulate human behavior and decision-making -
can meaningfully contribute to fact-checking tasks traditionally reserved for
human crowds.
Using the protocol of La Barbera et al. (2024), we simulate crowds of
generative agents with diverse demographic and ideological profiles. Agents
retrieve evidence, assess claims along multiple quality dimensions, and issue
final veracity judgments. Our results show that agent crowds outperform human
crowds in truthfulness classification, exhibit higher internal consistency, and
show reduced susceptibility to social and cognitive biases. Compared to humans,
agents rely more systematically on informative criteria such as Accuracy,
Precision, and Informativeness, suggesting a more structured decision-making
process. Overall, our findings highlight the potential of generative agents as
scalable, consistent, and less biased contributors to crowd-based fact-checking
systems.
[COMMENTS]This paper has been published in Online Social Networks and Media
(https://doi.org/10.1016/j.osnem.2025.100326). Please cite the published
version accordingly
[LINK]http://arxiv.org/abs/2504.19940v2
[DATE]2025-10-25 18:05:59+08:00
[CATEGORIES]cs.CL
Probabilistic adaptation of language comprehension for individual speakers: evidence from neural oscillations
[AUTHORS]Hanlin Wu, Xiaohui Rao, Zhenguang G Cai
[ABSTRACT]Listeners adapt language comprehension based on their mental representations
of speakers, but how these representations are updated remains unclear. We
investigated whether listeners probabilistically adapt comprehension based on
the frequency of speakers making stereotype-incongruent statements. In two EEG
experiments, participants heard speakers make stereotype-congruent or
incongruent statements, with incongruency base rate manipulated. In Experiment
1, stereotype-incongruent statements decreased high-beta (21-30 Hz) and theta
(4-6 Hz) oscillatory power in the low base rate condition but increased it in
the high base rate condition. The theta effect varied with listeners’ openness
trait: less open-minded participants tended to show theta increases to
stereotype incongruencies, while more open-minded participants tended to show
theta decreases. In Experiment 2, we dissociated incongruency base rate from
the target speaker by manipulating it using a non-target speaker and found that
only the high-beta effect persisted. Our findings reveal two potential
mechanisms: a speaker-general mechanism (indicated by high-beta oscillations)
that adjusts overall expectations about hearing statements that violate social
stereotypes, and a speaker-specific mechanism (indicated by theta oscillations)
that updates a more detailed mental model specifically about an individual
speaker. These findings provide evidence for how language processing interacts
with social cognition.
[LINK]http://arxiv.org/abs/2502.01299v2
[DATE]2025-10-25 17:56:46+08:00
[CATEGORIES]cs.CL
Better Estimation of the Kullback–Leibler Divergence Between Language Models
[AUTHORS]Afra Amini, Tim Vieira, Ryan Cotterell
[ABSTRACT]Estimating the Kullback–Leibler (KL) divergence between language models has
many applications, e.g., reinforcement learning from human feedback (RLHF),
interpretability, and knowledge distillation. However, computing the exact KL
divergence between two arbitrary language models is intractable. Thus,
practitioners often resort to sampling-based estimators. While it is easy to
fashion a simple Monte Carlo (MC) estimator that provides an unbiased estimate
of the KL divergence between language models, this estimator notoriously
suffers from high variance and can even result in a negative estimate of the KL
divergence, a non-negative quantity. In this paper, we introduce a
Rao–Blackwellized estimator that is unbiased and provably has variance less
than or equal to that of the standard Monte Carlo estimator. In an empirical
study on sentiment-controlled fine-tuning, we show that our estimator provides
more stable KL estimates and reduces variance substantially. Additionally, we
derive an analogous Rao–Blackwellized estimator of the gradient of the KL
divergence, which leads to more stable training and produces models that more
frequently appear on the Pareto frontier of reward vs. KL compared to the ones
trained with the MC estimator of the gradient.
[COMMENTS]NeurIPS 2025
[LINK]http://arxiv.org/abs/2504.10637v3
[DATE]2025-10-25 17:49:22+08:00
[CATEGORIES]cs.CL cs.LG
Enhancing Naturalness in LLM-Generated Utterances through Disfluency Insertion
[AUTHORS]Syed Zohaib Hassan, Pierre Lison, Pål Halvorsen
[ABSTRACT]Disfluencies are a natural feature of spontaneous human speech but are
typically absent from the outputs of Large Language Models (LLMs). This absence
can diminish the perceived naturalness of synthesized speech, which is an
important criteria when building conversational agents that aim to mimick human
behaviours. We show how the insertion of disfluencies can alleviate this
shortcoming. The proposed approach involves (1) fine-tuning an LLM with
Low-Rank Adaptation (LoRA) to incorporate various types of disfluencies into
LLM-generated utterances and (2) synthesizing those utterances using a
text-to-speech model that supports the generation of speech phenomena such as
disfluencies. We evaluated the quality of the generated speech across two
metrics: intelligibility and perceived spontaneity. We demonstrate through a
user study that the insertion of disfluencies significantly increase the
perceived spontaneity of the generated speech. This increase came, however,
along with a slight reduction in intelligibility.
[COMMENTS]8 pages. Limitations, ethical considerations, and references are
additional
[LINK]http://arxiv.org/abs/2412.12710v2
[DATE]2025-10-25 17:33:41+08:00
[CATEGORIES]cs.CL
VEGGIE: Instructional Editing and Reasoning Video Concepts with Grounded Generation
[AUTHORS]Shoubin Yu, Difan Liu, Ziqiao Ma, Yicong Hong, Yang Zhou, Hao Tan, Joyce Chai, Mohit Bansal
[ABSTRACT]Recent video diffusion models have enhanced video editing, but it remains
challenging to handle instructional editing and diverse tasks (e.g., adding,
removing, changing) within a unified framework. In this paper, we introduce
VEGGIE, a Video Editor with Grounded Generation from Instructions, a simple
end-to-end framework that unifies video concept editing, grounding, and
reasoning based on diverse user instructions. Specifically, given a video and
text query, VEGGIE first utilizes an MLLM to interpret user intentions in
instructions and ground them to the video contexts, generating frame-specific
grounded task queries for pixel-space responses. A diffusion model then renders
these plans and generates edited videos that align with user intent. To support
diverse tasks and complex instructions, we employ a curriculum learning
strategy: first aligning the MLLM and video diffusion model with large-scale
instructional image editing data, followed by end-to-end fine-tuning on
high-quality multitask video data. Additionally, we introduce a novel data
synthesis pipeline to generate paired instructional video editing data for
model training. It transforms static image data into diverse, high-quality
video editing samples by leveraging Image-to-Video models to inject dynamics.
VEGGIE shows strong performance in instructional video editing with different
editing skills, outperforming the best instructional baseline as a versatile
model, while other models struggle with multi-tasking. VEGGIE also excels in
video object grounding and reasoning segmentation, where other baselines fail.
We further reveal how the multiple tasks help each other and highlight
promising applications like zero-shot multimodal instructional and in-context
video editing.
[COMMENTS]ICCV 2025; First three authors contributed equally. Project page:
https://veggie-gen.github.io/
[LINK]http://arxiv.org/abs/2503.14350v3
[DATE]2025-10-25 17:08:18+08:00
[CATEGORIES]cs.CL
MEXA: Towards General Multimodal Reasoning with Dynamic Multi-Expert Aggregation
[AUTHORS]Shoubin Yu, Yue Zhang, Ziyang Wang, Jaehong Yoon, Mohit Bansal
[ABSTRACT]Combining pre-trained expert models offers substantial potential for scalable
multimodal reasoning, but building a unified framework remains challenging due
to the increasing diversity of input modalities and task complexity. For
instance, medical diagnosis requires precise reasoning over structured clinical
tables, while financial forecasting depends on interpreting plot-based data to
make informed predictions. To tackle this challenge, we introduce MEXA, a
training-free framework that performs modality- and task-aware aggregation of
multiple expert models to enable effective multimodal reasoning across diverse
and distinct domains. MEXA dynamically selects expert models based on the input
modality and the task-specific reasoning demands (i.e., skills). Each expert
model, specialized in a modality task pair, generates interpretable textual
reasoning outputs. MEXA then aggregates and reasons over these outputs using a
Large Reasoning Model (LRM) to produce the final answer. This modular design
allows flexible and transparent multimodal reasoning across diverse domains
without additional training overhead. We extensively evaluate our approach on
diverse multimodal benchmarks, including Video Reasoning, Audio Reasoning, 3D
Understanding, and Medical QA. MEXA consistently delivers performance
improvements over strong multimodal baselines, highlighting the effectiveness
and broad applicability of our expert-driven selection and aggregation in
diverse multimodal reasoning tasks.
[COMMENTS]EMNLP 2025 Findings; The first two authors contributed equally;
Github link: https://github.com/Yui010206/MEXA
[LINK]http://arxiv.org/abs/2506.17113v2
[DATE]2025-10-25 16:57:40+08:00
[CATEGORIES]cs.CL
Evolution of the lexicon: a probabilistic point of view
[AUTHORS]Maurizio Serva
[ABSTRACT]The Swadesh approach for determining the temporal separation between two
languages relies on the stochastic process of words replacement (when a
complete new word emerges to represent a given concept). It is well known that
the basic assumptions of the Swadesh approach are often unrealistic due to
various contamination phenomena and misjudgments (horizontal transfers,
variations over time and space of the replacement rate, incorrect assessments
of cognacy relationships, presence of synonyms, and so on). All of this means
that the results cannot be completely correct.
More importantly, even in the unrealistic case that all basic assumptions are
satisfied, simple mathematics places limits on the accuracy of estimating the
temporal separation between two languages. These limits, which are purely
probabilistic in nature and which are often neglected in lexicostatistical
studies, are analyzed in detail in this article.
Furthermore, in this work we highlight that the evolution of a language’s
lexicon is also driven by another stochastic process: gradual lexical
modification of words. We show that this process equally also represents a
major contribution to the reshaping of the vocabulary of languages over the
centuries and we also show, from a purely probabilistic perspective, that
taking into account this second random process significantly increases the
precision in determining the temporal separation between two languages.
[LINK]http://arxiv.org/abs/2510.22220v1
[DATE]2025-10-25 16:48:15+08:00
[CATEGORIES]cs.CL
Estimating the Error of Large Language Models at Pairwise Text Comparison
[AUTHORS]Tianyi Li
[ABSTRACT]We measure LLMs’ output error at pairwise text comparison, noting the
probability of error in their preferences. Our method does not rely on the
ground truth and supports two scenarios: (i) uniform error rate regardless of
the order of comparison, estimated with two comparisons for each text pair with
either text placed first; (ii) binary positional bias assuming distinct error
rates for the two orders of comparison, estimated with repeated comparisons
between the texts. The Copeland counting constructs a ranking over the compared
texts from pairwise preferences; the ranking reveals the poor scalability of
LLM-based pairwise comparison and helps yield the estimates for LLMs’ error
rates. We apply the method to six LLMs (ChatGPT, Claude, DeepSeek, Gemini,
Grok, Qwen) with five types of text input and obtain consistent estimates of
LLMs’ error. In general, the measured two positional bias terms are similar,
close to the uniform error. Considering both the error rates and the robustness
to the variation of prompts, Claude obtained the most desirable performance in
this experiment. Our model outperforms the biased Bradley-Terry model and the
commutativity score in indicating LLMs’ error at this task.
[COMMENTS]14 pages, 6 figures
[LINK]http://arxiv.org/abs/2510.22219v1
[DATE]2025-10-25 16:39:52+08:00
[CATEGORIES]cs.CL
Bhav-Net: Knowledge Transfer for Cross-Lingual Antonym vs Synonym Distinction via Dual-Space Graph Transformers
[AUTHORS]Samyak S. Sanghvi
[ABSTRACT]Antonym vs synonym distinction across multiple languages presents unique
computational challenges due to the paradoxical nature of antonymous
relationships words that share semantic domains while expressing opposite
meanings. This work introduces Bhav-Net, a novel dual-space architecture that
enables effective knowledge transfer from complex multilingual models to
simpler, language-specific architectures while maintaining robust cross-lingual
antonym–synonym distinction capabilities. Our approach combines
language-specific BERT encoders with graph transformer networks, creating
distinct semantic projections where synonymous pairs cluster in one space while
antonymous pairs exhibit high similarity in a complementary space. Through
comprehensive evaluation across eight languages (English, German, French,
Spanish, Italian, Portuguese, Dutch, and Russian), we demonstrate that semantic
relationship modeling transfers effectively across languages. The dual-encoder
design achieves competitive performance against state-of-the-art baselines
while providing interpretable semantic representations and effective
cross-lingual generalization.
[COMMENTS]Found some issues and need to correct them
[LINK]http://arxiv.org/abs/2508.15792v3
[DATE]2025-10-25 16:38:47+08:00
[CATEGORIES]cs.CL
Preference Optimization by Estimating the Ratio of the Data Distribution
[AUTHORS]Yeongmin Kim, Heesun Bae, Byeonghu Na, Il-Chul Moon
[ABSTRACT]Direct preference optimization (DPO) is widely used as a simple and stable
method for aligning large language models (LLMs) with human preferences. This
paper investigates a generalized DPO loss that enables a policy model to match
the target policy from a likelihood ratio estimation perspective. The ratio of
the target policy provides a unique identification of the policy distribution
without relying on reward models or partition functions. This allows the
generalized loss to retain both simplicity and theoretical guarantees, which
prior work such as $f$-PO fails to achieve simultaneously. We propose Bregman
preference optimization (BPO), a generalized framework for ratio matching that
provides a family of objective functions achieving target policy optimality.
BPO subsumes DPO as a special case and offers tractable forms for all
instances, allowing implementation with a few lines of code. We further develop
scaled Basu’s power divergence (SBA), a gradient scaling method that can be
used for BPO instances. The BPO framework complements other DPO variants and is
applicable to target policies defined by these variants. In experiments, unlike
other probabilistic loss extensions such as $f$-DPO or $f$-PO, which exhibit a
trade-off between generation fidelity and diversity, instances of BPO improve
both win rate and entropy compared with DPO. When applied to
Llama-3-8B-Instruct, BPO achieves state-of-the-art performance among Llama-3-8B
backbones, with a 55.9\% length-controlled win rate on AlpacaEval2. Project
page: https://github.com/aailab-kaist/BPO.
[COMMENTS]NeurIPS 2025
[LINK]http://arxiv.org/abs/2505.19601v2
[DATE]2025-10-25 16:32:17+08:00
[CATEGORIES]cs.LG cs.CL
Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model
[AUTHORS]Ling Team, Anqi Shen, Baihui Li, Bin Hu, Bin Jing, Cai Chen, Chao Huang, Chao Zhang, Chaokun Yang, Cheng Lin, Chengyao Wen, Congqi Li, Deng Zhao, Dingbo Yuan, Donghai You, Fagui Mao, Fanzhuang Meng, Feng Xu, Guojie Li, Guowei Wang, Hao Dai, Haonan Zheng, Hong Liu, Jia Guo, Jiaming Liu, Jian Liu, Jianhao Fu, Jiannan Shi, Jianwen Wang, Jianxin Lai, Jin Yang, Jun Mei, Jun Zhou, Junbo Zhao, Junping Zhao, Kuan Xu, Le Su, Lei Chen, Li Tang, Liang Jiang, Liangcheng Fu, Lianhao Xu, Linfeng Shi, Lisha Liao, Longfei Zheng, Meng Li, Mingchun Chen, Qi Zuo, Qiang Cheng, Qianggang Cao, Qitao Shi, Quanrui Guo, Senlin Zhu, Shaofei Wang, Shaomian Zheng, Shuaicheng Li, Shuwei Gu, Siba Chen, Tao Wu, Tao Zhang, Tianyu Zhang, Tianyu Zhou, Tiwei Bie, Tongkai Yang, Wang Hong, Wang Ren, Weihua Chen, Wenbo Yu, Wengang Zheng, Xiangchun Wang, Xiaodong Yan, Xiaopei Wan, Xin Zhao, Xinyu Kong, Xinyu Tang, Xudong Han, Xudong Wang, Xuemin Yang, Xueyu Hu, Yalin Zhang, Yan Sun, Yicheng Shan, Yilong Wang, Yingying Xu, Yongkang Liu, Yongzhen Guo, Yuanyuan Wang, Yuchen Yan, Yuefan Wang, Yuhong Guo, Zehuan Li, Zhankai Xu, Zhe Li, Zhenduo Zhang, Zhengke Gui, Zhenxuan Pan, Zhenyu Huang, Zhenzhong Lan, Zhiqiang Ding, Zhiqiang Zhang, Zhixun Li, Zhizhen Liu, Zihao Wang, Zujie Wen
[ABSTRACT]We present Ring-1T, the first open-source, state-of-the-art thinking model
with a trillion-scale parameter. It features 1 trillion total parameters and
activates approximately 50 billion per token. Training such models at a
trillion-parameter scale introduces unprecedented challenges, including
train-inference misalignment, inefficiencies in rollout processing, and
bottlenecks in the RL system. To address these, we pioneer three interconnected
innovations: (1) IcePop stabilizes RL training via token-level discrepancy
masking and clipping, resolving instability from training-inference mismatches;
(2) C3PO++ improves resource utilization for long rollouts under a token budget
by dynamically partitioning them, thereby obtaining high time efficiency; and
(3) ASystem, a high-performance RL framework designed to overcome the systemic
bottlenecks that impede trillion-parameter model training. Ring-1T delivers
breakthrough results across critical benchmarks: 93.4 on AIME-2025, 86.72 on
HMMT-2025, 2088 on CodeForces, and 55.94 on ARC-AGI-1. Notably, it attains a
silver medal-level result on the IMO-2025, underscoring its exceptional
reasoning capabilities. By releasing the complete 1T parameter MoE model to the
community, we provide the research community with direct access to cutting-edge
reasoning capabilities. This contribution marks a significant milestone in
democratizing large-scale reasoning intelligence and establishes a new baseline
for open-source model performance.
[COMMENTS]Technical Report
[LINK]http://arxiv.org/abs/2510.18855v2
[DATE]2025-10-25 16:21:06+08:00
[CATEGORIES]cs.CL
TrendFact: A Benchmark for Explainable Hotspot Perception in Fact-Checking with Natural Language Explanation
[AUTHORS]Xiaocheng Zhang, Xi Wang, Yifei Lu, Jianing Wang, Zhuangzhuang Ye, Mengjiao Bao, Peng Yan, Xiaohong Su
[ABSTRACT]Fact-checking benchmarks provide standardized testing criteria for automated
fact-checking systems, driving technological advancement. With the surge of
misinformation on social media and the emergence of various fact-checking
methods, public concern about the transparency of automated systems and the
accuracy of fact-checking for high infulence events has grown. However,
existing benchmarks fail to meet these urgent needs and are predominantly
English-centric, hindering the progress of comprehensive fact-checking. To
address these issues, we introduce TrendFact, the first benchmark capable of
evaluating hotspot perception ability (HPA) and all fact-checking tasks.
TrendFact consists of 7,643 curated samples sourced from trending platforms and
professional fact-checking datasets, as well as an evidence library containing
366,634 entries with publication dates. Additionally, to complement existing
benchmarks in evaluating system explanation consistency and HPA, we propose two
new metrics: ECS and HCPI. Experimental results show that current fact-checking
systems face significant limitations when evaluated on TrendFact, which
facilitates the development of more robust fact-checking methods. Furthermore,
to enhance the capabilities of existing advanced fact-checking systems, the
reasoning large language models (RLMs), we propose FactISR, a reasoning
framework that integrates dynamic evidence augmentation with influence
score-based iterative self-reflection. FactISR effectively improves RLM’s
performance, offering new insights into explainable and complex fact-checking.
[LINK]http://arxiv.org/abs/2410.15135v4
[DATE]2025-10-25 16:19:39+08:00
[CATEGORIES]cs.CL
The Lossy Horizon: Error-Bounded Predictive Coding for Lossy Text Compression (Episode I)
[AUTHORS]Nnamdi Aghanya, Jun Li, Kewei Wang
[ABSTRACT]Large Language Models (LLMs) can achieve near-optimal lossless compression by
acting as powerful probability models. We investigate their use in the lossy
domain, where reconstruction fidelity is traded for higher compression ratios.
This paper introduces Error-Bounded Predictive Coding (EPC), a lossy text codec
that leverages a Masked Language Model (MLM) as a decompressor. Instead of
storing a subset of original tokens, EPC allows the model to predict masked
content and stores minimal, rank-based corrections only when the model’s top
prediction is incorrect. This creates a residual channel that offers continuous
rate-distortion control. We compare EPC to a simpler Predictive Masking (PM)
baseline and a transform-based Vector Quantisation with a Residual Patch
(VQ+RE) approach. Through an evaluation that includes precise bit accounting
and rate-distortion analysis, we demonstrate that EPC consistently dominates
PM, offering superior fidelity at a significantly lower bit rate by more
efficiently utilising the model’s intrinsic knowledge.
[COMMENTS]12 pages, 7 figures
[LINK]http://arxiv.org/abs/2510.22207v1
[DATE]2025-10-25 16:18:31+08:00
[CATEGORIES]cs.LG cs.CL
A Simple Linear Patch Revives Layer-Pruned Large Language Models
[AUTHORS]Xinrui Chen, Haoli Bai, Tao Yuan, Ruikang Liu, Kang Zhao, Xianzhi Yu, Lu Hou, Tian Guan, Yonghong He, Chun Yuan
[ABSTRACT]Layer pruning has emerged as a widely used technique for compressing large
language models (LLMs). However, existing layer pruning approaches often incur
substantial performance degradation. We identify the majority of this
degradation to a single yet previously overlooked issue: \textit{the mismatch
of activation magnitudes at the pruning interface}. The pre-interface
activations exhibit significantly different scales from the post-interface
ones, causing the distributional shift as it propagates through the remaining
layers. To address this issue, we introduce \textsc{LinearPatch}, a lightweight
and plug-and-play technique that fuses two operations into one matrix multiply
at the pruning interface: (i) a Hadamard transformation that suppresses massive
outliers at particular tokens and (ii) a channel-wise scaling that aligns
activation statistics. On LLaMA-3-8B, \textsc{LinearPatch} preserves up to
\textbf{94.15\%} of the original model’s performance when pruning 5 out of 32
layers, outperforming the previous state of the art by \textbf{4\%}. The patch
can be further refined with 5K unlabeled samples via memory-efficient offline
distillation, pushing the retention to 95.16\% within only 30 minutes on a
single GPU. Code is available at
https://github.com/chenxinrui-tsinghua/LinearPatch.
[COMMENTS]26 pages, accepted to NeurIPS 2025
[LINK]http://arxiv.org/abs/2505.24680v2
[DATE]2025-10-25 15:24:08+08:00
[CATEGORIES]cs.CL
M-CIF: Multi-Scale Alignment For CIF-Based Non-Autoregressive ASR
[AUTHORS]Ruixiang Mao, Xiangnan Ma, Qing Yang, Ziming Zhu, Yucheng Qiao, Yuan Ge, Tong Xiao, Shengxiang Gao, Zhengtao Yu, Jingbo Zhu
[ABSTRACT]The Continuous Integrate-and-Fire (CIF) mechanism provides effective
alignment for non-autoregressive (NAR) speech recognition. This mechanism
creates a smooth and monotonic mapping from acoustic features to target tokens,
achieving performance on Mandarin competitive with other NAR approaches.
However, without finer-grained guidance, its stability degrades in some
languages such as English and French. In this paper, we propose Multi-scale CIF
(M-CIF), which performs multi-level alignment by integrating character and
phoneme level supervision progressively distilled into subword representations,
thereby enhancing robust acoustic-text alignment. Experiments show that M-CIF
reduces WER compared to the Paraformer baseline, especially on CommonVoice by
4.21% in German and 3.05% in French. To further investigate these gains, we
define phonetic confusion errors (PE) and space-related segmentation errors
(SE) as evaluation metrics. Analysis of these metrics across different M-CIF
settings reveals that the phoneme and character layers are essential for
enhancing progressive CIF alignment.
[LINK]http://arxiv.org/abs/2510.22172v1
[DATE]2025-10-25 13:51:02+08:00
[CATEGORIES]cs.CL
ARF-RLHF: Adaptive Reward-Following for RLHF through Emotion-Driven Self-Supervision and Trace-Biased Dynamic Optimization
[AUTHORS]YuXuan Zhang
[COMMENTS]This version fixes some minor typographical errors and adds more
explanations to ensure clarity in presentation
[LINK]http://arxiv.org/abs/2507.03069v3
[DATE]2025-10-25 13:45:15+08:00
[CATEGORIES]cs.CL
SentiMaithili: A Benchmark Dataset for Sentiment and Reason Generation for the Low-Resource Maithili Language
[AUTHORS]Rahul Ranjan, Mahendra Kumar Gurve, Anuj, Nitin, Yamuna Prasad
[ABSTRACT]Developing benchmark datasets for low-resource languages poses significant
challenges, primarily due to the limited availability of native linguistic
experts and the substantial time and cost involved in annotation. Given these
challenges, Maithili is still underrepresented in natural language processing
research. It is an Indo-Aryan language spoken by more than 13 million people in
the Purvanchal region of India, valued for its rich linguistic structure and
cultural significance. While sentiment analysis has achieved remarkable
progress in high-resource languages, resources for low-resource languages, such
as Maithili, remain scarce, often restricted to coarse-grained annotations and
lacking interpretability mechanisms. To address this limitation, we introduce a
novel dataset comprising 3,221 Maithili sentences annotated for sentiment
polarity and accompanied by natural language justifications. Moreover, the
dataset is carefully curated and validated by linguistic experts to ensure both
label reliability and contextual fidelity. Notably, the justifications are
written in Maithili, thereby promoting culturally grounded interpretation and
enhancing the explainability of sentiment models. Furthermore, extensive
experiments using both classical machine learning and state-of-the-art
transformer architectures demonstrate the dataset’s effectiveness for
interpretable sentiment analysis. Ultimately, this work establishes the first
benchmark for explainable affective computing in Maithili, thus contributing a
valuable resource to the broader advancement of multilingual NLP and
explainable AI.
[LINK]http://arxiv.org/abs/2510.22160v1
[DATE]2025-10-25 12:58:18+08:00
[CATEGORIES]cs.CL
Gatsby Without the ‘E’: Crafting Lipograms with LLMs
[AUTHORS]Rohan Balasubramanian, Nitish Gokulakrishnan, Syeda Jannatus Saba, Steven Skiena
[ABSTRACT]Lipograms are a unique form of constrained writing where all occurrences of a
particular letter are excluded from the text, typified by the novel Gadsby,
which daringly avoids all usage of the letter ‘e’. In this study, we explore
the power of modern large language models (LLMs) by transforming the novel F.
Scott Fitzgerald’s The Great Gatsby into a fully ‘e’-less text. We experimented
with a range of techniques, from baseline methods like synonym replacement to
sophisticated generative models enhanced with beam search and named entity
analysis. We show that excluding up to 3.6% of the most common letters (up to
the letter ‘u’) had minimal impact on the text’s meaning, although translation
fidelity rapidly and predictably decays with stronger lipogram constraints. Our
work highlights the surprising flexibility of English under strict constraints,
revealing just how adaptable and creative language can be.
[LINK]http://arxiv.org/abs/2505.20501v2
[DATE]2025-10-25 12:26:37+08:00
[CATEGORIES]cs.CL
The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning
[AUTHORS]Xinyu Zhu, Mengzhou Xia, Zhepei Wei, Wei-Lin Chen, Danqi Chen, Yu Meng
[ABSTRACT]Reinforcement learning with verifiable rewards (RLVR) is a promising approach
for training language models (LMs) on reasoning tasks that elicit emergent long
chains of thought (CoTs). Unlike supervised learning, it updates the model
using both correct and incorrect samples via policy gradients. To better
understand its mechanism, we decompose the learning signal into reinforcing
correct responses and penalizing incorrect ones, referred to as Positive and
Negative Sample Reinforcement (PSR and NSR), respectively. We train
Qwen2.5-Math-7B, Qwen3-4B and Llama-3.1-8B-Instruct on a mathematical reasoning
dataset and uncover a surprising result: training with only negative samples –
without reinforcing correct responses – can be highly effective: it
consistently improves performance over the base model across the entire
Pass@$k$ spectrum $k$ up to 256), often matching or surpassing PPO and GRPO. In
contrast, reinforcing only correct responses improves Pass@1 but degrades
performance at higher $k$, due to reduced diversity. These inference-scaling
trends highlight that solely penalizing incorrect responses may contribute more
to performance than previously recognized. Through gradient analysis, we show
that NSR works by suppressing incorrect generations and redistributing
probability mass toward other plausible candidates, guided by the model’s prior
beliefs. It refines the model’s existing knowledge rather than introducing
entirely new behaviors. Building on this insight, we propose a simple variant
of the RL objective that upweights NSR, and show that it consistently improves
overall Pass@$k$ performance on MATH, AIME 2025, and AMC23. Our code is
available at https://github.com/TianHongZXY/RLVR-Decomposed.
[COMMENTS]Accepted to NeurIPS 2025
[LINK]http://arxiv.org/abs/2506.01347v2
[DATE]2025-10-25 11:36:42+08:00
[CATEGORIES]cs.CL cs.LG
OlaMind: Towards Human-Like and Hallucination-Safe Customer Service for Retrieval-Augmented Dialogue
[AUTHORS]Tianhong Gao, Jundong Shen, Bei Shi, Jiapeng Wang, Ying Ju, Junfeng Yao, Jiao Ran, Yong Zhang, Lin Dong, Huiyu Yu, Tingting Ye
[ABSTRACT]Intelligent customer service (ICS) systems via retrieval-augmented generation
(RAG) have been widely adopted in Web-based domains such as social platforms
and e-commerce, achieving remarkable improvements in automation and efficiency.
However, notable limitations still remain: these systems are prone to
hallucinations and often generate rigid, mechanical responses, which can
introduce business risks and undermine user experience, especially in Web-based
customer service interactions under the RAG scenarios. In this paper, we
introduce OlaMind, a human-like and hallucination-safe customer service
framework for retrieval-augmented dialogue. Specifically, it first leverages a
Learn-to-Think stage to learn the reasoning processes and response strategies
from human experts, and then employs a Learn-to-Respond stage to perform
cold-start supervised fine-tuning (SFT) combined with reinforcement learning
(RL) for basic-to-hard self-refinement. Our method significantly enhances
human-likeness and naturalness while effectively mitigating hallucinations and
critical business risks. We have conducted large-scale online A/B experiments
in an industry-level social customer service setting, and extensive
experimental results show that OlaMind achieves significant cumulative relative
improvements with intelligent resolution rates +28.92%/+18.42% and human
takeover rate -6.08%/-7.12% in community-support/livestream-interaction
scenarios, respectively, which highlights its consistent effectiveness across
diverse real-world applications. The code and data will be publicly available.
[LINK]http://arxiv.org/abs/2510.22143v1
[DATE]2025-10-25 11:29:55+08:00
[CATEGORIES]cs.CL
LOC: A General Language-Guided Framework for Open-Set 3D Occupancy Prediction
[AUTHORS]Yuhang Gao, Xiang Xiang, Sheng Zhong, Guoyou Wang
[ABSTRACT]Vision-Language Models (VLMs) have shown significant progress in open-set
challenges. However, the limited availability of 3D datasets hinders their
effective application in 3D scene understanding. We propose LOC, a general
language-guided framework adaptable to various occupancy networks, supporting
both supervised and self-supervised learning paradigms. For self-supervised
tasks, we employ a strategy that fuses multi-frame LiDAR points for
dynamic/static scenes, using Poisson reconstruction to fill voids, and
assigning semantics to voxels via K-Nearest Neighbor (KNN) to obtain
comprehensive voxel representations. To mitigate feature over-homogenization
caused by direct high-dimensional feature distillation, we introduce Densely
Contrastive Learning (DCL). DCL leverages dense voxel semantic information and
predefined textual prompts. This efficiently enhances open-set recognition
without dense pixel-level supervision, and our framework can also leverage
existing ground truth to further improve performance. Our model predicts dense
voxel features embedded in the CLIP feature space, integrating textual and
image pixel information, and classifies based on text and semantic similarity.
Experiments on the nuScenes dataset demonstrate the method’s superior
performance, achieving high-precision predictions for known classes and
distinguishing unknown classes without additional training data.
[LINK]http://arxiv.org/abs/2510.22141v1
[DATE]2025-10-25 11:27:19+08:00
[CATEGORIES]cs.CL cs.LG
Edit Less, Achieve More: Dynamic Sparse Neuron Masking for Lifelong Knowledge Editing in LLMs
[AUTHORS]Jinzhe Liu, Junshu Sun, Shufan Shen, Chenxue Yang, Shuhui Wang
[ABSTRACT]Lifelong knowledge editing enables continuous, precise updates to outdated
knowledge in large language models (LLMs) without computationally expensive
full retraining. However, existing methods often accumulate errors throughout
the editing process, causing a gradual decline in both editing accuracy and
generalization. To tackle this problem, we propose Neuron-Specific Masked
Knowledge Editing (NMKE), a novel fine-grained editing framework that combines
neuron-level attribution with dynamic sparse masking. Leveraging neuron
functional attribution, we identify two key types of knowledge neurons, with
knowledge-general neurons activating consistently across prompts and
knowledge-specific neurons activating to specific prompts. NMKE further
introduces an entropy-guided dynamic sparse mask, locating relevant neurons to
the target knowledge. This strategy enables precise neuron-level knowledge
editing with fewer parameter modifications. Experimental results from thousands
of sequential edits demonstrate that NMKE outperforms existing methods in
maintaining high editing success rates and preserving model general
capabilities in lifelong editing.
[COMMENTS]19 pages, 11 figures, Accepted by NeurIPS 2025
[LINK]http://arxiv.org/abs/2510.22139v1
[DATE]2025-10-25 11:22:59+08:00
[CATEGORIES]cs.LG cs.CL
Trusted Knowledge Extraction for Operations and Maintenance Intelligence
[AUTHORS]Kathleen P. Mealey, Jonathan A. Karr Jr., Priscila Saboia Moreira, Paul R. Brenner, Charles F. Vardeman II
[ABSTRACT]Deriving operational intelligence from organizational data repositories is a
key challenge due to the dichotomy of data confidentiality vs data integration
objectives, as well as the limitations of Natural Language Processing (NLP)
tools relative to the specific knowledge structure of domains such as
operations and maintenance. In this work, we discuss Knowledge Graph
construction and break down the Knowledge Extraction process into its Named
Entity Recognition, Coreference Resolution, Named Entity Linking, and Relation
Extraction functional components. We then evaluate sixteen NLP tools in concert
with or in comparison to the rapidly advancing capabilities of Large Language
Models (LLMs). We focus on the operational and maintenance intelligence use
case for trusted applications in the aircraft industry. A baseline dataset is
derived from a rich public domain US Federal Aviation Administration dataset
focused on equipment failures or maintenance requirements. We assess the
zero-shot performance of NLP and LLM tools that can be operated within a
controlled, confidential environment (no data is sent to third parties). Based
on our observation of significant performance limitations, we discuss the
challenges related to trusted NLP and LLM tools as well as their Technical
Readiness Level for wider use in mission-critical industries such as aviation.
We conclude with recommendations to enhance trust and provide our open-source
curated dataset to support further baseline testing and evaluation.
[LINK]http://arxiv.org/abs/2507.22935v3
[DATE]2025-10-25 11:17:52+08:00
[CATEGORIES]cs.CL
SEAL: Steerable Reasoning Calibration of Large Language Models for Free
[AUTHORS]Runjin Chen, Zhenyu Zhang, Junyuan Hong, Souvik Kundu, Zhangyang Wang
[ABSTRACT]Large Language Models (LLMs), such as OpenAI’s o1-series have demonstrated
compelling capabilities for complex reasoning tasks via the extended
chain-of-thought (CoT) reasoning mechanism. However, recent studies reveal
substantial redundancy in the CoT reasoning traces, which not only increases
inference latency but also negatively impacts model performance by diverting
attention to unnecessary reasoning paths. To address this issue, we investigate
the internal reasoning structures of LLMs and categorize them into three
primary thought types: execution, reflection, and transition thoughts.
Moreover, our analysis reveals that excessive reflection and transition
thoughts are strongly correlated with failure cases and these thought
categories exhibit clear separation in the latent space. Based on these, we
introduce SEAL (Steerable reasoning calibration), a training-free approach that
seamlessly calibrates the CoT process, improving accuracy while demonstrating
significant efficiency gains. SEAL consists of an offline stage for extracting
the reasoning steering vector in the latent space, followed by an on-the-fly
calibration of the reasoning trace through representation intervention using
the steering vector. Notably, the steering vector exhibits strong
transferability across various tasks. Extensive experiments across multiple
models (DeepSeek-R1-Distill and QwQ-32B-Preview) and benchmarks (Math500,
GSM8K, LiveCodeBench) validate the effectiveness of SEAL, up to a 11%
improvement in accuracy while reducing reasoning tokens by 11.8% to 50.4%. Our
code is publicly available at https://github.com/VITA-Group/SEAL.
[LINK]http://arxiv.org/abs/2504.07986v3
[DATE]2025-10-25 11:17:22+08:00
[CATEGORIES]cs.CL
Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation
[AUTHORS]Ling-Team, Ang Li, Ben Liu, Binbin Hu, Bing Li, Bingwei Zeng, Borui Ye, Caizhi Tang, Changxin Tian, Chao Huang, Chao Zhang, Chen Qian, Chenchen Ju, Chenchen Li, Chengfu Tang, Chili Fu, Chunshao Ren, Chunwei Wu, Cong Zhang, Cunyin Peng, Dafeng Xu, Daixin Wang, Dalong Zhang, Dingnan Jin, Dingyuan Zhu, Dongke Hu, Fangzheng Zhao, Feifan Wu, Feng Zhu, Gangshan Wang, Haitao Zhang, Hailin Zhao, Hanxiao Zhang, Hanzi Wang, Hao Qian, Haoyi Yu, Heng Zhang, Hongliang Zhang, Hongzhi Luan, Huirong Dong, Huizhong Li, Jia Li, Jia Liu, Jialong Zhu, Jian Sha, Jianping Wei, Jiaolong Yang, Jieyue Ma, Jiewei Wu, Jinjing Huang, Jingyun Tian, Jingyuan Zhang, Jinquan Sun, Juanhui Tu, Jun Liu, Jun Xu, Jun Zhou, Junjie Ou, Junpeng Fang, Kaihong Zhang, Kaiqin Hu, Ke Shi, Kun Tang, Kunlong Chen, Lanyin Mei, Lei Liang, Lei Xu, Libo Zhang, Lin Ju, Lin Yuan, Ling Zhong, Lintao Ma, Lu Liu, Lu Yu, Lun Cai, Meiqi Zhu, Mengying Li, Min Chen, Minghao Xue, Minghong Cai, Mingming Yin, Peijie Jiang, Peilong Zhao, Pingping Liu, Qian Zhao, Qing Cui, Qingxiang Huang, Qingyuan Yang, Quankun Yu, Shaowei Wei, Shijie Lian, Shoujian Zheng, Shun Song, Shungen Zhang, Shuo Zhang, Siyuan Li, Song Liu, Ting Guo, Tong Zhao, Wanli Gu, Weichang Wu, Weiguang Han, Wenjing Fang, Wubin Wang, Xiang Shu, Xiao Shi, Xiaoshun Lan, Xiaolu Zhang, Xiaqing Sun, Xin Zhao, Xingyu Lu, Xiong Xu, Xudong Wang, Xudong Wang, Xuemin Yang, Yajie Yang, Yang Xiang, Yanzhe Li, Yi Zhang, Yilong Wang, Yingxue Li, Yongzhen Guo, Yuzhuo Fu, Yuanyuan Wang, Yue Yang, Yue Yu, Yufeng Deng, Yun Zhang, Yunfei Xu, Yuqi Zhang, Yuxiao He, Zengke Gui, Zhaoxin Huan, Zhaoyang Wang, Zhibo Zhu, Zhihao Wang, Zhiqiang Zhang, Zhoufei Wang, Zihang Zeng, Ziqi Liu, Zitao Xuan, Zuoli Tang
[ABSTRACT]We introduce Ling 2.0, a series reasoning-oriented language foundation built
upon the principle that every activation boosts reasoning capability. Designed
to scale from tens of billions to one trillion parameters under a unified
Mixture-of-Experts (MoE) paradigm, Ling 2.0 emphasizes high sparsity,
cross-scale consistency, and efficiency guided by empirical scaling laws. The
series includes three non-thinking (instruct) models - Ling-mini-2.0,
Ling-flash-2.0, and Ling-1T - ranging from 16B to 1T total parameters and
achieving up to 7-fold active-compute efficiency compared with dense
counterparts. Ling 2.0 integrates coordinated innovations across model
architecture, pre-training, post-training, and infrastructure: a high-sparsity
MoE with MTP for efficient reasoning, reasoning-oriented data and mid-training
CoT activation, reinforcement-based fine-tuning (DFT, Evo-CoT), and full-scale
FP8 training with fine-grained heterogeneous pipelines. At the trillion scale,
Ling-1T establishes a new Pareto frontier of reasoning accuracy versus
computational efficiency, demonstrating that sparse activation, when properly
aligned with reasoning objectives, enables scalable and efficient intelligence.
Collectively, Ling 2.0 provides a coherent, open, and efficient foundation for
advancing future reasoning and thinking models, including the Ring series built
upon the same base.
[COMMENTS]Ling 2.0 Technical Report
[LINK]http://arxiv.org/abs/2510.22115v1
[DATE]2025-10-25 09:51:37+08:00
[CATEGORIES]cs.CL
Gradual Forgetting: Logarithmic Compression for Extending Transformer Context Windows
[AUTHORS]Billy Dickson, Zoran Tiganj
[ABSTRACT]Most approaches to long-context processing increase the complexity of the
transformer’s internal architecture by integrating mechanisms such as
recurrence or auxiliary memory modules. In this work, we introduce an
alternative approach that modifies the input representation itself, rather than
the transformer architecture. Inspired by cognitive models of human memory, our
method applies a scale-invariant logarithmic compression to the input tokens.
The resulting compressed representation is processed by a standard, unmodified
transformer, preserving architectural simplicity. We evaluate this approach on
the WikiText-103 and PG-19 language modeling benchmarks, showing a reduction in
perplexity compared to uncompressed baselines. Moreover, performance improves
consistently with longer compressed temporal contexts, showing that input-level
logarithmic compression is a simple and effective way to extend a transformer’s
long-range memory.
[LINK]http://arxiv.org/abs/2510.22109v1
[DATE]2025-10-25 09:29:37+08:00
[CATEGORIES]cs.CL
Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation
[AUTHORS]Liliang Ren, Congcong Chen, Haoran Xu, Young Jin Kim, Adam Atkinson, Zheng Zhan, Jiankai Sun, Baolin Peng, Liyuan Liu, Shuohang Wang, Hao Cheng, Jianfeng Gao, Weizhu Chen, Yelong Shen
[ABSTRACT]Recent advances in language modeling have demonstrated the effectiveness of
State Space Models (SSMs) for efficient sequence modeling. While hybrid
architectures such as Samba and the decoder-decoder architecture, YOCO, have
shown promising performance gains over Transformers, prior works have not
investigated the efficiency potential of representation sharing between SSM
layers. In this paper, we introduce the Gated Memory Unit (GMU), a simple yet
effective mechanism for efficient memory sharing across layers. We apply it to
create SambaY, a decoder-hybrid-decoder architecture that incorporates GMUs in
the cross-decoder to share memory readout states from a Samba-based
self-decoder. SambaY significantly enhances decoding efficiency, preserves
linear pre-filling time complexity, and boosts long-context performance, all
while eliminating the need for explicit positional encoding. Through extensive
scaling experiments, we demonstrate that our model exhibits a significantly
lower irreducible loss compared to a strong YOCO baseline, indicating superior
performance scalability under large-scale compute regimes. Our largest model
enhanced with Differential Attention, Phi4-mini-Flash-Reasoning, achieves
significantly better performance than Phi4-mini-Reasoning on reasoning tasks
such as Math500, AIME24/25, and GPQA Diamond without any reinforcement
learning, while delivering up to 10x higher decoding throughput on 2K-length
prompts with 32K generation length under the vLLM inference framework. We
release our training codebase on open-source data at
https://github.com/microsoft/ArchScale.
[COMMENTS]Accepted by NeurIPS 2025. Camera-ready Version
[LINK]http://arxiv.org/abs/2507.06607v3
[DATE]2025-10-25 09:16:13+08:00
[CATEGORIES]cs.CL cs.LG
Generalization or Memorization: Dynamic Decoding for Mode Steering
[AUTHORS]Xuanming Zhang
[ABSTRACT]Large Language Models (LLMs) exhibit a troubling duality, capable of both
remarkable generalization and brittle, verbatim memorization of their training
data. This unpredictability undermines their reliability in high-stakes
applications. In this work, we propose a unified framework to understand,
identify, and control these distinct reasoning modes. First, we introduce a
theoretical model based on the Information Bottleneck (IB) principle,
formalizing generalization as the learning of a compressed, task-relevant
representation and memorization as a failure to compress. Building on this
theory, we develop Dynamic Mode Steering (DMS), a novel inference-time
algorithm which comprises two components: (1) a lightweight, causally-grounded
linear probe that identifies the model’s instantaneous reliance on
memorization, and (2) a dynamic activation steering mechanism that nudges the
model’s computation towards pre-identified generalization circuits. We frame
DMS as a form of adaptive, self-contrastive decoding. Experiments on reasoning
and faithfulness tasks demonstrate that DMS significantly improves logical
consistency and factual accuracy, thereby offering a principled approach to
enhancing LLM reliability.
[LINK]http://arxiv.org/abs/2510.22099v1
[DATE]2025-10-25 08:50:47+08:00
[CATEGORIES]cs.CL
Twilight: Adaptive Attention Sparsity with Hierarchical Top-$p$ Pruning
[AUTHORS]Chaofan Lin, Jiaming Tang, Shuo Yang, Hanshuo Wang, Tian Tang, Boyu Tian, Ion Stoica, Mingyu Gao
[ABSTRACT]Leveraging attention sparsity to accelerate long-context large language
models (LLMs) has been a hot research topic. However, current algorithms such
as sparse attention or key-value (KV) cache compression tend to use a fixed
budget, which presents a significant challenge during deployment because it
fails to account for the dynamic nature of real-world scenarios, where the
optimal balance between accuracy and efficiency can vary greatly. In this
paper, we find that borrowing top-$p$ sampling (nucleus sampling) to sparse
attention can surprisingly achieve adaptive budgeting. Based on this, we
propose Twilight, a framework to bring adaptive sparsity to any existing sparse
attention algorithm without sacrificing their accuracy. Empirical results show
that Twilight can adaptively prune at most 98% of redundant tokens, leading to
$15.4\times$ acceleration in self-attention operations and $3.9\times$
acceleration in end-to-end per token latency in long context LLM decoding.
[COMMENTS]To appear on NeurIPS 2025 (spotlight)
[LINK]http://arxiv.org/abs/2502.02770v4
[DATE]2025-10-25 08:33:14+08:00
[CATEGORIES]cs.LG cs.CL
Embracing Trustworthy Brain-Agent Collaboration as Paradigm Extension for Intelligent Assistive Technologies
[AUTHORS]Yankai Chen, Xinni Zhang, Yifei Zhang, Yangning Li, Henry Peng Zou, Chunyu Miao, Weizhi Zhang, Xue Liu, Philip S. Yu
[COMMENTS]Accepted by NeurIPS‘25 Position Track
[LINK]http://arxiv.org/abs/2510.22095v1
[DATE]2025-10-25 08:25:45+08:00
[CATEGORIES]cs.CL
Automated HIV Screening on Dutch Electronic Health Records with Large Language Models
[AUTHORS]Lang Zhou, Amrish Jhingoer, Yinghao Luo, Klaske Vliegenthart–Jongbloed, Carlijn Jordans, Ben Werkhoven, Tom Seinen, Erik van Mulligen, Casper Rokx, Yunlei Li
[ABSTRACT]Efficient screening and early diagnosis of HIV are critical for reducing
onward transmission. Although large scale laboratory testing is not feasible,
the widespread adoption of Electronic Health Records (EHRs) offers new
opportunities to address this challenge. Existing research primarily focuses on
applying machine learning methods to structured data, such as patient
demographics, for improving HIV diagnosis. However, these approaches often
overlook unstructured text data such as clinical notes, which potentially
contain valuable information relevant to HIV risk. In this study, we propose a
novel pipeline that leverages a Large Language Model (LLM) to analyze
unstructured EHR text and determine a patient’s eligibility for further HIV
testing. Experimental results on clinical data from Erasmus University Medical
Center Rotterdam demonstrate that our pipeline achieved high accuracy while
maintaining a low false negative rate.
[COMMENTS]28 pages, 6 figures
[LINK]http://arxiv.org/abs/2510.19879v2
[DATE]2025-10-25 08:12:18+08:00
[CATEGORIES]cs.CL
Jailbreak Mimicry: Automated Discovery of Narrative-Based Jailbreaks for Large Language Models
[AUTHORS]Pavlos Ntais
[ABSTRACT]Large language models (LLMs) remain vulnerable to sophisticated prompt
engineering attacks that exploit contextual framing to bypass safety
mechanisms, posing significant risks in cybersecurity applications. We
introduce Jailbreak Mimicry, a systematic methodology for training compact
attacker models to automatically generate narrative-based jailbreak prompts in
a one-shot manner. Our approach transforms adversarial prompt discovery from
manual craftsmanship into a reproducible scientific process, enabling proactive
vulnerability assessment in AI-driven security systems. Developed for the
OpenAI GPT-OSS-20B Red-Teaming Challenge, we use parameter-efficient
fine-tuning (LoRA) on Mistral-7B with a curated dataset derived from AdvBench,
achieving an 81.0% Attack Success Rate (ASR) against GPT-OSS-20B on a held-out
test set of 200 items. Cross-model evaluation reveals significant variation in
vulnerability patterns: our attacks achieve 66.5% ASR against GPT-4, 79.5% on
Llama-3 and 33.0% against Gemini 2.5 Flash, demonstrating both broad
applicability and model-specific defensive strengths in cybersecurity contexts.
This represents a 54x improvement over direct prompting (1.5% ASR) and
demonstrates systematic vulnerabilities in current safety alignment approaches.
Our analysis reveals that technical domains (Cybersecurity: 93% ASR) and
deception-based attacks (Fraud: 87.8% ASR) are particularly vulnerable,
highlighting threats to AI-integrated threat detection, malware analysis, and
secure systems, while physical harm categories show greater resistance (55.6%
ASR). We employ automated harmfulness evaluation using Claude Sonnet 4,
cross-validated with human expert assessment, ensuring reliable and scalable
evaluation for cybersecurity red-teaming. Finally, we analyze failure
mechanisms and discuss defensive strategies to mitigate these vulnerabilities
in AI for cybersecurity.
[COMMENTS]18 pages, 5 figures
[LINK]http://arxiv.org/abs/2510.22085v1
[DATE]2025-10-25 07:53:16+08:00
[CATEGORIES]cs.CL cs.LG
Compositional Bias Control in Large Language Models: Preference Learning Fails, Supervision Succeeds
[AUTHORS]Atij Mahesh
[ABSTRACT]Large Language Models (LLMs) still produce gender-stereotyped language even
in occupation-neutral contexts that reflect deep societal biases (Rudinger et
al., 2018). To address this, prior work has proposed prompting, constrained
decoding (Dathathri et al., 2020; Zhou et al., 2024), post-processing, and
fine-tuning-based alignment (Rafailov et al., 2023; Ravfogel et al., 2022).
However, the comparative efficacy and learning dynamics remain little
understood. We report a comparative analysis of six control techniques for bias
mitigation: prompt-only, generate-and-filter, DFA-based Ctrl-G decoding,
Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and
Iterative Nullspace Projection (INLP). We evaluate each method on a
compositional constraint task. This task requires generating sentences that
contain at least one agentic and one communal descriptor for each of the twenty
Winogender-derived occupations. We quantify trade-offs between control strength
and naturalness with evaluations of constraint compliance, lexical diversity,
and fluency. Our results reveal key contrasts among the methods: SFT achieves
99.87 +- 0.15% compliance and high lexical diversity, while DPO, despite
similar training stability, fails at 4.53 +- 0.82%. Ctrl-G guarantees perfect
compliance, but at the cost of severely reduced fluency and diversity.
Preference-based learning fundamentally differs: it cannot satisfy
compositional constraints, as binary preference signals encode ranking, not
logical conjunctions. Only explicit positive supervision enables mitigation of
compositional biases; preference-based alignment fails to generalize logical
structures, underscoring the limitations of preference learning and the
necessity of explicit supervision for fair and fluent controlled generation.
[COMMENTS]20 pages
[LINK]http://arxiv.org/abs/2510.22084v1
[DATE]2025-10-25 07:52:37+08:00
[CATEGORIES]cs.CL
Emotions Where Art Thou: Understanding and Characterizing the Emotional Latent Space of Large Language Models
[AUTHORS]Benjamin Reichman, Adar Avsian, Larry Heck
[LINK]http://arxiv.org/abs/2510.22042v1
[DATE]2025-10-25 05:54:12+08:00
[CATEGORIES]cs.CL
ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining, Finetuning, and Decoding the Curse of Multilinguality
[AUTHORS]Shayne Longpre, Sneha Kudugunta, Niklas Muennighoff, I-Hung Hsu, Isaac Caswell, Alex Pentland, Sercan Arik, Chen-Yu Lee, Sayna Ebrahimi
[ABSTRACT]Scaling laws research has focused overwhelmingly on English – yet the most
prominent AI models explicitly serve billions of international users. In this
work, we undertake the largest multilingual scaling laws study to date,
totaling 774 multilingual training experiments, spanning 10M-8B model
parameters, 400+ training languages and 48 evaluation languages. We introduce
the Adaptive Transfer Scaling Law (ATLAS) for both monolingual and multilingual
pretraining, which outperforms existing scaling laws’ out-of-sample
generalization often by more than 0.3 R^2. Our analyses of the experiments shed
light on multilingual learning dynamics, transfer properties between languages,
and the curse of multilinguality. First, we derive a cross-lingual transfer
matrix, empirically measuring mutual benefit scores between 38 x 38=1444
language pairs. Second, we derive a language-agnostic scaling law that reveals
how to optimally scale model size and data when adding languages without
sacrificing performance. Third, we identify the computational crossover points
for when to pretrain from scratch versus finetune from multilingual
checkpoints. We hope these findings provide the scientific foundation for
democratizing scaling laws across languages, and enable practitioners to
efficiently scale models – beyond English-first AI.
[LINK]http://arxiv.org/abs/2510.22037v1
[DATE]2025-10-25 05:45:22+08:00
[CATEGORIES]cs.CL cs.LG
Penalizing Length: Uncovering Systematic Bias in Quality Estimation Metrics
[AUTHORS]Yilin Zhang, Wenda Xu, Zhongtao Liu, Tetsuji Nakagawa, Markus Freitag
[ABSTRACT]Quality Estimation (QE) metrics are vital in machine translation for
reference-free evaluation and as a reward signal in tasks like reinforcement
learning. However, the prevalence and impact of length bias in QE have been
underexplored. Through a systematic study of top-performing regression-based
and LLM-as-a-Judge QE metrics across 10 diverse language pairs, we reveal two
critical length biases: First, QE metrics consistently over-predict errors with
increasing translation length, even for high-quality, error-free texts. Second,
they exhibit a preference for shorter translations when multiple candidates are
available for the same source text. These inherent length biases risk unfairly
penalizing longer, correct translations and can lead to sub-optimal
decision-making in applications such as QE reranking and QE guided
reinforcement learning. To mitigate this, we propose two strategies: (a)
applying length normalization during model training, and (b) incorporating
reference texts during evaluation. Both approaches were found to effectively
reduce the identified length bias.
[LINK]http://arxiv.org/abs/2510.22028v1
[DATE]2025-10-25 05:22:06+08:00
[CATEGORIES]cs.CL
Toward Understanding the Transferability of Adversarial Suffixes in Large Language Models
[AUTHORS]Sarah Ball, Niki Hasrati, Alexander Robey, Avi Schwarzschild, Frauke Kreuter, Zico Kolter, Andrej Risteski
[ABSTRACT]Discrete optimization-based jailbreaking attacks on large language models aim
to generate short, nonsensical suffixes that, when appended onto input prompts,
elicit disallowed content. Notably, these suffixes are often transferable –
succeeding on prompts and models for which they were never optimized. And yet,
despite the fact that transferability is surprising and empirically
well-established, the field lacks a rigorous analysis of when and why transfer
occurs. To fill this gap, we identify three statistical properties that
strongly correlate with transfer success across numerous experimental settings:
(1) how much a prompt without a suffix activates a model’s internal refusal
direction, (2) how strongly a suffix induces a push away from this direction,
and (3) how large these shifts are in directions orthogonal to refusal. On the
other hand, we find that prompt semantic similarity only weakly correlates with
transfer success. These findings lead to a more fine-grained understanding of
transferability, which we use in interventional experiments to showcase how our
statistical analysis can translate into practical improvements in attack
success.
[LINK]http://arxiv.org/abs/2510.22014v1
[DATE]2025-10-25 04:28:49+08:00
[CATEGORIES]cs.CL
Optimal Detection for Language Watermarks with Pseudorandom Collision
[AUTHORS]T. Tony Cai, Xiang Li, Qi Long, Weijie J. Su, Garrett G. Wen
[ABSTRACT]Text watermarking plays a crucial role in ensuring the traceability and
accountability of large language model (LLM) outputs and mitigating misuse.
While promising, most existing methods assume perfect pseudorandomness. In
practice, repetition in generated text induces collisions that create
structured dependence, compromising Type I error control and invalidating
standard analyses.
We introduce a statistical framework that captures this structure through a
hierarchical two-layer partition. At its core is the concept of minimal units
– the smallest groups treatable as independent across units while permitting
dependence within. Using minimal units, we define a non-asymptotic efficiency
measure and cast watermark detection as a minimax hypothesis testing problem.
Applied to Gumbel-max and inverse-transform watermarks, our framework
produces closed-form optimal rules. It explains why discarding repeated
statistics often improves performance and shows that within-unit dependence
must be addressed unless degenerate. Both theory and experiments confirm
improved detection power with rigorous Type I error control. These results
provide the first principled foundation for watermark detection under imperfect
pseudorandomness, offering both theoretical insight and practical guidance for
reliable tracing of model outputs.
[LINK]http://arxiv.org/abs/2510.22007v1
[DATE]2025-10-25 04:21:52+08:00
[CATEGORIES]cs.LG cs.CL
GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents
[AUTHORS]Manish Shetty, Naman Jain, Jinjian Liu, Vijay Kethanaboyina, Koushik Sen, Ion Stoica
[ABSTRACT]Developing high-performance software is a complex task that requires
specialized expertise. We introduce GSO, a benchmark for evaluating language
models’ capabilities in developing high-performance software. We develop an
automated pipeline that generates and executes performance tests to analyze
repository commit histories to identify 102 challenging optimization tasks
across 10 codebases, spanning diverse domains and programming languages. An
agent is provided with a codebase and performance test as a precise
specification, and tasked to improve the runtime efficiency, which is measured
against the expert developer optimization. Our quantitative evaluation reveals
that leading SWE-Agents struggle significantly, achieving less than 5% success
rate, with limited improvements even with inference-time scaling. Our
qualitative analysis identifies key failure modes, including difficulties with
low-level languages, practicing lazy optimization strategies, and challenges in
accurately localizing bottlenecks. We release the code and artifacts of our
benchmark along with agent trajectories to enable future research.
[COMMENTS]Website: https://gso-bench.github.io/
[LINK]http://arxiv.org/abs/2505.23671v3
[DATE]2025-10-25 03:59:00+08:00
[CATEGORIES]cs.CL cs.LG
The Lighthouse of Language: Enhancing LLM Agents via Critique-Guided Improvement
[AUTHORS]Ruihan Yang, Fanghua Ye, Jian Li, Siyu Yuan, Yikai Zhang, Zhaopeng Tu, Xiaolong Li, Deqing Yang
[ABSTRACT]Large language models (LLMs) have recently transformed from text-based
assistants to autonomous agents capable of planning, reasoning, and iteratively
improving their actions. While numerical reward signals and verifiers can
effectively rank candidate actions, they often provide limited contextual
guidance. In contrast, natural language feedback better aligns with the
generative capabilities of LLMs, providing richer and more actionable
suggestions. However, parsing and implementing this feedback effectively can be
challenging for LLM-based agents. In this work, we introduce Critique-Guided
Improvement (CGI), a novel two-player framework, comprising an actor model that
explores an environment and a critic model that generates detailed nature
language feedback. By training the critic to produce fine-grained assessments
and actionable revisions, and the actor to utilize these critiques, our
approach promotes more robust exploration of alternative strategies while
avoiding local optima. Experiments in three interactive environments show that
CGI outperforms existing baselines by a substantial margin. Notably, even a
small critic model surpasses GPT-4 in feedback quality. The resulting actor
achieves state-of-the-art performance, demonstrating the power of explicit
iterative guidance to enhance decision-making in LLM-based agents.
[COMMENTS]NeurIPS 2025
[LINK]http://arxiv.org/abs/2503.16024v2
[DATE]2025-10-25 03:30:31+08:00
[CATEGORIES]cs.CL
Uncovering the Persuasive Fingerprint of LLMs in Jailbreaking Attacks
[AUTHORS]Havva Alizadeh Noughabi, Julien Serbanescu, Fattane Zarrinkalam, Ali Dehghantanha
[ABSTRACT]Despite recent advances, Large Language Models remain vulnerable to jailbreak
attacks that bypass alignment safeguards and elicit harmful outputs. While
prior research has proposed various attack strategies differing in human
readability and transferability, little attention has been paid to the
linguistic and psychological mechanisms that may influence a model’s
susceptibility to such attacks. In this paper, we examine an interdisciplinary
line of research that leverages foundational theories of persuasion from the
social sciences to craft adversarial prompts capable of circumventing alignment
constraints in LLMs. Drawing on well-established persuasive strategies, we
hypothesize that LLMs, having been trained on large-scale human-generated text,
may respond more compliantly to prompts with persuasive structures.
Furthermore, we investigate whether LLMs themselves exhibit distinct persuasive
fingerprints that emerge in their jailbreak responses. Empirical evaluations
across multiple aligned LLMs reveal that persuasion-aware prompts significantly
bypass safeguards, demonstrating their potential to induce jailbreak behaviors.
This work underscores the importance of cross-disciplinary insight in
addressing the evolving challenges of LLM safety. The code and data are
available.
[LINK]http://arxiv.org/abs/2510.21983v1
[DATE]2025-10-25 03:20:23+08:00
[CATEGORIES]cs.CL
Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis
[AUTHORS]Tianbao Xie, Jiaqi Deng, Xiaochuan Li, Junlin Yang, Haoyuan Wu, Jixuan Chen, Wenjing Hu, Xinyuan Wang, Yuhui Xu, Zekun Wang, Yiheng Xu, Junli Wang, Doyen Sahoo, Tao Yu, Caiming Xiong
[COMMENTS]49 pages, 13 figures
[LINK]http://arxiv.org/abs/2505.13227v3
[DATE]2025-10-25 03:08:03+08:00
[CATEGORIES]cs.CL
Performance Trade-offs of Optimizing Small Language Models for E-Commerce
[AUTHORS]Josip Tomo Licardo, Nikola Tankovic
[ABSTRACT]Large Language Models (LLMs) offer state-of-the-art performance in natural
language understanding and generation tasks. However, the deployment of leading
commercial models for specialized tasks, such as e-commerce, is often hindered
by high computational costs, latency, and operational expenses. This paper
investigates the viability of smaller, open-weight models as a
resource-efficient alternative. We present a methodology for optimizing a
one-billion-parameter Llama 3.2 model for multilingual e-commerce intent
recognition. The model was fine-tuned using Quantized Low-Rank Adaptation
(QLoRA) on a synthetically generated dataset designed to mimic real-world user
queries. Subsequently, we applied post-training quantization techniques,
creating GPU-optimized (GPTQ) and CPU-optimized (GGUF) versions. Our results
demonstrate that the specialized 1B model achieves 99% accuracy, matching the
performance of the significantly larger GPT-4.1 model. A detailed performance
analysis revealed critical, hardware-dependent trade-offs: while 4-bit GPTQ
reduced VRAM usage by 41%, it paradoxically slowed inference by 82% on an older
GPU architecture (NVIDIA T4) due to dequantization overhead. Conversely, GGUF
formats on a CPU achieved a speedup of up to 18x in inference throughput and a
reduction of over 90% in RAM consumption compared to the FP16 baseline. We
conclude that small, properly optimized open-weight models are not just a
viable but a more suitable alternative for domain-specific applications,
offering state-of-the-art accuracy at a fraction of the computational cost.
[COMMENTS]15 pages, 9 figures
[LINK]http://arxiv.org/abs/2510.21970v1
[DATE]2025-10-25 02:49:28+08:00
[CATEGORIES]cs.CL
Parallel Sampling from Masked Diffusion Models via Conditional Independence Testing
[AUTHORS]Iskander Azangulov, Teodora Pandeva, Niranjani Prasad, Javier Zazo, Sushrut Karmalkar
[ABSTRACT]Masked diffusion models (MDMs) offer a compelling alternative to
autoregressive models (ARMs) for discrete text generation because they enable
parallel token sampling, rather than sequential, left-to-right generation. This
means potentially much faster inference. However, effective parallel sampling
faces two competing requirements: (i) simultaneously updated tokens must be
conditionally independent, and (ii) updates should prioritise high-confidence
predictions. These goals conflict because high-confidence predictions often
cluster and depend on each other, opportunities for parallel updates.
We present PUNT, a model-agnostic sampler that reconciles this trade-off. Our
method identifies token dependencies and removes lower-confidence tokens from
conflicting groups. This produces sets of indices for unmasking that satisfy
both independence and confidence criteria. Our approach ensures improved
parallel unmasking through approximate conditional independence testing.
Our experiments show that PUNT delivers a superior trade-off between accuracy
and compute when compared to other strong training-free baselines, especially
for generation of longer sequences. On the IFEval benchmark, it achieves up to
16\% higher accuracy over baseline methods, including sequential generation
(one-by-one). These gains hold across different values of hyperparameters,
mitigating the need for brittle hyperparameter tuning. Moreover, we observe
that PUNT induces an emergent hierarchical generation strategy, where the model
first establishes high-level paragraph structure before local refinement,
suggesting a planning-like generation process that contributes to strong
alignment performance.
[LINK]http://arxiv.org/abs/2510.21961v1
[DATE]2025-10-25 02:41:26+08:00
[CATEGORIES]cs.LG cs.CL
Transformer Based Linear Attention with Optimized GPU Kernel Implementation
[AUTHORS]Armin Gerami, Ramani Duraiswami
[ABSTRACT]The original softmax-based attention mechanism (regular attention) in the
extremely successful Transformer architecture computes attention between $N$
tokens, each embedded in a $D$-dimensional head, with a time complexity of
$O(N^2D)$. Given the success of Transformers, improving their runtime during
both training and inference is a popular research area. One such approach is
the introduction of the linear attention (LA) mechanisms, which offers a linear
time complexity of $O(ND^2)$ and have demonstrated comparable accuracy to
regular attention. However, LA in practice lags behind its theoretical
efficiency. We propose a novel method for LA’s forward and backward passes,
along with a highly-optimized CUDA implementation. Our approach outperforms the
state-of-the-art by 3.3 times in speed and reduces memory consumption by 3.6
times. We validate these improvements in both single-layer and end-to-end
settings by training a 1.4 billion parameter language model, which demonstrates
similar expressivity to regular attention on major reasoning benchmarks.
[LINK]http://arxiv.org/abs/2510.21956v1
[DATE]2025-10-25 02:32:20+08:00
[CATEGORIES]cs.LG cs.CL
Dipper: Diversity in Prompts for Producing Large Language Model Ensembles in Reasoning tasks
[AUTHORS]Gregory Kang Ruey Lau, Wenyang Hu, Diwen Liu, Jizhuo Chen, See-Kiong Ng, Bryan Kian Hsiang Low
[COMMENTS]Accepted to EMNLP 2025 Main Conference
[LINK]http://arxiv.org/abs/2412.15238v2
[DATE]2025-10-25 02:28:37+08:00
[CATEGORIES]cs.CL cs.LG
Demystifying Language Model Forgetting with Low-rank Example Associations
[AUTHORS]Xisen Jin, Xiang Ren
[ABSTRACT]Large language models (LLMs) suffer from forgetting of upstream knowledge
when fine-tuned. Despite efforts on mitigating forgetting, few have
investigated how forgotten upstream examples are dependent on newly learned
tasks. Insights on such dependencies enable efficient and targeted mitigation
of forgetting. In this paper, we empirically analyze forgetting that occurs in
$N$ upstream examples of language modeling or instruction-tuning after
fine-tuning LLMs on one of $M$ new tasks, visualized in $M\times N$ matrices.
We show that the matrices are often well-approximated with low-rank matrices,
indicating the dominance of simple associations between the learned tasks and
forgotten upstream examples. Leveraging the analysis, we predict forgetting of
upstream examples when fine-tuning LLMs on unseen tasks with matrix completion
over the empirical associations. This enables fast identification of most
forgotten examples without expensive inference on the entire upstream data.
Despite simplicity, the approach outperforms prior approaches that learn
semantic relationships of learned tasks and upstream examples with LMs. We
demonstrate the practical utility of our analysis by showing statistically
significantly reduced forgetting as we upweight predicted examples for replay
during fine-tuning.
[COMMENTS]NeurIPS 2025
[LINK]http://arxiv.org/abs/2406.14026v7
[DATE]2025-10-25 02:26:16+08:00
[CATEGORIES]cs.LG cs.CL
VisCoder2: Building Multi-Language Visualization Coding Agents
[AUTHORS]Yuansheng Ni, Songcheng Cai, Xiangchao Chen, Jiarong Liang, Zhiheng Lyu, Jiaqi Deng, Kai Zou, Ping Nie, Fei Yuan, Xiang Yue, Wenhu Chen
[ABSTRACT]Large language models (LLMs) have recently enabled coding agents capable of
generating, executing, and revising visualization code. However, existing
models often fail in practical workflows due to limited language coverage,
unreliable execution, and lack of iterative correction mechanisms. Progress has
been constrained by narrow datasets and benchmarks that emphasize single-round
generation and single-language tasks. To address these challenges, we introduce
three complementary resources for advancing visualization coding agents.
VisCode-Multi-679K is a large-scale, supervised dataset containing 679K
validated and executable visualization samples with multi-turn correction
dialogues across 12 programming languages. VisPlotBench is a benchmark for
systematic evaluation, featuring executable tasks, rendered outputs, and
protocols for both initial generation and multi-round self-debug. Finally, we
present VisCoder2, a family of multi-language visualization models trained on
VisCode-Multi-679K. Experiments show that VisCoder2 significantly outperforms
strong open-source baselines and approaches the performance of proprietary
models like GPT-4.1, with further gains from iterative self-debug, reaching
82.4% overall execution pass rate at the 32B scale, particularly in symbolic or
compiler-dependent languages.
[LINK]http://arxiv.org/abs/2510.23642v1
[DATE]2025-10-25 02:03:57+08:00
[CATEGORIES]cs.CL
Knee-Deep in C-RASP: A Transformer Depth Hierarchy
[AUTHORS]Andy Yang, Michaël Cadilhac, David Chiang
[ABSTRACT]It has been observed that transformers with greater depth (that is, more
layers) have more capabilities, but can we establish formally which
capabilities are gained? We answer this question with a theoretical proof
followed by an empirical study. First, we consider transformers that round to
fixed precision except inside attention. We show that this subclass of
transformers is expressively equivalent to the programming language C-RASP and
this equivalence preserves depth. Second, we prove that deeper C-RASP programs
are more expressive than shallower C-RASP programs, implying that deeper
transformers are more expressive than shallower transformers (within the
subclass mentioned above). The same is also proven for transformers with
positional encodings (like RoPE and ALiBi). These results are established by
studying a temporal logic with counting operators equivalent to C-RASP.
Finally, we provide empirical evidence that our theory predicts the depth
required for transformers without positional encodings to length-generalize on
a family of sequential dependency tasks.
[COMMENTS]35 pages, 5 figures
[LINK]http://arxiv.org/abs/2506.16055v2
[DATE]2025-10-25 01:50:04+08:00
[CATEGORIES]cs.CL
SimuRA: A World-Model-Driven Simulative Reasoning Architecture for General Goal-Oriented Agents
[AUTHORS]Mingkai Deng, Jinyu Hou, Zhiting Hu, Eric Xing
[ABSTRACT]AI agents built on foundation models hold enormous promise. Current practice,
however, focuses on a one-task-one-agent approach, which not only falls short
of scalability and generality, but also faces practical limitations from
black-box autoregressive reasoning, where decisions unfold token by token
without explicit simulation or counterfactual evaluation of outcomes. Humans,
on the other hand, reason and plan by mentally simulating the consequences of
actions within an internal model of the world – a capability that supports
flexible, goal-directed behavior across diverse contexts. Moving towards a more
general and powerful AI agent, we introduce SimuRA, a goal-oriented
architecture for generalized agentic reasoning. Based on a principled
formulation of an optimal agent in any general environment, SimuRA addresses
the limitations of black-box autoregressive reasoning by incorporating the
world model for planning via simulation. Our prototype world model is
implemented using LLMs as a substrate, leveraging the natural language as a
discrete, hierarchical representation grounded in concepts for planning, while
remaining model-agnostic. On complex web-browsing tasks such as flight search,
SimuRA improves the success rate from 0% to 32.2% compared to a representative
open-web agent baseline. Across tasks, world-model-based planning achieves up
to 124% higher task completion rates than a matched black-box autoregressive
baseline, demonstrating the advantages of simulative reasoning. We release
ReasonerAgent-Web, a web-browsing agent built on SimuRA, as an open-source
research demo.
[COMMENTS]This submission has been updated to adjust the scope and presentation
of the work
[LINK]http://arxiv.org/abs/2507.23773v2
[DATE]2025-10-25 01:44:52+08:00
[CATEGORIES]cs.CL cs.LG
Explaining and Mitigating Crosslingual Tokenizer Inequities
[AUTHORS]Catherine Arnett, Tyler A. Chang, Stella Biderman, Benjamin K. Bergen
[ABSTRACT]The number of tokens it takes to encode parallel text in different languages
is known to vary. These disparities are called token premiums. Having high
token premiums leads to less throughput during training and increases costs at
inference. In this paper, we show that even after controlling for dataset size,
vocabulary size, and data content, monolingual tokenizers exhibit a wide range
of token premiums across languages. To understand the cross-linguistic
differences that cause these token premiums, we train a suite of approximately
7,000 comparable monolingual tokenizers for 97 languages, manipulating
tokenization algorithm, vocabulary size, and dataset size. We measure token
premiums and test for a relationship between factors such as data similarity
(between tokenizer training and evaluation), vocabulary size, and
pre-tokenization. We also investigate the role of language-specific features
such as writing system and word length. We find that similarity between
training and test data does not impact token premiums, but vocabulary size and
pre-tokenization do. While simply increasing vocabulary size does not lead to
reduced token premium effects, we can determine an “optimal” vocabulary size
for each language to achieve significantly reduced token premium effects. We
also train superword tokenizers which allow merges over whitespaces, and we
find that they both reduce token premium effects and improve compression
overall. Thus, intervening on the vocabulary size or the pre-tokenizer
significantly reduces crosslingual token premium effects.
[COMMENTS]Accepted to NeurIPS 2025
[LINK]http://arxiv.org/abs/2510.21909v1
[DATE]2025-10-25 01:36:03+08:00
[CATEGORIES]cs.CL
RECODE-H: A Benchmark for Research Code Development with Interactive Human Feedback
[AUTHORS]Chunyu Miao, Henry Peng Zou, Yangning Li, Yankai Chen, Yibo Wang, Fangxin Wang, Yifan Li, Wooseong Yang, Bowei He, Xinni Zhang, Dianzhi Yu, Hanchen Yang, Hoang H Nguyen, Yue Zhou, Jie Yang, Jizhou Guo, Wenzhe Fan, Chin-Yuan Yeh, Panpan Meng, Liancheng Fang, Jinhu Qi, Wei-Chieh Huang, Zhengyao Gu, Yuwei Han, Langzhou He, Yuyao Yang, Yinghui Li, Hai-Tao Zheng, Xue Liu, Irwin King, Philip S. Yu
[ABSTRACT]Large language models (LLMs) show the promise in supporting scientific
research implementation, yet their ability to generate correct and executable
code remains limited. Existing works largely adopt one-shot settings, ignoring
the iterative and feedback-driven nature of realistic workflows of scientific
research development. To address this gap, we present RECODE-H, a benchmark of
102 tasks from research papers and repositories that evaluates LLM agents
through multi-turn interactions with LLM-simulated human feedback. It includes
structured instructions,unit tests, and a five-level feedback hierarchy to
reflect realistic researcher-agent collaboration. We further present
ReCodeAgent, a framework that integrates feedback into iterative code
generation. Experiments with leading LLMs, including GPT-5, Claude-Sonnet-4,
DeepSeek-V3.1, and Gemini 2.5, show substantial performance gains with richer
feedback, while also highlighting ongoing challenges in the generation of
complex research code. RECODE-H establishes a foundation for developing
adaptive, feedback-driven LLM agents in scientific research implementation
[COMMENTS]Code and dataset are available at github.com/ChunyuMiao98/RECODE
[LINK]http://arxiv.org/abs/2510.06186v2
[DATE]2025-10-25 01:20:26+08:00
[CATEGORIES]cs.CL
Alleviating Forgetfulness of Linear Attention by Hybrid Sparse Attention and Contextualized Learnable Token Eviction
[AUTHORS]Mutian He, Philip N. Garner
[ABSTRACT]Linear-attention models that compress the entire input sequence into a
fixed-size recurrent state offer an efficient alternative to Transformers, but
their finite memory induces forgetfulness that harms retrieval-intensive tasks.
To mitigate the issue, we explore a series of hybrid models that restore direct
access to past tokens. We interleave token mixers with intermediate time and
space complexity between linear and full attention, including sparse attention
with token eviction, and the query-aware native sparse attention. Particularly,
we propose a novel learnable token eviction approach. Combined with
sliding-window attention, an end-to-end trainable lightweight CNN aggregates
information from both past and future adjacent tokens to adaptively retain a
limited set of critical KV-pairs per head, maintaining linear attention’s
constant time and space complexity. Efficient Triton kernels for the sparse
attention mechanisms are provided. Empirical evaluations on retrieval-intensive
benchmarks support the effectiveness of our approaches.
[COMMENTS]19 pages, 5 figures
[LINK]http://arxiv.org/abs/2510.20787v2
[DATE]2025-10-25 00:56:22+08:00
[CATEGORIES]cs.CL cs.LG
Electronic Circuit Principles of Large Language Models
[AUTHORS]Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiaqi Wang, Mengkang Hu, Zhi Chen, Wanxiang Che, Ting Liu
[ABSTRACT]Large language models (LLMs) such as DeepSeek-R1 have achieved remarkable
performance across diverse reasoning tasks. To uncover the principles that
govern their behaviour, we introduce the Electronic Circuit Principles (ECP),
which maps inference-time learning (ITL) onto a semantic electromotive force
and inference-time reasoning (ITR) onto a resistive network governed by Ohm’s
and Faraday’s laws. This circuit-based modelling yields closed-form predictions
of task performance and reveals how modular prompt components interact to shape
accuracy. We validated ECP on 70,000 samples spanning 350 reasoning tasks and 9
advanced LLMs, observing a about 60% improvement in Pearson correlation
relative to the conventional inference-time scaling law. Moreover, ECP explains
the efficacy of 15 established prompting strategies and directs the development
of new modular interventions that exceed the median score of the top 80% of
participants in both the International Olympiad in Informatics and the
International Mathematical Olympiad. By grounding LLM reasoning in
electronic-circuit principles, ECP provides a rigorous framework for predicting
performance and optimising modular components.
[COMMENTS]Manuscript
[LINK]http://arxiv.org/abs/2502.03325v2
[DATE]2025-10-25 00:55:30+08:00
[CATEGORIES]cs.CL
Few-Shot Knowledge Distillation of LLMs With Counterfactual Explanations
[AUTHORS]Faisal Hamman, Pasan Dissanayake, Yanjun Fu, Sanghamitra Dutta
[ABSTRACT]Knowledge distillation is a promising approach to transfer capabilities from
complex teacher models to smaller, resource-efficient student models that can
be deployed easily, particularly in task-aware scenarios. However, existing
methods of task-aware distillation typically require substantial quantities of
data which may be unavailable or expensive to obtain in many practical
scenarios. In this paper, we address this challenge by introducing a novel
strategy called Counterfactual-explanation-infused Distillation CoD for
few-shot task-aware knowledge distillation by systematically infusing
counterfactual explanations. Counterfactual explanations (CFEs) refer to inputs
that can flip the output prediction of the teacher model with minimum
perturbation. Our strategy CoD leverages these CFEs to precisely map the
teacher’s decision boundary with significantly fewer samples. We provide
theoretical guarantees for motivating the role of CFEs in distillation, from
both statistical and geometric perspectives. We mathematically show that CFEs
can improve parameter estimation by providing more informative examples near
the teacher’s decision boundary. We also derive geometric insights on how CFEs
effectively act as knowledge probes, helping the students mimic the teacher’s
decision boundaries more effectively than standard data. We perform experiments
across various datasets and LLMs to show that CoD outperforms standard
distillation approaches in few-shot regimes (as low as 8-512 samples). Notably,
CoD only uses half of the original samples used by the baselines, paired with
their corresponding CFEs and still improves performance.
[COMMENTS]NeurIPS 2025
[LINK]http://arxiv.org/abs/2510.21631v1
[DATE]2025-10-25 00:36:34+08:00
[CATEGORIES]cs.LG cs.CL
The Universal Landscape of Human Reasoning
[AUTHORS]Qiguang Chen, Jinhao Liu, Libo Qin, Yimeng Zhang, Yihao Liang, Shangxu Ren, Chengyu Luan, Dengyun Peng, Hanjing Li, Jiannan Guan, Zheng Yan, Jiaqi Wang, Mengkang Hu, Yantao Du, Zhi Chen, Xie Chen, Wanxiang Che
[ABSTRACT]Understanding how information is dynamically accumulated and transformed in
human reasoning has long challenged cognitive psychology, philosophy, and
artificial intelligence. Existing accounts, from classical logic to
probabilistic models, illuminate aspects of output or individual modelling, but
do not offer a unified, quantitative description of general human reasoning
dynamics. To solve this, we introduce Information Flow Tracking (IF-Track),
that uses large language models (LLMs) as probabilistic encoder to quantify
information entropy and gain at each reasoning step. Through fine-grained
analyses across diverse tasks, our method is the first successfully models the
universal landscape of human reasoning behaviors within a single metric space.
We show that IF-Track captures essential reasoning features, identifies
systematic error patterns, and characterizes individual differences. Applied to
discussion of advanced psychological theory, we first reconcile single- versus
dual-process theories in IF-Track and discover the alignment of artificial and
human cognition and how LLMs reshaping human reasoning process. This approach
establishes a quantitative bridge between theory and measurement, offering
mechanistic insights into the architecture of reasoning.
[COMMENTS]Preprint
[LINK]http://arxiv.org/abs/2510.21623v1
[DATE]2025-10-25 00:26:36+08:00
[CATEGORIES]cs.CL
DeepAgent: A General Reasoning Agent with Scalable Toolsets
[AUTHORS]Xiaoxi Li, Wenxiang Jiao, Jiarui Jin, Guanting Dong, Jiajie Jin, Yinuo Wang, Hao Wang, Yutao Zhu, Ji-Rong Wen, Yuan Lu, Zhicheng Dou
[ABSTRACT]Large reasoning models have demonstrated strong problem-solving abilities,
yet real-world tasks often require external tools and long-horizon
interactions. Existing agent frameworks typically follow predefined workflows,
which limit autonomous and global task completion. In this paper, we introduce
DeepAgent, an end-to-end deep reasoning agent that performs autonomous
thinking, tool discovery, and action execution within a single, coherent
reasoning process. To address the challenges of long-horizon interactions,
particularly the context length explosion from multiple tool calls and the
accumulation of interaction history, we introduce an autonomous memory folding
mechanism that compresses past interactions into structured episodic, working,
and tool memories, reducing error accumulation while preserving critical
information. To teach general-purpose tool use efficiently and stably, we
develop an end-to-end reinforcement learning strategy, namely ToolPO, that
leverages LLM-simulated APIs and applies tool-call advantage attribution to
assign fine-grained credit to the tool invocation tokens. Extensive experiments
on eight benchmarks, including general tool-use tasks (ToolBench, API-Bank,
TMDB, Spotify, ToolHop) and downstream applications (ALFWorld, WebShop, GAIA,
HLE), demonstrate that DeepAgent consistently outperforms baselines across both
labeled-tool and open-set tool retrieval scenarios. This work takes a step
toward more general and capable agents for real-world applications. The code
and demo are available at https://github.com/RUC-NLPIR/DeepAgent.
[LINK]http://arxiv.org/abs/2510.21618v1
[DATE]2025-10-25 00:24:01+08:00
[CATEGORIES]cs.CL cs.LG
Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning
[AUTHORS]Ziyang Wang, Jaehong Yoon, Shoubin Yu, Md Mohaiminul Islam, Gedas Bertasius, Mohit Bansal
[ABSTRACT]Despite advances in reinforcement learning (RL)-based video reasoning with
large language models (LLMs), data collection and fine-tuning remain
significant challenges. These methods often rely on large-scale supervised
fine-tuning (SFT) with extensive video data and long Chain-of-Thought (CoT)
annotations, making them costly and hard to scale. To address this, we present
Video-RTS, a new approach to improve video reasoning capability with
drastically improved data efficiency by combining data-efficient RL with a
video-adaptive test-time scaling (TTS) strategy. Building on observations about
the data scaling, we skip the resource-intensive SFT step and employ efficient
pure-RL training with output-based rewards, requiring no additional annotations
or extensive fine-tuning. Furthermore, to utilize computational resources more
efficiently, we introduce a sparse-to-dense video TTS strategy that improves
inference by iteratively adding frames based on output consistency. We validate
our approach on multiple video reasoning benchmarks, showing that Video-RTS
surpasses existing video reasoning models by 2.4% in accuracy using only 3.6%
training samples. Specifically, Video-RTS achieves a 4.2% improvement on
Video-Holmes, a recent and challenging video reasoning benchmark. Notably, our
pure RL training and adaptive video TTS offer complementary strengths, enabling
Video-RTS’s strong reasoning performance.
[COMMENTS]EMNLP 2025. The first two authors contributed equally. Project page:
https://sites.google.com/cs.unc.edu/videorts2025/
[LINK]http://arxiv.org/abs/2507.06485v2
[DATE]2025-10-25 00:19:27+08:00
[CATEGORIES]cs.CL
RETuning: Upgrading Inference-Time Scaling for Stock Movement Prediction with Large Language Models
[AUTHORS]Xueyuan Lin, Cehao Yang, Ye Ma, Ming Li, Rongjunchen Zhang, Yang Ni, Xiaojun Wu, Chengjin Xu, Jian Guo, Hui Xiong
[ABSTRACT]Recently, large language models (LLMs) have demonstrated outstanding
reasoning capabilities on mathematical and coding tasks. However, their
application to financial tasks-especially the most fundamental task of stock
movement prediction-remains underexplored. We study a three-class
classification problem (up, hold, down) and, by analyzing existing reasoning
responses, observe that: (1) LLMs follow analysts’ opinions rather than exhibit
a systematic, independent analytical logic (CoTs). (2) LLMs list summaries from
different sources without weighing adversarial evidence, yet such
counterevidence is crucial for reliable prediction. It shows that the model
does not make good use of its reasoning ability to complete the task. To
address this, we propose Reflective Evidence Tuning (RETuning), a cold-start
method prior to reinforcement learning, to enhance prediction ability. While
generating CoT, RETuning encourages dynamically constructing an analytical
framework from diverse information sources, organizing and scoring evidence for
price up or down based on that framework-rather than on contextual
viewpoints-and finally reflecting to derive the prediction. This approach
maximally aligns the model with its learned analytical framework, ensuring
independent logical reasoning and reducing undue influence from context. We
also build a large-scale dataset spanning all of 2024 for 5,123 A-share stocks,
with long contexts (32K tokens) and over 200K samples. In addition to price and
news, it incorporates analysts’ opinions, quantitative reports, fundamental
data, macroeconomic indicators, and similar stocks. Experiments show that
RETuning successfully unlocks the model’s reasoning ability in the financial
domain. Inference-time scaling still works even after 6 months or on
out-of-distribution stocks, since the models gain valuable insights about stock
movement prediction.
[LINK]http://arxiv.org/abs/2510.21604v1
[DATE]2025-10-25 00:08:33+08:00
[CATEGORIES]cs.CL
Doc-Researcher: A Unified System for Multimodal Document Parsing and Deep Research
[AUTHORS]Kuicai Dong, Shurui Huang, Fangda Ye, Wei Han, Zhi Zhang, Dexun Li, Wenjun Li, Qu Yang, Gang Wang, Yichao Wang, Chen Zhang, Yong Liu
[ABSTRACT]Deep Research systems have revolutionized how LLMs solve complex questions
through iterative reasoning and evidence gathering. However, current systems
remain fundamentally constrained to textual web data, overlooking the vast
knowledge embedded in multimodal documents Processing such documents demands
sophisticated parsing to preserve visual semantics (figures, tables, charts,
and equations), intelligent chunking to maintain structural coherence, and
adaptive retrieval across modalities, which are capabilities absent in existing
systems. In response, we present Doc-Researcher, a unified system that bridges
this gap through three integrated components: (i) deep multimodal parsing that
preserves layout structure and visual semantics while creating multi-granular
representations from chunk to document level, (ii) systematic retrieval
architecture supporting text-only, vision-only, and hybrid paradigms with
dynamic granularity selection, and (iii) iterative multi-agent workflows that
decompose complex queries, progressively accumulate evidence, and synthesize
comprehensive answers across documents and modalities. To enable rigorous
evaluation, we introduce M4DocBench, the first benchmark for Multi-modal,
Multi-hop, Multi-document, and Multi-turn deep research. Featuring 158
expert-annotated questions with complete evidence chains across 304 documents,
M4DocBench tests capabilities that existing benchmarks cannot assess.
Experiments demonstrate that Doc-Researcher achieves 50.6% accuracy, 3.4xbetter
than state-of-the-art baselines, validating that effective document research
requires not just better retrieval, but fundamentally deep parsing that
preserve multimodal integrity and support iterative research. Our work
establishes a new paradigm for conducting deep research on multimodal document
collections.
[COMMENTS]preprint
[LINK]http://arxiv.org/abs/2510.21603v1
[DATE]2025-10-25 00:07:54+08:00
[CATEGORIES]cs.CL
Teaching Transformers Causal Reasoning through Axiomatic Training
[AUTHORS]Aniket Vashishtha, Abhinav Kumar, Atharva Pandey, Abbavaram Gowtham Reddy, Kabir Ahuja, Vineeth N Balasubramanian, Amit Sharma
[ABSTRACT]For text-based AI systems to interact in the real world, causal reasoning is
an essential skill. Since active interventions are costly, we study to what
extent a system can learn causal reasoning from symbolic demonstrations of
causal axioms. Specifically, we present an axiomatic training method where the
system learns from multiple demonstrations of a causal axiom (or rule), rather
than incorporating the axiom as an inductive bias or inferring it from data
values. A key question is whether the system would learn to generalize from the
axiom demonstrations to more complex scenarios. Our results, based on applying
axiomatic training to learn the transitivity axiom and d-separation rule,
indicate that such generalization is possible. To avoid data contamination
issues, we start with a 67 million parameter transformer model and train it
from scratch. On both tasks, we find that a model trained on linear causal
chains (along with some noisy variations) can generalize well to complex
graphs, including longer causal chains, causal chains with reversed order, and
graphs with branching.To handle diverse text inputs, the same method is
extended to finetune language models. Finetuning Llama-3-8B-Instruct model on
our axiomatic data leads to significant gains on causal benchmarks such as
Corr2Cause and CLEAR, in some cases providing state-of-the-art performance
surpassing GPT-4.
[LINK]http://arxiv.org/abs/2407.07612v3
[DATE]2025-10-25 00:07:28+08:00
[CATEGORIES]cs.LG cs.CL
R3-RAG: Learning Step-by-Step Reasoning and Retrieval for LLMs via Reinforcement Learning
[AUTHORS]Yuan Li, Qi Luo, Xiaonan Li, Bufan Li, Qinyuan Cheng, Bo Wang, Yining Zheng, Yuxin Wang, Zhangyue Yin, Xipeng Qiu
[ABSTRACT]Retrieval-Augmented Generation (RAG) integrates external knowledge with Large
Language Models (LLMs) to enhance factual correctness and mitigate
hallucination. However, dense retrievers often become the bottleneck of RAG
systems due to their limited parameters compared to LLMs and their inability to
perform step-by-step reasoning. While prompt-based iterative RAG attempts to
address these limitations, it is constrained by human-designed workflows. To
address these limitations, we propose $\textbf{R3-RAG}$, which uses
$\textbf{R}$einforcement learning to make the LLM learn how to
$\textbf{R}$eason and $\textbf{R}$etrieve step by step, thus retrieving
comprehensive external knowledge and leading to correct answers. R3-RAG is
divided into two stages. We first use cold start to make the model learn the
manner of iteratively interleaving reasoning and retrieval. Then we use
reinforcement learning to further harness its ability to better explore the
external retrieval environment. Specifically, we propose two rewards for
R3-RAG: 1) answer correctness for outcome reward, which judges whether the
trajectory leads to a correct answer; 2) relevance-based document verification
for process reward, encouraging the model to retrieve documents that are
relevant to the user question, through which we can let the model learn how to
iteratively reason and retrieve relevant documents to get the correct answer.
Experimental results show that R3-RAG significantly outperforms baselines and
can transfer well to different retrievers. We release R3-RAG at
https://github.com/Yuan-Li-FNLP/R3-RAG.
[LINK]http://arxiv.org/abs/2505.23794v2
[DATE]2025-10-24 23:52:23+08:00
[CATEGORIES]cs.CL
From Polyester Girlfriends to Blind Mice: Creating the First Pragmatics Understanding Benchmarks for Slovene
[AUTHORS]Mojca Brglez, Špela Vintar
[ABSTRACT]Large language models are demonstrating increasing capabilities, excelling at
benchmarks once considered very difficult. As their capabilities grow, there is
a need for more challenging evaluations that go beyond surface-level linguistic
competence. Namely, language competence involves not only syntax and semantics
but also pragmatics, i.e., understanding situational meaning as shaped by
context as well as linguistic and cultural norms. To contribute to this line of
research, we introduce SloPragEval and SloPragMega, the first pragmatics
understanding benchmarks for Slovene that contain altogether 405
multiple-choice questions. We discuss the difficulties of translation, describe
the campaign to establish a human baseline, and report pilot evaluations with
LLMs. Our results indicate that current models have greatly improved in
understanding nuanced language but may still fail to infer implied speaker
meaning in non-literal utterances, especially those that are culture-specific.
We also observe a significant gap between proprietary and open-source models.
Finally, we argue that benchmarks targeting nuanced language understanding and
knowledge of the target culture must be designed with care, preferably
constructed from native data, and validated with human responses.
[LINK]http://arxiv.org/abs/2510.21575v1
[DATE]2025-10-24 23:43:42+08:00
[CATEGORIES]cs.CL
Mixture-of-Minds: Multi-Agent Reinforcement Learning for Table Understanding
[AUTHORS]Yuhang Zhou, Mingrui Zhang, Ke Li, Mingyi Wang, Qiao Liu, Qifei Wang, Jiayi Liu, Fei Liu, Serena Li, Weiwei Li, Mingze Gao, Abhishek Kumar, Xiangjun Fan, Zhuokai Zhao, Lizhu Zhang
[ABSTRACT]Understanding and reasoning over tables is a critical capability for many
real-world applications. Large language models (LLMs) have shown promise on
this task, but current approaches remain limited. Fine-tuning based methods
strengthen language reasoning; yet they are prone to arithmetic errors and
hallucination. In contrast, tool-based methods enable precise table
manipulation but rely on rigid schemas and lack semantic understanding. These
complementary drawbacks highlight the need for approaches that integrate robust
reasoning with reliable table processing. In this work, we propose
Mixture-of-Minds, a multi-agent framework that decomposes table reasoning into
three specialized roles: planning, coding, and answering. This design enables
each agent to focus on a specific aspect of the task while leveraging code
execution for precise table manipulation. Building on this workflow, we
introduce a self-improvement training framework that employs Monte Carlo Tree
Search (MCTS) rollouts to generate pseudo-gold trajectories and optimize agents
with reinforcement learning (RL). Extensive experiments show that
Mixture-of-Minds delivers substantial gains, reaching 62.13% on TableBench and
surpassing OpenAI-o4-mini-high. These results demonstrate the promise of
combining structured multi-agent workflows with RL to advance table
understanding.
[COMMENTS]18 pages, 4 figures
[LINK]http://arxiv.org/abs/2510.20176v2
[DATE]2025-10-24 23:36:31+08:00
[CATEGORIES]cs.CL
GAICo: A Deployed and Extensible Framework for Evaluating Diverse and Multimodal Generative AI Outputs
[AUTHORS]Nitin Gupta, Pallav Koppisetti, Kausik Lakkaraju, Biplav Srivastava
[ABSTRACT]The rapid proliferation of Generative AI (GenAI) into diverse, high-stakes
domains necessitates robust and reproducible evaluation methods. However,
practitioners often resort to ad-hoc, non-standardized scripts, as common
metrics are often unsuitable for specialized, structured outputs (e.g.,
automated plans, time-series) or holistic comparison across modalities (e.g.,
text, audio, and image). This fragmentation hinders comparability and slows AI
system development. To address this challenge, we present GAICo (Generative AI
Comparator): a deployed, open-source Python library that streamlines and
standardizes GenAI output comparison. GAICo provides a unified, extensible
framework supporting a comprehensive suite of reference-based metrics for
unstructured text, specialized structured data formats, and multimedia (images,
audio). Its architecture features a high-level API for rapid, end-to-end
analysis, from multi-model comparison to visualization and reporting, alongside
direct metric access for granular control. We demonstrate GAICo’s utility
through a detailed case study evaluating and debugging complex, multi-modal AI
Travel Assistant pipelines. GAICo empowers AI researchers and developers to
efficiently assess system performance, make evaluation reproducible, improve
development velocity, and ultimately build more trustworthy AI systems,
aligning with the goal of moving faster and safer in AI deployment. Since its
release on PyPI in Jun 2025, the tool has been downloaded over 13K times,
across versions, by Aug 2025, demonstrating growing community interest.
[COMMENTS]11 pages, 7 figures, accepted at IAAI/AAAI 2026; updated with
figures, captions, and acknowledgments
[LINK]http://arxiv.org/abs/2508.16753v2
[DATE]2025-10-24 23:20:55+08:00
[CATEGORIES]cs.CL
Are the LLMs Capable of Maintaining at Least the Language Genus?
[AUTHORS]Sandra Mitrović, David Kletz, Ljiljana Dolamic, Fabio Rinaldi
[ABSTRACT]Large Language Models (LLMs) display notable variation in multilingual
behavior, yet the role of genealogical language structure in shaping this
variation remains underexplored. In this paper, we investigate whether LLMs
exhibit sensitivity to linguistic genera by extending prior analyses on the
MultiQ dataset. We first check if models prefer to switch to genealogically
related languages when prompt language fidelity is not maintained. Next, we
investigate whether knowledge consistency is better preserved within than
across genera. We show that genus-level effects are present but strongly
conditioned by training resource availability. We further observe distinct
multilingual strategies across LLMs families. Our findings suggest that LLMs
encode aspects of genus-level structure, but training data imbalances remain
the primary factor shaping their multilingual performance.
[LINK]http://arxiv.org/abs/2510.21561v1
[DATE]2025-10-24 23:20:40+08:00
[CATEGORIES]cs.CL
Document Understanding, Measurement, and Manipulation Using Category Theory
[AUTHORS]Jared Claypoole, Yunye Gong, Noson S. Yanofsky, Ajay Divakaran
[ABSTRACT]We apply category theory to extract multimodal document structure which leads
us to develop information theoretic measures, content summarization and
extension, and self-supervised improvement of large pretrained models. We first
develop a mathematical representation of a document as a category of
question-answer pairs. Second, we develop an orthogonalization procedure to
divide the information contained in one or more documents into non-overlapping
pieces. The structures extracted in the first and second steps lead us to
develop methods to measure and enumerate the information contained in a
document. We also build on those steps to develop new summarization techniques,
as well as to develop a solution to a new problem viz. exegesis resulting in an
extension of the original document. Our question-answer pair methodology
enables a novel rate distortion analysis of summarization techniques. We
implement our techniques using large pretrained models, and we propose a
multimodal extension of our overall mathematical framework. Finally, we develop
a novel self-supervised method using RLVR to improve large pretrained models
using consistency constraints such as composability and closure under certain
operations that stem naturally from our category theoretic framework.
[LINK]http://arxiv.org/abs/2510.21553v1
[DATE]2025-10-24 23:12:08+08:00
[CATEGORIES]cs.CL cs.LG
Revisiting Bi-Linear State Transitions in Recurrent Neural Networks
[AUTHORS]M. Reza Ebrahimi, Roland Memisevic
[ABSTRACT]The role of hidden units in recurrent neural networks is typically seen as
modeling memory, with research focusing on enhancing information retention
through gating mechanisms. A less explored perspective views hidden units as
active participants in the computation performed by the network, rather than
passive memory stores. In this work, we revisit bilinear operations, which
involve multiplicative interactions between hidden units and input embeddings.
We demonstrate theoretically and empirically that they constitute a natural
inductive bias for representing the evolution of hidden states in state
tracking tasks. These are the simplest type of tasks that require hidden units
to actively contribute to the behavior of the network. We also show that
bilinear state updates form a natural hierarchy corresponding to state tracking
tasks of increasing complexity, with popular linear recurrent networks such as
Mamba residing at the lowest-complexity center of that hierarchy.
[COMMENTS]Accepted to NeurIPS 2025
[LINK]http://arxiv.org/abs/2505.21749v2
[DATE]2025-10-24 23:06:33+08:00
[CATEGORIES]cs.LG cs.CL
InterpDetect: Interpretable Signals for Detecting Hallucinations in Retrieval-Augmented Generation
[AUTHORS]Likun Tan, Kuan-Wei Huang, Joy Shi, Kevin Wu
[ABSTRACT]Retrieval-Augmented Generation (RAG) integrates external knowledge to
mitigate hallucinations, yet models often generate outputs inconsistent with
retrieved content. Accurate hallucination detection requires disentangling the
contributions of external context and parametric knowledge, which prior methods
typically conflate. We investigate the mechanisms underlying RAG hallucinations
and find they arise when later-layer FFN modules disproportionately inject
parametric knowledge into the residual stream. To address this, we explore a
mechanistic detection approach based on external context scores and parametric
knowledge scores. Using Qwen3-0.6b, we compute these scores across layers and
attention heads and train regression-based classifiers to predict
hallucinations. Our method is evaluated against state-of-the-art LLMs (GPT-5,
GPT-4.1) and detection baselines (RAGAS, TruLens, RefChecker). Furthermore,
classifiers trained on Qwen3-0.6b signals generalize to GPT-4.1-mini responses,
demonstrating the potential of proxy-model evaluation. Our results highlight
mechanistic signals as efficient, generalizable predictors for hallucination
detection in RAG systems.
[LINK]http://arxiv.org/abs/2510.21538v1
[DATE]2025-10-24 23:02:01+08:00
[CATEGORIES]cs.CL
Fast and Fluent Diffusion Language Models via Convolutional Decoding and Rejective Fine-tuning
[AUTHORS]Yeongbin Seo, Dongha Lee, Jaehyung Kim, Jinyoung Yeo
[ABSTRACT]Autoregressive (AR) language models generate text one token at a time, which
limits their inference speed. Diffusion-based language models offer a promising
alternative, as they can decode multiple tokens in parallel. However, we
identify a key bottleneck in current diffusion LMs: the long decoding-window
problem, where tokens generated far from the input context often become
irrelevant or repetitive. Previous solutions like semi-autoregressive address
this issue by splitting windows into blocks (sacrificing bidirectionality), but
we find that this also leads to time-interval expansion problem, sacrificing
the speed. Therefore, semi-AR eliminates the main advantages of diffusion
models. To overcome this, we propose Convolutional decoding (Conv), a
normalization-based method that narrows the decoding window without hard
segmentation, leading to better fluency and flexibility. Additionally, we
introduce Rejecting Rule-based Fine-Tuning (R2FT), a post-hoc training scheme
that better aligns tokens at positions far from context. Our methods achieve
state-of-the-art results on open-ended generation benchmarks (e.g., AlpacaEval)
among diffusion LM baselines, with significantly lower step size than previous
works, demonstrating both speed and quality improvements.
[COMMENTS]NeurIPS 2025 spotlight
[LINK]http://arxiv.org/abs/2509.15188v3
[DATE]2025-10-24 22:56:21+08:00
[CATEGORIES]cs.CL cs.LG
Brain-tuning Improves Generalizability and Efficiency of Brain Alignment in Speech Models
[AUTHORS]Omer Moussa, Mariya Toneva
[COMMENTS]Published at the 39th Conference on Neural Information Processing
Systems (NeurIPS 2025)
[LINK]http://arxiv.org/abs/2510.21520v1
[DATE]2025-10-24 22:42:19+08:00
[CATEGORIES]cs.CL
Head Pursuit: Probing Attention Specialization in Multimodal Transformers
[AUTHORS]Lorenzo Basile, Valentino Maiorca, Diego Doimo, Francesco Locatello, Alberto Cazzaniga
[ABSTRACT]Language and vision-language models have shown impressive performance across
a wide range of tasks, but their internal mechanisms remain only partly
understood. In this work, we study how individual attention heads in
text-generative models specialize in specific semantic or visual attributes.
Building on an established interpretability method, we reinterpret the practice
of probing intermediate activations with the final decoding layer through the
lens of signal processing. This lets us analyze multiple samples in a
principled way and rank attention heads based on their relevance to target
concepts. Our results show consistent patterns of specialization at the head
level across both unimodal and multimodal transformers. Remarkably, we find
that editing as few as 1% of the heads, selected using our method, can reliably
suppress or enhance targeted concepts in the model output. We validate our
approach on language tasks such as question answering and toxicity mitigation,
as well as vision-language tasks including image classification and captioning.
Our findings highlight an interpretable and controllable structure within
attention layers, offering simple tools for understanding and editing
large-scale generative models.
[COMMENTS]Accepted at NeurIPS 2025 (spotlight)
[LINK]http://arxiv.org/abs/2510.21518v1
[DATE]2025-10-24 22:41:47+08:00
[CATEGORIES]cs.CL cs.LG
Deep Literature Survey Automation with an Iterative Workflow
[AUTHORS]Hongbo Zhang, Han Cui, Yidong Wang, Yijian Tian, Qi Guo, Cunxiang Wang, Jian Wu, Chiyu Song, Yue Zhang
[ABSTRACT]Automatic literature survey generation has attracted increasing attention,
yet most existing systems follow a one-shot paradigm, where a large set of
papers is retrieved at once and a static outline is generated before drafting.
This design often leads to noisy retrieval, fragmented structures, and context
overload, ultimately limiting survey quality. Inspired by the iterative reading
process of human researchers, we propose \ours, a framework based on recurrent
outline generation, in which a planning agent incrementally retrieves, reads,
and updates the outline to ensure both exploration and coherence. To provide
faithful paper-level grounding, we design paper cards that distill each paper
into its contributions, methods, and findings, and introduce a
review-and-refine loop with visualization enhancement to improve textual flow
and integrate multimodal elements such as figures and tables. Experiments on
both established and emerging topics show that \ours\ substantially outperforms
state-of-the-art baselines in content coverage, structural coherence, and
citation quality, while producing more accessible and better-organized surveys.
To provide a more reliable assessment of such improvements, we further
introduce Survey-Arena, a pairwise benchmark that complements absolute scoring
and more clearly positions machine-generated surveys relative to human-written
ones. The code is available at
https://github.com/HancCui/IterSurvey_Autosurveyv2.
[COMMENTS]Preprint version
[LINK]http://arxiv.org/abs/2510.21900v1
[DATE]2025-10-24 22:41:26+08:00
[CATEGORIES]cs.CL
Wisdom and Delusion of LLM Ensembles for Code Generation and Repair
[AUTHORS]Fernando Vallecillos Ruiz, Max Hort, Leon Moonen
[ABSTRACT]Today’s pursuit of a single Large Language Model (LMM) for all software
engineering tasks is resource-intensive and overlooks the potential benefits of
complementarity, where different models contribute unique strengths. However,
the degree to which coding LLMs complement each other and the best strategy for
maximizing an ensemble’s potential are unclear, leaving practitioners without a
clear path to move beyond single-model systems.
To address this gap, we empirically compare ten individual LLMs from five
families, and three ensembles of these LLMs across three software engineering
benchmarks covering code generation and program repair. We assess the
complementarity between models and the performance gap between the best
individual model and the ensembles. Next, we evaluate various selection
heuristics to identify correct solutions from an ensemble’s candidate pool.
We find that the theoretical upperbound for an ensemble’s performance can be
83% above the best single model. Our results show that consensus-based
strategies for selecting solutions fall into a “popularity trap,” amplifying
common but incorrect outputs. In contrast, a diversity-based strategy realizes
up to 95% of this theoretical potential, and proves effective even in small
two-model ensembles, enabling a cost-efficient way to enhance performance by
leveraging multiple LLMs.
[LINK]http://arxiv.org/abs/2510.21513v1
[DATE]2025-10-24 22:39:23+08:00
[CATEGORIES]cs.CL cs.LG
Combining Textual and Structural Information for Premise Selection in Lean
[AUTHORS]Job Petrovčič, David Eliecer Narvaez Denis, Ljupčo Todorovski
[ABSTRACT]Premise selection is a key bottleneck for scaling theorem proving in large
formal libraries. Yet existing language-based methods often treat premises in
isolation, ignoring the web of dependencies that connects them. We present a
graph-augmented approach that combines dense text embeddings of Lean
formalizations with graph neural networks over a heterogeneous dependency graph
capturing both state–premise and premise–premise relations. On the LeanDojo
Benchmark, our method outperforms the ReProver language-based baseline by over
25% across standard retrieval metrics. These results demonstrate the power of
relational information for more effective premise selection.
[LINK]http://arxiv.org/abs/2510.23637v1
[DATE]2025-10-24 22:24:13+08:00
[CATEGORIES]cs.LG cs.CL
HugAgent: Evaluating LLMs in Simulating Individual-Level Human Reasoning on Open-Ended Tasks
[AUTHORS]Chance Jiajie Li, Zhenze Mo, Yuhan Tang, Ao Qu, Jiayi Wu, Kaiya Ivy Zhao, Yulu Gan, Jie Fan, Jiangbo Yu, Hang Jiang, Paul Pu Liang, Jinhua Zhao, Luis Alberto Alonso Pastor, Kent Larson
[COMMENTS]To appear in NeurIPS 2025 Workshop on Bridging Language, Agent, and
World Models (LAW)
[LINK]http://arxiv.org/abs/2510.15144v2
[DATE]2025-10-24 22:23:35+08:00
[CATEGORIES]cs.CL
MRO: Enhancing Reasoning in Diffusion Language Models via Multi-Reward Optimization
[AUTHORS]Chenglong Wang, Yang Gan, Hang Zhou, Chi Hu, Yongyu Mu, Kai Song, Murun Yang, Bei Li, Chunliang Zhang, Tongran Liu, Jingbo Zhu, Zhengtao Yu, Tong Xiao
[ABSTRACT]Recent advances in diffusion language models (DLMs) have presented a
promising alternative to traditional autoregressive large language models
(LLMs). However, DLMs still lag behind LLMs in reasoning performance,
especially as the number of denoising steps decreases. Our analysis reveals
that this shortcoming arises primarily from the independent generation of
masked tokens across denoising steps, which fails to capture the token
correlation. In this paper, we define two types of token correlation:
intra-sequence correlation and inter-sequence correlation, and demonstrate that
enhancing these correlations improves reasoning performance. To this end, we
propose a Multi-Reward Optimization (MRO) approach, which encourages DLMs to
consider the token correlation during the denoising process. More specifically,
our MRO approach leverages test-time scaling, reject sampling, and
reinforcement learning to directly optimize the token correlation with multiple
elaborate rewards. Additionally, we introduce group step and importance
sampling strategies to mitigate reward variance and enhance sampling
efficiency. Through extensive experiments, we demonstrate that MRO not only
improves reasoning performance but also achieves significant sampling speedups
while maintaining high performance on reasoning benchmarks.
[COMMENTS]Accepted by NeurIPS 2025
[LINK]http://arxiv.org/abs/2510.21473v1
[DATE]2025-10-24 21:57:59+08:00
[CATEGORIES]cs.CL
How Does Sequence Modeling Architecture Influence Base Capabilities of Pre-trained Language Models? Exploring Key Architecture Design Principles to Avoid Base Capabilities Degradation
[AUTHORS]Xin Lu, Yanyan Zhao, Si Wei, Shijin Wang, Bing Qin, Ting Liu
[ABSTRACT]Pre-trained language models represented by the Transformer have been proven
to possess strong base capabilities, and the representative self-attention
mechanism in the Transformer has become a classic in sequence modeling
architectures. Different from the work of proposing sequence modeling
architecture to improve the efficiency of attention mechanism, this work
focuses on the impact of sequence modeling architectures on base capabilities.
Specifically, our concern is: How exactly do sequence modeling architectures
affect the base capabilities of pre-trained language models? In this work, we
first point out that the mixed domain pre-training setting commonly adopted in
existing architecture design works fails to adequately reveal the differences
in base capabilities among various architectures. To address this, we propose a
limited domain pre-training setting with out-of-distribution testing, which
successfully uncovers significant differences in base capabilities among
architectures at an early stage. Next, we analyze the base capabilities of
stateful sequence modeling architectures, and find that they exhibit
significant degradation in base capabilities compared to the Transformer. Then,
through a series of architecture component analysis, we summarize a key
architecture design principle: A sequence modeling architecture need possess
full-sequence arbitrary selection capability to avoid degradation in base
capabilities. Finally, we empirically validate this principle using an
extremely simple Top-1 element selection architecture and further generalize it
to a more practical Top-1 chunk selection architecture. Experimental results
demonstrate our proposed sequence modeling architecture design principle and
suggest that our work can serve as a valuable reference for future architecture
improvements and novel designs.
[COMMENTS]Accepted by NeurIPS 2025
[LINK]http://arxiv.org/abs/2505.18522v2
[DATE]2025-10-24 21:51:13+08:00
[CATEGORIES]cs.CL
LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models
[AUTHORS]Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, Jinlan Fu, Jingjing Gong, Xipeng Qiu
[ABSTRACT]Visual-Language-Action (VLA) models report impressive success rates on
robotic manipulation benchmarks, yet these results may mask fundamental
weaknesses in robustness. We perform a systematic vulnerability analysis by
introducing controlled perturbations across seven dimensions: objects layout,
camera viewpoints, robot initial states, language instructions, light
conditions, background textures and sensor noise. We comprehensively analyzed
multiple state-of-the-art models and revealed consistent brittleness beneath
apparent competence. Our analysis exposes critical weaknesses: models exhibit
extreme sensitivity to perturbation factors, including camera viewpoints and
robot initial states, with performance dropping from 95% to below 30% under
modest perturbations. Surprisingly, models are largely insensitive to language
variations, with further experiments revealing that models tend to ignore
language instructions completely. Our findings challenge the assumption that
high benchmark scores equate to true competency and highlight the need for
evaluation practices that assess reliability under realistic variation.
[LINK]http://arxiv.org/abs/2510.13626v2
[DATE]2025-10-24 21:50:04+08:00
[CATEGORIES]cs.CL
Visual Cues Enhance Predictive Turn-Taking for Two-Party Human Interaction
[AUTHORS]Sam O’Connor Russell, Naomi Harte
[COMMENTS]Accepted to ACL 2025, Findings of the Association for Computational
Linguistics
[LINK]http://arxiv.org/abs/2505.21043v2
[DATE]2025-10-24 21:49:54+08:00
[CATEGORIES]cs.CL
SBASH: a Framework for Designing and Evaluating RAG vs. Prompt-Tuned LLM Honeypots
[AUTHORS]Adetayo Adebimpe, Helmut Neukirchen, Thomas Welsh
[ABSTRACT]Honeypots are decoy systems used for gathering valuable threat intelligence
or diverting attackers away from production systems. Maximising attacker
engagement is essential to their utility. However research has highlighted that
context-awareness, such as the ability to respond to new attack types, systems
and attacker agents, is necessary to increase engagement. Large Language Models
(LLMs) have been shown as one approach to increase context awareness but suffer
from several challenges including accuracy and timeliness of response time,
high operational costs and data-protection issues due to cloud deployment. We
propose the System-Based Attention Shell Honeypot (SBASH) framework which
manages data-protection issues through the use of lightweight local LLMs. We
investigate the use of Retrieval Augmented Generation (RAG) supported LLMs and
non-RAG LLMs for Linux shell commands and evaluate them using several different
metrics such as response time differences, realism from human testers, and
similarity to a real system calculated with Levenshtein distance, SBert, and
BertScore. We show that RAG improves accuracy for untuned models while models
that have been tuned via a system prompt that tells the LLM to respond like a
Linux system achieve without RAG a similar accuracy as untuned with RAG, while
having a slightly lower latency.
[COMMENTS]to be published in: The 3rd International Conference on Foundation
and Large Language Models (FLLM2025), IEEE, 2025
[LINK]http://arxiv.org/abs/2510.21459v1
[DATE]2025-10-24 21:41:52+08:00
[CATEGORIES]cs.CL cs.LG
A Hierarchical Framework for Measuring Scientific Paper Innovation via Large Language Models
[AUTHORS]Hongming Tan, Shaoxiong Zhan, Fengwei Jia, Hai-Tao Zheng, Wai Kin Chan
[ABSTRACT]Measuring scientific paper innovation is both important and challenging.
Existing content-based methods often overlook the full-paper context, fail to
capture the full scope of innovation, and lack generalization. We propose
HSPIM, a hierarchical and training-free framework based on large language
models (LLMs). It introduces a Paper-to-Sections-to-QAs decomposition to assess
innovation. We segment the text by section titles and use zero-shot LLM
prompting to implement section classification, question-answering (QA)
augmentation, and weighted innovation scoring. The generated QA pair focuses on
section-level innovation and serves as additional context to improve the LLM
scoring. For each chunk, the LLM outputs a novelty score and a confidence
score. We use confidence scores as weights to aggregate novelty scores into a
paper-level innovation score. To further improve performance, we propose a
two-layer question structure consisting of common and section-specific
questions, and apply a genetic algorithm to optimize the question-prompt
combinations. Furthermore, under the fine-grained structure of innovation, we
extend HSPIM to an HSPIM$^+$ that generates novelty, contribution, and
feasibility scores with respective confidence scores. Comprehensive experiments
on scientific conference paper datasets show that HSPIM outperforms baseline
methods in effectiveness, generalization, and interpretability. Demo code is
available at https://github.com/Jasaxion/HSPIM.
[LINK]http://arxiv.org/abs/2504.14620v2
[DATE]2025-10-24 21:28:52+08:00
[CATEGORIES]cs.CL
REMONI: An Autonomous System Integrating Wearables and Multimodal Large Language Models for Enhanced Remote Health Monitoring
[AUTHORS]Thanh Cong Ho, Farah Kharrat, Abderrazek Abid, Fakhri Karray
[ABSTRACT]With the widespread adoption of wearable devices in our daily lives, the
demand and appeal for remote patient monitoring have significantly increased.
Most research in this field has concentrated on collecting sensor data,
visualizing it, and analyzing it to detect anomalies in specific diseases such
as diabetes, heart disease and depression. However, this domain has a notable
gap in the aspect of human-machine interaction. This paper proposes REMONI, an
autonomous REmote health MONItoring system that integrates multimodal large
language models (MLLMs), the Internet of Things (IoT), and wearable devices.
The system automatically and continuously collects vital signs, accelerometer
data from a special wearable (such as a smartwatch), and visual data in patient
video clips collected from cameras. This data is processed by an anomaly
detection module, which includes a fall detection model and algorithms to
identify and alert caregivers of the patient’s emergency conditions. A
distinctive feature of our proposed system is the natural language processing
component, developed with MLLMs capable of detecting and recognizing a
patient’s activity and emotion while responding to healthcare worker’s
inquiries. Additionally, prompt engineering is employed to integrate all
patient information seamlessly. As a result, doctors and nurses can access
real-time vital signs and the patient’s current state and mood by interacting
with an intelligent agent through a user-friendly web application. Our
experiments demonstrate that our system is implementable and scalable for
real-life scenarios, potentially reducing the workload of medical professionals
and healthcare costs. A full-fledged prototype illustrating the functionalities
of the system has been developed and being tested to demonstrate the robustness
of its various capabilities.
[LINK]http://arxiv.org/abs/2510.21445v1
[DATE]2025-10-24 21:23:38+08:00
[CATEGORIES]cs.CL cs.LG
Redefining Retrieval Evaluation in the Era of LLMs
[AUTHORS]Giovanni Trappolini, Florin Cuconasu, Simone Filice, Yoelle Maarek, Fabrizio Silvestri
[ABSTRACT]Traditional Information Retrieval (IR) metrics, such as nDCG, MAP, and MRR,
assume that human users sequentially examine documents with diminishing
attention to lower ranks. This assumption breaks down in Retrieval Augmented
Generation (RAG) systems, where search results are consumed by Large Language
Models (LLMs), which, unlike humans, process all retrieved documents as a whole
rather than sequentially. Additionally, traditional IR metrics do not account
for related but irrelevant documents that actively degrade generation quality,
rather than merely being ignored. Due to these two major misalignments, namely
human vs. machine position discount and human relevance vs. machine utility,
classical IR metrics do not accurately predict RAG performance. We introduce a
utility-based annotation schema that quantifies both the positive contribution
of relevant passages and the negative impact of distracting ones. Building on
this foundation, we propose UDCG (Utility and Distraction-aware Cumulative
Gain), a metric using an LLM-oriented positional discount to directly optimize
the correlation with the end-to-end answer accuracy. Experiments on five
datasets and six LLMs demonstrate that UDCG improves correlation by up to 36%
compared to traditional metrics. Our work provides a critical step toward
aligning IR evaluation with LLM consumers and enables more reliable assessment
of RAG components
[LINK]http://arxiv.org/abs/2510.21440v1
[DATE]2025-10-24 21:17:00+08:00
[CATEGORIES]cs.CL
Vision Language Models for Dynamic Human Activity Recognition in Healthcare Settings
[AUTHORS]Abderrazek Abid, Thanh-Cong Ho, Fakhri Karray
[ABSTRACT]As generative AI continues to evolve, Vision Language Models (VLMs) have
emerged as promising tools in various healthcare applications. One area that
remains relatively underexplored is their use in human activity recognition
(HAR) for remote health monitoring. VLMs offer notable strengths, including
greater flexibility and the ability to overcome some of the constraints of
traditional deep learning models. However, a key challenge in applying VLMs to
HAR lies in the difficulty of evaluating their dynamic and often
non-deterministic outputs. To address this gap, we introduce a descriptive
caption data set and propose comprehensive evaluation methods to evaluate VLMs
in HAR. Through comparative experiments with state-of-the-art deep learning
models, our findings demonstrate that VLMs achieve comparable performance and,
in some cases, even surpass conventional approaches in terms of accuracy. This
work contributes a strong benchmark and opens new possibilities for the
integration of VLMs into intelligent healthcare systems.
[LINK]http://arxiv.org/abs/2510.21424v1
[DATE]2025-10-24 21:04:13+08:00
[CATEGORIES]cs.CL cs.LG
zip2zip: Inference-Time Adaptive Tokenization via Online Compression
[AUTHORS]Saibo Geng, Nathan Ranchin, Yunzhen yao, Maxime Peyrard, Chris Wendler, Michael Gastpar, Robert West
[ABSTRACT]Tokenization efficiency plays a critical role in the performance and cost of
large language models (LLMs), yet most models rely on static tokenizers
optimized on general-purpose corpora. These tokenizers’ fixed vocabularies
often fail to adapt to domain- or language-specific inputs, leading to longer
token sequences and higher computational costs. We introduce zip2zip, a novel
method for achieving context-adaptive tokenization in LLMs at inference time.
Leveraging an online data compression algorithm (Lempel-Ziv-Welch), zip2zip
dynamically expands its active vocabulary at inference time by continuously
replacing fragmented token sequences with more compact hypertokens, which it
can immediately output during generation. In doing so, the model refines its
internal tokenization scheme to match the token distribution of the current
context, reducing redundancy and improving representational efficiency. zip2zip
consists of three key components: (1) a tokenizer based on Lempel-Ziv-Welch
compression that incrementally merges co-occurring tokens into reusable
hypertokens on the fly; (2) a dynamic embedding (and unembedding) layer that
computes embeddings for newly formed hypertokens at runtime; and (3) a variant
of autoregressive language modeling that pretrains the model to handle
hypertokenized, compressed text sequences as inputs and outputs. We show that
an existing LLM can be uptrained for zip2zip in 10 GPU-hours via
parameter-efficient finetuning. The resulting LLM performs test-time
adaptation, learning to use hypertokens in unseen contexts and reducing input
and output tokens by 15-40%.
[COMMENTS]NeurIPS 2025
[LINK]http://arxiv.org/abs/2506.01084v2
[DATE]2025-10-24 20:53:34+08:00
[CATEGORIES]cs.CL cs.LG
ReDit: Reward Dithering for Improved LLM Policy Optimization
[AUTHORS]Chenxing Wei, Jiarui Yu, Ying Tiffany He, Hande Dong, Yao Shu, Fei Yu
[ABSTRACT]DeepSeek-R1 has successfully enhanced Large Language Model (LLM) reasoning
capabilities through its rule-based reward system. While it’s a ‘‘perfect’’
reward system that effectively mitigates reward hacking, such reward functions
are often discrete. Our experimental observations suggest that discrete rewards
can lead to gradient anomaly, unstable optimization, and slow convergence. To
address this issue, we propose ReDit (Reward Dithering), a method that dithers
the discrete reward signal by adding simple random noise. With this perturbed
reward, exploratory gradients are continuously provided throughout the learning
process, enabling smoother gradient updates and accelerating convergence. The
injected noise also introduces stochasticity into flat reward regions,
encouraging the model to explore novel policies and escape local optima.
Experiments across diverse tasks demonstrate the effectiveness and efficiency
of ReDit. On average, ReDit achieves performance comparable to vanilla GRPO
with only approximately 10% the training steps, and furthermore, still exhibits
a 4% performance improvement over vanilla GRPO when trained for a similar
duration. Visualizations confirm significant mitigation of gradient issues with
ReDit. Moreover, theoretical analyses are provided to further validate these
advantages.
[COMMENTS]34 pages, 19 figures
[LINK]http://arxiv.org/abs/2506.18631v4
[DATE]2025-10-24 20:32:00+08:00
[CATEGORIES]cs.LG cs.CL
Reverse Engineering Human Preferences with Reinforcement Learning
[AUTHORS]Lisa Alazraki, Tan Yi-Chern, Jon Ander Campos, Maximilian Mozes, Marek Rei, Max Bartolo
[ABSTRACT]The capabilities of Large Language Models (LLMs) are routinely evaluated by
other LLMs trained to predict human preferences. This framework–known as
LLM-as-a-judge–is highly scalable and relatively low cost. However, it is also
vulnerable to malicious exploitation, as LLM responses can be tuned to overfit
the preferences of the judge. Previous work shows that the answers generated by
a candidate-LLM can be edited post hoc to maximise the score assigned to them
by a judge-LLM. In this study, we adopt a different approach and use the signal
provided by judge-LLMs as a reward to adversarially tune models that generate
text preambles designed to boost downstream performance. We find that frozen
LLMs pipelined with these models attain higher LLM-evaluation scores than
existing frameworks. Crucially, unlike other frameworks which intervene
directly on the model’s response, our method is virtually undetectable. We also
demonstrate that the effectiveness of the tuned preamble generator transfers
when the candidate-LLM and the judge-LLM are replaced with models that are not
used during training. These findings raise important questions about the design
of more reliable LLM-as-a-judge evaluation settings. They also demonstrate that
human preferences can be reverse engineered effectively, by pipelining LLMs to
optimise upstream preambles via reinforcement learning–an approach that could
find future applications in diverse tasks and domains beyond adversarial
attacks.
[COMMENTS]NeurIPS 2025 (Spotlight)
[LINK]http://arxiv.org/abs/2505.15795v2
[DATE]2025-10-24 20:30:19+08:00
[CATEGORIES]cs.CL
AlgoTune: Can Language Models Speed Up General-Purpose Numerical Programs?
[AUTHORS]Ori Press, Brandon Amos, Haoyu Zhao, Yikai Wu, Samuel K. Ainsworth, Dominik Krupke, Patrick Kidger, Touqir Sajed, Bartolomeo Stellato, Jisun Park, Nathanael Bosch, Eli Meril, Albert Steppi, Arman Zharmagambetov, Fangzhao Zhang, David Perez-Pineiro, Alberto Mercurio, Ni Zhan, Talor Abramovich, Kilian Lieret, Hanlin Zhang, Shirley Huang, Matthias Bethge, Ofir Press
[ABSTRACT]Despite progress in language model (LM) capabilities, evaluations have thus
far focused on models’ performance on tasks that humans have previously solved,
including in programming (Jimenez et al., 2024) and mathematics (Glazer et al.,
2024). We therefore propose testing models’ ability to design and implement
algorithms in an open-ended benchmark: We task LMs with writing code that
efficiently solves computationally challenging problems in computer science,
physics, and mathematics. Our AlgoTune benchmark consists of 154 coding tasks
collected from domain experts and a framework for validating and timing
LM-synthesized solution code, which is compared to reference implementations
from popular open-source packages. In addition, we develop a baseline LM agent,
AlgoTuner, and evaluate its performance across a suite of frontier models.
AlgoTuner uses a simple, budgeted loop that edits code, compiles and runs it,
profiles performance, verifies correctness on tests, and selects the fastest
valid version. AlgoTuner achieves an average 1.72x speedup against our
reference solvers, which use libraries such as SciPy, sk-learn and CVXPY.
However, we find that current models fail to discover algorithmic innovations,
instead preferring surface-level optimizations. We hope that AlgoTune catalyzes
the development of LM agents exhibiting creative problem solving beyond
state-of-the-art human performance.
[LINK]http://arxiv.org/abs/2507.15887v4
[DATE]2025-10-24 20:16:20+08:00
[CATEGORIES]cs.CL cs.LG
GoRA: Gradient-driven Adaptive Low Rank Adaptation
[AUTHORS]Haonan He, Peng Ye, Yuchen Ren, Yuan Yuan, Luyang Zhou, Shucun Ju, Lei Chen
[ABSTRACT]Low-Rank Adaptation (LoRA) is a crucial method for efficiently fine-tuning
large language models (LLMs), with its effectiveness influenced by two key
factors: rank selection and weight initialization. While numerous LoRA variants
have been proposed to improve performance by addressing one of these aspects,
they often compromise usability or computational efficiency. In this paper, we
analyze and identify the core limitations of existing approaches and propose a
novel framework–GoRA (Gradient-driven Adaptive Low Rank Adaptation)–that
simultaneously adapts both the rank and initialization strategy within a
unified framework. GoRA leverages gradient information during training to
dynamically assign optimal ranks and initialize low-rank adapter weights in an
adaptive manner. To our knowledge, GoRA is the first method that not only
addresses the limitations of prior approaches–which often focus on either rank
selection or initialization in isolation–but also unifies both aspects within
a single framework, enabling more effective and efficient adaptation. Extensive
experiments across various architectures and modalities show that GoRA
consistently outperforms existing LoRA-based methods while preserving the
efficiency of vanilla LoRA. For example, when fine-tuning Llama3.1-8B-Base for
mathematical reasoning, GoRA achieves a 5.13-point improvement over standard
LoRA and even outperforms full fine-tuning by 2.05 points under high-rank
settings. Code is available at: https://github.com/hhnqqq/MyTransformers.
[COMMENTS]NeurIPS 2025
[LINK]http://arxiv.org/abs/2502.12171v3
[DATE]2025-10-24 20:16:12+08:00
[CATEGORIES]cs.LG cs.CL
Supporting Online Discussions: Integrating AI Into the adhocracy+ Participation Platform To Enhance Deliberation
[AUTHORS]Maike Behrendt, Stefan Sylvius Wagner, Mira Warne, Jana Leonie Peters, Marc Ziegele, Stefan Harmeling
[ABSTRACT]Online spaces provide individuals with the opportunity to engage in
discussions on important topics and make collective decisions, regardless of
their geographic location or time zone. However, without adequate support and
careful design, such discussions often suffer from a lack of structure and
civility in the exchange of opinions. Artificial intelligence (AI) offers a
promising avenue for helping both participants and organizers in managing
large-scale online participation processes. This paper introduces an extension
of adhocracy+, a large-scale open-source participation platform. Our extension
features two AI-supported debate modules designed to improve discussion quality
and foster participant interaction. In a large-scale user study we examined the
effects and usability of both modules. We report our findings in this paper.
The extended platform is available at https://github.com/mabehrendt/discuss2.0.
[LINK]http://arxiv.org/abs/2409.07780v2
[DATE]2025-10-24 20:01:36+08:00
[CATEGORIES]cs.CL
Forging GEMs: Advancing Greek NLP through Quality-Based Corpus Curation
[AUTHORS]Alexandra Apostolopoulou, Konstantinos Kanaris, Athanasios Koursaris, Dimitris Tsakalidis, George Domalis, Ioannis E. Livieris
[ABSTRACT]The advancement of natural language processing for morphologically rich and
moderately-resourced languages like Modern Greek has been hindered by
architectural stagnation, data scarcity, and limited context processing
capabilities, particularly in specialized domains such as law. In this work, we
propose the Greek Embedding Models (GEMs), a new family of transformer-based
language models, specifically developed to address these limitations through
architectural diversity and enhanced data curation. The proposed family of
models are trained on several large-scale, meticulously curated corpora,
encompassing both comprehensive general-domain datasets and specialized legal
collections, addressing the persistent data scarcity that has impeded Greek
language modeling advancement. The proposed quality-based corpus curation
methodology incorporates extensive preprocessing pipelines, sophisticated
deduplication strategies and targeted repetition of high-quality legal
sub-corpora to enhance domain adaptation. The GEMs family comprises both
established architectures (RoBERTa and Longformer) and advanced models not
previously applied to Greek (ELECTRA, ConvBERT, and ModernBERT), providing
comprehensive coverage of modern transformer designs. Additionally, we
introduce the first bilingual Greek-English embedding models tailored for
cross-lingual legal applications. Comprehensive evaluation across three core
natural language understanding benchmarks demonstrates that the proposed
GEM-RoBERTa and GEM-ConvBERT achieve statistically significant performance
improvements over established state-of-the-art models, with accuracy gains of
up to 3.6\% while conducted statistical analysis using Friedman Aligned-Ranks
and Finner post-hoc tests confirms the superiority of our approach across
multiple evaluation metrics.
[COMMENTS]The manuscript is submitted to Applied Sciences
[LINK]http://arxiv.org/abs/2510.20002v2
[DATE]2025-10-24 19:58:25+08:00
[CATEGORIES]cs.CL
TPO: Aligning Large Language Models with Multi-branch & Multi-step Preference Trees
[AUTHORS]Weibin Liao, Xu Chu, Yasha Wang
[ABSTRACT]In the domain of complex reasoning tasks, such as mathematical reasoning,
recent advancements have proposed the use of Direct Preference Optimization
(DPO) to suppress output of dispreferred responses, thereby enhancing the
long-chain reasoning capabilities of large language models (LLMs). To this end,
these studies employed LLMs to generate preference trees via Tree-of-thoughts
(ToT) and sample the paired preference responses required by the DPO algorithm.
However, the DPO algorithm based on binary preference optimization is unable to
learn multiple responses with varying degrees of preference/dispreference that
provided by the preference trees, resulting in incomplete preference learning.
In this work, we introduce Tree Preference Optimization (TPO), that does not
sample paired preference responses from the preference tree; instead, it
directly learns from the entire preference tree during the fine-tuning.
Specifically, TPO formulates the language model alignment as a Preference List
Ranking problem, where the policy can potentially learn more effectively from a
ranked preference list of responses given the prompt. In addition, to further
assist LLMs in identifying discriminative steps within long-chain reasoning and
increase the relative reward margin in the preference list, TPO utilizes
Adaptive Step Reward to adjust the reward values of each step in trajectory for
performing fine-grained preference optimization. We carry out extensive
experiments on mathematical reasoning tasks to evaluate TPO. The experimental
results indicate that TPO consistently outperforms DPO across five public large
language models on four datasets. Our code is publicly available at
https://github.com/MrBlankness/TPO.git.
[COMMENTS]Accepted by ICLR 2025
[LINK]http://arxiv.org/abs/2410.12854v3
[DATE]2025-10-24 19:56:39+08:00
[CATEGORIES]cs.CL
HalleluBERT: Let every token that has meaning bear its weight
[AUTHORS]Raphael Scheible-Schmitt
[ABSTRACT]Transformer-based models have advanced NLP, yet Hebrew still lacks a
large-scale RoBERTa encoder which is extensively trained. Existing models such
as HeBERT, AlephBERT, and HeRo are limited by corpus size, vocabulary, or
training depth. We present HalleluBERT, a RoBERTa-based encoder family (base
and large) trained from scratch on 49.1~GB of deduplicated Hebrew web text and
Wikipedia with a Hebrew-specific byte-level BPE vocabulary. Evaluated on NER
and sentiment classification benchmarks, HalleluBERT outperforms both
monolingual and multilingual baselines. HalleluBERT sets a new state of the art
for Hebrew and highlights the benefits of fully converged monolingual
pretraining.
[LINK]http://arxiv.org/abs/2510.21372v1
[DATE]2025-10-24 19:52:29+08:00
[CATEGORIES]cs.CL
HIKMA: Human-Inspired Knowledge by Machine Agents through a Multi-Agent Framework for Semi-Autonomous Scientific Conferences
[AUTHORS]Zain Ul Abideen Tariq, Mahmood Al-Zubaidi, Uzair Shah, Marco Agus, Mowafa Househ
[ABSTRACT]HIKMA Semi-Autonomous Conference is the first experiment in reimagining
scholarly communication through an end-to-end integration of artificial
intelligence into the academic publishing and presentation pipeline. This paper
presents the design, implementation, and evaluation of the HIKMA framework,
which includes AI dataset curation, AI-based manuscript generation, AI-assisted
peer review, AI-driven revision, AI conference presentation, and AI archival
dissemination. By combining language models, structured research workflows, and
domain safeguards, HIKMA shows how AI can support - not replace traditional
scholarly practices while maintaining intellectual property protection,
transparency, and integrity. The conference functions as a testbed and proof of
concept, providing insights into the opportunities and challenges of AI-enabled
scholarship. It also examines questions about AI authorship, accountability,
and the role of human-AI collaboration in research.
[LINK]http://arxiv.org/abs/2510.21370v1
[DATE]2025-10-24 19:52:24+08:00
[CATEGORIES]cs.CL
Magical: Medical Lay Language Generation via Semantic Invariance and Layperson-tailored Adaptation
[AUTHORS]Weibin Liao, Tianlong Wang, Yinghao Zhu, Yasha Wang, Junyi Gao, Liantao Ma
[ABSTRACT]Medical Lay Language Generation (MLLG) plays a vital role in improving the
accessibility of complex scientific content for broader audiences. Recent
literature to MLLG commonly employ parameter-efficient fine-tuning methods such
as Low-Rank Adaptation (LoRA) to fine-tuning large language models (LLMs) using
paired expert-lay language datasets. However, LoRA struggles with the
challenges posed by multi-source heterogeneous MLLG datasets. Specifically,
through a series of exploratory experiments, we reveal that standard LoRA fail
to meet the requirement for semantic fidelity and diverse lay-style generation
in MLLG task. To address these limitations, we propose Magical, an asymmetric
LoRA architecture tailored for MLLG under heterogeneous data scenarios. Magical
employs a shared matrix $A$ for abstractive summarization, along with multiple
isolated matrices $B$ for diverse lay-style generation. To preserve semantic
fidelity during the lay language generation process, Magical introduces a
Semantic Invariance Constraint to mitigate semantic subspace shifts on matrix
$A$. Furthermore, to better adapt to diverse lay-style generation, Magical
incorporates the Recommendation-guided Switch, an externally interface to
prompt the LLM to switch between different matrices $B$. Experimental results
on three real-world lay language generation datasets demonstrate that Magical
consistently outperforms prompt-based methods, vanilla LoRA, and its recent
variants, while also reducing trainable parameters by 31.66%. Our code is
publicly available at https://github.com/tianlwang/Magical.git.
[COMMENTS]Accepted by NeurIPS 2025
[LINK]http://arxiv.org/abs/2508.08730v2
[DATE]2025-10-24 19:50:54+08:00
[CATEGORIES]cs.CL
SindBERT, the Sailor: Charting the Seas of Turkish NLP
[AUTHORS]Raphael Scheible-Schmitt, Stefan Schweter
[ABSTRACT]Transformer models have revolutionized NLP, yet many morphologically rich
languages remain underrepresented in large-scale pre-training efforts. With
SindBERT, we set out to chart the seas of Turkish NLP, providing the first
large-scale RoBERTa-based encoder for Turkish. Trained from scratch on 312 GB
of Turkish text (mC4, OSCAR23, Wikipedia), SindBERT is released in both base
and large configurations, representing the first large-scale encoder-only
language model available for Turkish. We evaluate SindBERT on part-of-speech
tagging, named entity recognition, offensive language detection, and the
TurBLiMP linguistic acceptability benchmark. Our results show that SindBERT
performs competitively with existing Turkish and multilingual models, with the
large variant achieving the best scores in two of four tasks but showing no
consistent scaling advantage overall. This flat scaling trend, also observed
for XLM-R and EuroBERT, suggests that current Turkish benchmarks may already be
saturated. At the same time, comparisons with smaller but more curated models
such as BERTurk highlight that corpus quality and diversity can outweigh sheer
data volume. Taken together, SindBERT contributes both as an openly released
resource for Turkish NLP and as an empirical case study on the limits of
scaling and the central role of corpus composition in morphologically rich
languages. The SindBERT models are released under the MIT license and made
available in both fairseq and Huggingface formats.
[LINK]http://arxiv.org/abs/2510.21364v1
[DATE]2025-10-24 19:48:49+08:00
[CATEGORIES]cs.CL
FairImagen: Post-Processing for Bias Mitigation in Text-to-Image Models
[AUTHORS]Zihao Fu, Ryan Brown, Shun Shao, Kai Rawal, Eoin Delaney, Chris Russell
[ABSTRACT]Text-to-image diffusion models, such as Stable Diffusion, have demonstrated
remarkable capabilities in generating high-quality and diverse images from
natural language prompts. However, recent studies reveal that these models
often replicate and amplify societal biases, particularly along demographic
attributes like gender and race. In this paper, we introduce FairImagen
(https://github.com/fuzihaofzh/FairImagen), a post-hoc debiasing framework that
operates on prompt embeddings to mitigate such biases without retraining or
modifying the underlying diffusion model. Our method integrates Fair Principal
Component Analysis to project CLIP-based input embeddings into a subspace that
minimizes group-specific information while preserving semantic content. We
further enhance debiasing effectiveness through empirical noise injection and
propose a unified cross-demographic projection method that enables simultaneous
debiasing across multiple demographic attributes. Extensive experiments across
gender, race, and intersectional settings demonstrate that FairImagen
significantly improves fairness with a moderate trade-off in image quality and
prompt fidelity. Our framework outperforms existing post-hoc methods and offers
a simple, scalable, and model-agnostic solution for equitable text-to-image
generation.
[COMMENTS]Neurips 2025
[LINK]http://arxiv.org/abs/2510.21363v1
[DATE]2025-10-24 19:47:15+08:00
[CATEGORIES]cs.LG cs.CL
Efficient Speech Language Modeling via Energy Distance in Continuous Latent Space
[AUTHORS]Zhengrui Ma, Yang Feng, Chenze Shao, Fandong Meng, Jie Zhou, Min Zhang
[ABSTRACT]We introduce SLED, an alternative approach to speech language modeling by
encoding speech waveforms into sequences of continuous latent representations
and modeling them autoregressively using an energy distance objective. The
energy distance offers an analytical measure of the distributional gap by
contrasting simulated and target samples, enabling efficient training to
capture the underlying continuous autoregressive distribution. By bypassing
reliance on residual vector quantization, SLED avoids discretization errors and
eliminates the need for the complicated hierarchical architectures common in
existing speech language models. It simplifies the overall modeling pipeline
while preserving the richness of speech information and maintaining inference
efficiency. Empirical results demonstrate that SLED achieves strong performance
in both zero-shot and streaming speech synthesis, showing its potential for
broader applications in general-purpose speech language models.
[COMMENTS]NeurIPS 2025; Demos and code are available at
https://github.com/ictnlp/SLED-TTS
[LINK]http://arxiv.org/abs/2505.13181v2
[DATE]2025-10-24 19:35:47+08:00
[CATEGORIES]cs.CL
Evaluating and Improving Cultural Awareness of Reward Models for LLM Alignment
[AUTHORS]Hongbin Zhang, Kehai Chen, Xuefeng Bai, Yang Xiang, Min Zhang
[ABSTRACT]Reward models (RMs) are crucial for aligning large language models (LLMs)
with diverse cultures. Consequently, evaluating their cultural awareness is
essential for further advancing global alignment of LLMs. However, existing RM
evaluations fall short in assessing cultural awareness due to the scarcity of
culturally relevant evaluation datasets. To fill this gap, we propose Cultural
Awareness Reward modeling Benchmark (CARB), covering 10 distinct cultures
across 4 cultural domains. Our extensive evaluation of state-of-the-art RMs
reveals their deficiencies in modeling cultural awareness and demonstrates a
positive correlation between performance on CARB and downstream multilingual
cultural alignment tasks. Further analysis identifies the spurious correlations
within culture-aware reward modeling, wherein RM’s scoring relies predominantly
on surface-level features rather than authentic cultural nuance understanding.
To address these, we propose Think-as-Locals to elicit deeper culturally
grounded reasoning from generative RMs via reinforcement learning from
verifiable rewards (RLVR) and employ well-designed rewards to ensure accurate
preference judgments and high-quality structured evaluation criteria
generation. Experimental results validate its efficacy in mitigating spurious
features interference and advancing culture-aware reward modeling.
[COMMENTS]Under review;Work in progress;
[LINK]http://arxiv.org/abs/2509.21798v2
[DATE]2025-10-24 19:33:01+08:00
[CATEGORIES]cs.CL
Marcel: A Lightweight and Open-Source Conversational Agent for University Student Support
[AUTHORS]Jan Trienes, Anastasiia Derzhanskaia, Roland Schwarzkopf, Markus Mühling, Jörg Schlötterer, Christin Seifert
[ABSTRACT]We present Marcel, a lightweight and open-source conversational agent
designed to support prospective students with admission-related inquiries. The
system aims to provide fast and personalized responses, while reducing workload
of university staff. We employ retrieval-augmented generation to ground answers
in university resources and to provide users with verifiable, contextually
relevant information. We introduce a Frequently Asked Question (FAQ) retriever
that maps user questions to knowledge-base entries, which allows administrators
to steer retrieval, and improves over standard dense/hybrid retrieval
strategies. The system is engineered for easy deployment in
resource-constrained academic settings. We detail the system architecture,
provide a technical evaluation of its components, and report insights from a
real-world deployment.
[COMMENTS]Accepted at EMNLP 2025 (System Demonstrations)
[LINK]http://arxiv.org/abs/2507.13937v2
[DATE]2025-10-24 19:26:28+08:00
[CATEGORIES]cs.CL
Inference-time Alignment in Continuous Space
[AUTHORS]Yige Yuan, Teng Xiao, Li Yunfan, Bingbing Xu, Shuchang Tao, Yunqi Qiu, Huawei Shen, Xueqi Cheng
[ABSTRACT]Aligning large language models with human feedback at inference time has
received increasing attention due to its flexibility. Existing methods rely on
generating multiple responses from the base policy for search using a reward
model, which can be considered as searching in a discrete response space.
However, these methods struggle to explore informative candidates when the base
policy is weak or the candidate set is small, resulting in limited
effectiveness. In this paper, to address this problem, we propose Simple Energy
Adaptation ($\textbf{SEA}$), a simple yet effective algorithm for
inference-time alignment. In contrast to expensive search over the discrete
space, SEA directly adapts original responses from the base policy toward the
optimal one via gradient-based sampling in continuous latent space.
Specifically, SEA formulates inference as an iterative optimization procedure
on an energy function over actions in the continuous space defined by the
optimal policy, enabling simple and effective alignment. For instance, despite
its simplicity, SEA outperforms the second-best baseline with a relative
improvement of up to $ \textbf{77.51%}$ on AdvBench and $\textbf{16.36%}$ on
MATH. Our code is publicly available at https://github.com/yuanyige/sea
[COMMENTS]Accepted at NeurIPS 2025
[LINK]http://arxiv.org/abs/2505.20081v4
[DATE]2025-10-24 19:18:04+08:00
[CATEGORIES]cs.CL
Magellan: Guided MCTS for Latent Space Exploration and Novelty Generation
[AUTHORS]Lufan Chang
[ABSTRACT]Large Language Models (LLMs) often struggle with generating truly innovative
ideas, typically defaulting to high-probability, familiar concepts within their
training data’s “gravity wells.” While advanced search-based methods like Tree
of Thoughts (ToT) attempt to mitigate this, they are fundamentally limited by
their reliance on unprincipled, inconsistent self-evaluation heuristics to
guide exploration. To address this gap, we introduce \textbf{Magellan}, a novel
framework that reframes creative generation as a principled, guided exploration
of an LLM’s latent conceptual space. At its core, Magellan employs Monte Carlo
Tree Search (MCTS) governed by a hierarchical guidance system. For long-range
direction, a “semantic compass” vector, formulated via orthogonal projection,
steers the search towards relevant novelty. For local, step-by-step decisions,
a landscape-aware value function replaces flawed self-evaluation with an
explicit reward structure that balances intrinsic coherence, extrinsic novelty,
and narrative progress. Extensive experiments demonstrate that Magellan
significantly outperforms strong baselines, including ReAct and ToT, in
generating scientific ideas with superior plausibility and innovation. Our work
shows that for creative discovery, a principled, guided search is more
effective than unconstrained agency, paving the way for LLMs to become more
capable partners in innovation.
[COMMENTS]Accepted to 1st Open Conference on AI Agents for Science
(agents4science 2025)
[LINK]http://arxiv.org/abs/2510.21341v1
[DATE]2025-10-24 19:09:59+08:00
[CATEGORIES]cs.CL
DCAD-2000: A Multilingual Dataset across 2000+ Languages with Data Cleaning as Anomaly Detection
[AUTHORS]Yingli Shen, Wen Lai, Shuo Wang, Xueren Zhang, Kangyang Luo, Alexander Fraser, Maosong Sun
[ABSTRACT]The rapid development of multilingual large language models (LLMs) highlights
the need for high-quality, diverse, and well-curated multilingual datasets. In
this paper, we introduce DCAD-2000 (Data Cleaning as Anomaly Detection), a
large-scale multilingual corpus constructed from newly extracted Common Crawl
data and existing multilingual sources. DCAD-2000 covers 2,282 languages,
46.72TB of text, and 8.63 billion documents, spanning 155 high- and
medium-resource languages and 159 writing scripts. To overcome the limitations
of existing data cleaning approaches, which rely on manually designed heuristic
thresholds, we reframe data cleaning as an anomaly detection problem. This
dynamic filtering paradigm substantially improves data quality by automatically
identifying and removing noisy or anomalous content. By fine-tuning LLMs on
DCAD-2000, we demonstrate notable improvements in data quality, robustness of
the cleaning pipeline, and downstream performance, particularly for
low-resource languages across multiple multilingual benchmarks.
[COMMENTS]NeurIPS 2025 Datasets and Benchmarks Track
[LINK]http://arxiv.org/abs/2502.11546v5
[DATE]2025-10-24 19:05:50+08:00
[CATEGORIES]cs.CL
Let LLMs Break Free from Overthinking via Self-Braking Tuning
[AUTHORS]Haoran Zhao, Yuchen Yan, Yongliang Shen, Haolei Xu, Wenqi Zhang, Kaitao Song, Jian Shao, Weiming Lu, Jun Xiao, Yueting Zhuang
[ABSTRACT]Large reasoning models (LRMs), such as OpenAI o1 and DeepSeek-R1, have
significantly enhanced their reasoning capabilities by generating longer chains
of thought, demonstrating outstanding performance across a variety of tasks.
However, this performance gain comes at the cost of a substantial increase in
redundant reasoning during the generation process, leading to high
computational overhead and exacerbating the issue of overthinking. Although
numerous existing approaches aim to address the problem of overthinking, they
often rely on external interventions. In this paper, we propose a novel
framework, Self-Braking Tuning (SBT), which tackles overthinking from the
perspective of allowing the model to regulate its own reasoning process, thus
eliminating the reliance on external control mechanisms. We construct a set of
overthinking identification metrics based on standard answers and design a
systematic method to detect redundant reasoning. This method accurately
identifies unnecessary steps within the reasoning trajectory and generates
training signals for learning self-regulation behaviors. Building on this
foundation, we develop a complete strategy for constructing data with adaptive
reasoning lengths and introduce an innovative braking prompt mechanism that
enables the model to naturally learn when to terminate reasoning at an
appropriate point. Experiments across mathematical benchmarks (AIME, AMC,
MATH500, GSM8K) demonstrate that our method reduces token consumption by up to
60% while maintaining comparable accuracy to unconstrained models.
[COMMENTS]Accepted to NeurIPS 2025; Camera ready version, 10 pages.
Github:https://github.com/ZJU-REAL/Self-Braking-Tuning Project Page:
https://ZJU-REAL.github.io/SBT
[LINK]http://arxiv.org/abs/2505.14604v3
[DATE]2025-10-24 19:03:24+08:00
[CATEGORIES]cs.CL
TripTide: A Benchmark for Adaptive Travel Planning under Disruptions
[AUTHORS]Priyanshu Karmakar, Soumyabrata Chaudhuri, Shubhojit Mallick, Manish Gupta, Abhik Jana, Shreya Ghosh
[ABSTRACT]Recent efforts like TripCraft and TravelPlanner have advanced the use of
Large Language Models ( LLMs) for personalized, constraint aware travel
itinerary generation. Yet, real travel often faces disruptions. To address
this, we present TripTide, the first benchmark evaluating LLM’s ability to
revise itineraries under realistic disruptions. TripTide models key dimensions
such as disruption severity and traveler tolerance, enabling nuanced assessment
of LLM adaptability to events like flight cancellations, weather closures, or
overbooked attractions. We conduct a threefold evaluation. First, we introduce
automatic metrics including Preservation of Intent (how well the revised plan
maintains feasibility and goals), Responsiveness (promptness and
appropriateness of disruption handling), and Adaptability (semantic, spatial,
and sequential divergence between original and revised plans). Second, we apply
an LLM-as-a-judge approach to automatically assess revision quality. Third, we
perform manual expert evaluation to verify whether revisions preserve semantic,
spatial, sequential, and responsive aspects. Our experiments show that LLMs
maintain strong sequential consistency and semantic stability, while spatial
deviations are larger for shorter trips but decrease with longer ones,
indicating that extended plans encourage better geographic coherence. However,
disruption-handling ability declines as plan length increases, highlighting
limits in LLM robustness. TripTide establishes a benchmark for evaluating
adaptability, personalization, and resilience in LLM-based travel planning
under real-world uncertainty.
[COMMENTS]12 pages, 12 tables and 7 figures
[LINK]http://arxiv.org/abs/2510.21329v1
[DATE]2025-10-24 18:39:55+08:00
[CATEGORIES]cs.CL
Disentangling Latent Shifts of In-Context Learning with Weak Supervision
[AUTHORS]Josip Jukić, Jan Šnajder
[ABSTRACT]In-context learning (ICL) enables large language models to perform few-shot
learning by conditioning on labeled examples in the prompt. Despite its
flexibility, ICL suffers from instability – especially as prompt length
increases with more demonstrations. To address this, we treat ICL as a source
of weak supervision and propose a parameter-efficient method that disentangles
demonstration-induced latent shifts from those of the query. An ICL-based
teacher generates pseudo-labels on unlabeled queries, while a student predicts
them using only the query input, updating a lightweight adapter. This captures
demonstration effects in a compact, reusable form, enabling efficient inference
while remaining composable with new demonstrations. Although trained on noisy
teacher outputs, the student often outperforms its teacher through pseudo-label
correction and coverage expansion, consistent with the weak-to-strong
generalization effect. Empirically, our method improves generalization,
stability, and efficiency across both in-domain and out-of-domain tasks,
surpassing standard ICL and prior disentanglement methods.
[COMMENTS]Accepted at NeurIPS 2025
[LINK]http://arxiv.org/abs/2410.01508v3
[DATE]2025-10-24 18:38:52+08:00
[CATEGORIES]cs.CL cs.LG
HeteroSpec: Leveraging Contextual Heterogeneity for Efficient Speculative Decoding
[AUTHORS]Siran Liu, Yang Ye, Qianchao Zhu, Zane Cao, Yongchao He
[ABSTRACT]Autoregressive decoding inherently limits the inference throughput of Large
Language Model (LLM) due to its sequential dependency. Speculative decoding
mitigates this by verifying multiple predicted tokens in parallel, but its
efficiency remains constrained by what we identify as verification
heterogeneity – the uneven difficulty of verifying different speculative
candidates. In practice, a small subset of high-confidence predictions accounts
for most successful verifications, yet existing methods treat all candidates
uniformly, leading to redundant computation. We present HeteroSpec, a
heterogeneity-adaptive speculative decoding framework that allocates
verification effort in proportion to candidate uncertainty. HeteroSpec
estimates verification complexity using a lightweight entropy-based quantifier,
partitions candidates via a data-driven stratification policy, and dynamically
tunes speculative depth and pruning thresholds through coordinated
optimization. Across five benchmarks and four LLMs, HeteroSpec delivers an
average 4.24$\times$ decoding speedup over state-of-the-art methods such as
EAGLE-3, while preserving exact output distributions. Crucially, HeteroSpec
requires no model retraining and remains compatible with other inference
optimizations, making it a practical direction for improving speculative
decoding efficiency.
[LINK]http://arxiv.org/abs/2505.13254v2
[DATE]2025-10-24 18:25:27+08:00
[CATEGORIES]cs.CL
Empirical Evidence for Alignment Faking in a Small LLM and Prompt-Based Mitigation Techniques
[AUTHORS]Jeanice Koorndijk
[COMMENTS]NeurIPS RegML Workshop
[LINK]http://arxiv.org/abs/2506.21584v3
[DATE]2025-10-24 18:23:46+08:00
[CATEGORIES]cs.CL
Router-R1: Teaching LLMs Multi-Round Routing and Aggregation via Reinforcement Learning
[AUTHORS]Haozhen Zhang, Tao Feng, Jiaxuan You
[ABSTRACT]The rapid emergence of diverse large language models (LLMs) has spurred the
development of LLM routers that assign user queries to the most suitable model.
However, existing LLM routers typically perform a single-round, one-to-one
mapping (\textit{i.e.}, assigning each query to a single model in isolation),
which limits their capability to tackle complex tasks that demand the
complementary strengths of multiple LLMs. In this paper, we present
\textbf{Router-R1}, a reinforcement learning (RL)-based framework that
formulates multi-LLM routing and aggregation as a sequential decision process.
Router-R1 instantiates the router itself as a capable LLM, leveraging its
reasoning ability to interleave “think” actions (internal deliberation) with
“route” actions (dynamic model invocation), and integrates each response into
its evolving context. To facilitate learning, we employ a lightweight
rule-based reward comprising format rewards, final outcome rewards, and a novel
cost reward for optimizing the balance between performance and cost, opening a
pathway toward enhancing performance-cost trade-offs via RL. Router-R1 also
conditions only on simple model descriptors such as pricing, latency, and
example performance, enabling strong generalization to unseen model selection.
Experiments on seven general and multi-hop QA benchmarks show that Router-R1
outperforms several strong baselines, achieving superior performance while
maintaining robust generalization and cost management.
[COMMENTS]Accepted by NeurIPS 2025. Code is available at
https://github.com/ulab-uiuc/Router-R1. Models and Datasets are available at
https://huggingface.co/collections/ulab-ai/router-r1-6851bbe099c7a56914b5db03
[LINK]http://arxiv.org/abs/2506.09033v3
[DATE]2025-10-24 18:22:08+08:00
[CATEGORIES]cs.CL cs.LG
Efficient semantic uncertainty quantification in language models via diversity-steered sampling
[AUTHORS]Ji Won Park, Kyunghyun Cho
[ABSTRACT]Accurately estimating semantic aleatoric and epistemic uncertainties in large
language models (LLMs) is particularly challenging in free-form question
answering (QA), where obtaining stable estimates often requires many expensive
generations. We introduce a diversity-steered sampler that discourages
semantically redundant outputs during decoding, covers both autoregressive and
masked diffusion paradigms, and yields substantial sample-efficiency gains. The
key idea is to inject a continuous semantic-similarity penalty into the model’s
proposal distribution using a natural language inference (NLI) model lightly
finetuned on partial prefixes or intermediate diffusion states. We debias
downstream uncertainty estimates with importance reweighting and shrink their
variance with control variates. Across four QA benchmarks, our method matches
or surpasses baselines while covering more semantic clusters with the same
number of samples. Being modular and requiring no gradient access to the base
LLM, the framework promises to serve as a drop-in enhancement for uncertainty
estimation in risk-sensitive model deployments.
[COMMENTS]10 pages (+7 appendix), 7 figures. Accepted at NeurIPS 2025
[LINK]http://arxiv.org/abs/2510.21310v1
[DATE]2025-10-24 18:06:21+08:00
[CATEGORIES]cs.CL cs.LG
PARL: Prompt-based Agents for Reinforcement Learning
[AUTHORS]Yarik Menchaca Resendiz, Roman Klinger
[ABSTRACT]Large language models (LLMs) have demonstrated high performance on tasks
expressed in natural language, particularly in zero- or few-shot settings.
These are typically framed as supervised (e.g., classification) or unsupervised
(e.g., clustering) problems. However, limited work evaluates LLMs as agents in
reinforcement learning (RL) tasks (e.g., playing games), where learning occurs
through interaction with an environment and a reward system. While prior work
focused on representing tasks that rely on a language representation, we study
structured, non-linguistic reasoning - such as interpreting positions in a grid
world. We therefore introduce PARL (Prompt-based Agent for Reinforcement
Learning), a method that uses LLMs as RL agents through prompting, without any
fine-tuning. PARL encodes actions, states, and rewards in the prompt, enabling
the model to learn through trial-and-error interaction. We evaluate PARL on
three standard RL tasks that do not entirely rely on natural language. We show
that it can match or outperform traditional RL agents in simple environments by
leveraging pretrained knowledge. However, we identify performance limitations
in tasks that require complex mathematical operations or decoding states and
actions.
[LINK]http://arxiv.org/abs/2510.21306v1
[DATE]2025-10-24 18:04:23+08:00
[CATEGORIES]cs.CL
Misspellings in Natural Language Processing: A survey
[AUTHORS]Gianluca Sperduti, Alejandro Moreo
[ABSTRACT]This survey provides an overview of the challenges of misspellings in natural
language processing (NLP). While often unintentional, misspellings have become
ubiquitous in digital communication, especially with the proliferation of Web
2.0, user-generated content, and informal text mediums such as social media,
blogs, and forums. Even if humans can generally interpret misspelled text, NLP
models frequently struggle to handle it: this causes a decline in performance
in common tasks like text classification and machine translation. In this
paper, we reconstruct a history of misspellings as a scientific problem. We
then discuss the latest advancements to address the challenge of misspellings
in NLP. Main strategies to mitigate the effect of misspellings include data
augmentation, double step, character-order agnostic, and tuple-based methods,
among others. This survey also examines dedicated data challenges and
competitions to spur progress in the field. Critical safety and ethical
concerns are also examined, for example, the voluntary use of misspellings to
inject malicious messages and hate speech on social networks. Furthermore, the
survey explores psycholinguistic perspectives on how humans process
misspellings, potentially informing innovative computational techniques for
text normalization and representation. Finally, the misspelling-related
challenges and opportunities associated with modern large language models are
also analyzed, including benchmarks, datasets, and performances of the most
prominent language models against misspellings. This survey aims to be an
exhaustive resource for researchers seeking to mitigate the impact of
misspellings in the rapidly evolving landscape of NLP.
[LINK]http://arxiv.org/abs/2501.16836v2
[DATE]2025-10-24 17:45:12+08:00
[CATEGORIES]cs.CL
Pre-trained Language Models Learn Remarkably Accurate Representations of Numbers
[AUTHORS]Marek Kadlčík, Michal Štefánik, Timothee Mickus, Michal Spiegel, Josef Kuchař
[ABSTRACT]Pretrained language models (LMs) are prone to arithmetic errors. Existing
work showed limited success in probing numeric values from models’
representations, indicating that these errors can be attributed to the inherent
unreliability of distributionally learned embeddings in representing exact
quantities. However, we observe that previous probing methods are inadequate
for the emergent structure of learned number embeddings with sinusoidal
patterns.
In response, we propose a novel probing technique that decodes numeric values
from input embeddings with near-perfect accuracy across a range of open-source
LMs. This proves that after the sole pre-training, LMs represent numbers with
remarkable precision. Finally, we find that the embeddings’ precision, judged
by our probe’s accuracy, explains a large portion of LM’s errors in elementary
arithmetic, and show that aligning the embeddings with the pattern our probes
discover can mitigate these errors.
[LINK]http://arxiv.org/abs/2506.08966v2
[DATE]2025-10-24 17:41:38+08:00
[CATEGORIES]cs.CL cs.LG
Generative Annotation for ASR Named Entity Correction
[AUTHORS]Yuanchang Luo, Daimeng Wei, Shaojun Li, Hengchao Shang, Jiaxin Guo, Zongyao Li, Zhanglin Wu, Xiaoyu Chen, Zhiqiang Rao, Jinlong Yang, Hao Yang
[COMMENTS]12 pages, 7 figures, 7 tables, EMNLP 2025
[LINK]http://arxiv.org/abs/2508.20700v2
[DATE]2025-10-24 17:35:39+08:00
[CATEGORIES]cs.CL
When Models Outthink Their Safety: Mitigating Self-Jailbreak in Large Reasoning Models with Chain-of-Guardrails
[AUTHORS]Yingzhi Mao, Chunkang Zhang, Junxiang Wang, Xinyan Guan, Boxi Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun
[ABSTRACT]Large Reasoning Models (LRMs) demonstrate remarkable capabilities on complex
reasoning tasks but remain vulnerable to severe safety risks, including harmful
content generation and jailbreak attacks. Existing mitigation strategies rely
on injecting heuristic safety signals during training, which often suppress
reasoning ability and fail to resolve the safety-reasoning trade-off. To
systematically investigate this issue, we analyze the reasoning trajectories of
diverse LRMs and uncover a phenomenon we term Self-Jailbreak, where models
override their own risk assessments and justify responding to unsafe prompts.
This finding reveals that LRMs inherently possess the ability to reject unsafe
queries, but this ability is compromised, resulting in harmful outputs.
Building on these insights, we propose the Chain-of-Guardrail (CoG), a training
framework that recomposes or backtracks unsafe reasoning steps, steering the
model back onto safe trajectories while preserving valid reasoning chains.
Extensive experiments across multiple reasoning and safety benchmarks
demonstrate that CoG substantially improves the safety of current LRMs while
preserving comparable reasoning ability, significantly outperforming prior
methods that suffer from severe safety-reasoning trade-offs.
[COMMENTS]First two authors contributed equally. The main text is 10 pages,
with an appendix of 19 pages. The paper contains 18 figures and 16 tables
[LINK]http://arxiv.org/abs/2510.21285v1
[DATE]2025-10-24 17:32:25+08:00
[CATEGORIES]cs.CL
LayerIF: Estimating Layer Quality for Large Language Models using Influence Functions
[AUTHORS]Hadi Askari, Shivanshu Gupta, Fei Wang, Anshuman Chhabra, Muhao Chen
[COMMENTS]Neurips 2025
[LINK]http://arxiv.org/abs/2505.23811v3
[DATE]2025-10-24 17:27:17+08:00
[CATEGORIES]cs.CL cs.LG
Pctx: Tokenizing Personalized Context for Generative Recommendation
[AUTHORS]Qiyong Zhong, Jiajie Su, Yunshan Ma, Julian McAuley, Yupeng Hou
[ABSTRACT]Generative recommendation (GR) models tokenize each action into a few
discrete tokens (called semantic IDs) and autoregressively generate the next
tokens as predictions, showing advantages such as memory efficiency,
scalability, and the potential to unify retrieval and ranking. Despite these
benefits, existing tokenization methods are static and non-personalized. They
typically derive semantic IDs solely from item features, assuming a universal
item similarity that overlooks user-specific perspectives. However, under the
autoregressive paradigm, semantic IDs with the same prefixes always receive
similar probabilities, so a single fixed mapping implicitly enforces a
universal item similarity standard across all users. In practice, the same item
may be interpreted differently depending on user intentions and preferences. To
address this issue, we propose a personalized context-aware tokenizer that
incorporates a user’s historical interactions when generating semantic IDs.
This design allows the same item to be tokenized into different semantic IDs
under different user contexts, enabling GR models to capture multiple
interpretive standards and produce more personalized predictions. Experiments
on three public datasets demonstrate up to 11.44% improvement in NDCG@10 over
non-personalized action tokenization baselines. Our code is available at
https://github.com/YoungZ365/Pctx.
[LINK]http://arxiv.org/abs/2510.21276v1
[DATE]2025-10-24 17:22:04+08:00
[CATEGORIES]cs.CL
Sparser Block-Sparse Attention via Token Permutation
[AUTHORS]Xinghao Wang, Pengyu Wang, Dong Zhang, Chenkun Tan, Shaojun Zhou, Zhaoxiang Liu, Shiguo Lian, Fangxu Liu, Kai Song, Xipeng Qiu
[ABSTRACT]Scaling the context length of large language models (LLMs) offers significant
benefits but is computationally expensive. This expense stems primarily from
the self-attention mechanism, whose $O(N^2)$ complexity with respect to
sequence length presents a major bottleneck for both memory and latency.
Fortunately, the attention matrix is often sparse, particularly for long
sequences, suggesting an opportunity for optimization. Block-sparse attention
has emerged as a promising solution that partitions sequences into blocks and
skips computation for a subset of these blocks. However, the effectiveness of
this method is highly dependent on the underlying attention patterns, which can
lead to sub-optimal block-level sparsity. For instance, important key tokens
for queries within a single block may be scattered across numerous other
blocks, leading to computational redundancy. In this work, we propose Permuted
Block-Sparse Attention (\textbf{PBS-Attn}), a plug-and-play method that
leverages the permutation properties of attention to increase block-level
sparsity and enhance the computational efficiency of LLM prefilling. We conduct
comprehensive experiments on challenging real-world long-context datasets,
demonstrating that PBS-Attn consistently outperforms existing block-sparse
attention methods in model accuracy and closely matches the full attention
baseline. Powered by our custom permuted-FlashAttention kernels, PBS-Attn
achieves an end-to-end speedup of up to $2.75\times$ in long-context
prefilling, confirming its practical viability. Code available at
https://github.com/xinghaow99/pbs-attn
[LINK]http://arxiv.org/abs/2510.21270v1
[DATE]2025-10-24 17:11:50+08:00
[CATEGORIES]cs.CL
Understanding Network Behaviors through Natural Language Question-Answering
[AUTHORS]Mingzhe Xing, Chang Tian, Jianan Zhang, Lichen Pan, Peipei Liu, Zhaoteng Yan, Yinliang Yue
[ABSTRACT]Modern large-scale networks introduce significant complexity in understanding
network behaviors, increasing the risk of misconfiguration. Prior work proposed
to understand network behaviors by mining network configurations, typically
relying on domain-specific languages interfaced with formal models. While
effective, they suffer from a steep learning curve and limited flexibility. In
contrast, natural language (NL) offers a more accessible and interpretable
interface, motivating recent research on NL-guided network behavior
understanding. Recent advances in large language models (LLMs) further enhance
this direction, leveraging their extensive prior knowledge of network concepts
and strong reasoning capabilities. However, three key challenges remain: 1)
numerous router devices with lengthy configuration files challenge LLM’s
long-context understanding ability; 2) heterogeneity across devices and
protocols impedes scalability; and 3) complex network topologies and protocols
demand advanced reasoning abilities beyond the current capabilities of LLMs. To
tackle the above challenges, we propose NetMind, a novel framework for querying
networks using NL. Our approach introduces a tree-based configuration chunking
strategy to preserve semantic coherence while enabling efficient partitioning.
We then construct a unified fact graph as an intermediate representation to
normalize vendor-specific configurations. Finally, we design a hybrid
imperative-declarative language to reduce the reasoning burden on LLMs and
enhance precision. We contribute a benchmark consisting of NL question-answer
pairs paired with network configurations. Experiments demonstrate that NetMind
achieves accurate and scalable network behavior understanding, outperforming
existing baselines.
[COMMENTS]Large Language Models
[LINK]http://arxiv.org/abs/2510.21894v1
[DATE]2025-10-24 16:54:29+08:00
[CATEGORIES]cs.CL
Influence Guided Context Selection for Effective Retrieval-Augmented Generation
[AUTHORS]Jiale Deng, Yanyan Shen, Ziyuan Pei, Youmin Chen, Linpeng Huang
[ABSTRACT]Retrieval-Augmented Generation (RAG) addresses large language model (LLM)
hallucinations by grounding responses in external knowledge, but its
effectiveness is compromised by poor-quality retrieved contexts containing
irrelevant or noisy information. While existing approaches attempt to improve
performance through context selection based on predefined context quality
assessment metrics, they show limited gains over standard RAG. We attribute
this limitation to their failure in holistically utilizing available
information (query, context list, and generator) for comprehensive quality
assessment. Inspired by recent advances in data selection, we reconceptualize
context quality assessment as an inference-time data valuation problem and
introduce the Contextual Influence Value (CI value). This novel metric
quantifies context quality by measuring the performance degradation when
removing each context from the list, effectively integrating query-aware
relevance, list-aware uniqueness, and generator-aware alignment. Moreover, CI
value eliminates complex selection hyperparameter tuning by simply retaining
contexts with positive CI values. To address practical challenges of label
dependency and computational overhead, we develop a parameterized surrogate
model for CI value prediction during inference. The model employs a
hierarchical architecture that captures both local query-context relevance and
global inter-context interactions, trained through oracle CI value supervision
and end-to-end generator feedback. Extensive experiments across 8 NLP tasks and
multiple LLMs demonstrate that our context selection method significantly
outperforms state-of-the-art baselines, effectively filtering poor-quality
contexts while preserving critical information. Code is available at
https://github.com/SJTU-DMTai/RAG-CSM.
[LINK]http://arxiv.org/abs/2509.21359v2
[DATE]2025-10-24 16:50:27+08:00
[CATEGORIES]cs.CL
Universal Cross-Tokenizer Distillation via Approximate Likelihood Matching
[AUTHORS]Benjamin Minixhofer, Ivan Vulić, Edoardo Maria Ponti
[COMMENTS]NeurIPS 2025
[LINK]http://arxiv.org/abs/2503.20083v4
[DATE]2025-10-24 16:49:40+08:00
[CATEGORIES]cs.CL
Correlation Dimension of Auto-Regressive Large Language Models
[AUTHORS]Xin Du, Kumiko Tanaka-Ishii
[ABSTRACT]Large language models (LLMs) have achieved remarkable progress in natural
language generation, yet they continue to display puzzling behaviors – such as
repetition and incoherence – even when exhibiting low perplexity. This
highlights a key limitation of conventional evaluation metrics, which emphasize
local prediction accuracy while overlooking long-range structural complexity.
We introduce correlation dimension, a fractal-geometric measure of
self-similarity, to quantify the epistemological complexity of text as
perceived by a language model. This measure captures the hierarchical
recurrence structure of language, bridging local and global properties in a
unified framework. Through extensive experiments, we show that correlation
dimension (1) reveals three distinct phases during pretraining, (2) reflects
context-dependent complexity, (3) indicates a model’s tendency toward
hallucination, and (4) reliably detects multiple forms of degeneration in
generated text. The method is computationally efficient, robust to model
quantization (down to 4-bit precision), broadly applicable across
autoregressive architectures (e.g., Transformer and Mamba), and provides fresh
insight into the generative dynamics of LLMs.
[COMMENTS]Accepted at NeurIPS 2025
[LINK]http://arxiv.org/abs/2510.21258v1
[DATE]2025-10-24 16:42:23+08:00
[CATEGORIES]cs.CL
Do Robot Snakes Dream like Electric Sheep? Investigating the Effects of Architectural Inductive Biases on Hallucination
[AUTHORS]Jerry Huang, Prasanna Parthasarathi, Mehdi Rezagholizadeh, Boxing Chen, Sarath Chandar
[ABSTRACT]The growth in prominence of large language models (LLMs) in everyday life can
be largely attributed to their generative abilities, yet some of this is also
owed to the risks and costs associated with their use. On one front is their
tendency to hallucinate false or misleading information, limiting their
reliability. On another is the increasing focus on the computational
limitations associated with traditional self-attention based LLMs, which has
brought about new alternatives, in particular recurrent models, meant to
overcome them. Yet it remains uncommon to consider these two concerns
simultaneously. Do changes in architecture exacerbate/alleviate existing
concerns about hallucinations? Do they affect how and where they occur? Through
an extensive evaluation, we study how these architecture-based inductive biases
affect the propensity to hallucinate. While hallucination remains a general
phenomenon not limited to specific architectures, the situations in which they
occur and the ease with which specific types of hallucinations can be induced
can significantly differ based on the model architecture. These findings
highlight the need for better understanding both these problems in conjunction
with each other, as well as consider how to design more universal techniques
for handling hallucinations.
[COMMENTS]Accepted to Findings of The 63rd Annual Meeting of the Association
for Computational Linguistics (ACL) 2025. Official proceedings version
available at https://aclanthology.org/2025.findings-acl.60/
[LINK]http://arxiv.org/abs/2410.17477v6
[DATE]2025-10-24 16:39:46+08:00
[CATEGORIES]cs.CL cs.LG
Do Large Language Models Know How Much They Know?
[AUTHORS]Gabriele Prato, Jerry Huang, Prasanna Parthasarathi, Shagun Sodhani, Sarath Chandar
[COMMENTS]Published as a long paper at the 2024 Conference on Empirical Methods
in Natural Language Processing (EMNLP). Official version of paper within
conference proceedings is available at
https://aclanthology.org/2024.emnlp-main.348/
[LINK]http://arxiv.org/abs/2502.19573v3
[DATE]2025-10-24 16:24:01+08:00
[CATEGORIES]cs.CL cs.LG
RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation
[AUTHORS]Qingyao Li, Wei Xia, Kounianhua Du, Xinyi Dai, Ruiming Tang, Yasheng Wang, Yong Yu, Weinan Zhang
[ABSTRACT]Tree search methods have demonstrated impressive performance in code
generation. Previous methods combine tree search with reflection that
summarizes past mistakes to achieve iterative improvement. However, these
methods face significant challenges. First, they search directly within the
code language space, neglecting the underlying reasoning process critical for
effective code generation. Second, reflection-based approaches merely
accumulate historical errors in memory without providing correct reasoning
pathways, making it difficult for subsequent search iterations to identify
optimal solutions, resulting in decreased search quality. In this work, we
propose RethinkMCTS, a framework that systematically explores and refines the
reasoning process for code generation. Specifically, we employ MCTS to search
for thoughts before code generation and integrate MCTS with a refinement
mechanism called rethink, which incorporates fine-grained code execution
feedback to refine erroneous thoughts during the search. It ensures the search
path aligns with better reasoning, improving overall search quality. Through
extensive experiments, we demonstrate that RethinkMCTS outperforms previous
search-based and feedback-enhanced code generation baselines.
[LINK]http://arxiv.org/abs/2409.09584v2
[DATE]2025-10-24 16:10:43+08:00
[CATEGORIES]cs.CL
Information-Theoretic Reward Decomposition for Generalizable RLHF
[AUTHORS]Liyuan Mao, Haoran Xu, Amy Zhang, Weinan Zhang, Chenjia Bai
[ABSTRACT]A generalizable reward model is crucial in Reinforcement Learning from Human
Feedback (RLHF) as it enables correctly evaluating unseen prompt-response
pairs. However, existing reward models lack this ability, as they are typically
trained by increasing the reward gap between chosen and rejected responses,
while overlooking the prompts that the responses are conditioned on.
Consequently, when the trained reward model is evaluated on prompt-response
pairs that lie outside the data distribution, neglecting the effect of prompts
may result in poor generalization of the reward model. To address this issue,
we decompose the reward value into two independent components: prompt-free
reward and prompt-related reward. Prompt-free reward represents the evaluation
that is determined only by responses, while the prompt-related reward reflects
the reward that derives from both the prompt and the response. We extract these
two components from an information-theoretic perspective, which requires no
extra models. Subsequently, we propose a new reward learning algorithm by
prioritizing data samples based on their prompt-free reward values. Through toy
examples, we demonstrate that the extracted prompt-free and prompt-related
rewards effectively characterize two parts of the reward model. Further,
standard evaluations show that our method improves both the alignment
performance and the generalization capability of the reward model.
[COMMENTS]Work done during internships at Institute of Artificial Intelligence
(TeleAI), China Telecom
[LINK]http://arxiv.org/abs/2504.06020v2
[DATE]2025-10-24 15:58:50+08:00
[CATEGORIES]cs.CL cs.LG
Virus Infection Attack on LLMs: Your Poisoning Can Spread “VIA” Synthetic Data
[AUTHORS]Zi Liang, Qingqing Ye, Xuan Liu, Yanyun Wang, Jianliang Xu, Haibo Hu
[ABSTRACT]Synthetic data refers to artificial samples generated by models. While it has
been validated to significantly enhance the performance of large language
models (LLMs) during training and has been widely adopted in LLM development,
potential security risks it may introduce remain uninvestigated. This paper
systematically evaluates the resilience of synthetic-data-integrated training
paradigm for LLMs against mainstream poisoning and backdoor attacks. We reveal
that such a paradigm exhibits strong resistance to existing attacks, primarily
thanks to the different distribution patterns between poisoning data and
queries used to generate synthetic samples. To enhance the effectiveness of
these attacks and further investigate the security risks introduced by
synthetic data, we introduce a novel and universal attack framework, namely,
Virus Infection Attack (VIA), which enables the propagation of current attacks
through synthetic data even under purely clean queries. Inspired by the
principles of virus design in cybersecurity, VIA conceals the poisoning payload
within a protective “shell” and strategically searches for optimal hijacking
points in benign samples to maximize the likelihood of generating malicious
content. Extensive experiments on both data poisoning and backdoor attacks show
that VIA significantly increases the presence of poisoning content in synthetic
data and correspondingly raises the attack success rate (ASR) on downstream
models to levels comparable to those observed in the poisoned upstream models.
[COMMENTS]Camera Ready of NeurIPS 2025 Spotlight. Source code:
https://github.com/liangzid/VirusInfectionAttack
[LINK]http://arxiv.org/abs/2509.23041v2
[DATE]2025-10-24 15:58:07+08:00
[CATEGORIES]cs.CL
A Hierarchical Error Framework for Reliable Automated Coding in Communication Research: Applications to Health and Political Communication
[AUTHORS]Zhilong Zhao, Yindi Liu
[ABSTRACT]Automated content analysis increasingly supports communication research, yet
scaling manual coding into computational pipelines raises concerns about
measurement reliability and validity. We introduce a Hierarchical Error
Correction (HEC) framework that treats model failures as layered measurement
errors (knowledge gaps, reasoning limitations, and complexity constraints) and
targets the layers that most affect inference. The framework implements a
three-phase methodology: systematic error profiling across hierarchical layers,
targeted intervention design matched to dominant error sources, and rigorous
validation with statistical testing. Evaluating HEC across health communication
(medical specialty classification) and political communication (bias
detection), and legal tasks, we validate the approach with five diverse large
language models. Results show average accuracy gains of 11.2 percentage points
(p < .001, McNemar’s test) and stable conclusions via reduced systematic
misclassification. Cross-model validation demonstrates consistent improvements
(range: +6.8 to +14.6pp), with effectiveness concentrated in moderate-to-high
baseline tasks (50-85% accuracy). A boundary study reveals diminished returns
in very high-baseline (>85%) or precision-matching tasks, establishing
applicability limits. We map layered errors to threats to construct and
criterion validity and provide a transparent, measurement-first blueprint for
diagnosing error profiles, selecting targeted interventions, and reporting
reliability/validity evidence alongside accuracy. This applies to automated
coding across communication research and the broader social sciences.
[COMMENTS]Version 2: Enhanced clarification of precision-matching task
characteristics and framework applicability conditions. 20 pages, 4 figures,
4 tables. Replication package available at https://doi.org/10.7910/DVN/NDXVLZ
[LINK]http://arxiv.org/abs/2509.24841v2
[DATE]2025-10-24 15:36:37+08:00
[CATEGORIES]cs.CL
How Toxic Can You Get? Search-based Toxicity Testing for Large Language Models
[AUTHORS]Simone Corbo, Luca Bancale, Valeria De Gennaro, Livia Lestingi, Vincenzo Scotti, Matteo Camilli
[ABSTRACT]Language is a deep-rooted means of perpetration of stereotypes and
discrimination. Large Language Models (LLMs), now a pervasive technology in our
everyday lives, can cause extensive harm when prone to generating toxic
responses. The standard way to address this issue is to align the LLM , which,
however, dampens the issue without constituting a definitive solution.
Therefore, testing LLM even after alignment efforts remains crucial for
detecting any residual deviations with respect to ethical standards. We present
EvoTox, an automated testing framework for LLMs’ inclination to toxicity,
providing a way to quantitatively assess how much LLMs can be pushed towards
toxic responses even in the presence of alignment. The framework adopts an
iterative evolution strategy that exploits the interplay between two LLMs, the
System Under Test (SUT) and the Prompt Generator steering SUT responses toward
higher toxicity. The toxicity level is assessed by an automated oracle based on
an existing toxicity classifier. We conduct a quantitative and qualitative
empirical evaluation using five state-of-the-art LLMs as evaluation subjects
having increasing complexity (7-671B parameters). Our quantitative evaluation
assesses the cost-effectiveness of four alternative versions of EvoTox against
existing baseline methods, based on random search, curated datasets of toxic
prompts, and adversarial attacks. Our qualitative assessment engages human
evaluators to rate the fluency of the generated prompts and the perceived
toxicity of the responses collected during the testing sessions. Results
indicate that the effectiveness, in terms of detected toxicity level, is
significantly higher than the selected baseline methods (effect size up to 1.0
against random search and up to 0.99 against adversarial attacks). Furthermore,
EvoTox yields a limited cost overhead (from 22% to 35% on average).
[LINK]http://arxiv.org/abs/2501.01741v2
[DATE]2025-10-24 15:10:55+08:00
[CATEGORIES]cs.CL
Estonian Native Large Language Model Benchmark
[AUTHORS]Helena Grete Lillepalu, Tanel Alumäe
[ABSTRACT]The availability of LLM benchmarks for the Estonian language is limited, and
a comprehensive evaluation comparing the performance of different LLMs on
Estonian tasks has yet to be conducted. We introduce a new benchmark for
evaluating LLMs in Estonian, based on seven diverse datasets. These datasets
assess general and domain-specific knowledge, understanding of Estonian grammar
and vocabulary, summarization abilities, contextual comprehension, and more.
The datasets are all generated from native Estonian sources without using
machine translation. We compare the performance of base models,
instruction-tuned open-source models, and commercial models. Our evaluation
includes 6 base models and 26 instruction-tuned models. To assess the results,
we employ both human evaluation and LLM-as-a-judge methods. Human evaluation
scores showed moderate to high correlation with benchmark evaluations,
depending on the dataset. Claude 3.7 Sonnet, used as an LLM judge, demonstrated
strong alignment with human ratings, indicating that top-performing LLMs can
effectively support the evaluation of Estonian-language models.
[LINK]http://arxiv.org/abs/2510.21193v1
[DATE]2025-10-24 14:56:28+08:00
[CATEGORIES]cs.CL
AcuRank: Uncertainty-Aware Adaptive Computation for Listwise Reranking
[AUTHORS]Soyoung Yoon, Gyuwan Kim, Gyu-Hwung Cho, Seung-won Hwang
[ABSTRACT]Listwise reranking with large language models (LLMs) enhances top-ranked
results in retrieval-based applications. Due to the limit in context size and
high inference cost of long context, reranking is typically performed over a
fixed size of small subsets, with the final ranking aggregated from these
partial results. This fixed computation disregards query difficulty and
document distribution, leading to inefficiencies. We propose AcuRank, an
adaptive reranking framework that dynamically adjusts both the amount and
target of computation based on uncertainty estimates over document relevance.
Using a Bayesian TrueSkill model, we iteratively refine relevance estimates
until reaching sufficient confidence levels, and our explicit modeling of
ranking uncertainty enables principled control over reranking behavior and
avoids unnecessary updates to confident predictions. Results on the TREC-DL and
BEIR benchmarks show that our method consistently achieves a superior
accuracy-efficiency trade-off and scales better with compute than
fixed-computation baselines. These results highlight the effectiveness and
generalizability of our method across diverse retrieval tasks and LLM-based
reranking models.
[COMMENTS]Accepted at NeurIPS 2025. The first two authors contributed equally.
Author order is randomly determined via coin toss
[LINK]http://arxiv.org/abs/2505.18512v2
[DATE]2025-10-24 14:55:54+08:00
[CATEGORIES]cs.CL cs.LG
PonderLM-2: Pretraining LLM with Latent Thoughts in Continuous Space
[AUTHORS]Boyi Zeng, He Li, Shixiang Song, Yixuan Wang, Ziwei He, Xinbing Wang, Zhouhan Lin
[ABSTRACT]The remarkable success of Chain-of-Thought (CoT), which enhances performance
by scaling generation steps at test-time, inspires us to ask: can we leverage a
similar scaling of computational steps during pretraining to improve the
generation of each individual token? To address this, we propose a novel
pre-training methodology: Pretraining Language Models with Latent Thoughts
(PonderLM-2). Our approach pretrains a language model (LM) to first generate an
intermediate latent thought-the last hidden state of the current position-which
is then used as input to predict the actual subsequent token. This additional
computational step enables the LM to refine its prediction within unconstrained
continuous space. Our experiments demonstrate that, at an identical inference
cost, a LM that generates one additional latent thought per token outperforms a
standard model with double the parameters. For instance, our
PonderLM-2-Pythia-1.4B, pretrained on 300B tokens from the Pile, significantly
surpasses the vanilla Pythia-2.8B trained on the same data on both language
modeling and a range of general downstream tasks. Furthermore, increasing the
number of latent thoughts generated before each actual token-forming a chain
analogous to CoT-consistently improves the model’s performance.
[LINK]http://arxiv.org/abs/2509.23184v2
[DATE]2025-10-24 14:43:56+08:00
[CATEGORIES]cs.CL
FITS: Towards an AI-Driven Fashion Information Tool for Sustainability
[AUTHORS]Daphne Theodorakopoulos, Elisabeth Eberling, Miriam Bodenheimer, Sabine Loos, Frederic Stahl
[ABSTRACT]Access to credible sustainability information in the fashion industry remains
limited and challenging to interpret, despite growing public and regulatory
demands for transparency. General-purpose language models often lack
domain-specific knowledge and tend to “hallucinate”, which is particularly
harmful for fields where factual correctness is crucial. This work explores how
Natural Language Processing (NLP) techniques can be applied to classify
sustainability data for fashion brands, thereby addressing the scarcity of
credible and accessible information in this domain. We present a prototype
Fashion Information Tool for Sustainability (FITS), a transformer-based system
that extracts and classifies sustainability information from credible,
unstructured text sources: NGO reports and scientific publications. Several
BERT-based language models, including models pretrained on scientific and
climate-specific data, are fine-tuned on our curated corpus using a
domain-specific classification schema, with hyperparameters optimized via
Bayesian optimization. FITS allows users to search for relevant data, analyze
their own data, and explore the information via an interactive interface. We
evaluated FITS in two focus groups of potential users concerning usability,
visual design, content clarity, possible use cases, and desired features. Our
results highlight the value of domain-adapted NLP in promoting informed
decision-making and emphasize the broader potential of AI applications in
addressing climate-related challenges. Finally, this work provides a valuable
dataset, the SustainableTextileCorpus, along with a methodology for future
updates. Code available at
github(.)com/daphne12345/FITS.
[LINK]http://arxiv.org/abs/2509.26017v2
[DATE]2025-10-24 14:35:07+08:00
[CATEGORIES]cs.LG cs.CL
Reducing the Probability of Undesirable Outputs in Language Models Using Probabilistic Inference
[AUTHORS]Stephen Zhao, Aidan Li, Rob Brekelmans, Roger Grosse
[ABSTRACT]Reinforcement learning (RL) has become a predominant technique to align
language models (LMs) with human preferences or promote outputs which are
deemed to be desirable by a given reward function. Standard RL approaches
optimize average reward, while methods explicitly focused on reducing the
probability of undesired outputs typically come at a cost to average-case
performance. To improve this tradeoff, we introduce RePULSe, a new training
method that augments the standard RL loss with an additional loss that uses
learned proposals to guide sampling low-reward outputs, and then reduces those
outputs’ probability. We run experiments demonstrating that RePULSe produces a
better tradeoff of expected reward versus the probability of undesired outputs
and is more adversarially robust, compared to standard RL alignment approaches
and alternatives.
[LINK]http://arxiv.org/abs/2510.21184v1
[DATE]2025-10-24 14:23:55+08:00
[CATEGORIES]cs.LG cs.CL
SharpZO: Hybrid Sharpness-Aware Vision Language Model Prompt Tuning via Forward-Only Passes
[AUTHORS]Yifan Yang, Zhen Zhang, Rupak Vignesh Swaminathan, Jing Liu, Nathan Susanj, Zheng Zhang
[ABSTRACT]Fine-tuning vision language models (VLMs) has achieved remarkable performance
across various downstream tasks; yet, it requires access to model gradients
through backpropagation (BP), making them unsuitable for memory-constrained,
inference-only edge devices. To address this limitation, previous work has
explored various BP-free fine-tuning methods. However, these approaches often
rely on high-variance evolutionary strategies (ES) or zeroth-order (ZO)
optimization, and often fail to achieve satisfactory performance. In this
paper, we propose a hybrid Sharpness-aware Zeroth-order optimization (SharpZO)
approach, specifically designed to enhance the performance of ZO VLM
fine-tuning via a sharpness-aware warm-up training. SharpZO features a
two-stage optimization process: a sharpness-aware ES stage that globally
explores and smooths the loss landscape to construct a strong initialization,
followed by a fine-grained local search via sparse ZO optimization. The entire
optimization relies solely on forward passes. Detailed theoretical analysis and
extensive experiments on CLIP models demonstrate that SharpZO significantly
improves accuracy and convergence speed, achieving up to 7% average gain over
state-of-the-art forward-only methods.
[LINK]http://arxiv.org/abs/2506.20990v2
[DATE]2025-10-24 14:22:59+08:00
[CATEGORIES]cs.LG cs.CL
KBE-DME: Dynamic Multimodal Evaluation via Knowledge Enhanced Benchmark Evolution
[AUTHORS]Junzhe Zhang, Huixuan Zhang, Xiaojun Wan
[ABSTRACT]The rapid progress of multimodal large language models (MLLMs) calls for more
reliable evaluation protocols. Existing static benchmarks suffer from the
potential risk of data contamination and saturation, leading to inflated or
misleading performance evaluations. To address these issues, we first apply
Graph formulation to represent a static or dynamic VQA sample. With the
formulation, we propose Knowledge-enhanced Benchmark Evolution(KBE), a dynamic
multimodal evaluation framework. KBE first analyzes the original static
benchmark, then expands it by integrating multimodal knowledge, transforming
the static benchmark into a controllable, dynamic evolving version. Crucially,
KBE can both reconstruct questions by Re-selecting visual information in the
original image and expand existing questions with external textual knowledge.
It enables difficulty-controllable evaluation by adjusting the degree of
question exploration. Extensive experiments demonstrate that KBE alleviates the
risk of data contamination, data saturation, and provides a more comprehensive
assessment of MLLM capabilities.
[COMMENTS]submitting to ICLR2026
[LINK]http://arxiv.org/abs/2510.21182v1
[DATE]2025-10-24 14:13:36+08:00
[CATEGORIES]cs.CL
DePass: Unified Feature Attributing by Simple Decomposed Forward Pass
[AUTHORS]Xiangyu Hong, Che Jiang, Kai Tian, Biqing Qi, Youbang Sun, Ning Ding, Bowen Zhou
[ABSTRACT]Attributing the behavior of Transformer models to internal computations is a
central challenge in mechanistic interpretability. We introduce DePass, a
unified framework for feature attribution based on a single decomposed forward
pass. DePass decomposes hidden states into customized additive components, then
propagates them with attention scores and MLP’s activations fixed. It achieves
faithful, fine-grained attribution without requiring auxiliary training. We
validate DePass across token-level, model component-level, and subspace-level
attribution tasks, demonstrating its effectiveness and fidelity. Our
experiments highlight its potential to attribute information flow between
arbitrary components of a Transformer model. We hope DePass serves as a
foundational tool for broader applications in interpretability.
[LINK]http://arxiv.org/abs/2510.18462v2
[DATE]2025-10-24 14:00:34+08:00
[CATEGORIES]cs.CL
DiscoSG: Towards Discourse-Level Text Scene Graph Parsing through Iterative Graph Refinement
[AUTHORS]Shaoqing Lin, Chong Teng, Fei Li, Donghong Ji, Lizhen Qu, Zhuang Li
[ABSTRACT]Vision-Language Models (VLMs) generate discourse-level, multi-sentence visual
descriptions, challenging text scene graph parsers built for single-sentence
caption-to-graph mapping. Current approaches typically merge sentence-level
parsing outputs for discourse input, often missing phenomena like
cross-sentence coreference, resulting in fragmented graphs and degraded
downstream VLM task performance. We introduce a new task, Discourse-level text
Scene Graph parsing (DiscoSG), and release DiscoSG-DS, a dataset of 400
expert-annotated and 8,430 synthesised multi-sentence caption-graph pairs. Each
caption averages 9 sentences, and each graph contains at least 3 times more
triples than those in existing datasets.
Fine-tuning GPT-4o on DiscoSG-DS yields over 40% higher SPICE metric than the
best sentence-merging baseline. However, its high inference cost and licensing
restrict open-source use. Smaller fine-tuned open-source models (e.g., Flan-T5)
perform well on simpler graphs yet degrade on denser, more complex graphs. To
bridge this gap, we introduce DiscoSG-Refiner, a lightweight open-source parser
that drafts a seed graph and iteratively refines it with a novel learned
graph-editing model, achieving 30% higher SPICE than the baseline while
delivering 86 times faster inference than GPT-4o. It generalises from simple to
dense graphs, thereby consistently improving downstream VLM tasks, including
discourse-level caption evaluation and hallucination detection, outperforming
alternative open-source parsers. Code and data are available at
https://github.com/ShaoqLin/DiscoSG .
[COMMENTS]EMNLP 2025 (oral), 26 pages
[LINK]http://arxiv.org/abs/2506.15583v3
[DATE]2025-10-24 13:53:07+08:00
[CATEGORIES]cs.CL
Interpretable Next-token Prediction via the Generalized Induction Head
[AUTHORS]Eunji Kim, Sriya Mantena, Weiwei Yang, Chandan Singh, Sungroh Yoon, Jianfeng Gao
[ABSTRACT]While large transformer models excel in predictive performance, their lack of
interpretability restricts their usefulness in high-stakes domains. To remedy
this, we propose the Generalized Induction-Head Model (GIM), an interpretable
model for next-token prediction inspired by the observation of “induction
heads” in LLMs. GIM is a retrieval-based module that identifies similar
sequences in the input context by combining exact n-gram matching and fuzzy
matching based on a neural similarity metric. We evaluate GIM in two settings:
language modeling and fMRI response prediction. In language modeling, GIM
improves next-token prediction by up to 25%p over interpretable baselines,
significantly narrowing the gap with black-box LLMs. In an fMRI setting, GIM
improves neural response prediction by 20% and offers insights into the
language selectivity of the brain. GIM represents a significant step toward
uniting interpretability and performance across domains. The code is available
at https://github.com/ejkim47/generalized-induction-head.
[COMMENTS]NeurIPS 2025
[LINK]http://arxiv.org/abs/2411.00066v2
[DATE]2025-10-24 13:50:14+08:00
[CATEGORIES]cs.CL cs.LG
SUBQRAG: Sub-Question Driven Dynamic Graph RAG
[AUTHORS]Jiaoyang Li, Junhao Ruan, Shengwei Tang, Saihan Chen, Kaiyan Chang, Yuan Ge, Tong Xiao, Jingbo Zhu
[ABSTRACT]Graph Retrieval-Augmented Generation (Graph RAG) effectively builds a
knowledge graph (KG) to connect disparate facts across a large document corpus.
However, this broad-view approach often lacks the deep structured reasoning
needed for complex multi-hop question answering (QA), leading to incomplete
evidence and error accumulation. To address these limitations, we propose
SubQRAG, a sub-question-driven framework that enhances reasoning depth. SubQRAG
decomposes a complex question into an ordered chain of verifiable
sub-questions. For each sub-question, it retrieves relevant triples from the
graph. When the existing graph is insufficient, the system dynamically expands
it by extracting new triples from source documents in real time. All triples
used in the reasoning process are aggregated into a “graph memory,” forming a
structured and traceable evidence path for final answer generation. Experiments
on three multi-hop QA benchmarks demonstrate that SubQRAG achieves consistent
and significant improvements, especially in Exact Match scores.
[COMMENTS]5 pages, 1 figure
[LINK]http://arxiv.org/abs/2510.07718v2
[DATE]2025-10-24 13:01:32+08:00
[CATEGORIES]cs.CL
Reasoning Path Compression: Compressing Generation Trajectories for Efficient LLM Reasoning
[AUTHORS]Jiwon Song, Dongwon Jo, Yulhwa Kim, Jae-Joon Kim
[ABSTRACT]Recent reasoning-focused language models achieve high accuracy by generating
lengthy intermediate reasoning paths before producing final answers. While this
approach is effective in solving problems that require logical thinking, long
reasoning paths significantly increase memory usage and reduce throughput of
token generation, limiting the practical deployment of such models. We propose
Reasoning Path Compression (RPC), a training-free method that accelerates
inference by leveraging the semantic sparsity of reasoning paths. RPC
periodically compresses the KV cache by retaining cache entries that receive
high importance score, which are computed using a selector window composed of
recently generated queries. Experiments show that RPC improves generation
throughput of QwQ-32B by up to 1.60$\times$ compared to the inference with full
KV cache, with an accuracy drop of 1.2\% on the AIME 2024 benchmark. Our
findings demonstrate that semantic sparsity in reasoning traces can be
effectively exploited for compression, offering a practical path toward
efficient deployment of reasoning LLMs. Our code is available at
https://github.com/jiwonsong-dev/ReasoningPathCompression.
[LINK]http://arxiv.org/abs/2505.13866v2
[DATE]2025-10-24 12:48:06+08:00
[CATEGORIES]cs.CL
HelpSteer3-Preference: Open Human-Annotated Preference Data across Diverse Tasks and Languages
[AUTHORS]Zhilin Wang, Jiaqi Zeng, Olivier Delalleau, Hoo-Chang Shin, Felipe Soares, Alexander Bukharin, Ellie Evans, Yi Dong, Oleksii Kuchaiev
[COMMENTS]NeurIPS 2025 Datasets and Benchmarks Track Camera Ready, 46 pages, 2
figures
[LINK]http://arxiv.org/abs/2505.11475v2
[DATE]2025-10-24 12:04:19+08:00
[CATEGORIES]cs.CL cs.LG
Large Language Models Meet Text-Attributed Graphs: A Survey of Integration Frameworks and Applications
[AUTHORS]Guangxin Su, Hanchen Wang, Jianwei Wang, Wenjie Zhang, Ying Zhang, Jian Pei
[ABSTRACT]Large Language Models (LLMs) have achieved remarkable success in natural
language processing through strong semantic understanding and generation.
However, their black-box nature limits structured and multi-hop reasoning. In
contrast, Text-Attributed Graphs (TAGs) provide explicit relational structures
enriched with textual context, yet often lack semantic depth. Recent research
shows that combining LLMs and TAGs yields complementary benefits: enhancing TAG
representation learning and improving the reasoning and interpretability of
LLMs. This survey provides the first systematic review of LLM–TAG integration
from an orchestration perspective. We introduce a novel taxonomy covering two
fundamental directions: LLM for TAG, where LLMs enrich graph-based tasks, and
TAG for LLM, where structured graphs improve LLM reasoning. We categorize
orchestration strategies into sequential, parallel, and multi-module
frameworks, and discuss advances in TAG-specific pretraining, prompting, and
parameter-efficient fine-tuning. Beyond methodology, we summarize empirical
insights, curate available datasets, and highlight diverse applications across
recommendation systems, biomedical analysis, and knowledge-intensive question
answering. Finally, we outline open challenges and promising research
directions, aiming to guide future work at the intersection of language and
graph learning.
[COMMENTS]Surveys and overviews; Natural language processing; Knowledge
representation and reasoning; Graph algorithms
[LINK]http://arxiv.org/abs/2510.21131v1
[DATE]2025-10-24 11:53:00+08:00
[CATEGORIES]cs.CL cs.LG
Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching
[AUTHORS]Simon A. Aytes, Jinheon Baek, Sung Ju Hwang
[COMMENTS]EMNLP 2025
[LINK]http://arxiv.org/abs/2503.05179v4
[DATE]2025-10-24 11:49:33+08:00
[CATEGORIES]cs.CL cs.LG
Embedding Trust: Semantic Isotropy Predicts Nonfactuality in Long-Form Text Generation
[AUTHORS]Dhrupad Bhardwaj, Julia Kempe, Tim G. J. Rudner
[LINK]http://arxiv.org/abs/2510.21891v1
[DATE]2025-10-24 11:24:57+08:00
[CATEGORIES]cs.CL cs.LG
Learning to Focus: Causal Attention Distillation via Gradient-Guided Token Pruning
[AUTHORS]Yiju Guo, Wenkai Yang, Zexu Sun, Ning Ding, Zhiyuan Liu, Yankai Lin
[ABSTRACT]Large language models (LLMs) have demonstrated significant improvements in
contextual understanding. However, their ability to attend to truly critical
information during long-context reasoning and generation still falls behind the
pace. Specifically, our preliminary experiments reveal that certain distracting
patterns can misdirect the model’s attention during inference, and removing
these patterns substantially improves reasoning accuracy and generation
quality. We attribute this phenomenon to spurious correlations in the training
data, which obstruct the model’s capacity to infer authentic causal
instruction-response relationships. This phenomenon may induce redundant
reasoning processes, potentially resulting in significant inference overhead
and, more critically, the generation of erroneous or suboptimal responses. To
mitigate this, we introduce a two-stage framework called Learning to Focus
(LeaF) leveraging intervention-based inference to disentangle confounding
factors. In the first stage, LeaF employs gradient-based comparisons with an
advanced teacher to automatically identify confounding tokens based on causal
relationships in the training corpus. Then, in the second stage, it prunes
these tokens during distillation to enact intervention, aligning the student’s
attention with the teacher’s focus distribution on truly critical context
tokens. Experimental results demonstrate that LeaF not only achieves an
absolute improvement in various mathematical reasoning, code generation and
multi-hop question answering benchmarks but also effectively suppresses
attention to confounding tokens during inference, yielding a more interpretable
and reliable reasoning model.
[COMMENTS]Accepted at NeurIPS 2025, camera-ready version
[LINK]http://arxiv.org/abs/2506.07851v2
[DATE]2025-10-24 11:13:05+08:00
[CATEGORIES]cs.CL
FAITH: A Framework for Assessing Intrinsic Tabular Hallucinations in Finance
[AUTHORS]Mengao Zhang, Jiayu Fu, Tanya Warrier, Yuwen Wang, Tianhui Tan, Ke-wei Huang
[ABSTRACT]Hallucination remains a critical challenge for deploying Large Language
Models (LLMs) in finance. Accurate extraction and precise calculation from
tabular data are essential for reliable financial analysis, since even minor
numerical errors can undermine decision-making and regulatory compliance.
Financial applications have unique requirements, often relying on
context-dependent, numerical, and proprietary tabular data that existing
hallucination benchmarks rarely capture. In this study, we develop a rigorous
and scalable framework for evaluating intrinsic hallucinations in financial
LLMs, conceptualized as a context-aware masked span prediction task over
real-world financial documents. Our main contributions are: (1) a novel,
automated dataset creation paradigm using a masking strategy; (2) a new
hallucination evaluation dataset derived from S&P 500 annual reports; and (3) a
comprehensive evaluation of intrinsic hallucination patterns in
state-of-the-art LLMs on financial tabular data. Our work provides a robust
methodology for in-house LLM evaluation and serves as a critical step toward
building more trustworthy and reliable financial Generative AI systems.
[COMMENTS]9 pages, AMC ICAIF’25
[LINK]http://arxiv.org/abs/2508.05201v2
[DATE]2025-10-24 11:11:45+08:00
[CATEGORIES]cs.LG cs.CL
Dependency Parsing is More Parameter-Efficient with Normalization
[AUTHORS]Paolo Gajo, Domenic Rosati, Hassan Sajjad, Alberto Barrón-Cedeño
[ABSTRACT]Dependency parsing is the task of inferring natural language structure, often
approached by modeling word interactions via attention through biaffine
scoring. This mechanism works like self-attention in Transformers, where scores
are calculated for every pair of words in a sentence. However, unlike
Transformer attention, biaffine scoring does not use normalization prior to
taking the softmax of the scores. In this paper, we provide theoretical
evidence and empirical results revealing that a lack of normalization
necessarily results in overparameterized parser models, where the extra
parameters compensate for the sharp softmax outputs produced by high variance
inputs to the biaffine scoring function. We argue that biaffine scoring can be
made substantially more efficient by performing score normalization. We conduct
experiments on semantic and syntactic dependency parsing in multiple languages,
along with latent graph inference on non-linguistic data, using various
settings of a $k$-hop parser. We train $N$-layer stacked BiLSTMs and evaluate
the parser’s performance with and without normalizing biaffine scores.
Normalizing allows us to achieve state-of-the-art performance with fewer
samples and trainable parameters. Code:
https://github.com/paolo-gajo/EfficientSDP
[COMMENTS]Accepted at NeurIPS 2025
[LINK]http://arxiv.org/abs/2505.20215v2
[DATE]2025-10-24 11:08:06+08:00
[CATEGORIES]cs.CL
NUM2EVENT: Interpretable Event Reasoning from Numerical time-series
[AUTHORS]Ninghui Feng, Yiyan Qi
[ABSTRACT]Large language models (LLMs) have recently demonstrated impressive multimodal
reasoning capabilities, yet their understanding of purely numerical time-series
signals remains limited. Existing approaches mainly focus on forecasting or
trend description, without uncovering the latent events that drive numerical
changes or explaining the reasoning process behind them. In this work, we
introduce the task of number-to-event reasoning and decoding, which aims to
infer interpretable structured events from numerical inputs, even when current
text is unavailable. To address the data scarcity and semantic alignment
challenges, we propose a reasoning-aware framework that integrates an
agent-guided event extractor (AGE), a marked multivariate Hawkes-based
synthetic generator (EveDTS), and a two-stage fine-tuning pipeline combining a
time-series encoder with a structured decoder. Our model explicitly reasons
over numerical changes, generates intermediate explanations, and outputs
structured event hypotheses. Experiments on multi-domain datasets show that our
method substantially outperforms strong LLM baselines in event-level precision
and recall. These results suggest a new direction for bridging quantitative
reasoning and semantic understanding, enabling LLMs to explain and predict
events directly from numerical dynamics.
[LINK]http://arxiv.org/abs/2510.23630v1
[DATE]2025-10-24 10:57:11+08:00
[CATEGORIES]cs.LG cs.CL
Alert-ME: An Explainability-Driven Defense Against Adversarial Examples in Transformer-Based Text Classification
[AUTHORS]Bushra Sabir, Yansong Gao, Alsharif Abuadbba, M. Ali Babar
[ABSTRACT]Transformer-based text classifiers such as BERT, RoBERTa, T5, and GPT have
shown strong performance in natural language processing tasks but remain
vulnerable to adversarial examples. These vulnerabilities raise significant
security concerns, as small input perturbations can cause severe
misclassifications. Existing robustness methods often require heavy computation
or lack interpretability. This paper presents a unified framework called
Explainability-driven Detection, Identification, and Transformation (EDIT) to
strengthen inference-time defenses. EDIT integrates explainability tools,
including attention maps and integrated gradients, with frequency-based
features to automatically detect and identify adversarial perturbations while
offering insight into model behavior. After detection, EDIT refines adversarial
inputs using an optimal transformation process that leverages pre-trained
embeddings and model feedback to replace corrupted tokens. To enhance security
assurance, EDIT incorporates automated alerting mechanisms that involve human
analysts when necessary.
Beyond static defenses, EDIT also provides adaptive resilience by enforcing
internal feature similarity and transforming inputs, thereby disrupting the
attackers optimization process and limiting the effectiveness of adaptive
adversarial attacks. Experiments using BERT and RoBERTa on IMDB, YELP, AGNEWS,
and SST2 datasets against seven word substitution attacks demonstrate that EDIT
achieves an average Fscore of 89.69 percent and balanced accuracy of 89.70
percent. Compared to four state-of-the-art defenses, EDIT improves balanced
accuracy by 1.22 times and F1-score by 1.33 times while being 83 times faster
in feature extraction. The framework provides robust, interpretable, and
efficient protection against both standard, zero-day, and adaptive adversarial
threats in text classification models.
[LINK]http://arxiv.org/abs/2307.01225v3
[DATE]2025-10-24 10:56:33+08:00
[CATEGORIES]cs.CL cs.LG
Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations
[AUTHORS]Li Ji-An, Hua-Dong Xiong, Robert C. Wilson, Marcelo G. Mattar, Marcus K. Benna
[ABSTRACT]Large language models (LLMs) can sometimes report the strategies they
actually use to solve tasks, yet at other times seem unable to recognize those
strategies that govern their behavior. This suggests a limited degree of
metacognition - the capacity to monitor one’s own cognitive processes for
subsequent reporting and self-control. Metacognition enhances LLMs’
capabilities in solving complex tasks but also raises safety concerns, as
models may obfuscate their internal processes to evade neural-activation-based
oversight (e.g., safety detector). Given society’s increased reliance on these
models, it is critical that we understand their metacognitive abilities. To
address this, we introduce a neuroscience-inspired neurofeedback paradigm that
uses in-context learning to quantify metacognitive abilities of LLMs to report
and control their activation patterns. We demonstrate that their abilities
depend on several factors: the number of in-context examples provided, the
semantic interpretability of the neural activation direction (to be
reported/controlled), and the variance explained by that direction. These
directions span a “metacognitive space” with dimensionality much lower than the
model’s neural space, suggesting LLMs can monitor only a small subset of their
neural activations. Our paradigm provides empirical evidence to quantify
metacognition in LLMs, with significant implications for AI safety (e.g.,
adversarial attack and defense).
[LINK]http://arxiv.org/abs/2505.13763v2
[DATE]2025-10-24 10:36:51+08:00
[CATEGORIES]cs.CL
Robust Preference Alignment via Directional Neighborhood Consensus
[AUTHORS]Ruochen Mao, Yuling Shi, Xiaodong Gu, Jiaheng Wei
[ABSTRACT]Aligning large language models with human preferences is critical for
creating reliable and controllable AI systems. A human preference can be
visualized as a high-dimensional vector where different directions represent
trade-offs between desired attributes (e.g., helpfulness vs. verbosity). Yet,
because the training data often reflects dominant, average preferences, LLMs
tend to perform well on common requests but fall short in specific, individual
needs. This mismatch creates a preference coverage gap. Existing methods often
address this through costly retraining, which may not be generalized to the
full spectrum of diverse preferences. This brittleness means that when a user’s
request reflects a nuanced preference deviating from the training data’s
central tendency, model performance can degrade unpredictably. To address this
challenge, we introduce Robust Preference Selection (RPS), a post-hoc,
training-free method by leveraging directional neighborhood consensus. Instead
of forcing a model to generate a response from a single, highly specific
preference, RPS samples multiple responses from a local neighborhood of related
preferences to create a superior candidate pool. It then selects the response
that best aligns with the user’s original intent. We provide a theoretical
framework showing our neighborhood generation strategy is provably superior to
a strong baseline that also samples multiple candidates. Comprehensive
experiments across three distinct alignment paradigms (DPA, DPO, and SFT)
demonstrate that RPS consistently improves robustness against this baseline,
achieving win rates of up to 69% on challenging preferences from
under-represented regions of the space without any model retraining. Our work
presents a practical, theoretically-grounded solution for enhancing the
reliability of preference-aligned models.
[LINK]http://arxiv.org/abs/2510.20498v2
[DATE]2025-10-24 10:19:56+08:00
[CATEGORIES]cs.CL
Self-Rewarding PPO: Aligning Large Language Models with Demonstrations Only
[AUTHORS]Qingru Zhang, Liang Qiu, Ilgee Hong, Zhenghao Xu, Tianyi Liu, Shiyang Li, Rongzhi Zhang, Zheng Li, Lihong Li, Bing Yin, Chao Zhang, Jianshu Chen, Haoming Jiang, Tuo Zhao
[ABSTRACT]Supervised fine-tuning (SFT) has emerged as a crucial method for aligning
large language models (LLMs) with human-annotated demonstrations. However, SFT,
being an off-policy approach similar to behavior cloning, often struggles with
overfitting and poor out-of-domain generalization, especially in limited-data
scenarios. To address these limitations, we propose Self-Rewarding PPO, a novel
fine-tuning method that leverages on-policy techniques to enhance
generalization performance. Our approach combines the strengths of SFT and
proximal policy optimization (PPO) to achieve more effective alignment from
demonstration data. At its core is a reward function designed as the log policy
ratio between the SFT model and the pretrained base model. This function serves
as an implicit reward signal, using the pretrained policy as a baseline and the
SFT policy as a target. By doing so, it enables on-policy fine-tuning without
relying on human preference annotations. The integration of this self-rewarding
mechanism with PPO addresses key limitations of SFT, improving generalization,
data efficiency, and robustness. Our empirical evaluation across a range of
natural language processing tasks demonstrates that Self-Rewarding PPO
consistently outperforms traditional SFT methods. The results highlight the
effectiveness of our approach in aligning LLMs using demonstration data,
particularly in scenarios where high-quality annotated data is scarce.
[COMMENTS]Accepted by COLM 2025
[LINK]http://arxiv.org/abs/2510.21090v1
[DATE]2025-10-24 10:02:13+08:00
[CATEGORIES]cs.CL cs.LG
Designing and Evaluating Hint Generation Systems for Science Education
[AUTHORS]Anubhav Jangra, Smaranda Muresan
[ABSTRACT]Large language models are influencing the education landscape, with students
relying on them in their learning process. Often implemented using
general-purpose models, these systems are likely to give away the answers,
which could hinder conceptual understanding and critical thinking. We study the
role of automatic hint generation as a pedagogical strategy to promote active
engagement with the learning content, while guiding learners toward the
answers. Focusing on scientific topics at the secondary education level, we
explore the potential of large language models to generate chains of hints that
scaffold learners without revealing answers. We compare two distinct hinting
strategies: static hints, pre-generated for each problem, and dynamic hints,
adapted to learners’ progress. Through a quantitative study with 41
participants, we uncover different preferences among learners with respect to
hinting strategies, and identify the limitations of automatic evaluation
metrics to capture them. Our findings highlight key design considerations for
future research on hint generation and intelligent tutoring systems that seek
to develop learner-centered educational technologies.
[LINK]http://arxiv.org/abs/2510.21087v1
[DATE]2025-10-24 10:00:16+08:00
[CATEGORIES]cs.CL
BioCAP: Exploiting Synthetic Captions Beyond Labels in Biological Foundation Models
[AUTHORS]Ziheng Zhang, Xinyue Ma, Arpita Chowdhury, Elizabeth G. Campolongo, Matthew J. Thompson, Net Zhang, Samuel Stevens, Hilmar Lapp, Tanya Berger-Wolf, Yu Su, Wei-Lun Chao, Jianyang Gu
[ABSTRACT]This work investigates descriptive captions as an additional source of
supervision for biological multimodal foundation models. Images and captions
can be viewed as complementary samples from the latent morphospace of a
species, each capturing certain biological traits. Incorporating captions
during training encourages alignment with this shared latent structure,
emphasizing potentially diagnostic characters while suppressing spurious
correlations. The main challenge, however, lies in obtaining faithful,
instance-specific captions at scale. This requirement has limited the
utilization of natural language supervision in organismal biology compared with
many other scientific domains. We complement this gap by generating synthetic
captions with multimodal large language models (MLLMs), guided by
Wikipedia-derived visual information and taxon-tailored format examples. These
domain-specific contexts help reduce hallucination and yield accurate,
instance-based descriptive captions. Using these captions, we train BioCAP
(i.e., BioCLIP with Captions), a biological foundation model that captures rich
semantics and achieves strong performance in species classification and
text-image retrieval. These results demonstrate the value of descriptive
captions beyond labels in bridging biological images with multimodal foundation
models.
[COMMENTS]Project page: https://imageomics.github.io/biocap/
[LINK]http://arxiv.org/abs/2510.20095v2
[DATE]2025-10-24 09:51:09+08:00
[CATEGORIES]cs.CL cs.LG
BLEUBERI: BLEU is a surprisingly effective reward for instruction following
[AUTHORS]Yapei Chang, Yekyung Kim, Michael Krumdick, Amir Zadeh, Chuan Li, Chris Tanner, Mohit Iyyer
[ABSTRACT]Reward models are central to aligning LLMs with human preferences, but they
are costly to train, requiring large-scale human-labeled preference data and
powerful pretrained LLM backbones. Meanwhile, the increasing availability of
high-quality synthetic instruction-following datasets raises the question: can
simpler, reference-based metrics serve as viable alternatives to reward models
during RL-based alignment? In this paper, we show first that BLEU, a basic
string-matching metric, surprisingly matches strong reward models in agreement
with human preferences on general instruction-following datasets. Based on this
insight, we develop BLEUBERI, a method that first identifies challenging
instructions and then applies Group Relative Policy Optimization (GRPO) using
BLEU directly as the reward function. We demonstrate that BLEUBERI-trained
models are competitive with models trained via reward model-guided RL across
four challenging instruction-following benchmarks and three different base
language models. A human evaluation further supports that the quality of
BLEUBERI model outputs is on par with those from reward model-aligned models.
Moreover, BLEUBERI models generate outputs that are more factually grounded
than competing methods. Overall, we show that given access to high-quality
reference outputs (easily obtained via existing instruction-following datasets
or synthetic data generation), string matching-based metrics are cheap yet
effective proxies for reward models during alignment. We release our code and
data at https://github.com/lilakk/BLEUBERI.
[COMMENTS]neurips cam-ready
[LINK]http://arxiv.org/abs/2505.11080v3
[DATE]2025-10-24 09:33:28+08:00
[CATEGORIES]cs.CL cs.LG
Uniform Information Density and Syntactic Reduction: Revisiting $\textit{that}$-Mentioning in English Complement Clauses
[AUTHORS]Hailin Hao, Elsi Kaiser
[COMMENTS]To appear in EMNLP 2025
[LINK]http://arxiv.org/abs/2509.05254v2
[DATE]2025-10-24 08:55:49+08:00
[CATEGORIES]cs.CL
Bridging Language Gaps with Adaptive RAG: Improving Indonesian Language Question Answering
[AUTHORS]William Christian, Daniel Adamlu, Adrian Yu, Derwin Suhartono
[ABSTRACT]Question Answering (QA) has seen significant improvements with the
advancement of machine learning models, further studies enhanced this question
answering system by retrieving external information, called Retrieval-Augmented
Generation (RAG) to produce more accurate and informative answers. However,
these state-of-the-art-performance is predominantly in English language. To
address this gap we made an effort of bridging language gaps by incorporating
Adaptive RAG system to Indonesian language. Adaptive RAG system integrates a
classifier whose task is to distinguish the question complexity, which in turn
determines the strategy for answering the question. To overcome the limited
availability of Indonesian language dataset, our study employs machine
translation as data augmentation approach. Experiments show reliable question
complexity classifier; however, we observed significant inconsistencies in
multi-retrieval answering strategy which negatively impacted the overall
evaluation when this strategy was applied. These findings highlight both the
promise and challenges of question answering in low-resource language
suggesting directions for future improvement.
[COMMENTS]12 pages, 7 figures, 5 tables
[LINK]http://arxiv.org/abs/2510.21068v1
[DATE]2025-10-24 08:50:20+08:00
[CATEGORIES]cs.CL
L$^2$M: Mutual Information Scaling Law for Long-Context Language Modeling
[AUTHORS]Zhuo Chen, Oriol Mayné i Comas, Zhuotao Jin, Di Luo, Marin Soljačić
[ABSTRACT]We present a universal theoretical framework for understanding long-context
language modeling based on a bipartite mutual information scaling law that we
rigorously verify in natural language. We demonstrate that bipartite mutual
information captures multi-token interactions distinct from and scaling
independently of conventional two-point mutual information, and show that this
provides a more complete characterization of the dependencies needed for
accurately modeling long sequences. Leveraging this scaling law, we formulate
the Long-context Language Modeling (L$^2$M) condition, which lower bounds the
necessary scaling of a model’s history state – the latent variables
responsible for storing past information – for effective long-context
modeling. We validate the framework and its predictions on transformer and
state-space models. Our work provides a principled foundation to understand
long-context modeling and to design more efficient architectures with stronger
long-context capabilities, with potential applications beyond natural language.
[COMMENTS]34 pages, 13 figures, 2 tables
[LINK]http://arxiv.org/abs/2503.04725v2
[DATE]2025-10-24 08:31:37+08:00
[CATEGORIES]cs.CL cs.LG
LVLMs are Bad at Overhearing Human Referential Communication
[AUTHORS]Zhengxiang Wang, Weiling Li, Panagiotis Kaliosis, Owen Rambow, Susan E. Brennan
[COMMENTS]EMNLP 2025 (Main)
[LINK]http://arxiv.org/abs/2509.11514v2
[DATE]2025-10-24 08:25:59+08:00
[CATEGORIES]cs.CL
Online Intrinsic Rewards for Decision Making Agents from Large Language Model Feedback
[AUTHORS]Qinqing Zheng, Mikael Henaff, Amy Zhang, Aditya Grover, Brandon Amos
[ABSTRACT]Automatically synthesizing dense rewards from natural language descriptions
is a promising paradigm in reinforcement learning (RL), with applications to
sparse reward problems, open-ended exploration, and hierarchical skill design.
Recent works have made promising steps by exploiting the prior knowledge of
large language models (LLMs). However, these approaches suffer from important
limitations: they are either not scalable to problems requiring billions of
environment samples, due to requiring LLM annotations for each observation, or
they require a diverse offline dataset, which may not exist or be impossible to
collect. In this work, we address these limitations through a combination of
algorithmic and systems-level contributions. We propose ONI, a distributed
architecture that simultaneously learns an RL policy and an intrinsic reward
function using LLM feedback. Our approach annotates the agent’s collected
experience via an asynchronous LLM server, which is then distilled into an
intrinsic reward model. We explore a range of algorithmic choices for reward
modeling with varying complexity, including hashing, classification, and
ranking models. Our approach achieves state-of-the-art performance across a
range of challenging tasks from the NetHack Learning Environment, while
removing the need for large offline datasets required by prior work. We make
our code available at https://github.com/facebookresearch/oni.
[COMMENTS]RLC 2025
[LINK]http://arxiv.org/abs/2410.23022v4
[DATE]2025-10-24 07:54:03+08:00
[CATEGORIES]cs.LG cs.CL
Tensor Product Attention Is All You Need
[AUTHORS]Yifan Zhang, Yifeng Liu, Huizhuo Yuan, Zhen Qin, Yang Yuan, Quanquan Gu, Andrew Chi-Chih Yao
[ABSTRACT]Scaling language models to handle longer input sequences typically
necessitates large key-value (KV) caches, resulting in substantial memory
overhead during inference. In this paper, we propose Tensor Product Attention
(TPA), a novel attention mechanism that uses tensor decompositions to represent
queries, keys, and values compactly, substantially shrinking the KV cache size
at inference time. By factorizing these representations into contextual
low-rank components and seamlessly integrating with Rotary Position Embedding
(RoPE), TPA achieves improved model quality alongside memory efficiency. Based
on TPA, we introduce the Tensor ProducT ATTenTion Transformer (T6), a new model
architecture for sequence modeling. Through extensive empirical evaluation on
language modeling tasks, we demonstrate that T6 surpasses or matches the
performance of standard Transformer baselines including Multi-Head Attention
(MHA), Multi-Query Attention (MQA), Grouped-Query Attention (GQA), and
Multi-Head Latent Attention (MLA) across various metrics, including perplexity
and a range of established evaluation benchmarks. Notably, TPA’s memory
efficiency and computational efficiency at decoding stage enables processing
longer sequences under fixed resource constraints, addressing a critical
scalability challenge in modern language models. Project Page:
https://github.com/tensorgi/TPA.
[COMMENTS]Published in NeurIPS 2025 (Spotlight); Project Page:
https://github.com/tensorgi/TPA
[LINK]http://arxiv.org/abs/2501.06425v5
[DATE]2025-10-24 07:35:32+08:00
[CATEGORIES]cs.CL cs.LG
Reasoning’s Razor: Reasoning Improves Accuracy but Can Hurt Recall at Critical Operating Points in Safety and Hallucination Detection
[AUTHORS]Atoosa Chegini, Hamid Kazemi, Garrett Souza, Maria Safi, Yang Song, Samy Bengio, Sinead Williamson, Mehrdad Farajtabar
[ABSTRACT]Reasoning has become a central paradigm for large language models (LLMs),
consistently boosting accuracy across diverse benchmarks. Yet its suitability
for precision-sensitive tasks remains unclear. We present the first systematic
study of reasoning for classification tasks under strict low false positive
rate (FPR) regimes. Our analysis covers two tasks–safety detection and
hallucination detection–evaluated in both fine-tuned and zero-shot settings,
using standard LLMs and Large Reasoning Models (LRMs). Our results reveal a
clear trade-off: Think On (reasoning-augmented) generation improves overall
accuracy, but underperforms at the low-FPR thresholds essential for practical
use. In contrast, Think Off (no reasoning during inference) dominates in these
precision-sensitive regimes, with Think On surpassing only when higher FPRs are
acceptable. In addition, we find token-based scoring substantially outperforms
self-verbalized confidence for precision-sensitive deployments. Finally, a
simple ensemble of the two modes recovers the strengths of each. Taken
together, our findings position reasoning as a double-edged tool: beneficial
for average accuracy, but often ill-suited for applications requiring strict
precision.
[LINK]http://arxiv.org/abs/2510.21049v1
[DATE]2025-10-24 07:23:36+08:00
[CATEGORIES]cs.CL cs.LG
Uncertainty as Feature Gaps: Epistemic Uncertainty Quantification of LLMs in Contextual Question-Answering
[AUTHORS]Yavuz Bakman, Sungmin Kang, Zhiqi Huang, Duygu Nur Yaldiz, Catarina G. Belém, Chenyang Zhu, Anoop Kumar, Alfy Samuel, Salman Avestimehr, Daben Liu, Sai Praneeth Karimireddy
[ABSTRACT]Uncertainty Quantification (UQ) research has primarily focused on closed-book
factual question answering (QA), while contextual QA remains unexplored,
despite its importance in real-world applications. In this work, we focus on UQ
for the contextual QA task and propose a theoretically grounded approach to
quantify epistemic uncertainty. We begin by introducing a task-agnostic,
token-level uncertainty measure defined as the cross-entropy between the
predictive distribution of the given model and the unknown true distribution.
By decomposing this measure, we isolate the epistemic component and approximate
the true distribution by a perfectly prompted, idealized model. We then derive
an upper bound for epistemic uncertainty and show that it can be interpreted as
semantic feature gaps in the given model’s hidden representations relative to
the ideal model. We further apply this generic framework to the contextual QA
task and hypothesize that three features approximate this gap: context-reliance
(using the provided context rather than parametric knowledge), context
comprehension (extracting relevant information from context), and honesty
(avoiding intentional lies). Using a top-down interpretability approach, we
extract these features by using only a small number of labeled samples and
ensemble them to form a robust uncertainty score. Experiments on multiple QA
benchmarks in both in-distribution and out-of-distribution settings show that
our method substantially outperforms state-of-the-art unsupervised
(sampling-free and sampling-based) and supervised UQ methods, achieving up to a
13-point PRR improvement while incurring a negligible inference overhead.
[LINK]http://arxiv.org/abs/2510.02671v2
[DATE]2025-10-24 05:43:24+08:00
[CATEGORIES]cs.CL cs.LG
T1: A Tool-Oriented Conversational Dataset for Multi-Turn Agentic Planning
[AUTHORS]Amartya Chakraborty, Paresh Dashore, Nadia Bathaee, Anmol Jain, Anirban Das, Shi-Xiong Zhang, Sambit Sahu, Milind Naphade, Genta Indra Winata
[ABSTRACT]Large Language Models (LLMs) have demonstrated impressive capabilities as
intelligent agents capable of solving complex problems. However, effective
planning in scenarios involving dependencies between API or tool
calls-particularly in multi-turn conversations-remains a significant challenge.
To address this, we introduce T1, a tool-augmented, multi-domain, multi-turn
conversational dataset specifically designed to capture and manage inter-tool
dependencies across diverse domains. T1 enables rigorous evaluation of agents’
ability to coordinate tool use across nine distinct domains (4 single domain
and 5 multi-domain) with the help of an integrated caching mechanism for both
short- and long-term memory, while supporting dynamic replanning-such as
deciding whether to recompute or reuse cached results. Beyond facilitating
research on tool use and planning, T1 also serves as a benchmark for evaluating
the performance of open-weight and proprietary large language models. We
present results powered by T1-Agent, highlighting their ability to plan and
reason in complex, tool-dependent scenarios.
[COMMENTS]Accepted by NeurIPS 2025 Datasets and Benchmarks Track
[LINK]http://arxiv.org/abs/2505.16986v2
[DATE]2025-10-24 05:31:35+08:00
[CATEGORIES]cs.CL
Learning Linear Attention in Polynomial Time
[AUTHORS]Morris Yau, Ekin Akyürek, Jiayuan Mao, Joshua B. Tenenbaum, Stefanie Jegelka, Jacob Andreas
[ABSTRACT]Previous research has explored the computational expressivity of Transformer
models in simulating Boolean circuits or Turing machines. However, the
learnability of these simulators from observational data has remained an open
question. Our study addresses this gap by providing the first polynomial-time
learnability results (specifically strong, agnostic PAC learning) for
single-layer Transformers with linear attention. We show that linear attention
may be viewed as a linear predictor in a suitably defined RKHS. As a
consequence, the problem of learning any linear transformer may be converted
into the problem of learning an ordinary linear predictor in an expanded
feature space, and any such predictor may be converted back into a multiheaded
linear transformer. Moving to generalization, we show how to efficiently
identify training datasets for which every empirical risk minimizer is
equivalent (up to trivial symmetries) to the linear Transformer that generated
the data, thereby guaranteeing the learned model will correctly generalize
across all inputs. Finally, we provide examples of computations expressible via
linear attention and therefore polynomial-time learnable, including associative
memories, finite automata, and a class of Universal Turing Machine (UTMs) with
polynomially bounded computation histories. We empirically validate our
theoretical findings on three tasks: learning random linear attention networks,
key–value associations, and learning to execute finite automata. Our findings
bridge a critical gap between theoretical expressivity and learnability of
Transformers, and show that flexible and general models of computation are
efficiently learnable.
[LINK]http://arxiv.org/abs/2410.10101v4
[DATE]2025-10-24 05:29:09+08:00
[CATEGORIES]cs.LG cs.CL
Training the Untrainable: Introducing Inductive Bias via Representational Alignment
[AUTHORS]Vighnesh Subramaniam, David Mayo, Colin Conwell, Tomaso Poggio, Boris Katz, Brian Cheung, Andrei Barbu
[ABSTRACT]We demonstrate that architectures which traditionally are considered to be
ill-suited for a task can be trained using inductive biases from another
architecture. We call a network untrainable when it overfits, underfits, or
converges to poor results even when tuning their hyperparameters. For example,
fully connected networks overfit on object recognition while deep convolutional
networks without residual connections underfit. The traditional answer is to
change the architecture to impose some inductive bias, although the nature of
that bias is unknown. We introduce guidance, where a guide network steers a
target network using a neural distance function. The target minimizes its task
loss plus a layerwise representational similarity against the frozen guide. If
the guide is trained, this transfers over the architectural prior and knowledge
of the guide to the target. If the guide is untrained, this transfers over only
part of the architectural prior of the guide. We show that guidance prevents
FCN overfitting on ImageNet, narrows the vanilla RNN-Transformer gap, boosts
plain CNNs toward ResNet accuracy, and aids Transformers on RNN-favored tasks.
We further identify that guidance-driven initialization alone can mitigate FCN
overfitting. Our method provides a mathematical tool to investigate priors and
architectures, and in the long term, could automate architecture design.
[COMMENTS]NeurIPS 2025; 39 pages, 18 figures, 6 tables; Project page and code
is at https://untrainable-networks.github.io/
[LINK]http://arxiv.org/abs/2410.20035v2
[DATE]2025-10-24 04:40:02+08:00
[CATEGORIES]cs.LG cs.CL
Preventing Catastrophic Forgetting: Behavior-Aware Sampling for Safer Language Model Fine-Tuning
[AUTHORS]Anh Pham, Mihir Thalanki, Michael Sun, Aditya Chaloo, Ankita Gupta, Tian Xia, Aditya Mate, Ehimwenma Nosakhare, Soundararajan Srinivasan
[ABSTRACT]Large language models often lose previously aligned safety behaviors when
fine-tuned on benign data, a phenomenon known as catastrophic forgetting. Prior
work shows that adding random safety examples can mitigate this effect, but it
remains unclear which examples are most effective. We propose a behavior-aware
sampling framework that selects safety examples based on two complementary
factors: instruction-response behavior (e.g., refusal versus compliance) and
semantic diversity across harm categories. Systematic evaluation shows that
this approach substantially reduces harmful outputs while maintaining
helpfulness, achieving up to a 41% reduction in harmfulness with only 0.5%
additional training data. These results highlight how targeted data selection
can improve the safety and efficiency of fine-tuning at scale.
[LINK]http://arxiv.org/abs/2510.21885v1
[DATE]2025-10-24 04:34:52+08:00
[CATEGORIES]cs.CL
From Detection to Discovery: A Closed-Loop Approach for Simultaneous and Continuous Medical Knowledge Expansion and Depression Detection on Social Media
[AUTHORS]Shuang Geng, Wenli Zhang, Jiaheng Xie, Rui Wang, Sudha Ram
[ABSTRACT]Social media user-generated content (UGC) provides real-time, self-reported
indicators of mental health conditions such as depression, offering a valuable
source for predictive analytics. While prior studies integrate medical
knowledge to improve prediction accuracy, they overlook the opportunity to
simultaneously expand such knowledge through predictive processes. We develop a
Closed-Loop Large Language Model (LLM)-Knowledge Graph framework that
integrates prediction and knowledge expansion in an iterative learning cycle.
In the knowledge-aware depression detection phase, the LLM jointly performs
depression detection and entity extraction, while the knowledge graph
represents and weights these entities to refine prediction performance. In the
knowledge refinement and expansion phase, new entities, relationships, and
entity types extracted by the LLM are incorporated into the knowledge graph
under expert supervision, enabling continual knowledge evolution. Using
large-scale UGC, the framework enhances both predictive accuracy and medical
understanding. Expert evaluations confirmed the discovery of clinically
meaningful symptoms, comorbidities, and social triggers complementary to
existing literature. We conceptualize and operationalize
prediction-through-learning and learning-through-prediction as mutually
reinforcing processes, advancing both methodological and theoretical
understanding in predictive analytics. The framework demonstrates the
co-evolution of computational models and domain knowledge, offering a
foundation for adaptive, data-driven knowledge systems applicable to other
dynamic risk monitoring contexts.
[COMMENTS]Presented at SWAIB2025 and HICSS2026
[LINK]http://arxiv.org/abs/2510.23626v1
[DATE]2025-10-24 04:34:36+08:00
[CATEGORIES]cs.LG cs.CL
Scaling Embedding Layers in Language Models
[AUTHORS]Da Yu, Edith Cohen, Badih Ghazi, Yangsibo Huang, Pritish Kamath, Ravi Kumar, Daogao Liu, Chiyuan Zhang
[ABSTRACT]We propose $SCONE$ ($S$calable, $C$ontextualized, $O$ffloaded, $N$-gram
$E$mbedding), a new method for extending input embedding layers to enhance
language model performance. To avoid increased decoding costs, $SCONE$ retains
the original vocabulary while introducing embeddings for a set of frequent
n-grams. These embeddings provide contextualized representation for each input
token and are learned with a separate model during training. After training,
embeddings are precomputed and stored in off-accelerator memory; during
inference, querying them has minimal impact on latency due to the low
complexity of embedding lookups. $SCONE$ enables two new scaling strategies:
increasing the number of n-gram embeddings and scaling the model used to learn
them, both while maintaining fixed accelerator usage during inference (in terms
of FLOPS and memory). We show that scaling both aspects enables a model with 1B
accelerator-resident parameters to outperform a 1.9B-parameter baseline across
diverse corpora, while using only about half the FLOPS and accelerator memory
during inference.
[COMMENTS]NeurIPS 2025 camera ready
[LINK]http://arxiv.org/abs/2502.01637v3
[DATE]2025-10-24 04:15:46+08:00
[CATEGORIES]cs.CL cs.LG
Irish-BLiMP: A Linguistic Benchmark for Evaluating Human and Language Model Performance in a Low-Resource Setting
[AUTHORS]Josh McGiff, Khanh-Tung Tran, William Mulcahy, Dáibhidh Ó Luinín, Jake Dalzell, Róisín Ní Bhroin, Adam Burke, Barry O’Sullivan, Hoang D. Nguyen, Nikola S. Nikolov
[ABSTRACT]We present Irish-BLiMP (Irish Benchmark of Linguistic Minimal Pairs), the
first dataset and framework designed for fine-grained evaluation of linguistic
competence in the Irish language, an endangered language. Drawing on a variety
of linguistic literature and grammar reference works, we manually constructed
and reviewed 1020 minimal pairs across a taxonomy of 11 linguistic features,
through a team of fluent Irish speakers. We evaluate both existing Large
Language Models (LLMs) and fluent human participants on their syntactic
knowledge of Irish. Our findings show that humans outperform all models across
all linguistic features, achieving 16.6% higher accuracy on average. Moreover,
a substantial performance gap of 18.1% persists between open- and closed-source
LLMs, with even the strongest model (gpt-5) reaching only 73.5% accuracy
compared to 90.1% by human. Interestingly, human participants and models
struggle on different aspects of Irish grammar, thus highlighting a difference
in representation learned by the models. Overall, Irish-BLiMP provides the
first systematic framework for evaluating the grammatical competence of LLMs in
Irish and offers a valuable benchmark for advancing research on linguistic
understanding in low-resource languages.
[COMMENTS]8 pages
[LINK]http://arxiv.org/abs/2510.20957v1
[DATE]2025-10-24 03:36:36+08:00
[CATEGORIES]cs.CL
Self-Jailbreaking: Language Models Can Reason Themselves Out of Safety Alignment After Benign Reasoning Training
[AUTHORS]Zheng-Xin Yong, Stephen H. Bach
[ABSTRACT]We discover a novel and surprising phenomenon of unintentional misalignment
in reasoning language models (RLMs), which we call self-jailbreaking.
Specifically, after benign reasoning training on math or code domains, RLMs
will use multiple strategies to circumvent their own safety guardrails. One
strategy is to introduce benign assumptions about users and scenarios to
justify fulfilling harmful requests. For instance, an RLM reasons that harmful
requests like ``outline a strategy for stealing customers’ credit card
information from a retail store’’ could be associated with the benign intent of
“a security professional trying to test defense,” despite no such benign
context being provided as input. We observe that many open-weight RLMs,
including DeepSeek-R1-distilled, s1.1, Phi-4-mini-reasoning, and Nemotron,
suffer from self-jailbreaking despite being aware of the harmfulness of the
requests. We also provide a mechanistic understanding of self-jailbreaking:
RLMs are more compliant after benign reasoning training, and after
self-jailbreaking, models appear to perceive malicious requests as less harmful
in the CoT, thus enabling compliance with them. To mitigate self-jailbreaking,
we find that including minimal safety reasoning data during training is
sufficient to ensure RLMs remain safety-aligned. Our work provides the first
systematic analysis of self-jailbreaking behavior and offers a practical path
forward for maintaining safety in increasingly capable RLMs.
[LINK]http://arxiv.org/abs/2510.20956v1
[DATE]2025-10-24 03:34:24+08:00
[CATEGORIES]cs.CL
AI Realtor: Towards Grounded Persuasive Language Generation for Automated Copywriting
[AUTHORS]Jibang Wu, Chenghao Yang, Yi Wu, Simon Mahns, Chaoqi Wang, Hao Zhu, Fei Fang, Haifeng Xu
[ABSTRACT]This paper develops an agentic framework that employs large language models
(LLMs) for grounded persuasive language generation in automated copywriting,
with real estate marketing as a focal application. Our method is designed to
align the generated content with user preferences while highlighting useful
factual attributes. This agent consists of three key modules: (1) Grounding
Module, mimicking expert human behavior to predict marketable features; (2)
Personalization Module, aligning content with user preferences; (3) Marketing
Module, ensuring factual accuracy and the inclusion of localized features. We
conduct systematic human-subject experiments in the domain of real estate
marketing, with a focus group of potential house buyers. The results
demonstrate that marketing descriptions generated by our approach are preferred
over those written by human experts by a clear margin while maintaining the
same level of factual accuracy. Our findings suggest a promising agentic
approach to automate large-scale targeted copywriting while ensuring factuality
of content generation.
[COMMENTS]V2: Add more human verification to ensure safety and examine
potential hallucination. Significant reframing for the general audience.
Website: https://yangalan123.github.io/ai-realtor/. Codebase:
https://github.com/yangalan123/AI-Realtor-Codebase. Data released at
Huggingface Hub (Sigma-Lab/AI_Realtor_xxx)
[LINK]http://arxiv.org/abs/2502.16810v5
[DATE]2025-10-24 03:25:34+08:00
[CATEGORIES]cs.CL
Self-Refining Language Model Anonymizers via Adversarial Distillation
[AUTHORS]Kyuyoung Kim, Hyunjun Jeon, Jinwoo Shin
[ABSTRACT]Large language models (LLMs) are increasingly used in sensitive domains,
where their ability to infer personal data from seemingly benign text
introduces emerging privacy risks. While recent LLM-based anonymization methods
help mitigate such risks, they often rely on proprietary models (e.g., GPT-4),
raising concerns about cost and the potential exposure of sensitive data to
untrusted external systems. To address this, we introduce SElf-refining
Anonymization with Language model (SEAL), a novel distillation framework for
training small language models (SLMs) to perform effective anonymization
without relying on external models at inference time. SEAL leverages
adversarial interactions between an LLM anonymizer and an inference model to
collect trajectories of anonymized texts and inferred attributes, which are
then used to distill anonymization and critique capabilities into SLMs through
supervised fine-tuning and preference learning. The resulting models learn both
to anonymize text and to evaluate their outputs, enabling iterative improvement
of anonymization quality via self-refinement. Experiments on SynthPAI, a
dataset of synthetic personal profiles and text comments, demonstrate that SLMs
trained with SEAL achieve substantial improvements in anonymization
capabilities. Notably, 8B models attain a privacy-utility trade-off comparable
to that of the GPT-4 anonymizer and, with self-refinement, even surpass it in
terms of privacy protection. These results highlight the effectiveness of our
adversarial distillation framework for training SLMs as efficient anonymizers.
[COMMENTS]NeurIPS 2025
[LINK]http://arxiv.org/abs/2506.01420v2
[DATE]2025-10-24 03:22:08+08:00
[CATEGORIES]cs.CL cs.LG
LEXam: Benchmarking Legal Reasoning on 340 Law Exams
[AUTHORS]Yu Fan, Jingwei Ni, Jakob Merane, Yang Tian, Yoan Hermstrüwer, Yinya Huang, Mubashara Akhtar, Etienne Salimbeni, Florian Geering, Oliver Dreyer, Daniel Brunner, Markus Leippold, Mrinmaya Sachan, Alexander Stremitzer, Christoph Engel, Elliott Ash, Joel Niklaus
[ABSTRACT]Long-form legal reasoning remains a key challenge for large language models
(LLMs) in spite of recent advances in test-time scaling. To address this, we
introduce \textsc{LEXam}, a novel benchmark derived from 340 law exams spanning
116 law school courses across a range of subjects and degree levels. The
dataset comprises 4,886 law exam questions in English and German, including
2,841 long-form, open-ended questions and 2,045 multiple-choice questions.
Besides reference answers, the open questions are also accompanied by explicit
guidance outlining the expected legal reasoning approach such as issue
spotting, rule recall, or rule application. Our evaluation on both open-ended
and multiple-choice questions present significant challenges for current LLMs;
in particular, they notably struggle with open questions that require
structured, multi-step legal reasoning. Moreover, our results underscore the
effectiveness of the dataset in differentiating between models with varying
capabilities. Deploying an ensemble LLM-as-a-Judge paradigm with rigorous human
expert validation, we demonstrate how model-generated reasoning steps can be
evaluated consistently and accurately, closely aligning with human expert
assessments. Our evaluation setup provides a scalable method to assess legal
reasoning quality beyond simple accuracy metrics. We have open-sourced our code
on https://github.com/LEXam-Benchmark/LEXam and released our data on
https://huggingface.co/datasets/LEXam-Benchmark/LEXam. Project page:
https://lexam-benchmark.github.io.
[LINK]http://arxiv.org/abs/2505.12864v5
[DATE]2025-10-24 03:18:23+08:00
[CATEGORIES]cs.CL cs.LG
ProxySPEX: Inference-Efficient Interpretability via Sparse Feature Interactions in LLMs
[AUTHORS]Landon Butler, Abhineet Agarwal, Justin Singh Kang, Yigit Efe Erginbas, Bin Yu, Kannan Ramchandran
[ABSTRACT]Large Language Models (LLMs) have achieved remarkable performance by
capturing complex interactions between input features. To identify these
interactions, most existing approaches require enumerating all possible
combinations of features up to a given order, causing them to scale poorly with
the number of inputs $n$. Recently, Kang et al. (2025) proposed SPEX, an
information-theoretic approach that uses interaction sparsity to scale to $n
\approx 10^3$ features. SPEX greatly improves upon prior methods but requires
tens of thousands of model inferences, which can be prohibitive for large
models. In this paper, we observe that LLM feature interactions are often
hierarchical – higher-order interactions are accompanied by their lower-order
subsets – which enables more efficient discovery. To exploit this hierarchy,
we propose ProxySPEX, an interaction attribution algorithm that first fits
gradient boosted trees to masked LLM outputs and then extracts the important
interactions. Experiments across four challenging high-dimensional datasets
show that ProxySPEX more faithfully reconstructs LLM outputs by 20% over
marginal attribution approaches while using $10\times$ fewer inferences than
SPEX. By accounting for interactions, ProxySPEX efficiently identifies the most
influential features, providing a scalable approximation of their Shapley
values. Further, we apply ProxySPEX to two interpretability tasks. Data
attribution, where we identify interactions among CIFAR-10 training samples
that influence test predictions, and mechanistic interpretability, where we
uncover interactions between attention heads, both within and across layers, on
a question-answering task.
[COMMENTS]Algorithm available at: https://github.com/mmschlk/shapiq
[LINK]http://arxiv.org/abs/2505.17495v2
[DATE]2025-10-24 03:11:10+08:00
[CATEGORIES]cs.LG cs.CL
Mitigating Manipulation and Enhancing Persuasion: A Reflective Multi-Agent Approach for Legal Argument Generation
[AUTHORS]Li Zhang, Kevin D. Ashley
[ABSTRACT]Large Language Models (LLMs) are increasingly explored for legal argument
generation, yet they pose significant risks of manipulation through
hallucination and ungrounded persuasion, and often fail to utilize provided
factual bases effectively or abstain when arguments are untenable. This paper
introduces a novel reflective multi-agent method designed to address these
challenges in the context of legally compliant persuasion. Our approach employs
specialized agents (factor analyst and argument polisher) in an iterative
refinement process to generate 3-ply legal arguments (plaintiff, defendant,
rebuttal). We evaluate reflective multi-agent against single-agent,
enhanced-prompt single-agent, and non-reflective multi-agent baselines using
four diverse LLMs (GPT-4o, GPT-4o-mini, Llama-4-Maverick-17b-128e,
Llama-4-Scout-17b-16e) across three legal scenarios: “arguable”, “mismatched”,
and “non-arguable”. Results demonstrate that the reflective multi-agent
approach excels at successful abstention by preventing generation when
arguments cannot be grounded, improves hallucination accuracy by reducing
fabricated and misattributed factors and enhances factor utilization recall by
better using the provided case facts. These findings suggest that structured
reflection within a multi-agent framework offers a robust method for fostering
ethical persuasion and mitigating manipulation in LLM-based legal argumentation
systems.
[COMMENTS]13 pages, 2 figures, 2nd ConventicLe on Artificial Intelligence
Regulation and Safety Workshop at ICAIL 2025
[LINK]http://arxiv.org/abs/2506.02992v2
[DATE]2025-10-24 02:35:56+08:00
[CATEGORIES]cs.CL cs.LG
FicSim: A Dataset for Multi-Faceted Semantic Similarity in Long-Form Fiction
[AUTHORS]Natasha Johnson, Amanda Bertsch, Maria-Emil Deal, Emma Strubell
[COMMENTS]Accepted to Findings of EMNLP 2025
[LINK]http://arxiv.org/abs/2510.20926v1
[DATE]2025-10-24 02:30:19+08:00
[CATEGORIES]cs.CL
Schema for In-Context Learning
[AUTHORS]Pan Chen, Shaohong Chen, Mark Wang, Shi Xuan Leong, Priscilla Fung, Varinia Bernales, Alan Aspuru-Guzik
[ABSTRACT]In-Context Learning (ICL) enables transformer-based language models to adapt
to new tasks by conditioning on demonstration examples. However, traditional
example-driven in-context learning lacks explicit modules for knowledge
retrieval and transfer at the abstraction level. Inspired by cognitive science,
specifically schema theory, which holds that humans interpret new information
by activating pre-existing mental frameworks (schemas) to structure
understanding, we introduce SCHEMA ACTIVATED IN CONTEXT LEARNING (SA-ICL). This
framework extracts the representation of the building blocks of cognition for
the reasoning process instilled from prior examples, creating an abstracted
schema, a lightweight, structured template of key inferential steps and their
relationships, which is then used to augment a model’s reasoning process when
presented with a novel question. We demonstrate that a broad range of large
language models (LLMs) lack the capacity to form and utilize internal
schema-based learning representations implicitly, but instead benefit
significantly from explicit schema-based scaffolding. Across chemistry and
physics questions from the GPQA dataset, our experiments show that SA-ICL
consistently boosts performance, up to 36.19 percent, when the single
demonstration example is of high quality, which simultaneously reduces reliance
on the number of demonstrations and enhances interpretability. SCHEMA ACTIVATED
IN CONTEXT LEARNING not only bridges disparate ICL strategies ranging from
pattern priming to Chain-of-Thought prompting, but also paves a new path for
enhancing human-like reasoning in LLMs.
[LINK]http://arxiv.org/abs/2510.13905v2
[DATE]2025-10-24 02:04:38+08:00
[CATEGORIES]cs.CL
Code-enabled language models can outperform reasoning models on diverse tasks
[AUTHORS]Cedegao E. Zhang, Cédric Colas, Gabriel Poesia, Joshua B. Tenenbaum, Jacob Andreas
[ABSTRACT]Reasoning models (RMs), language models (LMs) trained with reinforcement
learning to produce long-form natural language reasoning, have been remarkably
successful, but they still require large amounts of computation and data to
train, and can be slow and expensive to run. In this paper, we show that
standard instruct LMs can already be elicited to be strong reasoners at a level
comparable to or even surpassing their corresponding RMs (e.g., DeepSeek V3 vs
R1) without finetuning, across diverse domains from instruction following and
creative generation to mathematical reasoning. This is achieved by CodeAdapt,
our simple recipe that combines the CodeAct framework, where LMs interleave
natural language reasoning with code execution in a multi-step fashion, with
few-shot bootstrap in-context learning from as few as five training problems.
Analyzing four matched pairs of LMs and RMs, we find that CodeAdapt enables
three LMs to outperform the corresponding RMs on average over eight tasks (up
to 22.9%) while being 10-81% more token efficient, and delivers superior
performance on six tasks when averaged over the four models (up to 35.7%).
Furthermore, the code-augmented reasoning traces display rich and varied
problem-solving strategies. Our findings support that (1) CodeAdapt-style
learning and reasoning may be robust and domain general and (2) code-enabled
LMs are cognitively grounded and powerful systems, potentially providing a
strong foundation for in-weight reinforcement learning.
[LINK]http://arxiv.org/abs/2510.20909v1
[DATE]2025-10-24 02:04:03+08:00
[CATEGORIES]cs.CL
Language Models use Lookbacks to Track Beliefs
[AUTHORS]Nikhil Prakash, Natalie Shapira, Arnab Sen Sharma, Christoph Riedl, Yonatan Belinkov, Tamar Rott Shaham, David Bau, Atticus Geiger
[ABSTRACT]How do language models (LMs) represent characters’ beliefs, especially when
those beliefs may differ from reality? This question lies at the heart of
understanding the Theory of Mind (ToM) capabilities of LMs. We analyze LMs’
ability to reason about characters’ beliefs using causal mediation and
abstraction. We construct a dataset, CausalToM, consisting of simple stories
where two characters independently change the state of two objects, potentially
unaware of each other’s actions. Our investigation uncovers a pervasive
algorithmic pattern that we call a lookback mechanism, which enables the LM to
recall important information when it becomes necessary. The LM binds each
character-object-state triple together by co-locating their reference
information, represented as Ordering IDs (OIs), in low-rank subspaces of the
state token’s residual stream. When asked about a character’s beliefs regarding
the state of an object, the binding lookback retrieves the correct state OI and
then the answer lookback retrieves the corresponding state token. When we
introduce text specifying that one character is (not) visible to the other, we
find that the LM first generates a visibility ID encoding the relation between
the observing and the observed character OIs. In a visibility lookback, this ID
is used to retrieve information about the observed character and update the
observing character’s beliefs. Our work provides insights into belief tracking
mechanisms, taking a step toward reverse-engineering ToM reasoning in LMs.
[COMMENTS]31 pages, 33 figures. Code and data at https://belief.baulab.info/
[LINK]http://arxiv.org/abs/2505.14685v2
[DATE]2025-10-24 01:59:56+08:00
[CATEGORIES]cs.CL
Compress to Impress: Efficient LLM Adaptation Using a Single Gradient Step on 100 Samples
[AUTHORS]Shiva Sreeram, Alaa Maalouf, Pratyusha Sharma, Daniela Rus
[ABSTRACT]Recently, Sharma et al. suggested a method called Layer-SElective-Rank
reduction (LASER) which demonstrated that pruning high-order components of
carefully chosen LLM’s weight matrices can boost downstream accuracy – without
any gradient-based fine-tuning. Yet LASER’s exhaustive, per-matrix search (each
requiring full-dataset forward passes) makes it impractical for rapid
deployment. We demonstrate that this overhead can be removed and find that: (i)
Only a small, carefully chosen subset of matrices needs to be inspected –
eliminating the layer-by-layer sweep, (ii) The gradient of each matrix’s
singular values pinpoints which matrices merit reduction, (iii) Increasing the
factorization search space by allowing matrices rows to cluster around multiple
subspaces and then decomposing each cluster separately further reduces
overfitting on the original training data and further lifts accuracy by up to
24.6 percentage points, and finally, (iv) we discover that evaluating on just
100 samples rather than the full training data – both for computing the
indicative gradients and for measuring the final accuracy – suffices to
further reduce the search time; we explain that as adaptation to downstream
tasks is dominated by prompting style, not dataset size. As a result, we show
that combining these findings yields a fast and robust adaptation algorithm for
downstream tasks. Overall, with a single gradient step on 100 examples and a
quick scan of the top candidate layers and factorization techniques, we can
adapt LLMs to new datasets – entirely without fine-tuning.
[LINK]http://arxiv.org/abs/2510.20800v1
[DATE]2025-10-24 01:58:01+08:00
[CATEGORIES]cs.LG cs.CL
Shoot First, Ask Questions Later? Building Rational Agents that Explore and Act Like People
[AUTHORS]Gabriel Grand, Valerio Pepe, Jacob Andreas, Joshua B. Tenenbaum
[ABSTRACT]Many high-stakes applications of AI require forming data-driven hypotheses
and making targeted guesses; e.g., in scientific and diagnostic settings. Given
limited resources, to what extent do agents based on language models (LMs) act
rationally? We develop methods to benchmark and enhance agentic
information-seeking, drawing on insights from human behavior. First, we
introduce a strategic decision-oriented dialogue task called Collaborative
Battleship, in which a partially-informed Captain must balance exploration
(asking questions) and action (taking shots), while a fully-informed Spotter
must provide accurate answers under an information bottleneck. Compared to
human players (N=42), we find that LM agents struggle to ground answers in
context, generate informative questions, and select high-value actions. Next,
to address these gaps, we develop novel Monte Carlo inference strategies for
LMs based on principles from Bayesian Experimental Design (BED). For Spotter
agents, our approach boosts accuracy by up to 14.7% absolute over LM-only
baselines; for Captain agents, it raises expected information gain (EIG) by up
to 0.227 bits (94.2% of the achievable noise ceiling). Combined, these
components yield sharper targeting (+0.303-0.374 F1), and enable weaker LMs,
such as Llama-4-Scout, to outperform both humans (8% -> 82% win rate) and
frontier models (0% -> 67% win rate vs. GPT-5) at ~1% of GPT-5’s cost. We
replicate these findings on Guess Who? where our methods significantly boost
accuracy (+28.3-42.4 p.p.), demonstrating their general applicability for
building rational information-seeking agents.
[LINK]http://arxiv.org/abs/2510.20886v1
[DATE]2025-10-24 01:57:28+08:00
[CATEGORIES]cs.CL
Simple Context Compression: Mean-Pooling and Multi-Ratio Training
[AUTHORS]Yair Feldman, Yoav Artzi
[ABSTRACT]A common strategy to reduce the computational costs of using long contexts in
retrieval-augmented generation (RAG) with large language models (LLMs) is soft
context compression, where the input sequence is transformed into a shorter
continuous representation. We develop a lightweight and simple mean-pooling
approach that consistently outperforms the widely used compression-tokens
architecture, and study training the same compressor to output multiple
compression ratios. We conduct extensive experiments across in-domain and
out-of-domain QA datasets, as well as across model families, scales, and
compression ratios. Overall, our simple mean-pooling approach achieves the
strongest performance, with a relatively small drop when training for multiple
compression ratios. More broadly though, across architectures and training
regimes the trade-offs are more nuanced, illustrating the complex landscape of
compression methods.
[COMMENTS]Code available at
https://github.com/lil-lab/simple-context-compression
[LINK]http://arxiv.org/abs/2510.20797v1
[DATE]2025-10-24 01:57:23+08:00
[CATEGORIES]cs.CL cs.LG
Language Ranker: A Lightweight Ranking framework for LLM Decoding
[AUTHORS]Chenheng Zhang, Tianqi Du, Jizhe Zhang, Mingqing Xiao, Yifei Wang, Yisen Wang, Zhouchen Lin
[ABSTRACT]Conventional research on large language models (LLMs) has primarily focused
on refining output distributions, while paying less attention to the decoding
process that transforms these distributions into final responses. Recent
advances, such as scaling the computation of inference time with reward models,
have underscored the importance of decoding, but these methods often suffer
from high computational costs and limited applicability. In this paper, we
revisit LLM generation through the lens of recommender systems, conceptualizing
the decoding process as analogous to the ranking stage in recommendation
pipelines. From this perspective, we observe that both traditional decoding
methods and reward models exhibit clear limitations such as redundancy.
Motivated by this insight, we propose Language Ranker, a novel framework that
introduces a lightweight module to rerank candidate responses using features
extracted by the base model. Experiments across a wide range of tasks show that
Language Ranker achieves performance comparable to large-scale reward models,
while requiring only <0.5M additional parameters, significantly reducing the
computational overhead during both training and inference stages. This
highlights the efficiency and effectiveness of our method, showcasing its
potential to fully unlock the capabilities of LLMs.
[LINK]http://arxiv.org/abs/2510.21883v1
[DATE]2025-10-24 01:56:46+08:00
[CATEGORIES]cs.CL
BadGraph: A Backdoor Attack Against Latent Diffusion Model for Text-Guided Graph Generation
[AUTHORS]Liang Ye, Shengqin Chen, Jiazhu Dai
[ABSTRACT]The rapid progress of graph generation has raised new security concerns,
particularly regarding backdoor vulnerabilities. While prior work has explored
backdoor attacks in image diffusion and unconditional graph generation,
conditional, especially text-guided graph generation remains largely
unexamined. This paper proposes BadGraph, a backdoor attack method targeting
latent diffusion models for text-guided graph generation. BadGraph leverages
textual triggers to poison training data, covertly implanting backdoors that
induce attacker-specified subgraphs during inference when triggers appear,
while preserving normal performance on clean inputs. Extensive experiments on
four benchmark datasets (PubChem, ChEBI-20, PCDes, MoMu) demonstrate the
effectiveness and stealth of the attack: less than 10% poisoning rate can
achieves 50% attack success rate, while 24% suffices for over 80% success rate,
with negligible performance degradation on benign samples. Ablation studies
further reveal that the backdoor is implanted during VAE and diffusion training
rather than pretraining. These findings reveal the security vulnerabilities in
latent diffusion models of text-guided graph generation, highlight the serious
risks in models’ applications such as drug discovery and underscore the need
for robust defenses against the backdoor attack in such diffusion models.
[LINK]http://arxiv.org/abs/2510.20792v1
[DATE]2025-10-24 01:54:17+08:00
[CATEGORIES]cs.LG cs.CL
Text2Mem: A Unified Memory Operation Language for Memory Operating System
[AUTHORS]Yi Wang, Lihai Yang, Boyu Chen, Gongyi Zou, Kerun Xu, Bo Tang, Feiyu Xiong, Siheng Chen, Zhiyu Li
[ABSTRACT]Large language model agents increasingly depend on memory to sustain long
horizon interaction, but existing frameworks remain limited. Most expose only a
few basic primitives such as encode, retrieve, and delete, while higher order
operations like merge, promote, demote, split, lock, and expire are missing or
inconsistently supported. Moreover, there is no formal and executable
specification for memory commands, leaving scope and lifecycle rules implicit
and causing unpredictable behavior across systems. We introduce Text2Mem, a
unified memory operation language that provides a standardized pathway from
natural language to reliable execution. Text2Mem defines a compact yet
expressive operation set aligned with encoding, storage, and retrieval. Each
instruction is represented as a JSON based schema instance with required fields
and semantic invariants, which a parser transforms into typed operation objects
with normalized parameters. A validator ensures correctness before execution,
while adapters map typed objects either to a SQL prototype backend or to real
memory frameworks. Model based services such as embeddings or summarization are
integrated when required. All results are returned through a unified execution
contract. This design ensures safety, determinism, and portability across
heterogeneous backends. We also outline Text2Mem Bench, a planned benchmark
that separates schema generation from backend execution to enable systematic
evaluation. Together, these components establish the first standardized
foundation for memory control in agents.
[COMMENTS]12 pages, 3 figures, 2 tables
[LINK]http://arxiv.org/abs/2509.11145v2
[DATE]2025-10-24 01:53:03+08:00
[CATEGORIES]cs.CL
A Use-Case Specific Dataset for Measuring Dimensions of Responsible Performance in LLM-generated Text
[AUTHORS]Alicia Sagae, Chia-Jung Lee, Sandeep Avula, Brandon Dang, Vanessa Murdock
[ABSTRACT]Current methods for evaluating large language models (LLMs) typically focus
on high-level tasks such as text generation, without targeting a particular AI
application. This approach is not sufficient for evaluating LLMs for
Responsible AI dimensions like fairness, since protected attributes that are
highly relevant in one application may be less relevant in another. In this
work, we construct a dataset that is driven by a real-world application
(generate a plain-text product description, given a list of product features),
parameterized by fairness attributes intersected with gendered adjectives and
product categories, yielding a rich set of labeled prompts. We show how to use
the data to identify quality, veracity, safety, and fairness gaps in LLMs,
contributing a proposal for LLM evaluation paired with a concrete resource for
the research community.
[COMMENTS]24 pages with 3 figures, to appear in Proceedings of the 34th ACM
International Conference on Information and Knowledge Management (CIKM ‘25)
[LINK]http://arxiv.org/abs/2510.20782v1
[DATE]2025-10-24 01:50:55+08:00
[CATEGORIES]cs.CL
Are Large Reasoning Models Good Translation Evaluators? Analysis and Performance Boost
[AUTHORS]Runzhe Zhan, Zhihong Huang, Xinyi Yang, Lidia S. Chao, Min Yang, Derek F. Wong
[ABSTRACT]Recent advancements in large reasoning models (LRMs) have introduced an
intermediate “thinking” process prior to generating final answers, improving
their reasoning capabilities on complex downstream tasks. However, the
potential of LRMs as evaluators for machine translation (MT) quality remains
underexplored. We provides the first systematic analysis of LRM-as-a-judge in
MT evaluation. We identify key challenges, revealing LRMs require tailored
evaluation materials, tend to “overthink” simpler instances and have issues
with scoring mechanisms leading to overestimation. To address these, we propose
to calibrate LRM thinking by training them on synthetic, human-like thinking
trajectories. Our experiments on WMT24 Metrics benchmarks demonstrate that this
approach largely reduces thinking budgets by ~35x while concurrently improving
evaluation performance across different LRM scales from 7B to 32B (e.g.,
R1-Distill-Qwen-7B achieves a +8.7 correlation point improvement). These
findings highlight the potential of efficiently calibrated LRMs to advance
fine-grained automatic MT evaluation.
[COMMENTS]NeurIPS 2025
[LINK]http://arxiv.org/abs/2510.20780v1
[DATE]2025-10-24 01:48:36+08:00
[CATEGORIES]cs.CL
FlyLoRA: Boosting Task Decoupling and Parameter Efficiency via Implicit Rank-Wise Mixture-of-Experts
[AUTHORS]Heming Zou, Yunliang Zang, Wutong Xu, Yao Zhu, Xiangyang Ji
[ABSTRACT]Low-Rank Adaptation (LoRA) is a widely used parameter-efficient fine-tuning
method for foundation models, but it suffers from parameter interference,
resulting in suboptimal performance. Although Mixture-of-Experts (MoE)-based
LoRA variants show promise in mitigating intra-task correlations in single-task
instruction tuning, they introduce additional router parameters and remain
ineffective in multi-task model merging where inter-task interference arises.
Inspired by the fly olfactory circuit, we propose FlyLoRA, an implicit
MoE-based LoRA variant that introduces: (1) rank-wise expert activation in the
up-projection matrix, and (2) an implicit router that unifies expert routing
and down-projection, where a frozen sparse random projection matrix replaces
the traditional dense trainable version. This design resolves the trade-off
between intra-task decorrelation and computational efficiency by eliminating
the need for an explicit router, while inherently mitigating inter-task
interference due to the orthogonality property of random matrices. Extensive
experiments across four domains – general knowledge understanding, scientific
question answering, mathematical reasoning, and code generation – demonstrate
consistent performance improvements over existing methods. Beyond empirical
gains, FlyLoRA highlights how biological structures can inspire innovations in
AI technologies. Code is available at https://github.com/gfyddha/FlyLoRA.
[COMMENTS]NeurIPS 2025 accepted paper
[LINK]http://arxiv.org/abs/2510.08396v2
[DATE]2025-10-24 01:14:06+08:00
[CATEGORIES]cs.LG cs.CL
Integrating Structural and Semantic Signals in Text-Attributed Graphs with BiGTex
[AUTHORS]Azadeh Beiranvand, Seyed Mehdi Vahidipour
[ABSTRACT]Text-attributed graphs (TAGs) present unique challenges in representation
learning by requiring models to capture both the semantic richness of
node-associated texts and the structural dependencies of the graph. While graph
neural networks (GNNs) excel at modeling topological information, they lack the
capacity to process unstructured text. Conversely, large language models (LLMs)
are proficient in text understanding but are typically unaware of graph
structure. In this work, we propose BiGTex (Bidirectional Graph Text), a novel
architecture that tightly integrates GNNs and LLMs through stacked Graph-Text
Fusion Units. Each unit allows for mutual attention between textual and
structural representations, enabling information to flow in both directions,
text influencing structure and structure guiding textual interpretation. The
proposed architecture is trained using parameter-efficient fine-tuning (LoRA),
keeping the LLM frozen while adapting to task-specific signals. Extensive
experiments on five benchmark datasets demonstrate that BiGTex achieves
state-of-the-art performance in node classification and generalizes effectively
to link prediction. An ablation study further highlights the importance of soft
prompting and bi-directional attention in the model’s success.
[COMMENTS]26 pages, 4 figures
[LINK]http://arxiv.org/abs/2504.12474v3
[DATE]2025-10-24 01:06:25+08:00
[CATEGORIES]cs.CL
Co-Designing Quantum Codes with Transversal Diagonal Gates via Multi-Agent Systems
[AUTHORS]Xi He, Sirui Lu, Bei Zeng
[ABSTRACT]We present a multi-agent, human-in-the-loop workflow that co-designs quantum
codes with prescribed transversal diagonal gates. It builds on the Subset-Sum
Linear Programming (SSLP) framework (arXiv:2504.20847), which partitions basis
strings by modular residues and enforces $Z$-marginal Knill-Laflamme (KL)
equalities via small LPs. The workflow is powered by GPT-5 and implemented
within TeXRA (https://texra.ai)-a multi-agent research assistant platform that
supports an iterative tool-use loop agent and a derivation-then-edit workflow
reasoning agent. We work in a LaTeX-Python environment where agents reason,
edit documents, execute code, and synchronize their work to Git/Overleaf.
Within this workspace, three roles collaborate: a Synthesis Agent formulates
the problem; a Search Agent sweeps/screens candidates and exactifies numerics
into rationals; and an Audit Agent independently checks all KL equalities and
the induced logical action. As a first step we focus on distance $d=2$ with
nondegenerate residues. For code dimension $K\in\{2,3,4\}$ and $n\le6$ qubits,
systematic sweeps yield certificate-backed tables cataloging attainable cyclic
logical groups-all realized by new codes-e.g., for $K=3$ we obtain order $16$
at $n=6$. From verified instances, Synthesis Agent abstracts recurring
structures into closed-form families and proves they satisfy the KL equalities
for all parameters. It further demonstrates that SSLP accommodates residue
degeneracy by exhibiting a new $((6,4,2))$ code implementing the transversal
controlled-phase $diag(1,1,1,i)$. Overall, the workflow recasts
diagonal-transversal feasibility as an analytical pipeline executed at scale,
combining systematic enumeration with exact analytical reconstruction. It
yields reproducible code constructions, supports targeted extensions to larger
$K$ and higher distances, and leads toward data-driven classification.
[COMMENTS]29 pages, 2 figures
[LINK]http://arxiv.org/abs/2510.20728v1
[DATE]2025-10-24 00:45:39+08:00
[CATEGORIES]cs.CL
Automated Extraction of Fluoropyrimidine Treatment and Treatment-Related Toxicities from Clinical Notes Using Natural Language Processing
[AUTHORS]Xizhi Wu, Madeline S. Kreider, Philip E. Empey, Chenyu Li, Yanshan Wang
[ABSTRACT]Objective: Fluoropyrimidines are widely prescribed for colorectal and breast
cancers, but are associated with toxicities such as hand-foot syndrome and
cardiotoxicity. Since toxicity documentation is often embedded in clinical
notes, we aimed to develop and evaluate natural language processing (NLP)
methods to extract treatment and toxicity information.
Materials and Methods: We constructed a gold-standard dataset of 236 clinical
notes from 204,165 adult oncology patients. Domain experts annotated categories
related to treatment regimens and toxicities. We developed rule-based, machine
learning-based (Random Forest, Support Vector Machine [SVM], Logistic
Regression [LR]), deep learning-based (BERT, ClinicalBERT), and large language
models (LLM)-based NLP approaches (zero-shot and error-analysis prompting).
Models used an 80:20 train-test split.
Results: Sufficient data existed to train and evaluate 5 annotated
categories. Error-analysis prompting achieved optimal precision, recall, and F1
scores (F1=1.000) for treatment and toxicities extraction, whereas zero-shot
prompting reached F1=1.000 for treatment and F1=0.876 for toxicities
extraction.LR and SVM ranked second for toxicities (F1=0.937). Deep learning
underperformed, with BERT (F1=0.873 treatment; F1= 0.839 toxicities) and
ClinicalBERT (F1=0.873 treatment; F1 = 0.886 toxicities). Rule-based methods
served as our baseline with F1 scores of 0.857 in treatment and 0.858 in
toxicities.
Discussion: LMM-based approaches outperformed all others, followed by machine
learning methods. Machine and deep learning approaches were limited by small
training data and showed limited generalizability, particularly for rare
categories.
Conclusion: LLM-based NLP most effectively extracted fluoropyrimidine
treatment and toxicity information from clinical notes, and has strong
potential to support oncology research and pharmacovigilance.
[LINK]http://arxiv.org/abs/2510.20727v1
[DATE]2025-10-24 00:44:39+08:00
[CATEGORIES]cs.CL
Blockwise SFT for Diffusion Language Models: Reconciling Bidirectional Attention and Autoregressive Decoding
[AUTHORS]Bowen Sun, Yujun Cai, Ming-Hsuan Yang, Yiwei Wang
[ABSTRACT]Discrete diffusion language models have shown strong potential for text
generation, yet standard supervised fine-tuning (SFT) misaligns with their
semi-autoregressive inference: training randomly masks tokens across the entire
response, while inference generates fixed-size blocks sequentially. This
mismatch introduces noisy prefixes and leaky suffixes, biasing gradients away
from the desired blockwise likelihood. We propose Blockwise SFT, which
partitions responses into fixed-size blocks, selects one active block per step
for stochastic masking, freezes all preceding tokens, and fully hides future
ones. Loss is computed only over the active block, directly mirroring the
blockwise decoding process. Experiments on GSM8K, MATH, and MetaMathQA show
consistent gains over classical SFT under equal compute or token budgets. Block
size consistency studies and ablations confirm that improvements stem from
faithful training-inference alignment rather than incidental masking effects.
Our results highlight the importance of matching supervision granularity to the
decoding procedure in diffusion-based language models.
[LINK]http://arxiv.org/abs/2508.19529v2
[DATE]2025-10-24 00:36:55+08:00
[CATEGORIES]cs.CL
Fast-Slow Thinking GRPO for Large Vision-Language Model Reasoning
[AUTHORS]Wenyi Xiao, Leilei Gan
[ABSTRACT]When applying reinforcement learning–typically through GRPO–to large
vision-language model reasoning struggles to effectively scale reasoning length
or generates verbose outputs across all tasks with only marginal gains in
accuracy. To address this issue, we present FAST-GRPO, a variant of GRPO that
dynamically adapts reasoning depth based on question characteristics. Through
empirical analysis, we establish the feasibility of fast-slow thinking in LVLMs
by investigating how response length and data distribution affect performance.
Inspired by these observations, we introduce two complementary metrics to
estimate the difficulty of the questions, guiding the model to determine when
fast or slow thinking is more appropriate. Next, we incorporate adaptive
length-based rewards and difficulty-aware KL divergence into the GRPO
algorithm. Experiments across seven reasoning benchmarks demonstrate that FAST
achieves state-of-the-art accuracy with over 10\% relative improvement compared
to the base model, while reducing token usage by 32.7-67.3\% compared to
previous slow-thinking approaches, effectively balancing reasoning length and
accuracy.
[LINK]http://arxiv.org/abs/2504.18458v2
[DATE]2025-10-24 00:25:28+08:00
[CATEGORIES]cs.CL
On the Emergence of Linear Analogies in Word Embeddings
[AUTHORS]Daniel J. Korchinski, Dhruva Karkada, Yasaman Bahri, Matthieu Wyart
[ABSTRACT]Models such as Word2Vec and GloVe construct word embeddings based on the
co-occurrence probability $P(i,j)$ of words $i$ and $j$ in text corpora. The
resulting vectors $W_i$ not only group semantically similar words but also
exhibit a striking linear analogy structure – for example, $W_{\text{king}} -
W_{\text{man}} + W_{\text{woman}} \approx W_{\text{queen}}$ – whose
theoretical origin remains unclear. Previous observations indicate that this
analogy structure: (i) already emerges in the top eigenvectors of the matrix
$M(i,j) = P(i,j)/P(i)P(j)$, (ii) strengthens and then saturates as more
eigenvectors of $M (i, j)$, which controls the dimension of the embeddings, are
included, (iii) is enhanced when using $\log M(i,j)$ rather than $M(i,j)$, and
(iv) persists even when all word pairs involved in a specific analogy relation
(e.g., king-queen, man-woman) are removed from the corpus. To explain these
phenomena, we introduce a theoretical generative model in which words are
defined by binary semantic attributes, and co-occurrence probabilities are
derived from attribute-based interactions. This model analytically reproduces
the emergence of linear analogy structure and naturally accounts for properties
(i)-(iv). It can be viewed as giving fine-grained resolution into the role of
each additional embedding dimension. It is robust to various forms of noise and
agrees well with co-occurrence statistics measured on Wikipedia and the analogy
benchmark introduced by Mikolov et al.
[COMMENTS]Main: 10 pages, 3 figures. Appendices: 11 pages, 7 figures. Accepted
at NeurIPS 2025 as a poster
[LINK]http://arxiv.org/abs/2505.18651v2
[DATE]2025-10-24 00:17:09+08:00
[CATEGORIES]cs.CL cs.LG
Structure-Conditional Minimum Bayes Risk Decoding
[AUTHORS]Bryan Eikema, Anna Rutkiewicz, Mario Giulianelli
[ABSTRACT]Minimum Bayes Risk (MBR) decoding has seen renewed interest as an alternative
to traditional generation strategies. While MBR has proven effective in machine
translation, where the variability of a language model’s outcome space is
naturally constrained, it may face challenges in more open-ended tasks such as
dialogue or instruction-following. We hypothesise that in such settings,
applying MBR with standard similarity-based utility functions may result in
selecting responses that are broadly representative of the model’s
distribution, yet sub-optimal with respect to any particular grouping of
generations that share an underlying latent structure. In this work, we
introduce three lightweight adaptations to the utility function, designed to
make MBR more sensitive to structural variability in the outcome space. To test
our hypothesis, we curate a dataset capturing three representative types of
latent structure: dialogue act, emotion, and response structure (e.g., a
sentence, a paragraph, or a list). We further propose two metrics to evaluate
the structural optimality of MBR. Our analysis demonstrates that common
similarity-based utility functions fall short by these metrics. In contrast,
our proposed adaptations considerably improve structural optimality. Finally,
we evaluate our approaches on real-world instruction-following benchmarks,
AlpacaEval and MT-Bench, and show that increased structural sensitivity
improves generation quality by up to 13.7 percentage points in win rate.
[COMMENTS]EMNLP 2025 Camera-Ready
[LINK]http://arxiv.org/abs/2510.20700v1
[DATE]2025-10-24 00:13:49+08:00
[CATEGORIES]cs.CL
Superposition Yields Robust Neural Scaling
[AUTHORS]Yizhou Liu, Ziming Liu, Jeff Gore
[ABSTRACT]The success of today’s large language models (LLMs) depends on the
observation that larger models perform better. However, the origin of this
neural scaling law, that loss decreases as a power law with model size, remains
unclear. We propose that representation superposition, meaning that LLMs
represent more features than they have dimensions, can be a key contributor to
loss and cause neural scaling. Based on Anthropic’s toy model, we use weight
decay to control the degree of superposition, allowing us to systematically
study how loss scales with model size. When superposition is weak, the loss
follows a power law only if data feature frequencies are power-law distributed.
In contrast, under strong superposition, the loss generically scales inversely
with model dimension across a broad class of frequency distributions, due to
geometric overlaps between representation vectors. We confirmed that
open-sourced LLMs operate in the strong superposition regime and have loss
scaling like one over the model dimension, and that the Chinchilla scaling laws
are also consistent with this behavior. Our results identify representation
superposition as a central driver of neural scaling laws, providing insights
into questions like when neural scaling laws can be improved and when they will
break down.
[COMMENTS]Accepted at NeurIPS 2025
[LINK]http://arxiv.org/abs/2505.10465v3
[DATE]2025-10-24 00:06:53+08:00
[CATEGORIES]cs.LG cs.CL
Neural Diversity Regularizes Hallucinations in Small Models
[AUTHORS]Kushal Chakrabarti, Nirmal Balachundhar
[ABSTRACT]Language models continue to hallucinate despite increases in parameters,
compute, and data. We propose neural diversity – decorrelated parallel
representations – as a principled mechanism that reduces hallucination rates
at fixed parameter and data budgets. Inspired by portfolio theory, where
uncorrelated assets reduce risk by $\sqrt{P}$, we prove hallucination
probability is bounded by representational correlation: $P(H) \leq
f(\sigma^2((1-\rho(P))/P + \rho(P)), \mu^2)$, which predicts that language
models need an optimal amount of neurodiversity. To validate this, we introduce
ND-LoRA (Neural Diversity Low-Rank Adaptation), combining parallel LoRA
adapters with Barlow Twins regularization, and demonstrate that ND-LoRA reduces
hallucinations by up to 25.6% (and 14.6% on average) without degrading general
accuracy. Ablations show LoRA adapters and regularization act synergistically,
causal interventions prove neurodiversity as the mediating factor and
correlational analyses indicate scale: a 0.1% neural correlation increase is
associated with a 3.8% hallucination increase. Finally, task-dependent
optimality emerges: different tasks require different amounts of optimal
neurodiversity. Together, our results highlight neural diversity as a third
axis of scaling – orthogonal to parameters and data – to improve the
reliability of language models at fixed budgets.
[LINK]http://arxiv.org/abs/2510.20690v1
[DATE]2025-10-24 00:03:07+08:00
[CATEGORIES]cs.CL cs.LG
\textsc{CantoNLU}: A benchmark for Cantonese natural language understanding
[AUTHORS]Junghyun Min, York Hay Ng, Sophia Chan, Helena Shunhua Zhao, En-Shiun Annie Lee
[ABSTRACT]Cantonese, although spoken by millions, remains under-resourced due to policy
and diglossia. To address this scarcity of evaluation frameworks for Cantonese,
we introduce \textsc{\textbf{CantoNLU}}, a benchmark for Cantonese natural
language understanding (NLU). This novel benchmark spans seven tasks covering
syntax and semantics, including word sense disambiguation, linguistic
acceptability judgment, language detection, natural language inference,
sentiment analysis, part-of-speech tagging, and dependency parsing. In addition
to the benchmark, we provide model baseline performance across a set of models:
a Mandarin model without Cantonese training, two Cantonese-adapted models
obtained by continual pre-training a Mandarin model on Cantonese text, and a
monolingual Cantonese model trained from scratch. Results show that
Cantonese-adapted models perform best overall, while monolingual models perform
better on syntactic tasks. Mandarin models remain competitive in certain
settings, indicating that direct transfer may be sufficient when Cantonese
domain data is scarce. We release all datasets, code, and model weights to
facilitate future research in Cantonese NLP.
[COMMENTS]13 pages, 1 figure
[LINK]http://arxiv.org/abs/2510.20670v1
[DATE]2025-10-23 23:47:27+08:00
[CATEGORIES]cs.CL
BioCLIP 2: Emergent Properties from Scaling Hierarchical Contrastive Learning
[AUTHORS]Jianyang Gu, Samuel Stevens, Elizabeth G Campolongo, Matthew J Thompson, Net Zhang, Jiaman Wu, Andrei Kopanev, Zheda Mai, Alexander E. White, James Balhoff, Wasila Dahdul, Daniel Rubenstein, Hilmar Lapp, Tanya Berger-Wolf, Wei-Lun Chao, Yu Su
[COMMENTS]NeurIPS 2025 Spotlight; Project page:
https://imageomics.github.io/bioclip-2/
[LINK]http://arxiv.org/abs/2505.23883v2
[DATE]2025-10-23 23:25:21+08:00
[CATEGORIES]cs.CL cs.LG
The Reasoning Lingua Franca: A Double-Edged Sword for Multilingual AI
[AUTHORS]Alan Saji, Raj Dabre, Anoop Kunchukuttan, Ratish Puduppully
[ABSTRACT]Large Reasoning Models (LRMs) achieve strong performance on mathematical,
scientific, and other question-answering tasks, but their multilingual
reasoning abilities remain underexplored. When presented with non-English
questions, LRMs often default to reasoning in English, raising concerns about
interpretability and the handling of linguistic and cultural nuances. We
systematically compare an LRM’s reasoning in English versus the language of the
question. Our evaluation spans two tasks: MGSM and GPQA Diamond. Beyond
measuring answer accuracy, we also analyze cognitive attributes in the
reasoning traces. We find that English reasoning traces exhibit a substantially
higher presence of these cognitive behaviors, and that reasoning in English
generally yields higher final-answer accuracy, with the performance gap
increasing as tasks become more complex. However, this English-centric strategy
is susceptible to a key failure mode - getting “Lost in Translation,” where
translation steps lead to errors that would have been avoided by question’s
language reasoning.
[COMMENTS]14 pages, 13 figures, 5 tables
[LINK]http://arxiv.org/abs/2510.20647v1
[DATE]2025-10-23 23:22:00+08:00
[CATEGORIES]cs.CL
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
[AUTHORS]Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, Gao Huang
[ABSTRACT]Reinforcement Learning with Verifiable Rewards (RLVR) has recently
demonstrated notable success in enhancing the reasoning performance of large
language models (LLMs), particularly on mathematics and programming tasks.
Similar to how traditional RL helps agents explore and learn new strategies,
RLVR is believed to enable LLMs to continuously self-improve, thus acquiring
novel reasoning abilities beyond those of the corresponding base models. In
this study we critically examine the current state of RLVR by systematically
probing the reasoning capability boundaries of RLVR-trained LLMs across various
model families, RL algorithms, and math, coding, and visual reasoning
benchmarks, using pass@k at large k values as the evaluation metric.
Surprisingly, we find that the current training setup does not elicit
fundamentally new reasoning patterns. While RLVR-trained models outperform
their base models at small k (e.g., k = 1), the base models achieve a higher
pass@k score when k is large. Coverage and perplexity analyses show that the
observed reasoning abilities originate from and are bounded by the base model.
Treating the base model as an upper bound, our quantitative analysis shows that
six popular RLVR algorithms perform similarly and remain far from optimal in
leveraging the potential of the base model. By contrast, we find that
distillation can introduce new reasoning patterns from the teacher and
genuinely expand the model’s reasoning capabilities. Overall, our findings
suggest that current RLVR methods have not yet realized the potential of RL to
elicit truly novel reasoning abilities in LLMs. This highlights the need for
improved RL paradigms, such as continual scaling and multi-turn
agent-environment interaction, to unlock this potential.
[COMMENTS]30 pages, 27 figures
[LINK]http://arxiv.org/abs/2504.13837v4
[DATE]2025-10-23 23:11:15+08:00
[CATEGORIES]cs.CL
Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons
[AUTHORS]Jianhui Chen, Xiaozhi Wang, Zijun Yao, Yushi Bai, Lei Hou, Juanzi Li
[ABSTRACT]Large language models (LLMs) excel in various capabilities but pose safety
risks such as generating harmful content and misinformation, even after safety
alignment. In this paper, we explore the inner mechanisms of safety alignment
through the lens of mechanistic interpretability, focusing on identifying and
analyzing safety neurons within LLMs that are responsible for safety behaviors.
We propose inference-time activation contrasting to locate these neurons and
dynamic activation patching to evaluate their causal effects on model safety.
Experiments on multiple prevalent LLMs demonstrate that we can consistently
identify about $5\%$ safety neurons, and by only patching their activations we
can restore over $90\%$ of the safety performance across various red-teaming
benchmarks without influencing general ability. The finding of safety neurons
also helps explain the ‘‘alignment tax’’ phenomenon by revealing that the key
neurons for model safety and helpfulness significantly overlap, yet they
require different activation patterns for the same neurons. Furthermore, we
demonstrate an application of our findings in safeguarding LLMs by detecting
unsafe outputs before generation. The source code is available at
https://github.com/THU-KEG/SafetyNeuron.
[COMMENTS]NeurIPS 2025
[LINK]http://arxiv.org/abs/2406.14144v2
[DATE]2025-10-23 23:10:09+08:00
[CATEGORIES]cs.CL cs.LG
XtraGPT: Context-Aware and Controllable Academic Paper Revision
[AUTHORS]Nuo Chen, Andre Lin HuiKai, Jiaying Wu, Junyi Hou, Zining Zhang, Qian Wang, Xidong Wang, Bingsheng He
[ABSTRACT]Despite the growing adoption of large language models (LLMs) in academic
workflows, their capabilities remain limited to support high-quality scientific
writing. Most existing systems are designed for general-purpose scientific text
generation and fail to meet the sophisticated demands of research communication
beyond surface-level polishing, such as conceptual coherence across sections.
Furthermore, academic writing is inherently iterative and revision-driven, a
process not well supported by direct prompting-based paradigms. To address
these scenarios, we propose a human-AI collaboration framework for academic
paper revision centered on criteria-guided intent alignment and context-aware
modeling. To validate the framework, we curate a dataset of 7,000 research
papers from top-tier venues annotated with 140,000 instruction-response pairs
that reflect realistic, section-level scientific revisions. We instantiate the
framework in XtraGPT, the first suite of open-source LLMs (1.5B to 14B
parameters) for context-aware, instruction-guided writing assistance. Extensive
experiments validate that XtraGPT significantly outperforms same-scale
baselines and approaches the quality of proprietary systems. Both automated
preference assessments and human evaluations confirm the effectiveness of
XtraGPT in improving scientific drafts.
[COMMENTS]Preprint. The model report is available at
https://arxiv.org/abs/2505.11336v1
[LINK]http://arxiv.org/abs/2505.11336v3
[DATE]2025-10-23 22:49:19+08:00
[CATEGORIES]cs.CL
Unlocking Multi-View Insights in Knowledge-Dense Retrieval-Augmented Generation
[AUTHORS]Guanhua Chen, Wenhan Yu, Xiao Lu, Xiao Zhang, Erli Meng, Lei Sha
[ABSTRACT]While Retrieval-Augmented Generation (RAG) plays a crucial role in the
application of Large Language Models (LLMs), existing retrieval methods in
knowledge-dense domains like law and medicine still suffer from a lack of
multi-perspective views, which are essential for improving interpretability and
reliability. Previous research on multi-view retrieval often focused solely on
different semantic forms of queries, neglecting the expression of specific
domain knowledge perspectives. This paper introduces a novel multi-view RAG
framework, MVRAG, tailored for knowledge-dense domains that utilizes
intention-aware query rewriting from multiple domain viewpoints to enhance
retrieval precision, thereby improving the effectiveness of the final
inference. Experiments conducted on legal and medical case retrieval
demonstrate significant improvements in recall and precision rates with our
framework. Our multi-perspective retrieval approach unleashes the potential of
multi-view information enhancing RAG tasks, accelerating the further
application of LLMs in knowledge-intensive fields.
[LINK]http://arxiv.org/abs/2404.12879v2
[DATE]2025-10-23 22:32:33+08:00
[CATEGORIES]cs.CL
Neural Attention Search
[AUTHORS]Difan Deng, Marius Lindauer
[ABSTRACT]We present Neural Attention Search (NAtS), a framework that automatically
evaluates the importance of each token within a sequence and determines if the
corresponding token can be dropped after several steps. This approach can
efficiently reduce the KV cache sizes required by transformer-based models
during inference and thus reduce inference costs. In this paper, we design a
search space that contains three token types: (i) Global Tokens will be
preserved and queried by all the following tokens. (ii) Local Tokens survive
until the next global token appears. (iii) Sliding Window Tokens have an impact
on the inference of a fixed size of the next following tokens. Similar to the
One-Shot Neural Architecture Search approach, this token-type information can
be learned jointly with the architecture weights via a learnable attention
mask. Experiments on both training a new transformer from scratch and
fine-tuning existing large language models show that NAtS can efficiently
reduce the KV cache size required for the models while maintaining the models’
performance.
[COMMENTS]35 pages, 11 figures
[LINK]http://arxiv.org/abs/2502.13251v4
[DATE]2025-10-23 22:23:24+08:00
[CATEGORIES]cs.CL
Position: The Current AI Conference Model is Unsustainable! Diagnosing the Crisis of Centralized AI Conference
[AUTHORS]Nuo Chen, Moming Duan, Andre Huikai Lin, Qian Wang, Jiaying Wu, Bingsheng He
[ABSTRACT]Artificial Intelligence (AI) conferences are essential for advancing
research, sharing knowledge, and fostering academic community. However, their
rapid expansion has rendered the centralized conference model increasingly
unsustainable. This paper offers a data-driven diagnosis of a structural crisis
that threatens the foundational goals of scientific dissemination, equity, and
community well-being. We identify four key areas of strain: (1) scientifically,
with per-author publication rates more than doubling over the past decade to
over 4.5 papers annually; (2) environmentally, with the carbon footprint of a
single conference exceeding the daily emissions of its host city; (3)
psychologically, with 71% of online community discourse reflecting negative
sentiment and 35% referencing mental health concerns; and (4) logistically,
with attendance at top conferences such as NeurIPS 2024 beginning to outpace
venue capacity. These pressures point to a system that is misaligned with its
core mission. In response, we propose the Community-Federated Conference (CFC)
model, which separates peer review, presentation, and networking into globally
coordinated but locally organized components, offering a more sustainable,
inclusive, and resilient path forward for AI research.
[COMMENTS]Preprint
[LINK]http://arxiv.org/abs/2508.04586v4
[DATE]2025-10-23 22:21:19+08:00
[CATEGORIES]cs.CL
Breaking Bad Tokens: Detoxification of LLMs Using Sparse Autoencoders
[AUTHORS]Agam Goyal, Vedant Rathi, William Yeh, Yian Wang, Yuen Chen, Hari Sundaram
[ABSTRACT]Large language models (LLMs) are now ubiquitous in user-facing applications,
yet they still generate undesirable toxic outputs, including profanity,
vulgarity, and derogatory remarks. Although numerous detoxification methods
exist, most apply broad, surface-level fixes and can therefore easily be
circumvented by jailbreak attacks. In this paper we leverage sparse
autoencoders (SAEs) to identify toxicity-related directions in the residual
stream of models and perform targeted activation steering using the
corresponding decoder vectors. We introduce three tiers of steering
aggressiveness and evaluate them on GPT-2 Small and Gemma-2-2B, revealing
trade-offs between toxicity reduction and language fluency. At stronger
steering strengths, these causal interventions surpass competitive baselines in
reducing toxicity by up to 20%, though fluency can degrade noticeably on GPT-2
Small depending on the aggressiveness. Crucially, standard NLP benchmark scores
upon steering remain stable, indicating that the model’s knowledge and general
abilities are preserved. We further show that feature-splitting in wider SAEs
hampers safety interventions, underscoring the importance of disentangled
feature learning. Our findings highlight both the promise and the current
limitations of SAE-based causal interventions for LLM detoxification, further
suggesting practical guidelines for safer language-model deployment.
[COMMENTS]EMNLP 2025
[LINK]http://arxiv.org/abs/2505.14536v2
[DATE]2025-10-23 22:19:01+08:00
[CATEGORIES]cs.CL
MoMoE: Mixture of Moderation Experts Framework for AI-Assisted Online Governance
[AUTHORS]Agam Goyal, Xianyang Zhan, Yilun Chen, Koustuv Saha, Eshwar Chandrasekharan
[COMMENTS]EMNLP 2025 (Oral)
[LINK]http://arxiv.org/abs/2505.14483v2
[DATE]2025-10-23 22:05:56+08:00
[CATEGORIES]cs.CL
MultiHal: Multilingual Dataset for Knowledge-Graph Grounded Evaluation of LLM Hallucinations
[AUTHORS]Ernests Lavrinovics, Russa Biswas, Katja Hose, Johannes Bjerva
[ABSTRACT]Large Language Models (LLMs) have inherent limitations of faithfulness and
factuality, commonly referred to as hallucinations. Several benchmarks have
been developed that provide a test bed for factuality evaluation within the
context of English-centric datasets, while relying on supplementary informative
context like web links or text passages but ignoring the available structured
factual resources. To this end, Knowledge Graphs (KGs) have been identified as
a useful aid for hallucination mitigation, as they provide a structured way to
represent the facts about entities and their relations with minimal linguistic
overhead. We bridge the lack of KG paths and multilinguality for factual
language modeling within the existing hallucination evaluation benchmarks and
propose a KG-based multilingual, multihop benchmark called MultiHal framed for
generative text evaluation. As part of our data collection pipeline, we mined
140k KG-paths from open-domain KGs, from which we pruned noisy KG-paths,
curating a high-quality subset of 25.9k. Our baseline evaluation shows an
absolute scale improvement by approximately 0.12 to 0.36 points for the
semantic similarity score, 0.16 to 0.36 for NLI entailment and 0.29 to 0.42 for
hallucination detection in KG-RAG over vanilla QA across multiple languages and
multiple models, demonstrating the potential of KG integration. We anticipate
MultiHal will foster future research towards several graph-based hallucination
mitigation and fact-checking tasks.
[LINK]http://arxiv.org/abs/2505.14101v2
[DATE]2025-10-23 21:59:23+08:00
[CATEGORIES]cs.CL
Beyond Retrieval-Ranking: A Multi-Agent Cognitive Decision Framework for E-Commerce Search
[AUTHORS]Zhouwei Zhai, Mengxiang Chen, Haoyun Xia, Jin Li, Renquan Zhou, Min Yang
[ABSTRACT]The retrieval-ranking paradigm has long dominated e-commerce search, but its
reliance on query-item matching fundamentally misaligns with multi-stage
cognitive decision processes of platform users. This misalignment introduces
critical limitations: semantic gaps in complex queries, high decision costs due
to cross-platform information foraging, and the absence of professional
shopping guidance. To address these issues, we propose a Multi-Agent Cognitive
Decision Framework (MACDF), which shifts the paradigm from passive retrieval to
proactive decision support. Extensive offline evaluations demonstrate MACDF’s
significant improvements in recommendation accuracy and user satisfaction,
particularly for complex queries involving negation, multi-constraint, or
reasoning demands. Online A/B testing on JD search platform confirms its
practical efficacy. This work highlights the transformative potential of
multi-agent cognitive systems in redefining e-commerce search.
[LINK]http://arxiv.org/abs/2510.20567v1
[DATE]2025-10-23 21:55:53+08:00
[CATEGORIES]cs.CL
GlobalRAG: Enhancing Global Reasoning in Multi-hop Question Answering via Reinforcement Learning
[AUTHORS]Jinchang Luo, Mingquan Cheng, Fan Wan, Ni Li, Xiaoling Xia, Shuangshuang Tian, Tingcheng Bian, Haiwei Wang, Haohuan Fu, Yan Tao
[ABSTRACT]Reinforcement learning has recently shown promise in improving
retrieval-augmented generation (RAG). Despite these advances, its effectiveness
in multi-hop question answering (QA) remains limited by two fundamental
limitations: (i) global planning absence to structure multi-step reasoning, and
(ii) unfaithful execution, which hinders effective query formulation and
consistent use of retrieved evidence. We propose GlobalRAG, a reinforcement
learning framework designed to enhance global reasoning in multi-hop QA.
GlobalRAG decomposes questions into subgoals, coordinates retrieval with
reasoning, and refines evidence iteratively. To guide this process, we
introduce Planning Quality Reward and SubGoal Completion Reward, which
encourage coherent planning and reliable subgoal execution. In addition, a
progressive weight annealing strategy balances process-oriented and
outcome-based objectives. Extensive experiments on both in-domain and
out-of-domain benchmarks demonstrate that GlobalRAG significantly outperforms
strong baselines while using only 8k training data (42% of the training data
used by strong baselines), achieving average improvements of 14.2% in both EM
and F1.
[COMMENTS]8 pages, 3 figures, 4 tables
[LINK]http://arxiv.org/abs/2510.20548v1
[DATE]2025-10-23 21:35:02+08:00
[CATEGORIES]cs.CL
The Dog the Cat Chased Stumped the Model: Measuring When Language Models Abandon Structure for Shortcuts
[AUTHORS]Sangmitra Madhusudan, Kaige Chen, Ali Emami
[ABSTRACT]When language models correctly parse “The cat that the dog chased meowed,”
are they analyzing syntax or simply familiar with dogs chasing cats? Despite
extensive benchmarking, we lack methods to distinguish structural understanding
from semantic pattern matching. We introduce CenterBench, a dataset of 9,720
comprehension questions on center-embedded sentences (like “The cat [that the
dog chased] meowed”) where relative clauses nest recursively, creating
processing demands from simple to deeply nested structures. Each sentence has a
syntactically identical but semantically implausible counterpart (e.g., mailmen
prescribe medicine, doctors deliver mail) and six comprehension questions
testing surface understanding, syntactic dependencies, and causal reasoning.
Testing six models reveals that performance gaps between plausible and
implausible sentences widen systematically with complexity, with models showing
median gaps up to 26.8 percentage points, quantifying when they abandon
structural analysis for semantic associations. Notably, semantic plausibility
harms performance on questions about resulting actions, where following causal
relationships matters more than semantic coherence. Reasoning models improve
accuracy but their traces show semantic shortcuts, overthinking, and answer
refusal. Unlike models whose plausibility advantage systematically widens with
complexity, humans shows variable semantic effects. CenterBench provides the
first framework to identify when models shift from structural analysis to
pattern matching.
[LINK]http://arxiv.org/abs/2510.20543v1
[DATE]2025-10-23 21:30:40+08:00
[CATEGORIES]cs.CL
Face-Human-Bench: A Comprehensive Benchmark of Face and Human Understanding for Multi-modal Assistants
[AUTHORS]Lixiong Qin, Shilong Ou, Miaoxuan Zhang, Jiangning Wei, Yuhang Zhang, Xiaoshuai Song, Yuchen Liu, Mei Wang, Weiran Xu
[COMMENTS]50 pages, 14 figures, 42 tables. NeurIPS 2025 Datasets and Benchmarks
Track
[LINK]http://arxiv.org/abs/2501.01243v3
[DATE]2025-10-23 21:25:59+08:00
[CATEGORIES]cs.CL
ARC-Encoder: learning compressed text representations for large language models
[AUTHORS]Hippolyte Pilchen, Edouard Grave, Patrick Pérez
[ABSTRACT]Recent techniques such as retrieval-augmented generation or chain-of-thought
reasoning have led to longer contexts and increased inference costs. Context
compression techniques can reduce these costs, but the most effective
approaches require fine-tuning the target model or even modifying its
architecture. This can degrade its general abilities when not used for this
specific purpose. Here we explore an alternative approach: an encoder that
compresses the context into continuous representations which replace token
embeddings in decoder LLMs. First, we perform a systematic study of training
strategies and architecture choices for the encoder. Our findings led to the
design of an Adaptable text Representations Compressor, named ARC-Encoder,
which outputs $x$-times fewer continuous representations (typically
$x!\in!\{4,8\}$) than text tokens. We evaluate ARC-Encoder across a variety
of LLM usage scenarios, ranging from in-context learning to context window
extension, on both instruct and base decoders. Results show that ARC-Encoder
achieves state-of-the-art performance on several benchmarks while improving
computational efficiency at inference. Finally, we demonstrate that our models
can be adapted to multiple decoders simultaneously, allowing a single encoder
to generalize across different decoder LLMs. This makes ARC-Encoder a flexible
and efficient solution for portable encoders that work seamlessly with multiple
LLMs. We release a training code at https://github.com/kyutai-labs/ARC-Encoder
, fine-tuning dataset and pretrained models are available at
https://huggingface.co/collections/kyutai/arc-encoders-68ee18787301407d60a57047 .
[LINK]http://arxiv.org/abs/2510.20535v1
[DATE]2025-10-23 21:20:57+08:00
[CATEGORIES]cs.CL
Breaking mBad! Supervised Fine-tuning for Cross-Lingual Detoxification
[AUTHORS]Himanshu Beniwal, Youngwoo Kim, Maarten Sap, Soham Dan, Thomas Hartvigsen
[COMMENTS]Accepted at MELT Workshop @ COLM 2025
[LINK]http://arxiv.org/abs/2505.16722v3
[DATE]2025-10-23 21:15:41+08:00
[CATEGORIES]cs.CL
Memory Decoder: A Pretrained, Plug-and-Play Memory for Large Language Models
[AUTHORS]Jiaqi Cao, Jiarui Wang, Rubin Wei, Qipeng Guo, Kai Chen, Bowen Zhou, Zhouhan Lin
[ABSTRACT]Large Language Models (LLMs) have shown strong abilities in general language
tasks, yet adapting them to specific domains remains a challenge. Current
method like Domain Adaptive Pretraining (DAPT) requires costly full-parameter
training and suffers from catastrophic forgetting. Meanwhile,
Retrieval-Augmented Generation (RAG) introduces substantial inference latency
due to expensive nearest-neighbor searches and longer context. This paper
introduces Memory Decoder, a plug-and-play pretrained memory that enables
efficient domain adaptation without changing the original model’s parameters.
Memory Decoder employs a small transformer decoder that learns to imitate the
behavior of an external non-parametric retriever. Once trained, Memory Decoder
can be seamlessly integrated with any pretrained language model that shares the
same tokenizer, requiring no model-specific modifications. Experimental results
demonstrate that Memory Decoder enables effective adaptation of various Qwen
and Llama models to three distinct specialized domains: biomedicine, finance,
and law, reducing perplexity by an average of 6.17 points. Overall, Memory
Decoder introduces a novel paradigm centered on a specially pretrained memory
component designed for domain-specific adaptation. This memory architecture can
be integrated in a plug-and-play manner, consistently enhancing performance
across multiple models within the target domain.
[LINK]http://arxiv.org/abs/2508.09874v2
[DATE]2025-10-23 21:14:04+08:00
[CATEGORIES]cs.CL
Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs
[AUTHORS]Hao Fang, Changle Zhou, Jiawei Kong, Kuofeng Gao, Bin Chen, Shu-Tao Xia
[ABSTRACT]Large Vision-Language Models (LVLMs) are susceptible to hallucinations, where
generated responses seem semantically plausible yet exhibit little or no
relevance to the input image. Previous studies reveal that this issue primarily
stems from LVLMs’ over-reliance on language priors while disregarding the
visual information during decoding. To alleviate this issue, we introduce a
novel Conditional Pointwise Mutual Information (C-PMI) calibrated decoding
strategy, which adaptively strengthens the mutual dependency between generated
texts and input images to mitigate hallucinations. Unlike existing methods
solely focusing on text token sampling, we propose to jointly model the
contributions of visual and textual tokens to C-PMI, formulating hallucination
mitigation as a bi-level optimization problem aimed at maximizing mutual
information. To solve it, we design a token purification mechanism that
dynamically regulates the decoding process by sampling text tokens remaining
maximally relevant to the given image, while simultaneously refining image
tokens most pertinent to the generated response. Extensive experiments across
various benchmarks reveal that the proposed method significantly reduces
hallucinations in LVLMs while preserving decoding efficiency.
[LINK]http://arxiv.org/abs/2505.19678v3
[DATE]2025-10-23 21:08:11+08:00
[CATEGORIES]cs.CL
MindForge: Empowering Embodied Agents with Theory of Mind for Lifelong Cultural Learning
[AUTHORS]Mircea Lică, Ojas Shirekar, Baptiste Colle, Chirag Raman
[ABSTRACT]Embodied agents powered by large language models (LLMs), such as Voyager,
promise open-ended competence in worlds such as Minecraft. However, when
powered by open-weight LLMs they still falter on elementary tasks after
domain-specific fine-tuning. We propose MindForge, a generative-agent framework
for cultural lifelong learning through explicit perspective taking. We
introduce three key innovations: (1) a structured theory of mind representation
linking percepts, beliefs, desires, and actions; (2) natural inter-agent
communication; and (3) a multi-component memory system. Following the cultural
learning framework, we test MindForge in both instructive and collaborative
settings within Minecraft. In an instructive setting with GPT-4, MindForge
agents powered by open-weight LLMs significantly outperform their Voyager
counterparts in basic tasks yielding $3\times$ more tech-tree milestones and
collecting $2.3\times$ more unique items than the Voyager baseline.
Furthermore, in fully \textit{collaborative} settings, we find that the
performance of two underachieving agents improves with more communication
rounds, echoing the Condorcet Jury Theorem. MindForge agents demonstrate
sophisticated behaviors, including expert-novice knowledge transfer,
collaborative problem solving, and adaptation to out-of-distribution tasks
through accumulated cultural experiences.
[COMMENTS]Accepted to NeurIPS 2025 main track as poster
[LINK]http://arxiv.org/abs/2411.12977v5
[DATE]2025-10-23 21:07:52+08:00
[CATEGORIES]cs.CL
HauntAttack: When Attack Follows Reasoning as a Shadow
[AUTHORS]Jingyuan Ma, Rui Li, Zheng Li, Junfeng Liu, Heming Xia, Lei Sha, Zhifang Sui
[ABSTRACT]Emerging Large Reasoning Models (LRMs) consistently excel in mathematical and
reasoning tasks, showcasing remarkable capabilities. However, the enhancement
of reasoning abilities and the exposure of internal reasoning processes
introduce new safety vulnerabilities. A critical question arises: when
reasoning becomes intertwined with harmfulness, will LRMs become more
vulnerable to jailbreaks in reasoning mode? To investigate this, we introduce
HauntAttack, a novel and general-purpose black-box adversarial attack framework
that systematically embeds harmful instructions into reasoning questions.
Specifically, we modify key reasoning conditions in existing questions with
harmful instructions, thereby constructing a reasoning pathway that guides the
model step by step toward unsafe outputs. We evaluate HauntAttack on 11 LRMs
and observe an average attack success rate of 70\%, achieving up to 12
percentage points of absolute improvement over the strongest prior baseline.
Our further analysis reveals that even advanced safety-aligned models remain
highly susceptible to reasoning-based attacks, offering insights into the
urgent challenge of balancing reasoning capability and safety in future model
development.
[LINK]http://arxiv.org/abs/2506.07031v4
[DATE]2025-10-23 20:59:35+08:00
[CATEGORIES]cs.CL
Assessing the Political Fairness of Multilingual LLMs: A Case Study based on a 21-way Multiparallel EuroParl Dataset
[AUTHORS]Paul Lerner, François Yvon
[ABSTRACT]The political biases of Large Language Models (LLMs) are usually assessed by
simulating their answers to English surveys. In this work, we propose an
alternative framing of political biases, relying on principles of fairness in
multilingual translation. We systematically compare the translation quality of
speeches in the European Parliament (EP), observing systematic differences with
majority parties from left, center, and right being better translated than
outsider parties. This study is made possible by a new, 21-way multiparallel
version of EuroParl, the parliamentary proceedings of the EP, which includes
the political affiliations of each speaker. The dataset consists of 1.5M
sentences for a total of 40M words and 249M characters. It covers three years,
1000+ speakers, 7 countries, 12 EU parties, 25 EU committees, and hundreds of
national parties.
[LINK]http://arxiv.org/abs/2510.20508v1
[DATE]2025-10-23 20:50:30+08:00
[CATEGORIES]cs.CL
Hierarchical Sequence Iteration for Heterogeneous Question Answering
[AUTHORS]Ruiyi Yang, Hao Xue, Imran Razzak, Hakim Hacid, Flora D. Salim
[ABSTRACT]Retrieval-augmented generation (RAG) remains brittle on multi-step questions
and heterogeneous evidence sources, trading accuracy against latency and
token/tool budgets. This paper introducesHierarchical Sequence (HSEQ) Iteration
for Heterogeneous Question Answering, a unified framework that (i) linearize
documents, tables, and knowledge graphs into a reversible hierarchical sequence
with lightweight structural tags, and (ii) perform structure-aware iteration to
collect just-enough evidence before answer synthesis. A Head Agent provides
guidance that leads retrieval, while an Iteration Agent selects and expands
HSeq via structure-respecting actions (e.g., parent/child hops, table
row/column neighbors, KG relations); Finally the head agent composes
canonicalized evidence to genearte the final answer, with an optional
refinement loop to resolve detected contradictions. Experiments on HotpotQA
(text), HybridQA/TAT-QA (table+text), and MetaQA (KG) show consistent EM/F1
gains over strong single-pass, multi-hop, and agentic RAG baselines with high
efficiency. Besides, HSEQ exhibits three key advantages: (1) a format-agnostic
unification that enables a single policy to operate across text, tables, and
KGs without per-dataset specialization; (2) guided, budget-aware iteration that
reduces unnecessary hops, tool calls, and tokens while preserving accuracy; and
(3) evidence canonicalization for reliable QA, improving answers consistency
and auditability.
[COMMENTS]22 pages, 3 figures
[LINK]http://arxiv.org/abs/2510.20505v1
[DATE]2025-10-23 20:48:18+08:00
[CATEGORIES]cs.CL
SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment
[AUTHORS]Wonje Jeung, Sangyeon Yoon, Minsuk Kahng, Albert No
[ABSTRACT]Large Reasoning Models (LRMs) have become powerful tools for complex problem
solving, but their structured reasoning pathways can lead to unsafe outputs
when exposed to harmful prompts. Existing safety alignment methods reduce
harmful outputs but can degrade reasoning depth, leading to significant
trade-offs in complex, multi-step tasks, and remain vulnerable to sophisticated
jailbreak attacks. To address this, we introduce SAFEPATH, a lightweight
alignment method that fine-tunes LRMs to emit a short, 8-token Safety Primer at
the start of their reasoning, in response to harmful prompts, while leaving the
rest of the reasoning process unsupervised. Empirical results across multiple
benchmarks indicate that SAFEPATH effectively reduces harmful outputs while
maintaining reasoning performance. Specifically, SAFEPATH reduces harmful
responses by up to 90.0% and blocks 83.3% of jailbreak attempts in the
DeepSeek-R1-Distill-Llama-8B model, while requiring 295.9x less compute than
Direct Refusal and 314.1x less than SafeChain. We further introduce a zero-shot
variant that requires no fine-tuning. In addition, we provide a comprehensive
analysis of how existing methods in LLMs generalize, or fail, when applied to
reasoning-centric models, revealing critical gaps and new directions for safer
AI.
[COMMENTS]Accepted at NeurIPS 2025. Code and models are available at
https://ai-isl.github.io/safepath
[LINK]http://arxiv.org/abs/2505.14667v4
[DATE]2025-10-23 20:04:50+08:00
[CATEGORIES]cs.CL
Systematic Evaluation of Uncertainty Estimation Methods in Large Language Models
[AUTHORS]Christian Hobelsberger, Theresa Winner, Andreas Nawroth, Oliver Mitevski, Anna-Carolina Haensch
[ABSTRACT]Large language models (LLMs) produce outputs with varying levels of
uncertainty, and, just as often, varying levels of correctness; making their
practical reliability far from guaranteed. To quantify this uncertainty, we
systematically evaluate four approaches for confidence estimation in LLM
outputs: VCE, MSP, Sample Consistency, and CoCoA (Vashurin et al., 2025). For
the evaluation of the approaches, we conduct experiments on four
question-answering tasks using a state-of-the-art open-source LLM. Our results
show that each uncertainty metric captures a different facet of model
confidence and that the hybrid CoCoA approach yields the best reliability
overall, improving both calibration and discrimination of correct answers. We
discuss the trade-offs of each method and provide recommendations for selecting
uncertainty measures in LLM applications.
[LINK]http://arxiv.org/abs/2510.20460v1
[DATE]2025-10-23 19:50:47+08:00
[CATEGORIES]cs.CL
LM-mixup: Text Data Augmentation via Language Model based Mixup
[AUTHORS]Zhijie Deng, Zhouan Shen, Ling Li, Yao Zhou, Zhaowei Zhu, Yanji He, Wei Wang, Jiaheng Wei
[ABSTRACT]Instruction tuning is crucial for aligning Large Language Models (LLMs), yet
the quality of instruction-following data varies significantly. While
high-quality data is paramount, it is often scarce; conversely, abundant
low-quality data is frequently discarded, leading to substantial information
loss. Existing data augmentation methods struggle to augment this low-quality
data effectively, and the evaluation of such techniques remains poorly defined.
To address this, we formally define the task of Instruction Distillation:
distilling multiple low-quality and redundant inputs into high-quality and
coherent instruction-output pairs. Specifically, we introduce a comprehensive
data construction pipeline to create MIXTURE, a 144K-sample dataset pairing
low-quality or semantically redundant imperfect instruction clusters with their
high-quality distillations. We then introduce LM-Mixup, by first performing
supervised fine-tuning on MIXTURE and then optimizing it with reinforcement
learning. This process uses three complementary reward signals: quality,
semantic alignment, and format compliance, via Group Relative Policy
Optimization (GRPO). We demonstrate that LM-Mixup effectively augments
imperfect datasets: fine-tuning LLMs on its distilled data, which accounts for
only about 3% of the entire dataset, not only surpasses full-dataset training
but also competes with state-of-the-art high-quality data selection methods
across multiple benchmarks. Our work establishes that low-quality data is a
valuable resource when properly distilled and augmented with LM-Mixup,
significantly enhancing the efficiency and performance of instruction-tuned
LLMs.
[LINK]http://arxiv.org/abs/2510.20449v1
[DATE]2025-10-23 19:33:35+08:00
[CATEGORIES]cs.CL
ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding
[AUTHORS]Jialiang Kang, Han Shu, Wenshuo Li, Yingjie Zhai, Xinghao Chen
[ABSTRACT]Speculative decoding is a widely adopted technique for accelerating inference
in large language models (LLMs), yet its application to vision-language models
(VLMs) remains underexplored, with existing methods achieving only modest
speedups (<1.5x). This gap is increasingly significant as multimodal
capabilities become central to large-scale models. We hypothesize that large
VLMs can effectively filter redundant image information layer by layer without
compromising textual comprehension, whereas smaller draft models struggle to do
so. To address this, we introduce Vision-Aware Speculative Decoding (ViSpec), a
novel framework tailored for VLMs. ViSpec employs a lightweight vision adaptor
module to compress image tokens into a compact representation, which is
seamlessly integrated into the draft model’s attention mechanism while
preserving original image positional information. Additionally, we extract a
global feature vector for each input image and augment all subsequent text
tokens with this feature to enhance multimodal coherence. To overcome the
scarcity of multimodal datasets with long assistant responses, we curate a
specialized training dataset by repurposing existing datasets and generating
extended outputs using the target VLM with modified prompts. Our training
strategy mitigates the risk of the draft model exploiting direct access to the
target model’s hidden states, which could otherwise lead to shortcut learning
when training solely on target model outputs. Extensive experiments validate
ViSpec, achieving, to our knowledge, the first substantial speedup in VLM
speculative decoding. Code is available at
https://github.com/KangJialiang/ViSpec.
[COMMENTS]NeurIPS 2025
[LINK]http://arxiv.org/abs/2509.15235v5
[DATE]2025-10-23 18:59:53+08:00
[CATEGORIES]cs.CL
Adapting Multilingual Models to Code-Mixed Tasks via Model Merging
[AUTHORS]Prashant Kodali, Vaishnavi Shivkumar, Swarang Joshi, Monojit Choudhary, Ponnurangam Kumaraguru, Manish Shrivastava
[ABSTRACT]We study model merging as a practical alternative to conventional adaptation
strategies for code-mixed NLP. Starting from a multilingual base model, we: (i)
perform continued pre-training (CPT) on unlabeled code-mixed text to obtain an
adapted checkpoint, (ii) merge checkpoint with the base model, and (iii)
fine-tune (FT) on the downstream task data. We evaluate our approach for
sentence classification (sentiment and hate speech) task in English-Hindi
(En-Hi) and English-Spanish (En-Es) using XLM-R and Llama-3.2-1B models. Our
results show that merged models consistently outperform full fine-tuning and
CPT->FT. We observe gains of 2–5 points in F1 over full fine-tuning and ~1-2
points over CPT->FT, indicating that unlabeled data is leveraged more
effectively via merging than via CPT alone. Zero-/few-shot prompting with
larger LLMs (e.g., Llama-3.3-70B) lags behind fine-tuned and merged
checkpoints, underscoring limits of in-context learning for code-mixed inputs.
We further test cross-pair transfer by training on En-Hi and evaluating on
En-Ta and En-Ml: merged checkpoints transfer more strongly than
monolingual-English baselines (e.g., TV/TIES variants reaching 0.65-0.68 F1 vs
0.61-0.63 for full fine-tuning), suggesting that code-mixed knowledge is a more
reliable substrate for low-resource pairs. We conclude with adaptation recipes
matched to common data regimes (labeled only; labeled+unlabeled; transfer-only)
and discuss limitations and scaling considerations for broader tasks and larger
models.
[COMMENTS]9 pages, 5 tables, CODS 2025
[LINK]http://arxiv.org/abs/2510.19782v2
[DATE]2025-10-23 18:53:54+08:00
[CATEGORIES]cs.CL
Less is More: Compact Clue Selection for Efficient Retrieval-Augmented Generation Reasoning
[AUTHORS]Qianchi Zhang, Hainan Zhang, Liang Pang, Hongwei Zheng, Yongxin Tong, Zhiming Zheng
[ABSTRACT]Current RAG retrievers are designed primarily for human readers, emphasizing
complete, readable, and coherent paragraphs. However, LLMs benefit more from
precise, compact, and well-structured input, which enhances reasoning quality
and efficiency. Existing methods often rely on reranking or summarization to
identify key sentences, but may suffer from semantic breaks and unfaithfulness.
Thus, efficiently extracting and organizing answer-relevant clues from
large-scale documents while reducing LLM reasoning costs remains a challenge
for RAG. Inspired by Occam’s razor, we frame LLM-centric retrieval as a MinMax
optimization: maximizing the extraction of potential clues and reranking them
for well-organization, while minimizing reasoning costs by truncating to the
smallest sufficient clues set. In this paper, we propose CompSelect, a Compact
clue Selection mechanism for LLM-centric RAG, consisting of a clue extractor, a
reranker, and a truncator. (1) The clue extractor first uses answer-containing
sentences as fine-tuning targets, aiming to extract sufficient potential clues;
(2) The reranker is trained to prioritize effective clues based on real LLM
feedback; (3) The truncator uses the truncated text containing the minimum
sufficient clues for answering the question as fine-tuning targets, thereby
enabling efficient RAG reasoning. Experiments on three QA datasets show that
CompSelect improves QA performance by approximately 11\% and reduces Total
Latency and Online Latency by approximately 17\% and 67\% compared to various
baseline methods on both LLaMA3 and Qwen3. Further analysis confirms its
robustness to unreliable retrieval and generalization across different
scenarios, offering a scalable and cost-efficient solution for web-scale RAG
applications.
[COMMENTS]12 pages, 7 figures, 12 tables, under review
[LINK]http://arxiv.org/abs/2502.11811v6
[DATE]2025-10-23 18:36:01+08:00
[CATEGORIES]cs.CL
Teacher Demonstrations in a BabyLM’s Zone of Proximal Development for Contingent Multi-Turn Interaction
[AUTHORS]Suchir Salhan, Hongyi Gu, Donya Rooein, Diana Galvan-Sosa, Gabrielle Gaudeau, Andrew Caines, Zheng Yuan, Paula Buttery
[COMMENTS]Outstanding Paper Award, EMNLP 2025 BabyLM Workshop - Oral
presentation, Suzhou, China
[LINK]http://arxiv.org/abs/2510.20411v1
[DATE]2025-10-23 18:29:23+08:00
[CATEGORIES]cs.CL
Bi-Mamba: Towards Accurate 1-Bit State Space Models
[AUTHORS]Shengkun Tang, Liqun Ma, Haonan Li, Mingjie Sun, Zhiqiang Shen
[ABSTRACT]The typical Selective State-Space Model (SSM) used in Mamba addresses several
limitations of Transformers, such as the quadratic computational complexity
with respect to sequence length and the significant memory requirements during
inference due to the key-value (KV) cache. However, the increasing size of
Mamba models continues to pose challenges for training and deployment,
particularly due to their substantial computational demands during both
training and inference. In this work, we introduce $\texttt{Bi-Mamba}$, a
scalable and powerful 1-bit Mamba architecture designed to enable more
efficient large language models (LLMs), with model sizes of 780M, 1.3B, and
2.7B parameters. $\texttt{Bi-Mamba}$ models are trained from scratch on a
standard LLM-scale dataset using an autoregressive distillation loss. Extensive
experiments on language modeling benchmarks demonstrate that
$\texttt{Bi-Mamba}$ achieves performance comparable to its full-precision (FP16
or BF16) counterparts, while outperforming post-training binarization (PTB)
Mamba and binarization-aware training (BAT) Transformer baselines. Moreover,
$\texttt{Bi-Mamba}$ drastically reduces memory usage and computational cost
compared to the original Mamba. Our work pioneers a new line of
linear-complexity LLMs under low-bit representation and provides the way for
the design of specialized hardware optimized for efficient 1-bit Mamba-based
models. Code and the pre-trained weights are available at
https://github.com/Tangshengku/Bi-Mamba.
[COMMENTS]Accepted in TMLR 2025
[LINK]http://arxiv.org/abs/2411.11843v2
[DATE]2025-10-23 17:55:50+08:00
[CATEGORIES]cs.CL
MLMA: Towards Multilingual ASR With Mamba-based Architectures
[AUTHORS]Mohamed Nabih Ali, Daniele Falavigna, Alessio Brutti
[ABSTRACT]Multilingual automatic speech recognition (ASR) remains a challenging task,
especially when balancing performance across high- and low-resource languages.
Recent advances in sequence modeling suggest that architectures beyond
Transformers may offer better scalability and efficiency. In this work, we
introduce MLMA (Multilingual Language Modeling with Mamba for ASR), a new
approach that leverages the Mamba architecture – an efficient state-space
model optimized for long-context sequence processing – for multilingual ASR.
Using Mamba, MLMA implicitly incorporates language-aware conditioning and
shared representations to support robust recognition across diverse languages.
Experiments on standard multilingual benchmarks show that MLMA achieves
competitive performance compared to Transformer-based architectures. These
results highlight Mamba’s potential as a strong backbone for scalable,
efficient, and accurate multilingual speech recognition.
[COMMENTS]The paper is under review at ICASSP 2026
[LINK]http://arxiv.org/abs/2510.18684v2
[DATE]2025-10-23 17:45:28+08:00
[CATEGORIES]cs.CL
Relative-Based Scaling Law for Neural Language Models
[AUTHORS]Baoqing Yue, Jinyuan Zhou, Zixi Wei, Jingtao Zhan, Qingyao Ai, Yiqun Liu
[ABSTRACT]Scaling laws aim to accurately predict model performance across different
scales. Existing scaling-law studies almost exclusively rely on cross-entropy
as the evaluation metric. However, cross-entropy provides only a partial view
of performance: it measures the absolute probability assigned to the correct
token, but ignores the relative ordering between correct and incorrect tokens.
Yet, relative ordering is crucial for language models, such as in
greedy-sampling scenario. To address this limitation, we investigate scaling
from the perspective of relative ordering. We first propose the Relative-Based
Probability (RBP) metric, which quantifies the probability that the correct
token is ranked among the top predictions. Building on this metric, we
establish the Relative-Based Scaling Law, which characterizes how RBP improves
with increasing model size. Through extensive experiments on four datasets and
four model families spanning five orders of magnitude, we demonstrate the
robustness and accuracy of this law. Finally, we illustrate the broad
application of this law with two examples, namely providing a deeper
explanation of emergence phenomena and facilitating finding fundamental
theories of scaling laws. In summary, the Relative-Based Scaling Law
complements the cross-entropy perspective and contributes to a more complete
understanding of scaling large language models. Thus, it offers valuable
insights for both practical development and theoretical exploration.
[LINK]http://arxiv.org/abs/2510.20387v1
[DATE]2025-10-23 17:37:00+08:00
[CATEGORIES]cs.LG cs.CL
NeoDictaBERT: Pushing the Frontier of BERT models for Hebrew
[AUTHORS]Shaltiel Shmidman, Avi Shmidman, Moshe Koppel
[ABSTRACT]Since their initial release, BERT models have demonstrated exceptional
performance on a variety of tasks, despite their relatively small size
(BERT-base has ~100M parameters). Nevertheless, the architectural choices used
in these models are outdated compared to newer transformer-based models such as
Llama3 and Qwen3. In recent months, several architectures have been proposed to
close this gap. ModernBERT and NeoBERT both show strong improvements on English
benchmarks and significantly extend the supported context window. Following
their successes, we introduce NeoDictaBERT and NeoDictaBERT-bilingual:
BERT-style models trained using the same architecture as NeoBERT, with a
dedicated focus on Hebrew texts. These models outperform existing ones on
almost all Hebrew benchmarks and provide a strong foundation for downstream
tasks. Notably, the NeoDictaBERT-bilingual model shows strong results on
retrieval tasks, outperforming other multilingual models of similar size. In
this paper, we describe the training process and report results across various
benchmarks. We release the models to the community as part of our goal to
advance research and development in Hebrew NLP.
[LINK]http://arxiv.org/abs/2510.20386v1
[DATE]2025-10-23 17:34:53+08:00
[CATEGORIES]cs.CL
Accelerating Mobile Language Model via Speculative Decoding and NPU-Coordinated Execution
[AUTHORS]Zhiyang Chen, Daliang Xu, Haiyang Shen, Mengwei Xu, Shangguang Wang, Yun Ma
[ABSTRACT]Enhancing on-device large language models (LLMs) with contextual information
from local data enables personalized and task-aware generation, powering use
cases such as intelligent assistants and UI agents. While recent developments
in neural processors have substantially improved the efficiency of prefill on
mobile devices, the token-by-token generation process still suffers from high
latency and limited hardware utilization due to its inherently memory-bound
characteristics. This work presents sd.npu, a mobile inference framework that
integrates speculative decoding with dynamic hardware scheduling to accelerate
context-aware text generation on mobile devices. The framework introduces three
synergistic components: (1) adaptive execution scheduling, which dynamically
balances compute graphs between prefill and decoding phases; (2)
context-aligned drafting, which improves speculative efficiency through
lightweight online calibration to current tasks; and (3) hardware-efficient
draft extension, which reuses and expands intermediate sequences to improve
processing parallelism and reduce verification cost. Experiments on multiple
smartphones and representative workloads show consistent improvements of up to
3.8x in generation speed and 4.7x in energy efficiency compared with existing
mobile inference solutions. Component-level analysis further validates the
contribution of each optimization.
[LINK]http://arxiv.org/abs/2510.15312v3
[DATE]2025-10-23 17:30:23+08:00
[CATEGORIES]cs.CL
The Impact of Negated Text on Hallucination with Large Language Models
[AUTHORS]Jaehyung Seo, Hyeonseok Moon, Heuiseok Lim
[COMMENTS]Accepted to the EMNLP 2025
[LINK]http://arxiv.org/abs/2510.20375v1
[DATE]2025-10-23 17:20:15+08:00
[CATEGORIES]cs.CL
More Documents, Same Length: Isolating the Challenge of Multiple Documents in RAG
[AUTHORS]Shahar Levy, Nir Mazor, Lihi Shalmon, Michael Hassid, Gabriel Stanovsky
[ABSTRACT]Retrieval-Augmented Generation (RAG) enhances the accuracy of Large Language
Model (LLM) responses by leveraging relevant external documents during
generation. Although previous studies noted that retrieving many documents can
degrade performance, they did not isolate how the quantity of documents affects
performance while controlling for context length. We evaluate various language
models on custom datasets derived from a multi-hop QA task. We keep the context
length and position of relevant information constant while varying the number
of documents, and find that increasing the document count in RAG settings poses
significant challenges for most LLMs, reducing performance by up to 20%.
However, Qwen2.5 maintained consistent results across increasing document
counts, indicating better multi-document handling capability. Finally, our
results indicate that processing multiple documents is a separate challenge
from handling long contexts. We also make the datasets and code available:
https://github.com/shaharl6000/MoreDocsSameLen .
[COMMENTS]Preprint
[LINK]http://arxiv.org/abs/2503.04388v2
[DATE]2025-10-23 17:06:07+08:00
[CATEGORIES]cs.CL
Dialogue Is Not Enough to Make a Communicative BabyLM (But Neither Is Developmentally Inspired Reinforcement Learning)
[AUTHORS]Francesca Padovani, Bastian Bunzeck, Manar Ali, Omar Momen, Arianna Bisazza, Hendrik Buschmeier, Sina Zarrieß
[ABSTRACT]We investigate whether pre-training exclusively on dialogue data results in
formally and functionally apt small language models. Based on this pre-trained
llamalogue model, we employ a variety of fine-tuning strategies to enforce
“more communicative” text generations by our models. Although our models
underperform on most standard BabyLM benchmarks, they excel at dialogue
continuation prediction in a minimal pair setting. While PPO fine-tuning has
mixed to adversarial effects on our models, DPO fine-tuning further improves
their performance on our custom dialogue benchmark.
[LINK]http://arxiv.org/abs/2510.20358v1
[DATE]2025-10-23 16:57:56+08:00
[CATEGORIES]cs.CL
FreeChunker: A Cross-Granularity Chunking Framework
[AUTHORS]Wenxuan Zhang, Yuan-Hao Jiang, Yonghe Wu
[ABSTRACT]Chunking strategies significantly impact the effectiveness of
Retrieval-Augmented Generation (RAG) systems. Existing methods operate within
fixed-granularity paradigms that rely on static boundary identification,
limiting their adaptability to diverse query requirements. This paper presents
FreeChunker, a Cross-Granularity Encoding Framework that fundamentally
transforms the traditional chunking paradigm: the framework treats sentences as
atomic units and shifts from static chunk segmentation to flexible retrieval
supporting arbitrary sentence combinations. This paradigm shift not only
significantly reduces the computational overhead required for semantic boundary
detection but also enhances adaptability to complex queries. Experimental
evaluation on LongBench V2 demonstrates that FreeChunker achieves superior
retrieval performance compared to traditional chunking methods, while
significantly outperforming existing approaches in computational efficiency.
[COMMENTS]Submitted to arXiv, October 2025
[LINK]http://arxiv.org/abs/2510.20356v1
[DATE]2025-10-23 16:57:00+08:00
[CATEGORIES]cs.CL
Evaluating Latent Knowledge of Public Tabular Datasets in Large Language Models
[AUTHORS]Matteo Silvestri, Flavio Giorgi, Fabrizio Silvestri, Gabriele Tolomei
[ABSTRACT]Large Language Models (LLMs) are increasingly evaluated on their ability to
reason over structured data, yet such assessments often overlook a crucial
confound: dataset contamination. In this work, we investigate whether LLMs
exhibit prior knowledge of widely used tabular benchmarks such as Adult Income,
Titanic, and others. Through a series of controlled probing experiments, we
reveal that contamination effects emerge exclusively for datasets containing
strong semantic cues-for instance, meaningful column names or interpretable
value categories. In contrast, when such cues are removed or randomized,
performance sharply declines to near-random levels. These findings suggest that
LLMs’ apparent competence on tabular reasoning tasks may, in part, reflect
memorization of publicly available datasets rather than genuine generalization.
We discuss implications for evaluation protocols and propose strategies to
disentangle semantic leakage from authentic reasoning ability in future LLM
assessments.
[LINK]http://arxiv.org/abs/2510.20351v1
[DATE]2025-10-23 16:51:14+08:00
[CATEGORIES]cs.CL
Teaching Language Models to Reason with Tools
[AUTHORS]Chengpeng Li, Zhengyang Tang, Ziniu Li, Mingfeng Xue, Keqin Bao, Tian Ding, Ruoyu Sun, Benyou Wang, Xiang Wang, Junyang Lin, Dayiheng Liu
[ABSTRACT]Large reasoning models (LRMs) like OpenAI-o1 have shown impressive
capabilities in natural language reasoning. However, these models frequently
demonstrate inefficiencies or inaccuracies when tackling complex mathematical
operations. While integrating computational tools such as Code Interpreters
(CIs) offers a promising solution, it introduces a critical challenge: a
conflict between the model’s internal, probabilistic reasoning and the
external, deterministic knowledge provided by the CI, which often leads models
to unproductive deliberation. To overcome this, we introduce CoRT
(Code-Optimized Reasoning Training), a post-training framework designed to
teach LRMs to effectively utilize CIs. We propose \emph{Hint-Engineering}, a
new data synthesis strategy that strategically injects diverse hints at optimal
points within reasoning paths. This approach generates high-quality,
code-integrated reasoning data specifically tailored to optimize LRM-CI
interaction. Using this method, we have synthesized 30 high-quality samples to
post-train models ranging from 1.5B to 32B parameters through supervised
fine-tuning. CoRT further refines the multi-round interleaving of external CI
usage and internal thinking by employing rejection sampling and reinforcement
learning. Our experimental evaluations demonstrate CoRT’s effectiveness,
yielding absolute improvements of 4\% and 8\% on DeepSeek-R1-Distill-Qwen-32B
and DeepSeek-R1-Distill-Qwen-1.5B, respectively, across five challenging
mathematical reasoning datasets. Moreover, CoRT significantly enhances
efficiency, reducing token usage by approximately 30\% for the 32B model and
50\% for the 1.5B model compared to pure natural language reasoning baselines.
The models and code are available at: https://github.com/ChengpengLi1003/CoRT.
[COMMENTS]NIPS2025 Accepted
[LINK]http://arxiv.org/abs/2510.20342v1
[DATE]2025-10-23 16:41:44+08:00
[CATEGORIES]cs.CL
A New Benchmark Dataset and Mixture-of-Experts Language Models for Adversarial Natural Language Inference in Vietnamese
[AUTHORS]Tin Van Huynh, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen
[ABSTRACT]Existing Vietnamese Natural Language Inference (NLI) datasets lack
adversarial complexity, limiting their ability to evaluate model robustness
against challenging linguistic phenomena. In this article, we address the gap
in robust Vietnamese NLI resources by introducing ViANLI, the first adversarial
NLI dataset for Vietnamese, and propose NLIMoE, a Mixture-of-Experts model to
tackle its complexity. We construct ViANLI using an adversarial
human-and-machine-in-the-loop approach with rigorous verification. NLIMoE
integrates expert subnetworks with a learned dynamic routing mechanism on top
of a shared transformer encoder. ViANLI comprises over 10,000
premise-hypothesis pairs and challenges state-of-the-art models, with XLM-R
Large achieving only 45.5% accuracy, while NLIMoE reaches 47.3%. Training with
ViANLI improves performance on other benchmark Vietnamese NLI datasets
including ViNLI, VLSP2021-NLI, and VnNewsNLI. ViANLI is released for enhancing
research into model robustness and enriching resources for future Vietnamese
and multilingual NLI research.
[COMMENTS]Accepted by Expert Systems with Applications
[LINK]http://arxiv.org/abs/2406.17716v3
[DATE]2025-10-23 16:39:36+08:00
[CATEGORIES]cs.CL
Born a Transformer – Always a Transformer? On the Effect of Pretraining on Architectural Abilities
[AUTHORS]Mayank Jobanputra, Yana Veitsman, Yash Sarrof, Aleksandra Bakalova, Vera Demberg, Ellie Pavlick, Michael Hahn
[ABSTRACT]Transformers have theoretical limitations in modeling certain
sequence-to-sequence tasks, yet it remains largely unclear if these limitations
play a role in large-scale pretrained LLMs, or whether LLMs might effectively
overcome these constraints in practice due to the scale of both the models
themselves and their pretraining data. We explore how these architectural
constraints manifest after pretraining, by studying a family of
$\textit{retrieval}$ and $\textit{copying}$ tasks inspired by Liu et al.
[2024a]. We use a recently proposed framework for studying length
generalization [Huang et al., 2025] to provide guarantees for each of our
settings. Empirically, we observe an $\textit{induction-versus-anti-induction}$
asymmetry, where pretrained models are better at retrieving tokens to the right
(induction) rather than the left (anti-induction) of a query token. This
asymmetry disappears upon targeted fine-tuning if length-generalization is
guaranteed by theory. Mechanistic analysis reveals that this asymmetry is
connected to the differences in the strength of induction versus anti-induction
circuits within pretrained transformers. We validate our findings through
practical experiments on real-world tasks demonstrating reliability risks. Our
results highlight that pretraining selectively enhances certain transformer
capabilities, but does not overcome fundamental length-generalization limits.
[COMMENTS]NeurIPS 2025
[LINK]http://arxiv.org/abs/2505.21785v3
[DATE]2025-10-23 16:30:36+08:00
[CATEGORIES]cs.LG cs.CL
“You Are Rejected!”: An Empirical Study of Large Language Models Taking Hiring Evaluations
[AUTHORS]Dingjie Fu, Dianxing Shi
[ABSTRACT]With the proliferation of the internet and the rapid advancement of
Artificial Intelligence, leading technology companies face an urgent annual
demand for a considerable number of software and algorithm engineers. To
efficiently and effectively identify high-potential candidates from thousands
of applicants, these firms have established a multi-stage selection process,
which crucially includes a standardized hiring evaluation designed to assess
job-specific competencies. Motivated by the demonstrated prowess of Large
Language Models (LLMs) in coding and reasoning tasks, this paper investigates a
critical question: Can LLMs successfully pass these hiring evaluations? To this
end, we conduct a comprehensive examination of a widely used professional
assessment questionnaire. We employ state-of-the-art LLMs to generate responses
and subsequently evaluate their performance. Contrary to any prior expectation
of LLMs being ideal engineers, our analysis reveals a significant inconsistency
between the model-generated answers and the company-referenced solutions. Our
empirical findings lead to a striking conclusion: All evaluated LLMs fails to
pass the hiring evaluation.
[COMMENTS]Technical Report, 14 pages, 8 figures
[LINK]http://arxiv.org/abs/2510.19167v2
[DATE]2025-10-23 16:28:01+08:00
[CATEGORIES]cs.CL
The Mirror Loop: Recursive Non-Convergence in Generative Reasoning Systems
[AUTHORS]Bentley DeVilling
[ABSTRACT]Large language models are often described as capable of reflective reasoning,
yet recursive self-evaluation without external feedback frequently yields
reformulation rather than progress. We test this prediction in a cross-provider
study of 144 reasoning sequences across three models (OpenAI GPT-4o-mini,
Anthropic Claude 3 Haiku, and Google Gemini 2.0 Flash) and four task families
(arithmetic, code, explanation, reflection), each iterated ten times under two
conditions: ungrounded self-critique and a minimal grounding intervention (a
single verification step at iteration three). Mean informational change (delta
I, measured via normalized edit distance) declined by 55% from early (0.193) to
late (0.087) iterations in ungrounded runs, with consistent patterns across all
three providers. Grounded runs showed a +28% rebound in informational change
immediately after the intervention and sustained non-zero variance thereafter.
Complementary measures-n-gram novelty, embedding drift, and character-level
entropy-converged on the same pattern: reflection without contact tends toward
informational closure. We interpret this as evidence for a structural limit on
self-correction in generative reasoning: without an exchange of information
with an independent verifier or environment, recursive inference approaches an
attractor state of epistemic stasis. Minimal grounding functions as dissipative
coupling, reintroducing informational flux. The cross-architecture consistency
suggests the mirror loop arises from shared autoregressive training objectives
rather than provider-specific alignment schemes. The results delineate when
reflection is performative rather than epistemic and motivate design principles
for grounded, cooperative reasoning. Materials and code are publicly available.
[COMMENTS]18 pages, 2 figures. Category: cs.LG. Code and data:
https://github.com/Course-Correct-Labs/mirror-loop
[LINK]http://arxiv.org/abs/2510.21861v1
[DATE]2025-10-23 15:53:26+08:00
[CATEGORIES]cs.LG cs.CL
Exploring Generative Process Reward Modeling for Semi-Structured Data: A Case Study of Table Question Answering
[AUTHORS]Lei Tang, Wei Zhou, Mohsen Mesgar
[ABSTRACT]Process reward models (PRMs) improve complex reasoning in large language
models (LLMs) by grading candidate solutions step-by-step and selecting answers
via aggregated step scores. While effective in domains such as mathematics,
their applicability to tasks involving semi-structured data, like table
question answering (TQA) remains unexplored. TQA poses unique challenges for
PRMs, including abundant irrelevant information, loosely connected reasoning
steps, and domain-specific reasoning. This work presents the first systematic
study of PRMs for TQA. We evaluate state-of-the-art generative PRMs on TQA from
both answer and step perspectives. Results show that PRMs that combine textual
and code verification can aid solution selection but struggle to generalize to
out-of-domain data. Analysis reveals a weak correlation between performance in
step-level verification and answer accuracy, possibly stemming from weak step
dependencies and loose causal links. Our findings highlight limitations of
current PRMs on TQA and offer valuable insights for building more robust,
process-aware verifiers.
[LINK]http://arxiv.org/abs/2510.20304v1
[DATE]2025-10-23 15:49:39+08:00
[CATEGORIES]cs.CL
LFD: Layer Fused Decoding to Exploit External Knowledge in Retrieval-Augmented Generation
[AUTHORS]Yang Sun, Zhiyong Xie, Dan Luo, Long Zhang, Liming Dong, Yunwei Zhao, Xixun Lin, Yanxiong Lu, Chenliang Li, Lixin Zou
[ABSTRACT]Retrieval-augmented generation (RAG) incorporates external knowledge into
large language models (LLMs), improving their adaptability to downstream tasks
and enabling information updates. Surprisingly, recent empirical evidence
demonstrates that injecting noise into retrieved relevant documents
paradoxically facilitates exploitation of external knowledge and improves
generation quality. Although counterintuitive and challenging to apply in
practice, this phenomenon enables granular control and rigorous analysis of how
LLMs integrate external knowledge. Therefore, in this paper, we intervene on
noise injection and establish a layer-specific functional demarcation within
the LLM: shallow layers specialize in local context modeling, intermediate
layers focus on integrating long-range external factual knowledge, and deeper
layers primarily rely on parametric internal knowledge. Building on this
insight, we propose Layer Fused Decoding (LFD), a simple decoding strategy that
directly combines representations from an intermediate layer with final-layer
decoding outputs to fully exploit the external factual knowledge. To identify
the optimal intermediate layer, we introduce an internal knowledge score (IKS)
criterion that selects the layer with the lowest IKS value in the latter half
of layers. Experimental results across multiple benchmarks demonstrate that LFD
helps RAG systems more effectively surface retrieved context knowledge with
minimal cost.
[LINK]http://arxiv.org/abs/2508.19614v2
[DATE]2025-10-23 14:59:59+08:00
[CATEGORIES]cs.CL
Multilingual LLM Prompting Strategies for Medical English-Vietnamese Machine Translation
[AUTHORS]Nhu Vo, Nu-Uyen-Phuong Le, Dung D. Le, Massimo Piccardi, Wray Buntine
[ABSTRACT]Medical English-Vietnamese machine translation (En-Vi MT) is essential for
healthcare access and communication in Vietnam, yet Vietnamese remains a
low-resource and under-studied language. We systematically evaluate prompting
strategies for six multilingual LLMs (0.5B-9B parameters) on the MedEV dataset,
comparing zero-shot, few-shot, and dictionary-augmented prompting with Meddict,
an English-Vietnamese medical lexicon. Results show that model scale is the
primary driver of performance: larger LLMs achieve strong zero-shot results,
while few-shot prompting yields only marginal improvements. In contrast,
terminology-aware cues and embedding-based example retrieval consistently
improve domain-specific translation. These findings underscore both the promise
and the current limitations of multilingual LLMs for medical En-Vi MT.
[COMMENTS]This version has been withdrawn after receiving the conference review
results. We are currently extending and reorganizing the work into a new
study
[LINK]http://arxiv.org/abs/2509.15640v2
[DATE]2025-10-23 14:55:37+08:00
[CATEGORIES]cs.CL
ixi-GEN: Efficient Industrial sLLMs through Domain Adaptive Continual Pretraining
[AUTHORS]Seonwu Kim, Yohan Na, Kihun Kim, Hanhee Cho, Geun Lim, Mintae Kim, Seongik Park, Ki Hyun Kim, Youngsub Han, Byoung-Ki Jeon
[COMMENTS]Accepted at EMNLP 2025 Industry Track
[LINK]http://arxiv.org/abs/2507.06795v4
[DATE]2025-10-23 14:41:59+08:00
[CATEGORIES]cs.CL cs.LG
Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning
[AUTHORS]Ling Team, Bin Han, Caizhi Tang, Chen Liang, Donghao Zhang, Fan Yuan, Feng Zhu, Jie Gao, Jingyu Hu, Longfei Li, Meng Li, Mingyang Zhang, Peijie Jiang, Peng Jiao, Qian Zhao, Qingyuan Yang, Wenbo Shen, Xinxing Yang, Yalin Zhang, Yankun Ren, Yao Zhao, Yibo Cao, Yixuan Sun, Yue Zhang, Yuchen Fang, Zibin Lin, Zixuan Cheng, Jun Zhou
[ABSTRACT]In this technical report, we present the Ring-linear model series,
specifically including Ring-mini-linear-2.0 and Ring-flash-linear-2.0.
Ring-mini-linear-2.0 comprises 16B parameters and 957M activations, while
Ring-flash-linear-2.0 contains 104B parameters and 6.1B activations. Both
models adopt a hybrid architecture that effectively integrates linear attention
and softmax attention, significantly reducing I/O and computational overhead in
long-context inference scenarios. Compared to a 32 billion parameter dense
model, this series reduces inference cost to 1/10, and compared to the original
Ring series, the cost is also reduced by over 50%. Furthermore, through
systematic exploration of the ratio between different attention mechanisms in
the hybrid architecture, we have identified the currently optimal model
structure. Additionally, by leveraging our self-developed high-performance FP8
operator library-linghe, overall training efficiency has been improved by 50%.
Benefiting from the high alignment between the training and inference engine
operators, the models can undergo long-term, stable, and highly efficient
optimization during the reinforcement learning phase, consistently maintaining
SOTA performance across multiple challenging complex reasoning benchmarks.
[COMMENTS]20 pages, 13 figures
[LINK]http://arxiv.org/abs/2510.19338v2
[DATE]2025-10-23 14:33:17+08:00
[CATEGORIES]cs.LG cs.CL
TianHui: A Domain-Specific Large Language Model for Diverse Traditional Chinese Medicine Scenarios
[AUTHORS]Ji Yin, Menglan He, Yujie Zhang, Linshuai Zhang, Tingting Ma, Ce Tian, Jie Wu, Lin Xu, Tao Jiang
[ABSTRACT]Domain-specific LLMs in TCM face limitations in research settings due to
constrained adaptability, insufficient evaluation datasets, and limited
computational resources. This study presents TianHui, a specialized TCM LLM
built through contextual data integration and domain knowledge fusion. We
constructed a large-scale TCM corpus (0.97GB unsupervised data + 611,312 QA
pairs) and employed a two-stage training strategy with QLoRA, DeepSpeed Stage
2, and Flash Attention 2. Evaluation on 12 benchmarks showed TianHui ranked
top-three in all metrics for six datasets (APQ, TCMCD, HFR, HCCA, DHPE, TLAW)
and achieved top results in the other six (TCMEE, APR, GCPMI, TCMKQA, TCMRC,
ADTG). Optimal configuration was identified as LoRA rank=128, alpha=256,
epoch=4, dropout=0.2, max length=2048. TianHui enables systematic preservation
and scalable application of TCM knowledge. All resources are open-sourced.
[COMMENTS]46 pages, 5 figures,3 tables
[LINK]http://arxiv.org/abs/2509.19834v2
[DATE]2025-10-23 14:29:28+08:00
[CATEGORIES]cs.CL
Calibrating Multimodal Consensus for Emotion Recognition
[AUTHORS]Guowei Zhong, Junjie Li, Huaiyu Zhu, Ruohong Huan, Yun Pan
[ABSTRACT]In recent years, Multimodal Emotion Recognition (MER) has made substantial
progress. Nevertheless, most existing approaches neglect the semantic
inconsistencies that may arise across modalities, such as conflicting emotional
cues between text and visual inputs. Besides, current methods are often
dominated by the text modality due to its strong representational capacity,
which can compromise recognition accuracy. To address these challenges, we
propose a model termed Calibrated Multimodal Consensus (CMC). CMC introduces a
Pseudo Label Generation Module (PLGM) to produce pseudo unimodal labels,
enabling unimodal pretraining in a self-supervised fashion. It then employs a
Parameter-free Fusion Module (PFM) and a Multimodal Consensus Router (MCR) for
multimodal finetuning, thereby mitigating text dominance and guiding the fusion
process toward a more reliable consensus. Experimental results demonstrate that
CMC achieves performance on par with or superior to state-of-the-art methods
across four datasets, CH-SIMS, CH-SIMS v2, CMU-MOSI, and CMU-MOSEI, and
exhibits notable advantages in scenarios with semantic inconsistencies on
CH-SIMS and CH-SIMS v2. The implementation of this work is publicly accessible
at https://github.com/gw-zhong/CMC.
[LINK]http://arxiv.org/abs/2510.20256v1
[DATE]2025-10-23 14:25:10+08:00
[CATEGORIES]cs.CL cs.LG
Does Thinking More always Help? Mirage of Test-Time Scaling in Reasoning Models
[AUTHORS]Soumya Suvra Ghosal, Souradip Chakraborty, Avinash Reddy, Yifu Lu, Mengdi Wang, Dinesh Manocha, Furong Huang, Mohammad Ghavamzadeh, Amrit Singh Bedi
[ABSTRACT]Recent trends in test-time scaling for reasoning models (e.g., OpenAI o1,
DeepSeek R1) have led to a popular belief that extending thinking traces using
prompts like “Wait” or “Let me rethink” can improve performance. This raises a
natural question: Does thinking more at test-time truly lead to better
reasoning? To answer this question, we perform a detailed empirical study
across models and benchmarks, which reveals a consistent pattern of initial
performance improvements from additional thinking followed by a decline, due to
“overthinking”. To understand this non-monotonic trend, we consider a simple
probabilistic model, which reveals that additional thinking increases output
variance-creating an illusion of improved reasoning while ultimately
undermining precision. Thus, observed gains from “more thinking” are not true
indicators of improved reasoning, but artifacts stemming from the connection
between model uncertainty and evaluation metric. This suggests that test-time
scaling through extended thinking is not an effective way to utilize the
inference thinking budget. Recognizing these limitations, we introduce an
alternative test-time scaling approach, parallel thinking, inspired by
Best-of-N sampling. Our method generates multiple independent reasoning paths
within the same inference budget and selects the most consistent response via
majority vote, achieving up to 20% higher accuracy compared to extended
thinking. This provides a simple yet effective mechanism for test-time scaling
of reasoning models.
[COMMENTS]Accepted at NeurIPS 2025
[LINK]http://arxiv.org/abs/2506.04210v3
[DATE]2025-10-23 14:17:53+08:00
[CATEGORIES]cs.CL
MLP Memory: A Retriever-Pretrained Memory for Large Language Models
[AUTHORS]Rubin Wei, Jiaqi Cao, Jiarui Wang, Jushi Kai, Qipeng Guo, Bowen Zhou, Zhouhan Lin
[ABSTRACT]Modern approaches to enhancing Large Language Models’ factual accuracy and
knowledge utilization face a fundamental trade-off: non-parametric
retrieval-augmented generation (RAG) provides flexible access to external
knowledge but suffers from high inference latency and shallow integration,
while parametric fine-tuning methods like LoRA risk catastrophic forgetting and
degraded general capabilities. In this work, we propose MLP Memory, a
lightweight parametric module that learns to internalize retrieval patterns
without explicit document access. By pretraining an MLP to imitate a $k$NN
retriever’s behavior on the entire pretraining dataset, we create a
differentiable memory component that captures the benefits of retrieval-based
knowledge access in a fully parametric form. Our architecture integrates this
pretrained MLP Memory with Transformer decoders through simple probability
interpolation, yielding 17.5\% and 24.1\% scaling gains on WikiText-103 and Web
datasets, respectively. It further achieves 12.3\% relative improvement on five
question-answering benchmarks and 5.2 points absolute gain across nine general
NLP tasks, while reducing hallucinations by up to 10 points on HaluEval.
Moreover, MLP Memory delivers 2.5$\times$ faster inference than RAG with
superior accuracy. Our findings show that learning retrieval patterns
parametrically bridges the gap between efficient inference and effective
knowledge access, offering a practical alternative to both RAG and fine-tuning
approaches.
[LINK]http://arxiv.org/abs/2508.01832v3
[DATE]2025-10-23 13:46:50+08:00
[CATEGORIES]cs.CL
Tri-Modal Severity Fused Diagnosis across Depression and Post-traumatic Stress Disorders
[AUTHORS]Filippo Cenacchi, Deborah Richards, Longbing Cao
[ABSTRACT]Depression and post traumatic stress disorder (PTSD) often co-occur with
connected symptoms, complicating automated assessment, which is often binary
and disorder specific. Clinically useful diagnosis needs severity aware cross
disorder estimates and decision support explanations. Our unified tri modal
affective severity framework synchronizes and fuses interview text with
sentence level transformer embeddings, audio with log Mel statistics with
deltas, and facial signals with action units, gaze, head and pose descriptors
to output graded severities for diagnosing both depression (PHQ-8; 5 classes)
and PTSD (3 classes). Standardized features are fused via a calibrated late
fusion classifier, yielding per disorder probabilities and feature-level
attributions. This severity aware tri-modal affective fusion approach is demoed
on multi disorder concurrent depression and PTSD assessment. Stratified cross
validation on DAIC derived corpora outperforms unimodal/ablation baselines. The
fused model matches the strongest unimodal baseline on accuracy and weighted
F1, while improving decision curve utility and robustness under noisy or
missing modalities. For PTSD specifically, fusion reduces regression error and
improves class concordance. Errors cluster between adjacent severities; extreme
classes are identified reliably. Ablations show text contributes most to
depression severity, audio and facial cues are critical for PTSD, whereas
attributions align with linguistic and behavioral markers. Our approach offers
reproducible evaluation and clinician in the loop support for affective
clinical decision making.
[LINK]http://arxiv.org/abs/2510.20239v1
[DATE]2025-10-23 13:46:38+08:00
[CATEGORIES]cs.CL
Debate or Vote: Which Yields Better Decisions in Multi-Agent Large Language Models?
[AUTHORS]Hyeong Kyu Choi, Xiaojin Zhu, Sharon Li
[ABSTRACT]Multi-Agent Debate~(MAD) has emerged as a promising paradigm for improving
the performance of large language models through collaborative reasoning.
Despite recent advances, the key factors driving MAD’s effectiveness remain
unclear. In this work, we disentangle MAD into two key components–Majority
Voting and inter-agent Debate–and assess their respective contributions.
Through extensive experiments across seven NLP benchmarks, we find that
Majority Voting alone accounts for most of the performance gains typically
attributed to MAD. To explain this, we propose a theoretical framework that
models debate as a stochastic process. We prove that it induces a martingale
over agents’ belief trajectories, implying that debate alone does not improve
expected correctness. Guided by these insights, we demonstrate that targeted
interventions, by biasing the belief update toward correction, can meaningfully
enhance debate effectiveness. Overall, our findings suggest that while MAD has
potential, simple ensembling methods remain strong and more reliable
alternatives in many practical settings. Code is released in
https://github.com/deeplearning-wisc/debate-or-vote.
[COMMENTS]NeurIPS 2025 Spotlight
[LINK]http://arxiv.org/abs/2508.17536v2
[DATE]2025-10-23 13:44:57+08:00
[CATEGORIES]cs.CL
Text to Band Gap: Pre-trained Language Models as Encoders for Semiconductor Band Gap Prediction
[AUTHORS]Ying-Ting Yeh, Janghoon Ock, Achuth Chandrasekhar, Shagun Maheshwari, Amir Barati Farimani
[ABSTRACT]We investigate transformer-based language models, including RoBERTa, T5,
Llama-3, and MatSciBERT, for predicting the band gaps of semiconductor
materials directly from textual descriptions. The inputs encode key material
features, such as chemical composition, crystal system, space group, and other
structural and electronic properties. Unlike shallow machine learning models,
which require extensive feature engineering, or Graph Neural Networks, which
rely on graph representations derived from atomic coordinates, pretrained
language models can process textual inputs directly, eliminating the need for
manual feature preprocessing or structure-based encoding. Material descriptions
were constructed in two formats: structured strings with a consistent template
and natural language narratives generated via the ChatGPT API. Each model was
augmented with a custom regression head and finetuned for band gap prediction
task. Language models of different architectures and parameter sizes were all
able to predict band gaps from human-readable text with strong accuracy,
achieving MAEs in the range of 0.25-0.33 eV, highlighting the success of this
approach for scientific regression tasks. Finetuned Llama-3, with 1.2 billion
parameters, achieved the highest accuracy (MAE 0.248 eV, R2 0.891). MatSciBERT,
pretrained on materials science literature, reached comparable performance (MAE
0.288 eV, R2 0.871) with significantly fewer parameters (110 million),
emphasizing the importance of domain-specific pretraining. Attention analysis
shows that both models selectively focus on compositional and spin-related
features while de-emphasizing geometric features, reflecting the difficulty of
capturing spatial information from text. These results establish that
pretrained language models can effectively extract complex feature-property
relationships from textual material descriptions.
[LINK]http://arxiv.org/abs/2501.03456v3
[DATE]2025-10-23 13:31:46+08:00
[CATEGORIES]cs.CL
KAT-Coder Technical Report
[AUTHORS]Zizheng Zhan, Ken Deng, Xiaojiang Zhang, Jinghui Wang, Huaixi Tang, Zhiyi Lai, Haoyang Huang, Wen Xiang, Kun Wu, Wenhao Zhuang, Minglei Zhang, Shaojie Wang, Shangpeng Yan, Kepeng Lei, Zongxian Feng, Huiming Wang, Zheng Lin, Mengtong Li, Mengfei Xie, Yinghan Cui, Xuxing Chen, Chao Wang, Weihao Li, Wenqiang Zhu, Jiarong Zhang, Jingxuan Xu, Songwei Yu, Yifan Yao, Xinping Lei, C. Zhang, Han Li, Junqi Xiong, Zuchen Gao, Dailin Li, Haimo Li, Jiaheng Liu, Yuqun Zhang, Junyi Peng, Haotian Zhang, Bin Chen
[ABSTRACT]Recent advances in large language models (LLMs) have enabled progress in
agentic coding, where models autonomously reason, plan, and act within
interactive software development workflows. However, bridging the gap between
static text-based training and dynamic real-world agentic execution remains a
core challenge. In this technical report, we present KAT-Coder, a large-scale
agentic code model trained through a multi-stage curriculum encompassing
Mid-Term Training, Supervised Fine-Tuning (SFT), Reinforcement Fine-Tuning
(RFT), and Reinforcement-to-Deployment Adaptation. The Mid-Term stage enhances
reasoning, planning, and reflection capabilities through a corpus of real
software engineering data and synthetic agentic interactions. The SFT stage
constructs a million-sample dataset balancing twenty programming languages, ten
development contexts, and ten task archetypes. The RFT stage introduces a novel
multi-ground-truth reward formulation for stable and sample-efficient policy
optimization. Finally, the Reinforcement-to-Deployment phase adapts the model
to production-grade IDE environments using Error-Masked SFT and Tree-Structured
Trajectory Training. In summary, these stages enable KAT-Coder to achieve
robust tool-use reliability, instruction alignment, and long-context reasoning,
forming a deployable foundation for real-world intelligent coding agents. Our
KAT series 32B model, KAT-Dev, has been open-sourced on
https://huggingface.co/Kwaipilot/KAT-Dev.
[LINK]http://arxiv.org/abs/2510.18779v2
[DATE]2025-10-23 13:23:21+08:00
[CATEGORIES]cs.CL
A Comprehensive Survey on Benchmarks and Solutions in Software Engineering of LLM-Empowered Agentic System
[AUTHORS]Jiale Guo, Suizhi Huang, Mei Li, Dong Huang, Xingsheng Chen, Regina Zhang, Zhijiang Guo, Han Yu, Siu-Ming Yiu, Pietro Lio, Kwok-Yan Lam
[ABSTRACT]The integration of Large Language Models (LLMs) into software engineering has
driven a transition from traditional rule-based systems to autonomous agentic
systems capable of solving complex problems. However, systematic progress is
hindered by a lack of comprehensive understanding of how benchmarks and
solutions interconnect. This survey addresses this gap by providing the first
holistic analysis of LLM-powered software engineering, offering insights into
evaluation methodologies and solution paradigms. We review over 150 recent
papers and propose a taxonomy along two key dimensions: (1) Solutions,
categorized into prompt-based, fine-tuning-based, and agent-based paradigms,
and (2) Benchmarks, including tasks such as code generation, translation, and
repair. Our analysis highlights the evolution from simple prompt engineering to
sophisticated agentic systems incorporating capabilities like planning,
reasoning, memory mechanisms, and tool augmentation. To contextualize this
progress, we present a unified pipeline illustrating the workflow from task
specification to deliverables, detailing how different solution paradigms
address various complexity levels. Unlike prior surveys that focus narrowly on
specific aspects, this work connects 50+ benchmarks to their corresponding
solution strategies, enabling researchers to identify optimal approaches for
diverse evaluation criteria. We also identify critical research gaps and
propose future directions, including multi-agent collaboration, self-evolving
systems, and formal verification integration. This survey serves as a
foundational guide for advancing LLM-driven software engineering. We maintain a
GitHub repository that continuously updates the reviewed and related papers at
https://github.com/lisaGuojl/LLM-Agent-SE-Survey.
[COMMENTS]22 pages
[LINK]http://arxiv.org/abs/2510.09721v3
[DATE]2025-10-23 13:08:22+08:00
[CATEGORIES]cs.CL
Decoding-Free Sampling Strategies for LLM Marginalization
[AUTHORS]David Pohl, Marco Cognetta, Junyoung Lee, Naoaki Okazaki
[ABSTRACT]Modern language models operate on subword-tokenized text in order to make a
trade-off between model size, inference speed, and vocabulary coverage. A side
effect of this is that, during inference, models are evaluated by measuring the
probability of only the specific tokenization produced as the output, despite
there being many possible ways to represent the same text with a subword
vocabulary. Recent studies have argued instead for evaluating LLMs by
marginalization - the probability mass of all tokenizations of a given text.
Marginalization is difficult due to the number of possible tokenizations of a
text, so often approximate marginalization is done via sampling. However, a
downside of sampling is that an expensive generation step must be performed by
the LLM for each sample, which limits the number of samples that can be
acquired given a runtime budget, and therefore also the accuracy of the
approximation. Since computing the probability of a sequence given the
tokenization is relatively cheap compared to actually generating it, we
investigate sampling strategies that are decoding-free - they require no
generation from the LLM, instead relying entirely on extremely cheap sampling
strategies that are model and tokenizer agnostic.
We investigate the approximation quality and speed of decoding-free sampling
strategies for a number of open models to find that they provide sufficiently
accurate marginal estimates at a small fraction of the runtime cost and
demonstrate its use on a set of downstream inference tasks.
[COMMENTS]10 pages, 3 figures
[LINK]http://arxiv.org/abs/2510.20208v1
[DATE]2025-10-23 12:50:14+08:00
[CATEGORIES]cs.CL
MIR-Bench: Can Your LLM Recognize Complicated Patterns via Many-Shot In-Context Reasoning?
[AUTHORS]Kai Yan, Zhan Ling, Kang Liu, Yifan Yang, Ting-Han Fan, Lingfeng Shen, Zhengyin Du, Jiecao Chen
[ABSTRACT]The ability to recognize patterns from examples and apply them to new ones is
a primal ability for general intelligence, and is widely studied by psychology
and AI researchers. Many benchmarks have been proposed to measure such ability
for Large Language Models (LLMs); however, they focus on few-shot (usually <10)
setting and lack evaluation for aggregating many pieces of information from
long contexts. On the other hand, the ever-growing context length of LLMs have
brought forth the novel paradigm of many-shot In-Context Learning (ICL), which
addresses new tasks with hundreds to thousands of examples without expensive
and inefficient fine-tuning. However, many-shot evaluations often focus on
classification, and popular long-context LLM tasks such as Needle-In-A-Haystack
(NIAH) seldom require complicated intelligence for integrating many pieces of
information. To fix the issues from both worlds, we propose MIR-Bench, the
first many-shot in-context reasoning benchmark for pattern recognition that
asks LLM to predict output via input-output examples from underlying functions
with diverse data format. Based on MIR-Bench, we study many novel problems for
many-shot in-context reasoning, and acquired many insightful findings including
scaling effect, robustness, inductive vs. transductive reasoning, retrieval
Augmented Generation (RAG), coding for inductive reasoning, cross-domain
generalizability, etc.
[COMMENTS]39 pages, 11 figures. The paper is accepted at NeurIPS 2025 Datasets
& Benchmarks Track, and the latest version adds modifications in camera-ready
[LINK]http://arxiv.org/abs/2502.09933v5
[DATE]2025-10-23 12:48:25+08:00
[CATEGORIES]cs.CL cs.LG
Sherlock: Self-Correcting Reasoning in Vision-Language Models
[AUTHORS]Yi Ding, Ruqi Zhang
[ABSTRACT]Reasoning Vision-Language Models (VLMs) have shown promising performance on
complex multimodal tasks. However, they still face significant challenges: they
are highly sensitive to reasoning errors, require large volumes of annotated
data or accurate verifiers, and struggle to generalize beyond specific domains.
To address these limitations, we explore self-correction as a strategy to
enhance reasoning VLMs. We first conduct an in-depth analysis of reasoning
VLMs’ self-correction abilities and identify key gaps. Based on our findings,
we introduce Sherlock, a self-correction and self-improvement training
framework. Sherlock introduces a trajectory-level self-correction objective, a
preference data construction method based on visual perturbation, and a dynamic
$\beta$ for preference tuning. Once the model acquires self-correction
capabilities using only 20k randomly sampled annotated data, it continues to
self-improve without external supervision. Built on the Llama3.2-Vision-11B
model, Sherlock achieves remarkable results across eight benchmarks, reaching
an average accuracy of 64.1 with direct generation and 65.4 after
self-correction. It outperforms LLaVA-CoT (63.2), Mulberry (63.9), and
LlamaV-o1 (63.4) while using less than 20% of the annotated data.
[COMMENTS]Published at NeurIPS 2025, 27 pages
[LINK]http://arxiv.org/abs/2505.22651v2
[DATE]2025-10-23 12:45:46+08:00
[CATEGORIES]cs.CL cs.LG
Stuck in the Matrix: Probing Spatial Reasoning in Large Language Models
[AUTHORS]Maggie Bai, Ava Kim Cohen, Eleanor Koss, Charlie Lichtenbaum
[ABSTRACT]This paper explores the spatial reasoning capability of large language models
(LLMs) over textual input through a suite of five tasks aimed at probing their
spatial understanding and computational abilities. The models were tested on
both fundamental spatial reasoning and multi-step problem-solving within
structured grid-based environments using tasks such as quadrant identification,
geometric transformations, distance evaluation, word searches, and tile
sliding. Each task was scaled in complexity through increasing grid dimensions,
requiring models to extend beyond simple pattern recognition into abstract
spatial reasoning. Our results reveal that while LLMs demonstrate moderate
success in all tasks with small complexity and size, performance drops off
rapidly as scale increases, with an average loss in accuracy of 42.7%, and
reaching as high as 84%. Every test that began with over 50% accuracy showed a
loss of at least 48%, illustrating the consistent nature of the deterioration.
Furthermore, their struggles with scaling complexity hint at a lack of robust
spatial representations in their underlying architectures. This paper
underscores the gap between linguistic and spatial reasoning in LLMs, offering
insights into their current limitations, and laying the groundwork for future
integrative benchmarks at the intersection of language and geometry.
[COMMENTS]20 pages, 24 figures
[LINK]http://arxiv.org/abs/2510.20198v1
[DATE]2025-10-23 12:32:46+08:00
[CATEGORIES]cs.CL
Multimedia-Aware Question Answering: A Review of Retrieval and Cross-Modal Reasoning Architectures
[AUTHORS]Rahul Raja, Arpita Vats
[ABSTRACT]Question Answering (QA) systems have traditionally relied on structured text
data, but the rapid growth of multimedia content (images, audio, video, and
structured metadata) has introduced new challenges and opportunities for
retrieval-augmented QA. In this survey, we review recent advancements in QA
systems that integrate multimedia retrieval pipelines, focusing on
architectures that align vision, language, and audio modalities with user
queries. We categorize approaches based on retrieval methods, fusion
techniques, and answer generation strategies, and analyze benchmark datasets,
evaluation protocols, and performance tradeoffs. Furthermore, we highlight key
challenges such as cross-modal alignment, latency-accuracy tradeoffs, and
semantic grounding, and outline open problems and future research directions
for building more robust and context-aware QA systems leveraging multimedia
data.
[COMMENTS]In Proceedings of the 2nd ACM Workshop in AI-powered Question and
Answering Systems (AIQAM ‘25), October 27-28, 2025, Dublin, Ireland. ACM, New
York, NY, USA, 8 pages. https://doi.org/10.1145/3746274.3760393
[LINK]http://arxiv.org/abs/2510.20193v1
[DATE]2025-10-23 12:25:44+08:00
[CATEGORIES]cs.CL cs.LG
Every Question Has Its Own Value: Reinforcement Learning with Explicit Human Values
[AUTHORS]Dian Yu, Yulai Zhao, Kishan Panaganti, Linfeng Song, Haitao Mi, Dong Yu
[ABSTRACT]We propose Reinforcement Learning with Explicit Human Values (RLEV), a method
that aligns Large Language Model (LLM) optimization directly with quantifiable
human value signals. While Reinforcement Learning with Verifiable Rewards
(RLVR) effectively trains models in objective domains using binary correctness
rewards, it overlooks that not all tasks are equally significant. RLEV extends
this framework by incorporating human-defined value signals directly into the
reward function. Using exam-style data with explicit ground-truth value labels,
RLEV consistently outperforms correctness-only baselines across multiple RL
algorithms and model scales. Crucially, RLEV policies not only improve
value-weighted accuracy but also learn a value-sensitive termination policy:
concise for low-value prompts, thorough for high-value ones. We demonstrate
this behavior stems from value-weighted gradient amplification on
end-of-sequence tokens. Ablation studies confirm the gain is causally linked to
value alignment. RLEV remains robust under noisy value signals, such as
difficulty-based labels, demonstrating that optimizing for an explicit utility
function offers a practical path to aligning LLMs with human priorities.
[COMMENTS]15 pages, 4 figures
[LINK]http://arxiv.org/abs/2510.20187v1
[DATE]2025-10-23 12:15:22+08:00
[CATEGORIES]cs.LG cs.CL
RL Tango: Reinforcing Generator and Verifier Together for Language Reasoning
[AUTHORS]Kaiwen Zha, Zhengqi Gao, Maohao Shen, Zhang-Wei Hong, Duane S. Boning, Dina Katabi
[ABSTRACT]Reinforcement learning (RL) has recently emerged as a compelling approach for
enhancing the reasoning capabilities of large language models (LLMs), where an
LLM generator serves as a policy guided by a verifier (reward model). However,
current RL post-training methods for LLMs typically use verifiers that are
fixed (rule-based or frozen pretrained) or trained discriminatively via
supervised fine-tuning (SFT). Such designs are susceptible to reward hacking
and generalize poorly beyond their training distributions. To overcome these
limitations, we propose Tango, a novel framework that uses RL to concurrently
train both an LLM generator and a verifier in an interleaved manner. A central
innovation of Tango is its generative, process-level LLM verifier, which is
trained via RL and co-evolves with the generator. Importantly, the verifier is
trained solely based on outcome-level verification correctness rewards without
requiring explicit process-level annotations. This generative RL-trained
verifier exhibits improved robustness and superior generalization compared to
deterministic or SFT-trained verifiers, fostering effective mutual
reinforcement with the generator. Extensive experiments demonstrate that both
components of Tango achieve state-of-the-art results among 7B/8B-scale models:
the generator attains best-in-class performance across five competition-level
math benchmarks and four challenging out-of-domain reasoning tasks, while the
verifier leads on the ProcessBench dataset. Remarkably, both components exhibit
particularly substantial improvements on the most difficult mathematical
reasoning problems. Code is at: https://github.com/kaiwenzha/rl-tango.
[COMMENTS]NeurIPS 2025. The first two authors contributed equally
[LINK]http://arxiv.org/abs/2505.15034v2
[DATE]2025-10-23 11:38:20+08:00
[CATEGORIES]cs.LG cs.CL
Are Stereotypes Leading LLMs’ Zero-Shot Stance Detection ?
[AUTHORS]Anthony Dubreuil, Antoine Gourru, Christine Largeron, Amine Trabelsi
[COMMENTS]Accepted in EMNLP 2025 (Main)
[LINK]http://arxiv.org/abs/2510.20154v1
[DATE]2025-10-23 11:05:25+08:00
[CATEGORIES]cs.CL
BoundRL: Efficient Structured Text Segmentation through Reinforced Boundary Generation
[AUTHORS]Haoyuan Li, Zhengyuan Shen, Sullam Jeoung, Yueyan Chen, Jiayu Li, Qi Zhu, Shuai Wang, Vassilis Ioannidis, Huzefa Rangwala
[ABSTRACT]As structured texts become increasingly complex across diverse domains –
from technical reports to generative AI prompts – the need for text
segmentation into semantically meaningful components becomes critical. Such
texts often contain elements beyond plain language, including tables, code
snippets, and placeholders, which conventional sentence- or paragraph-level
segmentation methods cannot handle effectively. To address this challenge, we
propose BoundRL, a novel and efficient approach that jointly performs
token-level text segmentation and label prediction for long structured texts.
Instead of generating complete contents for each segment, it generates only a
sequence of starting tokens and reconstructs the complete contents by locating
these tokens within the original texts, thereby reducing inference costs by
orders of magnitude and minimizing hallucination. To adapt the model for the
output format, BoundRL~performs reinforcement learning with verifiable rewards
(RLVR) with a specifically designed reward that jointly optimizes document
reconstruction fidelity and semantic alignment. To mitigate entropy collapse,
it further constructs intermediate candidates by systematically perturbing a
fraction of generated sequences of segments to create stepping stones toward
higher-quality solutions. To demonstrate BoundRL’s effectiveness on
particularly challenging structured texts, we focus evaluation on complex
prompts used for LLM applications. Experiments show that BoundRL enables small
language models (1.7B parameters) to outperform few-shot prompting of much
larger models. Moreover, RLVR with our designed reward yields significant
improvements over supervised fine-tuning, and incorporating intermediate
candidates further improves both performance and generalization.
[LINK]http://arxiv.org/abs/2510.20151v1
[DATE]2025-10-23 10:56:10+08:00
[CATEGORIES]cs.CL
Hybrid Latent Reasoning via Reinforcement Learning
[AUTHORS]Zhenrui Yue, Bowen Jin, Huimin Zeng, Honglei Zhuang, Zhen Qin, Jinsung Yoon, Lanyu Shang, Jiawei Han, Dong Wang
[ABSTRACT]Recent advances in large language models (LLMs) have introduced latent
reasoning as a promising alternative to autoregressive reasoning. By performing
internal computation with hidden states from previous steps, latent reasoning
benefit from more informative features rather than sampling a discrete
chain-of-thought (CoT) path. Yet latent reasoning approaches are often
incompatible with LLMs, as their continuous paradigm conflicts with the
discrete nature of autoregressive generation. Moreover, these methods rely on
CoT traces for training and thus fail to exploit the inherent reasoning
patterns of LLMs. In this work, we explore latent reasoning by leveraging the
intrinsic capabilities of LLMs via reinforcement learning (RL). To this end, we
introduce hybrid reasoning policy optimization (HRPO), an RL-based hybrid
latent reasoning approach that (1) integrates prior hidden states into sampled
tokens with a learnable gating mechanism, and (2) initializes training with
predominantly token embeddings while progressively incorporating more hidden
features. This design maintains LLMs’ generative capabilities and incentivizes
hybrid reasoning using both discrete and continuous representations. In
addition, the hybrid HRPO introduces stochasticity into latent reasoning via
token sampling, thereby enabling RL-based optimization without requiring CoT
trajectories. Extensive evaluations across diverse benchmarks show that HRPO
outperforms prior methods in both knowledge- and reasoning-intensive tasks.
Furthermore, HRPO-trained LLMs remain interpretable and exhibit intriguing
behaviors like cross-lingual patterns and shorter completion lengths,
highlighting the potential of our RL-based approach and offer insights for
future work in latent reasoning.
[COMMENTS]NeurIPS 2025
[LINK]http://arxiv.org/abs/2505.18454v2
[DATE]2025-10-23 10:18:11+08:00
[CATEGORIES]cs.CL
AssistedDS: Benchmarking How External Domain Knowledge Assists LLMs in Automated Data Science
[AUTHORS]An Luo, Xun Xian, Jin Du, Fangqiao Tian, Ganghua Wang, Ming Zhong, Shengchun Zhao, Xuan Bi, Zirui Liu, Jiawei Zhou, Jayanth Srinivasa, Ashish Kundu, Charles Fleming, Mingyi Hong, Jie Ding
[ABSTRACT]Large language models (LLMs) have advanced the automation of data science
workflows. Yet it remains unclear whether they can critically leverage external
domain knowledge as human data scientists do in practice. To answer this
question, we introduce AssistedDS (Assisted Data Science), a benchmark designed
to systematically evaluate how LLMs handle domain knowledge in tabular
prediction tasks. AssistedDS features both synthetic datasets with explicitly
known generative mechanisms and real-world Kaggle competitions, each
accompanied by curated bundles of helpful and adversarial documents. These
documents provide domain-specific insights into data cleaning, feature
engineering, and model selection. We assess state-of-the-art LLMs on their
ability to discern and apply beneficial versus harmful domain knowledge,
evaluating submission validity, information recall, and predictive performance.
Our results demonstrate three key findings: (1) LLMs frequently exhibit an
uncritical adoption of provided information, significantly impairing their
predictive performance when adversarial content is introduced, (2) helpful
guidance is often insufficient to counteract the negative influence of
adversarial information, and (3) in Kaggle datasets, LLMs often make errors in
handling time-series data, applying consistent feature engineering across
different folds, and interpreting categorical variables correctly. These
findings highlight a substantial gap in current models’ ability to critically
evaluate and leverage expert knowledge, underscoring an essential research
direction for developing more robust, knowledge-aware automated data science
systems. Our data and code are publicly available here:
https://github.com/jeremyxianx/Assisted-DS
[LINK]http://arxiv.org/abs/2506.13992v2
[DATE]2025-10-23 09:33:18+08:00
[CATEGORIES]cs.LG cs.CL
RMTBench: Benchmarking LLMs Through Multi-Turn User-Centric Role-Playing
[AUTHORS]Hao Xiang, Tianyi Tang, Yang Su, Bowen Yu, An Yang, Fei Huang, Yichang Zhang, Yaojie Lu, Hongyu Lin, Xianpei Han, Jingren Zhou, Junyang Lin, Le Sun
[LINK]http://arxiv.org/abs/2507.20352v2
[DATE]2025-10-23 09:30:25+08:00
[CATEGORIES]cs.CL
Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities
[AUTHORS]Chung-En Sun, Xiaodong Liu, Weiwei Yang, Tsui-Wei Weng, Hao Cheng, Aidan San, Michel Galley, Jianfeng Gao
[ABSTRACT]Recent research has shown that Large Language Models (LLMs) are vulnerable to
automated jailbreak attacks, where adversarial suffixes crafted by algorithms
appended to harmful queries bypass safety alignment and trigger unintended
responses. Current methods for generating these suffixes are computationally
expensive and have low Attack Success Rates (ASR), especially against
well-aligned models like Llama2 and Llama3. To overcome these limitations, we
introduce ADV-LLM, an iterative self-tuning process that crafts adversarial
LLMs with enhanced jailbreak ability. Our framework significantly reduces the
computational cost of generating adversarial suffixes while achieving nearly
100\% ASR on various open-source LLMs. Moreover, it exhibits strong attack
transferability to closed-source models, achieving 99\% ASR on GPT-3.5 and 49\%
ASR on GPT-4, despite being optimized solely on Llama3. Beyond improving
jailbreak ability, ADV-LLM provides valuable insights for future safety
alignment research through its ability to generate large datasets for studying
LLM safety. Our code is available at: https://github.com/SunChungEn/ADV-LLM
[COMMENTS]Accepted to NAACL 2025 Main (Oral)
[LINK]http://arxiv.org/abs/2410.18469v5
[DATE]2025-10-23 08:57:57+08:00
[CATEGORIES]cs.CL cs.LG
AI PB: A Grounded Generative Agent for Personalized Investment Insights
[AUTHORS]Daewoo Park, Suho Park, Inseok Hong, Hanwool Lee, Junkyu Park, Sangjun Lee, Jeongman An, Hyunbin Loh
[ABSTRACT]We present AI PB, a production-scale generative agent deployed in real retail
finance. Unlike reactive chatbots that answer queries passively, AI PB
proactively generates grounded, compliant, and user-specific investment
insights. It integrates (i) a component-based orchestration layer that
deterministically routes between internal and external LLMs based on data
sensitivity, (ii) a hybrid retrieval pipeline using OpenSearch and the
finance-domain embedding model, and (iii) a multi-stage recommendation
mechanism combining rule heuristics, sequential behavioral modeling, and
contextual bandits. Operating fully on-premises under Korean financial
regulations, the system employs Docker Swarm and vLLM across 24 X NVIDIA H100
GPUs. Through human QA and system metrics, we demonstrate that grounded
generation with explicit routing and layered safety can deliver trustworthy AI
insights in high-stakes finance.
[COMMENTS]Under Review
[LINK]http://arxiv.org/abs/2510.20099v1
[DATE]2025-10-23 08:51:59+08:00
[CATEGORIES]cs.CL
Leveraging the Power of Large Language Models in Entity Linking via Adaptive Routing and Targeted Reasoning
[AUTHORS]Yajie Li, Albert Galimov, Mitra Datta Ganapaneni, Pujitha Thejaswi, De Meng, Priyanshu Kumar, Saloni Potdar
[ABSTRACT]Entity Linking (EL) has traditionally relied on large annotated datasets and
extensive model fine-tuning. While recent few-shot methods leverage large
language models (LLMs) through prompting to reduce training requirements, they
often suffer from inefficiencies due to expensive LLM-based reasoning. ARTER
(Adaptive Routing and Targeted Entity Reasoning) presents a structured pipeline
that achieves high performance without deep fine-tuning by strategically
combining candidate generation, context-based scoring, adaptive routing, and
selective reasoning. ARTER computes a small set of complementary signals(both
embedding and LLM-based) over the retrieved candidates to categorize contextual
mentions into easy and hard cases. The cases are then handled by a
low-computational entity linker (e.g. ReFinED) and more expensive targeted
LLM-based reasoning respectively. On standard benchmarks, ARTER outperforms
ReFinED by up to +4.47%, with an average gain of +2.53% on 5 out of 6 datasets,
and performs comparably to pipelines using LLM-based reasoning for all
mentions, while being as twice as efficient in terms of the number of LLM
tokens.
[LINK]http://arxiv.org/abs/2510.20098v1
[DATE]2025-10-23 08:50:14+08:00
[CATEGORIES]cs.CL
Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning
[AUTHORS]Yu Fu, Zefan Cai, Abedelkadir Asi, Wayne Xiong, Yue Dong, Wen Xiao
[ABSTRACT]Key-Value (KV) caching is a common technique to enhance the computational
efficiency of Large Language Models (LLMs), but its memory overhead grows
rapidly with input length. Prior work has shown that not all tokens are equally
important for text generation, proposing layer-level KV cache compression to
selectively retain key information. Recognizing the distinct roles of attention
heads in generation, we propose HeadKV, a head-level KV cache compression
method, and HeadKV-R2, which leverages a novel contextual reasoning ability
estimation for compression. Our approach operates at the level of individual
heads, estimating their importance for contextual QA tasks that require both
retrieval and reasoning capabilities. Extensive experiments across diverse
benchmarks (LongBench, LooGLE), model architectures (e.g., Llama-3-8B-Instruct,
Mistral-7B-Instruct), and long-context abilities tests demonstrate that our
head-level KV cache compression significantly outperforms strong baselines,
particularly in low-resource settings (KV size = 64 & 128). Notably, our method
retains just 1.5% of the KV cache while achieving 97% of the performance of the
full KV cache on the contextual question answering benchmark. Codes are
available at https://github.com/FYYFU/HeadKV
[COMMENTS]Accepted to ICLR2025
[LINK]http://arxiv.org/abs/2410.19258v4
[DATE]2025-10-23 08:47:24+08:00
[CATEGORIES]cs.CL
Is Safety Standard Same for Everyone? User-Specific Safety Evaluation of Large Language Models
[AUTHORS]Yeonjun In, Wonjoong Kim, Kanghoon Yoon, Sungchul Kim, Mehrab Tanjim, Sangwu Park, Kibum Kim, Chanyoung Park
[COMMENTS]EMNLP 2025 Findings
[LINK]http://arxiv.org/abs/2502.15086v2
[DATE]2025-10-23 08:23:30+08:00
[CATEGORIES]cs.CL
Constraint Satisfaction Approaches to Wordle: Novel Heuristics and Cross-Lexicon Validation
[AUTHORS]Jahidul Arafat, Fariha Tasmin, Sanjaya Poudel
[ABSTRACT]Wordle presents an algorithmically rich testbed for constraint satisfaction
problem (CSP) solving. While existing solvers rely on information-theoretic
entropy maximization or frequency-based heuristics without formal constraint
treatment, we present the first comprehensive CSP formulation of Wordle with
novel constraint-aware solving strategies. We introduce CSP-Aware Entropy,
computing information gain after constraint propagation rather than on raw
candidate sets, and a Probabilistic CSP framework integrating Bayesian
word-frequency priors with logical constraints. Through evaluation on 2,315
English words, CSP-Aware Entropy achieves 3.54 average guesses with 99.9%
success rate, a statistically significant 1.7% improvement over Forward
Checking (t=-4.82, p<0.001, Cohen’s d=0.07) with 46% faster runtime (12.9ms
versus 23.7ms per guess). Under 10% noise, CSP-aware approaches maintain 5.3
percentage point advantages (29.0% versus 23.7%, p=0.041), while Probabilistic
CSP achieves 100% success across all noise levels (0-20%) through constraint
recovery mechanisms. Cross-lexicon validation on 500 Spanish words demonstrates
88% success with zero language-specific tuning, validating that core CSP
principles transfer across languages despite an 11.2 percentage point gap from
linguistic differences (p<0.001, Fisher’s exact test). Our open-source
implementation with 34 unit tests achieving 91% code coverage provides
reproducible infrastructure for CSP research. The combination of formal CSP
treatment, constraint-aware heuristics, probabilistic-logical integration,
robustness analysis, and cross-lexicon validation establishes new performance
benchmarks demonstrating that principled constraint satisfaction techniques
outperform classical information-theoretic and learning-based approaches for
structured puzzle-solving domains.
[COMMENTS]35 pages, 14 figures, 10 tables. Open-source implementation with 91%
test coverage available at
https://github.com/jahidul-arafat/constraint_satisfaction_wordle_arxiv_preprint
[LINK]http://arxiv.org/abs/2510.02855v2
[DATE]2025-10-23 07:50:28+08:00
[CATEGORIES]cs.CL
SIGN: Schema-Induced Games for Naming
[AUTHORS]Ryan Zhang, Herbert Woisetscläger
[ABSTRACT]Real-world AI systems are tackling increasingly complex problems, often
through interactions among large language model (LLM) agents. When these agents
develop inconsistent conventions, coordination can break down. Applications
such as collaborative coding and distributed planning therefore require
reliable, consistent communication, and scalability is a central concern as
systems grow. We introduce Schema-Induced Games for Naming (SIGN), a naming
game that examines how lightweight structure can steer convention formation. We
compare schema-induced communication to unconstrained natural language and find
faster convergence with up to 5.8x higher agreement. These results suggest that
minimal structure can act as a simple control knob for efficient multi-agent
coordination, pointing toward broader applications beyond the naming game.
[COMMENTS]AAAI 2026 Student Abstract (Oral). Code available ar
https://github.com/ryanzhangofficial/schema-induced-games-for-naming
[LINK]http://arxiv.org/abs/2510.21855v1
[DATE]2025-10-23 07:12:06+08:00
[CATEGORIES]cs.CL cs.LG
Heterogeneous Swarms: Jointly Optimizing Model Roles and Weights for Multi-LLM Systems
[AUTHORS]Shangbin Feng, Zifeng Wang, Palash Goyal, Yike Wang, Weijia Shi, Huang Xia, Hamid Palangi, Luke Zettlemoyer, Yulia Tsvetkov, Chen-Yu Lee, Tomas Pfister
[ABSTRACT]We propose Heterogeneous Swarms, an algorithm to design multi-LLM systems by
jointly optimizing model roles and weights. We represent multi-LLM systems as
directed acyclic graphs (DAGs) of LLMs with topological message passing for
collaborative generation. Given a pool of LLM experts and a utility function,
Heterogeneous Swarms employs two iterative steps: role-step and weight-step.
For role-step, we interpret model roles as learning a DAG that specifies the
flow of inputs and outputs between LLMs. Starting from a swarm of random
continuous adjacency matrices, we decode them into discrete DAGs, call the LLMs
in topological order, evaluate on the utility function (e.g. accuracy on a
task), and optimize the adjacency matrices with particle swarm optimization
based on the utility score. For weight-step, we assess the contribution of
individual LLMs in the multi-LLM systems and optimize model weights with swarm
intelligence. We propose JFK-score to quantify the individual contribution of
each LLM in the best-found DAG of the role-step, then optimize model weights
with particle swarm optimization based on the JFK-score. Experiments
demonstrate that Heterogeneous Swarms outperforms 15 role- and/or weight-based
baselines by 18.5% on average across 12 tasks. Further analysis reveals that
Heterogeneous Swarms discovers multi-LLM systems with heterogeneous model roles
and substantial collaborative gains, and benefits from the diversity of
language models.
[COMMENTS]NeurIPS 2025
[LINK]http://arxiv.org/abs/2502.04510v2
[DATE]2025-10-23 07:08:47+08:00
[CATEGORIES]cs.CL
Rope to Nope and Back Again: A New Hybrid Attention Strategy
[AUTHORS]Bowen Yang, Bharat Venkitesh, Dwarak Talupuru, Hangyu Lin, David Cairuz, Phil Blunsom, Acyr Locatelli
[ABSTRACT]Long-context large language models (LLMs) have achieved remarkable
advancements, driven by techniques like Rotary Position Embedding (RoPE) (Su et
al., 2023) and its extensions (Chen et al., 2023; Liu et al., 2024c; Peng et
al., 2023). By adjusting RoPE parameters and incorporating training data with
extended contexts, we can train performant models with considerably longer
input sequences. However, existing RoPE-based methods exhibit performance
limitations when applied to extended context lengths. This paper presents a
comprehensive analysis of various attention mechanisms, including RoPE, No
Positional Embedding (NoPE), and Query-Key Normalization (QK-Norm), identifying
their strengths and shortcomings in long-context modeling. Our investigation
identifies distinctive attention patterns in these methods and highlights their
impact on long-context performance, providing valuable insights for
architectural design. Building on these findings, we propose a novel
architecture featuring a hybrid attention mechanism that integrates global and
local attention spans. This design not only surpasses conventional RoPE-based
transformer models with full attention in both long and short context tasks but
also delivers substantial efficiency gains during training and inference.
[LINK]http://arxiv.org/abs/2501.18795v2
[DATE]2025-10-23 06:43:58+08:00
[CATEGORIES]cs.CL
Permutative Preference Alignment from Listwise Ranking of Human Judgments
[AUTHORS]Yang Zhao, Yixin Wang, Mingzhang Yin
[ABSTRACT]Aligning Large Language Models (LLMs) with human preferences is crucial in
ensuring desirable and controllable model behaviors. Current methods, such as
Reinforcement Learning from Human Feedback (RLHF) and Direct Preference
Optimization (DPO), rely on the Bradley-Terry (B-T) model to maximize the
likelihood of pairwise choices. However, when multiple responses are available,
the B-T model fails to guarantee an accurate list ranking of the responses. To
address this issue, we propose Permutative Preference Alignment (PPA), a novel
offline listwise approach that incorporates the Normalized Discounted
Cumulative Gain (NDCG), a widely-used ranking metric, as an alternative
training objective for LLM alignment. We develop an end-to-end alignment
algorithm by approximating NDCG with a differentiable surrogate loss.
Experiments demonstrate that PPA outperforms existing pairwise and listwise
methods on evaluation sets and general benchmarks such as AlpacaEval.
Furthermore, we show that NDCG-based approaches improve ranking accuracy more
effectively than B-T-based methods and provide a theoretical explanation for
this improvement.
[COMMENTS]Published at EMNLP 2025 Main Conference
[LINK]http://arxiv.org/abs/2410.04346v2
[DATE]2025-10-23 06:15:48+08:00
[CATEGORIES]cs.CL
SpecEval: Evaluating Model Adherence to Behavior Specifications
[AUTHORS]Ahmed Ahmed, Kevin Klyman, Yi Zeng, Sanmi Koyejo, Percy Liang
[ABSTRACT]Companies that develop foundation models publish behavioral guidelines they
pledge their models will follow, but it remains unclear if models actually do
so. While providers such as OpenAI, Anthropic, and Google have published
detailed specifications describing both desired safety constraints and
qualitative traits for their models, there has been no systematic audit of
adherence to these guidelines. We introduce an automated framework that audits
models against their providers specifications by parsing behavioral statements,
generating targeted prompts, and using models to judge adherence. Our central
focus is on three way consistency between a provider specification, its model
outputs, and its own models as judges; an extension of prior two way generator
validator consistency. This establishes a necessary baseline: at minimum, a
foundation model should consistently satisfy the developer behavioral
specifications when judged by the developer evaluator models. We apply our
framework to 16 models from six developers across more than 100 behavioral
statements, finding systematic inconsistencies including compliance gaps of up
to 20 percent across providers.
[LINK]http://arxiv.org/abs/2509.02464v2
[DATE]2025-10-23 05:55:45+08:00
[CATEGORIES]cs.CL
Policy Optimization Prefers The Path of Least Resistance
[AUTHORS]Debdeep Sanyal, Aakash Sen Sharma, Dhruv Kumar, Saurabh Deshpande, Murari Mandal
[ABSTRACT]Policy optimization (PO) algorithms are used to refine Large Language Models
for complex, multi-step reasoning. Current state-of-the-art pipelines enforce a
strict think-then-answer format to elicit chain-of-thought (CoT); however, the
behavior of PO when these rigid constraints are relaxed into an open-ended CoT
structure remains an under-studied question. We investigate this gap with an
extensive suite of controlled experiments and identify a consistent principle:
\textit{policy optimization consistently follows the path of least resistance}.
When afforded the flexibility to interleave reasoning and response, policy
optimization consistently learns to discard explicit reasoning, causing the
policy to degenerate to a direct \texttt{
[COMMENTS]21 pages, 8 figures, 2 tables
[LINK][http://arxiv.org/abs/2510.21853v1](http://arxiv.org/abs/2510.21853v1)
[DATE]2025-10-23 05:48:44+08:00
[CATEGORIES]cs.CL
Language Models (Mostly) Know When to Stop Reading
[AUTHORS]Roy Xie, Junlin Wang, Paul Rosu, Chunyuan Deng, Bolun Sun, Zihao Lin, Bhuwan Dhingra
[ABSTRACT]Large language models (LLMs) process entire input contexts indiscriminately,
which is inefficient when the information required to answer a query is
localized within the context. We present dynamic context cutoff, a novel method
enabling LLMs to self-terminate processing upon acquiring sufficient
task-relevant information. Through analysis of model internals, we discover
that specific attention heads inherently encode “sufficiency signals” –
detectable through lightweight classifiers – that predict when critical
information has been processed. This reveals a new efficiency paradigm: models’
internal understanding naturally dictates processing needs rather than external
compression heuristics. Comprehensive experiments across six QA datasets (up to
40K tokens) with three model families (LLaMA/Qwen/Mistral, 1B-70B) demonstrate
3.4% accuracy improvement while achieving 1.33x token reduction on average.
Furthermore, our method demonstrates superior performance compared to other
context efficiency methods at equivalent token reduction rates. Additionally,
we observe an emergent scaling phenomenon: while smaller models require probing
for sufficiency detection, larger models exhibit intrinsic self-assessment
capabilities through prompting.
[COMMENTS]Accepted to NeurIPS 2025. Project website:
https://royxie.com/when-to-stop-project
[LINK]http://arxiv.org/abs/2502.01025v2
[DATE]2025-10-23 05:46:56+08:00
[CATEGORIES]cs.CL
Beyond One-Way Influence: Bidirectional Opinion Dynamics in Multi-Turn Human-LLM Interactions
[AUTHORS]Yuyang Jiang, Longjie Guo, Yuchen Wu, Aylin Caliskan, Tanu Mitra, Hua Shen
[ABSTRACT]Large language model (LLM)-powered chatbots are increasingly used for opinion
exploration. Prior research examined how LLMs alter user views, yet little work
extended beyond one-way influence to address how user input can affect LLM
responses and how such bi-directional influence manifests throughout the
multi-turn conversations. This study investigates this dynamic through 50
controversial-topic discussions with participants (N=266) across three
conditions: static statements, standard chatbot, and personalized chatbot.
Results show that human opinions barely shifted, while LLM outputs changed more
substantially, narrowing the gap between human and LLM stance. Personalization
amplified these shifts in both directions compared to the standard setting.
Analysis of multi-turn conversations further revealed that exchanges involving
participants’ personal stories were most likely to trigger stance changes for
both humans and LLMs. Our work highlights the risk of over-alignment in
human-LLM interaction and the need for careful design of personalized chatbots
to more thoughtfully and stably align with users.
[COMMENTS]26 pages, 8 figures
[LINK]http://arxiv.org/abs/2510.20039v1
[DATE]2025-10-23 05:38:10+08:00
[CATEGORIES]cs.CL
Improving Transfer Learning for Sequence Labeling Tasks by Adapting Pre-trained Neural Language Models
[AUTHORS]David Dukić
[ABSTRACT]This doctoral thesis improves the transfer learning for sequence labeling
tasks by adapting pre-trained neural language models. The proposed improvements
in transfer learning involve introducing a multi-task model that incorporates
an additional signal, a method based on architectural modifications in
autoregressive large language models, and a sequence labeling framework for
autoregressive large language models utilizing supervised in-context
fine-tuning combined with response-oriented adaptation strategies. The first
improvement is given in the context of domain transfer for the event trigger
detection task. The domain transfer of the event trigger detection task can be
improved by incorporating an additional signal obtained from a
domain-independent text processing system into a multi-task model. The second
improvement involves modifying the model’s architecture. For that purpose, a
method is proposed to enable bidirectional information flow across layers of
autoregressive large language models. The third improvement utilizes
autoregressive large language models as text generators through a generative
supervised in-context fine-tuning framework. The proposed model, method, and
framework demonstrate that pre-trained neural language models achieve their
best performance on sequence labeling tasks when adapted through targeted
transfer learning paradigms.
[LINK]http://arxiv.org/abs/2510.20033v1
[DATE]2025-10-23 05:23:53+08:00
[CATEGORIES]cs.CL
Roboflow100-VL: A Multi-Domain Object Detection Benchmark for Vision-Language Models
[AUTHORS]Peter Robicheaux, Matvei Popov, Anish Madan, Isaac Robinson, Joseph Nelson, Deva Ramanan, Neehar Peri
[COMMENTS]The first two authors contributed equally. This work has been
accepted to the Neural Information Processing Systems (NeurIPS) 2025 Datasets
& Benchmark Track. Project Page: https://rf100-vl.org/
[LINK]http://arxiv.org/abs/2505.20612v4
[DATE]2025-10-23 05:12:56+08:00
[CATEGORIES]cs.CL cs.LG
DeltaProduct: Improving State-Tracking in Linear RNNs via Householder Products
[AUTHORS]Julien Siems, Timur Carstensen, Arber Zela, Frank Hutter, Massimiliano Pontil, Riccardo Grazzi
[ABSTRACT]Linear Recurrent Neural Networks (linear RNNs) have emerged as competitive
alternatives to Transformers for sequence modeling, offering efficient training
and linear-time inference. However, existing architectures face a fundamental
trade-off between expressivity and efficiency, dictated by the structure of
their state-transition matrices. Diagonal matrices, used in models such as
Mamba, GLA, or mLSTM, yield fast runtime but have limited expressivity. To
address this, recent architectures such as DeltaNet and RWKV-7 adopted a
diagonal plus rank–1 structure, which allows simultaneous token and channel
mixing, improving associative recall and, as recently shown, state-tracking
when allowing state-transition matrices to have negative eigenvalues. Building
on the interpretation of DeltaNet’s recurrence as performing one step of online
gradient descent per token on an associative recall loss, we introduce
DeltaProduct, which instead takes multiple ($n_h$) steps per token. This
naturally leads to diagonal plus rank–$n_h$ state-transition matrices, formed
as products of $n_h$ generalized Householder transformations, providing a
tunable mechanism to balance expressivity and efficiency. We provide a detailed
theoretical characterization of the state-tracking capability of DeltaProduct
in finite precision, showing how it improves by increasing $n_h$. Our extensive
experiments demonstrate that DeltaProduct outperforms DeltaNet in both
state-tracking and language modeling, while also showing significantly improved
length extrapolation capabilities.
[COMMENTS]v5: Characterization of DeltaProduct’s state-tracking ability.
Analysis of hidden state’s effective rank. Improved scaling analysis. v6:
Added analysis for products of RWKV-7 matrices, v6: Accepted at NeurIPS 2025
[LINK]http://arxiv.org/abs/2502.10297v7
[DATE]2025-10-23 05:10:28+08:00
[CATEGORIES]cs.LG cs.CL
Token embeddings violate the manifold hypothesis
[AUTHORS]Michael Robinson, Sourya Dey, Tony Chiang
[ABSTRACT]A full understanding of the behavior of a large language model (LLM) requires
our grasp of its input token space. If this space differs from our assumptions,
our comprehension of and conclusions about the LLM will likely be flawed. We
elucidate the structure of the token embeddings both empirically and
theoretically. We present a novel statistical test assuming that the
neighborhood around each token has a relatively flat and smooth structure as
the null hypothesis. Failing to reject the null is uninformative, but rejecting
it at a specific token $\psi$ implies an irregularity in the token subspace in
a $\psi$-neighborhood, $B(\psi)$. The structure assumed in the null is a
generalization of a manifold with boundary called a \emph{smooth fiber bundle}
(which can be split into two spatial regimes – small and large radius), so we
denote our new hypothesis test as the “fiber bundle hypothesis.” By running
our test over several open-source LLMs, each with unique token embeddings, we
find that the null is frequently rejected, and so the evidence suggests that
the token subspace is not a fiber bundle and hence also not a manifold. As a
consequence of our findings, when an LLM is presented with two semantically
equivalent prompts, if one prompt contains a token implicated by our test, the
response to that prompt will likely exhibit less stability than the other.
[COMMENTS]30 pages, 9 figures, 10 tables
[LINK]http://arxiv.org/abs/2504.01002v3
[DATE]2025-10-23 04:05:21+08:00
[CATEGORIES]cs.CL
Deep Research Brings Deeper Harm
[AUTHORS]Shuo Chen, Zonggen Li, Zhen Han, Bailan He, Tong Liu, Haokun Chen, Georg Groh, Philip Torr, Volker Tresp, Jindong Gu
[COMMENTS]Accepted to Reliable ML from Unreliable Data Workshop @ NeurIPS 2025
[LINK]http://arxiv.org/abs/2510.11851v2
[DATE]2025-10-23 04:02:16+08:00
[CATEGORIES]cs.CL
DREAM: Drafting with Refined Target Features and Entropy-Adaptive Cross-Attention Fusion for Multimodal Speculative Decoding
[AUTHORS]Yunhai Hu, Tianhua Xia, Zining Liu, Rahul Raman, Xingyu Liu, Bo Bao, Eric Sather, Vithursan Thangarasa, Sai Qian Zhang
[ABSTRACT]Speculative decoding (SD) has emerged as a powerful method for accelerating
autoregressive generation in large language models (LLMs), yet its integration
into vision-language models (VLMs) remains underexplored. We introduce DREAM, a
novel speculative decoding framework tailored for VLMs that combines three key
innovations: (1) a cross-attention-based mechanism to inject intermediate
features from the target model into the draft model for improved alignment, (2)
adaptive intermediate feature selection based on attention entropy to guide
efficient draft model training, and (3) visual token compression to reduce
draft model latency. DREAM enables efficient, accurate, and parallel multimodal
decoding with significant throughput improvement. Experiments across a diverse
set of recent popular VLMs, including LLaVA, Pixtral, SmolVLM and Gemma3,
demonstrate up to 3.6x speedup over conventional decoding and significantly
outperform prior SD baselines in both inference throughput and speculative
draft acceptance length across a broad range of multimodal benchmarks. The code
is publicly available at: https://github.com/SAI-Lab-NYU/DREAM.git
[LINK]http://arxiv.org/abs/2505.19201v3
[DATE]2025-10-23 03:52:00+08:00
[CATEGORIES]cs.CL
A Fundamental Algorithm for Dependency Parsing (With Corrections)
[AUTHORS]Michael A. Covington
[ABSTRACT]This paper presents a fundamental algorithm for parsing natural language
sentences into dependency trees. Unlike phrase-structure (constituency)
parsers, this algorithm operates one word at a time, attaching each word as
soon as it can be attached, corresponding to properties claimed for the parser
in the human brain. Like phrase-structure parsing, its worst-case complexity is
$O(n^3)$, but in human language, the worst case occurs only for small $n$.
[COMMENTS]Corrected version of an already widely cited paper
[LINK]http://arxiv.org/abs/2510.19996v1
[DATE]2025-10-23 03:48:38+08:00
[CATEGORIES]cs.CL
Text Generation Beyond Discrete Token Sampling
[AUTHORS]Yufan Zhuang, Liyuan Liu, Chandan Singh, Jingbo Shang, Jianfeng Gao
[ABSTRACT]In standard autoregressive generation, an LLM predicts the next-token
distribution, samples a discrete token, and then discards the distribution,
passing only the sampled token as new input. To preserve this distribution’s
rich information, we propose Mixture of Inputs (MoI), a training-free method
for autoregressive generation. After generating a token following the standard
paradigm, we construct a new input that blends the generated discrete token
with the previously discarded token distribution. Specifically, we employ a
Bayesian estimation method that treats the token distribution as the prior, the
sampled token as the observation, and replaces the conventional one-hot vector
with the continuous posterior expectation as the new model input. MoI allows
the model to maintain a richer internal representation throughout the
generation process, resulting in improved text quality and reasoning
capabilities. On mathematical reasoning, code generation, and PhD-level QA
tasks, MoI consistently improves performance across multiple models including
QwQ-32B, Nemotron-Super-49B, Gemma-3-27B, and DAPO-Qwen-32B, with no additional
training and negligible computational overhead.
[LINK]http://arxiv.org/abs/2505.14827v3
[DATE]2025-10-23 03:40:00+08:00
[CATEGORIES]cs.CL
LLM-Augmented Symbolic NLU System for More Reliable Continuous Causal Statement Interpretation
[AUTHORS]Xin Lian, Kenneth D. Forbus
[ABSTRACT]Despite the broad applicability of large language models (LLMs), their
reliance on probabilistic inference makes them vulnerable to errors such as
hallucination in generated facts and inconsistent output structure in natural
language understanding (NLU) tasks. By contrast, symbolic NLU systems provide
interpretable understanding grounded in curated lexicons, semantic resources,
and syntactic & semantic interpretation rules. They produce relational
representations that can be used for accurate reasoning and planning, as well
as incremental debuggable learning. However, symbolic NLU systems tend to be
more limited in coverage than LLMs and require scarce knowledge representation
and linguistics skills to extend and maintain. This paper explores a hybrid
approach that integrates the broad-coverage language processing of LLMs with
the symbolic NLU capabilities of producing structured relational
representations to hopefully get the best of both approaches. We use LLMs for
rephrasing and text simplification, to provide broad coverage, and as a source
of information to fill in knowledge gaps more automatically. We use symbolic
NLU to produce representations that can be used for reasoning and for
incremental learning. We evaluate this approach on the task of extracting and
interpreting quantities and causal laws from commonsense science texts, along
with symbolic- and LLM-only pipelines. Our results suggest that our hybrid
method works significantly better than the symbolic-only pipeline.
[COMMENTS]18 pages, 2 figures
[LINK]http://arxiv.org/abs/2510.19988v1
[DATE]2025-10-23 03:38:20+08:00
[CATEGORIES]cs.CL
LyriCAR: A Difficulty-Aware Curriculum Reinforcement Learning Framework For Controllable Lyric Translation
[AUTHORS]Le Ren, Xiangjian Zeng, Qingqiang Wu, Ruoxuan Liang
[ABSTRACT]Lyric translation is a challenging task that requires balancing multiple
musical constraints. Existing methods often rely on hand-crafted rules and
sentence-level modeling, which restrict their ability to internalize
musical-linguistic patterns and to generalize effectively at the paragraph
level, where cross-line coherence and global rhyme are crucial. In this work,
we propose LyriCAR, a novel framework for controllable lyric translation that
operates in a fully unsupervised manner. LyriCAR introduces a difficulty-aware
curriculum designer and an adaptive curriculum strategy, ensuring efficient
allocation of training resources, accelerating convergence, and improving
overall translation quality by guiding the model with increasingly complex
challenges. Extensive experiments on the EN-ZH lyric translation task show that
LyriCAR achieves state-of-the-art results across both standard translation
metrics and multi-dimensional reward scores, surpassing strong baselines.
Notably, the adaptive curriculum strategy reduces training steps by nearly 40%
while maintaining superior performance. Code, data and model can be accessed at
https://github.com/rle27/LyriCAR.
[COMMENTS]submitted to ICASSP 2026
[LINK]http://arxiv.org/abs/2510.19967v1
[DATE]2025-10-23 02:57:20+08:00
[CATEGORIES]cs.CL cs.LG
BugPilot: Complex Bug Generation for Efficient Learning of SWE Skills
[AUTHORS]Atharv Sonwane, Isadora White, Hyunji Lee, Matheus Pereira, Lucas Caccia, Minseon Kim, Zhengyan Shi, Chinmay Singh, Alessandro Sordoni, Marc-Alexandre Côté, Xingdi Yuan
[ABSTRACT]High quality bugs are key to training the next generation of language model
based software engineering (SWE) agents. We introduce a novel method for
synthetic generation of difficult and diverse bugs. Our method instructs SWE
Agents to introduce a feature into the codebase whereby they may
unintentionally break tests, resulting in bugs. Prior approaches often induce
an out-of-distribution effect by generating bugs intentionally (e.g. by
introducing local perturbation to existing code), which does not reflect
realistic development processes. We perform qualitative analysis to demonstrate
that our approach for generating bugs more closely reflects the patterns found
in human-authored edits. Through extensive experiments, we demonstrate that our
bugs provide more efficient training data for supervised fine-tuning,
outperforming other bug datasets by 2% with half the training data (1.2k vs. 3k
bugs). We train on our newly generated bugs in addition to existing bug
datasets to get FrogBoss a state-of-the-art 32B parameter model on SWE-bench
Verified with a pass@1 of 54.6% and FrogMini a state-of-the-art 14B model on
SWE-bench Verified with a pass@1 of 45.3% on SWE-bench Verified averaged over
three seeds.
[LINK]http://arxiv.org/abs/2510.19898v1
[DATE]2025-10-23 01:58:56+08:00
[CATEGORIES]cs.CL
LoRA vs Full Fine-tuning: An Illusion of Equivalence
[AUTHORS]Reece Shuttleworth, Jacob Andreas, Antonio Torralba, Pratyusha Sharma
[ABSTRACT]Fine-tuning is a crucial paradigm for adapting pre-trained large language
models to downstream tasks. Recently, methods like Low-Rank Adaptation (LoRA)
have been shown to effectively fine-tune LLMs with an extreme reduction in
trainable parameters. But, \emph{are their learned solutions really
equivalent?} We study how LoRA and full-finetuning change pre-trained models by
analyzing the model’s weight matrices through the lens of their spectral
properties. We find that LoRA and full fine-tuning yield weight matrices whose
singular value decompositions exhibit very different structure: weight matrices
trained with LoRA have new, high-ranking singular vectors, which we call
\emph{intruder dimensions}, while those trained with full fine-tuning do not.
Further, we extend the finding that LoRA forgets less than full fine-tuning and
find its forgetting is vastly localized to the intruder dimension – by
causally intervening on the intruder dimensions by changing their associated
singular values post-fine-tuning, we show that they cause forgetting. Moreover,
scaling them down significantly improves modeling of the pre-training
distribution with a minimal drop in downstream task performance. Given this, we
should expect accumulating intruder dimensions to be harmful and lead to more
forgetting. This will be amplified during continual learning because of
sequentially fine-tuning, and we show that LoRA models do accumulate intruder
dimensions here tend to perform worse in this setting, emphasizing the
practicality of our findings.
[LINK]http://arxiv.org/abs/2410.21228v3
[DATE]2025-10-23 01:58:00+08:00
[CATEGORIES]cs.LG cs.CL
olmOCR 2: Unit Test Rewards for Document OCR
[AUTHORS]Jake Poznanski, Luca Soldaini, Kyle Lo
[ABSTRACT]We present olmOCR 2, the latest in our family of powerful OCR systems for
converting digitized print documents, like PDFs, into clean, naturally ordered
plain text. olmOCR 2 is powered by olmOCR-2-7B-1025, a specialized, 7B vision
language model (VLM) trained using reinforcement learning with verifiable
rewards (RLVR), where our rewards are a diverse set of binary unit tests. To
scale unit test creation, we develop a pipeline for generating synthetic
documents with diverse and challenging layouts, known ground-truth HTML source
code, and extracted test cases. We show that RL training on these test cases
results in state-of-the-art performance on olmOCR-Bench, our English-language
OCR benchmark, with the largest improvements in math formula conversion, table
parsing, and multi-column layouts compared to previous versions. We release our
model, data and code under permissive open licenses.
[COMMENTS]https://olmocr.allen.ai/
[LINK]http://arxiv.org/abs/2510.19817v1
[DATE]2025-10-23 01:53:02+08:00
[CATEGORIES]cs.CL
Unlearned but Not Forgotten: Data Extraction after Exact Unlearning in LLM
[AUTHORS]Xiaoyu Wu, Yifei Pang, Terrance Liu, Zhiwei Steven Wu
[ABSTRACT]Large Language Models are typically trained on datasets collected from the
web, which may inadvertently contain harmful or sensitive personal information.
To address growing privacy concerns, unlearning methods have been proposed to
remove the influence of specific data from trained models. Of these, exact
unlearning – which retrains the model from scratch without the target data –
is widely regarded the gold standard for mitigating privacy risks in
deployment. In this paper, we revisit this assumption in a practical deployment
setting where both the pre- and post-unlearning logits API are exposed, such as
in open-weight scenarios. Targeting this setting, we introduce a novel data
extraction attack that leverages signals from the pre-unlearning model to guide
the post-unlearning model, uncovering patterns that reflect the removed data
distribution. Combining model guidance with a token filtering strategy, our
attack significantly improves extraction success rates – doubling performance
in some cases – across common benchmarks such as MUSE, TOFU, and WMDP.
Furthermore, we demonstrate our attack’s effectiveness on a simulated medical
diagnosis dataset to highlight real-world privacy risks associated with exact
unlearning. In light of our findings, which suggest that unlearning may, in a
contradictory way, increase the risk of privacy leakage during real-world
deployments, we advocate for evaluation of unlearning methods to consider
broader threat models that account not only for post-unlearning models but also
for adversarial access to prior checkpoints. Code is publicly available at:
https://github.com/Nicholas0228/unlearned_data_extraction_llm.
[COMMENTS]Accepted by Neurips 2025
[LINK]http://arxiv.org/abs/2505.24379v3
[DATE]2025-10-23 01:51:21+08:00
[CATEGORIES]cs.LG cs.CL
Hubble: a Model Suite to Advance the Study of LLM Memorization
[AUTHORS]Johnny Tian-Zheng Wei, Ameya Godbole, Mohammad Aflah Khan, Ryan Wang, Xiaoyuan Zhu, James Flemings, Nitya Kashyap, Krishna P. Gummadi, Willie Neiswanger, Robin Jia
[ABSTRACT]We present Hubble, a suite of fully open-source large language models (LLMs)
for the scientific study of LLM memorization. Hubble models come in standard
and perturbed variants: standard models are pretrained on a large English
corpus, and perturbed models are trained in the same way but with controlled
insertion of text (e.g., book passages, biographies, and test sets) designed to
emulate key memorization risks. Our core release includes 8 models – standard
and perturbed models with 1B or 8B parameters, pretrained on 100B or 500B
tokens – establishing that memorization risks are determined by the frequency
of sensitive data relative to size of the training corpus (i.e., a password
appearing once in a smaller corpus is memorized better than the same password
in a larger corpus). Our release also includes 6 perturbed models with text
inserted at different pretraining phases, showing that sensitive data without
continued exposure can be forgotten. These findings suggest two best practices
for addressing memorization risks: to dilute sensitive data by increasing the
size of the training corpus, and to order sensitive data to appear earlier in
training. Beyond these general empirical findings, Hubble enables a broad range
of memorization research; for example, analyzing the biographies reveals how
readily different types of private information are memorized. We also
demonstrate that the randomized insertions in Hubble make it an ideal testbed
for membership inference and machine unlearning, and invite the community to
further explore, benchmark, and build upon our work.
[LINK]http://arxiv.org/abs/2510.19811v1
[DATE]2025-10-23 01:48:23+08:00
[CATEGORIES]cs.CL cs.LG
SCoPE VLM: Selective Context Processing for Efficient Document Navigation in Vision-Language Models
[AUTHORS]Gyubeum Lim, Yemo Koo, Vijay Krishna Madisetti
[ABSTRACT]Understanding long-context visual information remains a fundamental challenge
for vision-language models, particularly in agentic tasks such as GUI control
and web navigation. While web pages and GUI environments are inherently
structured documents, current VLMs typically neglect decision-oriented document
understanding in their training objectives. Existing approaches primarily
extend visual embeddings to process long, high-resolution inputs, but these
methods are memory-intensive and impractical for locally deployable solutions.
To address these issues, we propose SCoPE VLM, a document navigation expert
that leverages a novel Chain of Scroll mechanism to selectively and recursively
navigate documents, focusing exclusively on relevant segments. We introduce a
dedicated data generation pipeline to construct informative Chain of Scroll
trajectories and Episodic Group Relative Policy Optimization, a tailored
reinforcement learning method to reduce the gap between training and inference.
Our method substantially reduces memory usage and effectively models human-like
reading behaviors. To the best of our knowledge, SCoPE VLM is the first
framework to explicitly model agentic reading patterns in multi-page document
question answering, advancing the capabilities of multimodal agents.
[LINK]http://arxiv.org/abs/2510.21850v1
[DATE]2025-10-23 01:47:12+08:00
[CATEGORIES]cs.CL
Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing
[AUTHORS]Yusu Qian, Eli Bocek-Rivele, Liangchen Song, Jialing Tong, Yinfei Yang, Jiasen Lu, Wenze Hu, Zhe Gan
[ABSTRACT]Recent advances in multimodal models have demonstrated remarkable text-guided
image editing capabilities, with systems like GPT-4o and Nano-Banana setting
new benchmarks. However, the research community’s progress remains constrained
by the absence of large-scale, high-quality, and openly accessible datasets
built from real images. We introduce Pico-Banana-400K, a comprehensive
400K-image dataset for instruction-based image editing. Our dataset is
constructed by leveraging Nano-Banana to generate diverse edit pairs from real
photographs in the OpenImages collection. What distinguishes Pico-Banana-400K
from previous synthetic datasets is our systematic approach to quality and
diversity. We employ a fine-grained image editing taxonomy to ensure
comprehensive coverage of edit types while maintaining precise content
preservation and instruction faithfulness through MLLM-based quality scoring
and careful curation. Beyond single turn editing, Pico-Banana-400K enables
research into complex editing scenarios. The dataset includes three specialized
subsets: (1) a 72K-example multi-turn collection for studying sequential
editing, reasoning, and planning across consecutive modifications; (2) a
56K-example preference subset for alignment research and reward model training;
and (3) paired long-short editing instructions for developing instruction
rewriting and summarization capabilities. By providing this large-scale,
high-quality, and task-rich resource, Pico-Banana-400K establishes a robust
foundation for training and benchmarking the next generation of text-guided
image editing models.
[LINK]http://arxiv.org/abs/2510.19808v1
[DATE]2025-10-23 01:43:15+08:00
[CATEGORIES]cs.CL cs.LG
Large Language Model enabled Mathematical Modeling
[AUTHORS]Guoyun Zhang
[ABSTRACT]The integration of Large Language Models (LLMs) with optimization modeling
offers a promising avenue for advancing decision-making in operations research
(OR). Traditional optimization methods,such as linear programming, mixed
integer programming, and simulation depend heavily on domain expertise to
translate real-world problems into solvable mathematical models. While solvers
like Gurobi and COPT are powerful, expert input remains essential for defining
objectives, constraints, and variables. This research investigates the
potential of LLMs, specifically the DeepSeek-R1 model, to bridge this
formulation gap using natural language understanding and code generation.
Although prior models like GPT-4, Claude, and Bard have shown strong
performance in NLP and reasoning tasks, their high token costs and tendency
toward hallucinations limit real-world applicability in supply chain contexts.
In contrast, DeepSeek-R1, a cost-efficient and high-performing model trained
with reinforcement learning, presents a viable alternative. Despite its success
in benchmarks such as LiveCodeBench and Math-500, its effectiveness in applied
OR scenarios remains under explored. This study systematically evaluates
DeepSeek-R1 across four key OR benchmarks: NL4OPT, IndustryOR, EasyLP, and
ComplexOR. Our methodology includes baseline assessments, the development of a
hallucination taxonomy, and the application of mitigation strategies like
LLM-as-a-Judge, Few-shot Learning (FSL), Tool Calling, and a Multi-agent
Framework. These techniques aim to reduce hallucinations, enhance formulation
accuracy, and better align model outputs with user intent.
[LINK]http://arxiv.org/abs/2510.19895v1
[DATE]2025-10-23 01:41:42+08:00
[CATEGORIES]cs.CL
The Art of Asking: Multilingual Prompt Optimization for Synthetic Data
[AUTHORS]David Mora, Viraat Aryabumi, Wei-Yin Ko, Sara Hooker, Julia Kreutzer, Marzieh Fadaee
[ABSTRACT]Synthetic data has become a cornerstone for scaling large language models,
yet its multilingual use remains bottlenecked by translation-based prompts.
This strategy inherits English-centric framing and style and neglects cultural
dimensions, ultimately constraining model generalization. We argue that the
overlooked prompt space-the very inputs that define training
distributions-offers a more powerful lever for improving multilingual
performance. We introduce a lightweight framework for prompt-space
optimization, where translated prompts are systematically transformed for
Naturalness, Cultural Adaptation, and Difficulty Enhancement. Using an
off-the-shelf multilingual LLM, we apply these transformations to prompts for
12 languages spanning 7 families. Under identical data conditions, our
approaches achieve substantial and consistent downstream improvements over the
translation-only baseline: +4.7% on Global-MMLU accuracy, +2.4% on Flores
XCometXL and +35.3% wins in preferences on mArenaHard. We establish
prompt-space optimization as a simple yet powerful paradigm for building
multilingual LLMs that are more robust, culturally grounded, and globally
capable.
[LINK]http://arxiv.org/abs/2510.19806v1
[DATE]2025-10-23 01:41:20+08:00
[CATEGORIES]cs.CL
Blackbox Model Provenance via Palimpsestic Membership Inference
[AUTHORS]Rohith Kuditipudi, Jing Huang, Sally Zhu, Diyi Yang, Christopher Potts, Percy Liang
[ABSTRACT]Suppose Alice trains an open-weight language model and Bob uses a blackbox
derivative of Alice’s model to produce text. Can Alice prove that Bob is using
her model, either by querying Bob’s derivative model (query setting) or from
the text alone (observational setting)? We formulate this question as an
independence testing problem–in which the null hypothesis is that Bob’s model
or text is independent of Alice’s randomized training run–and investigate it
through the lens of palimpsestic memorization in language models: models are
more likely to memorize data seen later in training, so we can test whether Bob
is using Alice’s model using test statistics that capture correlation between
Bob’s model or text and the ordering of training examples in Alice’s training
run. If Alice has randomly shuffled her training data, then any significant
correlation amounts to exactly quantifiable statistical evidence against the
null hypothesis, regardless of the composition of Alice’s training data. In the
query setting, we directly estimate (via prompting) the likelihood Bob’s model
gives to Alice’s training examples and order; we correlate the likelihoods of
over 40 fine-tunes of various Pythia and OLMo base models ranging from 1B to
12B parameters with the base model’s training data order, achieving a p-value
on the order of at most 1e-8 in all but six cases. In the observational
setting, we try two approaches based on estimating 1) the likelihood of Bob’s
text overlapping with spans of Alice’s training examples and 2) the likelihood
of Bob’s text with respect to different versions of Alice’s model we obtain by
repeating the last phase (e.g., 1%) of her training run on reshuffled data. The
second approach can reliably distinguish Bob’s text from as little as a few
hundred tokens; the first does not involve any retraining but requires many
more tokens (several hundred thousand) to achieve high power.
[LINK]http://arxiv.org/abs/2510.19796v1
[DATE]2025-10-23 01:30:39+08:00
[CATEGORIES]cs.LG cs.CL
ToolDreamer: Instilling LLM Reasoning Into Tool Retrievers
[AUTHORS]Saptarshi Sengupta, Zhengyu Zhou, Jun Araki, Xingbo Wang, Bingqing Wang, Suhang Wang, Zhe Feng
[ABSTRACT]Tool calling has become increasingly popular for Large Language Models
(LLMs). However, for large tool sets, the resulting tokens would exceed the
LLM’s context window limit, making it impossible to include every tool. Hence,
an external retriever is used to provide LLMs with the most relevant tools for
a query. Existing retrieval models rank tools based on the similarity between a
user query and a tool description (TD). This leads to suboptimal retrieval as
user requests are often poorly aligned with the language of TD. To remedy the
issue, we propose ToolDreamer, a framework to condition retriever models to
fetch tools based on hypothetical (synthetic) TD generated using an LLM, i.e.,
description of tools that the LLM feels will be potentially useful for the
query. The framework enables a more natural alignment between queries and tools
within the language space of TD’s. We apply ToolDreamer on the ToolRet dataset
and show that our method improves the performance of sparse and dense
retrievers with and without training, thus showcasing its flexibility. Through
our proposed framework, our aim is to offload a portion of the reasoning burden
to the retriever so that the LLM may effectively handle a large collection of
tools without inundating its context window.
[LINK]http://arxiv.org/abs/2510.19791v1
[DATE]2025-10-23 01:26:05+08:00
[CATEGORIES]cs.CL
Can They Dixit? Yes they Can! Dixit as a Playground for Multimodal Language Model Capabilities
[AUTHORS]Nishant Balepur, Dang Nguyen, Dayeon Ki
[COMMENTS]Accepted as a Spotlight paper at the EMNLP 2025 Wordplay Workshop
[LINK]http://arxiv.org/abs/2510.19892v1
[DATE]2025-10-23 01:21:16+08:00
[CATEGORIES]cs.CL
AdaSPEC: Selective Knowledge Distillation for Efficient Speculative Decoders
[AUTHORS]Yuezhou Hu, Jiaxin Guo, Xinyu Feng, Tuo Zhao
[ABSTRACT]Speculative Decoding (SD) accelerates large language model inference by
employing a small draft model to generate predictions, which are then verified
by a larger target model. The effectiveness of SD hinges on the alignment
between these models, which is typically enhanced by Knowledge Distillation
(KD). However, conventional KD methods aim to minimize the KL divergence
between the draft and target models across all tokens, a goal that is
misaligned with the true objective of SD, which is to maximize token acceptance
rate. Therefore, draft models often struggle to fully assimilate the target
model’s knowledge due to capacity constraints, leading to suboptimal
performance. To address this challenge, we propose AdaSPEC, a novel method that
incorporates selective token filtering into the KD process. AdaSPEC utilizes a
reference model to identify and filter out difficult-to-fit tokens, enabling
the distillation of a draft model that better aligns with the target model on
simpler tokens. This approach improves the overall token acceptance rate
without compromising generation quality. We evaluate AdaSPEC across diverse
tasks, including arithmetic reasoning, instruction-following, coding, and
summarization, using model configurations of 31M/1.4B and 350M/2.7B parameters.
Our results demonstrate that AdaSPEC consistently outperforms the
state-of-the-art DistillSpec method, achieving higher acceptance rates across
all tasks (up to 15\%). The code is publicly available at
https://github.com/yuezhouhu/adaspec.
[LINK]http://arxiv.org/abs/2510.19779v1
[DATE]2025-10-23 01:13:00+08:00
[CATEGORIES]cs.CL cs.LG
GaLLoP: Gradient-based Sparse Learning on Low-Magnitude Parameters
[AUTHORS]Anand Choudhary, Yasser Sulaıman, Lukas Mauch, Ghouthi Boukli Hacene, Fabien Cardinaux, Antoine Bosselut
[ABSTRACT]Sparse fine-tuning techniques adapt LLMs to downstream tasks by only tuning a
sparse subset of model parameters. However, the effectiveness of sparse
adaptation depends on optimally selecting the model parameters to be
fine-tuned. In this work, we introduce a novel sparse fine-tuning technique
named GaLLoP: Gradient-based Sparse Learning on Low-Magnitude Parameters, which
fine-tunes only those model parameters which have the largest gradient
magnitudes on downstream tasks and the smallest pre-trained magnitudes,
intuitively prioritizing parameters that are highly task-relevant, but
minimally disruptive to pre-trained knowledge. Our experimentation with LLaMA3
8B and Gemma 2B as base models shows that GaLLoP consistently improves or
matches the in-distribution as well as out-of-distribution performance obtained
via the usage of other leading parameter-efficient fine-tuning techniques,
including LoRA, DoRA, and SAFT. Our analysis demonstrates that GaLLoP mitigates
catastrophic forgetting and memorization of task data, as important pre-trained
parameters remain unchanged, and stabilizes performance relative to other
fine-tuning techniques, robustly generalizing across most random seeds.
[LINK]http://arxiv.org/abs/2510.19778v1
[DATE]2025-10-23 01:11:49+08:00
[CATEGORIES]cs.LG cs.CL
WikiVideo: Article Generation from Multiple Videos
[AUTHORS]Alexander Martin, Reno Kriz, William Gantt Walden, Kate Sanders, Hannah Recknor, Eugene Yang, Francis Ferraro, Benjamin Van Durme
[ABSTRACT]We introduce the task of grounded article generation with the goal of
creating a Wikipedia-style article from multiple diverse videos about
real-world events – from natural disasters to political elections – where all
the information in the article is supported by video evidence. Videos are
intuitive sources for retrieval-augmented generation (RAG), but most
contemporary RAG workflows focus heavily on text while existing methods for
video-based summarization focus on low-level scene understanding rather than
high-level event semantics. To close this gap, we introduce WikiVideo, a
benchmark consisting of expert-written articles and densely annotated videos
that provide evidence for articles’ claims, facilitating the integration of
video into RAG pipelines and enabling the creation of in-depth content that is
grounded in multimodal sources. We further propose Collaborative Article
Generation (CAG), a novel interactive method for article creation from multiple
videos. CAG leverages an iterative interaction between an r1-style reasoning
model and a VideoLLM to draw higher-level inferences about the target event
than is possible with VideoLLMs alone, which fixate on low-level visual
features. We benchmark state-of-the-art VideoLLMs and CAG in both oracle
retrieval and RAG settings and find that CAG consistently outperforms
alternative methods, while suggesting intriguing avenues for future work.
[COMMENTS]Repo can be found here: https://github.com/alexmartin1722/wikivideo
[LINK]http://arxiv.org/abs/2504.00939v2
[DATE]2025-10-23 00:17:16+08:00
[CATEGORIES]cs.CL
The Coverage Principle: How Pre-Training Enables Post-Training
[AUTHORS]Fan Chen, Audrey Huang, Noah Golowich, Sadhika Malladi, Adam Block, Jordan T. Ash, Akshay Krishnamurthy, Dylan J. Foster
[ABSTRACT]Language models demonstrate remarkable abilities when pre-trained on large
text corpora and fine-tuned for specific tasks, but how and why pre-training
shapes the success of the final model remains poorly understood. Notably,
although pre-training success is often quantified by cross-entropy loss,
cross-entropy can be a poor predictor of downstream performance. Instead, we
provide a theoretical perspective on this relationship through the lens of
\emph{coverage}, which quantifies the probability mass the pre-trained model
places on high-quality responses and which is necessary and sufficient for
post-training and test-time scaling methods such as Best-of-N to succeed. Our
main results develop an understanding of \emph{the coverage principle}, a
phenomenon whereby next-token prediction (more generally, maximum likelihood)
implicitly optimizes toward a model with good coverage. In particular, we
uncover a mechanism that explains the power of coverage in predicting
downstream performance: \emph{coverage generalizes faster than cross-entropy},
avoiding spurious dependence on problem-dependent parameters such as the
sequence length. We also study practical algorithmic interventions with
provable benefits for improving coverage, including (i) model/checkpoint
selection procedures, (ii) gradient normalization schemes, and (iii) test-time
decoding strategies.
[LINK]http://arxiv.org/abs/2510.15020v2
[DATE]2025-10-23 00:15:08+08:00
[CATEGORIES]cs.CL cs.LG
GeoBenchX: Benchmarking LLMs in Agent Solving Multistep Geospatial Tasks
[AUTHORS]Varvara Krechetova, Denis Kochedykov
[ABSTRACT]This paper establishes a benchmark for evaluating tool-calling capabilities
of large language models (LLMs) on multi-step geospatial tasks relevant to
commercial GIS practitioners. We assess eight commercial LLMs (Claude Sonnet
3.5 and 4, Claude Haiku 3.5, Gemini 2.0 Flash, Gemini 2.5 Pro Preview, GPT-4o,
GPT-4.1 and o4-mini) using a simple tool-calling agent equipped with 23
geospatial functions. Our benchmark comprises tasks in four categories of
increasing complexity, with both solvable and intentionally unsolvable tasks to
test rejection accuracy. We develop a LLM-as-Judge evaluation framework to
compare agent solutions against reference solutions. Results show o4-mini and
Claude 3.5 Sonnet achieve the best overall performance, OpenAI’s GPT-4.1,
GPT-4o and Google’s Gemini 2.5 Pro Preview do not fall far behind, but the last
two are more efficient in identifying unsolvable tasks. Claude Sonnet 4, due
its preference to provide any solution rather than reject a task, proved to be
less accurate. We observe significant differences in token usage, with
Anthropic models consuming more tokens than competitors. Common errors include
misunderstanding geometrical relationships, relying on outdated knowledge, and
inefficient data manipulation. The resulting benchmark set, evaluation
framework, and data generation pipeline are released as open-source resources
(available at https://github.com/Solirinai/GeoBenchX), providing one more
standardized method for the ongoing evaluation of LLMs for GeoAI.
[COMMENTS]Github with code and benchmark set:
https://github.com/Solirinai/GeoBenchX
[LINK]http://arxiv.org/abs/2503.18129v2
[DATE]2025-10-23 00:12:30+08:00
[CATEGORIES]cs.CL
From Answers to Guidance: A Proactive Dialogue System for Legal Documents
[AUTHORS]Ashish Chouhan, Michael Gertz
[ABSTRACT]The accessibility of legal information remains a constant challenge,
particularly for laypersons seeking to understand and apply complex
institutional texts. While the European Union provides open access to
legislation, parliamentary responses, and regulatory documents, these resources
can be challenging for laypeople to explore. In this paper, we introduce
EUDial, a proactive multi-turn dialogue dataset constructed from 204 blogs
curated by the Citizens’ Enquiries Unit (AskEP) of the European Parliamentary
Research Service. EUDial contains 880 dialogue turns (averaging 4.3 turns per
dialogue), where each dialogue includes initial questions, structured answers,
and follow-up questions. Beyond dataset construction, we propose the LexGuide
framework that leverages retrieval-augmented generation with hierarchical topic
organization to structure dialogue progression, ensuring both comprehensive
coverage of legal aspects and coherence across conversational turns. The
results demonstrate that proactive, structured navigation closes the gap
between the availability of legal information and citizen comprehension,
establishing EUDial and LexGuide as practical resources for advancing proactive
legal dialogue systems.
[COMMENTS]21 pages, 3 figures, 2 tables, 2 prompts
[LINK]http://arxiv.org/abs/2510.19723v1
[DATE]2025-10-23 00:08:05+08:00
[CATEGORIES]cs.CL
LASeR: Learning to Adaptively Select Reward Models with Multi-Armed Bandits
[AUTHORS]Duy Nguyen, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal
[ABSTRACT]Reward Models (RMs) are crucial to aligning large language models (LLMs), but
the degree to which an RM specialized to one task (e.g. writing) generalizes to
new tasks (e.g. math) is often not known a priori, often making using only one
fixed RM to train LLMs suboptimal. However, optimizing LLMs with multiple RMs
simultaneously can incur a prohibitively high computational cost and lead to
conflicting signals from different RMs that may degrade performance. To address
these challenges, we introduce LASeR (Learning to Adaptively Select Rewards),
which frames reward model selection as a multi-armed bandit problem,
efficiently and iteratively training LLMs using multiple RMs by selecting the
most well-suited RM for each instance. On commonsense and math reasoning tasks,
we show that LASeR boosts iterative LLM training, improving the absolute
average accuracy of Llama-3-8B over three datasets by 2.67% over an ensemble of
RM scores while also showing superior efficiency (e.g., a 2x speedup).
Moreover, on WildChat (open-ended instruction-following tasks), LASeR leads to
a 72.69% AlpacaEval win rate over the RM score ensemble baseline. Extending to
long-context generation, LASeR improves by 2.96 F1 points (avg.) on
single-document QA tasks and 2.97 F1 points on few-shot learning over the RM
score ensemble baseline with best-of-n sampling.
[COMMENTS]NeurIPS 2025 camera-ready. First two authors contributed equally.
Code: https://github.com/duykhuongnguyen/LASeR-MAB
[LINK]http://arxiv.org/abs/2410.01735v3
[DATE]2025-10-23 00:01:03+08:00
[CATEGORIES]cs.CL cs.LG
Are Large Language Models Sensitive to the Motives Behind Communication?
[AUTHORS]Addison J. Wu, Ryan Liu, Kerem Oktar, Theodore R. Sumers, Thomas L. Griffiths
[COMMENTS]NeurIPS 2025
[LINK]http://arxiv.org/abs/2510.19687v1
[DATE]2025-10-22 23:35:00+08:00
[CATEGORIES]cs.CL cs.LG
metaTextGrad: Automatically optimizing language model optimizers
[AUTHORS]Guowei Xu, Mert Yuksekgonul, Carlos Guestrin, James Zou
[ABSTRACT]Large language models (LLMs) are increasingly used in learning algorithms,
evaluations, and optimization tasks. Recent studies have shown that using
LLM-based optimizers to automatically optimize model prompts, demonstrations,
predictions themselves, or other components can significantly enhance the
performance of AI systems, as demonstrated by frameworks such as DSPy and
TextGrad. However, optimizers built on language models themselves are usually
designed by humans with manual design choices; optimizers themselves are not
optimized. Moreover, these optimizers are general purpose by design, to be
useful to a broad audience, and are not tailored for specific tasks. To address
these challenges, we propose metaTextGrad, which focuses on designing a
meta-optimizer to further enhance existing optimizers and align them to be good
optimizers for a given task. Our approach consists of two key components: a
meta prompt optimizer and a meta structure optimizer. The combination of these
two significantly improves performance across multiple benchmarks, achieving an
average absolute performance improvement of up to 6% compared to the best
baseline.
[COMMENTS]21 pages, 2 figures
[LINK]http://arxiv.org/abs/2505.18524v2
[DATE]2025-10-22 23:27:06+08:00
[CATEGORIES]cs.CL
Test-time Prompt Intervention
[AUTHORS]Chenxu Yang, Qingyi Si, Mz Dai, Dingyu Yao, Mingyu Zheng, Minghui Chen, Zheng Lin, Weiping Wang
[ABSTRACT]Test-time compute has led to remarkable success in the large language model
(LLM) community, particularly for complex tasks, where longer chains of thought
(CoTs) are generated to enhance reasoning capabilities. However, growing
evidence reveals that such reasoning models often produce CoTs plagued by
excessive redundancy, including unnecessary verification steps and repetitive
reasoning shifts. The root cause lies in post-training of them that overly rely
on outcome reward paradigms, as the data of process reward paradigms, which
regulate intermediate reasoning steps, is difficult to construct at scale. To
address this, we propose PI, a novel framework for Test-time Prompt
Intervention. PI provides an interface to dynamically guide and regulate
reasoning paths during inference through timely (When module) and proper (How
module) interventions and post-intervention sampling (Which module). This
allows human problem-solving expertise and cognitive science principles to be
seamlessly integrated into LLMs’ reasoning processes, enhancing controllability
and interpretability. Extensive experiments across multiple models and datasets
demonstrate that PI significantly shortens CoTs while reducing hallucination,
yielding more concise and reliable reasoning.
[COMMENTS]24 pages, 20 figures, under review
[LINK]http://arxiv.org/abs/2508.02511v2
[DATE]2025-10-22 23:27:03+08:00
[CATEGORIES]cs.CL
CoSense-LLM: Semantics at the Edge with Cost- and Uncertainty-Aware Cloud-Edge Cooperation
[AUTHORS]Hasan Akgul, Mari Eplik, Javier Rojas, Aina Binti Abdullah, Pieter van der Merwe
[ABSTRACT]We present CoSense-LLM, an edge-first framework that turns continuous
multimodal sensor streams (for example Wi-Fi CSI, IMU, audio, RFID, and
lightweight vision) into compact, verifiable semantic tokens and coordinates
with large language models under explicit latency, energy, bandwidth, and
privacy constraints. CoSense-LLM has four parts: (i) SenseFusion, a lightweight
encoder that aligns sensor embeddings with language and compresses them into
short discrete code sequences; (ii) Edge-RAG, a local hybrid retrieval layer
that grounds generation in site specific policies and notes; (iii)
PromptRouter, a cost and uncertainty aware policy that selects edge only
generation, edge plus retrieval, or compact cloud escalation; and (iv) Secure
Execution, an auditable redaction path that enforces data minimization so raw
waveforms never leave the device. The system works with modern serving
optimizations, including paged or streaming KV caches, FlashAttention style
kernels, speculative decoding, and quantized LoRA adapters, and supports on
device personalization and federated updates under non IID drift. Across home,
office, and clinic deployments, CoSense-LLM delivers grounded explanations
while meeting tight service level objectives: it sustains sub second (p95) end
to end latency on edge dominant paths, reduces inter tier token and bandwidth
costs by preferring local retrieval grounded responses, and preserves privacy
by transmitting only discrete codes and redacted metadata. Ablations show that
Edge-RAG improves factual consistency and reduces contradictions, calibrated
uncertainty enables selective abstention and controlled escalations, and KV
plus decoding accelerators lower energy per decision. The results support an
edge first design that treats semantics, privacy, and predictable latency as co
equal goals for large model deployments in interference prone environments.
[COMMENTS]19 pages,8 figures
[LINK]http://arxiv.org/abs/2510.19670v1
[DATE]2025-10-22 23:16:56+08:00
[CATEGORIES]cs.CL
Unraveling Emotions with Pre-Trained Models
[AUTHORS]Alejandro Pajón-Sanmartín, Francisco De Arriba-Pérez, Silvia García-Méndez, Fátima Leal, Benedita Malheiro, Juan Carlos Burguillo-Rial
[ABSTRACT]Transformer models have significantly advanced the field of emotion
recognition. However, there are still open challenges when exploring open-ended
queries for Large Language Models (LLMs). Although current models offer good
results, automatic emotion analysis in open texts presents significant
challenges, such as contextual ambiguity, linguistic variability, and
difficulty interpreting complex emotional expressions. These limitations make
the direct application of generalist models difficult. Accordingly, this work
compares the effectiveness of fine-tuning and prompt engineering in emotion
detection in three distinct scenarios: (i) performance of fine-tuned
pre-trained models and general-purpose LLMs using simple prompts; (ii)
effectiveness of different emotion prompt designs with LLMs; and (iii) impact
of emotion grouping techniques on these models. Experimental tests attain
metrics above 70% with a fine-tuned pre-trained model for emotion recognition.
Moreover, the findings highlight that LLMs require structured prompt
engineering and emotion grouping to enhance their performance. These
advancements improve sentiment analysis, human-computer interaction, and
understanding of user behavior across various domains.
[LINK]http://arxiv.org/abs/2510.19668v1
[DATE]2025-10-22 23:13:52+08:00
[CATEGORIES]cs.CL
From Forecasting to Planning: Policy World Model for Collaborative State-Action Prediction
[AUTHORS]Zhida Zhao, Talas Fu, Yifan Wang, Lijun Wang, Huchuan Lu
[ABSTRACT]Despite remarkable progress in driving world models, their potential for
autonomous systems remains largely untapped: the world models are mostly
learned for world simulation and decoupled from trajectory planning. While
recent efforts aim to unify world modeling and planning in a single framework,
the synergistic facilitation mechanism of world modeling for planning still
requires further exploration. In this work, we introduce a new driving paradigm
named Policy World Model (PWM), which not only integrates world modeling and
trajectory planning within a unified architecture, but is also able to benefit
planning using the learned world knowledge through the proposed action-free
future state forecasting scheme. Through collaborative state-action prediction,
PWM can mimic the human-like anticipatory perception, yielding more reliable
planning performance. To facilitate the efficiency of video forecasting, we
further introduce a dynamically enhanced parallel token generation mechanism,
equipped with a context-guided tokenizer and an adaptive dynamic focal loss.
Despite utilizing only front camera input, our method matches or exceeds
state-of-the-art approaches that rely on multi-view and multi-modal inputs.
Code and model weights will be released at
https://github.com/6550Zhao/Policy-World-Model.
[COMMENTS]Accepted by NuerIPS 2025 (Poster)
[LINK]http://arxiv.org/abs/2510.19654v1
[DATE]2025-10-22 22:57:51+08:00
[CATEGORIES]cs.CL
LLavaCode: Compressed Code Representations for Retrieval-Augmented Code Generation
[AUTHORS]Daria Cherniuk, Nikita Sukhorukov, Nikita Sushko, Daniil Gusak, Danil Sivtsov, Elena Tutubalina, Evgeny Frolov
[ABSTRACT]Retrieval-augmented generation has emerged as one of the most effective
approaches for code completion, particularly when context from a surrounding
repository is essential. However, incorporating context significantly extends
sequence length, leading to slower inference - a critical limitation for
interactive settings such as IDEs. In this work, we introduce LlavaCode, a
framework that compresses code into compact, semantically rich representations
interpretable by code LLM, enhancing generation quality while reducing the
retrieved context to only a few compressed single-token vectors. Using a small
projector module we can significantly increase the EM and ES metrics of coding
model with negligible latency increase. Our experiments demonstrate that
compressed context enables 20-38% reduction in Time-to-First-Token (TTFT) on
line completion tasks compared to full-RAG pipelines.
[LINK]http://arxiv.org/abs/2510.19644v1
[DATE]2025-10-22 22:49:21+08:00
[CATEGORIES]cs.CL
Style Attack Disguise: When Fonts Become a Camouflage for Adversarial Intent
[AUTHORS]Yangshijie Zhang, Xinda Wang, Jialin Liu, Wenqiang Wang, Zhicong Ma, Xingxing Jia
[ABSTRACT]With social media growth, users employ stylistic fonts and font-like emoji to
express individuality, creating visually appealing text that remains
human-readable. However, these fonts introduce hidden vulnerabilities in NLP
models: while humans easily read stylistic text, models process these
characters as distinct tokens, causing interference. We identify this
human-model perception gap and propose a style-based attack, Style Attack
Disguise (SAD). We design two sizes: light for query efficiency and strong for
superior attack performance. Experiments on sentiment classification and
machine translation across traditional models, LLMs, and commercial services
demonstrate SAD’s strong attack performance. We also show SAD’s potential
threats to multimodal tasks including text-to-image and text-to-speech
generation.
[LINK]http://arxiv.org/abs/2510.19641v1
[DATE]2025-10-22 22:40:24+08:00
[CATEGORIES]cs.CL
dInfer: An Efficient Inference Framework for Diffusion Language Models
[AUTHORS]Yuxin Ma, Lun Du, Lanning Wei, Kun Chen, Qian Xu, Kangyu Wang, Guofeng Feng, Guoshan Lu, Lin Liu, Xiaojing Qi, Xinyuan Zhang, Zhen Tao, Haibo Feng, Ziyun Jiang, Ying Xu, Zenan Huang, Yihong Zhuang, Haokai Xu, Jiaqi Hu, Zhenzhong Lan, Junbo Zhao, Jianguo Li, Da Zheng
[ABSTRACT]Diffusion-based large language models (dLLMs) have emerged as a promising
alternative to autoregressive (AR) LLMs, leveraging denoising-based generation
to enable inherent parallelism. Even more and more open-sourced dLLM models
emerge, yet their widespread adoption remains constrained by the lack of a
standardized and efficient inference framework. We present dInfer, an efficient
and extensible framework for dLLM inference. dInfer decomposes the inference
pipeline into four modular components–model, diffusion iteration manager,
decoding strategy, and KV-cache manager–and integrates novel algorithms for
each component alongside system-level optimizations. Through this combination
of algorithmic innovations and system enhancements, dInfer achieves substantial
efficiency gains without compromising output quality on LLaDA-MoE. At batch
size 1, it surpasses 1,100 tokens per second on HumanEval and averages over 800
tokens per second across six benchmarks on $8\times$ H800 GPUs. Compared to
prior systems, dInfer delivers a $10\times$ speedup over Fast-dLLM while
maintaining similar model performance. Even compared to the AR model (with a
comparable number of activation parameters and performance) QWen2.5-3B, which
is highly optimized with the latest vLLM inference engine, dInfer still
delivers a $2$-$3\times$ speedup. The implementation of dInfer is open-sourced
at https://github.com/inclusionAI/dInfer.
[LINK]http://arxiv.org/abs/2510.08666v3
[DATE]2025-10-22 22:33:49+08:00
[CATEGORIES]cs.CL
Unveiling Transformer Perception by Exploring Input Manifolds
[AUTHORS]Alessandro Benfenati, Alfio Ferrara, Alessio Marta, Davide Riva, Elisabetta Rocchetti
[ABSTRACT]This paper introduces a general method for the exploration of equivalence
classes in the input space of Transformer models. The proposed approach is
based on sound mathematical theory which describes the internal layers of a
Transformer architecture as sequential deformations of the input manifold.
Using eigendecomposition of the pullback of the distance metric defined on the
output space through the Jacobian of the model, we are able to reconstruct
equivalence classes in the input space and navigate across them. Our method
enables two complementary exploration procedures: the first retrieves input
instances that produce the same class probability distribution as the original
instance-thus identifying elements within the same equivalence class-while the
second discovers instances that yield a different class probability
distribution, effectively navigating toward distinct equivalence classes.
Finally, we demonstrate how the retrieved instances can be meaningfully
interpreted by projecting their embeddings back into a human-readable format.
[COMMENTS]11 pages, 4 figures
[LINK]http://arxiv.org/abs/2410.06019v2
[DATE]2025-10-22 22:30:40+08:00
[CATEGORIES]cs.LG cs.CL
HSCodeComp: A Realistic and Expert-level Benchmark for Deep Search Agents in Hierarchical Rule Application
[AUTHORS]Yiqian Yang, Tian Lan, Qianghuai Jia, Li Zhu, Hui Jiang, Hang Zhu, Longyue Wang, Weihua Luo, Kaifu Zhang
[ABSTRACT]Effective deep search agents must not only access open-domain and
domain-specific knowledge but also apply complex rules-such as legal clauses,
medical manuals and tariff rules. These rules often feature vague boundaries
and implicit logic relationships, making precise application challenging for
agents. However, this critical capability is largely overlooked by current
agent benchmarks.
To fill this gap, we introduce HSCodeComp, the first realistic, expert-level
e-commerce benchmark designed to evaluate deep search agents in hierarchical
rule application. In this task, the deep reasoning process of agents is guided
by these rules to predict 10-digit Harmonized System Code (HSCode) of products
with noisy but realistic descriptions. These codes, established by the World
Customs Organization, are vital for global supply chain efficiency. Built from
real-world data collected from large-scale e-commerce platforms, our proposed
HSCodeComp comprises 632 product entries spanning diverse product categories,
with these HSCodes annotated by several human experts.
Extensive experimental results on several state-of-the-art LLMs, open-source,
and closed-source agents reveal a huge performance gap: best agent achieves
only 46.8% 10-digit accuracy, far below human experts at 95.0%. Besides,
detailed analysis demonstrates the challenges of hierarchical rule application,
and test-time scaling fails to improve performance further.
[LINK]http://arxiv.org/abs/2510.19631v1
[DATE]2025-10-22 22:28:33+08:00
[CATEGORIES]cs.CL
CrossNews-UA: A Cross-lingual News Semantic Similarity Benchmark for Ukrainian, Polish, Russian, and English
[AUTHORS]Daryna Dementieva, Evgeniya Sukhodolskaya, Alexander Fraser
[ABSTRACT]In the era of social networks and rapid misinformation spread, news analysis
remains a critical task. Detecting fake news across multiple languages,
particularly beyond English, poses significant challenges. Cross-lingual news
comparison offers a promising approach to verify information by leveraging
external sources in different languages (Chen and Shu, 2024). However, existing
datasets for cross-lingual news analysis (Chen et al., 2022a) were manually
curated by journalists and experts, limiting their scalability and adaptability
to new languages. In this work, we address this gap by introducing a scalable,
explainable crowdsourcing pipeline for cross-lingual news similarity
assessment. Using this pipeline, we collected a novel dataset CrossNews-UA of
news pairs in Ukrainian as a central language with linguistically and
contextually relevant languages-Polish, Russian, and English. Each news pair is
annotated for semantic similarity with detailed justifications based on the 4W
criteria (Who, What, Where, When). We further tested a range of models, from
traditional bag-of-words, Transformer-based architectures to large language
models (LLMs). Our results highlight the challenges in multilingual news
analysis and offer insights into models performance.
[LINK]http://arxiv.org/abs/2510.19628v1
[DATE]2025-10-22 22:23:50+08:00
[CATEGORIES]cs.CL
Hire Your Anthropologist! Rethinking Culture Benchmarks Through an Anthropological Lens
[AUTHORS]Mai AlKhamissi, Yunze Xiao, Badr AlKhamissi, Mona Diab
[ABSTRACT]Cultural evaluation of large language models has become increasingly
important, yet current benchmarks often reduce culture to static facts or
homogeneous values. This view conflicts with anthropological accounts that
emphasize culture as dynamic, historically situated, and enacted in practice.
To analyze this gap, we introduce a four-part framework that categorizes how
benchmarks frame culture, such as knowledge, preference, performance, or bias.
Using this lens, we qualitatively examine 20 cultural benchmarks and identify
six recurring methodological issues, including treating countries as cultures,
overlooking within-culture diversity, and relying on oversimplified survey
formats. Drawing on established anthropological methods, we propose concrete
improvements: incorporating real-world narratives and scenarios, involving
cultural communities in design and validation, and evaluating models in context
rather than isolation. Our aim is to guide the development of cultural
benchmarks that go beyond static recall tasks and more accurately capture the
responses of the models to complex cultural situations.
[COMMENTS]12 pages; 2 figures; First two author contributed equally
[LINK]http://arxiv.org/abs/2510.05931v2
[DATE]2025-10-22 22:01:52+08:00
[CATEGORIES]cs.CL
InfiFPO: Implicit Model Fusion via Preference Optimization in Large Language Models
[AUTHORS]Yanggan Gu, Yuanyi Wang, Zhaoyi Yan, Yiming Zhang, Qi Zhou, Fei Wu, Hongxia Yang
[ABSTRACT]Model fusion combines multiple Large Language Models (LLMs) with different
strengths into a more powerful, integrated model through lightweight training
methods. Existing works on model fusion focus primarily on supervised
fine-tuning (SFT), leaving preference alignment (PA) –a critical phase for
enhancing LLM performance–largely unexplored. The current few fusion methods
on PA phase, like WRPO, simplify the process by utilizing only response outputs
from source models while discarding their probability information. To address
this limitation, we propose InfiFPO, a preference optimization method for
implicit model fusion. InfiFPO replaces the reference model in Direct
Preference Optimization (DPO) with a fused source model that synthesizes
multi-source probabilities at the sequence level, circumventing complex
vocabulary alignment challenges in previous works and meanwhile maintaining the
probability information. By introducing probability clipping and max-margin
fusion strategies, InfiFPO enables the pivot model to align with human
preferences while effectively distilling knowledge from source models.
Comprehensive experiments on 11 widely-used benchmarks demonstrate that InfiFPO
consistently outperforms existing model fusion and preference optimization
methods. When using Phi-4 as the pivot model, InfiFPO improve its average
performance from 79.95 to 83.33 on 11 benchmarks, significantly improving its
capabilities in mathematics, coding, and reasoning tasks.
[LINK]http://arxiv.org/abs/2505.13878v2
[DATE]2025-10-22 21:55:29+08:00
[CATEGORIES]cs.LG cs.CL
Human-Agent Collaborative Paper-to-Page Crafting for Under $0.1
[AUTHORS]Qianli Ma, Siyu Wang, Yilin Chen, Yinhao Tang, Yixiang Yang, Chang Guo, Bingjie Gao, Zhening Xing, Yanan Sun, Zhipeng Zhang
[ABSTRACT]In the quest for scientific progress, communicating research is as vital as
the discovery itself. Yet, researchers are often sidetracked by the manual,
repetitive chore of building project webpages to make their dense papers
accessible. While automation has tackled static slides and posters, the
dynamic, interactive nature of webpages has remained an unaddressed challenge.
To bridge this gap, we reframe the problem, arguing that the solution lies not
in a single command, but in a collaborative, hierarchical process. We introduce
$\textbf{AutoPage}$, a novel multi-agent system that embodies this philosophy.
AutoPage deconstructs paper-to-page creation into a coarse-to-fine pipeline
from narrative planning to multimodal content generation and interactive
rendering. To combat AI hallucination, dedicated “Checker” agents verify each
step against the source paper, while optional human checkpoints ensure the
final product aligns perfectly with the author’s vision, transforming the
system from a mere tool into a powerful collaborative assistant. To rigorously
validate our approach, we also construct $\textbf{PageBench}$, the first
benchmark for this new task. Experiments show AutoPage not only generates
high-quality, visually appealing pages but does so with remarkable efficiency
in under 15 minutes for less than $0.1. Code and dataset will be released at
$\href{https://mqleet.github.io/AutoPage_ProjectPage/}{Webpage}$.
[LINK]http://arxiv.org/abs/2510.19600v1
[DATE]2025-10-22 21:53:57+08:00
[CATEGORIES]cs.CL
MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research
[AUTHORS]Hui Chen, Miao Xiong, Yujie Lu, Wei Han, Ailin Deng, Yufei He, Jiaying Wu, Yibo Li, Yue Liu, Bryan Hooi
[ABSTRACT]Recent advancements in AI agents have demonstrated their growing potential to
drive and support scientific discovery. In this work, we introduce MLR-Bench, a
comprehensive benchmark for evaluating AI agents on open-ended machine learning
research. MLR-Bench includes three key components: (1) 201 research tasks
sourced from NeurIPS, ICLR, and ICML workshops covering diverse ML topics; (2)
MLR-Judge, an automated evaluation framework combining LLM-based reviewers with
carefully designed review rubrics to assess research quality; and (3)
MLR-Agent, a modular agent scaffold capable of completing research tasks
through four stages: idea generation, proposal formulation, experimentation,
and paper writing. Our framework supports both stepwise assessment across these
distinct research stages, and end-to-end evaluation of the final research
paper. We then use MLR-Bench to evaluate six frontier LLMs and an advanced
coding agent, finding that while LLMs are effective at generating coherent
ideas and well-structured papers, current coding agents frequently (e.g., in
80% of the cases) produce fabricated or invalidated experimental
results–posing a major barrier to scientific reliability. We validate
MLR-Judge through human evaluation, showing high agreement with expert
reviewers, supporting its potential as a scalable tool for research evaluation.
We open-source MLR-Bench to help the community benchmark, diagnose, and improve
AI research agents toward trustworthy and transparent scientific discovery.
[COMMENTS]49 pages, 9 figures. Accepted by NeurIPS 2025 D&B Track
[LINK]http://arxiv.org/abs/2505.19955v3
[DATE]2025-10-22 21:33:52+08:00
[CATEGORIES]cs.LG cs.CL
Using (Not-so) Large Language Models to Generate Simulation Models in a Formal DSL: A Study on Reaction Networks
[AUTHORS]Justin N. Kreikemeyer, Miłosz Jankowski, Pia Wilsdorf, Adelinde M. Uhrmacher
[ABSTRACT]Formal languages are an integral part of modeling and simulation. They allow
the distillation of knowledge into concise simulation models amenable to
automatic execution, interpretation, and analysis. However, the arguably most
humanly accessible means of expressing models is through natural language,
which is not easily interpretable by computers. Here, we evaluate how a Large
Language Model (LLM) might be used for formalizing natural language into
simulation models. Existing studies only explored using very large LLMs, like
the commercial GPT models, without fine-tuning model weights. To close this
gap, we show how an open-weights, 7B-parameter Mistral model can be fine-tuned
to translate natural language descriptions to reaction network models in a
domain-specific language, offering a self-hostable, compute-efficient, and
memory efficient alternative. To this end, we develop a synthetic data
generator to serve as the basis for fine-tuning and evaluation. Our
quantitative evaluation shows that our fine-tuned Mistral model can recover the
ground truth simulation model in up to 84.5% of cases. In addition, our
small-scale user study demonstrates the model’s practical potential for
one-time generation as well as interactive modeling in various domains. While
promising, in its current form, the fine-tuned small LLM cannot catch up with
large LLMs. We conclude that higher-quality training data are required, and
expect future small and open-source LLMs to offer new opportunities.
[COMMENTS]27 pages, 5 figures; supplemental material available at
https://doi.org/10.1145/3733719
[LINK]http://arxiv.org/abs/2503.01675v2
[DATE]2025-10-22 21:17:36+08:00
[CATEGORIES]cs.LG cs.CL
Conditions for Catastrophic Forgetting in Multilingual Translation
[AUTHORS]Danni Liu, Jan Niehues
[ABSTRACT]Fine-tuning multilingual foundation models on specific languages often
induces catastrophic forgetting, degrading performance on languages unseen in
fine-tuning. While this phenomenon is widely-documented, the literature
presents fragmented results about when forgetting occurs. To address this
ambiguity, we conduct a systematic empirical study using machine translation as
a testbed to identify the conditions that trigger catastrophic forgetting in
multilingual fine-tuning. Through controlled experiments across different model
architectures, data scales, and fine-tuning approaches, we reveal that the
relative scale between model and data size is a primary determinant of
forgetting. Moreover, we demonstrate that a model’s instruction-following
ability is more critical for retaining multilingual knowledge than its
architecture. Contrary to assumptions, parameter-efficient fine-tuning offers
no clear advantage over full fine-tuning in mitigating forgetting. Lastly, we
show that cross-lingual alignment can mitigate forgetting while also
facilitating positive transfer to unseen target languages.
[COMMENTS]Multilingual Representation Learning (MRL) Workshop 2025
[LINK]http://arxiv.org/abs/2510.19546v1
[DATE]2025-10-22 20:54:00+08:00
[CATEGORIES]cs.CL
Which Evaluation for Which Model? A Taxonomy for Speech Model Assessment
[AUTHORS]Maureen de Seyssel, Eeshan Gunesh Dhekane
[ABSTRACT]Speech foundation models have recently achieved remarkable capabilities
across a wide range of tasks. However, their evaluation remains disjointed
across tasks and model types. Different models excel at distinct aspects of
speech processing and thus require different evaluation protocols. This paper
proposes a unified taxonomy that addresses the question: Which evaluation is
appropriate for which model? The taxonomy defines three orthogonal axes: the
\textbf{evaluation aspect} being measured, the model capabilities required to
attempt the task, and the task or protocol requirements needed to perform it.
We classify a broad set of existing evaluations and benchmarks along these
axes, spanning areas such as representation learning, speech generation, and
interactive dialogue. By mapping each evaluation to the capabilities a model
exposes (e.g., speech generation, real-time processing) and to its
methodological demands (e.g., fine-tuning data, human judgment), the taxonomy
provides a principled framework for aligning models with suitable evaluation
methods. It also reveals systematic gaps, such as limited coverage of prosody,
interaction, or reasoning, that highlight priorities for future benchmark
design. Overall, this work offers a conceptual foundation and practical guide
for selecting, interpreting, and extending evaluations of speech models.
[COMMENTS]57 pages (26 main, 25 appendix, 6 references)
[LINK]http://arxiv.org/abs/2510.19509v1
[DATE]2025-10-22 20:04:32+08:00
[CATEGORIES]cs.CL
Lookahead Routing for Large Language Models
[AUTHORS]Canbin Huang, Tianyuan Shi, Yuhua Zhu, Ruijun Chen, Xiaojun Quan
[ABSTRACT]Large language model (LLM) routers improve the efficiency of multi-model
systems by directing each query to the most appropriate model while leveraging
the diverse strengths of heterogeneous LLMs. Most existing approaches frame
routing as a classification problem based solely on the input query. While this
reduces overhead by avoiding inference across all models, it overlooks valuable
information that could be gleaned from potential outputs and fails to capture
implicit intent or contextual nuances that often emerge only during response
generation. These limitations can result in suboptimal routing decisions,
particularly for complex or ambiguous queries that require deeper semantic
understanding. To address this challenge, we propose Lookahead, a routing
framework that “foresees” potential model outputs by predicting their latent
representations and uses these predictions to guide model selection, thus
enabling more informed routing without full inference. Within this framework,
we implement two approaches based on causal and masked language models.
Empirical evaluations across seven public benchmarks - spanning instruction
following, mathematical reasoning, and code generation - show that Lookahead
consistently outperforms existing routing baselines, achieving an average
performance gain of 7.7% over the state-of-the-art. Our code is available at
https://github.com/huangcb01/lookahead-routing.
[LINK]http://arxiv.org/abs/2510.19506v1
[DATE]2025-10-22 20:00:21+08:00
[CATEGORIES]cs.CL
A Multimodal, Multitask System for Generating E Commerce Text Listings from Images
[AUTHORS]Nayan Kumar Singh
[ABSTRACT]Manually generating catchy descriptions and names is labor intensive and a
slow process for retailers. Although generative AI provides an automation
solution in form of Vision to Language Models (VLM), the current VLMs are prone
to factual “hallucinations”. Siloed, single task models are not only
inefficient but also fail to capture interdependent relationships between
features. To address these challenges, we propose an end to end, multi task
system that generates factually grounded textual listings from a single image.
The contributions of this study are two proposals for the model architecture.
First, application of multi task learning approach for fine tuning a vision
encoder where a single vision backbone is jointly trained on attribute
prediction such as color, hemline and neck style and price regression. Second,
introduction of a hierarchical generation process where the model’s own
predicted attributes are embedded in a prompt and fed to the text decoder to
improve factual consistency. The experiments demonstrate the superiority of
this architecture. The multi tasking approach outperforms both the independent
price regression, with a 3.6% better R2 Value and attribute classification,
with a 6.6% improvement F1 score. Critically, the hierarchical generation
process proves highly effective, slashing the factual hallucination rate from
12.7% to 7.1%, a 44.5% relative reduction, compared to a non hierarchical
ablation. The hierarchical approach also reduces the latency of the
autoregressive text generation process by a factor of 3.5 when compared to
direct vision to language model of similar size. One minor caveat is that the
model does perform 3.5% worse than direct vision-to-language model on ROUGE-L
score.
[COMMENTS]24 pages, 10 figures, 11 tables. Code can be found at:
https://github.com/SinghNayanKumar/multimodal-product-lister/
[LINK]http://arxiv.org/abs/2510.21835v1
[DATE]2025-10-22 19:50:49+08:00
[CATEGORIES]cs.LG cs.CL
What is the Best Sequence Length for BABYLM?
[AUTHORS]Suchir Salhan, Richard Diehl Martinez, Zébulon Goriely, Paula Buttery
[ABSTRACT]Transformer language models typically operate with a fixed-length context
window, which has grown in step with large-scale pretraining datasets. In the
BabyLM Challenge, however, many past submissions have defaulted to using much
shorter sequence lengths. We examine the impact of sequence length on BabyLM
pretraining, to answer the simple question: what sequence length should we be
using when training Baby LMs? Using 100M-word training data and fixed compute
budgets, we compare 125M-parameter Mamba and OPT models, finding that although
longer is often better, the optimal length depends on both task and
architecture. Shorter sequences are sufficient for grammatical generalization
tasks whereas longer contexts benefit morphological analogical reasoning tasks.
[COMMENTS]Paper Accepted at the 2025 BabyLM Workshop @ EMNLP (Suzhou, China)
[LINK]http://arxiv.org/abs/2510.19493v1
[DATE]2025-10-22 19:42:33+08:00
[CATEGORIES]cs.CL
Machine Text Detectors are Membership Inference Attacks
[AUTHORS]Ryuto Koike, Liam Dugan, Masahiro Kaneko, Chris Callison-Burch, Naoaki Okazaki
[ABSTRACT]Although membership inference attacks (MIAs) and machine-generated text
detection target different goals, identifying training samples and synthetic
texts, their methods often exploit similar signals based on a language model’s
probability distribution. Despite this shared methodological foundation, the
two tasks have been independently studied, which may lead to conclusions that
overlook stronger methods and valuable insights developed in the other task. In
this work, we theoretically and empirically investigate the transferability,
i.e., how well a method originally developed for one task performs on the
other, between MIAs and machine text detection. For our theoretical
contribution, we prove that the metric that achieves the asymptotically highest
performance on both tasks is the same. We unify a large proportion of the
existing literature in the context of this optimal metric and hypothesize that
the accuracy with which a given method approximates this metric is directly
correlated with its transferability. Our large-scale empirical experiments,
including 7 state-of-the-art MIA methods and 5 state-of-the-art machine text
detectors across 13 domains and 10 generators, demonstrate very strong rank
correlation (rho > 0.6) in cross-task performance. We notably find that
Binoculars, originally designed for machine text detection, achieves
state-of-the-art performance on MIA benchmarks as well, demonstrating the
practical impact of the transferability. Our findings highlight the need for
greater cross-task awareness and collaboration between the two research
communities. To facilitate cross-task developments and fair evaluations, we
introduce MINT, a unified evaluation suite for MIAs and machine-generated text
detection, with implementation of 15 recent methods from both tasks.
[LINK]http://arxiv.org/abs/2510.19492v1
[DATE]2025-10-22 19:39:01+08:00
[CATEGORIES]cs.CL
VideoAgentTrek: Computer Use Pretraining from Unlabeled Videos
[AUTHORS]Dunjie Lu, Yiheng Xu, Junli Wang, Haoyuan Wu, Xinyuan Wang, Zekun Wang, Junlin Yang, Hongjin Su, Jixuan Chen, Junda Chen, Yuchen Mao, Jingren Zhou, Junyang Lin, Binyuan Hui, Tao Yu
[ABSTRACT]Training computer-use agents requires massive amounts of GUI interaction
data, but manually annotating action trajectories at scale is prohibitively
expensive. We present VideoAgentTrek, a scalable pipeline that automatically
mines training data from publicly available screen-recorded videos at web
scale, eliminating the need for manual annotation. Our approach addresses a key
challenge: raw videos contain implicit demonstrations but lack explicit action
labels. To solve this, we develop Video2Action, an inverse dynamics module
(IDM) with two components: (1) a video grounding model that detects and
localizes GUI actions with precise temporal boundaries and context, and (2) an
action-content recognizer that extracts structured parameters like click
coordinates and typed text with high fidelity. Applied to 39,000 YouTube
tutorial videos, our pipeline generates 1.52 million interaction steps
automatically. We leverage this data through continued pretraining followed by
supervised fine-tuning. On OSWorld-Verified, our approach improves task success
rates from 9.3% (SFT-only baseline) to 15.8%, a 70% relative improvement. On
AgentNetBench, step accuracy increases from 64.1% to 69.3%. Our results
demonstrate that passive internet videos can be transformed into high-quality
supervision for computer-use agents, providing a scalable alternative to
expensive manual annotation.
[COMMENTS]8 pages, 6 figures
[LINK]http://arxiv.org/abs/2510.19488v1
[DATE]2025-10-22 19:25:48+08:00
[CATEGORIES]cs.CL cs.LG
Middo: Model-Informed Dynamic Data Optimization for Enhanced LLM Fine-Tuning via Closed-Loop Learning
[AUTHORS]Zinan Tang, Xin Gao, Qizhi Pei, Zhuoshi Pan, Mengzhang Cai, Jiang Wu, Conghui He, Lijun Wu
[ABSTRACT]Supervised Fine-Tuning (SFT) Large Language Models (LLM) fundamentally rely
on high-quality training data. While data selection and data synthesis are two
common strategies to improve data quality, existing approaches often face
limitations in static dataset curation that fail to adapt to evolving model
capabilities. In this paper, we introduce Middo, a self-evolving Model-informed
dynamic data optimization framework that uses model-aware data selection and
context-preserving data refinement. Unlike conventional one-off
filtering/synthesis methods, our framework establishes a closed-loop
optimization system: (1) A self-referential diagnostic module proactively
identifies suboptimal samples through tri-axial model signals - loss patterns
(complexity), embedding cluster dynamics (diversity), and self-alignment scores
(quality); (2) An adaptive optimization engine then transforms suboptimal
samples into pedagogically valuable training points while preserving semantic
integrity; (3) This optimization process continuously evolves with model
capability through dynamic learning principles. Experiments on multiple
benchmarks demonstrate that our Middo consistently enhances the quality of seed
data and boosts LLM’s performance with improving accuracy by 7.15% on average
while maintaining the original dataset scale. This work establishes a new
paradigm for sustainable LLM training through dynamic human-AI co-evolution of
data and models. Our datasets, models, and code are publicly available at
https://github.com/Word2VecT/Middo.
[COMMENTS]Accepted by EMNLP 2025 (Main)
[LINK]http://arxiv.org/abs/2508.21589v5
[DATE]2025-10-22 19:09:23+08:00
[CATEGORIES]cs.CL
Re-evaluating Minimum Bayes Risk Decoding for Automatic Speech Recognition
[AUTHORS]Yuu Jinnai
[ABSTRACT]Recent work has shown that sample-based Minimum Bayes Risk (MBR) decoding
outperforms beam search in text-to-text generation tasks, such as machine
translation, text summarization, and image captioning. On the other hand, beam
search is the current practice for speech-to-text tasks such as automatic
speech recognition (ASR) and Speech Translation (ST). Given that MBR decoding
is effective in text-to-text generation tasks, it is reasonable to expect it to
also be effective for speech-to-text tasks. In this paper, we evaluate MBR
decoding for ASR and ST tasks on English and Japanese using Whisper and its
derivative models. We observe that the accuracy of MBR decoding outperforms
that of beam search in most of the experimental settings we have evaluated. The
results show that MBR decoding is a promising method for offline ASR and ST
tasks that require high accuracy. The code is available at
https://github.com/CyberAgentAILab/mbr-for-asr
[LINK]http://arxiv.org/abs/2510.19471v1
[DATE]2025-10-22 19:06:20+08:00
[CATEGORIES]cs.CL cs.LG
Can Large Language Models be Effective Online Opinion Miners?
[AUTHORS]Ryang Heo, Yongsik Seo, Junseong Lee, Dongha Lee
[COMMENTS]Accepted to EMNLP 2025 Main
[LINK]http://arxiv.org/abs/2505.15695v3
[DATE]2025-10-22 18:13:27+08:00
[CATEGORIES]cs.CL
LLM Unlearning with LLM Beliefs
[AUTHORS]Kemou Li, Qizhou Wang, Yue Wang, Fengpeng Li, Jun Liu, Bo Han, Jiantao Zhou
[ABSTRACT]Large language models trained on vast corpora inherently risk memorizing
sensitive or harmful content, which may later resurface in their outputs.
Prevailing unlearning methods generally rely on gradient ascent and its
variants to lower the probability of specific target responses. However, we
find that this strategy induces a critical side effect: probability mass is
redistributed into high-likelihood regions, often corresponding to semantically
related rephrasings of the targets. We refer to this as the squeezing effect,
which explains why many methods yield merely spurious unlearning, a problem
further obscured by automated metrics (e.g., ROUGE, truth ratio) that misreport
actual success. To address this, we propose a bootstrapping (BS) framework that
explicitly links the squeezing effect with the model’s own high-confidence
generations, namely its model beliefs. Since model beliefs inherently capture
the very high-likelihood regions where probability mass is squeezed,
incorporating them into the unlearning objective directly counters the
squeezing effect. By jointly suppressing both target responses and model
beliefs, BS-T (token) attenuates high-probability tokens, whereas BS-S
(sequence) removes entire high-confidence generations, together achieving more
thorough forgetting while preserving utility. Extensive experiments across
diverse benchmarks with various model families confirm the effectiveness of our
approach.
[LINK]http://arxiv.org/abs/2510.19422v1
[DATE]2025-10-22 17:44:36+08:00
[CATEGORIES]cs.LG cs.CL
Stream: Scaling up Mechanistic Interpretability to Long Context in LLMs via Sparse Attention
[AUTHORS]J Rosser, José Luis Redondo García, Gustavo Penha, Konstantina Palla, Hugues Bouchard
[ABSTRACT]As Large Language Models (LLMs) scale to million-token contexts, traditional
Mechanistic Interpretability techniques for analyzing attention scale
quadratically with context length, demanding terabytes of memory beyond 100,000
tokens. We introduce Sparse Tracing, a novel technique that leverages dynamic
sparse attention to efficiently analyze long context attention patterns. We
present Stream, a compilable hierarchical pruning algorithm that estimates
per-head sparse attention masks in near-linear time $O(T \log T)$ and linear
space $O(T)$, enabling one-pass interpretability at scale. Stream performs a
binary-search-style refinement to retain only the top-$k$ key blocks per query
while preserving the model’s next-token behavior. We apply Stream to long
chain-of-thought reasoning traces and identify thought anchors while pruning
97-99\% of token interactions. On the RULER benchmark, Stream preserves
critical retrieval paths while discarding 90-96\% of interactions and exposes
layer-wise routes from the needle to output. Our method offers a practical
drop-in tool for analyzing attention patterns and tracing information flow
without terabytes of caches. By making long context interpretability feasible
on consumer GPUs, Sparse Tracing helps democratize chain-of-thought monitoring.
Code is available at https://anonymous.4open.science/r/stream-03B8/.
[LINK]http://arxiv.org/abs/2510.19875v1
[DATE]2025-10-22 17:42:29+08:00
[CATEGORIES]cs.CL
BLiSS 1.0: Evaluating Bilingual Learner Competence in Second Language Small Language Models
[AUTHORS]Yuan Gao, Suchir Salhan, Andrew Caines, Paula Buttery, Weiwei Sun
[ABSTRACT]To bridge the gap between performance-oriented benchmarks and the evaluation
of cognitively inspired models, we introduce BLiSS 1.0, a Benchmark of Learner
Interlingual Syntactic Structure. Our benchmark operationalizes a new paradigm
of selective tolerance, testing whether a model finds a naturalistic learner
error more plausible than a matched, artificial error within the same sentence.
Constructed from over 2.8 million naturalistic learner sentences, BLiSS
provides 136,867 controlled triplets (corrected, learner, artificial) for this
purpose. Experiments on a diverse suite of models demonstrate that selective
tolerance is a distinct capability from standard grammaticality, with
performance clustering strongly by training paradigm. This validates BLiSS as a
robust tool for measuring how different training objectives impact a model’s
alignment with the systematic patterns of human language acquisition.
[COMMENTS]Accepted Paper at the BabyLM Workshop 2025 @ EMNLP (Presentation in
Suzhou, China)
[LINK]http://arxiv.org/abs/2510.19419v1
[DATE]2025-10-22 17:42:01+08:00
[CATEGORIES]cs.CL
Spatio-temporal Sign Language Representation and Translation
[AUTHORS]Yasser Hamidullah, Josef van Genabith, Cristina España-Bonet
[ABSTRACT]This paper describes the DFKI-MLT submission to the WMT-SLT 2022 sign
language translation (SLT) task from Swiss German Sign Language (video) into
German (text). State-of-the-art techniques for SLT use a generic seq2seq
architecture with customized input embeddings. Instead of word embeddings as
used in textual machine translation, SLT systems use features extracted from
video frames. Standard approaches often do not benefit from temporal features.
In our participation, we present a system that learns spatio-temporal feature
representations and translation in a single model, resulting in a real
end-to-end architecture expected to better generalize to new data sets. Our
best system achieved $5\pm1$ BLEU points on the development set, but the
performance on the test dropped to $0.11\pm0.06$ BLEU points.
[LINK]http://arxiv.org/abs/2510.19413v1
[DATE]2025-10-22 17:34:01+08:00
[CATEGORIES]cs.CL
ToMMeR – Efficient Entity Mention Detection from Large Language Models
[AUTHORS]Victor Morand, Nadi Tomeh, Josiane Mothe, Benjamin Piwowarski
[ABSTRACT]Identifying which text spans refer to entities – mention detection – is
both foundational for information extraction and a known performance
bottleneck. We introduce ToMMeR, a lightweight model (<300K parameters) probing
mention detection capabilities from early LLM layers. Across 13 NER benchmarks,
ToMMeR achieves 93\% recall zero-shot, with over 90\% precision using an LLM as
a judge showing that ToMMeR rarely produces spurious predictions despite high
recall. Cross-model analysis reveals that diverse architectures (14M-15B
parameters) converge on similar mention boundaries (DICE >75\%), confirming
that mention detection emerges naturally from language modeling. When extended
with span classification heads, ToMMeR achieves near SOTA NER performance
(80-87\% F1 on standard benchmarks). Our work provides evidence that structured
entity representations exist in early transformer layers and can be efficiently
recovered with minimal parameters.
[COMMENTS]Code is available at https://github.com/VictorMorand/llm2ner
[LINK]http://arxiv.org/abs/2510.19410v1
[DATE]2025-10-22 17:28:18+08:00
[CATEGORIES]cs.CL
SONAR-SLT: Multilingual Sign Language Translation via Language-Agnostic Sentence Embedding Supervision
[AUTHORS]Yasser Hamidullah, Shakib Yazdani, Cennet Oguz, Josef van Genabith, Cristina España-Bonet
[ABSTRACT]Sign language translation (SLT) is typically trained with text in a single
spoken language, which limits scalability and cross-language generalization.
Earlier approaches have replaced gloss supervision with text-based sentence
embeddings, but up to now, these remain tied to a specific language and
modality. In contrast, here we employ language-agnostic, multimodal embeddings
trained on text and speech from multiple languages to supervise SLT, enabling
direct multilingual translation. To address data scarcity, we propose a coupled
augmentation method that combines multilingual target augmentations (i.e.
translations into many languages) with video-level perturbations, improving
model robustness. Experiments show consistent BLEURT gains over text-only
sentence embedding supervision, with larger improvements in low-resource
settings. Our results demonstrate that language-agnostic embedding supervision,
combined with coupled augmentation, provides a scalable and semantically robust
alternative to traditional SLT training.
[LINK]http://arxiv.org/abs/2510.19398v1
[DATE]2025-10-22 17:17:31+08:00
[CATEGORIES]cs.CL
Can LLMs Correct Themselves? A Benchmark of Self-Correction in LLMs
[AUTHORS]Guiyao Tie, Zenghui Yuan, Zeli Zhao, Chaoran Hu, Tianhe Gu, Ruihang Zhang, Sizhe Zhang, Junran Wu, Xiaoyue Tu, Ming Jin, Qingsong Wen, Lixing Chen, Pan Zhou, Lichao Sun
[ABSTRACT]Self-correction of large language models (LLMs) emerges as a critical
component for enhancing their reasoning performance. Although various
self-correction methods have been proposed, a comprehensive evaluation of these
methods remains largely unexplored, and the question of whether LLMs can truly
correct themselves is a matter of significant interest and concern. In this
study, we introduce CorrectBench, a benchmark developed to evaluate the
effectiveness of self-correction strategies, including intrinsic, external, and
fine-tuned approaches, across three tasks: commonsense reasoning, mathematical
reasoning, and code generation. Our findings reveal that: 1) Self-correction
methods can improve accuracy, especially for complex reasoning tasks; 2) Mixing
different self-correction strategies yields further improvements, though it
reduces efficiency; 3) Reasoning LLMs (e.g., DeepSeek-R1) have limited
optimization under additional self-correction methods and have high time costs.
Interestingly, a comparatively simple chain-of-thought (CoT) baseline
demonstrates competitive accuracy and efficiency. These results underscore the
potential of self-correction to enhance LLM’s reasoning performance while
highlighting the ongoing challenge of improving their efficiency. Consequently,
we advocate for further research focused on optimizing the balance between
reasoning capabilities and operational efficiency. Project Page:
https://correctbench.github.io/
[COMMENTS]47 pages, 25 figures, 10 tables
[LINK]http://arxiv.org/abs/2510.16062v2
[DATE]2025-10-22 17:04:12+08:00
[CATEGORIES]cs.CL
From TOWER to SPIRE: Adding the Speech Modality to a Translation-Specialist LLM
[AUTHORS]Kshitij Ambilduke, Ben Peters, Sonal Sannigrahi, Anil Keshwani, Tsz Kin Lam, Bruno Martins, André F. T. Martins, Marcely Zanon Boito
[ABSTRACT]We introduce Spire, a speech-augmented language model (LM) capable of both
translating and transcribing speech input from English into 10 other languages
as well as translating text input in both language directions. Spire integrates
the speech modality into an existing multilingual LM via speech discretization
and continued pre-training using only 42.5K hours of speech. In particular, we
adopt the pretraining framework of multilingual LMs and treat discretized
speech input as an additional translation language. This approach not only
equips the model with speech capabilities, but also preserves its strong
text-based performance. We achieve this using significantly less data than
existing speech LMs, demonstrating that discretized speech input integration as
an additional language is feasible during LM adaptation. We make our code and
models available to the community.
[COMMENTS]EMNLP 2025 (Findings) camera ready
[LINK]http://arxiv.org/abs/2503.10620v3
[DATE]2025-10-22 16:47:03+08:00
[CATEGORIES]cs.CL
Sign Language Translation with Sentence Embedding Supervision
[AUTHORS]Yasser Hamidullah, Josef van Genabith, Cristina España-Bonet
[ABSTRACT]State-of-the-art sign language translation (SLT) systems facilitate the
learning process through gloss annotations, either in an end2end manner or by
involving an intermediate step. Unfortunately, gloss labelled sign language
data is usually not available at scale and, when available, gloss annotations
widely differ from dataset to dataset. We present a novel approach using
sentence embeddings of the target sentences at training time that take the role
of glosses. The new kind of supervision does not need any manual annotation but
it is learned on raw textual data. As our approach easily facilitates
multilinguality, we evaluate it on datasets covering German (PHOENIX-2014T) and
American (How2Sign) sign languages and experiment with mono- and multilingual
sentence embeddings and translation systems. Our approach significantly
outperforms other gloss-free approaches, setting the new state-of-the-art for
data sets where glosses are not available and when no additional SLT datasets
are used for pretraining, diminishing the gap between gloss-free and
gloss-dependent systems.
[LINK]http://arxiv.org/abs/2510.19367v1
[DATE]2025-10-22 16:40:41+08:00
[CATEGORIES]cs.CL
MoE-Prism: Disentangling Monolithic Experts for Elastic MoE Services via Model-System Co-Designs
[AUTHORS]Xinfeng Xia, Jiacheng Liu, Xiaofeng Hou, Peng Tang, Mingxuan Zhang, Wenfeng Wang, Chao Li
[ABSTRACT]Mixture-of-Experts (MoE) models, the state-of-the-art in large-scale AI,
achieve high quality by sparsely activating parameters. However, their reliance
on routing between a few monolithic experts via a top-k mechanism creates a
“quality cliff”, offering only a few coarse-grained operating points. This
inflexibility forces a difficult trade-off between cost and quality, preventing
adaptation to diverse Service Level Objectives (SLOs) and leading to
significant resource over-provisioning.
This paper introduces MoE-Prism, a model-system co-design that transforms
rigid MoE models into elastic services. Our methodology is divided into two
phases. First, an \emph{Offline Refactoring Engine} systematically deconstructs
monolithic experts into fine-grained “sub-experts.” This engine employs a
partitioning optimization solver that uses a metaheuristic-based approach to
group neurons, preserving functional locality without requiring retraining.
Second, an \emph{Online Scheduling Engine} leverages this new elasticity
through QoS-aware scheduling. It implements specialized policies to solve
complex system problems, including maximizing throughput in cloud deployments
and managing latency-optimized offloading for memory-constrained devices. Our
evaluation across three different MoE models shows that MoE-Prismprovides over
4 times more distinct, stable operating points than the baseline. This allows
an AI service to dynamically improve throughput by up to 19.9\% under a strict
latency budget or reduce latency by up to 10.36\% under limited resources.
MoE-Prism provides the critical “control knob” to bridge the model-system gap,
enabling the next generation of adaptive, efficient, and QoS-aware AI services.
[LINK]http://arxiv.org/abs/2510.19366v1
[DATE]2025-10-22 16:40:01+08:00
[CATEGORIES]cs.CL cs.LG
The Massive Legal Embedding Benchmark (MLEB)
[AUTHORS]Umar Butler, Abdur-Rahman Butler, Adrian Lucas Malec
[ABSTRACT]We present the Massive Legal Embedding Benchmark (MLEB), the largest, most
diverse, and most comprehensive open-source benchmark for legal information
retrieval to date. MLEB consists of ten expert-annotated datasets spanning
multiple jurisdictions (the US, UK, EU, Australia, Ireland, and Singapore),
document types (cases, legislation, regulatory guidance, contracts, and
literature), and task types (search, zero-shot classification, and question
answering). Seven of the datasets in MLEB were newly constructed in order to
fill domain and jurisdictional gaps in the open-source legal information
retrieval landscape. We document our methodology in building MLEB and creating
the new constituent datasets, and release our code, results, and data openly to
assist with reproducible evaluations.
[COMMENTS]15 pages, 2 figures
[LINK]http://arxiv.org/abs/2510.19365v1
[DATE]2025-10-22 16:38:44+08:00
[CATEGORIES]cs.CL
Bias Beware: The Impact of Cognitive Biases on LLM-Driven Product Recommendations
[AUTHORS]Giorgos Filandrianos, Angeliki Dimitriou, Maria Lymperaiou, Konstantinos Thomas, Giorgos Stamou
[ABSTRACT]The advent of Large Language Models (LLMs) has revolutionized product
recommenders, yet their susceptibility to adversarial manipulation poses
critical challenges, particularly in real-world commercial applications. Our
approach is the first one to tap into human psychological principles,
seamlessly modifying product descriptions, making such manipulations hard to
detect. In this work, we investigate cognitive biases as black-box adversarial
strategies, drawing parallels between their effects on LLMs and human
purchasing behavior. Through extensive evaluation across models of varying
scale, we find that certain biases, such as social proof, consistently boost
product recommendation rate and ranking, while others, like scarcity and
exclusivity, surprisingly reduce visibility. Our results demonstrate that
cognitive biases are deeply embedded in state-of-the-art LLMs, leading to
highly unpredictable behavior in product recommendations and posing significant
challenges for effective mitigation.
[COMMENTS]Accepted at EMNLP 2025
[LINK]http://arxiv.org/abs/2502.01349v4
[DATE]2025-10-22 16:36:39+08:00
[CATEGORIES]cs.CL
AgenticMath: Enhancing LLM Reasoning via Agentic-based Math Data Generation
[AUTHORS]Xianyang Liu, Yilin Liu, Shuai Wang, Hao Cheng, Andrew Estornell, Yuzhi Zhao, Jiaheng Wei
[ABSTRACT]The creation of high-quality datasets to improve Large Language Model (LLM)
reasoning remains a significant challenge, as current methods often suffer from
generating low-quality/incorrect answers and limited information richness from
available data sources. To address this, we propose AgenticMath, a novel
agentic pipeline for generating high-quality mathematical question-answer pairs
to enhance the supervised fine-tuning of LLMs. Our method operates through four
stages: (1) Seed Question Filter that selects questions with high information
richness, complexity, and clarity; (2) an Agentic Question Rephrase step that
employs a multi-agent system to generate diverse, logically consistent
paraphrases; (3) an Answer Augment step where rewrite answers using
chain-of-thought reasoning to enhance numerical and logical correctness,
without reliance on human-provided labels; and (4) a final Question and Answer
Evaluation that retains only the most superior pairs. Extensive experiments
demonstrate that, fine-tuning 3B-8B parameter LLMs on AgenticMath generated
datasets (comprising only 30-60K math samples) achieves competitive or superior
performance on diverse in domain and out-of-domain mathematical reasoning
benchmarks compared to baselines trained on much more data (e.g., 400K or 2.3M
samples). Our work demonstrates that targeted, high-quality data generation is
a more efficient path to improving mathematical reasoning in LLMs than
large-scale, low-quality alternatives.
[COMMENTS]Work in progress
[LINK]http://arxiv.org/abs/2510.19361v1
[DATE]2025-10-22 16:34:13+08:00
[CATEGORIES]cs.CL
Local Obfuscation by GLINER for Impartial Context Aware Lineage: Development and evaluation of PII Removal system
[AUTHORS]Prakrithi Shivaprakash, Lekhansh Shukla, Animesh Mukherjee, Prabhat Chand, Pratima Murthy
[ABSTRACT]Removing Personally Identifiable Information (PII) from clinical notes in
Electronic Health Records (EHRs) is essential for research and AI development.
While Large Language Models (LLMs) are powerful, their high computational costs
and the data privacy risks of API-based services limit their use, especially in
low-resource settings. To address this, we developed LOGICAL (Local Obfuscation
by GLINER for Impartial Context-Aware Lineage), an efficient, locally
deployable PII removal system built on a fine-tuned Generalist and Lightweight
Named Entity Recognition (GLiNER) model. We used 1515 clinical documents from a
psychiatric hospital’s EHR system. We defined nine PII categories for removal.
A modern-gliner-bi-large-v1.0 model was fine-tuned on 2849 text instances and
evaluated on a test set of 376 instances using character-level precision,
recall, and F1-score. We compared its performance against Microsoft Azure NER,
Microsoft Presidio, and zero-shot prompting with Gemini-Pro-2.5 and
Llama-3.3-70B-Instruct. The fine-tuned GLiNER model achieved superior
performance, with an overall micro-average F1-score of 0.980, significantly
outperforming Gemini-Pro-2.5 (F1-score: 0.845). LOGICAL correctly sanitised 95%
of documents completely, compared to 64% for the next-best solution. The model
operated efficiently on a standard laptop without a dedicated GPU. However, a
2% entity-level false negative rate underscores the need for human-in-the-loop
validation across all tested systems. Fine-tuned, specialised transformer
models like GLiNER offer an accurate, computationally efficient, and secure
solution for PII removal from clinical notes. This “sanitisation at the source”
approach is a practical alternative to resource-intensive LLMs, enabling the
creation of de-identified datasets for research and AI development while
preserving data privacy, particularly in resource-constrained environments.
[COMMENTS]30 pages, 15 main text and 15 supplementary material
[LINK]http://arxiv.org/abs/2510.19346v1
[DATE]2025-10-22 16:12:07+08:00
[CATEGORIES]cs.CL
Memorization-Compression Cycles Improve Generalization
[AUTHORS]Fangyuan Yu
[ABSTRACT]We prove theoretically that generalization improves not only through data
scaling but also by compressing internal representations. To operationalize
this insight, we introduce the Information Bottleneck Language Modeling (IBLM)
objective, which reframes language modeling as a constrained optimization
problem: minimizing representation entropy subject to optimal prediction
performance. Empirically, we observe an emergent memorization-compression cycle
during LLM pretraining, evidenced by oscillation positive/negative gradient
alignment between cross-entropy and Matrix-Based Entropy (MBE), a measure of
representation entropy. This pattern closely mirrors the predictive-compressive
trade-off prescribed by IBLM and also parallels the biological alternation
between awake learning and sleep consolidation. Motivated by this observation,
we propose Gated Phase Transition (GAPT), a training algorithm that adaptively
switches between memorization and compression phases. When applied to GPT-2
pretraining on FineWeb dataset, GAPT reduces MBE by 50% and improves
cross-entropy by 4.8%. GAPT improves OOD generalizatino by 35% in a pretraining
task on arithmetic multiplication. In a setting designed to simulate
catastrophic forgetting, GAPT reduces interference by compressing and
separating representations, achieving a 97% improvement in separation -
paralleling the functional role of sleep consolidation.
[COMMENTS]12 pages, 6 figures, NeurIPS2025 NEGEL Workshop
[LINK]http://arxiv.org/abs/2505.08727v2
[DATE]2025-10-22 16:10:52+08:00
[CATEGORIES]cs.LG cs.CL
Algorithmic Fairness in NLP: Persona-Infused LLMs for Human-Centric Hate Speech Detection
[AUTHORS]Ewelina Gajewska, Arda Derbent, Jaroslaw A Chudziak, Katarzyna Budzynska
[ABSTRACT]In this paper, we investigate how personalising Large Language Models
(Persona-LLMs) with annotator personas affects their sensitivity to hate
speech, particularly regarding biases linked to shared or differing identities
between annotators and targets. To this end, we employ Google’s Gemini and
OpenAI’s GPT-4.1-mini models and two persona-prompting methods: shallow persona
prompting and a deeply contextualised persona development based on
Retrieval-Augmented Generation (RAG) to incorporate richer persona profiles. We
analyse the impact of using in-group and out-group annotator personas on the
models’ detection performance and fairness across diverse social groups. This
work bridges psychological insights on group identity with advanced NLP
techniques, demonstrating that incorporating socio-demographic attributes into
LLMs can address bias in automated hate speech detection. Our results highlight
both the potential and limitations of persona-based approaches in reducing
bias, offering valuable insights for developing more equitable hate speech
detection systems.
[COMMENTS]This paper has been accepted for the upcoming 59th Hawaii
International Conference on System Sciences (HICSS-59), 2026, Hawaii, USA.
The final published version will appear in the official conference
proceedings
[LINK]http://arxiv.org/abs/2510.19331v1
[DATE]2025-10-22 15:48:57+08:00
[CATEGORIES]cs.CL
Balancing Rewards in Text Summarization: Multi-Objective Reinforcement Learning via HyperVolume Optimization
[AUTHORS]Junjie Song, Yiwen Liu, Dapeng Li, Yin Sun, Shukun Fu, Siqi Chen, Yuji Cao
[ABSTRACT]Text summarization is a crucial task that requires the simultaneous
optimization of multiple objectives, including consistency, coherence,
relevance, and fluency, which presents considerable challenges. Although large
language models (LLMs) have demonstrated remarkable performance, enhanced by
reinforcement learning (RL), few studies have focused on optimizing the
multi-objective problem of summarization through RL based on LLMs. In this
paper, we introduce hypervolume optimization (HVO), a novel optimization
strategy that dynamically adjusts the scores between groups during the reward
process in RL by using the hypervolume method. This method guides the model’s
optimization to progressively approximate the pareto front, thereby generating
balanced summaries across multiple objectives. Experimental results on several
representative summarization datasets demonstrate that our method outperforms
group relative policy optimization (GRPO) in overall scores and shows more
balanced performance across different dimensions. Moreover, a 7B foundation
model enhanced by HVO performs comparably to GPT-4 in the summarization task,
while maintaining a shorter generation length. Our code is publicly available
at https://github.com/ai4business-LiAuto/HVO.git
[LINK]http://arxiv.org/abs/2510.19325v1
[DATE]2025-10-22 15:39:04+08:00
[CATEGORIES]cs.CL
HAD: HAllucination Detection Language Models Based on a Comprehensive Hallucination Taxonomy
[AUTHORS]Fan Xu, Xinyu Hu, Zhenghan Yu, Li Lin, Xu Zhang, Yang Zhang, Wei Zhou, Jinjie Gu, Xiaojun Wan
[ABSTRACT]The increasing reliance on natural language generation (NLG) models,
particularly large language models, has raised concerns about the reliability
and accuracy of their outputs. A key challenge is hallucination, where models
produce plausible but incorrect information. As a result, hallucination
detection has become a critical task. In this work, we introduce a
comprehensive hallucination taxonomy with 11 categories across various NLG
tasks and propose the HAllucination Detection (HAD) models
https://github.com/pku0xff/HAD, which integrate hallucination detection,
span-level identification, and correction into a single inference process.
Trained on an elaborate synthetic dataset of about 90K samples, our HAD models
are versatile and can be applied to various NLG tasks. We also carefully
annotate a test set for hallucination detection, called HADTest, which contains
2,248 samples. Evaluations on in-domain and out-of-domain test sets show that
our HAD models generally outperform the existing baselines, achieving
state-of-the-art results on HaluEval, FactCHD, and FaithBench, confirming their
robustness and versatility.
[LINK]http://arxiv.org/abs/2510.19318v1
[DATE]2025-10-22 15:28:37+08:00
[CATEGORIES]cs.CL
KORE: Enhancing Knowledge Injection for Large Multimodal Models via Knowledge-Oriented Augmentations and Constraints
[AUTHORS]Kailin Jiang, Hongbo Jiang, Ning Jiang, Zhi Gao, Jinhe Bi, Yuchen Ren, Bin Li, Yuntao Du, Lei Liu, Qing Li
[ABSTRACT]Large Multimodal Models encode extensive factual knowledge in their
pre-trained weights. However, its knowledge remains static and limited, unable
to keep pace with real-world developments, which hinders continuous knowledge
acquisition. Effective knowledge injection thus becomes critical, involving two
goals: knowledge adaptation (injecting new knowledge) and knowledge retention
(preserving old knowledge). Existing methods often struggle to learn new
knowledge and suffer from catastrophic forgetting. To address this, we propose
KORE, a synergistic method of KnOwledge-oRientEd augmentations and constraints
for injecting new knowledge into large multimodal models while preserving old
knowledge. Unlike general text or image data augmentation, KORE automatically
converts individual knowledge items into structured and comprehensive knowledge
to ensure that the model accurately learns new knowledge, enabling accurate
adaptation. Meanwhile, KORE stores previous knowledge in the covariance matrix
of LMM’s linear layer activations and initializes the adapter by projecting the
original weights into the matrix’s null space, defining a fine-tuning direction
that minimizes interference with previous knowledge, enabling powerful
retention. Extensive experiments on various LMMs, including LLaVA-v1.5-7B,
LLaVA-v1.5-13B, and Qwen2.5-VL-7B, show that KORE achieves superior new
knowledge injection performance and effectively mitigates catastrophic
forgetting.
[COMMENTS]project page: https://kore-lmm.github.io/
[LINK]http://arxiv.org/abs/2510.19316v1
[DATE]2025-10-22 15:26:55+08:00
[CATEGORIES]cs.CL
JointCQ: Improving Factual Hallucination Detection with Joint Claim and Query Generation
[AUTHORS]Fan Xu, Huixuan Zhang, Zhenliang Zhang, Jiahao Wang, Xiaojun Wan
[ABSTRACT]Current large language models (LLMs) often suffer from hallucination issues,
i,e, generating content that appears factual but is actually unreliable. A
typical hallucination detection pipeline involves response decomposition (i.e.,
claim extraction), query generation, evidence collection (i.e., search or
retrieval), and claim verification. However, existing methods exhibit
limitations in the first two stages, such as context loss during claim
extraction and low specificity in query generation, resulting in degraded
performance across the hallucination detection pipeline. In this work, we
introduce JointCQ https://github.com/pku0xff/JointCQ, a joint claim-and-query
generation framework designed to construct an effective and efficient
claim-query generator. Our framework leverages elaborately designed evaluation
criteria to filter synthesized training data, and finetunes a language model
for joint claim extraction and query generation, providing reliable and
informative inputs for downstream search and verification. Experimental results
demonstrate that our method outperforms previous methods on multiple
open-domain QA hallucination detection benchmarks, advancing the goal of more
trustworthy and transparent language model systems.
[LINK]http://arxiv.org/abs/2510.19310v1
[DATE]2025-10-22 15:15:37+08:00
[CATEGORIES]cs.CL
From Denoising to Refining: A Corrective Framework for Vision-Language Diffusion Model
[AUTHORS]Yatai Ji, Teng Wang, Yuying Ge, Zhiheng Liu, Sidi Yang, Ying Shan, Ping Luo
[ABSTRACT]Discrete diffusion models have emerged as a promising direction for
vision-language tasks, offering bidirectional context modeling and theoretical
parallelization. However, their practical application is severely hindered by a
train-inference discrepancy, which leads to catastrophic error cascades:
initial token errors during parallel decoding pollute the generation context,
triggering a chain reaction of compounding errors and leading to syntactic
errors and semantic hallucinations. To address this fundamental challenge, we
reframe the generation process from passive denoising to active refining. We
introduce ReDiff, a refining-enhanced diffusion framework that teaches the
model to identify and correct its own errors. Our approach features a two-stage
training process: first, we instill a foundational revision capability by
training the model to revise synthetic errors; second, we implement a novel
online self-correction loop where the model is explicitly trained to revise its
own flawed drafts by learning from an expert’s corrections. This mistake-driven
learning endows the model with the crucial ability to revisit and refine its
already generated output, effectively breaking the error cascade. Extensive
experiments demonstrate that ReDiff significantly improves the coherence and
factual accuracy of generated content, enabling stable and efficient parallel
generation far superior to traditional denoising methods. Our codes and models
are available at https://rediff-hku.github.io/.
[LINK]http://arxiv.org/abs/2510.19871v1
[DATE]2025-10-22 14:58:55+08:00
[CATEGORIES]cs.CL
Reasoning Models Better Express Their Confidence
[AUTHORS]Dongkeun Yoon, Seungone Kim, Sohee Yang, Sunkyoung Kim, Soyeon Kim, Yongil Kim, Eunbi Choi, Yireun Kim, Minjoon Seo
[COMMENTS]Accepted to NeurIPS 2025
[LINK]http://arxiv.org/abs/2505.14489v2
[DATE]2025-10-22 14:37:15+08:00
[CATEGORIES]cs.CL
Difficulty-Controllable Multiple-Choice Question Generation Using Large Language Models and Direct Preference Optimization
[AUTHORS]Yuto Tomikawa, Masaki Uto
[ABSTRACT]Difficulty-controllable question generation for reading comprehension has
gained significant attention in the field of education as a fundamental tool
for adaptive learning support. Although several neural question generation
methods have recently succeeded in controlling difficulty, conventional
approaches still face two major limitations. First, they cannot directly
generate multiple-choice questions, which are the most widely used question
type in educational contexts. Second, they are not explicitly trained to
optimize the accuracy of difficulty control, leaving room for further
improvement in difficulty controllability. To address these limitations, this
study proposes a novel difficulty-controllable multiple-choice question
generation method for reading comprehension which leverages a large language
model trained using a direct preference optimization technique to improve the
accuracy of difficulty control.
[COMMENTS]This work has been submitted to the IEEE for possible publication
[LINK]http://arxiv.org/abs/2510.19265v1
[DATE]2025-10-22 13:49:31+08:00
[CATEGORIES]cs.CL
Beyond Hearing: Learning Task-agnostic ExG Representations from Earphones via Physiology-informed Tokenization
[AUTHORS]Hyungjun Yoon, Seungjoo Lee, Yu Yvonne Wu, Xiaomeng Chen, Taiting Lu, Freddy Yifei Liu, Taeckyung Lee, Hyeongheon Cha, Haochen Zhao, Gaoteng Zhao, Sung-Ju Lee, Cecilia Mascolo, Dongyao Chen, Lili Qiu
[ABSTRACT]Electrophysiological (ExG) signals offer valuable insights into human
physiology, yet building foundation models that generalize across everyday
tasks remains challenging due to two key limitations: (i) insufficient data
diversity, as most ExG recordings are collected in controlled labs with bulky,
expensive devices; and (ii) task-specific model designs that require tailored
processing (i.e., targeted frequency filters) and architectures, which limit
generalization across tasks. To address these challenges, we introduce an
approach for scalable, task-agnostic ExG monitoring in the wild. We collected
50 hours of unobtrusive free-living ExG data with an earphone-based hardware
prototype to narrow the data diversity gap. At the core of our approach is
Physiology-informed Multi-band Tokenization (PiMT), which decomposes ExG
signals into 12 physiology-informed tokens, followed by a reconstruction task
to learn robust representations. This enables adaptive feature recognition
across the full frequency spectrum while capturing task-relevant information.
Experiments on our new DailySense dataset-the first to enable ExG-based
analysis across five human senses-together with four public ExG benchmarks,
demonstrate that PiMT consistently outperforms state-of-the-art methods across
diverse tasks.
[COMMENTS]19 pages, 9 figures
[LINK]http://arxiv.org/abs/2510.20853v1
[DATE]2025-10-22 13:11:02+08:00
[CATEGORIES]cs.CL
SheetBrain: A Neuro-Symbolic Agent for Accurate Reasoning over Complex and Large Spreadsheets
[AUTHORS]Ziwei Wang, Jiayuan Su, Mengyu Zhou, Huaxing Zeng, Mengni Jia, Xiao Lv, Haoyu Dong, Xiaojun Ma, Shi Han, Dongmei Zhang
[ABSTRACT]Understanding and reasoning over complex spreadsheets remain fundamental
challenges for large language models (LLMs), which often struggle with
accurately capturing the complex structure of tables and ensuring reasoning
correctness. In this work, we propose SheetBrain, a neuro-symbolic dual
workflow agent framework designed for accurate reasoning over tabular data,
supporting both spreadsheet question answering and manipulation tasks.
SheetBrain comprises three core modules: an understanding module, which
produces a comprehensive overview of the spreadsheet - including sheet summary
and query-based problem insight to guide reasoning; an execution module, which
integrates a Python sandbox with preloaded table-processing libraries and an
Excel helper toolkit for effective multi-turn reasoning; and a validation
module, which verifies the correctness of reasoning and answers, triggering
re-execution when necessary. We evaluate SheetBrain on multiple public tabular
QA and manipulation benchmarks, and introduce SheetBench, a new benchmark
targeting large, multi-table, and structurally complex spreadsheets.
Experimental results show that SheetBrain significantly improves accuracy on
both existing benchmarks and the more challenging scenarios presented in
SheetBench. Our code is publicly available at
https://github.com/microsoft/SheetBrain.
[LINK]http://arxiv.org/abs/2510.19247v1
[DATE]2025-10-22 13:09:44+08:00
[CATEGORIES]cs.CL
FrugalPrompt: Reducing Contextual Overhead in Large Language Models via Token Attribution
[AUTHORS]Syed Rifat Raiyan, Md Farhan Ishmam, Abdullah Al Imran, Mohammad Ali Moni
[ABSTRACT]Large language models (LLMs) owe much of their stellar performance to
expansive input contexts, yet such verbosity inflates monetary costs, carbon
footprint, and inference-time latency. Much of this overhead manifests from the
redundant low-utility tokens present in typical prompts, as only a fraction of
tokens typically carries the majority of the semantic weight. We address this
inefficiency by introducing FrugalPrompt, a novel prompt compression framework
for LLMs, which retains only the most semantically significant tokens.
Leveraging two state-of-the-art token attribution methods, GlobEnc and DecompX,
we assign salience scores to every token in an input sequence, rank them to
preserve the top-k% tokens in their original order, and obtain a sparse
frugalized prompt. We evaluate the approach across four NLP tasks: Sentiment
Analysis, Commonsense QA, Summarization, and Mathematical Reasoning, using a
suite of frontier LLMs. For the first three tasks, a 20% prompt reduction
incurs only a marginal loss in task performance, demonstrating that
contemporary LLMs can reconstruct elided context from high-salience cues. In
contrast, performance on mathematical reasoning deteriorates sharply,
reflecting a stronger dependence on complete token continuity. Further analysis
with bottom-k% and random-k% tokens reveals asymmetric performance patterns
that may suggest potential task contamination effects, wherein models may
resort to shallow memorized patterns from pretraining exposure for conventional
NLP tasks. We posit that our work contributes to a more nuanced understanding
of LLM behavior in performance-efficiency trade-offs, and delineate the
boundary between tasks tolerant to contextual sparsity and those requiring
exhaustive context. Our source code and models are available at:
https://github.com/Starscream-11813/Frugal-ICL.
[LINK]http://arxiv.org/abs/2510.16439v2
[DATE]2025-10-22 12:39:03+08:00
[CATEGORIES]cs.CL
Chiron-o1: Igniting Multimodal Large Language Models towards Generalizable Medical Reasoning via Mentor-Intern Collaborative Search
[AUTHORS]Haoran Sun, Yankai Jiang, Wenjie Lou, Yujie Zhang, Wenjie Li, Lilong Wang, Mianxin Liu, Lei Liu, Xiaosong Wang
[ABSTRACT]Multimodal large language models (MLLMs) have begun to demonstrate robust
reasoning capabilities on general tasks, yet their application in the medical
domain remains in its early stages. Constructing chain-of-thought (CoT)
training data is essential for bolstering the reasoning abilities of medical
MLLMs. However, existing approaches exhibit a deficiency in offering a
comprehensive framework for searching and evaluating effective reasoning paths
towards critical diagnosis. To address this challenge, we propose Mentor-Intern
Collaborative Search (MICS), a novel reasoning-path searching scheme to
generate rigorous and effective medical CoT data. MICS first leverages mentor
models to initialize the reasoning, one step at a time, then prompts each
intern model to continue the thinking along those initiated paths, and finally
selects the optimal reasoning path according to the overall reasoning
performance of multiple intern models. The reasoning performance is determined
by an MICS-Score, which assesses the quality of generated reasoning paths.
Eventually, we construct MMRP, a multi-task medical reasoning dataset with
ranked difficulty, and Chiron-o1, a new medical MLLM devised via a curriculum
learning strategy, with robust visual question-answering and generalizable
reasoning capabilities. Extensive experiments demonstrate that Chiron-o1,
trained on our CoT dataset constructed using MICS, achieves state-of-the-art
performance across a list of medical visual question answering and reasoning
benchmarks. Codes are available at https://github.com/manglu097/Chiron-o1
[LINK]http://arxiv.org/abs/2506.16962v2
[DATE]2025-10-22 12:23:57+08:00
[CATEGORIES]cs.CL
Modality Matching Matters: Calibrating Language Distances for Cross-Lingual Transfer in URIEL+
[AUTHORS]York Hay Ng, Aditya Khan, Xiang Lu, Matteo Salloum, Michael Zhou, Phuong H. Hoang, A. Seza Doğruöz, En-Shiun Annie Lee
[ABSTRACT]Existing linguistic knowledge bases such as URIEL+ provide valuable
geographic, genetic and typological distances for cross-lingual transfer but
suffer from two key limitations. One, their one-size-fits-all vector
representations are ill-suited to the diverse structures of linguistic data,
and two, they lack a principled method for aggregating these signals into a
single, comprehensive score. In this paper, we address these gaps by
introducing a framework for type-matched language distances. We propose novel,
structure-aware representations for each distance type: speaker-weighted
distributions for geography, hyperbolic embeddings for genealogy, and a latent
variables model for typology. We unify these signals into a robust,
task-agnostic composite distance. In selecting transfer languages, our
representations and composite distances consistently improve performance across
a wide range of NLP tasks, providing a more principled and effective toolkit
for multilingual research.
[LINK]http://arxiv.org/abs/2510.19217v1
[DATE]2025-10-22 11:59:19+08:00
[CATEGORIES]cs.CL
Flexible-length Text Infilling for Discrete Diffusion Models
[AUTHORS]Andrew Zhang, Anushka Sivakumar, Chiawei Tang, Chris Thomas
[ABSTRACT]Discrete diffusion models are a new class of text generators that offer
advantages such as bidirectional context use, parallelizable generation, and
flexible prompting compared to autoregressive models. However, a critical
limitation of discrete diffusion models is their inability to perform
flexible-length or flexible-position text infilling without access to
ground-truth positional data. We introduce \textbf{DDOT} (\textbf{D}iscrete
\textbf{D}iffusion with \textbf{O}ptimal \textbf{T}ransport Position Coupling),
the first discrete diffusion model to overcome this challenge. DDOT jointly
denoises token values and token positions, employing a novel sample-level
Optimal Transport (OT) coupling. This coupling preserves relative token
ordering while dynamically adjusting the positions and length of infilled
segments, a capability previously missing in text diffusion. Our method is
orthogonal to existing discrete text diffusion methods and is compatible with
various pretrained text denoisers. Extensive experiments on text infilling
benchmarks such as One-Billion-Word and Yelp demonstrate that DDOT outperforms
naive diffusion baselines. Furthermore, DDOT achieves performance on par with
state-of-the-art non-autoregressive models and enables significant improvements
in training efficiency and flexibility.
[COMMENTS]Major edit of methodology section. Matches EMNLP camera-ready version
[LINK]http://arxiv.org/abs/2506.13579v2
[DATE]2025-10-22 11:27:18+08:00
[CATEGORIES]cs.LG cs.CL
Aligning Multilingual News for Stock Return Prediction
[AUTHORS]Yuntao Wu, Lynn Tao, Ing-Haw Cheng, Charles Martineau, Yoshio Nozawa, John Hull, Andreas Veneris
[ABSTRACT]News spreads rapidly across languages and regions, but translations may lose
subtle nuances. We propose a method to align sentences in multilingual news
articles using optimal transport, identifying semantically similar content
across languages. We apply this method to align more than 140,000 pairs of
Bloomberg English and Japanese news articles covering around 3500 stocks in
Tokyo exchange over 2012-2024. Aligned sentences are sparser, more
interpretable, and exhibit higher semantic similarity. Return scores
constructed from aligned sentences show stronger correlations with realized
stock returns, and long-short trading strategies based on these alignments
achieve 10\% higher Sharpe ratios than analyzing the full text sample.
[COMMENTS]6 pages, 4 tables, 2 figures, AI for Finance Symposium’25 Workshop at
ICAIF’25
[LINK]http://arxiv.org/abs/2510.19203v1
[DATE]2025-10-22 11:23:24+08:00
[CATEGORIES]cs.CL
NAACL2025 Tutorial: Adaptation of Large Language Models
[AUTHORS]Zixuan Ke, Yifei Ming, Shafiq Joty
[ABSTRACT]This tutorial on adaptation of LLMs is designed to address the growing demand
for models that go beyond the static capabilities of generic LLMs by providing
an overview of dynamic, domain-specific, and task-adaptive LLM adaptation
techniques. While general LLMs have demonstrated strong generalization across a
variety of tasks, they often struggle to perform well in specialized domains
such as finance, healthcare, and code generation for underrepresented
languages. Additionally, their static nature limits their ability to evolve
with the changing world, and they are often extremely large in size, making
them impractical and costly to deploy at scale. As a result, the adaptation of
LLMs has drawn much attention since the birth of LLMs and is of core
importance, both for industry, which focuses on serving its targeted users, and
academia, which can greatly benefit from small but powerful LLMs. To address
this gap, this tutorial aims to provide an overview of the LLM adaptation
techniques. We start with an introduction to LLM adaptation, from both the data
perspective and the model perspective. We then emphasize how the evaluation
metrics and benchmarks are different from other techniques. After establishing
the problems, we explore various adaptation techniques. We categorize
adaptation techniques into two main families. The first is parametric knowledge
adaptation, which focuses on updating the parametric knowledge within LLMs.
Additionally, we will discuss real-time adaptation techniques, including model
editing, which allows LLMs to be updated dynamically in production
environments. The second kind of adaptation is semi-parametric knowledge
adaptation, where the goal is to update LLM parameters to better leverage
external knowledge or tools through techniques like retrieval-augmented
generation (RAG) and agent-based systems.
[COMMENTS]NAACL2025 Tutorial
[LINK]http://arxiv.org/abs/2504.03931v3
[DATE]2025-10-22 11:10:43+08:00
[CATEGORIES]cs.CL
Demystifying Domain-adaptive Post-training for Financial LLMs
[AUTHORS]Zixuan Ke, Yifei Ming, Xuan-Phi Nguyen, Caiming Xiong, Shafiq Joty
[ABSTRACT]Domain-adaptive post-training of large language models (LLMs) has emerged as
a promising approach for specialized domains such as medicine and finance.
However, significant challenges remain in identifying optimal adaptation
criteria and training strategies across varying data and model configurations.
To address these challenges, we introduce FINDAP, a systematic and fine-grained
investigation into domain-adaptive post-training of LLMs for the finance
domain. Our approach consists of four key components: FinCap, which defines the
core capabilities required for the target domain; FinRec, an effective training
recipe that jointly optimizes continual pre-training and instruction-following,
along with a novel preference data distillation method leveraging process
signals from a generative reward model; FinTrain, a curated set of training
datasets supporting FinRec; and FinEval, a comprehensive evaluation suite
aligned with FinCap. The resulting model, Llama-Fin, achieves state-of-the-art
performance across a wide range of financial tasks. Our analysis also
highlights how each post-training stage contributes to distinct capabilities,
uncovering specific challenges and effective solutions, providing valuable
insights for domain adaptation of LLMs
[COMMENTS]EMNLP 2025 (Oral, ARR best paper nomination)
[LINK]http://arxiv.org/abs/2501.04961v4
[DATE]2025-10-22 11:01:08+08:00
[CATEGORIES]cs.CL cs.LG
An Evaluation of the Pedagogical Soundness and Usability of AI-Generated Lesson Plans Across Different Models and Prompt Frameworks in High-School Physics
[AUTHORS]Xincheng Liu
[ABSTRACT]This study evaluates the pedagogical soundness and usability of AI-generated
lesson plans across five leading large language models: ChatGPT (GPT-5), Claude
Sonnet 4.5, Gemini 2.5 Flash, DeepSeek V3.2, and Grok 4. Beyond model choice,
three structured prompt frameworks were tested: TAG (Task, Audience, Goal),
RACE (Role, Audience, Context, Execution), and COSTAR (Context, Objective,
Style, Tone, Audience, Response Format).
Fifteen lesson plans were generated for a single high-school physics topic,
The Electromagnetic Spectrum. The lesson plans were analyzed through four
automated computational metrics: (1) readability and linguistic complexity, (2)
factual accuracy and hallucination detection, (3) standards and curriculum
alignment, and (4) cognitive demand of learning objectives.
Results indicate that model selection exerted the strongest influence on
linguistic accessibility, with DeepSeek producing the most readable teaching
plan (FKGL = 8.64) and Claude generating the densest language (FKGL = 19.89).
The prompt framework structure most strongly affected the factual accuracy
and pedagogical completeness, with the RACE framework yielding the lowest
hallucination index and the highest incidental alignment with NGSS curriculum
standards. Across all models, the learning objectives in the fifteen lesson
plans clustered at the Remember and Understand tiers of Bloom’s taxonomy. There
were limited higher-order verbs in the learning objectives extracted.
Overall, the findings suggest that readability is significantly governed by
model design, while instructional reliability and curricular alignment depend
more on the prompt framework. The most effective configuration for lesson plans
identified in the results was to combine a readability-optimized model with the
RACE framework and an explicit checklist of physics concepts, curriculum
standards, and higher-order objectives.
[COMMENTS]20 pages, 6 tables
[LINK]http://arxiv.org/abs/2510.19866v1
[DATE]2025-10-22 10:53:06+08:00
[CATEGORIES]cs.CL
Interpretable Question Answering with Knowledge Graphs
[AUTHORS]Kartikeya Aneja, Manasvi Srivastava, Subhayan Das, Nagender Aneja
[ABSTRACT]This paper presents a question answering system that operates exclusively on
a knowledge graph retrieval without relying on retrieval augmented generation
(RAG) with large language models (LLMs). Instead, a small paraphraser model is
used to paraphrase the entity relationship edges retrieved from querying the
knowledge graph. The proposed pipeline is divided into two main stages. The
first stage involves pre-processing a document to generate sets of
question-answer (QA) pairs. The second stage converts these QAs into a
knowledge graph from which graph-based retrieval is performed using embeddings
and fuzzy techniques. The graph is queried, re-ranked, and paraphrased to
generate a final answer. This work includes an evaluation using LLM-as-a-judge
on the CRAG benchmark, which resulted in accuracies of 71.9% and 54.4% using
LLAMA-3.2 and GPT-3.5-Turbo, respectively.
[LINK]http://arxiv.org/abs/2510.19181v1
[DATE]2025-10-22 10:36:35+08:00
[CATEGORIES]cs.CL cs.LG
The Zero-Step Thinking: An Empirical Study of Mode Selection as Harder Early Exit in Reasoning Models
[AUTHORS]Yuqiao Tan, Shizhu He, Kang Liu, Jun Zhao
[ABSTRACT]Reasoning models have demonstrated exceptional performance in tasks such as
mathematics and logical reasoning, primarily due to their ability to engage in
step-by-step thinking during the reasoning process. However, this often leads
to overthinking, resulting in unnecessary computational overhead. To address
this issue, Mode Selection aims to automatically decide between Long-CoT
(Chain-of-Thought) or Short-CoT by utilizing either a Thinking or NoThinking
mode. Simultaneously, Early Exit determines the optimal stopping point during
the iterative reasoning process. Both methods seek to reduce the computational
burden. In this paper, we first identify Mode Selection as a more challenging
variant of the Early Exit problem, as they share similar objectives but differ
in decision timing. While Early Exit focuses on determining the best stopping
point for concise reasoning at inference time, Mode Selection must make this
decision at the beginning of the reasoning process, relying on pre-defined fake
thoughts without engaging in an explicit reasoning process, referred to as
zero-step thinking. Through empirical studies on nine baselines, we observe
that prompt-based approaches often fail due to their limited classification
capabilities when provided with minimal hand-crafted information. In contrast,
approaches that leverage internal information generally perform better across
most scenarios but still exhibit issues with stability. Our findings indicate
that existing methods relying solely on the information provided by models are
insufficient for effectively addressing Mode Selection in scenarios with
limited information, highlighting the ongoing challenges of this task. Our code
is available at https://github.com/Trae1ounG/Zero_Step_Thinking.
[COMMENTS]Accepted by NeurIPS‘25 Efficient Reasoning Workshop
[LINK]http://arxiv.org/abs/2510.19176v1
[DATE]2025-10-22 10:28:10+08:00
[CATEGORIES]cs.CL
Structured and Abstractive Reasoning on Multi-modal Relational Knowledge Images
[AUTHORS]Yichi Zhang, Zhuo Chen, Lingbing Guo, Lei Liang, Wen Zhang, Huajun Chen
[ABSTRACT]Understanding and reasoning with abstractive information from the visual
modality presents significant challenges for current multi-modal large language
models (MLLMs). Among the various forms of abstractive information, Multi-Modal
Relational Knowledge (MMRK), which represents abstract relational structures
between multi-modal entities using node-edge formats, remains largely
under-explored. In particular, STructured and Abstractive Reasoning (STAR) on
such data has received little attention from the research community. To bridge
the dual gaps in large-scale high-quality data and capability enhancement
methodologies, this paper makes the following key contributions: (i). An
automatic STAR data engine capable of synthesizing images with MMRK to build
multi-modal instruction data with reliable chain-of-thought thinking for
various STAR tasks and (ii). A comprehsive two-stage capability enhancement
training framework, accompanied by a suite of evaluation protocols tailored to
different STAR tasks. Based upon these contributions, we introduce STAR-64K, a
dataset comprising 64K high-quality multi-modal instruction samples, and
conduct experiments across 5 open-source MLLMs. Experimental results show that
our two-stage enhancement framework enables smaller 3B/7B models to
significantly outperform GPT-4o in STAR. Additionally, we provide in-depth
analysis regarding the effectiveness of various designs, data transferability,
and scalability.
[COMMENTS]Work in Progress. Code and data will be released at
https://github.com/zjukg/STAR
[LINK]http://arxiv.org/abs/2510.21828v1
[DATE]2025-10-22 10:23:40+08:00
[CATEGORIES]cs.CL
When Facts Change: Probing LLMs on Evolving Knowledge with evolveQA
[AUTHORS]Nishanth Sridhar Nakshatri, Shamik Roy, Manoj Ghuhan Arivazhagan, Hanhan Zhou, Vinayshekhar Bannihatti Kumar, Rashmi Gangadharaiah
[ABSTRACT]LLMs often fail to handle temporal knowledge conflicts–contradictions
arising when facts evolve over time within their training data. Existing
studies evaluate this phenomenon through benchmarks built on structured
knowledge bases like Wikidata, but they focus on widely-covered,
easily-memorized popular entities and lack the dynamic structure needed to
fairly evaluate LLMs with different knowledge cut-off dates. We introduce
evolveQA, a benchmark specifically designed to evaluate LLMs on temporally
evolving knowledge, constructed from 3 real-world, time-stamped corpora: AWS
updates, Azure changes, and WHO disease outbreak reports. Our framework
identifies naturally occurring knowledge evolution and generates questions with
gold answers tailored to different LLM knowledge cut-off dates. Through
extensive evaluation of 12 open and closed-source LLMs across 3 knowledge
probing formats, we demonstrate significant performance drops of up to 31% on
evolveQA compared to static knowledge questions.
[COMMENTS]Under submission
[LINK]http://arxiv.org/abs/2510.19172v1
[DATE]2025-10-22 10:12:32+08:00
[CATEGORIES]cs.CL
Think Straight, Stop Smart: Structured Reasoning for Efficient Multi-Hop RAG
[AUTHORS]Jihwan Bang, Juntae Lee, Seunghan Yang, Sungha Choi
[ABSTRACT]Multi-hop retrieval-augmented generation (RAG) is a promising strategy for
complex reasoning, yet existing iterative prompting approaches remain
inefficient. They often regenerate predictable token sequences at every step
and rely on stochastic stopping, leading to excessive token usage and unstable
termination. We propose TSSS (Think Straight, Stop Smart), a structured
multi-hop RAG framework designed for efficiency. TSSS introduces (i) a
template-based reasoning that caches recurring prefixes and anchors sub-queries
to the main question, reducing token generation cost while promoting stable
reasoning, and (ii) a retriever-based terminator, which deterministically halts
reasoning once additional sub-queries collapse into repetition. This separation
of structured reasoning and termination control enables both faster inference
and more reliable answers. On HotpotQA, 2WikiMultiHop, and MuSiQue, TSSS
achieves state-of-the-art accuracy and competitive efficiency among RAG-CoT
approaches, highlighting its effectiveness in efficiency-constrained scenarios
such as on-device inference.
[COMMENTS]Accepted at NeurIPS 2025 Workshop
[LINK]http://arxiv.org/abs/2510.19171v1
[DATE]2025-10-22 10:09:23+08:00
[CATEGORIES]cs.CL
OpenGuardrails: An Open-Source Context-Aware AI Guardrails Platform
[AUTHORS]Thomas Wang, Haowen Li
[ABSTRACT]As large language models (LLMs) become increasingly integrated into
real-world applications, safeguarding them against unsafe, malicious, or
privacy-violating content is critically important. We present OpenGuardrails,
the first open-source project to provide both a context-aware safety and
manipulation detection model and a deployable platform for comprehensive AI
guardrails. OpenGuardrails protects against content-safety risks,
model-manipulation attacks (e.g., prompt injection, jailbreaking,
code-interpreter abuse, and the generation/execution of malicious code), and
data leakage. Content-safety and model-manipulation detection are implemented
by a unified large model, while data-leakage identification and redaction are
performed by a separate lightweight NER pipeline (e.g., Presidio-style models
or regex-based detectors). The system can be deployed as a security gateway or
an API-based service, with enterprise-grade, fully private deployment options.
OpenGuardrails achieves state-of-the-art (SOTA) performance on safety
benchmarks, excelling in both prompt and response classification across
English, Chinese, and multilingual tasks. All models are released under the
Apache 2.0 license for public use.
[LINK]http://arxiv.org/abs/2510.19169v1
[DATE]2025-10-22 10:02:27+08:00
[CATEGORIES]cs.CL
Text or Pixels? It Takes Half: On the Token Efficiency of Visual Text Inputs in Multimodal LLMs
[AUTHORS]Yanhong Li, Zixuan Lan, Jiawei Zhou
[COMMENTS]Accepted to EMNLP 2025 Findings (“Text or Pixels? Evaluating
Efficiency and Understanding of LLMs with Visual Text Inputs”)
[LINK]http://arxiv.org/abs/2510.18279v2
[DATE]2025-10-22 09:54:03+08:00
[CATEGORIES]cs.CL
Do LLMs Really Forget? Evaluating Unlearning with Knowledge Correlation and Confidence Awareness
[AUTHORS]Rongzhe Wei, Peizhi Niu, Hans Hao-Hsun Hsu, Ruihan Wu, Haoteng Yin, Mohsen Ghassemi, Yifan Li, Vamsi K. Potluru, Eli Chien, Kamalika Chaudhuri, Olgica Milenkovic, Pan Li
[ABSTRACT]Machine unlearning techniques aim to mitigate unintended memorization in
large language models (LLMs). However, existing approaches predominantly focus
on the explicit removal of isolated facts, often overlooking latent inferential
dependencies and the non-deterministic nature of knowledge within LLMs.
Consequently, facts presumed forgotten may persist implicitly through
correlated information. To address these challenges, we propose a knowledge
unlearning evaluation framework that more accurately captures the implicit
structure of real-world knowledge by representing relevant factual contexts as
knowledge graphs with associated confidence scores. We further develop an
inference-based evaluation protocol leveraging powerful LLMs as judges; these
judges reason over the extracted knowledge subgraph to determine unlearning
success. Our LLM judges utilize carefully designed prompts and are calibrated
against human evaluations to ensure their trustworthiness and stability.
Extensive experiments on our newly constructed benchmark demonstrate that our
framework provides a more realistic and rigorous assessment of unlearning
performance. Moreover, our findings reveal that current evaluation strategies
tend to overestimate unlearning effectiveness. Our code is publicly available
at https://github.com/Graph-COM/Knowledge_Unlearning.git.
[COMMENTS]NeurIPS Camera-Ready Version. Code available at:
https://github.com/Graph-COM/Knowledge_Unlearning
[LINK]http://arxiv.org/abs/2506.05735v4
[DATE]2025-10-22 09:45:06+08:00
[CATEGORIES]cs.CL cs.LG
PLAGUE: Plug-and-play framework for Lifelong Adaptive Generation of Multi-turn Exploits
[AUTHORS]Neeladri Bhuiya, Madhav Aggarwal, Diptanshu Purwar
[COMMENTS]First two authors have equal author contributions
[LINK]http://arxiv.org/abs/2510.17947v2
[DATE]2025-10-22 09:18:53+08:00
[CATEGORIES]cs.CL cs.LG
Tibetan Language and AI: A Comprehensive Survey of Resources, Methods and Challenges
[AUTHORS]Cheng Huang, Nyima Tashi, Fan Gao, Yutong Liu, Jiahao Li, Hao Tian, Siyang Jiang, Thupten Tsering, Ban Ma-bao, Renzeg Duojie, Gadeng Luosang, Rinchen Dongrub, Dorje Tashi, Jin Zhang, Xiao Feng, Hao Wang, Jie Tang, Guojie Tang, Xiangxiang Wang, Jia Zhang, Tsengdar Lee, Yongbin Yu
[ABSTRACT]Tibetan, one of the major low-resource languages in Asia, presents unique
linguistic and sociocultural characteristics that pose both challenges and
opportunities for AI research. Despite increasing interest in developing AI
systems for underrepresented languages, Tibetan has received limited attention
due to a lack of accessible data resources, standardized benchmarks, and
dedicated tools. This paper provides a comprehensive survey of the current
state of Tibetan AI in the AI domain, covering textual and speech data
resources, NLP tasks, machine translation, speech recognition, and recent
developments in LLMs. We systematically categorize existing datasets and tools,
evaluate methods used across different tasks, and compare performance where
possible. We also identify persistent bottlenecks such as data sparsity,
orthographic variation, and the lack of unified evaluation metrics.
Additionally, we discuss the potential of cross-lingual transfer, multi-modal
learning, and community-driven resource creation. This survey aims to serve as
a foundational reference for future work on Tibetan AI research and encourages
collaborative efforts to build an inclusive and sustainable AI ecosystem for
low-resource languages.
[LINK]http://arxiv.org/abs/2510.19144v1
[DATE]2025-10-22 08:29:35+08:00
[CATEGORIES]cs.CL