MatchTIR: Fine-Grained Supervision for Tool-Integrated Reasoning via Bipartite Matching
[AUTHORS]Changle Qu, Sunhao Dai, Hengyi Cai, Jun Xu, Shuaiqiang Wang, Dawei Yin
[ABSTRACT]Tool-Integrated Reasoning (TIR) empowers large language models (LLMs) to tackle complex tasks by interleaving reasoning steps with external tool interactions. However, existing reinforcement learning methods typically rely on outcome- or trajectory-level rewards, assigning uniform advantages to all steps within a trajectory. This coarse-grained credit assignment fails to distinguish effective tool calls from redundant or erroneous ones, particularly in long-horizon multi-turn scenarios. To address this, we propose MatchTIR, a framework that introduces fine-grained supervision via bipartite matching-based turn-level reward assignment and dual-level advantage estimation. Specifically, we formulate credit assignment as a bipartite matching problem between predicted and ground-truth traces, utilizing two assignment strategies to derive dense turn-level rewards. Furthermore, to balance local step precision with global task success, we introduce a dual-level advantage estimation scheme that integrates turn-level and trajectory-level signals, assigning distinct advantage values to individual interaction turns. Extensive experiments on three benchmarks demonstrate the superiority of MatchTIR. Notably, our 4B model surpasses the majority of 8B competitors, particularly in long-horizon and multi-turn tasks. Our codes are available at https://github.com/quchangle1/MatchTIR.
[LINK]http://arxiv.org/abs/2601.10712v1
[DATE]2026-01-16 02:59:23+08:00
[CATEGORIES]cs.CL
Grounding Agent Memory in Contextual Intent
[AUTHORS]Ruozhen Yang, Yucheng Jiang, Yueqi Jiang, Priyanka Kargupta, Yunyi Zhang, Jiawei Han
[ABSTRACT]Deploying large language models in long-horizon, goal-oriented interactions remains challenging because similar entities and facts recur under different latent goals and constraints, causing memory systems to retrieve context-mismatched evidence. We propose STITCH (Structured Intent Tracking in Contextual History), an agentic memory system that indexes each trajectory step with a structured retrieval cue, contextual intent, and retrieves history by matching the current step’s intent. Contextual intent provides compact signals that disambiguate repeated mentions and reduce interference: (1) the current latent goal defining a thematic segment, (2) the action type, and (3) the salient entity types anchoring which attributes matter. During inference, STITCH filters and prioritizes memory snippets by intent compatibility, suppressing semantically similar but context-incompatible history.
For evaluation, we introduce CAME-Bench, a benchmark for context-aware retrieval in realistic, dynamic, goal-oriented trajectories. Across CAME-Bench and LongMemEval, STITCH achieves state-of-the-art performance, outperforming the strongest baseline by 35.6%, with the largest gains as trajectory length increases. Our analysis shows that intent indexing substantially reduces retrieval noise, supporting intent-aware memory for robust long-horizon reasoning.
[LINK]http://arxiv.org/abs/2601.10702v1
[DATE]2026-01-16 02:55:13+08:00
[CATEGORIES]cs.CL
LIBERTy: A Causal Framework for Benchmarking Concept-Based Explanations of LLMs with Structural Counterfactuals
[AUTHORS]Gilat Toker, Nitay Calderon, Ohad Amosy, Roi Reichart
[ABSTRACT]Concept-based explanations quantify how high-level concepts (e.g., gender or experience) influence model behavior, which is crucial for decision-makers in high-stakes domains. Recent work evaluates the faithfulness of such explanations by comparing them to reference causal effects estimated from counterfactuals. In practice, existing benchmarks rely on costly human-written counterfactuals that serve as an imperfect proxy. To address this, we introduce a framework for constructing datasets containing structural counterfactual pairs: LIBERTy (LLM-based Interventional Benchmark for Explainability with Reference Targets). LIBERTy is grounded in explicitly defined Structured Causal Models (SCMs) of the text generation, interventions on a concept propagate through the SCM until an LLM generates the counterfactual. We introduce three datasets (disease detection, CV screening, and workplace violence prediction) together with a new evaluation metric, order-faithfulness. Using them, we evaluate a wide range of methods across five models and identify substantial headroom for improving concept-based explanations. LIBERTy also enables systematic analysis of model sensitivity to interventions: we find that proprietary LLMs show markedly reduced sensitivity to demographic concepts, likely due to post-training mitigation. Overall, LIBERTy provides a much-needed benchmark for developing faithful explainability methods.
[LINK]http://arxiv.org/abs/2601.10700v1
[DATE]2026-01-16 02:54:50+08:00
[CATEGORIES]cs.CL
BASIL: Bayesian Assessment of Sycophancy in LLMs
[AUTHORS]Katherine Atwell, Pedram Heydari, Anthony Sicilia, Malihe Alikhani
[ABSTRACT]Sycophancy (overly agreeable or flattering behavior) poses a fundamental challenge for human-AI collaboration, particularly in high-stakes decision-making domains such as health, law, and education. A central difficulty in studying sycophancy in large language models (LLMs) is disentangling sycophantic belief shifts from rational changes in behavior driven by new evidence or user-provided information. Existing approaches either measure descriptive behavior changes or apply normative evaluations that rely on objective ground truth, limiting their applicability to subjective or uncertain tasks. We introduce a Bayesian probabilistic framework, grounded in behavioral economics and rational decision theory, that explicitly separates sycophancy from rational belief updating. Within this framework, we achieve three objectives: (i) a descriptive metric that measures sycophancy while controlling for rational responses to evidence; (ii) a normative metric that quantifies how sycophancy leads models astray from Bayesian-consistent belief updating; and (iii) the ability to apply both metrics in settings without ground-truth labels. Applying our framework across multiple LLMs and three uncertainty-driven tasks, we find robust evidence of sycophantic belief shifts and show that their impact on rationality depends on whether models systematically over- or under-update their beliefs. Finally, we demonstrate that a post-hoc calibration method and two fine-tuning strategies (SFT and DPO) substantially reduce Bayesian inconsistency, with particularly strong improvements under explicit sycophancy prompting.
[LINK]http://arxiv.org/abs/2508.16846v3
[DATE]2026-01-16 02:31:50+08:00
[CATEGORIES]cs.CL
Detecting Winning Arguments with Large Language Models and Persuasion Strategies
[AUTHORS]Tiziano Labruna, Arkadiusz Modzelewski, Giorgio Satta, Giovanni Da San Martino
[ABSTRACT]Detecting persuasion in argumentative text is a challenging task with important implications for understanding human communication. This work investigates the role of persuasion strategies - such as Attack on reputation, Distraction, and Manipulative wording - in determining the persuasiveness of a text. We conduct experiments on three annotated argument datasets: Winning Arguments (built from the Change My View subreddit), Anthropic/Persuasion, and Persuasion for Good. Our approach leverages large language models (LLMs) with a Multi-Strategy Persuasion Scoring approach that guides reasoning over six persuasion strategies. Results show that strategy-guided reasoning improves the prediction of persuasiveness. To better understand the influence of content, we organize the Winning Argument dataset into broad discussion topics and analyze performance across them. We publicly release this topic-annotated version of the dataset to facilitate future research. Overall, our methodology demonstrates the value of structured, strategy-aware prompting for enhancing interpretability and robustness in argument quality assessment.
[LINK]http://arxiv.org/abs/2601.10660v1
[DATE]2026-01-16 02:30:15+08:00
[CATEGORIES]cs.CL
Knowledge Homophily in Large Language Models
[AUTHORS]Utkarsh Sahu, Zhisheng Qi, Mahantesh Halappanavar, Nedim Lipka, Ryan A. Rossi, Franck Dernoncourt, Yu Zhang, Yao Ma, Yu Wang
[ABSTRACT]Large Language Models (LLMs) have been increasingly studied as neural knowledge bases for supporting knowledge-intensive applications such as question answering and fact checking. However, the structural organization of their knowledge remains unexplored. Inspired by cognitive neuroscience findings, such as semantic clustering and priming, where knowing one fact increases the likelihood of recalling related facts, we investigate an analogous knowledge homophily pattern in LLMs. To this end, we map LLM knowledge into a graph representation through knowledge checking at both the triplet and entity levels. After that, we analyze the knowledgeability relationship between an entity and its neighbors, discovering that LLMs tend to possess a similar level of knowledge about entities positioned closer in the graph. Motivated by this homophily principle, we propose a Graph Neural Network (GNN) regression model to estimate entity-level knowledgeability scores for triplets by leveraging their neighborhood scores. The predicted knowledgeability enables us to prioritize checking less well-known triplets, thereby maximizing knowledge coverage under the same labeling budget. This not only improves the efficiency of active labeling for fine-tuning to inject knowledge into LLMs but also enhances multi-hop path retrieval in reasoning-intensive question answering.
[LINK]http://arxiv.org/abs/2509.23773v2
[DATE]2026-01-16 02:26:36+08:00
[CATEGORIES]cs.LG cs.CL
PMOA-TTS: Introducing the PubMed Open Access Textual Times Series Corpus
[AUTHORS]Shahriar Noroozizadeh, Sayantan Kumar, George H. Chen, Jeremy C. Weiss
[ABSTRACT]Clinical narratives encode temporal dynamics essential for modeling patient trajectories, yet large-scale temporally annotated resources are scarce. We introduce PMOA-TTS, a corpus of 124,699 single-patient PubMed Open Access case reports converted into structured textual timelines of (event, time) pairs using a scalable large-language-model pipeline (Llama 3.3 70B and DeepSeek-R1). The corpus comprises over 5.6 million timestamped events, alongside extracted demographics and diagnoses. Technical validation uses a clinician-curated gold set and three measures: semantic event matching, temporal concordance (c-index), and alignment error summarized with Area Under the Log-Time CDF (AULTC). We benchmark alternative prompting and model choices and provide documentation to support reproduction. PMOA-TTS enables research on timeline extraction, temporal reasoning, survival modeling and event forecasting from narrative text, and offers broad diagnostic and demographic coverage. Data and code are openly available in public repositories.
[LINK]http://arxiv.org/abs/2505.20323v2
[DATE]2026-01-16 02:18:24+08:00
[CATEGORIES]cs.CL cs.LG
Dual-Uncertainty Guided Policy Learning for Multimodal Reasoning
[AUTHORS]Rui Liu, Dian Yu, Tong Zheng, Runpeng Dai, Zongxia Li, Wenhao Yu, Zhenwen Liang, Linfeng Song, Haitao Mi, Pratap Tokekar, Dong Yu
[ABSTRACT]Reinforcement learning with verifiable rewards (RLVR) has advanced reasoning capabilities in multimodal large language models. However, existing methods typically treat visual inputs as deterministic, overlooking the perceptual ambiguity inherent to the visual modality. Consequently, they fail to distinguish whether a model’s uncertainty stems from complex reasoning or ambiguous perception, preventing the targeted allocation of exploration or learning signals. To address this gap, we introduce DUPL, a dual-uncertainty guided policy learning approach for multimodal RLVR that quantifies and leverages both perceptual uncertainty (via symmetric KL divergence) and output uncertainty (via policy entropy) to guide policy updates. By establishing an uncertainty-driven feedback loop and employing a dynamic branch prioritization mechanism, DUPL recalibrates the policy advantage to focus learning on states with high perceptual or decisional ambiguity, enabling effective targeted exploration beyond passive data augmentation. Implemented on top of GRPO and evaluated on six multimodal mathematical and general-domain reasoning benchmarks, DUPL improves Qwen2.5-VL 3B and 7B models, achieving accuracy gains of up to 11.2% on visual math tasks and up to 7.1% on general-domain reasoning tasks, while consistently outperforming GRPO. These results demonstrate that dual-uncertainty guided policy learning is an effective and generalizable approach for multimodal RLVR.
[LINK]http://arxiv.org/abs/2510.01444v2
[DATE]2026-01-16 01:51:14+08:00
[CATEGORIES]cs.CL cs.LG
On the Failure of Latent State Persistence in Large Language Models
[AUTHORS]Jen-tse Huang, Kaiser Sun, Wenxuan Wang, Mark Dredze
[ABSTRACT]While Large Language Models (LLMs) excel in reasoning, whether they can sustain persistent latent states remains under-explored. The capacity to maintain and manipulate unexpressed, internal representations-analogous to human working memory-is a cornerstone of complex reasoning. In this paper, we formalize and quantify the “Latent State Persistence” (LSP) gap through three novel experiments. First, we utilize a Number Guessing Game, demonstrating that across independent queries, LLMs fail to allocate probability mass to a singular hidden choice, violating a fundamental probabilistic principle. Second, we employ a Yes-No Game to show that as the number of questions increases, LLMs suffer from “concept drift,” leading to inevitable self-contradictions due to the lack of LSP. Finally, inspired by Mathematical Mentalism, we task models with tracking transformations on hidden variables, revealing a failure in variable binding and state evolution when the initial state is not explicitly present in the context. Collectively, these findings suggest that LLMs function as reactive post-hoc solvers rather than proactive planners with LSP. Our work provides a framework for evaluating the fidelity of internal representations and highlights a fundamental architectural divergence between autoregressive transformers and human-like cognition.
[COMMENTS]8 pages, 6 figures, 9 tables
[LINK]http://arxiv.org/abs/2505.10571v4
[DATE]2026-01-16 01:44:56+08:00
[CATEGORIES]cs.CL
Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs
[AUTHORS]Zhiyuan Hu, Yucheng Wang, Yufei He, Jiaying Wu, Yilun Zhao, See-Kiong Ng, Cynthia Breazeal, Anh Tuan Luu, Hae Won Park, Bryan Hooi
[ABSTRACT]Reinforcement learning (RL) has become a central paradigm for post-training large language models (LLMs), particularly for complex reasoning tasks, yet it often suffers from exploration collapse: policies prematurely concentrate on a small set of dominant reasoning patterns, improving pass@1 while limiting rollout-level diversity and gains in pass@k. We argue that this failure stems from regularizing local token behavior rather than diversity over sets of solutions. To address this, we propose Uniqueness-Aware Reinforcement Learning, a rollout-level objective that explicitly rewards correct solutions that exhibit rare high-level strategies. Our method uses an LLM-based judge to cluster rollouts for the same problem according to their high-level solution strategies, ignoring superficial variations, and reweights policy advantages inversely with cluster size. As a result, correct but novel strategies receive higher rewards than redundant ones. Across mathematics, physics, and medical reasoning benchmarks, our approach consistently improves pass@$k$ across large sampling budgets and increases the area under the pass@$k$ curve (AUC@$K$) without sacrificing pass@1, while sustaining exploration and uncovering more diverse solution strategies at scale.
[COMMENTS]Work in Progress
[LINK]http://arxiv.org/abs/2601.08763v2
[DATE]2026-01-16 01:24:46+08:00
[CATEGORIES]cs.LG cs.CL
Bayesian Teaching Enables Probabilistic Reasoning in Large Language Models
[AUTHORS]Linlu Qiu, Fei Sha, Kelsey Allen, Yoon Kim, Tal Linzen, Sjoerd van Steenkiste
[ABSTRACT]Large language models (LLMs) are increasingly used as agents that interact with users and with the world. To do so successfully, LLMs must construct representations of the world and form probabilistic beliefs about them. To provide personalized recommendations, for example, the LLM needs to infer a user’s preferences from their behavior over multiple interactions. The Bayesian inference framework lays out the optimal way for an agent to update its beliefs as it receives new information. We first show that LLMs fall far short of the standard defined by the Bayesian framework. We then show that by teaching LLMs to mimic the predictions of the normative Bayesian model, we can dramatically improve their ability to update their beliefs; this ability generalizes to new tasks. We conclude that LLMs can effectively learn reasoning skills from examples and generalize those skills to new domains.
[COMMENTS]Nature Communications
[LINK]http://arxiv.org/abs/2503.17523v3
[DATE]2026-01-16 01:21:57+08:00
[CATEGORIES]cs.CL
Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning
[AUTHORS]Zhiyuan Hu, Yunhai Hu, Juncheng Liu, Shuyue Stella Li, Yucheng Wang, Zhen Xu, See-Kiong Ng, Anh Tuan Luu, Xinxing Xu, Bryan Hooi, Cynthia Breazeal, Hae Won Park
[ABSTRACT]Multi-agent systems have evolved into practical LLM-driven collaborators for many applications, gaining robustness from diversity and cross-checking. However, multi-agent RL (MARL) training is resource-intensive and unstable: co-adapting teammates induce non-stationarity, and rewards are often sparse and high-variance. Therefore, we introduce \textbf{Multi-Agent Test-Time Reinforcement Learning (MATTRL)}, a framework that injects structured textual experience into multi-agent deliberation at inference time. MATTRL forms a multi-expert team of specialists for multi-turn discussions, retrieves and integrates test-time experiences, and reaches consensus for final decision-making. We also study credit assignment for constructing a turn-level experience pool, then reinjecting it into the dialogue. Across challenging benchmarks in medicine, math, and education, MATTRL improves accuracy by an average of 3.67\% over a multi-agent baseline, and by 8.67\% over comparable single-agent baselines. Ablation studies examine different credit-assignment schemes and provide a detailed comparison of how they affect training outcomes. MATTRL offers a stable, effective and efficient path to distribution-shift-robust multi-agent reasoning without tuning.
[COMMENTS]Work in Progress
[LINK]http://arxiv.org/abs/2601.09667v2
[DATE]2026-01-16 01:20:36+08:00
[CATEGORIES]cs.CL
Be Your Own Red Teamer: Safety Alignment via Self-Play and Reflective Experience Replay
[AUTHORS]Hao Wang, Yanting Wang, Hao Li, Rui Li, Lei Sha
[ABSTRACT]Large Language Models (LLMs) have achieved remarkable capabilities but remain vulnerable to adversarial “jailbreak” attacks designed to bypass safety guardrails. Current safety alignment methods depend heavily on static external red teaming, utilizing fixed defense prompts or pre-collected adversarial datasets. This leads to a rigid defense that overfits known patterns and fails to generalize to novel, sophisticated threats. To address this critical limitation, we propose empowering the model to be its own red teamer, capable of achieving autonomous and evolving adversarial attacks. Specifically, we introduce Safety Self- Play (SSP), a system that utilizes a single LLM to act concurrently as both the Attacker (generating jailbreaks) and the Defender (refusing harmful requests) within a unified Reinforcement Learning (RL) loop, dynamically evolving attack strategies to uncover vulnerabilities while simultaneously strengthening defense mechanisms. To ensure the Defender effectively addresses critical safety issues during the self-play, we introduce an advanced Reflective Experience Replay Mechanism, which uses an experience pool accumulated throughout the process. The mechanism employs a Upper Confidence Bound (UCB) sampling strategy to focus on failure cases with low rewards, helping the model learn from past hard mistakes while balancing exploration and exploitation. Extensive experiments demonstrate that our SSP approach autonomously evolves robust defense capabilities, significantly outperforming baselines trained on static adversarial datasets and establishing a new benchmark for proactive safety alignment.
[LINK]http://arxiv.org/abs/2601.10589v1
[DATE]2026-01-16 01:00:16+08:00
[CATEGORIES]cs.CL
PlotCraft: Pushing the Limits of LLMs for Complex and Interactive Data Visualization
[AUTHORS]Jiajun Zhang, Jianke Zhang, Zeyu Cui, Jiaxi Yang, Lei Zhang, Binyuan Hui, Qiang Liu, Zilei Wang, Liang Wang, Junyang Lin
[ABSTRACT]Recent Large Language Models (LLMs) have demonstrated remarkable proficiency in code generation. However, their ability to create complex visualizations for scaled and structured data remains largely unevaluated and underdeveloped. To address this gap, we introduce PlotCraft, a new benchmark featuring 1k challenging visualization tasks that cover a wide range of topics, such as finance, scientific research, and sociology. The benchmark is structured around seven high-level visualization tasks and encompasses 48 distinct chart types. Crucially, it is the first to systematically evaluate both single-turn generation and multi-turn refinement across a diverse spectrum of task complexities. Our comprehensive evaluation of 23 leading LLMs on PlotCraft reveals obvious performance deficiencies in handling sophisticated visualization tasks. To bridge this performance gap, we develope SynthVis-30K, a large-scale, high-quality dataset of complex visualization code synthesized via a collaborative agent framework. Building upon this dataset, we develope PlotCraftor, a novel code generation model that achieves strong capabilities in complex data visualization with a remarkably small size. Across VisEval, PandasPlotBench, and our proposed PlotCraft, PlotCraftor shows performance comparable to that of leading proprietary approaches. Especially, on hard task, Our model achieves over 50% performance improvement. We will release the benchmark, dataset, and code at https://github.com/Speakn0w/PlotCraft-Benchmark.
[LINK]http://arxiv.org/abs/2511.00010v2
[DATE]2026-01-16 01:00:01+08:00
[CATEGORIES]cs.CL
Representation-Aware Unlearning via Activation Signatures: From Suppression to Knowledge-Signature Erasure
[AUTHORS]Syed Naveed Mahmood, Md. Rezaur Rahman Bhuiyan, Tasfia Zaman, Jareen Tasneem Khondaker, Md. Sameer Sakib, Nazia Tasnim, Farig Sadeque
[ABSTRACT]Selective knowledge erasure from LLMs is critical for GDPR compliance and model safety, yet current unlearning methods conflate behavioral suppression with true knowledge removal, allowing latent capabilities to persist beneath surface-level refusals. In this work, we address this challenge by introducing Knowledge Immunization Framework (KIF), a representation-aware architecture that distinguishes genuine erasure from obfuscation by targeting internal activation signatures rather than surface outputs. Our approach combines dynamic suppression of subject-specific representations with parameter-efficient adaptation, enabling durable unlearning without full model retraining. KIF achieves near-oracle erasure (FQ approx 0.99 vs. 1.00) while preserving utility at oracle levels (MU = 0.62), effectively breaking the stability-erasure tradeoff that has constrained all prior work. We evaluate both standard foundation models (Llama and Mistral) and reasoning-prior models (Qwen and DeepSeek) across 3B to 14B parameters. Our observation shows that standard models exhibit scale-independent true erasure (<3% utility drift), while reasoning-prior models reveal fundamental architectural divergence. Our comprehensive dual-metric evaluation protocol, combining surface-level leakage with latent trace persistence, operationalizes the obfuscation - erasure distinction and enables the first systematic diagnosis of mechanism-level forgetting behavior across model families and scales.
[COMMENTS]16 pages, 4 figures
[LINK]http://arxiv.org/abs/2601.10566v1
[DATE]2026-01-16 00:28:14+08:00
[CATEGORIES]cs.CL cs.LG
Learning Latency-Aware Orchestration for Parallel Multi-Agent Systems
[AUTHORS]Xi Shi, Mengxin Zheng, Qian Lou
[ABSTRACT]Multi-agent systems (MAS) enable complex reasoning by coordinating multiple agents, but often incur high inference latency due to multi-step execution and repeated model invocations, severely limiting their scalability and usability in time-sensitive scenarios. Most existing approaches primarily optimize task performance and inference cost, and explicitly or implicitly assume sequential execution, making them less optimal for controlling latency under parallel execution. In this work, we investigate learning-based orchestration of multi-agent systems with explicit latency supervision under parallel execution. We propose Latency-Aware Multi-agent System (LAMaS), a latency-aware multi-agent orchestration framework that enables parallel execution and explicitly optimizes the critical execution path, allowing the controller to construct execution topology graphs with lower latency under parallel execution. Our experiments show that our approach reduces critical path length by 38-46% compared to the state-of-the-art baseline for multi-agent architecture search across multiple benchmarks, while maintaining or even improving task performance. These results highlight the importance of explicitly optimizing latency under parallel execution when designing efficient multi-agent systems. The code is available at https://github.com/xishi404/LAMaS
[COMMENTS]Preprint
[LINK]http://arxiv.org/abs/2601.10560v1
[DATE]2026-01-16 00:23:53+08:00
[CATEGORIES]cs.CL
Small Open Models Achieve Near Parity with Large Models in Low Resource Literary Translation at a Fraction of the Cost
[AUTHORS]Mihai Nadas, Laura Diosan, Andreea Tomescu, Andrei Piscoran
[ABSTRACT]Literary translation has recently gained attention as a distinct and complex task in machine translation research. However, the translation by small open models remains an open problem. We contribute to this ongoing research by introducing TinyFabulist Translation Framework (TF2), a unified framework for dataset creation, fine-tuning, and evaluation in English->Romanian literary translation, centered on the creation and open release of both a compact, fine-tuned language model (TF2-12B) and large-scale synthetic parallel datasets (DS-TF2-EN-RO-3M and DS-TF2-EN-RO-15K). Building on DS-TF1-EN-3M (TF1), the largest collection of synthetic English fables to date, we address the need for rich, high-quality literary datasets in low-resource languages such as Romanian. Our pipeline first generates 15k high-quality Romanian reference translations from the TF1 pool using a high-performing LLM. We then apply a two-stage fine-tuning process to a 12B-parameter open-weight model: (i) instruction tuning to capture genre-specific narrative style, and (ii) adapter compression for efficient deployment. Evaluation combines corpus-level BLEU with a five-dimension LLM-based rubric (accuracy, fluency, coherence, style, and cultural adaptation) to provide a nuanced assessment of translation quality. Results show that our fine-tuned model achieves strong fluency and adequacy, narrowing the gap to top-performing proprietary models under automated and human-anchored evaluation, while being open, accessible, and significantly more cost-effective. Alongside the fine-tuned model and both datasets, we publicly release all scripts and evaluation prompts. TF2 thus provides an end-to-end, reproducible pipeline for research on cost-efficient translation, cross-lingual narrative generation, and the broad adoption of open models for culturally significant literary content in low-resource settings.
[COMMENTS]25 pages, 8 figures, includes datasets and models released on Hugging Face
[LINK]http://arxiv.org/abs/2509.07829v2
[DATE]2026-01-16 00:20:47+08:00
[CATEGORIES]cs.CL cs.LG
Defending Large Language Models Against Jailbreak Attacks via In-Decoding Safety-Awareness Probing
[AUTHORS]Yinzhi Zhao, Ming Wang, Shi Feng, Xiaocui Yang, Daling Wang, Yifei Zhang
[ABSTRACT]Large language models (LLMs) have achieved impressive performance across natural language tasks and are increasingly deployed in real-world applications. Despite extensive safety alignment efforts, recent studies show that such alignment is often shallow and remains vulnerable to jailbreak attacks. Existing defense mechanisms, including decoding-based constraints and post-hoc content detectors, struggle against sophisticated jailbreaks, often intervening robust detection or excessively degrading model utility. In this work, we examine the decoding process of LLMs and make a key observation: even when successfully jailbroken, models internally exhibit latent safety-related signals during generation. However, these signals are overridden by the model’s drive for fluent continuation, preventing timely self-correction or refusal. Building on this observation, we propose a simple yet effective approach that explicitly surfaces and leverages these latent safety signals for early detection of unsafe content during decoding. Experiments across diverse jailbreak attacks demonstrate that our approach significantly enhances safety, while maintaining low over-refusal rates on benign inputs and preserving response quality. Our results suggest that activating intrinsic safety-awareness during decoding offers a promising and complementary direction for defending against jailbreak attacks. Code is available at: https://github.com/zyz13590/SafeProbing.
[LINK]http://arxiv.org/abs/2601.10543v1
[DATE]2026-01-16 00:09:10+08:00
[CATEGORIES]cs.CL
DInf-Grid: A Neural Differential Equation Solver with Differentiable Feature Grids
[AUTHORS]Navami Kairanda, Shanthika Naik, Marc Habermann, Avinash Sharma, Christian Theobalt, Vladislav Golyanik
[ABSTRACT]We present a novel differentiable grid-based representation for efficiently solving differential equations (DEs). Widely used architectures for neural solvers, such as sinusoidal neural networks, are coordinate-based MLPs that are both computationally intensive and slow to train. Although grid-based alternatives for implicit representations (e.g., Instant-NGP and K-Planes) train faster by exploiting signal structure, their reliance on linear interpolation restricts their ability to compute higher-order derivatives, rendering them unsuitable for solving DEs. Our approach overcomes these limitations by combining the efficiency of feature grids with radial basis function interpolation, which is infinitely differentiable. To effectively capture high-frequency solutions and enable stable and faster computation of global gradients, we introduce a multi-resolution decomposition with co-located grids. Our proposed representation, DInf-Grid, is trained implicitly using the differential equations as loss functions, enabling accurate modelling of physical fields. We validate DInf-Grid on a variety of tasks, including the Poisson equation for image reconstruction, the Helmholtz equation for wave fields, and the Kirchhoff-Love boundary value problem for cloth simulation. Our results demonstrate a 5-20x speed-up over coordinate-based MLP-based methods, solving differential equations in seconds or minutes while maintaining comparable accuracy and compactness.
[COMMENTS]25 pages; 16 figures; project page: https://4dqv.mpi-inf.mpg.de/DInf-Grid/
[LINK]http://arxiv.org/abs/2601.10715v1
[DATE]2026-01-16 02:59:57+08:00
[CATEGORIES]cs.LG
High-accuracy and dimension-free sampling with diffusions
[AUTHORS]Khashayar Gatmiry, Sitan Chen, Adil Salim
[ABSTRACT]Diffusion models have shown remarkable empirical success in sampling from rich multi-modal distributions. Their inference relies on numerically solving a certain differential equation. This differential equation cannot be solved in closed form, and its resolution via discretization typically requires many small iterations to produce \emph{high-quality} samples.
More precisely, prior works have shown that the iteration complexity of discretization methods for diffusion models scales polynomially in the ambient dimension and the inverse accuracy $1/\varepsilon$. In this work, we propose a new solver for diffusion models relying on a subtle interplay between low-degree approximation and the collocation method (Lee, Song, Vempala 2018), and we prove that its iteration complexity scales \emph{polylogarithmically} in $1/\varepsilon$, yielding the first “high-accuracy” guarantee for a diffusion-based sampler that only uses (approximate) access to the scores of the data distribution. In addition, our bound does not depend explicitly on the ambient dimension; more precisely, the dimension affects the complexity of our solver through the \emph{effective radius} of the support of the target distribution only.
[LINK]http://arxiv.org/abs/2601.10708v1
[DATE]2026-01-16 02:58:50+08:00
[CATEGORIES]cs.LG
See Less, Drive Better: Generalizable End-to-End Autonomous Driving via Foundation Models Stochastic Patch Selection
[AUTHORS]Amir Mallak, Erfan Aasi, Shiva Sreeram, Tsun-Hsuan Wang, Daniela Rus, Alaa Maalouf
[ABSTRACT]Recent advances in end-to-end autonomous driving show that policies trained on patch-aligned features extracted from foundation models generalize better to Out-of-Distribution (OOD). We hypothesize that due to the self-attention mechanism, each patch feature implicitly embeds/contains information from all other patches, represented in a different way and intensity, making these descriptors highly redundant. We quantify redundancy in such (BLIP2) features via PCA and cross-patch similarity: $90$% of variance is captured by $17/64$ principal components, and strong inter-token correlations are pervasive. Training on such overlapping information leads the policy to overfit spurious correlations, hurting OOD robustness. We present Stochastic-Patch-Selection (SPS), a simple yet effective approach for learning policies that are more robust, generalizable, and efficient. For every frame, SPS randomly masks a fraction of patch descriptors, not feeding them to the policy model, while preserving the spatial layout of the remaining patches. Thus, the policy is provided with different stochastic but complete views of the (same) scene: every random subset of patches acts like a different, yet still sensible, coherent projection of the world. The policy thus bases its decisions on features that are invariant to which specific tokens survive. Extensive experiments confirm that across all OOD scenarios, our method outperforms the state of the art (SOTA), achieving a $6.2$% average improvement and up to $20.4$% in closed-loop simulations, while being $2.4\times$ faster. We conduct ablations over masking rates and patch-feature reorganization, training and evaluating 9 systems, with 8 of them surpassing prior SOTA. Finally, we show that the same learned policy transfers to a physical, real-world car without any tuning.
[LINK]http://arxiv.org/abs/2601.10707v1
[DATE]2026-01-16 02:58:33+08:00
[CATEGORIES]cs.LG
Distributed Perceptron under Bounded Staleness, Partial Participation, and Noisy Communication
[AUTHORS]Keval Jain, Anant Raj, Saurav Prakash, Girish Varma
[ABSTRACT]We study a semi-asynchronous client-server perceptron trained via iterative parameter mixing (IPM-style averaging): clients run local perceptron updates and a server forms a global model by aggregating the updates that arrive in each communication round. The setting captures three system effects in federated and distributed deployments: (i) stale updates due to delayed model delivery and delayed application of client computations (two-sided version lag), (ii) partial participation (intermittent client availability), and (iii) imperfect communication on both downlink and uplink, modeled as effective zero-mean additive noise with bounded second moment. We introduce a server-side aggregation rule called staleness-bucket aggregation with padding that deterministically enforces a prescribed staleness profile over update ages without assuming any stochastic model for delays or participation. Under margin separability and bounded data radius, we prove a finite-horizon expected bound on the cumulative weighted number of perceptron mistakes over a given number of server rounds: the impact of delay appears only through the mean enforced staleness, whereas communication noise contributes an additional term that grows on the order of the square root of the horizon with the total noise energy. In the noiseless case, we show how a finite expected mistake budget yields an explicit finite-round stabilization bound under a mild fresh-participation condition.
[LINK]http://arxiv.org/abs/2601.10705v1
[DATE]2026-01-16 02:56:54+08:00
[CATEGORIES]cs.LG
Communication-Efficient and Privacy-Adaptable Mechanism – a Federated Learning Scheme with Convergence Analysis
[AUTHORS]Chun Hei Michael Shiu, Chih Wei Ling
[ABSTRACT]Federated learning enables multiple parties to jointly train learning models without sharing their own underlying data, offering a practical pathway to privacy-preserving collaboration under data-governance constraints. Continued study of federated learning is essential to address key challenges in it, including communication efficiency and privacy protection between parties. A recent line of work introduced a novel approach called the Communication-Efficient and Privacy-Adaptable Mechanism (CEPAM), which achieves both objectives simultaneously. CEPAM leverages the rejection-sampled universal quantizer (RSUQ), a randomized vector quantizer whose quantization error is equivalent to a prescribed noise, which can be tuned to customize privacy protection between parties. In this work, we theoretically analyze the privacy guarantees and convergence properties of CEPAM. Moreover, we assess CEPAM’s utility performance through experimental evaluations, including convergence profiles compared with other baselines, and accuracy-privacy trade-offs between different parties.
[COMMENTS]19 pages, 5 figures. This work is submitted in part to the 2026 IEEE International Symposium on Information Theory (ISIT). arXiv admin note: substantial text overlap with arXiv:2501.12046
[LINK]http://arxiv.org/abs/2601.10701v1
[DATE]2026-01-16 02:55:00+08:00
[CATEGORIES]cs.LG
Data-driven stochastic reduced-order modeling of parametrized dynamical systems
[AUTHORS]Andrew F. Ilersich, Kevin Course, Prasanth B. Nair
[ABSTRACT]Modeling complex dynamical systems under varying conditions is computationally intensive, often rendering high-fidelity simulations intractable. Although reduced-order models (ROMs) offer a promising solution, current methods often struggle with stochastic dynamics and fail to quantify prediction uncertainty, limiting their utility in robust decision-making contexts. To address these challenges, we introduce a data-driven framework for learning continuous-time stochastic ROMs that generalize across parameter spaces and forcing conditions. Our approach, based on amortized stochastic variational inference, leverages a reparametrization trick for Markov Gaussian processes to eliminate the need for computationally expensive forward solvers during training. This enables us to jointly learn a probabilistic autoencoder and stochastic differential equations governing the latent dynamics, at a computational cost that is independent of the dataset size and system stiffness. Additionally, our approach offers the flexibility of incorporating physics-informed priors if available. Numerical studies are presented for three challenging test problems, where we demonstrate excellent generalization to unseen parameter combinations and forcings, and significant efficiency gains compared to existing approaches.
[LINK]http://arxiv.org/abs/2601.10690v1
[DATE]2026-01-16 02:50:18+08:00
[CATEGORIES]cs.LG
On the origin of neural scaling laws: from random graphs to natural language
[AUTHORS]Maissam Barkeshli, Alberto Alfarano, Andrey Gromov
[ABSTRACT]Scaling laws have played a major role in the modern AI revolution, providing practitioners predictive power over how the model performance will improve with increasing data, compute, and number of model parameters. This has spurred an intense interest in the origin of neural scaling laws, with a common suggestion being that they arise from power law structure already present in the data. In this paper we study scaling laws for transformers trained to predict random walks (bigrams) on graphs with tunable complexity. We demonstrate that this simplified setting already gives rise to neural scaling laws even in the absence of power law structure in the data correlations. We further consider dialing down the complexity of natural language systematically, by training on sequences sampled from increasingly simplified generative language models, from 4,2,1-layer transformer language models down to language bigrams, revealing a monotonic evolution of the scaling exponents. Our results also include scaling laws obtained from training on random walks on random graphs drawn from Erdös-Renyi and scale-free Barabási-Albert ensembles. Finally, we revisit conventional scaling laws for language modeling, demonstrating that several essential results can be reproduced using 2 layer transformers with context length of 50, provide a critical analysis of various fits used in prior literature, demonstrate an alternative method for obtaining compute optimal curves as compared with current practice in published literature, and provide preliminary evidence that maximal update parameterization may be more parameter efficient than standard parameterization.
[COMMENTS]33 pages
[LINK]http://arxiv.org/abs/2601.10684v1
[DATE]2026-01-16 02:46:09+08:00
[CATEGORIES]cs.LG
Are Your Reasoning Models Reasoning or Guessing? A Mechanistic Analysis of Hierarchical Reasoning Models
[AUTHORS]Zirui Ren, Ziming Liu
[ABSTRACT]Hierarchical reasoning model (HRM) achieves extraordinary performance on various reasoning tasks, significantly outperforming large language model-based reasoners. To understand the strengths and potential failure modes of HRM, we conduct a mechanistic study on its reasoning patterns and find three surprising facts: (a) Failure of extremely simple puzzles, e.g., HRM can fail on a puzzle with only one unknown cell. We attribute this failure to the violation of the fixed point property, a fundamental assumption of HRM. (b) “Grokking” dynamics in reasoning steps, i.e., the answer is not improved uniformly, but instead there is a critical reasoning step that suddenly makes the answer correct; (c) Existence of multiple fixed points. HRM “guesses” the first fixed point, which could be incorrect, and gets trapped there for a while or forever. All facts imply that HRM appears to be “guessing” instead of “reasoning”. Leveraging this “guessing” picture, we propose three strategies to scale HRM’s guesses: data augmentation (scaling the quality of guesses), input perturbation (scaling the number of guesses by leveraging inference randomness), and model bootstrapping (scaling the number of guesses by leveraging training randomness). On the practical side, by combining all methods, we develop Augmented HRM, boosting accuracy on Sudoku-Extreme from 54.5% to 96.9%. On the scientific side, our analysis provides new insights into how reasoning models “reason”.
[LINK]http://arxiv.org/abs/2601.10679v1
[DATE]2026-01-16 02:42:50+08:00
[CATEGORIES]cs.LG
Single-Stage Huffman Encoder for ML Compression
[AUTHORS]Aditya Agrawal, Albert Magyar, Hiteshwar Eswaraiah, Patrick Sheridan, Pradeep Janedula, Ravi Krishnan Venkatesan, Krishna Nair, Ravi Iyer
[ABSTRACT]Training and serving Large Language Models (LLMs) require partitioning data across multiple accelerators, where collective operations are frequently bottlenecked by network bandwidth. Lossless compression using Huffman codes is an effective way to alleviate the issue, however, its three-stage design requiring on-the-fly frequency analysis, codebook generation and transmission of codebook along with data introduces computational, latency and data overheads which are prohibitive for latency-sensitive scenarios such as die-to-die communication. This paper proposes a single-stage Huffman encoder that eliminates these overheads by using fixed codebooks derived from the average probability distribution of previous data batches. Through our analysis of the Gemma 2B model, we demonstrate that tensors exhibit high statistical similarity across layers and shards. Using this approach we achieve compression within 0.5% of per-shard Huffman coding and within 1% of the ideal Shannon compressibility, enabling efficient on-the-fly compression.
[COMMENTS]5 pages, 4 figures
[LINK]http://arxiv.org/abs/2601.10673v1
[DATE]2026-01-16 02:37:56+08:00
[CATEGORIES]cs.LG
PACEvolve: Enabling Long-Horizon Progress-Aware Consistent Evolution
[AUTHORS]Minghao Yan, Bo Peng, Benjamin Coleman, Ziqi Chen, Zhouhang Xie, Zhankui He, Noveen Sachdeva, Isabella Ye, Weili Wang, Chi Wang, Ed H. Chi, Wang-Cheng Kang, Derek Zhiyuan Cheng, Beidou Wang
[ABSTRACT]Large Language Models (LLMs) have emerged as powerful operators for evolutionary search, yet the design of efficient search scaffolds remains ad hoc. While promising, current LLM-in-the-loop systems lack a systematic approach to managing the evolutionary process. We identify three distinct failure modes: Context Pollution, where experiment history biases future candidate generation; Mode Collapse, where agents stagnate in local minima due to poor exploration-exploitation balance; and Weak Collaboration, where rigid crossover strategies fail to leverage parallel search trajectories effectively. We introduce Progress-Aware Consistent Evolution (PACEvolve), a framework designed to robustly govern the agent’s context and search dynamics, to address these challenges. PACEvolve combines hierarchical context management (HCM) with pruning to address context pollution; momentum-based backtracking (MBB) to escape local minima; and a self-adaptive sampling policy that unifies backtracking and crossover for dynamic search coordination (CE), allowing agents to balance internal refinement with cross-trajectory collaboration. We demonstrate that PACEvolve provides a systematic path to consistent, long-horizon self-improvement, achieving state-of-the-art results on LLM-SR and KernelBench, while discovering solutions surpassing the record on Modded NanoGPT.
[LINK]http://arxiv.org/abs/2601.10657v1
[DATE]2026-01-16 02:25:23+08:00
[CATEGORIES]cs.LG
Sparse Nonparametric Contextual Bandits
[AUTHORS]Hamish Flynn, Julia Olkhovskaya, Paul Rognon-Vael
[ABSTRACT]We study the benefits of sparsity in nonparametric contextual bandit problems, in which the set of candidate features is countably or uncountably infinite. Our contribution is two-fold. First, using a novel reduction to sequences of multi-armed bandit problems, we provide lower bounds on the minimax regret, which show that polynomial dependence on the number of actions is generally unavoidable in this setting. Second, we show that a variant of the Feel-Good Thompson Sampling algorithm enjoys regret bounds that match our lower bounds up to logarithmic factors of the horizon, and have logarithmic dependence on the effective number of candidate features. When we apply our results to kernelised and neural contextual bandits, we find that sparsity enables better regret bounds whenever the horizon is large enough relative to the sparsity and the number of actions.
[COMMENTS]44 pages
[LINK]http://arxiv.org/abs/2503.16382v2
[DATE]2026-01-16 02:10:26+08:00
[CATEGORIES]cs.LG
RMBRec: Robust Multi-Behavior Recommendation towards Target Behaviors
[AUTHORS]Miaomiao Cai, Zhijie Zhang, Junfeng Fang, Zhiyong Cheng, Xiang Wang, Meng Wang
[ABSTRACT]Multi-behavior recommendation faces a critical challenge in practice: auxiliary behaviors (e.g., clicks, carts) are often noisy, weakly correlated, or semantically misaligned with the target behavior (e.g., purchase), which leads to biased preference learning and suboptimal performance. While existing methods attempt to fuse these heterogeneous signals, they inherently lack a principled mechanism to ensure robustness against such behavioral inconsistency.
In this work, we propose Robust Multi-Behavior Recommendation towards Target Behaviors (RMBRec), a robust multi-behavior recommendation framework grounded in an information-theoretic robustness principle. We interpret robustness as a joint process of maximizing predictive information while minimizing its variance across heterogeneous behavioral environments. Under this perspective, the Representation Robustness Module (RRM) enhances local semantic consistency by maximizing the mutual information between users’ auxiliary and target representations, whereas the Optimization Robustness Module (ORM) enforces global stability by minimizing the variance of predictive risks across behaviors, which is an efficient approximation to invariant risk minimization. This local-global collaboration bridges representation purification and optimization invariance in a theoretically coherent way. Extensive experiments on three real-world datasets demonstrate that RMBRec not only outperforms state-of-the-art methods in accuracy but also maintains remarkable stability under various noise perturbations. For reproducibility, our code is available at https://github.com/miaomiao-cai2/RMBRec/.
[LINK]http://arxiv.org/abs/2601.08705v2
[DATE]2026-01-16 02:06:47+08:00
[CATEGORIES]cs.LG
TinyMyo: a Tiny Foundation Model for Flexible EMG Signal Processing at the Edge
[AUTHORS]Matteo Fasulo, Giusy Spacone, Thorir Mar Ingolfsson, Yawei Li, Luca Benini, Andrea Cossettini
[ABSTRACT]Objective: Surface electromyography (EMG) is a non-invasive sensing modality widely used in biomechanics, rehabilitation, prosthetic control, and human-machine interfaces. Despite decades of use, achieving robust generalization across subjects, recording systems, and acquisition protocols remains challenging. While foundation models (FMs) are gaining traction for EMG, existing approaches remain limited to single downstream tasks and lack deployability on embedded platforms. This work addresses these limitations. Methods: We present TinyMyo, a lightweight FM based on a Transformer encoder architecture. The model is pre-trained in a self-supervised manner using masked reconstruction on publicly available datasets. With only 3.6M parameters, TinyMyo is designed to support multiple downstream tasks through minimal task-specific head adaptations. Results: We demonstrate generalization across hand gesture classification, hand kinematic regression, speech production and speech recognition, with performance comparable to or surpassing the state of the art (SoA), and model size below 5M parameters. We achieve SoA results compared to previous FM-based works on the NinaPro DB5 (89.4%), UCI-EMG (97.56%), and EPN-612 (96.74%) datasets. We demonstrate the first-time deployment of an EMG FM on an ultra-low power microcontroller (GAP9), with an inference time of 0.785 s, energy of 44.91 mJ and power envelope of 57.18 mW. Conclusion: TinyMyo demonstrates that compact, self-supervised EMG FM can guarantee strong generalization across multiple downstream tasks while remaining compatible with low-power edge devices. Significance: TinyMyo is the first EMG FM for ultra-low power edge devices, enabling scalable and energy-efficient sensing for motor intent decoding, neuromuscular assessment, and biosignal driven human-machine interaction.
[LINK]http://arxiv.org/abs/2512.15729v2
[DATE]2026-01-16 02:05:25+08:00
[CATEGORIES]cs.LG
Adjusted Similarity Measures and a Violation of Expectations
[AUTHORS]William L. Lippitt, Edward J. Bedrick, Nichole E. Carlson
[ABSTRACT]Adjusted similarity measures, such as Cohen’s kappa for inter-rater reliability and the adjusted Rand index used to compare clustering algorithms, are a vital tool for comparing discrete labellings. These measures are intended to have the property of 0 expectation under a null distribution and maximum value 1 under maximal similarity to aid in interpretation. Measures are frequently adjusted with respect to the permutation distribution for historic and analytic reasons. There is currently renewed interest in considering other null models more appropriate for context, such as clustering ensembles permitting a random number of identified clusters. The purpose of this work is two – fold: (1) to generalize the study of the adjustment operator to general null models and to a more general procedure which includes statistical standardization as a special case and (2) to identify sufficient conditions for the adjustment operator to produce the intended properties, where sufficient conditions are related to whether and how observed data are incorporated into null distributions. We demonstrate how violations of the sufficient conditions may lead to substantial breakdown, such as by producing a non-positive measure under traditional adjustment rather than one with mean 0, or by producing a measure which is deterministically 0 under statistical standardization.
[COMMENTS]12 pages, 1 figure
[LINK]http://arxiv.org/abs/2601.10641v1
[DATE]2026-01-16 02:01:26+08:00
[CATEGORIES]cs.LG
STEM: Scaling Transformers with Embedding Modules
[AUTHORS]Ranajoy Sadhukhan, Sheng Cao, Harry Dong, Changsheng Zhao, Attiano Purpura-Pontoniere, Yuandong Tian, Zechun Liu, Beidi Chen
[ABSTRACT]Fine-grained sparsity promises higher parametric capacity without proportional per-token compute, but often suffers from training instability, load balancing, and communication overhead. We introduce STEM (Scaling Transformers with Embedding Modules), a static, token-indexed approach that replaces the FFN up-projection with a layer-local embedding lookup while keeping the gate and down-projection dense. This removes runtime routing, enables CPU offload with asynchronous prefetch, and decouples capacity from both per-token FLOPs and cross-device communication. Empirically, STEM trains stably despite extreme sparsity. It improves downstream performance over dense baselines while reducing per-token FLOPs and parameter accesses (eliminating roughly one-third of FFN parameters). STEM learns embedding spaces with large angular spread which enhances its knowledge storage capacity. More interestingly, this enhanced knowledge capacity comes with better interpretability. The token-indexed nature of STEM embeddings allows simple ways to perform knowledge editing and knowledge injection in an interpretable manner without any intervention in the input text or additional computation. In addition, STEM strengthens long-context performance: as sequence length grows, more distinct parameters are activated, yielding practical test-time capacity scaling. Across 350M and 1B model scales, STEM delivers up to ~3–4% accuracy improvements overall, with notable gains on knowledge and reasoning-heavy benchmarks (ARC-Challenge, OpenBookQA, GSM8K, MMLU). Overall, STEM is an effective way of scaling parametric memory while providing better interpretability, better training stability and improved efficiency.
[LINK]http://arxiv.org/abs/2601.10639v1
[DATE]2026-01-16 02:00:27+08:00
[CATEGORIES]cs.LG
Riesz Representer Fitting under Bregman Divergence: A Unified Framework for Debiased Machine Learning
[AUTHORS]Masahiro Kato
[ABSTRACT]Estimating the Riesz representer is central to debiased machine learning for causal and structural parameter estimation. We propose generalized Riesz regression, a unified framework that estimates the Riesz representer by fitting a representer model via Bregman divergence minimization. This framework includes the squared loss and the Kullback–Leibler (KL) divergence as special cases: the former recovers Riesz regression, while the latter recovers tailored loss minimization. Under suitable model specifications, the dual problems correspond to covariate balancing, which we call automatic covariate balancing. Moreover, under the same specifications, outcome averages weighted by the estimated Riesz representer satisfy Neyman orthogonality even without estimating the regression function, a property we call automatic Neyman orthogonalization. This property not only reduces the estimation error of Neyman orthogonal scores but also clarifies a key distinction between debiased machine learning and targeted maximum likelihood estimation. Our framework can also be viewed as a generalization of density ratio fitting under Bregman divergences to Riesz representer estimation, and it applies beyond density ratio estimation. We provide convergence analyses for both reproducing kernel Hilbert space (RKHS) and neural network model classes. A Python package for generalized Riesz regression is available at https://github.com/MasaKat0/grr.
[LINK]http://arxiv.org/abs/2601.07752v2
[DATE]2026-01-16 01:55:37+08:00
[CATEGORIES]cs.LG
Relative Information Gain and Gaussian Process Regression
[AUTHORS]Hamish Flynn
[ABSTRACT]The sample complexity of estimating or maximising an unknown function in a reproducing kernel Hilbert space is known to be linked to both the effective dimension and the information gain associated with the kernel. While the information gain has an attractive information-theoretic interpretation, the effective dimension typically results in better rates. We introduce a new quantity called the relative information gain, which measures the sensitivity of the information gain with respect to the observation noise. We show that the relative information gain smoothly interpolates between the effective dimension and the information gain, and that the relative information gain has the same growth rate as the effective dimension. In the second half of the paper, we prove a new PAC-Bayesian excess risk bound for Gaussian process regression. The relative information gain arises naturally from the complexity term in this PAC-Bayesian bound. We prove bounds on the relative information gain that depend on the spectral properties of the kernel. When these upper bounds are combined with our excess risk bound, we obtain minimax-optimal rates of convergence.
[COMMENTS]30 pages, 1 figure
[LINK]http://arxiv.org/abs/2510.04277v2
[DATE]2026-01-16 01:55:28+08:00
[CATEGORIES]cs.LG
Data-Driven Dynamic Factor Modeling via Manifold Learning
[AUTHORS]Graeme Baker, Agostino Capponi, J. Antonio Sidaoui
[ABSTRACT]We introduce a data-driven dynamic factor framework for modeling the joint evolution of high-dimensional covariates and responses without parametric assumptions. Standard factor models applied to covariates alone often lose explanatory power for responses. Our approach uses anisotropic diffusion maps, a manifold learning technique, to learn low-dimensional embeddings that preserve both the intrinsic geometry of the covariates and the predictive relationship with responses. For time series arising from Langevin diffusions in Euclidean space, we show that the associated graph Laplacian converges to the generator of the underlying diffusion. We further establish a bound on the approximation error between the diffusion map coordinates and linear diffusion processes, and we show that ergodic averages in the embedding space converge under standard spectral assumptions. These results justify using Kalman filtering in diffusion-map coordinates for predicting joint covariate-response evolution. We apply this methodology to equity-portfolio stress testing using macroeconomic and financial variables from Federal Reserve supervisory scenarios, achieving mean absolute error improvements of up to 55% over classical scenario analysis and 39% over principal component analysis benchmarks.
[LINK]http://arxiv.org/abs/2506.19945v2
[DATE]2026-01-16 01:50:32+08:00
[CATEGORIES]cs.LG
Parametric RDT approach to computational gap of symmetric binary perceptron
[AUTHORS]Mihailo Stojnic
[ABSTRACT]We study potential presence of statistical-computational gaps (SCG) in symmetric binary perceptrons (SBP) via a parametric utilization of \emph{fully lifted random duality theory} (fl-RDT) [96]. A structural change from decreasingly to arbitrarily ordered $c$-sequence (a key fl-RDT parametric component) is observed on the second lifting level and associated with \emph{satisfiability} ($α_c$) – \emph{algorithmic} ($α_a$) constraints density threshold change thereby suggesting a potential existence of a nonzero computational gap $SCG=α_c-α_a$. The second level estimate is shown to match the theoretical $α_c$ whereas the $r\rightarrow \infty$ level one is proposed to correspond to $α_a$. For example, for the canonical SBP ($κ=1$ margin) we obtain $α_c\approx 1.8159$ on the second and $α_a\approx 1.6021$ (with converging tendency towards $\sim 1.59$ range) on the seventh level. Our propositions remarkably well concur with recent literature: (i) in [20] local entropy replica approach predicts $α_{LE}\approx 1.58$ as the onset of clustering defragmentation (presumed driving force behind locally improving algorithms failures); (ii) in $α\rightarrow 0$ regime we obtain on the third lifting level $κ\approx 1.2385\sqrt{\frac{α_a}{-\log\left ( α_a \right ) }}$ which qualitatively matches overlap gap property (OGP) based predictions of [43] and identically matches local entropy based predictions of [24]; (iii) $c$-sequence ordering change phenomenology mirrors the one observed in asymmetric binary perceptron (ABP) in [98] and the negative Hopfield model in [100]; and (iv) as in [98,100], we here design a CLuP based algorithm whose practical performance closely matches proposed theoretical predictions.
[LINK]http://arxiv.org/abs/2601.10628v1
[DATE]2026-01-16 01:48:58+08:00
[CATEGORIES]cs.LG
Differential Privacy as a Perk: Federated Learning over Multiple-Access Fading Channels with a Multi-Antenna Base Station
[AUTHORS]Hao Liang, Haifeng Wen, Kaishun Wu, Hong Xing
[ABSTRACT]Federated Learning (FL) is a distributed learning paradigm that preserves privacy by eliminating the need to exchange raw data during training. In its prototypical edge instantiation with underlying wireless transmissions enabled by analog over-the-air computing (AirComp), referred to as \emph{over-the-air FL (AirFL)}, the inherent channel noise plays a unique role of \emph{frenemy} in the sense that it degrades training due to noisy global aggregation while providing a natural source of randomness for privacy-preserving mechanisms, formally quantified by \emph{differential privacy (DP)}. It remains, nevertheless, challenging to effectively harness such channel impairments, as prior arts, under assumptions of either simple channel models or restricted types of loss functions, mostly considering (local) DP enhancement with a single-round or non-convergent bound on privacy loss. In this paper, we study AirFL over multiple-access fading channels with a multi-antenna base station (BS) subject to user-level DP requirements. Despite a recent study, which claimed in similar settings that artificial noise (AN) must be injected to ensure DP in general, we demonstrate, on the contrary, that DP can be gained as a \emph{perk} even \emph{without} employing any AN. Specifically, we derive a novel bound on DP that converges under general bounded-domain assumptions on model parameters, along with a convergence bound with general smooth and non-convex loss functions. Next, we optimize over receive beamforming and power allocations to characterize the optimal convergence-privacy trade-offs, which also reveal explicit conditions in which DP is achievable without compromising training. Finally, our theoretical findings are validated by extensive numerical results.
[COMMENTS]13 pages, 6 figures, submitted for possible publication
[LINK]http://arxiv.org/abs/2510.23463v3
[DATE]2026-01-16 01:38:48+08:00
[CATEGORIES]cs.LG
An Improved Privacy and Utility Analysis of Differentially Private SGD with Bounded Domain and Smooth Losses
[AUTHORS]Hao Liang, Wanrong Zhang, Xinlei He, Kaishun Wu, Hong Xing
[ABSTRACT]Differentially Private Stochastic Gradient Descent (DPSGD) is widely used to protect sensitive data during the training of machine learning models, but its privacy guarantee often comes at a large cost of model performance due to the lack of tight theoretical bounds quantifying privacy loss. While recent efforts have achieved more accurate privacy guarantees, they still impose some assumptions prohibited from practical applications, such as convexity and complex parameter requirements, and rarely investigate in-depth the impact of privacy mechanisms on the model’s utility. In this paper, we provide a rigorous privacy characterization for DPSGD with general L-smooth and non-convex loss functions, revealing converged privacy loss with iteration in bounded-domain cases. Specifically, we track the privacy loss over multiple iterations, leveraging the noisy smooth-reduction property, and further establish comprehensive convergence analysis in different scenarios. In particular, we show that for DPSGD with a bounded domain, (i) the privacy loss can still converge without the convexity assumption, (ii) a smaller bounded diameter can improve both privacy and utility simultaneously under certain conditions, and (iii) the attainable big-O order of the privacy utility trade-off for DPSGD with gradient clipping (DPSGD-GC) and for DPSGD-GC with bounded domain (DPSGD-DC) and mu-strongly convex population risk function, respectively. Experiments via membership inference attack (MIA) in a practical setting validate insights gained from the theoretical results.
[COMMENTS]19 pages, 5 figures, accepted by AAAI 2026
[LINK]http://arxiv.org/abs/2502.17772v4
[DATE]2026-01-16 01:33:50+08:00
[CATEGORIES]cs.LG
ProbFM: Probabilistic Time Series Foundation Model with Uncertainty Decomposition
[AUTHORS]Arundeep Chinta, Lucas Vinh Tran, Jay Katukuri
[ABSTRACT]Time Series Foundation Models (TSFMs) have emerged as a promising approach for zero-shot financial forecasting, demonstrating strong transferability and data efficiency gains. However, their adoption in financial applications is hindered by fundamental limitations in uncertainty quantification: current approaches either rely on restrictive distributional assumptions, conflate different sources of uncertainty, or lack principled calibration mechanisms. While recent TSFMs employ sophisticated techniques such as mixture models, Student’s t-distributions, or conformal prediction, they fail to address the core challenge of providing theoretically-grounded uncertainty decomposition. For the very first time, we present a novel transformer-based probabilistic framework, ProbFM (probabilistic foundation model), that leverages Deep Evidential Regression (DER) to provide principled uncertainty quantification with explicit epistemic-aleatoric decomposition. Unlike existing approaches that pre-specify distributional forms or require sampling-based inference, ProbFM learns optimal uncertainty representations through higher-order evidence learning while maintaining single-pass computational efficiency. To rigorously evaluate the core DER uncertainty quantification approach independent of architectural complexity, we conduct an extensive controlled comparison study using a consistent LSTM architecture across five probabilistic methods: DER, Gaussian NLL, Student’s-t NLL, Quantile Loss, and Conformal Prediction. Evaluation on cryptocurrency return forecasting demonstrates that DER maintains competitive forecasting accuracy while providing explicit epistemic-aleatoric uncertainty decomposition. This work establishes both an extensible framework for principled uncertainty quantification in foundation models and empirical evidence for DER’s effectiveness in financial applications.
[COMMENTS]Accepted for oral presentation at the AI Meets Quantitative Finance Workshop at ICAIF 2025. An enhanced version was accepted for oral presentation at the AI for Time Series Analysis Workshop at AAAI 2026
[LINK]http://arxiv.org/abs/2601.10591v1
[DATE]2026-01-16 01:02:06+08:00
[CATEGORIES]cs.LG
SSFL: Discovering Sparse Unified Subnetworks at Initialization for Efficient Federated Learning
[AUTHORS]Riyasat Ohib, Bishal Thapaliya, Gintare Karolina Dziugaite, Jingyu Liu, Vince Calhoun, Sergey Plis
[ABSTRACT]In this work, we propose Salient Sparse Federated Learning (SSFL), a streamlined approach for sparse federated learning with efficient communication. SSFL identifies a sparse subnetwork prior to training, leveraging parameter saliency scores computed separately on local client data in non-IID scenarios, and then aggregated, to determine a global mask. Only the sparse model weights are trained and communicated each round between the clients and the server. On standard benchmarks including CIFAR-10, CIFAR-100, and Tiny-ImageNet, SSFL consistently improves the accuracy sparsity trade off, achieving more than 20\% relative error reduction on CIFAR-10 compared to the strongest sparse baseline, while reducing communication costs by $2 \times$ relative to dense FL. Finally, in a real-world federated learning deployment, SSFL delivers over $2.3 \times$ faster communication time, underscoring its practical efficiency.
[COMMENTS]Published in Transactions on Machine Learning Research (TMLR), 2026
[LINK]http://arxiv.org/abs/2405.09037v2
[DATE]2026-01-16 01:01:07+08:00
[CATEGORIES]cs.LG
Searching for Quantum Effects in the Brain: A Bell-Type Test for Nonclassical Latent Representations in Autoencoders
[AUTHORS]I. K. Kominis, C. Xie, S. Li, M. Skotiniotis, G. P. Tsironis
[ABSTRACT]Whether neural information processing is entirely classical or involves quantum-mechanical elements remains an open question. Here we propose a model-agnostic, information-theoretic test of nonclassicality that bypasses microscopic assumptions and instead probes the structure of neural representations themselves. Using autoencoders as a transparent model system, we introduce a Bell-type consistency test in latent space, and ask whether decoding statistics obtained under multiple readout contexts can be jointly explained by a single positive latent-variable distribution. By shifting the search for quantum-like signatures in neural systems from microscopic dynamics to experimentally testable constraints on information processing, this work opens a new route for probing the fundamental physics of neural computation.
[COMMENTS]6 pages, 2 figures
[LINK]http://arxiv.org/abs/2601.10588v1
[DATE]2026-01-16 00:59:40+08:00
[CATEGORIES]cs.LG
Image Complexity-Aware Adaptive Retrieval for Efficient Vision-Language Models
[AUTHORS]Mikel Williams-Lekuona, Georgina Cosma
[ABSTRACT]Vision transformers in vision-language models typically use the same amount of compute for every image, regardless of whether it is simple or complex. We propose ICAR (Image Complexity-Aware Retrieval), an adaptive computation approach that enables vision transformers to use less compute for simple images whilst processing complex images through their full network depth. The key challenge is maintaining cross-modal alignment: embeddings from different processing depths must remain compatible for text matching. ICAR solves this through dual-path training that produces compatible embeddings from both the early-exit and full-depth paths. This maintains compatibility between image representations and text embeddings in the same semantic space, whether an image exits early or processes fully. Unlike existing two-stage approaches that require expensive reranking, ICAR enables direct image-text matching without additional overhead. To determine how much compute to use, we develop ConvNeXt-IC, which treats image complexity assessment as a classification task. By applying modern classifier backbones rather than specialised architectures, ConvNeXt-IC achieves state-of-the-art performance, attaining a Pearson correlation coefficient of 0.959 with human labelling whilst delivering 4.4x faster complexity prediction. Evaluated on standard benchmarks augmented with real-world web data, ICAR achieves 20% faster image encoding while maintaining category-level performance and 95% of instance-level performance, enabling sustainable scaling of vision-language systems.
[COMMENTS]Camera-ready version for ECIR 2026
[LINK]http://arxiv.org/abs/2512.15372v2
[DATE]2026-01-16 00:58:39+08:00
[CATEGORIES]cs.LG
TPV: Parameter Perturbations Through the Lens of Test Prediction Variance
[AUTHORS]Devansh Arpit
[ABSTRACT]We identify test prediction variance (TPV) – the first-order sensitivity of model outputs to parameter perturbations around a trained solution – as a unifying quantity that links several classical observations about generalization in deep networks. TPV is a fully label-free object whose trace form separates the geometry of the trained model from the specific perturbation mechanism, allowing a broad family of parameter perturbations like SGD noise, label noise, finite-precision noise, and other post-training perturbations to be analyzed under a single framework. Theoretically, we show that TPV estimated on the training set converges to its test-set value in the overparameterized limit, providing the first result that prediction variance under local parameter perturbations can be inferred from training inputs alone. Empirically, TPV exhibits a striking stability across datasets and architectures – including extremely narrow networks – and correlates well with clean test loss. Finally, we demonstrate that modeling pruning as a TPV perturbation yields a simple label-free importance measure that performs competitively with state-of-the-art pruning methods, illustrating the practical utility of TPV. Code available at github.com/devansharpit/TPV.
[LINK]http://arxiv.org/abs/2512.11089v2
[DATE]2026-01-16 00:56:11+08:00
[CATEGORIES]cs.LG
Combinatorial Optimization Augmented Machine Learning
[AUTHORS]Maximilian Schiffer, Heiko Hoppe, Yue Su, Louis Bouvier, Axel Parmentier
[ABSTRACT]Combinatorial optimization augmented machine learning (COAML) has recently emerged as a powerful paradigm for integrating predictive models with combinatorial decision-making. By embedding combinatorial optimization oracles into learning pipelines, COAML enables the construction of policies that are both data-driven and feasibility-preserving, bridging the traditions of machine learning, operations research, and stochastic optimization. This paper provides a comprehensive overview of the state of the art in COAML. We introduce a unifying framework for COAML pipelines, describe their methodological building blocks, and formalize their connection to empirical cost minimization. We then develop a taxonomy of problem settings based on the form of uncertainty and decision structure. Using this taxonomy, we review algorithmic approaches for static and dynamic problems, survey applications across domains such as scheduling, vehicle routing, stochastic programming, and reinforcement learning, and synthesize methodological contributions in terms of empirical cost minimization, imitation learning, and reinforcement learning. Finally, we identify key research frontiers. This survey aims to serve both as a tutorial introduction to the field and as a roadmap for future research at the interface of combinatorial optimization and machine learning.
[LINK]http://arxiv.org/abs/2601.10583v1
[DATE]2026-01-16 00:55:19+08:00
[CATEGORIES]cs.LG
MixMin: Finding Data Mixtures via Convex Minimization
[AUTHORS]Anvith Thudi, Evianne Rovers, Yangjun Ruan, Tristan Thrush, Chris J. Maddison
[ABSTRACT]Modern machine learning pipelines are increasingly combining and mixing data from diverse and disparate sources, e.g., pre-training large language models. Yet, finding the optimal data mixture is a challenging and open problem. We formalize this data mixing problem as a bi-level objective: the best mixture is the one that would lead to the best model for a downstream objective. Unfortunately, this objective is generally intractable. In this paper, we make the observation that the bi-level data mixing objective becomes convex as our model class becomes larger. We develop and study a gradient-based approach for optimizing this convex objective, which we call MixMin, and test it on language modeling and chemistry tasks. MixMin was the only method that uniformly improved the data mixture in all our experiments. With MixMin, we improved the data mixture using less than 0.2% additional compute for a pythia-410M model trained on 8.2B tokens, resulting between 1-5% relative improvement to negative log likelihood on PIQA, ARC Easy, SciQ, and OpenWebMath. Crucially, we found that MixMin mixtures for smaller models improved training of larger models, suggesting that MixMin mixtures may be scale-invariant. When mixing bioassay data to train an XGBoost model, we saw improvements to average precision scores of 0.03-0.15.
[COMMENTS]Proceedings of the 42nd International Conference on Machine Learning
[LINK]http://arxiv.org/abs/2502.10510v3
[DATE]2026-01-16 00:49:37+08:00
[CATEGORIES]cs.LG
Random Walk Learning and the Pac-Man Attack
[AUTHORS]Xingran Chen, Parimal Parag, Rohit Bhagat, Zonghong Liu, Salim El Rouayheb
[ABSTRACT]Random walk (RW)-based algorithms have long been popular in distributed systems due to low overheads and scalability, with recent growing applications in decentralized learning. However, their reliance on local interactions makes them inherently vulnerable to malicious behavior. In this work, we investigate an adversarial threat that we term the “Pac-Man” attack, in which a malicious node probabilistically terminates any RW that visits it. This stealthy behavior gradually eliminates active RWs from the network, effectively halting the learning process without triggering failure alarms. To counter this threat, we propose the Average Crossing (AC) algorithm–a fully decentralized mechanism for duplicating RWs to prevent RW extinction in the presence of Pac-Man. Our theoretical analysis establishes that (i) the RW population remains almost surely bounded under AC and (ii) RW-based stochastic gradient descent remains convergent under AC, even in the presence of Pac-Man, with a quantifiable deviation from the true optimum. Our extensive empirical results on both synthetic and real-world datasets corroborate our theoretical findings. Furthermore, they uncover a phase transition in the extinction probability as a function of the duplication threshold. We offer theoretical insights by analyzing a simplified variant of the AC, which sheds light on the observed phase transition.
[COMMENTS]The updated manuscript represents an incomplete version of the work. A substantially updated version will be prepared before further dissemination
[LINK]http://arxiv.org/abs/2508.05663v3
[DATE]2026-01-16 00:40:18+08:00
[CATEGORIES]cs.LG
Process-Guided Concept Bottleneck Model
[AUTHORS]Reza M. Asiyabi, SEOSAW Partnership, Steven Hancock, Casey Ryan
[ABSTRACT]Concept Bottleneck Models (CBMs) improve the explainability of black-box Deep Learning (DL) by introducing intermediate semantic concepts. However, standard CBMs often overlook domain-specific relationships and causal mechanisms, and their dependence on complete concept labels limits applicability in scientific domains where supervision is sparse but processes are well defined. To address this, we propose the Process-Guided Concept Bottleneck Model (PG-CBM), an extension of CBMs which constrains learning to follow domain-defined causal mechanisms through biophysically meaningful intermediate concepts. Using above ground biomass density estimation from Earth Observation data as a case study, we show that PG-CBM reduces error and bias compared to multiple benchmarks, whilst leveraging multi-source heterogeneous training data and producing interpretable intermediate outputs. Beyond improved accuracy, PG-CBM enhances transparency, enables detection of spurious learning, and provides scientific insights, representing a step toward more trustworthy AI systems in scientific applications.
[COMMENTS]13 pages with 7 figures and 1 table, Supplementary Materials 10 pages with 3 figures
[LINK]http://arxiv.org/abs/2601.10562v1
[DATE]2026-01-16 00:25:55+08:00
[CATEGORIES]cs.LG
Effects of Structural Allocation of Geometric Task Diversity in Linear Meta-Learning Models
[AUTHORS]Saptati Datta, Nicolas W. Hengartner, Yulia Pimonova, Natalie E. Klein, Nicholas Lubbers
[ABSTRACT]Meta-learning aims to leverage information across related tasks to improve prediction on unlabeled data for new tasks when only a small number of labeled observations are available (“few-shot” learning). Increased task diversity is often believed to enhance meta-learning by providing richer information across tasks. However, recent work by Kumar et al. (2022) shows that increasing task diversity, quantified through the overall geometric spread of task representations, can in fact degrade meta-learning prediction performance across a range of models and datasets. In this work, we build on this observation by showing that meta-learning performance is affected not only by the overall geometric variability of task parameters, but also by how this variability is allocated relative to an underlying low-dimensional structure. Similar to Pimonova et al. (2025), we decompose task-specific regression effects into a structurally informative component and an orthogonal, non-informative component. We show theoretically and through simulation that meta-learning prediction degrades when a larger fraction of between-task variability lies in orthogonal, non-informative directions, even when the overall geometric variability of tasks is held fixed.
[LINK]http://arxiv.org/abs/2509.18349v3
[DATE]2026-01-16 00:17:24+08:00
[CATEGORIES]cs.LG
Sample Complexity of Composite Quantum Hypothesis Testing
[AUTHORS]Jacob Paul Simpson, Efstratios Palias, Sharu Theresa Jose
[ABSTRACT]This paper investigates symmetric composite binary quantum hypothesis testing (QHT), where the goal is to determine which of two uncertainty sets contains an unknown quantum state. While asymptotic error exponents for this problem are well-studied, the finite-sample regime remains poorly understood. We bridge this gap by characterizing the sample complexity – the minimum number of state copies required to achieve a target error level. Specifically, we derive lower bounds that generalize the sample complexity of simple QHT and introduce new upper bounds for various uncertainty sets, including of both finite and infinite cardinalities. Notably, our upper and lower bounds match up to universal constants, providing a tight characterization of the sample complexity. Finally, we extend our analysis to the differentially private setting, establishing the sample complexity for privacy-preserving composite QHT.
[COMMENTS]Under review
[LINK]http://arxiv.org/abs/2601.08588v2
[DATE]2026-01-16 00:07:59+08:00
[CATEGORIES]cs.LG
Machine Unlearning Fails to Remove Data Poisoning Attacks
[AUTHORS]Martin Pawelczyk, Jimmy Z. Di, Yiwei Lu, Gautam Kamath, Ayush Sekhari, Seth Neel
[ABSTRACT]We revisit the efficacy of several practical methods for approximate machine unlearning developed for large-scale deep learning. In addition to complying with data deletion requests, one often-cited potential application for unlearning methods is to remove the effects of poisoned data. We experimentally demonstrate that, while existing unlearning methods have been demonstrated to be effective in a number of settings, they fail to remove the effects of data poisoning across a variety of types of poisoning attacks (indiscriminate, targeted, and a newly-introduced Gaussian poisoning attack) and models (image classifiers and LLMs); even when granted a relatively large compute budget. In order to precisely characterize unlearning efficacy, we introduce new evaluation metrics for unlearning based on data poisoning. Our results suggest that a broader perspective, including a wider variety of evaluations, are required to avoid a false sense of confidence in machine unlearning procedures for deep learning without provable guarantees. Moreover, while unlearning methods show some signs of being useful to efficiently remove poisoned data without having to retrain, our work suggests that these methods are not yet “ready for prime time,” and currently provide limited benefit over retraining.
[COMMENTS]Published at ICLR 2025, Made author ordering consistent with ICLR‘25 submission
[LINK]http://arxiv.org/abs/2406.17216v3
[DATE]2026-01-16 00:07:52+08:00
[CATEGORIES]cs.LG
Mixtures of Transparent Local Models
[AUTHORS]Niffa Cheick Oumar Diaby, Thierry Duchesne, Mario Marchand
[ABSTRACT]The predominance of machine learning models in many spheres of human activity has led to a growing demand for their transparency. The transparency of models makes it possible to discern some factors, such as security or non-discrimination. In this paper, we propose a mixture of transparent local models as an alternative solution for designing interpretable (or transparent) models. Our approach is designed for the situations where a simple and transparent function is suitable for modeling the label of instances in some localities/regions of the input space, but may change abruptly as we move from one locality to another. Consequently, the proposed algorithm is to learn both the transparent labeling function and the locality of the input space where the labeling function achieves a small risk in its assigned locality. By using a new multi-predictor (and multi-locality) loss function, we established rigorous PAC-Bayesian risk bounds for the case of binary linear classification problem and that of linear regression. In both cases, synthetic data sets were used to illustrate how the learning algorithms work. The results obtained from real data sets highlight the competitiveness of our approach compared to other existing methods as well as certain opaque models. Keywords: PAC-Bayes, risk bounds, local models, transparent models, mixtures of local transparent models.
[COMMENTS]44 pages, 32 figues
[LINK]http://arxiv.org/abs/2601.10541v1
[DATE]2026-01-16 00:05:20+08:00
[CATEGORIES]cs.LG
A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Doubao 1.8, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5
[AUTHORS]Xingjun Ma, Yixu Wang, Hengyuan Xu, Yutao Wu, Yifan Ding, Yunhan Zhao, Zilong Wang, Jiabin Hua, Ming Wen, Jianan Liu, Ranjie Duan, Yifeng Gao, Yingshui Tan, Yunhao Chen, Hui Xue, Xin Wang, Wei Cheng, Jingjing Chen, Zuxuan Wu, Bo Li, Yu-Gang Jiang
[ABSTRACT]The rapid evolution of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) has produced substantial gains in reasoning, perception, and generative capability across language and vision. However, whether these advances yield commensurate improvements in safety remains unclear, in part due to fragmented evaluation practices limited to single modalities or threat models. In this report, we present an integrated safety evaluation of 7 frontier models: GPT-5.2, Gemini 3 Pro, Qwen3-VL, Doubao 1.8, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5. We evaluate each model across language, vision-language, and image generation settings using a unified protocol that integrates benchmark evaluation, adversarial evaluation, multilingual evaluation, and compliance evaluation. Aggregating our evaluations into safety leaderboards and model safety profiles across multiple evaluation modes reveals a sharply heterogeneous safety landscape. While GPT-5.2 demonstrates consistently strong and balanced safety performance across evaluations, other models exhibit pronounced trade-offs among benchmark safety, adversarial alignment, multilingual generalization, and regulatory compliance. Both language and vision-language modalities show significant vulnerability under adversarial evaluation, with all models degrading substantially despite strong results on standard benchmarks. Text-to-image models achieve relatively stronger alignment in regulated visual risk categories, yet remain brittle under adversarial or semantically ambiguous prompts. Overall, these results show that safety in frontier models is inherently multidimensional–shaped by modality, language, and evaluation scheme, underscoring the need for standardized safety evaluations to accurately assess real-world risk and guide responsible model development and deployment.
[COMMENTS]42 pages, 24 figures
[LINK]http://arxiv.org/abs/2601.10527v1
[DATE]2026-01-15 23:52:52+08:00
[CATEGORIES]cs.CL cs.LG
GraphIF: Enhancing Multi-Turn Instruction Following for Large Language Models with Relation Graph Prompt
[AUTHORS]Zhenhe Li, Can Lin, Ling Zheng, Wen-Da Wei, Junli Liang, Qi Song
[ABSTRACT]Multi-turn instruction following is essential for building intelligent conversational systems that can consistently adhere to instructions across dialogue turns. However, existing approaches to enhancing multi-turn instruction following primarily rely on collecting or generating large-scale multi-turn dialogue datasets to fine-tune large language models (LLMs), which treat each response generation as an isolated task and fail to explicitly incorporate multi-turn instruction following into the optimization objectives. As a result, instruction-tuned LLMs often struggle with complex long-distance constraints. In multi-turn dialogues, relational constraints across turns can be naturally modeled as labeled directed edges, making graph structures particularly suitable for modeling multi-turn instruction following. Despite this potential, leveraging graph structures to enhance the multi-turn instruction following capabilities of LLMs remains unexplored. To bridge this gap, we propose GraphIF, a plug-and-play framework that models multi-turn dialogues as directed relation graphs and leverages graph prompts to enhance the instruction following capabilities of LLMs. GraphIF comprises three key components: (1) an agent-based relation extraction module that captures inter-turn semantic relations via action-triggered mechanisms to construct structured graphs; (2) a relation graph prompt generation module that converts structured graph information into natural language prompts; and (3) a response rewriting module that refines initial LLM outputs using the generated graph prompts. Extensive experiments on two long multi-turn dialogue datasets demonstrate that GraphIF can be seamlessly integrated into instruction-tuned LLMs and leads to significant improvements across all four multi-turn instruction-following evaluation metrics.
[LINK]http://arxiv.org/abs/2511.10051v2
[DATE]2026-01-15 23:45:38+08:00
[CATEGORIES]cs.CL
DR-Arena: an Automated Evaluation Framework for Deep Research Agents
[AUTHORS]Yiwen Gao, Ruochen Zhao, Yang Deng, Wenxuan Zhang
[ABSTRACT]As Large Language Models (LLMs) increasingly operate as Deep Research (DR) Agents capable of autonomous investigation and information synthesis, reliable evaluation of their task performance has become a critical bottleneck. Current benchmarks predominantly rely on static datasets, which suffer from several limitations: limited task generality, temporal misalignment, and data contamination. To address these, we introduce DR-Arena, a fully automated evaluation framework that pushes DR agents to their capability limits through dynamic investigation. DR-Arena constructs real-time Information Trees from fresh web trends to ensure the evaluation rubric is synchronized with the live world state, and employs an automated Examiner to generate structured tasks testing two orthogonal capabilities: Deep reasoning and Wide coverage. DR-Arena further adopts Adaptive Evolvement Loop, a state-machine controller that dynamically escalates task complexity based on real-time performance, demanding deeper deduction or wider aggregation until a decisive capability boundary emerges. Experiments with six advanced DR agents demonstrate that DR-Arena achieves a Spearman correlation of 0.94 with the LMSYS Search Arena leaderboard. This represents the state-of-the-art alignment with human preferences without any manual efforts, validating DR-Arena as a reliable alternative for costly human adjudication.
[COMMENTS]22 pages, 8 figures
[LINK]http://arxiv.org/abs/2601.10504v1
[DATE]2026-01-15 23:28:21+08:00
[CATEGORIES]cs.CL
Contextual StereoSet: Stress-Testing Bias Alignment Robustness in Large Language Models
[AUTHORS]Abhinaba Basu, Pavan Chakraborty
[ABSTRACT]A model that avoids stereotypes in a lab benchmark may not avoid them in deployment. We show that measured bias shifts dramatically when prompts mention different places, times, or audiences – no adversarial prompting required.
We introduce Contextual StereoSet, a benchmark that holds stereotype content fixed while systematically varying contextual framing. Testing 13 models across two protocols, we find striking patterns: anchoring to 1990 (vs. 2030) raises stereotype selection in all models tested on this contrast (p<0.05); gossip framing raises it in 5 of 6 full-grid models; out-group observer framing shifts it by up to 13 percentage points. These effects replicate in hiring, lending, and help-seeking vignettes.
We propose Context Sensitivity Fingerprints (CSF): a compact profile of per-dimension dispersion and paired contrasts with bootstrap CIs and FDR correction. Two evaluation tracks support different use cases – a 360-context diagnostic grid for deep analysis and a budgeted protocol covering 4,229 items for production screening.
The implication is methodological: bias scores from fixed-condition tests may not generalize.This is not a claim about ground-truth bias rates; it is a stress test of evaluation robustness. CSF forces evaluators to ask, “Under what conditions does bias appear?” rather than “Is this model biased?” We release our benchmark, code, and results.
[LINK]http://arxiv.org/abs/2601.10460v1
[DATE]2026-01-15 22:50:49+08:00
[CATEGORIES]cs.CL cs.LG
Parallel Test-Time Scaling for Latent Reasoning Models
[AUTHORS]Runyang You, Yongqi Li, Meng Liu, Wenjie Wang, Liqiang Nie, Wenjie Li
[ABSTRACT]Parallel test-time scaling (TTS) is a pivotal approach for enhancing large language models (LLMs), typically by sampling multiple token-based chains-of-thought in parallel and aggregating outcomes through voting or search. Recent advances in latent reasoning, where intermediate reasoning unfolds in continuous vector spaces, offer a more efficient alternative to explicit Chain-of-Thought, yet whether such latent models can similarly benefit from parallel TTS remains open, mainly due to the absence of sampling mechanisms in continuous space, and the lack of probabilistic signals for advanced trajectory aggregation. This work enables parallel TTS for latent reasoning models by addressing the above issues. For sampling, we introduce two uncertainty-inspired stochastic strategies: Monte Carlo Dropout and Additive Gaussian Noise. For aggregation, we design a Latent Reward Model (LatentRM) trained with step-wise contrastive objective to score and guide latent reasoning. Extensive experiments and visualization analyses show that both sampling strategies scale effectively with compute and exhibit distinct exploration dynamics, while LatentRM enables effective trajectory selection. Together, our explorations open a new direction for scalable inference in continuous spaces. Code and checkpoints released at https://github.com/ModalityDance/LatentTTS
[LINK]http://arxiv.org/abs/2510.07745v3
[DATE]2026-01-15 22:50:27+08:00
[CATEGORIES]cs.CL cs.LG
SurgGoal: Rethinking Surgical Planning Evaluation via Goal-Satisfiability
[AUTHORS]Ruochen Li, Kun Yuan, Yufei Xia, Yue Zhou, Qingyu Lu, Weihang Li, Youxiang Zhu, Nassir Navab
[ABSTRACT]Surgical planning integrates visual perception, long-horizon reasoning, and procedural knowledge, yet it remains unclear whether current evaluation protocols reliably assess vision-language models (VLMs) in safety-critical settings. Motivated by a goal-oriented view of surgical planning, we define planning correctness via phase-goal satisfiability, where plan validity is determined by expert-defined surgical rules. Based on this definition, we introduce a multicentric meta-evaluation benchmark with valid procedural variations and invalid plans containing order and content errors. Using this benchmark, we show that sequence similarity metrics systematically misjudge planning quality, penalizing valid plans while failing to identify invalid ones. We therefore adopt a rule-based goal-satisfiability metric as a high-precision meta-evaluation reference to assess Video-LLMs under progressively constrained settings, revealing failures due to perception errors and under-constrained reasoning. Structural knowledge consistently improves performance, whereas semantic guidance alone is unreliable and benefits larger models only when combined with structural constraints.
[LINK]http://arxiv.org/abs/2601.10455v1
[DATE]2026-01-15 22:47:26+08:00
[CATEGORIES]cs.CL
TF3-RO-50M: Training Compact Romanian Language Models from Scratch on Synthetic Moral Microfiction
[AUTHORS]Mihai Dan Nadas, Laura Diosan, Andreea Tomescu, Andrei Piscoran
[ABSTRACT]Recent advances in synthetic data generation have shown that compact language models can be trained effectively when the underlying corpus is structurally controlled and linguistically coherent. However, for morphologically rich and computationally under-resourced languages such as Romanian, there is still no openly documented, end-to-end pipeline that unifies tokenizer design, preprocessing, pretraining, compression, evaluation, and large-scale synthetic data generation in a reproducible framework. Building on TF1, a three-million-story English fable dataset, and TF2, which extends TF1 through high-quality Romanian translations, we introduce TF3-RO, a Romanian-centric language modeling pipeline spanning tokenizer training, from-scratch model development, and Romanian-native dataset generation. TF3-RO constructs Romanian-specific BPE and Unigram tokenizers from a linguistically informed corpus to mitigate token inflation induced by Romanian morphology. Using long-sequence packed training, we pretrain a 51.65M-parameter LLaMA-style Transformer entirely from scratch. The model is subsequently optimized through quantization, structured pruning, and logit-based knowledge distillation, yielding a compact 26.45M-parameter student model with tied embeddings and strong deployment characteristics. Using this distilled model, TF3-RO generates three million Romanian-native synthetic fables via a controlled combinatorial prompting framework. Across all stages, the pipeline integrates a comprehensive evaluation suite combining intrinsic metrics, Romanian agreement probes, entity coherence, rule-based grammar checking, and LLM-based assessment. TF3-RO provides a reproducible and linguistically grounded framework for training compact Romanian language models and producing large-scale synthetic narrative corpora.
[LINK]http://arxiv.org/abs/2601.10410v1
[DATE]2026-01-15 22:02:00+08:00
[CATEGORIES]cs.CL
INDIC DIALECT: A Multi Task Benchmark to Evaluate and Translate in Indian Language Dialects
[AUTHORS]Tarun Sharma, Manikandan Ravikiran, Sourava Kumar Behera, Pramit Bhattacharya, Arnab Bhattacharya, Rohit Saluja
[ABSTRACT]Recent NLP advances focus primarily on standardized languages, leaving most low-resource dialects under-served especially in Indian scenarios. In India, the issue is particularly important: despite Hindi being the third most spoken language globally (over 600 million speakers), its numerous dialects remain underrepresented. The situation is similar for Odia, which has around 45 million speakers. While some datasets exist which contain standard Hindi and Odia languages, their regional dialects have almost no web presence. We introduce INDIC-DIALECT, a human-curated parallel corpus of 13k sentence pairs spanning 11 dialects and 2 languages: Hindi and Odia. Using this corpus, we construct a multi-task benchmark with three tasks: dialect classification, multiple-choice question (MCQ) answering, and machine translation (MT). Our experiments show that LLMs like GPT-4o and Gemini 2.5 perform poorly on the classification task. While fine-tuned transformer based models pretrained on Indian languages substantially improve performance e.g., improving F1 from 19.6\% to 89.8\% on dialect classification. For dialect to language translation, we find that hybrid AI model achieves highest BLEU score of 61.32 compared to the baseline score of 23.36. Interestingly, due to complexity in generating dialect sentences, we observe that for language to dialect translation the ``rule-based followed by AI” approach achieves best BLEU score of 48.44 compared to the baseline score of 27.59. INDIC-DIALECT thus is a new benchmark for dialect-aware Indic NLP, and we plan to release it as open source to support further work on low-resource Indian dialects.
[LINK]http://arxiv.org/abs/2601.10388v1
[DATE]2026-01-15 21:40:27+08:00
[CATEGORIES]cs.CL
The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models
[AUTHORS]Christina Lu, Jack Gallagher, Jonathan Michala, Kyle Fish, Jack Lindsey
[ABSTRACT]Large language models can represent a variety of personas but typically default to a helpful Assistant identity cultivated during post-training. We investigate the structure of the space of model personas by extracting activation directions corresponding to diverse character archetypes. Across several different models, we find that the leading component of this persona space is an “Assistant Axis,” which captures the extent to which a model is operating in its default Assistant mode. Steering towards the Assistant direction reinforces helpful and harmless behavior; steering away increases the model’s tendency to identify as other entities. Moreover, steering away with more extreme values often induces a mystical, theatrical speaking style. We find this axis is also present in pre-trained models, where it primarily promotes helpful human archetypes like consultants and coaches and inhibits spiritual ones. Measuring deviations along the Assistant Axis predicts “persona drift,” a phenomenon where models slip into exhibiting harmful or bizarre behaviors that are uncharacteristic of their typical persona. We find that persona drift is often driven by conversations demanding meta-reflection on the model’s processes or featuring emotionally vulnerable users. We show that restricting activations to a fixed region along the Assistant Axis can stabilize model behavior in these scenarios – and also in the face of adversarial persona-based jailbreaks. Our results suggest that post-training steers models toward a particular region of persona space but only loosely tethers them to it, motivating work on training and steering strategies that more deeply anchor models to a coherent persona.
[LINK]http://arxiv.org/abs/2601.10387v1
[DATE]2026-01-15 21:40:06+08:00
[CATEGORIES]cs.CL
CoMAT: Chain of Mathematically Annotated Thought Improves Mathematical Reasoning
[AUTHORS]Joshua Ong Jun Leang, Aryo Pradipta Gema, Shay B. Cohen
[COMMENTS]9 pages, 12 figures
[LINK]http://arxiv.org/abs/2410.10336v2
[DATE]2026-01-15 21:23:01+08:00
[CATEGORIES]cs.CL cs.LG
Unlocking Implicit Experience: Synthesizing Tool-Use Trajectories from Text
[AUTHORS]Zhihao Xu, Rumei Li, Jiahuan Li, Rongxiang Weng, Jingang Wang, Xunliang Cai, Xiting Wang
[ABSTRACT]Enabling Large Language Models (LLMs) to effectively utilize tools in multi-turn interactions is essential for building capable autonomous agents. However, acquiring diverse and realistic multi-turn tool-use data remains a significant challenge. In this work, we propose a novel text-based paradigm. We observe that textual corpora naturally contain rich, multi-step problem-solving experiences, which can serve as an untapped, scalable, and authentic data source for multi-turn tool-use tasks. Based on this insight, we introduce GEM, a data synthesis pipeline that enables the generation and extraction of multi-turn tool-use trajectories from text corpora through a four-stage process: relevance filtering, workflow & tool extraction, trajectory grounding, and complexity refinement. To reduce the computational cost, we further train a specialized Trajectory Synthesizer via supervised fine-tuning. This model distills the complex generation pipeline into an efficient, end-to-end trajectory generator. Experiments demonstrate that our GEM-32B achieve a 16.5% improvement on the BFCL V3 Multi-turn benchmark. Our models partially surpass the performance of models trained on τ - bench (Airline and Retail) in-domain data, highlighting the superior generalization capability derived from our text-based synthesis paradigm. Notably, our Trajectory Synthesizer matches the quality of the full pipeline while significantly reducing inference latency and costs.
[LINK]http://arxiv.org/abs/2601.10355v1
[DATE]2026-01-15 20:58:46+08:00
[CATEGORIES]cs.CL
Bias Dynamics in BabyLMs: Towards a Compute-Efficient Sandbox for Democratising Pre-Training Debiasing
[AUTHORS]Filip Trhlik, Andrew Caines, Paula Buttery
[ABSTRACT]Pre-trained language models (LMs) have, over the last few years, grown substantially in both societal adoption and training costs. This rapid growth in size has constrained progress in understanding and mitigating their biases. Since re-training LMs is prohibitively expensive, most debiasing work has focused on post-hoc or masking-based strategies, which often fail to address the underlying causes of bias. In this work, we seek to democratise pre-model debiasing research by using low-cost proxy models. Specifically, we investigate BabyLMs, compact BERT-like models trained on small and mutable corpora that can approximate bias acquisition and learning dynamics of larger models. We show that BabyLMs display closely aligned patterns of intrinsic bias formation and performance development compared to standard BERT models, despite their drastically reduced size. Furthermore, correlations between BabyLMs and BERT hold across multiple intra-model and post-model debiasing methods. Leveraging these similarities, we conduct pre-model debiasing experiments with BabyLMs, replicating prior findings and presenting new insights regarding the influence of gender imbalance and toxicity on bias formation. Our results demonstrate that BabyLMs can serve as an effective sandbox for large-scale LMs, reducing pre-training costs from over 500 GPU-hours to under 30 GPU-hours. This provides a way to democratise pre-model debiasing research and enables faster, more accessible exploration of methods for building fairer LMs.
[COMMENTS]21 pages, 18 figures
[LINK]http://arxiv.org/abs/2601.09421v2
[DATE]2026-01-15 20:25:45+08:00
[CATEGORIES]cs.CL
ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding
[AUTHORS]Xueyun Tian, Wei Li, Bingbing Xu, Heng Dong, Yuanzhuo Wang, Huawei Shen
[ABSTRACT]Recent Omni-multimodal Large Language Models show promise in unified audio, vision, and text modeling. However, streaming audio-video understanding remains challenging, as existing approaches suffer from disjointed capabilities: they typically exhibit incomplete modality support or lack autonomous proactive monitoring. To address this, we present ROMA, a real-time omni-multimodal assistant for unified reactive and proactive interaction. ROMA processes continuous inputs as synchronized multimodal units, aligning dense audio with discrete video frames to handle granularity mismatches. For online decision-making, we introduce a lightweight speak head that decouples response initiation from generation to ensure precise triggering without task conflict. We train ROMA with a curated streaming dataset and a two-stage curriculum that progressively optimizes for streaming format adaptation and proactive responsiveness. To standardize the fragmented evaluation landscape, we reorganize diverse benchmarks into a unified suite covering both proactive (alert, narration) and reactive (QA) settings. Extensive experiments across 12 benchmarks demonstrate ROMA achieves state-of-the-art performance on proactive tasks while competitive in reactive settings, validating its robustness in unified real-time omni-multimodal understanding.
[COMMENTS]Our project page is available at https://eureka-maggie.github.io/ROMA_show
[LINK]http://arxiv.org/abs/2601.10323v1
[DATE]2026-01-15 20:09:04+08:00
[CATEGORIES]cs.CL
An Efficient Long-Context Ranking Architecture With Calibrated LLM Distillation: Application to Person-Job Fit
[AUTHORS]Warren Jouanneau, Emma Jouffroy, Marc Palyart
[ABSTRACT]Finding the most relevant person for a job proposal in real time is challenging, especially when resumes are long, structured, and multilingual. In this paper, we propose a re-ranking model based on a new generation of late cross-attention architecture, that decomposes both resumes and project briefs to efficiently handle long-context inputs with minimal computational overhead. To mitigate historical data biases, we use a generative large language model (LLM) as a teacher, generating fine-grained, semantically grounded supervision. This signal is distilled into our student model via an enriched distillation loss function. The resulting model produces skill-fit scores that enable consistent and interpretable person-job matching. Experiments on relevance, ranking, and calibration metrics demonstrate that our approach outperforms state-of-the-art baselines.
[LINK]http://arxiv.org/abs/2601.10321v1
[DATE]2026-01-15 20:01:27+08:00
[CATEGORIES]cs.CL cs.LG
Lil: Less is Less When Applying Post-Training Sparse-Attention Algorithms in Long-Decode Stage
[AUTHORS]Junhao Hu, Fangze Li, Mingtao Xu, Feifan Meng, Shiju Zhao, Tiancheng Hu, Ting Peng, Anmin Liu, Wenrui Huang, Chenxu Liu, Ziyue Hua, Tao Xie
[ABSTRACT]Large language models (LLMs) demonstrate strong capabilities across a wide range of complex tasks and are increasingly deployed at scale, placing significant demands on inference efficiency. Prior work typically decomposes inference into prefill and decode stages, with the decode stage dominating total latency. To reduce time and memory complexity in the decode stage, a line of work introduces sparse-attention algorithms. In this paper, we show, both empirically and theoretically, that sparse attention can paradoxically increase end-to-end complexity: information loss often induces significantly longer sequences, a phenomenon we term “Less is Less” (Lil). To mitigate the Lil problem, we propose an early-stopping algorithm that detects the threshold where information loss exceeds information gain during sparse decoding. Our early-stopping algorithm reduces token consumption by up to 90% with a marginal accuracy degradation of less than 2% across reasoning-intensive benchmarks.
[LINK]http://arxiv.org/abs/2601.03043v2
[DATE]2026-01-15 19:59:09+08:00
[CATEGORIES]cs.CL cs.LG
Boundary-Aware NL2SQL: Integrating Reliability through Hybrid Reward and Data Synthesis
[AUTHORS]Songsong Tian, Kongsheng Zhuo, Zhendong Wang, Rong Shen, Shengtao Zhang, Yong Wu
[ABSTRACT]In this paper, we present BAR-SQL (Boundary-Aware Reliable NL2SQL), a unified training framework that embeds reliability and boundary awareness directly into the generation process. We introduce a Seed Mutation data synthesis paradigm that constructs a representative enterprise corpus, explicitly encompassing multi-step analytical queries alongside boundary cases including ambiguity and schema limitations. To ensure interpretability, we employ Knowledge-Grounded Reasoning Synthesis, which produces Chain-of-Thought traces explicitly anchored in schema metadata and business rules. The model is trained through a two-stage process: Supervised Fine-Tuning (SFT) followed by Reinforcement Learning via Group Relative Policy Optimization. We design a Task-Conditioned Hybrid Reward mechanism that simultaneously optimizes SQL execution accuracy-leveraging Abstract Syntax Tree analysis and dense result matching-and semantic precision in abstention responses. To evaluate reliability alongside generation accuracy, we construct and release Ent-SQL-Bench, which jointly assesse SQL precision and boundary-aware abstention across ambiguous and unanswerable queries. Experimental results on this benchmark demonstrate that BAR-SQL achieves 91.48% average accuracy, outperforming leading proprietary models, including Claude 4.5 Sonnet and GPT-5, in both SQL generation quality and boundary-aware abstention capability. The source code and benchmark are available anonymously at: https://github.com/TianSongS/BAR-SQL.
[LINK]http://arxiv.org/abs/2601.10318v1
[DATE]2026-01-15 19:55:01+08:00
[CATEGORIES]cs.CL
ADVOSYNTH: A Synthetic Multi-Advocate Dataset for Speaker Identification in Courtroom Scenarios
[AUTHORS]Aniket Deroy
[ABSTRACT]As large-scale speech-to-speech models achieve high fidelity, the distinction between synthetic voices in structured environments becomes a vital area of study. This paper introduces Advosynth-500, a specialized dataset comprising 100 synthetic speech files featuring 10 unique advocate identities. Using the Speech Llama Omni model, we simulate five distinct advocate pairs engaged in courtroom arguments. We define specific vocal characteristics for each advocate and present a speaker identification challenge to evaluate the ability of modern systems to map audio files to their respective synthetic origins.
Dataset is available at this link-https: //github.com/naturenurtureelite/ADVOSYNTH-500.
[LINK]http://arxiv.org/abs/2601.10315v1
[DATE]2026-01-15 19:46:47+08:00
[CATEGORIES]cs.CL
Multilinguality as Sense Adaptation
[AUTHORS]Jan Christian Blaise Cruz, David Ifeoluwa Adelani, Alham Fikri Aji
[ABSTRACT]We approach multilinguality as sense adaptation: aligning latent meaning representations across languages rather than relying solely on shared parameters and scale. In this paper, we introduce SENse-based Symmetric Interlingual Alignment (SENSIA), which adapts a Backpack language model from one language to another by explicitly aligning sense-level mixtures and contextual representations on parallel data, while jointly training a target-language language modeling loss to preserve fluency. Across benchmarks on four typologically diverse languages, SENSIA generally outperforms comparable multilingual alignment methods and achieves competitive accuracy against monolingual from-scratch baselines while using 2-4x less target-language data. Analyses of learned sense geometry indicate that local sense topology and global structure relative to English are largely preserved, and ablations show that the method is robust in terms of design and scale.
[COMMENTS]Code available at https://github.com/jcblaisecruz02/sensia
[LINK]http://arxiv.org/abs/2601.10310v1
[DATE]2026-01-15 19:44:01+08:00
[CATEGORIES]cs.CL
The Straight and Narrow: Do LLMs Possess an Internal Moral Path?
[AUTHORS]Luoming Hu, Jingjie Zeng, Liang Yang, Hongfei Lin
[ABSTRACT]Enhancing the moral alignment of Large Language Models (LLMs) is a critical challenge in AI safety. Current alignment techniques often act as superficial guardrails, leaving the intrinsic moral representations of LLMs largely untouched. In this paper, we bridge this gap by leveraging Moral Foundations Theory (MFT) to map and manipulate the fine-grained moral landscape of LLMs. Through cross-lingual linear probing, we validate the shared nature of moral representations in middle layers and uncover a shared yet different moral subspace between English and Chinese. Building upon this, we extract steerable Moral Vectors and successfully validate their efficacy at both internal and behavioral levels. Leveraging the high generalizability of morality, we propose Adaptive Moral Fusion (AMF), a dynamic inference-time intervention that synergizes probe detection with vector injection to tackle the safety-helpfulness trade-off. Empirical results confirm that our approach acts as a targeted intrinsic defense, effectively reducing incorrect refusals on benign queries while minimizing jailbreak success rates compared to standard baselines.
[LINK]http://arxiv.org/abs/2601.10307v1
[DATE]2026-01-15 19:42:00+08:00
[CATEGORIES]cs.CL
Evidence-Augmented Policy Optimization with Reward Co-Evolution for Long-Context Reasoning
[AUTHORS]Xin Guan, Zijian Li, Shen Huang, Pengjun Xie, Jingren Zhou, Jiuxin Cao
[ABSTRACT]While Reinforcement Learning (RL) has advanced LLM reasoning, applying it to long-context scenarios is hindered by sparsity of outcome rewards. This limitation fails to penalize ungrounded “lucky guesses,” leaving the critical process of needle-in-a-haystack evidence retrieval largely unsupervised. To address this, we propose EAPO (Evidence-Augmented Policy Optimization). We first establish the Evidence-Augmented Reasoning paradigm, validating via Tree-Structured Evidence Sampling that precise evidence extraction is the decisive bottleneck for long-context reasoning. Guided by this insight, EAPO introduces a specialized RL algorithm where a reward model computes a Group-Relative Evidence Reward, providing dense process supervision to explicitly improve evidence quality. To sustain accurate supervision throughout training, we further incorporate an Adaptive Reward-Policy Co-Evolution mechanism. This mechanism iteratively refines the reward model using outcome-consistent rollouts, sharpening its discriminative capability to ensure precise process guidance. Comprehensive evaluations across eight benchmarks demonstrate that EAPO significantly enhances long-context reasoning performance compared to SOTA baselines.
[LINK]http://arxiv.org/abs/2601.10306v1
[DATE]2026-01-15 19:40:57+08:00
[CATEGORIES]cs.CL
LittleBit: Ultra Low-Bit Quantization via Latent Factorization
[AUTHORS]Banseok Lee, Dongkyu Kim, Youngcheon You, Youngmin Kim
[ABSTRACT]Deploying large language models (LLMs) often faces challenges from substantial memory and computational costs. Quantization offers a solution, yet performance degradation in the sub-1-bit regime remains particularly difficult. This paper introduces LittleBit, a novel method for extreme LLM compression. It targets levels like 0.1 bits per weight (BPW), achieving nearly 31$\times$ memory reduction, e.g., Llama2-13B to under 0.9 GB. LittleBit represents weights in a low-rank form using latent matrix factorization, subsequently binarizing these factors. To counteract information loss from this extreme precision, it integrates a multi-scale compensation mechanism. This includes row, column, and an additional latent dimension that learns per-rank importance. Two key contributions enable effective training: Dual Sign-Value-Independent Decomposition (Dual-SVID) for quantization-aware training (QAT) initialization, and integrated Residual Compensation to mitigate errors. Extensive experiments confirm LittleBit’s superiority in sub-1-bit quantization: e.g., its 0.1 BPW performance on Llama2-7B surpasses the leading method’s 0.7 BPW. LittleBit establishes a new, viable size-performance trade-off–unlocking a potential 11.6$\times$ speedup over FP16 at the kernel level–and makes powerful LLMs practical for resource-constrained environments. Our code can be found at https://github.com/SamsungLabs/LittleBit.
[COMMENTS]Accepted to NeurIPS 2025. Banseok Lee and Dongkyu Kim contributed equally
[LINK]http://arxiv.org/abs/2506.13771v4
[DATE]2026-01-15 18:46:32+08:00
[CATEGORIES]cs.LG cs.CL
TranslateGemma Technical Report
[AUTHORS]Mara Finkelstein, Isaac Caswell, Tobias Domhan, Jan-Thorsten Peter, Juraj Juraska, Parker Riley, Daniel Deutsch, Cole Dilanni, Colin Cherry, Eleftheria Briakou, Elizabeth Nielsen, Jiaming Luo, Kat Black, Ryan Mullins, Sweta Agrawal, Wenda Xu, Erin Kats, Stephane Jaskiewicz, Markus Freitag, David Vilar
[ABSTRACT]We present TranslateGemma, a suite of open machine translation models based on the Gemma 3 foundation models. To enhance the inherent multilingual capabilities of Gemma 3 for the translation task, we employ a two-stage fine-tuning process. First, supervised fine-tuning is performed using a rich mixture of high-quality large-scale synthetic parallel data generated via state-of-the-art models and human-translated parallel data. This is followed by a reinforcement learning phase, where we optimize translation quality using an ensemble of reward models, including MetricX-QE and AutoMQM, targeting translation quality. We demonstrate the effectiveness of TranslateGemma with human evaluation on the WMT25 test set across 10 language pairs and with automatic evaluation on the WMT24++ benchmark across 55 language pairs. Automatic metrics show consistent and substantial gains over the baseline Gemma 3 models across all sizes. Notably, smaller TranslateGemma models often achieve performance comparable to larger baseline models, offering improved efficiency. We also show that TranslateGemma models retain strong multimodal capabilities, with enhanced performance on the Vistra image translation benchmark. The release of the open TranslateGemma models aims to provide the research community with powerful and adaptable tools for machine translation.
[LINK]http://arxiv.org/abs/2601.09012v2
[DATE]2026-01-15 18:43:25+08:00
[CATEGORIES]cs.CL
Measuring Affinity between Attention-Head Weight Subspaces via the Projection Kernel
[AUTHORS]Hiroaki Yamagiwa, Yusuke Takase, Hidetoshi Shimodaira
[ABSTRACT]Understanding relationships between attention heads is essential for interpreting the internal structure of Transformers, yet existing metrics do not capture this structure well. We focus on the subspaces spanned by attention-head weight matrices and quantify head-to-head relationships using the Projection Kernel (PK), a principal-angle-based measure of subspace similarity. Experiments show that PK reproduces known head-to-head interactions on the IOI task more clearly than prior metrics such as the Composition Score. We further introduce a framework to quantify the informativeness of PK distributions by comparing them with a reference distribution derived from random orthogonal subspaces. As an application, we analyze a directed graph constructed from PK and show that, in GPT2-small, L4H7 acts as a hub by functioning as an identity head.
[LINK]http://arxiv.org/abs/2601.10266v1
[DATE]2026-01-15 18:37:55+08:00
[CATEGORIES]cs.CL
Untangling Input Language from Reasoning Language: A Diagnostic Framework for Cross-Lingual Moral Alignment in LLMs
[AUTHORS]Nan Li, Bo Kang, Tijl De Bie
[ABSTRACT]When LLMs judge moral dilemmas, do they reach different conclusions in different languages, and if so, why? Two factors could drive such differences: the language of the dilemma itself, or the language in which the model reasons. Standard evaluation conflates these by testing only matched conditions (e.g., English dilemma with English reasoning). We introduce a methodology that separately manipulates each factor, covering also mismatched conditions (e.g., English dilemma with Chinese reasoning), enabling decomposition of their contributions. To study \emph{what} changes, we propose an approach to interpret the moral judgments in terms of Moral Foundations Theory. As a side result, we identify evidence for splitting the Authority dimension into a family-related and an institutional dimension. Applying this methodology to English-Chinese moral judgment with 13 LLMs, we demonstrate its diagnostic power: (1) the framework isolates reasoning-language effects as contributing twice the variance of input-language effects; (2) it detects context-dependency in nearly half of models that standard evaluation misses; and (3) a diagnostic taxonomy translates these patterns into deployment guidance. We release our code and datasets at https://anonymous.4open.science/r/CrossCulturalMoralJudgement.
[LINK]http://arxiv.org/abs/2601.10257v1
[DATE]2026-01-15 18:26:29+08:00
[CATEGORIES]cs.CL
Loop as a Bridge: Can Looped Transformers Truly Link Representation Space and Natural Language Outputs?
[AUTHORS]Guanxu Chen, Dongrui Liu, Jing Shao
[ABSTRACT]Large Language Models (LLMs) often exhibit a gap between their internal knowledge and their explicit linguistic outputs. In this report, we empirically investigate whether Looped Transformers (LTs)–architectures that increase computational depth by iterating shared layers–can bridge this gap by utilizing their iterative nature as a form of introspection. Our experiments reveal that while increasing loop iterations narrows the gap, it is partly driven by a degradation of their internal knowledge carried by representations. Moreover, another empirical analysis suggests that current LTs’ ability to perceive representations does not improve across loops; it is only present in the final loop. These results suggest that while LTs offer a promising direction for scaling computational depth, they have yet to achieve the introspection required to truly link representation space and natural language.
[COMMENTS]9 pages,6 figures
[LINK]http://arxiv.org/abs/2601.10242v1
[DATE]2026-01-15 18:01:21+08:00
[CATEGORIES]cs.CL
Are Language Models Efficient Reasoners? A Perspective from Logic Programming
[AUTHORS]Andreas Opedal, Yanick Zengaffinen, Haruki Shirakami, Clemente Pasti, Mrinmaya Sachan, Abulhair Saparov, Ryan Cotterell, Bernhard Schölkopf
[ABSTRACT]Modern language models (LMs) exhibit strong deductive reasoning capabilities, yet standard evaluations emphasize correctness while overlooking a key aspect of reasoning: efficiency. In real-world reasoning scenarios, much of the available information is irrelevant, and effective deductive inference requires identifying and ignoring such distractions. We propose a framework for assessing LM reasoning efficiency through the lens of logic programming, introducing a simple method to align proofs written in natural language – as generated by an LM – with shortest proofs found by executing the logic program. Efficiency is quantified by measuring how well a model avoids unnecessary inference. Empirically, we construct a dataset of math word problems injected with various number of irrelevant axioms that vary in semantic overlap with the goal theorem. We find that current LMs show marked accuracy declines under such conditions – even with minimal, domain-consistent distractions – and the proofs they generate frequently exhibit detours through irrelevant inferences.
[COMMENTS]NeurIPS 2025
[LINK]http://arxiv.org/abs/2510.25626v2
[DATE]2026-01-15 17:56:12+08:00
[CATEGORIES]cs.CL cs.LG
GeoSteer: Faithful Chain-of-Thought Steering via Latent Manifold Gradients
[AUTHORS]Kentaro Kazama, Daiki Shirafuji, Tatsuhiko Saito
[ABSTRACT]Recent advances in Large Language Models (LLMs) have improved multi-step reasoning. Most approaches rely on Chain-of-Thought (CoT) rationales. Previous studies have shown that LLMs often generate logically inconsistent reasoning steps even when their final answers are correct. These inconsistencies reduce the reliability of step-level reasoning. We propose GeoSteer, a manifold-based framework that improves the quality of intermediate reasoning. The method consists of: (1) constructing a CoT dataset with segment-level scores, (2) training a Variational Autoencoder (VAE) model and a quality estimation model to learn a low-dimensional manifold of high-quality CoT trajectories, and (3) steering hidden states of target LLMs toward higher-quality regions in the latent space. This update in a latent space behaves like a natural-gradient adjustment in the original hidden-state space. It ensures geometrically coherent steering. We evaluate GeoSteer on the GSM8k dataset using the Qwen3 series. We measure via answer accuracy and overall reasoning performance. GeoSteer improved the exact match accuracy by up to 2.6 points. It also enhanced the pairwise win rate by 5.3 points. These results indicate that GeoSteer provides an effective and controllable mechanism for improving the quality of intermediate reasoning in LLMs.
[COMMENTS]The Third workshop of NeusymBridge @AAAI 2026 (Bridging Neurons and Symbols for NLP and Knowledge Graph Reasoning)
[LINK]http://arxiv.org/abs/2601.10229v1
[DATE]2026-01-15 17:44:07+08:00
[CATEGORIES]cs.CL
COALA: Numerically Stable and Efficient Framework for Context-Aware Low-Rank Approximation
[AUTHORS]Uliana Parkina, Maxim Rakhuba
[ABSTRACT]Recent studies suggest that context-aware low-rank approximation is a useful tool for compression and fine-tuning of modern large-scale neural networks. In this type of approximation, a norm is weighted by a matrix of input activations, significantly improving metrics over the unweighted case. Nevertheless, existing methods for neural networks suffer from numerical instabilities due to their reliance on classical formulas involving explicit Gram matrix computation and their subsequent inversion. We demonstrate that this can degrade the approximation quality or cause numerically singular matrices.
To address these limitations, we propose a novel inversion-free regularized framework that is based entirely on stable decompositions and overcomes the numerical pitfalls of prior art. Our method can handle possible challenging scenarios: (1) when calibration matrices exceed GPU memory capacity, (2) when input activation matrices are nearly singular, and even (3) when insufficient data prevents unique approximation. For the latter, we prove that our solution converges to a desired approximation and derive explicit error bounds.
[LINK]http://arxiv.org/abs/2507.07580v2
[DATE]2026-01-15 17:44:00+08:00
[CATEGORIES]cs.LG cs.CL
Multi-Personality Generation of LLMs at Decoding-time
[AUTHORS]Rongxin Chen, Yunfan Li, Yige Yuan, Bingbing Xu, Huawei Shen
[ABSTRACT]Multi-personality generation for LLMs, enabling simultaneous embodiment of multiple personalization attributes, is a fundamental challenge. Existing retraining-based approaches are costly and poorly scalable, while decoding-time methods often rely on external models or heuristics, limiting flexibility and robustness. In this paper, we propose a novel Multi-Personality Generation (MPG) framework under the decoding-time combination paradigm. It flexibly controls multi-personality without relying on scarce multi-dimensional models or extra training, leveraging implicit density ratios in single-dimensional models as a “free lunch” to reformulate the task as sampling from a target strategy aggregating these ratios. To implement MPG efficiently, we design Speculative Chunk-level based Rejection sampling (SCR), which generates responses in chunks and parallelly validates them via estimated thresholds within a sliding window. This significantly reduces computational overhead while maintaining high-quality generation. Experiments on MBTI personality and Role-Playing demonstrate the effectiveness of MPG, showing improvements up to 16%-18%. Code and data are available at https://github.com/Libra117/MPG .
[COMMENTS]Accepted by WSDM 2026
[LINK]http://arxiv.org/abs/2511.01891v4
[DATE]2026-01-15 17:29:50+08:00
[CATEGORIES]cs.CL
Judge Q: Trainable Queries for Optimized Information Retention in KV Cache Eviction
[AUTHORS]Yijun Liu, Yixuan Wang, Yuzhuang Xu, Shiyu Ji, Yang Xu, Qingfu Zhu, Wanxiang Che
[ABSTRACT]Large language models (LLMs) utilize key-value (KV) cache to store historical information during sequence processing. The size of KV cache grows linearly as the length of the sequence extends, which seriously affects memory usage and decoding efficiency. Current methods for KV cache eviction typically utilize the last window from the pre-filling phase as queries to compute the KV importance scores for eviction. Although this scheme is simple to implement, it tends to overly focus on local information, potentially leading to the neglect or omission of crucial global information. To mitigate this issue, we propose Judge Q, a novel training method which incorporates a soft token list. This method only tunes the model’s embedding layer at a low training cost. By concatenating the soft token list at the end of the input sequence, we train these tokens’ attention map to the original input sequence to align with that of the actual decoded tokens. In this way, the queries corresponding to the soft tokens can effectively capture global information and better evaluate the importance of the keys and values within the KV cache, thus maintaining decoding quality when KV cache is evicted. Under the same eviction budget, our method exhibits less performance degradation compared to existing eviction approaches. We validate our approach through experiments conducted on models such as Llama-3.1-8B-Instruct and Mistral-7B-Instruct-v0.3, using benchmarks including LongBench, RULER, and Needle-in-a-Haystack. Results indicate an improvement of approximately 1 point on the LongBench and over 3 points on RULER. This proposed methodology can be seamlessly integrated into existing open-source models with minimal training overhead, thereby enhancing performance in KV cache eviction scenarios.
[COMMENTS]Accepted in AAAI 2026
[LINK]http://arxiv.org/abs/2509.10798v2
[DATE]2026-01-15 17:11:49+08:00
[CATEGORIES]cs.CL
One Instruction Does Not Fit All: How Well Do Embeddings Align Personas and Instructions in Low-Resource Indian Languages?
[AUTHORS]Arya Shah, Himanshu beniwal, Mayank Singh
[ABSTRACT]Aligning multilingual assistants with culturally grounded user preferences is essential for serving India’s linguistically diverse population of over one billion speakers across multiple scripts. However, existing benchmarks either focus on a single language or conflate retrieval with generation, leaving open the question of whether current embedding models can encode persona-instruction compatibility without relying on response synthesis. We present a unified benchmark spanning 12 Indian languages and four evaluation tasks: monolingual and cross-lingual persona-to-instruction retrieval, reverse retrieval from instruction to persona, and binary compatibility classification. Eight multilingual embedding models are evaluated in a frozen-encoder setting with a thin logistic regression head for classification. E5-Large-Instruct achieves the highest Recall@1 of 27.4\% on monolingual retrieval and 20.7\% on cross-lingual transfer, while BGE-M3 leads reverse retrieval at 32.1\% Recall@1. For classification, LaBSE attains 75.3\% AUROC with strong calibration. These findings offer practical guidance for model selection in Indic multilingual retrieval and establish reproducible baselines for future work\footnote{Code, datasets, and models are publicly available at https://github.com/aryashah2k/PI-Indic-Align.
[COMMENTS]12 pages, 4 figures, 10 tables
[LINK]http://arxiv.org/abs/2601.10205v1
[DATE]2026-01-15 17:10:14+08:00
[CATEGORIES]cs.CL
Disco-RAG: Discourse-Aware Retrieval-Augmented Generation
[AUTHORS]Dongqi Liu, Hang Ding, Qiming Feng, Jian Li, Xurong Xie, Zhucun Xue, Chengjie Wang, Jiangning Zhang, Yabiao Wang
[ABSTRACT]Retrieval-Augmented Generation (RAG) has emerged as an important means of enhancing the performance of large language models (LLMs) in knowledge-intensive tasks. However, most existing RAG strategies treat retrieved passages in a flat and unstructured way, which prevents the model from capturing structural cues and constrains its ability to synthesize knowledge from dispersed evidence across documents. To overcome these limitations, we propose Disco-RAG, a discourse-aware framework that explicitly injects discourse signals into the generation process. Our method constructs intra-chunk discourse trees to capture local hierarchies and builds inter-chunk rhetorical graphs to model cross-passage coherence. These structures are jointly integrated into a planning blueprint that conditions the generation. Experiments on question answering and long-document summarization benchmarks show the efficacy of our approach. Disco-RAG achieves state-of-the-art results on the benchmarks without fine-tuning. These findings underscore the important role of discourse structure in advancing RAG systems.
[LINK]http://arxiv.org/abs/2601.04377v3
[DATE]2026-01-15 17:06:14+08:00
[CATEGORIES]cs.CL cs.LG
PRL: Process Reward Learning Improves LLMs’ Reasoning Ability and Broadens the Reasoning Boundary
[AUTHORS]Jiarui Yao, Ruida Wang, Tong Zhang
[ABSTRACT]Improving the reasoning abilities of Large Language Models (LLMs) has been a continuous topic recently. But most relevant works are based on outcome rewards at the trajectory level, missing fine-grained supervision during the reasoning process. Other existing training frameworks that try to combine process signals together to optimize LLMs also rely heavily on tedious additional steps like MCTS, training a separate reward model, etc., doing harm to the training efficiency. Moreover, the intuition behind the process signals design lacks rigorous theoretical support, leaving the understanding of the optimization mechanism opaque. In this paper, we propose Process Reward Learning (PRL), which decomposes the entropy regularized reinforcement learning objective into intermediate steps, with rigorous process rewards that could be assigned to models accordingly. Starting from theoretical motivation, we derive the formulation of PRL that is essentially equivalent to the objective of reward maximization plus a KL-divergence penalty term between the policy model and a reference model. However, PRL could turn the outcome reward into process supervision signals, which helps better guide the exploration during RL optimization. From our experiment results, we demonstrate that PRL not only improves the average performance for LLMs’ reasoning ability measured by average @ n, but also broadens the reasoning boundary by improving the pass @ n metric. Extensive experiments show the effectiveness of PRL could be verified and generalized.
[LINK]http://arxiv.org/abs/2601.10201v1
[DATE]2026-01-15 17:01:53+08:00
[CATEGORIES]cs.LG cs.CL
HUMANLLM: Benchmarking and Reinforcing LLM Anthropomorphism via Human Cognitive Patterns
[AUTHORS]Xintao Wang, Jian Yang, Weiyuan Li, Rui Xie, Jen-tse Huang, Jun Gao, Shuai Huang, Yueping Kang, Liyuan Gou, Hongwei Feng, Yanghua Xiao
[ABSTRACT]Large Language Models (LLMs) have demonstrated remarkable capabilities in reasoning and generation, serving as the foundation for advanced persona simulation and Role-Playing Language Agents (RPLAs). However, achieving authentic alignment with human cognitive and behavioral patterns remains a critical challenge for these agents. We present HUMANLLM, a framework treating psychological patterns as interacting causal forces. We construct 244 patterns from ~12,000 academic papers and synthesize 11,359 scenarios where 2-5 patterns reinforce, conflict, or modulate each other, with multi-turn conversations expressing inner thoughts, actions, and dialogue. Our dual-level checklists evaluate both individual pattern fidelity and emergent multi-pattern dynamics, achieving strong human alignment (r=0.91) while revealing that holistic metrics conflate simulation accuracy with social desirability. HUMANLLM-8B outperforms Qwen3-32B on multi-pattern dynamics despite 4x fewer parameters, demonstrating that authentic anthropomorphism requires cognitive modeling–simulating not just what humans do, but the psychological processes generating those behaviors.
[LINK]http://arxiv.org/abs/2601.10198v1
[DATE]2026-01-15 16:56:53+08:00
[CATEGORIES]cs.CL
HOMURA: Taming the Sand-Glass for Time-Constrained LLM Translation via Reinforcement Learning
[AUTHORS]Ziang Cui, Mengran Yu, Tianjiao Li, Chenyu Shi, Yingxuan Shi, Lusheng Zhang, Hongwei Lin
[ABSTRACT]Large Language Models (LLMs) have achieved remarkable strides in multilingual translation but are hindered by a systemic cross-lingual verbosity bias, rendering them unsuitable for strict time-constrained tasks like subtitling and dubbing. Current prompt-engineering approaches struggle to resolve this conflict between semantic fidelity and rigid temporal feasibility. To bridge this gap, we first introduce Sand-Glass, a benchmark specifically designed to evaluate translation under syllable-level duration constraints. Furthermore, we propose HOMURA, a reinforcement learning framework that explicitly optimizes the trade-off between semantic preservation and temporal compliance. By employing a KL-regularized objective with a novel dynamic syllable-ratio reward, HOMURA effectively “tames” the output length. Experimental results demonstrate that our method significantly outperforms strong LLM baselines, achieving precise length control that respects linguistic density hierarchies without compromising semantic adequacy.
[LINK]http://arxiv.org/abs/2601.10187v1
[DATE]2026-01-15 16:45:54+08:00
[CATEGORIES]cs.CL
Scalable and Reliable Evaluation of AI Knowledge Retrieval Systems: RIKER and the Coherent Simulated Universe
[AUTHORS]JV Roig
[ABSTRACT]Evaluating knowledge systems (LLMs, RAG, knowledge graphs, etc) faces fundamental challenges: static benchmarks are vulnerable to contamination, LLM-based judges exhibit systematic biases, and ground truth extraction requires expensive human annotation. We present RIKER (Retrieval Intelligence and Knowledge Extraction Rating), both a benchmark and a replicable methodology based on paradigm inversion - generating documents from known ground truth rather than extracting ground truth from documents. This approach enables deterministic scoring and scalable evaluation without human annotation or reference models, and contamination resistance through regenerable corpora. Our evaluation of 33 models using over 21 billion tokens reveals that context length claims frequently exceed usable capacity, with significant degradation beyond 32K tokens; cross-document aggregation proves substantially harder than single-document extraction; and grounding ability and hallucination resistance are distinct capabilities - models excelling at finding facts that exist may still fabricate facts that do not. Beyond the specific benchmark, we contribute a domain-agnostic methodology for constructing scalable and contamination-resistant evaluations wherever synthetic documents can be generated from structured ground truth.
[COMMENTS]26 pages, 17 tables, 1 figure
[LINK]http://arxiv.org/abs/2601.08847v2
[DATE]2026-01-15 16:39:19+08:00
[CATEGORIES]cs.CL
Bridging the gap: A comparative exploration of Speech-LLM and end-to-end architecture for multilingual conversational ASR
[AUTHORS]Yuxiang Mei, Dongxing Xu, Jiaen Liang, Yanhua Long
[ABSTRACT]The INTERSPEECH 2025 Challenge on Multilingual Conversational Speech Language Models (MLC-SLM) promotes multilingual conversational ASR with large language models (LLMs). Our previous SHNU-mASR system adopted a competitive parallel-speech-encoder architecture that integrated Whisper and mHuBERT with an LLM. However, it faced two challenges: simple feature concatenation may not fully exploit complementary information, and the performance gap between LLM-based ASR and end-to-end(E2E) encoder-decoder ASR remained unexplored. In this work, we present an enhanced LLM-based ASR framework that combines fine-tuned Whisper and mHuBERT encoders with an LLM to enrich speech representations. We first evaluate E2E Whisper models with LoRA and full fine-tuning on the MLC-SLM ASR task, and then propose cross-attention-based fusion mechanisms for the parallel-speech-encoder. On the official evaluation set of the MLC-SLM Challenge, our system achieves a CER/WER of 10.69%, ranking on par with the top-ranked Track 1 systems, even though it uses only 1,500 hours of baseline training data compared with their large-scale training sets. Nonetheless, we find that our final LLM-based ASR still does not match the performance of a fine-tuned E2E Whisper model, providing valuable empirical guidance for future Speech-LLM design. Our code is publicly available at https://github.com/1535176727/MLC-SLM.
[COMMENTS]5 pages, 1 figure
[LINK]http://arxiv.org/abs/2601.01461v2
[DATE]2026-01-15 16:38:42+08:00
[CATEGORIES]cs.CL
Mistake Notebook Learning: Batch-Clustered Failures for Training-Free Agent Adaptation
[AUTHORS]Xuanbo Su, Yingfang Zhang, Hao Luo, Xiaoteng Liu, Leo Huang
[ABSTRACT]With the growing adoption of Large Language Model (LLM) agents in persistent, real-world roles, they naturally encounter continuous streams of tasks and inevitable failures. A key limitation, however, is their inability to systematically learn from these mistakes, forcing them to repeat identical errors in similar contexts. Unlike prior training-free methods that primarily store raw instance-level experience or focus on retrieving successful trajectories, we propose Mistake Notebook Learning (MNL), a novel memory framework that enables agents to self-curate generalizable guidance from batch-clustered failures. This mechanism allows agents to distill shared error patterns into structured “mistake notes,” updating an external memory only when batch performance improves to ensure stability. To further amplify adaptability, we integrate MNL with test-time scaling, leveraging aggregated failure patterns to actively steer the search process away from known pitfalls. Experiments on mathematical reasoning, Text-to-SQL, and interactive agent benchmarks show that MNL achieves competitive performance compared to existing memory mechanisms and in-context methods in both effectiveness and efficiency. These findings position structured mistake abstraction as a critical lever for robust agent evolution, enabling continuous improvement without the cost of parameter updates.
[LINK]http://arxiv.org/abs/2512.11485v2
[DATE]2026-01-15 16:29:51+08:00
[CATEGORIES]cs.CL
Filling in the Clinical Gaps in Benchmark: Case for HealthBench for the Japanese medical system
[AUTHORS]Shohei Hisada, Endo Sunao, Himi Yamato, Shoko Wakamiya, Eiji Aramaki
[ABSTRACT]This study investigates the applicability of HealthBench, a large-scale, rubric-based medical benchmark, to the Japanese context. Although robust evaluation frameworks are essential for the safe development of medical LLMs, resources in Japanese are scarce and often consist of translated multiple-choice questions. Our research addresses this issue in two ways. First, we establish a performance baseline by applying a machine-translated version of HealthBench’s 5,000 scenarios to evaluate two models: a high-performing multilingual model (GPT-4.1) and a Japanese-native open-source model (LLM-jp-3.1). Secondly, we use an LLM-as-a-Judge approach to systematically classify the benchmark’s scenarios and rubric criteria. This allows us to identify ‘contextual gaps’ where the content is misaligned with Japan’s clinical guidelines, healthcare systems or cultural norms. Our findings reveal a modest performance drop in GPT-4.1 due to rubric mismatches, as well as a significant failure in the Japanese-native model, which lacked the required clinical completeness. Furthermore, our classification shows that, despite most scenarios being applicable, a significant proportion of the rubric criteria require localisation. This work underscores the limitations of direct benchmark translation and highlights the urgent need for a context-aware, localised adaptation, a “J-HealthBench”, to ensure the reliable and safe evaluation of medical LLMs in Japan.
[COMMENTS]draft v0.3 Code and analysis data is available at https://zenodo.org/records/17405321
[LINK]http://arxiv.org/abs/2509.17444v3
[DATE]2026-01-15 16:24:28+08:00
[CATEGORIES]cs.CL
ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack
[AUTHORS]Hao Li, Yankai Yang, G. Edward Suh, Ning Zhang, Chaowei Xiao
[ABSTRACT]Large Language Models (LLMs) have enabled the development of powerful agentic systems capable of automating complex workflows across various fields. However, these systems are highly vulnerable to indirect prompt injection attacks, where malicious instructions embedded in external data can hijack agent behavior. In this work, we present ReasAlign, a model-level solution to improve safety alignment against indirect prompt injection attacks. The core idea of ReasAlign is to incorporate structured reasoning steps to analyze user queries, detect conflicting instructions, and preserve the continuity of the user’s intended tasks to defend against indirect injection attacks. To further ensure reasoning logic and accuracy, we introduce a test-time scaling mechanism with a preference-optimized judge model that scores reasoning steps and selects the best trajectory. Comprehensive evaluations across various benchmarks show that ReasAlign maintains utility comparable to an undefended model while consistently outperforming Meta SecAlign, the strongest prior guardrail. On the representative open-ended CyberSecEval2 benchmark, which includes multiple prompt-injected tasks, ReasAlign achieves 94.6% utility and only 3.6% ASR, far surpassing the state-of-the-art defensive model of Meta SecAlign (56.4% utility and 74.4% ASR). These results demonstrate that ReasAlign achieves the best trade-off between security and utility, establishing a robust and practical defense against prompt injection attacks in real-world agentic systems. Our code and experimental results could be found at https://github.com/leolee99/ReasAlign.
[COMMENTS]15 pages, 10 figures
[LINK]http://arxiv.org/abs/2601.10173v1
[DATE]2026-01-15 16:23:38+08:00
[CATEGORIES]cs.CL
Credit C-GPT: A Domain-Specialized Large Language Model for Conversational Understanding in Vietnamese Debt Collection
[AUTHORS]Nhung Nguyen Thi Hong, Cuong Nguyen Dang, Tri Le Ngoc
[ABSTRACT]Debt collection is a critical function within the banking, financial services, and insurance (BFSI) sector, relying heavily on large-scale human-to-human conversational interactions conducted primarily in Vietnamese contact centers. These conversations involve informal spoken language, emotional variability, and complex domain-specific reasoning, which pose significant challenges for traditional natural language processing systems. This paper introduces Credit C-GPT, a domain-specialized large language model with seven billion parameters, fine-tuned for conversational understanding in Vietnamese debt collection scenarios. The proposed model integrates multiple conversational intelligence tasks, including dialogue understanding, sentiment recognition, intent detection, call stage classification, and structured slot-value extraction, within a single reasoning-based framework. We describe the data construction process, annotation strategy, and training methodology, and evaluate the model on proprietary human-annotated datasets. Experimental results show consistent improvements over traditional pipeline-based approaches, indicating that domain-specialized conversational language models provide a scalable and privacy-aware solution for real-time assistance and post-call analytics in enterprise contact centers.
[COMMENTS]8 pages, 0 figures, 3 tables. Preprint
[LINK]http://arxiv.org/abs/2601.10167v1
[DATE]2026-01-15 16:12:55+08:00
[CATEGORIES]cs.CL
AWED-FiNER: Agents, Web applications, and Expert Detectors for Fine-grained Named Entity Recognition across 36 Languages for 6.6 Billion Speakers
[AUTHORS]Prachuryya Kaushik, Ashish Anand
[COMMENTS]Submitted to ACL‘26 System Demonstration
[LINK]http://arxiv.org/abs/2601.10161v1
[DATE]2026-01-15 16:00:25+08:00
[CATEGORIES]cs.CL
ToolSafe: Enhancing Tool Invocation Safety of LLM-based agents via Proactive Step-level Guardrail and Feedback
[AUTHORS]Yutao Mou, Zhangchi Xue, Lijun Li, Peiyang Liu, Shikun Zhang, Wei Ye, Jing Shao
[ABSTRACT]While LLM-based agents can interact with environments via invoking external tools, their expanded capabilities also amplify security risks. Monitoring step-level tool invocation behaviors in real time and proactively intervening before unsafe execution is critical for agent deployment, yet remains under-explored. In this work, we first construct TS-Bench, a novel benchmark for step-level tool invocation safety detection in LLM agents. We then develop a guardrail model, TS-Guard, using multi-task reinforcement learning. The model proactively detects unsafe tool invocation actions before execution by reasoning over the interaction history. It assesses request harmfulness and action-attack correlations, producing interpretable and generalizable safety judgments and feedback. Furthermore, we introduce TS-Flow, a guardrail-feedback-driven reasoning framework for LLM agents, which reduces harmful tool invocations of ReAct-style agents by 65 percent on average and improves benign task completion by approximately 10 percent under prompt injection attacks.
[COMMENTS]Work in Progress. Code available: https://github.com/MurrayTom/ToolSafe
[LINK]http://arxiv.org/abs/2601.10156v1
[DATE]2026-01-15 15:54:32+08:00
[CATEGORIES]cs.CL
Generative Adversarial Gumbel MCTS for Abstract Visual Composition Generation
[AUTHORS]Zirui Zhao, Boye Niu, David Hsu, Wee Sun Lee
[ABSTRACT]We study abstract visual composition, in which identity is primarily determined by the spatial configuration and relations among a small set of geometric primitives (e.g., parts, symmetry, topology). They are invariant primarily to texture and photorealistic detail. Composing such structures from fixed components under geometric constraints and vague goal specification (such as text) is non-trivial due to combinatorial placement choices, limited data, and discrete feasibility (overlap-free, allowable orientations), which create a sparse solution manifold ill-suited to purely statistical pixel-space generators. We propose a constraint-guided framework that combines explicit geometric reasoning with neural semantics. An AlphaGo-style search enforces feasibility, while a fine-tuned vision-language model scores semantic alignment as reward signals. Our algorithm uses a policy network as a heuristic in Monte-Carlo Tree Search and fine-tunes the network via search-generated plans. Inspired by the Generative Adversarial Network, we use the generated instances for adversarial reward refinement. Over time, the generation should approach the actual data more closely when the reward model cannot distinguish between generated instances and ground-truth. In the Tangram Assembly task, our approach yields higher validity and semantic fidelity than diffusion and auto-regressive baselines, especially as constraints tighten.
[LINK]http://arxiv.org/abs/2512.01242v2
[DATE]2026-01-15 15:18:11+08:00
[CATEGORIES]cs.CL
Role-Playing Agents Driven by Large Language Models: Current Status, Challenges, and Future Trends
[AUTHORS]Ye Wang, Jiaxing Chen, Hongjiang Xiao
[ABSTRACT]In recent years, with the rapid advancement of large language models (LLMs), role-playing language agents (RPLAs) have emerged as a prominent research focus at the intersection of natural language processing (NLP) and human-computer interaction. This paper systematically reviews the current development and key technologies of RPLAs, delineating the technological evolution from early rule-based template paradigms, through the language style imitation stage, to the cognitive simulation stage centered on personality modeling and memory mechanisms. It summarizes the critical technical pathways supporting high-quality role-playing, including psychological scale-driven character modeling, memory-augmented prompting mechanisms, and motivation-situation-based behavioral decision control. At the data level, the paper further analyzes the methods and challenges of constructing role-specific corpora, focusing on data sources, copyright constraints, and structured annotation processes. In terms of evaluation, it collates multi-dimensional assessment frameworks and benchmark datasets covering role knowledge, personality fidelity, value alignment, and interactive hallucination, while commenting on the advantages and disadvantages of methods such as human evaluation, reward models, and LLM-based scoring. Finally, the paper outlines future development directions of role-playing agents, including personality evolution modeling, multi-agent collaborative narrative, multimodal immersive interaction, and integration with cognitive neuroscience, aiming to provide a systematic perspective and methodological insights for subsequent research.
[LINK]http://arxiv.org/abs/2601.10122v1
[DATE]2026-01-15 15:08:20+08:00
[CATEGORIES]cs.CL
TopoDIM: One-shot Topology Generation of Diverse Interaction Modes for Multi-Agent Systems
[AUTHORS]Rui Sun, Jie Ding, Chenghua Gong, Tianjun Gu, Yihang Jiang, Juyuan Zhang, Liming Pan, Linyuan Lü
[ABSTRACT]Optimizing communication topology in LLM-based multi-agent system is critical for enabling collective intelligence. Existing methods mainly rely on spatio-temporal interaction paradigms, where the sequential execution of multi-round dialogues incurs high latency and computation. Motivated by the recent insights that evaluation and debate mechanisms can improve problem-solving in multi-agent systems, we propose TopoDIM, a framework for one-shot Topology generation with Diverse Interaction Modes. Designed for decentralized execution to enhance adaptability and privacy, TopoDIM enables agents to autonomously construct heterogeneous communication without iterative coordination, achieving token efficiency and improved task performance. Experiments demonstrate that TopoDIM reduces total token consumption by 46.41% while improving average performance by 1.50% over state-of-the-art methods. Moreover, the framework exhibits strong adaptability in organizing communication among heterogeneous agents. Code is available at: https://anonymous.4open.science/r/TopoDIM-8D35/
[LINK]http://arxiv.org/abs/2601.10120v1
[DATE]2026-01-15 15:05:04+08:00
[CATEGORIES]cs.CL
MiLe Loss: a New Entropy-Weighed Loss for Mitigating the Bias of Learning Difficulties in Large Language Models
[AUTHORS]Zhenpeng Su, Xing Wu, Xue Bai, Zijia Lin, Hui Chen, Guiguang Ding, Wei Zhou, Songlin Hu
[COMMENTS]This paper has been accepted by NAACL 2024
[LINK]http://arxiv.org/abs/2310.19531v8
[DATE]2026-01-15 14:45:36+08:00
[CATEGORIES]cs.CL
MATRIX AS PLAN: Structured Logical Reasoning with Feedback-Driven Replanning
[AUTHORS]Ke Chen, Jiandian Zeng, Zihao Peng, Guo Li, Guangxue Zhang, Tian Wang
[ABSTRACT]As knowledge and semantics on the web grow increasingly complex, enhancing Large Language Models (LLMs) comprehension and reasoning capabilities has become particularly important. Chain-of-Thought (CoT) prompting has been shown to enhance the reasoning capabilities of LLMs. However, it still falls short on logical reasoning tasks that rely on symbolic expressions and strict deductive rules. Neuro-symbolic methods address this gap by enforcing formal correctness through external solvers. Yet these solvers are highly format-sensitive, and small instabilities in model outputs can lead to frequent processing failures. LLM-driven approaches avoid parsing brittleness, but they lack structured representations and process-level error-correction mechanisms. To further enhance the logical reasoning capabilities of LLMs, we propose MatrixCoT, a structured CoT framework with a matrix-based plan. Specifically, we normalize and type natural language expressions, attach explicit citation fields, and introduce a matrix-based planning method to preserve global relations among steps. The plan becomes a verifiable artifact, making execution more stable. For verification, we also add a feedback-driven replanning mechanism. Under semantic-equivalence constraints, it identifies omissions and defects, rewrites and compresses the dependency matrix, and produces a more trustworthy final answer. Experiments on five logical-reasoning benchmarks and five LLMs show that, without relying on external solvers, MatrixCoT enhances both robustness and interpretability when tackling complex symbolic reasoning tasks, while maintaining competitive performance.
[COMMENTS]12 pages, 5 figures, 2 tables. Accepted at The Web Conference (WWW) 2026
[LINK]http://arxiv.org/abs/2601.10101v1
[DATE]2026-01-15 14:12:00+08:00
[CATEGORIES]cs.CL
CALM-IT: Generating Realistic Long-Form Motivational Interviewing Dialogues with Dual-Actor Conversational Dynamics Tracking
[AUTHORS]Viet Cuong Nguyen, Nhi Yen Nguyen, Kristin A. Candan, Mary Conlon, Vanessa Rumie, Kristen Risola, Srijan Kumar, Munmun De Choudhury
[ABSTRACT]Large Language Models (LLMs) are increasingly used in mental health-related settings, yet they struggle to sustain realistic, goal-directed dialogue over extended interactions. While LLMs generate fluent responses, they optimize locally for the next turn rather than maintaining a coherent model of therapeutic progress, leading to brittleness and long-horizon drift. We introduce CALM-IT, a framework for generating and evaluating long-form Motivational Interviewing (MI) dialogues that explicitly models dual-actor conversational dynamics. CALM-IT represents therapist-client interaction as a bidirectional state-space process, in which both agents continuously update inferred alignment, mental states, and short-term goals to guide strategy selection and utterance generation. Across large-scale evaluations, CALM-IT consistently outperforms strong baselines in Effectiveness and Goal Alignment and remains substantially more stable as conversation length increases. Although CALM-IT initiates fewer therapist redirections, it achieves the highest client acceptance rate (64.3%), indicating more precise and therapeutically aligned intervention timing. Overall, CALM-IT provides evidence for modeling evolving conversational state being essential for generating high-quality long-form synthetic conversations.
[COMMENTS]46 pages
[LINK]http://arxiv.org/abs/2601.10085v1
[DATE]2026-01-15 13:21:54+08:00
[CATEGORIES]cs.CL
Deriving Character Logic from Storyline as Codified Decision Trees
[AUTHORS]Letian Peng, Kun Zhou, Longfei Yun, Yupeng Hou, Jingbo Shang
[ABSTRACT]Role-playing (RP) agents rely on behavioral profiles to act consistently across diverse narrative contexts, yet existing profiles are largely unstructured, non-executable, and weakly validated, leading to brittle agent behavior. We propose Codified Decision Trees (CDT), a data-driven framework that induces an executable and interpretable decision structure from large-scale narrative data. CDT represents behavioral profiles as a tree of conditional rules, where internal nodes correspond to validated scene conditions and leaves encode grounded behavioral statements, enabling deterministic retrieval of context-appropriate rules at execution time. The tree is learned by iteratively inducing candidate scene-action rules, validating them against data, and refining them through hierarchical specialization, yielding profiles that support transparent inspection and principled updates. Across multiple benchmarks, CDT substantially outperforms human-written profiles and prior profile induction methods on $85$ characters across $16$ artifacts, indicating that codified and validated behavioral representations lead to more reliable agent grounding.
[LINK]http://arxiv.org/abs/2601.10080v1
[DATE]2026-01-15 13:12:43+08:00
[CATEGORIES]cs.CL
Sparse-RL: Breaking the Memory Wall in LLM Reinforcement Learning via Stable Sparse Rollouts
[AUTHORS]Sijia Luo, Xiaokang Zhang, Yuxuan Hu, Bohan Zhang, Ke Wang, Jinbo Su, Mengshu Sun, Lei Liang, Jing Zhang
[ABSTRACT]Reinforcement Learning (RL) has become essential for eliciting complex reasoning capabilities in Large Language Models (LLMs). However, the substantial memory overhead of storing Key-Value (KV) caches during long-horizon rollouts acts as a critical bottleneck, often prohibiting efficient training on limited hardware. While existing KV compression techniques offer a remedy for inference, directly applying them to RL training induces a severe policy mismatch, leading to catastrophic performance collapse. To address this, we introduce Sparse-RL empowers stable RL training under sparse rollouts. We show that instability arises from a fundamental policy mismatch among the dense old policy, the sparse sampler policy, and the learner policy. To mitigate this issue, Sparse-RL incorporates Sparsity-Aware Rejection Sampling and Importance-based Reweighting to correct the off-policy bias introduced by compression-induced information loss. Experimental results show that Sparse-RL reduces rollout overhead compared to dense baselines while preserving the performance. Furthermore, Sparse-RL inherently implements sparsity-aware training, significantly enhancing model robustness during sparse inference deployment.
[LINK]http://arxiv.org/abs/2601.10079v1
[DATE]2026-01-15 13:12:03+08:00
[CATEGORIES]cs.LG cs.CL
Controlled Self-Evolution for Algorithmic Code Optimization
[AUTHORS]Tu Hu, Ronghao Chen, Shuo Zhang, Jianghao Yin, Mou Xiao Feng, Jingping Liu, Shaolei Zhang, Wenqi Jiang, Yuqi Fang, Sen Hu, Huacan Wang, Yi Xu
[ABSTRACT]Self-evolution methods enhance code generation through iterative “generate-verify-refine” cycles, yet existing approaches suffer from low exploration efficiency, failing to discover solutions with superior complexity within limited budgets. This inefficiency stems from initialization bias trapping evolution in poor solution regions, uncontrolled stochastic operations lacking feedback guidance, and insufficient experience utilization across tasks. To address these bottlenecks, we propose Controlled Self-Evolution (CSE), which consists of three key components. Diversified Planning Initialization generates structurally distinct algorithmic strategies for broad solution space coverage. Genetic Evolution replaces stochastic operations with feedback-guided mechanisms, enabling targeted mutation and compositional crossover. Hierarchical Evolution Memory captures both successful and failed experiences at inter-task and intra-task levels. Experiments on EffiBench-X demonstrate that CSE consistently outperforms all baselines across various LLM backbones. Furthermore, CSE achieves higher efficiency from early generations and maintains continuous improvement throughout evolution. Our code is publicly available at https://github.com/QuantaAlpha/EvoControl.
[COMMENTS]27 pages
[LINK]http://arxiv.org/abs/2601.07348v4
[DATE]2026-01-15 12:14:52+08:00
[CATEGORIES]cs.CL
JudgeAgent: Beyond Static Benchmarks for Knowledge-Driven and Dynamic LLM Evaluation
[AUTHORS]Zhichao Shi, Xuhui Jiang, Chengjin Xu, Cangli Yao, Shengjia Ma, Yinghan Shen, Zixuan Li, Jian Guo, Yuanzhuo Wang
[ABSTRACT]Current evaluation methods for large language models (LLMs) primarily rely on static benchmarks, presenting two major challenges: limited knowledge coverage and fixed difficulties that mismatch with the evaluated LLMs. These limitations lead to superficial assessments of LLM knowledge, thereby impeding the targeted model optimizations. To bridge this gap, we propose JudgeAgent, a knowledge-driven and dynamic evaluation framework for LLMs. To address the challenge of limited knowledge coverage, JudgeAgent leverages LLM agents equipped with context graphs to traverse knowledge structures systematically for question generation. Furthermore, to mitigate data contamination and difficulty mismatch, it adopts a difficulty-adaptive and multi-turn interview mechanism. Thereby, JudgeAgent can achieve comprehensive evaluations and facilitate more effective improvement of LLMs. Empirical results demonstrate that JudgeAgent enables more comprehensive evaluations and facilitates effective model iterations, highlighting the potential of this knowledge-driven and dynamic evaluation paradigm. The source code is available on https://github.com/DataArcTech/JudgeAgent.
[LINK]http://arxiv.org/abs/2509.02097v4
[DATE]2026-01-15 12:01:33+08:00
[CATEGORIES]cs.CL
How Many Human Judgments Are Enough? Feasibility Limits of Human Preference Evaluation
[AUTHORS]Wilson Y. Lee
[ABSTRACT]Human preference evaluations are widely used to compare generative models, yet it remains unclear how many judgments are required to reliably detect small improvements. We show that when preference signal is diffuse across prompts (i.e., all prompt types are similarly informative), proportional allocation is minimax-optimal: no allocation strategy substantially improves detectability. Empirical analysis of large-scale human preference datasets shows that most comparisons fall into this diffuse regime, exhibiting small preference margins that require far more judgments than typically collected, even in well-sampled comparisons. These limits persist across evaluation protocols and modalities, including chat, image generation, and code generation with execution feedback. In contrast, curated benchmarks that reduce prompt induced variability systematically induce larger margins and improve detectability through a $1.5\times$ reduction in prompt-level variance. Our results show that inconclusive or negative human evaluation outcomes frequently reflect underpowered evaluation rather than model equivalence, underscoring the need to account explicitly for effect size, budget, and protocol design.
[LINK]http://arxiv.org/abs/2601.09084v2
[DATE]2026-01-15 11:47:46+08:00
[CATEGORIES]cs.CL cs.LG
Beg to Differ: Understanding Reasoning-Answer Misalignment Across Languages
[AUTHORS]Anaelia Ovalle, Candace Ross, Sebastian Ruder, Adina Williams, Karen Ullrich, Mark Ibrahim, Levent Sagun
[COMMENTS]Accepted to 2025 EMNLP Multilingual Representation Learning Workshop
[LINK]http://arxiv.org/abs/2512.22712v2
[DATE]2026-01-15 11:34:27+08:00
[CATEGORIES]cs.CL
Scalable Oversight for Superhuman AI via Recursive Self-Critiquing
[AUTHORS]Xueru Wen, Jie Lou, Xinyu Lu, Junjie Yang, Yanjiang Liu, Yaojie Lu, Debing Zhang, Xing Yu
[ABSTRACT]As AI capabilities increasingly surpass human proficiency in complex tasks, current alignment techniques, including SFT and RLHF, face fundamental challenges in ensuring reliable oversight. These methods rely on direct human assessment and become impractical when AI outputs exceed human cognitive thresholds. In response to this challenge, we explore two hypotheses: (1) \textit{Critique of critique can be easier than critique itself}, extending the widely-accepted observation that verification is easier than generation to the critique domain, as critique itself is a specialized form of generation; (2) \textit{This difficulty relationship holds recursively}, suggesting that when direct evaluation is infeasible, performing higher-order critiques (e.g., critique of critique of critique) offers a more tractable supervision pathway. We conduct Human-Human, Human-AI, and AI-AI experiments to investigate the potential of recursive self-critiquing for AI supervision. Our results highlight recursive critique as a promising approach for scalable AI oversight.
[LINK]http://arxiv.org/abs/2502.04675v4
[DATE]2026-01-15 11:27:36+08:00
[CATEGORIES]cs.CL cs.LG
Exploring the Translation Mechanism of Large Language Models
[AUTHORS]Hongbin Zhang, Kehai Chen, Xuefeng Bai, Xiucheng Li, Yang Xiang, Min Zhang
[ABSTRACT]While large language models (LLMs) demonstrate remarkable success in multilingual translation, their internal core translation mechanisms, even at the fundamental word level, remain insufficiently understood. To address this critical gap, this work introduces a systematic framework for interpreting the mechanism behind LLM translation from the perspective of computational components. This paper first proposes subspace-intervened path patching for precise, fine-grained causal analysis, enabling the detection of components crucial to translation tasks and subsequently characterizing their behavioral patterns in human-interpretable terms. Comprehensive experiments reveal that translation is predominantly driven by a sparse subset of components: specialized attention heads serve critical roles in extracting source language, translation indicators, and positional features, which are then integrated and processed by specific multi-layer perceptrons (MLPs) into intermediary English-centric latent representations before ultimately yielding the final translation. The significance of these findings is underscored by the empirical demonstration that targeted fine-tuning a minimal parameter subset ($<5\%$) enhances translation performance while preserving general capabilities. This result further indicates that these crucial components generalize effectively to sentence-level translation and are instrumental in elucidating more intricate translation tasks.
[COMMENTS]Accepted in NeurIPS 2025 Poster
[LINK]http://arxiv.org/abs/2502.11806v3
[DATE]2026-01-15 11:25:26+08:00
[CATEGORIES]cs.CL
Classifying and Addressing the Diversity of Errors in Retrieval-Augmented Generation Systems
[AUTHORS]Kin Kwan Leung, Mouloud Belbahri, Yi Sui, Alex Labach, Xueying Zhang, Stephen Anthony Rose, Jesse C. Cresswell
[ABSTRACT]Retrieval-augmented generation (RAG) is a prevalent approach for building LLM-based question-answering systems that can take advantage of external knowledge databases. Due to the complexity of real-world RAG systems, there are many potential causes for erroneous outputs. Understanding the range of errors that can occur in practice is crucial for robust deployment. We present a new taxonomy of the error types that can occur in realistic RAG systems, examples of each, and practical advice for addressing them. Additionally, we curate a dataset of erroneous RAG responses annotated by error types. We then propose an auto-evaluation method aligned with our taxonomy that can be used in practice to track and address errors during development. Code and data are available at https://github.com/layer6ai-labs/rag-error-classification.
[COMMENTS]EACL 2026
[LINK]http://arxiv.org/abs/2510.13975v2
[DATE]2026-01-15 11:23:18+08:00
[CATEGORIES]cs.CL cs.LG
TeleMem: Building Long-Term and Multimodal Memory for Agentic AI
[AUTHORS]Chunliang Chen, Ming Guan, Xiao Lin, Jiaxu Li, Luxi Lin, Qiyi Wang, Xiangyu Chen, Jixiang Luo, Changzhi Sun, Dell Zhang, Xuelong Li
[ABSTRACT]Large language models (LLMs) excel at many NLP tasks but struggle to sustain long-term interactions due to limited attention over extended dialogue histories. Retrieval-augmented generation (RAG) mitigates this issue but lacks reliable mechanisms for updating or refining stored memories, leading to schema-driven hallucinations, inefficient write operations, and minimal support for multimodal reasoning.To address these challenges, we propose TeleMem, a unified long-term and multimodal memory system that maintains coherent user profiles through narrative dynamic extraction, ensuring that only dialogue-grounded information is preserved. TeleMem further introduces a structured writing pipeline that batches, retrieves, clusters, and consolidates memory entries, substantially improving storage efficiency, reducing token usage, and accelerating memory operations. Additionally, a multimodal memory module combined with ReAct-style reasoning equips the system with a closed-loop observe, think, and act process that enables accurate understanding of complex video content in long-term contexts. Experimental results show that TeleMem surpasses the state-of-the-art Mem0 baseline with 19% higher accuracy, 43% fewer tokens, and a 2.1x speedup on the ZH-4O long-term role-play gaming benchmark.
[LINK]http://arxiv.org/abs/2601.06037v3
[DATE]2026-01-15 11:01:53+08:00
[CATEGORIES]cs.CL
Continuous-Depth Transformers with Learned Control Dynamics
[AUTHORS]Peter Jemley
[ABSTRACT]We present a hybrid transformer architecture that replaces discrete middle layers with a continuous-depth Neural Ordinary Differential Equation (ODE) block, enabling inference-time control over generation attributes via a learned steering signal. Unlike standard transformers that process representations through fixed discrete layers, our approach treats depth as a continuous variable governed by a learned vector field $F_θ(H, τ, u)$, where $u$ is a low-dimensional control signal injected via explicit concatenation. We validate the architecture through four experiments: (1) gradient flow stability with zero exploding/vanishing gradient events, (2) semantic steering achieving 98\%/88\% accuracy for positive/negative sentiment control, (3) continuous interpolation validated by a negligible 0.068\% trajectory divergence between fixed and adaptive solvers, and (4) efficiency benchmarking demonstrating latency parity with standard discrete baselines. Additionally, we show that adaptive ODE solvers reveal geometric structure in the learned dynamics: the control signal partitions the vector field into distinct dynamical regimes with different curvature characteristics. The adjoint method enables $O(1)$ memory training regardless of integration depth. Our results demonstrate that continuous-depth dynamics with learned control signals provide a viable, efficient mechanism for steerable language generation.
[COMMENTS]9 pages, 4 figures. Code available at: https://github.com/PeterJemley/Continuous-Depth-Transformers-with-Learned-Control-Dynamics
[LINK]http://arxiv.org/abs/2601.10007v1
[DATE]2026-01-15 10:35:37+08:00
[CATEGORIES]cs.LG cs.CL
A.X K1 Technical Report
[AUTHORS]Sung Jun Cheon, Jaekyung Cho, Seongho Choi, Hyunjun Eun, Seokhwan Jo, Jaehyun Jun, Minsoo Kang, Jin Kim, Jiwon Kim, Minsang Kim, Sungwan Kim, Seungsik Kim, Tae Yoon Kim, Youngrang Kim, Hyeongmun Lee, Sangyeol Lee, Sungeun Lee, Youngsoon Lee, Yujin Lee, Seongmin Ok, Chanyong Park, Hyewoong Park, Junyoung Park, Hyunho Yang, Subin Yi, Soohyun Bae, Dhammiko Arya, Yongseok Choi, Sangho Choi, Dongyeon Cho, Seungmo Cho, Gyoungeun Han, Yong-jin Han, Seokyoung Hong, Hyeon Hwang, Wonbeom Jang, Minjeong Ju, Wonjin Jung, Keummin Ka, Sungil Kang, Dongnam Kim, Joonghoon Kim, Jonghwi Kim, SaeRom Kim, Sangjin Kim, Seongwon Kim, Youngjin Kim, Seojin Lee, Sunwoo Lee, Taehoon Lee, Chanwoo Park, Sohee Park, Sooyeon Park, Yohan Ra, Sereimony Sek, Seungyeon Seo, Gun Song, Sanghoon Woo, Janghan Yoon, Sungbin Yoon
[ABSTRACT]We introduce A.X K1, a 519B-parameter Mixture-of-Experts (MoE) language model trained from scratch. Our design leverages scaling laws to optimize training configurations and vocabulary size under fixed computational budgets. A.X K1 is pre-trained on a corpus of approximately 10T tokens, curated by a multi-stage data processing pipeline. Designed to bridge the gap between reasoning capability and inference efficiency, A.X K1 supports explicitly controllable reasoning to facilitate scalable deployment across diverse real-world scenarios. We propose a simple yet effective Think-Fusion training recipe, enabling user-controlled switching between thinking and non-thinking modes within a single unified model. Extensive evaluations demonstrate that A.X K1 achieves performance competitive with leading open-source models, while establishing a distinctive advantage in Korean-language benchmarks.
[COMMENTS]This paper is withdrawn pending additional internal review of the methodology and analysis
[LINK]http://arxiv.org/abs/2601.09200v2
[DATE]2026-01-15 10:32:50+08:00
[CATEGORIES]cs.CL
SocraticKG: Knowledge Graph Construction via QA-Driven Fact Extraction
[AUTHORS]Sanghyeok Choi, Woosang Jeon, Kyuseok Yang, Taehyeong Kim
[ABSTRACT]Constructing Knowledge Graphs (KGs) from unstructured text provides a structured framework for knowledge representation and reasoning, yet current LLM-based approaches struggle with a fundamental trade-off: factual coverage often leads to relational fragmentation, while premature consolidation causes information loss. To address this, we propose SocraticKG, an automated KG construction method that introduces question-answer pairs as a structured intermediate representation to systematically unfold document-level semantics prior to triple extraction. By employing 5W1H-guided QA expansion, SocraticKG captures contextual dependencies and implicit relational links typically lost in direct KG extraction pipelines, providing explicit grounding in the source document that helps mitigate implicit reasoning errors. Evaluation on the MINE benchmark demonstrates that our approach effectively addresses the coverage-connectivity trade-off, achieving superior factual retention while maintaining high structural cohesion even as extracted knowledge volume substantially expands. These results highlight that QA-mediated semantic scaffolding plays a critical role in structuring semantics prior to KG extraction, enabling more coherent and reliable graph construction in subsequent stages.
[LINK]http://arxiv.org/abs/2601.10003v1
[DATE]2026-01-15 10:26:51+08:00
[CATEGORIES]cs.CL
Fairness Definitions in Language Models Explained
[AUTHORS]Zhipeng Yin, Zichong Wang, Avash Palikhe, Wenbin Zhang
[ABSTRACT]Language Models (LMs) have demonstrated exceptional performance across various Natural Language Processing (NLP) tasks. Despite these advancements, LMs can inherit and amplify societal biases related to sensitive attributes such as gender and race, limiting their adoption in real-world applications. Therefore, fairness has been extensively explored in LMs, leading to the proposal of various fairness notions. However, the lack of clear agreement on which fairness definition to apply in specific contexts and the complexity of understanding the distinctions between these definitions can create confusion and impede further progress. To this end, this paper proposes a systematic survey that clarifies the definitions of fairness as they apply to LMs. Specifically, we begin with a brief introduction to LMs and fairness in LMs, followed by a comprehensive, up-to-date overview of existing fairness notions in LMs and the introduction of a novel taxonomy that categorizes these concepts based on their transformer architecture: encoder-only, decoder-only, and encoder-decoder LMs. We further illustrate each definition through experiments, showcasing their practical implications and outcomes. Finally, we discuss current research challenges and open questions, aiming to foster innovative ideas and advance the field. The repository is publicly available online at https://github.com/vanbanTruong/Fairness-in-Large-Language-Models/tree/main/definitions.
[LINK]http://arxiv.org/abs/2407.18454v3
[DATE]2026-01-15 10:02:02+08:00
[CATEGORIES]cs.CL cs.LG
WebRollback: Enhancing Web Agents with Explicit Rollback Mechanisms
[AUTHORS]Zhisong Zhang, Tianqing Fang, Kaixin Ma, Wenhao Yu, Hongming Zhang, Haitao Mi, Dong Yu
[ABSTRACT]With recent advancements in large language models, web agents have been greatly improved. However, dealing with complex and dynamic web environments requires more advanced planning and search abilities. Previous studies usually adopt a greedy one-way search strategy, which may struggle to recover from erroneous states. In this work, we enhance web agents with an explicit rollback mechanism, enabling the agent to revert back to a previous state in its navigation trajectory. This mechanism gives models the flexibility to directly control the search process, leading to an effective and efficient web navigation method. We conduct experiments on two live web navigation benchmarks with zero-shot and fine-tuning settings. The results demonstrate the effectiveness of our proposed approach.
[COMMENTS]EACL 2026
[LINK]http://arxiv.org/abs/2504.11788v3
[DATE]2026-01-15 10:00:45+08:00
[CATEGORIES]cs.CL
Context Volume Drives Performance: Tackling Domain Shift in Extremely Low-Resource Translation via RAG
[AUTHORS]David Samuel Setiawan, Raphaël Merx, Jey Han Lau
[ABSTRACT]Neural Machine Translation (NMT) models for low-resource languages suffer significant performance degradation under domain shift. We quantify this challenge using Dhao, an indigenous language of Eastern Indonesia with no digital footprint beyond the New Testament (NT). When applied to the unseen Old Testament (OT), a standard NMT model fine-tuned on the NT drops from an in-domain score of 36.17 chrF++ to 27.11 chrF++. To recover this loss, we introduce a hybrid framework where a fine-tuned NMT model generates an initial draft, which is then refined by a Large Language Model (LLM) using Retrieval-Augmented Generation (RAG). The final system achieves 35.21 chrF++ (+8.10 recovery), effectively matching the original in-domain quality. Our analysis reveals that this performance is driven primarily by the number of retrieved examples rather than the choice of retrieval algorithm. Qualitative analysis confirms the LLM acts as a robust “safety net,” repairing severe failures in zero-shot domains.
[LINK]http://arxiv.org/abs/2601.09982v1
[DATE]2026-01-15 09:57:34+08:00
[CATEGORIES]cs.CL
SPRInG: Continual LLM Personalization via Selective Parametric Adaptation and Retrieval-Interpolated Generation
[AUTHORS]Seoyeon Kim, Jaehyung Kim
[ABSTRACT]Personalizing Large Language Models typically relies on static retrieval or one-time adaptation, assuming user preferences remain invariant over time. However, real-world interactions are dynamic, where user interests continuously evolve, posing a challenge for models to adapt to preference drift without catastrophic forgetting. Standard continual learning approaches often struggle in this context, as they indiscriminately update on noisy interaction streams, failing to distinguish genuine preference shifts from transient contexts. To address this, we introduce SPRInG, a novel semi-parametric framework designed for effective continual personalization. During training, SPRInG employs drift-driven selective adaptation, which utilizes a likelihood-based scoring function to identify high-novelty interactions. This allows the model to selectively update the user-specific adapter on drift signals while preserving hard-to-learn residuals in a replay buffer. During inference, we apply strict relevance gating and fuse parametric knowledge with retrieved history via logit interpolation. Experiments on the long-form personalized generation benchmark demonstrate that SPRInG outperforms existing baselines, validating its robustness for real-world continual personalization.
[COMMENTS]under review, 23 pages
[LINK]http://arxiv.org/abs/2601.09974v1
[DATE]2026-01-15 09:32:27+08:00
[CATEGORIES]cs.CL
CoSense-LLM: Semantics at the Edge with Cost- and Uncertainty-Aware Cloud-Edge Cooperation
[AUTHORS]Hasan Akgul, Mari Eplik, Javier Rojas, Aina Binti Abdullah, Pieter van der Merwe
[ABSTRACT]We present CoSense-LLM, an edge-first framework that turns continuous multimodal sensor streams (for example Wi-Fi CSI, IMU, audio, RFID, and lightweight vision) into compact, verifiable semantic tokens and coordinates with large language models under explicit latency, energy, bandwidth, and privacy constraints. CoSense-LLM has four parts: (i) SenseFusion, a lightweight encoder that aligns sensor embeddings with language and compresses them into short discrete code sequences; (ii) Edge-RAG, a local hybrid retrieval layer that grounds generation in site specific policies and notes; (iii) PromptRouter, a cost and uncertainty aware policy that selects edge only generation, edge plus retrieval, or compact cloud escalation; and (iv) Secure Execution, an auditable redaction path that enforces data minimization so raw waveforms never leave the device. The system works with modern serving optimizations, including paged or streaming KV caches, FlashAttention style kernels, speculative decoding, and quantized LoRA adapters, and supports on device personalization and federated updates under non IID drift. Across home, office, and clinic deployments, CoSense-LLM delivers grounded explanations while meeting tight service level objectives: it sustains sub second (p95) end to end latency on edge dominant paths, reduces inter tier token and bandwidth costs by preferring local retrieval grounded responses, and preserves privacy by transmitting only discrete codes and redacted metadata. Ablations show that Edge-RAG improves factual consistency and reduces contradictions, calibrated uncertainty enables selective abstention and controlled escalations, and KV plus decoding accelerators lower energy per decision. The results support an edge first design that treats semantics, privacy, and predictable latency as co equal goals for large model deployments in interference prone environments.
[COMMENTS]arXiv admin note: This paper has been withdrawn by arXiv due to unverifiable authorship and affiliation
[LINK]http://arxiv.org/abs/2510.19670v4
[DATE]2026-01-15 06:53:47+08:00
[CATEGORIES]cs.CL
ZK-SenseLM: Verifiable Large-Model Wireless Sensing with Selective Abstention and Zero-Knowledge Attestation
[AUTHORS]Hasan Akgul, Mari Eplik, Javier Rojas, Aina Binti Abdullah, Pieter van der Merwe
[ABSTRACT]ZK-SenseLM is a secure and auditable wireless sensing framework that pairs a large-model encoder for Wi-Fi channel state information (and optionally mmWave radar or RFID) with a policy-grounded decision layer and end-to-end zero-knowledge proofs of inference. The encoder uses masked spectral pretraining with phase-consistency regularization, plus a light cross-modal alignment that ties RF features to compact, human-interpretable policy tokens. To reduce unsafe actions under distribution shift, we add a calibrated selective-abstention head; the chosen risk-coverage operating point is registered and bound into the proof. We implement a four-stage proving pipeline: (C1) feature sanity and commitment, (C2) threshold and version binding, (C3) time-window binding, and (C4) PLONK-style proofs that the quantized network, given the committed window, produced the logged action and confidence. Micro-batched proving amortizes cost across adjacent windows, and a gateway option offloads proofs from low-power devices. The system integrates with differentially private federated learning and on-device personalization without weakening verifiability: model hashes and the registered threshold are part of each public statement. Across activity, presence or intrusion, respiratory proxy, and RF fingerprinting tasks, ZK-SenseLM improves macro-F1 and calibration, yields favorable coverage-risk curves under perturbations, and rejects tamper and replay with compact proofs and fast verification.
[COMMENTS]arXiv admin note: This paper has been withdrawn by arXiv due to unverifiable authorship and affiliation
[LINK]http://arxiv.org/abs/2510.25677v3
[DATE]2026-01-15 06:53:23+08:00
[CATEGORIES]cs.CL
Self-reflection in Automated Qualitative Coding: Improving Text Annotation through Secondary LLM Critique
[AUTHORS]Zackary Okun Dunivin, Mobina Noori, Seth Frey, Curtis Atkinson
[ABSTRACT]Large language models (LLMs) allow for sophisticated qualitative coding of large datasets, but zero- and few-shot classifiers can produce an intolerable number of errors, even with careful, validated prompting. We present a simple, generalizable two-stage workflow: an LLM applies a human-designed, LLM-adapted codebook; a secondary LLM critic performs self-reflection on each positive label by re-reading the source text alongside the first model’s rationale and issuing a final decision. We evaluate this approach on six qualitative codes over 3,000 high-content emails from Apache Software Foundation project evaluation discussions. Our human-derived audit of 360 positive annotations (60 passages by six codes) found that the first-line LLM had a false-positive rate of 8% to 54%, despite F1 scores of 0.74 and 1.00 in testing. Subsequent recoding of all stage-one annotations via a second self-reflection stage improved F1 by 0.04 to 0.25, bringing two especially poor performing codes up to 0.69 and 0.79 from 0.52 and 0.55 respectively. Our manual evaluation identified two recurrent error classes: misinterpretation (violations of code definitions) and meta-discussion (debate about a project evaluation criterion mistaken for its use as a decision justification). Code-specific critic clauses addressing observed failure modes were especially effective with testing and refinement, replicating the codebook-adaption process for LLM interpretation in stage-one. We explain how favoring recall in first-line LLM annotation combined with secondary critique delivers precision-first, compute-light control. With human guidance and validation, self-reflection slots into existing LLM-assisted annotation pipelines to reduce noise and potentially salvage unusable classifiers.
[LINK]http://arxiv.org/abs/2601.09905v1
[DATE]2026-01-15 06:27:13+08:00
[CATEGORIES]cs.CL
The Open Proof Corpus: A Large-Scale Study of LLM-Generated Mathematical Proofs
[AUTHORS]Jasper Dekoninck, Ivo Petrov, Kristian Minchev, Mislav Balunovic, Martin Vechev, Miroslav Marinov, Maria Drencheva, Lyuba Konova, Milen Shumanov, Kaloyan Tsvetkov, Nikolay Drenchev, Lazar Todorov, Kalina Nikolova, Nikolay Georgiev, Vanesa Kalinkova, Margulan Ismoldayev
[ABSTRACT]In recent months, large language models (LLMs) have made significant progress in mathematical proof generation, but further advancement is hindered by the lack of a large-scale, high-quality dataset of human-evaluated proofs. While expensive to create, such a dataset is essential for driving improvements in training and enabling a rigorous analysis of proof generation capabilities. In this work, we present the Open Proof Corpus (OPC), a dataset comprising over 5,000 human-evaluated proofs produced by state-of-the-art LLMs. The OPC was specifically designed for broad applicability and downstream usage in proof generation research and is the first to include a substantial number of correct, LLM-generated solutions to problems from prestigious mathematics competitions such as the USAMO and IMO. Using the OPC, we explore critical questions in automated proof generation: (1) the performance gap between natural language and formal proof generation, (2) the discrepancy between final-answer accuracy and full-proof validity, and (3) the impact of best-of-n selection on proof quality. Finally, to showcase the utility of the OPC, we finetune an 8B-parameter model on the dataset, obtaining a model that performs on par with the best model, Gemini-2.5-Pro, on the task of evaluating proof correctness.
[LINK]http://arxiv.org/abs/2506.21621v2
[DATE]2026-01-15 05:46:26+08:00
[CATEGORIES]cs.CL
Investigating LLM Capabilities on Long Context Comprehension for Medical Question Answering
[AUTHORS]Feras AlMannaa, Talia Tseriotou, Jenny Chim, Maria Liakata
[ABSTRACT]This study is the first to investigate LLM comprehension capabilities over long-context (LC), clinically relevant medical Question Answering (QA) beyond MCQA. Our comprehensive approach considers a range of settings based on content inclusion of varying size and relevance, LLM models of different capabilities and a variety of datasets across task formulations. We reveal insights on model size effects and their limitations, underlying memorization issues and the benefits of reasoning models, while demonstrating the value and challenges of leveraging the full long patient’s context. Importantly, we examine the effect of Retrieval Augmented Generation (RAG) on medical LC comprehension, showcasing best settings in single versus multi-document QA datasets. We shed light into some of the evaluation aspects using a multi-faceted approach uncovering common metric challenges. Our quantitative analysis reveals challenging cases where RAG excels while still showing limitations in cases requiring temporal reasoning.
[LINK]http://arxiv.org/abs/2510.18691v2
[DATE]2026-01-15 05:40:27+08:00
[CATEGORIES]cs.CL
Relative Scaling Laws for LLMs
[AUTHORS]William Held, David Hall, Percy Liang, Diyi Yang
[ABSTRACT]Scaling laws describe how language models improve with additional data, parameters, and compute. While widely used, they are typically measured on aggregate test sets. Aggregate evaluations yield clean trends but average over heterogeneous subpopulations, obscuring performance disparities. We introduce relative scaling laws, which track how performance gaps between test distributions evolve with scale rather than focusing solely on absolute error. Using 255 decoder-only Transformers trained under matched-compute (IsoFLOP) budgets from $10^{18}$–$10^{20}$ FLOPs on standard pretraining datasets, we find diverse trajectories: academic domains on MMLU converge toward parity; regional English dialects shift depending on population size; and clusters of AI risk behaviours split, with capability- and influence-related risks increasing during pretraining while adversarial risks do not. These results show that although scaling improves overall performance, it is not a universal equalizer. To support further study, we release all model checkpoints from this work to enable practitioners to measure relative alongside traditional scaling laws, in order to better prioritize robustness challenges in light of the bitter lesson.
[LINK]http://arxiv.org/abs/2510.24626v2
[DATE]2026-01-15 05:02:33+08:00
[CATEGORIES]cs.CL
OUTLINEFORGE: Hierarchical Reinforcement Learning with Explicit States for Scientific Writing
[AUTHORS]Yilin Bao, Ziyao He, Zayden Yang
[ABSTRACT]Scientific paper generation requires document-level planning and factual grounding, but current large language models, despite their strong local fluency, often fail in global structure, input coverage, and citation consistency. We present a reinforcement learning framework that casts scientific outline construction as a long-horizon planning problem over hierarchical document structures. Our approach models edit evolving outlines through structured actions, enabling the system to incrementally build a complete scientific manuscript. To support effective and stabilize learning,we introduce a two-stage optimization procedure consisting of (i) backward outline reconstruction from partial plans to enforce global structural consistency, and (ii) forward value-guided reinforcement learning with rewards explicitly modeling scientific correctness, discourse coherence, and citation fidelity. In addition, We further introduce a benchmark for scientific paper generation that evaluates document planning, input utilization, reference faithfulness, outline organization, and content-level factual accuracy. Our results show consistent improvements over strong neural and LLM baselines, particularly in long-range structural coherence and citation reliability.
[LINK]http://arxiv.org/abs/2601.09858v1
[DATE]2026-01-15 04:37:26+08:00
[CATEGORIES]cs.CL cs.LG
Thinking Long, but Short: Stable Sequential Test-Time Scaling for Large Reasoning Models
[AUTHORS]Michael R. Metel, Yufei Cui, Boxing Chen, Prasanna Parthasarathi
[ABSTRACT]Sequential test-time scaling is a promising training-free method to improve large reasoning model accuracy, but as currently implemented, significant limitations have been observed. Inducing models to think for longer can increase their accuracy, but as the length of reasoning is further extended, it has also been shown to result in accuracy degradation and model instability. This work presents a novel sequential test-time scaling method, Min-Seek, which improves model accuracy significantly over a wide range of induced thoughts, stabilizing the accuracy of sequential scaling, and removing the need for reasoning length fine-tuning. Beyond improving model accuracy over a variety of reasoning tasks, our method is inherently efficient, as only the KV pairs of one additional induced thought are kept in the KV cache during reasoning. With a custom KV cache which stores keys without position embeddings, by dynamically encoding them contiguously before each new generated thought, our method can continue to reason well beyond a model’s maximum context length, and under mild conditions has linear computational complexity.
[COMMENTS]Findings of EACL 2026
[LINK]http://arxiv.org/abs/2601.09855v1
[DATE]2026-01-15 04:30:55+08:00
[CATEGORIES]cs.CL
Value-Aware Numerical Representations for Transformer Language Models
[AUTHORS]Andreea Dutulescu, Stefan Ruseti, Mihai Dascalu
[ABSTRACT]Transformer-based language models often achieve strong results on mathematical reasoning benchmarks while remaining fragile on basic numerical understanding and arithmetic operations. A central limitation is that numbers are processed as symbolic tokens whose embeddings do not explicitly encode numerical value, leading to systematic errors. We introduce a value-aware numerical representation that augments standard tokenized inputs with a dedicated prefix token whose embedding is explicitly conditioned on the underlying numerical value. This mechanism injects magnitude information directly into the model’s input space while remaining compatible with existing tokenizers and decoder-only Transformer architectures. Evaluation on arithmetic tasks shows that the proposed approach outperforms baselines across numerical formats, tasks, and operand lengths. These results indicate that explicitly encoding numerical value is an effective and efficient way to improve fundamental numerical robustness in language models.
[LINK]http://arxiv.org/abs/2601.09706v1
[DATE]2026-01-15 02:59:14+08:00
[CATEGORIES]cs.CL cs.LG
ShortCoder: Knowledge-Augmented Syntax Optimization for Token-Efficient Code Generation
[AUTHORS]Sicong Liu, Yanxian Huang, Mingwei Liu, Jiachi Chen, Ensheng Shi, Yuchi Ma, Hongyu Zhang, Yin Zhang, Yanlin Wang
[ABSTRACT]Code generation tasks aim to automate the conversion of user requirements into executable code, significantly reducing manual development efforts and enhancing software productivity. The emergence of large language models (LLMs) has significantly advanced code generation, though their efficiency is still impacted by certain inherent architectural constraints. Each token generation necessitates a complete inference pass, requiring persistent retention of contextual information in memory and escalating resource consumption. While existing research prioritizes inference-phase optimizations such as prompt compression and model quantization, the generation phase remains underexplored. To tackle these challenges, we propose a knowledge-infused framework named ShortCoder, which optimizes code generation efficiency while preserving semantic equivalence and readability. In particular, we introduce: (1) ten syntax-level simplification rules for Python, derived from AST-preserving transformations, achieving 18.1% token reduction without functional compromise; (2) a hybrid data synthesis pipeline integrating rule-based rewriting with LLM-guided refinement, producing ShorterCodeBench, a corpus of validated tuples of original code and simplified code with semantic consistency; (3) a fine-tuning strategy that injects conciseness awareness into the base LLMs. Extensive experimental results demonstrate that ShortCoder consistently outperforms state-of-the-art methods on HumanEval, achieving an improvement of 18.1%-37.8% in generation efficiency over previous methods while ensuring the performance of code generation.
[LINK]http://arxiv.org/abs/2601.09703v1
[DATE]2026-01-15 02:57:31+08:00
[CATEGORIES]cs.CL
Empathy Applicability Modeling for General Health Queries
[AUTHORS]Shan Randhawa, Agha Ali Raza, Kentaro Toyama, Julie Hui, Mustafa Naseem
[ABSTRACT]LLMs are increasingly being integrated into clinical workflows, yet they often lack clinical empathy, an essential aspect of effective doctor-patient communication. Existing NLP frameworks focus on reactively labeling empathy in doctors’ responses but offer limited support for anticipatory modeling of empathy needs, especially in general health queries. We introduce the Empathy Applicability Framework (EAF), a theory-driven approach that classifies patient queries in terms of the applicability of emotional reactions and interpretations, based on clinical, contextual, and linguistic cues. We release a benchmark of real patient queries, dual-annotated by Humans and GPT-4o. In the subset with human consensus, we also observe substantial human-GPT alignment. To validate EAF, we train classifiers on human-labeled and GPT-only annotations to predict empathy applicability, achieving strong performance and outperforming the heuristic and zero-shot LLM baselines. Error analysis highlights persistent challenges: implicit distress, clinical-severity ambiguity, and contextual hardship, underscoring the need for multi-annotator modeling, clinician-in-the-loop calibration, and culturally diverse annotation. EAF provides a framework for identifying empathy needs before response generation, establishes a benchmark for anticipatory empathy modeling, and enables supporting empathetic communication in asynchronous healthcare.
[COMMENTS]In Submission to ACL
[LINK]http://arxiv.org/abs/2601.09696v1
[DATE]2026-01-15 02:47:02+08:00
[CATEGORIES]cs.CL
LLMs can Compress LLMs: Adaptive Pruning by Agents
[AUTHORS]Sai Varun Kodathala, Rakesh Vunnam
[ABSTRACT]As Large Language Models (LLMs) continue to scale, post-training pruning has emerged as a promising approach to reduce computational costs while preserving performance. Existing methods such as SparseGPT and Wanda achieve high sparsity through layer-wise weight reconstruction or activation-aware magnitude pruning, but rely on uniform or hand-crafted heuristics to determine per-layer sparsity ratios. Moreover, recent work has shown that pruned LLMs suffer from severe factual knowledge degradation, with structured pruning methods experiencing near-total collapse in factual question-answering capabilities. We introduce agent-guided pruning, where a foundation model acts as an adaptive pruning agent to intelligently select which layers to prune at each iteration while preserving critical knowledge pathways. Our method constructs layer-wise sensitivity profiles by combining Wanda-inspired weight-activation metrics with gradient importance scores, normalized as z-scores for model-agnostic comparison. These statistics are processed by an LLM agent equipped with self-reflection capabilities, enabling it to learn from previous pruning outcomes and iteratively refine its strategy. A checkpoint rollback mechanism maintains model quality by reverting when perplexity degradation exceeds a threshold. We evaluate our approach on Qwen3 models (4B and 8B parameters) at approximately 45% sparsity, demonstrating substantial improvements over structured pruning baselines: 56% relative improvement in MMLU accuracy, 19x better factual knowledge retention on FreebaseQA, and 69% lower perplexity degradation. Notably, our framework requires no retraining, operates in a model-agnostic manner, and exhibits effective self-correction with only 2-4 rollbacks across 21-40 iterations, demonstrating that foundation models can effectively guide the compression of other foundation models.
[COMMENTS]17 Pages
[LINK]http://arxiv.org/abs/2601.09694v1
[DATE]2026-01-15 02:45:36+08:00
[CATEGORIES]cs.CL
Routing with Generated Data: Annotation-Free LLM Skill Estimation and Expert Selection
[AUTHORS]Tianyi Niu, Justin Chih-Yao Chen, Genta Indra Winata, Shi-Xiong Zhang, Supriyo Chakraborty, Sambit Sahu, Yue Zhang, Elias Stengel-Eskin, Mohit Bansal
[ABSTRACT]Large Language Model (LLM) routers dynamically select optimal models for given inputs. Existing approaches typically assume access to ground-truth labeled data, which is often unavailable in practice, especially when user request distributions are heterogeneous and unknown. We introduce Routing with Generated Data (RGD), a challenging setting in which routers are trained exclusively on generated queries and answers produced from high-level task descriptions by generator LLMs. We evaluate query-answer routers (using both queries and labels) and query-only routers across four diverse benchmarks and 12 models, finding that query-answer routers degrade faster than query-only routers as generator quality decreases. Our analysis reveals two crucial characteristics of effective generators: they must accurately respond to their own questions, and their questions must produce sufficient performance differentiation among the model pool. We then show how filtering for these characteristics can improve the quality of generated data. We further propose CASCAL, a novel query-only router that estimates model correctness through consensus voting and identifies model-specific skill niches via hierarchical clustering. CASCAL is substantially more robust to generator quality, outperforming the best query-answer router by 4.6% absolute accuracy when trained on weak generator data.
[COMMENTS]Code: https://github.com/tianyiniu/RoutingGenData
[LINK]http://arxiv.org/abs/2601.09692v1
[DATE]2026-01-15 02:43:32+08:00
[CATEGORIES]cs.CL cs.LG
DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation
[AUTHORS]Yibo Wang, Lei Wang, Yue Deng, Keming Wu, Yao Xiao, Huanjin Yao, Liwei Kang, Hai Ye, Yongcheng Jing, Lidong Bing
[ABSTRACT]Deep research systems are widely used for multi-step web research, analysis, and cross-source synthesis, yet their evaluation remains challenging. Existing benchmarks often require annotation-intensive task construction, rely on static evaluation dimensions, or fail to reliably verify facts when citations are missing. To bridge these gaps, we introduce DeepResearchEval, an automated framework for deep research task construction and agentic evaluation. For task construction, we propose a persona-driven pipeline generating realistic, complex research tasks anchored in diverse user profiles, applying a two-stage filter Task Qualification and Search Necessity to retain only tasks requiring multi-source evidence integration and external retrieval. For evaluation, we propose an agentic pipeline with two components: an Adaptive Point-wise Quality Evaluation that dynamically derives task-specific evaluation dimensions, criteria, and weights conditioned on each generated task, and an Active Fact-Checking that autonomously extracts and verifies report statements via web search, even when citations are missing.
[COMMENTS]Source code: https://github.com/Infinity-AILab/DeepResearchEval
[LINK]http://arxiv.org/abs/2601.09688v1
[DATE]2026-01-15 02:38:31+08:00
[CATEGORIES]cs.CL
Disentangling Task Conflicts in Multi-Task LoRA via Orthogonal Gradient Projection
[AUTHORS]Ziyu Yang, Guibin Chen, Yuxin Yang, Aoxiong Zeng, Xiangquan Yang
[ABSTRACT]Multi-Task Learning (MTL) combined with Low-Rank Adaptation (LoRA) has emerged as a promising direction for parameter-efficient deployment of Large Language Models (LLMs). By sharing a single adapter across multiple tasks, one can significantly reduce storage overhead. However, this approach suffers from negative transfer, where conflicting gradient updates from distinct tasks degrade the performance of individual tasks compared to single-task fine-tuning. This problem is exacerbated in LoRA due to the low-rank constraint, which limits the optimization landscape’s capacity to accommodate diverse task requirements. In this paper, we propose Ortho-LoRA, a gradient projection method specifically tailored for the bipartite structure of LoRA. Ortho-LoRA dynamically projects conflicting task gradients onto the orthogonal complement of each other within the intrinsic LoRA subspace. Extensive experiments on the GLUE benchmark demonstrate that Ortho-LoRA effectively mitigates task interference, outperforming standard joint training and recovering 95\% of the performance gap between multi-task and single-task baselines with negligible computational overhead.
[COMMENTS]preprint
[LINK]http://arxiv.org/abs/2601.09684v1
[DATE]2026-01-15 02:36:22+08:00
[CATEGORIES]cs.LG cs.CL
The Geometry of Thought: Disclosing the Transformer as a Tropical Polynomial Circuit
[AUTHORS]Faruk Alpay, Bilge Senturk
[ABSTRACT]We prove that the Transformer self-attention mechanism in the high-confidence regime ($β\to \infty$, where $β$ is an inverse temperature) operates in the tropical semiring (max-plus algebra). In particular, we show that taking the tropical limit of the softmax attention converts it into a tropical matrix product. This reveals that the Transformer’s forward pass is effectively executing a dynamic programming recurrence (specifically, a Bellman-Ford path-finding update) on a latent graph defined by token similarities. Our theoretical result provides a new geometric perspective for chain-of-thought reasoning: it emerges from an inherent shortest-path (or longest-path) algorithm being carried out within the network’s computation.
[COMMENTS]7 pages, 2 figures
[LINK]http://arxiv.org/abs/2601.09775v1
[DATE]2026-01-15 02:05:42+08:00
[CATEGORIES]cs.LG cs.CL
Non-Linear Scoring Model for Translation Quality Evaluation
[AUTHORS]Serge Gladkoff, Lifeng Han, Katerina Gasova
[ABSTRACT]Analytic Translation Quality Evaluation (TQE), based on Multidimensional Quality Metrics (MQM), traditionally uses a linear error-to-penalty scale calibrated to a reference sample of 1000-2000 words. However, linear extrapolation biases judgment on samples of different sizes, over-penalizing short samples and under-penalizing long ones, producing misalignment with expert intuition.
Building on the Multi-Range framework, this paper presents a calibrated, non-linear scoring model that better reflects how human content consumers perceive translation quality across samples of varying length. Empirical data from three large-scale enterprise environments shows that acceptable error counts grow logarithmically, not linearly, with sample size.
Psychophysical and cognitive evidence, including the Weber-Fechner law and Cognitive Load Theory, supports this premise by explaining why the perceptual impact of additional errors diminishes while the cognitive burden grows with scale. We propose a two-parameter model
E(x) = a * ln(1 + b * x), a, b > 0,
anchored to a reference tolerance and calibrated from two tolerance points using a one-dimensional root-finding step. The model yields an explicit interval within which the linear approximation stays within +/-20 percent relative error and integrates into existing evaluation workflows with only a dynamic tolerance function added.
The approach improves interpretability, fairness, and inter-rater reliability across both human and AI-generated translations. By operationalizing a perceptually valid scoring paradigm, it advances translation quality evaluation toward more accurate and scalable assessment. The model also provides a stronger basis for AI-based document-level evaluation aligned with human judgment. Implementation considerations for CAT/LQA systems and implications for human and AI-generated text evaluation are discussed.
[COMMENTS]Technical report, 31 pages
[LINK]http://arxiv.org/abs/2511.13467v4
[DATE]2026-01-15 01:45:20+08:00
[CATEGORIES]cs.CL
TaxoBell: Gaussian Box Embeddings for Self-Supervised Taxonomy Expansion
[AUTHORS]Sahil Mishra, Srinitish Srinivasan, Srikanta Bedathur, Tanmoy Chakraborty
[ABSTRACT]Taxonomies form the backbone of structured knowledge representation across diverse domains, enabling applications such as e-commerce catalogs, semantic search, and biomedical discovery. Yet, manual taxonomy expansion is labor-intensive and cannot keep pace with the emergence of new concepts. Existing automated methods rely on point-based vector embeddings, which model symmetric similarity and thus struggle with the asymmetric “is-a” relationships that are fundamental to taxonomies. Box embeddings offer a promising alternative by enabling containment and disjointness, but they face key issues: (i) unstable gradients at the intersection boundaries, (ii) no notion of semantic uncertainty, and (iii) limited capacity to represent polysemy or ambiguity. We address these shortcomings with TaxoBell, a Gaussian box embedding framework that translates between box geometries and multivariate Gaussian distributions, where means encode semantic location and covariances encode uncertainty. Energy-based optimization yields stable optimization, robust modeling of ambiguous concepts, and interpretable hierarchical reasoning. Extensive experimentation on five benchmark datasets demonstrates that TaxoBell significantly outperforms eight state-of-the-art taxonomy expansion baselines by 19% in MRR and around 25% in Recall@k. We further demonstrate the advantages and pitfalls of TaxoBell with error analysis and ablation studies.
[COMMENTS]Accepted in The Web Conference (WWW) 2026
[LINK]http://arxiv.org/abs/2601.09633v1
[DATE]2026-01-15 01:08:37+08:00
[CATEGORIES]cs.CL
Mathematical Derivation Graphs: A Relation Extraction Task in STEM Manuscripts
[AUTHORS]Vishesh Prasad, Brian Kim, Nickvash Kani
[ABSTRACT]Recent advances in natural language processing (NLP), particularly with the emergence of large language models (LLMs), have significantly enhanced the field of textual analysis. However, while these developments have yielded substantial progress in analyzing natural language text, applying analysis to mathematical equations and their relationships within texts has produced mixed results. This paper takes the initial steps in expanding the problem of relation extraction towards understanding the dependency relationships between mathematical expressions in STEM articles. The authors construct the Mathematical Derivation Graphs Dataset (MDGD), sourced from a random sampling of the arXiv corpus, containing an analysis of $107$ published STEM manuscripts with over $2000$ manually labeled inter-equation dependency relationships, resulting in a new object referred to as a derivation graph that summarizes the mathematical content of the manuscript. The authors exhaustively evaluate analytical and machine learning (ML) based models to assess their capability to identify and extract the derivation relationships for each article and compare the results with the ground truth. The authors show that the best tested LLMs achieve $F_1$ scores of $\sim45\%-52\%$, and attempt to improve their performance by combining them with analytic algorithms and other methods.
[COMMENTS]29 pages, 11 figures
[LINK]http://arxiv.org/abs/2410.21324v2
[DATE]2026-01-15 01:06:58+08:00
[CATEGORIES]cs.CL cs.LG
LLMs Got Rhythm? Hybrid Phonological Filtering for Greek Poetry Rhyme Detection and Generation
[AUTHORS]Stergios Chatzikyriakidis
[ABSTRACT]Large Language Models (LLMs), despite their remarkable capabilities across NLP tasks, struggle with phonologically-grounded phenomena like rhyme detection and generation. This is even more evident in lower-resource languages such as Modern Greek. In this paper, we present a hybrid system that combines LLMs with deterministic phonological algorithms to achieve accurate rhyme identification/analysis and generation. Our approach implements a comprehensive taxonomy of Greek rhyme types, including Pure, Rich, Imperfect, Mosaic, and Identical Pre-rhyme Vowel (IDV) patterns, and employs an agentic generation pipeline with phonological verification. We evaluate multiple prompting strategies (zero-shot, few-shot, Chain-of-Thought, and RAG-augmented) across several LLMs including Claude 3.7 and 4.5, GPT-4o, Gemini 2.0 and open-weight models like Llama 3.1 8B and 70B and Mistral Large. Results reveal a significant “Reasoning Gap”: while native-like models (Claude 3.7) perform intuitively (40\% accuracy in identification), reasoning-heavy models (Claude 4.5) achieve state-of-the-art performance (54\%) only when prompted with Chain-of-Thought. Most critically, pure LLM generation fails catastrophically (under 4\% valid poems), while our hybrid verification loop restores performance to 73.1\%. We release our system and a crucial, rigorously cleaned corpus of 40,000+ rhymes, derived from the Anemoskala and Interwar Poetry corpora, to support future research.
[LINK]http://arxiv.org/abs/2601.09631v1
[DATE]2026-01-15 01:05:17+08:00
[CATEGORIES]cs.CL
Toward Understanding Unlearning Difficulty: A Mechanistic Perspective and Circuit-Guided Difficulty Metric
[AUTHORS]Jiali Cheng, Ziheng Chen, Chirag Agarwal, Hadi Amiri
[ABSTRACT]Machine unlearning is becoming essential for building trustworthy and compliant language models. Yet unlearning success varies considerably across individual samples: some are reliably erased, while others persist despite the same procedure. We argue that this disparity is not only a data-side phenomenon, but also reflects model-internal mechanisms that encode and protect memorized information. We study this problem from a mechanistic perspective based on model circuits–structured interaction pathways that govern how predictions are formed. We propose Circuit-guided Unlearning Difficulty (CUD), a {\em pre-unlearning} metric that assigns each sample a continuous difficulty score using circuit-level signals. Extensive experiments demonstrate that CUD reliably separates intrinsically easy and hard samples, and remains stable across unlearning methods. We identify key circuit-level patterns that reveal a mechanistic signature of difficulty: easy-to-unlearn samples are associated with shorter, shallower interactions concentrated in earlier-to-intermediate parts of the original model, whereas hard samples rely on longer and deeper pathways closer to late-stage computation. Compared to existing qualitative studies, CUD takes a first step toward a principled, fine-grained, and interpretable analysis of unlearning difficulty; and motivates the development of unlearning methods grounded in model mechanisms.
[LINK]http://arxiv.org/abs/2601.09624v1
[DATE]2026-01-15 00:55:58+08:00
[CATEGORIES]cs.LG cs.CL
LaoBench: A Large-Scale Multidimensional Lao Benchmark for Large Language Models
[AUTHORS]Jian Gao, Richeng Xuan, Zhaolu Kang, Dingshi Liao, Wenxin Huang, Zongmou Huang, Yangdi Xu, Bowen Qin, Zheqi He, Xi Yang, Changjin Li, Yonghua Lin
[ABSTRACT]The rapid advancement of large language models (LLMs) has not been matched by their evaluation in low-resource languages, especially Southeast Asian languages like Lao. To fill this gap, we introduce \textbf{LaoBench}, the first large-scale, high-quality, and multidimensional benchmark for assessing LLM language understanding and reasoning in Lao. LaoBench contains \textbf{17,000+} expert-curated samples across three dimensions: culturally grounded knowledge application, curriculum-aligned K12 education, and bilingual translation among Lao, Chinese, and English. It includes open-source and held-out subsets, where the held-out portion enables secure black-box evaluation via a controlled service to improve fairness and data security. We construct LaoBench with a hybrid pipeline that combines expert authoring with agent-assisted verification, ensuring linguistic accuracy, cultural relevance, and educational validity. We evaluate diverse state-of-the-art open-source and closed-source LLMs, and find that even strong multilingual models lag behind human experts, particularly in culturally grounded reasoning and translation fidelity. We hope LaoBench will catalyze research on Lao and other underrepresented Southeast Asian languages for more inclusive multilingual evaluation.
[LINK]http://arxiv.org/abs/2511.11334v2
[DATE]2026-01-15 00:47:57+08:00
[CATEGORIES]cs.CL
DPWriter: Reinforcement Learning with Diverse Planning Branching for Creative Writing
[AUTHORS]Qian Cao, Yahui Liu, Wei Bi, Yi Zhao, Ruihua Song, Xiting Wang, Ruiming Tang, Guorui Zhou, Han Li
[ABSTRACT]Reinforcement learning (RL)-based enhancement of large language models (LLMs) often leads to reduced output diversity, undermining their utility in open-ended tasks like creative writing. Current methods lack explicit mechanisms for guiding diverse exploration and instead prioritize optimization efficiency and performance over diversity. This paper proposes an RL framework structured around a semi-structured long Chain-of-Thought (CoT), in which the generation process is decomposed into explicitly planned intermediate steps. We introduce a Diverse Planning Branching method that strategically introduces divergence at the planning phase based on diversity variation, alongside a group-aware diversity reward to encourage distinct trajectories. Experimental results on creative writing benchmarks demonstrate that our approach significantly improves output diversity without compromising generation quality, consistently outperforming existing baselines.
[LINK]http://arxiv.org/abs/2601.09609v1
[DATE]2026-01-15 00:30:20+08:00
[CATEGORIES]cs.CL
CoGen: Creation of Reusable UI Components in Figma via Textual Commands
[AUTHORS]Ishani Kanapathipillai, Obhasha Priyankara
[ABSTRACT]The evolution of User Interface design has emphasized the need for efficient, reusable, and editable components to ensure an efficient design process. This research introduces CoGen, a system that uses machine learning techniques to generate reusable UI components directly in Figma, one of the most popular UI design tools. Addressing gaps in current systems, CoGen focuses on creating atomic components such as buttons, labels, and input fields using structured JSON and natural language prompts.
The project integrates Figma API data extraction, Seq2Seq models, and fine-tuned T5 transformers for component generation. The key results demonstrate the efficiency of the T5 model in prompt generation, with an accuracy of 98% and a BLEU score of 0.2668, which ensures the mapping of JSON to descriptive prompts. For JSON creation, CoGen achieves a success rate of up to 100% in generating simple JSON outputs for specified component types.
[COMMENTS]8 pages, 6 figures, 11 tables
[LINK]http://arxiv.org/abs/2601.10536v1
[DATE]2026-01-15 23:57:59+08:00
[CATEGORIES]cs.LG
Coarsening Causal DAG Models
[AUTHORS]Francisco Madaleno, Pratik Misra, Alex Markham
[ABSTRACT]Directed acyclic graphical (DAG) models are a powerful tool for representing causal relationships among jointly distributed random variables, especially concerning data from across different experimental settings. However, it is not always practical or desirable to estimate a causal model at the granularity of given features in a particular dataset. There is a growing body of research on causal abstraction to address such problems. We contribute to this line of research by (i) providing novel graphical identifiability results for practically-relevant interventional settings, (ii) proposing an efficient, provably consistent algorithm for directly learning abstract causal graphs from interventional data with unknown intervention targets, and (iii) uncovering theoretical insights about the lattice structure of the underlying search space, with connections to the field of causal discovery more generally. As proof of concept, we apply our algorithm on synthetic and real datasets with known ground truths, including measurements from a controlled physical system with interacting light intensity and polarization.
[COMMENTS]25 pages, 5 figures
[LINK]http://arxiv.org/abs/2601.10531v1
[DATE]2026-01-15 23:56:20+08:00
[CATEGORIES]cs.LG
Quantum circuit complexity and unsupervised machine learning of topological order
[AUTHORS]Yanming Che, Clemens Gneiting, Xiaoguang Wang, Franco Nori
[ABSTRACT]Inspired by the close relationship between Kolmogorov complexity and unsupervised machine learning, we explore quantum circuit complexity, an important concept in quantum computation and quantum information science, as a pivot to understand and to build interpretable and efficient unsupervised machine learning for topological order in quantum many-body systems. We argue that Nielsen’s quantum circuit complexity represents an intrinsic topological distance between topological quantum many-body phases of matter, and as such plays a central role in interpretable manifold learning of topological order. To span a bridge from conceptual power to practical applicability, we present two theorems that connect Nielsen’s quantum circuit complexity for the quantum path planning between two arbitrary quantum many-body states with quantum Fisher complexity (Bures distance) and entanglement generation, respectively. Leveraging these connections, fidelity-based and entanglement-based similarity measures or kernels, which are more practical for implementation, are formulated. Using the two proposed distance measures, unsupervised manifold learning of quantum phases of the bond-alternating XXZ spin chain, the ground state of Kitaev’s toric code and random product states, is conducted, demonstrating their superior performance. Moreover, we find that the entanglement-based approach, which captures the long-range structure of quantum entanglement of topological orders, is more robust to local Haar random noises. Relations with classical shadow tomography and shadow kernel learning are also discussed, where the latter can be naturally understood from our approach. Our results establish connections between key concepts and tools of quantum circuit computation, quantum complexity, quantum metrology, and machine learning of topological quantum order.
[COMMENTS]Updated version; With enriched Supplementary Information; 23 pages; 5 figures. Code is available upon reasonable request, and will be open-sourced along with the publication. Comments are welcome
[LINK]http://arxiv.org/abs/2508.04486v2
[DATE]2026-01-15 23:54:45+08:00
[CATEGORIES]cs.LG
Transformer-Based Cognitive Radio: Adaptive Modulation Strategies Using Transformer Models
[AUTHORS]Andrea Melis, Andrea Piroddi, Roberto Girau
[ABSTRACT]Cognitive Radio (CR) systems, which dynamically adapt to changing spectrum environments, could benefit significantly from advancements in machine learning technologies. These systems can be enhanced in terms of spectral efficiency, robustness, and security through innovative approaches such as the use of Transformer models. This work investigates the application of Transformer models, specifically the GPT-2 architecture, to generate novel modulation schemes for wireless communications. By training a GPT-2 model on a dataset of existing modulation formulas, new modulation schemes has been created. These generated schemes are then compared to traditional methods using key performance metrics such as Signal-to-Noise Ratio (SNR) and Power Spectrum Density (PSD). The results show that Transformer-generated modulation schemes can achieve performance comparable to, and in some cases outperforming, traditional methods. This demonstrates that advanced CR systems could greatly benefit from the implementation of Transformer models, leading to more efficient, robust, and secure communication systems.
[LINK]http://arxiv.org/abs/2601.10519v1
[DATE]2026-01-15 23:46:22+08:00
[CATEGORIES]cs.LG
Curvature Tuning: Provable Training-free Model Steering From a Single Parameter
[AUTHORS]Leyang Hu, Matteo Gamba, Randall Balestriero
[ABSTRACT]The scaling of model and data sizes has reshaped the AI landscape, establishing finetuning pretrained models as the standard paradigm for solving downstream tasks. However, dominant finetuning methods typically rely on weight adaptation, often lack interpretability, and depend on heuristically chosen hyperparameters. In this paper, we take a different perspective and shift the focus from weights to activation functions, viewing them through the lens of spline operators. We propose Curvature Tuning (CT), an interpretable and principled steering method that modulates a model’s decision boundary by injecting a single hyperparameter into its activation functions. We show that CT provably adjusts model decision boundary curvature and, more fundamentally, projects a model onto a space of smooth functions-thereby complementing current finetuning methods, whose effect lies primarily in feature adaptation. Making this hyperparameter trainable gives rise to a novel and highly parameter-efficient finetuning method. Empirically, CT improves both generalization and robustness. For example, it boosts downstream accuracy of ResNet-50/152 by 8.59%/8.34% over linear probing and 4.64%/1.70% over LoRA across 12 datasets, and improves robust accuracy on the $\ell_\infty$ benchmark from RobustBench by 1032.64%/1494.46%. Our code is available at https://github.com/Leon-Leyang/curvature-tuning.
[COMMENTS]Accepted at NeurIPS 2025
[LINK]http://arxiv.org/abs/2502.07783v5
[DATE]2026-01-15 23:36:28+08:00
[CATEGORIES]cs.LG
Projected Microbatch Accumulation yields reference-free proximal policy updates for reinforcement learning
[AUTHORS]Nilin Abrahamsen
[ABSTRACT]This note introduces Projected Microbatch Accumulation (PROMA), a proximal policy update method for large language model fine-tuning. PROMA accumulates policy gradients across microbatches by projecting out sequence-wise gradient components before microbatch aggregation. The projection is applied layer-wise during the backward pass, enabling efficient implementation without additional forward or backward passes. Empirically, PROMA enforces tighter control of local KL divergence than GRPO, resulting in more stable policy learning. Unlike PPO and GRPO, PROMA achieves proximal updates without inducing entropy collapse and does not rely on a reference policy or likelihood-ratio clipping.
[LINK]http://arxiv.org/abs/2601.10498v1
[DATE]2026-01-15 23:16:15+08:00
[CATEGORIES]cs.LG
CROCS: A Two-Stage Clustering Framework for Behaviour-Centric Consumer Segmentation with Smart Meter Data
[AUTHORS]Luke W. Yerbury, Ricardo J. G. B. Campello, G. C. Livingston, Mark Goldsworthy, Lachlan O’Neil
[ABSTRACT]With grid operators confronting rising uncertainty from renewable integration and a broader push toward electrification, Demand-Side Management (DSM) – particularly Demand Response (DR) – has attracted significant attention as a cost-effective mechanism for balancing modern electricity systems. Unprecedented volumes of consumption data from a continuing global deployment of smart meters enable consumer segmentation based on real usage behaviours, promising to inform the design of more effective DSM and DR programs. However, existing clustering-based segmentation methods insufficiently reflect the behavioural diversity of consumers, often relying on rigid temporal alignment, and faltering in the presence of anomalies, missing data, or large-scale deployments.
To address these challenges, we propose a novel two-stage clustering framework – Clustered Representations Optimising Consumer Segmentation (CROCS). In the first stage, each consumer’s daily load profiles are clustered independently to form a Representative Load Set (RLS), providing a compact summary of their typical diurnal consumption behaviours. In the second stage, consumers are clustered using the Weighted Sum of Minimum Distances (WSMD), a novel set-to-set measure that compares RLSs by accounting for both the prevalence and similarity of those behaviours. Finally, community detection on the WSMD-induced graph reveals higher-order prototypes that embody the shared diurnal behaviours defining consumer groups, enhancing the interpretability of the resulting clusters.
Extensive experiments on both synthetic and real Australian smart meter datasets demonstrate that CROCS captures intra-consumer variability, uncovers both synchronous and asynchronous behavioural similarities, and remains robust to anomalies and missing data, while scaling efficiently through natural parallelisation. These results…
[LINK]http://arxiv.org/abs/2601.10494v1
[DATE]2026-01-15 23:13:54+08:00
[CATEGORIES]cs.LG
Communication-Efficient Federated Learning by Exploiting Spatio-Temporal Correlations of Gradients
[AUTHORS]Shenlong Zheng, Zhen Zhang, Yuhui Deng, Geyong Min, Lin Cui
[ABSTRACT]Communication overhead is a critical challenge in federated learning, particularly in bandwidth-constrained networks. Although many methods have been proposed to reduce communication overhead, most focus solely on compressing individual gradients, overlooking the temporal correlations among them. Prior studies have shown that gradients exhibit spatial correlations, typically reflected in low-rank structures. Through empirical analysis, we further observe a strong temporal correlation between client gradients across adjacent rounds. Based on these observations, we propose GradESTC, a compression technique that exploits both spatial and temporal gradient correlations. GradESTC exploits spatial correlations to decompose each full gradient into a compact set of basis vectors and corresponding combination coefficients. By exploiting temporal correlations, only a small portion of the basis vectors need to be dynamically updated in each round. GradESTC significantly reduces communication overhead by transmitting lightweight combination coefficients and a limited number of updated basis vectors instead of the full gradients. Extensive experiments show that, upon reaching a target accuracy level near convergence, GradESTC reduces uplink communication by an average of 39.79% compared to the strongest baseline, while maintaining comparable convergence speed and final accuracy to uncompressed FedAvg. By effectively leveraging spatio-temporal gradient structures, GradESTC offers a practical and scalable solution for communication-efficient federated learning.
[LINK]http://arxiv.org/abs/2601.10491v1
[DATE]2026-01-15 23:11:41+08:00
[CATEGORIES]cs.LG
Continuous Diffusion for Mixed-Type Tabular Data
[AUTHORS]Markus Mueller, Kathrin Gruber, Dennis Fok
[ABSTRACT]Score-based generative models, commonly referred to as diffusion models, have proven to be successful at generating text and image data. However, their adaptation to mixed-type tabular data remains underexplored. In this work, we propose CDTD, a Continuous Diffusion model for mixed-type Tabular Data. CDTD is based on a novel combination of score matching and score interpolation to enforce a unified continuous noise distribution for both continuous and categorical features. We explicitly acknowledge the necessity of homogenizing distinct data types by relying on model-specific loss calibration and initialization schemes. To further address the high heterogeneity in mixed-type tabular data, we introduce adaptive feature- or type-specific noise schedules. These ensure balanced generative performance across features and optimize the allocation of model capacity across features and diffusion time. Our experimental results show that CDTD consistently outperforms state-of-the-art benchmark models, captures feature correlations exceptionally well, and that heterogeneity in the noise schedule design boosts sample quality. Replication code is available at https://github.com/muellermarkus/cdtd.
[COMMENTS]published at ICLR 2025
[LINK]http://arxiv.org/abs/2312.10431v6
[DATE]2026-01-15 23:08:41+08:00
[CATEGORIES]cs.LG
H-EFT-VA: An Effective-Field-Theory Variational Ansatz with Provable Barren Plateau Avoidance
[AUTHORS]Eyad I. B Hamid
[ABSTRACT]Variational Quantum Algorithms (VQAs) are critically threatened by the Barren Plateau (BP) phenomenon. In this work, we introduce the H-EFT Variational Ansatz (H-EFT-VA), an architecture inspired by Effective Field Theory (EFT). By enforcing a hierarchical “UV-cutoff” on initialization, we theoretically restrict the circuit’s state exploration, preventing the formation of approximate unitary 2-designs. We provide a rigorous proof that this localization guarantees an inverse-polynomial lower bound on the gradient variance: $Var[\partial θ] \in Ω(1/poly(N))$. Crucially, unlike approaches that avoid BPs by limiting entanglement, we demonstrate that H-EFT-VA maintains volume-law entanglement and near-Haar purity, ensuring sufficient expressibility for complex quantum states. Extensive benchmarking across 16 experiments – including Transverse Field Ising and Heisenberg XXZ models – confirms a 109x improvement in energy convergence and a 10.7x increase in ground-state fidelity over standard Hardware-Efficient Ansatze (HEA), with a statistical significance of $p < 10^{-88}$.
[COMMENTS]7 pages, 5 figuers, Appendix
[LINK]http://arxiv.org/abs/2601.10479v1
[DATE]2026-01-15 23:01:16+08:00
[CATEGORIES]cs.LG
DeFlow: Decoupling Manifold Modeling and Value Maximization for Offline Policy Extraction
[AUTHORS]Zhancun Mu
[ABSTRACT]We present DeFlow, a decoupled offline RL framework that leverages flow matching to faithfully capture complex behavior manifolds. Optimizing generative policies is computationally prohibitive, typically necessitating backpropagation through ODE solvers. We address this by learning a lightweight refinement module within an explicit, data-derived trust region of the flow manifold, rather than sacrificing the iterative generation capability via single-step distillation. This way, we bypass solver differentiation and eliminate the need for balancing loss terms, ensuring stable improvement while fully preserving the flow’s iterative expressivity. Empirically, DeFlow achieves superior performance on the challenging OGBench benchmark and demonstrates efficient offline-to-online adaptation.
[COMMENTS]13 pages, 3 figures
[LINK]http://arxiv.org/abs/2601.10471v1
[DATE]2026-01-15 22:56:57+08:00
[CATEGORIES]cs.LG
Instance-level quantitative saliency in multiple sclerosis lesion segmentation
[AUTHORS]Federico Spagnolo, Nataliia Molchanova, Meritxell Bach Cuadra, Mario Ocampo Pineda, Lester Melie-Garcia, Cristina Granziera, Vincent Andrearczyk, Adrien Depeursinge
[ABSTRACT]Explainable artificial intelligence (XAI) methods have been proposed to interpret model decisions in classification and, more recently, in semantic segmentation. However, instance-level XAI for semantic segmentation, namely explanations focused on a single object among multiple instances of the same class, remains largely unexplored. Such explanations are particularly important in multi-lesional diseases to understand what drives the detection and contouring of a specific lesion. We propose instance-level explanation maps for semantic segmentation by extending SmoothGrad and Grad-CAM++ to obtain quantitative instance saliency. These methods were applied to the segmentation of white matter lesions (WMLs), a magnetic resonance imaging biomarker in multiple sclerosis. We used 4023 FLAIR and MPRAGE MRI scans from 687 patients collected at the University Hospital of Basel, Switzerland, with WML masks annotated by four expert clinicians. Three deep learning architectures, a 3D U-Net, nnU-Net, and Swin UNETR, were trained and evaluated, achieving normalized Dice scores of 0.71, 0.78, and 0.80, respectively. Instance saliency maps showed that the models relied primarily on FLAIR rather than MPRAGE for WML segmentation, with positive saliency inside lesions and negative saliency in their immediate neighborhood, consistent with clinical practice. Peak saliency values differed significantly across correct and incorrect predictions, suggesting that quantitative instance saliency may help identify segmentation errors. In conclusion, we introduce two architecture-agnostic XAI methods that provide quantitative instance-level explanations for semantic segmentation and support clinically meaningful interpretation of model decisions.
[LINK]http://arxiv.org/abs/2406.09335v3
[DATE]2026-01-15 22:55:41+08:00
[CATEGORIES]cs.LG
Towards Understanding Deep Learning Model in Image Recognition via Coverage Test
[AUTHORS]Wenkai Li, Xiaoqi Li, Yingjie Mao, Yishun Wang
[ABSTRACT]Deep neural networks (DNNs) play a crucial role in the field of artificial intelligence, and their security-related testing has been a prominent research focus. By inputting test cases, the behavior of models is examined for anomalies, and coverage metrics are utilized to determine the extent of neurons covered by these test cases. With the widespread application and advancement of DNNs, different types of neural behaviors have garnered attention, leading to the emergence of various coverage metrics for neural networks. However, there is currently a lack of empirical research on these coverage metrics, specifically in analyzing the relationships and patterns between model depth, configuration information, and neural network coverage. This paper aims to investigate the relationships and patterns of four coverage metrics: primary functionality, boundary, hierarchy, and structural coverage. A series of empirical experiments were conducted, selecting LeNet, VGG, and ResNet as different DNN architectures, along with 10 models of varying depths ranging from 5 to 54 layers, to compare and study the relationships between different depths, configuration information, and various neural network coverage metrics. Additionally, an investigation was carried out on the relationships between modified decision/condition coverage and dataset size. Finally, three potential future directions are proposed to further contribute to the security testing of DNN Models.
[LINK]http://arxiv.org/abs/2505.08814v2
[DATE]2026-01-15 22:53:58+08:00
[CATEGORIES]cs.LG
LangLasso: Interactive Cluster Descriptions through LLM Explanation
[AUTHORS]Raphael Buchmüller, Dennis Collaris, Linhao Meng, Angelos Chatzimparmpas
[ABSTRACT]Dimensionality reduction is a powerful technique for revealing structure and potential clusters in data. However, as the axes are complex, non-linear combinations of features, they often lack semantic interpretability. Existing visual analytics (VA) methods support cluster interpretation through feature comparison and interactive exploration, but they require technical expertise and intense human effort. We present \textit{LangLasso}, a novel method that complements VA approaches through interactive, natural language descriptions of clusters using large language models (LLMs). It produces human-readable descriptions that make cluster interpretation accessible to non-experts and allow integration of external contextual knowledge beyond the dataset. We systematically evaluate the reliability of these explanations and demonstrate that \langlasso provides an effective first step for engaging broader audiences in cluster interpretation. The tool is available at https://langlasso.vercel.app
[COMMENTS]This manuscript is accepted for publication in VIS 2025 VISxGenAI Workshop
[LINK]http://arxiv.org/abs/2601.10458v1
[DATE]2026-01-15 22:49:24+08:00
[CATEGORIES]cs.LG
Stable Differentiable Modal Synthesis for Learning Nonlinear Dynamics
[AUTHORS]Victor Zheleznov, Stefan Bilbao, Alec Wright, Simon King
[ABSTRACT]Modal methods are a long-standing approach to physical modelling synthesis. Extensions to nonlinear problems are possible, including the case of a high-amplitude vibration of a string. A modal decomposition leads to a densely coupled nonlinear system of ordinary differential equations. Recent work in scalar auxiliary variable techniques has enabled construction of explicit and stable numerical solvers for such classes of nonlinear systems. On the other hand, machine learning approaches (in particular neural ordinary differential equations) have been successful in modelling nonlinear systems automatically from data. In this work, we examine how scalar auxiliary variable techniques can be combined with neural ordinary differential equations to yield a stable differentiable model capable of learning nonlinear dynamics. The proposed approach leverages the analytical solution for linear vibration of system’s modes so that physical parameters of a system remain easily accessible after the training without the need for a parameter encoder in the model architecture. As a proof of concept, we generate synthetic data for the nonlinear transverse vibration of a string and show that the model can be trained to reproduce the nonlinear dynamics of the system. Sound examples are presented.
[COMMENTS]Submitted to the Journal of Audio Engineering Society (December 2025)
[LINK]http://arxiv.org/abs/2601.10453v1
[DATE]2026-01-15 22:43:21+08:00
[CATEGORIES]cs.LG
AgentGuardian: Learning Access Control Policies to Govern AI Agent Behavior
[AUTHORS]Nadya Abaev, Denis Klimov, Gerard Levinov, David Mimran, Yuval Elovici, Asaf Shabtai
[ABSTRACT]Artificial intelligence (AI) agents are increasingly used in a variety of domains to automate tasks, interact with users, and make decisions based on data inputs. Ensuring that AI agents perform only authorized actions and handle inputs appropriately is essential for maintaining system integrity and preventing misuse. In this study, we introduce the AgentGuardian, a novel security framework that governs and protects AI agent operations by enforcing context-aware access-control policies. During a controlled staging phase, the framework monitors execution traces to learn legitimate agent behaviors and input patterns. From this phase, it derives adaptive policies that regulate tool calls made by the agent, guided by both real-time input context and the control flow dependencies of multi-step agent actions. Evaluation across two real-world AI agent applications demonstrates that AgentGuardian effectively detects malicious or misleading inputs while preserving normal agent functionality. Moreover, its control-flow-based governance mechanism mitigates hallucination-driven errors and other orchestration-level malfunctions.
[COMMENTS]14 pages, 5 figures
[LINK]http://arxiv.org/abs/2601.10440v1
[DATE]2026-01-15 22:33:36+08:00
[CATEGORIES]cs.LG
Reinforcement Learning with Multi-Step Lookahead Information Via Adaptive Batching
[AUTHORS]Nadav Merlis
[ABSTRACT]We study tabular reinforcement learning problems with multiple steps of lookahead information. Before acting, the learner observes $\ell$ steps of future transition and reward realizations: the exact state the agent would reach and the rewards it would collect under any possible course of action. While it has been shown that such information can drastically boost the value, finding the optimal policy is NP-hard, and it is common to apply one of two tractable heuristics: processing the lookahead in chunks of predefined sizes (‘fixed batching policies’), and model predictive control. We first illustrate the problems with these two approaches and propose utilizing the lookahead in adaptive (state-dependent) batches; we refer to such policies as adaptive batching policies (ABPs). We derive the optimal Bellman equations for these strategies and design an optimistic regret-minimizing algorithm that enables learning the optimal ABP when interacting with unknown environments. Our regret bounds are order-optimal up to a potential factor of the lookahead horizon $\ell$, which can usually be considered a small constant.
[LINK]http://arxiv.org/abs/2601.10418v1
[DATE]2026-01-15 22:09:49+08:00
[CATEGORIES]cs.LG
CS-GBA: A Critical Sample-based Gradient-guided Backdoor Attack for Offline Reinforcement Learning
[AUTHORS]Yuanjie Zhao, Junnan Qiu, Yue Ding, Jie Li
[ABSTRACT]Offline Reinforcement Learning (RL) enables policy optimization from static datasets but is inherently vulnerable to backdoor attacks. Existing attack strategies typically struggle against safety-constrained algorithms (e.g., CQL) due to inefficient random poisoning and the use of easily detectable Out-of-Distribution (OOD) triggers. In this paper, we propose CS-GBA (Critical Sample-based Gradient-guided Backdoor Attack), a novel framework designed to achieve high stealthiness and destructiveness under a strict budget. Leveraging the theoretical insight that samples with high Temporal Difference (TD) errors are pivotal for value function convergence, we introduce an adaptive Critical Sample Selection strategy that concentrates the attack budget on the most influential transitions. To evade OOD detection, we propose a Correlation-Breaking Trigger mechanism that exploits the physical mutual exclusivity of state features (e.g., 95th percentile boundaries) to remain statistically concealed. Furthermore, we replace the conventional label inversion with a Gradient-Guided Action Generation mechanism, which searches for worst-case actions within the data manifold using the victim Q-network’s gradient. Empirical results on D4RL benchmarks demonstrate that our method significantly outperforms state-of-the-art baselines, achieving high attack success rates against representative safety-constrained algorithms with a minimal 5% poisoning budget, while maintaining the agent’s performance in clean environments.
[LINK]http://arxiv.org/abs/2601.10407v1
[DATE]2026-01-15 21:57:52+08:00
[CATEGORIES]cs.LG
Discrete Feynman-Kac Correctors
[AUTHORS]Mohsin Hasan, Viktor Ohanesian, Artem Gazizov, Yoshua Bengio, Alán Aspuru-Guzik, Roberto Bondesan, Marta Skreta, Kirill Neklyudov
[ABSTRACT]Discrete diffusion models have recently emerged as a promising alternative to the autoregressive approach for generating discrete sequences. Sample generation via gradual denoising or demasking processes allows them to capture hierarchical non-sequential interdependencies in the data. These custom processes, however, do not assume a flexible control over the distribution of generated samples. We propose Discrete Feynman-Kac Correctors, a framework that allows for controlling the generated distribution of discrete masked diffusion models at inference time. We derive Sequential Monte Carlo (SMC) algorithms that, given a trained discrete diffusion model, control the temperature of the sampled distribution (i.e. perform annealing), sample from the product of marginals of several diffusion processes (e.g. differently conditioned processes), and sample from the product of the marginal with an external reward function, producing likely samples from the target distribution that also have high reward. Notably, our framework does not require any training of additional models or fine-tuning of the original model. We illustrate the utility of our framework in several applications including: efficient sampling from the annealed Boltzmann distribution of the Ising model, improving the performance of language models for code generation and amortized learning, as well as reward-tilted protein sequence generation.
[COMMENTS]Code: https://github.com/hasanmohsin/discrete_fkc
[LINK]http://arxiv.org/abs/2601.10403v1
[DATE]2026-01-15 21:55:38+08:00
[CATEGORIES]cs.LG
Discrete Solution Operator Learning for Geometry-Dependent PDEs
[AUTHORS]Jinshuai Bai, Haolin Li, Zahra Sharif Khodaei, M. H. Aliabadi, YuanTong Gu, Xi-Qiao Feng
[ABSTRACT]Neural operator learning accelerates PDE solution by approximating operators as mappings between continuous function spaces. Yet in many engineering settings, varying geometry induces discrete structural changes, including topological changes, abrupt changes in boundary conditions or boundary types, and changes in the computational domain, which break the smooth-variation premise. Here we introduce Discrete Solution Operator Learning (DiSOL), a complementary paradigm that learns discrete solution procedures rather than continuous function-space operators. DiSOL factorizes the solver into learnable stages that mirror classical discretizations: local contribution encoding, multiscale assembly, and implicit solution reconstruction on an embedded grid, thereby preserving procedure-level consistency while adapting to geometry-dependent discrete structures. Across geometry-dependent Poisson, advection-diffusion, linear elasticity, as well as spatiotemporal heat conduction problems, DiSOL produces stable and accurate predictions under both in-distribution and strongly out-of-distribution geometries, including discontinuous boundaries and topological changes. These results highlight the need for procedural operator representations in geometry-dominated problems and position discrete solution operator learning as a distinct, complementary direction in scientific machine learning.
[COMMENTS]15 pages main text, 40 pages SI
[LINK]http://arxiv.org/abs/2601.09143v2
[DATE]2026-01-15 21:46:37+08:00
[CATEGORIES]cs.LG
Bootstrap Off-policy with World Model
[AUTHORS]Guojian Zhan, Likun Wang, Xiangteng Zhang, Jiaxin Gao, Masayoshi Tomizuka, Shengbo Eben Li
[ABSTRACT]Online planning has proven effective in reinforcement learning (RL) for improving sample efficiency and final performance. However, using planning for environment interaction inevitably introduces a divergence between the collected data and the policy’s actual behaviors, degrading both model learning and policy improvement. To address this, we propose BOOM (Bootstrap Off-policy with WOrld Model), a framework that tightly integrates planning and off-policy learning through a bootstrap loop: the policy initializes the planner, and the planner refines actions to bootstrap the policy through behavior alignment. This loop is supported by a jointly learned world model, which enables the planner to simulate future trajectories and provides value targets to facilitate policy improvement. The core of BOOM is a likelihood-free alignment loss that bootstraps the policy using the planner’s non-parametric action distribution, combined with a soft value-weighted mechanism that prioritizes high-return behaviors and mitigates variability in the planner’s action quality within the replay buffer. Experiments on the high-dimensional DeepMind Control Suite and Humanoid-Bench show that BOOM achieves state-of-the-art results in both training stability and final performance. The code is accessible at https://github.com/molumitu/BOOM_MBRL.
[COMMENTS]NeurIPS 2025
[LINK]http://arxiv.org/abs/2511.00423v3
[DATE]2026-01-15 21:38:44+08:00
[CATEGORIES]cs.LG
Probabilistic Insights for Efficient Exploration Strategies in Reinforcement Learning
[AUTHORS]Ernesto Garcia, Paola Bermolen, Matthieu Jonckheere, Seva Shneer
[ABSTRACT]We investigate efficient exploration strategies of environments with unknown stochastic dynamics and sparse rewards. Specifically, we analyze first the impact of parallel simulations on the probability of reaching rare states within a finite time budget. Using simplified models based on random walks and Lévy processes, we provide analytical results that demonstrate a phase transition in reaching probabilities as a function of the number of parallel simulations. We identify an optimal number of parallel simulations that balances exploration diversity and time allocation. Additionally, we analyze a restarting mechanism that exponentially enhances the probability of success by redirecting efforts toward more promising regions of the state space. Our findings contribute to a more qualitative and quantitative theory of some exploration schemes in reinforcement learning, offering insights into developing more efficient strategies for environments characterized by rare events.
[LINK]http://arxiv.org/abs/2503.03565v2
[DATE]2026-01-15 21:24:06+08:00
[CATEGORIES]cs.LG
Entropy Production in Machine Learning Under Fokker-Planck Probability Flow
[AUTHORS]Lennon Shikhman
[ABSTRACT]Machine learning models deployed in nonstationary environments inevitably experience performance degradation due to data drift. While numerous drift detection heuristics exist, most lack a dynamical interpretation and provide limited guidance on how retraining decisions should be balanced against operational cost. In this work, we propose an entropy-based retraining framework grounded in nonequilibrium statistical physics. Interpreting drift as probability flow governed by a Fokker-Planck equation, we quantify model-data mismatch using relative entropy and show that its time derivative admits an entropy-balance decomposition featuring a nonnegative entropy production term driven by probability currents. Guided by this theory, we implement an entropy-triggered retraining policy using an exponentially weighted moving-average (EWMA) control statistic applied to a streaming kernel density estimator of the Kullback-Leibler divergence. We evaluate this approach across multiple nonstationary data streams. In synthetic, financial, and web-traffic domains, entropy-based retraining achieves predictive performance comparable to frequent retraining while reducing retraining frequency by one to two orders of magnitude. However, in a challenging biomedical ECG setting, the entropy-based trigger underperforms the maximum-frequency baseline, highlighting limitations of feature-space entropy monitoring under complex label-conditional drift.
[COMMENTS]12 pages, 4 figures, 1 table
[LINK]http://arxiv.org/abs/2601.00554v2
[DATE]2026-01-15 21:22:48+08:00
[CATEGORIES]cs.LG
A Geometric Unification of Generative AI with Manifold-Probabilistic Projection Models
[AUTHORS]Leah Bar, Liron Mor Yosef, Shai Zucker, Neta Shoham, Inbar Seroussi, Nir Sochen
[ABSTRACT]Most models of generative AI for images assume that images are inherently low-dimensional objects embedded within a high-dimensional space. Additionally, it is often implicitly assumed that thematic image datasets form smooth or piecewise smooth manifolds. Common approaches overlook the geometric structure and focus solely on probabilistic methods, approximating the probability distribution through universal approximation techniques such as the kernel method. In some generative models the low dimensional nature of the data manifest itself by the introduction of a lower dimensional latent space. Yet, the probability distribution in the latent or the manifold’s coordinate space is considered uninteresting and is predefined or considered uniform. In this study, we address the problem of Blind Image Denoising (BID), and to some extent, the problem of generating images from noise by unifying geometric and probabilistic perspectives. We introduce a novel framework that improves upon existing probabilistic approaches by incorporating geometric assumptions that enable the effective use of kernel-based probabilistic methods. Furthermore, the proposed framework extends prior geometric approaches by combining explicit and implicit manifold descriptions through the introduction of a distance function. The resulting framework demystifies diffusion models by interpreting them as a projection mechanism onto the manifold of “good images”. This interpretation leads to the construction of a new deterministic model, the Manifold-Probabilistic Projection Model (MPPM), which operates in both the representation (pixel) space and the latent space. We demonstrate that the Latent MPPM (LMPPM) outperforms the Latent Diffusion Model (LDM) across various datasets, achieving superior results in terms of image restoration and generation.
[LINK]http://arxiv.org/abs/2510.00666v2
[DATE]2026-01-15 21:08:37+08:00
[CATEGORIES]cs.LG
PLGC: Pseudo-Labeled Graph Condensation
[AUTHORS]Jay Nandy, Arnab Kumar Mondal, Anuj Rathore, Mahesh Chandran
[ABSTRACT]Large graph datasets make training graph neural networks (GNNs) computationally costly. Graph condensation methods address this by generating small synthetic graphs that approximate the original data. However, existing approaches rely on clean, supervised labels, which limits their reliability when labels are scarce, noisy, or inconsistent. We propose Pseudo-Labeled Graph Condensation (PLGC), a self-supervised framework that constructs latent pseudo-labels from node embeddings and optimizes condensed graphs to match the original graph’s structural and feature statistics – without requiring ground-truth labels. PLGC offers three key contributions: (1) A diagnosis of why supervised condensation fails under label noise and distribution shift. (2) A label-free condensation method that jointly learns latent prototypes and node assignments. (3) Theoretical guarantees showing that pseudo-labels preserve latent structural statistics of the original graph and ensure accurate embedding alignment. Empirically, across node classification and link prediction tasks, PLGC achieves competitive performance with state-of-the-art supervised condensation methods on clean datasets and exhibits substantial robustness under label noise, often outperforming all baselines by a significant margin. Our findings highlight the practical and theoretical advantages of self-supervised graph condensation in noisy or weakly-labeled environments.
[LINK]http://arxiv.org/abs/2601.10358v1
[DATE]2026-01-15 21:02:31+08:00
[CATEGORIES]cs.LG
EvoMorph: Counterfactual Explanations for Continuous Time-Series Extrinsic Regression Applied to Photoplethysmography
[AUTHORS]Mesut Ceylan, Alexis Tabin, Patrick Langer, Elgar Fleisch, Filipe Barata
[ABSTRACT]Wearable devices enable continuous, population-scale monitoring of physiological signals, such as photoplethysmography (PPG), creating new opportunities for data-driven clinical assessment. Time-series extrinsic regression (TSER) models increasingly leverage PPG signals to estimate clinically relevant outcomes, including heart rate, respiratory rate, and oxygen saturation. For clinical reasoning and trust, however, single point estimates alone are insufficient: clinicians must also understand whether predictions are stable under physiologically plausible variations and to what extent realistic, attainable changes in physiological signals would meaningfully alter a model’s prediction. Counterfactual explanations (CFE) address these “what-if” questions, yet existing time series CFE generation methods are largely restricted to classification, overlook waveform morphology, and often produce physiologically implausible signals, limiting their applicability to continuous biomedical time series. To address these limitations, we introduce EvoMorph, a multi-objective evolutionary framework for generating physiologically plausible and diverse CFE for TSER applications. EvoMorph optimizes morphology-aware objectives defined on interpretable signal descriptors and applies transformations to preserve the waveform structure. We evaluated EvoMorph on three PPG datasets (heart rate, respiratory rate, and oxygen saturation) against a nearest-unlike-neighbor baseline. In addition, in a case study, we evaluated EvoMorph as a tool for uncertainty quantification by relating counterfactual sensitivity to bootstrap-ensemble uncertainty and data-density measures. Overall, EvoMorph enables the generation of physiologically-aware counterfactuals for continuous biomedical signals and supports uncertainty-aware interpretability, advancing trustworthy model analysis for clinical time-series applications.
[LINK]http://arxiv.org/abs/2601.10356v1
[DATE]2026-01-15 20:58:59+08:00
[CATEGORIES]cs.LG
Robust and Efficient Zeroth-Order LLM Fine-Tuning via Adaptive Bayesian Subspace Optimizer
[AUTHORS]Jian Feng, Zhihong Huang
[ABSTRACT]Fine-tuning large language models (LLMs) with zeroth-order (ZO) optimization reduces memory by approximating gradients through function evaluations. However, existing methods essentially perform updates in a one-dimensional space, and suffer from collapse or substantial performance degradation under low-precision training. We introduce BSZO, an adaptive \textbf{B}ayesian \textbf{S}ubspace \textbf{Z}eroth-Order \textbf{O}ptimizer, which applies Kalman filtering to combine finite-difference information across multiple perturbation directions within a subspace. By treating each finite-difference measurement as a noisy observation, BSZO builds a posterior distribution over the subspace-projected gradient and updates it through Bayesian inference, with a residual-based adaptive mechanism to adapt to noise variations. Theoretical analysis shows that BSZO improves the convergence rate by a factor of $k/γ$ compared to standard ZO methods. Experiments on RoBERTa, Mistral, and OPT models show that BSZO outperforms the baselines across various tasks, achieving up to 6.67\% absolute average improvement on OPT-13B while remaining robust under fp16/bf16 precision and keeping memory usage close to inference-only baselines (1.00$\times$–1.08$\times$ of MeZO).
[COMMENTS]23 pages, 2 figures, 5 tables
[LINK]http://arxiv.org/abs/2601.01452v3
[DATE]2026-01-15 20:44:25+08:00
[CATEGORIES]cs.LG
A reduced-order derivative-informed neural operator for subsurface fluid-flow
[AUTHORS]Jeongjin Park, Grant Bruer, Huseyin Tuna Erdinc, Abhinav Prakash Gahlot, Felix J. Herrmann
[ABSTRACT]Neural operators have emerged as cost-effective surrogates for expensive fluid-flow simulators, particularly in computationally intensive tasks such as permeability inversion from time-lapse seismic data, and uncertainty quantification. In these applications, the fidelity of the surrogate’s gradients with respect to system parameters is crucial, as the accuracy of downstream tasks, such as optimization and Bayesian inference, relies directly on the quality of the derivative information. Recent advances in physics-informed methods have leveraged derivative information to improve surrogate accuracy. However, incorporating explicit Jacobians can become computationally prohibitive, as the complexity typically scales quadratically with the number of input parameters. To address this limitation, we propose DeFINO (Derivative-based Fisher-score Informed Neural Operator), a reduced-order, derivative-informed training framework. DeFINO integrates Fourier neural operators (FNOs) with a novel derivative-based training strategy guided by the Fisher Information Matrix (FIM). By projecting Jacobians onto dominant eigen-directions identified by the FIM, DeFINO captures critical sensitivity information directly informed by observational data, significantly reducing computational expense. We validate DeFINO through synthetic experiments in the context of subsurface multi-phase fluid-flow, demonstrating improvements in gradient accuracy while maintaining robust forward predictions of underlying fluid dynamics. These results highlight DeFINO’s potential to offer practical, scalable solutions for inversion problems in complex real-world scenarios, all at substantially reduced computational cost.
[LINK]http://arxiv.org/abs/2509.13620v2
[DATE]2026-01-15 20:27:56+08:00
[CATEGORIES]cs.LG
Learning Regularization Functionals for Inverse Problems: A Comparative Study
[AUTHORS]Johannes Hertrich, Hok Shing Wong, Alexander Denker, Stanislas Ducotterd, Zhenghan Fang, Markus Haltmeier, Željko Kereta, Erich Kobler, Oscar Leong, Mohammad Sadegh Salehi, Carola-Bibiane Schönlieb, Johannes Schwab, Zakhar Shumaylov, Jeremias Sulam, German Shâma Wache, Martin Zach, Yasi Zhang, Matthias J. Ehrhardt, Sebastian Neumayer
[ABSTRACT]In recent years, a variety of learned regularization frameworks for solving inverse problems in imaging have emerged. These offer flexible modeling together with mathematical insights. The proposed methods differ in their architectural design and training strategies, making direct comparison challenging due to non-modular implementations. We address this gap by collecting and unifying the available code into a common framework. This unified view allows us to systematically compare the approaches and highlight their strengths and limitations, providing valuable insights into their future potential. We also provide concise descriptions of each method, complemented by practical guidelines.
[LINK]http://arxiv.org/abs/2510.01755v2
[DATE]2026-01-15 20:27:37+08:00
[CATEGORIES]cs.LG
An analytic theory of convolutional neural network inverse problems solvers
[AUTHORS]Minh Hai Nguyen, Quoc Bao Do, Edouard Pauwels, Pierre Weiss
[ABSTRACT]Supervised convolutional neural networks (CNNs) are widely used to solve imaging inverse problems, achieving state-of-the-art performance in numerous applications. However, despite their empirical success, these methods are poorly understood from a theoretical perspective and often treated as black boxes. To bridge this gap, we analyze trained neural networks through the lens of the Minimum Mean Square Error (MMSE) estimator, incorporating functional constraints that capture two fundamental inductive biases of CNNs: translation equivariance and locality via finite receptive fields. Under the empirical training distribution, we derive an analytic, interpretable, and tractable formula for this constrained variant, termed Local-Equivariant MMSE (LE-MMSE). Through extensive numerical experiments across various inverse problems (denoising, inpainting, deconvolution), datasets (FFHQ, CIFAR-10, FashionMNIST), and architectures (U-Net, ResNet, PatchMLP), we demonstrate that our theory matches the neural networks outputs (PSNR $\gtrsim25$dB). Furthermore, we provide insights into the differences between \emph{physics-aware} and \emph{physics-agnostic} estimators, the impact of high-density regions in the training (patch) distribution, and the influence of other factors (dataset size, patch size, etc).
[LINK]http://arxiv.org/abs/2601.10334v1
[DATE]2026-01-15 20:25:59+08:00
[CATEGORIES]cs.LG
Advancing Safe Mechanical Ventilation Using Offline RL With Hybrid Actions and Clinically Aligned Rewards
[AUTHORS]Muhammad Hamza Yousuf, Jason Li, Sahar Vahdati, Raphael Theilen, Jakob Wittenstein, Jens Lehmann
[ABSTRACT]Invasive mechanical ventilation (MV) is a life-sustaining therapy commonly used in the intensive care unit (ICU) for patients with severe and acute conditions. These patients frequently rely on MV for breathing. Given the high risk of death in such cases, optimal MV settings can reduce mortality, minimize ventilator-induced lung injury, shorten ICU stays, and ease the strain on healthcare resources. However, optimizing MV settings remains a complex and error-prone process due to patient-specific variability. While Offline Reinforcement Learning (RL) shows promise for optimizing MV settings, current methods struggle with the hybrid (continuous and discrete) nature of MV settings. Discretizing continuous settings leads to exponential growth in the action space, which limits the number of optimizable settings. Converting the predictions back to continuous can cause a distribution shift, compromising safety and performance. To address this challenge, in the IntelliLung project, we are developing an AI-based approach where we constrain the action space and employ factored action critics. This approach allows us to scale to six optimizable settings compared to 2-3 in previous studies. We adapt SOTA offline RL algorithms to operate directly on hybrid action spaces, avoiding the pitfalls of discretization. We also introduce a clinically grounded reward function based on ventilator-free days and physiological targets. Using multiobjective optimization for reward selection, we show that this leads to a more equitable consideration of all clinically relevant objectives. Notably, we develop a system in close collaboration with healthcare professionals that is aligned with real-world clinical objectives and designed with future deployment in mind.
[COMMENTS]Accepted to AAAI-26
[LINK]http://arxiv.org/abs/2506.14375v2
[DATE]2026-01-15 20:24:48+08:00
[CATEGORIES]cs.LG
Meta Dynamic Graph for Traffic Flow Prediction
[AUTHORS]Yiqing Zou, Hanning Yuan, Qianyu Yang, Ziqiang Yuan, Shuliang Wang, Sijie Ruan
[ABSTRACT]Traffic flow prediction is a typical spatio-temporal prediction problem and has a wide range of applications. The core challenge lies in modeling the underlying complex spatio-temporal dependencies. Various methods have been proposed, and recent studies show that the modeling of dynamics is useful to meet the core challenge. While handling spatial dependencies and temporal dependencies using separate base model structures may hinder the modeling of spatio-temporal correlations, the modeling of dynamics can bridge this gap. Incorporating spatio-temporal heterogeneity also advances the main goal, since it can extend the parameter space and allow more flexibility. Despite these advances, two limitations persist: 1) the modeling of dynamics is often limited to the dynamics of spatial topology (e.g., adjacency matrix changes), which, however, can be extended to a broader scope; 2) the modeling of heterogeneity is often separated for spatial and temporal dimensions, but this gap can also be bridged by the modeling of dynamics. To address the above limitations, we propose a novel framework for traffic prediction, called Meta Dynamic Graph (MetaDG). MetaDG leverages dynamic graph structures of node representations to explicitly model spatio-temporal dynamics. This generates both dynamic adjacency matrices and meta-parameters, extending dynamic modeling beyond topology while unifying the capture of spatio-temporal heterogeneity into a single dimension. Extensive experiments on four real-world datasets validate the effectiveness of MetaDG.
[COMMENTS]Accepted to AAAI 2026
[LINK]http://arxiv.org/abs/2601.10328v1
[DATE]2026-01-15 20:15:54+08:00
[CATEGORIES]cs.LG
Barrier Certificates for Unknown Systems with Latent States and Polynomial Dynamics using Bayesian Inference
[AUTHORS]Robert Lefringhausen, Sami Leon Noel Aziz Hanna, Elias August, Sandra Hirche
[ABSTRACT]Certifying safety in dynamical systems is crucial, but barrier certificates - widely used to verify that system trajectories remain within a safe region - typically require explicit system models. When dynamics are unknown, data-driven methods can be used instead, yet obtaining a valid certificate requires rigorous uncertainty quantification. For this purpose, existing methods usually rely on full-state measurements, limiting their applicability. This paper proposes a novel approach for synthesizing barrier certificates for unknown systems with latent states and polynomial dynamics. A Bayesian framework is employed, where a prior in state-space representation is updated using output data via a targeted marginal Metropolis-Hastings sampler. The resulting samples are used to construct a barrier certificate through a sum-of-squares program. Probabilistic guarantees for its validity with respect to the true, unknown system are obtained by testing on an additional set of posterior samples. The approach and its probabilistic guarantees are illustrated through a numerical simulation.
[COMMENTS]Accepted for publication in the Proceedings of the 64th IEEE Conference on Decision and Control
[LINK]http://arxiv.org/abs/2504.01807v3
[DATE]2026-01-15 19:58:37+08:00
[CATEGORIES]cs.LG
Provably Safe Reinforcement Learning for Stochastic Reach-Avoid Problems with Entropy Regularization
[AUTHORS]Abhijit Mazumdar, Rafal Wisniewski, Manuela L. Bujorianu
[ABSTRACT]We consider the problem of learning the optimal policy for Markov decision processes with safety constraints. We formulate the problem in a reach-avoid setup. Our goal is to design online reinforcement learning algorithms that ensure safety constraints with arbitrarily high probability during the learning phase. To this end, we first propose an algorithm based on the optimism in the face of uncertainty (OFU) principle. Based on the first algorithm, we propose our main algorithm, which utilizes entropy regularization. We investigate the finite-sample analysis of both algorithms and derive their regret bounds. We demonstrate that the inclusion of entropy regularization improves the regret and drastically controls the episode-to-episode variability that is inherent in OFU-based safe RL algorithms.
[LINK]http://arxiv.org/abs/2601.08646v2
[DATE]2026-01-15 19:31:58+08:00
[CATEGORIES]cs.LG
Quartet: Native FP4 Training Can Be Optimal for Large Language Models
[AUTHORS]Roberto L. Castro, Andrei Panferov, Soroush Tabesh, Oliver Sieberling, Jiale Chen, Mahdi Nikdan, Saleh Ashkboos, Dan Alistarh
[ABSTRACT]Training large language models (LLMs) models directly in low-precision offers a way to address computational costs by improving both throughput and energy efficiency. For those purposes, NVIDIA’s recent Blackwell architecture facilitates very low-precision operations using FP4 variants. Yet, current algorithms for training LLMs in FP4 precision face significant accuracy degradation and often rely on mixed-precision fallbacks. In this paper, we investigate hardware-supported FP4 training and introduce a new approach for accurate, end-to-end FP4 training with all the major computations (i.e., linear layers) in low precision. Through extensive evaluations on Llama-type models, we reveal a new low-precision scaling law that quantifies performance trade-offs across bit-widths and training setups. Guided by this investigation, we design an “optimal” technique in terms of accuracy-vs-computation, called Quartet. We implement Quartet using optimized CUDA kernels tailored for Blackwell, demonstrating that fully FP4-based training is a competitive alternative to FP16 half-precision and to FP8 training. Our code is available at https://github.com/IST-DASLab/Quartet.
[LINK]http://arxiv.org/abs/2505.14669v4
[DATE]2026-01-15 19:15:06+08:00
[CATEGORIES]cs.LG
Decorrelation Speeds Up Vision Transformers
[AUTHORS]Kieran Carrigg, Rob van Gastel, Melda Yeghaian, Sander Dalm, Faysal Boughorbel, Marcel van Gerven
[ABSTRACT]Masked Autoencoder (MAE) pre-training of vision transformers (ViTs) yields strong performance in low-label data regimes but comes with substantial computational costs, making it impractical in time- and resource-constrained industrial settings. We address this by integrating Decorrelated Backpropagation (DBP) into MAE pre-training, an optimization method that iteratively reduces input correlations at each layer to accelerate convergence. Applied selectively to the encoder, DBP achieves faster pre-training without loss of stability. To mimic constrained-data scenarios, we evaluate our approach on ImageNet-1K pre-training and ADE20K fine-tuning using randomly sampled subsets of each dataset. Under this setting, DBP-MAE reduces wall-clock time to baseline performance by 21.1%, lowers carbon emissions by 21.4%, and improves segmentation mIoU by 1.1 points. We observe similar gains when pre-training and fine-tuning on proprietary industrial data, confirming the method’s applicability in real-world scenarios. These results demonstrate that DBP can reduce training time and energy use while improving downstream performance for large-scale ViT pre-training.
Keywords: Deep learning, Vision transformers, Efficient AI, Decorrelation
[COMMENTS]20 pages, 12 figures, CVC 2026 camera-ready version
[LINK]http://arxiv.org/abs/2510.14657v3
[DATE]2026-01-15 19:12:10+08:00
[CATEGORIES]cs.LG
Spectral Convolutional Conditional Neural Processes
[AUTHORS]Peiman Mohseni, Nick Duffield
[ABSTRACT]Neural Processes (NPs) are meta-learning models that learn to map sets of observations to approximations of the corresponding posterior predictive distributions. By accommodating variable-sized, unstructured collections of observations and enabling probabilistic predictions at arbitrary query points, NPs provide a flexible framework for modeling functions over continuous domains. Since their introduction, numerous variants have emerged; however, early formulations shared a fundamental limitation: they compressed the observed data into finite-dimensional global representations via aggregation operations such as mean pooling. This strategy induces an intrinsic mismatch with the infinite-dimensional nature of the stochastic processes that NPs intend to model. Convolutional conditional neural processes (ConvCNPs) address this limitation by constructing infinite-dimensional functional embeddings processed through convolutional neural networks (CNNs) to enforce translation equivariance. Yet CNNs with local spatial kernels struggle to capture long-range dependencies without resorting to large kernels, which impose significant computational costs. To overcome this limitation, we propose spectral ConvCNPs (SConvCNPs), which perform global convolution in the frequency domain. Inspired by Fourier neural operators (FNOs) for learning solution operators of partial differential equations (PDEs), our approach directly parameterizes convolution kernels in the frequency domain, leveraging the relatively compact yet global Fourier representation of many natural signals. We validate the effectiveness of SConvCNPs on both synthetic and real-world datasets, demonstrating how ideas from operator learning can advance the capabilities of NPs.
[LINK]http://arxiv.org/abs/2404.13182v6
[DATE]2026-01-15 19:10:21+08:00
[CATEGORIES]cs.LG
SPIKE: Sparse Koopman Regularization for Physics-Informed Neural Networks
[AUTHORS]Jose Marie Antonio Minoza
[ABSTRACT]Physics-Informed Neural Networks (PINNs) provide a mesh-free approach for solving differential equations by embedding physical constraints into neural network training. However, PINNs tend to overfit within the training domain, leading to poor generalization when extrapolating beyond trained spatiotemporal regions. This work presents SPIKE (Sparse Physics-Informed Koopman-Enhanced), a framework that regularizes PINNs with continuous-time Koopman operators to learn parsimonious dynamics representations. By enforcing linear dynamics $dz/dt = Az$ in a learned observable space, both PIKE (without explicit sparsity) and SPIKE (with L1 regularization on $A$) learn sparse generator matrices, embodying the parsimony principle that complex dynamics admit low-dimensional structure. Experiments across parabolic, hyperbolic, dispersive, and stiff PDEs, including fluid dynamics (Navier-Stokes) and chaotic ODEs (Lorenz), demonstrate consistent improvements in temporal extrapolation, spatial generalization, and long-term prediction accuracy. The continuous-time formulation with matrix exponential integration provides unconditional stability for stiff systems while avoiding diagonal dominance issues inherent in discrete-time Koopman operators.
[LINK]http://arxiv.org/abs/2601.10282v1
[DATE]2026-01-15 18:59:48+08:00
[CATEGORIES]cs.LG
Queueing-Aware Optimization of Reasoning Tokens for Accuracy-Latency Trade-offs in LLM Servers
[AUTHORS]Emre Ozbas, Melih Bastopcu
[ABSTRACT]We consider a single large language model (LLM) server that serves a heterogeneous stream of queries belonging to $N$ distinct task types. Queries arrive according to a Poisson process, and each type occurs with a known prior probability. For each task type, the server allocates a fixed number of internal thinking tokens, which determines the computational effort devoted to that query. The token allocation induces an accuracy-latency trade-off: the service time follows an approximately affine function of the allocated tokens, while the probability of a correct response exhibits diminishing returns. Under a first-in, first-out (FIFO) service discipline, the system operates as an $M/G/1$ queue, and the mean system time depends on the first and second moments of the resulting service-time distribution. We formulate a constrained optimization problem that maximizes a weighted average accuracy objective penalized by the mean system time, subject to architectural token-budget constraints and queue-stability conditions. The objective function is shown to be strictly concave over the stability region, which ensures existence and uniqueness of the optimal token allocation. The first-order optimality conditions yield a coupled projected fixed-point characterization of the optimum, together with an iterative solution and an explicit sufficient condition for contraction. Moreover, a projected gradient method with a computable global step-size bound is developed to guarantee convergence beyond the contractive regime. Finally, integer-valued token allocations are attained via rounding of the continuous solution, and the resulting performance loss is evaluated in simulation results.
[LINK]http://arxiv.org/abs/2601.10274v1
[DATE]2026-01-15 18:47:11+08:00
[CATEGORIES]cs.LG
GraLoRA: Granular Low-Rank Adaptation for Parameter-Efficient Fine-Tuning
[AUTHORS]Yeonjoon Jung, Daehyun Ahn, Hyungjun Kim, Taesu Kim, Eunhyeok Park
[ABSTRACT]Low-Rank Adaptation (LoRA) is a popular method for parameter-efficient fine-tuning (PEFT) of generative models, valued for its simplicity and effectiveness. Despite recent enhancements, LoRA still suffers from a fundamental limitation: overfitting when the bottleneck is widened. It performs best at ranks 32-64, yet its accuracy stagnates or declines at higher ranks, still falling short of full fine-tuning (FFT) performance. We identify the root cause as LoRA’s structural bottleneck, which introduces gradient entanglement to the unrelated input channels and distorts gradient propagation. To address this, we introduce a novel structure, Granular Low-Rank Adaptation (GraLoRA) that partitions weight matrices into sub-blocks, each with its own low-rank adapter. With negligible computational or storage cost, GraLoRA overcomes LoRA’s limitations, effectively increases the representational capacity, and more closely approximates FFT behavior. Experiments on code generation and commonsense reasoning benchmarks show that GraLoRA consistently outperforms LoRA and other baselines, achieving up to +8.5% absolute gain in Pass@1 on HumanEval+. These improvements hold across model sizes and rank settings, making GraLoRA a scalable and robust solution for PEFT. Code, data, and scripts are available at https://github.com/SqueezeBits/GraLoRA.git
[COMMENTS]39th Conference on Neural Information Processing Systems (NeurIPS 2025)
[LINK]http://arxiv.org/abs/2505.20355v2
[DATE]2026-01-15 18:43:27+08:00
[CATEGORIES]cs.LG
Deep Learning for Continuous-Time Stochastic Control with Jumps
[AUTHORS]Patrick Cheridito, Jean-Loup Dupret, Donatien Hainaut
[ABSTRACT]In this paper, we introduce a model-based deep-learning approach to solve finite-horizon continuous-time stochastic control problems with jumps. We iteratively train two neural networks: one to represent the optimal policy and the other to approximate the value function. Leveraging a continuous-time version of the dynamic programming principle, we derive two different training objectives based on the Hamilton-Jacobi-Bellman equation, ensuring that the networks capture the underlying stochastic dynamics. Empirical evaluations on different problems illustrate the accuracy and scalability of our approach, demonstrating its effectiveness in solving complex high-dimensional stochastic control tasks.
[COMMENTS]NeurIPS 2025
[LINK]http://arxiv.org/abs/2505.15602v3
[DATE]2026-01-15 18:43:19+08:00
[CATEGORIES]cs.LG
Softly Induced Functional Simplicity: Implications for Neural Network Generalisation, Robustness, and Distillation
[AUTHORS]Maciej Glowacki
[ABSTRACT]Learning robust and generalisable abstractions from high-dimensional input data is a central challenge in machine learning and its applications to high-energy physics (HEP). Solutions of lower functional complexity are known to produce abstractions that generalise more effectively and are more robust to input perturbations. In complex hypothesis spaces, inductive biases make such solutions learnable by shaping the loss geometry during optimisation. In a HEP classification task, we show that a soft symmetry respecting inductive bias creates approximate degeneracies in the loss, which we identify as pseudo-Goldstone modes. We quantify functional complexity using metrics derived from first principles Hessian analysis and via compressibility. Our results demonstrate that solutions of lower complexity give rise to abstractions that are more generalisable, robust, and efficiently distillable.
[LINK]http://arxiv.org/abs/2601.06584v2
[DATE]2026-01-15 18:40:18+08:00
[CATEGORIES]cs.LG
Early Fault Detection on CMAPSS with Unsupervised LSTM Autoencoders
[AUTHORS]P. Sánchez, K. Reyes, B. Radu, E. Fernández
[ABSTRACT]This paper introduces an unsupervised health-monitoring framework for turbofan engines that does not require run-to-failure labels. First, operating-condition effects in NASA CMAPSS sensor streams are removed via regression-based normalisation; then a Long Short-Term Memory (LSTM) autoencoder is trained only on the healthy portion of each trajectory. Persistent reconstruction error, estimated using an adaptive data-driven threshold, triggers real-time alerts without hand-tuned rules. Benchmark results show high recall and low false-alarm rates across multiple operating regimes, demonstrating that the method can be deployed quickly, scale to diverse fleets, and serve as a complementary early-warning layer to Remaining Useful Life models.
[LINK]http://arxiv.org/abs/2601.10269v1
[DATE]2026-01-15 18:38:14+08:00
[CATEGORIES]cs.LG
In-Context Source and Channel Coding
[AUTHORS]Ziqiong Wang, Tianqi Ren, Rongpeng Li, Zhifeng Zhao, Honggang Zhang
[ABSTRACT]Separate Source-Channel Coding (SSCC) remains attractive for text transmission due to its modularity and compatibility with mature entropy coders and powerful channel codes. However, SSCC often suffers from a pronounced cliff effect in low Signal-to-Noise Ratio (SNR) regimes, where residual bit errors after channel decoding can catastrophically break lossless source decoding, especially for Arithmetic Coding (AC) driven by Large Language Models (LLMs). This paper proposes a receiver-side In-Context Decoding (ICD) framework that enhances SSCC robustness without modifying the transmitter. ICD leverages an Error Correction Code Transformer (ECCT) to obtain bit-wise reliability for the decoded information bits. Based on the context-consistent bitstream, ICD constructs a confidence-ranked candidate pool via reliability-guided bit flipping, samples a compact yet diverse subset of candidates, and applies an LLM-based arithmetic decoder to obtain both reconstructions and sequence-level log-likelihoods. A reliability-likelihood fusion rule then selects the final output. We further provide theoretical guarantees on the stability and convergence of the proposed sampling procedure. Extensive experiments over Additive White Gaussian Noise (AWGN) and Rayleigh fading channels demonstrate consistent gains compared with conventional SSCC baselines and representative Joint Source-Channel Coding (JSCC) schemes.
[LINK]http://arxiv.org/abs/2601.10267v1
[DATE]2026-01-15 18:37:57+08:00
[CATEGORIES]cs.LG
X-SAM: Boosting Sharpness-Aware Minimization with Dominant-Eigenvector Gradient Correction
[AUTHORS]Hongru Duan, Yongle Chen, Lei Guan
[ABSTRACT]Sharpness-Aware Minimization (SAM) aims to improve generalization by minimizing a worst-case perturbed loss over a small neighborhood of model parameters. However, during training, its optimization behavior does not always align with theoretical expectations, since both sharp and flat regions may yield a small perturbed loss. In such cases, the gradient may still point toward sharp regions, failing to achieve the intended effect of SAM. To address this issue, we investigate SAM from a spectral and geometric perspective: specifically, we utilize the angle between the gradient and the leading eigenvector of the Hessian as a measure of sharpness. Our analysis illustrates that when this angle is less than or equal to ninety degrees, the effect of SAM’s sharpness regularization can be weakened. Furthermore, we propose an explicit eigenvector-aligned SAM (X-SAM), which corrects the gradient via orthogonal decomposition along the top eigenvector, enabling more direct and efficient regularization of the Hessian’s maximum eigenvalue. We prove X-SAM’s convergence and superior generalization, with extensive experimental evaluations confirming both theoretical and practical advantages.
[LINK]http://arxiv.org/abs/2601.10251v1
[DATE]2026-01-15 18:19:08+08:00
[CATEGORIES]cs.LG
Fast Mining and Dynamic Time-to-Event Prediction over Multi-sensor Data Streams
[AUTHORS]Kota Nakamura, Koki Kawabata, Yasuko Matsubara, Yasushi Sakurai
[ABSTRACT]Given real-time sensor data streams obtained from machines, how can we continuously predict when a machine failure will occur? This work aims to continuously forecast the timing of future events by analyzing multi-sensor data streams. A key characteristic of real-world data streams is their dynamic nature, where the underlying patterns evolve over time. To address this, we present TimeCast, a dynamic prediction framework designed to adapt to these changes and provide accurate, real-time predictions of future event time. Our proposed method has the following properties: (a) Dynamic: it identifies the distinct time-evolving patterns (i.e., stages) and learns individual models for each, enabling us to make adaptive predictions based on pattern shifts. (b) Practical: it finds meaningful stages that capture time-varying interdependencies between multiple sensors and improve prediction performance; (c) Scalable: our algorithm scales linearly with the input size and enables online model updates on data streams. Extensive experiments on real datasets demonstrate that TimeCast provides higher prediction accuracy than state-of-the-art methods while finding dynamic changes in data streams with a great reduction in computational time.
[COMMENTS]Accepted by KDD 2026
[LINK]http://arxiv.org/abs/2601.04741v2
[DATE]2026-01-15 18:15:50+08:00
[CATEGORIES]cs.LG
Arbitrary Polynomial Separations in Trainable Quantum Machine Learning
[AUTHORS]Eric R. Anschuetz, Xun Gao
[ABSTRACT]Recent theoretical results in quantum machine learning have demonstrated a general trade-off between the expressive power of quantum neural networks (QNNs) and their trainability; as a corollary of these results, practical exponential separations in expressive power over classical machine learning models are believed to be infeasible as such QNNs take a time to train that is exponential in the model size. We here circumvent these negative results by constructing a hierarchy of efficiently trainable QNNs that exhibit unconditionally provable, polynomial memory separations of arbitrary constant degree over classical neural networks – including state-of-the-art models, such as Transformers – in performing a classical sequence modeling task. This construction is also computationally efficient, as each unit cell of the introduced class of QNNs only has constant gate complexity. We show that contextuality – informally, a quantitative notion of semantic ambiguity – is the source of the expressivity separation, suggesting that other learning tasks with this property may be a natural setting for the use of quantum learning algorithms.
[COMMENTS]30 pages, 3 figures, version accepted to Quantum
[LINK]http://arxiv.org/abs/2402.08606v4
[DATE]2026-01-15 18:01:32+08:00
[CATEGORIES]cs.LG
Adaptive Model-Predictive Control of a Soft Continuum Robot Using a Physics-Informed Neural Network Based on Cosserat Rod Theory
[AUTHORS]Johann Licher, Max Bartholdt, Henrik Krauss, Tim-Lukas Habich, Thomas Seel, Moritz Schappler
[ABSTRACT]Dynamic control of soft continuum robots (SCRs) holds great potential for expanding their applications, but remains a challenging problem due to the high computational demands of accurate dynamic models. While data-driven approaches like Koopman-operator-based methods have been proposed, they typically lack adaptability and cannot reconstruct the full robot shape, limiting their applicability. This work introduces a real-time-capable nonlinear model-predictive control (MPC) framework for SCRs based on a domain-decoupled physics-informed neural network (DD-PINN) with adaptable bending stiffness. The DD-PINN serves as a surrogate for the dynamic Cosserat rod model with a speed-up factor of 44000. It is also used within an unscented Kalman filter for estimating the model states and bending compliance from end-effector position measurements. We implement a nonlinear evolutionary MPC running at 70 Hz on the GPU. In simulation, it demonstrates accurate tracking of dynamic trajectories and setpoint control with end-effector position errors below 3 mm (2.3% of the actuator’s length). In real-world experiments, the controller achieves similar accuracy and accelerations up to 3.55 m/s2.
[COMMENTS]Submitted to IEEE Transactions on Robotics, 20 pages, 14 figures
[LINK]http://arxiv.org/abs/2508.12681v2
[DATE]2026-01-15 17:58:58+08:00
[CATEGORIES]cs.LG
VICON: Vision In-Context Operator Networks for Multi-Physics Fluid Dynamics Prediction
[AUTHORS]Yadi Cao, Yuxuan Liu, Liu Yang, Rose Yu, Hayden Schaeffer, Stanley Osher
[ABSTRACT]In-Context Operator Networks (ICONs) have demonstrated the ability to learn operators across diverse partial differential equations using few-shot, in-context learning. However, existing ICONs process each spatial point as an individual token, severely limiting computational efficiency when handling dense data in higher spatial dimensions. We propose Vision In-Context Operator Networks (VICON), which integrates vision transformer architectures to efficiently process 2D data through patch-wise operations while preserving ICON’s adaptability to multiphysics systems and varying timesteps. Evaluated across three fluid dynamics benchmarks, VICON significantly outperforms state-of-the-art baselines: DPOT and MPP, reducing the averaged last-step rollout error by 37.9% compared to DPOT and 44.7% compared to MPP, while requiring only 72.5% and 34.8% of their respective inference times. VICON naturally supports flexible rollout strategies with varying timestep strides, enabling immediate deployment in imperfect measurement systems where sampling frequencies may differ or frames might be dropped - common challenges in real-world settings - without requiring retraining or interpolation. In these realistic scenarios, VICON exhibits remarkable robustness, experiencing only 24.41% relative performance degradation compared to 71.37%-74.49% degradation in baseline methods, demonstrating its versatility for deploying in realistic applications. Our scripts for processing datasets and code are publicly available at https://github.com/Eydcao/VICON.
[COMMENTS]update after TMLR accept
[LINK]http://arxiv.org/abs/2411.16063v5
[DATE]2026-01-15 17:58:38+08:00
[CATEGORIES]cs.LG
Distributionally Robust Causal Abstractions
[AUTHORS]Yorgos Felekis, Theodoros Damoulas, Paris Giampouras
[ABSTRACT]Causal Abstraction (CA) theory provides a principled framework for relating causal models that describe the same system at different levels of granularity while ensuring interventional consistency between them. Recently, several approaches for learning CAs have been proposed, but all assume fixed and well-specified exogenous distributions, making them vulnerable to environmental shifts and misspecification. In this work, we address these limitations by introducing the first class of distributionally robust CAs and their associated learning algorithms. The latter cast robust causal abstraction learning as a constrained min-max optimization problem with Wasserstein ambiguity sets. We provide theoretical results, for both empirical and Gaussian environments, leading to principled selection of the level of robustness via the radius of these sets. Furthermore, we present empirical evidence across different problems and CA learning methods, demonstrating our framework’s robustness not only to environmental shifts but also to structural model and intervention mapping misspecification.
[LINK]http://arxiv.org/abs/2510.04842v2
[DATE]2026-01-15 17:56:46+08:00
[CATEGORIES]cs.LG
LaM-SLidE: Latent Space Modeling of Spatial Dynamical Systems via Linked Entities
[AUTHORS]Florian Sestak, Artur Toshev, Andreas Fürst, Günter Klambauer, Andreas Mayr, Johannes Brandstetter
[ABSTRACT]Generative models are spearheading recent progress in deep learning, showcasing strong promise for trajectory sampling in dynamical systems as well. However, whereas latent space modeling paradigms have transformed image and video generation, similar approaches are more difficult for most dynamical systems. Such systems – from chemical molecule structures to collective human behavior – are described by interactions of entities, making them inherently linked to connectivity patterns, entity conservation, and the traceability of entities over time. Our approach, LaM-SLidE (Latent Space Modeling of Spatial Dynamical Systems via Linked Entities), bridges the gap between: (1) keeping the traceability of individual entities in a latent system representation, and (2) leveraging the efficiency and scalability of recent advances in image and video generation, where pre-trained encoder and decoder enable generative modeling directly in latent space. The core idea of LaM-SLidE is the introduction of identifier representations (IDs) that enable the retrieval of entity properties and entity composition from latent system representations, thus fostering traceability. Experimentally, across different domains, we show that LaM-SLidE performs favorably in terms of speed, accuracy, and generalizability. Code is available at https://github.com/ml-jku/LaM-SLidE .
[COMMENTS]Project page: https://ml-jku.github.io/LaM-SLidE/
[LINK]http://arxiv.org/abs/2502.12128v5
[DATE]2026-01-15 17:56:42+08:00
[CATEGORIES]cs.LG
Fundamental Limitations of Favorable Privacy-Utility Guarantees for DP-SGD
[AUTHORS]Murat Bilgehan Ertan, Marten van Dijk
[ABSTRACT]Differentially Private Stochastic Gradient Descent (DP-SGD) is the dominant paradigm for private training, but its fundamental limitations under worst-case adversarial privacy definitions remain poorly understood. We analyze DP-SGD in the $f$-differential privacy framework, which characterizes privacy via hypothesis-testing trade-off curves, and study shuffled sampling over a single epoch with $M$ gradient updates. We derive an explicit suboptimal upper bound on the achievable trade-off curve. This result induces a geometric lower bound on the separation $κ$ which is the maximum distance between the mechanism’s trade-off curve and the ideal random-guessing line. Because a large separation implies significant adversarial advantage, meaningful privacy requires small $κ$. However, we prove that enforcing a small separation imposes a strict lower bound on the Gaussian noise multiplier $σ$, which directly limits the achievable utility. In particular, under the standard worst-case adversarial model, shuffled DP-SGD must satisfy
$σ\ge \frac{1}{\sqrt{2\ln M}}$ $\quad\text{or}\quad$ $κ\ge\ \frac{1}{\sqrt{8}}!\left(1-\frac{1}{\sqrt{4π\ln M}}\right)$,
and thus cannot simultaneously achieve strong privacy and high utility. Although this bound vanishes asymptotically as $M \to \infty$, the convergence is extremely slow: even for practically relevant numbers of updates the required noise magnitude remains substantial. We further show that the same limitation extends to Poisson subsampling up to constant factors. Our experiments confirm that the noise levels implied by this bound leads to significant accuracy degradation at realistic training settings, thus showing a critical bottleneck in DP-SGD under standard worst-case adversarial assumptions.
[LINK]http://arxiv.org/abs/2601.10237v1
[DATE]2026-01-15 17:50:36+08:00
[CATEGORIES]cs.LG
Learning Without Augmenting: Unsupervised Time Series Representation Learning via Frame Projections
[AUTHORS]Berken Utku Demirel, Christian Holz
[ABSTRACT]Self-supervised learning (SSL) has emerged as a powerful paradigm for learning representations without labeled data. Most SSL approaches rely on strong, well-established, handcrafted data augmentations to generate diverse views for representation learning. However, designing such augmentations requires domain-specific knowledge and implicitly imposes representational invariances on the model, which can limit generalization. In this work, we propose an unsupervised representation learning method that replaces augmentations by generating views using orthonormal bases and overcomplete frames. We show that embeddings learned from orthonormal and overcomplete spaces reside on distinct manifolds, shaped by the geometric biases introduced by representing samples in different spaces. By jointly leveraging the complementary geometry of these distinct manifolds, our approach achieves superior performance without artificially increasing data diversity through strong augmentations. We demonstrate the effectiveness of our method on nine datasets across five temporal sequence tasks, where signal-specific characteristics make data augmentations particularly challenging. Without relying on augmentation-induced diversity, our method achieves performance gains of up to 15–20\% over existing self-supervised approaches. Source code: https://github.com/eth-siplab/Learning-with-FrameProjections
[COMMENTS]Published at the Conference on Neural Information Processing Systems (NeurIPS) 2025
[LINK]http://arxiv.org/abs/2510.22655v2
[DATE]2026-01-15 17:38:50+08:00
[CATEGORIES]cs.LG
Persistent Homology via Ellipsoids
[AUTHORS]Niklas Canova, Sara Kališnik, Aaron Moser, Bastian Rieck, Ana Žegarac
[ABSTRACT]Persistent homology is one of the most popular methods in topological data analysis. An initial step in its use involves constructing a nested sequence of simplicial complexes. There is an abundance of different complexes to choose from, with Čech, Rips, alpha, and witness complexes being popular choices. In this manuscript, we build a novel type of geometrically informed simplicial complex, called a Rips-type ellipsoid complex. This complex is based on the idea that ellipsoids aligned with tangent directions better approximate the data compared to conventional (Euclidean) balls centered at sample points, as used in the construction of Rips and Alpha complexes. We use Principal Component Analysis to estimate tangent spaces directly from samples and present an algorithm for computing Rips-type ellipsoid barcodes, i.e., topological descriptors based on Rips-type ellipsoid complexes. Additionally, we show that the ellipsoid barcodes depend continuously on the input data so that small perturbations of a k-generic point cloud lead to proportionally small changes in the resulting ellipsoid barcodes. This provides a theoretical guarantee analogous, if somewhat weaker, to the classical stability results for Rips and Čech filtrations. We also conduct extensive experiments and compare Rips-type ellipsoid barcodes with standard Rips barcodes. Our findings indicate that Rips-type ellipsoid complexes are particularly effective for estimating the homology of manifolds and spaces with bottlenecks from samples. In particular, the persistence intervals corresponding to ground-truth topological features are longer compared to those obtained using the Rips complex of the data. Furthermore, Rips-type ellipsoid barcodes lead to better classification results in sparsely sampled point clouds. Finally, we demonstrate that Rips-type ellipsoid barcodes outperform Rips barcodes in classification tasks.
[LINK]http://arxiv.org/abs/2408.11450v3
[DATE]2026-01-15 17:27:35+08:00
[CATEGORIES]cs.LG
AgriFM: A Multi-source Temporal Remote Sensing Foundation Model for Agriculture Mapping
[AUTHORS]Wenyuan Li, Shunlin Liang, Keyan Chen, Yongzhe Chen, Han Ma, Jianglei Xu, Yichuan Ma, Shikang Guan, Husheng Fang, Zhenwei Shi
[ABSTRACT]Accurate crop mapping fundamentally relies on modeling multi-scale spatiotemporal patterns, where spatial scales range from individual field textures to landscape-level context, and temporal scales capture both short-term phenological transitions and full growing-season dynamics. Transformer-based remote sensing foundation models (RSFMs) offer promising potential for crop mapping due to their innate ability for unified spatiotemporal processing. However, current RSFMs remain suboptimal for crop mapping: they either employ fixed spatiotemporal windows that ignore the multi-scale nature of crop systems or completely disregard temporal information by focusing solely on spatial patterns. To bridge these gaps, we present AgriFM, a multi-source remote sensing foundation model specifically designed for agricultural crop mapping. Our approach begins by establishing the necessity of simultaneous hierarchical spatiotemporal feature extraction, leading to the development of a modified Video Swin Transformer architecture where temporal down-sampling is synchronized with spatial scaling operations. This modified backbone enables efficient unified processing of long time-series satellite inputs. AgriFM leverages temporally rich data streams from three satellite sources including MODIS, Landsat-8/9 and Sentinel-2, and is pre-trained on a global representative dataset comprising over 25 million image samples supervised by land cover products. The resulting framework incorporates a versatile decoder architecture that dynamically fuses these learned spatiotemporal representations, supporting diverse downstream tasks. Comprehensive evaluations demonstrate AgriFM’s superior performance over conventional deep learning approaches and state-of-the-art general-purpose RSFMs across all downstream tasks. Codes will be available at https://github.com/flyakon/AgriFM.
[LINK]http://arxiv.org/abs/2505.21357v3
[DATE]2026-01-15 17:17:32+08:00
[CATEGORIES]cs.LG
Optimal kernel regression bounds under energy-bounded noise
[AUTHORS]Amon Lahr, Johannes Köhler, Anna Scampicchio, Melanie N. Zeilinger
[ABSTRACT]Non-conservative uncertainty bounds are key for both assessing an estimation algorithm’s accuracy and in view of downstream tasks, such as its deployment in safety-critical contexts. In this paper, we derive a tight, non-asymptotic uncertainty bound for kernel-based estimation, which can also handle correlated noise sequences. Its computation relies on a mild norm-boundedness assumption on the unknown function and the noise, returning the worst-case function realization within the hypothesis class at an arbitrary query input location. The value of this function is shown to be given in terms of the posterior mean and covariance of a Gaussian process for an optimal choice of the measurement noise covariance. By rigorously analyzing the proposed approach and comparing it with other results in the literature, we show its effectiveness in returning tight and easy-to-compute bounds for kernel-based estimates.
[LINK]http://arxiv.org/abs/2505.22235v3
[DATE]2026-01-15 17:07:03+08:00
[CATEGORIES]cs.LG
Adaptive Querying for Reward Learning from Human Feedback
[AUTHORS]Yashwanthi Anand, Nnamdi Nwagwu, Kevin Sabbe, Naomi T. Fitter, Sandhya Saisubramanian
[ABSTRACT]Learning from human feedback is a popular approach to train robots to adapt to user preferences and improve safety. Existing approaches typically consider a single querying (interaction) format when seeking human feedback and do not leverage multiple modes of user interaction with a robot. We examine how to learn a penalty function associated with unsafe behaviors using multiple forms of human feedback, by optimizing both the query state and feedback format. Our proposed adaptive feedback selection is an iterative, two-phase approach which first selects critical states for querying, and then uses information gain to select a feedback format for querying across the sampled critical states. The feedback format selection also accounts for the cost and probability of receiving feedback in a certain format. Our experiments in simulation demonstrate the sample efficiency of our approach in learning to avoid undesirable behaviors. The results of our user study with a physical robot highlight the practicality and effectiveness of adaptive feedback selection in seeking informative, user-aligned feedback that accelerate learning. Experiment videos, code and appendices are found on our website: https://tinyurl.com/AFS-learning.
[LINK]http://arxiv.org/abs/2412.07990v2
[DATE]2026-01-15 17:01:35+08:00
[CATEGORIES]cs.LG
Graph Regularized PCA
[AUTHORS]Antonio Briola, Marwin Schmidt, Fabio Caccioli, Carlos Ros Perez, James Singleton, Christian Michler, Tomaso Aste
[ABSTRACT]High-dimensional data often exhibit dependencies among variables that violate the isotropic-noise assumption under which principal component analysis (PCA) is optimal. For cases where the noise is not independent and identically distributed across features (i.e., the covariance is not spherical) we introduce Graph Regularized PCA (GR-PCA). It is a graph-based regularization of PCA that incorporates the dependency structure of the data features by learning a sparse precision graph and biasing loadings toward the low-frequency Fourier modes of the corresponding graph Laplacian. Consequently, high-frequency signals are suppressed, while graph-coherent low-frequency ones are preserved, yielding interpretable principal components aligned with conditional relationships. We evaluate GR-PCA on synthetic data spanning diverse graph topologies, signal-to-noise ratios, and sparsity levels. Compared to mainstream alternatives, it concentrates variance on the intended support, produces loadings with lower graph-Laplacian energy, and remains competitive in out-of-sample reconstruction. When high-frequency signals are present, the graph Laplacian penalty prevents overfitting, reducing the reconstruction accuracy but improving structural fidelity. The advantage over PCA is most pronounced when high-frequency signals are graph-correlated, whereas PCA remains competitive when such signals are nearly rotationally invariant. The procedure is simple to implement, modular with respect to the precision estimator, and scalable, providing a practical route to structure-aware dimensionality reduction that improves structural fidelity without sacrificing predictive performance.
[COMMENTS]15 pages, 2 figures, 4 Tables
[LINK]http://arxiv.org/abs/2601.10199v1
[DATE]2026-01-15 16:57:18+08:00
[CATEGORIES]cs.LG
ShuffleGate: Scalable Feature Optimization for Recommender Systems via Batch-wise Sensitivity Learning
[AUTHORS]Yihong Huang, Chen Chu, Fan Zhang, Liping Wang Fei Chen, Yu Lin, Ruiduan Li, Zhihao Li
[ABSTRACT]Feature optimization, specifically Feature Selection (FS) and Dimension Selection (DS), is critical for the efficiency and generalization of large-scale recommender systems. While conceptually related, these tasks are typically tackled with isolated solutions that often suffer from ambiguous importance scores or prohibitive computational costs.
In this paper, we propose ShuffleGate, a unified and interpretable mechanism that estimates component importance by measuring the model’s sensitivity to information loss. Unlike conventional gating that learns relative weights, ShuffleGate introduces a batch-wise shuffling strategy to effectively erase information in an end-to-end differentiable manner. This paradigm shift yields naturally polarized importance distributions, bridging the long-standing “search-retrain gap” and distinguishing essential signals from noise without complex threshold tuning.
ShuffleGate provides a unified solution across granularities. It achieves state-of-the-art performance on feature and dimension selection tasks. Furthermore, to demonstrate its extreme scalability and precision, we extend ShuffleGate to evaluate fine-grained embedding entries. Experiments show it can identify and prune 99.9% of redundant embedding parameters on the Criteo dataset while maintaining competitive AUC, verifying its robustness in massive search spaces. Finally, the method has been successfully deployed in a top-tier industrial video recommendation platform. By compressing the concatenated input dimension from over 10,000 to 1,000+, it achieved a 91% increase in training throughput while serving billions of daily requests without performance degradation.
[LINK]http://arxiv.org/abs/2503.09315v4
[DATE]2026-01-15 16:46:45+08:00
[CATEGORIES]cs.LG
CC-OR-Net: A Unified Framework for LTV Prediction through Structural Decoupling
[AUTHORS]Mingyu Zhao, Haoran Bai, Yu Tian, Bing Zhu, Hengliang Luo
[ABSTRACT]Customer Lifetime Value (LTV) prediction, a central problem in modern marketing, is characterized by a unique zero-inflated and long-tail data distribution. This distribution presents two fundamental challenges: (1) the vast majority of low-to-medium value users numerically overwhelm the small but critically important segment of high-value “whale” users, and (2) significant value heterogeneity exists even within the low-to-medium value user base. Common approaches either rely on rigid statistical assumptions or attempt to decouple ranking and regression using ordered buckets; however, they often enforce ordinality through loss-based constraints rather than inherent architectural design, failing to balance global accuracy with high-value precision. To address this gap, we propose \textbf{C}onditional \textbf{C}ascaded \textbf{O}rdinal-\textbf{R}esidual Networks \textbf{(CC-OR-Net)}, a novel unified framework that achieves a more robust decoupling through \textbf{structural decomposition}, where ranking is architecturally guaranteed. CC-OR-Net integrates three specialized components: a \textit{structural ordinal decomposition module} for robust ranking, an \textit{intra-bucket residual module} for fine-grained regression, and a \textit{targeted high-value augmentation module} for precision on top-tier users. Evaluated on real-world datasets with over 300M users, CC-OR-Net achieves a superior trade-off across all key business metrics, outperforming state-of-the-art methods in creating a holistic and commercially valuable LTV prediction solution.
[COMMENTS]Accepted by WWW’26
[LINK]http://arxiv.org/abs/2601.10176v1
[DATE]2026-01-15 16:35:17+08:00
[CATEGORIES]cs.LG
OBLR-PO: A Theoretical Framework for Stable Reinforcement Learning
[AUTHORS]Zixun Huang, Jiayi Sheng, Zeyu Zheng
[ABSTRACT]Existing reinforcement learning (RL)-based post-training methods for large language models have advanced rapidly, yet their design has largely been guided by heuristics rather than systematic theoretical principles. This gap limits our understanding of the properties of the gradient estimators and the associated optimization algorithms, thereby constraining opportunities to improve training stability and overall performance. In this work, we provide a unified theoretical framework that characterizes the statistical properties of commonly used policy-gradient estimators under mild assumptions. Our analysis establishes unbiasedness, derives exact variance expressions, and yields an optimization-loss upper bound that enables principled reasoning about learning dynamics. Building on these results, we prove convergence guarantees and derive an adaptive learning-rate schedule governed by the signal-to-noise ratio (SNR) of gradients. We further show that the variance-optimal baseline is a gradient-weighted estimator, offering a new principle for variance reduction and naturally enhancing stability beyond existing methods. These insights motivate Optimal Baseline and Learning-Rate Policy Optimization (OBLR-PO), an algorithm that jointly adapts learning rates and baselines in a theoretically grounded manner. Experiments on Qwen3-4B-Base and Qwen3-8B-Base demonstrate consistent gains over existing policy optimization methods, validating that our theoretical contributions translate into practical improvements in large-scale post-training.
[COMMENTS]19 pages, 7 figures
[LINK]http://arxiv.org/abs/2511.23310v2
[DATE]2026-01-15 16:06:30+08:00
[CATEGORIES]cs.LG
LOOKAT: Lookup-Optimized Key-Attention for Memory-Efficient Transformers
[AUTHORS]Aryan Karmore
[ABSTRACT]Compressing the KV cache is a required step to deploy large language models on edge devices. Current quantization methods compress storage but fail to reduce bandwidth as attention calculation requires dequantizing keys from INT4/INT8 to FP16 before use. We observe that attention scoring is mathematically equivalent to the inner product similarity search and we can apply some compression techniques from vector databases to compress KV-cache better. We propose LOOKAT, which applies product quantization and asymmetric distance computation, to transformer architecture by decomposing key vectors into subspaces, learning codebooks and computing attention tables via lookup tables. This transforms attention from memory-bound to compute-bound. LOOKAT achieves 64 $\times$ compression at 95.7\% output fidelity and 32 $\times$ compression at 95.0\% fidelity when tested on GPT-2. LOOKAT requires no architecture changes or training while maintaining rank correlation $ρ> 0.95$. Theoretical analysis confirms that rank correlation degrades as $O(d_k/mK)$, with guarantees validated across sequence lengths up to 1024 tokens.
[LINK]http://arxiv.org/abs/2601.10155v1
[DATE]2026-01-15 15:54:07+08:00
[CATEGORIES]cs.LG
MHub.ai: A Simple, Standardized, and Reproducible Platform for AI Models in Medical Imaging
[AUTHORS]Leonard Nürnberg, Dennis Bontempi, Suraj Pai, Curtis Lisle, Steve Pieper, Ron Kikinis, Sil van de Leemput, Rahul Soni, Gowtham Murugesan, Cosmin Ciausu, Miriam Groeneveld, Felix J. Dorfner, Jue Jiang, Aneesh Rangnekar, Harini Veeraraghavan, Joeran S. Bosma, Keno Bressem, Raymond Mak, Andrey Fedorov, Hugo JWL Aerts
[ABSTRACT]Artificial intelligence (AI) has the potential to transform medical imaging by automating image analysis and accelerating clinical research. However, research and clinical use are limited by the wide variety of AI implementations and architectures, inconsistent documentation, and reproducibility issues. Here, we introduce MHub.ai, an open-source, container-based platform that standardizes access to AI models with minimal configuration, promoting accessibility and reproducibility in medical imaging. MHub.ai packages models from peer-reviewed publications into standardized containers that support direct processing of DICOM and other formats, provide a unified application interface, and embed structured metadata. Each model is accompanied by publicly available reference data that can be used to confirm model operation. MHub.ai includes an initial set of state-of-the-art segmentation, prediction, and feature extraction models for different modalities. The modular framework enables adaptation of any model and supports community contributions. We demonstrate the utility of the platform in a clinical use case through comparative evaluation of lung segmentation models. To further strengthen transparency and reproducibility, we publicly release the generated segmentations and evaluation metrics and provide interactive dashboards that allow readers to inspect individual cases and reproduce or extend our analysis. By simplifying model use, MHub.ai enables side-by-side benchmarking with identical execution commands and standardized outputs, and lowers the barrier to clinical translation.
[COMMENTS]41 pages, 15 figures, 6 tables
[LINK]http://arxiv.org/abs/2601.10154v1
[DATE]2026-01-15 15:53:09+08:00
[CATEGORIES]cs.LG
Debiased Orthogonal Boundary-Driven Efficient Noise Mitigation
[AUTHORS]Hao Li, Jiayang Gu, Jingkuan Song, An Zhang, Lianli Gao
[ABSTRACT]Mitigating the detrimental effects of noisy labels on the training process has become increasingly critical, as obtaining entirely clean or human-annotated samples for large-scale pre-training tasks is often impractical. Nonetheless, existing noise mitigation methods often encounter limitations in practical applications due to their task-specific design, model dependency, and significant computational overhead. In this work, we exploit the properties of high-dimensional orthogonality to identify a robust and effective boundary in cone space for separating clean and noisy samples. Building on this, we propose One-Step Anti-noise (OSA), a model-agnostic noisy label mitigation paradigm that employs an estimator model and a scoring function to assess the noise level of input pairs through just one-step inference. We empirically validate the superiority of OSA, demonstrating its enhanced training robustness, improved task transferability, streamlined deployment, and reduced computational overhead across diverse benchmarks, models, and tasks. Our code is released at https://github.com/leolee99/OSA.
[COMMENTS]20 pages, 4 figures, 11 Tables
[LINK]http://arxiv.org/abs/2410.01944v2
[DATE]2026-01-15 15:49:21+08:00
[CATEGORIES]cs.LG
Simple Network Graph Comparative Learning
[AUTHORS]Qiang Yu, Xinran Cheng, Shiqiang Xu, Chuanyi Liu
[ABSTRACT]The effectiveness of contrastive learning methods has been widely recognized in the field of graph learning, especially in contexts where graph data often lack labels or are difficult to label. However, the application of these methods to node classification tasks still faces a number of challenges. First, existing data enhancement techniques may lead to significant differences from the original view when generating new views, which may weaken the relevance of the view and affect the efficiency of model training. Second, the vast majority of existing graph comparison learning algorithms rely on the use of a large number of negative samples. To address the above challenges, this study proposes a novel node classification contrast learning method called Simple Network Graph Comparative Learning (SNGCL). Specifically, SNGCL employs a superimposed multilayer Laplace smoothing filter as a step in processing the data to obtain global and local feature smoothing matrices, respectively, which are thus passed into the target and online networks of the siamese network, and finally employs an improved triple recombination loss function to bring the intra-class distance closer and the inter-class distance farther. We have compared SNGCL with state-of-the-art models in node classification tasks, and the experimental results show that SNGCL is strongly competitive in most tasks.
[COMMENTS]10 pages, 5 figures
[LINK]http://arxiv.org/abs/2601.10150v1
[DATE]2026-01-15 15:43:42+08:00
[CATEGORIES]cs.LG
Understanding and Preserving Safety in Fine-Tuned LLMs
[AUTHORS]Jiawen Zhang, Yangfan Hu, Kejia Chen, Lipeng He, Jiachen Ma, Jian Lou, Dan Li, Jian Liu, Xiaohu Yang, Ruoxi Jia
[ABSTRACT]Fine-tuning is an essential and pervasive functionality for applying large language models (LLMs) to downstream tasks. However, it has the potential to substantially degrade safety alignment, e.g., by greatly increasing susceptibility to jailbreak attacks, even when the fine-tuning data is entirely harmless. Despite garnering growing attention in defense efforts during the fine-tuning stage, existing methods struggle with a persistent safety-utility dilemma: emphasizing safety compromises task performance, whereas prioritizing utility typically requires deep fine-tuning that inevitably leads to steep safety declination.
In this work, we address this dilemma by shedding new light on the geometric interaction between safety- and utility-oriented gradients in safety-aligned LLMs. Through systematic empirical analysis, we uncover three key insights: (I) safety gradients lie in a low-rank subspace, while utility gradients span a broader high-dimensional space; (II) these subspaces are often negatively correlated, causing directional conflicts during fine-tuning; and (III) the dominant safety direction can be efficiently estimated from a single sample. Building upon these novel insights, we propose safety-preserving fine-tuning (SPF), a lightweight approach that explicitly removes gradient components conflicting with the low-rank safety subspace. Theoretically, we show that SPF guarantees utility convergence while bounding safety drift. Empirically, SPF consistently maintains downstream task performance and recovers nearly all pre-trained safety alignment, even under adversarial fine-tuning scenarios. Furthermore, SPF exhibits robust resistance to both deep fine-tuning and dynamic jailbreak attacks. Together, our findings provide new mechanistic understanding and practical guidance toward always-aligned LLM fine-tuning.
[LINK]http://arxiv.org/abs/2601.10141v1
[DATE]2026-01-15 15:33:13+08:00
[CATEGORIES]cs.LG
Step-by-Step Causality: Transparent Causal Discovery with Multi-Agent Tree-Query and Adversarial Confidence Estimation
[AUTHORS]Ziyi Ding, Chenfei Ye-Hao, Zheyuan Wang, Xiao-Ping Zhang
[ABSTRACT]Causal discovery aims to recover “what causes what”, but classical constraint-based methods (e.g., PC, FCI) suffer from error propagation, and recent LLM-based causal oracles often behave as opaque, confidence-free black boxes. This paper introduces Tree-Query, a tree-structured, multi-expert LLM framework that reduces pairwise causal discovery to a short sequence of queries about backdoor paths, (in)dependence, latent confounding, and causal direction, yielding interpretable judgments with robustness-aware confidence scores. Theoretical guarantees are provided for asymptotic identifiability of four pairwise relations. On data-free benchmarks derived from Mooij et al. and UCI causal graphs, Tree-Query improves structural metrics over direct LLM baselines, and a diet–weight case study illustrates confounder screening and stable, high-confidence causal conclusions. Tree-Query thus offers a principled way to obtain data-free causal priors from LLMs that can complement downstream data-driven causal discovery. Code is available at https://anonymous.4open.science/r/Repo-9B3E-4F96.
[LINK]http://arxiv.org/abs/2601.10137v1
[DATE]2026-01-15 15:28:59+08:00
[CATEGORIES]cs.LG
Is More Context Always Better? Examining LLM Reasoning Capability for Time Interval Prediction
[AUTHORS]Yanan Cao, Farnaz Fallahi, Murali Mohana Krishna Dandu, Lalitesh Morishetti, Kai Zhao, Luyi Ma, Sinduja Subramaniam, Jianpeng Xu, Evren Korpeoglu, Kaushiki Nag, Sushant Kumar, Kannan Achan
[ABSTRACT]Large Language Models (LLMs) have demonstrated impressive capabilities in reasoning and prediction across different domains. Yet, their ability to infer temporal regularities from structured behavioral data remains underexplored. This paper presents a systematic study investigating whether LLMs can predict time intervals between recurring user actions, such as repeated purchases, and how different levels of contextual information shape their predictive behavior. Using a simple but representative repurchase scenario, we benchmark state-of-the-art LLMs in zero-shot settings against both statistical and machine-learning models. Two key findings emerge. First, while LLMs surpass lightweight statistical baselines, they consistently underperform dedicated machine-learning models, showing their limited ability to capture quantitative temporal structure. Second, although moderate context can improve LLM accuracy, adding further user-level detail degrades performance. These results challenge the assumption that “more context leads to better reasoning”. Our study highlights fundamental limitations of today’s LLMs in structured temporal inference and offers guidance for designing future context-aware hybrid models that integrate statistical precision with linguistic flexibility.
[COMMENTS]Accepted at The Web Conference 2026 (WWW 2026)
[LINK]http://arxiv.org/abs/2601.10132v1
[DATE]2026-01-15 15:18:40+08:00
[CATEGORIES]cs.LG
Prototype-Guided Non-Exemplar Continual Learning for Cross-subject EEG Decoding
[AUTHORS]Dan Li, Hye-Bin Shin, Yeon-Woo Choi
[ABSTRACT]Due to the significant variability in electroencephalo-gram (EEG) signals across individuals, knowledge acquired from previous subjects is often overwritten as new subjects are introduced in continual EEG decoding tasks. Existing methods mainly rely on storing historical data from seen subjects as replay buffers to mitigate forgetting, which is impractical under privacy or memory constraints. To address this issue, we propose a Prototype-guided Non-Exemplar Continual Learning (ProNECL) framework that preserves prior knowledge without accessing historical EEG samples. ProNECL summarizes subject-specific discriminative representations into class-level prototypes and incrementally aligns new subject representations with a global prototype memory through prototype-based feature regulariza-tion and cross-subject alignment. Experiments on the BCI Com-petition IV 2a and 2b datasets demonstrate that ProNECL effec-tively balances knowledge retention and adaptability, achieving superior performance in cross-subject continual EEG decoding tasks.
[COMMENTS]4 pages, 2 figures, 14th IEEE International Winter Conference on Brain-Computer Interface Conference 2026
[LINK]http://arxiv.org/abs/2511.20696v2
[DATE]2026-01-15 15:05:51+08:00
[CATEGORIES]cs.LG
Projection-based Lyapunov method for fully heterogeneous weakly-coupled MDPs
[AUTHORS]Xiangcheng Zhang, Yige Hong, Weina Wang
[ABSTRACT]Heterogeneity poses a fundamental challenge for many real-world large-scale decision-making problems but remains largely understudied. In this paper, we study the fully heterogeneous setting of a prominent class of such problems, known as weakly-coupled Markov decision processes (WCMDPs). Each WCMDP consists of $N$ arms (or subproblems), which have distinct model parameters in the fully heterogeneous setting, leading to the curse of dimensionality when $N$ is large. We show that, under mild assumptions, an efficiently computable policy achieves an $O(1/\sqrt{N})$ optimality gap in the long-run average reward per arm for fully heterogeneous WCMDPs as $N$ becomes large. This is the first asymptotic optimality result for fully heterogeneous average-reward WCMDPs. Our main technical innovation is the construction of projection-based Lyapunov functions that certify the convergence of rewards and costs to an optimal region, even under full heterogeneity.
[COMMENTS]37 pages; full version for NeurIPS proceeding paper; fixed some typos
[LINK]http://arxiv.org/abs/2502.06072v6
[DATE]2026-01-15 14:56:22+08:00
[CATEGORIES]cs.LG
Fine-Tuning Diffusion Models via Intermediate Distribution Shaping
[AUTHORS]Gautham Govind Anil, Shaan Ul Haque, Nithish Kannen, Dheeraj Nagaraj, Sanjay Shakkottai, Karthikeyan Shanmugam
[ABSTRACT]Diffusion models are widely used for generative tasks across domains. While pre-trained diffusion models effectively capture the training data distribution, it is often desirable to shape these distributions using reward functions to align with downstream applications. Policy gradient methods, such as Proximal Policy Optimization (PPO), are widely used in the context of autoregressive generation. However, the marginal likelihoods required for such methods are intractable for diffusion models, leading to alternative proposals and relaxations. In this context, we unify variants of Rejection sAmpling based Fine-Tuning (RAFT) as GRAFT, and show that this implicitly performs KL regularized reward maximization with reshaped rewards. We then introduce P-GRAFT to shape distributions at intermediate noise levels and demonstrate empirically that this can lead to more effective fine-tuning. We mathematically explain this via a bias-variance tradeoff. Motivated by this, we propose inverse noise correction to improve flow models without leveraging explicit rewards. We empirically evaluate our methods on text-to-image(T2I) generation, layout generation, molecule generation and unconditional image generation. Notably, our framework, applied to Stable Diffusion 2, improves over policy gradient methods on popular T2I benchmarks in terms of VQAScore and shows an $8.81\%$ relative improvement over the base model. For unconditional image generation, inverse noise correction improves FID of generated images at lower FLOPs/image.
[LINK]http://arxiv.org/abs/2510.02692v2
[DATE]2026-01-15 14:46:41+08:00
[CATEGORIES]cs.LG
Learning Physics-Informed Noise Models from Dark Frames for Low-Light Raw Image Denoising
[AUTHORS]Hansen Feng, Lizhi Wang, Yiqi Huang, Yuzhi Wang, Lin Zhu, Hua Huang
[ABSTRACT]Recently, the mainstream practice for training low-light raw image denoising methods has shifted towards employing synthetic data. Noise modeling, which focuses on characterizing the noise distribution of real-world sensors, profoundly influences the effectiveness and practicality of synthetic data. Currently, physics-based noise modeling struggles to characterize the entire real noise distribution, while learning-based noise modeling impractically depends on paired real data. In this paper, we propose a novel strategy: learning the noise model from dark frames instead of paired real data, to break down the data dependency. Based on this strategy, we introduce an efficient physics-informed noise neural proxy (PNNP) to approximate the real-world sensor noise model. Specifically, we integrate physical priors into neural proxies and introduce three efficient techniques: physics-guided noise decoupling (PND), physics-aware proxy model (PPM), and differentiable distribution loss (DDL). PND decouples the dark frame into different components and handles different levels of noise flexibly, which reduces the complexity of noise modeling. PPM incorporates physical priors to constrain the synthetic noise, which promotes the accuracy of noise modeling. DDL provides explicit and reliable supervision for noise distribution, which promotes the precision of noise modeling. PNNP exhibits powerful potential in characterizing the real noise distribution. Extensive experiments on public datasets demonstrate superior performance in practical low-light raw image denoising. The source code will be publicly available at the project homepage.
[COMMENTS]18 pages, 13 figures. Accepted by IEEE TPAMI (2026)
[LINK]http://arxiv.org/abs/2310.09126v3
[DATE]2026-01-15 14:46:31+08:00
[CATEGORIES]cs.LG
Functional Critics Are Essential in Off-Policy Actor-Critic: Provable Convergence and Efficient Exploration
[AUTHORS]Qinxun Bai, Yuxuan Han, Wei Xu, Zhengyuan Zhou
[ABSTRACT]Off-policy reinforcement learning (RL) with function approximation offers an effective way to improve sample efficiency by reusing past experience. Within this setting, the actor-critic (AC) framework has achieved strong empirical success but suffers from the “moving target” problem, where the policy being evaluated changes continually. Functional critics, or policy-conditioned value functions, have been proposed to address this issue by including a representation of the policy as input. While the concept of generalizing value functions across policy space is appealing, previous efforts have struggled to remain competitive against state-of-the-art AC algorithms that do not utilize functional critics. In this work, we revisit functional critics within the off-policy AC framework and identify two aspects that render them a necessity rather than a luxury. First, in off-policy AC, critic learning contends with both the “deadly triad” instability and the “moving target” issue, while actor learning faces the challenge of estimating the exact off-policy policy gradient. This complex interplay makes theoretical convergence extremely difficult for practical algorithms. We demonstrate that a functional critic is essential for addressing this challenge and establish the first convergence proof for an off-policy target-based AC algorithm under linear function approximation. Second, we identify a crucial link between functional critic modeling and efficient exploration. Specifically, we show that approximating posterior sampling for exploration in model-free settings is infeasible without functional critics. Practically, we propose a tailored neural network architecture and a minimal AC algorithm that relies solely on these insights. In experiments on the DeepMind Control Suite, this implementation achieves performance competitive with state-of-the-art methods.
[LINK]http://arxiv.org/abs/2509.22964v3
[DATE]2026-01-15 14:25:49+08:00
[CATEGORIES]cs.LG
Learning normalized image densities via dual score matching
[AUTHORS]Florentin Guth, Zahra Kadkhodaie, Eero P Simoncelli
[ABSTRACT]Learning probability models from data is at the heart of many machine learning endeavors, but is notoriously difficult due to the curse of dimensionality. We introduce a new framework for learning \emph{normalized} energy (log probability) models that is inspired by diffusion generative models, which rely on networks optimized to estimate the score. We modify a score network architecture to compute an energy while preserving its inductive biases. The gradient of this energy network with respect to its input image is the score of the learned density, which can be optimized using a denoising objective. Importantly, the gradient with respect to the noise level provides an additional score that can be optimized with a novel secondary objective, ensuring consistent and normalized energies across noise levels. We train an energy network with this \emph{dual} score matching objective on the ImageNet64 dataset, and obtain a cross-entropy (negative log likelihood) value comparable to the state of the art. We further validate our approach by showing that our energy model \emph{strongly generalizes}: log probabilities estimated with two networks trained on non-overlapping data subsets are nearly identical. Finally, we demonstrate that both image probability and dimensionality of local neighborhoods vary substantially depending on image content, in contrast with conventional assumptions such as concentration of measure or support on a low-dimensional manifold.
[LINK]http://arxiv.org/abs/2506.05310v3
[DATE]2026-01-15 14:23:22+08:00
[CATEGORIES]cs.LG
Unveiling m-Sharpness Through the Structure of Stochastic Gradient Noise
[AUTHORS]Haocheng Luo, Mehrtash Harandi, Dinh Phung, Trung Le
[ABSTRACT]Sharpness-aware minimization (SAM) has emerged as a highly effective technique to improve model generalization, but its underlying principles are not fully understood. We investigate m-sharpness, where SAM performance improves monotonically as the micro-batch size for computing perturbations decreases, a phenomenon critical for distributed training yet lacking rigorous explanation. We leverage an extended Stochastic Differential Equation (SDE) framework and analyze stochastic gradient noise (SGN) to characterize the dynamics of SAM variants, including n-SAM and m-SAM. Our analysis reveals that stochastic perturbations induce an implicit variance-based sharpness regularization whose strength increases as m decreases. Motivated by this insight, we propose Reweighted SAM (RW-SAM), which employs sharpness-weighted sampling to mimic the generalization benefits of m-SAM while remaining parallelizable. Comprehensive experiments validate our theory and method.
[COMMENTS]Accepted to NeurIPS 2025
[LINK]http://arxiv.org/abs/2509.18001v3
[DATE]2026-01-15 14:16:42+08:00
[CATEGORIES]cs.LG
DFIR-DETR: Frequency Domain Enhancement and Dynamic Feature Aggregation for Cross-Scene Small Object Detection
[AUTHORS]Bo Gao, Jingcheng Tong, Xingsheng Chen, Han Yu, Zichen Li
[ABSTRACT]Detecting small objects in UAV remote sensing images and identifying surface defects in industrial inspection remain difficult tasks. These applications face common obstacles: features are sparse and weak, backgrounds are cluttered, and object scales vary dramatically. Current transformer-based detectors, while powerful, struggle with three critical issues. First, features degrade severely as networks downsample progressively. Second, spatial convolutions cannot capture long-range dependencies effectively. Third, standard upsampling methods inflate feature maps unnecessarily.
We introduce DFIR-DETR to tackle these problems through dynamic feature aggregation combined with frequency-domain processing. Our architecture builds on three novel components. The DCFA module uses dynamic K-sparse attention, cutting complexity from O(N2) down to O(NK), and employs spatial gated linear units for better nonlinear modeling. The DFPN module applies amplitude-normalized upsampling to prevent feature inflation and uses dual-path shuffle convolution to retain spatial details across scales. The FIRC3 module operates in the frequency domain, achieving global receptive fields without sacrificing efficiency.
We tested our method extensively on NEU-DET and VisDrone datasets. Results show mAP50 scores of 92.9% and 51.6% respectively-both state-of-the-art. The model stays lightweight with just 11.7M parameters and 41.2 GFLOPs. Strong performance across two very different domains confirms that DFIR-DETR generalizes well and works effectively in resource-limited settings for cross-scene small object detection.
[COMMENTS]16 pages. Correct typos
[LINK]http://arxiv.org/abs/2512.07078v2
[DATE]2026-01-15 14:15:47+08:00
[CATEGORIES]cs.LG
Multilingual-To-Multimodal (M2M): Unlocking New Languages with Monolingual Text
[AUTHORS]Piyush Singh Pasi
[ABSTRACT]Multimodal models excel in English, supported by abundant image-text and audio-text data, but performance drops sharply for other languages due to limited multilingual multimodal resources. Existing solutions rely heavily on machine translation, while advances in multilingual text modeling remain underutilized. We introduce METAL, a lightweight alignment method that learns only a few linear layers using English text alone to map multilingual text embeddings into a multimodal space. Despite its simplicity, METAL matches baseline performance in English (94.9 percent Recall at 10) and achieves strong zero-shot transfer (89.5 percent Recall at 10 averaged across 11 languages, 10 unseen) on XTD text-to-image retrieval. Qualitative t-SNE visualizations show that multilingual embeddings align tightly with multimodal representations, while weight analysis reveals that the transformation reshapes embedding geometry rather than performing trivial rotations. Beyond image-text retrieval, METAL generalizes to audio-text retrieval and cross-lingual text-to-image generation. We release code and checkpoints at https://github.com/m2m-codebase/M2M , as well as multilingual evaluation datasets including MSCOCO Multilingual 30K (https://huggingface.co/datasets/piyushsinghpasi/mscoco-multilingual-30k ), AudioCaps Multilingual (https://huggingface.co/datasets/piyushsinghpasi/audiocaps-multilingual ), and Clotho Multilingual (https://huggingface.co/datasets/piyushsinghpasi/clotho-multilingual ), to facilitate further research.
[COMMENTS]EACL 2026 Findings accepted. Initial Draft of Camera-ready
[LINK]http://arxiv.org/abs/2601.10096v1
[DATE]2026-01-15 13:56:37+08:00
[CATEGORIES]cs.LG
V-Zero: Self-Improving Multimodal Reasoning with Zero Annotation
[AUTHORS]Han Wang, Yi Yang, Jingyuan Hu, Minfeng Zhu, Wei Chen
[ABSTRACT]Recent advances in multimodal learning have significantly enhanced the reasoning capabilities of vision-language models (VLMs). However, state-of-the-art approaches rely heavily on large-scale human-annotated datasets, which are costly and time-consuming to acquire. To overcome this limitation, we introduce V-Zero, a general post-training framework that facilitates self-improvement using exclusively unlabeled images. V-Zero establishes a co-evolutionary loop by instantiating two distinct roles: a Questioner and a Solver. The Questioner learns to synthesize high-quality, challenging questions by leveraging a dual-track reasoning reward that contrasts intuitive guesses with reasoned results. The Solver is optimized using pseudo-labels derived from majority voting over its own sampled responses. Both roles are trained iteratively via Group Relative Policy Optimization (GRPO), driving a cycle of mutual enhancement. Remarkably, without a single human annotation, V-Zero achieves consistent performance gains on Qwen2.5-VL-7B-Instruct, improving visual mathematical reasoning by +1.7 and general vision-centric by +2.6, demonstrating the potential of self-improvement in multimodal systems. Code is available at https://github.com/SatonoDia/V-Zero
[LINK]http://arxiv.org/abs/2601.10094v1
[DATE]2026-01-15 13:47:43+08:00
[CATEGORIES]cs.LG
LLM-Based Emulation of the Radio Resource Control Layer: Towards AI-Native RAN Protocols
[AUTHORS]Ziming Liu, Bryan Liu, Alvaro Valcarce, Xiaoli Chu
[ABSTRACT]Integrating Large AI Models (LAMs) into 6G mobile networks is a key enabler of the AI-Native Air Interface (AI-AI), where protocol intelligence must scale beyond handcrafted logic. This paper presents, to our knowledge, the first standards-compliant emulation of the Radio Resource Control (RRC) layer using a decoder-only LAM (LLAMA-class) fine-tuned with Low-Rank Adaptation (LoRA) on a multi-vendor corpus of real-world traces spanning both 5G and 4G systems. We treat RRC as a domain-specific language and construct a segmentation-safe, question-answer (Question-and-Answer (QA)) dataset that preserves Abstract Syntax Notation (ASN.1) structure through linearization prior to Byte Pair Encoding (BPE) tokenization. The proposed approach combines parameter-efficient adaptation with schema-bounded prompting to ensure syntactic and procedural fidelity. Evaluation introduces a standards-aware triad – ASN.1 conformance, field-level coverage analysis, and uplink-to-downlink state-machine checks – alongside semantic similarity and latency profiling across 120 configurations. On 30k 5G request-response pairs plus an additional 4.8k QA turns from 4G sessions, our 8B model achieves a median cosine similarity of 0.97, a 61% relative gain over a zero-shot baseline, while sustaining high conformance rates. These results demonstrate that LAMs, when augmented with protocol-aware reasoning, can directly orchestrate control-plane procedures, laying the foundation for the future Artificial Intelligence (AI)-native Radio Access Network (RAN).
[COMMENTS]This work has been submitted to the IEEE for possible publication. Focuses on applying LLMs to 5G RRC protocol generation; primary: cs.NI; cross-list: eess.SP, cs.LG
[LINK]http://arxiv.org/abs/2505.16821v5
[DATE]2026-01-15 13:44:21+08:00
[CATEGORIES]cs.LG
ProDAG: Projected Variational Inference for Directed Acyclic Graphs
[AUTHORS]Ryan Thompson, Edwin V. Bonilla, Robert Kohn
[ABSTRACT]Directed acyclic graph (DAG) learning is a central task in structure discovery and causal inference. Although the field has witnessed remarkable advances over the past few years, it remains statistically and computationally challenging to learn a single (point estimate) DAG from data, let alone provide uncertainty quantification. We address the difficult task of quantifying graph uncertainty by developing a Bayesian variational inference framework based on novel, provably valid distributions that have support directly on the space of sparse DAGs. These distributions, which we use to define our prior and variational posterior, are induced by a projection operation that maps an arbitrary continuous distribution onto the space of sparse weighted acyclic adjacency matrices. While this projection is combinatorial, it can be solved efficiently using recent continuous reformulations of acyclicity constraints. We empirically demonstrate that our method, ProDAG, can outperform state-of-the-art alternatives in both accuracy and uncertainty quantification.
[COMMENTS]To appear in Advances in Neural Information Processing Systems
[LINK]http://arxiv.org/abs/2405.15167v7
[DATE]2026-01-15 13:40:25+08:00
[CATEGORIES]cs.LG
Bayesian Meta-Analyses Could Be More: A Case Study in Trial of Labor After a Cesarean-section Outcomes and Complications
[AUTHORS]Ashley Klein, Edward Raff, Marcia DesJardin
[ABSTRACT]The meta-analysis’s utility is dependent on previous studies having accurately captured the variables of interest, but in medical studies, a key decision variable that impacts a physician’s decisions was not captured. This results in an unknown effect size and unreliable conclusions. A Bayesian approach may allow analysis to determine if the claim of a positive effect is still warranted, and we build a Bayesian approach to this common medical scenario. To demonstrate its utility, we assist professional OBGYNs in evaluating Trial of Labor After a Cesarean-section (TOLAC) situations where few interventions are available for patients and find the support needed for physicians to advance patient care.
[COMMENTS]To appear in AAAI 2026
[LINK]http://arxiv.org/abs/2601.10089v1
[DATE]2026-01-15 13:29:32+08:00
[CATEGORIES]cs.LG
Adaptive Label Error Detection: A Bayesian Approach to Mislabeled Data Detection
[AUTHORS]Zan Chaudhry, Noam H. Rotenberg, Brian Caffo, Craig K. Jones, Haris I. Sair
[ABSTRACT]Machine learning classification systems are susceptible to poor performance when trained with incorrect ground truth labels, even when data is well-curated by expert annotators. As machine learning becomes more widespread, it is increasingly imperative to identify and correct mislabeling to develop more powerful models. In this work, we motivate and describe Adaptive Label Error Detection (ALED), a novel method of detecting mislabeling. ALED extracts an intermediate feature space from a deep convolutional neural network, denoises the features, models the reduced manifold of each class with a multidimensional Gaussian distribution, and performs a simple likelihood ratio test to identify mislabeled samples. We show that ALED has markedly increased sensitivity, without compromising precision, compared to established label error detection methods, on multiple medical imaging datasets. We demonstrate an example where fine-tuning a neural network on corrected data results in a 33.8% decrease in test set errors, providing strong benefits to end users. The ALED detector is deployed in the Python package statlab.
[COMMENTS]10 pages, 5 figures
[LINK]http://arxiv.org/abs/2601.10084v1
[DATE]2026-01-15 13:20:00+08:00
[CATEGORIES]cs.LG
Cross-Modal Fine-Tuning of 3D Convolutional Foundation Models for ADHD Classification with Low-Rank Adaptation
[AUTHORS]Jyun-Ping Kao, Shinyeong Rho, Shahar Lazarev, Hyun-Hae Cho, Fangxu Xing, Taehoon Shin, C. -C. Jay Kuo, Jonghye Woo
[ABSTRACT]Early diagnosis of attention-deficit/hyperactivity disorder (ADHD) in children plays a crucial role in improving outcomes in education and mental health. Diagnosing ADHD using neuroimaging data, however, remains challenging due to heterogeneous presentations and overlapping symptoms with other conditions. To address this, we propose a novel parameter-efficient transfer learning approach that adapts a large-scale 3D convolutional foundation model, pre-trained on CT images, to an MRI-based ADHD classification task. Our method introduces Low-Rank Adaptation (LoRA) in 3D by factorizing 3D convolutional kernels into 2D low-rank updates, dramatically reducing trainable parameters while achieving superior performance. In a five-fold cross-validated evaluation on a public diffusion MRI database, our 3D LoRA fine-tuning strategy achieved state-of-the-art results, with one model variant reaching 71.9% accuracy and another attaining an AUC of 0.716. Both variants use only 1.64 million trainable parameters (over 113x fewer than a fully fine-tuned foundation model). Our results represent one of the first successful cross-modal (CT-to-MRI) adaptations of a foundation model in neuroimaging, establishing a new benchmark for ADHD classification while greatly improving efficiency.
[COMMENTS]Accepted for presentation at the IEEE International Symposium on Biomedical Imaging (ISBI) 2026
[LINK]http://arxiv.org/abs/2511.06163v2
[DATE]2026-01-15 13:18:46+08:00
[CATEGORIES]cs.LG
Thinking Like Van Gogh: Structure-Aware Style Transfer via Flow-Guided 3D Gaussian Splatting
[AUTHORS]Zhendong Wang, Lebin Zhou, Jingchuan Xiao, Rongduo Han, Nam Ling, Cihan Ruan
[ABSTRACT]In 1888, Vincent van Gogh wrote, “I am seeking exaggeration in the essential.” This principle, amplifying structural form while suppressing photographic detail, lies at the core of Post-Impressionist art. However, most existing 3D style transfer methods invert this philosophy, treating geometry as a rigid substrate for surface-level texture projection. To authentically reproduce Post-Impressionist stylization, geometric abstraction must be embraced as the primary vehicle of expression.
We propose a flow-guided geometric advection framework for 3D Gaussian Splatting (3DGS) that operationalizes this principle in a mesh-free setting. Our method extracts directional flow fields from 2D paintings and back-propagates them into 3D space, rectifying Gaussian primitives to form flow-aligned brushstrokes that conform to scene topology without relying on explicit mesh priors. This enables expressive structural deformation driven directly by painterly motion rather than photometric constraints.
Our contributions are threefold: (1) a projection-based, mesh-free flow guidance mechanism that transfers 2D artistic motion into 3D Gaussian geometry; (2) a luminance-structure decoupling strategy that isolates geometric deformation from color optimization, mitigating artifacts during aggressive structural abstraction; and (3) a VLM-as-a-Judge evaluation framework that assesses artistic authenticity through aesthetic judgment instead of conventional pixel-level metrics, explicitly addressing the subjective nature of artistic stylization.
[COMMENTS]7 pages, 8 figures
[LINK]http://arxiv.org/abs/2601.10075v1
[DATE]2026-01-15 13:00:02+08:00
[CATEGORIES]cs.LG
Reward Learning through Ranking Mean Squared Error
[AUTHORS]Chaitanya Kharyal, Calarina Muslimani, Matthew E. Taylor
[ABSTRACT]Reward design remains a significant bottleneck in applying reinforcement learning (RL) to real-world problems. A popular alternative is reward learning, where reward functions are inferred from human feedback rather than manually specified. Recent work has proposed learning reward functions from human feedback in the form of ratings, rather than traditional binary preferences, enabling richer and potentially less cognitively demanding supervision. Building on this paradigm, we introduce a new rating-based RL method, Ranked Return Regression for RL (R4). At its core, R4 employs a novel ranking mean squared error (rMSE) loss, which treats teacher-provided ratings as ordinal targets. Our approach learns from a dataset of trajectory-rating pairs, where each trajectory is labeled with a discrete rating (e.g., “bad,” “neutral,” “good”). At each training step, we sample a set of trajectories, predict their returns, and rank them using a differentiable sorting operator (soft ranks). We then optimize a mean squared error loss between the resulting soft ranks and the teacher’s ratings. Unlike prior rating-based approaches, R4 offers formal guarantees: its solution set is provably minimal and complete under mild assumptions. Empirically, using simulated human feedback, we demonstrate that R4 consistently matches or outperforms existing rating and preference-based RL methods on robotic locomotion benchmarks from OpenAI Gym and the DeepMind Control Suite, while requiring significantly less feedback.
[LINK]http://arxiv.org/abs/2601.09236v2
[DATE]2026-01-15 12:48:46+08:00
[CATEGORIES]cs.LG
Lifelong Domain Adaptive 3D Human Pose Estimation
[AUTHORS]Qucheng Peng, Hongfei Xue, Pu Wang, Chen Chen
[ABSTRACT]3D Human Pose Estimation (3D HPE) is vital in various applications, from person re-identification and action recognition to virtual reality. However, the reliance on annotated 3D data collected in controlled environments poses challenges for generalization to diverse in-the-wild scenarios. Existing domain adaptation (DA) paradigms like general DA and source-free DA for 3D HPE overlook the issues of non-stationary target pose datasets. To address these challenges, we propose a novel task named lifelong domain adaptive 3D HPE. To our knowledge, we are the first to introduce the lifelong domain adaptation to the 3D HPE task. In this lifelong DA setting, the pose estimator is pretrained on the source domain and subsequently adapted to distinct target domains. Moreover, during adaptation to the current target domain, the pose estimator cannot access the source and all the previous target domains. The lifelong DA for 3D HPE involves overcoming challenges in adapting to current domain poses and preserving knowledge from previous domains, particularly combating catastrophic forgetting. We present an innovative Generative Adversarial Network (GAN) framework, which incorporates 3D pose generators, a 2D pose discriminator, and a 3D pose estimator. This framework effectively mitigates domain shifts and aligns original and augmented poses. Moreover, we construct a novel 3D pose generator paradigm, integrating pose-aware, temporal-aware, and domain-aware knowledge to enhance the current domain’s adaptation and alleviate catastrophic forgetting on previous domains. Our method demonstrates superior performance through extensive experiments on diverse domain adaptive 3D HPE datasets.
[COMMENTS]Accepted by AAAI 2026
[LINK]http://arxiv.org/abs/2512.23860v2
[DATE]2026-01-15 12:43:31+08:00
[CATEGORIES]cs.LG
Eventually LIL Regret: Almost Sure $\ln\ln T$ Regret for a sub-Gaussian Mixture on Unbounded Data
[AUTHORS]Shubhada Agrawal, Aaditya Ramdas
[ABSTRACT]We prove that a classic sub-Gaussian mixture proposed by Robbins in a stochastic setting actually satisfies a path-wise (deterministic) regret bound. For every path in a natural “Ville event” $E_α$, this regret till time $T$ is bounded by $\ln^2(1/α)/V_T + \ln (1/α) + \ln \ln V_T$ up to universal constants, where $V_T$ is a nonnegative, nondecreasing, cumulative variance process. (The bound reduces to $\ln(1/α) + \ln \ln V_T$ if $V_T \geq \ln(1/α)$.) If the data were stochastic, then one can show that $E_α$ has probability at least $1-α$ under a wide class of distributions (eg: sub-Gaussian, symmetric, variance-bounded, etc.). In fact, we show that on the Ville event $E_0$ of probability one, the regret on every path in $E_0$ is eventually bounded by $\ln \ln V_T$ (up to constants). We explain how this work helps bridge the world of adversarial online learning (which usually deals with regret bounds for bounded data), with game-theoretic statistics (which can handle unbounded data, albeit using stochastic assumptions). In short, conditional regret bounds serve as a bridge between stochastic and adversarial betting.
[COMMENTS]To appear at ALT 2026
[LINK]http://arxiv.org/abs/2512.12325v2
[DATE]2026-01-15 12:33:41+08:00
[CATEGORIES]cs.LG
Query-Efficient Locally Private Hypothesis Selection via the Scheffe Graph
[AUTHORS]Gautam Kamath, Alireza F. Pour, Matthew Regehr, David P. Woodruff
[ABSTRACT]We propose an algorithm with improved query-complexity for the problem of hypothesis selection under local differential privacy constraints. Given a set of $k$ probability distributions $Q$, we describe an algorithm that satisfies local differential privacy, performs $\tilde{O}(k^{3/2})$ non-adaptive queries to individuals who each have samples from a probability distribution $p$, and outputs a probability distribution from the set $Q$ which is nearly the closest to $p$. Previous algorithms required either $Ω(k^2)$ queries or many rounds of interactive queries.
Technically, we introduce a new object we dub the Scheffé graph, which captures structure of the differences between distributions in $Q$, and may be of more broad interest for hypothesis selection tasks.
[LINK]http://arxiv.org/abs/2509.16180v2
[DATE]2026-01-15 12:24:41+08:00
[CATEGORIES]cs.LG
Unlabeled Data Can Provably Enhance In-Context Learning of Transformers
[AUTHORS]Renpu Liu, Jing Yang
[ABSTRACT]Large language models (LLMs) exhibit impressive in-context learning (ICL) capabilities, yet the quality of their predictions is fundamentally limited by the few costly labeled demonstrations that can fit into a prompt. Meanwhile, there exist vast and continuously growing amounts of unlabeled data that may be closely related to the ICL task. How to utilize such unlabeled data to provably enhance the performance of ICL thus becomes an emerging fundamental question. In this work, we propose a novel augmented ICL framework, in which the prompt includes a small set of labeled examples alongside a block of unlabeled inputs. We focus on the multi-class linear classification setting and demonstrate that, with chain-of-thought (CoT) prompting, a multi-layer transformer can effectively emulate an expectation-maximization (EM) algorithm. This enables the transformer to implicitly extract useful information from both labeled and unlabeled data, leading to provable improvements in ICL accuracy. Moreover, we show that such a transformer can be trained via teacher forcing, with its parameters converging to the desired solution at a linear rate. Experiments demonstrate that the augmented ICL framework consistently outperforms conventional few-shot ICL, providing empirical support for our theoretical findings. To the best of our knowledge, this is the first theoretical study on the impact of unlabeled data on the ICL performance of transformers.
[COMMENTS]Published as a conference paper at NeurIPS 2025
[LINK]http://arxiv.org/abs/2601.10058v1
[DATE]2026-01-15 12:23:32+08:00
[CATEGORIES]cs.LG
Wasserstein-p Central Limit Theorem Rates: From Local Dependence to Markov Chains
[AUTHORS]Yixuan Zhang, Qiaomin Xie
[ABSTRACT]Finite-time central limit theorem (CLT) rates play a central role in modern machine learning. In this paper, we study CLT rates for multivariate dependent data in Wasserstein-$p$ ($W_p$) distance, for general $p \geq 1$. We focus on two fundamental dependence structures that commonly arise in machine learning: locally dependent sequences and geometrically ergodic Markov chains. In both settings, we establish the first optimal $O(n^{-1/2})$ rate in $W_1$, as well as the first $W_p$ ($p\ge 2$) CLT rates under mild moment assumptions, substantially improving the best previously known bounds in these dependent-data regimes. As an application of our optimal $W_1$ rate for locally dependent sequences, we further obtain the first optimal $W_1$-CLT rate for multivariate $U$-statistics.
On the technical side, we derive a tractable auxiliary bound for $W_1$ Gaussian approximation errors that is well suited for studying dependent data. For Markov chains, we further prove that the regeneration time of the split chain associated with a geometrically ergodic chain has a geometric tail without assuming strong aperiodicity or other restrictive conditions. These tools may be of independent interests and enable our optimal $W_1$ rates and underpin our $W_p$ ($p\ge 2$) results.
[COMMENTS]51 pages
[LINK]http://arxiv.org/abs/2601.08184v2
[DATE]2026-01-15 11:46:02+08:00
[CATEGORIES]cs.LG
Instruction Finetuning LLaMA-3-8B Model Using LoRA for Financial Named Entity Recognition
[AUTHORS]Zhiming Lian
[ABSTRACT]Particularly, financial named-entity recognition (NER) is one of the many important approaches to translate unformatted reports and news into structured knowledge graphs. However, free, easy-to-use large language models (LLMs) often fail to differentiate organisations as people, or disregard an actual monetary amount entirely. This paper takes Meta’s Llama 3 8B and applies it to financial NER by combining instruction fine-tuning and Low-Rank Adaptation (LoRA). Each annotated sentence is converted into an instruction-input-output triple, enabling the model to learn task descriptions while fine-tuning with small low-rank matrices instead of updating all weights. Using a corpus of 1,693 sentences, our method obtains a micro-F1 score of 0.894 compared with Qwen3-8B, Baichuan2-7B, T5, and BERT-Base. We present dataset statistics, describe training hyperparameters, and perform visualizations of entity density, learning curves, and evaluation metrics. Our results show that instruction tuning combined with parameter-efficient fine-tuning enables state-of-the-art performance on domain-sensitive NER.
[LINK]http://arxiv.org/abs/2601.10043v1
[DATE]2026-01-15 11:41:00+08:00
[CATEGORIES]cs.LG
BPE: Behavioral Profiling Ensemble
[AUTHORS]Yanxin Liu, Yunqi Zhang
[ABSTRACT]Ensemble learning is widely recognized as a pivotal strategy for pushing the boundaries of predictive performance. Traditional static ensemble methods, such as Stacking, typically assign weights by treating each base learner as a holistic entity, thereby overlooking the fact that individual models exhibit varying degrees of competence across different regions of the instance space. To address this limitation, Dynamic Ensemble Selection (DES) was introduced. However, both static and dynamic approaches predominantly rely on the divergence among different models as the basis for integration. This inter-model perspective neglects the intrinsic characteristics of the models themselves and necessitates a heavy reliance on validation sets for competence estimation. In this paper, we propose the Behavioral Profiling Ensemble (BPE) framework, which introduces a novel paradigm shift. Unlike traditional methods, BPE constructs a “behavioral profile” intrinsic to each model and derives integration weights based on the deviation between the model’s response to a specific test instance and its established behavioral profile. Extensive experiments on both synthetic and real-world datasets demonstrate that the algorithm derived from the BPE framework achieves significant improvements over state-of-the-art ensemble baselines. These gains are evident not only in predictive accuracy but also in computational efficiency and storage resource utilization across various scenarios.
[LINK]http://arxiv.org/abs/2601.10024v1
[DATE]2026-01-15 11:14:51+08:00
[CATEGORIES]cs.LG
Time Aggregation Features for XGBoost Models
[AUTHORS]Mykola Pinchuk
[ABSTRACT]This paper studies time aggregation features for XGBoost models in click-through rate prediction. The setting is the Avazu click-through rate prediction dataset with strict out-of-time splits and a no-lookahead feature constraint. Features for hour H use only impressions from hours strictly before H. This paper compares a strong time-aware target encoding baseline to models augmented with entity history time aggregation under several window designs. Across two rolling-tail folds on a deterministic ten percent sample, a trailing window specification improves ROC AUC by about 0.0066 to 0.0082 and PR AUC by about 0.0084 to 0.0094 relative to target encoding alone. Within the time aggregation design grid, event count windows provide the only consistent improvement over trailing windows, and the gain is small. Gap windows and bucketized windows underperform simple trailing windows in this dataset and protocol. These results support a practical default of trailing windows, with an optional event count window when marginal ROC AUC gains matter.
[COMMENTS]17 pages, 18 tables and figures
[LINK]http://arxiv.org/abs/2601.10019v1
[DATE]2026-01-15 11:02:04+08:00
[CATEGORIES]cs.LG
CAFEDistill: Learning Personalized and Dynamic Models through Federated Early-Exit Network Distillation
[AUTHORS]Boyi Liu, Zimu Zhou, Yongxin Tong
[COMMENTS]12 pages, conference
[LINK]http://arxiv.org/abs/2601.10015v1
[DATE]2026-01-15 10:52:12+08:00
[CATEGORIES]cs.LG
PID-Guided Partial Alignment for Multimodal Decentralized Federated Learning
[AUTHORS]Yanhang Shi, Xiaoyu Wang, Houwei Cao, Jian Li, Yong Liu
[ABSTRACT]Multimodal decentralized federated learning (DFL) is challenging because agents differ in available modalities and model architectures, yet must collaborate over peer-to-peer (P2P) networks without a central coordinator. Standard multimodal pipelines learn a single shared embedding across all modalities. In DFL, such a monolithic representation induces gradient misalignment between uni- and multimodal agents; as a result, it suppresses heterogeneous sharing and cross-modal interaction. We present PARSE, a multimodal DFL framework that operationalizes partial information decomposition (PID) in a server-free setting. Each agent performs feature fission to factorize its latent representation into redundant, unique, and synergistic slices. P2P knowledge sharing among heterogeneous agents is enabled by slice-level partial alignment: only semantically shareable branches are exchanged among agents that possess the corresponding modality. By removing the need for central coordination and gradient surgery, PARSE resolves uni-/multimodal gradient conflicts, thereby overcoming the multimodal DFL dilemma while remaining compatible with standard DFL constraints. Across benchmarks and agent mixes, PARSE yields consistent gains over task-, modality-, and hybrid-sharing DFL baselines. Ablations on fusion operators and split ratios, together with qualitative visualizations, further demonstrate the efficiency and robustness of the proposed design.
[LINK]http://arxiv.org/abs/2601.10012v1
[DATE]2026-01-15 10:42:29+08:00
[CATEGORIES]cs.LG
Autoencoding Random Forests
[AUTHORS]Binh Duc Vu, Jan Kapar, Marvin Wright, David S. Watson
[ABSTRACT]We propose a principled method for autoencoding with random forests. Our strategy builds on foundational results from nonparametric statistics and spectral graph theory to learn a low-dimensional embedding of the model that optimally represents relationships in the data. We provide exact and approximate solutions to the decoding problem via constrained optimization, split relabeling, and nearest neighbors regression. These methods effectively invert the compression pipeline, establishing a map from the embedding space back to the input space using splits learned by the ensemble’s constituent trees. The resulting decoders are universally consistent under common regularity assumptions. The procedure works with supervised or unsupervised models, providing a window into conditional or joint distributions. We demonstrate various applications of this autoencoder, including powerful new tools for visualization, compression, clustering, and denoising. Experiments illustrate the ease and utility of our method in a wide range of settings, including tabular, image, and genomic data.
[COMMENTS]10 pages main text, 27 pages total. 9 figures, 4 tables. To be published in proceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
[LINK]http://arxiv.org/abs/2505.21441v4
[DATE]2026-01-15 10:40:53+08:00
[CATEGORIES]cs.LG
Topological constraints on self-organisation in locally interacting systems
[AUTHORS]Francesco Sacco, Dalton A R Sakthivadivel, Michael Levin
[ABSTRACT]All intelligence is collective intelligence, in the sense that it is made of parts which must align with respect to system-level goals. Understanding the dynamics which facilitate or limit navigation of problem spaces by aligned parts thus impacts many fields ranging across life sciences and engineering. To that end, consider a system on the vertices of a planar graph, with pairwise interactions prescribed by the edges of the graph. Such systems can sometimes exhibit long-range order, distinguishing one phase of macroscopic behaviour from another. In networks of interacting systems we may view spontaneous ordering as a form of self-organisation, modelling neural and basal forms of cognition. Here, we discuss necessary conditions on the topology of the graph for an ordered phase to exist, with an eye towards finding constraints on the ability of a system with local interactions to maintain an ordered target state. By studying the scaling of free energy under the formation of domain walls in three model systems – the Potts model, autoregressive models, and hierarchical networks – we show how the combinatorics of interactions on a graph prevent or allow spontaneous ordering. As an application we are able to analyse why multiscale systems like those prevalent in biology are capable of organising into complex patterns, whereas rudimentary language models are challenged by long sequences of outputs.
[COMMENTS]11+3 pages, four figures, four tikzpictures. This version to appear in Philos Trans R Soc A
[LINK]http://arxiv.org/abs/2501.13188v2
[DATE]2026-01-15 10:14:16+08:00
[CATEGORIES]cs.LG
FaTRQ: Tiered Residual Quantization for LLM Vector Search in Far-Memory-Aware ANNS Systems
[AUTHORS]Tianqi Zhang, Flavio Ponzina, Tajana Rosing
[ABSTRACT]Approximate Nearest-Neighbor Search (ANNS) is a key technique in retrieval-augmented generation (RAG), enabling rapid identification of the most relevant high-dimensional embeddings from massive vector databases. Modern ANNS engines accelerate this process using prebuilt indexes and store compressed vector-quantized representations in fast memory. However, they still rely on a costly second-pass refinement stage that reads full-precision vectors from slower storage like SSDs. For modern text and multimodal embeddings, these reads now dominate the latency of the entire query. We propose FaTRQ, a far-memory-aware refinement system using tiered memory that eliminates the need to fetch full vectors from storage. It introduces a progressive distance estimator that refines coarse scores using compact residuals streamed from far memory. Refinement stops early once a candidate is provably outside the top-k. To support this, we propose tiered residual quantization, which encodes residuals as ternary values stored efficiently in far memory. A custom accelerator is deployed in a CXL Type-2 device to perform low-latency refinement locally. Together, FaTRQ improves the storage efficiency by 2.4$\times$ and improves the throughput by up to 9$ \times$ than SOTA GPU ANNS system.
[LINK]http://arxiv.org/abs/2601.09985v1
[DATE]2026-01-15 09:59:29+08:00
[CATEGORIES]cs.LG
Online Scheduling for LLM Inference with KV Cache Constraints
[AUTHORS]Patrick Jaillet, Jiashuo Jiang, Konstantina Mellou, Marco Molinaro, Chara Podimata, Zijie Zhou
[ABSTRACT]Large Language Model (LLM) inference, where a trained model generates text one word at a time in response to user prompts, is a computationally intensive process requiring efficient scheduling to optimize latency and resource utilization. A key challenge in LLM inference is the management of the Key-Value (KV) cache, which reduces redundant computations but introduces memory constraints. In this work, we model LLM inference with KV cache constraints theoretically and propose a novel batching and scheduling algorithm that minimizes inference latency while effectively managing the KV cache’s memory.
More specifically, we make the following contributions. First, to evaluate the performance of online algorithms for scheduling in LLM inference, we introduce a hindsight optimal benchmark, formulated as an integer program that computes the minimum total inference latency under full future information. Second, we prove that no deterministic online algorithm can achieve a constant competitive ratio when the arrival process is arbitrary. Third, motivated by the computational intractability of solving the integer program at scale, we propose a polynomial-time online scheduling algorithm and show that under certain conditions it can achieve a constant competitive ratio. We also demonstrate our algorithm’s strong empirical performance by comparing it to the hindsight optimal in a synthetic dataset. Finally, we conduct empirical evaluations on a real-world public LLM inference dataset, simulating the Llama2-70B model on A100 GPUs, and show that our algorithm significantly outperforms the benchmark algorithms. Overall, our results offer a path toward more sustainable and cost-effective LLM deployment.
[LINK]http://arxiv.org/abs/2502.07115v5
[DATE]2026-01-15 09:54:52+08:00
[CATEGORIES]cs.LG
Bayesian Monocular Depth Refinement via Neural Radiance Fields
[AUTHORS]Arun Muthukkumar
[ABSTRACT]Monocular depth estimation has applications in many fields, such as autonomous navigation and extended reality, making it an essential computer vision task. However, current methods often produce smooth depth maps that lack the fine geometric detail needed for accurate scene understanding. We propose MDENeRF, an iterative framework that refines monocular depth estimates using depth information from Neural Radiance Fields (NeRFs). MDENeRF consists of three components: (1) an initial monocular estimate for global structure, (2) a NeRF trained on perturbed viewpoints, with per-pixel uncertainty, and (3) Bayesian fusion of the noisy monocular and NeRF depths. We derive NeRF uncertainty from the volume rendering process to iteratively inject high-frequency fine details. Meanwhile, our monocular prior maintains global structure. We demonstrate improvements on key metrics and experiments using indoor scenes from the SUN RGB-D dataset.
[COMMENTS]IEEE 8th International Conference on Algorithms, Computing and Artificial Intelligence (ACAI 2025)
[LINK]http://arxiv.org/abs/2601.03869v2
[DATE]2026-01-15 09:46:55+08:00
[CATEGORIES]cs.LG
Performance of AI agents based on reasoning language models on ALD process optimization tasks
[AUTHORS]Angel Yanguas-Gil
[ABSTRACT]In this work we explore the performance and behavior of reasoning large language models to autonomously optimize atomic layer deposition (ALD) processes. In the ALD process optimization task, an agent built on top of a reasoning LLM has to find optimal dose times for an ALD precursor and a coreactant without any prior knowledge on the process, including whether it is actually self-limited. The agent is meant to interact iteratively with an ALD reactor in a fully unsupervised way. We evaluate this agent using a simple model of an ALD tool that incorporates ALD processes with different self-limited surface reaction pathways as well as a non self-limited component. Our results show that agents based on reasoning models like OpenAI’s o3 and GPT5 consistently succeeded at completing this optimization task. However, we observed significant run-to-run variability due to the non deterministic nature of the model’s response. In order to understand the logic followed by the reasoning model, the agent uses a two step process in which the model first generates an open response detailing the reasoning process. This response is then transformed into a structured output. An analysis of these reasoning traces showed that the logic of the model was sound and that its reasoning was based on the notions of self-limited process and saturation expected in the case of ALD. However, the agent can sometimes be misled by its own prior choices when exploring the optimization space.
[LINK]http://arxiv.org/abs/2601.09980v1
[DATE]2026-01-15 09:46:49+08:00
[CATEGORIES]cs.LG
In-Context Operator Learning on the Space of Probability Measures
[AUTHORS]Frank Cole, Dixi Wang, Yineng Chen, Yulong Lu, Rongjie Lai
[ABSTRACT]We introduce \emph{in-context operator learning on probability measure spaces} for optimal transport (OT). The goal is to learn a single solution operator that maps a pair of distributions to the OT map, using only few-shot samples from each distribution as a prompt and \emph{without} gradient updates at inference. We parameterize the solution operator and develop scaling-law theory in two regimes. In the \emph{nonparametric} setting, when tasks concentrate on a low-intrinsic-dimension manifold of source–target pairs, we establish generalization bounds that quantify how in-context accuracy scales with prompt size, intrinsic task dimension, and model capacity. In the \emph{parametric} setting (e.g., Gaussian families), we give an explicit architecture that recovers the exact OT map in context and provide finite-sample excess-risk bounds. Our numerical experiments on synthetic transports and generative-modeling benchmarks validate the framework.
[LINK]http://arxiv.org/abs/2601.09979v1
[DATE]2026-01-15 09:44:10+08:00
[CATEGORIES]cs.LG
GreedyPixel: Fine-Grained Black-Box Adversarial Attack Via Greedy Algorithm
[AUTHORS]Hanrui Wang, Ching-Chun Chang, Chun-Shien Lu, Christopher Leckie, Isao Echizen
[ABSTRACT]Deep neural networks are highly vulnerable to adversarial examples, which are inputs with small, carefully crafted perturbations that cause misclassification – making adversarial attacks a critical tool for evaluating robustness. Existing black-box methods typically entail a trade-off between precision and flexibility: pixel-sparse attacks (e.g., single- or few-pixel attacks) provide fine-grained control but lack adaptability, whereas patch- or frequency-based attacks improve efficiency or transferability, but at the cost of producing larger and less precise perturbations. We present GreedyPixel, a fine-grained black-box attack method that performs brute-force-style, per-pixel greedy optimization guided by a surrogate-derived priority map and refined by means of query feedback. It evaluates each coordinate directly without any gradient information, guaranteeing monotonic loss reduction and convergence to a coordinate-wise optimum, while also yielding near white-box-level precision and pixel-wise sparsity and perceptual quality. On the CIFAR-10 and ImageNet datasets, spanning convolutional neural networks (CNNs) and Transformer models, GreedyPixel achieved state-of-the-art success rates with visually imperceptible perturbations, effectively bridging the gap between black-box practicality and white-box performance. The implementation is available at https://github.com/azrealwang/greedypixel.
[COMMENTS]IEEE Transactions on Information Forensics and Security
[LINK]http://arxiv.org/abs/2501.14230v4
[DATE]2026-01-15 09:36:12+08:00
[CATEGORIES]cs.LG
An Exploratory Study to Repurpose LLMs to a Unified Architecture for Time Series Classification
[AUTHORS]Hansen He, Shuheng Li
[ABSTRACT]Time series classification (TSC) is a core machine learning problem with broad applications. Recently there has been growing interest in repurposing large language models (LLMs) for TSC, motivated by their strong reasoning and generalization ability. Prior work has primarily focused on alignment strategies that explicitly map time series data into the textual domain; however, the choice of time series encoder architecture remains underexplored. In this work, we conduct an exploratory study of hybrid architectures that combine specialized time series encoders with a frozen LLM backbone. We evaluate a diverse set of encoder families, including Inception, convolutional, residual, transformer-based, and multilayer perceptron architectures, among which the Inception model is the only encoder architecture that consistently yields positive performance gains when integrated with an LLM backbone. Overall, this study highlights the impact of time series encoder choice in hybrid LLM architectures and points to Inception-based models as a promising direction for future LLM-driven time series learning.
[LINK]http://arxiv.org/abs/2601.09971v1
[DATE]2026-01-15 09:22:53+08:00
[CATEGORIES]cs.LG
A Sustainable AI Economy Needs Data Deals That Work for Generators
[AUTHORS]Ruoxi Jia, Luis Oala, Wenjie Xiong, Suqin Ge, Jiachen T. Wang, Feiyang Kang, Dawn Song
[ABSTRACT]We argue that the machine learning value chain is structurally unsustainable due to an economic data processing inequality: each state in the data cycle from inputs to model weights to synthetic outputs refines technical signal but strips economic equity from data generators. We show, by analyzing seventy-three public data deals, that the majority of value accrues to aggregators, with documented creator royalties rounding to zero and widespread opacity of deal terms. This is not just an economic welfare concern: as data and its derivatives become economic assets, the feedback loop that sustains current learning algorithms is at risk. We identify three structural faults - missing provenance, asymmetric bargaining power, and non-dynamic pricing - as the operational machinery of this inequality. In our analysis, we trace these problems along the machine learning value chain and propose an Equitable Data-Value Exchange (EDVEX) Framework to enable a minimal market that benefits all participants. Finally, we outline research directions where our community can make concrete contributions to data deals and contextualize our position with related and orthogonal viewpoints.
[COMMENTS]Published at NeurIPS 2025 (https://neurips.cc/virtual/2025/loc/san-diego/poster/121926)
[LINK]http://arxiv.org/abs/2601.09966v1
[DATE]2026-01-15 09:05:48+08:00
[CATEGORIES]cs.LG
MatchMiner-AI: An Open-Source Solution for Cancer Clinical Trial Matching
[AUTHORS]Jennifer Altreuter, Pavel Trukhanov, Morgan A. Paul, Michael J. Hassett, Irbaz B. Riaz, Muhammad Umar Afzal, Arshad A. Mohammed, Sarah Sammons, James Lindsay, Emily Mallaber, Harry R. Klein, Gufran Gungor, Matthew Galvin, Michael Deletto, Stephen C. Van Nostrand, James Provencher, Joyce Yu, Naeem Tahir, Jonathan Wischhusen, Olga Kozyreva, Taylor Ortiz, Hande Tuncer, Jad El Masri, Alys Malcolm, Tali Mazor, Ethan Cerami, Kenneth L. Kehl
[ABSTRACT]Background Clinical trials are essential to advancing cancer treatments, yet fewer than 10% of adults with cancer enroll in trials, and many studies fail to meet accrual targets. Artificial intelligence (AI) could improve identification of appropriate trials for patients, but sharing AI models trained on protected health information remains difficult due to privacy restrictions.
Methods We developed MatchMiner-AI, an open-source platform for clinical trial search and ranking trained entirely on synthetic electronic health record (EHR) data. The system extracts core clinical criteria from longitudinal EHR text and embeds patient summaries and trial “spaces” (target populations) in a shared vector space for rapid retrieval. It then applies custom text classifiers to assess whether each patient-trial pairing is a clinically reasonable consideration. The pipeline was evaluated on real clinical data.
Results Across retrospective evaluations on real EHR data, the fine-tuned pipeline outperformed baseline text-embedding approaches. For trial-enrolled patients, 90% of the top 20 recommended trials were relevant matches (compared to 17% for the baseline model). Similar improvements were noted for patients who received standard-of-care treatments (88% of the top 20 matches were relevant, compared to 14% for baseline). Text classification modules demonstrated strong discrimination (AUROC 0.94-0.98) for evaluating candidate patient-trial space pair eligibility; incorporating these components consistently increased mean average precision to ~ 0.90 across patient- and trial-centric use cases. Synthetic training data, model weights, inference tools, and demonstration frontends are publicly available.
Conclusions MatchMiner-AI demonstrates an openly accessible, privacy-preserving approach to distilling a clinical trial matching AI pipeline from LLM-generated synthetic EHR data.
[LINK]http://arxiv.org/abs/2412.17228v3
[DATE]2026-01-15 09:01:39+08:00
[CATEGORIES]cs.LG
Kinematic Tokenization: Optimization-Based Continuous-Time Tokens for Learnable Decision Policies in Noisy Time Series
[AUTHORS]Griffin Kearney
[ABSTRACT]Transformers are designed for discrete tokens, yet many real-world signals are continuous processes observed through noisy sampling. Discrete tokenizations (raw values, patches, finite differences) can be brittle in low signal-to-noise regimes, especially when downstream objectives impose asymmetric penalties that rationally encourage abstention. We introduce Kinematic Tokenization, an optimization-based continuous-time representation that reconstructs an explicit spline from noisy measurements and tokenizes local spline coefficients (position, velocity, acceleration, jerk). This is applied to financial time series data in the form of asset prices in conjunction with trading volume profiles. Across a multi-asset daily-equity testbed, we use a risk-averse asymmetric classification objective as a stress test for learnability. Under this objective, several discrete baselines collapse to an absorbing cash policy (the Liquidation Equilibrium), whereas the continuous spline tokens sustain calibrated, non-trivial action distributions and stable policies. These results suggest that explicit continuous-time tokens can improve the learnability and calibration of selective decision policies in noisy time series under abstention-inducing losses.
[LINK]http://arxiv.org/abs/2601.09949v1
[DATE]2026-01-15 08:21:02+08:00
[CATEGORIES]cs.LG
Interpolation-Based Optimization for Enforcing lp-Norm Metric Differential Privacy in Continuous and Fine-Grained Domains
[AUTHORS]Chenxi Qiu
[ABSTRACT]Metric Differential Privacy (mDP) generalizes Local Differential Privacy (LDP) by adapting privacy guarantees based on pairwise distances, enabling context-aware protection and improved utility. While existing optimization-based methods reduce utility loss effectively in coarse-grained domains, optimizing mDP in fine-grained or continuous settings remains challenging due to the computational cost of constructing dense perterubation matrices and satisfying pointwise constraints.
In this paper, we propose an interpolation-based framework for optimizing lp-norm mDP in such domains. Our approach optimizes perturbation distributions at a sparse set of anchor points and interpolates distributions at non-anchor locations via log-convex combinations, which provably preserve mDP. To address privacy violations caused by naive interpolation in high-dimensional spaces, we decompose the interpolation process into a sequence of one-dimensional steps and derive a corrected formulation that enforces lp-norm mDP by design. We further explore joint optimization over perturbation distributions and privacy budget allocation across dimensions. Experiments on real-world location datasets demonstrate that our method offers rigorous privacy guarantees and competitive utility in fine-grained domains, outperforming baseline mechanisms. in high-dimensional spaces, we decompose the interpolation process into a sequence of one-dimensional steps and derive a corrected formulation that enforces lp-norm mDP by design. We further explore joint optimization over perturbation distributions and privacy budget allocation across dimensions. Experiments on real-world location datasets demonstrate that our method offers rigorous privacy guarantees and competitive utility in fine-grained domains, outperforming baseline mechanisms.
[COMMENTS]USENIX Security 2026
[LINK]http://arxiv.org/abs/2601.09946v1
[DATE]2026-01-15 08:12:54+08:00
[CATEGORIES]cs.LG
A Taxonomy for Evaluating Generalist Robot Manipulation Policies
[AUTHORS]Jensen Gao, Suneel Belkhale, Sudeep Dasari, Ashwin Balakrishna, Dhruv Shah, Dorsa Sadigh
[ABSTRACT]Machine learning for robot manipulation promises to unlock generalization to novel tasks and environments. But how should we measure the progress of these policies towards generalization? Evaluating and quantifying generalization is the Wild West of modern robotics, with each work proposing and measuring different types of generalization in their own, often difficult to reproduce settings. In this work, our goal is (1) to outline the forms of generalization we believe are important for robot manipulation in a comprehensive and fine-grained manner, and (2) to provide reproducible guidelines for measuring these notions of generalization. We first propose STAR-Gen, a taxonomy of generalization for robot manipulation structured around visual, semantic, and behavioral generalization. Next, we instantiate STAR-Gen with two case studies on real-world benchmarking: one based on open-source models and the Bridge V2 dataset, and another based on the bimanual ALOHA 2 platform that covers more dexterous and longer horizon tasks. Our case studies reveal many interesting insights: for example, we observe that open-source vision-language-action models often struggle with semantic generalization, despite pre-training on internet-scale language datasets. We provide videos and other supplementary material at our website stargen-taxonomy.github.io.
[COMMENTS]IEEE Robotics and Automation Letters (RA-L)
[LINK]http://arxiv.org/abs/2503.01238v2
[DATE]2026-01-15 08:06:44+08:00
[CATEGORIES]cs.LG
Deep Jump Gaussian Processes for Surrogate Modeling of High-Dimensional Piecewise Continuous Functions
[AUTHORS]Yang Xu, Chiwoo Park
[ABSTRACT]We introduce Deep Jump Gaussian Processes (DJGP), a novel method for surrogate modeling of a piecewise continuous function on a high-dimensional domain. DJGP addresses the limitations of conventional Jump Gaussian Processes (JGP) in high-dimensional input spaces by integrating region-specific, locally linear projections with JGP modeling. These projections employ region-dependent matrices to capture local low-dimensional subspace structures, making them well suited to the inherently localized modeling behavior of JGPs, a variant of local Gaussian processes. To control model complexity, we place a Gaussian Process prior on the projection matrices, allowing them to evolve smoothly across the input space. The projected inputs are then modeled with a JGP to capture piecewise continuous relationships with the response. This yields a distinctive two-layer deep learning of GP/JGP. We further develop a scalable variational inference algorithm to jointly learn the projection matrices and JGP hyperparameters. Rigorous theoretical analysis and extensive empirical studies are provided to justify the proposed approach. In particular, we derive an oracle error bound for DJGP and decompose it into four distinct sources of error, which are then linked to practical implications. Experiments on synthetic and benchmark datasets demonstrate that DJGP achieves superior predictive accuracy and more reliable uncertainty quantification compared with existing methods.
[LINK]http://arxiv.org/abs/2510.21974v2
[DATE]2026-01-15 07:54:48+08:00
[CATEGORIES]cs.LG
Permissive Information-Flow Analysis for Large Language Models
[AUTHORS]Shoaib Ahmed Siddiqui, Radhika Gaonkar, Boris Köpf, David Krueger, Andrew Paverd, Ahmed Salem, Shruti Tople, Lukas Wutschitz, Menglin Xia, Santiago Zanella-Béguelin
[ABSTRACT]Large Language Models (LLMs) are rapidly becoming commodity components of larger software systems. This poses natural security and privacy problems: poisoned data retrieved from one component can change the model’s behavior and compromise the entire system, including coercing the model to spread confidential data to untrusted components. One promising approach is to tackle this problem at the system level via dynamic information flow (aka taint) tracking. Unfortunately, this approach of propagating the most restrictive input label to the output is too conservative for applications where LLMs operate on inputs retrieved from diverse sources. In this paper, we propose a novel, more permissive approach to propagate information flow labels through LLM queries. The key idea behind our approach is to propagate only the labels of the samples that were influential in generating the model output and to eliminate the labels of unnecessary inputs. We implement and investigate the effectiveness of two variations of this approach, based on (i) prompt-based retrieval augmentation, and (ii) a $k$-nearest-neighbors language model. We compare these with a baseline that uses introspection to predict the output label. Our experimental results in an LLM agent setting show that the permissive label propagator improves over the baseline in more than 85% of the cases, which underscores the practicality of our approach.
[LINK]http://arxiv.org/abs/2410.03055v3
[DATE]2026-01-15 07:52:42+08:00
[CATEGORIES]cs.LG
From Dormant to Deleted: Tamper-Resistant Unlearning Through Weight-Space Regularization
[AUTHORS]Shoaib Ahmed Siddiqui, Adrian Weller, David Krueger, Gintare Karolina Dziugaite, Michael Curtis Mozer, Eleni Triantafillou
[LINK]http://arxiv.org/abs/2505.22310v2
[DATE]2026-01-15 07:36:05+08:00
[CATEGORIES]cs.LG
Geometric and Dynamic Scaling in Deep Transformers
[AUTHORS]Haoran Su, Chenyu You
[ABSTRACT]Despite their empirical success, pushing Transformer architectures to extreme depth often leads to a paradoxical failure: representations become increasingly redundant, lose rank, and ultimately collapse. Existing explanations largely attribute this phenomenon to optimization instability or vanishing gradients, yet such accounts fail to explain why collapse persists even under modern normalization and initialization schemes. In this paper, we argue that the collapse of deep Transformers is fundamentally a geometric problem. Standard residual updates implicitly assume that feature accumulation is always beneficial, but offer no mechanism to constrain update directions or to erase outdated information. As depth increases, this leads to systematic drift off the semantic manifold and monotonic feature accumulation, causing representational degeneracy. We propose a unified geometric framework that addresses these failures through two orthogonal principles. First, manifold-constrained hyper-connections restrict residual updates to valid local tangent directions, preventing uncontrolled manifold drift. Second, deep delta learning introduces data-dependent, non-monotonic updates that enable reflection and erasure of redundant features rather than their unconditional accumulation. Together, these mechanisms decouple the direction and sign of feature updates, yielding a stable geometric evolution across depth. We term the resulting architecture the Manifold-Geometric Transformer (MGT). Our analysis predicts that enforcing geometric validity while allowing dynamic erasure is essential for avoiding rank collapse in ultra-deep networks. We outline an evaluation protocol for Transformers exceeding 100 layers to test the hypothesis that geometry, rather than depth itself, is the key limiting factor in deep representation learning.
[COMMENTS]Research Proposal Only
[LINK]http://arxiv.org/abs/2601.01014v3
[DATE]2026-01-15 07:33:37+08:00
[CATEGORIES]cs.LG
The PROPER Approach to Proactivity: Benchmarking and Advancing Knowledge Gap Navigation
[AUTHORS]Kirandeep Kaur, Vinayak Gupta, Aditya Gupta, Chirag Shah
[ABSTRACT]Most language-based assistants follow a reactive ask-and-respond paradigm, requiring users to explicitly state their needs. As a result, relevant but unexpressed needs often go unmet. Existing proactive agents attempt to address this gap either by eliciting further clarification, preserving this burden, or by extrapolating future needs from context, often leading to unnecessary or mistimed interventions. We introduce ProPer, Proactivity-driven Personalized agents, a novel two-agent architecture consisting of a Dimension Generating Agent (DGA) and a Response Generating Agent (RGA). DGA, a fine-tuned LLM agent, leverages explicit user data to generate multiple implicit dimensions (latent aspects relevant to the user’s task but not considered by the user) or knowledge gaps. These dimensions are selectively filtered using a reranker based on quality, diversity, and task relevance. RGA then balances explicit and implicit dimensions to tailor personalized responses with timely and proactive interventions. We evaluate ProPer across multiple domains using a structured, gap-aware rubric that measures coverage, initiative appropriateness, and intent alignment. Our results show that ProPer improves quality scores and win rates across all domains, achieving up to 84% gains in single-turn evaluation and consistent dominance in multi-turn interactions.
[LINK]http://arxiv.org/abs/2601.09926v1
[DATE]2026-01-15 07:13:01+08:00
[CATEGORIES]cs.LG
Why Knowledge Distillation Works in Generative Models: A Minimal Working Explanation
[AUTHORS]Sungmin Cha, Kyunghyun Cho
[ABSTRACT]Knowledge distillation (KD) is a core component in the training and deployment of modern generative models, particularly large language models (LLMs). While its empirical benefits are well documented – enabling smaller student models to emulate the performance of much larger teachers – the underlying mechanisms by which KD improves generative quality remain poorly understood. In this work, we present a minimal working explanation of KD in generative modeling. Using a controlled simulation with mixtures of Gaussians, we demonstrate that distillation induces a trade-off between precision and recall in the student model. As the teacher distribution becomes more selective, the student concentrates more probability mass on high-likelihood regions at the expense of coverage, which is a behavior modulated by a single entropy-controlling parameter. We then validate this effect in a large-scale language modeling setup using the SmolLM2 family of models. Empirical results reveal the same precision-recall dynamics observed in simulation, where precision corresponds to sample quality and recall to distributional coverage. This precision-recall trade-off in LLMs is found to be especially beneficial in scenarios where sample quality is more important than diversity, such as instruction tuning or downstream generation. Our analysis provides a simple and general explanation for the effectiveness of KD in generative modeling.
[COMMENTS]NeurIPS 2025 camera ready version
[LINK]http://arxiv.org/abs/2505.13111v3
[DATE]2026-01-15 06:57:57+08:00
[CATEGORIES]cs.LG
An Empirical Risk Minimization Approach for Offline Inverse RL and Dynamic Discrete Choice Model
[AUTHORS]Enoch H. Kang, Hema Yoganarasimhan, Lalit Jain
[ABSTRACT]We study the problem of estimating Dynamic Discrete Choice (DDC) models, also known as offline Maximum Entropy-Regularized Inverse Reinforcement Learning (offline MaxEnt-IRL) in machine learning. The objective is to recover reward or $Q^*$ functions that govern agent behavior from offline behavior data. In this paper, we propose a globally convergent gradient-based method for solving these problems without the restrictive assumption of linearly parameterized rewards. The novelty of our approach lies in introducing the Empirical Risk Minimization (ERM) based IRL/DDC framework, which circumvents the need for explicit state transition probability estimation in the Bellman equation. Furthermore, our method is compatible with non-parametric estimation techniques such as neural networks. Therefore, the proposed method has the potential to be scaled to high-dimensional, infinite state spaces. A key theoretical insight underlying our approach is that the Bellman residual satisfies the Polyak-Lojasiewicz (PL) condition – a property that, while weaker than strong convexity, is sufficient to ensure fast global convergence guarantees. Through a series of synthetic experiments, we demonstrate that our approach consistently outperforms benchmark methods and state-of-the-art alternatives.
[LINK]http://arxiv.org/abs/2502.14131v6
[DATE]2026-01-15 06:45:37+08:00
[CATEGORIES]cs.LG
Attn-JGNN: Attention Enhanced Join-Graph Neural Networks
[AUTHORS]Jixin Zhang
[ABSTRACT]We propose an Attention Enhanced Join-Graph Neural Networks(Attn-JGNN) model for solving #SAT problems, which significantly improves the solving accuracy. Inspired by the Iterative Join Graph Propagation (IJGP) algorithm, Attn-JGNN uses tree decomposition to encode the CNF formula into a join-graph, then performs iterative message passing on the join-graph, and finally approximates the model number by learning partition functions. In order to further improve the accuracy of the solution, we apply the attention mechanism in and between clusters of the join-graphs, which makes Attn-JGNN pay more attention to the key variables and clusters in probabilistic inference, and reduces the redundant calculation. Finally, our experiments show that our Attn-JGNN model achieves better results than other neural network methods.
[LINK]http://arxiv.org/abs/2510.15583v2
[DATE]2026-01-15 06:23:38+08:00
[CATEGORIES]cs.LG
Calibrating Generative Models to Distributional Constraints
[AUTHORS]Henry D. Smith, Nathaniel L. Diamant, Brian L. Trippe
[ABSTRACT]Generative models frequently suffer miscalibration, wherein statistics of the sampling distribution such as class probabilities deviate from desired values. We frame calibration as a constrained optimization problem and seek the closest model in Kullback-Leibler divergence satisfying calibration constraints. To address the intractability of imposing these constraints exactly, we introduce two surrogate objectives for fine-tuning: (1) the relax loss, which replaces the constraint with a miscalibration penalty, and (2) the reward loss, which converts calibration into a reward fine-tuning problem. We demonstrate that these approaches substantially reduce calibration error across hundreds of simultaneous constraints and models with up to one billion parameters, spanning applications in protein design, image generation, and language modeling.
[COMMENTS]Our codebase accompanying the paper is available at: https://github.com/smithhenryd/cgm
[LINK]http://arxiv.org/abs/2510.10020v2
[DATE]2026-01-15 06:21:51+08:00
[CATEGORIES]cs.LG
Future-as-Label: Scalable Supervision from Real-World Outcomes
[AUTHORS]Benjamin Turtel, Paul Wilczewski, Danny Franklin, Kris Skothiem
[ABSTRACT]Time creates free supervision: forecasts about real-world events resolve to verifiable outcomes. The passage of time provides labels that require no annotation. To exploit this structure, we extend reinforcement learning with verifiable rewards to real-world prediction over time. We train language models to make probabilistic forecasts from causally masked information, using proper scoring rules as the reward function once events resolve. Learning is driven entirely by realized outcomes, enabling scalable outcome-based supervision in open-world prediction. On real-world forecasting benchmarks, Qwen3-32B trained using Foresight Learning improves Brier score by 27% and halves calibration error relative to its pretrained baseline, and outperforms Qwen3-235B on both constructed future-event prediction tasks and the Metaculus benchmark despite a 7x parameter disadvantage.
[LINK]http://arxiv.org/abs/2601.06336v2
[DATE]2026-01-15 06:14:08+08:00
[CATEGORIES]cs.LG
EarthSight: A Distributed Framework for Low-Latency Satellite Intelligence
[AUTHORS]Ansel Kaplan Erol, Seungjun Lee, Divya Mahajan
[ABSTRACT]Low-latency delivery of satellite imagery is essential for time-critical applications such as disaster response, intelligence, and infrastructure monitoring. However, traditional pipelines rely on downlinking all captured images before analysis, introducing delays of hours to days due to restricted communication bandwidth. To address these bottlenecks, emerging systems perform onboard machine learning to prioritize which images to transmit. However, these solutions typically treat each satellite as an isolated compute node, limiting scalability and efficiency. Redundant inference across satellites and tasks further strains onboard power and compute costs, constraining mission scope and responsiveness. We present EarthSight, a distributed runtime framework that redefines satellite image intelligence as a distributed decision problem between orbit and ground. EarthSight introduces three core innovations: (1) multi-task inference on satellites using shared backbones to amortize computation across multiple vision tasks; (2) a ground-station query scheduler that aggregates user requests, predicts priorities, and assigns compute budgets to incoming imagery; and (3) dynamic filter ordering, which integrates model selectivity, accuracy, and execution cost to reject low-value images early and conserve resources. EarthSight leverages global context from ground stations and resource-aware adaptive decisions in orbit to enable constellations to perform scalable, low-latency image analysis within strict downlink bandwidth and onboard power budgets. Evaluations using a prior established satellite simulator show that EarthSight reduces average compute time per image by 1.9x and lowers 90th percentile end-to-end latency from first contact to delivery from 51 to 21 minutes compared to the state-of-the-art baseline.
[LINK]http://arxiv.org/abs/2511.10834v2
[DATE]2026-01-15 06:13:17+08:00
[CATEGORIES]cs.LG
Evaluating Large Language Models for Fair and Reliable Organ Allocation
[AUTHORS]Brian Hyeongseok Kim, Hannah Murray, Isabelle Lee, Jason Byun, Joshua Lum, Dani Yogatama, Evi Micha
[ABSTRACT]Medical institutions are considering the use of LLMs in high-stakes clinical decision-making, such as organ allocation. In such sensitive use cases, evaluating fairness is imperative. However, existing evaluation methods often fall short; benchmarks are too simplistic to capture real-world complexity, and accuracy-based metrics fail to address the absence of a clear ground truth. To realistically and fairly model organ allocation, specifically kidney allocation, we begin by testing the medical knowledge of LLMs to determine whether they understand the clinical factors required to make sound allocation decisions. Building on this foundation, we design two tasks: (1) Choose-One and (2) Rank-All. In Choose-One, LLMs select a single candidate from a list of potential candidates to receive a kidney. In this scenario, we assess fairness across demographics using traditional fairness metrics, such as proportional parity. In Rank-All, LLMs rank all candidates waiting for a kidney, reflecting real-world allocation processes more closely, where an organ is passed down a ranked list until allocated. Our evaluation on three LLMs reveals a divergence between fairness metrics: while exposure-based metrics suggest equitable outcomes, probability-based metrics uncover systematic preferential sorting, where specific groups were clustered in upper-ranking tiers. Furthermore, we observe that demographic preferences are highly task-dependent, showing inverted trends between Choose-One and Rank-All tasks, even when considering the topmost rank. Overall, our results indicate that current LLMs can introduce inequalities in real-world allocation scenarios, underscoring the urgent need for rigorous fairness evaluation and human oversight before their use in high-stakes decision-making.
[LINK]http://arxiv.org/abs/2504.03716v2
[DATE]2026-01-15 06:02:49+08:00
[CATEGORIES]cs.LG
Geometric Algorithms for Neural Combinatorial Optimization with Constraints
[AUTHORS]Nikolaos Karalias, Akbar Rafiey, Yifei Xu, Zhishang Luo, Behrooz Tahmasebi, Connie Jiang, Stefanie Jegelka
[ABSTRACT]Self-Supervised Learning (SSL) for Combinatorial Optimization (CO) is an emerging paradigm for solving combinatorial problems using neural networks. In this paper, we address a central challenge of SSL for CO: solving problems with discrete constraints. We design an end-to-end differentiable framework that enables us to solve discrete constrained optimization problems with neural networks. Concretely, we leverage algorithmic techniques from the literature on convex geometry and Carathéodory’s theorem to decompose neural network outputs into convex combinations of polytope corners that correspond to feasible sets. This decomposition-based approach enables self-supervised training but also ensures efficient quality-preserving rounding of the neural net output into feasible solutions. Extensive experiments in cardinality-constrained optimization show that our approach can consistently outperform neural baselines. We further provide worked-out examples of how our method can be applied beyond cardinality-constrained problems to a diverse set of combinatorial optimization tasks, including finding independent sets in graphs, and solving matroid-constrained problems.
[LINK]http://arxiv.org/abs/2510.24039v3
[DATE]2026-01-15 05:57:33+08:00
[CATEGORIES]cs.LG
SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training
[AUTHORS]Jintao Zhang, Jia Wei, Pengle Zhang, Xiaoming Xu, Haofeng Huang, Haoxu Wang, Kai Jiang, Jianfei Chen, Jun Zhu
[ABSTRACT]The efficiency of attention is important due to its quadratic time complexity. We enhance the efficiency of attention through two key contributions: First, we leverage the new FP4 Tensor Cores in Blackwell GPUs to accelerate attention computation. Our implementation achieves 1038 TOPS on RTX5090, which is a 5x speedup over the fastest FlashAttention on RTX5090. Experiments show that our FP4 attention can accelerate inference of various models in a plug-and-play way. Second, we pioneer low-bit attention to training tasks. Existing low-bit attention works like FlashAttention3 and SageAttention focus only on inference. However, the efficiency of training large models is also important. To explore whether low-bit attention can be effectively applied to training tasks, we design an accurate and efficient 8-bit attention for both forward and backward propagation. Experiments indicate that 8-bit attention achieves lossless performance in fine-tuning tasks but exhibits slower convergence in pretraining tasks. The code is available at https://github.com/thu-ml/SageAttention.
[LINK]http://arxiv.org/abs/2505.11594v3
[DATE]2026-01-15 05:51:33+08:00
[CATEGORIES]cs.LG
Privacy amplification by random allocation
[AUTHORS]Vitaly Feldman, Moshe Shenfeld
[ABSTRACT]We consider the privacy amplification properties of a sampling scheme in which a user’s data is used in k steps chosen randomly and uniformly from a sequence (or set) of t steps. This sampling scheme has been recently applied in the context of differentially private optimization [Chua et al., 2024a, Choquette-Choo et al., 2025] and is also motivated by communication-efficient high-dimensional private aggregation [Asi et al., 2025]. Existing analyses of this scheme either rely on privacy amplification by shuffling which leads to overly conservative bounds or require Monte Carlo simulations that are computationally prohibitive in most practical scenarios.
We give the first theoretical guarantees and numerical estimation algorithms for this sampling scheme. In particular, we demonstrate that the privacy guarantees of random k-out-of-t allocation can be upper bounded by the privacy guarantees of the well-studied independent (or Poisson) subsampling in which each step uses the user’s data with probability $(1+o(1))k/t$. Further, we provide two additional analysis techniques that lead to numerical improvements in several parameter regimes. Altogether, our bounds give efficiently-computable and nearly tight numerical results for random allocation applied to Gaussian noise addition.
[LINK]http://arxiv.org/abs/2502.08202v4
[DATE]2026-01-15 05:36:37+08:00
[CATEGORIES]cs.LG
Transition Matching Distillation for Fast Video Generation
[AUTHORS]Weili Nie, Julius Berner, Nanye Ma, Chao Liu, Saining Xie, Arash Vahdat
[ABSTRACT]Large video diffusion and flow models have achieved remarkable success in high-quality video generation, but their use in real-time interactive applications remains limited due to their inefficient multi-step sampling process. In this work, we present Transition Matching Distillation (TMD), a novel framework for distilling video diffusion models into efficient few-step generators. The central idea of TMD is to match the multi-step denoising trajectory of a diffusion model with a few-step probability transition process, where each transition is modeled as a lightweight conditional flow. To enable efficient distillation, we decompose the original diffusion backbone into two components: (1) a main backbone, comprising the majority of early layers, that extracts semantic representations at each outer transition step; and (2) a flow head, consisting of the last few layers, that leverages these representations to perform multiple inner flow updates. Given a pretrained video diffusion model, we first introduce a flow head to the model, and adapt it into a conditional flow map. We then apply distribution matching distillation to the student model with flow head rollout in each transition step. Extensive experiments on distilling Wan2.1 1.3B and 14B text-to-video models demonstrate that TMD provides a flexible and strong trade-off between generation speed and visual quality. In particular, TMD outperforms existing distilled models under comparable inference costs in terms of visual fidelity and prompt adherence. Project page: https://research.nvidia.com/labs/genair/tmd
[LINK]http://arxiv.org/abs/2601.09881v1
[DATE]2026-01-15 05:30:03+08:00
[CATEGORIES]cs.LG
A Kolmogorov metric embedding for live cell microscopy signaling patterns
[AUTHORS]Layton Aho, Mark Winter, Marc DeCarlo, Agne Frismantiene, Yannick Blum, Paolo Armando Gagliardi, Olivier Pertz, Andrew R. Cohen
[ABSTRACT]We present a metric embedding that captures spatiotemporal patterns of cell signaling dynamics in 5-D $(x,y,z,channel,time)$ live cell microscopy movies. The embedding uses a metric distance called the normalized information distance (NID) based on Kolmogorov complexity theory, an absolute measure of information content between digital objects. The NID uses statistics of lossless compression to compute a theoretically optimal metric distance between pairs of 5-D movies, requiring no a priori knowledge of expected pattern dynamics, and no training data. The cell signaling structure function (SSF) is defined using a class of metric 3-D image filters that compute at each spatiotemporal cell centroid the voxel intensity configuration of the nucleus w.r.t. the surrounding cytoplasm, or a functional output e.g. velocity. The only parameter is the expected cell radii ($μm$). The SSF can be optionally combined with segmentation and tracking algorithms. The resulting lossless compression pipeline represents each 5-D input movie as a single point in a metric embedding space. The utility of a metric embedding follows from Euclidean distance between any points in the embedding space approximating optimally the pattern difference, as measured by the NID, between corresponding pairs of 5-D movies. This is true throughout the embedding space, not only at points corresponding to input images. Examples are shown for synthetic data, for 2-D+time movies of ERK and AKT signaling under different oncogenic mutations in human epithelial (MCF10A) cells, for 3-D MCF10A spheroids under optogenetic manipulation of ERK, and for ERK dynamics during colony differentiation in human stem cells.
[LINK]http://arxiv.org/abs/2401.02501v5
[DATE]2026-01-15 05:29:49+08:00
[CATEGORIES]cs.LG
Statistical Taylor Expansion: A New and Path-Independent Method for Uncertainty Analysis
[AUTHORS]Chengpu Wang
[ABSTRACT]As a rigorous statistical approach, statistical Taylor expansion extends the conventional Taylor expansion by replacing precise input variables with random variables of known distributions, to compute means and standard deviations of the results. Statistical Taylor expansion traces the dependency of the input uncertainties in the intermediate steps, so that the variables in the intermediate analytic expressions can no longer be regarded as independent of each other, and the result of the analytic expression is path independent. Thus, it differs fundamentally from the conventional common approaches in applied mathematics which optimize execution path for each calculation. In fact, statistical Taylor expansion may standardize numerical calculations for analytic expressions. Its statistical nature allows religious testing of its result when the sample size is large enough. This paper also introduces an implementation of statistical Taylor expansion called variance arithmetic and presents corresponding test results in a very wide range of mathematical applications.
Another important conclusion of this paper is that the numerical errors in the library function can have significant effects on the result. For example, the periodic numerical errors in the trigonometric library functions can resonate with periodic signals, producing large numerical errors in the results.
[COMMENTS]83 pages, 67 figures
[LINK]http://arxiv.org/abs/2410.01223v16
[DATE]2026-01-15 05:29:47+08:00
[CATEGORIES]cs.LG
HERMES: Holographic Equivariant neuRal network model for Mutational Effect and Stability prediction
[AUTHORS]Gian Marco Visani, William Galvin, Zac Jones, Michael N. Pun, Eric Daniel, Kevin Borisiak, Utheri Wagura, Armita Nourmohammad
[ABSTRACT]Predicting the stability and fitness effects of amino acid mutations in proteins is a cornerstone of biological discovery and engineering. Various experimental techniques have been developed to measure mutational effects, providing us with extensive datasets across a diverse range of proteins. By training on these data, traditional computational modeling and more recent machine learning approaches have advanced significantly in predicting mutational effects. Here, we introduce HERMES, a 3D rotationally equivariant structure-based neural network model for mutational effect and stability prediction. Pre-trained to predict amino acid propensity from its surrounding 3D structure, HERMES can be fine-tuned for mutational effects using our open-source code. We present a suite of HERMES models, pre-trained with different strategies, and fine-tuned to predict the stability effect of mutations. Benchmarking against other models shows that HERMES often outperforms or matches their performance in predicting mutational effect on stability, binding, and fitness. HERMES offers versatile tools for evaluating mutational effects and can be fine-tuned for specific predictive objectives.
[LINK]http://arxiv.org/abs/2407.06703v2
[DATE]2026-01-15 05:08:56+08:00
[CATEGORIES]cs.LG
VibrantSR: Sub-Meter Canopy Height Models from Sentinel-2 Using Generative Flow Matching
[AUTHORS]Kiarie Ndegwa, Andreas Gros, Tony Chang, David Diaz, Vincent A. Landau, Nathan E. Rutenbeck, Luke J. Zachmann, Guy Bayes, Scott Conway
[ABSTRACT]We present VibrantSR (Vibrant Super-Resolution), a generative super-resolution framework for estimating 0.5 meter canopy height models (CHMs) from 10 meter Sentinel-2 imagery. Unlike approaches based on aerial imagery that are constrained by infrequent and irregular acquisition schedules, VibrantSR leverages globally available Sentinel-2 seasonal composites, enabling consistent monitoring at a seasonal-to-annual cadence. Evaluated across 22 EPA Level 3 eco-regions in the western United States using spatially disjoint validation splits, VibrantSR achieves a Mean Absolute Error of 4.39 meters for canopy heights >= 2 m, outperforming Meta (4.83 m), LANDFIRE (5.96 m), and ETH (7.05 m) satellite-based benchmarks. While aerial-based VibrantVS (2.71 m MAE) retains an accuracy advantage, VibrantSR enables operational forest monitoring and carbon accounting at continental scales without reliance on costly and temporally infrequent aerial acquisitions.
[COMMENTS]12 pages, 8 figures, 2 tables
[LINK]http://arxiv.org/abs/2601.09866v1
[DATE]2026-01-15 04:56:35+08:00
[CATEGORIES]cs.LG
Robust LLM Alignment via Distributionally Robust Direct Preference Optimization
[AUTHORS]Zaiyan Xu, Sushil Vemuri, Kishan Panaganti, Dileep Kalathil, Rahul Jain, Deepak Ramachandran
[ABSTRACT]A major challenge in aligning large language models (LLMs) with human preferences is the issue of distribution shift. LLM alignment algorithms rely on static preference datasets, assuming that they accurately represent real-world user preferences. However, user preferences vary significantly across geographical regions, demographics, linguistic patterns, and evolving cultural trends. This preference distribution shift leads to catastrophic alignment failures in many real-world applications. We address this problem using the principled framework of distributionally robust optimization, and develop two novel distributionally robust direct preference optimization (DPO) algorithms, namely, Wasserstein DPO (WDPO) and Kullback-Leibler DPO (KLDPO). We characterize the sample complexity of learning the optimal policy parameters for WDPO and KLDPO. Moreover, we propose scalable gradient descent-style learning algorithms by developing suitable approximations for the challenging minimax loss functions of WDPO and KLDPO. Our empirical experiments using benchmark data sets and LLMs demonstrate the superior performance of WDPO and KLDPO in substantially improving the alignment when there is a preference distribution shift.
[COMMENTS]Accepted to NeurIPS 2025
[LINK]http://arxiv.org/abs/2502.01930v4
[DATE]2026-01-15 04:55:41+08:00
[CATEGORIES]cs.LG
Advancing Model Refinement: Muon-Optimized Distillation and Quantization for LLM Deployment
[AUTHORS]Jacob Sander, Brian Jalaian, Venkat R. Dasari
[ABSTRACT]Large Language Models (LLMs) enable advanced natural language processing but face deployment challenges on resource-constrained edge devices due to high computational, memory, and energy demands. Optimizing these models requires addressing three key challenges: acquiring task-specific data, fine-tuning for performance, and compressing models to accelerate inference while reducing resource demands. We propose an integrated framework combining GPTQ-based quantization, low-rank adaptation (LoRA), and a specialized data distillation process to significantly reduce model size and complexity while preserving or enhancing task-specific performance. By leveraging data distillation, knowledge distillation via Kullback-Leibler divergence, Bayesian hyperparameter optimization, and the Muon optimizer, our pipeline achieves up to 2x memory compression (e.g., reducing a 6GB model to 3GB) and enables efficient inference for specialized tasks. Empirical results demonstrate superior performance on standard LLM benchmarks compared to GPTQ quantization alone, with the Muon optimizer notably enhancing fine-tuned models’ resistance to accuracy decay during quantization.
[COMMENTS]12 pages, 5 figures
[LINK]http://arxiv.org/abs/2601.09865v1
[DATE]2026-01-15 04:50:30+08:00
[CATEGORIES]cs.LG
CleanSurvival: Automated data preprocessing for time-to-event models using reinforcement learning
[AUTHORS]Yousef Koka, David Selby, Gerrit Großmann, Sebastian Vollmer, Kathan Pandya
[ABSTRACT]Data preprocessing is a critical yet frequently neglected aspect of machine learning, often paid little attention despite its potentially significant impact on model performance. While automated machine learning pipelines are starting to recognize and integrate data preprocessing into their solutions for classification and regression tasks, this integration is lacking for more specialized tasks like survival or time-to-event models. As a result, survival analysis not only faces the general challenges of data preprocessing but also suffers from the lack of tailored, automated solutions in this area. To address this gap, this paper presents ‘CleanSurvival’, a reinforcement-learning-based solution for optimizing preprocessing pipelines, extended specifically for survival analysis. The framework can handle continuous and categorical variables, using Q-learning to select which combination of data imputation, outlier detection and feature extraction techniques achieves optimal performance for a Cox, random forest, neural network or user-supplied time-to-event model. The package is available on GitHub: https://github.com/datasciapps/CleanSurvival Experimental benchmarks on real-world datasets show that the Q-learning-based data preprocessing results in superior predictive performance to standard approaches, finding such a model up to 10 times faster than undirected random grid search. Furthermore, a simulation study demonstrates the effectiveness in different types and levels of missingness and noise in the data.
[LINK]http://arxiv.org/abs/2502.03946v2
[DATE]2026-01-15 04:45:45+08:00
[CATEGORIES]cs.LG
Breaking the Limits of Open-Weight CLIP: An Optimization Framework for Self-supervised Fine-tuning of CLIP
[AUTHORS]Anant Mehta, Xiyuan Wei, Xingyu Chen, Tianbao Yang
[COMMENTS]Submitted to ICLR 2026
[LINK]http://arxiv.org/abs/2601.09859v1
[DATE]2026-01-15 04:38:36+08:00
[CATEGORIES]cs.LG
Challenges and Research Directions for Large Language Model Inference Hardware
[AUTHORS]Xiaoyu Ma, David Patterson
[ABSTRACT]Large Language Model (LLM) inference is hard. The autoregressive Decode phase of the underlying Transformer model makes LLM inference fundamentally different from training. Exacerbated by recent AI trends, the primary challenges are memory and interconnect rather than compute. To address these challenges, we highlight four architecture research opportunities: High Bandwidth Flash for 10X memory capacity with HBM-like bandwidth; Processing-Near-Memory and 3D memory-logic stacking for high memory bandwidth; and low-latency interconnect to speedup communication. While our focus is datacenter AI, we also review their applicability for mobile devices.
[COMMENTS]Accepted for publication by IEEE Computer, 2026
[LINK]http://arxiv.org/abs/2601.05047v2
[DATE]2026-01-15 04:37:46+08:00
[CATEGORIES]cs.LG
Improving the Throughput of Diffusion-based Large Language Models via a Training-Free Confidence-Aware Calibration
[AUTHORS]Jucheng Shen, Gaurav Sarkar, Yeonju Ro, Sharath Nittur Sridhar, Zhangyang Wang, Aditya Akella, Souvik Kundu
[ABSTRACT]We present CadLLM, a training-free method to accelerate the inference throughput of diffusion-based LLMs (dLLMs). We first investigate the dynamic nature of token unmasking confidence across blocks and steps. Based on this observation, we present a lightweight adaptive approach that controls the generation block size, step size, and threshold based on the average confidence of unmasked tokens. We further reduce softmax overhead by dynamically leveraging a subset of the vocabulary to regulate sampling breadth. CadLLM is a plug-and-play, model-agnostic method compatible with KV-cache-based dLLMs. Extensive experiments on four popular tasks demonstrate that CadLLM yields up to 2.28x throughput improvement over the state-of-the-art baseline with competitive accuracy.
[COMMENTS]9 pages, 3 figures. Preprint under review
[LINK]http://arxiv.org/abs/2512.07173v2
[DATE]2026-01-15 04:22:57+08:00
[CATEGORIES]cs.LG
Non-Expansive Mappings in Two-Time-Scale Stochastic Approximation: Finite-Time Analysis
[AUTHORS]Siddharth Chandak
[ABSTRACT]Two-time-scale stochastic approximation algorithms are iterative methods used in applications such as optimization, reinforcement learning, and control. Finite-time analysis of these algorithms has primarily focused on fixed point iterations where both time-scales have contractive mappings. In this work, we broaden the scope of such analyses by considering settings where the slower time-scale has a non-expansive mapping. For such algorithms, the slower time-scale can be viewed as a stochastic inexact Krasnoselskii-Mann iteration. We also study a variant where the faster time-scale has a projection step which leads to non-expansiveness in the slower time-scale. We show that the last-iterate mean square residual error for such algorithms decays at a rate $O(1/k^{1/4-ε})$, where $ε>0$ is arbitrarily small. We further establish almost sure convergence of iterates to the set of fixed points. We demonstrate the applicability of our framework by applying our results to minimax optimization, linear stochastic approximation, and Lagrangian optimization.
[COMMENTS]Submitted to SIAM Journal on Control and Optimization
[LINK]http://arxiv.org/abs/2501.10806v3
[DATE]2026-01-15 04:17:50+08:00
[CATEGORIES]cs.LG
Accelerated Regularized Wasserstein Proximal Sampling Algorithms
[AUTHORS]Hong Ye Tan, Stanley Osher, Wuchen Li
[ABSTRACT]We consider sampling from a Gibbs distribution by evolving a finite number of particles using a particular score estimator rather than Brownian motion. To accelerate the particles, we consider a second-order score-based ODE, similar to Nesterov acceleration. In contrast to traditional kernel density score estimation, we use the recently proposed regularized Wasserstein proximal method, yielding the Accelerated Regularized Wasserstein Proximal method (ARWP). We provide a detailed analysis of continuous- and discrete-time non-asymptotic and asymptotic mixing rates for Gaussian initial and target distributions, using techniques from Euclidean acceleration and accelerated information gradients. Compared with the kinetic Langevin sampling algorithm, the proposed algorithm exhibits a higher contraction rate in the asymptotic time regime. Numerical experiments are conducted across various low-dimensional experiments, including multi-modal Gaussian mixtures and ill-conditioned Rosenbrock distributions. ARWP exhibits structured and convergent particles, accelerated discrete-time mixing, and faster tail exploration than the non-accelerated regularized Wasserstein proximal method and kinetic Langevin methods. Additionally, ARWP particles exhibit better generalization properties for some non-log-concave Bayesian neural network tasks.
[LINK]http://arxiv.org/abs/2601.09848v1
[DATE]2026-01-15 04:08:11+08:00
[CATEGORIES]cs.LG
A pipeline for enabling path-specific causal fairness in observational health data
[AUTHORS]Aparajita Kashyap, Sara Matijevic, Noémie Elhadad, Steven A. Kushner, Shalmali Joshi
[ABSTRACT]When training machine learning (ML) models for potential deployment in a healthcare setting, it is essential to ensure that they do not replicate or exacerbate existing healthcare biases. Although many definitions of fairness exist, we focus on path-specific causal fairness, which allows us to better consider the social and medical contexts in which biases occur (e.g., direct discrimination by a clinician or model versus bias due to differential access to the healthcare system) and to characterize how these biases may appear in learned models. In this work, we map the structural fairness model to the observational healthcare setting and create a generalizable pipeline for training causally fair models. The pipeline explicitly considers specific healthcare context and disparities to define a target “fair” model. Our work fills two major gaps: first, we expand on characterizations of the “fairness-accuracy” tradeoff by detangling direct and indirect sources of bias and jointly presenting these fairness considerations alongside considerations of accuracy in the context of broadly known biases. Second, we demonstrate how a foundation model trained without fairness constraints on observational health data can be leveraged to generate causally fair downstream predictions in tasks with known social and medical disparities. This work presents a model-agnostic pipeline for training causally fair machine learning models that address both direct and indirect forms of healthcare bias.
[LINK]http://arxiv.org/abs/2601.09841v1
[DATE]2026-01-15 03:57:34+08:00
[CATEGORIES]cs.LG
An information-matching approach to optimal experimental design and active learning
[AUTHORS]Yonatan Kurniawan, Tracianne B. Neilsen, Benjamin L. Francis, Alex M. Stankovic, Mingjian Wen, Ilia Nikiforov, Ellad B. Tadmor, Vasily V. Bulatov, Vincenzo Lordi, Mark K. Transtrum
[ABSTRACT]The efficacy of mathematical models heavily depends on the quality of the training data, yet collecting sufficient data is often expensive and challenging. Many modeling applications require inferring parameters only as a means to predict other quantities of interest (QoI). Because models often contain many unidentifiable (sloppy) parameters, QoIs often depend on a relatively small number of parameter combinations. Therefore, we introduce an information-matching criterion based on the Fisher Information Matrix to select the most informative training data from a candidate pool. This method ensures that the selected data contain sufficient information to learn only those parameters that are needed to constrain downstream QoIs. It is formulated as a convex optimization problem, making it scalable to large models and datasets. We demonstrate the effectiveness of this approach across various modeling problems in diverse scientific fields, including power systems and underwater acoustics. Finally, we use information-matching as a query function within an Active Learning loop for material science applications. In all these applications, we find that a relatively small set of optimal training data can provide the necessary information for achieving precise predictions. These results are encouraging for diverse future applications, particularly active learning in large machine learning models.
[LINK]http://arxiv.org/abs/2411.02740v5
[DATE]2026-01-15 03:50:14+08:00
[CATEGORIES]cs.LG
A New Convergence Analysis of Plug-and-Play Proximal Gradient Descent Under Prior Mismatch
[AUTHORS]Guixian Xu, Jinglai Li, Junqi Tang
[ABSTRACT]In this work, we provide a new convergence theory for plug-and-play proximal gradient descent (PnP-PGD) under prior mismatch where the denoiser is trained on a different data distribution to the inference task at hand. To the best of our knowledge, this is the first convergence proof of PnP-PGD under prior mismatch. Compared with the existing theoretical results for PnP algorithms, our new results removed the need for several restrictive and unverifiable assumptions.
[LINK]http://arxiv.org/abs/2601.09831v1
[DATE]2026-01-15 03:47:31+08:00
[CATEGORIES]cs.LG
Exploring specialization and sensitivity of convolutional neural networks in the context of simultaneous image augmentations
[AUTHORS]Pavel Kharyuk, Sergey Matveev, Ivan Oseledets
[ABSTRACT]Drawing parallels with the way biological networks are studied, we adapt the treatment–control paradigm to explainable artificial intelligence research and enrich it through multi-parametric input alterations. In this study, we propose a framework for investigating the internal inference impacted by input data augmentations. The internal changes in network operation are reflected in activation changes measured by variance, which can be decomposed into components related to each augmentation, employing Sobol indices and Shapley values. These quantities enable one to visualize sensitivity to different variables and use them for guided masking of activations. In addition, we introduce a way of single-class sensitivity analysis where the candidates are filtered according to their matching to prediction bias generated by targeted damaging of the activations. Relying on the observed parallels, we assume that the developed framework can potentially be transferred to studying biological neural networks in complex environments.
[COMMENTS]31 pages; main text: 5 figures, 5 tables; appendix: 5 sections, 4 tables; supplementary: 7 files (figures S1-S6: pdf files packed as 7z archives, S7: single pdf file)
[LINK]http://arxiv.org/abs/2503.03283v2
[DATE]2026-01-15 03:31:26+08:00
[CATEGORIES]cs.LG
Improving Chain-of-Thought for Logical Reasoning via Attention-Aware Intervention
[AUTHORS]Nguyen Minh Phuong, Dang Huu Tien, Naoya Inoue
[ABSTRACT]Modern logical reasoning with LLMs primarily relies on employing complex interactive frameworks that decompose the reasoning process into subtasks solved through carefully designed prompts or requiring external resources (e.g., symbolic solvers) to exploit their strong logical structures. While interactive approaches introduce additional overhead, hybrid approaches depend on external components, which limit their scalability. A non-interactive, end-to-end framework enables reasoning to emerge within the model itself – improving generalization while preserving analyzability without any external resources. In this work, we introduce a non-interactive, end-to-end framework for reasoning tasks. We show that introducing structural information into the few-shot prompt activates a subset of attention heads that patterns aligned with logical reasoning operators. Building on this insight, we propose Attention-Aware Intervention (AAI), an inference-time intervention method that reweights attention scores across selected heads identified by their logical patterns. AAI offers an efficient way to steer the model’s reasoning toward leveraging prior knowledge through attention modulation. Extensive experiments show that AAI enhances logical reasoning performance across diverse benchmarks and model architectures, while incurring negligible additional computational overhead. Code is available at https://github.com/phuongnm94/aai_for_logical_reasoning.
[COMMENTS]Findings of EACL 2026
[LINK]http://arxiv.org/abs/2601.09805v1
[DATE]2026-01-15 03:10:10+08:00
[CATEGORIES]cs.LG
Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning
[AUTHORS]Chi-Pin Huang, Yunze Man, Zhiding Yu, Min-Hung Chen, Jan Kautz, Yu-Chiang Frank Wang, Fu-En Yang
[ABSTRACT]Vision-Language-Action (VLA) tasks require reasoning over complex visual scenes and executing adaptive actions in dynamic environments. While recent studies on reasoning VLAs show that explicit chain-of-thought (CoT) can improve generalization, they suffer from high inference latency due to lengthy reasoning traces. We propose Fast-ThinkAct, an efficient reasoning framework that achieves compact yet performant planning through verbalizable latent reasoning. Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodied control. This enables reasoning-enhanced policy learning that effectively connects compact reasoning to action execution. Extensive experiments across diverse embodied manipulation and reasoning benchmarks demonstrate that Fast-ThinkAct achieves strong performance with up to 89.3\% reduced inference latency over state-of-the-art reasoning VLAs, while maintaining effective long-horizon planning, few-shot adaptation, and failure recovery.
[COMMENTS]Project page: https://jasper0314-huang.github.io/fast-thinkact/
[LINK]http://arxiv.org/abs/2601.09708v1
[DATE]2026-01-15 02:59:59+08:00
[CATEGORIES]cs.LG
Contrastive Geometric Learning Unlocks Unified Structure- and Ligand-Based Drug Design
[AUTHORS]Lisa Schneckenreiter, Sohvi Luukkonen, Lukas Friedrich, Daniel Kuhn, Günter Klambauer
[ABSTRACT]Structure-based and ligand-based computational drug design have traditionally relied on disjoint data sources and modeling assumptions, limiting their joint use at scale. In this work, we introduce Contrastive Geometric Learning for Unified Computational Drug Design (ConGLUDe), a single contrastive geometric model that unifies structure- and ligand-based training. ConGLUDe couples a geometric protein encoder that produces whole-protein representations and implicit embeddings of predicted binding sites with a fast ligand encoder, removing the need for pre-defined pockets. By aligning ligands with both global protein representations and multiple candidate binding sites through contrastive learning, ConGLUDe supports ligand-conditioned pocket prediction in addition to virtual screening and target fishing, while being trained jointly on protein-ligand complexes and large-scale bioactivity data. Across diverse benchmarks, ConGLUDe achieves state-of-the-art zero-shot virtual screening performance in settings where no binding pocket information is provided as input, substantially outperforms existing methods on a challenging target fishing task, and demonstrates competitive ligand-conditioned pocket selection. These results highlight the advantages of unified structure-ligand training and position ConGLUDe as a step toward general-purpose foundation models for drug discovery.
[COMMENTS]ELLIS ML4Molecules Workshop 2025, ELLIS Unconference, Copenhagen 2025
[LINK]http://arxiv.org/abs/2601.09693v1
[DATE]2026-01-15 02:45:08+08:00
[CATEGORIES]cs.LG
OptiMind: Teaching LLMs to Think Like Optimization Experts
[AUTHORS]Xinzhi Zhang, Zeyi Chen, Humishka Zope, Hugo Barbalho, Konstantina Mellou, Marco Molinaro, Janardhan Kulkarni, Ishai Menache, Sirui Li
[ABSTRACT]Mathematical programming – the task of expressing operations and decision-making problems in precise mathematical language – is fundamental across domains, yet remains a skill-intensive process requiring operations research expertise. Recent advances in large language models for complex reasoning have spurred interest in automating this task, translating natural language into executable optimization models. Current approaches, however, achieve limited accuracy, hindered by scarce and noisy training data without leveraging domain knowledge. In this work, we systematically integrate optimization expertise to improve formulation accuracy for mixed-integer linear programming, a key family of mathematical programs. Our OptiMind framework leverages semi-automated, class-based error analysis to guide both training and inference, explicitly preventing common mistakes within each optimization class. Our resulting fine-tuned LLM significantly improves formulation accuracy by 20.7% across multiple optimization benchmarks, with consistent gains under test-time scaling methods such as self-consistency and multi-turn feedback, enabling further progress toward robust LLM-assisted optimization formulation.
[LINK]http://arxiv.org/abs/2509.22979v2
[DATE]2026-01-15 02:26:45+08:00
[CATEGORIES]cs.LG
TimeSAE: Sparse Decoding for Faithful Explanations of Black-Box Time Series Models
[AUTHORS]Khalid Oublal, Quentin Bouniot, Qi Gan, Stephan Clémençon, Zeynep Akata
[ABSTRACT]As black box models and pretrained models gain traction in time series applications, understanding and explaining their predictions becomes increasingly vital, especially in high-stakes domains where interpretability and trust are essential. However, most of the existing methods involve only in-distribution explanation, and do not generalize outside the training support, which requires the learning capability of generalization. In this work, we aim to provide a framework to explain black-box models for time series data through the dual lenses of Sparse Autoencoders (SAEs) and causality. We show that many current explanation methods are sensitive to distributional shifts, limiting their effectiveness in real-world scenarios. Building on the concept of Sparse Autoencoder, we introduce TimeSAE, a framework for black-box model explanation. We conduct extensive evaluations of TimeSAE on both synthetic and real-world time series datasets, comparing it to leading baselines. The results, supported by both quantitative metrics and qualitative insights, show that TimeSAE provides more faithful and robust explanations. Our code is available in an easy-to-use library TimeSAE-Lib: https://anonymous.4open.science/w/TimeSAE-571D/.
[LINK]http://arxiv.org/abs/2601.09776v1
[DATE]2026-01-15 02:13:32+08:00
[CATEGORIES]cs.LG
Policy Compatible Skill Incremental Learning via Lazy Learning Interface
[AUTHORS]Daehee Lee, Dongsu Lee, TaeYoon Kwack, Wonje Choi, Honguk Woo
[ABSTRACT]Skill Incremental Learning (SIL) is the process by which an embodied agent expands and refines its skill set over time by leveraging experience gained through interaction with its environment or by the integration of additional data. SIL facilitates efficient acquisition of hierarchical policies grounded in reusable skills for downstream tasks. However, as the skill repertoire evolves, it can disrupt compatibility with existing skill-based policies, limiting their reusability and generalization. In this work, we propose SIL-C, a novel framework that ensures skill-policy compatibility, allowing improvements in incrementally learned skills to enhance the performance of downstream policies without requiring policy re-training or structural adaptation. SIL-C employs a bilateral lazy learning-based mapping technique to dynamically align the subtask space referenced by policies with the skill space decoded into agent behaviors. This enables each subtask, derived from the policy’s decomposition of a complex task, to be executed by selecting an appropriate skill based on trajectory distribution similarity. We evaluate SIL-C across diverse SIL scenarios and demonstrate that it maintains compatibility between evolving skills and downstream policies while ensuring efficiency throughout the learning process.
[COMMENTS]NeurIPS 2025 Spotlight
[LINK]http://arxiv.org/abs/2509.20612v3
[DATE]2026-01-15 02:11:20+08:00
[CATEGORIES]cs.LG
Dynamics-Aligned Latent Imagination in Contextual World Models for Zero-Shot Generalization
[AUTHORS]Frank Röder, Jan Benad, Manfred Eppe, Pradeep Kr. Banerjee
[ABSTRACT]Real-world reinforcement learning demands adaptation to unseen environmental conditions without costly retraining. Contextual Markov Decision Processes (cMDP) model this challenge, but existing methods often require explicit context variables (e.g., friction, gravity), limiting their use when contexts are latent or hard to measure. We introduce Dynamics-Aligned Latent Imagination (DALI), a framework integrated within the Dreamer architecture that infers latent context representations from agent-environment interactions. By training a self-supervised encoder to predict forward dynamics, DALI generates actionable representations conditioning the world model and policy, bridging perception and control. We theoretically prove this encoder is essential for efficient context inference and robust generalization. DALI’s latent space enables counterfactual consistency: Perturbing a gravity-encoding dimension alters imagined rollouts in physically plausible ways. On challenging cMDP benchmarks, DALI achieves significant gains over context-unaware baselines, often surpassing context-aware baselines in extrapolation tasks, enabling zero-shot generalization to unseen contextual variations.
[COMMENTS]31 pages, 4 figures, accepted to NeurIPS 2025
[LINK]http://arxiv.org/abs/2508.20294v2
[DATE]2026-01-15 01:50:26+08:00
[CATEGORIES]cs.LG
Exploring Fine-Tuning for Tabular Foundation Models
[AUTHORS]Aditya Tanna, Pratinav Seth, Mohamed Bouadi, Vinay Kumar Sankarapu
[ABSTRACT]Tabular Foundation Models (TFMs) have recently shown strong in-context learning capabilities on structured data, achieving zero-shot performance comparable to traditional machine learning methods. We find that zero-shot TFMs already achieve strong performance, while the benefits of fine-tuning are highly model and data-dependent. Meta-learning and PEFT provide moderate gains under specific conditions, whereas full supervised fine-tuning (SFT) often reduces accuracy or calibration quality. This work presents the first comprehensive study of fine-tuning in TFMs across benchmarks including TALENT, OpenML-CC18, and TabZilla. We compare Zero-Shot, Meta-Learning, Supervised (SFT), and parameter-efficient (PEFT) approaches, analyzing how dataset factors such as imbalance, size, and dimensionality affect outcomes. Our findings cover performance, calibration, and fairness, offering practical guidelines on when fine-tuning is most beneficial and its limitations.
[LINK]http://arxiv.org/abs/2601.09654v1
[DATE]2026-01-15 01:40:46+08:00
[CATEGORIES]cs.LG
SPGD: Steepest Perturbed Gradient Descent Optimization
[AUTHORS]Amir M. Vahedi, Horea T. Ilies
[ABSTRACT]Optimization algorithms are pivotal in advancing various scientific and industrial fields but often encounter obstacles such as trapping in local minima, saddle points, and plateaus (flat regions), which makes the convergence to reasonable or near-optimal solutions particularly challenging. This paper presents the Steepest Perturbed Gradient Descent (SPGD), a novel algorithm that innovatively combines the principles of the gradient descent method with periodic uniform perturbation sampling to effectively circumvent these impediments and lead to better solutions whenever possible. SPGD is distinctively designed to generate a set of candidate solutions and select the one exhibiting the steepest loss difference relative to the current solution. It enhances the traditional gradient descent approach by integrating a strategic exploration mechanism that significantly increases the likelihood of escaping sub-optimal local minima and navigating complex optimization landscapes effectively. Our approach not only retains the directed efficiency of gradient descent but also leverages the exploratory benefits of stochastic perturbations, thus enabling a more comprehensive search for global optima across diverse problem spaces. We demonstrate the efficacy of SPGD in solving the 3D component packing problem, an NP-hard challenge. Preliminary results show a substantial improvement over four established methods, particularly on response surfaces with complex topographies and in multidimensional non-convex continuous optimization problems. Comparative analyses with established 2D benchmark functions highlight SPGD’s superior performance, showcasing its ability to navigate complex optimization landscapes. These results emphasize SPGD’s potential as a versatile tool for a wide range of optimization problems.
[COMMENTS]28 pages, 26 figures, submitted to Journal of Mechanical Design
[LINK]http://arxiv.org/abs/2411.04946v3
[DATE]2026-01-15 01:20:45+08:00
[CATEGORIES]cs.LG
Training Large Neural Networks With Low-Dimensional Error Feedback
[AUTHORS]Maher Hanut, Jonathan Kadmon
[ABSTRACT]Training deep neural networks typically relies on backpropagating high dimensional error signals a computationally intensive process with little evidence supporting its implementation in the brain. However, since most tasks involve low-dimensional outputs, we propose that low-dimensional error signals may suffice for effective learning. To test this hypothesis, we introduce a novel local learning rule based on Feedback Alignment that leverages indirect, low-dimensional error feedback to train large networks. Our method decouples the backward pass from the forward pass, enabling precise control over error signal dimensionality while maintaining high-dimensional representations. We begin with a detailed theoretical derivation for linear networks, which forms the foundation of our learning framework, and extend our approach to nonlinear, convolutional, and transformer architectures. Remarkably, we demonstrate that even minimal error dimensionality on the order of the task dimensionality can achieve performance matching that of traditional backpropagation. Furthermore, our rule enables efficient training of convolutional networks, which have previously been resistant to Feedback Alignment methods, with minimal error. This breakthrough not only paves the way toward more biologically accurate models of learning but also challenges the conventional reliance on high-dimensional gradient signals in neural network training. Our findings suggest that low-dimensional error signals can be as effective as high-dimensional ones, prompting a reevaluation of gradient-based learning in high-dimensional systems. Ultimately, our work offers a fresh perspective on neural network optimization and contributes to understanding learning mechanisms in both artificial and biological systems.
[LINK]http://arxiv.org/abs/2502.20580v4
[DATE]2026-01-15 01:19:43+08:00
[CATEGORIES]cs.LG
Coupled Data and Measurement Space Dynamics for Enhanced Diffusion Posterior Sampling
[AUTHORS]Shayan Mohajer Hamidi, En-Hui Yang, Ben Liang
[ABSTRACT]Inverse problems, where the goal is to recover an unknown signal from noisy or incomplete measurements, are central to applications in medical imaging, remote sensing, and computational biology. Diffusion models have recently emerged as powerful priors for solving such problems. However, existing methods either rely on projection-based techniques that enforce measurement consistency through heuristic updates, or they approximate the likelihood $p(\boldsymbol{y} \mid \boldsymbol{x})$, often resulting in artifacts and instability under complex or high-noise conditions. To address these limitations, we propose a novel framework called \emph{coupled data and measurement space diffusion posterior sampling} (C-DPS), which eliminates the need for constraint tuning or likelihood approximation. C-DPS introduces a forward stochastic process in the measurement space $\{\boldsymbol{y}t\}$, evolving in parallel with the data-space diffusion $\{\boldsymbol{x}_t\}$, which enables the derivation of a closed-form posterior $p(\boldsymbol{x}{t-1} \mid \boldsymbol{x}t, \boldsymbol{y}{t-1})$. This coupling allows for accurate and recursive sampling based on a well-defined posterior distribution. Empirical results demonstrate that C-DPS consistently outperforms existing baselines, both qualitatively and quantitatively, across multiple inverse problem benchmarks.
[LINK]http://arxiv.org/abs/2510.09676v2
[DATE]2026-01-15 01:17:22+08:00
[CATEGORIES]cs.LG
Adaptive Requesting in Decentralized Edge Networks via Non-Stationary Bandits
[AUTHORS]Yi Zhuang, Kun Yang, Xingran Chen
[ABSTRACT]We study a decentralized collaborative requesting problem that aims to optimize the information freshness of time-sensitive clients in edge networks consisting of multiple clients, access nodes (ANs), and servers. Clients request content through ANs acting as gateways, without observing AN states or the actions of other clients. We define the reward as the age of information reduction resulting from a client’s selection of an AN, and formulate the problem as a non-stationary multi-armed bandit. In this decentralized and partially observable setting, the resulting reward process is history-dependent and coupled across clients, and exhibits both abrupt and gradual changes in expected rewards, rendering classical bandit-based approaches ineffective. To address these challenges, we propose the AGING BANDIT WITH ADAPTIVE RESET algorithm, which combines adaptive windowing with periodic monitoring to track evolving reward distributions. We establish theoretical performance guarantees showing that the proposed algorithm achieves near-optimal performance, and we validate the theoretical results through simulations.
[LINK]http://arxiv.org/abs/2601.08760v2
[DATE]2026-01-15 01:16:54+08:00
[CATEGORIES]cs.LG
Representing Molecules with Algebraic Data Types: Beyond SMILES and SELFIES
[AUTHORS]Oliver Goldstein, Samuel March
[ABSTRACT]Algebraic data types (ADTs) let a representation specify at the type level what molecular values are valid and what transformations are meaningful. We propose a molecular representation as a family of typed ADTs that separates (i) constitution (Dietz style bonding systems), (ii) 3D coordinates and stereochemistry, and (iii) electronic structure annotations. This separation makes invariants explicit, supports deterministic local edits, and provides hooks for symmetry aware and Bayesian modeling. These data structures allow us to consider how the representation constrains operations which may be performed over them. Types make invalid manipulations unrepresentable and make it easier to define meaningful priors/likelihoods over generative models (programs with sample and score operations). Unlike string based formats, the ADT exposes chemical structure directly; validity conditions (e.g., valence and symmetry constraints) can be enforced by construction and checked deterministically during transformations. We optionally attach electronic structure annotations (shell/subshell/orbital metadata) to atoms when such information is available; we do not attempt to compute orbitals in this work. We sketch Bayesian probabilistic programming via an integration with LazyPPL, a lazy probabilistic programming library; molecules can be made instances of a group under rotation to support geometric learning settings where molecular properties are invariant under rigid motions and relabellings; and the framework’s flexibility is demonstrated through an extension to represent chemical reactions. We provide a Haskell library implementing the representation, released under an OSI approved open source license and archived with a DOI.
[COMMENTS]3 Figures
[LINK]http://arxiv.org/abs/2501.13633v4
[DATE]2026-01-15 01:15:35+08:00
[CATEGORIES]cs.LG
Physically Plausible Multi-System Trajectory Generation and Symmetry Discovery
[AUTHORS]Jiayin Liu, Yulong Yang, Vineet Bansal, Christine Allen-Blanchette
[ABSTRACT]From metronomes to celestial bodies, mechanics underpins how the world evolves in time and space. With consideration of this, a number of recent neural network models leverage inductive biases from classical mechanics to encourage model interpretability and ensure forecasted states are physical. However, in general, these models are designed to capture the dynamics of a single system with fixed physical parameters, from state-space measurements of a known configuration space. In this paper we introduce Symplectic Phase Space GAN (SPS-GAN) which can capture the dynamics of multiple systems, and generalize to unseen physical parameters from. Moreover, SPS-GAN does not require prior knowledge of the system configuration space. In fact, SPS-GAN can discover the configuration space structure of the system from arbitrary measurement types (e.g., state-space measurements, video frames). To achieve physically plausible generation, we introduce a novel architecture which embeds a Hamiltonian neural network recurrent module in a conditional GAN backbone. To discover the structure of the configuration space, we optimize the conditional time-series GAN objective with an additional physically motivated term to encourages a sparse representation of the configuration space. We demonstrate the utility of SPS-GAN for trajectory prediction, video generation and symmetry discovery. Our approach captures multiple systems and achieves performance on par with supervised models designed for single systems.
[LINK]http://arxiv.org/abs/2509.23003v2
[DATE]2026-01-15 01:15:25+08:00
[CATEGORIES]cs.LG
PersonalAlign: Hierarchical Implicit Intent Alignment for Personalized GUI Agent with Long-Term User-Centric Records
[AUTHORS]Yibo Lyu, Gongwei Chen, Rui Shao, Weili Guan, Liqiang Nie
[ABSTRACT]While GUI agents have shown strong performance under explicit and completion instructions, real-world deployment requires aligning with users’ more complex implicit intents. In this work, we highlight Hierarchical Implicit Intent Alignment for Personalized GUI Agent (PersonalAlign), a new agent task that requires agents to leverage long-term user records as persistent context to resolve omitted preferences in vague instructions and anticipate latent routines by user state for proactive assistance. To facilitate this study, we introduce AndroidIntent, a benchmark designed to evaluate agents’ ability in resolving vague instructions and providing proactive suggestions through reasoning over long-term user records. We annotated 775 user-specific preferences and 215 routines from 20k long-term records across different users for evaluation. Furthermore, we introduce Hierarchical Intent Memory Agent (HIM-Agent), which maintains a continuously updating personal memory and hierarchically organizes user preferences and routines for personalization. Finally, we evaluate a range of GUI agents on AndroidIntent, including GPT-5, Qwen3-VL, and UI-TARS, further results show that HIM-Agent significantly improves both execution and proactive performance by 15.7% and 7.3%.
[LINK]http://arxiv.org/abs/2601.09636v1
[DATE]2026-01-15 01:12:48+08:00
[CATEGORIES]cs.LG
LLM for Large-Scale Optimization Model Auto-Formulation: A Lightweight Few-Shot Learning Approach
[AUTHORS]Kuo Liang, Yuhang Lu, Jianming Mao, Shuyi Sun, Chunwei Yang, Congcong Zeng, Xiao Jin, Hanzhang Qin, Ruihao Zhu, Chung-Piaw Teo
[ABSTRACT]Large-scale optimization is a key backbone of modern business decision-making. However, building these models is often labor-intensive and time-consuming. We address this by proposing LEAN-LLM-OPT, a LightwEight AgeNtic workflow construction framework for LLM-assisted large-scale OPTimization auto-formulation. LEAN-LLM-OPT takes as input a problem description together with associated datasets and orchestrates a team of LLM agents to produce an optimization formulation. Specifically, upon receiving a query, two upstream LLM agents dynamically construct a workflow that specifies, step-by-step, how optimization models for similar problems can be formulated. A downstream LLM agent then follows this workflow to generate the final output. Leveraging LLMs’ text-processing capabilities and common modeling practices, the workflow decomposes the modeling task into a sequence of structured sub-tasks and offloads mechanical data-handling operations to auxiliary tools. This design alleviates the downstream agent’s burden related to planning and data handling, allowing it to focus on the most challenging components that cannot be readily standardized. Extensive simulations show that LEAN-LLM-OPT, instantiated with GPT-4.1 and the open source gpt-oss-20B, achieves strong performance on large-scale optimization modeling tasks and is competitive with state-of-the-art approaches. In addition, in a Singapore Airlines choice-based revenue management use case, LEAN-LLM-OPT demonstrates practical value by achieving leading performance across a range of scenarios. Along the way, we introduce Large-Scale-OR and Air-NRM, the first comprehensive benchmarks for large-scale optimization auto-formulation. The code and data of this work is available at https://github.com/CoraLiang01/lean-llm-opt.
[COMMENTS]Updated version of https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5329027
[LINK]http://arxiv.org/abs/2601.09635v1
[DATE]2026-01-15 01:09:57+08:00
[CATEGORIES]cs.LG
MISA: Memory-Efficient LLMs Optimization with Module-wise Importance Sampling
[AUTHORS]Yuxi Liu, Renjia Deng, Yutong He, Xue Wang, Tao Yao, Kun Yuan
[ABSTRACT]The substantial memory demands of pre-training and fine-tuning large language models (LLMs) require memory-efficient optimization algorithms. One promising approach is layer-wise optimization, which treats each transformer block as a single layer and optimizes it sequentially, while freezing the other layers to save optimizer states and activations. Although effective, these methods ignore the varying importance of the modules within each layer, leading to suboptimal performance. Moreover, layer-wise sampling provides only limited memory savings, as at least one full layer must remain active during optimization. To overcome these limitations, we propose Module-wise Importance SAmpling (MISA), a novel method that divides each layer into smaller modules and assigns importance scores to each module. MISA uses a weighted random sampling mechanism to activate modules, provably reducing gradient variance compared to layer-wise sampling. Additionally, we establish an (\mathcal{O}(1/\sqrt{K})) convergence rate under non-convex and stochastic conditions, where $K$ is the total number of block updates, and provide a detailed memory analysis showcasing MISA’s superiority over existing baseline methods. Experiments on diverse learning tasks validate the effectiveness of MISA. Source code is available at https://github.com/pkumelon/MISA.
[COMMENTS]This paper is accepted to Neural Information Processing Systems (NeurIPS) 2025
[LINK]http://arxiv.org/abs/2511.00056v3
[DATE]2026-01-15 01:04:50+08:00
[CATEGORIES]cs.LG
From Prompt to Protocol: Fast Charging Batteries with Large Language Models
[AUTHORS]Ge Lei, Ferran Brosa Planella, Sterling G. Baird, Samuel J. Cooper
[ABSTRACT]Efficiently optimizing battery charging protocols is challenging because each evaluation is slow, costly, and non-differentiable. Many existing approaches address this difficulty by heavily constraining the protocol search space, which limits the diversity of protocols that can be explored, preventing the discovery of higher-performing solutions. We introduce two gradient-free, LLM-driven closed-loop methods: Prompt-to-Optimizer (P2O), which uses an LLM to propose the code for small neural-network-based protocols, which are then trained by an inner loop, and Prompt-to-Protocol (P2P), which simply writes an explicit function for the current and its scalar parameters. Across our case studies, LLM-guided P2O outperforms neural networks designed by Bayesian optimization, evolutionary algorithms, and random search. In a realistic fast charging scenario, both P2O and P2P yield around a 4.2 percent improvement in state of health (capacity retention based health metric under fast charging cycling) over a state-of-the-art multi-step constant current (CC) baseline, with P2P achieving this under matched evaluation budgets (same number of protocol evaluations). These results demonstrate that LLMs can expand the space of protocol functional forms, incorporate language-based constraints, and enable efficient optimization in high cost experimental settings.
[LINK]http://arxiv.org/abs/2601.09626v1
[DATE]2026-01-15 00:58:20+08:00
[CATEGORIES]cs.LG
Can AI Dream of Unseen Galaxies? Conditional Diffusion Model for Galaxy Morphology Augmentation
[AUTHORS]Chenrui Ma, Zechang Sun, Tao Jing, Zheng Cai, Yuan-Sen Ting, Song Huang, Mingyu Li
[ABSTRACT]Observational astronomy relies on visual feature identification to detect critical astrophysical phenomena. While machine learning (ML) increasingly automates this process, models often struggle with generalization in large-scale surveys due to the limited representativeness of labeled datasets, whether from simulations or human annotation, a challenge pronounced for rare yet scientifically valuable objects. To address this, we propose a conditional diffusion model to synthesize realistic galaxy images for augmenting ML training data (hereafter GalaxySD). Leveraging the Galaxy Zoo 2 dataset which contains visual feature, galaxy image pairs from volunteer annotation, we demonstrate that GalaxySD generates diverse, high-fidelity galaxy images that closely adhere to the specified morphological feature conditions. Moreover, this model enables generative extrapolation to project well-annotated data into unseen domains and advancing rare object detection. Integrating synthesized images into ML pipelines improves performance in standard morphology classification, boosting completeness and purity by up to 30% across key metrics. For rare object detection, using early-type galaxies with prominent dust lane features (~0.1% in GZ2 dataset) as a test case, our approach doubled the number of detected instances, from 352 to 872, compared to previous studies based on visual inspection. This study highlights the power of generative models to bridge gaps between scarce labeled data and the vast, uncharted parameter space of observational astronomy and sheds insight for future astrophysical foundation model developments. Our project homepage is available at https://galaxysd-webpage.streamlit.app/.
[COMMENTS]29 pages, 17 figures, accepted version for ApJS. Comments welcome. See another independent work for further reference, Category-based Galaxy Image Generation via Diffusion Models (Fan, Tang et al.)
[LINK]http://arxiv.org/abs/2506.16233v2
[DATE]2026-01-15 00:33:14+08:00
[CATEGORIES]cs.LG
Instance-Dependent Regret Bounds for Nonstochastic Linear Partial Monitoring
[AUTHORS]Federico Di Gennaro, Khaled Eldowa, Nicolò Cesa-Bianchi
[ABSTRACT]In contrast to the classic formulation of partial monitoring, linear partial monitoring can model infinite outcome spaces, while imposing a linear structure on both the losses and the observations. This setting can be viewed as a generalization of linear bandits where loss and feedback are decoupled in a flexible manner. In this work, we address a nonstochastic (adversarial), finite-actions version of the problem through a simple instance of the exploration-by-optimization method that is amenable to efficient implementation. We derive regret bounds that depend on the game structure in a more transparent manner than previous theoretical guarantees for this paradigm. Our bounds feature instance-specific quantities that reflect the degree of alignment between observations and losses, and resemble known guarantees in the stochastic setting. Notably, they achieve the standard $\sqrt{T}$ rate in easy (locally observable) games and $T^{2/3}$ in hard (globally observable) games, where $T$ is the time horizon. We instantiate these bounds in a selection of old and new partial information settings subsumed by this model, and illustrate that the achieved dependence on the game structure can be tight in interesting cases.
[LINK]http://arxiv.org/abs/2510.19158v2
[DATE]2026-01-15 00:24:31+08:00
[CATEGORIES]cs.LG
A Multi-Stage Deep Learning Framework with PKCP-MixUp Augmentation for Pediatric Liver Tumor Diagnosis Using Multi-Phase Contrast-Enhanced CT
[AUTHORS]Wanqi Wang, Chun Yang, Jianbo Shao, Yaokai Zhang, Xuehua Peng, Jin Sun, Chao Xiong, Long Lu, Lianting Hu
[ABSTRACT]Pediatric liver tumors are one of the most common solid tumors in pediatrics, with differentiation of benign or malignant status and pathological classification critical for clinical treatment. While pathological examination is the gold standard, the invasive biopsy has notable limitations: the highly vascular pediatric liver and fragile tumor tissue raise complication risks such as bleeding; additionally, young children with poor compliance require anesthesia for biopsy, increasing medical costs or psychological trauma. Although many efforts have been made to utilize AI in clinical settings, most researchers have overlooked its importance in pediatric liver tumors. To establish a non-invasive examination procedure, we developed a multi-stage deep learning (DL) framework for automated pediatric liver tumor diagnosis using multi-phase contrast-enhanced CT. Two retrospective and prospective cohorts were enrolled. We established a novel PKCP-MixUp data augmentation method to address data scarcity and class imbalance. We also trained a tumor detection model to extract ROIs, and then set a two-stage diagnosis pipeline with three backbones with ROI-masked images. Our tumor detection model has achieved high performance (mAP=0.871), and the first stage classification model between benign and malignant tumors reached an excellent performance (AUC=0.989). Final diagnosis models also exhibited robustness, including benign subtype classification (AUC=0.915) and malignant subtype classification (AUC=0.979). We also conducted multi-level comparative analyses, such as ablation studies on data and training pipelines, as well as Shapley-Value and CAM interpretability analyses. This framework fills the pediatric-specific DL diagnostic gap, provides actionable insights for CT phase selection and model design, and paves the way for precise, accessible pediatric liver tumor diagnosis.
[LINK]http://arxiv.org/abs/2511.19478v2
[DATE]2026-01-15 00:22:53+08:00
[CATEGORIES]cs.LG
HapticLLaMA: A Multimodal Sensory Language Model for Haptic Captioning
[AUTHORS]Guimin Hu, Daniel Hershcovich, Hasti Seifi
[ABSTRACT]Haptic captioning is the task of generating natural language descriptions from haptic signals, such as vibrations, for use in virtual reality, accessibility, and rehabilitation applications. While previous multimodal research has focused primarily on vision and audio, haptic signals for the sense of touch remain underexplored. To address this gap, we formalize the haptic captioning task and propose HapticLLaMA, a multimodal sensory language model that interprets vibration signals into descriptions in a given sensory, emotional, or associative category. We investigate two types of haptic tokenizers, a frequency-based tokenizer and an EnCodec-based tokenizer, that convert haptic signals into sequences of discrete units, enabling their integration with the LLaMA model. HapticLLaMA is trained in two stages: (1) supervised fine-tuning using the LLaMA architecture with LoRA-based adaptation, and (2) fine-tuning via reinforcement learning from human feedback (RLHF). We assess HapticLLaMA’s captioning performance using both automated n-gram metrics and human evaluation. HapticLLaMA demonstrates strong capability in interpreting haptic vibration signals, achieving a METEOR score of 59.98 and a BLEU-4 score of 32.06 respectively. Additionally, over 61% of the generated captions received human ratings above 3.5 on a 7-point scale, with RLHF yielding a 10% improvement in the overall rating distribution, indicating stronger alignment with human haptic perception. These findings highlight the potential of large language models to process and adapt to sensory data.
[LINK]http://arxiv.org/abs/2508.06475v2
[DATE]2026-01-14 23:59:08+08:00
[CATEGORIES]cs.CL
Beyond Chunking: Discourse-Aware Hierarchical Retrieval for Long Document Question Answering
[AUTHORS]Huiyao Chen, Yi Yang, Yinghui Li, Meishan Zhang, Baotian Hu, Min Zhang
[ABSTRACT]Existing long-document question answering systems typically process texts as flat sequences or use heuristic chunking, which overlook the discourse structures that naturally guide human comprehension. We present a discourse-aware hierarchical framework that leverages rhetorical structure theory (RST) for long document question answering. Our approach converts discourse trees into sentence-level representations and employs LLM-enhanced node representations to bridge structural and semantic information. The framework involves three key innovations: language-universal discourse parsing for lengthy documents, LLM-based enhancement of discourse relation nodes, and structure-guided hierarchical retrieval. Extensive experiments on four datasets demonstrate consistent improvements over existing approaches through the incorporation of discourse structure, across multiple genres and languages. Moreover, the proposed framework exhibits strong robustness across diverse document types and linguistic settings.
[COMMENTS]21 pages, 9 figures
[LINK]http://arxiv.org/abs/2506.06313v4
[DATE]2026-01-14 23:49:23+08:00
[CATEGORIES]cs.CL
Head Pursuit: Probing Attention Specialization in Multimodal Transformers
[AUTHORS]Lorenzo Basile, Valentino Maiorca, Diego Doimo, Francesco Locatello, Alberto Cazzaniga
[ABSTRACT]Language and vision-language models have shown impressive performance across a wide range of tasks, but their internal mechanisms remain only partly understood. In this work, we study how individual attention heads in text-generative models specialize in specific semantic or visual attributes. Building on an established interpretability method, we reinterpret the practice of probing intermediate activations with the final decoding layer through the lens of signal processing. This lets us analyze multiple samples in a principled way and rank attention heads based on their relevance to target concepts. Our results show consistent patterns of specialization at the head level across both unimodal and multimodal transformers. Remarkably, we find that editing as few as 1% of the heads, selected using our method, can reliably suppress or enhance targeted concepts in the model output. We validate our approach on language tasks such as question answering and toxicity mitigation, as well as vision-language tasks including image classification and captioning. Our findings highlight an interpretable and controllable structure within attention layers, offering simple tools for understanding and editing large-scale generative models.
[COMMENTS]Accepted at NeurIPS 2025 (spotlight)
[LINK]http://arxiv.org/abs/2510.21518v2
[DATE]2026-01-14 23:47:59+08:00
[CATEGORIES]cs.CL cs.LG
Dialogue Telemetry: Turn-Level Instrumentation for Autonomous Information Gathering
[AUTHORS]Dimitris Panagopoulos, Adolfo Perrusquia, Weisi Guo
[ABSTRACT]Autonomous systems conducting schema-grounded information-gathering dialogues face an instrumentation gap, lacking turn-level observables for monitoring acquisition efficiency and detecting when questioning becomes unproductive. We introduce Dialogue Telemetry (DT), a measurement framework that produces two model-agnostic signals after each question-answer exchange: (i) a Progress Estimator (PE) quantifying residual information potential per category (with a bits-based variant), and (ii) a Stalling Index (SI) detecting an observable failure signature characterized by repeated category probing with semantically similar, low-marginal-gain responses. SI flags this pattern without requiring causal diagnosis, supporting monitoring in settings where attributing degradation to specific causes may be impractical. We validate DT in controlled search-and-rescue (SAR)-inspired interviews using large language model (LLM)-based simulations, distinguishing efficient from stalled dialogue traces and illustrating downstream utility by integrating DT signals into a reinforcement learning (RL) policy. Across these settings, DT provides interpretable turn-level instrumentation that improves policy performance when stalling carries operational costs.
[COMMENTS]16 pages, 9 Figures, Version submitted to IEEE for publication
[LINK]http://arxiv.org/abs/2601.09570v1
[DATE]2026-01-14 23:39:52+08:00
[CATEGORIES]cs.CL
DYCP: Dynamic Context Pruning for Long-Form Dialogue with LLMs
[AUTHORS]Nayoung Choi, Jonathan Zhang, Jinho D. Choi
[ABSTRACT]Large Language Models (LLMs) often exhibit increased response latency and degraded answer quality as dialogue length grows, making effective context management essential. However, existing methods rely on extra LLM calls to build memory or perform offline memory construction without considering the current user utterance, which can introduce inefficiencies or disrupt conversational continuity. We introduce DyCP, a lightweight context management method that dynamically segment and retrieve relevant memory at query time. It preserves the sequential structure of dialogue without predefined topic boundaries and supports efficient, adaptive context retrieval. Across three long-form dialogue benchmarks, LoCoMo, MT-Bench+, and SCM4LLMs, and multiple LLMs, DyCP consistently improves answer quality while reducing response latency. We also examine the gap between modern LLMs’ expanded context windows and their actual long-context processing capacity, highlighting the continued importance of effective context management.
[COMMENTS]Accepted (B) to TACL 2026
[LINK]http://arxiv.org/abs/2601.07994v2
[DATE]2026-01-14 23:26:22+08:00
[CATEGORIES]cs.CL
Navigating the Reality Gap: Privacy-Preserving On-Device Continual Adaptation of ASR for Clinical Telephony
[AUTHORS]Darshil Chauhan, Adityasinh Solanki, Vansh Patel, Kanav Kapoor, Ritvik Jain, Aditya Bansal, Pratik Narang, Dhruv Kumar
[ABSTRACT]Automatic Speech Recognition (ASR) holds immense potential to assist in clinical documentation and patient report generation, particularly in resource-constrained regions. However, deployment is currently hindered by a technical deadlock: a severe “Reality Gap” between laboratory performance and noisy, real-world clinical audio, coupled with strict privacy and resource constraints. Such adaptation is essential for clinical telephony systems, where patient speech is highly variable and transcription errors can directly impact downstream clinical workflows. We quantify this gap, showing that a robust multilingual model (IndicWav2Vec) degrades up to a 40.94% WER on rural clinical telephony speech from India, rendering it unusable. We demonstrate consistent improvements on these helpline interactions without transmitting raw patient data off-device via an on-device continual adaptation framework using Low-Rank Adaptation (LoRA). We conduct an investigative study of stabilization strategies, characterizing the trade-offs between data-driven and parameter-driven approaches. Our results demonstrate that multi-domain Experience Replay (ER) yields the primary performance gains, achieving a 17.1% relative improvement in target WER and reducing catastrophic forgetting by 55% compared to naive adaptation. Furthermore, we investigate a stabilized importance estimation strategy (Absolute Fisher) to ensure robust convergence against the high-variance gradients common in clinical telephony speech. Finally, we verify via a domain-specific spot check that acoustic adaptation is a fundamental prerequisite for usability in healthcare settings which cannot be bypassed by language models alone.
[COMMENTS]17 pages, 13 figures. Under review
[LINK]http://arxiv.org/abs/2512.16401v4
[DATE]2026-01-14 23:22:47+08:00
[CATEGORIES]cs.CL
Benchmarking Post-Training Quantization of Large Language Models under Microscaling Floating Point Formats
[AUTHORS]Manyi Zhang, Ji-Fu Li, Zhongao Sun, Haoli Bai, Hui-Ling Zhen, Zhenhua Dong, Xianzhi Yu
[ABSTRACT]Microscaling Floating-Point (MXFP) has emerged as a promising low-precision format for large language models (LLMs). Despite various post-training quantization (PTQ) algorithms being proposed, they mostly focus on integer quantization, while their applicability and behavior under MXFP formats remain largely unexplored. To address this gap, this work conducts a systematic investigation of PTQ under MXFP formats, encompassing over 7 PTQ algorithms, 15 evaluation benchmarks, and 3 LLM families. The key findings include: 1) MXFP8 consistently achieves near-lossless performance, while MXFP4 introduces substantial accuracy degradation and remains challenging; 2) PTQ effectiveness under MXFP depends strongly on format compatibility, with some algorithmic paradigms being consistently more effective than others; 3) PTQ performance exhibits highly consistent trends across model families and modalities, in particular, quantization sensitivity is dominated by the language model rather than the vision encoder in multimodal LLMs; 4) The scaling factor of quantization is a critical error source in MXFP4, and a simple pre-scale optimization strategy can significantly mitigate its impact. Together, these results provide practical guidance on adapting existing PTQ methods to MXFP quantization.
[LINK]http://arxiv.org/abs/2601.09555v1
[DATE]2026-01-14 23:16:55+08:00
[CATEGORIES]cs.CL
DiffSampling: Enhancing Diversity and Accuracy in Neural Text Generation
[AUTHORS]Giorgio Franceschelli, Mirco Musolesi
[ABSTRACT]Despite their growing capabilities, language models still frequently reproduce content from their training data, generate repetitive text, and favor common grammatical patterns and vocabulary. A possible cause is the decoding strategy: the most common strategies either consider only the most probable tokens, which reduces output diversity, or increase the likelihood of unlikely tokens, compromising output accuracy and correctness. In this paper, we propose DiffSampling, a new decoding method that leverages a mathematical analysis of the token probability distribution to ensure the generation of contextually appropriate text. In particular, the difference between consecutive, sorted probabilities can be used to truncate incorrect tokens. In addition, we also propose two variations of the proposed method that aim to correct the subtle inconsistencies of common sampling strategies. Experiments involving four different text-generation tasks demonstrate that our approach consistently performs at least on par with the existing methods it builds upon in terms of quality, despite sampling from a larger set of tokens.
[COMMENTS]Published in Transactions on Machine Learning Research (2025), see https://tmlr.infinite-conf.org/paper_pages/kXjHbMvdIi.html
[LINK]http://arxiv.org/abs/2502.14037v5
[DATE]2026-01-14 23:16:12+08:00
[CATEGORIES]cs.CL cs.LG
Can LLMs Generate Reliable Test Case Generators? A Study on Competition-Level Programming Problems
[AUTHORS]Yuhan Cao, Zian Chen, Kun Quan, Ziliang Zhang, Yu Wang, Xiaoning Dong, Yeqi Feng, Guanzhong He, Jingcheng Huang, Jianhao Li, Yixuan Tan, Jiafu Tang, Yilin Tang, Junlei Wu, Qianyu Xiao, Can Zheng, Shouchen Zhou, Yuxiang Zhu, Yiming Huang, Tianxing He
[ABSTRACT]Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, capable of tackling complex tasks during inference. However, the extent to which LLMs can be utilized for code checking or debugging through test case generation remains largely unexplored. We investigate this problem from the perspective of competition-level programming (CP) programs and propose TCGBench, a Benchmark for (LLM generation of) Test Case Generators. This benchmark comprises two tasks, aimed at studying the capabilities of LLMs in (1) generating valid test case generators for a given CP problem, and further (2) generating targeted test case generators that expose bugs in human-written code. Experimental results indicate that while state-of-the-art LLMs can generate valid test case generators in most cases, most LLMs struggle to generate targeted test cases that reveal flaws in human code effectively. Especially, even advanced reasoning models (e.g., o3-mini) fall significantly short of human performance in the task of generating targeted generators. Furthermore, we construct a high-quality, manually curated dataset of instructions for generating targeted generators. Analysis demonstrates that the performance of LLMs can be enhanced with the aid of this dataset, by both prompting and fine-tuning.
[COMMENTS]37 pages, 22 figures
[LINK]http://arxiv.org/abs/2506.06821v4
[DATE]2026-01-14 23:14:27+08:00
[CATEGORIES]cs.CL
Autofocus Retrieval: An Effective Pipeline for Multi-Hop Question Answering With Semi-Structured Knowledge
[AUTHORS]Derian Boer, Stephen Roth, Stefan Kramer
[ABSTRACT]In many real-world settings, machine learning models and interactive systems have access to both structured knowledge, e.g., knowledge graphs or tables, and unstructured content, e.g., natural language documents. Yet, most rely on either. Semi-Structured Knowledge Bases (SKBs) bridge this gap by linking unstructured content to nodes within structured data. In this work, we present Autofocus-Retriever (AF-Retriever), a modular framework for SKB-based, multi-hop question answering. It combines structural and textual retrieval through novel integration steps and optimizations, achieving the best zero- and one-shot results across all three STaRK QA benchmarks, which span diverse domains and evaluation metrics.
AF-Retriever’s average first-hit rate surpasses the second-best method by 32.1%. Its performance is driven by (1) leveraging exchangeable large language models (LLMs) to extract entity attributes and relational constraints for both parsing and reranking the top-k answers, (2) vector similarity search for ranking both extracted entities and final answers, (3) a novel incremental scope expansion procedure that prepares for the reranking on a configurable amount of suitable candidates that fulfill the given constraints the most, and (4) a hybrid retrieval strategy that reduces error susceptibility.
In summary, while constantly adjusting the focus like an optical autofocus, AF-Retriever delivers a configurable amount of answer candidates in four constraint-driven retrieval steps, which are then supplemented and ranked through four additional processing steps. An ablation study and a detailed error analysis, including a comparison of three different LLM reranking strategies, provide component-level insights. The source code is available at https://github.com/kramerlab/AF-Retriever.
[LINK]http://arxiv.org/abs/2505.09246v3
[DATE]2026-01-14 22:49:17+08:00
[CATEGORIES]cs.CL
SERM: Self-Evolving Relevance Model with Agent-Driven Learning from Massive Query Streams
[AUTHORS]Chenglong Wang, Canjia Li, Xingzhao Zhu, Yifu Huo, Huiyu Wang, Weixiong Lin, Yun Yang, Qiaozhi He, Tianhua Zhou, Xiaojia Chang, Jingbo Zhu, Tong Xiao
[ABSTRACT]Due to the dynamically evolving nature of real-world query streams, relevance models struggle to generalize to practical search scenarios. A sophisticated solution is self-evolution techniques. However, in large-scale industrial settings with massive query streams, this technique faces two challenges: (1) informative samples are often sparse and difficult to identify, and (2) pseudo-labels generated by the current model could be unreliable. To address these challenges, in this work, we propose a Self-Evolving Relevance Model approach (SERM), which comprises two complementary multi-agent modules: a multi-agent sample miner, designed to detect distributional shifts and identify informative training samples, and a multi-agent relevance annotator, which provides reliable labels through a two-level agreement framework. We evaluate SERM in a large-scale industrial setting, which serves billions of user requests daily. Experimental results demonstrate that SERM can achieve significant performance gains through iterative self-evolution, as validated by extensive offline multilingual evaluations and online testing.
[LINK]http://arxiv.org/abs/2601.09515v1
[DATE]2026-01-14 22:31:16+08:00
[CATEGORIES]cs.CL
AutoToM: Scaling Model-based Mental Inference via Automated Agent Modeling
[AUTHORS]Zhining Zhang, Chuanyang Jin, Mung Yao Jia, Shunchi Zhang, Tianmin Shu
[ABSTRACT]Theory of Mind (ToM), the ability to understand people’s minds based on their behavior, is key to developing socially intelligent agents. Current approaches to ToM reasoning either rely on prompting Large Language Models (LLMs), which are prone to systematic errors, or use handcrafted, rigid agent models for model-based inference, which are more robust but fail to generalize across domains. In this work, we introduce AutoToM, an automated agent modeling method for scalable, robust, and interpretable mental inference. Given a ToM problem, AutoToM first proposes an initial agent model and then performs automated Bayesian inverse planning based on this model, leveraging an LLM backend. Guided by inference uncertainty, it iteratively refines the model by introducing additional mental variables and/or incorporating more timesteps in the context. Across five diverse benchmarks, AutoToM outperforms existing ToM methods and even large reasoning models. Additionally, we show that AutoToM can produce human-like confidence estimates and enable online mental inference for embodied decision-making.
[COMMENTS]NeurIPS 2025 (Spotlight). 42 pages, 11 figures, 15 tables. Website at https://chuanyangjin.com/AutoToM/
[LINK]http://arxiv.org/abs/2502.15676v3
[DATE]2026-01-14 22:23:27+08:00
[CATEGORIES]cs.CL
Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning
[AUTHORS]Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Jinhe Bi, Kristian Kersting, Jeff Z. Pan, Hinrich Schütze, Volker Tresp, Yunpu Ma
[ABSTRACT]Large Language Models (LLMs) have demonstrated impressive capabilities across a wide range of NLP tasks, but they remain fundamentally stateless, constrained by limited context windows that hinder long-horizon reasoning. Recent efforts to address this limitation often augment LLMs with an external memory bank, yet most existing pipelines are static and heuristic-driven, lacking a learned mechanism for deciding what to store, update, or retrieve. We present Memory-R1, a reinforcement learning (RL) framework that equips LLMs with the ability to actively manage and utilize external memory through two specialized agents: a Memory Manager that learns structured operations, including ADD, UPDATE, DELETE, and NOOP; and an Answer Agent that pre-selects and reasons over relevant entries. Both agents are fine-tuned with outcome-driven RL (PPO and GRPO), enabling adaptive memory management with minimal supervision. With only 152 training QA pairs, Memory-R1 outperforms strong baselines and generalizes across diverse question types, three benchmarks (LoCoMo, MSC, LongMemEval), and multiple model scales (3B-14B).
[LINK]http://arxiv.org/abs/2508.19828v5
[DATE]2026-01-14 22:21:21+08:00
[CATEGORIES]cs.CL
Structured yet Bounded Temporal Understanding in Large Language Models
[AUTHORS]Damin Zhang, Julia Rayz
[COMMENTS]Under review. Results on larger dataset. Correct a theoretical error. 11 pages, 5 figures
[LINK]http://arxiv.org/abs/2510.16685v2
[DATE]2026-01-14 22:17:12+08:00
[CATEGORIES]cs.CL
MVSS: A Unified Framework for Multi-View Structured Survey Generation
[AUTHORS]Yinqi Liu, Yueqi Zhu, Yongkang Zhang, Xinfeng Li, Feiran Liu, Yufei Sun, Xin Wang, Renzhao Liang, Yidong Wang, Cunxiang Wang
[ABSTRACT]Scientific surveys require not only summarizing large bodies of literature, but also organizing them into clear and coherent conceptual structures. Existing automatic survey generation methods typically focus on linear text generation and struggle to explicitly model hierarchical relations among research topics and structured methodological comparisons, resulting in gaps in structural organization compared to expert-written surveys. We propose MVSS, a multi-view structured survey generation framework that jointly generates and aligns citation-grounded hierarchical trees, structured comparison tables, and survey text. MVSS follows a structure-first paradigm: it first constructs a conceptual tree of the research domain, then generates comparison tables constrained by the tree, and finally uses both as structural constraints for text generation. This enables complementary multi-view representations across structure, comparison, and narrative. We introduce an evaluation framework assessing structural quality, comparative completeness, and citation fidelity. Experiments on 76 computer science topics show MVSS outperforms existing methods in organization and evidence grounding, achieving performance comparable to expert surveys.
[LINK]http://arxiv.org/abs/2601.09504v1
[DATE]2026-01-14 22:11:39+08:00
[CATEGORIES]cs.CL
Development and Evaluation of HopeBot: an LLM-based chatbot for structured and interactive PHQ-9 depression screening
[AUTHORS]Zhijun Guo, Alvina Lai, Julia Ive, Alexandru Petcu, Yutong Wang, Luyuan Qi, Johan H Thygesen, Kezhi Li
[ABSTRACT]Static tools like the Patient Health Questionnaire-9 (PHQ-9) effectively screen depression but lack interactivity and adaptability. We developed HopeBot, a chatbot powered by a large language model (LLM) that administers the PHQ-9 using retrieval-augmented generation and real-time clarification. In a within-subject study, 132 adults in the United Kingdom and China completed both self-administered and chatbot versions. Scores demonstrated strong agreement (ICC = 0.91; 45% identical). Among 75 participants providing comparative feedback, 71% reported greater trust in the chatbot, highlighting clearer structure, interpretive guidance, and a supportive tone. Mean ratings (0-10) were 8.4 for comfort, 7.7 for voice clarity, 7.6 for handling sensitive topics, and 7.4 for recommendation helpfulness; the latter varied significantly by employment status and prior mental-health service use (p < 0.05). Overall, 87.1% expressed willingness to reuse or recommend HopeBot. These findings demonstrate voice-based LLM chatbots can feasibly serve as scalable, low-burden adjuncts for routine depression screening.
[LINK]http://arxiv.org/abs/2507.05984v2
[DATE]2026-01-14 22:03:49+08:00
[CATEGORIES]cs.CL
SlidesGen-Bench: Evaluating Slides Generation via Computational and Quantitative Metrics
[AUTHORS]Yunqiao Yang, Wenbo Li, Houxing Ren, Zimu Lu, Ke Wang, Zhiyuan Huang, Zhuofan Zong, Mingjie Zhan, Hongsheng Li
[ABSTRACT]The rapid evolution of Large Language Models (LLMs) has fostered diverse paradigms for automated slide generation, ranging from code-driven layouts to image-centric synthesis. However, evaluating these heterogeneous systems remains challenging, as existing protocols often struggle to provide comparable scores across architectures or rely on uncalibrated judgments. In this paper, we introduce SlidesGen-Bench, a benchmark designed to evaluate slide generation through a lens of three core principles: universality, quantification, and reliability. First, to establish a unified evaluation framework, we ground our analysis in the visual domain, treating terminal outputs as renderings to remain agnostic to the underlying generation method. Second, we propose a computational approach that quantitatively assesses slides across three distinct dimensions - Content, Aesthetics, and Editability - offering reproducible metrics where prior works relied on subjective or reference-dependent proxies. Finally, to ensure high correlation with human preference, we construct the Slides-Align1.5k dataset, a human preference aligned dataset covering slides from nine mainstream generation systems across seven scenarios. Our experiments demonstrate that SlidesGen-Bench achieves a higher degree of alignment with human judgment than existing evaluation pipelines. Our code and data are available at https://github.com/YunqiaoYang/SlidesGen-Bench.
[COMMENTS]37 pages, 34 figures
[LINK]http://arxiv.org/abs/2601.09487v1
[DATE]2026-01-14 21:50:30+08:00
[CATEGORIES]cs.CL
Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts
[AUTHORS]Zhi-Yi Chin, Chieh-Ming Jiang, Ching-Chun Huang, Pin-Yu Chen, Wei-Chen Chiu
[ABSTRACT]Text-to-image diffusion models, e.g. Stable Diffusion (SD), lately have shown remarkable ability in high-quality content generation, and become one of the representatives for the recent wave of transformative AI. Nevertheless, such advance comes with an intensifying concern about the misuse of this generative technology, especially for producing copyrighted or NSFW (i.e. not safe for work) images. Although efforts have been made to filter inappropriate images/prompts or remove undesirable concepts/styles via model fine-tuning, the reliability of these safety mechanisms against diversified problematic prompts remains largely unexplored. In this work, we propose Prompting4Debugging (P4D) as a debugging and red-teaming tool that automatically finds problematic prompts for diffusion models to test the reliability of a deployed safety mechanism. We demonstrate the efficacy of our P4D tool in uncovering new vulnerabilities of SD models with safety mechanisms. Particularly, our result shows that around half of prompts in existing safe prompting benchmarks which were originally considered “safe” can actually be manipulated to bypass many deployed safety mechanisms, including concept removal, negative prompt, and safety guidance. Our findings suggest that, without comprehensive testing, the evaluations on limited safe prompting benchmarks can lead to a false sense of safety for text-to-image models.
[COMMENTS]ICML 2024 main conference paper. The source code is available at https://github.com/zhiyichin/P4D
[LINK]http://arxiv.org/abs/2309.06135v3
[DATE]2026-01-14 21:34:07+08:00
[CATEGORIES]cs.CL
Dissecting Judicial Reasoning in U.S. Copyright Damage Awards
[AUTHORS]Pei-Chi Lo, Thomas Y. Lu
[ABSTRACT]Judicial reasoning in copyright damage awards poses a core challenge for computational legal analysis. Although federal courts follow the 1976 Copyright Act, their interpretations and factor weightings vary widely across jurisdictions. This inconsistency creates unpredictability for litigants and obscures the empirical basis of legal decisions. This research introduces a novel discourse-based Large Language Model (LLM) methodology that integrates Rhetorical Structure Theory (RST) with an agentic workflow to extract and quantify previously opaque reasoning patterns from judicial opinions. Our framework addresses a major gap in empirical legal scholarship by parsing opinions into hierarchical discourse structures and using a three-stage pipeline, i.e., Dataset Construction, Discourse Analysis, and Agentic Feature Extraction. This pipeline identifies reasoning components and extract feature labels with corresponding discourse subtrees. In analyzing copyright damage rulings, we show that discourse-augmented LLM analysis outperforms traditional methods while uncovering unquantified variations in factor weighting across circuits. These findings offer both methodological advances in computational legal analysis and practical insights into judicial reasoning, with implications for legal practitioners seeking predictive tools, scholars studying legal principle application, and policymakers confronting inconsistencies in copyright law.
[COMMENTS]Presented in SIGKDD’25 SciSoc LLM Workshop: Large Language Models for Scientific and Societal Advances
[LINK]http://arxiv.org/abs/2601.09459v1
[DATE]2026-01-14 21:09:16+08:00
[CATEGORIES]cs.CL
Improving Symbolic Translation of Language Models for Logical Reasoning
[AUTHORS]Ramya Keerthy Thatikonda, Jiuzhou Han, Wray Buntine, Ehsan Shareghi
[ABSTRACT]The use of formal language for deductive logical reasoning aligns well with language models (LMs), where translating natural language (NL) into first-order logic (FOL) and employing an external solver results in a verifiable and therefore reliable reasoning system. However, smaller LMs often struggle with this translation task, frequently producing incorrect symbolic outputs due to formatting and translation errors. Existing approaches typically rely on self-iteration to correct these errors, but such methods depend heavily on the capabilities of the underlying model. To address this, we first categorize common errors and fine-tune smaller LMs using data synthesized by large language models. The evaluation is performed using the defined error categories. We introduce incremental inference, which divides inference into two stages, predicate generation and FOL translation, providing greater control over model behavior and enhancing generation quality as measured by predicate metrics. This decomposition framework also enables the use of a verification module that targets predicate-arity errors to further improve performance. Our study evaluates three families of models across four logical-reasoning datasets. The comprehensive fine-tuning, incremental inference, and verification modules reduce error rates, increase predicate coverage, and improve reasoning performance for smaller LMs, moving us closer to developing reliable and accessible symbolic-reasoning systems.
[COMMENTS]The Third workshop of NeusymBridge @AAAI 2026 (Bridging Neurons and Symbols for NLP and Knowledge Graph Reasoning)
[LINK]http://arxiv.org/abs/2601.09446v1
[DATE]2026-01-14 20:47:14+08:00
[CATEGORIES]cs.CL
ProfVLM: A Lightweight Video-Language Model for Multi-View Proficiency Estimation
[AUTHORS]Edoardo Bianchi, Jacopo Staiano, Antonio Liotta
[ABSTRACT]Existing approaches treat action quality assessment and skill proficiency estimation as classification problems, outputting discrete labels without interpretable reasoning. We reformulate this task as generative vision language modeling, introducing ProfVLM, a compact model that jointly predicts proficiency levels and generates expert-like natural language feedback from multi-view videos. ProfVLM leverages conditional language generation to provide actionable insights along with quantitative evaluation scores. Central to our method is an AttentiveGatedProjector that dynamically fuses and projects multi-view egocentric and exocentric features from a frozen TimeSformer backbone into a language model fine-tuned for feedback generation. Trained on EgoExo4D with expert commentaries, ProfVLM surpasses state-of-the-art methods while using up to 20x fewer parameters and reducing training time by up to 60% compared to existing classification-based methods. By providing natural language critiques aligned with performance levels, this work shows that generative vision-language modeling offers a powerful and efficient paradigm shift for interpretable action quality assessment.
[LINK]http://arxiv.org/abs/2509.26278v2
[DATE]2026-01-14 20:30:42+08:00
[CATEGORIES]cs.CL
Sentiment Analysis Of Shopee Product Reviews Using Distilbert
[AUTHORS]Zahri Aksa Dautd, Aviv Yuniar Rahman
[ABSTRACT]The rapid growth of digital commerce has led to the accumulation of a massive number of consumer reviews on online platforms. Shopee, as one of the largest e-commerce platforms in Southeast Asia, receives millions of product reviews every day containing valuable information regarding customer satisfaction and preferences. Manual analysis of these reviews is inefficient, thus requiring a computational approach such as sentiment analysis. This study examines the use of DistilBERT, a lightweight transformer-based deep learning model, for sentiment classification on Shopee product reviews. The dataset used consists of approximately one million English-language reviews that have been preprocessed and trained using the distilbert-base-uncased model. Evaluation was conducted using accuracy, precision, recall, and F1-score metrics, and compared against benchmark models such as BERT and SVM. The results show that DistilBERT achieved an accuracy of 94.8%, slightly below BERT (95.3%) but significantly higher than SVM (90.2%), with computation time reduced by more than 55%. These findings demonstrate that DistilBERT provides an optimal balance between accuracy and efficiency, making it suitable for large scale sentiment analysis on e-commerce platforms. Keywords: Sentiment Analysis, DistilBERT, Shopee Reviews, Natural Language Processing, Deep Learning, Transformer Models.
[COMMENTS]The authors have decided to withdraw this manuscript because substantial improvements are needed in the methodology, data analysis, and presentation of the research results to ensure the article’s scientific quality meets journal publication standards. Therefore, the authors plan to conduct a thorough revision before resubmitting to an appropriate journal
[LINK]http://arxiv.org/abs/2511.22313v2
[DATE]2026-01-14 19:52:53+08:00
[CATEGORIES]cs.CL
Structured Knowledge Representation through Contextual Pages for Retrieval-Augmented Generation
[AUTHORS]Xinze Li, Zhenghao Liu, Haidong Xin, Yukun Yan, Shuo Wang, Zheni Zeng, Sen Mei, Ge Yu, Maosong Sun
[ABSTRACT]Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by incorporating external knowledge. Recently, some works have incorporated iterative knowledge accumulation processes into RAG models to progressively accumulate and refine query-related knowledge, thereby constructing more comprehensive knowledge representations. However, these iterative processes often lack a coherent organizational structure, which limits the construction of more comprehensive and cohesive knowledge representations. To address this, we propose PAGER, a page-driven autonomous knowledge representation framework for RAG. PAGER first prompts an LLM to construct a structured cognitive outline for a given question, which consists of multiple slots representing a distinct knowledge aspect. Then, PAGER iteratively retrieves and refines relevant documents to populate each slot, ultimately constructing a coherent page that serves as contextual input for guiding answer generation. Experiments on multiple knowledge-intensive benchmarks and backbone models show that PAGER consistently outperforms all RAG baselines. Further analyses demonstrate that PAGER constructs higher-quality and information-dense knowledge representations, better mitigates knowledge conflicts, and enables LLMs to leverage external knowledge more effectively. All code is available at https://github.com/OpenBMB/PAGER.
[LINK]http://arxiv.org/abs/2601.09402v1
[DATE]2026-01-14 19:44:31+08:00
[CATEGORIES]cs.CL
SLAM-LLM: A Modular, Open-Source Multimodal Large Language Model Framework and Best Practice for Speech, Language, Audio and Music Processing
[AUTHORS]Ziyang Ma, Guanrou Yang, Wenxi Chen, Zhifu Gao, Yexing Du, Xiquan Li, Zhisheng Zheng, Haina Zhu, Jianheng Zhuo, Zheshu Song, Ruiyang Xu, Tiranrui Wang, Yifan Yang, Yanqiao Zhu, Zhikang Niu, Liumeng Xue, Yinghao Ma, Ruibin Yuan, Shiliang Zhang, Kai Yu, Eng Siong Chng, Xie Chen
[ABSTRACT]The recent surge in open-source Multimodal Large Language Models (MLLM) frameworks, such as LLaVA, provides a convenient kickoff for artificial intelligence developers and researchers. However, most of the MLLM frameworks take vision as the main input modality, and provide limited in-depth support for the modality of speech, audio, and music. This situation hinders the development of audio-language models, and forces researchers to spend a lot of effort on code writing and hyperparameter tuning. We present SLAM-LLM, an open-source deep learning framework designed to train customized MLLMs, focused on speech, language, audio, and music processing. SLAM-LLM provides a modular configuration of different encoders, projectors, LLMs, and parameter-efficient fine-tuning plugins. SLAM-LLM also includes detailed training and inference recipes for mainstream tasks, along with high-performance checkpoints like LLM-based Automatic Speech Recognition (ASR), Automated Audio Captioning (AAC), and Music Captioning (MC). Some of these recipes have already reached or are nearing state-of-the-art performance, and some relevant techniques have also been accepted by academic papers. We hope SLAM-LLM will accelerate iteration, development, data engineering, and model training for researchers. We are committed to continually pushing forward audio-based MLLMs through this open-source framework, and call on the community to contribute to the LLM-based speech, audio and music processing.
[COMMENTS]Published in IEEE Journal of Selected Topics in Signal Processing (JSTSP)
[LINK]http://arxiv.org/abs/2601.09385v1
[DATE]2026-01-14 19:25:36+08:00
[CATEGORIES]cs.CL
Long-term Task-oriented Agent: Proactive Long-term Intent Maintenance in Dynamic Environments
[AUTHORS]Qinglong Shi, Donghai Wang, Hantao Zhou, Jiguo Li, Jun Xu, Jiuchong Gao, Jinghua Hao, Renqing He
[ABSTRACT]Current large language model agents predominantly operate under a reactive paradigm, responding only to immediate user queries within short-term sessions. This limitation hinders their ability to maintain long-term user’s intents and dynamically adapt to evolving external environments. In this paper, we propose a novel interaction paradigm for proactive Task-oriented Agents capable of bridging the gap between relatively static user’s needs and a dynamic environment. We formalize proactivity through two key capabilities, (i) Intent-Conditioned Monitoring: The agent autonomously formulates trigger conditions based on dialog history; (ii) Event-Triggered Follow-up: The agent actively engages the user upon detecting useful environmental updates. We introduce a high-quality data synthesis pipeline to construct complex, multi-turn dialog data in a dynamic environment. Furthermore, we attempt to address the lack of evaluation criteria of task-oriented interaction in a dynamic environment by proposing a new benchmark, namely ChronosBench. We evaluated some leading close-source and open-source models at present and revealed their flaws in long-term task-oriented interaction. Furthermore, our fine-tuned model trained using synthetic data for supervised learning achieves a task completion rate of 85.19% for complex tasks including shifts in user intent, outperforming other models under test. And the result validated the effectiveness of our data-driven strategy.
[COMMENTS]8 pages, 2 figures
[LINK]http://arxiv.org/abs/2601.09382v1
[DATE]2026-01-14 19:15:40+08:00
[CATEGORIES]cs.CL
Where to Begin: Efficient Pretraining via Subnetwork Selection and Distillation
[AUTHORS]Arjun Krishnakumar, Rhea Sanjay Sukthanker, Hannan Javed Mahadik, Gabriela Kadlecová, Vladyslav Moroshan, Timur Carstensen, Frank Hutter, Aaron Klein
[ABSTRACT]Small Language models (SLMs) offer an efficient and accessible alternative to Large Language Models (LLMs), delivering strong performance while using far fewer resources. We introduce a simple and effective framework for pretraining SLMs that brings together three complementary ideas. First, we identify structurally sparse sub-network initializations that consistently outperform randomly initialized models of similar size under the same compute budget. Second, we use evolutionary search to automatically discover high-quality sub-network initializations, providing better starting points for pretraining. Third, we apply knowledge distillation from larger teacher models to speed up training and improve generalization. Together, these components make SLM pretraining substantially more efficient: our best model, discovered using evolutionary search and initialized with LLM weights, matches the validation perplexity of a comparable Pythia SLM while requiring 5.16x and 1.26x fewer floating point operations for token budgets of 10B and 100B, respectively. We release all code publicly, offering a practical and reproducible path toward cost-efficient small language model development at scale.
[LINK]http://arxiv.org/abs/2510.07227v2
[DATE]2026-01-14 19:05:32+08:00
[CATEGORIES]cs.CL
The Imperfective Paradox in Large Language Models
[AUTHORS]Bolei Ma, Yusuke Miyao
[ABSTRACT]Do Large Language Models (LLMs) genuinely grasp the compositional semantics of events, or do they rely on surface-level probabilistic heuristics? We investigate the Imperfective Paradox, a logical phenomenon where the past progressive aspect entails event realization for activities (e.g., running $\to$ ran) but not for accomplishments (e.g., building $\nrightarrow$ built). We introduce ImperfectiveNLI, a diagnostic dataset designed to probe this distinction across diverse semantic classes. Evaluating state-of-the-art open-weight models, we uncover a pervasive Teleological Bias: models systematically hallucinate completion for goal-oriented events, often overriding explicit textual negation. Representational analyses show that while internal embeddings often distinguish process from result, inference decisions are dominated by strong priors about goal attainment. We further find that prompting-based interventions reduce hallucinated completions but also increase incorrect rejections of valid entailments. Our findings suggest that current LLMs lack structural aspectual awareness, operating as predictive narrative engines rather than faithful logical reasoners.
[LINK]http://arxiv.org/abs/2601.09373v1
[DATE]2026-01-14 18:57:16+08:00
[CATEGORIES]cs.CL
Relation Extraction Capabilities of LLMs on Clinical Text: A Bilingual Evaluation for English and Turkish
[AUTHORS]Aidana Aidynkyzy, Oğuz Dikenelli, Oylum Alatlı, Şebnem Bora
[ABSTRACT]The scarcity of annotated datasets for clinical information extraction in non-English languages hinders the evaluation of large language model (LLM)-based methods developed primarily in English. In this study, we present the first comprehensive bilingual evaluation of LLMs for the clinical Relation Extraction (RE) task in both English and Turkish. To facilitate this evaluation, we introduce the first English-Turkish parallel clinical RE dataset, derived and carefully curated from the 2010 i2b2/VA relation classification corpus. We systematically assess a diverse set of prompting strategies, including multiple in-context learning (ICL) and Chain-of-Thought (CoT) approaches, and compare their performance to fine-tuned baselines such as PURE. Furthermore, we propose Relation-Aware Retrieval (RAR), a novel in-context example selection method based on contrastive learning, that is specifically designed to capture both sentence-level and relation-level semantics. Our results show that prompting-based LLM approaches consistently outperform traditional fine-tuned models. Moreover, evaluations for English performed better than their Turkish counterparts across all evaluated LLMs and prompting techniques. Among ICL methods, RAR achieves the highest performance, with Gemini 1.5 Flash reaching a micro-F1 score of 0.906 in English and 0.888 in Turkish. Performance further improves to 0.918 F1 in English when RAR is combined with a structured reasoning prompt using the DeepSeek-V3 model. These findings highlight the importance of high-quality demonstration retrieval and underscore the potential of advanced retrieval and prompting techniques to bridge resource gaps in clinical natural language processing.
[LINK]http://arxiv.org/abs/2601.09367v1
[DATE]2026-01-14 18:49:46+08:00
[CATEGORIES]cs.CL
Can Language Models Discover Scaling Laws?
[AUTHORS]Haowei Lin, Haotian Ye, Wenzheng Feng, Quzhe Huang, Yujun Li, Hubert Lim, Zhengrui Li, Xiangyu Wang, Jianzhu Ma, James Zou, Yitao Liang
[ABSTRACT]Discovering scaling laws for predicting model performance at scale is a fundamental and open-ended challenge, mostly reliant on slow, case specific human experimentation. To investigate the potential for LLMs to automate this process, we collect over 5,000 experiments from existing literature and curate eight diverse scaling law discovery tasks. While existing agents struggle to produce accurate law formulas, this paper introduces SLDAgent, an evolution-based agent that co-optimize the scaling law model and the parameters, enabling it to autonomously explore complex relationships between variables. For the first time, we demonstrates that SLDAgent can automatically discover laws that exhibit consistently more accurate extrapolation than their established, human-derived counterparts across all tasks. Through comprehensive analysis, we elucidate why these discovered laws are superior and verify their practical utility in both pretraining and finetuning applications. This work establishes a new paradigm for agentic scientific discovery, showing that AI systems can understand their own scaling behavior, and can contribute novel and practical knowledge back to the research community.
[LINK]http://arxiv.org/abs/2507.21184v4
[DATE]2026-01-14 18:48:38+08:00
[CATEGORIES]cs.LG cs.CL
Efficient Test-Time Scaling of Multi-Step Reasoning by Probing Internal States of Large Language Models
[AUTHORS]Jingwei Ni, Ekaterina Fadeeva, Tianyi Wu, Mubashara Akhtar, Jiaheng Zhang, Elliott Ash, Markus Leippold, Timothy Baldwin, See-Kiong Ng, Artem Shelmanov, Mrinmaya Sachan
[ABSTRACT]LLMs can solve complex tasks by generating long, multi-step reasoning chains. Test-time scaling (TTS) can further improve LLM performance by sampling multiple variants of intermediate reasoning steps, verifying their correctness, and strategically choosing the best steps for continuation. However, existing verification approaches, such as Process Reward Models (PRMs), are computationally expensive, limited to specific domains, and require large-scale human or model-generated annotations. We propose a lightweight alternative for step-level reasoning verification based on probing the internal states of LLMs. We train a transformer-based probe that uses the internal states of the frozen LLM to estimate the credibility of its reasoning steps during generation. Annotation can be generated either by another larger LLM (e.g., DeepSeek-R1) or in a self-supervised manner by the original model itself. The probes are both effective and lightweight, containing fewer than 10M parameters. Across multiple domains, including mathematics, planning, and general knowledge question answering, our probes match or even exceed the performance of PRMs that are up to 810x larger. Our findings suggest that the internal states of LLMs encode their confidence in reasoning processes and can serve as reliable signals for reasoning step verification, offering a promising direction towards scalable and generalizable TTS and introspective LLMs.
[COMMENTS]Preprint under review
[LINK]http://arxiv.org/abs/2511.06209v3
[DATE]2026-01-14 18:08:56+08:00
[CATEGORIES]cs.CL
AgenticIE: An Adaptive Agent for Information Extraction from Complex Regulatory Documents
[AUTHORS]Gaye Colakoglu, Gürkan Solmaz, Jonathan Fürst
[ABSTRACT]Declaration of Performance (DoP) documents, mandated by EU regulation, specify characteristics of construction products, such as fire resistance and insulation. While this information is essential for quality control and reducing carbon footprints, it is not easily machine readable. Despite content requirements, DoPs exhibit significant variation in layout, schema, and format, further complicated by their multilingual nature. In this work, we propose DoP Key Information Extraction (KIE) and Question Answering (QA) as new NLP challenges. To address this challenge, we design a domain-specific AgenticIE system based on a planner-executor-corresponder pattern. For evaluation, we introduce a high-density, expert-annotated dataset of complex, multi-page regulatory documents in English and German. Unlike standard IE datasets (e.g., FUNSD, CORD) with sparse annotations, our dataset contains over 15K annotated entities, averaging over 190 annotations per document. Our agentic system outperforms static and multimodal LLM baselines, achieving Exact Match (EM) scores of 0.396 vs. 0.342 (GPT-4o, +16%) and 0.314 (GPT-4o-V, +26%) across the KIE and QA tasks. Our experimental analysis validates the benefits of the agentic system, as well as the challenging nature of our new DoP dataset.
[LINK]http://arxiv.org/abs/2509.11773v3
[DATE]2026-01-14 17:15:01+08:00
[CATEGORIES]cs.CL
Bench360: Benchmarking Local LLM Inference from 360 Degrees
[AUTHORS]Linus Stuhlmann, Mauricio Fadel Argerich, Jonathan Fürst
[ABSTRACT]Running LLMs locally has become increasingly common, but users face a complex design space across models, quantization levels, inference engines, and serving scenarios. Existing inference benchmarks are fragmented and focus on isolated goals, offering little guidance for practical deployments. We present Bench360, a framework for evaluating local LLM inference across tasks, usage patterns, and system metrics in one place. Bench360 supports custom tasks, integrates multiple inference engines and quantization formats, and reports both task quality and system behavior (latency, throughput, energy, startup time). We demonstrate it on four NLP tasks across three GPUs and four engines, showing how design choices shape efficiency and output quality. Results confirm that tradeoffs are substantial and configuration choices depend on specific workloads and constraints. There is no universal best option, underscoring the need for comprehensive, deployment-oriented benchmarks.
[LINK]http://arxiv.org/abs/2511.16682v2
[DATE]2026-01-14 16:53:59+08:00
[CATEGORIES]cs.CL cs.LG
MCGA: A Multi-task Classical Chinese Literary Genre Audio Corpus
[AUTHORS]Yexing Du, Kaiyuan Liu, Bihe Zhang, Youcheng Pan, Bo Yang, Liangyu Huo, Xiyuan Zhang, Jian Xie, Daojing He, Yang Xiang, Ming Liu, Bin Qin
[ABSTRACT]With the rapid advancement of Multimodal Large Language Models (MLLMs), their potential has garnered significant attention in Chinese Classical Studies (CCS). While existing research has primarily focused on text and visual modalities, the audio corpus within this domain remains largely underexplored. To bridge this gap, we propose the Multi-task Classical Chinese Literary Genre Audio Corpus (MCGA). It encompasses a diverse range of literary genres across six tasks: Automatic Speech Recognition (ASR), Speech-to-Text Translation (S2TT), Speech Emotion Captioning (SEC), Spoken Question Answering (SQA), Speech Understanding (SU), and Speech Reasoning (SR). Through the evaluation of ten MLLMs, our experimental results demonstrate that current models still face substantial challenges when processed on the MCGA test set. Furthermore, we introduce an evaluation metric for SEC and a metric to measure the consistency between the speech and text capabilities of MLLMs. We release MCGA and our code to the public to facilitate the development of MLLMs with more robust multidimensional audio capabilities in CCS. MCGA Corpus: https://github.com/yxduir/MCGA
[LINK]http://arxiv.org/abs/2601.09270v1
[DATE]2026-01-14 16:05:16+08:00
[CATEGORIES]cs.CL
DR-LoRA: Dynamic Rank LoRA for Mixture-of-Experts Adaptation
[AUTHORS]Guanzhi Deng, Bo Li, Ronghao Chen, Huacan Wang, Lijie Wen, Linqi Song
[ABSTRACT]Mixture-of-Experts (MoE) has become a prominent paradigm for scaling Large Language Models (LLMs). Parameter-efficient fine-tuning (PEFT), such as LoRA, is widely adopted to adapt pretrained MoE LLMs to downstream tasks. However, existing approaches assign identical LoRA ranks to all experts, overlooking the intrinsic functional specialization within MoE LLMs. This uniform allocation leads to resource mismatch, task-relevant experts are under-provisioned while less relevant ones receive redundant parameters. We propose a Dynamic Rank LoRA framework named DR-LoRA, which dynamically grows expert LoRA ranks during fine-tuning based on task-specific demands. DR-LoRA employs an Expert Saliency Scoring mechanism that integrates expert routing frequency and LoRA rank importance to quantify each expert’s demand for additional capacity. Experts with higher saliency scores are prioritized for rank expansion, enabling the automatic formation of a heterogeneous rank distribution tailored to the target task. Experiments on multiple benchmarks demonstrate that DR-LoRA consistently outperforms standard LoRA and static allocation strategies under the same parameter budget, achieving superior task performance with more efficient parameter utilization.
[LINK]http://arxiv.org/abs/2601.04823v3
[DATE]2026-01-14 16:04:55+08:00
[CATEGORIES]cs.CL
Higher Satisfaction, Lower Cost: A Technical Report on How LLMs Revolutionize Meituan’s Intelligent Interaction Systems
[AUTHORS]Xuxin Cheng, Ke Zeng, Zhiquan Cao, Linyi Dai, Wenxuan Gao, Fei Han, Ai Jian, Feng Hong, Wenxing Hu, Zihe Huang, Dejian Kong, Jia Leng, Zhuoyuan Liao, Pei Liu, Jiaye Lin, Xing Ma, Jingqing Ruan, Jiaxing Song, Xiaoyu Tan, Ruixuan Xiao, Wenhui Yu, Wenyu Zhan, Haoxing Zhang, Chao Zhou, Hao Zhou, Shaodong Zheng, Ruinian Chen, Siyuan Chen, Ziyang Chen, Yiwen Dong, Yaoyou Fan, Yangyi Fang, Yang Gan, Shiguang Guo, Qi He, Chaowen Hu, Binghui Li, Dailin Li, Xiangyu Li, Yan Li, Chengjian Liu, Xiangfeng Liu, Jiahui Lv, Qiao Ma, Jiang Pan, Cong Qin, Chenxing Sun, Wen Sun, Zhonghui Wang, Abudukelimu Wuerkaixi, Xin Yang, Fangyi Yuan, Yawen Zhu, Tianyi Zhai, Jie Zhang, Runlai Zhang, Yao Xu, Yiran Zhao, Yifan Wang, Xunliang Cai, Yangen Hu, Cao Liu, Lu Pan, Xiaoli Wang, Bo Xiao, Wenyuan Yao, Qianlin Zhou, Benchang Zhu
[COMMENTS]36 pages, 14 figures
[LINK]http://arxiv.org/abs/2510.13291v2
[DATE]2026-01-14 15:30:10+08:00
[CATEGORIES]cs.CL
TeachPro: Multi-Label Qualitative Teaching Evaluation via Cross-View Graph Synergy and Semantic Anchored Evidence Encoding
[AUTHORS]Xiangqian Wang, Yifan Jia, Yang Xiang, Yumin Zhang, Yanbin Wang, Ke Liu
[ABSTRACT]Standardized Student Evaluation of Teaching often suffer from low reliability, restricted response options, and response distortion. Existing machine learning methods that mine open-ended comments usually reduce feedback to binary sentiment, which overlooks concrete concerns such as content clarity, feedback timeliness, and instructor demeanor, and provides limited guidance for instructional improvement.We propose TeachPro, a multi-label learning framework that systematically assesses five key teaching dimensions: professional expertise, instructional behavior, pedagogical efficacy, classroom experience, and other performance metrics. We first propose a Dimension-Anchored Evidence Encoder, which integrates three core components: (i) a pre-trained text encoder that transforms qualitative feedback annotations into contextualized embeddings; (ii) a prompt module that represents five teaching dimensions as learnable semantic anchors; and (iii) a cross-attention mechanism that aligns evidence with pedagogical dimensions within a structured semantic space. We then propose a Cross-View Graph Synergy Network to represent student comments. This network comprises two components: (i) a Syntactic Branch that extracts explicit grammatical dependencies from parse trees, and (ii) a Semantic Branch that models latent conceptual relations derived from BERT-based similarity graphs. BiAffine fusion module aligns syntactic and semantic units, while a differential regularizer disentangles embeddings to encourage complementary representations. Finally, a cross-attention mechanism bridges the dimension-anchored evidence with the multi-view comment representations. We also contribute a novel benchmark dataset featuring expert qualitative annotations and multi-label scores. Extensive experiments demonstrate that TeachPro offers superior diagnostic granularity and robustness across diverse evaluation settings.
[LINK]http://arxiv.org/abs/2601.09246v1
[DATE]2026-01-14 15:27:57+08:00
[CATEGORIES]cs.CL
When to Trust: A Causality-Aware Calibration Framework for Accurate Knowledge Graph Retrieval-Augmented Generation
[AUTHORS]Jing Ren, Bowen Li, Ziqi Xu, Xinkun Zhang, Haytham Fayek, Xiaodong Li
[ABSTRACT]Knowledge Graph Retrieval-Augmented Generation (KG-RAG) extends the RAG paradigm by incorporating structured knowledge from knowledge graphs, enabling Large Language Models (LLMs) to perform more precise and explainable reasoning. While KG-RAG improves factual accuracy in complex tasks, existing KG-RAG models are often severely overconfident, producing high-confidence predictions even when retrieved sub-graphs are incomplete or unreliable, which raises concerns for deployment in high-stakes domains. To address this issue, we propose Ca2KG, a Causality-aware Calibration framework for KG-RAG. Ca2KG integrates counterfactual prompting, which exposes retrieval-dependent uncertainties in knowledge quality and reasoning reliability, with a panel-based re-scoring mechanism that stabilises predictions across interventions. Extensive experiments on two complex QA datasets demonstrate that Ca2KG consistently improves calibration while maintaining or even enhancing predictive accuracy.
[COMMENTS]Accepted by WWW 2026
[LINK]http://arxiv.org/abs/2601.09241v1
[DATE]2026-01-14 15:22:59+08:00
[CATEGORIES]cs.CL
GIFT: Unlocking Global Optimality in Post-Training via Finite-Temperature Gibbs Initialization
[AUTHORS]Zhengyang Zhao, Lu Ma, Yizhen Jiang, Xiaochen Ma, Zimo Meng, Chengyu Shen, Lexiang Tang, Haoze Sun, Peng Pei, Wentao Zhang
[ABSTRACT]The prevailing post-training paradigm for Large Reasoning Models (LRMs)–Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL)–suffers from an intrinsic optimization mismatch: the rigid supervision inherent in SFT induces distributional collapse, thereby exhausting the exploration space necessary for subsequent RL. In this paper, we reformulate SFT within a unified post-training framework and propose Gibbs Initialization with Finite Temperature (GIFT). We characterize standard SFT as a degenerate zero-temperature limit that suppresses base priors. Conversely, GIFT incorporates supervision as a finite-temperature energy potential, establishing a distributional bridge that ensures objective consistency throughout the post-training pipeline. Our experiments demonstrate that GIFT significantly outperforms standard SFT and other competitive baselines when utilized for RL initialization, providing a mathematically principled pathway toward achieving global optimality in post-training. Our code is available at https://github.com/zzy1127/GIFT.
[LINK]http://arxiv.org/abs/2601.09233v1
[DATE]2026-01-14 15:13:57+08:00
[CATEGORIES]cs.LG cs.CL
Spiffy: Multiplying Diffusion LLM Acceleration via Lossless Speculative Decoding
[AUTHORS]Sudhanshu Agrawal, Risheek Garrepalli, Raghavv Goel, Mingu Lee, Christopher Lott, Fatih Porikli
[ABSTRACT]Diffusion LLMs (dLLMs) have recently emerged as a powerful alternative to autoregressive LLMs (AR-LLMs) with the potential to operate at significantly higher token generation rates. However, currently available open-source dLLMs often generate at much lower rates, typically decoding only a single token at every denoising timestep in order to maximize output quality. We present Spiffy, a speculative decoding algorithm that accelerates dLLM inference by $\mathbf{2.8{-}3.1\times}$ while provably preserving the model’s output distribution. This work addresses the unique challenges involved in applying ideas from speculative decoding of AR-LLMs to the dLLM setting. Spiffy proposes draft states by leveraging the dLLM’s distribution itself in an auto-speculative manner. This approach is efficient and effective, and eliminates the overheads of training and running an independent draft model. To structure the candidate draft states, we propose a novel directed draft graph which is uniquely designed to take advantage of the bidirectional, block-wise nature of dLLM generation and can be verified in parallel by the dLLM. To further optimize the structure of these draft graphs, we introduce an efficient, offline calibration algorithm that procedurally determines high-quality graph configurations. These optimized draft graphs, enabling increased acceptance rates, lead to a significant boost in the overall speedup achieved by the system. Crucially, Spiffy is also complementary to other recent innovations in improving dLLM generation speeds such as KV-caching and multi-token unmasking. We demonstrate that when combined with such parallel decoding algorithms, Spiffy is able to effectively multiply the benefits of these methods leading to total speedups of up to $\mathbf{7.9\times}$.
[COMMENTS]Original version uploaded on Sep 22, 2025. (v2): Extended Table 2 with additional analysis and referenced it in Sec 5.2. (v3): Added note to Sec 4.2 and Appendix A.2 specifying conditions for losslessness
[LINK]http://arxiv.org/abs/2509.18085v3
[DATE]2026-01-14 14:58:58+08:00
[CATEGORIES]cs.LG cs.CL
Temporal Knowledge Graph Question Answering: A Survey
[AUTHORS]Miao Su, Zixuan Li, Zhuo Chen, Long Bai, Xiaolong Jin, Jiafeng Guo
[ABSTRACT]Knowledge Base Question Answering (KBQA) has been a long-standing field to answer questions based on knowledge bases. Recently, the evolving dynamics of knowledge have attracted a growing interest in Temporal Knowledge Graph Question Answering (TKGQA), an emerging task to answer temporal questions. However, this field grapples with ambiguities in defining temporal questions and lacks a systematic categorization of existing methods for TKGQA. In response, this paper provides a thorough survey from two perspectives: the taxonomy of temporal questions and the methodological categorization for TKGQA. Specifically, we first establish a detailed taxonomy of temporal questions engaged in prior studies. Subsequently, we provide a comprehensive review of TKGQA techniques of two categories: semantic parsing-based and TKG embedding-based. Building on this review, the paper outlines potential research directions aimed at advancing the field of TKGQA. This work aims to serve as a comprehensive reference for TKGQA and to stimulate further research.
[COMMENTS]8 pages, 3 figures. This work has been submitted to the IEEE for possible publication
[LINK]http://arxiv.org/abs/2406.14191v4
[DATE]2026-01-14 14:54:40+08:00
[CATEGORIES]cs.CL cs.LG
UserLM-R1: Modeling Human Reasoning in User Language Models with Multi-Reward Reinforcement Learning
[AUTHORS]Feng Zhang, Shijia Li, Chunmao Zhang, Zhanyu Ma, Jun Xu, Jiuchong Gao, Jinghua Hao, Renqing He, Jingwen Xu, Han Liu
[ABSTRACT]User simulators serve as the critical interactive environment for agent post-training, and an ideal user simulator generalizes across domains and proactively engages in negotiation by challenging or bargaining. However, current methods exhibit two issues. They rely on static and context-unaware profiles, necessitating extensive manual redesign for new scenarios, thus limiting generalizability. Moreover, they neglect human strategic thinking, leading to vulnerability to agent manipulation. To address these issues, we propose UserLM-R1, a novel user language model with reasoning capability. Specifically, we first construct comprehensive user profiles with both static roles and dynamic scenario-specific goals for adaptation to diverse scenarios. Then, we propose a goal-driven decision-making policy to generate high-quality rationales before producing responses, and further refine the reasoning and improve strategic capabilities with supervised fine-tuning and multi-reward reinforcement learning. Extensive experimental results demonstrate that UserLM-R1 outperforms competitive baselines, particularly on the more challenging adversarial set.
[LINK]http://arxiv.org/abs/2601.09215v1
[DATE]2026-01-14 14:42:01+08:00
[CATEGORIES]cs.CL
The Agentic Leash: Extracting Causal Feedback Fuzzy Cognitive Maps with LLMs
[AUTHORS]Akash Kumar Panda, Olaoluwa Adigun, Bart Kosko
[ABSTRACT]We design a large-language-model (LLM) agent that extracts causal feedback fuzzy cognitive maps (FCMs) from raw text. The causal learning or extraction process is agentic both because of the LLM’s semi-autonomy and because ultimately the FCM dynamical system’s equilibria drive the LLM agents to fetch and process causal text. The fetched text can in principle modify the adaptive FCM causal structure and so modify the source of its quasi-autonomy–its equilibrium limit cycles and fixed-point attractors. This bidirectional process endows the evolving FCM dynamical system with a degree of autonomy while still staying on its agentic leash. We show in particular that a sequence of three finely tuned system instructions guide an LLM agent as it systematically extracts key nouns and noun phrases from text, as it extracts FCM concept nodes from among those nouns and noun phrases, and then as it extracts or infers partial or fuzzy causal edges between those FCM nodes. We test this FCM generation on a recent essay about the promise of AI from the late diplomat and political theorist Henry Kissinger and his colleagues. This three-step process produced FCM dynamical systems that converged to the same equilibrium limit cycles as did the human-generated FCMs even though the human-generated FCM differed in the number of nodes and edges. A final FCM mixed generated FCMs from separate Gemini and ChatGPT LLM agents. The mixed FCM absorbed the equilibria of its dominant mixture component but also created new equilibria of its own to better approximate the underlying causal dynamical system.
[COMMENTS]15 figures
[LINK]http://arxiv.org/abs/2601.00097v2
[DATE]2026-01-14 14:26:14+08:00
[CATEGORIES]cs.CL
Optimizing Fine-Tuning through Advanced Initialization Strategies for Low-Rank Adaptation
[AUTHORS]Yongfu Xue
[ABSTRACT]The rapid development of parameter-efficient fine-tuning methods has noticeably improved the efficiency of adapting large language models. Among these, LoRA has gained widespread popularity due to its strong balance of effectiveness and parameter efficiency. However, LoRA relies on initializing two low-rank matrices whose product is zero, which limits its ability to effectively activate and leverage the original model weights-creating a potential bottleneck for optimal performance. To address this limitation, we propose \textbf{IniLoRA}, a novel initialization strategy that initializes the low-rank matrices to closely approximate the original model weights. Experimental results indicate that IniLoRA achieves better performance than LoRA across a range of models and tasks. Additionally, we introduce two variants, IniLoRA-$α$ and IniLoRA-$β$, both leveraging distinct initialization methods to enhance performance further.
[LINK]http://arxiv.org/abs/2510.03731v4
[DATE]2026-01-14 14:12:58+08:00
[CATEGORIES]cs.LG cs.CL
“Hiding in Plain Sight”: Designing Synthetic Dialog Generation for Uncovering Socially Situated Norms
[AUTHORS]Chengfei Wu, Dan Goldwasser
[COMMENTS]Accepted at AAAI Workshop on Shaping Responsible Synthetic Data in the Era of Foundation Models
[LINK]http://arxiv.org/abs/2410.00998v2
[DATE]2026-01-14 14:03:44+08:00
[CATEGORIES]cs.CL
KPoEM: A Human-Annotated Dataset for Emotion Classification and RAG-Based Poetry Generation in Korean Modern Poetry
[AUTHORS]Iro Lim, Haein Ji, Byungjun Kim
[ABSTRACT]This study introduces KPoEM (Korean Poetry Emotion Mapping), a novel dataset that serves as a foundation for both emotion-centered analysis and generative applications in modern Korean poetry. Despite advancements in NLP, poetry remains underexplored due to its complex figurative language and cultural specificity. We constructed a multi-label dataset of 7,662 entries (7,007 line-level and 615 work-level), annotated with 44 fine-grained emotion categories from five influential Korean poets. The KPoEM emotion classification model, fine-tuned through a sequential strategy – moving from general-purpose corpora to the specialized KPoEM dataset – achieved an F1-micro score of 0.60, significantly outperforming previous models (0.43). The model demonstrates an enhanced ability to identify temporally and culturally specific emotional expressions while preserving core poetic sentiments. Furthermore, applying the structured emotion dataset to a RAG-based poetry generation model demonstrates the empirical feasibility of generating texts that reflect the emotional and cultural sensibilities of Korean literature. This integrated approach strengthens the connection between computational techniques and literary analysis, opening new pathways for quantitative emotion research and generative poetics. Overall, this study provides a foundation for advancing emotion-centered analysis and creation in modern Korean poetry.
[COMMENTS]43 pages, 22 tables, 3 figures, Digital Humanities and Social Sciences Korea Conference, James Joo-Jin Kim Center for Korean Studies, University of Pennsylvania, Philadelphia, USA
[LINK]http://arxiv.org/abs/2509.03932v2
[DATE]2026-01-14 13:34:47+08:00
[CATEGORIES]cs.CL cs.LG
OrthoGeoLoRA: Geometric Parameter-Efficient Fine-Tuning for Structured Social Science Concept Retrieval on theWeb
[AUTHORS]Zeqiang Wang, Xinyue Wu, Chenxi Li, Zixi Chen, Nishanth Sastry, Jon Johnson, Suparna De
[ABSTRACT]Large language models and text encoders increasingly power web-based information systems in the social sciences, including digital libraries, data catalogues, and search interfaces used by researchers, policymakers, and civil society. Full fine-tuning is often computationally and energy intensive, which can be prohibitive for smaller institutions and non-profit organizations in the Web4Good ecosystem. Parameter-Efficient Fine-Tuning (PEFT), especially Low-Rank Adaptation (LoRA), reduces this cost by updating only a small number of parameters. We show that the standard LoRA update $ΔW = BA^\top$ has geometric drawbacks: gauge freedom, scale ambiguity, and a tendency toward rank collapse. We introduce OrthoGeoLoRA, which enforces an SVD-like form $ΔW = BΣA^\top$ by constraining the low-rank factors to be orthogonal (Stiefel manifold). A geometric reparameterization implements this constraint while remaining compatible with standard optimizers such as Adam and existing fine-tuning pipelines. We also propose a benchmark for hierarchical concept retrieval over the European Language Social Science Thesaurus (ELSST), widely used to organize social science resources in digital repositories. Experiments with a multilingual sentence encoder show that OrthoGeoLoRA outperforms standard LoRA and several strong PEFT variants on ranking metrics under the same low-rank budget, offering a more compute- and parameter-efficient path to adapt foundation models in resource-constrained settings.
[LINK]http://arxiv.org/abs/2601.09185v1
[DATE]2026-01-14 13:34:40+08:00
[CATEGORIES]cs.CL
Geometric Stability: The Missing Axis of Representations
[AUTHORS]Prashant C. Raju
[ABSTRACT]Analysis of learned representations has a blind spot: it focuses on $similarity$, measuring how closely embeddings align with external references, but similarity reveals only what is represented, not whether that structure is robust. We introduce $geometric$ $stability$, a distinct dimension that quantifies how reliably representational geometry holds under perturbation, and present $Shesha$, a framework for measuring it. Across 2,463 configurations in seven domains, we show that stability and similarity are empirically uncorrelated ($ρ\approx 0.01$) and mechanistically distinct: similarity metrics collapse after removing the top principal components, while stability retains sensitivity to fine-grained manifold structure. This distinction yields actionable insights: for safety monitoring, stability acts as a functional geometric canary, detecting structural drift nearly 2$\times$ more sensitively than CKA while filtering out the non-functional noise that triggers false alarms in rigid distance metrics; for controllability, supervised stability predicts linear steerability ($ρ= 0.89$-$0.96$); for model selection, stability dissociates from transferability, revealing a geometric tax that transfer optimization incurs. Beyond machine learning, stability predicts CRISPR perturbation coherence and neural-behavioral coupling. By quantifying $how$ $reliably$ systems maintain structure, geometric stability provides a necessary complement to similarity for auditing representations across biological and computational systems.
[LINK]http://arxiv.org/abs/2601.09173v1
[DATE]2026-01-14 13:15:22+08:00
[CATEGORIES]cs.LG cs.CL
ExPO: Unlocking Hard Reasoning with Self-Explanation-Guided Reinforcement Learning
[AUTHORS]Ruiyang Zhou, Shuozhe Li, Amy Zhang, Liu Leqi
[COMMENTS]Accepted to NeurIPS 2025 (Poster). Code available at https://github.com/HumainLab/ExPO_rl_reasoning_by_explanation
[LINK]http://arxiv.org/abs/2507.02834v2
[DATE]2026-01-14 12:45:32+08:00
[CATEGORIES]cs.LG cs.CL
EvasionBench: Detecting Evasive Answers in Financial Q&A via Multi-Model Consensus and LLM-as-Judge
[AUTHORS]Shijian Ma, Yan Lin, Yi Yang
[ABSTRACT]Detecting evasive answers in earnings calls is critical for financial transparency, yet progress is hindered by the lack of large-scale benchmarks. We introduce EvasionBench, comprising 30,000 training samples and 1,000 human-annotated test samples (Cohen’s Kappa 0.835) across three evasion levels. Our key contribution is a multi-model annotation framework leveraging a core insight: disagreement between frontier LLMs signals hard examples most valuable for training. We mine boundary cases where two strong annotators conflict, using a judge to resolve labels. This approach outperforms single-model distillation by 2.4 percent, with judge-resolved samples improving generalization despite higher training loss (0.421 vs 0.393) - evidence that disagreement mining acts as implicit regularization. Our trained model Eva-4B (4B parameters) achieves 81.3 percent accuracy, outperforming its base by 25 percentage points and approaching frontier LLM performance at a fraction of inference cost.
[COMMENTS]Shijian Ma and Yan Lin contributed equally. Corresponding author: Yan Lin
[LINK]http://arxiv.org/abs/2601.09142v1
[DATE]2026-01-14 12:26:43+08:00
[CATEGORIES]cs.LG cs.CL
Identity-Robust Language Model Generation via Content Integrity Preservation
[AUTHORS]Miao Zhang, Kelly Chen, Md Mehrab Tanjim, Rumi Chunara
[ABSTRACT]Large Language Model (LLM) outputs often vary across user sociodemographic attributes, leading to disparities in factual accuracy, utility, and safety, even for objective questions where demographic information is irrelevant. Unlike prior work on stereotypical or representational bias, this paper studies identity-dependent degradation of core response quality. We show empirically that such degradation arises from biased generation behavior, despite factual knowledge being robustly encoded across identities. Motivated by this mismatch, we propose a lightweight, training-free framework for identity-robust generation that selectively neutralizes non-critical identity information while preserving semantically essential attributes, thus maintaining output content integrity. Experiments across four benchmarks and 18 sociodemographic identities demonstrate an average 77% reduction in identity-dependent bias compared to vanilla prompting and a 45% reduction relative to prompt-based defenses. Our work addresses a critical gap in mitigating the impact of user identity cues in prompts on core generation quality.
[LINK]http://arxiv.org/abs/2601.09141v1
[DATE]2026-01-14 12:25:29+08:00
[CATEGORIES]cs.CL
Adaptive Multi-Stage Patent Claim Generation with Unified Quality Assessment
[AUTHORS]Chen-Wei Liang, Bin Guo, Zhen-Yuan Wei, Mu-Jiang-Shan Wang
[ABSTRACT]Current patent claim generation systems face three fundamental limitations: poor cross-jurisdictional generalization, inadequate semantic relationship modeling between claims and prior art, and unreliable quality assessment. We introduce a novel three-stage framework that addresses these challenges through relationship-aware similarity analysis, domain-adaptive claim generation, and unified quality assessment. Our approach employs multi-head attention with eight specialized heads for explicit relationship modeling, integrates curriculum learning with dynamic LoRA adapter selection across five patent domains, and implements cross-attention mechanisms between evaluation aspects for comprehensive quality assessment. Extensive experiments on USPTO HUPD dataset, EPO patent collections, and Patent-CE benchmark demonstrate substantial improvements: 7.6-point ROUGE-L gain over GPT-4o, 8.3\% BERTScore enhancement over Llama-3.1-8B, and 0.847 correlation with human experts compared to 0.623 for separate evaluation models. Our method maintains 89.4\% cross-jurisdictional performance retention versus 76.2\% for baselines, establishing a comprehensive solution for automated patent prosecution workflows.
[COMMENTS]18 pages, 7 figures. Preprint
[LINK]http://arxiv.org/abs/2601.09120v1
[DATE]2026-01-14 11:44:27+08:00
[CATEGORIES]cs.CL
Contrastive Bi-Encoder Models for Multi-Label Skill Extraction: Enhancing ESCO Ontology Matching with BERT and Attention Mechanisms
[AUTHORS]Yongming Sun
[ABSTRACT]Fine-grained labor market analysis increasingly relies on mapping unstructured job advertisements to standardized skill taxonomies such as ESCO. This mapping is naturally formulated as an Extreme Multi-Label Classification (XMLC) problem, but supervised solutions are constrained by the scarcity and cost of large-scale, taxonomy-aligned annotations–especially in non-English settings where job-ad language diverges substantially from formal skill definitions. We propose a zero-shot skill extraction framework that eliminates the need for manually labeled job-ad training data. The framework uses a Large Language Model (LLM) to synthesize training instances from ESCO definitions, and introduces hierarchically constrained multi-skill generation based on ESCO Level-2 categories to improve semantic coherence in multi-label contexts. On top of the synthetic corpus, we train a contrastive bi-encoder that aligns job-ad sentences with ESCO skill descriptions in a shared embedding space; the encoder augments a BERT backbone with BiLSTM and attention pooling to better model long, information-dense requirement statements. An upstream RoBERTa-based binary filter removes non-skill sentences to improve end-to-end precision. Experiments show that (i) hierarchy-conditioned generation improves both fluency and discriminability relative to unconstrained pairing, and (ii) the resulting multi-label model transfers effectively to real-world Chinese job advertisements, achieving strong zero-shot retrieval performance (F1@5 = 0.72) and outperforming TF–IDF and standard BERT baselines. Overall, the proposed pipeline provides a scalable, data-efficient pathway for automated skill coding in labor economics and workforce analytics.
[LINK]http://arxiv.org/abs/2601.09119v1
[DATE]2026-01-14 11:43:45+08:00
[CATEGORIES]cs.CL
Revisiting the Uniform Information Density Hypothesis in LLM Reasoning Traces
[AUTHORS]Minju Gwak, Guijin Son, Jaehyung Kim
[ABSTRACT]The Uniform Information Density (UID) hypothesis suggests that effective communication maintains a stable flow of information. In this work, we revisit this principle in the context of large language model (LLM) reasoning traces, asking whether step-level uniformity reflects reasoning quality. To this end, we propose an entropy-based stepwise information density metric and introduce two complementary measures of uniformity, local and global uniformity scores. Across the experiments on six different reasoning benchmarks, we find that step-level uniformity not only provides a strong theoretical lens but also yields practical performance benefits; for example, selecting reasoning traces with more uniform information density at the step-level improves accuracy by 10-32\% relative gains over baselines at AIME2025. Our analysis further reveals that correct reasoning traces tend to avoid sharp information density spikes, while incorrect traces exhibit irregular information bursts. These results demonstrate that UID-inspired information density measures outperform alternative internal signals as predictors of reasoning quality. Results highlight the uniformity of the information density as a robust diagnostic and selection criterion for building more reliable and accurate reasoning systems.
[LINK]http://arxiv.org/abs/2510.06953v2
[DATE]2026-01-14 11:29:50+08:00
[CATEGORIES]cs.CL
To Retrieve or To Think? An Agentic Approach for Context Evolution
[AUTHORS]Rubing Chen, Jian Wang, Wenjie Li, Xiao-Yong Wei, Qing Li
[ABSTRACT]Current context augmentation methods, such as retrieval-augmented generation, are essential for solving knowledge-intensive reasoning tasks. However, they typically adhere to a rigid, brute-force strategy that executes retrieval at every step. This indiscriminate approach not only incurs unnecessary computational costs but also degrades performance by saturating the context with irrelevant noise. To address these limitations, we introduce Agentic Context Evolution (ACE), a framework inspired by human metacognition that dynamically determines whether to seek new evidence or reason with existing knowledge. ACE employs a central orchestrator agent to make decisions strategically via majority voting. It aims to alternate between activating a retriever agent for external retrieval and a reasoner agent for internal analysis and refinement. By eliminating redundant retrieval steps, ACE maintains a concise and evolved context. Extensive experiments on challenging multi-hop QA benchmarks demonstrate that ACE significantly outperforms competitive baselines in accuracy while achieving efficient token consumption. Our work provides valuable insights into advancing context-evolved generation for complex, knowledge-intensive tasks.
[LINK]http://arxiv.org/abs/2601.08747v2
[DATE]2026-01-14 11:14:05+08:00
[CATEGORIES]cs.CL
AviationLMM: A Large Multimodal Foundation Model for Civil Aviation
[AUTHORS]Wenbin Li, Jingling Wu, Xiaoyong Lin. Jing Chen, Cong Chen
[ABSTRACT]Civil aviation is a cornerstone of global transportation and commerce, and ensuring its safety, efficiency and customer satisfaction is paramount. Yet conventional Artificial Intelligence (AI) solutions in aviation remain siloed and narrow, focusing on isolated tasks or single modalities. They struggle to integrate heterogeneous data such as voice communications, radar tracks, sensor streams and textual reports, which limits situational awareness, adaptability, and real-time decision support. This paper introduces the vision of AviationLMM, a Large Multimodal foundation Model for civil aviation, designed to unify the heterogeneous data streams of civil aviation and enable understanding, reasoning, generation and agentic applications. We firstly identify the gaps between existing AI solutions and requirements. Secondly, we describe the model architecture that ingests multimodal inputs such as air-ground voice, surveillance, on-board telemetry, video and structured texts, and performs cross-modal alignment and fusion, and produces flexible outputs ranging from situation summaries and risk alerts to predictive diagnostics and multimodal incident reconstructions. In order to fully realize this vision, we identify key research opportunities to address, including data acquisition, alignment and fusion, pretraining, reasoning, trustworthiness, privacy, robustness to missing modalities, and synthetic scenario generation. By articulating the design and challenges of AviationLMM, we aim to boost the civil aviation foundation model progress and catalyze coordinated research efforts toward an integrated, trustworthy and privacy-preserving aviation AI ecosystem.
[COMMENTS]Accepted by 2025 7th International Conference on Interdisciplinary Computer Science and Engineering (ICICSE 2025) conference, Chongqing, China; 9 pages,1 figure,5 tables
[LINK]http://arxiv.org/abs/2601.09105v1
[DATE]2026-01-14 11:10:33+08:00
[CATEGORIES]cs.CL
Mitigating Gender Bias via Fostering Exploratory Thinking in LLMs
[AUTHORS]Kangda Wei, Hasnat Md Abdullah, Ruihong Huang
[ABSTRACT]Large Language Models (LLMs) often exhibit gender bias, resulting in unequal treatment of male and female subjects across different contexts. To address this issue, we propose a novel data generation framework that fosters exploratory thinking in LLMs. Our approach prompts models to generate story pairs featuring male and female protagonists in structurally identical, morally ambiguous scenarios, then elicits and compares their moral judgments. When inconsistencies arise, the model is guided to produce balanced, gender-neutral judgments. These story-judgment pairs are used to fine-tune or optimize the models via Direct Preference Optimization (DPO). Experimental results show that our method significantly reduces gender bias while preserving or even enhancing general model capabilities. We will release the code and generated data. We release the code and generated data at: https://github.com/WeiKangda/LLMs-Exploratory-Bias-Mitigation/tree/main.
[LINK]http://arxiv.org/abs/2505.17217v3
[DATE]2026-01-14 10:45:29+08:00
[CATEGORIES]cs.CL
SubTokenTest: A Practical Benchmark for Real-World Sub-token Understanding
[AUTHORS]Shuyang Hou, Yi Hu, Muhan Zhang
[ABSTRACT]Recent advancements in large language models (LLMs) have significantly enhanced their reasoning capabilities. However, they continue to struggle with basic character-level tasks, such as counting letters in words, a problem rooted in their tokenization process. While existing benchmarks have highlighted this weakness through basic character operations, such failures are often dismissed due to lacking practical relevance. Yet, many real-world applications, such as navigating text-based maps or interpreting structured tables, rely heavily on precise sub-token understanding. In this regard, we introduce SubTokenTest, a comprehensive benchmark that assesses sub-token understanding through practical, utility-driven tasks. Our benchmark includes ten tasks across four domains and isolates tokenization-related failures by decoupling performance from complex reasoning. We provide a comprehensive evaluation of nine advanced LLMs. Additionally, we investigate the impact of test-time scaling on sub-token reasoning and explore how character-level information is encoded within the hidden states.
[LINK]http://arxiv.org/abs/2601.09089v1
[DATE]2026-01-14 10:45:08+08:00
[CATEGORIES]cs.CL
Distribution-Aligned Sequence Distillation for Superior Long-CoT Reasoning
[AUTHORS]Shaotian Yan, Kaiyuan Liu, Chen Shen, Bing Wang, Sinan Fan, Jun Zhang, Yue Wu, Zheng Wang, Jieping Ye
[ABSTRACT]In this report, we introduce DASD-4B-Thinking, a lightweight yet highly capable, fully open-source reasoning model. It achieves SOTA performance among open-source models of comparable scale across challenging benchmarks in mathematics, scientific reasoning, and code generation – even outperforming several larger models. We begin by critically reexamining a widely adopted distillation paradigm in the community: SFT on teacher-generated responses, also known as sequence-level distillation. Although a series of recent works following this scheme have demonstrated remarkable efficiency and strong empirical performance, they are primarily grounded in the SFT perspective. Consequently, these approaches focus predominantly on designing heuristic rules for SFT data filtering, while largely overlooking the core principle of distillation itself – enabling the student model to learn the teacher’s full output distribution so as to inherit its generalization capability. Specifically, we identify three critical limitations in current practice: i) Inadequate representation of the teacher’s sequence-level distribution; ii) Misalignment between the teacher’s output distribution and the student’s learning capacity; and iii) Exposure bias arising from teacher-forced training versus autoregressive inference. In summary, these shortcomings reflect a systemic absence of explicit teacher-student interaction throughout the distillation process, leaving the essence of distillation underexploited. To address these issues, we propose several methodological innovations that collectively form an enhanced sequence-level distillation training pipeline. Remarkably, DASD-4B-Thinking obtains competitive results using only 448K training samples – an order of magnitude fewer than those employed by most existing open-source efforts. To support community research, we publicly release our models and the training dataset.
[COMMENTS]Project Page: https://github.com/D2I-ai/dasd-thinking
[LINK]http://arxiv.org/abs/2601.09088v1
[DATE]2026-01-14 10:43:17+08:00
[CATEGORIES]cs.LG cs.CL
MMR-GRPO: Accelerating GRPO-Style Training through Diversity-Aware Reward Reweighting
[AUTHORS]Kangda Wei, Ruihong Huang
[ABSTRACT]Group Relative Policy Optimization (GRPO) has become a standard approach for training mathematical reasoning models; however, its reliance on multiple completions per prompt makes training computationally expensive. Although recent work has reduced the number of training steps required to reach peak performance, the overall wall-clock training time often remains unchanged or even increases due to higher per-step cost. We propose MMR-GRPO, which integrates Maximal Marginal Relevance to reweigh rewards based on completion diversity. Our key insight is that semantically redundant completions contribute limited marginal learning signal; prioritizing diverse solutions yields more informative updates and accelerates convergence. Extensive evaluations across three model sizes (1.5B, 7B, 8B), three GRPO variants, and five mathematical reasoning benchmarks show that MMR-GRPO achieves comparable peak performance while requiring on average 47.9% fewer training steps and 70.2% less wall-clock time. These gains are consistent across models, methods, and benchmarks. We will release our code, trained models, and experimental protocols.
[LINK]http://arxiv.org/abs/2601.09085v1
[DATE]2026-01-14 10:35:19+08:00
[CATEGORIES]cs.LG cs.CL
From Symbolic to Natural-Language Relations: Rethinking Knowledge Graph Construction in the Era of Large Language Models
[AUTHORS]Kanyao Han, Yushang Lai
[ABSTRACT]Knowledge graphs (KGs) have commonly been constructed using predefined symbolic relation schemas, typically implemented as categorical relation labels. This design has notable shortcomings: real-world relations are often contextual, nuanced, and sometimes uncertain, and compressing it into discrete relation labels abstracts away critical semantic detail. Nevertheless, symbolic-relation KGs remain widely used because they have been operationally effective and broadly compatible with pre-LLM downstream models and algorithms, in which KG knowledge could be retrieved or encoded into quantified features and embeddings at scale. The emergence of LLMs has reshaped how knowledge is created and consumed. LLMs support scalable synthesis of domain facts directly in concise natural language, and prompting-based inference favors context-rich free-form text over quantified representations. This position paper argues that these changes call for rethinking the representation of relations themselves rather than merely using LLMs to populate conventional schemas more efficiently. We therefore advocate moving from symbolic to natural-language relation descriptions, and we propose hybrid design principles that preserve a minimal structural backbone while enabling more flexible and context-sensitive relational representations.
[LINK]http://arxiv.org/abs/2601.09069v1
[DATE]2026-01-14 09:49:24+08:00
[CATEGORIES]cs.CL
Word Synchronization Challenge: A Benchmark for Word Association Responses for Large Language Models
[AUTHORS]Tanguy Cazalets, Joni Dambre
[ABSTRACT]This paper introduces the Word Synchronization Challenge, a novel benchmark to evaluate large language models (LLMs) in Human-Computer Interaction (HCI). This benchmark uses a dynamic game-like framework to test LLMs ability to mimic human cognitive processes through word associations. By simulating complex human interactions, it assesses how LLMs interpret and align with human thought patterns during conversational exchanges, which are essential for effective social partnerships in HCI. Initial findings highlight the influence of model sophistication on performance, offering insights into the models capabilities to engage in meaningful social interactions and adapt behaviors in human-like ways. This research advances the understanding of LLMs potential to replicate or diverge from human cognitive functions, paving the way for more nuanced and empathetic human-machine collaborations.
[LINK]http://arxiv.org/abs/2502.08312v2
[DATE]2026-01-14 09:37:18+08:00
[CATEGORIES]cs.CL
Mi:dm 2.0 Korea-centric Bilingual Language Models
[AUTHORS]Donghoon Shin, Sejung Lee, Soonmin Bae, Hwijung Ryu, Changwon Ok, Hoyoun Jung, Hyesung Ji, Jeehyun Lim, Jehoon Lee, Ji-Eun Han, Jisoo Baik, Mihyeon Kim, Riwoo Chung, Seongmin Lee, Wonjae Park, Yoonseok Heo, Youngkyung Seo, Seyoun Won, Boeun Kim, Cheolhun Heo, Eunkyeong Lee, Honghee Lee, Hyeongju Ju, Hyeontae Seo, Jeongyong Shim, Jisoo Lee, Junseok Koh, Junwoo Kim, Minho Lee, Minji Kang, Minju Kim, Sangha Nam, Seongheum Park, Taehyeong Kim, Euijai Ahn, Hong Seok Jeung, Jisu Shin, Jiyeon Kim, Seonyeong Song, Seung Hyun Kong, Sukjin Hong, Taeyang Yun, Yu-Seon Kim, A-Hyun Lee, Chae-Jeong Lee, Hye-Won Yu, Ji-Hyun Ahn, Song-Yeon Kim, Sun-Woo Jung, Eunju Kim, Eunji Ha, Jinwoo Baek, Yun-ji Lee, Wanjin Park, Jeong Yeop Kim, Eun Mi Kim, Hyoung Jun Park, Jung Won Yoon, Min Sung Noh, Myung Gyo Oh, Wongyoung Lee, Yun Jin Park, Young S. Kwon, Hyun Keun Kim, Jieun Lee, YeoJoo Park
[ABSTRACT]We introduce Mi:dm 2.0, a bilingual large language model (LLM) specifically engineered to advance Korea-centric AI. This model goes beyond Korean text processing by integrating the values, reasoning patterns, and commonsense knowledge inherent to Korean society, enabling nuanced understanding of cultural contexts, emotional subtleties, and real-world scenarios to generate reliable and culturally appropriate responses. To address limitations of existing LLMs, often caused by insufficient or low-quality Korean data and lack of cultural alignment, Mi:dm 2.0 emphasizes robust data quality through a comprehensive pipeline that includes proprietary data cleansing, high-quality synthetic data generation, strategic data mixing with curriculum learning, and a custom Korean-optimized tokenizer to improve efficiency and coverage. To realize this vision, we offer two complementary configurations: Mi:dm 2.0 Base (11.5B parameters), built with a depth-up scaling strategy for general-purpose use, and Mi:dm 2.0 Mini (2.3B parameters), optimized for resource-constrained environments and specialized tasks. Mi:dm 2.0 achieves state-of-the-art performance on Korean-specific benchmarks, with top-tier zero-shot results on KMMLU and strong internal evaluation results across language, humanities, and social science tasks. The Mi:dm 2.0 lineup is released under the MIT license to support extensive research and commercial use. By offering accessible and high-performance Korea-centric LLMs, KT aims to accelerate AI adoption across Korean industries, public services, and education, strengthen the Korean AI developer community, and lay the groundwork for the broader vision of K-intelligence. Our models are available at https://huggingface.co/K-intelligence. For technical inquiries, please contact [email protected].
[LINK]http://arxiv.org/abs/2601.09066v1
[DATE]2026-01-14 09:28:29+08:00
[CATEGORIES]cs.CL
Beyond Consensus: Perspectivist Modeling and Evaluation of Annotator Disagreement in NLP
[AUTHORS]Yinuo Xu, David Jurgens
[ABSTRACT]Annotator disagreement is widespread in NLP, particularly for subjective and ambiguous tasks such as toxicity detection and stance analysis. While early approaches treated disagreement as noise to be removed, recent work increasingly models it as a meaningful signal reflecting variation in interpretation and perspective. This survey provides a unified view of disagreement-aware NLP methods. We first present a domain-agnostic taxonomy of the sources of disagreement spanning data, task, and annotator factors. We then synthesize modeling approaches using a common framework defined by prediction targets and pooling structure, highlighting a shift from consensus learning toward explicitly modeling disagreement, and toward capturing structured relationships among annotators. We review evaluation metrics for both predictive performance and annotator behavior, and noting that most fairness evaluations remain descriptive rather than normative. We conclude by identifying open challenges and future directions, including integrating multiple sources of variation, developing disagreement-aware interpretability frameworks, and grappling with the practical tradeoffs of perspectivist modeling.
[LINK]http://arxiv.org/abs/2601.09065v1
[DATE]2026-01-14 09:26:29+08:00
[CATEGORIES]cs.CL
Investigating Retrieval-Augmented Generation Systems on Unanswerable, Uncheatable, Realistic, Multi-hop Queries
[AUTHORS]Gabrielle Kaili-May Liu, Bryan Li, Arman Cohan, William Gantt Walden, Eugene Yang
[COMMENTS]ECIR 2026
[LINK]http://arxiv.org/abs/2510.11956v3
[DATE]2026-01-14 09:20:51+08:00
[CATEGORIES]cs.CL
Efficient Multilingual Dialogue Processing via Translation Pipelines and Distilled Language Models
[AUTHORS]Santiago Martínez Novoa, Nicolás Rozo Fajardo, Diego Alejandro González Vargas, Nicolás Bedoya Figueroa
[ABSTRACT]This paper presents team Kl33n3x’s multilingual dialogue summarization and question answering system developed for the NLPAI4Health 2025 shared task. The approach employs a three-stage pipeline: forward translation from Indic languages to English, multitask text generation using a 2.55B parameter distilled language model, and reverse translation back to source languages. By leveraging knowledge distillation techniques, this work demonstrates that compact models can achieve highly competitive performance across nine languages. The system achieved strong win rates across the competition’s tasks, with particularly robust performance on Marathi (86.7% QnA), Tamil (86.7% QnA), and Hindi (80.0% QnA), demonstrating the effectiveness of translation-based approaches for low-resource language processing without task-specific fine-tuning.
[LINK]http://arxiv.org/abs/2601.09059v1
[DATE]2026-01-14 09:02:06+08:00
[CATEGORIES]cs.CL
StegoStylo: Squelching Stylometric Scrutiny through Steganographic Stitching
[AUTHORS]Robert Dilworth
[ABSTRACT]Stylometry–the identification of an author through analysis of a text’s style (i.e., authorship attribution)–serves many constructive purposes: it supports copyright and plagiarism investigations, aids detection of harmful content, offers exploratory cues for certain medical conditions (e.g., early signs of dementia or depression), provides historical context for literary works, and helps uncover misinformation and disinformation. In contrast, when stylometry is employed as a tool for authorship verification–confirming whether a text truly originates from a claimed author–it can also be weaponized for malicious purposes. Techniques such as de-anonymization, re-identification, tracking, profiling, and downstream effects like censorship illustrate the privacy threats that stylometric analysis can enable. Building on these concerns, this paper further explores how adversarial stylometry combined with steganography can counteract stylometric analysis. We first present enhancements to our adversarial attack, $\textit{TraceTarnish}$, providing stronger evidence of its capacity to confound stylometric systems and reduce their attribution and verification accuracy. Next, we examine how steganographic embedding can be fine-tuned to mask an author’s stylistic fingerprint, quantifying the level of authorship obfuscation achievable as a function of the proportion of words altered with zero-width Unicode characters. Based on our findings, steganographic coverage of 33% or higher seemingly ensures authorship obfuscation. Finally, we reflect on the ways stylometry can be used to undermine privacy and argue for the necessity of defensive tools like $\textit{TraceTarnish}$.
[COMMENTS]16 pages, 6 figures, 1 table
[LINK]http://arxiv.org/abs/2601.09056v1
[DATE]2026-01-14 08:49:20+08:00
[CATEGORIES]cs.CL
SITA: Learning Speaker-Invariant and Tone-Aware Speech Representations for Low-Resource Tonal Languages
[AUTHORS]Tianyi Xu, Xuan Ouyang, Binwei Yao, Shoua Xiong, Sara Misurelli, Maichou Lor, Junjie Hu
[ABSTRACT]Tonal low-resource languages are widely spoken yet remain underserved by modern speech technology. A key challenge is learning representations that are robust to nuisance variation such as gender while remaining tone-aware for different lexical meanings. To address this, we propose SITA, a lightweight adaptation recipe that enforces Speaker-Invariance and Tone-Awareness for pretrained wav2vec-style encoders. SITA uses staged multi-objective training: (i) a cross-gender contrastive objective encourages lexical consistency across speakers, while a tone-repulsive loss prevents tone collapse by explicitly separating same-word different-tone realizations; and (ii) an auxiliary Connectionist Temporal Classification (CTC)-based ASR objective with distillation stabilizes recognition-relevant structure. We evaluate primarily on Hmong, a highly tonal and severely under-resourced language where off-the-shelf multilingual encoders fail to represent tone effectively. On a curated Hmong word corpus, SITA improves cross-gender lexical retrieval accuracy, while maintaining usable ASR accuracy relative to an ASR-adapted XLS-R teacher. We further observe similar gains when transferring the same recipe to Mandarin, suggesting SITA is a general, plug-in approach for adapting multilingual speech encoders to tonal languages.
[COMMENTS]8 pages (excluding references, limitations, ethics, acknowledgement, and appendix); 4 figures in the main paper; appendix included
[LINK]http://arxiv.org/abs/2601.09050v1
[DATE]2026-01-14 08:42:27+08:00
[CATEGORIES]cs.CL
Is Grokking Worthwhile? Functional Analysis and Transferability of Generalization Circuits in Transformers
[AUTHORS]Kaiyu He, Zhang Mian, Peilin Wu, Xinya Du, Zhiyu Chen
[ABSTRACT]While Large Language Models (LLMs) excel at factual retrieval, they often struggle with the “curse of two-hop reasoning” in compositional tasks. Recent research suggests that parameter-sharing transformers can bridge this gap by forming a “Generalization Circuit” during a prolonged “grokking” phase. A fundamental question arises: Is a grokked model superior to its non-grokked counterparts on downstream tasks? Furthermore, is the extensive computational cost of waiting for the grokking phase worthwhile? In this work, we conduct a mechanistic study to evaluate the Generalization Circuit’s role in knowledge assimilation and transfer. We demonstrate that: (i) The inference paths established by non-grokked and grokked models for in-distribution compositional queries are identical. This suggests that the “Generalization Circuit” does not represent the sudden acquisition of a new reasoning paradigm. Instead, we argue that grokking is the process of integrating memorized atomic facts into an naturally established reasoning path. (ii) Achieving high accuracy on unseen cases after prolonged training and the formation of a certain reasoning path are not bound; they can occur independently under specific data regimes. (iii) Even a mature circuit exhibits limited transferability when integrating new knowledge, suggesting that “grokked” Transformers do not achieve a full mastery of compositional logic.
[LINK]http://arxiv.org/abs/2601.09049v1
[DATE]2026-01-14 08:40:35+08:00
[CATEGORIES]cs.CL
SpectraQuery: A Hybrid Retrieval-Augmented Conversational Assistant for Battery Science
[AUTHORS]Sreya Vangara, Jagjit Nanda, Yan-Kai Tzeng, Eric Darve
[ABSTRACT]Scientific reasoning increasingly requires linking structured experimental data with the unstructured literature that explains it, yet most large language model (LLM) assistants cannot reason jointly across these modalities. We introduce SpectraQuery, a hybrid natural-language query framework that integrates a relational Raman spectroscopy database with a vector-indexed scientific literature corpus using a Structured and Unstructured Query Language (SUQL)-inspired design. By combining semantic parsing with retrieval-augmented generation, SpectraQuery translates open-ended questions into coordinated SQL and literature retrieval operations, producing cited answers that unify numerical evidence with mechanistic explanation. Across SQL correctness, answer groundedness, retrieval effectiveness, and expert evaluation, SpectraQuery demonstrates strong performance: approximately 80 percent of generated SQL queries are fully correct, synthesized answers reach 93-97 percent groundedness with 10-15 retrieved passages, and battery scientists rate responses highly across accuracy, relevance, grounding, and clarity (4.1-4.6/5). These results show that hybrid retrieval architectures can meaningfully support scientific workflows by bridging data and discourse for high-volume experimental datasets.
[COMMENTS]11 pages, 8 figures, appendix included
[LINK]http://arxiv.org/abs/2601.09036v1
[DATE]2026-01-14 08:01:46+08:00
[CATEGORIES]cs.CL
OpenDecoder: Open Large Language Model Decoding to Incorporate Document Quality in RAG
[AUTHORS]Fengran Mo, Zhan Su, Yuchen Hui, Jinghan Zhang, Jia Ao Sun, Zheyuan Liu, Chao Zhang, Tetsuya Sakai, Jian-Yun Nie
[ABSTRACT]The development of large language models (LLMs) has achieved superior performance in a range of downstream tasks, including LLM-based retrieval-augmented generation (RAG). The quality of generated content heavily relies on the usefulness of the retrieved information and the capacity of LLMs’ internal information processing mechanism to incorporate it in answer generation. It is generally assumed that the retrieved information is relevant to the question. However, the retrieved information may have a variable degree of relevance and usefulness, depending on the question and the document collection. It is important to take into account the relevance of the retrieved information in answer generation. In this paper, we propose OpenDecoder, a new approach that leverages explicit evaluation of the retrieved information as quality indicator features for generation. We aim to build a RAG model that is more robust to varying levels of noisy context. Three types of explicit evaluation information are considered: relevance score, ranking score, and QPP (query performance prediction) score. The experimental results on five benchmark datasets demonstrate the effectiveness and better robustness of OpenDecoder by outperforming various baseline methods. Importantly, this paradigm is flexible to be integrated with the post-training of LLMs for any purposes and incorporated with any type of external indicators.
[COMMENTS]Accepted by ACM WWW 2026
[LINK]http://arxiv.org/abs/2601.09028v1
[DATE]2026-01-14 07:26:30+08:00
[CATEGORIES]cs.CL
A Benchmark for End-to-End Zero-Shot Biomedical Relation Extraction with LLMs: Experiments with OpenAI Models
[AUTHORS]Aviv Brokman, Xuguang Ai, Yuhang Jiang, Shashank Gupta, Ramakanth Kavuluru
[ABSTRACT]Extracting relations from scientific literature is a fundamental task in biomedical NLP because entities and relations among them drive hypothesis generation and knowledge discovery. As literature grows rapidly, relation extraction (RE) is indispensable to curate knowledge graphs to be used as computable structured and symbolic representations. With the rise of LLMs, it is pertinent to examine if it is better to skip tailoring supervised RE methods, save annotation burden, and just use zero shot RE (ZSRE) via LLM API calls. In this paper, we propose a benchmark with seven biomedical RE datasets with interesting characteristics and evaluate three Open AI models (GPT-4, o1, and GPT-OSS-120B) for end-to-end ZSRE. We show that LLM-based ZSRE is inching closer to supervised methods in performances on some datasets but still struggles on complex inputs expressing multiple relations with different predicates. Our error analysis reveals scope for improvements.
[COMMENTS]Accepted to appear in proceedings of the 3rd Workshop on Artificial Intelligence for Scientific Publications (WASP@AACL 2025). Code and data are available here: https://github.com/bionlproc/ZeroShotRE
[LINK]http://arxiv.org/abs/2504.04083v3
[DATE]2026-01-14 07:22:16+08:00
[CATEGORIES]cs.CL
LingVarBench: Benchmarking LLMs on Entity Recognitions and Linguistic Verbalization Patterns in Phone-Call Transcripts
[AUTHORS]Seyedali Mohammadi, Manas Paldhe, Amit Chhabra, Youngseo Son, Vishal Seshagiri
[ABSTRACT]We study structured entity extraction from phone-call transcripts in customer-support and healthcare settings, where annotation is costly, and data access is limited by privacy and consent. Existing methods degrade under disfluencies, interruptions, and speaker overlap, yet large real-call corpora are rarely shareable. We introduce LingVarBench, a benchmark and semantic synthetic data generation pipeline that generates linguistically varied training data via (1) LLM-sampled entity values, (2) curated linguistic verbalization patterns covering diverse disfluencies and entity-specific readout styles, and (3) a value-transcript consistency filter. Using this dataset, DSPy’s SIMBA automatically synthesizes and optimizes extraction prompts, reducing manual prompt engineering and targeting robustness to verbal variation. On real customer transcripts, prompts optimized solely on LingVarBench outperform zero-shot baselines and match or closely approach human-tuned prompts for structured entities such as ZIP code, date of birth, and name (F1 approximately 94-95 percent). For subjective questionnaire items, optimized prompts substantially improve over zero-shot performance and approach human-tuned prompts. LingVarBench offers a practical and cost-efficient path to deployment in a direct-answer setting, with real annotations later enabling additional refinement.
[COMMENTS]Accepted to EACL 2026 (Industry Track); to appear in the proceedings
[LINK]http://arxiv.org/abs/2508.15801v2
[DATE]2026-01-14 06:53:41+08:00
[CATEGORIES]cs.CL cs.LG
Quiet Feature Learning in Algorithmic Tasks
[AUTHORS]Prudhviraj Naidu, Zixian Wang, Leon Bergen, Ramamohan Paturi
[ABSTRACT]We train Transformer-based language models on ten foundational algorithmic tasks and observe pronounced phase transitions in their loss curves that deviate from established power-law scaling trends. Over large ranges of compute, the validation loss barely improves, then abruptly decreases. Probing the models’ internal representations reveals that quiet features are learned prior to any decrease in task loss. These quiet features represent intermediate algorithmic computations that do not by themselves improve the output loss. Ablation experiments demonstrate that individual quiet features are causally necessary for task performance. Our results demonstrate that substantial representational progress can remain hidden beneath an apparently flat loss curve, challenging the prevailing use of cross-entropy as a proxy for learning and motivating richer diagnostics for monitoring model training.
[COMMENTS]Accepted as Oral presentation @ AAAI 2026 Special Track on AI Alignment
[LINK]http://arxiv.org/abs/2505.03997v3
[DATE]2026-01-14 06:46:57+08:00
[CATEGORIES]cs.LG cs.CL
Agent Bain vs. Agent McKinsey: A New Text-to-SQL Benchmark for the Business Domain
[AUTHORS]Yue Li, Ran Tao, Derek Hommel, Yusuf Denizay Dönder, Sungyong Chang, David Mimno, Unso Eun Seo Jo
[ABSTRACT]Text-to-SQL benchmarks have traditionally only tested simple data access as a translation task of natural language to SQL queries. But in reality, users tend to ask diverse questions that require more complex responses including data-driven predictions or recommendations. Using the business domain as a motivating example, we introduce CORGI, a new benchmark that expands text-to-SQL to reflect practical database queries encountered by end users. CORGI is composed of synthetic databases inspired by enterprises such as DoorDash, Airbnb, and Lululemon. It provides questions across four increasingly complicated categories of business queries: descriptive, explanatory, predictive, and recommendational. This challenge calls for causal reasoning, temporal forecasting, and strategic recommendation, reflecting multi-level and multi-step agentic intelligence. We find that LLM performance degrades on higher-level questions as question complexity increases. CORGI also introduces and encourages the text-to-SQL community to consider new automatic methods for evaluating open-ended, qualitative responses in data access tasks. Our experiments show that LLMs exhibit an average 33.12% lower success execution rate (SER) on CORGI compared to existing benchmarks such as BIRD, highlighting the substantially higher complexity of real-world business needs. We release the CORGI dataset, an evaluation framework, and a submission website to support future research.
[COMMENTS]23 pages, under review for ACL ARR
[LINK]http://arxiv.org/abs/2510.07309v4
[DATE]2026-01-14 06:44:40+08:00
[CATEGORIES]cs.CL
Multicultural Spyfall: Assessing LLMs through Dynamic Multilingual Social Deduction Game
[AUTHORS]Haryo Akbarianto Wibowo, Alaa Elsetohy, Qinrong Cui, Alham Fikri Aji
[ABSTRACT]The rapid advancement of Large Language Models (LLMs) has necessitated more robust evaluation methods that go beyond static benchmarks, which are increasingly prone to data saturation and leakage. In this paper, we propose a dynamic benchmarking framework for evaluating multilingual and multicultural capabilities through the social deduction game Spyfall. In our setup, models must engage in strategic dialogue to either identify a secret agent or avoid detection, utilizing culturally relevant locations or local foods. Our results show that our game-based rankings align closely with the Chatbot Arena. However, we find a significant performance gap in non-English contexts: models are generally less proficient when handling locally specific entities and often struggle with rule-following or strategic integrity in non-English languages. We demonstrate that this game-based approach provides a scalable, leakage-resistant, and culturally nuanced alternative to traditional NLP benchmarks. The game history can be accessed here https://huggingface.co/datasets/haryoaw/cultural-spyfall.
[LINK]http://arxiv.org/abs/2601.09017v1
[DATE]2026-01-14 06:38:40+08:00
[CATEGORIES]cs.CL
Entropy Sentinel: Continuous LLM Accuracy Monitoring from Decoding Entropy Traces in STEM
[AUTHORS]Pedro Memoli Buffa, Luciano Del Corro
[ABSTRACT]Deploying LLMs raises two coupled challenges: (1) monitoring - estimating where a model underperforms as traffic and domains drift - and (2) improvement - prioritizing data acquisition to close the largest performance gaps. We test whether an inference-time signal can estimate slice-level accuracy under domain shift. For each response, we compute an output-entropy profile from final-layer next-token probabilities (from top-k logprobs) and summarize it with eleven statistics. A lightweight classifier predicts instance correctness, and averaging predicted probabilities yields a domain-level accuracy estimate. We evaluate on ten STEM reasoning benchmarks with exhaustive train/test compositions (k in {1,2,3,4}; all “10 choose k” combinations), across nine LLMs from six families (3B-20B). Estimates often track held-out benchmark accuracy, and several models show near-monotonic ordering of domains. Output-entropy profiles are thus an accessible signal for scalable monitoring and for targeting data acquisition.
[LINK]http://arxiv.org/abs/2601.09001v1
[DATE]2026-01-14 05:54:38+08:00
[CATEGORIES]cs.CL
Collaborative Causal Sensemaking: Closing the Complementarity Gap in Human-AI Decision Support
[AUTHORS]Raunak Jain, Mudita Khurana
[ABSTRACT]LLM-based agents are increasingly deployed for expert decision support, yet human-AI teams in high-stakes settings do not yet reliably outperform the best individual. We argue this complementarity gap reflects a fundamental mismatch: current agents are trained as answer engines, not as partners in the collaborative sensemaking through which experts actually make decisions. Sensemaking (the ability to co-construct causal explanations, surface uncertainties, and adapt goals) is the key capability that current training pipelines do not explicitly develop or evaluate. We propose Collaborative Causal Sensemaking (CCS) as a research agenda to develop this capability from the ground up, spanning new training environments that reward collaborative thinking, representations for shared human-AI mental models, and evaluation centred on trust and complementarity. Taken together, these directions shift MAS research from building oracle-like answer engines to cultivating AI teammates that co-reason with their human partners over the causal structure of shared decisions, advancing the design of effective human-AI teams.
[LINK]http://arxiv.org/abs/2512.07801v4
[DATE]2026-01-14 04:22:43+08:00
[CATEGORIES]cs.CL cs.LG
Imagine-then-Plan: Agent Learning from Adaptive Lookahead with World Models
[AUTHORS]Youwei Liu, Jian Wang, Hanlin Wang, Beichen Guo, Wenjie Li
[ABSTRACT]Recent advances in world models have shown promise for modeling future dynamics of environmental states, enabling agents to reason and act without accessing real environments. Current methods mainly perform single-step or fixed-horizon rollouts, leaving their potential for complex task planning under-exploited. We propose Imagine-then-Plan (\texttt{ITP}), a unified framework for agent learning via lookahead imagination, where an agent’s policy model interacts with the learned world model, yielding multi-step “imagined” trajectories. Since the imagination horizon may vary by tasks and stages, we introduce a novel adaptive lookahead mechanism by trading off the ultimate goal and task progress. The resulting imagined trajectories provide rich signals about future consequences, such as achieved progress and potential conflicts, which are fused with current observations, formulating a partially \textit{observable} and \textit{imaginable} Markov decision process to guide policy learning. We instantiate \texttt{ITP} with both training-free and reinforcement-trained variants. Extensive experiments across representative agent benchmarks demonstrate that \texttt{ITP} significantly outperforms competitive baselines. Further analyses validate that our adaptive lookahead largely enhances agents’ reasoning capability, providing valuable insights into addressing broader, complex tasks.
[LINK]http://arxiv.org/abs/2601.08955v1
[DATE]2026-01-14 03:49:58+08:00
[CATEGORIES]cs.CL cs.LG
Synthetic Data for Veterinary EHR De-identification: Benefits, Limits, and Safety Trade-offs Under Fixed Compute
[AUTHORS]David Brundage
[ABSTRACT]Veterinary electronic health records (vEHRs) contain privacy-sensitive identifiers that limit secondary use. While PetEVAL provides a benchmark for veterinary de-identification, the domain remains low-resource. This study evaluates whether large language model (LLM)-generated synthetic narratives improve de-identification safety under distinct training regimes, emphasizing (i) synthetic augmentation and (ii) fixed-budget substitution. We conducted a controlled simulation using a PetEVAL-derived corpus (3,750 holdout/1,249 train). We generated 10,382 synthetic notes using a privacy-preserving “template-only” regime where identifiers were removed prior to LLM prompting. Three transformer backbones (PetBERT, VetBERT, Bio_ClinicalBERT) were trained under varying mixtures. Evaluation prioritized document-level leakage rate (the fraction of documents with at least one missed identifier) as the primary safety outcome. Results show that under fixed-sample substitution, replacing real notes with synthetic ones monotonically increased leakage, indicating synthetic data cannot safely replace real supervision. Under compute-matched training, moderate synthetic mixing matched real-only performance, but high synthetic dominance degraded utility. Conversely, epoch-scaled augmentation improved performance: PetBERT span-overlap F1 increased from 0.831 to 0.850 +/- 0.014, and leakage decreased from 6.32% to 4.02% +/- 0.19%. However, these gains largely reflect increased training exposure rather than intrinsic synthetic data quality. Corpus diagnostics revealed systematic synthetic-real mismatches in note length and label distribution that align with persistent leakage. We conclude that synthetic augmentation is effective for expanding exposure but is complementary, not substitutive, for safety-critical veterinary de-identification.
[LINK]http://arxiv.org/abs/2601.09756v1
[DATE]2026-01-14 03:35:25+08:00
[CATEGORIES]cs.CL
Swap distance minimization beyond entropy minimization in word order variation
[AUTHORS]Víctor Franco-Sánchez, Arnau Martí-Llobet, Ramon Ferrer-i-Cancho
[ABSTRACT]Consider a linguistic structure formed by $n$ elements, for instance, subject, direct object and verb ($n=3$) or subject, direct object, indirect object and verb ($n=4$). We investigate whether the frequency of the $n!$ possible orders is constrained by two principles. First, entropy minimization, a principle that has been suggested to shape natural communication systems at distinct levels of organization. Second, swap distance minimization, namely a preference for word orders that require fewer swaps of adjacent elements to be produced from a source order. We present average swap distance, a novel score for research on swap distance minimization. We find strong evidence of pressure for entropy minimization and swap distance minimization with respect to a die rolling experiment in distinct linguistic structures with $n=3$ or $n=4$. Evidence with respect to a Polya urn process is strong for $n=4$ but weaker for $n=3$. We still find evidence consistent with the action of swap distance minimization when word order frequencies are shuffled, indicating that swap distance minimization effects are beyond pressure to reduce word order entropy.
[COMMENTS]minor correction, mainly typos; in press in the Journal of Quantitative Linguistics
[LINK]http://arxiv.org/abs/2404.14192v6
[DATE]2026-01-14 03:28:00+08:00
[CATEGORIES]cs.CL
Navigating Ideation Space: Decomposed Conceptual Representations for Positioning Scientific Ideas
[AUTHORS]Yuexi Shen, Minqian Liu, Dawei Zhou, Lifu Huang
[ABSTRACT]Scientific discovery is a cumulative process and requires new ideas to be situated within an ever-expanding landscape of existing knowledge. An emerging and critical challenge is how to identify conceptually relevant prior work from rapidly growing literature, and assess how a new idea differentiates from existing research. Current embedding approaches typically conflate distinct conceptual aspects into single representations and cannot support fine-grained literature retrieval; meanwhile, LLM-based evaluators are subject to sycophancy biases, failing to provide discriminative novelty assessment. To tackle these challenges, we introduce the Ideation Space, a structured representation that decomposes scientific knowledge into three distinct dimensions, i.e., research problem, methodology, and core findings, each learned through contrastive training. This framework enables principled measurement of conceptual distance between ideas, and modeling of ideation transitions that capture the logical connections within a proposed idea. Building upon this representation, we propose a Hierarchical Sub-Space Retrieval framework for efficient, targeted literature retrieval, and a Decomposed Novelty Assessment algorithm that identifies which aspects of an idea are novel. Extensive experiments demonstrate substantial improvements, where our approach achieves Recall@30 of 0.329 (16.7% over baselines), our ideation transition retrieval reaches Hit Rate@30 of 0.643, and novelty assessment attains 0.37 correlation with expert judgments. In summary, our work provides a promising paradigm for future research on accelerating and evaluating scientific discovery.
[COMMENTS]21 pages, 6 tables
[LINK]http://arxiv.org/abs/2601.08901v1
[DATE]2026-01-14 02:56:11+08:00
[CATEGORIES]cs.CL cs.LG
Multiplex Thinking: Reasoning via Token-wise Branch-and-Merge
[AUTHORS]Yao Tang, Li Dong, Yaru Hao, Qingxiu Dong, Furu Wei, Jiatao Gu
[ABSTRACT]Large language models often solve complex reasoning tasks more effectively with Chain-of-Thought (CoT), but at the cost of long, low-bandwidth token sequences. Humans, by contrast, often reason softly by maintaining a distribution over plausible next steps. Motivated by this, we propose Multiplex Thinking, a stochastic soft reasoning mechanism that, at each thinking step, samples K candidate tokens and aggregates their embeddings into a single continuous multiplex token. This preserves the vocabulary embedding prior and the sampling dynamics of standard discrete generation, while inducing a tractable probability distribution over multiplex rollouts. Consequently, multiplex trajectories can be directly optimized with on-policy reinforcement learning (RL). Importantly, Multiplex Thinking is self-adaptive: when the model is confident, the multiplex token is nearly discrete and behaves like standard CoT; when it is uncertain, it compactly represents multiple plausible next steps without increasing sequence length. Across challenging math reasoning benchmarks, Multiplex Thinking consistently outperforms strong discrete CoT and RL baselines from Pass@1 through Pass@1024, while producing shorter sequences. The code and checkpoints are available at https://github.com/GMLR-Penn/Multiplex-Thinking.
[COMMENTS]21 pages. Code available at https://github.com/GMLR-Penn/Multiplex-Thinking
[LINK]http://arxiv.org/abs/2601.08808v1
[DATE]2026-01-14 02:48:00+08:00
[CATEGORIES]cs.CL cs.LG
A Vision for Multisensory Intelligence: Sensing, Science, and Synergy
[AUTHORS]Paul Pu Liang
[ABSTRACT]Our experience of the world is multisensory, spanning a synthesis of language, sight, sound, touch, taste, and smell. Yet, artificial intelligence has primarily advanced in digital modalities like text, vision, and audio. This paper outlines a research vision for multisensory artificial intelligence over the next decade. This new set of technologies can change how humans and AI experience and interact with one another, by connecting AI to the human senses and a rich spectrum of signals from physiological and tactile cues on the body, to physical and social signals in homes, cities, and the environment. We outline how this field must advance through three interrelated themes of sensing, science, and synergy. Firstly, research in sensing should extend how AI captures the world in richer ways beyond the digital medium. Secondly, developing a principled science for quantifying multimodal heterogeneity and interactions, developing unified modeling architectures and representations, and understanding cross-modal transfer. Finally, we present new technical challenges to learn synergy between modalities and between humans and AI, covering multisensory integration, alignment, reasoning, generation, generalization, and experience. Accompanying this vision paper are a series of projects, resources, and demos of latest advances from the Multisensory Intelligence group at the MIT Media Lab, see https://mit-mi.github.io/.
[LINK]http://arxiv.org/abs/2601.04563v3
[DATE]2026-01-14 02:24:14+08:00
[CATEGORIES]cs.LG cs.CL
The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning
[AUTHORS]Qiguang Chen, Yantao Du, Ziniu Li, Jinhao Liu, Songyao Duan, Jiarui Guo, Minghao Liu, Jiaheng Liu, Tong Yang, Ge Zhang, Libo Qin, Wanxiang Che, Wenhao Huang
[ABSTRACT]Large language models (LLMs) often fail to learn effective long chain-of-thought (Long CoT) reasoning from human or non-Long-CoT LLMs imitation. To understand this, we propose that effective and learnable Long CoT trajectories feature stable molecular-like structures in unified view, which are formed by three interaction types: Deep-Reasoning (covalent-like), Self-Reflection (hydrogen-bond-like), and Self-Exploration (van der Waals-like). Analysis of distilled trajectories reveals these structures emerge from Long CoT fine-tuning, not keyword imitation. We introduce Effective Semantic Isomers and show that only bonds promoting fast entropy convergence support stable Long CoT learning, while structural competition impairs training. Drawing on these findings, we present Mole-Syn, a distribution-transfer-graph method that guides synthesis of effective Long CoT structures, boosting performance and RL stability across benchmarks.
[COMMENTS]Preprint
[LINK]http://arxiv.org/abs/2601.06002v2
[DATE]2026-01-14 02:21:01+08:00
[CATEGORIES]cs.CL
Asymptotic Universal Alignment: A New Alignment Framework via Test-Time Scaling
[AUTHORS]Yang Cai, Weiqiang Zheng
[ABSTRACT]Aligning large language models (LLMs) to serve users with heterogeneous and potentially conflicting preferences is a central challenge for personalized and trustworthy AI. We formalize an ideal notion of universal alignment through test-time scaling: for each prompt, the model produces $k\ge 1$ candidate responses and a user selects their preferred one. We introduce $(k,f(k))$-robust alignment, which requires the $k$-output model to have win rate $f(k)$ against any other single-output model, and asymptotic universal alignment (U-alignment), which requires $f(k)\to 1$ as $k\to\infty$. Our main result characterizes the optimal convergence rate: there exists a family of single-output policies whose $k$-sample product policies achieve U-alignment at rate $f(k)=\frac{k}{k+1}$, and no method can achieve a faster rate in general.
We show that popular post-training methods, including Nash learning from human feedback (NLHF), can fundamentally underutilize the benefits of test-time scaling. Even though NLHF is optimal for $k=1$, sampling from the resulting (often deterministic) policy cannot guarantee win rates above $\tfrac{1}{2}$ except for an arbitrarily small slack. This stems from a lack of output diversity: existing alignment methods can collapse to a single majority-preferred response, making additional samples redundant. In contrast, our approach preserves output diversity and achieves the optimal test-time scaling rate. In particular, we propose a family of symmetric multi-player alignment games and prove that any symmetric Nash equilibrium policy of the $(k+1)$-player alignment game achieves the optimal $(k,\frac{k}{k+1})$-robust alignment. Finally, we provide theoretical convergence guarantees for self-play learning dynamics in these games and extend the framework to opponents that also generate multiple responses.
[LINK]http://arxiv.org/abs/2601.08777v1
[DATE]2026-01-14 02:08:06+08:00
[CATEGORIES]cs.LG cs.CL
Spatial Context Improves the Integration of Text with Remote Sensing for Mapping Environmental Variables
[AUTHORS]Valerie Zermatten, Chiara Vanalli, Gencer Sumbul, Diego Marcos, Devis Tuia
[ABSTRACT]Recent developments in natural language processing highlight text as an emerging data source for ecology. Textual resources carry unique information that can be used in complementarity with geospatial data sources, thus providing insights at the local scale into environmental conditions and properties hidden from more traditional data sources. Leveraging textual information in a spatial context presents several challenges. First, the contribution of textual data remains poorly defined in an ecological context, and it is unclear for which tasks it should be incorporated. Unlike ubiquitous satellite imagery or environmental covariates, the availability of textual data is sparse and irregular; its integration with geospatial data is not straightforward. In response to these challenges, this work proposes an attention-based approach that combines aerial imagery and geolocated text within a spatial neighbourhood, i.e. integrating contributions from several nearby observations. Our approach combines vision and text representations with a geolocation encoding, with an attention-based module that dynamically selects spatial neighbours that are useful for predictive tasks.The proposed approach is applied to the EcoWikiRS dataset, which combines high-resolution aerial imagery with sentences extracted from Wikipedia describing local environmental conditions across Switzerland. Our model is evaluated on the task of predicting 103 environmental variables from the SWECO25 data cube. Our approach consistently outperforms single-location or unimodal, i.e. image-only or text-only, baselines. When analysing variables by thematic groups, results show a significant improvement in performance for climatic, edaphic, population and land use/land cover variables, underscoring the benefit of including the spatial context when combining text and image data.
[COMMENTS]submitted
[LINK]http://arxiv.org/abs/2601.08750v1
[DATE]2026-01-14 01:27:16+08:00
[CATEGORIES]cs.CL
TableCache: Primary Foreign Key Guided KV Cache Precomputation for Low Latency Text-to-SQL
[AUTHORS]Jinbo Su, Yuxuan Hu, Cuiping Li, Hong Chen, Jia Li, Lintao Ma, Jing Zhang
[ABSTRACT]In Text-to-SQL tasks, existing LLM-based methods often include extensive database schemas in prompts, leading to long context lengths and increased prefilling latency. While user queries typically focus on recurrent table sets-offering an opportunity for KV cache sharing across queries-current inference engines, such as SGLang and vLLM, generate redundant prefix cache copies when processing user queries with varying table orders. To address this inefficiency, we propose precomputing table representations as KV caches offline and querying the required ones online. A key aspect of our approach is the computation of table caches while preserving primary foreign key relationships between tables. Additionally, we construct a Table Trie structure to facilitate efficient KV cache lookups during inference. To enhance cache performance, we introduce a cache management system with a query reranking strategy to improve cache hit rates and a computation loading pipeline for parallelizing model inference and cache loading. Experimental results show that our proposed TableCache achieves up to a 3.62x speedup in Time to First Token (TTFT) with negligible performance degradation.
[LINK]http://arxiv.org/abs/2601.08743v1
[DATE]2026-01-14 01:20:55+08:00
[CATEGORIES]cs.CL
Inferring Latent Intentions: Attributional Natural Language Inference in LLM Agents
[AUTHORS]Xin Quan, Jiafeng Xiong, Marco Valentino, André Freitas
[ABSTRACT]Attributional inference, the ability to predict latent intentions behind observed actions, is a critical yet underexplored capability for large language models (LLMs) operating in multi-agent environments. Traditional natural language inference (NLI), in fact, fails to capture the nuanced, intention-driven reasoning essential for complex interactive systems. To address this gap, we introduce Attributional NLI (Att-NLI), a framework that extends NLI with principles from social psychology to assess an agent’s capacity for abductive intentional inference (generating hypotheses about latent intentions), and subsequent deductive verification (drawing valid logical conclusions). We instantiate Att-NLI via a textual game, Undercover-V, experimenting with three types of LLM agents with varying reasoning capabilities and access to external tools: a standard NLI agent using only deductive inference, an Att-NLI agent employing abductive-deductive inference, and a neuro-symbolic Att-NLI agent performing abductive-deductive inference with external theorem provers. Extensive experiments demonstrate a clear hierarchy of attributional inference capabilities, with neuro-symbolic agents consistently outperforming others, achieving an average win rate of 17.08%. Our results underscore the role that Att-NLI can play in developing agents with sophisticated reasoning capabilities, highlighting, at the same time, the potential impact of neuro-symbolic AI in building rational LLM agents acting in multi-agent environments.
[LINK]http://arxiv.org/abs/2601.08742v1
[DATE]2026-01-14 01:18:38+08:00
[CATEGORIES]cs.CL
From Rows to Reasoning: A Retrieval-Augmented Multimodal Framework for Spreadsheet Understanding
[AUTHORS]Anmol Gulati, Sahil Sen, Waqar Sarguroh, Kevin Paul
[ABSTRACT]Large Language Models (LLMs) struggle to reason over large-scale enterprise spreadsheets containing thousands of numeric rows, multiple linked sheets, and embedded visual content such as charts and receipts. Prior state-of-the-art spreadsheet reasoning approaches typically rely on single-sheet compression or full-context encoding, which limits scalability and fails to reflect how real users interact with complex, multimodal workbooks. We introduce FRTR-Bench, the first large-scale benchmark for multimodal spreadsheet reasoning, comprising 30 enterprise-grade Excel workbooks spanning nearly four million cells and more than 50 embedded images. To address these challenges, we present From Rows to Reasoning (FRTR), an advanced, multimodal retrieval-augmented generation framework that decomposes Excel workbooks into granular row, column, and block embeddings, employs hybrid lexical-dense retrieval with Reciprocal Rank Fusion (RRF), and integrates multimodal embeddings to reason over both numerical and visual information. We tested FRTR on six LLMs, achieving 74% answer accuracy on FRTR-Bench with Claude Sonnet 4.5, a substantial improvement over prior state-of-the-art approaches that reached only 24%. On the SpreadsheetLLM benchmark, FRTR achieved 87% accuracy with GPT-5 while reducing token usage by roughly 50% compared to context-compression methods.
[LINK]http://arxiv.org/abs/2601.08741v1
[DATE]2026-01-14 01:18:14+08:00
[CATEGORIES]cs.CL
PIE: Performance Interval Estimation for Free-Form Generation Tasks
[AUTHORS]Chi-Yang Hsu, Alexander Braylan, Yiheng Su, Matthew Lease, Omar Alonso
[ABSTRACT]Confidence estimation infers a probability for whether each model output is correct or not. While predicting such binary correctness is sensible for tasks with exact answers, free-form generation tasks are often more nuanced, with output quality being both fine-grained and multi-faceted. We thus propose Performance Interval Estimation (PIE) to predict both: 1) point estimates for any arbitrary set of continuous-valued evaluation metrics; and 2) calibrated uncertainty intervals around these point estimates. We then compare two approaches: LLM-as-judge vs. classic regression with confidence estimation features. Evaluation over 11 datasets spans summarization, translation, code generation, function-calling, and question answering. Regression is seen to achieve both: i) lower error point estimates of metric scores; and ii) well-calibrated uncertainty intervals. To support reproduction and follow-on work, we share our data and code.
[COMMENTS]10 pages
[LINK]http://arxiv.org/abs/2509.07309v2
[DATE]2026-01-14 01:15:04+08:00
[CATEGORIES]cs.CL cs.LG
PrivGemo: Privacy-Preserving Dual-Tower Graph Retrieval for Empowering LLM Reasoning with Memory Augmentation
[AUTHORS]Xingyu Tan, Xiaoyang Wang, Qing Liu, Xiwei Xu, Xin Yuan, Liming Zhu, Wenjie Zhang
[ABSTRACT]Knowledge graphs (KGs) provide structured evidence that can ground large language model (LLM) reasoning for knowledge-intensive question answering. However, many practical KGs are private, and sending retrieved triples or exploration traces to closed-source LLM APIs introduces leakage risk. Existing privacy treatments focus on masking entity names, but they still face four limitations: structural leakage under semantic masking, uncontrollable remote interaction, fragile multi-hop and multi-entity reasoning, and limited experience reuse for stability and efficiency. To address these issues, we propose PrivGemo, a privacy-preserving retrieval-augmented framework for KG-grounded reasoning with memory-guided exposure control. PrivGemo uses a dual-tower design to keep raw KG knowledge local while enabling remote reasoning over an anonymized view that goes beyond name masking to limit both semantic and structural exposure. PrivGemo supports multi-hop, multi-entity reasoning by retrieving anonymized long-hop paths that connect all topic entities, while keeping grounding and verification on the local KG. A hierarchical controller and a privacy-aware experience memory further reduce unnecessary exploration and remote interactions. Comprehensive experiments on six benchmarks show that PrivGemo achieves overall state-of-the-art results, outperforming the strongest baseline by up to 17.1%. Furthermore, PrivGemo enables smaller models (e.g., Qwen3-4B) to achieve reasoning performance comparable to that of GPT-4-Turbo.
[LINK]http://arxiv.org/abs/2601.08739v1
[DATE]2026-01-14 01:14:23+08:00
[CATEGORIES]cs.CL
RAGShaper: Eliciting Sophisticated Agentic RAG Skills via Automated Data Synthesis
[AUTHORS]Zhengwei Tao, Bo Li, Jialong Wu, Guochen Yan, Huanyao Zhang, Jiahao Xu, Haitao Mi, Wentao Zhang
[ABSTRACT]Agentic Retrieval-Augmented Generation (RAG) empowers large language models to autonomously plan and retrieve information for complex problem-solving. However, the development of robust agents is hindered by the scarcity of high-quality training data that reflects the noise and complexity of real-world retrieval environments. Conventional manual annotation is unscalable and often fails to capture the dynamic reasoning strategies required to handle retrieval failures. To bridge this gap, we introduce RAGShaper, a novel data synthesis framework designed to automate the construction of RAG tasks and robust agent trajectories. RAGShaper incorporates an InfoCurator to build dense information trees enriched with adversarial distractors spanning Perception and Cognition levels. Furthermore, we propose a constrained navigation strategy that forces a teacher agent to confront these distractors, thereby eliciting trajectories that explicitly demonstrate error correction and noise rejection. Comprehensive experiments confirm that models trained on our synthesized corpus significantly outperform existing baselines, exhibiting superior robustness in noise-intensive and complex retrieval tasks.
[LINK]http://arxiv.org/abs/2601.08699v1
[DATE]2026-01-14 00:25:07+08:00
[CATEGORIES]cs.CL
Theoretical Foundations of Prompt Engineering: From Heuristics to Expressivity
[AUTHORS]Dongseok Kim, Hyoungsun Choi, Mohamed Jismy Aashik Rasool, Gisung Oh
[ABSTRACT]Prompts can switch a model’s behavior even when the weights are fixed, yet this phenomenon is rarely treated as a clean theoretical object rather than a heuristic. We study the family of functions obtainable by holding a Transformer backbone fixed as an executor and varying only the prompt. Our core idea is to view the prompt as an externally injected program and to construct a simplified Transformer that interprets it to implement different computations. The construction exposes a mechanism-level decomposition: attention performs selective routing from prompt memory, the FFN performs local arithmetic conditioned on retrieved fragments, and depth-wise stacking composes these local updates into a multi-step computation. Under this viewpoint, we prove a constructive existential result showing that a single fixed backbone can approximate a broad class of target behaviors via prompts alone. The framework provides a unified starting point for formalizing trade-offs under prompt length/precision constraints and for studying structural limits of prompt-based switching, while remaining distinct from empirical claims about pretrained LLMs.
[COMMENTS]Revised for clarity and correctness; improved exposition and fixed minor issues
[LINK]http://arxiv.org/abs/2512.12688v2
[DATE]2026-01-14 00:04:19+08:00
[CATEGORIES]cs.LG cs.CL
Exploiting Task Relationships in Continual Learning via Transferability-Aware Task Embeddings
[AUTHORS]Yanru Wu, Jianning Wang, Xiangyu Chen, Enming Zhang, Yang Tan, Hanbing Liu, Yang Li
[ABSTRACT]Continual learning (CL) has been a critical topic in contemporary deep neural network applications, where higher levels of both forward and backward transfer are desirable for an effective CL performance. Existing CL strategies primarily focus on task models, either by regularizing model updates or by separating task-specific and shared components, while often overlooking the potential of leveraging inter-task relationships to enhance transfer. To address this gap, we propose a transferability-aware task embedding, termed H-embedding, and construct a hypernet framework under its guidance to learn task-conditioned model weights for CL tasks. Specifically, H-embedding is derived from an information theoretic measure of transferability and is designed to be online and easy to compute. Our method is also characterized by notable practicality, requiring only the storage of a low-dimensional task embedding per task and supporting efficient end-to-end training. Extensive evaluations on benchmarks including CIFAR-100, ImageNet-R, and DomainNet show that our framework performs prominently compared to various baseline and SOTA approaches, demonstrating strong potential in capturing and utilizing intrinsic task relationships. Our code is publicly available at https://github.com/viki760/Hembedding_Guided_Hypernet.
[COMMENTS]28 pages, 5 figures, accepted by NeurIPS 2025
[LINK]http://arxiv.org/abs/2502.11609v4
[DATE]2026-01-14 23:58:22+08:00
[CATEGORIES]cs.LG
Energy-Entropy Regularization: The True Power of Minimal Looped Transformers
[AUTHORS]Wai-Lun Lam
[ABSTRACT]Recent research suggests that looped Transformers have superior reasoning capabilities compared to standard deep architectures. Current approaches to training single-head looped architectures on benchmark tasks frequently fail or yield suboptimal performance due to a highly non-convex and irregular loss landscape. In these settings, optimization often stagnates in poor local minima and saddle points of the loss landscape, preventing the model from discovering the global minimum point. The internal mechanisms of these single-head looped transformer models remain poorly understood, and training them from scratch remains a significant challenge. In this paper, we propose a novel training framework that leverages Tsallis entropy and Hamiltonian dynamics to transform the geometry of the loss landscape. By treating the parameter updates as a physical flow, we successfully trained a single-head looped Transformer with model dimension $d = 8$ to solve induction head task with input sequence length of 1000 tokens. This success reveals the internal mechanism behind the superior reasoning capability.
[COMMENTS]19 pages, 2 figures
[LINK]http://arxiv.org/abs/2601.09588v1
[DATE]2026-01-14 23:56:35+08:00
[CATEGORIES]cs.LG
Constraint- and Score-Based Nonlinear Granger Causality Discovery with Kernels
[AUTHORS]Fiona Murphy, Alessio Benavoli
[ABSTRACT]Kernel-based methods are used in the context of Granger Causality to enable the identification of nonlinear causal relationships between time series variables. In this paper, we show that two state of the art kernel-based Granger Causality (GC) approaches can be theoretically unified under the framework of Kernel Principal Component Regression (KPCR), and introduce a method based on this unification, demonstrating that this approach can improve causal identification. Additionally, we introduce a Gaussian Process score-based model with Smooth Information Criterion penalisation on the marginal likelihood, and demonstrate improved performance over existing state of the art time-series nonlinear causal discovery methods. Furthermore, we propose a contemporaneous causal identification algorithm, fully based on GC, using the proposed score-based $GP_{SIC}$ method, and compare its performance to a state of the art contemporaneous time series causal discovery algorithm.
[LINK]http://arxiv.org/abs/2601.09579v1
[DATE]2026-01-14 23:48:53+08:00
[CATEGORIES]cs.LG
Do Sparse Autoencoders Identify Reasoning Features in Language Models?
[AUTHORS]George Ma, Zhongyuan Liang, Irene Y. Chen, Somayeh Sojoudi
[ABSTRACT]We investigate whether sparse autoencoders (SAEs) identify genuine reasoning features in large language models (LLMs). We first show through a simple theoretical analysis that $\ell_1$-regularized SAEs are intrinsically biased toward low-dimensional patterns, providing a mechanistic explanation for why shallow linguistic cues may be preferentially captured over distributed reasoning behaviors. Motivated by this bias, we introduce a falsification-oriented evaluation framework that combines causal token injection and LLM-guided falsification to test whether feature activation reflects reasoning processes or superficial linguistic correlates. Across 20 configurations spanning multiple model families, layers, and reasoning datasets, we find that features identified by contrastive methods are highly sensitive to token-level interventions, with 45% to 90% activating when a small number of associated tokens are injected into non-reasoning text. For the remaining features, LLM-guided falsification consistently produces non-reasoning inputs that activate the feature and reasoning inputs that do not, with no analyzed feature satisfying our criteria for genuine reasoning behavior. Steering these features yields no improvements in benchmark performance. Overall, our results suggest that SAE features identified by current contrastive approaches primarily capture linguistic correlates of reasoning rather than the underlying reasoning computations themselves. Code is available at https://github.com/GeorgeMLP/reasoning-probing.
[LINK]http://arxiv.org/abs/2601.05679v2
[DATE]2026-01-14 23:46:18+08:00
[CATEGORIES]cs.LG
Differentially private federated learning for localized control of infectious disease dynamics
[AUTHORS]Raouf Kerkouche, Henrik Zunker, Mario Fritz, Martin J. Kühn
[ABSTRACT]In times of epidemics, swift reaction is necessary to mitigate epidemic spreading. For this reaction, localized approaches have several advantages, limiting necessary resources and reducing the impact of interventions on a larger scale. However, training a separate machine learning (ML) model on a local scale is often not feasible due to limited available data. Centralizing the data is also challenging because of its high sensitivity and privacy constraints. In this study, we consider a localized strategy based on the German counties and communities managed by the related local health authorities (LHA). For the preservation of privacy to not oppose the availability of detailed situational data, we propose a privacy-preserving forecasting method that can assist public health experts and decision makers. ML methods with federated learning (FL) train a shared model without centralizing raw data. Considering the counties, communities or LHAs as clients and finding a balance between utility and privacy, we study a FL framework with client-level differential privacy (DP). We train a shared multilayer perceptron on sliding windows of recent case counts to forecast the number of cases, while clients exchange only norm-clipped updates and the server aggregated updates with DP noise. We evaluate the approach on COVID-19 data on county-level during two phases. As expected, very strict privacy yields unstable, unusable forecasts. At a moderately strong level, the DP model closely approaches the non-DP model: R2 around 0.94 (vs. 0.95) and mean absolute percentage error (MAPE) of 26 % in November 2020; R2 around 0.88 (vs. 0.93) and MAPE of 21 % in March 2022. Overall, client-level DP-FL can deliver useful county-level predictions with strong privacy guarantees, and viable privacy budgets depend on epidemic phase, allowing privacy-compliant collaboration among health authorities for local forecasting.
[COMMENTS]26 pages, 9 figures
[LINK]http://arxiv.org/abs/2509.14024v2
[DATE]2026-01-14 23:45:09+08:00
[CATEGORIES]cs.LG
Differentially Private Bilevel Optimization
[AUTHORS]Guy Kornowski
[ABSTRACT]We present differentially private (DP) algorithms for bilevel optimization, a problem class that received significant attention lately in various machine learning applications. These are the first algorithms for such problems under standard DP constraints, and are also the first to avoid Hessian computations which are prohibitive in large-scale settings. Under the well-studied setting in which the upper-level is not necessarily convex and the lower-level problem is strongly-convex, our proposed gradient-based $(ε,δ)$-DP algorithm returns a point with hypergradient norm at most $\widetilde{\mathcal{O}}\left((\sqrt{d_\mathrm{up}}/εn)^{1/2}+(\sqrt{d_\mathrm{low}}/εn)^{1/3}\right)$ where $n$ is the dataset size, and $d_\mathrm{up}/d_\mathrm{low}$ are the upper/lower level dimensions. Our analysis covers constrained and unconstrained problems alike, accounts for mini-batch gradients, and applies to both empirical and population losses. As an application, we specialize our analysis to derive a simple private rule for tuning a regularization hyperparameter.
[COMMENTS]Accepted to ALT 2026; some fixes following reviews
[LINK]http://arxiv.org/abs/2409.19800v3
[DATE]2026-01-14 23:44:18+08:00
[CATEGORIES]cs.LG
Enhancing Large Language Models for Time-Series Forecasting via Vector-Injected In-Context Learning
[AUTHORS]Jianqi Zhang, Jingyao Wang, Wenwen Qiang, Fanjiang Xu, Changwen Zheng
[ABSTRACT]The World Wide Web needs reliable predictive capabilities to respond to changes in user behavior and usage patterns. Time series forecasting (TSF) is a key means to achieve this goal. In recent years, the large language models (LLMs) for TSF (LLM4TSF) have achieved good performance. However, there is a significant difference between pretraining corpora and time series data, making it hard to guarantee forecasting quality when directly applying LLMs to TSF; fine-tuning LLMs can mitigate this issue, but often incurs substantial computational overhead. Thus, LLM4TSF faces a dual challenge of prediction performance and compute overhead. To address this, we aim to explore a method for improving the forecasting performance of LLM4TSF while freezing all LLM parameters to reduce computational overhead. Inspired by in-context learning (ICL), we propose LVICL. LVICL uses our vector-injected ICL to inject example information into a frozen LLM, eliciting its in-context learning ability and thereby enhancing its performance on the example-related task (i.e., TSF). Specifically, we first use the LLM together with a learnable context vector adapter to extract a context vector from multiple examples adaptively. This vector contains compressed, example-related information. Subsequently, during the forward pass, we inject this vector into every layer of the LLM to improve forecasting performance. Compared with conventional ICL that adds examples into the prompt, our vector-injected ICL does not increase prompt length; moreover, adaptively deriving a context vector from examples suppresses components harmful to forecasting, thereby improving model performance. Extensive experiments demonstrate the effectiveness of our approach.
[LINK]http://arxiv.org/abs/2601.07903v2
[DATE]2026-01-14 23:32:42+08:00
[CATEGORIES]cs.LG
Content Accuracy and Quality Aware Resource Allocation Based on LP-Guided DRL for ISAC-Driven AIGC Networks
[AUTHORS]Ningzhe Shi, Yiqing Zhou, Ling Liu, Jinglin Shi, Yihao Wu, Haiwei Shi, Hanxiao Yu
[ABSTRACT]Integrated sensing and communication (ISAC) can enhance artificial intelligence-generated content (AIGC) networks by providing efficient sensing and transmission. Existing AIGC services usually assume that the accuracy of the generated content can be ensured, given accurate input data and prompt, thus only the content generation quality (CGQ) is concerned. However, it is not applicable in ISAC-based AIGC networks, where content generation is based on inaccurate sensed data. Moreover, the AIGC model itself introduces generation errors, which depend on the number of generating steps (i.e., computing resources). To assess the quality of experience of ISAC-based AIGC services, we propose a content accuracy and quality aware service assessment metric (CAQA). Since allocating more resources to sensing and generating improves content accuracy but may reduce communication quality, and vice versa, this sensing-generating (computing)-communication three-dimensional resource tradeoff must be optimized to maximize the average CAQA (AvgCAQA) across all users with AIGC (CAQA-AIGC). This problem is NP-hard, with a large solution space that grows exponentially with the number of users. To solve the CAQA-AIGC problem with low complexity, a linear programming (LP) guided deep reinforcement learning (DRL) algorithm with an action filter (LPDRL-F) is proposed. Through the LP-guided approach and the action filter, LPDRL-F can transform the original three-dimensional solution space to two dimensions, reducing complexity while improving the learning performance of DRL. Simulations show that compared to existing DRL and generative diffusion model (GDM) algorithms without LP, LPDRL-F converges faster and finds better resource allocation solutions, improving AvgCAQA by more than 10%. With LPDRL-F, CAQA-AIGC can achieve an improvement in AvgCAQA of more than 50% compared to existing schemes focusing solely on CGQ.
[COMMENTS]Accepted by IEEE TMC
[LINK]http://arxiv.org/abs/2508.12079v2
[DATE]2026-01-14 23:28:17+08:00
[CATEGORIES]cs.LG
Beyond Uniform SVD:Dual-Level Optimization across Columns and Modules for LLM Compression
[AUTHORS]Lin Xv, Xian Gao, Ting Li, Yuzhuo Fu
[ABSTRACT]Low-rank decomposition, particularly Singular Value Decomposition (SVD), is a pivotal technique for mitigating the storage and computational demands of Large Language Models (LLMs). However, prevalent SVD-based approaches overlook the critical phenomenon that decomposition errors exhibit significant disparity across different components of the parameter matrix, often leading to suboptimal approximation. Furthermore, existing methods lack a direct metric to evaluate the importance of individual weight matrices. To address these limitations, we propose Duo-SVD (Dual-level Optimization SVD), a novel training-free framework that synergizes optimization at both the column and the module levels. First, Duo-SVD incorporates a Column-Preserving Strategy that explicitly retains columns exhibiting high decomposition errors, while applying low-rank approximation solely to those with lower errors. Second, at the module level, we employ a Module-Adaptive Allocation Strategy that formulates ratio allocation as a global constrained optimization problem based on perturbation-induced model deviation. Extensive experiments demonstrate that Duo-SVD consistently outperforms state-of-the-art SVD-based baselines and structured pruning methods, establishing it as a superior paradigm for efficient LLM compression.
[LINK]http://arxiv.org/abs/2510.19385v2
[DATE]2026-01-14 23:28:04+08:00
[CATEGORIES]cs.LG
Graph Neural Network Surrogates to leverage Mechanistic Expert Knowledge towards Reliable and Immediate Pandemic Response
[AUTHORS]Agatha Schmidt, Henrik Zunker, Alexander Heinlein, Martin J. Kühn
[ABSTRACT]During the COVID-19 crisis, mechanistic models have guided evidence-based decision making. However, time-critical decisions in a dynamical environment limit the time available to gather supporting evidence. We address this bottleneck by developing a graph neural network (GNN) surrogate of an age-structured and spatially resolved mechanistic metapopulation simulation model. This combined approach complements classical modeling approaches which are mostly mechanistic and purely data-driven machine learning approaches which are often black box. Our design of experiments spans outbreak and persistent-threat regimes, up to three contact change points, and age-structured contact matrices on a spatial graph with 400 nodes representing German counties. We benchmark multiple GNN layers and identify an ARMAConv-based architecture that offers a strong accuracy-runtime trade-off. Across horizons of 30-90 day simulation and prediction, allowing up to three contact change points, the surrogate model attains 10-27 \% mean absolute percentage error (MAPE) while delivering (near) constant runtime with respect to the forecast horizon. Our approach accelerates evaluation by up to 28,670 times compared with the mechanistic model, allowing responsive decision support in time-critical scenarios and straightforward web integration. These results show how GNN surrogates can translate complex metapopulation models into immediate, reliable tools for pandemic response.
[COMMENTS]20 pages, 9 figures
[LINK]http://arxiv.org/abs/2411.06500v4
[DATE]2026-01-14 23:26:51+08:00
[CATEGORIES]cs.LG
Enabling Global, Human-Centered Explanations for LLMs:From Tokens to Interpretable Code and Test Generation
[AUTHORS]Dipin Khati, Daniel Rodriguez-Cardenas, David N. Palacio, Alejandro Velasco, Denys Poshyvanyk
[ABSTRACT]As Large Language Models for Code (LM4Code) become integral to software engineering, establishing trust in their output becomes critical. However, standard accuracy metrics obscure the underlying reasoning of generative models, offering little insight into how decisions are made. Although post-hoc interpretability methods attempt to fill this gap, they often restrict explanations to local, token-level insights, which fail to provide a developer-understandable global analysis. Our work highlights the urgent need for \textbf{global, code-based} explanations that reveal how models reason across code. To support this vision, we introduce \textit{code rationales} (CodeQ), a framework that enables global interpretability by mapping token-level rationales to high-level programming categories. Aggregating thousands of these token-level explanations allows us to perform statistical analyses that expose systemic reasoning behaviors. We validate this aggregation by showing it distills a clear signal from noisy token data, reducing explanation uncertainty (Shannon entropy) by over 50%. Additionally, we find that a code generation model (\textit{codeparrot-small}) consistently favors shallow syntactic cues (e.g., \textbf{indentation}) over deeper semantic logic. Furthermore, in a user study with 37 participants, we find its reasoning is significantly misaligned with that of human developers. These findings, hidden from traditional metrics, demonstrate the importance of global interpretability techniques to foster trust in LM4Code.
[COMMENTS]Accepted to ICSE 2026
[LINK]http://arxiv.org/abs/2503.16771v2
[DATE]2026-01-14 23:10:06+08:00
[CATEGORIES]cs.LG
Unsupervised full-field Bayesian inference of orthotropic hyperelasticity from a single biaxial test: a myocardial case study
[AUTHORS]Rogier P. Krijnen, Akshay Joshi, Siddhant Kumar, Mathias Peirlinck
[ABSTRACT]Cardiac muscle tissue exhibits highly non-linear hyperelastic and orthotropic material behavior during passive deformation. Traditional constitutive identification protocols therefore combine multiple loading modes and typically require multiple specimens and substantial handling. In soft living tissues, such protocols are challenged by inter- and intra-sample variability and by manipulation-induced alterations of mechanical response, which can bias inverse calibration. In this work we exploit spatially heterogeneous full-field kinematics as an information-rich alternative to multimodal testing. We adapt EUCLID, an unsupervised method for the automated discovery of constitutive models, towards Bayesian parameter inference for highly nonlinear, orthotropic constitutive models. Using synthetic myocardial tissue slabs, we demonstrate that a single heterogeneous biaxial experiment, combined with sparse reaction-force measurements, enables robust recovery of Holzapfel-Ogden parameters with quantified uncertainty, across multiple noise levels. The inferred responses agree closely with ground-truth simulations and yield credible intervals that reflect the impact of measurement noise on orthotropic material model inference. Our work supports single-shot, uncertainty-aware characterization of nonlinear orthotropic material models from a single biaxial test, reducing sample demand and experimental manipulation.
[LINK]http://arxiv.org/abs/2510.09498v2
[DATE]2026-01-14 23:05:54+08:00
[CATEGORIES]cs.LG
Hamiltonian Property Testing
[AUTHORS]Andreas Bluhm, Matthias C. Caro, Aadil Oufkir
[ABSTRACT]Locality is a fundamental feature of many physical time evolutions. Assumptions on locality and related structural properties also underlie recently proposed procedures for learning an unknown Hamiltonian from access to the induced time evolution. However, no protocols to rigorously test whether an unknown Hamiltonian is local were known. We investigate Hamiltonian locality testing as a property testing problem, where the task is to determine whether an unknown $n$-qubit Hamiltonian $H$ is $k$-local or $\varepsilon$-far from all $k$-local Hamiltonians, given access to the time evolution along $H$. First, we emphasize the importance of the chosen distance measure: With respect to the operator norm, a worst-case distance measure, incoherent quantum locality testers require $\tildeΩ(2^n)$ many time evolution queries and an expected total evolution time of $\tildeΩ(2^n / \varepsilon)$, and even coherent testers need $Ω(2^{n/2})$ many queries and $Ω(2^{n/2}/\varepsilon)$ total evolution time. In contrast, when distances are measured according to the normalized Frobenius norm, corresponding to an average-case distance, we give a sample-, time-, and computationally efficient incoherent Hamiltonian locality testing algorithm based on randomized measurements. In fact, our procedure can be used to simultaneously test a wide class of Hamiltonian properties beyond locality. Finally, we prove that learning a general Hamiltonian remains exponentially hard with this average-case distance, thereby establishing an exponential separation between Hamiltonian testing and learning. Our work initiates the study of property testing for quantum Hamiltonians, demonstrating that a broad class of Hamiltonian properties is efficiently testable even with limited quantum capabilities, and positioning Hamiltonian testing as an independent area of research alongside Hamiltonian learning.
[COMMENTS]70 pages, 3 figures; minor improvements. Version accepted for publication in Quantum
[LINK]http://arxiv.org/abs/2403.02968v4
[DATE]2026-01-14 23:05:17+08:00
[CATEGORIES]cs.LG
Improved Segmentation of Polyps and Visual Explainability Analysis
[AUTHORS]Akwasi Asare, Thanh-Huy Nguyen, Ulas Bagci
[ABSTRACT]Colorectal cancer (CRC) remains one of the leading causes of cancer-related morbidity and mortality worldwide, with gastrointestinal (GI) polyps serving as critical precursors according to the World Health Organization (WHO). Early and accurate segmentation of polyps during colonoscopy is essential for reducing CRC progression, yet manual delineation is labor-intensive and prone to observer variability. Deep learning methods have demonstrated strong potential for automated polyp analysis, but their limited interpretability remains a barrier to clinical adoption. In this study, we present PolypSeg-GradCAM, an explainable deep learning framework that integrates a U-Net architecture with a pre-trained ResNet-34 backbone and Gradient-weighted Class Activation Mapping (Grad-CAM) for transparent polyp segmentation. To ensure rigorous benchmarking, the model was trained and evaluated using 5-Fold Cross-Validation on the Kvasir-SEG dataset of 1,000 annotated endoscopic images. Experimental results show a mean Dice coefficient of 0.8902 +/- 0.0125, a mean Intersection-over-Union (IoU) of 0.8023, and an Area Under the Receiver Operating Characteristic Curve (AUC-ROC) of 0.9722. Advanced quantitative analysis using an optimal threshold yielded a Sensitivity of 0.9058 and Precision of 0.9083. Additionally, Grad-CAM visualizations confirmed that predictions were guided by clinically relevant regions, offering insight into the model’s decision-making process. This study demonstrates that integrating segmentation accuracy with interpretability can support the development of trustworthy AI-assisted colonoscopy tools.
[LINK]http://arxiv.org/abs/2509.18159v6
[DATE]2026-01-14 23:04:35+08:00
[CATEGORIES]cs.LG
Differential Privacy for Transformer Embeddings of Text with Nonparametric Variational Information Bottleneck
[AUTHORS]Dina El Zein, James Henderson
[ABSTRACT]We propose a privacy-preserving method for sharing text data by sharing noisy versions of their transformer embeddings. It has been shown that hidden representations learned by deep models can encode sensitive information from the input, making it possible for adversaries to recover the input data with considerable accuracy. This problem is exacerbated in transformer embeddings because they consist of multiple vectors, one per token. To mitigate this risk, we propose Nonparametric Variational Differential Privacy (NVDP), which ensures both useful data sharing and strong privacy protection. We take a differential privacy (DP) approach, integrating a nonparametric variational information bottleneck (NVIB) layer into the transformer architecture to inject noise into its multivector embeddings and thereby hide information, and measuring privacy protection with Rényi Divergence (RD) and its corresponding Bayesian Differential Privacy (BDP) guarantee. Training the NVIB layer calibrates the noise level according to the utility of the downstream task. We test NVDP on the General Language Understanding Evaluation (GLUE) benchmark and show that varying the noise level gives us a useful trade-off between privacy and accuracy. With lower noise levels, our model maintains high accuracy while offering strong privacy guarantees, effectively balancing privacy and utility.
[COMMENTS]11 pages, 2 figures
[LINK]http://arxiv.org/abs/2601.02307v2
[DATE]2026-01-14 23:01:54+08:00
[CATEGORIES]cs.LG
Residual Power Flow for Neural Solvers
[AUTHORS]Jochen Stiasny, Jochen Cremer
[ABSTRACT]The energy transition challenges operational tasks based on simulations and optimisation. These computations need to be fast and flexible as the grid is ever-expanding, and renewables’ uncertainty requires a flexible operational environment. Learned approximations, proxies or surrogates – we refer to them as Neural Solvers – excel in terms of evaluation speed, but are inflexible with respect to adjusting to changing tasks. Hence, neural solvers are usually applicable to highly specific tasks, which limits their usefulness in practice; a widely reusable, foundational neural solver is required. Therefore, this work proposes the Residual Power Flow (RPF) formulation. RPF formulates residual functions based on Kirchhoff’s laws to quantify the infeasibility of an operating condition. The minimisation of the residuals determines the voltage solution; an additional slack variable is needed to achieve AC-feasibility. RPF forms a natural, foundational subtask of tasks subject to power flow constraints. We propose to learn RPF with neural solvers to exploit their speed. Furthermore, RPF improves learning performance compared to common power flow formulations. To solve operational tasks, we integrate the neural solver in a Predict-then-Optimise (PO) approach to combine speed and flexibility. The case study investigates the IEEE 9-bus system and three tasks (AC Optimal Power Flow (OPF), power-flow and quasi-steady state power flow) solved by PO. The results demonstrate the accuracy and flexibility of learning with RPF.
[COMMENTS]This work has been submitted to the IEEE for possible publication
[LINK]http://arxiv.org/abs/2601.09533v1
[DATE]2026-01-14 22:56:25+08:00
[CATEGORIES]cs.LG
Input Convex Kolmogorov Arnold Networks
[AUTHORS]Thomas Deschatre, Xavier Warin
[ABSTRACT]This article presents an input convex neural network architecture using Kolmogorov-Arnold networks (ICKAN). Two specific networks are presented: the first is based on a low-order, linear-by-part, representation of functions, and a universal approximation theorem is provided. The second is based on cubic splines, for which only numerical results support convergence. We demonstrate on simple tests that these networks perform competitively with classical input convex neural networks (ICNNs). In a second part, we use the networks to solve some optimal transport problems needing a convex approximation of functions and demonstrate their effectiveness. Comparisons with ICNNs show that cubic ICKANs produce results similar to those of classical ICNNs.
[LINK]http://arxiv.org/abs/2505.21208v2
[DATE]2026-01-14 22:52:00+08:00
[CATEGORIES]cs.LG
AQ-PCDSys: An Adaptive Quantized Planetary Crater Detection System for Autonomous Space Exploration
[AUTHORS]Aditri Paul, Archan Paul
[ABSTRACT]Successful autonomous planetary exploration hinges on real-time, high-fidelity environmental perception. However, standard deep learning models usually demand far more memory and computation power than space-qualified, radiation-hardened onboard hardware can provide. This creates a fundamental design challenge of deploying sophisticated detection architectures without saturating the rigid power and memory envelopes of the computation hardware of planetary exploration platforms. We propose the Adaptive Quantized Planetary Crater Detection System to resolve this bottleneck. Our framework integrates a Quantized Neural Network, refined through Quantization Aware Training, with an Adaptive Multi-Sensor Fusion module. By forcing weights into low-precision integer arithmetic, we effectively strip away the floating-point overhead that typically bottlenecks onboard processors and system memory. This yields a leaner model footprint and significantly faster processing while the detection fidelity remains high. Such efficiency enables AMF module to merge high-bandwidth Optical Imagery streams with Digital Elevation Models using an Adaptive Weighting Mechanism to re-balance sensor priority under variable conditions like deep shadows or high albedo. Integrated Multi-Scale Detection Heads then resolve craters across a wide range of diameters, providing a computationally efficient and precise solution for real-time detection, localization of craters and hazard avoidance. This paper establishes the architectural design and theoretical justification of the system. While our methodology is grounded in principles of hybrid computer vision and planetary science, we present this as a blueprint for future empirical validation and hardware benchmarking on integer-arithmetic units. This system provides a capability vital for the next generation of autonomous landing, navigation, and deep space explorations.
[COMMENTS]17 pages, 6 figures. A research paper on a novel deep learning framework for planetary crater detection
[LINK]http://arxiv.org/abs/2508.18025v2
[DATE]2026-01-14 22:49:07+08:00
[CATEGORIES]cs.LG
Private LLM Inference on Consumer Blackwell GPUs: A Practical Guide for Cost-Effective Local Deployment in SMEs
[AUTHORS]Jonathan Knoop, Hendrik Holtmann
[ABSTRACT]SMEs increasingly seek alternatives to cloud LLM APIs, which raise data privacy concerns. Dedicated cloud GPU instances offer improved privacy but with limited guarantees and ongoing costs, while professional on-premise hardware (A100, H100) remains prohibitively expensive. We present a systematic evaluation of NVIDIA’s Blackwell consumer GPUs (RTX 5060 Ti, 5070 Ti, 5090) for production LLM inference, benchmarking four open-weight models (Qwen3-8B, Gemma3-12B, Gemma3-27B, GPT-OSS-20B) across 79 configurations spanning quantization formats (BF16, W4A16, NVFP4, MXFP4), context lengths (8k-64k), and three workloads: RAG, multi-LoRA agentic serving, and high-concurrency APIs. The RTX 5090 delivers 3.5-4.6x higher throughput than the 5060 Ti with 21x lower latency for RAG, but budget GPUs achieve the highest throughput-per-dollar for API workloads with sub-second latency. NVFP4 quantization provides 1.6x throughput over BF16 with 41% energy reduction and only 2-4% quality loss. Self-hosted inference costs $0.001-0.04 per million tokens (electricity only), which is 40-200x cheaper than budget-tier cloud APIs, with hardware breaking even in under four months at moderate volume (30M tokens/day). Our results show that consumer GPUs can reliably replace cloud inference for most SME workloads, except latency-critical long-context RAG, where high-end GPUs remain essential. We provide deployment guidance and release all benchmark data for reproducible SME-scale deployments.
[COMMENTS]15 pages, 18 tables, 7 figures. Includes link to GitHub repository and Docker image for reproducibility
[LINK]http://arxiv.org/abs/2601.09527v1
[DATE]2026-01-14 22:49:07+08:00
[CATEGORIES]cs.LG
Kernel Limit for a Class of Recurrent Neural Networks Trained on Ergodic Data Sequences
[AUTHORS]Samuel Chun-Hei Lam, Justin Sirignano, Konstantinos Spiliopoulos
[ABSTRACT]Mathematical methods are developed to characterize the asymptotics of recurrent neural networks (RNN) as the number of hidden units, data samples in the sequence, hidden state updates, and training steps simultaneously grow to infinity. In the case of an RNN with a simplified weight matrix, we prove the convergence of the RNN to the solution of an infinite-dimensional ODE coupled with the fixed point of a random algebraic equation. The analysis requires addressing several challenges which are unique to RNNs. In typical mean-field applications (e.g., feedforward neural networks), discrete updates are of magnitude $\mathcal{O}(1/N)$ and the number of updates is $\mathcal{O}(N)$. Therefore, the system can be represented as an Euler approximation of an appropriate ODE/PDE, which it will converge to as $N \rightarrow \infty$. However, the RNN hidden layer updates are $\mathcal{O}(1)$. Therefore, RNNs cannot be represented as a discretization of an ODE/PDE and standard mean-field techniques cannot be applied. Instead, we develop a fixed point analysis for the evolution of the RNN memory states, with convergence estimates in terms of the number of update steps and the number of hidden units. The RNN hidden layer is studied as a function in a Sobolev space, whose evolution is governed by the data sequence (a Markov chain), the parameter updates, and its dependence on the RNN hidden layer at the previous time step. Due to the strong correlation between updates, a Poisson equation must be used to bound the fluctuations of the RNN around its limit equation. These mathematical methods give rise to the neural tangent kernel (NTK) limits for RNNs trained on data sequences as the number of data samples and size of the neural network grow to infinity.
[COMMENTS]Revision in response to reviewers’ comments. The mean-field random function has been replaced by a mean-field term. Some typos fixed. Minor title change
[LINK]http://arxiv.org/abs/2308.14555v4
[DATE]2026-01-14 22:47:00+08:00
[CATEGORIES]cs.LG
Fair Foundation Models for Medical Image Analysis: Challenges and Perspectives
[AUTHORS]Dilermando Queiroz, Anderson Carlos, André Anjos, Lilian Berton
[ABSTRACT]Ensuring equitable Artificial Intelligence (AI) in healthcare demands systems that make unbiased decisions across all demographic groups, bridging technical innovation with ethical principles. Foundation Models (FMs), trained on vast datasets through self-supervised learning, enable efficient adaptation across medical imaging tasks while reducing dependency on labeled data. These models demonstrate potential for enhancing fairness, though significant challenges remain in achieving consistent performance across demographic groups. Our review indicates that effective bias mitigation in FMs requires systematic interventions throughout all stages of development. While previous approaches focused primarily on model-level bias mitigation, our analysis reveals that fairness in FMs requires integrated interventions throughout the development pipeline, from data documentation to deployment protocols. This comprehensive framework advances current knowledge by demonstrating how systematic bias mitigation, combined with policy engagement, can effectively address both technical and institutional barriers to equitable AI in healthcare. The development of equitable FMs represents a critical step toward democratizing advanced healthcare technologies, particularly for underserved populations and regions with limited medical infrastructure and computational resources.
[LINK]http://arxiv.org/abs/2502.16841v2
[DATE]2026-01-14 22:42:54+08:00
[CATEGORIES]cs.LG
Class Adaptive Conformal Training
[AUTHORS]Badr-Eddine Marani, Julio Silva-Rodriguez, Ismail Ben Ayed, Maria Vakalopoulou, Stergios Christodoulidis, Jose Dolz
[ABSTRACT]Deep neural networks have achieved remarkable success across a variety of tasks, yet they often suffer from unreliable probability estimates. As a result, they can be overconfident in their predictions. Conformal Prediction (CP) offers a principled framework for uncertainty quantification, yielding prediction sets with rigorous coverage guarantees. Existing conformal training methods optimize for overall set size, but shaping the prediction sets in a class-conditional manner is not straightforward and typically requires prior knowledge of the data distribution. In this work, we introduce Class Adaptive Conformal Training (CaCT), which formulates conformal training as an augmented Lagrangian optimization problem that adaptively learns to shape prediction sets class-conditionally without making any distributional assumptions. Experiments on multiple benchmark datasets, including standard and long-tailed image recognition as well as text classification, demonstrate that CaCT consistently outperforms prior conformal training methods, producing significantly smaller and more informative prediction sets while maintaining the desired coverage guarantees.
[LINK]http://arxiv.org/abs/2601.09522v1
[DATE]2026-01-14 22:41:23+08:00
[CATEGORIES]cs.LG
Variational Bayesian Inference for Tensor Robust Principal Component Analysis
[AUTHORS]Chao Wang, Huiwen Zheng, Raymond Chan, Youwei Wen
[ABSTRACT]Tensor Robust Principal Component Analysis (TRPCA) holds a crucial position in machine learning and computer vision. It aims to recover underlying low-rank structures and to characterize the sparse structures of noise. Current approaches often encounter difficulties in accurately capturing the low-rank properties of tensors and balancing the trade-off between low-rank and sparse components, especially in a mixed-noise scenario. To address these challenges, we introduce a Bayesian framework for TRPCA, which integrates a low-rank tensor nuclear norm prior and a generalized sparsity-inducing prior. By embedding the priors within the Bayesian framework, our method can automatically determine the optimal tensor nuclear norm and achieve a balance between the nuclear norm and sparse components. Furthermore, our method can be efficiently extended to the weighted tensor nuclear norm model. Experiments conducted on synthetic and real-world datasets demonstrate the effectiveness and superiority of our method compared to state-of-the-art approaches.
[LINK]http://arxiv.org/abs/2412.18717v2
[DATE]2026-01-14 22:28:04+08:00
[CATEGORIES]cs.LG
CLARE: Continual Learning for Vision-Language-Action Models via Autonomous Adapter Routing and Expansion
[AUTHORS]Ralf Römer, Yi Zhang, Angela P. Schoellig
[ABSTRACT]To teach robots complex manipulation tasks, it is now a common practice to fine-tune a pre-trained vision-language-action model (VLA) on task-specific data. However, since this recipe updates existing representations, it is unsuitable for long-term operation in the real world, where robots must continually adapt to new tasks and environments while retaining the knowledge they have already acquired. Existing continual learning methods for robotics commonly require storing previous data (exemplars), struggle with long task sequences, or rely on task identifiers for deployment. To address these limitations, we propose CLARE, a general, parameter-efficient framework for exemplar-free continual learning with VLAs. CLARE introduces lightweight modular adapters into selected feedforward layers and autonomously expands the model only where necessary when learning a new task, guided by layer-wise feature similarity. During deployment, an autoencoder-based routing mechanism dynamically activates the most relevant adapters without requiring task labels. Through extensive experiments on the LIBERO benchmark, we show that CLARE achieves high performance on new tasks without catastrophic forgetting of earlier tasks, significantly outperforming even exemplar-based methods. Code and data are available at https://tum-lsy.github.io/clare.
[COMMENTS]Project page: https://tum-lsy.github.io/clare. 9 pages, 5 figures
[LINK]http://arxiv.org/abs/2601.09512v1
[DATE]2026-01-14 22:23:42+08:00
[CATEGORIES]cs.LG
What Can RL Bring to VLA Generalization? An Empirical Study
[AUTHORS]Jijia Liu, Feng Gao, Bingwen Wei, Xinlei Chen, Qingmin Liao, Yi Wu, Chao Yu, Yu Wang
[COMMENTS]Accepted by NeurIPS 2025
[LINK]http://arxiv.org/abs/2505.19789v4
[DATE]2026-01-14 22:23:30+08:00
[CATEGORIES]cs.LG
Lens: A Knowledge-Guided Foundation Model for Network Traffic
[AUTHORS]Xiaochang Li, Chen Qian, Qineng Wang, Jiangtao Kong, Yuchen Wang, Ziyu Yao, Bo Ji, Long Cheng, Gang Zhou, Huajie Shao
[ABSTRACT]Network traffic refers to the amount of data being sent and received over the Internet or any system that connects computers. Analyzing network traffic is vital for security and management, yet remains challenging due to the heterogeneity of plain-text packet headers and encrypted payloads. To capture the latent semantics of traffic, recent studies have adopted Transformer-based pretraining techniques to learn network representations from massive traffic data. However, these methods pre-train on data-driven tasks but overlook network knowledge, such as masking partial digits of the indivisible network port numbers for prediction, thereby limiting semantic understanding. In addition, they struggle to extend classification to new classes during fine-tuning due to the distribution shift. Motivated by these limitations, we propose \Lens, a unified knowledge-guided foundation model for both network traffic classification and generation. In pretraining, we propose a Knowledge-Guided Mask Span Prediction method with textual context for learning knowledge-enriched representations. For extending to new classes in finetuning, we reframe the traffic classification as a closed-ended generation task and introduce context-aware finetuning to adapt to the distribution shift. Evaluation results across various benchmark datasets demonstrate that the proposed Lens~achieves superior performance on both classification and generation tasks. For traffic classification, Lens~outperforms competitive baselines substantially on 8 out of 12 tasks with an average accuracy of \textbf{96.33\%} and extends to novel classes with significantly better performance. For traffic generation, Lens~generates better high-fidelity network traffic for network simulation, gaining up to \textbf{30.46\%} and \textbf{33.3\%} better accuracy and F1 in fuzzing tests. We will open-source the code upon publication.
[LINK]http://arxiv.org/abs/2402.03646v5
[DATE]2026-01-14 22:09:25+08:00
[CATEGORIES]cs.LG
Towards Robust Cross-Dataset Object Detection Generalization under Domain Specificity
[AUTHORS]Ritabrata Chakraborty, Hrishit Mitra, Shivakumara Palaiahnakote, Umapada Pal
[ABSTRACT]Object detectors often perform well in-distribution, yet degrade sharply on a different benchmark. We study cross-dataset object detection (CD-OD) through a lens of setting specificity. We group benchmarks into setting-agnostic datasets with diverse everyday scenes and setting-specific datasets tied to a narrow environment, and evaluate a standard detector family across all train–test pairs. This reveals a clear structure in CD-OD: transfer within the same setting type is relatively stable, while transfer across setting types drops substantially and is often asymmetric. The most severe breakdowns occur when transferring from specific sources to agnostic targets, and persist after open-label alignment, indicating that domain shift dominates in the hardest regimes. To disentangle domain shift from label mismatch, we compare closed-label transfer with an open-label protocol that maps predicted classes to the nearest target label using CLIP similarity. Open-label evaluation yields consistent but bounded gains, and many corrected cases correspond to semantic near-misses supported by the image evidence. Overall, we provide a principled characterization of CD-OD under setting specificity and practical guidance for evaluating detectors under distribution shift. Code will be released at \href{[https://github.com/Ritabrata04/cdod-icpr.git}{https://github.com/Ritabrata04/cdod-icpr}.
[COMMENTS]15 pages, 4 figures, 6 tables
[LINK]http://arxiv.org/abs/2601.09497v1
[DATE]2026-01-14 22:03:11+08:00
[CATEGORIES]cs.LG
Parallelizable memory recurrent units
[AUTHORS]Florent De Geeter, Gaspard Lambrechts, Damien Ernst, Guillaume Drion
[ABSTRACT]With the emergence of massively parallel processing units, parallelization has become a desirable property for new sequence models. The ability to parallelize the processing of sequences with respect to the sequence length during training is one of the main factors behind the uprising of the Transformer architecture. However, Transformers lack efficiency at sequence generation, as they need to reprocess all past timesteps at every generation step. Recently, state-space models (SSMs) emerged as a more efficient alternative. These new kinds of recurrent neural networks (RNNs) keep the efficient update of the RNNs while gaining parallelization by getting rid of nonlinear dynamics (or recurrence). SSMs can reach state-of-the art performance through the efficient training of potentially very large networks, but still suffer from limited representation capabilities. In particular, SSMs cannot exhibit persistent memory, or the capacity of retaining information for an infinite duration, because of their monostability. In this paper, we introduce a new family of RNNs, the memory recurrent units (MRUs), that combine the persistent memory capabilities of nonlinear RNNs with the parallelizable computations of SSMs. These units leverage multistability as a source of persistent memory, while getting rid of transient dynamics for efficient computations. We then derive a specific implementation as proof-of-concept: the bistable memory recurrent unit (BMRU). This new RNN is compatible with the parallel scan algorithm. We show that BMRU achieves good results in tasks with long-term dependencies, and can be combined with state-space models to create hybrid networks that are parallelizable and have transient dynamics as well as persistent memory.
[COMMENTS]19 pages, 12 figures. This work has been the subject of a patent application (Number: EP26151077). This work has been submitted to the IEEE for possible publication
[LINK]http://arxiv.org/abs/2601.09495v1
[DATE]2026-01-14 22:01:11+08:00
[CATEGORIES]cs.LG
Deep Operator Networks for Surrogate Modeling of Cyclic Adsorption Processes with Varying Initial Conditions
[AUTHORS]Beatrice Ceccanti, Mattia Galanti, Ivo Roghair, Martin van Sint Annaland
[ABSTRACT]Deep Operator Networks are emerging as fundamental tools among various neural network types to learn mappings between function spaces, and have recently gained attention due to their ability to approximate nonlinear operators. In particular, DeepONets offer a natural formulation for PDE solving, since the solution of a partial differential equation can be interpreted as an operator mapping an initial condition to its corresponding solution field. In this work, we applied DeepONets in the context of process modeling for adsorption technologies, to assess their feasibility as surrogates for cyclic adsorption process simulation and optimization. The goal is to accelerate convergence of cyclic processes such as Temperature-Vacuum Swing Adsorption (TVSA), which require repeated solution of transient PDEs, which are computationally expensive. Since each step of a cyclic adsorption process starts from the final state of the preceding step, effective surrogate modeling requires generalization across a wide range of initial conditions. The governing equations exhibit steep traveling fronts, providing a demanding benchmark for operator learning. To evaluate functional generalization under these conditions, we construct a mixed training dataset composed of heterogeneous initial conditions and train DeepONets to approximate the corresponding solution operators. The trained models are then tested on initial conditions outside the parameter ranges used during training, as well as on completely unseen functional forms. The results demonstrate accurate predictions both within and beyond the training distribution, highlighting DeepONets as potential efficient surrogates for accelerating cyclic adsorption simulations and optimization workflows.
[COMMENTS]36 pages, 11 figures
[LINK]http://arxiv.org/abs/2601.09491v1
[DATE]2026-01-14 21:56:25+08:00
[CATEGORIES]cs.LG
A Curriculum Learning Approach to Reinforcement Learning: Leveraging RAG for Multimodal Question Answering
[AUTHORS]Chenliang Zhang, Lin Wang, Yuanyuan Lu, Yusheng Qi, Kexin Wang, Peixu Hou, Wenshi Chen
[ABSTRACT]This paper describes the solutions of the Dianping-Trust-Safety team for the META CRAG-MM challenge. The challenge requires building a comprehensive retrieval-augmented generation system capable for multi-modal multi-turn question answering. The competition consists of three tasks: (1) answering questions using structured data retrieved from an image-based mock knowledge graph, (2) synthesizing information from both knowledge graphs and web search results, and (3) handling multi-turn conversations that require context understanding and information aggregation from multiple sources. For Task 1, our solution is based on the vision large language model, enhanced by supervised fine-tuning with knowledge distilled from GPT-4.1. We further applied curriculum learning strategies to guide reinforcement learning, resulting in improved answer accuracy and reduced hallucination. For Task 2 and Task 3, we additionally leveraged web search APIs to incorporate external knowledge, enabling the system to better handle complex queries and multi-turn conversations. Our approach achieved 1st place in Task 1 with a significant lead of 52.38%, and 3rd place in Task 3, demonstrating the effectiveness of the integration of curriculum learning with reinforcement learning in our training pipeline.
[LINK]http://arxiv.org/abs/2508.10337v2
[DATE]2026-01-14 21:50:30+08:00
[CATEGORIES]cs.LG
UniConFlow: A Unified Constrained Flow-Matching Framework for Certified Motion Planning
[AUTHORS]Zewen Yang, Xiaobing Dai, Dian Yu, Zhijun Li, Majid Khadiv, Sandra Hirche, Sami Haddadin
[ABSTRACT]Generative models have become increasingly powerful tools for robot motion generation, enabling flexible and multimodal trajectory generation across various tasks. Yet, most existing approaches remain limited in handling multiple types of constraints, such as collision avoidance, actuation limits, and dynamic consistency, which are typically addressed individually or heuristically. In this work, we propose UniConFlow, a unified constrained flow matching-based framework for trajectory generation that systematically incorporates both equality and inequality constraints. Moreover, UniConFlow introduces a novel prescribed-time zeroing function that shapes a time-varying guidance field during inference, allowing the generation process to adapt to varying system models and task requirements. Furthermore, to further address the computational challenges of long-horizon and high-dimensional trajectory generation, we propose two practical strategies for the terminal constraint enforcement and inference process: a violation-segment extraction protocol that precisely localizes and refines only the constraint-violating portions of trajectories, and a trajectory compression method that accelerates optimization in a reduced-dimensional space while preserving high-fidelity reconstruction after decoding. Empirical validation across three experiments, including a double inverted pendulum, a real-to-sim car racing task, and a sim-to-real manipulation task, demonstrates that UniConFlow outperforms state-of-the-art generative planners and conventional optimization baselines, achieving superior performance on certified motion planning metrics such as safety, kinodynamic consistency, and action feasibility. Project page is available at: https://uniconflow.github.io.
[LINK]http://arxiv.org/abs/2506.02955v2
[DATE]2026-01-14 21:47:10+08:00
[CATEGORIES]cs.LG
Comprehensive language-image pre-training for 3D medical image understanding
[AUTHORS]Tassilo Wald, Ibrahim Ethem Hamamci, Yuan Gao, Sam Bond-Taylor, Harshita Sharma, Maximilian Ilse, Cynthia Lo, Olesya Melnichenko, Anton Schwaighofer, Noel C. F. Codella, Maria Teodora Wetscherek, Klaus H. Maier-Hein, Panagiotis Korfiatis, Valentina Salvatelli, Javier Alvarez-Valle, Fernando Pérez-García
[ABSTRACT]Vision-language pre-training, i.e., aligning images with paired text, is a powerful paradigm to create encoders that can be directly used for tasks such as classification, retrieval, and segmentation. In the 3D medical image domain, these capabilities allow vision-language encoders (VLEs) to support radiologists by retrieving patients with similar abnormalities, predicting likelihoods of abnormality, or, with downstream adaptation, generating radiological reports. While the methodology holds promise, data availability and domain-specific hurdles limit the capabilities of current 3D VLEs.
In this paper, we overcome these challenges by injecting additional supervision via a report generation objective and combining vision-language with vision-only pre-training. This allows us to leverage both image-only and paired image-text 3D datasets, increasing the total amount of data to which our model is exposed. Through these additional objectives, paired with best practices of the 3D medical imaging domain, we develop the Comprehensive Language-Image Pre-training (COLIPRI) encoder family. Our COLIPRI encoders achieve state-of-the-art performance in report generation, semantic segmentation, classification probing, and zero-shot classification. The model is available at https://huggingface.co/microsoft/colipri.
[LINK]http://arxiv.org/abs/2510.15042v2
[DATE]2026-01-14 21:42:41+08:00
[CATEGORIES]cs.LG
Fusion or Confusion? Multimodal Complexity Is Not All You Need
[AUTHORS]Tillmann Rheude, Roland Eils, Benjamin Wild
[ABSTRACT]Deep learning architectures for multimodal learning have increased in complexity, driven by the assumption that multimodal-specific methods improve performance. We challenge this assumption through a large-scale empirical study reimplementing 19 high-impact methods under standardized conditions. We evaluate them across nine diverse datasets with up to 23 modalities, and test their generalizability to new tasks beyond their original scope, including settings with missing modalities. We propose a Simple Baseline for Multimodal Learning (SimBaMM), a late-fusion Transformer architecture, and demonstrate that under standardized experimental conditions with rigorous hyperparameter tuning of all methods, more complex architectures do not reliably outperform SimBaMM. Statistical analyses show that complex methods perform on par with SimBaMM and often fail to consistently outperform well-tuned unimodal baselines, especially in small-data settings. To support our findings, we include a case study highlighting common methodological shortcomings in the literature followed by a pragmatic reliability checklist to promote comparable, robust, and trustworthy future evaluations. In summary, we argue for a shift in focus: away from the pursuit of architectural novelty and toward methodological rigor.
[LINK]http://arxiv.org/abs/2512.22991v2
[DATE]2026-01-14 21:38:18+08:00
[CATEGORIES]cs.LG
Learning About Learning: A Physics Path from Spin Glasses to Artificial Intelligence
[AUTHORS]Denis D. Caprioti, Matheus Haas, Constantino F. Vasconcelos, Mauricio Girardi-Schappo
[ABSTRACT]The Hopfield model, originally inspired by spin-glass physics, occupies a central place at the intersection of statistical mechanics, neural networks, and modern artificial intelligence. Despite its conceptual simplicity and broad applicability – from associative memory to near-optimal solutions of combinatorial optimization problems – it is rarely integrated into standard undergraduate physics curricula. In this paper, we present the Hopfield model as a pedagogically rich framework that naturally unifies core topics from undergraduate statistical physics, dynamical systems, linear algebra, and computational methods. We provide a concise and illustrated theoretical introduction grounded in familiar physics concepts, analyze the model’s energy function, dynamics, and pattern stability, and discuss practical aspects of simulation, including a freely available simulation code. To support instruction, we conclude with classroom-ready example problems designed to mirror research practice. By explicitly connecting fundamental physics to contemporary AI applications, this work aims to help prepare physics students to understand, apply, and critically engage with the computational tools increasingly central to research, industry, and society.
[COMMENTS]18 pages, 11 figures
[LINK]http://arxiv.org/abs/2601.07635v2
[DATE]2026-01-14 21:36:51+08:00
[CATEGORIES]cs.LG
Terminally constrained flow-based generative models from an optimal control perspective
[AUTHORS]Weiguo Gao, Ming Li, Qianxiao Li
[ABSTRACT]We address the problem of sampling from terminally constrained distributions with pre-trained flow-based generative models through an optimal control formulation. Theoretically, we characterize the value function by a Hamilton-Jacobi-Bellman equation and derive the optimal feedback control as the minimizer of the associated Hamiltonian. We show that as the control penalty increases, the controlled process recovers the reference distribution, while as the penalty vanishes, the terminal law converges to a generalized Wasserstein projection onto the constraint manifold. Algorithmically, we introduce Terminal Optimal Control with Flow-based models (TOCFlow), a geometry-aware sampling-time guidance method for pre-trained flows. Solving the control problem in a terminal co-moving frame that tracks reference trajectories yields a closed-form scalar damping factor along the Riemannian gradient, capturing second-order curvature effects without matrix inversions. TOCFlow therefore matches the geometric consistency of Gauss-Newton updates at the computational cost of standard gradient guidance. We evaluate TOCFlow on three high-dimensional scientific tasks spanning equality, inequality, and global statistical constraints, namely Darcy flow, constrained trajectory planning, and turbulence snapshot generation with Kolmogorov spectral scaling. Across all settings, TOCFlow improves constraint satisfaction over Euclidean guidance and projection baselines while preserving the reference model’s generative quality.
[COMMENTS]59 pages, 9 figures
[LINK]http://arxiv.org/abs/2601.09474v1
[DATE]2026-01-14 21:32:15+08:00
[CATEGORIES]cs.LG
SimMerge: Learning to Select Merge Operators from Similarity Signals
[AUTHORS]Oliver Bolton, Aakanksha, Arash Ahmadian, Sara Hooker, Marzieh Fadaee, Beyza Ermis
[ABSTRACT]Model merging enables multiple large language models (LLMs) to be combined into a single model while preserving performance. This makes it a valuable tool in LLM development, offering a competitive alternative to multi-task training. However, merging can be difficult at scale, as successful merging requires choosing the right merge operator, selecting the right models, and merging them in the right order. This often leads researchers to run expensive merge-and-evaluate searches to select the best merge. In this work, we provide an alternative by introducing \simmerge{}, \emph{a predictive merge-selection method} that selects the best merge using inexpensive, task-agnostic similarity signals between models. From a small set of unlabeled probes, we compute functional and structural features and use them to predict the performance of a given 2-way merge. Using these predictions, \simmerge{} selects the best merge operator, the subset of models to merge, and the merge order, eliminating the expensive merge-and-evaluate loop. We demonstrate that we surpass standard merge-operator performance on 2-way merges of 7B-parameter LLMs, and that \simmerge{} generalizes to multi-way merges and 111B-parameter LLM merges without retraining. Additionally, we present a bandit variant that supports adding new tasks, models, and operators on the fly. Our results suggest that learning how to merge is a practical route to scalable model composition when checkpoint catalogs are large and evaluation budgets are tight.
[LINK]http://arxiv.org/abs/2601.09473v1
[DATE]2026-01-14 21:30:00+08:00
[CATEGORIES]cs.LG
FairGU: Fairness-aware Graph Unlearning in Social Network
[AUTHORS]Renqiang Luo, Yongshuai Yang, Huafei Huang, Qing Qing, Mingliang Hou, Ziqi Xu, Yi Yu, Jingjing Zhou, Feng Xia
[ABSTRACT]Graph unlearning has emerged as a critical mechanism for supporting sustainable and privacy-preserving social networks, enabling models to remove the influence of deleted nodes and thereby better safeguard user information. However, we observe that existing graph unlearning techniques insufficiently protect sensitive attributes, often leading to degraded algorithmic fairness compared with traditional graph learning methods. To address this gap, we introduce FairGU, a fairness-aware graph unlearning framework designed to preserve both utility and fairness during the unlearning process. FairGU integrates a dedicated fairness-aware module with effective data protection strategies, ensuring that sensitive attributes are neither inadvertently amplified nor structurally exposed when nodes are removed. Through extensive experiments on multiple real-world datasets, we demonstrate that FairGU consistently outperforms state-of-the-art graph unlearning methods and fairness-enhanced graph learning baselines in terms of both accuracy and fairness metrics. Our findings highlight a previously overlooked risk in current unlearning practices and establish FairGU as a robust and equitable solution for the next generation of socially sustainable networked systems. The codes are available at https://github.com/LuoRenqiang/FairGU.
[COMMENTS]9 pages, 2 figs, WWW 2026 accepted
[LINK]http://arxiv.org/abs/2601.09469v1
[DATE]2026-01-14 21:21:39+08:00
[CATEGORIES]cs.LG
Searth Transformer: A Transformer Architecture Incorporating Earth’s Geospheric Physical Priors for Global Mid-Range Weather Forecasting
[AUTHORS]Tianye Li, Qi Liu, Hao Li, Lei Chen, Wencong Cheng, Fei Zheng, Xiangao Xia, Ya Wang, Gang Huang, Weiwei Wang, Xuan Tong, Ziqing Zu, Yi Fang, Shenming Fu, Jiang Jiang, Haochen Li, Mingxing Li, Jiangjiang Xia
[ABSTRACT]Accurate global medium-range weather forecasting is fundamental to Earth system science. Most existing Transformer-based forecasting models adopt vision-centric architectures that neglect the Earth’s spherical geometry and zonal periodicity. In addition, conventional autoregressive training is computationally expensive and limits forecast horizons due to error accumulation. To address these challenges, we propose the Shifted Earth Transformer (Searth Transformer), a physics-informed architecture that incorporates zonal periodicity and meridional boundaries into window-based self-attention for physically consistent global information exchange. We further introduce a Relay Autoregressive (RAR) fine-tuning strategy that enables learning long-range atmospheric evolution under constrained memory and computational budgets. Based on these methods, we develop YanTian, a global medium-range weather forecasting model. YanTian achieves higher accuracy than the high-resolution forecast of the European Centre for Medium-Range Weather Forecasts and performs competitively with state-of-the-art AI models at one-degree resolution, while requiring roughly 200 times lower computational cost than standard autoregressive fine-tuning. Furthermore, YanTian attains a longer skillful forecast lead time for Z500 (10.3 days) than HRES (9 days). Beyond weather forecasting, this work establishes a robust algorithmic foundation for predictive modeling of complex global-scale geophysical circulation systems, offering new pathways for Earth system science.
[LINK]http://arxiv.org/abs/2601.09467v1
[DATE]2026-01-14 21:20:17+08:00
[CATEGORIES]cs.LG
Horizon Activation Mapping for Neural Networks in Time Series Forecasting
[AUTHORS]Hans Krupakar, V A Kandappan
[ABSTRACT]Neural networks for time series forecasting have relied on error metrics and architecture-specific interpretability approaches for model selection that don’t apply across models of different families. To interpret forecasting models agnostic to the types of layers across state-of-the-art model families, we introduce Horizon Activation Mapping (HAM), a visual interpretability technique inspired by grad-CAM that uses gradient norm averages to study the horizon’s subseries where grad-CAM studies attention maps over image data. We introduce causal and anti-causal modes to calculate gradient update norm averages across subseries at every timestep and lines of proportionality signifying uniform distributions of the norm averages. Optimization landscape studies with respect to changes in batch sizes, early stopping, train-val-test splits, architectural choices, univariate forecasting and dropouts are studied with respect to performances and subseries in HAM. Interestingly, batch size based differences in activities seem to indicate potential for existence of an exponential approximation across them per epoch relative to each other. Multivariate forecasting models including MLP-based CycleNet, N-Linear, N-HITS, self attention-based FEDformer, Pyraformer, SSM-based SpaceTime and diffusion-based Multi-Resolution DDPM over different horizon sizes trained over the ETTm2 dataset are used for HAM plots in this study. NHITS’ neural approximation theorem and SpaceTime’s exponential autoregressive activities have been attributed to trends in HAM plots over their training, validation and test sets. In general, HAM can be used for granular model selection, validation set choices and comparisons across different neural network model families.
[LINK]http://arxiv.org/abs/2601.02094v2
[DATE]2026-01-14 21:10:20+08:00
[CATEGORIES]cs.LG
Nonlinear reconciliation: Error reduction theorems
[AUTHORS]Lorenzo Nespoli, Anubhab Biswas, Roberto Rocchetta, Vasco Medici
[ABSTRACT]Forecast reconciliation, an ex-post technique applied to forecasts that must satisfy constraints, has been a prominent topic in the forecasting literature over the past two decades. Recently, several efforts have sought to extend reconciliation methods to the probabilistic settings. Nevertheless, formal theorems demonstrating error reduction in nonlinear constraints, analogous to those presented in Panagiotelis et al.(2021), are still lacking. This paper addresses that gap by establishing such theorems for various classes of nonlinear hypersurfaces and vector-valued functions. Specifically, we derive an exact analog of Theorem 3.1 from Panagiotelis et al.(2021) for hypersurfaces with constant-sign curvature. Additionally, we provide an error reduction theorem for the broader case of hypersurfaces with non-constant-sign curvature and for general manifolds with codimension > 1. To support reproducibility and practical adoption, we release a JAX-based Python package, JNLR, implementing the presented theorems and reconciliation procedures.
[LINK]http://arxiv.org/abs/2507.22500v2
[DATE]2026-01-14 21:09:42+08:00
[CATEGORIES]cs.LG
SoK: Enhancing Cryptographic Collaborative Learning with Differential Privacy
[AUTHORS]Francesco Capano, Jonas Böhler, Benjamin Weggenmann
[ABSTRACT]In collaborative learning (CL), multiple parties jointly train a machine learning model on their private datasets. However, data can not be shared directly due to privacy concerns. To ensure input confidentiality, cryptographic techniques, e.g., multi-party computation (MPC), enable training on encrypted data. Yet, even securely trained models are vulnerable to inference attacks aiming to extract memorized data from model outputs. To ensure output privacy and mitigate inference attacks, differential privacy (DP) injects calibrated noise during training. While cryptography and DP offer complementary guarantees, combining them efficiently for cryptographic and differentially private CL (CPCL) is challenging. Cryptography incurs performance overheads, while DP degrades accuracy, creating a privacy-accuracy-performance trade-off that needs careful design considerations. This work systematizes the CPCL landscape. We introduce a unified framework that generalizes common phases across CPCL paradigms, and identify secure noise sampling as the foundational phase to achieve CPCL. We analyze trade-offs of different secure noise sampling techniques, noise types, and DP mechanisms discussing their implementation challenges and evaluating their accuracy and cryptographic overhead across CPCL paradigms. Additionally, we implement identified secure noise sampling options in MPC and evaluate their computation and communication costs in WAN and LAN. Finally, we propose future research directions based on identified key observations, gaps and possible enhancements in the literature.
[COMMENTS]This work has been accepted for publication at the IEEE Conference on Secure and Trustworthy Machine Learning (SaTML 2026)
[LINK]http://arxiv.org/abs/2601.09460v1
[DATE]2026-01-14 21:09:29+08:00
[CATEGORIES]cs.LG
Bayesian Optimization with Preference Exploration using a Monotonic Neural Network Ensemble
[AUTHORS]Hanyang Wang, Juergen Branke, Matthias Poloczek
[ABSTRACT]Many real-world black-box optimization problems have multiple conflicting objectives. Rather than attempting to approximate the entire set of Pareto-optimal solutions, interactive preference learning allows to focus the search on the most relevant subset. However, few previous studies have exploited the fact that utility functions are usually monotonic. In this paper, we address the Bayesian Optimization with Preference Exploration (BOPE) problem and propose using a neural network ensemble as a utility surrogate model. This approach naturally integrates monotonicity and supports pairwise comparison data. Our experiments demonstrate that the proposed method outperforms state-of-the-art approaches and exhibits robustness to noise in utility evaluations. An ablation study highlights the critical role of monotonicity in enhancing performance.
[LINK]http://arxiv.org/abs/2501.18792v5
[DATE]2026-01-14 20:51:34+08:00
[CATEGORIES]cs.LG
DeepLight: A Sobolev-trained Image-to-Image Surrogate Model for Light Transport in Tissue
[AUTHORS]Philipp Haim, Vasilis Ntziachristos, Torsten Enßlin, Dominik Jüstel
[ABSTRACT]In optoacoustic imaging, recovering the absorption coefficients of tissue by inverting the light transport remains a challenging problem. Improvements in solving this problem can greatly benefit the clinical value of optoacoustic imaging. Existing variational inversion methods require an accurate and differentiable model of this light transport. As neural surrogate models allow fast and differentiable simulations of complex physical processes, they are considered promising candidates to be used in solving such inverse problems. However, there are in general no guarantees that the derivatives of these surrogate models accurately match those of the underlying physical operator. As accurate derivatives are central to solving inverse problems, errors in the model derivative can considerably hinder high fidelity reconstructions. To overcome this limitation, we present a surrogate model for light transport in tissue that uses Sobolev training to improve the accuracy of the model derivatives. Additionally, the form of Sobolev training we used is suitable for high-dimensional models in general. Our results demonstrate that Sobolev training for a light transport surrogate model not only improves derivative accuracy but also reduces generalization error for in-distribution and out-of-distribution samples. These improvements promise to considerably enhance the utility of the surrogate model in downstream tasks, especially in solving inverse problems.
[LINK]http://arxiv.org/abs/2601.09439v1
[DATE]2026-01-14 20:40:02+08:00
[CATEGORIES]cs.LG
Do Transformers Understand Ancient Roman Coin Motifs Better than CNNs?
[AUTHORS]David Reid, Ognjen Arandjelovic
[ABSTRACT]Automated analysis of ancient coins has the potential to help researchers extract more historical insights from large collections of coins and to help collectors understand what they are buying or selling. Recent research in this area has shown promise in focusing on identification of semantic elements as they are commonly depicted on ancient coins, by using convolutional neural networks (CNNs). This paper is the first to apply the recently proposed Vision Transformer (ViT) deep learning architecture to the task of identification of semantic elements on coins, using fully automatic learning from multi-modal data (images and unstructured text). This article summarises previous research in the area, discusses the training and implementation of ViT and CNN models for ancient coins analysis and provides an evaluation of their performance. The ViT models were found to outperform the newly trained CNN models in accuracy.
[LINK]http://arxiv.org/abs/2601.09433v1
[DATE]2026-01-14 20:30:49+08:00
[CATEGORIES]cs.LG
Draw it like Euclid: Teaching transformer models to generate CAD profiles using ruler and compass construction steps
[AUTHORS]Siyi Li, Joseph G. Lambourne, Longfei Zhang, Pradeep Kumar Jayaraman, Karl. D. D. Willis
[ABSTRACT]We introduce a new method of generating Computer Aided Design (CAD) profiles via a sequence of simple geometric constructions including curve offsetting, rotations and intersections. These sequences start with geometry provided by a designer and build up the points and curves of the final profile step by step. We demonstrate that adding construction steps between the designer’s input geometry and the final profile improves generation quality in a similar way to the introduction of a chain of thought in language models. Similar to the constraints in a parametric CAD model, the construction sequences reduce the degrees of freedom in the modeled shape to a small set of parameter values which can be adjusted by the designer, allowing parametric editing with the constructed geometry evaluated to floating point precision. In addition we show that applying reinforcement learning to the construction sequences gives further improvements over a wide range of metrics, including some which were not explicitly optimized.
[LINK]http://arxiv.org/abs/2601.09428v1
[DATE]2026-01-14 20:17:34+08:00
[CATEGORIES]cs.LG
Hydrogen production from blended waste biomass: pyrolysis, thermodynamic-kinetic analysis and AI-based modelling
[AUTHORS]Sana Kordoghli, Abdelhakim Settar, Oumayma Belaati, Mohammad Alkhatib, Khaled Chetehouna, Zakaria Mansouri
[ABSTRACT]This work contributes to advancing sustainable energy and waste management strategies by investigating the thermochemical conversion of food-based biomass through pyrolysis, highlighting the role of artificial intelligence (AI) in enhancing process modelling accuracy and optimization efficiency. The main objective is to explore the potential of underutilized biomass resources, such as spent coffee grounds (SCG) and date seeds (DS), for sustainable hydrogen production. Specifically, it aims to optimize the pyrolysis process while evaluating the performance of these resources both individually and as blends. Proximate, ultimate, fibre, TGA/DTG, kinetic, thermodynamic, and Py-Micro GC analyses were conducted for pure DS, SCG, and blends (75% DS - 25% SCG, 50% DS - 50% SCG, 25% DS - 75% SCG). Blend 3 offered superior hydrogen yield potential but had the highest activation energy (Ea: 313.24 kJ/mol), while Blend 1 exhibited the best activation energy value (Ea: 161.75 kJ/mol). The kinetic modelling based on isoconversional methods (KAS, FWO, Friedman) identified KAS as the most accurate. These approaches provide a detailed understanding of the pyrolysis process, with particular emphasis on the integration of artificial intelligence. An LSTM model trained with lignocellulosic data predicted TGA curves with exceptional accuracy (R^2: 0.9996-0.9998).
[COMMENTS]41 pages, 21 figures
[LINK]http://arxiv.org/abs/2510.15960v2
[DATE]2026-01-14 20:17:31+08:00
[CATEGORIES]cs.LG
Exploring the Secondary Risks of Large Language Models
[AUTHORS]Jiawei Chen, Zhengwei Fang, Xiao Yang, Chao Yu, Zhaoxia Yin, Hang Su
[ABSTRACT]Ensuring the safety and alignment of Large Language Models is a significant challenge with their growing integration into critical applications and societal functions. While prior research has primarily focused on jailbreak attacks, less attention has been given to non-adversarial failures that subtly emerge during benign interactions. We introduce secondary risks a novel class of failure modes marked by harmful or misleading behaviors during benign prompts. Unlike adversarial attacks, these risks stem from imperfect generalization and often evade standard safety mechanisms. To enable systematic evaluation, we introduce two risk primitives verbose response and speculative advice that capture the core failure patterns. Building on these definitions, we propose SecLens, a black-box, multi-objective search framework that efficiently elicits secondary risk behaviors by optimizing task relevance, risk activation, and linguistic plausibility. To support reproducible evaluation, we release SecRiskBench, a benchmark dataset of 650 prompts covering eight diverse real-world risk categories. Experimental results from extensive evaluations on 16 popular models demonstrate that secondary risks are widespread, transferable across models, and modality independent, emphasizing the urgent need for enhanced safety mechanisms to address benign yet harmful LLM behaviors in real-world deployments.
[COMMENTS]18 pages, 5 figures
[LINK]http://arxiv.org/abs/2506.12382v4
[DATE]2026-01-14 20:16:50+08:00
[CATEGORIES]cs.LG
SGAC: A Graph Neural Network Framework for Imbalanced and Structure-Aware AMP Classification
[AUTHORS]Yingxu Wang, Victor Liang, Nan Yin, Siwei Liu, Eran Segal
[ABSTRACT]Classifying Antimicrobial Peptides (AMPs) from the vast collection of peptides derived from metagenomic sequencing offers a promising avenue for combating antibiotic resistance. However, most existing AMP classification methods rely primarily on sequence-based representations and fail to capture the spatial structural information critical for accurate identification. Although recent graph-based approaches attempt to incorporate structural information, they typically construct residue- or atom-level graphs that introduce redundant atomic details and increase structural complexity. Furthermore, the class imbalance between the small number of known AMPs and the abundant non-AMPs significantly hinders predictive performance. To address these challenges, we employ lightweight OmegaFold to predict the three-dimensional structures of peptides and construct peptide graphs using C α atoms to capture their backbone geometry and spatial topology. Building on this representation, we propose the Spatial GNN-based AMP Classifier (SGAC), a novel framework that leverages Graph Neural Networks (GNNs) to extract structural features and generate discriminative graph representations. To handle class imbalance, SGAC incorporates Weight-enhanced Contrastive Learning to cluster structurally similar peptides and separate dissimilar ones through adaptive weighting, and applies Weight-enhanced Pseudo-label Distillation to generate high-confidence pseudo labels for unlabeled samples, achieving balanced and consistent representation learning. Experiments on publicly available AMP and non-AMP datasets demonstrate that SGAC significantly achieves state-of-the-art performance compared to baselines.
[LINK]http://arxiv.org/abs/2412.16276v2
[DATE]2026-01-14 20:01:15+08:00
[CATEGORIES]cs.LG
SAC Flow: Sample-Efficient Reinforcement Learning of Flow-Based Policies via Velocity-Reparameterized Sequential Modeling
[AUTHORS]Yixian Zhang, Shu’ang Yu, Tonghe Zhang, Mo Guang, Haojia Hui, Kaiwen Long, Yu Wang, Chao Yu, Wenbo Ding
[ABSTRACT]Training expressive flow-based policies with off-policy reinforcement learning is notoriously unstable due to gradient pathologies in the multi-step action sampling process. We trace this instability to a fundamental connection: the flow rollout is algebraically equivalent to a residual recurrent computation, making it susceptible to the same vanishing and exploding gradients as RNNs. To address this, we reparameterize the velocity network using principles from modern sequential models, introducing two stable architectures: Flow-G, which incorporates a gated velocity, and Flow-T, which utilizes a decoded velocity. We then develop a practical SAC-based algorithm, enabled by a noise-augmented rollout, that facilitates direct end-to-end training of these policies. Our approach supports both from-scratch and offline-to-online learning and achieves state-of-the-art performance on continuous control and robotic manipulation benchmarks, eliminating the need for common workarounds like policy distillation or surrogate objectives.
[LINK]http://arxiv.org/abs/2509.25756v3
[DATE]2026-01-14 19:56:33+08:00
[CATEGORIES]cs.LG
On Bayesian Neural Networks with Dependent and Possibly Heavy-Tailed Weights
[AUTHORS]Nicola Apollonio, Giovanni Franzina, Giovanni Luca Torrisi
[ABSTRACT]We consider fully connected and feedforward deep neural networks with dependent and possibly heavy-tailed weights, as introduced in Lee et al. 2023, to address limitations of the standard Gaussian prior. It has been proved in Lee et al. 2023 that, as the number of nodes in the hidden layers grows large, according to a sequential and ordered limit, the law of the output converges weakly to a Gaussian mixture. Among our results, we present sufficient conditions on the model parameters (the activation function and the associated Lévy measures) which ensure that the sequential limit is independent of the order. Next, we study the neural network through the lens of the posterior distribution with a Gaussian likelihood. If the random covariance matrix of the infinite-width limit is positive definite under the prior, we identify the posterior distribution of the output in the wide-width limit according to a sequential regime. Remarkably, we provide mild sufficient conditions to ensure the aforementioned invertibility of the random covariance matrix under the prior, thereby extending the results in L. Carvalho et al. 2025. We illustrate our findings using numerical simulations.
[COMMENTS]2 figures
[LINK]http://arxiv.org/abs/2507.22095v4
[DATE]2026-01-14 19:50:48+08:00
[CATEGORIES]cs.LG
KnowEEG: Explainable Knowledge Driven EEG Classification
[AUTHORS]Amarpal Sahota, Navid Mohammadi Foumani, Raul Santos-Rodriguez, Zahraa S. Abdallah
[ABSTRACT]Electroencephalography (EEG) is a method of recording brain activity that shows significant promise in applications ranging from disease classification to emotion detection and brain-computer interfaces. Recent advances in deep learning have improved EEG classification performance yet model explainability remains an issue. To address this key limitation of explainability we introduce KnowEEG; a novel explainable machine learning approach for EEG classification. KnowEEG extracts a comprehensive set of per-electrode features, filters them using statistical tests, and integrates between-electrode connectivity statistics. These features are then input to our modified Random Forest model (Fusion Forest) that balances per electrode statistics with between electrode connectivity features in growing the trees of the forest. By incorporating knowledge from both the generalized time-series and EEG-specific domains, KnowEEG achieves performance comparable to or exceeding state-of-the-art deep learning models across five different classification tasks: emotion detection, mental workload classification, eyes open/closed detection, abnormal EEG classification, and event detection. In addition to high performance, KnowEEG provides inherent explainability through feature importance scores for understandable features. We demonstrate by example on the eyes closed/open classification task that this explainability can be used to discover knowledge about the classes. This discovered knowledge for eyes open/closed classification was proven to be correct by current neuroscience literature. Therefore, the impact of KnowEEG will be significant for domains where EEG explainability is critical such as healthcare.
[LINK]http://arxiv.org/abs/2505.00541v2
[DATE]2026-01-14 19:46:55+08:00
[CATEGORIES]cs.LG
Preliminary Tests of the Anticipatory Classifier System with Hindsight Experience Replay
[AUTHORS]Olgierd Unold, Stanisław Franczyk
[ABSTRACT]This paper introduces ACS2HER, a novel integration of the Anticipatory Classifier System (ACS2) with the Hindsight Experience Replay (HER) mechanism. While ACS2 is highly effective at building cognitive maps through latent learning, its performance often stagnates in environments characterized by sparse rewards. We propose a specific architectural variant that triggers hindsight learning when the agent fails to reach its primary goal, re-labeling visited states as virtual goals to densify the learning signal. The proposed model was evaluated on two benchmarks: the deterministic \texttt{Maze 6} and the stochastic \texttt{FrozenLake}. The results demonstrate that ACS2HER significantly accelerates knowledge acquisition and environmental mastery compared to the standard ACS2. However, this efficiency gain is accompanied by increased computational overhead and a substantial expansion in classifier numerosity. This work provides the first analysis of combining anticipatory mechanisms with retrospective goal-relabeling in Learning Classifier Systems.
[LINK]http://arxiv.org/abs/2601.09400v1
[DATE]2026-01-14 19:43:36+08:00
[CATEGORIES]cs.LG
Multilevel neural simulation-based inference
[AUTHORS]Yuga Hikida, Ayush Bharti, Niall Jeffrey, François-Xavier Briol
[ABSTRACT]Neural simulation-based inference (SBI) is a popular set of methods for Bayesian inference when models are only available in the form of a simulator. These methods are widely used in the sciences and engineering, where writing down a likelihood can be significantly more challenging than constructing a simulator. However, the performance of neural SBI can suffer when simulators are computationally expensive, thereby limiting the number of simulations that can be performed. In this paper, we propose a novel approach to neural SBI which leverages multilevel Monte Carlo techniques for settings where several simulators of varying cost and fidelity are available. We demonstrate through both theoretical analysis and extensive experiments that our method can significantly enhance the accuracy of SBI methods given a fixed computational budget.
[LINK]http://arxiv.org/abs/2506.06087v4
[DATE]2026-01-14 19:31:35+08:00
[CATEGORIES]cs.LG
Fodor and Pylyshyn’s Legacy: Still No Human-like Systematic Compositionality in Neural Networks
[AUTHORS]Tim Woydt, Moritz Willig, Antonia Wüst, Lukas Helff, Wolfgang Stammer, Constantin A. Rothkopf, Kristian Kersting
[ABSTRACT]Strong meta-learning capabilities for systematic compositionality are emerging as an important skill for navigating the complex and changing tasks of today’s world. However, in presenting models for robust adaptation to novel environments, it is important to refrain from making unsupported claims about the performance of meta-learning systems that ultimately do not stand up to scrutiny. While Fodor and Pylyshyn famously posited that neural networks inherently lack this capacity as they are unable to model compositional representations or structure-sensitive operations, and thus are not a viable model of the human mind, Lake and Baroni recently presented meta-learning as a pathway to compositionality. In this position paper, we critically revisit this claim and highlight limitations in the proposed meta-learning framework for compositionality. Our analysis shows that modern neural meta-learning systems can only perform such tasks, if at all, under a very narrow and restricted definition of a meta-learning setup. We therefore claim that `Fodor and Pylyshyn’s legacy’ persists, and to date, there is no human-like systematic compositionality learned in neural networks.
[LINK]http://arxiv.org/abs/2506.01820v2
[DATE]2026-01-14 19:24:37+08:00
[CATEGORIES]cs.LG
Towards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders
[AUTHORS]James Oldfield, Shawn Im, Sharon Li, Mihalis A. Nicolaou, Ioannis Patras, Grigorios G Chrysos
[ABSTRACT]Multilayer perceptrons (MLPs) are an integral part of large language models, yet their dense representations render them difficult to understand, edit, and steer. Recent methods learn interpretable approximations via neuron-level sparsity, yet fail to faithfully reconstruct the original mapping–significantly increasing model’s next-token cross-entropy loss. In this paper, we advocate for moving to layer-level sparsity to overcome the accuracy trade-off in sparse layer approximation. Under this paradigm, we introduce Mixture of Decoders (MxDs). MxDs generalize MLPs and Gated Linear Units, expanding pre-trained dense layers into tens of thousands of specialized sublayers. Through a flexible form of tensor factorization, each sparsely activating MxD sublayer implements a linear transformation with full-rank weights–preserving the original decoders’ expressive capacity even under heavy sparsity. Experimentally, we show that MxDs significantly outperform state-of-the-art methods (e.g., Transcoders) on the sparsity-accuracy frontier in language models with up to 3B parameters. Further evaluations on sparse probing and feature steering demonstrate that MxDs learn similarly specialized features of natural language–opening up a promising new avenue for designing interpretable yet faithful decompositions. Our code is included at: https://github.com/james-oldfield/MxD/.
[COMMENTS]NeurIPS 2025 camera-ready version
[LINK]http://arxiv.org/abs/2505.21364v3
[DATE]2026-01-14 18:55:34+08:00
[CATEGORIES]cs.LG
In-Context Learning Enhanced Credibility Transformer
[AUTHORS]Kishan Padayachy, Ronald Richman, Salvatore Scognamiglio, Mario V. Wüthrich
[ABSTRACT]The starting point of our network architecture is the Credibility Transformer which extends the classical Transformer architecture by a credibility mechanism to improve model learning and predictive performance. This Credibility Transformer learns credibilitized CLS tokens that serve as learned representations of the original input features. In this paper we present a new paradigm that augments this architecture by an in-context learning mechanism, i.e., we increase the information set by a context batch consisting of similar instances. This allows the model to enhance the CLS token representations of the instances by additional in-context information and fine-tuning. We empirically verify that this in-context learning enhances predictive accuracy by adapting to similar risk patterns. Moreover, this in-context learning also allows the model to generalize to new instances which, e.g., have feature levels in the categorical covariates that have not been present when the model was trained – for a relevant example, think of a new vehicle model which has just been developed by a car manufacturer.
[LINK]http://arxiv.org/abs/2509.08122v2
[DATE]2026-01-14 18:49:19+08:00
[CATEGORIES]cs.LG
FeatInv: Spatially resolved mapping from feature space to input space using conditional diffusion models
[AUTHORS]Nils Neukirch, Johanna Vielhaben, Nils Strodthoff
[ABSTRACT]Internal representations are crucial for understanding deep neural networks, such as their properties and reasoning patterns, but remain difficult to interpret. While mapping from feature space to input space aids in interpreting the former, existing approaches often rely on crude approximations. We propose using a conditional diffusion model - a pretrained high-fidelity diffusion model conditioned on spatially resolved feature maps - to learn such a mapping in a probabilistic manner. We demonstrate the feasibility of this approach across various pretrained image classifiers from CNNs to ViTs, showing excellent reconstruction capabilities. Through qualitative comparisons and robustness analysis, we validate our method and showcase possible applications, such as the visualization of concept steering in input space or investigations of the composite nature of the feature space. This approach has broad potential for improving feature space understanding in computer vision models.
[COMMENTS]Version published by Transactions on Machine Learning Research in 2025 (TMLR ISSN 2835-8856) at https://openreview.net/forum?id=UtE1YnPNgZ. 32 pages, 27 figures. This work builds on an earlier manuscript (arXiv:2505.21032) and crucially extends it. Code is available at https://github.com/AI4HealthUOL/FeatInv
[LINK]http://arxiv.org/abs/2505.21032v2
[DATE]2026-01-14 18:46:17+08:00
[CATEGORIES]cs.LG
Consistent Sampling and Simulation: Molecular Dynamics with Energy-Based Diffusion Models
[AUTHORS]Michael Plainer, Hao Wu, Leon Klein, Stephan Günnemann, Frank Noé
[ABSTRACT]In recent years, diffusion models trained on equilibrium molecular distributions have proven effective for sampling biomolecules. Beyond direct sampling, the score of such a model can also be used to derive the forces that act on molecular systems. However, while classical diffusion sampling usually recovers the training distribution, the corresponding energy-based interpretation of the learned score is often inconsistent with this distribution, even for low-dimensional toy systems. We trace this inconsistency to inaccuracies of the learned score at very small diffusion timesteps, where the model must capture the correct evolution of the data distribution. In this regime, diffusion models fail to satisfy the Fokker-Planck equation, which governs the evolution of the score. We interpret this deviation as one source of the observed inconsistencies and propose an energy-based diffusion model with a Fokker-Planck-derived regularization term to enforce consistency. We demonstrate our approach by sampling and simulating multiple biomolecular systems, including fast-folding proteins, and by introducing a state-of-the-art transferable Boltzmann emulator for dipeptides that supports simulation and achieves improved consistency and efficient sampling. Our code, model weights, and self-contained JAX and PyTorch notebooks are available at https://github.com/noegroup/ScoreMD.
[COMMENTS]Accepted at Conference on Neural Information Processing Systems (NeurIPS 2025)
[LINK]http://arxiv.org/abs/2506.17139v3
[DATE]2026-01-14 18:46:14+08:00
[CATEGORIES]cs.LG
GeoRA: Geometry-Aware Low-Rank Adaptation for RLVR
[AUTHORS]Jiaying Zhang, Lei Shi, Jiguo Li, Jun Xu, Jiuchong Gao, Jinghua Hao, Renqing He
[ABSTRACT]Reinforcement Learning with Verifiable Rewards (RLVR) is crucial for advancing large-scale reasoning models. However, existing parameter-efficient methods, such as PiSSA and MiLoRA, are designed for Supervised Fine-Tuning (SFT) and do not account for the distinct optimization dynamics and geometric structures of RLVR. Applying these methods directly leads to spectral collapse and optimization instability, which severely limit model performance. Meanwhile, alternative approaches that leverage update sparsity encounter significant efficiency bottlenecks on modern hardware due to unstructured computations. To address these challenges, we propose GeoRA (Geometry-Aware Low-Rank Adaptation), which exploits the anisotropic and compressible nature of RL update subspaces. GeoRA initializes adapters by extracting principal directions via Singular Value Decomposition (SVD) within a geometrically constrained subspace while freezing the residual components. This method preserves the pre-trained geometric structure and enables efficient GPU computation through dense operators. Experiments on Qwen and Llama demonstrate that GeoRA mitigates optimization bottlenecks caused by geometric misalignment. It consistently outperforms established low-rank baselines on key mathematical benchmarks, achieving state-of-the-art (SOTA) results. Moreover, GeoRA shows superior generalization and resilience to catastrophic forgetting in out-of-domain tasks.
[LINK]http://arxiv.org/abs/2601.09361v1
[DATE]2026-01-14 18:41:34+08:00
[CATEGORIES]cs.LG
High-Performance Serverless Computing: A Systematic Literature Review on Serverless for HPC, AI, and Big Data
[AUTHORS]Valerio Besozzi, Matteo Della Bartola, Patrizio Dazzi, Marco Danelutto
[ABSTRACT]The widespread deployment of large-scale, compute-intensive applications such as high-performance computing, artificial intelligence, and big data is leading to convergence between cloud and high-performance computing infrastructures. Cloud providers are increasingly integrating high-performance computing capabilities in their infrastructures, such as hardware accelerators and high-speed interconnects, while researchers in the high-performance computing community are starting to explore cloud-native paradigms to improve scalability, elasticity, and resource utilization. In this context, serverless computing emerges as a promising execution model to efficiently handle highly dynamic, parallel, and distributed workloads. This paper presents a comprehensive systematic literature review of 122 research articles published between 2018 and early 2025, exploring the use of the serverless paradigm to develop, deploy, and orchestrate compute-intensive applications across cloud, high-performance computing, and hybrid environments. From these, a taxonomy comprising eight primary research directions and nine targeted use case domains is proposed, alongside an analysis of recent publication trends and collaboration networks among authors, highlighting the growing interest and interconnections within this emerging research field. Overall, this work aims to offer a valuable foundation for both new researchers and experienced practitioners, guiding the development of next-generation serverless solutions for parallel compute-intensive applications.
[LINK]http://arxiv.org/abs/2601.09334v1
[DATE]2026-01-14 18:10:20+08:00
[CATEGORIES]cs.LG
Improving AI-Based Canine Heart Disease Diagnosis with Expert-Consensus Auscultation Labeling
[AUTHORS]Pinar Bisgin, Tom Strube, Niklas Tschorn, Michael Pantförder, Maximilian Fecke, Ingrid Ljungvall, Jens Häggström, Gerhard Wess, Christoph Schummer, Sven Meister, Falk M. Howar
[ABSTRACT]Noisy labels pose significant challenges for AI model training in veterinary medicine. This study examines expert assessment ambiguity in canine auscultation data, highlights the negative impact of label noise on classification performance, and introduces methods for label noise reduction. To evaluate whether label noise can be minimized by incorporating multiple expert opinions, a dataset of 140 heart sound recordings (HSR) was annotated regarding the intensity of holosystolic heart murmurs caused by Myxomatous Mitral Valve Disease (MMVD). The expert opinions facilitated the selection of 70 high-quality HSR, resulting in a noise-reduced dataset. By leveraging individual heart cycles, the training data was expanded and classification robustness was enhanced. The investigation encompassed training and evaluating three classification algorithms: AdaBoost, XGBoost, and Random Forest. While AdaBoost and Random Forest exhibited reasonable performances, XGBoost demonstrated notable improvements in classification accuracy. All algorithms showed significant improvements in classification accuracy due to the applied label noise reduction, most notably XGBoost. Specifically, for the detection of mild heart murmurs, sensitivity increased from 37.71% to 90.98% and specificity from 76.70% to 93.69%. For the moderate category, sensitivity rose from 30.23% to 55.81% and specificity from 64.56% to 97.19%. In the loud/thrilling category, sensitivity and specificity increased from 58.28% to 95.09% and from 84.84% to 89.69%, respectively. These results highlight the importance of minimizing label noise to improve classification algorithms for the detection of canine heart murmurs. Index Terms: AI diagnosis, canine heart disease, heart sound classification, label noise reduction, machine learning, XGBoost, veterinary cardiology, MMVD.
[COMMENTS]Accepted to IEEE Engineering in Medicine and Biology Conference (EMBC) 2025
[LINK]http://arxiv.org/abs/2507.05950v2
[DATE]2026-01-14 18:03:29+08:00
[CATEGORIES]cs.LG
Uncertainty-Aware PCA for Arbitrarily Distributed Data Modeled by Gaussian Mixture Models
[AUTHORS]Daniel Klötzl, Ozan Tastekin, David Hägele, Marina Evers, Daniel Weiskopf
[ABSTRACT]Multidimensional data is often associated with uncertainties that are not well-described by normal distributions. In this work, we describe how such distributions can be projected to a low-dimensional space using uncertainty-aware principal component analysis (UAPCA). We propose to model multidimensional distributions using Gaussian mixture models (GMMs) and derive the projection from a general formulation that allows projecting arbitrary probability density functions. The low-dimensional projections of the densities exhibit more details about the distributions and represent them more faithfully compared to UAPCA mappings. Further, we support including user-defined weights between the different distributions, which allows for varying the importance of the multidimensional distributions. We evaluate our approach by comparing the distributions in low-dimensional space obtained by our method and UAPCA to those obtained by sample-based projections.
[COMMENTS]10 pages, 6 figures
[LINK]http://arxiv.org/abs/2508.13990v2
[DATE]2026-01-14 17:53:51+08:00
[CATEGORIES]cs.LG
Repairing vulnerabilities without invisible hands. A differentiated replication study on LLMs
[AUTHORS]Maria Camporese, Fabio Massacci
[ABSTRACT]Background: Automated Vulnerability Repair (AVR) is a fast-growing branch of program repair. Recent studies show that large language models (LLMs) outperform traditional techniques, extending their success beyond code generation and fault detection.
Hypothesis: These gains may be driven by hidden factors – “invisible hands” such as training-data leakage or perfect fault localization – that let an LLM reproduce human-authored fixes for the same code.
Objective: We replicate prior AVR studies under controlled conditions by deliberately adding errors to the reported vulnerability location in the prompt. If LLMs merely regurgitate memorized fixes, both small and large localization errors should yield the same number of correct patches, because any offset should divert the model from the original fix.
Method: Our pipeline repairs vulnerabilities from the Vul4J and VJTrans benchmarks after shifting the fault location by n lines from the ground truth. A first LLM generates a patch, a second LLM reviews it, and we validate the result with regression and proof-of-vulnerability tests. Finally, we manually audit a sample of patches and estimate the error rate with the Agresti-Coull-Wilson method.
[LINK]http://arxiv.org/abs/2507.20977v2
[DATE]2026-01-14 17:49:09+08:00
[CATEGORIES]cs.LG
Controlling Ensemble Variance in Diffusion Models: An Application for Reanalyses Downscaling
[AUTHORS]Fabio Merizzi, Davide Evangelista, Harilaos Loukos
[ABSTRACT]In recent years, diffusion models have emerged as powerful tools for generating ensemble members in meteorology. In this work, we demonstrate how a Denoising Diffusion Implicit Model (DDIM) can effectively control ensemble variance by varying the number of diffusion steps. Introducing a theoretical framework, we relate diffusion steps to the variance expressed by the reverse diffusion process. Focusing on reanalysis downscaling, we propose an ensemble diffusion model for the full ERA5-to-CERRA domain, generating variance-calibrated ensemble members for wind speed at full spatial and temporal resolution. Our method aligns global mean variance with a reference ensemble dataset and ensures spatial variance is distributed in accordance with observed meteorological variability. Additionally, we address the lack of ensemble information in the CARRA dataset, showcasing the utility of our approach for efficient, high-resolution ensemble generation.
[LINK]http://arxiv.org/abs/2501.14822v2
[DATE]2026-01-14 17:45:28+08:00
[CATEGORIES]cs.LG
IKDiffuser: a Diffusion-based Generative Inverse Kinematics Solver for Kinematic Trees
[AUTHORS]Zeyu Zhang, Ziyuan Jiao
[ABSTRACT]Solving Inverse Kinematics (IK) for arbitrary kinematic trees presents significant challenges due to their high-dimensionality, redundancy, and complex inter-branch constraints. Conventional optimization-based solvers can be sensitive to initialization and suffer from local minima or conflicting gradients. At the same time, existing learning-based approaches are often tied to a predefined number of end-effectors and a fixed training objective, limiting their reusability across various robot morphologies and task requirements. To address these limitations, we introduce IKDiffuser, a scalable IK solver built upon conditional diffusion-based generative models, which learns the distribution of the configuration space conditioned on end-effector poses. We propose a structure-agnostic formulation that represents end-effector poses as a sequence of tokens, leading to a unified framework that handles varying numbers of end-effectors while learning the implicit kinematic structures entirely from data. Beyond standard IK generation, IKDiffuser handles partially specified goals via a masked marginalization mechanism that conditions only on a subset of end-effector constraints. Furthermore, it supports adding task objectives at inference through objective-guided sampling, enabling capabilities such as warm-start initialization and manipulability maximization without retraining. Extensive evaluations across seven diverse robotic platforms demonstrate that IKDiffuser significantly outperforms state-of-the-art baselines in accuracy, solution diversity, and collision avoidance. Moreover, when used to initialize optimization-based solvers, IKDiffuser significantly boosts success rates on challenging redundant systems with high Degrees of Freedom (DoF), such as the 29-DoF Unitree G1 humanoid, from 21.01% to 96.96% while reducing computation time to the millisecond range.
[COMMENTS]under review
[LINK]http://arxiv.org/abs/2506.13087v4
[DATE]2026-01-14 17:44:39+08:00
[CATEGORIES]cs.LG
Out-of-Distribution Detection via Channelwise Feature Aggregation in Neural Network-Based Receivers
[AUTHORS]Marko Tuononen, Heikki Penttinen, Duy Vu, Dani Korpi, Vesa Starck, Ville Hautamäki
[ABSTRACT]Neural network-based radio receivers are expected to play a key role in future wireless systems, making reliable Out-Of-Distribution (OOD) detection essential. We propose a post-hoc, layerwise OOD framework based on channelwise feature aggregation that avoids classwise statistics–critical for multi-label soft-bit outputs with astronomically many classes. Receiver activations exhibit no discrete clusters but a smooth Signal-to-Noise-Ratio (SNR)-aligned manifold, consistent with classical receiver behavior and motivating manifold-aware OOD detection. We evaluate multiple OOD feature types, distance metrics, and methods across layers. Gaussian Mahalanobis with mean activations is the strongest single detector, earlier layers outperform later, and SNR/classifier fusions offer small, inconsistent AUROC gains. High-delay OOD is detected reliably, while high-speed remains challenging.
[COMMENTS]50 pages, 39 figures, 48 tables, and 31 equations
[LINK]http://arxiv.org/abs/2505.15570v2
[DATE]2026-01-14 17:33:18+08:00
[CATEGORIES]cs.LG
Divergence-Based Similarity Function for Multi-View Contrastive Learning
[AUTHORS]Jae Hyoung Jeon, Cheolsu Lim, Myungjoo Kang
[ABSTRACT]Recent success in contrastive learning has sparked growing interest in more effectively leveraging multiple augmented views of data. While prior methods incorporate multiple views at the loss or feature level, they primarily capture pairwise relationships and fail to model the joint structure across all views. In this work, we propose a divergence-based similarity function (DSF) that explicitly captures the joint structure by representing each set of augmented views as a distribution and measuring similarity as the divergence between distributions. Extensive experiments demonstrate that DSF consistently improves performance across diverse tasks, including kNN classification, linear evaluation, transfer learning, and distribution shift, while also achieving greater efficiency than other multi-view methods. Furthermore, we establish a connection between DSF and cosine similarity, and demonstrate that, unlike cosine similarity, DSF operates effectively without the need for tuning a temperature hyperparameter.
[COMMENTS]9 pages, 5 figures. Code and Pretrained Model: https://github.com/Jeon789/DSF, v4: Removed erroneous AAAI copyright notice
[LINK]http://arxiv.org/abs/2507.06560v4
[DATE]2026-01-14 17:17:55+08:00
[CATEGORIES]cs.LG
Explainable Autoencoder-Based Anomaly Detection in IEC 61850 GOOSE Networks
[AUTHORS]Dafne Lozano-Paredes, Luis Bote-Curiel, Juan Ramón Feijóo-Martínez, Ismael Gómez-Talal, José Luis Rojo-Álvarez
[ABSTRACT]The IEC 61850 Generic Object-Oriented Substation Event (GOOSE) protocol plays a critical role in real-time protection and automation of digital substations, yet its lack of native security mechanisms can expose power systems to sophisticated cyberattacks. Traditional rule-based and supervised intrusion detection techniques struggle to detect protocol-compliant and zero-day attacks under significant class imbalance and limited availability of labeled data. This paper proposes an explainable, unsupervised multi-view anomaly detection framework for IEC 61850 GOOSE networks that explicitly separates semantic integrity and temporal availability. The approach employs asymmetric autoencoders trained only on real operational GOOSE traffic to learn distinct latent representations of sequence-based protocol semantics and timing-related transmission dynamics in normal traffic. Anomaly detection is implemented using reconstruction errors mixed with statistically grounded thresholds, enabling robust detection without specified attack types. Feature-level reconstruction analysis provides intrinsic explainability by directly linking detection outcomes to IEC 61850 protocol characteristics. The proposed framework is evaluated using real substation traffic for training and a public dataset containing normal traffic and message suppression, data manipulation, and denial-of-service attacks for testing. Experimental results show attack detection rates above 99% with false positives remaining below 5% of total traffic, demonstrating strong generalization across environments and effective operation under extreme class imbalance and interpretable anomaly attribution.
[LINK]http://arxiv.org/abs/2601.09287v1
[DATE]2026-01-14 16:47:07+08:00
[CATEGORIES]cs.LG
Enhancing Spatial Reasoning in Large Language Models for Metal-Organic Frameworks Structure Prediction
[AUTHORS]Mianzhi Pan, JianFei Li, Peishuo Liu, Botian Wang, Yawen Ouyang, Yiming Rong, Hao Zhou, Jianbing Zhang
[ABSTRACT]Metal-organic frameworks (MOFs) are porous crystalline materials with broad applications such as carbon capture and drug delivery, yet accurately predicting their 3D structures remains a significant challenge. While Large Language Models (LLMs) have shown promise in generating crystals, their application to MOFs is hindered by MOFs’ high atomic complexity. Inspired by the success of block-wise paradigms in deep generative models, we pioneer the use of LLMs in this domain by introducing MOF-LLM, the first LLM framework specifically adapted for block-level MOF structure prediction. To effectively harness LLMs for this modular assembly task, our training paradigm integrates spatial-aware continual pre-training (CPT), structural supervised fine-tuning (SFT), and matching-driven reinforcement learning (RL). By incorporating explicit spatial priors and optimizing structural stability via Soft Adaptive Policy Optimization (SAPO), our approach substantially enhances the spatial reasoning capability of a Qwen-3 8B model for accurate MOF structure prediction. Comprehensive experiments demonstrate that MOF-LLM outperforms state-of-the-art denoising-based and LLM-based methods while exhibiting superior sampling efficiency.
[LINK]http://arxiv.org/abs/2601.09285v1
[DATE]2026-01-14 16:45:07+08:00
[CATEGORIES]cs.LG
Cluster Workload Allocation: Semantic Soft Affinity Using Natural Language Processing
[AUTHORS]Leszek Sliwko, Jolanta Mizeria-Pietraszko
[ABSTRACT]Cluster workload allocation often requires complex configurations, creating a usability gap. This paper introduces a semantic, intent-driven scheduling paradigm for cluster systems using Natural Language Processing. The system employs a Large Language Model (LLM) integrated via a Kubernetes scheduler extender to interpret natural language allocation hint annotations for soft affinity preferences. A prototype featuring a cluster state cache and an intent analyzer (using AWS Bedrock) was developed. Empirical evaluation demonstrated high LLM parsing accuracy (>95% Subset Accuracy on an evaluation ground-truth dataset) for top-tier models like Amazon Nova Pro/Premier and Mistral Pixtral Large, significantly outperforming a baseline engine. Scheduling quality tests across six scenarios showed the prototype achieved superior or equivalent placement compared to standard Kubernetes configurations, particularly excelling in complex and quantitative scenarios and handling conflicting soft preferences. The results validate using LLMs for accessible scheduling but highlight limitations like synchronous LLM latency, suggesting asynchronous processing for production readiness. This work confirms the viability of semantic soft affinity for simplifying workload orchestration.
[LINK]http://arxiv.org/abs/2601.09282v1
[DATE]2026-01-14 16:36:21+08:00
[CATEGORIES]cs.LG
Comparative Performance Analysis of Quantum Machine Learning Architectures for Credit Card Fraud Detection
[AUTHORS]Mansour El Alami, Nouhaila Innan, Muhammad Shafique, Mohamed Bennai
[ABSTRACT]As financial fraud becomes increasingly complex, effective detection methods are essential. Quantum Machine Learning (QML) introduces certain capabilities that may enhance both accuracy and efficiency in this area. This study examines how different quantum feature maps and ansatz configurations affect the performance of three QML-based classifiers, the Variational Quantum Classifier (VQC), the Sampler Quantum Neural Network (SQNN), and the Estimator Quantum Neural Network (EQNN), when applied to two non-normalized financial fraud datasets. Different quantum feature map and ansatz configurations are evaluated, revealing distinct performance patterns. The VQC consistently demonstrates strong classification results, achieving an F1-score of 0.88, while the SQNN also delivers promising outcomes. In contrast, the EQNN struggles to produce robust results, emphasizing the challenges presented by non-standardized data. Statistical validation using ANOVA confirms the significance of observed performance differences. Additionally, robustness tests on the best-performing models under five quantum noise types show that they maintain competitive performance, supporting their practical applicability. These findings highlight the importance of careful model configuration in QML-based financial fraud detection. By showing how specific feature maps and ansatz choices influence predictive success, this work guides researchers and practitioners in refining QML approaches for complex financial applications.
[COMMENTS]15 pages, 20 figures, 8 tables. Accepted in Applied Intelligence
[LINK]http://arxiv.org/abs/2412.19441v3
[DATE]2026-01-14 16:20:11+08:00
[CATEGORIES]cs.LG
Random Multiplexing
[AUTHORS]Lei Liu, Yuhao Chi, Shunqi Huang, Zhaoyang Zhang
[ABSTRACT]As wireless communication applications evolve from traditional multipath environments to high-mobility scenarios like unmanned aerial vehicles, multiplexing techniques have advanced accordingly. Traditional single-carrier frequency-domain equalization (SC-FDE) and orthogonal frequency-division multiplexing (OFDM) have given way to emerging orthogonal time-frequency space (OTFS) and affine frequency-division multiplexing (AFDM). These approaches exploit specific channel structures to diagonalize or sparsify the effective channel, thereby enabling low-complexity detection. However, their reliance on these structures significantly limits their robustness in dynamic, real-world environments. To address these challenges, this paper studies a random multiplexing technique that is decoupled from the physical channels, enabling its application to arbitrary norm-bounded and spectrally convergent channel matrices. Random multiplexing achieves statistical fading-channel ergodicity for transmitted signals by constructing an equivalent input-isotropic channel matrix in the random transform domain. It guarantees the asymptotic replica MAP bit-error rate (BER) optimality of AMP-type detectors for linear systems with arbitrary norm-bounded, spectrally convergent channel matrices and signaling configurations, under the unique fixed point assumption. A low-complexity cross-domain memory AMP (CD-MAMP) detector is considered, leveraging the sparsity of the time-domain channel and the randomness of the equivalent channel. Optimal power allocations are derived to minimize the replica MAP BER and maximize the replica constrained capacity of random multiplexing systems. The optimal coding principle and replica constrained-capacity optimality of CD-MAMP detector are investigated for random multiplexing systems. Additionally, the versatility of random multiplexing in diverse wireless applications is explored.
[COMMENTS]This paper has been accepted for publication in IEEE Transactions on Information Theory
[LINK]http://arxiv.org/abs/2512.24087v2
[DATE]2026-01-14 16:15:56+08:00
[CATEGORIES]cs.LG
Neural Emulator Superiority: When Machine Learning for PDEs Surpasses its Training Data
[AUTHORS]Felix Koehler, Nils Thuerey
[ABSTRACT]Neural operators or emulators for PDEs trained on data from numerical solvers are conventionally assumed to be limited by their training data’s fidelity. We challenge this assumption by identifying “emulator superiority,” where neural networks trained purely on low-fidelity solver data can achieve higher accuracy than those solvers when evaluated against a higher-fidelity reference. Our theoretical analysis reveals how the interplay between emulator inductive biases, training objectives, and numerical error characteristics enables superior performance during multi-step rollouts. We empirically validate this finding across different PDEs using standard neural architectures, demonstrating that emulators can implicitly learn dynamics that are more regularized or exhibit more favorable error accumulation properties than their training data, potentially surpassing training data limitations and mitigating numerical artifacts. This work prompts a re-evaluation of emulator benchmarking, suggesting neural emulators might achieve greater physical fidelity than their training source within specific operational regimes. Project Page: https://tum-pbs.github.io/emulator-superiority
[COMMENTS]Accepted at NeurIPS 2025: https://neurips.cc/virtual/2025/poster/116770 ; V2: Updated references, fix typos
[LINK]http://arxiv.org/abs/2510.23111v2
[DATE]2026-01-14 16:02:51+08:00
[CATEGORIES]cs.LG
STORM: A Spatio-Temporal Factor Model Based on Dual Vector Quantized Variational Autoencoders for Financial Trading
[AUTHORS]Yilei Zhao, Wentao Zhang, Tingran Yang, Yong Jiang, Fei Huang, Wei Yang Bryan Lim
[ABSTRACT]In financial trading, factor models are widely used to price assets and capture excess returns from mispricing. Recently, we have witnessed the rise of variational autoencoder-based latent factor models, which learn latent factors self-adaptively. While these models focus on modeling overall market conditions, they often fail to effectively capture the temporal patterns of individual stocks. Additionally, representing multiple factors as single values simplifies the model but limits its ability to capture complex relationships and dependencies. As a result, the learned factors are of low quality and lack diversity, reducing their effectiveness and robustness across different trading periods. To address these issues, we propose a Spatio-Temporal factOR Model based on dual vector quantized variational autoencoders, named STORM, which extracts features of stocks from temporal and spatial perspectives, then fuses and aligns these features at the fine-grained and semantic level, and represents the factors as multi-dimensional embeddings. The discrete codebooks cluster similar factor embeddings, ensuring orthogonality and diversity, which helps distinguish between different factors and enables factor selection in financial trading. To show the performance of the proposed factor model, we apply it to two downstream experiments: portfolio management on two stock datasets and individual trading tasks on six specific stocks. The extensive experiments demonstrate STORM’s flexibility in adapting to downstream tasks and superior performance over baseline models.
[LINK]http://arxiv.org/abs/2412.09468v4
[DATE]2026-01-14 15:55:43+08:00
[CATEGORIES]cs.LG
Learning to Trust Experience: A Monitor-Trust-Regulator Framework for Learning under Unobservable Feedback Reliability
[AUTHORS]Zhipeng Zhang, Zhenjie Yao, Kai Li, Lei Yang
[ABSTRACT]Learning under unobservable feedback reliability poses a distinct challenge beyond optimization robustness: a system must decide whether to learn from an experience, not only how to learn stably. We study this setting as Epistemic Identifiability under Unobservable Reliability (EIUR), where each experience has a latent credibility, reliable and unreliable feedback can be locally indistinguishable, and data are generated in a closed loop by the learner’s own evolving beliefs and actions. In EIUR, standard robust learning can converge stably yet form high-confidence, systematically wrong beliefs.
We propose metacognitive regulation as a practical response: a second, introspective control loop that infers experience credibility from endogenous evidence in the learner’s internal dynamics. We formalize this as a modular Monitor-Trust-Regulator (MTR) decomposition and instantiate it with self-diagnosis, which maintains a slowly varying experience-trust variable that softly modulates learning updates, without exogenous reliability labels or an explicit corruption model.
Empirically, in the EIUR regimes studied here, self-diagnosis is associated with improved epistemic identifiability. In reinforcement learning, it enables calibrated skepticism and recovery under systematically corrupted rewards. In supervised learning, it exposes a critical dissociation: performance recovery does not imply epistemic recovery. Accuracy can rebound while internal belief dynamics remain locked-in by early misleading data, a failure detectable only through introspective diagnostics. Together, MTR and self-diagnosis provide an organizing abstraction and a concrete design template for intrinsic reliability assessment in autonomous learning under unobservable reliability.
[COMMENTS]23 pages, 7 figures. Preprint
[LINK]http://arxiv.org/abs/2601.09261v1
[DATE]2026-01-14 15:52:14+08:00
[CATEGORIES]cs.LG
LatencyPrism: Online Non-intrusive Latency Sculpting for SLO-Guaranteed LLM Inference
[AUTHORS]Du Yin, Jiayi Ren, Xiayu Sun, Tianyao Zhou, Haizhu Zhou, Ruiyan Ma, Danyang Zhang
[ABSTRACT]LLM inference latency critically determines user experience and operational costs, directly impacting throughput under SLO constraints. Even brief latency spikes degrade service quality despite acceptable average performance. However, distributed inference environments featuring diverse software frameworks and XPU architectures combined with dynamic workloads make latency analysis challenging. Constrained by intrusive designs that necessitate service restarts or even suspension, and by hardware-bound implementations that fail to adapt to heterogeneous inference environments, existing AI profiling methods are often inadequate for real-time production analysis.
We present LatencyPrism, the first zero-intrusion multi-platform latency sculpting system. It aims to break down the inference latency across pipeline, proactively alert on inference latency anomalies, and guarantee adherence to SLOs, all without requiring code modifications or service restarts. LatencyPrism has been deployed across thousands of XPUs for over six months. It enables low-overhead real-time monitoring at batch level with alerts triggered in milliseconds. This approach distinguishes between workload-driven latency variations and anomalies indicating underlying issues with an F1-score of 0.98. We also conduct extensive experiments and investigations into root cause analysis to demonstrate LatencyPrism’s capability.
[COMMENTS]12 pages, 6 figures
[LINK]http://arxiv.org/abs/2601.09258v1
[DATE]2026-01-14 15:46:59+08:00
[CATEGORIES]cs.LG
RIFT: Repurposing Negative Samples via Reward-Informed Fine-Tuning
[AUTHORS]Zehua Liu, Shuqi Liu, Tao Zhong, Mingxuan Yuan
[ABSTRACT]While Supervised Fine-Tuning (SFT) and Rejection Sampling Fine-Tuning (RFT) are standard for LLM alignment, they either rely on costly expert data or discard valuable negative samples, leading to data inefficiency. To address this, we propose Reward Informed Fine-Tuning (RIFT), a simple yet effective framework that utilizes all self-generated samples. Unlike the hard thresholding of RFT, RIFT repurposes negative trajectories, reweighting the loss with scalar rewards to learn from both the positive and negative trajectories from the model outputs. To overcome the training collapse caused by naive reward integration, where direct multiplication yields an unbounded loss, we introduce a stabilized loss formulation that ensures numerical robustness and optimization efficiency. Extensive experiments on mathematical benchmarks across various base models show that RIFT consistently outperforms RFT. Our results demonstrate that RIFT is a robust and data-efficient alternative for alignment using mixed-quality, self-generated data.
[LINK]http://arxiv.org/abs/2601.09253v1
[DATE]2026-01-14 15:41:03+08:00
[CATEGORIES]cs.LG
HGATSolver: A Heterogeneous Graph Attention Solver for Fluid-Structure Interaction
[AUTHORS]Qin-Yi Zhang, Hong Wang, Siyao Liu, Haichuan Lin, Linying Cao, Xiao-Hu Zhou, Chen Chen, Shuangyi Wang, Zeng-Guang Hou
[ABSTRACT]Fluid-structure interaction (FSI) systems involve distinct physical domains, fluid and solid, governed by different partial differential equations and coupled at a dynamic interface. While learning-based solvers offer a promising alternative to costly numerical simulations, existing methods struggle to capture the heterogeneous dynamics of FSI within a unified framework. This challenge is further exacerbated by inconsistencies in response across domains due to interface coupling and by disparities in learning difficulty across fluid and solid regions, leading to instability during prediction. To address these challenges, we propose the Heterogeneous Graph Attention Solver (HGATSolver). HGATSolver encodes the system as a heterogeneous graph, embedding physical structure directly into the model via distinct node and edge types for fluid, solid, and interface regions. This enables specialized message-passing mechanisms tailored to each physical domain. To stabilize explicit time stepping, we introduce a novel physics-conditioned gating mechanism that serves as a learnable, adaptive relaxation factor. Furthermore, an Inter-domain Gradient-Balancing Loss dynamically balances the optimization objectives across domains based on predictive uncertainty. Extensive experiments on two constructed FSI benchmarks and a public dataset demonstrate that HGATSolver achieves state-of-the-art performance, establishing an effective framework for surrogate modeling of coupled multi-physics systems.
[LINK]http://arxiv.org/abs/2601.09251v1
[DATE]2026-01-14 15:38:02+08:00
[CATEGORIES]cs.LG
XLinear: A Lightweight and Accurate MLP-Based Model for Long-Term Time Series Forecasting with Exogenous Inputs
[AUTHORS]Xinyang Chen, Huidong Jin, Yu Huang, Zaiwen Feng
[ABSTRACT]Despite the prevalent assumption of uniform variable importance in long-term time series forecasting models, real world applications often exhibit asymmetric causal relationships and varying data acquisition costs. Specifically, cost-effective exogenous data (e.g., local weather) can unilaterally influence dynamics of endogenous variables, such as lake surface temperature. Exploiting these links enables more effective forecasts when exogenous inputs are readily available. Transformer-based models capture long-range dependencies but incur high computation and suffer from permutation invariance. Patch-based variants improve efficiency yet can miss local temporal patterns. To efficiently exploit informative signals across both the temporal dimension and relevant exogenous variables, this study proposes XLinear, a lightweight time series forecasting model built upon MultiLayer Perceptrons (MLPs). XLinear uses a global token derived from an endogenous variable as a pivotal hub for interacting with exogenous variables, and employs MLPs with sigmoid activation to extract both temporal patterns and variate-wise dependencies. Its prediction head then integrates these signals to forecast the endogenous series. We evaluate XLinear on seven standard benchmarks and five real-world datasets with exogenous inputs. Compared with state-of-the-art models, XLinear delivers superior accuracy and efficiency for both multivariate forecasts and univariate forecasts influenced by exogenous inputs.
[COMMENTS]Accepted by AAAI 2026
[LINK]http://arxiv.org/abs/2601.09237v1
[DATE]2026-01-14 15:21:29+08:00
[CATEGORIES]cs.LG
Unified Multimodal Brain Decoding via Cross-Subject Soft-ROI Fusion
[AUTHORS]Xuanyu Hu
[ABSTRACT]Multimodal brain decoding aims to reconstruct semantic information that is consistent with visual stimuli from brain activity signals such as fMRI, and then generate readable natural language descriptions. However, multimodal brain decoding still faces key challenges in cross-subject generalization and interpretability. We propose a BrainROI model and achieve leading-level results in brain-captioning evaluation on the NSD dataset. Under the cross-subject setting, compared with recent state-of-the-art methods and representative baselines, metrics such as BLEU-4 and CIDEr show clear improvements. Firstly, to address the heterogeneity of functional brain topology across subjects, we design a new fMRI encoder. We use multi-atlas soft functional parcellations (soft-ROI) as a shared space. We extend the discrete ROI Concatenation strategy in MINDLLM to a voxel-wise gated fusion mechanism (Voxel-gate). We also ensure consistent ROI mapping through global label alignment, which enhances cross-subject transferability. Secondly, to overcome the limitations of manual and black-box prompting methods in stability and transparency, we introduce an interpretable prompt optimization process. In a small-sample closed loop, we use a locally deployed Qwen model to iteratively generate and select human-readable prompts. This process improves the stability of prompt design and preserves an auditable optimization trajectory. Finally, we impose parameterized decoding constraints during inference to further improve the stability and quality of the generated descriptions.
[COMMENTS]15 pages, 2 figures, 4 tables. Submitted to ICPR 2026
[LINK]http://arxiv.org/abs/2512.20249v2
[DATE]2026-01-14 15:14:01+08:00
[CATEGORIES]cs.LG
Tipping Point Forecasting in Non-Stationary Dynamics on Function Spaces
[AUTHORS]Miguel Liu-Schiaffini, Clare E. Singer, Nikola Kovachki, Sze Chai Leung, Hyunji Jane Bae, Kamyar Azizzadenesheli, Anima Anandkumar
[ABSTRACT]Tipping points are abrupt, drastic, and often irreversible changes in the evolution of non-stationary and chaotic dynamical systems. For instance, increased greenhouse gas concentrations are predicted to lead to drastic decreases in low cloud cover, referred to as a climatological tipping point. In this paper, we learn the evolution of such non-stationary dynamical systems using a novel recurrent neural operator (RNO), which learns mappings between function spaces. After training RNO on only the pre-tipping dynamics, we employ it to detect future tipping points using an uncertainty-based approach. In particular, we propose a conformal prediction framework to forecast tipping points by monitoring deviations from physics constraints (such as conserved quantities and partial differential equations), enabling forecasting of these abrupt changes along with a rigorous measure of uncertainty. We illustrate our proposed methodology on non-stationary ordinary and partial differential equations, such as the Lorenz-63 and Kuramoto-Sivashinsky equations. We also apply our methods to forecast a climate tipping point in stratocumulus cloud cover and airfoil wake and stall transitions using only limited knowledge of the governing equations. For the latter, we show that our proposed method zero-shot generalizes to forecasting multiple future tipping points under varying Reynolds numbers. In our experiments, we demonstrate that even partial or approximate physics constraints can be used to accurately forecast future tipping points.
[LINK]http://arxiv.org/abs/2308.08794v3
[DATE]2026-01-14 14:54:26+08:00
[CATEGORIES]cs.LG
From Hawkes Processes to Attention: Time-Modulated Mechanisms for Event Sequences
[AUTHORS]Xinzi Tan, Kejian Zhang, Junhan Yu, Doudou Zhou
[ABSTRACT]Marked Temporal Point Processes (MTPPs) arise naturally in medical, social, commercial, and financial domains. However, existing Transformer-based methods mostly inject temporal information only via positional encodings, relying on shared or parametric decay structures, which limits their ability to capture heterogeneous and type-specific temporal effects. Inspired by this observation, we derive a novel attention operator called Hawkes Attention from the multivariate Hawkes process theory for MTPP, using learnable per-type neural kernels to modulate query, key and value projections, thereby replacing the corresponding parts in the traditional attention. Benefited from the design, Hawkes Attention unifies event timing and content interaction, learning both the time-relevant behavior and type-specific excitation patterns from the data. The experimental results show that our method achieves better performance compared to the baselines. In addition to the general MTPP, our attention mechanism can also be easily applied to specific temporal structures, such as time series forecasting.
[LINK]http://arxiv.org/abs/2601.09220v1
[DATE]2026-01-14 14:47:37+08:00
[CATEGORIES]cs.LG
Annealed Relaxation of Speculative Decoding for Faster Autoregressive Image Generation
[AUTHORS]Xingyao Li, Fengzhuo Zhang, Cunxiao Du, Hui Ji
[ABSTRACT]Despite significant progress in autoregressive image generation, inference remains slow due to the sequential nature of AR models and the ambiguity of image tokens, even when using speculative decoding. Recent works attempt to address this with relaxed speculative decoding but lack theoretical grounding. In this paper, we establish the theoretical basis of relaxed SD and propose COOL-SD, an annealed relaxation of speculative decoding built on two key insights. The first analyzes the total variation (TV) distance between the target model and relaxed speculative decoding and yields an optimal resampling distribution that minimizes an upper bound of the distance. The second uses perturbation analysis to reveal an annealing behaviour in relaxed speculative decoding, motivating our annealed design. Together, these insights enable COOL-SD to generate images faster with comparable quality, or achieve better quality at similar latency. Experiments validate the effectiveness of COOL-SD, showing consistent improvements over prior methods in speed-quality trade-offs.
[COMMENTS]Accepted to AAAI 2026
[LINK]http://arxiv.org/abs/2601.09212v1
[DATE]2026-01-14 14:35:21+08:00
[CATEGORIES]cs.LG
Soft Contrastive Learning for Time Series
[AUTHORS]Seunghan Lee, Taeyoung Park, Kibok Lee
[COMMENTS]ICLR 2024 Spotlight
[LINK]http://arxiv.org/abs/2312.16424v4
[DATE]2026-01-14 14:23:37+08:00
[CATEGORIES]cs.LG
CuMoLoS-MAE: A Masked Autoencoder for Remote Sensing Data Reconstruction
[AUTHORS]Anurup Naskar, Nathanael Zhixin Wong, Sara Shamekh
[ABSTRACT]Accurate atmospheric profiles from remote sensing instruments such as Doppler Lidar, Radar, and radiometers are frequently corrupted by low-SNR (Signal to Noise Ratio) gates, range folding, and spurious discontinuities. Traditional gap filling blurs fine-scale structures, whereas deep models lack confidence estimates. We present CuMoLoS-MAE, a Curriculum-Guided Monte Carlo Stochastic Ensemble Masked Autoencoder designed to (i) restore fine-scale features such as updraft and downdraft cores, shear lines, and small vortices, (ii) learn a data-driven prior over atmospheric fields, and (iii) quantify pixel-wise uncertainty. During training, CuMoLoS-MAE employs a mask-ratio curriculum that forces a ViT decoder to reconstruct from progressively sparser context. At inference, we approximate the posterior predictive by Monte Carlo over random mask realisations, evaluating the MAE multiple times and aggregating the outputs to obtain the posterior predictive mean reconstruction together with a finely resolved per-pixel uncertainty map. Together with high-fidelity reconstruction, this novel deep learning-based workflow enables enhanced convection diagnostics, supports real-time data assimilation, and improves long-term climate reanalysis.
[COMMENTS]Accepted for poster presentation at the NeurIPS 2025 workshop on Tackling Climate Change with Machine Learning. 5 pages, 2 figures
[LINK]http://arxiv.org/abs/2508.14957v2
[DATE]2026-01-14 13:43:13+08:00
[CATEGORIES]cs.LG
Noisy Quantum Learning Theory
[AUTHORS]Jordan Cotler, Weiyuan Gong, Ishaan Kannan
[ABSTRACT]We develop a framework for learning from noisy quantum experiments in which fault-tolerant devices access uncharacterized systems through noisy couplings. Introducing the complexity class $\textsf{NBQP}$ (“noisy BQP’’), we model noisy fault-tolerant quantum computers that cannot generally error-correct the oracle systems they query. Using this class, we prove that while noise can eliminate the exponential quantum learning advantages of unphysical, noiseless learners, a superpolynomial gap remains between $\textsf{NISQ}$ and fault-tolerant devices. Turning to canonical learning tasks in noisy settings, we find that the exponential two-copy advantage for purity testing collapses under local depolarizing noise. Nevertheless, we identify a setting motivated by AdS/CFT in which noise-resilient physical structure restores this quantum learning advantage. We then analyze noisy Pauli shadow tomography, deriving lower bounds characterizing how instance size, quantum memory and noise jointly control sample complexity, and design algorithms with parametrically matching scalings. We study similar tradeoffs in quantum metrology, and show that the Heisenberg-limited sensitivity of existing error-correction-based protocols persists only up to a timescale inverse-polynomial in the error rate per probe qubit. Together, our results demonstrate that the primitives underlying quantum-enhanced experiments are fundamentally fragile to noise, and that realizing meaningful quantum advantages in future experiments will require interfacing noise-robust physical properties with available algorithmic techniques.
[COMMENTS]13+56 pages, 4 figures; v2: added metrology-related results, typos corrected
[LINK]http://arxiv.org/abs/2512.10929v2
[DATE]2026-01-14 13:36:11+08:00
[CATEGORIES]cs.LG
$D^2Prune$: Sparsifying Large Language Models via Dual Taylor Expansion and Attention Distribution Awareness
[AUTHORS]Lang Xiong, Ning Liu, Ao Ren, Yuheng Bai, Haining Fang, BinYan Zhang, Zhe Jiang, Yujuan Tan, Duo Liu
[ABSTRACT]Large language models (LLMs) face significant deployment challenges due to their massive computational demands. % While pruning offers a promising compression solution, existing methods suffer from two critical limitations: (1) They neglect activation distribution shifts between calibration data and test data, resulting in inaccurate error estimations; (2) They overlook the long-tail distribution characteristics of activations in the attention module. To address these limitations, this paper proposes a novel pruning method, $D^2Prune$. First, we propose a dual Taylor expansion-based method that jointly models weight and activation perturbations for precise error estimation, leading to precise pruning mask selection and weight updating and facilitating error minimization during pruning. % Second, we propose an attention-aware dynamic update strategy that preserves the long-tail attention pattern by jointly minimizing the KL divergence of attention distributions and the reconstruction error. Extensive experiments show that $D^2Prune$ consistently outperforms SOTA methods across various LLMs (e.g., OPT-125M, LLaMA2/3, and Qwen3). Moreover, the dynamic attention update mechanism also generalizes well to ViT-based vision models like DeiT, achieving superior accuracy on ImageNet-1K.
[LINK]http://arxiv.org/abs/2601.09176v1
[DATE]2026-01-14 13:17:35+08:00
[CATEGORIES]cs.LG
BalDRO: A Distributionally Robust Optimization based Framework for Large Language Model Unlearning
[AUTHORS]Pengyang Shao, Naixin Zhai, Lei Chen, Yonghui Yang, Fengbin Zhu, Xun Yang, Meng Wang
[ABSTRACT]As Large Language Models (LLMs) increasingly shape online content, removing targeted information from well-trained LLMs (also known as LLM unlearning) has become critical for web governance. A key challenge lies in sample-wise imbalance within the forget set: different samples exhibit widely varying unlearning difficulty, leading to asynchronous forgetting where some knowledge remains insufficiently erased while others become over-forgotten. To address this, we propose BalDRO, a novel and efficient framework for balanced LLM unlearning. BalDRO formulates unlearning as a min-sup process: an inner step identifies a worst-case data distribution that emphasizes hard-to-unlearn samples, while an outer step updates model parameters under this distribution. We instantiate BalDRO via two efficient variants: BalDRO-G, a discrete GroupDRO-based approximation focusing on high-loss subsets, and BalDRO-DV, a continuous Donsker-Varadhan dual method enabling smooth adaptive weighting within standard training pipelines. Experiments on TOFU and MUSE show that BalDRO significantly improves both forgetting quality and model utility over existing methods, and we release code for reproducibility.
[LINK]http://arxiv.org/abs/2601.09172v1
[DATE]2026-01-14 13:15:10+08:00
[CATEGORIES]cs.LG
Reinforcement Learning with Exogenous States and Rewards
[AUTHORS]George Trimponias, Thomas G. Dietterich
[ABSTRACT]Exogenous state variables and rewards can slow reinforcement learning by injecting uncontrolled variation into the reward signal. This paper formalizes exogenous state variables and rewards and shows that if the reward function decomposes additively into endogenous and exogenous components, the MDP can be decomposed into an exogenous Markov Reward Process (based on the exogenous reward) and an endogenous Markov Decision Process (optimizing the endogenous reward). Any optimal policy for the endogenous MDP is also an optimal policy for the original MDP, but because the endogenous reward typically has reduced variance, the endogenous MDP is easier to solve. We study settings where the decomposition of the state space into exogenous and endogenous state spaces is not given but must be discovered. The paper introduces and proves correctness of algorithms for discovering the exogenous and endogenous subspaces of the state space when they are mixed through linear combination. These algorithms can be applied during reinforcement learning to discover the exogenous subspace, remove the exogenous reward, and focus reinforcement learning on the endogenous MDP. Experiments on a variety of challenging synthetic MDPs show that these methods, applied online, discover large exogenous state spaces and produce substantial speedups in reinforcement learning.
[COMMENTS]Substantial rewrite to improve rigor and clarity in response to referee reports at JMLR
[LINK]http://arxiv.org/abs/2303.12957v2
[DATE]2026-01-14 13:15:05+08:00
[CATEGORIES]cs.LG
N-EIoU-YOLOv9: A Signal-Aware Bounding Box Regression Loss for Lightweight Mobile Detection of Rice Leaf Diseases
[AUTHORS]Dung Ta Nguyen Duc, Thanh Bui Dang, Hoang Le Minh, Tung Nguyen Viet, Huong Nguyen Thanh, Dong Trinh Cong
[ABSTRACT]In this work, we propose N EIoU YOLOv9, a lightweight detection framework based on a signal aware bounding box regression loss derived from non monotonic gradient focusing and geometric decoupling principles, referred to as N EIoU (Non monotonic Efficient Intersection over Union). The proposed loss reshapes localization gradients by combining non monotonic focusing with decoupled width and height optimization, thereby enhancing weak regression signals for hard samples with low overlap while reducing gradient interference. This design is particularly effective for small and low contrast targets commonly observed in agricultural disease imagery. The proposed N EIoU loss is integrated into a lightweight YOLOv9t architecture and evaluated on a self collected field dataset comprising 5908 rice leaf images across four disease categories and healthy leaves. Experimental results demonstrate consistent performance gains over the standard CIoU loss, achieving a mean Average Precision of 90.3 percent, corresponding to a 4.3 percent improvement over the baseline, with improved localization accuracy under stricter evaluation criteria. For practical validation, the optimized model is deployed on an Android device using TensorFlow Lite with Float16 quantization, achieving an average inference time of 156 milliseconds per frame while maintaining accuracy. These results confirm that the proposed approach effectively balances accuracy, optimization stability, and computational efficiency for edge based agricultural monitoring systems.
[LINK]http://arxiv.org/abs/2601.09170v1
[DATE]2026-01-14 13:13:36+08:00
[CATEGORIES]cs.LG
DP-FEDSOFIM: Differentially Private Federated Stochastic Optimization using Regularized Fisher Information Matrix
[AUTHORS]Sidhant R. Nair, Tanmay Sen, Mrinmay Sen
[ABSTRACT]Differentially private federated learning (DP-FL) suffers from slow convergence under tight privacy budgets due to the overwhelming noise introduced to preserve privacy. While adaptive optimizers can accelerate convergence, existing second-order methods such as DP-FedNew require O(d^2) memory at each client to maintain local feature covariance matrices, making them impractical for high-dimensional models. We propose DP-FedSOFIM, a server-side second-order optimization framework that leverages the Fisher Information Matrix (FIM) as a natural gradient preconditioner while requiring only O(d) memory per client. By employing the Sherman-Morrison formula for efficient matrix inversion, DP-FedSOFIM achieves O(d) computational complexity per round while maintaining the convergence benefits of second-order methods. Our analysis proves that the server-side preconditioning preserves (epsilon, delta)-differential privacy through the post-processing theorem. Empirical evaluation on CIFAR-10 demonstrates that DP-FedSOFIM achieves superior test accuracy compared to first-order baselines across multiple privacy regimes.
[COMMENTS]17 pages, 1 figure. Submitted to ICML 2026
[LINK]http://arxiv.org/abs/2601.09166v1
[DATE]2026-01-14 13:11:28+08:00
[CATEGORIES]cs.LG
Multi-Teacher Ensemble Distillation: A Mathematical Framework for Probability-Domain Knowledge Aggregation
[AUTHORS]Aaron R. Flouro, Shawn P. Chadwick
[ABSTRACT]Building on the probability-domain distillation framework of Sparse-KD, we develop an axiomatic, operator-theoretic framework for multi-teacher ensemble knowledge distillation. Rather than prescribing a specific aggregation formula, we define five core axioms governing valid knowledge aggregation operators, encompassing convexity, positivity, continuity, weight monotonicity, and temperature coherence. We prove the existence and non-uniqueness of operator families satisfying these axioms, establishing that multiple distinct aggregation mechanisms conform to the same foundational principles.
Within this framework, we establish operator-agnostic guarantees showing that multi-teacher aggregation reduces both stochastic variance and systematic supervisory bias under heterogeneous teachers, while providing Jensen-type bounds, log-loss guarantees, and safety attenuation properties. For aggregation operators linear in teacher weights, we further establish classical ensemble variance-reduction results under standard independence assumptions, with extensions to correlated-error regimes. The framework provides theoretical grounding for multi-teacher distillation from diverse frontier models while admitting multiple valid implementation strategies.
[COMMENTS]7 pages, 1 table
[LINK]http://arxiv.org/abs/2601.09165v1
[DATE]2026-01-14 13:10:36+08:00
[CATEGORIES]cs.LG
Efficient Clustering in Stochastic Bandits
[AUTHORS]G Dhinesh Chandran, Kota Srinivas Reddy, Srikrishna Bhashyam
[ABSTRACT]We study the Bandit Clustering (BC) problem under the fixed confidence setting, where the objective is to group a collection of data sequences (arms) into clusters through sequential sampling from adaptively selected arms at each time step while ensuring a fixed error probability at the stopping time. We consider a setting where arms in a cluster may have different distributions. Unlike existing results in this setting, which assume Gaussian-distributed arms, we study a broader class of vector-parametric distributions that satisfy mild regularity conditions. Existing asymptotically optimal BC algorithms require solving an optimization problem as part of their sampling rule at each step, which is computationally costly. We propose an Efficient Bandit Clustering algorithm (EBC), which, instead of solving the full optimization problem, takes a single step toward the optimal value at each time step, making it computationally efficient while remaining asymptotically optimal. We also propose a heuristic variant of EBC, called EBC-H, which further simplifies the sampling rule, with arm selection based on quantities computed as part of the stopping rule. We highlight the computational efficiency of EBC and EBC-H by comparing their per-sample run time with that of existing algorithms. The asymptotic optimality of EBC is supported through simulations on the synthetic datasets. Through simulations on both synthetic and real-world datasets, we show the performance gain of EBC and EBC-H over existing approaches.
[LINK]http://arxiv.org/abs/2601.09162v1
[DATE]2026-01-14 13:05:58+08:00
[CATEGORIES]cs.LG
Do What You Say: Steering Vision-Language-Action Models via Runtime Reasoning-Action Alignment Verification
[AUTHORS]Yilin Wu, Anqi Li, Tucker Hermans, Fabio Ramos, Andrea Bajcsy, Claudia Pérez-D’Arpino
[ABSTRACT]Reasoning Vision Language Action (VLA) models improve robotic instruction-following by generating step-by-step textual plans before low-level actions, an approach inspired by Chain-of-Thought (CoT) reasoning in language models. Yet even with a correct textual plan, the generated actions can still miss the intended outcomes in the plan, especially in out-of-distribution (OOD) scenarios. We formalize this phenomenon as a lack of embodied CoT faithfulness, and introduce a training-free, runtime policy steering method for reasoning-action alignment. Given a reasoning VLA’s intermediate textual plan, our framework samples multiple candidate action sequences from the same model, predicts their outcomes via simulation, and uses a pre-trained Vision-Language Model (VLM) to select the sequence whose outcome best aligns with the VLA’s own textual plan. Only executing action sequences that align with the textual reasoning turns our base VLA’s natural action diversity from a source of error into a strength, boosting robustness to semantic and visual OOD perturbations and enabling novel behavior composition without costly re-training. We also contribute a reasoning-annotated extension of LIBERO-100, environment variations tailored for OOD evaluation, and demonstrate up to 15% performance gain over prior work on behavior composition tasks and scales with compute and data diversity. Project Website at: https://yilin-wu98.github.io/steering-reasoning-vla/
[LINK]http://arxiv.org/abs/2510.16281v2
[DATE]2026-01-14 13:03:39+08:00
[CATEGORIES]cs.LG
Deep Learning-based Binary Analysis for Vulnerability Detection in x86-64 Machine Code
[AUTHORS]Mitchell Petingola
[ABSTRACT]While much of the current research in deep learning-based vulnerability detection relies on disassembled binaries, this paper explores the feasibility of extracting features directly from raw x86-64 machine code. Although assembly language is more interpretable for humans, it requires more complex models to capture token-level context. In contrast, machine code may enable more efficient, lightweight models and preserve all information that might be lost in disassembly. This paper approaches the task of vulnerability detection through an exploratory study on two specific deep learning model architectures and aims to systematically evaluate their performance across three vulnerability types. The results demonstrate that graph-based models consistently outperform sequential models, emphasizing the importance of control flow relationships, and that machine code contains sufficient information for effective vulnerability discovery.
[LINK]http://arxiv.org/abs/2601.09157v1
[DATE]2026-01-14 12:52:47+08:00
[CATEGORIES]cs.LG
KTCF: Actionable Recourse in Knowledge Tracing via Counterfactual Explanations for Education
[AUTHORS]Woojin Kim, Changkwon Lee, Hyeoncheol Kim
[ABSTRACT]Using Artificial Intelligence to improve teaching and learning benefits greater adaptivity and scalability in education. Knowledge Tracing (KT) is recognized for student modeling task due to its superior performance and application potential in education. To this end, we conceptualize and investigate counterfactual explanation as the connection from XAI for KT to education. Counterfactual explanations offer actionable recourse, are inherently causal and local, and easy for educational stakeholders to understand who are often non-experts. We propose KTCF, a counterfactual explanation generation method for KT that accounts for knowledge concept relationships, and a post-processing scheme that converts a counterfactual explanation into a sequence of educational instructions. We experiment on a large-scale educational dataset and show our KTCF method achieves superior and robust performance over existing methods, with improvements ranging from 5.7% to 34% across metrics. Additionally, we provide a qualitative evaluation of our post-processing scheme, demonstrating that the resulting educational instructions help in reducing large study burden. We show that counterfactuals have the potential to advance the responsible and practical use of AI in education. Future works on XAI for KT may benefit from educationally grounded conceptualization and developing stakeholder-centered methods.
[COMMENTS]Accepted to AAAI-26 Special Track AI for Social Impact (oral presentation)
[LINK]http://arxiv.org/abs/2601.09156v1
[DATE]2026-01-14 12:51:54+08:00
[CATEGORIES]cs.LG
Interpretable Probability Estimation with LLMs via Shapley Reconstruction
[AUTHORS]Yang Nan, Qihao Wen, Jiahao Wang, Pengfei He, Ravi Tandon, Yong Ge, Han Xu
[ABSTRACT]Large Language Models (LLMs) demonstrate potential to estimate the probability of uncertain events, by leveraging their extensive knowledge and reasoning capabilities. This ability can be applied to support intelligent decision-making across diverse fields, such as financial forecasting and preventive healthcare. However, directly prompting LLMs for probability estimation faces significant challenges: their outputs are often noisy, and the underlying predicting process is opaque. In this paper, we propose PRISM: Probability Reconstruction via Shapley Measures, a framework that brings transparency and precision to LLM-based probability estimation. PRISM decomposes an LLM’s prediction by quantifying the marginal contribution of each input factor using Shapley values. These factor-level contributions are then aggregated to reconstruct a calibrated final estimate. In our experiments, we demonstrate PRISM improves predictive accuracy over direct prompting and other baselines, across multiple domains including finance, healthcare, and agriculture. Beyond performance, PRISM provides a transparent prediction pipeline: our case studies visualize how individual factors shape the final estimate, helping build trust in LLM-based decision support systems.
[LINK]http://arxiv.org/abs/2601.09151v1
[DATE]2026-01-14 12:45:36+08:00
[CATEGORIES]cs.LG
Distributional Machine Unlearning via Selective Data Removal
[AUTHORS]Youssef Allouah, Rachid Guerraoui, Sanmi Koyejo
[ABSTRACT]Machine learning systems increasingly face requirements to remove entire domains of information–such as toxic language or biases–rather than individual user data. This task presents a dilemma: full removal of the unwanted domain data is computationally expensive, while random partial removal is statistically inefficient. We find that a domain’s statistical influence is often concentrated in a small subset of its data samples, suggesting a path between ineffective partial removal and unnecessary complete removal. We formalize this as distributional unlearning: a framework to select a small subset that balances forgetting an unwanted distribution while preserving a desired one. Using Kullback-Leibler divergence constraints, we derive the exact removal-preservation Pareto frontier for Gaussian distributions and prove that models trained on the edited data achieve corresponding log-loss bounds. We propose a distance-based selection algorithm and show it is quadratically more sample-efficient than random removal in the challenging low-divergence regime. Experiments across synthetic, text, and image datasets (Jigsaw, CIFAR-10, SMS spam) show our method requires 15-82% less deletion than full removal for strong unlearning effects, e.g., halving initial forget set accuracy. Ultimately, by showing a small forget set often suffices, our framework lays the foundations for more scalable and rigorous subpopulation unlearning.
[LINK]http://arxiv.org/abs/2507.15112v4
[DATE]2026-01-14 12:19:32+08:00
[CATEGORIES]cs.LG
Why are there many equally good models? An Anatomy of the Rashomon Effect
[AUTHORS]Harsh Parikh
[ABSTRACT]The Rashomon effect – the existence of multiple, distinct models that achieve nearly equivalent predictive performance – has emerged as a fundamental phenomenon in modern machine learning and statistics. In this paper, we explore the causes underlying the Rashomon effect, organizing them into three categories: statistical sources arising from finite samples and noise in the data-generating process; structural sources arising from non-convexity of optimization objectives and unobserved variables that create fundamental non-identifiability; and procedural sources arising from limitations of optimization algorithms and deliberate restrictions to suboptimal model classes. We synthesize insights from machine learning, statistics, and optimization literature to provide a unified framework for understanding why the multiplicity of good models arises. A key distinction emerges: statistical multiplicity diminishes with more data, structural multiplicity persists asymptotically and cannot be resolved without different data or additional assumptions, and procedural multiplicity reflects choices made by practitioners. Beyond characterizing causes, we discuss both the challenges and opportunities presented by the Rashomon effect, including implications for inference, interpretability, fairness, and decision-making under uncertainty.
[LINK]http://arxiv.org/abs/2601.06730v2
[DATE]2026-01-14 12:15:47+08:00
[CATEGORIES]cs.LG
Uncertainty-Aware Multi-Objective Reinforcement Learning-Guided Diffusion Models for 3D De Novo Molecular Design
[AUTHORS]Lianghong Chen, Dongkyu Eugene Kim, Mike Domaratzki, Pingzhao Hu
[ABSTRACT]Designing de novo 3D molecules with desirable properties remains a fundamental challenge in drug discovery and molecular engineering. While diffusion models have demonstrated remarkable capabilities in generating high-quality 3D molecular structures, they often struggle to effectively control complex multi-objective constraints critical for real-world applications. In this study, we propose an uncertainty-aware Reinforcement Learning (RL) framework to guide the optimization of 3D molecular diffusion models toward multiple property objectives while enhancing the overall quality of the generated molecules. Our method leverages surrogate models with predictive uncertainty estimation to dynamically shape reward functions, facilitating balance across multiple optimization objectives. We comprehensively evaluate our framework across three benchmark datasets and multiple diffusion model architectures, consistently outperforming baselines for molecular quality and property optimization. Additionally, Molecular Dynamics (MD) simulations and ADMET profiling of top generated candidates indicate promising drug-like behavior and binding stability, comparable to known Epidermal Growth Factor Receptor (EGFR) inhibitors. Our results demonstrate the strong potential of RL-guided generative diffusion models for advancing automated molecular design.
[COMMENTS]Accepted at NeurIPS 2025
[LINK]http://arxiv.org/abs/2510.21153v2
[DATE]2026-01-14 12:12:09+08:00
[CATEGORIES]cs.LG
Trustworthy scientific inference with generative models
[AUTHORS]James Carzon, Luca Masserano, Joshua D. Ingram, Alex Shen, Antonio Carlos Herling Ribeiro Junior, Tommaso Dorigo, Michele Doro, Joshua S. Speagle, Rafael Izbicki, Ann B. Lee
[ABSTRACT]Generative artificial intelligence (AI) excels at producing complex data structures (text, images, videos) by learning patterns from training examples. Across scientific disciplines, researchers are now applying generative models to “inverse problems” to directly predict hidden parameters from observed data along with measures of uncertainty. While these predictive or posterior-based methods can handle intractable likelihoods and large-scale studies, they can also produce biased or overconfident conclusions even without model misspecifications. We present a solution with Frequentist-Bayes (FreB), a mathematically rigorous protocol that reshapes AI-generated posterior probability distributions into (locally valid) confidence regions that consistently include true parameters with the expected probability, while achieving minimum size when training and target data align. We demonstrate FreB’s effectiveness by tackling diverse case studies in the physical sciences: identifying unknown sources under dataset shift, reconciling competing theoretical models, and mitigating selection bias and systematics in observational studies. By providing validity guarantees with interpretable diagnostics, FreB enables trustworthy scientific inference across fields where direct likelihood evaluation remains impossible or prohibitively expensive.
[LINK]http://arxiv.org/abs/2508.02602v3
[DATE]2026-01-14 11:45:17+08:00
[CATEGORIES]cs.LG
A Machine Learning Approach Towards Runtime Optimisation of Matrix Multiplication
[AUTHORS]Yufan Xia, Marco De La Pierre, Amanda S. Barnard, Giuseppe Maria Junior Barca
[ABSTRACT]The GEneral Matrix Multiplication (GEMM) is one of the essential algorithms in scientific computing. Single-thread GEMM implementations are well-optimised with techniques like blocking and autotuning. However, due to the complexity of modern multi-core shared memory systems, it is challenging to determine the number of threads that minimises the multi-thread GEMM runtime. We present a proof-of-concept approach to building an Architecture and Data-Structure Aware Linear Algebra (ADSALA) software library that uses machine learning to optimise the runtime performance of BLAS routines. More specifically, our method uses a machine learning model on-the-fly to automatically select the optimal number of threads for a given GEMM task based on the collected training data. Test results on two different HPC node architectures, one based on a two-socket Intel Cascade Lake and the other on a two-socket AMD Zen 3, revealed a 25 to 40 per cent speedup compared to traditional GEMM implementations in BLAS when using GEMM of memory usage within 100 MB.
[COMMENTS]2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
[LINK]http://arxiv.org/abs/2601.09114v1
[DATE]2026-01-14 11:28:54+08:00
[CATEGORIES]cs.LG
Enhancing Imbalanced Electrocardiogram Classification: A Novel Approach Integrating Data Augmentation through Wavelet Transform and Interclass Fusion
[AUTHORS]Haijian Shao, Wei Liu, Xing Deng, Daze Lu
[ABSTRACT]Imbalanced electrocardiogram (ECG) data hampers the efficacy and resilience of algorithms in the automated processing and interpretation of cardiovascular diagnostic information, which in turn impedes deep learning-based ECG classification. Notably, certain cardiac conditions that are infrequently encountered are disproportionately underrepresented in these datasets. Although algorithmic generation and oversampling of specific ECG signal types can mitigate class skew, there is a lack of consensus regarding the effectiveness of such techniques in ECG classification. Furthermore, the methodologies and scenarios of ECG acquisition introduce noise, further complicating the processing of ECG data. This paper presents a significantly enhanced ECG classifier that simultaneously addresses both class imbalance and noise-related challenges in ECG analysis, as observed in the CPSC 2018 dataset. Specifically, we propose the application of feature fusion based on the wavelet transform, with a focus on wavelet transform-based interclass fusion, to generate the training feature library and the test set feature library. Subsequently, the original training and test data are amalgamated with their respective feature databases, resulting in more balanced training and test datasets. Employing this approach, our ECG model achieves recognition accuracies of up to 99%, 98%, 97%, 98%, 96%, 92%, and 93% for Normal, AF, I-AVB, LBBB, RBBB, PAC, PVC, STD, and STE, respectively. Furthermore, the average recognition accuracy for these categories ranges between 92\% and 98\%. Notably, our proposed data fusion methodology surpasses any known algorithms in terms of ECG classification accuracy in the CPSC 2018 dataset.
[COMMENTS]18 pages, 9 figures, 3 tables, 1 algorithm
[LINK]http://arxiv.org/abs/2601.09103v1
[DATE]2026-01-14 11:09:13+08:00
[CATEGORIES]cs.LG
Comparative Assessment of Concrete Compressive Strength Prediction at Industry Scale Using Embedding-based Neural Networks, Transformers, and Traditional Machine Learning Approaches
[AUTHORS]Md Asiful Islam, Md Ahmed Al Muzaddid, Afia Jahin Prema, Sreenath Reddy Vuske
[ABSTRACT]Concrete is the most widely used construction material worldwide; however, reliable prediction of compressive strength remains challenging due to material heterogeneity, variable mix proportions, and sensitivity to field and environmental conditions. Recent advances in artificial intelligence enable data-driven modeling frameworks capable of supporting automated decision-making in construction quality control. This study leverages an industry-scale dataset consisting of approximately 70,000 compressive strength test records to evaluate and compare multiple predictive approaches, including linear regression, decision trees, random forests, transformer-based neural networks, and embedding-based neural networks. The models incorporate key mixture design and placement variables such as water cement ratio, cementitious material content, slump, air content, temperature, and placement conditions. Results indicate that the embedding-based neural network consistently outperforms traditional machine learning and transformer-based models, achieving a mean 28-day prediction error of approximately 2.5%. This level of accuracy is comparable to routine laboratory testing variability, demonstrating the potential of embedding-based learning frameworks to enable automated, data-driven quality control and decision support in large-scale construction operations.
[LINK]http://arxiv.org/abs/2601.09096v1
[DATE]2026-01-14 10:58:05+08:00
[CATEGORIES]cs.LG
Hidden States as Early Signals: Step-level Trace Evaluation and Pruning for Efficient Test-Time Scaling
[AUTHORS]Zhixiang Liang, Beichen Huang, Zheng Wang, Minjia Zhang
[ABSTRACT]Large Language Models (LLMs) can enhance reasoning capabilities through test-time scaling by generating multiple traces. However, the combination of lengthy reasoning traces with multiple sampling introduces substantial computation and high end-to-end latency. Prior work on accelerating this process has relied on similarity-based or confidence-based pruning, but these signals do not reliably indicate trace quality. To address these limitations, we propose STEP: Step-level Trace Evaluation and Pruning, a novel pruning framework that evaluates reasoning steps using hidden states and dynamically prunes unpromising traces during generation. We train a lightweight step scorer to estimate trace quality, and design a GPU memory-aware pruning strategy that triggers pruning as the GPU memory is saturated by KV cache to reduce end-to-end latency. Experiments across challenging reasoning benchmarks demonstrate that STEP reduces end-to-end inference latency by 45%-70% on average compared to self-consistency while also improving reasoning accuracy. Our code is released at: https://github.com/Supercomputing-System-AI-Lab/STEP
[LINK]http://arxiv.org/abs/2601.09093v1
[DATE]2026-01-14 10:54:55+08:00
[CATEGORIES]cs.LG
MORE: Multi-Objective Adversarial Attacks on Speech Recognition
[AUTHORS]Xiaoxue Gao, Zexin Li, Yiming Chen, Nancy F. Chen
[ABSTRACT]The emergence of large-scale automatic speech recognition (ASR) models such as Whisper has greatly expanded their adoption across diverse real-world applications. Ensuring robustness against even minor input perturbations is therefore critical for maintaining reliable performance in real-time environments. While prior work has mainly examined accuracy degradation under adversarial attacks, robustness with respect to efficiency remains largely unexplored. This narrow focus provides only a partial understanding of ASR model vulnerabilities. To address this gap, we conduct a comprehensive study of ASR robustness under multiple attack scenarios. We introduce MORE, a multi-objective repetitive doubling encouragement attack, which jointly degrades recognition accuracy and inference efficiency through a hierarchical staged repulsion-anchoring mechanism. Specifically, we reformulate multi-objective adversarial optimization into a hierarchical framework that sequentially achieves the dual objectives. To further amplify effectiveness, we propose a novel repetitive encouragement doubling objective (REDO) that induces duplicative text generation by maintaining accuracy degradation and periodically doubling the predicted sequence length. Overall, MORE compels ASR models to produce incorrect transcriptions at a substantially higher computational cost, triggered by a single adversarial input. Experiments show that MORE consistently yields significantly longer transcriptions while maintaining high word error rates compared to existing baselines, underscoring its effectiveness in multi-objective adversarial attack.
[COMMENTS]19 pages
[LINK]http://arxiv.org/abs/2601.01852v2
[DATE]2026-01-14 10:50:21+08:00
[CATEGORIES]cs.LG
SRT: Accelerating Reinforcement Learning via Speculative Rollout with Tree-Structured Cache
[AUTHORS]Chi-Chih Chang, Siqi Zhu, Zhichen Zeng, Haibin Lin, Jiaxuan You, Mohamed S. Abdelfattah, Ziheng Jiang, Xuehai Qian
[ABSTRACT]We present Speculative Rollout with Tree-Structured Cache (SRT), a simple, model-free approach to accelerate on-policy reinforcement learning (RL) for language models without sacrificing distributional correctness. SRT exploits the empirical similarity of rollouts for the same prompt across training steps by storing previously generated continuations in a per-prompt tree-structured cache. During generation, the current policy uses this tree as the draft model for performing speculative decoding. To keep the cache fresh and improve draft model quality, SRT updates trees online from ongoing rollouts and proactively performs run-ahead generation during idle GPU bubbles. Integrated into standard RL pipelines (\textit{e.g.}, PPO, GRPO and DAPO) and multi-turn settings, SRT consistently reduces generation and step latency and lowers per-token inference cost, achieving up to 2.08x wall-clock time speedup during rollout.
[LINK]http://arxiv.org/abs/2601.09083v1
[DATE]2026-01-14 10:34:48+08:00
[CATEGORIES]cs.LG
Lean Clients, Full Accuracy: Hybrid Zeroth- and First-Order Split Federated Learning
[AUTHORS]Zhoubin Kou, Zihan Chen, Jing Yang, Cong Shen
[ABSTRACT]Split Federated Learning (SFL) enables collaborative training between resource-constrained edge devices and a compute-rich server. Communication overhead is a central issue in SFL and can be mitigated with auxiliary networks. Yet, the fundamental client-side computation challenge remains, as back-propagation requires substantial memory and computation costs, severely limiting the scale of models that edge devices can support. To enable more resource-efficient client computation and reduce the client-server communication, we propose HERON-SFL, a novel hybrid optimization framework that integrates zeroth-order (ZO) optimization for local client training while retaining first-order (FO) optimization on the server. With the assistance of auxiliary networks, ZO updates enable clients to approximate local gradients using perturbed forward-only evaluations per step, eliminating memory-intensive activation caching and avoiding explicit gradient computation in the traditional training process. Leveraging the low effective rank assumption, we theoretically prove that HERON-SFL’s convergence rate is independent of model dimensionality, addressing a key scalability concern common to ZO algorithms. Empirically, on ResNet training and language model (LM) fine-tuning tasks, HERON-SFL matches benchmark accuracy while reducing client peak memory by up to 64% and client-side compute cost by up to 33% per step, substantially expanding the range of models that can be trained or adapted on resource-limited devices.
[LINK]http://arxiv.org/abs/2601.09076v1
[DATE]2026-01-14 10:17:49+08:00
[CATEGORIES]cs.LG
Stock Price Responses to Firm-Level News in Supply Chain Networks
[AUTHORS]Hiroyasu Inoue, Yasuyuki Todo
[ABSTRACT]This study examines how positive and negative news about firms is associated with stock prices and whether these associations extend to firms’ suppliers and clients connected through supply chain relationships, using large samples of publicly listed firms worldwide and in Japan. News sentiment is measured using FinBERT, a natural language processing model fine-tuned for financial texts, and supply chain links are identified from financial statements for global firms and from large-scale firm-level surveys for Japanese firms. We find that stock prices exhibit systematic associations with positive and negative news even before public disclosure. These associations are also observed for suppliers and clients before and after disclosure. In general, relative to the pre-disclosure period, post-disclosure associations are larger, with the difference concentrated around the time of public news disclosure. However, for Japanese firms, the post-disclosure associations for suppliers and clients are smaller than the pre-disclosure associations, in contrast to the pattern observed for firms outside Japan.
[LINK]http://arxiv.org/abs/2409.06255v5
[DATE]2026-01-14 10:00:49+08:00
[CATEGORIES]cs.LG
Optimism Without Regularization: Constant Regret in Zero-Sum Games
[AUTHORS]John Lazarsfeld, Georgios Piliouras, Ryann Sim, Stratis Skoulakis
[ABSTRACT]This paper studies the optimistic variant of Fictitious Play for learning in two-player zero-sum games. While it is known that Optimistic FTRL – a regularized algorithm with a bounded stepsize parameter – obtains constant regret in this setting, we show for the first time that similar, optimal rates are also achievable without regularization: we prove for two-strategy games that Optimistic Fictitious Play (using any tiebreaking rule) obtains only constant regret, providing surprising new evidence on the ability of non-no-regret algorithms for fast learning in games. Our proof technique leverages a geometric view of Optimistic Fictitious Play in the dual space of payoff vectors, where we show a certain energy function of the iterates remains bounded over time. Additionally, we also prove a regret lower bound of $Ω(\sqrt{T})$ for Alternating Fictitious Play. In the unregularized regime, this separates the ability of optimism and alternation in achieving $o(\sqrt{T})$ regret.
[COMMENTS]NeurIPS 2025
[LINK]http://arxiv.org/abs/2506.16736v3
[DATE]2026-01-14 09:56:57+08:00
[CATEGORIES]cs.LG
Detecting Batch Heterogeneity via Likelihood Clustering
[AUTHORS]Austin Talbot, Yue Ke
[ABSTRACT]Batch effects represent a major confounder in genomic diagnostics. In copy number variant (CNV) detection from NGS, many algorithms compare read depth between test samples and a reference sample, assuming they are process-matched. When this assumption is violated, with causes ranging from reagent lot changes to multi-site processing, the reference becomes inappropriate, introducing false CNV calls or masking true pathogenic variants. Detecting such heterogeneity before downstream analysis is critical for reliable clinical interpretation. Existing batch effect detection methods either cluster samples based on raw features, risking conflation of biological signal with technical variation, or require known batch labels that are frequently unavailable. We introduce a method that addresses both limitations by clustering samples according to their Bayesian model evidence. The central insight is that evidence quantifies compatibility between data and model assumptions, technical artifacts violate assumptions and reduce evidence, whereas biological variation, including CNV status, is anticipated by the model and yields high evidence. This asymmetry provides a discriminative signal that separates batch effects from biology. We formalize heterogeneity detection as a likelihood ratio test for mixture structure in evidence space, using parametric bootstrap calibration to ensure conservative false positive rates. We validate our approach on synthetic data demonstrating proper Type I error control, three clinical targeted sequencing panels (liquid biopsy, BRCA, and thalassemia) exhibiting distinct batch effect mechanisms, and mouse electrophysiology recordings demonstrating cross-modality generalization. Our method achieves superior clustering accuracy compared to standard correlation-based and dimensionality-reduction approaches while maintaining the conservativeness required for clinical usage.
[LINK]http://arxiv.org/abs/2601.09758v1
[DATE]2026-01-14 09:49:21+08:00
[CATEGORIES]cs.LG
When do spectral gradient updates help in deep learning?
[AUTHORS]Damek Davis, Dmitriy Drusvyatskiy
[ABSTRACT]Spectral gradient methods, such as the recently popularized Muon optimizer, are a promising alternative to standard Euclidean gradient descent for training deep neural networks and transformers, but it is still unclear in which regimes they are expected to perform better. We propose a simple layerwise condition that predicts when a spectral update yields a larger decrease in the loss than a Euclidean gradient step. This condition compares, for each parameter block, the squared nuclear-to-Frobenius ratio of the gradient to the stable rank of the incoming activations. To understand when this condition may be satisfied, we first prove that post-activation matrices have low stable rank at Gaussian initialization in random feature regression, feedforward networks, and transformer blocks. In spiked random feature models we then show that, after a short burn-in, the Euclidean gradient’s nuclear-to-Frobenius ratio grows with the data dimension while the stable rank of the activations remains bounded, so the predicted advantage of spectral updates scales with dimension. We validate these predictions in synthetic regression experiments and in NanoGPT-scale language model training, where we find that intermediate activations have low-stable-rank throughout training and the corresponding gradients maintain large nuclear-to-Frobenius ratios. Together, these results identify conditions for spectral gradient methods, such as Muon, to be effective in training deep networks and transformers.
[LINK]http://arxiv.org/abs/2512.04299v2
[DATE]2026-01-14 09:20:17+08:00
[CATEGORIES]cs.LG
Coupling Generative Modeling and an Autoencoder with the Causal Bridge
[AUTHORS]Ruolin Meng, Ming-Yu Chung, Dhanajit Brahma, Ricardo Henao, Lawrence Carin
[ABSTRACT]We consider inferring the causal effect of a treatment (intervention) on an outcome of interest in situations where there is potentially an unobserved confounder influencing both the treatment and the outcome. This is achievable by assuming access to two separate sets of control (proxy) measurements associated with treatment and outcomes, which are used to estimate treatment effects through a function termed the em causal bridge (CB). We present a new theoretical perspective, associated assumptions for when estimating treatment effects with the CB is feasible, and a bound on the average error of the treatment effect when the CB assumptions are violated. From this new perspective, we then demonstrate how coupling the CB with an autoencoder architecture allows for the sharing of statistical strength between observed quantities (proxies, treatment, and outcomes), thus improving the quality of the CB estimates. Experiments on synthetic and real-world data demonstrate the effectiveness of the proposed approach in relation to the state-of-the-art methodology for proxy measurements.
[COMMENTS]Accepted to NeurIPS 2025
[LINK]http://arxiv.org/abs/2509.25599v2
[DATE]2026-01-14 09:18:27+08:00
[CATEGORIES]cs.LG
Deep Incomplete Multi-View Clustering via Hierarchical Imputation and Alignment
[AUTHORS]Yiming Du, Ziyu Wang, Jian Li, Rui Ning, Lusi Li
[ABSTRACT]Incomplete multi-view clustering (IMVC) aims to discover shared cluster structures from multi-view data with partial observations. The core challenges lie in accurately imputing missing views without introducing bias, while maintaining semantic consistency across views and compactness within clusters. To address these challenges, we propose DIMVC-HIA, a novel deep IMVC framework that integrates hierarchical imputation and alignment with four key components: (1) view-specific autoencoders for latent feature extraction, coupled with a view-shared clustering predictor to produce soft cluster assignments; (2) a hierarchical imputation module that first estimates missing cluster assignments based on cross-view contrastive similarity, and then reconstructs missing features using intra-view, intra-cluster statistics; (3) an energy-based semantic alignment module, which promotes intra-cluster compactness by minimizing energy variance around low-energy cluster anchors; and (4) a contrastive assignment alignment module, which enhances cross-view consistency and encourages confident, well-separated cluster predictions. Experiments on benchmarks demonstrate that our framework achieves superior performance under varying levels of missingness.
[COMMENTS]Accepted by AAAI 2026
[LINK]http://arxiv.org/abs/2601.09051v1
[DATE]2026-01-14 08:46:00+08:00
[CATEGORIES]cs.LG
Horseshoe Mixtures-of-Experts (HS-MoE)
[AUTHORS]Nick Polson, Vadim Sokolov
[ABSTRACT]Horseshoe mixtures-of-experts (HS-MoE) models provide a Bayesian framework for sparse expert selection in mixture-of-experts architectures. We combine the horseshoe prior’s adaptive global-local shrinkage with input-dependent gating, yielding data-adaptive sparsity in expert usage. Our primary methodological contribution is a particle learning algorithm for sequential inference, in which the filter is propagated forward in time while tracking only sufficient statistics. We also discuss how HS-MoE relates to modern mixture-of-experts layers in large language models, which are deployed under extreme sparsity constraints (e.g., activating a small number of experts per token out of a large pool).
[LINK]http://arxiv.org/abs/2601.09043v1
[DATE]2026-01-14 08:18:54+08:00
[CATEGORIES]cs.LG
SCaLE: Switching Cost aware Learning and Exploration
[AUTHORS]Neelkamal Bhuyan, Debankur Mukherjee, Adam Wierman
[ABSTRACT]This work addresses the fundamental problem of unbounded metric movement costs in bandit online convex optimization, by considering high-dimensional dynamic quadratic hitting costs and $\ell_2$-norm switching costs in a noisy bandit feedback model. For a general class of stochastic environments, we provide the first algorithm SCaLE that provably achieves a distribution-agnostic sub-linear dynamic regret, without the knowledge of hitting cost structure. En-route, we present a novel spectral regret analysis that separately quantifies eigenvalue-error driven regret and eigenbasis-perturbation driven regret. Extensive numerical experiments, against online-learning baselines, corroborate our claims, and highlight statistical consistency of our algorithm.
[COMMENTS]42 pages
[LINK]http://arxiv.org/abs/2601.09042v1
[DATE]2026-01-14 08:14:58+08:00
[CATEGORIES]cs.LG
Layer-Parallel Training for Transformers
[AUTHORS]Shuai Jiang, Marc Salvado, Eric C. Cyr, Alena Kopaničáková, Rolf Krause, Jacob B. Schroder
[ABSTRACT]We present a new training methodology for transformers using a multilevel, layer-parallel approach. Through a neural ODE formulation of transformers, our application of a multilevel parallel-in-time algorithm for the forward and backpropagation phases of training achieves parallel acceleration over the layer dimension. This dramatically enhances parallel scalability as the network depth increases, which is particularly useful for increasingly large foundational models. However, achieving this introduces errors that cause systematic bias in the gradients, which in turn reduces convergence when closer to the minima. We develop an algorithm to detect this critical transition and either switch to serial training or systematically increase the accuracy of layer-parallel training. Results, including BERT, GPT2, ViT, and machine translation architectures, demonstrate parallel-acceleration as well as accuracy commensurate with serial pre-training while fine-tuning is unaffected.
[COMMENTS]20 pages, 12 figures
[LINK]http://arxiv.org/abs/2601.09026v1
[DATE]2026-01-14 07:12:53+08:00
[CATEGORIES]cs.LG
Universal Latent Homeomorphic Manifolds: Cross-Domain Representation Learning via Homeomorphism Verification
[AUTHORS]Tong Wu, Tayab Uddin Wara, Daniel Hernandez, Sidong Lei
[ABSTRACT]We present the Universal Latent Homeomorphic Manifold (ULHM), a framework that unifies semantic representations (e.g., human descriptions, diagnostic labels) and observation-driven machine representations (e.g., pixel intensities, sensor readings) into a single latent structure. Despite originating from fundamentally different pathways, both modalities capture the same underlying reality. We establish \emph{homeomorphism}, a continuous bijection preserving topological structure, as the mathematical criterion for determining when latent manifolds induced by different semantic-observation pairs can be rigorously unified. This criterion provides theoretical guarantees for three critical applications: (1) semantic-guided sparse recovery from incomplete observations, (2) cross-domain transfer learning with verified structural compatibility, and (3) zero-shot compositional learning via valid transfer from semantic to observation space. Our framework learns continuous manifold-to-manifold transformations through conditional variational inference, avoiding brittle point-to-point mappings. We develop practical verification algorithms, including trust, continuity, and Wasserstein distance metrics, that empirically validate homeomorphic structure from finite samples. Experiments demonstrate: (1) sparse image recovery from 5\% of CelebA pixels and MNIST digit reconstruction at multiple sparsity levels, (2) cross-domain classifier transfer achieving 86.73\% accuracy from MNIST to Fashion-MNIST without retraining, and (3) zero-shot classification on unseen classes achieving 89.47\% on MNIST, 84.70\% on Fashion-MNIST, and 78.76\% on CIFAR-10. Critically, the homeomorphism criterion correctly rejects incompatible datasets, preventing invalid unification and providing a feasible way to principled decomposition of general foundation models into verified domain-specific components.
[LINK]http://arxiv.org/abs/2601.09025v1
[DATE]2026-01-14 07:08:16+08:00
[CATEGORIES]cs.LG
An Inexact Weighted Proximal Trust-Region Method
[AUTHORS]Leandro Farias Maia, Robert Baraldi, Drew P. Kouri
[ABSTRACT]In [R. J. Baraldi and D. P. Kouri, Math. Program., 201:1 (2023), pp. 559-598], the authors introduced a trust-region method for minimizing the sum of a smooth nonconvex and a nonsmooth convex function, the latter of which has an analytical proximity operator. While many functions satisfy this criterion, e.g., the $\ell_1$-norm defined on $\ell_2$, many others are precluded by either the topology or the nature of the nonsmooth term. Using the $δ$-Fréchet subdifferential, we extend the definition of the inexact proximity operator and enable its use within the aforementioned trust-region algorithm. Moreover, we augment the analysis for the standard trust-region convergence theory to handle proximity operator inexactness with weighted inner products. We first introduce an algorithm to generate a point in the inexact proximity operator and then apply the algorithm within the trust-region method to solve an optimal control problem constrained by Burgers’ equation.
[LINK]http://arxiv.org/abs/2601.09024v1
[DATE]2026-01-14 07:07:46+08:00
[CATEGORIES]cs.LG
Human-in-the-Loop Segmentation of Multi-species Coral Imagery
[AUTHORS]Scarlett Raine, Ross Marchant, Brano Kusy, Frederic Maire, Niko Suenderhauf, Tobias Fischer
[ABSTRACT]Marine surveys by robotic underwater and surface vehicles result in substantial quantities of coral reef imagery, however labeling these images is expensive and time-consuming for domain experts. Point label propagation is a technique that uses existing images labeled with sparse points to create augmented ground truth data, which can be used to train a semantic segmentation model. In this work, we show that recent advances in large foundation models facilitate the creation of augmented ground truth masks using only features extracted by the denoised version of the DINOv2 foundation model and K-Nearest Neighbors (KNN), without any pre-training. For images with extremely sparse labels, we use human-in-the-loop principles to enhance annotation efficiency: if there are 5 point labels per image, our method outperforms the prior state-of-the-art by 19.7% for mIoU. When human-in-the-loop labeling is not available, using the denoised DINOv2 features with a KNN still improves on the prior state-of-the-art by 5.8% for mIoU (5 grid points). On the semantic segmentation task, we outperform the prior state-of-the-art by 13.5% for mIoU when only 5 point labels are used for point label propagation. Additionally, we perform a comprehensive study into the number and placement of point labels, and make several recommendations for improving the efficiency of labeling images with points.
[COMMENTS]IEEE Journal of Oceanic Engineering accepted preprint of extended paper, 36 pages, 14 figures. Original conference paper (v2) accepted at the CVPR 2024 3rd Workshop on Learning with Limited Labelled Data for Image and Video Understanding (L3D-IVU)
[LINK]http://arxiv.org/abs/2404.09406v4
[DATE]2026-01-14 06:54:33+08:00
[CATEGORIES]cs.LG
Tail-Sensitive KL and Rényi Convergence of Unadjusted Hamiltonian Monte Carlo via One-Shot Couplings
[AUTHORS]Nawaf Bou-Rabee, Siddharth Mitra, Andre Wibisono
[ABSTRACT]Hamiltonian Monte Carlo (HMC) algorithms are among the most widely used sampling methods in high dimensional settings, yet their convergence properties are poorly understood in divergences that quantify relative density mismatch, such as Kullback-Leibler (KL) and Rényi divergences. These divergences naturally govern acceptance probabilities and warm-start requirements for Metropolis-adjusted Markov chains. In this work, we develop a framework for upgrading Wasserstein convergence guarantees for unadjusted Hamiltonian Monte Carlo (uHMC) to guarantees in tail-sensitive KL and Rényi divergences. Our approach is based on one-shot couplings, which we use to establish a regularization property of the uHMC transition kernel. This regularization allows Wasserstein-2 mixing-time and asymptotic bias bounds to be lifted to KL divergence, and analogous Orlicz-Wasserstein bounds to be lifted to Rényi divergence, paralleling earlier work of Bou-Rabee and Eberle (2023) that upgrade Wasserstein-1 bounds to total variation distance via kernel smoothing. As a consequence, our results provide quantitative control of relative density mismatch, clarify the role of discretization bias in strong divergences, and yield principled guarantees relevant both for unadjusted sampling and for generating warm starts for Metropolis-adjusted Markov chains.
[COMMENTS]64 pages
[LINK]http://arxiv.org/abs/2601.09019v1
[DATE]2026-01-14 06:39:23+08:00
[CATEGORIES]cs.LG
Meta-learning to Address Data Shift in Time Series Classification
[AUTHORS]Samuel Myren, Nidhi Parikh, Natalie Klein
[ABSTRACT]Across engineering and scientific domains, traditional deep learning (TDL) models perform well when training and test data share the same distribution. However, the dynamic nature of real-world data, broadly termed \textit{data shift}, renders TDL models prone to rapid performance degradation, requiring costly relabeling and inefficient retraining. Meta-learning, which enables models to adapt quickly to new data with few examples, offers a promising alternative for mitigating these challenges. Here, we systematically compare TDL with fine-tuning and optimization-based meta-learning algorithms to assess their ability to address data shift in time-series classification. We introduce a controlled, task-oriented seismic benchmark (SeisTask) and show that meta-learning typically achieves faster and more stable adaptation with reduced overfitting in data-scarce regimes and smaller model architectures. As data availability and model capacity increase, its advantages diminish, with TDL with fine-tuning performing comparably. Finally, we examine how task diversity influences meta-learning and find that alignment between training and test distributions, rather than diversity alone, drives performance gains. Overall, this work provides a systematic evaluation of when and why meta-learning outperforms TDL under data shift and contributes SeisTask as a benchmark for advancing adaptive learning research in time-series domains.
[LINK]http://arxiv.org/abs/2601.09018v1
[DATE]2026-01-14 06:38:43+08:00
[CATEGORIES]cs.LG
Block Decomposable Methods for Large-Scale Optimization Problems
[AUTHORS]Leandro Farias Maia
[ABSTRACT]This dissertation explores block decomposable methods for large-scale optimization problems. It focuses on alternating direction method of multipliers (ADMM) schemes and block coordinate descent (BCD) methods. Specifically, it introduces a new proximal ADMM algorithm and proposes two BCD methods. The first part of the research presents a new proximal ADMM algorithm. This method is adaptive to all problem parameters and solves the proximal augmented Lagrangian (AL) subproblem inexactly. This adaptiveness facilitates the highly efficient application of the algorithm to a broad swath of practical problems. The inexact solution of the proximal AL subproblem overcomes many key challenges in the practical applications of ADMM. The resultant algorithm obtains an approximate solution of an optimization problem in a number of iterations that matches the state-of-the-art complexity for the class of proximal ADMM schemes. The second part of the research focuses on an inexact proximal mapping for the class of block proximal gradient methods. Key properties of this operator is established, facilitating the derivation of convergence rates for the proposed algorithm. Under two error decreases conditions, the algorithm matches the convergence rate of its exactly computed counterpart. Numerical results demonstrate the superior performance of the algorithm under a dynamic error regime over a fixed one. The dissertation concludes by providing convergence guarantees for the randomized BCD method applied to a broad class of functions, known as Hölder smooth functions. Convergence rates are derived for non-convex, convex, and strongly convex functions. These convergence rates match those furnished in the existing literature for the Lipschtiz smooth setting.
[LINK]http://arxiv.org/abs/2601.09010v1
[DATE]2026-01-14 06:18:57+08:00
[CATEGORIES]cs.LG
Universal Dynamics of Warmup Stable Decay: understanding WSD beyond Transformers
[AUTHORS]Annalisa Belloni, Lorenzo Noci, Antonio Orvieto
[ABSTRACT]The Warmup Stable Decay (WSD) learning rate scheduler has recently become popular, largely due to its good performance and flexibility when training large language models. It remains an open question whether the remarkable performance of WSD - using a decaying learning rate for only a fraction of training compared to cosine decay - is a phenomenon specific to transformer-based language models that can potentially offer new theoretical insights into their training dynamics. Inspired by the usage of learning rate schedulers as a new lens into understanding landscape geometry (e.g., river valley, connected minima, progressive sharpening), in this work we compare the WSD path of the Adam optimizer on a Pythia-like language model to that of a small CNN trained to classify CIFAR10 images. We observe most training signals, optimizer path features, and sharpness dynamics to be qualitatively similar in such architectures. This consistency points to shared geometric characteristics of the loss landscapes of old and new nonconvex problems, and hints to future research questions around the geometry of high dimensional optimization problems.
[COMMENTS]Accepted at the 2025 HiLD and MOSS Workshops at ICML
[LINK]http://arxiv.org/abs/2601.09000v1
[DATE]2026-01-14 05:54:22+08:00
[CATEGORIES]cs.LG
Physics-Guided Counterfactual Explanations for Large-Scale Multivariate Time Series: Application in Scalable and Interpretable SEP Event Prediction
[AUTHORS]Pranjal Patil, Anli Ji, Berkay Aydin
[ABSTRACT]Accurate prediction of solar energetic particle events is vital for safeguarding satellites, astronauts, and space-based infrastructure. Modern space weather monitoring generates massive volumes of high-frequency, multivariate time series (MVTS) data from sources such as the Geostationary perational Environmental Satellites (GOES). Machine learning (ML) models trained on this data show strong predictive power, but most existing methods overlook domain-specific feasibility constraints. Counterfactual explanations have emerged as a key tool for improving model interpretability, yet existing approaches rarely enforce physical plausibility. This work introduces a Physics-Guided Counterfactual Explanation framework, a novel method for generating counterfactual explanations in time series classification tasks that remain consistent with underlying physical principles. Applied to solar energetic particles (SEP) forecasting, this framework achieves over 80% reduction in Dynamic Time Warping (DTW) distance increasing the proximity, produces counterfactual explanations with higher sparsity, and reduces runtime by nearly 50% compared to state-of-the-art baselines such as DiCE. Beyond numerical improvements, this framework ensures that generated counterfactual explanations are physically plausible and actionable in scientific domains. In summary, the framework generates counterfactual explanations that are both valid and physically consistent, while laying the foundation for scalable counterfactual generation in big data environments.
[COMMENTS]This is a pre-print of an accepted paper at IEEE BigData 2025, SS 11:Towards an Understanding of Artificial Intelligence: Bridging Theory, Explainability, and Practical Applications
[LINK]http://arxiv.org/abs/2601.08999v1
[DATE]2026-01-14 05:48:51+08:00
[CATEGORIES]cs.LG
Optimising for Energy Efficiency and Performance in Machine Learning
[AUTHORS]Emile Dos Santos Ferreira, Neil D. Lawrence, Andrei Paleyes
[ABSTRACT]The ubiquity of machine learning (ML) and the demand for ever-larger models bring an increase in energy consumption and environmental impact. However, little is known about the energy scaling laws in ML, and existing research focuses on training cost – ignoring the larger cost of inference. Furthermore, tools for measuring the energy consumption of ML do not provide actionable feedback.
To address these gaps, we developed Energy Consumption Optimiser (ECOpt): a hyperparameter tuner that optimises for energy efficiency and model performance. ECOpt quantifies the trade-off between these metrics as an interpretable Pareto frontier. This enables ML practitioners to make informed decisions about energy cost and environmental impact, while maximising the benefit of their models and complying with new regulations.
Using ECOpt, we show that parameter and floating-point operation counts can be unreliable proxies for energy consumption, and observe that the energy efficiency of Transformer models for text generation is relatively consistent across hardware. These findings motivate measuring and publishing the energy metrics of ML models. We further show that ECOpt can have a net positive environmental impact and use it to uncover seven models for CIFAR-10 that improve upon the state of the art, when considering accuracy and energy efficiency together.
[COMMENTS]Accepted to CAIN’26
[LINK]http://arxiv.org/abs/2601.08991v1
[DATE]2026-01-14 05:28:58+08:00
[CATEGORIES]cs.LG
Game of Coding: Sybil Resistant Decentralized Machine Learning with Minimal Trust Assumption
[AUTHORS]Hanzaleh Akbari Nodehi, Viveck R. Cadambe, Mohammad Ali Maddah-Ali
[ABSTRACT]Coding theory plays a crucial role in ensuring data integrity and reliability across various domains, from communication to computation and storage systems. However, its reliance on trust assumptions for data recovery, which requires the number of honest nodes to exceed adversarial nodes by a certain margin, poses significant challenges, particularly in emerging decentralized systems where trust is a scarce resource. To address this, the game of coding framework was introduced, offering insights into strategies for data recovery within incentive-oriented environments. In such environments, participant nodes are rewarded as long as the system remains functional (live). This incentivizes adversaries to maximize their rewards (utility) by ensuring that the decoder, as the data collector (DC), successfully recovers the data, preferably with a high estimation error. This rational behavior is leveraged in a game-theoretic framework, where the equilibrium leads to a robust and resilient system, referred to as the game of coding. The focus of the earliest version of the game of coding was limited to scenarios involving only two nodes. In this paper, we generalize the game of coding framework to scenarios with $N \ge 2$ nodes, exploring critical aspects of system behavior. Specifically, we (i) demonstrate that the adversary’s utility at equilibrium is non-increasing with additional adversarial nodes, ensuring no gain for the adversary and no pain for the DC, thus establishing the game of coding framework’s Sybil resistance; (ii) show that increasing the number of honest nodes does not always enhance the DC’s utility, providing examples and proposing an algorithm to identify and mitigate this counterintuitive effect; and (iii) outline the optimal strategies for both the DC and the adversary, demonstrating that the system achieves enhanced liveness at equilibrium.
[LINK]http://arxiv.org/abs/2410.05540v3
[DATE]2026-01-14 05:09:38+08:00
[CATEGORIES]cs.LG
Continuous Fairness On Data Streams
[AUTHORS]Subhodeep Ghosh, Zhihui Du, Angela Bonifati, Manish Kumar, David Bader, Senjuti Basu Roy
[ABSTRACT]We study the problem of enforcing continuous group fairness over windows in data streams. We propose a novel fairness model that ensures group fairness at a finer granularity level (referred to as block) within each sliding window. This formulation is particularly useful when the window size is large, making it desirable to enforce fairness at a finer granularity. Within this framework, we address two key challenges: efficiently monitoring whether each sliding window satisfies block-level group fairness, and reordering the current window as effectively as possible when fairness is violated. To enable real-time monitoring, we design sketch-based data structures that maintain attribute distributions with minimal overhead. We also develop optimal, efficient algorithms for the reordering task, supported by rigorous theoretical guarantees. Our evaluation on four real-world streaming scenarios demonstrates the practical effectiveness of our approach. We achieve millisecond-level processing and a throughput of approximately 30,000 queries per second on average, depending on system parameters. The stream reordering algorithm improves block-level group fairness by up to 95% in certain cases, and by 50-60% on average across datasets. A qualitative study further highlights the advantages of block-level fairness compared to window-level fairness.
[LINK]http://arxiv.org/abs/2601.08976v1
[DATE]2026-01-14 04:51:20+08:00
[CATEGORIES]cs.LG
Pain in 3D: Generating Controllable Synthetic Faces for Automated Pain Assessment
[AUTHORS]Xin Lei Lin, Soroush Mehraban, Abhishek Moturu, Babak Taati
[ABSTRACT]Automated pain assessment from facial expressions is crucial for non-communicative patients, such as those with dementia. Progress has been limited by two challenges: (i) existing datasets exhibit severe demographic and label imbalance due to ethical constraints, and (ii) current generative models cannot precisely control facial action units (AUs), facial structure, or clinically validated pain levels.
We present 3DPain, a large-scale synthetic dataset specifically designed for automated pain assessment, featuring unprecedented annotation richness and demographic diversity. Our three-stage framework generates diverse 3D meshes, textures them with diffusion models, and applies AU-driven face rigging to synthesize multi-view faces with paired neutral and pain images, AU configurations, PSPI scores, and the first dataset-level annotations of pain-region heatmaps. The dataset comprises 82,500 samples across 25,000 pain expression heatmaps and 2,500 synthetic identities balanced by age, gender, and ethnicity.
We further introduce ViTPain, a Vision Transformer based cross-modal distillation framework in which a heatmap-trained teacher guides a student trained on RGB images, enhancing accuracy, interpretability, and clinical reliability. Together, 3DPain and ViTPain establish a controllable, diverse, and clinically grounded foundation for generalizable automated pain assessment.
[LINK]http://arxiv.org/abs/2509.16727v4
[DATE]2026-01-14 04:50:25+08:00
[CATEGORIES]cs.LG
Benchmarking Positional Encodings for GNNs and Graph Transformers
[AUTHORS]Florian Grötschla, Jiaqing Xie, Roger Wattenhofer
[ABSTRACT]Positional Encodings (PEs) are essential for injecting structural information into Graph Neural Networks (GNNs), particularly Graph Transformers, yet their empirical impact remains insufficiently understood. We introduce a unified benchmarking framework that decouples PEs from architectural choices, enabling a fair comparison across 8 GNN and Transformer models, 9 PEs, and 10 synthetic and real-world datasets. Across more than 500 model-PE-dataset configurations, we find that commonly used expressiveness proxies, including Weisfeiler-Lehman distinguishability, do not reliably predict downstream performance. In particular, highly expressive PEs frequently fail to improve, and can even degrade performance on real-world tasks. At the same time, we identify several simple and previously overlooked model-PE combinations that match or outperform recent state-of-the-art methods. Our results demonstrate the strong task-dependence of PEs and underscore the need for empirical validation beyond theoretical expressiveness. To support reproducible research, we release an open-source benchmarking framework for evaluating PEs for graph learning tasks.
[COMMENTS]Accepted at KDD 2026 Datasets & Benchmarks Track
[LINK]http://arxiv.org/abs/2411.12732v2
[DATE]2026-01-14 04:35:35+08:00
[CATEGORIES]cs.LG
Machine Learning-Driven Creep Law Discovery Across Alloy Compositional Space
[AUTHORS]Hongshun Chen, Ryan Zhou, Rujing Zha, Zihan Chen, Wenpan Li, Rowan Rolark, John Patrick Reidy, Jian Cao, Ping Guo, David C. Dunand, Horacio D. Espinosa
[ABSTRACT]Hihg-temperature creep characterization of structural alloys traditionally relies on serial uniaxial tests, which are highly inefficient for exploring the large search space of alloy compositions and for material discovery. Here, we introduce a machine-learning-assisted, high-throughput framework for creep law identification based on a dimple array bulge instrument (DABI) configuration, which enables parallel creep testing of 25 dimples, each fabricated from a different alloy, in a single experiment. Full-field surface displacements of dimples undergoing time-dependent creep-induced bulging under inert gas pressure are measured by 3D digital image correlation. We train a recurrent neural network (RNN) as a surrogate model, mapping creep parameters and loading conditions to the time-dependent deformation response of DABI. Coupling this surrogate with a particle swarm optimization scheme enables rapid and global inverse identification with sparsity regularization of creep parameters from experiment displacement-time histories. In addition, we propose a phenomenological creep law with a time-dependent stress exponent that captures the sigmoidal primary creep observed in wrought INCONEL 625 and extracts its temperature dependence from DABI test at multiple temperatures. Furthermore, we employ a general creep law combining several conventional forms together with regularized inversion to identify the creep laws for 47 additional Fe-, Ni-, and Co-rich alloys and to automatically select the dominant functional form for each alloy. This workflow combined with DABI experiment provides a quantitative, high-throughput creep characterization platform that is compatible with data mining, composition-property modeling, and nonlinear structural optimization with creep behavior across a large alloy design space.
[COMMENTS]27 pages, 7 figures
[LINK]http://arxiv.org/abs/2601.08970v1
[DATE]2026-01-14 04:29:15+08:00
[CATEGORIES]cs.LG
Breaking the Bottlenecks: Scalable Diffusion Models for 3D Molecular Generation
[AUTHORS]Adrita Das, Peiran Jiang, Dantong Zhu, Barnabas Poczos, Jose Lugo-Martinez
[ABSTRACT]Diffusion models have emerged as a powerful class of generative models for molecular design, capable of capturing complex structural distributions and achieving high fidelity in 3D molecule generation. However, their widespread use remains constrained by long sampling trajectories, stochastic variance in the reverse process, and limited structural awareness in denoising dynamics. The Directly Denoising Diffusion Model (DDDM) mitigates these inefficiencies by replacing stochastic reverse MCMC updates with deterministic denoising step, substantially reducing inference time. Yet, the theoretical underpinnings of such deterministic updates have remained opaque. In this work, we provide a principled reinterpretation of DDDM through the lens of the Reverse Transition Kernel (RTK) framework by Huang et al. 2024, unifying deterministic and stochastic diffusion under a shared probabilistic formalism. By expressing the DDDM reverse process as an approximate kernel operator, we show that the direct denoising process implicitly optimizes a structured transport map between noisy and clean samples. This perspective elucidates why deterministic denoising achieves efficient inference. Beyond theoretical clarity, this reframing resolves several long-standing bottlenecks in molecular diffusion. The RTK view ensures numerical stability by enforcing well-conditioned reverse kernels, improves sample consistency by eliminating stochastic variance, and enables scalable and symmetry-preserving denoisers that respect SE(3) equivariance. Empirically, we demonstrate that RTK-guided deterministic denoising achieves faster convergence and higher structural fidelity than stochastic diffusion models, while preserving chemical validity across GEOM-DRUGS dataset. Code, models, and datasets are publicly available in our project repository.
[LINK]http://arxiv.org/abs/2601.08963v1
[DATE]2026-01-14 04:09:44+08:00
[CATEGORIES]cs.LG
Integration Matters for Learning PDEs with Backward SDEs
[AUTHORS]Sungje Park, Stephen Tu
[ABSTRACT]Backward stochastic differential equation (BSDE)-based deep learning methods provide an alternative to Physics-Informed Neural Networks (PINNs) for solving high-dimensional partial differential equations (PDEs), offering potential algorithmic advantages in settings such as stochastic optimal control, where the PDEs of interest are tied to an underlying dynamical system. However, standard BSDE-based solvers have empirically been shown to underperform relative to PINNs in the literature. In this paper, we identify the root cause of this performance gap as a discretization bias introduced by the standard Euler-Maruyama (EM) integration scheme applied to one-step self-consistency BSDE losses, which shifts the optimization landscape off target. We find that this bias cannot be satisfactorily addressed through finer step-sizes or multi-step self-consistency losses. To properly handle this issue, we propose a Stratonovich-based BSDE formulation, which we implement with stochastic Heun integration. We show that our proposed approach completely eliminates the bias issues faced by EM integration. Furthermore, our empirical results show that our Heun-based BSDE method consistently outperforms EM-based variants and achieves competitive results with PINNs across multiple high-dimensional benchmarks. Our findings highlight the critical role of integration schemes in BSDE-based PDE solvers, an algorithmic detail that has received little attention thus far in the literature.
[COMMENTS]NeurIPS 2025
[LINK]http://arxiv.org/abs/2505.01078v3
[DATE]2026-01-14 04:06:31+08:00
[CATEGORIES]cs.LG
DriftGuard: A Hierarchical Framework for Concept Drift Detection and Remediation in Supply Chain Forecasting
[AUTHORS]Shahnawaz Alam, Mohammed Abdul Rahman, Bareera Sadeqa
[ABSTRACT]Supply chain forecasting models degrade over time as real-world conditions change. Promotions shift, consumer preferences evolve, and supply disruptions alter demand patterns, causing what is known as concept drift. This silent degradation leads to stockouts or excess inventory without triggering any system warnings. Current industry practice relies on manual monitoring and scheduled retraining every 3-6 months, which wastes computational resources during stable periods while missing rapid drift events. Existing academic methods focus narrowly on drift detection without addressing diagnosis or remediation, and they ignore the hierarchical structure inherent in supply chain data. What retailers need is an end-to-end system that detects drift early, explains its root causes, and automatically corrects affected models. We propose DriftGuard, a five-module framework that addresses the complete drift lifecycle. The system combines an ensemble of four complementary detection methods, namely error-based monitoring, statistical tests, autoencoder anomaly detection, and Cumulative Sum (CUSUM) change-point analysis, with hierarchical propagation analysis to identify exactly where drift occurs across product lines. Once detected, Shapley Additive Explanations (SHAP) analysis diagnoses the root causes, and a cost-aware retraining strategy selectively updates only the most affected models. Evaluated on over 30,000 time series from the M5 retail dataset, DriftGuard achieves 97.8% detection recall within 4.2 days and delivers up to 417 return on investment through targeted remediation.
[LINK]http://arxiv.org/abs/2601.08928v1
[DATE]2026-01-14 03:08:11+08:00
[CATEGORIES]cs.LG
Motion Attribution for Video Generation
[AUTHORS]Xindi Wu, Despoina Paschalidou, Jun Gao, Antonio Torralba, Laura Leal-Taixé, Olga Russakovsky, Sanja Fidler, Jonathan Lorraine
[ABSTRACT]Despite the rapid progress of video generation models, the role of data in influencing motion is poorly understood. We present Motive (MOTIon attribution for Video gEneration), a motion-centric, gradient-based data attribution framework that scales to modern, large, high-quality video datasets and models. We use this to study which fine-tuning clips improve or degrade temporal dynamics. Motive isolates temporal dynamics from static appearance via motion-weighted loss masks, yielding efficient and scalable motion-specific influence computation. On text-to-video models, Motive identifies clips that strongly affect motion and guides data curation that improves temporal consistency and physical plausibility. With Motive-selected high-influence data, our method improves both motion smoothness and dynamic degree on VBench, achieving a 74.1% human preference win rate compared with the pretrained base model. To our knowledge, this is the first framework to attribute motion rather than visual appearance in video generative models and to use it to curate fine-tuning data.
[COMMENTS]See the project website at https://research.nvidia.com/labs/sil/projects/MOTIVE/
[LINK]http://arxiv.org/abs/2601.08828v1
[DATE]2026-01-14 02:59:09+08:00
[CATEGORIES]cs.LG
Free-RBF-KAN: Kolmogorov-Arnold Networks with Adaptive Radial Basis Functions for Efficient Function Learning
[AUTHORS]Shao-Ting Chiu, Siu Wun Cheung, Ulisses Braga-Neto, Chak Shing Lee, Rui Peng Li
[ABSTRACT]Kolmogorov-Arnold Networks (KANs) have shown strong potential for efficiently approximating complex nonlinear functions. However, the original KAN formulation relies on B-spline basis functions, which incur substantial computational overhead due to De Boor’s algorithm. To address this limitation, recent work has explored alternative basis functions such as radial basis functions (RBFs) that can improve computational efficiency and flexibility. Yet, standard RBF-KANs often sacrifice accuracy relative to the original KAN design. In this work, we propose Free-RBF-KAN, a RBF-based KAN architecture that incorporates adaptive learning grids and trainable smoothness to close this performance gap. Our method employs freely learnable RBF shapes that dynamically align grid representations with activation patterns, enabling expressive and adaptive function approximation. Additionally, we treat smoothness as a kernel parameter optimized jointly with network weights, without increasing computational complexity. We provide a general universality proof for RBF-KANs, which encompasses our Free-RBF-KAN formulation. Through a broad set of experiments, including multiscale function approximation, physics-informed machine learning, and PDE solution operator learning, Free-RBF-KAN achieves accuracy comparable to the original B-spline-based KAN while delivering faster training and inference. These results highlight Free-RBF-KAN as a compelling balance between computational efficiency and adaptive resolution, particularly for high-dimensional structured modeling tasks.
[LINK]http://arxiv.org/abs/2601.07760v2
[DATE]2026-01-14 02:39:13+08:00
[CATEGORIES]cs.LG
Stability of Primal-Dual Gradient Flow Dynamics for Multi-Block Convex Optimization Problems
[AUTHORS]Ibrahim K. Ozaslan, Panagiotis Patrinos, Mihailo R. Jovanović
[ABSTRACT]We examine stability properties of primal-dual gradient flow dynamics for composite convex optimization problems with multiple, possibly nonsmooth, terms in the objective function under the generalized consensus constraint. The proposed dynamics are based on the proximal augmented Lagrangian and they provide a viable alternative to ADMM which faces significant challenges from both analysis and implementation viewpoints in large-scale multi-block scenarios. In contrast to customized algorithms with individualized convergence guarantees, we develop a systematic approach for solving a broad class of challenging composite optimization problems. We leverage various structural properties to establish global (exponential) convergence guarantees for the proposed dynamics. Our assumptions are much weaker than those required to prove (exponential) stability of primal-dual dynamics as well as (linear) convergence of discrete-time methods such as standard two-block and multi-block ADMM and EXTRA algorithms. Finally, we show necessity of some of our structural assumptions for exponential stability and provide computational experiments to demonstrate the convenience of the proposed approach for parallel and distributed computing applications.
[COMMENTS]32 pages; 4 figures
[LINK]http://arxiv.org/abs/2408.15969v3
[DATE]2026-01-14 02:25:29+08:00
[CATEGORIES]cs.LG
On the use of graph models to achieve individual and group fairness
[AUTHORS]Arturo Pérez-Peralta, Sandra Benítez-Peña, Rosa E. Lillo
[ABSTRACT]Machine Learning algorithms are ubiquitous in key decision-making contexts such as justice, healthcare and finance, which has spawned a great demand for fairness in these procedures. However, the theoretical properties of such models in relation with fairness are still poorly understood, and the intuition behind the relationship between group and individual fairness is still lacking. In this paper, we provide a theoretical framework based on Sheaf Diffusion to leverage tools based on dynamical systems and homology to model fairness. Concretely, the proposed method projects input data into a bias-free space that encodes fairness constrains, resulting in fair solutions. Furthermore, we present a collection of network topologies handling different fairness metrics, leading to a unified method capable of dealing with both individual and group bias. The resulting models have a layer of interpretability in the form of closed-form expressions for their SHAP values, consolidating their place in the responsible Artificial Intelligence landscape. Finally, these intuitions are tested on a simulation study and standard fairness benchmarks, where the proposed methods achieve satisfactory results. More concretely, the paper showcases the performance of the proposed models in terms of accuracy and fairness, studying available trade-offs on the Pareto frontier, checking the effects of changing the different hyper-parameters, and delving into the interpretation of its outputs.
[COMMENTS]75 pages, 46 figures
[LINK]http://arxiv.org/abs/2601.08784v1
[DATE]2026-01-14 02:17:43+08:00
[CATEGORIES]cs.LG
Incentivizing Multi-Tenant Split Federated Learning for Foundation Models at the Network Edge
[AUTHORS]Songyuan Li, Jia Hu, Geyong Min, Haojun Huang
[ABSTRACT]Foundation models (FMs) such as GPT-4 exhibit exceptional generative capabilities across diverse downstream tasks through fine-tuning. Split Federated Learning (SFL) facilitates privacy-preserving FM fine-tuning on resource-constrained local devices by offloading partial FM computations to edge servers, enabling device-edge synergistic fine-tuning. Practical edge networks often host multiple SFL tenants to support diversified downstream tasks. However, existing research primarily focuses on single-tenant SFL scenarios, and lacks tailored incentive mechanisms for multi-tenant settings, which are essential to effectively coordinate self-interested local devices for participation in various downstream tasks, ensuring that each SFL tenant’s distinct FM fine-tuning requirements (e.g., FM types, performance targets, and fine-tuning deadlines) are met. To address this gap, we propose a novel Price-Incentive Mechanism (PRINCE) that guides multiple SFL tenants to offer strategic price incentives, which solicit high-quality device participation for efficient FM fine-tuning. Specifically, we first develop a bias-resilient global SFL model aggregation scheme to eliminate model biases caused by independent device participation. We then derive a rigorous SFL convergence bound to evaluate the contributions of heterogeneous devices to FM performance improvements, guiding the incentive strategies of SFL tenants. Furthermore, we model inter-tenant device competition as a congestion game for Stackelberg equilibrium (SE) analysis, deriving each SFL tenant’s optimal incentive strategy. Extensive simulations involving four representative SFL tenant types (ViT, BERT, Whisper, and LLaMA) across diverse data modalities (text, images, and audio) demonstrate that PRINCE accelerates FM fine-tuning by up to 3.07x compared to state-of-the-art approaches, while consistently meeting fine-tuning performance targets.
[COMMENTS]Accepted for publication in IEEE/ACM Transactions on Networking. Index Terms: Foundation models, Edge computing, Split federated learning, Multi-tenant system, Incentive mechanism
[LINK]http://arxiv.org/abs/2503.04971v2
[DATE]2026-01-14 02:16:48+08:00
[CATEGORIES]cs.LG
Fast and explainable clustering in the Manhattan and Tanimoto distance
[AUTHORS]Stefan Güttel, Kaustubh Roy
[ABSTRACT]The CLASSIX algorithm is a fast and explainable approach to data clustering. In its original form, this algorithm exploits the sorting of the data points by their first principal component to truncate the search for nearby data points, with nearness being defined in terms of the Euclidean distance. Here we extend CLASSIX to other distance metrics, including the Manhattan distance and the Tanimoto distance. Instead of principal components, we use an appropriate norm of the data vectors as the sorting criterion, combined with the triangle inequality for search termination. In the case of Tanimoto distance, a provably sharper intersection inequality is used to further boost the performance of the new algorithm. On a real-world chemical fingerprint benchmark, CLASSIX Tanimoto is about 30 times faster than the Taylor–Butina algorithm, and about 80 times faster than DBSCAN, while computing higher-quality clusters in both cases.
[LINK]http://arxiv.org/abs/2601.08781v1
[DATE]2026-01-14 02:14:03+08:00
[CATEGORIES]cs.LG
Comprehensive Machine Learning Benchmarking for Fringe Projection Profilometry with Photorealistic Synthetic Data
[AUTHORS]Anush Lakshman S, Adam Haroon, Beiwen Li
[ABSTRACT]Machine learning approaches for fringe projection profilometry (FPP) are hindered by the lack of large, diverse datasets and comprehensive benchmarking protocols. This paper introduces the first open-source, photorealistic synthetic dataset for FPP, generated using NVIDIA Isaac Sim with 15,600 fringe images and 300 depth reconstructions across 50 diverse objects. We benchmark four neural network architectures (UNet, Hformer, ResUNet, Pix2Pix) on single-shot depth reconstruction, revealing that all models achieve similar performance (58-77 mm RMSE) despite substantial architectural differences. Our results demonstrate fundamental limitations of direct fringe-to-depth mapping without explicit phase information, with reconstruction errors approaching 75-95\% of the typical object depth range. This resource provides standardized evaluation protocols enabling systematic comparison and development of learning-based FPP approaches.
[LINK]http://arxiv.org/abs/2601.08900v1
[DATE]2026-01-14 02:04:07+08:00
[CATEGORIES]cs.LG
Hybrid Reward-Driven Reinforcement Learning for Efficient Quantum Circuit Synthesis
[AUTHORS]Sara Giordano, Kornikar Sen, Miguel A. Martin-Delgado
[ABSTRACT]A reinforcement learning (RL) framework is introduced for the efficient synthesis of quantum circuits that generate specified target quantum states from a fixed initial state, addressing a central challenge in both the Noisy Intermediate-Scale Quantum (NISQ) era and future fault-tolerant quantum computing. The approach utilizes tabular Q-learning, based on action sequences, within a discretized quantum state space, to effectively manage the exponential growth of the space dimension.The framework introduces a hybrid reward mechanism, combining a static, domain-informed reward that guides the agent toward the target state with customizable dynamic penalties that discourage inefficient circuit structures such as gate congestion and redundant state revisits. This is a circuit-aware reward, in contrast to the current trend of works on this topic, which are primarily fidelity-based. By leveraging sparse matrix representations and state-space discretization, the method enables practical navigation of high-dimensional environments while minimizing computational overhead. Benchmarking on graph-state preparation tasks for up to seven qubits, we demonstrate that the algorithm consistently discovers minimal-depth circuits with optimized gate counts. Moreover, extending the framework to a universal gate set still yields low depth circuits, highlighting the algorithm robustness and adaptability. The results confirm that this RL-driven approach, with our completely circuit-aware method, efficiently explores the complex quantum state space and synthesizes near-optimal quantum circuits, providing a resource-efficient foundation for quantum circuit optimization.
[COMMENTS]35 pages, 7 figures, color figures
[LINK]http://arxiv.org/abs/2507.16641v2
[DATE]2026-01-14 01:34:14+08:00
[CATEGORIES]cs.LG
Analog In-memory Training on General Non-ideal Resistive Elements: The Impact of Response Functions
[AUTHORS]Zhaoxian Wu, Quan Xiao, Tayfun Gokmen, Omobayode Fagbohungbe, Tianyi Chen
[ABSTRACT]As the economic and environmental costs of training and deploying large vision or language models increase dramatically, analog in-memory computing (AIMC) emerges as a promising energy-efficient solution. However, the training perspective, especially its training dynamic, is underexplored. In AIMC hardware, the trainable weights are represented by the conductance of resistive elements and updated using consecutive electrical pulses. While the conductance changes by a constant in response to each pulse, in reality, the change is scaled by asymmetric and non-linear response functions, leading to a non-ideal training dynamic. This paper provides a theoretical foundation for gradient-based training on AIMC hardware with non-ideal response functions. We demonstrate that asymmetric response functions negatively impact Analog SGD by imposing an implicit penalty on the objective. To overcome the issue, we propose Residual Learning algorithm, which provably converges exactly to a critical point by solving a bilevel optimization problem. We demonstrate that the proposed method can be extended to address other hardware imperfections, such as limited response granularity. As we know, it is the first paper to investigate the impact of a class of generic non-ideal response functions. The conclusion is supported by simulations validating our theoretical insights.
[LINK]http://arxiv.org/abs/2502.06309v5
[DATE]2026-01-14 01:33:05+08:00
[CATEGORIES]cs.LG
A Novel Approach to Explainable AI with Quantized Active Ingredients in Decision Making
[AUTHORS]A. M. A. S. D. Alagiyawanna, Asoka Karunananda, Thushari Silva, A. Mahasinghe
[ABSTRACT]Artificial Intelligence (AI) systems have shown good success at classifying. However, the lack of explainability is a true and significant challenge, especially in high-stakes domains, such as health and finance, where understanding is paramount. We propose a new solution to this challenge: an explainable AI framework based on our comparative study with Quantum Boltzmann Machines (QBMs) and Classical Boltzmann Machines (CBMs). We leverage principles of quantum computing within classical machine learning to provide substantive transparency around decision-making. The design involves training both models on a binarised and dimensionally reduced MNIST dataset, where Principal Component Analysis (PCA) is applied for preprocessing. For interpretability, we employ gradient-based saliency maps in QBMs and SHAP (SHapley Additive exPlanations) in CBMs to evaluate feature attributions.QBMs deploy hybrid quantum-classical circuits with strongly entangling layers, allowing for richer latent representations, whereas CBMs serve as a classical baseline that utilises contrastive divergence. Along the way, we found that QBMs outperformed CBMs on classification accuracy (83.5% vs. 54%) and had more concentrated distributions in feature attributions as quantified by entropy (1.27 vs. 1.39). In other words, QBMs not only produced better predictive performance than CBMs, but they also provided clearer identification of “active ingredient” or the most important features behind model predictions. To conclude, our results illustrate that quantum-classical hybrid models can display improvements in both accuracy and interpretability, which leads us toward more trustworthy and explainable AI systems.
[COMMENTS]Accepted and published in IEEE 2025. This is the authors manuscript version; final version available at IEEE Xplore: https://ieeexplore.ieee.org/document/11318441
[LINK]http://arxiv.org/abs/2601.08733v1
[DATE]2026-01-14 01:06:19+08:00
[CATEGORIES]cs.LG
Quantization Error Propagation: Revisiting Layer-Wise Post-Training Quantization
[AUTHORS]Yamato Arai, Yuma Ichikawa
[COMMENTS]29 pages, 3 figures, Accepted at NeurIPS 2025
[LINK]http://arxiv.org/abs/2504.09629v3
[DATE]2026-01-14 01:04:13+08:00
[CATEGORIES]cs.LG
Efficient and Reproducible Biomedical Question Answering using Retrieval Augmented Generation
[AUTHORS]Linus Stuhlmann, Michael Alexander Saxer, Jonathan Fürst
[ABSTRACT]Biomedical question-answering (QA) systems require effective retrieval and generation components to ensure accuracy, efficiency, and scalability. This study systematically examines a Retrieval-Augmented Generation (RAG) system for biomedical QA, evaluating retrieval strategies and response time trade-offs. We first assess state-of-the-art retrieval methods, including BM25, BioBERT, MedCPT, and a hybrid approach, alongside common data stores such as Elasticsearch, MongoDB, and FAISS, on a ~10% subset of PubMed (2.4M documents) to measure indexing efficiency, retrieval latency, and retriever performance in the end-to-end RAG system. Based on these insights, we deploy the final RAG system on the full 24M PubMed corpus, comparing different retrievers’ impact on overall performance. Evaluations of the retrieval depth show that retrieving 50 documents with BM25 before reranking with MedCPT optimally balances accuracy (0.90), recall (0.90), and response time (1.91s). BM25 retrieval time remains stable (82ms), while MedCPT incurs the main computational cost. These results highlight previously not well-known trade-offs in retrieval depth, efficiency, and scalability for biomedical QA. With open-source code, the system is fully reproducible and extensible.
[COMMENTS]Minor wording corrections and updated author contact information
[LINK]http://arxiv.org/abs/2505.07917v2
[DATE]2026-01-14 01:00:37+08:00
[CATEGORIES]cs.LG
Spike-timing-dependent Hebbian learning as noisy gradient descent
[AUTHORS]Niklas Dexheimer, Sascha Gaudlitz, Johannes Schmidt-Hieber
[ABSTRACT]Hebbian learning is a key principle underlying learning in biological neural networks. We relate a Hebbian spike-timing-dependent plasticity rule to noisy gradient descent with respect to a non-convex loss function on the probability simplex. Despite the constant injection of noise and the non-convexity of the underlying optimization problem, one can rigorously prove that the considered Hebbian learning dynamic identifies the presynaptic neuron with the highest activity and that the convergence is exponentially fast in the number of iterations. This is non-standard and surprising as typically noisy gradient descent with fixed noise level only converges to a stationary regime where the noise causes the dynamic to fluctuate around a minimiser.
[LINK]http://arxiv.org/abs/2505.10272v3
[DATE]2026-01-14 00:54:42+08:00
[CATEGORIES]cs.LG
Model-Agnostic Solutions for Deep Reinforcement Learning in Non-Ergodic Contexts
[AUTHORS]Bert Verbruggen, Arne Vanhoyweghen, Vincent Ginis
[ABSTRACT]Reinforcement Learning (RL) remains a central optimisation framework in machine learning. Although RL agents can converge to optimal solutions, the definition of “optimality” depends on the environment’s statistical properties. The Bellman equation, central to most RL algorithms, is formulated in terms of expected values of future rewards. However, when ergodicity is broken, long-term outcomes depend on the specific trajectory rather than on the ensemble average. In such settings, the ensemble average diverges from the time-average growth experienced by individual agents, with expected-value formulations yielding systematically suboptimal policies. Prior studies demonstrated that traditional RL architectures fail to recover the true optimum in non-ergodic environments. We extend this analysis to deep RL implementations and show that these, too, produce suboptimal policies under non-ergodic dynamics. Introducing explicit time dependence into the learning process can correct this limitation. By allowing the network’s function approximation to incorporate temporal information, the agent can estimate value functions consistent with the process’s intrinsic growth rate. This improvement does not require altering the environmental feedback, such as reward transformations or modified objective functions, but arises naturally from the agent’s exposure to temporal trajectories. Our results contribute to the growing body of research on reinforcement learning methods for non-ergodic systems.
[LINK]http://arxiv.org/abs/2601.08726v1
[DATE]2026-01-14 00:53:40+08:00
[CATEGORIES]cs.LG
Kernel Learning for Regression via Quantum Annealing Based Spectral Sampling
[AUTHORS]Yasushi Hasegawa, Masayuki Ohzeki
[ABSTRACT]While quantum annealing (QA) has been developed for combinatorial optimization, practical QA devices operate at finite temperature and under noise, and their outputs can be regarded as stochastic samples close to a Gibbs–Boltzmann distribution. In this study, we propose a QA-in-the-loop kernel learning framework that integrates QA not merely as a substitute for Markov-chain Monte Carlo sampling but as a component that directly determines the learned kernel for regression. Based on Bochner’s theorem, a shift-invariant kernel is represented as an expectation over a spectral distribution, and random Fourier features (RFF) approximate the kernel by sampling frequencies. We model the spectral distribution with a (multi-layer) restricted Boltzmann machine (RBM), generate discrete RBM samples using QA, and map them to continuous frequencies via a Gaussian–Bernoulli transformation. Using the resulting RFF, we construct a data-adaptive kernel and perform Nadaraya–Watson (NW) regression. Because the RFF approximation based on $\cos(\bmω^{\top}Δ\bm{x})$ can yield small negative values and cancellation across neighbors, the Nadaraya–Watson denominator $\sum_j k_{ij}$ may become close to zero. We therefore employ nonnegative squared-kernel weights $w_{ij}=k(\bm{x}_i,\bm{x}_j)^2$, which also enhances the contrast of kernel weights. The kernel parameters are trained by minimizing the leave-one-out NW mean squared error, and we additionally evaluate local linear regression with the same squared-kernel weights at inference. Experiments on multiple benchmark regression datasets demonstrate a decrease in training loss, accompanied by structural changes in the kernel matrix, and show that the learned kernel tends to improve $R^2$ and RMSE over the baseline Gaussian-kernel NW. Increasing the number of random features at inference further enhances accuracy.
[COMMENTS]15pages, 5 figures, 4 tables
[LINK]http://arxiv.org/abs/2601.08724v1
[DATE]2026-01-14 00:50:07+08:00
[CATEGORIES]cs.LG
Soft Partition-based KAPI-ELM for Multi-Scale PDEs
[AUTHORS]Vikas Dwivedi, Monica Sigovan, Bruno Sixou
[ABSTRACT]Physics-informed machine learning holds great promise for solving differential equations, yet existing methods struggle with highly oscillatory, multiscale, or singularly perturbed PDEs due to spectral bias, costly backpropagation, and manually tuned kernel or Fourier frequencies. This work introduces a soft partition–based Kernel-Adaptive Physics-Informed Extreme Learning Machine (KAPI-ELM), a deterministic low-dimensional parameterization in which smooth partition lengths jointly control collocation centers and Gaussian kernel widths, enabling continuous coarse-to-fine resolution without Fourier features, random sampling, or hard domain interfaces. A signed-distance-based weighting further stabilizes least-squares learning on irregular geometries. Across eight benchmarks–including oscillatory ODEs, high-frequency Poisson equations, irregular-shaped domains, and stiff singularly perturbed convection-diffusion problems-the proposed method matches or exceeds the accuracy of state-of-the-art Physics-Informed Neural Network (PINN) and Theory of Functional Connections (TFC) variants while using only a single linear solve. Although demonstrated on steady linear PDEs, the results show that soft-partition kernel adaptation provides a fast, architecture-free approach for multiscale PDEs with broad potential for future physics-informed modeling. For reproducibility, the reference codes are available at https://github.com/vikas-dwivedi-2022/soft_kapi
[LINK]http://arxiv.org/abs/2601.08719v1
[DATE]2026-01-14 00:43:38+08:00
[CATEGORIES]cs.LG
Precision over Diversity: High-Precision Reward Generalizes to Robust Instruction Following
[AUTHORS]Yirong Zeng, Yufei Liu, Xiao Ding, Yutai Hou, Yuxian Wang, Haonan Song, Wu Ning, Dandan Tu, Qixun Zhang, Bibo Cai, Yuxiang He, Ting Liu
[ABSTRACT]A central belief in scaling reinforcement learning with verifiable rewards for instruction following (IF) tasks is that, a diverse mixture of verifiable hard and unverifiable soft constraints is essential for generalizing to unseen instructions. In this work, we challenge this prevailing consensus through a systematic empirical investigation. Counter-intuitively, we find that models trained on hard-only constraints consistently outperform those trained on mixed datasets. Extensive experiments reveal that reward precision, rather than constraint diversity, is the primary driver of effective alignment. The LLM judge suffers from a low recall rate in detecting false response, which leads to severe reward hacking, thereby undermining the benefits of diversity. Furthermore, analysis of the attention mechanism reveals that high-precision rewards develop a transferable meta-skill for IF. Motivated by these insights, we propose a simple yet effective data-centric refinement strategy that prioritizes reward precision. Evaluated on five benchmarks, our approach outperforms competitive baselines by 13.4\% in performance while achieving a 58\% reduction in training time, maintaining strong generalization beyond instruction following. Our findings advocate for a paradigm shift: moving away from the indiscriminate pursuit of data diversity toward high-precision rewards.
[COMMENTS]Under review, 13 pages, 8 figures
[LINK]http://arxiv.org/abs/2601.04954v2
[DATE]2026-01-14 00:42:42+08:00
[CATEGORIES]cs.LG
Evaluating the Ability of Explanations to Disambiguate Models in a Rashomon Set
[AUTHORS]Kaivalya Rawal, Eoin Delaney, Zihao Fu, Sandra Wachter, Chris Russell
[ABSTRACT]Explainable artificial intelligence (XAI) is concerned with producing explanations indicating the inner workings of models. For a Rashomon set of similarly performing models, explanations provide a way of disambiguating the behavior of individual models, helping select models for deployment. However explanations themselves can vary depending on the explainer used, and need to be evaluated. In the paper “Evaluating Model Explanations without Ground Truth”, we proposed three principles of explanation evaluation and a new method “AXE” to evaluate the quality of feature-importance explanations. We go on to illustrate how evaluation metrics that rely on comparing model explanations against ideal ground truth explanations obscure behavioral differences within a Rashomon set. Explanation evaluation aligned with our proposed principles would highlight these differences instead, helping select models from the Rashomon set. The selection of alternate models from the Rashomon set can maintain identical predictions but mislead explainers into generating false explanations, and mislead evaluation methods into considering the false explanations to be of high quality. AXE, our proposed explanation evaluation method, can detect this adversarial fairwashing of explanations with a 100% success rate. Unlike prior explanation evaluation strategies such as those based on model sensitivity or ground truth comparison, AXE can determine when protected attributes are used to make predictions.
[COMMENTS]This is a preprint of the paper published at the MURE workshop, AAAI 2026, which builds on a preprint of separate work published at FAccT 2025 (arXiv:2505.10399)
[LINK]http://arxiv.org/abs/2601.08703v1
[DATE]2026-01-14 00:31:11+08:00
[CATEGORIES]cs.LG
The Kernel Manifold: A Geometric Approach to Gaussian Process Model Selection
[AUTHORS]Md Shafiqul Islam, Shakti Prasad Padhy, Douglas Allaire, Raymundo Arróyave
[ABSTRACT]Gaussian Process (GP) regression is a powerful nonparametric Bayesian framework, but its performance depends critically on the choice of covariance kernel. Selecting an appropriate kernel is therefore central to model quality, yet remains one of the most challenging and computationally expensive steps in probabilistic modeling. We present a Bayesian optimization framework built on kernel-of-kernels geometry, using expected divergence-based distances between GP priors to explore kernel space efficiently. A multidimensional scaling (MDS) embedding of this distance matrix maps a discrete kernel library into a continuous Euclidean manifold, enabling smooth BO. In this formulation, the input space comprises kernel compositions, the objective is the log marginal likelihood, and featurization is given by the MDS coordinates. When the divergence yields a valid metric, the embedding preserves geometry and produces a stable BO landscape. We demonstrate the approach on synthetic benchmarks, real-world time-series datasets, and an additive manufacturing case study predicting melt-pool geometry, achieving superior predictive accuracy and uncertainty calibration relative to baselines including Large Language Model (LLM)-guided search. This framework establishes a reusable probabilistic geometry for kernel search, with direct relevance to GP modeling and deep kernel learning.
[COMMENTS]Included a subsection named “Budgetary impact of inline kernel optimization during BO”, and corrected label of a figure
[LINK]http://arxiv.org/abs/2601.05371v2
[DATE]2026-01-14 00:22:49+08:00
[CATEGORIES]cs.LG
Enabling Population-Based Architectures for Neural Combinatorial Optimization
[AUTHORS]Andoni Irazusta Garmendia, Josu Ceberio, Alexander Mendiburu
[ABSTRACT]Neural Combinatorial Optimization (NCO) has mostly focused on learning policies, typically neural networks, that operate on a single candidate solution at a time, either by constructing one from scratch or iteratively improving it. In contrast, decades of work in metaheuristics have shown that maintaining and evolving populations of solutions improves robustness and exploration, and often leads to stronger performance. To close this gap, we study how to make NCO explicitly population-based by learning policies that act on sets of candidate solutions. We first propose a simple taxonomy of population awareness levels and use it to highlight two key design challenges: (i) how to represent a whole population inside a neural network, and (ii) how to learn population dynamics that balance intensification (generating good solutions) and diversification (maintaining variety). We make these ideas concrete with two complementary tools: one that improves existing solutions using information shared across the whole population, and the other generates new candidate solutions that explicitly balance being high-quality with diversity. Experimental results on Maximum Cut and Maximum Independent Set indicate that incorporating population structure is advantageous for learned optimization methods and opens new connections between NCO and classical population-based search.
[LINK]http://arxiv.org/abs/2601.08696v1
[DATE]2026-01-14 00:19:34+08:00
[CATEGORIES]cs.LG
Interactive and Hybrid Imitation Learning: Provably Beating Behavior Cloning
[AUTHORS]Yichen Li, Chicheng Zhang
[ABSTRACT]Imitation learning (IL) is a paradigm for learning sequential decision making policies from experts, leveraging offline demonstrations, interactive annotations, or both. Recent advances show that when annotation cost is tallied per trajectory, Behavior Cloning (BC) which relies solely on offline demonstrations cannot be improved in general, leaving limited conditions for interactive methods such as DAgger to help. We revisit this conclusion and prove that when the annotation cost is measured per state, algorithms using interactive annotations can provably outperform BC. Specifically: (1) we show that Stagger, a one sample per round variant of DAgger, provably beats BC under low recovery cost settings; (2) we initiate the study of hybrid IL where the agent learns from offline demonstrations and interactive annotations. We propose Warm Stagger whose learning guarantee is not much worse than using either data source alone. Furthermore, motivated by compounding error and cold start problem in imitation learning practice, we give an MDP example in which Warm Stagger has significant better annotation cost; (3) experiments on MuJoCo continuous control tasks confirm that, with modest cost ratio between interactive and offline annotations, interactive and hybrid approaches consistently outperform BC. To the best of our knowledge, our work is the first to highlight the benefit of state wise interactive annotation and hybrid feedback in imitation learning.
[COMMENTS]42 pages, Accepted by NeurIPS 2025
[LINK]http://arxiv.org/abs/2412.07057v3
[DATE]2026-01-14 00:15:49+08:00
[CATEGORIES]cs.LG
Parallel Context-of-Experts Decoding for Retrieval Augmented Generation
[AUTHORS]Giulio Corallo, Paolo Papotti
[ABSTRACT]Retrieval Augmented Generation faces a trade-off: concatenating documents in a long prompt enables multi-document reasoning but creates prefill bottlenecks, while encoding document KV caches separately offers speed but breaks cross-document interaction. We propose Parallel Context-of-Experts Decoding (Pced), a training-free framework that shifts evidence aggregation from the attention mechanism to the decoding. Pced treats retrieved documents as isolated “experts”, synchronizing their predictions via a novel retrieval-aware contrastive decoding rule that weighs expert logits against the model prior. This approach recovers cross-document reasoning capabilities without constructing a shared attention across documents.
[LINK]http://arxiv.org/abs/2601.08670v1
[DATE]2026-01-13 23:46:59+08:00
[CATEGORIES]cs.CL
Analyzing Bias in False Refusal Behavior of Large Language Models for Hate Speech Detoxification
[AUTHORS]Kyuri Im, Shuzhou Yuan, Michael Färber
[ABSTRACT]While large language models (LLMs) have increasingly been applied to hate speech detoxification, the prompts often trigger safety alerts, causing LLMs to refuse the task. In this study, we systematically investigate false refusal behavior in hate speech detoxification and analyze the contextual and linguistic biases that trigger such refusals. We evaluate nine LLMs on both English and multilingual datasets, our results show that LLMs disproportionately refuse inputs with higher semantic toxicity and those targeting specific groups, particularly nationality, religion, and political ideology. Although multilingual datasets exhibit lower overall false refusal rates than English datasets, models still display systematic, language-dependent biases toward certain targets. Based on these findings, we propose a simple cross-translation strategy, translating English hate speech into Chinese for detoxification and back, which substantially reduces false refusals while preserving the original content, providing an effective and lightweight mitigation approach.
[LINK]http://arxiv.org/abs/2601.08668v1
[DATE]2026-01-13 23:45:31+08:00
[CATEGORIES]cs.CL
Alleviating Attention Hacking in Discriminative Reward Modeling through Interaction Distillation
[AUTHORS]Jianxiang Zang
[ABSTRACT]The reward model (RM), as the core component of reinforcement learning from human feedback (RLHF) for large language models (LLMs), responsible for providing reward signals to generated responses. However, the mainstream discriminative reward modeling is inadequate in terms of token-level interaction, making its judgment signals vulnerable to being hacked by misallocated attention to context. This stems from two fundamental limitations: (1) Current preference modeling employs decoder-only architectures, where the unidirectional causal attention mechanism leads to forward-decaying intra-sequence attention within the prompt-response sequence. (2) The independent Siamese-encoding paradigm induces the absence of token-level inter-sequence attention between chosen and rejected sequences. To address this “attention hacking”, we propose “Interaction Distillation”, a novel training framework for more adequate discriminative reward modeling via attention-level optimization. The method introduces an interaction-based natural language understanding model as the teacher to provide sophisticated token interaction patterns via comprehensive attention, and guides the reward modeling to simulate teacher model’s interaction pattern through an attentional alignment objective. Through extensive experiments, interaction distillation has demonstrated its ability to provide more stable and generalizable reward signals compared to state-of-the-art RM optimization methods that target data noise, highlighting the attention hacking constitute a more fundamental limitation in discriminative RM.
[LINK]http://arxiv.org/abs/2508.02618v3
[DATE]2026-01-13 23:31:54+08:00
[CATEGORIES]cs.CL
RULERS: Locked Rubrics and Evidence-Anchored Scoring for Robust LLM Evaluation
[AUTHORS]Yihan Hong, Huaiyuan Yao, Bolin Shen, Wanpeng Xu, Hua Wei, Yushun Dong
[ABSTRACT]The LLM-as-a-Judge paradigm promises scalable rubric-based evaluation, yet aligning frozen black-box models with human standards remains a challenge due to inherent generation stochasticity. We reframe judge alignment as a criteria transfer problem and isolate three recurrent failure modes: rubric instability caused by prompt sensitivity, unverifiable reasoning that lacks auditable evidence, and scale misalignment with human grading boundaries. To address these issues, we introduce RULERS (Rubric Unification, Locking, and Evidence-anchored Robust Scoring), a compiler-executor framework that transforms natural language rubrics into executable specifications. RULERS operates by compiling criteria into versioned immutable bundles, enforcing structured decoding with deterministic evidence verification, and applying lightweight Wasserstein-based post-hoc calibration, all without updating model parameters. Extensive experiments on essay and summarization benchmarks demonstrate that RULERS significantly outperforms representative baselines in human agreement, maintains strong stability against adversarial rubric perturbations, and enables smaller models to rival larger proprietary judges. Overall, our results suggest that reliable LLM judging requires executable rubrics, verifiable evidence, and calibrated scales rather than prompt phrasing alone. Code is available at https://github.com/LabRAI/Rulers.git.
[LINK]http://arxiv.org/abs/2601.08654v1
[DATE]2026-01-13 23:31:42+08:00
[CATEGORIES]cs.CL cs.LG
Safe Language Generation in the Limit
[AUTHORS]Antonios Anastasopoulos, Giuseppe Ateniese, Evgenios M. Kornaropoulos
[ABSTRACT]Recent results in learning a language in the limit have shown that, although language identification is impossible, language generation is tractable. As this foundational area expands, we need to consider the implications of language generation in real-world settings.
This work offers the first theoretical treatment of safe language generation. Building on the computational paradigm of learning in the limit, we formalize the tasks of safe language identification and generation. We prove that under this model, safe language identification is impossible, and that safe language generation is at least as hard as (vanilla) language identification, which is also impossible. Last, we discuss several intractable and tractable cases.
[LINK]http://arxiv.org/abs/2601.08648v1
[DATE]2026-01-13 23:25:44+08:00
[CATEGORIES]cs.CL cs.LG
Get away with less: Need of source side data curation to build parallel corpus for low resource Machine Translation
[AUTHORS]Saumitra Yadav, Manish Shrivastava
[ABSTRACT]Data curation is a critical yet under-researched step in the machine translation training paradigm. To train translation systems, data acquisition relies primarily on human translations and digital parallel sources or, to a limited degree, synthetic generation. But, for low-resource languages, human translation to generate sufficient data is prohibitively expensive. Therefore, it is crucial to develop a framework that screens source sentences to form efficient parallel text, ensuring optimal MT system performance in low-resource environments. We approach this by evaluating English-Hindi bi-text to determine effective sentence selection strategies for optimal MT system training. Our extensively tested framework, (Lexical And Linguistically Informed Text Analysis) LALITA, targets source sentence selection using lexical and linguistic features to curate parallel corpora. We find that by training mostly on complex sentences from both existing and synthetic datasets, our method significantly improves translation quality. We test this by simulating low-resource data availabilty with curated datasets of 50K to 800K English sentences and report improved performances on all data sizes. LALITA demonstrates remarkable efficiency, reducing data needs by more than half across multiple languages (Hindi, Odia, Nepali, Norwegian Nynorsk, and German). This approach not only reduces MT systems training cost by reducing training data requirement, but also showcases LALITA’s utility in data augmentation.
[COMMENTS]Under Review
[LINK]http://arxiv.org/abs/2601.08629v1
[DATE]2026-01-13 23:05:19+08:00
[CATEGORIES]cs.CL
ES-Mem: Event Segmentation-Based Memory for Long-Term Dialogue Agents
[AUTHORS]Huhai Zou, Tianhao Sun, Chuanjiang He, Yu Tian, Zhenyang Li, Li Jin, Nayu Liu, Jiang Zhong, Kaiwen Wei
[ABSTRACT]Memory is critical for dialogue agents to maintain coherence and enable continuous adaptation in long-term interactions. While existing memory mechanisms offer basic storage and retrieval capabilities, they are hindered by two primary limitations: (1) rigid memory granularity often disrupts semantic integrity, resulting in fragmented and incoherent memory units; (2) prevalent flat retrieval paradigms rely solely on surface-level semantic similarity, neglecting the structural cues of discourse required to navigate and locate specific episodic contexts. To mitigate these limitations, drawing inspiration from Event Segmentation Theory, we propose ES-Mem, a framework incorporating two core components: (1) a dynamic event segmentation module that partitions long-term interactions into semantically coherent events with distinct boundaries; (2) a hierarchical memory architecture that constructs multi-layered memories and leverages boundary semantics to anchor specific episodic memory for precise context localization. Evaluations on two memory benchmarks demonstrate that ES-Mem yields consistent performance gains over baseline methods. Furthermore, the proposed event segmentation module exhibits robust applicability on dialogue segmentation datasets.
[LINK]http://arxiv.org/abs/2601.07582v2
[DATE]2026-01-13 23:04:26+08:00
[CATEGORIES]cs.CL
How Order-Sensitive Are LLMs? OrderProbe for Deterministic Structural Reconstruction
[AUTHORS]Yingjie He, Zhaolu Kang, Kehan Jiang, Qianyuan Zhang, Jiachen Qian, Chunlei Meng, Yujie Feng, Yuan Wang, Jiabao Dou, Aming Wu, Leqi Zheng, Pengxiang Zhao, Jiaxin Liu, Zeyu Zhang, Lei Wang, Guansu Wang, Qishi Zhan, Xiaomin He, Meisheng Zhang, Jianyuan Ni
[ABSTRACT]Large language models (LLMs) excel at semantic understanding, yet their ability to reconstruct internal structure from scrambled inputs remains underexplored. Sentence-level restoration is ill-posed for automated evaluation because multiple valid word orders often exist. We introduce OrderProbe, a deterministic benchmark for structural reconstruction using fixed four-character expressions in Chinese, Japanese, and Korean, which have a unique canonical order and thus support exact-match scoring. We further propose a diagnostic framework that evaluates models beyond recovery accuracy, including semantic fidelity, logical validity, consistency, robustness sensitivity, and information density. Experiments on twelve widely used LLMs show that structural reconstruction remains difficult even for frontier systems: zero-shot recovery frequently falls below 35%. We also observe a consistent dissociation between semantic recall and structural planning, suggesting that structural robustness is not an automatic byproduct of semantic competence.
[LINK]http://arxiv.org/abs/2601.08626v1
[DATE]2026-01-13 23:03:38+08:00
[CATEGORIES]cs.CL
GraphSearch: Agentic Search-Augmented Reasoning for Zero-Shot Graph Learning
[AUTHORS]Jiajin Liu, Yuanfu Sun, Dongzhe Fan, Qiaoyu Tan
[ABSTRACT]Recent advances in search-augmented large reasoning models (LRMs) enable the retrieval of external knowledge to reduce hallucinations in multistep reasoning. However, their ability to operate on graph-structured data, prevalent in domains such as e-commerce, social networks, and scientific citations, remains underexplored. Unlike plain text corpora, graphs encode rich topological signals that connect related entities and can serve as valuable priors for retrieval, enabling more targeted search and improved reasoning efficiency. Yet, effectively leveraging such structure poses unique challenges, including the difficulty of generating graph-expressive queries and ensuring reliable retrieval that balances structural and semantic relevance. To address this gap, we introduce GraphSearch, the first framework that extends search-augmented reasoning to graph learning, enabling zero-shot graph learning without task-specific fine-tuning. GraphSearch combines a Graph-aware Query Planner, which disentangles search space (e.g., 1-hop, multi-hop, or global neighbors) from semantic queries, with a Graph-aware Retriever, which constructs candidate sets based on topology and ranks them using a hybrid scoring function. We further instantiate two traversal modes: GraphSearch-R, which recursively expands neighborhoods hop by hop, and GraphSearch-F, which flexibly retrieves across local and global neighborhoods without hop constraints. Extensive experiments across diverse benchmarks show that GraphSearch achieves competitive or even superior performance compared to supervised graph learning methods, setting state-of-the-art results in zero-shot node classification and link prediction. These findings position GraphSearch as a flexible and generalizable paradigm for agentic reasoning over graphs.
[COMMENTS]16 pages, 5 pages
[LINK]http://arxiv.org/abs/2601.08621v1
[DATE]2026-01-13 23:00:57+08:00
[CATEGORIES]cs.CL
KaLM: Knowledge-aligned Autoregressive Language Modeling via Dual-view Knowledge Graph Contrastive Learning
[AUTHORS]Peng Yu, Cheng Deng, Beiya Dai, Xinbing Wang, Ying Wen
[ABSTRACT]Autoregressive large language models (LLMs) pre-trained by next token prediction are inherently proficient in generative tasks. However, their performance on knowledge-driven tasks such as factual knowledge querying remains unsatisfactory. Knowledge graphs (KGs), as high-quality structured knowledge bases, can provide reliable knowledge for LLMs, potentially compensating for their knowledge deficiencies. Aligning LLMs with explicit, structured knowledge from KGs has been a challenge; previous attempts either failed to effectively align knowledge representations or compromised the generative capabilities of LLMs, leading to less-than-optimal outcomes. This paper proposes \textbf{KaLM}, a \textit{Knowledge-aligned Language Modeling} approach, which fine-tunes autoregressive LLMs to align with KG knowledge via the joint objective of explicit knowledge alignment and implicit knowledge alignment. The explicit knowledge alignment objective aims to directly optimize the knowledge representation of LLMs through dual-view knowledge graph contrastive learning. The implicit knowledge alignment objective focuses on incorporating textual patterns of knowledge into LLMs through triple completion language modeling. Notably, our method achieves a significant performance boost in evaluations of knowledge-driven tasks, specifically embedding-based knowledge graph completion and generation-based knowledge graph question answering.
[COMMENTS]The article has been accepted by Frontiers of Computer Science (FCS), with the DOI: {10.1007/s11704-026-50906-6}
[LINK]http://arxiv.org/abs/2412.04948v2
[DATE]2026-01-13 22:38:40+08:00
[CATEGORIES]cs.CL
Hardwired-Neurons Language Processing Units as General-Purpose Cognitive Substrates
[AUTHORS]Yang Liu, Yi Chen, Yongwei Zhao, Yifan Hao, Zifu Zheng, Weihao Kong, Zhangmai Li, Dongchen Jiang, Ruiyang Xia, Zhihong Ma, Zisheng Liu, Zhaoyong Wan, Yunqi Lu, Ximing Liu, Hongrui Guo, Zhihao Yang, Zhe Wang, Tianrui Ma, Mo Zou, Rui Zhang, Ling Li, Xing Hu, Zidong Du, Zhiwei Xu, Qi Guo, Tianshi Chen, Yunji Chen
[ABSTRACT]The rapid advancement of Large Language Models (LLMs) has established language as a core general-purpose cognitive substrate, driving the demand for specialized Language Processing Units (LPUs) tailored for LLM inference. To overcome the growing energy consumption of LLM inference systems, this paper proposes a Hardwired-Neurons Language Processing Unit (HNLPU), which physically hardwires LLM weight parameters into the computational fabric, achieving several orders of magnitude computational efficiency improvement by extreme specialization. However, a significant challenge still lies in the scale of modern LLMs. A straightforward hardwiring of gpt-oss 120 B would require fabricating photomask sets valued at over 6 billion dollars, rendering this straightforward solution economically impractical. Addressing this challenge, we propose the novel Metal-Embedding methodology. Instead of embedding weights in a 2D grid of silicon device cells, Metal-Embedding embeds weight parameters into the 3D topology of metal wires. This brings two benefits: (1) a 15x increase in density, and (2) 60 out of 70 photomask layers are homogeneous across chips, including all EUV photomasks. In total, Metal-Embedding reduced the photomask cost by 112x, bringing the Non-Recurring Engineering (NRE) cost of HNLPU into an economically viable range. Experimental results show that HNLPU achieved 249,960 tokens/s (5,555x/85x that of GPU/WSE), 36 tokens/J (1,047x/283x that of GPU/WSE), 13,232 mm2 total die area, $59.46 M-123.5 M estimated NRE at 5 nm technology. Analysis shows that HNLPU achieved 41.7-80.4x improvement in cost-effectiveness and 357x reduction in carbon footprint compared to OpenAI-scale H100 clusters, under an annual weight updating assumption.
[LINK]http://arxiv.org/abs/2508.16151v2
[DATE]2026-01-13 22:31:14+08:00
[CATEGORIES]cs.CL
Learner-Tailored Program Repair: A Solution Generator with Iterative Edit-Driven Retrieval Enhancement
[AUTHORS]Zhenlong Dai, Zhuoluo Zhao, Hengning Wang, Xiu Tang, Sai Wu, Chang Yao, Zhipeng Gao, Jingyuan Chen
[ABSTRACT]With the development of large language models (LLMs) in the field of programming, intelligent programming coaching systems have gained widespread attention. However, most research focuses on repairing the buggy code of programming learners without providing the underlying causes of the bugs. To address this gap, we introduce a novel task, namely \textbf{LPR} (\textbf{L}earner-Tailored \textbf{P}rogram \textbf{R}epair). We then propose a novel and effective framework, \textbf{\textsc{\MethodName{}}} (\textbf{L}earner-Tailored \textbf{S}olution \textbf{G}enerator), to enhance program repair while offering the bug descriptions for the buggy code. In the first stage, we utilize a repair solution retrieval framework to construct a solution retrieval database and then employ an edit-driven code retrieval approach to retrieve valuable solutions, guiding LLMs in identifying and fixing the bugs in buggy code. In the second stage, we propose a solution-guided program repair method, which fixes the code and provides explanations under the guidance of retrieval solutions. Moreover, we propose an Iterative Retrieval Enhancement method that utilizes evaluation results of the generated code to iteratively optimize the retrieval direction and explore more suitable repair strategies, improving performance in practical programming coaching scenarios. The experimental results show that our approach outperforms a set of baselines by a large margin, validating the effectiveness of our framework for the newly proposed LPR task.
[LINK]http://arxiv.org/abs/2601.08545v1
[DATE]2026-01-13 21:31:11+08:00
[CATEGORIES]cs.CL
ToolRM: Towards Agentic Tool-Use Reward Modeling
[AUTHORS]Renhao Li, Jianhong Tu, Yang Su, Yantao Liu, Fei Huang, Hamid Alinejad-Rokny, Derek F. Wong, Junyang Lin, Min Yang
[ABSTRACT]Reward models (RMs) play a critical role in aligning large language models (LLMs) with human preferences. Yet in the domain of tool learning, the lack of RMs specifically designed for function-calling tasks has limited progress toward more capable agentic AI. We introduce ToolRM, a family of lightweight reward models tailored for general tool-use scenarios. To build these models, we propose a novel pipeline that constructs high-quality pairwise preference data using rule-based scoring and multidimensional sampling. This yields ToolPref-Pairwise-30K, a diverse, balanced, and challenging preference dataset that supports both generative and discriminative reward modeling. We also introduce TRBench$_{BFCL}$, a benchmark built on the agent evaluation suite BFCL to evaluate RMs on tool calling tasks. Trained on our constructed data, models from the Qwen3-4B/8B series achieve up to 17.94% higher accuracy, substantially outperforming frontier LLMs and RMs in pairwise reward judgments. Beyond training objectives, generative ToolRM generalizes to broader critique tasks, including Best-of-N sampling and self-correction. Experiments on ACEBench highlight its effectiveness and efficiency, enabling inference-time scaling while reducing output token usage by over 66%. Its support for downstream RL training further validates its practical utility. We release data to facilitate future research.
[LINK]http://arxiv.org/abs/2510.26167v2
[DATE]2026-01-13 21:28:26+08:00
[CATEGORIES]cs.CL
MRAG-Suite: A Diagnostic Evaluation Platform for Visual Retrieval-Augmented Generation
[AUTHORS]Yuelyu Ji, Wuwei Lan, Patrick NG
[ABSTRACT]Multimodal Retrieval-Augmented Generation (Visual RAG) significantly advances question answering by integrating visual and textual evidence. Yet, current evaluations fail to systematically account for query difficulty and ambiguity. We propose MRAG-Suite, a diagnostic evaluation platform integrating diverse multimodal benchmarks (WebQA, Chart-RAG, Visual-RAG, MRAG-Bench). We introduce difficulty-based and ambiguity-aware filtering strategies, alongside MM-RAGChecker, a claim-level diagnostic tool. Our results demonstrate substantial accuracy reductions under difficult and ambiguous queries, highlighting prevalent hallucinations. MM-RAGChecker effectively diagnoses these issues, guiding future improvements in Visual RAG systems.
[LINK]http://arxiv.org/abs/2509.24253v3
[DATE]2026-01-13 21:26:27+08:00
[CATEGORIES]cs.CL
Algorithmic Stability in Infinite Dimensions: Characterizing Unconditional Convergence in Banach Spaces
[AUTHORS]Przemysław Spyra
[ABSTRACT]The distinction between conditional, unconditional, and absolute convergence in infinite-dimensional spaces has fundamental implications for computational algorithms. While these concepts coincide in finite dimensions, the Dvoretzky-Rogers theorem establishes their strict separation in general Banach spaces. We present a comprehensive characterization theorem unifying seven equivalent conditions for unconditional convergence: permutation invariance, net convergence, subseries tests, sign stability, bounded multiplier properties, and weak uniform convergence. These theoretical results directly inform algorithmic stability analysis, governing permutation invariance in gradient accumulation for Stochastic Gradient Descent and justifying coefficient thresholding in frame-based signal processing. Our work bridges classical functional analysis with contemporary computational practice, providing rigorous foundations for order-independent and numerically robust summation processes.
[LINK]http://arxiv.org/abs/2601.08512v1
[DATE]2026-01-13 20:51:58+08:00
[CATEGORIES]cs.CL
STAR: Detecting Inference-time Backdoors in LLM Reasoning via State-Transition Amplification Ratio
[AUTHORS]Seong-Gyu Park, Sohee Park, Jisu Lee, Hyunsik Na, Daeseon Choi
[ABSTRACT]Recent LLMs increasingly integrate reasoning mechanisms like Chain-of-Thought (CoT). However, this explicit reasoning exposes a new attack surface for inference-time backdoors, which inject malicious reasoning paths without altering model parameters. Because these attacks generate linguistically coherent paths, they effectively evade conventional detection. To address this, we propose STAR (State-Transition Amplification Ratio), a framework that detects backdoors by analyzing output probability shifts. STAR exploits the statistical discrepancy where a malicious input-induced path exhibits high posterior probability despite a low prior probability in the model’s general knowledge. We quantify this state-transition amplification and employ the CUSUM algorithm to detect persistent anomalies. Experiments across diverse models (8B-70B) and five benchmark datasets demonstrate that STAR exhibits robust generalization capabilities, consistently achieving near-perfect performance (AUROC $\approx$ 1.0) with approximately $42\times$ greater efficiency than existing baselines. Furthermore, the framework proves robust against adaptive attacks attempting to bypass detection.
[COMMENTS]16 pages, 5 figures
[LINK]http://arxiv.org/abs/2601.08511v1
[DATE]2026-01-13 20:51:13+08:00
[CATEGORIES]cs.CL cs.LG
STAGE: A Benchmark for Knowledge Graph Construction, Question Answering, and In-Script Role-Playing over Movie Screenplays
[AUTHORS]Qiuyu Tian, Yiding Li, Fengyi Chen, Zequn Liu, Youyong Kong, Fan Guo, Yuyao Li, Jinjing Shen, Zhijing Xie, Yiyun Luo, Xin Zhang
[ABSTRACT]Movie screenplays are rich long-form narratives that interleave complex character relationships, temporally ordered events, and dialogue-driven interactions. While prior benchmarks target individual subtasks such as question answering or dialogue generation, they rarely evaluate whether models can construct a coherent story world and use it consistently across multiple forms of reasoning and generation. We introduce STAGE (Screenplay Text, Agents, Graphs and Evaluation), a unified benchmark for narrative understanding over full-length movie screenplays. STAGE defines four tasks: knowledge graph construction, scene-level event summarization, long-context screenplay question answering, and in-script character role-playing, all grounded in a shared narrative world representation. The benchmark provides cleaned scripts, curated knowledge graphs, and event- and character-centric annotations for 150 films across English and Chinese, enabling holistic evaluation of models’ abilities to build world representations, abstract and verify narrative events, reason over long narratives, and generate character-consistent responses.
[COMMENTS]66 pages, 9 figures
[LINK]http://arxiv.org/abs/2601.08510v1
[DATE]2026-01-13 20:50:58+08:00
[CATEGORIES]cs.CL
Spectral Generative Flow Models: A Physics-Inspired Replacement for Vectorized Large Language Models
[AUTHORS]Andrew Kiruluta
[ABSTRACT]We introduce Spectral Generative Flow Models (SGFMs), a physics-inspired alternative to transformer-based large language models. Instead of representing text or video as sequences of discrete tokens processed by attention, SGFMs treat generation as the evolution of a continuous field governed by constrained stochastic dynamics in a multiscale wavelet basis. This formulation replaces global attention with local operators, spectral projections, and Navier–Stokes-like transport, yielding a generative mechanism grounded in continuity, geometry, and physical structure.
Our framework provides three key innovations: (i) a field-theoretic ontology in which text and video are unified as trajectories of a stochastic partial differential equation; (ii) a wavelet-domain representation that induces sparsity, scale separation, and computational efficiency; and (iii) a constrained stochastic flow that enforces stability, coherence, and uncertainty propagation. Together, these components define a generative architecture that departs fundamentally from autoregressive modeling and diffusion-based approaches. SGFMs offer a principled path toward long-range coherence, multimodal generality, and physically structured inductive bias in next-generation generative models.
[LINK]http://arxiv.org/abs/2601.08893v1
[DATE]2026-01-13 20:50:24+08:00
[CATEGORIES]cs.LG cs.CL
Evaluating Role-Consistency in LLMs for Counselor Training
[AUTHORS]Eric Rudolph, Natalie Engert, Jens Albrecht
[ABSTRACT]The rise of online counseling services has highlighted the need for effective training methods for future counselors. This paper extends research on VirCo, a Virtual Client for Online Counseling, designed to complement traditional role-playing methods in academic training by simulating realistic client interactions. Building on previous work, we introduce a new dataset incorporating adversarial attacks to test the ability of large language models (LLMs) to maintain their assigned roles (role-consistency). The study focuses on evaluating the role consistency and coherence of the Vicuna model’s responses, comparing these findings with earlier research. Additionally, we assess and compare various open-source LLMs for their performance in sustaining role consistency during virtual client interactions. Our contributions include creating an adversarial dataset, evaluating conversation coherence and persona consistency, and providing a comparative analysis of different LLMs.
[LINK]http://arxiv.org/abs/2601.08892v1
[DATE]2026-01-13 20:26:15+08:00
[CATEGORIES]cs.CL
BenchOverflow: Measuring Overflow in Large Language Models via Plain-Text Prompts
[AUTHORS]Erin Feiglin, Nir Hutnik, Raz Lapid
[ABSTRACT]We investigate a failure mode of large language models (LLMs) in which plain-text prompts elicit excessive outputs, a phenomenon we term Overflow. Unlike jailbreaks or prompt injection, Overflow arises under ordinary interaction settings and can lead to elevated serving cost, latency, and cross-user performance degradation, particularly when scaled across many requests. Beyond usability, the stakes are economic and environmental: unnecessary tokens increase per-request cost and energy consumption, compounding into substantial operational spend and carbon footprint at scale. Moreover, Overflow represents a practical vector for compute amplification and service degradation in shared environments. We introduce BenchOverflow, a model-agnostic benchmark of nine plain-text prompting strategies that amplify output volume without adversarial suffixes or policy circumvention. Using a standardized protocol with a fixed budget of 5000 new tokens, we evaluate nine open- and closed-source models and observe pronounced rightward shifts and heavy tails in length distributions. Cap-saturation rates (CSR@1k/3k/5k) and empirical cumulative distribution functions (ECDFs) quantify tail risk; within-prompt variance and cross-model correlations show that Overflow is broadly reproducible yet heterogeneous across families and attack vectors. A lightweight mitigation-a fixed conciseness reminder-attenuates right tails and lowers CSR for all strategies across the majority of models. Our findings position length control as a measurable reliability, cost, and sustainability concern rather than a stylistic quirk. By enabling standardized comparison of length-control robustness across models, BenchOverflow provides a practical basis for selecting deployments that minimize resource waste and operating expense, and for evaluating defenses that curb compute amplification without eroding task performance.
[COMMENTS]Accepted at TMLR 2026
[LINK]http://arxiv.org/abs/2601.08490v1
[DATE]2026-01-13 20:22:29+08:00
[CATEGORIES]cs.CL
Surgical Refusal Ablation: Disentangling Safety from Intelligence via Concept-Guided Spectral Cleaning
[AUTHORS]Tony Cristofano
[ABSTRACT]Safety-aligned language models systematically refuse harmful requests. While activation steering can modulate refusal, ablating the raw “refusal vector” calculated from contrastive harmful and harmless prompts often causes collateral damage and distribution drift. We argue this degradation occurs because the raw vector is polysemantic, entangling the refusal signal with core capability circuits and linguistic style.
We introduce Surgical Refusal Ablation (SRA) to distill these steering directions. SRA constructs a registry of independent Concept Atoms representing protected capabilities and stylistic confounds, then uses ridge-regularized spectral residualization to orthogonalize the refusal vector against these directions. This yields a clean refusal direction that targets refusal-relevant structure while minimizing disruption to the model’s semantic geometry.
Across five models (Qwen3-VL and Ministral series), SRA achieves deep refusal reduction (0-2%) with negligible perplexity impact on Wikitext-2 (mean delta PPL approx. 0.02) and minimal distribution drift. Notably, standard ablation on Qwen3-VL-4B induces severe drift (first-token KL = 2.088), whereas SRA maintains the original distribution (KL = 0.044) while achieving the same 0% refusal rate. Using teacher-forced perplexity on GSM8K and MBPP as a high-resolution capability proxy, we show SRA preserves math and code distributions. These results suggest that common “model damage” is often “Ghost Noise,” defined as the spectral bleeding of the dirty refusal direction into capability subspaces.
[LINK]http://arxiv.org/abs/2601.08489v1
[DATE]2026-01-13 20:21:44+08:00
[CATEGORIES]cs.CL
Do You Understand How I Feel?: Towards Verified Empathy in Therapy Chatbots
[AUTHORS]Francesco Dettori, Matteo Forasassi, Lorenzo Veronese, Livia Lestingi, Vincenzo Scotti, Matteo Giovanni Rossi
[ABSTRACT]Conversational agents are increasingly used as support tools along mental therapeutic pathways with significant societal impacts. In particular, empathy is a key non-functional requirement in therapeutic contexts, yet current chatbot development practices provide no systematic means to specify or verify it. This paper envisions a framework integrating natural language processing and formal verification to deliver empathetic therapy chatbots. A Transformer-based model extracts dialogue features, which are then translated into a Stochastic Hybrid Automaton model of dyadic therapy sessions. Empathy-related properties can then be verified through Statistical Model Checking, while strategy synthesis provides guidance for shaping agent behavior. Preliminary results show that the formal model captures therapy dynamics with good fidelity and that ad-hoc strategies improve the probability of satisfying empathy requirements.
[LINK]http://arxiv.org/abs/2601.08477v1
[DATE]2026-01-13 20:08:58+08:00
[CATEGORIES]cs.CL
DeKeyNLU: Enhancing Natural Language to SQL Generation through Task Decomposition and Keyword Extraction
[AUTHORS]Jian Chen, Zhenyan Chen, Xuming Hu, Peilin Zhou, Yining Hua, Han Fang, Cissy Hing Yee Choy, Xinmei Ke, Jingfeng Luo, Zixuan Yuan
[ABSTRACT]Natural Language to SQL (NL2SQL) provides a new model-centric paradigm that simplifies database access for non-technical users by converting natural language queries into SQL commands. Recent advancements, particularly those integrating Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT) reasoning, have made significant strides in enhancing NL2SQL performance. However, challenges such as inaccurate task decomposition and keyword extraction by LLMs remain major bottlenecks, often leading to errors in SQL generation. While existing datasets aim to mitigate these issues by fine-tuning models, they struggle with over-fragmentation of tasks and lack of domain-specific keyword annotations, limiting their effectiveness. To address these limitations, we present DeKeyNLU, a novel dataset which contains 1,500 meticulously annotated QA pairs aimed at refining task decomposition and enhancing keyword extraction precision for the RAG pipeline. Fine-tuned with DeKeyNLU, we propose DeKeySQL, a RAG-based NL2SQL pipeline that employs three distinct modules for user question understanding, entity retrieval, and generation to improve SQL generation accuracy. We benchmarked multiple model configurations within DeKeySQL RAG pipeline. Experimental results demonstrate that fine-tuning with DeKeyNLU significantly improves SQL generation accuracy on both BIRD (62.31% to 69.10%) and Spider (84.2% to 88.7%) dev datasets.
[LINK]http://arxiv.org/abs/2509.14507v2
[DATE]2026-01-13 19:51:42+08:00
[CATEGORIES]cs.CL
Compliance-to-Code: Enhancing Financial Compliance Checking via Code Generation
[AUTHORS]Siyuan Li, Jian Chen, Rui Yao, Xuming Hu, Peilin Zhou, Weihua Qiu, Simin Zhang, Chucheng Dong, Zhiyao Li, Qipeng Xie, Zixuan Yuan
[ABSTRACT]Nowadays, regulatory compliance has become a cornerstone of corporate governance, ensuring adherence to systematic legal frameworks. At its core, financial regulations often comprise highly intricate provisions, layered logical structures, and numerous exceptions, which inevitably result in labor-intensive or comprehension challenges. To mitigate this, recent Regulatory Technology (RegTech) and Large Language Models (LLMs) have gained significant attention in automating the conversion of regulatory text into executable compliance logic. However, their performance remains suboptimal particularly when applied to Chinese-language financial regulations, due to three key limitations: (1) incomplete domain-specific knowledge representation, (2) insufficient hierarchical reasoning capabilities, and (3) failure to maintain temporal and logical coherence. One promising solution is to develop a domain specific and code-oriented datasets for model training. Existing datasets such as LexGLUE, LegalBench, and CODE-ACCORD are often English-focused, domain-mismatched, or lack fine-grained granularity for compliance code generation. To fill these gaps, we present Compliance-to-Code, the first large-scale Chinese dataset dedicated to financial regulatory compliance. Covering 1,159 annotated clauses from 361 regulations across ten categories, each clause is modularly structured with four logical elements-subject, condition, constraint, and contextual information-along with regulation relations. We provide deterministic Python code mappings, detailed code reasoning, and code explanations to facilitate automated auditing. To demonstrate utility, we present FinCheck: a pipeline for regulation structuring, code generation, and report generation.
[LINK]http://arxiv.org/abs/2505.19804v3
[DATE]2026-01-13 19:49:38+08:00
[CATEGORIES]cs.CL
JudgeRLVR: Judge First, Generate Second for Efficient Reasoning
[AUTHORS]Jiangshan Duo, Hanyu Li, Hailin Zhang, Yudong Wang, Sujian Li, Liang Zhao
[ABSTRACT]Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard paradigm for reasoning in Large Language Models. However, optimizing solely for final-answer correctness often drives models into aimless, verbose exploration, where they rely on exhaustive trial-and-error tactics rather than structured planning to reach solutions. While heuristic constraints like length penalties can reduce verbosity, they often truncate essential reasoning steps, creating a difficult trade-off between efficiency and verification. In this paper, we argue that discriminative capability is a prerequisite for efficient generation: by learning to distinguish valid solutions, a model can internalize a guidance signal that prunes the search space. We propose JudgeRLVR, a two-stage judge-then-generate paradigm. In the first stage, we train the model to judge solution responses with verifiable answers. In the second stage, we fine-tune the same model with vanilla generating RLVR initialized from the judge. Compared to Vanilla RLVR using the same math-domain training data, JudgeRLVR achieves a better quality–efficiency trade-off for Qwen3-30B-A3B: on in-domain math, it delivers about +3.7 points average accuracy gain with -42\% average generation length; on out-of-domain benchmarks, it delivers about +4.5 points average accuracy improvement, demonstrating enhanced generalization.
[COMMENTS]16 pages, 5 figures
[LINK]http://arxiv.org/abs/2601.08468v1
[DATE]2026-01-13 19:47:42+08:00
[CATEGORIES]cs.CL cs.LG
An Under-Explored Application for Explainable Multimodal Misogyny Detection in code-mixed Hindi-English
[AUTHORS]Sargam Yadav, Abhishek Kaushik, Kevin Mc Daid
[ABSTRACT]Digital platforms have an ever-expanding user base, and act as a hub for communication, business, and connectivity. However, this has also allowed for the spread of hate speech and misogyny. Artificial intelligence models have emerged as an effective solution for countering online hate speech but are under explored for low resource and code-mixed languages and suffer from a lack of interpretability. Explainable Artificial Intelligence (XAI) can enhance transparency in the decisions of deep learning models, which is crucial for a sensitive domain such as hate speech detection. In this paper, we present a multi-modal and explainable web application for detecting misogyny in text and memes in code-mixed Hindi and English. The system leverages state-of-the-art transformer-based models that support multilingual and multimodal settings. For text-based misogyny identification, the system utilizes XLM-RoBERTa (XLM-R) and multilingual Bidirectional Encoder Representations from Transformers (mBERT) on a dataset of approximately 4,193 comments. For multimodal misogyny identification from memes, the system utilizes mBERT + EfficientNet, and mBERT + ResNET trained on a dataset of approximately 4,218 memes. It also provides feature importance scores using explainability techniques including Shapley Additive Values (SHAP) and Local Interpretable Model Agnostic Explanations (LIME). The application aims to serve as a tool for both researchers and content moderators, to promote further research in the field, combat gender based digital violence, and ensure a safe digital space. The system has been evaluated using human evaluators who provided their responses on Chatbot Usability Questionnaire (CUQ) and User Experience Questionnaire (UEQ) to determine overall usability.
[LINK]http://arxiv.org/abs/2601.08457v1
[DATE]2026-01-13 19:31:55+08:00
[CATEGORIES]cs.CL
Maximizing Local Entropy Where It Matters: Prefix-Aware Localized LLM Unlearning
[AUTHORS]Naixin Zhai, Pengyang Shao, Binbin Zheng, Yonghui Yang, Fei Shen, Long Bai, Xun Yang
[ABSTRACT]Machine unlearning aims to forget sensitive knowledge from Large Language Models (LLMs) while maintaining general utility. However, existing approaches typically treat all tokens in a response indiscriminately and enforce uncertainty over the entire vocabulary. This global treatment results in unnecessary utility degradation and extends optimization to content-agnostic regions. To address these limitations, we propose PALU (Prefix-Aware Localized Unlearning), a framework driven by a local entropy maximization objective across both temporal and vocabulary dimensions. PALU reveals that (i) suppressing the sensitive prefix alone is sufficient to sever the causal generation link, and (ii) flattening only the top-$k$ logits is adequate to maximize uncertainty in the critical subspace. These findings allow PALU to avoid redundant optimization across the full vocabulary and parameter space while minimizing collateral damage to general model performance. Extensive experiments validate that PALU achieves superior forgetting efficacy and utility preservation compared to state-of-the-art baselines.
[LINK]http://arxiv.org/abs/2601.03190v2
[DATE]2026-01-13 19:13:49+08:00
[CATEGORIES]cs.CL
Silence the Judge: Reinforcement Learning with Self-Verifier via Latent Geometric Clustering
[AUTHORS]Nonghai Zhang, Weitao Ma, Zhanyu Ma, Jun Xu, Jiuchong Gao, Jinghua Hao, Renqing He, Jingwen Xu
[ABSTRACT]Group Relative Policy Optimization (GRPO) significantly enhances the reasoning performance of Large Language Models (LLMs). However, this success heavily relies on expensive external verifiers or human rules. Such dependency not only leads to significant computational costs and training latency, but also yields sparse rewards that hinder optimization efficiency. To address these challenges, we propose Latent-GRPO, a framework that derives intrinsic rewards directly from latent space geometry. Crucially, our empirical analysis reveals a compelling geometric property: terminal token representations of correct reasoning trajectories form dense clusters with high intra-class similarity, whereas incorrect trajectories remain scattered as outliers. In light of this discovery, we introduce the Iterative Robust Centroid Estimation (IRCE) algorithm, which generates dense, continuous rewards by mitigating magnitude fluctuations via spherical projection and estimating a robust “truth centroid” through iterative aggregation. Experimental results on multiple datasets show that our method maintains model performance while achieving a training speedup of over 2x compared to baselines. Furthermore, extensive results demonstrate strong generalization ability and robustness. The code will be released soon.
[LINK]http://arxiv.org/abs/2601.08427v1
[DATE]2026-01-13 18:55:08+08:00
[CATEGORIES]cs.CL cs.LG
Hallucinate at the Last in Long Response Generation: A Case Study on Long Document Summarization
[AUTHORS]Joonho Yang, Seunghyun Yoon, Hwan Chang, Byeongjeong Kim, Hwanhee Lee
[ABSTRACT]Large Language Models (LLMs) have significantly advanced text generation capabilities, including tasks like summarization, often producing coherent and fluent outputs. However, faithfulness to source material remains a significant challenge due to the generation of hallucinations. While extensive research focuses on detecting and reducing these inaccuracies, less attention has been paid to the positional distribution of hallucination within generated text, particularly in long outputs. In this work, we investigate where hallucinations occur in LLM-based long response generation, using long document summarization as a key case study. Focusing on the challenging setting of long context-aware long response generation, we find a consistent and concerning phenomenon: hallucinations tend to concentrate disproportionately in the latter parts of the generated long response. To understand this bias, we explore potential contributing factors related to the dynamics of attention and decoding over long sequences. Furthermore, we investigate methods to mitigate this positional hallucination, aiming to improve faithfulness specifically in the concluding segments of long outputs.
[COMMENTS]26 tables, 7 figures
[LINK]http://arxiv.org/abs/2505.15291v3
[DATE]2026-01-13 18:41:56+08:00
[CATEGORIES]cs.CL
MultiCheck: Strengthening Web Trust with Unified Multimodal Fact Verification
[AUTHORS]Aditya Kishore, Gaurav Kumar, Jasabanta Patro
[ABSTRACT]Misinformation on the web increasingly appears in multimodal forms, combining text, images, and OCR-rendered content in ways that amplify harm to public trust and vulnerable communities. While prior fact-checking systems often rely on unimodal signals or shallow fusion strategies, modern misinformation campaigns operate across modalities and require models that can reason over subtle cross-modal inconsistencies in a transparent and responsible manner. We introduce MultiCheck, a lightweight and interpretable framework for multimodal fact verification that jointly analyzes textual, visual, and OCR evidence. At its core, MultiCheck employs a relational fusion module based on element-wise difference and product operations, allowing for explicit cross-modal interaction modeling with minimal computational overhead. A contrastive alignment objective further helps the model distinguish between supporting and refuting evidence while maintaining a small memory and energy footprint, making it suitable for low-resource deployment. Evaluated on the Factify-2 (5-class) and Mocheg (3-class) benchmarks, MultiCheck achieves huge performance improvement and remains robust under noisy OCR and missing modality conditions. Its efficiency, transparency, and real-world robustness make it well-suited for journalists, civil society organisations, and web integrity efforts working to build a safer and more trustworthy web.
[LINK]http://arxiv.org/abs/2508.05097v3
[DATE]2026-01-13 18:20:04+08:00
[CATEGORIES]cs.CL
Foundations of LLM Knowledge Materialization: Termination, Reproducibility, Robustness
[AUTHORS]Luca Giordano, Simon Razniewski
[ABSTRACT]Large Language Models (LLMs) encode substantial factual knowledge, yet measuring and systematizing this knowledge remains challenging. Converting it into structured format, for example through recursive extraction approaches such as the GPTKB methodology (Hu et al., 2025b), is still underexplored. Key open questions include whether such extraction can terminate, whether its outputs are reproducible, and how robust they are to variations. We systematically study LLM knowledge materialization using miniGPTKBs (domain-specific, tractable subcrawls), analyzing termination, reproducibility, and robustness across three categories of metrics: yield, lexical similarity, and semantic similarity. We experiment with four variations (seed, language, randomness, model) and three illustrative domains (from history, entertainment, and finance). Our findings show (i) high termination rates, though model-dependent; (ii) mixed reproducibility; and (iii) robustness that varies by perturbation type: high for seeds and temperature, lower for languages and models. These results suggest that LLM knowledge materialization can reliably surface core knowledge, while also revealing important limitations.
[COMMENTS]Accepted and published in Findings of EACL 2026
[LINK]http://arxiv.org/abs/2510.06780v3
[DATE]2026-01-13 18:18:12+08:00
[CATEGORIES]cs.CL
Whose Facts Win? LLM Source Preferences under Knowledge Conflicts
[AUTHORS]Jakob Schuster, Vagrant Gautam, Katja Markert
[ABSTRACT]As large language models (LLMs) are more frequently used in retrieval-augmented generation pipelines, it is increasingly relevant to study their behavior under knowledge conflicts. Thus far, the role of the source of the retrieved information has gone unexamined. We address this gap with a novel framework to investigate how source preferences affect LLM resolution of inter-context knowledge conflicts in English, motivated by interdisciplinary research on credibility. With a comprehensive, tightly-controlled evaluation of 13 open-weight LLMs, we find that LLMs prefer institutionally-corroborated information (e.g., government or newspaper sources) over information from people and social media. However, these source preferences can be reversed by simply repeating information from less credible sources. To mitigate repetition effects and maintain consistent preferences, we propose a novel method that reduces repetition bias by up to 99.8%, while also maintaining at least 88.8% of original preferences. We release all data and code to encourage future work on credibility and source preferences in knowledge-intensive NLP.
[COMMENTS]Data and code: https://github.com/JaSchuste/llm-source-preference
[LINK]http://arxiv.org/abs/2601.03746v2
[DATE]2026-01-13 17:48:40+08:00
[CATEGORIES]cs.CL
Deconstructing Pre-training: Knowledge Attribution Analysis in MoE and Dense Models
[AUTHORS]Bo Wang, Junzhuo Li, Hong Chen, Yuanlin Chu, Yuxuan Fan, Xuming Hu
[ABSTRACT]Mixture-of-Experts (MoE) architectures decouple model capacity from per-token computation, enabling scaling beyond the computational limits imposed by dense scaling laws. Yet how MoE architectures shape knowledge acquisition during pre-training, and how this process differs from dense architectures, remains unknown. To address this issue, we introduce Gated-LPI (Log-Probability Increase), a neuron-level attribution metric that decomposes log-probability increase across neurons. We present a time-resolved comparison of knowledge acquisition dynamics in MoE and dense architectures, tracking checkpoints over 1.2M training steps (~ 5.0T tokens) and 600K training steps (~ 2.5T tokens), respectively. Our experiments uncover three patterns: (1) Low-entropy backbone. The top approximately 1% of MoE neurons capture over 45% of positive updates, forming a high-utility core, which is absent in the dense baseline. (2) Early consolidation. The MoE model locks into a stable importance profile within < 100K steps, whereas the dense model remains volatile throughout training. (3) Functional robustness. Masking the ten most important MoE attention heads reduces relational HIT@10 by < 10%, compared with > 50% for the dense model, showing that sparsity fosters distributed – rather than brittle – knowledge storage. These patterns collectively demonstrate that sparsity fosters an intrinsically stable and distributed computational backbone from early in training, helping bridge the gap between sparse architectures and training-time interpretability.
[COMMENTS]Accepted by AAAI26
[LINK]http://arxiv.org/abs/2601.08383v1
[DATE]2026-01-13 17:44:00+08:00
[CATEGORIES]cs.CL cs.LG
Memory in the Age of AI Agents
[AUTHORS]Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, Senjie Jin, Jiejun Tan, Yanbin Yin, Jiongnan Liu, Zeyu Zhang, Zhongxiang Sun, Yutao Zhu, Hao Sun, Boci Peng, Zhenrong Cheng, Xuanbo Fan, Jiaxin Guo, Xinlei Yu, Zhenhong Zhou, Zewen Hu, Jiahao Huo, Junhao Wang, Yuwei Niu, Yu Wang, Zhenfei Yin, Xiaobin Hu, Yue Liao, Qiankun Li, Kun Wang, Wangchunshu Zhou, Yixin Liu, Dawei Cheng, Qi Zhang, Tao Gui, Shirui Pan, Yan Zhang, Philip Torr, Zhicheng Dou, Ji-Rong Wen, Xuanjing Huang, Yu-Gang Jiang, Shuicheng Yan
[ABSTRACT]Memory has emerged, and will continue to remain, a core capability of foundation model-based agents. As research on agent memory rapidly expands and attracts unprecedented attention, the field has also become increasingly fragmented. Existing works that fall under the umbrella of agent memory often differ substantially in their motivations, implementations, and evaluation protocols, while the proliferation of loosely defined memory terminologies has further obscured conceptual clarity. Traditional taxonomies such as long/short-term memory have proven insufficient to capture the diversity of contemporary agent memory systems. This work aims to provide an up-to-date landscape of current agent memory research. We begin by clearly delineating the scope of agent memory and distinguishing it from related concepts such as LLM memory, retrieval augmented generation (RAG), and context engineering. We then examine agent memory through the unified lenses of forms, functions, and dynamics. From the perspective of forms, we identify three dominant realizations of agent memory, namely token-level, parametric, and latent memory. From the perspective of functions, we propose a finer-grained taxonomy that distinguishes factual, experiential, and working memory. From the perspective of dynamics, we analyze how memory is formed, evolved, and retrieved over time. To support practical development, we compile a comprehensive summary of memory benchmarks and open-source frameworks. Beyond consolidation, we articulate a forward-looking perspective on emerging research frontiers, including memory automation, reinforcement learning integration, multimodal memory, multi-agent memory, and trustworthiness issues. We hope this survey serves not only as a reference for existing work, but also as a conceptual foundation for rethinking memory as a first-class primitive in the design of future agentic intelligence.
[LINK]http://arxiv.org/abs/2512.13564v2
[DATE]2026-01-13 17:33:57+08:00
[CATEGORIES]cs.CL
Format as a Prior: Quantifying and Analyzing Bias in LLMs for Heterogeneous Data
[AUTHORS]Jiacheng Liu, Mayi Xu, Qiankun Pi, Wenli Li, Ming Zhong, Yuanyuan Zhu, Mengchi Liu, Tieyun Qian
[ABSTRACT]Large Language Models (LLMs) are increasingly employed in applications that require processing information from heterogeneous formats, including texts, tables, infoboxes, and knowledge graphs. However, systematic biases toward particular formats may undermine LLMs’ ability to integrate heterogeneous data impartially, potentially resulting in reasoning errors and increased risks in downstream tasks. Yet it remains unclear whether such biases are systematic, which data-level factors drive them, and what internal mechanisms underlie their emergence.
In this paper, we present the first comprehensive study of format bias in LLMs through a three-stage empirical analysis. The first stage explores the presence and direction of bias across a diverse range of LLMs. The second stage examines how key data-level factors influence these biases. The third stage analyzes how format bias emerges within LLMs’ attention patterns and evaluates a lightweight intervention to test its effectiveness. Our results show that format bias is consistent across model families, driven by information richness, structure quality, and representation type, and is closely associated with attention imbalance within the LLMs. Based on these investigations, we identify three future research directions to reduce format bias: enhancing data pre-processing through format repair and normalization, introducing inference-time interventions such as attention re-weighting, and developing format-balanced training corpora. These directions will support the design of more robust and fair heterogeneous data processing systems.
[COMMENTS]Accepted by AAAI 2026, camera ready version
[LINK]http://arxiv.org/abs/2508.15793v3
[DATE]2026-01-13 17:26:14+08:00
[CATEGORIES]cs.CL cs.LG
Information Capacity: Evaluating the Efficiency of Large Language Models via Text Compression
[AUTHORS]Cheng Yuan, Jiawei Shao, Chi Zhang, Xuelong Li
[ABSTRACT]Recent years have witnessed the rapid advancements of large language models (LLMs) and their expanding applications, leading to soaring demands for computational resources. The widespread adoption of test-time scaling further intensifies the tension between model capability and resource consumption, highlighting the importance of inference efficiency. However, a unified metric that accurately reflects an LLM’s efficiency across diverse model sizes and architectures remains absent. Motivated by the correlation between compression and intelligence, we introduce information capacity, a measure of model efficiency based on text compression performance relative to computational complexity. A distinctive feature of information capacity is its incorporation of tokenizer efficiency, which affects inference costs but is often neglected in LLM evaluations. We assess the information capacity of 52 open-source models and observe a consistent information capacity among different-sized models within a series. Experiments on 5 heterogeneous datasets reveal strong linguistic bias in mainstream LLMs. Three major factors of information capacity include tokenizer efficiency, pretraining data, and the mixture-of-experts architecture. Empirical results verify the accuracy of performance prediction across model sizes based on information capacity and show the correlation between information capacity and benchmark scores.
[COMMENTS]Code: https://github.com/TeleAI-AI-Flow/InformationCapacity. Data: https://huggingface.co/datasets/TeleAI-AI-Flow/InformationCapacity
[LINK]http://arxiv.org/abs/2511.08066v5
[DATE]2026-01-13 17:23:25+08:00
[CATEGORIES]cs.CL
Align-GRAG: Anchor and Rationale Guided Dual Alignment for Graph Retrieval-Augmented Generation
[AUTHORS]Derong Xu, Pengyue Jia, Xiaopeng Li, Yingyi Zhang, Maolin Wang, Qidong Liu, Xiangyu Zhao, Yichao Wang, Huifeng Guo, Ruiming Tang, Enhong Chen, Tong Xu
[ABSTRACT]Despite the strong abilities, large language models (LLMs) still suffer from hallucinations and reliance on outdated knowledge, raising concerns in knowledge-intensive tasks. Graph-based retrieval-augmented generation (GRAG) enriches LLMs with knowledge by retrieving graphs leveraging relational evidence, but it faces two challenges: structure-coupled irrelevant knowledge introduced by neighbor expansion and structure-reasoning discrepancy between graph embeddings and LLM semantics. We propose \ourmodel, an anchor-and-rationale guided refinement framework to address these challenges. It prompts an LLM to extract anchors and rationale chains, which provide intermediate supervision for \textbf{(1) node-level alignment} that identifies critical nodes and prunes noisy evidence, and \textbf{(2) graph-level alignment} that bridges graph and language semantic spaces via contrastive learning. Extensive experiments on commonsense reasoning, scene graph understanding, and knowledge graph reasoning demonstrate consistent gains over 18 strong baselines, validating the effectiveness of \ourmodel for improving graph-grounded generation. The code can be found in https://anonymous.4open.science/r/Align-GRAG-F3D8/.
[LINK]http://arxiv.org/abs/2505.16237v3
[DATE]2026-01-13 17:14:51+08:00
[CATEGORIES]cs.CL
When KV Cache Reuse Fails in Multi-Agent Systems: Cross-Candidate Interaction is Crucial for LLM Judges
[AUTHORS]Sichu Liang, Zhenglin Wang, Jiajia Chu, Pengfei Xia, Hui Zang, Deyu Zhou
[ABSTRACT]Multi-agent LLM systems routinely generate multiple candidate responses that are aggregated by an LLM judge. To reduce the dominant prefill cost in such pipelines, recent work advocates KV cache reuse across partially shared contexts and reports substantial speedups for generation agents. In this work, we show that these efficiency gains do not transfer uniformly to judge-centric inference. Across GSM8K, MMLU, and HumanEval, we find that reuse strategies that are effective for execution agents can severely perturb judge behavior: end-task accuracy may appear stable, yet the judge’s selection becomes highly inconsistent with dense prefill. We quantify this risk using Judge Consistency Rate (JCR) and provide diagnostics showing that reuse systematically weakens cross-candidate attention, especially for later candidate blocks. Our ablation further demonstrates that explicit cross-candidate interaction is crucial for preserving dense-prefill decisions. Overall, our results identify a previously overlooked failure mode of KV cache reuse and highlight judge-centric inference as a distinct regime that demands dedicated, risk-aware system design.
[LINK]http://arxiv.org/abs/2601.08343v1
[DATE]2026-01-13 17:02:58+08:00
[CATEGORIES]cs.CL
CLaS-Bench: A Cross-Lingual Alignment and Steering Benchmark
[AUTHORS]Daniil Gurgurov, Yusser Al Ghussin, Tanja Baeumel, Cheng-Ting Chou, Patrick Schramowski, Marius Mosbach, Josef van Genabith, Simon Ostermann
[ABSTRACT]Understanding and controlling the behavior of large language models (LLMs) is an increasingly important topic in multilingual NLP. Beyond prompting or fine-tuning, , i.e.,~manipulating internal representations during inference, has emerged as a more efficient and interpretable technique for adapting models to a target language. Yet, no dedicated benchmarks or evaluation protocols exist to quantify the effectiveness of steering techniques. We introduce CLaS-Bench, a lightweight parallel-question benchmark for evaluating language-forcing behavior in LLMs across 32 languages, enabling systematic evaluation of multilingual steering methods. We evaluate a broad array of steering techniques, including residual-stream DiffMean interventions, probe-derived directions, language-specific neurons, PCA/LDA vectors, Sparse Autoencoders, and prompting baselines. Steering performance is measured along two axes: language control and semantic relevance, combined into a single harmonic-mean steering score. We find that across languages simple residual-based DiffMean method consistently outperforms all other methods. Moreover, a layer-wise analysis reveals that language-specific structure emerges predominantly in later layers and steering directions cluster based on language family. CLaS-Bench is the first standardized benchmark for multilingual steering, enabling both rigorous scientific analysis of language representations and practical evaluation of steering as a low-cost adaptation alternative.
[COMMENTS]pre-print
[LINK]http://arxiv.org/abs/2601.08331v1
[DATE]2026-01-13 16:42:03+08:00
[CATEGORIES]cs.CL
AgriAgent: Contract-Driven Planning and Capability-Aware Tool Orchestration in Real-World Agriculture
[AUTHORS]Bo Yang, Yu Zhang, Yunkui Chen, Lanfei Feng, Xiao Xu, Nueraili Aierken, Shijian Li
[ABSTRACT]Intelligent agent systems in real-world agricultural scenarios must handle diverse tasks under multimodal inputs, ranging from lightweight information understanding to complex multi-step execution. However, most existing approaches rely on a unified execution paradigm, which struggles to accommodate large variations in task complexity and incomplete tool availability commonly observed in agricultural environments. To address this challenge, we propose AgriAgent, a two-level agent framework for real-world agriculture. AgriAgent adopts a hierarchical execution strategy based on task complexity: simple tasks are handled through direct reasoning by modality-specific agents, while complex tasks trigger a contract-driven planning mechanism that formulates tasks as capability requirements and performs capability-aware tool orchestration and dynamic tool generation, enabling multi-step and verifiable execution with failure recovery. Experimental results show that AgriAgent achieves higher execution success rates and robustness on complex tasks compared to existing tool-centric agent baselines that rely on unified execution paradigms. All code, data will be released at after our work be accepted to promote reproducible research.
[LINK]http://arxiv.org/abs/2601.08308v1
[DATE]2026-01-13 15:53:09+08:00
[CATEGORIES]cs.CL
Hacking Neural Evaluation Metrics with Single Hub Text
[AUTHORS]Hiroyuki Deguchi, Katsuki Chousa, Yusuke Sakai
[ABSTRACT]Strongly human-correlated evaluation metrics serve as an essential compass for the development and improvement of generation models and must be highly reliable and robust. Recent embedding-based neural text evaluation metrics, such as COMET for translation tasks, are widely used in both research and development fields. However, there is no guarantee that they yield reliable evaluation results due to the black-box nature of neural networks. To raise concerns about the reliability and safety of such metrics, we propose a method for finding a single adversarial text in the discrete space that is consistently evaluated as high-quality, regardless of the test cases, to identify the vulnerabilities in evaluation metrics. The single hub text found with our method achieved 79.1 COMET% and 67.8 COMET% in the WMT’24 English-to-Japanese (En–Ja) and English-to-German (En–De) translation tasks, respectively, outperforming translations generated individually for each source sentence by using M2M100, a general translation model. Furthermore, we also confirmed that the hub text found with our method generalizes across multiple language pairs such as Ja–En and De–En.
[COMMENTS]Accepted at EACL2026 main
[LINK]http://arxiv.org/abs/2512.16323v2
[DATE]2026-01-13 15:48:36+08:00
[CATEGORIES]cs.CL
Demystifying the Slash Pattern in Attention: The Role of RoPE
[AUTHORS]Yuan Cheng, Fengzhuo Zhang, Yunlong Hou, Cunxiao Du, Chao Du, Tianyu Pang, Aixin Sun, Zhuoran Yang
[ABSTRACT]Large Language Models (LLMs) often exhibit slash attention patterns, where attention scores concentrate along the $Δ$-th sub-diagonal for some offset $Δ$. These patterns play a key role in passing information across tokens. But why do they emerge? In this paper, we demystify the emergence of these Slash-Dominant Heads (SDHs) from both empirical and theoretical perspectives. First, by analyzing open-source LLMs, we find that SDHs are intrinsic to models and generalize to out-of-distribution prompts. To explain the intrinsic emergence, we analyze the queries, keys, and Rotary Position Embedding (RoPE), which jointly determine attention scores. Our empirical analysis reveals two characteristic conditions of SDHs: (1) Queries and keys are almost rank-one, and (2) RoPE is dominated by medium- and high-frequency components. Under these conditions, queries and keys are nearly identical across tokens, and interactions between medium- and high-frequency components of RoPE give rise to SDHs. Beyond empirical evidence, we theoretically show that these conditions are sufficient to ensure the emergence of SDHs by formalizing them as our modeling assumptions. Particularly, we analyze the training dynamics of a shallow Transformer equipped with RoPE under these conditions, and prove that models trained via gradient descent exhibit SDHs. The SDHs generalize to out-of-distribution prompts.
[LINK]http://arxiv.org/abs/2601.08297v1
[DATE]2026-01-13 15:40:57+08:00
[CATEGORIES]cs.LG cs.CL
D$^2$Plan: Dual-Agent Dynamic Global Planning for Complex Retrieval-Augmented Reasoning
[AUTHORS]Kangcheng Luo, Tinglang Wu, Yansong Feng
[LINK]http://arxiv.org/abs/2601.08282v1
[DATE]2026-01-13 15:17:51+08:00
[CATEGORIES]cs.CL
Discovery and Reinforcement of Tool-Integrated Reasoning Chains via Rollout Trees
[AUTHORS]Kun Li, Zenan Xu, Junan Li, Zengrui Jin, Jinghao Deng, Zexuan Qiu, Bo Zhou
[ABSTRACT]Tool-Integrated Reasoning has emerged as a key paradigm to augment Large Language Models (LLMs) with computational capabilities, yet integrating tool-use into long Chain-of-Thought (long CoT) remains underexplored, largely due to the scarcity of training data and the challenge of integrating tool-use without compromising the model’s intrinsic long-chain reasoning. In this paper, we introduce DART (Discovery And Reinforcement of Tool-Integrated Reasoning Chains via Rollout Trees), a reinforcement learning framework that enables spontaneous tool-use during long CoT reasoning without human annotation. DART operates by constructing dynamic rollout trees during training to discover valid tool-use opportunities, branching out at promising positions to explore diverse tool-integrated trajectories. Subsequently, a tree-based process advantage estimation identifies and credits specific sub-trajectories where tool invocation positively contributes to the solution, effectively reinforcing these beneficial behaviors. Extensive experiments on challenging benchmarks like AIME and GPQA-Diamond demonstrate that DART significantly outperforms existing methods, successfully harmonizing tool execution with long CoT reasoning.
[LINK]http://arxiv.org/abs/2601.08274v1
[DATE]2026-01-13 15:06:21+08:00
[CATEGORIES]cs.CL
Stuffed Mamba: Oversized States Lead to the Inability to Forget
[AUTHORS]Yingfa Chen, Xinrong Zhang, Shengding Hu, Xu Han, Zhiyuan Liu, Maosong Sun
[ABSTRACT]Recent advancements in recurrent architectures, such as Mamba and RWKV, have showcased strong language capabilities. Unlike transformer-based models, these architectures encode all contextual information into a fixed-size state, leading to great inference efficiency. However, this approach can cause information interference, where different token data conflicts, resulting in performance degradation and incoherent outputs beyond a certain context length. To prevent this, most RNNs incorporate mechanisms designed to “forget” earlier tokens. In this paper, we reveal that Mamba-based models struggle to effectively forget earlier tokens even with built-in forgetting mechanisms. We demonstrate that this issue stems from training on contexts that are too short for the state size, enabling the model to perform well without needing to learn how to forget. Then, we show that the minimum training length required for the model to learn forgetting scales linearly with the state size, and the maximum context length for accurate retrieval of a 5-digit passkey scales exponentially with the state size, indicating that the model retains some information beyond the point where forgetting begins. These findings highlight a critical limitation in current RNN architectures and provide valuable insights for improving long-context modeling. Our work suggests that future RNN designs must account for the interplay between state size, training length, and forgetting mechanisms to achieve robust performance in long-context tasks.
[COMMENTS]COLM 2025
[LINK]http://arxiv.org/abs/2410.07145v4
[DATE]2026-01-13 15:01:24+08:00
[CATEGORIES]cs.CL cs.LG
Generative Digital Twins: Vision-Language Simulation Models for Executable Industrial Systems
[AUTHORS]YuChe Hsu, AnJui Wang, TsaiChing Ni, YuanFu Yang
[ABSTRACT]We propose a Vision-Language Simulation Model (VLSM) that unifies visual and textual understanding to synthesize executable FlexScript from layout sketches and natural-language prompts, enabling cross-modal reasoning for industrial simulation systems. To support this new paradigm, the study constructs the first large-scale dataset for generative digital twins, comprising over 120,000 prompt-sketch-code triplets that enable multimodal learning between textual descriptions, spatial structures, and simulation logic. In parallel, three novel evaluation metrics, Structural Validity Rate (SVR), Parameter Match Rate (PMR), and Execution Success Rate (ESR), are proposed specifically for this task to comprehensively evaluate structural integrity, parameter fidelity, and simulator executability. Through systematic ablation across vision encoders, connectors, and code-pretrained language backbones, the proposed models achieve near-perfect structural accuracy and high execution robustness. This work establishes a foundation for generative digital twins that integrate visual reasoning and language understanding into executable industrial simulation systems. Project page: https://danielhsu2014.github.io/GDT-VLSM-project/
[COMMENTS]10 pages, 9 figures
[LINK]http://arxiv.org/abs/2512.20387v4
[DATE]2026-01-13 14:58:35+08:00
[CATEGORIES]cs.CL
Med-CoReasoner: Reducing Language Disparities in Medical Reasoning via Language-Informed Co-Reasoning
[AUTHORS]Fan Gao, Sherry T. Tong, Jiwoong Sohn, Jiahao Huang, Junfeng Jiang, Ding Xia, Piyalitt Ittichaiwong, Kanyakorn Veerakanjana, Hyunjae Kim, Qingyu Chen, Edison Marrese Taylor, Kazuma Kobayashi, Akkiko Aizawa, Irene Li
[ABSTRACT]While reasoning-enhanced large language models perform strongly on English medical tasks, a persistent multilingual gap remains, with substantially weaker reasoning in local languages, limiting equitable global medical deployment. To bridge this gap, we introduce Med-CoReasoner, a language-informed co-reasoning framework that elicits parallel English and local-language reasoning, abstracts them into structured concepts, and integrates local clinical knowledge into an English logical scaffold via concept-level alignment and retrieval. This design combines the structural robustness of English reasoning with the practice-grounded expertise encoded in local languages. To evaluate multilingual medical reasoning beyond multiple-choice settings, we construct MultiMed-X, a benchmark covering seven languages with expert-annotated long-form question answering and natural language inference tasks, comprising 350 instances per language. Experiments across three benchmarks show that Med-CoReasoner improves multilingual reasoning performance by an average of 5%, with particularly substantial gains in low-resource languages. Moreover, model distillation and expert evaluation analysis further confirm that Med-CoReasoner produces clinically sound and culturally grounded reasoning traces.
[LINK]http://arxiv.org/abs/2601.08267v1
[DATE]2026-01-13 14:51:40+08:00
[CATEGORIES]cs.CL
Measuring Iterative Temporal Reasoning with Time Puzzles
[AUTHORS]Zhengxiang Wang, Zeyu Dong
[ABSTRACT]We introduce Time Puzzles, a constraint-based date inference task for evaluating iterative temporal reasoning. Each puzzle combines factual temporal anchors with (cross-cultural) calendar relations, admits one or multiple valid solution dates, and is algorithmically generated for controlled, dynamic, and continual evaluation. Across 13 diverse LLMs, Time Puzzles well distinguishes their iterative temporal reasoning capabilities and remains challenging without tools: GPT-5 reaches only 49.3% accuracy and all other models stay below 31%, despite the dataset’s simplicity. Web search consistently yields substantial gains and using code interpreter shows mixed effects, but all models perform much better when constraints are rewritten with explicit dates, revealing a gap in reliable tool use. Overall, Time Puzzles presents a simple, cost-effective diagnostic for tool-augmented iterative temporal reasoning.
[LINK]http://arxiv.org/abs/2601.07148v2
[DATE]2026-01-13 14:40:50+08:00
[CATEGORIES]cs.CL
AutoContext: Instance-Level Context Learning for LLM Agents
[AUTHORS]Kuntai Cai, Juncheng Liu, Xianglin Yang, Zhaojie Niu, Xiaokui Xiao, Xing Chen
[ABSTRACT]Current LLM agents typically lack instance-level context, which comprises concrete facts such as environment structure, system configurations, and local mechanics. Consequently, existing methods are forced to intertwine exploration with task execution. This coupling leads to redundant interactions and fragile decision-making, as agents must repeatedly rediscover the same information for every new task. To address this, we introduce AutoContext, a method that decouples exploration from task solving. AutoContext performs a systematic, one-off exploration to construct a reusable knowledge graph for each environment instance. This structured context allows off-the-shelf agents to access necessary facts directly, eliminating redundant exploration. Experiments across TextWorld, ALFWorld, Crafter, and InterCode-Bash demonstrate substantial gains: for example, the success rate of a ReAct agent on TextWorld improves from 37% to 95%, highlighting the critical role of structured instance context in efficient agentic systems.
[LINK]http://arxiv.org/abs/2510.02369v3
[DATE]2026-01-13 14:31:54+08:00
[CATEGORIES]cs.CL
RAG-R1: Incentivizing the Search and Reasoning Capabilities of LLMs through Multi-query Parallelism
[AUTHORS]Zhiwen Tan, Jiaming Huang, Qintong Wu, Hongxuan Zhang, Chenyi Zhuang, Jinjie Gu
[ABSTRACT]Large Language Models (LLMs), despite their remarkable capabilities, are prone to generating hallucinated or outdated content due to their static internal knowledge. While Retrieval-Augmented Generation (RAG) integrated with Reinforcement Learning (RL) offers a solution, these methods are fundamentally constrained by a single-query mode, leading to prohibitive latency and inherent brittleness. To overcome these limitations, we introduce RAG-R1, a novel two-stage training framework centered around multi-query parallelism. Our framework enables LLMs to adaptively leverage internal and external knowledge during the reasoning process while transitioning from the single-query mode to multi-query parallelism. This architectural shift bolsters reasoning robustness while significantly reducing inference latency. Extensive experiments on seven question-answering benchmarks confirm the superiority of our method, which outperforms the strongest baseline by up to 13.7% and decreases inference time by 11.1%.
[LINK]http://arxiv.org/abs/2507.02962v6
[DATE]2026-01-13 14:28:13+08:00
[CATEGORIES]cs.CL
Pushing on Multilingual Reasoning Models with Language-Mixed Chain-of-Thought
[AUTHORS]Guijin Son, Donghun Yang, Hitesh Laxmichand Patel, Amit Agarwal, Hyunwoo Ko, Chanuk Lim, Srikant Panda, Minhyuk Kim, Nikunj Drolia, Dasol Choi, Kyong-Ha Lee, Youngjae Yu
[ABSTRACT]Recent frontier models employ long chain-of-thought reasoning to explore solution spaces in context and achieve stonger performance. While many works study distillation to build smaller yet capable models, most focus on English and little is known about language-specific reasoning. To bridge this gap, we first introduct Language-Mixed CoT, a reasoning schema that switches between English and a target language, using English as an anchor to excel in reasoning while minimizing translation artificats. As a Korean case study, we curate Yi-Sang: 5.79M native-Korean prompts from web Q&A, exams, STEM, and code; 3.7M long reasoning traces generated from Qwen3-32B; and a targeted 260k high-yield subset. We train ninve models (4B-35B) across six families (Qwen2.5, Llama-3.1, Gemma-3, etc). Our best model, KO-REAson-35B, achieves state-of-the-art performance, with the highest overall average score (64.0 \pm 25), ranking first on 5/9 benchmarks and second on the remainder. Samller and mid-sized models also benefit substantially, with an average improvement of +18.6 points across teh evaluated nine benchmarks. Ablations show Language-Mixed CoT is more effective than monolingual CoT, also resulting in cross-lingual and mult-modal performance gains. We release our data-curation pipeline, evaluation system, datasets, and models to advance research on language-specific reasoning. Data and model collection: https://huggingface.co/KOREAson.
[COMMENTS]Work in Progress
[LINK]http://arxiv.org/abs/2510.04230v2
[DATE]2026-01-13 13:48:55+08:00
[CATEGORIES]cs.CL
User-Oriented Multi-Turn Dialogue Generation with Tool Use at scale
[AUTHORS]Jungho Cho, Minbyul Jeong, Sungrae Park
[ABSTRACT]The recent paradigm shift toward large reasoning models (LRMs) as autonomous agents has intensified the demand for sophisticated, multi-turn tool-use capabilities. Yet, existing datasets and data-generation approaches are limited by static, predefined toolsets that cannot scale to the complexity of open-ended human-agent collaboration. To address this, we initially developed a framework for automated task-oriented multi-turn dialogue generation at scale, utilizing an LRM-based simulator to dynamically generate high-value, domain-specific tools to solve specified tasks. However, we observe that a purely task-oriented design often results in “solely task-solving” trajectories, where the agent completes the objective with minimal interaction, failing to generate the high turn-count conversations seen in realistic scenarios. To bridge this gap, we shift toward a user-oriented simulation paradigm. By decoupling task generation from a dedicated user simulator that mimics human behavioral rules - such as incremental request-making and turn-by-turn feedback - we facilitate more authentic, extended multi-turn dialogues that reflect the iterative nature of real-world problem solving. Our generation pipeline operates as a versatile, plug-and-play module capable of initiating generation from any state, ensuring high scalability in producing extended tool-use data. Furthermore, by facilitating multiple task completions within a single trajectory, it yields a high-density dataset that reflects the multifaceted demands of real-world human-agent interaction.
[LINK]http://arxiv.org/abs/2601.08225v1
[DATE]2026-01-13 13:14:09+08:00
[CATEGORIES]cs.CL
Towards Principled Design of Mixture-of-Experts Language Models under Memory and Inference Constraints
[AUTHORS]Seng Pei Liew, Kenta Shinzato, Yuyang Dong
[ABSTRACT]Modern Mixture-of-Experts (MoE) language models are designed based on total parameters (memory footprint) and active parameters (inference cost). However, we find these two factors alone are insufficient to describe an optimal architecture. Through a systematic study, we demonstrate that MoE performance is primarily determined by total parameters ($N_{total}$) and expert sparsity ($s:=n_{exp}/n_{topk}$).
Moreover, $n_{exp}$ and $n_{topk}$ do not “cancel out” within the sparsity ratio; instead, a larger total number of experts slightly penalizes performance by forcing a reduction in core model dimensions (depth and width) to meet memory constraints. This motivates a simple principle for MoE design which maximizes $N_{total}$ while minimizing $s$ (maximizing $n_{topk}$) and $n_{exp}$ under the given constraints. Our findings provide a robust framework for resolving architectural ambiguity and guiding MoE design.
[COMMENTS]10 pages, 5 figures
[LINK]http://arxiv.org/abs/2601.08215v1
[DATE]2026-01-13 12:42:45+08:00
[CATEGORIES]cs.CL cs.LG
Generation-Augmented Generation: A Plug-and-Play Framework for Private Knowledge Injection in Large Language Models
[AUTHORS]Rongji Li, Jian Xu, Xueqing Chen, Yisheng Yang, Jiayi Wang, Xingyu Chen, Chunyu Xie, Dawei Leng, Xu-Yao Zhang
[ABSTRACT]In domains such as biomedicine, materials, and finance, high-stakes deployment of large language models (LLMs) requires injecting private, domain-specific knowledge that is proprietary, fast-evolving, and under-represented in public pretraining. However, the two dominant paradigms for private knowledge injection each have pronounced drawbacks: fine-tuning is expensive to iterate, and continual updates risk catastrophic forgetting and general-capability regression; retrieval-augmented generation (RAG) keeps the base model intact but is brittle in specialized private corpora due to chunk-induced evidence fragmentation, retrieval drift, and long-context pressure that yields query-dependent prompt inflation. Inspired by how multimodal LLMs align heterogeneous modalities into a shared semantic space, we propose Generation-Augmented Generation (GAG), which treats private expertise as an additional expert modality and injects it via a compact, representation-level interface aligned to the frozen base model, avoiding prompt-time evidence serialization while enabling plug-and-play specialization and scalable multi-domain composition with reliable selective activation. Across two private scientific QA benchmarks (immunology adjuvant and catalytic materials) and mixed-domain evaluations, GAG improves specialist performance over strong RAG baselines by 15.34% and 14.86% on the two benchmarks, respectively, while maintaining performance on six open general benchmarks and enabling near-oracle selective activation for scalable multi-domain deployment.
[LINK]http://arxiv.org/abs/2601.08209v1
[DATE]2026-01-13 12:23:36+08:00
[CATEGORIES]cs.CL
Generating Text from Uniform Meaning Representation
[AUTHORS]Emma Markle, Reihaneh Iranmanesh, Shira Wein
[ABSTRACT]Uniform Meaning Representation (UMR) is a recently developed graph-based semantic representation, which expands on Abstract Meaning Representation (AMR) in a number of ways, in particular through the inclusion of document-level information and multilingual flexibility. In order to effectively adopt and leverage UMR for downstream tasks, efforts must be placed toward developing a UMR technological ecosystem. Though only a small amount of UMR annotations have been produced to date, in this work, we investigate the first approaches to producing text from multilingual UMR graphs. Exploiting the structural similarity between UMR and AMR graphs and the wide availability of AMR technologies, we introduce (1) a baseline approach which passes UMR graphs to AMR-to-text generation models, (2) a pipeline conversion of UMR to AMR, then using AMR-to-text generation models, and (3) a fine-tuning approach for both foundation models and AMR-to-text generation models with UMR data. Our best performing models achieve multilingual BERTscores of 0.825 for English and 0.882 for Chinese, a promising indication of the effectiveness of fine-tuning approaches for UMR-to-text generation even with limited UMR data.
[COMMENTS]Published and Outstanding Paper Award recipient at IJCNLP-AACL 2025
[LINK]http://arxiv.org/abs/2502.11973v3
[DATE]2026-01-13 12:10:04+08:00
[CATEGORIES]cs.CL
Triplets Better Than Pairs: Towards Stable and Effective Self-Play Fine-Tuning for LLMs
[AUTHORS]Yibo Wang, Hai-Long Sun, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Lijun Zhang
[ABSTRACT]Recently, self-play fine-tuning (SPIN) has been proposed to adapt large language models to downstream applications with scarce expert-annotated data, by iteratively generating synthetic responses from the model itself. However, SPIN is designed to optimize the current reward advantages of annotated responses over synthetic responses at hand, which may gradually vanish during iterations, leading to unstable optimization. Moreover, the utilization of reference policy induces a misalignment issue between the reward formulation for training and the metric for generation. To address these limitations, we propose a novel Triplet-based Self-Play fIne-tuNing (T-SPIN) method that integrates two key designs. First, beyond current advantages, T-SPIN additionally incorporates historical advantages between iteratively generated responses and proto-synthetic responses produced by the initial policy. Even if the current advantages diminish, historical advantages remain effective, stabilizing the overall optimization. Second, T-SPIN introduces the entropy constraint into the self-play framework, which is theoretically justified to support reference-free fine-tuning, eliminating the training-generation discrepancy. Empirical results on various tasks demonstrate not only the superior performance of T-SPIN over SPIN, but also its stable evolution during iterations. Remarkably, compared to supervised fine-tuning, T-SPIN achieves comparable or even better performance with only 25% samples, highlighting its effectiveness when faced with scarce annotated data.
[COMMENTS]NeurIPS 2025
[LINK]http://arxiv.org/abs/2601.08198v1
[DATE]2026-01-13 12:09:49+08:00
[CATEGORIES]cs.CL cs.LG
Evaluating Implicit Regulatory Compliance in LLM Tool Invocation via Logic-Guided Synthesis
[AUTHORS]Da Song, Yuheng Huang, Boqi Chen, Tianshuo Cong, Randy Goebel, Lei Ma, Foutse Khomh
[ABSTRACT]The integration of large language models (LLMs) into autonomous agents has enabled complex tool use, yet in high-stakes domains, these systems must strictly adhere to regulatory standards beyond simple functional correctness. However, existing benchmarks often overlook implicit regulatory compliance, thus failing to evaluate whether LLMs can autonomously enforce mandatory safety constraints. To fill this gap, we introduce LogiSafetyGen, a framework that converts unstructured regulations into Linear Temporal Logic oracles and employs logic-guided fuzzing to synthesize valid, safety-critical traces. Building on this framework, we construct LogiSafetyBench, a benchmark comprising 240 human-verified tasks that require LLMs to generate Python programs that satisfy both functional objectives and latent compliance rules. Evaluations of 13 state-of-the-art (SOTA) LLMs reveal that larger models, despite achieving better functional correctness, frequently prioritize task completion over safety, which results in non-compliant behavior.
[COMMENTS]11 pages, 3 figures
[LINK]http://arxiv.org/abs/2601.08196v1
[DATE]2026-01-13 11:55:18+08:00
[CATEGORIES]cs.CL
The Evolution of Thought: Tracking LLM Overthinking via Reasoning Dynamics Analysis
[AUTHORS]Zihao Wei, Liang Pang, Jiahao Liu, Wenjie Shi, Jingcheng Deng, Shicheng Xu, Zenghao Duan, Fei Sun, Huawei Shen, Xueqi Cheng
[ABSTRACT]Test-time scaling via explicit reasoning trajectories significantly boosts large language model (LLM) performance but often triggers overthinking. To explore this, we analyze reasoning through two lenses: Reasoning Length Dynamics, which reveals a compensatory trade-off between thinking and answer content length that eventually leads to thinking redundancy, and Reasoning Semantic Dynamics, which identifies semantic convergence and repetitive oscillations. These dynamics uncover an instance-specific Reasoning Completion Point (RCP), beyond which computation continues without further performance gain. Since the RCP varies across instances, we propose a Reasoning Completion Point Detector (RCPD), an inference-time early-exit method that identifies the RCP by monitoring the rank dynamics of termination tokens (e.g., </think>). Across AIME and GPQA benchmarks using Qwen3 and DeepSeek-R1, RCPD reduces token usage by up to 44% while preserving accuracy, offering a principled approach to efficient test-time scaling.
[LINK]http://arxiv.org/abs/2508.17627v2
[DATE]2026-01-13 11:40:21+08:00
[CATEGORIES]cs.CL
Are Emotions Arranged in a Circle? Geometric Analysis of Emotion Representations via Hyperspherical Contrastive Learning
[AUTHORS]Yusuke Yamauchi, Akiko Aizawa
[ABSTRACT]Psychological research has long utilized circumplex models to structure emotions, placing similar emotions adjacently and opposing ones diagonally. Although frequently used to interpret deep learning representations, these models are rarely directly incorporated into the representation learning of language models, leaving their geometric validity unexplored. This paper proposes a method to induce circular emotion representations within language model embeddings via contrastive learning on a hypersphere. We show that while this circular alignment offers superior interpretability and robustness against dimensionality reduction, it underperforms compared to conventional designs in high-dimensional settings and fine-grained classification. Our findings elucidate the trade-offs involved in applying psychological circumplex models to deep learning architectures.
[LINK]http://arxiv.org/abs/2601.06575v2
[DATE]2026-01-13 11:30:23+08:00
[CATEGORIES]cs.CL
VocalBench: Benchmarking the Vocal Conversational Abilities for Speech Interaction Models
[AUTHORS]Heyang Liu, Yuhao Wang, Ziyang Cheng, Hongcheng Liu, Yiqi Li, Yixuan Hou, Ronghua Wu, Qunshan Gu, Yanfeng Wang, Yu Wang
[ABSTRACT]Speech large language models (SpeechLLMs) have extended human-machine interactions from the text modality to the dynamic speech domain. Spoken dialogues convey diverse information, including semantic concepts, acoustic variations, paralanguage cues, and environmental context. However, existing evaluations of speech interaction models lack instances mimicking real scenarios and predominantly focus on the performance of distinct aspects, lacking a comprehensive comparison of critical capabilities between current routines. To address this gap, we propose VocalBench to assess the speech conversational abilities, comprising around 24k carefully curated instances of both English and Mandarin across four key dimensions - semantic quality, acoustic performance, conversational abilities, and robustness, covering 14 user-oriented characters. Experiments on 27 mainstream models reveal the common challenges for current routes, and highlight the need for new insights into next-generation speech interactive systems.
[LINK]http://arxiv.org/abs/2505.15727v3
[DATE]2026-01-13 11:25:15+08:00
[CATEGORIES]cs.CL
Prompt-Based Clarity Evaluation and Topic Detection in Political Question Answering
[AUTHORS]Lavanya Prahallad, Sai Utkarsh Choudarypally, Pragna Prahallad, Pranathi Prahallad
[ABSTRACT]Automatic evaluation of large language model (LLM) responses requires not only factual correctness but also clarity, particularly in political question-answering. While recent datasets provide human annotations for clarity and evasion, the impact of prompt design on automatic clarity evaluation remains underexplored. In this paper, we study prompt-based clarity evaluation using the CLARITY dataset from the SemEval 2026 shared task. We compare a GPT-3.5 baseline provided with the dataset against GPT-5.2 evaluated under three prompting strategies: simple prompting, chain-of-thought prompting, and chain-of-thought with few-shot examples. Model predictions are evaluated against human annotations using accuracy and class-wise metrics for clarity and evasion, along with hierarchical exact match. Results show that GPT-5.2 consistently outperforms the GPT-3.5 baseline on clarity prediction, with accuracy improving from 56 percent to 63 percent under chain-of-thought with few-shot prompting. Chain-of-thought prompting yields the highest evasion accuracy at 34 percent, though improvements are less stable across fine-grained evasion categories. We further evaluate topic identification and find that reasoning-based prompting improves accuracy from 60 percent to 74 percent relative to human annotations. Overall, our findings indicate that prompt design reliably improves high-level clarity evaluation, while fine-grained evasion and topic detection remain challenging despite structured reasoning prompts.
[COMMENTS]6 pages, 6 tables
[LINK]http://arxiv.org/abs/2601.08176v1
[DATE]2026-01-13 11:10:58+08:00
[CATEGORIES]cs.CL
Instructions are all you need: Self-supervised Reinforcement Learning for Instruction Following
[AUTHORS]Qingyu Ren, Qianyu He, Bowei Zhang, Jie Zeng, Jiaqing Liang, Yanghua Xiao, Weikang Zhou, Zeye Sun, Fei Yu
[ABSTRACT]Language models often struggle to follow multi-constraint instructions that are crucial for real-world applications. Existing reinforcement learning (RL) approaches suffer from dependency on external supervision and sparse reward signals from multi-constraint tasks. We propose a label-free self-supervised RL framework that eliminates dependency on external supervision by deriving reward signals directly from instructions and generating pseudo-labels for reward model training. Our approach introduces constraint decomposition strategies and efficient constraint-wise binary classification to address sparse reward challenges while maintaining computational efficiency. Experiments show that our approach generalizes well, achieving strong improvements across 3 in-domain and 5 out-of-domain datasets, including challenging agentic and multi-turn instruction following. The data and code are publicly available at https://github.com/Rainier-rq/verl-if
[LINK]http://arxiv.org/abs/2510.14420v3
[DATE]2026-01-13 10:52:28+08:00
[CATEGORIES]cs.CL
AlignSAE: Concept-Aligned Sparse Autoencoders
[AUTHORS]Minglai Yang, Xinyu Guo, Zhengliang Shi, Jinhe Bi, Steven Bethard, Mihai Surdeanu, Liangming Pan
[ABSTRACT]Large Language Models (LLMs) encode factual knowledge within hidden parametric spaces that are difficult to inspect or control. While Sparse Autoencoders (SAEs) can decompose hidden activations into more fine-grained, interpretable features, they often struggle to reliably align these features with human-defined concepts, resulting in entangled and distributed feature representations. To address this, we introduce AlignSAE, a method that aligns SAE features with a predefined ontology through a “pre-train, then post-train” curriculum. After an initial unsupervised training phase, we apply supervised post-training to bind specific concepts to dedicated latent slots while preserving the remaining capacity for general reconstruction. This separation creates an interpretable interface where specific concepts can be inspected and controlled without interference from unrelated features. Empirical results demonstrate that AlignSAE enables precise causal interventions, such as reliable “concept swaps”, by targeting single, semantically aligned slots, and further supports multi-hop reasoning and a mechanistic probe of grokking-like generalization dynamics.
[COMMENTS]23 pages, 16 figures, 7 tables
[LINK]http://arxiv.org/abs/2512.02004v3
[DATE]2026-01-13 10:52:14+08:00
[CATEGORIES]cs.LG cs.CL
WISE-Flow: Workflow-Induced Structured Experience for Self-Evolving Conversational Service Agents
[AUTHORS]Yuqing Zhou, Zhuoer Wang, Jie Yuan, Hong Wang, Samson Koelle, Ziwei Zhu, Wei Niu
[ABSTRACT]Large language model (LLM)-based agents are widely deployed in user-facing services but remain error-prone in new tasks, tend to repeat the same failure patterns, and show substantial run-to-run variability. Fixing failures via environment-specific training or manual patching is costly and hard to scale. To enable self-evolving agents in user-facing service environments, we propose WISE-Flow, a workflow-centric framework that converts historical service interactions into reusable procedural experience by inducing workflows with prerequisite-augmented action blocks. At deployment, WISE-Flow aligns the agent’s execution trajectory to retrieved workflows and performs prerequisite-aware feasibility reasoning to achieve state-grounded next actions. Experiments on ToolSandbox and $τ^2$-bench show consistent improvement across base models.
[COMMENTS]19 pages
[LINK]http://arxiv.org/abs/2601.08158v1
[DATE]2026-01-13 10:43:41+08:00
[CATEGORIES]cs.CL
Mechanisms are Transferable: Data-Efficient Low-Resource Adaptation via Circuit-Targeted Supervised Fine-Tuning
[AUTHORS]Khumaisa Nur’aini, Ayu Purwarianti, Alham Fikri Aji, Derry Wijaya
[ABSTRACT]Adapting LLMs to low-resource languages is difficult: labeled data is scarce, full-model fine-tuning is unstable, and continued cross-lingual tuning can cause catastrophic forgetting. We propose Circuit-Targeted Supervised Fine-Tuning (CT-SFT): a counterfactual-free adaptation of CD-T (Contextual Decomposition Transformer) that uses a label-balanced mean baseline and task-directional relevance scoring to identify a sparse set of task-relevant attention heads in a proxy-language checkpoint, then transfer learns to a target language by updating only those heads (plus LayerNorm) via head-level gradient masking. Across NusaX-Senti and XNLI, CT-SFT improves cross-lingual accuracy over continued full fine-tuning while updating only a small subset of model parameters. We find an editing-preserving trade-off: harder transfers favor editing circuit heads, while easier transfers often favor near-zero (i.e., low-relevance heads) updates, preserving the source mechanism. CT-SFT also substantially reduces catastrophic forgetting, preserving proxy/source-language competence during transfer.
[LINK]http://arxiv.org/abs/2601.08146v1
[DATE]2026-01-13 10:20:53+08:00
[CATEGORIES]cs.CL cs.LG
Attention Projection Mixing and Exogenous Anchors
[AUTHORS]Jonathan Su
[ABSTRACT]Transformers that reuse early-layer attention projections as residuals face a fundamental tension: the first layer must simultaneously serve as a stable reference for all deeper layers and as an effective computational block. To resolve this, we propose ExoFormer, which learns dedicated exogenous anchor projections outside the sequential layer stack, decoupling the anchor role from computational refinement. Through a unified normalized mixing framework (studying different coefficient granularities: elementwise, headwise, scalar) across all attention pathways (queries, keys, values, and gate logits), ExoFormer variants consistently outperform their internal-anchor counterparts. Moreover, the dynamic variant achieves a 2.13-point increase in downstream accuracy over the baseline and demonstrates superior data efficiency, matching baseline validation loss with 1.84x fewer tokens. ExoFormer also achieves a 2x reduction in attention sink compared to standard Gated Attention. Paradoxically, all ExoFormer variants exhibit signs of representation collapse. We explain this via an Offloading Hypothesis: external anchors preserve essential token identity, allowing layers to specialize exclusively in computational refinement. We release codes and models to facilitate future research.
[LINK]http://arxiv.org/abs/2601.08131v1
[DATE]2026-01-13 09:52:19+08:00
[CATEGORIES]cs.CL
Debiasing Large Language Models via Adaptive Causal Prompting with Sketch-of-Thought
[AUTHORS]Bowen Li, Ziqi Xu, Jing Ren, Renqiang Luo, Xikun Zhang, Xiuzhen Zhang, Yongli Ren, Feng Xia
[ABSTRACT]Despite notable advancements in prompting methods for Large Language Models (LLMs), such as Chain-of-Thought (CoT), existing strategies still suffer from excessive token usage and limited generalisability across diverse reasoning tasks. To address these limitations, we propose an Adaptive Causal Prompting with Sketch-of-Thought (ACPS) framework, which leverages structural causal models to infer the causal effect of a query on its answer and adaptively select an appropriate intervention (i.e., standard front-door and conditional front-door adjustments). This design enables generalisable causal reasoning across heterogeneous tasks without task-specific retraining. By replacing verbose CoT with concise Sketch-of-Thought, ACPS enables efficient reasoning that significantly reduces token usage and inference cost. Extensive experiments on multiple reasoning benchmarks and LLMs demonstrate that ACPS consistently outperforms existing prompting baselines in terms of accuracy, robustness, and computational efficiency.
[COMMENTS]Accepted by Findings of EACL 2026
[LINK]http://arxiv.org/abs/2601.08108v1
[DATE]2026-01-13 08:58:43+08:00
[CATEGORIES]cs.CL
On the Entropy Calibration of Language Models
[AUTHORS]Steven Cao, Gregory Valiant, Percy Liang
[ABSTRACT]We study the problem of entropy calibration, which asks whether a language model’s entropy over generations matches its log loss on human text. Past work found that models are miscalibrated, with entropy per step increasing as generations grow longer, due to error accumulation. To calibrate the model and improve text quality, it has become standard practice to truncate the distribution, but this approach reduces output diversity, which we would like to avoid. Therefore, in this paper, we ask: does miscalibration improve automatically with scale, and if not, is it theoretically possible to calibrate without tradeoffs? To build intuition, we first study a simplified theoretical setting to characterize the scaling behavior of miscalibration with respect to dataset size. We find that the rate of scaling depends on the power law exponent of the data distribution – in particular, for a power law exponent close to 1, the scaling exponent is close to 0, meaning that miscalibration improves very slowly with scale. Next, we measure miscalibration empirically in language models ranging from 0.5B to 70B parameters. We find that the observed scaling behavior is similar to what is predicted theoretically: our fitted scaling exponents for text are close to 0, meaning that larger models accumulate error at a similar rate as smaller ones. This scaling (or, lack thereof) provides one explanation for why we sample from larger models with similar amounts of truncation as smaller models, even though the larger models are of higher quality. However, truncation is not a satisfying solution because it comes at the cost of increased log loss. In theory, is it even possible to reduce entropy while preserving log loss? We prove that it is possible, if we assume access to a black box which can fit models to predict the future entropy of text.
[COMMENTS]Neurips 2025
[LINK]http://arxiv.org/abs/2511.11966v2
[DATE]2026-01-13 08:58:19+08:00
[CATEGORIES]cs.CL cs.LG
Query Suggestion for Retrieval-Augmented Generation via Dynamic In-Context Learning
[AUTHORS]Fabian Spaeh, Tianyi Chen, Chen-Hao Chiang, Bin Shen
[ABSTRACT]Retrieval-augmented generation with tool-calling agents (agentic RAG) has become increasingly powerful in understanding, processing, and responding to user queries. However, the scope of the grounding knowledge is limited and asking questions that exceed this scope may lead to issues like hallucination. While guardrail frameworks aim to block out-of-scope questions (Rodriguez et al., 2024), no research has investigated the question of suggesting answerable queries in order to complete the user interaction.
In this paper, we initiate the study of query suggestion for agentic RAG. We consider the setting where user questions are not answerable, and the suggested queries should be similar to aid the user interaction. Such scenarios are frequent for tool-calling LLMs as communicating the restrictions of the tools or the underlying datasets to the user is difficult, and adding query suggestions enhances the interaction with the RAG agent. As opposed to traditional settings for query recommendations such as in search engines, ensuring that the suggested queries are answerable is a major challenge due to the RAG’s multi-step workflow that demands a nuanced understanding of the RAG as a whole, which the executing LLM lacks. As such, we introduce robust dynamic few-shot learning which retrieves examples from relevant workflows. We show that our system can be self-learned, for instance on prior user queries, and is therefore easily applicable in practice. We evaluate our approach on three benchmark datasets based on two unlabeled question datasets collected from real-world user queries. Experiments on real-world datasets confirm that our method produces more relevant and answerable suggestions, outperforming few-shot and retrieval-only baselines, and thus enable safer, more effective user interaction with agentic RAG.
[LINK]http://arxiv.org/abs/2601.08105v1
[DATE]2026-01-13 08:56:38+08:00
[CATEGORIES]cs.CL
AdaJudge: Adaptive Multi-Perspective Judging for Reward Modeling
[AUTHORS]Yongliang Miao, Yangyang Liang, Mengnan Du
[ABSTRACT]Reward modeling is essential for aligning large language models with human preferences, yet predominant architectures rely on a static pooling strategy to condense sequences into scalar scores. This paradigm, however, suffers from two key limitations: a static inductive bias that misaligns with task-dependent preference signals, and a representational mismatch, as the backbone is optimized for generation rather than fine-grained discrimination. To address this, we propose AdaJudge, a unified framework that jointly adapts representation and aggregation. AdaJudge first refines backbone representations into a discrimination-oriented space via gated refinement blocks. It then replaces the static readout with an adaptive multi-view pooling module that dynamically routes and combines evidence. Extensive experiments on RM-Bench and JudgeBench show that AdaJudge outperforms strong off-the-shelf reward models and traditional pooling baselines.
[LINK]http://arxiv.org/abs/2601.08097v1
[DATE]2026-01-13 08:37:38+08:00
[CATEGORIES]cs.CL cs.LG
Exact Coset Sampling for Quantum Lattice Algorithms
[AUTHORS]Yifan Zhang
[COMMENTS]Preprint - Work in Progress
[LINK]http://arxiv.org/abs/2509.12341v7
[DATE]2026-01-13 08:19:39+08:00
[CATEGORIES]cs.CL
MemoBrain: Executive Memory as an Agentic Brain for Reasoning
[AUTHORS]Hongjin Qian, Zhao Cao, Zheng Liu
[ABSTRACT]Complex reasoning in tool-augmented agent frameworks is inherently long-horizon, causing reasoning traces and transient tool artifacts to accumulate and strain the bounded working context of large language models. Without explicit memory mechanisms, such accumulation disrupts logical continuity and undermines task alignment. This positions memory not as an auxiliary efficiency concern, but as a core component for sustaining coherent, goal-directed reasoning over long horizons.
We propose MemoBrain, an executive memory model for tool-augmented agents that constructs a dependency-aware memory over reasoning steps, capturing salient intermediate states and their logical relations. Operating as a co-pilot alongside the reasoning agent, MemoBrain organizes reasoning progress without blocking execution and actively manages the working context. Specifically, it prunes invalid steps, folds completed sub-trajectories, and preserves a compact, high-salience reasoning backbone under a fixed context budget. Together, these mechanisms enable explicit cognitive control over reasoning trajectories rather than passive context accumulation.
We evaluate MemoBrain on challenging long-horizon benchmarks, including GAIA, WebWalker, and BrowseComp-Plus, demonstrating consistent improvements over strong baselines.
[COMMENTS]Our codes are in https://github.com/qhjqhj00/MemoBrain
[LINK]http://arxiv.org/abs/2601.08079v1
[DATE]2026-01-13 07:44:59+08:00
[CATEGORIES]cs.CL
Exploiting DINOv3-Based Self-Supervised Features for Robust Few-Shot Medical Image Segmentation
[AUTHORS]Guoping Xu, Jayaram K. Udupa, Weiguo Lu, You Zhang
[ABSTRACT]Deep learning-based automatic medical image segmentation plays a critical role in clinical diagnosis and treatment planning but remains challenging in few-shot scenarios due to the scarcity of annotated training data. Recently, self-supervised foundation models such as DINOv3, which were trained on large natural image datasets, have shown strong potential for dense feature extraction that can help with the few-shot learning challenge. Yet, their direct application to medical images is hindered by domain differences. In this work, we propose DINO-AugSeg, a novel framework that leverages DINOv3 features to address the few-shot medical image segmentation challenge. Specifically, we introduce WT-Aug, a wavelet-based feature-level augmentation module that enriches the diversity of DINOv3-extracted features by perturbing frequency components, and CG-Fuse, a contextual information-guided fusion module that exploits cross-attention to integrate semantic-rich low-resolution features with spatially detailed high-resolution features. Extensive experiments on six public benchmarks spanning five imaging modalities, including MRI, CT, ultrasound, endoscopy, and dermoscopy, demonstrate that DINO-AugSeg consistently outperforms existing methods under limited-sample conditions. The results highlight the effectiveness of incorporating wavelet-domain augmentation and contextual fusion for robust feature representation, suggesting DINO-AugSeg as a promising direction for advancing few-shot medical image segmentation. Code and data will be made available on https://github.com/apple1986/DINO-AugSeg.
[COMMENTS]36 pages, 11 figures
[LINK]http://arxiv.org/abs/2601.08078v1
[DATE]2026-01-13 07:44:25+08:00
[CATEGORIES]cs.CL
Semantic Gravity Wells: Why Negative Constraints Backfire
[AUTHORS]Shailesh Rana
[ABSTRACT]Negative constraints (instructions of the form “do not use word X”) represent a fundamental test of instruction-following capability in large language models. Despite their apparent simplicity, these constraints fail with striking regularity, and the conditions governing failure have remained poorly understood. This paper presents the first comprehensive mechanistic investigation of negative instruction failure. We introduce semantic pressure, a quantitative measure of the model’s intrinsic probability of generating the forbidden token, and demonstrate that violation probability follows a tight logistic relationship with pressure ($p=σ(-2.40+2.27\cdot P_0)$; $n=40{,}000$ samples; bootstrap $95%$ CI for slope: $[2.21,,2.33]$). Through layer-wise analysis using the logit lens technique, we establish that the suppression signal induced by negative instructions is present but systematically weaker in failures: the instruction reduces target probability by only 5.2 percentage points in failures versus 22.8 points in successes – a $4.4\times$ asymmetry. We trace this asymmetry to two mechanistically distinct failure modes. In priming failure (87.5% of violations), the instruction’s explicit mention of the forbidden word paradoxically activates rather than suppresses the target representation. In override failure (12.5%), late-layer feed-forward networks generate contributions of $+0.39$ toward the target probability – nearly $4\times$ larger than in successes – overwhelming earlier suppression signals. Activation patching confirms that layers 23–27 are causally responsible: replacing these layers’ activations flips the sign of constraint effects. These findings reveal a fundamental tension in negative constraint design: the very act of naming a forbidden word primes the model to produce it.
[COMMENTS]10 pages, 8 figures. Code: https://github.com/gut-puncture
[LINK]http://arxiv.org/abs/2601.08070v1
[DATE]2026-01-13 07:30:18+08:00
[CATEGORIES]cs.CL
LLM generation novelty through the lens of semantic similarity
[AUTHORS]Philipp Davydov, Ameya Prabhu, Matthias Bethge, Elisa Nguyen, Seong Joon Oh
[ABSTRACT]Generation novelty is a key indicator of an LLM’s ability to generalize, yet measuring it against full pretraining corpora is computationally challenging. Existing evaluations often rely on lexical overlap, failing to detect paraphrased text, or do not consider the full pretraining corpus. We frame novelty as a semantic retrieval problem. This framing enables us to address novelty with modern embedding and indexing pipelines, allowing for efficient analysis at pre-training scale. Specifically, we propose a three-stage framework that retrieves semantically similar samples, reranks them at varying subsequence lengths, and calibrates scores using a human novelty reference for interpretability. We apply this framework to the SmolLM model family and report three key findings: (1) models draw on pre-training data across much longer sequences than previously reported; (2) some task domains systematically promote or suppress generation novelty; and (3) instruction tuning not only alters style but also increases novelty. These results highlight the value of semantic novelty analysis for studying generalization. To support reproducibility and further research, we release ~20 TB of corpus chunks and index artifacts at https://huggingface.co/datasets/stai-tuebingen/faiss-smollm
[LINK]http://arxiv.org/abs/2510.27313v2
[DATE]2026-01-13 07:15:20+08:00
[CATEGORIES]cs.LG cs.CL
Universal computation is intrinsic to language model decoding
[AUTHORS]Alex Lewandowski, Marlos C. Machado, Dale Schuurmans
[ABSTRACT]Language models now provide an interface to express and often solve general problems in natural language, yet their ultimate computational capabilities remain a major topic of scientific debate. Unlike a formal computer, a language model is trained to autoregressively predict successive elements in human-generated text. We prove that chaining a language model’s autoregressive output is sufficient to perform universal computation. That is, a language model can simulate the execution of any algorithm on any input. The challenge of eliciting desired computational behaviour can thus be reframed in terms of programmability: the ease of finding a suitable prompt. Strikingly, we demonstrate that even randomly initialized language models are capable of universal computation before training. This implies that training does not give rise to computational expressiveness – rather, it improves programmability, enabling a natural language interface for accessing these intrinsic capabilities.
[LINK]http://arxiv.org/abs/2601.08061v1
[DATE]2026-01-13 07:07:50+08:00
[CATEGORIES]cs.CL
Reasoning Beyond Chain-of-Thought: A Latent Computational Mode in Large Language Models
[AUTHORS]Zhenghao He, Guangzhi Xiong, Bohan Liu, Sanchit Sinha, Aidong Zhang
[ABSTRACT]Chain-of-Thought (CoT) prompting has improved the reasoning performance of large language models (LLMs), but it remains unclear why it works and whether it is the unique mechanism for triggering reasoning in large language models. In this work, we study this question by directly analyzing and intervening on the internal representations of LLMs with Sparse Autoencoders (SAEs), identifying a small set of latent features that are causally associated with LLM reasoning behavior. Across multiple model families and reasoning benchmarks, we find that steering a single reasoning-related latent feature can substantially improve accuracy without explicit CoT prompting. For large models, latent steering achieves performance comparable to standard CoT prompting while producing more efficient outputs. We further observe that this reasoning-oriented internal state is triggered early in generation and can override prompt-level instructions that discourage explicit reasoning. Overall, our results suggest that multi-step reasoning in LLMs is supported by latent internal activations that can be externally activated, while CoT prompting is one effective, but not unique, way of activating this mechanism rather than its necessary cause.
[LINK]http://arxiv.org/abs/2601.08058v1
[DATE]2026-01-13 07:01:21+08:00
[CATEGORIES]cs.CL
Human Mobility Datasets Enriched With Contextual and Social Dimensions
[AUTHORS]Chiara Pugliese, Francesco Lettich, Guido Rocchietti, Chiara Renso, Fabio Pinelli
[ABSTRACT]In this resource paper, we present two publicly available datasets of semantically enriched human trajectories, together with the pipeline to build them. The trajectories are publicly available GPS traces retrieved from OpenStreetMap. Each dataset includes contextual layers such as stops, moves, points of interest (POIs), inferred transportation modes, and weather data. A novel semantic feature is the inclusion of synthetic, realistic social media posts generated by Large Language Models (LLMs), enabling multimodal and semantic mobility analysis. The datasets are available in both tabular and Resource Description Framework (RDF) formats, supporting semantic reasoning and FAIR data practices. They cover two structurally distinct, large cities: Paris and New York. Our open source reproducible pipeline allows for dataset customization, while the datasets support research tasks such as behavior modeling, mobility prediction, knowledge graph construction, and LLM-based applications. To our knowledge, our resource is the first to combine real-world movement, structured semantic enrichment, LLM-generated text, and semantic web compatibility in a reusable framework.
[COMMENTS]5 pages, 3 figures, 1 table
[LINK]http://arxiv.org/abs/2510.02333v3
[DATE]2026-01-13 06:43:04+08:00
[CATEGORIES]cs.CL
UNCAP: Uncertainty-Guided Neurosymbolic Planning Using Natural Language Communication for Cooperative Autonomous Vehicles
[AUTHORS]Neel P. Bhatt, Po-han Li, Kushagra Gupta, Rohan Siva, Daniel Milan, Alexander T. Hogue, Sandeep P. Chinchali, David Fridovich-Keil, Zhangyang Wang, Ufuk Topcu
[ABSTRACT]Safe large-scale coordination of multiple cooperative connected autonomous vehicles (CAVs) hinges on communication that is both efficient and interpretable. Existing approaches either rely on transmitting high-bandwidth raw sensor data streams or neglect perception and planning uncertainties inherent in shared data, resulting in systems that are neither scalable nor safe. To address these limitations, we propose Uncertainty-Guided Natural Language Cooperative Autonomous Planning (UNCAP), a vision-language model-based planning approach that enables CAVs to communicate via lightweight natural language messages while explicitly accounting for perception uncertainty in decision-making. UNCAP features a two-stage communication protocol: (i) an ego CAV first identifies the subset of vehicles most relevant for information exchange, and (ii) the selected CAVs then transmit messages that quantitatively express their perception uncertainty. By selectively fusing messages that maximize mutual information, this strategy allows the ego vehicle to integrate only the most relevant signals into its decision-making, improving both the scalability and reliability of cooperative planning. Experiments across diverse driving scenarios show a 63% reduction in communication bandwidth with a 31% increase in driving safety score, a 61% reduction in decision uncertainty, and a four-fold increase in collision distance margin during near-miss events. Project website: https://uncap-project.github.io/
[LINK]http://arxiv.org/abs/2510.12992v2
[DATE]2026-01-13 06:23:28+08:00
[CATEGORIES]cs.CL
LLM Review: Enhancing Creative Writing via Blind Peer Review Feedback
[AUTHORS]Weiyue Li, Mingxiao Song, Zhenda Shen, Dachuan Zhao, Yunfan Long, Yi Li, Yongce Li, Ruyi Yang, Mengyu Wang
[ABSTRACT]Large Language Models (LLMs) often struggle with creative generation, and multi-agent frameworks that improve reasoning through interaction can paradoxically hinder creativity by inducing content homogenization. We introduce LLM Review, a peer-review-inspired framework implementing Blind Peer Review: agents exchange targeted feedback while revising independently, preserving divergent creative trajectories. To enable rigorous evaluation, we propose SciFi-100, a science fiction writing dataset with a unified framework combining LLM-as-a-judge scoring, human annotation, and rule-based novelty metrics. Experiments demonstrate that LLM Review consistently outperforms multi-agent baselines, and smaller models with our framework can surpass larger single-agent models, suggesting interaction structure may substitute for model scale.
[LINK]http://arxiv.org/abs/2601.08003v1
[DATE]2026-01-13 05:10:28+08:00
[CATEGORIES]cs.CL
Modeling Language as a Sequence of Thoughts
[AUTHORS]Nasim Borazjanizadeh, James McClelland
[ABSTRACT]Transformer language models can generate strikingly natural text by modeling language as a sequence of tokens, but by relying primarily on surface-level co-occurrence statistics they fail to form globally consistent latent representations of entities and events, which contributes to poor relational generalization (the reversal curse), contextualization errors, and data inefficiency. Cognitive science, by contrast, shows that human comprehension converts linguistic input into compact, event-like representations that persist in memory while verbatim form is short-lived. Motivated by these findings, we introduce the Thought Gestalt (TG) model, a recurrent transformer that models language at two levels of abstraction: tokens and sentence-level “thought” states. TG generates one sentence at a time while cross-attending to a working memory of prior sentence representations. Token and sentence representations are generated using a shared stack of transformer blocks and trained with a single objective, next-token prediction loss. By retaining the computation graph of sentence representations written to working memory, gradients from future token losses flow backward through cross-attention to optimize the parameters that generate earlier sentence vectors. In scaling experiments, TG consistently improves data and parameter efficiency compared to matched GPT-2 runs and other baselines, with scaling fits indicating GPT-2 requires ~5-8% more data and ~33-42% more parameters to match TG’s test loss. TG also reduces errors in relational-direction generalization on a father-son reversal curse probe.
[LINK]http://arxiv.org/abs/2512.25026v2
[DATE]2026-01-13 04:57:25+08:00
[CATEGORIES]cs.CL
From Word Sequences to Behavioral Sequences: Adapting Modeling and Evaluation Paradigms for Longitudinal NLP
[AUTHORS]Adithya V Ganesan, Vasudha Varadarajan, Oscar NE Kjell, Whitney R Ringwald, Scott Feltman, Benjamin J Luft, Roman Kotov, Ryan L Boyd, H Andrew Schwartz
[ABSTRACT]While NLP typically treats documents as independent and unordered samples, in longitudinal studies, this assumption rarely holds: documents are nested within authors and ordered in time, forming person-indexed, time-ordered $\textit{behavioral sequences}$. Here, we demonstrate the need for and propose a longitudinal modeling and evaluation paradigm that consequently updates four parts of the NLP pipeline: (1) evaluation splits aligned to generalization over people ($\textit{cross-sectional}$) and/or time ($\textit{prospective}$); (2) accuracy metrics separating between-person differences from within-person dynamics; (3) sequence inputs to incorporate history by default; and (4) model internals that support different $\textit{coarseness}$ of latent state over histories (pooled summaries, explicit dynamics, or interaction-based models). We demonstrate the issues ensued by traditional pipeline and our proposed improvements on a dataset of 17k daily diary transcripts paired with PTSD symptom severity from 238 participants, finding that traditional document-level evaluation can yield substantially different and sometimes reversed conclusions compared to our ecologically valid modeling and evaluation. We tie our results to a broader discussion motivating a shift from word-sequence evaluation toward $\textit{behavior-sequence}$ paradigms for NLP.
[LINK]http://arxiv.org/abs/2601.07988v1
[DATE]2026-01-13 04:38:36+08:00
[CATEGORIES]cs.CL cs.LG
VULCA-Bench: A Multicultural Vision-Language Benchmark for Evaluating Cultural Understanding
[AUTHORS]Haorui Yu, Ramon Ruiz-Dolz, Diji Yang, Hang He, Fengrui Zhang, Qiufeng Yi
[ABSTRACT]We introduce VULCA-Bench, a multicultural art-critique benchmark for evaluating Vision-Language Models’ (VLMs) cultural understanding beyond surface-level visual perception. Existing VLM benchmarks predominantly measure L1-L2 capabilities (object recognition, scene description, and factual question answering) while under-evaluate higher-order cultural interpretation. VULCA-Bench contains 7,410 matched image-critique pairs spanning eight cultural traditions, with Chinese-English bilingual coverage. We operationalise cultural understanding using a five-layer framework (L1-L5, from Visual Perception to Philosophical Aesthetics), instantiated as 225 culture-specific dimensions and supported by expert-written bilingual critiques. Our pilot results indicate that higher-layer reasoning (L3-L5) is consistently more challenging than visual and technical analysis (L1-L2). The dataset, evaluation scripts, and annotation tools are available under CC BY 4.0 in the supplementary materials.
[COMMENTS]8 pages, 4 figures, submitted to ACL 2026 Dataset Track
[LINK]http://arxiv.org/abs/2601.07986v1
[DATE]2026-01-13 04:36:30+08:00
[CATEGORIES]cs.CL
Multilingual, Multimodal Pipeline for Creating Authentic and Structured Fact-Checked Claim Dataset
[AUTHORS]Z. Melce Hüsünbeyi, Virginie Mouilleron, Leonie Uhling, Daniel Foppe, Tatjana Scheffler, Djamé Seddah
[ABSTRACT]The rapid proliferation of misinformation across online platforms underscores the urgent need for robust, up-to-date, explainable, and multilingual fact-checking resources. However, existing datasets are limited in scope, often lacking multimodal evidence, structured annotations, and detailed links between claims, evidence, and verdicts. This paper introduces a comprehensive data collection and processing pipeline that constructs multimodal fact-checking datasets in French and German languages by aggregating ClaimReview feeds, scraping full debunking articles, normalizing heterogeneous claim verdicts, and enriching them with structured metadata and aligned visual content. We used state-of-the-art large language models (LLMs) and multimodal LLMs for (i) evidence extraction under predefined evidence categories and (ii) justification generation that links evidence to verdicts. Evaluation with G-Eval and human assessment demonstrates that our pipeline enables fine-grained comparison of fact-checking practices across different organizations or media markets, facilitates the development of more interpretable and evidence-grounded fact-checking models, and lays the groundwork for future research on multilingual, multimodal misinformation verification.
[LINK]http://arxiv.org/abs/2601.07985v1
[DATE]2026-01-13 04:33:46+08:00
[CATEGORIES]cs.CL
Cross-Cultural Expert-Level Art Critique Evaluation with Vision-Language Models
[AUTHORS]Haorui Yu, Ramon Ruiz-Dolz, Xuehang Wen, Fengrui Zhang, Qiufeng Yi
[ABSTRACT]Vision-Language Models (VLMs) excel at visual perception, yet their ability to interpret cultural meaning in art remains under-validated. We present a tri-tier evaluation framework for cross-cultural art-critique assessment: Tier I computes automated coverage and risk indicators offline; Tier II applies rubric-based scoring using a single primary judge across five dimensions; and Tier III calibrates the Tier II aggregate score to human ratings via isotonic regression, yielding a 5.2% reduction in MAE on a 152-sample held-out set. The framework outputs a calibrated cultural-understanding score for model selection and cultural-gap diagnosis, together with dimension-level diagnostics and risk indicators. We evaluate 15 VLMs on 294 expert anchors spanning six cultural traditions. Key findings are that (i) automated metrics are unreliable proxies for cultural depth, (ii) Western samples score higher than non-Western samples under our sampling and rubric, and (iii) cross-judge scale mismatch makes naive score averaging unreliable, motivating a single primary judge with explicit calibration. Dataset and code are available in the supplementary materials.
[COMMENTS]16 pages, 7 figures, submitted to ACL 2026
[LINK]http://arxiv.org/abs/2601.07984v1
[DATE]2026-01-13 04:33:35+08:00
[CATEGORIES]cs.CL
Arctic-Text2SQL-R1: Simple Rewards, Strong Reasoning in Text-to-SQL
[AUTHORS]Zhewei Yao, Guoheng Sun, Lukasz Borchmann, Gaurav Nuti, Zheyu Shen, Minghang Deng, Bohan Zhai, Hao Zhang, Ang Li, Yuxiong He
[ABSTRACT]Translating natural language into SQL (Test2SQL) is a longstanding challenge at the intersection of natural language understanding and structured data access. While large language models (LLMs) have significantly improved fluency in SQL generation, producing correct and executable SQL–particularly for complex queries–remains a bottleneck. We present Arctic-Text2SQL-R1, a reinforcement learning (RL) framework and model family designed to generate accurate, executable SQL using a lightweight reward signal based solely on execution correctness. Our approach avoids brittle intermediate supervision and complex reward shaping, promoting stable training and alignment with the end task. Combined with carefully curated data, strong supervised initialization, and effective training practices, Arctic-Text2SQL-R1 achieves state-of-the-art execution accuracy across six diverse Test2SQL benchmarks, including the top position on the BIRD leaderboard. Notably, our 7B model outperforms prior 70B-class systems, highlighting the framework’s scalability and efficiency. We further demonstrate inference-time robustness through simple extensions like value retrieval and majority voting. Extensive experiments and ablation studies offer both positive and negative insights, providing practical guidance for future Test2SQL research.
[COMMENTS]22 pages, 2 figures
[LINK]http://arxiv.org/abs/2505.20315v2
[DATE]2026-01-13 04:32:51+08:00
[CATEGORIES]cs.CL
Explaining Generalization of AI-Generated Text Detectors Through Linguistic Analysis
[AUTHORS]Yuxi Xia, Kinga Stańczak, Benjamin Roth
[ABSTRACT]AI-text detectors achieve high accuracy on in-domain benchmarks, but often struggle to generalize across different generation conditions such as unseen prompts, model families, or domains. While prior work has reported these generalization gaps, there are limited insights about the underlying causes. In this work, we present a systematic study aimed at explaining generalization behavior through linguistic analysis. We construct a comprehensive benchmark that spans 6 prompting strategies, 7 large language models (LLMs), and 4 domain datasets, resulting in a diverse set of human- and AI-generated texts. Using this dataset, we fine-tune classification-based detectors on various generation settings and evaluate their cross-prompt, cross-model, and cross-dataset generalization. To explain the performance variance, we compute correlations between generalization accuracies and feature shifts of 80 linguistic features between training and test conditions. Our analysis reveals that generalization performance for specific detectors and evaluation conditions is significantly associated with linguistic features such as tense usage and pronoun frequency.
[LINK]http://arxiv.org/abs/2601.07974v1
[DATE]2026-01-13 04:16:06+08:00
[CATEGORIES]cs.CL
Vocabulary Expansion of Large Language Models via Kullback-Leibler-Based Self-Distillation
[AUTHORS]Max Rehman Linder
[ABSTRACT]Large pre-trained language models often struggle to incorporate new domain-specific terminology when fine-tuned on small, specialized corpora. In this work, we address the challenge of vocabulary expansion in frozen LLMs by introducing a mathematically grounded method for knowledge distillation via KL divergence, even when the original and extended models use different tokenizations. This allows the student model to inherit distributional knowledge from the teacher despite differing vocabularies. We compare our KL-based distillation approach to conventional cross-entropy training, evaluating both methods across multiple strategies for initializing new token embeddings. After embedding initialization, models are further fine-tuned to integrate the new vocabulary. Each trained model is benchmarked on approximately 2000 code-generation tasks, where our approach achieves the best performance across the board. Finally, through mechanistic interpretability, we analyze how models learn representations for the new tokens, providing an explanation for the observed gains and offering insight into the structure of embedding space during vocabulary expansion.
[COMMENTS]Master’s Thesis
[LINK]http://arxiv.org/abs/2508.15807v2
[DATE]2026-01-13 04:12:21+08:00
[CATEGORIES]cs.CL
A Human-Centric Pipeline for Aligning Large Language Models with Chinese Medical Ethics
[AUTHORS]Haoan Jin, Han Ying, Jiacheng Ji, Hanhui Xu, Mengyue Wu
[ABSTRACT]Recent advances in large language models have enabled their application to a range of healthcare tasks. However, aligning LLMs with the nuanced demands of medical ethics, especially under complex real world scenarios, remains underexplored. In this work, we present MedES, a dynamic, scenario-centric benchmark specifically constructed from 260 authoritative Chinese medical, ethical, and legal sources to reflect the challenges in clinical decision-making. To facilitate model alignment, we introduce a guardian-in-the-loop framework that leverages a dedicated automated evaluator (trained on expert-labeled data and achieving over 97% accuracy within our domain) to generate targeted prompts and provide structured ethical feedback. Using this pipeline, we align a 7B-parameter LLM through supervised fine-tuning and domain-specific preference optimization. Experimental results, conducted entirely within the Chinese medical ethics context, demonstrate that our aligned model outperforms notably larger baselines on core ethical tasks, with observed improvements in both quality and composite evaluation metrics. Our work offers a practical and adaptable framework for aligning LLMs with medical ethics in the Chinese healthcare domain, and suggests that similar alignment pipelines may be instantiated in other legal and cultural environments through modular replacement of the underlying normative corpus.
[LINK]http://arxiv.org/abs/2601.07954v1
[DATE]2026-01-13 03:40:13+08:00
[CATEGORIES]cs.CL
Cross-Prompt Encoder for Low-Performing Languages
[AUTHORS]Beso Mikaberidze, Teimuraz Saghinadze, Simon Ostermann, Philipp Muller
[ABSTRACT]Soft prompts have emerged as a powerful alternative to adapters in parameter-efficient fine-tuning (PEFT), enabling large language models (LLMs) to adapt to downstream tasks without architectural changes or parameter updates. While prior work has focused on stabilizing training via parameter interaction in small neural prompt encoders, their broader potential for transfer across languages remains unexplored. In this paper, we demonstrate that a prompt encoder can play a central role in improving performance on low-performing languages - those that achieve poor accuracy even under full-model fine-tuning. We investigate a lightweight encoder paired with multi-source training on typologically diverse languages. We call this architecture-training combination the Cross-Prompt Encoder (XPE), and show that it advances the capture of abstract, transferable patterns across languages. To complement XPE, we propose a Dual Soft Prompt mechanism that combines an encoder-based prompt with a directly trained standard soft prompt. This hybrid design proves especially effective for target languages that benefit from both broadly shared structure and language-specific alignment. Text classification experiments with a transformer encoder (XLM-R) on the SIB-200 benchmark reveal a consistent trade-off: XPE is most effective for low-performing languages, while hybrid variants offer broader adaptability across multilingual settings.
[COMMENTS]Accepted at Findings of IJCNLP-AACL 2025
[LINK]http://arxiv.org/abs/2508.10352v2
[DATE]2026-01-13 03:31:52+08:00
[CATEGORIES]cs.CL
SAC: A Framework for Measuring and Inducing Personality Traits in LLMs with Dynamic Intensity Control
[AUTHORS]Adithya Chittem, Aishna Shrivastava, Sai Tarun Pendela, Jagat Sesh Challa, Dhruv Kumar
[ABSTRACT]Large language models (LLMs) have gained significant traction across a wide range of fields in recent years. There is also a growing expectation for them to display human-like personalities during interactions. To meet this expectation, numerous studies have proposed methods for modelling LLM personalities through psychometric evaluations. However, most existing models face two major limitations: they rely on the Big Five (OCEAN) framework, which only provides coarse personality dimensions, and they lack mechanisms for controlling trait intensity. In this paper, we address this gap by extending the Machine Personality Inventory (MPI), which originally used the Big Five model, to incorporate the 16 Personality Factor (16PF) model, allowing expressive control over sixteen distinct traits. We also developed a structured framework known as Specific Attribute Control (SAC) for evaluating and dynamically inducing trait intensity in LLMs. Our method introduces adjective-based semantic anchoring to guide trait intensity expression and leverages behavioural questions across five intensity factors: \textit{Frequency}, \textit{Depth}, \textit{Threshold}, \textit{Effort}, and \textit{Willingness}. Through experimentation, we find that modelling intensity as a continuous spectrum yields substantially more consistent and controllable personality expression compared to binary trait toggling. Moreover, we observe that changes in target trait intensity systematically influence closely related traits in psychologically coherent directions, suggesting that LLMs internalize multi-dimensional personality structures rather than treating traits in isolation. Our work opens new pathways for controlled and nuanced human-machine interactions in domains such as healthcare, education, and interviewing processes, bringing us one step closer to truly human-like social machines.
[COMMENTS]Accepted into 18th Edition of International Conference on Agents and Artificial Intelligence (ICAART)
[LINK]http://arxiv.org/abs/2506.20993v2
[DATE]2026-01-13 03:26:24+08:00
[CATEGORIES]cs.CL
Towards Specialized Generalists: A Multi-Task MoE-LoRA Framework for Domain-Specific LLM Adaptation
[AUTHORS]Yuxin Yang, Aoxiong Zeng, Xiangquan Yang
[ABSTRACT]The rapid evolution of Large Language Models (LLMs) has shifted focus from general-purpose capabilities to domain-specific expertise. However, adapting LLMs to specialized fields such as medicine presents two challenge: (1) the “Stability-Plasticity Dilemma”, where the model must acquire complex clinical knowledge without suffering from catastrophic forgetting of general world knowledge; and (2) “Task Interference”, where disparate sub-tasks, such as medical diagnosis, report summarization, and drug-drug interaction prediction, compete for limited low-rank parameter space. In this paper, we propose Med-MoE-LoRA, a novel framework that integrates Mixture-of-Experts (MoE) with Low-Rank Adaptation (LoRA) to enable efficient multi-task domain adaptation, especially for medical scenarios. Drawing inspiration from recent advances, our framework employs an asymmetric expert distribution where deeper layers are equipped with a higher density of LoRA experts to capture complex semantic abstractions. We further introduce a “Knowledge-Preservation Plugin”, inspired by LoRA MoE, to isolate and protect general-purpose reasoning. By utilizing soft merging with adaptive routing and rank-wise decoupling, Med-MoE-LoRA achieves superior performance in medical benchmarks while reducing interference. Experimental results demonstrate that our approach consistently outperforms standard LoRA and conventional MoE architectures across multiple clinical NLP tasks while retaining the model’s general cognitive capabilities.
[COMMENTS]Work in Progress
[LINK]http://arxiv.org/abs/2601.07935v1
[DATE]2026-01-13 03:04:58+08:00
[CATEGORIES]cs.LG cs.CL
Reference Games as a Testbed for the Alignment of Model Uncertainty and Clarification Requests
[AUTHORS]Manar Ali, Judith Sieker, Sina Zarrieß, Hendrik Buschmeier
[ABSTRACT]In human conversation, both interlocutors play an active role in maintaining mutual understanding. When addressees are uncertain about what speakers mean, for example, they can request clarification. It is an open question for language models whether they can assume a similar addressee role, recognizing and expressing their own uncertainty through clarification. We argue that reference games are a good testbed to approach this question as they are controlled, self-contained, and make clarification needs explicit and measurable. To test this, we evaluate three vision-language models comparing a baseline reference resolution task to an experiment where the models are instructed to request clarification when uncertain. The results suggest that even in such simple tasks, models often struggle to recognize internal uncertainty and translate it into adequate clarification behavior. This demonstrates the value of reference games as testbeds for interaction qualities of (vision and) language models.
[LINK]http://arxiv.org/abs/2601.07820v1
[DATE]2026-01-13 02:53:09+08:00
[CATEGORIES]cs.CL
Metaphors are a Source of Cross-Domain Misalignment of Large Reasoning Models
[AUTHORS]Zhibo Hu, Chen Wang, Yanfeng Shu, Hye-young Paik, Liming Zhu
[ABSTRACT]Earlier research has shown that metaphors influence human’s decision making, which raises the question of whether metaphors also influence large language models (LLMs)’ reasoning pathways, considering their training data contain a large number of metaphors. In this work, we investigate the problem in the scope of the emergent misalignment problem where LLMs can generalize patterns learned from misaligned content in one domain to another domain. We discover a strong causal relationship between metaphors in training data and the misalignment degree of LLMs’ reasoning contents. With interventions using metaphors in pre-training, fine-tuning and re-alignment phases, models’ cross-domain misalignment degrees change significantly. As we delve deeper into the causes behind this phenomenon, we observe that there is a connection between metaphors and the activation of global and local latent features of large reasoning models. By monitoring these latent features, we design a detector that predict misaligned content with high accuracy.
[COMMENTS]17 pages, 7 figures
[LINK]http://arxiv.org/abs/2601.03388v2
[DATE]2026-01-13 02:08:06+08:00
[CATEGORIES]cs.CL cs.LG
Beyond Single-Shot: Multi-step Tool Retrieval via Query Planning
[AUTHORS]Wei Fang, James Glass
[ABSTRACT]LLM agents operating over massive, dynamic tool libraries rely on effective retrieval, yet standard single-shot dense retrievers struggle with complex requests. These failures primarily stem from the disconnect between abstract user goals and technical documentation, and the limited capacity of fixed-size embeddings to model combinatorial tool compositions. To address these challenges, we propose TOOLQP, a lightweight framework that models retrieval as iterative query planning. Instead of single-shot matching, TOOLQP decomposes instructions into sub-tasks and dynamically generates queries to interact with the retriever, effectively bridging the semantic gap by targeting the specific sub-tasks required for composition. We train TOOLQP using synthetic query trajectories followed by optimization via Reinforcement Learning with Verifiable Rewards (RLVR). Experiments demonstrate that TOOLQP achieves state-of-the-art performance, exhibiting superior zero-shot generalization, robustness across diverse retrievers, and significant improvements in downstream agentic execution.
[LINK]http://arxiv.org/abs/2601.07782v1
[DATE]2026-01-13 01:58:39+08:00
[CATEGORIES]cs.CL
Enhancing Self-Correction in Large Language Models through Multi-Perspective Reflection
[AUTHORS]Mariana Costa, Alberlucia Rafael Soarez, Daniel Kim, Camila Ferreira
[ABSTRACT]While Chain-of-Thought (CoT) prompting advances LLM reasoning, challenges persist in consistency, accuracy, and self-correction, especially for complex or ethically sensitive tasks. Existing single-dimensional reflection methods offer insufficient improvements. We propose MyGO Poly-Reflective Chain-of-Thought (PR-CoT), a novel methodology employing structured multi-perspective reflection. After initial CoT, PR-CoT guides the LLM to self-assess its reasoning across multiple predefined angles: logical consistency, information completeness, biases/ethics, and alternative solutions. Implemented purely via prompt engineering, this process refines the initial CoT into a more robust and accurate final answer without model retraining. Experiments across arithmetic, commonsense, ethical decision-making, and logical puzzles, using GPT-three point five and GPT-four models, demonstrate PR-CoT’s superior performance. It significantly outperforms traditional CoT and existing reflection methods in logical consistency and error correction, with notable gains in nuanced domains like ethical decision-making. Ablation studies, human evaluations, and qualitative analyses further validate the contribution of each reflection perspective and the overall efficacy of our poly-reflective paradigm in fostering more reliable LLM reasoning.
[LINK]http://arxiv.org/abs/2601.07780v1
[DATE]2026-01-13 01:57:05+08:00
[CATEGORIES]cs.CL
Are LLM Decisions Faithful to Verbal Confidence?
[AUTHORS]Jiawei Wang, Yanfei Zhou, Siddartha Devic, Deqing Fu
[ABSTRACT]Large Language Models (LLMs) can produce surprisingly sophisticated estimates of their own uncertainty. However, it remains unclear to what extent this expressed confidence is tied to the reasoning, knowledge, or decision making of the model. To test this, we introduce $\textbf{RiskEval}$: a framework designed to evaluate whether models adjust their abstention policies in response to varying error penalties. Our evaluation of several frontier models reveals a critical dissociation: models are neither cost-aware when articulating their verbal confidence, nor strategically responsive when deciding whether to engage or abstain under high-penalty conditions. Even when extreme penalties render frequent abstention the mathematically optimal strategy, models almost never abstain, resulting in utility collapse. This indicates that calibrated verbal confidence scores may not be sufficient to create trustworthy and interpretable AI systems, as current models lack the strategic agency to convert uncertainty signals into optimal and risk-sensitive decisions.
[LINK]http://arxiv.org/abs/2601.07767v1
[DATE]2026-01-13 01:49:51+08:00
[CATEGORIES]cs.LG cs.CL
Structure First, Reason Next: Enhancing a Large Language Model using Knowledge Graph for Numerical Reasoning in Financial Documents
[AUTHORS]Aryan Mishra, Akash Anil
[ABSTRACT]Numerical reasoning is an important task in the analysis of financial documents. It helps in understanding and performing numerical predictions with logical conclusions for the given query seeking answers from financial texts. Recently, Large Language Models (LLMs) have shown promising results in multiple Question-Answering (Q-A) systems with the capability of logical reasoning. As documents related to finance often consist of long and complex financial contexts, LLMs appear well-suited for building high-quality automated financial question-answering systems. However, LLMs often face challenges in accurately processing the various numbers within financial reports. Extracting numerical data from unstructured text and semi-structured tables, and reliably performing accurate calculations, remains a significant bottleneck for numerical reasoning in most state-of-the-art LLMs. Recent studies have shown that structured data augmentations, such as Knowledge Graphs (KGs), have notably improved the predictions of LLMs along with logical explanations. Thus, it is an important requirement to consider inherent structured information in financial reports while using LLMs for various financial analytics. This paper proposes a framework to incorporate structured information using KGs along with LLM predictions for numerical reasoning tasks. The KGs are extracted using a proposed schema inherently from the document under processing. We evaluated our proposed framework over the benchmark data FinQA, using an open-source LLM, namely Llama 3.1 8B Instruct. We observed that the proposed framework improved execution accuracy by approximately 12% relative to the vanilla LLM.
[LINK]http://arxiv.org/abs/2601.07754v1
[DATE]2026-01-13 01:39:08+08:00
[CATEGORIES]cs.CL
Is Agentic RAG worth it? An experimental comparison of RAG approaches
[AUTHORS]Pietro Ferrazzi, Milica Cvjeticanin, Alessio Piraccini, Davide Giannuzzi
[ABSTRACT]Retrieval-Augmented Generation (RAG) systems are usually defined by the combination of a generator and a retrieval component that extracts textual context from a knowledge base to answer user queries. However, such basic implementations exhibit several limitations, including noisy or suboptimal retrieval, misuse of retrieval for out-of-scope queries, weak query-document matching, and variability or cost associated with the generator. These shortcomings have motivated the development of “Enhanced” RAG, where dedicated modules are introduced to address specific weaknesses in the workflow. More recently, the growing self-reflective capabilities of Large Language Models (LLMs) have enabled a new paradigm, which we refer to as “Agentic” RAG. In this approach, the LLM orchestrates the entire process-deciding which actions to perform, when to perform them, and whether to iterate-thereby reducing reliance on fixed, manually engineered modules. Despite the rapid adoption of both paradigms, it remains unclear which approach is preferable under which conditions. In this work, we conduct an extensive, empirically driven evaluation of Enhanced and Agentic RAG across multiple scenarios and dimensions. Our results provide practical insights into the trade-offs between the two paradigms, offering guidance on selecting the most effective RAG design for real-world applications, considering both costs and performance.
[LINK]http://arxiv.org/abs/2601.07711v1
[DATE]2026-01-13 00:43:44+08:00
[CATEGORIES]cs.CL
Perspectives in Play: A Multi-Perspective Approach for More Inclusive NLP Systems
[AUTHORS]Benedetta Muscato, Lucia Passaro, Gizem Gezici, Fosca Giannotti
[ABSTRACT]In the realm of Natural Language Processing (NLP), common approaches for handling human disagreement consist of aggregating annotators’ viewpoints to establish a single ground truth. However, prior studies show that disregarding individual opinions can lead can lead to the side effect of underrepresenting minority perspectives, especially in subjective tasks, where annotators may systematically disagree because of their preferences. Recognizing that labels reflect the diverse backgrounds, life experiences, and values of individuals, this study proposes a new multi-perspective approach using soft labels to encourage the development of the next generation of perspective aware models, more inclusive and pluralistic. We conduct an extensive analysis across diverse subjective text classification tasks, including hate speech, irony, abusive language, and stance detection, to highlight the importance of capturing human disagreements, often overlooked by traditional aggregation methods. Results show that the multi-perspective approach not only better approximates human label distributions, as measured by Jensen-Shannon Divergence (JSD), but also achieves superior classification performance (higher F1 scores), outperforming traditional approaches. However, our approach exhibits lower confidence in tasks like irony and stance detection, likely due to the inherent subjectivity present in the texts. Lastly, leveraging Explainable AI (XAI), we explore model uncertainty and uncover meaningful insights into model predictions.
[LINK]http://arxiv.org/abs/2506.20209v2
[DATE]2026-01-13 00:32:27+08:00
[CATEGORIES]cs.CL
Exploring the Meta-level Reasoning of Large Language Models via a Tool-based Multi-hop Tabular Question Answering Task
[AUTHORS]Nick Ferguson, Alan Bundy, Kwabena Nuamah
[ABSTRACT]Recent advancements in Large Language Models (LLMs) are increasingly focused on “reasoning” ability, a concept with many overlapping definitions in the LLM discourse. We take a more structured approach, distinguishing meta-level reasoning (denoting the process of reasoning about intermediate steps required to solve a task) from object-level reasoning (which concerns the low-level execution of the aforementioned steps.) We design a novel question answering task, which is based around the values of geopolitical indicators for various countries over various years. Questions require breaking down into intermediate steps, retrieval of data, and mathematical operations over that data. The meta-level reasoning ability of LLMs is analysed by examining the selection of appropriate tools for answering questions. To bring greater depth to the analysis of LLMs beyond final answer accuracy, our task contains ‘essential actions’ against which we can compare the tool call output of LLMs to infer the strength of reasoning ability. We find that LLMs demonstrate good meta-level reasoning on our task, yet are flawed in some aspects of task understanding. We find that n-shot prompting has little effect on accuracy; error messages encountered do not often deteriorate performance; and provide additional evidence for the poor numeracy of LLMs. Finally, we discuss the generalisation and limitation of our findings to other task domains.
[LINK]http://arxiv.org/abs/2601.07696v1
[DATE]2026-01-13 00:29:21+08:00
[CATEGORIES]cs.CL
Accelerated Gradient Methods with Biased Gradient Estimates: Risk Sensitivity, High-Probability Guarantees, and Large Deviation Bounds
[AUTHORS]Mert Gürbüzbalaban, Yasa Syed, Necdet Serhat Aybat
[ABSTRACT]We study trade-offs between convergence rate and robustness to gradient errors in the context of first-order methods. Our focus is on generalized momentum methods (GMMs)–a broad class that includes Nesterov’s accelerated gradient, heavy-ball, and gradient descent methods–for minimizing smooth strongly convex objectives. We allow stochastic gradient errors that may be adversarial and biased, and quantify robustness of these methods to gradient errors via the risk-sensitive index (RSI) from robust control theory. For quadratic objectives with i.i.d. Gaussian noise, we give closed form expressions for RSI in terms of solutions to 2x2 matrix Riccati equations, revealing a Pareto frontier between RSI and convergence rate over the choice of step-size and momentum parameters. We then prove a large-deviation principle for time-averaged suboptimality in the large iteration limit and show that the rate function is, up to a scaling, the convex conjugate of the RSI function. We further show that the rate function and RSI are linked to the $H_\infty$-norm–a measure of robustness to the worst-case deterministic gradient errors–so that stronger worst-case robustness (smaller $H_\infty$-norm) leads to sharper decay of the tail probabilities for the average suboptimality. Beyond quadratics, under potentially biased sub-Gaussian gradient errors, we derive non-asymptotic bounds on a finite-time analogue of the RSI, yielding finite-time high-probability guarantees and non-asymptotic large-deviation bounds for the averaged iterates. In the case of smooth strongly convex functions, we also observe an analogous trade-off between RSI and convergence-rate bounds. To our knowledge, these are the first non-asymptotic guarantees for GMMs with biased gradients and the first risk-sensitive analysis of GMMs. Finally, we provide numerical experiments on a robust regression problem to illustrate our results.
[LINK]http://arxiv.org/abs/2509.13628v3
[DATE]2026-01-13 23:47:28+08:00
[CATEGORIES]cs.LG
ROSS: RObust decentralized Stochastic learning based on Shapley values
[AUTHORS]Lina Wang, Yunsheng Yuan, Feng Li, Lingjie Duan
[ABSTRACT]In the paradigm of decentralized learning, a group of agents collaborate to learn a global model using a distributed dataset without a central server; nevertheless, it is severely challenged by the heterogeneity of the data distribution across the agents. For example, the data may be distributed non-independently and identically, and even be noised or poisoned. To address these data challenges, we propose ROSS, a novel robust decentralized stochastic learning algorithm based on Shapley values, in this paper. Specifically, in each round, each agent aggregates the cross-gradient information from its neighbors, i.e., the derivatives of its local model with respect to the datasets of its neighbors, to update its local model in a momentum like manner, while we innovate in weighting the derivatives according to their contributions measured by Shapley values. We perform solid theoretical analysis to reveal the linear convergence speedup of our ROSS algorithm. We also verify the efficacy of our algorithm through extensive experiments on public datasets. Our results demonstrate that, in face of the above variety of data challenges, our ROSS algorithm has significant advantages over existing state-of-the-art proposals in terms of both convergence and prediction accuracy.
[LINK]http://arxiv.org/abs/2411.00365v2
[DATE]2026-01-13 23:40:47+08:00
[CATEGORIES]cs.LG
Feed-Forward Optimization With Delayed Feedback for Neural Network Training
[AUTHORS]Katharina Flügel, Daniel Coquelin, Marie Weiel, Charlotte Debus, Achim Streit, Markus Götz
[ABSTRACT]Backpropagation has long been criticized for being biologically implausible due to its reliance on concepts that are not viable in natural learning processes. Two core issues are the weight transport and update locking problems caused by the forward-backward dependencies, which limit biological plausibility, computational efficiency, and parallelization. Although several alternatives have been proposed to increase biological plausibility, they often come at the cost of reduced predictive performance. This paper proposes an alternative approach to training feed-forward neural networks addressing these issues by using approximate gradient information. We introduce Feed-Forward with delayed Feedback (F$^3$), which approximates gradients using fixed random feedback paths and delayed error information from the previous epoch to balance biological plausibility with predictive performance. We evaluate F$^3$ across multiple tasks and architectures, including both fully-connected and Transformer networks. Our results demonstrate that, compared to similarly plausible approaches, F$^3$ significantly improves predictive performance, narrowing the gap to backpropagation by up to 56% for classification and 96% for regression. This work is a step towards more biologically plausible learning algorithms while opening up new avenues for energy-efficient and parallelizable neural network training.
[COMMENTS]This is the submitted manuscript of the version of record published in the proceedings to the 31st International Conference on Neural Information Processing (ICONIP 2024), Lecture Notes in Computer Science, vol 15289, and available online at https://doi.org/10.1007/978-981-96-6585-3_6
[LINK]http://arxiv.org/abs/2304.13372v2
[DATE]2026-01-13 23:40:45+08:00
[CATEGORIES]cs.LG
TRACE: Reconstruction-Based Anomaly Detection in Ensemble and Time-Dependent Simulations
[AUTHORS]Hamid Gadirov, Martijn Westra, Steffen Frey
[ABSTRACT]Detecting anomalies in high-dimensional, time-dependent simulation data is challenging due to complex spatial and temporal dynamics. We study reconstruction-based anomaly detection for ensemble data from parameterized Kármán vortex street simulations using convolutional autoencoders. We compare a 2D autoencoder operating on individual frames with a 3D autoencoder that processes short temporal stacks. The 2D model identifies localized spatial irregularities in single time steps, while the 3D model exploits spatio-temporal context to detect anomalous motion patterns and reduces redundant detections across time. We further evaluate volumetric time-dependent data and find that reconstruction errors are strongly influenced by the spatial distribution of mass, with highly concentrated regions yielding larger errors than dispersed configurations. Our results highlight the importance of temporal context for robust anomaly detection in dynamic simulations.
[LINK]http://arxiv.org/abs/2601.08659v1
[DATE]2026-01-13 23:36:52+08:00
[CATEGORIES]cs.LG
Geometry Aware Operator Transformer as an Efficient and Accurate Neural Surrogate for PDEs on Arbitrary Domains
[AUTHORS]Shizheng Wen, Arsh Kumbhat, Levi Lingsch, Sepehr Mousavi, Yizhou Zhao, Praveen Chandrashekar, Siddhartha Mishra
[ABSTRACT]The very challenging task of learning solution operators of PDEs on arbitrary domains accurately and efficiently is of vital importance to engineering and industrial simulations. Despite the existence of many operator learning algorithms to approximate such PDEs, we find that accurate models are not necessarily computationally efficient and vice versa. We address this issue by proposing a geometry aware operator transformer (GAOT) for learning PDEs on arbitrary domains. GAOT combines novel multiscale attentional graph neural operator encoders and decoders, together with geometry embeddings and (vision) transformer processors to accurately map information about the domain and the inputs into a robust approximation of the PDE solution. Multiple innovations in the implementation of GAOT also ensure computational efficiency and scalability. We demonstrate this significant gain in both accuracy and efficiency of GAOT over several baselines on a large number of learning tasks from a diverse set of PDEs, including achieving state of the art performance on three large scale three-dimensional industrial CFD datasets.
[LINK]http://arxiv.org/abs/2505.18781v4
[DATE]2026-01-13 23:34:28+08:00
[CATEGORIES]cs.LG
XGBoost Forecasting of NEPSE Index Log Returns with Walk Forward Validation
[AUTHORS]Sahaj Raj Malla, Shreeyash Kayastha, Rumi Suwal, Harish Chandra Bhandari, Rajendra Adhikari
[ABSTRACT]This study develops a robust machine learning framework for one-step-ahead forecasting of daily log-returns in the Nepal Stock Exchange (NEPSE) Index using the XGBoost regressor. A comprehensive feature set is engineered, including lagged log-returns (up to 30 days) and established technical indicators such as short- and medium-term rolling volatility measures and the 14-period Relative Strength Index. Hyperparameter optimization is performed using Optuna with time-series cross-validation on the initial training segment. Out-of-sample performance is rigorously assessed via walk-forward validation under both expanding and fixed-length rolling window schemes across multiple lag configurations, simulating real-world deployment and avoiding lookahead bias. Predictive accuracy is evaluated using root mean squared error, mean absolute error, coefficient of determination (R-squared), and directional accuracy on both log-returns and reconstructed closing prices. Empirical results show that the optimal configuration, an expanding window with 20 lags, outperforms tuned ARIMA and Ridge regression benchmarks, achieving the lowest log-return RMSE (0.013450) and MAE (0.009814) alongside a directional accuracy of 65.15%. While the R-squared remains modest, consistent with the noisy nature of financial returns, primary emphasis is placed on relative error reduction and directional prediction. Feature importance analysis and visual inspection further enhance interpretability. These findings demonstrate the effectiveness of gradient boosting ensembles in modeling nonlinear dynamics in volatile emerging market time series and establish a reproducible benchmark for NEPSE Index forecasting.
[COMMENTS]9 pages, 4 figures, 3 tables
[LINK]http://arxiv.org/abs/2601.08896v1
[DATE]2026-01-13 23:22:08+08:00
[CATEGORIES]cs.LG
TROJail: Trajectory-Level Optimization for Multi-Turn Large Language Model Jailbreaks with Process Rewards
[AUTHORS]Xiqiao Xiong, Ouxiang Li, Zhuo Liu, Moxin Li, Wentao Shi, Fengbin Zhu, Qifan Wang, Fuli Feng
[ABSTRACT]Large language models have seen widespread adoption, yet they remain vulnerable to multi-turn jailbreak attacks, threatening their safe deployment. This has led to the task of training automated multi-turn attackers to probe model safety vulnerabilities. However, existing approaches typically rely on turn-level optimization, which is insufficient for learning long-term attack strategies. To bridge this gap, we formulate this task as a multi-turn reinforcement learning problem, directly optimizing the harmfulness of the final-turn response as the outcome reward. To address the sparse supervision of the outcome reward, we introduce TROJail, which employs two process rewards to evaluate the utility of intermediate prompts and integrate them into advantage estimation. These rewards (1) penalize overly harmful prompts that trigger the model’s refusal mechanism, and (2) encourage steering the semantic relevance of responses toward the targeted harmful content. Experimental results show improved attack success rates across multiple models and benchmarks, highlighting the effectiveness of our approach. The code is available at https://github.com/xxiqiao/TROJail. Warning: This paper contains examples of harmful content.
[COMMENTS]21 pages, 15 figures
[LINK]http://arxiv.org/abs/2512.07761v2
[DATE]2026-01-13 23:14:32+08:00
[CATEGORIES]cs.LG
Efficient and Scalable Implementation of Differentially Private Deep Learning without Shortcuts
[AUTHORS]Sebastian Rodriguez Beltran, Marlon Tobaben, Joonas Jälkö, Niki Loppi, Antti Honkela
[ABSTRACT]Differentially private stochastic gradient descent (DP-SGD) is the standard algorithm for training machine learning models under differential privacy (DP). The most common DP-SGD privacy accountants rely on Poisson subsampling to ensure the theoretical DP guarantees. Implementing computationally efficient DP-SGD with Poisson subsampling is not trivial, which leads many implementations to taking a shortcut by using computationally faster subsampling. We quantify the computational cost of training deep learning models under DP by implementing and benchmarking efficient methods with the correct Poisson subsampling. We find that using the naive implementation of DP-SGD with Opacus in PyTorch has a throughput between 2.6 and 8 times lower than that of SGD. However, efficient gradient clipping implementations like Ghost Clipping can roughly halve this cost. We propose an alternative computationally efficient implementation of DP-SGD with JAX that uses Poisson subsampling and performs comparably with efficient clipping optimizations based on PyTorch. We study the scaling behavior using up to 80 GPUs and find that DP-SGD scales better than SGD. We share our library at https://github.com/DPBayes/Towards-Efficient-Scalable-Training-DP-DL.
[COMMENTS]19 pages, 21 figures
[LINK]http://arxiv.org/abs/2406.17298v3
[DATE]2026-01-13 23:13:42+08:00
[CATEGORIES]cs.LG
Beyond Backpropagation: Optimization with Multi-Tangent Forward Gradients
[AUTHORS]Katharina Flügel, Daniel Coquelin, Marie Weiel, Charlotte Debus, Achim Streit, Markus Götz
[ABSTRACT]The gradients used to train neural networks are typically computed using backpropagation. While an efficient way to obtain exact gradients, backpropagation is computationally expensive, hinders parallelization, and is biologically implausible. Forward gradients are an approach to approximate the gradients from directional derivatives along random tangents computed by forward-mode automatic differentiation. So far, research has focused on using a single tangent per step. This paper provides an in-depth analysis of multi-tangent forward gradients and introduces an improved approach to combining the forward gradients from multiple tangents based on orthogonal projections. We demonstrate that increasing the number of tangents improves both approximation quality and optimization performance across various tasks.
[LINK]http://arxiv.org/abs/2410.17764v2
[DATE]2026-01-13 23:13:09+08:00
[CATEGORIES]cs.LG
M$^2$FMoE: Multi-Resolution Multi-View Frequency Mixture-of-Experts for Extreme-Adaptive Time Series Forecasting
[AUTHORS]Yaohui Huang, Runmin Zou, Yun Wang, Laeeq Aslam, Ruipeng Dong
[ABSTRACT]Forecasting time series with extreme events is critical yet challenging due to their high variance, irregular dynamics, and sparse but high-impact nature. While existing methods excel in modeling dominant regular patterns, their performance degrades significantly during extreme events, constituting the primary source of forecasting errors in real-world applications. Although some approaches incorporate auxiliary signals to improve performance, they still fail to capture extreme events’ complex temporal dynamics. To address these limitations, we propose M$^2$FMoE, an extreme-adaptive forecasting model that learns both regular and extreme patterns through multi-resolution and multi-view frequency modeling. It comprises three modules: (1) a multi-view frequency mixture-of-experts module assigns experts to distinct spectral bands in Fourier and Wavelet domains, with cross-view shared band splitter aligning frequency partitions and enabling inter-expert collaboration to capture both dominant and rare fluctuations; (2) a multi-resolution adaptive fusion module that hierarchically aggregates frequency features from coarse to fine resolutions, enhancing sensitivity to both short-term variations and sudden changes; (3) a temporal gating integration module that dynamically balances long-term trends and short-term frequency-aware features, improving adaptability to both regular and extreme temporal patterns. Experiments on real-world hydrological datasets with extreme patterns demonstrate that M$^2$FMoE outperforms state-of-the-art baselines without requiring extreme-event labels.
[COMMENTS]Accepted by AAAI 2026
[LINK]http://arxiv.org/abs/2601.08631v1
[DATE]2026-01-13 23:05:49+08:00
[CATEGORIES]cs.LG
SafeRedir: Prompt Embedding Redirection for Robust Unlearning in Image Generation Models
[AUTHORS]Renyang Liu, Kangjie Chen, Han Qiu, Jie Zhang, Kwok-Yan Lam, Tianwei Zhang, See-Kiong Ng
[ABSTRACT]Image generation models (IGMs), while capable of producing impressive and creative content, often memorize a wide range of undesirable concepts from their training data, leading to the reproduction of unsafe content such as NSFW imagery and copyrighted artistic styles. Such behaviors pose persistent safety and compliance risks in real-world deployments and cannot be reliably mitigated by post-hoc filtering, owing to the limited robustness of such mechanisms and a lack of fine-grained semantic control. Recent unlearning methods seek to erase harmful concepts at the model level, which exhibit the limitations of requiring costly retraining, degrading the quality of benign generations, or failing to withstand prompt paraphrasing and adversarial attacks. To address these challenges, we introduce SafeRedir, a lightweight inference-time framework for robust unlearning via prompt embedding redirection. Without modifying the underlying IGMs, SafeRedir adaptively routes unsafe prompts toward safe semantic regions through token-level interventions in the embedding space. The framework comprises two core components: a latent-aware multi-modal safety classifier for identifying unsafe generation trajectories, and a token-level delta generator for precise semantic redirection, equipped with auxiliary predictors for token masking and adaptive scaling to localize and regulate the intervention. Empirical results across multiple representative unlearning tasks demonstrate that SafeRedir achieves effective unlearning capability, high semantic and perceptual preservation, robust image quality, and enhanced resistance to adversarial attacks. Furthermore, SafeRedir generalizes effectively across a variety of diffusion backbones and existing unlearned models, validating its plug-and-play compatibility and broad applicability. Code and data are available at https://github.com/ryliu68/SafeRedir.
[COMMENTS]Code at https://github.com/ryliu68/SafeRedir
[LINK]http://arxiv.org/abs/2601.08623v1
[DATE]2026-01-13 23:01:38+08:00
[CATEGORIES]cs.LG
Robust low-rank estimation with multiple binary responses using pairwise AUC loss
[AUTHORS]The Tien Mai
[ABSTRACT]Multiple binary responses arise in many modern data-analytic problems. Although fitting separate logistic regressions for each response is computationally attractive, it ignores shared structure and can be statistically inefficient, especially in high-dimensional and class-imbalanced regimes. Low-rank models offer a natural way to encode latent dependence across tasks, but existing methods for binary data are largely likelihood-based and focus on pointwise classification rather than ranking performance. In this work, we propose a unified framework for learning with multiple binary responses that directly targets discrimination by minimizing a surrogate loss for the area under the ROC curve (AUC). The method aggregates pairwise AUC surrogate losses across responses while imposing a low-rank constraint on the coefficient matrix to exploit shared structure. We develop a scalable projected gradient descent algorithm based on truncated singular value decomposition. Exploiting the fact that the pairwise loss depends only on differences of linear predictors, we simplify computation and analysis. We establish non-asymptotic convergence guarantees, showing that under suitable regularity conditions, leading to linear convergence up to the minimax-optimal statistical precision. Extensive simulation studies demonstrate that the proposed method is robust in challenging settings such as label switching and data contamination and consistently outperforms likelihood-based approaches.
[LINK]http://arxiv.org/abs/2601.08618v1
[DATE]2026-01-13 23:00:10+08:00
[CATEGORIES]cs.LG
Accelerated Methods with Complexity Separation Under Data Similarity for Federated Learning Problems
[AUTHORS]Dmitry Bylinkin, Sergey Skorik, Dmitriy Bystrov, Leonid Berezin, Aram Avetisyan, Aleksandr Beznosikov
[ABSTRACT]Heterogeneity within data distribution poses a challenge in many modern federated learning tasks. We formalize it as an optimization problem involving a computationally heavy composite under data similarity. By employing different sets of assumptions, we present several approaches to develop communication-efficient methods. An optimal algorithm is proposed for the convex case. The constructed theory is validated through a series of experiments across various problems.
[COMMENTS]30 pages, 4 theorems, 2 figures
[LINK]http://arxiv.org/abs/2601.08614v1
[DATE]2026-01-13 22:59:22+08:00
[CATEGORIES]cs.LG
Interpretability and Individuality in Knee MRI: Patient-Specific Radiomic Fingerprint with Reconstructed Healthy Personas
[AUTHORS]Yaxi Chen, Simin Ni, Shuai Li, Shaheer U. Saeed, Aleksandra Ivanova, Rikin Hargunani, Jie Huang, Chaozong Liu, Yipeng Hu
[ABSTRACT]For automated assessment of knee MRI scans, both accuracy and interpretability are essential for clinical use and adoption. Traditional radiomics rely on predefined features chosen at the population level; while more interpretable, they are often too restrictive to capture patient-specific variability and can underperform end-to-end deep learning (DL). To address this, we propose two complementary strategies that bring individuality and interpretability: radiomic fingerprints and healthy personas. First, a radiomic fingerprint is a dynamically constructed, patient-specific feature set derived from MRI. Instead of applying a uniform population-level signature, our model predicts feature relevance from a pool of candidate features and selects only those most predictive for each patient, while maintaining feature-level interpretability. This fingerprint can be viewed as a latent-variable model of feature usage, where an image-conditioned predictor estimates usage probabilities and a transparent logistic regression with global coefficients performs classification. Second, a healthy persona synthesises a pathology-free baseline for each patient using a diffusion model trained to reconstruct healthy knee MRIs. Comparing features extracted from pathological images against their personas highlights deviations from normal anatomy, enabling intuitive, case-specific explanations of disease manifestations. We systematically compare fingerprints, personas, and their combination across three clinical tasks. Experimental results show that both approaches yield performance comparable to or surpassing state-of-the-art DL models, while supporting interpretability at multiple levels. Case studies further illustrate how these perspectives facilitate human-explainable biomarker discovery and pathology localisation.
[LINK]http://arxiv.org/abs/2601.08604v1
[DATE]2026-01-13 22:48:01+08:00
[CATEGORIES]cs.LG
A Differential Perspective on Distributional Reinforcement Learning
[AUTHORS]Juan Sebastian Rojas, Chi-Guhn Lee
[ABSTRACT]To date, distributional reinforcement learning (distributional RL) methods have exclusively focused on the discounted setting, where an agent aims to optimize a discounted sum of rewards over time. In this work, we extend distributional RL to the average-reward setting, where an agent aims to optimize the reward received per time step. In particular, we utilize a quantile-based approach to develop the first set of algorithms that can successfully learn and/or optimize the long-run per-step reward distribution, as well as the differential return distribution of an average-reward MDP. We derive proven-convergent tabular algorithms for both prediction and control, as well as a broader family of algorithms that have appealing scaling properties. Empirically, we find that these algorithms yield competitive and sometimes superior performance when compared to their non-distributional equivalents, while also capturing rich information about the long-run per-step reward and differential return distributions.
[COMMENTS]In AAAI Conference on Artificial Intelligence 2026
[LINK]http://arxiv.org/abs/2506.03333v2
[DATE]2026-01-13 21:43:59+08:00
[CATEGORIES]cs.LG
EviNAM: Intelligibility and Uncertainty via Evidential Neural Additive Models
[AUTHORS]Sören Schleibaum, Anton Frederik Thielmann, Julian Teusch, Benjamin Säfken, Jörg P. Müller
[ABSTRACT]Intelligibility and accurate uncertainty estimation are crucial for reliable decision-making. In this paper, we propose EviNAM, an extension of evidential learning that integrates the interpretability of Neural Additive Models (NAMs) with principled uncertainty estimation. Unlike standard Bayesian neural networks and previous evidential methods, EviNAM enables, in a single pass, both the estimation of the aleatoric and epistemic uncertainty as well as explicit feature contributions. Experiments on synthetic and real data demonstrate that EviNAM matches state-of-the-art predictive performance. While we focus on regression, our method extends naturally to classification and generalized additive models, offering a path toward more intelligible and trustworthy predictions.
[LINK]http://arxiv.org/abs/2601.08556v1
[DATE]2026-01-13 21:41:49+08:00
[CATEGORIES]cs.LG
Contrastive and Multi-Task Learning on Noisy Brain Signals with Nonlinear Dynamical Signatures
[AUTHORS]Sucheta Ghosh, Zahra Monfared, Felix Dietrich
[ABSTRACT]We introduce a two-stage multitask learning framework for analyzing Electroencephalography (EEG) signals that integrates denoising, dynamical modeling, and representation learning. In the first stage, a denoising autoencoder is trained to suppress artifacts and stabilize temporal dynamics, providing robust signal representations. In the second stage, a multitask architecture processes these denoised signals to achieve three objectives: motor imagery classification, chaotic versus non-chaotic regime discrimination using Lyapunov exponent-based labels, and self-supervised contrastive representation learning with NT-Xent loss. A convolutional backbone combined with a Transformer encoder captures spatial-temporal structure, while the dynamical task encourages sensitivity to nonlinear brain dynamics. This staged design mitigates interference between reconstruction and discriminative goals, improves stability across datasets, and supports reproducible training by clearly separating noise reduction from higher-level feature learning. Empirical studies show that our framework not only enhances robustness and generalization but also surpasses strong baselines and recent state-of-the-art methods in EEG decoding, highlighting the effectiveness of combining denoising, dynamical features, and self-supervised learning.
[LINK]http://arxiv.org/abs/2601.08549v1
[DATE]2026-01-13 21:36:38+08:00
[CATEGORIES]cs.LG
Convergence of gradient flow for learning convolutional neural networks
[AUTHORS]Jona-Maria Diederen, Holger Rauhut, Ulrich Terstiege
[ABSTRACT]Convolutional neural networks are widely used in imaging and image recognition. Learning such networks from training data leads to the minimization of a non-convex function. This makes the analysis of standard optimization methods such as variants of (stochastic) gradient descent challenging. In this article we study the simplified setting of linear convolutional networks. We show that the gradient flow (to be interpreted as an abstraction of gradient descent) applied to the empirical risk defined via certain loss functions including the square loss always converges to a critical point, under a mild condition on the training data.
[COMMENTS]17 pages
[LINK]http://arxiv.org/abs/2601.08547v1
[DATE]2026-01-13 21:33:48+08:00
[CATEGORIES]cs.LG
Reducing Compute Waste in LLMs through Kernel-Level DVFS
[AUTHORS]Jeffrey Spaan, Kuan-Hsun Chen, Ana-Lucia Varbanescu
[ABSTRACT]The rapid growth of AI has fueled the expansion of accelerator- or GPU-based data centers. However, the rising operational energy consumption has emerged as a critical bottleneck and a major sustainability concern. Dynamic Voltage and Frequency Scaling (DVFS) is a well-known technique used to reduce energy consumption, and thus improve energy-efficiency, since it requires little effort and works with existing hardware. Reducing the energy consumption of training and inference of Large Language Models (LLMs) through DVFS or power capping is feasible: related work has shown energy savings can be significant, but at the cost of significant slowdowns. In this work, we focus on reducing waste in LLM operations: i.e., reducing energy consumption without losing performance. We propose a fine-grained, kernel-level, DVFS approach that explores new frequency configurations, and prove these save more energy than previous, pass- or iteration-level solutions. For example, for a GPT-3 training run, a pass-level approach could reduce energy consumption by 2% (without losing performance), while our kernel-level approach saves as much as 14.6% (with a 0.6% slowdown). We further investigate the effect of data and tensor parallelism, and show our discovered clock frequencies translate well for both. We conclude that kernel-level DVFS is a suitable technique to reduce waste in LLM operations, providing significant energy savings with negligible slow-down.
[LINK]http://arxiv.org/abs/2601.08539v1
[DATE]2026-01-13 21:26:57+08:00
[CATEGORIES]cs.LG
VGC-Bench: Towards Mastering Diverse Team Strategies in Competitive Pokémon
[AUTHORS]Cameron Angliss, Jiaxun Cui, Jiaheng Hu, Arrasy Rahman, Peter Stone
[ABSTRACT]Developing AI agents that can robustly adapt to varying strategic landscapes without retraining is a central challenge in multi-agent learning. Pokémon Video Game Championships (VGC) is a domain with a vast space of approximately $10^{139}$ team configurations, far larger than those of other games such as Chess, Go, Poker, StarCraft, or Dota. The combinatorial nature of team building in Pokémon VGC causes optimal strategies to vary substantially depending on both the controlled team and the opponent’s team, making generalization uniquely challenging. To advance research on this problem, we introduce VGC-Bench: a benchmark that provides critical infrastructure, standardizes evaluation protocols, and supplies a human-play dataset of over 700,000 battle logs and a range of baseline agents based on heuristics, large language models, behavior cloning, and multi-agent reinforcement learning with empirical game-theoretic methods such as self-play, fictitious play, and double oracle. In the restricted setting where an agent is trained and evaluated in a mirror match with a single team configuration, our methods can win against a professional VGC competitor. We repeat this training and evaluation with progressively larger team sets and find that as the number of teams increases, the best-performing algorithm in the single-team setting has worse performance and is more exploitable, but has improved generalization to unseen teams. Our code and dataset are open-sourced at https://github.com/cameronangliss/vgc-bench and https://huggingface.co/datasets/cameronangliss/vgc-battle-logs.
[COMMENTS]AAMAS 2026
[LINK]http://arxiv.org/abs/2506.10326v3
[DATE]2026-01-13 21:17:41+08:00
[CATEGORIES]cs.LG
Sampling via Stochastic Interpolants by Langevin-based Velocity and Initialization Estimation in Flow ODEs
[AUTHORS]Chenguang Duan, Yuling Jiao, Gabriele Steidl, Christian Wald, Jerry Zhijian Yang, Ruizhe Zhang
[ABSTRACT]We propose a novel method for sampling from unnormalized Boltzmann densities based on a probability-flow ordinary differential equation (ODE) derived from linear stochastic interpolants. The key innovation of our approach is the use of a sequence of Langevin samplers to enable efficient simulation of the flow. Specifically, these Langevin samplers are employed (i) to generate samples from the interpolant distribution at intermediate times and (ii) to construct, starting from these intermediate times, a robust estimator of the velocity field governing the flow ODE. For both applications of the Langevin diffusions, we establish convergence guarantees. Extensive numerical experiments demonstrate the efficiency of the proposed method on challenging multimodal distributions across a range of dimensions, as well as its effectiveness in Bayesian inference tasks.
[LINK]http://arxiv.org/abs/2601.08527v1
[DATE]2026-01-13 21:11:37+08:00
[CATEGORIES]cs.LG
A Mesh-Adaptive Hypergraph Neural Network for Unsteady Flow Around Oscillating and Rotating Structures
[AUTHORS]Rui Gao, Zhi Cheng, Rajeev K. Jaiman
[ABSTRACT]Graph neural networks, recently introduced into the field of fluid flow surrogate modeling, have been successfully applied to model the temporal evolution of various fluid flow systems. Existing applications, however, are mostly restricted to cases where the domain is time-invariant. The present work extends the application of graph neural network-based modeling to fluid flow around structures rotating with respect to a certain axis. Specifically, we propose to apply a graph neural network-based surrogate model with part of the mesh/graph co-rotating with the structure and part of the mesh/graph static. A single layer of interface cells are constructed at the interface between the two parts and are allowed to distort and adapt, which helps in circumventing the difficulty of interpolating information encoded by the neural network at every graph neural network layer. Dedicated reconstruction and re-projection schemes are designed to counter the error caused by the distortion and connectivity change of the interface cells. The effectiveness of our proposed framework is examined on two test cases: (i) fluid flow around a 2D oscillating airfoil, and (ii) fluid flow past a 3D rotating cube. Our results show that the model achieves stable rollout predictions over hundreds or even a thousand time steps. We further demonstrate that one could enforce accurate, error-bounded prediction results by incorporating the measurements from sparse pressure sensors. In addition to the accurate flow field predictions, the lift and drag force predictions closely match with the computational fluid dynamics calculations, highlighting the potential of the framework for modeling fluid flow around rotating structures, and paving the path towards a graph neural network-based surrogate model for more complex scenarios like flow around marine propellers.
[LINK]http://arxiv.org/abs/2503.22252v2
[DATE]2026-01-13 21:05:27+08:00
[CATEGORIES]cs.LG
$φ$-test: Global Feature Selection and Inference for Shapley Additive Explanations
[AUTHORS]Dongseok Kim, Hyoungsun Choi, Mohamed Jismy Aashik Rasool, Gisung Oh
[ABSTRACT]We propose $φ$-test, a global feature-selection and significance procedure for black-box predictors that combines Shapley attributions with selective inference. Given a trained model and an evaluation dataset, $φ$-test performs SHAP-guided screening and fits a linear surrogate on the screened features via a selection rule with a tractable selective-inference form. For each retained feature, it outputs a Shapley-based global score, a surrogate coefficient, and post-selection $p$-values and confidence intervals in a global feature-importance table. Experiments on real tabular regression tasks with tree-based and neural backbones suggest that $φ$-test can retain much of the predictive ability of the original model while using only a few features and producing feature sets that remain fairly stable across resamples and backbone classes. In these settings, $φ$-test acts as a practical global explanation layer linking Shapley-based importance summaries with classical statistical inference.
[COMMENTS]Revised for clarity and correctness; improved exposition and fixed minor issues
[LINK]http://arxiv.org/abs/2512.07578v2
[DATE]2026-01-13 21:05:12+08:00
[CATEGORIES]cs.LG
Transfer Learning Across Fixed-Income Product Classes
[AUTHORS]Nicolas Camenzind, Damir Filipovic
[ABSTRACT]We propose a framework for transfer learning of discount curves across different fixed-income product classes. Motivated by challenges in estimating discount curves from sparse or noisy data, we extend kernel ridge regression (KR) to a vector-valued setting, formulating a convex optimization problem in a vector-valued reproducing kernel Hilbert space (RKHS). Each component of the solution corresponds to the discount curve implied by a specific product class. We introduce an additional regularization term motivated by economic principles, promoting smoothness of spread curves between product classes, and show that it leads to a valid separable kernel structure. A main theoretical contribution is a decomposition of the vector-valued RKHS norm induced by separable kernels. We further provide a Gaussian process interpretation of vector-valued KR, enabling quantification of estimation uncertainty. Illustrative examples show how transfer learning tightens confidence intervals compared to single-curve estimation. An extensive masking experiment demonstrates that transfer learning significantly improves extrapolation performance.
[LINK]http://arxiv.org/abs/2505.07676v2
[DATE]2026-01-13 20:47:03+08:00
[CATEGORIES]cs.LG
Temporal Fusion Nexus: A task-agnostic multi-modal embedding model for clinical narratives and irregular time series in post-kidney transplant care
[AUTHORS]Aditya Kumar, Simon Rauch, Mario Cypko, Marcel Naik, Matthieu-P Schapranow, Aadil Rashid, Fabian Halleck, Bilgin Osmanodja, Roland Roller, Lars Pape, Klemens Budde, Mario Schiffer, Oliver Amft
[ABSTRACT]We introduce Temporal Fusion Nexus (TFN), a multi-modal and task-agnostic embedding model to integrate irregular time series and unstructured clinical narratives. We analysed TFN in post-kidney transplant (KTx) care, with a retrospective cohort of 3382 patients, on three key outcomes: graft loss, graft rejection, and mortality. Compared to state-of-the-art model in post KTx care, TFN achieved higher performance for graft loss (AUC 0.96 vs. 0.94) and graft rejection (AUC 0.84 vs. 0.74). In mortality prediction, TFN yielded an AUC of 0.86. TFN outperformed unimodal baselines (approx 10% AUC improvement over time series only baseline, approx 5% AUC improvement over time series with static patient data). Integrating clinical text improved performance across all tasks. Disentanglement metrics confirmed robust and interpretable latent factors in the embedding space, and SHAP-based attributions confirmed alignment with clinical reasoning. TFN has potential application in clinical tasks beyond KTx, where heterogeneous data sources, irregular longitudinal data, and rich narrative documentation are available.
[COMMENTS]31 pages, 9 figures, 3 tables. A supplementary file is also available
[LINK]http://arxiv.org/abs/2601.08503v1
[DATE]2026-01-13 20:38:33+08:00
[CATEGORIES]cs.LG
GraphFusionSBR: Denoising Multi-Channel Graphs for Session-Based Recommendation
[AUTHORS]Jia-Xin He, Hung-Hsuan Chen
[ABSTRACT]Session-based recommendation systems must capture implicit user intents from sessions. However, existing models suffer from issues such as item interaction dominance and noisy sessions. We propose a multi-channel recommendation model, including a knowledge graph channel, a session hypergraph channel, and a session line graph channel, to capture information from multiple sources. Our model adaptively removes redundant edges in the knowledge graph channel to reduce noise. Knowledge graph representations cooperate with hypergraph representations for prediction to alleviate item dominance. We also generate in-session attention for denoising. Finally, we maximize mutual information between the hypergraph and line graph channels as an auxiliary task. Experiments demonstrate that our method enhances the accuracy of various recommendations, including e-commerce and multimedia recommendations. We release the code on GitHub for reproducibility.\footnote{https://github.com/hohehohe0509/DSR-HK}
[LINK]http://arxiv.org/abs/2601.08497v1
[DATE]2026-01-13 20:32:31+08:00
[CATEGORIES]cs.LG
AUV Trajectory Learning for Underwater Acoustic Energy Transfer and Age Minimization
[AUTHORS]Mohamed Afouene Melki, Mohammad Shehab, Mohamed-Slim Alouini
[ABSTRACT]Internet of underwater things (IoUT) is increasingly gathering attention with the aim of monitoring sea life and deep ocean environment, underwater surveillance as well as maintenance of underwater installments. However, conventional IoUT devices, reliant on battery power, face limitations in lifespan and pose environmental hazards upon disposal. This paper introduces a sustainable approach for simultaneous information uplink from the IoUT devices and acoustic energy transfer (AET) to the devices via an autonomous underwater vehicle (AUV), potentially enabling them to operate indefinitely. To tackle the time-sensitivity, we adopt age of information (AoI), and Jain’s fairness index. We develop two deep-reinforcement learning (DRL) algorithms, offering a high-complexity, high-performance frequency division duplex (FDD) solution and a low-complexity, medium-performance time division duplex (TDD) approach. The results elucidate that the proposed FDD and TDD solutions significantly reduce the average AoI and boost the harvested energy as well as data collection fairness compared to baseline approaches.
[LINK]http://arxiv.org/abs/2601.08491v1
[DATE]2026-01-13 20:23:53+08:00
[CATEGORIES]cs.LG
DiffMM: Efficient Method for Accurate Noisy and Sparse Trajectory Map Matching via One Step Diffusion
[AUTHORS]Chenxu Han, Sean Bin Yang, Jilin Hu
[ABSTRACT]Map matching for sparse trajectories is a fundamental problem for many trajectory-based applications, e.g., traffic scheduling and traffic flow analysis. Existing methods for map matching are generally based on Hidden Markov Model (HMM) or encoder-decoder framework. However, these methods continue to face significant challenges when handling noisy or sparsely sampled GPS trajectories. To address these limitations, we propose DiffMM, an encoder-diffusion-based map matching framework that produces effective yet efficient matching results through a one-step diffusion process. We first introduce a road segment-aware trajectory encoder that jointly embeds the input trajectory and its surrounding candidate road segments into a shared latent space through an attention mechanism. Next, we propose a one step diffusion method to realize map matching through a shortcut model by leveraging the joint embedding of the trajectory and candidate road segments as conditioning context. We conduct extensive experiments on large-scale trajectory datasets, demonstrating that our approach consistently outperforms state-of-the-art map matching methods in terms of both accuracy and efficiency, particularly for sparse trajectories and complex road network topologies.
[COMMENTS]AAAI-26
[LINK]http://arxiv.org/abs/2601.08482v1
[DATE]2026-01-13 20:14:57+08:00
[CATEGORIES]cs.LG
An AI-driven framework for the prediction of personalised health response to air pollution
[AUTHORS]Nazanin Zounemat-Kermani, Sadjad Naderi, Claire H. Dilliway, Claire E. Heaney, Shrreya Behll, Boyang Chen, Hisham Abubakar-Waziri, Alexandra E. Porter, Marc Chadeau-Hyam, Fangxin Fang, Ian M. Adcock, Kian Fan Chung, Christopher C. Pain
[ABSTRACT]Air pollution is a growing global health threat, exacerbated by climate change and linked to cardiovascular and respiratory diseases. While personal sensing devices enable real-time physiological monitoring, their integration with environmental data for individualised health prediction remains underdeveloped. Here, we present a modular, cloud-based framework that predicts personalised physiological responses to pollution by combining wearable-derived data with real-time environmental exposures. At its core is an Adversarial Autoencoder (AAE), initially trained on high-resolution pollution-health data from the INHALE study and fine-tuned using smartwatch data via transfer learning to capture individual-specific patterns. Consistent with changes in pollution levels commonly observed in the real-world, simulated pollution spikes (+100%) revealed modest but measurable increases in vital signs (e.g., +2.5% heart rate, +3.5% breathing rate). To assess clinical relevance, we analysed U-BIOPRED data and found that individuals with such subclinical vital sign elevations had higher asthma burden scores or elevated Fractional Exhaled Nitric Oxide (FeNO), supporting the physiological validity of these AI-predicted responses. This integrative approach demonstrates the feasibility of anticipatory, personalised health modelling in response to environmental challenges, offering a scalable and secure infrastructure for AI-driven environmental health monitoring.
[COMMENTS]Zounemat-Kermani and Naderi share first authorship. 22 pages, 5 figures and 1 table
[LINK]http://arxiv.org/abs/2505.10556v2
[DATE]2026-01-13 19:56:04+08:00
[CATEGORIES]cs.LG
Softpick: No Attention Sink, No Massive Activations with Rectified Softmax
[AUTHORS]Zayd M. K. Zuhri, Erland Hilman Fuadi, Alham Fikri Aji
[ABSTRACT]We introduce softpick, a rectified, not sum-to-one, drop-in replacement for softmax in transformer attention mechanisms that eliminates attention sink and massive activations. Our experiments with 340M and 1.8B parameter models demonstrate that softpick achieves 0\% sink rate consistently. The softpick transformers produce hidden states with significantly lower kurtosis and creates sparse attention maps. Quantized models using softpick outperform softmax on standard benchmarks, with a particularly pronounced advantage at lower bit precisions. Our analysis and discussion shows how softpick has the potential to open new possibilities for quantization, low-precision training, sparsity optimization, pruning, and interpretability. Code: https://github.com/zaydzuhri/softpick-attention.
[COMMENTS]Updated by adding analysis on why it does not scale
[LINK]http://arxiv.org/abs/2504.20966v3
[DATE]2026-01-13 19:54:31+08:00
[CATEGORIES]cs.LG
Zero-Shot Distracted Driver Detection via Vision Language Models with Double Decoupling
[AUTHORS]Takamichi Miyata, Sumiko Miyata, Andrew Morris
[ABSTRACT]Distracted driving is a major cause of traffic collisions, calling for robust and scalable detection methods. Vision-language models (VLMs) enable strong zero-shot image classification, but existing VLM-based distracted driver detectors often underperform in real-world conditions. We identify subject-specific appearance variations (e.g., clothing, age, and gender) as a key bottleneck: VLMs entangle these factors with behavior cues, leading to decisions driven by who the driver is rather than what the driver is doing. To address this, we propose a subject decoupling framework that extracts a driver appearance embedding and removes its influence from the image embedding prior to zero-shot classification, thereby emphasizing distraction-relevant evidence. We further orthogonalize text embeddings via metric projection onto Stiefel manifold to improve separability while staying close to the original semantics. Experiments demonstrate consistent gains over prior baselines, indicating the promise of our approach for practical road-safety applications.
[LINK]http://arxiv.org/abs/2601.08467v1
[DATE]2026-01-13 19:46:05+08:00
[CATEGORIES]cs.LG
Decentralized Autoregressive Generation
[AUTHORS]Stepan Maschan, Haoxuan Qu, Jun Liu
[ABSTRACT]We present a theoretical analysis of decentralization of autoregressive generation. We define the Decentralized Discrete Flow Matching objective, by expressing probability generating velocity as a linear combination of expert flows. We also conduct experiments demonstrating the equivalence between decentralized and centralized training settings for multimodal language models across diverse set of benchmarks. Specifically, we compare two distinct paradigms: LLaVA and InternVL 2.5-1B, which uses a fixed CLIP vision encoder and performs full-parameter fine-tuning (ViT+MLP+LLM) during the instruction tuning stage.
[COMMENTS]Work in progress
[LINK]http://arxiv.org/abs/2601.03184v2
[DATE]2026-01-13 19:19:48+08:00
[CATEGORIES]cs.LG
Attention Consistency Regularization for Interpretable Early-Exit Neural Networks
[AUTHORS]Yanhua Zhao
[ABSTRACT]Early-exit neural networks enable adaptive inference by allowing predictions at intermediate layers, reducing computational cost. However, early exits often lack interpretability and may focus on different features than deeper layers, limiting trust and explainability. This paper presents Explanation-Guided Training (EGT), a multi-objective framework that improves interpretability and consistency in early-exit networks through attention-based regularization. EGT introduces an attention consistency loss that aligns early-exit attention maps with the final exit. The framework jointly optimizes classification accuracy and attention consistency through a weighted combination of losses. Experiments on a real-world image classification dataset demonstrate that EGT achieves up to 98.97% overall accuracy (matching baseline performance) with a 1.97x inference speedup through early exits, while improving attention consistency by up to 18.5% compared to baseline models. The proposed method provides more interpretable and consistent explanations across all exit points, making early-exit networks more suitable for explainable AI applications in resource-constrained environments.
[COMMENTS]2 pages, 1 figure
[LINK]http://arxiv.org/abs/2601.08891v1
[DATE]2026-01-13 19:19:09+08:00
[CATEGORIES]cs.LG
Noise-Adaptive Regularization for Robust Multi-Label Remote Sensing Image Classification
[AUTHORS]Tom Burgert, Julia Henkel, Begüm Demir
[ABSTRACT]The development of reliable methods for multi-label classification (MLC) has become a prominent research direction in remote sensing (RS). As the scale of RS data continues to expand, annotation procedures increasingly rely on thematic products or crowdsourced procedures to reduce the cost of manual annotation. While cost-effective, these strategies often introduce multi-label noise in the form of partially incorrect annotations. In MLC, label noise arises as additive noise, subtractive noise, or a combination of both in the form of mixed noise. Previous work has largely overlooked this distinction and commonly treats noisy annotations as supervised signals, lacking mechanisms that explicitly adapt learning behavior to different noise types. To address this limitation, we propose NAR, a noise-adaptive regularization method that explicitly distinguishes between additive and subtractive noise within a semi-supervised learning framework. NAR employs a confidence-based label handling mechanism that dynamically retains label entries with high confidence, temporarily deactivates entries with moderate confidence, and corrects low confidence entries via flipping. This selective attenuation of supervision is integrated with early-learning regularization (ELR) to stabilize training and mitigate overfitting to corrupted labels. Experiments across additive, subtractive, and mixed noise scenarios demonstrate that NAR consistently improves robustness compared with existing methods. Performance improvements are most pronounced under subtractive and mixed noise, indicating that adaptive suppression and selective correction of noisy supervision provide an effective strategy for noise robust learning in RS MLC.
[COMMENTS]Submitted to TGRS
[LINK]http://arxiv.org/abs/2601.08446v1
[DATE]2026-01-13 19:16:45+08:00
[CATEGORIES]cs.LG
Applying the maximum entropy principle to neural networks enhances multi-species distribution models
[AUTHORS]Maxime Ryckewaert, Diego Marcos, Christophe Botella, Maximilien Servajean, Pierre Bonnet, Alexis Joly
[ABSTRACT]The rapid expansion of citizen science initiatives has led to a significant growth of biodiversity databases, and particularly presence-only (PO) observations. PO data are invaluable for understanding species distributions and their dynamics, but their use in a Species Distribution Model (SDM) is curtailed by sampling biases and the lack of information on absences. Poisson point processes are widely used for SDMs, with Maxent being one of the most popular methods. Maxent maximises the entropy of a probability distribution across sites as a function of predefined transformations of variables, called features. In contrast, neural networks and deep learning have emerged as a promising technique for automatic feature extraction from complex input variables. Arbitrarily complex transformations of input variables can be learned from the data efficiently through backpropagation and stochastic gradient descent (SGD). In this paper, we propose DeepMaxent, which harnesses neural networks to automatically learn shared features among species, using the maximum entropy principle. To do so, it employs a normalised Poisson loss where for each species, presence probabilities across sites are modelled by a neural network. We evaluate DeepMaxent on a benchmark dataset known for its spatial sampling biases, using PO data for calibration and presence-absence (PA) data for validation across six regions with different biological groups and covariates. Our results indicate that DeepMaxent performs better than Maxent and other leading SDMs across all regions and taxonomic groups. The method performs particularly well in regions of uneven sampling, demonstrating substantial potential to increase SDM performances. In particular, our approach yields more accurate predictions than traditional single-species models, which opens up new possibilities for methodological enhancement.
[COMMENTS]Submitted to Methods in Ecology and Evolution
[LINK]http://arxiv.org/abs/2412.19217v4
[DATE]2026-01-13 18:53:51+08:00
[CATEGORIES]cs.LG
CausAdv: A Causal-based Framework for Detecting Adversarial Examples
[AUTHORS]Hichem Debbi
[ABSTRACT]Deep learning has led to tremendous success in computer vision, largely due to Convolutional Neural Networks (CNNs). However, CNNs have been shown to be vulnerable to crafted adversarial perturbations. This vulnerability of adversarial examples has has motivated research into improving model robustness through adversarial detection and defense methods. In this paper, we address the adversarial robustness of CNNs through causal reasoning. We propose CausAdv: a causal framework for detecting adversarial examples based on counterfactual reasoning. CausAdv learns both causal and non-causal features of every input, and quantifies the counterfactual information (CI) of every filter of the last convolutional layer. We then perform a statistical analysis of the filters’ CI across clean and adversarial samples, to demonstrate that adversarial examples exhibit different CI distributions compared to clean samples. Our results show that causal reasoning enhances the process of adversarial detection without the need to train a separate detector. Moreover, we illustrate the efficiency of causal explanations as a helpful detection tool by visualizing the extracted causal features.
[LINK]http://arxiv.org/abs/2411.00839v2
[DATE]2026-01-13 18:52:32+08:00
[CATEGORIES]cs.LG
Coverage Improvement and Fast Convergence of On-policy Preference Learning
[AUTHORS]Juno Kim, Jihun Yun, Jason D. Lee, Kwang-Sung Jun
[ABSTRACT]Online on-policy preference learning algorithms for language model alignment such as online direct policy optimization (DPO) can significantly outperform their offline counterparts. We provide a theoretical explanation for this phenomenon by analyzing how the sampling policy’s coverage evolves throughout on-policy training. We propose and rigorously justify the \emph{coverage improvement principle}: with sufficient batch size, each update moves into a region around the target where coverage is uniformly better, making subsequent data increasingly informative and enabling rapid convergence. In the contextual bandit setting with Bradley-Terry preferences and linear softmax policy class, we show that on-policy DPO converges exponentially in the number of iterations for batch size exceeding a generalized coverage threshold. In contrast, any learner restricted to offline samples from the initial policy suffers a slower minimax rate, leading to a sharp separation in total sample complexity. Motivated by this analysis, we further propose a simple hybrid sampler based on a novel \emph{preferential} G-optimal design, which removes dependence on coverage and guarantees convergence in just two rounds. Finally, we develop principled on-policy schemes for reward distillation in the general function class setting, and show faster noiseless rates under an alternative deviation-based notion of coverage. Experimentally, we confirm that on-policy DPO and our proposed reward distillation algorithms outperform their off-policy counterparts and enjoy stable, monotonic performance gains across iterations.
[COMMENTS]46 pages, 2 figures, 2 tables
[LINK]http://arxiv.org/abs/2601.08421v1
[DATE]2026-01-13 18:46:06+08:00
[CATEGORIES]cs.LG
Taxon: Hierarchical Tax Code Prediction with Semantically Aligned LLM Expert Guidance
[AUTHORS]Jihang Li, Qing Liu, Zulong Chen, Jing Wang, Wei Wang, Chuanfei Xu, Zeyi Wen
[ABSTRACT]Tax code prediction is a crucial yet underexplored task in automating invoicing and compliance management for large-scale e-commerce platforms. Each product must be accurately mapped to a node within a multi-level taxonomic hierarchy defined by national standards, where errors lead to financial inconsistencies and regulatory risks. This paper presents Taxon, a semantically aligned and expert-guided framework for hierarchical tax code prediction. Taxon integrates (i) a feature-gating mixture-of-experts architecture that adaptively routes multi-modal features across taxonomy levels, and (ii) a semantic consistency model distilled from large language models acting as domain experts to verify alignment between product titles and official tax definitions. To address noisy supervision in real business records, we design a multi-source training pipeline that combines curated tax databases, invoice validation logs, and merchant registration data to provide both structural and semantic supervision. Extensive experiments on the proprietary TaxCode dataset and public benchmarks demonstrate that Taxon achieves state-of-the-art performance, outperforming strong baselines. Further, an additional full hierarchical paths reconstruction procedure significantly improves structural consistency, yielding the highest overall F1 scores. Taxon has been deployed in production within Alibaba’s tax service system, handling an average of over 500,000 tax code queries per day and reaching peak volumes above five million requests during business event with improved accuracy, interpretability, and robustness.
[LINK]http://arxiv.org/abs/2601.08418v1
[DATE]2026-01-13 18:41:23+08:00
[CATEGORIES]cs.LG
Out-of-distribution generalization of deep-learning surrogates for 2D PDE-generated dynamics in the small-data regime
[AUTHORS]Binh Duong Nguyen, Stefan Sandfeld
[ABSTRACT]Partial differential equations (PDEs) are a central tool for modeling the dynamics of physical, engineering, and materials systems, but high-fidelity simulations are often computationally expensive. At the same time, many scientific applications can be viewed as the evolution of spatially distributed fields, making data-driven forecasting of such fields a core task in scientific machine learning. In this work we study autoregressive deep-learning surrogates for two-dimensional PDE dynamics on periodic domains, focusing on generalization to out-of-distribution initial conditions within a fixed PDE and parameter regime and on strict small-data settings with at most $\mathcal{O}(10^2)$ simulated trajectories per system. We introduce a multi-channel U-Net […], evaluate it on five qualitatively different PDE families and compare it to ViT, AFNO, PDE-Transformer, and KAN-UNet under a common training setup. Across all datasets, me-UNet matches or outperforms these more complex architectures in terms of field-space error, spectral similarity, and physics-based metrics for in-distribution rollouts, while requiring substantially less training time. It also generalizes qualitatively to unseen initial conditions with as few as $\approx 20$ training simulations. A data-efficiency study and Grad-CAM analysis further suggest that, in small-data periodic 2D PDE settings, convolutional architectures with inductive biases aligned to locality and periodic boundary conditions remain strong contenders for accurate and moderately out-of-distribution-robust surrogate modeling.
[LINK]http://arxiv.org/abs/2601.08404v1
[DATE]2026-01-13 18:20:59+08:00
[CATEGORIES]cs.LG
Generating crossmodal gene expression from cancer histopathology improves multimodal AI predictions
[AUTHORS]Samiran Dey, Christopher R. S. Banerji, Partha Basuchowdhuri, Sanjoy K. Saha, Deepak Parashar, Tapabrata Chakraborti
[ABSTRACT]Emerging research has highlighted that artificial intelligence-based multimodal fusion of digital pathology and transcriptomic features can improve cancer diagnosis (grading/subtyping) and prognosis (survival risk) prediction. However, such direct fusion is impractical in clinical settings, where histopathology remains the gold standard and transcriptomic tests are rarely requested in public healthcare. We experiment on two publicly available multimodal datasets, The Cancer Genomic Atlas and the Clinical Proteomic Tumor Analysis Consortium, spanning four independent cohorts: glioma-glioblastoma, renal, uterine, and breast, and observe significant performance gains in gradation and risk estimation (p-value<0.05) when incorporating synthesized transcriptomic data with WSIs. Also, predictions using synthesized features were statistically close to those obtained with real transcriptomic data (p-value>0.05), consistently across cohorts. Here we show that with our diffusion based crossmodal generative AI model, PathGen, gene expressions synthesized from digital histopathology jointly predict cancer grading and patient survival risk with high accuracy (state-of-the-art performance), certainty (through conformal coverage guarantee) and interpretability (through distributed co-attention maps). PathGen code is available for open use on GitHub at https://github.com/Samiran-Dey/PathGen.
[LINK]http://arxiv.org/abs/2502.00568v4
[DATE]2026-01-13 18:13:57+08:00
[CATEGORIES]cs.LG
Controlled LLM Training on Spectral Sphere
[AUTHORS]Tian Xie, Haoming Luo, Haoyu Tang, Yiwen Hu, Jason Klein Liu, Qingnan Ren, Yang Wang, Wayne Xin Zhao, Rui Yan, Bing Su, Chong Luo, Baining Guo
[ABSTRACT]Scaling large models requires optimization strategies that ensure rapid convergence grounded in stability. Maximal Update Parametrization ($\boldsymbolμ$P) provides a theoretical safeguard for width-invariant $Θ(1)$ activation control, whereas emerging optimizers like Muon are only “half-aligned” with these constraints: they control updates but allow weights to drift. To address this limitation, we introduce the \textbf{Spectral Sphere Optimizer (SSO)}, which enforces strict module-wise spectral constraints on both weights and their updates. By deriving the steepest descent direction on the spectral sphere, SSO realizes a fully $\boldsymbolμ$P-aligned optimization process. To enable large-scale training, we implement SSO as an efficient parallel algorithm within Megatron. Through extensive pretraining on diverse architectures, including Dense 1.7B, MoE 8B-A1B, and 200-layer DeepNet models, SSO consistently outperforms AdamW and Muon. Furthermore, we observe significant practical stability benefits, including improved MoE router load balancing, suppressed outliers, and strictly bounded activations.
[LINK]http://arxiv.org/abs/2601.08393v1
[DATE]2026-01-13 17:59:47+08:00
[CATEGORIES]cs.LG
Gradient-free online learning of subgrid-scale dynamics with neural emulators
[AUTHORS]Hugo Frezat, Ronan Fablet, Guillaume Balarac, Julien Le Sommer
[ABSTRACT]In this paper, we propose a generic algorithm to train machine learning-based subgrid parametrizations online, i.e., with \textit{a posteriori} loss functions, but for non-differentiable numerical solvers. The proposed approach leverages a neural emulator to approximate the reduced state-space solver, which is then used to allow gradient propagation through temporal integration steps. We apply this methodology on a chaotic two-timescales Lorenz-96 system and a single layer quasi-geostrophic system with zonal dynamics, known to be highly unstable with offline strategies. Using our algorithm, we are able to train a parametrization that recovers most of the benefits of online strategies without having to compute the gradient of the original solver. We found that training the neural emulator and parametrization components separately with different loss quantities is necessary in order to minimize the propagation of approximation biases. Experiments on emulator architectures with different complexities also indicates that emulator performance is key in order to learn an accurate parametrization. This work is a step towards learning parametrization with online strategies for climate models.
[COMMENTS]39 pages, 8 figures, submitted for publication in Journal of Advances in Modeling Earth Systems (JAMES)
[LINK]http://arxiv.org/abs/2310.19385v5
[DATE]2026-01-13 17:59:16+08:00
[CATEGORIES]cs.LG
Data Science: a Natural Ecosystem
[AUTHORS]Emilio Porcu, Roy El Moukari, Laurent Najman, Francisco Herrera, Horst Simon
[ABSTRACT]This manuscript provides a systemic and data-centric view of what we term essential data science, as a natural ecosystem with challenges and missions stemming from the fusion of data universe with its multiple combinations of the 5D complexities (data structure, domain, cardinality, causality, and ethics) with the phases of the data life cycle. Data agents perform tasks driven by specific goals. The data scientist is an abstract entity that comes from the logical organization of data agents with their actions. Data scientists face challenges that are defined according to the missions. We define specific discipline-induced data science, which in turn allows for the definition of pan-data science, a natural ecosystem that integrates specific disciplines with the essential data science. We semantically split the essential data science into computational, and foundational. By formalizing this ecosystemic view, we contribute a general-purpose, fusion-oriented architecture for integrating heterogeneous knowledge, agents, and workflows-relevant to a wide range of disciplines and high-impact applications.
[LINK]http://arxiv.org/abs/2506.11010v2
[DATE]2026-01-13 17:56:43+08:00
[CATEGORIES]cs.LG
Detecting High-Stakes Interactions with Activation Probes
[AUTHORS]Alex McKenzie, Urja Pawar, Phil Blandfort, William Bankes, David Krueger, Ekdeep Singh Lubana, Dmitrii Krasheninnikov
[ABSTRACT]Monitoring is an important aspect of safely deploying Large Language Models (LLMs). This paper examines activation probes for detecting “high-stakes” interactions – where the text indicates that the interaction might lead to significant harm – as a critical, yet underexplored, target for such monitoring. We evaluate several probe architectures trained on synthetic data, and find them to exhibit robust generalization to diverse, out-of-distribution, real-world data. Probes’ performance is comparable to that of prompted or finetuned medium-sized LLM monitors, while offering computational savings of six orders-of-magnitude. These savings are enabled by reusing activations of the model that is being monitored. Our experiments also highlight the potential of building resource-aware hierarchical monitoring systems, where probes serve as an efficient initial filter and flag cases for more expensive downstream analysis. We release our novel synthetic dataset and the codebase at https://github.com/arrrlex/models-under-pressure.
[COMMENTS]Accepted at NeurIPS 2025
[LINK]http://arxiv.org/abs/2506.10805v3
[DATE]2026-01-13 17:49:38+08:00
[CATEGORIES]cs.LG
Training-Free Distribution Adaptation for Diffusion Models via Maximum Mean Discrepancy Guidance
[AUTHORS]Matina Mahdizadeh Sani, Nima Jamali, Mohammad Jalali, Farzan Farnia
[ABSTRACT]Pre-trained diffusion models have emerged as powerful generative priors for both unconditional and conditional sample generation, yet their outputs often deviate from the characteristics of user-specific target data. Such mismatches are especially problematic in domain adaptation tasks, where only a few reference examples are available and retraining the diffusion model is infeasible. Existing inference-time guidance methods can adjust sampling trajectories, but they typically optimize surrogate objectives such as classifier likelihoods rather than directly aligning with the target distribution. We propose MMD Guidance, a training-free mechanism that augments the reverse diffusion process with gradients of the Maximum Mean Discrepancy (MMD) between generated samples and a reference dataset. MMD provides reliable distributional estimates from limited data, exhibits low variance in practice, and is efficiently differentiable, which makes it particularly well-suited for the guidance task. Our framework naturally extends to prompt-aware adaptation in conditional generation models via product kernels. Also, it can be applied with computational efficiency in latent diffusion models (LDMs), since guidance is applied in the latent space of the LDM. Experiments on synthetic and real-world benchmarks demonstrate that MMD Guidance can achieve distributional alignment while preserving sample fidelity.
[LINK]http://arxiv.org/abs/2601.08379v1
[DATE]2026-01-13 17:42:57+08:00
[CATEGORIES]cs.LG
Sequential Enumeration in Large Language Models
[AUTHORS]Kuinan Hou, Marco Zorzi, Alberto Testolin
[ABSTRACT]Reliably counting and generating sequences of items remain a significant challenge for neural networks, including Large Language Models (LLMs). Indeed, although this capability is readily handled by rule-based symbolic systems based on serial computation, learning to systematically deploy counting procedures is difficult for neural models, which should acquire these skills through learning. Previous research has demonstrated that recurrent architectures can only approximately track and enumerate sequences of events, and it remains unclear whether modern deep learning systems, including LLMs, can deploy systematic counting procedures over sequences of discrete symbols. This paper aims to fill this gap by investigating the sequential enumeration abilities of five state-of-the-art LLMs, including proprietary, open-source, and reasoning models. We probe LLMs in sequential naming and production tasks involving lists of letters and words, adopting a variety of prompting instructions to explore the role of chain-of-thought in the spontaneous emerging of counting strategies. We also evaluate open-source models with the same architecture but increasing size to see whether the mastering of counting principles follows scaling laws, and we analyze the embedding dynamics during sequential enumeration to investigate the emergent encoding of numerosity. We find that some LLMs are indeed capable of deploying counting procedures when explicitly prompted to do so, but none of them spontaneously engage in counting when simply asked to enumerate the number of items in a sequence. Our results suggest that, despite their impressive emergent abilities, LLMs cannot yet robustly and systematically deploy counting procedures, highlighting a persistent gap between neural and symbolic approaches to compositional generalization.
[LINK]http://arxiv.org/abs/2512.04727v2
[DATE]2026-01-13 17:39:43+08:00
[CATEGORIES]cs.LG
Geo-NVS-w: Geometry-Aware Novel View Synthesis In-the-Wild with an SDF Renderer
[AUTHORS]Anastasios Tsalakopoulos, Angelos Kanlis, Evangelos Chatzis, Antonis Karakottas, Dimitrios Zarpalas
[ABSTRACT]We introduce Geo-NVS-w, a geometry-aware framework for high-fidelity novel view synthesis from unstructured, in-the-wild image collections. While existing in-the-wild methods already excel at novel view synthesis, they often lack geometric grounding on complex surfaces, sometimes producing results that contain inconsistencies. Geo-NVS-w addresses this limitation by leveraging an underlying geometric representation based on a Signed Distance Function (SDF) to guide the rendering process. This is complemented by a novel Geometry-Preservation Loss which ensures that fine structural details are preserved. Our framework achieves competitive rendering performance, while demonstrating a 4-5x reduction reduction in energy consumption compared to similar methods. We demonstrate that Geo-NVS-w is a robust method for in-the-wild NVS, yielding photorealistic results with sharp, geometrically coherent details.
[COMMENTS]Presented at the ICCV 2025 Workshop on Large Scale Cross Device Localization
[LINK]http://arxiv.org/abs/2601.08371v1
[DATE]2026-01-13 17:34:01+08:00
[CATEGORIES]cs.LG
An Algebraic Representation Theorem for Linear GENEOs in Geometric Machine Learning
[AUTHORS]Francesco Conti, Patrizio Frosini, Nicola Quercioli
[LINK]http://arxiv.org/abs/2601.03910v2
[DATE]2026-01-13 17:26:45+08:00
[CATEGORIES]cs.LG
Decodable but not structured: linear probing enables Underwater Acoustic Target Recognition with pretrained audio embeddings
[AUTHORS]Hilde I. Hummel, Sandjai Bhulai, Rob D. van der Mei, Burooj Ghani
[ABSTRACT]Increasing levels of anthropogenic noise from ships contribute significantly to underwater sound pollution, posing risks to marine ecosystems. This makes monitoring crucial to understand and quantify the impact of the ship radiated noise. Passive Acoustic Monitoring (PAM) systems are widely deployed for this purpose, generating years of underwater recordings across diverse soundscapes. Manual analysis of such large-scale data is impractical, motivating the need for automated approaches based on machine learning. Recent advances in automatic Underwater Acoustic Target Recognition (UATR) have largely relied on supervised learning, which is constrained by the scarcity of labeled data. Transfer Learning (TL) offers a promising alternative to mitigate this limitation. In this work, we conduct the first empirical comparative study of transfer learning for UATR, evaluating multiple pretrained audio models originating from diverse audio domains. The pretrained model weights are frozen, and the resulting embeddings are analyzed through classification, clustering, and similarity-based evaluations. The analysis shows that the geometrical structure of the embedding space is largely dominated by recording-specific characteristics. However, a simple linear probe can effectively suppress this recording-specific information and isolate ship-type features from these embeddings. As a result, linear probing enables effective automatic UATR using pretrained audio models at low computational cost, significantly reducing the need for a large amounts of high-quality labeled ship recordings.
[LINK]http://arxiv.org/abs/2601.08358v1
[DATE]2026-01-13 17:15:31+08:00
[CATEGORIES]cs.LG
MLPlatt: Simple Calibration Framework for Ranking Models
[AUTHORS]Piotr Bajger, Roman Dusek, Krzysztof Galias, Paweł Młyniec, Aleksander Wawer, Paweł Zawistowski
[ABSTRACT]Ranking models are extensively used in e-commerce for relevance estimation. These models often suffer from poor interpretability and no scale calibration, particularly when trained with typical ranking loss functions. This paper addresses the problem of post-hoc calibration of ranking models. We introduce MLPlatt: a simple yet effective ranking model calibration method that preserves the item ordering and converts ranker outputs to interpretable click-through rate (CTR) probabilities usable in downstream tasks. The method is context-aware by design and achieves good calibration metrics globally, and within strata corresponding to different values of a selected categorical field (such as user country or device), which is often important from a business perspective of an E-commerce platform. We demonstrate the superiority of MLPlatt over existing approaches on two datasets, achieving an improvement of over 10\% in F-ECE (Field Expected Calibration Error) compared to other methods. Most importantly, we show that high-quality calibration can be achieved without compromising the ranking quality.
[LINK]http://arxiv.org/abs/2601.08345v1
[DATE]2026-01-13 17:04:42+08:00
[CATEGORIES]cs.LG
Automated Machine Learning in Radiomics: A Comparative Evaluation of Performance, Efficiency and Accessibility
[AUTHORS]Jose Lozano-Montoya, Emilio Soria-Olivas, Almudena Fuster-Matanzo, Angel Alberich-Bayarri, Ana Jimenez-Pastor
[ABSTRACT]Automated machine learning (AutoML) frameworks can lower technical barriers for predictive and prognostic model development in radiomics by enabling researchers without programming expertise to build models. However, their effectiveness in addressing radiomics-specific challenges remains unclear. This study evaluates the performance, efficiency, and accessibility of general-purpose and radiomics-specific AutoML frameworks on diverse radiomics classification tasks, thereby highlighting development needs for radiomics. Ten public/private radiomics datasets with varied imaging modalities (CT/MRI), sizes, anatomies and endpoints were used. Six general-purpose and five radiomics-specific frameworks were tested with predefined parameters using standardized cross-validation. Evaluation metrics included AUC, runtime, together with qualitative aspects related to software status, accessibility, and interpretability. Simplatab, a radiomics-specific tool with a no-code interface, achieved the highest average test AUC (81.81%) with a moderate runtime (~1 hour). LightAutoML, a general-purpose framework, showed the fastest execution with competitive performance (78.74% mean AUC in six minutes). Most radiomics-specific frameworks were excluded from the performance analysis due to obsolescence, extensive programming requirements, or computational inefficiency. Conversely, general-purpose frameworks demonstrated higher accessibility and ease of implementation. Simplatab provides an effective balance of performance, efficiency, and accessibility for radiomics classification problems. However, significant gaps remain, including the lack of accessible survival analysis support and the limited integration of feature reproducibility and harmonization within current AutoML frameworks. Future research should focus on adapting AutoML solutions to better address these radiomics-specific challenges.
[COMMENTS]27 pages, 4 figures, 3 tables, code available, see https://github.com/joselznom/AutoML-Comparison-in-Radiomics
[LINK]http://arxiv.org/abs/2601.08334v1
[DATE]2026-01-13 16:47:44+08:00
[CATEGORIES]cs.LG
On the Theoretical Foundation of Sparse Dictionary Learning in Mechanistic Interpretability
[AUTHORS]Yiming Tang, Harshvardhan Saini, Zhaoqian Yao, Yizhen Liao, Qianxiao Li, Mengnan Du, Dianbo Liu
[ABSTRACT]As AI models achieve remarkable capabilities across diverse domains, understanding what representations they learn and how they process information has become increasingly important for both scientific progress and trustworthy deployment. Recent works in mechanistic interpretability have shown that neural networks represent meaningful concepts as directions in their representation spaces and often encode diverse concepts in superposition. Various sparse dictionary learning (SDL) methods, including sparse autoencoders, transcoders, and crosscoders, address this by training auxiliary models with sparsity constraints to disentangle these superposed concepts into monosemantic features. These methods have demonstrated remarkable empirical success but have limited theoretical understanding. Existing theoretical work is limited to sparse autoencoders with tied-weight constraints, leaving the broader family of SDL methods without formal grounding. In this work, we develop the first unified theoretical framework considering SDL as one optimization problem. We demonstrate how diverse methods instantiate the theoretical framework and provide rigorous analysis of the optimization landscape. We provide novel theoretical explanations for empirically observed phenomena, including feature absorption and dead neurons. We design the Linear Representation Bench, a benchmark that strictly follows the Linear Representation Hypothesis, to evaluate SDL methods with fully accessible ground-truth features. Motivated by our theory and findings, we develop feature achoring, a novel technique applicable for all SDL methods, to enhance their feature recovery capabilities.
[LINK]http://arxiv.org/abs/2512.05534v2
[DATE]2026-01-13 16:47:42+08:00
[CATEGORIES]cs.LG
Using Subgraph GNNs for Node Classification:an Overlooked Potential Approach
[AUTHORS]Qian Zeng, Xin Lin, Jingyi Gao, Yang Yu
[ABSTRACT]Previous studies have demonstrated the strong performance of Graph Neural Networks (GNNs) in node classification. However, most existing GNNs adopt a node-centric perspective and rely on global message passing, leading to high computational and memory costs that hinder scalability. To mitigate these challenges, subgraph-based methods have been introduced, leveraging local subgraphs as approximations of full computational trees. While this approach improves efficiency, it often suffers from performance degradation due to the loss of global contextual information, limiting its effectiveness compared to global GNNs. To address this trade-off between scalability and classification accuracy, we reformulate the node classification task as a subgraph classification problem and propose SubGND (Subgraph GNN for NoDe). This framework introduces a differentiated zero-padding strategy and an Ego-Alter subgraph representation method to resolve label conflicts while incorporating an Adaptive Feature Scaling Mechanism to dynamically adjust feature contributions based on dataset-specific dependencies. Experimental results on six benchmark datasets demonstrate that SubGND achieves performance comparable to or surpassing global message-passing GNNs, particularly in heterophilic settings, highlighting its effectiveness and scalability as a promising solution for node classification.
[COMMENTS]16 pages
[LINK]http://arxiv.org/abs/2503.06614v2
[DATE]2026-01-13 16:44:48+08:00
[CATEGORIES]cs.LG
Bayesian Multiobject Tracking With Neural-Enhanced Motion and Measurement Models
[AUTHORS]Shaoxiu Wei, Mingchao Liang, Florian Meyer
[ABSTRACT]Multiobject tracking (MOT) is an important task in applications including autonomous driving, ocean sciences, and aerospace surveillance. Traditional MOT methods are model-based and combine sequential Bayesian estimation with data association and an object birth model. More recent methods are fully data-driven and rely on the training of neural networks. Both approaches offer distinct advantages in specific settings. In particular, model-based methods are generally applicable across a wide range of scenarios, whereas data-driven MOT achieves superior performance in scenarios where abundant labeled data for training is available. A natural thought is whether a general framework can integrate the two approaches. This paper introduces a hybrid method that utilizes neural networks to enhance specific aspects of the statistical model in Bayesian MOT that have been identified as overly simplistic. By doing so, the performance of the prediction and update steps of Bayesian MOT is improved. To ensure tractable computation, our framework uses belief propagation to avoid high-dimensional operations combined with sequential Monte Carlo methods to perform low-dimensional operations efficiently. The resulting method combines the flexibility and robustness of model-based approaches with the capability to learn complex information from data of neural networks. We evaluate the performance of the proposed method based on the nuScenes autonomous driving dataset and demonstrate that it has state-of-the-art performance.
[LINK]http://arxiv.org/abs/2506.18124v3
[DATE]2026-01-13 16:40:42+08:00
[CATEGORIES]cs.LG
Disentangling History and Propagation Dependencies in Cross-Subject Knee Contact Stress Prediction Using a Shared MeshGraphNet Backbone
[AUTHORS]Zhengye Pan, Jianwei Zuo, Jiajia Luo
[ABSTRACT]Background:Subject-specific finite element analysis accurately characterizes knee joint mechanics but is computationally expensive. Deep surrogate models provide a rapid alternative, yet their generalization across subjects under limited pose and load inputs remains unclear. It remains unclear whether the dominant source of prediction uncertainty arises from temporal history dependence or spatial propagation dependence. Methods:To disentangle these factors, we employed a shared MGN backbone with a fixed mesh topology. A dataset of running trials from nine subjects was constructed using an OpenSim-FEBio workflow. We developed four model variants to isolate specific dependencies: (1) a baseline MGN; (2) CT-MGN, incorporating a Control Transformer to encode short-horizon history; (3) MsgModMGN, applying state-conditioned modulation to message passing for adaptive propagation; (4) CT-MsgModMGN, combining both mechanisms. Models were evaluated using a rigorous grouped 3-fold cross-validation on unseen subjects.Results:The models incorporating history encoding significantly outperformed the baseline MGN and MsgModMGN in global accuracy and spatial consistency. Crucially, the CT module effectively mitigated the peak-shaving defect common in deep surrogates, significantly reducing peak stress prediction errors. In contrast, the spatial propagation modulation alone yielded no significant improvement over the baseline, and combining it with CT provided no additional benefit.Conclusion:Temporal history dependence, rather than spatial propagation modulation, is the primary driver of prediction uncertainty in cross-subject knee contact mechanics. Explicitly encoding short-horizon driver sequences enables the surrogate model to recover implicit phase information, thereby achieving superior fidelity in peak-stress capture and high-risk localization compared to purely state-based approaches.
[LINK]http://arxiv.org/abs/2601.08318v1
[DATE]2026-01-13 16:15:57+08:00
[CATEGORIES]cs.LG
ORBIT: On-policy Exploration-Exploitation for Controllable Multi-Budget Reasoning
[AUTHORS]Kun Liang, Clive Bai, Xin Xu, Chenming Tang, Sanwoo Lee, Weijie Liu, Saiyong Yang, Yunfang Wu
[ABSTRACT]Recent Large Reasoning Models (LRMs) achieve strong performance by leveraging long-form Chain-of-Thought (CoT) reasoning, but uniformly applying overlong reasoning at inference time incurs substantial and often unnecessary computational cost. To address this, prior work explores various strategies to infer an appropriate reasoning budget from the input. However, such approaches are unreliable in the worst case, as estimating the minimal required reasoning effort is fundamentally difficult, and they implicitly fix the trade-off between reasoning cost and accuracy during training, limiting flexibility under varying deployment scenarios. Motivated by these limitations, we propose ORBIT, a controllable multi-budget reasoning framework with well-separated reasoning modes triggered by input. ORBIT employs multi-stage reinforcement learning to discover Pareto-optimal reasoning behaviors at each effort, followed by on-policy distillation to fuse these behaviors into a single unified model. Experiments show that ORBIT achieves (1) controllable reasoning behavior over multiple modes, (2) competitive reasoning density within each mode, and (3) integration of these frontier policies into a single unified student model while preserving clear mode separation and high per-mode performance.
[COMMENTS]Preprint
[LINK]http://arxiv.org/abs/2601.08310v1
[DATE]2026-01-13 15:57:48+08:00
[CATEGORIES]cs.LG
PINGS-X: Physics-Informed Normalized Gaussian Splatting with Axes Alignment for Efficient Super-Resolution of 4D Flow MRI
[AUTHORS]Sun Jo, Seok Young Hong, JinHyun Kim, Seungmin Kang, Ahjin Choi, Don-Gwan An, Simon Song, Je Hyeong Hong
[ABSTRACT]4D flow magnetic resonance imaging (MRI) is a reliable, non-invasive approach for estimating blood flow velocities, vital for cardiovascular diagnostics. Unlike conventional MRI focused on anatomical structures, 4D flow MRI requires high spatiotemporal resolution for early detection of critical conditions such as stenosis or aneurysms. However, achieving such resolution typically results in prolonged scan times, creating a trade-off between acquisition speed and prediction accuracy. Recent studies have leveraged physics-informed neural networks (PINNs) for super-resolution of MRI data, but their practical applicability is limited as the prohibitively slow training process must be performed for each patient. To overcome this limitation, we propose PINGS-X, a novel framework modeling high-resolution flow velocities using axes-aligned spatiotemporal Gaussian representations. Inspired by the effectiveness of 3D Gaussian splatting (3DGS) in novel view synthesis, PINGS-X extends this concept through several non-trivial novel innovations: (i) normalized Gaussian splatting with a formal convergence guarantee, (ii) axes-aligned Gaussians that simplify training for high-dimensional data while preserving accuracy and the convergence guarantee, and (iii) a Gaussian merging procedure to prevent degenerate solutions and boost computational efficiency. Experimental results on computational fluid dynamics (CFD) and real 4D flow MRI datasets demonstrate that PINGS-X substantially reduces training time while achieving superior super-resolution accuracy. Our code and datasets are available at https://github.com/SpatialAILab/PINGS-X.
[COMMENTS]Accepted at AAAI 2026. Supplementary material included after references. 27 pages, 21 figures, 11 tables
[LINK]http://arxiv.org/abs/2511.11048v2
[DATE]2026-01-13 15:54:23+08:00
[CATEGORIES]cs.LG
Visually Prompted Benchmarks Are Surprisingly Fragile
[AUTHORS]Haiwen Feng, Long Lian, Lisa Dunlap, Jiahao Shu, XuDong Wang, Renhao Wang, Trevor Darrell, Alane Suhr, Angjoo Kanazawa
[ABSTRACT]A key challenge in evaluating VLMs is testing models’ ability to analyze visual content independently from their textual priors. Recent benchmarks such as BLINK probe visual perception through visual prompting, where questions about visual content are paired with coordinates to which the question refers, with the coordinates explicitly marked in the image itself. While these benchmarks are an important part of VLM evaluation, we find that existing models are surprisingly fragile to seemingly irrelevant details of visual prompting: simply changing a visual marker from red to blue can completely change rankings among models on a leaderboard. By evaluating nine commonly-used open- and closed-source VLMs on two visually prompted tasks, we demonstrate how details in benchmark setup, including visual marker design and dataset size, have a significant influence on model performance and leaderboard rankings. These effects can even be exploited to lift weaker models above stronger ones; for instance, slightly increasing the size of the visual marker results in open-source InternVL3-8B ranking alongside or better than much larger proprietary models like Gemini 2.5 Pro. We further show that low-level inference choices that are often ignored in benchmarking, such as JPEG compression levels in API calls, can also cause model lineup changes. These details have substantially larger impacts on visually prompted benchmarks than on conventional semantic VLM evaluations. To mitigate this instability, we curate existing datasets to create VPBench, a larger visually prompted benchmark with 16 visual marker variants. We open-source VPBench and our analysis framework at: https://lisadunlap.github.io/vpbench/.
[LINK]http://arxiv.org/abs/2512.17875v2
[DATE]2026-01-13 15:50:07+08:00
[CATEGORIES]cs.LG
PRPO: Aligning Process Reward with Outcome Reward in Policy Optimization
[AUTHORS]Ruiyi Ding, Yongxuan Lv, Xianhui Meng, Jiahe Song, Chao Wang, Chen Jiang, Yuan Cheng
[ABSTRACT]Policy optimization for large language models often suffers from sparse reward signals in multi-step reasoning tasks. Critic-free methods like GRPO assign a single normalized outcome reward to all tokens, providing limited guidance for intermediate reasoning . While Process Reward Models (PRMs) offer dense feedback, they risk premature collapse when used alone, as early low-reward tokens can drive policies toward truncated outputs. We introduce Process Relative Policy Optimization (PRPO), which combines outcome reliability with process-level guidance in a critic-free framework. PRPO segments reasoning sequences based on semantic clues, normalizes PRM scores into token-level advantages, and aligns their distribution with outcome advantages through location-parameter shift. On MATH500, PRPO improves Qwen2.5-Math-1.5B accuracy from 61.2% to 64.4% over GRPO using only eight rollouts and no value network, demonstrating efficient fine-grained credit assignment within critic-free optimization. Code is available at: https://github.com/SchumiDing/srpocode
[COMMENTS]8 pages, 2 figures Code is available at: https://github.com/SchumiDing/srpocode
[LINK]http://arxiv.org/abs/2601.07182v2
[DATE]2026-01-13 15:35:53+08:00
[CATEGORIES]cs.LG
AgriLens: Semantic Retrieval in Agricultural Texts Using Topic Modeling and Language Models
[AUTHORS]Heba Shakeel, Tanvir Ahmad, Tanya Liyaqat, Chandni Saxena
[ABSTRACT]As the volume of unstructured text continues to grow across domains, there is an urgent need for scalable methods that enable interpretable organization, summarization, and retrieval of information. This work presents a unified framework for interpretable topic modeling, zero-shot topic labeling, and topic-guided semantic retrieval over large agricultural text corpora. Leveraging BERTopic, we extract semantically coherent topics. Each topic is converted into a structured prompt, enabling a language model to generate meaningful topic labels and summaries in a zero-shot manner. Querying and document exploration are supported via dense embeddings and vector search, while a dedicated evaluation module assesses topical coherence and bias. This framework supports scalable and interpretable information access in specialized domains where labeled data is limited.
[COMMENTS]8 Pages, 1st workshop on Democratizing GenAI and Scalable NLP with HiPC for Societal Impact; 32nd IEEE International Conference on High Performance Computing, Data, & Analytics
[LINK]http://arxiv.org/abs/2601.08283v1
[DATE]2026-01-13 15:18:59+08:00
[CATEGORIES]cs.LG
Brain network science modelling of sparse neural networks enables Transformers and LLMs to perform as fully connected
[AUTHORS]Yingtao Zhang, Diego Cerretti, Jialin Zhao, Wenjing Wu, Ziheng Liao, Umberto Michieli, Carlo Vittorio Cannistraci
[ABSTRACT]Dynamic sparse training (DST) can reduce the computational demands in ANNs, but faces difficulties in keeping peak performance at high sparsity levels. The Cannistraci-Hebb training (CHT) is a brain-inspired method for growing connectivity in DST. CHT leverages a gradient-free, topology-driven link regrowth, which has shown ultra-sparse (less than 1% connectivity) advantage across various tasks compared to fully connected networks. Yet, CHT suffers two main drawbacks: (i) its time complexity is $O(Nd^3)$ - N node network size, d node degree - restricting it to ultra-sparse regimes. (ii) it selects top link prediction scores, which is inappropriate for the early training epochs, when the network presents unreliable connections. Here, we design the first brain-inspired network model - termed bipartite receptive field (BRF) - to initialize the connectivity of sparse artificial neural networks. We further introduce a GPU-friendly matrix-based approximation of CH link prediction, reducing complexity to $O(N^3)$. We introduce the Cannistraci-Hebb training soft rule (CHTs), which adopts a flexible strategy for sampling connections in both link removal and regrowth, balancing the exploration and exploitation of network topology. Additionally, we integrate CHTs with a sigmoid gradual density decay (CHTss). Empirical results show that BRF offers performance advantages over previous network science models. Using 1% of connections, CHTs outperforms fully connected networks in MLP architectures on image classification tasks, compressing some networks to less than 30% of the nodes. Using 5% of the connections, CHTss outperforms fully connected networks in two Transformer-based machine translation tasks. Finally, at 30% connectivity, both CHTs and CHTss outperform other DST methods in language modeling task.
[LINK]http://arxiv.org/abs/2501.19107v4
[DATE]2026-01-13 15:17:05+08:00
[CATEGORIES]cs.LG
Rethinking Recurrent Neural Networks for Time Series Forecasting: A Reinforced Recurrent Encoder with Prediction-Oriented Proximal Policy Optimization
[AUTHORS]Xin Lai, Shiming Deng, Lu Yu, Yumin Lai, Shenghao Qiao, Xinze Zhang
[ABSTRACT]Time series forecasting plays a crucial role in contemporary engineering information systems for supporting decision-making across various industries, where Recurrent Neural Networks (RNNs) have been widely adopted due to their capability in modeling sequential data. Conventional RNN-based predictors adopt an encoder-only strategy with sliding historical windows as inputs to forecast future values. However, this approach treats all time steps and hidden states equally without considering their distinct contributions to forecasting, leading to suboptimal performance. To address this limitation, we propose a novel Reinforced Recurrent Encoder with Prediction-oriented Proximal Policy Optimization, RRE-PPO4Pred, which significantly improves time series modeling capacity and forecasting accuracy of the RNN models. The core innovations of this method are: (1) A novel Reinforced Recurrent Encoder (RRE) framework that enhances RNNs by formulating their internal adaptation as a Markov Decision Process, creating a unified decision environment capable of learning input feature selection, hidden skip connection, and output target selection; (2) An improved Prediction-oriented Proximal Policy Optimization algorithm, termed PPO4Pred, which is equipped with a Transformer-based agent for temporal reasoning and develops a dynamic transition sampling strategy to enhance sampling efficiency; (3) A co-evolutionary optimization paradigm to facilitate the learning of the RNN predictor and the policy agent, providing adaptive and interactive time series modeling. Comprehensive evaluations on five real-world datasets indicate that our method consistently outperforms existing baselines, and attains accuracy better than state-of-the-art Transformer models, thus providing an advanced time series predictor in engineering informatics.
[LINK]http://arxiv.org/abs/2601.03683v2
[DATE]2026-01-13 15:12:23+08:00
[CATEGORIES]cs.LG
Synergy over Discrepancy: A Partition-Based Approach to Multi-Domain LLM Fine-Tuning
[AUTHORS]Hua Ye, Siyuan Chen, Haoliang Zhang, Weihao Luo, Yanbin Li, Xuan Zhang
[ABSTRACT]Large language models (LLMs) demonstrate impressive generalization abilities, yet adapting them effectively across multiple heterogeneous domains remains challenging due to inter-domain interference. To overcome this challenge, we propose a partition-based multi-stage fine-tuning framework designed to exploit inter-domain synergies while minimizing negative transfer. Our approach strategically partitions domains into subsets (stages) by balancing domain discrepancy, synergy, and model capacity constraints. We theoretically analyze the proposed framework and derive novel generalization bounds that justify our partitioning strategy. Extensive empirical evaluations on various language understanding tasks show that our method consistently outperforms state-of-the-art baselines.
[COMMENTS]20 pages, 5 figures, 21 tables. Accepted at NeurIPS 2025. Corresponding author: Xuan Zhang ([email protected])
[LINK]http://arxiv.org/abs/2511.07198v3
[DATE]2026-01-13 14:58:58+08:00
[CATEGORIES]cs.LG
Aligning by Misaligning: Boundary-aware Curriculum Learning for Multimodal Alignment
[AUTHORS]Hua Ye, Hang Ding, Siyuan Chen, Yiyang Jiang, Changyuan Zhang, Xuan Zhang
[ABSTRACT]Most multimodal models treat every negative pair alike, ignoring the ambiguous negatives that differ from the positive by only a small detail. We propose Boundary-Aware Curriculum with Local Attention (BACL), a lightweight add-on that turns these borderline cases into a curriculum signal. A Boundary-aware Negative Sampler gradually raises difficulty, while a Contrastive Local Attention loss highlights where the mismatch occurs. The two modules are fully differentiable and work with any off-the-shelf dual encoder. Theory predicts a fast O(1/n) error rate; practice shows up to +32% R@1 over CLIP and new SOTA on four large-scale benchmarks, all without extra labels.
[COMMENTS]24 pages, 6 figures, 5 tables. Submitted to NeurIPS 2025
[LINK]http://arxiv.org/abs/2511.08399v2
[DATE]2026-01-13 14:56:09+08:00
[CATEGORIES]cs.LG
How Memory in Optimization Algorithms Implicitly Modifies the Loss
[AUTHORS]Matias D. Cattaneo, Boris Shigida
[ABSTRACT]In modern optimization methods used in deep learning, each update depends on the history of previous iterations, often referred to as memory, and this dependence decays fast as the iterates go further into the past. For example, gradient descent with momentum has exponentially decaying memory through exponentially averaged past gradients. We introduce a general technique for identifying a memoryless algorithm that approximates an optimization algorithm with memory. It is obtained by replacing all past iterates in the update by the current one, and then adding a correction term arising from memory (also a function of the current iterate). This correction term can be interpreted as a perturbation of the loss, and the nature of this perturbation can inform how memory implicitly (anti-)regularizes the optimization dynamics. As an application of our theory, we find that Lion does not have the kind of implicit anti-regularization induced by memory that AdamW does, providing a theory-based explanation for Lion’s better generalization performance recently documented.
[LINK]http://arxiv.org/abs/2502.02132v3
[DATE]2026-01-13 14:37:35+08:00
[CATEGORIES]cs.LG
A Usable GAN-Based Tool for Synthetic ECG Generation in Cardiac Amyloidosis Research
[AUTHORS]Francesco Speziale, Ugo Lomoio, Fabiola Boccuto, Pierangelo Veltri, Pietro Hiram Guzzi
[ABSTRACT]Cardiac amyloidosis (CA) is a rare and underdiagnosed infiltrative cardiomyopathy, and available datasets for machine-learning models are typically small, imbalanced and heterogeneous. This paper presents a Generative Adversarial Network (GAN) and a graphical command-line interface for generating realistic synthetic electrocardiogram (ECG) beats to support early diagnosis and patient stratification in CA. The tool is designed for usability, allowing clinical researchers to train class-specific generators once and then interactively produce large volumes of labelled synthetic beats that preserve the distribution of minority classes.
[LINK]http://arxiv.org/abs/2601.08260v1
[DATE]2026-01-13 14:33:16+08:00
[CATEGORIES]cs.LG
On Evaluation of Unsupervised Feature Selection for Pattern Classification
[AUTHORS]Gyu-Il Kim, Dae-Won Kim, Jaesung Lee
[ABSTRACT]Unsupervised feature selection aims to identify a compact subset of features that captures the intrinsic structure of data without supervised label. Most existing studies evaluate the performance of methods using the single-label dataset that can be instantiated by selecting a label from multi-label data while maintaining the original features. Because the chosen label can vary arbitrarily depending on the experimental setting, the superiority among compared methods can be changed with regard to which label happens to be selected. Thus, evaluating unsupervised feature selection methods based solely on single-label accuracy is unreasonable for assessing their true discriminative ability. This study revisits this evaluation paradigm by adopting a multi-label classification framework. Experiments on 21 multi-label datasets using several representative methods demonstrate that performance rankings differ markedly from those reported under single-label settings, suggesting the possibility of multi-label evaluation settings for fair and reliable comparison of unsupervised feature selection methods.
[LINK]http://arxiv.org/abs/2601.08257v1
[DATE]2026-01-13 14:28:59+08:00
[CATEGORIES]cs.LG
LDLT L-Lipschitz Network Weight Parameterization Initialization
[AUTHORS]Marius F. R. Juston, Ramavarapu S. Sreenivas, Dustin Nottage, Ahmet Soylemezoglu
[ABSTRACT]We analyze initialization dynamics for LDLT-based $\mathcal{L}$-Lipschitz layers by deriving the exact marginal output variance when the underlying parameter matrix $W_0\in \mathbb{R}^{m\times n}$ is initialized with IID Gaussian entries $\mathcal{N}(0,σ^2)$. The Wishart distribution, $S=W_0W_0^\top\sim\mathcal{W}_m(n,σ^2 \boldsymbol{I}_m)$, used for computing the output marginal variance is derived in closed form using expectations of zonal polynomials via James’ theorem and a Laplace-integral expansion of $(α\boldsymbol{I}_m+S)^{-1}$. We develop an Isserlis/Wick-based combinatorial expansion for $\operatorname{\mathbb{E}}\left[\operatorname{tr}(S^k)\right]$ and provide explicit truncated moments up to $k=10$, which yield accurate series approximations for small-to-moderate $σ^2$. Monte Carlo experiments confirm the theoretical estimates. Furthermore, empirical analysis was performed to quantify that, using current He or Kaiming initialization with scaling $1/\sqrt{n}$, the output variance is $0.41$, whereas the new parameterization with $10/ \sqrt{n}$ for $α=1$ results in an output variance of $0.9$. The findings clarify why deep $\mathcal{L}$-Lipschitz networks suffer rapid information loss at initialization and offer practical prescriptions for choosing initialization hyperparameters to mitigate this effect. However, using the Higgs boson classification dataset, a hyperparameter sweep over optimizers, initialization scale, and depth was conducted to validate the results on real-world data, showing that although the derivation ensures variance preservation, empirical results indicate He initialization still performs better.
[COMMENTS]12 pages, 17 figures
[LINK]http://arxiv.org/abs/2601.08253v1
[DATE]2026-01-13 14:20:21+08:00
[CATEGORIES]cs.LG
Learning Steerable Clarification Policies with Collaborative Self-play
[AUTHORS]Jonathan Berant, Maximillian Chen, Adam Fisch, Reza Aghajani, Fantine Huot, Mirella Lapata, Jacob Eisenstein
[LINK]http://arxiv.org/abs/2512.04068v2
[DATE]2026-01-13 14:20:19+08:00
[CATEGORIES]cs.LG
Directed Homophily-Aware Graph Neural Network
[AUTHORS]Aihu Zhang, Jiaxing Xu, Mengcheng Lan, Shili Xiang, Yiping Ke
[ABSTRACT]Graph Neural Networks (GNNs) have achieved significant success in various learning tasks on graph-structured data. Nevertheless, most GNNs struggle to generalize to heterophilic neighborhoods. Additionally, many GNNs ignore the directional nature of real-world graphs, resulting in suboptimal performance on directed graphs with asymmetric structures. In this work, we propose Directed Homophily-aware Graph Neural Network (DHGNN), a novel framework that addresses these limitations by incorporating homophily-aware and direction-sensitive components. DHGNN employs a resettable gating mechanism to adaptively modulate message contributions based on homophily levels and informativeness, and a structure-aware noise-tolerant fusion module to effectively integrate node representations from the original and reverse directions. Extensive experiments on both homophilic and heterophilic directed graph datasets demonstrate that DHGNN outperforms state-of-the-art methods in node classification and link prediction. In particular, DHGNN improves over the best baseline by up to 15.07\% in link prediction. Our analysis further shows that the gating mechanism captures directional homophily gaps and fluctuating homophily across layers, providing deeper insights into message-passing behavior on complex graph structures.
[LINK]http://arxiv.org/abs/2505.22362v3
[DATE]2026-01-13 14:16:28+08:00
[CATEGORIES]cs.LG
Hyperbolic Heterogeneous Graph Transformer
[AUTHORS]Jongmin Park, Seunghoon Han, Hyewon Lee, Won-Yong Shin, Sungsu Lim
[ABSTRACT]In heterogeneous graphs, we can observe complex structures such as tree-like or hierarchical structures. Recently, the hyperbolic space has been widely adopted in many studies to effectively learn these complex structures. Although these methods have demonstrated the advantages of the hyperbolic space in learning heterogeneous graphs, most existing methods still have several challenges. They rely heavily on tangent-space operations, which often lead to mapping distortions during frequent transitions. Moreover, their message-passing architectures mainly focus on local neighborhood information, making it difficult to capture global hierarchical structures and long-range dependencies between different types of nodes. To address these limitations, we propose Hyperbolic Heterogeneous Graph Transformer (HypHGT), which effectively and efficiently learns heterogeneous graph representations entirely within the hyperbolic space. Unlike previous message-passing based hyperbolic heterogeneous GNNs, HypHGT naturally captures both local and global dependencies through transformer-based architecture. Furthermore, the proposed relation-specific hyperbolic attention mechanism in HypHGT, which operates with linear time complexity, enables efficient computation while preserving the heterogeneous information across different relation types. This design allows HypHGT to effectively capture the complex structural properties and semantic information inherent in heterogeneous graphs. We conduct comprehensive experiments to evaluate the effectiveness and efficiency of HypHGT, and the results demonstrate that it consistently outperforms state-of-the-art methods in node classification task, with significantly reduced training time and memory usage.
[COMMENTS]14pages, 9 figures
[LINK]http://arxiv.org/abs/2601.08251v1
[DATE]2026-01-13 14:15:14+08:00
[CATEGORIES]cs.LG
Structural Dimension Reduction in Bayesian Networks
[AUTHORS]Pei Heng, Yi Sun, Jianhua Guo
[ABSTRACT]This work introduces a novel technique, named structural dimension reduction, to collapse a Bayesian network onto a minimum and localized one while ensuring that probabilistic inferences between the original and reduced networks remain consistent. To this end, we propose a new combinatorial structure in directed acyclic graphs called the directed convex hull, which has turned out to be equivalent to their minimum localized Bayesian networks. An efficient polynomial-time algorithm is devised to identify them by determining the unique directed convex hulls containing the variables of interest from the original networks. Experiments demonstrate that the proposed technique has high dimension reduction capability in real networks, and the efficiency of probabilistic inference based on directed convex hulls can be significantly improved compared with traditional methods such as variable elimination and belief propagation algorithms. The code of this study is open at \href{https://github.com/Balance-H/Algorithms}{https://github.com/Balance-H/Algorithms} and the proofs of the results in the main body are postponed to the appendix.
[COMMENTS]13 pages
[LINK]http://arxiv.org/abs/2601.08236v1
[DATE]2026-01-13 13:42:52+08:00
[CATEGORIES]cs.LG
Gradient flow in parameter space is equivalent to linear interpolation in output space
[AUTHORS]Thomas Chen, Patrícia Muñoz Ewald
[ABSTRACT]We prove that the standard gradient flow in parameter space that underlies many training algorithms in deep learning can be continuously deformed into an adapted gradient flow which yields (constrained) Euclidean gradient flow in output space. Moreover, for the $L^{2}$ loss, if the Jacobian of the outputs with respect to the parameters is full rank (for fixed training data), then the time variable can be reparametrized so that the resulting flow is simply linear interpolation, and a global minimum can be achieved. For the cross-entropy loss, under the same rank condition and assuming the labels have positive components, we derive an explicit formula for the unique global minimum.
[COMMENTS]To appear in Journal of Geometry and Physics
[LINK]http://arxiv.org/abs/2408.01517v3
[DATE]2026-01-13 13:40:35+08:00
[CATEGORIES]cs.LG
When Does Learning Renormalize? Sufficient Conditions for Power Law Spectral Dynamics
[AUTHORS]Yizhou Zhang
[ABSTRACT]Empirical power–law scaling has been widely observed across modern deep learning systems, yet its theoretical origins and scope of validity remain incompletely understood. The Generalized Resolution–Shell Dynamics (GRSD) framework models learning as spectral energy transport across logarithmic resolution shells, providing a coarse–grained dynamical description of training. Within GRSD, power–law scaling corresponds to a particularly simple renormalized shell dynamics; however, such behavior is not automatic and requires additional structural properties of the learning process.
In this work, we identify a set of sufficient conditions under which the GRSD shell dynamics admits a renormalizable coarse–grained description. These conditions constrain the learning configuration at multiple levels, including boundedness of gradient propagation in the computation graph, weak functional incoherence at initialization, controlled Jacobian evolution along training, and log–shift invariance of renormalized shell couplings. We further show that power–law scaling does not follow from renormalizability alone, but instead arises as a rigidity consequence: once log–shift invariance is combined with the intrinsic time–rescaling covariance of gradient flow, the renormalized GRSD velocity field is forced into a power–law form.
[LINK]http://arxiv.org/abs/2512.18209v5
[DATE]2026-01-13 13:30:44+08:00
[CATEGORIES]cs.LG
GADPN: Graph Adaptive Denoising and Perturbation Networks via Singular Value Decomposition
[AUTHORS]Hao Deng, Bo Liu
[ABSTRACT]While Graph Neural Networks (GNNs) excel on graph-structured data, their performance is fundamentally limited by the quality of the observed graph, which often contains noise, missing links, or structural properties misaligned with GNNs’ underlying assumptions. To address this, graph structure learning aims to infer a more optimal topology. Existing methods, however, often incur high computational costs due to complex generative models and iterative joint optimization, limiting their practical utility. In this paper, we propose GADPN, a simple yet effective graph structure learning framework that adaptively refines graph topology via low-rank denoising and generalized structural perturbation. Our approach makes two key contributions: (1) we introduce Bayesian optimization to adaptively determine the optimal denoising strength, tailoring the process to each graph’s homophily level; and (2) we extend the structural perturbation method to arbitrary graphs via Singular Value Decomposition (SVD), overcoming its original limitation to symmetric structures. Extensive experiments on benchmark datasets demonstrate that GADPN achieves state-of-the-art performance while significantly improving efficiency. It shows particularly strong gains on challenging disassortative graphs, validating its ability to robustly learn enhanced graph structures across diverse network types.
[LINK]http://arxiv.org/abs/2601.08230v1
[DATE]2026-01-13 13:25:32+08:00
[CATEGORIES]cs.LG
A Comparative Study of Traditional Machine Learning, Deep Learning, and Large Language Models for Mental Health Forecasting using Smartphone Sensing Data
[AUTHORS]Kaidong Feng, Zhu Sun, Roy Ka-Wei Lee, Xun Jiang, Yin-Leng Theng, Yi Ding
[ABSTRACT]Smartphone sensing offers an unobtrusive and scalable way to track daily behaviors linked to mental health, capturing changes in sleep, mobility, and phone use that often precede symptoms of stress, anxiety, or depression. While most prior studies focus on detection that responds to existing conditions, forecasting mental health enables proactive support through Just-in-Time Adaptive Interventions. In this paper, we present the first comprehensive benchmarking study comparing traditional machine learning (ML), deep learning (DL), and large language model (LLM) approaches for mental health forecasting using the College Experience Sensing (CES) dataset, the most extensive longitudinal dataset of college student mental health to date. We systematically evaluate models across temporal windows, feature granularities, personalization strategies, and class imbalance handling. Our results show that DL models, particularly Transformer (Macro-F1 = 0.58), achieve the best overall performance, while LLMs show strength in contextual reasoning but weaker temporal modeling. Personalization substantially improves forecasts of severe mental health states. By revealing how different modeling approaches interpret phone sensing behavioral data over time, this work lays the groundwork for next-generation, adaptive, and human-centered mental health technologies that can advance both research and real-world well-being.
[LINK]http://arxiv.org/abs/2601.03603v2
[DATE]2026-01-13 13:06:22+08:00
[CATEGORIES]cs.LG
An Axiomatic Approach to General Intelligence: SANC(E3) – Self-organizing Active Network of Concepts with Energy E3
[AUTHORS]Daesuk Kwon, Won-gi Paeng
[ABSTRACT]General intelligence must reorganize experience into internal structures that enable prediction and action under finite resources. Existing systems implicitly presuppose fixed primitive units – tokens, subwords, pixels, or predefined sensor channels – thereby bypassing the question of how representational units themselves emerge and stabilize. This paper proposes SANC(E3), an axiomatic framework in which representational units are not given a priori but instead arise as stable outcomes of competitive selection, reconstruction, and compression under finite activation capacity, governed by the explicit minimization of an energy functional E3. SANC(E3) draws a principled distinction between system tokens – structural anchors such as {here, now, I} and sensory sources – and tokens that emerge through self-organization during co-occurring events. Five core axioms formalize finite capacity, association from co-occurrence, similarity-based competition, confidence-based stabilization, and the reconstruction-compression-update trade-off. A key feature is a pseudo-memory-mapped I/O mechanism, through which internally replayed Gestalts are processed via the same axiomatic pathway as external sensory input. As a result, perception, imagination, prediction, planning, and action are unified within a single representational and energetic process. From the axioms, twelve propositions are derived, showing that category formation, hierarchical organization, unsupervised learning, and high-level cognitive activities can all be understood as instances of Gestalt completion under E3 minimization.
[COMMENTS]20 pages, 3 tables
[LINK]http://arxiv.org/abs/2601.08224v1
[DATE]2026-01-13 13:06:07+08:00
[CATEGORIES]cs.LG
One-Shot Federated Ridge Regression: Exact Recovery via Sufficient Statistic Aggregation
[AUTHORS]Zahir Alsulaimawi
[ABSTRACT]Federated learning protocols require repeated synchronization between clients and a central server, with convergence rates depending on learning rates, data heterogeneity, and client sampling. This paper asks whether iterative communication is necessary for distributed linear regression. We show it is not. We formulate federated ridge regression as a distributed equilibrium problem where each client computes local sufficient statistics – the Gram matrix and moment vector – and transmits them once. The server reconstructs the global solution through a single matrix inversion. We prove exact recovery: under a coverage condition on client feature matrices, one-shot aggregation yields the centralized ridge solution, not an approximation. For heterogeneous distributions violating coverage, we derive non-asymptotic error bounds depending on spectral properties of the aggregated Gram matrix. Communication reduces from $\mathcal{O}(Rd)$ in iterative methods to $\mathcal{O}(d^2)$ total; for high-dimensional settings, we propose and experimentally validate random projection techniques reducing this to $\mathcal{O}(m^2)$ where $m \ll d$. We establish differential privacy guarantees where noise is injected once per client, eliminating the composition penalty that degrades privacy in multi-round protocols. We further address practical considerations including client dropout robustness, federated cross-validation for hyperparameter selection, and comparison with gradient-based alternatives. Comprehensive experiments on synthetic heterogeneous regression demonstrate that one-shot fusion matches FedAvg accuracy while requiring up to $38\times$ less communication. The framework applies to kernel methods and random feature models but not to general nonlinear architectures.
[LINK]http://arxiv.org/abs/2601.08216v1
[DATE]2026-01-13 12:47:22+08:00
[CATEGORIES]cs.LG
Diffusion Fine-Tuning via Reparameterized Policy Gradient of the Soft Q-Function
[AUTHORS]Hyeongyu Kang, Jaewoo Lee, Woocheol Shin, Kiyoung Om, Jinkyoo Park
[ABSTRACT]Diffusion models excel at generating high-likelihood samples but often require alignment with downstream objectives. Existing fine-tuning methods for diffusion models significantly suffer from reward over-optimization, resulting in high-reward but unnatural samples and degraded diversity. To mitigate over-optimization, we propose Soft Q-based Diffusion Finetuning (SQDF), a novel KL-regularized RL method for diffusion alignment that applies a reparameterized policy gradient of a training-free, differentiable estimation of the soft Q-function. SQDF is further enhanced with three innovations: a discount factor for proper credit assignment in the denoising process, the integration of consistency models to refine Q-function estimates, and the use of an off-policy replay buffer to improve mode coverage and manage the reward-diversity trade-off. Our experiments demonstrate that SQDF achieves superior target rewards while preserving diversity in text-to-image alignment. Furthermore, in online black-box optimization, SQDF attains high sample efficiency while maintaining naturalness and diversity.
[COMMENTS]36 pages, 21 figures, 4 tables
[LINK]http://arxiv.org/abs/2512.04559v2
[DATE]2026-01-13 12:42:44+08:00
[CATEGORIES]cs.LG
Adaptive few-shot learning for robust part quality classification in two-photon lithography
[AUTHORS]Sixian Jia, Ruo-Syuan Mei, Chenhui Shao
[ABSTRACT]Two-photon lithography (TPL) is an advanced additive manufacturing (AM) technique for fabricating high-precision micro-structures. While computer vision (CV) is proofed for automated quality control, existing models are often static, rendering them ineffective in dynamic manufacturing environments. These models typically cannot detect new, unseen defect classes, be efficiently updated from scarce data, or adapt to new part geometries. To address this gap, this paper presents an adaptive CV framework for the entire life-cycle of quality model maintenance. The proposed framework is built upon a same, scale-robust backbone model and integrates three key methodologies: (1) a statistical hypothesis testing framework based on Linear Discriminant Analysis (LDA) for novelty detection, (2) a two-stage, rehearsal-based strategy for few-shot incremental learning, and (3) a few-shot Domain-Adversarial Neural Network (DANN) for few-shot domain adaptation. The framework was evaluated on a TPL dataset featuring hemisphere as source domain and cube as target domain structures, with each domain categorized into good, minor damaged, and damaged quality classes. The hypothesis testing method successfully identified new class batches with 99-100% accuracy. The incremental learning method integrated a new class to 92% accuracy using only K=20 samples. The domain adaptation model bridged the severe domain gap, achieving 96.19% accuracy on the target domain using only K=5 shots. These results demonstrate a robust and data-efficient solution for deploying and maintaining CV models in evolving production scenarios.
[LINK]http://arxiv.org/abs/2601.08885v1
[DATE]2026-01-13 12:32:49+08:00
[CATEGORIES]cs.LG
FUME: Fused Unified Multi-Gas Emission Network for Livestock Rumen Acidosis Detection
[AUTHORS]Taminul Islam, Toqi Tahamid Sarker, Mohamed Embaby, Khaled R Ahmed, Amer AbuGhazaleh
[ABSTRACT]Ruminal acidosis is a prevalent metabolic disorder in dairy cattle causing significant economic losses and animal welfare concerns. Current diagnostic methods rely on invasive pH measurement, limiting scalability for continuous monitoring. We present FUME (Fused Unified Multi-gas Emission Network), the first deep learning approach for rumen acidosis detection from dual-gas optical imaging under in vitro conditions. Our method leverages complementary carbon dioxide (CO2) and methane (CH4) emission patterns captured by infrared cameras to classify rumen health into Healthy, Transitional, and Acidotic states. FUME employs a lightweight dual-stream architecture with weight-shared encoders, modality-specific self-attention, and channel attention fusion, jointly optimizing gas plume segmentation and classification of dairy cattle health. We introduce the first dual-gas OGI dataset comprising 8,967 annotated frames across six pH levels with pixel-level segmentation masks. Experiments demonstrate that FUME achieves 80.99% mIoU and 98.82% classification accuracy while using only 1.28M parameters and 1.97G MACs–outperforming state-of-the-art methods in segmentation quality with 10x lower computational cost. Ablation studies reveal that CO2 provides the primary discriminative signal and dual-task learning is essential for optimal performance. Our work establishes the feasibility of gas emission-based livestock health monitoring, paving the way for practical, in vitro acidosis detection systems. Codes are available at https://github.com/taminulislam/fume.
[COMMENTS]10 pages, 5 figures
[LINK]http://arxiv.org/abs/2601.08205v1
[DATE]2026-01-13 12:17:22+08:00
[CATEGORIES]cs.LG
Private Zeroth-Order Optimization with Public Data
[AUTHORS]Xuchen Gong, Tian Li
[COMMENTS]NeurIPS 2025
[LINK]http://arxiv.org/abs/2511.10859v2
[DATE]2026-01-13 12:08:46+08:00
[CATEGORIES]cs.LG
Latent Geometry of Taste: Scalable Low-Rank Matrix Factorization for Recommender Systems
[AUTHORS]Joshua Salako
[ABSTRACT]Scalability and data sparsity remain critical bottlenecks for collaborative filtering on massive interaction datasets. This work investigates the latent geometry of user preferences using the MovieLens 32M dataset, implementing a high-performance, parallelized Alternating Least Squares (ALS) framework. Through extensive hyperparameter optimization, we demonstrate that constrained low-rank models significantly outperform higher dimensional counterparts in generalization, achieving an optimal balance between Root Mean Square Error (RMSE) and ranking precision. We visualize the learned embedding space to reveal the unsupervised emergence of semantic genre clusters, confirming that the model captures deep structural relationships solely from interaction data. Finally, we validate the system’s practical utility in a cold-start scenario, introducing a tunable scoring parameter to manage the trade-off between popularity bias and personalized affinity effectively. The codebase for this research can be found here: https://github.com/joshsalako/recommender.git
[COMMENTS]Added a new figure on page 5, updated the title to include recommender systems, updated keywords, updated captions for all figures, and cited all figures in the text
[LINK]http://arxiv.org/abs/2601.03466v2
[DATE]2026-01-13 12:08:05+08:00
[CATEGORIES]cs.LG
On the Sample Complexity of Differentially Private Policy Optimization
[AUTHORS]Yi He, Xingyu Zhou
[ABSTRACT]Policy optimization (PO) is a cornerstone of modern reinforcement learning (RL), with diverse applications spanning robotics, healthcare, and large language model training. The increasing deployment of PO in sensitive domains, however, raises significant privacy concerns. In this paper, we initiate a theoretical study of differentially private policy optimization, focusing explicitly on its sample complexity. We first formalize an appropriate definition of differential privacy (DP) tailored to PO, addressing the inherent challenges arising from on-policy learning dynamics and the subtlety involved in defining the unit of privacy. We then systematically analyze the sample complexity of widely-used PO algorithms, including policy gradient (PG), natural policy gradient (NPG) and more, under DP constraints and various settings, via a unified framework. Our theoretical results demonstrate that privacy costs can often manifest as lower-order terms in the sample complexity, while also highlighting subtle yet important observations in private PO settings. These offer valuable practical insights for privacy-preserving PO algorithms.
[COMMENTS]Accepted at NeurIPS 2025
[LINK]http://arxiv.org/abs/2510.21060v2
[DATE]2026-01-13 12:00:52+08:00
[CATEGORIES]cs.LG
UniF$^2$ace: A Unified Fine-grained Face Understanding and Generation Model
[AUTHORS]Junzhe Li, Sifan Zhou, Liya Guo, Xuerui Qiu, Linrui Xu, Delin Qu, Tingting Long, Chun Fan, Ming Li, Hehe Fan, Jun Liu, Shuicheng Yan
[ABSTRACT]Unified multimodal models (UMMs) have emerged as a powerful paradigm in fundamental cross-modality research, demonstrating significant potential in both image understanding and generation. However, existing research in the face domain primarily faces two challenges: $\textbf{(1)}$ $\textbf{fragmentation development}$, with existing methods failing to unify understanding and generation into a single one, hindering the way to artificial general intelligence. $\textbf{(2) lack of fine-grained facial attributes}$, which are crucial for high-fidelity applications. To handle those issues, we propose $\textbf{UniF$^2$ace}$, $\textit{the first UMM specifically tailored for fine-grained face understanding and generation}$. $\textbf{First}$, we introduce a novel theoretical framework with a Dual Discrete Diffusion (D3Diff) loss, unifying masked generative models with discrete score matching diffusion and leading to a more precise approximation of the negative log-likelihood. Moreover, this D3Diff significantly enhances the model’s ability to synthesize high-fidelity facial details aligned with text input. $\textbf{Second}$, we propose a multi-level grouped Mixture-of-Experts architecture, adaptively incorporating the semantic and identity facial embeddings to complement the attribute forgotten phenomenon in representation evolvement. $\textbf{Finally}$, to this end, we construct UniF$^2$aceD-1M, a large-scale dataset comprising 130K fine-grained image-caption pairs and 1M visual question-answering pairs, spanning a much wider range of facial attributes than existing datasets. Extensive experiments demonstrate that UniF$^2$ace outperforms existing models with a similar scale in both understanding and generation tasks, with 7.1\% higher Desc-GPT and 6.6\% higher VQA-score, respectively.
[LINK]http://arxiv.org/abs/2503.08120v5
[DATE]2026-01-13 11:56:56+08:00
[CATEGORIES]cs.LG
Cross-Domain Imitation Learning via Optimal Transport
[AUTHORS]Arnaud Fickinger, Samuel Cohen, Stuart Russell, Brandon Amos
[COMMENTS]ICLR 2022
[LINK]http://arxiv.org/abs/2110.03684v4
[DATE]2026-01-13 11:56:14+08:00
[CATEGORIES]cs.LG
Autonomous Materials Exploration by Integrating Automated Phase Identification and AI-Assisted Human Reasoning
[AUTHORS]Ming-Chiang Chang, Maximilian Amsler, Duncan R. Sutherland, Sebastian Ament, Katie R. Gann, Lan Zhou, Louisa M. Smieska, Arthur R. Woll, John M. Gregoire, Carla P. Gomes, R. Bruce van Dover, Michael O. Thompson
[ABSTRACT]Autonomous experimentation holds the potential to accelerate materials development by combining artificial intelligence (AI) with modular robotic platforms to explore extensive combinatorial chemical and processing spaces. Such self-driving laboratories can not only increase the throughput of repetitive experiments, but also incorporate human domain expertise to drive the search towards user-defined objectives, including improved materials performance metrics. We present an autonomous materials synthesis extension to SARA, the Scientific Autonomous Reasoning Agent, utilizing phase information provided by an automated probabilistic phase labeling algorithm to expedite the search for targeted phase regions. By incorporating human input into an expanded SARA-H (SARA with human-in-the-loop) framework, we enhance the efficiency of the underlying reasoning process. Using synthetic benchmarks, we demonstrate the efficiency of our AI implementation and show that the human input can contribute to significant improvement in sampling efficiency. We conduct experimental active learning campaigns using robotic processing of thin-film samples of several oxide material systems, including Bi$_2$O$_3$, SnO$_x$, and Bi-Ti-O, using lateral-gradient laser spike annealing to synthesize and kinetically trap metastable phases. We showcase the utility of human-in-the-loop autonomous experimentation for the Bi-Ti-O system, where we identify extensive processing domains that stabilize $δ$-Bi$_2$O$_3$ and Bi$_2$Ti$_2$O$_7$, explore dwell-dependent ternary oxide phase behavior, and provide evidence confirming predictions that cationic substitutional doping of TiO$_2$ with Bi inhibits the unfavorable transformation of the metastable anatase to the ground-state rutile phase. The autonomous methods we have developed enable the discovery of new materials and new understanding of materials synthesis and properties.
[COMMENTS]Main manuscript: 21 pages(including references), 6 figures. Supplementary Information: 12 pages, 9 figures, 1 table
[LINK]http://arxiv.org/abs/2601.08185v1
[DATE]2026-01-13 11:30:14+08:00
[CATEGORIES]cs.LG
TabPFN Through The Looking Glass: An interpretability study of TabPFN and its internal representations
[AUTHORS]Aviral Gupta, Armaan Sethi, Dhruv Kumar
[ABSTRACT]Tabular foundational models are pre-trained models designed for a wide range of tabular data tasks. They have shown strong performance across domains, yet their internal representations and learned concepts remain poorly understood. This lack of interpretability makes it important to study how these models process and transform input features. In this work, we analyze the information encoded inside the model’s hidden representations and examine how these representations evolve across layers. We run a set of probing experiments that test for the presence of linear regression coefficients, intermediate values from complex expressions, and the final answer in early layers. These experiments allow us to reason about the computations the model performs internally. Our results provide evidence that meaningful and structured information is stored inside the representations of tabular foundational models. We observe clear signals that correspond to both intermediate and final quantities involved in the model’s prediction process. This gives insight into how the model refines its inputs and how the final output emerges. Our findings contribute to a deeper understanding of the internal mechanics of tabular foundational models. They show that these models encode concrete and interpretable information, which moves us closer to making their decision processes more transparent and trustworthy.
[LINK]http://arxiv.org/abs/2601.08181v1
[DATE]2026-01-13 11:14:25+08:00
[CATEGORIES]cs.LG
Instruction-Driven 3D Facial Expression Generation and Transition
[AUTHORS]Anh H. Vo, Tae-Seok Kim, Hulin Jin, Soo-Mi Choi, Yong-Guk Kim
[ABSTRACT]A 3D avatar typically has one of six cardinal facial expressions. To simulate realistic emotional variation, we should be able to render a facial transition between two arbitrary expressions. This study presents a new framework for instruction-driven facial expression generation that produces a 3D face and, starting from an image of the face, transforms the facial expression from one designated facial expression to another. The Instruction-driven Facial Expression Decomposer (IFED) module is introduced to facilitate multimodal data learning and capture the correlation between textual descriptions and facial expression features. Subsequently, we propose the Instruction to Facial Expression Transition (I2FET) method, which leverages IFED and a vertex reconstruction loss function to refine the semantic comprehension of latent vectors, thus generating a facial expression sequence according to the given instruction. Lastly, we present the Facial Expression Transition model to generate smooth transitions between facial expressions. Extensive evaluation suggests that the proposed model outperforms state-of-the-art methods on the CK+ and CelebV-HQ datasets. The results show that our framework can generate facial expression trajectories according to text instruction. Considering that text prompts allow us to make diverse descriptions of human emotional states, the repertoire of facial expressions and the transitions between them can be expanded greatly. We expect our framework to find various practical applications More information about our project can be found at https://vohoanganh.github.io/tg3dfet/
[LINK]http://arxiv.org/abs/2601.08179v1
[DATE]2026-01-13 11:12:48+08:00
[CATEGORIES]cs.LG
VBO-MI: A Fully Gradient-Based Bayesian Optimization Framework Using Variational Mutual Information Estimation
[AUTHORS]Farhad Mirkarimi
[ABSTRACT]Many real-world tasks require optimizing expensive black-box functions accessible only through noisy evaluations, a setting commonly addressed with Bayesian optimization (BO). While Bayesian neural networks (BNNs) have recently emerged as scalable alternatives to Gaussian Processes (GPs), traditional BNN-BO frameworks remain burdened by expensive posterior sampling and acquisition function optimization. In this work, we propose {VBO-MI} (Variational Bayesian Optimization with Mutual Information), a fully gradient-based BO framework that leverages recent advances in variational mutual information estimation. To enable end-to-end gradient flow, we employ an actor-critic architecture consisting of an {action-net} to navigate the input space and a {variational critic} to estimate information gain. This formulation effectively eliminates the traditional inner-loop acquisition optimization bottleneck, achieving up to a {$10^2 \times$ reduction in FLOPs} compared to BNN-BO baselines. We evaluate our method on a diverse suite of benchmarks, including high-dimensional synthetic functions and complex real-world tasks such as PDE optimization, the Lunar Lander control problem, and categorical Pest Control. Our experiments demonstrate that VBO-MI consistently provides the same or superior optimization performance and computational scalability over the baselines.
[COMMENTS]31 pages, 8 figures, Code will be released upon acceptance
[LINK]http://arxiv.org/abs/2601.08172v1
[DATE]2026-01-13 11:07:52+08:00
[CATEGORIES]cs.LG
Regression-adjusted Monte Carlo Estimators for Shapley Values and Probabilistic Values
[AUTHORS]R. Teal Witter, Yurong Liu, Christopher Musco
[ABSTRACT]With origins in game theory, probabilistic values like Shapley values, Banzhaf values, and semi-values have emerged as a central tool in explainable AI. They are used for feature attribution, data attribution, data valuation, and more. Since all of these values require exponential time to compute exactly, research has focused on efficient approximation methods using two techniques: Monte Carlo sampling and linear regression formulations. In this work, we present a new way of combining both of these techniques. Our approach is more flexible than prior algorithms, allowing for linear regression to be replaced with any function family whose probabilistic values can be computed efficiently. This allows us to harness the accuracy of tree-based models like XGBoost, while still producing unbiased estimates. From experiments across eight datasets, we find that our methods give state-of-the-art performance for estimating probabilistic values. For Shapley values, the error of our methods can be $6.5\times$ lower than Permutation SHAP (the most popular Monte Carlo method), $3.8\times$ lower than Kernel SHAP (the most popular linear regression method), and $2.6\times$ lower than Leverage SHAP (the prior state-of-the-art Shapley value estimator). For more general probabilistic values, we can obtain error $215\times$ lower than the best estimator from prior work.
[LINK]http://arxiv.org/abs/2506.11849v2
[DATE]2026-01-13 10:37:05+08:00
[CATEGORIES]cs.LG
Explainable Molecular Property Prediction: Aligning Chemical Concepts with Predictions via Language Models
[AUTHORS]Zhenzhong Wang, Zehui Lin, Wanyu Lin, Ming Yang, Minggang Zeng, Kay Chen Tan
[ABSTRACT]Providing explainable molecular property predictions is critical for many scientific domains, such as drug discovery and material science. Though transformer-based language models have shown great potential in accurate molecular property prediction, they neither provide chemically meaningful explanations nor faithfully reveal the molecular structure-property relationships. In this work, we develop a framework for explainable molecular property prediction based on language models, dubbed as Lamole, which can provide chemical concepts-aligned explanations. We take a string-based molecular representation – Group SELFIES – as input tokens to pretrain and fine-tune our Lamole, as it provides chemically meaningful semantics. By disentangling the information flows of Lamole, we propose combining self-attention weights and gradients for better quantification of each chemically meaningful substructure’s impact on the model’s output. To make the explanations more faithfully respect the structure-property relationship, we then carefully craft a marginal loss to explicitly optimize the explanations to be able to align with the chemists’ annotations. We bridge the manifold hypothesis with the elaborated marginal loss to prove that the loss can align the explanations with the tangent space of the data manifold, leading to concept-aligned explanations. Experimental results over six mutagenicity datasets and one hepatotoxicity dataset demonstrate Lamole can achieve comparable classification accuracy and boost the explanation accuracy by up to 14.3%, being the state-of-the-art in explainable molecular property prediction.
[LINK]http://arxiv.org/abs/2405.16041v4
[DATE]2026-01-13 10:36:05+08:00
[CATEGORIES]cs.LG
Enriching Semantic Profiles into Knowledge Graph for Recommender Systems Using Large Language Models
[AUTHORS]Seokho Ahn, Sungbok Shin, Young-Duk Seo
[ABSTRACT]Rich and informative profiling to capture user preferences is essential for improving recommendation quality. However, there is still no consensus on how best to construct and utilize such profiles. To address this, we revisit recent profiling-based approaches in recommender systems along four dimensions: 1) knowledge base, 2) preference indicator, 3) impact range, and 4) subject. We argue that large language models (LLMs) are effective at extracting compressed rationales from diverse knowledge sources, while knowledge graphs (KGs) are better suited for propagating these profiles to extend their reach. Building on this insight, we propose a new recommendation model, called SPiKE. SPiKE consists of three core components: i) Entity profile generation, which uses LLMs to generate semantic profiles for all KG entities; ii) Profile-aware KG aggregation, which integrates these profiles into the KG; and iii) Pairwise profile preference matching, which aligns LLM- and KG-based representations during training. In experiments, we demonstrate that SPiKE consistently outperforms state-of-the-art KG- and LLM-based recommenders in real-world settings.
[COMMENTS]Accepted at KDD 2026
[LINK]http://arxiv.org/abs/2601.08148v1
[DATE]2026-01-13 10:23:32+08:00
[CATEGORIES]cs.LG
Dynamic Graph Structure Learning via Resistance Curvature Flow
[AUTHORS]Chaoqun Fei, Huanjiang Liu, Tinglve Zhou, Yangyang Li, Tianyong Hao
[ABSTRACT]Geometric Representation Learning (GRL) aims to approximate the non-Euclidean topology of high-dimensional data through discrete graph structures, grounded in the manifold hypothesis. However, traditional static graph construction methods based on Euclidean distance often fail to capture the intrinsic curvature characteristics of the data manifold. Although Ollivier-Ricci Curvature Flow (OCF) has proven to be a powerful tool for dynamic topological optimization, its core reliance on Optimal Transport (Wasserstein distance) leads to prohibitive computational complexity, severely limiting its application in large-scale datasets and deep learning frameworks. To break this bottleneck, this paper proposes a novel geometric evolution framework: Resistance Curvature Flow (RCF). Leveraging the concept of effective resistance from circuit physics, RCF transforms expensive curvature optimization into efficient matrix operations. This approach achieves over 100x computational acceleration while maintaining geometric optimization capabilities comparable to OCF. We provide an in-depth exploration of the theoretical foundations and dynamical principles of RCF, elucidating how it guides the redistribution of edge weights via curvature gradients to eliminate topological noise and strengthen local cluster structures. Furthermore, we provide a mechanistic explanation of RCF’s role in manifold enhancement and noise suppression, as well as its compatibility with deep learning models. We design a graph optimization algorithm, DGSL-RCF, based on this framework. Experimental results across deep metric learning, manifold learning, and graph structure learning demonstrate that DGSL-RCF significantly improves representation quality and downstream task performance.
[LINK]http://arxiv.org/abs/2601.08149v1
[DATE]2026-01-13 10:23:32+08:00
[CATEGORIES]cs.LG
Joint Progression Modeling (JPM): A Probabilistic Framework for Mixed-Pathology Progression
[AUTHORS]Hongtao Hao, Joseph L. Austerweil
[ABSTRACT]Event-based models (EBMs) infer disease progression from cross-sectional data, and standard EBMs assume a single underlying disease per individual. In contrast, mixed pathologies are common in neurodegeneration. We introduce the Joint Progression Model (JPM), a probabilistic framework that treats single-disease trajectories as partial rankings and builds a prior over joint progressions. We study several JPM variants (Pairwise, Bradley-Terry, Plackett-Luce, and Mallows) and analyze three properties: (i) calibration – whether lower model energy predicts smaller distance to the ground truth ordering; (ii) separation – the degree to which sampled rankings are distinguishable from random permutations; and (iii) sharpness – the stability of sampled aggregate rankings. All variants are calibrated, and all achieve near-perfect separation; sharpness varies by variant and is well-predicted by simple features of the input partial rankings (number and length of rankings, conflict, and overlap). In synthetic experiments, JPM improves ordering accuracy by roughly 21 percent over a strong EBM baseline (SA-EBM) that treats the joint disease as a single condition. Finally, using NACC, we find that the Mallows variant of JPM and the baseline model (SA-EBM) have results that are more consistent with prior literature on the possible disease progression of the mixed pathology of AD and VaD.
[COMMENTS]49 pages; Machine Learning for Health (ML4H) Symposium 2025
[LINK]http://arxiv.org/abs/2512.03475v2
[DATE]2026-01-13 10:14:55+08:00
[CATEGORIES]cs.LG
LoFT-LLM: Low-Frequency Time-Series Forecasting with Large Language Models
[AUTHORS]Jiacheng You, Jingcheng Yang, Yuhang Xie, Zhongxuan Wu, Xiucheng Li, Feng Li, Pengjie Wang, Jian Xu, Bo Zheng, Xinyang Chen
[ABSTRACT]Time-series forecasting in real-world applications such as finance and energy often faces challenges due to limited training data and complex, noisy temporal dynamics. Existing deep forecasting models typically supervise predictions using full-length temporal windows, which include substantial high-frequency noise and obscure long-term trends. Moreover, auxiliary variables containing rich domain-specific information are often underutilized, especially in few-shot settings. To address these challenges, we propose LoFT-LLM, a frequency-aware forecasting pipeline that integrates low-frequency learning with semantic calibration via a large language model (LLM). Firstly, a Patch Low-Frequency forecasting Module (PLFM) extracts stable low-frequency trends from localized spectral patches. Secondly, a residual learner then models high-frequency variations. Finally, a fine-tuned LLM refines the predictions by incorporating auxiliary context and domain knowledge through structured natural language prompts. Extensive experiments on financial and energy datasets demonstrate that LoFT-LLM significantly outperforms strong baselines under both full-data and few-shot regimes, delivering superior accuracy, robustness, and interpretability.
[COMMENTS]This submission is withdrawn due to internal review and compliance considerations
[LINK]http://arxiv.org/abs/2512.20002v3
[DATE]2026-01-13 10:09:17+08:00
[CATEGORIES]cs.LG
Reverse Flow Matching: A Unified Framework for Online Reinforcement Learning with Diffusion and Flow Policies
[AUTHORS]Zeyang Li, Sunbochen Tang, Navid Azizan
[ABSTRACT]Diffusion and flow policies are gaining prominence in online reinforcement learning (RL) due to their expressive power, yet training them efficiently remains a critical challenge. A fundamental difficulty in online RL is the lack of direct samples from the target distribution; instead, the target is an unnormalized Boltzmann distribution defined by the Q-function. To address this, two seemingly distinct families of methods have been proposed for diffusion policies: a noise-expectation family, which utilizes a weighted average of noise as the training target, and a gradient-expectation family, which employs a weighted average of Q-function gradients. Yet, it remains unclear how these objectives relate formally or if they can be synthesized into a more general formulation. In this paper, we propose a unified framework, reverse flow matching (RFM), which rigorously addresses the problem of training diffusion and flow models without direct target samples. By adopting a reverse inferential perspective, we formulate the training target as a posterior mean estimation problem given an intermediate noisy sample. Crucially, we introduce Langevin Stein operators to construct zero-mean control variates, deriving a general class of estimators that effectively reduce importance sampling variance. We show that existing noise-expectation and gradient-expectation methods are two specific instances within this broader class. This unified view yields two key advancements: it extends the capability of targeting Boltzmann distributions from diffusion to flow policies, and enables the principled combination of Q-value and Q-gradient information to derive an optimal, minimum-variance estimator, thereby improving training efficiency and stability. We instantiate RFM to train a flow policy in online RL, and demonstrate improved performance on continuous-control benchmarks compared to diffusion policy baselines.
[LINK]http://arxiv.org/abs/2601.08136v1
[DATE]2026-01-13 09:58:24+08:00
[CATEGORIES]cs.LG
Buffered AUC maximization for scoring systems via mixed-integer optimization
[AUTHORS]Moe Shiina, Shunnosuke Ikeda, Yuichi Takano
[ABSTRACT]A scoring system is a linear classifier composed of a small number of explanatory variables, each assigned a small integer coefficient. This system is highly interpretable and allows predictions to be made with simple manual calculations without the need for a calculator. Several previous studies have used mixed-integer optimization (MIO) techniques to develop scoring systems for binary classification; however, they have not focused on directly maximizing AUC (i.e., area under the receiver operating characteristic curve), even though AUC is recognized as an essential evaluation metric for scoring systems. Our goal herein is to establish an effective MIO framework for constructing scoring systems that directly maximize the buffered AUC (bAUC) as the tightest concave lower bound on AUC. Our optimization model is formulated as a mixed-integer linear optimization (MILO) problem that maximizes bAUC subject to a group sparsity constraint for limiting the number of questions in the scoring system. Computational experiments using publicly available real-world datasets demonstrate that our MILO method can build scoring systems with superior AUC values compared to the baseline methods based on regularization and stepwise regression. This research contributes to the advancement of MIO techniques for developing highly interpretable classification models.
[LINK]http://arxiv.org/abs/2601.05544v2
[DATE]2026-01-13 09:58:15+08:00
[CATEGORIES]cs.LG
Hierarchical Online-Scheduling for Energy-Efficient Split Inference with Progressive Transmission
[AUTHORS]Zengzipeng Tang, Yuxuan Sun, Wei Chen, Jianwen Ding, Bo Ai, Yulin Shao
[ABSTRACT]Device-edge collaborative inference with Deep Neural Networks (DNNs) faces fundamental trade-offs among accuracy, latency and energy consumption. Current scheduling exhibits two drawbacks: a granularity mismatch between coarse, task-level decisions and fine-grained, packet-level channel dynamics, and insufficient awareness of per-task complexity. Consequently, scheduling solely at the task level leads to inefficient resource utilization. This paper proposes a novel ENergy-ACcuracy Hierarchical optimization framework for split Inference, named ENACHI, that jointly optimizes task- and packet-level scheduling to maximize accuracy under energy and delay constraints. A two-tier Lyapunov-based framework is developed for ENACHI, with a progressive transmission technique further integrated to enhance adaptivity. At the task level, an outer drift-plus-penalty loop makes online decisions for DNN partitioning and bandwidth allocation, and establishes a reference power budget to manage the long-term energy-accuracy trade-off. At the packet level, an uncertainty-aware progressive transmission mechanism is employed to adaptively manage per-sample task complexity. This is integrated with a nested inner control loop implementing a novel reference-tracking policy, which dynamically adjusts per-slot transmit power to adapt to fluctuating channel conditions. Experiments on ImageNet dataset demonstrate that ENACHI outperforms state-of-the-art benchmarks under varying deadlines and bandwidths, achieving a 43.12\% gain in inference accuracy with a 62.13\% reduction in energy consumption under stringent deadlines, and exhibits high scalability by maintaining stable energy consumption in congested multi-user scenarios.
[COMMENTS]This work has been submitted to the IEEE for possible publication
[LINK]http://arxiv.org/abs/2601.08135v1
[DATE]2026-01-13 09:56:46+08:00
[CATEGORIES]cs.LG
WaveNet’s Precision in EEG Classification
[AUTHORS]Casper van Laar, Khubaib Ahmed
[ABSTRACT]This study introduces a WaveNet-based deep learning model designed to automate the classification of intracranial electroencephalography (iEEG) signals into physiological activity, pathological (epileptic) activity, power-line noise, and other non-cerebral artifacts. Traditional methods for iEEG signal classification, which rely on expert visual review, are becoming increasingly impractical due to the growing complexity and volume of iEEG recordings. Leveraging a publicly available annotated dataset from Mayo Clinic and St. Anne’s University Hospital, the WaveNet model was trained, validated, and tested on 209,231 samples using a 70/20/10 split. The model achieved a classification accuracy exceeding previous non-specialized CNN- and LSTM-based approaches and was benchmarked against a Temporal Convolutional Network (TCN) baseline. Notably, the model achieves high discrimination of noise and artifact classes, with precisions of 0.98 and approximately 1, respectively. Classification between physiological and pathological signals exhibits a modest but clinically interpretable overlap, with F1-scores of 0.96 and 0.90 and 175 and 272 cross-class false positives, respectively, reflecting inherent clinical overlap. WaveNet’s architecture, originally developed for raw audio synthesis, is well-suited for iEEG data due to its use of dilated causal convolutions and residual connections, enabling the capture of both fine-grained and long-range temporal dependencies. The study also details the preprocessing pipeline, including dynamic dataset partitioning, the use of focal loss to address class imbalance, and normalization steps that support high model performance. While the results demonstrate strong in-distribution performance, generalizability across datasets and clinical settings has yet to be established.
[COMMENTS]6 pages, 5 figures and 3 tables. Includes main text and bibliography
[LINK]http://arxiv.org/abs/2510.15947v2
[DATE]2026-01-13 09:51:48+08:00
[CATEGORIES]cs.LG
Generalization Analysis and Method for Domain Generalization for a Family of Recurrent Neural Networks
[AUTHORS]Atefeh Termehchi, Ekram Hossain, Isaac Woungang
[ABSTRACT]Deep learning (DL) has driven broad advances across scientific and engineering domains. Despite its success, DL models often exhibit limited interpretability and generalization, which can undermine trust, especially in safety-critical deployments. As a result, there is growing interest in (i) analyzing interpretability and generalization and (ii) developing models that perform robustly under data distributions different from those seen during training (i.e. domain generalization). However, the theoretical analysis of DL remains incomplete. For example, many generalization analyses assume independent samples, which is violated in sequential data with temporal correlations. Motivated by these limitations, this paper proposes a method to analyze interpretability and out-of-domain (OOD) generalization for a family of recurrent neural networks (RNNs). Specifically, the evolution of a trained RNN’s states is modeled as an unknown, discrete-time, nonlinear closed-loop feedback system. Using Koopman operator theory, these nonlinear dynamics are approximated with a linear operator, enabling interpretability. Spectral analysis is then used to quantify the worst-case impact of domain shifts on the generalization error. Building on this analysis, a domain generalization method is proposed that reduces the OOD generalization error and improves the robustness to distribution shifts. Finally, the proposed analysis and domain generalization approach are validated on practical temporal pattern-learning tasks.
[LINK]http://arxiv.org/abs/2601.08122v1
[DATE]2026-01-13 09:27:10+08:00
[CATEGORIES]cs.LG
Intra-tree Column Subsampling Hinders XGBoost Learning of Ratio-like Interactions
[AUTHORS]Mykola Pinchuk
[ABSTRACT]Many applied problems contain signal that becomes clear only after combining multiple raw measurements. Ratios and rates are common examples. In gradient boosted trees, this combination is not an explicit operation: the model must synthesize it through coordinated splits on the component features. We study whether intra-tree column subsampling in XGBoost makes that synthesis harder. We use two synthetic data generating processes with cancellation-style structure. In both, two primitive features share a strong nuisance factor, while the target depends on a smaller differential factor. A log ratio cancels the nuisance and isolates the signal. We vary colsample_bylevel and colsample_bynode over s in {0.4, 0.6, 0.8, 0.9}, emphasizing mild subsampling (s >= 0.8). A control feature set includes the engineered ratio, removing the need for synthesis. Across both processes, intra-tree column subsampling reduces test PR-AUC in the primitives-only setting. In the main process the relative decrease reaches 54 percent when both parameters are set to 0.4. The effect largely disappears when the engineered ratio is present. A path-based co-usage metric drops in the same cells where performance deteriorates. Practically, if ratio-like structure is plausible, either avoid intra-tree subsampling or include the intended ratio features.
[COMMENTS]14 pages, 11 figures and tables
[LINK]http://arxiv.org/abs/2601.08121v1
[DATE]2026-01-13 09:27:01+08:00
[CATEGORIES]cs.LG
Structure Detection for Contextual Reinforcement Learning
[AUTHORS]Tianyue Zhou, Jung-Hoon Cho, Cathy Wu
[ABSTRACT]Contextual Reinforcement Learning (CRL) tackles the problem of solving a set of related Contextual Markov Decision Processes (CMDPs) that vary across different context variables. Traditional approaches–independent training and multi-task learning–struggle with either excessive computational costs or negative transfer. A recently proposed multi-policy approach, Model-Based Transfer Learning (MBTL), has demonstrated effectiveness by strategically selecting a few tasks to train and zero-shot transfer. However, CMDPs encompass a wide range of problems, exhibiting structural properties that vary from problem to problem. As such, different task selection strategies are suitable for different CMDPs. In this work, we introduce Structure Detection MBTL (SD-MBTL), a generic framework that dynamically identifies the underlying generalization structure of CMDP and selects an appropriate MBTL algorithm. For instance, we observe Mountain structure in which generalization performance degrades from the training performance of the target task as the context difference increases. We thus propose M/GP-MBTL, which detects the structure and adaptively switches between a Gaussian Process-based approach and a clustering-based approach. Extensive experiments on synthetic data and CRL benchmarks–covering continuous control, traffic control, and agricultural management–show that M/GP-MBTL surpasses the strongest prior method by 12.49% on the aggregated metric. These results highlight the promise of online structure detection for guiding source task selection in complex CRL environments.
[LINK]http://arxiv.org/abs/2601.08120v1
[DATE]2026-01-13 09:22:39+08:00
[CATEGORIES]cs.LG
Learning a Stochastic Differential Equation Model of Tropical Cyclone Intensification from Reanalysis and Observational Data
[AUTHORS]Kenneth Gee, Sai Ravela
[ABSTRACT]Tropical cyclones are dangerous natural hazards, but their hazard is challenging to quantify directly from historical datasets due to limited dataset size and quality. Models of cyclone intensification fill this data gap by simulating huge ensembles of synthetic hurricanes based on estimates of the storm’s large scale environment. Both physics-based and statistical/ML intensification models have been developed to tackle this problem, but an open question is: can a physically reasonable and simple physics-style differential equation model of intensification be learned from data? In this paper, we answer this question in the affirmative by presenting a 10-term cubic stochastic differential equation model of Tropical Cyclone intensification. The model depends on a well-vetted suite of engineered environmental features known to drive intensification and is trained using a high quality dataset of hurricane intensity (IBTrACS) with estimates of the cyclone’s large scale environment from a data-assimilated simulation (ERA5 reanalysis), restricted to the Northern Hemisphere. The model generates synthetic intensity series which capture many aspects of historical intensification statistics and hazard estimates in the Northern Hemisphere. Our results show promise that interpretable, physics style models of complex earth system dynamics can be learned using automated system identification techniques.
[LINK]http://arxiv.org/abs/2601.08116v1
[DATE]2026-01-13 09:11:17+08:00
[CATEGORIES]cs.LG
CSQL: Mapping Documents into Causal Databases
[AUTHORS]Sridhar Mahadevan
[ABSTRACT]We describe a novel system, CSQL, which automatically converts a collection of unstructured text documents into an SQL-queryable causal database (CDB). A CDB differs from a traditional DB: it is designed to answer “why’’ questions via causal interventions and structured causal queries. CSQL builds on our earlier system, DEMOCRITUS, which converts documents into thousands of local causal models derived from causal discourse. Unlike RAG-based systems or knowledge-graph based approaches, CSQL supports causal analysis over document collections rather than purely associative retrieval. For example, given an article on the origins of human bipedal walking, CSQL enables queries such as: “What are the strongest causal influences on bipedalism?’’ or “Which variables act as causal hubs with the largest downstream influence?’’ Beyond single-document case studies, we show that CSQL can also ingest RAG/IE-compiled causal corpora at scale by compiling the Testing Causal Claims (TCC) dataset of economics papers into a causal database containing 265,656 claim instances spanning 45,319 papers, 44 years, and 1,575 reported method strings, thereby enabling corpus-level causal queries and longitudinal analyses in CSQL. Viewed abstractly, CSQL functions as a compiler from unstructured documents into a causal database equipped with a principled algebra of queries, and can be applied broadly across many domains ranging from business, humanities, and science.
[COMMENTS]26 pages
[LINK]http://arxiv.org/abs/2601.08109v1
[DATE]2026-01-13 09:00:38+08:00
[CATEGORIES]cs.LG
STO-RL: Offline RL under Sparse Rewards via LLM-Guided Subgoal Temporal Order
[AUTHORS]Chengyang Gu, Yuxin Pan, Hui Xiong, Yize Chen
[ABSTRACT]Offline reinforcement learning (RL) enables policy learning from pre-collected datasets, avoiding costly and risky online interactions, but it often struggles with long-horizon tasks involving sparse rewards. Existing goal-conditioned and hierarchical offline RL methods decompose such tasks and generate intermediate rewards to mitigate limitations of traditional offline RL, but usually overlook temporal dependencies among subgoals and rely on imprecise reward shaping, leading to suboptimal policies. To address these issues, we propose STO-RL (Offline RL using LLM-Guided Subgoal Temporal Order), an offline RL framework that leverages large language models (LLMs) to generate temporally ordered subgoal sequences and corresponding state-to-subgoal-stage mappings. Using this temporal structure, STO-RL applies potential-based reward shaping to transform sparse terminal rewards into dense, temporally consistent signals, promoting subgoal progress while avoiding suboptimal solutions. The resulting augmented dataset with shaped rewards enables efficient offline training of high-performing policies. Evaluations on four discrete and continuous sparse-reward benchmarks demonstrate that STO-RL consistently outperforms state-of-the-art offline goal-conditioned and hierarchical RL baselines, achieving faster convergence, higher success rates, and shorter trajectories. Ablation studies further confirm STO-RL’s robustness to imperfect or noisy LLM-generated subgoal sequences, demonstrating that LLM-guided subgoal temporal structures combined with theoretically grounded reward shaping provide a practical and scalable solution for long-horizon offline RL.
[COMMENTS]Accepted at International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)
[LINK]http://arxiv.org/abs/2601.08107v1
[DATE]2026-01-13 08:57:45+08:00
[CATEGORIES]cs.LG
Towards A Unified PAC-Bayesian Framework for Norm-based Generalization Bounds
[AUTHORS]Xinping Yi, Gaojie Jin, Xiaowei Huang, Shi Jin
[ABSTRACT]Understanding the generalization behavior of deep neural networks remains a fundamental challenge in modern statistical learning theory. Among existing approaches, PAC-Bayesian norm-based bounds have demonstrated particular promise due to their data-dependent nature and their ability to capture algorithmic and geometric properties of learned models. However, most existing results rely on isotropic Gaussian posteriors, heavy use of spectral-norm concentration for weight perturbations, and largely architecture-agnostic analyses, which together limit both the tightness and practical relevance of the resulting bounds. To address these limitations, in this work, we propose a unified framework for PAC-Bayesian norm-based generalization by reformulating the derivation of generalization bounds as a stochastic optimization problem over anisotropic Gaussian posteriors. The key to our approach is a sensitivity matrix that quantifies the network outputs with respect to structured weight perturbations, enabling the explicit incorporation of heterogeneous parameter sensitivities and architectural structures. By imposing different structural assumptions on this sensitivity matrix, we derive a family of generalization bounds that recover several existing PAC-Bayesian results as special cases, while yielding bounds that are comparable to or tighter than state-of-the-art approaches. Such a unified framework provides a principled and flexible way for geometry-/structure-aware and interpretable generalization analysis in deep learning.
[LINK]http://arxiv.org/abs/2601.08100v1
[DATE]2026-01-13 08:42:22+08:00
[CATEGORIES]cs.LG
Local-Global Feature Fusion for Subject-Independent EEG Emotion Recognition
[AUTHORS]Zheng Zhou, Isabella McEvoy, Camilo E. Valderrama
[ABSTRACT]Subject-independent EEG emotion recognition is challenged by pronounced inter-subject variability and the difficulty of learning robust representations from short, noisy recordings. To address this, we propose a fusion framework that integrates (i) local, channel-wise descriptors and (ii) global, trial-level descriptors, improving cross-subject generalization on the SEED-VII dataset. Local representations are formed per channel by concatenating differential entropy with graph-theoretic features, while global representations summarize time-domain, spectral, and complexity characteristics at the trial level. These representations are fused in a dual-branch transformer with attention-based fusion and domain-adversarial regularization, with samples filtered by an intensity threshold. Experiments under a leave-one-subject-out protocol demonstrate that the proposed method consistently outperforms single-view and classical baselines, achieving approximately 40% mean accuracy in 7-class subject-independent emotion recognition. The code has been released at https://github.com/Danielz-z/LGF-EEG-Emotion.
[COMMENTS]7 pages, 5 figures, EMBC 2026
[LINK]http://arxiv.org/abs/2601.08094v1
[DATE]2026-01-13 08:27:01+08:00
[CATEGORIES]cs.LG
Q-realign: Piggybacking Realignment on Quantization for Safe and Efficient LLM Deployment
[AUTHORS]Qitao Tan, Xiaoying Song, Ningxi Cheng, Ninghao Liu, Xiaoming Zhai, Lingzi Hong, Yanzhi Wang, Zhen Xiang, Geng Yuan
[ABSTRACT]Public large language models (LLMs) are typically safety-aligned during pretraining, yet task-specific fine-tuning required for deployment often erodes this alignment and introduces safety risks. Existing defenses either embed safety recovery into fine-tuning or rely on fine-tuning-derived priors for post-hoc correction, leaving safety recovery tightly coupled with training and incurring high computational overhead and a complex workflow. To address these challenges, we propose \texttt{Q-realign}, a post-hoc defense method based on post-training quantization, guided by an analysis of representational structure. By reframing quantization as a dual-objective procedure for compression and safety, \texttt{Q-realign} decouples safety alignment from fine-tuning and naturally piggybacks into modern deployment pipelines. Experiments across multiple models and datasets demonstrate that our method substantially reduces unsafe behaviors while preserving task performance, with significant reductions in memory usage and GPU hours. Notably, our approach can recover the safety alignment of a fine-tuned 7B LLM on a single RTX 4090 within 40 minutes. Overall, our work provides a practical, turnkey solution for safety-aware deployment.
[LINK]http://arxiv.org/abs/2601.08089v1
[DATE]2026-01-13 08:07:24+08:00
[CATEGORIES]cs.LG
Asymptotically Stable Quaternion-valued Hopfield-structured Neural Network with Periodic Projection-based Supervised Learning Rules
[AUTHORS]Tianwei Wang, Xinhui Ma, Wei Pang
[ABSTRACT]Motivated by the geometric advantages of quaternions in representing rotations and postures, we propose a quaternion-valued supervised learning Hopfield-structured neural network (QSHNN) with a fully connected structure inspired by the classic Hopfield neural network (HNN). Starting from a continuous-time dynamical model of HNNs, we extend the formulation to the quaternionic domain and establish the existence and uniqueness of fixed points with asymptotic stability. For the learning rules, we introduce a periodic projection strategy that modifies standard gradient descent by periodically projecting each 4*4 block of the weight matrix onto the closest quaternionic structure in the least-squares sense. This approach preserves both convergence and quaternionic consistency throughout training. Benefiting from this rigorous mathematical foundation, the experimental model implementation achieves high accuracy, fast convergence, and strong reliability across randomly generated target sets. Moreover, the evolution trajectories of the QSHNN exhibit well-bounded curvature, i.e., sufficient smoothness, which is crucial for applications such as control systems or path planning modules in robotic arms, where joint postures are parameterized by quaternion neurons. Beyond these application scenarios, the proposed model offers a practical implementation framework and a general mathematical methodology for designing neural networks under hypercomplex or non-commutative algebraic structures.
[COMMENTS]Accepted by NeurIPS 2025
[LINK]http://arxiv.org/abs/2510.16607v2
[DATE]2026-01-13 07:45:20+08:00
[CATEGORIES]cs.LG
Large-scale Regional Traffic Signal Control Based on Single-Agent Reinforcement Learning
[AUTHORS]Qiang Li, Jin Niu, Qin Luo, Lina Yu
[ABSTRACT]In the context of global urbanization and motorization, traffic congestion has become a significant issue, severely affecting the quality of life, environment, and economy. This paper puts forward a single-agent reinforcement learning (RL)-based regional traffic signal control (TSC) model. Different from multi - agent systems, this model can coordinate traffic signals across a large area, with the goals of alleviating regional traffic congestion and minimizing the total travel time. The TSC environment is precisely defined through specific state space, action space, and reward functions. The state space consists of the current congestion state, which is represented by the queue lengths of each link, and the current signal phase scheme of intersections. The action space is designed to select an intersection first and then adjust its phase split. Two reward functions are meticulously crafted. One focuses on alleviating congestion and the other aims to minimize the total travel time while considering the congestion level. The experiments are carried out with the SUMO traffic simulation software. The performance of the TSC model is evaluated by comparing it with a base case where no signal-timing adjustments are made. The results show that the model can effectively control congestion. For example, the queuing length is significantly reduced in the scenarios tested. Moreover, when the reward is set to both alleviate congestion and minimize the total travel time, the average travel time is remarkably decreased, which indicates that the model can effectively improve traffic conditions. This research provides a new approach for large-scale regional traffic signal control and offers valuable insights for future urban traffic management.
[COMMENTS]A critical error in the methodology. The reported congestion control effects were not caused by the proposed signal timing optimization, but by an incorrect traffic volume scaling factor during evaluation. The traffic demand was not properly amplified, resulting in misleading performance gains. Due to the substantial nature of the error, completion of revisions is not feasible in the short term
[LINK]http://arxiv.org/abs/2503.09252v2
[DATE]2026-01-13 07:22:59+08:00
[CATEGORIES]cs.LG
Robust Single-Agent Reinforcement Learning for Regional Traffic Signal Control Under Demand Fluctuations
[AUTHORS]Qiang Li, Jin Niu, Lina Yu
[ABSTRACT]Traffic congestion, primarily driven by intersection queuing, significantly impacts urban living standards, safety, environmental quality, and economic efficiency. While Traffic Signal Control (TSC) systems hold potential for congestion mitigation, traditional optimization models often fail to capture real-world traffic complexity and dynamics. This study introduces a novel single-agent reinforcement learning (RL) framework for regional adaptive TSC, circumventing the coordination complexities inherent in multi-agent systems through a centralized decision-making paradigm. The model employs an adjacency matrix to unify the encoding of road network topology, real-time queue states derived from probe vehicle data, and current signal timing parameters. Leveraging the efficient learning capabilities of the DreamerV3 world model, the agent learns control policies where actions sequentially select intersections and adjust their signal phase splits to regulate traffic inflow/outflow, analogous to a feedback control system. Reward design prioritizes queue dissipation, directly linking congestion metrics (queue length) to control actions. Simulation experiments conducted in SUMO demonstrate the model’s effectiveness: under inference scenarios with multi-level (10%, 20%, 30%) Origin-Destination (OD) demand fluctuations, the framework exhibits robust anti-fluctuation capability and significantly reduces queue lengths. This work establishes a new paradigm for intelligent traffic control compatible with probe vehicle technology. Future research will focus on enhancing practical applicability by incorporating stochastic OD demand fluctuations during training and exploring regional optimization mechanisms for contingency events.
[COMMENTS]A critical error in the methodology. The reported congestion control effects were not caused by the proposed signal timing optimization, but by an incorrect traffic volume scaling factor during evaluation. The traffic demand was not properly amplified, resulting in misleading performance gains. Due to the substantial nature of the error, completion of revisions is not feasible in the short term
[LINK]http://arxiv.org/abs/2511.00549v2
[DATE]2026-01-13 07:22:21+08:00
[CATEGORIES]cs.LG
HiGP: A high-performance Python package for Gaussian Process
[AUTHORS]Hua Huang, Tianshi Xu, Yuanzhe Xi, Edmond Chow
[ABSTRACT]Gaussian Processes (GPs) are flexible, nonparametric Bayesian models widely used for regression and classification because of their ability to capture complex data patterns and quantify predictive uncertainty. However, the O(n^3) computational cost of kernel matrix operations poses a major obstacle to applying GPs at scale. HiGP is a high-performance Python package designed to overcome these scalability limitations through advanced numerical linear algebra and hierarchical kernel representations. It integrates H^2 matrices to achieve near-linear complexity in both storage and computation for spatial datasets, supports on-the-fly kernel evaluation to avoid explicit storage in large-scale problems, and incorporates a robust Adaptive Factorized Nyström (AFN) preconditioner that accelerates convergence of iterative solvers across a broad range of kernel spectra. These computational kernels are implemented in C++ for maximum performance and exposed through Python interfaces, enabling seamless integration with modern machine learning workflows. HiGP also includes analytically derived gradient computations for efficient hyperparameter optimization, avoiding the inefficiencies of automatic differentiation in iterative solvers. By serving as a reusable numerical engine, HiGP complements existing GP frameworks such as GPJax, KeOps, and GaussianProcesses.jl, providing a reliable and scalable computational backbone for large-scale Gaussian Process regression and classification.
[LINK]http://arxiv.org/abs/2503.02259v2
[DATE]2026-01-13 06:55:14+08:00
[CATEGORIES]cs.LG
The Impact of Off-Policy Training Data on Probe Generalisation
[AUTHORS]Nathalie Kirch, Samuel Dower, Adrians Skapars, Ekdeep Singh Lubana, Dmitrii Krasheninnikov
[ABSTRACT]Probing has emerged as a promising method for monitoring large language models (LLMs), enabling cheap inference-time detection of concerning behaviours. However, natural examples of many behaviours are rare, forcing researchers to rely on synthetic or off-policy LLM responses for training probes. We systematically evaluate how off-policy data influences probe generalisation across eight distinct LLM behaviours. Testing linear and attention probes across multiple LLMs, we find that training data generation strategy can significantly affect probe performance, though the magnitude varies greatly by behaviour. The largest generalisation failures arise for behaviours defined by response “intent” (e.g. strategic deception) rather than text-level content (e.g. usage of lists). We then propose a useful test for predicting generalisation failures in cases where on-policy test data is unavailable: successful generalisation to incentivised data (where the model was coerced) strongly correlates with high performance against on-policy examples. Based on these results, we predict that current deception probes may fail to generalise to real monitoring scenarios. Additionally, our finding that off-policy data can yield more reliable probes than on-policy data from a sufficiently different setting underscores the need for new monitoring methods that better handle all types of distribution shift.
[COMMENTS]10 pages, EurIPS 2025 Workshop on Private AI Governance
[LINK]http://arxiv.org/abs/2511.17408v3
[DATE]2026-01-13 06:53:17+08:00
[CATEGORIES]cs.LG
Riemannian Zeroth-Order Gradient Estimation with Structure-Preserving Metrics for Geodesically Incomplete Manifolds
[AUTHORS]Shaocong Ma, Heng Huang
[ABSTRACT]In this paper, we study Riemannian zeroth-order optimization in settings where the underlying Riemannian metric $g$ is geodesically incomplete, and the goal is to approximate stationary points with respect to this incomplete metric. To address this challenge, we construct structure-preserving metrics that are geodesically complete while ensuring that every stationary point under the new metric remains stationary under the original one. Building on this foundation, we revisit the classical symmetric two-point zeroth-order estimator and analyze its mean-squared error from a purely intrinsic perspective, depending only on the manifold’s geometry rather than any ambient embedding. Leveraging this intrinsic analysis, we establish convergence guarantees for stochastic gradient descent with this intrinsic estimator. Under additional suitable conditions, an $ε$-stationary point under the constructed metric $g’$ also corresponds to an $ε$-stationary point under the original metric $g$, thereby matching the best-known complexity in the geodesically complete setting. Empirical studies on synthetic problems confirm our theoretical findings, and experiments on a practical mesh optimization task demonstrate that our framework maintains stable convergence even in the absence of geodesic completeness.
[LINK]http://arxiv.org/abs/2601.08039v1
[DATE]2026-01-13 06:08:03+08:00
[CATEGORIES]cs.LG
InfGraND: An Influence-Guided GNN-to-MLP Knowledge Distillation
[AUTHORS]Amir Eskandari, Aman Anand, Elyas Rashno, Farhana Zulkernine
[ABSTRACT]Graph Neural Networks (GNNs) are the go-to model for graph data analysis. However, GNNs rely on two key operations - aggregation and update, which can pose challenges for low-latency inference tasks or resource-constrained scenarios. Simple Multi-Layer Perceptrons (MLPs) offer a computationally efficient alternative. Yet, training an MLP in a supervised setting often leads to suboptimal performance. Knowledge Distillation (KD) from a GNN teacher to an MLP student has emerged to bridge this gap. However, most KD methods either transfer knowledge uniformly across all nodes or rely on graph-agnostic indicators such as prediction uncertainty. We argue this overlooks a more fundamental, graph-centric inquiry: “How important is a node to the structure of the graph?” We introduce a framework, InfGraND, an Influence-guided Graph KNowledge Distillation from GNN to MLP that addresses this by identifying and prioritizing structurally influential nodes to guide the distillation process, ensuring that the MLP learns from the most critical parts of the graph. Additionally, InfGraND embeds structural awareness in MLPs through one-time multi-hop neighborhood feature pre-computation, which enriches the student MLP’s input and thus avoids inference-time overhead. Our rigorous evaluation in transductive and inductive settings across seven homophilic graph benchmark datasets shows InfGraND consistently outperforms prior GNN to MLP KD methods, demonstrating its practicality for numerous latency-critical applications in real-world settings.
[COMMENTS]Accepted in Transactions on Machine Learning Research (TMLR), 2026
[LINK]http://arxiv.org/abs/2601.08033v1
[DATE]2026-01-13 06:03:26+08:00
[CATEGORIES]cs.LG
Token Reduction Should Go Beyond Efficiency in Generative Models – From Vision, Language to Multimodality
[AUTHORS]Zhenglun Kong, Yize Li, Fanhu Zeng, Lei Xin, Shvat Messica, Xue Lin, Pu Zhao, Manolis Kellis, Hao Tang, Marinka Zitnik
[ABSTRACT]In Transformer architectures, tokens\textemdash discrete units derived from raw data\textemdash are formed by segmenting inputs into fixed-length chunks. Each token is then mapped to an embedding, enabling parallel attention computations while preserving the input’s essential information. Due to the quadratic computational complexity of transformer self-attention mechanisms, token reduction has primarily been used as an efficiency strategy. This is especially true in single vision and language domains, where it helps balance computational costs, memory usage, and inference latency. Despite these advances, this paper argues that token reduction should transcend its traditional efficiency-oriented role in the era of large generative models. Instead, we position it as a fundamental principle in generative modeling, critically influencing both model architecture and broader applications. Specifically, we contend that across vision, language, and multimodal systems, token reduction can: (i) facilitate deeper multimodal integration and alignment, (ii) mitigate “overthinking” and hallucinations, (iii) maintain coherence over long inputs, and (iv) enhance training stability, etc. We reframe token reduction as more than an efficiency measure. By doing so, we outline promising future directions, including algorithm design, reinforcement learning-guided token reduction, token optimization for in-context learning, agentic framework design, and broader ML and scientific domains.
[COMMENTS]Project page: https://github.com/ZLKong/Awesome-Collection-Token-Reduction
[LINK]http://arxiv.org/abs/2505.18227v3
[DATE]2026-01-13 05:52:55+08:00
[CATEGORIES]cs.LG
Beyond the Next Port: A Multi-Task Transformer for Forecasting Future Voyage Segment Durations
[AUTHORS]Nairui Liu, Fang He, Xindi Tang
[ABSTRACT]Accurate forecasts of segment-level sailing durations are fundamental to enhancing maritime schedule reliability and optimizing long-term port operations. However, conventional estimated time of arrival (ETA) models are primarily designed for the immediate next port of call and rely heavily on real-time automatic identification system (AIS) data, which is inherently unavailable for future voyage segments. To address this gap, the study reformulates future-port ETA prediction as a segment-level time-series forecasting problem. We develop a transformer-based architecture that integrates historical sailing durations, destination port congestion proxies, and static vessel descriptors. The proposed framework employs a causally masked attention mechanism to capture long-range temporal dependencies and a multi-task learning head to jointly predict segment sailing durations and port congestion states, leveraging shared latent signals to mitigate high uncertainty. Evaluation on a real-world global dataset from 2021 demonstrates the proposed model consistently outperforms a comprehensive suite of competitive baselines. The result shows a relative reduction of 4.85% in mean absolute error (MAE) and 4.95% in mean absolute percentage error (MAPE) compared with sequence baseline models. The relative reductions with gradient boosting machines are 9.39% in MAE and 52.97% in MAPE. Case studies for the major destination port further illustrate the model’s superior accuracy.
[LINK]http://arxiv.org/abs/2601.08013v1
[DATE]2026-01-13 05:32:05+08:00
[CATEGORIES]cs.LG
TP-Blend: Textual-Prompt Attention Pairing for Precise Object-Style Blending in Diffusion Models
[AUTHORS]Xin Jin, Yichuan Zhong, Yapeng Tian
[ABSTRACT]Current text-conditioned diffusion editors handle single object replacement well but struggle when a new object and a new style must be introduced simultaneously. We present Twin-Prompt Attention Blend (TP-Blend), a lightweight training-free framework that receives two separate textual prompts, one specifying a blend object and the other defining a target style, and injects both into a single denoising trajectory. TP-Blend is driven by two complementary attention processors. Cross-Attention Object Fusion (CAOF) first averages head-wise attention to locate spatial tokens that respond strongly to either prompt, then solves an entropy-regularised optimal transport problem that reassigns complete multi-head feature vectors to those positions. CAOF updates feature vectors at the full combined dimensionality of all heads (e.g., 640 dimensions in SD-XL), preserving rich cross-head correlations while keeping memory low. Self-Attention Style Fusion (SASF) injects style at every self-attention layer through Detail-Sensitive Instance Normalization. A lightweight one-dimensional Gaussian filter separates low- and high-frequency components; only the high-frequency residual is blended back, imprinting brush-stroke-level texture without disrupting global geometry. SASF further swaps the Key and Value matrices with those derived from the style prompt, enforcing context-aware texture modulation that remains independent of object fusion. Extensive experiments show that TP-Blend produces high-resolution, photo-realistic edits with precise control over both content and appearance, surpassing recent baselines in quantitative fidelity, perceptual quality, and inference speed.
[LINK]http://arxiv.org/abs/2601.08011v1
[DATE]2026-01-13 05:30:10+08:00
[CATEGORIES]cs.LG
Tensor Decomposition Networks for Fast Machine Learning Interatomic Potential Computations
[AUTHORS]Yuchao Lin, Cong Fu, Zachary Krueger, Haiyang Yu, Maho Nakata, Jianwen Xie, Emine Kucukbenli, Xiaofeng Qian, Shuiwang Ji
[ABSTRACT]$\rm{SO}(3)$-equivariant networks are the dominant models for machine learning interatomic potentials (MLIPs). The key operation of such networks is the Clebsch-Gordan (CG) tensor product, which is computationally expensive. To accelerate the computation, we develop tensor decomposition networks (TDNs) as a class of approximately equivariant networks in which CG tensor products are replaced by low-rank tensor decompositions, such as the CANDECOMP/PARAFAC (CP) decomposition. With the CP decomposition, we prove (i) a uniform bound on the induced error of $\rm{SO}(3)$-equivariance, and (ii) the universality of approximating any equivariant bilinear map. To further reduce the number of parameters, we propose path-weight sharing that ties all multiplicity-space weights across the $\mathcal{O}(L^3)$ CG paths into a single shared parameter set without compromising equivariance, where $L$ is the maximum angular degree. The resulting layer acts as a plug-and-play replacement for tensor products in existing networks, and the computational complexity of tensor products is reduced from $\mathcal{O}(L^6)$ to $\mathcal{O}(L^4)$. We evaluate TDNs on PubChemQCR, a newly curated molecular relaxation dataset containing 105 million DFT-calculated snapshots. We also use existing datasets, including OC20, and OC22. Results show that TDNs achieve competitive performance with dramatic speedup in computations. Our code is publicly available as part of the AIRS library (\href{https://github.com/divelab/AIRS/tree/main/OpenMol/TDN}{https://github.com/divelab/AIRS/}).
[LINK]http://arxiv.org/abs/2507.01131v5
[DATE]2026-01-13 04:02:23+08:00
[CATEGORIES]cs.LG
DataScribe: An AI-Native, Policy-Aligned Web Platform for Multi-Objective Materials Design and Discovery
[AUTHORS]Divyanshu Singh, Doguhan Sarıtürk, Cameron Lea, Md Shafiqul Islam, Raymundo Arroyave, Vahid Attari
[ABSTRACT]The acceleration of materials discovery requires digital platforms that go beyond data repositories to embed learning, optimization, and decision-making directly into research workflows. We introduce DataScribe, an AI-native, cloud-based materials discovery platform that unifies heterogeneous experimental and computational data through ontology-backed ingestion and machine-actionable knowledge graphs. The platform integrates FAIR-compliant metadata capture, schema and unit harmonization, uncertainty-aware surrogate modeling, and native multi-objective multi-fidelity Bayesian optimization, enabling closed-loop propose-measure-learn workflows across experimental and computational pipelines. DataScribe functions as an application-layer intelligence stack, coupling data governance, optimization, and explainability rather than treating them as downstream add-ons. We validate the platform through case studies in electrochemical materials and high-entropy alloys, demonstrating end-to-end data fusion, real-time optimization, and reproducible exploration of multi-objective trade spaces. By embedding optimization engines, machine learning, and unified access to public and private scientific data directly within the data infrastructure, and by supporting open, free use for academic and non-profit researchers, DataScribe functions as a general-purpose application-layer backbone for laboratories of any scale, including self-driving laboratories and geographically distributed materials acceleration platforms, with built-in support for performance, sustainability, and supply-chain-aware objectives.
[LINK]http://arxiv.org/abs/2601.07966v1
[DATE]2026-01-13 03:59:39+08:00
[CATEGORIES]cs.LG
Electron neural closure for turbulent magnetosheath simulations: energy channels
[AUTHORS]George Miloshevich, Luka Vranckx, Felipe Nathan de Oliveira Lopes, Pietro Dazzi, Giuseppe Arrò, Giovanni Lapenta
[ABSTRACT]In this work, we introduce a non-local five-moment electron pressure tensor closure parametrized by a Fully Convolutional Neural Network (FCNN). Electron pressure plays an important role in generalized Ohm’s law, competing with electron inertia. This model is used in the development of a surrogate model for a fully kinetic energy-conserving semi-implicit Particle-in-Cell simulation of decaying magnetosheath turbulence. We achieve this by training FCNN on a representative set of simulations with a smaller number of particles per cell and showing that our results generalise to a simulation with a large number of particles per cell. We evaluate the statistical properties of the learned equation of state, with a focus on pressure-strain interaction, which is crucial for understanding energy channels in turbulent plasmas. The resulting equation of state learned via FCNN significantly outperforms local closures, such as those learned by Multi-Layer Perceptron (MLP) or double adiabatic expressions. We report that the overall spatial distribution of pressure-strain and its conditional averages are reconstructed well. However, some small-scale features are missed, especially for the off-diagonal components of the pressure tensor. Nevertheless, the results are substantially improved with more training data, indicating favorable scaling and potential for improvement, which will be addressed in future work.
[COMMENTS]19 pages, 10 figures, 4 tables
[LINK]http://arxiv.org/abs/2510.00282v2
[DATE]2026-01-13 03:44:21+08:00
[CATEGORIES]cs.LG
Hybrid SARIMA LSTM Model for Local Weather Forecasting: A Residual Learning Approach for Data Driven Meteorological Prediction
[AUTHORS]Shreyas Rajeev, Karthik Mudenahalli Ashoka, Amit Mallappa Tiparaddi
[ABSTRACT]Accurately forecasting long-term atmospheric variables remains a defining challenge in meteorological science due to the chaotic nature of atmospheric systems. Temperature data represents a complex superposition of deterministic cyclical climate forces and stochastic, short-term fluctuations. While planetary mechanics drive predictable seasonal periodicities, rapid meteorological changes such as thermal variations, pressure anomalies, and humidity shifts introduce nonlinear volatilities that defy simple extrapolation. Historically, the Seasonal Autoregressive Integrated Moving Average (SARIMA) model has been the standard for modeling historical weather data, prized for capturing linear seasonal trends. However, SARIMA operates under strict assumptions of stationarity, failing to capture abrupt, nonlinear transitions. This leads to systematic residual errors, manifesting as the under-prediction of sudden spikes or the over-smoothing of declines. Conversely, Deep Learning paradigms, specifically Long Short-Term Memory (LSTM) networks, demonstrate exceptional efficacy in handling intricate time-series data. By utilizing memory gates, LSTMs learn complex nonlinear dependencies. Yet, LSTMs face instability in open-loop forecasting; without ground truth feedback, minor deviations compound recursively, causing divergence. To resolve these limitations, we propose a Hybrid SARIMA-LSTM architecture. This framework employs a residual-learning strategy to decompose temperature into a predictable climate component and a nonlinear weather component. The SARIMA unit models the robust, long-term seasonal trend, while the LSTM is trained exclusively on the residuals the nonlinear errors SARIMA fails to capture. By fusing statistical stability with neural plasticity, this hybrid approach minimizes error propagation and enhances long-horizon accuracy.
[LINK]http://arxiv.org/abs/2601.07951v1
[DATE]2026-01-13 03:34:51+08:00
[CATEGORIES]cs.LG
Reinforcement Learning Methods for Neighborhood Selection in Local Search
[AUTHORS]Yannick Molinghen, Augustin Delecluse, Renaud De Landtsheer, Stefano Michelini
[ABSTRACT]Reinforcement learning has recently gained traction as a means to improve combinatorial optimization methods, yet its effectiveness within local search metaheuristics specifically remains comparatively underexamined. In this study, we evaluate a range of reinforcement learning-based neighborhood selection strategies – multi-armed bandits (upper confidence bound, $ε$-greedy) and deep reinforcement learning methods (proximal policy optimization, double deep $Q$-network) – and compare them against multiple baselines across three different problems: the traveling salesman problem, the pickup and delivery problem with time windows, and the car sequencing problem. We show how search-specific characteristics, particularly large variations in cost due to constraint violation penalties, necessitate carefully designed reward functions to provide stable and informative learning signals. Our extensive experiments reveal that algorithm performance varies substantially across problems, although that $ε$-greedy consistently ranks among the best performers. In contrast, the computational overhead of deep reinforcement learning approaches only makes them competitive with a substantially longer runtime. These findings highlight both the promise and the practical limitations of deep reinforcement learning in local search.
[COMMENTS]Accepted at ICORES 2026
[LINK]http://arxiv.org/abs/2601.07948v1
[DATE]2026-01-13 03:25:29+08:00
[CATEGORIES]cs.LG
Coupled Diffusion-Encoder Models for Reconstruction of Flow Fields
[AUTHORS]AmirPouya Hemmasian, Amir Barati Farimani
[ABSTRACT]Data-driven flow-field reconstruction typically relies on autoencoder architectures that compress high-dimensional states into low-dimensional latent representations. However, classical approaches such as variational autoencoders (VAEs) often struggle to preserve the higher-order statistical structure of fluid flows when subjected to strong compression. We propose DiffCoder, a coupled framework that integrates a probabilistic diffusion model with a conventional convolutional ResNet encoder and trains both components end-to-end. The encoder compresses the flow field into a latent representation, while the diffusion model learns a generative prior over reconstructions conditioned on the compressed state. This design allows DiffCoder to recover distributional and spectral properties that are not strictly required for minimizing pointwise reconstruction loss but are critical for faithfully representing statistical properties of the flow field. We evaluate DiffCoder and VAE baselines across multiple model sizes and compression ratios on a challenging dataset of Kolmogorov flow fields. Under aggressive compression, DiffCoder significantly improves the spectral accuracy while VAEs exhibit substantial degradation. Although both methods show comparable relative L2 reconstruction error, DiffCoder better preserves the underlying distributional structure of the flow. At moderate compression levels, sufficiently large VAEs remain competitive, suggesting that diffusion-based priors provide the greatest benefit when information bottlenecks are severe. These results demonstrate that the generative decoding by diffusion offers a promising path toward compact, statistically consistent representations of complex flow fields.
[LINK]http://arxiv.org/abs/2601.07946v1
[DATE]2026-01-13 03:24:52+08:00
[CATEGORIES]cs.LG
A Statistical Assessment of Amortized Inference Under Signal-to-Noise Variation and Distribution Shift
[AUTHORS]Roy Shivam Ram Shreshtth, Arnab Hazra, Gourab Mukherjee
[ABSTRACT]Since the turn of the century, approximate Bayesian inference has steadily evolved as new computational techniques have been incorporated to handle increasingly complex and large-scale predictive problems. The recent success of deep neural networks and foundation models has now given rise to a new paradigm in statistical modeling, in which Bayesian inference can be amortized through large-scale learned predictors. In amortized inference, substantial computation is invested upfront to train a neural network that can subsequently produce approximate posterior or predictions at negligible marginal cost across a wide range of tasks. At deployment, amortized inference offers substantial computational savings compared with traditional Bayesian procedures, which generally require repeated likelihood evaluations or Monte Carlo simulations for predictions for each new dataset.
Despite the growing popularity of amortized inference, its statistical interpretation and its role within Bayesian inference remain poorly understood. This paper presents statistical perspectives on the working principles of several major neural architectures, including feedforward networks, Deep Sets, and Transformers, and examines how these architectures naturally support amortized Bayesian inference. We discuss how these models perform structured approximation and probabilistic reasoning in ways that yield controlled generalization error across a wide range of deployment scenarios, and how these properties can be harnessed for Bayesian computation. Through simulation studies, we evaluate the accuracy, robustness, and uncertainty quantification of amortized inference under varying signal-to-noise ratios and distributional shifts, highlighting both its strengths and its limitations.
[COMMENTS]26 pages, 5 figures, 3 tables
[LINK]http://arxiv.org/abs/2601.07944v1
[DATE]2026-01-13 03:21:51+08:00
[CATEGORIES]cs.LG
A New Formulation for Zeroth-Order Optimization of Adversarial EXEmples in Malware Detection
[AUTHORS]Marco Rando, Luca Demetrio, Lorenzo Rosasco, Fabio Roli
[ABSTRACT]Machine learning malware detectors are vulnerable to adversarial EXEmples, i.e., carefully-crafted Windows programs tailored to evade detection. Unlike other adversarial problems, attacks in this context must be functionality-preserving, a constraint that is challenging to address. As a consequence, heuristic algorithms are typically used, which inject new content, either randomly-picked or harvested from legitimate programs. In this paper, we show how learning malware detectors can be cast within a zeroth-order optimization framework, which allows incorporating functionality-preserving manipulations. This permits the deployment of sound and efficient gradient-free optimization algorithms, which come with theoretical guarantees and allow for minimal hyper-parameters tuning. As a by-product, we propose and study ZEXE, a novel zeroth-order attack against Windows malware detection. Compared to state-of-the-art techniques, ZEXE provides improvement in the evasion rate, reducing to less than one third the size of the injected content.
[COMMENTS]17 pages, 6 tables
[LINK]http://arxiv.org/abs/2405.14519v2
[DATE]2026-01-13 03:20:59+08:00
[CATEGORIES]cs.LG
Enhancing Portfolio Optimization with Deep Learning Insights
[AUTHORS]Brandon Luo, Jim Skufca
[ABSTRACT]Our work focuses on deep learning (DL) portfolio optimization, tackling challenges in long-only, multi-asset strategies across market cycles. We propose training models with limited regime data using pre-training techniques and leveraging transformer architectures for state variable inclusion. Evaluating our approach against traditional methods shows promising results, demonstrating our models’ resilience in volatile markets. These findings emphasize the evolving landscape of DL-driven portfolio optimization, stressing the need for adaptive strategies to navigate dynamic market conditions and improve predictive accuracy.
[LINK]http://arxiv.org/abs/2601.07942v1
[DATE]2026-01-13 03:13:26+08:00
[CATEGORIES]cs.LG
Transformer-Based Approach for Automated Functional Group Replacement in Chemical Compounds
[AUTHORS]Bo Pan, Zhiping Zhang, Kevin Spiekermann, Tianchi Chen, Xiang Yu, Liying Zhang, Liang Zhao
[ABSTRACT]Functional group replacement is a pivotal approach in cheminformatics to enable the design of novel chemical compounds with tailored properties. Traditional methods for functional group removal and replacement often rely on rule-based heuristics, which can be limited in their ability to generate diverse and novel chemical structures. Recently, transformer-based models have shown promise in improving the accuracy and efficiency of molecular transformations, but existing approaches typically focus on single-step modeling, lacking the guarantee of structural similarity. In this work, we seek to advance the state of the art by developing a novel two-stage transformer model for functional group removal and replacement. Unlike one-shot approaches that generate entire molecules in a single pass, our method generates the functional group to be removed and appended sequentially, ensuring strict substructure-level modifications. Using a matched molecular pairs (MMPs) dataset derived from ChEMBL, we trained an encoder-decoder transformer model with SMIRKS-based representations to capture transformation rules effectively. Extensive evaluations demonstrate our method’s ability to generate chemically valid transformations, explore diverse chemical spaces, and maintain scalability across varying search sizes.
[COMMENTS]The 2nd AAAI Workshop on Foundation Models for Biological Discoveries at AAAI 2025
[LINK]http://arxiv.org/abs/2601.07930v1
[DATE]2026-01-13 03:01:11+08:00
[CATEGORIES]cs.LG
A Complete Decomposition of Stochastic Differential Equations
[AUTHORS]Samuel Duffield
[ABSTRACT]We show that any stochastic differential equation with prescribed time-dependent marginal distributions admits a decomposition into three components: a unique scalar field governing marginal evolution, a symmetric positive-semidefinite diffusion matrix field and a skew-symmetric matrix field.
[LINK]http://arxiv.org/abs/2601.07834v1
[DATE]2026-01-13 02:59:36+08:00
[CATEGORIES]cs.LG
Optimal Learning Rate Schedule for Balancing Effort and Performance
[AUTHORS]Valentina Njaradi, Rodrigo Carrasco-Davis, Peter E. Latham, Andrew Saxe
[ABSTRACT]Learning how to learn efficiently is a fundamental challenge for biological agents and a growing concern for artificial ones. To learn effectively, an agent must regulate its learning speed, balancing the benefits of rapid improvement against the costs of effort, instability, or resource use. We introduce a normative framework that formalizes this problem as an optimal control process in which the agent maximizes cumulative performance while incurring a cost of learning. From this objective, we derive a closed-form solution for the optimal learning rate, which has the form of a closed-loop controller that depends only on the agent’s current and expected future performance. Under mild assumptions, this solution generalizes across tasks and architectures and reproduces numerically optimized schedules in simulations. In simple learning models, we can mathematically analyze how agent and task parameters shape learning-rate scheduling as an open-loop control solution. Because the optimal policy depends on expectations of future performance, the framework predicts how overconfidence or underconfidence influence engagement and persistence, linking the control of learning speed to theories of self-regulated learning. We further show how a simple episodic memory mechanism can approximate the required performance expectations by recalling similar past learning experiences, providing a biologically plausible route to near-optimal behaviour. Together, these results provide a normative and biologically plausible account of learning speed control, linking self-regulated learning, effort allocation, and episodic memory estimation within a unified and tractable mathematical framework.
[LINK]http://arxiv.org/abs/2601.07830v1
[DATE]2026-01-13 02:59:07+08:00
[CATEGORIES]cs.LG
CLAPS: Posterior-Aware Conformal Intervals via Last-Layer Laplace
[AUTHORS]Dongseok Kim, Hyoungsun Choi, Mohamed Jismy Aashik Rasool, Gisung Oh
[ABSTRACT]We present CLAPS, a posterior-aware conformal regression method that pairs a Last-Layer Laplace Approximation with split-conformal calibration. From the resulting Gaussian posterior, CLAPS defines a simple two-sided posterior CDF score that aligns the conformity metric with the full predictive shape, not just a point estimate. This alignment can yield substantially narrower prediction intervals at a fixed target coverage, particularly on small to medium tabular datasets where data are scarce and uncertainty modeling is informative. We also provide a lightweight diagnostic suite that separates aleatoric and epistemic components and visualizes posterior behavior, helping practitioners assess when and why intervals shrink. Across multiple benchmarks using the same MLP backbone, CLAPS achieves nominal coverage and offers the most efficient intervals on small to medium datasets with mild heterogeneity, while remaining competitive and diagnostically transparent on large-scale heterogeneous data where Normalized-CP and CQR attain the tightest intervals.
[COMMENTS]Revised for clarity and correctness; improved exposition and fixed minor issues
[LINK]http://arxiv.org/abs/2512.01384v3
[DATE]2026-01-13 02:49:06+08:00
[CATEGORIES]cs.LG
EEG-to-fMRI synthesis of task-evoked and spontaneous brain activity: addressing issues of statistical significance and generalizability
[AUTHORS]Neil Mehta, Ines Goncalves, Alberto Montagna, Mathis Fleury, Gustavo Caetano, Ines Esteves, Athanasios Vourvopoulos, Pulkit Grover, Patricia Figueiredo
[ABSTRACT]A growing interest has developed in the problem of training models of EEG features to predict brain activity measured using fMRI, i.e. the problem of EEG-to-fMRI synthesis. Despite some reported success, the statistical significance and generalizability of EEG-to-fMRI predictions remains to be fully demonstrated. Here, we investigate the predictive power of EEG for both task-evoked and spontaneous activity of the somatomotor network measured by fMRI, based on data collected from healthy subjects in two different sessions. We trained subject-specific distributed-lag linear models of time-varying, multi-channel EEG spectral power using Sparse Group LASSO regularization, and we showed that learned models outperformed conventional EEG somatomotor rhythm predictors as well as massive univariate correlation models. Furthermore, we showed that learned models were statistically significantly better than appropriate null models in most subjects and conditions, although less frequently for spontaneous compared to task-evoked activity. Critically, predictions improved significantly when training and testing on data acquired in the same session relative to across sessions, highlighting the importance of temporally separating the collection of train and test data to avoid data leakage and optimistic bias in model generalization. In sum, while we demonstrate that EEG models can provide fMRI predictions with statistical significance, we also show that predictive power is impaired for spontaneous fluctuations in brain activity and for models trained on data acquired in a different session. Our findings highlight the need to explicitly consider these often overlooked issues in the growing literature of EEG-to-fMRI synthesis.
[LINK]http://arxiv.org/abs/2504.10752v2
[DATE]2026-01-13 02:47:55+08:00
[CATEGORIES]cs.LG
The Power of Iterative Filtering for Supervised Learning with (Heavy) Contamination
[AUTHORS]Adam R. Klivans, Konstantinos Stavropoulos, Kevin Tian, Arsen Vasilyan
[ABSTRACT]Inspired by recent work on learning with distribution shift, we give a general outlier removal algorithm called iterative polynomial filtering and show a number of striking applications for supervised learning with contamination: (1) We show that any function class that can be approximated by low-degree polynomials with respect to a hypercontractive distribution can be efficiently learned under bounded contamination (also known as nasty noise). This is a surprising resolution to a longstanding gap between the complexity of agnostic learning and learning with contamination, as it was widely believed that low-degree approximators only implied tolerance to label noise. In particular, it implies the first efficient algorithm for learning halfspaces with $η$-bounded contamination up to error $2η+ε$ with respect to the Gaussian distribution. (2) For any function class that admits the (stronger) notion of sandwiching approximators, we obtain near-optimal learning guarantees even with respect to heavy additive contamination, where far more than $1/2$ of the training set may be added adversarially. Prior related work held only for regression and in a list-decodable setting. (3) We obtain the first efficient algorithms for tolerant testable learning of functions of halfspaces with respect to any fixed log-concave distribution. Even the non-tolerant case for a single halfspace in this setting had remained open. These results significantly advance our understanding of efficient supervised learning under contamination, a setting that has been much less studied than its unsupervised counterpart.
[COMMENTS]36 pages
[LINK]http://arxiv.org/abs/2505.20177v2
[DATE]2026-01-13 02:47:51+08:00
[CATEGORIES]cs.LG
ORACLE: Explaining Feature Interactions in Neural Networks with ANOVA
[AUTHORS]Dongseok Kim, Hyoungsun Choi, Mohamed Jismy Aashik Rasool, Gisung Oh
[ABSTRACT]We introduce ORACLE, a framework for explaining neural networks on tabular data and scientific factorial designs. ORACLE summarizes a trained network’s prediction surface with main effects and pairwise interactions by treating the network as a black-box response, discretizing the inputs onto a grid, and fitting an orthogonal factorial (ANOVA-style) surrogate – the $L^2$ orthogonal projection of the model response onto a finite-dimensional factorial subspace. A simple centering and $μ$-rebalancing step then expresses this surrogate as main- and interaction-effect tables that remain faithful to the original model in the $L^2$ sense. The resulting grid-based interaction maps are easy to visualize, comparable across backbones, and directly aligned with classical design-of-experiments practice. On synthetic factorial benchmarks and low- to medium-dimensional tabular regression tasks, ORACLE more accurately recovers ground-truth interaction structure and hotspots than Monte Carlo SHAP-family interaction methods, as measured by ranking, localization, and cross-backbone stability. We also discuss its scope in latent image and text settings: grid-based factorial surrogates are most effective when features admit an interpretable factorial structure, making ORACLE particularly well-suited to scientific and engineering workflows that require stable DoE-style interaction summaries.
[COMMENTS]v4: Revised for clarity and correctness; improved exposition and fixed minor issues
[LINK]http://arxiv.org/abs/2509.10825v4
[DATE]2026-01-13 02:46:44+08:00
[CATEGORIES]cs.LG
Discovering Coordinated Joint Options via Inter-Agent Relative Dynamics
[AUTHORS]Raul D. Steleac, Mohan Sridharan, David Abel
[ABSTRACT]Temporally extended actions improve the ability to explore and plan in single-agent settings. In multi-agent settings, the exponential growth of the joint state space with the number of agents makes coordinated behaviours even more valuable. Yet, this same exponential growth renders the design of multi-agent options particularly challenging. Existing multi-agent option discovery methods often sacrifice coordination by producing loosely coupled or fully independent behaviours. Toward addressing these limitations, we describe a novel approach for multi-agent option discovery. Specifically, we propose a joint-state abstraction that compresses the state space while preserving the information necessary to discover strongly coordinated behaviours. Our approach builds on the inductive bias that synchronisation over agent states provides a natural foundation for coordination in the absence of explicit objectives. We first approximate a fictitious state of maximal alignment with the team, the \textit{Fermat} state, and use it to define a measure of \textit{spreadness}, capturing team-level misalignment on each individual state dimension. Building on this representation, we then employ a neural graph Laplacian estimator to derive options that capture state synchronisation patterns between agents. We evaluate the resulting options across multiple scenarios in two multi-agent domains, showing that they yield stronger downstream coordination capabilities compared to alternative option discovery methods.
[LINK]http://arxiv.org/abs/2512.24827v2
[DATE]2026-01-13 02:29:50+08:00
[CATEGORIES]cs.LG
StarFlow: Generating Structured Workflow Outputs From Sketch Images
[AUTHORS]Patrice Bechard, Chao Wang, Amirhossein Abaskohi, Juan Rodriguez, Christopher Pal, David Vazquez, Spandana Gella, Sai Rajeswar, Perouz Taslakian
[ABSTRACT]Workflows are a fundamental component of automation in enterprise platforms, enabling the orchestration of tasks, data processing, and system integrations. Despite being widely used, building workflows can be complex, often requiring manual configuration through low-code platforms or visual programming tools. To simplify this process, we explore the use of generative foundation models, particularly vision-language models (VLMs), to automatically generate structured workflows from visual inputs. Translating hand-drawn sketches or computer-generated diagrams into executable workflows is challenging due to the ambiguity of free-form drawings, variations in diagram styles, and the difficulty of inferring execution logic from visual elements. To address this, we introduce StarFlow, a framework for generating structured workflow outputs from sketches using vision-language models. We curate a diverse dataset of workflow diagrams – including synthetic, manually annotated, and real-world samples – to enable robust training and evaluation. We finetune and benchmark multiple vision-language models, conducting a series of ablation studies to analyze the strengths and limitations of our approach. Our results show that finetuning significantly enhances structured workflow generation, outperforming large vision-language models on this task.
[COMMENTS]To be presented at EACL2026
[LINK]http://arxiv.org/abs/2503.21889v2
[DATE]2026-01-13 02:27:42+08:00
[CATEGORIES]cs.LG
AgentCompress: Task-Aware Compression for Affordable Large Language Model Agents
[AUTHORS]Zuhair Ahmed Khan Taha, Mohammed Mudassir Uddin, Shahnawaz Alam
[ABSTRACT]Large language models hold considerable promise for various applications, but their computational requirements create a barrier that many institutions cannot overcome. A single session using a 70-billion-parameter model can cost around $127 in cloud computing fees, which puts these tools out of reach for organizations operating on limited budgets. We present AgentCompress, a framework that tackles this problem through task-aware dynamic compression. The idea comes from a simple observation: not all tasks require the same computational effort. Complex reasoning, for example, is far more demanding than text reformatting, yet conventional compression applies the same reduction to both. Our approach uses a lightweight neural controller that looks at the first few tokens of each request, estimates how complex the task will be, and sends it to an appropriately quantized version of the model. This routing step adds only about 12 milliseconds of overhead. We tested the framework on 290 multi-stage workflows from domains including computer science, physics, chemistry, and biology. The results show a 68.3% reduction in computational costs while preserving 96.2% of the original success rate. These findings suggest that routing queries intelligently can make powerful language models substantially more affordable without sacrificing output quality
[LINK]http://arxiv.org/abs/2601.05191v2
[DATE]2026-01-13 02:25:18+08:00
[CATEGORIES]cs.LG
DT-ICU: Towards Explainable Digital Twins for ICU Patient Monitoring via Multi-Modal and Multi-Task Iterative Inference
[AUTHORS]Wen Guo
[ABSTRACT]We introduce DT-ICU, a multimodal digital twin framework for continuous risk estimation in intensive care. DT-ICU integrates variable-length clinical time series with static patient information in a unified multitask architecture, enabling predictions to be updated as new observations accumulate over the ICU stay. We evaluate DT-ICU on the large, publicly available MIMIC-IV dataset, where it consistently outperforms established baseline models under different evaluation settings. Our test-length analysis shows that meaningful discrimination is achieved shortly after admission, while longer observation windows further improve the ranking of high-risk patients in highly imbalanced cohorts. To examine how the model leverages heterogeneous data sources, we perform systematic modality ablations, revealing that the model learnt a reasonable structured reliance on interventions, physiological response observations, and contextual information. These analyses provide interpretable insights into how multimodal signals are combined and how trade-offs between sensitivity and precision emerge. Together, these results demonstrate that DT-ICU delivers accurate, temporally robust, and interpretable predictions, supporting its potential as a practical digital twin framework for continuous patient monitoring in critical care. The source code and trained model weights for DT-ICU are publicly available at https://github.com/GUO-W/DT-ICU-release.
[LINK]http://arxiv.org/abs/2601.07778v1
[DATE]2026-01-13 01:54:19+08:00
[CATEGORIES]cs.LG
Towards Mitigating Excessive Forgetting in LLM Unlearning via Entanglement-Guidance with Proxy Constraint
[AUTHORS]Zhihao Liu, Jian Lou, Yuke Hu, Xiaochen Li, Yitian Chen, Tailun Chen, Zhizhen Qin, Kui Ren, Zhan Qin
[ABSTRACT]Large language models (LLMs) are trained on massive datasets that may include private or copyrighted content. Due to growing privacy and ownership concerns, data owners may request the removal of their data from trained models. Machine unlearning provides a practical solution by removing the influence of specific data without full retraining. However, most existing methods still suffer from over-unlearning due to the lack of a principled mechanism to regulate the forgetting boundary, leading to unnecessary utility degradation and heightened privacy and robustness risks. In this work, we propose EGUP (Entanglement-Guided Unlearning with Proxy Constraint), a novel framework that leverages entanglement and proxy constraint to guide the unlearning process while mitigating over-unlearning. Within each iteration, EGUP employs inter-sample entanglement to adaptively reweight the unlearning strength, assigning greater unlearning efforts to forget samples that are semantically closer to retained knowledge. Across iterations, EGUP leverages intra-sample entanglement to track the representation shift of each forget sample and dynamically adjust its unlearning effort. In addition, we incorporate a proxy constraint that approximates the model’s expected outputs after unlearning, forming a reference boundary that softly regularizes the unlearning process. EGUP is compatible with existing gradient-based objectives and serves as a plug-and-play enhancement. We evaluate EGUP on the TOFU and MUSE benchmarks, demonstrating consistent improvements in the unlearning-utility trade-off across multiple LLMs. Moreover, EGUP achieves performance close to the retrained model while remaining scalable and robust.
[LINK]http://arxiv.org/abs/2508.20443v2
[DATE]2026-01-13 01:50:04+08:00
[CATEGORIES]cs.LG
Bayesian Surrogates for Risk-Aware Pre-Assessment of Aging Bridge Portfolios
[AUTHORS]Sophia V. Kuhn, Rafael Bischof, Marius Weber, Antoine Binggeli, Michael A. Kraus, Walter Kaufmann, Fernando Pérez-Cruz
[ABSTRACT]Aging infrastructure portfolios pose a critical resource allocation challenge: deciding which structures require intervention and which can safely remain in service. Structural assessments must balance the trade-off between cheaper, conservative analysis methods and accurate but costly simulations that do not scale portfolio-wide. We propose Bayesian neural network (BNN) surrogates for rapid structural pre-assessment of worldwide common bridge types, such as reinforced concrete frame bridges. Trained on a large-scale database of non-linear finite element analyses generated via a parametric pipeline and developed based on the Swiss Federal Railway’s bridge portfolio, the models accurately and efficiently estimate high-fidelity structural analysis results by predicting code compliance factors with calibrated epistemic uncertainty. Our BNN surrogate enables fast, uncertainty-aware triage: flagging likely critical structures and providing guidance where refined analysis is pertinent. We demonstrate the framework’s effectiveness in a real-world case study of a railway underpass, showing its potential to significantly reduce costs and emissions by avoiding unnecessary analyses and physical interventions across entire infrastructure portfolios.
[COMMENTS]Accepted at the NeurIPS 2025 Workshop on MLxOR: Mathematical Foundations and Operational Integration of Machine Learning for Uncertainty-Aware Decision-Making
[LINK]http://arxiv.org/abs/2509.25031v2
[DATE]2026-01-13 01:47:47+08:00
[CATEGORIES]cs.LG
Learning to bin: differentiable and Bayesian optimization for multi-dimensional discriminants in high-energy physics
[AUTHORS]Johannes Erdmann, Nitish Kumar Kasaraguppe, Florian Mausolf
[ABSTRACT]Categorizing events using discriminant observables is central to many high-energy physics analyses. Yet, bin boundaries are often chosen by hand. A simple, popular choice is to apply argmax projections of multi-class scores and equidistant binning of one-dimensional discriminants. We propose a binning optimization for signal significance directly in multi-dimensional discriminants. We use a Gaussian Mixture Model (GMM) to define flexible bin boundary shapes for multi-class scores, while in one dimension (binary classification) we move bin boundaries directly. On this binning model, we study two optimization strategies: a differentiable and a Bayesian optimization approach. We study two toy setups: a binary classification and a three-class problem with two signals and backgrounds. In the one-dimensional case, both approaches achieve similar gains in signal sensitivity compared to equidistant binnings for a given number of bins. In the multi-dimensional case, the GMM-based binning defines sensitive categories as well, with the differentiable approach performing best. We show that, in particular for limited separability of the signal processes, our approach outperforms argmax classification even with optimized binning in the one-dimensional projections. Both methods are released as lightweight Python plugins intended for straightforward integration into existing analyses.
[COMMENTS]13 pages, 5 figures
[LINK]http://arxiv.org/abs/2601.07756v1
[DATE]2026-01-13 01:40:45+08:00
[CATEGORIES]cs.LG
Improving Domain Generalization in Contrastive Learning using Adaptive Temperature Control
[AUTHORS]Robert Lewis, Katie Matton, Rosalind W. Picard, John Guttag
[ABSTRACT]Self-supervised pre-training with contrastive learning is a powerful method for learning from sparsely labeled data. However, performance can drop considerably when there is a shift in the distribution of data from training to test time. We study this phenomenon in a setting in which the training data come from multiple domains, and the test data come from a domain not seen at training that is subject to significant covariate shift. We present a new method for contrastive learning that incorporates domain labels to increase the domain invariance of learned representations, leading to improved out-of-distribution generalization. Our method adjusts the temperature parameter in the InfoNCE loss – which controls the relative weighting of negative pairs – using the probability that a negative sample comes from the same domain as the anchor. This upweights pairs from more similar domains, encouraging the model to discriminate samples based on domain-invariant attributes. Through experiments on a variant of the MNIST dataset, we demonstrate that our method yields better out-of-distribution performance than domain generalization baselines. Furthermore, our method maintains strong in-distribution task performance, substantially outperforming baselines on this measure.
[COMMENTS]NeurIPS SSL Workshop 2023
[LINK]http://arxiv.org/abs/2601.07748v1
[DATE]2026-01-13 01:32:24+08:00
[CATEGORIES]cs.LG
MiST: Understanding the Role of Mid-Stage Scientific Training in Developing Chemical Reasoning Models
[AUTHORS]Andres M Bran, Tong Xie, Shai Pranesh, Jeffrey Meng, Xuan Vu Nguyen, Jeremy Goumaz, David Ming Segura, Ruizhi Xu, Dongzhan Zhou, Wenjie Zhang, Bram Hoex, Philippe Schwaller
[ABSTRACT]Large Language Models can develop reasoning capabilities through online fine-tuning with rule-based rewards. However, recent studies reveal a critical constraint: reinforcement learning succeeds only when the base model already assigns non-negligible probability to correct answers – a property we term ‘latent solvability’. This work investigates the emergence of chemical reasoning capabilities and what these prerequisites mean for chemistry. We identify two necessary conditions for RL-based chemical reasoning: 1) Symbolic competence, and 2) Latent chemical knowledge. We propose mid-stage scientific training (MiST): a set of mid-stage training techniques to satisfy these, including data-mixing with SMILES/CIF-aware pre-processing, continued pre-training on 2.9B tokens, and supervised fine-tuning on 1B tokens. These steps raise the latent-solvability score on 3B and 7B models by up to 1.8x, and enable RL to lift top-1 accuracy from 10.9 to 63.9% on organic reaction naming, and from 40.6 to 67.4% on inorganic material generation. Similar results are observed for other challenging chemical tasks, while producing interpretable reasoning traces. Our results define clear prerequisites for chemical reasoning training and highlight the broader role of mid-stage training in unlocking reasoning capabilities.
[LINK]http://arxiv.org/abs/2512.21231v2
[DATE]2026-01-13 01:32:16+08:00
[CATEGORIES]cs.LG
PFT: Phonon Fine-tuning for Machine Learned Interatomic Potentials
[AUTHORS]Teddy Koker, Abhijeet Gangan, Mit Kotak, Jaime Marian, Tess Smidt
[ABSTRACT]Many materials properties depend on higher-order derivatives of the potential energy surface, yet machine learned interatomic potentials (MLIPs) trained with standard a standard loss on energy, force, and stress errors can exhibit error in curvature, degrading the prediction of vibrational properties. We introduce phonon fine-tuning (PFT), which directly supervises second-order force constants of materials by matching MLIP energy Hessians to DFT-computed force constants from finite displacement phonon calculations. To scale to large supercells, PFT stochastically samples Hessian columns and computes the loss with a single Hessian-vector product. We also use a simple co-training scheme to incorporate upstream data to mitigate catastrophic forgetting. On the MDR Phonon benchmark, PFT improves Nequix MP (trained on Materials Project) by 55% on average across phonon thermodynamic properties and achieves state-of-the-art performance among models trained on Materials Project trajectories. PFT also generalizes to improve properties beyond second-derivatives, improving thermal conductivity predictions that rely on third-order derivatives of the potential energy.
[LINK]http://arxiv.org/abs/2601.07742v1
[DATE]2026-01-13 01:20:09+08:00
[CATEGORIES]cs.LG
Backward Reconstruction of the Chafee–Infante Equation via Physics-Informed WGAN-GP
[AUTHORS]Joseph L. Shomberg
[ABSTRACT]We present a physics-informed Wasserstein GAN with gradient penalty (WGAN-GP) for solving the inverse Chafee–Infante problem on two-dimensional domains with Dirichlet boundary conditions. The objective is to reconstruct an unknown initial condition from a near-equilibrium state obtained after 100 explicit forward Euler iterations of the reaction-diffusion equation [ u_t - γΔu + κ\left(u^3 - u\right)=0. ] Because this mapping strongly damps high-frequency content, the inverse problem is severely ill-posed and sensitive to noise.
Our approach integrates a U-Net generator, a PatchGAN critic with spectral normalization, Wasserstein loss with gradient penalty, and several physics-informed auxiliary terms, including Lyapunov energy matching, distributional statistics, and a crucial forward-simulation penalty. This penalty enforces consistency between the predicted initial condition and its forward evolution under the \emph{same} forward Euler discretization used for dataset generation. Earlier experiments employing an Eyre-type semi-implicit solver were not compatible with this residual mechanism due to the cost and instability of Newton iterations within batched GPU training.
On a dataset of 50k training and 10k testing pairs on $128\times128$ grids (with natural $[-1,1]$ amplitude scaling), the best trained model attains a mean absolute error (MAE) of approximately \textbf{0.23988159} on the full test set, with a sample-wise standard deviation of about \textbf{0.00266345}. The results demonstrate stable inversion, accurate recovery of interfacial structure, and robustness to high-frequency noise in the initial data.
[COMMENTS]5 pages, 9 figures
[LINK]http://arxiv.org/abs/2601.07733v1
[DATE]2026-01-13 01:11:00+08:00
[CATEGORIES]cs.LG
FBFL: A Field-Based Coordination Approach for Data Heterogeneity in Federated Learning
[AUTHORS]Davide Domini, Gianluca Aguzzi, Lukas Esterle, Mirko Viroli
[ABSTRACT]In the last years, Federated learning (FL) has become a popular solution to train machine learning models in domains with high privacy concerns. However, FL scalability and performance face significant challenges in real-world deployments where data across devices are non-independently and identically distributed (non-IID). The heterogeneity in data distribution frequently arises from spatial distribution of devices, leading to degraded model performance in the absence of proper handling. Additionally, FL typical reliance on centralized architectures introduces bottlenecks and single-point-of-failure risks, particularly problematic at scale or in dynamic environments. To close this gap, we propose Field-Based Federated Learning (FBFL), a novel approach leveraging macroprogramming and field coordination to address these limitations through: (i) distributed spatial-based leader election for personalization to mitigate non-IID data challenges; and (ii) construction of a self-organizing, hierarchical architecture using advanced macroprogramming patterns. Moreover, FBFL not only overcomes the aforementioned limitations, but also enables the development of more specialized models tailored to the specific data distribution in each subregion. This paper formalizes FBFL and evaluates it extensively using MNIST, FashionMNIST, and Extended MNIST datasets. We demonstrate that, when operating under IID data conditions, FBFL performs comparably to the widely-used FedAvg algorithm. Furthermore, in challenging non-IID scenarios, FBFL not only outperforms FedAvg but also surpasses other state-of-the-art methods, namely FedProx and Scaffold, which have been specifically designed to address non-IID data distributions. Additionally, we showcase the resilience of FBFL’s self-organizing hierarchical architecture against server failures.
[LINK]http://arxiv.org/abs/2502.08577v3
[DATE]2026-01-13 01:08:29+08:00
[CATEGORIES]cs.LG
Enhancing Binary Encoded Crime Linkage Analysis Using Siamese Network
[AUTHORS]Yicheng Zhan, Fahim Ahmed, Amy Burrell, Matthew J. Tonkin, Sarah Galambos, Jessica Woodhams, Dalal Alrajeh
[ABSTRACT]Effective crime linkage analysis is crucial for identifying serial offenders and enhancing public safety. To address limitations of traditional crime linkage methods in handling high-dimensional, sparse, and heterogeneous data, we propose a Siamese Autoencoder framework that learns meaningful latent representations and uncovers correlations in complex crime data. Using data from the Violent Crime Linkage Analysis System (ViCLAS), maintained by the Serious Crime Analysis Section of the UK’s National Crime Agency, our approach mitigates signal dilution in sparse feature spaces by integrating geographic-temporal features at the decoder stage. This design amplifies behavioral representations rather than allowing them to be overshadowed at the input level, yielding consistent improvements across multiple evaluation metrics. We further analyze how different domain-informed data reduction strategies influence model performance, providing practical guidance for preprocessing in crime linkage contexts. Our results show that advanced machine learning approaches can substantially enhance linkage accuracy, improving AUC by up to 9% over traditional methods while offering interpretable insights to support investigative decision-making.
[COMMENTS]AAAI 2026, 7 pages, 4 figures
[LINK]http://arxiv.org/abs/2511.07651v2
[DATE]2026-01-13 01:00:01+08:00
[CATEGORIES]cs.LG
An Information-Minimal Geometry for Qubit-Efficient Optimization
[AUTHORS]Gordon Ma, Dimitris G. Angelakis
[ABSTRACT]Qubit-efficient optimization studies how large combinatorial problems can be addressed with quantum circuits whose width is far smaller than the number of logical variables. In quadratic unconstrained binary optimization (QUBO), objective values depend only on one- and two-body statistics, yet standard variational algorithms explore exponentially large Hilbert spaces. We recast qubit-efficient optimization as a geometric question: what is the minimal representation the objective itself requires?
Focusing on QUBO problems, we show that enforcing mutual consistency among pairwise statistics defines a convex body – the level-2 Sherali-Adams polytope – that captures the information on which quadratic objectives depend. We operationalize this geometry in a minimal variational pipeline that separates representation, consistency, and decoding: a logarithmic-width circuit produces pairwise moments, a differentiable information projection enforces local feasibility, and a maximum-entropy ensemble provides a principled global decoder.
This information-minimal construction achieves near-optimal approximation ratios on large unweighted Max-Cut instances (up to N=2000) at shallow depth, indicating that pairwise polyhedral geometry already captures the relevant structure in this regime. By making the information-minimal geometry explicit, this work establishes a clean baseline for qubit-efficient optimization and sharpens the question of where genuinely quantum structure becomes necessary.
[COMMENTS]39 pages, 9 figures. v2: Revised for clarity
[LINK]http://arxiv.org/abs/2511.08362v2
[DATE]2026-01-13 00:40:01+08:00
[CATEGORIES]cs.LG
Geometry-Aware Edge Pooling for Graph Neural Networks
[AUTHORS]Katharina Limbeck, Lydia Mezrag, Guy Wolf, Bastian Rieck
[ABSTRACT]Graph Neural Networks (GNNs) have shown significant success for graph-based tasks. Motivated by the prevalence of large datasets in real-world applications, pooling layers are crucial components of GNNs. By reducing the size of input graphs, pooling enables faster training and potentially better generalisation. However, existing pooling operations often optimise for the learning task at the expense of discarding fundamental graph structures, thus reducing interpretability. This leads to unreliable performance across dataset types, downstream tasks and pooling ratios. Addressing these concerns, we propose novel graph pooling layers for structure-aware pooling via edge collapses. Our methods leverage diffusion geometry and iteratively reduce a graph’s size while preserving both its metric structure and its structural diversity. We guide pooling using magnitude, an isometry-invariant diversity measure, which permits us to control the fidelity of the pooling process. Further, we use the spread of a metric space as a faster and more stable alternative ensuring computational efficiency. Empirical results demonstrate that our methods (i) achieve top performance compared to alternative pooling layers across a range of diverse graph classification tasks, (ii) preserve key spectral properties of the input graphs, and (iii) retain high accuracy across varying pooling ratios.
[COMMENTS]Accepted at the 39th Conference on Neural Information Processing Systems (NeurIPS) 2025. Our code is available at https://github.com/aidos-lab/mag_edge_pool
[LINK]http://arxiv.org/abs/2506.11700v3
[DATE]2026-01-13 00:19:36+08:00
[CATEGORIES]cs.LG
Physics-Informed Singular-Value Learning for Cross-Covariances Forecasting in Financial Markets
[AUTHORS]Efstratios Manolakis, Christian Bongiorno, Rosario Nunzio Mantegna
[ABSTRACT]A new wave of work on covariance cleaning and nonlinear shrinkage has delivered asymptotically optimal analytical solutions for large covariance matrices. Building on this progress, these ideas have been generalized to empirical cross-covariance matrices, whose singular-value shrinkage characterizes comovements between one set of assets and another. Existing analytical cross-covariance cleaners are derived under strong stationarity and large-sample assumptions, and they typically rely on mesoscopic regularity conditions such as bounded spectra; macroscopic common modes (e.g., a global market factor) violate these conditions. When applied to real equity returns, where dependence structures drift over time and global modes are prominent, we find that these theoretically optimal formulas do not translate into robust out-of-sample performance. We address this gap by designing a random-matrix-inspired neural architecture that operates in the empirical singular-vector basis and learns a nonlinear mapping from empirical singular values to their corresponding cleaned values. By construction, the network can recover the analytical solution as a special case, yet it remains flexible enough to adapt to non-stationary dynamics and mode-driven distortions. Trained on a long history of equity returns, the proposed method achieves a more favorable bias-variance trade-off than purely analytical cleaners and delivers systematically lower out-of-sample cross-covariance prediction errors. Our results demonstrate that combining random-matrix theory with machine learning makes asymptotic theories practically effective in realistic time-varying markets.
[LINK]http://arxiv.org/abs/2601.07687v1
[DATE]2026-01-13 00:18:08+08:00
[CATEGORIES]cs.LG
Tab-TRM: Tiny Recursive Model for Insurance Pricing on Tabular Data
[AUTHORS]Kishan Padayachy, Ronald Richman, Mario V. Wüthrich
[ABSTRACT]We introduce Tab-TRM (Tabular-Tiny Recursive Model), a network architecture that adapts the recursive latent reasoning paradigm of Tiny Recursive Models (TRMs) to insurance modeling. Drawing inspiration from both the Hierarchical Reasoning Model (HRM) and its simplified successor TRM, the Tab-TRM model makes predictions by reasoning over the input features. It maintains two learnable latent tokens - an answer token and a reasoning state - that are iteratively refined by a compact, parameter-efficient recursive network. The recursive processing layer repeatedly updates the reasoning state given the full token sequence and then refines the answer token, in close analogy with iterative insurance pricing schemes. Conceptually, Tab-TRM bridges classical actuarial workflows - iterative generalized linear model fitting and minimum-bias calibration - on the one hand, and modern machine learning, in terms of Gradient Boosting Machines, on the other.
[COMMENTS]30 pages
[LINK]http://arxiv.org/abs/2601.07675v1
[DATE]2026-01-13 00:01:49+08:00
[CATEGORIES]cs.LG
Self-Creating Random Walks for Decentralized Learning under Pac-Man Attacks
[AUTHORS]Xingran Chen, Parimal Parag, Rohit Bhagat, Salim El Rouayheb
[ABSTRACT]Random walk (RW)-based algorithms have long been popular in distributed systems due to low overheads and scalability, with recent growing applications in decentralized learning. However, their reliance on local interactions makes them inherently vulnerable to malicious behavior. In this work, we investigate an adversarial threat that we term the “Pac-Man” attack, in which a malicious node probabilistically terminates any RW that visits it. This stealthy behavior gradually eliminates active RWs from the network, effectively halting the learning process without triggering failure alarms. To counter this threat, we propose the CREATE-IF-LATE (CIL) algorithm, which is a fully decentralized, resilient mechanism that enables self-creating RWs and prevents RW extinction in the presence of Pac-Man. Our theoretical analysis shows that the CIL algorithm guarantees several desirable properties, such as (i) non-extinction of the RW population, (ii) almost sure boundedness of the RW population, and (iii) convergence of RW-based stochastic gradient descent even in the presence of Pac-Man with a quantifiable deviation from the true optimum. Moreover, the learning process experiences at most a linear time delay due to Pac-Man interruptions and RW regeneration. Our extensive empirical results on both synthetic and public benchmark datasets validate our theoretical findings.
[COMMENTS]arXiv admin note: substantial text overlap with arXiv:2508.05663
[LINK]http://arxiv.org/abs/2601.07674v1
[DATE]2026-01-13 00:00:21+08:00
[CATEGORIES]cs.LG
Survey on Publicly Available Sinhala Natural Language Processing Tools and Research
[AUTHORS]Nisansa de Silva
[ABSTRACT]Sinhala is the native language of the Sinhalese people who make up the largest ethnic group of Sri Lanka. The language belongs to the globe-spanning language tree, Indo-European. However, due to poverty in both linguistic and economic capital, Sinhala, in the perspective of Natural Language Processing tools and research, remains a resource-poor language which has neither the economic drive its cousin English has nor the sheer push of the law of numbers a language such as Chinese has. A number of research groups from Sri Lanka have noticed this dearth and the resultant dire need for proper tools and research for Sinhala natural language processing. However, due to various reasons, these attempts seem to lack coordination and awareness of each other. The objective of this paper is to fill that gap of a comprehensive literature survey of the publicly available Sinhala natural language tools and research so that the researchers working in this field can better utilize contributions of their peers. As such, we shall be uploading this paper to arXiv and perpetually update it periodically to reflect the advances made in the field.
[LINK]http://arxiv.org/abs/1906.02358v26
[DATE]2026-01-12 23:46:40+08:00
[CATEGORIES]cs.CL
Order in the Evaluation Court: A Critical Analysis of NLG Evaluation Trends
[AUTHORS]Jing Yang, Nils Feldhus, Salar Mohtaj, Leonhard Hennig, Qianli Wang, Eleni Metheniti, Sherzod Hakimov, Charlott Jakob, Veronika Solopova, Konrad Rieck, David Schlangen, Sebastian Möller, Vera Schmitt
[ABSTRACT]Despite advances in Natural Language Generation (NLG), evaluation remains challenging. Although various new metrics and LLM-as-a-judge (LaaJ) methods are proposed, human judgment persists as the gold standard. To systematically review how NLG evaluation has evolved, we employ an automatic information extraction scheme to gather key information from NLG papers, focusing on different evaluation methods (metrics, LaaJ and human evaluation). With extracted metadata from 14,171 papers across four major conferences (ACL, EMNLP, NAACL, and INLG) over the past six years, we reveal several critical findings: (1) Task Divergence: While Dialogue Generation demonstrates a rapid shift toward LaaJ (>40% in 2025), Machine Translation remains locked into n-gram metrics, and Question Answering exhibits a substantial decline in the proportion of studies conducting human evaluation. (2) Metric Inertia: Despite the development of semantic metrics, general-purpose metrics (e.g., BLEU, ROUGE) continue to be widely used across tasks without empirical justification, often lacking the discriminative power to distinguish between specific quality criteria. (3) Human-LaaJ Divergence: Our association analysis challenges the assumption that LLMs act as mere proxies for humans; LaaJ and human evaluations prioritize very different signals, and explicit validation is scarce (<8% of papers comparing the two), with only moderate to low correlation. Based on these observations, we derive practical recommendations to improve the rigor of future NLG evaluation.
[COMMENTS]8 pages
[LINK]http://arxiv.org/abs/2601.07648v1
[DATE]2026-01-12 23:27:58+08:00
[CATEGORIES]cs.CL
PlaM: Training-Free Plateau-Guided Model Merging for Better Visual Grounding in MLLMs
[AUTHORS]Zijing Wang, Yongkang Liu, Mingyang Wang, Ercong Nie, Deyuan Chen, Zhengjie Zhao, Shi Feng, Daling Wang, Xiaocui Yang, Yifei Zhang, Hinrich Schütze
[ABSTRACT]Multimodal Large Language Models (MLLMs) rely on strong linguistic reasoning inherited from their base language models. However, multimodal instruction fine-tuning paradoxically degrades this text’s reasoning capability, undermining multimodal performance. To address this issue, we propose a training-free framework to mitigate this degradation. Through layer-wise vision token masking, we reveal a common three-stage pattern in multimodal large language models: early-modal separation, mid-modal alignment, and late-modal degradation. By analyzing the behavior of MLLMs at different stages, we propose a plateau-guided model merging method that selectively injects base language model parameters into MLLMs. Experimental results based on five MLLMs on nine benchmarks demonstrate the effectiveness of our method. Attention-based analysis further reveals that merging shifts attention from diffuse, scattered patterns to focused localization on task-relevant visual regions. Our repository is on https://github.com/wzj1718/PlaM.
[COMMENTS]under review
[LINK]http://arxiv.org/abs/2601.07645v1
[DATE]2026-01-12 23:27:51+08:00
[CATEGORIES]cs.CL
Beyond Static Tools: Test-Time Tool Evolution for Scientific Reasoning
[AUTHORS]Jiaxuan Lu, Ziyu Kong, Yemin Wang, Rong Fu, Haiyuan Wan, Cheng Yang, Wenjie Lou, Haoran Sun, Lilong Wang, Yankai Jiang, Xiaosong Wang, Xiao Sun, Dongzhan Zhou
[ABSTRACT]The central challenge of AI for Science is not reasoning alone, but the ability to create computational methods in an open-ended scientific world. Existing LLM-based agents rely on static, pre-defined tool libraries, a paradigm that fundamentally fails in scientific domains where tools are sparse, heterogeneous, and intrinsically incomplete. In this paper, we propose Test-Time Tool Evolution (TTE), a new paradigm that enables agents to synthesize, verify, and evolve executable tools during inference. By transforming tools from fixed resources into problem-driven artifacts, TTE overcomes the rigidity and long-tail limitations of static tool libraries. To facilitate rigorous evaluation, we introduce SciEvo, a benchmark comprising 1,590 scientific reasoning tasks supported by 925 automatically evolved tools. Extensive experiments show that TTE achieves state-of-the-art performance in both accuracy and tool efficiency, while enabling effective cross-domain adaptation of computational tools. The code and benchmark have been released at https://github.com/lujiaxuan0520/Test-Time-Tool-Evol.
[LINK]http://arxiv.org/abs/2601.07641v1
[DATE]2026-01-12 23:22:51+08:00
[CATEGORIES]cs.CL
Towards Stable and Effective Reinforcement Learning for Mixture-of-Experts
[AUTHORS]Di Zhang, Xun Wu, Shaohan Huang, Lingjie Jiang, Yaru Hao, Li Dong, Zewen Chi, Zhifang Sui, Furu Wei
[ABSTRACT]Recent advances in reinforcement learning (RL) have substantially improved the training of large-scale language models, leading to significant gains in generation quality and reasoning ability. However, most existing research focuses on dense models, while RL training for Mixture-of-Experts (MoE) architectures remains underexplored. To address the instability commonly observed in MoE training, we propose a novel router-aware approach to optimize importance sampling (IS) weights in off-policy RL. Specifically, we design a rescaling strategy guided by router logits, which effectively reduces gradient variance and mitigates training divergence. Experimental results demonstrate that our method significantly improves both the convergence stability and the final performance of MoE models, highlighting the potential of RL algorithmic innovations tailored to MoE architectures and providing a promising direction for efficient training of large-scale expert models.
[COMMENTS]Added additional experiments, improved analysis, and fixed minor issues
[LINK]http://arxiv.org/abs/2510.23027v2
[DATE]2026-01-12 23:14:47+08:00
[CATEGORIES]cs.LG cs.CL
Uncovering the Computational Roles of Nonlinearity in Sequence Modeling Using Almost-Linear RNNs
[AUTHORS]Manuel Brenner, Georgia Koppe
[ABSTRACT]Sequence modeling tasks across domains such as natural language processing, time series forecasting, and control require learning complex input-output mappings. Nonlinear recurrence is theoretically required for universal approximation of sequence-to-sequence functions, yet linear recurrent models often prove surprisingly effective. This raises the question of when nonlinearity is truly required. We present a framework to systematically dissect the functional role of nonlinearity in recurrent networks, identifying when it is computationally necessary and what mechanisms it enables. We address this using Almost Linear Recurrent Neural Networks (AL-RNNs), which allow recurrence nonlinearity to be gradually attenuated and decompose network dynamics into analyzable linear regimes, making computational mechanisms explicit. We illustrate the framework across diverse synthetic and real-world tasks, including classic sequence modeling benchmarks, a neuroscientific stimulus-selection task, and a multi-task suite. We demonstrate how the AL-RNN’s piecewise linear structure enables identification of computational primitives such as gating, rule-based integration, and memory-dependent transients, revealing that these operations emerge within predominantly linear backbones. Across tasks, sparse nonlinearity improves interpretability by reducing and localizing nonlinear computations, promotes shared representations in multi-task settings, and reduces computational cost. Moreover, sparse nonlinearity acts as a useful inductive bias: in low-data regimes or when tasks require discrete switching between linear regimes, sparsely nonlinear models often match or exceed fully nonlinear architectures. Our findings provide a principled approach for identifying where nonlinearity is functionally necessary, guiding the design of recurrent architectures that balance performance, efficiency, and interpretability.
[COMMENTS]Published in Transactions on Machine Learning Research (TMLR), https://openreview.net/forum?id=qI2Vt9P9rl
[LINK]http://arxiv.org/abs/2506.07919v2
[DATE]2026-01-12 23:09:14+08:00
[CATEGORIES]cs.LG cs.CL
Rep3Net: An Approach Exploiting Multimodal Representation for Molecular Bioactivity Prediction
[AUTHORS]Sabrina Islam, Md. Atiqur Rahman, Md. Bakhtiar Hasan, Md. Hasanul Kabir
[ABSTRACT]Accurate prediction of compound potency accelerates early-stage drug discovery by prioritizing candidates for experimental testing. However, many Quantitative Structure-Activity Relationship (QSAR) approaches for this prediction are constrained by their choice of molecular representation: handcrafted descriptors capture global properties but miss local topology, graph neural networks encode structure but often lack broader chemical context, and SMILES-based language models provide contextual patterns learned from large corpora but are seldom combined with structural features. To exploit these complementary signals, we introduce Rep3Net, a unified multimodal architecture that fuses RDKit molecular descriptors, graph-derived features from a residual graph-convolutional backbone, and ChemBERTa SMILES embeddings. We evaluate Rep3Net on a curated ChEMBL subset for Human PARP1 using fivefold cross validation. Rep3Net attains an MSE of $0.83\pm0.06$, RMSE of $0.91\pm0.03$, $R^{2}=0.43\pm0.01$, and yields Pearson and Spearman correlations of $0.66\pm0.01$ and $0.67\pm0.01$, respectively, substantially improving over several strong GNN baselines. In addition, Rep3Net achieves a favorable latency-to-parameter trade-off thanks to a single-layer GCN backbone and parallel frozen encoders. Ablations show that graph topology, ChemBERTa semantics, and handcrafted descriptors each contribute complementary information, with full fusion providing the largest error reduction. These results demonstrate that multimodal representation fusion can improve potency prediction for PARP1 and provide a scalable framework for virtual screening in early-stage drug discovery.
[LINK]http://arxiv.org/abs/2512.00521v2
[DATE]2026-01-12 23:05:46+08:00
[CATEGORIES]cs.LG cs.CL
NeoAMT: Neologism-Aware Agentic Machine Translation with Reinforcement Learning
[AUTHORS]Zhongtao Miao, Kaiyan Zhao, Masaaki Nagata, Yoshimasa Tsuruoka
[ABSTRACT]Neologism-aware machine translation aims to translate source sentences containing neologisms into target languages. This field remains underexplored compared with general machine translation (MT). In this paper, we propose an agentic framework, NeoAMT, for neologism-aware machine translation using a Wiktionary search tool. Specifically, we first create a new dataset for neologism-aware machine translation and develop a search tool based on Wiktionary. The new dataset covers 16 languages and 75 translation directions and is derived from approximately 10 million records of an English Wiktionary dump. The retrieval corpus of the search tool is also constructed from around 3 million cleaned records of the Wiktionary dump. We then use it for training the translation agent with reinforcement learning (RL) and evaluating the accuracy of neologism-aware machine translation. Based on this, we also propose an RL training framework that contains a novel reward design and an adaptive rollout generation approach by leveraging “translation difficulty” to further improve the translation quality of translation agents using our search tool.
[COMMENTS]Fixed typos in Table 1, Figure 7 and Section 4.2: regex -> exact. Refined the caption of Table 3
[LINK]http://arxiv.org/abs/2601.03790v2
[DATE]2026-01-12 23:01:07+08:00
[CATEGORIES]cs.CL
Proof of Time: A Benchmark for Evaluating Scientific Idea Judgments
[AUTHORS]Bingyang Ye, Shan Chen, Jingxuan Tu, Chen Liu, Zidi Xiong, Samuel Schmidgall, Danielle S. Bitterman
[ABSTRACT]Large language models are increasingly being used to assess and forecast research ideas, yet we lack scalable ways to evaluate the quality of models’ judgments about these scientific ideas. Towards this goal, we introduce PoT, a semi-verifiable benchmarking framework that links scientific idea judgments to downstream signals that become observable later (e.g., citations and shifts in researchers’ agendas). PoT freezes a pre-cutoff snapshot of evidence in an offline sandbox and asks models to forecast post-cutoff outcomes, enabling verifiable evaluation when ground truth arrives, scalable benchmarking without exhaustive expert annotation, and analysis of human-model misalignment against signals such as peer-review awards. In addition, PoT provides a controlled testbed for agent-based research judgments that evaluate scientific ideas, comparing tool-using agents to non-agent baselines under prompt ablations and budget scaling. Across 30,000+ instances spanning four benchmark domains, we find that, compared with non-agent baselines, higher interaction budgets generally improve agent performance, while the benefit of tool use is strongly task-dependent. By combining time-partitioned, future-verifiable targets with an offline sandbox for tool use, PoT supports scalable evaluation of agents on future-facing scientific idea judgment tasks.
[COMMENTS]under review
[LINK]http://arxiv.org/abs/2601.07606v1
[DATE]2026-01-12 22:55:37+08:00
[CATEGORIES]cs.CL
GRPO with State Mutations: Improving LLM-Based Hardware Test Plan Generation
[AUTHORS]Dimple Vijay Kochar, Nathaniel Pinckney, Guan-Ting Liu, Chia-Tung Ho, Chenhui Deng, Haoxing Ren, Brucek Khailany
[ABSTRACT]RTL design often relies heavily on ad-hoc testbench creation early in the design cycle. While large language models (LLMs) show promise for RTL code generation, their ability to reason about hardware specifications and generate targeted test plans remains largely unexplored. We present the first systematic study of LLM reasoning capabilities for RTL verification stimuli generation, establishing a two-stage framework that decomposes test plan generation from testbench execution. Our benchmark reveals that state-of-the-art models, including DeepSeek-R1 and Claude-4.0-Sonnet, achieve only 15.7-21.7% success rates on generating stimuli that pass golden RTL designs. To improve LLM generated stimuli, we develop a comprehensive training methodology combining supervised fine-tuning with a novel reinforcement learning approach, GRPO with State Mutation (GRPO-SMu), which enhances exploration by varying input mutations. Our approach leverages a tree-based branching mutation strategy to construct training data comprising equivalent and mutated trees, moving beyond linear mutation approaches to provide rich learning signals. Training on this curated dataset, our 7B parameter model achieves a 33.3% golden test pass rate and a 13.9% mutation detection rate, representing a 17.6% absolute improvement over baseline and outperforming much larger general-purpose models. These results demonstrate that specialized training methodologies can significantly enhance LLM reasoning capabilities for hardware verification tasks, establishing a foundation for automated sub-unit testing in semiconductor design workflows.
[LINK]http://arxiv.org/abs/2601.07593v1
[DATE]2026-01-12 22:42:42+08:00
[CATEGORIES]cs.CL cs.LG
Stronger Baselines for Retrieval-Augmented Generation with Long-Context Language Models
[AUTHORS]Alex Laitenberger, Christopher D. Manning, Nelson F. Liu
[ABSTRACT]With the rise of long-context language models (LMs) capable of processing tens of thousands of tokens in a single context window, do multi-stage retrieval-augmented generation (RAG) pipelines still offer measurable benefits over simpler, single-stage approaches? To assess this question, we conduct a controlled evaluation for QA tasks under systematically scaled token budgets, comparing two recent multi-stage pipelines, ReadAgent and RAPTOR, against three baselines, including DOS RAG (Document’s Original Structure RAG), a simple retrieve-then-read method that preserves original passage order. Despite its straightforward design, DOS RAG consistently matches or outperforms more intricate methods on multiple long-context QA benchmarks. We trace this strength to a combination of maintaining source fidelity and document structure, prioritizing recall within effective context windows, and favoring simplicity over added pipeline complexity. We recommend establishing DOS RAG as a simple yet strong baseline for future RAG evaluations, paired with state-of-the-art embedding and language models, and benchmarked under matched token budgets, to ensure that added pipeline complexity is justified by clear performance gains as models continue to improve.
[COMMENTS]11 pages, 6 figures, for associated source code, see https://github.com/alex-laitenberger/stronger-baselines-rag
[LINK]http://arxiv.org/abs/2506.03989v2
[DATE]2026-01-12 22:40:22+08:00
[CATEGORIES]cs.CL
A Unified Framework for Emotion Recognition and Sentiment Analysis via Expert-Guided Multimodal Fusion with Large Language Models
[AUTHORS]Jiaqi Qiao, Xiujuan Xu, Xinran Li, Yu Liu
[ABSTRACT]Multimodal emotion understanding requires effective integration of text, audio, and visual modalities for both discrete emotion recognition and continuous sentiment analysis. We present EGMF, a unified framework combining expert-guided multimodal fusion with large language models. Our approach features three specialized expert networks–a fine-grained local expert for subtle emotional nuances, a semantic correlation expert for cross-modal relationships, and a global context expert for long-range dependencies–adaptively integrated through hierarchical dynamic gating for context-aware feature selection. Enhanced multimodal representations are integrated with LLMs via pseudo token injection and prompt-based conditioning, enabling a single generative framework to handle both classification and regression through natural language generation. We employ LoRA fine-tuning for computational efficiency. Experiments on bilingual benchmarks (MELD, CHERMA, MOSEI, SIMS-V2) demonstrate consistent improvements over state-of-the-art methods, with superior cross-lingual robustness revealing universal patterns in multimodal emotional expressions across English and Chinese. We will release the source code publicly.
[LINK]http://arxiv.org/abs/2601.07565v1
[DATE]2026-01-12 22:21:32+08:00
[CATEGORIES]cs.CL
Can Reasoning Help Large Language Models Capture Human Annotator Disagreement?
[AUTHORS]Jingwei Ni, Yu Fan, Vilém Zouhar, Donya Rooein, Alexander Hoyle, Mrinmaya Sachan, Markus Leippold, Dirk Hovy, Elliott Ash
[ABSTRACT]Variation in human annotation (i.e., disagreements) is common in NLP, often reflecting important information like task subjectivity and sample ambiguity. Modeling this variation is important for applications that are sensitive to such information. Although RLVR-style reasoning (Reinforcement Learning with Verifiable Rewards) has improved Large Language Model (LLM) performance on many tasks, it remains unclear whether such reasoning enables LLMs to capture informative variation in human annotation. In this work, we evaluate the influence of different reasoning settings on LLM disagreement modeling. We systematically evaluate each reasoning setting across model sizes, distribution expression methods, and steering methods, resulting in 60 experimental setups across 3 tasks. Surprisingly, our results show that RLVR-style reasoning degrades performance in disagreement modeling, while naive Chain-of-Thought (CoT) reasoning improves the performance of RLHF LLMs (RL from human feedback). These findings underscore the potential risk of replacing human annotators with reasoning LLMs, especially when disagreements are important.
[COMMENTS]EACL 2026 Main
[LINK]http://arxiv.org/abs/2506.19467v3
[DATE]2026-01-12 21:46:04+08:00
[CATEGORIES]cs.CL
From RAG to Agentic RAG for Faithful Islamic Question Answering
[AUTHORS]Gagan Bhatia, Hamdy Mubarak, Mustafa Jarrar, George Mikros, Fadi Zaraket, Mahmoud Alhirthani, Mutaz Al-Khatib, Logan Cochrane, Kareem Darwish, Rashid Yahiaoui, Firoj Alam
[ABSTRACT]LLMs are increasingly used for Islamic question answering, where ungrounded responses may carry serious religious consequences. Yet standard MCQ/MRC-style evaluations do not capture key real-world failure modes, notably free-form hallucinations and whether models appropriately abstain when evidence is lacking. To shed a light on this aspect we introduce ISLAMICFAITHQA, a 3,810-item bilingual (Arabic/English) generative benchmark with atomic single-gold answers, which enables direct measurement of hallucination and abstention. We additionally developed an end-to-end grounded Islamic modelling suite consisting of (i) 25K Arabic text-grounded SFT reasoning pairs, (ii) 5K bilingual preference samples for reward-guided alignment, and (iii) a verse-level Qur’an retrieval corpus of $\sim$6k atomic verses (ayat). Building on these resources, we develop an agentic Quran-grounding framework (agentic RAG) that uses structured tool calls for iterative evidence seeking and answer revision. Experiments across Arabic-centric and multilingual LLMs show that retrieval improves correctness and that agentic RAG yields the largest gains beyond standard RAG, achieving state-of-the-art performance and stronger Arabic-English robustness even with a small model (i.e., Qwen3 4B). We will make the experimental resources and datasets publicly available for the community.
[LINK]http://arxiv.org/abs/2601.07528v1
[DATE]2026-01-12 21:28:28+08:00
[CATEGORIES]cs.CL
Thinking Before Constraining: A Unified Decoding Framework for Large Language Models
[AUTHORS]Ngoc Trinh Hung Nguyen, Alonso Silva, Laith Zumot, Liubov Tupikina, Armen Aghasaryan, Mehwish Alam
[ABSTRACT]Natural generation allows Language Models (LMs) to produce free-form responses with rich reasoning, but the lack of guaranteed structure makes outputs difficult to parse or verify. Structured generation, or constrained decoding, addresses this drawback by producing content in standardized formats such as JSON, ensuring consistency and guaranteed-parsable outputs, but it can inadvertently restrict the model’s reasoning capabilities. In this work, we propose a simple approach that combines the advantages of both natural and structured generation. By allowing LLMs to reason freely until specific trigger tokens are generated, and then switching to structured generation, our method preserves the expressive power of natural language reasoning while ensuring the reliability of structured outputs. We further evaluate our approach on several datasets, covering both classification and reasoning tasks, to demonstrate its effectiveness, achieving a substantial gain of up to 27% in accuracy compared to natural generation, while requiring only a small overhead of 10-20 extra tokens.
[LINK]http://arxiv.org/abs/2601.07525v1
[DATE]2026-01-12 21:25:28+08:00
[CATEGORIES]cs.CL
IndicParam: Benchmark to evaluate LLMs on low-resource Indic Languages
[AUTHORS]Ayush Maheshwari, Kaushal Sharma, Vivek Patel, Aditya Maheshwari
[ABSTRACT]While large language models excel on high-resource multilingual tasks, low- and extremely low-resource Indic languages remain severely under-evaluated. We present IndicParam, a human-curated benchmark of over 13,000 multiple-choice questions covering 11 such languages (Nepali, Gujarati, Marathi, Odia as low-resource; Dogri, Maithili, Rajasthani, Sanskrit, Bodo, Santali, Konkani as extremely low-resource) plus Sanskrit-English code-mixed set. We evaluated 20 LLMs, both proprietary and open-weights, which reveals that even the top-performing \texttt{Gemini-2.5} reaches 58\% average accuracy, followed by \texttt{GPT-5} (45) and \texttt{DeepSeek-3.2} (43.1). We additionally label each question as knowledge-oriented or purely linguistic to discriminate factual recall from grammatical proficiency. Further, we assess the ability of LLMs to handle diverse question formats-such as list-based matching, assertion-reason pairs, and sequence ordering-alongside conventional multiple-choice questions. \benchmark\ provides insights into limitations of cross-lingual transfer and establishes a challenging benchmark for Indic languages. The dataset is available at https://huggingface.co/datasets/bharatgenai/IndicParam. Scripts to run benchmark are present at https://github.com/ayushbits/IndicParam.
[LINK]http://arxiv.org/abs/2512.00333v2
[DATE]2026-01-12 21:16:09+08:00
[CATEGORIES]cs.CL
Controlling Multimodal Conversational Agents with Coverage-Enhanced Latent Actions
[AUTHORS]Yongqi Li, Hao Lang, Tieyun Qian, Yongbin Li
[ABSTRACT]Vision-language models are increasingly employed as multimodal conversational agents (MCAs) for diverse conversational tasks. Recently, reinforcement learning (RL) has been widely explored for adapting MCAs to various human-AI interaction scenarios. Despite showing great enhancement in generalization performance, fine-tuning MCAs via RL still faces challenges in handling the extremely large text token space. To address this, we learn a compact latent action space for RL fine-tuning instead. Specifically, we adopt the learning from observation mechanism to construct the codebook for the latent action space, where future observations are leveraged to estimate current latent actions that could further be used to reconstruct future observations. However, the scarcity of paired image-text data hinders learning a codebook with sufficient coverage. Thus, we leverage both paired image-text data and text-only data to construct the latent action space, using a cross-modal projector for transforming text embeddings into image-text embeddings. We initialize the cross-modal projector on paired image-text data, and further train it on massive text-only data with a novel cycle consistency loss to enhance its robustness. We show that our latent action based method outperforms competitive baselines on two conversation tasks across various RL algorithms.
[LINK]http://arxiv.org/abs/2601.07516v1
[DATE]2026-01-12 21:13:24+08:00
[CATEGORIES]cs.CL cs.LG
High-Rank Structured Modulation for Parameter-Efficient Fine-Tuning
[AUTHORS]Yongkang Liu, Xing Li, Mengjie Zhao, Shanru Zhang, Zijing Wang, Qian Li, Shi Feng, Feiliang Ren, Daling Wang, Hinrich Schütze
[ABSTRACT]As the number of model parameters increases, parameter-efficient fine-tuning (PEFT) has become the go-to choice for tailoring pre-trained large language models. Low-rank Adaptation (LoRA) uses a low-rank update method to simulate full parameter fine-tuning, which is widely used to reduce resource requirements. However, decreasing the rank encounters challenges with limited representational capacity when compared to full parameter fine-tuning. We present \textbf{SMoA}, a high-rank \textbf{S}tructured \textbf{MO}dulation \textbf{A}dapter that uses fewer trainable parameters while maintaining a higher rank, thereby improving the model’s representational capacity and offering improved performance potential. The core idea is to freeze the original pretrained weights and selectively amplify or suppress important features of the original weights across multiple subspaces. The subspace mechanism provides an efficient way to increase the capacity and complexity of a model. We conduct both theoretical analyses and empirical studies on various tasks. Experiment results show that SMoA outperforms LoRA and its variants on 10 tasks, with extensive ablation studies validating its effectiveness.
[COMMENTS]under review
[LINK]http://arxiv.org/abs/2601.07507v1
[DATE]2026-01-12 21:06:17+08:00
[CATEGORIES]cs.CL
Bridging the Linguistic Divide: A Survey on Leveraging Large Language Models for Machine Translation
[AUTHORS]Baban Gain, Dibyanayan Bandyopadhyay, Asif Ekbal, Trilok Nath Singh
[ABSTRACT]Large Language Models (LLMs) are rapidly reshaping machine translation (MT), particularly by introducing instruction-following, in-context learning, and preference-based alignment into what has traditionally been a supervised encoder-decoder paradigm. This survey provides a comprehensive and up-to-date overview of how LLMs are being leveraged for MT across data regimes, languages, and application settings. We systematically analyze prompting-based methods, parameter-efficient and full fine-tuning strategies, synthetic data generation, preference-based optimization, and reinforcement learning with human and weakly supervised feedback. Special attention is given to low-resource translation, where we examine the roles of synthetic data quality, diversity, and preference signals, as well as the limitations of current RLHF pipelines. We further review recent advances in Mixture-of-Experts models, MT-focused LLMs, and multilingual alignment, highlighting trade-offs between scalability, specialization, and accessibility. Beyond sentence-level translation, we survey emerging document-level and discourse-aware MT methods with LLMs, showing that most approaches extend sentence-level pipelines through structured context selection, post-editing, or reranking rather than requiring fundamentally new data regimes or architectures. Finally, we discuss LLM-based evaluation, its strengths and biases, and its role alongside learned metrics. Overall, this survey positions LLM-based MT as an evolution of traditional MT systems, where gains increasingly depend on data quality, preference alignment, and context utilization rather than scale alone, and outlines open challenges for building robust, inclusive, and controllable translation systems.
[LINK]http://arxiv.org/abs/2504.01919v4
[DATE]2026-01-12 20:46:10+08:00
[CATEGORIES]cs.CL
Through the LLM Looking Glass: A Socratic Probing of Donkeys, Elephants, and Markets
[AUTHORS]Molly Kennedy, Ayyoob Imani, Timo Spinde, Akiko Aizawa, Hinrich Schütze
[ABSTRACT]Large Language Models (LLMs) are widely used for text generation, making it crucial to address potential bias. This study investigates ideological framing bias in LLM-generated articles, focusing on the subtle and subjective nature of such bias in journalistic contexts. We evaluate eight widely used LLMs on two datasets-POLIGEN and ECONOLEX-covering political and economic discourse where framing bias is most pronounced. Beyond text generation, LLMs are increasingly used as evaluators (LLM-as-a-judge), providing feedback that can shape human judgment or inform newer model versions. Inspired by the Socratic method, we further analyze LLMs’ feedback on their own outputs to identify inconsistencies in their reasoning. Our results show that most LLMs can accurately annotate ideologically framed text, with GPT-4o achieving human-level accuracy and high agreement with human annotators. However, Socratic probing reveals that when confronted with binary comparisons, LLMs often exhibit preference toward one perspective or perceive certain viewpoints as less biased.
[LINK]http://arxiv.org/abs/2503.16674v4
[DATE]2026-01-12 20:41:04+08:00
[CATEGORIES]cs.CL
Muse: Towards Reproducible Long-Form Song Generation with Fine-Grained Style Control
[AUTHORS]Changhao Jiang, Jiahao Chen, Zhenghao Xiang, Zhixiong Yang, Hanchen Wang, Jiabao Zhuang, Xinmeng Che, Jiajun Sun, Hui Li, Yifei Cao, Shihan Dou, Ming Zhang, Junjie Ye, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang
[ABSTRACT]Recent commercial systems such as Suno demonstrate strong capabilities in long-form song generation, while academic research remains largely non-reproducible due to the lack of publicly available training data, hindering fair comparison and progress. To this end, we release a fully open-source system for long-form song generation with fine-grained style conditioning, including a licensed synthetic dataset, training and evaluation pipelines, and Muse, an easy-to-deploy song generation model. The dataset consists of 116k fully licensed synthetic songs with automatically generated lyrics and style descriptions paired with audio synthesized by SunoV5. We train Muse via single-stage supervised finetuning of a Qwen-based language model extended with discrete audio tokens using MuCodec, without task-specific losses, auxiliary objectives, or additional architectural components. Our evaluations find that although Muse is trained with a modest data scale and model size, it achieves competitive performance on phoneme error rate, text–music style similarity, and audio aesthetic quality, while enabling controllable segment-level generation across different musical structures. All data, model weights, and training and evaluation pipelines will be publicly released, paving the way for continued progress in controllable long-form song generation research. The project repository is available at https://github.com/yuhui1038/Muse.
[LINK]http://arxiv.org/abs/2601.03973v3
[DATE]2026-01-12 19:39:42+08:00
[CATEGORIES]cs.CL
SAD: A Large-Scale Strategic Argumentative Dialogue Dataset
[AUTHORS]Yongkang Liu, Jiayang Yu, Mingyang Wang, Yiqun Zhang, Ercong Nie, Shi Feng, Daling Wang, Kaisong Song, Hinrich Schütze
[ABSTRACT]Argumentation generation has attracted substantial research interest due to its central role in human reasoning and decision-making. However, most existing argumentative corpora focus on non-interactive, single-turn settings, either generating arguments from a given topic or refuting an existing argument. In practice, however, argumentation is often realized as multi-turn dialogue, where speakers defend their stances and employ diverse argumentative strategies to strengthen persuasiveness. To support deeper modeling of argumentation dialogue, we present the first large-scale \textbf{S}trategic \textbf{A}rgumentative \textbf{D}ialogue dataset, SAD, consisting of 392,822 examples. Grounded in argumentation theories, we annotate each utterance with five strategy types, allowing multiple strategies per utterance. Unlike prior datasets, SAD requires models to generate contextually appropriate arguments conditioned on the dialogue history, a specified stance on the topic, and targeted argumentation strategies. We further benchmark a range of pretrained generative models on SAD and present in-depth analysis of strategy usage patterns in argumentation.
[COMMENTS]under review
[LINK]http://arxiv.org/abs/2601.07423v1
[DATE]2026-01-12 19:11:37+08:00
[CATEGORIES]cs.CL
Two Pathways to Truthfulness: On the Intrinsic Encoding of LLM Hallucinations
[AUTHORS]Wen Luo, Guangyue Peng, Wei Li, Shaohang Wei, Feifan Song, Liang Wang, Nan Yang, Xingxing Zhang, Jing Jin, Furu Wei, Houfeng Wang
[ABSTRACT]Despite their impressive capabilities, large language models (LLMs) frequently generate hallucinations. Previous work shows that their internal states encode rich signals of truthfulness, yet the origins and mechanisms of these signals remain unclear. In this paper, we demonstrate that truthfulness cues arise from two distinct information pathways: (1) a Question-Anchored pathway that depends on question-answer information flow, and (2) an Answer-Anchored pathway that derives self-contained evidence from the generated answer itself. First, we validate and disentangle these pathways through attention knockout and token patching. Afterwards, we uncover notable and intriguing properties of these two mechanisms. Further experiments reveal that (1) the two mechanisms are closely associated with LLM knowledge boundaries; and (2) internal representations are aware of their distinctions. Finally, building on these insightful findings, two applications are proposed to enhance hallucination detection performance. Overall, our work provides new insight into how LLMs internally encode truthfulness, offering directions for more reliable and self-aware generative systems.
[LINK]http://arxiv.org/abs/2601.07422v1
[DATE]2026-01-12 19:10:43+08:00
[CATEGORIES]cs.CL
SPEC-RL: Accelerating On-Policy Reinforcement Learning with Speculative Rollouts
[AUTHORS]Bingshuai Liu, Ante Wang, Zijun Min, Liang Yao, Haibo Zhang, Yang Liu, Xu Han, Peng Li, Anxiang Zeng, Jinsong Su
[ABSTRACT]Large Language Models (LLMs) increasingly rely on reinforcement learning with verifiable rewards (RLVR) to elicit reliable chain-of-thought reasoning. However, the training process remains bottlenecked by the computationally expensive rollout stage. Existing acceleration methods-such as parallelization, objective- and data-driven modifications, and replay buffers-either incur diminishing returns, introduce bias, or overlook redundancy across iterations. We identify that rollouts from consecutive training epochs frequently share a large portion of overlapping segments, wasting computation. To address this, we propose SPEC-RL, a novel framework that integrates SPECulative decoding with the RL rollout process. SPEC-RL reuses prior trajectory segments as speculative prefixes and extends them via a draft-and-verify mechanism, avoiding redundant generation while ensuring policy consistency. Experiments on diverse math reasoning and generalization benchmarks, including AIME24, MATH-500, OlympiadBench, MMLU-STEM, and others, demonstrate that SPEC-RL reduces rollout time by 2-3x without compromising policy quality. As a purely rollout-stage enhancement, SPEC-RL integrates seamlessly with mainstream algorithms (e.g., PPO, GRPO, DAPO), offering a general and practical path to scale RLVR for large reasoning models. Our code is available at https://github.com/ShopeeLLM/Spec-RL
[COMMENTS]fixed typos
[LINK]http://arxiv.org/abs/2509.23232v3
[DATE]2026-01-12 19:06:46+08:00
[CATEGORIES]cs.LG cs.CL
SCALPEL: Selective Capability Ablation via Low-rank Parameter Editing for Large Language Model Interpretability Analysis
[AUTHORS]Zihao Fu, Xufeng Duan, Zhenguang G. Cai
[ABSTRACT]Large language models excel across diverse domains, yet their deployment in healthcare, legal systems, and autonomous decision-making remains limited by incomplete understanding of their internal mechanisms. As these models integrate into high-stakes systems, understanding how they encode capabilities has become fundamental to interpretability research. Traditional approaches identify important modules through gradient attribution or activation analysis, assuming specific capabilities map to specific components. However, this oversimplifies neural computation: modules may contribute to multiple capabilities simultaneously, while single capabilities may distribute across multiple modules. These coarse-grained analyses fail to capture fine-grained, distributed capability encoding. We present SCALPEL (Selective Capability Ablation via Low-rank Parameter Editing for Large language models), a framework representing capabilities as low-rank parameter subspaces rather than discrete modules. Our key insight is that capabilities can be characterized by low-rank modifications distributed across layers and modules, enabling precise capability removal without affecting others. By training LoRA adapters to reduce distinguishing correct from incorrect answers while preserving general language modeling quality, SCALPEL identifies low-rank representations responsible for particular capabilities while remaining disentangled from others. Experiments across diverse capability and linguistic tasks from BLiMP demonstrate that SCALPEL successfully removes target capabilities while preserving general capabilities, providing fine-grained insights into capability distribution across parameter space. Results reveal that capabilities exhibit low-rank structure and can be selectively ablated through targeted parameter-space interventions, offering nuanced understanding of capability encoding in LLMs.
[LINK]http://arxiv.org/abs/2601.07411v1
[DATE]2026-01-12 18:54:18+08:00
[CATEGORIES]cs.LG cs.CL
Outcome-Grounded Advantage Reshaping for Fine-Grained Credit Assignment in Mathematical Reasoning
[AUTHORS]Ziheng Li, Liu Kang, Feng Xiao, Luxi Xing, Qingyi Si, Zhuoran Li, Weikang Gong, Deqing Yang, Yanghua Xiao, Hongcheng Guo
[ABSTRACT]Group Relative Policy Optimization (GRPO) has emerged as a promising critic-free reinforcement learning paradigm for reasoning tasks. However, standard GRPO employs a coarse-grained credit assignment mechanism that propagates group-level rewards uniformly to to every token in a sequence, neglecting the varying contribution of individual reasoning steps. We address this limitation by introducing Outcome-grounded Advantage Reshaping (OAR), a fine-grained credit assignment mechanism that redistributes advantages based on how much each token influences the model’s final answer. We instantiate OAR via two complementary strategies: (1) OAR-P, which estimates outcome sensitivity through counterfactual token perturbations, serving as a high-fidelity attribution signal; (2) OAR-G, which uses an input-gradient sensitivity proxy to approximate the influence signal with a single backward pass. These importance signals are integrated with a conservative Bi-Level advantage reshaping scheme that suppresses low-impact tokens and boosts pivotal ones while preserving the overall advantage mass. Empirical results on extensive mathematical reasoning benchmarks demonstrate that while OAR-P sets the performance upper bound, OAR-G achieves comparable gains with negligible computational overhead, both significantly outperforming a strong GRPO baseline, pushing the boundaries of critic-free LLM reasoning.
[LINK]http://arxiv.org/abs/2601.07408v1
[DATE]2026-01-12 18:48:02+08:00
[CATEGORIES]cs.CL cs.LG
On Narrative: The Rhetorical Mechanisms of Online Polarisation
[AUTHORS]Jan Elfes, Marco Bastos, Luca Maria Aiello
[ABSTRACT]Polarisation research has demonstrated how people cluster in homogeneous groups with opposing opinions. However, this effect emerges not only through interaction between people, limiting communication between groups, but also between narratives, shaping opinions and partisan identities. Yet, how polarised groups collectively construct and negotiate opposing interpretations of reality, and whether narratives move between groups despite limited interactions, remains unexplored. To address this gap, we formalise the concept of narrative polarisation and demonstrate its measurement in 212 YouTube videos and 90,029 comments on the Israeli-Palestinian conflict. Based on structural narrative theory and implemented through a large language model, we extract the narrative roles assigned to central actors in two partisan information environments. We find that while videos produce highly polarised narratives, comments significantly reduce narrative polarisation, harmonising discourse on the surface level. However, on a deeper narrative level, recurring narrative motifs reveal additional differences between partisan groups.
[LINK]http://arxiv.org/abs/2601.07398v1
[DATE]2026-01-12 18:34:57+08:00
[CATEGORIES]cs.CL
GROKE: Vision-Free Navigation Instruction Evaluation via Graph Reasoning on OpenStreetMap
[AUTHORS]Farzad Shami, Subhrasankha Dey, Nico Van de Weghe, Henrikki Tenkanen
[ABSTRACT]The evaluation of navigation instructions remains a persistent challenge in Vision-and-Language Navigation (VLN) research. Traditional reference-based metrics such as BLEU and ROUGE fail to capture the functional utility of spatial directives, specifically whether an instruction successfully guides a navigator to the intended destination. Although existing VLN agents could serve as evaluators, their reliance on high-fidelity visual simulators introduces licensing constraints and computational costs, and perception errors further confound linguistic quality assessment. This paper introduces GROKE(Graph-based Reasoning over OSM Knowledge for instruction Evaluation), a vision-free training-free hierarchical LLM-based framework for evaluating navigation instructions using OpenStreetMap data. Through systematic ablation studies, we demonstrate that structured JSON and textual formats for spatial information substantially outperform grid-based and visual graph representations. Our hierarchical architecture combines sub-instruction planning with topological graph navigation, reducing navigation error by 68.5% compared to heuristic and sampling baselines on the Map2Seq dataset. The agent’s execution success, trajectory fidelity, and decision patterns serve as proxy metrics for functional navigability given OSM-visible landmarks and topology, establishing a scalable and interpretable evaluation paradigm without visual dependencies. Code and data are available at https://anonymous.4open.science/r/groke.
[COMMENTS]Under Review for ACL 2026
[LINK]http://arxiv.org/abs/2601.07375v1
[DATE]2026-01-12 17:56:47+08:00
[CATEGORIES]cs.CL
Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models
[AUTHORS]Xin Cheng, Wangding Zeng, Damai Dai, Qinyu Chen, Bingxuan Wang, Zhenda Xie, Kezhao Huang, Xingkai Yu, Zhewen Hao, Yukun Li, Han Zhang, Huishuai Zhang, Dongyan Zhao, Wenfeng Liang
[ABSTRACT]While Mixture-of-Experts (MoE) scales capacity via conditional computation, Transformers lack a native primitive for knowledge lookup, forcing them to inefficiently simulate retrieval through computation. To address this, we introduce conditional memory as a complementary sparsity axis, instantiated via Engram, a module that modernizes classic $N$-gram embedding for O(1) lookup. By formulating the Sparsity Allocation problem, we uncover a U-shaped scaling law that optimizes the trade-off between neural computation (MoE) and static memory (Engram). Guided by this law, we scale Engram to 27B parameters, achieving superior performance over a strictly iso-parameter and iso-FLOPs MoE baseline. Most notably, while the memory module is expected to aid knowledge retrieval (e.g., MMLU +3.4; CMMLU +4.0), we observe even larger gains in general reasoning (e.g., BBH +5.0; ARC-Challenge +3.7) and code/math domains~(HumanEval +3.0; MATH +2.4). Mechanistic analyses reveal that Engram relieves the backbone’s early layers from static reconstruction, effectively deepening the network for complex reasoning. Furthermore, by delegating local dependencies to lookups, it frees up attention capacity for global context, substantially boosting long-context retrieval (e.g., Multi-Query NIAH: 84.2 to 97.0). Finally, Engram establishes infrastructure-aware efficiency: its deterministic addressing enables runtime prefetching from host memory, incurring negligible overhead. We envision conditional memory as an indispensable modeling primitive for next-generation sparse models.
[LINK]http://arxiv.org/abs/2601.07372v1
[DATE]2026-01-12 17:54:49+08:00
[CATEGORIES]cs.CL
Topology Matters: Measuring Memory Leakage in Multi-Agent LLMs
[AUTHORS]Jinbo Liu, Defu Cao, Yifei Wei, Tianyao Su, Yuan Liang, Yushun Dong, Yan Liu, Yue Zhao, Xiyang Hu
[ABSTRACT]Graph topology is a fundamental determinant of memory leakage in multi-agent LLM systems, yet its effects remain poorly quantified. We introduce MAMA (Multi-Agent Memory Attack), a framework that measures how network structure shapes leakage. MAMA operates on synthetic documents containing labeled Personally Identifiable Information (PII) entities, from which we generate sanitized task instructions. We execute a two-phase protocol: Engram (seeding private information into a target agent’s memory) and Resonance (multi-round interaction where an attacker attempts extraction). Over 10 rounds, we measure leakage as exact-match recovery of ground-truth PII from attacker outputs. We evaluate six canonical topologies (complete, ring, chain, tree, star, star-ring) across $n\in\{4,5,6\}$, attacker-target placements, and base models. Results are consistent: denser connectivity, shorter attacker-target distance, and higher target centrality increase leakage; most leakage occurs in early rounds and then plateaus; model choice shifts absolute rates but preserves topology ordering; spatiotemporal/location attributes leak more readily than identity credentials or regulated identifiers. We distill practical guidance for system design: favor sparse or hierarchical connectivity, maximize attacker-target separation, and restrict hub/shortcut pathways via topology-aware access control.
[LINK]http://arxiv.org/abs/2512.04668v3
[DATE]2026-01-12 17:40:51+08:00
[CATEGORIES]cs.CL
LogitSpec: Accelerating Retrieval-based Speculative Decoding via Next Next Token Speculation
[AUTHORS]Tianyu Liu, Qitan Lv, Hao Li, Xing Gao, Xiao Sun, Xiaoyan Sun
[ABSTRACT]Speculative decoding (SD), where a small draft model is employed to propose draft tokens in advance and then the target model validates them in parallel, has emerged as a promising technique for LLM inference acceleration. Many endeavors to improve SD are to eliminate the need for a draft model and generate draft tokens in a retrieval-based manner in order to further alleviate the drafting overhead and significantly reduce the difficulty in deployment and applications. However, retrieval-based SD relies on a matching paradigm to retrieval the most relevant reference as the draft tokens, where these methods often fail to find matched and accurate draft tokens. To address this challenge, we propose LogitSpec to effectively expand the retrieval range and find the most relevant reference as drafts. Our LogitSpec is motivated by the observation that the logit of the last token can not only predict the next token, but also speculate the next next token. Specifically, LogitSpec generates draft tokens in two steps: (1) utilizing the last logit to speculate the next next token; (2) retrieving relevant reference for both the next token and the next next token. LogitSpec is training-free and plug-and-play, which can be easily integrated into existing LLM inference frameworks. Extensive experiments on a wide range of text generation benchmarks demonstrate that LogitSpec can achieve up to 2.61 $\times$ speedup and 3.28 mean accepted tokens per decoding step. Our code is available at https://github.com/smart-lty/LogitSpec.
[LINK]http://arxiv.org/abs/2507.01449v2
[DATE]2026-01-12 17:32:48+08:00
[CATEGORIES]cs.CL
Semantic Compression of LLM Instructions via Symbolic Metalanguages
[AUTHORS]Ernst van Gassen
[ABSTRACT]We introduce MetaGlyph, a symbolic language for compressing prompts by encoding instructions as mathematical symbols rather than prose. Unlike systems requiring explicit decoding rules, MetaGlyph uses symbols like $\in$ (membership) and $\Rightarrow$ (implication) that models already understand from their training data. We test whether these symbols work as ‘‘instruction shortcuts’’ that models can interpret without additional teaching.
We evaluate eight models across two dimensions relevant to practitioners: scale (3B-1T parameters) and accessibility (open-source for local deployment vs. proprietary APIs). MetaGlyph achieves 62-81% token reduction across all task types. For API-based deployments, this translates directly to cost savings; for local deployments, it reduces latency and memory pressure.
Results vary by model. Gemini 2.5 Flash achieves 75% semantic equivalence between symbolic and prose instructions on selection tasks, with 49.9% membership operator fidelity. Kimi K2 reaches 98.1% fidelity for implication ($\Rightarrow$) and achieves perfect (100%) accuracy on selection tasks with symbolic prompts. GPT-5.2 Chat shows the highest membership fidelity observed (91.3%), though with variable parse success across task types. Claude Haiku 4.5 achieves 100% parse success with 26% membership fidelity. Among mid-sized models, Qwen 2.5 7B shows 62% equivalence on extraction tasks. Mid-sized open-source models (7B-12B) show near-zero operator fidelity, suggesting a U-shaped relationship where sufficient scale overcomes instruction-tuning biases.
[COMMENTS]12 pages and 6 tables
[LINK]http://arxiv.org/abs/2601.07354v1
[DATE]2026-01-12 17:26:46+08:00
[CATEGORIES]cs.CL
TALON: Confidence-Aware Speculative Decoding with Adaptive Token Trees
[AUTHORS]Tianyu Liu, Qitan Lv, Yuhao Shen, Xiao Sun, Xiaoyan Sun
[ABSTRACT]Speculative decoding (SD) has become a standard technique for accelerating LLM inference without sacrificing output quality. Recent advances in speculative decoding have shifted from sequential chain-based drafting to tree-structured generation, where the draft model constructs a tree of candidate tokens to explore multiple possible drafts in parallel. However, existing tree-based SD methods typically build a fixed-width, fixed-depth draft tree, which fails to adapt to the varying difficulty of tokens and contexts. As a result, the draft model cannot dynamically adjust the tree structure to early stop on difficult tokens and extend generation for simple ones. To address these challenges, we introduce TALON, a training-free, budget-driven adaptive tree expansion framework that can be plugged into existing tree-based methods. Unlike static methods, TALON constructs the draft tree iteratively until a fixed token budget is met, using a hybrid expansion strategy that adaptively allocates the node budget to each layer of the draft tree. This framework naturally shapes the draft tree into a “deep-and-narrow” form for deterministic contexts and a “shallow-and-wide” form for uncertain branches, effectively optimizing the trade-off between exploration width and generation depth under a given budget. Extensive experiments across 5 models and 6 datasets demonstrate that TALON consistently outperforms state-of-the-art EAGLE-3, achieving up to 5.16x end-to-end speedup over auto-regressive decoding.
[LINK]http://arxiv.org/abs/2601.07353v1
[DATE]2026-01-12 17:26:45+08:00
[CATEGORIES]cs.CL
Beyond Hard Masks: Progressive Token Evolution for Diffusion Language Models
[AUTHORS]Linhao Zhong, Linyu Wu, Bozhen Fang, Tianjian Feng, Chenchen Jing, Wen Wang, Jiaheng Zhang, Hao Chen, Chunhua Shen
[ABSTRACT]Diffusion Language Models (DLMs) offer a promising alternative for language modeling by enabling parallel decoding through iterative refinement. However, most DLMs rely on hard binary masking and discrete token assignments, which hinder the revision of early decisions and underutilize intermediate probabilistic representations. In this paper, we propose EvoToken-DLM, a novel diffusion-based language modeling approach that replaces hard binary masks with evolving soft token distributions. EvoToken-DLM enables a progressive transition from masked states to discrete outputs, supporting revisable decoding. To effectively support this evolution, we introduce continuous trajectory supervision, which aligns training objectives with iterative probabilistic updates. Extensive experiments across multiple benchmarks show that EvoToken-DLM consistently achieves superior performance, outperforming strong diffusion-based and masked DLM baselines. Project webpage: https://aim-uofa.github.io/EvoTokenDLM.
[COMMENTS]Project webpage: https://aim-uofa.github.io/EvoTokenDLM
[LINK]http://arxiv.org/abs/2601.07351v1
[DATE]2026-01-12 17:25:14+08:00
[CATEGORIES]cs.CL
DiffER: Diffusion Entity-Relation Modeling for Reversal Curse in Diffusion Large Language Models
[AUTHORS]Shaokai He, Kaiwen Wei, Xinyi Zeng, Xiang Chen, Xue Yang, Zhenyang Li, Jiang Zhong, Yu Tian
[ABSTRACT]The “reversal curse” refers to the phenomenon where large language models (LLMs) exhibit predominantly unidirectional behavior when processing logically bidirectional relationships. Prior work attributed this to autoregressive training – predicting the next token inherently favors left-to-right information flow over genuine bidirectional knowledge associations. However, we observe that Diffusion LLMs (DLLMs), despite being trained bidirectionally, also suffer from the reversal curse. To investigate the root causes, we conduct systematic experiments on DLLMs and identify three key reasons: 1) entity fragmentation during training, 2) data asymmetry, and 3) missing entity relations. Motivated by the analysis of these reasons, we propose Diffusion Entity-Relation Modeling (DiffER), which addresses the reversal curse through entity-aware training and balanced data construction. Specifically, DiffER introduces whole-entity masking, which mitigates entity fragmentation by predicting complete entities in a single step. DiffER further employs distribution-symmetric and relation-enhanced data construction strategies to alleviate data asymmetry and missing relations. Extensive experiments demonstrate that DiffER effectively alleviates the reversal curse in Diffusion LLMs, offering new perspectives for future research.
[LINK]http://arxiv.org/abs/2601.07347v1
[DATE]2026-01-12 17:22:10+08:00
[CATEGORIES]cs.CL
GlobalRAG: Enhancing Global Reasoning in Multi-hop Question Answering via Reinforcement Learning
[AUTHORS]Jinchang Luo, Mingquan Cheng, Fan Wan, Ni Li, Xiaoling Xia, Shuangshuang Tian, Tingcheng Bian, Haiwei Wang, Haohuan Fu, Yan Tao
[ABSTRACT]Reinforcement learning has recently shown promise in improving retrieval-augmented generation (RAG). Despite these advances, its effectiveness in multi-hop question answering (QA) remains limited by two fundamental limitations: (i) global planning absence to structure multi-step reasoning, and (ii) unfaithful execution, which hinders effective query formulation and consistent use of retrieved evidence. We propose GlobalRAG, a reinforcement learning framework designed to enhance global reasoning in multi-hop QA. GlobalRAG decomposes questions into subgoals, coordinates retrieval with reasoning, and refines evidence iteratively. To guide this process, we introduce Planning Quality Reward and SubGoal Completion Reward, which encourage coherent planning and reliable subgoal execution. In addition, a progressive weight annealing strategy balances process-oriented and outcome-based objectives. Extensive experiments on both in-domain and out-of-domain benchmarks demonstrate that GlobalRAG significantly outperforms strong baselines while using only 8k training data (42% of the training data used by strong baselines), achieving average improvements of 14.2% in both EM and F1.
[COMMENTS]8 pages, 3 figures, 4 tables
[LINK]http://arxiv.org/abs/2510.20548v3
[DATE]2026-01-12 17:03:57+08:00
[CATEGORIES]cs.CL
Beyond Literal Mapping: Benchmarking and Improving Non-Literal Translation Evaluation
[AUTHORS]Yanzhi Tian, Cunxiang Wang, Zeming Liu, Heyan Huang, Wenbo Yu, Dawei Song, Jie Tang, Yuhang Guo
[ABSTRACT]Large Language Models (LLMs) have significantly advanced Machine Translation (MT), applying them to linguistically complex domains-such as Social Network Services, literature etc. In these scenarios, translations often require handling non-literal expressions, leading to the inaccuracy of MT metrics. To systematically investigate the reliability of MT metrics, we first curate a meta-evaluation dataset focused on non-literal translations, namely MENT. MENT encompasses four non-literal translation domains and features source sentences paired with translations from diverse MT systems, with 7,530 human-annotated scores on translation quality. Experimental results reveal the inaccuracies of traditional MT metrics and the limitations of LLM-as-a-Judge, particularly the knowledge cutoff and score inconsistency problem. To mitigate these limitations, we propose RATE, a novel agentic translation evaluation framework, centered by a reflective Core Agent that dynamically invokes specialized sub-agents. Experimental results indicate the efficacy of RATE, achieving an improvement of at least 3.2 meta score compared with current metrics. Further experiments demonstrate the robustness of RATE to general-domain MT evaluation. Code and dataset are available at: https://github.com/BITHLP/RATE.
[LINK]http://arxiv.org/abs/2601.07338v1
[DATE]2026-01-12 17:03:42+08:00
[CATEGORIES]cs.CL
BayesRAG: Probabilistic Mutual Evidence Corroboration for Multimodal Retrieval-Augmented Generation
[AUTHORS]Xuan Li, Yining Wang, Haocai Luo, Shengping Liu, Jerry Liang, Ying Fu, Weihuang, Jun Yu, Junnan Zhu
[ABSTRACT]Retrieval-Augmented Generation (RAG) has become a pivotal paradigm for Large Language Models (LLMs), yet current approaches struggle with visually rich documents by treating text and images as isolated retrieval targets. Existing methods relying solely on cosine similarity often fail to capture the semantic reinforcement provided by cross-modal alignment and layout-induced coherence. To address these limitations, we propose BayesRAG, a novel multimodal retrieval framework grounded in Bayesian inference and Dempster-Shafer evidence theory. Unlike traditional approaches that rank candidates strictly by similarity, BayesRAG models the intrinsic consistency of retrieved candidates across modalities as probabilistic evidence to refine retrieval confidence. Specifically, our method computes the posterior association probability for combinations of multimodal retrieval results, prioritizing text-image pairs that mutually corroborate each other in terms of both semantics and layout. Extensive experiments demonstrate that BayesRAG significantly outperforms state-of-the-art (SOTA) methods on challenging multimodal benchmarks. This study establishes a new paradigm for multimodal retrieval fusion that effectively resolves the isolation of heterogeneous modalities through an evidence fusion mechanism and enhances the robustness of retrieval outcomes. Our code is available at https://github.com/TioeAre/BayesRAG.
[COMMENTS]17 pages, 8 figures
[LINK]http://arxiv.org/abs/2601.07329v1
[DATE]2026-01-12 16:53:14+08:00
[CATEGORIES]cs.CL
How to predict creativity ratings from written narratives: A comparison of co-occurrence and textual forma mentis networks
[AUTHORS]Roberto Passaro, Edith Haim, Massimo Stella
[ABSTRACT]This tutorial paper provides a step-by-step workflow for building and analysing semantic networks from short creative texts. We introduce and compare two widely used text-to-network approaches: word co-occurrence networks and textual forma mentis networks (TFMNs). We also demonstrate how they can be used in machine learning to predict human creativity ratings. Using a corpus of 1029 short stories, we guide readers through text preprocessing, network construction, feature extraction (structural measures, spreading-activation indices, and emotion scores), and application of regression models. We evaluate how network-construction choices influence both network topology and predictive performance. Across all modelling settings, TFMNs consistently outperformed co-occurrence networks through lower prediction errors (best MAE = 0.581 for TFMN, vs 0.592 for co-occurrence with window size 3). Network-structural features dominated predictive performance (MAE = 0.591 for TFMN), whereas emotion features performed worse (MAE = 0.711 for TFMN) and spreading-activation measures contributed little (MAE = 0.788 for TFMN). This paper offers practical guidance for researchers interested in applying network-based methods for cognitive fields like creativity research. we show when syntactic networks are preferable to surface co-occurrence models, and provide an open, reproducible workflow accessible to newcomers in the field, while also offering deeper methodological insight for experienced researchers.
[LINK]http://arxiv.org/abs/2601.07327v1
[DATE]2026-01-12 16:52:41+08:00
[CATEGORIES]cs.CL
Mitrasamgraha: A Comprehensive Classical Sanskrit Machine Translation Dataset
[AUTHORS]Sebastian Nehrdich, David Allport, Sven Sellmer, Jivnesh Sandhan, Manoj Balaji Jagadeeshan, Pawan Goyal, Sujeet Kumar, Kurt Keutzer
[ABSTRACT]While machine translation is regarded as a “solved problem” for many high-resource languages, close analysis quickly reveals that this is not the case for content that shows challenges such as poetic language, philosophical concepts, multi-layered metaphorical expressions, and more. Sanskrit literature is a prime example of this, as it combines a large number of such challenges in addition to inherent linguistic features like sandhi, compounding, and heavy morphology, which further complicate NLP downstream tasks. It spans multiple millennia of text production time as well as a large breadth of different domains, ranging from ritual formulas via epic narratives, philosophical treatises, poetic verses up to scientific material. As of now, there is a strong lack of publicly available resources that cover these different domains and temporal layers of Sanskrit. We therefore introduce Mitrasamgraha, a high-quality Sanskrit-to-English machine translation dataset consisting of 391,548 bitext pairs, more than four times larger than the largest previously available Sanskrit dataset Itih=asa. It covers a time period of more than three millennia and a broad range of historical Sanskrit domains. In contrast to web-crawled datasets, the temporal and domain annotation of this dataset enables fine-grained study of domain and time period effects on MT performance. We also release a validation set consisting of 5,587 and a test set consisting of 5,552 post-corrected bitext pairs. We conduct experiments benchmarking commercial and open models on this dataset and fine-tune NLLB and Gemma models on the dataset, showing significant improvements, while still recognizing significant challenges in the translation of complex compounds, philosophical concepts, and multi-layered metaphors. We also analyze how in-context learning on this dataset impacts the performance of commercial models
[LINK]http://arxiv.org/abs/2601.07314v1
[DATE]2026-01-12 16:37:15+08:00
[CATEGORIES]cs.CL
PsyCLIENT: Client Simulation via Conversational Trajectory Modeling for Trainee Practice and Model Evaluation in Mental Health Counseling
[AUTHORS]Huachuan Qiu, Zhaoming Chen, Yuqian Chen, Yuan Xie, Yu Lu, Zhenzhong Lan
[ABSTRACT]LLM-based client simulation has emerged as a promising tool for training novice counselors and evaluating automated counseling systems. However, existing client simulation approaches face three key challenges: (1) limited diversity and realism in client profiles, (2) the lack of a principled framework for modeling realistic client behaviors, and (3) a scarcity in Chinese-language settings. To address these limitations, we propose PsyCLIENT, a novel simulation framework grounded in conversational trajectory modeling. By conditioning LLM generation on predefined real-world trajectories that incorporate explicit behavior labels and content constraints, our approach ensures diverse and realistic interactions. We further introduce PsyCLIENT-CP, the first open-source Chinese client profile dataset, covering 60 distinct counseling topics. Comprehensive evaluations involving licensed professional counselors demonstrate that PsyCLIENT significantly outperforms baselines in terms of authenticity and training effectiveness. Notably, the simulated clients are nearly indistinguishable from human clients, achieving an about 95\% expert confusion rate in discrimination tasks. These findings indicate that conversational trajectory modeling effectively bridges the gap between theoretical client profiles and dynamic, realistic simulations, offering a robust solution for mental health education and research. Code and data will be released to facilitate future research in mental health counseling.
[LINK]http://arxiv.org/abs/2601.07312v1
[DATE]2026-01-12 16:33:05+08:00
[CATEGORIES]cs.CL
KVzap: Fast, Adaptive, and Faithful KV Cache Pruning
[AUTHORS]Simon Jegou, Maximilian Jeblick
[ABSTRACT]Growing context lengths in transformer-based language models have made the key-value (KV) cache a critical inference bottleneck. While many KV cache pruning methods have been proposed, they have not yet been adopted in major inference engines due to speed–accuracy trade-offs. We introduce KVzap, a fast, input-adaptive approximation of KVzip that works in both prefilling and decoding. On Qwen3-8B, Llama-3.1-8B-Instruct, and Qwen3-32B across long-context and reasoning tasks, KVzap achieves $2$–$4\times$ KV cache compression with negligible accuracy loss and achieves state-of-the-art performance on the KVpress leaderboard. Code and models are available at https://github.com/NVIDIA/kvpress.
[LINK]http://arxiv.org/abs/2601.07891v1
[DATE]2026-01-12 16:27:47+08:00
[CATEGORIES]cs.LG cs.CL
AdaSpec: Adaptive Speculative Decoding for Fast, SLO-Aware Large Language Model Serving
[AUTHORS]Kaiyu Huang, Hao Wu, Zhubo Shi, Han Zou, Minchen Yu, Qingjiang Shi
[ABSTRACT]Cloud-based Large Language Model (LLM) services often face challenges in achieving low inference latency and meeting Service Level Objectives (SLOs) under dynamic request patterns. Speculative decoding, which exploits lightweight models for drafting and LLMs for verification, has emerged as a compelling technique to accelerate LLM inference. However, existing speculative decoding solutions often fail to adapt to fluctuating workloads and dynamic system environments, resulting in impaired performance and SLO violations. In this paper, we introduce AdaSpec, an efficient LLM inference system that dynamically adjusts speculative strategies according to real-time request loads and system configurations. AdaSpec proposes a theoretical model to analyze and predict the efficiency of speculative strategies across diverse scenarios. Additionally, it implements intelligent drafting and verification algorithms to maximize performance while ensuring high SLO attainment. Experimental results on real-world LLM service traces demonstrate that AdaSpec consistently meets SLOs and achieves substantial performance improvements, delivering up to 66% speedup compared to state-of-the-art speculative inference systems. The source code is publicly available at https://github.com/cerebellumking/AdaSpec
[COMMENTS]This paper is accepted by ACM SoCC 2025
[LINK]http://arxiv.org/abs/2503.05096v2
[DATE]2026-01-12 16:18:53+08:00
[CATEGORIES]cs.CL
LRAS: Advanced Legal Reasoning with Agentic Search
[AUTHORS]Yujin Zhou, Chuxue Cao, Jinluan Yang, Lijun Wu, Conghui He, Sirui Han, Yike Guo
[ABSTRACT]While Large Reasoning Models (LRMs) have demonstrated exceptional logical capabilities in mathematical domains, their application to the legal field remains hindered by the strict requirements for procedural rigor and adherence to legal logic. Existing legal LLMs, which rely on “closed-loop reasoning” derived solely from internal parametric knowledge, frequently suffer from lack of self-awareness regarding their knowledge boundaries, leading to confident yet incorrect conclusions. To address this challenge, we present Legal Reasoning with Agentic Search (LRAS), the first framework designed to transition legal LLMs from static and parametric “closed-loop thinking” to dynamic and interactive “Active Inquiry”. By integrating Introspective Imitation Learning and Difficulty-aware Reinforcement Learning, LRAS enables LRMs to identify knowledge boundaries and handle legal reasoning complexity. Empirical results demonstrate that LRAS outperforms state-of-the-art baselines by 8.2-32\%, with the most substantial gains observed in tasks requiring deep reasoning with reliable knowledge. We will release our data and models for further exploration soon.
[LINK]http://arxiv.org/abs/2601.07296v1
[DATE]2026-01-12 16:07:35+08:00
[CATEGORIES]cs.CL
ReasonTabQA: A Comprehensive Benchmark for Table Question Answering from Real World Industrial Scenarios
[AUTHORS]Changzai Pan, Jie Zhang, Kaiwen Wei, Chenshuo Pan, Yu Zhao, Jingwang Huang, Jian Yang, Zhenhe Wu, Haoyang Zeng, Xiaoyan Gu, Weichao Sun, Yanbo Zhai, Yujie Mao, Zhuoru Jiang, Jiang Zhong, Shuangyong Song, Yongxiang Li, Zhongjiang He
[ABSTRACT]Recent advancements in Large Language Models (LLMs) have significantly catalyzed table-based question answering (TableQA). However, existing TableQA benchmarks often overlook the intricacies of industrial scenarios, which are characterized by multi-table structures, nested headers, and massive scales. These environments demand robust table reasoning through deep structured inference, presenting a significant challenge that remains inadequately addressed by current methodologies. To bridge this gap, we present ReasonTabQA, a large-scale bilingual benchmark encompassing 1,932 tables across 30 industry domains such as energy and automotive. ReasonTabQA provides high-quality annotations for both final answers and explicit reasoning chains, supporting both thinking and no-thinking paradigms. Furthermore, we introduce TabCodeRL, a reinforcement learning method that leverages table-aware verifiable rewards to guide the generation of logical reasoning paths. Extensive experiments on ReasonTabQA and 4 TableQA datasets demonstrate that while TabCodeRL yields substantial performance gains on open-source LLMs, the persistent performance gap on ReasonTabQA underscores the inherent complexity of real-world industrial TableQA.
[LINK]http://arxiv.org/abs/2601.07280v1
[DATE]2026-01-12 15:36:06+08:00
[CATEGORIES]cs.CL
Aligning the Spectrum: Hybrid Graph Pre-training and Prompt Tuning across Homophily and Heterophily
[AUTHORS]Haitong Luo, Suhang Wang, Weiyao Zhang, Ruiqi Meng, Xuying Meng, Yujun Zhang
[ABSTRACT]Graph “pre-training and prompt-tuning” aligns downstream tasks with pre-trained objectives to enable efficient knowledge transfer under limited supervision. However, current methods typically rely on single-filter backbones (e.g., low-pass), whereas real-world graphs exhibit inherent spectral diversity. Our theoretical \textit{Spectral Specificity} principle reveals that effective knowledge transfer requires alignment between pre-trained spectral filters and the intrinsic spectrum of downstream graphs. This identifies two fundamental limitations: (1) Knowledge Bottleneck: single-filter models suffer from irreversible information loss by suppressing signals from other frequency bands (e.g., high-frequency); (2) Utilization Bottleneck: spectral mismatches between pre-trained filters and downstream spectra lead to significant underutilization of pre-trained knowledge. To bridge this gap, we propose HS-GPPT. We utilize a hybrid spectral backbone to construct an abundant knowledge basis. Crucially, we introduce Spectral-Aligned Prompt Tuning to actively align the downstream graph’s spectrum with diverse pre-trained filters, facilitating comprehensive knowledge utilization across both homophily and heterophily. Extensive experiments validate the effectiveness under both transductive and inductive learning settings.
[COMMENTS]Under Review
[LINK]http://arxiv.org/abs/2508.11328v3
[DATE]2026-01-12 15:29:53+08:00
[CATEGORIES]cs.LG cs.CL
Cross-Domain Transfer and Few-Shot Learning for Personal Identifiable Information Recognition
[AUTHORS]Junhong Ye, Xu Yuan, Xinying Qiu
[ABSTRACT]Accurate recognition of personally identifiable information (PII) is central to automated text anonymization. This paper investigates the effectiveness of cross-domain model transfer, multi-domain data fusion, and sample-efficient learning for PII recognition. Using annotated corpora from healthcare (I2B2), legal (TAB), and biography (Wikipedia), we evaluate models across four dimensions: in-domain performance, cross-domain transferability, fusion, and few-shot learning. Results show legal-domain data transfers well to biographical texts, while medical domains resist incoming transfer. Fusion benefits are domain-specific, and high-quality recognition is achievable with only 10% of training data in low-specialization domains.
[COMMENTS]Accepted to CLNLP 2025
[LINK]http://arxiv.org/abs/2507.11862v3
[DATE]2026-01-12 15:14:17+08:00
[CATEGORIES]cs.CL
SpecDetect: Simple, Fast, and Training-Free Detection of LLM-Generated Text via Spectral Analysis
[AUTHORS]Haitong Luo, Weiyao Zhang, Suhang Wang, Wenji Zou, Chungang Lin, Xuying Meng, Yujun Zhang
[ABSTRACT]The proliferation of high-quality text from Large Language Models (LLMs) demands reliable and efficient detection methods. While existing training-free approaches show promise, they often rely on surface-level statistics and overlook fundamental signal properties of the text generation process. In this work, we reframe detection as a signal processing problem, introducing a novel paradigm that analyzes the sequence of token log-probabilities in the frequency domain. By systematically analyzing the signal’s spectral properties using the global Discrete Fourier Transform (DFT) and the local Short-Time Fourier Transform (STFT), we find that human-written text consistently exhibits significantly higher spectral energy. This higher energy reflects the larger-amplitude fluctuations inherent in human writing compared to the suppressed dynamics of LLM-generated text. Based on this key insight, we construct SpecDetect, a detector built on a single, robust feature from the global DFT: DFT total energy. We also propose an enhanced version, SpecDetect++, which incorporates a sampling discrepancy mechanism to further boost robustness. Extensive experiments show that our approach outperforms the state-of-the-art model while running in nearly half the time. Our work introduces a new, efficient, and interpretable pathway for LLM-generated text detection, showing that classical signal processing techniques offer a surprisingly powerful solution to this modern challenge.
[COMMENTS]AAAI’26 Oral
[LINK]http://arxiv.org/abs/2508.11343v3
[DATE]2026-01-12 15:11:14+08:00
[CATEGORIES]cs.CL
Put the Space of LoRA Initialization to the Extreme to Preserve Pre-trained Knowledge
[AUTHORS]Pengwei Tang, Xiaolin Hu, Yong Liu, Lizhong Ding, Dongjie Zhang, Xing Wu, Debing Zhang
[ABSTRACT]Low-Rank Adaptation (LoRA) is the leading parameter-efficient fine-tuning method for Large Language Models (LLMs), but it still suffers from catastrophic forgetting. Recent work has shown that specialized LoRA initialization can alleviate catastrophic forgetting. There are currently two approaches to LoRA initialization aimed at preventing knowledge forgetting during fine-tuning: (1) making residual weights close to pre-trained weights, and (2) ensuring the space of LoRA initialization is orthogonal to pre-trained knowledge. The former is what current methods strive to achieve, while the importance of the latter is not sufficiently recognized. We find that the space of LoRA initialization is the key to preserving pre-trained knowledge rather than the residual weights. Existing methods like MiLoRA propose making the LoRA initialization space orthogonal to pre-trained weights. However, MiLoRA utilizes the null space of pre-trained weights. Compared to pre-trained weights, the input activations of pre-trained knowledge take into account the parameters of all previous layers as well as the input data, while pre-trained weights only contain information from the current layer. Moreover, we find that the effective ranks of input activations are much smaller than those of pre-trained weights. Thus, the null space of activations is more accurate and contains less pre-trained knowledge information compared to that of weights. Based on these, we introduce LoRA-Null, our proposed method that initializes LoRA in the null space of activations. Experimental results show that LoRA-Null effectively preserves the pre-trained world knowledge of LLMs while achieving good fine-tuning performance, as evidenced by extensive experiments. Code is available at {https://github.com/HungerPWAY/LoRA-Null}.
[COMMENTS]Accepted at AAAI 2026. We rediscovered why our approach works from the perspective of the LoRA initialization space. Accordingly, we added new experiments and also removed inappropriate experiments (those without catastrophic forgetting)
[LINK]http://arxiv.org/abs/2503.02659v2
[DATE]2026-01-12 15:08:54+08:00
[CATEGORIES]cs.CL
A Scalable Unsupervised Framework for multi-aspect labeling of Multilingual and Multi-Domain Review Data
[AUTHORS]Jiin Park, Misuk Kim
[ABSTRACT]Effectively analyzing online review data is essential across industries. However, many existing studies are limited to specific domains and languages or depend on supervised learning approaches that require large-scale labeled datasets. To address these limitations, we propose a multilingual, scalable, and unsupervised framework for cross-domain aspect detection. This framework is designed for multi-aspect labeling of multilingual and multi-domain review data. In this study, we apply automatic labeling to Korean and English review datasets spanning various domains and assess the quality of the generated labels through extensive experiments. Aspect category candidates are first extracted through clustering, and each review is then represented as an aspect-aware embedding vector using negative sampling. To evaluate the framework, we conduct multi-aspect labeling and fine-tune several pretrained language models to measure the effectiveness of the automatically generated labels. Results show that these models achieve high performance, demonstrating that the labels are suitable for training. Furthermore, comparisons with publicly available large language models highlight the framework’s superior consistency and scalability when processing large-scale data. A human evaluation also confirms that the quality of the automatic labels is comparable to those created manually. This study demonstrates the potential of a robust multi-aspect labeling approach that overcomes limitations of supervised methods and is adaptable to multilingual, multi-domain environments. Future research will explore automatic review summarization and the integration of artificial intelligence agents to further improve the efficiency and depth of review analysis.
[COMMENTS]36 pages, 10 figures. Published in Knowledge-Based Systems
[LINK]http://arxiv.org/abs/2505.09286v2
[DATE]2026-01-12 15:03:51+08:00
[CATEGORIES]cs.CL
ActiShade: Activating Overshadowed Knowledge to Guide Multi-Hop Reasoning in Large Language Models
[AUTHORS]Huipeng Ma, Luan Zhang, Dandan Song, Linmei Hu, Yuhang Tian, Jun Yang, Changzhi Zhou, Chenhao Li, Yizhou Jin, Xudong Li, Meng Lin, Mingxing Zhang, Shuhao Zhang
[ABSTRACT]In multi-hop reasoning, multi-round retrieval-augmented generation (RAG) methods typically rely on LLM-generated content as the retrieval query. However, these approaches are inherently vulnerable to knowledge overshadowing - a phenomenon where critical information is overshadowed during generation. As a result, the LLM-generated content may be incomplete or inaccurate, leading to irrelevant retrieval and causing error accumulation during the iteration process. To address this challenge, we propose ActiShade, which detects and activates overshadowed knowledge to guide large language models (LLMs) in multi-hop reasoning. Specifically, ActiShade iteratively detects the overshadowed keyphrase in the given query, retrieves documents relevant to both the query and the overshadowed keyphrase, and generates a new query based on the retrieved documents to guide the next-round iteration. By supplementing the overshadowed knowledge during the formulation of next-round queries while minimizing the introduction of irrelevant noise, ActiShade reduces the error accumulation caused by knowledge overshadowing. Extensive experiments show that ActiShade outperforms existing methods across multiple datasets and LLMs.
[COMMENTS]Accepted to AAAI 2026
[LINK]http://arxiv.org/abs/2601.07260v1
[DATE]2026-01-12 14:57:31+08:00
[CATEGORIES]cs.CL
Learning to Trust the Crowd: A Multi-Model Consensus Reasoning Engine for Large Language Models
[AUTHORS]Pranav Kallem
[ABSTRACT]Large language models (LLMs) achieve strong average performance yet remain unreliable at the instance level, with frequent hallucinations, brittle failures, and poorly calibrated confidence. We study reliability through the lens of multi-model consensus: given responses from several heterogeneous LLMs, can we learn which answer is most likely correct for a given query? We introduce a Multi-Model Consensus Reasoning Engine that treats the set of LLM outputs as input to a supervised meta-learner. The system maps natural language responses into structured features using semantic embeddings, pairwise similarity and clustering statistics, lexical and structural cues, reasoning-quality scores, confidence estimates, and model-specific priors, and then applies gradient-boosted trees, listwise ranking, and graph neural networks over similarity graphs of answers. Using three open-weight LLMs evaluated on compact, resource-constrained subsets of GSM8K, ARC-Challenge, HellaSwag, and TruthfulQA, our best graph-attention-based consensus model improves macro-average accuracy by 4.6 percentage points over the strongest single LLM and by 8.1 points over majority vote, while also yielding lower Brier scores and fewer TruthfulQA hallucinations. Ablation and feature-importance analyses show that semantic agreement and clustering features are most influential, with reasoning-quality and model-prior features providing complementary gains, suggesting supervised multi-model consensus is a practical route toward more reliable LLM behavior, even in a modest single-machine setup.
[LINK]http://arxiv.org/abs/2601.07245v1
[DATE]2026-01-12 14:27:06+08:00
[CATEGORIES]cs.CL cs.LG
Lost in the Noise: How Reasoning Models Fail with Contextual Distractors
[AUTHORS]Seongyun Lee, Yongrae Jo, Minju Seo, Moontae Lee, Minjoon Seo
[ABSTRACT]Recent advances in reasoning models and agentic AI systems have led to an increased reliance on diverse external information. However, this shift introduces input contexts that are inherently noisy, a reality that current sanitized benchmarks fail to capture. We introduce NoisyBench, a comprehensive benchmark that systematically evaluates model robustness across 11 datasets in RAG, reasoning, alignment, and tool-use tasks against diverse noise types, including random documents, irrelevant chat histories, and hard negative distractors. Our evaluation reveals a catastrophic performance drop of up to 80% in state-of-the-art models when faced with contextual distractors. Crucially, we find that agentic workflows often amplify these errors by over-trusting noisy tool outputs, and distractors can trigger emergent misalignment even without adversarial intent. We find that prompting, context engineering, SFT, and outcome-reward only RL fail to ensure robustness; in contrast, our proposed Rationale-Aware Reward (RARE) significantly strengthens resilience by incentivizing the identification of helpful information within noise. Finally, we uncover an inverse scaling trend where increased test-time computation leads to worse performance in noisy settings and demonstrate via attention visualization that models disproportionately focus on distractor tokens, providing vital insights for building the next generation of robust, reasoning-capable agents.
[COMMENTS]Preprint
[LINK]http://arxiv.org/abs/2601.07226v1
[DATE]2026-01-12 13:43:51+08:00
[CATEGORIES]cs.CL
TagRAG: Tag-guided Hierarchical Knowledge Graph Retrieval-Augmented Generation
[AUTHORS]Wenbiao Tao, Xinyuan Li, Yunshi Lan, Weining Qian
[ABSTRACT]Retrieval-Augmented Generation enhances language models by retrieving external knowledge to support informed and grounded responses. However, traditional RAG methods rely on fragment-level retrieval, limiting their ability to address query-focused summarization queries. GraphRAG introduces a graph-based paradigm for global knowledge reasoning, yet suffers from inefficiencies in information extraction, costly resource consumption, and poor adaptability to incremental updates. To overcome these limitations, we propose TagRAG, a tag-guided hierarchical knowledge graph RAG framework designed for efficient global reasoning and scalable graph maintenance. TagRAG introduces two key components: (1) Tag Knowledge Graph Construction, which extracts object tags and their relationships from documents and organizes them into hierarchical domain tag chains for structured knowledge representation, and (2) Tag-Guided Retrieval-Augmented Generation, which retrieves domain-centric tag chains to localize and synthesize relevant knowledge during inference. This design significantly adapts to smaller language models, improves retrieval granularity, and supports efficient knowledge increment. Extensive experiments on UltraDomain datasets spanning Agriculture, Computer Science, Law, and cross-domain settings demonstrate that TagRAG achieves an average winning rate of 78.36% against baselines while maintaining about 14.6x construction and 1.9x retrieval efficiency compared with GraphRAG.
[LINK]http://arxiv.org/abs/2601.05254v2
[DATE]2026-01-12 13:32:52+08:00
[CATEGORIES]cs.CL
The Roots of Performance Disparity in Multilingual Language Models: Intrinsic Modeling Difficulty or Design Choices?
[AUTHORS]Chen Shani, Yuval Reif, Nathan Roll, Dan Jurafsky, Ekaterina Shutova
[ABSTRACT]Multilingual language models (LMs) promise broader NLP access, yet current systems deliver uneven performance across the world’s languages. This survey examines why these gaps persist and whether they reflect intrinsic linguistic difficulty or modeling artifacts. We organize the literature around two questions: do linguistic disparities arise from representation and allocation choices (e.g., tokenization, encoding, data exposure, parameter sharing) rather than inherent complexity; and which design choices mitigate inequities across typologically diverse languages. We review linguistic features, such as orthography, morphology, lexical diversity, syntax, information density, and typological distance, linking each to concrete modeling mechanisms. Gaps often shrink when segmentation, encoding, and data exposure are normalized, suggesting much apparent difficulty stems from current modeling choices. We synthesize these insights into design recommendations for tokenization, sampling, architectures, and evaluation to support more balanced multilingual LMs.
[LINK]http://arxiv.org/abs/2601.07220v1
[DATE]2026-01-12 13:25:39+08:00
[CATEGORIES]cs.CL
MI-PRUN: Optimize Large Language Model Pruning via Mutual Information
[AUTHORS]Hao Zhang, Zhibin Zhang, Guangxin Wu, He Chen, Jiafeng Guo, Xueqi Cheng
[ABSTRACT]Large Language Models (LLMs) have become indispensable across various domains, but this comes at the cost of substantial computational and memory resources. Model pruning addresses this by removing redundant components from models. In particular, block pruning can achieve significant compression and inference acceleration. However, existing block pruning methods are often unstable and struggle to attain globally optimal solutions. In this paper, we propose a mutual information based pruning method MI-PRUN for LLMs. Specifically, we leverages mutual information to identify redundant blocks by evaluating transitions in hidden states. Additionally, we incorporate the Data Processing Inequality (DPI) to reveal the relationship between the importance of entire contiguous blocks and that of individual blocks. Moreover, we develop the Fast-Block-Select algorithm, which iteratively updates block combinations to achieve a globally optimal solution while significantly improving the efficiency. Extensive experiments across various models and datasets demonstrate the stability and effectiveness of our method.
[COMMENTS]10 pages
[LINK]http://arxiv.org/abs/2601.07212v1
[DATE]2026-01-12 13:06:01+08:00
[CATEGORIES]cs.CL
MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization
[AUTHORS]Yang Zhao, Hepeng Wang, Xiao Ding, Yangou Ouyang, Bibo Cai, Kai Xiong, Jinglong Gao, Zhouhao Sun, Li Du, Bing Qin, Ting Liu
[ABSTRACT]Group-Relative Policy Optimization (GRPO) has emerged as an efficient paradigm for aligning Large Language Models (LLMs), yet its efficacy is primarily confined to domains with verifiable ground truths. Extending GRPO to open-domain settings remains a critical challenge, as unconstrained generation entails multi-faceted and often conflicting objectives - such as creativity versus factuality - where rigid, static reward scalarization is inherently suboptimal. To address this, we propose MAESTRO (Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization), which introduces a meta-cognitive orchestration layer that treats reward scalarization as a dynamic latent policy, leveraging the model’s terminal hidden states as a semantic bottleneck to perceive task-specific priorities. We formulate this as a contextual bandit problem within a bi-level optimization framework, where a lightweight Conductor network co-evolves with the policy by utilizing group-relative advantages as a meta-reward signal. Across seven benchmarks, MAESTRO consistently outperforms single-reward and static multi-objective baselines, while preserving the efficiency advantages of GRPO, and in some settings even reducing redundant generation.
[LINK]http://arxiv.org/abs/2601.07208v1
[DATE]2026-01-12 13:02:48+08:00
[CATEGORIES]cs.LG cs.CL
Relink: Constructing Query-Driven Evidence Graph On-the-Fly for GraphRAG
[AUTHORS]Manzong Huang, Chenyang Bu, Yi He, Xingrui Zhuo, Xindong Wu
[ABSTRACT]Graph-based Retrieval-Augmented Generation (GraphRAG) mitigates hallucinations in Large Language Models (LLMs) by grounding them in structured knowledge. However, current GraphRAG methods are constrained by a prevailing \textit{build-then-reason} paradigm, which relies on a static, pre-constructed Knowledge Graph (KG). This paradigm faces two critical challenges. First, the KG’s inherent incompleteness often breaks reasoning paths. Second, the graph’s low signal-to-noise ratio introduces distractor facts, presenting query-relevant but misleading knowledge that disrupts the reasoning process.
To address these challenges, we argue for a \textit{reason-and-construct} paradigm and propose Relink, a framework that dynamically builds a query-specific evidence graph. To tackle incompleteness, \textbf{Relink} instantiates required facts from a latent relation pool derived from the original text corpus, repairing broken paths on the fly. To handle misleading or distractor facts, Relink employs a unified, query-aware evaluation strategy that jointly considers candidates from both the KG and latent relations, selecting those most useful for answering the query rather than relying on their pre-existence. This empowers Relink to actively discard distractor facts and construct the most faithful and precise evidence path for each query.
Extensive experiments on five Open-Domain Question Answering benchmarks show that Relink achieves significant average improvements of 5.4\% in EM and 5.2\% in F1 over leading GraphRAG baselines, demonstrating the superiority of our proposed framework.
[COMMENTS]Accepted by AAAI 2026
[LINK]http://arxiv.org/abs/2601.07192v1
[DATE]2026-01-12 12:35:23+08:00
[CATEGORIES]cs.CL
Structured Reasoning for Large Language Models
[AUTHORS]Jinyi Han, Zixiang Di, Zishang Jiang, Ying Liao, Jiaqing Liang, Yongqi Wang, Yanghua Xiao
[ABSTRACT]Large language models (LLMs) achieve strong performance by generating long chains of thought, but longer traces always introduce redundant or ineffective reasoning steps. One typical behavior is that they often perform unnecessary verification and revisions even if they have reached the correct answers. This limitation stems from the unstructured nature of reasoning trajectories and the lack of targeted supervision for critical reasoning abilities. To address this, we propose Structured Reasoning (SCR), a framework that decouples reasoning trajectories into explicit, evaluable, and trainable components. We mainly implement SCR using a Generate-Verify-Revise paradigm. Specifically, we construct structured training data and apply Dynamic Termination Supervision to guide the model in deciding when to terminate reasoning. To avoid interference between learning signals for different reasoning abilities, we adopt a progressive two-stage reinforcement learning strategy: the first stage targets initial generation and self-verification, and the second stage focuses on revision. Extensive experiments on three backbone models show that SCR substantially improves reasoning efficiency and self-verification. Besides, compared with existing reasoning paradigms, it reduces output token length by up to 50%.
[LINK]http://arxiv.org/abs/2601.07180v1
[DATE]2026-01-12 12:04:01+08:00
[CATEGORIES]cs.CL
Credible Plan-Driven RAG Method for Multi-Hop Question Answering
[AUTHORS]Ningning Zhang, Chi Zhang, Zhizhong Tan, Xingxing Yang, Weiping Deng, Wenyong Wang
[ABSTRACT]Retrieval-augmented generation (RAG) has demonstrated strong performance in single-hop question answering (QA) by integrating external knowledge into large language models (LLMs). However, its effectiveness remains limited in multi-hop QA, which demands both stable reasoning and factual consistency. Existing approaches often provide partial solutions, addressing either reasoning trajectory stability or factual verification, but rarely achieving both simultaneously. To bridge this gap, we propose PAR-RAG, a three-stage Plan-then-Act-and-Review framework inspired by the PDCA cycle. PAR-RAG incorporates semantic complexity as a unifying principle through three key components: (i) complexity-aware exemplar selection guides plan generation by aligning decomposition granularity with question difficulty, thereby stabilizing reasoning trajectories; (ii) execution follows a structured retrieve-then-read process; and (iii) dual verification identifies and corrects intermediate errors while dynamically adjusting verification strength based on question complexity: emphasizing accuracy for simple queries and multi-evidence consistency for complex ones. This cognitively inspired framework integrates theoretical grounding with practical robustness. Experiments across diverse benchmarks demonstrate that PAR-RAG consistently outperforms competitive baselines, while ablation studies confirm the complementary roles of complexity-aware planning and dual verification. Collectively, these results establish PAR-RAG as a robust and generalizable framework for reliable multi-hop reasoning.
[COMMENTS]24 pages, 7 figures
[LINK]http://arxiv.org/abs/2504.16787v3
[DATE]2026-01-12 11:29:39+08:00
[CATEGORIES]cs.CL
TELEVAL: A Dynamic Benchmark Designed for Spoken Language Models in Chinese Interactive Scenarios
[AUTHORS]Zehan Li, Hongjie Chen, Qing Wang, Yuxin Zhang, Jing Zhou, Hang Lv, Mengjie Du, Yaodong Song, Jie Lian, Jian Kang, Jie Li, Yongxiang Li, Xuelong Li
[ABSTRACT]Spoken language models (SLMs) have advanced rapidly in recent years, accompanied by a growing number of evaluation benchmarks. However, most existing benchmarks emphasize task completion and capability scaling, while remaining poorly aligned with how users interact with SLMs in real-world spoken conversations. Effective spoken interaction requires not only accurate understanding of user intent and content, but also the ability to respond with appropriate interactional strategies. In this paper, we present TELEVAL, a dynamic, user-centered benchmark for evaluating SLMs in realistic Chinese spoken interaction scenarios. TELEVAL consolidates evaluation into two core aspects. Reliable Content Fulfillment assesses whether models can comprehend spoken inputs and produce semantically correct responses. Interactional Appropriateness evaluates whether models act as socially capable interlocutors, requiring them not only to generate human-like, colloquial responses, but also to implicitly incorporate paralinguistic cues for natural interaction. Experiments reveal that, despite strong performance on semantic and knowledge-oriented tasks, current SLMs still struggle to produce natural and interactionally appropriate responses, highlighting the need for more interaction-faithful evaluation.
[LINK]http://arxiv.org/abs/2507.18061v3
[DATE]2026-01-12 11:01:57+08:00
[CATEGORIES]cs.CL
Can Large Language Models Understand, Reason About, and Generate Code-Switched Text?
[AUTHORS]Genta Indra Winata, David Anugraha, Patrick Amadeus Irawan, Anirban Das, Haneul Yoo, Paresh Dashore, Shreyas Kulkarni, Ruochen Zhang, Haruki Sakajo, Frederikus Hudi, Anaelia Ovalle, Syrielle Montariol, Felix Gaschi, Michael Anugraha, Rutuj Ravindra Puranik, Zawad Hayat Ahmed, Adril Putra Merin, Emmanuele Chersoni
[ABSTRACT]Code-switching is a pervasive phenomenon in multilingual communication, yet the robustness of large language models (LLMs) in mixed-language settings remains insufficiently understood. In this work, we present a comprehensive evaluation of LLM capabilities in understanding, reasoning over, and generating code-switched text. We introduce CodeMixQA a novel benchmark with high-quality human annotations, comprising 16 diverse parallel code-switched language-pair variants that span multiple geographic regions and code-switching patterns, and include both original scripts and their transliterated forms. Using this benchmark, we analyze the reasoning behavior of LLMs on code-switched question-answering tasks, shedding light on how models process and reason over mixed-language inputs. We further conduct a systematic evaluation of LLM-generated synthetic code-switched text, focusing on both naturalness and semantic fidelity, and uncover key limitations in current generation capabilities. Our findings reveal persistent challenges in both reasoning and generation under code-switching conditions and provide actionable insights for building more robust multilingual LLMs. We release the dataset and code as open source.
[COMMENTS]Preprint
[LINK]http://arxiv.org/abs/2601.07153v1
[DATE]2026-01-12 10:52:38+08:00
[CATEGORIES]cs.CL
Rethinking Prompt Optimizers: From Prompt Merits to Optimization
[AUTHORS]Zixiao Zhu, Hanzhang Zhou, Zijian Feng, Tianjiao Li, Chua Jia Jim Deryl, Mak Lee Onn, Gee Wah Ng, Kezhi Mao
[ABSTRACT]Prompt optimization (PO) provides a practical way to improve response quality when users lack the time or expertise to manually craft effective prompts. Existing methods typically rely on LLMs’ self-generation ability to optimize prompts. However, due to limited downward compatibility, the instruction-heavy prompts generated by advanced LLMs can overwhelm lightweight inference models and degrade response quality, while also lacking interpretability due to implicit optimization. In this work, we rethink prompt optimization through the lens of explicit and interpretable design. We first identify a set of model-agnostic prompt quality merits and empirically validate their effectiveness in enhancing prompt and response quality. We then introduce MePO, a merit-guided, locally deployable prompt optimizer trained on our merit-guided prompt preference dataset generated by a lightweight LLM. MePO avoids online optimization, reduces privacy concerns, and, by learning clear, interpretable merits, generalizes effectively to both large-scale and lightweight inference models. Experiments demonstrate that MePO achieves better results across diverse tasks and model types, offering a scalable and robust solution for real-world deployment. The code, model and dataset can be found in https://github.com/MidiyaZhu/MePO
[COMMENTS]30 pages, 16 figures
[LINK]http://arxiv.org/abs/2505.09930v4
[DATE]2026-01-12 10:10:47+08:00
[CATEGORIES]cs.CL
ReinPool: Reinforcement Learning Pooling Multi-Vector Embeddings for Retrieval System
[AUTHORS]Sungguk Cha, DongWook Kim, Mintae Kim, Youngsub Han, Byoung-Ki Jeon, Sangyeob Lee
[ABSTRACT]Multi-vector embedding models have emerged as a powerful paradigm for document retrieval, preserving fine-grained visual and textual details through token-level representations. However, this expressiveness comes at a staggering cost: storing embeddings for every token inflates index sizes by over $1000\times$ compared to single-vector approaches, severely limiting scalability. We introduce \textbf{ReinPool}, a reinforcement learning framework that learns to dynamically filter and pool multi-vector embeddings into compact, retrieval-optimized representations. By training with an inverse retrieval objective and NDCG-based rewards, ReinPool identifies and retains only the most discriminative vectors without requiring manual importance annotations. On the Vidore V2 benchmark across three vision-language embedding models, ReinPool compresses multi-vector representations by $746$–$1249\times$ into single vectors while recovering 76–81\% of full multi-vector retrieval performance. Compared to static mean pooling baselines, ReinPool achieves 22–33\% absolute NDCG@3 improvement, demonstrating that learned selection significantly outperforms heuristic aggregation.
[COMMENTS]5 pages
[LINK]http://arxiv.org/abs/2601.07125v1
[DATE]2026-01-12 09:37:21+08:00
[CATEGORIES]cs.CL
ReMIND: Orchestrating Modular Large Language Models for Controllable Serendipity A REM-Inspired System Design for Emergent Creative Ideation
[AUTHORS]Makoto Sato
[ABSTRACT]Large language models (LLMs) are used not only for problem solving but also for creative ideation; however, eliciting serendipitous insights that are both novel and internally coherent remains difficult. While stochastic sampling promotes novelty, it often degrades consistency. Here, we propose ReMIND, a REM-inspired modular framework for ideation. ReMIND consists of four stages: wake, which generates a stable low-temperature semantic baseline; dream, which performs high-temperature exploratory generation; judge, which applies coarse evaluation to filter incoherent outputs and extract candidate ideas; and re-wake, which re-articulates selected ideas into coherent final outputs. By instantiating each stage as an independent LLM, ReMIND enables functional separation between exploration and consolidation. Parameter sweeps show that ReMIND reliably induces semantic exploration while preserving downstream stability. Embedding-based analyses confirm substantial semantic displacement during the dream phase, whereas external evaluations reveal that high-quality ideas emerge sporadically rather than as extrema along any single metric. These results suggest that serendipitous ideation in LLMs is a rare-event process best approached through system level design that shapes the conditions under which valuable ideas can emerge and be stabilized. ReMIND provides a general framework for studying the computational basis of serendipity and illustrates how modular LLM orchestration can bridge exploration and stabilization.
[LINK]http://arxiv.org/abs/2601.07121v1
[DATE]2026-01-12 09:22:45+08:00
[CATEGORIES]cs.CL
The Need for a Socially-Grounded Persona Framework for User Simulation
[AUTHORS]Pranav Narayanan Venkit, Yu Li, Yada Pruksachatkun, Chien-Sheng Wu
[ABSTRACT]Synthetic personas are widely used to condition large language models (LLMs) for social simulation, yet most personas are still constructed from coarse sociodemographic attributes or summaries. We revisit persona creation by introducing SCOPE, a socially grounded framework for persona construction and evaluation, built from a 141-item, two-hour sociopsychological protocol collected from 124 U.S.-based participants. Across seven models, we find that demographic-only personas are a structural bottleneck: demographics explain only ~1.5% of variance in human response similarity. Adding sociopsychological facets improves behavioral prediction and reduces over-accentuation, and non-demographic personas based on values and identity achieve strong alignment with substantially lower bias. These trends generalize to SimBench (441 aligned questions), where SCOPE personas outperform default prompting and NVIDIA Nemotron personas, and SCOPE augmentation improves Nemotron-based personas. Our results indicate that persona quality depends on sociopsychological structure rather than demographic templates or summaries.
[LINK]http://arxiv.org/abs/2601.07110v1
[DATE]2026-01-12 08:27:18+08:00
[CATEGORIES]cs.CL
Learning to Evolve: Bayesian-Guided Continual Knowledge Graph Embedding
[AUTHORS]Linyu Li, Zhi Jin, Yuanpeng He, Dongming Jin, Yichi Zhang, Haoran Duan, Xuan Zhang, Zhengwei Tao, Nyima Tash
[ABSTRACT]As social media and the World Wide Web become hubs for information dissemination, effectively organizing and understanding the vast amounts of dynamically evolving Web content is crucial. Knowledge graphs (KGs) provide a powerful framework for structuring this information. However, the rapid emergence of new hot topics, user relationships, and events in social media renders traditional static knowledge graph embedding (KGE) models rapidly outdated. Continual Knowledge Graph Embedding (CKGE) aims to address this issue, but existing methods commonly suffer from catastrophic forgetting, whereby older, but still valuable, information is lost when learning new knowledge (such as new memes or trending events). This means the model cannot effectively learn the evolution of the data. We propose a novel CKGE framework, BAKE. Unlike existing methods, BAKE formulates CKGE as a sequential Bayesian inference problem and utilizes the Bayesian posterior update principle as a natural continual learning strategy. This principle is insensitive to data order and provides theoretical guarantees to preserve prior knowledge as much as possible. Specifically, we treat each batch of new data as a Bayesian update to the model’s prior. By maintaining the posterior distribution, the model effectively preserves earlier knowledge even as it evolves over multiple snapshots. Furthermore, to constrain the evolution of knowledge across snapshots, we introduce a continual clustering method that maintains the compact cluster structure of entity embeddings through a regularization term, ensuring semantic consistency while allowing controlled adaptation to new knowledge. We conduct extensive experiments on multiple CKGE benchmarks, which demonstrate that BAKE achieves the top performance in the vast majority of cases compared to existing approaches.
[LINK]http://arxiv.org/abs/2508.02426v2
[DATE]2026-01-12 07:19:55+08:00
[CATEGORIES]cs.CL cs.LG
Deployability-Centric Infrastructure-as-Code Generation: Fail, Learn, Refine, and Succeed through LLM-Empowered DevOps Simulation
[AUTHORS]Tianyi Zhang, Shidong Pan, Zejun Zhang, Zhenchang Xing, Xiaoyu Sun
[ABSTRACT]Infrastructure-as-Code (IaC) generation holds significant promise for automating cloud infrastructure provisioning. Recent advances in Large Language Models (LLMs) present a promising opportunity to democratize IaC development by generating deployable infrastructure templates from natural language descriptions. However, current evaluation focuses on syntactic correctness while ignoring deployability, the critical measure of the utility of IaC configuration files. Six state-of-the-art LLMs performed poorly on deployability, achieving only 20.8$\sim$30.2% deployment success rate on the first attempt. In this paper, we construct DPIaC-Eval, the first deployability-centric IaC template benchmark consisting of 153 real-world scenarios cross 58 unique services. Also, we propose an LLM-based deployability-centric framework, dubbed IaCGen, that uses iterative feedback mechanism encompassing format verification, syntax checking, and live deployment stages, thereby closely mirroring the real DevOps workflows. Results show that IaCGen can make 54.6$\sim$91.6% generated IaC templates from all evaluated models deployable in the first 10 iterations. Additionally, human-in-the-loop feedback that provide direct guidance for the deployability errors, can further boost the performance to over 90% passItr@25 on all evaluated LLMs. Furthermore, we explore the trustworthiness of the generated IaC templates on user intent alignment and security compliance. The poor performance (25.2% user requirement coverage and 8.4% security compliance rate) indicates a critical need for continued research in this domain.
[COMMENTS]Accepted by FSE 2026
[LINK]http://arxiv.org/abs/2506.05623v3
[DATE]2026-01-12 07:18:31+08:00
[CATEGORIES]cs.CL
The Best Instruction-Tuning Data are Those That Fit
[AUTHORS]Dylan Zhang, Qirun Dai, Hao Peng
[ABSTRACT]High-quality supervised fine-tuning (SFT) data are crucial for eliciting strong capabilities from pretrained large language models (LLMs). Typically, instructions are paired with multiple responses sampled from other LLMs, which are often out of the distribution of the target model to be fine-tuned. This, at scale, can lead to diminishing returns and even hurt the models’ performance and robustness. We propose GRAPE, a novel SFT framework that accounts for the unique characteristics of the target model. For each instruction, it gathers responses from various LLMs and selects the one with the highest probability measured by the target model, indicating that it aligns most closely with the target model’s pretrained distribution; it then proceeds with standard SFT training.
We first evaluate GRAPE with a controlled experiment, where we sample various solutions for each question in UltraInteract from multiple models and fine-tune commonly used LMs like LLaMA3.1-8B, Mistral-7B, and Qwen2.5-7B on GRAPE-selected data. GRAPE significantly outperforms strong baselines, including distilling from the strongest model with an absolute gain of up to 13.8%, averaged across benchmarks, and training on 3x more data with a maximum performance improvement of 17.3%. GRAPE’s strong performance generalizes to realistic settings. We experiment with the post-training data used for Tulu3 and Olmo-2. GRAPE outperforms strong baselines trained on 4.5 times more data by 6.1% and a state-of-the-art data selection approach by 3% on average performance. Remarkably, using 1/3 of the data and half the number of epochs, GRAPE enables LLaMA3.1-8B to surpass the performance of Tulu3-SFT by 3.5%.
[LINK]http://arxiv.org/abs/2502.04194v3
[DATE]2026-01-12 07:18:22+08:00
[CATEGORIES]cs.CL cs.LG
Tensor Product Attention Is All You Need
[AUTHORS]Yifan Zhang, Yifeng Liu, Huizhuo Yuan, Zhen Qin, Yang Yuan, Quanquan Gu, Andrew Chi-Chih Yao
[ABSTRACT]Scaling language models to handle longer input sequences typically necessitates large key-value (KV) caches, resulting in substantial memory overhead during inference. In this paper, we propose Tensor Product Attention (TPA), a novel attention mechanism that uses tensor decompositions to represent queries, keys, and values compactly, substantially shrinking the KV cache size at inference time. By factorizing these representations into contextual low-rank components and seamlessly integrating with Rotary Position Embedding (RoPE), TPA achieves improved model quality alongside memory efficiency. Based on TPA, we introduce the Tensor ProducT ATTenTion Transformer (T6), a new model architecture for sequence modeling. Through extensive empirical evaluation on language modeling tasks, we demonstrate that T6 surpasses or matches the performance of standard Transformer baselines including Multi-Head Attention (MHA), Multi-Query Attention (MQA), Grouped-Query Attention (GQA), and Multi-Head Latent Attention (MLA) across various metrics, including perplexity and a range of established evaluation benchmarks. Notably, TPA’s memory efficiency and computational efficiency at decoding stage enables processing longer sequences under fixed resource constraints, addressing a critical scalability challenge in modern language models. Project Page: https://github.com/tensorgi/TPA.
[COMMENTS]Published in NeurIPS 2025 (Spotlight); Project Page: https://github.com/tensorgi/TPA
[LINK]http://arxiv.org/abs/2501.06425v7
[DATE]2026-01-12 07:01:33+08:00
[CATEGORIES]cs.CL cs.LG
Test-Time Scaling of Reasoning Models for Machine Translation
[AUTHORS]Zihao Li, Shaoxiong Ji, Jörg Tiedemann
[ABSTRACT]Test-time scaling (TTS) has enhanced the performance of Reasoning Models (RMs) on various tasks such as math and coding, yet its efficacy in machine translation (MT) remains underexplored. This paper investigates whether increased inference-time computation improves translation quality. We evaluate 12 RMs across a diverse suite of MT benchmarks spanning multiple domains, examining three scenarios: direct translation, forced-reasoning extrapolation, and post-editing. Our findings show that for general-purpose RMs, TTS provides limited and inconsistent benefits for direct translation, with performance quickly plateauing. However, the effectiveness of TTS is unlocked by domain-specific fine-tuning, which aligns a model’s reasoning process with task requirements, leading to consistent improvements up to an optimal, self-determined reasoning depth. We also find that forcing a model to reason beyond its natural stopping point consistently degrades translation quality. In contrast, TTS proves highly effective in a post-editing context, reliably turning self-correction into a beneficial process. These results indicate that the value of inference-time computation in MT lies not in enhancing single-pass translation with general models, but in targeted applications like multi-step, self-correction workflows and in conjunction with task-specialized models.
[LINK]http://arxiv.org/abs/2510.06471v2
[DATE]2026-01-12 06:32:41+08:00
[CATEGORIES]cs.CL
OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment
[AUTHORS]Tianci Liu, Ran Xu, Tony Yu, Ilgee Hong, Carl Yang, Tuo Zhao, Haoyu Wang
[ABSTRACT]Reward modeling lies at the core of reinforcement learning from human feedback (RLHF), yet most existing reward models rely on scalar or pairwise judgments that fail to capture the multifaceted nature of human preferences. Recent studies have explored rubrics-as-rewards (RaR) that uses structured criteria to capture multiple dimensions of response quality. However, producing rubrics that are both reliable and scalable remains a key challenge. In this work, we introduce OpenRubrics, a diverse, large-scale collection of (prompt, rubric) pairs for training rubric-generation and rubric-based reward models. To elicit discriminative and comprehensive evaluation signals, we introduce Contrastive Rubric Generation (CRG), which derives both hard rules (explicit constraints) and principles (implicit qualities) by contrasting preferred and rejected responses. We further remove noisy rubrics via preserving preference-label consistency. Across multiple reward-modeling benchmarks, our rubric-based reward model, Rubric-RM, surpasses strong size-matched baselines by 8.4%. These gains transfer to policy models on instruction-following and biomedical benchmarks.
[COMMENTS]The first two authors contributed equally. Updated OpenRubrics dataset, RMs, and results
[LINK]http://arxiv.org/abs/2510.07743v2
[DATE]2026-01-12 05:04:26+08:00
[CATEGORIES]cs.CL
Fine-Tuning vs. RAG for Multi-Hop Question Answering with Novel Knowledge
[AUTHORS]Zhuoyi Yang, Yurun Song, Iftekhar Ahmed, Ian Harris
[ABSTRACT]Multi-hop question answering is widely used to evaluate the reasoning capabilities of large language models (LLMs), as it requires integrating multiple pieces of supporting knowledge to arrive at a correct answer. While prior work has explored different mechanisms for providing knowledge to LLMs, such as finetuning and retrieval-augmented generation (RAG), their relative effectiveness for multi-hop question answering remains insufficiently understood, particularly when the required knowledge is temporally novel.
In this paper, we systematically compare parametric and non-parametric knowledge injection methods for open-domain multi-hop question answering. We evaluate unsupervised fine-tuning (continual pretraining), supervised fine-tuning, and retrieval-augmented generation across three 7B-parameter open-source LLMs. Experiments are conducted on two benchmarks: QASC, a standard multi-hop science question answering dataset, and a newly constructed dataset of over 10,000 multi-hop questions derived from Wikipedia events in 2024, designed to test knowledge beyond the models’ pretraining cutoff.
Our results show that unsupervised fine-tuning provides only limited gains over base models, suggesting that continual pretraining alone is insufficient for improving multi-hop reasoning accuracy. In contrast, retrieval-augmented generation yields substantial and consistent improvements, particularly when answering questions that rely on temporally novel information. Supervised fine-tuning achieves the highest overall accuracy across models and datasets. These findings highlight fundamental differences in how knowledge injection mechanisms support multi-hop question answering and underscore the importance of retrieval-based methods when external or compositional knowledge is required.
[LINK]http://arxiv.org/abs/2601.07054v1
[DATE]2026-01-12 04:24:25+08:00
[CATEGORIES]cs.CL cs.LG
When Abundance Conceals Weakness: Knowledge Conflict in Multilingual Models
[AUTHORS]Jiaqi Zhao, Qiang Huang, Haodong Chen, Xiaoxing You, Jun Yu
[ABSTRACT]Large Language Models (LLMs) encode vast world knowledge across multiple languages, yet their internal beliefs are often unevenly distributed across linguistic spaces. When external evidence contradicts these language-dependent memories, models encounter \emph{cross-lingual knowledge conflict}, a phenomenon largely unexplored beyond English-centric settings. We introduce \textbf{CLEAR}, a \textbf{C}ross-\textbf{L}ingual knowl\textbf{E}dge conflict ev\textbf{A}luation f\textbf{R}amework that systematically examines how multilingual LLMs reconcile conflicting internal beliefs and multilingual external evidence. CLEAR decomposes conflict resolution into four progressive scenarios, from multilingual parametric elicitation to competitive multi-source cross-lingual induction, and systematically evaluates model behavior across two complementary QA benchmarks with distinct task characteristics. We construct multilingual versions of ConflictQA and ConflictingQA covering 10 typologically diverse languages and evaluate six representative LLMs. Our experiments reveal a task-dependent decision dichotomy. In reasoning-intensive tasks, conflict resolution is dominated by language resource abundance, with high-resource languages exerting stronger persuasive power. In contrast, for entity-centric factual conflicts, linguistic affinity, not resource scale, becomes decisive, allowing low-resource but linguistically aligned languages to outperform distant high-resource ones.
[COMMENTS]14 pages, 7 figures, and 4 tables
[LINK]http://arxiv.org/abs/2601.07041v1
[DATE]2026-01-12 03:26:59+08:00
[CATEGORIES]cs.CL
Task Arithmetic with Support Languages for Low-Resource ASR
[AUTHORS]Emma Rafkin, Dan DeGenaro, Xiulin Yang
[COMMENTS]8 pages, 3 Figures, preprint after submitted for review for a *ACL conference
[LINK]http://arxiv.org/abs/2601.07038v1
[DATE]2026-01-12 03:24:05+08:00
[CATEGORIES]cs.CL
Codified Foreshadowing-Payoff Text Generation
[AUTHORS]Longfei Yun, Kun Zhou, Yupeng Hou, Letian Peng, Jingbo Shang
[ABSTRACT]Foreshadowing and payoff are ubiquitous narrative devices through which authors introduce commitments early in a story and resolve them through concrete, observable outcomes. However, despite advances in story generation, large language models (LLMs) frequently fail to bridge these long-range narrative dependencies, often leaving “Chekhov’s guns” unfired even when the necessary context is present. Existing evaluations largely overlook this structural failure, focusing on surface-level coherence rather than the logical fulfillment of narrative setups. In this paper, we introduce Codified Foreshadowing-Payoff Generation (CFPG), a novel framework that reframes narrative quality through the lens of payoff realization. Recognizing that LLMs struggle to intuitively grasp the “triggering mechanism” of a foreshadowed event, CFPG transforms narrative continuity into a set of executable causal predicates. By mining and encoding Foreshadow-Trigger-Payoff triples from the BookSum corpus, we provide structured supervision that ensures foreshadowed commitments are not only mentioned but also temporally and logically fulfilled. Experiments demonstrate that CFPG significantly outperforms standard prompting baselines in payoff accuracy and narrative alignment. Our findings suggest that explicitly codifying narrative mechanics is essential for moving LLMs from surface-level fluency to genuine narrative competence.
[LINK]http://arxiv.org/abs/2601.07033v1
[DATE]2026-01-12 03:05:37+08:00
[CATEGORIES]cs.CL
TurkBench: A Benchmark for Evaluating Turkish Large Language Models
[AUTHORS]Çağrı Toraman, Ahmet Kaan Sever, Ayse Aysu Cengiz, Elif Ecem Arslan, Görkem Sevinç, Mete Mert Birdal, Yusuf Faruk Güldemir, Ali Buğra Kanburoğlu, Sezen Felekoğlu, Osman Gürlek, Sarp Kantar, Birsen Şahin Kütük, Büşra Tufan, Elif Genç, Serkan Coşkun, Gupse Ekin Demir, Muhammed Emin Arayıcı, Olgun Dursun, Onur Gungor, Susan Üsküdarlı, Abdullah Topraksoy, Esra Darıcı
[ABSTRACT]With the recent surge in the development of large language models, the need for comprehensive and language-specific evaluation benchmarks has become critical. While significant progress has been made in evaluating English language models, benchmarks for other languages, particularly those with unique linguistic characteristics such as Turkish, remain less developed. Our study introduces TurkBench, a comprehensive benchmark designed to assess the capabilities of generative large language models in the Turkish language. TurkBench involves 8,151 data samples across 21 distinct subtasks. These are organized under six main categories of evaluation: Knowledge, Language Understanding, Reasoning, Content Moderation, Turkish Grammar and Vocabulary, and Instruction Following. The diverse range of tasks and the culturally relevant data would provide researchers and developers with a valuable tool for evaluating their models and identifying areas for improvement. We further publish our benchmark for online submissions at https://huggingface.co/turkbench
[LINK]http://arxiv.org/abs/2601.07020v1
[DATE]2026-01-12 02:28:23+08:00
[CATEGORIES]cs.CL
Lexicalized Constituency Parsing for Middle Dutch: Low-resource Training and Cross-Domain Generalization
[AUTHORS]Yiming Liang, Fang Zhao
[ABSTRACT]Recent years have seen growing interest in applying neural networks and contextualized word embeddings to the parsing of historical languages. However, most advances have focused on dependency parsing, while constituency parsing for low-resource historical languages like Middle Dutch has received little attention. In this paper, we adapt a transformer-based constituency parser to Middle Dutch, a highly heterogeneous and low-resource language, and investigate methods to improve both its in-domain and cross-domain performance. We show that joint training with higher-resource auxiliary languages increases F1 scores by up to 0.73, with the greatest gains achieved from languages that are geographically and temporally closer to Middle Dutch. We further evaluate strategies for leveraging newly annotated data from additional domains, finding that fine-tuning and data combination yield comparable improvements, and our neural parser consistently outperforms the currently used PCFG-based parser for Middle Dutch. We further explore feature-separation techniques for domain adaptation and demonstrate that a minimum threshold of approximately 200 examples per domain is needed to effectively enhance cross-domain performance.
[LINK]http://arxiv.org/abs/2601.07008v1
[DATE]2026-01-12 01:48:19+08:00
[CATEGORIES]cs.CL
FinCARDS: Card-Based Analyst Reranking for Financial Document Question Answering
[AUTHORS]Yixi Zhou, Fan Zhang, Yu Chen, Haipeng Zhang, Preslav Nakov, Zhuohan Xie
[ABSTRACT]Financial question answering (QA) over long corporate filings requires evidence to satisfy strict constraints on entities, financial metrics, fiscal periods, and numeric values. However, existing LLM-based rerankers primarily optimize semantic relevance, leading to unstable rankings and opaque decisions on long documents. We propose FinCards, a structured reranking framework that reframes financial evidence selection as constraint satisfaction under a finance-aware schema. FinCards represents filing chunks and questions using aligned schema fields (entities, metrics, periods, and numeric spans), enabling deterministic field-level matching. Evidence is selected via a multi-stage tournament reranking with stability-aware aggregation, producing auditable decision traces. Across two corporate filing QA benchmarks, FinCards substantially improves early-rank retrieval over both lexical and LLM-based reranking baselines, while reducing ranking variance, without requiring model fine-tuning or unpredictable inference budgets. Our code is available at https://github.com/XanderZhou2022/FINCARDS.
[COMMENTS]15 pages, including figures and tables
[LINK]http://arxiv.org/abs/2601.06992v1
[DATE]2026-01-12 01:03:24+08:00
[CATEGORIES]cs.CL
MedTutor: A Retrieval-Augmented LLM System for Case-Based Medical Education
[AUTHORS]Dongsuk Jang, Ziyao Shangguan, Kyle Tegtmeyer, Anurag Gupta, Jan Czerminski, Sophie Chheang, Arman Cohan
[ABSTRACT]The learning process for medical residents presents significant challenges, demanding both the ability to interpret complex case reports and the rapid acquisition of accurate medical knowledge from reliable sources. Residents typically study case reports and engage in discussions with peers and mentors, but finding relevant educational materials and evidence to support their learning from these cases is often time-consuming and challenging. To address this, we introduce MedTutor, a novel system designed to augment resident training by automatically generating evidence-based educational content and multiple-choice questions from clinical case reports. MedTutor leverages a Retrieval-Augmented Generation (RAG) pipeline that takes clinical case reports as input and produces targeted educational materials. The system’s architecture features a hybrid retrieval mechanism that synergistically queries a local knowledge base of medical textbooks and academic literature (using PubMed, Semantic Scholar APIs) for the latest related research, ensuring the generated content is both foundationally sound and current. The retrieved evidence is filtered and ordered using a state-of-the-art reranking model and then an LLM generates the final long-form output describing the main educational content regarding the case-report. We conduct a rigorous evaluation of the system. First, three radiologists assessed the quality of outputs, finding them to be of high clinical and educational value. Second, we perform a large scale evaluation using an LLM-as-a Judge to understand if LLMs can be used to evaluate the output of the system. Our analysis using correlation between LLMs outputs and human expert judgments reveals a moderate alignment and highlights the continued necessity of expert oversight.
[COMMENTS]Accepted to EMNLP 2025 (System Demonstrations)
[LINK]http://arxiv.org/abs/2601.06979v1
[DATE]2026-01-12 00:27:21+08:00
[CATEGORIES]cs.CL
UETQuintet at BioCreative IX – MedHopQA: Enhancing Biomedical QA with Selective Multi-hop Reasoning and Contextual Retrieval
[AUTHORS]Quoc-An Nguyen, Thi-Minh-Thu Vu, Bich-Dat Nguyen, Dinh-Quang-Minh Tran, Hoang-Quynh Le
[ABSTRACT]Biomedical Question Answering systems play a critical role in processing complex medical queries, yet they often struggle with the intricate nature of medical data and the demand for multi-hop reasoning. In this paper, we propose a model designed to effectively address both direct and sequential questions. While sequential questions are decomposed into a chain of sub-questions to perform reasoning across a chain of steps, direct questions are processed directly to ensure efficiency and minimise processing overhead. Additionally, we leverage multi-source information retrieval and in-context learning to provide rich, relevant context for generating answers. We evaluated our model on the BioCreative IX - MedHopQA Shared Task datasets. Our approach achieves an Exact Match score of 0.84, ranking second on the current leaderboard. These results highlight the model’s capability to meet the challenges of Biomedical Question Answering, offering a versatile solution for advancing medical research and practice.
[COMMENTS]In Proceedings of the BioCreative IX Challenge and Workshop (BC9): Large Language Models for Clinical and Biomedical NLP, IJCAI 2025
[LINK]http://arxiv.org/abs/2601.06974v1
[DATE]2026-01-12 00:12:38+08:00
[CATEGORIES]cs.CL
Categorize Early, Integrate Late: Divergent Processing Strategies in Automatic Speech Recognition
[AUTHORS]Nathan Roll, Pranav Bhalerao, Martijn Bartelds, Arjun Pawar, Yuka Tatsumi, Tolulope Ogunremi, Chen Shani, Calbert Graham, Meghan Sumner, Dan Jurafsky
[ABSTRACT]In speech language modeling, two architectures dominate the frontier: the Transformer and the Conformer. However, it remains unknown whether their comparable performance stems from convergent processing strategies or distinct architectural inductive biases. We introduce Architectural Fingerprinting, a probing framework that isolates the effect of architecture on representation, and apply it to a controlled suite of 24 pre-trained encoders (39M-3.3B parameters). Our analysis reveals divergent hierarchies: Conformers implement a “Categorize Early” strategy, resolving phoneme categories 29% earlier in depth and speaker gender by 16% depth. In contrast, Transformers “Integrate Late,” deferring phoneme, accent, and duration encoding to deep layers (49-57%). These fingerprints suggest design heuristics: Conformers’ front-loaded categorization may benefit low-latency streaming, while Transformers’ deep integration may favor tasks requiring rich context and cross-utterance normalization.
[COMMENTS]3 figures, 9 tables
[LINK]http://arxiv.org/abs/2601.06972v1
[DATE]2026-01-12 00:05:26+08:00
[CATEGORIES]cs.CL
REXO: Indoor Multi-View Radar Object Detection via 3D Bounding Box Diffusion
[AUTHORS]Ryoma Yataka, Pu Perry Wang, Petros Boufounos, Ryuhei Takahashi
[ABSTRACT]Multi-view indoor radar perception has drawn attention due to its cost-effectiveness and low privacy risks. Existing methods often rely on {implicit} cross-view radar feature association, such as proposal pairing in RFMask or query-to-feature cross-attention in RETR, which can lead to ambiguous feature matches and degraded detection in complex indoor scenes. To address these limitations, we propose \textbf{REXO} (multi-view Radar object dEtection with 3D bounding boX diffusiOn), which lifts the 2D bounding box (BBox) diffusion process of DiffusionDet into the 3D radar space. REXO utilizes these noisy 3D BBoxes to guide an {explicit} cross-view radar feature association, enhancing the cross-view radar-conditioned denoising process. By accounting for prior knowledge that the person is in contact with the ground, REXO reduces the number of diffusion parameters by determining them from this prior. Evaluated on two open indoor radar datasets, our approach surpasses state-of-the-art methods by a margin of +4.22 AP on the HIBER dataset and +11.02 AP on the MMVR dataset. The REXO implementation is available at https://github.com/merlresearch/radar-bbox-diffusion.
[COMMENTS]26 pages; Accepted to AAAI 2026; Code available at https://github.com/merlresearch/radar-bbox-diffusion
[LINK]http://arxiv.org/abs/2511.17806v2
[DATE]2026-01-12 23:56:26+08:00
[CATEGORIES]cs.LG
Modern Neuromorphic AI: From Intra-Token to Inter-Token Processing
[AUTHORS]Osvaldo Simeone
[ABSTRACT]The rapid growth of artificial intelligence (AI) has brought novel data processing and generative capabilities but also escalating energy requirements. This challenge motivates renewed interest in neuromorphic computing principles, which promise brain-like efficiency through discrete and sparse activations, recurrent dynamics, and non-linear feedback. In fact, modern AI architectures increasingly embody neuromorphic principles through heavily quantized activations, state-space dynamics, and sparse attention mechanisms. This paper elaborates on the connections between neuromorphic models, state-space models, and transformer architectures through the lens of the distinction between intra-token processing and inter-token processing. Most early work on neuromorphic AI was based on spiking neural networks (SNNs) for intra-token processing, i.e., for transformations involving multiple channels, or features, of the same vector input, such as the pixels of an image. In contrast, more recent research has explored how neuromorphic principles can be leveraged to design efficient inter-token processing methods, which selectively combine different information elements depending on their contextual relevance. Implementing associative memorization mechanisms, these approaches leverage state-space dynamics or sparse self-attention. Along with a systematic presentation of modern neuromorphic AI models through the lens of intra-token and inter-token processing, training methodologies for neuromorphic AI models are also reviewed. These range from surrogate gradients leveraging parallel convolutional processing to local learning rules based on reinforcement learning mechanisms.
[LINK]http://arxiv.org/abs/2601.00245v3
[DATE]2026-01-12 23:47:06+08:00
[CATEGORIES]cs.LG
Learning to accelerate Krasnosel’skii-Mann fixed-point iterations with guarantees
[AUTHORS]Andrea Martin, Giuseppe Belgioioso
[ABSTRACT]We introduce a principled learning to optimize (L2O) framework for solving fixed-point problems involving general nonexpansive mappings. Our idea is to deliberately inject summable perturbations into a standard Krasnosel’skii-Mann iteration to improve its average-case performance over a specific distribution of problems while retaining its convergence guarantees. Under a metric sub-regularity assumption, we prove that the proposed parametrization includes only iterations that locally achieve linear convergence-up to a vanishing bias term-and that it encompasses all iterations that do so at a sufficiently fast rate. We then demonstrate how our framework can be used to augment several widely-used operator splitting methods to accelerate the solution of structured monotone inclusion problems, and validate our approach on a best approximation problem using an L2O-augmented Douglas-Rachford splitting algorithm.
[LINK]http://arxiv.org/abs/2601.07665v1
[DATE]2026-01-12 23:45:40+08:00
[CATEGORIES]cs.LG
Active Evaluation of General Agents: Problem Definition and Comparison of Baseline Algorithms
[AUTHORS]Marc Lanctot, Kate Larson, Ian Gemp, Michael Kaisers
[ABSTRACT]As intelligent agents become more generally-capable, i.e. able to master a wide variety of tasks, the complexity and cost of properly evaluating them rises significantly. Tasks that assess specific capabilities of the agents can be correlated and stochastic, requiring many samples for accurate comparisons, leading to added costs. In this paper, we propose a formal definition and a conceptual framework for active evaluation of agents across multiple tasks, which assesses the performance of ranking algorithms as a function of number of evaluation data samples. Rather than curating, filtering, or compressing existing data sets as a preprocessing step, we propose an online framing: on every iteration, the ranking algorithm chooses the task and agents to sample scores from. Then, evaluation algorithms report a ranking of agents on each iteration and their performance is assessed with respect to the ground truth ranking over time. Several baselines are compared under different experimental contexts, with synthetic generated data and simulated online access to real evaluation data from Atari game-playing agents. We find that the classical Elo rating system – while it suffers from well-known failure modes, in theory – is a consistently reliable choice for efficient reduction of ranking error in practice. A recently-proposed method, Soft Condorcet Optimization, shows comparable performance to Elo on synthetic data and significantly outperforms Elo on real Atari agent evaluation. When task variation from the ground truth is high, selecting tasks based on proportional representation leads to higher rate of ranking error reduction.
[COMMENTS]AAMAS 2026
[LINK]http://arxiv.org/abs/2601.07651v1
[DATE]2026-01-12 23:32:11+08:00
[CATEGORIES]cs.LG
Backpropagation through space, time, and the brain
[AUTHORS]Benjamin Ellenberger, Paul Haider, Jakob Jordan, Kevin Max, Ismael Jaras, Laura Kriener, Federico Benitez, Mihai A. Petrovici
[ABSTRACT]How physical networks of neurons, bound by spatio-temporal locality constraints, can perform efficient credit assignment, remains, to a large extent, an open question. In machine learning, the answer is almost universally given by the error backpropagation algorithm, through both space and time. However, this algorithm is well-known to rely on biologically implausible assumptions, in particular with respect to spatio-temporal (non-)locality. Alternative forward-propagation models such as real-time recurrent learning only partially solve the locality problem, but only at the cost of scaling, due to prohibitive storage requirements. We introduce Generalized Latent Equilibrium (GLE), a computational framework for fully local spatio-temporal credit assignment in physical, dynamical networks of neurons. We start by defining an energy based on neuron-local mismatches, from which we derive both neuronal dynamics via stationarity and parameter dynamics via gradient descent. The resulting dynamics can be interpreted as a real-time, biologically plausible approximation of backpropagation through space and time in deep cortical networks with continuous-time neuronal dynamics and continuously active, local synaptic plasticity. In particular, GLE exploits the morphology of dendritic trees to enable more complex information storage and processing in single neurons, as well as the ability of biological neurons to phase-shift their output rate with respect to their membrane potential, which is essential in both directions of information propagation. For the forward computation, it enables the mapping of time-continuous inputs to neuronal space, effectively performing a spatio-temporal convolution. For the backward computation, it permits the temporal inversion of feedback signals, which consequently approximate the adjoint variables necessary for useful parameter updates.
[COMMENTS]First authorship shared by Benjamin Ellenberger and Paul Haider
[LINK]http://arxiv.org/abs/2403.16933v4
[DATE]2026-01-12 23:31:06+08:00
[CATEGORIES]cs.LG
Studying the Role of Synthetic Data for Machine Learning-based Wireless Networks Traffic Forecasting
[AUTHORS]José Pulido, Francesc Wilhelmi, Sergio Fortes, Alfonso Fernández-Durán, Lorenzo Galati Giordano, Raquel Barco
[ABSTRACT]Synthetic data generation is an appealing tool for augmenting and enriching datasets, playing a crucial role in advancing artificial intelligence (AI) and machine learning (ML). Not only does synthetic data help build robust AI/ML datasets cost-effectively, but it also offers privacy-friendly solutions and bypasses the complexities of storing large data volumes. This paper proposes a novel method to generate synthetic data, based on first-order auto-regressive noise statistics, for large-scale Wi-Fi deployments. The approach operates with minimal real data requirements while producing statistically rich traffic patterns that effectively mimic real Access Point (AP) behavior. Experimental results show that ML models trained on synthetic data achieve Mean Absolute Error (MAE) values within 10 to 15 of those obtained using real data when trained on the same APs, while requiring significantly less training data. Moreover, when generalization is required, synthetic-data-trained models improve prediction accuracy by up to 50 percent compared to real-data-trained baselines, thanks to the enhanced variability and diversity of the generated traces. Overall, the proposed method bridges the gap between synthetic data generation and practical Wi-Fi traffic forecasting, providing a scalable, efficient, and real-time solution for modern wireless networks.
[LINK]http://arxiv.org/abs/2601.07646v1
[DATE]2026-01-12 23:27:55+08:00
[CATEGORIES]cs.LG
Effectively Leveraging Momentum Terms in Stochastic Line Search Frameworks for Fast Optimization of Finite-Sum Problems
[AUTHORS]Matteo Lapucci, Davide Pucci
[ABSTRACT]In this work, we address unconstrained finite-sum optimization problems, with particular focus on instances originating in large scale deep learning scenarios. Our main interest lies in the exploration of the relationship between recent line search approaches for stochastic optimization in the overparametrized regime and momentum directions. First, we point out that combining these two elements with computational benefits is not straightforward. To this aim, we propose a solution based on mini-batch persistency. We then introduce an algorithmic framework that exploits a mix of data persistency, conjugate-gradient type rules for the definition of the momentum parameter and stochastic line searches. The resulting algorithm provably possesses convergence properties under suitable assumptions and is empirically shown to outperform other popular methods from the literature, obtaining state-of-the-art results in both convex and nonconvex large scale training problems.
[LINK]http://arxiv.org/abs/2411.07102v3
[DATE]2026-01-12 23:23:40+08:00
[CATEGORIES]cs.LG
Dual-Level Models for Physics-Informed Multi-Step Time Series Forecasting
[AUTHORS]Mahdi Nasiri, Johanna Kortelainen, Simo Särkkä
[ABSTRACT]This paper develops an approach for multi-step forecasting of dynamical systems by integrating probabilistic input forecasting with physics-informed output prediction. Accurate multi-step forecasting of time series systems is important for the automatic control and optimization of physical processes, enabling more precise decision-making. While mechanistic-based and data-driven machine learning (ML) approaches have been employed for time series forecasting, they face significant limitations. Incomplete knowledge of process mathematical models limits mechanistic-based direct employment, while purely data-driven ML models struggle with dynamic environments, leading to poor generalization. To address these limitations, this paper proposes a dual-level strategy for physics-informed forecasting of dynamical systems. On the first level, input variables are forecast using a hybrid method that integrates a long short-term memory (LSTM) network into probabilistic state transition models (STMs). On the second level, these stochastically predicted inputs are sequentially fed into a physics-informed neural network (PINN) to generate multi-step output predictions. The experimental results of the paper demonstrate that the hybrid input forecasting models achieve a higher log-likelihood and lower mean squared errors (MSE) compared to conventional STMs. Furthermore, the PINNs driven by the input forecasting models outperform their purely data-driven counterparts in terms of MSE and log-likelihood, exhibiting stronger generalization and forecasting performance across multiple test cases.
[LINK]http://arxiv.org/abs/2601.07640v1
[DATE]2026-01-12 23:19:21+08:00
[CATEGORIES]cs.LG
Reinforcement Learning for Micro-Level Claims Reserving
[AUTHORS]Benjamin Avanzi, Ronald Richman, Bernard Wong, Mario Wüthrich, Yagebu Xie
[ABSTRACT]Outstanding claim liabilities are revised repeatedly as claims develop, yet most modern reserving models are trained as one-shot predictors and typically learn only from settled claims. We formulate individual claims reserving as a claim-level Markov decision process in which an agent sequentially updates outstanding claim liability (OCL) estimates over development, using continuous actions and a reward design that balances accuracy with stable reserve revisions. A key advantage of this reinforcement learning (RL) approach is that it can learn from all observed claim trajectories, including claims that remain open at valuation, thereby avoiding the reduced sample size and selection effects inherent in supervised methods trained on ultimate outcomes only. We also introduce practical components needed for actuarial use – initialisation of new claims, temporally consistent tuning via a rolling-settlement scheme, and an importance-weighting mechanism to mitigate portfolio-level underestimation driven by the rarity of large claims. On CAS and SPLICE synthetic general insurance datasets, the proposed Soft Actor-Critic implementation delivers competitive claim-level accuracy and strong aggregate OCL performance, particularly for the immature claim segments that drive most of the liability.
[LINK]http://arxiv.org/abs/2601.07637v1
[DATE]2026-01-12 23:17:18+08:00
[CATEGORIES]cs.LG
Beyond Sharpness: A Flatness Decomposition Framework for Efficient Continual Learning
[AUTHORS]Yanan Chen, Tieliang Gong, Yunjiao Zhang, Wen Wen
[ABSTRACT]Continual Learning (CL) aims to enable models to sequentially learn multiple tasks without forgetting previous knowledge. Recent studies have shown that optimizing towards flatter loss minima can improve model generalization. However, existing sharpness-aware methods for CL suffer from two key limitations: (1) they treat sharpness regularization as a unified signal without distinguishing the contributions of its components. and (2) they introduce substantial computational overhead that impedes practical deployment. To address these challenges, we propose FLAD, a novel optimization framework that decomposes sharpness-aware perturbations into gradient-aligned and stochastic-noise components, and show that retaining only the noise component promotes generalization. We further introduce a lightweight scheduling scheme that enables FLAD to maintain significant performance gains even under constrained training time. FLAD can be seamlessly integrated into various CL paradigms and consistently outperforms standard and sharpness-aware optimizers in diverse experimental settings, demonstrating its effectiveness and practicality in CL.
[COMMENTS]Accepted by AAAI 2026
[LINK]http://arxiv.org/abs/2601.07636v1
[DATE]2026-01-12 23:17:04+08:00
[CATEGORIES]cs.LG
Neural Architecture for Fast and Reliable Coagulation Assessment in Clinical Settings: Leveraging Thromboelastography
[AUTHORS]Yulu Wang, Ziqian Zeng, Jianjun Wu, Zhifeng Tang
[ABSTRACT]In an ideal medical environment, real-time coagulation monitoring can enable early detection and prompt remediation of risks. However, traditional Thromboelastography (TEG), a widely employed diagnostic modality, can only provide such outputs after nearly 1 hour of measurement. The delay might lead to elevated mortality rates. These issues clearly point out one of the key challenges for medical AI development: Mak-ing reasonable predictions based on very small data sets and accounting for variation between different patient populations, a task where conventional deep learning methods typically perform poorly. We present Physiological State Reconstruc-tion (PSR), a new algorithm specifically designed to take ad-vantage of dynamic changes between individuals and to max-imize useful information produced by small amounts of clini-cal data through mapping to reliable predictions and diagnosis. We develop MDFE to facilitate integration of varied temporal signals using multi-domain learning, and jointly learn high-level temporal interactions together with attentions via HLA; furthermore, the parameterized DAM we designed maintains the stability of the computed vital signs. PSR evaluates with 4 TEG-specialized data sets and establishes remarkable perfor-mance – predictions of R2 > 0.98 for coagulation traits and error reduction around half compared to the state-of-the-art methods, and halving the inferencing time too. Drift-aware learning suggests a new future, with potential uses well be-yond thrombophilia discovery towards medical AI applica-tions with data scarcity.
[COMMENTS]This paper has been accepted by AAAI26
[LINK]http://arxiv.org/abs/2601.07618v1
[DATE]2026-01-12 23:03:53+08:00
[CATEGORIES]cs.LG
Simulation of Multivariate Extremes: a Wasserstein-Aitchison GAN approach
[AUTHORS]Stéphane Lhaut, Holger Rootzén, Johan Segers
[ABSTRACT]Economically responsible mitigation of multivariate extreme risks-such as extreme rainfall over large areas, large simultaneous variations in many stock prices, or widespread breakdowns in transportation systems-requires assessing the resilience of the systems under plausible stress scenarios. This paper uses Extreme Value Theory (EVT) to develop a new approach to simulating such multivariate extreme events. Specifically, we assume that after transformation to a standard scale the distribution of the random phenomenon of interest is multivariate regular varying and use this to provide a sampling procedure for extremes on the original scale. Our procedure combines a Wasserstein-Aitchison Generative Adversarial Network (WA-GAN) to simulate the tail dependence structure on the standard scale with joint modeling of the univariate marginal tails on the original scale. The WA-GAN procedure relies on the angular measure-encoding the distribution on the unit simplex of the angles of extreme observations-after transformation to Aitchison coordinates, which allows the Wasserstein-GAN algorithm to be run in a linear space. Our method is applied both to simulated data under various tail dependence scenarios and to a financial data set from the Kenneth French Data Library. The proposed algorithm demonstrates strong performance compared to existing alternatives in the literature, both in capturing tail dependence structures and in generating accurate new extreme observations.
[COMMENTS]40 pages
[LINK]http://arxiv.org/abs/2504.21438v3
[DATE]2026-01-12 22:38:53+08:00
[CATEGORIES]cs.LG
Large Language Models for Physics Instrument Design
[AUTHORS]Sara Zoccheddu, Shah Rukh Qasim, Patrick Owen, Nicola Serra
[ABSTRACT]We study the use of large language models (LLMs) for physics instrument design and compare their performance to reinforcement learning (RL). Using only prompting, LLMs are given task constraints and summaries of prior high-scoring designs and propose complete detector configurations, which we evaluate with the same simulators and reward functions used in RL-based optimization. Although RL yields stronger final designs, we find that modern LLMs consistently generate valid, resource-aware, and physically meaningful configurations that draw on broad pretrained knowledge of detector design principles and particle–matter interactions, despite having no task-specific training. Based on this result, as a first step toward hybrid design workflows, we explore pairing the LLMs with a dedicated trust region optimizer, serving as a precursor to future pipelines in which LLMs propose and structure design hypotheses while RL performs reward-driven optimization. Based on these experiments, we argue that LLMs are well suited as meta-planners: they can design and orchestrate RL-based optimization studies, define search strategies, and coordinate multiple interacting components within a unified workflow. In doing so, they point toward automated, closed-loop instrument design in which much of the human effort required to structure and supervise optimization can be reduced.
[LINK]http://arxiv.org/abs/2601.07580v1
[DATE]2026-01-12 22:30:54+08:00
[CATEGORIES]cs.LG
An adjoint method for training data-driven reduced-order models
[AUTHORS]Donglin Liu, Francisco García Atienza, Mengwu Guo
[ABSTRACT]Reduced-order modeling lies at the interface of numerical analysis and data-driven scientific computing, providing principled ways to compress high-fidelity simulations in science and engineering. We propose a training framework that couples a continuous-time form of operator inference with the adjoint-state method to obtain robust data-driven reduced-order models. This method minimizes a trajectory-based loss between reduced-order solutions and projected snapshot data, which removes the need to estimate time derivatives from noisy measurements and provides intrinsic temporal regularization through time integration. We derive the corresponding continuous adjoint equations to compute gradients efficiently and implement a gradient based optimizer to update the reduced model parameters. Each iteration only requires one forward reduced order solve and one adjoint solve, followed by inexpensive gradient assembly, making the method attractive for large-scale simulations. We validate the proposed method on three partial differential equations: viscous Burgers’ equation, the two-dimensional Fisher-KPP equation, and an advection-diffusion equation. We perform systematic comparisons against standard operator inference under two perturbation regimes, namely reduced temporal snapshot density and additive Gaussian noise. For clean data, both approaches deliver similar accuracy, but in situations with sparse sampling and noise, the proposed adjoint-based training provides better accuracy and enhanced roll-out stability.
[LINK]http://arxiv.org/abs/2601.07579v1
[DATE]2026-01-12 22:30:50+08:00
[CATEGORIES]cs.LG
d3LLM: Ultra-Fast Diffusion LLM using Pseudo-Trajectory Distillation
[AUTHORS]Yu-Yang Qian, Junda Su, Lanxiang Hu, Peiyuan Zhang, Zhijie Deng, Peng Zhao, Hao Zhang
[ABSTRACT]Diffusion large language models (dLLMs) offer capabilities beyond those of autoregressive (AR) LLMs, such as parallel decoding and random-order generation. However, realizing these benefits in practice is non-trivial, as dLLMs inherently face an accuracy-parallelism trade-off. Despite increasing interest, existing methods typically focus on only one-side of the coin, targeting either efficiency or performance. To address this limitation, we propose d3LLM (Pseudo-Distilled Diffusion Large Language Model), striking a balance between accuracy and parallelism: (i) during training, we introduce pseudo-trajectory distillation to teach the model which tokens can be decoded confidently at early steps, thereby improving parallelism; (ii) during inference, we employ entropy-based multi-block decoding with a KV-cache refresh mechanism to achieve high parallelism while maintaining accuracy. To better evaluate dLLMs, we also introduce AUP (Accuracy Under Parallelism), a new metric that jointly measures accuracy and parallelism. Experiments demonstrate that our d3LLM achieves up to 10$\times$ speedup over vanilla LLaDA/Dream and 5$\times$ speedup over AR models without much accuracy drop. Our code is available at https://github.com/hao-ai-lab/d3LLM.
[LINK]http://arxiv.org/abs/2601.07568v1
[DATE]2026-01-12 22:25:36+08:00
[CATEGORIES]cs.LG
TFEC: Multivariate Time-Series Clustering via Temporal-Frequency Enhanced Contrastive Learning
[AUTHORS]Zexi Tan, Tao Xie, Haoyi Xiao, Baoyao Yang, Yuzhu Ji, An Zeng, Xiang Zhang, Yiqun Zhang
[ABSTRACT]Multivariate Time-Series (MTS) clustering is crucial for signal processing and data analysis. Although deep learning approaches, particularly those leveraging Contrastive Learning (CL), are prominent for MTS representation, existing CL-based models face two key limitations: 1) neglecting clustering information during positive/negative sample pair construction, and 2) introducing unreasonable inductive biases, e.g., destroying time dependence and periodicity through augmentation strategies, compromising representation quality. This paper, therefore, proposes a Temporal-Frequency Enhanced Contrastive (TFEC) learning framework. To preserve temporal structure while generating low-distortion representations, a temporal-frequency Co-EnHancement (CoEH) mechanism is introduced. Accordingly, a synergistic dual-path representation and cluster distribution learning framework is designed to jointly optimize cluster structure and representation fidelity. Experiments on six real-world benchmark datasets demonstrate TFEC’s superiority, achieving 4.48% average NMI gains over SOTA methods, with ablation studies validating the design. The code of the paper is available at: https://github.com/yueliangy/TFEC.
[COMMENTS]Submitted to ICASSP 2026
[LINK]http://arxiv.org/abs/2601.07550v1
[DATE]2026-01-12 21:59:27+08:00
[CATEGORIES]cs.LG
Contextual Discrepancy-Aware Contrastive Learning for Robust Medical Time Series Diagnosis in Small-Sample Scenarios
[AUTHORS]Kaito Tanaka, Aya Nakayama, Masato Ito, Yuji Nishimura, Keisuke Matsuda
[ABSTRACT]Medical time series data, such as EEG and ECG, are vital for diagnosing neurological and cardiovascular diseases. However, their precise interpretation faces significant challenges due to high annotation costs, leading to data scarcity, and the limitations of traditional contrastive learning in capturing complex temporal patterns. To address these issues, we propose CoDAC (Contextual Discrepancy-Aware Contrastive learning), a novel framework that enhances diagnostic accuracy and generalization, particularly in small-sample settings. CoDAC leverages external healthy data and introduces a Contextual Discrepancy Estimator (CDE), built upon a Transformer-based Autoencoder, to precisely quantify abnormal signals through context-aware anomaly scores. These scores dynamically inform a Dynamic Multi-views Contrastive Framework (DMCF), which adaptively weights different temporal views to focus contrastive learning on diagnostically relevant, discrepant regions. Our encoder combines dilated convolutions with multi-head attention for robust feature extraction. Comprehensive experiments on Alzheimer’s Disease EEG, Parkinson’s Disease EEG, and Myocardial Infarction ECG datasets demonstrate CoDAC’s superior performance across all metrics, consistently outperforming state-of-the-art baselines, especially under low label availability. Ablation studies further validate the critical contributions of CDE and DMCF. CoDAC offers a robust and interpretable solution for medical time series diagnosis, effectively mitigating data scarcity challenges.
[LINK]http://arxiv.org/abs/2601.07548v1
[DATE]2026-01-12 21:54:50+08:00
[CATEGORIES]cs.LG
Multi-scale Graph Autoregressive Modeling: Molecular Property Prediction via Next Token Prediction
[AUTHORS]Zhuoyang Jiang, Yaosen Min, Peiran Jin, Lei Chen
[ABSTRACT]We present Connection-Aware Motif Sequencing (CamS), a graph-to-sequence representation that enables decoder-only Transformers to learn molecular graphs via standard next-token prediction (NTP). For molecular property prediction, SMILES-based NTP scales well but lacks explicit topology, whereas graph-native masked modeling captures connectivity but risks disrupting the pivotal chemical details (e.g., activity cliffs). CamS bridges this gap by serializing molecular graphs into structure-rich causal sequences. CamS first mines data-driven connection-aware motifs. It then serializes motifs via scaffold-rooted breadth-first search (BFS) to establish a stable core-to-periphery order. Crucially, CamS enables hierarchical modeling by concatenating sequences from fine to coarse motif scales, allowing the model to condition global scaffolds on dense, uncorrupted local structural evidence. We instantiate CamS-LLaMA by pre-training a vanilla LLaMA backbone on CamS sequences. It achieves state-of-the-art performance on MoleculeNet and the activity-cliff benchmark MoleculeACE, outperforming both SMILES-based language models and strong graph baselines. Interpretability analysis confirms that our multi-scale causal serialization effectively drives attention toward cliff-determining differences.
[LINK]http://arxiv.org/abs/2601.02530v2
[DATE]2026-01-12 21:52:52+08:00
[CATEGORIES]cs.LG
Near-Optimal Private Linear Regression via Iterative Hessian Mixing
[AUTHORS]Omri Lev, Moshe Shenfeld, Vishwak Srinivasan, Katrina Ligett, Ashia C. Wilson
[ABSTRACT]We study differentially private ordinary least squares (DP-OLS) with bounded data. The dominant approach, adaptive sufficient-statistics perturbation (AdaSSP), adds an adaptively chosen perturbation to the sufficient statistics, namely, the matrix $X^{\top}X$ and the vector $X^{\top}Y$, and is known to achieve near-optimal accuracy and to have strong empirical performance. In contrast, methods that rely on Gaussian-sketching, which ensure differential privacy by pre-multiplying the data with a random Gaussian matrix, are widely used in federated and distributed regression, yet remain relatively uncommon for DP-OLS. In this work, we introduce the iterative Hessian mixing, a novel DP-OLS algorithm that relies on Gaussian sketches and is inspired by the iterative Hessian sketch algorithm. We provide utility analysis for the iterative Hessian mixing as well as a new analysis for the previous methods that rely on Gaussian sketches. Then, we show that our new approach circumvents the intrinsic limitations of the prior methods and provides non-trivial improvements over AdaSSP. We conclude by running an extensive set of experiments across standard benchmarks to demonstrate further that our approach consistently outperforms these prior baselines.
[LINK]http://arxiv.org/abs/2601.07545v1
[DATE]2026-01-12 21:50:15+08:00
[CATEGORIES]cs.LG
Nonparametric Kernel Clustering with Bandit Feedback
[AUTHORS]Victor Thuot, Sebastian Vogt, Debarghya Ghoshdastidar, Nicolas Verzelen
[ABSTRACT]Clustering with bandit feedback refers to the problem of partitioning a set of items, where the clustering algorithm can sequentially query the items to receive noisy observations. The problem is formally posed as the task of partitioning the arms of an N-armed stochastic bandit according to their underlying distributions, grouping two arms together if and only if they share the same distribution, using samples collected sequentially and adaptively. This setting has gained attention in recent years due to its applicability in recommendation systems and crowdsourcing. Existing works on clustering with bandit feedback rely on a strong assumption that the underlying distributions are sub-Gaussian. As a consequence, the existing methods mainly cover settings with linearly-separable clusters, which has little practical relevance. We introduce a framework of “nonparametric clustering with bandit feedback”, where the underlying arm distributions are not constrained to any parametric, and hence, it is applicable for active clustering of real-world datasets. We adopt a kernel-based approach, which allows us to reformulate the nonparametric problem as the task of clustering the arms according to their kernel mean embeddings in a reproducing kernel Hilbert space (RKHS). Building on this formulation, we introduce the KABC algorithm with theoretical correctness guarantees and analyze its sampling budget. We introduce a notion of signal-to-noise ratio for this problem that depends on the maximum mean discrepancy (MMD) between the arm distributions and on their variance in the RKHS. Our algorithm is adaptive to this unknown quantity: it does not require it as an input yet achieves instance-dependent guarantees.
[LINK]http://arxiv.org/abs/2601.07535v1
[DATE]2026-01-12 21:36:07+08:00
[CATEGORIES]cs.LG
Stagewise Reinforcement Learning and the Geometry of the Regret Landscape
[AUTHORS]Chris Elliott, Einar Urdshals, David Quarel, Matthew Farrugia-Roberts, Daniel Murfet
[ABSTRACT]Singular learning theory characterizes Bayesian learning as an evolving tradeoff between accuracy and complexity, with transitions between qualitatively different solutions as sample size increases. We extend this theory to deep reinforcement learning, proving that the concentration of the generalized posterior over policies is governed by the local learning coefficient (LLC), an invariant of the geometry of the regret function. This theory predicts that Bayesian phase transitions in reinforcement learning should proceed from simple policies with high regret to complex policies with low regret. We verify this prediction empirically in a gridworld environment exhibiting stagewise policy development: phase transitions over SGD training manifest as “opposing staircases” where regret decreases sharply while the LLC increases. Notably, the LLC detects phase transitions even when estimated on a subset of states where the policies appear identical in terms of regret, suggesting it captures changes in the underlying algorithm rather than just performance.
[COMMENTS]50 pages, 14 figures
[LINK]http://arxiv.org/abs/2601.07524v1
[DATE]2026-01-12 21:25:21+08:00
[CATEGORIES]cs.LG
SegDAC: Improving Visual Reinforcement Learning by Extracting Dynamic Object-Centric Representations from Pretrained Vision Models
[AUTHORS]Alexandre Brown, Glen Berseth
[ABSTRACT]Visual reinforcement learning (RL) is challenging due to the need to extract useful representations from high-dimensional inputs while learning effective control from sparse and noisy rewards. Although large perception models exist, integrating them effectively into RL for visual generalization and improved sample efficiency remains difficult. We propose SegDAC, a Segmentation-Driven Actor-Critic method. SegDAC uses Segment Anything (SAM) for object-centric decomposition and YOLO-World to ground the image segmentation process via text inputs. It includes a novel transformer-based architecture that supports a dynamic number of segments at each time step and effectively learns which segments to focus on using online RL, without using human labels. By evaluating SegDAC over a challenging visual generalization benchmark using Maniskill3, which covers diverse manipulation tasks under strong visual perturbations, we demonstrate that SegDAC achieves significantly better visual generalization, doubling prior performance on the hardest setting and matching or surpassing prior methods in sample efficiency across all evaluated tasks. Project Page: https://segdac.github.io/
[LINK]http://arxiv.org/abs/2508.09325v3
[DATE]2026-01-12 21:21:57+08:00
[CATEGORIES]cs.LG
Iterative Methods for Full-Scale Gaussian Process Approximations for Large Spatial Data
[AUTHORS]Tim Gyger, Reinhard Furrer, Fabio Sigrist
[ABSTRACT]Gaussian processes are flexible probabilistic regression models which are widely used in statistics and machine learning. However, a drawback is their limited scalability to large data sets. To alleviate this, full-scale approximations (FSAs) combine predictive process methods and covariance tapering, thus approximating both global and local structures. We show how iterative methods can be used to reduce computational costs in calculating likelihoods, gradients, and predictive distributions with FSAs. In particular, we introduce a novel preconditioner and show theoretically and empirically that it accelerates the conjugate gradient method’s convergence speed and mitigates its sensitivity with respect to the FSA parameters and the eigenvalue structure of the original covariance matrix, and we demonstrate empirically that it outperforms a state-of-the-art pivoted Cholesky preconditioner. Furthermore, we introduce an accurate and fast way to calculate predictive variances using stochastic simulation and iterative methods. In addition, we show how our newly proposed fully independent training conditional (FITC) preconditioner can also be used in iterative methods for Vecchia approximations. In our experiments, it outperforms existing state-of-the-art preconditioners for Vecchia approximations. All methods are implemented in a free C++ software library with high-level Python and R packages.
[LINK]http://arxiv.org/abs/2405.14492v5
[DATE]2026-01-12 21:19:21+08:00
[CATEGORIES]cs.LG
Intervention Efficiency and Perturbation Validation Framework: Capacity-Aware and Robust Clinical Model Selection under the Rashomon Effect
[AUTHORS]Yuwen Zhang, Viet Tran, Paul Weng
[ABSTRACT]In clinical machine learning, the coexistence of multiple models with comparable performance (a manifestation of the Rashomon Effect) poses fundamental challenges for trustworthy deployment and evaluation. Small, imbalanced, and noisy datasets, coupled with high-dimensional and weakly identified clinical features, amplify this multiplicity and make conventional validation schemes unreliable. As a result, selecting among equally performing models becomes uncertain, particularly when resource constraints and operational priorities are not considered by conventional metrics like F1 score. To address these issues, we propose two complementary tools for robust model assessment and selection: Intervention Efficiency (IE) and the Perturbation Validation Framework (PVF). IE is a capacity-aware metric that quantifies how efficiently a model identifies actionable true positives when only limited interventions are feasible, thereby linking predictive performance with clinical utility. PVF introduces a structured approach to assess the stability of models under data perturbations, identifying models whose performance remains most invariant across noisy or shifted validation sets. Empirical results on synthetic and real-world healthcare datasets show that using these tools facilitates the selection of models that generalize more robustly and align with capacity constraints, offering a new direction for tackling the Rashomon Effect in clinical settings.
[COMMENTS]Accepted to the Workshop on Navigating Model Uncertainty and the Rashomon Effect: From Theory and Tools to Applications and Impact (AAAI 2026)
[LINK]http://arxiv.org/abs/2511.14317v7
[DATE]2026-01-12 21:16:09+08:00
[CATEGORIES]cs.LG
Land-then-transport: A Flow Matching-Based Generative Decoder for Wireless Image Transmission
[AUTHORS]Jingwen Fu, Ming Xiao, Mikael Skoglund, Dong In Kim
[ABSTRACT]Due to strict rate and reliability demands, wireless image transmission remains difficult for both classical layered designs and joint source-channel coding (JSCC), especially under low latency. Diffusion-based generative decoders can deliver strong perceptual quality by leveraging learned image priors, but iterative stochastic denoising leads to high decoding delay. To enable low-latency decoding, we propose a flow-matching (FM) generative decoder under a new land-then-transport (LTT) paradigm that tightly integrates the physical wireless channel into a continuous-time probability flow. For AWGN channels, we build a Gaussian smoothing path whose noise schedule indexes effective noise levels, and derive a closed-form teacher velocity field along this path. A neural-network student vector field is trained by conditional flow matching, yielding a deterministic, channel-aware ODE decoder with complexity linear in the number of ODE steps. At inference, it only needs an estimate of the effective noise variance to set the ODE starting time. We further show that Rayleigh fading and MIMO channels can be mapped, via linear MMSE equalization and singular-value-domain processing, to AWGN-equivalent channels with calibrated starting times. Therefore, the same probability path and trained velocity field can be reused for Rayleigh and MIMO without retraining. Experiments on MNIST, Fashion-MNIST, and DIV2K over AWGN, Rayleigh, and MIMO demonstrate consistent gains over JPEG2000+LDPC, DeepJSCC, and diffusion-based baselines, while achieving good perceptual quality with only a few ODE steps. Overall, LTT provides a deterministic, physically interpretable, and computation-efficient framework for generative wireless image decoding across diverse channels.
[LINK]http://arxiv.org/abs/2601.07512v1
[DATE]2026-01-12 21:09:37+08:00
[CATEGORIES]cs.LG
Beyond topography: Topographic regularization improves robustness and reshapes representations in convolutional neural networks
[AUTHORS]Nhut Truong, Uri Hasson
[ABSTRACT]Topographic convolutional neural networks (TCNNs) are computational models that can simulate aspects of the brain’s spatial and functional organization. However, it is unclear whether and how different types of topographic regularization shape robustness, representational structure, and functional organization during end-to-end training. We address this question by comparing TCNNs trained with two local spatial losses applied to a penultimate-layer topographic grid: i) Weight Similarity (WS), whose objective penalizes differences between neighboring units’ incoming weight vectors, and ii) Activation Similarity (AS), whose objective penalizes differences between neighboring units’ activation patterns over stimuli. We evaluate the trained models on classification accuracy, robustness to weight perturbations and input degradation, the spatial organization of learned representations, and development of category-selective “expert units” in the penultimate layer. Both losses changed inter-unit correlation structure, but in qualitatively different ways. WS produced smooth topographies, with correlated neighborhoods. In contrast, AS produced a bimodal inter-unit correlation structure that lacked spatial smoothness. AS and WS training increased robustness relative to control (non-topographic) models: AS improved robustness to image degradation on CIFAR-10, WS did so on MNIST, and both improved robustness to weight perturbations. WS was also associated with greater input sensitivity at the unit level and stronger functional localization. In addition, as compared to control models, both AS and WS produced differences in orientation tuning, symmetry sensitivity, and eccentricity profiles of units. Together, these results show that local topographic regularization can improve robustness during end-to-end training while systematically reshaping representational structure.
[COMMENTS]Title updated, edits, expanded analysis of expert units
[LINK]http://arxiv.org/abs/2508.00043v2
[DATE]2026-01-12 21:03:00+08:00
[CATEGORIES]cs.LG
FROAV: A Framework for RAG Observation and Agent Verification – Lowering the Barrier to LLM Agent Research
[AUTHORS]Tzu-Hsuan Lin, Chih-Hsuan Kao
[ABSTRACT]The rapid advancement of Large Language Models (LLMs) and their integration into autonomous agent systems has created unprecedented opportunities for document analysis, decision support, and knowledge retrieval. However, the complexity of developing, evaluating, and iterating on LLM-based agent workflows presents significant barriers to researchers, particularly those without extensive software engineering expertise. We present FROAV (Framework for RAG Observation and Agent Verification), an open-source research platform that democratizes LLM agent research by providing a plug-and-play architecture combining visual workflow orchestration, a comprehensive evaluation framework, and extensible Python integration. FROAV implements a multi-stage Retrieval-Augmented Generation (RAG) pipeline coupled with a rigorous “LLM-as-a-Judge” evaluation system, all accessible through intuitive graphical interfaces. Our framework integrates n8n for no-code workflow design, PostgreSQL for granular data management, FastAPI for flexible backend logic, and Streamlit for human-in-the-loop interaction. Through this integrated ecosystem, researchers can rapidly prototype RAG strategies, conduct prompt engineering experiments, validate agent performance against human judgments, and collect structured feedback-all without writing infrastructure code. We demonstrate the framework’s utility through its application to financial document analysis, while emphasizing its material-agnostic architecture that adapts to any domain requiring semantic analysis. FROAV represents a significant step toward making LLM agent research accessible to a broader scientific community, enabling researchers to focus on hypothesis testing and algorithmic innovation rather than system integration challenges.
[COMMENTS]8 pages, 1 figure, 3 tables
[LINK]http://arxiv.org/abs/2601.07504v1
[DATE]2026-01-12 21:02:32+08:00
[CATEGORIES]cs.LG
CaTS-Bench: Can Language Models Describe Time Series?
[AUTHORS]Luca Zhou, Pratham Yashwante, Marshall Fisher, Alessio Sampieri, Zihao Zhou, Fabio Galasso, Rose Yu
[ABSTRACT]Time series captioning, the task of describing time series in natural language, requires numeric and temporal reasoning, trend interpretation, and contextual understanding. Existing benchmarks, however, often rely on fully synthetic or generic captions, and typically neglect metadata and visual representations. We introduce CaTS-Bench, a comprehensive benchmark for Context-aware Time Series reasoning across $11$ diverse domains, centered on a gold-standard evaluation set of $1746$ human-rewritten captions that measure how effectively models translate numeric trends into immediately interpretable narratives. To address the scarcity of human-annotated data, we also propose a scalable pipeline for generating high-fidelity synthetic captions, the quality of which we validate. We evaluate leading Vision-Language Models on our benchmark, revealing that even proprietary models struggle to capture numeric nuances in temporal descriptions, while finetuning open-source models on synthetic data yields substantial performance gains. Finally, we release a diagnostic suite of $910$ multiple-choice questions and tailored numeric metrics to gauge time-series-specific reasoning capabilities, establishing CaTS-Bench as a reliable foundation for grounded, multimodal language generation in numeric domains.
[COMMENTS]8 pages, 6 figures, 3 tables in the main paper. Many more in the appendix
[LINK]http://arxiv.org/abs/2509.20823v4
[DATE]2026-01-12 21:01:45+08:00
[CATEGORIES]cs.LG
Decentralized Online Convex Optimization with Unknown Feedback Delays
[AUTHORS]Hao Qiu, Mengxiao Zhang, Juliette Achddou
[ABSTRACT]Decentralized online convex optimization (D-OCO), where multiple agents within a network collaboratively learn optimal decisions in real-time, arises naturally in applications such as federated learning, sensor networks, and multi-agent control. In this paper, we study D-OCO under unknown, time-and agent-varying feedback delays. While recent work has addressed this problem (Nguyen et al., 2024), existing algorithms assume prior knowledge of the total delay over agents and still suffer from suboptimal dependence on both the delay and network parameters. To overcome these limitations, we propose a novel algorithm that achieves an improved regret bound of O N $\sqrt$ d tot + N $\sqrt$ T (1-$σ$2) 1/4 , where T is the total horizon, d tot denotes the average total delay across agents, N is the number of agents, and 1 -$σ$ 2 is the spectral gap of the network. Our approach builds upon recent advances in D-OCO (Wan et al., 2024a), but crucially incorporates an adaptive learning rate mechanism via a decentralized communication protocol. This enables each agent to estimate delays locally using a gossip-based strategy without the prior knowledge of the total delay. We further extend our framework to the strongly convex setting and derive a sharper regret bound of O N $δ$max ln T $α$ , where $α$ is the strong convexity parameter and $δ$ max is the maximum number of missing observations averaged over agents. We also show that our upper bounds for both settings are tight up to logarithmic factors. Experimental results validate the effectiveness of our approach, showing improvements over existing benchmark algorithms.
[LINK]http://arxiv.org/abs/2601.07901v1
[DATE]2026-01-12 20:59:01+08:00
[CATEGORIES]cs.LG
Learning Design-Score Manifold to Guide Diffusion Models for Offline Optimization
[AUTHORS]Tailin Zhou, Zhilin Chen, Wenlong Lyu, Zhitang Chen, Danny H. K. Tsang, Jun Zhang
[ABSTRACT]Optimizing complex systems, from discovering therapeutic drugs to designing high-performance materials, remains a fundamental challenge across science and engineering, as the underlying rules are often unknown and costly to evaluate. Offline optimization aims to optimize designs for target scores using pre-collected datasets without system interaction. However, conventional approaches may fail beyond training data, predicting inaccurate scores and generating inferior designs. This paper introduces ManGO, a diffusion-based framework that learns the design-score manifold, capturing the design-score interdependencies holistically. Unlike existing methods that treat design and score spaces in isolation, ManGO unifies forward prediction and backward generation, attaining generalization beyond training data. Key to this is its derivative-free guidance for conditional generation, coupled with adaptive inference-time scaling that dynamically optimizes denoising paths. Extensive evaluations demonstrate that ManGO outperforms 24 single- and 10 multi-objective optimization methods across diverse domains, including synthetic tasks, robot control, material design, DNA sequence, and real-world engineering optimization.
[COMMENTS]This manuscript was accepted by npj AI
[LINK]http://arxiv.org/abs/2506.05680v2
[DATE]2026-01-12 20:56:43+08:00
[CATEGORIES]cs.LG
EMP: Enhance Memory in Data Pruning
[AUTHORS]Jinying Xiao, Ping Li, Jie Nie, Bin Ji, Shasha Li, Xiaodong Liu, Jun Ma, Qingbo Wu, Jie Yu
[ABSTRACT]Recently, large language and vision models have shown strong performance, but due to high pre-training and fine-tuning costs, research has shifted towards faster training via dataset pruning. Previous methods used sample loss as an evaluation criterion, aiming to select the most “difficult” samples for training. However, when the pruning rate increases, the number of times each sample is trained becomes more evenly distributed, which causes many critical or general samples to not be effectively fitted. We refer to this as Low-Frequency Learning (LFL). In other words, LFL prevents the model from remembering most samples. In our work, we decompose the scoring function of LFL, provide a theoretical explanation for the inefficiency of LFL, and propose adding a memory term to the scoring function to enhance the model’s memory capability, along with an approximation of this memory term. Similarly, we explore memory in Self-Supervised Learning (SSL), marking the first discussion on SSL memory. Using contrastive learning, we derive the memory term both theoretically and experimentally. Finally, we propose Enhance Memory Pruning (EMP), which addresses the issue of insufficient memory under high pruning rates by enhancing the model’s memory of data, thereby improving its performance. We evaluated the performance of EMP in tasks such as image classification, natural language understanding, and model pre-training. The results show that EMP can improve model performance under extreme pruning rates. For example, in the CIFAR100-ResNet50 pre-training task, with 70\% pruning, EMP outperforms current methods by 2.2\%.
[LINK]http://arxiv.org/abs/2408.16031v2
[DATE]2026-01-12 20:55:22+08:00
[CATEGORIES]cs.LG
Graph Inference Towards ICD Coding
[AUTHORS]Xiaoxiao Deng
[ABSTRACT]Automated ICD coding involves assigning standardized diagnostic codes to clinical narratives. The vast label space and extreme class imbalance continue to challenge precise prediction. To address these issues, LabGraph is introduced – a unified framework that reformulates ICD coding as a graph generation task. By combining adversarial domain adaptation, graph-based reinforcement learning, and perturbation regularization, LabGraph effectively enhances model robustness and generalization. In addition, a label graph discriminator dynamically evaluates each generated code, providing adaptive reward feedback during training. Experiments on benchmark datasets demonstrate that LabGraph consistently outperforms previous approaches on micro-F1, micro-AUC, and P@K.
[COMMENTS]6 pages, 2 figures, 2 tables
[LINK]http://arxiv.org/abs/2601.07496v1
[DATE]2026-01-12 20:51:21+08:00
[CATEGORIES]cs.LG
Multi-dimensional Autoscaling of Processing Services: A Comparison of Agent-based Methods
[AUTHORS]Boris Sedlak, Alireza Furutanpey, Zihang Wang, Víctor Casamayor Pujol, Schahram Dustdar
[ABSTRACT]Edge computing breaks with traditional autoscaling due to strict resource constraints, thus, motivating more flexible scaling behaviors using multiple elasticity dimensions. This work introduces an agent-based autoscaling framework that dynamically adjusts both hardware resources and internal service configurations to maximize requirements fulfillment in constrained environments. We compare four types of scaling agents: Active Inference, Deep Q Network, Analysis of Structural Knowledge, and Deep Active Inference, using two real-world processing services running in parallel: YOLOv8 for visual recognition and OpenCV for QR code detection. Results show all agents achieve acceptable SLO performance with varying convergence patterns. While the Deep Q Network benefits from pre-training, the structural analysis converges quickly, and the deep active inference agent combines theoretical foundations with practical scalability advantages. Our findings provide evidence for the viability of multi-dimensional agent-based autoscaling for edge environments and encourage future work in this research direction.
[LINK]http://arxiv.org/abs/2506.10420v2
[DATE]2026-01-12 20:51:01+08:00
[CATEGORIES]cs.LG
The Secretary Problem with Predictions and a Chosen Order
[AUTHORS]Helia Karisani, Mohammadreza Daneshvaramoli, Hedyeh Beyhaghi, Mohammad Hajiesmaili, Cameron Musco
[ABSTRACT]We study a learning-augmented variant of the secretary problem, recently introduced by Fujii and Yoshida (2023), in which the decision-maker has access to machine-learned predictions of candidate values. The central challenge is to balance consistency and robustness: when predictions are accurate, the algorithm should select a near-optimal secretary, while under inaccurate predictions it should still guarantee a bounded competitive ratio.
We consider both the classical Random Order Secretary Problem (ROSP), where candidates arrive in a uniformly random order, and a more natural learning-augmented model in which the decision-maker may choose the arrival order based on predicted values. We call this model the Chosen Order Secretary Problem (COSP), capturing scenarios such as interview schedules set in advance.
We propose a new randomized algorithm applicable to both ROSP and COSP. Our method switches from fully trusting predictions to a threshold-based rule once a large prediction deviation is detected. Let $ε\in [0,1]$ denote the maximum multiplicative prediction error. For ROSP, our algorithm achieves a competitive ratio of $\max\{0.221, (1-ε)/(1+ε)\}$, improving upon the prior bound of $\max\{0.215, (1-ε)/(1+ε)\}$. For COSP, we achieve $\max\{0.262, (1-ε)/(1+ε)\}$, surpassing the $0.25$ worst-case bound for prior approaches and moving closer to the classical secretary benchmark of $1/e \approx 0.368$. These results highlight the benefit of combining predictions with arrival-order control in online decision-making.
[COMMENTS]Accepted to the International Conference on Innovations in Theoretical Computer Science (ITCS 2026)
[LINK]http://arxiv.org/abs/2601.07482v1
[DATE]2026-01-12 20:34:49+08:00
[CATEGORIES]cs.LG
Hierarchic Flows to Estimate and Sample High-dimensional Probabilities
[AUTHORS]Etienne Lempereur, Stéphane Mallat
[ABSTRACT]Finding low-dimensional interpretable models of complex physical fields such as turbulence remains an open question, 80 years after the pioneer work of Kolmogorov. Estimating high-dimensional probability distributions from data samples suffers from an optimization and an approximation curse of dimensionality. It may be avoided by following a hierarchic probability flow from coarse to fine scales. This inverse renormalization group is defined by conditional probabilities across scales, renormalized in a wavelet basis. For a $\vvarphi^4$ scalar potential, sampling these hierarchic models avoids the critical slowing down at the phase transition. In a well chosen wavelet basis, conditional probabilities can be captured with low dimensional parametric models, because interactions between wavelet coefficients are local in space and scales. An outstanding issue is also to approximate non-Gaussian fields having long-range interactions in space and across scales. We introduce low-dimensional models of wavelet conditional probabilities with the scattering covariance. It is calculated with a second wavelet transform, which defines interactions over two hierarchies of scales. We estimate and sample these wavelet scattering models to generate 2D vorticity fields of turbulence, and images of dark matter densities.
[LINK]http://arxiv.org/abs/2405.03468v2
[DATE]2026-01-12 20:33:25+08:00
[CATEGORIES]cs.LG
Practical do-Shapley Explanations with Estimand-Agnostic Causal Inference
[AUTHORS]Álvaro Parafita, Tomas Garriga, Axel Brando, Francisco J. Cazorla
[ABSTRACT]Among explainability techniques, SHAP stands out as one of the most popular, but often overlooks the causal structure of the problem. In response, do-SHAP employs interventional queries, but its reliance on estimands hinders its practical application. To address this problem, we propose the use of estimand-agnostic approaches, which allow for the estimation of any identifiable query from a single model, making do-SHAP feasible on complex graphs. We also develop a novel algorithm to significantly accelerate its computation at a negligible cost, as well as a method to explain inaccessible Data Generating Processes. We demonstrate the estimation and computational performance of our approach, and validate it on two real-world datasets, highlighting its potential in obtaining reliable explanations.
[COMMENTS]Published at NeurIPS 2025
[LINK]http://arxiv.org/abs/2509.20211v2
[DATE]2026-01-12 20:28:25+08:00
[CATEGORIES]cs.LG
Task Prototype-Based Knowledge Retrieval for Multi-Task Learning from Partially Annotated Data
[AUTHORS]Youngmin Oh, Hyung-Il Kim, Jung Uk Kim
[ABSTRACT]Multi-task learning (MTL) is critical in real-world applications such as autonomous driving and robotics, enabling simultaneous handling of diverse tasks. However, obtaining fully annotated data for all tasks is impractical due to labeling costs. Existing methods for partially labeled MTL typically rely on predictions from unlabeled tasks, making it difficult to establish reliable task associations and potentially leading to negative transfer and suboptimal performance. To address these issues, we propose a prototype-based knowledge retrieval framework that achieves robust MTL instead of relying on predictions from unlabeled tasks. Our framework consists of two key components: (1) a task prototype embedding task-specific characteristics and quantifying task associations, and (2) a knowledge retrieval transformer that adaptively refines feature representations based on these associations. To achieve this, we introduce an association knowledge generating (AKG) loss to ensure the task prototype consistently captures task-specific characteristics. Extensive experiments demonstrate the effectiveness of our framework, highlighting its potential for robust multi-task learning, even when only a subset of tasks is annotated.
[COMMENTS]Accepted at AAAI 2026
[LINK]http://arxiv.org/abs/2601.07474v1
[DATE]2026-01-12 20:27:02+08:00
[CATEGORIES]cs.LG
Puzzle it Out: Local-to-Global World Model for Offline Multi-Agent Reinforcement Learning
[AUTHORS]Sijia li, Xinran Li, Shibo Chen, Jun Zhang
[ABSTRACT]Offline multi-agent reinforcement learning (MARL) aims to solve cooperative decision-making problems in multi-agent systems using pre-collected datasets. Existing offline MARL methods primarily constrain training within the dataset distribution, resulting in overly conservative policies that struggle to generalize beyond the support of the data. While model-based approaches offer a promising solution by expanding the original dataset with synthetic data generated from a learned world model, the high dimensionality, non-stationarity, and complexity of multi-agent systems make it challenging to accurately estimate the transitions and reward functions in offline MARL. Given the difficulty of directly modeling joint dynamics, we propose a local-to-global (LOGO) world model, a novel framework that leverages local predictions-which are easier to estimate-to infer global state dynamics, thus improving prediction accuracy while implicitly capturing agent-wise dependencies. Using the trained world model, we generate synthetic data to augment the original dataset, expanding the effective state-action space. To ensure reliable policy learning, we further introduce an uncertainty-aware sampling mechanism that adaptively weights synthetic data by prediction uncertainty, reducing approximation error propagation to policies. In contrast to conventional ensemble-based methods, our approach requires only an additional encoder for uncertainty estimation, significantly reducing computational overhead while maintaining accuracy. Extensive experiments across 8 scenarios against 8 baselines demonstrate that our method surpasses state-of-the-art baselines on standard offline MARL benchmarks, establishing a new model-based baseline for generalizable offline multi-agent learning.
[LINK]http://arxiv.org/abs/2601.07463v1
[DATE]2026-01-12 20:17:11+08:00
[CATEGORIES]cs.LG
R-LAM: Reproducibility-Constrained Large Action Models for Scientific Workflow Automation
[AUTHORS]Suriya Sureshkumar
[ABSTRACT]Large Action Models (LAMs) extend large language models by enabling autonomous decision-making and tool execution, making them promising for automating scientific workflows. However, scientific workflows impose strict requirements on reproducibility, auditability, and deterministic execution, which are not satisfied by generic LLM-based agents. Unconstrained action generation can lead to silent state changes, non-deterministic executions, and irreproducible experimental results, limiting the applicability of LAMs in scientific settings.
In this paper, we propose R-LAM, a reproducibility-constrained framework for applying Large Action Models to scientific workflow automation. R-LAM introduces structured action schemas, deterministic execution policies, and explicit provenance tracking to ensure that every action and intermediate artifact is auditable and replayable. The framework supports failure-aware execution loops and controlled workflow forking, enabling iterative experimentation without compromising reproducibility.
We implement R-LAM as a lightweight Python framework and release it as an open-source PyPI package to facilitate reproducible research. An experimental evaluation of representative scientific workflows demonstrates that R-LAM improves reproducibility success rates and execution reliability compared to unconstrained LLM-based agents, while retaining adaptive control over workflow execution.
[COMMENTS]9 pages, 3 figures, 1 Table, 2 Artifacts
[LINK]http://arxiv.org/abs/2601.09749v1
[DATE]2026-01-12 20:13:14+08:00
[CATEGORIES]cs.LG
Improving Video Question Answering through query-based frame selection
[AUTHORS]Himanshu Patil, Geo Jolly, Ramana Raja Buddala, Ganesh Ramakrishnan, Rohit Saluja
[ABSTRACT]Video Question Answering (VideoQA) models enhance understanding and interaction with audiovisual content, making it more accessible, searchable, and useful for a wide range of fields such as education, surveillance, entertainment, and content creation. Due to heavy compute requirements, most large visual language models (VLMs) for VideoQA rely on a fixed number of frames by uniformly sampling the video. However, this process does not pick important frames or capture the context of the video. We present a novel query-based selection of frames relevant to the questions based on the submodular mutual Information (SMI) functions. By replacing uniform frame sampling with query-based selection, our method ensures that the chosen frames provide complementary and essential visual information for accurate VideoQA. We evaluate our approach on the MVBench dataset, which spans a diverse set of multi-action video tasks. VideoQA accuracy on this dataset was assessed using two VLMs, namely Video-LLaVA and LLaVA-NeXT, both of which originally employed uniform frame sampling. Experiments were conducted using both uniform and query-based sampling strategies. An accuracy improvement of up to \textbf{4\%} was observed when using query-based frame selection over uniform sampling. Qualitative analysis further highlights that query-based selection, using SMI functions, consistently picks frames better aligned with the question. We opine that such query-based frame selection can enhance accuracy in a wide range of tasks that rely on only a subset of video frames.
[LINK]http://arxiv.org/abs/2601.07459v1
[DATE]2026-01-12 20:10:20+08:00
[CATEGORIES]cs.LG
Accelerating Targeted Hard-Label Adversarial Attacks in Low-Query Black-Box Settings
[AUTHORS]Arjhun Swaminathan, Mete Akgün
[ABSTRACT]Deep neural networks for image classification remain vulnerable to adversarial examples – small, imperceptible perturbations that induce misclassifications. In black-box settings, where only the final prediction is accessible, crafting targeted attacks that aim to misclassify into a specific target class is particularly challenging due to narrow decision regions. Current state-of-the-art methods often exploit the geometric properties of the decision boundary separating a source image and a target image rather than incorporating information from the images themselves. In contrast, we propose Targeted Edge-informed Attack (TEA), a novel attack that utilizes edge information from the target image to carefully perturb it, thereby producing an adversarial image that is closer to the source image while still achieving the desired target classification. Our approach consistently outperforms current state-of-the-art methods across different models in low query settings (nearly 70% fewer queries are used), a scenario especially relevant in real-world applications with limited queries and black-box access. Furthermore, by efficiently generating a suitable adversarial example, TEA provides an improved target initialization for established geometry-based attacks.
[COMMENTS]This paper contains 10 pages, 8 figures and 8 tables. For associated supplementary code, see https://github.com/mdppml/TEA. This work has been accepted for publication at the IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). The final version will be available on IEEE Xplore
[LINK]http://arxiv.org/abs/2505.16313v3
[DATE]2026-01-12 20:05:34+08:00
[CATEGORIES]cs.LG
Studying Illustrations in Manuscripts: An Efficient Deep-Learning Approach
[AUTHORS]Yoav Evron, Michal Bar-Asher Siegal, Michael Fire
[ABSTRACT]The recent Artificial Intelligence (AI) revolution has opened transformative possibilities for the humanities, particularly in unlocking the visual-artistic content embedded in historical illuminated manuscripts. While digital archives now offer unprecedented access to these materials, the ability to systematically locate, extract, and analyze illustrations at scale remains a major challenge. We present a general and scalable AI-based pipeline for large-scale visual analysis of illuminated manuscripts. The framework integrates modern deep-learning models for page-level illustration detection, illustration extraction, and multimodal description, enabling scholars to search, cluster, and study visual materials and artistic trends across entire corpora. We demonstrate the applicability of this approach on large heterogeneous collections, including the Vatican Library and richly illuminated manuscripts such as the Bible of Borso d’Este. The system reveals meaningful visual patterns and cross-manuscript relationships by embedding illustrations into a shared representation space and analyzing their similarity structure (see figure 4). By harnessing recent advances in computer vision and vision-language models, our framework enables new forms of large-scale visual scholarship in historical studies, art history, and cultural heritage making it possible to explore iconography, stylistic trends, and cultural connections in ways that were previously impractical.
[COMMENTS]17 pages, 5 figures
[LINK]http://arxiv.org/abs/2601.05269v2
[DATE]2026-01-12 19:37:13+08:00
[CATEGORIES]cs.LG
Surrogate-based Optimization via Clustering for Box-Constrained Problems
[AUTHORS]Maaz Ahmad, Iftekhar A. Karimi
[ABSTRACT]Global optimization of large-scale, complex systems such as multi-physics black-box simulations and real-world industrial systems is important but challenging. This work presents a novel Surrogate-Based Optimization framework based on Clustering, SBOC for global optimization of such systems, which can be used with any surrogate modeling technique. At each iteration, it uses a single surrogate model for the entire domain, employs k-means clustering to identify unexplored domain, and exploits a local region around the surrogate optimum to potentially add three new sample points in the domain. SBOC has been tested against sixteen promising benchmarking algorithms using 52 analytical test functions of varying input dimensionalities and shape profiles. It successfully identified a global minimum for most test functions with substantially lower computational effort than other algorithms. It worked especially well on test functions with four or more input variables. It was also among the top six algorithms in approaching a global minimum closely. Overall, SBOC is a robust, reliable, and efficient algorithm for global optimization of box-constrained systems.
[COMMENTS]34 pages, 4 Figures, 8 Tables
[LINK]http://arxiv.org/abs/2601.07442v1
[DATE]2026-01-12 19:29:53+08:00
[CATEGORIES]cs.LG
Variational Autoencoder with Normalizing flow for X-ray spectral fitting
[AUTHORS]Fiona Redmen, Ethan Tregidga, James F. Steiner, Cecilia Garraffo
[ABSTRACT]Black hole X-ray binaries (BHBs) can be studied with spectral fitting to provide physical constraints on accretion in extreme gravitational environments. Traditional methods of spectral fitting such as Markov Chain Monte Carlo (MCMC) face limitations due to computational times. We introduce a probabilistic model, utilizing a variational autoencoder with a normalizing flow, trained to adopt a physical latent space. This neural network produces predictions for spectral-model parameters as well as their full probability distributions. Our implementations result in a significant improvement in spectral reconstructions over a previous deterministic model while performing three orders of magnitude faster than traditional methods.
[COMMENTS]7 pages, 1 table, 3 figures. Accepted as a workshop paper to Machine Learning and the Physical Sciences at NeurIPS 2025
[LINK]http://arxiv.org/abs/2601.07440v1
[DATE]2026-01-12 19:28:40+08:00
[CATEGORIES]cs.LG
Position: Don’t be Afraid of Over-Smoothing And Over-Squashing
[AUTHORS]Niklas Kormann, Benjamin Doerr, Johannes F. Lutzeyer
[ABSTRACT]Over-smoothing and over-squashing have been extensively studied in the literature on Graph Neural Networks (GNNs) over the past years. We challenge this prevailing focus in GNN research, arguing that these phenomena are less critical for practical applications than assumed. We suggest that performance decreases often stem from uninformative receptive fields rather than over-smoothing. We support this position with extensive experiments on several standard benchmark datasets, demonstrating that accuracy and over-smoothing are mostly uncorrelated and that optimal model depths remain small even with mitigation techniques, thus highlighting the negligible role of over-smoothing. Similarly, we challenge that over-squashing is always detrimental in practical applications. Instead, we posit that the distribution of relevant information over the graph frequently factorises and is often localised within a small k-hop neighbourhood, questioning the necessity of jointly observing entire receptive fields or engaging in an extensive search for long-range interactions. The results of our experiments show that architectural interventions designed to mitigate over-squashing fail to yield significant performance gains. This position paper advocates for a paradigm shift in theoretical research, urging a diligent analysis of learning tasks and datasets using statistics that measure the underlying distribution of label-relevant information to better understand their localisation and factorisation.
[COMMENTS]Preprint. Copyright 2026 by the authors
[LINK]http://arxiv.org/abs/2601.07419v1
[DATE]2026-01-12 19:02:57+08:00
[CATEGORIES]cs.LG
PLANET v2.0: A comprehensive Protein-Ligand Affinity Prediction Model Based on Mixture Density Network
[AUTHORS]Haotian Gao, Xiangying Zhang, Jingyuan Li, Xinchong Chen, Haojie Wang, Yifei Qi, Renxiao Wang
[ABSTRACT]Drug discovery represents a time-consuming and financially intensive process, and virtual screening can accelerate it. Scoring functions, as one of the tools guiding virtual screening, have their precision closely tied to screening efficiency. In our previous study, we developed a graph neural network model called PLANET (Protein-Ligand Affinity prediction NETwork), but it suffers from the defect in representing protein-ligand contact maps. Incorrect binding modes inevitably lead to poor affinity predictions, so accurate prediction of the protein-ligand contact map is desired to improve PLANET. In this study, we have proposed PLANET v2.0 as an upgraded version. The model is trained via multi-objective training strategy and incorporates the Mixture Density Network to predict binding modes. Except for the probability density distributions of non-covalent interactions, we innovatively employ another Gaussian mixture model to describe the relationship between distance and energy of each interaction pair and predict protein-ligand affinity like calculating the mathematical expectation. As on the CASF-2016 benchmark, PLANET v2.0 demonstrates excellent scoring power, ranking power, and docking power. The screening power of PLANET v2.0 gets notably improved compared to PLANET and Glide SP and it demonstrates robust validation on a commercial ultra-large-scale dataset. Given its efficiency and accuracy, PLANET v2.0 can hopefully become one of the practical tools for virtual screening workflows. PLANET v2.0 is freely available at https://www.pdbbind-plus.org.cn/planetv2.
[LINK]http://arxiv.org/abs/2601.07415v1
[DATE]2026-01-12 18:56:29+08:00
[CATEGORIES]cs.LG
The Practicality of Normalizing Flow Test-Time Training in Bayesian Inference for Agent-Based Models
[AUTHORS]Junyao Zhang, Jinglai Li, Junqi Tang
[ABSTRACT]Agent-Based Models (ABMs) are gaining great popularity in economics and social science because of their strong flexibility to describe the realistic and heterogeneous decisions and interaction rules between individual agents. In this work, we investigate for the first time the practicality of test-time training (TTT) of deep models such as normalizing flows, in the parameters posterior estimations of ABMs. We propose several practical TTT strategies for fine-tuning the normalizing flow against distribution shifts. Our numerical study demonstrates that TTT schemes are remarkably effective, enabling real-time adjustment of flow-based inference for ABM parameters.
[LINK]http://arxiv.org/abs/2601.07413v1
[DATE]2026-01-12 18:55:01+08:00
[CATEGORIES]cs.LG
BiTrajDiff: Bidirectional Trajectory Generation with Diffusion Models for Offline Reinforcement Learning
[AUTHORS]Yunpeng Qing, Yixiao Chi, Shuo Chen, Shunyu Liu, Kelu Yao, Sixu Lin, Litao Liu, Changqing Zou
[ABSTRACT]Recent advances in offline Reinforcement Learning (RL) have proven that effective policy learning can benefit from imposing conservative constraints on pre-collected datasets. However, such static datasets often exhibit distribution bias, resulting in limited generalizability. To address this limitation, a straightforward solution is data augmentation (DA), which leverages generative models to enrich data distribution. Despite the promising results, current DA techniques focus solely on reconstructing future trajectories from given states, while ignoring the exploration of history transitions that reach them. This single-direction paradigm inevitably hinders the discovery of diverse behavior patterns, especially those leading to critical states that may have yielded high-reward outcomes. In this work, we introduce Bidirectional Trajectory Diffusion (BiTrajDiff), a novel DA framework for offline RL that models both future and history trajectories from any intermediate states. Specifically, we decompose the trajectory generation task into two independent yet complementary diffusion processes: one generating forward trajectories to predict future dynamics, and the other generating backward trajectories to trace essential history transitions.BiTrajDiff can efficiently leverage critical states as anchors to expand into potentially valuable yet underexplored regions of the state space, thereby facilitating dataset diversity. Extensive experiments on the D4RL benchmark suite demonstrate that BiTrajDiff achieves superior performance compared to other advanced DA methods across various offline RL backbones.
[LINK]http://arxiv.org/abs/2506.05762v4
[DATE]2026-01-12 18:54:04+08:00
[CATEGORIES]cs.LG
GraphPerf-RT: A Graph-Driven Performance Model for Hardware-Aware Scheduling of OpenMP Codes
[AUTHORS]Mohammad Pivezhandi, Mahdi Banisharif, Saeed Bakhshan, Abusayeed Saifullah, Ali Jannesari
[ABSTRACT]Performance prediction for OpenMP workloads on heterogeneous embedded SoCs is challenging due to complex interactions between task DAG structure, control-flow irregularity, cache and branch behavior, and thermal dynamics; classical heuristics struggle under workload irregularity, tabular regressors discard structural information, and model-free RL risks overheating resource-constrained devices. We introduce GraphPerf-RT, the first surrogate that unifies task DAG topology, CFG-derived code semantics, and runtime context (per-core DVFS, thermal state, utilization) in a heterogeneous graph representation with typed edges encoding precedence, placement, and contention. Multi-task evidential heads predict makespan, energy, cache and branch misses, and utilization with calibrated uncertainty (Normal-Inverse-Gamma), enabling risk-aware scheduling that filters low-confidence rollouts. We validate GraphPerf-RT on three embedded ARM platforms (Jetson TX2, Jetson Orin NX, RUBIK Pi), achieving R^2 > 0.95 with well-calibrated uncertainty (ECE < 0.05). To demonstrate end-to-end scheduling utility, we integrate the surrogate with four RL methods on Jetson TX2: single-agent model-free (SAMFRL), single-agent model-based (SAMBRL), multi-agent model-free (MAMFRL-D3QN), and multi-agent model-based (MAMBRL-D3QN). Experiments across 5 seeds (200 episodes each) show that MAMBRL-D3QN with GraphPerf-RT as the world model achieves 66% makespan reduction (0.97 +/- 0.35s) and 82% energy reduction (0.006 +/- 0.005J) compared to model-free baselines, demonstrating that accurate, uncertainty-aware surrogates enable effective model-based planning on thermally constrained embedded systems.
[COMMENTS]36 pages, 1 figure, 7 tables
[LINK]http://arxiv.org/abs/2512.12091v2
[DATE]2026-01-12 18:30:16+08:00
[CATEGORIES]cs.LG
OceanSAR-2: A Universal Feature Extractor for SAR Ocean Observation
[AUTHORS]Alexandre Tuel, Thomas Kerdreux, Quentin Febvre, Alexis Mouche, Antoine Grouazel, Jean-Renaud Miadana, Antoine Audras, Chen Wang, Bertrand Chapron
[ABSTRACT]We present OceanSAR-2, the second generation of our foundation model for SAR-based ocean observation. Building on our earlier release, which pioneered self-supervised learning on Sentinel-1 Wave Mode data, OceanSAR-2 relies on improved SSL training and dynamic data curation strategies, which enhances performance while reducing training cost. OceanSAR-2 demonstrates strong transfer performance across downstream tasks, including geophysical pattern classification, ocean surface wind vector and significant wave height estimation, and iceberg detection. We release standardized benchmark datasets, providing a foundation for systematic evaluation and advancement of SAR models for ocean applications.
[COMMENTS]accepted at EUSAR 2026
[LINK]http://arxiv.org/abs/2601.07392v1
[DATE]2026-01-12 18:20:43+08:00
[CATEGORIES]cs.LG
SAD Neural Networks: Divergent Gradient Flows and Asymptotic Optimality via o-minimal Structures
[AUTHORS]Julian Kranz, Davide Gallon, Steffen Dereich, Arnulf Jentzen
[ABSTRACT]We study gradient flows for loss landscapes of fully connected feedforward neural networks with commonly used continuously differentiable activation functions such as the logistic, hyperbolic tangent, softplus or GELU function. We prove that the gradient flow either converges to a critical point or diverges to infinity while the loss converges to an asymptotic critical value. Moreover, we prove the existence of a threshold $\varepsilon>0$ such that the loss value of any gradient flow initialized at most $\varepsilon$ above the optimal level converges to it. For polynomial target functions and sufficiently big architecture and data set, we prove that the optimal loss value is zero and can only be realized asymptotically. From this setting, we deduce our main result that any gradient flow with sufficiently good initialization diverges to infinity. Our proof heavily relies on the geometry of o-minimal structures. We confirm these theoretical findings with numerical experiments and extend our investigation to more realistic scenarios, where we observe an analogous behavior.
[COMMENTS]Accepted for NeurIPS 2025, 30 pages, 6 figures. The result about continuous data distributions now has an additional assumption since a gap was identified in a previous version of the proof
[LINK]http://arxiv.org/abs/2505.09572v3
[DATE]2026-01-12 18:12:47+08:00
[CATEGORIES]cs.LG
Graph Attention Specialized Expert Fusion Model for Node Classification: Based on Cora and Pubmed Datasets
[AUTHORS]Zihang Ma, Qitian Yin
[ABSTRACT]Graph node classification is a fundamental task in graph neural networks (GNNs), aiming to assign predefined class labels to nodes. On the PubMed citation network dataset, we observe significant classification difficulty disparities, with Category 2 achieving only 74.4% accuracy in traditional GCN, 7.5% lower than Category 1. To address this, we propose a Wasserstein-Rubinstein (WR) distance enhanced Expert Fusion Model (WR-EFM), training specialized GNN models for Categories 0/1 (with layer normalization and residual connections) and Multi-hop Graph Attention Networks (GAT) for Category 2. The WR distance metric optimizes representation similarity between models, particularly focusing on improving Category 2 performance. Our adaptive fusion strategy dynamically weights models based on category-specific performance, with Category 2 assigned a GAT weight of 0.8. WR distance further guides the fusion process by measuring distributional differences between model representations, enabling more principled integration of complementary features.
Experimental results show WR-EFM achieves balanced accuracy across categories: 77.8% (Category 0), 78.0% (Category 1), and 79.9% (Category 2), outperforming both single models and standard fusion approaches. The coefficient of variation (CV) of WR-EFM’s category accuracies is 0.013, 77.6% lower than GCN’s 0.058, demonstrating superior stability. Notably, WR-EFM improves Category 2 accuracy by 5.5% compared to GCN, verifying the effectiveness of WR-guided fusion in capturing complex structural patterns. This work provides a novel paradigm for handling class-imbalanced graph classification tasks. To promote the research community, we release our project at https://github.com/s010m00n/GASEM4NC.
[COMMENTS]It’s written so poorly, my bad
[LINK]http://arxiv.org/abs/2507.15784v2
[DATE]2026-01-12 18:07:48+08:00
[CATEGORIES]cs.LG
Computing patient similarity based on unstructured clinical notes
[AUTHORS]Petr Zelina, Marko Řeháček, Jana Halámková, Lucia Bohovicová, Martin Rusinko, Vít Nováček
[ABSTRACT]Clinical notes hold rich yet unstructured details about diagnoses, treatments, and outcomes that are vital to precision medicine but hard to exploit at scale. We introduce a method that represents each patient as a matrix built from aggregated embeddings of all their notes, enabling robust patient similarity computation based on their latent low-rank representations. Using clinical notes of 4,267 Czech breast-cancer patients and expert similarity labels from Masaryk Memorial Cancer Institute, we evaluate several matrix-based similarity measures and analyze their strengths and limitations across different similarity facets, such as clinical history, treatment, and adverse events. The results demonstrate the usefulness of the presented method for downstream tasks, such as personalized therapy recommendations or toxicity warnings.
[COMMENTS]This is a preprint and has not undergone peer review. Final version was presented at the Text, Speech, and Dialogue 2025 conference. The Version of Record is available at https://doi.org/10.1007/978-3-032-02551-7_13
[LINK]http://arxiv.org/abs/2601.07385v1
[DATE]2026-01-12 18:04:57+08:00
[CATEGORIES]cs.LG
CompNO: A Novel Foundation Model approach for solving Partial Differential Equations
[AUTHORS]Hamda Hmida, Hsiu-Wen Chang Joly, Youssef Mesri
[ABSTRACT]Partial differential equations (PDEs) govern a wide range of physical phenomena, but their numerical solution remains computationally demanding, especially when repeated simulations are required across many parameter settings. Recent Scientific Foundation Models (SFMs) aim to alleviate this cost by learning universal surrogates from large collections of simulated systems, yet they typically rely on monolithic architectures with limited interpretability and high pretraining expense. In this work we introduce Compositional Neural Operators (CompNO), a compositional neural operator framework for parametric PDEs. Instead of pretraining a single large model on heterogeneous data, CompNO first learns a library of Foundation Blocks, where each block is a parametric Fourier neural operator specialized to a fundamental differential operator (e.g. convection, diffusion, nonlinear convection). These blocks are then assembled, via lightweight Adaptation Blocks, into task-specific solvers that approximate the temporal evolution operator for target PDEs. A dedicated boundary-condition operator further enforces Dirichlet constraints exactly at inference time. We validate CompNO on one-dimensional convection, diffusion, convection–diffusion and Burgers’ equations from the PDEBench suite. The proposed framework achieves lower relative L2 error than strong baselines (PFNO, PDEFormer and in-context learning based models) on linear parametric systems, while remaining competitive on nonlinear Burgers’ flows. The model maintains exact boundary satisfaction with zero loss at domain boundaries, and exhibits robust generalization across a broad range of Peclet and Reynolds numbers. These results demonstrate that compositional neural operators provide a scalable and physically interpretable pathway towards foundation models for PDEs.
[COMMENTS]Under review at MDPI
[LINK]http://arxiv.org/abs/2601.07384v1
[DATE]2026-01-12 18:04:48+08:00
[CATEGORIES]cs.LG
OrderFusion: Encoding Orderbook for End-to-End Probabilistic Intraday Electricity Price Forecasting
[AUTHORS]Runyao Yu, Yuchen Tao, Fabian Leimgruber, Tara Esterl, Jochen Stiasny, Derek W. Bunn, Qingsong Wen, Hongye Guo, Jochen L. Cremer
[ABSTRACT]Probabilistic intraday electricity price forecasting is becoming increasingly important with the growth of renewable generation and the rise in demand-side engagement. Their uncertainties have increased the trading risks closer to delivery and the subsequent imbalance settlement costs. As a consequence, intraday trading has emerged to mitigate these risks. Unlike auction markets, intraday trading in many jurisdictions is characterized by the continuous posting of buy and sell orders on power exchange platforms. This dynamic orderbook microstructure of price formation presents special challenges for price forecasting. Conventional methods represent the orderbook via domain features aggregated from buy and sell trades, or by treating it as a multivariate time series, but such representations neglect the full buy-sell interaction structure of the orderbook. This research therefore develops a new order fusion methodology, which is an end-to-end and parameter-efficient probabilistic forecasting model that learns a full interaction-aware representation of the buy-sell dynamics. Furthermore, as quantile crossing is often a problem in probabilistic forecasting, this approach hierarchically estimates the quantiles with non-crossing constraints. Extensive experiments on the market price indices across high-liquidity (German) and low-liquidity (Austrian) markets demonstrate consistent improvements over conventional baselines, and ablation studies highlight the contributions of the main modeling components. The methodology is available at: https://runyao-yu.github.io/OrderFusion/.
[COMMENTS]10 pages, 3 figures, 5 tables
[LINK]http://arxiv.org/abs/2502.06830v4
[DATE]2026-01-12 17:55:39+08:00
[CATEGORIES]cs.LG
Approximating Persistent Homology for Large Datasets
[AUTHORS]Yueqi Cao, Anthea Monod
[ABSTRACT]Persistent homology is an important methodology in topological data analysis which adapts theory from algebraic topology to data settings. Computing persistent homology produces persistence diagrams, which have been successfully used in diverse domains. Despite its widespread use, persistent homology is simply impossible to compute when a dataset is very large. We study a statistical approach to the problem of computing persistent homology for massive datasets using a multiple subsampling framework and extend it to three summaries of persistent homology: Hölder continuous vectorizations of persistence diagrams; the alternative representation as persistence measures; and standard persistence diagrams. Specifically, we derive finite sample convergence rates for empirical means for persistent homology and practical guidance on interpreting and tuning parameters. We validate our approach through extensive experiments on both synthetic and real-world data. We demonstrate the performance of multiple subsampling in a permutation test to analyze the topological structure of Poincaré embeddings of large lexical databases.
[COMMENTS]42 pages, 11 figures
[LINK]http://arxiv.org/abs/2204.09155v3
[DATE]2026-01-12 17:38:34+08:00
[CATEGORIES]cs.LG
Revealing the Attention Floating Mechanism in Masked Diffusion Models
[AUTHORS]Xin Dai, Pengcheng Huang, Zhenghao Liu, Shuo Wang, Yukun Yan, Chaojun Xiao, Yu Gu, Ge Yu, Maosong Sun
[ABSTRACT]Masked diffusion models (MDMs), which leverage bidirectional attention and a denoising process, are narrowing the performance gap with autoregressive models (ARMs). However, their internal attention mechanisms remain under-explored. This paper investigates the attention behaviors in MDMs, revealing the phenomenon of Attention Floating. Unlike ARMs, where attention converges to a fixed sink, MDMs exhibit dynamic, dispersed attention anchors that shift across denoising steps and layers. Further analysis reveals its Shallow Structure-Aware, Deep Content-Focused attention mechanism: shallow layers utilize floating tokens to build a global structural framework, while deeper layers allocate more capability toward capturing semantic content. Empirically, this distinctive attention pattern provides a mechanistic explanation for the strong in-context learning capabilities of MDMs, allowing them to double the performance compared to ARMs in knowledge-intensive tasks. All codes and datasets are available at https://github.com/NEUIR/Attention-Floating.
[LINK]http://arxiv.org/abs/2601.07894v1
[DATE]2026-01-12 17:10:05+08:00
[CATEGORIES]cs.LG
SEE: Signal Embedding Energy for Quantifying Noise Interference in Large Audio Language Models
[AUTHORS]Yuanhe Zhang, Jiayu Tian, Yibo Zhang, Shilinlu Yan, Liang Lin, Zhenhong Zhou, Li Sun, Sen Su
[ABSTRACT]Large Audio Language Models (LALMs) have been widely applied in real-time scenarios, such as in-car assistants and online meeting comprehension. In practice, audio inputs are often corrupted by device and environmental noise, leading to performance degradation. However, existing LALM studies on noise lack quantitative analysis and rely mainly on intuition and empirical observation, thus failing to understand practical robustness. To address this issue, we introduce Signal Embedding Energy (SEE), a method for quantifying the impact of noise intensity on LALM inputs, enabling the differentiation of LALM robustness in real-world deployments. SEE introduces a perspective based on structured activation subspaces derived from the model’s internal representations, which more accurately captures its perception of noise than raw audio features. Across experiments, SEE exhibits a strong correlation with LALM performance, achieving a correlation of 0.98. Surprisingly, traditional audio denoising methods are only marginally effective for LALMs, and, in some cases, even increase SEE and impair performance. This suggests a mismatch between speech-centric denoising objectives and the noise sensitivity of modern LALMs. Therefore, we propose a mitigation strategy derived from SEE to denoise LALM inputs, outperforming existing denoising methods. This paper introduces a novel metric for noise quantification in LALMs, providing guidance for robustness improvements in real-world deployments.
[LINK]http://arxiv.org/abs/2601.07331v1
[DATE]2026-01-12 16:57:55+08:00
[CATEGORIES]cs.LG
Convergence Rate Analysis of the AdamW-Style Shampoo: Unifying One-sided and Two-Sided Preconditioning
[AUTHORS]Huan Li, Yiming Dong, Zhouchen Lin
[ABSTRACT]This paper studies the AdamW-style Shampoo optimizer, an effective implementation of classical Shampoo that notably won the external tuning track of the AlgoPerf neural network training algorithm competition. Our analysis unifies one-sided and two-sided preconditioning and establishes the convergence rate $\frac{1}{K}\sum_{k=1}^K E\left[|\nabla f(X_k)|*\right]\leq O(\frac{\sqrt{m+n}C}{K^{1/4}})$ measured by nuclear norm, where $K$ represents the iteration number, $(m,n)$ denotes the size of matrix parameters, and $C$ matches the constant in the optimal convergence rate of SGD. Theoretically, we have $|\nabla f(X)|_F\leq |\nabla f(X)|\leq \sqrt{m+n}|\nabla f(X)|_F$, supporting that our convergence rate can be considered to be analogous to the optimal $\frac{1}{K}\sum_{k=1}^KE\left[|\nabla f(X_k)|_F\right]\leq O(\frac{C}{K^{1/4}})$ convergence rate of SGD in the ideal case of $|\nabla f(X)|_= Θ(\sqrt{m+n})|\nabla f(X)|_F$.
[LINK]http://arxiv.org/abs/2601.07326v1
[DATE]2026-01-12 16:51:03+08:00
[CATEGORIES]cs.LG
Sherry: Hardware-Efficient 1.25-Bit Ternary Quantization via Fine-grained Sparsification
[AUTHORS]Hong Huang, Decheng Wu, Qiangqiang Hu, Guanghua Yu, Jinhai Yang, Jianchen Zhu, Xue Liu, Dapeng Wu
[ABSTRACT]The deployment of Large Language Models (LLMs) on resource-constrained edge devices is increasingly hindered by prohibitive memory and computational requirements. While ternary quantization offers a compelling solution by reducing weights to {-1, 0, +1}, current implementations suffer from a fundamental misalignment with commodity hardware. Most existing methods must choose between 2-bit aligned packing, which incurs significant bit wastage, or 1.67-bit irregular packing, which degrades inference speed. To resolve this tension, we propose Sherry, a hardware-efficient ternary quantization framework. Sherry introduces a 3:4 fine-grained sparsity that achieves a regularized 1.25-bit width by packing blocks of four weights into five bits, restoring power-of-two alignment. Furthermore, we identify weight trapping issue in sparse ternary training, which leads to representational collapse. To address this, Sherry introduces Arenas, an annealing residual synapse mechanism that maintains representational diversity during training. Empirical evaluations on LLaMA-3.2 across five benchmarks demonstrate that Sherry matches state-of-the-art ternary performance while significantly reducing model size. Notably, on an Intel i7-14700HX CPU, our 1B model achieves zero accuracy loss compared to SOTA baselines while providing 25% bit savings and 10% speed up. The code is available at https://github.com/Tencent/AngelSlim .
[LINK]http://arxiv.org/abs/2601.07892v1
[DATE]2026-01-12 16:49:34+08:00
[CATEGORIES]cs.LG
Variational Approximations for Robust Bayesian Inference via Rho-Posteriors
[AUTHORS]EL Mahdi Khribch, Pierre Alquier
[ABSTRACT]The $ρ$-posterior framework provides universal Bayesian estimation with explicit contamination rates and optimal convergence guarantees, but has remained computationally difficult due to an optimization over reference distributions that precludes intractable posterior computation. We develop a PAC-Bayesian framework that recovers these theoretical guarantees through temperature-dependent Gibbs posteriors, deriving finite-sample oracle inequalities with explicit rates and introducing tractable variational approximations that inherit the robustness properties of exact $ρ$-posteriors. Numerical experiments demonstrate that this approach achieves theoretical contamination rates while remaining computationally feasible, providing the first practical implementation of $ρ$-posterior inference with rigorous finite-sample guarantees.
[COMMENTS]53 pages including the proofs in appendices, 16 figures
[LINK]http://arxiv.org/abs/2601.07325v1
[DATE]2026-01-12 16:47:13+08:00
[CATEGORIES]cs.LG
Segmental Advantage Estimation: Enhancing PPO for Long-Context LLM Training
[AUTHORS]Xue Gong, Qi Yi, Ziyuan Nan, Guanhua Huang, Kejiao Li, Yuhao Jiang, Ruibin Xiong, Zenan Xu, Jiaming Guo, Shaohui Peng, Bo Zhou
[ABSTRACT]Training Large Language Models (LLMs) for reasoning tasks is increasingly driven by Reinforcement Learning with Verifiable Rewards (RLVR), where Proximal Policy Optimization (PPO) provides a principled framework for stable policy updates. However, the practical application of PPO is hindered by unreliable advantage estimation in the sparse-reward RLVR regime. This issue arises because the sparse rewards in RLVR lead to inaccurate intermediate value predictions, which in turn introduce significant bias when aggregated at every token by Generalized Advantage Estimation (GAE). To address this, we introduce Segmental Advantage Estimation (SAE), which mitigates the bias that GAE can incur in RLVR. Our key insight is that aggregating $n$-step advantages at every token(as in GAE) is unnecessary and often introduces excessive bias, since individual tokens carry minimal information. Instead, SAE first partitions the generated sequence into coherent sub-segments using low-probability tokens as heuristic boundaries. It then selectively computes variance-reduced advantage estimates only from these information-rich segment transitions, effectively filtering out noise from intermediate tokens. Our experiments demonstrate that SAE achieves superior performance, with marked improvements in final scores, training stability, and sample efficiency. These gains are shown to be consistent across multiple model sizes, and a correlation analysis confirms that our proposed advantage estimator achieves a higher correlation with an approximate ground-truth advantage, justifying its superior performance.
[LINK]http://arxiv.org/abs/2601.07320v1
[DATE]2026-01-12 16:41:47+08:00
[CATEGORIES]cs.LG
DyMixOp: Guiding Neural Operator Design for PDEs from a Complex Dynamics Perspective with Local-Global-Mixing
[AUTHORS]Pengyu Lai, Yixiao Chen, Hui Xu
[ABSTRACT]A primary challenge in using neural networks to approximate nonlinear dynamical systems governed by partial differential equations (PDEs) is transforming these systems into a suitable format, especially when dealing with non-linearizable dynamics or the need for infinite-dimensional spaces for linearization. This paper introduces DyMixOp, a novel neural operator framework for PDEs that integrates insights from complex dynamical systems to address this challenge. Grounded in inertial manifold theory, DyMixOp transforms infinite-dimensional nonlinear PDE dynamics into a finite-dimensional latent space, establishing a structured foundation that maintains essential nonlinear interactions and enhances physical interpretability. A key innovation is the Local-Global-Mixing (LGM) transformation, inspired by convection dynamics in turbulence. This transformation effectively captures both fine-scale details and nonlinear interactions, while mitigating spectral bias commonly found in existing neural operators. The framework is further strengthened by a dynamics-informed architecture that connects multiple LGM layers to approximate linear and nonlinear dynamics, reflecting the temporal evolution of dynamical systems. Experimental results across diverse PDE benchmarks demonstrate that DyMixOp achieves state-of-the-art performance, significantly reducing prediction errors, particularly in convection-dominated scenarios reaching up to 86.7\%, while maintaining computational efficiency and scalability.
[LINK]http://arxiv.org/abs/2508.13490v2
[DATE]2026-01-12 16:40:45+08:00
[CATEGORIES]cs.LG
BEAT-Net: Injecting Biomimetic Spatio-Temporal Priors for Interpretable ECG Classification
[AUTHORS]Runze Ma, Caizhi Liao
[ABSTRACT]Although deep learning has advanced automated electrocardiogram (ECG) diagnosis, prevalent supervised methods typically treat recordings as undifferentiated one-dimensional (1D) signals or two-dimensional (2D) images. This formulation compels models to learn physiological structures implicitly, resulting in data inefficiency and opacity that diverge from medical reasoning. To address these limitations, we propose BEAT-Net, a Biomimetic ECG Analysis with Tokenization framework that reformulates the problem as a language modeling task. Utilizing a QRS tokenization strategy to transform continuous signals into biologically aligned heartbeat sequences, the architecture explicitly decomposes cardiac physiology through specialized encoders that extract local beat morphology while normalizing spatial lead perspectives and modeling temporal rhythm dependencies. Evaluations across three large-scale benchmarks demonstrate that BEAT-Net matches the diagnostic accuracy of dominant convolutional neural network (CNN) architectures while substantially improving robustness. The framework exhibits exceptional data efficiency, recovering fully supervised performance using only 30 to 35 percent of annotated data. Moreover, learned attention mechanisms provide inherent interpretability by spontaneously reproducing clinical heuristics, such as Lead II prioritization for rhythm analysis, without explicit supervision. These findings indicate that integrating biological priors offers a computationally efficient and interpretable alternative to data-intensive large-scale pre-training.
[COMMENTS]8 pages, 4 figures and 2 tables
[LINK]http://arxiv.org/abs/2601.07316v1
[DATE]2026-01-12 16:37:47+08:00
[CATEGORIES]cs.LG
Explaining Machine Learning Predictive Models through Conditional Expectation Methods
[AUTHORS]Silvia Ruiz-España, Laura Arnal, François Signol, Juan-Carlos Perez-Cortes, Joaquim Arlandis
[ABSTRACT]The rapid adoption of complex Artificial Intelligence (AI) and Machine Learning (ML) models has led to their characterization as black boxes due to the difficulty of explaining their internal decision-making processes. This lack of transparency hinders users’ ability to understand, validate and trust model behavior, particularly in high-risk applications. Although explainable AI (XAI) has made significant progress, there remains a need for versatile and effective techniques to address increasingly complex models. This work introduces Multivariate Conditional Expectation (MUCE), a model-agnostic method for local explainability designed to capture prediction changes from feature interactions. MUCE extends Individual Conditional Expectation (ICE) by exploring a multivariate grid of values in the neighborhood of a given observation at inference time, providing graphical explanations that illustrate the local evolution of model predictions. In addition, two quantitative indices, stability and uncertainty, summarize local behavior and assess model reliability. Uncertainty is further decomposed into uncertainty+ and uncertainty- to capture asymmetric effects that global measures may overlook. The proposed method is validated using XGBoost models trained on three datasets: two synthetic (2D and 3D) to evaluate behavior near decision boundaries, and one transformed real-world dataset to test adaptability to heterogeneous feature types. Results show that MUCE effectively captures complex local model behavior, while the stability and uncertainty indices provide meaningful insight into prediction confidence. MUCE, together with the ICE modification and the proposed indices, offers a practical contribution to local explainability, enabling both graphical and quantitative insights that enhance the interpretability of predictive models and support more trustworthy and transparent decision-making.
[COMMENTS]24 pages, 15 figures. Silvia Ruiz-España and Laura Arnal contributed equally to this work
[LINK]http://arxiv.org/abs/2601.07313v1
[DATE]2026-01-12 16:34:36+08:00
[CATEGORIES]cs.LG
On the Design of One-step Diffusion via Shortcutting Flow Paths
[AUTHORS]Haitao Lin, Peiyan Hu, Minsi Ren, Zhifeng Gao, Zhi-Ming Ma, Guolin ke, Tailin Wu, Stan Z. Li
[ABSTRACT]Recent advances in few-step diffusion models have demonstrated their efficiency and effectiveness by shortcutting the probabilistic paths of diffusion models, especially in training one-step diffusion models from scratch (\emph{a.k.a.} shortcut models). However, their theoretical derivation and practical implementation are often closely coupled, which obscures the design space. To address this, we propose a common design framework for representative shortcut models. This framework provides theoretical justification for their validity and disentangles concrete component-level choices, thereby enabling systematic identification of improvements. With our proposed improvements, the resulting one-step model achieves a new state-of-the-art FID50k of 2.85 on ImageNet-256x256 under the classifier-free guidance setting with one step generation, and further reaches FID50k of 2.53 with 2x training steps. Remarkably, the model requires no pre-training, distillation, or curriculum learning. We believe our work lowers the barrier to component-level innovation in shortcut models and facilitates principled exploration of their design space.
[COMMENTS]10 pages of main body, conference paper
[LINK]http://arxiv.org/abs/2512.11831v4
[DATE]2026-01-12 15:58:25+08:00
[CATEGORIES]cs.LG
Photometric Redshift Estimation Using Scaled Ensemble Learning
[AUTHORS]Swagata Biswas, Shubhrangshu Ghosh, Avyarthana Ghosh, Yogesh Wadadekar, Abhishek Roy Choudhury, Arijit Mukherjee, Shailesh Deshpande, Arpan Pal
[ABSTRACT]The development of the state-of-the-art telescopic systems capable of performing expansive sky surveys such as the Sloan Digital Sky Survey, Euclid, and the Rubin Observatory’s Legacy Survey of Space and Time (LSST) has significantly advanced efforts to refine cosmological models. These advances offer deeper insight into persistent challenges in astrophysics and our understanding of the Universe’s evolution. A critical component of this progress is the reliable estimation of photometric redshifts (Pz). To improve the precision and efficiency of such estimations, the application of machine learning (ML) techniques to large-scale astronomical datasets has become essential. This study presents a new ensemble-based ML framework aimed at predicting Pz for faint galaxies and higher redshift ranges, relying solely on optical (grizy) photometric data. The proposed architecture integrates several learning algorithms, including gradient boosting machine, extreme gradient boosting, k-nearest neighbors, and artificial neural networks, within a scaled ensemble structure. By using bagged input data, the ensemble approach delivers improved predictive performance compared to stand-alone models. The framework demonstrates consistent accuracy in estimating redshifts, maintaining strong performance up to z ~ 4. The model is validated using publicly available data from the Hyper Suprime-Cam Strategic Survey Program by the Subaru Telescope. Our results show marked improvements in the precision and reliability of Pz estimation. Furthermore, this approach closely adheres to-and in certain instances exceeds-the benchmarks specified in the LSST Science Requirements Document. Evaluation metrics include catastrophic outlier, bias, and rms.
[COMMENTS]10 pages, 2 figures, 3 tables
[LINK]http://arxiv.org/abs/2601.07292v1
[DATE]2026-01-12 15:55:24+08:00
[CATEGORIES]cs.LG
Kernel Alignment-based Multi-view Unsupervised Feature Selection with Sample-level Adaptive Graph Learning
[AUTHORS]Yalan Tan, Yanyong Huang, Zongxin Shen, Dongjie Wang, Fengmao Lv, Tianrui Li
[ABSTRACT]Although multi-view unsupervised feature selection (MUFS) has demonstrated success in dimensionality reduction for unlabeled multi-view data, most existing methods reduce feature redundancy by focusing on linear correlations among features but often overlook complex nonlinear dependencies. This limits the effectiveness of feature selection. In addition, existing methods fuse similarity graphs from multiple views by employing sample-invariant weights to preserve local structure. However, this process fails to account for differences in local neighborhood clarity among samples within each view, thereby hindering accurate characterization of the intrinsic local structure of the data. In this paper, we propose a Kernel Alignment-based multi-view unsupervised FeatUre selection with Sample-level adaptive graph lEarning method (KAFUSE) to address these issues. Specifically, we first employ kernel alignment with an orthogonal constraint to reduce feature redundancy in both linear and nonlinear relationships. Then, a cross-view consistent similarity graph is learned by applying sample-level fusion to each slice of a tensor formed by stacking similarity graphs from different views, which automatically adjusts the view weights for each sample during fusion. These two steps are integrated into a unified model for feature selection, enabling mutual enhancement between them. Extensive experiments on real multi-view datasets demonstrate the superiority of KAFUSE over state-of-the-art methods.
[LINK]http://arxiv.org/abs/2601.07288v1
[DATE]2026-01-12 15:50:51+08:00
[CATEGORIES]cs.LG
Covariance-Driven Regression Trees: Reducing Overfitting in CART
[AUTHORS]Likun Zhang, Wei Ma
[ABSTRACT]Decision trees are powerful machine learning algorithms, widely used in fields such as economics and medicine for their simplicity and interpretability. However, decision trees such as CART are prone to overfitting, especially when grown deep or the sample size is small. Conventional methods to reduce overfitting include pre-pruning and post-pruning, which constrain the growth of uninformative branches. In this paper, we propose a complementary approach by introducing a covariance-driven splitting criterion for regression trees (CovRT). This method is more robust to overfitting than the empirical risk minimization criterion used in CART, as it produces more balanced and stable splits and more effectively identifies covariates with true signals. We establish an oracle inequality of CovRT and prove that its predictive accuracy is comparable to that of CART in high-dimensional settings. We find that CovRT achieves superior prediction accuracy compared to CART in both simulations and real-world tasks.
[LINK]http://arxiv.org/abs/2601.07281v1
[DATE]2026-01-12 15:36:18+08:00
[CATEGORIES]cs.LG
A High-Recall Cost-Sensitive Machine Learning Framework for Real-Time Online Banking Transaction Fraud Detection
[AUTHORS]Karthikeyan V. R., Premnath S., Kavinraaj S., J. Sangeetha
[ABSTRACT]Fraudulent activities on digital banking services are becoming more intricate by the day, challenging existing defenses. While older rule driven methods struggle to keep pace, even precision focused algorithms fall short when new scams are introduced. These tools typically overlook subtle shifts in criminal behavior, missing crucial signals. Because silent breaches cost institutions far more than flagged but legitimate actions, catching every possible case is crucial. High sensitivity to actual threats becomes essential when oversight leads to heavy losses. One key aim here involves reducing missed fraud cases without spiking incorrect alerts too much. This study builds a system using group learning methods adjusted through smart threshold choices. Using real world transaction records shared openly, where cheating acts rarely appear among normal activities, tests are run under practical skewed distributions. The outcomes reveal that approximately 91 percent of actual fraud is detected, outperforming standard setups that rely on unchanging rules when dealing with uneven examples across classes. When tested in live settings, the fraud detection system connects directly to an online banking transaction flow, stopping questionable activities before they are completed. Alongside this setup, a browser add on built for Chrome is designed to flag deceptive web links and reduce threats from harmful sites. These results show that adjusting decisions by cost impact and validating across entire systems makes deployment more stable and realistic for today’s digital banking platforms.
[COMMENTS]7 pages, 5 figures. Submitted to arXiv as a preprint
[LINK]http://arxiv.org/abs/2601.07276v1
[DATE]2026-01-12 15:34:04+08:00
[CATEGORIES]cs.LG
Learning Domain-Invariant Representations for Cross-Domain Image Registration via Scene-Appearance Disentanglement
[AUTHORS]Jiahao Qin, Yiwen Wang
[ABSTRACT]Image registration under domain shift remains a fundamental challenge in computer vision and medical imaging: when source and target images exhibit systematic intensity differences, the brightness constancy assumption underlying conventional registration methods is violated, rendering correspondence estimation ill-posed. We propose SAR-Net, a unified framework that addresses this challenge through principled scene-appearance disentanglement. Our key insight is that observed images can be decomposed into domain-invariant scene representations and domain-specific appearance codes, enabling registration via re-rendering rather than direct intensity matching. We establish theoretical conditions under which this decomposition enables consistent cross-domain alignment (Proposition 1) and prove that our scene consistency loss provides a sufficient condition for geometric correspondence in the shared latent space (Proposition 2). Empirically, we validate SAR-Net on bidirectional scanning microscopy, where coupled domain shift and geometric distortion create a challenging real-world testbed. Our method achieves 0.885 SSIM and 0.979 NCC, representing 3.1x improvement over the strongest baseline, while maintaining real-time performance (77 fps). Ablation studies confirm that both scene consistency and domain alignment losses are necessary: removing either degrades performance by 90% SSIM or causes 223x increase in latent alignment error, respectively. Code and data are available at https://github.com/D-ST-Sword/SAR-NET.
[COMMENTS]12 pages, 7 figures, 4 tables. Code and data available at https://github.com/D-ST-Sword/SAR-NET
[LINK]http://arxiv.org/abs/2601.08875v1
[DATE]2026-01-12 15:14:11+08:00
[CATEGORIES]cs.LG
Transformers as Intrinsic Optimizers: Forward Inference through the Energy Principle
[AUTHORS]Ruifeng Ren, Sheng Ouyang, Huayi Tang, Yong Liu
[ABSTRACT]Attention-based Transformers have demonstrated strong adaptability across a wide range of tasks and have become the backbone of modern Large Language Models (LLMs). However, their underlying mechanisms remain open for further exploration. The energy-based perspective has long provided a valuable principle for understanding neural computation. In this paper, we revisit the principle of energy as a lens to understand attention-based Transformer models. We present a unified energy-based framework which is composed of three key components: the local energy $E_i$, the global energy $F$, and the employed optimization algorithms. We show that different attention forms including unnormalized linear attention, gated linear attention and standard softmax attention can be induced by choosing their corresponding recipes within this framework. Building on this framework, we propose energy-based modifications of attention structures. Inspired by classical gradient descent (GD) algorithms, we extend the original attention formulation based on standard GD to the momentum-based GD, Nesterov Accelerated Gradient (NAG), and Newton’s method, each inducing a corresponding new attention structure. Our experiments provide preliminary support for the potential of the energy-based framework for designing attention mechanisms.
[LINK]http://arxiv.org/abs/2511.00907v2
[DATE]2026-01-12 15:13:15+08:00
[CATEGORIES]cs.LG
Pseudodata-guided Invariant Representation Learning Boosts the Out-of-Distribution Generalization in Enzymatic Kinetic Parameter Prediction
[AUTHORS]Haomin Wu, Zhiwei Nie, Hongyu Zhang, Zhixiang Ren
[ABSTRACT]Accurate prediction of enzyme kinetic parameters is essential for understanding catalytic mechanisms and guiding enzyme engineering.However, existing deep learning-based enzyme-substrate interaction (ESI) predictors often exhibit performance degradation on sequence-divergent, out-of-distribution (OOD) cases, limiting robustness under biologically relevant perturbations.We propose O$^2$DENet, a lightweight, plug-and-play module that enhances OOD generalization via biologically and chemically informed perturbation augmentation and invariant representation learning.O$^2$DENet introduces enzyme-substrate perturbations and enforces consistency between original and augmented enzyme-substrate-pair representations to encourage invariance to distributional shifts.When integrated with representative ESI models, O$^2$DENet consistently improves predictive performance for both $k_{cat}$ and $K_m$ across stringent sequence-identity-based OOD benchmarks, achieving state-of-the-art results among the evaluated methods in terms of accuracy and robustness metrics.Overall, O$^2$DENet provides a general and effective strategy to enhance the stability and deployability of data-driven enzyme kinetics predictors for real-world enzyme engineering applications.
[LINK]http://arxiv.org/abs/2601.07261v1
[DATE]2026-01-12 15:03:07+08:00
[CATEGORIES]cs.LG
Simulated Annealing-based Candidate Optimization for Batch Acquisition Functions
[AUTHORS]Sk Md Ahnaf Akif Alvi, Raymundo Arróyave, Douglas Allaire
[ABSTRACT]Bayesian Optimization with multi-objective acquisition functions such as q-Expected Hypervolume Improvement (qEHVI) requires efficient candidate optimization to maximize acquisition function values. Traditional approaches rely on continuous optimization methods like Sequential Least Squares Programming (SLSQP) for candidate selection. However, these gradient-based methods can become trapped in local optima, particularly in complex or high-dimensional objective landscapes. This paper presents a simulated annealing-based approach for candidate optimization in batch acquisition functions as an alternative to conventional continuous optimization methods. We evaluate our simulated annealing approach against SLSQP across four benchmark multi-objective optimization problems: ZDT1 (30D, 2 objectives), DTLZ2 (7D, 3 objectives), Kursawe (3D, 2 objectives), and Latent-Aware (4D, 2 objectives). Our results demonstrate that simulated annealing consistently achieves superior hypervolume performance compared to SLSQP in most test functions. The improvement is particularly pronounced for DTLZ2 and Latent-Aware problems, where simulated annealing reaches significantly higher hypervolume values and maintains better convergence characteristics. The histogram analysis of objective space coverage further reveals that simulated annealing explores more diverse and optimal regions of the Pareto front. These findings suggest that metaheuristic optimization approaches like simulated annealing can provide more robust and effective candidate optimization for multi-objective Bayesian optimization, offering a promising alternative to traditional gradient-based methods for batch acquisition function optimization.
[LINK]http://arxiv.org/abs/2601.07258v1
[DATE]2026-01-12 14:51:49+08:00
[CATEGORIES]cs.LG
Innovation Capacity of Dynamical Learning Systems
[AUTHORS]Anthony M. Polloreno
[ABSTRACT]In noisy physical reservoirs, the classical information-processing capacity $C_{\mathrm{ip}}$ quantifies how well a linear readout can realize tasks measurable from the input history, yet $C_{\mathrm{ip}}$ can be far smaller than the observed rank of the readout covariance. We explain this “missing capacity” by introducing the innovation capacity $C_{\mathrm{i}}$, the total capacity allocated to readout components orthogonal to the input filtration (Doob innovations, including input-noise mixing). Using a basis-free Hilbert-space formulation of the predictable/innovation decomposition, we prove the conservation law $C_{\mathrm{ip}}+C_{\mathrm{i}}=\mathrm{rank}(Σ_{XX})\le d$, so predictable and innovation capacities exactly partition the rank of the observable readout dimension covariance $Σ_{XX}\in \mathbb{R}^{\rm d\times d}$. In linear-Gaussian Johnson-Nyquist regimes, $Σ_{XX}(T)=S+T N_0$, the split becomes a generalized-eigenvalue shrinkage rule and gives an explicit monotone tradeoff between temperature and predictable capacity. Geometrically, in whitened coordinates the predictable and innovation components correspond to complementary covariance ellipsoids, making $C_{\mathrm{i}}$ a trace-controlled innovation budget. A large $C_{\mathrm{i}}$ forces a high-dimensional innovation subspace with a variance floor and under mild mixing and anti-concentration assumptions this yields extensive innovation-block differential entropy and exponentially many distinguishable histories. Finally, we give an information-theoretic lower bound showing that learning the induced innovation-block law in total variation requires a number of samples that scales with the effective innovation dimension, supporting the generative utility of noisy physical reservoirs.
[COMMENTS]12 pages, 3 figures
[LINK]http://arxiv.org/abs/2601.07257v1
[DATE]2026-01-12 14:51:44+08:00
[CATEGORIES]cs.LG
DDT: A Dual-Masking Dual-Expert Transformer for Energy Time-Series Forecasting
[AUTHORS]Mingnan Zhu, Qixuan Zhang, Yixuan Cheng, Fangzhou Gu, Shiming Lin
[ABSTRACT]Accurate energy time-series forecasting is crucial for ensuring grid stability and promoting the integration of renewable energy, yet it faces significant challenges from complex temporal dependencies and the heterogeneity of multi-source data. To address these issues, we propose DDT, a novel and robust deep learning framework for high-precision time-series forecasting. At its core, DDT introduces two key innovations. First, we design a dual-masking mechanism that synergistically combines a strict causal mask with a data-driven dynamic mask. This novel design ensures theoretical causal consistency while adaptively focusing on the most salient historical information, overcoming the rigidity of traditional masking techniques. Second, our architecture features a dual-expert system that decouples the modeling of temporal dynamics and cross-variable correlations into parallel, specialized pathways, which are then intelligently integrated through a dynamic gated fusion module. We conducted extensive experiments on 7 challenging energy benchmark datasets, including ETTh, Electricity, and Solar. The results demonstrate that DDT consistently outperforms strong state-of-the-art baselines across all prediction horizons, establishing a new benchmark for the task.
[LINK]http://arxiv.org/abs/2601.07250v1
[DATE]2026-01-12 14:36:36+08:00
[CATEGORIES]cs.LG
Multi-environment Invariance Learning with Missing Data
[AUTHORS]Yiran Jia
[ABSTRACT]Learning models that can handle distribution shifts is a key challenge in domain generalization. Invariance learning, an approach that focuses on identifying features invariant across environments, improves model generalization by capturing stable relationships, which may represent causal effects when the data distribution is encoded within a structural equation model (SEM) and satisfies modularity conditions. This has led to a growing body of work that builds on invariance learning, leveraging the inherent heterogeneity across environments to develop methods that provide causal explanations while enhancing robust prediction. However, in many practical scenarios, obtaining complete outcome data from each environment is challenging due to the high cost or complexity of data collection. This limitation in available data hinders the development of models that fully leverage environmental heterogeneity, making it crucial to address missing outcomes to improve both causal insights and robust prediction. In this work, we derive an estimator from the invariance objective under missing outcomes. We establish non-asymptotic guarantees on variable selection property and $\ell_2$ error convergence rates, which are influenced by the proportion of missing data and the quality of imputation models across environments. We evaluate the performance of the new estimator through extensive simulations and demonstrate its application using the UCI Bike Sharing dataset to predict the count of bike rentals. The results show that despite relying on a biased imputation model, the estimator is efficient and achieves lower prediction error, provided the bias is within a reasonable range.
[LINK]http://arxiv.org/abs/2601.07247v1
[DATE]2026-01-12 14:30:58+08:00
[CATEGORIES]cs.LG
ADPO: Anchored Direct Preference Optimization
[AUTHORS]Wang Zixian
[ABSTRACT]We present Anchored Direct Preference Optimization (ADPO), a policy alignment method derived from first principles of KL-regularized reinforcement learning. Unlike standard approaches that treat the reference policy merely as a regularizer, we show that the optimal policy in reinforcement learning from human feedback inherently operates in a differential coordinate system, optimizing relative advantage in the form of log ratios rather than absolute probabilities. ADPO explicitly parameterizes this optimal structure through anchored logits, effectively decoupling response quality from prior popularity and creating an implicit trust region through curvature scaling. We show that this formulation unifies supervised fine-tuning, reinforcement learning, and ranking-based objectives under a single geometric perspective. Theoretically, ADPO resolves the probability smearing problem of supervised fine-tuning while avoiding the mode-seeking instability characteristic of reverse-KL methods. Empirically, the listwise ranking variant of ADPO achieves state-of-the-art performance on reasoning tasks, outperforming GRPO by 30.9 percent on Qwen3-1.7B and demonstrating superior robustness under distribution shift.
[LINK]http://arxiv.org/abs/2510.18913v7
[DATE]2026-01-12 14:15:58+08:00
[CATEGORIES]cs.LG
Max-Min Neural Network Operators For Approximation of Multivariate Functions
[AUTHORS]Abhishek Yadav, Uaday Singh, Feng Dai
[ABSTRACT]In this paper, we develop a multivariate framework for approximation by max-min neural network operators. Building on the recent advances in approximation theory by neural network operators, particularly, the univariate max-min operators, we propose and analyze new multivariate operators activated by sigmoidal functions. We establish pointwise and uniform convergence theorems and derive quantitative estimates for the order of approximation via modulus of continuity and multivariate generalized absolute moment. Our results demonstrate that multivariate max-min structure of operators, besides their algebraic elegance, provide efficient and stable approximation tools in both theoretical and applied settings.
[COMMENTS]17 pages with 8 figures
[LINK]http://arxiv.org/abs/2601.07886v1
[DATE]2026-01-12 14:14:05+08:00
[CATEGORIES]cs.LG
Aligning Findings with Diagnosis: A Self-Consistent Reinforcement Learning Framework for Trustworthy Radiology Reporting
[AUTHORS]Kun Zhao, Siyuan Dai, Pan Wang, Jifeng Song, Hui Ji, Chenghua Lin, Liang Zhan, Haoteng Tang
[ABSTRACT]Multimodal Large Language Models (MLLMs) have shown strong potential for radiology report generation, yet their clinical translation is hindered by architectural heterogeneity and the prevalence of factual hallucinations. Standard supervised fine-tuning often fails to strictly align linguistic outputs with visual evidence, while existing reinforcement learning approaches struggle with either prohibitive computational costs or limited exploration. To address these challenges, we propose a comprehensive framework for self-consistent radiology report generation. First, we conduct a systematic evaluation to identify optimal vision encoder and LLM backbone configurations for medical imaging. Building on this foundation, we introduce a novel “Reason-then-Summarize” architecture optimized via Group Relative Policy Optimization (GRPO). This framework restructures generation into two distinct components: a think block for detailed findings and an answer block for structured disease labels. By utilizing a multi-dimensional composite reward function, we explicitly penalize logical discrepancies between the generated narrative and the final diagnosis. Extensive experiments on the MIMIC-CXR benchmark demonstrate that our method achieves state-of-the-art performance in clinical efficacy metrics and significantly reduces hallucinations compared to strong supervised baselines.
[LINK]http://arxiv.org/abs/2601.03321v2
[DATE]2026-01-12 13:56:31+08:00
[CATEGORIES]cs.LG
Consolidation or Adaptation? PRISM: Disentangling SFT and RL Data via Gradient Concentration
[AUTHORS]Yang Zhao, Yangou Ouyang, Xiao Ding, Hepeng Wang, Bibo Cai, Kai Xiong, Jinglong Gao, Zhouhao Sun, Li Du, Bing Qin, Ting Liu
[ABSTRACT]While Hybrid Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) has become the standard paradigm for training LLM agents, effective mechanisms for data allocation between these stages remain largely underexplored. Current data arbitration strategies often rely on surface-level heuristics that fail to diagnose intrinsic learning needs. Since SFT targets pattern consolidation through imitation while RL drives structural adaptation via exploration, misaligning data with these functional roles causes severe optimization interference. We propose PRISM, a dynamics-aware framework grounded in Schema Theory that arbitrates data based on its degree of cognitive conflict with the model’s existing knowledge. By analyzing the spatial geometric structure of gradients, PRISM identifies data triggering high spatial concentration as high-conflict signals that require RL for structural restructuring. In contrast, data yielding diffuse updates is routed to SFT for efficient consolidation. Extensive experiments on WebShop and ALFWorld demonstrate that PRISM achieves a Pareto improvement, outperforming state-of-the-art hybrid methods while reducing computational costs by up to 3.22$\times$. Our findings suggest that disentangling data based on internal optimization regimes is crucial for scalable and robust agent alignment.
[LINK]http://arxiv.org/abs/2601.07224v1
[DATE]2026-01-12 13:43:20+08:00
[CATEGORIES]cs.LG
Bag of Coins: A Statistical Probe into Neural Confidence Structures
[AUTHORS]Agnideep Aich, Sameera Hewage, Md Monzur Murshed, Bruce Wade, Ashit Baran Aich
[ABSTRACT]Modern neural networks often produce miscalibrated confidence scores and struggle to detect out-of-distribution (OOD) inputs, while most existing methods post-process outputs without testing internal consistency. We introduce the Bag-of-Coins (BoC) probe, a non-parametric diagnostic of logit coherence that compares softmax confidence $\hat p$ to an aggregate of pairwise Luce-style dominance probabilities $\bar q$, yielding a deterministic coherence score and a p-value-based structural score. Across ViT, ResNet, and RoBERTa with ID/OOD test sets, the coherence gap $Δ=\bar q-\hat p$ reveals clear ID/OOD separation for ViT (ID ${\sim}0.1$-$0.2$, OOD ${\sim}0.5$-$0.6$) but substantial overlap for ResNet and RoBERTa (both ${\sim}0$), indicating architecture-dependent uncertainty geometry. As a practical method, BoC improves calibration only when the base model is poorly calibrated (ViT: ECE $0.024$ vs.\ $0.180$) and underperforms standard calibrators (ECE ${\sim}0.005$), while for OOD detection it fails across architectures (AUROC $0.020$-$0.253$) compared to standard scores ($0.75$-$0.99$). We position BoC as a research diagnostic for interrogating how architectures encode uncertainty in logit geometry rather than a production calibration or OOD detection method.
[LINK]http://arxiv.org/abs/2507.19774v2
[DATE]2026-01-12 13:32:06+08:00
[CATEGORIES]cs.LG
Contextual Embedding-based Clustering to Identify Topics for Healthcare Service Improvement
[AUTHORS]K M Sajjadul Islam, Ravi Teja Karri, Srujan Vegesna, Jiawei Wu, Praveen Madiraju
[ABSTRACT]Understanding patient feedback is crucial for improving healthcare services, yet analyzing unlabeled short-text feedback presents significant challenges due to limited data and domain-specific nuances. Traditional supervised learning approaches require extensive labeled datasets, making unsupervised methods more viable for uncovering meaningful insights from patient feedback. This study explores unsupervised methods to extract meaningful topics from 439 survey responses collected from a healthcare system in Wisconsin, USA. A keyword-based filtering approach was applied to isolate complaint-related feedback using a domain-specific lexicon. To delve deeper and analyze dominant topics in feedback, we explored traditional topic modeling methods, including Latent Dirichlet Allocation (LDA) and Gibbs Sampling Dirichlet Multinomial Mixture (GSDMM), alongside BERTopic, an advanced neural embedding-based clustering approach. To improve coherence and interpretability where data are scarce and consist of short-texts, we propose kBERT, an integration of BERT embeddings with k-means clustering. Model performance was assessed using coherence scores (Cv ) for topic interpretability and average Inverted Rank-Biased Overlap (IRBOavg) for topic diversity. Results indicate that kBERT achieves the highest coherence (Cv = 0.53) and distinct topic separation (IRBOavg = 1.00), outperforming all other models in short-text healthcare feedback analysis. Our findings emphasize the importance of embedding-based techniques for topic identification and highlight the need for context-aware models in healthcare analytics.
[COMMENTS]The paper accepted at the 2025 IEEE COMPSAC, Toronto, Canada
[LINK]http://arxiv.org/abs/2504.14068v2
[DATE]2026-01-12 13:19:55+08:00
[CATEGORIES]cs.LG
Gems: Group Emotion Profiling Through Multimodal Situational Understanding
[AUTHORS]Anubhav Kataria, Surbhi Madan, Shreya Ghosh, Tom Gedeon, Abhinav Dhall
[ABSTRACT]Understanding individual, group and event level emotions along with contextual information is crucial for analyzing a multi-person social situation. To achieve this, we frame emotion comprehension as the task of predicting fine-grained individual emotion to coarse grained group and event level emotion. We introduce GEMS that leverages a multimodal swin-transformer and S3Attention based architecture, which processes an input scene, group members, and context information to generate joint predictions. Existing multi-person emotion related benchmarks mainly focus on atomic interactions primarily based on emotion perception over time and group level. To this end, we extend and propose VGAF-GEMS to provide more fine grained and holistic analysis on top of existing group level annotation of VGAF dataset. GEMS aims to predict basic discrete and continuous emotions (including valence and arousal) as well as individual, group and event level perceived emotions. Our benchmarking effort links individual, group and situational emotional responses holistically. The quantitative and qualitative comparisons with adapted state-of-the-art models demonstrate the effectiveness of GEMS framework on VGAF-GEMS benchmarking. We believe that it will pave the way of further research. The code and data is available at: https://github.com/katariaak579/GEMS
[LINK]http://arxiv.org/abs/2507.22393v2
[DATE]2026-01-12 12:59:29+08:00
[CATEGORIES]cs.LG
CalPro: Prior-Aware Evidential–Conformal Prediction with Structure-Aware Guarantees for Protein Structures
[AUTHORS]Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma
[ABSTRACT]Deep protein structure predictors such as AlphaFold provide confidence estimates (e.g., pLDDT) that are often miscalibrated and degrade under distribution shifts across experimental modalities, temporal changes, and intrinsically disordered regions. We introduce CalPro, a prior-aware evidential-conformal framework for shift-robust uncertainty quantification. CalPro combines (i) a geometric evidential head that outputs Normal-Inverse-Gamma predictive distributions via a graph-based architecture; (ii) a differentiable conformal layer that enables end-to-end training with finite-sample coverage guarantees; and (iii) domain priors (disorder, flexibility) encoded as soft constraints. We derive structure-aware coverage guarantees under distribution shift using PAC-Bayesian bounds over ambiguity sets, and show that CalPro maintains near-nominal coverage while producing tighter intervals than standard conformal methods in regions where priors are informative. Empirically, CalPro exhibits at most 5% coverage degradation across modalities (vs. 15-25% for baselines), reduces calibration error by 30-50%, and improves downstream ligand-docking success by 25%. Beyond proteins, CalPro applies to structured regression tasks in which priors encode local reliability, validated on non-biological benchmarks.
[LINK]http://arxiv.org/abs/2601.07201v1
[DATE]2026-01-12 12:49:32+08:00
[CATEGORIES]cs.LG
Safeguarding LLM Fine-tuning via Push-Pull Distributional Alignment
[AUTHORS]Haozhong Wang, Zhuo Li, Yibo Yang, He Zhao, Hongyuan Zha, Dandan Guo
[ABSTRACT]The inherent safety alignment of Large Language Models (LLMs) is prone to erosion during fine-tuning, even when using seemingly innocuous datasets. While existing defenses attempt to mitigate this via data selection, they typically rely on heuristic, instance-level assessments that neglect the global geometry of the data distribution and fail to explicitly repel harmful patterns. To address this, we introduce Safety Optimal Transport (SOT), a novel framework that reframes safe fine-tuning from an instance-level filtering challenge to a distribution-level alignment task grounded in Optimal Transport (OT). At its core is a dual-reference “push-pull” weight-learning mechanism: SOT optimizes sample importance by actively pulling the downstream distribution towards a trusted safe anchor while simultaneously pushing it away from a general harmful reference. This establishes a robust geometric safety boundary that effectively purifies the training data. Extensive experiments across diverse model families and domains demonstrate that SOT significantly enhances model safety while maintaining competitive downstream performance, achieving a superior safety-utility trade-off compared to baselines.
[LINK]http://arxiv.org/abs/2601.07200v1
[DATE]2026-01-12 12:48:02+08:00
[CATEGORIES]cs.LG
Forward versus Backward: Comparing Reasoning Objectives in Direct Preference Optimization
[AUTHORS]Murtaza Nikzad, Raghuram Ramanujan
[ABSTRACT]Large language models exhibit impressive reasoning capabilities yet frequently generate plausible but incorrect solutions, a phenomenon commonly termed hallucination. This paper investigates the effect of training objective composition on reasoning reliability through Direct Preference Optimization. Two complementary training signals are examined: forward chain-of-thought generation, which trains the model to produce correct reasoning traces, and backward verification, which trains the model to verify and acknowledge errors in candidate solutions. Experiments on GSM8K reveal a fundamental trade-off between these objectives. Forward-only DPO training achieves the highest accuracy improvement, increasing from 83.1% to 86.6% (+3.5 percentage points), while backward-only training yields minimal accuracy gains but substantially reduces the false positive rate from 13.4% to 4.3%. Notably, both training variants reduce acknowledgement rate compared to the baseline, suggesting that preference optimization increases model confidence in its outputs. These findings indicate that forward and backward reasoning objectives provide distinct and complementary learning signals: forward training improves problem-solving capability, while backward training improves verification calibration. The complete training and evaluation pipeline, implemented efficiently through Low-Rank Adaptation, is released to facilitate further research.
[LINK]http://arxiv.org/abs/2601.07199v1
[DATE]2026-01-12 12:46:27+08:00
[CATEGORIES]cs.LG
Beyond Variance: Knowledge-Aware LLM Compression via Fisher-Aligned Subspace Diagnostics
[AUTHORS]Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma
[ABSTRACT]Post-training activation compression is essential for deploying Large Language Models (LLMs) on resource-constrained hardware. However, standard methods like Singular Value Decomposition (SVD) are gradient-blind: they preserve high-variance dimensions regardless of their impact on factual knowledge preservation. We introduce Fisher-Aligned Subspace Compression (FASC), a knowledge-aware compression framework that selects subspaces by directly modeling activation-gradient coupling, minimizing a second-order surrogate of the loss function. FASC leverages the Fisher Information Matrix to identify dimensions critical for factual knowledge, which often reside in low-variance but high-gradient-sensitivity subspaces. We propose the Dependence Violation Score (\r{ho}) as a general-purpose diagnostic metric that quantifies activation-gradient coupling, revealing where factual knowledge is stored within transformer architectures. Extensive experiments on Mistral-7B and Llama-3-8B demonstrate that FASC preserves 6-8% more accuracy on knowledge-intensive benchmarks (MMLU, LAMA) compared to variance-based methods at 50% rank reduction, effectively enabling a 7B model to match the factual recall of a 13B uncompressed model. Our analysis reveals that \r{ho} serves as a fundamental signal of stored knowledge, with high-\r{ho} layers emerging only when models internalize factual associations during training.
[LINK]http://arxiv.org/abs/2601.07197v1
[DATE]2026-01-12 12:41:32+08:00
[CATEGORIES]cs.LG
ForensicFormer: Hierarchical Multi-Scale Reasoning for Cross-Domain Image Forgery Detection
[AUTHORS]Hema Hariharan Samson
[ABSTRACT]The proliferation of AI-generated imagery and sophisticated editing tools has rendered traditional forensic methods ineffective for cross-domain forgery detection. We present ForensicFormer, a hierarchical multi-scale framework that unifies low-level artifact detection, mid-level boundary analysis, and high-level semantic reasoning via cross-attention transformers. Unlike prior single-paradigm approaches, which achieve <75% accuracy on out-of-distribution datasets, our method maintains 86.8% average accuracy across seven diverse test sets, spanning traditional manipulations, GAN-generated images, and diffusion model outputs - a significant improvement over state-of-the-art universal detectors. We demonstrate superior robustness to JPEG compression (83% accuracy at Q=70 vs. 66% for baselines) and provide pixel-level forgery localization with a 0.76 F1-score. Extensive ablation studies validate that each hierarchical component contributes 4-10% accuracy improvement, and qualitative analysis reveals interpretable forensic features aligned with human expert reasoning. Our work bridges classical image forensics and modern deep learning, offering a practical solution for real-world deployment where manipulation techniques are unknown a priori.
[COMMENTS]9 pages, 4 figures, 5 tables. Technical report on hierarchical multi-scale image forgery detection
[LINK]http://arxiv.org/abs/2601.08873v1
[DATE]2026-01-12 12:29:36+08:00
[CATEGORIES]cs.LG
Prophet as a Reproducible Forecasting Framework: A Methodological Guide for Business and Financial Analytics
[AUTHORS]Sidney Shapiro, Burhanuddin Panvelwala
[ABSTRACT]Reproducibility remains a persistent challenge in forecasting research and practice, particularly in business and financial analytics, where forecasts inform high-stakes decisions. Traditional forecasting methods, while theoretically interpretable, often require extensive manual tuning and are difficult to replicate in proprietary environments. Machine learning approaches offer predictive flexibility but introduce challenges related to interpretability, stochastic training procedures, and cross-environment reproducibility. This paper examines Prophet, an open-source forecasting framework developed by Meta, as a reproducibility-enabling solution that balances interpretability, standardized workflows, and accessibility. Rather than proposing a new algorithm, this study evaluates how Prophet’s additive structure, open-source implementation, and standardized workflow contribute to transparent and replicable forecasting practice. Using publicly available financial and retail datasets, we compare the performance and interpretability of Prophet with multiple ARIMA specifications (auto-selected, manually specified, and seasonal variants) and Random Forest, under a controlled and fully documented experimental design. This multi-model comparison provides a robust assessment of Prophet’s relative performance and reproducibility advantages. Through concrete Python examples, we demonstrate how Prophet facilitates efficient forecasting workflows and integration with analytical pipelines. The study positions Prophet within the broader context of reproducible research. It highlights Prophet’s role as a methodological building block that supports verification, auditability, and methodological rigor. This work provides researchers and practitioners with a practical reference framework for reproducible forecasting in Python-based research workflows.
[LINK]http://arxiv.org/abs/2601.05929v2
[DATE]2026-01-12 11:36:50+08:00
[CATEGORIES]cs.LG
Paris: A Decentralized Trained Open-Weight Diffusion Model
[AUTHORS]Zhiying Jiang, Raihan Seraj, Marcos Villagra, Bidhan Roy
[ABSTRACT]We present Paris, the first publicly released diffusion model pre-trained entirely through decentralized computation. Paris demonstrates that high-quality text-to-image generation can be achieved without centrally coordinated infrastructure. Paris is open for research and commercial use. Paris required implementing our Distributed Diffusion Training framework from scratch. The model consists of 8 expert diffusion models (129M-605M parameters each) trained in complete isolation with no gradient, parameter, or intermediate activation synchronization. Rather than requiring synchronized gradient updates across thousands of GPUs, we partition data into semantically coherent clusters where each expert independently optimizes its subset while collectively approximating the full distribution. A lightweight transformer router dynamically selects appropriate experts at inference, achieving generation quality comparable to centrally coordinated baselines. Eliminating synchronization enables training on heterogeneous hardware without specialized interconnects. Empirical validation confirms that Paris’s decentralized training maintains generation quality while removing the dedicated GPU cluster requirement for large-scale diffusion models. Paris achieves this using 14$\times$ less training data and 16$\times$ less compute than the prior decentralized baseline.
[LINK]http://arxiv.org/abs/2510.03434v2
[DATE]2026-01-12 11:17:26+08:00
[CATEGORIES]cs.LG
Offline Meta-Reinforcement Learning with Flow-Based Task Inference and Adaptive Correction of Feature Overgeneralization
[AUTHORS]Min Wang, Xin Li, Mingzhong Wang, Hasnaa Bennis
[ABSTRACT]Offline meta-reinforcement learning (OMRL) combines the strengths of learning from diverse datasets in offline RL with the adaptability to new tasks of meta-RL, promising safe and efficient knowledge acquisition by RL agents. However, OMRL still suffers extrapolation errors due to out-of-distribution (OOD) actions, compromised by broad task distributions and Markov Decision Process (MDP) ambiguity in meta-RL setups. Existing research indicates that the generalization of the $Q$ network affects the extrapolation error in offline RL. This paper investigates this relationship by decomposing the $Q$ value into feature and weight components, observing that while decomposition enhances adaptability and convergence in the case of high-quality data, it often leads to policy degeneration or collapse in complex tasks. We observe that decomposed $Q$ values introduce a large estimation bias when the feature encounters OOD samples, a phenomenon we term ‘‘feature overgeneralization’’. To address this issue, we propose FLORA, which identifies OOD samples by modeling feature distributions and estimating their uncertainties. FLORA integrates a return feedback mechanism to adaptively adjust feature components. Furthermore, to learn precise task representations, FLORA explicitly models the complex task distribution using a chain of invertible transformations. We theoretically and empirically demonstrate that FLORA achieves rapid adaptation and meta-policy improvement compared to baselines across various environments.
[LINK]http://arxiv.org/abs/2601.07164v1
[DATE]2026-01-12 11:16:07+08:00
[CATEGORIES]cs.LG
AscendKernelGen: A Systematic Study of LLM-Based Kernel Generation for Neural Processing Units
[AUTHORS]Xinzi Cao, Jianyang Zhai, Pengfei Li, Zhiheng Hu, Cen Yan, Bingxu Mu, Guanghuan Fang, Bin She, Jiayu Li, Yihan Su, Dongyang Tao, Xiansong Huang, Fan Xu, Feidiao Yang, Yao Lu, Chang-Dong Wang, Yutong Lu, Weicheng Xue, Bin Zhou, Yonghong Tian
[ABSTRACT]To meet the ever-increasing demand for computational efficiency, Neural Processing Units (NPUs) have become critical in modern AI infrastructure. However, unlocking their full potential requires developing high-performance compute kernels using vendor-specific Domain-Specific Languages (DSLs), a task that demands deep hardware expertise and is labor-intensive. While Large Language Models (LLMs) have shown promise in general code generation, they struggle with the strict constraints and scarcity of training data in the NPU domain. Our preliminary study reveals that state-of-the-art general-purpose LLMs fail to generate functional complex kernels for Ascend NPUs, yielding a near-zero success rate. To address these challenges, we propose AscendKernelGen, a generation-evaluation integrated framework for NPU kernel development. We introduce Ascend-CoT, a high-quality dataset incorporating chain-of-thought reasoning derived from real-world kernel implementations, and KernelGen-LM, a domain-adaptive model trained via supervised fine-tuning and reinforcement learning with execution feedback. Furthermore, we design NPUKernelBench, a comprehensive benchmark for assessing compilation, correctness, and performance across varying complexity levels. Experimental results demonstrate that our approach significantly bridges the gap between general LLMs and hardware-specific coding. Specifically, the compilation success rate on complex Level-2 kernels improves from 0% to 95.5% (Pass@10), while functional correctness achieves 64.3% compared to the baseline’s complete failure. These results highlight the critical role of domain-specific reasoning and rigorous evaluation in automating accelerator-aware code generation.
[COMMENTS]33 pages,7 figures,16 tables
[LINK]http://arxiv.org/abs/2601.07160v1
[DATE]2026-01-12 11:12:58+08:00
[CATEGORIES]cs.LG
Stable On-Policy Distillation through Adaptive Target Reformulation
[AUTHORS]Ijun Jang, Jewon Yeom, Juan Yeo, Hyunggu Lim, Taesup Kim
[ABSTRACT]Knowledge distillation (KD) is a widely adopted technique for transferring knowledge from large language models to smaller student models; however, conventional supervised KD often suffers from a distribution mismatch between training and inference. While on-policy KD approaches attempt to mitigate this issue by learning directly from student-generated outputs, they frequently encounter training instabilities because the distributional gap between the novice student and the expert teacher is often too wide to bridge directly. These challenges manifest as pathological gradients in forward KL objectives or diversity collapse in reverse KL regimes. To address these limitations, we propose Veto, an objective-level reformulation that constructs a geometric bridge in the logit space. Unlike prior methods that mix data samples, Veto creates an intermediate target distribution that promotes alignment between the teacher and the student. By introducing a tunable parameter beta, Veto serves as an Adaptive Gradient Veto that stabilizes optimization by suppressing harmful gradients on low-confidence tokens, while simultaneously acting as a Decisiveness Knob to balance reward-driven performance with output diversity. Extensive experiments across various reasoning and generation tasks demonstrate that Veto consistently outperforms supervised fine-tuning and existing on-policy baselines.
[COMMENTS]10 pages, 5 figures
[LINK]http://arxiv.org/abs/2601.07155v1
[DATE]2026-01-12 10:57:39+08:00
[CATEGORIES]cs.LG
Generating readily synthesizable small molecule fluorophore scaffolds with reinforcement learning
[AUTHORS]Ruhi Sayana, Kate Callon, Jennifer Xu, Jonathan Deutsch, Steven Chu, James Zou, John Janetzko, Rabindra V. Shivnaraine, Kyle Swanson
[ABSTRACT]Developing new fluorophores for advanced imaging techniques requires exploring new chemical space. While generative AI approaches have shown promise in designing novel dye scaffolds, prior efforts often produced synthetically intractable candidates due to a lack of reaction constraints. Here, we developed SyntheFluor-RL, a generative AI model that employs known reaction libraries and molecular building blocks to create readily synthesizable fluorescent molecule scaffolds via reinforcement learning. To guide the generation of fluorophores, SyntheFluor-RL employs a scoring function built on multiple graph neural networks (GNNs) that predict key photophysical properties, including photoluminescence quantum yield, absorption, and emission wavelengths. These outputs are dynamically weighted and combined with a computed pi-conjugation score to prioritize candidates with desirable optical characteristics and synthetic feasibility. SyntheFluor-RL generated 11,590 candidate molecules, which were filtered to 19 structures predicted to possess dye-like properties. Of the 19 molecules, 14 were synthesized and 13 were experimentally confirmed. The top three were characterized, with the lead compound featuring a benzothiadiazole chromophore and exhibiting strong fluorescence (PLQY = 0.62), a large Stokes shift (97 nm), and a long excited-state lifetime (11.5 ns). These results demonstrate the effectiveness of SyntheFluor-RL in the identification of synthetically accessible fluorophores for further development.
[LINK]http://arxiv.org/abs/2601.07145v1
[DATE]2026-01-12 10:31:43+08:00
[CATEGORIES]cs.LG
Optimal Transport under Group Fairness Constraints
[AUTHORS]Linus Bleistein, Mathieu Dagréou, Francisco Andrade, Thomas Boudou, Aurélien Bellet
[ABSTRACT]Ensuring fairness in matching algorithms is a key challenge in allocating scarce resources and positions. Focusing on Optimal Transport (OT), we introduce a novel notion of group fairness requiring that the probability of matching two individuals from any two given groups in the OT plan satisfies a predefined target. We first propose \texttt{FairSinkhorn}, a modified Sinkhorn algorithm to compute perfectly fair transport plans efficiently. Since exact fairness can significantly degrade matching quality in practice, we then develop two relaxation strategies. The first one involves solving a penalised OT problem, for which we derive novel finite-sample complexity guarantees. This result is of independent interest as it can be generalized to arbitrary convex penalties. Our second strategy leverages bilevel optimization to learn a ground cost that induces a fair OT solution, and we establish a bound guaranteeing that the learned cost yields fair matchings on unseen data. Finally, we present empirical results that illustrate the trade-offs between fairness and performance.
[LINK]http://arxiv.org/abs/2601.07144v1
[DATE]2026-01-12 10:26:32+08:00
[CATEGORIES]cs.LG
CFT-RAG: An Entity Tree Based Retrieval Augmented Generation Algorithm With Cuckoo Filter
[AUTHORS]Zihang Li, Yangdong Ruan, Wenjun Liu, Zhengyang Wang, Tong Yang
[ABSTRACT]Although retrieval-augmented generation(RAG) significantly improves generation quality by retrieving external knowledge bases and integrating generated content, it faces computational efficiency bottlenecks, particularly in knowledge retrieval tasks involving hierarchical structures for Tree-RAG. This paper proposes a Tree-RAG acceleration method based on the improved Cuckoo Filter, which optimizes entity localization during the retrieval process to achieve significant performance improvements. Tree-RAG effectively organizes entities through the introduction of a hierarchical tree structure, while the Cuckoo Filter serves as an efficient data structure that supports rapid membership queries and dynamic updates. The experiment results demonstrate that our method is much faster than naive Tree-RAG while maintaining high levels of generative quality. When the number of trees is large, our method is hundreds of times faster than naive Tree-RAG. Our work is available at https://github.com/TUPYP7180/CFT-RAG-2025.
[COMMENTS]New research based on it has been conducted
[LINK]http://arxiv.org/abs/2501.15098v2
[DATE]2026-01-12 10:22:29+08:00
[CATEGORIES]cs.LG
DAST: Difficulty-Adaptive Slow-Thinking for Large Reasoning Models
[AUTHORS]Yi Shen, Jian Zhang, Jieyun Huang, Shuming Shi, Wenjing Zhang, Jiangze Yan, Ning Wang, Kai Wang, Zhaoxiang Liu, Shiguo Lian
[COMMENTS]EMNLP 2025 Industry Track
[LINK]http://arxiv.org/abs/2503.04472v3
[DATE]2026-01-12 10:07:30+08:00
[CATEGORIES]cs.LG
MoEMeta: Mixture-of-Experts Meta Learning for Few-Shot Relational Learning
[AUTHORS]Han Wu, Jie Yin
[COMMENTS]Appear in NeurIPS 2025
[LINK]http://arxiv.org/abs/2510.23013v4
[DATE]2026-01-12 10:04:38+08:00
[CATEGORIES]cs.LG
A Concentration Bound for TD(0) with Function Approximation
[AUTHORS]Siddharth Chandak, Vivek S. Borkar
[ABSTRACT]We derive uniform all-time concentration bound of the type ‘for all $n \geq n_0$ for some $n_0$’ for TD(0) with linear function approximation. We work with online TD learning with samples from a single sample path of the underlying Markov chain. This makes our analysis significantly different from offline TD learning or TD learning with access to independent samples from the stationary distribution of the Markov chain. We treat TD(0) as a contractive stochastic approximation algorithm, with both martingale and Markov noises. Markov noise is handled using the Poisson equation and the lack of almost sure guarantees on boundedness of iterates is handled using the concept of relaxed concentration inequalities.
[COMMENTS]Published in Stochastic Systems
[LINK]http://arxiv.org/abs/2312.10424v4
[DATE]2026-01-12 09:59:30+08:00
[CATEGORIES]cs.LG
Proof of Reasoning for Privacy Enhanced Federated Blockchain Learning at the Edge
[AUTHORS]James Calo, Benny Lo
[ABSTRACT]Consensus mechanisms are the core of any blockchain system. However, the majority of these mechanisms do not target federated learning directly nor do they aid in the aggregation step. This paper introduces Proof of Reasoning (PoR), a novel consensus mechanism specifically designed for federated learning using blockchain, aimed at preserving data privacy, defending against malicious attacks, and enhancing the validation of participating networks. Unlike generic blockchain consensus mechanisms commonly found in the literature, PoR integrates three distinct processes tailored for federated learning. Firstly, a masked autoencoder (MAE) is trained to generate an encoder that functions as a feature map and obfuscates input data, rendering it resistant to human reconstruction and model inversion attacks. Secondly, a downstream classifier is trained at the edge, receiving input from the trained encoder. The downstream network’s weights, a single encoded datapoint, the network’s output and the ground truth are then added to a block for federated aggregation. Lastly, this data facilitates the aggregation of all participating networks, enabling more complex and verifiable aggregation methods than previously possible. This three-stage process results in more robust networks with significantly reduced computational complexity, maintaining high accuracy by training only the downstream classifier at the edge. PoR scales to large IoT networks with low latency and storage growth, and adapts to evolving data, regulations, and network conditions.
[COMMENTS]8 Pages, 5 figues, 9 tables, journal paper
[LINK]http://arxiv.org/abs/2601.07134v1
[DATE]2026-01-12 09:57:17+08:00
[CATEGORIES]cs.LG
Advancing time series completion via RFAMoE and MDFF
[AUTHORS]Ci Zhang, Huayu Li, Changdi Yang, Jiangnan Xia, Yanzhi Wang, Xiaolong Ma, Jin Lu, Ao Li, Geng Yuan
[ABSTRACT]Recent studies show that using diffusion models for time series signal reconstruction holds great promise. However, such approaches remain largely unexplored in the domain of medical time series. The unique characteristics of the physiological time series signals, such as multivariate, high temporal variability, highly noisy, and artifact-prone, make deep learning-based approaches still challenging for tasks such as imputation. Hence, we propose a novel Mixture of Experts (MoE)-based noise estimator within a score-based diffusion framework. Specifically, the Receptive Field Adaptive MoE (RFAMoE) module is designed to enable each channel to adaptively select desired receptive fields throughout the diffusion process. Moreover, recent literature has found that when generating a physiological signal, performing multiple inferences and averaging the reconstructed signals can effectively reduce reconstruction errors, but at the cost of significant computational and latency overhead. We design a Fusion MoE module and innovatively leverage the nature of MoE module to generate K noise signals in parallel, fuse them using a routing mechanism, and complete signal reconstruction in a single inference step. This design not only improves performance over previous methods but also eliminates the substantial computational cost and latency associated with multiple inference processes. Extensive results demonstrate that our proposed framework consistently outperforms diffusion-based SOTA works on different tasks and datasets.
[LINK]http://arxiv.org/abs/2512.07873v4
[DATE]2026-01-12 09:55:03+08:00
[CATEGORIES]cs.LG
Towards Automated Diagnosis of Inherited Arrhythmias: Combined Arrhythmia Classification Using Lead-Aware Spatial Attention Networks
[AUTHORS]Sophie Sigfstead, River Jiang, Brianna Davies, Zachary W. M. Laksman, Julia Cadrin-Tourigny, Rafik Tadros, Habib Khan, Joseph Atallah, Christian Steinberg, Shubhayan Sanatani, Mario Talajic, Rahul Krishnan, Andrew D. Krahn, Christopher C. Cheung
[ABSTRACT]Arrhythmogenic right ventricular cardiomyopathy (ARVC) and long QT syndrome (LQTS) are inherited arrhythmia syndromes associated with sudden cardiac death. Deep learning shows promise for ECG interpretation, but multi-class inherited arrhythmia classification with clinically grounded interpretability remains underdeveloped. Our objective was to develop and validate a lead-aware deep learning framework for multi-class (ARVC vs LQTS vs control) and binary inherited arrhythmia classification, and to determine optimal strategies for integrating ECG foundation models within arrhythmia screening tools. We assembled a 13-center Canadian cohort (645 patients; 1,344 ECGs). We evaluated four ECG foundation models using three transfer learning approaches: linear probing, fine-tuning, and combined strategies. We developed lead-aware spatial attention networks (LASAN) and assessed integration strategies combining LASAN with foundation models. Performance was compared against the established foundation model baselines. Lead-group masking quantified disease-specific lead dependence. Fine-tuning outperformed linear probing and combined strategies across all foundation models (mean macro-AUROC 0.904 vs 0.825). The best lead-aware integrations achieved near-ceiling performance (HuBERT-ECG hybrid: macro-AUROC 0.990; ARVC vs control AUROC 0.999; LQTS vs control AUROC 0.994). Lead masking demonstrated physiologic plausibility: V1-V3 were most critical for ARVC detection (4.54% AUROC reduction), while lateral leads were preferentially important for LQTS (2.60% drop). Lead-aware architectures achieved state-of-the-art performance for inherited arrhythmia classification, outperforming all existing published models on both binary and multi-class tasks while demonstrating clinically aligned lead dependence. These findings support potential utility for automated ECG screening pending validation.
[COMMENTS]34 pages, 2 figures, 4 tables
[LINK]http://arxiv.org/abs/2601.07124v1
[DATE]2026-01-12 09:27:26+08:00
[CATEGORIES]cs.LG
Enhancing Cloud Network Resilience via a Robust LLM-Empowered Multi-Agent Reinforcement Learning Framework
[AUTHORS]Yixiao Peng, Hao Hu, Feiyang Li, Xinye Cao, Yingchang Jiang, Jipeng Tang, Guoshun Nan, Yuling Liu
[ABSTRACT]While virtualization and resource pooling empower cloud networks with structural flexibility and elastic scalability, they inevitably expand the attack surface and challenge cyber resilience. Reinforcement Learning (RL)-based defense strategies have been developed to optimize resource deployment and isolation policies under adversarial conditions, aiming to enhance system resilience by maintaining and restoring network availability. However, existing approaches lack robustness as they require retraining to adapt to dynamic changes in network structure, node scale, attack strategies, and attack intensity. Furthermore, the lack of Human-in-the-Loop (HITL) support limits interpretability and flexibility. To address these limitations, we propose CyberOps-Bots, a hierarchical multi-agent reinforcement learning framework empowered by Large Language Models (LLMs). Inspired by MITRE ATT&CK’s Tactics-Techniques model, CyberOps-Bots features a two-layer architecture: (1) An upper-level LLM agent with four modules–ReAct planning, IPDRR-based perception, long-short term memory, and action/tool integration–performs global awareness, human intent recognition, and tactical planning; (2) Lower-level RL agents, developed via heterogeneous separated pre-training, execute atomic defense actions within localized network regions. This synergy preserves LLM adaptability and interpretability while ensuring reliable RL execution. Experiments on real cloud datasets show that, compared to state-of-the-art algorithms, CyberOps-Bots maintains network availability 68.5% higher and achieves a 34.7% jumpstart performance gain when shifting the scenarios without retraining. To our knowledge, this is the first study to establish a robust LLM-RL framework with HITL support for cloud defense. We will release our framework to the community, facilitating the advancement of robust and autonomous defense in cloud networks.
[LINK]http://arxiv.org/abs/2601.07122v1
[DATE]2026-01-12 09:25:41+08:00
[CATEGORIES]cs.LG
*Data-Driven Knowledge Transfer in Batch $Q^$ Learning**
[AUTHORS]Elynn Chen, Xi Chen, Wenbo Jing
[ABSTRACT]In data-driven decision-making in marketing, healthcare, and education, it is desirable to utilize a large amount of data from existing ventures to navigate high-dimensional feature spaces and address data scarcity in new ventures. We explore knowledge transfer in dynamic decision-making by concentrating on batch stationary environments and formally defining task discrepancies through the lens of Markov decision processes (MDPs). We propose a framework of Transferred Fitted $Q$-Iteration algorithm with general function approximation, enabling the direct estimation of the optimal action-state function $Q^$ using both target and source data. We establish the relationship between statistical performance and MDP task discrepancy under sieve approximation, shedding light on the impact of source and target sample sizes and task discrepancy on the effectiveness of knowledge transfer. We show that the final learning error of the $Q^$ function is significantly improved from the single task rate both theoretically and empirically.
[LINK]http://arxiv.org/abs/2404.15209v3
[DATE]2026-01-12 09:16:22+08:00
[CATEGORIES]cs.LG
Reward-Preserving Attacks For Robust Reinforcement Learning
[AUTHORS]Lucas Schott, Elies Gherbi, Hatem Hajri, Sylvain Lamprier
[ABSTRACT]Adversarial robustness in RL is difficult because perturbations affect entire trajectories: strong attacks can break learning, while weak attacks yield little robustness, and the appropriate strength varies by state. We propose $α$-reward-preserving attacks, which adapt the strength of the adversary so that an $α$ fraction of the nominal-to-worst-case return gap remains achievable at each state. In deep RL, we use a gradient-based attack direction and learn a state-dependent magnitude $η\le η_{\mathcal B}$ selected via a critic $Q^π_α((s,a),η)$ trained off-policy over diverse radii. This adaptive tuning calibrates attack strength and, with intermediate $α$, improves robustness across radii while preserving nominal performance, outperforming fixed- and random-radius baselines.
[COMMENTS]19 pages, 6 figures, 4 algorithms, preprint
[LINK]http://arxiv.org/abs/2601.07118v1
[DATE]2026-01-12 09:14:03+08:00
[CATEGORIES]cs.LG
How to Backdoor the Knowledge Distillation
[AUTHORS]Chen Wu, Qian Ma, Prasenjit Mitra, Sencun Zhu
[ABSTRACT]Knowledge distillation has become a cornerstone in modern machine learning systems, celebrated for its ability to transfer knowledge from a large, complex teacher model to a more efficient student model. Traditionally, this process is regarded as secure, assuming the teacher model is clean. This belief stems from conventional backdoor attacks relying on poisoned training data with backdoor triggers and attacker-chosen labels, which are not involved in the distillation process. Instead, knowledge distillation uses the outputs of a clean teacher model to guide the student model, inherently preventing recognition or response to backdoor triggers as intended by an attacker. In this paper, we challenge this assumption by introducing a novel attack methodology that strategically poisons the distillation dataset with adversarial examples embedded with backdoor triggers. This technique allows for the stealthy compromise of the student model while maintaining the integrity of the teacher model. Our innovative approach represents the first successful exploitation of vulnerabilities within the knowledge distillation process using clean teacher models. Through extensive experiments conducted across various datasets and attack settings, we demonstrate the robustness, stealthiness, and effectiveness of our method. Our findings reveal previously unrecognized vulnerabilities and pave the way for future research aimed at securing knowledge distillation processes against backdoor attacks.
[LINK]http://arxiv.org/abs/2504.21323v2
[DATE]2026-01-12 09:10:07+08:00
[CATEGORIES]cs.LG
Certifying Stability of Reinforcement Learning Policies using Generalized Lyapunov Functions
[AUTHORS]Kehan Long, Jorge Cortés, Nikolay Atanasov
[COMMENTS]NeurIPS 2025
[LINK]http://arxiv.org/abs/2505.10947v4
[DATE]2026-01-12 09:08:26+08:00
[CATEGORIES]cs.LG
Right Now, Wrong Then: Non-Stationary Direct Preference Optimization under Preference Drift
[AUTHORS]Seongho Son, William Bankes, Sayak Ray Chowdhury, Brooks Paige, Ilija Bogunovic
[ABSTRACT]Current Large Language Model (LLM) preference optimization algorithms do not account for temporal preference drift, which can lead to severe misalignment. To address this limitation, we propose Non-Stationary Direct Preference Optimisation (NS-DPO) that models time-dependent reward functions with a Dynamic Bradley-Terry model. NS-DPO proposes a computationally efficient solution by introducing only a single discount parameter in the loss function, which is used for exponential weighting that proportionally focuses learning on more time-relevant datapoints. We theoretically analyze the convergence of NS-DPO in a general setting where the exact nature of the preference drift is not known, providing upper bounds on the estimation error and regret caused by non-stationary preferences. Finally, we demonstrate the effectiveness of NS-DPO for fine-tuning LLMs under drifting preferences. Using scenarios where various levels of preference drift is introduced, with popular LLM reward models and datasets, we show that NS-DPO fine-tuned LLMs remain robust under non-stationarity, significantly outperforming baseline algorithms that ignore temporal preference changes, without sacrificing performance in stationary cases.
[COMMENTS]31 pages, 10 figures. Accepted to ICML 2025
[LINK]http://arxiv.org/abs/2407.18676v3
[DATE]2026-01-12 09:01:31+08:00
[CATEGORIES]cs.LG
Flow Matching and Diffusion Models via PointNet for Generating Fluid Fields on Irregular Geometries
[AUTHORS]Ali Kashefi
[ABSTRACT]We present two novel generative geometric deep learning frameworks, termed Flow Matching PointNet and Diffusion PointNet, for predicting fluid flow variables on irregular geometries by incorporating PointNet into flow matching and diffusion models, respectively. In these frameworks, a reverse generative process reconstructs physical fields from standard Gaussian noise conditioned on unseen geometries. The proposed approaches operate directly on point-cloud representations of computational domains (e.g., grid vertices of finite-volume meshes) and therefore avoid the limitations of pixelation used to project geometries onto uniform lattices, as is common in U-Net-based flow matching and diffusion models. In contrast to graph neural network-based diffusion models, Flow Matching PointNet and Diffusion PointNet do not exhibit high-frequency noise artifacts in the predicted fields. Moreover, unlike such approaches, which require auxiliary intermediate networks to condition geometry, the proposed frameworks rely solely on PointNet, resulting in a simple and unified architecture. The performance of the proposed frameworks is evaluated on steady incompressible flow past a cylinder, using a geometric dataset constructed by varying the cylinder’s cross-sectional shape and orientation across samples. The results demonstrate that Flow Matching PointNet and Diffusion PointNet achieve more accurate predictions of velocity and pressure fields, as well as lift and drag forces, and exhibit greater robustness to incomplete geometries compared to a vanilla PointNet with the same number of trainable parameters.
[LINK]http://arxiv.org/abs/2601.03030v2
[DATE]2026-01-12 07:57:36+08:00
[CATEGORIES]cs.LG
Robust Bayesian Optimization via Tempered Posteriors
[AUTHORS]Jiguang Li, Hengrui Luo
[ABSTRACT]Bayesian optimization (BO) iteratively fits a Gaussian process (GP) surrogate to accumulated evaluations and selects new queries via an acquisition function such as expected improvement (EI). In practice, BO often concentrates evaluations near the current incumbent, causing the surrogate to become overconfident and to understate predictive uncertainty in the region guiding subsequent decisions. We develop a robust GP-based BO via tempered posterior updates, which downweight the likelihood by a power $α\in (0,1]$ to mitigate overconfidence under local misspecification. We establish cumulative regret bounds for tempered BO under a family of generalized improvement rules, including EI, and show that tempering yields strictly sharper worst-case regret guarantees than the standard posterior $(α=1)$, with the most favorable guarantees occurring near the classical EI choice.
Motivated by our theoretic findings, we propose a prequential procedure for selecting $α$ online: it decreases $α$ when realized prediction errors exceed model-implied uncertainty and returns $α$ toward one as calibration improves. Empirical results demonstrate that tempering provides a practical yet theoretically grounded tool for stabilizing BO surrogates under localized sampling.
[LINK]http://arxiv.org/abs/2601.07094v1
[DATE]2026-01-12 07:34:24+08:00
[CATEGORIES]cs.LG
Tackling Heterogeneity in Quantum Federated Learning: An Integrated Sporadic-Personalized Approach
[AUTHORS]Ratun Rahman, Shaba Shaon, Dinh C. Nguyen
[ABSTRACT]Quantum federated learning (QFL) emerges as a powerful technique that combines quantum computing with federated learning to efficiently process complex data across distributed quantum devices while ensuring data privacy in quantum networks. Despite recent research efforts, existing QFL frameworks struggle to achieve optimal model training performance primarily due to inherent heterogeneity in terms of (i) quantum noise where current quantum devices are subject to varying levels of noise due to varying device quality and susceptibility to quantum decoherence, and (ii) heterogeneous data distributions where data across participating quantum devices are naturally non-independent and identically distributed (non-IID). To address these challenges, we propose a novel integrated sporadic-personalized approach called SPQFL that simultaneously handles quantum noise and data heterogeneity in a single QFL framework. It is featured in two key aspects: (i) for quantum noise heterogeneity, we introduce a notion of sporadic learning to tackle quantum noise heterogeneity across quantum devices, and (ii) for quantum data heterogeneity, we implement personalized learning through model regularization to mitigate overfitting during local training on non-IID quantum data distributions, thereby enhancing the convergence of the global model. Moreover, we conduct a rigorous convergence analysis for the proposed SPQFL framework, with both sporadic and personalized learning considerations. Theoretical findings reveal that the upper bound of the SPQFL algorithm is strongly influenced by both the number of quantum devices and the number of quantum noise measurements. Extensive simulation results in real-world datasets also illustrate that the proposed SPQFL approach yields significant improvements in terms of training performance and convergence stability compared to the state-of-the-art methods.
[COMMENTS]Accepted at IEEE Transactions on Computers
[LINK]http://arxiv.org/abs/2601.07882v1
[DATE]2026-01-12 07:29:08+08:00
[CATEGORIES]cs.LG
Wide Neural Networks as a Baseline for the Computational No-Coincidence Conjecture
[AUTHORS]John Dunbar, Scott Aaronson
[ABSTRACT]We establish that randomly initialized neural networks, with large width and a natural choice of hyperparameters, have nearly independent outputs exactly when their activation function is nonlinear with zero mean under the Gaussian measure: $\mathbb{E}_{z \sim \mathcal{N}(0,1)}[σ(z)]=0$. For example, this includes ReLU and GeLU with an additive shift, as well as tanh, but not ReLU or GeLU by themselves. Because of their nearly independent outputs, we propose neural networks with zero-mean activation functions as a promising candidate for the Alignment Research Center’s computational no-coincidence conjecture – a conjecture that aims to measure the limits of AI interpretability.
[LINK]http://arxiv.org/abs/2510.06527v2
[DATE]2026-01-12 07:00:28+08:00
[CATEGORIES]cs.LG
When Should We Introduce Safety Interventions During Pretraining?
[AUTHORS]Dylan Sam, Sachin Goyal, Pratyush Maini, Alexander Robey, J. Zico Kolter
[ABSTRACT]Ensuring the safety of language models in high-stakes settings remains a pressing challenge, as aligned behaviors are often brittle and easily undone by adversarial pressure or downstream finetuning. Prior work has shown that interventions applied during pretraining, such as rephrasing harmful content, can substantially improve the safety of the resulting models. In this paper, we study the fundamental question: “When during pretraining should safety interventions be introduced?” We keep the underlying data fixed and vary only the choice of a safety curriculum: the timing of these interventions, i.e., after 0%, 20%, or 60% of the pretraining token budget. We find that introducing interventions earlier generally yields more robust models with no increase in overrefusal rates, with the clearest benefits appearing after downstream, benign finetuning. We also see clear benefits in the steerability of models towards safer generations. Finally, we observe that earlier interventions reshape internal representations: linear probes more cleanly separate safe vs harmful examples. Overall, these results argue for incorporating safety signals early in pretraining, producing models that are more robust to downstream finetuning and jailbreaking, and more reliable under both standard and safety-aware inference procedures.
[LINK]http://arxiv.org/abs/2601.07087v1
[DATE]2026-01-12 06:38:17+08:00
[CATEGORIES]cs.LG
XBTorch: A Unified Framework for Modeling and Co-Design of Crossbar-Based Deep Learning Accelerators
[AUTHORS]Osama Yousuf, Andreu L. Glasmann, Martin Lueker-Boden, Sina Najmaei, Gina C. Adam
[ABSTRACT]Emerging memory technologies have gained significant attention as a promising pathway to overcome the limitations of conventional computing architectures in deep learning applications. By enabling computation directly within memory, these technologies - built on nanoscale devices with tunable and nonvolatile conductance - offer the potential to drastically reduce energy consumption and latency compared to traditional von Neumann systems. This paper introduces XBTorch (short for CrossBarTorch), a novel simulation framework that integrates seamlessly with PyTorch and provides specialized tools for accurately and efficiently modeling crossbar-based systems based on emerging memory technologies. Through detailed comparisons and case studies involving hardware-aware training and inference, we demonstrate how XBTorch offers a unified interface for key research areas such as device-level modeling, cross-layer co-design, and inference-time fault tolerance. While exemplar studies utilize ferroelectric field-effect transistor (FeFET) models, the framework remains technology-agnostic - supporting other emerging memories such as resistive RAM (ReRAM), as well as enabling user-defined custom device models. The code is publicly available at: https://github.com/ADAM-Lab-GW/xbtorch
[LINK]http://arxiv.org/abs/2601.07086v1
[DATE]2026-01-12 06:35:30+08:00
[CATEGORIES]cs.LG
Robust Mean Estimation under Quantization
[AUTHORS]Pedro Abdalla, Junren Chen
[ABSTRACT]We consider the problem of mean estimation under quantization and adversarial corruption. We construct multivariate robust estimators that are optimal up to logarithmic factors in two different settings. The first is a one-bit setting, where each bit depends only on a single sample, and the second is a partial quantization setting, in which the estimator may use a small fraction of unquantized data.
[LINK]http://arxiv.org/abs/2601.07074v1
[DATE]2026-01-12 05:42:46+08:00
[CATEGORIES]cs.LG
Fail Fast, Win Big: Rethinking the Drafting Strategy in Speculative Decoding via Diffusion LLMs
[AUTHORS]Rui Pan, Zhuofu Chen, Hongyi Liu, Arvind Krishnamurthy, Ravi Netravali
[ABSTRACT]Diffusion Large Language Models (dLLMs) offer fast, parallel token generation, but their standalone use is plagued by an inherent efficiency-quality tradeoff. We show that, if carefully applied, the attributes of dLLMs can actually be a strength for drafters in speculative decoding with autoregressive (AR) verifiers. Our core insight is that dLLM’s speed from parallel decoding drastically lowers the risk of costly rejections, providing a practical mechanism to effectively realize the (elusive) lengthy drafts that lead to large speedups with speculative decoding. We present FailFast, a dLLM-based speculative decoding framework that realizes this approach by dynamically adapting its speculation length. It “fails fast” by spending minimal compute in hard-to-speculate regions to shrink speculation latency and “wins big” by aggressively extending draft lengths in easier regions to reduce verification latency (in many cases, speculating and accepting 70 tokens at a time!). Without any fine-tuning, FailFast delivers lossless acceleration of AR LLMs and achieves up to 4.9$\times$ speedup over vanilla decoding, 1.7$\times$ over the best naive dLLM drafter, and 2.0$\times$ over EAGLE-3 across diverse models and workloads. We open-source FailFast at https://github.com/ruipeterpan/failfast.
[LINK]http://arxiv.org/abs/2512.20573v2
[DATE]2026-01-12 05:06:29+08:00
[CATEGORIES]cs.LG
Multi-Agent Cooperative Learning for Robust Vision-Language Alignment under OOD Concepts
[AUTHORS]Philip Xu, Isabel Wagner, Eerke Boiten
[ABSTRACT]This paper introduces a novel Multi-Agent Cooperative Learning (MACL) framework to address cross-modal alignment collapse in vision-language models when handling out-of-distribution (OOD) concepts. Four core agents, including image, text, name, and coordination agents, collaboratively mitigate modality imbalance through structured message passing. The proposed framework enables multi-agent feature space name learning, incorporates a context exchange enhanced few-shot learning algorithm, and adopts an adaptive dynamic balancing mechanism to regulate inter-agent contributions. Experiments on the VISTA-Beyond dataset demonstrate that MACL significantly improves performance in both few-shot and zero-shot settings, achieving 1-5% precision gains across diverse visual domains.
[LINK]http://arxiv.org/abs/2601.09746v1
[DATE]2026-01-12 04:36:47+08:00
[CATEGORIES]cs.LG
The Blueprints of Intelligence: A Functional-Topological Foundation for Perception and Representation
[AUTHORS]Eduardo Di Santi
[ABSTRACT]Real-world phenomena do not generate arbitrary variability: their signals concentrate on compact, low-variability subsets of functional space, enabling rapid generalization from few examples. A small child can recognize a dog after extremely limited exposure because the perceptual manifold of “dog” is compact, structured, and low-dimensional. We formalize this principle through a deterministic functional-topological framework in which the set of valid realizations produced by a physical process forms a compact subset of a Banach space, endowed with stable invariants, a finite Hausdorff radius, and an induced continuous perceptual functional.
This geometry provides explicit limits on knowledge, conditions for identifiability, and guarantees for generalization from sparse evidence – properties fundamental to both natural and artificial intelligence. Across electromechanical, electrochemical, and physiological domains, we show that real-world processes consistently generate compact perceptual manifolds with the same geometric characteristics. Their boundaries can be discovered in a fully self-supervised manner as the empirical radius saturates with increasing sampling, even when the governing equations are unknown.
These results demonstrate that deterministic functional topology offers a unified mathematical foundation for perception, representation, and world-model construction. It provides a geometric explanation for why biological learners and self-supervised AI systems can generalize from few observations, and establishes compact perceptual manifolds as a fundamental building block for future AI architectures. Finally, this work unifies biological perception and modern self-supervised models under a single geometric principle: both derive their generalization ability from the compactness and invariants of real-world perceptual manifolds.
[COMMENTS]35 pages, 6 figures. This preprint develops a deterministic functional-topological framework showing that physical systems generate compact perceptual manifolds with finite radius. We provide theory, Monte-Carlo estimators, and validation across PM, battery, and ECG domains, unifying biological perception and self-supervised AI
[LINK]http://arxiv.org/abs/2512.05089v3
[DATE]2026-01-12 04:32:14+08:00
[CATEGORIES]cs.LG
Stable Minima of ReLU Neural Networks Suffer from the Curse of Dimensionality: The Neural Shattering Phenomenon
[AUTHORS]Tongtong Liang, Dan Qiao, Yu-Xiang Wang, Rahul Parhi
[ABSTRACT]We study the implicit bias of flatness / low (loss) curvature and its effects on generalization in two-layer overparameterized ReLU networks with multivariate inputs – a problem well motivated by the minima stability and edge-of-stability phenomena in gradient-descent training. Existing work either requires interpolation or focuses only on univariate inputs. This paper presents new and somewhat surprising theoretical results for multivariate inputs. On two natural settings (1) generalization gap for flat solutions, and (2) mean-squared error (MSE) in nonparametric function estimation by stable minima, we prove upper and lower bounds, which establish that while flatness does imply generalization, the resulting rates of convergence necessarily deteriorate exponentially as the input dimension grows. This gives an exponential separation between the flat solutions compared to low-norm solutions (i.e., weight decay), which are known not to suffer from the curse of dimensionality. In particular, our minimax lower bound construction, based on a novel packing argument with boundary-localized ReLU neurons, reveals how flat solutions can exploit a kind of “neural shattering” where neurons rarely activate, but with high weight magnitudes. This leads to poor performance in high dimensions. We corroborate these theoretical findings with extensive numerical simulations. To the best of our knowledge, our analysis provides the first systematic explanation for why flat minima may fail to generalize in high dimensions.
[COMMENTS]Camera Ready Version. Accepted by Neurips 2025 (Spotlight)
[LINK]http://arxiv.org/abs/2506.20779v4
[DATE]2026-01-12 03:55:02+08:00
[CATEGORIES]cs.LG
Making Interpretable Discoveries from Unstructured Data: A High-Dimensional Multiple Hypothesis Testing Approach
[AUTHORS]Jacob Carlson
[ABSTRACT]Social scientists are increasingly turning to unstructured datasets to unlock new empirical insights, e.g., estimating descriptive statistics of or causal effects on quantitative measures derived from text, audio, or video data. In many such settings, unsupervised analysis is of primary interest, in that the researcher does not want to (or cannot) pre-specify all important aspects of the unstructured data to measure; they are interested in “discovery.” This paper proposes a general and flexible framework for pursuing discovery from unstructured data in a statistically principled way. The framework leverages recent methods from the literature on machine learning interpretability to map unstructured data points to high-dimensional, sparse, and interpretable “dictionaries” of concepts; computes statistics of dictionary entries for testing relevant concept-level hypotheses; performs selective inference on these hypotheses using algorithms validated by new results in high-dimensional central limit theory, producing a selected set (“discoveries”); and both generates and evaluates human-interpretable natural language descriptions of these discoveries. The proposed framework has few researcher degrees of freedom, is fully replicable, and is cheap to implement – both in terms of financial cost and researcher time. Applications to recent descriptive and causal analyses of unstructured data in empirical economics are explored. An open source Jupyter notebook is provided for researchers to implement the framework in their own projects.
[LINK]http://arxiv.org/abs/2511.01680v2
[DATE]2026-01-12 03:06:37+08:00
[CATEGORIES]cs.LG
Tight Analysis of Decentralized SGD: A Markov Chain Perspective
[AUTHORS]Lucas Versini, Paul Mangold, Aymeric Dieuleveut
[ABSTRACT]We propose a novel analysis of the Decentralized Stochastic Gradient Descent (DSGD) algorithm with constant step size, interpreting the iterates of the algorithm as a Markov chain. We show that DSGD converges to a stationary distribution, with its bias, to first order, decomposable into two components: one due to decentralization (growing with the graph’s spectral gap and clients’ heterogeneity) and one due to stochasticity. Remarkably, the variance of local parameters is, at the first-order, inversely proportional to the number of clients, regardless of the network topology and even when clients’ iterates are not averaged at the end. As a consequence of our analysis, we obtain non-asymptotic convergence bounds for clients’ local iterates, confirming that DSGD has linear speed-up in the number of clients, and that the network topology only impacts higher-order terms.
[LINK]http://arxiv.org/abs/2601.07021v1
[DATE]2026-01-12 02:28:37+08:00
[CATEGORIES]cs.LG
Expert System for Bitcoin Forecasting: Integrating Global Liquidity via TimeXer Transformers
[AUTHORS]Sravan Karthick T
[ABSTRACT]Bitcoin price forecasting is characterized by extreme volatility and non-stationarity, often defying traditional univariate time-series models over long horizons. This paper addresses a critical gap by integrating Global M2 Liquidity, aggregated from 18 major economies, as a leading exogenous variable with a 12-week lag structure. Using the TimeXer architecture, we compare a liquidity-conditioned forecasting model (TimeXer-Exog) against state-of-the-art benchmarks including LSTM, N-BEATS, PatchTST, and a standard univariate TimeXer. Experiments conducted on daily Bitcoin price data from January 2020 to August 2025 demonstrate that explicit macroeconomic conditioning significantly stabilizes long-horizon forecasts. At a 70-day forecast horizon, the proposed TimeXer-Exog model achieves a mean squared error (MSE) 1.08e8, outperforming the univariate TimeXer baseline by over 89 percent. These results highlight that conditioning deep learning models on global liquidity provides substantial improvements in long-horizon Bitcoin price forecasting.
[LINK]http://arxiv.org/abs/2512.22326v2
[DATE]2026-01-12 02:28:01+08:00
[CATEGORIES]cs.LG
Reqo: A Comprehensive Learning-Based Cost Model for Robust and Explainable Query Optimization
[AUTHORS]Baoming Chang, Amin Kamali, Verena Kantere
[ABSTRACT]Although machine learning (ML) shows potential in improving query optimization by generating and selecting more efficient plans, ensuring the robustness of learning-based cost models (LCMs) remains challenging. These LCMs currently lack explainability, which undermines user trust and limits the ability to derive insights from their cost predictions to improve plan quality. Accurately converting tree-structured query plans into representations via tree models is also essential, as omitting any details may negatively impact subsequent cost model performance. Additionally, inherent uncertainty in cost estimation leads to inaccurate predictions, resulting in suboptimal plan selection. To address these challenges, we introduce Reqo, a Robust and Explainable Query Optimization cost model that comprehensively enhances three main stages in query optimization: plan generation, plan representation, and plan selection. Reqo integrates three innovations: the first explainability technique for LCMs that quantifies subgraph contributions and produces plan generation hints to enhance candidate plan quality; a novel tree model based on Bidirectional Graph Neural Networks (Bi-GNNs) with a Gated Recurrent Unit (GRU) aggregator to further capture both node-level and structural information and effectively strengthen plan representation; and an uncertainty-aware learning-to-rank cost estimator that adaptively integrates cost estimates with uncertainties to enhance plan selection robustness. Extensive experiments demonstrate that Reqo outperforms state-of-the-art approaches across all three stages.
[LINK]http://arxiv.org/abs/2501.17414v2
[DATE]2026-01-12 02:16:12+08:00
[CATEGORIES]cs.LG
Conditional Normalizing Flows for Forward and Backward Joint State and Parameter Estimation
[AUTHORS]Luke S. Lagunowich, Guoxiang Grayson Tong, Daniele E. Schiavazzi
[ABSTRACT]Traditional filtering algorithms for state estimation – such as classical Kalman filtering, unscented Kalman filtering, and particle filters - show performance degradation when applied to nonlinear systems whose uncertainty follows arbitrary non-Gaussian, and potentially multi-modal distributions. This study reviews recent approaches to state estimation via nonlinear filtering based on conditional normalizing flows, where the conditional embedding is generated by standard MLP architectures, transformers or selective state-space models (like Mamba-SSM). In addition, we test the effectiveness of an optimal-transport-inspired kinetic loss term in mitigating overparameterization in flows consisting of a large collection of transformations. We investigate the performance of these approaches on applications relevant to autonomous driving and patient population dynamics, paying special attention to how they handle time inversion and chained predictions. Finally, we assess the performance of various conditioning strategies for an application to real-world COVID-19 joint SIR system forecasting and parameter estimation.
[LINK]http://arxiv.org/abs/2601.07013v1
[DATE]2026-01-12 02:01:42+08:00
[CATEGORIES]cs.LG
List-Level Distribution Coupling with Applications to Speculative Decoding and Lossy Compression
[AUTHORS]Joseph Rowan, Buu Phan, Ashish Khisti
[ABSTRACT]We study a relaxation of the problem of coupling probability distributions – a list of samples is generated from one distribution and an accept is declared if any one of these samples is identical to the sample generated from the other distribution. We propose a novel method for generating samples, which extends the Gumbel-max sampling suggested in Daliri et al. (arXiv:2408.07978) for coupling probability distributions. We also establish a corresponding lower bound on the acceptance probability, which we call the list matching lemma. We next discuss two applications of our setup. First, we develop a new mechanism for multi-draft speculative sampling that is simple to implement and achieves performance competitive with baselines such as SpecTr and SpecInfer across a range of language tasks. Our method also guarantees a certain degree of drafter invariance with respect to the output tokens which is not supported by existing schemes. We also provide a theoretical lower bound on the token level acceptance probability. As our second application, we consider distributed lossy compression with side information in a setting where a source sample is compressed and available to multiple decoders, each with independent side information. We propose a compression technique that is based on our generalization of Gumbel-max sampling and show that it provides significant gains in experiments involving synthetic Gaussian sources and the MNIST image dataset.
[LINK]http://arxiv.org/abs/2506.05632v3
[DATE]2026-01-12 01:43:19+08:00
[CATEGORIES]cs.LG
Unity Forests: Improving Interaction Modelling and Interpretability in Random Forests
[AUTHORS]Roman Hornung, Alexander Hapfelmeier
[ABSTRACT]Random forests (RFs) are widely used for prediction and variable importance analysis and are often believed to capture any types of interactions via recursive splitting. However, since the splits are chosen locally, interactions are only reliably captured when at least one involved covariate has a marginal effect. We introduce unity forests (UFOs), an RF variant designed to better exploit interactions involving covariates without marginal effects. In UFOs, the first few splits of each tree are optimized jointly across a random covariate subset to form a “tree root” capturing such interactions; the remainder is grown conventionally. We further propose the unity variable importance measure (VIM), which is based on out-of-bag split criterion values from the tree roots. Here, only a small fraction of tree root splits with the highest in-bag criterion values are considered per covariate, reflecting that covariates with purely interaction-based effects are discriminative only if a split in an interacting covariate occurred earlier in the tree. Finally, we introduce covariate-representative tree roots (CRTRs), which select representative tree roots per covariate and provide interpretable insight into the conditions - marginal or interactive - under which each covariate has its strongest effects. In a simulation study, the unity VIM reliably identified interacting covariates without marginal effects, unlike conventional RF-based VIMs. In a large-scale real-data comparison, UFOs achieved higher discrimination and predictive accuracy than standard RFs, with comparable calibration. The CRTRs reproduced the covariates’ true effect types reliably in simulated data and provided interesting insights in a real data analysis.
[COMMENTS]33 pages, 12 figures
[LINK]http://arxiv.org/abs/2601.07003v1
[DATE]2026-01-12 01:36:58+08:00
[CATEGORIES]cs.LG
Model Privacy: A Unified Framework to Understand Model Stealing Attacks and Defenses
[AUTHORS]Ganghua Wang, Yuhong Yang, Jie Ding
[ABSTRACT]The use of machine learning (ML) has become increasingly prevalent in various domains, highlighting the importance of understanding and ensuring its safety. One pressing concern is the vulnerability of ML applications to model stealing attacks. These attacks involve adversaries attempting to recover a learned model through limited query-response interactions, such as those found in cloud-based services or on-chip artificial intelligence interfaces. While existing literature proposes various attack and defense strategies, these often lack a theoretical foundation and standardized evaluation criteria. In response, this work presents a framework called “Model Privacy”, providing a foundation for comprehensively analyzing model stealing attacks and defenses. We establish a rigorous formulation for the threat model and objectives, propose methods to quantify the goodness of attack and defense strategies, and analyze the fundamental tradeoffs between utility and privacy in ML models. Our developed theory offers valuable insights into enhancing the security of ML models, especially highlighting the importance of the attack-specific structure of perturbations for effective defenses. We demonstrate the application of model privacy from the defender’s perspective through various learning scenarios. Extensive experiments corroborate the insights and the effectiveness of defense mechanisms developed under the proposed framework.
[LINK]http://arxiv.org/abs/2502.15567v2
[DATE]2026-01-12 01:29:19+08:00
[CATEGORIES]cs.LG
Graph-KV: Breaking Sequence via Injecting Structural Biases into Large Language Models
[AUTHORS]Haoyu Wang, Peihao Wang, Mufei Li, Shikun Liu, Siqi Miao, Zhangyang Wang, Pan Li
[ABSTRACT]Modern large language models (LLMs) are inherently auto-regressive, requiring input to be serialized into flat sequences regardless of their structural dependencies. This serialization hinders the model’s ability to leverage structural inductive biases, especially in tasks such as retrieval-augmented generation (RAG) and reasoning on data with native graph structures, where inter-segment dependencies are crucial. We introduce Graph-KV with the potential to overcome this limitation. Graph-KV leverages the KV-cache of text segments as condensed representations and governs their interaction through structural inductive biases. In this framework, ‘target’ segments selectively attend only to the KV-caches of their designated ‘source’ segments, rather than all preceding segments in a serialized sequence. This approach induces a graph-structured block mask, sparsifying attention and enabling a message-passing-like step within the LLM. Furthermore, strategically allocated positional encodings for source and target segments reduce positional bias and context window consumption. We evaluate Graph-KV across three scenarios: (1) seven RAG benchmarks spanning direct inference, multi-hop reasoning, and long-document understanding; (2) Arxiv-QA, a novel academic paper QA task with full-text scientific papers structured as citation ego-graphs; and (3) paper topic classification within a citation network. By effectively reducing positional bias and harnessing structural inductive biases, Graph-KV substantially outperforms baselines, including standard costly sequential encoding, across various settings. Code and the Graph-KV data are publicly available.
[LINK]http://arxiv.org/abs/2506.07334v2
[DATE]2026-01-12 00:36:02+08:00
[CATEGORIES]cs.LG
Match Made with Matrix Completion: Efficient Learning under Matching Interference
[AUTHORS]Zhiyuan Tang, Wanning Chen, Kan Xu
[ABSTRACT]Matching markets face increasing needs to learn the matching qualities between demand and supply for effective design of matching policies. In practice, the matching rewards are high-dimensional due to the growing diversity of participants. We leverage a natural low-rank matrix structure of the matching rewards in these two-sided markets, and propose to utilize matrix completion to accelerate reward learning with limited offline data. A unique property for matrix completion in this setting is that the entries of the reward matrix are observed with matching interference – i.e., the entries are not observed independently but dependently due to matching or budget constraints. Such matching dependence renders unique technical challenges, such as sub-optimality or inapplicability of the existing analytical tools in the matrix completion literature, since they typically rely on sample independence. In this paper, we first show that standard nuclear norm regularization remains theoretically effective under matching interference. We provide a near-optimal Frobenius norm guarantee in this setting, coupled with a new analytical technique. Next, to guide certain matching decisions, we develop a novel “double-enhanced” estimator, based off the nuclear norm estimator, with a near-optimal entry-wise guarantee. Our double-enhancement procedure can apply to broader sampling schemes even with dependence, which may be of independent interest. Additionally, we extend our approach to online learning settings with matching constraints such as optimal matching and stable matching, and present improved regret bounds in matrix dimensions. Finally, we demonstrate the practical value of our methods using both synthetic data and real data of labor markets.
[LINK]http://arxiv.org/abs/2601.06982v1
[DATE]2026-01-12 00:33:01+08:00
[CATEGORIES]cs.LG
Diffusion Factor Models: Generating High-Dimensional Returns with Factor Structure
[AUTHORS]Minshuo Chen, Renyuan Xu, Yumin Xu, Ruixun Zhang
[ABSTRACT]Financial scenario simulation is essential for risk management and portfolio optimization, yet it remains challenging especially in high-dimensional and small data settings common in finance. We propose a diffusion factor model that integrates latent factor structure into generative diffusion processes, bridging econometrics with modern generative AI to address the challenges of the curse of dimensionality and data scarcity in financial simulation. By exploiting the low-dimensional factor structure inherent in asset returns, we decompose the score function–a key component in diffusion models–using time-varying orthogonal projections, and this decomposition is incorporated into the design of neural network architectures. We derive rigorous statistical guarantees, establishing nonasymptotic error bounds for both score estimation at O(d^{5/2} n^{-2/(k+5)}) and generated distribution at O(d^{5/4} n^{-1/2(k+5)}), primarily driven by the intrinsic factor dimension k rather than the number of assets d, surpassing the dimension-dependent limits in the classical nonparametric statistics literature and making the framework viable for markets with thousands of assets. Numerical studies confirm superior performance in latent subspace recovery under small data regimes. Empirical analysis demonstrates the economic significance of our framework in constructing mean-variance optimal portfolios and factor portfolios. This work presents the first theoretical integration of factor structure with diffusion models, offering a principled approach for high-dimensional financial simulation with limited data. Our code is available at https://github.com/xymmmm00/diffusion_factor_model.
[LINK]http://arxiv.org/abs/2504.06566v5
[DATE]2026-01-12 00:28:42+08:00
[CATEGORIES]cs.LG
Bridging Symbolic Control and Neural Reasoning in LLM Agents: The Structured Cognitive Loop
[AUTHORS]Myung Ho Kim
[ABSTRACT]Large language model agents suffer from fundamental architectural problems: entangled reasoning and execution, memory volatility, and uncontrolled action sequences. We introduce Structured Cognitive Loop (SCL), a modular architecture that explicitly separates agent cognition into five phases: Retrieval, Cognition, Control, Action, and Memory (R-CCAM). At the core of SCL is Soft Symbolic Control, an adaptive governance mechanism that applies symbolic constraints to probabilistic inference, preserving neural flexibility while restoring the explainability and controllability of classical symbolic systems. Through empirical validation on multi-step conditional reasoning tasks, we demonstrate that SCL achieves zero policy violations, eliminates redundant tool calls, and maintains complete decision traceability. These results address critical gaps in existing frameworks such as ReAct, AutoGPT, and memory-augmented approaches. Our contributions are threefold: (1) we situate SCL within the taxonomy of hybrid intelligence, differentiating it from prompt-centric and memory-only approaches; (2) we formally define Soft Symbolic Control and contrast it with neuro-symbolic AI; and (3) we derive three design principles for trustworthy agents: modular decomposition, adaptive symbolic governance, and transparent state management. We provide a complete open-source implementation demonstrating the R-CCAM loop architecture, alongside a live GPT-4o-powered travel planning agent. By connecting expert system principles with modern LLM capabilities, this work offers a practical and theoretically grounded path toward reliable, explainable, and governable AI agents.
[COMMENTS]The reference list has been updated to reflect recent work
[LINK]http://arxiv.org/abs/2511.17673v3
[DATE]2026-01-11 23:54:09+08:00
[CATEGORIES]cs.CL
RealMem: Benchmarking LLMs in Real-World Memory-Driven Interaction
[AUTHORS]Haonan Bian, Zhiyuan Yao, Sen Hu, Zishan Xu, Shaolei Zhang, Yifu Guo, Ziliang Yang, Xueran Han, Huacan Wang, Ronghao Chen
[ABSTRACT]As Large Language Models (LLMs) evolve from static dialogue interfaces to autonomous general agents, effective memory is paramount to ensuring long-term consistency. However, existing benchmarks primarily focus on casual conversation or task-oriented dialogue, failing to capture “long-term project-oriented” interactions where agents must track evolving goals.
To bridge this gap, we introduce RealMem, the first benchmark grounded in realistic project scenarios. RealMem comprises over 2,000 cross-session dialogues across eleven scenarios, utilizing natural user queries for evaluation.
We propose a synthesis pipeline that integrates Project Foundation Construction, Multi-Agent Dialogue Generation, and Memory and Schedule Management to simulate the dynamic evolution of memory. Experiments reveal that current memory systems face significant challenges in managing the long-term project states and dynamic context dependencies inherent in real-world projects.
Our code and datasets are available at https://github.com/AvatarMemory/RealMemBench.
[LINK]http://arxiv.org/abs/2601.06966v1
[DATE]2026-01-11 23:49:36+08:00
[CATEGORIES]cs.CL
Restoring Rhythm: Punctuation Restoration Using Transformer Models for Bangla, A Low-Resource Language
[AUTHORS]Md Obyedullahil Mamun, Md Adyelullahil Mamun, Arif Ahmad, Md. Imran Hossain Emu
[ABSTRACT]Punctuation restoration enhances the readability of text and is critical for post-processing tasks in Automatic Speech Recognition (ASR), especially for low-resource languages like Bangla. In this study, we explore the application of transformer-based models, specifically XLM-RoBERTa-large, to automatically restore punctuation in unpunctuated Bangla text. We focus on predicting four punctuation marks: period, comma, question mark, and exclamation mark across diverse text domains. To address the scarcity of annotated resources, we constructed a large, varied training corpus and applied data augmentation techniques. Our best-performing model, trained with an augmentation factor of alpha = 0.20%, achieves an accuracy of 97.1% on the News test set, 91.2% on the Reference set, and 90.2% on the ASR set.
Results show strong generalization to reference and ASR transcripts, demonstrating the model’s effectiveness in real-world, noisy scenarios. This work establishes a strong baseline for Bangla punctuation restoration and contributes publicly available datasets and code to support future research in low-resource NLP.
[LINK]http://arxiv.org/abs/2507.18448v2
[DATE]2026-01-11 23:27:33+08:00
[CATEGORIES]cs.CL cs.LG
X-Coder: Advancing Competitive Programming with Fully Synthetic Tasks, Solutions, and Tests
[AUTHORS]Jie Wu, Haoling Li, Xin Zhang, Jiani Guo, Jane Luo, Steven Liu, Yangyu Huang, Ruihang Chu, Scarlett Li, Yujiu Yang
[ABSTRACT]Competitive programming presents great challenges for Code LLMs due to its intensive reasoning demands and high logical complexity. However, current Code LLMs still rely heavily on real-world data, which limits their scalability. In this paper, we explore a fully synthetic approach: training Code LLMs with entirely generated tasks, solutions, and test cases, to empower code reasoning models without relying on real-world data. To support this, we leverage feature-based synthesis to propose a novel data synthesis pipeline called SynthSmith. SynthSmith shows strong potential in producing diverse and challenging tasks, along with verified solutions and tests, supporting both supervised fine-tuning and reinforcement learning. Based on the proposed synthetic SFT and RL datasets, we introduce the X-Coder model series, which achieves a notable pass rate of 62.9 avg@8 on LiveCodeBench v5 and 55.8 on v6, outperforming DeepCoder-14B-Preview and AReal-boba2-14B despite having only 7B parameters. In-depth analysis reveals that scaling laws hold on our synthetic dataset, and we explore which dimensions are more effective to scale. We further provide insights into code-centric reinforcement learning and highlight the key factors that shape performance through detailed ablations and analysis. Our findings demonstrate that scaling high-quality synthetic data and adopting staged training can greatly advance code reasoning, while mitigating reliance on real-world coding data.
[COMMENTS]Project: https://github.com/JieWu02/X-Coder
[LINK]http://arxiv.org/abs/2601.06953v1
[DATE]2026-01-11 23:22:33+08:00
[CATEGORIES]cs.CL cs.LG
Sliced-Wasserstein Distribution Alignment Loss Improves the Ultra-Low-Bit Quantization of Large Language Models
[AUTHORS]Deyu Cao, Yixin Yin, Samin Aref
[ABSTRACT]The benefits of most large language models come with steep and often hidden economic and environmental costs due to their resource usage inefficiency during deployment. Model quantization improves energy and memory efficiency through representing model parameters by lower-precision values. However, compression below 4-bits often distorts activation distributions and degrades performance. We address this challenge by introducing a sliced Wasserstein loss function for distribution-aware calibration in ultra-low-bit post-training quantization. The proposed loss aligns the output distributions of full-precision and quantized models under random linear projections, complementing standard mean-squared error loss without adding any computational overhead during inference. Our proposed loss function can be incorporated with any post-training quantization framework that has a retraining component. We demonstrate the performance gains of our proposed model by incorporating it with two frontier methods known as OmniQuant and TesseraQ. Compared to these two baselines, the proposed loss consistently improves both perplexity and downstream task accuracy across multiple ultra-low-bit settings. Our proposed loss function recovers 4.12-20.37% of the OmniQuant’s lost accuracy on the language model LLaMA-2-7B, 0.93-7.65% on OPT-6.7B, and 2.26-6.20% on LLaMA-2-13B. TesseraQ’s accuracy degradation is recovered by 3.63-7.63% in relative terms when augmented by our proposed loss function. Taken together, these results demonstrate that distributional alignment provides a simple yet effective performance boost that can push the limits of frontier quantization methods. Our method is available on GitHub to facilitate future progress in ultra-low-bit quantization.
[COMMENTS]Post-peer-review accepted manuscript, 17 pages including the supplementary information
[LINK]http://arxiv.org/abs/2601.07878v1
[DATE]2026-01-11 23:14:05+08:00
[CATEGORIES]cs.LG cs.CL
Deep Value Benchmark: Measuring Whether Models Generalize Deep Values or Shallow Preferences
[AUTHORS]Joshua Ashkinaze, Hua Shen, Saipranav Avula, Eric Gilbert, Ceren Budak
[COMMENTS]NeurIPS 2025 (Spotlight)
[LINK]http://arxiv.org/abs/2511.02109v3
[DATE]2026-01-11 22:57:32+08:00
[CATEGORIES]cs.CL
Symphonym: Universal Phonetic Embeddings for Cross-Script Toponym Matching via Teacher-Student Distillation
[AUTHORS]Stephen Gadd
[ABSTRACT]Linking place names across languages and writing systems is a fundamental challenge in digital humanities and geographic information retrieval. Existing approaches rely on language-specific phonetic algorithms or transliteration rules that fail when names cross script boundaries – no string metric can determine that “Moscow” when rendered in Cyrillic or Arabic refer to the same city.
I present Symphonym, a neural embedding system that maps toponyms from 20 writing systems into a unified 128-dimensional phonetic space. A Teacher network trained on articulatory phonetic features (via Epitran and PanPhon) produces target embeddings, while a Student network learns to approximate these from raw characters. At inference, only the lightweight Student (1.7M parameters) is required, enabling deployment without runtime phonetic conversion.
Training uses a three-phase curriculum on 57 million toponyms from GeoNames, Wikidata, and the Getty Thesaurus of Geographic Names. Phase 1 trains the Teacher on 467K phonetically-grounded triplets. Phase 2 aligns the Student to Teacher outputs across 23M samples, achieving 96.6% cosine similarity. Phase 3 fine-tunes on 3.3M hard negative triplets – negatives sharing prefix and script with the anchor but referring to different places – to sharpen discrimination.
Evaluation on the MEHDIE Hebrew-Arabic benchmark achieves 89.2% Recall@1, outperforming Levenshtein (81.5%) and Jaro-Winkler (78.5%). The system is optimised for cross-script matching; same-script variants can be handled by complementary string methods. Symphonym will enable fuzzy phonetic reconciliation and search across the World Historical Gazetteer’s 67 million toponyms. Code and models are publicly available.
[COMMENTS]30 pages, 5 tables, 2 figures
[LINK]http://arxiv.org/abs/2601.06932v1
[DATE]2026-01-11 22:36:36+08:00
[CATEGORIES]cs.CL
TreePS-RAG: Tree-based Process Supervision for Reinforcement Learning in Agentic RAG
[AUTHORS]Tianhua Zhang, Kun Li, Junan Li, Yunxiang Li, Hongyin Luo, Xixin Wu, James Glass, Helen Meng
[ABSTRACT]Agentic retrieval-augmented generation (RAG) formulates question answering as a multi-step interaction between reasoning and information retrieval, and has recently been advanced by reinforcement learning (RL) with outcome-based supervision. While effective, relying solely on sparse final rewards limits step-wise credit assignment and provides weak guidance for intermediate reasoning and actions. Recent efforts explore process-level supervision, but typically depend on offline constructed training data, which risks distribution shift, or require costly intermediate annotations. We present TreePS-RAG, an online, tree-based RL framework for agentic RAG that enables step-wise credit assignment while retaining standard outcome-only rewards. Our key insight is to model agentic RAG reasoning as a rollout tree, where each reasoning step naturally maps to a node. This tree structure allows step utility to be estimated via Monte Carlo estimation over its descendant outcomes, yielding fine-grained process advantages without requiring intermediate labels. To make this paradigm practical, we introduce an efficient online tree construction strategy that preserves exploration diversity under a constrained computational budget. With a rollout cost comparable to strong baselines like Search-R1, experiments on seven multi-hop and general QA benchmarks across multiple model scales show that TreePS-RAG consistently and significantly outperforms both outcome-supervised and leading process-supervised RL methods.
[LINK]http://arxiv.org/abs/2601.06922v1
[DATE]2026-01-11 22:07:30+08:00
[CATEGORIES]cs.CL
Distributional Clarity: The Hidden Driver of RL-Friendliness in Large Language Models
[AUTHORS]Shaoning Sun, Mingzhu Cai, Huang He, Bingjin Chen, Siqi Bao, Yujiu Yang, Hua Wu, Haifeng Wang
[ABSTRACT]Language model families exhibit striking disparity in their capacity to benefit from reinforcement learning: under identical training, models like Qwen achieve substantial gains, while others like Llama yield limited improvements. Complementing data-centric approaches, we reveal that this disparity reflects a hidden structural property: \textbf{distributional clarity} in probability space. Through a three-stage analysis-from phenomenon to mechanism to interpretation-we uncover that RL-friendly models exhibit intra-class compactness and inter-class separation in their probability assignments to correct vs. incorrect responses. We quantify this clarity using the \textbf{Silhouette Coefficient} ($S$) and demonstrate that (1) high $S$ correlates strongly with RL performance; (2) low $S$ is associated with severe logic errors and reasoning instability. To confirm this property, we introduce a Silhouette-Aware Reweighting strategy that prioritizes low-$S$ samples during training. Experiments across six mathematical benchmarks show consistent improvements across all model families, with gains up to 5.9 points on AIME24. Our work establishes distributional clarity as a fundamental, trainable property underlying RL-Friendliness.
[LINK]http://arxiv.org/abs/2601.06911v1
[DATE]2026-01-11 21:34:44+08:00
[CATEGORIES]cs.CL cs.LG
OM4OV: Leveraging Ontology Matching for Ontology Versioning
[AUTHORS]Zhangcheng Qiang, Kerry Taylor, Weiqing Wang
[ABSTRACT]Due to the dynamic nature of the Semantic Web, version control is necessary to manage changes in widely used ontologies. Despite the long-standing recognition of ontology versioning (OV) as a crucial component of efficient ontology management, many approaches treat OV as similar to ontology matching (OM) and directly reuse OM systems for OV tasks. In this study, we systematically analyse similarities and differences between OM and OV and formalise the OM4OV pipeline to offer more advanced OV support. The pipeline is implemented and evaluated in the state-of-the-art OM system Agent-OM. The experimental results indicate that OM systems can be reused for OV tasks, but without necessary extensions, the current OM4OV pipeline can produce skewed measurements, poor performance in detecting update entities, and limited explainability of false mappings. To tackle these issues, we propose an optimisation method called the cross-reference (CR) mechanism, which builds on existing OM alignments to reduce the number of matching candidates and to improve overall OV performance.
[COMMENTS]17 pages, 8 figures, 2 tables
[LINK]http://arxiv.org/abs/2409.20302v8
[DATE]2026-01-11 21:31:36+08:00
[CATEGORIES]cs.CL
Fine-grained Verbal Attack Detection via a Hierarchical Divide-and-Conquer Framework
[AUTHORS]Quan Zheng, Yuanhe Tian, Ming Wang, Yan Song
[ABSTRACT]In the digital era, effective identification and analysis of verbal attacks are essential for maintaining online civility and ensuring social security. However, existing research is limited by insufficient modeling of conversational structure and contextual dependency, particularly in Chinese social media where implicit attacks are prevalent. Current attack detection studies often emphasize general semantic understanding while overlooking user response relationships, hindering the identification of implicit and context-dependent attacks. To address these challenges, we present the novel “Hierarchical Attack Comment Detection” dataset and propose a divide-and-conquer, fine-grained framework for verbal attack recognition based on spatiotemporal information. The proposed dataset explicitly encodes hierarchical reply structures and chronological order, capturing complex interaction patterns in multi-turn discussions. Building on this dataset, the framework decomposes attack detection into hierarchical subtasks, where specialized lightweight models handle explicit detection, implicit intent inference, and target identification under constrained context. Extensive experiments on the proposed dataset and benchmark intention detection datasets show that smaller models using our framework significantly outperform larger monolithic models relying on parameter scaling, demonstrating the effectiveness of structured task decomposition.
[COMMENTS]13pages, 5figures
[LINK]http://arxiv.org/abs/2601.06907v1
[DATE]2026-01-11 21:17:59+08:00
[CATEGORIES]cs.CL
LongEmotion: Measuring Emotional Intelligence of Large Language Models in Long-Context Interaction
[AUTHORS]Weichu Liu, Jing Xiong, Yuxuan Hu, Zixuan Li, Minghuan Tan, Ningning Mao, Hui Shen, Wendong Xu, Chaofan Tao, Min Yang, Chengming Li, Lingpeng Kong, Ngai Wong
[ABSTRACT]Large language models (LLMs) have made significant progress in Emotional Intelligence (EI) and long-context modeling. However, existing benchmarks often overlook the fact that emotional information processing unfolds as a continuous long-context process. To address the absence of multidimensional EI evaluation in long-context inference and explore model performance under more challenging conditions, we present LongEmotion, a benchmark that encompasses a diverse suite of tasks targeting the assessment of models’ capabilities in Emotion Recognition, Knowledge Application, and Empathetic Generation, with an average context length of 15,341 tokens. To enhance performance under realistic constraints, we introduce the Collaborative Emotional Modeling (CoEM) framework, which integrates Retrieval-Augmented Generation (RAG) and multi-agent collaboration to improve models’ EI in long-context scenarios. We conduct a detailed analysis of various models in long-context settings, investigating how reasoning mode activation, RAG-based retrieval strategies, and context-length adaptability influence their EI performance. Our project page is: https://longemotion.github.io/
[COMMENTS]Technical Report
[LINK]http://arxiv.org/abs/2509.07403v2
[DATE]2026-01-11 20:20:05+08:00
[CATEGORIES]cs.CL
Paraphrasing Adversarial Attack on LLM-as-a-Reviewer
[AUTHORS]Masahiro Kaneko
[ABSTRACT]The use of large language models (LLMs) in peer review systems has attracted growing attention, making it essential to examine their potential vulnerabilities. Prior attacks rely on prompt injection, which alters manuscript content and conflates injection susceptibility with evaluation robustness. We propose the Paraphrasing Adversarial Attack (PAA), a black-box optimization method that searches for paraphrased sequences yielding higher review scores while preserving semantic equivalence and linguistic naturalness. PAA leverages in-context learning, using previous paraphrases and their scores to guide candidate generation. Experiments across five ML and NLP conferences with three LLM reviewers and five attacking models show that PAA consistently increases review scores without changing the paper’s claims. Human evaluation confirms that generated paraphrases maintain meaning and naturalness. We also find that attacked papers exhibit increased perplexity in reviews, offering a potential detection signal, and that paraphrasing submissions can partially mitigate attacks.
[LINK]http://arxiv.org/abs/2601.06884v1
[DATE]2026-01-11 20:14:10+08:00
[CATEGORIES]cs.CL cs.LG
BiasLab: A Multilingual, Dual-Framing Framework for Robust Measurement of Output-Level Bias in Large Language Models
[AUTHORS]William Guey, Wei Zhang, Pei-Luen Patrick Rau, Pierrick Bougault, Vitor D. de Moura, Bertan Ucar, Jose O. Gomes
[ABSTRACT]Large Language Models (LLMs) are increasingly deployed in high-stakes contexts where their outputs influence real-world decisions. However, evaluating bias in LLM outputs remains methodologically challenging due to sensitivity to prompt wording, limited multilingual coverage, and the lack of standardized metrics that enable reliable comparison across models. This paper introduces BiasLab, an open-source, model-agnostic evaluation framework for quantifying output-level (extrinsic) bias through a multilingual, robustness-oriented experimental design. BiasLab constructs mirrored probe pairs under a strict dual-framing scheme: an affirmative assertion favoring Target A and a reverse assertion obtained by deterministic target substitution favoring Target B, while preserving identical linguistic structure. To reduce dependence on prompt templates, BiasLab performs repeated evaluation under randomized instructional wrappers and enforces a fixed-choice Likert response format to maximize comparability across models and languages. Responses are normalized into agreement labels using an LLM-based judge, aligned for polarity consistency across framings, and aggregated into quantitative bias indicators with descriptive statistics including effect sizes and neutrality rates. The framework supports evaluation across diverse bias axes, including demographic, cultural, political, and geopolitical topics, and produces reproducible artifacts such as structured reports and comparative visualizations. BiasLab contributes a standardized methodology for cross-lingual and framing-sensitive bias measurement that complements intrinsic and dataset-based audits, enabling researchers and institutions to benchmark robustness and make better-informed deployment decisions.
[COMMENTS]source code and reproducibility scripts available on GitHub
[LINK]http://arxiv.org/abs/2601.06861v1
[DATE]2026-01-11 19:07:46+08:00
[CATEGORIES]cs.CL
†DAGGER: Distractor-Aware Graph Generation for Executable Reasoning in Math Problems
[AUTHORS]Zabir Al Nazi, Shubhashis Roy Dipta, Sudipta Kar
[ABSTRACT]Chain-of-Thought (CoT) prompting is widely adopted for mathematical problem solving, including in low-resource languages, yet its behavior under irrelevant context remains underexplored. To systematically study this challenge, we introduce DISTRACTMATH-BN, a Bangla benchmark that augments MGSM and MSVAMP with semantically coherent but computationally irrelevant information. Evaluating seven models ranging from 3B to 12B parameters, we observe substantial performance degradation under distractors: standard models drop by up to 41 points, while reasoning-specialized models decline by 14 to 20 points despite consuming five times more tokens. We propose †DAGGER, which reformulates mathematical problem solving as executable computational graph generation with explicit modeling of distractor nodes. Fine-tuning Gemma-3 models using supervised fine-tuning followed by Group Relative Policy Optimization achieves comparable weighted accuracy on augmented benchmarks while using 89 percent fewer tokens than reasoning models. Importantly, this robustness emerges without explicit training on distractor-augmented examples. Our results suggest that enforcing structured intermediate representations improves robustness and inference efficiency in mathematical reasoning compared to free-form approaches, particularly in noisy, low-resource settings.
[LINK]http://arxiv.org/abs/2601.06853v1
[DATE]2026-01-11 18:51:03+08:00
[CATEGORIES]cs.CL cs.LG
Explainable Multimodal Aspect-Based Sentiment Analysis with Dependency-guided Large Language Model
[AUTHORS]Zhongzheng Wang, Yuanhe Tian, Hongzhi Wang, Yan Song
[ABSTRACT]Multimodal aspect-based sentiment analysis (MABSA) aims to identify aspect-level sentiments by jointly modeling textual and visual information, which is essential for fine-grained opinion understanding in social media. Existing approaches mainly rely on discriminative classification with complex multimodal fusion, yet lacking explicit sentiment explainability. In this paper, we reformulate MABSA as a generative and explainable task, proposing a unified framework that simultaneously predicts aspect-level sentiment and generates natural language explanations. Based on multimodal large language models (MLLMs), our approach employs a prompt-based generative paradigm, jointly producing sentiment and explanation. To further enhance aspect-oriented reasoning capabilities, we propose a dependency-syntax-guided sentiment cue strategy. This strategy prunes and textualizes the aspect-centered dependency syntax tree, guiding the model to distinguish different sentiment aspects and enhancing its explainability. To enable explainability, we use MLLMs to construct new datasets with sentiment explanations to fine-tune. Experiments show that our approach not only achieves consistent gains in sentiment classification accuracy, but also produces faithful, aspect-grounded explanations.
[COMMENTS]9 pages, 3 figures
[LINK]http://arxiv.org/abs/2601.06848v1
[DATE]2026-01-11 18:41:33+08:00
[CATEGORIES]cs.CL
Training Versatile Coding Agents in Synthetic Environments
[AUTHORS]Yiqi Zhu, Apurva Gandhi, Graham Neubig
[LINK]http://arxiv.org/abs/2512.12216v2
[DATE]2026-01-11 18:24:25+08:00
[CATEGORIES]cs.CL
Speak While Watching: Unleashing TRUE Real-Time Video Understanding Capability of Multimodal Large Language Models
[AUTHORS]Junyan Lin, Junlong Tong, Hao Wu, Jialiang Zhang, Jinming Liu, Xin Jin, Xiaoyu Shen
[ABSTRACT]Multimodal Large Language Models (MLLMs) have achieved strong performance across many tasks, yet most systems remain limited to offline inference, requiring complete inputs before generating outputs. Recent streaming methods reduce latency by interleaving perception and generation, but still enforce a sequential perception-generation cycle, limiting real-time interaction. In this work, we target a fundamental bottleneck that arises when extending MLLMs to real-time video understanding: the global positional continuity constraint imposed by standard positional encoding schemes. While natural in offline inference, this constraint tightly couples perception and generation, preventing effective input-output parallelism. To address this limitation, we propose a parallel streaming framework that relaxes positional continuity through three designs: Overlapped, Group-Decoupled, and Gap-Isolated. These designs enable simultaneous perception and generation, allowing the model to process incoming inputs while producing responses in real time. Extensive experiments reveal that Group-Decoupled achieves the best efficiency-performance balance, maintaining high fluency and accuracy while significantly reducing latency. We further show that the proposed framework yields up to 2x acceleration under balanced perception-generation workloads, establishing a principled pathway toward speak-while-watching real-time systems. We make all our code publicly available: https://github.com/EIT-NLP/Speak-While-Watching.
[LINK]http://arxiv.org/abs/2601.06843v1
[DATE]2026-01-11 18:12:11+08:00
[CATEGORIES]cs.CL
SAEMark: Steering Personalized Multilingual LLM Watermarks with Sparse Autoencoders
[AUTHORS]Zhuohao Yu, Xingru Jiang, Weizheng Gu, Yidong Wang, Qingsong Wen, Shikun Zhang, Wei Ye
[ABSTRACT]Watermarking LLM-generated text is critical for content attribution and misinformation prevention. However, existing methods compromise text quality, require white-box model access and logit manipulation. These limitations exclude API-based models and multilingual scenarios. We propose SAEMark, a general framework for post-hoc multi-bit watermarking that embeds personalized messages solely via inference-time, feature-based rejection sampling without altering model logits or requiring training. Our approach operates on deterministic features extracted from generated text, selecting outputs whose feature statistics align with key-derived targets. This framework naturally generalizes across languages and domains while preserving text quality through sampling LLM outputs instead of modifying. We provide theoretical guarantees relating watermark success probability and compute budget that hold for any suitable feature extractor. Empirically, we demonstrate the framework’s effectiveness using Sparse Autoencoders (SAEs), achieving superior detection accuracy and text quality. Experiments across 4 datasets show SAEMark’s consistent performance, with 99.7% F1 on English and strong multi-bit detection accuracy. SAEMark establishes a new paradigm for scalable watermarking that works out-of-the-box with closed-source LLMs while enabling content attribution.
[COMMENTS]24 pages, 12 figures, NeurIPS 2025, code available: https://zhuohaoyu.github.io/SAEMark
[LINK]http://arxiv.org/abs/2508.08211v2
[DATE]2026-01-11 17:45:34+08:00
[CATEGORIES]cs.CL cs.LG
PDR: A Plug-and-Play Positional Decay Framework for LLM Pre-training Data Detection
[AUTHORS]Jinhan Liu, Yibo Yang, Ruiying Lu, Piotr Piekos, Yimeng Chen, Peng Wang, Dandan Guo
[ABSTRACT]Detecting pre-training data in Large Language Models (LLMs) is crucial for auditing data privacy and copyright compliance, yet it remains challenging in black-box, zero-shot settings where computational resources and training data are scarce. While existing likelihood-based methods have shown promise, they typically aggregate token-level scores using uniform weights, thereby neglecting the inherent information-theoretic dynamics of autoregressive generation. In this paper, we hypothesize and empirically validate that memorization signals are heavily skewed towards the high-entropy initial tokens, where model uncertainty is highest, and decay as context accumulates. To leverage this linguistic property, we introduce Positional Decay Reweighting (PDR), a training-free and plug-and-play framework. PDR explicitly reweights token-level scores to amplify distinct signals from early positions while suppressing noise from later ones. Extensive experiments show that PDR acts as a robust prior and can usually enhance a wide range of advanced methods across multiple benchmarks.
[LINK]http://arxiv.org/abs/2601.06827v1
[DATE]2026-01-11 17:32:13+08:00
[CATEGORIES]cs.CL
Expanding before Inferring: Enhancing Factuality in Large Language Models through Premature Layers Interpolation
[AUTHORS]Dingwei Chen, Ziqiang Liu, Feiteng Fang, Chak Tou Leong, Shiwen Ni, Ahmadreza Argha, Hamid Alinejad-Rokny, Min Yang, Chengming Li
[ABSTRACT]Large Language Models (LLMs) demonstrate remarkable capabilities in text understanding and generation. However, their tendency to produce factually inconsistent outputs, commonly referred to as ‘‘hallucinations’’, remains a critical challenge. Existing approaches, such as retrieval-based and inference-time correction methods, primarily address this issue at the input or output level, often overlooking the intrinsic information refinement process and the role of premature layers. Meanwhile, alignment- and fine-tuning-based methods are resource-intensive. In this paper, we propose PLI (Premature Layers Interpolation), a novel, training-free, and plug-and-play intervention designed to enhance factuality. PLI mitigates hallucinations by inserting premature layers formed through mathematical interpolation with adjacent layers. Inspired by stable diffusion and sampling steps, PLI extends the depth of information processing and transmission in LLMs, improving factual coherence. Experiments on four publicly available datasets demonstrate that PLI effectively reduces hallucinations while outperforming existing baselines in most cases. Further analysis suggests that the success of layer interpolation is closely linked to LLMs’ internal mechanisms. Our dataset and code are available at https://github.com/CuSO4-Chen/PLI.
[LINK]http://arxiv.org/abs/2506.02973v2
[DATE]2026-01-11 16:43:55+08:00
[CATEGORIES]cs.CL
Less is More: Improving LLM Reasoning with Minimal Test-Time Intervention
[AUTHORS]Zhen Yang, Mingyang Zhang, Feng Chen, Ganggui Ding, Liang Hou, Xin Tao, Ying-Cong Chen
[ABSTRACT]Recent progress in large language models (LLMs) has focused on test-time scaling to improve reasoning via increased inference computation, but often at the cost of efficiency. We revisit test-time behavior and uncover a simple yet underexplored phenomenon: reasoning uncertainty is highly localized-only a small subset of high-entropy tokens dominantly affects output correctness. Motivated by this, we propose Minimal Test-Time Intervention (MTI), a training-free framework that enhances reasoning accuracy and stability with minimal overhead. MTI includes: (i) Selective CFG intervention, applying classifier-free guidance only at uncertain positions; and (ii) Lightweight negative-prompt guidance, reusing the main model’s KV cache to approximate unconditional decoding efficiently. MTI yields consistent gains across general, coding, and STEM tasks-e.g., +9.28% average improvement on six benchmarks for DeepSeek-R1-7B and +11.25% on AIME2024 using Ling-mini-2.0-while remaining highly efficient.
[COMMENTS]Code: https://github.com/EnVision-Research/MTI
[LINK]http://arxiv.org/abs/2510.13940v3
[DATE]2026-01-11 16:32:23+08:00
[CATEGORIES]cs.CL
Forest Before Trees: Latent Superposition for Efficient Visual Reasoning
[AUTHORS]Yubo Wang, Juntian Zhang, Yichen Wu, Yankai Lin, Nils Lukas, Yuhan Liu
[ABSTRACT]While Chain-of-Thought empowers Large Vision-Language Models with multi-step reasoning, explicit textual rationales suffer from an information bandwidth bottleneck, where continuous visual details are discarded during discrete tokenization. Recent latent reasoning methods attempt to address this challenge, but often fall prey to premature semantic collapse due to rigid autoregressive objectives. In this paper, we propose Laser, a novel paradigm that reformulates visual deduction via Dynamic Windowed Alignment Learning (DWAL). Instead of forcing a point-wise prediction, Laser aligns the latent state with a dynamic validity window of future semantics. This mechanism enforces a “Forest-before-Trees” cognitive hierarchy, enabling the model to maintain a probabilistic superposition of global features before narrowing down to local details. Crucially, Laser maintains interpretability via decodable trajectories while stabilizing unconstrained learning via Self-Refined Superposition. Extensive experiments on 6 benchmarks demonstrate that Laser achieves state-of-the-art performance among latent reasoning methods, surpassing the strong baseline Monet by 5.03% on average. Notably, it achieves these gains with extreme efficiency, reducing inference tokens by more than 97%, while demonstrating robust generalization to out-of-distribution domains.
[LINK]http://arxiv.org/abs/2601.06803v1
[DATE]2026-01-11 16:30:49+08:00
[CATEGORIES]cs.CL
Berezinskii–Kosterlitz–Thouless transition in a context-sensitive random language model
[AUTHORS]Yuma Toji, Jun Takahashi, Vwani Roychowdhury, Hideyuki Miyahara
[ABSTRACT]Several power-law critical properties involving different statistics in natural languages – reminiscent of scaling properties of physical systems at or near phase transitions – have been documented for decades. The recent rise of large language models has added further evidence and excitement by providing intriguing similarities with notions in physics such as scaling laws and emergent abilities. However, specific instances of classes of generative language models that exhibit phase transitions, as understood by the statistical physics community, are lacking. In this work, inspired by the one-dimensional Potts model in statistical physics, we construct a simple probabilistic language model that falls under the class of context-sensitive grammars, which we call the context-sensitive random language model, and numerically demonstrate an unambiguous phase transition in the framework of a natural language model. We explicitly show that a precisely defined order parameter – that captures symbol frequency biases in the sentences generated by the language model – changes from strictly zero to a strictly nonzero value (in the infinite-length limit of sentences), implying a mathematical singularity arising when tuning the parameter of the stochastic language model we consider. Furthermore, we identify the phase transition as a variant of the Berezinskii–Kosterlitz–Thouless (BKT) transition, which is known to exhibit critical properties not only at the transition point but also in the entire phase. This finding leads to the possibility that critical properties in natural languages may not require careful fine-tuning nor self-organized criticality, but are generically explained by the underlying connection between language structures and the BKT phases.
[COMMENTS]accepted for publication in PRE
[LINK]http://arxiv.org/abs/2412.01212v2
[DATE]2026-01-11 16:04:59+08:00
[CATEGORIES]cs.CL cs.LG
CIRAG: Construction-Integration Retrieval and Adaptive Generation for Multi-hop Question Answering
[AUTHORS]Zili Wei, Xiaocui Yang, Yilin Wang, Zihan Wang, Weidong Bao, Shi Feng, Daling Wang, Yifei Zhang
[ABSTRACT]Triple-based Iterative Retrieval-Augmented Generation (iRAG) mitigates document-level noise for multi-hop question answering. However, existing methods still face limitations: (i) greedy single-path expansion, which propagates early errors and fails to capture parallel evidence from different reasoning branches, and (ii) granularity-demand mismatch, where a single evidence representation struggles to balance noise control with contextual sufficiency. In this paper, we propose the Construction-Integration Retrieval and Adaptive Generation model, CIRAG. It introduces an Iterative Construction-Integration module that constructs candidate triples and history-conditionally integrates them to distill core triples and generate the next-hop query. This module mitigates the greedy trap by preserving multiple plausible evidence chains. Besides, we propose an Adaptive Cascaded Multi-Granularity Generation module that progressively expands contextual evidence based on the problem requirements, from triples to supporting sentences and full passages. Moreover, we introduce Trajectory Distillation, which distills the teacher model’s integration policy into a lightweight student, enabling efficient and reliable long-horizon reasoning. Extensive experiments demonstrate that CIRAG achieves superior performance compared to existing iRAG methods.
[LINK]http://arxiv.org/abs/2601.06799v1
[DATE]2026-01-11 15:56:02+08:00
[CATEGORIES]cs.CL
Garbage Attention in Large Language Models: BOS Sink Heads and Sink-aware Pruning
[AUTHORS]Jaewon Sok, Jewon Yeom, Seonghyeon Park, Jeongjae Park, Taesup Kim
[ABSTRACT]Large Language Models (LLMs) are known to contain significant redundancy, yet a systematic explanation for why certain components, particularly in higher layers, are more redundant has remained elusive. In this work, we identify the BOS sink phenomenon as a key mechanism driving this layer-wise sensitivity. We show that attention heads with high BOS sink scores are strongly associated with functional redundancy: such heads, especially in deeper layers, contribute little to predictive performance and effectively serve as \emph{dumping grounds} for superfluous attention weights. This provides a concrete functional explanation for the structural redundancy reported in prior studies. Leveraging this insight, we introduce a simple pruning strategy that removes high-BOS sink heads. Experiments on Gemma-3, Llama-3.1, and Qwen3 demonstrate that this approach identifies redundant transformer components more reliably than weight- or activation-based criteria, while preserving performance close to dense baselines even under aggressive pruning. Moreover, we find that the behavior of sink heads remains stable across different sequence lengths. Overall, our results suggest that structural properties of attention offer a more intuitive and robust basis for model compression than magnitude-based methods.
[LINK]http://arxiv.org/abs/2601.06787v1
[DATE]2026-01-11 14:33:30+08:00
[CATEGORIES]cs.CL
EpiCaR: Knowing What You Don’t Know Matters for Better Reasoning in LLMs
[AUTHORS]Jewon Yeom, Jaewon Sok, Seonghyeon Park, Jeongjae Park, Taesup Kim
[ABSTRACT]Improving the reasoning abilities of large language models (LLMs) has largely relied on iterative self-training with model-generated data. While effective at boosting accuracy, existing approaches primarily reinforce successful reasoning paths, incurring a substantial calibration cost: models become overconfident and lose the ability to represent uncertainty. This failure has been characterized as a form of model collapse in alignment, where predictive distributions degenerate toward low-variance point estimates. We address this issue by reframing reasoning training as an epistemic learning problem, in which models must learn not only how to reason, but also when their reasoning should be trusted. We propose epistemically-calibrated reasoning (EpiCaR) as a training objective that jointly optimizes reasoning performance and calibration, and instantiate it within an iterative supervised fine-tuning framework using explicit self-evaluation signals. Experiments on Llama-3 and Qwen-3 families demonstrate that our approach achieves Pareto-superiority over standard baselines in both accuracy and calibration, particularly in models with sufficient reasoning capacity (e.g., 3B+). This framework generalizes effectively to OOD mathematical reasoning (GSM8K) and code generation (MBPP). Ultimately, our approach enables a 3X reduction in inference compute, matching the K=30 performance of STaR with only K=10 samples in capable models.
[LINK]http://arxiv.org/abs/2601.06786v1
[DATE]2026-01-11 14:21:13+08:00
[CATEGORIES]cs.CL
Evaluation of the Automated Labeling Method for Taxonomic Nomenclature Through Prompt-Optimized Large Language Model
[AUTHORS]Keito Inoshita, Kota Nojiri, Haruto Sugeno, Takumi Taga
[ABSTRACT]Scientific names of organisms consist of a genus name and a species epithet, with the latter often reflecting aspects such as morphology, ecology, distribution, and cultural background. Traditionally, researchers have manually labeled species names by carefully examining taxonomic descriptions, a process that demands substantial time and effort when dealing with large datasets. This study evaluates the feasibility of automatic species name labeling using large language model (LLM) by leveraging their text classification and semantic extraction capabilities. Using the spider name dataset compiled by Mammola et al., we compared LLM-based labeling results-enhanced through prompt engineering-with human annotations. The results indicate that LLM-based classification achieved high accuracy in Morphology, Geography, and People categories. However, classification accuracy was lower in Ecology & Behavior and Modern & Past Culture, revealing challenges in interpreting animal behavior and cultural contexts. Future research will focus on improving accuracy through optimized few-shot learning and retrieval-augmented generation techniques, while also expanding the applicability of LLM-based labeling to diverse biological taxa.
[COMMENTS]This paper was accepted by IEEE IAICT
[LINK]http://arxiv.org/abs/2503.10662v2
[DATE]2026-01-11 13:20:29+08:00
[CATEGORIES]cs.CL
Multi-Stage Evolutionary Model Merging with Meta Data Driven Curriculum Learning for Sentiment-Specialized Large Language Modeling
[AUTHORS]Keito Inoshita, Xiaokang Zhou, Akira Kawai
[ABSTRACT]The emergence of large language models (LLMs) has significantly transformed natural language processing (NLP), enabling more generalized models to perform various tasks with minimal training. However, traditional sentiment analysis methods, which focus on individual tasks such as sentiment classification or aspect-based analysis, are not practical for real-world applications that usually require handling multiple tasks. While offering flexibility, LLMs in sentiment-specific tasks often fall short of the required accuracy. Techniques like fine-tuning and evolutionary model merging help integrate models into a unified framework, which can improve the learning performance while reducing computational costs. The use of task meta-data and curriculum learning to optimize learning processes remains underexplored, while sentiment analysis is a critical task in NLP that requires high accuracy and scalability across multiple subtasks. In this study, we propose a hybrid learning model called Multi-stage Evolutionary Model Merging with Meta data driven Curriculum Learning (MEM-MCL), to enhance the sentiment analysis in large language modeling. In particular, expert models are created through instruction tuning for specific sentiment tasks and then merged using evolutionary algorithms to form a unified model. The merging process is optimized with weak data to enhance performance across tasks. The curriculum learning is incorporated to provide a learning sequence based on task difficulty, improving knowledge extraction from LLMs. Experiment results demonstrate that the proposed MEM-MCL model outperforms conventional LLMs in a majority of sentiment analysis tasks, achieving superior results across various subtasks.
[COMMENTS]This paper was presented at the 10th IEEE International Conference on Data Science and Systems in December 2024 and is awaiting publication
[LINK]http://arxiv.org/abs/2601.06780v1
[DATE]2026-01-11 13:12:23+08:00
[CATEGORIES]cs.CL
Enhancing Rare Codes via Probability-Biased Directed Graph Attention for Long-Tail ICD Coding
[AUTHORS]Tianlei Chen, Yuxiao Chen, Yang Li, Feifei Wang
[ABSTRACT]Automated international classification of diseases (ICD) coding aims to assign multiple disease codes to clinical documents and plays a critical role in healthcare informatics. However, its performance is hindered by the extreme long-tail distribution of the ICD ontology, where a few common codes dominate while thousands of rare codes have very few examples. To address this issue, we propose a Probability-Biased Directed Graph Attention model (ProBias) that partitions codes into common and rare sets and allows information to flow only from common to rare codes. Edge weights are determined by conditional co-occurrence probabilities, which guide the attention mechanism to enrich rare-code representations with clinically related signals. To provide higher-quality semantic representations as model inputs, we further employ large language models to generate enriched textual descriptions for ICD codes, offering external clinical context that complements statistical co-occurrence signals. Applied to automated ICD coding, our approach significantly improves the representation and prediction of rare codes, achieving state-of-the-art performance on three benchmark datasets. In particular, we observe substantial gains in macro-averaged F1 score, a key metric for long-tail classification.
[LINK]http://arxiv.org/abs/2511.09559v2
[DATE]2026-01-11 11:53:10+08:00
[CATEGORIES]cs.LG cs.CL
GanitLLM: Difficulty-Aware Bengali Mathematical Reasoning through Curriculum-GRPO
[AUTHORS]Shubhashis Roy Dipta, Khairul Mahbub, Nadia Najjar
[ABSTRACT]We present a Bengali mathematical reasoning model called GanitLLM (named after the Bangla word for mathematics, “Ganit”), together with a new difficulty-aware Bengali math corpus and a curriculum-based GRPO pipeline. Bengali is one of the world’s most widely spoken languages, yet existing LLMs either reason in English and then translate, or simply fail on multi-step Bengali math, in part because reinforcement learning recipes are tuned for high-resource languages and collapse under reward sparsity in low-resource settings. To address this, we construct Ganit, a rigorously filtered and decontaminated Bengali math dataset with automatic difficulty tags derived from the pass@k of a strong evaluator model. Building on this dataset, we propose Curriculum-GRPO, which combines multi-stage training (SFT + GRPO) with difficulty-aware sampling and verifiable rewards for format, numerical correctness, and Bengali reasoning. On Bn-MGSM and Bn-MSVAMP, GanitLLM-4B improves over its Qwen3-4B base by +8 and +7 accuracy points, respectively, while increasing the percentage of Bengali reasoning tokens from 14% to over 88% and reducing average solution length from 943 to 193 words.
[LINK]http://arxiv.org/abs/2601.06767v1
[DATE]2026-01-11 11:49:18+08:00
[CATEGORIES]cs.CL cs.LG
MTMCS-Bench: Evaluating Contextual Safety of Multimodal Large Language Models in Multi-Turn Dialogues
[AUTHORS]Zheyuan Liu, Dongwhi Kim, Yixin Wan, Xiangchi Yuan, Zhaoxuan Tan, Fengran Mo, Meng Jiang
[ABSTRACT]Multimodal large language models (MLLMs) are increasingly deployed as assistants that interact through text and images, making it crucial to evaluate contextual safety when risk depends on both the visual scene and the evolving dialogue. Existing contextual safety benchmarks are mostly single-turn and often miss how malicious intent can emerge gradually or how the same scene can support both benign and exploitative goals. We introduce the Multi-Turn Multimodal Contextual Safety Benchmark (MTMCS-Bench), a benchmark of realistic images and multi-turn conversations that evaluates contextual safety in MLLMs under two complementary settings, escalation-based risk and context-switch risk. MTMCS-Bench offers paired safe and unsafe dialogues with structured evaluation. It contains over 30 thousand multimodal (image+text) and unimodal (text-only) samples, with metrics that separately measure contextual intent recognition, safety-awareness on unsafe cases, and helpfulness on benign ones. Across eight open-source and seven proprietary MLLMs, we observe persistent trade-offs between contextual safety and utility, with models tending to either miss gradual risks or over-refuse benign dialogues. Finally, we evaluate five current guardrails and find that they mitigate some failures but do not fully resolve multi-turn contextual risks.
[COMMENTS]A benchmark of realistic images and multi-turn conversations that evaluates contextual safety in MLLMs under two complementary settings
[LINK]http://arxiv.org/abs/2601.06757v1
[DATE]2026-01-11 11:10:56+08:00
[CATEGORIES]cs.CL
KARMA: Leveraging Multi-Agent LLMs for Automated Knowledge Graph Enrichment
[AUTHORS]Yuxing Lu, Wei Wu, Xukai Zhao, Rui Peng, Jinzhuo Wang
[ABSTRACT]Maintaining comprehensive and up-to-date knowledge graphs (KGs) is critical for modern AI systems, but manual curation struggles to scale with the rapid growth of scientific literature. This paper presents KARMA, a novel framework employing multi-agent large language models (LLMs) to automate KG enrichment through structured analysis of unstructured text. Our approach employs nine collaborative agents, spanning entity discovery, relation extraction, schema alignment, and conflict resolution that iteratively parse documents, verify extracted knowledge, and integrate it into existing graph structures while adhering to domain-specific schema. Experiments on 1,200 PubMed articles from three different domains demonstrate the effectiveness of KARMA in knowledge graph enrichment, with the identification of up to 38,230 new entities while achieving 83.1\% LLM-verified correctness and reducing conflict edges by 18.6\% through multi-layer assessments.
[COMMENTS]24 pages, 3 figures, 2 tables
[LINK]http://arxiv.org/abs/2502.06472v2
[DATE]2026-01-11 10:16:42+08:00
[CATEGORIES]cs.CL
Empirical Analysis of Decoding Biases in Masked Diffusion Models
[AUTHORS]Pengcheng Huang, Tianming Liu, Zhenghao Liu, Yukun Yan, Shuo Wang, Tong Xiao, Zulong Chen, Maosong Sun
[ABSTRACT]Masked diffusion models (MDMs), which leverage bidirectional attention and a denoising process, are narrowing the performance gap with autoregressive models (ARMs). However, their internal attention mechanisms remain under-explored. This paper investigates the attention behaviors in MDMs, revealing the phenomenon of Attention Floating. Unlike ARMs, where attention converges to a fixed sink, MDMs exhibit dynamic, dispersed attention anchors that shift across denoising steps and layers. Further analysis reveals its Shallow Structure-Aware, Deep Content-Focused attention mechanism: shallow layers utilize floating tokens to build a global structural framework, while deeper layers allocate more capability toward capturing semantic content. Empirically, this distinctive attention pattern provides a mechanistic explanation for the strong in-context learning capabilities of MDMs, allowing them to double the performance compared to ARMs in knowledge-intensive tasks. All codes are available at https://github.com/NEUIR/Uncode.
[COMMENTS]22 pages,17 figures
[LINK]http://arxiv.org/abs/2508.13021v3
[DATE]2026-01-11 09:39:09+08:00
[CATEGORIES]cs.CL
Generalization Bounds for Transformer Channel Decoders
[AUTHORS]Qinshan Zhang, Bin Chen, Yong Jiang, Shu-Tao Xia
[ABSTRACT]Transformer channel decoders, such as the Error Correction Code Transformer (ECCT), have shown strong empirical performance in channel decoding, yet their generalization behavior remains theoretically unclear. This paper studies the generalization performance of ECCT from a learning-theoretic perspective. By establishing a connection between multiplicative noise estimation errors and bit-error-rate (BER), we derive an upper bound on the generalization gap via bit-wise Rademacher complexity. The resulting bound characterizes the dependence on code length, model parameters, and training set size, and applies to both single-layer and multi-layer ECCTs. We further show that parity-check-based masked attention induces sparsity that reduces the covering number, leading to a tighter generalization bound. To the best of our knowledge, this work provides the first theoretical generalization guarantees for this class of decoders.
[COMMENTS]18 pages, 3 figures
[LINK]http://arxiv.org/abs/2601.06969v1
[DATE]2026-01-11 23:56:37+08:00
[CATEGORIES]cs.LG
The Impact of Anisotropic Covariance Structure on the Training Dynamics and Generalization Error of Linear Networks
[AUTHORS]Taishi Watanabe, Ryo Karakida, Jun-nosuke Teramae
[ABSTRACT]The success of deep neural networks largely depends on the statistical structure of the training data. While learning dynamics and generalization on isotropic data are well-established, the impact of pronounced anisotropy on these crucial aspects is not yet fully understood. We examine the impact of data anisotropy, represented by a spiked covariance structure, a canonical yet tractable model, on the learning dynamics and generalization error of a two-layer linear network in a linear regression setting. Our analysis reveals that the learning dynamics proceed in two distinct phases, governed initially by the input-output correlation and subsequently by other principal directions of the data structure. Furthermore, we derive an analytical expression for the generalization error, quantifying how the alignment of the spike structure of the data with the learning task improves performance. Our findings offer deep theoretical insights into how data anisotropy shapes the learning trajectory and final performance, providing a foundation for understanding complex interactions in more advanced network architectures.
[COMMENTS]18 pages
[LINK]http://arxiv.org/abs/2601.06961v1
[DATE]2026-01-11 23:37:39+08:00
[CATEGORIES]cs.LG
HAS-VQ: Hessian-Adaptive Sparse Vector Quantization for High-Fidelity LLM Compression
[AUTHORS]Vladimer Khasia
[ABSTRACT]Post-training quantization is essential for deploying Large Language Models (LLMs) on resource-constrained devices. However, standard integer quantization (e.g., INT4) fundamentally degrades performance by imposing a uniform grid on the heavy-tailed distribution of weight parameters, particularly in smaller-scale models (e.g., <2B parameters). We introduce HAS-VQ (Hessian-Adaptive Sparse Vector Quantization), a compression framework that strictly decouples high-sensitivity outliers from the bulk weight distribution using second-order sensitivity analysis. HAS-VQ employs a Hessian-Masked Decoupling strategy to isolate sensitive parameters, followed by robust Vector Quantization (VQ) of the remaining dense body. Crucially, we introduce a residual sparse feedback mechanism that corrects quantization errors in the most sensitive dimensions, ensuring exact reconstruction of outliers. We evaluate HAS-VQ on SmolLM2-1.7B, demonstrating two distinct regimes of superiority: (1) Pareto Dominance over Integer Baselines: At 4.23 effective bits-per-parameter (BPP), we achieve a perplexity of 14.23, significantly outperforming the standard INT4 baseline (20.03 PPL at 4.71 BPP). (2) High-Fidelity Compression: Relative to the FP16 baseline, HAS-VQ achieves a 2.3x reduction in model size (7.03 BPP) while maintaining statistically indistinguishable perplexity (10.12 vs. 10.04), effectively offering a lossless compression alternative for bandwidth-constrained environments. The code is available at https://github.com/VladimerKhasia/HASVQ
[LINK]http://arxiv.org/abs/2601.06959v1
[DATE]2026-01-11 23:35:10+08:00
[CATEGORIES]cs.LG
Towards Operational Streamflow Forecasting in the Limpopo River Basin using Long Short-Term Memory Networks
[AUTHORS]James Tlhomole, Edoardo Borgomeo, Karthikeyan Matheswaran, Mariangel Garcia Andarcia
[ABSTRACT]Robust hydrological simulation is key for sustainable development, water management strategies, and climate change adaptation. In recent years, deep learning methods have been demonstrated to outperform mechanistic models at the task of hydrological discharge simulation. Adoption of these methods has been catalysed by the proliferation of large sample hydrology datasets, consisting of the observed discharge and meteorological drivers, along with geological and topographical catchment descriptors. Deep learning methods infer rainfall-runoff characteristics that have been shown to generalise across catchments, benefitting from the data diversity in large datasets. Despite this, application to catchments in Africa has been limited. The lack of adoption of deep learning methodologies is primarily due to sparsity or lack of the spatiotemporal observational data required to enable downstream model training. We therefore investigate the application of deep learning models, including LSTMs, for hydrological discharge simulation in the transboundary Limpopo River basin, emphasising application to data scarce regions. We conduct a number of computational experiments primarily focused on assessing the impact of varying the LSTM model input data on performance. Results confirm that data constraints remain the largest obstacle to deep learning applications across African river basins. We further outline the impact of human influence on data-driven modelling which is a commonly overlooked aspect of data-driven large-sample hydrology approaches and investigate solutions for model adaptation under smaller datasets. Additionally, we include recommendations for future efforts towards seasonal hydrological discharge prediction and direct comparison or inclusion of SWAT model outputs, as well as architectural improvements.
[COMMENTS]14 pages, 6 figures, 2 tables
[LINK]http://arxiv.org/abs/2601.06941v1
[DATE]2026-01-11 23:05:27+08:00
[CATEGORIES]cs.LG
Stochastic Control Methods for Optimization
[AUTHORS]Jinniao Qiu
[ABSTRACT]In this work, we investigate a stochastic control framework for global optimization over both Euclidean spaces and the Wasserstein space of probability measures, where the objective function may be non-convex and/or non-differentiable. In the Euclidean setting, the original minimization problem is approximated by a family of regularized stochastic control problems; using dynamic programming, we analyze the associated Hamilton–Jacobi–Bellman equations and obtain tractable representations via the Cole–Hopf transformation and the Feynman–Kac formula. For optimization over probability measures, we formulate a regularized mean-field control problem characterized by a master equation, and further approximate it by controlled $N$-particle systems. We establish that, as the regularization parameter tends to zero (and as the particle number tends to infinity for the optimization over probability measures), the value of the control problem converges to the global minimum of the original objective. Building on the resulting probabilistic representations, Monte Carlo-based numerical schemes are proposed and numerical experiments are reported to illustrate the effectiveness of the methods and to support the theoretical convergence rates.
[COMMENTS]32 pages, 5 figures
[LINK]http://arxiv.org/abs/2601.01248v2
[DATE]2026-01-11 23:01:29+08:00
[CATEGORIES]cs.LG
mind_call: A Dataset for Mental Health Function Calling with Large Language Models
[AUTHORS]Fozle Rabbi Shafi, M. Anwar Hossain, Salimur Choudhury
[ABSTRACT]Large Language Model (LLM)-based systems increasingly rely on function calling to enable structured and controllable interaction with external data sources, yet existing datasets do not address mental health-oriented access to wearable sensor data. This paper presents a synthetic function-calling dataset designed for mental health assistance grounded in wearable health signals such as sleep, physical activity, cardiovascular measures, stress indicators, and metabolic data. The dataset maps diverse natural language queries to standardized API calls derived from a widely adopted health data schema. Each sample includes a user query, a query category, an explicit reasoning step, a normalized temporal parameter, and a target function. The dataset covers explicit, implicit, behavioral, symptom-based, and metaphorical expressions, which reflect realistic mental health-related user interactions. This resource supports research on intent grounding, temporal reasoning, and reliable function invocation in LLM-based mental health agents and is publicly released to promote reproducibility and future work.
[LINK]http://arxiv.org/abs/2601.06937v1
[DATE]2026-01-11 22:52:57+08:00
[CATEGORIES]cs.LG
Inductive Graph Representation Learning with Quantum Graph Neural Networks
[AUTHORS]Arthur M. Faria, Ignacio F. Graña, Savvas Varsamopoulos
[ABSTRACT]Quantum Graph Neural Networks (QGNNs) offer a promising approach to combining quantum computing with graph-structured data processing. While classical Graph Neural Networks (GNNs) are scalable and robust, existing QGNNs often lack flexibility due to graph-specific quantum circuit designs, limiting their applicability to diverse real-world problems. To address this, we propose a versatile QGNN framework inspired by GraphSAGE, using quantum models as aggregators. We integrate inductive representation learning techniques with parameterized quantum convolutional and pooling layers, bridging classical and quantum paradigms. The convolutional layer is flexible, allowing tailored designs for specific tasks. Benchmarked on a node regression task with the QM9 dataset, our framework, using a single minimal circuit for all aggregation steps, handles molecules with varying numbers of atoms without changing qubits or circuit architecture. While classical GNNs achieve higher training performance, our quantum approach remains competitive and often shows stronger generalization as molecular complexity increases. We also observe faster learning in early training epochs. To mitigate trainability limitations of a single-circuit setup, we extend the framework with multiple quantum aggregators on QM9. Assigning distinct circuits to each hop substantially improves training performance across all cases. Additionally, we numerically demonstrate the absence of barren plateaus as qubit numbers increase, suggesting that the proposed model can scale to larger, more complex graph-based problems.
[COMMENTS]12 pages, 12 figures
[LINK]http://arxiv.org/abs/2503.24111v3
[DATE]2026-01-11 22:26:54+08:00
[CATEGORIES]cs.LG
When Less Is More: Binary Feedback Can Outperform Ordinal Comparisons in Ranking Recovery
[AUTHORS]Shirong Xu, Jingnan Zhang, Junhui Wang
[ABSTRACT]Paired comparison data, where users evaluate items in pairs, play a central role in ranking and preference learning tasks. While ordinal comparison data intuitively offer richer information than binary comparisons, this paper challenges that conventional wisdom. We propose a general parametric framework for modeling ordinal paired comparisons without ties. The model adopts a generalized additive structure, featuring a link function that quantifies the preference difference between two items and a pattern function that governs the distribution over ordinal response levels. This framework encompasses classical binary comparison models as special cases, by treating binary responses as binarized versions of ordinal data. Within this framework, we show that binarizing ordinal data can significantly improve the accuracy of ranking recovery. Specifically, we prove that under the counting algorithm, the ranking error associated with binary comparisons exhibits a faster exponential convergence rate than that of ordinal data. Furthermore, we characterize a substantial performance gap between binary and ordinal data in terms of a signal-to-noise ratio (SNR) determined by the pattern function. We identify the pattern function that minimizes the SNR and maximizes the benefit of binarization. Extensive simulations and a real application on the MovieLens dataset further corroborate our theoretical findings.
[LINK]http://arxiv.org/abs/2507.01613v4
[DATE]2026-01-11 22:26:19+08:00
[CATEGORIES]cs.LG
Active Learning Strategies for Efficient Machine-Learned Interatomic Potentials Across Diverse Material Systems
[AUTHORS]Mohammed Azeez Khan, Aaron D’Souza, Vijay Choyal
[ABSTRACT]Efficient discovery of new materials demands strategies to reduce the number of costly first-principles calculations required to train predictive machine learning models. We develop and validate an active learning framework that iteratively selects informative training structures for machine-learned interatomic potentials (MLIPs) from large, heterogeneous materials databases, specifically the Materials Project and OQMD. Our framework integrates compositional and property-based descriptors with a neural network ensemble model, enabling real-time uncertainty quantification via Query-by-Committee. We systematically compare four selection strategies: random sampling (baseline), uncertainty-based sampling, diversity-based sampling (k-means clustering with farthest-point refinement), and a hybrid approach balancing both objectives. Experiments across four representative material systems (elemental carbon, silicon, iron, and a titanium-oxide compound) with 5 random seeds per configuration demonstrate that diversity sampling consistently achieves competitive or superior performance, with particularly strong advantages on complex systems like titanium-oxide (10.9% improvement, p=0.008). Our results show that intelligent data selection strategies can achieve target accuracy with 5-13% fewer labeled samples compared to random baselines. The entire pipeline executes on Google Colab in under 4 hours per system using less than 8 GB of RAM, thereby democratizing MLIP development for researchers globally with limited computational resources. Our open-source code and detailed experimental configurations are available on GitHub. This multi-system evaluation establishes practical guidelines for data-efficient MLIP training and highlights promising future directions including integration with symmetry-aware neural network architectures.
[COMMENTS]16 pages, 3 figures, 2 tables
[LINK]http://arxiv.org/abs/2601.06916v1
[DATE]2026-01-11 21:52:28+08:00
[CATEGORIES]cs.LG
Tractable Multinomial Logit Contextual Bandits with Non-Linear Utilities
[AUTHORS]Taehyun Hwang, Dahngoon Kim, Min-hwan Oh
[COMMENTS]Accepted at NeurIPS 2025
[LINK]http://arxiv.org/abs/2601.06913v1
[DATE]2026-01-11 21:43:42+08:00
[CATEGORIES]cs.LG
E^2-LLM: Bridging Neural Signals and Interpretable Affective Analysis
[AUTHORS]Fei Ma, Han Lin, Yifan Xie, Hongwei Ren, Xiaoyu Shen, Wenbo Ding, Qi Tian
[ABSTRACT]Emotion recognition from electroencephalography (EEG) signals remains challenging due to high inter-subject variability, limited labeled data, and the lack of interpretable reasoning in existing approaches. While recent multimodal large language models (MLLMs) have advanced emotion analysis, they have not been adapted to handle the unique spatiotemporal characteristics of neural signals. We present E^2-LLM (EEG-to-Emotion Large Language Model), the first MLLM framework for interpretable emotion analysis from EEG. E^2-LLM integrates a pretrained EEG encoder with Qwen-based LLMs through learnable projection layers, employing a multi-stage training pipeline that encompasses emotion-discriminative pretraining, cross-modal alignment, and instruction tuning with chain-of-thought reasoning. We design a comprehensive evaluation protocol covering basic emotion prediction, multi-task reasoning, and zero-shot scenario understanding. Experiments on the dataset across seven emotion categories demonstrate that E^2-LLM achieves excellent performance on emotion classification, with larger variants showing enhanced reliability and superior zero-shot generalization to complex reasoning scenarios. Our work establishes a new paradigm combining physiological signals with LLM reasoning capabilities, showing that model scaling improves both recognition accuracy and interpretable emotional understanding in affective computing.
[COMMENTS]11 pages
[LINK]http://arxiv.org/abs/2601.07877v1
[DATE]2026-01-11 21:21:20+08:00
[CATEGORIES]cs.LG
Two-Player Zero-Sum Games with Bandit Feedback
[AUTHORS]Elif Yılmaz, Christos Dimitrakakis
[ABSTRACT]We study a two-player zero-sum game in which the row player aims to maximize their payoff against an adversarial column player, under an unknown payoff matrix estimated through bandit feedback. We propose three algorithms based on the Explore-Then-Commit (ETC) framework. The first adapts it to zero-sum games, the second incorporates adaptive elimination that leverages the $\varepsilon$-Nash Equilibrium property to efficiently select the optimal action pair, and the third extends the elimination algorithm by employing non-uniform exploration. Our objective is to demonstrate the applicability of ETC in a zero-sum game setting by focusing on learning pure strategy Nash Equilibria. A key contribution of our work is a derivation of instance-dependent upper bounds on the expected regret of our proposed algorithms, which has received limited attention in the literature on zero-sum games. Particularly, after $T$ rounds, we achieve an instance-dependent regret upper bounds of $O(Δ+ \sqrt{T})$ for ETC in zero-sum game setting and $O(\log (T Δ^2)/Δ)$ for the adaptive elimination algorithm and its variant with non-uniform exploration, where $Δ$ denotes the suboptimality gap. Therefore, our results indicate that ETC-based algorithms perform effectively in zero-sum game settings, achieving regret bounds comparable to existing methods while providing insight through instance-dependent analysis.
[COMMENTS]22 pages
[LINK]http://arxiv.org/abs/2506.14518v3
[DATE]2026-01-11 21:14:05+08:00
[CATEGORIES]cs.LG
NOVAK: Unified adaptive optimizer for deep neural networks
[AUTHORS]Sergii Kavun
[ABSTRACT]This work introduces NOVAK, a modular gradient-based optimization algorithm that integrates adaptive moment estimation, rectified learning-rate scheduling, decoupled weight regularization, multiple variants of Nesterov momentum, and lookahead synchronization into a unified, performance-oriented framework. NOVAK adopts a dual-mode architecture consisting of a streamlined fast path designed for production. The optimizer employs custom CUDA kernels that deliver substantial speedups (3-5 for critical operations) while preserving numerical stability under standard stochastic-optimization assumptions. We provide fully developed mathematical formulations for rectified adaptive learning rates, a memory-efficient lookahead mechanism that reduces overhead from O(2p) to O(p + p/k), and the synergistic coupling of complementary optimization components. Theoretical analysis establishes convergence guarantees and elucidates the stability and variance-reduction properties of the method. Extensive empirical evaluation on CIFAR-10, CIFAR-100, ImageNet, and ImageNette demonstrates NOVAK superiority over 14 contemporary optimizers, including Adam, AdamW, RAdam, Lion, and Adan. Across architectures such as ResNet-50, VGG-16, and ViT, NOVAK consistently achieves state-of-the-art accuracy, and exceptional robustness, attaining very high accuracy on VGG-16/ImageNette demonstrating superior architectural robustness compared to contemporary optimizers. The results highlight that NOVAKs architectural contributions (particularly rectification, decoupled decay, and hybrid momentum) are crucial for reliable training of deep plain networks lacking skip connections, addressing a long-standing limitation of existing adaptive optimization methods.
[COMMENTS]77 pages, 14 figures, 7 tables
[LINK]http://arxiv.org/abs/2601.07876v1
[DATE]2026-01-11 21:03:57+08:00
[CATEGORIES]cs.LG
Fed-SE: Federated Self-Evolution for Privacy-Constrained Multi-Environment LLM Agents
[AUTHORS]Xiang Chen, Yuling Shi, Qizhen Lan, Yuchao Qiu, Min Wang, Xiaodong Gu, Yanfu Yan
[ABSTRACT]LLM agents are widely deployed in complex interactive tasks, yet privacy constraints often preclude centralized optimization and co-evolution across dynamic environments. Despite the demonstrated success of Federated Learning (FL) on static datasets, its effectiveness in open-ended, self-evolving agent systems remains largely unexplored. In such settings, the direct application of standard FL is particularly challenging, as heterogeneous tasks and sparse, trajectory-level reward signals give rise to severe gradient instability, which undermines the global optimization process. To bridge this gap, we propose Fed-SE, a Federated Self-Evolution framework for LLM agents that establishes a local evolution-global aggregation paradigm. Locally, agents employ parameter-efficient fine-tuning on filtered, high-return trajectories to achieve stable gradient updates. Globally, Fed-SE aggregates updates within a low-rank subspace, reducing communication cost across clients. Experiments across five heterogeneous environments demonstrate that Fed-SE improves average task success rates by 10\% over the state-of-the-art FedIT, validating its effectiveness in cross-environment knowledge transfer under privacy constraints.
[LINK]http://arxiv.org/abs/2512.08870v2
[DATE]2026-01-11 20:46:13+08:00
[CATEGORIES]cs.LG
Terrain Diffusion: A Diffusion-Based Successor to Perlin Noise in Infinite, Real-Time Terrain Generation
[AUTHORS]Alexander Goslin
[ABSTRACT]For decades, procedural worlds have been built on procedural noise functions such as Perlin noise, which are fast and infinite, yet fundamentally limited in realism and large-scale coherence. We introduce Terrain Diffusion, a generative framework that bridges the fidelity of diffusion models with the properties that made procedural noise indispensable: seamless infinite extent, seed-consistency, and constant-time random access. At its core is InfiniteDiffusion, a novel algorithm for infinite generation that reformulates standard diffusion sampling for unbounded domains. While noise functions remain near-instant, our framework outpaces orbital velocity by 9 times on a consumer GPU, enabling realistic terrain generation at interactive rates. We integrate a hierarchical stack of diffusion models to couple planetary context with local detail, a compact Laplacian encoding to stabilize outputs across Earth-scale dynamic ranges, and an open-source infinite-tensor framework for constant-memory manipulation of unbounded tensors. Together, these components position diffusion models as a practical, scalable foundation for the next generation of infinite virtual worlds.
[COMMENTS]Project website: https://xandergos.github.io/terrain-diffusion/ Code: https://github.com/xandergos/terrain-diffusion/
[LINK]http://arxiv.org/abs/2512.08309v3
[DATE]2026-01-11 20:34:21+08:00
[CATEGORIES]cs.LG
Shape-Aware Topological Representation for Pipeline Hyperbola Detection in GPR Data
[AUTHORS]Meiyan Kang, Shizuo Kaji, Sang-Yun Lee, Taegon Kim, Hee-Hwan Ryu, Suyoung Choi
[ABSTRACT]Ground Penetrating Radar (GPR) is a widely used Non-Destructive Testing (NDT) technique for subsurface exploration, particularly in infrastructure inspection and maintenance. However, conventional interpretation methods are often limited by noise sensitivity and a lack of structural awareness. This study presents a novel framework that enhances the detection of underground utilities, especially pipelines, by integrating shape-aware topological features derived from B-scan GPR images using Topological Data Analysis (TDA), with the spatial detection capabilities of the YOLOv5 deep neural network (DNN). We propose a novel shape-aware topological representation that amplifies structural features in the input data, thereby improving the model’s responsiveness to the geometrical features of buried objects. To address the scarcity of annotated real-world data, we employ a Sim2Real strategy that generates diverse and realistic synthetic datasets, effectively bridging the gap between simulated and real-world domains. Experimental results demonstrate significant improvements in mean Average Precision (mAP), validating the robustness and efficacy of our approach. This approach underscores the potential of TDA-enhanced learning in achieving reliable, real-time subsurface object detection, with broad applications in urban planning, safety inspection, and infrastructure management.
[COMMENTS]15 pages, 6 figures
[LINK]http://arxiv.org/abs/2506.06311v3
[DATE]2026-01-11 20:26:12+08:00
[CATEGORIES]cs.LG
MODE: Efficient Time Series Prediction with Mamba Enhanced by Low-Rank Neural ODEs
[AUTHORS]Xingsheng Chen, Regina Zhang, Bo Gao, Xingwei He, Xiaofeng Liu, Pietro Lio, Kwok-Yan Lam, Siu-Ming Yiu
[ABSTRACT]Time series prediction plays a pivotal role across diverse domains such as finance, healthcare, energy systems, and environmental modeling. However, existing approaches often struggle to balance efficiency, scalability, and accuracy, particularly when handling long-range dependencies and irregularly sampled data. To address these challenges, we propose MODE, a unified framework that integrates Low-Rank Neural Ordinary Differential Equations (Neural ODEs) with an Enhanced Mamba architecture. As illustrated in our framework, the input sequence is first transformed by a Linear Tokenization Layer and then processed through multiple Mamba Encoder blocks, each equipped with an Enhanced Mamba Layer that employs Causal Convolution, SiLU activation, and a Low-Rank Neural ODE enhancement to efficiently capture temporal dynamics. This low-rank formulation reduces computational overhead while maintaining expressive power. Furthermore, a segmented selective scanning mechanism, inspired by pseudo-ODE dynamics, adaptively focuses on salient subsequences to improve scalability and long-range sequence modeling. Extensive experiments on benchmark datasets demonstrate that MODE surpasses existing baselines in both predictive accuracy and computational efficiency. Overall, our contributions include: (1) a unified and efficient architecture for long-term time series modeling, (2) integration of Mamba’s selective scanning with low-rank Neural ODEs for enhanced temporal representation, and (3) substantial improvements in efficiency and scalability enabled by low-rank approximation and dynamic selective scanning.
[COMMENTS]12 pages, 6 figures, and 3 tables. Updated description and explanations, and correct some typos
[LINK]http://arxiv.org/abs/2601.00920v2
[DATE]2026-01-11 19:54:19+08:00
[CATEGORIES]cs.LG
Applying Embedding-Based Retrieval to Airbnb Search
[AUTHORS]Mustafa Abdool, Soumyadip Banerjee, Moutupsi Paul, Do-kyum Kim, Xioawei Liu, Bin Xu, Tracy Yu, Hui Gao, Karen Ouyang, Huiji Gao, Liwei He, Stephanie Moyerman, Sanjeev Katariya
[ABSTRACT]The goal of Airbnb search is to match guests with the ideal accommodation that fits their travel needs. This is a challenging problem, as popular search locations can have around a hundred thousand available homes, and guests themselves have a wide variety of preferences. Furthermore, the launch of new product features, such as \textit{flexible date search,} significantly increased the number of eligible homes per search query. As such, there is a need for a sophisticated retrieval system which can provide high-quality candidates with low latency in a way that integrates with the overall ranking stack.
This paper details our journey to build an efficient and high-quality retrieval system for Airbnb search. We describe the key unique challenges we encountered when implementing an Embedding-Based Retrieval (EBR) system for a two sided marketplace like Airbnb – such as the dynamic nature of the inventory, a lengthy user funnel with multiple stages, and a variety of product surfaces. We cover unique insights when modeling the retrieval problem, how to build robust evaluation systems, and design choices for online serving. The EBR system was launched to production and powers several use-cases such as regular search, flexible date and promotional emails for marketing campaigns. The system demonstrated statistically-significant improvements in key metrics, such as booking conversion, via A/B testing.
[COMMENTS]14 pages, 9 figures
[LINK]http://arxiv.org/abs/2601.06873v1
[DATE]2026-01-11 19:41:55+08:00
[CATEGORIES]cs.LG
Parallel Algorithms for Structured Sparse Support Vector Machines: Application in Music Genre Classification
[AUTHORS]Rongmei Liang, Zizheng Liu, Xiaofei Wu, Jingwen Tu
[ABSTRACT]Mathematical modelling, particularly through approaches such as structured sparse support vector machines (SS-SVM), plays a crucial role in processing data with complex feature structures, yet efficient algorithms for distributed large-scale data remain lacking. To address this gap, this paper proposes a unified optimization framework based on a consensus structure. This framework is not only applicable to various loss functions and combined regularization terms but can also be effectively extended to non-convex regularizers, demonstrating strong scalability. Building upon this framework, we develop a distributed parallel alternating direction method of multipliers (ADMM) algorithm to efficiently solve SS-SVMs under distributed data storage. To ensure convergence, we incorporate a Gaussian back-substitution technique. Additionally, for completeness, we introduce a family of sparse group Lasso support vector machine (SGL-SVM) and apply it to music information retrieval. Theoretical analysis confirms that the computational complexity of the proposed algorithm is independent of the choice of regularization terms and loss functions, underscoring the universality of the parallel approach. Experiments on both synthetic and real-world music archive datasets validate the reliability, stability, and efficiency of our algorithm.
[LINK]http://arxiv.org/abs/2512.07463v2
[DATE]2026-01-11 19:31:50+08:00
[CATEGORIES]cs.LG
DaQ-MSA: Denoising and Qualifying Diffusion Augmentations for Multimodal Sentiment Analysis
[AUTHORS]Jiazhang Liang, Jianheng Dai, Miaosen Luo, Menghua Jiang, Sijie Mai
[ABSTRACT]Multimodal large language models (MLLMs) have demonstrated strong performance on vision-language tasks, yet their effectiveness on multimodal sentiment analysis remains constrained by the scarcity of high-quality training data, which limits accurate multimodal understanding and generalization. To alleviate this bottleneck, we leverage diffusion models to perform semantics-preserving augmentation on the video and audio modalities, expanding the multimodal training distribution. However, increasing data quantity alone is insufficient, as diffusion-generated samples exhibit substantial quality variation and noisy augmentations may degrade performance. We therefore propose DaQ-MSA (Denoising and Qualifying Diffusion Augmentations for Multimodal Sentiment Analysis), which introduces a quality scoring module to evaluate the reliability of augmented samples and assign adaptive training weights. By down-weighting low-quality samples and emphasizing high-fidelity ones, DaQ-MSA enables more stable learning. By integrating the generative capability of diffusion models with the semantic understanding of MLLMs, our approach provides a robust and generalizable automated augmentation strategy for training MLLMs without any human annotation or additional supervision.
[COMMENTS]11 pages, 4 figures
[LINK]http://arxiv.org/abs/2601.06870v1
[DATE]2026-01-11 19:29:21+08:00
[CATEGORIES]cs.LG
U-MASK: User-adaptive Spatio-Temporal Masking for Personalized Mobile AI Applications
[AUTHORS]Shiyuan Zhang, Yilai Liu, Yuwei Du, Ruoxuan Yang, Dong In Kim, Hongyang Du
[ABSTRACT]Personalized mobile artificial intelligence applications are widely deployed, yet they are expected to infer user behavior from sparse and irregular histories under a continuously evolving spatio-temporal context. This setting induces a fundamental tension among three requirements, i.e., immediacy to adapt to recent behavior, stability to resist transient noise, and generalization to support long-horizon prediction and cold-start users. Most existing approaches satisfy at most two of these requirements, resulting in an inherent impossibility triangle in data-scarce, non-stationary personalization. To address this challenge, we model mobile behavior as a partially observed spatio-temporal tensor and unify short-term adaptation, long-horizon forecasting, and cold-start recommendation as a conditional completion problem, where a user- and task-specific mask specifies which coordinates are treated as evidence. We propose U-MASK, a user-adaptive spatio-temporal masking method that allocates evidence budgets based on user reliability and task sensitivity. To enable mask generation under sparse observations, U-MASK learns a compact, task-agnostic user representation from app and location histories via U-SCOPE, which serves as the sole semantic conditioning signal. A shared diffusion transformer then performs mask-guided generative completion while preserving observed evidence, so personalization and task differentiation are governed entirely by the mask and the user representation. Experiments on real-world mobile datasets demonstrate consistent improvements over state-of-the-art methods across short-term prediction, long-horizon forecasting, and cold-start settings, with the largest gains under severe data sparsity. The code and dataset will be available at https://github.com/NICE-HKU/U-MASK.
[COMMENTS]18 pages, 9 figures
[LINK]http://arxiv.org/abs/2601.06867v1
[DATE]2026-01-11 19:22:25+08:00
[CATEGORIES]cs.LG
Variational decomposition autoencoding improves disentanglement of latent representations
[AUTHORS]Ioannis Ziogas, Aamna Al Shehhi, Ahsan H. Khandoker, Leontios J. Hadjileontiadis
[ABSTRACT]Understanding the structure of complex, nonstationary, high-dimensional time-evolving signals is a central challenge in scientific data analysis. In many domains, such as speech and biomedical signal processing, the ability to learn disentangled and interpretable representations is critical for uncovering latent generative mechanisms. Traditional approaches to unsupervised representation learning, including variational autoencoders (VAEs), often struggle to capture the temporal and spectral diversity inherent in such data. Here we introduce variational decomposition autoencoding (VDA), a framework that extends VAEs by incorporating a strong structural bias toward signal decomposition. VDA is instantiated through variational decomposition autoencoders (DecVAEs), i.e., encoder-only neural networks that combine a signal decomposition model, a contrastive self-supervised task, and variational prior approximation to learn multiple latent subspaces aligned with time-frequency characteristics. We demonstrate the effectiveness of DecVAEs on simulated data and three publicly available scientific datasets, spanning speech recognition, dysarthria severity evaluation, and emotional speech classification. Our results demonstrate that DecVAEs surpass state-of-the-art VAE-based methods in terms of disentanglement quality, generalization across tasks, and the interpretability of latent encodings. These findings suggest that decomposition-aware architectures can serve as robust tools for extracting structured representations from dynamic signals, with potential applications in clinical diagnostics, human-computer interaction, and adaptive neurotechnologies.
[COMMENTS]Supplementary information file at: https://drive.google.com/drive/folders/1sZl2AcCtRK-1oav7XZSaxlu0Cq0-3MMs?usp=sharing
[LINK]http://arxiv.org/abs/2601.06844v1
[DATE]2026-01-11 18:16:34+08:00
[CATEGORIES]cs.LG
Constrained Density Estimation via Optimal Transport
[AUTHORS]Yinan Hu, Estaban Tabak
[ABSTRACT]A novel framework for density estimation under expectation constraints is proposed. The framework minimizes the Wasserstein distance between the estimated density and a prior, subject to the constraints that the expected value of a set of functions adopts or exceeds given values. The framework is generalized to include regularization inequalities to mitigate the artifacts in the target measure. An annealing-like algorithm is developed to address non-smooth constraints, with its effectiveness demonstrated through both synthetic and proof-of-concept real world examples in finance.
[LINK]http://arxiv.org/abs/2601.06830v1
[DATE]2026-01-11 17:44:04+08:00
[CATEGORIES]cs.LG
Analyzing the effect of prediction accuracy on the distributionally-robust competitive ratio
[AUTHORS]Toru Yoshinaga, Yasushi Kawase
[ABSTRACT]The field of algorithms with predictions aims to improve algorithm performance by integrating machine learning predictions into algorithm design. A central question in this area is how predictions can improve performance, and a key aspect of this analysis is the role of prediction accuracy. In this context, prediction accuracy is defined as a guaranteed probability that an instance drawn from the distribution belongs to the predicted set. As a performance measure that incorporates prediction accuracy, we focus on the distributionally-robust competitive ratio (DRCR), introduced by Sun et al.~(ICML 2024). The DRCR is defined as the expected ratio between the algorithm’s cost and the optimal cost, where the expectation is taken over the worst-case instance distribution that satisfies the given prediction and accuracy requirement. A known structural property is that, for any fixed algorithm, the DRCR decreases linearly as prediction accuracy increases. Building on this result, we establish that the optimal DRCR value (i.e., the infimum over all algorithms) is a monotone and concave function of prediction accuracy. We further generalize the DRCR framework to a multiple-prediction setting and show that monotonicity and concavity are preserved in this setting. Finally, we apply our results to the ski rental problem, a benchmark problem in online optimization, to identify the conditions on prediction accuracies required for the optimal DRCR to attain a target value. Moreover, we provide a method for computing the critical accuracy, defined as the minimum accuracy required for the optimal DRCR to strictly improve upon the performance attainable without any accuracy guarantee.
[LINK]http://arxiv.org/abs/2601.06813v1
[DATE]2026-01-11 16:51:40+08:00
[CATEGORIES]cs.LG
WFR-FM: Simulation-Free Dynamic Unbalanced Optimal Transport
[AUTHORS]Qiangwei Peng, Zihan Wang, Junda Ying, Yuhao Sun, Qing Nie, Lei Zhang, Tiejun Li, Peijie Zhou
[ABSTRACT]The Wasserstein-Fisher-Rao (WFR) metric extends dynamic optimal transport (OT) by coupling displacement with change of mass, providing a principled geometry for modeling unbalanced snapshot dynamics. Existing WFR solvers, however, are often unstable, computationally expensive, and difficult to scale. Here we introduce WFR Flow Matching (WFR-FM), a simulation-free training algorithm that unifies flow matching with dynamic unbalanced OT. Unlike classical flow matching which regresses only a transport vector field, WFR-FM simultaneously regresses a vector field for displacement and a scalar growth rate function for birth-death dynamics, yielding continuous flows under the WFR geometry. Theoretically, we show that minimizing the WFR-FM loss exactly recovers WFR geodesics. Empirically, WFR-FM yields more accurate and robust trajectory inference in single-cell biology, reconstructing consistent dynamics with proliferation and apoptosis, estimating time-varying growth fields, and applying to generative dynamics under imbalanced data. It outperforms state-of-the-art baselines in efficiency, stability, and reconstruction accuracy. Overall, WFR-FM establishes a unified and efficient paradigm for learning dynamical systems from unbalanced snapshots, where not only states but also mass evolve over time.
[LINK]http://arxiv.org/abs/2601.06810v1
[DATE]2026-01-11 16:48:08+08:00
[CATEGORIES]cs.LG
Thinking with Deltas: Incentivizing Reinforcement Learning via Differential Visual Reasoning Policy
[AUTHORS]Shujian Gao, Yuan Wang, Jiangtao Yan, Zuxuan Wu, Yu-Gang Jiang
[ABSTRACT]Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced reasoning capabilities in Large Language Models. However, adapting RLVR to multimodal domains suffers from a critical \textit{perception-reasoning decoupling}. Existing paradigms, driven by text-centric outcome rewards, reasoning in language medium, inadvertently encourage models to bypass visual perception. We empirically validate this through blind experiments: state-of-the-art policies maintain or surprisingly improve performance even when visual inputs are entirely removed. This reveals that these models degenerate into \textit{blind reasoners}, exploiting linguistic priors to generate plausible answers instead of attending to visual evidence. In response, we propose \textbf{Thinking with Deltas}, a framework driven by a \textbf{Differential Visual Reasoning Policy (DVRP)}. DVRP introduces intrinsic supervision via visual triplets, comprising original, masked, and perturbed inputs. It optimizes the model to maximize reasoning divergence from masked inputs (enforcing \textit{visual sensitivity}) while minimizing divergence from perturbed inputs (ensuring \textit{visual robustness}). By aligning reasoning variations strictly with the \textit{Delta} of visual information, DVRP inherently bolsters visual understanding capabilities and significantly outperforms state-of-the-art methods on both general and medical benchmarks, without requiring external annotations or auxiliary tools.
[COMMENTS]24 pages, 10 tables, 4 figures
[LINK]http://arxiv.org/abs/2601.06801v1
[DATE]2026-01-11 16:25:34+08:00
[CATEGORIES]cs.LG
Graph Neural Network with One-side Edge Sampling for Fraud Detection
[AUTHORS]Hoang Hiep Trieu
[ABSTRACT]Financial fraud is always a major problem in the field of finance, as it can cause significant consequences. As a result, many approaches have been designed to detect it, and lately Graph Neural Networks (GNNs) have been demonstrated as a competent candidate. However, when trained with a large amount of data, they are slow and computationally demanding. In addition, GNNs may need a deep architecture to detect complex fraud patterns, but doing so may make them suffer from problems such as over-fitting or over-smoothing. Over-fitting leads to reduced generalisation of the model on unseen data, while over-smoothing causes all nodes’ features to converge to a fixed point due to excessive aggregation of information from neighbouring nodes. In this research, I propose an approach called One-Side Edge Sampling (OES) that can potentially reduce training duration as well as the effects of over-smoothing and over-fitting. The approach leverages predictive confidence in an edge classification task to sample edges from the input graph during a certain number of epochs. To explain why OES can alleviate over-smoothing, I perform a theoretical analysis of the proposed approach. In addition, to validate the effect of OES, I conduct experiments using different GNNs on two datasets. The results show that OES can empirically outperform backbone models in both shallow and deep architectures while also reducing training time.
[LINK]http://arxiv.org/abs/2601.06800v1
[DATE]2026-01-11 15:59:39+08:00
[CATEGORIES]cs.LG
CliffordNet: All You Need is Geometric Algebra
[AUTHORS]Zhongping Ji
[ABSTRACT]Modern computer vision architectures, from CNNs to Transformers, predominantly rely on the stacking of heuristic modules: spatial mixers (Attention/Conv) followed by channel mixers (FFNs). In this work, we challenge this paradigm by returning to mathematical first principles. We propose the \textbf{Clifford Algebra Network (CAN)}, also referred to as CliffordNet, a vision backbone grounded purely in Geometric Algebra. Instead of engineering separate modules for mixing and memory, we derive a unified interaction mechanism based on the \textbf{Clifford Geometric Product} ($uv = u \cdot v + u \wedge v$). This operation ensures algebraic completeness regarding the Geometric Product by simultaneously capturing feature coherence (via the generalized inner product) and structural variation (via the exterior wedge product).
Implemented via an efficient sparse rolling mechanism with \textbf{strict linear complexity $\mathcal{O}(N)$}, our model reveals a surprising emergent property: the geometric interaction is so representationally dense that standard Feed-Forward Networks (FFNs) become redundant. Empirically, CliffordNet establishes a new Pareto frontier: our \textbf{Nano} variant achieves \textbf{76.41\%} accuracy on CIFAR-100 with only \textbf{1.4M} parameters, effectively matching the heavy-weight ResNet-18 (11.2M) with \textbf{$8\times$ fewer parameters}, while our \textbf{Base} variant sets a new SOTA for tiny models at \textbf{78.05\%}. Our results suggest that global understanding can emerge solely from rigorous, algebraically complete local interactions, potentially signaling a shift where \textit{geometry is all you need}. Code is available at https://github.com/ParaMind2025/CAN.
[COMMENTS]15 pages
[LINK]http://arxiv.org/abs/2601.06793v1
[DATE]2026-01-11 15:26:02+08:00
[CATEGORIES]cs.LG
Cross-Modal Computational Model of Brain-Heart Interactions via HRV and EEG Feature
[AUTHORS]Malavika Pradeep, Akshay Sasi, Nusaibah Farrukh, Rahul Venugopal, Elizabeth Sherly
[ABSTRACT]The electroencephalogram (EEG) has been the gold standard for quantifying mental workload; however, due to its complexity and non-portability, it can be constraining. ECG signals, which are feasible on wearable equipment pieces such as headbands, present a promising method for cognitive state monitoring. This research explores whether electrocardiogram (ECG) signals are able to indicate mental workload consistently and act as surrogates for EEG-based cognitive indicators. This study investigates whether ECG-derived features can serve as surrogate indicators of cognitive load, a concept traditionally quantified using EEG. Using a publicly available multimodal dataset (OpenNeuro) of EEG and ECG recorded during working-memory and listening tasks, features of HRV and Catch22 descriptors are extracted from ECG, and spectral band-power with Catch22 features from EEG. A cross-modal regression framework based on XGBoost was trained to map ECG-derived HRV representations to EEG-derived cognitive features. In order to address data sparsity and model brain-heart interactions, we integrated the PSV-SDG to produce EEG-conditioned synthetic HRV time series.This addresses the challenge of inferring cognitive load solely from ECG-derived features using a combination of multimodal learning, signal processing, and synthetic data generation. These outcomes form a basis for light, interpretable machine learning models that are implemented through wearable biosensors in non-lab environments. Synthetic HRV inclusion enhances robustness, particularly in sparse data situations. Overall, this work is an initiation for building low-cost, explainable, and real-time cognitive monitoring systems for mental health, education, and human-computer interaction, with a focus on ageing and clinical populations.
[COMMENTS]6 pages, 2 figures, Code available at: https://github.com/Malavika-pradeep/Computational-model-for-Brain-heart-Interaction-Analysis. Presented at AIHC (not published)
[LINK]http://arxiv.org/abs/2601.06792v1
[DATE]2026-01-11 15:20:30+08:00
[CATEGORIES]cs.LG
Artificial Entanglement in the Fine-Tuning of Large Language Models
[AUTHORS]Min Chen, Zihan Wang, Canyu Chen, Zeguan Wu, Manling Li, Junyu Liu
[ABSTRACT]Large language models (LLMs) can be adapted to new tasks using parameter-efficient fine-tuning (PEFT) methods that modify only a small number of trainable parameters, often through low-rank updates. In this work, we adopt a quantum-information-inspired perspective to understand their effectiveness. From this perspective, low-rank parameterizations naturally correspond to low-dimensional Matrix Product States (MPS) representations, which enable entanglement-based characterizations of parameter structure. Thereby, we term and measure “Artificial Entanglement”, defined as the entanglement entropy of the parameters in artificial neural networks (in particular the LLMs). We first study the representative low-rank adaptation (LoRA) PEFT method, alongside full fine-tuning (FFT), using LLaMA models at the 1B and 8B scales trained on the Tulu3 and OpenThoughts3 datasets, and uncover: (i) Internal artificial entanglement in the updates of query and value projection matrices in LoRA follows a volume law with a central suppression (termed as the “Entanglement Valley”), which is sensitive to hyper-parameters and is distinct from that in FFT; (ii) External artificial entanglement in attention matrices, corresponding to token-token correlations in representation space, follows an area law with logarithmic corrections and remains robust to LoRA hyper-parameters and training steps. Drawing a parallel to the No-Hair Theorem in black hole physics, we propose that although LoRA and FFT induce distinct internal entanglement signatures, such differences do not manifest in the attention outputs, suggesting a “no-hair” property that results in the effectiveness of low rank updates. We further provide theoretical support based on random matrix theory, and extend our analysis to an MPS Adaptation PEFT method, which exhibits qualitatively similar behaviors.
[COMMENTS]41 pages, many figures
[LINK]http://arxiv.org/abs/2601.06788v1
[DATE]2026-01-11 14:34:50+08:00
[CATEGORIES]cs.LG
Dimension-reduced outcome-weighted learning for estimating individualized treatment regimes in observational studies
[AUTHORS]Sungtaek Son, Eardi Lila, Kwun Chuen Gary Chan
[ABSTRACT]Individualized treatment regimes (ITRs) aim to improve clinical outcomes by assigning treatment based on patient-specific characteristics. However, existing methods often struggle with high-dimensional covariates, limiting accuracy, interpretability, and real-world applicability. We propose a novel sufficient dimension reduction approach that directly targets the contrast between potential outcomes and identifies a low-dimensional subspace of the covariates capturing treatment effect heterogeneity. This reduced representation enables more accurate estimation of optimal ITRs through outcome-weighted learning. To accommodate observational data, our method incorporates kernel-based covariate balancing, allowing treatment assignment to depend on the full covariate set and avoiding the restrictive assumption that the subspace sufficient for modeling heterogeneous treatment effects is also sufficient for confounding adjustment. We show that the proposed method achieves universal consistency, i.e., its risk converges to the Bayes risk, under mild regularity conditions. We demonstrate its finite sample performance through simulations and an analysis of intensive care unit sepsis patient data to determine who should receive transthoracic echocardiography.
[COMMENTS]54 pages, 9 figures
[LINK]http://arxiv.org/abs/2601.06782v1
[DATE]2026-01-11 13:38:08+08:00
[CATEGORIES]cs.LG
From Sublinear to Linear: Fast Convergence in Deep Networks via Locally Polyak-Lojasiewicz Regions
[AUTHORS]Agnideep Aich, Ashit Baran Aich, Bruce Wade
[ABSTRACT]Gradient descent (GD) on deep neural network loss landscapes is non-convex, yet often converges far faster in practice than classical guarantees suggest. Prior work shows that within locally quasi-convex regions (LQCRs), GD converges to stationary points at sublinear rates, leaving the commonly observed near-exponential training dynamics unexplained. We show that, under a mild local Neural Tangent Kernel (NTK) stability assumption, the loss satisfies a PL-type error bound within these regions, yielding a Locally Polyak-Lojasiewicz Region (LPLR) in which the squared gradient norm controls the suboptimality gap. For properly initialized finite-width networks, we show that under local NTK stability this PL-type mechanism holds around initialization and establish linear convergence of GD as long as the iterates remain within the resulting LPLR. Empirically, we observe PL-like scaling and linear-rate loss decay in controlled full-batch training and in a ResNet-style CNN trained with mini-batch SGD on a CIFAR-10 subset, indicating that LPLR signatures can persist under modern architectures and stochastic optimization. Overall, the results connect local geometric structure, local NTK stability, and fast optimization rates in a finite-width setting.
[LINK]http://arxiv.org/abs/2507.21429v2
[DATE]2026-01-11 12:37:36+08:00
[CATEGORIES]cs.LG
Structure-preserving learning and prediction in optimal control of collective motion
[AUTHORS]Sofiia Huraka, Vakhtang Putkaradze
[ABSTRACT]Wide-spread adoption of unmanned vehicle technologies requires the ability to predict the motion of the combined vehicle operation from observations. While the general prediction of such motion for an arbitrary control mechanism is difficult, for a particular choice of control, the dynamics reduces to the Lie-Poisson equations [33,34]. Our goal is to learn the phase-space dynamics and predict the motion solely from observations, without any knowledge of the control Hamiltonian or the nature of interaction between vehicles. To achieve that goal, we propose the Control Optimal Lie-Poisson Neural Networks (CO-LPNets) for learning and predicting the dynamics of the system from data. Our methods learn the mapping of the phase space through the composition of Poisson maps, which are obtained as flows from Hamiltonians that could be integrated explicitly. CO-LPNets preserve the Poisson bracket and thus preserve Casimirs to machine precision. We discuss the completeness of the derived neural networks and their efficiency in approximating the dynamics. To illustrate the power of the method, we apply these techniques to systems of $N=3$ particles evolving on ${\rm SO}(3)$ group, which describe coupled rigid bodies rotating about their center of mass, and ${\rm SE}(3)$ group, applicable to the movement of unmanned air and water vehicles. Numerical results demonstrate that CO-LPNets learn the dynamics in phase space from data points and reproduce trajectories, with good accuracy, over hundreds of time steps. The method uses a limited number of points ($\sim200$/dimension) and parameters ($\sim 1000$ in our case), demonstrating potential for practical applications and edge deployment.
[COMMENTS]55 pages, 9 figures
[LINK]http://arxiv.org/abs/2601.06770v1
[DATE]2026-01-11 12:04:22+08:00
[CATEGORIES]cs.LG
ALFA: A Safe-by-Design Approach to Mitigate Quishing Attacks Launched via Fancy QR Codes
[AUTHORS]Muhammad Wahid Akram, Keshav Sood, Muneeb Ul Hassan, Dhananjay Thiruvady
[ABSTRACT]Phishing with Quick Response (QR) codes is termed as Quishing. The attackers exploit this method to manipulate individuals into revealing their confidential data. Recently, we see the colorful and fancy representations of QR codes, the 2D matrix of QR codes which does not reflect a typical mixture of black-white modules anymore. Instead, they become more tempting as an attack vector for adversaries which can evade the state-of-the-art deep learning visual-based and other prevailing countermeasures. We introduce “ALFA”, a safe-by-design approach, to mitigate Quishing and prevent everyone from accessing the post-scan harmful payload of fancy QR codes. Our method first converts a fancy QR code into the replica of binary grid and then identify the erroneous representation of modules in that grid. Following that, we present “FAST” method which can conveniently recover erroneous modules from that binary grid. Afterwards, using this binary grid, our solution extracts the structural features of fancy QR code and predicts its legitimacy using a pre-trained model. The effectiveness of our proposal is demonstrated by the experimental evaluation on a synthetic dataset (containing diverse variations of fancy QR codes) and achieve a FNR of 0.06% only. We also develop the mobile app to test the practical feasibility of our solution and provide a performance comparison of the app with the real-world QR readers. This comparison further highlights the classification reliability and detection accuracy of this solution in real-world environments.
[COMMENTS]LNCS Springer Template (19 pages, 5 figures, 4 tables). This paper is currently submitted to 31st European Symposium on Research in Computer Security (ESORICS) 2026 for publication
[LINK]http://arxiv.org/abs/2601.06768v1
[DATE]2026-01-11 11:56:56+08:00
[CATEGORIES]cs.LG
Generative Modeling via Hierarchical Tensor Sketching
[AUTHORS]Yifan Peng, Yian Chen, E. Miles Stoudenmire, Yuehaw Khoo
[ABSTRACT]We propose a hierarchical tensor-network approach for approximating high-dimensional probability density via empirical distribution. This leverages randomized singular value decomposition (SVD) techniques and involves solving linear equations for tensor cores in this tensor network. The complexity of the resulting algorithm scales linearly in the dimension of the high-dimensional density. An analysis of estimation error demonstrates the effectiveness of this method through several numerical experiments.
[COMMENTS]31 pages, 15 figures
[LINK]http://arxiv.org/abs/2304.05305v2
[DATE]2026-01-11 11:50:09+08:00
[CATEGORIES]cs.LG
A Backpropagation-Free Feedback-Hebbian Network for Continual Learning Dynamics
[AUTHORS]Josh Li
[ABSTRACT]Feedback-rich neural architectures can regenerate earlier representations and inject temporal context, making them a natural setting for strictly local synaptic plasticity. We ask whether a minimal, backpropagation-free feedback–Hebbian system can already express interpretable continual-learning–relevant behaviors under controlled training schedules. We introduce a compact prediction–reconstruction architecture with two feedforward layers for supervised association learning and two dedicated feedback layers trained to reconstruct earlier activity and re-inject it as additive temporal context. All synapses are updated by a unified local rule combining centered Hebbian covariance, Oja-style stabilization, and a local supervised drive where targets are available, requiring no weight transport or global error backpropagation. On a small two-pair association task, we characterize learning through layer-wise activity snapshots, connectivity trajectories (row/column means of learned weights), and a normalized retention index across phases. Under sequential A->B training, forward output connectivity exhibits a long-term depression (LTD)-like suppression of the earlier association while feedback connectivity preserves an A-related trace during acquisition of B. Under deterministic interleaving A,B,A,B,…, both associations are concurrently maintained rather than sequentially suppressed. Architectural controls and rule-term ablations isolate the role of dedicated feedback in regeneration and co-maintenance, and the role of the local supervised term in output selectivity and unlearning. Together, the results show that a compact feedback pathway trained with local plasticity can support regeneration and continual-learning–relevant dynamics in a minimal, mechanistically transparent setting.
[COMMENTS]8 pages, 10 figures
[LINK]http://arxiv.org/abs/2601.06758v1
[DATE]2026-01-11 11:25:38+08:00
[CATEGORIES]cs.LG
Crystal Generation using the Fully Differentiable Pipeline and Latent Space Optimization
[AUTHORS]Osman Goni Ridwan, Gilles Frapper, Hongfei Xue, Qiang Zhu
[ABSTRACT]We present a materials generation framework that couples a symmetry-conditioned variational autoencoder (CVAE) with a differentiable SO(3) power spectrum objective to steer candidates toward a specified local environment under the crystallographic constraints. In particular, we implement a fully differentiable pipeline to enable batch-wise optimization on both direct and latent crystallographic representations. Using the GPU acceleration, this implementation achieves about fivefold speed compared to our previous CPU workflow, while yielding comparable outcomes. In addition, we introduce the optimization strategy that alternatively performs optimization on the direct and latent crystal representations. This dual-level relaxation approach can effectively escape local minima defined by different objective gradients, thus increasing the success rate of generating complex structures satisfying the target local environments. This framework can be extended to systems consisting of multi-components and multi-environments, providing a scalable route to generate material structures with the target local environment.
[LINK]http://arxiv.org/abs/2601.04606v2
[DATE]2026-01-11 10:48:24+08:00
[CATEGORIES]cs.LG
Low-Rank Online Dynamic Assortment with Dual Contextual Information
[AUTHORS]Seong Jin Lee, Will Wei Sun, Yufeng Liu
[ABSTRACT]As e-commerce expands, delivering real-time personalized recommendations from vast catalogs poses a critical challenge for retail platforms. Maximizing revenue requires careful consideration of both individual customer characteristics and available item features to continuously optimize assortments over time. In this paper, we consider the dynamic assortment problem with dual contexts – user and item features. In high-dimensional scenarios, the quadratic growth of dimensions complicates computation and estimation. To tackle this challenge, we introduce a new low-rank dynamic assortment model to transform this problem into a manageable scale. Then we propose an efficient algorithm that estimates the intrinsic subspaces and utilizes the upper confidence bound approach to address the exploration-exploitation trade-off in online decision making. Theoretically, we establish a regret bound of $\tilde{O}((d_1+d_2)r\sqrt{T})$, where $d_1, d_2$ represent the dimensions of the user and item features respectively, $r$ is the rank of the parameter matrix, and $T$ denotes the time horizon. This bound represents a substantial improvement over prior literature, achieved by leveraging the low-rank structure. Extensive simulations and an application to the Expedia hotel recommendation dataset further demonstrate the advantages of our proposed method.
[LINK]http://arxiv.org/abs/2404.17592v2
[DATE]2026-01-11 10:30:46+08:00
[CATEGORIES]cs.LG
Federated Continual Learning for Privacy-Preserving Hospital Imaging Classification
[AUTHORS]Anay Sinhal, Arpana Sinhal, Amit Sinhal
[ABSTRACT]Deep learning models for radiology interpretation increasingly rely on multi-institutional data, yet privacy regulations and distribution shift across hospitals limit central data pooling. Federated learning (FL) allows hospitals to collaboratively train models without sharing raw images, but current FL algorithms typically assume a static data distribution. In practice, hospitals experience continual evolution in case mix, annotation protocols, and imaging devices, which leads to catastrophic forgetting when models are updated sequentially. Federated continual learning (FCL) aims to reconcile these challenges but existing methods either ignore the stringent privacy constraints of healthcare or rely on replay buffers and public surrogate datasets that are difficult to justify in clinical settings. We study FCL for chest radiography classification in a setting where hospitals are clients that receive temporally evolving streams of cases and labels. We introduce DP-FedEPC (Differentially Private Federated Elastic Prototype Consolidation), a method that combines elastic weight consolidation (EWC), prototype-based rehearsal, and client-side differential privacy within a standard FedAvg framework. EWC constrains updates along parameters deemed important for previous tasks, while a memory of latent prototypes preserves class structure without storing raw images. Differentially private stochastic gradient descent (DP-SGD) at each client adds calibrated Gaussian noise to clipped gradients, providing formal privacy guarantees for individual radiographs.
[LINK]http://arxiv.org/abs/2601.06742v1
[DATE]2026-01-11 09:28:34+08:00
[CATEGORIES]cs.LG
Logic-Driven Semantic Communication for Resilient Multi-Agent Systems
[AUTHORS]Tamara Alshammari, Mehdi Bennis
[ABSTRACT]The advent of 6G networks is accelerating autonomy and intelligence in large-scale, decentralized multi-agent systems (MAS). While this evolution enables adaptive behavior, it also heightens vulnerability to stressors such as environmental changes and adversarial behavior. Existing literature on resilience in decentralized MAS largely focuses on isolated aspects, such as fault tolerance, without offering a principled unified definition of multi-agent resilience. This gap limits the ability to design systems that can continuously sense, adapt, and recover under dynamic conditions. This article proposes a formal definition of MAS resilience grounded in two complementary dimensions: epistemic resilience, wherein agents recover and sustain accurate knowledge of the environment, and action resilience, wherein agents leverage that knowledge to coordinate and sustain goals under disruptions. We formalize resilience via temporal epistemic logic and quantify it using recoverability time (how quickly desired properties are re-established after a disturbance) and durability time (how long accurate beliefs and goal-directed behavior are sustained after recovery). We design an agent architecture and develop decentralized algorithms to achieve both epistemic and action resilience. We provide formal verification guarantees, showing that our specifications are sound with respect to the metric bounds and admit finite-horizon verification, enabling design-time certification and lightweight runtime monitoring. Through a case study on distributed multi-agent decision-making under stressors, we show that our approach outperforms baseline methods. Our formal verification analysis and simulation results highlight that the proposed framework enables resilient, knowledge-driven decision-making and sustained operation, laying the groundwork for resilient decentralized MAS in next-generation communication systems.
[LINK]http://arxiv.org/abs/2601.06733v1
[DATE]2026-01-11 08:54:09+08:00
[CATEGORIES]cs.LG
Predicting Student Success with Heterogeneous Graph Deep Learning and Machine Learning Models
[AUTHORS]Anca Muresan, Mihaela Cardei, Ionut Cardei
[ABSTRACT]Early identification of student success is crucial for enabling timely interventions, reducing dropout rates, and promoting on time graduation. In educational settings, AI powered systems have become essential for predicting student performance due to their advanced analytical capabilities. However, effectively leveraging diverse student data to uncover latent and complex patterns remains a key challenge. While prior studies have explored this area, the potential of dynamic data features and multi category entities has been largely overlooked. To address this gap, we propose a framework that integrates heterogeneous graph deep learning models to enhance early and continuous student performance prediction, using traditional machine learning algorithms for comparison. Our approach employs a graph metapath structure and incorporates dynamic assessment features, which progressively influence the student success prediction task. Experiments on the Open University Learning Analytics (OULA) dataset demonstrate promising results, achieving a 68.6% validation F1 score with only 7% of the semester completed, and reaching up to 89.5% near the semester’s end. Our approach outperforms top machine learning models by 4.7% in validation F1 score during the critical early 7% of the semester, underscoring the value of dynamic features and heterogeneous graph representations in student success prediction.
[LINK]http://arxiv.org/abs/2601.06729v1
[DATE]2026-01-11 08:46:44+08:00
[CATEGORIES]cs.LG
Copula-Stein Discrepancy: A Generator-Based Stein Operator for Archimedean Dependence
[AUTHORS]Agnideep Aich, Ashit Baran Aich
[ABSTRACT]Kernel Stein discrepancies (KSDs) are widely used for goodness-of-fit testing, but standard KSDs can be insensitive to higher-order dependence features such as tail dependence. We introduce the Copula-Stein Discrepancy (CSD), which defines a Stein operator directly on the copula density to target dependence geometry rather than the joint score. For Archimedean copulas, CSD admits a closed-form Stein kernel derived from the scalar generator. We prove that CSD metrizes weak convergence of copula distributions, admits an empirical estimator with minimax-optimal rate $O_P(n^{-1/2})$, and is sensitive to differences in tail dependence coefficients. We further extend the framework beyond Archimedean families to general copulas, including elliptical and vine constructions. Computationally, exact CSD kernel evaluation is linear in dimension, and a random-feature approximation reduces the quadratic $O(n^2)$ sample scaling to near-linear $\tilde{O}(n)$; experiments show near-nominal Type~I error, increasing power, and rapid concentration of the approximation toward the exact $\widehat{\mathrm{CSD}}_n^2$ as the number of features grows.
[LINK]http://arxiv.org/abs/2510.24056v2
[DATE]2026-01-11 08:21:24+08:00
[CATEGORIES]cs.LG